ics 215: advances in database management system technology spring 2004
DESCRIPTION
ICS 215: Advances in Database Management System Technology Spring 2004. Professor Chen Li Information and Computer Science University of California, Irvine. Course Web Server. URL: http://www.ics.uci.edu/~ics215/ All course info will be posted online Instructor: Chen Li - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: ICS 215: Advances in Database Management System Technology Spring 2004](https://reader036.vdocument.in/reader036/viewer/2022062422/56813b75550346895da4845b/html5/thumbnails/1.jpg)
1
ICS 215: Advances in Database Management System Technology Spring 2004
Professor Chen Li
Information and Computer Science
University of California, Irvine
![Page 2: ICS 215: Advances in Database Management System Technology Spring 2004](https://reader036.vdocument.in/reader036/viewer/2022062422/56813b75550346895da4845b/html5/thumbnails/2.jpg)
ICS215 Notes 01 2
Course Web Server
• URL: http://www.ics.uci.edu/~ics215/– All course info will be posted online
• Instructor: Chen Li– ICS 424B, [email protected]
• Course general info: http://www.ics.uci.edu/~ics215/geninfo.html
![Page 3: ICS 215: Advances in Database Management System Technology Spring 2004](https://reader036.vdocument.in/reader036/viewer/2022062422/56813b75550346895da4845b/html5/thumbnails/3.jpg)
ICS215 Notes 01 3
Topic today: Web Search• How did earlier search engines work?
• How does Google work?
• Readings:– Lawrence and Giles,
Searching the World Wide Web, Science, 1998. – Brin and Page,
The Anatomy of a Large-Scale Hypertextual Web Search Engine WWW7/Computer Networks 30(1-7): 107-117, 1998.
![Page 4: ICS 215: Advances in Database Management System Technology Spring 2004](https://reader036.vdocument.in/reader036/viewer/2022062422/56813b75550346895da4845b/html5/thumbnails/4.jpg)
ICS215 Notes 01 4
Earlier Search Engines• Hotbot, Yahoo, Alta Vista, Northern Light, Excite,
Infoseek, Lycos …• Main technique: “inverted index”
– Conceptually: use a matrix to represent how many times a term appears in one page
– # of columns = # of pages (huge!)
– # of rows = # of terms (also huge!) Page1 Page2 Page3 Page4 …
‘car’ 1 0 1 0
‘toyota’ 0 2 0 1 page 2 mentions ‘toyota’ twice
‘honda’ 2 1 0 0
…
![Page 5: ICS 215: Advances in Database Management System Technology Spring 2004](https://reader036.vdocument.in/reader036/viewer/2022062422/56813b75550346895da4845b/html5/thumbnails/5.jpg)
ICS215 Notes 01 5
Search by Keywords• If the query has one keyword, just return all
the pages that have the word– E.g., “toyota” all pages containing “toyota”:
page2, page4,…– There could be many many pages!– Solution: return those pages with most
frequencies of the word first
![Page 6: ICS 215: Advances in Database Management System Technology Spring 2004](https://reader036.vdocument.in/reader036/viewer/2022062422/56813b75550346895da4845b/html5/thumbnails/6.jpg)
ICS215 Notes 01 6
Multi-keyword Search• For each keyword W, find all the set of pages
mentioning W
• Intersect all the sets of pages– Assuming an “AND” operation of those keywords
• Example:– A search “toyota honda” will return all the
pages that mention both “toyota” and “honda”
![Page 7: ICS 215: Advances in Database Management System Technology Spring 2004](https://reader036.vdocument.in/reader036/viewer/2022062422/56813b75550346895da4845b/html5/thumbnails/7.jpg)
ICS215 Notes 01 7
Observations• The “matrix” can be huge:
– Now the Web has 4.2 billion pages!– There are many “terms” on the Web. Many of
them are typos.– It’s not easy to do the computation efficiently:
Given a word, find all the pages… Intersect many sets of pages…
• For these reasons, search engines never store this “matrix” so naively.
![Page 8: ICS 215: Advances in Database Management System Technology Spring 2004](https://reader036.vdocument.in/reader036/viewer/2022062422/56813b75550346895da4845b/html5/thumbnails/8.jpg)
ICS215 Notes 01 8
Problems• Spamming:
– People want their pages to be put very top on a word search (e.g., “toyota”) by repeating the word many many times
– Though these pages may be unimportant compared to www.toyota.com, even if the latter only mentions “toyota” only once (or 0 time).
• Search engines can be easily “fooled”
![Page 9: ICS 215: Advances in Database Management System Technology Spring 2004](https://reader036.vdocument.in/reader036/viewer/2022062422/56813b75550346895da4845b/html5/thumbnails/9.jpg)
ICS215 Notes 01 9
Closer look at the problems• Lacking the concept of “importance” of each
page on each topic• E.g.: Our ICS215 class page is not as
“important” as Yahoo’s main page.• A link from Yahoo is more important than a
link from our class page• But, how to capture the importance of a page?
– A guess: # of hits? where to get that info?– # of inlinks to a page Google’s main idea.
![Page 10: ICS 215: Advances in Database Management System Technology Spring 2004](https://reader036.vdocument.in/reader036/viewer/2022062422/56813b75550346895da4845b/html5/thumbnails/10.jpg)
ICS215 Notes 01 10
Google’s History• Started at Stanford DB Group as a research
project (Brin and Page)
• Used to be at: google.stanford.edu
• Very soon many people started liking it
• Incorporated in 1998: www.google.com
• The “largest” search engine now
• Started other businesses: froogle, gmail, …
![Page 11: ICS 215: Advances in Database Management System Technology Spring 2004](https://reader036.vdocument.in/reader036/viewer/2022062422/56813b75550346895da4845b/html5/thumbnails/11.jpg)
ICS215 Notes 01 11
PageRank• Intuition:
– The importance of each page should be decided by what other pages “say” about this page
– One naïve implementation: count the # of pages pointing to each page (i.e., # of inlinks)
• Problem:– We can easily fool this technique by generating
many dummy pages that point to our class page
![Page 12: ICS 215: Advances in Database Management System Technology Spring 2004](https://reader036.vdocument.in/reader036/viewer/2022062422/56813b75550346895da4845b/html5/thumbnails/12.jpg)
ICS215 Notes 01 12
Details of PageRank• At the beginning, each page has weight 1• In each iteration, each page propagates its current
weight W to all its N forward neighbors. Each of them gets weight: W/N
• Meanwhile, a page accumulates the weights from its backward neighbors
• Iterate until all weights converge. Usually 6-7 times are good enough.
• The final weight of each page is its importance.• NOTICE: currently Google is using many other
techniques/heuristics to do search. Here we just cover some of the initial ideas.
![Page 13: ICS 215: Advances in Database Management System Technology Spring 2004](https://reader036.vdocument.in/reader036/viewer/2022062422/56813b75550346895da4845b/html5/thumbnails/13.jpg)
ICS215 Notes 01 13
Example: MiniWeb• (Materials used by courtesy of Jeff Ullman)• Our “MiniWeb” has only three web sites: Netscape,
Amazon, and Microsoft.• Their weights are represented as a vector
oldnewa
m
n
a
m
n
012/1
2/100
2/102/1Ne
Am
MS
For instance, in each iteration, half of the weight of AM goes to NE, and half goes to MS.
![Page 14: ICS 215: Advances in Database Management System Technology Spring 2004](https://reader036.vdocument.in/reader036/viewer/2022062422/56813b75550346895da4845b/html5/thumbnails/14.jpg)
ICS215 Notes 01 14
Iterative computation
5/6
5/3
5/6
16/17
16/11
4/5
8/11
2/1
8/9
1
4/3
4/5
2/3
2/1
1
1
1
1
a
m
n
Ne
Am
MSFinal result:
• Netscape and Amazon have the same importance, and twice the importance of Microsoft.
• Does it capture the intuition? Yes.
![Page 15: ICS 215: Advances in Database Management System Technology Spring 2004](https://reader036.vdocument.in/reader036/viewer/2022062422/56813b75550346895da4845b/html5/thumbnails/15.jpg)
ICS215 Notes 01 15
Observations• We cannot get absolute weights:
– We can only know (and we are only interested in) those relative weights of the pages
• The matrix is stochastic (sum of each column is 1). So the iterations converge, and compute the principal eigenvector of the following matrix equation:
a
m
n
a
m
n
012/1
2/100
2/102/1
![Page 16: ICS 215: Advances in Database Management System Technology Spring 2004](https://reader036.vdocument.in/reader036/viewer/2022062422/56813b75550346895da4845b/html5/thumbnails/16.jpg)
ICS215 Notes 01 16
Problem 1 of algorithm: dead ends
0
0
0
16/5
16/3
2/1
8/3
4/1
8/5
2/1
4/1
4/3
2/1
2/1
1
1
1
1
a
m
n
Ne
Am
MS
• MS does not point to anybody
• Result: weights of the Web “leak out”
oldnewa
m
n
a
m
n
002/1
2/100
2/102/1
![Page 17: ICS 215: Advances in Database Management System Technology Spring 2004](https://reader036.vdocument.in/reader036/viewer/2022062422/56813b75550346895da4845b/html5/thumbnails/17.jpg)
ICS215 Notes 01 17
Problem 2 of algorithm: spider traps
0
3
0
16/5
16/35
2/1
8/3
2
8/5
2/1
4/7
4/3
2/1
2/3
1
1
1
1
a
m
n
Ne
Am
MS
• MS only points to itself
• Result: all weights go to MS!
oldnewa
m
n
a
m
n
002/1
2/110
2/102/1
![Page 18: ICS 215: Advances in Database Management System Technology Spring 2004](https://reader036.vdocument.in/reader036/viewer/2022062422/56813b75550346895da4845b/html5/thumbnails/18.jpg)
ICS215 Notes 01 18
Google’s solution: “tax each page”• Like people paying taxes, each page pays some weight into a
public pool, which will be distributed to all pages.
• Example: assume 20% tax rate in the “spider trap” example.
2.0
2.0
2.0
002/1
2/110
2/102/1
*8.0
a
m
n
a
m
n
11/5
11/21
11/7
a
m
n
![Page 19: ICS 215: Advances in Database Management System Technology Spring 2004](https://reader036.vdocument.in/reader036/viewer/2022062422/56813b75550346895da4845b/html5/thumbnails/19.jpg)
ICS215 Notes 01 19
The War of Search Engines
• More companies are realizing the importance of search engines
• More competitors in the market: Microsoft, Yahoo!, etc.
![Page 20: ICS 215: Advances in Database Management System Technology Spring 2004](https://reader036.vdocument.in/reader036/viewer/2022062422/56813b75550346895da4845b/html5/thumbnails/20.jpg)
ICS215 Notes 01 20
Next: HITS / Web communities
• Readings:– Jon M. Kleinberg,
Authoritative Sources in a Hyperlinked Environment, Journal of ACM 46(5): 604-632, 1999.
– Ravi Kumar, Prabhakar Raghavan, Sridhar Rajagopalan, and Andrew Tomkins, Trawling the Web for emerging cyber-communities, WWW 1999
![Page 21: ICS 215: Advances in Database Management System Technology Spring 2004](https://reader036.vdocument.in/reader036/viewer/2022062422/56813b75550346895da4845b/html5/thumbnails/21.jpg)
ICS215 Notes 01 21
Hubs and Authorities
• Motivation: find web pages to a topic– E.g.: “find all web sites about automobiles”
• “Authority”: a page that offers info about a topic– E.g.: DBLP is a page about papers
– E.g.: google.com, aj.com, teoma.com, lycos.com
• “Hub”: a page that doesn’t provide much info, but tell us where to find pages about a topic– E.g.: our ICS215 page linking to pages about papers
– E.g.: www.searchenginewatch.com is a hub of search engines
![Page 22: ICS 215: Advances in Database Management System Technology Spring 2004](https://reader036.vdocument.in/reader036/viewer/2022062422/56813b75550346895da4845b/html5/thumbnails/22.jpg)
ICS215 Notes 01 22
Two values of a page• Each page has a hub value and an authority value.
– In PageRank, each page has one value: “weight”
• Two vectors:– H: hub values– A: authority values
2
1
h
h
H
2
1
A
A
A
![Page 23: ICS 215: Advances in Database Management System Technology Spring 2004](https://reader036.vdocument.in/reader036/viewer/2022062422/56813b75550346895da4845b/html5/thumbnails/23.jpg)
ICS215 Notes 01 23
HITS algorithm: find hubs and authorities• First step: find pages related to the topic (e.g., “automobile”), and
construct the corresponding “focused subgraph”
– Find pages S containing the keyword (“automobile”)
– Find all pages these S pages point to, i.e., their forward neighbors.
– Find all pages that point to S pages, i.e., their backward neighbors
– Compute the subgraph of these pages
rootFocused subgraph
![Page 24: ICS 215: Advances in Database Management System Technology Spring 2004](https://reader036.vdocument.in/reader036/viewer/2022062422/56813b75550346895da4845b/html5/thumbnails/24.jpg)
ICS215 Notes 01 24
Step 2: computing H and A• Initially: set hub and authority to 1• In each iteration, the hub score of
a page is the total authority value of its forward neighbors (after normalization)
• The authority value of each page is the total hub value of its backward neighbors (after normalization)
• Iterate until converge hubs authorities
![Page 25: ICS 215: Advances in Database Management System Technology Spring 2004](https://reader036.vdocument.in/reader036/viewer/2022062422/56813b75550346895da4845b/html5/thumbnails/25.jpg)
ICS215 Notes 01 25
Example: MiniWeb
Ne
Am
MS
011
100
111
M
oldnew AMH **
a
m
n
h
h
h
H
a
m
n
a
a
a
A
oldT
new HMA **
oldT
new HMMH ***
Normalization!
Therefore:
oldT
new AMMA ***
![Page 26: ICS 215: Advances in Database Management System Technology Spring 2004](https://reader036.vdocument.in/reader036/viewer/2022062422/56813b75550346895da4845b/html5/thumbnails/26.jpg)
ICS215 Notes 01 26
Example: MiniWeb
Ne
Am
MS
011
100
111
M
011
101
101TM
202
011
213TMM
211
122
122
MM T
2
31
31
84
114
114
18
24
24
4
5
5
1
1
1
A
31
1
32
96
36
132
20
8
28
4
2
6
1
1
1
H
![Page 27: ICS 215: Advances in Database Management System Technology Spring 2004](https://reader036.vdocument.in/reader036/viewer/2022062422/56813b75550346895da4845b/html5/thumbnails/27.jpg)
ICS215 Notes 01 27
Trawling: finding online communities
• Motivation: find groups of individuals who share a common interest, together with the Web pages most popular among them (similar to “hubs”)
• Examples:– Web pages of NBA fans
– Community of Turkish student organizations in the US
– Fans of movie star Jack Lemmon
• Applications:– Provide valuable and timely info for interested people
– Represent the sociology of the web
– Target advertising
![Page 28: ICS 215: Advances in Database Management System Technology Spring 2004](https://reader036.vdocument.in/reader036/viewer/2022062422/56813b75550346895da4845b/html5/thumbnails/28.jpg)
ICS215 Notes 01 28
How: analyzing web structure
• These pages often do not reference each other– Competitions
– Different view points
• Main idea: “co-citations”– Often these pages share a large number of pages
– Example: the following two web sites share many pages http://kcm.co.kr/English/ www.cyberkorean.com/church
![Page 29: ICS 215: Advances in Database Management System Technology Spring 2004](https://reader036.vdocument.in/reader036/viewer/2022062422/56813b75550346895da4845b/html5/thumbnails/29.jpg)
ICS215 Notes 01 29
Bipartite subgraphs
• Bipartite graphs: sets of nodes, F and C
• Dense bipartite graph: there are “enough” number of edges between F and C
• Complete bipartite graph: there is an edge between each node in F and each node in C
• (i,j)-Core: a complete bipartite graph with at least i nodes in F and j nodes in C
• (i,j)-Core is a good signature for finding online communities
• Usually i and j are between 3 and 9
F“Fans”
C“Centers”
![Page 30: ICS 215: Advances in Database Management System Technology Spring 2004](https://reader036.vdocument.in/reader036/viewer/2022062422/56813b75550346895da4845b/html5/thumbnails/30.jpg)
ICS215 Notes 01 30
“Trawling”: finding cores
• Find all (i,j)-cores in the Web graph. – In particular: find “fans” (or “hubs”) in the graph
– “centers” = “authorities”
– Challenge: Web is huge. How to find cores efficiently? Experiments: 200M pages, 1 TB data
• Main idea: pruning
• Step 1: using out-degrees– Rule: each fan must point to at least 6 different websites
– Pruning results: 12% of all pages (= 24M pages) are potential fans
– Retain only links, and ignore page contents
![Page 31: ICS 215: Advances in Database Management System Technology Spring 2004](https://reader036.vdocument.in/reader036/viewer/2022062422/56813b75550346895da4845b/html5/thumbnails/31.jpg)
ICS215 Notes 01 31
Step 2: eliminate mirroring pages
• Many pages are mirrors (exactly the same page)
• They can produce many spurious fans
• Use a “shingling” method to identify and eliminate duplicates
• Results: – 60% of 24M potential-fan pages are removed
– # of potential centers is 30 times of # of potential fans
![Page 32: ICS 215: Advances in Database Management System Technology Spring 2004](https://reader036.vdocument.in/reader036/viewer/2022062422/56813b75550346895da4845b/html5/thumbnails/32.jpg)
ICS215 Notes 01 32
Step 3: using in-degrees of pages
• Delete pages highly referenced, e.g., yahoo, altavista
• Reason: they are referenced for many reasons, not likely forming an emerging community
• Formally: remove all pages with more than k inlinks (k = 50, for instance)
• Results: – 60M pages pointing to 20M pages
– 2M potential fans
![Page 33: ICS 215: Advances in Database Management System Technology Spring 2004](https://reader036.vdocument.in/reader036/viewer/2022062422/56813b75550346895da4845b/html5/thumbnails/33.jpg)
ICS215 Notes 01 33
Step 4: iterative pruning
• To find (i,j)-cores– Remove all pages whose # of out-links is < i
– Remove all pages whose # of in-links is < j
– Do it iteratively
![Page 34: ICS 215: Advances in Database Management System Technology Spring 2004](https://reader036.vdocument.in/reader036/viewer/2022062422/56813b75550346895da4845b/html5/thumbnails/34.jpg)
ICS215 Notes 01 34
Step 5: inclusion-exclusion pruning• Idea: in each step, we
– Either “include” a community– Or we “exclude” a page from further contention
• Check a page x with j out-degree. x is a fan of an (i,j)-core if:– There are i-1 fans point to all the forward neighbors of x– This step can be checked easily using the index on fans and centers
• Result: for (3,3)-cores, 5M pages remained• Final step:
– Since the graph is much smaller, we can afford to “enumerate” the remaining cores
• Result:– (3,3)-cores: about 75 KB– High-quality communities– Check a few in the paper by yourself