qian liu, computer and information sciences department
DESCRIPTION
A Presentation on The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin and Lawrence Page. Qian Liu, Computer and Information Sciences Department. Problem Size of the Web: In the order of hundreds of terabytes Still growing Problems with search engines: - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Qian Liu, Computer and Information Sciences Department](https://reader036.vdocument.in/reader036/viewer/2022070504/5681680a550346895ddd8ec6/html5/thumbnails/1.jpg)
1Qian Liu, Computer and Information Sciences Department
A Presentation on
The Anatomy of a Large-Scale Hypertextual Web Search Engine
Sergey Brin and Lawrence Page
![Page 2: Qian Liu, Computer and Information Sciences Department](https://reader036.vdocument.in/reader036/viewer/2022070504/5681680a550346895ddd8ec6/html5/thumbnails/2.jpg)
2Qian Liu, Computer and Information Sciences Department
Problem• Size of the Web:
• In the order of hundreds of terabytes• Still growing
• Problems with search engines:• Alta Vista, Excite, ect.:
• Return huge number of documents entries• Too many low quality or marginally relevant
matches
![Page 3: Qian Liu, Computer and Information Sciences Department](https://reader036.vdocument.in/reader036/viewer/2022070504/5681680a550346895ddd8ec6/html5/thumbnails/3.jpg)
3Qian Liu, Computer and Information Sciences Department
Problem• Yahoo:
• Expensive• Slow to improve• Cannot cover all esoteric topics
• Problems with users:• Inexperienced• Do not provide tightly constrained keywords
![Page 4: Qian Liu, Computer and Information Sciences Department](https://reader036.vdocument.in/reader036/viewer/2022070504/5681680a550346895ddd8ec6/html5/thumbnails/4.jpg)
4Qian Liu, Computer and Information Sciences Department
Motivation and Applications• To improve the quality web search engines
• Scale to keep up with the growth of the Web
• Academic search engine research
•Current search engine technology: advertising oriented
• “Open” search engine
• Support research activities on large-scale web data
![Page 5: Qian Liu, Computer and Information Sciences Department](https://reader036.vdocument.in/reader036/viewer/2022070504/5681680a550346895ddd8ec6/html5/thumbnails/5.jpg)
5Qian Liu, Computer and Information Sciences Department
MethodsBasic Idea:Q: “How can a search engine automatically identify high quality web pages for my topic?”
A: Hypertextual information --- improve search precision• Link structure• Anchor text• Proximity• Visual presentation
![Page 6: Qian Liu, Computer and Information Sciences Department](https://reader036.vdocument.in/reader036/viewer/2022070504/5681680a550346895ddd8ec6/html5/thumbnails/6.jpg)
6Qian Liu, Computer and Information Sciences Department
MethodsPageRank
PageRank: A measure of citation importance ranking
Link Structure: Latent human annotation of importance
![Page 7: Qian Liu, Computer and Information Sciences Department](https://reader036.vdocument.in/reader036/viewer/2022070504/5681680a550346895ddd8ec6/html5/thumbnails/7.jpg)
7Qian Liu, Computer and Information Sciences Department
![Page 8: Qian Liu, Computer and Information Sciences Department](https://reader036.vdocument.in/reader036/viewer/2022070504/5681680a550346895ddd8ec6/html5/thumbnails/8.jpg)
8Qian Liu, Computer and Information Sciences Department
![Page 9: Qian Liu, Computer and Information Sciences Department](https://reader036.vdocument.in/reader036/viewer/2022070504/5681680a550346895ddd8ec6/html5/thumbnails/9.jpg)
9Qian Liu, Computer and Information Sciences Department
![Page 10: Qian Liu, Computer and Information Sciences Department](https://reader036.vdocument.in/reader036/viewer/2022070504/5681680a550346895ddd8ec6/html5/thumbnails/10.jpg)
10Qian Liu, Computer and Information Sciences Department
![Page 11: Qian Liu, Computer and Information Sciences Department](https://reader036.vdocument.in/reader036/viewer/2022070504/5681680a550346895ddd8ec6/html5/thumbnails/11.jpg)
11Qian Liu, Computer and Information Sciences Department
![Page 12: Qian Liu, Computer and Information Sciences Department](https://reader036.vdocument.in/reader036/viewer/2022070504/5681680a550346895ddd8ec6/html5/thumbnails/12.jpg)
12Qian Liu, Computer and Information Sciences Department
![Page 13: Qian Liu, Computer and Information Sciences Department](https://reader036.vdocument.in/reader036/viewer/2022070504/5681680a550346895ddd8ec6/html5/thumbnails/13.jpg)
13Qian Liu, Computer and Information Sciences Department
![Page 14: Qian Liu, Computer and Information Sciences Department](https://reader036.vdocument.in/reader036/viewer/2022070504/5681680a550346895ddd8ec6/html5/thumbnails/14.jpg)
14Qian Liu, Computer and Information Sciences Department
![Page 15: Qian Liu, Computer and Information Sciences Department](https://reader036.vdocument.in/reader036/viewer/2022070504/5681680a550346895ddd8ec6/html5/thumbnails/15.jpg)
15Qian Liu, Computer and Information Sciences Department
MethodsPageRank
Why PageRank works?• Often users want information from “trusted” source
• Collaborative trust• Inexpensive to compute
• Allows fast updates• Fewer privacy implications
• Only public information is used
![Page 16: Qian Liu, Computer and Information Sciences Department](https://reader036.vdocument.in/reader036/viewer/2022070504/5681680a550346895ddd8ec6/html5/thumbnails/16.jpg)
16Qian Liu, Computer and Information Sciences Department
MethodsAnchor Text
• Associates anchor text the page the link is on
the page the link points to
• Accurate descriptions of web pages
• Search non-indexable web pages
![Page 17: Qian Liu, Computer and Information Sciences Department](https://reader036.vdocument.in/reader036/viewer/2022070504/5681680a550346895ddd8ec6/html5/thumbnails/17.jpg)
17Qian Liu, Computer and Information Sciences Department
MethodsProximity
• Hits
• Hits locations
• Multi-word search:
Calculate proximity --- how far apart the hits occur in the document (or anchor)
![Page 18: Qian Liu, Computer and Information Sciences Department](https://reader036.vdocument.in/reader036/viewer/2022070504/5681680a550346895ddd8ec6/html5/thumbnails/18.jpg)
18Qian Liu, Computer and Information Sciences Department
MethodsVisual Presentation
• Font size
Larger/bolder fonts --- higher weights
• Capitalization --- higher weights
![Page 19: Qian Liu, Computer and Information Sciences Department](https://reader036.vdocument.in/reader036/viewer/2022070504/5681680a550346895ddd8ec6/html5/thumbnails/19.jpg)
19Qian Liu, Computer and Information Sciences Department
MethodsArchitecture and Major Data Structures
![Page 20: Qian Liu, Computer and Information Sciences Department](https://reader036.vdocument.in/reader036/viewer/2022070504/5681680a550346895ddd8ec6/html5/thumbnails/20.jpg)
20Qian Liu, Computer and Information Sciences Department
MethodsMajor Applications:
Crawling
Indexing
Searching
![Page 21: Qian Liu, Computer and Information Sciences Department](https://reader036.vdocument.in/reader036/viewer/2022070504/5681680a550346895ddd8ec6/html5/thumbnails/21.jpg)
21Qian Liu, Computer and Information Sciences Department
CrawlingHow a crawler works:
Defined webspace
Requests URLs
Stores the returned objects into a file system
Examines the content of the objectScans for HTML anchor tags <A..>
Ignores URLs not conforming to specified rule; Visits URLs conforming to the rules
![Page 22: Qian Liu, Computer and Information Sciences Department](https://reader036.vdocument.in/reader036/viewer/2022070504/5681680a550346895ddd8ec6/html5/thumbnails/22.jpg)
22Qian Liu, Computer and Information Sciences Department
CrawlingGoogle’s web Crawling System:• Fast distributed crawling system• URLServer serves URLs to crawlers• Each crawler keeps 300 connections open at once• Different states:
1. Looking up DNS2. Connecting to host3. Sending request4. Receiving response
![Page 23: Qian Liu, Computer and Information Sciences Department](https://reader036.vdocument.in/reader036/viewer/2022070504/5681680a550346895ddd8ec6/html5/thumbnails/23.jpg)
23Qian Liu, Computer and Information Sciences Department
IndexingUses Flex to generate a lexical analyzer
Parse document
Convert word into WordID
Convert document into a set of hits
Sorter sorts the result by wordID to generate inverted index
![Page 24: Qian Liu, Computer and Information Sciences Department](https://reader036.vdocument.in/reader036/viewer/2022070504/5681680a550346895ddd8ec6/html5/thumbnails/24.jpg)
24Qian Liu, Computer and Information Sciences Department
SearchingSeek to the start of the doclist for every word
Scan through the doclists until there is a document that matches all the search terms.
Compute the rank of that document for the query
If we are not at the end of any doclist go to step 2
Sort the documents that have matched by rank and return the top k.
![Page 25: Qian Liu, Computer and Information Sciences Department](https://reader036.vdocument.in/reader036/viewer/2022070504/5681680a550346895ddd8ec6/html5/thumbnails/25.jpg)
25Qian Liu, Computer and Information Sciences Department
SearchingRanking:
Ranking Function:
• PageRank
• Type weight
• Count weight
• Proximity
![Page 26: Qian Liu, Computer and Information Sciences Department](https://reader036.vdocument.in/reader036/viewer/2022070504/5681680a550346895ddd8ec6/html5/thumbnails/26.jpg)
26Qian Liu, Computer and Information Sciences Department
ResultsA search on “bill clinton”:
• High quality pages
• Non-crawlable pages
• No results about a bill other than clinton No results about a clinton other than bill
![Page 27: Qian Liu, Computer and Information Sciences Department](https://reader036.vdocument.in/reader036/viewer/2022070504/5681680a550346895ddd8ec6/html5/thumbnails/27.jpg)
27Qian Liu, Computer and Information Sciences Department
Comparison with Other Search Engines1. Breadth-first search vs. depth-first search2. Comparison with WebCrawler:
WebCrawler: Files that the WebCrawler cannot index, such as pictures, sounds, etc., are not retrieved.Google: Uses anchor text
3. Number of crawlers:WebCrawler: 15Google: typically 3
![Page 28: Qian Liu, Computer and Information Sciences Department](https://reader036.vdocument.in/reader036/viewer/2022070504/5681680a550346895ddd8ec6/html5/thumbnails/28.jpg)
28Qian Liu, Computer and Information Sciences Department
Comparison with Other Search Engines (continued)
4. Quantity vs. quality
Alta Vista: Favors quantityGoogle: Provides quality search
![Page 29: Qian Liu, Computer and Information Sciences Department](https://reader036.vdocument.in/reader036/viewer/2022070504/5681680a550346895ddd8ec6/html5/thumbnails/29.jpg)
29Qian Liu, Computer and Information Sciences Department
Weak Points of Study1. To limit response time, when a certain number of matching documents are found, searcher stops scanning, sorts and returns results. Sub-optimal results.
2. Lack features such as boolean operators and negation, etc.
3. Search efficiency:
No optimizations such as query caching, subindices on common terms, and other common optimizations.
![Page 30: Qian Liu, Computer and Information Sciences Department](https://reader036.vdocument.in/reader036/viewer/2022070504/5681680a550346895ddd8ec6/html5/thumbnails/30.jpg)
30Qian Liu, Computer and Information Sciences Department
Suggestions for Future Study1. Using link structure:
In calculating PageRank: Exclude links between two pages with the same web domain (that often serve as navigation functions and do not confer authority).
2. Personalize PageRank by increasing the weight of a user’s homepage or bookmarks.
“99% of the Web information is useless to 99% of the Web users.”
![Page 31: Qian Liu, Computer and Information Sciences Department](https://reader036.vdocument.in/reader036/viewer/2022070504/5681680a550346895ddd8ec6/html5/thumbnails/31.jpg)
31Qian Liu, Computer and Information Sciences Department
Suggestions for Future Study(continued)
3. Make use of hubs --- collections of links to authorities4. In addition to anchor text, use text surrounding links, too.
![Page 32: Qian Liu, Computer and Information Sciences Department](https://reader036.vdocument.in/reader036/viewer/2022070504/5681680a550346895ddd8ec6/html5/thumbnails/32.jpg)
32Qian Liu, Computer and Information Sciences Department
Conclusions• Quality search results
• Techniques: PageRank Anchor text Proximity
• A complete architecture for crawling, indexing, and searching.
![Page 33: Qian Liu, Computer and Information Sciences Department](https://reader036.vdocument.in/reader036/viewer/2022070504/5681680a550346895ddd8ec6/html5/thumbnails/33.jpg)
33Qian Liu, Computer and Information Sciences Department