qian liu, computer and information sciences department
DESCRIPTION
A Presentation on The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin and Lawrence Page. Qian Liu, Computer and Information Sciences Department. Problem Size of the Web: In the order of hundreds of terabytes Still growing Problems with search engines: - PowerPoint PPT PresentationTRANSCRIPT
1Qian Liu, Computer and Information Sciences Department
A Presentation on
The Anatomy of a Large-Scale Hypertextual Web Search Engine
Sergey Brin and Lawrence Page
2Qian Liu, Computer and Information Sciences Department
Problem• Size of the Web:
• In the order of hundreds of terabytes• Still growing
• Problems with search engines:• Alta Vista, Excite, ect.:
• Return huge number of documents entries• Too many low quality or marginally relevant
matches
3Qian Liu, Computer and Information Sciences Department
Problem• Yahoo:
• Expensive• Slow to improve• Cannot cover all esoteric topics
• Problems with users:• Inexperienced• Do not provide tightly constrained keywords
4Qian Liu, Computer and Information Sciences Department
Motivation and Applications• To improve the quality web search engines
• Scale to keep up with the growth of the Web
• Academic search engine research
•Current search engine technology: advertising oriented
• “Open” search engine
• Support research activities on large-scale web data
5Qian Liu, Computer and Information Sciences Department
MethodsBasic Idea:Q: “How can a search engine automatically identify high quality web pages for my topic?”
A: Hypertextual information --- improve search precision• Link structure• Anchor text• Proximity• Visual presentation
6Qian Liu, Computer and Information Sciences Department
MethodsPageRank
PageRank: A measure of citation importance ranking
Link Structure: Latent human annotation of importance
7Qian Liu, Computer and Information Sciences Department
8Qian Liu, Computer and Information Sciences Department
9Qian Liu, Computer and Information Sciences Department
10Qian Liu, Computer and Information Sciences Department
11Qian Liu, Computer and Information Sciences Department
12Qian Liu, Computer and Information Sciences Department
13Qian Liu, Computer and Information Sciences Department
14Qian Liu, Computer and Information Sciences Department
15Qian Liu, Computer and Information Sciences Department
MethodsPageRank
Why PageRank works?• Often users want information from “trusted” source
• Collaborative trust• Inexpensive to compute
• Allows fast updates• Fewer privacy implications
• Only public information is used
16Qian Liu, Computer and Information Sciences Department
MethodsAnchor Text
• Associates anchor text the page the link is on
the page the link points to
• Accurate descriptions of web pages
• Search non-indexable web pages
17Qian Liu, Computer and Information Sciences Department
MethodsProximity
• Hits
• Hits locations
• Multi-word search:
Calculate proximity --- how far apart the hits occur in the document (or anchor)
18Qian Liu, Computer and Information Sciences Department
MethodsVisual Presentation
• Font size
Larger/bolder fonts --- higher weights
• Capitalization --- higher weights
19Qian Liu, Computer and Information Sciences Department
MethodsArchitecture and Major Data Structures
20Qian Liu, Computer and Information Sciences Department
MethodsMajor Applications:
Crawling
Indexing
Searching
21Qian Liu, Computer and Information Sciences Department
CrawlingHow a crawler works:
Defined webspace
Requests URLs
Stores the returned objects into a file system
Examines the content of the objectScans for HTML anchor tags <A..>
Ignores URLs not conforming to specified rule; Visits URLs conforming to the rules
22Qian Liu, Computer and Information Sciences Department
CrawlingGoogle’s web Crawling System:• Fast distributed crawling system• URLServer serves URLs to crawlers• Each crawler keeps 300 connections open at once• Different states:
1. Looking up DNS2. Connecting to host3. Sending request4. Receiving response
23Qian Liu, Computer and Information Sciences Department
IndexingUses Flex to generate a lexical analyzer
Parse document
Convert word into WordID
Convert document into a set of hits
Sorter sorts the result by wordID to generate inverted index
24Qian Liu, Computer and Information Sciences Department
SearchingSeek to the start of the doclist for every word
Scan through the doclists until there is a document that matches all the search terms.
Compute the rank of that document for the query
If we are not at the end of any doclist go to step 2
Sort the documents that have matched by rank and return the top k.
25Qian Liu, Computer and Information Sciences Department
SearchingRanking:
Ranking Function:
• PageRank
• Type weight
• Count weight
• Proximity
26Qian Liu, Computer and Information Sciences Department
ResultsA search on “bill clinton”:
• High quality pages
• Non-crawlable pages
• No results about a bill other than clinton No results about a clinton other than bill
27Qian Liu, Computer and Information Sciences Department
Comparison with Other Search Engines1. Breadth-first search vs. depth-first search2. Comparison with WebCrawler:
WebCrawler: Files that the WebCrawler cannot index, such as pictures, sounds, etc., are not retrieved.Google: Uses anchor text
3. Number of crawlers:WebCrawler: 15Google: typically 3
28Qian Liu, Computer and Information Sciences Department
Comparison with Other Search Engines (continued)
4. Quantity vs. quality
Alta Vista: Favors quantityGoogle: Provides quality search
29Qian Liu, Computer and Information Sciences Department
Weak Points of Study1. To limit response time, when a certain number of matching documents are found, searcher stops scanning, sorts and returns results. Sub-optimal results.
2. Lack features such as boolean operators and negation, etc.
3. Search efficiency:
No optimizations such as query caching, subindices on common terms, and other common optimizations.
30Qian Liu, Computer and Information Sciences Department
Suggestions for Future Study1. Using link structure:
In calculating PageRank: Exclude links between two pages with the same web domain (that often serve as navigation functions and do not confer authority).
2. Personalize PageRank by increasing the weight of a user’s homepage or bookmarks.
“99% of the Web information is useless to 99% of the Web users.”
31Qian Liu, Computer and Information Sciences Department
Suggestions for Future Study(continued)
3. Make use of hubs --- collections of links to authorities4. In addition to anchor text, use text surrounding links, too.
32Qian Liu, Computer and Information Sciences Department
Conclusions• Quality search results
• Techniques: PageRank Anchor text Proximity
• A complete architecture for crawling, indexing, and searching.
33Qian Liu, Computer and Information Sciences Department