how search engines work?
DESCRIPTION
How Search Engines Work?. Ziv Bar-Yossef Department of Electrical Engineering Technion. What is the Internet?. A global network of computers connected to each other Computers “talk” to each other using standard protocols TCP/IP. What is the World-Wide Web (WWW)?. - PowerPoint PPT PresentationTRANSCRIPT
1
How Search Engines Work?
Ziv Bar-Yossef
Department of Electrical Engineering Technion
2
What is the Internet?
A global network of computers connected to each other
Computers “talk” to each other using standard protocols TCP/IP
3
What is the World-Wide Web (WWW)?
Collection of pages available via the Internet Internet users can view
pages with web browsersWWW is only one
application of the InternetOther applications: email,
messengers, VOIP, newsgroups, ftp
4
Web Pages Various formats
pdf, word, excel, images, mp3, video, text
Most popular format: HTMLHTML pages point
to each other using hyperlinks
Users “surf the web” by clicking hyperlinks
5
What are Search Engines?
Users have “information needs” Where can I find solutions to my math homework
problem? Where can I find mp3s of Miri Messika’s latest album? What is the weather in Eilat in Channuka? What other Sharons are famous except for our prime
minister?
Search engines enable us to find web pages that match our information needs
6
What other Sharons are famous, except for
our prime minister?
Search Engines
queryUser
“Information Need”
sharon -ariel
1. Sharon Creech2. Sharon Stone3. Sharon, Massachusetts
Ranked list of matching pages
Search Engine
Search Engine
Web pages
Web
7
How Search Engines (don’t) Work?
queryUser
sharon -ariel
1. Sharon Creech2. Sharon Stone3. Sharon, Massachusetts
Ranked list of matching pages
Web pages
Common misconception: when user submits a query, the search engine scans all web pages to find the relevant matches
Search Engine
Search EngineWeb
8
How Search Engines Work?
queryUser
1. Sharon Creech2. Sharon Stone3. Sharon, Massachusetts
Ranked list of matching pagesWeb pages
What do you do when you look for a term in an encyclopedia? Use the index!
Web
Search Engine
index
sharon -ariel
9
Search Engine Architecture
CrawlerCrawler
Search Engine
IndexIndex
RankingAlgorithmRanking
AlgorithmQuery
ProcessorQuery
Processor
10
Web Crawler (a.k.a. Spider)
Fetches web pages and stores them in a local repository
Tries to get as many web pages as possible
Follows hyperlinks to learn about new pages
Refetches pages that change frequently
11
The Index
Ariel1 Sharon2, the3 prime4 minister5 of6 Israel7 founded8 a9 new10 political11 party12.
Sharon1 Stone2 dressed3 a4 new5 Jean6 Paul7 Gaultier8 gown9 at10 the11 Oscars12 after13 party14.
www.cnn.com
ariel: (cnn.com,1)
dress: (hollywood.com,3)
found: (cnn.com,8)
gaultier: (hollywood.com,8)
gown: (hollywood.com,9)
israel: (cnn.com,7)
jean: (hollywood.com,6)
minister: (cnn.com,5)
new: (cnn.com,7), (hollywood.com, 5)
oscar: (hollywood.com,12)
party: (cnn.com,12), (hollywood.com,14)
paul: (hollywood.com,7)
political: (cnn.com,11)
prime: (cnn.com,4)
sharon: (cnn.com,2), (hollywood.com,1)
stone: (hollywood.com,2)
Index
www.hollywood.com
12
Index by “Anchor Text”
Anchor text: what’s written inside a linkExample: Ariel Sharon, the prime minister…
Usually succinctly describes what’s written in the linked page
By which terms a page is listed in the index?Terms that appear in the pageTerms that appear in anchor text of links to the
page
13
Query Processor
Gets a user query Fetches relevant posting lists from index Extracts relevant matches from lists Example: Query = “sharon –ariel”
L1 posting list of sharon sharon: (cnn.com,2), (hollywood.com,1)
L2 posting list of ariel ariel: (cnn.com,1)
Return all pages in L1 that do not occur in L2
cnn.com
14
Ranking Algorithm
Many queries have many matching pages 472 million matches for “London” in Google
Cannot return all of them to the user User needs the most relevant results anyway
Need to order results by relevance Most relevant results are at the top
Ranking algorithm: a method of ordering matches The “heart” of a search engine The reason why Google is the most preferred search
engine today
15
Google’s PageRank Ranking Elections
Candidates: all web pages Voters: all web pages p votes to q, if p has a hyperlink to q.
Favorites(p) = all the pages p votes for. Fans(p) = all the pages that vote for p.
1 if p has no fans
16
Google’s PageRank
Underlying principles:A page is “important” if it has important fansA page splits its “importance” evenly among its
favorite pages.
1
1
1
1
1.5
2.5
4
17
Google’s PageRank
Ranking algorithm:Find pages that match the given queryOrder them by their PageRankReturn top 10 matches
18
But…PageRank Not Always Works
SPAM
19
Conclusions
Search engines use index to answer user queries
Ranking is the most important component Spam is a problem
20
Thank You