tolmachev alexander web search engines
DESCRIPTION
A brief overview about how web search engines workTRANSCRIPT
Web search engines
Alexander Tolmachevgr. #3057/2
2
Contents
Introduction: what do web search engines mean for us today?
History of web search engines How web search engines work Most popular search engines Conclusion: past, present and future of web
search
3
Contents
➔ Introduction: what do web search engines mean for us today?
History of web search engines How web search engines work Most popular search engines Conclusion: past, present and future of web
search
4
The Web as a huge storage of information A huge amount of information is contained in
the Word Wide Web And this amount is still growing
day by day We need to orient ourself in this enormous
information space Web search engines provide us fast
search of information that we are
interested in
5
Web search engines in our life We use web search engines every day for:
Searching texts, articles, books, news, etc. Searching different media: music, videos, films,
pictures, etc. Searching goods Searching web sites and web portals Preparing lectures and presentations ☺ …
The verb “to google” is included in dictionaries Web search engines have become an integral
part of our life
6
Contents
✔ Introduction: what do web search engines mean for us today?
➔ History of web search engines How web search engines work Most popular search engines Conclusion: past, present and future of web
search
7
The very first search tools
1989–1991 – the invention of the World Wide Web by Sir Tim Berners-Lee in CERN
Archie (1990) The first Internet search tool Fetching and indexing files on FTP servers Providing search for indexed files
Veronica and Jughead – similar to Archie search tools for Gopher protocol invented in 1991
8
The first web search engines
W3Catalog (1993) The first primitive search engine Mirroring and integration of manually maintained
catalogues Still available: http://www.w3catalog.com/
World Wide Web Wanderer (1993) The first web crawler The first web index called Wandex Aimed to count Web size, not to serve as a search
tool
9
The first web search engines
JumpStation (1993) The first web search engine combining crawling,
indexing and searching A web form for search queries No ranking, just listing search results
Excite (1994) The first ranking system
WebCrawler (1994) Indexing full text The first widely known web search engine
10
Web search evolution
1994–1997 – a number of similar web search engines: Infoseek OpenText Magellan Inktomi Northern Light AskJeeves AltaVista
11
Web search evolution
Yahoo! (1994) Search in human edited hierarchical web directory Manual solution of relevancy Search by keywords as well as browsing full
directory Gained large popularity Later in 2004 developed its own web search engine One of the main stars in business world in 1990s
12
Web search evolution
Google (1998) The invention of Page Rank Simple and clear interface instead of turning to a
web portal
Yandex (1997) Full-text search with Russian morphology support Quickly gained large popularity in Russia
13
Web search engines today
Powerful web search technologies Maximal freshness of results Variety of types of searchable documents Intelligent algorithms of ranking
Media search: Images Music Videos …
14
Web search engines today
Personalized search Based on user's search history Based on personal information from virtual
social spaces Location-based search Vertical search Image-based search Audio-based search
15
Contents
✔ Introduction: what do web search engines mean for us today?
✔ History of web search engines➔ How web search engines work Most popular search engines Conclusion: past, present and future of web
search
16
Basic principles of web search
Create and sort a pool of data Find the most appropriate information Deliver this information
17
Basic parts of web search engine A web spider/crawler/robot – a computer
program which: Continuously traverses web pages Finds new or changed content Stores visited pages in corpus
Index – a database containing crawling results Search engine – a computer program which:
Identifies pages relevant to search query Retrieve this pages Rank them
User interface
18
Web crawling Web crawling is aimed to traverse web pages
and to store their copies for further indexing General web crawler algorithm:
Starts with a list of initial URLs, called
the seeds Visits these URLs Retrieves required information from the page Identifies all the hyper-links on the page Adds this links to the queue of URLs, called the
crawl frontier Recursively visit URLs from the crawl frontier
19
Web crawler architecture
20
Crawling policies
A selection policy Focused crawling Restricting followed links URL normalization Path-ascending crawling
A re-visit policy Uniform policy Proportional policy
A politeness policy A parallelization policy
21
Indexing
Indexing is purposed to provide high speed and performance in finding relevant documents in corpus for a search query.
For example 10,000 documents: Queried within milliseconds with the help of index Sequential scan could take hours
Meta search engines reuse the indices of other services and do not store a local index E.g. vertical search can use indices of vertical
services
22
Inverted index For each word stores a list of documents
containing this word Provides direct access to the documents
associated with each word in the search query Commonly used by web search engines Not convenient to update
23
Forward index
Stores a list of words for each document It's more handy to store words per document
immediately during its parsing Enables asynchronous processing – mush easy
to update then inverted index Is stored to be transformed to inverted index
24
Ranking
Ranking is an arrangement of web search results in order of relevance
Usually based on statistical methods Frequency of keywords in particulat document Rating page popularity and authority
Advanced search engines also use intelligent algorithms of ranking
25
Google PageRank PageRank was invented in 1998 by Larry Page
and Sergey Brin at Stanford University It is aimed to rate web page authority relatively
to other web pages Basic principles:
A hyperlink to a page counts as a vote of support Page with high number of incoming links has high
authority A hyperlink coming from authoritative web page
gives more points
PR(p) is a probability that a person randomly clicking on links will arrive at page p
26
Google PageRank
A B C D
0.25 0.25 0.25 0.25
A B C D
1/2 1/6 1/6 1/6
A B C D
6/17 2/17 3/17 6/17
27
Google PageRank
So, PageRank of page A:
In the general case, the PageRank value for any page u:
where Bu – set containing all pages linking to page u; L(v) – number of links from page v.
28
Google PageRank
Spider traps:
Damp factor d – probability that random surfer continue traversal (1-d) – probability of going to random site
The result formula:
A B C
29
Web Search Engine Architecture
30
Contents
✔ Introduction: what do web search engines mean for us today?
✔ History of web search engines✔ How web search engines work➔ Most popular search engines Conclusion: past, present and future of web
search
31
Was started in 1996 as the research project of Larry Page and Sergey Brin in Stanford University
Was launched in 1998 By the end of 1998 already
had an index of about 60
million pages Quickly gained popularity due
to PageRank algorithm
32
Today Google is the most popular web search engine in the world: 85% of web search market
Provides many other services: Gmail Google maps Google+ …
Has its own OS – Android Provides web browser – Google Chrome ...
33
Yandex
Was founded in 1997 by
Arkady Volozh and Ilya Segalovich The first web search engine providing
morphological search The prototype of Yandex search engine was a
system for autimated searching in Bible The name stand for “Yet Another iNDEXer”
34
Yandex
In 1998 Yandex launched contextual advertisement In 2001 Yandex.Direct was launched - an
automated, auction-based system for placement of text-based advertising
2005 – Ukraine portal, www.yandex.ua 2008 – Yandex Labs in San Francisco Bay area 2010 – English version of web search engine 2011 - search engine and a range of other
services in Turkey, at yandex.com.tr
35
Yandex
36
Yandex today
63% of Russian web search market More than 3500 employees 24 offices in 8 countries
37
Contents
✔ Introduction: what do web search engines mean for us today?
✔ History of web search engines✔ How web search engines work✔ Most popular search engines➔ Conclusion: past, present and future of web
search
38
Conclusion
Web search engines are an integral part of our life today
They did a long way before they reached today's performance and power
Their development is far from being finished Main developing trends are:
Web search personalization Local-based search Vertical search
39
Your questions, please
40
Thank you for your time!