tolmachev alexander web search engines

40
Web search engines Alexander Tolmachev gr. #3057/2

Upload: alexandertolmachev

Post on 17-May-2015

801 views

Category:

Technology


1 download

DESCRIPTION

A brief overview about how web search engines work

TRANSCRIPT

Page 1: Tolmachev Alexander Web Search Engines

Web search engines

Alexander Tolmachevgr. #3057/2

Page 2: Tolmachev Alexander Web Search Engines

2

Contents

Introduction: what do web search engines mean for us today?

History of web search engines How web search engines work Most popular search engines Conclusion: past, present and future of web

search

Page 3: Tolmachev Alexander Web Search Engines

3

Contents

➔ Introduction: what do web search engines mean for us today?

History of web search engines How web search engines work Most popular search engines Conclusion: past, present and future of web

search

Page 4: Tolmachev Alexander Web Search Engines

4

The Web as a huge storage of information A huge amount of information is contained in

the Word Wide Web And this amount is still growing

day by day We need to orient ourself in this enormous

information space Web search engines provide us fast

search of information that we are

interested in

Page 5: Tolmachev Alexander Web Search Engines

5

Web search engines in our life We use web search engines every day for:

Searching texts, articles, books, news, etc. Searching different media: music, videos, films,

pictures, etc. Searching goods Searching web sites and web portals Preparing lectures and presentations ☺ …

The verb “to google” is included in dictionaries Web search engines have become an integral

part of our life

Page 6: Tolmachev Alexander Web Search Engines

6

Contents

✔ Introduction: what do web search engines mean for us today?

➔ History of web search engines How web search engines work Most popular search engines Conclusion: past, present and future of web

search

Page 7: Tolmachev Alexander Web Search Engines

7

The very first search tools

1989–1991 – the invention of the World Wide Web by Sir Tim Berners-Lee in CERN

Archie (1990) The first Internet search tool Fetching and indexing files on FTP servers Providing search for indexed files

Veronica and Jughead – similar to Archie search tools for Gopher protocol invented in 1991

Page 8: Tolmachev Alexander Web Search Engines

8

The first web search engines

W3Catalog (1993) The first primitive search engine Mirroring and integration of manually maintained

catalogues Still available: http://www.w3catalog.com/

World Wide Web Wanderer (1993) The first web crawler The first web index called Wandex Aimed to count Web size, not to serve as a search

tool

Page 9: Tolmachev Alexander Web Search Engines

9

The first web search engines

JumpStation (1993) The first web search engine combining crawling,

indexing and searching A web form for search queries No ranking, just listing search results

Excite (1994) The first ranking system

WebCrawler (1994) Indexing full text The first widely known web search engine

Page 10: Tolmachev Alexander Web Search Engines

10

Web search evolution

1994–1997 – a number of similar web search engines: Infoseek OpenText Magellan Inktomi Northern Light AskJeeves AltaVista

Page 11: Tolmachev Alexander Web Search Engines

11

Web search evolution

Yahoo! (1994) Search in human edited hierarchical web directory Manual solution of relevancy Search by keywords as well as browsing full

directory Gained large popularity Later in 2004 developed its own web search engine One of the main stars in business world in 1990s

Page 12: Tolmachev Alexander Web Search Engines

12

Web search evolution

Google (1998) The invention of Page Rank Simple and clear interface instead of turning to a

web portal

Yandex (1997) Full-text search with Russian morphology support Quickly gained large popularity in Russia

Page 13: Tolmachev Alexander Web Search Engines

13

Web search engines today

Powerful web search technologies Maximal freshness of results Variety of types of searchable documents Intelligent algorithms of ranking

Media search: Images Music Videos …

Page 14: Tolmachev Alexander Web Search Engines

14

Web search engines today

Personalized search Based on user's search history Based on personal information from virtual

social spaces Location-based search Vertical search Image-based search Audio-based search

Page 15: Tolmachev Alexander Web Search Engines

15

Contents

✔ Introduction: what do web search engines mean for us today?

✔ History of web search engines➔ How web search engines work Most popular search engines Conclusion: past, present and future of web

search

Page 16: Tolmachev Alexander Web Search Engines

16

Basic principles of web search

Create and sort a pool of data Find the most appropriate information Deliver this information

Page 17: Tolmachev Alexander Web Search Engines

17

Basic parts of web search engine A web spider/crawler/robot – a computer

program which: Continuously traverses web pages Finds new or changed content Stores visited pages in corpus

Index – a database containing crawling results Search engine – a computer program which:

Identifies pages relevant to search query Retrieve this pages Rank them

User interface

Page 18: Tolmachev Alexander Web Search Engines

18

Web crawling Web crawling is aimed to traverse web pages

and to store their copies for further indexing General web crawler algorithm:

Starts with a list of initial URLs, called

the seeds Visits these URLs Retrieves required information from the page Identifies all the hyper-links on the page Adds this links to the queue of URLs, called the

crawl frontier Recursively visit URLs from the crawl frontier

Page 19: Tolmachev Alexander Web Search Engines

19

Web crawler architecture

Page 20: Tolmachev Alexander Web Search Engines

20

Crawling policies

A selection policy Focused crawling Restricting followed links URL normalization Path-ascending crawling

A re-visit policy Uniform policy Proportional policy

A politeness policy A parallelization policy

Page 21: Tolmachev Alexander Web Search Engines

21

Indexing

Indexing is purposed to provide high speed and performance in finding relevant documents in corpus for a search query.

For example 10,000 documents: Queried within milliseconds with the help of index Sequential scan could take hours

Meta search engines reuse the indices of other services and do not store a local index E.g. vertical search can use indices of vertical

services

Page 22: Tolmachev Alexander Web Search Engines

22

Inverted index For each word stores a list of documents

containing this word Provides direct access to the documents

associated with each word in the search query Commonly used by web search engines Not convenient to update

Page 23: Tolmachev Alexander Web Search Engines

23

Forward index

Stores a list of words for each document It's more handy to store words per document

immediately during its parsing Enables asynchronous processing – mush easy

to update then inverted index Is stored to be transformed to inverted index

Page 24: Tolmachev Alexander Web Search Engines

24

Ranking

Ranking is an arrangement of web search results in order of relevance

Usually based on statistical methods Frequency of keywords in particulat document Rating page popularity and authority

Advanced search engines also use intelligent algorithms of ranking

Page 25: Tolmachev Alexander Web Search Engines

25

Google PageRank PageRank was invented in 1998 by Larry Page

and Sergey Brin at Stanford University It is aimed to rate web page authority relatively

to other web pages Basic principles:

A hyperlink to a page counts as a vote of support Page with high number of incoming links has high

authority A hyperlink coming from authoritative web page

gives more points

PR(p) is a probability that a person randomly clicking on links will arrive at page p

Page 26: Tolmachev Alexander Web Search Engines

26

Google PageRank

A B C D

0.25 0.25 0.25 0.25

A B C D

1/2 1/6 1/6 1/6

A B C D

6/17 2/17 3/17 6/17

Page 27: Tolmachev Alexander Web Search Engines

27

Google PageRank

So, PageRank of page A:

In the general case, the PageRank value for any page u:

where Bu – set containing all pages linking to page u; L(v) – number of links from page v.

Page 28: Tolmachev Alexander Web Search Engines

28

Google PageRank

Spider traps:

Damp factor d – probability that random surfer continue traversal (1-d) – probability of going to random site

The result formula:

A B C

Page 29: Tolmachev Alexander Web Search Engines

29

Web Search Engine Architecture

Page 30: Tolmachev Alexander Web Search Engines

30

Contents

✔ Introduction: what do web search engines mean for us today?

✔ History of web search engines✔ How web search engines work➔ Most popular search engines Conclusion: past, present and future of web

search

Page 31: Tolmachev Alexander Web Search Engines

31

Google

Was started in 1996 as the research project of Larry Page and Sergey Brin in Stanford University

Was launched in 1998 By the end of 1998 already

had an index of about 60

million pages Quickly gained popularity due

to PageRank algorithm

Page 32: Tolmachev Alexander Web Search Engines

32

Google

Today Google is the most popular web search engine in the world: 85% of web search market

Provides many other services: Gmail Google maps Google+ …

Has its own OS – Android Provides web browser – Google Chrome ...

Page 33: Tolmachev Alexander Web Search Engines

33

Yandex

Was founded in 1997 by

Arkady Volozh and Ilya Segalovich The first web search engine providing

morphological search The prototype of Yandex search engine was a

system for autimated searching in Bible The name stand for “Yet Another iNDEXer”

Page 34: Tolmachev Alexander Web Search Engines

34

Yandex

In 1998 Yandex launched contextual advertisement In 2001 Yandex.Direct was launched - an

automated, auction-based system for placement of text-based advertising

2005 – Ukraine portal, www.yandex.ua 2008 – Yandex Labs in San Francisco Bay area 2010 – English version of web search engine 2011 - search engine and a range of other

services in Turkey, at yandex.com.tr

Page 35: Tolmachev Alexander Web Search Engines

35

Yandex

Page 36: Tolmachev Alexander Web Search Engines

36

Yandex today

63% of Russian web search market More than 3500 employees 24 offices in 8 countries

Page 37: Tolmachev Alexander Web Search Engines

37

Contents

✔ Introduction: what do web search engines mean for us today?

✔ History of web search engines✔ How web search engines work✔ Most popular search engines➔ Conclusion: past, present and future of web

search

Page 38: Tolmachev Alexander Web Search Engines

38

Conclusion

Web search engines are an integral part of our life today

They did a long way before they reached today's performance and power

Their development is far from being finished Main developing trends are:

Web search personalization Local-based search Vertical search

Page 39: Tolmachev Alexander Web Search Engines

39

Your questions, please

Page 40: Tolmachev Alexander Web Search Engines

40

Thank you for your time!