search engine algorithms vincent ng [email protected] department of computing hong kong...

31
Search Engine Algorithms Vincent Ng [email protected] Department of Computing Hong Kong Polytechnic University Now and the Future

Upload: byron-burns

Post on 18-Dec-2015

217 views

Category:

Documents


1 download

TRANSCRIPT

Search Engine Algorithms

Vincent Ng [email protected]

Department of Computing

Hong Kong Polytechnic University

Now and the Future

What are search engines?

Search engines are huge databases of web page files that have been assembled automatically by machine.

A program that searches documents for specified keywords and returns a list of the documents where the keywords were found.

Some History

WWWW

SearchEngineWatch

Yahoo

1994 110,000 pages

1997 2,000,000 pages

2000 2 billons pages

Types of Search Engines

• Individual – Individual search engines compile their own

searchable databases on the web.

• Meta – Meta-searchers do not compile databases.

Instead, they search the databases of multiple sets of individual engines simultaneously

Meta-search Engines

• Do not crawl the web compiling their own searchable databases.

• They search the databases of multiple sets of individual search engines simultaneously, from a single site and using the same interface.

• Meta-searchers provide a quick way of finding out which engines are retrieving the best results for you in your search.

Subject Directories

• Unlike search engines, are created and maintained by human editors, not electronic spiders or robots.

• The editors review and select sites for inclusion in their directories on the basis of previously determined selection criteria.

• Directories tend to be smaller than search engine databases, typically indexing only the home page or top level pages of a site.

Search LogicAltaVista Excite Google

Content 250M pg 250M pg + media obj

1.25 billon sites

Default word

OR OR AND

Boolean Op

AND, AND NOT, OR

AND, AND NOT

Limit including and excluding words

Search LogicAltaVista Excite Google

Case sensitive

Yes No No

Truncation No, use * No Automatic

Special Date, language

Concept searching by suggested terms

Search any language

Developing a search engine

1. No database, real time search2. Use a database (e.g. MSSQL, Oracle)

1. Build indices of key words2. Simple matching

3. Use a database in a server or multiple servers (server farms)

1. Develop search indices based on key words or meta-information

2. Develop a search structure

spider indexer alg

Searching on the Web

web

client Queryscreen

Searchengine

Indexer

DB

spider

Three Algorithms

• A document is represented by– Occurrences of a keyword– Hyperlink structures

• Different ranking algorithms– Boolean spread Activation– Most-cited– TFxIDF

Boolean Spread Activation

• Based on the occurrence or absence of keywords in a document

– R i,q = M

j=1

( C i,j )

• A better approach

– R i,q = M

j=1

( I i,j )

Pi

C i,j

Link factor

Most-Cited

• Takes advantage of information about hyperlinks between web pages

– R i,q = M

k=1,k<>i

( Li I,k M

j=1

C k,j )

Li I,kNo link

TFxIDF

• Based on the vector space model

– R i,q = term in query (0.5 +

0.5 (term freq of Qj in Pi)/

max term freq of a keyword in Pi))

– R i,q = R i,q / normalized factor

– IDFj = log (N / N

I=1

C I,j )

An Excellent Search Engine - Google

Result of Search

More about Google• Much more accurate than most other search engines• But Run the same search on Yahoo (look for web pages)

and surprise! - You will get the same results• Because – Yahoo is powered by Google!

Google Internal

Google Internal

• Makes use of the link structure of the Web to calculate a quality ranking of each web page

– PageRank

• Utilizes link to improve search results

PageRank

• It can be thought of as a model of user behaviour– The probability that a random surfer visits a page

• PR(A) = (1-d) + d(PR(T1)/C(T1) ….

PR(Tn)/C(Tn))

Other Search Engines

• Personalization/ Context based– Individual, web-filtering

– www.searchorbit.com

• Multimedia search– Image search

– www.altavista.com (not really)

– http://disney.ctr.columbia.edu/webseek/

Internal Search Engine

• When under 100 web pages– One can do it real time

– www.comp.polyu.edu.hk/~cstyng/hci.99/labs/search.htm

• For a small web site– Direct matching is sufficient

• Other web sites– Indexers are needed

Finding a search engineA Check List

1. What platforms does the search engine and spider run on? Is it portable?

2. What programming languages is it written with? Is it internet/web enabled? Don't give me Fortran!

3. Can the vendor customize the system at a reasonable cost and turn-around time?  

4. What about local technical support? 5. Is it designed to search the internet, intranet, and

your local disks?  6. Can it handle different file formats, such as ASCII,

HTML, WORD, PPT, etc. etc.?

A Check List

1. Does it support BIG5, GB, and UNICODE? And do so efficiently? 

2. Is it designed and optimized for the Web? Don't give me a relational engine!

3. Is it an English search engine retrofit with Chinese search?

4. Can you control what the spider indexes and how frequent it indexes?  

5. Does it support full Boolean queries and relevance ranking?  

6. Can you search by dates or by categories?

A Check List

1. Can you search files that are on a specific host or of a certain file type?

2. Can you specify partial words (e.g., econom* and *port)?

3. Can you expand and translate a query?

4. What about speed? Don't forget to ask for insertion speed!  

5. What about scalability? Can it exploit multiple servers and CPUs?

A Check List

1. Is the spider/crawler fault tolerant? Can it endure link or host failures?

2. Can it be optimized according to user behaviors?

3. Can it search across secured servers?

Some search engines in HK

1. Chinese.yahoo.com

2. www.goyoyo.com.hk

3. www.hksrch.com.hk

4. www.hkonly.com

5. www.gowhere.com.hk

Your Input