google & document retrieval qing li school of computing and informatics arizona state university

31
Google & Document Retrieval Google & Document Retrieval Qing Li School of Computing and Informatics Arizona State University

Upload: marian-riley

Post on 31-Dec-2015

213 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Google & Document Retrieval Qing Li School of Computing and Informatics Arizona State University

Google & Document RetrievalGoogle & Document Retrieval

Qing Li

School of Computing and InformaticsArizona State University

Page 2: Google & Document Retrieval Qing Li School of Computing and Informatics Arizona State University

22// 3131Arizona State UniversityArizona State University

OutlineOutline

Simple introduction of Google

Architecture of Web search engine

Key techniques of search engine• Indexing

• Matching & ranking

Open Sources for search engine

Page 3: Google & Document Retrieval Qing Li School of Computing and Informatics Arizona State University

33// 3131Arizona State UniversityArizona State University

Google Search EngineGoogle Search Engine

“Google” • Number : 1 followed by 100 zeros• reflects the company's mission

• organize the immense amount of information available on the web.

Information Types• Text• Image• Video

Page 4: Google & Document Retrieval Qing Li School of Computing and Informatics Arizona State University

44// 3131Arizona State UniversityArizona State University

Google ServiceGoogle Service

Page 5: Google & Document Retrieval Qing Li School of Computing and Informatics Arizona State University

55// 3131Arizona State UniversityArizona State University

Google Web SearchingGoogle Web Searching

Page 6: Google & Document Retrieval Qing Li School of Computing and Informatics Arizona State University

66// 3131Arizona State UniversityArizona State University

Life of a Google QueryLife of a Google Query

Page 7: Google & Document Retrieval Qing Li School of Computing and Informatics Arizona State University

77// 3131Arizona State UniversityArizona State University

Web Search SystemWeb Search System

Data

IndexingIndex

Searching

SearchEngine User

Information

QueryCrawlingWeb

d1d3K2

d1d2K1

Page 8: Google & Document Retrieval Qing Li School of Computing and Informatics Arizona State University

88// 3131Arizona State UniversityArizona State University

Conventional Overview of Text RetrievalConventional Overview of Text Retrieval

Text Processing User/System Interaction

Search Engine

Matching& rank

Text Analysis

Analysis ofInfo Needs

rawtext

InfoNeeds

Index Query

Knowledge Resources & Tools

Retrieval Result

Page 9: Google & Document Retrieval Qing Li School of Computing and Informatics Arizona State University

99// 3131Arizona State UniversityArizona State University

Text Processing (1) - IndexingText Processing (1) - Indexing

A list of terms with relevant information• Frequency of terms

• Location of terms

• Etc.

Index terms: represent document content & separate documents• “economy” vs “computer” in a news article of Financial Times

To get Index• Extraction of index terms

• Computation of their weights

Page 10: Google & Document Retrieval Qing Li School of Computing and Informatics Arizona State University

1010// 3131Arizona State UniversityArizona State University

Text Processing (2) - ExtractionText Processing (2) - Extraction

Extraction of index terms• Word or phrase level

• Morphological Analysis (stemming in English)• “information”, “informed”, “informs”, “informative” inform

• Removal of common words from “Stop list”• “a”, “an”, “the”, “is”, “are”, “am”, …

• n-gram• “ 정보검색시스템” => “_ 정” , “ 정보” , “ 보검” , “ 검색” , … (bi-gra

m)

• Surprisingly effective in some languages

Page 11: Google & Document Retrieval Qing Li School of Computing and Informatics Arizona State University

1111// 3131Arizona State UniversityArizona State University

An ExampleAn Example

Identify all unique words in collection of 1,033 Abstracts in biomedicine

Delete 170 common function words includedIn stop list

Delete all terms with collection frequencyequal to 1 (terms occurring in one doc

with frequency 1)

Remove terminal “s” endings & combineIdentical word forms

Delete 30 very high-frequency terms occurring In over 24% of the documents

13,471 terms

13,301 terms left

7,236 terms left

6,056 terms left

6,026 terms left

Final indexing vocabulary

Page 12: Google & Document Retrieval Qing Li School of Computing and Informatics Arizona State University

1212// 3131Arizona State UniversityArizona State University

Text Processing (3) – Term WeightText Processing (3) – Term Weight

Calculation of term weights• Statistical weights using frequency information

• importance of a term in a document

• E.g. TF*IDF• TF: total frequency of a term in a document

• IDF: inverse document frequency

• DF: In how many documents the term appears?

• High TF , low DF good word to represent text

• High TF, High DF bad word

Page 13: Google & Document Retrieval Qing Li School of Computing and Informatics Arizona State University

1313// 3131Arizona State UniversityArizona State University

An ExampleAn Example

TF for “Arizona”• In Doc 1 is 1• In Doc 2 is 2

DF for “Arizona”• In this collection (Doc 1 & Doc 2)• Is 2 IDF = ½

TW = TF*IDF

Normalization of TF is critical to retrieval effectiveness

• prevent a bias towards longer documents• TF = 0.5 + 0.5*(TF / Max TF)

TW = TF * log2 (N / DF + 1)

Document 1

Document 2

Log 10-34 is -34

Page 14: Google & Document Retrieval Qing Li School of Computing and Informatics Arizona State University

1414// 3131Arizona State UniversityArizona State University

Text Processing (4) - Text Processing (4) - Storing indexing resultsStoring indexing results

For raw text to index

Arizona

University

1 1 2 2

Index Word Word Info.Document 1

Document 2

1 1 1 1

Page 15: Google & Document Retrieval Qing Li School of Computing and Informatics Arizona State University

1515// 3131Arizona State UniversityArizona State University

Text Processing (5) - Text Processing (5) - Storing indexing resultsStoring indexing results

Inverted File, …

search

Google

.

.

.

ASU

.

.

.

.

.

.

tiger

3

.

.

.

2

.

.

.

.

.

.

2

12345...

275276

.

.

.

.

10111012

12546...35....

14

Terms Pointers

Directory Posting file

Doc #1---------------

Doc #2---------------

Doc #5---------------

...

...

Query

Page 16: Google & Document Retrieval Qing Li School of Computing and Informatics Arizona State University

1616// 3131Arizona State UniversityArizona State University

Matching & RankingMatching & Ranking

Ranking• Retrieval Model

• Boolean (exact)

• Vector Space

• Probabilistic

• Inference Net

• Language Model …

Weighting Schemes• Index terms, query terms

• Parameters in formulas

Page 17: Google & Document Retrieval Qing Li School of Computing and Informatics Arizona State University

1717// 3131Arizona State UniversityArizona State University

Vector Space ModelVector Space Model

Treat document and query as a vector.

(DOC 1)... dog ........dog....

0 2Doc 1 = < 2>

(DOC 2)... cat ........ cat ......................dog..............dog....................

Doc 2 = < 2, 2>

0 2

2

dog

dog

cat

Page 18: Google & Document Retrieval Qing Li School of Computing and Informatics Arizona State University

1818// 3131Arizona State UniversityArizona State University

Vector Space ModelVector Space Model

0 2

2

dog

cat

Query 1 : dog

Query 2 : cat, dog

Query 1 = <1>

Query 2 = <1,1>

COS (Q1,Doc)<COS(Q2,Doc)

If we use angles as a similarity measure, then Q2 is more similar to Doc than Q1

(DOC)... cat ........ cat ......................dog..............dog....................

Doc = < 2, 2>

Page 19: Google & Document Retrieval Qing Li School of Computing and Informatics Arizona State University

1919// 3131Arizona State UniversityArizona State University

Vector Space ModelVector Space Model

Given

Dot product

Cosine Similiarity

n

iii yxyx

1

.

nxxxx ,...,, 21

nyyyy ,...,, 21

yx

yxyx

.

,cos

Page 20: Google & Document Retrieval Qing Li School of Computing and Informatics Arizona State University

2020// 3131Arizona State UniversityArizona State University

Vector Space ModelVector Space Model

<DOC 1>... cat ........ dog ......................dog................mouse .....dog........mouse ........................

Q = < cat, mouse>

dog

mouse

cat

D1

Q

D1 = (1, 2, 3)Q = (1, 1,0)

Similarity = (1*1+2*1+3*0)/( length of line D1 + length of line Q)

Term weight is only decided by the term frequency

yx

yxyx

.

,cos

Page 21: Google & Document Retrieval Qing Li School of Computing and Informatics Arizona State University

2121// 3131Arizona State UniversityArizona State University

Matching & RankingMatching & Ranking

Techniques for efficiency• New storage structure esp. for new document types

• Use of accumulators for efficient generation of ranked output

• Compression/decompression of indexes

Technique for Web search engines• Use of hyperlinks

• PageRank : Inlinks & outlinks

• HITS : Authority vs hub pages

• In conjunction with Directory Services (e.g. Yahoo)

• ...

Page 22: Google & Document Retrieval Qing Li School of Computing and Informatics Arizona State University

2222// 3131Arizona State UniversityArizona State University

PageRankPageRank

Basic idea: more links to a page implies a better page

• But, all links are not created equal• Links from a more important page should

count more than links from a weaker page

Basic PageRank R(A) for page A:

• outDegree(x) = number of edges leaving page x= hyperlinks on page B

• Page x distributes its rank boost over all the pages it points to

x pointed to A

( )( )

( )

PR xPR A

outDegree x

PR(A) = PR(C)/1

PR(B) = PR(A) / 2

PR(C) = PR(A) / 2 + PR(B)/1

Page 23: Google & Document Retrieval Qing Li School of Computing and Informatics Arizona State University

2323// 3131Arizona State UniversityArizona State University

PageRankPageRank

PageRank definition is recursive• Rank of a page depends on and influences other pages

• Eventually, ranks will converge

To compute PageRank:• Choose arbitrary initial R_old and use it to compute R_new

• Repeat, setting R_old to R_new until R converges (the difference between old and new R is sufficiently small)

• Rank values typically converge in 50-100 iterations

• Rank orders converge even faster

Page 24: Google & Document Retrieval Qing Li School of Computing and Informatics Arizona State University

2424// 3131Arizona State UniversityArizona State University

Problems with Basic PageRank Problems with Basic PageRank

Web is not a strongly connected graph• Rank sink – single page (node) with no outward links

• Nodes not part of sink get rank of 0

Page 25: Google & Document Retrieval Qing Li School of Computing and Informatics Arizona State University

2525// 3131Arizona State UniversityArizona State University

Extended PageRank Extended PageRank

Remove all nodes without outlinks• No rank for these pages

Add decay factor, d

• n is the number of nodes/pages• d is a constant, typically between 0.8~ 0.9

• Represents fraction of a pages rank that is distributed among pages it links to, rest of rank is distributed among all pages

In random surfer model, decay factor corresponds to user getting bored (or unhappy) with links on a given page and jumping to any random page (not linked to)

x pointed to A

( )( ) (1 ) /

( )

PR xPR A d d n

outDegree x

Page 26: Google & Document Retrieval Qing Li School of Computing and Informatics Arizona State University

2626// 3131Arizona State UniversityArizona State University

ExampleExample

Set d=0.5 and Ignore n Small pages can be directly solved

PR(A) = 0.5 + 0.5 PR(C)PR(B) = 0.5 + 0.5 (PR(A) / 2)PR(C) = 0.5 + 0.5 (PR(A) / 2 + PR(B))

Get:PR(A) = 14/13 = 1.07692308PR(B) = 10/13 = 0.76923077PR(C) = 15/13 = 1.15384615

x pointed to A

( )( ) (1 ) /

( )

PR xPR A d d n

outDegree x

Page 27: Google & Document Retrieval Qing Li School of Computing and Informatics Arizona State University

2727// 3131Arizona State UniversityArizona State University

Example Example

PR(A) = 0.5 + 0.5 PR(C)PR(B) = 0.5 + 0.5 (PR(A) / 2)PR(C) = 0.5 + 0.5 (PR(A) / 2 + PR(B))

Set initial value of P(A), P(B), P(C) to 1.After first iteration, • PR(A) = 0.5 + 0.5 *1 = 1

PR(B) = 0.5 + 0.5 (1 / 2) = 0.75PR(C) = 0.5 + 0.5 (PR(A) / 2 + PR(B)) =1.125

After second iteration• PR(A) = 0.5 + 0.5 * 1.125=1.0625

PR(B) = 0.5 + 0.5 (1 / 2)= 0.765625 PR(C) = 0.5 + 0.5 (1 / 2 +0.75) =1.1484375

Page 28: Google & Document Retrieval Qing Li School of Computing and Informatics Arizona State University

2828// 3131Arizona State UniversityArizona State University

ExampleExample

Large numbers, Iteration methodIteration PR(A) PR(B) PR(C)

0 1 1 1

1 1 0.75 1.125

2 1.0625 0.765625 1.1484375

3 1.07421875 0.76855469 1.15283203

4 1.07641602 0.76910400 1.15365601

5 1.07682800 0.76920700 1.15381050

6 1.07690525 0.76922631 1.15383947

7 1.07691973 0.76922993 1.15384490

8 1.07692245 0.76923061 1.15384592

9 1.07692296 0.76923074 1.15384611

10 1.07692305 0.76923076 1.15384615

11 1.07692307 0.76923077 1.15384615

12 1.07692308 0.76923077 1.15384615

Page 29: Google & Document Retrieval Qing Li School of Computing and Informatics Arizona State University

2929// 3131Arizona State UniversityArizona State University

Problems with PageRankProblems with PageRank

Show bias to new WebPages• Can be solved by a boost factor

No balance between relevancy and popularity• Very popular pages (such as search engines and web

portals) may be returned artificially high due to their popularity (even if not very related to the query)

Despite these problems, seems to work fairly well in practice

Page 30: Google & Document Retrieval Qing Li School of Computing and Informatics Arizona State University

3030// 3131Arizona State UniversityArizona State University

Open-Source Search Engine CodeOpen-Source Search Engine Code

Lucene Search Engine• http://lucene.apache.org/

SWISH• http://swish-e.org/

Glimpse• http://webglimpse.net/

and more

Page 31: Google & Document Retrieval Qing Li School of Computing and Informatics Arizona State University

3131// 3131Arizona State UniversityArizona State University

ReferenceReference

L.Pager & S. Brin,The PageRank citation ranking: Bringing order to the web , Stanford Digital Library Technique, Working paper 1999-0120, 1998.Steven Levy (2004). All Eyes on Google. Newsweek, April 12, 2004.

E. Brown, J. Callan, B. Croft (1994). “Fast Incremental Indexing for Full-Text Information Retrieval.” Proceedings of the 20th International Conference on Very Large Databases (VLDB).

Lawrence Page and Sergey Brin. The anatomy of a large-scale hypertextual web search engine. In Proceedings of the Seventh International Web Conference (WWW 98), 1998.