information retrieval part 2 sissi 11/17/2008. information retrieval cont.. web-based document...

22
Information Retrieval Part 2 Sissi 11/17/2008

Upload: camilla-fisher

Post on 18-Jan-2016

221 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Information Retrieval Part 2 Sissi 11/17/2008. Information Retrieval cont..  Web-Based Document Search  Page Rank  Anchor Text  Document Matching

Information Retrieval Part 2

Sissi11/17/2008

Page 2: Information Retrieval Part 2 Sissi 11/17/2008. Information Retrieval cont..  Web-Based Document Search  Page Rank  Anchor Text  Document Matching

Information Retrieval cont..

Web-Based Document Search Page Rank Anchor Text

Document Matching

Inverted Lists

Page 3: Information Retrieval Part 2 Sissi 11/17/2008. Information Retrieval cont..  Web-Based Document Search  Page Rank  Anchor Text  Document Matching

Page Rank

PR(A): the page rank of page A. C(T): the number of outgoing links from page T. d: minimum value assigned to any page. : a page pointing to A.

j

jj TCTPRddAPR ))(/)((*)1()(

jT

Page 4: Information Retrieval Part 2 Sissi 11/17/2008. Information Retrieval cont..  Web-Based Document Search  Page Rank  Anchor Text  Document Matching

Algorithm of Page Rank

1. Use the PageRank Equation to compute PageRank for each page in the collection using latest PageRanks of pages.

2. Repeat step 1 until no significant change to any PageRank.

Page 5: Information Retrieval Part 2 Sissi 11/17/2008. Information Retrieval cont..  Web-Based Document Search  Page Rank  Anchor Text  Document Matching

Example

in the first iteration:

PR(A)=0.1+0.9*(PR(B)+PR(C)) =0.1+0.9*(1+1) =1.9 PR(B)=0.1+0.9*(PR(A)/2) =0.1+0.9*(1.9/2) =0.95 PR(C)=0.1+0.9*(PR(A)/2) =0.1+0.9*(1.9/2) =0.95

PR(A)=1.48, PR(B)=0.76, PR(C)=0.76

initial value:

PR(A)=PR(B)=PR(C)=1

d=0.1

Page 6: Information Retrieval Part 2 Sissi 11/17/2008. Information Retrieval cont..  Web-Based Document Search  Page Rank  Anchor Text  Document Matching

Anchor Text

The anchor text is the visible, clickable text in a hyperlink.

For example: <a href=“http://www.wikipedia.org”>Wikipedia</a>

The anchor text is Wikipedia; the complex URL http://www.wikipedia.org/ displays on the web page as Wikipedia, contributing to a clean, easy to read text or document.

Page 7: Information Retrieval Part 2 Sissi 11/17/2008. Information Retrieval cont..  Web-Based Document Search  Page Rank  Anchor Text  Document Matching

Anchor Text

Anchor text usually gives the user relevant descriptive or contextual information about the content of the link’s destination.

The anchor text may or may not be related to the actual text of the URL of the link.

The words contained in the Anchor Text can determine the ranking that the page will receive by search engines.

Page 8: Information Retrieval Part 2 Sissi 11/17/2008. Information Retrieval cont..  Web-Based Document Search  Page Rank  Anchor Text  Document Matching

Common Misunderstanding

Webmasters sometimes tend to misunderstand anchor text.

Instead of turning appropriate words inside of a sentence into a clickable link, webmasters frequently insert extra text.

Page 9: Information Retrieval Part 2 Sissi 11/17/2008. Information Retrieval cont..  Web-Based Document Search  Page Rank  Anchor Text  Document Matching

Example

1. today our troops have liberated another country from tyranny. To know more, click here.

2. The more concise way of coding that would be: today our troops have

liberated another country from tyranny.

Page 10: Information Retrieval Part 2 Sissi 11/17/2008. Information Retrieval cont..  Web-Based Document Search  Page Rank  Anchor Text  Document Matching

Anchor Text

This proper method of linking is beneficial not only to users, but also to the webmasters as anchor text holds significant weight in search engine ranking.

Most search engine optimization experts recommend against using “click here” to designate a link.

Page 11: Information Retrieval Part 2 Sissi 11/17/2008. Information Retrieval cont..  Web-Based Document Search  Page Rank  Anchor Text  Document Matching

Google Bomb

In September 2000, the first Google bomb was created by Hugedisk Men’s Magazine, a now-defunct online humor magazine.

It linked the text “dumbmotherfucker” to a site selling George W. Bush-related merchandise.

A google search for this term would return the pro-Bush online store as its top result.

After a fair amount of publicity the George W. Bush-related merchandise site retained lawyers and sent a cease and desist letter to Hugedisk, thereby ending the Google bomb.

Page 12: Information Retrieval Part 2 Sissi 11/17/2008. Information Retrieval cont..  Web-Based Document Search  Page Rank  Anchor Text  Document Matching

Existed Google Bomb

When search “more evil than Satan”, it returns the home page of microsoft company.

“miserable failure”, or “worst president”, or ”unelectable” it returns the resume of George W. Bush in the White House

website.

“out of touch executives”, or “out of touch management” it returns the home page of google.

Other commercial use

Page 13: Information Retrieval Part 2 Sissi 11/17/2008. Information Retrieval cont..  Web-Based Document Search  Page Rank  Anchor Text  Document Matching

Document Matching

An arbitrarily long document is the query, not just a few key words.

But the goal is still to rank and output an ordered list of relevant documents.

The most similar documents are found using the measures described earlier.

Page 14: Information Retrieval Part 2 Sissi 11/17/2008. Information Retrieval cont..  Web-Based Document Search  Page Rank  Anchor Text  Document Matching

Generalization of searching

Matching a document to a collection of documents looks like a tedious and expensive operation.

Even for a short query, comparison to all large documents in the collection implies a relatively intensive computation task.

Page 15: Information Retrieval Part 2 Sissi 11/17/2008. Information Retrieval cont..  Web-Based Document Search  Page Rank  Anchor Text  Document Matching

Example of document matching

Consider an online help desk, where a complete description of a problem is submitted.

That document could be matched to stored documents, hopefully finding descriptions of similar problems and solutions without having the user experiment with numerous key word searches.

Page 16: Information Retrieval Part 2 Sissi 11/17/2008. Information Retrieval cont..  Web-Based Document Search  Page Rank  Anchor Text  Document Matching

Summarize

1. Search engines and document matchers are not focused on classification of new documents.

2. Their primary goal is to retrieve the most relevant documents from a collection of stored documents.

Page 17: Information Retrieval Part 2 Sissi 11/17/2008. Information Retrieval cont..  Web-Based Document Search  Page Rank  Anchor Text  Document Matching

Inverted Lists

What is inverted lists?

Instead of documents pointing to words, a list of words pointing to documents is the primary internal representation for processing queries and matching documents.

Page 18: Information Retrieval Part 2 Sissi 11/17/2008. Information Retrieval cont..  Web-Based Document Search  Page Rank  Anchor Text  Document Matching

Inverted Lists

Page 19: Information Retrieval Part 2 Sissi 11/17/2008. Information Retrieval cont..  Web-Based Document Search  Page Rank  Anchor Text  Document Matching

Example

If the query contained words 100 and 200

1) First processing W(100) to compute the similarity S(i) of each document i:

S(1)=0+1 S(2)=0+1 …2) Then process W(200) in the

same way: S(2)=1+1 …

Page 20: Information Retrieval Part 2 Sissi 11/17/2008. Information Retrieval cont..  Web-Based Document Search  Page Rank  Anchor Text  Document Matching

Summarize

1. The inverted list is the key to the efficiency of information retrieval systems.

2. The inverted list has contributed to make nearest-neighbor methods a pragmatic possibility for prediction.

Page 21: Information Retrieval Part 2 Sissi 11/17/2008. Information Retrieval cont..  Web-Based Document Search  Page Rank  Anchor Text  Document Matching

Conclusion

1. Information retrieval methods are specialized nearest-neighbor methods, which are well-known prediction methods.

2. IR methods typically process unlabeled data and order and display the retrieved documents.

3. The IR methods have no training and induce no new rules for classification.

Page 22: Information Retrieval Part 2 Sissi 11/17/2008. Information Retrieval cont..  Web-Based Document Search  Page Rank  Anchor Text  Document Matching

Thank You!