how pagerank works ketan mayer-patel university of north carolina january 31, 2011

36
How PageRank Works Ketan Mayer-Patel University of North Carolina January 31, 2011

Upload: karin-waters

Post on 15-Jan-2016

217 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: How PageRank Works Ketan Mayer-Patel University of North Carolina January 31, 2011

How PageRank Works

Ketan Mayer-PatelUniversity of North Carolina

January 31, 2011

Page 2: How PageRank Works Ketan Mayer-Patel University of North Carolina January 31, 2011

Me vs. Jeff

• High school– Public school in Texas

• College– The University of

California, Berkeley

• Faculty member at...– UNC

• High School– Hoity-toity, private all-

boys school in Jersey

• College– Stanford

• Faculty member at...– Duke

Page 3: How PageRank Works Ketan Mayer-Patel University of North Carolina January 31, 2011

The World Wide Web

• A Simple Request/Response System

Request for web page.

Web page returned.

Page 4: How PageRank Works Ketan Mayer-Patel University of North Carolina January 31, 2011

Making The Request

• How do you make a web request?– Use a browser.• Specify what you want directly.• Follow a link.

– Turns out we very rarely specify documents directly.

– Uniform Resource Locator (URL)• http://server-name.com/path/to/a/page

– Two key characteristics of hyperlinks:• Directional• Unilateral

Page 5: How PageRank Works Ketan Mayer-Patel University of North Carolina January 31, 2011

Web Search In Three Easy Steps

• What’s step one?– Cut a hole in the box.

Page 6: How PageRank Works Ketan Mayer-Patel University of North Carolina January 31, 2011

Web Search In Three Easy Steps

• First, crawl.– Try to find all of the web pages.

• Follow the links.

• Second, index.– Organize what you find.

• Lots of secret sauce here.

• Third, query.– Usually, text query words.– Retrieves a list of related pages.

• Usually because they contain the query text.

Page 7: How PageRank Works Ketan Mayer-Patel University of North Carolina January 31, 2011

Which to list first?

• Possible clues:– Number of times the query term appears– Where it appears• Title, body text, URL, metadata, etc.

– How it appears• Style of text• Role of text

– Position in the document graph• This is what distinguished Google from other search

engines at the time.

Page 8: How PageRank Works Ketan Mayer-Patel University of North Carolina January 31, 2011

PageRank

• Supposedly named after Larry Page

• Part of his research in grad school– Patented while in grad school.– Licensed to Google for ~ 1 million

shares of Google.• Sold for about $300M

Page 9: How PageRank Works Ketan Mayer-Patel University of North Carolina January 31, 2011

Document Graph

Page 10: How PageRank Works Ketan Mayer-Patel University of North Carolina January 31, 2011

Probability Distribution of a Random Walk

• Start walking the graph.• After some reasonably long amount of time,

stop.• What’s the chance that you are on a particular

page. – Larger chance => more important page– Is this actually true?• Maybe, maybe not

Page 11: How PageRank Works Ketan Mayer-Patel University of North Carolina January 31, 2011

Random Walk Example

Page 12: How PageRank Works Ketan Mayer-Patel University of North Carolina January 31, 2011

Random Walk Example

Page 13: How PageRank Works Ketan Mayer-Patel University of North Carolina January 31, 2011

Random Walk Example

Page 14: How PageRank Works Ketan Mayer-Patel University of North Carolina January 31, 2011

Random Walk Example

Page 15: How PageRank Works Ketan Mayer-Patel University of North Carolina January 31, 2011

Random Walk Example

Page 16: How PageRank Works Ketan Mayer-Patel University of North Carolina January 31, 2011

Random Walk Example

Page 17: How PageRank Works Ketan Mayer-Patel University of North Carolina January 31, 2011

Random Walk Example

Page 18: How PageRank Works Ketan Mayer-Patel University of North Carolina January 31, 2011

Trapdoors and Dead Ends

Shangri-La:Can’t ever get here.

Hotel California:Can’t ever leave.

Page 19: How PageRank Works Ketan Mayer-Patel University of North Carolina January 31, 2011

Spider Traps

Page 20: How PageRank Works Ketan Mayer-Patel University of North Carolina January 31, 2011

Fixing Our Random Walk

• What can we do to fix it?– Add a bit more randomness.• At each step, with probability α jump to any random

page.• Otherwise, randomly follow a link.

– Provides a way in to / out of trapdoors / dead ends and spider traps.

Page 21: How PageRank Works Ketan Mayer-Patel University of North Carolina January 31, 2011

Random Walk Scalability

• Problem: Would need to simulate the random walk over and over again to even come close to discovering the underlying probability distribution.– Easy to do for small graphs.– Pain in the ass for large ones.

• Markov Chain– Tool for analyzing stochastic processes.– Power method

Page 22: How PageRank Works Ketan Mayer-Patel University of North Carolina January 31, 2011

Power Method Equation• N : Number of documents• Rk : Page rank of document k

• Lk : Number of outgoing links in k• δ(k,j) : Delta function for links between k and j

δ(k,j) = 1 if and only if there exists a link from document k to document j

R j = δ (k, j)RkLkk=1

N

Page 23: How PageRank Works Ketan Mayer-Patel University of North Carolina January 31, 2011

Power Method Equation

• Our definition is circular.– To calculate page rank of a page we need to already

know the page rank of other pages.• Iterative solution.– Start with an initial assignment.

• Basically set the page rank of every page to 1/N.• Why 1/N?

– Calculate an updated value for every page using the current values.

– Keep repeating until the value are stable.

Page 24: How PageRank Works Ketan Mayer-Patel University of North Carolina January 31, 2011

Power Method Equation

• Intuition:– Page rank of a document is the sum of its fair

share of the page ranks of the pages that link to the document.

R ji+1 = δ (k, j)

Rki

Lki

k=1

N

Page 25: How PageRank Works Ketan Mayer-Patel University of North Carolina January 31, 2011

Examplei = 0

0.1

0.1

0.1

0.1

0.1

0.1

0.1

0.1

0.1

0.1

R ji+1 = δ (k, j)

Rki

Lki

k=1

N

Page 26: How PageRank Works Ketan Mayer-Patel University of North Carolina January 31, 2011

Examplei = 1

0

0.1

0.1

0.125

0.2

0.05

0.1

0.075

0.025

0.125

R ji+1 = δ (k, j)

Rki

Lki

k=1

N

Page 27: How PageRank Works Ketan Mayer-Patel University of North Carolina January 31, 2011

Examplei = 10

0

0.154

0.134

0.015

0.071

0.036

0.072

0.051

0.015

0.189

R ji+1 = δ (k, j)

Rki

Lki

k=1

N

Something is wrong!

Page 28: How PageRank Works Ketan Mayer-Patel University of North Carolina January 31, 2011

Power Method v2• Dead ends leak.• Spider traps slowly collect everything.• Translating our random walk solution:

– Add a “virtual” link from every document to every other document.– Define a weighting factor α between 0.0 and 1.0

• Distribute α proportion of your page rank over the virtual links• Distribute (1- α) proportion of your page rank over the real links

R ji+1 =

αRki

Nk=1

N

∑ + δ (k, j)(1−α )Rk

i

Lki

k=1

N

Page 29: How PageRank Works Ketan Mayer-Patel University of North Carolina January 31, 2011

Power Method v2• Dead ends leak.• Spider traps slowly collect everything.• Translating our random walk solution:

– Add a “virtual” link from every document to every other document.– Define a weighting factor α between 0.0 and 1.0

• Distribute α proportion of your page rank over the virtual links• Distribute (1- α) proportion of your page rank over the real links

R ji+1 =

α

N+ (1−α ) δ (k, j)

Rki

Lki

k=1

N

Page 30: How PageRank Works Ketan Mayer-Patel University of North Carolina January 31, 2011

Convergence

• Typical value for α is 0.15.• Convergence typically occurs in about 50

iterations even for large graphs.

Page 31: How PageRank Works Ketan Mayer-Patel University of North Carolina January 31, 2011

Examplei = 10

0.011

0.107

0.112

0.034

0.105

0.061

0.073

0.074

0.024

0.115

R ji+1 =

α

N+ (1−α ) δ (k, j)

Rki

Lki

k=1

N

Page 32: How PageRank Works Ketan Mayer-Patel University of North Carolina January 31, 2011

Examplei = 10

0.011

0.107

0.112

0.034

0.105

0.061

0.073

0.074

0.024

0.115

0

0.154

0.134

0.015

0.071

0.036

0.072

0.0510.015

0.189

R ji+1 =

α

N+ (1−α ) δ (k, j)

Rki

Lki

k=1

N

Page 33: How PageRank Works Ketan Mayer-Patel University of North Carolina January 31, 2011

Billions and billions

• How do you do this with billions of documents?– Can be implemented using

matrix math.– Special techniques for sparse

matrices.– PageRank roughly equivalent

to first eigenvector.

Page 34: How PageRank Works Ketan Mayer-Patel University of North Carolina January 31, 2011

Gaming The System

• Google Bomb!– Create a lot of links to the page that you want to

be highly ranked.• Create your own spider trap.

– Relatively easy to combat by discounting links that come from the same domain.

• Comment spam.• Porn trap.

Page 35: How PageRank Works Ketan Mayer-Patel University of North Carolina January 31, 2011

Last Notes

• Stanford Sucks!• GO HEELS!

Page 36: How PageRank Works Ketan Mayer-Patel University of North Carolina January 31, 2011

Bad Math

• When originally presented, the final version of the power method equation was shown as:

• The simplification for the first term is wrong and should have been:

R ji+1 =

α

N+ (1−α ) δ (k, j)

Rki

Lki

k=1

N

R ji+1 = α + (1−α ) δ (k, j)

Rki

Lki

k=1

N