cse535 link analysis

19
7/23/2019 Cse535 Link Analysis http://slidepdf.com/reader/full/cse535-link-analysis 1/19 CS315 – Link Analysis  Three generations of Search Engines Anchor text Link analysis for ranking Pagerank HITS

Upload: debika

Post on 18-Feb-2018

221 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Cse535 Link Analysis

7/23/2019 Cse535 Link Analysis

http://slidepdf.com/reader/full/cse535-link-analysis 1/19

CS315 – Link Analysis

 Three generations of Search Engines

Anchor text

Link analysis for ranking Pagerank

HITS

Page 2: Cse535 Link Analysis

7/23/2019 Cse535 Link Analysis

http://slidepdf.com/reader/full/cse535-link-analysis 2/19

1st Generation: Content Similarity

Content Similarity Ranking:The more rare words two documents share,the more similar they are

Documents are treated as “bags of words”

no e!ort to “understand” the contents"

Similarity is measured #y $ector angles

%uery &esults are ranked

#y sorting the angles#et'een (uery and documents

t 1

2

d 1

t 3

t 2

θ

Page 3: Cse535 Link Analysis

7/23/2019 Cse535 Link Analysis

http://slidepdf.com/reader/full/cse535-link-analysis 3/19

But we also have links (los links!)

Assumption 1: 

A hy)erlink from a )age denotes $ote ofcon*dence to second )age (uality signal"

Assumption 2:  The anchor text of the hy)erlink

descri#es the target )age textual context"

hyperlink Anchor text

Page A Page +

Page 4: Cse535 Link Analysis

7/23/2019 Cse535 Link Analysis

http://slidepdf.com/reader/full/cse535-link-analysis 4/19

n Generation: A "o#ularity

A hy)erlinkfrom a )age in site Ato some )age in site +is considered a popularity vote from site A to site +

Score of a )age ,num#er of in-links

%uery Processing -irst retrie$e all )ages

meeting the text (uery sayventure capital ".

/rder these #y the link)o)ularity of the )age or thesite"

'''.aa.com0

'''.##.com1

'''.cc.com0 '''.dd.com

1

'''.22.com3

Page 5: Cse535 Link Analysis

7/23/2019 Cse535 Link Analysis

http://slidepdf.com/reader/full/cse535-link-analysis 5/19

3r Generation: A $e#utation

Each )age starts 'ith some #asic “re)utation” e.g.4,0"and re)eatedly distri#utes e(ual fractions to its links'hile recei$ing from them"until some “e(uili#rium”

 The reputation “Page&ank” of a )age P ,the sum  of a fair fraction of the re)utations  of all )ages P 5 that )oint to P

+eautiful 6ath #ehind it

P& , )rinci)al eigen$ectorof the 'e#7s link matrix

P& e(ui$alent to the chanceof randomly sur*ng to the )age

Idea similar to academic co-citations

 

PR(W ) = PR(W 

1)

O(W 1)+ PR(W 

2)

O(W 2 )+...+

PR(W n

)

O(W n

)

Page 6: Cse535 Link Analysis

7/23/2019 Cse535 Link Analysis

http://slidepdf.com/reader/full/cse535-link-analysis 6/19

$oots o% "$: Citation Analysis

8itation fre(uency  The kind of #ackground 'ork Deans are doing at tenure

time

8o9citation cou)ling fre(uency

8o9citations 'ith a gi$en author measures“im)act”

Are you co9cited 'ith in:uential )u#lications;

+i#liogra)hic cou)ling fre(uency

Articles that co9cite the same articles are related8itation indexing <ho is author cited #y;

Page 7: Cse535 Link Analysis

7/23/2019 Cse535 Link Analysis

http://slidepdf.com/reader/full/cse535-link-analysis 7/19

"a&e$ank "$  – Com#lete 'e%inition

W  is a 'e# )age

W i are the 'e# )ages that ha$e a link to W 

O<i" is the num#er of out9links from W i

t  is the tele)ortation )ro#a#ility e.g. 3.0="

N is the si2e of the 'e# that 'e ha$e seen"

PR(W ) =  t 

 N + (1− t )(PR(W 

1)

O(W 1)

+ PR(W 

2)

O(W 2

)+...+

PR(W n

)

O(W n

))

<.

<0

<1

<>

<0

<1

<>

Page 8: Cse535 Link Analysis

7/23/2019 Cse535 Link Analysis

http://slidepdf.com/reader/full/cse535-link-analysis 8/19

"a&e$ank: terative Com#utation

t is normally set to 3.0=4#ut for this exam)le4 for sim)licity let7s set it to 3.=

Set initial PR $alues to 0

Solve the following equations iteratively:

PR( A) = 0.5 /3 + 0.5PR(C )

PR( B) = 0.5 /3 + 0.5(PR( A)/2)

PR(C ) = 0.5 /3 + 0.5(PR( A) / 2 + PR( B))

PR(W ) =  t 

 N + (1− t )(PR(W 

1)

O(W 1)

+ PR(W 

2)

O(W 2

)+...+

PR(W n

)

O(W n

))

Page 9: Cse535 Link Analysis

7/23/2019 Cse535 Link Analysis

http://slidepdf.com/reader/full/cse535-link-analysis 9/19

*am#leCom#utation

o% "$in *+el

Page 10: Cse535 Link Analysis

7/23/2019 Cse535 Link Analysis

http://slidepdf.com/reader/full/cse535-link-analysis 10/19

"a&erank – ,atri* ,ulti#li+ation -uivalent 'e%.

Imagine a #ro'ser doing a random 'alk on 'e# )ages? Start at a random )age P

At each ste)4'alk 'ith e(ual )ro#a#ility out of the current )agealong one of the links on that )age4

8ontinue doing this random'alk for a long time

“In the steady state”each )age has a long9term $isit rate?

@se this rate as the page’s score.

P1/3

1/3

1/3

Page 11: Cse535 Link Analysis

7/23/2019 Cse535 Link Analysis

http://slidepdf.com/reader/full/cse535-link-analysis 11/19

/ot -uite enou&h

 The 'e# is full of dead9ends. &andom 'alk can get stuck in dead9ends.

6akes no sense to talk a#out long9term $isit rates.

??

Page 12: Cse535 Link Analysis

7/23/2019 Cse535 Link Analysis

http://slidepdf.com/reader/full/cse535-link-analysis 12/19

0ele#ortin&

 At a dead end4 5um) to a random 'e# )age.

 At any non9dead end4

<ith )ro#a#ility4 say4 0=4 5um) to a random web page.

<ith remaining )ro#a#ility B="4go out on a random link .

t,3.0= is the “tele#ortin&” )arameter.

Page 13: Cse535 Link Analysis

7/23/2019 Cse535 Link Analysis

http://slidepdf.com/reader/full/cse535-link-analysis 13/19

$esult o% tele#ortin&

 Co' cannot get stuck locally.

 There exists a com)uta#le long9term rate

at 'hich any )age is $isited This not o#$ious4 #ut it has #een

)ro$en

 Ho' do 'e com)ute this $isitrate;

Page 14: Cse535 Link Analysis

7/23/2019 Cse535 Link Analysis

http://slidepdf.com/reader/full/cse535-link-analysis 14/19

,arkov +hains: astra+tions o% ranom walks

 A 6arko$ chain consists of n states4and an n×n transition probability matrix P.

At each ste)4 'e are in exactly one of thestates.

-or ≤ i,! ≤ n,the matrix entry Pi! 

tells us the )ro#a#ility of ! #eing the nextstate4

gi$en 'e are currently in state i.8learly4 for all i4

i j P ij 

.11

=∑=ij

n

 j P 

Page 15: Cse535 Link Analysis

7/23/2019 Cse535 Link Analysis

http://slidepdf.com/reader/full/cse535-link-analysis 15/19

Com#utin& "$ with ,arkov +hains

Example "ne#t two slides$?&e)resent the tele)orting random 'alk

'ith tele)orting )arameter t=1!

as a 6arko$ chain4 for this gra)h?

A + 8 D

Page 16: Cse535 Link Analysis

7/23/2019 Cse535 Link Analysis

http://slidepdf.com/reader/full/cse535-link-analysis 16/19

Com#utin& " with ,atri* ,ulti#li+ation

Start 'ith Ad5acency matrix A of the <e# ra)h If there is hy)erlink from i to 54 A i5 , 04 else Ai5 , 3

If a ro' has all 37s4

re)lace each element #y 0FC

Else di$ide each 0 #y the num#er of 07s in the ro'

6ulti)ly the matrix #y 09t

Add tFC to e$ery entry of the resulting matrix

A + 8 D P,

Page 17: Cse535 Link Analysis

7/23/2019 Cse535 Link Analysis

http://slidepdf.com/reader/full/cse535-link-analysis 17/19

Com#utin& all "a&eranks

Theorem?&egardless of 'here 'e start4 'eeventually reach the steady state a.

Start 'ith any distri#utionsay x, % & %"". After one ste)4 'e7re at xPG after t'o ste)s at xP' 4

then xP( and so on.

“E$entually” means for “large” ) 4 xP) , a.

Algorithm? multi)ly x #y increasing

)o'ers of P until the )roduct looks stable.

A + 8 D

P,

Page 18: Cse535 Link Analysis

7/23/2019 Cse535 Link Analysis

http://slidepdf.com/reader/full/cse535-link-analysis 18/19

"a&erank summary

Pre)rocessing? i$en gra)h of links4 #uild matrix P.

-rom it com)ute a.

 The entry ai is a num#er #et'een 3 and 0? the

)agerank of )age i.

%uery )rocessing? &etrie$e )ages meeting (uery.

&ank them #y their )agerank.

/rder is (uery9independent  If P&A" P&+" for some (uery4 it #eats it in every

(uery

Page 19: Cse535 Link Analysis

7/23/2019 Cse535 Link Analysis

http://slidepdf.com/reader/full/cse535-link-analysis 19/19

2ow is "a&erank use

 Page"ank #ec$nology?

Page&ank re:ects our $ie' of the im)ortance of 'e# )ages #yconsidering more than =33 million $aria#les and 1 #illion terms.Pages that 'e #elie$e are im)ortant )ages recei$e a higherPage&ank and are more likely to a))ear at the to) of the search

results.

#$is claim $as recently c$anged?

“Today 'e use more than 133 signals4 including Page&ank4 toorder 'e#sites4 and 'e u)date these algorithms on a 'eekly

#asis”

Pagerank is dead4 long li$e Pagerank

htt)?FF'''.google.comFcor)orateFtec

h.html