link analysis rong jin. web structure web is a graph each web site correspond to a node a link from...

66
Link Analysis Rong Jin

Upload: bertram-mccoy

Post on 14-Dec-2015

216 views

Category:

Documents


0 download

TRANSCRIPT

Link AnalysisRong Jin

Web Structure Web is a graph

Each web site correspond to a node A link from one site to another site forms a directed edge

What does it look likes?

Web is a small world The diameter of the web is 19

e.g. the average number of clicks from one web site to another is 19

Bowtie StructureStrongly Connected

Component

Broder et al., 2001

Bowtie StructureSites that link towards the

‘center’ of the web

Broder et al., 2001

Bowtie Structure

Sites that link from the ‘center’ of the web

Broder et al., 2001

Inlinks and Outlinks Both degrees of incoming and outgoing links follow power law

Broder et al., 2001

Connected Components Power law for weakly connected component (i.e.,

wcc) and strongly connected component (i.e., scc)

Broder et al., 2001

Why Link Analysis Useful for IR ?

Assumption 1: A hyperlink is a quality signal. A hyperlink reflects the author perceived relevance.

Assumption 2: The anchor text describes the target page d2.

Anchor Text for IR Example: Query

IBM You could match

IBM’s copyright page, IBM’s wikipedia, and spam pages

But not IBM home page if it is mostly graphical

Anchor Text for IR Example: Query IBM You could match

IBM’s copyright page, IBM’s wikipedia, and spam pages But not IBM home page if it is mostly graphical

Searching on anchor text is better for the query IBM. Represent each page by all the anchor text pointing to it. In this representation, the page with the most occurrences

of IBM is www.ibm.com.

Example Anchor Text for IBM Pointing to www.ibm.com

www.nytimes.com: “IBM acquires Webify” www.slashdot.org: “New IBM optical chip” www.stanford.edu: “IBM faculty award

recipients”

Anchor text is often a better description of a page’s content than the page itself. weight anchor texts more than the page text

Google Bomb

Example: “who is a failure” A Google bomb is a search with “bad” results due to

maliciously manipulated anchor text. Because Anchor text is weighted more than page text.

Google introduced a new weighting function in January 2007 that fixed many google bombs.

PageRank: Citation Analysis Citation analysis: analysis of citations in the

scientific literature Example

“Miller (2001) has shown that physical activity alters the metabolism of cells.”

“Miller (2001)” is a hyperlink link two scientific articles. Application of these “hyperlinks”

Measure the similarity of two articles co-citation Measure the impact of scientific articles

PageRank: Citation Analysis Citation frequency can be used to measure the

impact of an article. Each article gets one vote. Not a very accurate measure

da

db

Impact(da) = 4

Impact(db) = 3

d1

d2

d3

d4

d5

d6

PageRank: Citation Analysis Citation frequency can be used to measure the impact

of an article. Each article gets one vote. Not a very accurate measure

Better measure: weighted citation frequency / citation rank An article’s vote is weighted according to its citation

impact.

da

db

Impact(da) = 6

Impact(db) = 8

d1

d2

d3

d4

d5

d6

Impact(d1) = 1Impact(d2) = 1Impact(d3) = 1Impact(d4) = 4Impact(d5) = 1Impact(d6) = 3

PageRank: Citation Analysis Citation frequency can be used to measure the impact

of an article. Each article gets one vote. Not a very accurate measure

Better measure: weighted citation frequency / citation rank An article’s vote is weighted according to its citation

impact.

da

db

Impact(da) = 6

Impact(db) = 8

d1

d2

d3

d4

d5

d6

Impact(d1) = 1Impact(d2) = 1Impact(d3) = 1Impact(d4) = 4Impact(d5) = 1Impact(d6) = 3

Circular Definition

PageRank: Citation Analysis Citation frequency can be used to measure the

impact of an article. Each article gets one vote. Not a very accurate measure

Better measure: weighted citation frequency / citation rank An article’s vote is weighted according to its citation

impact. Circular definition

No: can be formalized in a well-defined way. This is basically PageRank.

Link Analysis for IR A simple approach of using links for ranking

web pages for a given query Assumption: more inlink higher popularity First, retrieve all pages for a given query Select the top K (e.g., 100) web pages Reorder the top K web pages by their in-links

But, this is prone to spam.

Random Walk ModelConsider a random walk through the Web graph

? ?

?

??

• Start at a random page• At each step, go out of the current page along one of the links on that page, with equal probability

Random Walk ModelConsider a random walk through the Web graph

• Start at a random page• At each step, go out of the current page along one of the links on that page, with equal probability

Random Walk ModelConsider a random walk through the Web graph

? ?

?

?

• Start at a random page• At each step, go out of the current page along one of the links on that page, with equal probability

Random Walk ModelConsider a random walk through the Web graph

Impact = Long-term rate:

What is portion of time that the surfer will spend on each site as time goes to infinity?

• Start at a random page• At each step, go out of the current page along one of the links on that page, with equal probability

Page Rank

r1

r2

r3

r4r5

r6

r7

1 2 3 5 6

1 1 1

2 4 2r r r r r ri: long term rate for the ith

web site

Page Rank

r1

r2

r3

r4r5

r6

r7

1 2 3 5 6

1 1 1

2 4 2r r r r r ri: long term rate for the ith

web site

Page Rank

r1

r2

r3

r4r5

r6

r7

1 2 3 5 6

1 1 1

2 4 2r r r r r ri: long term rate for the ith

web site

Page Rank

r1

r2

r3

r4r5

r6

r7

1 2 3 5 6

1 1 1

2 4 2r r r r r ri: long term rate for the ith

web site

Page Rank

r1

r2

r3

r4r5

r6

r7

1 2 3 5 6

1 1 1

2 4 2r r r r r

1

2

3

1 4

5

6

7

0

1

1/ 2

0

1/ 4

1/ 2

0

r

r

r

r r

r

r

r

ri: long term rate for the ith web site

Page Rank

r1

r2

r3

r4r5

r6

r7

1 2 3 5 6

1 1 1

2 4 2r r r r r

1

2

3

1 4

5

6

7

0

1

1/ 2

0

1/ 4

1/ 2

0

r

r

r

r r

r

r

r

ri: long term rate for the ith web site

Page Rank

r1

r2

r3

r4r5

r6

r7

1

2

3

1 2 3 4 5 6 7 4

5

6

7

, , , , , ,

r

r

r

r r r r r r r r

r

r

r

B

r =

Page Rank

r1

r2

r3

r4r5

r6

r7

1

2

3

1 2 3 4 5 6 7 4

5

6

7

, , , , , ,

r

r

r

r r r r r r r r

r

r

r

B

r =

Page Rank

r1

r2

r3

r4r5

r6

r7

1

2

3

1 2 3 4 5 6 7 4

5

6

7

, , , , , ,

r

r

r

r r r r r r r r

r

r

r

This is an eigenvector problem:

r is the principal eigenvector of B (i.e. eigenvector with the large eigenvalue)

B

r =

Page Rank

r1

r2

r3

r4r5

r6

r7

1

2

3

1 2 3 4 5 6 7 4

5

6

7

, , , , , ,

r

r

r

r r r r r r r r

r

r

r

How to compute this

transition matrix?

PageRankadjancy matrix

row normalization

transition matrix

512341

Page Rank

Page Rank

• Observation

# of inlinks from high ranked page

Adding Self Loop Allow surfer to decide to stay on the same place

B =

' (1 ) B B I

Dead Ends

The web is full of dead ends. Random walk can get stuck in dead ends. If there are dead ends, long-term visit rates are not

well-defined (or non-sensical).

What happens to the long-term rate if the graph has dead ends?

PageRank: Teleporting At a dead end, jump to a

random web page At any non-dead end,

with probability 10%, jump to a random web page

With remaining probability (90%), go out on a random hyperlink

10% is a parameter.

PageRank: Implementation Compute PageRank

Given graph of links, build matrix B Apply self-loop and teleportation Compute the page rank scores by the iterative procedure

Page Rank: Implementation

r1

r2

r3

r4r5

r6

r7

1

2

3

1 2 3 4 5 6 7 4

5

6

7

, , , , , ,

r

r

r

r r r r r r r r

r

r

r

Iterative procedure Initialize Update

Repeat till it converges

1 2 3 4 5 6 7 1/ 7r r r r r r r

Does not scale to the size of web

Iterative procedure Initialize Update

Repeat till it converges

Page Rank: Implementation

r

1

r

2

r

3

r

4

r

5

r

6

r

7

1 2 3 4 5 6 7 1/ 7r r r r r r r

0 0

Iterative procedure Initialize Update

Repeat till it converges

Page Rank: Implementation

r

1

r

2

r

3

r

4

r

5

r

6

r

7

1 2 3 4 5 6 7 1/ 7r r r r r r r

0

PageRank: Implementation How to utilize PageRank for information retrieval

Apply the standard IR approach to identify the top K (i.e., K = 100) ranked we pages

Re-rank the top ranked pages by their PageRank

Query Document Collection

Lucene

D1

D2

D3

0.01

0.03

0.05

PR

D3

D2

D1

How Important is PageRank? Frequent claim: PageRank is the most important

component of web ranking. The reality:

There are several components that are at least as important (e.g., anchor text, phrases, proximity, tiered indexes . . .)

Rumor has it that PageRank in its original form (as presented here) has a negligible impact on ranking!

However, variants of a page’s PageRank are still an essential part of ranking.

Adressing link spam is difficult and crucial.

HITS - Kleinberg’s Algorithm• HITS – Hypertext Induced Topic Selection

• For each vertex v V in a subgraph of interest:

• A site is very authoritative if it receives many citations. Citation from important sites weight more than citations from less-important sites

• Hubness shows the importance of a site. A good hub is a site that links to many authoritative sites

a(v) - the authority of vh(v) - the hubness of v

Authority and Hubness

2

3

4

1

a(1) = h(2) + h(3) + h(4)

Authority and Hubness

2

3

4

1

a(1) = h(2) + h(3) + h(4)

1

5

6

7

h(1) = a(5) + a(6) + a(7)

Authority and Hubness: Version 1HubsAuthorities(G)1 1 [1,…,1] Є R 2 a h 13 t 14 repeat5 for each v in V6 do a (v) Σ h

(w)

7 h (v) Σ a (w)

8 t t + 19 until || a – a || + || h – h || < ε10 return (a , h )

0 0

t

t

t

t

tt

t -1

t -1

t -1

t -1

w Є pa[v]

w Є pa[v]

|V|

[ ]

[ ]

Recursive dependency

( ) ( )

( ) ( )

w pa v

w ch v

a v h w

h v a w

Authority and Hubness: Version 1HubsAuthorities(G)1 1 [1,…,1] R 2 a h 13 t 14 repeat5 for each v in V6 do a (v) Σ h

(w)

7 h (v) Σ a (w)

8 t t + 19 until || a – a || + || h – h || < ε10 return (a , h )

0 0

t

t

t

t

tt

t -1

t -1

t -1

t -1

w pa[v]

w ch[v]

|V|

[ ]

[ ]

Recursive dependency

( ) ( )

( ) ( )

w pa v

w ch v

a v h w

h v a w

Authority and Hubness: Version 1HubsAuthorities(G)1 1 [1,…,1] R 2 a h 13 t 14 repeat5 for each v in V6 do a (v) Σ h

(w)

7 h (v) Σ a (w)

8 t t + 19 until || a – a || + || h – h || < ε10 return (a , h )

0 0

t

t

t

t

tt

t -1

t -1

t -1

t -1

w pa[v]

w ch[v]

|V|

[ ]

[ ]

Recursive dependency

( ) ( )

( ) ( )

w pa v

w ch v

a v h w

h v a w

Problems ?

Authority and Hubness: Version 1HubsAuthorities(G)1 1 [1,…,1] R 2 a h 13 t 14 repeat5 for each v in V6 do a (v) Σ h

(w)

7 h (v) Σ a (w)

8 t t + 19 until || a – a || + || h – h || < ε10 return (a , h )

0 0

t

t

t

t

tt

t -1

t -1

t -1

t -1

w pa[v]

w ch[v]

|V|

[ ]

[ ]

Recursive dependency

( ) ( )

( ) ( )

w pa v

w ch v

a v h w

h v a w

Problems ?

Authority and Hubness: Version 2

[ ]

[ ]

Recursive dependency

( ) ( )

( ) ( )

w pa v

w ch v

a v h w

h v a w

( )( )

( )

( )( )

( )

w

w

a va v

a w

h vh v

h w

+ Normalization

HubsAuthorities(G)1 1 [1,…,1] Є R 2 a h 13 t 14 repeat5 for each v in V6 do a (v) Σ h

(w)

7 h (v) Σ a (w)

8 a a / || a ||9 h h / || h ||10 t t + 111 until || a – a || + || h – h || < ε12 return (a , h )

0 0

t

t

t

t

t

t

t

t

t

t

tt

t -1

t -1

t -1

t -1

w Є pa[v]

w Є pa[v]

|V|

HITS Example Results

AuthorityHubness

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Authority and hubness weights

HITS Example Results

AuthorityHubness

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Authority and hubness weights

1

5

6

7

h(1) = a(5) + a(6) + a(7)

HITS Example Results

AuthorityHubness

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Authority and hubness weights

1

5

6

7

h(1) = a(5) + a(6) + a(7)

HITS Example Results

AuthorityHubness

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Authority and hubness weights

1

5

6

7

h(1) = a(5) + a(6) + a(7)

HITS Example Results

AuthorityHubness

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Authority and hubness weights

2

3

4

1

a(1) = h(2) + h(3) + h(4)

HITS Example Results

AuthorityHubness

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Authority and hubness weights

2

3

4

1

a(1) = h(2) + h(3) + h(4)

HITS Example Results

AuthorityHubness

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Authority and hubness weights

2

3

4

1

a(1) = h(2) + h(3) + h(4)

Authority and Hubness Authority score

Not only depends on the number of incoming links But also the ‘quality’ (e.g., hubness) of the incoming links

Hubness score Not only depends on the number of outgoing links But also the ‘quality’ (e.g., hubness) of the outgoing links

Authority and Hubness Convergence Question

Will the above iteration converge? If we start with a random assignment to both hub

and authority values, will the above iteration converge?

Will the converged authority and hub values depend on the initial assignments?

Convergence Column vector a: ai is the authority score for the i-th site

Column vector h: hi is the hub score for the i-th site

Matrix M: ,

1 the th site points to the th site

0 otherwisei j

i j

M

M =

Authority and Hub Vector a: ai is the authority score for the i-th site

Vector h: hi is the hub score for the i-th site

Matrix M:

• Recursive dependency:

a(v) Σ h(w)

h(v) Σ a(w)

w Є pa[v]

w Є ch[v]

,

1 the th site points to the th site

0 otherwisei j

i j

M

Authority and Hub Column vector a: ai is the authority score for the i-th site

Column vector h: hi is the hub score for the i-th site

Matrix M:

• Recursive dependency:

a(v) Σ h(w)

h(v) Σ a(w)

w Є pa[v]

w Є ch[v]

,

1 the th site points to the th site

0 otherwisei j

i j

M

h Ma

hMa T

Authority and Hub Column vector a: ai is the authority score for the i-th site

Column vector h: hi is the hub score for the i-th site

Matrix M:

1T

t t t a M h

• Recursive dependency:

a(v) Σ h(w)

h(v) Σ a(w)

w Є pa[v]

w Є ch[v]

,

1 the th site points to the th site

0 otherwisei j

i j

M

1t t t h Ma

Normalization Procedure

Authority and Hub

Apply SVD to matrix M

Authority scores left principal singular vector of M Hub scores right principal singular vector of M

TTt t t tt t t

Tt t t t t t t

a M Maa M h

h Ma h MM h

T Ti i i

i

M UΣV u v 1 1, a u h v