jure leskovecand anandrajaraman · 2/4/2010 jure leskovec & anand rajaraman, stanford cs345a:...

CS345a: Data MiningJure Leskovec and Anand RajaramanjStanford University

… …

Philip S. Yu

IJCAI Q: what is most relatedconference to ICDM

ICDM

KDD

Ning Zhong

A: Random walks!

conference to ICDM

SDM

AAAI M. Jordan

R. RamakrishnanA: Random walks!

[Sun, ICDM2005]

NIPS…

…

2

…

Conference Author2/4/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining

0 008

0.009

0.011

0.0080.007

0.0050.011

0.005

0.0050 004

0.004

0.0040.004

32/4/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining

[Pan, KDD ‘04]

…

Sea Sun Sky Wave{ } { }Cat Forest Grass Tiger

?A: Proximity on

grpahs!

4

{?, ?, ?,}grpahs!

[Pan KDD2004]2/4/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining

[Pan, KDD ‘04]

Region

Image

Test Image

5

Sea Sun Sky Wave Cat Forest Tiger Grass

Keyword2/4/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining

[Pan, KDD ‘04]

Region

Image

Test Image{Grass, Forest, Cat, Tiger}

6

Sea Sun Sky Wave Cat Forest Tiger Grass

Keyword2/4/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining

[Tong‐Faloutsos, ‘06]

Connection subgraphs: Connection subgraphs: What is the most likely connection between Andrew McCallum and Yiming Yang:Andrew McCallum and Yiming Yang:

Michael I.Jordan

RongJin

AndrewNg 1 16

2

7

Andrew Yiming

Jordan JinNg

ZoubinJohn D.

2

22 2

3

McCallum Yang

Xiaojin

GhahramaniLaffterty

2

Fernando42

1

42

2/4/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 7

XiaojinZhu

Jian ZhangFernando

C.N. Pereira


Gi Q d B Given Q query nodes Find Center‐piece subgraph on b nodes

A C

subgraph on b nodes

Input of CEPS: Q Query nodes Q Query nodes Budget b k‐softAND coefficient B


A C


Individual Score Calculation: Individual Score Calculation: Measure proximity of each nodewith respect to individual query node

( , ) n Qr i j

with respect to individual query node

Combine Individual Scores:

Measure importance of a node tothe whole query set

1( , ) nr Q j

“Extract” the connection subgraph: arg max ( )H g H



Goal: Goal: Calculate importance score r(i,j) = ri,j f h d j d h d i for each node j and each query node i

How to do that? How to do that?



I J1

1 1

A BH1 1

1 1D

1 1

1 11E

FG

1 11

11

a.k.a.: Relevance, Closeness, ‘Similarity’…


Shortest path is not good:Shortest path is not good:

No influence for degree‐1 nodes (E, F)! known as ‘pizza delivery guy’ problem in undirected graphgraph

Multi‐faceted relationships2/4/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 12

Max flow is not good: Max‐flow is not good:

Does not punish long paths



I J11 1

A BH1 1

D1 1

1 11•Multiple Connections…

EF

G1 11

• Quality of connection

•Direct & In‐direct connections

14

Direct & In direct connections

•Length, Degree, Weight…2/4/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining

910

2

9

12

1

3

8

11

4

56

15

7


Node 4

Node 1Node 2Node 3Node 4

0.130.100.130.222

9 10

8

120.13

0.10

0.08

0.04

0.02

0.03


0.130.050.050.08

1

4

3

6

811

30.13

0 05

0.04


0.040.030.040.02

5 6

7

0.13

0.05

0.05

Ranking vector More red more relevant

Nearby nodes, higher scores

16

More red, more relevant4r


( , ) i jQ i j rj

W : adjacency matrix. 1( )Q I cW

,( , ) i jQ j

i

j

c: damping factor

2c 3c ...W 2W 3WQ I c W WQ

all paths from i all paths from i all paths from i

17

to jwith length 1 to jwith length 2 to jwith length 3


(1 )r cWr c e

S i

(1 )i i ir cWr c e R b

0.13 0 1/3 1/3 1/3 0 0 0 0 0 0 0 00.10 1/3 0 1/3 0 0 0 0 1/4 0 0 0 0

0.13 00.10 0

Ranking vector Starting vectorTransition matrix Restart prob.

13

2

9 10

811

12

0.10 1/3 0 1/3 0 0 0 0 1/4 0 0 0 0.130.220.130.05

0.9

01/3 1/3 0 1/3 0 0 0 0 0 0 0 01/3 0 1/3 0 1/4 0 0 0 0 0 0 00 0 0 1/3 0 1/2 1/2 1/4 0 0 0 00 0 0 0 1/4 0 1/2 0 0 0 0 0

0.10 00.13 00.220.13 00.05 0

0.1

1

4

5 6

0.050.080.040.030.04

0 0 0 0 1/4 1/2 0 0 0 0 0 0 0 1/3 0 0 1/4 0 0 0 1/2 0 1/3 00 0 0 0 0 0 0 1/4 0 1/3 0 00 0 0 0 0 0 0 0 1/2 0 1/3 1/2 0 0 0 0 0 0 0 1/4 0 1/3 0 1/2

0.05 00.08 00.04 00.03 00.04 0

7

0.02 0 0 0 0 0 0 0 0 0 1/3 1/3 0

2 00.0

n x n n x 1n x 1182/4/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining

0 1/3 1/3 1/3 0 0 0 0 0 0 0 0 0 0 1/3 1/3 1/3 0 0 0 0 0 0 0 01/3 0 1/3 0 0 0 0 1/4 0 0 0 01/3 1/3 0 1/3 0 0 0 0 0 0 0 01/3 0 1/3 0 1/4 0 0 0 0 0 0 0

0001

Query

0.9

0 0 0 1/3 0 1/2 1/2 1/4 0 0 0 00 0 0 0 1/4 0 1/2 0 0 0 0 0

0 0 0 0 1/4 1/2 0 0 0 0 0 0

00

0.10

?? 0 1/3 0 0 1/4 0 0 0 1/2 0 1/3 00 0 0 0 0 0 0 1/4 0 1/3 0 00 0 0 0 0 0 0 0 1/2 0 1/3 1/20 0 0 0 0

000

0 0 1/4 0 1/3 0 1/2 0

0 0 0 0 0 0 0 1/4 0 1/3 0 1/2 0

0 0 0 0 0 0 0 0 0 1/3 1/3 0 0

Adjacency matrix Starting vectorRanking vectorRanking vector Adjacency matrix Starting vectorRanking vectorRanking vector

19

2/4/2010Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining

0 1/3 1/3 1/3 0 0 0 0 0 0 0 01/3 0 1/3 0 0 0 0 1/4 0 0 0 0

00

00

0.130.10

0.30

0.120.18

0.190.09

0.140.13

0.160.10

0.130.10 1/3 1/3 0 1/3 0 0 0 0 0 0 0 0

1/3 0 1/3 0 1/4

0.9

0 0 0 0 0 0 00 0 0 1/3 0 1/2 1/2 1/4 0 0 0 00 0 0 0 1/4 0 1/2 0 0 0 0 00 0 0 0 1/4 1/2 0 0 0 0 0 0

0

00

0.10

1

01000

0.130.220.130.050 05

1

2

9 10

8

12

0.30.10.300

0.120.350.030.070 07

0.190.180.180.040 04

0.140.260.100.060 06

0.160.210.150.050 05

0.130.220.130.050 05

12

9 10

811

120.130.10

0.08

0.04

0.02

0.03

0 0 0 0 1/4 1/2 0 0 0 0 0 0 0 1/3 0 0 1/4 0 0 0 1/2 0 1/3 00 0 0 0 0 0 0 1/4 0 1/3 0 00 0 0 0 0 0 0 0 1/2 0 1/3 1/2

0 0 0 0 0

0000

0 0 1/4 0 1/3 0 1/2 0

00000

0.050.080.040.030.04

1

4

3

5 6

1100000

0.070.07000

0.040.060.0200.02

0.060.080.010.010.01

0.050.070.020.010.02

0.050.080.040.030.04

43

5 6

7

110.13

0.13

0.05

0.04

0 0 0 0 0 0 0 0 0 1/3 1/3 0 0 0

0.02

70 0 0 0 0.01 0.02

ir

ir

70.05

No pre computation / light storage


No pre‐computation / light storageSlow on‐line response O(mE)

Q1 Q2 Q3

Node 1Node 2

0.5767 0.0088 0.00880.1235 0.0076 0.0076

0 00240 0333

0.0088

5


0.0283 0.0283 0.02830.0076 0.1235 0.00760.0088 0.5767 0.00880.0076 0.0076 0.123510

11 12

13

4

0.1260

0 02830.0024

0.00760.00240.0333


0.0088 0.0088 0.57670.0333 0.0024 0.12600.1260 0.0024 0.03330.1260 0.0333 0.0024

10 133

62

0.1235

0.0283

0.0076

Node 11Node 12Node 13

0.0333 0.1260 0.00240.0024 0.1260 0.03330.0024 0.0333 0.1260

19 8

0.5767

0.1260 0.0333

0.0088

7


The RWR score r(i j) The RWR score r(i,j) i … query node, iQ j d i th t k j … node in the network

Combine r(i,j) to get the score r(Q,j)( ,j) g (Q,j) 1) AND query: Meeting probability prob. that |Q| random particles coincide on node jp |Q| p j


OR query: OR query: At least one particle arrives to node j:(i d j i i t t t t l t 1 d )(i.e., node j is important to at least 1 query node)

k‐SoftAND query:d l k d Node j is important to at least k query nodes

Can be computed recursively:


Goal:2

910Goal:

Extract a sub‐graph S on b nodes that maximizes

4

68

9

1113

the uS r(Q, u) Idea: Iterate until budget is

1 35

4712

14 15 16 Iterate until budget is reached Pick not‐yet‐selected node j (Q j)

29

10

j=arg maxj r(Q,j) Find good paths to all query nodes 4

68

1113

Add the paths to S


1 35 712

0 4505 0.0103

5

11 0 07100.10100.1010

0.45055

11 120.00190.00460.0046

0.0103

10

11 12

13

4

0.1010

0.22670.1010

0.0710

10

12

13

4

3

0.0046

0.00240.0046

3

62

0.0710

0 1010 0 1010

0.0710

1 7

3

62

0.0019

0 0046 0 0046

0.0019

AND

1 7

9 80.4505

0.1010 0.1010

0.4505

1 7

9 80.0103

0.0046 0.0046

0.0103

S f AND

25

AND 2‐SoftAND

0 0103

5

0 16170.1617

0.0103

11 124

0 1617 0 1617

0.13870.16170.1617

10 133

0.1617

0.1387

0.08490.1617

0.1387

1 7

62

0.1617 0.1617

26

9 80.0103 0.0103

Solving the random walk for each query Solving the random walk for each query separately is time consuming Not appropriate for real time Not appropriate for real‐time

Can we do better? Can we do better?

Idea 1) Pre compute scores for each possible Idea 1) Pre‐compute scores for each possible query node i


1 2 3 4 5 6 7 8 9 10 11 12r r r r r r r r r r r r0.20 0.13 0.14 0.13 0.68 0.56 0.56 0.63 0.44 0.35 0.39 0.340.28 0.20 0.13 0.96 0.64 0.53 0.53 0.85 0.60 0.48 0.53 0.450.14 0.13 0.20 1.29 0.68 0.56 0.56 0.63 0.44 0.35 0.39 0.330.13 0.10 0.13 2.06 0.95 0.78 0.78 0.61 0.43 0.34 0.38 0.32

29 10

8120.13

0.100.08

0.04

0 02

0.03

0.09 0.09 0.09 1.27 2.41 1.97 1.97 1.05 0.73 0.58 0.66 0.560.03 0.04 0.04 0.52 0.98 2.06 1.37 0.43 0.30 0.24 0.27 0.220.03 0.04 0.04 0.52 0.98 1.37 2.06 0.43 0.30 0.24 0.27 0.220.08 0.11 0.04 0.82 1.05 0.86 0.86 2.13 1.49 1.19 1.33 1.130 03 0 04 0 03 0 28 0 36 0 30 0 30 0 74 1 78 1 00 0 76 0 79

1

43

5 6

8 110.13

0.130.05

0.02

0.04Q‐1:

0.03 0.04 0.03 0.28 0.36 0.30 0.30 0.74 1.78 1.00 0.76 0.790.04 0.04 0.04 0.34 0.44 0.36 0.36 0.89 1.50 2.45 1.54 1.800.04 0.05 0.04 0.38 0.49 0.40 0.40 1.00 1.14 1.54 2.28 1.720.02 0.03 0.02 0.21 0.28 0.22 0.22 0.56 0.79 1.20 1.14 2.05

70.13

0.05

Fast on‐line response

29

Heavy pre‐computation/storage costO(n3 ) O(n2)

[Tong‐Faloutsos‐Pan, ‘06]

Find Communities 12

9 10

812

12

9 10

812

100.04 0.039 10

12

1

43

5 6

7

11

9 1012

1

43

5 6

7

11

1

43

29 10

8 11

120.130.10

0.13

0.080.02

0.04

1

43

28

11

12 78

11

12 7

1

43

2

45 6

70.13

0.05

0.05

4

5 6

7

5 6

7

4

Combine1 3

29 10

8 11

121 3

29 10

8 11

12

30

Fix the remaining4

3

5 6

7

43

5 6

7

10

13

29 10

811

12

12

9 10

812

4

5 6

7

1

43

5 6

811

7 5

7

Within partition links cross partition links

31

Within‐partition links cross‐partition links

9 109 10

1

43

29

811

12

1

43

29

811

12

4

5 6

7

4

5 6

7

32

9 1012

9 1012

31

4

28

11

12

1

43

28

11

12

5 6

7

5 6

7

|S| << |W2|~| | | |

33

Offline stage: Offline stage:

QQ1,1

Q 1,2

=11Q

Q1,k

C i i BridgesCommunities Bridges

8/22/2006 34

Sherman Morrison Woodbury matrix identity: Sherman‐Morrison‐Woodbury matrix identity: Inverse of a rank‐k correction of matrix X can be computed by doing a rank‐k correction to X‐1:computed by doing a rank‐k correction to X :

In our case: In our case:Q‐1=c(I‐W)‐1=c(I‐W1‐W2)‐1=

1 1 11 U VQQ Q Q 8/22/2006 35

1 1 11 1

11U VcQQ Q Q

Online stage: Online stage: Compute column i of:

1 1 11 QQ Q Q

U i

1 1 11 1

11U VcQQ Q Q

Using:

8/22/2006 36

Most slides are borrowed from the tutorial by Most slides are borrowed from the tutorial by Hanghang Tong and Christos Faloutsos from CMUCMU

http://www.cs.cmu.edu/~htong/tut/cikm2008/cikm_tutorial.html

8/22/2006 37

jure leskovecand anandrajaraman · 2/4/2010 jure leskovec & anand rajaraman, stanford cs345a:...

Documents