jure leskovecand anandrajaraman · 2/4/2010 jure leskovec & anand rajaraman, stanford cs345a:...
TRANSCRIPT
CS345a: Data MiningJure Leskovec and Anand RajaramanjStanford University
… …
Philip S. Yu
IJCAI Q: what is most relatedconference to ICDM
ICDM
KDD
Ning Zhong
A: Random walks!
conference to ICDM
SDM
AAAI M. Jordan
R. RamakrishnanA: Random walks!
[Sun, ICDM2005]
NIPS…
…
2
…
Conference Author2/4/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining
0 008
0.009
0.011
0.0080.007
0.0050.011
0.005
0.0050 004
0.004
0.0040.004
32/4/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining
[Pan, KDD ‘04]
…
Sea Sun Sky Wave{ } { }Cat Forest Grass Tiger
?A: Proximity on
grpahs!
4
{?, ?, ?,}grpahs!
[Pan KDD2004]2/4/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining
[Pan, KDD ‘04]
Region
Image
Test Image
5
Sea Sun Sky Wave Cat Forest Tiger Grass
Keyword2/4/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining
[Pan, KDD ‘04]
Region
Image
Test Image{Grass, Forest, Cat, Tiger}
6
Sea Sun Sky Wave Cat Forest Tiger Grass
Keyword2/4/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining
[Tong‐Faloutsos, ‘06]
Connection subgraphs: Connection subgraphs: What is the most likely connection between Andrew McCallum and Yiming Yang:Andrew McCallum and Yiming Yang:
Michael I.Jordan
RongJin
AndrewNg 1 16
2
7
Andrew Yiming
Jordan JinNg
ZoubinJohn D.
2
22 2
3
McCallum Yang
Xiaojin
GhahramaniLaffterty
2
Fernando42
1
42
2/4/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 7
XiaojinZhu
Jian ZhangFernando
C.N. Pereira
[Tong‐Faloutsos, ‘06]
Gi Q d B Given Q query nodes Find Center‐piece subgraph on b nodes
A C
subgraph on b nodes
Input of CEPS: Q Query nodes Q Query nodes Budget b k‐softAND coefficient B
2/4/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 8
A C
[Tong‐Faloutsos, ‘06]
Individual Score Calculation: Individual Score Calculation: Measure proximity of each nodewith respect to individual query node
( , ) n Qr i j
with respect to individual query node
Combine Individual Scores:
Measure importance of a node tothe whole query set
1( , ) nr Q j
“Extract” the connection subgraph: arg max ( )H g H
2/4/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 9
[Tong‐Faloutsos, ‘06]
Goal: Goal: Calculate importance score r(i,j) = ri,j f h d j d h d i for each node j and each query node i
How to do that? How to do that?
102/4/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining
[Tong‐Faloutsos, ‘06]
I J1
1 1
A BH1 1
1 1D
1 1
1 11E
FG
1 11
11
a.k.a.: Relevance, Closeness, ‘Similarity’…
2/4/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining
Shortest path is not good:Shortest path is not good:
No influence for degree‐1 nodes (E, F)! known as ‘pizza delivery guy’ problem in undirected graphgraph
Multi‐faceted relationships2/4/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 12
Max flow is not good: Max‐flow is not good:
Does not punish long paths
2/4/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 13
[Tong‐Faloutsos, ‘06]
I J11 1
A BH1 1
D1 1
1 11•Multiple Connections…
EF
G1 11
• Quality of connection
•Direct & In‐direct connections
14
Direct & In direct connections
•Length, Degree, Weight…2/4/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining
910
2
9
12
1
3
8
11
4
56
15
7
2/4/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining
Node 4
Node 1Node 2Node 3Node 4
0.130.100.130.222
9 10
8
120.13
0.10
0.08
0.04
0.02
0.03
Node 5Node 6Node 7Node 8
0.130.050.050.08
1
4
3
6
811
30.13
0 05
0.04
Node 9Node 10Node 11Node 12
0.040.030.040.02
5 6
7
0.13
0.05
0.05
Ranking vector More red more relevant
Nearby nodes, higher scores
16
More red, more relevant4r
2/4/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining
( , ) i jQ i j rj
W : adjacency matrix. 1( )Q I cW
,( , ) i jQ j
i
j
c: damping factor
2c 3c ...W 2W 3WQ I c W WQ
all paths from i all paths from i all paths from i
17
to jwith length 1 to jwith length 2 to jwith length 3
2/4/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining
(1 )r cWr c e
S i
(1 )i i ir cWr c e R b
0.13 0 1/3 1/3 1/3 0 0 0 0 0 0 0 00.10 1/3 0 1/3 0 0 0 0 1/4 0 0 0 0
0.13 00.10 0
Ranking vector Starting vectorTransition matrix Restart prob.
13
2
9 10
811
12
0.10 1/3 0 1/3 0 0 0 0 1/4 0 0 0 0.130.220.130.05
0.9
01/3 1/3 0 1/3 0 0 0 0 0 0 0 01/3 0 1/3 0 1/4 0 0 0 0 0 0 00 0 0 1/3 0 1/2 1/2 1/4 0 0 0 00 0 0 0 1/4 0 1/2 0 0 0 0 0
0.10 00.13 00.220.13 00.05 0
0.1
1
4
5 6
0.050.080.040.030.04
0 0 0 0 1/4 1/2 0 0 0 0 0 0 0 1/3 0 0 1/4 0 0 0 1/2 0 1/3 00 0 0 0 0 0 0 1/4 0 1/3 0 00 0 0 0 0 0 0 0 1/2 0 1/3 1/2 0 0 0 0 0 0 0 1/4 0 1/3 0 1/2
0.05 00.08 00.04 00.03 00.04 0
7
0.02 0 0 0 0 0 0 0 0 0 1/3 1/3 0
2 00.0
n x n n x 1n x 1182/4/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining
0 1/3 1/3 1/3 0 0 0 0 0 0 0 0 0 0 1/3 1/3 1/3 0 0 0 0 0 0 0 01/3 0 1/3 0 0 0 0 1/4 0 0 0 01/3 1/3 0 1/3 0 0 0 0 0 0 0 01/3 0 1/3 0 1/4 0 0 0 0 0 0 0
0001
Query
0.9
0 0 0 1/3 0 1/2 1/2 1/4 0 0 0 00 0 0 0 1/4 0 1/2 0 0 0 0 0
0 0 0 0 1/4 1/2 0 0 0 0 0 0
00
0.10
?? 0 1/3 0 0 1/4 0 0 0 1/2 0 1/3 00 0 0 0 0 0 0 1/4 0 1/3 0 00 0 0 0 0 0 0 0 1/2 0 1/3 1/20 0 0 0 0
000
0 0 1/4 0 1/3 0 1/2 0
0 0 0 0 0 0 0 1/4 0 1/3 0 1/2 0
0 0 0 0 0 0 0 0 0 1/3 1/3 0 0
Adjacency matrix Starting vectorRanking vectorRanking vector Adjacency matrix Starting vectorRanking vectorRanking vector
19
2/4/2010Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining
0 1/3 1/3 1/3 0 0 0 0 0 0 0 01/3 0 1/3 0 0 0 0 1/4 0 0 0 0
00
00
0.130.10
0.30
0.120.18
0.190.09
0.140.13
0.160.10
0.130.10 1/3 1/3 0 1/3 0 0 0 0 0 0 0 0
1/3 0 1/3 0 1/4
0.9
0 0 0 0 0 0 00 0 0 1/3 0 1/2 1/2 1/4 0 0 0 00 0 0 0 1/4 0 1/2 0 0 0 0 00 0 0 0 1/4 1/2 0 0 0 0 0 0
0
00
0.10
1
01000
0.130.220.130.050 05
1
2
9 10
8
12
0.30.10.300
0.120.350.030.070 07
0.190.180.180.040 04
0.140.260.100.060 06
0.160.210.150.050 05
0.130.220.130.050 05
12
9 10
811
120.130.10
0.08
0.04
0.02
0.03
0 0 0 0 1/4 1/2 0 0 0 0 0 0 0 1/3 0 0 1/4 0 0 0 1/2 0 1/3 00 0 0 0 0 0 0 1/4 0 1/3 0 00 0 0 0 0 0 0 0 1/2 0 1/3 1/2
0 0 0 0 0
0000
0 0 1/4 0 1/3 0 1/2 0
00000
0.050.080.040.030.04
1
4
3
5 6
1100000
0.070.07000
0.040.060.0200.02
0.060.080.010.010.01
0.050.070.020.010.02
0.050.080.040.030.04
43
5 6
7
110.13
0.13
0.05
0.04
0 0 0 0 0 0 0 0 0 1/3 1/3 0 0 0
0.02
70 0 0 0 0.01 0.02
ir
ir
70.05
No pre computation / light storage
2/4/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 20
No pre‐computation / light storageSlow on‐line response O(mE)
Q1 Q2 Q3
Node 1Node 2
0.5767 0.0088 0.00880.1235 0.0076 0.0076
0 00240 0333
0.0088
5
Node 3Node 4Node 5Node 6
0.0283 0.0283 0.02830.0076 0.1235 0.00760.0088 0.5767 0.00880.0076 0.0076 0.123510
11 12
13
4
0.1260
0 02830.0024
0.00760.00240.0333
Node 7Node 8Node 9Node 10
0.0088 0.0088 0.57670.0333 0.0024 0.12600.1260 0.0024 0.03330.1260 0.0333 0.0024
10 133
62
0.1235
0.0283
0.0076
Node 11Node 12Node 13
0.0333 0.1260 0.00240.0024 0.1260 0.03330.0024 0.0333 0.1260
19 8
0.5767
0.1260 0.0333
0.0088
7
2/4/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 21
The RWR score r(i j) The RWR score r(i,j) i … query node, iQ j d i th t k j … node in the network
Combine r(i,j) to get the score r(Q,j)( ,j) g (Q,j) 1) AND query: Meeting probability prob. that |Q| random particles coincide on node jp |Q| p j
2/4/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 22
OR query: OR query: At least one particle arrives to node j:(i d j i i t t t t l t 1 d )(i.e., node j is important to at least 1 query node)
k‐SoftAND query:d l k d Node j is important to at least k query nodes
Can be computed recursively:
2/4/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 23
Goal:2
910Goal:
Extract a sub‐graph S on b nodes that maximizes
4
68
9
1113
the uS r(Q, u) Idea: Iterate until budget is
1 35
4712
14 15 16 Iterate until budget is reached Pick not‐yet‐selected node j (Q j)
29
10
j=arg maxj r(Q,j) Find good paths to all query nodes 4
68
1113
Add the paths to S
2/4/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 24
1 35 712
0 4505 0.0103
5
11 0 07100.10100.1010
0.45055
11 120.00190.00460.0046
0.0103
10
11 12
13
4
0.1010
0.22670.1010
0.0710
10
12
13
4
3
0.0046
0.00240.0046
3
62
0.0710
0 1010 0 1010
0.0710
1 7
3
62
0.0019
0 0046 0 0046
0.0019
AND
1 7
9 80.4505
0.1010 0.1010
0.4505
1 7
9 80.0103
0.0046 0.0046
0.0103
S f AND
25
AND 2‐SoftAND
0 0103
5
0 16170.1617
0.0103
11 124
0 1617 0 1617
0.13870.16170.1617
10 133
0.1617
0.1387
0.08490.1617
0.1387
1 7
62
0.1617 0.1617
26
9 80.0103 0.0103
2/4/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 27
Solving the random walk for each query Solving the random walk for each query separately is time consuming Not appropriate for real time Not appropriate for real‐time
Can we do better? Can we do better?
Idea 1) Pre compute scores for each possible Idea 1) Pre‐compute scores for each possible query node i
2/4/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 28
1 2 3 4 5 6 7 8 9 10 11 12r r r r r r r r r r r r0.20 0.13 0.14 0.13 0.68 0.56 0.56 0.63 0.44 0.35 0.39 0.340.28 0.20 0.13 0.96 0.64 0.53 0.53 0.85 0.60 0.48 0.53 0.450.14 0.13 0.20 1.29 0.68 0.56 0.56 0.63 0.44 0.35 0.39 0.330.13 0.10 0.13 2.06 0.95 0.78 0.78 0.61 0.43 0.34 0.38 0.32
29 10
8120.13
0.100.08
0.04
0 02
0.03
0.09 0.09 0.09 1.27 2.41 1.97 1.97 1.05 0.73 0.58 0.66 0.560.03 0.04 0.04 0.52 0.98 2.06 1.37 0.43 0.30 0.24 0.27 0.220.03 0.04 0.04 0.52 0.98 1.37 2.06 0.43 0.30 0.24 0.27 0.220.08 0.11 0.04 0.82 1.05 0.86 0.86 2.13 1.49 1.19 1.33 1.130 03 0 04 0 03 0 28 0 36 0 30 0 30 0 74 1 78 1 00 0 76 0 79
1
43
5 6
8 110.13
0.130.05
0.02
0.04Q‐1:
0.03 0.04 0.03 0.28 0.36 0.30 0.30 0.74 1.78 1.00 0.76 0.790.04 0.04 0.04 0.34 0.44 0.36 0.36 0.89 1.50 2.45 1.54 1.800.04 0.05 0.04 0.38 0.49 0.40 0.40 1.00 1.14 1.54 2.28 1.720.02 0.03 0.02 0.21 0.28 0.22 0.22 0.56 0.79 1.20 1.14 2.05
70.13
0.05
Fast on‐line response
29
Heavy pre‐computation/storage costO(n3 ) O(n2)
[Tong‐Faloutsos‐Pan, ‘06]
Find Communities 12
9 10
812
12
9 10
812
100.04 0.039 10
12
1
43
5 6
7
11
9 1012
1
43
5 6
7
11
1
43
29 10
8 11
120.130.10
0.13
0.080.02
0.04
1
43
28
11
12 78
11
12 7
1
43
2
45 6
70.13
0.05
0.05
4
5 6
7
5 6
7
4
Combine1 3
29 10
8 11
121 3
29 10
8 11
12
30
Fix the remaining4
3
5 6
7
43
5 6
7
10
13
29 10
811
12
12
9 10
812
4
5 6
7
1
43
5 6
811
7 5
7
Within partition links cross partition links
31
Within‐partition links cross‐partition links
9 109 10
1
43
29
811
12
1
43
29
811
12
4
5 6
7
4
5 6
7
32
9 1012
9 1012
31
4
28
11
12
1
43
28
11
12
5 6
7
5 6
7
|S| << |W2|~| | | |
33
Offline stage: Offline stage:
QQ1,1
Q 1,2
=11Q
Q1,k
C i i BridgesCommunities Bridges
8/22/2006 34
Sherman Morrison Woodbury matrix identity: Sherman‐Morrison‐Woodbury matrix identity: Inverse of a rank‐k correction of matrix X can be computed by doing a rank‐k correction to X‐1:computed by doing a rank‐k correction to X :
In our case: In our case:Q‐1=c(I‐W)‐1=c(I‐W1‐W2)‐1=
1 1 11 U VQQ Q Q 8/22/2006 35
1 1 11 1
11U VcQQ Q Q
Online stage: Online stage: Compute column i of:
1 1 11 QQ Q Q
U i
1 1 11 1
11U VcQQ Q Q
Using:
8/22/2006 36
Most slides are borrowed from the tutorial by Most slides are borrowed from the tutorial by Hanghang Tong and Christos Faloutsos from CMUCMU
http://www.cs.cmu.edu/~htong/tut/cikm2008/cikm_tutorial.html
8/22/2006 37