fast algorithms for querying and mining large graphs hanghang tong machine learning department...
TRANSCRIPT
Fast Algorithms for Querying and Mining Large Graphs
Hanghang TongMachine Learning Department
Carnegie Mellon [email protected]
http://www.cs.cmu.edu/~htong
1
-----
Graphs are everywhere!
2
Internet Map [Koren 2009] Food Web [2007]
Protein Network [Salthe 2004]
Social Network [Newman 2005]
Web Graph
Terrorist Network [Krebs 2002]
Why Do
We Care?
Research Theme
Help users to understand and utilize large graph-related data?
3
A1: Social Networks
• Facebook (300m users, $10bn value, $500mn revenue)• MSN (240m users, 4.5pb); Myspace (110m users)• LinkedIn (50m users, $1bn value); Twitter (18m users)
How to help users explore such networks?(e.g., find strange persons, communities, locate common friends, etc)4
A2: Network Forensics [Sun+ 2007]
How to detect abnormal traffic?
5
Port scanning DDoS Normal Traffic
Footnote: Rows are IP sources; Columns are IP destinations.
Adj. Matrix
ibm.com
cmu.edu
Graph
6
2005
NY Time
Forbes
Reuters Hardware
Service
IBM
2006
NY Time
Forbes
Reuters Hardware
Service
IBM
2007
NY Time
Forbes
Reuters Hardware
Service
IBM
A3: Business Intelligence
….….
Year
Rank of IBM in Global Service(higher is better)
What is IBM’s rank in global service business over years?
Footnote: nodes are business reviews and keywords; edges means ‘reporting’
A4: Financial Fraud Detection[Tong+ 2007]
7
7.5% of U.S. adults lost money for financial fraud 50%+ US corporations lost >= $500,000 [Albrecht+ 2001]
e.g., Enron ($70bn) Total cost of financial fraud: $1trillion [Ansari 2006]
How to detect abnormal transaction patterns?(e.g., money-laundry ring)
: Anonymous accounts
: Anonymous banks
Legends:
A5: Immunization
How to select k `best’ nodes for immunization?8
34
33
2526
27
28
29
30
31 32
22
21
20
19
18
17
23 2412
1314
1516
1
9
10
11
3
4
56
7
8
2
Footnote: SARS costs 700+ lives; $40+ Bn
This Talk
• Querying [Goal: query complex relationship]– Q.1. Find complex user-specific patterns;– Q.2. Proximity tracking;– Q.3. Answer all the above questions quickly.
• Mining [Goal: find interesting patterns]– M.1. Immunization;– M.2. Spot anomalies.
9
Tasks vs. Applications App.sTasks
A1 A2 A3 A4 A5
Q1Q2Q3M1M2
10
A1: Social Networks A2: Network Forensics A3: Business IntelligenceA4: Financial FraudA5: Immunization
Q1: Complex User-Specific PatternsQ2: Proximity Tracking Q3: Fast Proximity ComputingM1: Immunization M2: Anomaly Detection
Overview
Q1
Q3
Q2
Q3
M1
M2 M2
Overview
CePS, iPoG (KDD06, ICDM08, CIKM09)
Q1
FastProx (ICDM06, KAIS07, KDD07 b, ICDM08)Q3
pTrack/cTrack (SDM08, SAM08)Q2
FastProx(SDM08, SAM08)Q3
NetShield
M1
Colibri-S (KDD08)
M2 Colibri-D (KDD08)
M2
Proximity Measurement
13Q: How close is A to B?
A BH1 1
D1 1
E
F
G1 11
I J1
1 1
a.k.a Relevance, Closeness, ‘Similarity’…
Background
Random Walk with Restart [Tong+ ICDM 2006]
Node 4
Node 1Node 2Node 3Node 4Node 5Node 6Node 7Node 8Node 9Node 10Node 11Node 12
0.130.100.130.220.130.050.050.080.040.030.040.02
1
4
3
2
56
7
910
811
120.13
0.10
0.13
0.13
0.05
0.05
0.08
0.04
0.02
0.04
0.03
Ranking vector More red, more relevant
Nearby nodes, higher scores
4r
Background
Intuitions: Why RWR is Good Score?
15
54
2
3
1513
1412
10 116 7 8 9
1 20
TargetSource
Score (Red Path) = (1-c) c6 x W(1,3) x W(3,4) x …. x W(14,20)
Penalty of length of path Prob of traversing the path
Footnote: (1-c) is restart probability in RWR; W is normalized adjacency matrix of the graph.
Background
Prox (1, 20) = Score (Red Path) + Score (Green Path) + Score (Yellow Path) +
Score (Purple Path) + …
A high proximity means many short/high weighted paths
54
2
3
1513
1412
10 116 7 8 9
1 20
TargetSource
Intuitions: Why RWR is Good Score?
Background
Overview
CePS, iPoG (KDD06, ICDM08, CIKM09)
Q1
FastProx (ICDM06, KAIS07, KDD07 b, ICDM08)Q3
pTrack/cTrack (SDM08, SAM08)Q2
FastProx(SDM08, SAM08)Q3
NetShield
M1
Colibri-S (KDD08)
M2 Colibri-D (KDD08)
M2
Q1: Find Complex User-Specific Patterns
• Q1.1. Center-Piece Subgraph Discovery,– e.g., master-mind criminal given some
suspects X, Y and Z?
• Q1.2 Interactive Querying (e.g. Negation)– e.g., find most similar conferences wrt KDD,
but not like ICML?
18
Footnote: Our algorithms for both Q1.1 and Q1.2 are to be deployed in a real system (Cyano) in IBM
Overview
CePS, iPoG (KDD06, ICDM08, CIKM09)
Q1
FastProx (ICDM06, KAIS07, KDD07 b, ICDM08)Q3
pTrack/cTrack (SDM08, SAM08)Q2
FastProx(SDM08, SAM08)Q3
NetShield
M1
Colibri-S (KDD08)
M2 Colibri-D (KDD08)
M2
Q1.1 Center-Piece Subgraph Discovery [Tong+ KDD 06]
A C
B
A C
B
Original GraphCePS
Q: How to find hub for the black nodes?
CePS Node
Input Output
Red: Max (Prox(A, Red) x Prox(B, Red) x Prox(C, Red))
CePS: Example (AND Query)
21
DBLP co-authorship network: - 400,000 authors, 2,000,000 edges
?
CePS: Example (AND Query)
R. Agrawal Jiawei Han
V. Vapnik M. Jordan
H.V. Jagadish
Laks V.S. Lakshmanan
Heikki Mannila
Christos Faloutsos
Padhraic Smyth
Corinna Cortes
15 1013
1 1
6
1 1
4 Daryl Pregibon
10
2
11
3
16
22
DBLP co-authorship network: - 400,000 authors, 2,000,000 edges
K_SoftAND: Relaxation of AND
Asking AND query? No Answer!
Disconnected Communities
Noise
23
details
R. Agrawal Jiawei Han
V. Vapnik M. Jordan
H.V. Jagadish
Laks V.S. Lakshmanan
Umeshwar Dayal
Bernhard Scholkopf
Peter L. Bartlett
Alex J. Smola
1510
13
3 3
5 2 2
327
4
CePS: 2 SoftAND
Stat.
DB
24
details
Overview
CePS, iPoG (KDD06, ICDM08, CIKM09)
Q1
FastProx (ICDM06, KAIS07, KDD07 b, ICDM08)Q3
pTrack/cTrack (SDM08, SAM08)Q2
FastProx(SDM08, SAM08)Q3
NetShield
M1
Colibri-S (KDD08)
M2 Colibri-D (KDD08)
M2
Q1.2: Interactive Querying
26
User Feedback
User Feedback
User Feedback
User Feedback
Initial Results No to `ICML’ Yes to `SIGIR’
'ICDM' 'ICML' 'SDM' 'VLDB' 'ICDE'
'SIGMOD' 'NIPS''PKDD''IJCAI'
'PAKDD'
'ICDM' 'SDM''PKDD''ICDE''VLDB'
'SIGMOD''PAKDD''CIKM''SIGIR'
'WWW'
'SIGIR''TREC''CIKM''ECIR''CLEF''ICDM''JCDL''VLDB''ACL''ICDE'
two main sub-communities in KDD: DBs (green) vs. Stat (Red)
Negative feedback on ICML will exclude other stats confs (NIPS, IJCAI)
Positive feedback on SIGIR will bring more IR (brown) conferences.
what are most related conferences wrt KDD?(DBLP author-conference bipartite graph) 27
Q1.2 iPoG for Interactive Querying [Tong+ ICDM 08, CIKM 09]
Q1.2 iPoG for Interactive Querying [Tong+ ICDM 08, CIKM 09]
Initial Results No to `ICML’ Yes to `SIGIR’
'ICDM' 'ICML' 'SDM' 'VLDB' 'ICDE'
'SIGMOD' 'NIPS''PKDD''IJCAI'
'PAKDD'
'ICDM' 'SDM''PKDD''ICDE''VLDB'
'SIGMOD''PAKDD''CIKM''SIGIR'
'WWW'
'SIGIR''TREC''CIKM''ECIR''CLEF''ICDM''JCDL''VLDB''ACL''ICDE'
two main sub-communities in KDD: DBs (green) vs. ML/AI (Red)
Negative feedback on ICML will exclude other ML/AI conf.s (NIPS, IJCAI)
Positive feedback on SIGIR will bring more IR (brown) conferences.
what are most related conferences wrt KDD?(DBLP author-conference bipartite graph) 28
Initial Results No to `ICML’ Yes to `SIGIR’
'ICDM' 'ICML' 'SDM' 'VLDB' 'ICDE'
'SIGMOD' 'NIPS''PKDD''IJCAI'
'PAKDD'
'ICDM' 'SDM''PKDD''ICDE''VLDB'
'SIGMOD''PAKDD''CIKM''SIGIR'
'WWW'
'SIGIR''TREC''CIKM''ECIR''CLEF''ICDM''JCDL''VLDB''ACL''ICDE'
two main sub-communities in KDD: DBs (green) vs. ML/AI (Red)
Negative feedback on ICML will exclude other ML/AI conf.s (NIPS, IJCAI)
Positive feedback on SIGIR will bring more IR (brown) conferences.
what are most related conferences wrt KDD?(DBLP author-conference bipartite graph) 29
Q1.2 iPoG for Interactive Querying [Tong+ ICDM 08, CIKM 09]
Overview
CePS, iPoG (KDD06, ICDM08, CIKM09)
Q1
FastProx (ICDM06, KAIS07, KDD07 b, ICDM08)Q3
pTrack/cTrack (SDM08, SAM08)Q2
FastProx(SDM08, SAM08)Q3
NetShield
M1
Colibri-S (KDD08)
M2 Colibri-D (KDD08)
M2
Q2.2 pTrack: Challenge[Tong+ SDM 08]
• Observations (CePS, iPoG…)– All for static graphs– Proximity: main tool
• Graphs are evolving over time!– New nodes/edges show up; – Existing nodes/edges die out; – Edge weights change…
Q: How close is Philip Yu to DBs over years? A: Track proximity, incrementally! 31
32
Author-Keyword Bipartite Graphs (NIPS)….….
NIP
S 1
995Sejnowski
Jordan
Neural Network
ICA
Bayes
NIP
S 1
994Sejnowski
Jordan
Neural Network
ICA
Bayes
NIP
S 1
993Sejnowski
Jordan
Neural Network
ICA
Bayes
pTrack: Trend analysis on graph level
M. Jordan
G.HintonC. Koch
T. Sejnowski
Year
Rank of Influence
33
pTrack: Problem Definitions
• [Given] – a large, skewed time-evolving bipartite graphs, – the query nodes of interest
• [Track] – (1) top-k most related nodes for each query node
at each time step t; – (2) the proximity score (or rank of proximity)
between any two query nodes at each time step t.
34
pTrack: Philip S. Yu’s Top-5 conferences up to each year
ICDE
ICDCS
SIGMETRICS
PDIS
VLDB
CIKM
ICDCS
ICDE
SIGMETRICS
ICMCS
KDD
SIGMOD
ICDM
CIKM
ICDCS
ICDM
KDD
ICDE
SDM
VLDB
1992 1997 2002 2007
DatabasesPerformanceDistributed Sys.
DatabasesData Mining
DBLP: (Au. x Conf.) - 400k authors, - 3.5k conferences - 20 years
35
Prox. Rank
Year
Data Mining and Databases are getting closer & closer
36
(Closer)
John
KDD
Tom
Bob
Carl
Van
RoyRECOMB
ICML
VLDB
KDD’s Rank wrt. VLDB over years…
….
Q2: pTrack on Bipartite Graphs• Computational Challenges (assuming )
– Iterative method O(m)– Straight-forward update
• Example– NetFlix (2.6m users x 18k movies, 100m ratings)– Both need >1hr
37
Q2: pTrack on Bipartite Graphs• Observation #1
– n1 authors; n2 conferences;
– n1 >> n2
• e.g., 400k authors, 3.5k conf.s in DBLP
• Observation #2– m edges changed, (n1 authors, n2 conf.s)
– rank of update = = update
• Proposed algorithm: Fast-Update
38
Theorem: (Tong+ 2008) (1) Fast-Update has no quality loss (2) Fast-Update is
~ ~ ~
KDD
3939
176x speedup
40x speedup
log(Time) (Seconds)
Data Sets
Our method
Our method
Q2: Speed Comparison
Overview
CePS, iPoG (KDD06, ICDM08, CIKM09)
Q1
FastProx (ICDM06, KAIS07, KDD07 b, ICDM08)Q3
pTrack/cTrack (SDM08, SAM08)Q2
FastProx(SDM08, SAM08)Q3
NetShield
M1
Colibri-S (KDD08)
M2 Colibri-D (KDD08)
M2
41
RWR: Think of it as Wine Spill
1. Spill a drop of wine on cloth 2. Spread/diffuse to the neighborhood
Background
42
RWR: Wine Spill on a Graph
wine spill on cloth RWR on a graph
Query
Background
1
4
3
2
56
7
910
8
11
12
Random Walk with Restart
43
Background
44
Computing RWR
1
43
2
5 6
7
9 10
811
12
0.13 0 1/3 1/3 1/3 0 0 0 0 0 0 0 0
0.10 1/3 0 1/3 0 0 0 0 1/4 0 0 0
0.13
0.22
0.13
0.050.9
0.05
0.08
0.04
0.03
0.04
0.02
0
1/3 1/3 0 1/3 0 0 0 0 0 0 0 0
1/3 0 1/3 0 1/4 0 0 0 0 0 0 0
0 0 0 1/3 0 1/2 1/2 1/4 0 0 0 0
0 0 0 0 1/4 0 1/2 0 0 0 0 0
0 0 0 0 1/4 1/2 0 0 0 0 0 0
0 1/3 0 0 1/4 0 0 0 1/2 0 1/3 0
0 0 0 0 0 0 0 1/4 0 1/3 0 0
0 0 0 0 0 0 0 0 1/2 0 1/3 1/2
0 0 0 0 0 0 0 1/4 0 1/3 0 1/2
0 0 0 0 0 0 0 0 0 1/3 1/3 0
0.13 0
0.10 0
0.13 0
0.22
0.13 0
0.05 00.1
0.05 0
0.08 0
0.04 0
0.03 0
0.04 0
2 0
1
0.0
n x n n x 1n x 1
Ranking vector Starting vector(Normalized) Adjacency matrix
1
(1 )i i ir cWr c e
Restart p
Footnote: Maxwell Equation for Web [Chakrabarti]
Computing RWR
45Footnote: 1-c restart prob; W normalized adjacency matrix
Q
How to get (elements) of Q?
-1
= - - c x WIQ
1
4
3
2
5 6
7
9 10
811
12
4r
Computing RWR• OntheFly
– No Pre-Computation; – Light Storage Cost (W)– Slow On-Line Response: O(m x Iter)
• Pre-Compute– Fast On-Line Response – Prohibitive Pre-Compute Cost: O(n3)– Prohibitive Storage Cost: O(n2)
46
Q: How to Balance?
On-line Off-line
47
Goal: Efficiently get (elements) of
B_Lin: Basic Idea[Tong+ ICDM 2006]
1
43
2
5 6
7
9 10
811
120.130.10
0.13
0.13
0.05
0.05
0.08
0.04
0.02
0.04
0.03
1
4
3
2
56
7
910
811
12
Find Community
Fix the remaining
Combine1
43
2
5 6
7
9 10
811
12
56
7
910
811
12
1
43
2
1
4
3
2
5 6
7
910
811
12
1
4
3
2
48
1
43
2
B_Lin: Basic Idea[Tong+ ICDM 2006]
• Pre-Compute Stage– Find Communities– Pre-compute within-community scores
• On-Line Stage– Fix the influence of the bridges (cross-community links)
49
+~~
B_Lin: details
W1: within community Cross community
details
50
+W =
B_Lin: details
WI – c ~~ I – c – cUSVW1
-1 -1
Easy to be inverted LRA difference
Sherman–Morrison Lemma!
details
51If Then
B_Lin: Pre-Compute Stage
• Q: Efficiently compute and store Q• A: A few small, instead of ONE BIG, matrices inversions
52Footnote: Q1=(I-cW1)-1
B_Lin: On-Line Stage
• Q: Efficiently recover one column of Q• A: A few, instead of MANY, matrix-vector
multiplications
53
Query Time vs. Pre-Compute Time
Log Query Time
Log Pre-compute Time
•Quality: 90%+ •On-line:
•Up to 150x speedup•Pre-computation:
•Two orders saving
54
Our Results
More on Scalability Issues for Querying(the spectrum of ``FastProx’’)
• B_Lin: one large linear system – [Tong+ ICDM06, KAIS08]
• BB_Lin: the intrinsic complexity is small – [Tong+ KAIS08]
• FastUpdate: time-evolving linear system – [Tong+ SDM08, SAM08]
• FastAllDAP: multiple linear systems – [Tong+ KDD07 a]
• Fast-iPoG: dealing w/ on-line feedback– [Tong+ ICDM 2008, Tong+ CIKM09]
55
Overview
CePS, iPoG (KDD06, ICDM08, CIKM09)
Q1
FastProx (ICDM06, KAIS07, KDD07 b, ICDM08)Q3
pTrack/cTrack (SDM08, SAM08)Q2
FastProx(SDM08, SAM08)Q3
NetShield
M1
Colibri-S (KDD08)
M2 Colibri-D (KDD08)
M2
A5: Immunization
How to select k `best’ nodes for immunization?57
34
33
2526
27
28
29
30
31 32
22
21
20
19
18
17
23 2412
1314
1516
1
9
10
11
3
4
56
7
8
2
M1: SIS Virus Model [Chakrabarti+ 2008]
• ‘Flu’ like: Susceptible-Infectious-Susceptible
• If virus ‘strength’ s < 1/ λ1,A , an epidemic can not happen
58Footnote: Think of s as # of sneeze before heal.
Background
M1: Optimal Method• Select k nodes, whose absence creates the
largest drop in λ1,A
59
1
9
10
3
4
5
7
8
6
2
9
1
11
10
3
4
56
7
8
2
9
Original Graph: λ1,A Without {2, 6}: λ1,A~
M1: Optimal Method
• Select k nodes, whose absence creates the largest drop in λ1,A
• But, we need in time– Example:
• 1,000 nodes, with 10,000 edges • It takes 0.01 seconds to compute λ• It takes 2,615 years to find best-5 nodes !
60
Leading eigenvalue w/o subset of nodes S
M1: Netshield to Rescue
61
Theorem: (Tong+ 2009)(1)
A u = λ1,AX
u(i): eigen-score 1
2
3 4
9
10
11 12
5
6
7 8
13
14
15 16
10
1010
1
11
1
1
11
1
1
11
1
u
Think of u(i) as PageRank or in-degree
M1: Netshield to Rescue (Intuition)
• find a set of nodes S, which– (1) each has high eigen-scores– (2) diverse among themselves
1
2
3 4
9
10
11 12
5
6
7 8
13
14
15 16
10
1010
1
11
1
1
11
1
1
11
1
1
2
3 4
9
10
11 12
5
6
7 8
13
14
15 16
10
1010
1
11
1
1
11
1
1
11
1
1
2
3 4
9
10
11 12
5
6
7 8
13
14
15 16
10
1010
1
11
1
1
11
1
1
11
1
Theorem: (Tong+ 2009)(1)
M1: Netshield to Rescue
• Example: – 1,000 nodes, with 10,000 edges – Netshield takes < 0.1 seconds to find best-5 nodes !– … as opposed to 2,615 years
63
Theorem: (Tong+ 2009)(1)
(2) Br(S) is sub-modular(3) Netshield is near-optimal (wrt max Br(S))(4) Netshield is O(nk2+m)
Footnote: near-optimal means Br(S Netshield) >= (1-1/e) Br(S Opt)
Why Netshield is Near-Optimal?
64
1
310
8 7
4
6
5
2
9
B
B
1
310
8 7
4
6
5
2
9
Blue Bar: Marginal benefit of deleting blues nodesGreen Bar: Benefit of deleting green nodes
A
A
A Sub-Modular (i.e., Diminishing Returns) >=
B
Theorem: k-step greedy alg. to maximize a sub-modular function guarantees (1-1/e) optimal [Nemhauster+ 78]
M1: Why Br(S) is sub-modular?
65
1
310
8 7
4
6
5
2
9
Marginal Benefit
Pure from Blue Interaction between Blue and Green
Only purple term depends on {1, 2}!
Footnote: greens {1, 2} are nodes already deleted; blue {5,6} nodes are the nodes to be deleted
=
-
details
66
1
310
8 7
4
6
5
2
9
Marginal Benefit = Blue –Purple
More Green
Footnote: greens are nodes already deleted; blue {5,6} nodes are the nodes to be deleted
1
310
8 7
4
6
5
2
9
More Purple Less Red
Marginal Benefit of Left >= Marginal Benefit of Right
M1: Why Br(S) is sub-modular? details
M2: Quality of Netshield
67
Eig
-Dro
p
k
Netshield
Optimal
(1-1/e) x Optimal
(better)
M1: Speed of Netshield
68
Tim
e
k
> 10 days
0.1 secondsNetshield
NIPS co-authorship Network
(better)
Scalability of NetshieldT
ime
# of edges
(better)
X 108
Overview
CePS, iPoG (KDD06, ICDM08, CIKM09)
Q1
FastProx (ICDM06, KAIS07, KDD07 b, ICDM08)Q3
pTrack/cTrack (SDM08, SAM08)Q2
FastProx(SDM08, SAM08)Q3
NetShield
M1
Colibri-S (KDD08)
M2 Colibri-D (KDD08)
M2
Motivation [Tong+ KDD 08 b]
• Q: How to find patterns from a large graph?– e.g., communities, anomalies, etc.
71Author Conference
John
KDD
Tom
Bob
Carl
Van
RoyRECOMB
ISMB
ICDM
Motivation [Tong+ KDD 08 b]
• Q: How to find patterns from a large graph?– e.g., communities, anomalies, etc.
• A: Low-Rank Approximation (LRA) for adjacency matrix of the graph.
72
A L
M RX X
~~
1 1 0 0
1 1 0 0
1 1 0 0
0 1 1 1
0 0 1 1
0 0 1 1
LRA for Graph Mining
John
KDD
Tom
Bob
Carl
Van
RoyRECOMB
ISMB
ICDM
Author Conference Adjacency matrix: A
73
Conference
Au
tho
r
LRA for Graph Mining: Communities
John
KDD
Tom
Bob
Carl
Van
RoyRECOMB
ISMB
ICDM
Author Conf.
~~X X
Adj. matrix: AR: Conf. Group
M: Group-Group Interaction
L: author group
74
LRA for Graph Mining: Anomalies
John
KDD
Tom
Bob
Carl
Van
RoyRECOMB
ISMB
ICDM
Author Conf.
Adj. matrix: A
Recon. error is high ‘Carl’ is abnormal
75
Reconstructed A~
Challenges: How to Get (L, M, R)?
• Efficiently • both time and space
• Intuitively• easy for interpretation
• Dynamically • track patterns over time
76
None of existing methods fully meets our wish list!
Why Not SVD and CUR/CMD?
• SVD (Optimal in L2 and LF )
– Efficiency• Time:• Space: (L, R) are dense
– Interpretation• Linear Combination of
many columns
– Dynamic: Not Easy
77
2 2(min( , ))O n m nm
• CUR/CMD (Example-based)
– Efficiency• Better than SVD• Redundancy in L
– Interpretation• Actual Columns from A
xxxx
– Dynamic: Not Easy
Solutions: Colibri [Tong+ KDD 08 b]
• Colibri-S: for static graphs– Basic idea: remove linear redundancy
• Colibri-D: for dynamic graphs– Basic idea: leverage smoothness over time
78
Theorem: (Tong+ 2008)(1) Colibri = CUR/CMD in accuracy
(2) Colibri <= CUR/CMD in time(3) Colibri <= CUR/CMD in space
Comparison SVD, CUR vs. Colibri
s
Wish List SVD [Golub+ 1989]
CUR[Drineas+ 2005]
Colibri[Tong+ 2008]
Efficiency
Interpretation
Dynamics79
details
Performance of Colibri-S
Time SpaceOurs
CUR CUR
CMDCMD
80
SVD SVD
• Accuracy• Same 91%+
• Time• 12x of CMD• 28x of CUR
• Space• ~1/3 of CMD• ~10% of CUR Ours
Performance of Colibri-D
Time
# of changed cols
CMD
Colibri-S
Colibri-D achieves up to 112x speedup
Colibri-D
81
Network traffic
- 21,837 nodes
- 1,220 hours
- 22,800 edge/hr
(Prior Best Method)
Accuracy
- Same 93%+
Overview
CePS, iPoG (KDD06, ICDM08, CIKM09)
Q1
FastProx (ICDM06, KAIS07, KDD07 b, ICDM08)Q3
pTrack/cTrack (SDM08, SAM08)Q2
FastProx(SDM08, SAM08)Q3
NetShield
M1
Colibri-S (KDD08)
M2 Colibri-D (KDD08)
M2
Some of my other work• #1: FastDAP (in KDD07 a)– Predict Link Direction
• #2: Graph X-Ray (in KDD 07 b)– Best Effort Pattern Match in Attributed Graphs.
• #3: GhostEdge (in KDD 08 a)– Classification in Sparsely Labeled Network
• #4: TANGENT (in KDD09)– ``surprise-me’’ recommendation
• #5: GMine (in VLDB 06)– Interactive Graph Visualization and Mining
• #6: Graphite (in ICDM 08)– Visual Query System for Attributed Graphs
• # 7: T3/MT3: (in CIKM 08)– Mine Complex Time-stamped Events
• #8: BlurDetect (in ICME 04)– Determine whether or not, and how, an image is blurred
• #9: MRBIR (in MM 04, TIP06)– Manifold-Ranking based Image Retrieval
• #10: GBMML (in CVPR05, ACM/Multimedia 05)– Graph-based Multiple Modality Learning
83
Tasks Static Graphs Dynamic Graphs Images
84
Overview(this talk + others)
Queryin
gM
ining
CePS, iPoG, Basset, DAP, G-Ray, Grahite, TANGENT, FastRWR(KDD06, CDM06, KDD07a, KDD07b, IICDM08, KAIS08,
CIKM09, KDD09)
pTrack, cTrack, Fast-Update
(SDM08, SAM08)
Netshield, Colibri-S, GhostEdge, Gmine,
Pack, Shiftr(VLDB06, KDD08a, KDD08b,
SDM-LinkAnalysis 09, )
T3/MT3, Colibri-D (KDD08a, CIKM08)
MRBIR, UOLIR(MM04, CVPR05)
BlurDetect, GBMML, iQuality,
iExpertise(ICDE04, ICIP04,
MMM05, PCM05, MM05)
Plans
Goals
Step 1 (this talk)
Step 2 (medium term)
Step 3 (long term)
G1 Querying
CePS, iPoG, pTrack
Recommendation Interpretable Q
Querying rich data
G2Mining
Netshield, Colibri
Immunization Interpretable M
Mining rich data
G3 Scalability
All above O(m) or better (single machine)
Scalable by parallel Scalable on rich data
What is Next?
85
Research Theme: Help users to understand and utilize large graph-related data
Current Recommendation (Focus on Relevance)
86
1001
1
Sci. fiction
comedy
horror
Footnote: Nodes are movies; Edge is similarity between movies
adventure
Red nodes: by (most of) existing algorithms
``Broad Spectrum Recommendation’’(focus on completeness = relevance + diversity + novelty)
87
1001
1
adventureS
ci. fictioncom
edy
horror
Footnote: Nodes are movies; Edge = similarity between movies
Research Theme: Help users to understand and utilize large graph-related data
Plans
Goals
Step 1 (this talk)
Step 2 (medium term)
Step 3 (long term)
G1 Querying
CePS, iPoG, pTrack
Recommendation Interpretable Q
Querying rich data
G2Mining
Netshield, Colibri
Immunization Interpretable M
Mining rich data
G3 Scalability
All above O(m) or better (single machine)
Scalable by parallel Scalable on rich data
What is Next?
88
Interpretable Recommendation
• Amazon.com recommends
• (based on items you purchased or told us your own)
Current Recommendation 89
Interpretable Recommendation
• Amazon.com recommends
• (based on items you purchased or told us your own)
• Amazing.com recommends
• Because it has the topics • You are interested
• Graph mining• Linear algebra
• You might be interested• Hadoop• Submodularity
Current Recommendation Interpretable Recommendation
Plans
Goals
Step 1 (this talk)
Step 2 (medium term)
Step 3 (long term)
G1 Querying
CePS, iPoG, pTrack
Recommendation Interpretable Q
Querying rich data
G2Mining
Netshield, Colibri
Immunization Interpretable M
Mining rich data
G3 Scalability
All above O(m) or better (single machine)
Scalable by parallel Scalable on rich data
What is Next?
91
Research Theme: Help users to understand and utilize large graph-related data
Immunization
• This Talk: SIS (e.g., flu)• In the Future
– Immunize for SIR (e.g., chicken pox)– Immunize in Dynamic Settings
Dynamics of Graphs, e.g., edges/nodes are changing
Dynamics of Virus, e.g., the infection/healing rates are changing
92Footnote: SIR stands for susceptible-infectious-recovered.
Plans
Goals
Step 1 (this talk)
Step 2 (medium term)
Step 3 (long term)
G1 Querying
CePS, iPoG, pTrack
Recommendation Interpretable Q
Querying rich data
G2Mining
Netshield, Colibri
Immunization Interpretable M
Mining rich data
G3 Scalability
All above O(m) or better (single machine)
Scalable by parallel Scalable on rich data
What is Next?
93
Research Theme: Help users to understand and utilize large graph-related data
Interpretable Mining
94
Find CommunitiesFind a few nodes/edges to describe
each community relationship between 2 communities
Footnote: Nodes are actors; edges indicate co-play in a movie.
Plans
Goals
Step 1 (this talk)
Step 2 (medium term)
Step 3 (long term)
G1 Querying
CePS, iPoG, pTrack
Recommendation Interpretable Q
Querying rich data
G2Mining
Netshield, Colibri
Immunization Interpretable M
Mining rich data
G3 Scalability
All above O(m) or better (single machine)
Scalable by parallel Scalable on rich data
What is Next?
95
Research Theme: Help users to understandand utilize large graph-related data
Querying Rich Graphs(e.g., geo-coded, attributed)
96
What is difference between North America and Asia?
Teenage
Adult
Phone
MSN
Mining Rich Graphs(e.g., geo-coded, attributed)
97
Teenager
Adult
Phone
MSN
How to find patterns? (e.g., communities, anomalies)
telemarketer
Plans
Goals
Step 1 (this talk)
Step 2 (medium term)
Step 3 (long term)
G1 Querying
CePS, iPoG, pTrack
Recommendation Interpretable Q
Querying rich data
G2Mining
Netshield, Colibri
Immunization Interpretable M
Mining rich data
G3 Scalability
All above O(m) or better (single machine)
Scalable by parallel Scalable on rich data
What is Next?
98
Research Theme: Help users to understand and utilize large graph-related data
Scalability• Two orthogonal efforts
–E1: O(m) or better on a single machine–E2: Parallelism (e.g., hadoop)
• (implementation, decouple, analysis)
99
Research Theme: Help users to understand and utilize large graph-related data
100
Real Data
User
Scalability
CePSiPoGBasset
pTrack
BLin
BBLin
FastUpdate
Fast-iPoG
Colibri
GhostEdge
Graphite
Pack
TANGENT
GMine
T3
Min
ingQ1
Q2
Q3
M3M2
M1
My Collaboration Graph (During Ph.D Study)
Legends:Green: QueryingYellow: MiningPurple: Others
G-Ray
DAP
NBLin
cTrack
Basset
MT3
NetShield
Q & A
Thank you!
102