fast discovery of connection subgraphs

39
KDD04 Faloutsos, McCurley & Tom kins 1 Carnegie Mellon Fast Discovery of Connection Subgraphs Christos Faloutsos (CMU) Kevin McCurley (IBM) Andrew Tomkins (IBM)

Upload: adria-jefferson

Post on 31-Dec-2015

25 views

Category:

Documents


0 download

DESCRIPTION

Fast Discovery of Connection Subgraphs. Christos Faloutsos (CMU) Kevin McCurley (IBM) Andrew Tomkins (IBM). Outline. Introduction / Motivation Survey Proposed Method Algorithms Experiments Conclusions. Introduction. What are the best path s between ‘Kidman’ and ‘Diaz’?. Diaz. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Fast Discovery of Connection  Subgraphs

KDD04 Faloutsos, McCurley & Tomkins 1

Carnegie Mellon

Fast Discovery of Connection Subgraphs

Christos Faloutsos (CMU)Kevin McCurley (IBM)Andrew Tomkins (IBM)

Page 2: Fast Discovery of Connection  Subgraphs

KDD04 Faloutsos, McCurley & Tomkins 2

Carnegie Mellon

Outline

• Introduction / Motivation

• Survey

• Proposed Method

• Algorithms

• Experiments

• Conclusions

Page 3: Fast Discovery of Connection  Subgraphs

KDD04 Faloutsos, McCurley & Tomkins 3

Carnegie Mellon

Introduction

• What are the best paths between ‘Kidman’ and ‘Diaz’?

Kidman

Diaz

Page 4: Fast Discovery of Connection  Subgraphs

KDD04 Faloutsos, McCurley & Tomkins 4

Carnegie Mellon

Problem definition

• Given a graph, and two nodes s and t, and a 'budget' b of nodes

• Find the best b nodes that capture the relationship between s and t

s t

f

Page 5: Fast Discovery of Connection  Subgraphs

KDD04 Faloutsos, McCurley & Tomkins 5

Carnegie Mellon

Problem definition

• Given a graph, and two nodes s and t, and a 'budget' b of nodes

• Find the best b nodes that capture the relationship between s and t

s t

f

Page 6: Fast Discovery of Connection  Subgraphs

KDD04 Faloutsos, McCurley & Tomkins 6

Carnegie Mellon

Problem definition

• Part 1: How to quantify the goodness?

• Part 2: How to pick ‘best few’ nodes?

• Part 3: Scalability: large graphs (10**7 nodes)

s t

f

Page 7: Fast Discovery of Connection  Subgraphs

KDD04 Faloutsos, McCurley & Tomkins 7

Carnegie Mellon

Survey

• Graph Partitioning– [Karypis+Kumar]; [Newman+];

– [Virtanen]; …

• Communities– [Flake+]; [Tomkins, Kleinberg+]

• External distances [Palmer+]

Page 8: Fast Discovery of Connection  Subgraphs

KDD04 Faloutsos, McCurley & Tomkins 8

Carnegie Mellon

Outline

• Introduction / Motivation

• Survey

• Proposed Method

• Algorithms

• Experiments

• Conclusions

Page 9: Fast Discovery of Connection  Subgraphs

KDD04 Faloutsos, McCurley & Tomkins 9

Carnegie Mellon

• part 1: measuring goodness:– electricity

• part 2: finding good paths– dynamic programming

• part 3: scalability– heuristics

Proposed method

Page 10: Fast Discovery of Connection  Subgraphs

KDD04 Faloutsos, McCurley & Tomkins 10

Carnegie Mellon

s t

f

Electricity

• Why not shortest path?

Page 11: Fast Discovery of Connection  Subgraphs

KDD04 Faloutsos, McCurley & Tomkins 11

Carnegie Mellon

s t

f

Electricity

• Why not shortest path?

• Why not net. flow?

Page 12: Fast Discovery of Connection  Subgraphs

KDD04 Faloutsos, McCurley & Tomkins 12

Carnegie Mellon

s t

f

Electricity

• Why not shortest path?

• Why not net. flow?

• Why not plain ‘voltages’?

+1V 0V

Page 13: Fast Discovery of Connection  Subgraphs

KDD04 Faloutsos, McCurley & Tomkins 13

Carnegie Mellon

s t

f

Electricity

• Why not shortest path?

• Why not net. flow?

• Why not plain ‘voltages’?

+1V 0V

+0.5V

Page 14: Fast Discovery of Connection  Subgraphs

KDD04 Faloutsos, McCurley & Tomkins 14

Carnegie Mellon

s t

f

...

Electricity, cont’d

• Proposed method: voltages with universal sink:– ~ ‘tax collector’

• goodness of a path:

• its electric current(*)!+1V 0V

0V

Page 15: Fast Discovery of Connection  Subgraphs

KDD04 Faloutsos, McCurley & Tomkins 15

Carnegie Mellon

Outline

• Introduction / Motivation

• Survey

• Proposed Method

• Algorithms

• Experiments

• Conclusions

Page 16: Fast Discovery of Connection  Subgraphs

KDD04 Faloutsos, McCurley & Tomkins 16

Carnegie Mellon

Electricity – Algorithm

• Voltages/Amperages can be computed easily ( O(E) )

• without universal sink:v(i) = Σumj [v(j) * C(i,j) / C(i,*) ]

i != source, sink

v(source)=1; v(sink)=0

Page 17: Fast Discovery of Connection  Subgraphs

KDD04 Faloutsos, McCurley & Tomkins 17

Carnegie Mellon

Electricity – Algorithm

With universal sink:v(i) = 1/(1+a) Σumj [v(j) * C(i,j) / C(i,*) ]

(~ insensitive to a (=1))

Page 18: Fast Discovery of Connection  Subgraphs

KDD04 Faloutsos, McCurley & Tomkins 18

Carnegie Mellon

Given the voltages and amperages

• Which b nodes to keep?

• (and how to spot them quickly?)

Part 2: DisplayGen

Page 19: Fast Discovery of Connection  Subgraphs

KDD04 Faloutsos, McCurley & Tomkins 19

Carnegie Mellon

Part 2: DisplayGen

Page 20: Fast Discovery of Connection  Subgraphs

KDD04 Faloutsos, McCurley & Tomkins 20

Carnegie Mellon

Part 2: DisplayGen

• ‘delivered current’ of a path:– ~ ‘how many electrons’ choose this path

=4/5 *1/2A

Page 21: Fast Discovery of Connection  Subgraphs

KDD04 Faloutsos, McCurley & Tomkins 21

Carnegie Mellon

Part 2: DisplayGen

• find subgraph that max’s delivered current

• Incrementally, add nodes with max marginal delivered current

Page 22: Fast Discovery of Connection  Subgraphs

KDD04 Faloutsos, McCurley & Tomkins 22

Carnegie Mellon

Part 3: Scalability

‘CandidateGen’

• Starting from the large graph

• Eliminate nodes that are too far away to matter

• How?

Page 23: Fast Discovery of Connection  Subgraphs

KDD04 Faloutsos, McCurley & Tomkins 23

Carnegie Mellon

s tsource sink

Part 3: Scalability

• By successive, careful expansions

Page 24: Fast Discovery of Connection  Subgraphs

KDD04 Faloutsos, McCurley & Tomkins 24

Carnegie Mellon

s t

Part 3: Scalability

Page 25: Fast Discovery of Connection  Subgraphs

KDD04 Faloutsos, McCurley & Tomkins 25

Carnegie Mellon

s t

Part 3: Scalability

Page 26: Fast Discovery of Connection  Subgraphs

KDD04 Faloutsos, McCurley & Tomkins 26

Carnegie Mellon

s t

Part 3: Scalability

Page 27: Fast Discovery of Connection  Subgraphs

KDD04 Faloutsos, McCurley & Tomkins 27

Carnegie Mellon

Pseudo-code

Until (stoppingCriterion) use pickHeuristic() to pick a node n

expand node n

Page 28: Fast Discovery of Connection  Subgraphs

KDD04 Faloutsos, McCurley & Tomkins 28

Carnegie Mellon

Pseudo-code

pickHeuristic() favors• Nearby nodes with• Strong connections to source or sink

and with• Small degree

Page 29: Fast Discovery of Connection  Subgraphs

KDD04 Faloutsos, McCurley & Tomkins 29

Carnegie Mellon

Outline

• Introduction / Motivation

• Survey

• Proposed Method

• Algorithms

• Experiments

• Conclusions

Page 30: Fast Discovery of Connection  Subgraphs

KDD04 Faloutsos, McCurley & Tomkins 30

Carnegie Mellon

Experiments

• on large real graph – ~15M nodes, ~100M edges, weighted

– ‘who co-appears with whom’ (from 500M web pages)

• Q1: Quality of ‘voltage’ approach?

• Q2: Speed/accuracy trade-off?

Page 31: Fast Discovery of Connection  Subgraphs

KDD04 Faloutsos, McCurley & Tomkins 31

Carnegie Mellon

Q1: Quality

• Actors (A); Computer-Scientists (CS)

• Kidman-Diaz (A-A)

• Negreponte-Palmisano (CS-CS)

• Turing-Stone (CS-A)

Page 32: Fast Discovery of Connection  Subgraphs

KDD04 Faloutsos, McCurley & Tomkins 32

Carnegie Mellon

(A-A) Kidman-Diaz

Strong, direct link

• What are the best paths between ‘Kidman’ and ‘Diaz’?

Kidman

Diaz

Page 33: Fast Discovery of Connection  Subgraphs

KDD04 Faloutsos, McCurley & Tomkins 33

Carnegie Mellon

CS-CS: Negreponte - Palmisano

NN SP

• Mainly: CEOs of major Computer companies (Dell, Gates, Fiorina, ++)

Page 34: Fast Discovery of Connection  Subgraphs

KDD04 Faloutsos, McCurley & Tomkins 34

Carnegie Mellon

CS-CS: Negreponte - Palmisano

NNEsther Dyson Louis Gerstner

SP

Page 35: Fast Discovery of Connection  Subgraphs

KDD04 Faloutsos, McCurley & Tomkins 35

Carnegie Mellon

CS-A: Turing - Stone

TuringAnderson

Stone

Page 36: Fast Discovery of Connection  Subgraphs

KDD04 Faloutsos, McCurley & Tomkins 36

Carnegie Mellon

Outline

• Introduction / Motivation

• ...

• Experiments– Q1: quality

– Q2: speed/accuracy trade-off

• Conclusions

Page 37: Fast Discovery of Connection  Subgraphs

KDD04 Faloutsos, McCurley & Tomkins 37

Carnegie Mellon

Speed/Accuracy Trade-off

number of nodes kept (‘b’)

deliveredcurrent Kleinberg-Newell

Rivest-HoffmanTuring-StoneKidman-Diaz

Page 38: Fast Discovery of Connection  Subgraphs

KDD04 Faloutsos, McCurley & Tomkins 38

Carnegie Mellon

Speed/accuracy trade-off

• 80/20-like rule:

• the first few nodes/paths contribute the vast majority of ‘delivered current’

• Thus: CandidateGen makes sense

Page 39: Fast Discovery of Connection  Subgraphs

KDD04 Faloutsos, McCurley & Tomkins 39

Carnegie Mellon

Conclusions

• Defined the problem• Part 1: Electricity-based method to measure

quality• Part 2: Dynamic programming to spot best

paths (‘DisplayGen’)• Part 3: Scalability with good accuracy

(‘CandidateGen’)• Operational system