leopard: lightweight partitioning and replication for dynamic graphs
TRANSCRIPT
Leopard: Lightweight Partitioning and Replication for Dynamic Graphs
Jiewen Huang and Daniel AbadiYale University
Facebook Social Graph
Social Graphs
Web Graphs
Semantic Graphs
Many systems use hash partitioning
● Results in many edges being “cut”
Given a graph G and an integer k, partition the vertices into k disjoint sets such that:
● as few cuts as possible
● as balanced as possible
Graph Partitioning
NP Hard
Multilevel scheme Coarsening phase
State of the Art
The only constant is change.
-------- Heraclitus
To Make the Problem more Complicated
Social graphs: new people and friendshipsSemantic Web graphs: new knowledgeWeb graphs: new websites and links
Dynamic Graphs
A
Partition 1 Partition 2
Is partition 1 still the better partition for A?
Repartitioning the entire graph upon every change is way too expensive
New Framework
Leopard:● Locally reassess partitioning as a result of
changes without a full re-partitioning● Integrates consideration of replication with
partitioning
Outline
Background and Motivation
LEOPARD
Overview
Computation Skipping
Replication
Experiments
Algorithm Overview
For each added/deleted edge <V1, V2>
Compute best partition for V1 using a heuristic
Re-assign V1 if needed
The same for V2
Example: Adding an Edge
AB
Partition 1 Partition 2
Compute the Partition for B
A
B
Partition 1 Partition 2# neighbours: 1# vertices: 5
# neighbours: 3# vertices: 3
Goals: (1) few cuts and (2) balanced
Heuristic: # neighbours * (1 - #vertices/capacity)
1 * (1 - 5/6) = 0.17 3 * (1 - 3/6) = 1.5
Higher score
This heuristic is simple for the sake of presentation. More advanced heuristics are discussed in the paper
Compute the Partition for A
A
B
Partition 1 Partition 2# neighbours: 1# vertices: 4
# neighbours: 2# vertices: 4
Goals: (1) few cuts and (2) balanced
Heuristic: # neighbours * (1 - #vertices/capacity)
1 * (1 - 4/6) = 0.33 2 * (1 - 4/6) = 0.66
Higher score
Example: Adding an Edge
B
Partition 1 Partition 2
A
(1) B stays put(2) A moves to partition 2
Outline
Background and Motivation
Leopard
Overview
Computation Skipping
Replication
Experiments
Computation cost
For each new edge, must: For both vertexes involved in the edge: Calculate the heuristic for each partition (May involve communication for remote vertex location lookup)
Computation Skipping
Observation: As the number of neighbors of a vertex increases, the influence of a new neighbor decreases.
Computation Skipping
Basic Idea: Accumulate changes for a vertex, if the changes exceed a certain threshold, recompute the partition for the vertex.
For example, threshold = # accumulated changes / # neighbors = 20%.
(1) Compute the partition when V has 10 neighbors. Then 2 new edges are added for V: 2 / 12 = 17% < 20%. Don’t recompute
(2) When 1 more new edge is added for V: 3 / 13 = 23% > 20%. Recompute the partition for V. Reset # accumulated changes to 0.
Outline
Background and Motivation
Leopard
Overview
Computation Skipping
Replication
Experiments
Goals of replication:
fault tolerance (k copies for each data point/block)
further cut reduction
Replication
It takes two parameters:
● minimum: fault tolerance
● average: cut reduction
Minimum-Average Replication
Example
# copies vertices
2 A,C,D,E,H,J,K,L
3 F,I
4 B,G
min = 2average = 2.5
first copy
replica
Example
# copies vertices
2 A,C,D,E,H,J,K,L
3 F,I
4 B,G
min = 2average = 2.5
How Many Copies?
A
Partition 1 Partition 4Partition 3Partition 2
0.1 0.40.30.2
minimum = 2average = 3
Scores of each partition
How Many Copies?
A
Partition 1 Partition 4Partition 3Partition 2
0.1 0.40.30.2
minimum = 2average = 3
minimum requirementWhat about them?
Always keep the last n computed scores.
Comparing against Past Scores
0.220.290.30.40.870.9 0.2 0.11 0.1
High Low
... ... ... ... ....
minimum = 2average = 3
cutoff: top avg-1/k-1 percent of scores
Comparing against Past Scores
0.220.290.30.40.870.9 0.2 0.11 0.1
High Low
... ... ... ... ....
minimum = 2average = 3
30th 31th
# copies: 2
cutoff: 30th highest score
Comparing against Past Scores
0.220.290.30.40.870.9 0.2 0.11 0.1
High Low
... ... ... ... ....
minimum = 2average = 3
30th 31th
# copies: 2
cutoff: 30th highest score
Comparing against Past Scores
0.220.290.30.40.870.9 0.2 0.11 0.1
High Low
... ... ... ... ....
minimum = 2average = 3
30th 31th
# copies: 3
cutoff: 30th highest score
Comparing against Past Scores
0.220.290.30.40.870.9 0.2 0.11 0.1
High Low
... ... ... ... ....
minimum = 2average = 3
30th
# copies: 4
cutoff: 30th highest score
Outline
Background and Motivation
Leopard
Experiments
Experiment Setup
● Comparison points○ Leopard with FENNEL heustitics
○ One-pass FENNEL (no vertex reassignment)
○ METIS (static graphs)
○ ParMETIS (repartitioning for dynamic graphs)
○ Hash Partitioning
● Graph Datasets○ Type: social graphs, collaboration graphs, Web graphs, email graphs, and synthetic graphs
○ Size: up to 66 million vertices and 1.8 billion edges
Edge Cut
Computation Skipping
Effect of Replication on Edge Cut
Thanks!
Q & A