streaming graph partitioning kdd 8/15 streaming graph partitioning for large distributed graphs...
Post on 31-Mar-2015
221 Views
Preview:
TRANSCRIPT
Streaming Graph Partitioning KDD 8/15
Streaming Graph Partitioning for Large Distributed Graphs
Isabelle Stanton, UC BerkeleyGabriel Kliot, Microsoft Research XCG
Streaming Graph Partitioning KDD 8/15
• Modern graph datasets are huge– The web graph had over a trillion links in 2011.
Now?– facebook has “more than 901 million users with
average degree 130”– Protein networks
Motivation
Streaming Graph Partitioning KDD 8/15
• We still need to perform computations, so we have to deal with large data– PageRank (and other matrix-multiply problems)– Broadcasting status updates– Database queries– And on and on and on…
Motivation
P QL
Graph has to be distributed across a cluster of machines!
Streaming Graph Partitioning KDD 8/15
Motivation
• Edges cut correspond (approximately) to communication volume required
• Too expensive to move data on the network– Interprocessor communication: nanoseconds– Network communication: microseconds
• The data has to be loaded onto the cluster at some point…
• Can we partition while we load the data?
Streaming Graph Partitioning KDD 8/15
• Graph partitioning is NP-hard on a good day• But then we made it harder:
– Graphs like social networks are notoriously difficult to partition (expander-like)
– Large data sets drastically reduce the amount of computation that is feasible – O(n) or less
– The partitioning algorithms need to be parallel and distributed
High Level Background
Streaming Graph Partitioning KDD 8/15
𝑀𝑘
𝑀 1
𝑀 2
The Streaming Model
Graph Stream →
PartitionerGraph is ordered:• Random• Breadth-First Search• Depth-First Search
Goal: Generate an approximately balanced k-partitioning
Each machine
holds nodes
𝐶=(1+𝜀)𝑛𝑘
Possible Buffer of size
Streaming Graph Partitioning KDD 8/15
Lower Bounds On Orderings
Best balanced -partition cuts edges
Adversarial Ordering- Give every other vertex- See no edges till !- Can’t competeDFS Ordering- Stream is connected- Greedy will do optimally
Random Ordering- Birthday paradox: won’t see edges
until - Still can’t compete with edges cut
Theory says these types of algorithms can’t do well
Streaming Graph Partitioning KDD 8/15
• Totally ignore edges and hash vertex ID• Pro
– Fast to locate data– Doesn’t require a complex DHT or synchronization
• Con– Hashing the vertex ID cuts a fraction of the edges
for any order– Great simple approximation for MAX-CUT
Current Approach in Real Systems
Streaming Graph Partitioning KDD 8/15
• Evaluate 16 natural heuristics on 21 datasets with each of the three orderings with varying numbers of partitions
• Find out which heuristics work on each graph• Compare these with the results of
– Random Hashing to get worst case– METIS to get ‘best’ offline performance
Our Approach
Streaming Graph Partitioning KDD 8/15
Caveats
• METIS is a heuristic, not true lower bound– Does fine in practice– Available online for reproducing results
• Used publicly available datasets– Public graph datasets tend to be much smaller
than what companies have• Using meta-data for partitioning can be good
– partitioning the web graph by URL– Using geographic location for social network users
Streaming Graph Partitioning KDD 8/15
Heuristics
• Balanced• Chunking• Hashing• (weighted)
Deterministic Greedy• (weighted) Randomized
Greedy• Triangles• Balance Big
Uses a Buffer of size • Prefer Big• Avoid Big• Greedy EvoCut
Weight functionsUnweighted
Linear weighted
Exponentially weighted
Streaming Graph Partitioning KDD 8/15
Datasets
• Includes finite element meshes, citation networks, social networks, web graphs, protein networks and synthetically generated graphs
• Sizes: 297 vertices to 41.7 million vertices• Synthetic graph models
– Barabasi-Albert (Preferential Attachment)– RMAT (Kronecker)– Watts-Strogatz– Power law-Clustered
• Biggest graphs: LiveJournal and Twitter
Streaming Graph Partitioning KDD 8/15
Experimental Method
• For each graph, heuristic, and ordering, partition into 2, 4, 8, 16 pieces
• Compare with a random cut – upper bound• Compare with METIS – lower bound
• Performance was measured by:
¿𝑒𝑑𝑔𝑒𝑠𝑐𝑢𝑡𝑏𝑦 𝑟𝑎𝑛𝑑𝑜𝑚𝑐𝑢𝑡 −¿𝑒𝑑𝑔𝑒𝑠𝑐𝑢𝑡𝑏𝑦 h𝑒𝑢𝑟𝑖𝑠𝑡𝑖𝑐 ¿¿𝑒𝑑𝑔𝑒𝑠𝑐𝑢𝑡 𝑏𝑦 𝑟𝑎𝑛𝑑𝑜𝑚𝑐𝑢𝑡− ¿
𝑒𝑑𝑔𝑒𝑠𝑐𝑢𝑡𝑏𝑦𝑀𝐸𝑇𝐼𝑆¿
Streaming Graph Partitioning KDD 8/15
Heuristic ResultsBest heuristic, LDG,
gets an average improvement of 76%
over all datasets!
Synthetic
Social network Finite element mesh
Hash
METIS
BFSDFS
Random
Streaming Graph Partitioning KDD 8/15
Scaling in the Size of Graphs: Exploiting Synthetic Graphs
LDG
Hash
METIS
Streaming Graph Partitioning KDD 8/15
More Observations
• BFS is a superior ordering for all algorithms• Avoid Big does 46% WORSE on average than
Random Cut• Further experiments showed Linear Det.
Greedy has identical performance to Det. Greedy with load-based tie breaking.
Streaming Graph Partitioning KDD 8/15
• Compared the streamed partitioning with random hashing on SPARK, a distributed cluster computation system (http://www.spark-project.org/)
• Used 2 datasets• 4.6 million users, 77 million edges• 41.7 million users, 1.468 billion edges
• Computed the PageRank of each graph
Results on a Real System
Streaming Graph Partitioning KDD 8/15
Results on SPARK
LJ Hash LJ Stream Twitter Hash Twitter Stream
Naïve PR Mean 296.2 s 181.5 s 1199.4 s 969.3 s
Naïve PR STD 5.5 s 2.2 s 81.2 s 16.9 s
Combiner PR Mean
155.1 s 110.4 s 599.4 s 486.8 s
Combiner PR STD
1.5 s 0.8 s 14.4 s 5.9 s
LJ Improvement:Naïve – 38.7%
Combiner – 28.8 %
Twitter Improvement:Naïve – 19.1%
Combiner – 18.8 %
LiveJournal – 4.6 million users, 77 million edgesTwitter – 41.7 million users, 1.468 billion edges
Streaming Graph Partitioning KDD 8/15
Streaming graph partitioning is a really nice, simple,
effective preprocessing step.
Streaming Graph Partitioning KDD 8/15
Where to now?
• Can we explain theoretically why the greedy algorithm performs so well?*
• What heuristics work better?• What heuristics are optimal for different
classes of graphs?• Use multiple parallel streams!• Implement in real systems!
*Work under submission: I. Stanton, Streaming Balanced Graph Partitioning Algorithms for Random Graphs
isabelle@eecs.berkeley.edu
Streaming Graph Partitioning KDD 8/15
Acknowledgements
• David B. Wecker• Burton Smith• Reid Andersen• Nikhil Devanur• Sameh Elkinety• Sreenivas Gollapudi• Yuxiong He• Rina Panigrahy• Yuval PeresAll at MSR
• Satish Rao• Virginia Vassilevska Williams• Alexandre Stauffer• Ngoc Mai Tran• Miklos Racz• Matei ZahariaAll at Berkeley - CS and Statistics
Supported by NSF and NDSEG fellowships, NSF grant CCF-0830797, and an internship at Microsoft Research’s eXtreme Computing Group.
isabelle@eecs.berkeley.edu
top related