big graph processing on cloud jeffrey xu yu ( 于旭 ) the chinese university of hong kong...
TRANSCRIPT
Big Graph Processing on Cloud
Jeffrey Xu Yu (于旭 )The Chinese University of Hong Kong
[email protected], http://www.se.cuhk.edu.hk/~yu
Vertex-Centric Computing on BSP Distributed Vertex-centric Computing BSP (Bulk Synchronous Parallel)
Concurrent computing Communication Barrier synchronization
Workload Balancing
Computing Determined by the slowest Workload balancing
Communication The volume matters Cross-edges
Computing + Communication Balanced Partitioning
Balanced k-way Graph Partitioning Size balanced partition The minimum possible cross-edges
It solves our problem if the graph is static By static, we mean the vertices
are always active during the computation
However, for graph analytics, thevertices may toggle between active and inactive.
Workload Balancing
Dynamic Workload Balancing
8
Computing Determined by the slowest Workload balancing
Communication The volume matters Cross-edges
Dynamic workload balancing Respond to vertices’ status active/inactive
We do not know anything about what graph algorithms will be used.
We do not know anything about graphs themselves. We cannot request graphs to be ‘well’ partitioned on
Cloud. We cannot assume how graphs are initially partitioned
on Cloud. It needs to react to workload balancing in good timing,
and it cannot take long to balance itself.
Any General Approach?
PageRank Semi-clustering Graph Coloring Single Source Shortest Path Breadth First Search Random Walk Maximal Matching Minimum Spanning Tree Maximal Independent Sets
Representative Graph Algorithms
The three algorithms PageRank Semi-clustering Graph Coloring
The vertices are always active Ideal case for static partition
Perfectly balanced as expected
Category 1: Always Active
The Three Algorithms Single Source Shortest Path Breadth First Search Random Walk
Significantly imbalanced
Category 2: Traversal
The Three Algorithms Maximal Matching Minimum Spanning Tree Maximal Independent Sets
Somewhat balanced
Category 3: Multi-Phases
Predicable?
For category 1, the algorithms have stable working window.
For category 2, even though the predictability cannot be ensured, however, most of large scale algorithms have the low-diameter property. SSS has a reasonable hit-rate between supersteps.
For Category 3, the hit-rate between two successive phases is very high, due to the algorithm design.
Each computational node determines (a) how many vertices should be migrated, (b) to which destination, and (c) which vertices will be moved.
How many to move
is the workload in computational node is the quota of vertices shall be moved
Which vertices to move Priorities should be given to vertices benefiting more
when stored on node than node , by willingness score. Move the vertices with the highest willingness score until
the quota used up.
Our Approach [Shang et al. ICDE’13]
A vertex should be moved to the computational node with most in-neighbors.
Do not move vertex if it was moved in last iterations Move a vertex if the gain is higher than a ratio
If has neighbors on another computational node and neighbors on current computational node, it is not worth moving the vertex.
Do not move vertices to a computational node that has higher workload than the average in the previous superstep.
Some Basic Ideas
Graph Computing on Cloud The factors
Memory consumption, communication cost, CPU cost, and the number of rounds.
The classes MapReduce Class (MRC) by Karloff et al. in SODA’10. Minimal MapReduce Class (MMC) by Tao et al. in
SIGMOD’13. Scalable Graph Processing (SGC) on MapReduce by Qin
et al. in SIGMOD’14. Balanced Practical Pregel Algorithms (BPPA) on BSP by
Yan et al. in VLDB’14.
Big data and bigger data Google: 2+EB twitter: hit 8PB Yahoo: 400PB Facebook: 300PB
Big data needs to get answers fast More data beat cleaver algorithm
A few useful things to know about machine learning by P. Domingos in CACM 2012.
Auto-Approximate Graph Computing [Sang et al. VLDB’15]
Work in distributed environment is hard Designing a new algorithm is hard A new distributed approx. algorithm?
Hard + hard The target is fast answer! But, it is impossible to know the
meaning of programs.
Why Auto-Approximate?
To modify the vertex-centric programs (UDF)
Auto-Approximate Graph Computing
Traditional Computing
Approximation Computing
The Errors
The error comes from two sides The “bad” input
Error inherited from previous iterations Wrong calculation
Error from the new approx. UDF
Approximation There does not exist a way to have an approach that
can approximate all problems, as restricted by Rice’s theorem. Any nontrivial property about the language
recognized by a Turing machine is undecidable. Approximation
Continuous functions Discrete functions
The notions of continuity from mathematical analysis are relevant and interesting even for software by Chaudhuri et al. in CACM, 2012. shortest paths, minimum spanning trees
25
Synthesize
How to synthesize Sampling Memorization
Remember old decisions Task Skipping
Sleep for one iteration Interpolation
Replace complex function by simple ones System Function Replacement
An Example
Sampling as an example Find chances of sampling Synthesize codes Correct the answer by regression
Error-Time Tradeoff
Error in last iterations are larger than early ones Use instead of on last iterations
Real graphs are power-law graphs.
Consider a vertex with degree into the -th bucket.
The workload of bucket 4-9 dominate, even though the number of vertices are not large.
Sampling messages for the bucket may save computing cost.
Obtain using graph samples by random walks.
The Sampling Strategies
PR over twitter-mp (10 iterations)
Use f’ in all iterations except the last one. Save about to computing time