big graph processing on cloud jeffrey xu yu ( 于旭 ) the chinese university of hong kong...

Big Graph Processing on Cloud

Jeffrey Xu Yu (于旭 )The Chinese University of Hong Kong

[email protected], http://www.se.cuhk.edu.hk/~yu

mailto:[email protected]

http://www.se.cuhk.edu.hk/~yu

Big Graphs/Networks

Graph Systems

There are many and many graph systems in the literature.

3

Graph Computing on Cloud Workload Balancing Auto Approximation

4

Vertex-Centric Computing on BSP Distributed Vertex-centric Computing BSP (Bulk Synchronous Parallel)

Concurrent computing Communication Barrier synchronization

Workload Balancing

Computing Determined by the slowest Workload balancing

Communication The volume matters Cross-edges

Computing + Communication Balanced Partitioning

Balanced k-way Graph Partitioning Size balanced partition The minimum possible cross-edges

It solves our problem if the graph is static By static, we mean the vertices

are always active during the computation

However, for graph analytics, thevertices may toggle between active and inactive.

Workload Balancing

Dynamic Workload Balancing

8

Computing Determined by the slowest Workload balancing

Communication The volume matters Cross-edges

Dynamic workload balancing Respond to vertices’ status active/inactive

We do not know anything about what graph algorithms will be used.

We do not know anything about graphs themselves. We cannot request graphs to be ‘well’ partitioned on

Cloud. We cannot assume how graphs are initially partitioned

on Cloud. It needs to react to workload balancing in good timing,

and it cannot take long to balance itself.

Any General Approach?

An Example

PageRank Semi-clustering Graph Coloring Single Source Shortest Path Breadth First Search Random Walk Maximal Matching Minimum Spanning Tree Maximal Independent Sets

Representative Graph Algorithms

The three algorithms PageRank Semi-clustering Graph Coloring

The vertices are always active Ideal case for static partition

Perfectly balanced as expected

Category 1: Always Active

The Three Algorithms Single Source Shortest Path Breadth First Search Random Walk

Significantly imbalanced

Category 2: Traversal

The Three Algorithms Maximal Matching Minimum Spanning Tree Maximal Independent Sets

Somewhat balanced

Category 3: Multi-Phases

Predicable?

For category 1, the algorithms have stable working window.

For category 2, even though the predictability cannot be ensured, however, most of large scale algorithms have the low-diameter property. SSS has a reasonable hit-rate between supersteps.

For Category 3, the hit-rate between two successive phases is very high, due to the algorithm design.

Each computational node determines (a) how many vertices should be migrated, (b) to which destination, and (c) which vertices will be moved.

How many to move

is the workload in computational node is the quota of vertices shall be moved

Which vertices to move Priorities should be given to vertices benefiting more

when stored on node than node , by willingness score. Move the vertices with the highest willingness score until

the quota used up.

Our Approach [Shang et al. ICDE’13]

A vertex should be moved to the computational node with most in-neighbors.

Do not move vertex if it was moved in last iterations Move a vertex if the gain is higher than a ratio

If has neighbors on another computational node and neighbors on current computational node, it is not worth moving the vertex.

Do not move vertices to a computational node that has higher workload than the average in the previous superstep.

Some Basic Ideas

Compare with Random Partitioning

Graph Computing on Cloud The factors

Memory consumption, communication cost, CPU cost, and the number of rounds.

The classes MapReduce Class (MRC) by Karloff et al. in SODA’10. Minimal MapReduce Class (MMC) by Tao et al. in

SIGMOD’13. Scalable Graph Processing (SGC) on MapReduce by Qin

et al. in SIGMOD’14. Balanced Practical Pregel Algorithms (BPPA) on BSP by

Yan et al. in VLDB’14.

Big data and bigger data Google: 2+EB twitter: hit 8PB Yahoo: 400PB Facebook: 300PB

Big data needs to get answers fast More data beat cleaver algorithm

A few useful things to know about machine learning by P. Domingos in CACM 2012.

Auto-Approximate Graph Computing [Sang et al. VLDB’15]

Work in distributed environment is hard Designing a new algorithm is hard A new distributed approx. algorithm?

Hard + hard The target is fast answer! But, it is impossible to know the

meaning of programs.

Why Auto-Approximate?

To modify the vertex-centric programs (UDF)

Auto-Approximate Graph Computing

Traditional Computing

Approximation Computing

The Errors

Init value

Default UDF

Approx. UDF

final results

error term

The Errors

The error comes from two sides The “bad” input

Error inherited from previous iterations Wrong calculation

Error from the new approx. UDF

Approximation There does not exist a way to have an approach that

can approximate all problems, as restricted by Rice’s theorem. Any nontrivial property about the language

recognized by a Turing machine is undecidable. Approximation

Continuous functions Discrete functions

The notions of continuity from mathematical analysis are relevant and interesting even for software by Chaudhuri et al. in CACM, 2012. shortest paths, minimum spanning trees

25

Synthesize

How to synthesize Sampling Memorization

Remember old decisions Task Skipping

Sleep for one iteration Interpolation

Replace complex function by simple ones System Function Replacement

An Example

Sampling as an example Find chances of sampling Synthesize codes Correct the answer by regression

Error-Time Tradeoff

Error in last iterations are larger than early ones Use instead of on last iterations

Real graphs are power-law graphs.

Consider a vertex with degree into the -th bucket.

The workload of bucket 4-9 dominate, even though the number of vertices are not large.

Sampling messages for the bucket may save computing cost.

Obtain using graph samples by random walks.

The Sampling Strategies

Graph Algorithms

30

Real Datasets

31

PR over twitter-mp (10 iterations)

The relative error is

PR over twitter-mp (10 iterations)

Use f’ in all iterations except the last one. Save about to computing time

The Eight Graph Algorithms

Time/Error Prediction

Some Remarks

There are many reported graph systems in the literature. It needs to reconsider something new to explore further

to deal with big graphs.

36

big graph processing on cloud jeffrey xu yu ( 于旭 ) the chinese university of hong kong...

Documents

graph analytics

scalable graph processing

expected category

search random walkmaximal

communication cost

pbbig data needs

large scale algorithms

reasonable hitrate