joseph gonzalez yucheng low aapo kyrola danny bickson joe hellerstein alex smola distributed...

Joseph Gonzalez

YuchengLow

AapoKyrola

DannyBickson

JoeHellerstein

AlexSmola

Distributed Graph-Parallel Computation on Natural Graphs

HaijieGu

2

The Team:

CarlosGuestrin

How will wedesign and implement

parallel learning systems?

Big-Learning

Map-Reduce / Hadoop

Build learning algorithms on-top of high-level parallel abstractions

The popular answer:

Map-Reduce for Data-Parallel ML

• Excellent for large data-parallel tasks!

Data-Parallel Graph-Parallel

CrossValidation

Feature Extraction

Map Reduce

Computing SufficientStatistics

Graphical ModelsGibbs Sampling

Belief PropagationVariational Opt.

Semi-Supervised Learning

Label PropagationCoEM

Graph AnalysisPageRank

Triangle Counting

Collaborative Filtering

Tensor Factorization

Profile

Label Propagation• Social Arithmetic:

• Recurrence Algorithm:

– iterate until convergence• Parallelism:– Compute all Likes[i] in

parallel

Sue Ann

Carlos

Me

50% What I list on my profile40% Sue Ann Likes10% Carlos Like

40%

10%

50%

80% Cameras20% Biking



I Like:

+60% Cameras, 40% Biking

http://www.cs.cmu.edu/~zhuxj/pub/CMU-CALD-02-107.pdf

http://www.cs.cmu.edu/~zhuxj/pub/CMU-CALD-02-107.pdf

Properties of Graph-Parallel Algorithms

DependencyGraph

IterativeComputation

My Interests

Friends Interests

LocalUpdates

Parallelism: Run local updates simultaneously

Map-Reduce for Data-Parallel ML

• Excellent for large data-parallel tasks!

Data-Parallel Graph-Parallel

CrossValidation

Feature Extraction

Map Reduce

Computing SufficientStatistics

Graphical ModelsGibbs Sampling

Belief PropagationVariational Opt.

Semi-Supervised Learning

Label PropagationCoEM

Data-MiningPageRank

Triangle Counting

Collaborative Filtering

Tensor Factorization

Map Reduce?Graph-Parallel Abstraction

Graph-Parallel Abstractions

• Vertex-Program associated with each vertex• Graph constrains the interaction along edges– Pregel: Programs interact through Messages– GraphLab: Programs can read each-others state

Barrie

rThe Pregel Abstraction

Compute CommunicatePregel_LabelProp(i) // Read incoming messages msg_sum = sum (msg : in_messages)

// Compute the new interests Likes[i] = f( msg_sum )

// Send messages to neighbors for j in neighbors: send message(g(wij, Likes[i])) to j

The GraphLab AbstractionVertex-Programs are executed asynchronously and directly read the neighboring vertex-program state.

GraphLab_LblProp(i, neighbors Likes) // Compute sum over neighbors sum = 0 for j in neighbors of i: sum = g(wij, Likes[j]) // Update my interests Likes[i] = f( sum ) // Activate Neighbors if needed if Like[i] changes then activate_neighbors();

Activated vertex-programs are executed eventually and can read the new state of their neighbors

0 2 4 6 8 10 12 14 160

2

4

6

8

10

12

14

16

Number of CPUs

Spee

dup

Bett

er

Optimal

GraphLab CoEM

Never Ending Learner Project (CoEM)

11

GraphLab 16 Cores 30 min

15x Faster!6x fewer CPUs!

Hadoop 95 Cores 7.5 hrs

DistributedGraphLab

32 EC2 machines

80 secs

0.3% of Hadoop time

The Cost of the Wrong AbstractionLo

g-Sc

ale!

Startups Using GraphLab

Companies experimenting (or downloading) with GraphLab

Academic projects exploring (or downloading) GraphLab

Why do we need

2

Natural Graphs

[Image from WikiCommons]

Assumptions of Graph-Parallel Abstractions

Ideal Structure

• Small neighborhoods– Low degree vertices

• Vertices have similar degree• Easy to partition

Natural Graph

• Large Neighborhoods– High degree vertices

• Power-Law degree distribution• Difficult to partition

Power-Law Structure

Top 1% of vertices are adjacent to

50% of the edges!

-Slope = α ≈ 2

High-Degree Vertices

Challenges of High-Degree Vertices

Touches a largefraction of graph

(GraphLab)

SequentialVertex-Programs

Produces manymessages(Pregel)

Edge informationtoo large for single

machine

Asynchronous consistencyrequires heavy locking (GraphLab)

Synchronous consistency is prone tostragglers (Pregel)

Graph Partitioning

• Graph parallel abstraction rely on partitioning:– Minimize communication– Balance computation and storage

Machine 1 Machine 2

Natural Graphs are Difficult to Partition

• Natural graphs do not have low-cost balanced cuts [Leskovec et al. 08, Lang 04]

• Popular graph-partitioning tools (Metis, Chaco,…) perform poorly [Abou-Rjeili et al. 06]– Extremely slow and require substantial memory

Random Partitioning

• Both GraphLab and Pregel proposed Random (hashed) partitioning for Natural Graphs

Machine 1 Machine 210 Machines 90% of edges cut100 Machines 99% of edges cut!

In Summary

GraphLab and Pregel are not well suited for natural graphs

• Poor performance on high-degree vertices• Low Quality Partitioning

2• Distribute a single vertex-program – Move computation to data– Parallelize high-degree vertices

• Vertex Partitioning– Simple online heuristic to effectively partition large

power-law graphs

Decompose Vertex-Programs

+ + … +

Y YY

ParallelSum

User Defined:

Gather( ) ΣY

Σ1 + Σ2 Σ3

Y Scope

Gather (Reduce)

Y

YApply( , Σ) Y’

Apply the accumulated value to center vertex

User Defined:

Apply

Y’

Scatter( )

Update adjacent edgesand vertices.

User Defined:Y

Scatter

LabelProp_GraphLab2(i)Gather(Likes[i], wij, Likes[j]) :

return g(wij, Likes[j])

sum(a, b) : return a + b;

Apply(Likes[i], Σ) : Likes[i] = f(Σ)

Scatter(Likes[i], wij, Likes[j]) :if (change in Likes[i] > ε) then activate(j)

Writing a GraphLab2 Vertex-Program

Machine 2Machine 1

Y

Distributed Execution of a Factorized Vertex-Program

( + )( )Y

YYΣ1 Σ 2

YY

O(1) data transmitted over network

Cached Aggregation• Repeated calls to gather wastes computation:

• Solution: Cache previous gather and update incrementally

Y

Y Y YY+ + … + + Σ’

Wasted computation

Y Y Y

+ +…+ + Δ Σ’Cached

Gather (Σ)YΔ

Y New

Val

ue

Old

Val

ue

LabelProp_GraphLab2(i)Gather(Likes[i], wij, Likes[j]) :

return g(wij, Likes[j])

sum(a, b) : return a + b;

Apply(Likes[i], Σ) : Likes[i] = f(Σ)

Scatter(Likes[i], wij, Likes[j]) :if (change in Likes[i] > ε) then activate(j)

Post Δj = g(wij ,Likes[i]new) - g(wij ,Likes[i]old)

Writing a GraphLab2 Vertex-Program

Reduces Runtime of PageRank by 50%!

Execution Models

Synchronous and Asynchronous

• Similar to Pregel• For all active vertices– Gather– Apply– Scatter– Activated vertices are run

on the next iteration• Fully deterministic• Potentially slower convergence for some

machine learning algorithms

Synchronous Execution

• Similar to GraphLab• Active vertices are

processed asynchronouslyas resources becomeavailable.

• Non-deterministic• Optionally enable serial consistency

Asynchronous Execution

Preventing Overlapping Computation

• New distributed mutual exclusion protocol

Conflict

EdgeCo

nflict

Edge

0 2000 4000 6000 8000 10000 12000 140001.00E-021.00E-011.00E+001.00E+011.00E+021.00E+031.00E+041.00E+051.00E+061.00E+071.00E+08

Runtime (s)

L1 E

rror

Multi-core Performance

Multicore PageRank (25M Vertices, 355M Edges)

GraphLab

GraphLab2Factorized

Pregel (Simulated)

GraphLab2Factorized +

Caching

Vertex-Cuts for Partitioning

Percolation theory suggests that Power Law graphs can be split by removing only a small set

of vertices. [Albert et al. 2000]

What about graph partitioning?

GraphLab2 Abstraction PermitsNew Approach to Partitioning

• Rather than cut edges:

• we cut vertices:CPU 1 CPU 2

YY Must synchronize

many edges

CPU 1 CPU 2

Y Y Must synchronize a single vertex

Theorem:For any edge-cut we can directly construct a vertex-cut which requires strictly less communication and storage.

Constructing Vertex-Cuts

• Goal: Parallel graph partitioning on ingress.• Propose three simple approaches:– Random Edge Placement• Edges are placed randomly by each machine

– Greedy Edge Placement with Coordination• Edges are placed using a shared objective

– Oblivious-Greedy Edge Placement • Edges are placed using a local objective

Random Vertex-Cuts• Assign edges randomly to machines and allow

vertices to span machines.

Y

Machine 1 Machine 2

Y


vertices to span machines.• Expected number of machines spanned by a vertex:

Number of Machines

Spanned by v

Degree of v

NumericalFunctions


vertices to span machines.• Expected number of machines spanned by a vertex:

0 20 40 60 80 100 120 1401

10

100

1000

Number of Machines

Impr

ovem

ent o

ver

Ran

dom

Edg

e-Cu

ts

α = 1.65α = 1.7α = 1.8α = 2

Greedy Vertex-Cuts by Derandomization

• Place the next edge on the machine that minimizes the future expected cost:

• Greedy– Edges are greedily placed using shared placement history

• Oblivious– Edges are greedily placed using local placement history

Placement information for

previous vertices

Shared Objective (Communication)

Greedy Placement• Shared objective

Machine1 Machine 2

Local ObjectiveLocal Objective

Oblivious Placement• Local objectives:

CPU 1 CPU 2

Partitioning Performance

Twitter Graph: 41M vertices, 1.4B edges

Oblivious/Greedy balance partition quality and partitioning time.

Span

ned

Mac

hine

s

Load

-tim

e (S

econ

ds)

32-Way Partitioning Quality

Vertices EdgesTwitter 41M 1.4BUK 133M 5.5BAmazon 0.7M 5.2MLiveJournal 5.4M 79MHollywood 2.2M 229M

Oblivious 2x Improvement + 20% load-time

Greedy 3x Improvement + 100% load-time

Span

ned

Mac

hine

s

System Evaluation

Implementation

• Implemented as C++ API• Asynchronous IO over TCP/IP• Fault-tolerance is achieved by check-pointing • Substantially simpler than original GraphLab– Synchronous engine < 600 lines of code

• Evaluated on 64 EC2 HPC cc1.4xLarge

Comparison with GraphLab & Pregel

• PageRank on Synthetic Power-Law Graphs– Random edge and vertex cuts

Denser Denser

GraphLab2GraphLab2

Runtime Communication

Benefits of a good Partitioning

Better partitioning has a significant impact on performance.

Performance: PageRankTwitter Graph: 41M vertices, 1.4B edges

Oblivious

Greedy

Oblivious

RandomRandom

Greedy

Matrix Factorization

• Matrix Factorization of Wikipedia Dataset (11M vertices, 315M edges)

Doc

s

Words

Wiki

Consistency = Lower Throughput

Matrix FactorizationConsistency Faster Convergence

Fully AsynchronousSerially Consistent

PageRank on AltaVista Webgraph

1.4B vertices, 6.7B edges

Pegasus 1320s800 cores

GraphLab2 76s512 cores

Conclusion• Graph-Parallel abstractions are an emerging tool

for large-scale machine learning• The Challenges of Natural Graphs– Power-Law degree distribution– Difficult to partition

• GraphLab2:– Distributes single vertex programs– New vertex partitioning heuristic to rapidly place large

power-law graphs• Experimentally outperforms existing graph-parallel

abstractions

Carnegie Mellon University

Official release in July.http://graphlab.org

2

[email protected]

Pregel Message Combiners

User defined commutative associative (+) message operation:

Machine 1 Machine 2

+ Sum

Costly on High Fan-Out

Many identical messages are sent across the network to the same machine:

Machine 1 Machine 2

GraphLab Ghosts

Neighbors values are cached locally and maintained by system:

Machine 1 Machine 2

Ghost

Reduces Cost of High Fan-Out

Change to a high degree vertex is communicated with “single message”

Machine 1 Machine 2

Increases Cost of High Fan-In

Changes to neighbors are synchronized individually and collected sequentially:

Machine 1 Machine 2

Comparison with GraphLab & Pregel

• PageRank on Synthetic Power-Law Graphs

GraphLab2 GraphLab2

Power-Law Fan-In Power-Law Fan-Out

Denser Denser

Straggler Effect• PageRank on Synthetic Power-Law Graphs

Power-Law Fan-In Power-Law Fan-Out

Denser Denser

GraphLab

Pregel (Piccolo)

GraphLab2 GraphLab GraphLab2

Pregel (Piccolo)

Cached Gather for PageRankInitial Accum computation

time

Reduces runtime by ~ 50%.

joseph gonzalez yucheng low aapo kyrola danny bickson joe hellerstein alex smola distributed...

Documents

neighbors sum

sum msg

neighbors graphlab

compute sum

neighbors of i

j slide

vertex graph

parallel learning systems