joseph gonzalez - peoplejegonzal/assets/slides/...joseph gonzalez yucheng low danny bickson...

JosephGonzalez

YuchengLow

DannyBickson

DistributedGraph-ParallelComputationonNaturalGraphs

HaijieGu

Jointworkwith:

CarlosGuestrin

Graphs areubiquitous..

2

SocialMedia

• Graphs encode relationships between:

• Big:billions ofvertices andedges andrichmetadata

AdvertisingScience Web

PeopleFacts

ProductsInterests

Ideas

3

GraphsareEssentialtoData-Mining andMachineLearning

• Identifyinfluentialpeopleandinformation• Findcommunities• Targetadsandproducts• Modelcomplexdatadependencies

4

5

Natural GraphsGraphsderivedfromnatural

phenomena

6

Problem:

Existingdistributed graphcomputationsystemsperformpoorlyonNatural Graphs.

PageRankonTwitterFollowerGraphNaturalGraphwith40MUsers,1.4BillionLinks

Hadoop results from [Kang et al. '11]Twister (in-memory MapReduce) [Ekanayake et al. ‘10]

7

0 50 100 150 200

Hadoop

GraphLab

Twister

Piccolo

PowerGraph

RuntimePerIteration

Orderofmagnitude byexploiting propertiesofNaturalGraphs

PropertiesofNaturalGraphs

8

Power-LawDegreeDistribution


100 102 104 106 108100

102

104

106

108

1010

degree

count

Top1%ofverticesareadjacentto

50%oftheedges!

High-DegreeVertices

9

Num

bero

fVertices

AltaVistaWebGraph1.4BVertices,6.6BEdges

Degree

Morethan108 verticeshaveoneneighbor.


10

“StarLike”Motif

PresidentObama Followers

Power-LawGraphsareDifficulttoPartition

• Power-Lawgraphsdonothavelow-cost balancedcuts[Leskovec etal.08,Lang04]

• Traditionalgraph-partitioningalgorithmsperformpoorlyonPower-LawGraphs.[Abou-Rjeili etal.06]

11

CPU 1 CPU 2

PropertiesofNaturalGraphs

12

High-degreeVertices

LowQualityPartition


Machine 1 Machine 2

• Split High-Degreevertices• NewAbstractionà Equivalence onSplitVertices

13

ProgramForThis

RunonThis

Howdoweprogramgraphcomputation?

“ThinklikeaVertex.”-Malewicz etal.[SIGMOD’10]

14

TheGraph-Parallel Abstraction• Auser-defined Vertex-Program runsoneachvertex• Graph constrainsinteraction alongedges

– Usingmessages(e.g.Pregel [PODC’09,SIGMOD’10])

– Throughsharedstate(e.g.,GraphLab [UAI’10,VLDB’12])

• Parallelism:runmultiplevertexprogramssimultaneously

15

Example

What’s the popularityof this user?

Popular?

Depends on popularityof her followers

Depends on the popularity their followers

16

PageRankAlgorithm

• Updateranksinparallel• Iterateuntilconvergence

Rankofuseri Weightedsumof

neighbors’ranks

17

R[i] = 0.15 +X

j2Nbrs(i)

wjiR[j]

ThePregel AbstractionVertex-Programsinteractbysendingmessages.

iPregel_PageRank(i, messages) : // Receive all the messagestotal = 0foreach( msg in messages) :

total = total + msg

// Update the rank of this vertexR[i] = 0.15 + total

// Send new messages to neighborsforeach(j in out_neighbors[i]) :

Send msg(R[i] * wij) to vertex j

18Malewicz etal.[PODC’09,SIGMOD’10]

TheGraphLab AbstractionVertex-Programsdirectlyread theneighborsstate

iGraphLab_PageRank(i) // Compute sum over neighborstotal = 0foreach( j in in_neighbors(i)):

total = total + R[j] * wji

// Update the PageRankR[i] = 0.15 + total

// Trigger neighbors to run againif R[i] not converged then

foreach( j in out_neighbors(i)): signal vertex-program on j

19Lowetal.[UAI’10,VLDB’12]

AsynchronousExecutionrequiresheavylocking(GraphLab)

ChallengesofHigh-DegreeVertices

Touchesalargefractionofgraph

(GraphLab)

Sequentiallyprocessedges

Sendsmanymessages(Pregel)

Edgemeta-datatoolargeforsingle

machine

SynchronousExecutionpronetostragglers(Pregel)

20

CommunicationOverheadforHigh-DegreeVertices

Fan-Invs.Fan-Out

21

PregelMessageCombinersonFan-In

Machine1 Machine2

+B

A

C

DSum

• Userdefinedcommutative associative (+)messageoperation:

22

Pregel StruggleswithFan-Out

Machine1 Machine2

B

A

C

D

• Broadcast sendsmanycopiesofthesamemessagetothesamemachine!

23

Fan-InandFan-OutPerformance• PageRankonsyntheticPower-LawGraphs– PiccolowasusedtosimulatePregel withcombiners

0246810

1.8 1.9 2 2.1 2.2

TotalC

omm.(GB)

Power-LawConstantα

Morehigh-degreevertices 24

GraphLab Ghosting

• Changestomasteraresyncedtoghosts

Machine1

A

B

C

Machine2

DD

A

B

CGhost

25

GraphLab Ghosting

• Changestoneighbors ofhighdegreeverticescreatessubstantialnetworktraffic

Machine1

A

B

C

Machine2

DD

A

B

C Ghost

26

Fan-InandFan-OutPerformance

• PageRankonsyntheticPower-LawGraphs• GraphLab isundirected

0246810

1.8 1.9 2 2.1 2.2

TotalC

omm.(GB)

Power-LawConstantalphaMorehigh-degreevertices 27

GraphPartitioning• Graphparallelabstractionsrelyonpartitioning:– Minimizecommunication– Balancecomputationandstorage

Y

Machine1 Machine228

Data transmittedacross network

O(# cut edges)

Machine1 Machine2

RandomPartitioning

• BothGraphLabandPregel resorttorandom(hashed)partitioningonnaturalgraphs

3"2"

1"

D

A"

C"

B" 2"3"

C"

D

B"A"

1"

D

A"

C"C"

B"

(a) Edge-Cut

B"A" 1"

C" D3"

C" B"2"

C" D

B"A" 1"

3"

(b) Vertex-Cut

Figure 4: (a) An edge-cut and (b) vertex-cut of a graph intothree parts. Shaded vertices are ghosts and mirrors respectively.

5 Distributed Graph Placement

The PowerGraph abstraction relies on the distributed data-graph to store the computation state and encode the in-teraction between vertex programs. The placement ofthe data-graph structure and data plays a central role inminimizing communication and ensuring work balance.

A common approach to placing a graph on a cluster of pmachines is to construct a balanced p-way edge-cut (e.g.,Fig. 4a) in which vertices are evenly assigned to machinesand the number of edges spanning machines is minimized.Unfortunately, the tools [21, 31] for constructing balancededge-cuts perform poorly [1, 26, 23] or even fail on power-law graphs. When the graph is difficult to partition, bothGraphLab and Pregel resort to hashed (random) vertexplacement. While fast and easy to implement, hashedvertex placement cuts most of the edges:

Theorem 5.1. If vertices are randomly assigned to pmachines then the expected fraction of edges cut is:

E|Edges Cut|

|E|

�= 1� 1

p(5.1)

For example if just two machines are used, half of theof edges will be cut requiring order |E|/2 communication.

5.1 Balanced p-way Vertex-CutThe PowerGraph abstraction enables a single vertex pro-gram to span multiple machines. Hence, we can ensurework balance by evenly assigning edges to machines.Communication is minimized by limiting the number ofmachines a single vertex spans. A balanced p-way vertex-cut formalizes this objective by assigning each edge e2 Eto a machine A(e) 2 {1, . . . , p}. Each vertex then spansthe set of machines A(v)✓ {1, . . . , p} that contain its ad-jacent edges. We define the balanced vertex-cut objective:

minA

1|V | Â

v2V|A(v)| (5.2)

s.t. maxm

|{e 2 E | A(e) = m}|< l |E|p

(5.3)

where the imbalance factor l � 1 is a small constant. Weuse the term replicas of a vertex v to denote the |A(v)|copies of the vertex v: each machine in A(v) has a replicaof v. The objective term (Eq. 5.2) therefore minimizes the

average number of replicas in the graph and as a conse-quence the total storage and communication requirementsof the PowerGraph engine.

Vertex-cuts address many of the major issues associatedwith edge-cuts in power-law graphs. Percolation theory[3] suggests that power-law graphs have good vertex-cuts.Intuitively, by cutting a small fraction of the very highdegree vertices we can quickly shatter a graph. Further-more, because the balance constraint (Eq. 5.3) ensuresthat edges are uniformly distributed over machines, wenaturally achieve improved work balance even in the pres-ence of very high-degree vertices.

The simplest method to construct a vertex cut is torandomly assign edges to machines. Random (hashed)edge placement is fully data-parallel, achieves nearly per-fect balance on large graphs, and can be applied in thestreaming setting. In the following we relate the expectednormalized replication factor (Eq. 5.2) to the number ofmachines and the power-law constant a .

Theorem 5.2 (Randomized Vertex Cuts). Let D[v] denotethe degree of vertex v. A uniform random edge placementon p machines has an expected replication factor

E"

1|V | Â

v2V|A(v)|

#=

p|V | Â

v2V

1�✓

1� 1p

◆D[v]!. (5.4)

For a graph with power-law constant a we obtain:

E"

1|V | Â

v2V|A(v)|

#= p� pLia

✓p�1

p

◆/z (a) (5.5)

where Lia (x) is the transcendental polylog function andz (a) is the Riemann Zeta function (plotted in Fig. 5a).

Higher a values imply a lower replication factor, con-firming our earlier intuition. In contrast to a random 2-way edge-cut which requires order |E|/2 communicationa random 2-way vertex-cut on an a = 2 power-law graphrequires only order 0.3 |V | communication, a substantialsavings on natural graphs where E can be an order ofmagnitude larger than V (see Tab. 1a).

5.2 Greedy Vertex-CutsWe can improve upon the randomly constructed vertex-cut by de-randomizing the edge-placement process. Theresulting algorithm is a sequential greedy heuristic whichplaces the next edge on the machine that minimizes theconditional expected replication factor. To construct thede-randomization we consider the task of placing the i+1edge after having placed the previous i edges. Using theconditional expectation we define the objective:

argmink

E"

Âv2V

|A(v)|

�� Ai,A(ei+1) = k

#(5.6)

6

10Machinesà 90%ofedgescut100Machinesà 99%ofedgescut!

29

InSummary

GraphLab andPregel arenotwellsuitedfornaturalgraphs

• Challengesofhigh-degreevertices• Lowqualitypartitioning

30

• GASDecomposition:distributevertex-programs– Movecomputationtodata– Parallelizehigh-degreevertices

• VertexPartitioning:– Effectivelydistributelargepower-lawgraphs

31

GatherInformationAboutNeighborhood

UpdateVertex

SignalNeighbors&ModifyEdgeData

ACommonPattern forVertex-Programs

GraphLab_PageRank(i) // Compute sum over neighborstotal = 0foreach( j in in_neighbors(i)):

total = total + R[j] * wji

// Update the PageRankR[i] = 0.1 + total

// Trigger neighbors to run againif R[i] not converged then

foreach( j in out_neighbors(i)) signal vertex-program on j

32

GASDecompositionY

+…+à

Y

ParallelSum

UserDefined:Gather()à ΣY

Σ1 + Σ2 à Σ3

Y

Gather(Reduce)Applytheaccumulatedvaluetocentervertex

ApplyUpdateadjacentedges

andvertices.

Scatter

⌃

Accumulateinformationaboutneighborhood

Y

+

UserDefined:Apply(,Σ)à Y

’Y

Y

Σ Y’

UpdateEdgeData&ActivateNeighbors

UserDefined:Scatter()àY’

Y’

33

PowerGraph_PageRank(i)

Gather(j à i ): returnwji * R[j]sum(a,b) :returna+b;

Apply(i, Σ) : R[i] = 0.15 + Σ

Scatter( i à j ) :ifR[i] changedthentriggerj toberecomputed

PageRankinPowerGraph

34

R[i] = 0.15 +X

j2Nbrs(i)

wjiR[j]

Machine2Machine1

Machine4Machine3

DistributedExecutionofaPowerGraphVertex-Program

Σ1 Σ2

Σ3 Σ4

+++

YYYY

Y’

ΣY’Y’Y’Gather

Apply

Scatter

35

Master

Mirror

MirrorMirror

MinimizingCommunicationinPowerGraph

YYY

Avertex-cutminimizesmachineseachvertexspans

Percolationtheorysuggeststhatpowerlawgraphshavegoodvertexcuts.[Albertetal.2000]

Communicationislinearinthenumberofmachines

eachvertexspans

36

NewApproachtoPartitioning

• Ratherthancutedges:

• wecutvertices:CPU 1 CPU 2

YY Mustsynchronize

many edges

CPU 1 CPU 2

Y Y Mustsynchronizeasingle vertex

NewTheorem:Forany edge-cut wecandirectlyconstructavertex-cutwhichrequiresstrictlylesscommunicationandstorage.

37

ConstructingVertex-Cuts

• Evenly assignedges tomachines– Minimizemachinesspannedbyeachvertex

• Assigneachedgeasit isloaded– Toucheachedgeonlyonce

• Proposethreedistributedapproaches:– Random EdgePlacement– CoordinatedGreedyEdgePlacement– ObliviousGreedy EdgePlacement

38

Machine2Machine1 Machine3

Random Edge-Placement• Randomlyassignedgestomachines

YYYY ZYYYY ZY ZY Spans3Machines

Z Spans2Machines

BalancedVertex-Cut

Notcut!

39

AnalysisRandomEdge-Placement

• Expectednumberofmachinesspannedbyavertex:

2468

101214161820

8 28 48Exp.#ofM

achine

sSpa

nned

NumberofMachines

Predicted

Random

TwitterFollowerGraph41MillionVertices1.4BillionEdges

AccuratelyEstimateMemoryandComm.

Overhead

40

RandomVertex-Cutsvs.Edge-Cuts

• Expectedimprovementfromvertex-cuts:

1

10

100

0 50 100 150

Redu

ctionin

Comm.and

Storage

NumberofMachines41

OrderofMagnitudeImprovement

GreedyVertex-Cuts

• Placeedgesonmachineswhichalreadyhavetheverticesinthatedge.

Machine1 Machine 2

BA CB

DA EB42

GreedyVertex-Cuts

• De-randomizationà greedilyminimizestheexpectednumberofmachinesspanned

• Coordinated EdgePlacement– Requirescoordinationtoplaceeachedge– Slower:higherqualitycuts

• Oblivious EdgePlacement– Approx.greedyobjectivewithoutcoordination– Faster:lowerqualitycuts

43

PartitioningPerformanceTwitterGraph: 41Mvertices,1.4Bedges

Obliviousbalancescostandpartitioningtime.

2468

1012141618

8 16 24 32 40 48 56 64

Avg#ofM

achine

sSpa

nned

NumberofMachines

0

200

400

600

800

1000

8 16 24 32 40 48 56 64Partition

ingTime(Secon

ds)

NumberofMachines

44

Cost ConstructionTime

Better

GreedyVertex-CutsImprovePerformance

00.10.20.30.40.50.60.70.80.91

PageRank CollaborativeFiltering

ShortestPath

Runtim

eRe

lativ

etoRan

dom

RandomObliviousCoordinated

Greedypartitioningimprovescomputationperformance. 45

OtherFeatures(SeePaper)

• Supportsthreeexecutionmodes:– Synchronous: Bulk-SynchronousGASPhases– Asynchronous: InterleaveGASPhases– Asynchronous+Serializable:Neighboringverticesdonotrunsimultaneously

• DeltaCaching– Accelerategatherphasebycachingpartialsumsforeachvertex

46

SystemEvaluation

47

SystemDesign

• ImplementedasC++API• UsesHDFSforGraphInputandOutput• Fault-toleranceisachievedbycheck-pointing– Snapshot time<5secondsfortwitternetwork

48

EC2 HPCNodes

MPI/TCP-IP PThreads HDFS

PowerGraph(GraphLab2)System

ImplementedManyAlgorithms

• CollaborativeFiltering– AlternatingLeastSquares– StochasticGradientDescent

– SVD– Non-negativeMF

• StatisticalInference– LoopyBeliefPropagation– Max-ProductLinearPrograms

– GibbsSampling

• GraphAnalytics– PageRank– TriangleCounting– ShortestPath– GraphColoring– K-coreDecomposition

• ComputerVision– Imagestitching

• LanguageModeling– LDA

49

ComparisonwithGraphLab &Pregel• PageRankonSyntheticPower-LawGraphs:

RuntimeCommunication

0246810

1.8

TotalN

etwork(GB)

Power-LawConstantα

051015202530

1.8Second

s

Power-LawConstantα

Pregel (Piccolo)

GraphLab

Pregel (Piccolo)

GraphLab

50

High-degreevertices High-degreevertices

PowerGraphisrobusttohigh-degree vertices.

PageRankontheTwitterFollowerGraph

010203040506070

GraphLab Pregel(Piccolo)

PowerGraph

51

05

10152025303540

GraphLab Pregel(Piccolo)

PowerGraph

TotalN

etwork(GB)

Second

s

Communication RuntimeNaturalGraphwith40MUsers,1.4BillionLinks

ReducesCommunication RunsFaster32Nodesx8Cores(EC2HPCcc1.4x)

PowerGraphisScalableYahooAltavista WebGraph(2002):

Oneofthelargestpubliclyavailablewebgraphs1.4 BillionWebpages,6.6BillionLinks

1024Cores(2048HT)64HPCNodes

7SecondsperIter.1Blinksprocessedpersecond

30linesofusercode52

TopicModeling• EnglishlanguageWikipedia

– 2.6MDocuments,8.3MWords,500MTokens

– Computationallyintensivealgorithm

53

0 20 40 60 80 100 120 140 160

Smolaetal.

PowerGraph

MillionTokensPerSecond

100Yahoo!MachinesSpecificallyengineeredforthistask

64cc2.8xlargeEC2Nodes200linesofcode& 4humanhours

Counted:34.8BillionTriangles

54

TriangleCountingonTheTwitterGraphIdentifyindividualswithstrongcommunities.

64Machines1.5Minutes

1536Machines423Minutes

Hadoop[WWW’11]

S.Suri andS.Vassilvitskii,“Countingtrianglesandthecurseofthelastreducer,”WWW’11

282xFaster

Why?WrongAbstractionàBroadcastO(degree2)messagesperVertex

Summary• Problem: ComputationonNaturalGraphs ischallenging– High-degreevertices– Low-qualityedge-cuts

• Solution:PowerGraphSystem– GASDecomposition:splitvertex programs– Vertex-partitioning:distributenaturalgraphs

• PowerGraphtheoretically andexperimentallyoutperformsexistinggraph-parallelsystems.

55

PowerGraph(GraphLab2)System

GraphAnalytics

GraphicalModels

ComputerVision Clustering Topic

ModelingCollaborative

Filtering

MachineLearning andData-MiningToolkits

FutureWork

• Timeevolvinggraphs– Supportstructuralchanges duringcomputation

• Out-of-corestorage(GraphChi)– Supportgraphsthatdon’tfitinmemory

• ImprovedFault-Tolerance– Leveragevertexreplicationtoreducesnapshots– Asynchronous recovery

57

isGraphLab Version2.1Apache2License

http://graphlab.orgDocumentation… Code… Tutorials… (more on the way)