joseph gonzalez - peoplejegonzal/assets/slides/...joseph gonzalez yucheng low danny bickson...
TRANSCRIPT
JosephGonzalez
YuchengLow
DannyBickson
DistributedGraph-ParallelComputationonNaturalGraphs
HaijieGu
Jointworkwith:
CarlosGuestrin
Graphs areubiquitous..
2
SocialMedia
• Graphs encode relationships between:
• Big:billions ofvertices andedges andrichmetadata
AdvertisingScience Web
PeopleFacts
ProductsInterests
Ideas
3
GraphsareEssentialtoData-Mining andMachineLearning
• Identifyinfluentialpeopleandinformation• Findcommunities• Targetadsandproducts• Modelcomplexdatadependencies
4
5
Natural GraphsGraphsderivedfromnatural
phenomena
6
Problem:
Existingdistributed graphcomputationsystemsperformpoorlyonNatural Graphs.
PageRankonTwitterFollowerGraphNaturalGraphwith40MUsers,1.4BillionLinks
Hadoop results from [Kang et al. '11]Twister (in-memory MapReduce) [Ekanayake et al. ‘10]
7
0 50 100 150 200
Hadoop
GraphLab
Twister
Piccolo
PowerGraph
RuntimePerIteration
Orderofmagnitude byexploiting propertiesofNaturalGraphs
PropertiesofNaturalGraphs
8
Power-LawDegreeDistribution
Power-LawDegreeDistribution
100 102 104 106 108100
102
104
106
108
1010
degree
count
Top1%ofverticesareadjacentto
50%oftheedges!
High-DegreeVertices
9
Num
bero
fVertices
AltaVistaWebGraph1.4BVertices,6.6BEdges
Degree
Morethan108 verticeshaveoneneighbor.
Power-LawDegreeDistribution
10
“StarLike”Motif
PresidentObama Followers
Power-LawGraphsareDifficulttoPartition
• Power-Lawgraphsdonothavelow-cost balancedcuts[Leskovec etal.08,Lang04]
• Traditionalgraph-partitioningalgorithmsperformpoorlyonPower-LawGraphs.[Abou-Rjeili etal.06]
11
CPU 1 CPU 2
PropertiesofNaturalGraphs
12
High-degreeVertices
LowQualityPartition
Power-LawDegreeDistribution
Machine 1 Machine 2
• Split High-Degreevertices• NewAbstractionà Equivalence onSplitVertices
13
ProgramForThis
RunonThis
Howdoweprogramgraphcomputation?
“ThinklikeaVertex.”-Malewicz etal.[SIGMOD’10]
14
TheGraph-Parallel Abstraction• Auser-defined Vertex-Program runsoneachvertex• Graph constrainsinteraction alongedges
– Usingmessages(e.g.Pregel [PODC’09,SIGMOD’10])
– Throughsharedstate(e.g.,GraphLab [UAI’10,VLDB’12])
• Parallelism:runmultiplevertexprogramssimultaneously
15
Example
What’s the popularityof this user?
Popular?
Depends on popularityof her followers
Depends on the popularity their followers
16
PageRankAlgorithm
• Updateranksinparallel• Iterateuntilconvergence
Rankofuseri Weightedsumof
neighbors’ranks
17
R[i] = 0.15 +X
j2Nbrs(i)
wjiR[j]
ThePregel AbstractionVertex-Programsinteractbysendingmessages.
iPregel_PageRank(i, messages) : // Receive all the messagestotal = 0foreach( msg in messages) :
total = total + msg
// Update the rank of this vertexR[i] = 0.15 + total
// Send new messages to neighborsforeach(j in out_neighbors[i]) :
Send msg(R[i] * wij) to vertex j
18Malewicz etal.[PODC’09,SIGMOD’10]
TheGraphLab AbstractionVertex-Programsdirectlyread theneighborsstate
iGraphLab_PageRank(i) // Compute sum over neighborstotal = 0foreach( j in in_neighbors(i)):
total = total + R[j] * wji
// Update the PageRankR[i] = 0.15 + total
// Trigger neighbors to run againif R[i] not converged then
foreach( j in out_neighbors(i)): signal vertex-program on j
19Lowetal.[UAI’10,VLDB’12]
AsynchronousExecutionrequiresheavylocking(GraphLab)
ChallengesofHigh-DegreeVertices
Touchesalargefractionofgraph
(GraphLab)
Sequentiallyprocessedges
Sendsmanymessages(Pregel)
Edgemeta-datatoolargeforsingle
machine
SynchronousExecutionpronetostragglers(Pregel)
20
CommunicationOverheadforHigh-DegreeVertices
Fan-Invs.Fan-Out
21
PregelMessageCombinersonFan-In
Machine1 Machine2
+B
A
C
DSum
• Userdefinedcommutative associative (+)messageoperation:
22
Pregel StruggleswithFan-Out
Machine1 Machine2
B
A
C
D
• Broadcast sendsmanycopiesofthesamemessagetothesamemachine!
23
Fan-InandFan-OutPerformance• PageRankonsyntheticPower-LawGraphs– PiccolowasusedtosimulatePregel withcombiners
0246810
1.8 1.9 2 2.1 2.2
TotalC
omm.(GB)
Power-LawConstantα
Morehigh-degreevertices 24
GraphLab Ghosting
• Changestomasteraresyncedtoghosts
Machine1
A
B
C
Machine2
DD
A
B
CGhost
25
GraphLab Ghosting
• Changestoneighbors ofhighdegreeverticescreatessubstantialnetworktraffic
Machine1
A
B
C
Machine2
DD
A
B
C Ghost
26
Fan-InandFan-OutPerformance
• PageRankonsyntheticPower-LawGraphs• GraphLab isundirected
0246810
1.8 1.9 2 2.1 2.2
TotalC
omm.(GB)
Power-LawConstantalphaMorehigh-degreevertices 27
GraphPartitioning• Graphparallelabstractionsrelyonpartitioning:– Minimizecommunication– Balancecomputationandstorage
Y
Machine1 Machine228
Data transmittedacross network
O(# cut edges)
Machine1 Machine2
RandomPartitioning
• BothGraphLabandPregel resorttorandom(hashed)partitioningonnaturalgraphs
3"2"
1"
D
A"
C"
B" 2"3"
C"
D
B"A"
1"
D
A"
C"C"
B"
(a) Edge-Cut
B"A" 1"
C" D3"
C" B"2"
C" D
B"A" 1"
3"
(b) Vertex-Cut
Figure 4: (a) An edge-cut and (b) vertex-cut of a graph intothree parts. Shaded vertices are ghosts and mirrors respectively.
5 Distributed Graph Placement
The PowerGraph abstraction relies on the distributed data-graph to store the computation state and encode the in-teraction between vertex programs. The placement ofthe data-graph structure and data plays a central role inminimizing communication and ensuring work balance.
A common approach to placing a graph on a cluster of pmachines is to construct a balanced p-way edge-cut (e.g.,Fig. 4a) in which vertices are evenly assigned to machinesand the number of edges spanning machines is minimized.Unfortunately, the tools [21, 31] for constructing balancededge-cuts perform poorly [1, 26, 23] or even fail on power-law graphs. When the graph is difficult to partition, bothGraphLab and Pregel resort to hashed (random) vertexplacement. While fast and easy to implement, hashedvertex placement cuts most of the edges:
Theorem 5.1. If vertices are randomly assigned to pmachines then the expected fraction of edges cut is:
E|Edges Cut|
|E|
�= 1� 1
p(5.1)
For example if just two machines are used, half of theof edges will be cut requiring order |E|/2 communication.
5.1 Balanced p-way Vertex-CutThe PowerGraph abstraction enables a single vertex pro-gram to span multiple machines. Hence, we can ensurework balance by evenly assigning edges to machines.Communication is minimized by limiting the number ofmachines a single vertex spans. A balanced p-way vertex-cut formalizes this objective by assigning each edge e2 Eto a machine A(e) 2 {1, . . . , p}. Each vertex then spansthe set of machines A(v)✓ {1, . . . , p} that contain its ad-jacent edges. We define the balanced vertex-cut objective:
minA
1|V | Â
v2V|A(v)| (5.2)
s.t. maxm
|{e 2 E | A(e) = m}|< l |E|p
(5.3)
where the imbalance factor l � 1 is a small constant. Weuse the term replicas of a vertex v to denote the |A(v)|copies of the vertex v: each machine in A(v) has a replicaof v. The objective term (Eq. 5.2) therefore minimizes the
average number of replicas in the graph and as a conse-quence the total storage and communication requirementsof the PowerGraph engine.
Vertex-cuts address many of the major issues associatedwith edge-cuts in power-law graphs. Percolation theory[3] suggests that power-law graphs have good vertex-cuts.Intuitively, by cutting a small fraction of the very highdegree vertices we can quickly shatter a graph. Further-more, because the balance constraint (Eq. 5.3) ensuresthat edges are uniformly distributed over machines, wenaturally achieve improved work balance even in the pres-ence of very high-degree vertices.
The simplest method to construct a vertex cut is torandomly assign edges to machines. Random (hashed)edge placement is fully data-parallel, achieves nearly per-fect balance on large graphs, and can be applied in thestreaming setting. In the following we relate the expectednormalized replication factor (Eq. 5.2) to the number ofmachines and the power-law constant a .
Theorem 5.2 (Randomized Vertex Cuts). Let D[v] denotethe degree of vertex v. A uniform random edge placementon p machines has an expected replication factor
E"
1|V | Â
v2V|A(v)|
#=
p|V | Â
v2V
1�✓
1� 1p
◆D[v]!. (5.4)
For a graph with power-law constant a we obtain:
E"
1|V | Â
v2V|A(v)|
#= p� pLia
✓p�1
p
◆/z (a) (5.5)
where Lia (x) is the transcendental polylog function andz (a) is the Riemann Zeta function (plotted in Fig. 5a).
Higher a values imply a lower replication factor, con-firming our earlier intuition. In contrast to a random 2-way edge-cut which requires order |E|/2 communicationa random 2-way vertex-cut on an a = 2 power-law graphrequires only order 0.3 |V | communication, a substantialsavings on natural graphs where E can be an order ofmagnitude larger than V (see Tab. 1a).
5.2 Greedy Vertex-CutsWe can improve upon the randomly constructed vertex-cut by de-randomizing the edge-placement process. Theresulting algorithm is a sequential greedy heuristic whichplaces the next edge on the machine that minimizes theconditional expected replication factor. To construct thede-randomization we consider the task of placing the i+1edge after having placed the previous i edges. Using theconditional expectation we define the objective:
argmink
E"
Âv2V
|A(v)|
����� Ai,A(ei+1) = k
#(5.6)
6
10Machinesà 90%ofedgescut100Machinesà 99%ofedgescut!
29
InSummary
GraphLab andPregel arenotwellsuitedfornaturalgraphs
• Challengesofhigh-degreevertices• Lowqualitypartitioning
30
• GASDecomposition:distributevertex-programs– Movecomputationtodata– Parallelizehigh-degreevertices
• VertexPartitioning:– Effectivelydistributelargepower-lawgraphs
31
GatherInformationAboutNeighborhood
UpdateVertex
SignalNeighbors&ModifyEdgeData
ACommonPattern forVertex-Programs
GraphLab_PageRank(i) // Compute sum over neighborstotal = 0foreach( j in in_neighbors(i)):
total = total + R[j] * wji
// Update the PageRankR[i] = 0.1 + total
// Trigger neighbors to run againif R[i] not converged then
foreach( j in out_neighbors(i)) signal vertex-program on j
32
GASDecompositionY
+…+à
Y
ParallelSum
UserDefined:Gather()à ΣY
Σ1 + Σ2 à Σ3
Y
Gather(Reduce)Applytheaccumulatedvaluetocentervertex
ApplyUpdateadjacentedges
andvertices.
Scatter
⌃
Accumulateinformationaboutneighborhood
Y
+
UserDefined:Apply(,Σ)à Y
’Y
Y
Σ Y’
UpdateEdgeData&ActivateNeighbors
UserDefined:Scatter()àY’
Y’
33
PowerGraph_PageRank(i)
Gather(j à i ): returnwji * R[j]sum(a,b) :returna+b;
Apply(i, Σ) : R[i] = 0.15 + Σ
Scatter( i à j ) :ifR[i] changedthentriggerj toberecomputed
PageRankinPowerGraph
34
R[i] = 0.15 +X
j2Nbrs(i)
wjiR[j]
Machine2Machine1
Machine4Machine3
DistributedExecutionofaPowerGraphVertex-Program
Σ1 Σ2
Σ3 Σ4
+++
YYYY
Y’
ΣY’Y’Y’Gather
Apply
Scatter
35
Master
Mirror
MirrorMirror
MinimizingCommunicationinPowerGraph
YYY
Avertex-cutminimizesmachineseachvertexspans
Percolationtheorysuggeststhatpowerlawgraphshavegoodvertexcuts.[Albertetal.2000]
Communicationislinearinthenumberofmachines
eachvertexspans
36
NewApproachtoPartitioning
• Ratherthancutedges:
• wecutvertices:CPU 1 CPU 2
YY Mustsynchronize
many edges
CPU 1 CPU 2
Y Y Mustsynchronizeasingle vertex
NewTheorem:Forany edge-cut wecandirectlyconstructavertex-cutwhichrequiresstrictlylesscommunicationandstorage.
37
ConstructingVertex-Cuts
• Evenly assignedges tomachines– Minimizemachinesspannedbyeachvertex
• Assigneachedgeasit isloaded– Toucheachedgeonlyonce
• Proposethreedistributedapproaches:– Random EdgePlacement– CoordinatedGreedyEdgePlacement– ObliviousGreedy EdgePlacement
38
Machine2Machine1 Machine3
Random Edge-Placement• Randomlyassignedgestomachines
YYYY ZYYYY ZY ZY Spans3Machines
Z Spans2Machines
BalancedVertex-Cut
Notcut!
39
AnalysisRandomEdge-Placement
• Expectednumberofmachinesspannedbyavertex:
2468
101214161820
8 28 48Exp.#ofM
achine
sSpa
nned
NumberofMachines
Predicted
Random
TwitterFollowerGraph41MillionVertices1.4BillionEdges
AccuratelyEstimateMemoryandComm.
Overhead
40
RandomVertex-Cutsvs.Edge-Cuts
• Expectedimprovementfromvertex-cuts:
1
10
100
0 50 100 150
Redu
ctionin
Comm.and
Storage
NumberofMachines41
OrderofMagnitudeImprovement
GreedyVertex-Cuts
• Placeedgesonmachineswhichalreadyhavetheverticesinthatedge.
Machine1 Machine 2
BA CB
DA EB42
GreedyVertex-Cuts
• De-randomizationà greedilyminimizestheexpectednumberofmachinesspanned
• Coordinated EdgePlacement– Requirescoordinationtoplaceeachedge– Slower:higherqualitycuts
• Oblivious EdgePlacement– Approx.greedyobjectivewithoutcoordination– Faster:lowerqualitycuts
43
PartitioningPerformanceTwitterGraph: 41Mvertices,1.4Bedges
Obliviousbalancescostandpartitioningtime.
2468
1012141618
8 16 24 32 40 48 56 64
Avg#ofM
achine
sSpa
nned
NumberofMachines
0
200
400
600
800
1000
8 16 24 32 40 48 56 64Partition
ingTime(Secon
ds)
NumberofMachines
44
Cost ConstructionTime
Better
GreedyVertex-CutsImprovePerformance
00.10.20.30.40.50.60.70.80.91
PageRank CollaborativeFiltering
ShortestPath
Runtim
eRe
lativ
etoRan
dom
RandomObliviousCoordinated
Greedypartitioningimprovescomputationperformance. 45
OtherFeatures(SeePaper)
• Supportsthreeexecutionmodes:– Synchronous: Bulk-SynchronousGASPhases– Asynchronous: InterleaveGASPhases– Asynchronous+Serializable:Neighboringverticesdonotrunsimultaneously
• DeltaCaching– Accelerategatherphasebycachingpartialsumsforeachvertex
46
SystemEvaluation
47
SystemDesign
• ImplementedasC++API• UsesHDFSforGraphInputandOutput• Fault-toleranceisachievedbycheck-pointing– Snapshot time<5secondsfortwitternetwork
48
EC2 HPCNodes
MPI/TCP-IP PThreads HDFS
PowerGraph(GraphLab2)System
ImplementedManyAlgorithms
• CollaborativeFiltering– AlternatingLeastSquares– StochasticGradientDescent
– SVD– Non-negativeMF
• StatisticalInference– LoopyBeliefPropagation– Max-ProductLinearPrograms
– GibbsSampling
• GraphAnalytics– PageRank– TriangleCounting– ShortestPath– GraphColoring– K-coreDecomposition
• ComputerVision– Imagestitching
• LanguageModeling– LDA
49
ComparisonwithGraphLab &Pregel• PageRankonSyntheticPower-LawGraphs:
RuntimeCommunication
0246810
1.8
TotalN
etwork(GB)
Power-LawConstantα
051015202530
1.8Second
s
Power-LawConstantα
Pregel (Piccolo)
GraphLab
Pregel (Piccolo)
GraphLab
50
High-degreevertices High-degreevertices
PowerGraphisrobusttohigh-degree vertices.
PageRankontheTwitterFollowerGraph
010203040506070
GraphLab Pregel(Piccolo)
PowerGraph
51
05
10152025303540
GraphLab Pregel(Piccolo)
PowerGraph
TotalN
etwork(GB)
Second
s
Communication RuntimeNaturalGraphwith40MUsers,1.4BillionLinks
ReducesCommunication RunsFaster32Nodesx8Cores(EC2HPCcc1.4x)
PowerGraphisScalableYahooAltavista WebGraph(2002):
Oneofthelargestpubliclyavailablewebgraphs1.4 BillionWebpages,6.6BillionLinks
1024Cores(2048HT)64HPCNodes
7SecondsperIter.1Blinksprocessedpersecond
30linesofusercode52
TopicModeling• EnglishlanguageWikipedia
– 2.6MDocuments,8.3MWords,500MTokens
– Computationallyintensivealgorithm
53
0 20 40 60 80 100 120 140 160
Smolaetal.
PowerGraph
MillionTokensPerSecond
100Yahoo!MachinesSpecificallyengineeredforthistask
64cc2.8xlargeEC2Nodes200linesofcode& 4humanhours
Counted:34.8BillionTriangles
54
TriangleCountingonTheTwitterGraphIdentifyindividualswithstrongcommunities.
64Machines1.5Minutes
1536Machines423Minutes
Hadoop[WWW’11]
S.Suri andS.Vassilvitskii,“Countingtrianglesandthecurseofthelastreducer,”WWW’11
282xFaster
Why?WrongAbstractionàBroadcastO(degree2)messagesperVertex
Summary• Problem: ComputationonNaturalGraphs ischallenging– High-degreevertices– Low-qualityedge-cuts
• Solution:PowerGraphSystem– GASDecomposition:splitvertex programs– Vertex-partitioning:distributenaturalgraphs
• PowerGraphtheoretically andexperimentallyoutperformsexistinggraph-parallelsystems.
55
PowerGraph(GraphLab2)System
GraphAnalytics
GraphicalModels
ComputerVision Clustering Topic
ModelingCollaborative
Filtering
MachineLearning andData-MiningToolkits
FutureWork
• Timeevolvinggraphs– Supportstructuralchanges duringcomputation
• Out-of-corestorage(GraphChi)– Supportgraphsthatdon’tfitinmemory
• ImprovedFault-Tolerance– Leveragevertexreplicationtoreducesnapshots– Asynchronous recovery
57
isGraphLab Version2.1Apache2License
http://graphlab.orgDocumentation… Code… Tutorials… (more on the way)