introduction to large-scale graph computation

Introduction to Large-Scale Graph Computation

Introduction to Large-Scale Graph Computation+ GraphLab and GraphChi

Aapo Kyrola, [email protected] 27, 2013

AcknowledgmentsMany slides (the pretty ones) are from Joey Gonzalez lecture (2012)Many people involved in the research:

YuchengLow

DannyBickson

CarlosGuestrin

GuyBlelloch

JoeHellerstein

DavidOHallaronAlexSmola

HaijieGu

Arthur Gretton

Joey GonzalezContentsIntroduction to Big GraphsProperties of Real-world GraphsWhy Map-Reduce not good for big graphs specialized systemsVertex-Centric Programming ModelGraphLab -- distributed computationGraphChi -- disk-based Basic vocabularyGraph (network)Vertex (node)Edge (link), in-edge, out-edgeSparse graph / matrixABeTerms: e is an out-edge of A, and in-edge of B.introduction to Big graphsWhat is aBig Graph?Definition changes rapidly:GraphLab paper 2009: biggest graph 200M edgesGraphlab & GraphChi papers 2012: biggest graph 6.7B edgesGraphChi @ Twitter: many times bigger.Depends on the computation as wellmatrix factorization (collaborative filtering) or Belief Propagation much more expensive than PageRank

What is Big GraphBiggest graphs available to researchersAltavista: 6.7B edges, 1.4B vertices Twitter 2010: 1.5B edges, 68M verticesCommon Crawl (2012): 5 billion web pagesBut the industry has even bigger ones:Facebook (Oct 2012): 144B friendships, 1B usersTwitter (2011): 15B follower-edges

When reading about graph processing systems, be critical of the problem sizes are they really big?Shun, Blelloch (2013, PPoPP): use single machine (256 gb RAM) for in-memory computation on same graphs as the GraphLab/GraphChi papers.Big Graphs are always extremely sparse.Examples of Big GraphsTwitter what kind of graphs?follow-graph

engagement graph

list-members graph

topic-authority (consumers -> producers)follow graph is the obvious graph in Twitter, but we can extract many other graphs as well. For example the engagement graph may have more information than the follow graph since it is easy to follow, but engaging requires more action and will be seen by your followers (you do not want to spam by retweeting everything).8Example of Big GraphsFacebook: extended social graphFB friend-graph: differences to Twitters graph?

Slide from Facebook Engineerings presentationthis graph is huge9Other Big NetworksWWWAcademic CitationsInternet trafficPhone callsWhat can we compute from social networks / web graphs?Influence rankingPageRank, TunkRank, SALSA, HITSAnalysistriangle counting (clustering coefficient), community detection, information propagation, graph radii, ...Recommendationswho-to-follow, who-to-follow for topic TsimilaritiesSearch enhancementsFacebooks Graph SearchBut actually: it is a hard question by itself!Sparse MatricesUser x Item/Product matricesexplicit feedback (ratings)implicit feedback (seen or not seen)typically very sparseArgoPlan9 From the Outer SpaceTitanic...The HobbitUser 1-325User 24--3...User N514-How to represent sparse matrices as graphs?Product Item bipartite graphCity of God

Wild Strawberries

The Celebration

La Dolce Vita

Women on the Verge of aNervous Breakdown

4325(slide adapted from Joey Gonzalez)13What can we compute from user-item graphs?Collaborative filtering (recommendations)Recommend products that users with similar tests have recommended.Similarity / distance metricsMatrix factorizationRandom walk based methodsLots of algorithms available. See Danny Bicksons CF toolkit for GraphChi:http://bickson.blogspot.com/2012/08/collaborative-filtering-with-graphchi.htmlProbabilistic Graphical ModelsEach vertex represents a random variableEdges between vertices represent dependenciesmodelled with conditional probabilitiesBayes networksMarkov Random FieldsConditional Random FieldsGoal: given evidence (observed variables), compute likelihood of the unobserved variablesExact inference generally intractableNeed to use approximations.

CamerasCooking

16

Shopper 1Shopper 2Here we have two shoppers. We would like to recommend things for them to buy based on their interests. However we may not have enough information to make informed recommendations by examining their individual histories in isolation.

We can use the rich probabilistic structure to improve the recommendations for individual people.16

Synthetic Noisy ImageFewUpdates17Image DenoisingGraphical Modeladapted from Joeys slides

17Still more examplesCompBioProtein-Protein interaction networkActivator/deactivator gene networkDNA assembly graphText modellingword-document graphsKnowledge basesNELL project at CMUPlanar graphsRoad networkImplicit Graphsk-NN graphsI hope you now got the idea that there are lots of interesting graphs out there. Some of them a bigger than others, but also the type of computation varies. Thus the definition of big is not well-defined.18ResourcesStanford SNAP datasets:http://snap.stanford.edu/data/index.htmlClueWeb (CMU):http://lemurproject.org/clueweb09/Univ. of Milans repository:http://law.di.unimi.it/datasets.php

properties of real world graphs

Twitter network visualization, by Akshay Java, 2009

Natural Graphs[Image from WikiCommons]Partial map of the Internet based on the January 15, 2005 21Natural GraphsGrids and other Planar Graphs are easyEasy to find separatorsThe fundamental properties of natural graphs make them computationally challenging Power-LawDegree of a vertex = number of adjacent edgesin-degree and out-degree

Power-Law = Scale-freeFraction of vertices having k neighbors:P(k) ~ k-alphaGenerative models:rich-get-richer (preferential attachment)copy-modelKronecker graphs (Leskovec, Faloutsos, et al.)Other phenomena with power-law characteristics?

Wealth / income of individualsSize of cities

24Natural Graphs Power Law

Top 1% of vertices is adjacent to53% of the edges!Altavista Web Graph: 1.4B Vertices, 6.7B Edges

Power Law-Slope = 2 slide from Joey Gonzalez.LOG-LOG25Properties of Natural GraphsSmall diameterexpected distance between two nodes in Facebook: 4.74 (2011)Nice local structure, but no global structure

from Michael Mahoneys (Stanford) presentation Great talk by M. Mahoney :Extracting insight from large networks: implications of small-scale and large-scale structure26Graph CompressionLocal structure helps compression:Blelloch et. al. (2003): compress web graph to 3-4 bits / linkWebGraph framework from Univ of Milanosocial graphs ~ 10 bits / edge (2009)Basic idea:order the vertices so that topologically close vertices have ids close to each otherdifference encoding

Computational ChallengeNatural Graphs are very hard to partitionHard to distribute computation to many nodes in balanced way, so that the number of edges crossing partitions is minimizedWhy? Think about stars.Graph partitioning algorithms:METISSpectral clusteringNot feasible on very large graphs!Vertex-cuts better than edge cuts (talk about this later with GraphLab)large-scale graph computation systemsWhy MapReduce is not enoughParallel Graph ComputationDistributed computation and/or multicore parallelismSometimes confusing. We will talk mostly about distributed computation.Are classic graph algorithms parallelizable? What about distributed?Depth-first search?Breadth-first search? Priority-queue based traversals (Djikstras, Prims algorithms)BFS is actually intrisincally parallel because can express as computation on frontiers.30MapReduce for GraphsGraph computation almost always iterative

MapReduce ends up shipping the whole graph on each iteration over the network (map->reduce->map->reduce->...)Mappers and reducers are statelessIterative Computation is DifficultSystem is not optimized for iteration:DataDataDataDataDataDataDataDataDataDataDataDataDataDataCPU 1CPU 2CPU 3DataDataDataDataDataDataDataCPU 1CPU 2CPU 3DataDataDataDataDataDataDataCPU 1CPU 2CPU 3IterationsDisk PenaltyDisk PenaltyDisk PenaltyStartup PenaltyStartup PenaltyStartup Penalty32MapReduce and PartitioningMap-Reduce splits the keys randomly between mappers/reducersBut on natural graphs, high-degree vertices (keys) may have million-times more edges than the averageExtremely uneven distributionTime of iteration = time of slowest job.

Curse of the Slow JobDataDataDataDataDataDataDataDataDataDataDataDataDataDataCPU 1CPU 2CPU 3CPU 1CPU 2CPU 3DataDataDataDataDataDataDataCPU 1CPU 2CPU 3IterationsBarrierBarrierDataDataDataDataDataDataDataBarrierhttp://www.www2011india.com/proceeding/proceedings/p607.pdf34Map-Reduce is Bulk-Synchronous ParallelBulk-Synchronous Parallel = BSP (Valiant, 80s)Each iteration sees only the values of previous iteration.In linear systems literature: Jacobi iterationsPros:Simple to programMaximum parallelismSimple fault-toleranceCons:Slower convergenceIteration time = time taken by the slowest node

Asynchronous ComputationAlternative to BSPLinear systems: Gauss-Seidel iterationsWhen computing value for item X, can observe the most recently computed values of neighborsOften relaxed: can see most recent values available on a certain nodeConsistency issues:Prevent parallel threads from over-writing or corrupting values (race conditions)

MapReduces (Hadoops) poor performance on huge graphs has motivated the development of special graph-computation systemsSpecialized Graph Computation Systems (Distributed)Common to all: Graph partitions resident in memory on the computation nodesAvoid shipping the graph over and overPregel (Google, 2010):Think like a vertexMessaging modelBSPOpen source: Giraph, Hama, Stanford GPS,..GraphLab (2010, 2012) [CMU]Asynchronous (also BSP)Version 2.1 (PowerGraph) uses vertex-partitioning extremely good performance on natural graphs+ OthersBut do you need a distributed framework?vertex-centric programmingThink like a vertexVertex-Centric ProgrammingThink like a Vertex (Google, 2010)Historically, similar idea used before in systolic-computation, data-flow systems the Connection Machine and others.Basic idea: each vertex computes individually its value [in parallel]Program state = vertex (and edge) valuesPregel: vertices send messages to each otherGraphLab/Chi: vertex reads its neighbors and edge values, modifies edge values (can be used to simulate messaging)IterativeFixed-point computations are typical: iterate until the state does not change (much).

Computational Model (GraphLab and GraphChi)Graph G = (V, E) directed edges: e = (source, destination)each edge and vertex associated with a value (user-defined type)vertex and edge values can be modified(GraphChi: structure modification also supported)

DataDataDataDataDataDataDataDataDataData41GraphChi Aapo KyrolaABeTerms: e is an out-edge of A, and in-edge of B.Lets now discuss what is the computational setting of this work. Lets first introduce the basic computational model. 41Vertex Update FunctionDataDataDataDataDataDataDataDataDataDataMyFunc(vertex) { // modify neighborhood }DataDataDataDataDataParallel ComputationBulk-Synchronous: All vertices update in parallel (note: need 2x memory why?)

Asynchronous:Basic idea: if two vertices are not connected, can update them in parallelTwo-hop connectionsGraphLab supports different consistency models allowing user to specify the level of protection = lockingEfficient locking is complicated on distributed computation (hidden from user) why?SchedulingOften, some parts of the graph require more iterations to converge than others:Remember power-law structureWasteful to update all vertices equal number of times.The SchedulerCPU 1CPU 2The scheduler determines the order that vertices are updatedefgkjihdcbabihaibefjcSchedulerThe process repeats until the scheduler is empty45Types of Schedulers (GraphLab)Round-robinSelective scheduling (skipping):round robin but jump over un-scheduled verticeFIFOPriority schedulingApproximations used in distributed computation (each node has its own priority queue)Rarely used in practice (why?)

Example: PagerankExpress Pagerank in words in the vertex-centric model

Example: Connected Components1234567First iteration: Each vertex chooses label = its id.Example: Connected Components1112565Update: my vertex id = minimum of neighbors id.Example: Connected Components1111555Component id = leader id (smallest id in the component)How many iterations needed for convergence?(In synchronous model)What about asynchronous model?Matrix FactorizationAlternating Least Squares (ALS)r11r12r23r24u1u2m1m2m3User Factors (U)Movie Factors (M)UsersMoviesNetflixUsersxMoviesuimj

Iterate:

http://dl.acm.org/citation.cfm?id=142426951Graphlab (v2.1 = powergraph) GraphLabJoseph Gonzalez, Yucheng Low, Haijie Gu, Danny Bickson, Carlos Guestrin (2012). "PowerGraph: Distributed Graph-Parallel Computation on Natural Graphs." Proceedings of Operating Systems Design and Implementation (OSDI).

GraphLab project started in 2009 by prof. Carlos Guestrins teammulticore version: UAI 2009distributed: VLDB 2012PowerGraph: OSDI 2012Motivation from Machine Learning; many ML problems can be represented naturally as graph problems.Since expanded scope to network analysis etc.Open-source! http://graphlab.org2

All GraphLab slides from Joey Gonzalez!Touches a largefraction of graph(GraphLab 1)SequentialVertex-UpdatesProduces manymessages(Pregel)Edge informationtoo large for singlemachineAsynchronous consistencyrequires heavy locking (GraphLab 1)Synchronous consistency is prone tostragglers (Pregel)Problem: High Degree Vertices Limit ParallelismDistribute a single vertex-updateMove computation to dataParallelize high-degree vertices

Vertex PartitioningSimple online approach, effectively partitions large power-law graphs2

Factorized Vertex UpdatesSplit update into 3 phases+ + + YYYParallelSumYScopeGatherYYApply( , ) YLocally apply the accumulated to vertexApplyYUpdate neighborsScatterData-parallel over edgesData-parallel over edgesPageRank in GraphLab2PageRankProgram(i)Gather( j i ) : return wji * R[j]sum(a, b) : return a + b;Apply(i, ) : R[i] = + (1 ) * Scatter( i j ) :if (R[i] changes) then activate(j)

57How ever, in some cases, this can seem rather inefficient.57Machine 2Machine 1Machine 4Machine 3Distributed Execution of a GraphLab2 Vertex-Program1234+ + + YYYYYYYYGatherApplyScatter58Minimizing Communication in GraphLab2YYYA vertex-cut minimizes machines each vertex spansPercolation theory suggests that power law graphs have good vertex cuts. [Albert et al. 2000]Communication is linear in the number of machines each vertex spans59Constructing Vertex-CutsGoal: Parallel graph partitioning on ingressGraphLab 2 provides three simple approaches:Random Edge PlacementEdges are placed randomly by each machineGood theoretical guaranteesGreedy Edge Placement with CoordinationEdges are placed using a shared objectiveBetter theoretical guaranteesOblivious-Greedy Edge Placement Edges are placed using a local objectiveMachine 2Machine 1Machine 3Random Vertex-CutsRandomly assign edges to machinesYYYYZYYYYZYZYSpans 3 MachinesZSpans 2 MachinesBalanced CutSpans only 1 machine61Greedy Vertex-CutsPlace edges on machines which already have the vertices in that edge.Machine1Machine 2BACBDAEB62Greedy algorithm is no worse than random placement in expectation

62Greedy Vertex-CutsDerandomization: Minimizes the expected number of machines spanned by each vertex. CoordinatedMaintain a shared placement history (DHT)Slower but higher qualityObliviousOperate only on local placement historyFaster but lower quality63Greedy algorithm is no worse than random placement in expectation

63Partitioning PerformanceTwitter Graph: 41M vertices, 1.4B edgesOblivious balances partition quality and partitioning time.RandomObliviousCoordinatedCoordinatedObliviousRandom64CostConstruction TimeBetterPut Random, coordinated, then oblivious as compromise 64Beyond Random Vertex Cuts!Triangle Counting in Twitter Graph40M Users 1.2B EdgesTotal:34.8 Billion TrianglesHadoop results from [Suri & Vassilvitskii '11]1536 Machines423 Minutes64 Machines, 1024 Cores1.5 MinutesHadoop [1]S. Suri and S. Vassilvitskii, Counting triangles and the curse of the last reducer, presented at the WWW '11: Proceedings of the 20th international conference on World wide web, 2011.66LDA PerformanceAll English language Wikipedia2.6M documents, 8.3M words, 500M tokens

LDA state-of-the-art sampler (100 Machines)Alex Smola: 150 Million tokens per Second

GraphLab Sampler (64 cc2.8xlarge EC2 Nodes)100 Million Tokens per SecondUsing only 200 Lines of code and 4 human hours

PageRank40M Webpages, 1.4 Billion Links (100 iterations)5.5 hrs1 hr8 minHadoop results from [Kang et al. '11]Twister (in-memory MapReduce) [Ekanayake et al. 10]Comparable numbers are hard to come by as everyone uses different datasets, but we try to equalize as much as possible giving the competition an advantage when in doubtScaling to 100 iterations so costs are in dollars and not cents.

Numbers: Hadoop: Ran on Kronecker graph of 1.1B edges, 50 M45 machines. From available numbers, each machine is approximately 8 cores, 6 GB RAM or roughly a c1.xlarge instance. Putting total cost at $33 per hourSpectral Analysis for Billion-Scale Graphs: Discoveries and Implementation U Kang Brendan Meeder Christos Faloutsos Twister: Ran on ClueWeb dataset of 50M pages, 1.4B edges. Using 64 Nodes of 4 cores each. 16GB RAM per node. Or roughly a m2.xlarge to m2.2xlarge instance. Putting total cost at $28-$56 per hour.Jaliya Ekanayake, Hui Li, Bingjing Zhang, Thilina Gunarathne, Seung-Hee Bae, Judy Qiu, Geoffrey Fox, Twister: A Runtime for Iterative MapReduce," The First International Workshop on MapReduce and its Applications (MAPREDUCE'10) - HPDC2010 GraphLab: Ran on 64 cc1.4x large instances at $83.2 per hour68Linux Cluster Services (Amazon AWS)MPI/TCP-IPPThreadsHadoop/HDFSGraphLab Version 2.1 API (C++)Graph AnalyticsGraphicalModelsComputerVisionClusteringTopicModelingCollaborativeFilteringGraphLab easily incorporates external toolkitsAutomatically detects and builds external toolkitsGraphLab ToolkitsGraphLab Future DirectionsMaking programming easierGather-Apply-Scatter difficult for some problems often want repeated gather-applieshigher level operators: edge- and vertex-map/reducesIntegration with graph storageDevelopment continues by prof. Carlos Guestrins team at Univ of Washington (CMU)graphChi large scale graph computation on just a PCBackgroundSpin-off of GraphLab projectGraphLab[radors] small friend GraphChi[huahua]Also presented in OSDI 12C++ and Java/Scala-versions available at http://graphchi.org

GraphChi: Going small with GraphLab

Solve huge problems on small or embedded devices?Key: Exploit non-volatile memory (starting with SSDs and HDs)

Could we compute Big Graphs on a single machine?Disk-based Graph ComputationFacebooks graph 144 B edges ~ 1 terabyteI just told that the size of the data is not really a problem, it is the computation.

Maybe we can also do the computation on a one disk?

But lets first ask why we even want this. Cannot we just use the cloud? Credit card in, solutions out.74Writing distributed applications remains cumbersome.

75GraphChi Aapo Kyrola

Cluster crashCrash in your IDEDistributed State is Hard to ProgramI want to take first the point of view of those folks who develop algorithms on our systems.

Unfortunately, it is still hard, cumbersome, to write distributed algorithms. Even if you have some nice abstraction, you still need to understand what is happening.

I think this motivation is quite clear to everyone in this room, so let me give just one example.

Debugging. Some people write bugs. Now to find out what crashed a cluster can be really difficult. You need to analyze logs of many nodes and understand all the middleware.

Compare this to if you can run the same big problems on your own machine. Then you just use your favorite IDE and its debugger. It is a huge difference in productivity.75

Efficient ScalingBusinesses need to compute hundreds of distinct tasks on the same graphExample: personalized recommendations.Parallelize each taskParallelize across tasksTaskTaskTaskTaskTaskTaskTaskTask

Task Task Task TaskTaskComplexSimpleExpensive to scale2x machines = 2x throughputAnother, perhaps a bit surprising motivation comes from thinking about scalability in large scale.

The industry wants to compute many tasks on the same graph. For example, to compute personalizedRecommendations, same task is computed for people in different countries, different interests groups, etc.

Currently: you need a cluster just to compute one single task. To compute tasks faster, you grow the cluster.

But this work allows a different way. Since one machine can handle one big task, you can dedicate one taskPer machine.

Why does this make sense? * Clusters are complex, and expensive to scale. * while in this new model, it is very simple as nodes do not talk to each other, and you can double the throughput by doubling the machines

There are other motivations as well, such as reducing costs and energy. But lets move on.76Other BenefitsCostsEasier management, simpler hardware.Energy ConsumptionFull utilization of a single computer.Embedded systems, mobile devicesA basic flash-drive can fit a huge graph.Research GoalCompute on graphs with billions of edges, in a reasonable time, on a single PC.Reasonable = close to numbers previously reported for distributed systems in the literature.

Experiment PC: Mac Mini (2012)Now we can state the goal of this research, or the research problem we stated for us when we started this project. The goal has some vagueness in it, so let me briefly explain. With reasonable time, I mean that if there have been papers reporting some numbers for other systems, I assume at least the authors were happy with the performance, and our system can do the same in the same ballpark, it is likely reasonable, given the lowe costs here. Now, as a consumer PC we used a Mac Mini. Not the cheapest computer there is, especially with the SSD, but still quite a small package. Now, we have since also run GraphChi with cheaper hardware, on hard drive instead of SSD, and can say it provides good performance on the lower end as well. But for this work, we used this computer.

One outcome of this research is to come up with a single computer comparison point for computing on very large graphs. Now researchers who develop large scale graph computation platforms can compare their performance to this, and analyze the relative gain achieved by distributing the computation. Before, there has not been really understanding of whether many proposed distributed frameworks are efficient or not.78Random Access Problemvertexin-neighborsout-neighbors53:2.3, 19: 1.3, 49: 0.65,...781: 2.3, 881: 4.2......193: 1.4, 9: 12.1, ...5: 1.3, 28: 2.2, ...... or with file index pointers

vertexin-neighbor-ptrout-neighbors53: 881, 19: 10092, 49: 20763,...781: 2.3, 881: 4.2......193: 882, 9: 2872, ...5: 1.3, 28: 2.2, ...Random writeRandom readreadsynchronizeSymmetrized adjacency file with values,

519For sufficient performance, millions of random accesses / second would be needed. Even for SSD, this is too much.Here in the table I have snippet of a simple straightforward storage of a graph as adjacency sets. For each vertex, we have a list of its in-neighbors, and out-neighbors, with associated values. Now lets say when update vertex 5, we change the value of its in-edge from vertex 19. As vertex 19 has the out-edge, we need to update its value in 19s list. This incurs a random write.Now, perhaps we can solve this as following: each vertex only stores its out-neighbors directly, but in-neighbors are stored as file pointers to their primary storage at the neighbors out-edge list. In this case, when we load vertex 5, we need to do a random read to fetch the value of the in-edge. Random read is better, much better, than random write but, in our experiments, even on SSD, it is way too slow. One additional reason is the overhead of a system call. Perhaps a direct access to the SSD would help, but as we came up with a simpler solution that works even on a rotational hard drive, we abandonded this approach.79Possible SolutionsUse SSD as a memory-extension? [SSDAlloc, NSDI11]2. Compress the graph structure to fit into RAM?[ WebGraph framework]3. Cluster the graph and handle each cluster separately in RAM?4. Caching of hot nodes?Too many small objects, need millions / sec.Associated values do not compress well, and are mutated.Expensive; The number of inter-cluster edges is big.Unpredictable performance. Now there are some potential remedies we pretended to consider. Compressing the graph and caching of hot nodes works in many cases, and for example for graph traversals, where you walk in the graph in a random manner, they are good solution, and there is other work on that. But for our computational model, where we actually need to modify the values of the edges, these wont be sufficient given the constraints. Of course, if you have one terabyte of memory, you do not need any of that. 80Our SolutionParallel Sliding Windows (PSW)Now we finally move to what is our main contribution. We call it Parallel Sliding Windows, for the reason that comes apparent soon. One reviewer commented that it should be called Parallel Tumbling Windows, but as I had committed already to using this term, and replace-alls are dangerous, I stuck with it.81Parallel Sliding Windows: PhasesPSW processes the graph one sub-graph a time:

In one iteration, the whole graph is processed.And typically, next iteration is started.

1. Load2. Compute3. WriteThe basic approach is that PSW loads one sub-graph of the graph a time, computes the update-functions for it, and saves the modifications back to disk. We will show soon how the sub-graphs are defined, and how we do this without doing almost no random access. Now, we usually use this for ITERATIVE computation. That is, we process all graph in sequence, to finish a full iteration, and then move to a next one.82Vertices are numbered from 1 to nP intervals, each associated with a shard on disk.sub-graph = interval of verticesPSW: Shards and Intervalsshard(1)interval(1)interval(2)interval(P)shard(2)shard(P)1nv1v283GraphChi Aapo Kyrola1. Load2. Compute3. WriteHow are the sub-graphs defined?83PSW: LayoutShard 1Shards small enough to fit in memory; balance size of shardsShard: in-edges for interval of vertices; sorted by source-idin-edges for vertices 1..100sorted by source_idVertices1..100Vertices101..700Vertices701..1000Vertices1001..10000Shard 2Shard 3Shard 4Shard 11. Load2. Compute3. WriteLet us show an example84Vertices1..100Vertices101..700Vertices701..1000Vertices1001..10000Load all in-edgesin memory Load subgraph for vertices 1..100What about out-edges? Arranged in sequence in other shardsShard 2Shard 3Shard 4PSW: Loading Sub-graphShard 1in-edges for vertices 1..100sorted by source_id1. Load2. Compute3. WriteShard 1Load all in-edgesin memory Load subgraph for vertices 101..700Shard 2Shard 3Shard 4PSW: Loading Sub-graphVertices1..100Vertices101..700Vertices701..1000Vertices1001..10000Out-edge blocksin memoryin-edges for vertices 1..100sorted by source_id1. Load2. Compute3. WritePSW Load-Phase

Only P large reads for each interval.P2 reads on one full pass.

87GraphChi Aapo Kyrola1. Load2. Compute3. Write

PSW: Execute updatesUpdate-function is executed on intervals verticesEdges have pointers to the loaded data blocksChanges take effect immediately asynchronous.&Data&Data&Data&Data&Data&Data&Data&Data&Data&DataBlock XBlock YDeterministic scheduling prevents races between neighboring vertices.88GraphChi Aapo Kyrola1. Load2. Compute3. WriteNow, when we have the sub-graph in memory, i.e for all vertices in the sub-graph we have all their in- and out-edges, we can execute the update-functions. Now comes important thing to understand: as we loaded the edges from disk, these large blocks are stored in memory. When we then create the graph objects vertices and edges the edge object will have a pointer to the data block. I have changed the figure to show that actually all data blocks are represented by pointers, pointing in to the blocks loaded from disk. Now: if two vertices share an edge, they will immediatelly observe change made by the other one since their edge pointers point to the same address. 88PSW: Commit to DiskIn write phase, the blocks are written back to diskNext load-phase sees the preceding writes asynchronous.89GraphChi Aapo Kyrola1. Load2. Compute3. Write

&Data&Data&Data&Data&Data&Data&Data&Data&Data&DataBlock XBlock YIn total: P2 reads and writes / full pass on the graph. Performs well on both SSD and hard drive.is graphchi fast enough?Comparisons to existing systemsExperiment SettingMac Mini (Apple Inc.)8 GB RAM256 GB SSD, 1TB hard driveIntel Core i5, 2.5 GHzExperiment graphs:

GraphVerticesEdgesP (shards)Preprocessinglive-journal4.8M69M30.5 minnetflix0.5M99M201 mintwitter-201042M1.5B202 minuk-2007-05106M3.7B4031 minuk-union133M5.4B5033 minyahoo-web1.4B6.6B5037 minThe same graphs are typically used for benchmarking distributed graph processing systems.

91Comparison to Existing SystemsNotes: comparison results do not include time to transfer the data to cluster, preprocessing, or the time to load the graph from disk. GraphChi computes asynchronously, while all but GraphLab synchronously. PageRankSee the paper for more comparisons.WebGraph Belief Propagation (U Kang et al.)Matrix Factorization (Alt. Least Sqr.)Triangle CountingOn a Mac Mini:GraphChi can solve as big problems as existing large-scale systems.Comparable performance.

Unfortunately the literature is abundant with Pagerank experiments, but not much more. Pagerank is really not that interesting, and quite simple solutions work. Nevertheless, we get some idea. Pegasus is a hadoop-based graph mining system, and it has been used to implement a wide range of different algorithms. The best comparable result we got was for a machine learning algo belief propagation. Mac Mini can roughly match a 100node cluster of Pegasus. This also highlights the inefficiency of MapReduce. That said, the Hadoop ecosystem is pretty solid, and people choose it for the simplicity. Matrix factorization has been one of the core Graphlab applications, and here we show that our performance is pretty good compared to GraphLab running on a slightly older 8-core server. Last, triangle counting, which is a heavy-duty social network analysis algorithm. A paper in VLDB couple of years ago introduced a Hadoop algorithm for counting triangles. This comparison is a bit stunning. But, I remind that these are prior to PowerGraph: in OSDI, the map changed totally!

However, we are confident in saying, that GraphChi is fast enough fo rmany purposes. And indeed, it can solve as big problems as the other systems have been shown to execute. It is limited by the disk space.92PowerGraph ComparisonPowerGraph / GraphLab 2 outperforms previous systems by a wide margin on natural graphs.With 64 more machines, 512 more CPUs:Pagerank: 40x faster than GraphChiTriangle counting: 30x faster than GraphChi.

OSDI12

GraphChi has state-of-the-art performance / CPU.2

vs.

GraphChiPowerGraph really resets the speed comparisons. However, the point of ease of use remain, and GraphChi likely provides sufficient performance for most people. But if you need peak performance and have the resources, PowerGraph is the answer. GraphChi has still a role as the development platform for PowerGraph.93Evaluation: Evolving GraphsStreaming Graph experimentStreaming Graph ExperimentOn the Mac Mini:Streamed edges in random order from the twitter-2010 graph (1.5 B edges)With maximum rate of 100K or 200K edges/sec. (very high rate)Simultaneously run PageRank.Data layout:Edges were streamed from hard driveShards were stored on SSD.

edges

Hard to evalute, no comparison points95Ingest RateWhen graph grows, shard recreations become more expensive.Streaming: Computational ThroughputThroughput varies strongly due to shard rewrites and asymmetric computation.SummaryIntroduced large graphs, challenges with themTalked why specialized graph computation platforms are neededVertex-centric computation modelGraphLabGraphChiThank You!Follow me in Twitter: @kyrpovhttp://graphchi.orghttp://code.google.com/p/graphchihttp://www.cs.cmu.edu/~akyrola

introduction to large-scale graph computation

Documents

abig graph

graph radii

obvious graph

graph processing systems

social graphfb friendgraph

edge of b

big graphswhat

kind of graphs