graphchi : large-scale graph computation on just a pc

DRAFT SKETCH

GraphChi: Large-Scale Graph Computation on Just a PCAapo Kyrl (CMU)Guy Blelloch (CMU)Carlos Guestrin (UW)OSDI12In co-operation with the GraphLab team.Thank you for the introduction. My friend Joey just told us how to use PowerGraph to compute impressively efficiently on very large graphs, by using a big cluster. Now, I will tell how to compute this same problems on just a small Mac Mini!

My name is Aapo Kyrola. This is joint work with my advisors Guy Blellloch and Carlos Guestrin.1

BigData with Structure: BigGraphsocial graphsocial graphfollow-graphconsumer-products graphuser-movie ratings graphDNA interactiongraphWWWlink graphetc.

As Joey told us, analysis of big social networks and other huge graphs is a hot topic.

In the keynote, we heard that analysis of the connections inside genome is a big challenge.

Now, as Joey already gave a great introduction to the graph computation, I will keep this short.

But I do want to emphasize one point .. (next slide)2

Big Graphs != Big Data3GraphChi Aapo KyrolaData size:

140 billion connections 1 TBNot a problem!Computation:

Hard to scaleTwitter network visualization, by Akshay Java, 2009

Problems with Big Graphs are really quite different than with rest of the Big Data. Why is that?

Well, when we compute on graphs, we are interested about the structure. Facebook just last week announced that their network had about 140 billion connections. I believe it is the biggest graph out there. However, storing this network on a disk would take only roughly 1 terabyte of space. It is tiny! Smaller than tiny!

The reason why Big Graphs are so hard from system perspective is therefore in the computation. Joey already talked about this, so I will be brief.

These so called natural graphs are challenging, because they very asymmetric structure. Look at the picture of the Twitter graph. Let me give an example: Lady Gaga has 30 million followers ---- I have only 300 of them --- and my advisor does not have even thirty.

This extreme skew makes distributing the computation very hard. It is hard to split the problem into components of even size, which would not be very connected to each other. PowerGraph has a neat solution to this.

But let me now introduce a completely different approach to the problem.3Could we compute Big Graphs on a single machine?Disk-based Graph ComputationCant we just use the Cloud?I just told that the size of the data is not really a problem, it is the computation.

Maybe we can also do the computation on a one disk?

But lets first ask why we even want this. Cannot we just use the cloud? Credit card in, solutions out.4Writing distributed applications remains cumbersome.

5GraphChi Aapo Kyrola

Cluster crashCrash in your IDEDistributed State is Hard to ProgramI want to take first the point of view of those folks who develop algorithms on our systems.

Unfortunately, it is still hard, cumbersome, to write distributed algorithms. Even if you have some nice abstraction, you still need to understand what is happening.

I think this motivation is quite clear to everyone in this room, so let me give just one example.

Debugging. Some people write bugs. Now to find out what crashed a cluster can be really difficult. You need to analyze logs of many nodes and understand all the middleware.

Compare this to if you can run the same big problems on your own machine. Then you just use your favorite IDE and its debugger. It is a huge difference in productivity.5

Efficient ScalingBusinesses need to compute hundreds of distinct tasks on the same graphExample: personalized recommendations.Parallelize each taskParallelize across tasksTaskTaskTaskTaskTaskTaskTaskTask

Task Task Task TaskTaskComplexSimpleExpensive to scale2x machines = 2x throughputAnother, perhaps a bit surprising motivation comes from thinking about scalability in large scale.

The industry wants to compute many tasks on the same graph. For example, to compute personalizedRecommendations, same task is computed for people in different countries, different interests groups, etc.

Currently: you need a cluster just to compute one single task. To compute tasks faster, you grow the cluster.

But this work allows a different way. Since one machine can handle one big task, you can dedicate one taskPer machine.

Why does this make sense? * Clusters are complex, and expensive to scale. * while in this new model, it is very simple as nodes do not talk to each other, and you can double the throughput by doubling the machines

There are other motivations as well, such as reducing costs and energy. But lets move on.6Other BenefitsCostsEasier management, simpler hardware.Energy ConsumptionFull utilization of a single computer.Embedded systems, mobile devicesA basic flash-drive can fit a huge graph.Research GoalCompute on graphs with billions of edges, in a reasonable time, on a single PC.Reasonable = close to numbers previously reported for distributed systems in the literature.

Experiment PC: Mac Mini (2012)Now we can state the goal of this research, or the research problem we stated for us when we started this project. The goal has some vagueness in it, so let me briefly explain. With reasonable time, I mean that if there have been papers reporting some numbers for other systems, I assume at least the authors were happy with the performance, and our system can do the same in the same ballpark, it is likely reasonable, given the lowe costs here. Now, as a consumer PC we used a Mac Mini. Not the cheapest computer there is, especially with the SSD, but still quite a small package. Now, we have since also run GraphChi with cheaper hardware, on hard drive instead of SSD, and can say it provides good performance on the lower end as well. But for this work, we used this computer.

One outcome of this research is to come up with a single computer comparison point for computing on very large graphs. Now researchers who develop large scale graph computation platforms can compare their performance to this, and analyze the relative gain achieved by distributing the computation. Before, there has not been really understanding of whether many proposed distributed frameworks are efficient or not.8Computational ModelGraphChi Aapo Kyrola9Computational ModelGraph G = (V, E) directed edges: e = (source, destination)each edge and vertex associated with a value (user-defined type)vertex and edge values can be modified(structure modification also supported)

DataDataDataDataDataDataDataDataDataData10GraphChi Aapo KyrolaABeTerms: e is an out-edge of A, and in-edge of B.Lets now discuss what is the computational setting of this work. Lets first introduce the basic computational model. 10DataDataDataDataDataDataDataDataDataDataVertex-centric ProgrammingThink like a vertexPopularized by the Pregel and GraphLab projectsHistorically, systolic computation and the Connection MachineMyFunc(vertex) { // modify neighborhood }DataDataDataDataDataThink like a vertex was used by the Google Pregel paper, and also adopted by GraphLab. Historical similar idea was used in the systolic computation or connection machine architectures, but with regular networks. Now, we had the data model where we associate a value with every vertex and edge, shown in the picture. As the primary computation model, user defines an update-function that operates on a vertex, and can access the values of the neighboring edges (shown in red). That is, we modify the data directly in the graph, one vertex a time. Of course, we can parallelize this, and take into account that neighboring vertices that share an edge should not be updated simultaneously (in general).11The Main Challenge of Disk-based Graph Computation:Random AccessI will now briefly demonstrate why disk-based graph computation was not a trivial problem. Perhaps we can assume it wasnt, because no such system as stated in the goals clearly existed. But it makes sense to analyze why solving the problem required a small innovation, worthy of an OSDI publication. The main problem has been stated on the slide: random access, i.e when you need to read many times from many different locations on disk, is slow. This is especially true with hard drives: seek times are several milliseconds. On SSD, random access is much faster, but still far a far cry from the performance of RAM. Lets now study this a bit.12Random Access Problemvertexin-neighborsout-neighbors53:2.3, 19: 1.3, 49: 0.65,...781: 2.3, 881: 4.2......193: 1.4, 9: 12.1, ...5: 1.3, 28: 2.2, ...... or with file index pointers

vertexin-neighbor-ptrout-neighbors53: 881, 19: 10092, 49: 20763,...781: 2.3, 881: 4.2......193: 882, 9: 2872, ...5: 1.3, 28: 2.2, ...Random writeRandom readreadsynchronizeSymmetrized adjacency file with values,

519For sufficient performance, millions of random accesses / second would be needed. Even for SSD, this is too much.Here in the table I have snippet of a simple straightforward storage of a graph as adjacency sets. For each vertex, we have a list of its in-neighbors, and out-neighbors, with associated values. Now lets say when update vertex 5, we change the value of its in-edge from vertex 19. As vertex 19 has the out-edge, we need to update its value in 19s list. This incurs a random write.Now, perhaps we can solve this as following: each vertex only stores its out-neighbors directly, but in-neighbors are stored as file pointers to their primary storage at the neighbors out-edge list. In this case, when we load vertex 5, we need to do a random read to fetch the value of the in-edge. Random read is better, much better, than random write but, in our experiments, even on SSD, it is way too slow. One additional reason is the overhead of a system call. Perhaps a direct access to the SSD would help, but as we came up with a simpler solution that works even on a rotational hard drive, we abandonded this approach.13Possible SolutionsUse SSD as a memory-extension? [SSDAlloc, NSDI11]2. Compress the graph structure to fit into RAM?[ WebGraph framework]3. Cluster the graph and handle each cluster separately in RAM?4. Caching of hot nodes?Too many small objects, need millions / sec.Associated values do not compress well, and are mutated.Expensive; The number of inter-cluster edges is big.Unpredictable performance. Now there are some potential remedies we pretended to consider. Compressing the graph and caching of hot nodes works in many cases, and for example for graph traversals, where you walk in the graph in a random manner, they are good solution, and there is other work on that. But for our computational model, where we actually need to modify the values of the edges, these wont be sufficient given the constraints. Of course, if you have one terabyte of memory, you do not need any of that. 14Our SolutionParallel Sliding Windows (PSW)Now we finally move to what is our main contribution. We call it Parallel Sliding Windows, for the reason that comes apparent soon. One reviewer commented that it should be called Parallel Tumbling Windows, but as I had committed already to using this term, and replace-alls are dangerous, I stuck with it.15Parallel Sliding Windows: PhasesPSW processes the graph one sub-graph a time:

In one iteration, the whole graph is processed.And typically, next iteration is started.

1. Load2. Compute3. WriteThe basic approach is that PSW loads one sub-graph of the graph a time, computes the update-functions for it, and saves the modifications back to disk. We will show soon how the sub-graphs are defined, and how we do this without doing almost no random access. Now, we usually use this for ITERATIVE computation. That is, we process all graph in sequence, to finish a full iteration, and then move to a next one.16Vertices are numbered from 1 to nP intervals, each associated with a shard on disk.sub-graph = interval of verticesPSW: Shards and Intervalsshard(1)interval(1)interval(2)interval(P)shard(2)shard(P)1nv1v217GraphChi Aapo Kyrola1. Load2. Compute3. WriteHow are the sub-graphs defined?17PSW: LayoutShard 1Shards small enough to fit in memory; balance size of shardsShard: in-edges for interval of vertices; sorted by source-idin-edges for vertices 1..100sorted by source_idVertices1..100Vertices101..700Vertices701..1000Vertices1001..10000Shard 2Shard 3Shard 4Shard 11. Load2. Compute3. WriteLet us show an example18Vertices1..100Vertices101..700Vertices701..1000Vertices1001..10000Load all in-edgesin memory Load subgraph for vertices 1..100What about out-edges? Arranged in sequence in other shardsShard 2Shard 3Shard 4PSW: Loading Sub-graphShard 1in-edges for vertices 1..100sorted by source_id1. Load2. Compute3. WriteShard 1Load all in-edgesin memory Load subgraph for vertices 101..700Shard 2Shard 3Shard 4PSW: Loading Sub-graphVertices1..100Vertices101..700Vertices701..1000Vertices1001..10000Out-edge blocksin memoryin-edges for vertices 1..100sorted by source_id1. Load2. Compute3. WritePSW Load-Phase

Only P large reads for each interval.P2 reads on one full pass.

21GraphChi Aapo Kyrola1. Load2. Compute3. Write

PSW: Execute updatesUpdate-function is executed on intervals verticesEdges have pointers to the loaded data blocksChanges take effect immediately asynchronous.&Data&Data&Data&Data&Data&Data&Data&Data&Data&DataBlock XBlock YDeterministic scheduling prevents races between neighboring vertices.22GraphChi Aapo Kyrola1. Load2. Compute3. WriteNow, when we have the sub-graph in memory, i.e for all vertices in the sub-graph we have all their in- and out-edges, we can execute the update-functions. Now comes important thing to understand: as we loaded the edges from disk, these large blocks are stored in memory. When we then create the graph objects vertices and edges the edge object will have a pointer to the data block. I have changed the figure to show that actually all data blocks are represented by pointers, pointing in to the blocks loaded from disk. Now: if two vertices share an edge, they will immediatelly observe change made by the other one since their edge pointers point to the same address. 22PSW: Commit to DiskIn write phase, the blocks are written back to diskNext load-phase sees the preceding writes asynchronous.23GraphChi Aapo Kyrola1. Load2. Compute3. Write

&Data&Data&Data&Data&Data&Data&Data&Data&Data&DataBlock XBlock YIn total: P2 reads and writes / full pass on the graph. Performs well on both SSD and hard drive.GraphChi: ImplementationEvaluation & ExperimentsGraphChiC++ implementation: 8,000 lines of codeJava-implementation also available (~ 2-3x slower), with a Scala API.Several optimizations to PSW (see paper).Source code and examples:http://graphchi.orgBecause our goal was to make large scale graph computation available to a large audience, we also put a lot of effort in making GraphChi easy to use.25Evaluation: ApplicabilityOne important factor to evaluate is that is this system any good? Can you use it for anything26Evaluation: Is PSW expressive enough?Graph MiningConnected componentsApprox. shortest pathsTriangle countingCommunity DetectionSpMVPageRankGenericRecommendationsRandom walksCollaborative Filtering (by Danny Bickson)ALSSGDSparse-ALSSVD, SVD++Item-CFProbabilistic Graphical ModelsBelief Propagation

Algorithms implemented for GraphChi (Oct 2012)One important factor to evaluate is that is this system any good? Can you use it for anything? GraphChi is an early project, but we already have a great variety of algorithms implement on it. I think it is safe to say, that the system can be used for many purposes. I dont know of a better way to evaluate the usability of a system than listing what it has been used for. There are over a thousand of downloads of the source code + checkouts which we cannot track, and we know many people are already using the algorithms of GraphChi and also implementing their own. Most of these algos are now available only in the C++ edition, apart from the random walk system which is only in the Java version.27is graphchi fast enough?Comparisons to existing systemsExperiment SettingMac Mini (Apple Inc.)8 GB RAM256 GB SSD, 1TB hard driveIntel Core i5, 2.5 GHzExperiment graphs:

GraphVerticesEdgesP (shards)Preprocessinglive-journal4.8M69M30.5 minnetflix0.5M99M201 mintwitter-201042M1.5B202 minuk-2007-05106M3.7B4031 minuk-union133M5.4B5033 minyahoo-web1.4B6.6B5037 minThe same graphs are typically used for benchmarking distributed graph processing systems.

29Comparison to Existing SystemsNotes: comparison results do not include time to transfer the data to cluster, preprocessing, or the time to load the graph from disk. GraphChi computes asynchronously, while all but GraphLab synchronously. PageRankSee the paper for more comparisons.WebGraph Belief Propagation (U Kang et al.)Matrix Factorization (Alt. Least Sqr.)Triangle CountingOn a Mac Mini:GraphChi can solve as big problems as existing large-scale systems.Comparable performance.

Unfortunately the literature is abundant with Pagerank experiments, but not much more. Pagerank is really not that interesting, and quite simple solutions work. Nevertheless, we get some idea. Pegasus is a hadoop-based graph mining system, and it has been used to implement a wide range of different algorithms. The best comparable result we got was for a machine learning algo belief propagation. Mac Mini can roughly match a 100node cluster of Pegasus. This also highlights the inefficiency of MapReduce. That said, the Hadoop ecosystem is pretty solid, and people choose it for the simplicity. Matrix factorization has been one of the core Graphlab applications, and here we show that our performance is pretty good compared to GraphLab running on a slightly older 8-core server. Last, triangle counting, which is a heavy-duty social network analysis algorithm. A paper in VLDB couple of years ago introduced a Hadoop algorithm for counting triangles. This comparison is a bit stunning. But, I remind that these are prior to PowerGraph: in OSDI, the map changed totally!

However, we are confident in saying, that GraphChi is fast enough fo rmany purposes. And indeed, it can solve as big problems as the other systems have been shown to execute. It is limited by the disk space.30PowerGraph ComparisonPowerGraph / GraphLab 2 outperforms previous systems by a wide margin on natural graphs.With 64 more machines, 512 more CPUs:Pagerank: 40x faster than GraphChiTriangle counting: 30x faster than GraphChi.

OSDI12

GraphChi has state-of-the-art performance / CPU.2

vs.

GraphChiPowerGraph really resets the speed comparisons. However, the point of ease of use remain, and GraphChi likely provides sufficient performance for most people. But if you need peak performance and have the resources, PowerGraph is the answer. GraphChi has still a role as the development platform for PowerGraph.31system evaluationSneak peekConsult the paper for a comprehensive evaluation:HD vs. SSDStriping data across multiple hard drivesComparison to an in-memory versionBottlenecks analysisEffect of the number of shardsBlock size and performance.

Now we move to analyze how the system itself works: how it scales, what are the bottlenecks32Scalability / Input Size [SSD]Throughput: number of edges processed / second.Conclusion: the throughput remains roughly constant when graph size is increased. GraphChi with hard-drive is ~ 2x slower than SSD (if computational cost low).Graph size Performance Paper: scalability of other applications.Here, in this plot, the x-axis is the size of the graph as the number of edges. All the experiment graphs are presented here. On the y-axis, we have the performance: how many edges processed by second. Now the dots present different experiments (averaged), and the read line is a least-squares fit. On SSD, the throuhgput remains very closely constant when the graph size increases. Note that the structure of the graph actually has an effect on performance, but only by a factor of two. The largest graph, yahoo-web, has a challenging structure, and thus its results are comparatively worse.33Bottlenecks / MulticoreExperiment on MacBook Pro with 4 cores / SSD.Computationally intensive applications benefit substantially from parallel execution.GraphChi saturates SSD I/O with 2 threads.Amdhals law34Evolving GraphsGraphs whose structure changes over timeJust one more thing35Evolving Graphs: IntroductionMost interesting networks grow continuously:New connections made, some unfriended.Desired functionality:Ability to add and remove edges in streaming fashion;... while continuing computation.Related work: Kineograph (EuroSys 12), distributed system for computation on a changing graph.

.... however, kineograph is not available36PSW and Evolving GraphsAdding edgesEach (shard, interval) has an associated edge-buffer. Removing edges: Edge flagged as removed.interval(1)interval(2)interval(P)shard(j)edge-buffer(j, 1)edge-buffer(j, 2)edge-buffer(j, P)New edges(for example)Twitter firehoseRecreating Shards on DiskWhen buffers fill up, shards a recreated on diskToo big shards are split.During recreation, deleted edges are permanently removed.interval(1)interval(2)interval(P)shard(j)Re-create & Splitinterval(1)interval(2)interval(P+1)shard(j)interval(1)interval(2)interval(P+1)shard(j+1)Evaluation: Evolving GraphsStreaming Graph experimentStreaming Graph ExperimentOn the Mac Mini:Streamed edges in random order from the twitter-2010 graph (1.5 B edges)With maximum rate of 100K or 200K edges/sec. (very high rate)Simultaneously run PageRank.Data layout:Edges were streamed from hard driveShards were stored on SSD.

edges

Hard to evalute, no comparison points40Ingest RateWhen graph grows, shard recreations become more expensive.Streaming: Computational ThroughputThroughput varies strongly due to shard rewrites and asymmetric computation.Conclusion

Future DirectionsCome to the the poster on Monday to discuss!This work: small amount of memory.What if have hundreds of GBs of RAM?

Graph working memory (PSW)

diskComputation 1 stateComputation 2 stateComputational stateGraph working memory (PSW)

diskComputation 1 stateComputation 2 stateComputational stateGraph working memory (PSW)

diskcomputational stateComputation 1 stateComputation 2 stateRAMConclusionParallel Sliding Windows algorithm enables processing of large graphs with very few non-sequential disk accesses.For the system researchers, GraphChi is a solid baseline for system evaluationIt can solve as big problems as distributed systems.Takeaway: Appropriate data structures as an alternative to scaling up.Source code and examples: http://graphchi.orgLicense: Apache 2.0Extra SlidesPSW is AsynchronousIf V > U, and there is edge (U,V, &x) = (V, U, &x), update(V) will observe change to x done by update(U):Memory-shard for interval (j+1) will contain writes to shard(j) done on previous intervals.Previous slide: If U, V in the same interval.PSW implements the Gauss-Seidel (asynchronous) model of computationShown to allow, in many cases, clearly faster convergence of computation than Bulk-Synchronous Parallel (BSP).Each edge stored only once.

Extended edition47GraphChi Aapo KyrolaQuite interesting, PSW implements the asynchronous, or Gauss-Seidel model of computation, naturally. For BSP, one can think of different solution, but they always require storing each value twice. Now, under the asyncrhonous model, it is also easy to implement BSP: just store the current and previous value for each edge and swap.47Number of Shards

If P is in the dozens, there is not much effect on performance.I/O ComplexitySee the paper for theoretical analysis in the Aggarwal-Vitters I/O model.Worst-case only 2x best-case.Intuition:shard(1)interval(1)interval(2)interval(P)shard(2)shard(P)1|V|v1v2Inter-interval edge is loaded from disk only once / iteration.Edge spanning intervals is loaded twice / iteration.Multiple hard-drives (RAIDish)GraphChi supports striping shards to multiple disks Parallel I/O.Experiment on a 16-core AMD server (from year 2007).BottlenecksConnected Components on Mac Mini / SSDCost of constructing the sub-graph in memory is almost as large as the I/O cost on an SSDGraph construction requires a lot of random access in RAM memory bandwidth becomes a bottleneck.Computational SettingConstraints:Not enough memory to store the whole graph in memory, nor all the vertex values.Enough memory to store one vertex and its edges w/ associated values.Largest example graph used in the experiments:Yahoo-web graph: 6.7 B edges, 1.4 B vertices

Recently GraphChi has been used on a MacBook Pro to compute with the most recent Twitter follow-graph (last year: 15 B edges)Note, that bigger systems can also benefit from our work since we can move the graph from memory, and use that for algorithmic state. If you consider the Yahoo-web graph, which has 1.4B vertices. If we associate a 4-byte float with all the vertices, it already takes almost 6 gigabytes of memory already. On a laptop with 8gigs, you generally have not enough memory for the rest. The research contributions of this work are mostly in this setting, but the system, GraphChi, is actually very useful in settings with more memory as well. It is the flexibility that counts.52

graphchi : large-scale graph computation on just a pc

Documents