large scale graph processing

23
Large Scale Graph Processing Deepankar Patra IIT Madras

Post on 18-Oct-2014

559 views

Category:

Documents


2 download

DESCRIPTION

 

TRANSCRIPT

Page 1: Large scale graph processing

Large Scale Graph ProcessingDeepankar Patra

IIT Madras

Page 2: Large scale graph processing

Goal

Running graph algorithms(e.g. Shortest path, connected components, finding diameter etc) on huge graphs(Terabyte or more Sized)

Page 3: Large scale graph processing

Graph

Node/Vertex Edge

Page 4: Large scale graph processing

Example Graph Algorithm

● Shortest Path Algorithm

Source Vertex Destination Vertex

Page 5: Large scale graph processing

Why?

Lot of machine learning algorithms require graph computations and in the real world the input for these are huge, which cannot fit in one machine.

Page 6: Large scale graph processing

Real World? 

Big Graphs:● Social Networks● Biological Networks● Mobile Call Networks● Citation Networks● World Wide Web● Geographic Pathways● Customer merchant graphs(Amazon, Ebay)

Page 7: Large scale graph processing

Facebook Friends Graph

Src: http://wisonets.files.wordpress.com/2012/09/facebook-mutual-friends2.png

Page 8: Large scale graph processing

Machine Learning Algorithms?

● Recommendation● PageRank● Web search● Cyber security● Fraud detection● Clustering● Shortest Path Calculation

Page 9: Large scale graph processing

Graph Algorithms Typically Involve

● Performing computations at each node based on node features, edge features, and local link structure.

● Propagating computations: “traversing” the graph

Page 10: Large scale graph processing

Example

Src: http://www.slideshare.net/WeiruDai

Page 11: Large scale graph processing

Why not MapReduce?

● Represent graphs as adjacency lists● Perform local computations in mapper ● Pass along partial results via outlinks, keyed by destination node ● Perform aggregation in reducer on inlinks to a node ● Iterate until convergence: controlled by external “driver” ● Don’t forget to pass the graph structure between iterations

Page 12: Large scale graph processing

Why not Spark?

● Spark provides GraphX library for graph & machine learning algorithms.

● But still it is not designed specifically for graph algorithms.

● So, no optimization will be available which are applicable for graphs only.

Page 13: Large scale graph processing

PREGEL, Google, 2010

● Basic idea: “think like a vertex”

● Based on Bulk Synchronous Parallel(BSP) Model

● Provides scalability

● Provides fault tolerance

● Provides flexibility to express arbitrary graph algorithms

Page 14: Large scale graph processing

How does it work?

● Master/Worker architecture ● Each worker is assigned a subset of a directed graph’s vertices

● Vertex-centric model. Each vertex has: ● An arbitrary “value” that can be get/set.

● List of messages sent to it ● List of outgoing edges (edges have a value too)

● A binary state (active/inactive)

Page 15: Large scale graph processing

Graph Parititioning

Worker 1

Worker 3

Worker 2

Page 16: Large scale graph processing

Pregel execution model

Master initiates synchronous iterations (called a “superstep”), where at every superstep:

● Workers asynchronously execute a user function on all of its vertices

● Vertices can receive messages sent to it in the last superstep

● Vertices can modify their value, modify values of edges, change the topology of the graph (add/remove vertices or edges)

● Vertices can send messages to other vertices to be received in the next superstep

● Vertices can “vote to halt”

● Execution stops when all vertices have voted to halt and no vertices have messages.

● Vote to halt trumped by non-empty message queue

Page 17: Large scale graph processing

Pregel Graph Processing

Page 18: Large scale graph processing

Page Rank

PageRank is a link analysis algorithm that is used to determine the importance of a documentbased on the number of references to it and the importance of the source documents themselves.

Page 19: Large scale graph processing

Page Rank

A = A given pageT1 .... Tn = Pages that point to page A (citations)d = Damping factor between 0 and 1 (usually kept as0.85)C(T) = number of links going out of TPR(A) = the PageRank of page A

Page 20: Large scale graph processing

Page Rank

Class PageRankVertex: public Vertex<double, void, double> {public:virtual void Compute(MessageIterator* msgs) {if (superstep() >= 1) {

double sum = 0;for (; !msgs->done(); msgs->Next())

sum += msgs->Value();*MutableValue() = 0.15 + 0.85 * sum;

}if (supersteps() < 30) {

const int64 n = GetOutEdgeIterator().size();SendMessageToAllNeighbors(GetValue() / n);

} else {

VoteToHalt();}}};

Page 21: Large scale graph processing

Open Source

PREGEL was a research paper, Google didn't expose any open source implementation.

As a result lots of open source implementations came up and they keep on improving the basic Pregel model. Most notable two are:

a) Apache Giraph, started, maintained and used mainly by facebook

b) CMU's GraphLab(now it is a company by itself)

Page 22: Large scale graph processing

One Example: GraphLab

● GraphLab is currently is the best one

● GraphLab modified the partitioning strategy to reduce network overhead message transfer among workers

● GraphLab has a rich library of machine learning algorithms and its growing

Page 23: Large scale graph processing

Reference

● Pregel: A System for Large-Scale Graph Processing

● PowerGraph: Distributed Graph-Parallel Computation on Natural Graphs

● GraphX: A Resilient Distributed Graph System on Spark

● giraph.apache.org

● graphlab.org