demystifying distributed graph processing

Post on 16-Apr-2017

1.381 Views

Category:

Technology

3 Downloads

Preview:

Click to see full reader

TRANSCRIPT

DEMYSTIFYING

DISTRIBUTED GRAPH PROCESSING

Vasia Kalavri vasia@apache.org

@vkalavri

WHY DISTRIBUTED GRAPH PROCESSING?

MY GRAPH IS SO BIG, IT DOESN’T FIT IN A SINGLE MACHINE

Big Data Ninja

MISCONCEPTION #1

A SOCIAL NETWORK

YOUR INPUT DATASET SIZE IS _OFTEN_ IRRELEVANT

INTERMEDIATE DATA: THE OFTEN DISREGARDED EVIL

▸Naive Who(m) to Follow:

▸ compute a friends-of-friends list per user

▸ exclude existing friends

▸ rank by common connections

DISTRIBUTED PROCESSING IS ALWAYS FASTER THAN SINGLE-NODE

Data Science Rockstar

MISCONCEPTION #2

GRAPHS DON’T APPEAR OUT OF THIN AIR

Expectation…

GRAPHS DON’T APPEAR OUT OF THIN AIR

Reality!

HOW DO WE EXPRESS A DISTRIBUTED GRAPH ANALYSIS TASK?

GRAPH APPLICATIONS ARE DIVERSE

▸ Iterative value propagation

▸ PageRank, Connected Components, Label Propagation

▸ Traversals and path exploration

▸ Shortest paths, centrality measures

▸ Ego-network analysis

▸ Personalized recommendations

▸ Pattern mining

▸ Finding frequent subgraphs

RECENT DISTRIBUTED GRAPH PROCESSING HISTORY

2004

MapReduce

Pegasus

2009

Pregel

2010

Signal-Collect

PowerGraph

2012

RECENT DISTRIBUTED GRAPH PROCESSING HISTORY

2004

MapReduce

Pegasus

2009

Pregel

2010

Signal-Collect

PowerGraph

2012

Iterative value propagation

PREGEL: THINK LIKE A VERTEX

1

5

4

3

2 1 3, 4

2 1, 4

5 3

. . .

PREGEL: SUPERSTEPS

(Vi+1, outbox) <— compute(Vi, inbox)

1 3, 4

2 1, 4

5 3

. . .

1 3, 4

2 1, 4

5 3

. . .

Superstep i Superstep i+1

PREGEL EXAMPLE: PAGERANKvoid compute(messages): sum = 0.0

for (m <- messages) do sum = sum + m end for

setValue(0.15/numVertices() + 0.85*sum)

for (edge <- getOutEdges()) do sendMessageTo( edge.target(), getValue()/numEdges) end for

sum up received messages

update vertex rank

distribute rank to neighbors

SIGNAL-COLLECT

outbox <— signal(Vi)

1 3, 4

2 1, 4

5 3

. . .

1 3, 4

2 1, 4

5 3

. . .

Superstep i

Vi+1 <— collect(inbox)

1 3, 4

2 1, 4

5 3

. . .

Signal Collect Superstep i+1

SIGNAL-COLLECT EXAMPLE: PAGERANKvoid signal(): for (edge <- getOutEdges()) do sendMessageTo( edge.target(), getValue()/numEdges) end for

void collect(messages): sum = 0.0 for (m <- messages) do sum = sum + m end for

setValue(0.15/numVertices() + 0.85*sum)

distribute rank to neighbors

sum up received messages

update vertex rank

GATHER-SUM-APPLY (POWERGRAPH)

1

. . .. . .

Gather Sum

1

2

5

. . .

Apply

3

1 5

5 3

1

. . .

Gather

3

1 5

5 3

Superstep i Superstep i+1

GSA EXAMPLE: PAGERANK

double gather(source, edge, target): return target.value() / target.numEdges()

double sum(rank1, rank2): return rank1 + rank2

double apply(sum, currentRank): return 0.15 + 0.85*sum

compute partial rank

combine partial ranks

update rank

RECENT DISTRIBUTED GRAPH PROCESSING HISTORY

2004

MapReduce

Pegasus

2009

Pregel

2010

Signal-Collect

PowerGraph

2012

Iterative value propagation

Giraph++

2013

RECENT DISTRIBUTED GRAPH PROCESSING HISTORY

2004

MapReduce

Pegasus

2009

Pregel

2010

Signal-Collect

PowerGraph

2012

Iterative value propagation

Giraph++

2013

Graph Traversals

THINK LIKE A (SUB)GRAPH

1

5

4

3

2

1

5

4

3

2

- compute() on the entire partition

- Information flows freely inside each partition

- Network communication between partitions, not vertices

RECENT DISTRIBUTED GRAPH PROCESSING HISTORY

2004

MapReduce

Pegasus

2009

Pregel

2010

Signal-Collect

PowerGraph

2012

Iterative value propagation

Giraph++

2013

Graph Traversals

NScale

2014

RECENT DISTRIBUTED GRAPH PROCESSING HISTORY

2004

MapReduce

Pegasus

2009

Pregel

2010

Signal-Collect

PowerGraph

2012

Iterative value propagation

Giraph++

2013

Graph Traversals

NScale

2014

Ego-network analysis

Arabesque

2015

Tinkerpop

RECENT DISTRIBUTED GRAPH PROCESSING HISTORY

2004

MapReduce

Pegasus

2009

Pregel

2010

Signal-Collect

PowerGraph

2012

Iterative value propagation

Giraph++

2013

Graph Traversals

NScale

2014

Ego-network analysis

Arabesque

2015

Pattern Matching

Tinkerpop

CAN WE HAVE IT ALL?

▸ Data pipeline integration: built on top of an efficient distributed processing engine

▸ Graph ETL: high-level API with abstractions and methods to transform graphs

▸ Familiar programming model: support popular programming abstractions

HELLO, GELLY! THE APACHE FLINK GRAPH API

▸ Java and Scala APIs: seamlessly integrate with Flink’s DataSet API

▸ Transformations, library of common algorithms

val graph = Graph.fromDataSet(edges, env)

val ranks = graph.run(new PageRank(0.85, 20))

▸ Iteration abstractionsPregel

Signal-Collect

Gather-Sum-Apply

Partition-Centric*

POSIX Java/ScalaCollections

POSIX

‣efficient streaming runtime

‣native iteration operators

‣well-integrated

WHY FLINK?

FEELING GELLY?

▸ Paper References

http://www.citeulike.org/user/vasiakalavri/tag/dotscale

▸ Apache Flink:

http://flink.apache.org/

▸ Gelly documentation:

http://ci.apache.org/projects/flink/flink-docs-master/apis/batch/libs/gelly.html

▸ Gelly-Stream:

https://github.com/vasia/gelly-streaming

top related