large-scale graph processing with apache flink

35
Andra Lungu Flink committer andra.lungu @campus.tu-berlin.de Large-Scale Graph Processing with Apache Flink

Upload: truongthu

Post on 14-Feb-2017

236 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Large-Scale Graph Processing with Apache Flink

Andra Lungu

Flink committer

andra.lungu

@campus.tu-berlin.de

Large-Scale Graph Processing with Apache Flink

Page 2: Large-Scale Graph Processing with Apache Flink

What is Gelly?

Large-scale graph processing API

On top of Flink’s Java API

Official release: Flink 0.9

Off-the shelf library methods

Supports record and graph analysis applications; iterative algorithms

2

Page 3: Large-Scale Graph Processing with Apache Flink

The Growing Flink Stack

3

Page 4: Large-Scale Graph Processing with Apache Flink

How to use Gelly?

4

Page 5: Large-Scale Graph Processing with Apache Flink

Graph Creation

5

Page 6: Large-Scale Graph Processing with Apache Flink

Graph Properties

getVertices() getEdges() getVertexIds() getEdgeIds() inDegrees() outDegrees() getDegrees() numberOfVertices() numberOfEdges() getTriplets()

6

Page 7: Large-Scale Graph Processing with Apache Flink

Graph Transformations

Map• mapVertices(final MapFunction<Vertex<K, VV>, NV> mapper)

• mapEdges(final MapFunction<Edge<K, EV>, NV> mapper)

Filter • filterOnVertices(FilterFunction<Vertex<K, VV>> vertexFilter)

• filterOnEdges(FilterFunction<Edge<K, EV>> edgeFilter)

• subgraph(FilterFunction<Vertex<K, VV>> vertexFilter, FilterFunction<Edge<K, EV>> edgeFilter)

7

Page 8: Large-Scale Graph Processing with Apache Flink

Filter Functions

8

Page 9: Large-Scale Graph Processing with Apache Flink

Graph Transformations

Join• joinWithVertices(DataSet<Tuple2<K, T>> inputDataSet, final MapFunction<Tuple2<VV, T>, VV> mapper)

• joinWithEdges(DataSet<Tuple3<K, K, T>> inputDataSet, final MapFunction<Tuple2<EV, T>, EV> mapper)

• joinWithEdgesOnSource / joinWithEdgesOnTarget

Reverse

Undirected

9

Page 10: Large-Scale Graph Processing with Apache Flink

Union

10

Page 11: Large-Scale Graph Processing with Apache Flink

Graph Mutations

addVertex(final Vertex<K, VV> vertex) addVertices(List<Vertex<K, VV>>verticesToAdd)

addEdge(Vertex<K, VV> source, Vertex<K, VV>target, EV edgeValue)

addEdges(List<Edge<K, EV>> newEdges) removeVertex(Vertex<K, VV> vertex) removeVertices(List<Vertex<K, VV>>verticesToBeRemoved)

removeEdge(Edge<K, EV> edge) removeEdges(List<Edge<K, EV>>edgesToBeRemoved)

11

Page 12: Large-Scale Graph Processing with Apache Flink

Neighborhood Methods

reduceOnNeighbors(reduceNeighborsFunction, direction)

reduceOnEdges

groupReduceOnNeighbors; groupReduceOnEdges

12

Page 13: Large-Scale Graph Processing with Apache Flink

Graph Validation

Given criteria:

• Edge IDs correspond to vertex IDs

13

Page 14: Large-Scale Graph Processing with Apache Flink

Vertex-centric Iterations

Pregel [BSP] Execution Model

UDFs:

• Messaging Function

• VertexUpdateFunction

S-1: receive messages from neighbors

S: update vertex values

S+1: send new value to neighbors

14

Page 15: Large-Scale Graph Processing with Apache Flink

Single Source Shortest Paths

15

Page 16: Large-Scale Graph Processing with Apache Flink

SSSP – Second Superstep

16

Page 17: Large-Scale Graph Processing with Apache Flink

SSSP - Result

17

Page 18: Large-Scale Graph Processing with Apache Flink

SSSP – code snippet

18

Page 19: Large-Scale Graph Processing with Apache Flink

Gather-Sum-Apply Iterations

UDFs:

• GatherFunction

• SumFunction

• ApplyFunction

Back to SSSP:

• Gather: neighbor value + edge weight

• Sum/Accumulate: choose min

• Apply: compare computed min and update

19

Page 20: Large-Scale Graph Processing with Apache Flink

SSSP – Superstep 1

20

Page 21: Large-Scale Graph Processing with Apache Flink

SSSP – Superstep 2

21

Page 22: Large-Scale Graph Processing with Apache Flink

SSSP - Result

22

Page 23: Large-Scale Graph Processing with Apache Flink

SSSP – code snippet

23

Page 24: Large-Scale Graph Processing with Apache Flink

Vertex-centric or GSA?

Messaging = Gather + Sum

Update = Apply

Skewed graphs? – GSA (parallel

gather)

coGroup vs. reduce

GSA gathers from immediate neighbors;

Vertex-centric send to any vertex

24

Page 25: Large-Scale Graph Processing with Apache Flink

Library of Algorithms

Weakly Connected Components

Community Detection

Page Rank

Single Source Shortest Paths

Label Propagation

25

Page 26: Large-Scale Graph Processing with Apache Flink

Music Profiles Example

26

Page 27: Large-Scale Graph Processing with Apache Flink

Input Data

<user-id, song-id, play-count>

Set of bad records [IDs]

27

Page 28: Large-Scale Graph Processing with Apache Flink

Filter out Bad Records

28

Page 29: Large-Scale Graph Processing with Apache Flink

Compute Top Songs/User

29

Page 30: Large-Scale Graph Processing with Apache Flink

Compute Top Songs/User

30

Page 31: Large-Scale Graph Processing with Apache Flink

Create a user-user Graph

31

Page 32: Large-Scale Graph Processing with Apache Flink

Create a user-user Graph

32

Page 33: Large-Scale Graph Processing with Apache Flink

Cluster Similar Users

33

Page 34: Large-Scale Graph Processing with Apache Flink

Coming up Next

Gelly Blog Post Scala API More Library Methods Flink Streaming Integration Graph Partitioning Techniques Specialized Operators for Highly Skewed Graphs

Bipartite Graph Support

Curious? Gelly Roadmap

34