single-pass graph stream analytics with apache flink · @graphdevroom batch graph processing 5 we...

71
@GraphDevroom Single-pass Graph Stream Analytics with Apache Flink Vasia Kalavri <[email protected]> Paris Carbone <[email protected]> 1

Upload: others

Post on 28-May-2020

10 views

Category:

Documents


0 download

TRANSCRIPT

@GraphDevroom

Single-pass Graph Stream Analytics with Apache Flink

Vasia Kalavri <[email protected]> Paris Carbone <[email protected]>

1

@GraphDevroom

Outline

• Why Graph Streaming?

• Single-Pass Algorithms Examples

• Apache Flink Streaming API

• The GellyStream API

@GraphDevroom

Real Graphs are dynamic

Graphs are created from events happening in real-time

3

@GraphDevroom

4

@GraphDevroom

Batch Graph Processing

5

We create and analyze a snapshot of the real graph

• the Facebook social network on January 30 2016

• user web logs gathered between March 1st 12:00 and 16:00

• retweets and replies for 24h after the announcement of the death of David Bowie

@GraphDevroom

Streaming Graph Processing

We consume events in real-time

• Get results faster • No need to wait for the job to finish

• Sometimes, early approximations are better than late exact answers

• Get results continuously • Process unbounded number of events

6

@GraphDevroom

Challenges

• Maintain the graph structure • How to apply state updates efficiently?

• Result updates • Re-run the analysis for each event? • Design an incremental algorithm? • Run separate instances on multiple snapshots?

• Computation on most recent events only

7

@GraphDevroom

Single-Pass Graph Streaming

• Each event is an edge addition

• Maintains only a graph summary

• Recent events are grouped in graph windows

8

@GraphDevroom

1

43

2

5

6

7

8

0

2

4

6

1 2 3 4

Streaming Degrees Distribution#v

ertic

es

degree

9

@GraphDevroom

1

43

2

5

6

7

8

0

2

4

6

1 2 3 4

#ver

tices

degree

Streaming Degrees Distribution

10

@GraphDevroom

1

43

2

5

6

7

8

0

2

4

6

1 2 3 4

#ver

tices

degree

Streaming Degrees Distribution

11

@GraphDevroom

1

43

2

5

6

7

8

0

2

4

6

1 2 3 4

#ver

tices

degree

Streaming Degrees Distribution

12

@GraphDevroom

1

43

2

5

6

7

8

Streaming Degrees Distribution

0

2

4

6

1 2 3 4

#ver

tices

degree

13

@GraphDevroom

1

43

2

5

6

7

8

0

2

4

6

1 2 3 4

#ver

tices

degree

Streaming Degrees Distribution

14

@GraphDevroom

1

43

2

5

6

7

8

Streaming Degrees Distribution

0

2

4

6

1 2 3 4

#ver

tices

degree

15

@GraphDevroom

1

43

2

5

6

7

8

0

2

4

6

1 2 3 4

#ver

tices

degree

Streaming Degrees Distribution

16

@GraphDevroom

1

43

2

5

6

7

8

Streaming Degrees Distribution

0

2

4

6

1 2 3 4

#ver

tices

degree

17

@GraphDevroom

1

43

2

5

6

7

8

0

2

4

6

1 2 3 4

#ver

tices

degree

Streaming Degrees Distribution

18

@GraphDevroom

Graph Summaries

• spanners for distance estimation • sparsifiers for cut estimation • sketches for homomorphic properties

graph summary

algorithm algorithm~R1 R2

19

@GraphDevroom

Window Aggregations

Neighborhood aggregations on windows

20

@GraphDevroom

Examples

21

@GraphDevroom

Batch Connected Components

• State: the graph and a component ID per vertex (initially equal to vertex ID)

• Iterative Computation: For each vertex:

• choose the min of neighbors’ component IDs and own component ID as new ID

• if component ID changed since last iteration, notify neighbors

22

@GraphDevroom

1

43

2

5

6

7

8

i=0

Batch Connected Components

23

@GraphDevroom

1

11

2

2

6

6

6

i=1

Batch Connected Components

24

@GraphDevroom

1

11

1

5

6

6

6

i=2

Batch Connected Components

25

1

@GraphDevroom

1

11

1

1

6

6

6

i=3

Batch Connected Components

26

@GraphDevroom

Stream Connected Components

• State: a disjoint set data structure for the components

• Computation: For each edge

• if seen for the 1st time, create a component with ID the min of the vertex IDs

• if in different components, merge them and update the component ID to the min of the component IDs

• if only one of the endpoints belongs to a component, add the other one to the same component

27

@GraphDevroom

31

52

54

76

86

ComponentID Vertices

1

43

2

5

6

7

8

28

@GraphDevroom

31

52

54

76

86

42

ComponentID Vertices

1 1, 3

1

43

2

5

6

7

8

29

@GraphDevroom

31

52

54

76

86

42

ComponentID Vertices

43

2 2, 5

1 1, 3

1

43

2

5

6

7

8

30

@GraphDevroom

31

52

54

76

86

42

43

87

ComponentID Vertices

2 2, 4, 5

1 1, 3

1

43

2

5

6

7

8

31

@GraphDevroom

31

52

54

76

86

42

43

87

41

ComponentID Vertices

2 2, 4, 5

1 1, 3

6 6, 7

1

43

2

5

6

7

8

32

@GraphDevroom

52

54

76

86

42

43

87

41

ComponentID Vertices

2 2, 4, 5

1 1, 3

6 6, 7, 8

1

43

2

5

6

7

8

33

@GraphDevroom

54

76

86

42

43

87

41 ComponentID Vertices

2 2, 4, 5

1 1, 3

6 6, 7, 8

1

43

2

5

6

7

8

34

@GraphDevroom

76

86

42

43

87

41

ComponentID Vertices

2 2, 4, 5

1 1, 3

6 6, 7, 8

1

43

2

5

6

7

8

35

@GraphDevroom

76

86

42

43

87

41

ComponentID Vertices

6 6, 7, 8

1 1, 2, 3, 4, 5

1

43

2

5

6

7

8

36

@GraphDevroom

86

42

43

87

41

ComponentID Vertices

6 6, 7, 8

1 1, 2, 3, 4, 5

1

43

2

5

6

7

8

37

@GraphDevroom

42

43

87

41

ComponentID Vertices

6 6, 7, 8

1 1, 2, 3, 4, 5

1

43

2

5

6

7

8

38

@GraphDevroom Distributed Stream Connected

Components

39

@GraphDevroom

Stream Bipartite Detection

Similar to connected components, but

• Each vertex is also assigned a sign, (+) or (-)

• Edge endpoints must have different signs

• When merging components, if flipping all signs doesn’t work => the graph is not bipartite

40

@GraphDevroom

1

43

2

5

6

7

(+) (-)

(+)(-)

(+) (-)

(+)

Cid=1

Cid=5

41

Stream Bipartite Detection

@GraphDevroom

3 5

1

43

2

5

6

7

(+) (-)

(+)(-)

(+) (-)

(+)

Cid=1

Cid=5

42

Stream Bipartite Detection

@GraphDevroom

3 5

1

43

2

5

6

7

(+) (-)

(+)(-)

(+) (-)

(+)

Cid=1

Cid=5

43

Stream Bipartite Detection

@GraphDevroom

Cid=1

1

43

2

5

6

7

(+) (-)

(-)(+)

(+) (-)

(-)

3 5

44

Stream Bipartite Detection

@GraphDevroom

3 7

Cid=1

1

43

2

5

6

7

(+) (-)

(-)(+)

(+) (-)

(-)Can’t flip signs and stay consistent

=> not bipartite!

45

Stream Bipartite Detection

@GraphDevroom

API Requirements

• Continuous aggregations on edge streams

• Global graph aggregations

• Support for windowing

@GraphDevroom

47

The Apache Flink Stack

APIs

Execution

DataStreamDataSet

Distributed Dataflow

Deployment

• Bounded Data Sources • Blocking Operations • Structured Iterations

• Unbounded Data Sources • Continuous Operations • Asynchronous Iterations

@GraphDevroom

Unifying Data Processing

Job Manager• scheduling tasks • monitoring/recovery

Client

• task pipelining • blocking

• execution plan building • optimisation

48

DataStreamDataSet

Distributed Dataflow

Deployment

HDFS

Kafka

DataSet<String> text = env.readTextFile(“hdfs://…”); text.map(…).groupReduce(…)…

DataStream<String> events = env.addSource(new KafkaConsumer(…)); events.map(…).filter(…).window(…).fold(…)…

@GraphDevroom

Data Streams as ADTs

• Tasks are long running in a pipelined execution.

• State is kept within tasks.

• Transformations are applied per-record or window.

• Transformations: map, flatmap, filter, union…

• Aggregations: reduce, fold, sum

• Partitioning: forward, broadcast, shuffle, keyBy

• Sources/Sinks: custom or Kafka, Twitter, Collections…

49

DataStream

@GraphDevroom

Working with Windows

50

Why windows? We are often interested in fresh data!

Highlight: Flink can form and trigger windows consistently under different notions of time and deal with late events!

#sec40 80

SUM #2

0

SUM #1

20 60 100

#sec40 80

SUM #3

SUM #2

0

SUM #1

20 60 100

120

15 38 65 88

15 38

38 65

65 88

15 38 65 88

110 120

myKeyStream.timeWindow( Time.of(60, TimeUnit.SECONDS), Time.of(20, TimeUnit.SECONDS));

1) Sliding windows

2) Tumbling windowsmyKeyStream.timeWindow( Time.of(60, TimeUnit.SECONDS));

window buckets/panes

@GraphDevroom

Example

51

myTextStream .flatMap(new Splitter()) //transformation .keyBy(0) //partitioning

.window(Time.of(5, TimeUnit.MINUTES)) .sum(1) //rolling aggregation

.setParallelism(4); counts.print();

10:48 - “cool, gelly is cool”

printwindow sumflatMap

11:01 - “dataflow is cool too”

<“gelly”,1>… <“cool”,2>

<“dataflow”,1>… <“cool”,1>

@GraphDevroom

Gelly on Streams

52

DataStreamDataSet

Distributed Dataflow

Deployment

Gelly Gelly-Stream

• Static Graphs • Multi-Pass Algorithms • Full Computations

• Dynamic Graphs • Single-Pass Algorithms • Approximate Computations

DataStream

@GraphDevroom

Introducing Gelly-Stream

53

Gelly-Stream enriches the DataStream API with two new additional ADTs:

• GraphStream: • A representation of a data stream of edges.

• Edges can have state (e.g. weights).

• Supports property streams, transformations and aggregations.

• GraphWindow: • A “time-slice” of a graph stream.

• It enables neighbourhood aggregations

@GraphDevroom

GraphStream Operations

54

.getEdges()

.getVertices()

.numberOfVertices()

.numberOfEdges()

.getDegrees()

.inDegrees()

.outDegrees()

GraphStream -> DataStream

.mapEdges();

.distinct();

.filterVertices();

.filterEdges();

.reverse();

.undirected();

.union();

GraphStream -> GraphStream

Property Streams Transformations

@GraphDevroom

Graph Stream Aggregations

55

result aggregate

property streamgraph stream

(window) fold

combine

fold

reduce

local summaries

global summary

edges

agg

global aggregates can be persistent or transient

graphStream.aggregate( new MyGraphAggregation(window, fold, combine, transform))

@GraphDevroom

Graph Stream Aggregations

56

result aggregate

property streamgraph stream

(window) fold

combine transform

fold

reduce map

local summaries

global summary

edges

agg

graphStream.aggregate( new MyGraphAggregation(window, fold, combine, transform))

@GraphDevroom

graphStream.aggregate( new ConnectedComponents(window,fold,combine,transform))

Connected Components

57

graph stream

31

52

1

43

2

5

6

7

8

#components

@GraphDevroom

Connected Components

58

graph stream

{1,3}

{2,5}

1

43

2

5

6

7

8

graphStream.aggregate( new ConnectedComponents(window,fold,combine,transform))

#components

@GraphDevroom

Connected Components

59

graph stream

{1,3}

{2,5}

54

1

43

2

5

6

7

8

graphStream.aggregate( new ConnectedComponents(window,fold,combine,transform))

#components

@GraphDevroom

Connected Components

60

graph stream

{1,3}

{2,5}

{4,5}76

86

1

43

2

5

6

7

8

graphStream.aggregate( new ConnectedComponents(window,fold,combine,transform))

#components

@GraphDevroom

Connected Components

61

graph stream

{1,3}

{2,5}

{4,5}

{6,7}

{6,8}

1

43

2

5

6

7

8

graphStream.aggregate( new ConnectedComponents(window,fold,combine,transform))

#components

windowtriggers

@GraphDevroom

Connected Components

62

graph stream

{2,5}{6,8}

{1,3}{4,5}

{6,7}

3

1

43

2

5

6

7

8

graphStream.aggregate( new ConnectedComponents(window,fold,combine,transform))

#components

@GraphDevroom

Connected Components

63

graph stream

{1,3}{2,4,5}

{6,7,8}

3

1

43

2

5

6

7

8

graphStream.aggregate( new ConnectedComponents(window,fold,combine,transform))

#components

@GraphDevroom

Connected Components

64

graph stream

{1,3}{2,4,5}

{6,7,8}42

43

1

43

2

5

6

7

8

graphStream.aggregate( new ConnectedComponents(window,fold,combine,transform))

#components

@GraphDevroom

Connected Components

65

graph stream

{1,3}{2,4,5}

{6,7,8}{2,4}

{3,4}

41

87

1

43

2

5

6

7

8

graphStream.aggregate( new ConnectedComponents(window,fold,combine,transform))

#components

@GraphDevroom

Connected Components

66

graph stream

{1,3}{2,4,5}

{6,7,8}{1,2,4}

{3,4}{7,8}

1

43

2

5

6

7

8

graphStream.aggregate( new ConnectedComponents(window,fold,combine,transform))

#components

windowtriggers

@GraphDevroom

Connected Components

67

graph stream

{1,2,4,5}{6,7,8}

2

{3,4}{7,8}

1

43

2

5

6

7

8

graphStream.aggregate( new ConnectedComponents(window,fold,combine,transform))

#components

@GraphDevroom

Connected Components

68

graph stream

{1,2,3,4,5}{6,7,8}

2

1

43

2

5

6

7

8

graphStream.aggregate( new ConnectedComponents(window,fold,combine,transform))

#components

@GraphDevroom

Aggregating Slices

69

graphStream.slice(Time.of(1, MINUTE), direction)

.reduceOnEdges();

.foldNeighbors();

.applyOnNeighbors();

• Slicing collocates edges by vertex information

• Neighbourhood aggregations are now enabled on sliced graphs

source

target

Aggregations

@GraphDevroom

Finding Matches Nearby

70

graphStream.filterVertices(GraphGeeks()) .slice(Time.of(15, MINUTE), EdgeDirection.IN) .applyOnNeighbors(FindPairs())

slice

GraphStream :: graph geek check-ins

wendy checked_in soap_bar steve checked_in soap_bar

tom checked_in joe’s_grill sandra checked_in soap_bar

rafa checked_in joe’s_grill

wendy

steve

sandra

soapbar

tom

rafa

joe’sgrill

FindPairs

{wendy, steve} {steve, sandra} {wendy, sandra} {tom, rafa}

GraphWindow :: user-place

@GraphDevroom

Feeling Gelly?• Gelly-Stream: https://github.com/vasia/gelly-streaming

• Apache Flink: https://github.com/apache/flink

• An interesting read: http://users.dcc.uchile.cl/~pbarcelo/mcg.pdf

• A cool thesis: http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-170425

• Twitter: @vkalavri , @senorcarbone