martin junghans – gradoop: scalable graph analytics with apache flink

71
GRADOOP: Scalable Graph Analytics with Apache Flink Martin Junghanns University of Leipzig

Upload: flink-forward

Post on 15-Jan-2017

6.121 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Martin Junghans – Gradoop: Scalable Graph Analytics with Apache Flink

GRADOOP: Scalable Graph Analytics with Apache Flink

Martin Junghanns University of Leipzig

Page 2: Martin Junghans – Gradoop: Scalable Graph Analytics with Apache Flink

About the speaker and the team

2011 Bachelor of Engineering Thesis: Partitioning of Dynamic Graphs

2014 Master of Science

Thesis: Graph Database Systems for Business Intelligence

Now: PhD Student, Database Group, University of Leipzig

Distributed Systems Distributed Graph Data Management Graph Theory & Algorithms

Professional Experience: sones GraphDB, SAP

André, PhD Student

Martin, PhD Student

Kevin, M.Sc. Student Niklas, M.Sc. Student

Page 3: Martin Junghans – Gradoop: Scalable Graph Analytics with Apache Flink

Motivation

Page 4: Martin Junghans – Gradoop: Scalable Graph Analytics with Apache Flink

𝑮𝑟𝑟𝑟𝑟 = (𝑽𝑒𝑟𝑒𝑒𝑒𝑒𝑒,𝑬𝑑𝑑𝑒𝑒)

“Graphs are everywhere”

Page 5: Martin Junghans – Gradoop: Scalable Graph Analytics with Apache Flink

𝐺𝑟𝑟𝑟𝑟 = (𝐔𝐔𝐔𝐔𝐔,𝐹𝑟𝑒𝑒𝐹𝑑𝑒𝑟𝑒𝑟𝑒)

“Graphs are everywhere”

Alice

Bob

Eve

Dave

Carol

Mallory

Peggy

Page 6: Martin Junghans – Gradoop: Scalable Graph Analytics with Apache Flink

𝐺𝑟𝑟𝑟𝑟 = (𝐔𝐔𝐔𝐔𝐔,𝐹𝑟𝑒𝑒𝐹𝑑𝑒𝑟𝑒𝑟𝑒)

“Graphs are everywhere”

Alice

Bob

Eve

Dave

Carol

Mallory

Peggy

Page 7: Martin Junghans – Gradoop: Scalable Graph Analytics with Apache Flink

𝐺𝑟𝑟𝑟𝑟 = (𝐔𝐔𝐔𝐔𝐔,𝐹𝑟𝑒𝑒𝐹𝑑𝑒𝑟𝑒𝑟𝑒)

“Graphs are everywhere”

Alice

Bob

Eve

Dave

Carol

Mallory

Peggy

Page 8: Martin Junghans – Gradoop: Scalable Graph Analytics with Apache Flink

𝐺𝑟𝑟𝑟𝑟 = (𝐔𝐔𝐔𝐔𝐔,𝐹𝑟𝑒𝑒𝐹𝑑𝑒𝑟𝑒𝑟𝑒)

“Graphs are everywhere”

Alice

Bob

Eve

Dave

Carol

Mallory

Peggy

Page 9: Martin Junghans – Gradoop: Scalable Graph Analytics with Apache Flink

𝐺𝑟𝑟𝑟𝑟 = (𝐔𝐔𝐔𝐔𝐔,𝐹𝐹𝐹𝐹𝐹𝐹𝑒𝑟𝑒)

“Graphs are everywhere”

Alice

Bob

Eve

Dave

Carol

Mallory

Peggy

Trent

Page 10: Martin Junghans – Gradoop: Scalable Graph Analytics with Apache Flink

𝐺𝑟𝑟𝑟𝑟 = (𝐔𝐔𝐔𝐔𝐔,𝐹𝐹𝐹𝐹𝐹𝐹𝑒𝑟𝑒)

“Graphs are everywhere”

Alice

Bob

Eve

Dave

Carol

Mallory

Peggy

Trent

Page 11: Martin Junghans – Gradoop: Scalable Graph Analytics with Apache Flink

𝐺𝑟𝑟𝑟𝑟 = (𝐂𝐂𝐂𝐂𝐔𝐔,𝐶𝐹𝐹𝐹𝑒𝑒𝑒𝑒𝐹𝐹𝑒)

“Graphs are everywhere”

Leipzig pop: 544K

Dresden pop: 536K

Berlin pop: 3.5M

Hamburg pop: 1.7M

Munich pop: 1.4M

Chemnitz pop: 243K

Nuremberg pop: 500K

Cologne pop: 1M

Page 12: Martin Junghans – Gradoop: Scalable Graph Analytics with Apache Flink

World Wide Web ca. 1 billion websites

“Graphs are large”

Facebook ca. 1.49 billion active users ca. 340 friends per user

Page 13: Martin Junghans – Gradoop: Scalable Graph Analytics with Apache Flink

End-to-End Graph Analytics

Data Integration Graph Analytics Representation

Page 14: Martin Junghans – Gradoop: Scalable Graph Analytics with Apache Flink

End-to-End Graph Analytics

Data Integration Graph Analytics Representation

Integrate data from one or more sources into a dedicated graph storage with common graph data model

Page 15: Martin Junghans – Gradoop: Scalable Graph Analytics with Apache Flink

End-to-End Graph Analytics

Data Integration Graph Analytics Representation

Integrate data from one or more sources into a dedicated graph storage with common graph data model

Definition of analytical workflows from operator algebra

Page 16: Martin Junghans – Gradoop: Scalable Graph Analytics with Apache Flink

End-to-End Graph Analytics

Data Integration Graph Analytics Representation

Integrate data from one or more sources into a dedicated graph storage with common graph data model

Definition of analytical workflows from operator algebra Result representation in a meaningful way

Page 17: Martin Junghans – Gradoop: Scalable Graph Analytics with Apache Flink

Graph Data Management Graph Database Systems Neo4j, OrientDB

Graph Processing Systems Pregel, Giraph

Distributed Workflow Systems Flink Gelly, Spark GraphX

Data Model Rich Graph Models

Generic Graph Models Generic Graph Models

Focus Local ACID Operations

Global Graph Operations Global Data and Graph Operations

Query Language Yes No No

Persistency Yes No No

Scalability Vertical Horizontal Horizontal

Workflows No No Yes

Data Integration No No No

Graph Analytics No Yes Yes

Representation Yes No No

Page 18: Martin Junghans – Gradoop: Scalable Graph Analytics with Apache Flink

Graph Data Management Graph Database Systems Neo4j, OrientDB

Graph Processing Systems Pregel, Giraph

Distributed Workflow Systems Flink Gelly, Spark GraphX

Data Model Rich Graph Models

Generic Graph Models Generic Graph Models

Focus Local ACID Operations

Global Graph Operations Global Data and Graph Operations

Query Language Yes No No

Persistency Yes No No

Scalability Vertical Horizontal Horizontal

Workflows No No Yes

Data Integration No No No

Graph Analytics No Yes Yes

Representation Yes No No

Page 19: Martin Junghans – Gradoop: Scalable Graph Analytics with Apache Flink

Graph Data Management Graph Database Systems Neo4j, OrientDB

Graph Processing Systems Pregel, Giraph

Distributed Workflow Systems Flink Gelly, Spark GraphX

Data Model Rich Graph Models

Generic Graph Models Generic Graph Models

Focus Local ACID Operations

Global Graph Operations Global Data and Graph Operations

Query Language Yes No No

Persistency Yes No No

Scalability Vertical Horizontal Horizontal

Workflows No No Yes

Data Integration No No No

Graph Analytics No Yes Yes

Representation Yes No No

Page 20: Martin Junghans – Gradoop: Scalable Graph Analytics with Apache Flink

Graph Data Management Graph Database Systems Neo4j, OrientDB

Graph Processing Systems Pregel, Giraph

Distributed Workflow Systems Flink Gelly, Spark GraphX

Data Model Rich Graph Models

Generic Graph Models Generic Graph Models

Focus Local ACID Operations

Global Graph Operations Global Data and Graph Operations

Query Language Yes No No

Persistency Yes No No

Scalability Vertical Horizontal Horizontal

Workflows No No Yes

Data Integration No No No

Graph Analytics No Yes Yes

Representation Yes No No

Page 21: Martin Junghans – Gradoop: Scalable Graph Analytics with Apache Flink

What‘s missing?

An end-to-end framework and research platform for efficient, distributed and domain independent

graph data management and analytics.

Page 22: Martin Junghans – Gradoop: Scalable Graph Analytics with Apache Flink

What‘s missing?

An end-to-end framework and research platform for efficient, distributed and domain independent

graph data management and analytics.

Page 23: Martin Junghans – Gradoop: Scalable Graph Analytics with Apache Flink

Gradoop Architecture & Data Model

Page 24: Martin Junghans – Gradoop: Scalable Graph Analytics with Apache Flink

High Level Architecture

HDFS/YARN Cluster

HBase Distributed Graph Store

Extended Property Graph Model

Flink Operator Implementations

Data Integration

Flink Operator Execution

Workflow Declaration

Visual

GrALa DSL Representation

Data flow

Control flow

Graph Analytics Representation

Workflow Execution

Page 25: Martin Junghans – Gradoop: Scalable Graph Analytics with Apache Flink

High Level Architecture

HBase Distributed Graph Store

Extended Property Graph Model

Flink Operator Implementations

Data Integration

Flink Operator Execution

Workflow Declaration

Visual

GrALa DSL Representation

Data flow

Control flow

Graph Analytics Representation

Workflow Execution

HDFS/YARN Cluster

Page 26: Martin Junghans – Gradoop: Scalable Graph Analytics with Apache Flink

Extended Property Graph Model

Page 27: Martin Junghans – Gradoop: Scalable Graph Analytics with Apache Flink

Extended Property Graph Model

Page 28: Martin Junghans – Gradoop: Scalable Graph Analytics with Apache Flink

Extended Property Graph Model

Page 29: Martin Junghans – Gradoop: Scalable Graph Analytics with Apache Flink

Graph Operators

Operator GrALa notation

Binary

Combination graph.combine(otherGraph) : Graph

Overlap graph.overlap(otherGraph) : Graph

Exclusion graph.exclude(otherGraph) : Graph

Isomorphism graph.isIsomorphicTo(otherGraph) : Boolean

Unary

Pattern Matching graph.match(patternGraph,predicate) : Collection

Aggregation graph.aggregate(propertyKey,aggregateFunction) : Graph

Projection graph.project(vertexFunction,edgeFunction) : Graph

Summarization graph.summarize( vertexGroupKeys,vertexAggregateFunction, edgeGroupKeys,edgeAggregateFunction) : Graph

Page 30: Martin Junghans – Gradoop: Scalable Graph Analytics with Apache Flink

Combination

1: personGraph = db.G[0].combine(db.G[1]).combine(db.G[2])

Page 31: Martin Junghans – Gradoop: Scalable Graph Analytics with Apache Flink

Combination

1: personGraph = db.G[0].combine(db.G[1]).combine(db.G[2])

Page 32: Martin Junghans – Gradoop: Scalable Graph Analytics with Apache Flink

Graph Operators

Operator GrALa notation

Binary

Combination graph.combine(otherGraph) : Graph

Overlap graph.overlap(otherGraph) : Graph

Exclusion graph.exclude(otherGraph) : Graph

Isomorphism graph.isIsomorphicTo(otherGraph) : Boolean

Unary

Pattern Matching graph.match(patternGraph,predicate) : Collection

Aggregation graph.aggregate(propertyKey,aggregateFunction) : Graph

Projection graph.project(vertexFunction,edgeFunction) : Graph

Summarization graph.summarize( vertexGroupKeys,vertexAggregateFunction, edgeGroupKeys,edgeAggregateFunction) : Graph

Page 33: Martin Junghans – Gradoop: Scalable Graph Analytics with Apache Flink

Summarization

1: personGraph = db.G[0].combine(db.G[1]).combine(db.G[2]) 2: vertexGroupingKeys = {:type, “city”} 3: edgeGroupingKeys = {:type} 4: vertexAggFunc = (Vertex vSum, Set vertices => vSum[“count”] = |vertices|) 5: edgeAggFunc = (Edge eSum, Set edges => eSum[“count”] = |edges|) 6: sumGraph = personGraph.summarize(vertexGroupingKeys, vertexAggFunc, edgeGroupingKeys, edgeAggFunc)

Page 34: Martin Junghans – Gradoop: Scalable Graph Analytics with Apache Flink

Summarization

1: personGraph = db.G[0].combine(db.G[1]).combine(db.G[2]) 2: vertexGroupingKeys = {:type, “city”} 3: edgeGroupingKeys = {:type} 4: vertexAggFunc = (Vertex vSum, Set vertices => vSum[“count”] = |vertices|) 5: edgeAggFunc = (Edge eSum, Set edges => eSum[“count”] = |edges|) 6: sumGraph = personGraph.summarize(vertexGroupingKeys, vertexAggFunc, edgeGroupingKeys, edgeAggFunc)

Page 35: Martin Junghans – Gradoop: Scalable Graph Analytics with Apache Flink

Graph Collection Operators

Operator GrALa notation Collection

Selection collection.select(predicate) : Collection

Distinct collection.distinct() : Collection

Sort by collection.sortBy(key, [:asc|:desc]) : Collection

Top collection.top(limit) : Collection

Union collection.union(otherCollection) : Collection

Intersection collection.intersect(otherCollection) : Collection

Difference collection.difference(otherCollection) : Collection

Auxiliary

Apply collection.apply(unaryGraphOperator) : Collection

Reduce collection.reduce(binaryGraphOperator) : Graph

Call [graph|collection].callFor[Graph|Collection]( algorithm,parameters) : [Graph|Collection]

Page 36: Martin Junghans – Gradoop: Scalable Graph Analytics with Apache Flink

Selection

1: collection = <db.G[0],db.G[1],db.G[2]> 2: predicate = (Graph g => |g.V| > 3) 3: result = collection.select(predicate)

Page 37: Martin Junghans – Gradoop: Scalable Graph Analytics with Apache Flink

Selection

1: collection = <db.G[0],db.G[1],db.G[2]> 2: predicate = (Graph g => |g.V| > 3) 3: result = collection.select(predicate)

Page 38: Martin Junghans – Gradoop: Scalable Graph Analytics with Apache Flink

Graph Collection Operators

Operator GrALa notation Collection

Selection collection.select(predicate) : Collection

Distinct collection.distinct() : Collection

Sort by collection.sortBy(key, [:asc|:desc]) : Collection

Top collection.top(limit) : Collection

Union collection.union(otherCollection) : Collection

Intersection collection.intersect(otherCollection) : Collection

Difference collection.difference(otherCollection) : Collection

Auxiliary

Apply collection.apply(unaryGraphOperator) : Collection

Reduce collection.reduce(binaryGraphOperator) : Graph

Call [graph|collection].callFor[Graph|Collection]( algorithm,parameters) : [Graph|Collection]

Page 39: Martin Junghans – Gradoop: Scalable Graph Analytics with Apache Flink

Extended Property Graph Model in Flink

ID Label Properties Graphs

ID Label Properties Source Vertex

Target Vertex

Graphs

VertexData

EdgeData

GraphData

ID Label Properties

POJO

POJO

POJO

DataSet<Vertex<ID,VertexData>>

DataSet<Edge<ID,EdgeData>>

DataSet<Subgraph<ID,GraphData>>

Gelly

𝒱

𝒢

Pojo Representation

Page 40: Martin Junghans – Gradoop: Scalable Graph Analytics with Apache Flink

Extended Property Graph Model in Flink

VertexData

EdgeData

GraphData

POJO

POJO

POJO

DataSet<Vertex<ID,VertexData>>

DataSet<Edge<ID,EdgeData>>

DataSet<Subgraph<ID,GraphData>>

Gelly

VertexData

EdgeData

GraphData

Tuple

Tuple

Tuple

DataSet<VertexData>

DataSet<EdgeData>

DataSet<GraphData>

𝒱

𝒱

𝒢

𝒢

Pojo Representation

Tuple Representation

ID Label Properties Graphs

ID Label Properties Source Vertex

Target Vertex

Graphs

ID Label Properties

ID Label Properties Graphs

ID Label Properties Source Vertex

Target Vertex

Graphs

ID Label Properties

Page 41: Martin Junghans – Gradoop: Scalable Graph Analytics with Apache Flink

Summarization in Flink

VID City

0 L

1 L

2 D

3 D

4 D

5 B

EID S T

0 0 1

1 1 0

2 1 2

3 2 1

4 2 3

5 3 2

6 4 0

7 4 1

8 5 2

9 5 3

L [0,1]

D [2,3,4]

B [5]

VID City Count

0 L 2

2 D 3

5 B 1

VID Rep

0 0

1 0

2 2

3 2

4 2

5 5

ID S T

0 0 1

1 0 0

2 0 2

3 2 1

4 2 3

5 2 2

6 2 0

7 2 1

8 5 2

9 5 3

ID S T

0 0 0

1 0 0

2 0 2

3 2 0

4 2 2

5 2 2

6 2 0

7 2 0

8 5 2

9 5 2

0,0 [0,1]

0,2 [2]

2,0 [3,6,7]

2,2 [4,5]

5,2 [8,9]

EID S T Count

0 0 1 2

2 0 2 1

3 2 0 3

4 2 2 2

8 5 2 2

join(VID==S)

𝒱

ℰ’

𝒱′

groupBy(City)

reduceGroup + filter + map

reduceGroup + filter + map

groupBy(S,T)

join(VID==T)

Page 42: Martin Junghans – Gradoop: Scalable Graph Analytics with Apache Flink

Use Case: Graph Business Intelligence

Page 43: Martin Junghans – Gradoop: Scalable Graph Analytics with Apache Flink

Use Case: Graph Business Intelligence

Business intelligence usually based on relational data warehouses Enterprise data is integrated within dimensional schema Analysis limited to predefined relationships No support for relationship-oriented data mining

Facts

Dim 1

Dim 2

Dim 3

Page 44: Martin Junghans – Gradoop: Scalable Graph Analytics with Apache Flink

Use Case: Graph Business Intelligence

Business intelligence usually based on relational data warehouses Enterprise data is integrated within dimensional schema Analysis limited to predefined relationships No support for relationship-oriented data mining

Graph-based approach Integrate data sources within an instance graph by preserving original

relationships between data objects (transactional and master data) Determine subgraphs (business transaction graphs) related to business

activities Analyze subgraphs or entire graphs with aggregation queries, mining

relationship patterns, etc.

Facts

Dim 1

Dim 2

Dim 3

Page 45: Martin Junghans – Gradoop: Scalable Graph Analytics with Apache Flink

Prerequisites: Data Integration

Page 46: Martin Junghans – Gradoop: Scalable Graph Analytics with Apache Flink

Business Transaction Graphs

CIT ERP

Employee Name: Dave

Employee Name: Alice

Employee Name: Bob

Employee Name: Carol

Ticket Expense: 500

SalesQuotation

SalesOrder PurchaseOrder

PurchaseOrder

SalesRevenue Revenue: 5,000

PurchaseInvoice Expense: 2,000

PurchaseInvoice Expense: 1,500

sentBy

createdBy

processedBy

createdBy

openedFor

processedBy

basedOn serves

serves

bills

bills

bills

processedBy

Page 47: Martin Junghans – Gradoop: Scalable Graph Analytics with Apache Flink

Business Transaction Graphs

CIT ERP

Employee Name: Dave

Employee Name: Alice

Employee Name: Bob

Employee Name: Carol

Ticket Expense: 500

SalesQuotation

SalesOrder PurchaseOrder

PurchaseOrder

SalesRevenue Revenue: 5,000

PurchaseInvoice Expense: 2,000

PurchaseInvoice Expense: 1,500

sentBy

createdBy

processedBy

createdBy

openedFor

processedBy

processedBy

basedOn serves

serves

bills

bills

bills

Page 48: Martin Junghans – Gradoop: Scalable Graph Analytics with Apache Flink

Business Transaction Graphs

CIT ERP

Employee Name: Dave

Employee Name: Alice

Employee Name: Bob

Employee Name: Carol

Ticket Expense: 500

SalesQuotation

SalesOrder PurchaseOrder

PurchaseOrder

SalesRevenue Revenue: 5,000

PurchaseInvoice Expense: 2,000

PurchaseInvoice Expense: 1,500

sentBy

createdBy

processedBy

createdBy

openedFor

processedBy

processedBy

basedOn serves

serves

bills

bills

bills

Page 49: Martin Junghans – Gradoop: Scalable Graph Analytics with Apache Flink

Business Transaction Graphs

CIT ERP

Employee Name: Dave

Employee Name: Alice

Employee Name: Bob

Employee Name: Carol

Ticket Expense: 500

SalesQuotation

SalesOrder PurchaseOrder

PurchaseOrder

SalesRevenue Revenue: 5,000

PurchaseInvoice Expense: 2,000

PurchaseInvoice Expense: 1,500

sentBy

createdBy

processedBy

createdBy

openedFor

processedBy

processedBy

basedOn serves

serves

bills

bills

bills

Page 50: Martin Junghans – Gradoop: Scalable Graph Analytics with Apache Flink

Business Transaction Graphs

CIT ERP

Employee Name: Dave

Employee Name: Alice

Employee Name: Bob

Employee Name: Carol

Ticket Expense: 500

SalesQuotation

SalesOrder PurchaseOrder

PurchaseOrder

SalesRevenue Revenue: 5,000

PurchaseInvoice Expense: 2,000

PurchaseInvoice Expense: 1,500

sentBy

createdBy

processedBy

createdBy

openedFor

processedBy

processedBy

basedOn serves

serves

bills

bills

bills

Page 51: Martin Junghans – Gradoop: Scalable Graph Analytics with Apache Flink

Business Transaction Graphs

CIT ERP

Employee Name: Dave

Employee Name: Alice

Employee Name: Bob

Employee Name: Carol

Ticket Expense: 500

SalesQuotation

SalesOrder PurchaseOrder

PurchaseOrder

SalesRevenue Revenue: 5,000

PurchaseInvoice Expense: 2,000

PurchaseInvoice Expense: 1,500

sentBy

createdBy

processedBy

createdBy

openedFor

processedBy

processedBy

basedOn serves

serves

bills

bills

bills

Page 52: Martin Junghans – Gradoop: Scalable Graph Analytics with Apache Flink

BTG 1

(1) BTG Extraction

BTG 2

BTG 3

BTG 4

BTG 5

BTG n

Page 53: Martin Junghans – Gradoop: Scalable Graph Analytics with Apache Flink

(1) BTG Extraction

// generate base collection btgs = iig.callForCollection( :BusinessTransactionGraphs , {} )

Page 54: Martin Junghans – Gradoop: Scalable Graph Analytics with Apache Flink

(2) Profit Aggregation

CIT ERP

Employee Name: Dave

Employee Name: Alice

Employee Name: Bob

Employee Name: Carol

Ticket Expense: 500

SalesQuotation

SalesOrder PurchaseOrder

PurchaseOrder

SalesRevenue Revenue: 5,000

PurchaseInvoice Expense: 2,000

PurchaseInvoice Expense: 1,500

sentBy

createdBy

processedBy

createdBy

openedFor

processedBy

processedBy

basedOn serves

serves

bills

bills

bills

Page 55: Martin Junghans – Gradoop: Scalable Graph Analytics with Apache Flink

(2) Profit Aggregation

// generate base collection btgs = iig.callForCollection( :BusinessTransactionGraphs , {} ) // define profit aggregate function aggFunc = ( Graph g => g.V.values(“Revenue").sum() - g.V.values(“Expense").sum() )

Page 56: Martin Junghans – Gradoop: Scalable Graph Analytics with Apache Flink

(2) Profit Aggregation

BTG 1

BTG 2

BTG 3

BTG 4

BTG 5

BTG n

∑ Revenue ∑ Expenses Net Profit

5,000 -3,000 2,000

9,000 -3,000 6,000

2,000 -1,500 500

5,000 -7,000 -2,000

10,000 -15,000 -5,000

… … …

8,000 -4,000 4,000

Page 57: Martin Junghans – Gradoop: Scalable Graph Analytics with Apache Flink

(2) Profit Aggregation

// generate base collection btgs = iig.callForCollection( :BusinessTransactionGraphs , {} ) // define profit aggregate function aggFunc = ( Graph g => g.V.values(“Revenue").sum() - g.V.values(“Expense").sum() ) // apply aggregate function and store result at new property btgs = btgs.apply( Graph g => g.aggregate( “Profit“ , aggFunc ) )

Page 58: Martin Junghans – Gradoop: Scalable Graph Analytics with Apache Flink

(3) BTG Clustering

BTG 1

BTG 2

BTG 3

BTG 4

BTG 5

BTG n

∑ Revenue ∑ Expenses Net Profit

5,000 -3,000 2,000

9,000 -3,000 6,000

2,000 -1,500 500

5,000 -7,000 -2,000

10,000 -15,000 -5,000

… … …

8,000 -4,000 4,000

Page 59: Martin Junghans – Gradoop: Scalable Graph Analytics with Apache Flink

(3) BTG Clustering

// select profit and loss clusters profitBtgs = btgs.select( Graph g => g[“Profit”] >= 0 ) lossBtgs = btgs.difference(profitBtgs)

Page 60: Martin Junghans – Gradoop: Scalable Graph Analytics with Apache Flink

(4) Cluster Characteristic Patterns

CIT ERP

Employee Name: Dave

Employee Name: Alice

Employee Name: Bob

Employee Name: Carol

Ticket Expense: 500

SalesQuotation

SalesOrder PurchaseOrder

PurchaseOrder

SalesRevenue Revenue: 5,000

PurchaseInvoice Expense: 2,000

PurchaseInvoice Expense: 1,500

sentBy

createdBy

processedBy

createdBy

openedFor

processedBy

processedBy

basedOn serves

serves

bills

bills

bills

Page 61: Martin Junghans – Gradoop: Scalable Graph Analytics with Apache Flink

(4) Cluster Characteristic Patterns

CIT ERP

Employee Name: Dave

Employee Name: Alice

Employee Name: Bob

Employee Name: Carol

Ticket Expense: 500

SalesQuotation

SalesOrder PurchaseOrder

PurchaseOrder

SalesRevenue Revenue: 5,000

PurchaseInvoice Expense: 2,000

PurchaseInvoice Expense: 1,500

sentBy

createdBy

processedBy

createdBy

openedFor

processedBy

processedBy

basedOn serves

serves

bills

bills

bills

Page 62: Martin Junghans – Gradoop: Scalable Graph Analytics with Apache Flink

(4) Cluster Characteristic Patterns

BTG 1

BTG 2

BTG 3

BTG 4

BTG 5

BTG n

∑ Revenue ∑ Expenses Net Profit

5,000 -3,000 2,000

9,000 -3,000 6,000

2,000 -1,500 500

5,000 -7,000 -2,000

10,000 -15,000 -5,000

… … …

8,000 -4,000 4,000

Ticket Alice

processedBy

Bob

createdBy

PurchaseOrder

Page 63: Martin Junghans – Gradoop: Scalable Graph Analytics with Apache Flink

(4) Cluster Characteristic Patterns

// select profit and loss clusters profitBtgs = btgs.select( Graph g => g[“Profit”] >= 0 ) lossBtgs = btgs.difference(profitBtgs) // apply magic profitFreqPats = profitBtgs.callForCollection( :FrequentSubgraphs , {“Threshold”:0.7} ) lossFreqPats = lossBtgs.callForCollection( :FrequentSubgraphs , {“Threshold”:0.7} ) // determine cluster characteristic patterns trivialPats = profitFreqPats.intersect(lossFreqPats) profitCharPatterns = profitFreqPats.difference(trivialPats) lossCharPatterns = lossFreqPats.difference(trivialPats)

Page 64: Martin Junghans – Gradoop: Scalable Graph Analytics with Apache Flink

Current State & Future Work

Page 65: Martin Junghans – Gradoop: Scalable Graph Analytics with Apache Flink

Current State

0.0.1 First Prototype (May 2015) Hadoop MapReduce and Giraph for operator implementations Too much complexity Performance loss through serialization in HDFS/HBase

0.0.2 Using Flink as execution layer (June 2015) Basic operators

Currently 0.0.3-SNAPSHOT Performance improvements More operator implementations

Page 66: Martin Junghans – Gradoop: Scalable Graph Analytics with Apache Flink

Operator implementations (0.0.3-SNAPSHOT)

Unary Pattern Matching Collection Selection Algorithms LabelPropagation

Aggregation Distinct BTG Extraction

Projection Sort by FSM

Summarization Top

Binary Combination Union

Overlap Intersection

Exclusion Difference

Isomorphism Auxiliary Apply

Reduce

Call

Page 67: Martin Junghans – Gradoop: Scalable Graph Analytics with Apache Flink

Future Work

Operator integration into Gelly Summarization FLINK-2411 Graph Sampling …

Graph Operations on streams (Flink) Graph Partitioning (maybe together with the Gelly people) Graph Versioning (Storage) Benchmarking GrALa Interpreter / Web UI

Page 68: Martin Junghans – Gradoop: Scalable Graph Analytics with Apache Flink

Benchmarks Sneak Preview

0

200

400

600

800

1000

1200

1400

1 2 4 8 16

Time [s]

# Worker

Summarization (Vertex and Edge Labels)

16x Intel(R) Xeon(R) CPU E5-2430 v2 @ 2.50GHz (12 Cores), 48 GB RAM Hadoop 2.5.2, Flink 0.9.0

slots (per node) 12 jobmanager.heap.mb 2048 taskmanager.heap.mb 40960

Foodbroker Graph (https://github.com/dbs-leipzig/foodbroker) Generates BI process data 858,624,267 Vertices, 4,406,445,007 Edges, 663GB Payload

Page 69: Martin Junghans – Gradoop: Scalable Graph Analytics with Apache Flink

Web UI Sneak Preview

Page 70: Martin Junghans – Gradoop: Scalable Graph Analytics with Apache Flink

Contributions welcome

Code Operator implementations Performance Tuning Storage layout

Data! and Use Cases

We are researchers, we assume ... Getting real data (especially BI data) is nearly impossible

People Bachelor / Master / PhD Thesis

Page 71: Martin Junghans – Gradoop: Scalable Graph Analytics with Apache Flink

Thank you for building Flink!

www.gradoop.com

https://github.com/dbs-leipzig/gradoop http://dbs.uni-leipzig.de/file/GradoopTR.pdf

http://dbs.uni-leipzig.de/file/biiig-vldb2014.pdf