graphgen: conducting graph analytics over relational databases

37
GraphGen: Conducting Graph Analytics over Relational Databases Konstantinos Xirogiannopoulos Amol Deshpande

Upload: konstantinos-xirogiannopoulos

Post on 13-Apr-2017

328 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Page 1: GraphGen: Conducting Graph Analytics over Relational Databases

GraphGen: Conducting Graph Analytics over Relational Databases

Konstantinos Xirogiannopoulos Amol Deshpande

Page 2: GraphGen: Conducting Graph Analytics over Relational Databases

collaboratedName:Konstantinos

Name:Amol

Name: University of MarylandName: PyData DCYear: 2016

gave_talk works_at w

orks

_at

Page 3: GraphGen: Conducting Graph Analytics over Relational Databases

Graph Analytics: (Network Science)

Leveraging of connections between entities in a network towards gaining insight about said entities and/or the network via the use of graph algorithms.

Page 4: GraphGen: Conducting Graph Analytics over Relational Databases

1) Why graph analytics?2) How are graph analytics done currently?3) What are most people dealing with?4) Bolt-on graph analytics with GraphGen5) The GraphGen Language

Page 5: GraphGen: Conducting Graph Analytics over Relational Databases

Graphs Across Domains

Protein-protein interaction networks

Financial transaction networks

Stock Trading Networks

Social Networks Federal Funds Networks

Knowledge GraphWorld Wide WebCommunication NetworksCitation Networks…...

http://go.umd.edu/graphs

Page 6: GraphGen: Conducting Graph Analytics over Relational Databases

Example Use cases

● Financial crimes (e.g. money laundering)

● Fraudulent transactions

● Cybercrime● Counterterrorism

● Key players in a network

● Ranking entities (web pages, PageRank)

● Providing connection recommendations to users

● Optimizing transportation routes

● Identifying weaknesses in power grids, water grids etc.

● Computer networks

● Medical Research● Disease pathology● DNA Sequencing

Page 7: GraphGen: Conducting Graph Analytics over Relational Databases

1) Why graph analytics?2) How are graph analytics done currently?3) What are most people dealing with?4) Bolt-on graph analytics with GraphGen5) The GraphGen Language

Page 8: GraphGen: Conducting Graph Analytics over Relational Databases

Types of Graph Analytics

● Graph “queries”: Subgraph pattern matching, shortest paths, temporal queries

● Real Time Analytics: Anomaly/Event detection, online prediction

● Batch Analytics (Network Science): Centrality analysis, community detection, network evolution

● Machine Learning: Matrix factorization, logistic regression modeled as message passing in specially structured graphs.

http://go.umd.edu/graphs

Page 9: GraphGen: Conducting Graph Analytics over Relational Databases

State of the art

● Graph Analytics tasks are too widely varied

http://go.umd.edu/graphs

● There is no one-size-fits-all solution○ RDBMS/Hadoop/Spark have their tradeoffs

● Fragmented area with little consensus❖ Specialized graph databases (Neo4j, Titan, Blazegraph, Cayley,Dgraph)

❖ RDF stores (Allegrograph, Jena)❖ Bolt-on solutions (Teradata SQL-Graph, SAP Graph Engine,

Oracle)❖ Distributed batch processing systems (Giraph, GraphX,

GraphLab) Lots of ETL required!❖ Many more research prototypes...

Page 10: GraphGen: Conducting Graph Analytics over Relational Databases

Different Analytics Flows

Other SystemsGraph Databases Bolt-On Solutions

Page 11: GraphGen: Conducting Graph Analytics over Relational Databases

What should I use then??

● What fraction of the overall workload is graph-oriented?

● How often are some sort of graph analytics required to run?

● Do you need to do graph updates?● What types of analytics are required?● How large would the graphs be?● Are you starting from scratch or do you have an

already deployed DBMS?

Page 12: GraphGen: Conducting Graph Analytics over Relational Databases

1) Why graph analytics?2) How are graph analytics done currently?3) What are most people dealing with?4) Bolt-on graph analytics with GraphGen5) The GraphGen Language

Page 13: GraphGen: Conducting Graph Analytics over Relational Databases

● Most business analytics (querying, reporting, OLAP) happen in SQL

● Organizations typically model their data according to their needs

● Graph databases if you have strictly graph-centric workloads

Where’s the Data?

Page 14: GraphGen: Conducting Graph Analytics over Relational Databases

Where’s the Data?

● Most likely organized in some type of database schema● Collection of tables related to each-other through

common attributes, or primary, foreign-key constraints.

We need to extract connections between entities

Page 15: GraphGen: Conducting Graph Analytics over Relational Databases

Most Likely...

Page 16: GraphGen: Conducting Graph Analytics over Relational Databases

Lots of “hidden” graphs

● Let’s take TPC-H. part_key

Part

supplier_key

...

customer_key

Customer

customer_name

...

order_key

Orders

part_key

customer_key

...

supplier_key

Supplier

supplier_name

...

● We could create edges between two customers if they’ve:○ Bought the same item○ Bought the same item on

the same day○ Bought from the same

supplier○ Etc.

Page 17: GraphGen: Conducting Graph Analytics over Relational Databases

State of the art

● Graph Analytics tasks are too widely varied

http://go.umd.edu/graphs

● There is no one-size-fits-all solution○ RDBMS/Hadoop/Spark have their tradeoffs

● Fragmented area with little consensus❖ Specialized graph databases (Neo4j, Titan, Blazegraph, Cayley,Dgraph)

❖ RDF stores (Allegrograph, Jena)❖ Bolt-on solutions (Teradata SQL-Graph, SAP Graph Engine,

Oracle)❖ Distributed batch processing systems (Giraph, GraphX,

GraphLab) Lots of ETL required!❖ Many more research prototypes...

Page 18: GraphGen: Conducting Graph Analytics over Relational Databases

State of the art

● Graph Analytics tasks are too widely varied

http://go.umd.edu/graphs

● There is no one-size-fits-all solution○ RDBMS/Hadoop/Spark have their tradeoffs

● Fragmented area with little consensus❖ Specialized graph databases (Neo4j, Titan, Blazegraph, Cayley,Dgraph)

❖ RDF stores (Allegrograph, Jena)❖ Bolt-on solutions (Teradata SQL-Graph, SAP Graph Engine,

Oracle)❖ Distributed batch processing systems (Giraph, GraphX,

GraphLab) Lots of ETL required!❖ Many more research prototypes...

Page 19: GraphGen: Conducting Graph Analytics over Relational Databases

1) Why graph analytics?2) How are graph analytics done currently?3) What are most people dealing with?4) Bolt-on graph analytics with GraphGen5) The GraphGen Language

Page 20: GraphGen: Conducting Graph Analytics over Relational Databases

GraphGen

Extract and analyze many different kinds of graphs

Simple, Intuitive, Declarative Language, No ETL required

Full Graph API & Vertex Centric Framework

Page 21: GraphGen: Conducting Graph Analytics over Relational Databases

GraphGen Interfaces

Native Java LibraryPython wrapper LibraryGraphGen Explorer: UI Web Application

Page 22: GraphGen: Conducting Graph Analytics over Relational Databases

Graphgen Explorer Web App

Page 23: GraphGen: Conducting Graph Analytics over Relational Databases

● Exploration of database schema to detect different types of hidden graphs.

● Allows users to visually explore potential graphs.

● Simple statistic and on-the-fly analysis

Not all graphs will be useful!

GraphGen Explorer Web App

Page 25: GraphGen: Conducting Graph Analytics over Relational Databases

GraphgenPy in Python

Page 26: GraphGen: Conducting Graph Analytics over Relational Databases

from graphgenpy import GraphGenerator

import networkx as nx

datalogQuery = """

Nodes(ID, Name) :- Author(ID, Name).

Edges(ID1, ID2) :- AuthorPublication(ID1, PubID), AuthorPublication(ID2, PubID).

"""

# Credentials for connecting to the database

gg = GraphGenerator("localhost","5432","testgraphgen","kostasx","password")

fname = gg.generateGraph(datalogQuery,"extracted_graph",GraphGenerator.GML)

G = nx.read_gml(fname,'id')

print "Graph Loaded into NetworkX! Running PageRank..."

# Run any algorithm on the graph using NetworkX

print nx.pagerank(G)

print "Done!"

Define GraphGen Query

Database Credentials

Generate and Serialize Graph

Load Graph into NetworkX

Run Any Algorithm

Page 27: GraphGen: Conducting Graph Analytics over Relational Databases

Native GraphGen in Java

Page 28: GraphGen: Conducting Graph Analytics over Relational Databases

// Establish Connection to Database

GraphGenerator ggen = new GraphGenerator("host", "port", "dbName",

"username", "password");

// Define and evaluate a single graph extraction query

String datalog_query = "...";

Graph g = ggen.generateGraph(datalog_query).get(0);

// Initialize vertec-centric object

VertexCentric p = new VertexCentric(g);

// Define vertex-centric compute function

Executor program = new Executor("result_value_name") {

@Override

public void compute(Vertex v, VertexCentric p) {

// implementation of compute function

}

};

// Begin execution

p.run(program);

Define GraphGen Query

Database Credentials

Extract and Load Graph

Define Vertex Centric Program

Run Program

Page 29: GraphGen: Conducting Graph Analytics over Relational Databases

// Establish Connection to Database

GraphGenerator ggen = new GraphGenerator("host", "port", "dbName",

"username", "password");

// Define and evaluate a single graph extraction query

String datalog_query = "...";

Graph g = ggen.generateGraph(datalog_query).get(0);

for (Vertex v : g.getVertices()) {

// For each neighbor

for (Vertex neighbor : v.getVertices(Direction.OUT)) {

// Do something

}

}

Define GraphGen Query

Database Credentials

Extract and Load Graph

Use Full API to access the Graph

Page 30: GraphGen: Conducting Graph Analytics over Relational Databases

GraphGen Back-End Architecture

Page 31: GraphGen: Conducting Graph Analytics over Relational Databases

1) Why graph analytics?2) How are graph analytics done currently?3) What are most people dealing with?4) Bolt-on graph analytics with GraphGen5) The GraphGen Language

Page 32: GraphGen: Conducting Graph Analytics over Relational Databases

GraphGen DSL

● Intuitive Domain Specific Language based on Datalog● User needs to specify:

○ How the nodes are defined○ How the edges are defined

● The query is executed, and the user gets a Graph object to operate upon.

● Very expressive: Allows for homogeneous and heterogeneous graphs with various types of nodes and edges.

Page 33: GraphGen: Conducting Graph Analytics over Relational Databases

TPC-H Database

partKey

Part

supplierKey

...

customerKey

Customer

customerName

...

● We want to explore a graph of customers!

● Using the GraphGen Language:○ Which tables do

we need to combine to extract the nodes and edges

orderKey

Orders

partKey

customerKey

...

supplierKey

Supplier

supplierName

...

Page 34: GraphGen: Conducting Graph Analytics over Relational Databases

GraphGen DSL Example

Nodes(ID, Name) :- Customer(ID, Name).

● Creates a node out of each row in the Customer table■ Customer ID and Name as properties

Edges(ID1, ID2) :- Orders(_,partKey, ID1), Orders(_,partKey, ID2).

● Connect ID1 -> ID2 if they have both ordered the same part

Page 35: GraphGen: Conducting Graph Analytics over Relational Databases

GraphGen

● Enable extraction of different types of hidden graphs

● Independent of where the data is stored (given SQL)

● Enable complex analytics over the extracted graphs

● Efficient extraction through various in-memory representations

● Efficient analysis through a parallel execution engine

● Effortless through a Declarative Language

● Eliminates the need for complex ETL

● Intuitive and swift analysis of any graph that exists in your data!

Page 36: GraphGen: Conducting Graph Analytics over Relational Databases

Download GraphGen at: konstantinosx.github.io/graphgen-project/

DDL Blog Post at: blog.districtdatalabs.com/graph-analytics-over-relational-datasets

Page 37: GraphGen: Conducting Graph Analytics over Relational Databases

Email: [email protected]: @kxirog

Download GraphGen at: konstantinosx.github.io/graphgen-project/

Thank you!