graphgen: conducting graph analytics over relational databases

37
GraphGen: Conducting Graph Analytics over Relational Databases Konstantinos Xirogiannopoulos Amol Deshpande

Upload: pydata

Post on 16-Apr-2017

287 views

Category:

Technology


2 download

TRANSCRIPT

GraphGen: Conducting Graph Analytics over Relational Databases

Konstantinos Xirogiannopoulos Amol Deshpande

collaboratedName:Konstantinos

Name:Amol

Name: University of MarylandName: PyData DCYear: 2016

gave_talk works_at w

orks

_at

Graph Analytics: (Network Science)

Leveraging of connections between entities in a network towards gaining insight about said entities and/or the network via the use of graph algorithms.

1) Why graph analytics?2) How are graph analytics done currently?3) What are most people dealing with?4) Bolt-on graph analytics with GraphGen5) The GraphGen Language

Graphs Across Domains

Protein-protein interaction networks

Financial transaction networks

Stock Trading Networks

Social Networks Federal Funds Networks

Knowledge GraphWorld Wide WebCommunication NetworksCitation Networks…...

http://go.umd.edu/graphs

Example Use cases

● Financial crimes (e.g. money laundering)

● Fraudulent transactions

● Cybercrime● Counterterrorism

● Key players in a network

● Ranking entities (web pages, PageRank)

● Providing connection recommendations to users

● Optimizing transportation routes

● Identifying weaknesses in power grids, water grids etc.

● Computer networks

● Medical Research● Disease pathology● DNA Sequencing

1) Why graph analytics?2) How are graph analytics done currently?3) What are most people dealing with?4) Bolt-on graph analytics with GraphGen5) The GraphGen Language

Types of Graph Analytics

● Graph “queries”: Subgraph pattern matching, shortest paths, temporal queries

● Real Time Analytics: Anomaly/Event detection, online prediction

● Batch Analytics (Network Science): Centrality analysis, community detection, network evolution

● Machine Learning: Matrix factorization, logistic regression modeled as message passing in specially structured graphs.

http://go.umd.edu/graphs

State of the art

● Graph Analytics tasks are too widely varied

http://go.umd.edu/graphs

● There is no one-size-fits-all solution○ RDBMS/Hadoop/Spark have their tradeoffs

● Fragmented area with little consensus❖ Specialized graph databases (Neo4j, Titan, Blazegraph, Cayley,Dgraph)

❖ RDF stores (Allegrograph, Jena)❖ Bolt-on solutions (Teradata SQL-Graph, SAP Graph Engine,

Oracle)❖ Distributed batch processing systems (Giraph, GraphX,

GraphLab) Lots of ETL required!❖ Many more research prototypes...

Different Analytics Flows

Other SystemsGraph Databases Bolt-On Solutions

What should I use then??

● What fraction of the overall workload is graph-oriented?

● How often are some sort of graph analytics required to run?

● Do you need to do graph updates?● What types of analytics are required?● How large would the graphs be?● Are you starting from scratch or do you have an

already deployed DBMS?

1) Why graph analytics?2) How are graph analytics done currently?3) What are most people dealing with?4) Bolt-on graph analytics with GraphGen5) The GraphGen Language

● Most business analytics (querying, reporting, OLAP) happen in SQL

● Organizations typically model their data according to their needs

● Graph databases if you have strictly graph-centric workloads

Where’s the Data?

Where’s the Data?

● Most likely organized in some type of database schema● Collection of tables related to each-other through

common attributes, or primary, foreign-key constraints.

We need to extract connections between entities

Most Likely...

Lots of “hidden” graphs

● Let’s take TPC-H. part_key

Part

supplier_key

...

customer_key

Customer

customer_name

...

order_key

Orders

part_key

customer_key

...

supplier_key

Supplier

supplier_name

...

● We could create edges between two customers if they’ve:○ Bought the same item○ Bought the same item on

the same day○ Bought from the same

supplier○ Etc.

State of the art

● Graph Analytics tasks are too widely varied

http://go.umd.edu/graphs

● There is no one-size-fits-all solution○ RDBMS/Hadoop/Spark have their tradeoffs

● Fragmented area with little consensus❖ Specialized graph databases (Neo4j, Titan, Blazegraph, Cayley,Dgraph)

❖ RDF stores (Allegrograph, Jena)❖ Bolt-on solutions (Teradata SQL-Graph, SAP Graph Engine,

Oracle)❖ Distributed batch processing systems (Giraph, GraphX,

GraphLab) Lots of ETL required!❖ Many more research prototypes...

State of the art

● Graph Analytics tasks are too widely varied

http://go.umd.edu/graphs

● There is no one-size-fits-all solution○ RDBMS/Hadoop/Spark have their tradeoffs

● Fragmented area with little consensus❖ Specialized graph databases (Neo4j, Titan, Blazegraph, Cayley,Dgraph)

❖ RDF stores (Allegrograph, Jena)❖ Bolt-on solutions (Teradata SQL-Graph, SAP Graph Engine,

Oracle)❖ Distributed batch processing systems (Giraph, GraphX,

GraphLab) Lots of ETL required!❖ Many more research prototypes...

1) Why graph analytics?2) How are graph analytics done currently?3) What are most people dealing with?4) Bolt-on graph analytics with GraphGen5) The GraphGen Language

GraphGen

Extract and analyze many different kinds of graphs

Simple, Intuitive, Declarative Language, No ETL required

Full Graph API & Vertex Centric Framework

GraphGen Interfaces

Native Java LibraryPython wrapper LibraryGraphGen Explorer: UI Web Application

Graphgen Explorer Web App

● Exploration of database schema to detect different types of hidden graphs.

● Allows users to visually explore potential graphs.

● Simple statistic and on-the-fly analysis

Not all graphs will be useful!

GraphGen Explorer Web App

GraphgenPy in Python

from graphgenpy import GraphGenerator

import networkx as nx

datalogQuery = """

Nodes(ID, Name) :- Author(ID, Name).

Edges(ID1, ID2) :- AuthorPublication(ID1, PubID), AuthorPublication(ID2, PubID).

"""

# Credentials for connecting to the database

gg = GraphGenerator("localhost","5432","testgraphgen","kostasx","password")

fname = gg.generateGraph(datalogQuery,"extracted_graph",GraphGenerator.GML)

G = nx.read_gml(fname,'id')

print "Graph Loaded into NetworkX! Running PageRank..."

# Run any algorithm on the graph using NetworkX

print nx.pagerank(G)

print "Done!"

Define GraphGen Query

Database Credentials

Generate and Serialize Graph

Load Graph into NetworkX

Run Any Algorithm

Native GraphGen in Java

// Establish Connection to Database

GraphGenerator ggen = new GraphGenerator("host", "port", "dbName",

"username", "password");

// Define and evaluate a single graph extraction query

String datalog_query = "...";

Graph g = ggen.generateGraph(datalog_query).get(0);

// Initialize vertec-centric object

VertexCentric p = new VertexCentric(g);

// Define vertex-centric compute function

Executor program = new Executor("result_value_name") {

@Override

public void compute(Vertex v, VertexCentric p) {

// implementation of compute function

}

};

// Begin execution

p.run(program);

Define GraphGen Query

Database Credentials

Extract and Load Graph

Define Vertex Centric Program

Run Program

// Establish Connection to Database

GraphGenerator ggen = new GraphGenerator("host", "port", "dbName",

"username", "password");

// Define and evaluate a single graph extraction query

String datalog_query = "...";

Graph g = ggen.generateGraph(datalog_query).get(0);

for (Vertex v : g.getVertices()) {

// For each neighbor

for (Vertex neighbor : v.getVertices(Direction.OUT)) {

// Do something

}

}

Define GraphGen Query

Database Credentials

Extract and Load Graph

Use Full API to access the Graph

GraphGen Back-End Architecture

1) Why graph analytics?2) How are graph analytics done currently?3) What are most people dealing with?4) Bolt-on graph analytics with GraphGen5) The GraphGen Language

GraphGen DSL

● Intuitive Domain Specific Language based on Datalog● User needs to specify:

○ How the nodes are defined○ How the edges are defined

● The query is executed, and the user gets a Graph object to operate upon.

● Very expressive: Allows for homogeneous and heterogeneous graphs with various types of nodes and edges.

TPC-H Database

partKey

Part

supplierKey

...

customerKey

Customer

customerName

...

● We want to explore a graph of customers!

● Using the GraphGen Language:○ Which tables do

we need to combine to extract the nodes and edges

orderKey

Orders

partKey

customerKey

...

supplierKey

Supplier

supplierName

...

GraphGen DSL Example

Nodes(ID, Name) :- Customer(ID, Name).

● Creates a node out of each row in the Customer table■ Customer ID and Name as properties

Edges(ID1, ID2) :- Orders(_,partKey, ID1), Orders(_,partKey, ID2).

● Connect ID1 -> ID2 if they have both ordered the same part

GraphGen

● Enable extraction of different types of hidden graphs

● Independent of where the data is stored (given SQL)

● Enable complex analytics over the extracted graphs

● Efficient extraction through various in-memory representations

● Efficient analysis through a parallel execution engine

● Effortless through a Declarative Language

● Eliminates the need for complex ETL

● Intuitive and swift analysis of any graph that exists in your data!

Download GraphGen at: konstantinosx.github.io/graphgen-project/

DDL Blog Post at: blog.districtdatalabs.com/graph-analytics-over-relational-datasets

Email: [email protected]: @kxirog

Download GraphGen at: konstantinosx.github.io/graphgen-project/

Thank you!