graph based machine learning on relational data

Post on 18-Jul-2015

94 Views

Category:

Technology

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Graph Based Machine

Learning on Relational Data

Problems and Methods

Machine Learning using Graphs

Machine Learning using Graphs

- Machine Learning is iterative but iteration

can also be seen as traversal.

Machine Learning using Graphs

- Machine Learning is iterative but iteration

can also be seen as traversal.

- Many domains have structures already

modeled as graphs (health records, finance)

Machine Learning using Graphs

- Machine Learning is iterative but iteration

can also be seen as traversal.

- Many domains have structures already

modeled as graphs (health records, finance)

- Important analyses are graph algorithms:

clusters, influence propagation, centrality.

Machine Learning using Graphs

- Machine Learning is iterative but iteration

can also be seen as traversal.

- Many domains have structures already

modeled as graphs (health records, finance)

- Important analyses are graph algorithms:

clusters, influence propagation, centrality.

- Performance benefits on sparse data

Machine Learning using Graphs

- Machine Learning is iterative but iteration

can also be seen as traversal.

- Many domains have structures already

modeled as graphs (health records, finance)

- Important analyses are graph algorithms:

clusters, influence propagation, centrality.

- Performance benefits on sparse data

- More understandable implementation

Iterative PageRank in Python

def pageRank(G, s = .85, maxerr = .001):

n = G.shape[0]

# transform G into markov matrix M

M = csc_matrix(G,dtype=np.float)

rsums = np.array(M.sum(1))[:,0]

ri, ci = M.nonzero()

M.data /= rsums[ri]

sink = rsums==0 # bool array of sink states

# Compute pagerank r until we converge

ro, r = np.zeros(n), np.ones(n)

while np.sum(np.abs(r-ro)) > maxerr:

ro = r.copy()

for i in xrange(0,n):

Ii = np.array(M[:,i].todense())[:,0] # inlinks of state i

Si = sink / float(n) # account for sink states

Ti = np.ones(n) / float(n) # account for teleportation

r[i] = ro.dot( Ii*s + Si*s + Ti*(1-s) )

return r/sum(r) # return normalized pagerank

Graph-Based PageRank in Gremlin

pagerank = [:].withDefault{0}

size = uris.size();

uris.each{

count = it.outE.count();

if(count == 0 || rand.nextDouble() > 0.85) {

rank = pagerank[it]

uris.each {

pagerank[it] = pagerank[it] / uris.size()

}

}

rank = pagerank[it] / it.outE.count();

it.out.each{

pagerank[it] = pagerank[it] + rank;

}

}

Learning by Example

- Machine Learning requires many instances with

which to fit a model to make predictions.

Learning by Example

- Machine Learning requires many instances with

which to fit a model to make predictions.

- Current large scale analytical methods (Pregel,

Giraph, GraphLab) are in-memory without data

storage components.

Learning by Example

- Machine Learning requires many instances with

which to fit a model to make predictions.

- Current large scale analytical methods (Pregel,

Giraph, GraphLab) are in-memory with data

storage components

- And while Neo4j, OrientDB, and Titan are ok...

Learning by Example

- Machine Learning requires many instances with

which to fit a model to make predictions.

- Current large scale analytical methods (Pregel,

Giraph, GraphLab) are in-memory with data

storage components

- And while Neo4j, OrientDB, and Titan are ok...

- Most (active) data sits in relational databases

where users interact with it in real time via

transactions in web applications.

Is it because relational data is a legacy system we must support?

Is it purely because of inertia?

NO! It’s because Relational Data is awesome!

Awesome sauce relational data of the future.

- Ability to express queries/algorithms using a

declarative, graph-domain specific language

like SQL, or at the very least via UDFs.

Requirements

Requirements

- Ability to express queries/algorithms using a

declarative, graph-domain specific language

like SQL, or at the very least via UDFs.

- Ability to explore and identify hidden or

implicit graphs in the database.

Requirements

- Ability to express queries/algorithms using a

declarative, graph-domain specific language

like SQL, or at the very least via UDFs.

- Ability to explore and identify hidden or

implicit graphs in the database.

- Combine in-memory analytics with some

disk storage facility that is transactional.

Approach 1: ETL Methods

t = 0 t > 0extract

transform

load

synchronize

analyze

Approach 1: ETL Methods

The Good- Processing is not physical layer dependent

- Relational data storage with real time interaction

- Analytics can scale in size to Hadoop or in speed to in-

memory computation frameworks.

The Bad- Must know structure of graph in relational database

ahead of time, no exploration.

- Synchronization can cause inconsistency.

- OLAP processes incur resource penalty (I/O or CPU

depending on location).

Approach 1: ETL Methods

The Good- Processing is not physical layer dependent

- Relational data storage with real time interaction

- Analytics can scale in size to Hadoop or in speed to in-

memory computation frameworks.

The Bad- Must know structure of graph in relational database

ahead of time, no exploration.

- Synchronization can cause inconsistency.

- OLAP processes incur resource penalty (I/O or CPU

depending on location).

Approach 2: Store Graph in RDBMS

Approach 2: Store Graph in RDBMS

The Good- Can utilize relational devices like indices and parallel

joins for graph-specific queries on existing data.

- Simply use SQL for the data access mechanism.

- Transactional storage of the data.

The Bad- Constrained to graph-specific schema.

- Many joins required for traversal.

- Depending on storage mechanisms there may be too

few or too many tables in the database for applications.

- Must convert existing database to this structure.

Approach 2: Store Graph in RDBMS

The Good- Can utilize relational devices like indices and parallel

joins for graph-specific queries on existing data.

- Simply use SQL for the data access mechanism.

- Transactional storage of the data.

The Bad- Constrained to graph-specific schema.

- Many joins required for traversal.

- Depending on storage mechanisms there may be too

few or too many tables in the database for applications.

- Must convert existing database to this structure.

Approach 3: Use Graph Query Language

API

Optimizer

Query Result

Query Translator

SQL Queries

Final SQL

Queries

Graph DSL Query

Approach 3: Use Graph Query Language

The Good- DSL in the graph domain that easily expresses graph

analytics but also relational semantics.

- Can use existing relational schemas; allows for

exploration and identification of graphs.

- Computation is offloaded into in-memory processing

The Bad- Many graphs or big graphs can cause too many joins

without optimal query translation.

- User is required to facilitate definition of relational

structure into a graph representation.

- May not leverage relational resources.

Approach 3: Use Graph Query Language

The Good- DSL in the graph domain that easily expresses graph

analytics but also relational semantics.

- Can use existing relational schemas; allows for

exploration and identification of graphs.

- Computation is offloaded into in-memory processing

The Bad- Many graphs or big graphs can cause too many joins

without optimal query translation.

- User is required to facilitate definition of relational

structure into a graph representation.

- May not leverage relational resources.

Any Questions?

Thank you!

Presented By:

Konstantinos Xirogiannopoulos <kostasx@cs.umd.edu>

Benjamin Bengfort <bengfort@cs.umd.edu>

May 7, 2015

top related