graph based machine learning on relational data

29
Graph Based Machine Learning on Relational Data Problems and Methods

Upload: benjamin-bengfort

Post on 18-Jul-2015

94 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Graph Based Machine Learning on Relational Data

Graph Based Machine

Learning on Relational Data

Problems and Methods

Page 2: Graph Based Machine Learning on Relational Data

Machine Learning using Graphs

Page 3: Graph Based Machine Learning on Relational Data

Machine Learning using Graphs

- Machine Learning is iterative but iteration

can also be seen as traversal.

Page 4: Graph Based Machine Learning on Relational Data

Machine Learning using Graphs

- Machine Learning is iterative but iteration

can also be seen as traversal.

- Many domains have structures already

modeled as graphs (health records, finance)

Page 5: Graph Based Machine Learning on Relational Data

Machine Learning using Graphs

- Machine Learning is iterative but iteration

can also be seen as traversal.

- Many domains have structures already

modeled as graphs (health records, finance)

- Important analyses are graph algorithms:

clusters, influence propagation, centrality.

Page 6: Graph Based Machine Learning on Relational Data

Machine Learning using Graphs

- Machine Learning is iterative but iteration

can also be seen as traversal.

- Many domains have structures already

modeled as graphs (health records, finance)

- Important analyses are graph algorithms:

clusters, influence propagation, centrality.

- Performance benefits on sparse data

Page 7: Graph Based Machine Learning on Relational Data

Machine Learning using Graphs

- Machine Learning is iterative but iteration

can also be seen as traversal.

- Many domains have structures already

modeled as graphs (health records, finance)

- Important analyses are graph algorithms:

clusters, influence propagation, centrality.

- Performance benefits on sparse data

- More understandable implementation

Page 8: Graph Based Machine Learning on Relational Data

Iterative PageRank in Python

def pageRank(G, s = .85, maxerr = .001):

n = G.shape[0]

# transform G into markov matrix M

M = csc_matrix(G,dtype=np.float)

rsums = np.array(M.sum(1))[:,0]

ri, ci = M.nonzero()

M.data /= rsums[ri]

sink = rsums==0 # bool array of sink states

# Compute pagerank r until we converge

ro, r = np.zeros(n), np.ones(n)

while np.sum(np.abs(r-ro)) > maxerr:

ro = r.copy()

for i in xrange(0,n):

Ii = np.array(M[:,i].todense())[:,0] # inlinks of state i

Si = sink / float(n) # account for sink states

Ti = np.ones(n) / float(n) # account for teleportation

r[i] = ro.dot( Ii*s + Si*s + Ti*(1-s) )

return r/sum(r) # return normalized pagerank

Page 9: Graph Based Machine Learning on Relational Data

Graph-Based PageRank in Gremlin

pagerank = [:].withDefault{0}

size = uris.size();

uris.each{

count = it.outE.count();

if(count == 0 || rand.nextDouble() > 0.85) {

rank = pagerank[it]

uris.each {

pagerank[it] = pagerank[it] / uris.size()

}

}

rank = pagerank[it] / it.outE.count();

it.out.each{

pagerank[it] = pagerank[it] + rank;

}

}

Page 10: Graph Based Machine Learning on Relational Data

Learning by Example

- Machine Learning requires many instances with

which to fit a model to make predictions.

Page 11: Graph Based Machine Learning on Relational Data

Learning by Example

- Machine Learning requires many instances with

which to fit a model to make predictions.

- Current large scale analytical methods (Pregel,

Giraph, GraphLab) are in-memory without data

storage components.

Page 12: Graph Based Machine Learning on Relational Data

Learning by Example

- Machine Learning requires many instances with

which to fit a model to make predictions.

- Current large scale analytical methods (Pregel,

Giraph, GraphLab) are in-memory with data

storage components

- And while Neo4j, OrientDB, and Titan are ok...

Page 13: Graph Based Machine Learning on Relational Data

Learning by Example

- Machine Learning requires many instances with

which to fit a model to make predictions.

- Current large scale analytical methods (Pregel,

Giraph, GraphLab) are in-memory with data

storage components

- And while Neo4j, OrientDB, and Titan are ok...

- Most (active) data sits in relational databases

where users interact with it in real time via

transactions in web applications.

Page 14: Graph Based Machine Learning on Relational Data

Is it because relational data is a legacy system we must support?

Is it purely because of inertia?

Page 15: Graph Based Machine Learning on Relational Data

NO! It’s because Relational Data is awesome!

Awesome sauce relational data of the future.

Page 16: Graph Based Machine Learning on Relational Data

- Ability to express queries/algorithms using a

declarative, graph-domain specific language

like SQL, or at the very least via UDFs.

Requirements

Page 17: Graph Based Machine Learning on Relational Data

Requirements

- Ability to express queries/algorithms using a

declarative, graph-domain specific language

like SQL, or at the very least via UDFs.

- Ability to explore and identify hidden or

implicit graphs in the database.

Page 18: Graph Based Machine Learning on Relational Data

Requirements

- Ability to express queries/algorithms using a

declarative, graph-domain specific language

like SQL, or at the very least via UDFs.

- Ability to explore and identify hidden or

implicit graphs in the database.

- Combine in-memory analytics with some

disk storage facility that is transactional.

Page 19: Graph Based Machine Learning on Relational Data

Approach 1: ETL Methods

t = 0 t > 0extract

transform

load

synchronize

analyze

Page 20: Graph Based Machine Learning on Relational Data

Approach 1: ETL Methods

The Good- Processing is not physical layer dependent

- Relational data storage with real time interaction

- Analytics can scale in size to Hadoop or in speed to in-

memory computation frameworks.

The Bad- Must know structure of graph in relational database

ahead of time, no exploration.

- Synchronization can cause inconsistency.

- OLAP processes incur resource penalty (I/O or CPU

depending on location).

Page 21: Graph Based Machine Learning on Relational Data

Approach 1: ETL Methods

The Good- Processing is not physical layer dependent

- Relational data storage with real time interaction

- Analytics can scale in size to Hadoop or in speed to in-

memory computation frameworks.

The Bad- Must know structure of graph in relational database

ahead of time, no exploration.

- Synchronization can cause inconsistency.

- OLAP processes incur resource penalty (I/O or CPU

depending on location).

Page 22: Graph Based Machine Learning on Relational Data

Approach 2: Store Graph in RDBMS

Page 23: Graph Based Machine Learning on Relational Data

Approach 2: Store Graph in RDBMS

The Good- Can utilize relational devices like indices and parallel

joins for graph-specific queries on existing data.

- Simply use SQL for the data access mechanism.

- Transactional storage of the data.

The Bad- Constrained to graph-specific schema.

- Many joins required for traversal.

- Depending on storage mechanisms there may be too

few or too many tables in the database for applications.

- Must convert existing database to this structure.

Page 24: Graph Based Machine Learning on Relational Data

Approach 2: Store Graph in RDBMS

The Good- Can utilize relational devices like indices and parallel

joins for graph-specific queries on existing data.

- Simply use SQL for the data access mechanism.

- Transactional storage of the data.

The Bad- Constrained to graph-specific schema.

- Many joins required for traversal.

- Depending on storage mechanisms there may be too

few or too many tables in the database for applications.

- Must convert existing database to this structure.

Page 25: Graph Based Machine Learning on Relational Data

Approach 3: Use Graph Query Language

API

Optimizer

Query Result

Query Translator

SQL Queries

Final SQL

Queries

Graph DSL Query

Page 26: Graph Based Machine Learning on Relational Data

Approach 3: Use Graph Query Language

The Good- DSL in the graph domain that easily expresses graph

analytics but also relational semantics.

- Can use existing relational schemas; allows for

exploration and identification of graphs.

- Computation is offloaded into in-memory processing

The Bad- Many graphs or big graphs can cause too many joins

without optimal query translation.

- User is required to facilitate definition of relational

structure into a graph representation.

- May not leverage relational resources.

Page 27: Graph Based Machine Learning on Relational Data

Approach 3: Use Graph Query Language

The Good- DSL in the graph domain that easily expresses graph

analytics but also relational semantics.

- Can use existing relational schemas; allows for

exploration and identification of graphs.

- Computation is offloaded into in-memory processing

The Bad- Many graphs or big graphs can cause too many joins

without optimal query translation.

- User is required to facilitate definition of relational

structure into a graph representation.

- May not leverage relational resources.

Page 28: Graph Based Machine Learning on Relational Data

Any Questions?

Page 29: Graph Based Machine Learning on Relational Data

Thank you!

Presented By:

Konstantinos Xirogiannopoulos <[email protected]>

Benjamin Bengfort <[email protected]>

May 7, 2015