graph processing applications @ hug

Graph ProcessingApplications

[email protected]

www.thecloudavenue.com

@praveensripati

Agenda

Introduction to Graphs

Representing graphs

Different types of graphs

Algorithms in graphs

What constitutes a graph application

Graph databases (examples and how they work)

Graph computing engines (examples and how they work)

Questions & Answers

What are/aren't Graphs in this context?

NOYES

1 2 3

4

5

6

How is a graph represented?

A collection of vertices connected to each other using edges, with both vertices and edges having properties. A vertex can be a person, place, account or any item which needs to be tracked.

Vertex

Edge

A social graph

1 2 3

4

5

6

Name:ArunAge : 25Sex : M

Tom

Sheetal

Prajval

Deepak

Bob

Friend

Collegue

Relative

Friend

Friend

Relation : Collegue

Friend

Friend

Vertex

EdgeProperties

Who are Arun's friends?

Whom should I recommend

Sheetal to be friends with?

Facebook Recruiting Competition

Want an in

terview @

Facebook?The challenge is to recommend missing links in a social network. Participants will be presented with an external anonymized, directed social graph (no, not Facebook, keep guessing) from which some edges have been deleted, and asked to make ranked predictions for each user in the test set of which other users they would want to follow.

What is Kaggle?

Kaggle is an innovative solution for statistical/analytics outsourcing. We are the leading platform for predictive modeling competitions. Companies, governments and researchers present datasets and problems - the world's best data scientists then compete to produce the best solutions. At the end of a competition, the competition host pays prize money in exchange for the intellectual property behind the winning model.

1 2 3

4

5

6

http://www.kaggle.com/c/FacebookRecruiting

A spatial graph

1 2 3

4

5

6

Name:BangalorePopulataion : 25,00,000

Area : 35,000 SqKm

Mumbai

Kolkotta

Chennai

New Delhi

Lucknow

450 km

800 km

250 km600 km

Distance : 700 km

350 km

Vertex

EdgeProperties

450 km

850 km

What is th

e shortest

Distance between

Bangalore and Calcutta?

I would like to cover all

the places, which is the

shortest path?

How to represent a Graph for computing?

1 2 3

4

5

6

2, 4, 5 3 5

3, 6

5

1 -> 2,4,52 -> 33 -> 54 -> 3.65 -> 6 -> 5

1 2 3 4 5 6

1 0 1 0 1 1 0

2 0 0 1 0 0 0

3 0 0 0 0 1 0

4 0 0 1 0 0 0

5 0 0 0 0 0 0

6 0 0 0 0 1 0

.... as an adjacency list for sparse graph

.... as an adjacency matrix for dense graph

A graph with few edges is sparse,many edges is dense.

Obviously, the web with billionsof pages cannot be represented

as an adjaceny matrix.

Different Graphs

Social graph (Facebook, LinkedIn etc)

Spacial graph (Google Maps, MapQuest, FedEx etc)

Web graph (PageRank, Recomendations etc)

Computer network graph (Optimal network layout etc)

Financial graph (Fraud detection, Currency Flow etc)

Data representations (Lists etc)

Chemistry (to represent genomes/molucules)

And others

Some of the Graph Algorithms

Shortest path (Finding the shortest path from A to B)

Minimal Spanning Tree (Cheapest way to connect objects, so that each object is connected to another – can be used in internet, cable wiring etc)

Graph center (placing a warehouse, hospital in a city, so that all the locations can be reached easily)

Bipartite Matching (Matching in a dating site, job to employee and others)

Finding Planar Graph (as in the case of circuit designs).

http://www.graph-magics.com/practic_use.php

Graph Applications

Applications

Graph processing frameworksGraph Databases

GiraphHama

How to store a Graph?

Option 1 : In a flat file as

1- 4,5,6 4- 2,5,6

Where vertex 1 is connected to vertex 4,5,6 and so on

Option 2 : In a relational database using referencing tables or join tables.

Option 3 : Using a specialized database designed only and only for graphs.

Simple, but not efficient

and easy to maintain.

Comparing Graph with Relational DB

Depth Execution Time – MySQL Execution Time –Neo4j

2 0.016 0.010

3 30.267 0.168

4 1,543.505 1.359

5 Not Finished in 1 Hour 2.132

http://www.neotechnology.com/2012/06/how-much-faster-is-a-graph-database-really/

In a DB of 1,000,000 users finding friends-of-friendsfor 1,000 users at various depths.

Which one would

you prefer for s

toring

Graph data?

So, what is a Graph DB?A graph database is any storage system that provides `index free adjacency`.

1 2 3

4

5

6

2, 4, 5 3 5

3, 6

5

Every element (node or edge) has a direct pointer to it's adjacent element.

No Index lookup : We can determine which vertex is adjacent wo which other vertex without lookup an index-tree.

So, what is a Graph DB? (.....)

Graph DB is the option

when persisting graphs.

So, what is a Graph DB? (.....)D

ata

Siz

e

Data Complexity

Key Value Store like Amazon Dynamo.

Columnar Databases like Cassandra, HBase.

Document Databases like MongoDB, CouchDB..

Graph Databases like Neo4J

Part o

f the

NoS

QL fa

mily

Graph DB Bindings (~JDBC API)

//connect to the database//begin transaction

Node firstNode;Node secondNode;Relationship relationship;

firstNode = graphDb.createNode();firstNode.setProperty( "message", "Hello, " );secondNode = graphDb.createNode();secondNode.setProperty( "message", "World!" ); relationship = firstNode.createRelationshipTo( secondNode, RelTypes.KNOWS );relationship.setProperty( "message", "brave Neo4j " );

//end the transaction//close the connection to the database

http://docs.neo4j.org/chunked/milestone/tutorials-java-embedded-hello-world.html

Graph Adhoc Query (~SQL)

http://docs.neo4j.org/chunked/milestone/cypher-query-lang.html

START john=node:node_auto_index(name = 'John')MATCH john-[:friend]->()-[:friend]->fofRETURN john, fof

john fof

Node[4]{name:"John"} Node[2]{name:"Maria"}

Node[4]{name:"John"} Node[3]{name:"Steve"}

Different Graph Databases

http://en.wikipedia.org/wiki/Graph_database

FlockDB from Twitter

GraphBase

Allegrograph

From Objectivity

What is a Graph Computing Engine?

Graph ComputingEngine

InputFormatInput Location

OutputFormatOutput Location

Algorithms

Graph engines come with some built-in graph processing algorithms, but also provide an easy to use API to build new algorithms and extend the framework.

http://incubator.apache.org/giraph/apidocs/index.htmlhttp://incubator.apache.org/hama/docs/r0.3.0/api/index.html

Different Graph Computing Engines

Memory based graphs like (graph size < local machine ram) - jung.sourceforge.net- igraph.sourceforge.net- metworkx.lanl.gov

Disk based graphs like (graph size < local hard disk size)- Neo4j- Infinite Graph – objectivity.com- sparsity-technologies.com/dex

Cluster based graphs like (depends on the cluster specs)

- Apache Hama- Apache Giraph- GoldenORB

Based on BSP

(Bulk

Synch

ronous Paralle

l) model

in the sp

irit of G

oogle pregel

Bulk Synchronous Parallel

Some quick facts

An alternate computing model to MapReduce (Not all problems can be solved with MapReduce efficiently). Also, any MR algorithm can be simulated on BSP and vice versa.

Developed by Leslie Valinat during the 1980s. Was resurrected by Google in the Pregel Paper (extensively used for PageRank)

Good for

- Processing big data with complicated relationships, eg., graph and networks.- Iterative and Recursive scientific computations- Continious Event Processing (CEP)

http://googleresearch.blogspot.in/2009/06/large-scale-graph-computing-at-google.htmlhttp://arxiv.org/abs/1203.2081 – Comparing MR vs BSP

What is Bulk Synchronous Parallel?

Super Step 1

Super Step 2

Super Step 3

http://en.wikipedia.org/wiki/Bulk_synchronous_parallel/http://blog.octo.com/en/introduction-to-large-scale-graph-processing/

Hama vs Giraph

Google Pregel **

Derived Derived

HDFS

MapReduceBSP

BSP

Giraph

Hama

** http://googleresearch.blogspot.in/2009/06/large-scale-graph-computing-at-google.html

Hama vs Giraph (.....)

Hama Giraph

Pure BSP engine. Uses BSP, but BSP API is not exposed.

Matrix, Graph, Network and other procesing.

Just for Graph processing.

Jobs are run as a BSP Job on HDFS. Jobs as run as MapReduce on Hadoop.

Both of them are derived from on `Pregel : A System for Large-Scale Graph Processing` paper published by Google. Both have been recently promoted from Incubator to Apache Top Level Project.

Both of them have a few graph algorithms implemented and also provide a very easy API to implement new Graph algorithms.

** http://googleresearch.blogspot.in/2009/06/large-scale-graph-computing-at-google.html

Page Rank in Hama

PageRank Algorithm assigns numerical weightage to each element of a hyperlinked set of documents

.bin/hama jar ../hama-0.4.0-examples.jar pagerank <input path> <output path> [damping factor] [epsilon error] [tasks]

http://wiki.apache.org/hama/PageRank

Input

Site1\tSite2\tSite3Site2\tSite3Site3

Output

Site1 0.5Site2 1.3Site3 1.2

What's next?Deep dive into

- Both Graph databases and frameworks with a Demo.- Bulk Syncronous Parallel procssing model.

Hadoop, Hive, Pig and others are too crowded. Graph Frameworks and Databases are emerging and are an easy entry to contribute to in Apache.

Would suggest to subscribe/follow the mailing lists in Apache and try to get familiar and contribute to them.

graph processing applications @ hug

Technology

bulk synchronous

graph size

graph databases

graph db

graphs

graph

facebook

computing