graph processing applications @ hug
Post on 22-Sep-2014
22 views
DESCRIPTION
Introduce what are Graphs and explore what happens behind some of the applications (PageRank, Maps, FaceBook etc) using Graph processing. Introduce @ a high level the different frameworks/softwares behind Graph processing.TRANSCRIPT
Agenda
Introduction to Graphs
Representing graphs
Different types of graphs
Algorithms in graphs
What constitutes a graph application
Graph databases (examples and how they work)
Graph computing engines (examples and how they work)
Questions & Answers
What are/aren't Graphs in this context?
NOYES
1 2 3
4
5
6
How is a graph represented?
A collection of vertices connected to each other using edges, with both vertices and edges having properties. A vertex can be a person, place, account or any item which needs to be tracked.
Vertex
Edge
A social graph
1 2 3
4
5
6
Name:ArunAge : 25Sex : M
Tom
Sheetal
Prajval
Deepak
Bob
Friend
Collegue
Relative
Friend
Friend
Relation : Collegue
Friend
Friend
Vertex
EdgeProperties
Who are Arun's friends?
Whom should I recommend
Sheetal to be friends with?
Facebook Recruiting Competition
Want an in
terview @
Facebook?The challenge is to recommend missing links in a social network. Participants will be presented with an external anonymized, directed social graph (no, not Facebook, keep guessing) from which some edges have been deleted, and asked to make ranked predictions for each user in the test set of which other users they would want to follow.
What is Kaggle?
Kaggle is an innovative solution for statistical/analytics outsourcing. We are the leading platform for predictive modeling competitions. Companies, governments and researchers present datasets and problems - the world's best data scientists then compete to produce the best solutions. At the end of a competition, the competition host pays prize money in exchange for the intellectual property behind the winning model.
1 2 3
4
5
6
http://www.kaggle.com/c/FacebookRecruiting
A spatial graph
1 2 3
4
5
6
Name:BangalorePopulataion : 25,00,000
Area : 35,000 SqKm
Mumbai
Kolkotta
Chennai
New Delhi
Lucknow
450 km
800 km
250 km600 km
Distance : 700 km
350 km
Vertex
EdgeProperties
450 km
850 km
What is th
e shortest
Distance between
Bangalore and Calcutta?
I would like to cover all
the places, which is the
shortest path?
How to represent a Graph for computing?
1 2 3
4
5
6
2, 4, 5 3 5
3, 6
5
1 -> 2,4,52 -> 33 -> 54 -> 3.65 -> 6 -> 5
1 2 3 4 5 6
1 0 1 0 1 1 0
2 0 0 1 0 0 0
3 0 0 0 0 1 0
4 0 0 1 0 0 0
5 0 0 0 0 0 0
6 0 0 0 0 1 0
.... as an adjacency list for sparse graph
.... as an adjacency matrix for dense graph
A graph with few edges is sparse,many edges is dense.
Obviously, the web with billionsof pages cannot be represented
as an adjaceny matrix.
Different Graphs
Social graph (Facebook, LinkedIn etc)
Spacial graph (Google Maps, MapQuest, FedEx etc)
Web graph (PageRank, Recomendations etc)
Computer network graph (Optimal network layout etc)
Financial graph (Fraud detection, Currency Flow etc)
Data representations (Lists etc)
Chemistry (to represent genomes/molucules)
And others
Some of the Graph Algorithms
Shortest path (Finding the shortest path from A to B)
Minimal Spanning Tree (Cheapest way to connect objects, so that each object is connected to another – can be used in internet, cable wiring etc)
Graph center (placing a warehouse, hospital in a city, so that all the locations can be reached easily)
Bipartite Matching (Matching in a dating site, job to employee and others)
Finding Planar Graph (as in the case of circuit designs).
http://www.graph-magics.com/practic_use.php
Graph Applications
Applications
Graph processing frameworksGraph Databases
GiraphHama
How to store a Graph?
Option 1 : In a flat file as
1- 4,5,6 4- 2,5,6
Where vertex 1 is connected to vertex 4,5,6 and so on
Option 2 : In a relational database using referencing tables or join tables.
Option 3 : Using a specialized database designed only and only for graphs.
Simple, but not efficient
and easy to maintain.
Comparing Graph with Relational DB
Depth Execution Time – MySQL Execution Time –Neo4j
2 0.016 0.010
3 30.267 0.168
4 1,543.505 1.359
5 Not Finished in 1 Hour 2.132
http://www.neotechnology.com/2012/06/how-much-faster-is-a-graph-database-really/
In a DB of 1,000,000 users finding friends-of-friendsfor 1,000 users at various depths.
Which one would
you prefer for s
toring
Graph data?
So, what is a Graph DB?A graph database is any storage system that provides `index free adjacency`.
1 2 3
4
5
6
2, 4, 5 3 5
3, 6
5
Every element (node or edge) has a direct pointer to it's adjacent element.
No Index lookup : We can determine which vertex is adjacent wo which other vertex without lookup an index-tree.
So, what is a Graph DB? (.....)
Graph DB is the option
when persisting graphs.
So, what is a Graph DB? (.....)D
ata
Siz
e
Data Complexity
Key Value Store like Amazon Dynamo.
Columnar Databases like Cassandra, HBase.
Document Databases like MongoDB, CouchDB..
Graph Databases like Neo4J
Part o
f the
NoS
QL fa
mily
Graph DB Bindings (~JDBC API)
//connect to the database//begin transaction
Node firstNode;Node secondNode;Relationship relationship;
firstNode = graphDb.createNode();firstNode.setProperty( "message", "Hello, " );secondNode = graphDb.createNode();secondNode.setProperty( "message", "World!" ); relationship = firstNode.createRelationshipTo( secondNode, RelTypes.KNOWS );relationship.setProperty( "message", "brave Neo4j " );
//end the transaction//close the connection to the database
http://docs.neo4j.org/chunked/milestone/tutorials-java-embedded-hello-world.html
Graph Adhoc Query (~SQL)
http://docs.neo4j.org/chunked/milestone/cypher-query-lang.html
START john=node:node_auto_index(name = 'John')MATCH john-[:friend]->()-[:friend]->fofRETURN john, fof
john fof
Node[4]{name:"John"} Node[2]{name:"Maria"}
Node[4]{name:"John"} Node[3]{name:"Steve"}
Different Graph Databases
http://en.wikipedia.org/wiki/Graph_database
FlockDB from Twitter
GraphBase
Allegrograph
From Objectivity
What is a Graph Computing Engine?
Graph ComputingEngine
InputFormatInput Location
OutputFormatOutput Location
Algorithms
Graph engines come with some built-in graph processing algorithms, but also provide an easy to use API to build new algorithms and extend the framework.
http://incubator.apache.org/giraph/apidocs/index.htmlhttp://incubator.apache.org/hama/docs/r0.3.0/api/index.html
Different Graph Computing Engines
Memory based graphs like (graph size < local machine ram) - jung.sourceforge.net- igraph.sourceforge.net- metworkx.lanl.gov
Disk based graphs like (graph size < local hard disk size)- Neo4j- Infinite Graph – objectivity.com- sparsity-technologies.com/dex
Cluster based graphs like (depends on the cluster specs)
- Apache Hama- Apache Giraph- GoldenORB
Based on BSP
(Bulk
Synch
ronous Paralle
l) model
in the sp
irit of G
oogle pregel
Bulk Synchronous Parallel
Some quick facts
An alternate computing model to MapReduce (Not all problems can be solved with MapReduce efficiently). Also, any MR algorithm can be simulated on BSP and vice versa.
Developed by Leslie Valinat during the 1980s. Was resurrected by Google in the Pregel Paper (extensively used for PageRank)
Good for
- Processing big data with complicated relationships, eg., graph and networks.- Iterative and Recursive scientific computations- Continious Event Processing (CEP)
http://googleresearch.blogspot.in/2009/06/large-scale-graph-computing-at-google.htmlhttp://arxiv.org/abs/1203.2081 – Comparing MR vs BSP
What is Bulk Synchronous Parallel?
Super Step 1
Super Step 2
Super Step 3
http://en.wikipedia.org/wiki/Bulk_synchronous_parallel/http://blog.octo.com/en/introduction-to-large-scale-graph-processing/
Hama vs Giraph
Google Pregel **
Derived Derived
HDFS
MapReduceBSP
BSP
Giraph
Hama
** http://googleresearch.blogspot.in/2009/06/large-scale-graph-computing-at-google.html
Hama vs Giraph (.....)
Hama Giraph
Pure BSP engine. Uses BSP, but BSP API is not exposed.
Matrix, Graph, Network and other procesing.
Just for Graph processing.
Jobs are run as a BSP Job on HDFS. Jobs as run as MapReduce on Hadoop.
Both of them are derived from on `Pregel : A System for Large-Scale Graph Processing` paper published by Google. Both have been recently promoted from Incubator to Apache Top Level Project.
Both of them have a few graph algorithms implemented and also provide a very easy API to implement new Graph algorithms.
** http://googleresearch.blogspot.in/2009/06/large-scale-graph-computing-at-google.html
Page Rank in Hama
PageRank Algorithm assigns numerical weightage to each element of a hyperlinked set of documents
.bin/hama jar ../hama-0.4.0-examples.jar pagerank <input path> <output path> [damping factor] [epsilon error] [tasks]
http://wiki.apache.org/hama/PageRank
Input
Site1\tSite2\tSite3Site2\tSite3Site3
Output
Site1 0.5Site2 1.3Site3 1.2
What's next?Deep dive into
- Both Graph databases and frameworks with a Demo.- Bulk Syncronous Parallel procssing model.
Hadoop, Hive, Pig and others are too crowded. Graph Frameworks and Databases are emerging and are an easy entry to contribute to in Apache.
Would suggest to subscribe/follow the mailing lists in Apache and try to get familiar and contribute to them.
Q&A