a highly efficient runtime and graph library for large
TRANSCRIPT
© 2014 IBM Corporation
2014/02/15
A Highly Efficient Runtime and Graph Library for Large
Scale Graph Analytics
Ilie Gabriel Tanase – Research Staff Member, IBM TJ Watson
Yinglong Xia, Yanbin Liu, Wei Tan, Jason Crawford,
Ching-Yung Lin – IBM TJ Watson
Lifeng Nai – Georgia Tech
© 2014 IBM Corporation
Visualization
Analytics
Middleware
Database
Huge
Network
Visualization
Graphical
Model
Visualization
Network
Propagation I2 3D
Network
Visualization
Geo Network
Visualization
Centralities
Communities
Graph Sampling
Network Info
Flow Shortest Paths
Ego Net Features Graph Matching
Graph Query
Graph Search Bayesian
Networks Latent Net
Inference Markov Networks
Graph Processing Interface
Hadoop
BigInsights Shared Memory
Graph Library Distr. Memory
Graph Library
Gra
ph
Acce
lera
t
or
Infospher
e
Streams
(ISS)
Graph Data Interface
GBase (update, scan,
operators, indexing))
HBase
HDFS
Native Store
DB2
DB2 RDF TinkerPop
Compliant
DBs
System G v1.0 Architecture
Generic Graph
Library
Graph
Communication
Layer
(PAMI/RDMA)
© 2014 IBM Corporation
2014/02/15
The Spectrum of Open Source Graph Technology
disk in memory
# machines clusters single machine
sp
eed
of
pro
cessin
g
faste
r slo
wer
Fulgora Aurelius
Faunus Aurelius
OrientDB NuvolaBase
DEX Sparsity Tech.
Neo4J Neo4j.org
Giraph Apache Yahoo,FB,…
Jung
GraphLab CMU
NetworX NetworX.org
RedisGraph MIT
Titan Aurelius
© 2014 IBM Corporation
Motivation and Requirements
Flexible Graph Datastructure
– In memory only, persistent, both
– Directed, undirected, directed with predecessors
– ACID properties
Run well on IBM machines including X86, Power, Bluegene
– Large memory, large number of cores
– Clusters with Infiniband or specialized networks (RDMA)
Commercial solution
© 2014 IBM Corporation
IBM Parallel Programming Library
C++
– Object oriented design - inheritance
– Generic using templates
Datastructures – Graphs, Hash Tables, Arrays
Large shared memory
– Concurrency
Distributed memory clusters
– Messaging API based on active messages and
RDMA
© 2014 IBM Corporation
IBM PPL Graph class hierarchy
Graph<VertexProperty, EdgeProperty, Directness>
- in memory only
- custom vertex and edge property
InDiskGraph
- in memory and persistent storage
MultipropertyGraph
- StorageType : InMemory, Hybrid
© 2014 IBM Corporation
2014/02/15
Multiproperty Graph
Name : Gabriel
SSN : xxx-xx
Vertex
Key Value
Name : John
Friend
Label
Edge
since : 2010
Person
Label
Company
Name:IBM
Works at Edge
Since : 2010
© 2014 IBM Corporation
Persistent Storage
•Write through policy for now
•Separate structure from properties : benefits computations based on
structure only
•Efficient graph loading : on demand
•Versioning
© 2014 IBM Corporation
Programming Model/ Runtime
A graph is a collection of vertices
Each vertex maintains its in and out edges
Parallel processing on IBMPPL graph
– Task based model of parallelism
• execute_tasks(wf, num_tasks)
• for_each(graph, wf);
• schedule_task_graph(tg)
– Work stealing
– Two level nested parallelism
– Within shared memory for now
© 2014 IBM Corporation
Performance Add Vertex
1M 2M 3M 4M 5M 10M
0
50
100
150
200
250
300
350
400
Add vertex SG
Add vertex Neo4j
Add vertex Titan
Number of vertices
Exe
cutio
n T
ime
(sec)
Add vertices and for each vertex add a property
Indexed
v=add_vertex(); v.set_property(“name”, “vertex0”);
Titan with Berkeley DB backend
Intel Haswell 24 core 2.7GHz and 256 GB, SSD
© 2014 IBM Corporation
Performance Add Edge
1M 2M 3M 4M 5M 10M
0
5000
10000
15000
20000
25000
AddEdge SG
AddEdge Neo4j
AddEdge Titan
# Vertices (Edges=10*Vertices)
Exe
cutio
n T
ime
(sec)
Add edges randomly
The source and target are specified as vertex properties
add_edge(“vertex0”, “vertex7”)
Index lookup
© 2014 IBM Corporation
Performance - Query
1M 2M 3M 4M 5M 10M
0
100000
200000
300000
400000
500000
600000
700000
800000
900000
1000000
TEPS_SG
TEPS_NEO4j
TEPS Titan
Graph Size (Vertices, Edges=10xVertices)
TE
PS
For a given vertex collect all its neighbors up to depth=3
~1000 edges traversed per query
© 2014 IBM Corporation
2014/02/15
© 2014 IBM Corporation
2014/02/15
RDF Graph Construction
• Load the .csv files for vertices
• Load the .csv files for edges
• Construct property graph in memory only
© 2014 IBM Corporation
2014/02/15
Query 2
0 10000 20000 30000 40000 50000 60000 70000 80000 90000
0
500000
1000000
1500000
2000000
2500000
3000000
3500000
4000000
Edges/query
TE
PS
© 2014 IBM Corporation
2014/02/15
1 2 4 16 32
0
20000000
40000000
60000000
80000000
100000000
120000000
140000000
160000000
TCMalloc
Q2
Q2 P
Q4
Q4 P
Q6
Q6 P
Cores/threads
TE
PS
(T
CM
alloc)
Impact of Parallelism on Throughput
Queries are
assigned to
threads evenly
Queries are
processed by a
single thread
#concurrent
component
© 2014 IBM Corporation
Conclusion
Graph databases are gaining in popularity
– Google, Facebook, Twitter, Paypal, BAML
Feature System G Native Store Neo4j Titan
Back-end Graph Graph Non-graph
Scaling Yes Moderate Yes
Traversal efficiency Perfect Good Poor
Schemaless Yes Yes No
User defined function Yes No Yes
Performance-critical App.
Perfect Good Poor
Multi-language APIs C++, Java, Python, Shell Java, Cypher Java, Gremlin