a highly efficient runtime and graph library for large

© 2014 IBM Corporation

2014/02/15

A Highly Efficient Runtime and Graph Library for Large

Scale Graph Analytics

Ilie Gabriel Tanase – Research Staff Member, IBM TJ Watson

Yinglong Xia, Yanbin Liu, Wei Tan, Jason Crawford,

Ching-Yung Lin – IBM TJ Watson

Lifeng Nai – Georgia Tech


Visualization

Analytics

Middleware

Database

Huge

Network

Visualization

Graphical

Model

Visualization

Network

Propagation I2 3D

Network

Visualization

Geo Network

Visualization

Centralities

Communities

Graph Sampling

Network Info

Flow Shortest Paths

Ego Net Features Graph Matching

Graph Query

Graph Search Bayesian

Networks Latent Net

Inference Markov Networks

Graph Processing Interface

Hadoop

BigInsights Shared Memory

Graph Library Distr. Memory

Graph Library

Gra

ph

Acce

lera

t

or

Infospher

e

Streams

(ISS)

Graph Data Interface

GBase (update, scan,

operators, indexing))

HBase

HDFS

Native Store

DB2

DB2 RDF TinkerPop

Compliant

DBs

System G v1.0 Architecture

Generic Graph

Library

Graph

Communication

Layer

(PAMI/RDMA)


2014/02/15

The Spectrum of Open Source Graph Technology

disk in memory

# machines clusters single machine

sp

eed

of

pro

cessin

g

faste

r slo

wer

Fulgora Aurelius

Faunus Aurelius

OrientDB NuvolaBase

DEX Sparsity Tech.

Neo4J Neo4j.org

Giraph Apache Yahoo,FB,…

Jung

GraphLab CMU

NetworX NetworX.org

RedisGraph MIT

Titan Aurelius


Motivation and Requirements

Flexible Graph Datastructure

– In memory only, persistent, both

– Directed, undirected, directed with predecessors

– ACID properties

Run well on IBM machines including X86, Power, Bluegene

– Large memory, large number of cores

– Clusters with Infiniband or specialized networks (RDMA)

Commercial solution


IBM Parallel Programming Library

C++

– Object oriented design - inheritance

– Generic using templates

Datastructures – Graphs, Hash Tables, Arrays

Large shared memory

– Concurrency

Distributed memory clusters

– Messaging API based on active messages and

RDMA


IBM PPL Graph class hierarchy

Graph<VertexProperty, EdgeProperty, Directness>

- in memory only

- custom vertex and edge property

InDiskGraph

- in memory and persistent storage

MultipropertyGraph

- StorageType : InMemory, Hybrid


2014/02/15

Multiproperty Graph

Name : Gabriel

SSN : xxx-xx

Vertex

Key Value

Name : John

Friend

Label

Edge

since : 2010

Person

Label

Company

Name:IBM

Works at Edge

Since : 2010


Persistent Storage

•Write through policy for now

•Separate structure from properties : benefits computations based on

structure only

•Efficient graph loading : on demand

•Versioning


Programming Model/ Runtime

A graph is a collection of vertices

Each vertex maintains its in and out edges

Parallel processing on IBMPPL graph

– Task based model of parallelism

• execute_tasks(wf, num_tasks)

• for_each(graph, wf);

• schedule_task_graph(tg)

– Work stealing

– Two level nested parallelism

– Within shared memory for now


Performance Add Vertex

1M 2M 3M 4M 5M 10M

0

50

100

150

200

250

300

350

400

Add vertex SG

Add vertex Neo4j

Add vertex Titan

Number of vertices

Exe

cutio

n T

ime

(sec)

Add vertices and for each vertex add a property

Indexed

v=add_vertex(); v.set_property(“name”, “vertex0”);

Titan with Berkeley DB backend

Intel Haswell 24 core 2.7GHz and 256 GB, SSD


Performance Add Edge

1M 2M 3M 4M 5M 10M

0

5000

10000

15000

20000

25000

AddEdge SG

AddEdge Neo4j

AddEdge Titan

# Vertices (Edges=10*Vertices)

Exe

cutio

n T

ime

(sec)

Add edges randomly

The source and target are specified as vertex properties

add_edge(“vertex0”, “vertex7”)

Index lookup


Performance - Query

1M 2M 3M 4M 5M 10M

0

100000

200000

300000

400000

500000

600000

700000

800000

900000

1000000

TEPS_SG

TEPS_NEO4j

TEPS Titan

Graph Size (Vertices, Edges=10xVertices)

TE

PS

For a given vertex collect all its neighbors up to depth=3

~1000 edges traversed per query


2014/02/15


2014/02/15

RDF Graph Construction

• Load the .csv files for vertices

• Load the .csv files for edges

• Construct property graph in memory only


2014/02/15

Query 2

0 10000 20000 30000 40000 50000 60000 70000 80000 90000

0

500000

1000000

1500000

2000000

2500000

3000000

3500000

4000000

Edges/query

TE

PS


2014/02/15

1 2 4 16 32

0

20000000

40000000

60000000

80000000

100000000

120000000

140000000

160000000

TCMalloc

Q2

Q2 P

Q4

Q4 P

Q6

Q6 P

Cores/threads

TE

PS

(T

CM

alloc)

Impact of Parallelism on Throughput

Queries are

assigned to

threads evenly

Queries are

processed by a

single thread

#concurrent

component


Conclusion

Graph databases are gaining in popularity

– Google, Facebook, Twitter, Paypal, BAML

Feature System G Native Store Neo4j Titan

Back-end Graph Graph Non-graph

Scaling Yes Moderate Yes

Traversal efficiency Perfect Good Poor

Schemaless Yes Yes No

User defined function Yes No Yes

Performance-critical App.

Perfect Good Poor

Multi-language APIs C++, Java, Python, Shell Java, Cypher Java, Gremlin

a highly efficient runtime and graph library for large

Documents