a highly efficient runtime and graph library for large

17
© 2014 IBM Corporation 2014/02/15 A Highly Efficient Runtime and Graph Library for Large Scale Graph Analytics Ilie Gabriel Tanase Research Staff Member, IBM TJ Watson Yinglong Xia, Yanbin Liu, Wei Tan, Jason Crawford, Ching-Yung Lin IBM TJ Watson Lifeng Nai Georgia Tech

Upload: others

Post on 21-Oct-2021

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A Highly Efficient Runtime and Graph Library for Large

© 2014 IBM Corporation

2014/02/15

A Highly Efficient Runtime and Graph Library for Large

Scale Graph Analytics

Ilie Gabriel Tanase – Research Staff Member, IBM TJ Watson

Yinglong Xia, Yanbin Liu, Wei Tan, Jason Crawford,

Ching-Yung Lin – IBM TJ Watson

Lifeng Nai – Georgia Tech

Page 2: A Highly Efficient Runtime and Graph Library for Large

© 2014 IBM Corporation

Visualization

Analytics

Middleware

Database

Huge

Network

Visualization

Graphical

Model

Visualization

Network

Propagation I2 3D

Network

Visualization

Geo Network

Visualization

Centralities

Communities

Graph Sampling

Network Info

Flow Shortest Paths

Ego Net Features Graph Matching

Graph Query

Graph Search Bayesian

Networks Latent Net

Inference Markov Networks

Graph Processing Interface

Hadoop

BigInsights Shared Memory

Graph Library Distr. Memory

Graph Library

Gra

ph

Acce

lera

t

or

Infospher

e

Streams

(ISS)

Graph Data Interface

GBase (update, scan,

operators, indexing))

HBase

HDFS

Native Store

DB2

DB2 RDF TinkerPop

Compliant

DBs

System G v1.0 Architecture

Generic Graph

Library

Graph

Communication

Layer

(PAMI/RDMA)

Page 3: A Highly Efficient Runtime and Graph Library for Large

© 2014 IBM Corporation

2014/02/15

The Spectrum of Open Source Graph Technology

disk in memory

# machines clusters single machine

sp

eed

of

pro

cessin

g

faste

r slo

wer

Fulgora Aurelius

Faunus Aurelius

OrientDB NuvolaBase

DEX Sparsity Tech.

Neo4J Neo4j.org

Giraph Apache Yahoo,FB,…

Jung

GraphLab CMU

NetworX NetworX.org

RedisGraph MIT

Titan Aurelius

Page 4: A Highly Efficient Runtime and Graph Library for Large

© 2014 IBM Corporation

Motivation and Requirements

Flexible Graph Datastructure

– In memory only, persistent, both

– Directed, undirected, directed with predecessors

– ACID properties

Run well on IBM machines including X86, Power, Bluegene

– Large memory, large number of cores

– Clusters with Infiniband or specialized networks (RDMA)

Commercial solution

Page 5: A Highly Efficient Runtime and Graph Library for Large

© 2014 IBM Corporation

IBM Parallel Programming Library

C++

– Object oriented design - inheritance

– Generic using templates

Datastructures – Graphs, Hash Tables, Arrays

Large shared memory

– Concurrency

Distributed memory clusters

– Messaging API based on active messages and

RDMA

Page 6: A Highly Efficient Runtime and Graph Library for Large

© 2014 IBM Corporation

IBM PPL Graph class hierarchy

Graph<VertexProperty, EdgeProperty, Directness>

- in memory only

- custom vertex and edge property

InDiskGraph

- in memory and persistent storage

MultipropertyGraph

- StorageType : InMemory, Hybrid

Page 7: A Highly Efficient Runtime and Graph Library for Large

© 2014 IBM Corporation

2014/02/15

Multiproperty Graph

Name : Gabriel

SSN : xxx-xx

Vertex

Key Value

Name : John

Friend

Label

Edge

since : 2010

Person

Label

Company

Name:IBM

Works at Edge

Since : 2010

Page 8: A Highly Efficient Runtime and Graph Library for Large

© 2014 IBM Corporation

Persistent Storage

•Write through policy for now

•Separate structure from properties : benefits computations based on

structure only

•Efficient graph loading : on demand

•Versioning

Page 9: A Highly Efficient Runtime and Graph Library for Large

© 2014 IBM Corporation

Programming Model/ Runtime

A graph is a collection of vertices

Each vertex maintains its in and out edges

Parallel processing on IBMPPL graph

– Task based model of parallelism

• execute_tasks(wf, num_tasks)

• for_each(graph, wf);

• schedule_task_graph(tg)

– Work stealing

– Two level nested parallelism

– Within shared memory for now

Page 10: A Highly Efficient Runtime and Graph Library for Large

© 2014 IBM Corporation

Performance Add Vertex

1M 2M 3M 4M 5M 10M

0

50

100

150

200

250

300

350

400

Add vertex SG

Add vertex Neo4j

Add vertex Titan

Number of vertices

Exe

cutio

n T

ime

(sec)

Add vertices and for each vertex add a property

Indexed

v=add_vertex(); v.set_property(“name”, “vertex0”);

Titan with Berkeley DB backend

Intel Haswell 24 core 2.7GHz and 256 GB, SSD

Page 11: A Highly Efficient Runtime and Graph Library for Large

© 2014 IBM Corporation

Performance Add Edge

1M 2M 3M 4M 5M 10M

0

5000

10000

15000

20000

25000

AddEdge SG

AddEdge Neo4j

AddEdge Titan

# Vertices (Edges=10*Vertices)

Exe

cutio

n T

ime

(sec)

Add edges randomly

The source and target are specified as vertex properties

add_edge(“vertex0”, “vertex7”)

Index lookup

Page 12: A Highly Efficient Runtime and Graph Library for Large

© 2014 IBM Corporation

Performance - Query

1M 2M 3M 4M 5M 10M

0

100000

200000

300000

400000

500000

600000

700000

800000

900000

1000000

TEPS_SG

TEPS_NEO4j

TEPS Titan

Graph Size (Vertices, Edges=10xVertices)

TE

PS

For a given vertex collect all its neighbors up to depth=3

~1000 edges traversed per query

Page 13: A Highly Efficient Runtime and Graph Library for Large

© 2014 IBM Corporation

2014/02/15

Page 14: A Highly Efficient Runtime and Graph Library for Large

© 2014 IBM Corporation

2014/02/15

RDF Graph Construction

• Load the .csv files for vertices

• Load the .csv files for edges

• Construct property graph in memory only

Page 15: A Highly Efficient Runtime and Graph Library for Large

© 2014 IBM Corporation

2014/02/15

Query 2

0 10000 20000 30000 40000 50000 60000 70000 80000 90000

0

500000

1000000

1500000

2000000

2500000

3000000

3500000

4000000

Edges/query

TE

PS

Page 16: A Highly Efficient Runtime and Graph Library for Large

© 2014 IBM Corporation

2014/02/15

1 2 4 16 32

0

20000000

40000000

60000000

80000000

100000000

120000000

140000000

160000000

TCMalloc

Q2

Q2 P

Q4

Q4 P

Q6

Q6 P

Cores/threads

TE

PS

(T

CM

alloc)

Impact of Parallelism on Throughput

Queries are

assigned to

threads evenly

Queries are

processed by a

single thread

#concurrent

component

Page 17: A Highly Efficient Runtime and Graph Library for Large

© 2014 IBM Corporation

Conclusion

Graph databases are gaining in popularity

– Google, Facebook, Twitter, Paypal, BAML

Feature System G Native Store Neo4j Titan

Back-end Graph Graph Non-graph

Scaling Yes Moderate Yes

Traversal efficiency Perfect Good Poor

Schemaless Yes Yes No

User defined function Yes No Yes

Performance-critical App.

Perfect Good Poor

Multi-language APIs C++, Java, Python, Shell Java, Cypher Java, Gremlin