graph processing

96
GRAPH PROCESSING

Upload: wyanet

Post on 23-Feb-2016

85 views

Category:

Documents


0 download

DESCRIPTION

GRAPH PROCESSING. Why Graph Processing?. Graphs are everywhere!. Why Graph Processing?. Why Distributed Graph Processing?. They are getting bigger!. Road Scale. >24 million vertices >58 million edges *Route Planning in Road Networks - 2008. Social Scale. >1 billion vertices - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: GRAPH PROCESSING

GRAPH PROCESSING

Page 2: GRAPH PROCESSING

Why Graph Processing?

Graphs are everywhere!

Page 3: GRAPH PROCESSING

Why Graph Processing?

Page 4: GRAPH PROCESSING

Why Distributed Graph Processing?

They are getting bigger!

Page 5: GRAPH PROCESSING

Road Scale

>24 million vertices>58 million edges

*Route Planning in Road Networks - 2008

Page 6: GRAPH PROCESSING

Social Scale

>1 billion vertices~1 trillion edges*Facebook Engineering Blog

~41 million vertices>1.4 billion edges

*Twitter Graph- 2010

Page 7: GRAPH PROCESSING

Web Scale

>50 billion vertices>1 trillion edges

*NSA Big Graph Experiment- 2013

Page 8: GRAPH PROCESSING

Brain Scale

>100 billion vertices>100 trillion edges

*NSA Big Graph Experiment- 2013

Page 9: GRAPH PROCESSING

CHALLENGES IN PARALLEL GRAPH PROCESSING

Lumsdaine, Andrew, et al. "Challenges in parallel graph processing." Parallel Processing Letters 17.01 -2007

Page 10: GRAPH PROCESSING

Challenges

1 Structure driven computationData Transfer Issues

2 Irregular StructurePartitioning Issues

*Concept borrowed from Cristina Abad’s PhD defense slides

Page 11: GRAPH PROCESSING

Overcoming the challenges

1 Extend Existing Paradigms

2 BUILD NEW FRAMEWORKS!

Page 12: GRAPH PROCESSING

Build New Graph Frameworks!

Key Requirements from Graph Processing Frameworks

Page 13: GRAPH PROCESSING

1 Less pre-processing

2 Low and load-balanced computation

3 Low and load-balanced communication

4 Low memory footprint

5 Scalable wrt cluster size and graph size

Page 14: GRAPH PROCESSING

PREGEL

Malewicz, Grzegorz, et al. "Pregel: a system for large-scale graph processing.“ ACM SIGMOD -2010.

Page 15: GRAPH PROCESSING

Life of a Vertex Program

PlacementOf Vertices

Iteration 0 Iteration 1

Barrier Barrier Barrier

Time

Computation

Communication

*Concept borrowed from LFGraph Slides

Computation

Communication

Page 16: GRAPH PROCESSING

B D

C E

A

Sample Graph

*Graph Borrowed from LFGraph Paper

Page 17: GRAPH PROCESSING

B D

C E

A

Shortest Path Example

Page 18: GRAPH PROCESSING

B0

D∞

C∞

E∞

A∞

Iteration 0Message (0+1)

Page 19: GRAPH PROCESSING

B0

D∞

C∞

E∞

A1

Iteration 1

Message (1+1)

Message (1+1)

Page 20: GRAPH PROCESSING

B0

D2

C∞

E2

A1

Iteration 2

Page 21: GRAPH PROCESSING

Can we do better?GOAL PREGEL

Computation 1 Pass

Communication ∝ #Edge cuts

Pre-processing Cheap (Hash)

Memory High (out edges + buffered messages)

Page 22: GRAPH PROCESSING

LFGRAPH – YES, WE CAN!Hoque, Imranul, and Indranil Gupta. "LFGraph: Simple and Fast Distributed Graph Analytics”. TRIOS-2013

Page 23: GRAPH PROCESSING

B D

C E

A

Features

Cheap Vertex Placement: Hash Based

Low graph initialization time

Page 24: GRAPH PROCESSING

B D

C E

A

Features

Publish Subscribe fetch once information flow

Low communication overhead

Page 25: GRAPH PROCESSING

B D

C E

A

Subscribe

Subscribing to vertex A

Page 26: GRAPH PROCESSING

B D

C E

A

Publish

Publish List of Server 1: (Server2, A)

Page 27: GRAPH PROCESSING

B D

C E

A

LFGraph Model

Value of A

Page 28: GRAPH PROCESSING

B D

C E

A

Features

Only stores in-neighbor vertices

Reduces memory footprint

Page 29: GRAPH PROCESSING

B D

C E

A

In-neighbor storage

Local in-neighbor – simply read the value

Remote in-neighbor – read locally available value

Page 30: GRAPH PROCESSING

B0

D∞

C∞

E∞

A∞

Iteration 0

Page 31: GRAPH PROCESSING

B0

D∞

C∞

E∞

A∞

Iteration 1Read value 0

Read value ∞

A1

Value change in duplicate store

Value of A

Page 32: GRAPH PROCESSING

B0

D∞

C∞

E∞

A1

Iteration 2

Local read of A

Local read of A

D2

E2

Page 33: GRAPH PROCESSING

B D

C E

A

Features

Single Pass Computation

Low computation overhead

Page 34: GRAPH PROCESSING

Life of a Vertex Program

PlacementOf Vertices

Iteration 0 Iteration 1

Barrier Barrier Barrier

Time

Computation Communication Computation Communication

*Concept Borrowed from LFGraph Slides

Page 35: GRAPH PROCESSING

How everything Works

Page 36: GRAPH PROCESSING
Page 37: GRAPH PROCESSING

GRAPHLAB

Low, Yucheng, et al. "Graphlab: A new framework for parallel machine learning”. Conference on Uncertainty in Artificial Intelligence (UAI) - 2010

Page 38: GRAPH PROCESSING

B D

C E

A A

D

E

GraphLab Model

Page 39: GRAPH PROCESSING

Can we do better?GOAL GRAPHLAB

Computation 2 passes

Communication ∝ #Vertex Ghosts

Pre-processing Cheap (Hash)

Memory High (in & out edges + ghost values)

Page 40: GRAPH PROCESSING

POWERGRAPH

Gonzalez, Joseph E., et al. "Powergraph: Distributed graph-parallel computation on natural graphs." USENIX OSDI - 2012.

Page 41: GRAPH PROCESSING

B D

C E

A1 A2

PowerGraph Model

Page 42: GRAPH PROCESSING

Can we do better?GOAL POWERGRAPH

Computation 2 passes

Communication ∝ #Vertex Mirrors

Pre-processing Expensive (Intelligent)

Memory High (in & out edges + mirror values)

Page 43: GRAPH PROCESSING

Communication Analysis

External on edge cuts

Ghost vertices - in and out neighbors

Mirrors -in and out neighbors

External in neighbors

Page 44: GRAPH PROCESSING

Computation Balance Analysis

• Power Law graphs have substantial load imbalance.

• Power law graphs have degree d with probability proportional to d-α.

•Lower α means a denser graph with more high degree vertices.

Page 45: GRAPH PROCESSING

Computation Balance Analysis

Page 46: GRAPH PROCESSING

Computation Balance Analysis

Page 47: GRAPH PROCESSING

Real World vs Power Law

Page 48: GRAPH PROCESSING

Communication Balance Analysis

Page 49: GRAPH PROCESSING

PageRank – Runtime w/o partition

Page 50: GRAPH PROCESSING

PageRank – Runtime with partition

Page 51: GRAPH PROCESSING

PageRank – Memory footprint

Page 52: GRAPH PROCESSING

PageRank – Network Communication

Page 53: GRAPH PROCESSING

Scalability

Page 54: GRAPH PROCESSING

X-Stream: Edge-centric Graph Processing using

Streaming Partitions

* Some figures adopted from author’s presentation

Page 55: GRAPH PROCESSING

Motivation

• Can sequential access be used instead of random

access?!

• Can large graph processing be done on a single

machine?! X-Stream

Page 56: GRAPH PROCESSING

Sequential Access: Key to Performance!

Medium Read (MB/s) Write (MB/s)Rando

mSequentia

lSpeed

up Random Sequential Speed up

RAM (1 core) 567 2605 4.6 1057 2248 2.2

RAM (16 core) 14198 25658 1.9 10044 13384 1.4

SSD 22.5 667.69 29.7 48.6 576.5 11.9

Magnetic Disk 0.6 328 546.7 2 316.3 158.2

Speed up of sequential access over random access in different media

Test bed: 64 GB RAM + 200 GB SSD + 3TB drive

Page 57: GRAPH PROCESSING

How to Use Sequential Access?

Sequential access …

Edge-Centric Processing

Page 58: GRAPH PROCESSING

Vertex-CentricScatter

U{state} for each vertex v

if state has updatedfor each output edge e of vscatter update on e

Update u1

Update

u2Update

un

Page 59: GRAPH PROCESSING

Vertex-CentricGather

V{state}

Update

v1

Update v2

Update vn

for each vertex vfor each input edge e of vif e has an updateapply update on state

V{state2

}

Page 60: GRAPH PROCESSING

1 63

58

7

4

2

BFS

SOURCE DEST

1 31 52 72 43 23 84 34 74 85 66 18 58 6

V

12345678

Looku

p Index

Vertex-Centric

Page 61: GRAPH PROCESSING

Edge-CentricScatter

for each edge eif e.source has updatedscatter update on e

A

B

C = Updated Vertex

Update u1

Update un

Page 62: GRAPH PROCESSING

Edge-CentricGather

for each update u on edge eapply update u to

e.destination

X

Y

Z = Updated Vertex

Update u1

Update un

Update u2

X

Y

Z

Page 63: GRAPH PROCESSING

Sequential Access via Edge-Centric!

In Fast Storage

In Slow Storage

In Slow Storage

Page 64: GRAPH PROCESSING

Fast and Slow Storage

Page 65: GRAPH PROCESSING

1 63

58

7

4

2

SOURCE DEST

1 31 52 72 43 23 84 34 74 85 66 18 58 6

V

12345678

BFS

Edge-Centric

Lots of wasted reads!

Most real world graphs have small diameter

Large Diameter makes X-Stream slow and wasteful

Page 66: GRAPH PROCESSING

66

SOURCE DEST

1 31 52 72 43 23 84 34 74 85 66 18 58 6

=

SOURCE DEST

1 38 65 62 43 24 74 33 84 82 76 18 51 5

Order is not important

No pre-processing (sorting and indexing) needed!

Page 67: GRAPH PROCESSING

But, still …

• Random access for vertices

• Vertices may not fit into fast storage

Streaming Partitions

Page 68: GRAPH PROCESSING

Streaming Partitions

V=Subset of vertices

E=Outgoing edges of V

U=Incoming updates to V

Mutually disjoint

Changing in each scatter

phase

Constant set

Page 69: GRAPH PROCESSING

Vn

En

Un

Scatter and ShuffleV1

E1

U1 Input buffere1 e2 e3 …

Update bufferu1 u2 u3 …

Output bufferu'1 u'2 u'3 …

Vertex setv1 v2 v3 …

Fast Memory

Read source

Add update

Load

Stream

Shuffle …

Append to updates

Page 70: GRAPH PROCESSING

Shuffle

Stream Buffer with k partitions

Page 71: GRAPH PROCESSING

GatherV1

E1

U1 Update bufferu1 u2 u3 …

Vertex setv1 v2 v3 …

Fast MemoryLoad

Stream

Apply update

No output!

Page 72: GRAPH PROCESSING

Parallelism

• State stored in vertices• Disjoint vertex set in partitions

Compute partitions in parallelParallel scatter and gather

Page 73: GRAPH PROCESSING

Experimental Results

Page 74: GRAPH PROCESSING

X-Stream Speedup over Graphchi

Netflix/ALS

Twitter/Pagerank

Twitter/Belief Propagation

RMAT27/WCC

0 1 2 3 4 5 6

Mean Speedup = 2.3

Speedup without considering the pre-process time of Graphchi

Page 75: GRAPH PROCESSING

Netflix/ALS

Twitter/Pagerank

Twitter/Belief Propagation

RMAT27/WCC

0 1 2 3 4 5 6

X-Stream Speedup over Graphchi

Mean Speedup = 3.7

Speedup considering the pre-process time of Graphchi

Page 76: GRAPH PROCESSING

Netflix

/ALS

Twitter/P

agera

nk

Twitter/B

elief

Propagation

RMAT27/W

CC0

50010001500200025003000

Graphchi ShardingX-Stream runtime

Tim

e (s

ec)

X-Stream Runtime vs Graphchi Sharding

Page 77: GRAPH PROCESSING

77

Disk Transfer Rates

Metric X-Stream GraphchiData moved 224 GB 322 GBTime taken 398

seconds2613 seconds

Transfer rate

578 MB/s 126 MB/s

SSD sustain reads = 667 MB/s, writes = 576 MB/s

Data transfer rates on Page Rank algorithm on Twitter workload

Page 78: GRAPH PROCESSING

Scalability on Input Data size

384MB

768MB

1536MB3GB

6GB12GB

24GB48GB

96GB192GB

384GB768GB

1.5TB0:00:010:00:050:00:210:01:240:05:380:22:301:30:006:00:00

24:00:0096:00:00

Tim

e (H

H:M

M:S

S)

RAM SSD Disk

8 Million V, 128 Million E, 8 sec

256 Million V, 4 Billion E, 33 mins

4 Billion V, 64 Billion E, 26 hours

Page 79: GRAPH PROCESSING

Discussion• Features like global values, aggregation

functions, asynchronous computation missing from LFGraph. Will the overhead of adding these features slow it down?

• LFGraph assumes that all edge values are same. If the edge values are not, either the receiving vertices or the server will have to incorporate that value. Overheads?

• LFGraph has one pass computation but then it executes the vertex program at each vertex (active or inactive). Trade-off?

Page 80: GRAPH PROCESSING

Discussion• Independent computation and communication rounds

may not always be preferred. Use bandwidth when available.

• Faul Tolerance is another feature missing from LFGraph. Overheads?

• Three benchmarks for experiments. Enough evaluation?

• Scalability comparison with Pregel with different experiment settings. Memory comparison with PowerGraph based on heap values from logs. Fair experiments?

Page 81: GRAPH PROCESSING

Discussion• Could the system become asynchronous?• Could the scatter and gather phase be combined into

one phase?• Does not support iterating over the edges/updates of

a vertex. Can this be added?• How good do they determine number of partitions?• Can shuffle be optimized by counting the updates of

each partition during scatter?

Page 82: GRAPH PROCESSING

Thank you for listening!

Questions?

Page 83: GRAPH PROCESSING

Backup Slides

Page 84: GRAPH PROCESSING

Reason for Improvement

Page 85: GRAPH PROCESSING

Qualitative Comparison

GOAL PREGEL GRAPHLAB POWERGRAPH LFGRAPH

Computation 2 passes, Combiners

2 passes 2 passes 1 pass

Communication ∝ #Edge cuts ∝ #Vertex Ghosts

∝ #Vertex Mirrors

∝ #External in-neighbors

Pre-processing Cheap (Hash) Cheap (Hash) Expensive (Intelligent)

Cheap (Hash)

Memory High (out edges + buffered messages)

High (in & out edges + ghost

values)

High (in & out edges + mirror

values)

Low (in edges + remote values)

Page 86: GRAPH PROCESSING

Backup Slides

Page 87: GRAPH PROCESSING

Read Bandwidth - SSD

0100200300400500600700800900

1000

X-StreamGraphchi

5 minute window

Read

(MB/

s)

Page 88: GRAPH PROCESSING

Write Bandwidth - SSD

0100200300400500600700800

X-StreamGraphchi

5 minute window

Writ

e (M

B/s)

Page 89: GRAPH PROCESSING

Scalability on Thread Count

Page 90: GRAPH PROCESSING

Scalability on Number of I/O Devices

Page 91: GRAPH PROCESSING

Sharding-Computing Breakdown in Graphchi

Netflix/A

LS

Twitter/Pag

erank

Twitter/Belie

f Propag

ation

RMAT27/WCC

00.30.60.9

Graphchi Runtime Breakdown

Compute + I/ORe-sort shard

Benchmark

Frac

tion

of R

untim

e

Page 92: GRAPH PROCESSING

X-Stream not Always Perfect

Page 93: GRAPH PROCESSING

Large Diameter makes X-stream Slow!

Page 94: GRAPH PROCESSING

In-Memory X-Stream Performance

1 2 4 8 16020406080

100

BFS (32M vertices/256M edges)

BFS-1 [HPC 2010]BFS-2 [PACT 2011]X-Stream

Threads

Runti

me

(s) L

ower

is

bette

r

Page 95: GRAPH PROCESSING

Ligra vs. X-Stream

Page 96: GRAPH PROCESSING

Discussion• The current implementation is on a single

machine, can it be extended to clusters?– Would it still perform good– How to provide fault tolerance and synchronization?

• The waste rate is high (~65%). Could this be improved?

• Can the partition be more intelligent? Dynamic partitioning?

• Could all vertex-centric programs be converted to edge-centric?

• When does streaming outperform random access?