final usc-2016alchem.usc.edu/ceng-seminar/slides/2016/rajiv_gupta.pdf[micro 2015] branch divergence...

9/14/16

Parallel Graph Processing on GPUs, Clusters, and Multicores

Farzad Khorasani Keval Vora

Rajiv Gupta

Graph Processing Graphs & Analytics

Web Graphs PageRank – Rank websites in search results Belief Propagation – Malicious domains & infected hosts

Social Networks Community Detection – common interests Betweenness Centrality – critical nodes

Characteristics Graphs – Large number of nodes & edges Algorithms – Iterative convergence-based

à Exploit Data Parallelism

Parallel Graph Processing 2

– Limited Memory Storage System

– Limited Device Memory & Bandwidth – Communication High Latency

GPUs + Massive Parallelism Power Efficiency

Clusters

+ Scalability Memory & Cores

Multicores

+ Efficient Parallelism

– Limited Memory Storage System

– Limited Device Memory & Bandwidth – Communication High Latency

GPUs (VWC, Medusa, Totem etc.) + Massive Parallelism Power Efficiency

Clusters (GraphLab, GraphX etc.)

+ Scalability Memory & Cores

Multicores (GraphChi, XStream etc.)

+ Efficient Parallelism

Graphs on GPUs Suitable for Data Parallel Computations

NVIDIA GeForce GTX780 12 Streaming Multiprocessors (SMs) 64 Warps/SM; 32 Threads/Warp

Limitations

SIMD Limitation on Warps Limited DRAM Capacity – 3 GB Limited PCIe Bandwidth – 12 GB/s

Challenges – Graphs on GPUs Low SIMD-Efficiency ~ Irregular Graphs

Pokec 30 M Edges

1.6 M Vertices

LiveJournal 69 M Edges 5 M Vertices

Virtual Warp Centric (VWC) [Hong et al., PPoPP 2011] à Warp utilization ~ 40%

Challenges – Graphs on GPUs Scalability for Large Graphs

Limited DRAM Capacity à Multiple GPUs à Limited PCIe Bandwidth

Low Communication Efficiency Medusa – 4 GPUs [Zhong & He, IEEE TPDS 2014]

à comm. efficiency 9 – 15% Totem – 2 GPUs [Gharaibeh et al., PACT 2012]

à comm. efficiency 11 – 18%

9/14/16

Our Approach Warp Segmentation

Variable number of threads given to a vertex to process its incoming edges à High SIMD-Efficiency Vertex Refinement

Eliminates unnecessary communication à High Communication Efficiency

6 Iterative Graph Processing Single Source Shortest Path (SSSP)

4V0 V4

source

= min( V1+ 3, V2+ 5 )

Iteration V0 V1 V2 V3 V4

0 ∞ ∞ ∞ ∞

0 4 2 ∞ ∞

0 3 2 7 5

0 3 2 5 6

Compressed Sparse Row (CSR) Graph Representation

b c d e f g h a E

0 4 2 0 1 2 2 4

0 1 4 5 7 8 PTR

V0 V1 V2 V3 V4 V

0 1 2 3 4

Existing Works – Static

Each vertex is assigned the same (fixed) number of threads

[Harish & Narayanan, HiPC 2007] – 1 thread [Hong et al, PPoPP 2011] – power of 2 threads [Kim & Batten, MICRO 2014] – a thread-block

à Leads to SIMD inefficiency

N7 N6 N3 N2 N1 N0 N5 N4

Static – too few / many threads V0 V1

N0 N1 N2 N3 N4 N5

Assign 4 threads/vertex

N0 N1 N2 N3 N4 N5 N6 N7

C0 C1 C2 C3 C6 C7

Lane 0 Lane 1 Lane 2 Lane 3 Lane 4 Lane 5 Lane 6 Lane 7

. . . . . . . .

0 1 2 3 4 5 6 7

Warp Segmentation

N7 N6 N3 N2 N1 N0 N5 N4

N0 N1 N2 N3 N4 N5

C0 C1 C2 C3 C4 C5 C6 C7

N0 N1 N2 N3 N4 N5 N6 N7

Lane 0 Lane 1 Lane 2 Lane 3 Lane 4 Lane 5 Lane 6 Lane 7

Assign warp/vertex-group

. . . . . . . .

0 1 2 3 4 5 6 7

9/14/16

Warp Segmentation

Binary Search E item index in PTR

Belonging Vertex Index

Index inside Segment from right

Index inside Segment from left

Segment Size

Operation N0

N7 N6 N3 N2 N1 N0 N5 N4

N0 N1 N2 N3 N4 N5

. . . . . . . .

0 1 2 3 4 5 6 7

Warp Segmentation 13

Binary Search E item index in PTR

Belonging Vertex Index

Index inside Segment from right

Index inside Segment from left

Segment Size

Operation N3

6 – 3 – 1 3 – 0

2 + 3 + 1

N7 N6 N3 N2 N1 N0 N5 N4

N0 N1 N2 N3 N4 N5

0 1 2 3 4 5 6 7

Efficiency of Warp Segmentation No shared memory atomics for reduction. No synchronization primitives used. All memory accesses are coalesced except for accessing neighbor’s values.

Exploits instruction-level parallelism.

14 Experimental Setup

NVIDIA GeForce GTX780 Programs

BFS Breadth First Search

CC Connected Components

NN Neural Network

PR PageRank

SSSP Single Source Shortest Path

SSWP Single Source Widest Path

Graphs (#V, #E)

RM33V335E 33m, 335m

Orkut 3.07m, 234m

LiveJournal 4.85m, 69m

SocPokec 1.63m, 30.6m

RoadNetCA 1.97m, 5.5m

Amazon0312 0.40m, 3.2m

Speedups Over VWC

Average for Programs

BFS 1.27x – 2.60x

CC 1.33x – 2.90x

NN 1.21x – 2.70x

PR 1.22x – 2.68x

SSSP 1.31x – 2.76x

SSWP 1.28x – 2.80x

Average for Graphs

RM33V335E 1.23x – 1.56x

Orkut 1.15x – 1.99x

LiveJournal 1.29x – 1.99x

SocPokec 1.27x – 1.77x

RoadNetCA 1.24x – 9.90x

Amazon0312 1.53x – 2.68x

16 Warp Execution Efficiency

RM33V33

ComOrku

ER25V20

RM25V20

RM16V20

RM16V13

SocPok

HiggsT

witter

WebGoo

Amazon

VWC-2 VWC-4 VWC-8 VWC-16 VWC-32 Warp Seg.

9/14/16

Scalability – Multiple-GPUs

PCIe GPU DRAM

Bandwidth (GB/s)

~12 ~300

V6 V0 V1 V2 V5 V6

a b c d e f

0 0 1 4 2 5

0 0 1 6 7 9

a b c d

0 0 1 4

V0 V1 V2 V5 V6 V3 V4

0 0 1 2 4

V0 V1 V2 V5 V6 V3 V4

2 3 5 0

Device 0 Device 1

19 Scalability – Multiple-GPUs V

Medusa - ALL Method [IEEE TPDS 2014]

a b c d

0 0 1 4

V0 V1 V2 V5 V6 V3 V4

0 0 1 2 4

V0 V1 V2 V5 V6 V3 V4

2 3 5 0

Device 0 Device 1

V0 V1 V2 V3

V5 V6 V4

à Communication Efficiency 9-15%

a b c d

0 0 1 4

V0 V1 V2 V5 V6 V3 V4

0 0 1 2 4

V0 V1 V2 V5 V6 V3 V4

2 3 5 0

Device 0 Device 1

Totem - Maximal Subset (MS) [PACT 2012]

à Communication Efficiency 11-18%

Vertex Refinement (VR)

Offline Vertex Refinement Marks Boundary Vertices

Online Vertex Refinement On-the-fly by CUDA kernel

Look for updates in values

22 Online Vertex Refinement

+0]=V0

+1]=V3

+2]=V7

Is (Updated & Marked)

Shuffle A

Binary Prefix Sum

Operation V0

Y N Y Y N N N N

Binary Reduction

Reserve Outbox Region A = atomicAdd( deviceOutboxMovingIndex, 3 );

Fill Outbox Region

V0 V1 V2 V3 V4 V5 V6 V7

V0 V3 V7 Device Outbox

T #0 T#1 T#2 T#3 T#4 T#5 T#6 T#7

9/14/16

Buffers – Inbox and Outbox Double Buffering Host-as-hub

PCIe Lanes

≤M ≤M

GPU #0

Values Indices

≤N ≤N

GPU #1

Values Indices

GPU #2

Values Indices ≤P

Outbox Outbox Outbox

M M M N N N P P P

≤M ≤N ≤P ≤M ≤N

Values Indices ≤P

Inbox #0 Inbox #1 Inbox #2

Host Memory

≤M ≤N ≤P ≤M ≤N

Values Indices ≤P

Inbox #0 Inbox #1 Inbox #2

Odd Buffer Even Buffer

PCIe Lanes PCIe Lanes

24 Speedup – VR vs. MS & ALL

Input graph BFS PR SSSP

RM54V704E 54m, 704m

over ALL 1.85x 1.48x 1.75x

over MS 1.82x 1.47x 1.71x

RM41V536E 42m, 537m

over ALL 1.89x 1.44x 1.80x

over MS 1.82x 1.39x 1.75x

RM41V503E 42m, 503m

over ALL 1.29x 1.18x 1.24x

over MS 1.27x 1.15x 1.21x

RM35V402E 36m, 403m

over ALL 1.32x 1.21x 1.25x

over MS 1.30x 1.20x 1.22x

3 GPUs

2 GPUs

VR vs. MS & ALL

ALL MS VR ALL MS VR ALL MS VR

BFS PR SSSP

Aggregated Computation Duration Aggregated Communication Duration

3 GPUs

Irregular Computations Graphs on GPUs [PACT 2015a] Warp Segmentation & Vertex Refinement

[HPDC 2014] CuSha+CW Graph Representation

Hashing on GPUs [PACT 2015b] Stadium Hashing vs. Cuckoo Hashing

Generalization and Automation [MICRO 2015] Branch Divergence (CCC)

[IPDPS 2016] Thread Assignment (CTE)

Graphs on Clusters Scalability for Large Graphs

Partition graph across machines Exploit memory across all nodes Exploit cores across all nodes

Challenges Programmability – Distributed programs Performance – Network latency is high

Our Approach Programmability

ASPIRE Distributed Shared Memory (DSM)

Performance Caching Protocol to tolerate

Network Latency à TARDIS (16 node cluster with infiniband) remote fetch 2.3x slower than local

Fetch(c) Fetch(a) Fetch(b) c’ = f (c, a, b) Store(c, c’)

9/14/16

Relaxed Consistency Protocol (RCP) Allow use of stale values to tolerate network latency

Staleness of a value number of updates performed

Using stale values slows convergence

Relax consistency without

delaying convergence

[Cui et al., USENIX ATC 2014] Controls staleness using static threshold

Bounded Staleness

Number of Remote Fetches

High Threshold

Low Threshold ?

Relaxed Consistency Protocol Set a High Threshold

Fetches are allowed to use stale values Minimize impact of network latency

Best Effort Refresh Minimize use of stale values Minimize staleness of stale values used

Cache-miss value not in cache

Current-hit value in cache; staleness = 0

Stale-hit value in cache; 0 < staleness ≤ t Issue refresh request

Stale-miss value in cache; staleness > t

Relaxed Consistency Protocol 33

Relaxed Consistency Protocol

Shared

Cache-‐Miss / Write [Local Node] Staleness = 0

Evict [Local Node]

Hit / Write [Local Node]

Invalidate [Directory] ++Staleness

Stale-‐Hit [Local Node]

Evict [Local Node]

Stale-‐Miss [Local Node] Staleness = 0

Uncached

Refresh [Local Node] Staleness = 0

Shared

Uncached

Shared

Evict [Local Node]

Uncached

9/14/16

Shared

Evict [Local Node]

Uncached

Shared

Evict [Local Node]

Uncached

Shared

Evict [Local Node]

Uncached

38 Experimental Setup

Tardis -- 16-node cluster

Programs

SSSP Single Source Shortest Path

CC Connected Components

CD Community Detection

PR PageRank

GC Graph Coloring

Graphs (#V, #E)

Orkut 3.07m, 234m

LiveJournal 4.85m, 69m

Pokec 1.63m, 30.6m

HiggsTwitter 0.46m, 14.8m

RoadNetCA 1.97m, 5.5m

RoadNetTX 1.38m, 3.84m

CD CC GC PR SSSP

Orkut 1.57x 2.60x 2.14x 1.55x 2.43x

LiveJournal 3.71x 3.30x 2.47x 1.83x 2.39x

Pokec 2.40x 2.04x 2.26x 6.42x 2.09x

HiggsTwitter 1.94x 3.34x 2.05x 6.78x 1.25x

RoadNetCA 1.35x 1.13x 0.56x 1.02x 1.03x

RoadNetTX 0.94x 0.94x 0.88x 0.87x 0.98x RCP is over 2x faster for Graph Algorithms

Execution Time: RCP over SCP 40 Remote Fetches

RCP blocks on 42% of fetches Best Stale-n blocks on 86%

9/14/16

Iterations for Convergence

RCP requires 49% more iterations Stale-2 & 3 require 146% & 176% more

42 Staleness Percentage

97% of values have staleness 0 2% of values have staleness 1

RCP compares well with GraphLab RCP is orthogonal to GraphLab

RCP over GraphLab [VLDB’12] SSSP PR CC

Orkut 1.48x 1.01x 1.13x

LiveJournal 0.72x 0.86x 3.03x

Pokec 0.92x 0.94x 5.71x

HiggsTwitter 2.20x -- 3.95x

RoadNetCA 1.22x 11.5x 3.88x

RoadNetTX 0.41x 15.5x 2.27x

Vora, Koduru, & Gupta [OOPSLA 2014] Exploiting Asynchronous Parallelism in Iterative Algorithms using Relaxed Consistency based DSM.

Other Work: Graphs on Clusters u  Evolving Graphs [TACO 2016] u  Streaming Graphs u  Confined Recovery

Out-‐of-‐core Graph Processing GraphChi [OSDI’12] – disk-‐friendly Graph is split across multiple shards Created during pre-‐processing Remain static throughout processing

Load & Process one shard at a time Iterative processing – I/O bound GraphChi spends 73-‐88% time on loading

Not all edges are always required

Opportunity

10 20 30 40 50 60 70

Iteration

LJNFUKTTFT

PageRank

9/14/16

Challenges Different algorithms behave differently

Need dynamic partitions

4 6 8 10

Iteration

PRMSSP

048 Challenges

How to create dynamic partitions

on the 'ly?

How to process dynamic partitions?

Different algorithms behave differently

Need dynamic partitions

4 6 8 10

Iteration

PRMSSP

Shards Maintain locality while processing

Src Dst Value a b e0 a c e1 d a e2 e c e3

Shard 0

Src Dst Value b d e4 d e e5 e d e6

Shard 1

Shards Maintain locality while processing Load, compute, store

Shard 0

Shard 1

Main Memory

Shard 0

Shard 1

Main Memory

Shard 0

Shard 1

Main Memory

9/14/16

Dynamic Shards Src Dst Value a b 5 a c 4 d a 7 e c 9

Shard 0

Src Dst Value b d 6 d e 4 e d 1

Shard 1

Src Dst Value b d 3 e d 6

Src Dst Value a b 4 a c 6 e c 8

Src Dst Value e d 9

Src Dst Value a b 2 a c 3

3 out of 7 Edges

How to efGiciently create dynamic shards on the 'ly?

Dynamic Shards Src Dst Value a b 5 a c 4 d a 7 e c 9

Shard 0

Shard 1

Src Dst Value e d 9

Creating Dynamic Shards Src Dst Value a b 5 a c 4 d a 7 e c 9

Shard 0

Shard 1

Main Memory c

51 Creating Dynamic Shards

Src Dst Value a b 5 a c 4 d a 7 e c 9

Shard 0

Shard 1

Main Memory

Shard 0

Shard 1

Main Memory

Src Dst Value

51 Creating Dynamic Shards

Shard 0

Shard 1

Main Memory

Src Dst Value

9/14/16

Shard 0

Shard 1

Src Dst Value

Sequential writes Light-‐weight

51 Processing Dynamic Shards

Shard 0

Shard 1

Main Memory

Processing Dynamic Shards Src Dst Value a b 5 a c 4 d a 7 e c 9

Shard 0

Shard 1

Main Memory

How to process vertices with missing edges?

Delay based Processing Src Dst Value a b 5 a c 4 d a 7 e c 9

Shard 0

Shard 1

Main Memory

Delay computations

Shard 0

Shard 1

Main Memory

Delay computations

Shard 0

Shard 1

Main Memory

Delay computations Src Dst Value Src Dst Value

9/14/16

Shard 0

Shard 1

Main Memory

Shard 0

Shard 1

Main Memory

Shard 0

Shard 1

Main Memory

Shard 0

Shard 1

Main Memory

Delayed Computations are held in-‐memory buffer Periodically process delayed vertices Shadow iteration -‐-‐ all edges are made available

a Main

Memory d e

Delay based Processing 54

Shadow Iteration Src Dst Value a b 5 a c 4 d a 7 e c 9

Shard 0

Shard 1

Main Memory

Src Dst Value Src Dst Value b d 5

9/14/16

Shard 0

Shard 1

Main Memory

Shard 0

Shard 1

Main Memory

Shard 1

Main Memory

Shard 1

Main Memory

Shard 1

Main Memory

55 Shadow Iteration

Shard 1

Main Memory

9/14/16

Shard 1

Main Memory

55 Dynamic Shards

Sn-‐1

DS0n-‐1

Dsin-‐1

DSi0+1

DSi1+1

DSi+1 n-‐1

Shadow Iteration

Dynamic Shards

Sn-‐1

DS0n-‐1

Dsin-‐1

DSi0+1

DSi1+1

DSi+1 n-‐1

Shadow Iteration

64-‐73% of active vertices get delayed

56 Accumulation based Processing

Before Now

No Del

Partial

Accumulation based Processing 58

Evaluation Setup Dell 8-‐core, 8 GB main memory Dell 500GB 7.2K RPM HDD, Dell 400GB SSD Ubuntu 14.04 File system caches dlushed

9/14/16

Benchmark & Inputs 5 graph algorithms PageRank, Belief Propagation, Heat Simulation, Multiple Source Shortest Paths, Connected Components

6 real-‐world input graphs

60 Performance

1.8x average speedup

TT FT UK

CCMSSPHSBPPR

Performance 1.8x average speedup

TT FT UK

CCMSSPHSBPPR

61 Dynamic Shard Size

15 20 25 30

Iteration

15 20 25 30 35 40

Iteration

5 6 7 8 9 10 11 12 13

Iteration

PR on FT HS on FT

BP on FT

Reads & Writes Up to 64% reduction in data read Up to 54% reduction in data written Shadow iterations are I/O intensive

TT FT UK

Regular ReadsShadow Reads

HSBPPR

0 0.2 0.4 0.6 0.8

1 1.2 1.4

TT FT UK

Regular WritesShadow Writes

HSBPPR

Vora, Xu, & Gupta [USENIX ATC 2016] Load the Edges You Need: A Generic I/O Optimization for Distributive Disk-based Graph Algorithms.

Other Work: Graphs on Multicores u  Graph Reduction [HPDC 2016] u  Out-of-core Version

9/14/16

Summary – Graph Processing On GPUs

SIMD-efficiency – Keeping threads busy Communication – Efficient use of PCIe bandwidth

On Clusters Communication – Tolerating network latency

On Multicore I/O Efficiency – Transforming representation

final usc-2016alchem.usc.edu/ceng-seminar/slides/2016/rajiv_gupta.pdf[micro 2015] branch divergence...

Documents

lucy care ceng mp

ceng example

self-stabilizing masterâ€“slave token circulation and -...

identifying divergence

divergence a

entrails - ipdps - ieee international parallel & distributed...

ceng 5604 - ndl.ethernet.edu.et

divergence hidden divergence fibos(1)

great divergence

technical report 1 - ceng

ceng 6101 project management

i : ' lu ceng - padhle

report1 divergence

1 ceng 394 introduction to human-computer interaction ceng...

ceng 520 _is

ceng 501 examination problem: estimation of viscosity...

ceng 365 environmental biotechnology

high-performance computing methods for computational...

my projects ceng mice

ceng 291 project