final usc-2016alchem.usc.edu/ceng-seminar/slides/2016/rajiv_gupta.pdf[micro 2015] branch divergence...
Post on 04-Jul-2020
0 Views
Preview:
TRANSCRIPT
9/14/16
1
Parallel Graph Processing on GPUs, Clusters, and Multicores
Farzad Khorasani Keval Vora
Rajiv Gupta
Graph Processing Graphs & Analytics
Web Graphs PageRank – Rank websites in search results Belief Propagation – Malicious domains & infected hosts
Social Networks Community Detection – common interests Betweenness Centrality – critical nodes
Characteristics Graphs – Large number of nodes & edges Algorithms – Iterative convergence-based
à Exploit Data Parallelism
1
Parallel Graph Processing 2
– Limited Memory Storage System
– Limited Device Memory & Bandwidth – Communication High Latency
GPUs + Massive Parallelism Power Efficiency
Clusters
+ Scalability Memory & Cores
Multicores
+ Efficient Parallelism
– Limited Memory Storage System
– Limited Device Memory & Bandwidth – Communication High Latency
GPUs (VWC, Medusa, Totem etc.) + Massive Parallelism Power Efficiency
Clusters (GraphLab, GraphX etc.)
+ Scalability Memory & Cores
Multicores (GraphChi, XStream etc.)
+ Efficient Parallelism
Graphs on GPUs Suitable for Data Parallel Computations
NVIDIA GeForce GTX780 12 Streaming Multiprocessors (SMs) 64 Warps/SM; 32 Threads/Warp
Limitations
SIMD Limitation on Warps Limited DRAM Capacity – 3 GB Limited PCIe Bandwidth – 12 GB/s
3
Challenges – Graphs on GPUs Low SIMD-Efficiency ~ Irregular Graphs
Pokec 30 M Edges
1.6 M Vertices
LiveJournal 69 M Edges 5 M Vertices
Virtual Warp Centric (VWC) [Hong et al., PPoPP 2011] à Warp utilization ~ 40%
4
Challenges – Graphs on GPUs Scalability for Large Graphs
Limited DRAM Capacity à Multiple GPUs à Limited PCIe Bandwidth
Low Communication Efficiency Medusa – 4 GPUs [Zhong & He, IEEE TPDS 2014]
à comm. efficiency 9 – 15% Totem – 2 GPUs [Gharaibeh et al., PACT 2012]
à comm. efficiency 11 – 18%
5
9/14/16
2
Our Approach Warp Segmentation
Variable number of threads given to a vertex to process its incoming edges à High SIMD-Efficiency Vertex Refinement
Eliminates unnecessary communication à High Communication Efficiency
6 Iterative Graph Processing Single Source Shortest Path (SSSP)
7
0 4
1 2
3
0 4
1 2
3
4 212
3 5
3
4V0 V4
V1 V2
V3
source
= min( V1+ 3, V2+ 5 )
Iteration V0 V1 V2 V3 V4
0
1
2
3
0 ∞ ∞ ∞ ∞
0 4 2 ∞ ∞
0 3 2 7 5
0 3 2 5 6
Compressed Sparse Row (CSR) Graph Representation
V0 V4
V1 V2
V3
b c
d
e
f g
h
a
b c d e f g h a E
0 4 2 0 1 2 2 4
0 1 4 5 7 8 PTR
8
V0 V1 V2 V3 V4 V
0 1 2 3 4
Existing Works – Static
Each vertex is assigned the same (fixed) number of threads
[Harish & Narayanan, HiPC 2007] – 1 thread [Hong et al, PPoPP 2011] – power of 2 threads [Kim & Batten, MICRO 2014] – a thread-block
à Leads to SIMD inefficiency
9
V1 V0
N7 N6 N3 N2 N1 N0 N5 N4
Static – too few / many threads V0 V1
N0 N1 N2 N3 N4 N5
0 6 8
N6 N7
PTR
E
V
Assign 4 threads/vertex
N0 N1 N2 N3 N4 N5 N6 N7
RF RF
RF
C0 C1 C2 C3 C6 C7
R R
R R
RF RF
Lane 0 Lane 1 Lane 2 Lane 3 Lane 4 Lane 5 Lane 6 Lane 7
R
RF
C4 C5
Time
R
R R
R
R
10
. . . . . . . .
0 1 2 3 4 5 6 7
Warp Segmentation
V1 V0
N7 N6 N3 N2 N1 N0 N5 N4
V0 V1
N0 N1 N2 N3 N4 N5
0 6 8
N6 N7
PTR
E
V
C0 C1 C2 C3 C4 C5 C6 C7
R R
R
R
RF RF
R
R
N0 N1 N2 N3 N4 N5 N6 N7
R R
R R
R R
RF RF
Lane 0 Lane 1 Lane 2 Lane 3 Lane 4 Lane 5 Lane 6 Lane 7
Time
Assign warp/vertex-group
11
. . . . . . . .
0 1 2 3 4 5 6 7
9/14/16
3
Warp Segmentation
Binary Search E item index in PTR
Belonging Vertex Index
Index inside Segment from right
Index inside Segment from left
Segment Size
Operation N0
[0,8]
[0,4]
[0,2]
[0,1]
0
5
0
6
N1
[0,8]
[0,4]
[0,2]
[0,1]
0
4
1
6
N2
[0,8]
[0,4]
[0,2]
[0,1]
0
3
2
6
N3
[0,8]
[0,4]
[0,2]
[0,1]
0
2
3
6
N4
[0,8]
[0,4]
[0,2]
[0,1]
0
1
4
6
N5
[0,8]
[0,4]
[0,2]
[0,1]
0
0
5
6
N6
[0,8]
[0,4]
[0,2]
[1,2]
1
1
0
2
N7
[0,8]
[0,4]
[0,2]
[1,2]
1
0
1
2
V1 V0
N7 N6 N3 N2 N1 N0 N5 N4
V0 V1
N0 N1 N2 N3 N4 N5
0 6 8
N6 N7
PTR
E
V
12
. . . . . . . .
0 1 2 3 4 5 6 7
Warp Segmentation 13
Binary Search E item index in PTR
Belonging Vertex Index
Index inside Segment from right
Index inside Segment from left
Segment Size
Operation N3
[0,8]
[0,4]
[0,2]
[0,1]
0
2
3
6
6 – 3 – 1 3 – 0
2 + 3 + 1
V1 V0
N7 N6 N3 N2 N1 N0 N5 N4
V0 V1
N0 N1 N2 N3 N4 N5
0 6
N6 N7
PTR
E
V
3
0 1 2 3 4 5 6 7
Efficiency of Warp Segmentation No shared memory atomics for reduction. No synchronization primitives used. All memory accesses are coalesced except for accessing neighbor’s values.
Exploits instruction-level parallelism.
14 Experimental Setup
NVIDIA GeForce GTX780 Programs
BFS Breadth First Search
CC Connected Components
NN Neural Network
PR PageRank
SSSP Single Source Shortest Path
SSWP Single Source Widest Path
Graphs (#V, #E)
RM33V335E 33m, 335m
Orkut 3.07m, 234m
LiveJournal 4.85m, 69m
SocPokec 1.63m, 30.6m
RoadNetCA 1.97m, 5.5m
Amazon0312 0.40m, 3.2m
15
Speedups Over VWC
Average for Programs
BFS 1.27x – 2.60x
CC 1.33x – 2.90x
NN 1.21x – 2.70x
PR 1.22x – 2.68x
SSSP 1.31x – 2.76x
SSWP 1.28x – 2.80x
Average for Graphs
RM33V335E 1.23x – 1.56x
Orkut 1.15x – 1.99x
LiveJournal 1.29x – 1.99x
SocPokec 1.27x – 1.77x
RoadNetCA 1.24x – 9.90x
Amazon0312 1.53x – 2.68x
16 Warp Execution Efficiency
0
20
40
60
80
RM33V33
5E
ComOrku
t
ER25V20
1E
RM25V20
1E
RM16V20
1E
RM16V13
4E
LiveJ
ourna
l
SocPok
ec
HiggsT
witter
RoadN
etCA
WebGoo
gle
Amazon
0312
War
p E
xecu
tion
Effi
cien
cy (%
)
VWC-2 VWC-4 VWC-8 VWC-16 VWC-32 Warp Seg.
17
SSSP
9/14/16
4
Scalability – Multiple-GPUs
V0 V4
V1
V2
V3 V5
V6
PCIe GPU DRAM
Bandwidth (GB/s)
~12 ~300
18
V0 V4
V1
V2
V3 V5
V6 V0 V1 V2 V5 V6
a b c d e f
0 0 1 4 2 5
0 0 1 6 7 9
g
3
f
4
g
5
V3 V4
2 4
a b c d
0 0 1 4
e f
2 5
g
3
f
4
g
5
V0 V1 V2 V5 V6 V3 V4
0 0 1 2 4
V0 V1 V2 V5 V6 V3 V4
2 3 5 0
Device 0 Device 1
19 Scalability – Multiple-GPUs V
PTR
E
Medusa - ALL Method [IEEE TPDS 2014]
a b c d
0 0 1 4
e f
2 5
g
3
f
4
g
5
V0 V1 V2 V5 V6 V3 V4
0 0 1 2 4
V0 V1 V2 V5 V6 V3 V4
2 3 5 0
Device 0 Device 1
V0 V1 V2 V3
V5 V6 V4
à Communication Efficiency 9-15%
20
a b c d
0 0 1 4
e f
2 5
g
3
f
4
g
5
V0 V1 V2 V5 V6 V3 V4
0 0 1 2 4
V0 V1 V2 V5 V6 V3 V4
2 3 5 0
Device 0 Device 1
V2 V3
V4
Totem - Maximal Subset (MS) [PACT 2012]
à Communication Efficiency 11-18%
21
Vertex Refinement (VR)
Offline Vertex Refinement Marks Boundary Vertices
Online Vertex Refinement On-the-fly by CUDA kernel
Look for updates in values
22 Online Vertex Refinement
O[A
+0]=V0
O[A
+1]=V3
O[A
+2]=V7
Is (Updated & Marked)
Shuffle A
Binary Prefix Sum
Operation V0
3
A
0
V1
3
A
1
V2
3
A
1
V3
3
A
1
V4
3
A
2
V5
3
A
2
V6
3
A
2
V7
3
A
2
Y N Y Y N N N N
Binary Reduction
Reserve Outbox Region A = atomicAdd( deviceOutboxMovingIndex, 3 );
Fill Outbox Region
V0 V1 V2 V3 V4 V5 V6 V7
V0 V3 V7 Device Outbox
T #0 T#1 T#2 T#3 T#4 T#5 T#6 T#7
23
9/14/16
5
Buffers – Inbox and Outbox Double Buffering Host-as-hub
PCIe Lanes
≤M ≤M
M+1
V
Q
Q
GPU #0
PTR
E
Values Indices
≤N ≤N
N+1
V
R
R
GPU #1
PTR
E
Values Indices
≤P
P+1
V
T
T
GPU #2
PTR
E
Values Indices ≤P
Outbox Outbox Outbox
M M M N N N P P P
≤M ≤N ≤P ≤M ≤N
Values Indices ≤P
Inbox #0 Inbox #1 Inbox #2
Host Memory
≤M ≤N ≤P ≤M ≤N
Values Indices ≤P
Inbox #0 Inbox #1 Inbox #2
Odd Buffer Even Buffer
PCIe Lanes PCIe Lanes
24 Speedup – VR vs. MS & ALL
Input graph BFS PR SSSP
RM54V704E 54m, 704m
over ALL 1.85x 1.48x 1.75x
over MS 1.82x 1.47x 1.71x
RM41V536E 42m, 537m
over ALL 1.89x 1.44x 1.80x
over MS 1.82x 1.39x 1.75x
RM41V503E 42m, 503m
over ALL 1.29x 1.18x 1.24x
over MS 1.27x 1.15x 1.21x
RM35V402E 36m, 403m
over ALL 1.32x 1.21x 1.25x
over MS 1.30x 1.20x 1.22x
3 GPUs
2 GPUs
25
VR vs. MS & ALL
0
0.2
0.4
0.6
0.8
1
ALL MS VR ALL MS VR ALL MS VR
BFS PR SSSP
Nor
mal
ized
Agg
rega
ted
Pro
cess
ing
Tim
e
Aggregated Computation Duration Aggregated Communication Duration
3 GPUs
26
Irregular Computations Graphs on GPUs [PACT 2015a] Warp Segmentation & Vertex Refinement
[HPDC 2014] CuSha+CW Graph Representation
Hashing on GPUs [PACT 2015b] Stadium Hashing vs. Cuckoo Hashing
Generalization and Automation [MICRO 2015] Branch Divergence (CCC)
[IPDPS 2016] Thread Assignment (CTE)
27
Graphs on Clusters Scalability for Large Graphs
Partition graph across machines Exploit memory across all nodes Exploit cores across all nodes
Challenges Programmability – Distributed programs Performance – Network latency is high
28
Our Approach Programmability
ASPIRE Distributed Shared Memory (DSM)
Performance Caching Protocol to tolerate
Network Latency à TARDIS (16 node cluster with infiniband) remote fetch 2.3x slower than local
Fetch(c) Fetch(a) Fetch(b) c’ = f (c, a, b) Store(c, c’)
29
9/14/16
6
Relaxed Consistency Protocol (RCP) Allow use of stale values to tolerate network latency
Staleness of a value number of updates performed
Using stale values slows convergence
Relax consistency without
delaying convergence
30
[Cui et al., USENIX ATC 2014] Controls staleness using static threshold
Bounded Staleness
Itera
tions
to
conv
erge
nce
Number of Remote Fetches
High Threshold
Low Threshold ?
31
Relaxed Consistency Protocol Set a High Threshold
Fetches are allowed to use stale values Minimize impact of network latency
Best Effort Refresh Minimize use of stale values Minimize staleness of stale values used
32
Cache-miss value not in cache
Current-hit value in cache; staleness = 0
Stale-hit value in cache; 0 < staleness ≤ t Issue refresh request
Stale-miss value in cache; staleness > t
Relaxed Consistency Protocol 33
Relaxed Consistency Protocol
Shared
Stale
Cache-‐Miss / Write [Local Node] Staleness = 0
Evict [Local Node]
Hit / Write [Local Node]
Invalidate [Directory] ++Staleness
Stale-‐Hit [Local Node]
Invalidate [Directory] ++Staleness
Evict [Local Node]
Stale-‐Miss [Local Node] Staleness = 0
Uncached
Refresh [Local Node] Staleness = 0
34
Shared
Stale
Uncached
Relaxed Consistency Protocol
Shared
Stale
Cache-‐Miss / Write [Local Node] Staleness = 0
Evict [Local Node]
Hit / Write [Local Node]
Invalidate [Directory] ++Staleness
Stale-‐Hit [Local Node]
Invalidate [Directory] ++Staleness
Evict [Local Node]
Stale-‐Miss [Local Node] Staleness = 0
Uncached
Refresh [Local Node] Staleness = 0
35
9/14/16
7
Relaxed Consistency Protocol
Shared
Stale
Cache-‐Miss / Write [Local Node] Staleness = 0
Evict [Local Node]
Hit / Write [Local Node]
Invalidate [Directory] ++Staleness
Stale-‐Hit [Local Node]
Invalidate [Directory] ++Staleness
Evict [Local Node]
Stale-‐Miss [Local Node] Staleness = 0
Uncached
Refresh [Local Node] Staleness = 0
36
Relaxed Consistency Protocol
Shared
Stale
Cache-‐Miss / Write [Local Node] Staleness = 0
Evict [Local Node]
Hit / Write [Local Node]
Invalidate [Directory] ++Staleness
Stale-‐Hit [Local Node]
Invalidate [Directory] ++Staleness
Evict [Local Node]
Stale-‐Miss [Local Node] Staleness = 0
Uncached
Refresh [Local Node] Staleness = 0
37
Relaxed Consistency Protocol
Shared
Stale
Cache-‐Miss / Write [Local Node] Staleness = 0
Evict [Local Node]
Hit / Write [Local Node]
Invalidate [Directory] ++Staleness
Stale-‐Hit [Local Node]
Invalidate [Directory] ++Staleness
Evict [Local Node]
Stale-‐Miss [Local Node] Staleness = 0
Uncached
Refresh [Local Node] Staleness = 0
38 Experimental Setup
Tardis -- 16-node cluster
Programs
SSSP Single Source Shortest Path
CC Connected Components
CD Community Detection
PR PageRank
GC Graph Coloring
Graphs (#V, #E)
Orkut 3.07m, 234m
LiveJournal 4.85m, 69m
Pokec 1.63m, 30.6m
HiggsTwitter 0.46m, 14.8m
RoadNetCA 1.97m, 5.5m
RoadNetTX 1.38m, 3.84m
39
CD CC GC PR SSSP
Orkut 1.57x 2.60x 2.14x 1.55x 2.43x
LiveJournal 3.71x 3.30x 2.47x 1.83x 2.39x
Pokec 2.40x 2.04x 2.26x 6.42x 2.09x
HiggsTwitter 1.94x 3.34x 2.05x 6.78x 1.25x
RoadNetCA 1.35x 1.13x 0.56x 1.02x 1.03x
RoadNetTX 0.94x 0.94x 0.88x 0.87x 0.98x RCP is over 2x faster for Graph Algorithms
Execution Time: RCP over SCP 40 Remote Fetches
RCP blocks on 42% of fetches Best Stale-n blocks on 86%
41
9/14/16
8
Iterations for Convergence
RCP requires 49% more iterations Stale-2 & 3 require 146% & 176% more
42 Staleness Percentage
97% of values have staleness 0 2% of values have staleness 1
43
RCP compares well with GraphLab RCP is orthogonal to GraphLab
RCP over GraphLab [VLDB’12] SSSP PR CC
Orkut 1.48x 1.01x 1.13x
LiveJournal 0.72x 0.86x 3.03x
Pokec 0.92x 0.94x 5.71x
HiggsTwitter 2.20x -- 3.95x
RoadNetCA 1.22x 11.5x 3.88x
RoadNetTX 0.41x 15.5x 2.27x
44
Vora, Koduru, & Gupta [OOPSLA 2014] Exploiting Asynchronous Parallelism in Iterative Algorithms using Relaxed Consistency based DSM.
Other Work: Graphs on Clusters u Evolving Graphs [TACO 2016] u Streaming Graphs u Confined Recovery
45
Out-‐of-‐core Graph Processing GraphChi [OSDI’12] – disk-‐friendly Graph is split across multiple shards Created during pre-‐processing Remain static throughout processing
Load & Process one shard at a time Iterative processing – I/O bound GraphChi spends 73-‐88% time on loading
46
Not all edges are always required
Opportunity
0%
20%
40%
60%
80%
100%
10 20 30 40 50 60 70
% U
sefu
l Ed
ges
Iteration
LJNFUKTTFT
PageRank
047
9/14/16
9
Challenges Different algorithms behave differently
Need dynamic partitions
0
0.2
0.4
0.6
0.8
1
4 6 8 10
Idea
l Sh
ard
Siz
e
Iteration
PRMSSP
CC
048 Challenges
How to create dynamic partitions
on the 'ly?
How to process dynamic partitions?
Different algorithms behave differently
Need dynamic partitions
0
0.2
0.4
0.6
0.8
1
4 6 8 10
Idea
l Sh
ard
Siz
e
Iteration
PRMSSP
CC
048
Shards Maintain locality while processing
a b
c
e
d
Src Dst Value a b e0 a c e1 d a e2 e c e3
Shard 0
Src Dst Value b d e4 d e e5 e d e6
Shard 1
049
Src Dst Value a b e0 a c e1 d a e2 e c e3
a b
c
Shards Maintain locality while processing Load, compute, store
Shard 0
Src Dst Value b d e4 d e e5 e d e6
Shard 1
Main Memory
a b
c
e
d e
d
d
e c
a b
049
Src Dst Value a b e0 a c e1 d a e2 e c e3
a b
c
Shards Maintain locality while processing Load, compute, store
Shard 0
Src Dst Value b d e4 d e e5 e d e6
Shard 1
Main Memory
a b
c
e
d e
d
d
e c
a b
049
Src Dst Value a b e0 a c e1 d a e2 e c e3
a b
c
Shards Maintain locality while processing Load, compute, store
Shard 0
Src Dst Value b d e4 d e e5 e d e6
Shard 1
Main Memory
a b
c
e
d e
d
d
e c
a b
049
9/14/16
10
Dynamic Shards Src Dst Value a b 5 a c 4 d a 7 e c 9
Shard 0
Src Dst Value b d 6 d e 4 e d 1
Shard 1
Src Dst Value b d 3 e d 6
Src Dst Value a b 4 a c 6 e c 8
Src Dst Value e d 9
Src Dst Value a b 2 a c 3
Time
3 out of 7 Edges
050
How to efGiciently create dynamic shards on the 'ly?
Dynamic Shards Src Dst Value a b 5 a c 4 d a 7 e c 9
Shard 0
Src Dst Value b d 6 d e 4 e d 1
Shard 1
Src Dst Value b d 3 e d 6
Src Dst Value a b 4 a c 6 e c 8
Src Dst Value e d 9
Src Dst Value a b 2 a c 3
Time
050
Creating Dynamic Shards Src Dst Value a b 5 a c 4 d a 7 e c 9
Shard 0
Src Dst Value b d 6 d e 4 e d 1
Shard 1
Src Dst Value b d 3 e d 6
Src Dst Value a b 4 a c 6 e c 8
Time
Main Memory c
a b
c
a b
a b
c
e
d a b
c
51 Creating Dynamic Shards
Src Dst Value a b 5 a c 4 d a 7 e c 9
Shard 0
Src Dst Value b d 6 d e 4 e d 1
Shard 1
Src Dst Value b d 3 e d 6
Src Dst Value a b 4 a c 6 e c 8
Time
Main Memory
a b
c
e
d e
d
d
e
d
e
51
Creating Dynamic Shards Src Dst Value a b 5 a c 4 d a 7 e c 9
Shard 0
Src Dst Value b d 6 d e 4 e d 1
Shard 1
Src Dst Value b d 3 e d 6
Src Dst Value a b 4 a c 6 e c 8
Time
Main Memory
a b
c
c
a b
Src Dst Value a b 2 a c 3
Src Dst Value
c
a b
a b
c
e
d
51 Creating Dynamic Shards
Src Dst Value a b 5 a c 4 d a 7 e c 9
Shard 0
Src Dst Value b d 6 d e 4 e d 1
Shard 1
Src Dst Value b d 3 e d 6
Src Dst Value a b 4 a c 6 e c 8
Time
Main Memory
a b
c
e
d
Src Dst Value a b 2 a c 3
Src Dst Value
e
d
d
e
d
e
51
9/14/16
11
Creating Dynamic Shards Src Dst Value a b 5 a c 4 d a 7 e c 9
Shard 0
Src Dst Value b d 6 d e 4 e d 1
Shard 1
Src Dst Value b d 3 e d 6
Src Dst Value a b 4 a c 6 e c 8
Time
Src Dst Value a b 2 a c 3
Src Dst Value
Sequential writes Light-‐weight
51 Processing Dynamic Shards
Src Dst Value a b 5 a c 4 d a 7 e c 9
Shard 0
Src Dst Value b d 6 d e 4 e d 1
Shard 1
Src Dst Value b d 3 e d 6
Src Dst Value a b 4 a c 6 e c 8
Time
Main Memory
a b
c
a b
c
e
d
52
Processing Dynamic Shards Src Dst Value a b 5 a c 4 d a 7 e c 9
Shard 0
Src Dst Value b d 6 d e 4 e d 1
Shard 1
Src Dst Value b d 3 e d 6
Src Dst Value a b 4 a c 6 e c 8
Time
Main Memory
a b
c
a b
c
e
d
How to process vertices with missing edges?
52
a b
c
a
Delay based Processing Src Dst Value a b 5 a c 4 d a 7 e c 9
Shard 0
Src Dst Value b d 6 d e 4 e d 1
Shard 1
Src Dst Value b d 3 e d 6
Src Dst Value a b 4 a c 6 e c 8
Time
Main Memory
a b
c
e
d
Delay computations
53
a
c
b a
Delay based Processing Src Dst Value a b 5 a c 4 d a 7 e c 9
Shard 0
Src Dst Value b d 6 d e 4 e d 1
Shard 1
Src Dst Value b d 3 e d 6
Src Dst Value a b 4 a c 6 e c 8
Time
Main Memory
a b
c
e
d
Delay computations
53
a
c
b a
Delay based Processing Src Dst Value a b 5 a c 4 d a 7 e c 9
Shard 0
Src Dst Value b d 6 d e 4 e d 1
Shard 1
Src Dst Value b d 3 e d 6
Src Dst Value a b 4 a c 6 e c 8
Time
Main Memory
a b
c
e
d
Delay computations Src Dst Value Src Dst Value
b d 5
53
9/14/16
12
a
Delay based Processing Src Dst Value a b 5 a c 4 d a 7 e c 9
Shard 0
Src Dst Value b d 6 d e 4 e d 1
Shard 1
Src Dst Value b d 3 e d 6
Src Dst Value a b 4 a c 6 e c 8
Time
Main Memory
a b
c
e
d
Delay computations Src Dst Value Src Dst Value
b d 5
e
d
53
a
Delay based Processing Src Dst Value a b 5 a c 4 d a 7 e c 9
Shard 0
Src Dst Value b d 6 d e 4 e d 1
Shard 1
Src Dst Value b d 3 e d 6
Src Dst Value a b 4 a c 6 e c 8
Time
Main Memory
a b
c
e
d
Delay computations Src Dst Value Src Dst Value
b d 5
e
d
53
a
Delay based Processing Src Dst Value a b 5 a c 4 d a 7 e c 9
Shard 0
Src Dst Value b d 6 d e 4 e d 1
Shard 1
Src Dst Value b d 3 e d 6
Src Dst Value a b 4 a c 6 e c 8
Time
Main Memory
a b
c
e
d
Delay computations Src Dst Value Src Dst Value
b d 5
e
d
e
d
53
a
Delay based Processing Src Dst Value a b 5 a c 4 d a 7 e c 9
Shard 0
Src Dst Value b d 6 d e 4 e d 1
Shard 1
Src Dst Value b d 3 e d 6
Src Dst Value a b 4 a c 6 e c 8
Time
Main Memory
a b
c
e
d
Delay computations Src Dst Value Src Dst Value
b d 5
e
d d e
53
Delayed Computations are held in-‐memory buffer Periodically process delayed vertices Shadow iteration -‐-‐ all edges are made available
a Main
Memory d e
Delay based Processing 54
a
Shadow Iteration Src Dst Value a b 5 a c 4 d a 7 e c 9
Shard 0
Src Dst Value b d 6 d e 4 e d 1
Shard 1
Src Dst Value b d 3 e d 6
Src Dst Value a b 4 a c 6 e c 8
Time
Main Memory
a b
c
e
d
Src Dst Value Src Dst Value b d 5
a b
c
d e
55
9/14/16
13
a
Shadow Iteration Src Dst Value a b 5 a c 4 d a 7 e c 9
Shard 0
Src Dst Value b d 6 d e 4 e d 1
Shard 1
Src Dst Value b d 3 e d 6
Src Dst Value a b 4 a c 6 e c 8
Time
Main Memory
a b
c
e
d
Src Dst Value Src Dst Value b d 5
a b
c d e
55
a
Shadow Iteration Src Dst Value a b 5 a c 4 d a 7 e c 9
Shard 0
Src Dst Value b d 6 d e 4 e d 1
Shard 1
Src Dst Value b d 3 e d 6
Src Dst Value a b 4 a c 6 e c 8
Time
Main Memory
a b
c
e
d
Src Dst Value Src Dst Value b d 5
a b
c
a b
c
d e
55
a
Shadow Iteration Src Dst Value a b 5 a c 4 d a 7 e c 9
Src Dst Value b d 6 d e 4 e d 1
Shard 1
Src Dst Value b d 3 e d 6
Src Dst Value a b 4 a c 6 e c 8
Time
Main Memory
a b
c
e
d
Src Dst Value Src Dst Value b d 5
c
a b
d e
55
a
Shadow Iteration Src Dst Value a b 5 a c 4 d a 7 e c 9
Src Dst Value b d 6 d e 4 e d 1
Shard 1
Src Dst Value b d 3 e d 6
Src Dst Value a b 4 a c 6 e c 8
Time
Main Memory
a b
c
e
d
Src Dst Value Src Dst Value b d 5
c
a b
d e
55
Shadow Iteration Src Dst Value a b 5 a c 4 d a 7 e c 9
Src Dst Value b d 6 d e 4 e d 1
Shard 1
Src Dst Value b d 3 e d 6
Src Dst Value a b 4 a c 6 e c 8
Time
Main Memory
a b
c
e
d
Src Dst Value Src Dst Value b d 5
e
d
e
d
d
d e
55 Shadow Iteration
Src Dst Value a b 5 a c 4 d a 7 e c 9
Src Dst Value b d 6 d e 4 e d 1
Shard 1
Src Dst Value b d 3 e d 6
Src Dst Value a b 4 a c 6 e c 8
Time
Main Memory
a b
c
e
d
Src Dst Value Src Dst Value b d 5
d
e
d e
55
9/14/16
14
Shadow Iteration Src Dst Value a b 5 a c 4 d a 7 e c 9
Src Dst Value b d 6 d e 4 e d 1
Shard 1
Src Dst Value b d 3 e d 6
Src Dst Value a b 4 a c 6 e c 8
Time
Main Memory
a b
c
e
d
Src Dst Value Src Dst Value b d 5
d d
e e
55 Dynamic Shards
S0
S1
Sn-‐1
DS00
DS01
DS0n-‐1
DSi0
DSi1
Dsin-‐1
DSi0+1
DSi1+1
DSi+1 n-‐1
Shadow Iteration
56
Dynamic Shards
S0
S1
Sn-‐1
DS00
DS01
DS0n-‐1
DSi0
DSi1
Dsin-‐1
DSi0+1
DSi1+1
DSi+1 n-‐1
Shadow Iteration
64-‐73% of active vertices get delayed
56 Accumulation based Processing
Before Now
d c
b
e e
d c
b
e e
d c
b
e e
d c
b
e e
No Del
ay
Partial
Delay
e
57
Accumulation based Processing 58
Evaluation Setup Dell 8-‐core, 8 GB main memory Dell 500GB 7.2K RPM HDD, Dell 400GB SSD Ubuntu 14.04 File system caches dlushed
59
9/14/16
15
Benchmark & Inputs 5 graph algorithms PageRank, Belief Propagation, Heat Simulation, Multiple Source Shortest Paths, Connected Components
6 real-‐world input graphs
60 Performance
1.8x average speedup
0
0.5
1
1.5
2
2.5
3
UK
TT FT UK
TT FT UK
TT FT UK
TT FT UK
TT FT
Spee
du
p
CCMSSPHSBPPR
61
Performance 1.8x average speedup
0
0.5
1
1.5
2
2.5
3
UK
TT FT UK
TT FT UK
TT FT UK
TT FT UK
TT FT
Spee
du
p
CCMSSPHSBPPR
61 Dynamic Shard Size
0
0.2
0.4
0.6
0.8
1
15 20 25 30
Nor
mal
ized
Sh
ard
Siz
e
Iteration
0
0.2
0.4
0.6
0.8
1
15 20 25 30 35 40
Nor
mal
ized
Sh
ard
Siz
e
Iteration
0
0.2
0.4
0.6
0.8
1
5 6 7 8 9 10 11 12 13
Nor
mal
ized
Sh
ard
Siz
e
Iteration
PR on FT HS on FT
BP on FT
62
Reads & Writes Up to 64% reduction in data read Up to 54% reduction in data written Shadow iterations are I/O intensive
0
0.2
0.4
0.6
0.8
1
UK
TT FT UK
TT FT UK
TT FT
Rea
d S
ize
Regular ReadsShadow Reads
HSBPPR
0 0.2 0.4 0.6 0.8
1 1.2 1.4
UK
TT FT UK
TT FT UK
TT FT
Wri
tes
Size
Regular WritesShadow Writes
HSBPPR
63
Vora, Xu, & Gupta [USENIX ATC 2016] Load the Edges You Need: A Generic I/O Optimization for Distributive Disk-based Graph Algorithms.
Other Work: Graphs on Multicores u Graph Reduction [HPDC 2016] u Out-of-core Version
64
9/14/16
16
Summary – Graph Processing On GPUs
SIMD-efficiency – Keeping threads busy Communication – Efficient use of PCIe bandwidth
On Clusters Communication – Tolerating network latency
On Multicore I/O Efficiency – Transforming representation
65
top related