rupesh nasre. · matrix multiplication shortest paths computation by knowing matrix size and its...
TRANSCRIPT
![Page 1: Rupesh Nasre. · Matrix Multiplication Shortest Paths Computation By knowing matrix size and its starting address, and without knowing its values we can determine the dynamic behavior](https://reader036.vdocument.in/reader036/viewer/2022062402/5e95068ffd1fec2c837a2fa0/html5/thumbnails/1.jpg)
Parallel Graph Algorithms
Rupesh Nasre.IIT Madras
NITKFebruary 2016
![Page 2: Rupesh Nasre. · Matrix Multiplication Shortest Paths Computation By knowing matrix size and its starting address, and without knowing its values we can determine the dynamic behavior](https://reader036.vdocument.in/reader036/viewer/2022062402/5e95068ffd1fec2c837a2fa0/html5/thumbnails/2.jpg)
2
Graphs
● Where do we encounter graphs?● Social networks, road connections, molecular
interactions, planetary forces, …● snap, florida, dimacs, konect, ...
● Why treat them separately?● They provide structural information.● They can be processed more efficiently.
● What challenges do they pose?● Load imbalance, poor locality, …● Irregularity
![Page 3: Rupesh Nasre. · Matrix Multiplication Shortest Paths Computation By knowing matrix size and its starting address, and without knowing its values we can determine the dynamic behavior](https://reader036.vdocument.in/reader036/viewer/2022062402/5e95068ffd1fec2c837a2fa0/html5/thumbnails/3.jpg)
3
What is IrReguLariTy?
● Data-access or control patterns are unpredictable at compile time.
Irregular data-access Irregular control-flow
int a[N], b[N], c[N];readinput(a);
c[5] = b[a[4]];
int a[N], b[N], c[N];readinput(a);
c[5] = b[a[4]];
Pointer-based data structures often contribute to irregularity.
int a[N];readinput(a);
if (a[4] > 30) {...
}
int a[N];readinput(a);
if (a[4] > 30) {...
}
Needs dynamic techniques
![Page 4: Rupesh Nasre. · Matrix Multiplication Shortest Paths Computation By knowing matrix size and its starting address, and without knowing its values we can determine the dynamic behavior](https://reader036.vdocument.in/reader036/viewer/2022062402/5e95068ffd1fec2c837a2fa0/html5/thumbnails/4.jpg)
4
Examples of Irregular Algorithms
C1C1 C2C2 C3C3 C4C4 C5C5
X1X1 X2X2 X5X5X4X4X3X3
a = &xb = &yp = &a*p = bc = a
aa
cc
pp bb
x
ya
5
3
8
5
6 6
4
7
Delaunay Mesh Refinement Points-to Analysis
Minimum Spanning Tree Computation
Survey Propagation
![Page 5: Rupesh Nasre. · Matrix Multiplication Shortest Paths Computation By knowing matrix size and its starting address, and without knowing its values we can determine the dynamic behavior](https://reader036.vdocument.in/reader036/viewer/2022062402/5e95068ffd1fec2c837a2fa0/html5/thumbnails/5.jpg)
5
Regular vs. Irregular Algorithms
X =
aa
cc
bb dd
73
4
gg
ff
ee
Matrix Multiplication Shortest Paths Computation
By knowing matrix size and its starting address, and without knowing
its values we can determine the dynamic behavior.
Dynamic behavior is dependent on the input graph.
This results inpessimistic synchronization.
![Page 6: Rupesh Nasre. · Matrix Multiplication Shortest Paths Computation By knowing matrix size and its starting address, and without knowing its values we can determine the dynamic behavior](https://reader036.vdocument.in/reader036/viewer/2022062402/5e95068ffd1fec2c837a2fa0/html5/thumbnails/6.jpg)
6
Irregularity is a Matter of Degree
X =
aa
cc
bb dd
73
4
gg
ff
ee
Co
ntr
ol-
flo
w ir
reg
ula
rity
0.5%
1.0%
1.5%
0%
2.0%
2.5%
3.0%
3.5%
4.0%
memory-access irregularity
0.0%
10% 20% 30% 40% 50% 60% 70% 80%
![Page 7: Rupesh Nasre. · Matrix Multiplication Shortest Paths Computation By knowing matrix size and its starting address, and without knowing its values we can determine the dynamic behavior](https://reader036.vdocument.in/reader036/viewer/2022062402/5e95068ffd1fec2c837a2fa0/html5/thumbnails/7.jpg)
7
Irregularity is a Matter of Degree
X =
aa
cc
bb dd
73
4
gg
ff
ee
![Page 8: Rupesh Nasre. · Matrix Multiplication Shortest Paths Computation By knowing matrix size and its starting address, and without knowing its values we can determine the dynamic behavior](https://reader036.vdocument.in/reader036/viewer/2022062402/5e95068ffd1fec2c837a2fa0/html5/thumbnails/8.jpg)
8
ee
Dynamic Techniques
aa
cc
bb dd
gg
ff hh ii
jj Active node
Neighborhood
Non-overlapping neighborhoods can be processed in parallel.
Overlapping neighborhoods require synchronization.
Leads to optimistic and cautious parallelizations.
Question: Given an algorithm, can you schedule its parallel processing such that neighborhoods never overlap?
![Page 9: Rupesh Nasre. · Matrix Multiplication Shortest Paths Computation By knowing matrix size and its starting address, and without knowing its values we can determine the dynamic behavior](https://reader036.vdocument.in/reader036/viewer/2022062402/5e95068ffd1fec2c837a2fa0/html5/thumbnails/9.jpg)
9
● Manual● Libraries
● Galois, Ligra, LonestarGPU, Totem, ...
● Domain-Specific Languages● Green-Marl, Elixir, BlackBuck, ...
Parallelism Approaches
Pro
du
ctiv
ity P
erform
ance
![Page 10: Rupesh Nasre. · Matrix Multiplication Shortest Paths Computation By knowing matrix size and its starting address, and without knowing its values we can determine the dynamic behavior](https://reader036.vdocument.in/reader036/viewer/2022062402/5e95068ffd1fec2c837a2fa0/html5/thumbnails/10.jpg)
10
Specifying Parallelism
● Do not specify.● Sequential input, completely automated, currently
very challenging in general
● Implicit parallelism● aggregates, aggregate functions, value-based
processing, ...
● Explicit parallelism● pthreads, MPI, ...
![Page 11: Rupesh Nasre. · Matrix Multiplication Shortest Paths Computation By knowing matrix size and its starting address, and without knowing its values we can determine the dynamic behavior](https://reader036.vdocument.in/reader036/viewer/2022062402/5e95068ffd1fec2c837a2fa0/html5/thumbnails/11.jpg)
11
Identifying Dependence
for (ii = 0; ii < 10; ++ii) { a[2 * ii] = ... a[2 * ii + 1] ...}
for (ii = 0; ii < 10; ++ii) { a[2 * ii] = ... a[2 * ii + 1] ...}
Is there a flow dependencebetween different iterations?
Dependence equations0 <= ii
w < ii
r < 10
2 * iiw = 2 * ii
r + 1
which can be written as
0 <= iiw
iiw <= ii
r - 1
iir <= 9
2 * iiw <= 2 * ii
r + 1
2 * iir + 1 <= 2 * ii
w
-1 0
1 -1
0 1
2 -2
-2 2
iiw
iir
-1 0
1 -1
0 1
2 -2
-2 2
0
-1
9
1
-1
<=
Dependence exists if the system has a solution.Dependence exists if the system has a solution.
![Page 12: Rupesh Nasre. · Matrix Multiplication Shortest Paths Computation By knowing matrix size and its starting address, and without knowing its values we can determine the dynamic behavior](https://reader036.vdocument.in/reader036/viewer/2022062402/5e95068ffd1fec2c837a2fa0/html5/thumbnails/12.jpg)
12
Outline
● Introduction➔ Graph algorithms on GPUs● Morph algorithms on GPUs● Synthesizing graph algorithms for GPUs
![Page 13: Rupesh Nasre. · Matrix Multiplication Shortest Paths Computation By knowing matrix size and its starting address, and without knowing its values we can determine the dynamic behavior](https://reader036.vdocument.in/reader036/viewer/2022062402/5e95068ffd1fec2c837a2fa0/html5/thumbnails/13.jpg)
13
What is a GPU?
● Graphics Processing Unit● Separate piece of hardware
connected using a bus● Separate address space
than that of the CPU● Massive multithreading● Warp-based execution
![Page 14: Rupesh Nasre. · Matrix Multiplication Shortest Paths Computation By knowing matrix size and its starting address, and without knowing its values we can determine the dynamic behavior](https://reader036.vdocument.in/reader036/viewer/2022062402/5e95068ffd1fec2c837a2fa0/html5/thumbnails/14.jpg)
14
What is a Warp?
Source: Wikipedia
![Page 15: Rupesh Nasre. · Matrix Multiplication Shortest Paths Computation By knowing matrix size and its starting address, and without knowing its values we can determine the dynamic behavior](https://reader036.vdocument.in/reader036/viewer/2022062402/5e95068ffd1fec2c837a2fa0/html5/thumbnails/15.jpg)
15
GPU Computation Hierarchy
...
... ... ......
... ... ......
... ... ......
Thread
Warp
Block
Multi-processor
GPU
1
32
1024
Tens ofthousands
Hundreds ofthousands
![Page 16: Rupesh Nasre. · Matrix Multiplication Shortest Paths Computation By knowing matrix size and its starting address, and without knowing its values we can determine the dynamic behavior](https://reader036.vdocument.in/reader036/viewer/2022062402/5e95068ffd1fec2c837a2fa0/html5/thumbnails/16.jpg)
16
Challenges with GPUs
● Warp-based execution● Locking is expensive● Dynamic memory allocation is costly● Limited data-cache● Programmability issues
● separate address space● no recursion support● complex computation hierarchy● exposed memory hierarchy● ...
![Page 17: Rupesh Nasre. · Matrix Multiplication Shortest Paths Computation By knowing matrix size and its starting address, and without knowing its values we can determine the dynamic behavior](https://reader036.vdocument.in/reader036/viewer/2022062402/5e95068ffd1fec2c837a2fa0/html5/thumbnails/17.jpg)
17
Challenges in Graph Algorithms
● Synchronization● locks are prohibitively expensive on GPUs● atomic instructions quickly become expensive
● Memory latency● locality is difficult to exploit● low caching support
● Thread-divergence● work done per node varies with graph structure
● Uncoalesced memory accesses● warp-threads access arbitrary graph elements
![Page 18: Rupesh Nasre. · Matrix Multiplication Shortest Paths Computation By knowing matrix size and its starting address, and without knowing its values we can determine the dynamic behavior](https://reader036.vdocument.in/reader036/viewer/2022062402/5e95068ffd1fec2c837a2fa0/html5/thumbnails/18.jpg)
18
Irregular Algorithms on GPUs
aa
cc
bb dd1
34
ggff
ee5 2 7
7sssp(src): dsrc = distance[src]
forall edges (src, dst, wt) relaxEdge(src, dst, wt)
sssp(src): dsrc = distance[src]
forall edges (src, dst, wt) relaxEdge(src, dst, wt)
Single-source shortest paths
![Page 19: Rupesh Nasre. · Matrix Multiplication Shortest Paths Computation By knowing matrix size and its starting address, and without knowing its values we can determine the dynamic behavior](https://reader036.vdocument.in/reader036/viewer/2022062402/5e95068ffd1fec2c837a2fa0/html5/thumbnails/19.jpg)
19
Irregular Algorithms on GPUs
aa
cc
bb dd1
34
ggff
ee
Single-source shortest paths
5 2 7
7sssp(src): dsrc = distance[src]
forall edges (src, dst, wt) ddst = distance[dst] altdist = dsrc + wt
if altdist < ddst distance[dst] = altdist
sssp(src): dsrc = distance[src]
forall edges (src, dst, wt) ddst = distance[dst] altdist = dsrc + wt
if altdist < ddst distance[dst] = altdist
Mem
ory-bound kernel
Kernel unrolling
Instead of operating on a node, a thread operates on a subgraph.
![Page 20: Rupesh Nasre. · Matrix Multiplication Shortest Paths Computation By knowing matrix size and its starting address, and without knowing its values we can determine the dynamic behavior](https://reader036.vdocument.in/reader036/viewer/2022062402/5e95068ffd1fec2c837a2fa0/html5/thumbnails/20.jpg)
20
Improving Synchronization
aa
push-based
aa
pull-based
1010
33 44
2 3
tfive tseven
77
33 44
2 3
tfive tseven
55
33 44
2 3
tfive tseven
Atomic-free update Lost-update problem Correction by topology-driven processing, exploiting monotonicity
![Page 21: Rupesh Nasre. · Matrix Multiplication Shortest Paths Computation By knowing matrix size and its starting address, and without knowing its values we can determine the dynamic behavior](https://reader036.vdocument.in/reader036/viewer/2022062402/5e95068ffd1fec2c837a2fa0/html5/thumbnails/21.jpg)
21
Irregular Algorithms on GPUs
aa
cc
bb dd1
34
ggff
ee
Barnes-Hut n-body simulationBreadth-first search Single-source shortest paths
5 2 7
7
● Better memory layout
● Kernel unrolling
● Local worklists
● Improved synchronization
Application Speedup
BFS 48
BH 90
SSSP 45
![Page 22: Rupesh Nasre. · Matrix Multiplication Shortest Paths Computation By knowing matrix size and its starting address, and without knowing its values we can determine the dynamic behavior](https://reader036.vdocument.in/reader036/viewer/2022062402/5e95068ffd1fec2c837a2fa0/html5/thumbnails/22.jpg)
22
Outline
● introduction● Graph algorithms on GPUs
● e.g., breadth-first search, shortest paths
➔ Morph algorithms on GPUs➔ e.g., mesh refinement, pointer analysis
● Synthesizing irregular algorithms for GPUs
![Page 23: Rupesh Nasre. · Matrix Multiplication Shortest Paths Computation By knowing matrix size and its starting address, and without knowing its values we can determine the dynamic behavior](https://reader036.vdocument.in/reader036/viewer/2022062402/5e95068ffd1fec2c837a2fa0/html5/thumbnails/23.jpg)
23
Identify the Celebrity
Source: wikipedia
![Page 24: Rupesh Nasre. · Matrix Multiplication Shortest Paths Computation By knowing matrix size and its starting address, and without knowing its values we can determine the dynamic behavior](https://reader036.vdocument.in/reader036/viewer/2022062402/5e95068ffd1fec2c837a2fa0/html5/thumbnails/24.jpg)
24
What is a morph?
Source: wikipedia
![Page 25: Rupesh Nasre. · Matrix Multiplication Shortest Paths Computation By knowing matrix size and its starting address, and without knowing its values we can determine the dynamic behavior](https://reader036.vdocument.in/reader036/viewer/2022062402/5e95068ffd1fec2c837a2fa0/html5/thumbnails/25.jpg)
25
Examples of Morph Algorithms
C1C1 C2C2 C3C3 C4C4 C5C5
X1X1 X2X2 X5X5X4X4X3X3
a = &xb = &yp = &a*p = bc = a
aa
cc
pp bb
x
ya
5
3
8
5
6 6
4
7
Delaunay Mesh Refinement Points-to Analysis
Minimum Spanning Tree Computation
Survey Propagation
![Page 26: Rupesh Nasre. · Matrix Multiplication Shortest Paths Computation By knowing matrix size and its starting address, and without knowing its values we can determine the dynamic behavior](https://reader036.vdocument.in/reader036/viewer/2022062402/5e95068ffd1fec2c837a2fa0/html5/thumbnails/26.jpg)
26
TAO Classification
Multi-core GPU
Regular (matrix multiplication, stencils)
Irregular (static-graph) (breadth-first search, shortest paths)
Morph (mesh refinement, survey propagation)
The TAO of Parallelism in Algorithms, Pingali et al. [PLDI 2011]
![Page 27: Rupesh Nasre. · Matrix Multiplication Shortest Paths Computation By knowing matrix size and its starting address, and without knowing its values we can determine the dynamic behavior](https://reader036.vdocument.in/reader036/viewer/2022062402/5e95068ffd1fec2c837a2fa0/html5/thumbnails/27.jpg)
27
First Attempt
● Ported multi-core Galois benchmarks to CUDA
● GPU version's performance was comparable to that of the single-threaded version
48322420161284210
10
20
30
40
50
60
70
80
90
Serial (Shewchuck, Berkeley)
GPU Ported
Multicore Galois
Exe
cutio
n tim
e (s
econ
ds)
Number of CPU threads
DMR on a mesh with 10 million triangles with around 50% bad triangles
![Page 28: Rupesh Nasre. · Matrix Multiplication Shortest Paths Computation By knowing matrix size and its starting address, and without knowing its values we can determine the dynamic behavior](https://reader036.vdocument.in/reader036/viewer/2022062402/5e95068ffd1fec2c837a2fa0/html5/thumbnails/28.jpg)
28
Challenges in Morph Algorithms● Synchronization
● locks are prohibitively expensive on GPUs● atomic instructions quickly become expensive
● Memory allocation● changing graph structure requires new strategies● memory requirement cannot be predicted
● Load imbalance● different modifications to different parts of the graph● work done per node changes dynamically● leads to thread-divergence and uncoalesced
memory accesses
![Page 29: Rupesh Nasre. · Matrix Multiplication Shortest Paths Computation By knowing matrix size and its starting address, and without knowing its values we can determine the dynamic behavior](https://reader036.vdocument.in/reader036/viewer/2022062402/5e95068ffd1fec2c837a2fa0/html5/thumbnails/29.jpg)
29
Performance Considerations
Graph representation
Subgraph addition
Subgraph deletion
Conflict resolution
Load balancing
44
33
22
11
55
![Page 30: Rupesh Nasre. · Matrix Multiplication Shortest Paths Computation By knowing matrix size and its starting address, and without knowing its values we can determine the dynamic behavior](https://reader036.vdocument.in/reader036/viewer/2022062402/5e95068ffd1fec2c837a2fa0/html5/thumbnails/30.jpg)
30
Graph Representation 11
Delaunay Mesh Refinement Survey Propagation
Points-to Analysis Minimum Spanning Tree Computation
Points Triangles
![Page 31: Rupesh Nasre. · Matrix Multiplication Shortest Paths Computation By knowing matrix size and its starting address, and without knowing its values we can determine the dynamic behavior](https://reader036.vdocument.in/reader036/viewer/2022062402/5e95068ffd1fec2c837a2fa0/html5/thumbnails/31.jpg)
31
Graph Representation 11
Delaunay Mesh Refinement Survey Propagation
Points Triangles Neighbors
Points-to Analysis Minimum Spanning Tree Computation
![Page 32: Rupesh Nasre. · Matrix Multiplication Shortest Paths Computation By knowing matrix size and its starting address, and without knowing its values we can determine the dynamic behavior](https://reader036.vdocument.in/reader036/viewer/2022062402/5e95068ffd1fec2c837a2fa0/html5/thumbnails/32.jpg)
32
Graph Representation 11
Delaunay Mesh Refinement Survey Propagation
Points Triangles Neighbors
.
.
.
Clauses to literals
Literals to clauses
Points-to Analysis Minimum Spanning Tree Computation
![Page 33: Rupesh Nasre. · Matrix Multiplication Shortest Paths Computation By knowing matrix size and its starting address, and without knowing its values we can determine the dynamic behavior](https://reader036.vdocument.in/reader036/viewer/2022062402/5e95068ffd1fec2c837a2fa0/html5/thumbnails/33.jpg)
33
Graph Representation 11
Delaunay Mesh Refinement Survey Propagation
Points Triangles NeighborsClauses to
literalsLiterals to clauses
Pointers
.
.
.
Edges
Points-to Analysis Minimum Spanning Tree Computation
.
.
.
![Page 34: Rupesh Nasre. · Matrix Multiplication Shortest Paths Computation By knowing matrix size and its starting address, and without knowing its values we can determine the dynamic behavior](https://reader036.vdocument.in/reader036/viewer/2022062402/5e95068ffd1fec2c837a2fa0/html5/thumbnails/34.jpg)
34
Graph Representation 11
Delaunay Mesh Refinement Survey Propagation
Points Triangles NeighborsClauses to
literalsLiterals to clauses
Pointers
.
.
.
Points-to sets
.
.
.
Graph
.
.
.
Components
Points-to Analysis Minimum Spanning Tree Computation
.
.
.
.
.
.
![Page 35: Rupesh Nasre. · Matrix Multiplication Shortest Paths Computation By knowing matrix size and its starting address, and without knowing its values we can determine the dynamic behavior](https://reader036.vdocument.in/reader036/viewer/2022062402/5e95068ffd1fec2c837a2fa0/html5/thumbnails/35.jpg)
35
Subgraph Addition 22
● Pre-allocation● When final graph has bounded size● Overallocation
● Host-only● When next size predictable on host, global
allocation● May be expensive, but can be optimized
● Kernel-host● When next size predictable, global allocation● Size computation can piggyback
● Kernel-only● General purpose, local allocation● Expensive, but can be optimized
e.g. MST, DMR
e.g. PTA
e.g. DMR
e.g. PTA
![Page 36: Rupesh Nasre. · Matrix Multiplication Shortest Paths Computation By knowing matrix size and its starting address, and without knowing its values we can determine the dynamic behavior](https://reader036.vdocument.in/reader036/viewer/2022062402/5e95068ffd1fec2c837a2fa0/html5/thumbnails/36.jpg)
36
PerformanceE
xecu
tion
time
(sec
onds
)
Number of CPU threads
Serial (Shewchuck, Berkeley)
Multi-core Galois
GPU Optimized48322420161284210
10
20
30
40
50
60
70
80
90
GPU Ported
Application Speedup
DMR 80
MST 20
PTA 23
SP 25
44
33
22
11
55
Graph representation
Subgraph addition
Subgraph deletion
Conflict resolution
Load balancing
![Page 37: Rupesh Nasre. · Matrix Multiplication Shortest Paths Computation By knowing matrix size and its starting address, and without knowing its values we can determine the dynamic behavior](https://reader036.vdocument.in/reader036/viewer/2022062402/5e95068ffd1fec2c837a2fa0/html5/thumbnails/37.jpg)
37
GPU Optimization Principles
GPUPrinciples
Synchronization
Com
puta
tion Mem
ory
Kernel transform
ations
Data grouping
Exploiting m
emory hierarchy
Algorithm selection
Work sorting
Work chunking
Communication onto computation
Following parallelism profile
Pipelined computation
Avoiding synchronizationCoarsening synchronizationRace and resolve mechanismCombining synchronization
![Page 38: Rupesh Nasre. · Matrix Multiplication Shortest Paths Computation By knowing matrix size and its starting address, and without knowing its values we can determine the dynamic behavior](https://reader036.vdocument.in/reader036/viewer/2022062402/5e95068ffd1fec2c837a2fa0/html5/thumbnails/38.jpg)
38
GPU Optimization Principles
GPUPrinciples
Synchronization
Com
puta
tion Mem
ory
Kernel transformationsData groupingExploiting memory hierarchy
Algorithm selectionWork sorting
Work chunkingCommunication onto computation
Following parallelism profilePipelined computation
Avoiding synchronizationCoarsening synchronizationRace and resolve mechanismCombining synchronization
These optimization principlesare critical for high-performingirregular GPU computations.
![Page 39: Rupesh Nasre. · Matrix Multiplication Shortest Paths Computation By knowing matrix size and its starting address, and without knowing its values we can determine the dynamic behavior](https://reader036.vdocument.in/reader036/viewer/2022062402/5e95068ffd1fec2c837a2fa0/html5/thumbnails/39.jpg)
39
Barrier-based Synchronization
➔ Disjoint accesses
● Overlapping accesses
● Benign overlaps
O(e) atomics
O(t) atomics
O(log t) barriers
atomic per element
atomic per thread
prefix-sum
...
Consider threads pushing elements into a worklist
![Page 40: Rupesh Nasre. · Matrix Multiplication Shortest Paths Computation By knowing matrix size and its starting address, and without knowing its values we can determine the dynamic behavior](https://reader036.vdocument.in/reader036/viewer/2022062402/5e95068ffd1fec2c837a2fa0/html5/thumbnails/40.jpg)
40
Barrier-based Synchronization
● Disjoint accesses
➔ Overlapping accesses
● Benign overlaps
...
atomic per element
non-atomic mark
prioritized mark
check
Race and resolve
Race and resolve
AND
ORnon-atomic mark
check
e.g., for owning cavities in Delaunay mesh refinement
e.g., for inserting unique elements into a worklist
Consider threads trying to own a set of elements
![Page 41: Rupesh Nasre. · Matrix Multiplication Shortest Paths Computation By knowing matrix size and its starting address, and without knowing its values we can determine the dynamic behavior](https://reader036.vdocument.in/reader036/viewer/2022062402/5e95068ffd1fec2c837a2fa0/html5/thumbnails/41.jpg)
41
Barrier-based Synchronization
● Disjoint accesses
● Overlapping accesses
➔ Benign overlaps
...
with atomics
without atomics
e.g., level-by-level breadth-first search
Consider threads updating shared variables to the same value
a
b dc
e gf
![Page 42: Rupesh Nasre. · Matrix Multiplication Shortest Paths Computation By knowing matrix size and its starting address, and without knowing its values we can determine the dynamic behavior](https://reader036.vdocument.in/reader036/viewer/2022062402/5e95068ffd1fec2c837a2fa0/html5/thumbnails/42.jpg)
42
Exploiting Algebraic Properties
➔ Monotonicity
● Idempotency
● Associativity
1010
33 44
2 3
tfive tseven
77
33 44
2 3
tfive tseven
55
33 44
2 3
tfive tseven
Atomic-free update Lost-update problem Correction by topology-driven processing, exploiting monotonicity
Consider threads updating distances in shortest paths computation
![Page 43: Rupesh Nasre. · Matrix Multiplication Shortest Paths Computation By knowing matrix size and its starting address, and without knowing its values we can determine the dynamic behavior](https://reader036.vdocument.in/reader036/viewer/2022062402/5e95068ffd1fec2c837a2fa0/html5/thumbnails/43.jpg)
43
Exploiting Algebraic Properties
● Monotonicity
➔ Idempotency
● Associativity
zz
bb cc
t2 t3
aa dd
t1 t4 zz zz zz zz
worklist zz
pp
qqrr
t5, t6, t7,t8
Consider threads updating distances in shortest paths computation
Update by multiple threads Multiple instances of a node in the worklist
Same node processed by multiple threads
![Page 44: Rupesh Nasre. · Matrix Multiplication Shortest Paths Computation By knowing matrix size and its starting address, and without knowing its values we can determine the dynamic behavior](https://reader036.vdocument.in/reader036/viewer/2022062402/5e95068ffd1fec2c837a2fa0/html5/thumbnails/44.jpg)
44
Exploiting Algebraic Properties
● Monotonicity
● Idempotency
➔ Associativity
zz
bb cc
t2 t3
aa dd
t1 t4x
y z,v
m,n
x,y,z,v,m,n
Consider threads pushing points-to information to a pointer node
Associativity helps push information using prefix-sum
![Page 45: Rupesh Nasre. · Matrix Multiplication Shortest Paths Computation By knowing matrix size and its starting address, and without knowing its values we can determine the dynamic behavior](https://reader036.vdocument.in/reader036/viewer/2022062402/5e95068ffd1fec2c837a2fa0/html5/thumbnails/45.jpg)
45
Other Synchronization Methods
➔ Partitioning● Scatter-gather
Delaunay Mesh Refinement
● Layout-based graph partitioning is expensive
● Partitioning time may be comparable to overall running time
aa
push-based
aa
pull-based
Partitioning enables pull-based approach
![Page 46: Rupesh Nasre. · Matrix Multiplication Shortest Paths Computation By knowing matrix size and its starting address, and without knowing its values we can determine the dynamic behavior](https://reader036.vdocument.in/reader036/viewer/2022062402/5e95068ffd1fec2c837a2fa0/html5/thumbnails/46.jpg)
46
Other Synchronization Methods
● Partitioning➔ Scatter-gather
O(e) atomics
O(t) atomics
O(log t) barriers
atomic per element
atomic per thread
prefix-sum
...
scatter
gather
Consider threads pushing elements into a worklist
![Page 47: Rupesh Nasre. · Matrix Multiplication Shortest Paths Computation By knowing matrix size and its starting address, and without knowing its values we can determine the dynamic behavior](https://reader036.vdocument.in/reader036/viewer/2022062402/5e95068ffd1fec2c837a2fa0/html5/thumbnails/47.jpg)
47
Outline
● Introduction● Graph algorithms on GPUs● Morph algorithms on GPUs➔ Synthesizing irregular algorithms for GPUs
![Page 48: Rupesh Nasre. · Matrix Multiplication Shortest Paths Computation By knowing matrix size and its starting address, and without knowing its values we can determine the dynamic behavior](https://reader036.vdocument.in/reader036/viewer/2022062402/5e95068ffd1fec2c837a2fa0/html5/thumbnails/48.jpg)
48
Hand-Written BFS
... hedgessrcdst = (unsigned int *)malloc(nedges * sizeof(unsigned int)); hpsrc = (unsigned int *)calloc(nnodes, sizeof(unsigned int)); hdist = (float *)malloc(nnodes * sizeof(float));
... // Read graph. …
Memoryallocationon host
Memoryallocationon host
![Page 49: Rupesh Nasre. · Matrix Multiplication Shortest Paths Computation By knowing matrix size and its starting address, and without knowing its values we can determine the dynamic behavior](https://reader036.vdocument.in/reader036/viewer/2022062402/5e95068ffd1fec2c837a2fa0/html5/thumbnails/49.jpg)
49
Hand-Written BFS ... cudaMalloc((void **)&edgessrcdst, nedges * sizeof(unsigned int)); cudaMalloc((void **)&psrc, nnodes * sizeof(unsigned int));
cudaMemcpy(edgessrcdst, hedgessrcdst, nedges * sizeof(unsigned int), cudaMemcpyHostToDevice); cudaMemcpy(edgessrcwt, hedgessrcwt, nedges * sizeof(float), cudaMemcpyHostToDevice); cudaMemcpy(psrc, hpsrc, nnodes * sizeof(unsigned int), cudaMemcpyHostToDevice);
...
Memoryallocationon device
Memoryallocationon device
Explicitmemorytransfer
Explicitmemorytransfer
![Page 50: Rupesh Nasre. · Matrix Multiplication Shortest Paths Computation By knowing matrix size and its starting address, and without knowing its values we can determine the dynamic behavior](https://reader036.vdocument.in/reader036/viewer/2022062402/5e95068ffd1fec2c837a2fa0/html5/thumbnails/50.jpg)
50
Hand-Written BFS
… do { hchanged = false; cudaMemcpy(changed, &hchanged, sizeof(bool), cudaMemcpyHostToDevice); relaxEdge <<<nblocks, blocksize>>> (dist, edgessrcdst, psrc, nedges, nnodes, changed); cudaThreadSynchronize(); cudaMemcpy(&hchanged, changed, sizeof(bool), cudaMemcpyDeviceToHost); } while (hchanged); ...
Transferfrom GPU
Transferfrom GPU
Transfer to GPU
Transfer to GPU
Explicitbarrier
Explicitbarrier
![Page 51: Rupesh Nasre. · Matrix Multiplication Shortest Paths Computation By knowing matrix size and its starting address, and without knowing its values we can determine the dynamic behavior](https://reader036.vdocument.in/reader036/viewer/2022062402/5e95068ffd1fec2c837a2fa0/html5/thumbnails/51.jpg)
51
Hand-Written BFS
… do { hchanged = false; cudaMemcpy(changed, &hchanged, sizeof(bool), cudaMemcpyHostToDevice); relaxEdge <<<nblocks, blocksize>>> (dist, edgessrcdst, psrc, nedges, nnodes, changed); cudaThreadSynchronize(); cudaMemcpy(&hchanged, changed, sizeof(bool), cudaMemcpyDeviceToHost); } while (hchanged);
cudaMemcpy(hdist, dist, nnodes * sizeof(float), cudaMemcpyDeviceToHost); ...
Transferfrom GPU
Transferfrom GPU
![Page 52: Rupesh Nasre. · Matrix Multiplication Shortest Paths Computation By knowing matrix size and its starting address, and without knowing its values we can determine the dynamic behavior](https://reader036.vdocument.in/reader036/viewer/2022062402/5e95068ffd1fec2c837a2fa0/html5/thumbnails/52.jpg)
52
SSSP Specification in Elixir
Graph [nodes(node: Node, dist: int),
edges(src: Node, dst: Node, wt: int)]
source: Node
initDist = [nodes(node n, dist d)] →
[d = if (n == source) 0 else ∞]
relaxEdge = [nodes(node p, dist pd),
nodes(node q, dist qd),
edges(src p, dst q, wt w),
pd + wt < qd] →
[qd = pd + wt]
init = foreach initDist
bfs = iterate relaxEdge
main = init; bfs
Graph [nodes(node: Node, dist: int),
edges(src: Node, dst: Node, wt: int)]
source: Node
initDist = [nodes(node n, dist d)] →
[d = if (n == source) 0 else ∞]
relaxEdge = [nodes(node p, dist pd),
nodes(node q, dist qd),
edges(src p, dst q, wt w),
pd + wt < qd] →
[qd = pd + wt]
init = foreach initDist
bfs = iterate relaxEdge
main = init; bfs
Generate kernels
Graph rewrite rules
Graph layout unspecified
Processing order unspecified
![Page 53: Rupesh Nasre. · Matrix Multiplication Shortest Paths Computation By knowing matrix size and its starting address, and without knowing its values we can determine the dynamic behavior](https://reader036.vdocument.in/reader036/viewer/2022062402/5e95068ffd1fec2c837a2fa0/html5/thumbnails/53.jpg)
53
SSSP Specification in Green-Marl
... G.dist = (G == root) ? 0 : +INF; G.updated = (G == root) ? True: False; G.dist_nxt = G.dist; G.updated_nxt = G.updated;
While (!fin) { fin = True;
Foreach(n: G.Nodes)(n.updated) { Foreach(s: n.Nbrs) { Edge e = s.ToEdge(); // the edge to s // updated_nxt becomes true only if dist_nxt is actually updated <s.dist_nxt; s.updated_nxt> min= <n.dist + e.len; True>; } }
G.dist = G.dist_nxt; G.updated = G.updated_nxt; G.updated_nxt = False; fin = ! Exist(n: G.Nodes){n.updated}; } ...
![Page 54: Rupesh Nasre. · Matrix Multiplication Shortest Paths Computation By knowing matrix size and its starting address, and without knowing its values we can determine the dynamic behavior](https://reader036.vdocument.in/reader036/viewer/2022062402/5e95068ffd1fec2c837a2fa0/html5/thumbnails/54.jpg)
54
Using Operators
e
a
b
cd
A = b @ w1B = a @ w2C = a @ w3D = c @ w4E = (d @ w5) # (b @ w6)
Assigning # to a thread is vertex-based.Assigning @ to a thread is edge-based.
@ = +# = minwi = 1
@ = +# = minwi = wt
BFS SSSP
@ = /# = +wi = outdegree
PR
![Page 55: Rupesh Nasre. · Matrix Multiplication Shortest Paths Computation By knowing matrix size and its starting address, and without knowing its values we can determine the dynamic behavior](https://reader036.vdocument.in/reader036/viewer/2022062402/5e95068ffd1fec2c837a2fa0/html5/thumbnails/55.jpg)
55
Language Features
Automatic memory management Memory allocation on host and device Memory transfer between devices
Automatic kernel configuration Easy overriding
Intelligent barrier insertion Automatic synchronization
![Page 56: Rupesh Nasre. · Matrix Multiplication Shortest Paths Computation By knowing matrix size and its starting address, and without knowing its values we can determine the dynamic behavior](https://reader036.vdocument.in/reader036/viewer/2022062402/5e95068ffd1fec2c837a2fa0/html5/thumbnails/56.jpg)
56
Performance of Synthesized BFS
Graph [nodes(node: Node, dist: int), edges(src: Node, dst: Node)]source: NodeinitDist = [nodes(node n, dist d)] → [d = if (n == source) 0 else ∞]relaxEdge = [nodes(node p, dist dp), nodes(node q, dist dq), edges(src p, dst q), pd + 1 < qd] → [qd = pd + 1]init = foreach initDistbfs = iterate relaxEdgemain = init; bfs
Graph [nodes(node: Node, dist: int), edges(src: Node, dst: Node)]source: NodeinitDist = [nodes(node n, dist d)] → [d = if (n == source) 0 else ∞]relaxEdge = [nodes(node p, dist dp), nodes(node q, dist dq), edges(src p, dst q), pd + 1 < qd] → [qd = pd + 1]init = foreach initDistbfs = iterate relaxEdgemain = init; bfs
Generate kernels
Graph rewrite rules
Graph layout unspecified
Hand-written multicore
Hand-written GPU
Synthesized GPU
100% 29% 41%
Processing order unspecified
![Page 57: Rupesh Nasre. · Matrix Multiplication Shortest Paths Computation By knowing matrix size and its starting address, and without knowing its values we can determine the dynamic behavior](https://reader036.vdocument.in/reader036/viewer/2022062402/5e95068ffd1fec2c837a2fa0/html5/thumbnails/57.jpg)
57
The Road Ahead
● Static graph algorithms● Dynamic graph algorithms● Language extensions for morph algorithms● Optimizations for special classes of graphs● Applications
![Page 59: Rupesh Nasre. · Matrix Multiplication Shortest Paths Computation By knowing matrix size and its starting address, and without knowing its values we can determine the dynamic behavior](https://reader036.vdocument.in/reader036/viewer/2022062402/5e95068ffd1fec2c837a2fa0/html5/thumbnails/59.jpg)
59
![Page 60: Rupesh Nasre. · Matrix Multiplication Shortest Paths Computation By knowing matrix size and its starting address, and without knowing its values we can determine the dynamic behavior](https://reader036.vdocument.in/reader036/viewer/2022062402/5e95068ffd1fec2c837a2fa0/html5/thumbnails/60.jpg)
60
Todo
● Literature survey● higher level picture● Irregularity definition, earlier work by yelick, runger.● Multi-core status, tao.
● Morph algorithms● Benchmark suite: show properties of each b/m● optimizations● Synthesis● Current work: ir for multicore and gpu● Addons: data-topology, workload● cite papers.● Avoid sentences at the bottom of the slides.
![Page 61: Rupesh Nasre. · Matrix Multiplication Shortest Paths Computation By knowing matrix size and its starting address, and without knowing its values we can determine the dynamic behavior](https://reader036.vdocument.in/reader036/viewer/2022062402/5e95068ffd1fec2c837a2fa0/html5/thumbnails/61.jpg)
61
Subgraph Deletion 33
● Marking● When deletion is infrequent● Deleted memory unavailable
● Explicit deletion● When deletion is frequent, local deletions● Global deletion may require expensive compaction
● Recycle● When new subgraph fits in the memory
of the old subgraph● Moderately expensive
e.g. SP
e.g. DMR
![Page 62: Rupesh Nasre. · Matrix Multiplication Shortest Paths Computation By knowing matrix size and its starting address, and without knowing its values we can determine the dynamic behavior](https://reader036.vdocument.in/reader036/viewer/2022062402/5e95068ffd1fec2c837a2fa0/html5/thumbnails/62.jpg)
62
Conflict Resolution 44
1. mark triangles
–barrier–
2. priority-check markings
–barrier–
3. check markings
0
01
1
1
1
![Page 63: Rupesh Nasre. · Matrix Multiplication Shortest Paths Computation By knowing matrix size and its starting address, and without knowing its values we can determine the dynamic behavior](https://reader036.vdocument.in/reader036/viewer/2022062402/5e95068ffd1fec2c837a2fa0/html5/thumbnails/63.jpg)
63
Effect of Optimizations
Base with partitioning
Race and resolveWith atomic-free global barrier
Optimized memory layout
Adaptive configuration
Load balancing
Single precision arithmeticOn-demand memory reallocation
0 10 20 30 40 50 60 70 80
Speedup
Op
tim
izat
ion
s
Input: mesh with 10 million triangles with around 50% bad triangles
Delaunay Mesh Refinement