Atomic Operations across GPU generations
Juan Gómez-Luna
University of Córdoba (Spain)
University of Illinois at Urbana-Champaign. ECE 408. October 22, 2015
About me
• Juan Gómez-Luna • Telecommunications Engineering (University of Sevilla, 2001) • Since 2005 Lecturer at the University of Córdoba • PhD Thesis (University of Córdoba, 2012)
– Programming Issues for Video Analysis on Graphics Processing Units • Research collaborations:
– Technical University Munich (Germany) – Technical University Eindhoven (The Netherlands) – University of Illinois at Urbana-Champaign (USA) – University of Málaga (Spain) – Barcelona Supercomputing Center (Spain)
• PI of University of Córdoba GPU Education Center (supported by NVIDIA)
2 Juan Gómez Luna University of Illinois at Urbana-Champaign. ECE408. October 22, 2015
Outline • Uses of atomic operations
• Atomic operations on shared memory – Evolution across GPU generations – Case studies
• Stream compaction • Histogramming • Reduction
• Atomic operations on global memory – Evolution across GPU generations – Case studies
• Scatter vs. gather • Adjacent thread block synchronization
3 Juan Gómez Luna University of Illinois at Urbana-Champaign. ECE408. October 22, 2015
Uses of atomic operations • Collaboration
– Atomics on an array that will be the output of the kernel – Example
• Histogramming
• Synchronization – Atomics on memory locations that are used for
synchronization or coordination – Example
• Locks, flags…
4 Juan Gómez Luna University of Illinois at Urbana-Champaign. ECE408. October 22, 2015
Uses of atomic operations • CUDA provides atomic functions on shared memory
and global memory
• Arithmetic functions – Add, sub, max, min, exch, inc, dec, CAS
!! ! !int atomicAdd(int*, int);!
• Bitwise functions – And, or, xor
• Integer, uint, ull, and float
5 Juan Gómez Luna University of Illinois at Urbana-Champaign. ECE408. October 22, 2015
Outline • Uses of atomic operations
• Atomic operations on shared memory – Evolution across GPU generations – Case studies
• Stream compaction • Histogramming • Reduction
• Atomic operations on global memory – Evolution across GPU generations – Case studies
• Scatter vs. gather • Adjacent thread block synchronization
6 Juan Gómez Luna University of Illinois at Urbana-Champaign. ECE408. October 22, 2015
• Code – CUDA: int atomicAdd(int*, int);!
– PTX: atom.shared.add.u32 %r25, [%rd14], 1;!
– SASS:
Tesla, Fermi, Kepler Maxwell
Native atomic operations for 32-bit integer, and 32-bit and 64-bit atomicCAS
– Lock/Update/Unlock vs. Native atomic operations
7
Atomic operations on shared memory
/*00a0*/ !LDSLK P0, R9, [R8]; !
/*00a8*/ !@P0 IADD R10, R9, R7; !
/*00b0*/ !@P0 STSCUL P1, [R8], R10; !
/*00b8*/ !@!P1 BRA 0xa0;!
/*01f8*/ ATOMS.ADD RZ, [R7], R11;!
Juan Gómez Luna University of Illinois at Urbana-Champaign. ECE408. October 22, 2015
• Atomic conflict degree – Intra-warp conflict degree from 1 to 32
8
Atomic operations on shared memory
tbase
tconflict
Shared memory Shared memory
tbase
No atomic conflict = concurrent votes
Atomic conflict = serialized votes
Juan Gómez Luna University of Illinois at Urbana-Champaign. ECE408. October 22, 2015
Atomic operations on shared memory • Microbenchmarking on Tesla, Fermi and Kepler
– Position conflicts (GTX 580 – Fermi)
9 Juan Gómez Luna University of Illinois at Urbana-Champaign. ECE408. October 22, 2015
Atomic operations on shared memory • Microbenchmarking on Tesla, Fermi and Kepler
– Position conflicts (K20 – Kepler)
10
0.00#
10.00#
20.00#
30.00#
40.00#
50.00#
60.00#
70.00#
80.00#
1# 4# 7#10#13#16#19#22#25#28#31#1# 4# 7#10#13#16#19#22#25#28#31#1# 4# 7#10#13#16#19#22#25#28#31#1# 4# 7#10#13#16#19#22#25#28#31#1# 4# 7#10#13#16#19#22#25#28#31#1# 4# 7#10#13#16#19#22#25#28#31#
32# 64# 128# 256# 512# 1024#
Execu2
on#2me#(m
s)#
Number#of#threads#in#the#warp#vo2ng#in#the#same#loca2on#than#warp#leader#thread#Block#size#
Juan Gómez Luna University of Illinois at Urbana-Champaign. ECE408. October 22, 2015
Atomic operations on shared memory • Microbenchmarking on Maxwell
– Position conflicts (GTX 980 – Maxwell)
11
0.00#1.00#2.00#3.00#4.00#5.00#6.00#7.00#8.00#9.00#10.00#
1# 4# 7#10#13#16#19#22#25#28#31#1# 4# 7#10#13#16#19#22#25#28#31#1# 4# 7#10#13#16#19#22#25#28#31#1# 4# 7#10#13#16#19#22#25#28#31#1# 4# 7#10#13#16#19#22#25#28#31#1# 4# 7#10#13#16#19#22#25#28#31#
32# 64# 128# 256# 512# 1024#
Execu2
on#2me#(m
s)#
Number#of#threads#in#the#warp#vo2ng#in#the#same#loca2on#than#warp#leader#thread#Block#size#
Juan Gómez Luna University of Illinois at Urbana-Champaign. ECE408. October 22, 2015
• Filtering / Stream compaction
Case studies: Filtering
12
2 1 3 0 0 1 3 4 0 0 2 1
2 1 3 1 3 4 2 1
Predicate: Element > 0
Input
Output
Stream compaction
Juan Gómez Luna University of Illinois at Urbana-Champaign. ECE408. October 22, 2015
__global__ void filter_k(int *dst, int *nres, const int *src, int n, int value) {! int i = threadIdx.x + blockIdx.x * blockDim.x;! if(i < n && src[i] != value){! int index = atomicAdd(nres, 1);! dst[index] = src[i];! }!}!
• Filtering / Stream compaction – Global memory atomics
– Shared memory atomics
13
__global__ void filter_shared_k(int *dst, int *nres, const int* src, int n, int value) {! __shared__ int l_n;! int i = blockIdx.x * (NPER_THREAD * BS) + threadIdx.x;!
for (int iter = 0; iter < NPER_THREAD; iter++) {! // zero the counter! if (threadIdx.x == 0)! l_n = 0;! __syncthreads();!
// get the value, evaluate the predicate, and! // increment the counter if needed! int d, pos;! if(i < n) {! d = src[i];! if(d != value)! pos = atomicAdd(&l_n, 1);! }! __syncthreads();!
// leader increments the global counter! if(threadIdx.x == 0)! l_n = atomicAdd(nres, l_n);! __syncthreads();! // threads with true predicates write their elements! if(i < n && d != value) {! pos += l_n; // increment local pos by global counter! dst[pos] = d;! }! __syncthreads();! i += BS;! }!}!
Case studies: Filtering
Juan Gómez Luna University of Illinois at Urbana-Champaign. ECE408. October 22, 2015
• Filtering / Stream compaction: Shared memory atomics
14
Case studies: Filtering
Juan Gómez Luna University of Illinois at Urbana-Champaign. ECE408. October 22, 2015
• Filtering / Stream compaction
Find more: CUDA Pro Tip: Optimized Filtering with Warp-Aggregated Atomics
15
Case studies: Filtering
Juan Gómez Luna University of Illinois at Urbana-Champaign. ECE408. October 22, 2015
Case studies: Histogramming
• Privatization for histogram generation
16 Juan Gómez Luna University of Illinois at Urbana-Champaign. ECE408. October 22, 2015
• Privatization – 256-bin histogram calculation for 100 real images
• Shared memory implementation uses 1 sub-histogram per block • Global atomics were greatly improved in Kepler
Case studies: Histogramming
17 Juan Gómez Luna University of Illinois at Urbana-Champaign. ECE408. October 22, 2015
• Histogram calculation
18
For (each pixel i in image I){!Pixel = I[i]! ! !// Read pixel!Pixel’ = Computation(Pixel) !// Optional computation!Histogram[Pixel’]++ ! !// Vote in histogram bin!
}!
Case studies: Histogramming
Juan Gómez Luna University of Illinois at Urbana-Champaign. ECE408. October 22, 2015
• Histogram calculation
• Natural images: spatial correlation
19
Case studies: Histogramming
Juan Gómez Luna University of Illinois at Urbana-Champaign. ECE408. October 22, 2015
• Histogram calculation
• Privatization + Replication + Padding
20
Case studies: Histogramming
Juan Gómez Luna University of Illinois at Urbana-Champaign. ECE408. October 22, 2015
• Histogram calculation: 100 real images - Privatization + Replication + Padding
21
Case studies: Histogramming
0.0#
0.2#
0.4#
0.6#
0.8#
1.0#
1.2#
1.4#
1# 2# 4# 8# 16#
32#
64#
128#
189# 1# 2# 4# 8# 16#
32#
47#
64# 256#
GTX#580#(Fermi#GF110)#
K40c#(Kepler#GK110)#
Replica=on#factor#Histogram#size#(bins)#
Execu=
on#=me#(m
s)#
0.0#
0.2#
0.4#
0.6#
0.8#
1.0#
1.2#
1.4#
1# 2# 4# 8# 16#
32#
64#
128#
189# 1# 2# 4# 8# 16#
32#
47#
64# 256#
GTX#580#(Fermi#GF110)#
K40c#(Kepler#GK110)#
GTX#980#(Maxwell#GM204)#
Replica@on#factor#Histogram#size#(bins)#
Execu@
on#@me#(m
s)#
Juan Gómez Luna University of Illinois at Urbana-Champaign. ECE408. October 22, 2015
• Privatization – 256-bin histogram calculation for 100 real images
• Shared memory implementation uses 1 sub-histogram per block • Global atomics were greatly improved in Kepler
Case studies: Histogramming
22 Juan Gómez Luna University of Illinois at Urbana-Champaign. ECE408. October 22, 2015
• Reduction – Tree-based algorithm is recommended (avoid scatter style)
23
Case studies: Reduction
Juan Gómez Luna University of Illinois at Urbana-Champaign. ECE408. October 22, 2015
• Reduction – 7 versions in CUDA samples: Tree-based reduction in shared
memory • Version 0: No whole warps active
• Version 1: Contiguous threads, but many bank conflicts
• Version 2: No bank conflicts
• Version 3: First level of reduction when reading from global memory
• Version 4: Warp shuffle or unrolling of final warp
• Version 5: Warp shuffle or complete unrolling
• Version 6: Multiple elements per thread sequentially
24
Case studies: Reduction
Juan Gómez Luna University of Illinois at Urbana-Champaign. ECE408. October 22, 2015
Juan Gómez Luna University of Illinois at Urbana-Champaign. ECE408. October 22, 2015
• Reduction
25
Case studies: Reduction
Juan Gómez Luna University of Illinois at Urbana-Champaign. ECE408. October 22, 2015
• Reduction
26
Case studies: Reduction
Outline • Uses of atomic operations
• Atomic operations on shared memory – Evolution across GPU generations – Case studies
• Stream compaction • Histogramming • Reduction
• Atomic operations on global memory – Evolution across GPU generations – Case studies
• Scatter vs. gather • Adjacent thread block synchronization
27 Juan Gómez Luna University of Illinois at Urbana-Champaign. ECE408. October 22, 2015
• Tesla: – Executed on DRAM
• Fermi: – Executed on L2
– Atomic units near L2
• Kepler and Maxwell: – Atomic units near L2 now have kind of local cache
28
Atomic operations on global memory
Juan Gómez Luna University of Illinois at Urbana-Champaign. ECE408. October 22, 2015
Case study: Scatter vs. gather
• Scatter vs. Gather
29 Juan Gómez Luna University of Illinois at Urbana-Champaign. ECE408. October 22, 2015
Juan Gómez Luna University of Illinois at Urbana-Champaign. ECE408. October 22, 2015
Case study: Scatter vs. gather
• Scatter vs. Gather
30
__global__ void s2g_gpu_scatter_kernel(unsigned int* in, unsigned int* out,! unsigned int num_in, unsigned int num_out) {!
unsigned int inIdx = blockIdx.x*blockDim.x + threadIdx.x;! if(inIdx < num_in) {! unsigned int intermediate = outInvariant(in[inIdx]);! for(unsigned int outIdx = 0; outIdx < num_out; ++outIdx) {! atomicAdd(&(out[outIdx]), outDependent(intermediate, inIdx, outIdx));! }! }!}!
__global__ void s2g_gpu_gather_kernel(unsigned int* in, unsigned int* out,! unsigned int num_in, unsigned int num_out) {!
unsigned int outIdx = blockIdx.x*blockDim.x + threadIdx.x;! if(outIdx < num_out) {! unsigned int out_reg = 0;! for(unsigned int inIdx = 0; inIdx < num_in; ++inIdx) {! unsigned int intermediate = outInvariant(in[inIdx]);! out_reg += outDependent(intermediate, inIdx, outIdx);! }! out[outIdx] += out_reg;! }!}!
Case study: Scatter vs. gather
• Scatter vs. Gather
31
Tesla: No L2 cache
Juan Gómez Luna University of Illinois at Urbana-Champaign. ECE408. October 22, 2015
Case study: Scatter vs. gather
• Scatter vs. Gather
32
Fermi: ROPs in L2 cache
Juan Gómez Luna University of Illinois at Urbana-Champaign. ECE408. October 22, 2015
Case study: Scatter vs. gather
• Scatter vs. Gather
33
Kepler: Buffer in ROPs
Juan Gómez Luna University of Illinois at Urbana-Champaign. ECE408. October 22, 2015
• GPU programming with CUDA (or OpenCL) might not completely exploit inherent parallelism in some algorithms
– In-place operations • Possible dependence between consecutive thread
blocks
– Bulk synchronous parallel programming • Thread block synchronization requires kernel
termination and relaunch
Case study: Adjacent block synchronization
34 Juan Gómez Luna University of Illinois at Urbana-Champaign. ECE408. October 22, 2015
Padding
35
Case study: Adjacent block synchronization • In-place matrix padding
– Limited GPU memory makes it desirable
Juan Gómez Luna University of Illinois at Urbana-Champaign. ECE408. October 22, 2015
Case study: Adjacent block synchronization • In-place matrix padding
– Temporary storage into on-chip memory – Bulk synchronous programming
• Global synchronization = kernel termination
36
Less parallelism
More parallelism
Juan Gómez Luna University of Illinois at Urbana-Champaign. ECE408. October 22, 2015
Case study: Adjacent block synchronization • Motivation: In-place matrix padding
– 5000x4900 -> 5000x5000 • Almost 100 rows moved in first iteration • 181 iterations with some parallelism • Last 99 iterations moved sequentially
– Effective throughput only less than 20% peak bw.
37 Juan Gómez Luna University of Illinois at Urbana-Champaign. ECE408. October 22, 2015
Case study: Adjacent block synchronization • Regular Data Sliding
– Dynamic thread block id allocation • Avoids deadlocks
– Loading stage • Coarsening factor
– Adjacent thread block synchronization • Avoids kernel termination and relaunch
– Storing stage
38 Juan Gómez Luna University of Illinois at Urbana-Champaign. ECE408. October 22, 2015
Case study: Adjacent block synchronization • Timing comparison of the two approaches
39
Load Row 4 Store Row 4
Load Row 3 AS Store Row 3
Load Row 2 Store Row 2
Load Row 1 Store Row 1
Load Row 4 Store Row 4
Load Row 3 Store Row 3
Load Row 2 Store Row 2 Load Row 1 Store Row 1
Adjacent Synchronization
Kernel Termination and Re-launch
Time
AS
AS
AS
Juan Gómez Luna University of Illinois at Urbana-Champaign. ECE408. October 22, 2015
Case study: Adjacent block synchronization • Regular Data Sliding
– Adjacent block synchronization (Yan et al., 2013) • Leader thread waits for previous block flag set • Avoids kernel termination and relaunch
40
__syncthreads(); !if (tid == 0){ !
!// Wait !while(atomicOr(&flags[bid_ - 1], 0) == 0){;}!!// Set flag !atomicOr(&flags[bid_], 1); !
} !__syncthreads(); !
Juan Gómez Luna University of Illinois at Urbana-Champaign. ECE408. October 22, 2015
Case study: Adjacent block synchronization • Regular Data Sliding
– Dynamic block id allocation • Avoids deadlocks
__shared__ int bid_; !if (tid == 0) !
!bid_ = atomicAdd(&S, 1); !__syncthreads; !
41 Juan Gómez Luna University of Illinois at Urbana-Champaign. ECE408. October 22, 2015
Case study: Adjacent block synchronization
On-chip memory (registers and shared memory)
42
Global memory
4 concurrent blocks (size = 2 threads)
Flags
Juan Gómez Luna University of Illinois at Urbana-Champaign. ECE408. October 22, 2015
Case study: Adjacent block synchronization • Regular Data Sliding: Padding and Unpadding
– Baseline = bulk synchronous implementation (Motivation) – Up to 9.11x (Maxwell) and up to 73.25x (Hawaii)
43
0.0#
20.0#
40.0#
60.0#
80.0#
100.0#
120.0#
140.0#
4950# 4955# 4960# 4965# 4970# 4975# 4980# 4985# 4990# 4995#
Throughp
ut#(G
B/s)#
Number#of#Columns#(before#padding)#Number#of#Rows#=#5000#
DS#Padding#Baseline#
0.0#
20.0#
40.0#
60.0#
80.0#
100.0#
120.0#
4950# 4955# 4960# 4965# 4970# 4975# 4980# 4985# 4990# 4995#
Throughp
ut#(G
B/s)#
Number#of#Columns#(before#padding)#Number#of#Rows#=#5000#
DS#Padding#Baseline#
0.0#
20.0#
40.0#
60.0#
80.0#
100.0#
120.0#
140.0#
4950# 4955# 4960# 4965# 4970# 4975# 4980# 4985# 4990# 4995#
Throughp
ut#(G
B/s)#
Number#of#Columns#(aCer#unpadding)#Number#of#Rows#=#5000#
DS#Unpadding#
Baseline#
0.0#
20.0#
40.0#
60.0#
80.0#
100.0#
4950# 4955# 4960# 4965# 4970# 4975# 4980# 4985# 4990# 4995#
Throughp
ut#(G
B/s)#
Number#of#Columns#(aCer#unpadding)#Number#of#Rows#=#5000#
DS#Unpadding#
Baseline#
Padding on NVIDIA Maxwell Padding on AMD Hawaii
Unpadding on NVIDIA Maxwell Unpadding on AMD Hawaii Juan Gómez Luna University of Illinois at Urbana-Champaign. ECE408. October 22, 2015
Case study: Adjacent block synchronization • Irregular Data Sliding
– Dynamic block id allocation – Loading stage
• Local counter
– Reduction – Adjacent block synchronization – Storing stage
• Binary prefix-sum within the thread block
44 Juan Gómez Luna University of Illinois at Urbana-Champaign. ECE408. October 22, 2015
Case study: Adjacent block synchronization • Irregular Data Sliding
– Adjacent block synchronization • count (shared memory variable) contains reduction result • flag+count is prefix sum of blocks’ reductions • count = flag makes it visible to all threads in block
45
__syncthreads(); !if (tid == 0){ !
!// Wait !while(atomicOr(&flags[bid_ - 1], 0) == 0){;} !!// Set flag !int flag = flags[bid_ - 1]; !!atomicAdd(&flags[bid_], flag + count); !count = flag; !
} !__syncthreads();!
Juan Gómez Luna University of Illinois at Urbana-Champaign. ECE408. October 22, 2015
Case study: Adjacent block synchronization • Irregular Data Sliding: Select
46
Up to 3.05x Thrust on Maxwell 2.80x on Kepler 1.78x on Fermi
Juan Gómez Luna University of Illinois at Urbana-Champaign. ECE408. October 22, 2015
Case study: Adjacent block synchronization • Irregular Data Sliding
– Stream compaction • Our Ip stable implementation is 68% of the fastest Oop unstable
kernel
– Unique • Up to 3.24x Thrust on Maxwell • 2.73x on Kepler • 1.66x on Fermi
– Partition • Up to 2.84x Thrust on Maxwell • 2.88x on Kepler • 1.64x on Fermi
47 Juan Gómez Luna University of Illinois at Urbana-Champaign. ECE408. October 22, 2015
• Significant hardware improvements for atomic operations – Shared memory: Native integer atomics – Global memory: L2 + Buffer in ROPs
• They can free programmers from applying software optimization – Histogramming
• They may allow a more natural way of coding, saving many lines of code – Reduction
• They may allow using new, faster algorithms – Filtering – Adjacent synchronization
Summary
48 Juan Gómez Luna University of Illinois at Urbana-Champaign. ECE408. October 22, 2015
Atomic Operations across GPU generations
Juan Gómez-Luna
University of Córdoba (Spain) [email protected] [email protected]
University of Illinois at Urbana-Champaign. ECE 408. October 22, 2015