gmprof: a low-overhead, fine-grained profiling approach for gpu programs
DESCRIPTION
GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU Programs. Mai Zheng, Vignesh T. Ravi, Wenjing Ma, Feng Qin, and Gagan Agrawal. Dept. of Computer Science and Engineering The Ohio State University Columbus, OH, USA. GPU Programming Gets Popular. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU Programs](https://reader036.vdocument.in/reader036/viewer/2022062500/56815a14550346895dc75e20/html5/thumbnails/1.jpg)
GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU Programs
Mai Zheng, Vignesh T. Ravi, Wenjing Ma, Feng Qin, and Gagan Agrawal
Dept. of Computer Science and EngineeringThe Ohio State UniversityColumbus, OH, USA
![Page 2: GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU Programs](https://reader036.vdocument.in/reader036/viewer/2022062500/56815a14550346895dc75e20/html5/thumbnails/2.jpg)
GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU programs
GPU Programming Gets Popular
• Many domains are using GPUs for high performance
2
GPU-accelerated Molecular Dynamics GPU-accelerated Seismic Imaging
• Available in both high-end/low-end systems• the #1 supercomputer in the world uses GPUs [TOP500, Nov 2012]• commodity desktops/laptops equipped with GPUs
![Page 3: GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU Programs](https://reader036.vdocument.in/reader036/viewer/2022062500/56815a14550346895dc75e20/html5/thumbnails/3.jpg)
GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU programs
• Need careful management of • a large amount of threads
Writing Efficient GPU Programs is Challenging
3
Thread Blocks
![Page 4: GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU Programs](https://reader036.vdocument.in/reader036/viewer/2022062500/56815a14550346895dc75e20/html5/thumbnails/4.jpg)
GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU programs
• Need careful management of • a large amount of threads• multi-layer memory hierarchy
Writing Efficient GPU Programs is Challenging
4
Read-only Data Cache
DRAM (Device Memory)
L2 Cache
L1Cache
SharedMemory
Thread
Thread Blocks
Kepler GK110 Memory Hierarchy
![Page 5: GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU Programs](https://reader036.vdocument.in/reader036/viewer/2022062500/56815a14550346895dc75e20/html5/thumbnails/5.jpg)
GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU programs
• Need careful management of • a large amount of threads• multi-layer memory hierarchy
Writing Efficient GPU Programs is Challenging
5
Read-only Data Cache
DRAM (Device Memory)
L2 Cache
L1Cache
SharedMemory
Thread
Thread Blocks
Fast but Small
Large but Slow
Kepler GK110 Memory Hierarchy
![Page 6: GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU Programs](https://reader036.vdocument.in/reader036/viewer/2022062500/56815a14550346895dc75e20/html5/thumbnails/6.jpg)
GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU programs
Writing Efficient GPU Programs is Challenging
6
Which data in shared memory are infrequently accessed?
Which data in device memory are frequently accessed?
Read-only Data Cache
DRAM (Device Memory)
L2 Cache
L1Cache
SharedMemory
Thread
Kepler GK110 Memory Hierarchy
![Page 7: GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU Programs](https://reader036.vdocument.in/reader036/viewer/2022062500/56815a14550346895dc75e20/html5/thumbnails/7.jpg)
GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU programs
• Existing tools can’t help much• inapplicable to GPU• coarse-grained• prohibitive runtime overhead• cannot handle irregular/indirect accesses
Writing Efficient GPU Programs is Challenging
7
Which data in shared memory are infrequently accessed?
Which data in device memory are frequently accessed?
Read-only Data Cache
DRAM (Device Memory)
L2 Cache
L1Cache
SharedMemory
Thread
Kepler GK110 Memory Hierarchy
![Page 8: GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU Programs](https://reader036.vdocument.in/reader036/viewer/2022062500/56815a14550346895dc75e20/html5/thumbnails/8.jpg)
GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU programs
Outline
• Motivation• GMProf
• Naïve Profiling Approach• Optimizations• Enhanced Algorithm
• Evaluation• Conclusions
8
![Page 9: GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU Programs](https://reader036.vdocument.in/reader036/viewer/2022062500/56815a14550346895dc75e20/html5/thumbnails/9.jpg)
GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU programs
GMProf-basic: The Naïve Profiling Approach
9
• Shared Memory Profiling• integer counters to count accesses to shared memory• one counter for each shared memory element• atomically update the counter
• to avoid race condition among threads
• Device Memory Profiling• integer counters to count accesses to device memory• one counter for each element in the user device memory array
• since device memory is too large to be monitored as a whole (e..g, 6GB)• atomically update the counter
![Page 10: GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU Programs](https://reader036.vdocument.in/reader036/viewer/2022062500/56815a14550346895dc75e20/html5/thumbnails/10.jpg)
GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU programs
Outline
• Motivation• GMProf
• Naïve Profiling Approach• Optimizations• Enhanced Algorithm
• Evaluation• Conclusions
10
![Page 11: GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU Programs](https://reader036.vdocument.in/reader036/viewer/2022062500/56815a14550346895dc75e20/html5/thumbnails/11.jpg)
GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU programs
GMProf-SA: Static Analysis Optimization
11
• Observation I: Many memory accesses can be determined statically
1. __shared__ int s[];
2. …
3. s[threadIdx.x] = 3;
![Page 12: GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU Programs](https://reader036.vdocument.in/reader036/viewer/2022062500/56815a14550346895dc75e20/html5/thumbnails/12.jpg)
GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU programs
GMProf-SA: Static Analysis Optimization
12
• Observation I: Many memory accesses can be determined statically
1. __shared__ int s[];
2. …
3. s[threadIdx.x] = 3;
Don’t need to count the access at runtime
![Page 13: GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU Programs](https://reader036.vdocument.in/reader036/viewer/2022062500/56815a14550346895dc75e20/html5/thumbnails/13.jpg)
GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU programs
GMProf-SA: Static Analysis Optimization
13
• Observation I: Many memory accesses can be determined statically
1. __shared__ int s[];
2. …
3. s[threadIdx.x] = 3;
Don’t need to count the access at runtime
• How about this …
1. __shared__ float s[];
2. …
3. for(r=0; …; …) {
4. for(c=0; …; …) {
5. temp = s[input[c]];
6. }
7. }y
![Page 14: GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU Programs](https://reader036.vdocument.in/reader036/viewer/2022062500/56815a14550346895dc75e20/html5/thumbnails/14.jpg)
GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU programs
GMProf-SA: Static Analysis Optimization
14
• Observation II: Some accesses are loop-invariant• E.g. s[input[c]] is irrelavant to the outer loop iterator r
1. __shared__ float s[];
2. …
3. for(r=0; …; …) {
4. for(c=0; …; …) {
5. temp = s[input[c]];
6. }
7. }y
![Page 15: GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU Programs](https://reader036.vdocument.in/reader036/viewer/2022062500/56815a14550346895dc75e20/html5/thumbnails/15.jpg)
GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU programs
GMProf-SA: Static Analysis Optimization
15
• Observation II: Some accesses are loop-invariant• E.g. s[input[c]] is irrelavant to the outer loop iterator r
1. __shared__ float s[];
2. …
3. for(r=0; …; …) {
4. for(c=0; …; …) {
5. temp = s[input[c]];
6. }
7. }y
Don’t need to profile in every r
iteration
![Page 16: GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU Programs](https://reader036.vdocument.in/reader036/viewer/2022062500/56815a14550346895dc75e20/html5/thumbnails/16.jpg)
GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU programs
GMProf-SA: Static Analysis Optimization
16
• Observation II: Some accesses are loop-invariant• E.g. s[input[c]] is irrelavant to the outer loop iterator r
1. __shared__ float s[];
2. …
3. for(r=0; …; …) {
4. for(c=0; …; …) {
5. temp = s[input[c]];
6. }
7. }y
Don’t need to profile in every r
iteration
• Observation III: Some accesses are tid-invariant• E.g. s[input[c]] is irrelavant to threadIdx
![Page 17: GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU Programs](https://reader036.vdocument.in/reader036/viewer/2022062500/56815a14550346895dc75e20/html5/thumbnails/17.jpg)
GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU programs
GMProf-SA: Static Analysis Optimization
17
• Observation II: Some accesses are loop-invariant• E.g. s[input[c]] is irrelavant to the outer loop iterator r
1. __shared__ float s[];
2. …
3. for(r=0; …; …) {
4. for(c=0; …; …) {
5. temp = s[input[c]];
6. }
7. }y
Don’t need to profile in every r
iteration
• Observation III: Some accesses are tid-invariant• E.g. s[input[c]] is irrelavant to threadIdx Don’t need to update the
counter in every thread
![Page 18: GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU Programs](https://reader036.vdocument.in/reader036/viewer/2022062500/56815a14550346895dc75e20/html5/thumbnails/18.jpg)
GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU programs
GMProf-NA: Non-Atomic Operation Optimization
18
• Atomic operation cost a lot• Serialize all concurrent threads when updating a shared counter
• Use non-atomic operation to update counters• does not impact the overall accuracy thanks to other optimizations
atomicAdd(&counter, 1);
…
…
concurrent threads serialized threads
![Page 19: GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU Programs](https://reader036.vdocument.in/reader036/viewer/2022062500/56815a14550346895dc75e20/html5/thumbnails/19.jpg)
GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU programs
GMProf-SM: Shared Memory Counters Optimization
19
• Make full use of shared memory• Store counters in shared memory
when possible• Reduce counter size
• E.g., 32-bit integer counters -> 8-bit
Read-only Data Cache
Device Memory
L2 Cache
L1Cache
SharedMemory
Fast but Small
![Page 20: GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU Programs](https://reader036.vdocument.in/reader036/viewer/2022062500/56815a14550346895dc75e20/html5/thumbnails/20.jpg)
GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU programs
GMProf-SM: Shared Memory Counters Optimization
20
• Make full use of shared memory• Store counters in shared memory
when possible• Reduce counter size
• E.g., 32-bit integer counters -> 8-bit
Read-only Data Cache
Device Memory
L2 Cache
L1Cache
SharedMemory
Fast but Small
GMProf-TH: Threshold Optimization• Precise count may not be necessary
• E.g A is accessed 10 times, while B is accessed > 100 times
• Stop counting once reaching certain threshold• Tradeoff between accuracy and overhead
![Page 21: GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU Programs](https://reader036.vdocument.in/reader036/viewer/2022062500/56815a14550346895dc75e20/html5/thumbnails/21.jpg)
GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU programs
Outline
• Motivation• GMProf
• Naïve Profiling Approach• Optimizations• Enhanced Algorithm
• Evaluation• Conclusions
21
![Page 22: GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU Programs](https://reader036.vdocument.in/reader036/viewer/2022062500/56815a14550346895dc75e20/html5/thumbnails/22.jpg)
GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU programs
GMProf-Enhanced: Live Range Analysis
22
• The number of accesses to a shared memory location may be misleading
shm_buf in Shared Memory
input_array in Device Memory
data0 data1 data2
output_array in Device Memory
• Need to count the accesses/reuse of DATA, not address
data0
data0 data1 data2
data1data2
![Page 23: GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU Programs](https://reader036.vdocument.in/reader036/viewer/2022062500/56815a14550346895dc75e20/html5/thumbnails/23.jpg)
GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU programs
• Track data during its live range in shared memory
• Use logical clock to marks the boundary of each live range• Separate counters in each live range based on logical clock
GMProf-Enhanced: Live Range Analysis
23
1. ...2. shm_buffer = input_array[0] //load data0 from DM to ShM3. ...4. output_array[0] = shm_buffer //store data0 from ShM to DM5. ...6. ...7. shm_buffer = input_array[1] //load data1 from DM to ShM8. ...9. output_array[1] = shm_buffer //store data1 from ShM to DM10. ...
live range of data0
live range of data1
![Page 24: GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU Programs](https://reader036.vdocument.in/reader036/viewer/2022062500/56815a14550346895dc75e20/html5/thumbnails/24.jpg)
GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU programs
Outline
• Motivation• GMProf
• Naïve Profiling Approach• Optimizations• Enhanced Algorithm
• Evaluation• Conclusions
24
![Page 25: GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU Programs](https://reader036.vdocument.in/reader036/viewer/2022062500/56815a14550346895dc75e20/html5/thumbnails/25.jpg)
GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU programs
• Platform• GPU: NVIDIA Tesla C1060
• 240 cores (30×8), 1.296GHz• 16KB shared memory per SM• 4GB device memory
• CPU: AMD Opteron 2.6GHz ×2• 8GB main memory • Linux kernel 2.6.32• CUDA Toolkit 3.0
• Six Applications• Co-clustering, EM clustering, Binomial Options, Jacobi, Sparse Matrix-
Vector Multiplication, and DXTC
25
Methodology
![Page 26: GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU Programs](https://reader036.vdocument.in/reader036/viewer/2022062500/56815a14550346895dc75e20/html5/thumbnails/26.jpg)
GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU programs 26
Runtime Overhead for Profiling Shared Memory Use
182x 144x 648x 181x113x
2.6x
90x648x
![Page 27: GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU Programs](https://reader036.vdocument.in/reader036/viewer/2022062500/56815a14550346895dc75e20/html5/thumbnails/27.jpg)
GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU programs 27
Runtime Overhead for Profiling Device Memory Use
83x 197x 48.5x
1.6x
![Page 28: GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU Programs](https://reader036.vdocument.in/reader036/viewer/2022062500/56815a14550346895dc75e20/html5/thumbnails/28.jpg)
GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU programs 28
Case Study I: Put the most frequently used data into shared memory
ProfilingResult GMProf-basic
GMProf
w/o TH w/ TH
ShM 0 0 0
DM
A1(276)A2(276)A3(128)
A4(1)
A1(276)A2(276)A3(128)
A4(1)
A1(THR)A2(THR)A3(128)
A4(1)
• bo_v1: • a naïve implementation where all data arrays are stored in device
memory
A1 ~ A4: four data arrays(N): average access # of the elements in the corresponding data array
![Page 29: GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU Programs](https://reader036.vdocument.in/reader036/viewer/2022062500/56815a14550346895dc75e20/html5/thumbnails/29.jpg)
GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU programs
• bo_v2: • an improved version which puts the most frequently used arrays
(identified by GMProf) into shared memory
29
Case Study I: Put the most frequently used data into shared memory
ProfilingResult GMProf-basic
GMProf
w/o TH w/ TH
ShM A1 (174,788)A2 (169,221)
A1(165,881)A2(160,315)
A1(THR)A2(THR)
DM A3(128)A4(1)
A3(128)A4(1)
A3(128)A4(1)
• bo_v2 outperforms bo_v1 by a factor of 39.63
![Page 30: GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU Programs](https://reader036.vdocument.in/reader036/viewer/2022062500/56815a14550346895dc75e20/html5/thumbnails/30.jpg)
GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU programs
• jcb_v1: • the shared memory is accessed frequently, but little reuse of the date
30
Case Study II: identify the true reuse of data
ProfilingResult GMProf-basic
GMProf
w/o Enh. Alg. w/ Enh. Alg.
ShM shm_buf (5,760) shm_buf (5,748) shm_buf (2)
DM in(4)out(1)
in(4)out(1)
in(4)out(1)
• jcb_v2 outperforms jcb_v1 by 2.59 times
• jcb_v2:
ProfilingResult GMProf-basic
GMProf
w/o Enh. Alg. w/ Enh. Alg.
ShM shm_buf (4,757) shm_buf (4,741) shm_buf (4)
DM in(1)out(1)
in(1)out(1)
in(1)out(1)
![Page 31: GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU Programs](https://reader036.vdocument.in/reader036/viewer/2022062500/56815a14550346895dc75e20/html5/thumbnails/31.jpg)
GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU programs
Outline
• Motivation• GMProf
• Naïve Profiling Approach• Optimizations
• Evaluation• Conclusions
31
![Page 32: GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU Programs](https://reader036.vdocument.in/reader036/viewer/2022062500/56815a14550346895dc75e20/html5/thumbnails/32.jpg)
GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU programs
Conclusions
• GMProf• Statically-assisted dynamic profiling approach• Architecture-based optimizations • Live range analysis to capture real usage of data• Low-overhead & Fine-grained• May be applied to profile other events
32
Thanks!Thanks!