Transcript
Page 1: GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU Programs

GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU Programs

Mai Zheng, Vignesh T. Ravi, Wenjing Ma, Feng Qin, and Gagan Agrawal

Dept. of Computer Science and EngineeringThe Ohio State UniversityColumbus, OH, USA

Page 2: GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU Programs

GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU programs

GPU Programming Gets Popular

• Many domains are using GPUs for high performance

2

GPU-accelerated Molecular Dynamics GPU-accelerated Seismic Imaging

• Available in both high-end/low-end systems• the #1 supercomputer in the world uses GPUs [TOP500, Nov 2012]• commodity desktops/laptops equipped with GPUs

Page 3: GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU Programs

GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU programs

• Need careful management of • a large amount of threads

Writing Efficient GPU Programs is Challenging

3

Thread Blocks

Page 4: GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU Programs

GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU programs

• Need careful management of • a large amount of threads• multi-layer memory hierarchy

Writing Efficient GPU Programs is Challenging

4

Read-only Data Cache

DRAM (Device Memory)

L2 Cache

L1Cache

SharedMemory

Thread

Thread Blocks

Kepler GK110 Memory Hierarchy

Page 5: GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU Programs

GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU programs

• Need careful management of • a large amount of threads• multi-layer memory hierarchy

Writing Efficient GPU Programs is Challenging

5

Read-only Data Cache

DRAM (Device Memory)

L2 Cache

L1Cache

SharedMemory

Thread

Thread Blocks

Fast but Small

Large but Slow

Kepler GK110 Memory Hierarchy

Page 6: GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU Programs

GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU programs

Writing Efficient GPU Programs is Challenging

6

Which data in shared memory are infrequently accessed?

Which data in device memory are frequently accessed?

Read-only Data Cache

DRAM (Device Memory)

L2 Cache

L1Cache

SharedMemory

Thread

Kepler GK110 Memory Hierarchy

Page 7: GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU Programs

GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU programs

• Existing tools can’t help much• inapplicable to GPU• coarse-grained• prohibitive runtime overhead• cannot handle irregular/indirect accesses

Writing Efficient GPU Programs is Challenging

7

Which data in shared memory are infrequently accessed?

Which data in device memory are frequently accessed?

Read-only Data Cache

DRAM (Device Memory)

L2 Cache

L1Cache

SharedMemory

Thread

Kepler GK110 Memory Hierarchy

Page 8: GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU Programs

GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU programs

Outline

• Motivation• GMProf

• Naïve Profiling Approach• Optimizations• Enhanced Algorithm

• Evaluation• Conclusions

8

Page 9: GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU Programs

GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU programs

GMProf-basic: The Naïve Profiling Approach

9

• Shared Memory Profiling• integer counters to count accesses to shared memory• one counter for each shared memory element• atomically update the counter

• to avoid race condition among threads

• Device Memory Profiling• integer counters to count accesses to device memory• one counter for each element in the user device memory array

• since device memory is too large to be monitored as a whole (e..g, 6GB)• atomically update the counter

Page 10: GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU Programs

GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU programs

Outline

• Motivation• GMProf

• Naïve Profiling Approach• Optimizations• Enhanced Algorithm

• Evaluation• Conclusions

10

Page 11: GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU Programs

GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU programs

GMProf-SA: Static Analysis Optimization

11

• Observation I: Many memory accesses can be determined statically

1. __shared__ int s[];

2. …

3. s[threadIdx.x] = 3;

Page 12: GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU Programs

GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU programs

GMProf-SA: Static Analysis Optimization

12

• Observation I: Many memory accesses can be determined statically

1. __shared__ int s[];

2. …

3. s[threadIdx.x] = 3;

Don’t need to count the access at runtime

Page 13: GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU Programs

GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU programs

GMProf-SA: Static Analysis Optimization

13

• Observation I: Many memory accesses can be determined statically

1. __shared__ int s[];

2. …

3. s[threadIdx.x] = 3;

Don’t need to count the access at runtime

• How about this …

1. __shared__ float s[];

2. …

3. for(r=0; …; …) {

4. for(c=0; …; …) {

5. temp = s[input[c]];

6. }

7. }y

Page 14: GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU Programs

GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU programs

GMProf-SA: Static Analysis Optimization

14

• Observation II: Some accesses are loop-invariant• E.g. s[input[c]] is irrelavant to the outer loop iterator r

1. __shared__ float s[];

2. …

3. for(r=0; …; …) {

4. for(c=0; …; …) {

5. temp = s[input[c]];

6. }

7. }y

Page 15: GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU Programs

GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU programs

GMProf-SA: Static Analysis Optimization

15

• Observation II: Some accesses are loop-invariant• E.g. s[input[c]] is irrelavant to the outer loop iterator r

1. __shared__ float s[];

2. …

3. for(r=0; …; …) {

4. for(c=0; …; …) {

5. temp = s[input[c]];

6. }

7. }y

Don’t need to profile in every r

iteration

Page 16: GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU Programs

GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU programs

GMProf-SA: Static Analysis Optimization

16

• Observation II: Some accesses are loop-invariant• E.g. s[input[c]] is irrelavant to the outer loop iterator r

1. __shared__ float s[];

2. …

3. for(r=0; …; …) {

4. for(c=0; …; …) {

5. temp = s[input[c]];

6. }

7. }y

Don’t need to profile in every r

iteration

• Observation III: Some accesses are tid-invariant• E.g. s[input[c]] is irrelavant to threadIdx

Page 17: GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU Programs

GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU programs

GMProf-SA: Static Analysis Optimization

17

• Observation II: Some accesses are loop-invariant• E.g. s[input[c]] is irrelavant to the outer loop iterator r

1. __shared__ float s[];

2. …

3. for(r=0; …; …) {

4. for(c=0; …; …) {

5. temp = s[input[c]];

6. }

7. }y

Don’t need to profile in every r

iteration

• Observation III: Some accesses are tid-invariant• E.g. s[input[c]] is irrelavant to threadIdx Don’t need to update the

counter in every thread

Page 18: GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU Programs

GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU programs

GMProf-NA: Non-Atomic Operation Optimization

18

• Atomic operation cost a lot• Serialize all concurrent threads when updating a shared counter

• Use non-atomic operation to update counters• does not impact the overall accuracy thanks to other optimizations

atomicAdd(&counter, 1);

concurrent threads serialized threads

Page 19: GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU Programs

GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU programs

GMProf-SM: Shared Memory Counters Optimization

19

• Make full use of shared memory• Store counters in shared memory

when possible• Reduce counter size

• E.g., 32-bit integer counters -> 8-bit

Read-only Data Cache

Device Memory

L2 Cache

L1Cache

SharedMemory

Fast but Small

Page 20: GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU Programs

GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU programs

GMProf-SM: Shared Memory Counters Optimization

20

• Make full use of shared memory• Store counters in shared memory

when possible• Reduce counter size

• E.g., 32-bit integer counters -> 8-bit

Read-only Data Cache

Device Memory

L2 Cache

L1Cache

SharedMemory

Fast but Small

GMProf-TH: Threshold Optimization• Precise count may not be necessary

• E.g A is accessed 10 times, while B is accessed > 100 times

• Stop counting once reaching certain threshold• Tradeoff between accuracy and overhead

Page 21: GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU Programs

GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU programs

Outline

• Motivation• GMProf

• Naïve Profiling Approach• Optimizations• Enhanced Algorithm

• Evaluation• Conclusions

21

Page 22: GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU Programs

GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU programs

GMProf-Enhanced: Live Range Analysis

22

• The number of accesses to a shared memory location may be misleading

shm_buf in Shared Memory

input_array in Device Memory

data0 data1 data2

output_array in Device Memory

• Need to count the accesses/reuse of DATA, not address

data0

data0 data1 data2

data1data2

Page 23: GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU Programs

GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU programs

• Track data during its live range in shared memory

• Use logical clock to marks the boundary of each live range• Separate counters in each live range based on logical clock

GMProf-Enhanced: Live Range Analysis

23

1. ...2. shm_buffer = input_array[0] //load data0 from DM to ShM3. ...4. output_array[0] = shm_buffer //store data0 from ShM to DM5. ...6. ...7. shm_buffer = input_array[1] //load data1 from DM to ShM8. ...9. output_array[1] = shm_buffer //store data1 from ShM to DM10. ...

live range of data0

live range of data1

Page 24: GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU Programs

GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU programs

Outline

• Motivation• GMProf

• Naïve Profiling Approach• Optimizations• Enhanced Algorithm

• Evaluation• Conclusions

24

Page 25: GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU Programs

GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU programs

• Platform• GPU: NVIDIA Tesla C1060

• 240 cores (30×8), 1.296GHz• 16KB shared memory per SM• 4GB device memory

• CPU: AMD Opteron 2.6GHz ×2• 8GB main memory • Linux kernel 2.6.32• CUDA Toolkit 3.0

• Six Applications• Co-clustering, EM clustering, Binomial Options, Jacobi, Sparse Matrix-

Vector Multiplication, and DXTC

25

Methodology

Page 26: GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU Programs

GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU programs 26

Runtime Overhead for Profiling Shared Memory Use

182x 144x 648x 181x113x

2.6x

90x648x

Page 27: GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU Programs

GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU programs 27

Runtime Overhead for Profiling Device Memory Use

83x 197x 48.5x

1.6x

Page 28: GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU Programs

GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU programs 28

Case Study I: Put the most frequently used data into shared memory

ProfilingResult GMProf-basic

GMProf

w/o TH w/ TH

ShM 0 0 0

DM

A1(276)A2(276)A3(128)

A4(1)

A1(276)A2(276)A3(128)

A4(1)

A1(THR)A2(THR)A3(128)

A4(1)

• bo_v1: • a naïve implementation where all data arrays are stored in device

memory

A1 ~ A4: four data arrays(N): average access # of the elements in the corresponding data array

Page 29: GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU Programs

GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU programs

• bo_v2: • an improved version which puts the most frequently used arrays

(identified by GMProf) into shared memory

29

Case Study I: Put the most frequently used data into shared memory

ProfilingResult GMProf-basic

GMProf

w/o TH w/ TH

ShM A1 (174,788)A2 (169,221)

A1(165,881)A2(160,315)

A1(THR)A2(THR)

DM A3(128)A4(1)

A3(128)A4(1)

A3(128)A4(1)

• bo_v2 outperforms bo_v1 by a factor of 39.63

Page 30: GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU Programs

GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU programs

• jcb_v1: • the shared memory is accessed frequently, but little reuse of the date

30

Case Study II: identify the true reuse of data

ProfilingResult GMProf-basic

GMProf

w/o Enh. Alg. w/ Enh. Alg.

ShM shm_buf (5,760) shm_buf (5,748) shm_buf (2)

DM in(4)out(1)

in(4)out(1)

in(4)out(1)

• jcb_v2 outperforms jcb_v1 by 2.59 times

• jcb_v2:

ProfilingResult GMProf-basic

GMProf

w/o Enh. Alg. w/ Enh. Alg.

ShM shm_buf (4,757) shm_buf (4,741) shm_buf (4)

DM in(1)out(1)

in(1)out(1)

in(1)out(1)

Page 31: GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU Programs

GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU programs

Outline

• Motivation• GMProf

• Naïve Profiling Approach• Optimizations

• Evaluation• Conclusions

31

Page 32: GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU Programs

GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU programs

Conclusions

• GMProf• Statically-assisted dynamic profiling approach• Architecture-based optimizations • Live range analysis to capture real usage of data• Low-overhead & Fine-grained• May be applied to profile other events

32

Thanks!Thanks!


Top Related