energy consumption of cuda kernels with varying thread topology
TRANSCRIPT
Energy Consumption of CUDA Kernelswith Varying Thread Topology
Sebastian Dreßler & Thomas SteinkeZuse Institute Berlin
September 12, 2012
Outline
1 GPGPUs @ HPC
2 Energy Awareness of GPGPUs
3 Applied Methods
4 Results & Interpretation
5 Conclusion & Outlook
Energy Consumption of CUDA Kernels with Varying Thread Topology Sebastian Dreßler & Thomas Steinke [email protected] 2 / 25
GPGPUs @ HPC
� GPGPUs are more and more utilized in modern HPC systems
� Top500 of June ’12:
� 57 systems using accelerators in total
� NVIDIA GPUs:
� #5 (Tianhe-1A),
� #10 (Nebulae), and
� #14 (TSUBAME 2.0)
Energy awareness of acceleratorsbecomes a key element in HPC systems.
Energy Consumption of CUDA Kernels with Varying Thread Topology Sebastian Dreßler & Thomas Steinke [email protected] 3 / 25
GPGPUs @ HPC
� GPGPUs are more and more utilized in modern HPC systems
� Top500 of June ’12:
� 57 systems using accelerators in total
� NVIDIA GPUs:
� #5 (Tianhe-1A),
� #10 (Nebulae), and
� #14 (TSUBAME 2.0)
Energy awareness of acceleratorsbecomes a key element in HPC systems.
Energy Consumption of CUDA Kernels with Varying Thread Topology Sebastian Dreßler & Thomas Steinke [email protected] 3 / 25
Fermi GPU Architecture
Dispatch Unit
Warp Scheduler
Instruction Cache
Dispatch Unit
Warp Scheduler
Core
Core Core
Core Core
Core Core
Core Core
Core Core
Core Core
Core Core
Core Core
Core Core
Core Core
Core Core
Core Core
Core Core
Core Core
Core Core
SFU
SFU
SFU
SFU
LD/ST
LD/ST
LD/ST
LD/ST
LD/ST
LD/ST
LD/ST
LD/ST
LD/ST
LD/ST
LD/ST
LD/ST
LD/ST
LD/ST
LD/ST
LD/ST
Interconnect Network
64 KB Shared Memory / L1 Cache
UniformCache
Core
Register File (32,768 x 32-bit)
CUDA Core
Operand Collector
Dispatch Port
Result Queue
FP Unit INT Unit
Fermi Streaming Multiprocessor (SM)
Image copyright by NVIDIA, Source: NVIDIA Fermi Architecture Whitepaper
Energy Consumption of CUDA Kernels with Varying Thread Topology Sebastian Dreßler & Thomas Steinke [email protected] 4 / 25
Energy Awareness of GPGPUs
Measurements I
� E measurement GPGPU vs. CPU (Rofouei et al. [6])
� Scan application measured
� Higher E consumption, much lower run-time Ñ improved E footprint
� E measurement on heterogeneous systems (McIntosh-Smith et al. [5])
� NVIDIA beats AMD on E efficiency
� GPGPU: in general better E footprint than multi-core CPUs only
Energy Consumption of CUDA Kernels with Varying Thread Topology Sebastian Dreßler & Thomas Steinke [email protected] 6 / 25
Measurements II
� Relation of utilized SMs vs. P̄ (Collange et al. [1])
� Steep linear rise until half SMs utilized
� more than half SMs Ñ more flat linear rise
� E measurement for block / thread comb. (Huang et al. [4])
� Single application measured
� Lower runtime Ñ lower energy consumption
Applied methods: external hardware + software logging
Energy Consumption of CUDA Kernels with Varying Thread Topology Sebastian Dreßler & Thomas Steinke [email protected] 7 / 25
Measurements II
� Relation of utilized SMs vs. P̄ (Collange et al. [1])
� Steep linear rise until half SMs utilized
� more than half SMs Ñ more flat linear rise
� E measurement for block / thread comb. (Huang et al. [4])
� Single application measured
� Lower runtime Ñ lower energy consumption
Applied methods: external hardware + software logging
Energy Consumption of CUDA Kernels with Varying Thread Topology Sebastian Dreßler & Thomas Steinke [email protected] 7 / 25
Predictive Models
� Prediction for # of utilized SMs for best EC (Hong and Kim [3])
� Goal: decreased power consumption
� Result: avg. E saving of «26%
� Fine grained model for P̄ at instruction level (Haifeng and Qingkui [2])
� Decomposition of PTX instructions into groups
� Arithmetics, Memory Transfer, Control, . . .
� Reference P̄ measurement + instruction group count Ñ P̄ prediction
� Error: 5%
Energy Consumption of CUDA Kernels with Varying Thread Topology Sebastian Dreßler & Thomas Steinke [email protected] 8 / 25
Goal of Our Work
� With respect to measurements
� Switch from Hardware (Power meter, Oscilloscope, . . . ) to Software
� Migration to software controlled remote measurement
� Supported by upcoming integration of power sensors
� Fermi architecture HW power sensor available on Tesla M2090
� Provide fine-grained measurement for instructions and applications
� Concerning predictive models
� Clearly distinct energy consumption and power consumption
� Provide additional informations to improve models
� Demonstration of the thread scheduler’s impact
Energy Consumption of CUDA Kernels with Varying Thread Topology Sebastian Dreßler & Thomas Steinke [email protected] 9 / 25
Applied Methods
Software Measurement Framework
� NVIDIA Management Library (NVML)
� Measures instantaneous overall power consumption P of a GPU card
� Framework: threaded library using NVML
� On Tesla M2090: every 20 ms a new sample
� Measurement accuracy
� Assumption: sensor values are correct
� Captured multiple runtime profiles with constant card utilization
� Statistical analysis on runtime profiles
� P̄ “ 165.3W ˘ 0.73W
� Relative uncertainty: 0.44 %
Energy Consumption of CUDA Kernels with Varying Thread Topology Sebastian Dreßler & Thomas Steinke [email protected] 11 / 25
Instruction Level Kernels
� Purpose: Measure energy consumption at instruction level
Listing 1: Instruction level kernel example
1 float r1, r2, r3;23 for (int i = 0; i < RUNS; i++) {4 r3 = r1 + r2;5 r2 = r3 + r1;6 r1 = r2 + r3;7 [...]8 r3 = r1 + r2;9 r2 = r3 + r1;10 r1 = r2 + r3;11 }
� Runtime >20 ms (l. 3)
� Obfuscate that “code doesnothing” (ll. 4-10)
� Ensured above points withPTX code analysis
Energy Consumption of CUDA Kernels with Varying Thread Topology Sebastian Dreßler & Thomas Steinke [email protected] 12 / 25
Application Level Kernels
� Purpose: Measure energy consumption at application level
Listing 2: Weak scaling: vector norm
1 void norm(2 double *v,3 double *norm) {45 int i;6 int idx = [...];7 double a = 0.0;89 for (i=0; i<SVEC; i++) {10 a += pow( // Ineff. code11 v[idx+i*SVEC], 2.0f12 );13 }1415 norm[idx] = sqrt(a);16 }
Listing 3: Strong scaling: vector calculation
1 void vecpowadd(2 double ma, double mb,3 double *vec_a ,4 double *vec_b) {56 int i;7 int idx = [...];8 int L = [...];9
10 for (i=0; i<SVEC/L; i++) {11 vec_a[idx + i] += pow(12 vec_a[idx + L],13 vec_b[idx + L]14 );15 }16 }
Energy Consumption of CUDA Kernels with Varying Thread Topology Sebastian Dreßler & Thomas Steinke [email protected] 13 / 25
Results & Interpretation
Single Instruction SP Floating Point
0
16
32
48
64
0
256
512
768
1024 0
8
16
24
Ene
rgy
uJ
BlocksThreads
Ene
rgy
uJ
0 2 4 6 8 10 12 14 16 18 20 22
Energy Consumption of CUDA Kernels with Varying Thread Topology Sebastian Dreßler & Thomas Steinke [email protected] 15 / 25
Vector Norm Application Kernel (Weak Scaling)
0
16
32
48
64
0
256
512
768
1024 0
0.5
1
1.5
2
2.5
Ene
rgy
J
BlocksThreads
Ene
rgy
J
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 2.2
Energy Consumption of CUDA Kernels with Varying Thread Topology Sebastian Dreßler & Thomas Steinke [email protected] 16 / 25
Vector Calc. Application Kernel (Strong Scaling)
0
5
10
15
20
25
30
35
40
45
0 128 256 384 512 640 768 896 1024
Ene
rgy
Threads
16 Blocks32 Blocks48 Blocks64 Blocks
Energy Consumption of CUDA Kernels with Varying Thread Topology Sebastian Dreßler & Thomas Steinke [email protected] 17 / 25
P̄ vs. E
� Differentiation between P̄ and E is necessary
� Taking only P̄ leads to wrong conclusions
0 16 32 48 64
768
776
784
792
800
400
500
600
700
800
Ave
rage
pow
er c
onsu
mpt
ion
Blocks
Threads
Ave
rage
pow
er c
onsu
mpt
ion
450 500 550 600 650 700 750 800
Reason for breakdowns?
130
140
150
160
170
180
190
0 34 68 102 136 170 204 238 272 306 340
Pow
er c
onsu
mpt
ion
in W
Samples
Power ConsumptionAverage Power Consumption
Range for avg. power calculation
Different utilization due to scheduling
Energy Consumption of CUDA Kernels with Varying Thread Topology Sebastian Dreßler & Thomas Steinke [email protected] 18 / 25
P̄ vs. E
� Differentiation between P̄ and E is necessary
� Taking only P̄ leads to wrong conclusions
0 16 32 48 64
768
776
784
792
800
400
500
600
700
800
Ave
rage
pow
er c
onsu
mpt
ion
Blocks
Threads
Ave
rage
pow
er c
onsu
mpt
ion
450 500 550 600 650 700 750 800
Reason for breakdowns?
130
140
150
160
170
180
190
0 34 68 102 136 170 204 238 272 306 340
Pow
er c
onsu
mpt
ion
in W
Samples
Power ConsumptionAverage Power Consumption
Range for avg. power calculation
Different utilization due to scheduling
Energy Consumption of CUDA Kernels with Varying Thread Topology Sebastian Dreßler & Thomas Steinke [email protected] 18 / 25
Scheduler Capabilities (Likely)
Notable result: energybreakdowns with
increasing utilization
5
6
7
8
9
10
11
12
13
14
640 672 704 736 768 800 832 864 896 928 960 992 1024
256 288 320 352 384 416 448 480 512 544 576 608 640
Ene
rgy
Threads for b = 32
Threads for b = 64
64 Blocks32 Blocks
� Most likely reason: two distinct scheduling cases
1. Scheduler can provide low delay for outstanding requests (low E)
2. The opposite case (high E)
� Sometimes scheduler is capable to switch to optimal scheduling
Energy Consumption of CUDA Kernels with Varying Thread Topology Sebastian Dreßler & Thomas Steinke [email protected] 19 / 25
Scheduler Capabilities (Likely)
Notable result: energybreakdowns with
increasing utilization
5
6
7
8
9
10
11
12
13
14
640 672 704 736 768 800 832 864 896 928 960 992 1024
256 288 320 352 384 416 448 480 512 544 576 608 640
Ene
rgy
Threads for b = 32
Threads for b = 64
64 Blocks32 Blocks
� Most likely reason: two distinct scheduling cases
1. Scheduler can provide low delay for outstanding requests (low E)
2. The opposite case (high E)
� Sometimes scheduler is capable to switch to optimal scheduling
Energy Consumption of CUDA Kernels with Varying Thread Topology Sebastian Dreßler & Thomas Steinke [email protected] 19 / 25
Relative Energy
0
2
4
6
8
10
12
14
16
0 128 256 384 512 640 768 896 1024
Ene
rgy
Threads
16 Blocks32 Blocks48 Blocks64 Blocks
� Linear dependence between energy consumption and threads per block
� Optimal scheduling: linear increase of energy
� Suboptimal scheduling: energy jumps
Energy Consumption of CUDA Kernels with Varying Thread Topology Sebastian Dreßler & Thomas Steinke [email protected] 20 / 25
Conclusion & Outlook
� Software library for P̄ measurement and E calculation implemented
� Based on NVML, high accuracy and remote measurement
� LGPL licensed, available at GitHub
� CUDA Power and Energy Measurement Framework
� https://github.com/sdressler/CUDA-PEMF
� Showed impact of thread scheduler, very likely two categories
� Optimal scheduling Ñ energy efficient
� Suboptimal scheduling Ñ energy inefficient
� Open question: assumption to be evaluated by NVIDIA
� Outlook
� Investigate scheduler impact further
� Based on results: improve predictive models further
Energy Consumption of CUDA Kernels with Varying Thread Topology Sebastian Dreßler & Thomas Steinke [email protected] 21 / 25
Thank you for your attention.
S. Collange, D. Defour, and A. Tisserand.
Power Consumption of GPUs from a Software Perspective.
In Proceedings of the 9th International Conference on ComputationalScience: Part I, ICCS ’09, pages 914–923, Berlin, Heidelberg, 2009.Springer-Verlag.
W. Haifeng and C. Qingkui.
An Energy Consumption Model for GPU Computing at InstructionLevel.
IJACT, 4(2):192 – 200, Feb. 2012.
S. Hong and H. Kim.
An integrated GPU power and performance model.
SIGARCH Comput. Archit. News, 38(3):280–289, June 2010.
S. Huang, S. Xiao, and W. Feng.
On the energy efficiency of graphics processing units for scientificcomputing.
In Proceedings of the 2009 IEEE International Symposium onParallel&Distributed Processing, IPDPS ’09, pages 1–8, Washington,DC, USA, 2009. IEEE Computer Society.
S. McIntosh-Smith, T. Wilson, A. A. Ibarra, J. Crisp, and R. B.Sessions.
Benchmarking energy efficiency, power costs and carbon emissions onheterogeneous systems.
The Computer Journal, 55(2):192–205, 2012.
M. Rofouei, T. Stathopoulos, S. Ryffel, W. Kaiser, andM. Sarrafzadeh.
Energy-aware high performance computing with graphic processingunits.
In Proceedings of the 2008 conference on Power aware computing andsystems, HotPower’08, pages 11–11, Berkeley, CA, USA, 2008.USENIX Association.