energy consumption of cuda kernels with varying thread topology

Energy Consumption of CUDA Kernelswith Varying Thread Topology

Sebastian Dreßler & Thomas SteinkeZuse Institute Berlin

September 12, 2012

Outline

1 GPGPUs @ HPC

2 Energy Awareness of GPGPUs

3 Applied Methods

4 Results & Interpretation

5 Conclusion & Outlook

Energy Consumption of CUDA Kernels with Varying Thread Topology Sebastian Dreßler & Thomas Steinke [email protected] 2 / 25

GPGPUs @ HPC

� GPGPUs are more and more utilized in modern HPC systems

� Top500 of June ’12:

� 57 systems using accelerators in total

� NVIDIA GPUs:

� #5 (Tianhe-1A),

� #10 (Nebulae), and

� #14 (TSUBAME 2.0)

Energy awareness of acceleratorsbecomes a key element in HPC systems.


Fermi GPU Architecture

Dispatch Unit

Warp Scheduler

Instruction Cache

Dispatch Unit

Warp Scheduler

Core

Core Core

Core Core

Core Core

Core Core

Core Core

Core Core

Core Core

Core Core

Core Core

Core Core

Core Core

Core Core

Core Core

Core Core

Core Core

SFU

SFU

SFU

SFU

LD/ST

LD/ST

LD/ST

LD/ST

LD/ST

LD/ST

LD/ST

LD/ST

LD/ST

LD/ST

LD/ST

LD/ST

LD/ST

LD/ST

LD/ST

LD/ST

Interconnect Network

64 KB Shared Memory / L1 Cache

UniformCache

Core

Register File (32,768 x 32-bit)

CUDA Core

Operand Collector

Dispatch Port

Result Queue

FP Unit INT Unit

Fermi Streaming Multiprocessor (SM)

Image copyright by NVIDIA, Source: NVIDIA Fermi Architecture Whitepaper


Energy Awareness of GPGPUs

Measurements I

� E measurement GPGPU vs. CPU (Rofouei et al. [6])

� Scan application measured

� Higher E consumption, much lower run-time Ñ improved E footprint

� E measurement on heterogeneous systems (McIntosh-Smith et al. [5])

� NVIDIA beats AMD on E efficiency

� GPGPU: in general better E footprint than multi-core CPUs only


Measurements II

� Relation of utilized SMs vs. P̄ (Collange et al. [1])

� Steep linear rise until half SMs utilized

� more than half SMs Ñ more flat linear rise

� E measurement for block / thread comb. (Huang et al. [4])

� Single application measured

� Lower runtime Ñ lower energy consumption

Applied methods: external hardware + software logging


Predictive Models

� Prediction for # of utilized SMs for best EC (Hong and Kim [3])

� Goal: decreased power consumption

� Result: avg. E saving of «26%

� Fine grained model for P̄ at instruction level (Haifeng and Qingkui [2])

� Decomposition of PTX instructions into groups

� Arithmetics, Memory Transfer, Control, . . .

� Reference P̄ measurement + instruction group count Ñ P̄ prediction

� Error: 5%


Goal of Our Work

� With respect to measurements

� Switch from Hardware (Power meter, Oscilloscope, . . . ) to Software

� Migration to software controlled remote measurement

� Supported by upcoming integration of power sensors

� Fermi architecture HW power sensor available on Tesla M2090

� Provide fine-grained measurement for instructions and applications

� Concerning predictive models

� Clearly distinct energy consumption and power consumption

� Provide additional informations to improve models

� Demonstration of the thread scheduler’s impact


Applied Methods

Software Measurement Framework

� NVIDIA Management Library (NVML)

� Measures instantaneous overall power consumption P of a GPU card

� Framework: threaded library using NVML

� On Tesla M2090: every 20 ms a new sample

� Measurement accuracy

� Assumption: sensor values are correct

� Captured multiple runtime profiles with constant card utilization

� Statistical analysis on runtime profiles

� P̄ “ 165.3W ˘ 0.73W

� Relative uncertainty: 0.44 %


Instruction Level Kernels

� Purpose: Measure energy consumption at instruction level

Listing 1: Instruction level kernel example

1 float r1, r2, r3;23 for (int i = 0; i < RUNS; i++) {4 r3 = r1 + r2;5 r2 = r3 + r1;6 r1 = r2 + r3;7 [...]8 r3 = r1 + r2;9 r2 = r3 + r1;10 r1 = r2 + r3;11 }

� Runtime >20 ms (l. 3)

� Obfuscate that “code doesnothing” (ll. 4-10)

� Ensured above points withPTX code analysis


Application Level Kernels

� Purpose: Measure energy consumption at application level

Listing 2: Weak scaling: vector norm

1 void norm(2 double *v,3 double *norm) {45 int i;6 int idx = [...];7 double a = 0.0;89 for (i=0; i<SVEC; i++) {10 a += pow( // Ineff. code11 v[idx+i*SVEC], 2.0f12 );13 }1415 norm[idx] = sqrt(a);16 }

Listing 3: Strong scaling: vector calculation

1 void vecpowadd(2 double ma, double mb,3 double *vec_a ,4 double *vec_b) {56 int i;7 int idx = [...];8 int L = [...];9

10 for (i=0; i<SVEC/L; i++) {11 vec_a[idx + i] += pow(12 vec_a[idx + L],13 vec_b[idx + L]14 );15 }16 }


Results & Interpretation

Single Instruction SP Floating Point

0

16

32

48

64

0

256

512

768

1024 0

8

16

24

Ene

rgy

uJ

BlocksThreads

Ene

rgy

uJ

0 2 4 6 8 10 12 14 16 18 20 22


Vector Norm Application Kernel (Weak Scaling)

0

16

32

48

64

0

256

512

768

1024 0

0.5

1

1.5

2

2.5

Ene

rgy

J

BlocksThreads

Ene

rgy

J

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 2.2


Vector Calc. Application Kernel (Strong Scaling)

0

5

10

15

20

25

30

35

40

45

0 128 256 384 512 640 768 896 1024

Ene

rgy

Threads

16 Blocks32 Blocks48 Blocks64 Blocks


P̄ vs. E

� Differentiation between P̄ and E is necessary

� Taking only P̄ leads to wrong conclusions

0 16 32 48 64

768

776

784

792

800

400

500

600

700

800

Ave

rage

pow

er c

onsu

mpt

ion

Blocks

Threads

Ave

rage

pow

er c

onsu

mpt

ion

450 500 550 600 650 700 750 800

Reason for breakdowns?

130

140

150

160

170

180

190

0 34 68 102 136 170 204 238 272 306 340

Pow

er c

onsu

mpt

ion

in W

Samples

Power ConsumptionAverage Power Consumption

Range for avg. power calculation

Different utilization due to scheduling


Scheduler Capabilities (Likely)

Notable result: energybreakdowns with

increasing utilization

5

6

7

8

9

10

11

12

13

14

640 672 704 736 768 800 832 864 896 928 960 992 1024

256 288 320 352 384 416 448 480 512 544 576 608 640

Ene

rgy

Threads for b = 32

Threads for b = 64

64 Blocks32 Blocks

� Most likely reason: two distinct scheduling cases

1. Scheduler can provide low delay for outstanding requests (low E)

2. The opposite case (high E)

� Sometimes scheduler is capable to switch to optimal scheduling


Relative Energy

0

2

4

6

8

10

12

14

16

0 128 256 384 512 640 768 896 1024

Ene

rgy

Threads

16 Blocks32 Blocks48 Blocks64 Blocks

� Linear dependence between energy consumption and threads per block

� Optimal scheduling: linear increase of energy

� Suboptimal scheduling: energy jumps


Conclusion & Outlook

� Software library for P̄ measurement and E calculation implemented

� Based on NVML, high accuracy and remote measurement

� LGPL licensed, available at GitHub

� CUDA Power and Energy Measurement Framework

� https://github.com/sdressler/CUDA-PEMF

� Showed impact of thread scheduler, very likely two categories

� Optimal scheduling Ñ energy efficient

� Suboptimal scheduling Ñ energy inefficient

� Open question: assumption to be evaluated by NVIDIA

� Outlook

� Investigate scheduler impact further

� Based on results: improve predictive models further


https://github.com/sdressler/CUDA-PEMF

Thank you for your attention.

S. Collange, D. Defour, and A. Tisserand.

Power Consumption of GPUs from a Software Perspective.

In Proceedings of the 9th International Conference on ComputationalScience: Part I, ICCS ’09, pages 914–923, Berlin, Heidelberg, 2009.Springer-Verlag.

W. Haifeng and C. Qingkui.

An Energy Consumption Model for GPU Computing at InstructionLevel.

IJACT, 4(2):192 – 200, Feb. 2012.

S. Hong and H. Kim.

An integrated GPU power and performance model.

SIGARCH Comput. Archit. News, 38(3):280–289, June 2010.

S. Huang, S. Xiao, and W. Feng.

On the energy efficiency of graphics processing units for scientificcomputing.

In Proceedings of the 2009 IEEE International Symposium onParallel&Distributed Processing, IPDPS ’09, pages 1–8, Washington,DC, USA, 2009. IEEE Computer Society.

S. McIntosh-Smith, T. Wilson, A. A. Ibarra, J. Crisp, and R. B.Sessions.

Benchmarking energy efficiency, power costs and carbon emissions onheterogeneous systems.

The Computer Journal, 55(2):192–205, 2012.

M. Rofouei, T. Stathopoulos, S. Ryffel, W. Kaiser, andM. Sarrafzadeh.

Energy-aware high performance computing with graphic processingunits.

In Proceedings of the 2008 conference on Power aware computing andsystems, HotPower’08, pages 11–11, Berkeley, CA, USA, 2008.USENIX Association.

energy consumption of cuda kernels with varying thread topology

Documents