cuda performance considerations patrick cozzi university of pennsylvania cis 565 - spring 2011

CUDA Performance Considerations

Patrick CozziUniversity of PennsylvaniaCIS 565 - Spring 2011

Agenda

Data Prefetching Loop Unrolling Thread Granularity

Data Prefetching

Independent instructions between a global memory read and its use can hide memory latency

float m = Md[i];

float f = a * b + c * d;

float f2 = m * f;

Data Prefetching


float m = Md[i];


float f2 = m * f;

Read global memory

Data Prefetching


float m = Md[i];


float f2 = m * f;

Execute instructions that are not dependent on memory read

Data Prefetching


float m = Md[i];


float f2 = m * f; Use global memory after the above line executes in enough warps hide the memory latency

Data Prefetching

Prefetching data from global memory can effectively increase the number of independent instructions between global memory read and use

Data Prefetching

Recall tiled matrix multiply:

for (/* ... */)

{

// Load current tile into shared memory

__syncthreads();

// Accumulate dot product

__syncthreads();

}

Data Prefetching

Tiled matrix multiply with prefetch:

// Load first tile into registers

for (/* ... */)

{

// Deposit registers into shared memory

__syncthreads();

// Load next tile into registers


__syncthreads();

}

Data Prefetching



for (/* ... */)

{


__syncthreads();



__syncthreads();

}

Prefetch for next iteration of the loop

Data Prefetching



for (/* ... */)

{


__syncthreads();



__syncthreads();

}

These instructions executed by enough warps will hide the memory latency of the prefetch

Data Prefetching

CostAdded complexityMore registers – what does this imply?

Loop Unrollingfor (int k = 0; k < BLOCK_SIZE; ++k)

{

Pvalue += Ms[ty][k] * Ns[k][tx];

}

Instructions per iterationOne floating-point multipleOne floating-point addWhat else?


{


}

Other instructions per iterationUpdate loop counter


{


}

Other instructions per iterationUpdate loop counterBranch


{


}

Other instructions per iterationUpdate loop counterBranchAddress arithmetic


{


}

Instruction Mix2 floating-point arithmetic instructions1 loop branch instruction2 address arithmetic instructions1 loop counter increment instruction

Loop Unrolling

Only 1/3 are floating-point calculations But I want my full theoretical 346.5 GFLOPs

(G80) Consider loop unrolling

Loop UnrollingPvalue +=

Ms[ty][0] * Ns[0][tx] +

Ms[ty][1] * Ns[1][tx] +

...

Ms[ty][15] * Ns[15][tx]; // BLOCK_SIZE = 16

No more loopNo loop count updateNo branchConstant indices – no address arithmetic

instructions

Thread Granularity

How much work should one thread do?Parallel Reduction

Reduce two elements?

Matrix multiply Compute one element of Pd?

Thread Granularity

Image from http://courses.engr.illinois.edu/ece498/al/textbook/Chapter5-CudaPerformance.pdf

Matrix Multiple

Thread Granularity


Matrix MultipleBoth elements of Pd

require the same row of Md

Thread Granularity

Matrix MultipleCompute both Pd elements in the same thread

Reduces global memory access by ¼ Increases number of independent instructions

What is the benefit?

New kernel uses more registers and shared memory What does that imply?

Matrix Multiply

What improves performance?Prefetching?Loop unrolling?Thread granularity?

For what inputs?

Matrix Multiply


Matrix Multiply


8x8 Tiles• Coarser thread granularity helps• Prefetching doesn’t• Loop unrolling doesn’t

Matrix Multiply


16x16 Tiles• Coarser thread granularity helps

Matrix Multiply


16x16 Tiles• Full loop unrolling can help

Matrix Multiply


16x16 Tiles• Prefetch helps for 1x1 tiling

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009University of Illinois, Urbana-Champaign 31

Floating-Point Considerations


What is IEEE floating-point format? A floating point binary number consists of three parts:

sign (S), exponent (E), and mantissa (M). Each (S, E, M) pattern uniquely identifies a floating point

number.

For each bit pattern, its IEEE floating-point value is derived as:

value = (-1)S * M * {2E}, where 1.0 ≤ M < 10.0B

The interpretation of S is simple: S=0 results in a positive number and S=1 a negative number.

IEEE 754 Format

http://kipirvine.com/asm/workbook/floating_tut.htm

Single Precision 1 bit sign, 8 bit exponent (bias-127), 23 bit fraction

Double Precision1 bit sign, 11 bit exponent (1023-bias), 52 bit fraction

Mantissa

-3.154 x 105 as an example, the sign is negative, the mantissa is 3.154, and the exponent is 5.

The fractional portion of the mantissa is the sum of each digit multiplied by a power of 10:

.154 = 1/10 + 5/100 + 4/1000

A binary floating-point number is similar. For example, in the number +11.1011 x 23, the sign is positive, the mantissa is 11.1011, and the exponent is 3. The fractional portion of the mantissa is the sum of successive powers of 2. In

our example, it is expressed as:.1011 = 1/2 + 0/4 + 1/8 + 1/16 =0.6875DD

Combined with the left-hand side of 11.1011, the decimal value of the number is 3.6875.


Normalizing the Mantissa

Before a floating-point binary number can be stored correctly, its mantissa must be normalized. The process is basically the same as when normalizing a floating-point decimal number.

For example, decimal 1234.567 is normalized as 1.234567 x 103 by moving the decimal point so that only one digit appears before the decimal.


The Exponent

8-bit unsigned integers with a bias of 127. An example: 1.101 x 25 . The exponent (5) is

added to 127(2n-1-1) and the sum (132) is binary 10100010.


Creating the IEEE Bit Representation 1.101 x 20 is stored as sign = 0 (positive), mantissa = 101, and exponent = 01111111

(the exponent value is added to 127). The "1" to the left of the decimal point is dropped from the mantissa. Here are more examples:


© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009

University of Illinois, Urbana-Champaign 38

Arithmetic Instruction Throughput int and float add, shift, min, max and float mul, mad: 4

cycles per warp int multiply (*) is by default 32-bit

requires multiple cycles / warp Use __mul24() / __umul24() intrinsics for 4-cycle 24-bit int

multiply

Integer divide and modulo are expensive Compiler will convert literal power-of-2 divides to shifts Be explicit in cases where compiler can’t tell that divisor is a

power of 2! Useful trick: foo % n == foo & (n-1) if n is a power of 2

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009

University of Illinois, Urbana-Champaign 39

Arithmetic Instruction Throughput

Reciprocal, reciprocal square root, sin/cos, log, exp: 16 cycles per warp These are the versions prefixed with “__” Examples:__rcp(), __sin(), __exp()

Other functions are combinations of the above y / x == rcp(x) * y == 20 cycles per warp sqrt(x) == rcp(rsqrt(x)) == 32 cycles per warp


Runtime Math Library

There are two types of runtime math operations __func(): direct mapping to hardware ISA

Fast but low accuracy (see prog. guide for details) Examples: __sin(x), __exp(x), __pow(x,y)

func() : compile to multiple instructions Slower but higher accuracy (5 ulp, units in the least place, or

less) Examples: sin(x), exp(x), pow(x,y)

The -use_fast_math compiler option forces every func() to compile to __func()


Make your program float-safe! Future hardware will have double precision support

G80 is single-precision only Double precision will have additional performance cost Careless use of double or undeclared types may run more slowly on

G80+ Important to be float-safe (be explicit whenever you want

single precision) to avoid using double precision where it is not needed Add ‘f’ specifier on float literals:

foo = bar * 0.123; // double assumed foo = bar * 0.123f; // float explicit

Use float version of standard library functions foo = sin(bar); // double assumed foo = sinf(bar); // single precision explicit


Deviations from IEEE-754

Addition and Multiplication are IEEE 754 compliant Maximum 0.5 ulp (units in the least place) error

However, often combined into multiply-add (FMAD) Intermediate result is truncated

Division is non-compliant (2 ulp) Not all rounding modes are supported Denormalized numbers are not supported No mechanism to detect floating-point exceptions


GPU Floating Point FeaturesG80 SSE IBM Altivec Cell SPE

Precision IEEE 754 IEEE 754 IEEE 754 IEEE 754

Rounding modes for FADD and FMUL

Round to nearest and round to zero

All 4 IEEE, round to nearest, zero, inf, -inf

Round to nearest only

Round to zero/truncate only

Denormal handling Flush to zeroSupported,1000’s of cycles

Supported,1000’s of cycles

Flush to zero

NaN support Yes Yes Yes No

Overflow and Infinity support

Yes, only clamps to max norm

Yes Yes No, infinity

Flags No Yes Yes Some

Square root Software only Hardware Software only Software only

Division Software only Hardware Software only Software only

Reciprocal estimate accuracy

24 bit 12 bit 12 bit 12 bit

Reciprocal sqrt estimate accuracy

23 bit 12 bit 12 bit 12 bit

log2(x) and 2^x estimates accuracy

23 bit No 12 bit No

cuda performance considerations patrick cozzi university of pennsylvania cis 565 - spring 2011

Documents