cuda performance considerations patrick cozzi university of pennsylvania cis 565 - spring 2011
TRANSCRIPT
CUDA Performance Considerations
Patrick CozziUniversity of PennsylvaniaCIS 565 - Spring 2011
Agenda
Data Prefetching Loop Unrolling Thread Granularity
Data Prefetching
Independent instructions between a global memory read and its use can hide memory latency
float m = Md[i];
float f = a * b + c * d;
float f2 = m * f;
Data Prefetching
Independent instructions between a global memory read and its use can hide memory latency
float m = Md[i];
float f = a * b + c * d;
float f2 = m * f;
Read global memory
Data Prefetching
Independent instructions between a global memory read and its use can hide memory latency
float m = Md[i];
float f = a * b + c * d;
float f2 = m * f;
Execute instructions that are not dependent on memory read
Data Prefetching
Independent instructions between a global memory read and its use can hide memory latency
float m = Md[i];
float f = a * b + c * d;
float f2 = m * f; Use global memory after the above line executes in enough warps hide the memory latency
Data Prefetching
Prefetching data from global memory can effectively increase the number of independent instructions between global memory read and use
Data Prefetching
Recall tiled matrix multiply:
for (/* ... */)
{
// Load current tile into shared memory
__syncthreads();
// Accumulate dot product
__syncthreads();
}
Data Prefetching
Tiled matrix multiply with prefetch:
// Load first tile into registers
for (/* ... */)
{
// Deposit registers into shared memory
__syncthreads();
// Load next tile into registers
// Accumulate dot product
__syncthreads();
}
Data Prefetching
Tiled matrix multiply with prefetch:
// Load first tile into registers
for (/* ... */)
{
// Deposit registers into shared memory
__syncthreads();
// Load next tile into registers
// Accumulate dot product
__syncthreads();
}
Data Prefetching
Tiled matrix multiply with prefetch:
// Load first tile into registers
for (/* ... */)
{
// Deposit registers into shared memory
__syncthreads();
// Load next tile into registers
// Accumulate dot product
__syncthreads();
}
Prefetch for next iteration of the loop
Data Prefetching
Tiled matrix multiply with prefetch:
// Load first tile into registers
for (/* ... */)
{
// Deposit registers into shared memory
__syncthreads();
// Load next tile into registers
// Accumulate dot product
__syncthreads();
}
These instructions executed by enough warps will hide the memory latency of the prefetch
Data Prefetching
CostAdded complexityMore registers – what does this imply?
Loop Unrollingfor (int k = 0; k < BLOCK_SIZE; ++k)
{
Pvalue += Ms[ty][k] * Ns[k][tx];
}
Instructions per iterationOne floating-point multipleOne floating-point addWhat else?
Loop Unrollingfor (int k = 0; k < BLOCK_SIZE; ++k)
{
Pvalue += Ms[ty][k] * Ns[k][tx];
}
Other instructions per iterationUpdate loop counter
Loop Unrollingfor (int k = 0; k < BLOCK_SIZE; ++k)
{
Pvalue += Ms[ty][k] * Ns[k][tx];
}
Other instructions per iterationUpdate loop counterBranch
Loop Unrollingfor (int k = 0; k < BLOCK_SIZE; ++k)
{
Pvalue += Ms[ty][k] * Ns[k][tx];
}
Other instructions per iterationUpdate loop counterBranchAddress arithmetic
Loop Unrollingfor (int k = 0; k < BLOCK_SIZE; ++k)
{
Pvalue += Ms[ty][k] * Ns[k][tx];
}
Instruction Mix2 floating-point arithmetic instructions1 loop branch instruction2 address arithmetic instructions1 loop counter increment instruction
Loop Unrolling
Only 1/3 are floating-point calculations But I want my full theoretical 346.5 GFLOPs
(G80) Consider loop unrolling
Loop UnrollingPvalue +=
Ms[ty][0] * Ns[0][tx] +
Ms[ty][1] * Ns[1][tx] +
...
Ms[ty][15] * Ns[15][tx]; // BLOCK_SIZE = 16
No more loopNo loop count updateNo branchConstant indices – no address arithmetic
instructions
Thread Granularity
How much work should one thread do?Parallel Reduction
Reduce two elements?
Matrix multiply Compute one element of Pd?
Thread Granularity
Image from http://courses.engr.illinois.edu/ece498/al/textbook/Chapter5-CudaPerformance.pdf
Matrix Multiple
Thread Granularity
Image from http://courses.engr.illinois.edu/ece498/al/textbook/Chapter5-CudaPerformance.pdf
Matrix MultipleBoth elements of Pd
require the same row of Md
Thread Granularity
Matrix MultipleCompute both Pd elements in the same thread
Reduces global memory access by ¼ Increases number of independent instructions
What is the benefit?
New kernel uses more registers and shared memory What does that imply?
Matrix Multiply
What improves performance?Prefetching?Loop unrolling?Thread granularity?
For what inputs?
Matrix Multiply
Image from http://courses.engr.illinois.edu/ece498/al/textbook/Chapter5-CudaPerformance.pdf
Matrix Multiply
Image from http://courses.engr.illinois.edu/ece498/al/textbook/Chapter5-CudaPerformance.pdf
8x8 Tiles• Coarser thread granularity helps• Prefetching doesn’t• Loop unrolling doesn’t
Matrix Multiply
Image from http://courses.engr.illinois.edu/ece498/al/textbook/Chapter5-CudaPerformance.pdf
16x16 Tiles• Coarser thread granularity helps
Matrix Multiply
Image from http://courses.engr.illinois.edu/ece498/al/textbook/Chapter5-CudaPerformance.pdf
16x16 Tiles• Full loop unrolling can help
Matrix Multiply
Image from http://courses.engr.illinois.edu/ece498/al/textbook/Chapter5-CudaPerformance.pdf
16x16 Tiles• Prefetch helps for 1x1 tiling
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009University of Illinois, Urbana-Champaign 31
Floating-Point Considerations
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009University of Illinois, Urbana-Champaign 32
What is IEEE floating-point format? A floating point binary number consists of three parts:
sign (S), exponent (E), and mantissa (M). Each (S, E, M) pattern uniquely identifies a floating point
number.
For each bit pattern, its IEEE floating-point value is derived as:
value = (-1)S * M * {2E}, where 1.0 ≤ M < 10.0B
The interpretation of S is simple: S=0 results in a positive number and S=1 a negative number.
IEEE 754 Format
http://kipirvine.com/asm/workbook/floating_tut.htm
Single Precision 1 bit sign, 8 bit exponent (bias-127), 23 bit fraction
Double Precision1 bit sign, 11 bit exponent (1023-bias), 52 bit fraction
Mantissa
-3.154 x 105 as an example, the sign is negative, the mantissa is 3.154, and the exponent is 5.
The fractional portion of the mantissa is the sum of each digit multiplied by a power of 10:
.154 = 1/10 + 5/100 + 4/1000
A binary floating-point number is similar. For example, in the number +11.1011 x 23, the sign is positive, the mantissa is 11.1011, and the exponent is 3. The fractional portion of the mantissa is the sum of successive powers of 2. In
our example, it is expressed as:.1011 = 1/2 + 0/4 + 1/8 + 1/16 =0.6875DD
Combined with the left-hand side of 11.1011, the decimal value of the number is 3.6875.
http://kipirvine.com/asm/workbook/floating_tut.htm
Normalizing the Mantissa
Before a floating-point binary number can be stored correctly, its mantissa must be normalized. The process is basically the same as when normalizing a floating-point decimal number.
For example, decimal 1234.567 is normalized as 1.234567 x 103 by moving the decimal point so that only one digit appears before the decimal.
http://kipirvine.com/asm/workbook/floating_tut.htm
The Exponent
8-bit unsigned integers with a bias of 127. An example: 1.101 x 25 . The exponent (5) is
added to 127(2n-1-1) and the sum (132) is binary 10100010.
http://kipirvine.com/asm/workbook/floating_tut.htm
Creating the IEEE Bit Representation 1.101 x 20 is stored as sign = 0 (positive), mantissa = 101, and exponent = 01111111
(the exponent value is added to 127). The "1" to the left of the decimal point is dropped from the mantissa. Here are more examples:
http://kipirvine.com/asm/workbook/floating_tut.htm
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009
University of Illinois, Urbana-Champaign 38
Arithmetic Instruction Throughput int and float add, shift, min, max and float mul, mad: 4
cycles per warp int multiply (*) is by default 32-bit
requires multiple cycles / warp Use __mul24() / __umul24() intrinsics for 4-cycle 24-bit int
multiply
Integer divide and modulo are expensive Compiler will convert literal power-of-2 divides to shifts Be explicit in cases where compiler can’t tell that divisor is a
power of 2! Useful trick: foo % n == foo & (n-1) if n is a power of 2
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009
University of Illinois, Urbana-Champaign 39
Arithmetic Instruction Throughput
Reciprocal, reciprocal square root, sin/cos, log, exp: 16 cycles per warp These are the versions prefixed with “__” Examples:__rcp(), __sin(), __exp()
Other functions are combinations of the above y / x == rcp(x) * y == 20 cycles per warp sqrt(x) == rcp(rsqrt(x)) == 32 cycles per warp
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009University of Illinois, Urbana-Champaign 40
Runtime Math Library
There are two types of runtime math operations __func(): direct mapping to hardware ISA
Fast but low accuracy (see prog. guide for details) Examples: __sin(x), __exp(x), __pow(x,y)
func() : compile to multiple instructions Slower but higher accuracy (5 ulp, units in the least place, or
less) Examples: sin(x), exp(x), pow(x,y)
The -use_fast_math compiler option forces every func() to compile to __func()
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009University of Illinois, Urbana-Champaign 41
Make your program float-safe! Future hardware will have double precision support
G80 is single-precision only Double precision will have additional performance cost Careless use of double or undeclared types may run more slowly on
G80+ Important to be float-safe (be explicit whenever you want
single precision) to avoid using double precision where it is not needed Add ‘f’ specifier on float literals:
foo = bar * 0.123; // double assumed foo = bar * 0.123f; // float explicit
Use float version of standard library functions foo = sin(bar); // double assumed foo = sinf(bar); // single precision explicit
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009University of Illinois, Urbana-Champaign 42
Deviations from IEEE-754
Addition and Multiplication are IEEE 754 compliant Maximum 0.5 ulp (units in the least place) error
However, often combined into multiply-add (FMAD) Intermediate result is truncated
Division is non-compliant (2 ulp) Not all rounding modes are supported Denormalized numbers are not supported No mechanism to detect floating-point exceptions
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009University of Illinois, Urbana-Champaign 43
GPU Floating Point FeaturesG80 SSE IBM Altivec Cell SPE
Precision IEEE 754 IEEE 754 IEEE 754 IEEE 754
Rounding modes for FADD and FMUL
Round to nearest and round to zero
All 4 IEEE, round to nearest, zero, inf, -inf
Round to nearest only
Round to zero/truncate only
Denormal handling Flush to zeroSupported,1000’s of cycles
Supported,1000’s of cycles
Flush to zero
NaN support Yes Yes Yes No
Overflow and Infinity support
Yes, only clamps to max norm
Yes Yes No, infinity
Flags No Yes Yes Some
Square root Software only Hardware Software only Software only
Division Software only Hardware Software only Software only
Reciprocal estimate accuracy
24 bit 12 bit 12 bit 12 bit
Reciprocal sqrt estimate accuracy
23 bit 12 bit 12 bit 12 bit
log2(x) and 2^x estimates accuracy
23 bit No 12 bit No