iap09 cuda@mit 6.963 - lecture 04: cuda advanced #1 (nicolas pinto, mit)

144
IAP09 CUDA@MIT / 6.963 Supercomputing on your desktop: Programming the next generation of cheap and massively parallel hardware using CUDA Lecture 04 CUDA Advanced #1 - Nicolas Pinto (MIT)

Upload: npinto

Post on 17-Nov-2014

3.781 views

Category:

Education


8 download

DESCRIPTION

More at http://sites.google.com/site/cudaiap2009 and http://pinto.scripts.mit.edu/Classes/CUDAIAP2009 Note that some slides were borrowed from Matthew Bolitho (John Hopkins) and NVIDIA.

TRANSCRIPT

Page 1: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)

IAP09 CUDA@MIT / 6.963

Supercomputing on your desktop:Programming the next generation of cheap

and massively parallel hardware using CUDA

Lecture 04

CUDA Advanced #1-

Nicolas Pinto (MIT)

Page 2: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)

During this course,

we’ll try to

and use existing material ;-)

“ ”

adapted for 6.963

Page 3: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)
Page 4: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)
Page 5: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)

warp != wrap

Page 6: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)

Todayyey!!

Page 7: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)

Textures & OpenGLAsync API

LibrariesInterfacing CUDA

Performance

IAP09 CUDA@MIT / 6.963

Page 8: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)

CUDA Textures and OpenGL

IAP09 CUDA@MIT / 6.963

Page 9: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)

CUDA Texture Functionality

© NVIDIA Corporation 2008 160

Textures in CUDA

Different hardware path to memory

Benefits of CUDA textures:Texture fetches are cached

Optimized for 2D locality

Textures are addressable in 2DUsing integer or normalized coordinates

Means fewer addressing calculations in code

Provide filtering for free

Free wrap modes (boundary conditions)Clamp to edge / repeat

Limitations of CUDA textures:Read-only

Currently either 1D or 2D (3D will be added)

9-bit accuracy of filter weights

Textures

Page 10: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)

© NVIDIA Corporation 2008 161

Two CUDA Texture Types

Bound to linear memoryGlobal memory is bound to a texture

Only 1D

Integer addressing

No filtering, no addressing modes

Bound to CUDA arraysCUDA array is bound to a texture

1D or 2D

Float addressing (size-based or normalized)

Filtering

Addressing modes (clamping, repeat)

Both:Return either element type or normalized float

© NVIDIA Corporation 2008 162

CUDA Texturing Steps

Host (CPU) code:Allocate/obtain memory (global linear, or CUDA array)

Create a texture reference object

Currently must be at file-scope

Bind the texture reference to memory/array

When done:

Unbind the texture reference, free resources

Device (kernel) code:Fetch using texture reference

Linear memory textures:

tex1Dfetch()

Array textures:

tex1D() or tex2D()

Textures

Page 11: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)

© NVIDIA Corporation 2008 161

Two CUDA Texture Types

Bound to linear memoryGlobal memory is bound to a texture

Only 1D

Integer addressing

No filtering, no addressing modes

Bound to CUDA arraysCUDA array is bound to a texture

1D or 2D

Float addressing (size-based or normalized)

Filtering

Addressing modes (clamping, repeat)

Both:Return either element type or normalized float

© NVIDIA Corporation 2008 162

CUDA Texturing Steps

Host (CPU) code:Allocate/obtain memory (global linear, or CUDA array)

Create a texture reference object

Currently must be at file-scope

Bind the texture reference to memory/array

When done:

Unbind the texture reference, free resources

Device (kernel) code:Fetch using texture reference

Linear memory textures:

tex1Dfetch()

Array textures:

tex1D() or tex2D()

Textures

Page 12: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)

© NVIDIA Corporation 2008 163

Texture ReferenceImmutable parameters (compile-time)

Type: type returned when fetchingBasic int, float typesCUDA 1-, 2-, 4-element vectors

Dimensionality:Currently 1 or 2 (3 will be supported in the future)

Read Mode:cudaReadModeElementTypecudaReadModeNormalizedFloat (valid for 8- or 16-bit ints)– returns [-1,1] for signed, [0,1] for unsigned

Mutable parameters (run-time, only for array-textures)Normalized:

non-zero = addressing range [0, 1]Filter Mode:

cudaFilterModePointcudaFilterModeLinear

Address Mode:cudaAddressModeClampcudaAddressModeWrap

© NVIDIA Corporation 2008 164

Example: Host code for linear mem

// declare texture reference (must be at file-scope)texture<unsigned short, 1, cudaReadModeNormalizedFloat> texRef;

...

// set up linear memoryunsigned short *dA = 0;cudaMalloc((void**)&dA, numBytes);cudaMemcpy(dA, hA, numBytes, cudaMemcpyHostToDevice);

// bind texture reference to arraycudaBindTexture(NULL, texRef, dA);

Textures

Page 13: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)

© NVIDIA Corporation 2008 163

Texture ReferenceImmutable parameters (compile-time)

Type: type returned when fetchingBasic int, float typesCUDA 1-, 2-, 4-element vectors

Dimensionality:Currently 1 or 2 (3 will be supported in the future)

Read Mode:cudaReadModeElementTypecudaReadModeNormalizedFloat (valid for 8- or 16-bit ints)– returns [-1,1] for signed, [0,1] for unsigned

Mutable parameters (run-time, only for array-textures)Normalized:

non-zero = addressing range [0, 1]Filter Mode:

cudaFilterModePointcudaFilterModeLinear

Address Mode:cudaAddressModeClampcudaAddressModeWrap

© NVIDIA Corporation 2008 164

Example: Host code for linear mem

// declare texture reference (must be at file-scope)texture<unsigned short, 1, cudaReadModeNormalizedFloat> texRef;

...

// set up linear memoryunsigned short *dA = 0;cudaMalloc((void**)&dA, numBytes);cudaMemcpy(dA, hA, numBytes, cudaMemcpyHostToDevice);

// bind texture reference to arraycudaBindTexture(NULL, texRef, dA);

Textures

Page 14: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)

© NVIDIA Corporation 2008 165

cudaArray Type

Channel format, width, height

cudaChannelFormatDesc structureint x, y, z, w: bits for each component

enum cudaChannelFormatKind – one of:cudaChannelFormatKindSigned

cudaChannelFormatKindUnsigned

cudaChannelFormatKindFloat

some predefined constructors:cudaCreateChannelDesc<float>(void);

cudaCreateChannelDesc<float4>(void);

Management functions:cudaMallocArray, cudaFreeArray,

cudaMemcpyToArray, cudaMemcpyFromArray, ...

© NVIDIA Corporation 2008 166

Example: Host code for 2D array tex

// declare texture reference (must be at file-scope)texture<float, 2, cudaReadModeElementType> texRef;

...

// set up the CUDA arraycudaChannelFormatDesc cf = cudaCreateChannelDesc<float>();cudaArray *texArray = 0;cudaMallocArray(&texArray, &cf, dimX, dimY);cudaMempcyToArray(texArray, 0,0, hA, numBytes, cudaMemcpyHostToDevice);

// specify mutable texture reference parameterstexRef.normalized = 0;texRef.filterMode = cudaFilterModeLinear;texRef.addressMode = cudaAddressModeClamp;

// bind texture reference to arraycudaBindTextureToArray(texRef, texArray);

Textures

Page 15: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)

© NVIDIA Corporation 2008 165

cudaArray Type

Channel format, width, height

cudaChannelFormatDesc structureint x, y, z, w: bits for each component

enum cudaChannelFormatKind – one of:cudaChannelFormatKindSigned

cudaChannelFormatKindUnsigned

cudaChannelFormatKindFloat

some predefined constructors:cudaCreateChannelDesc<float>(void);

cudaCreateChannelDesc<float4>(void);

Management functions:cudaMallocArray, cudaFreeArray,

cudaMemcpyToArray, cudaMemcpyFromArray, ...

© NVIDIA Corporation 2008 166

Example: Host code for 2D array tex

// declare texture reference (must be at file-scope)texture<float, 2, cudaReadModeElementType> texRef;

...

// set up the CUDA arraycudaChannelFormatDesc cf = cudaCreateChannelDesc<float>();cudaArray *texArray = 0;cudaMallocArray(&texArray, &cf, dimX, dimY);cudaMempcyToArray(texArray, 0,0, hA, numBytes, cudaMemcpyHostToDevice);

// specify mutable texture reference parameterstexRef.normalized = 0;texRef.filterMode = cudaFilterModeLinear;texRef.addressMode = cudaAddressModeClamp;

// bind texture reference to arraycudaBindTextureToArray(texRef, texArray);

Textures

Page 16: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)

© NVIDIA Corporation 2008 177

OpenGL Interoperability

OpenGL buffer objects can be mapped into the CUDA address space and then used as global memory

Vertex buffer objects

Pixel buffer objects

Direct3D9 Vertex objects can be mapped

Data can be accessed like any other global data in the device code

Image data can be displayed from pixel buffer objects using glDrawPixels / glTexImage2D

Requires copy in video memory, but still fast

© NVIDIA Corporation 2008 178

OpenGL Interop Steps

Register a buffer object with CUDAcudaGLRegisterBufferObject(GLuint buffObj);

OpenGL can use a registered buffer only as a sourceUnregister the buffer prior to rendering to it by OpenGL

Map the buffer object to CUDA memorycudaGLMapBufferObject(void **devPtr, GLuint buffObj);

Returns an address in global memoryBuffer must registered prior to mapping

Launch a CUDA kernel to process the buffer

Unmap the buffer object prior to use by OpenGLcudaGLUnmapBufferObject(GLuint buffObj);

Unregister the buffer objectcudaGLUnregisterBufferObject(GLuint buffObj);

Optional: needed if the buffer is a render target

Use the buffer object in OpenGL code

OpenGL

Page 17: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)

© NVIDIA Corporation 2008 177

OpenGL Interoperability

OpenGL buffer objects can be mapped into the CUDA address space and then used as global memory

Vertex buffer objects

Pixel buffer objects

Direct3D9 Vertex objects can be mapped

Data can be accessed like any other global data in the device code

Image data can be displayed from pixel buffer objects using glDrawPixels / glTexImage2D

Requires copy in video memory, but still fast

© NVIDIA Corporation 2008 178

OpenGL Interop Steps

Register a buffer object with CUDAcudaGLRegisterBufferObject(GLuint buffObj);

OpenGL can use a registered buffer only as a sourceUnregister the buffer prior to rendering to it by OpenGL

Map the buffer object to CUDA memorycudaGLMapBufferObject(void **devPtr, GLuint buffObj);

Returns an address in global memoryBuffer must registered prior to mapping

Launch a CUDA kernel to process the buffer

Unmap the buffer object prior to use by OpenGLcudaGLUnmapBufferObject(GLuint buffObj);

Unregister the buffer objectcudaGLUnregisterBufferObject(GLuint buffObj);

Optional: needed if the buffer is a render target

Use the buffer object in OpenGL code

OpenGL

Page 18: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)

© NVIDIA Corporation 2008 179

Interop Scenario:Dynamic CUDA-generated texture

Register the texture PBO with CUDA

For each frame:

Map the buffer

Generate the texture in a CUDA kernel

Unmap the buffer

Update the texture

Render the textured object

unsigned char *p_d=0;

cudaGLMapBufferObject((void**)&p_d, pbo);

prepTexture<<<height,width>>>(p_d, time);

cudaGLUnmapBufferObject(pbo);

glBindBuffer(GL_PIXEL_UNPACK_BUFFER_ARB, pbo);

glBindTexture(GL_TEXTURE_2D, texID);

glTexSubImage2D(GL_TEXTURE_2D, 0, 0,0, 256,256,

GL_BGRA, GL_UNSIGNED_BYTE, 0);

© NVIDIA Corporation 2008 180

Interop Scenario:Frame Post-processing by CUDA

For each frame:

Render to PBO with OpenGL

Register the PBO with CUDA

Map the buffer

Process the buffer with a CUDA kernel

Unmap the buffer

Unregister the PBO from CUDA

unsigned char *p_d=0;

cudaGLRegisterBufferObject(pbo);

cudaGLMapBufferObject((void**)&p_d, pbo);

postProcess<<<blocks,threads>>>(p_d);

cudaGLUnmapBufferObject(pbo);

cudaGLUnregisterBufferObject(pbo);

...

OpenGL

Page 19: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)

© NVIDIA Corporation 2008 179

Interop Scenario:Dynamic CUDA-generated texture

Register the texture PBO with CUDA

For each frame:

Map the buffer

Generate the texture in a CUDA kernel

Unmap the buffer

Update the texture

Render the textured object

unsigned char *p_d=0;

cudaGLMapBufferObject((void**)&p_d, pbo);

prepTexture<<<height,width>>>(p_d, time);

cudaGLUnmapBufferObject(pbo);

glBindBuffer(GL_PIXEL_UNPACK_BUFFER_ARB, pbo);

glBindTexture(GL_TEXTURE_2D, texID);

glTexSubImage2D(GL_TEXTURE_2D, 0, 0,0, 256,256,

GL_BGRA, GL_UNSIGNED_BYTE, 0);

© NVIDIA Corporation 2008 180

Interop Scenario:Frame Post-processing by CUDA

For each frame:

Render to PBO with OpenGL

Register the PBO with CUDA

Map the buffer

Process the buffer with a CUDA kernel

Unmap the buffer

Unregister the PBO from CUDA

unsigned char *p_d=0;

cudaGLRegisterBufferObject(pbo);

cudaGLMapBufferObject((void**)&p_d, pbo);

postProcess<<<blocks,threads>>>(p_d);

cudaGLUnmapBufferObject(pbo);

cudaGLUnregisterBufferObject(pbo);

...

OpenGL

Page 20: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)

CUDAAsync API

IAP09 CUDA@MIT / 6.963

Page 21: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)

97

!"#$%&'($()"*+,+('#*%(-#

!"#$%&'($()"*&(".*!" /,01%,*+,+('#*%(-#*2('*

-34,56(%7,/*+,+('#*2',,"*)-*89:*($*366*8:;!*

%3-3<6,*/,01%,"

=0,'63-*1+-6,+,$.,/*<#*)"1$4*3*8:;!*".',3+

8:;!*>.',3+*?*>,@),$%,*(2*8:;!*(-,'3.1($"*.&3.*

,A,%).,*1$*('/,'

>.',3+*!9BC

D3%&*".',3+*&3"*3$*B;C*E*?*/,23)6.*".',3+

cudaMemcpyAsync(dst, src, size, 0);

Async

Page 22: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)

98

!"#$%&'()#$*#%(&*+(,#,-$.(/-'.

0-*/1$$#*2(#3#/124-*(-5(&()#$*#%(&*+(&(6-72(!"

+#"4/#(,#,-$.(/-'.(5-$('&8#9%-/)#+(,#,-$.

0-,'12#(/&'&:4%42.(;<(=>=(?@AB(&*+(1'C

D"&4%&:%#(&7(&('$#"4#E(5#&21$#(4*(0FGD(=>=

!"#$%&'7()#$*#%(#3#/124-*(4*(-*#(72$#&,(E426(&(,#,-$.(

/-'.(5$-,(&*-26#$(72$#&,

H2$#&,(DIJK

cudaStreamCreate(&stream1);

cudaStreamCreate(&stream2);

cudaMemcpyAsync(dst, src, size, stream1);

kernel<<<grid, block, 0, stream2>>>(…);

cudaStreamQuery(stream2);

-"#$%&''#+

Async

Page 23: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)

95

!"#$%&'()*%$+,

&'()*-%./(%0)-(/*(1%2/(34/1(15%0)*4%!"#$%3.66%-*/(.7-

"-.8(%-3()./04-9

7(.-:/(%(6.;-(1%*07(%<4/%!"#$%3.66-%23643=%3>36(%;/(30-04)5

?:(/>%*@(%-*.*:-%4<%.)%.->)3@/4)4:-%!"#$%3.66

A643=%!+"%:)*06%!"#$%3.66-%;/04/%*4%*@(%('()*%./(%347;6(*(1

.->)3$+, -.7;6(%0)%!"#$%B#C

3:1.&'()*D* -*./*E%-*4;F

3:1.&'()*!/(.*(2G-*./*5F 3:1.&'()*!/(.*(2G-*4;5F

3:1.&'()*H(34/12-*./*E%I5F

=(/)(6JJJ8/01E%A643=KKK2LLL5F

3:1.&'()*H(34/12-*4;E%I5F

3:1.&'()*B>)3@/4)0M(2-*4;5F

<64.*%(*F

3:1.&'()*&6.;-(1N07(2G(*E%-*./*E%-*4;5F

3:1.&'()*#(-*/4>2-*./*5F 3:1.&'()*#(-*/4>2-*4;5F

95

Async

Page 24: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)

CUDALibraries

IAP09 CUDA@MIT / 6.963

Page 25: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)

9M02: High Performance Computing with CUDA

CUDA librariesCUDA libraries

CUDA includes 2 widely used libraries

CUBLAS: BLAS implementation

CUFFT: FFT implementation

CUDPP (Data Parallel Primitives), available from

http://www.gpgpu.org/developer/cudpp/ :

Reduction

Scan

Sort

Library

Page 26: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)

10M02: High Performance Computing with CUDA

Closely Coupled CPU-GPUClosely Coupled CPU-GPU

Operation 1 Operation 2 Operation 3

Init

Alloc

Function Lib LibFunction Function

CPU

GPU

Integrated programming model

High speed data transfer – up to 5.5GB/sec

Asynchronous data transfer

Large GPU memory systems

Library

Page 27: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)

11M02: High Performance Computing with CUDA

CUBLASCUBLAS

Implementation of BLAS (Basic Linear Algebra Subprograms)on top of CUDA driver

Self-contained at the API level, no direct interaction with CUDAdriver

Basic model for use

Create matrix and vector objects in GPU memory space

Fill objects with data

Call sequence of CUBLAS functions

Retrieve data from GPU

CUBLAS library contains helper functions

Creating and destroying objects in GPU space

Writing data to and retrieving data from objects

Library

Page 28: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)

13M02: High Performance Computing with CUDA

Using CUBLASUsing CUBLAS

Interface to CUBLAS library is in cublas.h

Function naming conventioncublas + BLAS name

Eg., cublasSGEMM

Error handlingCUBLAS core functions do not return error

CUBLAS provides function to retrieve last error recorded

CUBLAS helper functions do return error

Helper functions:Memory allocation, data transfer

Implemented using C-based CUDA tool chainInterfacing to C/C++ applications is trivial

Library

Page 29: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)

© 2008 NVIDIA Corporation.

Supported Features

Single Precision Double Precision*

Real Complex Real Complex

Level 1! ! !

Level 2!

dgemv, dger,

dsyr, dtrsv

Level 3!

cgemm!

zgemm

*Double-precision functions only supported on GPUs with double-precision hardware

Library

Page 30: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)

© 2008 NVIDIA Corporation.

CUBLAS Helper Functions

cublasInit()Initializes CUBLAS library

cublasShutdown()

Releases resources used by CUBLAS library

cublasGetError()

Returns last error from CUBLAS core function (+ resets)

cublasAlloc()Wrapper around cudaMalloc() to allocate space for array

cublasFree()destroys object in GPU memory

cublas[Set|Get][Vector|Matrix]()Copies array elements between CPU and GPU memory

Accommodates non-unit strides

Library

Page 31: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)

© 2008 NVIDIA Corporation.

sgemmExample.c

#include <stdio.h>

#include <stdlib.h>

#include "cublas.h"

int main(void)

{

float *a_h, *b_h, *c_h;

float *a_d, *b_d, *c_d;

float alpha = 1.0f, beta = 0.0f;

int N = 2048, n2 = N*N;

int nBytes, i;

nBytes = n2*sizeof(float);

a_h = (float *)malloc(nBytes);

b_h = (float *)malloc(nBytes);

c_h = (float *)malloc(nBytes);

for (i=0; i < n2; i++) {

a_h[i] = rand() / (float) RAND_MAX;

b_h[i] = rand() / (float) RAND_MAX;

}

cublasInit();

cublasAlloc(n2, sizeof(float), (void **)&a_d);

cublasAlloc(n2, sizeof(float), (void **)&b_d);

cublasAlloc(n2, sizeof(float), (void **)&c_d);

cublasSetVector(n2, sizeof(float), a_h, 1, a_d, 1);

cublasSetVector(n2, sizeof(float), b_h, 1, b_d, 1);

cublasSgemm('n', 'n', N, N, N, alpha, a_d, N,

b_d, N, beta, c_d, N);

cublasGetVector(n2, sizeof(float), c_d, 1, c_h, 1);

free(a_h); free(b_h); free(c_h);

cublasFree(a_d); cublasFree(b_d);

cublasFree(c_d);

cublasShutdown();

return 0;

}

Library

Page 32: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)

14M02: High Performance Computing with CUDA

Calling CUBLAS from FORTRANCalling CUBLAS from FORTRAN

Two interfaces:

Thunking (define CUBLAS_USE_THUNKING when compiling fortran.c)

Allows interfacing to existing applications without any changes

During each call, the wrappers allocate GPU memory, copy source datafrom CPU memory space to GPU memory space, call CUBLAS, and finallycopy back the results to CPU memory space and deallocate the GPGPUmemory

Intended for light testing due to call overhead

Non-Thunking (default)

Intended for production code

Substitute device pointers for vector and matrix arguments in all BLASfunctions

Existing applications need to be modified slightly to allocate and deallocatedata structures in GPGPU memory space (using CUBLAS_ALLOC andCUBLAS_FREE) and to copy data between GPU and CPU memoryspaces (using CUBLAS_SET_VECTOR, CUBLAS_GET_VECTOR,CUBLAS_SET_MATRIX, and CUBLAS_GET_MATRIX)

Library

Page 33: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)

15M02: High Performance Computing with CUDA

SGEMM example (THUNKING)SGEMM example (THUNKING)! Define 3 single precision matrices A, B, C

real , dimension(m1,m1):: A, B, C

……

! Initialize

……

#ifdef CUBLAS

! Call SGEMM in CUBLAS library using THUNKING interface (library takes care of

! memory allocation on device and data movement)

call cublasSGEMM ('n','n',m1,m1,m1,alpha,A,m1,B,m1,beta,C,m1)

#else

! Call SGEMM in host BLAS library

call SGEMM ('n','n',m1,m1,m1,alpha,A,m1,B,m1,beta,C,m1)

#endif

To use the host BLAS routine:

g95 –O3 code.f90 –L/usr/local/lib -lblas

To use the CUBLAS routine (fortran.c is provided by NVIDIA):

gcc -O3 -DCUBLAS_USE_THUNKING -I/usr/local/cuda/include -c fortran.c

g95 -O3 -DCUBLAS code.f90 fortran.o -L/usr/local/cuda/lib -lcublas

Library

Page 34: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)

16M02: High Performance Computing with CUDA

SGEMM example (NON-THUNKING)SGEMM example (NON-THUNKING)

! Define 3 single precision matrices A, B, C

real , dimension(m1,m1):: A, B, C

integer:: devPtrA, devPtrB, devPtrC, size_of_real=4

……

! Initialize A, B, C

………

! Allocate matrices on GPU

cublasAlloc(m1*m1, size_of_real, devPtrA)

cublasAlloc(m1*m1, size_of_real, devPtrB)

cublasAlloc(m1*m1, size_of_real, devPtrC)

!Copy data from CPU to GPU

cublasSetMatrix(m1,m1, size_of_real, A,m1, devPtrA, m1)

cublasSetMatrix(m1,m1, size_of_real, B,m1, devPtrB, m1)

cublasSetMatrix(m1,m1, size_of_real, C,m1, devPtrC, m1)

! Call SGEMM in CUBLAS library using NON-THUNKING interface (library is expecting data inGPU memory)

call cublasSGEMM ('n','n',m1,m1,m1,alpha,A,m1,B,m1,beta,C,m1)

!Copy data from GPU to CPU

cublasGetMatrix(m1,m1, size_of_real, devPtrC,m1, C, m1)

! Free memory on device

cublasFree(devPtrA)

……

g95 -O3 code.f90 -L/usr/local/cuda/lib -lcublas

Library

Page 35: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)

Volkov and Demmel (SC08)

!

!

"#$!%&'(!"()*+,(!

"-./01!

"()*+,(!

2011"-.!

"()*+,(!

0011"-.

"()*+,(!

0311"-4

5!*6!7(,8*+!,*+(9! :1! ;3! ;3! <!

,*+(!,=*,>?!"@A! ;B:1! ;B3C! ;B:D! ;B<D!

+(EF98(+9G,*+(! 3<HI! :/HI! :/HI! :/HI!

9'('G,*+(! ;3HI! ;3HI! ;3HI! ;3HI!

'('*+J!KL9?!"@A ;B;! ;B;! 1B2! ;B1!

'('*+J!KL9?!MF%9 D;/! /D3! :0<! ;/0!

K&%NOFN8P?!"IG9! ;<;! C1! 03! :/!

'('*+J!&'*L%8! ;"I! D;/QI! C30QI! /D3QI!

4#?!M(&>!"6=*MG9! 3/<! </2! :<3! 2:!

4#?!M(&>!M(+!,*+(! /;! /C! //! /:!

4#?!6=*M9RO*+N! ;0! /D! ;3! ;/!

S#?!M(&>!"6=*MG9! C0! T! T! T!

S#?!6=*M9RO*+N! <B<! T! T! T!

-&K=(!;R!-P(!=F98!*6!8P(!"#$9!L9(N!F%!8PF9!98LNJB!4#!F9!9F%E=(!M+(U

,F9F*%!&%N!S#!F9!N*LK=(!M+(,F9F*%B!4'('!F9!9P&+(N!'('*+JB!#(&>!

6=*M!+&8(9!&+(!9P*O%!6*+!'L=8FM=J!&%N!&NN!*M(+&8F*%9B!)=*M9RO*+N!

F9!8P(!+&8F*!*6!M(&>!"6=*MG9!+&8(!8*!MF%U'('*+J!K&%NOFN8P!F%!

O*+N9B!!

!"#$%&'()*#+,-./0,12,34#",5"#0",6*#"'(,78+"9(',,

V&9F=J!V*=>*7!

W*'ML8(+!4,F(%,(!SF7F9F*%!$%F7(+9F8J!*6!W&=F6*+%F&!&8!I(+>(=(J

X&'(9!YB!S(''(=!

W*'ML8(+!4,F(%,(!SF7F9F*%!&%N!S(M&+8'(%8!*6!Q&8P('&8F,9!$%F7(+9F8J!*6!W&=F6*+%F&!&8!I(+>(=(J!

7901('$1,

Y(! M+(9(%8! M(+6*+'&%,(! +(9L=89! 6*+! N(%9(! =F%(&+! &=E(K+&! L9F%E!+(,(%8! ZV[S[\! "#$9B! ]L+! '&8+F^U'&8+F^! 'L=8FM=J! +*L8F%(!_"`QQa!+L%9!LM! 8*!31b!6&98(+! 8P&%! 8P(!7(%N*+c9! F'M=('(%8&U8F*%!&%N!&MM+*&,P(9!8P(!M(&>!*6!P&+NO&+(!,&M&KF=F8F(9B!]L+!d$?!ef! &%N! WP*=(9>J! 6&,8*+FA&8F*%9! &,PF(7(! LM! 8*! 01g21b! *6! 8P(!M(&>! "`QQ! +&8(B! ]L+! M&+&==(=! d$! +L%%F%E! *%! 8O*! "#$9!&,PF(7(9!LM!8*!hD<1!"6=*MG9B!-P(9(!+(9L=89!&+(!&,,*'M=F9P(N!KJ!,P&==(%EF%E!8P(!&,,(M8(N!7F(O!*6!8P(!"#$!&+,PF8(,8L+(!&%N!M+*UE+&''F%E! ELFN(=F%(9B!Y(! &+EL(! 8P&8! '*N(+%! "#$9! 9P*L=N! K(!7F(O(N! &9! 'L=8F8P+(&N(N! 'L=8F,*+(! 7(,8*+! L%F89B! Y(! (^M=*F8!K=*,>F%E!9F'F=&+=J!8*!7(,8*+!,*'ML8(+9!&%N!P(8(+*E(%(F8J!*6!8P(!9J98('! KJ! ,*'ML8F%E! K*8P! *%! "#$! &%N! W#$B! -PF9! 98LNJ! F%U,=LN(9!N(8&F=(N!K(%,P'&+>F%E!*6! 8P(!"#$!'('*+J!9J98('! 8P&8!+(7(&=9! 9FA(9! &%N! =&8(%,F(9! *6! ,&,P(9! &%N! -dIB! Y(! M+(9(%8! &!,*LM=(! *6! &=E*+F8P'F,! *M8F'FA&8F*%9! &F'(N! &8! F%,+(&9F%E! M&+&=U=(=F9'!&%N!+(EL=&+F8J!F%!8P(!M+*K=('!8P&8!M+*7FN(!L9!OF8P!9=FEP8=J!PFEP(+!M(+6*+'&%,(B!

:,;#1(2<4$1*2#,

Y(! '&>(! 8P(! 6*==*OF%E! ,*%8+FKL8F*%9B! )*+! 8P(! 6F+98! 8F'(?! O(!9P*O!&%!d$?!ef!&%N!WP*=(9>J!6&,8*+FA&8F*%! 8P&8!&,PF(7(!,*'UML8&8F*%&=!+&8(9!*7(+!:11!"6=*MG9!*%!&!"#$B!-P(9(!&+(!8P+((!*6!8P(!'*98!OFN(=J!L9(N!6&,8*+FA&8F*%9! F%!N(%9(! =F%(&+!&=E(K+&!&%N!M&7(! 8P(! O&J! 6*+! 8P(! F'M=('(%8&8F*%! *6! 8P(! (%8F+(! d\#\WH!=FK+&+J!i\%N(+9*%!(8!&=B!;221j!6*+!8P(!"#$9B!

]L+! +(9L=89! &=9*! F%,=LN(! M(+6*+'&%,(! *%! 8P(! 0U9(+F(9! *6!ZV[S[\!"#$9!8P&8!O&9!%*8!M+(7F*L9=J!&88&F%(N!F%!8P(!;BD!J(&+9!9F%,(!8P(9(!"#$9!O(+(!&7&F=&K=(B!Y(!M+*7FN(!%(O!F%9FEP89!F%8*!M+*E+&''F%E! 8P(9(! &%N! %(O(+!"#$9! 8P&8! P(=M! L9! &,PF(7(! M(+U6*+'&%,(!F%!9L,P!K&9F,!>(+%(=9!&9!'&8+F^U'&8+F^!'L=8FM=J!8P&8!F9!31b! 6&98(+! 8P&%! 8P*9(! F%! 8P(! *M8F'FA(N! 7(%N*+c9! =FK+&+J!W$Id\4! ;B;B! 4*'(! *6! *L+! ,*N(9! P&7(! K((%! =F,(%9(N! KJ!ZV[S[\! &%N! F%,=LN(N! F%! W$Id\4! /B1B! [%! *L+! &MM+*&,P! O(!8PF%>! *6! 8P(! "#$! &9! &! 'L=8F8P+(&N(N! 7(,8*+! L%F8! &%N! *L+! K(98!&=E*+F8P'9! O(+(! 6*L%N! 8*! ,=*9(=J! +(9('K=(! (&+=F(+! 9*=L8F*%9!6*L%N!6*+!7(,8*+!M+*,(99*+9B!

Y(! M(+6*+'! N(8&F=(N! K(%,P'&+>9! *6! 8P(! "#$! &%N! +(7(&=!9*'(!*6! 8P(!K*88=(%(,>9?!9L,P!&9!&,,(99!8*!8P(!*%U,PFM!'('*+J!8P&8! K*L%N9! 8P(! M(+6*+'&%,(! *6! *L+! K(98! ,*N(9?! &%N! >(+%(=!=&L%,P!*7(+P(&N!8P&8!M+*PFKF89!(66F,F(%8!6F%(UE+&F%!,*'ML8&8F*%9B!-P(! K(%,P'&+>9! +(7(&=! 8P(! 98+L,8L+(! *6! 8P(!"#$!'('*+J! 9J9U8('?!F%,=LNF%E!9FA(9!&%N!=&8(%,F(9!*6! 8P(!d;!&%N!d/!,&,P(9!&%N!-dIB!)*+! 8P(! 6F+98! 8F'(!O(! F'M=('(%8! &%N!'(&9L+(! 8P(!M(+6*+U'&%,(! *6! &! E=*K&=! K&++F(+! 8P&8! +L%9! (%8F+(=J! *%! 8P(! "#$B!Y(!K(=F(7(! 8PF9! F9! &%! F'M*+8&%8! 98(M! 8*O&+N9! *M(+&8F%E!"#$9!OF8P!=*O(+!W#$!F%8(+7(%8F*%B!

-*!&,PF(7(!8P(!K(98!M(+6*+'&%,(!F%!'&8+F^!6&,8*+FA&8F*%9!O(!L9(!98&8(!*6!&+8!8(,P%FkL(9!9L,P!&9!=**>U&P(&N?!*7(+=&MMF%E!W#$!&%N! "#$! ,*'ML8&8F*%?! &L8*8L%F%E?! 9'&+8(+! 7&+F&%89! *6! /U=(7(=!K=*,>F%E?!&%N!,P**9F%E!8P(!+FEP8!'('*+J!=&J*L8l!O(!&=9*!L9(!&!%*7(=! &=E*+F8P'!OF8P!'*NF6F(N! %L'(+F,9B!Y(! &%&=JA(! 8P(! M(+U6*+'&%,(!*6!*L+!F'M=('(%8&8F*%9!F%!N(8&F=!8*!9P*O!8P&8!&==!,*'UM*%(%89!*6!8P(!6F%&=!9J98('!+L%!&8!8P(!%(&+=J!*M8F'&=!+&8(9B!

]L+!K(98!9M((NLM9!79B!*%(!kL&N!,*+(!W#$!&+(!*7(+!<!!F%!&==!8P+((!6&,8*+FA&8F*%9B!

-P(!+(98!*6!8PF9!M&M(+!F9!*+E&%FA(N!&9!6*==*O9B!4(,8F*%!/!N(U

9,+FK(9! 8P(! &+,PF8(,8L+(! *6! 8P(!"#$9!O(! L9(N?! PFEP=FEP8F%E! 8P(!6(&8L+(9!,*''*%!8*!7(,8*+!&+,PF8(,8L+(9B!4(,8F*%!:!K(%,P'&+>9!*M(+&8F*%9! F%,=LNF%E!'('*+J! 8+&%96(+?!>(+%(=! 98&+8ULM?!&%N!K&+U+F(+9?! &%N! L9(9! 8P(9(! 8*! &%&=JA(! 8P(! M(+6*+'&%,(! *6! 8P(! M&%(=!6&,8*+FA&8F*%!*6!d$B!4(,8F*%!<!NF9,L99(9! 8P(!N(9FE%!&%N!M(+6*+U'&%,(! (7&=L&8F*%! *6!'&8+F^!'L=8FM=F,&8F*%B! 4(,8F*%! D! NF9,L99(9!8P(! N(9FE%! *6! d$?! ef! &%N! WP*=(9>J?! &%N! 4(,8F*%! 3! (7&=L&8(9!8P(F+! M(+6*+'&%,(B! 4(,8F*%! C! 9L''&+FA(9! &%N! N(9,+FK(9! 6L8L+(!O*+>B!

=,-./,7($%*1"$14(",

[%! 8PF9! O*+>! O(! &+(! ,*%,(+%(N! OF8P! M+*E+&''F%E! 0! 9(+F(9?! 2!9(+F(9?!&%N!/11!9(+F(9!*6!ZV[S[\!"#$9?!&9!=F98(N!F%!-&K=(!;B!)*+!8P(!N(9,+FM8F*%!*6!8P(F+!&+,PF8(,8L+(!9((!8P(!W$S\!M+*E+&''F%E!ELFN(! iZV[S[\! /110&j?! 8(,P%F,&=! K+F(69! iZV[S[\! /113l!ZV[S[\! /110Kj! &%N! =(,8L+(! 9=FN(9! F%! 8P(! ,*L+9(! *%! M+*E+&'U'F%E! "#$9! &8! 8P(! $%F7(+9F8J! *6! [==F%*F9?! $+K&%&UWP&'M&FE%!i@OL!&%N!HF+>!/11CjB!\NNF8F*%&=!F%9FEP89!,&%!K(!6*L%N!F%!!"#$%!&;?!OPF,P! F9!&! 8PF+NUM&+8J!NF9&99('K=(+!*6!"#$!KF%&+F(9!K&9(N!

*%!+(7(+9(U(%EF%((+F%E!*6!8P(!%&8F7(!F%98+L,8F*%!9(8B!-P(!F%98+L,U8F*%!9(8!,&==(N!#-.!8P&8!O&9!+(=(&9(N!KJ!7(%N*+!F9!&%!&K98+&,8F*%!8P&8!+(kLF+(9!6L+8P(+!,*'MF=&8F*%!&%N!9*!M+*7FN(9!6(O(+!F%9FEP89B!

=>:,?21'1*2#,

-P(!"#$!M+*E+&''F%E!'*N(=!L9(N!F%!8P(!W$S\!M+*E+&''F%E!(%7F+*%'(%8!iZV[S[\!/110&j!K*++*O9!'L,P!6+*'!&K98+&,8F*%9!L9(N!F%!E+&MPF,9?!(BEB!9L,P!&9!L9(N!F%!8P(!SF+(,8.!&%N!]M(%"d!98&%N&+N9B!"#$!M+*E+&'9!&+(!+L%!&9!,*==(,8F*%9!*6!9,&=&+!8P+(&N9!8P&8! +L%! 6&98(+! F6! 8P(J! +('&F%! ,*%7(+E(%8! F%! &%! 4[QS! 6&9PF*%B!4F'F=&+=J?! F%NF7FNL&=! &+F8P'(8F,! MFM(=F%(9! 8P&8! (^(,L8(! 9,&=&+!F%98+L,8F*%9! &+(! (^M*9(N! &9! F%NF7FNL&=! M+*,(99F%E! ,*+(9B! )*+!(^&'M=(?!8P(!8(,P%F,&=!K+F(6!*%!8P(!=&8(98!"#$!iZV[S[\!/110Kj!

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!;!P88MRGGOOOB,9B+LEB%=GhO=&NF'F+GN(,LN&G!

!"#$%&&%'()*')$+,")-%.%*+/)'#)0+#-)1'2%"&)'3)+//)'#)2+#*)'3)*0%&)4'#,)3'#)2"#&'(+/)'#)1/+&&#''$)5&")%&).#+(*"-)4%*0'5*)3"")2#'6%-"-)*0+*)1'2%"&)+#")('*)$+-")'#)-%&*#%75*"-

3'#)2#'3%*)'#)1'$$"#1%+/)+-6+(*+.")+(-)*0+*)1'2%"&)7"+#)*0%&)('*%1")+(-)*0")35//)1%*+*%'()'()*0")3%#&*)2+."8)9')1'2:)'*0"#4%&";)*')#"257/%&0;)*')2'&*)'()&"#6"#&)'#)*'))

#"-%&*#%75*")*')/%&*&;)#"<5%#"&)2#%'#)&2"1%3%1)2"#$%&&%'()+(-='#)+)3""8))

>?@AAB)C'6"$7"#)@AAB;)D5&*%(;)9"E+&;)F>D)GHBIJIK@KKI@BLMIG=AB)N@M8AA)O@AAB)PQQQ

Library

Page 36: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)

Volkov and Demmel (SC08)

!

!

"#$!%&'(!"()*+,(!

"-./01!

"()*+,(!

2011"-.!

"()*+,(!

0011"-.

"()*+,(!

0311"-4

5!*6!7(,8*+!,*+(9! :1! ;3! ;3! <!

,*+(!,=*,>?!"@A! ;B:1! ;B3C! ;B:D! ;B<D!

+(EF98(+9G,*+(! 3<HI! :/HI! :/HI! :/HI!

9'('G,*+(! ;3HI! ;3HI! ;3HI! ;3HI!

'('*+J!KL9?!"@A ;B;! ;B;! 1B2! ;B1!

'('*+J!KL9?!MF%9 D;/! /D3! :0<! ;/0!

K&%NOFN8P?!"IG9! ;<;! C1! 03! :/!

'('*+J!&'*L%8! ;"I! D;/QI! C30QI! /D3QI!

4#?!M(&>!"6=*MG9! 3/<! </2! :<3! 2:!

4#?!M(&>!M(+!,*+(! /;! /C! //! /:!

4#?!6=*M9RO*+N! ;0! /D! ;3! ;/!

S#?!M(&>!"6=*MG9! C0! T! T! T!

S#?!6=*M9RO*+N! <B<! T! T! T!

-&K=(!;R!-P(!=F98!*6!8P(!"#$9!L9(N!F%!8PF9!98LNJB!4#!F9!9F%E=(!M+(U

,F9F*%!&%N!S#!F9!N*LK=(!M+(,F9F*%B!4'('!F9!9P&+(N!'('*+JB!#(&>!

6=*M!+&8(9!&+(!9P*O%!6*+!'L=8FM=J!&%N!&NN!*M(+&8F*%9B!)=*M9RO*+N!

F9!8P(!+&8F*!*6!M(&>!"6=*MG9!+&8(!8*!MF%U'('*+J!K&%NOFN8P!F%!

O*+N9B!!

!"#$%&'()*#+,-./0,12,34#",5"#0",6*#"'(,78+"9(',,

V&9F=J!V*=>*7!

W*'ML8(+!4,F(%,(!SF7F9F*%!$%F7(+9F8J!*6!W&=F6*+%F&!&8!I(+>(=(J

X&'(9!YB!S(''(=!

W*'ML8(+!4,F(%,(!SF7F9F*%!&%N!S(M&+8'(%8!*6!Q&8P('&8F,9!$%F7(+9F8J!*6!W&=F6*+%F&!&8!I(+>(=(J!

7901('$1,

Y(! M+(9(%8! M(+6*+'&%,(! +(9L=89! 6*+! N(%9(! =F%(&+! &=E(K+&! L9F%E!+(,(%8! ZV[S[\! "#$9B! ]L+! '&8+F^U'&8+F^! 'L=8FM=J! +*L8F%(!_"`QQa!+L%9!LM! 8*!31b!6&98(+! 8P&%! 8P(!7(%N*+c9! F'M=('(%8&U8F*%!&%N!&MM+*&,P(9!8P(!M(&>!*6!P&+NO&+(!,&M&KF=F8F(9B!]L+!d$?!ef! &%N! WP*=(9>J! 6&,8*+FA&8F*%9! &,PF(7(! LM! 8*! 01g21b! *6! 8P(!M(&>! "`QQ! +&8(B! ]L+! M&+&==(=! d$! +L%%F%E! *%! 8O*! "#$9!&,PF(7(9!LM!8*!hD<1!"6=*MG9B!-P(9(!+(9L=89!&+(!&,,*'M=F9P(N!KJ!,P&==(%EF%E!8P(!&,,(M8(N!7F(O!*6!8P(!"#$!&+,PF8(,8L+(!&%N!M+*UE+&''F%E! ELFN(=F%(9B!Y(! &+EL(! 8P&8! '*N(+%! "#$9! 9P*L=N! K(!7F(O(N! &9! 'L=8F8P+(&N(N! 'L=8F,*+(! 7(,8*+! L%F89B! Y(! (^M=*F8!K=*,>F%E!9F'F=&+=J!8*!7(,8*+!,*'ML8(+9!&%N!P(8(+*E(%(F8J!*6!8P(!9J98('! KJ! ,*'ML8F%E! K*8P! *%! "#$! &%N! W#$B! -PF9! 98LNJ! F%U,=LN(9!N(8&F=(N!K(%,P'&+>F%E!*6! 8P(!"#$!'('*+J!9J98('! 8P&8!+(7(&=9! 9FA(9! &%N! =&8(%,F(9! *6! ,&,P(9! &%N! -dIB! Y(! M+(9(%8! &!,*LM=(! *6! &=E*+F8P'F,! *M8F'FA&8F*%9! &F'(N! &8! F%,+(&9F%E! M&+&=U=(=F9'!&%N!+(EL=&+F8J!F%!8P(!M+*K=('!8P&8!M+*7FN(!L9!OF8P!9=FEP8=J!PFEP(+!M(+6*+'&%,(B!

:,;#1(2<4$1*2#,

Y(! '&>(! 8P(! 6*==*OF%E! ,*%8+FKL8F*%9B! )*+! 8P(! 6F+98! 8F'(?! O(!9P*O!&%!d$?!ef!&%N!WP*=(9>J!6&,8*+FA&8F*%! 8P&8!&,PF(7(!,*'UML8&8F*%&=!+&8(9!*7(+!:11!"6=*MG9!*%!&!"#$B!-P(9(!&+(!8P+((!*6!8P(!'*98!OFN(=J!L9(N!6&,8*+FA&8F*%9! F%!N(%9(! =F%(&+!&=E(K+&!&%N!M&7(! 8P(! O&J! 6*+! 8P(! F'M=('(%8&8F*%! *6! 8P(! (%8F+(! d\#\WH!=FK+&+J!i\%N(+9*%!(8!&=B!;221j!6*+!8P(!"#$9B!

]L+! +(9L=89! &=9*! F%,=LN(! M(+6*+'&%,(! *%! 8P(! 0U9(+F(9! *6!ZV[S[\!"#$9!8P&8!O&9!%*8!M+(7F*L9=J!&88&F%(N!F%!8P(!;BD!J(&+9!9F%,(!8P(9(!"#$9!O(+(!&7&F=&K=(B!Y(!M+*7FN(!%(O!F%9FEP89!F%8*!M+*E+&''F%E! 8P(9(! &%N! %(O(+!"#$9! 8P&8! P(=M! L9! &,PF(7(! M(+U6*+'&%,(!F%!9L,P!K&9F,!>(+%(=9!&9!'&8+F^U'&8+F^!'L=8FM=J!8P&8!F9!31b! 6&98(+! 8P&%! 8P*9(! F%! 8P(! *M8F'FA(N! 7(%N*+c9! =FK+&+J!W$Id\4! ;B;B! 4*'(! *6! *L+! ,*N(9! P&7(! K((%! =F,(%9(N! KJ!ZV[S[\! &%N! F%,=LN(N! F%! W$Id\4! /B1B! [%! *L+! &MM+*&,P! O(!8PF%>! *6! 8P(! "#$! &9! &! 'L=8F8P+(&N(N! 7(,8*+! L%F8! &%N! *L+! K(98!&=E*+F8P'9! O(+(! 6*L%N! 8*! ,=*9(=J! +(9('K=(! (&+=F(+! 9*=L8F*%9!6*L%N!6*+!7(,8*+!M+*,(99*+9B!

Y(! M(+6*+'! N(8&F=(N! K(%,P'&+>9! *6! 8P(! "#$! &%N! +(7(&=!9*'(!*6! 8P(!K*88=(%(,>9?!9L,P!&9!&,,(99!8*!8P(!*%U,PFM!'('*+J!8P&8! K*L%N9! 8P(! M(+6*+'&%,(! *6! *L+! K(98! ,*N(9?! &%N! >(+%(=!=&L%,P!*7(+P(&N!8P&8!M+*PFKF89!(66F,F(%8!6F%(UE+&F%!,*'ML8&8F*%9B!-P(! K(%,P'&+>9! +(7(&=! 8P(! 98+L,8L+(! *6! 8P(!"#$!'('*+J! 9J9U8('?!F%,=LNF%E!9FA(9!&%N!=&8(%,F(9!*6! 8P(!d;!&%N!d/!,&,P(9!&%N!-dIB!)*+! 8P(! 6F+98! 8F'(!O(! F'M=('(%8! &%N!'(&9L+(! 8P(!M(+6*+U'&%,(! *6! &! E=*K&=! K&++F(+! 8P&8! +L%9! (%8F+(=J! *%! 8P(! "#$B!Y(!K(=F(7(! 8PF9! F9! &%! F'M*+8&%8! 98(M! 8*O&+N9! *M(+&8F%E!"#$9!OF8P!=*O(+!W#$!F%8(+7(%8F*%B!

-*!&,PF(7(!8P(!K(98!M(+6*+'&%,(!F%!'&8+F^!6&,8*+FA&8F*%9!O(!L9(!98&8(!*6!&+8!8(,P%FkL(9!9L,P!&9!=**>U&P(&N?!*7(+=&MMF%E!W#$!&%N! "#$! ,*'ML8&8F*%?! &L8*8L%F%E?! 9'&+8(+! 7&+F&%89! *6! /U=(7(=!K=*,>F%E?!&%N!,P**9F%E!8P(!+FEP8!'('*+J!=&J*L8l!O(!&=9*!L9(!&!%*7(=! &=E*+F8P'!OF8P!'*NF6F(N! %L'(+F,9B!Y(! &%&=JA(! 8P(! M(+U6*+'&%,(!*6!*L+!F'M=('(%8&8F*%9!F%!N(8&F=!8*!9P*O!8P&8!&==!,*'UM*%(%89!*6!8P(!6F%&=!9J98('!+L%!&8!8P(!%(&+=J!*M8F'&=!+&8(9B!

]L+!K(98!9M((NLM9!79B!*%(!kL&N!,*+(!W#$!&+(!*7(+!<!!F%!&==!8P+((!6&,8*+FA&8F*%9B!

-P(!+(98!*6!8PF9!M&M(+!F9!*+E&%FA(N!&9!6*==*O9B!4(,8F*%!/!N(U

9,+FK(9! 8P(! &+,PF8(,8L+(! *6! 8P(!"#$9!O(! L9(N?! PFEP=FEP8F%E! 8P(!6(&8L+(9!,*''*%!8*!7(,8*+!&+,PF8(,8L+(9B!4(,8F*%!:!K(%,P'&+>9!*M(+&8F*%9! F%,=LNF%E!'('*+J! 8+&%96(+?!>(+%(=! 98&+8ULM?!&%N!K&+U+F(+9?! &%N! L9(9! 8P(9(! 8*! &%&=JA(! 8P(! M(+6*+'&%,(! *6! 8P(! M&%(=!6&,8*+FA&8F*%!*6!d$B!4(,8F*%!<!NF9,L99(9! 8P(!N(9FE%!&%N!M(+6*+U'&%,(! (7&=L&8F*%! *6!'&8+F^!'L=8FM=F,&8F*%B! 4(,8F*%! D! NF9,L99(9!8P(! N(9FE%! *6! d$?! ef! &%N! WP*=(9>J?! &%N! 4(,8F*%! 3! (7&=L&8(9!8P(F+! M(+6*+'&%,(B! 4(,8F*%! C! 9L''&+FA(9! &%N! N(9,+FK(9! 6L8L+(!O*+>B!

=,-./,7($%*1"$14(",

[%! 8PF9! O*+>! O(! &+(! ,*%,(+%(N! OF8P! M+*E+&''F%E! 0! 9(+F(9?! 2!9(+F(9?!&%N!/11!9(+F(9!*6!ZV[S[\!"#$9?!&9!=F98(N!F%!-&K=(!;B!)*+!8P(!N(9,+FM8F*%!*6!8P(F+!&+,PF8(,8L+(!9((!8P(!W$S\!M+*E+&''F%E!ELFN(! iZV[S[\! /110&j?! 8(,P%F,&=! K+F(69! iZV[S[\! /113l!ZV[S[\! /110Kj! &%N! =(,8L+(! 9=FN(9! F%! 8P(! ,*L+9(! *%! M+*E+&'U'F%E! "#$9! &8! 8P(! $%F7(+9F8J! *6! [==F%*F9?! $+K&%&UWP&'M&FE%!i@OL!&%N!HF+>!/11CjB!\NNF8F*%&=!F%9FEP89!,&%!K(!6*L%N!F%!!"#$%!&;?!OPF,P! F9!&! 8PF+NUM&+8J!NF9&99('K=(+!*6!"#$!KF%&+F(9!K&9(N!

*%!+(7(+9(U(%EF%((+F%E!*6!8P(!%&8F7(!F%98+L,8F*%!9(8B!-P(!F%98+L,U8F*%!9(8!,&==(N!#-.!8P&8!O&9!+(=(&9(N!KJ!7(%N*+!F9!&%!&K98+&,8F*%!8P&8!+(kLF+(9!6L+8P(+!,*'MF=&8F*%!&%N!9*!M+*7FN(9!6(O(+!F%9FEP89B!

=>:,?21'1*2#,

-P(!"#$!M+*E+&''F%E!'*N(=!L9(N!F%!8P(!W$S\!M+*E+&''F%E!(%7F+*%'(%8!iZV[S[\!/110&j!K*++*O9!'L,P!6+*'!&K98+&,8F*%9!L9(N!F%!E+&MPF,9?!(BEB!9L,P!&9!L9(N!F%!8P(!SF+(,8.!&%N!]M(%"d!98&%N&+N9B!"#$!M+*E+&'9!&+(!+L%!&9!,*==(,8F*%9!*6!9,&=&+!8P+(&N9!8P&8! +L%! 6&98(+! F6! 8P(J! +('&F%! ,*%7(+E(%8! F%! &%! 4[QS! 6&9PF*%B!4F'F=&+=J?! F%NF7FNL&=! &+F8P'(8F,! MFM(=F%(9! 8P&8! (^(,L8(! 9,&=&+!F%98+L,8F*%9! &+(! (^M*9(N! &9! F%NF7FNL&=! M+*,(99F%E! ,*+(9B! )*+!(^&'M=(?!8P(!8(,P%F,&=!K+F(6!*%!8P(!=&8(98!"#$!iZV[S[\!/110Kj!

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!;!P88MRGGOOOB,9B+LEB%=GhO=&NF'F+GN(,LN&G!

!"#$%&&%'()*')$+,")-%.%*+/)'#)0+#-)1'2%"&)'3)+//)'#)2+#*)'3)*0%&)4'#,)3'#)2"#&'(+/)'#)1/+&&#''$)5&")%&).#+(*"-)4%*0'5*)3"")2#'6%-"-)*0+*)1'2%"&)+#")('*)$+-")'#)-%&*#%75*"-

3'#)2#'3%*)'#)1'$$"#1%+/)+-6+(*+.")+(-)*0+*)1'2%"&)7"+#)*0%&)('*%1")+(-)*0")35//)1%*+*%'()'()*0")3%#&*)2+."8)9')1'2:)'*0"#4%&";)*')#"257/%&0;)*')2'&*)'()&"#6"#&)'#)*'))

#"-%&*#%75*")*')/%&*&;)#"<5%#"&)2#%'#)&2"1%3%1)2"#$%&&%'()+(-='#)+)3""8))

>?@AAB)C'6"$7"#)@AAB;)D5&*%(;)9"E+&;)F>D)GHBIJIK@KKI@BLMIG=AB)N@M8AA)O@AAB)PQQQ

Library

Page 37: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)

17M02: High Performance Computing with CUDA

DGEMM PerformanceDGEMM Performance

Library

Page 38: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)

© 2008 NVIDIA Corporation.

Additional Resources

CUDA SDK examplesimpleCUBLAS

CUBLAS Library documentation

in doc folder of CUDA Toolkit or download from CUDA Zone

Library

Page 39: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)

18M02: High Performance Computing with CUDA

CUFFTCUFFT

The Fast Fourier Transform (FFT) is a divide-and-

conquer algorithm for efficiently computing discrete

Fourier transform of complex or real-valued data

sets.

CUFFT is the CUDA FFT library

Provides a simple interface for computing parallel FFT on

an NVIDIA GPU

Allows users to leverage the floating-point power and

parallelism of the GPU without having to develop a custom,

GPU-based FFT implementation

Library

Page 40: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)

19M02: High Performance Computing with CUDA

Supported FeaturesSupported Features

1D, 2D and 3D transforms of complex and real-valued

data

Batched execution for doing multiple 1D transforms

in parallel

1D transform size up to 8M elements

2D and 3D transform sizes in the range [2,16384]

In-place and out-of-place transforms for real and

complex data.

Library

Page 41: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)

20M02: High Performance Computing with CUDA

Transform TypesTransform Types

Library supports real and complex transformsCUFFT_C2C, CUFFT_C2R, CUFFT_R2C

DirectionsCUFFT_FORWARD (-1) and CUFFT_INVERSE (1)

According to sign of the complex exponential term

Real and imaginary parts of complex input andoutput arrays are interleaved

cufftComplex type is defined for this

Real to complex FFTs, output array holds onlynonredundant coefficients

N -> N/2+1

N0 x N1 x … x Nn -> N0 x N1 x … x (Nn/2+1)

For in-place transforms the input/output arrays need to bepadded

Library

Page 42: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)

21M02: High Performance Computing with CUDA

More on TransformsMore on Transforms

For 2D and 3D transforms, CUFFT performs transforms in row-

major (C-order)

If calling from FORTRAN or MATLAB, remember to change the

order of size parameters during plan creation

CUFFT performs un-normalized transforms:

IFFT(FFT(A))= length(A)*A

CUFFT API is modeled after FFTW. Based on plans, that

completely specify the optimal configuration to execute a

particular size of FFT

Once a plan is created, the library stores whatever state is

needed to execute the plan multiple times without recomputing

the configuration

Works very well for CUFFT, because different kinds of FFTs

require different thread configurations and GPU resources

Library

Page 43: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)

© 2008 NVIDIA Corporation.

CUFFT Types and Definitions

cufftHandleType used to store and access CUFFT plans

cufftResults

Enumeration of API function return values

cufftReal

single-precision, real datatype

cufftComplexsingle-precision, complex datatype

Real and complex transforms

CUFFT_C2C, CUFFT_C2R, CUFFT_R2C

DirectionsCUFFT_FORWARD, CUFFT_INVERSE

Library

Page 44: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)

© 2008 NVIDIA Corporation.

CUFFT Example#include <stdio.h>

#include <math.h>

#include "cufft.h"

int main(int argc, char *argv[])

{

cufftComplex *a_h, *a_d;

cufftHandle plan;

int N = 1024, batchSize = 10;

int i, nBytes;

double maxError;

nBytes = sizeof(cufftComplex)*N*batchSize;

a_h = (cufftComplex *)malloc(nBytes);

for (i=0; i < N*batchSize; i++) {

a_h[i].x = sinf(i);

a_h[i].y = cosf(i);

}

cudaMalloc((void **)&a_d, nBytes);

cudaMemcpy(a_d, a_h, nBytes,

cudaMemcpyHostToDevice);

cufftPlan1d(&plan, N, CUFFT_C2C, batchSize);

cufftExecC2C(plan, a_d, a_d, CUFFT_FORWARD);

cufftExecC2C(plan, a_d, a_d, CUFFT_INVERSE);

cudaMemcpy(a_h, a_d, nBytes,

cudaMemcpyDeviceToHost);

// check error - normalize

for (maxError = 0.0, i=0; i < N*batchSize; i++) {

maxError = max(fabs(a_h[i].x/N-sinf(i)), maxError);

maxError = max(fabs(a_h[i].y/N-cosf(i)), maxError);

}

printf("Max fft error = %g\n", maxError);

cufftDestroy(plan);

free(a_h); cudaFree(a_d);

return 0;

}

Library

Page 45: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)

© 2008 NVIDIA Corporation.

Additional CUFFT Resources

CUDA SDK examplessimpleCUFFT

convolutionFFT2D

oceanFFT

CUFFT Library documentation

In doc folder of CUDA Toolkit or download from CUDA Zone

Library

Page 46: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)
Page 47: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)

Glue ?

Page 48: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)

Interfacing CUDA

IAP09 CUDA@MIT / 6.963

Page 49: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)

23M02: High Performance Computing with CUDA

Interfacing CUDA with other languagesInterfacing CUDA with other languages

CUDA kernels from FORTRAN, allocate pinnedmemory from FORTRAN

Calling CUDA from MATLAB with MEX files

Several packages (open source and commercial) tointerface CUDA with Python, IDL, .NET, FORTRAN(Flagon). Browse CUDA Zone to find all thepackages.

Glue

Page 50: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)

24M02: High Performance Computing with CUDA

Pinned memoryPinned memory from FORTRANfrom FORTRAN

use iso_c_binding

! The allocation is performed by C function calls. Define the C pointer as type (C_PTR)

type(C_PTR) :: cptr_A, cptr_B, cptr_C

! Define Fortran arrays as pointer.

real, dimension(:,:), pointer :: A, B, C

! Allocating memory with cudaMallocHost.

! The Fortan arrays, now defined as pointers, are then associated with the C pointers using the

! new interoperability defined in iso_c_binding. This is equivalent to allocate(A(m1,m1))

res = cudaMallocHost ( cptr_A, m1*m1*sizeof(fp_kind) )

call c_f_pointer ( cptr_A, A, (/ m1, m1 /) )

! Use A as usual.

! See example code for cudaMallocHost interface code

Pinned memory provides a fast PCI-e transfer speed and enables use of streams:

•Allocation needs to be done with cudaMallocHost

•Use new Fortran 2003 features for interoperability with C.

http://www.nvidia.com/object/cuda_programming_tools.html

Glue

Page 51: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)

25M02: High Performance Computing with CUDA

Calling CUDA kernels from FORTRANCalling CUDA kernels from FORTRAN

! Fortran -> C -> CUDA ->C ->Fortran

call cudafunction(c,c2,N)

From Fortran call C function that will call CUDA kernel

/* NB: Fortran subroutine arguments are passed by reference. */

extern "C" void cudafunction_(cuComplex *a, cuComplex *b, int *Np)

{

...

int N=*np;

cudaMalloc ((void **) &a_d , sizeof(cuComplex)*N);

cudaMemcpy( a_d, a, sizeof(cuComplex)*N ,cudaMemcpyHostToDevice);

dim3 dimBlock(block_size); dim3 dimGrid (N/dimBlock.x); if( N % block_size != 0 ) dimGrid.x+=1;

square_complex<<<dimGrid,dimBlock>>>(a_d,a_d,N);

cudaMemcpy( b, a_d, sizeof(cuComplex)*N,cudaMemcpyDeviceToHost);

cudaFree(a_d);

}

complex_mul: main.f90 Cuda_function.o $(FC) -o complex_mul main.f90 Cuda_function.o -L/usr/local/cuda/lib -lcudart

cuda_function.o: cuda_function.cu nvcc -c -O3 cuda_function.cu

Glue

Page 52: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)

26M02: High Performance Computing with CUDA

CUDA & MATLABCUDA & MATLAB

Even though MATLAB is built on many well-optimized libraries, some functions can performbetter when written in a compiled language (e.g. Cand Fortran).

MATLAB provides a convenient API for interfacingcode written in C and FORTRAN to MATLABfunctions with MEX files.

MEX files could be used to exploit multi-coreprocessors with OpenMP or threaded codes or likein this case to offload functions to the GPU.

Glue

Page 53: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)

27M02: High Performance Computing with CUDA

NVMEX NVMEX

Native MATLAB script cannot parse CUDA code

New MATLAB script nvmex.m compiles CUDA code

(.cu) to create MATLAB function files

Syntax similar to original mex script:

>> nvmex –f nvmexopts.bat filename.cu –IC:\cuda\include

–LC:\cuda\lib -lcudart

Available for Windows and Linux from:

http://developer.nvidia.com/object/matlab_cuda.html

Glue

Page 54: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)

28M02: High Performance Computing with CUDA

Mex Mex files for CUDAfiles for CUDA

A typical mex file will perform the following steps:

1. Convert from double to single precision

2. Rearrange the data layout for complex data

3. Allocate memory on the GPU

4. Transfer the data from the host to the GPU

5. Perform computation on GPU (library, custom code)

6. Transfer results from the GPU to the host

7. Rearrange the data layout for complex data

8. Convert from single to double

9. Clean up memory and return results to MATLAB

Some of these steps will go away with new versions of the library(2,7) and new hardware (1,8)

Glue

Page 55: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)

29M02: High Performance Computing with CUDA

CUDA MEX exampleCUDA MEX example

/*Parse input, convert to single precision and to interleaved complex format */

…..

/* Allocate array on the GPU */

cufftComplex *rhs_complex_d;

cudaMalloc( (void **) &rhs_complex_d,sizeof(cufftComplex)*N*M);

/* Copy input array in interleaved format to the GPU */

cudaMemcpy( rhs_complex_d, input_single, sizeof(cufftComplex)*N*M, cudaMemcpyHostToDevice);

/* Create plan for CUDA FFT NB: transposing dimensions*/

cufftPlan2d(&plan, N, M, CUFFT_C2C) ;

/* Execute FFT on GPU */

cufftExecC2C(plan, rhs_complex_d, rhs_complex_d, CUFFT_INVERSE) ;

/* Copy result back to host */

cudaMemcpy( input_single, rhs_complex_d, sizeof(cufftComplex)*N*M, cudaMemcpyDeviceToHost);

/* Clean up memory and plan on the GPU */

cufftDestroy(plan); cudaFree(rhs_complex_d);

/*Convert back to double precision and to split complex format */

….

Additional code in MEX file to handle CUDA

Glue

Page 56: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)

30M02: High Performance Computing with CUDA

Timing detailsTiming details

1483 MB/s

1223 MB/s

1135 MB/s

1003 MB/s

PCI-e Bandwidth:

Host to/from device

14.x

11.x

1.8x

Speed

up

605s

789s

4937s

9525s

Runtime

Opteron 2210

Speed

up

Runtime

Opteron 250

577 s

735 s

4425 s

8098 s

12.XOverload Szeta

Standard MATLAB

15.7xOverload Szeta , FFT2 and

IFFT2

1.9xOverload FFT2 and IFFT2

1024x1024 mesh, 400 RK4 steps on Windows,

2D isotropic turbulence

Glue

Page 57: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)

Glue

Page 58: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)

Glue

Page 59: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)

Glue

Page 60: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)

Glue

Page 61: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)

Glue

Page 62: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)
Page 63: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)

Wanna Play with The Big Guys?

Page 64: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)

CUDAPerformance Strategies

IAP09 CUDA@MIT / 6.963

Page 65: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)

© NVIDIA Corporation 2006 3

Programming Model

A kernel is executed as a grid of thread blocks

A thread block is a batch of threads that can cooperate with each other by:

Sharing data through shared memory

Synchronizing their execution

Threads from different blocks cannot cooperate

Host

Kernel 1

Kernel 2

Device

Grid 1

Block(0, 0)

Block(1, 0)

Block(2, 0)

Block(0, 1)

Block(1, 1)

Block(2, 1)

Grid 2

Block (1, 1)

Thread

(0, 1)

Thread

(1, 1)

Thread

(2, 1)

Thread

(3, 1)

Thread

(4, 1)

Thread

(0, 2)

Thread

(1, 2)

Thread

(2, 2)

Thread

(3, 2)

Thread

(4, 2)

Thread

(0, 0)

Thread

(1, 0)

Thread

(2, 0)

Thread

(3, 0)

Thread

(4, 0)

Threading

Page 66: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)

© NVIDIA Corporation 2008 10

Data Movement in a CUDA Program

Host Memory

Device Memory

[Shared Memory]

COMPUTATION

[Shared Memory]

Device Memory

Host Memory

Memory

Page 67: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)

39

!"#$%$&'()*+,-$#.%/(0,-(#.'(123

456$%$&'($78'"'78'7#("5-5**'*$/%

456$%$&'(5-$#.%'#$9($7#'7/$#:(;%5#.<=578>$8#.?

@,%'#$%'/($#A/(='##'-(#,(-'9,%"B#'(#.57(#,(959.'

123(/"'78/($#/(#-57/$/#,-/(,7()C3/D(7,#(%'%,-:

E,(%,-'(9,%"B#5#$,7(,7(#.'(123(#,(5F,$8(9,/#*:(

85#5(#-57/0'-/

GF'7(*,>("5-5**'*$/%(9,%"B#5#$,7/(957(/,%'#$%'/(='(

05/#'-(#.57(#-57/0'--$7+(=59H(578(0,-#.(#,(.,/#

Perf

Page 68: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)

40

!"#$%$&'()'%*+,(-*.'+'/0'

-*12'30'4(536(7*/80*12'30'4(9(*+4'+(*:(%1;/$#<4'

=2*>12?@*012(4'5$0'(%'%*+,(

!"#$%$&'(:*+(3"1#$12(2*012$#,($/(010.'4(#'A#<+'(

%'%*+,

B/(3.1+'4(%'%*+,C(15*$4(.$;.84';+''(>1/D(0*/:2$0#3

Perf

Page 69: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)

41

!"#$%&'(")*"+$%,-%./"0$'%1$2,03

45)'0$'6%,-%*72$6%-"6*$0%*/")%+8,9"8%2$2,03

!/0$"'6%:")%:,,;$0"*$%(7"%6/"0$'%2$2,03

<6$%,)$%=%"%-$>%*/0$"'6%*,%8,"'%=%:,2;5*$%'"*"%

6/"0$'%93%"88%*/0$"'6

<6$%7*%*,%"(,7'%),)?:,"8$6:$'%"::$66

.*"+$%8,"'6%")'%6*,0$6%7)%6/"0$'%2$2,03%*,%0$?,0'$0%),)?

:,"8$6:$"98$%"''0$667)+

1"*07@%*0")6;,6$%$@"2;8$%8"*$0

Perf

Page 70: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)

42

!"#$%&'&((#()"*$+,,)-)#./(0

%&'/)/)1.$012'$-1*32/&/)1.$/1$4##3$/5#$6%!$

*2(/)3'1-#""1'"$#72&((0$82"0

9&.0$/5'#&:";$*&.0$/5'#&:$8(1-4"

<##3$'#"12'-#$2"&=#$(1>$#.12=5$/1$"2331'/$

*2(/)3(#$&-/)?#$/5'#&:$8(1-4"$3#'$*2(/)3'1-#""1'

@#=)"/#'";$"5&'#:$*#*1'0

Perf

Page 71: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)

44

!"#$%&'$()*#*+,)*$-.

/()*#*+*-0'#"#$%&')%,-.1"%.

2$,3".4*-0'03$5,3'#"#$%&',44"..".

6.*-0'.7,%"8'#"#$%&'"11"4)*9"3&

Perf

Page 72: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)

45

!"#"$%&"'()*&(

!*+,-*$.*./&0$#/$1/(#$.*./&0$2"'34,3#1$.5-1$

6/4*&$#1"'$3*+,-*$.*./&0$#/$3*+,-*$2"'34,3#1

789:($;*"<$=>?@A*$BCDE$+(F$GH$89:($;*"<$=I5"3&/$JK$LDHHE

G89:($)/&$>?@A*$MFH

N,',.,O*$#&"'()*&(

@'#*&.*3,"#*$3"#"$(#&5-#5&*($-"'$2*$"66/-"#*3P$/;*&"#*3$

/'P$"'3$3*"66/-"#*3$4,#1/5#$*+*&$-/;0,'Q$#1*.$#/$1/(#$

.*./&0

8&/5;$#&"'()*&(

R'*$6"&Q*$#&"'()*&$.5-1$2*##*&$#1"'$."'0$(."66$/'*(

Perf

Page 73: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)

46

!"#$%&'()$*+,$-'./+0."123$.2

(4*","55'(6'2789+"55':2+"55'("7;'1+'3+<"#$%5'()$*+='27+-$-'./

>1"?5$2+=;#=$27+(4*",$-(</+<$.3'.-"1($@AB+CDE2F+('--'1+'1+!GH%$I<.$22+8IJK9

LM+CDE2+-$"24.$*+'1+1N'.($+KOP;+-'7=$.?'".*2+8'Q$.(5'()$*+!GH%$9

R$$+7=$+S?"1*:;*7=0$27T GUVW+RVX+2"-<5$

U2$+:;7=+("47;'1W55'("7;1#+7''+-4(=+<"#$%5'()$*+-$-'./+("1+.$*4($+'Q$."55+2/27$-+<$.3'.-"1($

0$27+/'4.+2/27$-2+"1*+"<<2+7'+5$".1+7=$;.+5;-;72

Perf

Page 74: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)

47

!"#$%"&'()#*+&,(%-./0*12(.

3145(.2&"%2(67+&16.2*8721#6.9&:;;<=;;&7"#7>&7+7"(.

?1>("+&2#&$(&@(*A#*)%67(&$#22"(6(7>

B@21)1C%21#6.&7%6&4*(%2"+&167*(%.(&@(*A#*)%67(

D#%"(.71649&8@&2#&E;F&.@((-8@

?%2(67+&51-1649&8@&2#&GHIF&.@((-8@

Perf

Page 75: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)

48

!"#$%&'()*

+,'""-.()#/%.,-%#.,01,#,2#$345#-6,789 /2-%#.&:

+,'")/(*;";&,-%*("),"3,*$"0#$,<%<"-1=

9> 01/%&,4 %#'2,/2-%#.,-%#.&,#,5"-.=,()/?,3$"#/?,@

8AB 01/%&,4 %#'2,/2-%#.,-%#.&,#,.";0$%45"-.=,()/A?,3$"#/A?,@

AC9 01/%&,D %#'2,/2-%#.,-%#.&,#,E;#.45"-.=,()/>?,3$"#/>?,@

+..(/(")#$,-%&/-('/(")&,"),FBGHFIG,#-'2(/%'/;-%=

J/#-/()*,#..-%&&,3"-,#,-%*("),<;&/,0%,#,<;$/(6$%,"3,-%*("),

&(K%

L2%,k/2 /2-%#.,(),#,2#$345#-6,<;&/,#''%&&,/2% k/2 %$%<%)/,(),#,

0$"'M,0%()*,-%#.

NO'%6/(")=,)"/,#$$,/2-%#.&,<;&/,0%,6#-/('(6#/()*

P-%.('#/%.,#''%&&?,.(Q%-*%)'%,5(/2(),#,2#$35#-6

Perf

Page 76: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)

49

!"#$%&'%()*''%&&+),%#(-./)0$"#1&

12 13 14 135 13617

12 13 14 135 13617

374 378 395 3:4349 352 355 399

374 378 395 3:4349 352 355 399

;"<%)=>?%#(&)@")A"1)B#?1-'-C#1%

*$$)1>?%#(&)C#?1-'-C#1%

Perf

Page 77: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)

50

!"#$%&'(#')*+##'((,*-'%)."/*0&$%1(

12 13 14 135 13617

374 378349 352 355

:';<=1')*+##'((*>?*@A;'%)(

395 3B4399

C.(%&./"')*D1%;1."/*+));'((*E"$1*%*<=&1.F&'*$0*85G

12 13 14 137 13617

374 378 395 3B4349 352 355 399

135

Perf

Page 78: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)

51

!"#$%&'()*+,-(.()*,/%&0$1&

234%5(.%)1,"),678+,

9%5)%$+,5%#:,#,;$"#1<,()'5%.%)1<,=5(1%,>#'?

@A,;$"#1&,BCDAEF

-(.%&,#G%5#*%:,"G%5,C89,50)&

CD9,>$"'?&,3,DHI,1J5%#:&+

@HIK&,L '"#$%&'%:

@HMK&,L '"#$%&'%:<,&".%,1J5%#:&,:")N1,4#51('(4#1%

@<OPOK&,L 4%5.01%:Q.(&#$(*)%:,1J5%#:,#''%&&

Perf

Page 79: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)

58

!"#$%&'()*+

,-./'-/.%&0"10&(2%0! 34054067089-%&:&%0#0,-./'-/.%0"10;..#9&0<,";=0()&-%#>0"10;..#90"10,-./'-/.%&0

<;",=

?10,";0(&0)"-0@(#A$%+

B".'%0&-./'-/.%0#$(*)C%)-+0DD#$(*)<E=40FG%.%0E0H0340540".067

:&%0,IJI0-"0#'G(%@%0'"#$%&'()*

zyx Point structure

zyx zyx zyx AoS

xxx yyy zzz SoA

Perf

Page 80: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)

59

!"#$%&'()*+,-.//#01

!"#$%&'()*,*0%#2$1,(/30"4%&,250".*53.2

!0(2('#$,2",/%/"0167".)8,9%0)%$&

:%#8()*,&20.'2.0%&,";,&(<%,"25%0,25#),=>,?>,"0,@A712%&,B($$,70%#9,'"#$%&'()*+

C0%;%0,-20.'2.0%&,";,D00#1& "4%0,D"-

E;,-"D,(&,)"2,4(#7$%>,0%#8FB0(2%,250".*5,-GHG

D88(2(")#$,0%&".0'%&+D$(*)%8,I13%&,-JK,-#/3$%

Perf

Page 81: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)

64

!"#"$$%$&'%()#*&+#,-./%,/0#%

12&"&3"#"$$%$&(",-.2%4&("2*&/-#%"56&",,%66&(%()#*

7-%#%8)#%4&(%()#*&.6&5.9.5%5&.2/)&:"2;6

<66%2/."$&/)&",-.%9%&-.=-&:"25>.5/-

<",-&:"2;&,"2&6%#9.,%&)2%&"55#%66&3%#&,*,$%

+&(%()#*&,"2&6%#9.,%&"6&("2*&6.(0$/"2%)06&

",,%66%6&"6&./&-"6&:"2;6

'0$/.3$%&6.(0$/"2%)06&",,%66%6&/)&"&:"2;

#%60$/&.2&"&:"2;&,)28$.,/&

?)28$.,/.2=&",,%66%6&"#%&6%#."$.@%5

Bank 15

Bank 7

Bank 6

Bank 5

Bank 4

Bank 3

Bank 2

Bank 1

Bank 0

Perf

Page 82: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)

65

!"#$%&''()**+#,%-."/01)*

23%!"#$%43#51+67*

8+#)"(%"''()**+#,%

*7(+')%99%:

23%!"#$%43#51+67*

;"#'3/%:<:%=)(/>7"7+3#

Bank 15

Bank 7

Bank 6

Bank 5

Bank 4

Bank 3

Bank 2

Bank 1

Bank 0

Thread 15

Thread 7

Thread 6

Thread 5

Thread 4

Thread 3

Thread 2

Thread 1

Thread 0

Bank 15

Bank 7

Bank 6

Bank 5

Bank 4

Bank 3

Bank 2

Bank 1

Bank 0

Thread 15

Thread 7

Thread 6

Thread 5

Thread 4

Thread 3

Thread 2

Thread 1

Thread 0

Perf

Page 83: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)

66

!"#$%&''()**+#,%-."/01)*

234"5%!"#$%67#81+9:*

;+#)"(%"''()**+#,%

*:(+')%<<%2

=34"5%!"#$%67#81+9:*

;+#)"(%"''()**+#,%

*:(+')%<<%=

Thread 11

Thread 10

Thread 9

Thread 8

Thread 4

Thread 3

Thread 2

Thread 1

Thread 0

Bank 15

Bank 7

Bank 6

Bank 5

Bank 4

Bank 3

Bank 2

Bank 1

Bank 0

Thread 15

Thread 7

Thread 6

Thread 5

Thread 4

Thread 3

Thread 2

Thread 1

Thread 0

Bank 9

Bank 8

Bank 15

Bank 7

Bank 2

Bank 1

Bank 0x8

x8

Perf

Page 84: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)

67

!"#$%&&'())()$*%+$,"$-%./)$".$012

3%.&#4&,5$"6$(%75$-%./$4)$89$-4,)$+('$9$7:"7/$7;7:()

<=77())4>($89?-4,$#"'&)$%'($%))4@.(&$,"$)=77())4>($

-%./)

012$5%)$AB$-%./)

<"$-%./$C$%&&'())$D$AB

<%*($%)$,5($)4E($"6$%$5%:6?#%'+F"$-%./$7".6:47,)$-(,#((.$&466('(.,$5%:6?#%'+)G$".:;$#4,54.$%$)4.@:($5%:6?#%'+

Perf

Page 85: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)

68

!"#$%&'(%()$*'+#,-'.),/01.23

!"#$%&'(%()$*'13'#3'/#32'#3'$%4132%$3'1/'2"%$%'#$%'

,)'+#,-'.),/01.23

5"%'/#32'.#3%6

7/'#00'2"$%#&3')/'#'"#0/89#$:'#..%33'&1//%$%,2'+#,-3;'2"%$%'13'

,)'+#,-'.),/01.2

7/'#00'2"$%#&3')/'#'"#0/89#$:'$%#&'2"%'1&%,21.#0'#&&$%33;'

2"%$%'13',)'+#,-'.),/01.2'<+$)#&.#32=

5"%'30)9'.#3%6

>#,-'?),/01.26'(@021:0%'2"$%#&3'1,'2"%'3#(%'"#0/89#$:'

#..%33'2"%'3#(%'+#,-

A@32'3%$1#01B%'2"%'#..%33%3

?)32'C'(#D'E')/'31(@02#,%)@3'#..%33%3'2)'#'31,40%'+#,-

Perf

Page 86: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)

Conflicts,Coalescing, Warps...I hate growing up.

Perf

Page 87: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)

!"#$%$&'#$()*+,'%"-./*0'#1$,*21')3"(3.

Perf

Page 88: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)

70

!"#$%&'($")*+,*-

./0'."1+2-'34#$")*+,*-56

7228*#$"#-*9

:,"2-*;%)<

=>,%?%)<'.!@!'A")B';,)C2%;#*

.+--?8+*'C,$'->-)'*1"22'1"#$%;-*

1 2 3 4

5 6 7 8

9 10 11 12

13 14 15 16

1 5 9 13

2 6 10 14

3 7 11 15

4 8 12 16

Perf

Page 89: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)

71

!"#$%&'(#')*+,%"(-$('

__global__ void transpose_naive(float *odata, float *idata, int width, int height)

{

unsigned int xIndex = blockDim.x * blockIdx.x + threadIdx.x;

unsigned int yIndex = blockDim.y * blockIdx.y + threadIdx.y;

if (xIndex < width && yIndex < height)

{

unsigned int index_in = xIndex + width * yIndex;

unsigned int index_out = yIndex + height * xIndex;

$)%.%/0")'12$3.4 = 0)%.%/0")'120"4;

}

}

1.

2.

3.

4.

5.

6.

Perf

Page 90: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)

72

!"#$%&'(#')*+,%"(-$('

.'%)(*/"-01*2,$3*4565

787978:78778;

;879;8:;87;8;

79879798:7987798;

<,/1'*$01-01*1$*4565

7987:87787;87

798;:8;78;;8;

79879:8797879;879

Stride = 16, uncoalesced

45654565

Stride = 1, coalesced

Perf

Page 91: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)

73

!"#$%&'%()*+#,&-"&%

.&&/0-12",3)0#1+24)2&)-#+1212",%()2,1")&5/#+%)12$%&

*6+%#(7$"'8)974:)7;<3

=%#()16%)974:7;< 2,-/1)12$%:)&1"+%)2,1")>?@?

A+21%)16%)>?@?)(#1#)1")97;:74< "/1-/1)12$%*+#,&-"&%)16%)2,(%42,B)2,1")>?@?

*6+%#()914:1;<3

=%#(&)%$%0%,1)914:1;< C+"0)2,-/1)12$%

A+21%&)%$%0%,1)914:1;< 2,1")"/1-/1)12$%

!"#$%&'2,B)2&)#'62%D%()2C3

E$"'8F12$%)(20%,&2",&)#+%)0/$12-$%&)"C)GH

Perf

Page 92: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)

74

!"#$%&'%()*+#,&-"&%

.+/0%&)0")12324%#(&)5+"6)7232

898:89;89889<

<98:<9;<98<9<

8:98:8:9;8:988:9<

.+/0%&)0")72324%#(&)5+"6)1232

8:98;98898<98

8:9<;9<89<<9<

8:98:;98:898:<98:

898:89;89889<

<98:<9;<98<9<

8:98:8:9;8:988:9<

898:89;89889<

<98:<9;<98<9<

8:98:8:9;8:988:9<

Perf

Page 93: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)

75

!"#"$%&'()(*+'(,-

./01+23$01+2$!"#"$4('/$3'0(21$5$67

8+-9$:,-;<(:'3

=1+23$;0,)$!"#"

6>?6@?66?6A?6

6>?A@?A6?AA?A

6>?6>@?6>6?6>A?6>

!,<B'(,-

C<<,:+'1$+-$D1E'0+F :,<B)-

=1+2$3'0(21$5$6G

./01+23$01+2$;0,)$:,-31:B'(H1$I+-93

6>?6@?66?6A?6

6>?A@?A6?AA?A

6>?6>@?6>6?6>A?6>

Perf

Page 94: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)

75

!"#"$%&'()(*+'(,-

./01+23$01+2$!"#"$4('/$3'0(21$5$67

8+-9$:,-;<(:'3

=1+23$;0,)$!"#"

6>?6@?66?6A?6

6>?A@?A6?AA?A

6>?6>@?6>6?6>A?6>

!,<B'(,-

C<<,:+'1$+-$D1E'0+F :,<B)-

=1+2$3'0(21$5$6G

./01+23$01+2$;0,)$:,-31:B'(H1$I+-93

6>?6@?66?6A?6

6>?A@?A6?AA?A

6>?6>@?6>6?6>A?6>

Perf

Page 95: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)

76

!"#$%&'%()*+#,&-"&%

__global__ void transpose(float *odata, float *idata, int width, int height)

{

__shared__ float block[(BLOCK_DIM./)*BLOCK_DIM];

unsigned int xBlock = blockDim.x * blockIdx.x;

unsigned int yBlock = blockDim.y * blockIdx.y;

unsigned int xIndex = xBlock + threadIdx.x;

unsigned int yIndex = yBlock + threadIdx.y;

unsigned int index_out, index_transpose;

if (xIndex < width && yIndex < height)

{

unsigned int index_in = width * yIndex + xIndex;

unsigned int index_block = threadIdx.y * (BLOCK_DIM+1) + threadIdx.x;

block[index_block] = idata[index_in];

index_transpose = threadIdx.x * (BLOCK_DIM+1) + threadIdx.y;

index_out = height * (xBlock + threadIdx.y) + yBlock + threadIdx.x;

}

__syncthreads();

if (xIndex < width && yIndex < height)

odata[index_out] = block[index_transpose];

}

1.

2.

3.

4.

5.

6.

7.

8.

9.

10.

11.

12.

13.

14.

15.

Perf

Page 96: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)

77

!"#$%&'%()!*+*$,%

-&((./&%)0*12)3'#4(%3*$,)#$.)-565)'&1*+*7#1*'$8

9:;<9:;8))=>=99+% ?%>)=>=::+%))@:>=A %&((./&B

C9:<C9:8))=>=D+%)))?%>)=>EE+%))))@F>CA %&((./&B

9=:F<9=:F8))=>E=+%)))?%>)9>G:+%))))@H>FA %&((./&B

9=:F<:=F;8))=>DG+%)))?%>)H>H+%))))))@;>FA %&((./&B

I'#4(%3*$,)0*12'/1)-565)'&1*+*7#1*'$8

9:;<9:;8))=>=9F+%

C9:<C9:8))=>9=9+%

9=:F<9=:F8))=>F9:+%

9=:F<:=F;8))=>;HG+%

Perf

Page 97: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)

!"#$%&'()*+(),'-%./&'()*01&'2'3/&'()4

Perf

Page 98: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)

79

!""#$%&"'

()*+%,-.&/0*#"0.1&/-%*+-+2+"#0+,-/+3#+&0.%44'5-/1-

+2+"#0.&6-10)+*-7%*$/-./-0)+-1&4'-7%'-01-).,+-

4%0+&".+/-%&,-8++$-0)+-)%*,7%*+-9#/'

!""#$%&"' :-;#<9+*-1=-7%*$/-*#&&.&6-

"1&"#**+&04'-1&-%-<#40.$*1"+//1*-,.>.,+,-9'-

<%2.<#<-&#<9+*-1=-7%*$/-0)%0-"%&-*#&-

"1&"#**+&04'

?.<.0+,-9'-*+/1#*"+-#/%6+@

A+6./0+*/

B)%*+,-<+<1*'

Perf

Page 99: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)

80

!"#$%&'()*+,#-.+/.0"#12#)1

3+(4+5'()*1+6+3+(4+70'2#8"().11("1

,(+9''+70'2#8"().11("1+:9;.+92+'.912+(<.+5'()*+2(+.=.)02.

3+(4+5'()*1+%+3+(4+70'2#8"().11("1+6+>

?0'2#8'.+5'()*1+)9<+"0<+)(<)0"".<2'@+#<+9+70'2#8"().11("

&'()*1+2:92+9".<A2+B9#2#<C+92+9+DD1@<)2:".9$1EF+*..8+2:.+

:9"$B9".+501@

,05G.)2+2(+".1(0").+9;9#'95#'#2@+H ".C#12."1I+1:9".$+7.7("@

3+(4+5'()*1+6+JKK+2(+1)9'.+2(+4020".+$.;#).1

&'()*1+.=.)02.$+#<+8#8.'#<.+491:#(<

JKKK+5'()*1+8."+C"#$+B#''+1)9'.+9)"(11+70'2#8'.+C.<."92#(<1

Perf

Page 100: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)

81

!"#$%&"'()"*"+,"+-.

!"/,0/1&"'02'$&"('"#$%&"'(,"*"+,"+-.3+%&'4-&$5+6%('"%47&(-/+(8"('"/,(9::(-.-7"%(7/&"'

;-"+/'$5%<=>)?< @AB<

A5(-5C*7"&"7.(D$,"(&D"(7/&"+-.<(!4+(/&(7"/%&(EF: &D'"/,%(GH(2/'*%I(*"'(C47&$*'5-"%%5'

?&(7"/%&(:JK 5--4*/+-.

AD'"/,%(,5(+5&(D/L"(&5(8"75+#(&5(&D"(%/C"(&D'"/,(875-M

/,,N1O:(((P1OQ(P1EQ(P1:

/,,N1O:(((P1JQ(P1OQ(P1R

S T(.(U(JV

W(T(S U(OV

7,N%D/'",N1O:((P1OQ(XP'OEUYZ(

/,,N1O:(((((((((((P1OQ(P1OQ(P1R

%[,/&/XYZ(UT(OV

Perf

Page 101: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)

82

!"#$%&"'()'"%%*'"

+$,"(-.&"/01(21(*%$/#(34'"(&5'".,%(6"'(78

9$3$&$/#(:.0&4'%;

<*32"'(4=('"#$%&"'%(6"'(>"'/"-

?@AB 6"'(78C(6.'&$&$4/",(.34/#(04/0*''"/&(&5'".,%

D34*/&(4=(%5.'",(3"34'1

@EFG 6"'(78C(6.'&$&$4/",(.34/#(04/0*''"/&(&5'".,2-40>%

H5"0>(I0*2$/(=$-"(=4'(J('"#$%&"'%(K(>"'/"-

L%"(M3.N''"#04*/&O< =-.#(&4(<PHH

< O(,"%$'",(3.N$3*3('"#$%&"'%(K(>"'/"-

D&(%43"(64$/&(Q%6$--$/#R $/&4(98S8(3.1(400*'

!",*0"%(6"'=4'3./0"(M 98S8($%(%-4T

H5"0>(I0*2$/(=$-"(=4'(98S8(*%.#"

Perf

Page 102: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)

83

!"#"$%&'&'()$"*+,$-"),*.("

/*")012#3+2#&+'*4567 +2#&+')#+)'6--

8$9)-+%2&:")#;")<"$'":)-+=")>&#;)#;")5-,?&')@:.()#+)

="#"$%&'")$"(&*#"$),*.("A

82"')#;")A-,?&')@&:")>&#;).)#"3#)"=&#+$).'=):++<)@+$)

#;")0-+="7 *"-#&+'Aarchitecture {sm_10}

abiversion {0}

modname {cubin}

code {

name = BlackScholesGPU

lmem = 0

smem = 68

reg = 20

bar = 0

bincode {

0xa0004205 0x04200780 0x40024c09 0x00200780

per thread local memory

per thread block shared memory

per thread registers

Perf

Page 103: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)

84

!"#$%&''()*+',%!*-'(-*./0Perf

Page 104: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)

85

!"#$%$&$'()#*+,-./)",+)01234

5*22/,)#*+,-./)",+)01234)-/)-)%61#$"1,)27)8-+")/$&,

9:2$.)8-/#$'()32%"6#-#$2')2')6'.,+;"2"61-#,.)8-+"/

<2+,)#*+,-./)",+)01234)==)0,##,+)%,%2+>)1-#,'3>)

*$.$'(

?6#@)%2+,)#*+,-./)",+)01234)==)7,8,+)+,($/#,+/)",+)

#*+,-.

A,+',1)$':23-#$2'/)3-')7-$1)$7)#22)%-'>)+,($/#,+/)-+,)6/,.

B,6+$/#$3/

<$'$%6%C)DE)#*+,-./)",+)01234

!'1>)$7)%61#$"1,)32'36++,'#)01234/)

FGH)2+)HID)#*+,-./)-)0,##,+)3*2$3,

J/6-11>)/#$11),'26(*)+,(/)#2)32%"$1,)-'.)$':24,)/633,//7611>

K*$/)-11).,",'./)2')>26+)32%"6#-#$2'@)/2),L"+$%,'#M

Perf

Page 105: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)

86

!""#$%&"'()*(+,-./-0%&",

1&"-,%23&4(/""#$%&"'(5/,2(&/6(&,",22%-37'(

3&"-,%2,($,-./-0%&",

BUT…

8/9:/""#$%&"'(0#763$-/",22/-2("%&&/6(%5,;#%6,7'(

<35,(7%6,&"'(/&(0,0/-':=/#&5(>,-&,72

?16(%77("/0,2(5/9&(6/(%-36<0,63"(3&6,&236'(%&5(%@%37%=7,(

$%-%77,7320A

Perf

Page 106: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)

87

!"#"$%&%#'(%)*+,#)-../'0"&'+1

!"#"$%&%#'("&'+1)2%/.3)"4".&"&'+1)&+)4'55%#%1&)6!73

6!73)8"#9)'1)$"19):"93;)+5)$,/&'.#+0%33+#3

<%$+#9)="14:'4&2

>2"#%4)$%$+#9)3'(%

?%@'3&%#)5'/%)3'(%

A2#%"43).%#)=/+0B

*+,)0"1)%8%1)$"B%)"..3)3%/5C&,1'1@)D/'B%)EEAF)"14)-AG->H

IJK.%#'$%1&L $+4%)4'30+8%#3)"14)3"8%3)+.&'$"/)0+15'@,#"&'+1

Perf

Page 107: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)

88

!"#$%&'("#

)#*+,'-.#*/!)01/2+,3",4.#$+/$5.,.$-+,('-($'

6+4",7/$".%+'$(#8

0(9+,8+#-/:,.#$5(#8

;.#</$"#3%($-'

=.-+#$7/5(*(#8

)'+/2+.</2+,3",4.#$+/4+-,($'/-"/8&(*+/"2-(4(>.-("#/

)#*+,'-.#*/2.,.%%+%/.%8",(-54/$"42%+?(-7/-5+",7

@#"A/5"A/-"/(*+#-(37/-72+/"3/:"--%+#+$<

+B8B/4+4",7C/$",+/$"42&-.-("#C/",/(#'-,&$-("#/"9+,5+.*

D2-(4(>+/7"&,/.%8",(-54C/then &#,"%%/%""2'

)'+/-+42%.-+/2.,.4+-+,'/-"/8+#+,.-+/"2-(4.%/$"*+

Perf

Page 108: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)

!"#$%&'($)*+,-.$/012*.#0

Perf

Page 109: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)

61

!"#$%&'($)*+,-.$/012*.#0

3#.4+$5#-+,0#$-67$2*67$418#68*-.$4#02105-69#$

401:.#5

;/&$-67$%/&$8*5*6<$210$-..$=#06#.$*6>19-8*16+$-67$

5#594?+

!*5#$+8-54+

(99#++$81$"-07@-0#$4#02105-69#$91,68#0+$

Perf

Page 110: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)

101

!"#$%&'"()'*#

Perf

Page 111: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)

62

!"#$%&'

()*$+',%-*,+-%./*0,1"+2,2%-01%-*,.34$+*-',3$,'"#$%&',"$,+2*,.2"56

+"7*'+%75

#&08"$.32*-*$+

#&08.32*-*$+

#'+8"$.32*-*$+

#'+8.32*-*$+

&3.%&8&3%0

&3.%&8'+3-*

9-%$.2

0")*-#*$+89-%$.2

"$'+-4.+"3$' : "$'+-4.+"3$,.34$+

1%-58'*-"%&";* : +2-*%0,1%-5',+2%+,'*-"%&";*,3$,%00-*'',.3$<&".+',+3,'2%-*0,3-,.3$'+%$+,7*73-=

.+%8&%4$.2*0 : *>*.4+*0,+2-*%0,9&3./'

Global memory loads/stores are coalesced

(coherent) or non-coalesced (incoherent)

Total branches and divergent branches

taken by threads

Local loads/stores

Perf

Page 112: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)

63

!"#$%&%$#'"()&%*+',$%)-*."#$%/

01,.$/)%$&%$/$"#)$2$"#/)3'#4'")1)#4%$15)31%&

6",7)#1%($#/)*"$)8.,#'&%*-$//*%01,.$/)3',,)"*#)-*%%$/&*"5)#*)#4$)#*#1,)".89$%)*+)31%&/),1."-4$5)+*%)1)&1%#'-.,1%):$%"$,;

<1."-4)$"*.(4)#4%$15)9,*-:/)#*)$"/.%$)#41#)#4$)#1%($#)8.,#'&%*-$//*%)'/)('2$")1)-*"/'/#$"#)&$%-$"#1($)*+)#4$)#*#1,)3*%:;

01,.$/)1%$)9$/#)./$5)#*)'5$"#'+7)%$,1#'2$)&$%+*%81"-$)5'++$%$"-$/)9$#3$$")."*&#'8'=$5)1"5)*&#'8'=$5)-*5$

!")*#4$%)3*%5/>)#%7)#*)%$5.-$)#4$)81("'#.5$/)*+)(,5?(/#@'"-*4$%$"#>)5'2$%($"#@9%1"-4>)1"5)31%&@/$%'1,'=$

Perf

Page 113: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)

COME

Page 114: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)

Back Pocket Slides

slide by David Cox

Page 115: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)

Dense Linear Algebra

IAP09 CUDA@MIT / 6.963

Page 116: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)

!"#$$%"&'()(*"+,-.,-/01,23

4/5,-".-,6789:;

! <128/-":=:089:

! >1?82@/7A8:

! B12?A7/-"@/7A8:

!"#$"%&'#"()%*+,"-)(

B,A-C8",D"7/-?8E:C/78"C/:8:;

! *8-,2/A01C:;"F>4

! +,9.A0/01,2/7"CG891:0-=

! )/0/"91212?!"#!$ !

"%$% !

&$% !

Page 117: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)

!"#$$%"&'()(*"+,-.,-/01,23

4/5,-".-,6789:;

! <128/-":=:089:

! >1?82@/7A8:

! B12?A7/-"@/7A8:

!"#$"%&'#"()%*+,"-)(

*7?,-10C9:;

! D28E:1F8F"G/H0,-1I/01,2:;

<JK"+C,78:L=K"MN

! OP,E:1F8F"G/H0,-1I/01,2:;

MN"/7?3K"Q/H,61

! OP,E:1F8F"G/H0,-1I/01,2:;

B')!"#!$ !

"%$% !

&$% !

Page 118: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)

!"#$$%"&'()(*"+,-.,-/01,23

45*6"78,-0-/29::;<

!"##!$%&''(!)*+,!)*+,

-,!.,!/,

012,!",!#3",

4,!#34,

012,!!,!#3!!5

!"#$%&'!"#$%&()$*+,-#.(!"#"!"$"%"&"'

+=45*6"7+;<

6789:;$<=--(!)*+,!!)*+,

-,!.,!/,

012,!>",!#3",

>4,!#34,

012,!>!,!#3!!5?

+,>.?0/01,2"12"@A="-BC?1-BD<

! (2101/E1F/01,2",G"+=)*"B2H1-,2>B20

! *EE,I/01,2",G"J/0/"D0-?I0?-BD"12"@A=">B>,-K"7L/2JEB-D"!"#$!%#$!&;

! M-/2DGB-",G"J/0/"7>/0-1IBD""#$%#$&;

! +,>.?0/01,2"7I?NE/D6OB>>;

! PB0-1BHB"-BD?E0"7>/0-1Q"&;

! 8-BB"J/0/"D0-?I0?-BD"12"@A=">B>,-K

! MB->12/01,2",G"+=)*"B2H1-,2>B20

Page 119: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)

!

!

"#$!%&'(!"()*+,(!

"-./01!

"()*+,(!

2011"-.!

"()*+,(!

0011"-.

"()*+,(!

0311"-4

5!*6!7(,8*+!,*+(9! :1! ;3! ;3! <!

,*+(!,=*,>?!"@A! ;B:1! ;B3C! ;B:D! ;B<D!

+(EF98(+9G,*+(! 3<HI! :/HI! :/HI! :/HI!

9'('G,*+(! ;3HI! ;3HI! ;3HI! ;3HI!

'('*+J!KL9?!"@A ;B;! ;B;! 1B2! ;B1!

'('*+J!KL9?!MF%9 D;/! /D3! :0<! ;/0!

K&%NOFN8P?!"IG9! ;<;! C1! 03! :/!

'('*+J!&'*L%8! ;"I! D;/QI! C30QI! /D3QI!

4#?!M(&>!"6=*MG9! 3/<! </2! :<3! 2:!

4#?!M(&>!M(+!,*+(! /;! /C! //! /:!

4#?!6=*M9RO*+N! ;0! /D! ;3! ;/!

S#?!M(&>!"6=*MG9! C0! T! T! T!

S#?!6=*M9RO*+N! <B<! T! T! T!

-&K=(!;R!-P(!=F98!*6!8P(!"#$9!L9(N!F%!8PF9!98LNJB!4#!F9!9F%E=(!M+(U

,F9F*%!&%N!S#!F9!N*LK=(!M+(,F9F*%B!4'('!F9!9P&+(N!'('*+JB!#(&>!

6=*M!+&8(9!&+(!9P*O%!6*+!'L=8FM=J!&%N!&NN!*M(+&8F*%9B!)=*M9RO*+N!

F9!8P(!+&8F*!*6!M(&>!"6=*MG9!+&8(!8*!MF%U'('*+J!K&%NOFN8P!F%!

O*+N9B!!

!"#$%&'()*#+,-./0,12,34#",5"#0",6*#"'(,78+"9(',,

V&9F=J!V*=>*7!

W*'ML8(+!4,F(%,(!SF7F9F*%!$%F7(+9F8J!*6!W&=F6*+%F&!&8!I(+>(=(J

X&'(9!YB!S(''(=!

W*'ML8(+!4,F(%,(!SF7F9F*%!&%N!S(M&+8'(%8!*6!Q&8P('&8F,9!$%F7(+9F8J!*6!W&=F6*+%F&!&8!I(+>(=(J!

7901('$1,

Y(! M+(9(%8! M(+6*+'&%,(! +(9L=89! 6*+! N(%9(! =F%(&+! &=E(K+&! L9F%E!+(,(%8! ZV[S[\! "#$9B! ]L+! '&8+F^U'&8+F^! 'L=8FM=J! +*L8F%(!_"`QQa!+L%9!LM! 8*!31b!6&98(+! 8P&%! 8P(!7(%N*+c9! F'M=('(%8&U8F*%!&%N!&MM+*&,P(9!8P(!M(&>!*6!P&+NO&+(!,&M&KF=F8F(9B!]L+!d$?!ef! &%N! WP*=(9>J! 6&,8*+FA&8F*%9! &,PF(7(! LM! 8*! 01g21b! *6! 8P(!M(&>! "`QQ! +&8(B! ]L+! M&+&==(=! d$! +L%%F%E! *%! 8O*! "#$9!&,PF(7(9!LM!8*!hD<1!"6=*MG9B!-P(9(!+(9L=89!&+(!&,,*'M=F9P(N!KJ!,P&==(%EF%E!8P(!&,,(M8(N!7F(O!*6!8P(!"#$!&+,PF8(,8L+(!&%N!M+*UE+&''F%E! ELFN(=F%(9B!Y(! &+EL(! 8P&8! '*N(+%! "#$9! 9P*L=N! K(!7F(O(N! &9! 'L=8F8P+(&N(N! 'L=8F,*+(! 7(,8*+! L%F89B! Y(! (^M=*F8!K=*,>F%E!9F'F=&+=J!8*!7(,8*+!,*'ML8(+9!&%N!P(8(+*E(%(F8J!*6!8P(!9J98('! KJ! ,*'ML8F%E! K*8P! *%! "#$! &%N! W#$B! -PF9! 98LNJ! F%U,=LN(9!N(8&F=(N!K(%,P'&+>F%E!*6! 8P(!"#$!'('*+J!9J98('! 8P&8!+(7(&=9! 9FA(9! &%N! =&8(%,F(9! *6! ,&,P(9! &%N! -dIB! Y(! M+(9(%8! &!,*LM=(! *6! &=E*+F8P'F,! *M8F'FA&8F*%9! &F'(N! &8! F%,+(&9F%E! M&+&=U=(=F9'!&%N!+(EL=&+F8J!F%!8P(!M+*K=('!8P&8!M+*7FN(!L9!OF8P!9=FEP8=J!PFEP(+!M(+6*+'&%,(B!

:,;#1(2<4$1*2#,

Y(! '&>(! 8P(! 6*==*OF%E! ,*%8+FKL8F*%9B! )*+! 8P(! 6F+98! 8F'(?! O(!9P*O!&%!d$?!ef!&%N!WP*=(9>J!6&,8*+FA&8F*%! 8P&8!&,PF(7(!,*'UML8&8F*%&=!+&8(9!*7(+!:11!"6=*MG9!*%!&!"#$B!-P(9(!&+(!8P+((!*6!8P(!'*98!OFN(=J!L9(N!6&,8*+FA&8F*%9! F%!N(%9(! =F%(&+!&=E(K+&!&%N!M&7(! 8P(! O&J! 6*+! 8P(! F'M=('(%8&8F*%! *6! 8P(! (%8F+(! d\#\WH!=FK+&+J!i\%N(+9*%!(8!&=B!;221j!6*+!8P(!"#$9B!

]L+! +(9L=89! &=9*! F%,=LN(! M(+6*+'&%,(! *%! 8P(! 0U9(+F(9! *6!ZV[S[\!"#$9!8P&8!O&9!%*8!M+(7F*L9=J!&88&F%(N!F%!8P(!;BD!J(&+9!9F%,(!8P(9(!"#$9!O(+(!&7&F=&K=(B!Y(!M+*7FN(!%(O!F%9FEP89!F%8*!M+*E+&''F%E! 8P(9(! &%N! %(O(+!"#$9! 8P&8! P(=M! L9! &,PF(7(! M(+U6*+'&%,(!F%!9L,P!K&9F,!>(+%(=9!&9!'&8+F^U'&8+F^!'L=8FM=J!8P&8!F9!31b! 6&98(+! 8P&%! 8P*9(! F%! 8P(! *M8F'FA(N! 7(%N*+c9! =FK+&+J!W$Id\4! ;B;B! 4*'(! *6! *L+! ,*N(9! P&7(! K((%! =F,(%9(N! KJ!ZV[S[\! &%N! F%,=LN(N! F%! W$Id\4! /B1B! [%! *L+! &MM+*&,P! O(!8PF%>! *6! 8P(! "#$! &9! &! 'L=8F8P+(&N(N! 7(,8*+! L%F8! &%N! *L+! K(98!&=E*+F8P'9! O(+(! 6*L%N! 8*! ,=*9(=J! +(9('K=(! (&+=F(+! 9*=L8F*%9!6*L%N!6*+!7(,8*+!M+*,(99*+9B!

Y(! M(+6*+'! N(8&F=(N! K(%,P'&+>9! *6! 8P(! "#$! &%N! +(7(&=!9*'(!*6! 8P(!K*88=(%(,>9?!9L,P!&9!&,,(99!8*!8P(!*%U,PFM!'('*+J!8P&8! K*L%N9! 8P(! M(+6*+'&%,(! *6! *L+! K(98! ,*N(9?! &%N! >(+%(=!=&L%,P!*7(+P(&N!8P&8!M+*PFKF89!(66F,F(%8!6F%(UE+&F%!,*'ML8&8F*%9B!-P(! K(%,P'&+>9! +(7(&=! 8P(! 98+L,8L+(! *6! 8P(!"#$!'('*+J! 9J9U8('?!F%,=LNF%E!9FA(9!&%N!=&8(%,F(9!*6! 8P(!d;!&%N!d/!,&,P(9!&%N!-dIB!)*+! 8P(! 6F+98! 8F'(!O(! F'M=('(%8! &%N!'(&9L+(! 8P(!M(+6*+U'&%,(! *6! &! E=*K&=! K&++F(+! 8P&8! +L%9! (%8F+(=J! *%! 8P(! "#$B!Y(!K(=F(7(! 8PF9! F9! &%! F'M*+8&%8! 98(M! 8*O&+N9! *M(+&8F%E!"#$9!OF8P!=*O(+!W#$!F%8(+7(%8F*%B!

-*!&,PF(7(!8P(!K(98!M(+6*+'&%,(!F%!'&8+F^!6&,8*+FA&8F*%9!O(!L9(!98&8(!*6!&+8!8(,P%FkL(9!9L,P!&9!=**>U&P(&N?!*7(+=&MMF%E!W#$!&%N! "#$! ,*'ML8&8F*%?! &L8*8L%F%E?! 9'&+8(+! 7&+F&%89! *6! /U=(7(=!K=*,>F%E?!&%N!,P**9F%E!8P(!+FEP8!'('*+J!=&J*L8l!O(!&=9*!L9(!&!%*7(=! &=E*+F8P'!OF8P!'*NF6F(N! %L'(+F,9B!Y(! &%&=JA(! 8P(! M(+U6*+'&%,(!*6!*L+!F'M=('(%8&8F*%9!F%!N(8&F=!8*!9P*O!8P&8!&==!,*'UM*%(%89!*6!8P(!6F%&=!9J98('!+L%!&8!8P(!%(&+=J!*M8F'&=!+&8(9B!

]L+!K(98!9M((NLM9!79B!*%(!kL&N!,*+(!W#$!&+(!*7(+!<!!F%!&==!8P+((!6&,8*+FA&8F*%9B!

-P(!+(98!*6!8PF9!M&M(+!F9!*+E&%FA(N!&9!6*==*O9B!4(,8F*%!/!N(U

9,+FK(9! 8P(! &+,PF8(,8L+(! *6! 8P(!"#$9!O(! L9(N?! PFEP=FEP8F%E! 8P(!6(&8L+(9!,*''*%!8*!7(,8*+!&+,PF8(,8L+(9B!4(,8F*%!:!K(%,P'&+>9!*M(+&8F*%9! F%,=LNF%E!'('*+J! 8+&%96(+?!>(+%(=! 98&+8ULM?!&%N!K&+U+F(+9?! &%N! L9(9! 8P(9(! 8*! &%&=JA(! 8P(! M(+6*+'&%,(! *6! 8P(! M&%(=!6&,8*+FA&8F*%!*6!d$B!4(,8F*%!<!NF9,L99(9! 8P(!N(9FE%!&%N!M(+6*+U'&%,(! (7&=L&8F*%! *6!'&8+F^!'L=8FM=F,&8F*%B! 4(,8F*%! D! NF9,L99(9!8P(! N(9FE%! *6! d$?! ef! &%N! WP*=(9>J?! &%N! 4(,8F*%! 3! (7&=L&8(9!8P(F+! M(+6*+'&%,(B! 4(,8F*%! C! 9L''&+FA(9! &%N! N(9,+FK(9! 6L8L+(!O*+>B!

=,-./,7($%*1"$14(",

[%! 8PF9! O*+>! O(! &+(! ,*%,(+%(N! OF8P! M+*E+&''F%E! 0! 9(+F(9?! 2!9(+F(9?!&%N!/11!9(+F(9!*6!ZV[S[\!"#$9?!&9!=F98(N!F%!-&K=(!;B!)*+!8P(!N(9,+FM8F*%!*6!8P(F+!&+,PF8(,8L+(!9((!8P(!W$S\!M+*E+&''F%E!ELFN(! iZV[S[\! /110&j?! 8(,P%F,&=! K+F(69! iZV[S[\! /113l!ZV[S[\! /110Kj! &%N! =(,8L+(! 9=FN(9! F%! 8P(! ,*L+9(! *%! M+*E+&'U'F%E! "#$9! &8! 8P(! $%F7(+9F8J! *6! [==F%*F9?! $+K&%&UWP&'M&FE%!i@OL!&%N!HF+>!/11CjB!\NNF8F*%&=!F%9FEP89!,&%!K(!6*L%N!F%!!"#$%!&;?!OPF,P! F9!&! 8PF+NUM&+8J!NF9&99('K=(+!*6!"#$!KF%&+F(9!K&9(N!

*%!+(7(+9(U(%EF%((+F%E!*6!8P(!%&8F7(!F%98+L,8F*%!9(8B!-P(!F%98+L,U8F*%!9(8!,&==(N!#-.!8P&8!O&9!+(=(&9(N!KJ!7(%N*+!F9!&%!&K98+&,8F*%!8P&8!+(kLF+(9!6L+8P(+!,*'MF=&8F*%!&%N!9*!M+*7FN(9!6(O(+!F%9FEP89B!

=>:,?21'1*2#,

-P(!"#$!M+*E+&''F%E!'*N(=!L9(N!F%!8P(!W$S\!M+*E+&''F%E!(%7F+*%'(%8!iZV[S[\!/110&j!K*++*O9!'L,P!6+*'!&K98+&,8F*%9!L9(N!F%!E+&MPF,9?!(BEB!9L,P!&9!L9(N!F%!8P(!SF+(,8.!&%N!]M(%"d!98&%N&+N9B!"#$!M+*E+&'9!&+(!+L%!&9!,*==(,8F*%9!*6!9,&=&+!8P+(&N9!8P&8! +L%! 6&98(+! F6! 8P(J! +('&F%! ,*%7(+E(%8! F%! &%! 4[QS! 6&9PF*%B!4F'F=&+=J?! F%NF7FNL&=! &+F8P'(8F,! MFM(=F%(9! 8P&8! (^(,L8(! 9,&=&+!F%98+L,8F*%9! &+(! (^M*9(N! &9! F%NF7FNL&=! M+*,(99F%E! ,*+(9B! )*+!(^&'M=(?!8P(!8(,P%F,&=!K+F(6!*%!8P(!=&8(98!"#$!iZV[S[\!/110Kj!

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!;!P88MRGGOOOB,9B+LEB%=GhO=&NF'F+GN(,LN&G!

!"#$%&&%'()*')$+,")-%.%*+/)'#)0+#-)1'2%"&)'3)+//)'#)2+#*)'3)*0%&)4'#,)3'#)2"#&'(+/)'#)1/+&&#''$)5&")%&).#+(*"-)4%*0'5*)3"")2#'6%-"-)*0+*)1'2%"&)+#")('*)$+-")'#)-%&*#%75*"-

3'#)2#'3%*)'#)1'$$"#1%+/)+-6+(*+.")+(-)*0+*)1'2%"&)7"+#)*0%&)('*%1")+(-)*0")35//)1%*+*%'()'()*0")3%#&*)2+."8)9')1'2:)'*0"#4%&";)*')#"257/%&0;)*')2'&*)'()&"#6"#&)'#)*'))

#"-%&*#%75*")*')/%&*&;)#"<5%#"&)2#%'#)&2"1%3%1)2"#$%&&%'()+(-='#)+)3""8))

>?@AAB)C'6"$7"#)@AAB;)D5&*%(;)9"E+&;)F>D)GHBIJIK@KKI@BLMIG=AB)N@M8AA)O@AAB)PQQQ

Page 120: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)

Volkov and Demmel (SC08)

!

!

"#$!%&'(!"()*+,(!

"-./01!

"()*+,(!

2011"-.!

"()*+,(!

0011"-.

"()*+,(!

0311"-4

5!*6!7(,8*+!,*+(9! :1! ;3! ;3! <!

,*+(!,=*,>?!"@A! ;B:1! ;B3C! ;B:D! ;B<D!

+(EF98(+9G,*+(! 3<HI! :/HI! :/HI! :/HI!

9'('G,*+(! ;3HI! ;3HI! ;3HI! ;3HI!

'('*+J!KL9?!"@A ;B;! ;B;! 1B2! ;B1!

'('*+J!KL9?!MF%9 D;/! /D3! :0<! ;/0!

K&%NOFN8P?!"IG9! ;<;! C1! 03! :/!

'('*+J!&'*L%8! ;"I! D;/QI! C30QI! /D3QI!

4#?!M(&>!"6=*MG9! 3/<! </2! :<3! 2:!

4#?!M(&>!M(+!,*+(! /;! /C! //! /:!

4#?!6=*M9RO*+N! ;0! /D! ;3! ;/!

S#?!M(&>!"6=*MG9! C0! T! T! T!

S#?!6=*M9RO*+N! <B<! T! T! T!

-&K=(!;R!-P(!=F98!*6!8P(!"#$9!L9(N!F%!8PF9!98LNJB!4#!F9!9F%E=(!M+(U

,F9F*%!&%N!S#!F9!N*LK=(!M+(,F9F*%B!4'('!F9!9P&+(N!'('*+JB!#(&>!

6=*M!+&8(9!&+(!9P*O%!6*+!'L=8FM=J!&%N!&NN!*M(+&8F*%9B!)=*M9RO*+N!

F9!8P(!+&8F*!*6!M(&>!"6=*MG9!+&8(!8*!MF%U'('*+J!K&%NOFN8P!F%!

O*+N9B!!

!"#$%&'()*#+,-./0,12,34#",5"#0",6*#"'(,78+"9(',,

V&9F=J!V*=>*7!

W*'ML8(+!4,F(%,(!SF7F9F*%!$%F7(+9F8J!*6!W&=F6*+%F&!&8!I(+>(=(J

X&'(9!YB!S(''(=!

W*'ML8(+!4,F(%,(!SF7F9F*%!&%N!S(M&+8'(%8!*6!Q&8P('&8F,9!$%F7(+9F8J!*6!W&=F6*+%F&!&8!I(+>(=(J!

7901('$1,

Y(! M+(9(%8! M(+6*+'&%,(! +(9L=89! 6*+! N(%9(! =F%(&+! &=E(K+&! L9F%E!+(,(%8! ZV[S[\! "#$9B! ]L+! '&8+F^U'&8+F^! 'L=8FM=J! +*L8F%(!_"`QQa!+L%9!LM! 8*!31b!6&98(+! 8P&%! 8P(!7(%N*+c9! F'M=('(%8&U8F*%!&%N!&MM+*&,P(9!8P(!M(&>!*6!P&+NO&+(!,&M&KF=F8F(9B!]L+!d$?!ef! &%N! WP*=(9>J! 6&,8*+FA&8F*%9! &,PF(7(! LM! 8*! 01g21b! *6! 8P(!M(&>! "`QQ! +&8(B! ]L+! M&+&==(=! d$! +L%%F%E! *%! 8O*! "#$9!&,PF(7(9!LM!8*!hD<1!"6=*MG9B!-P(9(!+(9L=89!&+(!&,,*'M=F9P(N!KJ!,P&==(%EF%E!8P(!&,,(M8(N!7F(O!*6!8P(!"#$!&+,PF8(,8L+(!&%N!M+*UE+&''F%E! ELFN(=F%(9B!Y(! &+EL(! 8P&8! '*N(+%! "#$9! 9P*L=N! K(!7F(O(N! &9! 'L=8F8P+(&N(N! 'L=8F,*+(! 7(,8*+! L%F89B! Y(! (^M=*F8!K=*,>F%E!9F'F=&+=J!8*!7(,8*+!,*'ML8(+9!&%N!P(8(+*E(%(F8J!*6!8P(!9J98('! KJ! ,*'ML8F%E! K*8P! *%! "#$! &%N! W#$B! -PF9! 98LNJ! F%U,=LN(9!N(8&F=(N!K(%,P'&+>F%E!*6! 8P(!"#$!'('*+J!9J98('! 8P&8!+(7(&=9! 9FA(9! &%N! =&8(%,F(9! *6! ,&,P(9! &%N! -dIB! Y(! M+(9(%8! &!,*LM=(! *6! &=E*+F8P'F,! *M8F'FA&8F*%9! &F'(N! &8! F%,+(&9F%E! M&+&=U=(=F9'!&%N!+(EL=&+F8J!F%!8P(!M+*K=('!8P&8!M+*7FN(!L9!OF8P!9=FEP8=J!PFEP(+!M(+6*+'&%,(B!

:,;#1(2<4$1*2#,

Y(! '&>(! 8P(! 6*==*OF%E! ,*%8+FKL8F*%9B! )*+! 8P(! 6F+98! 8F'(?! O(!9P*O!&%!d$?!ef!&%N!WP*=(9>J!6&,8*+FA&8F*%! 8P&8!&,PF(7(!,*'UML8&8F*%&=!+&8(9!*7(+!:11!"6=*MG9!*%!&!"#$B!-P(9(!&+(!8P+((!*6!8P(!'*98!OFN(=J!L9(N!6&,8*+FA&8F*%9! F%!N(%9(! =F%(&+!&=E(K+&!&%N!M&7(! 8P(! O&J! 6*+! 8P(! F'M=('(%8&8F*%! *6! 8P(! (%8F+(! d\#\WH!=FK+&+J!i\%N(+9*%!(8!&=B!;221j!6*+!8P(!"#$9B!

]L+! +(9L=89! &=9*! F%,=LN(! M(+6*+'&%,(! *%! 8P(! 0U9(+F(9! *6!ZV[S[\!"#$9!8P&8!O&9!%*8!M+(7F*L9=J!&88&F%(N!F%!8P(!;BD!J(&+9!9F%,(!8P(9(!"#$9!O(+(!&7&F=&K=(B!Y(!M+*7FN(!%(O!F%9FEP89!F%8*!M+*E+&''F%E! 8P(9(! &%N! %(O(+!"#$9! 8P&8! P(=M! L9! &,PF(7(! M(+U6*+'&%,(!F%!9L,P!K&9F,!>(+%(=9!&9!'&8+F^U'&8+F^!'L=8FM=J!8P&8!F9!31b! 6&98(+! 8P&%! 8P*9(! F%! 8P(! *M8F'FA(N! 7(%N*+c9! =FK+&+J!W$Id\4! ;B;B! 4*'(! *6! *L+! ,*N(9! P&7(! K((%! =F,(%9(N! KJ!ZV[S[\! &%N! F%,=LN(N! F%! W$Id\4! /B1B! [%! *L+! &MM+*&,P! O(!8PF%>! *6! 8P(! "#$! &9! &! 'L=8F8P+(&N(N! 7(,8*+! L%F8! &%N! *L+! K(98!&=E*+F8P'9! O(+(! 6*L%N! 8*! ,=*9(=J! +(9('K=(! (&+=F(+! 9*=L8F*%9!6*L%N!6*+!7(,8*+!M+*,(99*+9B!

Y(! M(+6*+'! N(8&F=(N! K(%,P'&+>9! *6! 8P(! "#$! &%N! +(7(&=!9*'(!*6! 8P(!K*88=(%(,>9?!9L,P!&9!&,,(99!8*!8P(!*%U,PFM!'('*+J!8P&8! K*L%N9! 8P(! M(+6*+'&%,(! *6! *L+! K(98! ,*N(9?! &%N! >(+%(=!=&L%,P!*7(+P(&N!8P&8!M+*PFKF89!(66F,F(%8!6F%(UE+&F%!,*'ML8&8F*%9B!-P(! K(%,P'&+>9! +(7(&=! 8P(! 98+L,8L+(! *6! 8P(!"#$!'('*+J! 9J9U8('?!F%,=LNF%E!9FA(9!&%N!=&8(%,F(9!*6! 8P(!d;!&%N!d/!,&,P(9!&%N!-dIB!)*+! 8P(! 6F+98! 8F'(!O(! F'M=('(%8! &%N!'(&9L+(! 8P(!M(+6*+U'&%,(! *6! &! E=*K&=! K&++F(+! 8P&8! +L%9! (%8F+(=J! *%! 8P(! "#$B!Y(!K(=F(7(! 8PF9! F9! &%! F'M*+8&%8! 98(M! 8*O&+N9! *M(+&8F%E!"#$9!OF8P!=*O(+!W#$!F%8(+7(%8F*%B!

-*!&,PF(7(!8P(!K(98!M(+6*+'&%,(!F%!'&8+F^!6&,8*+FA&8F*%9!O(!L9(!98&8(!*6!&+8!8(,P%FkL(9!9L,P!&9!=**>U&P(&N?!*7(+=&MMF%E!W#$!&%N! "#$! ,*'ML8&8F*%?! &L8*8L%F%E?! 9'&+8(+! 7&+F&%89! *6! /U=(7(=!K=*,>F%E?!&%N!,P**9F%E!8P(!+FEP8!'('*+J!=&J*L8l!O(!&=9*!L9(!&!%*7(=! &=E*+F8P'!OF8P!'*NF6F(N! %L'(+F,9B!Y(! &%&=JA(! 8P(! M(+U6*+'&%,(!*6!*L+!F'M=('(%8&8F*%9!F%!N(8&F=!8*!9P*O!8P&8!&==!,*'UM*%(%89!*6!8P(!6F%&=!9J98('!+L%!&8!8P(!%(&+=J!*M8F'&=!+&8(9B!

]L+!K(98!9M((NLM9!79B!*%(!kL&N!,*+(!W#$!&+(!*7(+!<!!F%!&==!8P+((!6&,8*+FA&8F*%9B!

-P(!+(98!*6!8PF9!M&M(+!F9!*+E&%FA(N!&9!6*==*O9B!4(,8F*%!/!N(U

9,+FK(9! 8P(! &+,PF8(,8L+(! *6! 8P(!"#$9!O(! L9(N?! PFEP=FEP8F%E! 8P(!6(&8L+(9!,*''*%!8*!7(,8*+!&+,PF8(,8L+(9B!4(,8F*%!:!K(%,P'&+>9!*M(+&8F*%9! F%,=LNF%E!'('*+J! 8+&%96(+?!>(+%(=! 98&+8ULM?!&%N!K&+U+F(+9?! &%N! L9(9! 8P(9(! 8*! &%&=JA(! 8P(! M(+6*+'&%,(! *6! 8P(! M&%(=!6&,8*+FA&8F*%!*6!d$B!4(,8F*%!<!NF9,L99(9! 8P(!N(9FE%!&%N!M(+6*+U'&%,(! (7&=L&8F*%! *6!'&8+F^!'L=8FM=F,&8F*%B! 4(,8F*%! D! NF9,L99(9!8P(! N(9FE%! *6! d$?! ef! &%N! WP*=(9>J?! &%N! 4(,8F*%! 3! (7&=L&8(9!8P(F+! M(+6*+'&%,(B! 4(,8F*%! C! 9L''&+FA(9! &%N! N(9,+FK(9! 6L8L+(!O*+>B!

=,-./,7($%*1"$14(",

[%! 8PF9! O*+>! O(! &+(! ,*%,(+%(N! OF8P! M+*E+&''F%E! 0! 9(+F(9?! 2!9(+F(9?!&%N!/11!9(+F(9!*6!ZV[S[\!"#$9?!&9!=F98(N!F%!-&K=(!;B!)*+!8P(!N(9,+FM8F*%!*6!8P(F+!&+,PF8(,8L+(!9((!8P(!W$S\!M+*E+&''F%E!ELFN(! iZV[S[\! /110&j?! 8(,P%F,&=! K+F(69! iZV[S[\! /113l!ZV[S[\! /110Kj! &%N! =(,8L+(! 9=FN(9! F%! 8P(! ,*L+9(! *%! M+*E+&'U'F%E! "#$9! &8! 8P(! $%F7(+9F8J! *6! [==F%*F9?! $+K&%&UWP&'M&FE%!i@OL!&%N!HF+>!/11CjB!\NNF8F*%&=!F%9FEP89!,&%!K(!6*L%N!F%!!"#$%!&;?!OPF,P! F9!&! 8PF+NUM&+8J!NF9&99('K=(+!*6!"#$!KF%&+F(9!K&9(N!

*%!+(7(+9(U(%EF%((+F%E!*6!8P(!%&8F7(!F%98+L,8F*%!9(8B!-P(!F%98+L,U8F*%!9(8!,&==(N!#-.!8P&8!O&9!+(=(&9(N!KJ!7(%N*+!F9!&%!&K98+&,8F*%!8P&8!+(kLF+(9!6L+8P(+!,*'MF=&8F*%!&%N!9*!M+*7FN(9!6(O(+!F%9FEP89B!

=>:,?21'1*2#,

-P(!"#$!M+*E+&''F%E!'*N(=!L9(N!F%!8P(!W$S\!M+*E+&''F%E!(%7F+*%'(%8!iZV[S[\!/110&j!K*++*O9!'L,P!6+*'!&K98+&,8F*%9!L9(N!F%!E+&MPF,9?!(BEB!9L,P!&9!L9(N!F%!8P(!SF+(,8.!&%N!]M(%"d!98&%N&+N9B!"#$!M+*E+&'9!&+(!+L%!&9!,*==(,8F*%9!*6!9,&=&+!8P+(&N9!8P&8! +L%! 6&98(+! F6! 8P(J! +('&F%! ,*%7(+E(%8! F%! &%! 4[QS! 6&9PF*%B!4F'F=&+=J?! F%NF7FNL&=! &+F8P'(8F,! MFM(=F%(9! 8P&8! (^(,L8(! 9,&=&+!F%98+L,8F*%9! &+(! (^M*9(N! &9! F%NF7FNL&=! M+*,(99F%E! ,*+(9B! )*+!(^&'M=(?!8P(!8(,P%F,&=!K+F(6!*%!8P(!=&8(98!"#$!iZV[S[\!/110Kj!

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!;!P88MRGGOOOB,9B+LEB%=GhO=&NF'F+GN(,LN&G!

!"#$%&&%'()*')$+,")-%.%*+/)'#)0+#-)1'2%"&)'3)+//)'#)2+#*)'3)*0%&)4'#,)3'#)2"#&'(+/)'#)1/+&&#''$)5&")%&).#+(*"-)4%*0'5*)3"")2#'6%-"-)*0+*)1'2%"&)+#")('*)$+-")'#)-%&*#%75*"-

3'#)2#'3%*)'#)1'$$"#1%+/)+-6+(*+.")+(-)*0+*)1'2%"&)7"+#)*0%&)('*%1")+(-)*0")35//)1%*+*%'()'()*0")3%#&*)2+."8)9')1'2:)'*0"#4%&";)*')#"257/%&0;)*')2'&*)'()&"#6"#&)'#)*'))

#"-%&*#%75*")*')/%&*&;)#"<5%#"&)2#%'#)&2"1%3%1)2"#$%&&%'()+(-='#)+)3""8))

>?@AAB)C'6"$7"#)@AAB;)D5&*%(;)9"E+&;)F>D)GHBIJIK@KKI@BLMIG=AB)N@M8AA)O@AAB)PQQQ

Page 121: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)

Volkov and Demmel (SC08)

Page 122: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)
Page 123: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)

!

!

!

"!

#!!

#"!

$!!

$"!

%!!

%"!

&' #$( $"& "#$ #!$' $!'( '!)& (#)$ #&%('

*+,-./0

123425-+567829:

;<=>-,40?@AB

C(D

')D

"#D

"#$%&'!()!*+,'-!+./#'0'1!#2!,/'!3+.,4&#5+,#42-6!7'&.'2,-!#21#.+,'!

,/'!/#$/'-,!3&+.,#42!43!,/'!-8-,'9:-!7'+;!<=>?@A>?!4&!A>?!

42B8C!+./#'0'1D!

!E!

!E"

#E!

#E"

$E!

$E"

%E!

%E"

'E!

'E"

&' #$( $"& "#$ #!$' $!'( '!)& (#)$ #&%('

F.443G.5H05=-24$5;G73

123425-+567829:

;<=>-,40?@AB

'E':

$EC:

*IJ$(!

((!!*IJ

"#$%&'!E)!F7''1%7!0'&-%-!GDH=I5!A4&'J!K%+1D!L%9M'&-!42!,/'!

&#$/,!+&'!,/'!M'-,!-7''1%7-D!!

!

!! KNEOH! EEHH=PQ@RN(HH! =PQJEH@RN(HH!

!! =3B47S-! =3B47S-! -7''1%7! =3B47S-! -7''1%7

T?! (G! U(V! JDOW! GHV! XDUW!

A/4B'-;8! (H! UEG! JD(W! GUO! XDXW!

K*! (O! UVJ! JDNW! GXH! XDXW!

F=RYY! EE! JHE! JDXW! G(O! XDGW!

7'+;! VN! GEE! XDHW! NN(! NDVW!

P+MB'!X)!A497+&#-42!43!M'-,!=3B47S-!&+,'-!#2!,/'!A>?!+21!=>?!0'&-#42-!+21!M'-,!-7''1%7!0-D!,/'!A>?Z+B42'!0'&-#42-D!F=RYY!&+,'-!34&!,/'!=>?@A>?!-8-,'9-!#2.B%1'!=>?!&+,'-!42B8D!

!

"!

#!!

#"!

$!!

$"!

%!!

%"!

'!!

'"!

"!!

""!

! $"!! "!!! C"!! #!!!! #$"!! #"!!! #C"!! $!!!! $$"!!

*+,-./0

123425-+567829:

"%(

%!) $)(

#C)

"#$%&'!V)!>'&34&9+2.'!43!42'Z=>?!+21!,[4Z=>?!0'&-#42-!43!

,/'!T?!1'.4974-#,#42![#,/!M'-,!&+,'-!#2!=3B47S-!-/4[2!42!&#$/,D

-,4&#2$!,/'!.4+&-'&!MB4.;-D!

!"#$%&'(%")*+",-(+./"0-1(*+.2-(.*3%"

"4&!,/'!&'-%B,-!#2!,/#-!-'.,#42!['!%-'1!+!1'-;,47!-8-,'9!M+-'1!42!JDN(=I5! A4&'J! \%4! RN(HH! ']%#77'1! [#,/! 9%B,#7B'! >A^'! UDU!!UN!-B4,-D!"4&!,/'!&'-%B,-![#,/!42'!4&!,[4!='"4&.'!EEHH=PQ!['!%-'1! GJZM#,!_#214[-!Q>! +21!A?\`!UDUD! "4&! ,/'! &'-%B,-![#,/!='"4&.'!=PQJEH!['!%-'1!NXZM#,!_#214[-!Q>!+21!A?\`!JDHD!A>?Z42B8!&'-%B,-!['&'!4M,+#2'1!42!GDH=I5!A4&'J!K%+1!KNEOH!&%22#2$!NXZM#,!T#2%aD!^2!+BB!.+-'-!,/'!^2,'B!YbT!UHDH!B#M&+&8!#-!%-'1! 34&! 3+.,4&#5+,#42-!42! ,/'!A>?D!_'!24,'1! ,/+,! #,! &%2-! -%MZ-,+2,#+BB8!-B4['&!#2!GJZM#,D!`BB!&'-%B,-!+&'!#2!-#2$B'!7&'.#-#42D!

^27%,!+21!4%,7%,!1+,+!+&'!#2!,/'!7#22'1!A>?!9'94&86![/#./!7&40#1'-!+!.497&49#-'!M',[''2!%-'3%B2'--!#2!+77B#.+,#42-!<,/+,!+&'!B#;'B8!,4!&%2!42!,/'!A>?C!+21!7'&34&9+2.'!<-B4['&!,&+2-3'&-!,4S3&49!=>?!#3!,/'!1+,+!#-!#2!7+$'+MB'!9'94&8CD!P/'!.4-,!43!,/'!9'94&8!+BB4.+,#42!#-!24,!#2.B%1'1!#2!,/'!,#9#2$-D!

Y+,&#.'-! +&'! 7+11'1! ,4! +2! 411!9%B,#7B'! 43! NX![4&1-D! P/#-!/'B7-! +04#1#2$! +249+B4%-! 7'&34&9+2.'! 1&47-! +,! -49'! 9+,&#a!-#5'-D!

P/'!.4&&'.,2'--!43! ,/'!+B$4&#,/9-!#-! ,'-,'1! #2! ,/'!34BB4[#2$![+8D!^27%,!9+,&#a!!!#-!-82,/'-#5'1![#,/!&+2149!'2,&#'-!%2#34&9ZB8! 1#-,&#M%,'1! #2! cdU6Ue! <,4! $%+&+2,''! -899',&#.! 74-#,#0'! 1'3#Z2#,'2'--6"!!f!HDHHU"#"@!$%$! #-!%-'1! #2-,'+1! #2! ,'-,#2$! ,/'!A/4ZB'-;8! 3+.,4&#5+,#426![/'&'!$! #-! ,/'! &+2149!9+,&#a! +-!1'-.&#M'1!+M40'!+21!#!#-!,/'!#1'2,#,8!9+,&#aCD!g%,7%,!3+.,4&-!+&'!9%B,#7B#'1!+21!9+aZ24&9!43! #,-!1#33'&'2.'![#,/! ,/'! #27%,!9+,&#a! #-! 34%21D!P/#-!9'+-%&'-!,/'!M+.;[+&1!'&&4&!#2!,/'!3+.,4&#5+,#42D!_'!34%21!,/+,! ,/#-!'&&4&! #-!+M4%,! ,/'! -+9'![/',/'&!%-#2$!4%&!=>?ZM+-'1!+B$4&#,/9!4&!,/'!7%&'B8!A>?ZM+-'1!+B$4&#,/9!#2!,/'!^2,'B!YbT!!<+B[+8-![#,/#2!+!3+.,4&!43!J6!+21![#,/#2!JHh!#2!94-,!.+-'-CD!P/'!0+&#+2,!43!,/'!T?!3+.,4&#5+,#42!,/+,!9%B,#7B#'-!M8!,/'!#20'&-'-!43!,/'! 1#+$42+B! MB4.;-! 43! ,/'! ,&#+2$%B+&! 9+,&#a! /+-! -/4[2! +M4%,!-+9'! +..%&+.8! +-![/'2! &%22#2$! ,&#+2$%B+&! -4B0'-! 42! ,/'!=>?D!`-! +2! 'a+97B'6! ,/'! '&&4&-! +-!9'+-%&'1! +M40'! #2! T?6!K*! +21!A/4B'-;8!+,!&!f!EUVJ!+&'!+M4%,!JHHH"#"ii!ii'()6!JHH"#"ii!ii'()!+21!U("#"ii!ii'()! &'-7D6! [/'&'! #" *" J

dJG! #-! 9+./#2'! '7-#B42! #2! ^RRR!-#2$B'!7&'.#-#42!+21!ii!ii'()!#-!,/'!9+aZ24&9!43!!D!

!45"6&77-+8"*)"9$+)*+7-31$"

"#$D!(!-/4[-!,/'!=3B47S-!&+,'-!-%-,+#2'1!#2!,/'!=>?ZM+-'1!9+Z,&#a!3+.,4&#5+,#42!&4%,#2'-!!+21!%-#2$!A4&'J!K%+1!+B42'6!+21!"#$D!E!1',+#B-!,/'!-7''1%7-!0-D!A4&'J!K%+1D!`..4&1#2$!,4!,/'!"#$%&'6!,/'! .&4--40'&! M',[''2! ,/'! =>?ZM+-'1! +21! A>?Z+B42'! #97B'Z9'2,+,#42-! #-! +&4%21! &! f! UHHH! 34&! +BB! M%,! A/4B'-;8! &%2! 42!=PQJEH6![/#./! #-! +&4%21!&!f!NHHD!P/'!M'-,!7'&34&9+2.'-!+&'!-%99+&#5'1! #2!P+MB'!XD! ^,!-/4[-!,/+,! ,/'!-7''1%7!#-!2'+&B8!,/'!-+9'! +-! ,/'! -7''1%7! #2! 9+,&#aZ9+,&#a! 9%B,#7B8! <F=RYYCD!I4['0'&6!1#33'&'2.'! #2! ,/'4&',#.+B!+&#,/9',#.!7'+;!&+,'-! #-! -%MZ-,+2,#+BB8!/#$/'&!/#$/B#$/,#2$!,/+,! ,/'&'!+&'!94&'!.497%,+,#42+B!&'-4%&.'-!+0+#B+MB'!,/+2!['!.4%B1!/+&0'-,D!

"#$D!V!-/4[-!,/'!7'&34&9+2.'!43!,/'!T?!1'.4974-#,#42!,/+,!+./#'0'-! OGE! =3B47S-! +,! &" !! JU6HHH! M8! &%22#2$! ,[4! =>?-! #2!7+&+BB'BD!L4,'!,/+,!+!-#2$B'!=PQ!JEH!8#'B1-!/#$/'&!&+,'-!,/+2!,[4!EEHH!=PQD!F''!,/'!24,'-!M'B4[!42!-.+B#2$D!

!4:"9$+)*+7-31$";3-'8%.%"

"#$D!UH!-/4[-!,/'!M&'+;14[2!43!&%2,#9'!#2!,/'!T?!3+.,4&#5+,#42!

42!EEHH=PQD!P/'!M&'+;14[2!-/4[-!,/+,!%7!,4!VHh!43!,/'!&%2Z

,#9'!#-!.42-%9'1!M8!.497%,#2$!42!,/'!=>?!+21!+M4%,!43!UHh!

43!,/#-!,#9'!40'&B+7-![#,/!.497%,#2$!42!,/'!A>?!!_'!'a7'.,!,/'!

=>?!7+&,!,4!M'!-9+BB'&![/'2!.497%,#2$![#,/!3+-,'&!=>?-!7&4Z

1%.#2$! M',,'&! 40'&B+7! +,! B+&$'! 9+,&#a! -#5'-D! P#9'! -7'2,! #2! ,/'!

A>?Z=>?! ,&+2-3'&-! #-! -%M-,+2,#+B! +,! -9+BB! +21! 9'1#%9! -#5'1!

Page 124: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)

!

!

!

"!

#!!

#"!

$!!

$"!

%!!

%"!

&' #$( $"& "#$ #!$' $!'( '!)& (#)$ #&%('

*+,-./0

123425-+567829:

;<=>-,40?@AB

C(D

')D

"#D

"#$%&'!()!*+,'-!+./#'0'1!#2!,/'!3+.,4&#5+,#42-6!7'&.'2,-!#21#.+,'!

,/'!/#$/'-,!3&+.,#42!43!,/'!-8-,'9:-!7'+;!<=>?@A>?!4&!A>?!

42B8C!+./#'0'1D!

!E!

!E"

#E!

#E"

$E!

$E"

%E!

%E"

'E!

'E"

&' #$( $"& "#$ #!$' $!'( '!)& (#)$ #&%('

F.443G.5H05=-24$5;G73

123425-+567829:

;<=>-,40?@AB

'E':

$EC:

*IJ$(!

((!!*IJ

"#$%&'!E)!F7''1%7!0'&-%-!GDH=I5!A4&'J!K%+1D!L%9M'&-!42!,/'!

&#$/,!+&'!,/'!M'-,!-7''1%7-D!!

!

!! KNEOH! EEHH=PQ@RN(HH! =PQJEH@RN(HH!

!! =3B47S-! =3B47S-! -7''1%7! =3B47S-! -7''1%7

T?! (G! U(V! JDOW! GHV! XDUW!

A/4B'-;8! (H! UEG! JD(W! GUO! XDXW!

K*! (O! UVJ! JDNW! GXH! XDXW!

F=RYY! EE! JHE! JDXW! G(O! XDGW!

7'+;! VN! GEE! XDHW! NN(! NDVW!

P+MB'!X)!A497+&#-42!43!M'-,!=3B47S-!&+,'-!#2!,/'!A>?!+21!=>?!0'&-#42-!+21!M'-,!-7''1%7!0-D!,/'!A>?Z+B42'!0'&-#42-D!F=RYY!&+,'-!34&!,/'!=>?@A>?!-8-,'9-!#2.B%1'!=>?!&+,'-!42B8D!

!

"!

#!!

#"!

$!!

$"!

%!!

%"!

'!!

'"!

"!!

""!

! $"!! "!!! C"!! #!!!! #$"!! #"!!! #C"!! $!!!! $$"!!

*+,-./0

123425-+567829:

"%(

%!) $)(

#C)

"#$%&'!V)!>'&34&9+2.'!43!42'Z=>?!+21!,[4Z=>?!0'&-#42-!43!

,/'!T?!1'.4974-#,#42![#,/!M'-,!&+,'-!#2!=3B47S-!-/4[2!42!&#$/,D

-,4&#2$!,/'!.4+&-'&!MB4.;-D!

!"#$%&'(%")*+",-(+./"0-1(*+.2-(.*3%"

"4&!,/'!&'-%B,-!#2!,/#-!-'.,#42!['!%-'1!+!1'-;,47!-8-,'9!M+-'1!42!JDN(=I5! A4&'J! \%4! RN(HH! ']%#77'1! [#,/! 9%B,#7B'! >A^'! UDU!!UN!-B4,-D!"4&!,/'!&'-%B,-![#,/!42'!4&!,[4!='"4&.'!EEHH=PQ!['!%-'1! GJZM#,!_#214[-!Q>! +21!A?\`!UDUD! "4&! ,/'! &'-%B,-![#,/!='"4&.'!=PQJEH!['!%-'1!NXZM#,!_#214[-!Q>!+21!A?\`!JDHD!A>?Z42B8!&'-%B,-!['&'!4M,+#2'1!42!GDH=I5!A4&'J!K%+1!KNEOH!&%22#2$!NXZM#,!T#2%aD!^2!+BB!.+-'-!,/'!^2,'B!YbT!UHDH!B#M&+&8!#-!%-'1! 34&! 3+.,4&#5+,#42-!42! ,/'!A>?D!_'!24,'1! ,/+,! #,! &%2-! -%MZ-,+2,#+BB8!-B4['&!#2!GJZM#,D!`BB!&'-%B,-!+&'!#2!-#2$B'!7&'.#-#42D!

^27%,!+21!4%,7%,!1+,+!+&'!#2!,/'!7#22'1!A>?!9'94&86![/#./!7&40#1'-!+!.497&49#-'!M',[''2!%-'3%B2'--!#2!+77B#.+,#42-!<,/+,!+&'!B#;'B8!,4!&%2!42!,/'!A>?C!+21!7'&34&9+2.'!<-B4['&!,&+2-3'&-!,4S3&49!=>?!#3!,/'!1+,+!#-!#2!7+$'+MB'!9'94&8CD!P/'!.4-,!43!,/'!9'94&8!+BB4.+,#42!#-!24,!#2.B%1'1!#2!,/'!,#9#2$-D!

Y+,&#.'-! +&'! 7+11'1! ,4! +2! 411!9%B,#7B'! 43! NX![4&1-D! P/#-!/'B7-! +04#1#2$! +249+B4%-! 7'&34&9+2.'! 1&47-! +,! -49'! 9+,&#a!-#5'-D!

P/'!.4&&'.,2'--!43! ,/'!+B$4&#,/9-!#-! ,'-,'1! #2! ,/'!34BB4[#2$![+8D!^27%,!9+,&#a!!!#-!-82,/'-#5'1![#,/!&+2149!'2,&#'-!%2#34&9ZB8! 1#-,&#M%,'1! #2! cdU6Ue! <,4! $%+&+2,''! -899',&#.! 74-#,#0'! 1'3#Z2#,'2'--6"!!f!HDHHU"#"@!$%$! #-!%-'1! #2-,'+1! #2! ,'-,#2$! ,/'!A/4ZB'-;8! 3+.,4&#5+,#426![/'&'!$! #-! ,/'! &+2149!9+,&#a! +-!1'-.&#M'1!+M40'!+21!#!#-!,/'!#1'2,#,8!9+,&#aCD!g%,7%,!3+.,4&-!+&'!9%B,#7B#'1!+21!9+aZ24&9!43! #,-!1#33'&'2.'![#,/! ,/'! #27%,!9+,&#a! #-! 34%21D!P/#-!9'+-%&'-!,/'!M+.;[+&1!'&&4&!#2!,/'!3+.,4&#5+,#42D!_'!34%21!,/+,! ,/#-!'&&4&! #-!+M4%,! ,/'! -+9'![/',/'&!%-#2$!4%&!=>?ZM+-'1!+B$4&#,/9!4&!,/'!7%&'B8!A>?ZM+-'1!+B$4&#,/9!#2!,/'!^2,'B!YbT!!<+B[+8-![#,/#2!+!3+.,4&!43!J6!+21![#,/#2!JHh!#2!94-,!.+-'-CD!P/'!0+&#+2,!43!,/'!T?!3+.,4&#5+,#42!,/+,!9%B,#7B#'-!M8!,/'!#20'&-'-!43!,/'! 1#+$42+B! MB4.;-! 43! ,/'! ,&#+2$%B+&! 9+,&#a! /+-! -/4[2! +M4%,!-+9'! +..%&+.8! +-![/'2! &%22#2$! ,&#+2$%B+&! -4B0'-! 42! ,/'!=>?D!`-! +2! 'a+97B'6! ,/'! '&&4&-! +-!9'+-%&'1! +M40'! #2! T?6!K*! +21!A/4B'-;8!+,!&!f!EUVJ!+&'!+M4%,!JHHH"#"ii!ii'()6!JHH"#"ii!ii'()!+21!U("#"ii!ii'()! &'-7D6! [/'&'! #" *" J

dJG! #-! 9+./#2'! '7-#B42! #2! ^RRR!-#2$B'!7&'.#-#42!+21!ii!ii'()!#-!,/'!9+aZ24&9!43!!D!

!45"6&77-+8"*)"9$+)*+7-31$"

"#$D!(!-/4[-!,/'!=3B47S-!&+,'-!-%-,+#2'1!#2!,/'!=>?ZM+-'1!9+Z,&#a!3+.,4&#5+,#42!&4%,#2'-!!+21!%-#2$!A4&'J!K%+1!+B42'6!+21!"#$D!E!1',+#B-!,/'!-7''1%7-!0-D!A4&'J!K%+1D!`..4&1#2$!,4!,/'!"#$%&'6!,/'! .&4--40'&! M',[''2! ,/'! =>?ZM+-'1! +21! A>?Z+B42'! #97B'Z9'2,+,#42-! #-! +&4%21! &! f! UHHH! 34&! +BB! M%,! A/4B'-;8! &%2! 42!=PQJEH6![/#./! #-! +&4%21!&!f!NHHD!P/'!M'-,!7'&34&9+2.'-!+&'!-%99+&#5'1! #2!P+MB'!XD! ^,!-/4[-!,/+,! ,/'!-7''1%7!#-!2'+&B8!,/'!-+9'! +-! ,/'! -7''1%7! #2! 9+,&#aZ9+,&#a! 9%B,#7B8! <F=RYYCD!I4['0'&6!1#33'&'2.'! #2! ,/'4&',#.+B!+&#,/9',#.!7'+;!&+,'-! #-! -%MZ-,+2,#+BB8!/#$/'&!/#$/B#$/,#2$!,/+,! ,/'&'!+&'!94&'!.497%,+,#42+B!&'-4%&.'-!+0+#B+MB'!,/+2!['!.4%B1!/+&0'-,D!

"#$D!V!-/4[-!,/'!7'&34&9+2.'!43!,/'!T?!1'.4974-#,#42!,/+,!+./#'0'-! OGE! =3B47S-! +,! &" !! JU6HHH! M8! &%22#2$! ,[4! =>?-! #2!7+&+BB'BD!L4,'!,/+,!+!-#2$B'!=PQ!JEH!8#'B1-!/#$/'&!&+,'-!,/+2!,[4!EEHH!=PQD!F''!,/'!24,'-!M'B4[!42!-.+B#2$D!

!4:"9$+)*+7-31$";3-'8%.%"

"#$D!UH!-/4[-!,/'!M&'+;14[2!43!&%2,#9'!#2!,/'!T?!3+.,4&#5+,#42!

42!EEHH=PQD!P/'!M&'+;14[2!-/4[-!,/+,!%7!,4!VHh!43!,/'!&%2Z

,#9'!#-!.42-%9'1!M8!.497%,#2$!42!,/'!=>?!+21!+M4%,!43!UHh!

43!,/#-!,#9'!40'&B+7-![#,/!.497%,#2$!42!,/'!A>?!!_'!'a7'.,!,/'!

=>?!7+&,!,4!M'!-9+BB'&![/'2!.497%,#2$![#,/!3+-,'&!=>?-!7&4Z

1%.#2$! M',,'&! 40'&B+7! +,! B+&$'! 9+,&#a! -#5'-D! P#9'! -7'2,! #2! ,/'!

A>?Z=>?! ,&+2-3'&-! #-! -%M-,+2,#+B! +,! -9+BB! +21! 9'1#%9! -#5'1!

Page 125: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)

!

!

!

"!

#!!

#"!

$!!

$"!

%!!

%"!

&' #$( $"& "#$ #!$' $!'( '!)& (#)$ #&%('

*+,-./0

123425-+567829:

;<=>-,40?@AB

C(D

')D

"#D

"#$%&'!()!*+,'-!+./#'0'1!#2!,/'!3+.,4&#5+,#42-6!7'&.'2,-!#21#.+,'!

,/'!/#$/'-,!3&+.,#42!43!,/'!-8-,'9:-!7'+;!<=>?@A>?!4&!A>?!

42B8C!+./#'0'1D!

!E!

!E"

#E!

#E"

$E!

$E"

%E!

%E"

'E!

'E"

&' #$( $"& "#$ #!$' $!'( '!)& (#)$ #&%('

F.443G.5H05=-24$5;G73

123425-+567829:

;<=>-,40?@AB

'E':

$EC:

*IJ$(!

((!!*IJ

"#$%&'!E)!F7''1%7!0'&-%-!GDH=I5!A4&'J!K%+1D!L%9M'&-!42!,/'!

&#$/,!+&'!,/'!M'-,!-7''1%7-D!!

!

!! KNEOH! EEHH=PQ@RN(HH! =PQJEH@RN(HH!

!! =3B47S-! =3B47S-! -7''1%7! =3B47S-! -7''1%7

T?! (G! U(V! JDOW! GHV! XDUW!

A/4B'-;8! (H! UEG! JD(W! GUO! XDXW!

K*! (O! UVJ! JDNW! GXH! XDXW!

F=RYY! EE! JHE! JDXW! G(O! XDGW!

7'+;! VN! GEE! XDHW! NN(! NDVW!

P+MB'!X)!A497+&#-42!43!M'-,!=3B47S-!&+,'-!#2!,/'!A>?!+21!=>?!0'&-#42-!+21!M'-,!-7''1%7!0-D!,/'!A>?Z+B42'!0'&-#42-D!F=RYY!&+,'-!34&!,/'!=>?@A>?!-8-,'9-!#2.B%1'!=>?!&+,'-!42B8D!

!

"!

#!!

#"!

$!!

$"!

%!!

%"!

'!!

'"!

"!!

""!

! $"!! "!!! C"!! #!!!! #$"!! #"!!! #C"!! $!!!! $$"!!

*+,-./0

123425-+567829:

"%(

%!) $)(

#C)

"#$%&'!V)!>'&34&9+2.'!43!42'Z=>?!+21!,[4Z=>?!0'&-#42-!43!

,/'!T?!1'.4974-#,#42![#,/!M'-,!&+,'-!#2!=3B47S-!-/4[2!42!&#$/,D

-,4&#2$!,/'!.4+&-'&!MB4.;-D!

!"#$%&'(%")*+",-(+./"0-1(*+.2-(.*3%"

"4&!,/'!&'-%B,-!#2!,/#-!-'.,#42!['!%-'1!+!1'-;,47!-8-,'9!M+-'1!42!JDN(=I5! A4&'J! \%4! RN(HH! ']%#77'1! [#,/! 9%B,#7B'! >A^'! UDU!!UN!-B4,-D!"4&!,/'!&'-%B,-![#,/!42'!4&!,[4!='"4&.'!EEHH=PQ!['!%-'1! GJZM#,!_#214[-!Q>! +21!A?\`!UDUD! "4&! ,/'! &'-%B,-![#,/!='"4&.'!=PQJEH!['!%-'1!NXZM#,!_#214[-!Q>!+21!A?\`!JDHD!A>?Z42B8!&'-%B,-!['&'!4M,+#2'1!42!GDH=I5!A4&'J!K%+1!KNEOH!&%22#2$!NXZM#,!T#2%aD!^2!+BB!.+-'-!,/'!^2,'B!YbT!UHDH!B#M&+&8!#-!%-'1! 34&! 3+.,4&#5+,#42-!42! ,/'!A>?D!_'!24,'1! ,/+,! #,! &%2-! -%MZ-,+2,#+BB8!-B4['&!#2!GJZM#,D!`BB!&'-%B,-!+&'!#2!-#2$B'!7&'.#-#42D!

^27%,!+21!4%,7%,!1+,+!+&'!#2!,/'!7#22'1!A>?!9'94&86![/#./!7&40#1'-!+!.497&49#-'!M',[''2!%-'3%B2'--!#2!+77B#.+,#42-!<,/+,!+&'!B#;'B8!,4!&%2!42!,/'!A>?C!+21!7'&34&9+2.'!<-B4['&!,&+2-3'&-!,4S3&49!=>?!#3!,/'!1+,+!#-!#2!7+$'+MB'!9'94&8CD!P/'!.4-,!43!,/'!9'94&8!+BB4.+,#42!#-!24,!#2.B%1'1!#2!,/'!,#9#2$-D!

Y+,&#.'-! +&'! 7+11'1! ,4! +2! 411!9%B,#7B'! 43! NX![4&1-D! P/#-!/'B7-! +04#1#2$! +249+B4%-! 7'&34&9+2.'! 1&47-! +,! -49'! 9+,&#a!-#5'-D!

P/'!.4&&'.,2'--!43! ,/'!+B$4&#,/9-!#-! ,'-,'1! #2! ,/'!34BB4[#2$![+8D!^27%,!9+,&#a!!!#-!-82,/'-#5'1![#,/!&+2149!'2,&#'-!%2#34&9ZB8! 1#-,&#M%,'1! #2! cdU6Ue! <,4! $%+&+2,''! -899',&#.! 74-#,#0'! 1'3#Z2#,'2'--6"!!f!HDHHU"#"@!$%$! #-!%-'1! #2-,'+1! #2! ,'-,#2$! ,/'!A/4ZB'-;8! 3+.,4&#5+,#426![/'&'!$! #-! ,/'! &+2149!9+,&#a! +-!1'-.&#M'1!+M40'!+21!#!#-!,/'!#1'2,#,8!9+,&#aCD!g%,7%,!3+.,4&-!+&'!9%B,#7B#'1!+21!9+aZ24&9!43! #,-!1#33'&'2.'![#,/! ,/'! #27%,!9+,&#a! #-! 34%21D!P/#-!9'+-%&'-!,/'!M+.;[+&1!'&&4&!#2!,/'!3+.,4&#5+,#42D!_'!34%21!,/+,! ,/#-!'&&4&! #-!+M4%,! ,/'! -+9'![/',/'&!%-#2$!4%&!=>?ZM+-'1!+B$4&#,/9!4&!,/'!7%&'B8!A>?ZM+-'1!+B$4&#,/9!#2!,/'!^2,'B!YbT!!<+B[+8-![#,/#2!+!3+.,4&!43!J6!+21![#,/#2!JHh!#2!94-,!.+-'-CD!P/'!0+&#+2,!43!,/'!T?!3+.,4&#5+,#42!,/+,!9%B,#7B#'-!M8!,/'!#20'&-'-!43!,/'! 1#+$42+B! MB4.;-! 43! ,/'! ,&#+2$%B+&! 9+,&#a! /+-! -/4[2! +M4%,!-+9'! +..%&+.8! +-![/'2! &%22#2$! ,&#+2$%B+&! -4B0'-! 42! ,/'!=>?D!`-! +2! 'a+97B'6! ,/'! '&&4&-! +-!9'+-%&'1! +M40'! #2! T?6!K*! +21!A/4B'-;8!+,!&!f!EUVJ!+&'!+M4%,!JHHH"#"ii!ii'()6!JHH"#"ii!ii'()!+21!U("#"ii!ii'()! &'-7D6! [/'&'! #" *" J

dJG! #-! 9+./#2'! '7-#B42! #2! ^RRR!-#2$B'!7&'.#-#42!+21!ii!ii'()!#-!,/'!9+aZ24&9!43!!D!

!45"6&77-+8"*)"9$+)*+7-31$"

"#$D!(!-/4[-!,/'!=3B47S-!&+,'-!-%-,+#2'1!#2!,/'!=>?ZM+-'1!9+Z,&#a!3+.,4&#5+,#42!&4%,#2'-!!+21!%-#2$!A4&'J!K%+1!+B42'6!+21!"#$D!E!1',+#B-!,/'!-7''1%7-!0-D!A4&'J!K%+1D!`..4&1#2$!,4!,/'!"#$%&'6!,/'! .&4--40'&! M',[''2! ,/'! =>?ZM+-'1! +21! A>?Z+B42'! #97B'Z9'2,+,#42-! #-! +&4%21! &! f! UHHH! 34&! +BB! M%,! A/4B'-;8! &%2! 42!=PQJEH6![/#./! #-! +&4%21!&!f!NHHD!P/'!M'-,!7'&34&9+2.'-!+&'!-%99+&#5'1! #2!P+MB'!XD! ^,!-/4[-!,/+,! ,/'!-7''1%7!#-!2'+&B8!,/'!-+9'! +-! ,/'! -7''1%7! #2! 9+,&#aZ9+,&#a! 9%B,#7B8! <F=RYYCD!I4['0'&6!1#33'&'2.'! #2! ,/'4&',#.+B!+&#,/9',#.!7'+;!&+,'-! #-! -%MZ-,+2,#+BB8!/#$/'&!/#$/B#$/,#2$!,/+,! ,/'&'!+&'!94&'!.497%,+,#42+B!&'-4%&.'-!+0+#B+MB'!,/+2!['!.4%B1!/+&0'-,D!

"#$D!V!-/4[-!,/'!7'&34&9+2.'!43!,/'!T?!1'.4974-#,#42!,/+,!+./#'0'-! OGE! =3B47S-! +,! &" !! JU6HHH! M8! &%22#2$! ,[4! =>?-! #2!7+&+BB'BD!L4,'!,/+,!+!-#2$B'!=PQ!JEH!8#'B1-!/#$/'&!&+,'-!,/+2!,[4!EEHH!=PQD!F''!,/'!24,'-!M'B4[!42!-.+B#2$D!

!4:"9$+)*+7-31$";3-'8%.%"

"#$D!UH!-/4[-!,/'!M&'+;14[2!43!&%2,#9'!#2!,/'!T?!3+.,4&#5+,#42!

42!EEHH=PQD!P/'!M&'+;14[2!-/4[-!,/+,!%7!,4!VHh!43!,/'!&%2Z

,#9'!#-!.42-%9'1!M8!.497%,#2$!42!,/'!=>?!+21!+M4%,!43!UHh!

43!,/#-!,#9'!40'&B+7-![#,/!.497%,#2$!42!,/'!A>?!!_'!'a7'.,!,/'!

=>?!7+&,!,4!M'!-9+BB'&![/'2!.497%,#2$![#,/!3+-,'&!=>?-!7&4Z

1%.#2$! M',,'&! 40'&B+7! +,! B+&$'! 9+,&#a! -#5'-D! P#9'! -7'2,! #2! ,/'!

A>?Z=>?! ,&+2-3'&-! #-! -%M-,+2,#+B! +,! -9+BB! +21! 9'1#%9! -#5'1!

Page 126: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)

!

!

!

"!

#!!

#"!

$!!

$"!

%!!

%"!

&' #$( $"& "#$ #!$' $!'( '!)& (#)$ #&%('

*+,-./0

123425-+567829:

;<=>-,40?@AB

C(D

')D

"#D

"#$%&'!()!*+,'-!+./#'0'1!#2!,/'!3+.,4&#5+,#42-6!7'&.'2,-!#21#.+,'!

,/'!/#$/'-,!3&+.,#42!43!,/'!-8-,'9:-!7'+;!<=>?@A>?!4&!A>?!

42B8C!+./#'0'1D!

!E!

!E"

#E!

#E"

$E!

$E"

%E!

%E"

'E!

'E"

&' #$( $"& "#$ #!$' $!'( '!)& (#)$ #&%('

F.443G.5H05=-24$5;G73

123425-+567829:

;<=>-,40?@AB

'E':

$EC:

*IJ$(!

((!!*IJ

"#$%&'!E)!F7''1%7!0'&-%-!GDH=I5!A4&'J!K%+1D!L%9M'&-!42!,/'!

&#$/,!+&'!,/'!M'-,!-7''1%7-D!!

!

!! KNEOH! EEHH=PQ@RN(HH! =PQJEH@RN(HH!

!! =3B47S-! =3B47S-! -7''1%7! =3B47S-! -7''1%7

T?! (G! U(V! JDOW! GHV! XDUW!

A/4B'-;8! (H! UEG! JD(W! GUO! XDXW!

K*! (O! UVJ! JDNW! GXH! XDXW!

F=RYY! EE! JHE! JDXW! G(O! XDGW!

7'+;! VN! GEE! XDHW! NN(! NDVW!

P+MB'!X)!A497+&#-42!43!M'-,!=3B47S-!&+,'-!#2!,/'!A>?!+21!=>?!0'&-#42-!+21!M'-,!-7''1%7!0-D!,/'!A>?Z+B42'!0'&-#42-D!F=RYY!&+,'-!34&!,/'!=>?@A>?!-8-,'9-!#2.B%1'!=>?!&+,'-!42B8D!

!

"!

#!!

#"!

$!!

$"!

%!!

%"!

'!!

'"!

"!!

""!

! $"!! "!!! C"!! #!!!! #$"!! #"!!! #C"!! $!!!! $$"!!

*+,-./0

123425-+567829:

"%(

%!) $)(

#C)

"#$%&'!V)!>'&34&9+2.'!43!42'Z=>?!+21!,[4Z=>?!0'&-#42-!43!

,/'!T?!1'.4974-#,#42![#,/!M'-,!&+,'-!#2!=3B47S-!-/4[2!42!&#$/,D

-,4&#2$!,/'!.4+&-'&!MB4.;-D!

!"#$%&'(%")*+",-(+./"0-1(*+.2-(.*3%"

"4&!,/'!&'-%B,-!#2!,/#-!-'.,#42!['!%-'1!+!1'-;,47!-8-,'9!M+-'1!42!JDN(=I5! A4&'J! \%4! RN(HH! ']%#77'1! [#,/! 9%B,#7B'! >A^'! UDU!!UN!-B4,-D!"4&!,/'!&'-%B,-![#,/!42'!4&!,[4!='"4&.'!EEHH=PQ!['!%-'1! GJZM#,!_#214[-!Q>! +21!A?\`!UDUD! "4&! ,/'! &'-%B,-![#,/!='"4&.'!=PQJEH!['!%-'1!NXZM#,!_#214[-!Q>!+21!A?\`!JDHD!A>?Z42B8!&'-%B,-!['&'!4M,+#2'1!42!GDH=I5!A4&'J!K%+1!KNEOH!&%22#2$!NXZM#,!T#2%aD!^2!+BB!.+-'-!,/'!^2,'B!YbT!UHDH!B#M&+&8!#-!%-'1! 34&! 3+.,4&#5+,#42-!42! ,/'!A>?D!_'!24,'1! ,/+,! #,! &%2-! -%MZ-,+2,#+BB8!-B4['&!#2!GJZM#,D!`BB!&'-%B,-!+&'!#2!-#2$B'!7&'.#-#42D!

^27%,!+21!4%,7%,!1+,+!+&'!#2!,/'!7#22'1!A>?!9'94&86![/#./!7&40#1'-!+!.497&49#-'!M',[''2!%-'3%B2'--!#2!+77B#.+,#42-!<,/+,!+&'!B#;'B8!,4!&%2!42!,/'!A>?C!+21!7'&34&9+2.'!<-B4['&!,&+2-3'&-!,4S3&49!=>?!#3!,/'!1+,+!#-!#2!7+$'+MB'!9'94&8CD!P/'!.4-,!43!,/'!9'94&8!+BB4.+,#42!#-!24,!#2.B%1'1!#2!,/'!,#9#2$-D!

Y+,&#.'-! +&'! 7+11'1! ,4! +2! 411!9%B,#7B'! 43! NX![4&1-D! P/#-!/'B7-! +04#1#2$! +249+B4%-! 7'&34&9+2.'! 1&47-! +,! -49'! 9+,&#a!-#5'-D!

P/'!.4&&'.,2'--!43! ,/'!+B$4&#,/9-!#-! ,'-,'1! #2! ,/'!34BB4[#2$![+8D!^27%,!9+,&#a!!!#-!-82,/'-#5'1![#,/!&+2149!'2,&#'-!%2#34&9ZB8! 1#-,&#M%,'1! #2! cdU6Ue! <,4! $%+&+2,''! -899',&#.! 74-#,#0'! 1'3#Z2#,'2'--6"!!f!HDHHU"#"@!$%$! #-!%-'1! #2-,'+1! #2! ,'-,#2$! ,/'!A/4ZB'-;8! 3+.,4&#5+,#426![/'&'!$! #-! ,/'! &+2149!9+,&#a! +-!1'-.&#M'1!+M40'!+21!#!#-!,/'!#1'2,#,8!9+,&#aCD!g%,7%,!3+.,4&-!+&'!9%B,#7B#'1!+21!9+aZ24&9!43! #,-!1#33'&'2.'![#,/! ,/'! #27%,!9+,&#a! #-! 34%21D!P/#-!9'+-%&'-!,/'!M+.;[+&1!'&&4&!#2!,/'!3+.,4&#5+,#42D!_'!34%21!,/+,! ,/#-!'&&4&! #-!+M4%,! ,/'! -+9'![/',/'&!%-#2$!4%&!=>?ZM+-'1!+B$4&#,/9!4&!,/'!7%&'B8!A>?ZM+-'1!+B$4&#,/9!#2!,/'!^2,'B!YbT!!<+B[+8-![#,/#2!+!3+.,4&!43!J6!+21![#,/#2!JHh!#2!94-,!.+-'-CD!P/'!0+&#+2,!43!,/'!T?!3+.,4&#5+,#42!,/+,!9%B,#7B#'-!M8!,/'!#20'&-'-!43!,/'! 1#+$42+B! MB4.;-! 43! ,/'! ,&#+2$%B+&! 9+,&#a! /+-! -/4[2! +M4%,!-+9'! +..%&+.8! +-![/'2! &%22#2$! ,&#+2$%B+&! -4B0'-! 42! ,/'!=>?D!`-! +2! 'a+97B'6! ,/'! '&&4&-! +-!9'+-%&'1! +M40'! #2! T?6!K*! +21!A/4B'-;8!+,!&!f!EUVJ!+&'!+M4%,!JHHH"#"ii!ii'()6!JHH"#"ii!ii'()!+21!U("#"ii!ii'()! &'-7D6! [/'&'! #" *" J

dJG! #-! 9+./#2'! '7-#B42! #2! ^RRR!-#2$B'!7&'.#-#42!+21!ii!ii'()!#-!,/'!9+aZ24&9!43!!D!

!45"6&77-+8"*)"9$+)*+7-31$"

"#$D!(!-/4[-!,/'!=3B47S-!&+,'-!-%-,+#2'1!#2!,/'!=>?ZM+-'1!9+Z,&#a!3+.,4&#5+,#42!&4%,#2'-!!+21!%-#2$!A4&'J!K%+1!+B42'6!+21!"#$D!E!1',+#B-!,/'!-7''1%7-!0-D!A4&'J!K%+1D!`..4&1#2$!,4!,/'!"#$%&'6!,/'! .&4--40'&! M',[''2! ,/'! =>?ZM+-'1! +21! A>?Z+B42'! #97B'Z9'2,+,#42-! #-! +&4%21! &! f! UHHH! 34&! +BB! M%,! A/4B'-;8! &%2! 42!=PQJEH6![/#./! #-! +&4%21!&!f!NHHD!P/'!M'-,!7'&34&9+2.'-!+&'!-%99+&#5'1! #2!P+MB'!XD! ^,!-/4[-!,/+,! ,/'!-7''1%7!#-!2'+&B8!,/'!-+9'! +-! ,/'! -7''1%7! #2! 9+,&#aZ9+,&#a! 9%B,#7B8! <F=RYYCD!I4['0'&6!1#33'&'2.'! #2! ,/'4&',#.+B!+&#,/9',#.!7'+;!&+,'-! #-! -%MZ-,+2,#+BB8!/#$/'&!/#$/B#$/,#2$!,/+,! ,/'&'!+&'!94&'!.497%,+,#42+B!&'-4%&.'-!+0+#B+MB'!,/+2!['!.4%B1!/+&0'-,D!

"#$D!V!-/4[-!,/'!7'&34&9+2.'!43!,/'!T?!1'.4974-#,#42!,/+,!+./#'0'-! OGE! =3B47S-! +,! &" !! JU6HHH! M8! &%22#2$! ,[4! =>?-! #2!7+&+BB'BD!L4,'!,/+,!+!-#2$B'!=PQ!JEH!8#'B1-!/#$/'&!&+,'-!,/+2!,[4!EEHH!=PQD!F''!,/'!24,'-!M'B4[!42!-.+B#2$D!

!4:"9$+)*+7-31$";3-'8%.%"

"#$D!UH!-/4[-!,/'!M&'+;14[2!43!&%2,#9'!#2!,/'!T?!3+.,4&#5+,#42!

42!EEHH=PQD!P/'!M&'+;14[2!-/4[-!,/+,!%7!,4!VHh!43!,/'!&%2Z

,#9'!#-!.42-%9'1!M8!.497%,#2$!42!,/'!=>?!+21!+M4%,!43!UHh!

43!,/#-!,#9'!40'&B+7-![#,/!.497%,#2$!42!,/'!A>?!!_'!'a7'.,!,/'!

=>?!7+&,!,4!M'!-9+BB'&![/'2!.497%,#2$![#,/!3+-,'&!=>?-!7&4Z

1%.#2$! M',,'&! 40'&B+7! +,! B+&$'! 9+,&#a! -#5'-D! P#9'! -7'2,! #2! ,/'!

A>?Z=>?! ,&+2-3'&-! #-! -%M-,+2,#+B! +,! -9+BB! +21! 9'1#%9! -#5'1!

Page 127: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)

!

!

!"

#!"

$!"

%!"

&!"

'!"

(!"

)!"

*!"

+!"

#!!"

&&* )!& #!** #((& $&+( %(&* '%#$ ))&& ##$(&

,-./

012/134536781-9

!"#!$"#"%&'()*+&

%&'(),-)+

.--/!'0+'1

!"#2$"#

-3+&.',

$"#

!"#

"#$%&'!()*!+,'!-&'./0123!14!5#6'!#3!5,'!78!0'916:1;#5#13!&%3!13!

<'"1&9'!==))!<+>?!

!:+

#:!

#:#

#:$

#:%

#:&

#:'

#:(

#:)

#:*

#:+

$:!

(& #$* $'( '#$ #!$& $!&* &!+( *#+$ #(%*&

;<4=24=>

012/134536781-9

-3+&.',"!"#2$"#

%&'(),-)+"4'%&56

789:"35'"$;::

<'%=0",53-%5(>

"#$%&'!((*!@A120123!2,'3!16#55#3$!13'!14!5,'!1:5#6#B.5#13;!%;'0!2,'3!&%33#3$!13!<'"1&9'!<+>!C=)?!

6.5&#9';!.30!;,1%A0!-'!#6:&1D'0!2#5,!5,'!3'2'&!EFG'!#35'&913H

3'95! ;%::1&5'0! -I! 5,'! 3'2'&! <E8;! .30! 615,'&-1.&0;?! +#6'!

;:'35! #3! 5&.3;:1;#3$! 5,'!6.5&#9';! #;! 315! ;%-;5.35#.A?! G30#D#0%.A!

6'.;%&'6'35;!,.D'!;,123!5,.5!5&.3;:1;'!&%3;!.5!CJ!KJ!<LM;!41&!

!! N! ()))?! +,#;! D.&#.5#13! #3! -.302#05,! #;! 0%'! 51! 5,'!610'&.5'!

$&.3%A.&#5I!14!5,#;!1:'&.5#13?!"1&!'O.6:A'P!#5!5./';!QR";!51!91:I!

1&!5&.3;:1;'!.!()CK#SK!6.5&#O!.5!5,'!:'./!;%;5.#3'0!-.302#05,!

14!RS!<LM;P!2,#9,!#;!9A1;'!51!5,'!/'&3'A!A.%39,!1D'&,'.0?!FE8H

<E8! 5&.3;4'&;! &%3! .5! T?)!T?T! <LM;! 41&! !" N! ()))P! 2,#9,! .:H

:&1.9,';!5,'!:'./!;%;5.#3'0!&.5'?!"#$?!((!'D.A%.5';!5,'!#6:.95;!14!0#44'&'35!1:5#6#B.5#13;!%;'0!

2,'3!916:%5#3$!13!<+>C=)?!+,'!61;5!#6:1&5.35!1:5#6#B.5#13!2.;!%;#3$!&12H6.U1&!A.I1%5!13!5,'!<E8!5,.5!3'.&AI!01%-A'0!5,'!:'&41&6.39'! .5! A.&$'! :&1-A'6! ;#B';?! G30#D#0%.A! 6'.;%&'6'35;!,.D'! ;,123! 5,.5! :#D15#3$! 5./';! (!()V! 14! 5#6'! #3! 5,'! '35#&'!916:%5.5#13! 41&!!! N! (J))! #4! 013'! #3! 5,'! &12H6.U1&! A.I1%5?! G3!5,.5! 9.;'! #5! .9,#'D';! R!CJ!<LM;! 14! '44'95#D'! -.302#05,?!W,'3!%;#3$! 91A%63H6.U1&! A.I1%5P! #5! 5./';! CR!J)V! 14! 5,'! 515.A! 5#6'!.30!&%3!.5!)?T!(?=<LM;P!2#5,!;A12'&!&.5';!41&!A.&$'&!6.5&#9';?!

X! ;%&:&#;#3$AI! A.&$'! ;:''0%:! Y%:! 51!T)VZ!2.;!1-5.#3'0!-I!:'&41&6#3$! 5&#.3$%A.&! ;1AD'!D#.!6%A5#:AI#3$!-I! 5,'! #3D'&;'!6.H5&#O?!+&#.3$%A.&! ;1AD'!2#5,! .!SK#SK! 5&#.3$%A.&!6.5&#O! .30!=([C!&#$,5! ,.30! ;#0';! &%3;! .5! (=! <4A1:M;! 13! <+>C=)! 2,'3! %;#3$!F8L7X@!C?)?! G5! #;! .3!1&0'&!14!6.$3#5%0'! ;A12'&! 5,.3! 5,'!CS=!<4A1:M;! &.5'! .9,#'D'0! #3! 6%A5#:AI#3$! .! SK#SK! 6.5&#O! -I! .!SK#=([C!6.5&#O!5,.5!01';!5,'!;.6'!21&/!Y5,#;!#;!(TK!<4A1:M;!#4!315!91%35#3$!5,'!&'0%30.35!21&/Z?!

X61&5#B.5#13!14!/'&3'A!A.%39,!1D'&,'.0!0%'!51!-.59,!:#D15H#3$!I#'A0;!T)!())V!;:''0%:!.5!!!\!()CK?!]44'95!14!.AA!1:5#6#B.H5#13;!0'9&'.;';!.5!A.&$'&!:&1-A'6!;#B';P!2,'&'!5#6'!#;!016#3.5'0!-I! 6.5&#OH6.5&#O! 6%A5#:A#';?! ^.5';! #3! 5,';'! 6%A5#:A#';! .&'! .4H4'95'0!-I!%;#3$!CHA'D'A!;9,'6';!#3!78!.30!F,1A';/I!.30!%;#3$!.%515%3#3$! 51! 9,11;'! -A19/! ;#B'! #3!_^?!+,';'! 5'9,3#`%';! $.D'!%:!51!K!RV!;:''0%:!.30!4.951&'0!#3!13AI!41&!!!N!K)[S?!

X991&0#3$! 51!"#$?! [P! %;#3$! 521!==))<+>!I#'A0;!13AI!SRV!#6:&1D'6'35!#3!5,'!:'./!<4A1:M;!&.5'?!+,#;!&';%A5!91&&';:130;!51!:&'H.AA19.5#3$!:#33'0!6'61&I!#3!5,'!6.;5'&!FE8!5,&'.0!-'41&'!<E8!9135'O5;!.&'!9&'.5'0!#3!5,'!9,#A0!FE8!5,&'.0;?!X;!.!&';%A5P!.AA!5&.3;4'&;!&%3!.5!.!;6.AA!4&.95#13!14!5,'!:'./!EFG'!-.302#05,!.;! #4! 5,'!6'61&I!2.;!315!:#33'0?!a#$,'&!#6:&1D'6'35!14!RKV!2,'3! %;#3$! 521! <+>C=)! 91&&';:130;! 51! .AA19.5#3$! :#33'0!6'61&I!#3!13'!14!5,'!9,#A0!FE8!5,&'.0;!.45'&!5,'!<E8!9135'O5;!.&'! .55.9,'0?!+,#;!6'61&I! #;! %;'0! 51! ;51&'! 5,'!FE8b;! 91:I!14!5,'!6.5&#OP!#?'?!-15,!5,'!#3:%5!.30!1%5:%5!0.5.!14!5,'!&1%5#3'?!+,#;!.AA12;! &%33#3$! 5&.3;4'&;!.5! 4%AA!-.302#05,! 51!13'!14! 5,'!<E8;?!+,'&'!.&'!15,'&!&'.;13;!41&!A';;!5,.3!#0'.A!;9.A#3$P!;%9,!.;!'O5&.!FE8H<E8!-.302#05,!913;%6:5#13P!A.9/!14!CHA'D'A!-A19/#3$!.30!315!;9.A#3$!5,'!FE8!;#0'!14!5,'!;I;5'6?!

!"#$%&'()*+,&-$.+/0$1/02*$3&*4$

+,'! 4#&;5! #6:A'6'35.5#13! 14! 5,'! 78! 4.951&#B.5#13! %;#3$! <E8;!

5,.5!2'!/312!2.;!:%-A#;,'0!-I!<.A1::1!'5!.A?!cC))Jd!.30!&.3!.5!

%:! 51! Q()! <4A1:M;! 41&! !! e! K)))! 2#5,1%5! :#D15#3$! .30! .5! QS!

<4A1:M;!41&!!!e!TJ))!2#5,!:.&5#.A!:#D15#3$!13!5,'!1A0'&!<'"1&9'!

R=))?! +,'I! %;'! .! 313H-A19/'0! .A$1&#5,6! 5,.5! #;! -.302#05,!

.30M1&!1D'&,'.0!-1%[email protected]#3$! 5,';'!3%6-'&;!2#5,!-.302#05,!

$#D';!%:!51!CS!<4A1:M;!13!<+>C=)P!.3!1&0'&!14!6.$3#5%0'!A';;!

5,.3! #3! 1%&! #6:A'6'35.5#13?! f%&! ;1A%5#13! 21&/;! 4.;5'&! 0%'! 51!

A.&$'! -A19/#3$! '3.-A'0! -I! ;,.&'0! 6'61&I?! f%&! ,#$,! :'&41&H

6.39'!2,'3!:#D15#3$! #;! '3.-A'0!-I! 5,'!,#$,H-.302#05,! .99';;!

51!A#3'.&!.00&';;!;:.9'!.D.#A.-A'!13!610'&3!<E8;?!

L.&&.9,#3.!'5! .A?! cC))=d! &':1&5!J)!<4A1:M;! #3!78! 4.951&#B.H

5#13!.30!K(!<4A1:M;!#3!F,1A';/I!4.951&#B.5#13!41&!!!e!J)))!%;#3$!

F8L7X@! (?)! 13! <'"1&9'! ==))! 8A5&.?! f%&! #6:A'6'35.5#13!

.9,#'D';!C?[#!.30!T?R#!,#$,'&!;:''0!41&!78!.30!F,1A';/I!&';:?!

13! 5,'! ;A#$,5AI! ;A12'&! ==))<+>?! +,#;! #;! 0%'! 51! 1%&! #6:&1D'0!

6.5&#OH6.5&#O!6%A5#:AI! &1%5#3'! .30! 5,'! 1:5#6#B.5#13;! 'D.A%.5'0!

.-1D'?!

L.-1%A#3!'5!.A?! cC))=d!0';9&#-';! #6:A'6'35.5#13!14!78!.30!

_^!.A$1&#5,6;!5,.5!&%3!.5!%:!51!$JJ!<4A1:M;!13!_%.0&1!">JS))!

41&!!!$!([P)))!%;#3$!F8L7X@!(?)?!+,#;!<E8!,.;!;A#$,5AI!;A12H

'&!6'61&I! 5,.3!==))<+>!.30!15,'&2#;'! ;#6#A.&?!+,'#&! #6:A'H

6'35.5#13! 14! F,1A';/I! &%3;! .5! %:! 51! [)! <4A1:M;! #4! %;#3$!

F8L7X@!.30!.::&1.9,';!(S)!<4A1:M;!#4!%;#3$!.3!'.&AI!D'&;#13!

14! 5,'! 6.5&#O! 6%A5#:AI! 0';9&#-'0! #3! 5,#;! :.:'&! .30! 144A1.0#3$!

L7X@(ML7X@C! 1:'&.5#13;! 51! 5,'! FE8?! f%&! #6:A'6'35.5#13!

.9,#'D';!,#$,'&!&.5';!0%'!51!.!0'':'&!:'&41&6.39'!.3.AI;#;!.30!

5%3#3$?!F.;5#AA1! '5! .A?! cC))=d! &':1&5! &';%A5;! 41&!F,1A';/I! 4.951&#B.H

5#13!&%3!13!KH<E8!ghGiGX!+';A.!@=R)?!].9,!14!5,';'!<E8;!#;!;#6#A.&!51!_%.0&1!">JS))!0';9&#-'0!.-1D'?!X%5,1&;!&':1&5!(=)!<4A1:M;!13!.!;I;5'6!.5!!"$!()P)))?!W'!.9,#'D'!5,#;!:'&41&6.39'!%;#3$!.!;#3$A'!==))<+>?!+,'#&!&';%A5!2.;!A.5'&!#6:&1D'0!51!KCK!<4A1:M;!.5!!"$!C)P)))!-I!%;#3$!5,'!6.5&#O!6%A5#:AI!&1%5#3'!:&'H;'35'0!#3!5,#;!:.:'&!c_%#35.3.Hf&5#!'5!.A?!C))=d?!

5$%&-678,+&-,$

W'!,.D'!:&';'35'0!5,'!4.;5';5!Y;1!4.&Z!#6:A'6'35.5#13;!14!0'3;'!78P! _^! .30! F,1A';/I! 4.951&#B.5#13;! &%33#3$! 13! .! ;#3$A'! 1&!01%-A'!ghGiGX!<E8;?!L.;'0!13!1%&!:'&41&6.39'!-'39,6.&/H#3$! .30! 610'A#3$P! 5,'I! .55.#3! =)V![)V! 14! 5,'! :'./! ;:''0;!:1;;#-A'!41&!A.&$'!6.5&#9';?!+,#;!;:''0!2.;!.9,#'D'0!-I!9.&'4%AHAI!9,11;#3$!1:5#6#B.5#13;!51!6.59,!5,'!9.:.-#A#5#';!14!5,'!,.&0H

Page 128: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)

CUFFT Example

IAP09 CUDA@MIT / 6.963

Page 129: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)

31M02: High Performance Computing with CUDA

CUDA Example:CUDA Example:Fourier-spectral Poisson SolverFourier-spectral Poisson Solver

Solve a Poisson equation on a rectangular domain with

periodic boundary conditions using a Fourier-spectral

method.

This example will show how to use the FFT library, transfer

the data to/from GPU and perform simple computations on

the GPU.

Page 130: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)

32M02: High Performance Computing with CUDA

Mathematical backgroundMathematical background

rkkr yx

FFT ˆˆ)( 222=+!""#"=$ %%

1. Apply 2D forward FFT to r to obtain r(k), where k is the

wave number

2. Apply the inverse of the Laplace operator to r(k) to obtain

u(k): simple element-wise division in Fourier space

3. Apply 2D inverse FFT to u(k) to obtain u

)(

ˆˆ22

yx kk

r

+!="

Page 131: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)

33M02: High Performance Computing with CUDA

Reference MATLAB implementationReference MATLAB implementation% No. of Fourier modes

N = 64;

% Domain size (assumed square)

L = 1;

% Characteristic width of f (make << 1)

sig = 0.1;

% Vector of wavenumbers

k = (2*pi/L)*[0:(N/2-1) (-N/2):(-1)];

%Matrix of (x,y) wavenumbers corresponding

% to Fourier mode (m,n)

[KX KY] = meshgrid(k,k);

% Laplacian matrix acting on the wavenumbers

delsq = -(KX.^2 + KY.^2);

% Kludge to avoid division by zero for

% wavenumber (0,0).

% (this waveno. of fhat should be zero anyway!)

delsq(1,1) = 1;

% Grid spacing

h = L/N;

x = (0:(N-1))*h ;

y = (0:(N-1))*h;

[X Y] = meshgrid(x,y);

% Construct RHS f(x,y) at the Fourier gridpoints

rsq = (X-0.5*L).^2 + (Y-0.5*L).^2;

sigsq = sig^2;

f = exp(-rsq/(2*sigsq)).*…

(rsq - 2*sigsq)/(sigsq^2);

% Spectral inversion of Laplacian

fhat = fft2(f);

u = real(ifft2(fhat./delsq));

% Specify arbitrary constant by forcing corner

% u = 0.

u = u - u(1,1);

% Compute L2 and Linf norm of error

uex = exp(-rsq/(2*sigsq));

errmax = norm(u(:)-uex(:),inf);

errmax2 = norm(u(:)-uex(:),2)/(N*N);

% Print L2 and Linf norm of error

fprintf('N=%d\n',N);

fprintf('Solution at (%d,%d): ',N/2,N/2);

fprintf('computed=%10.6f …

reference = %10.6f\n',u(N/2,N/2),uex(N/2,N/2));

fprintf('Linf err=%10.6e L2 norm

err = %10.6e\n',errmax, errmax2);

http://www.atmos.washington.edu/2005Q2/581/matlab/pois_FFT.m

Page 132: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)

34M02: High Performance Computing with CUDA

Implementation stepsImplementation steps

The following steps need to be performed:

1. Allocate memory on host: r (NxN), u (NxN) , kx (N) and ky (N)

2. Allocate memory on device: r_d, u_d, kx_d, ky_d

3. Transfer r, kx and ky from host memory to the correspondentarrays on device memory

4. Initialize plan for FFT

5. Compute execution configuration

6. Transform real input to complex input

7. 2D forward FFT

8. Solve Poisson equation in Fourier space

9. 2D inverse FFT

10.Transform complex output to real input

11.Transfer results from the GPU back to the host

We are not taking advantage of the symmetries (C2C transform for real data) tokeep the code simple.

Page 133: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)

35M02: High Performance Computing with CUDA

Solution walk-through (steps 1-2)Solution walk-through (steps 1-2)

/*Allocate arrays on the host */

float *kx, *ky, *r;

kx = (float *) malloc(sizeof(float*N);

ky = (float *) malloc(sizeof(float*N);

r = (float *) malloc(sizeof(float*N*N);

/* Allocate array on the GPU with cudaMalloc */

float *kx_d, *ky_d, *r_d;

cudaMalloc( (void **) &kx_d, sizeof(cufftComplex)*N);

cudaMalloc( (void **) &ky_d, sizeof(cufftComplex)*N);

cudaMalloc( (void **) &r_d , sizeof(cufftComplex)*N*N);

cufftComplex *r_complex_d;

cudaMalloc( (void **) &r_complex_d, sizeof(cufftComplex)*N*N);

Page 134: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)

36M02: High Performance Computing with CUDA

Code walk-through (steps 3-4)Code walk-through (steps 3-4)

/* Initialize r, kx and ky on the host */

……………

/*Transfer data from host to device with

cudaMemcpy(target, source, size, direction)*/

cudaMemcpy (kx_d, kx, sizeof(float)*N , cudaMemcpyHostToDevice);

cudaMemcpy (ky_d, ky, sizeof(float)*N , cudaMemcpyHostToDevice);

cudaMemcpy (r_d , r , sizeof(float)*N*N, cudaMemcpyHostToDevice);

/* Create plan for CUDA FFT (interface similar to FFTW) */

cufftHandle plan;

cufftPlan2d( &plan, N, N, CUFFT_C2C);

Page 135: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)

37M02: High Performance Computing with CUDA

Code walk-through (step 5)Code walk-through (step 5)/* Compute the execution configuration

NB: block_size_x*block_size_y = number of threads

On G80 number of threads < 512 */

dim3 dimBlock(block_size_x, block_size_y);

dim3 dimGrid (N/dimBlock.x, N/dimBlock.y);

/* Handle N not multiple of block_size_x or block_size_y */

if (N % block_size_x !=0 ) dimGrid.x+=1;

if (N % block_size_y !=0 ) dimGrid.y+=1

Block_size_x

Blo

ck_siz

e_y

N

N

Page 136: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)

38M02: High Performance Computing with CUDA

Code walk-through (step 6-10)Code walk-through (step 6-10)

/* Transform real input to complex input */

real2complex<<<dimGrid, dimBlock>>> (r_d, r_complex_d, N);

/* Compute in place forward FFT */

cufftExecC2C (plan, r_complex_d, r_complex_d, CUFFT_FORWARD);

/* Solve Poisson equation in Fourier space */

solve_poisson<<<dimGrid, dimBlock>>> (r_complex_d, kx_d, ky_d,N);

/* Compute in place inverse FFT */

cufftExecC2C (plan, r_complex_d, r_complex_d, CUFFT_INVERSE);

/* Copy the solution back to a real array and apply scaling ( an FFT followed by iFFT willgive you back the same array times the length of the transform) */

scale = 1.f / ( (float) N * (float) N );

complex2real_scaled<<<dimGrid, dimBlock>>> (r_d, r_complex_d, N, scale);

Page 137: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)

39M02: High Performance Computing with CUDA

Code walk-through (step 11)Code walk-through (step 11)

/*Transfer data from device to host with

cudaMemcpy(target, source, size, direction)*/

cudaMemcpy (r , r_d , sizeof(float)*N*N, cudaMemcpyDeviceToHost);

/* Destroy plan and clean up memory on device*/

cufftDestroy( plan);

cudaFree(r_complex_d);

…….

cudaFree(kx_d);

Page 138: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)

40M02: High Performance Computing with CUDA

real2complexreal2complex/*Copy real data to complex data */

__global__ void real2complex (float *a, cufftComplex *c, int N)

{

/* compute idx and idy, the location of the element in the original NxN array */

int idx = blockIdx.x*blockDim.x+threadIdx.x;

int idy = blockIdx.y*blockDim.y+threadIdx.y;

if ( idx < N && idy <N)

{

int index = idx + idy*N;

c[index].x = a[index];

c[index].y = 0.f;

}

}

idx

idy

Page 139: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)

41M02: High Performance Computing with CUDA

solve_poisson solve_poisson (with shared memory)(with shared memory)

__global__ void solve_poisson (cufftComplex *c, float *kx, float *ky, int N)

{

unsigned int idx = __umul24(blockIdx.x,blockDim.x)+threadIdx.x;

unsigned int idy = __umul24(blockIdx.y,blockDim.y)+threadIdx.y;

// use shared memory to minimize multiple access to same k values

__shared__ float kx_s[BLOCK_WIDTH], ky_s[BLOCK_HEIGHT]

if (threadIx.x < 1) kx_s[threadIdx.x] = kx[idx];

if (threadIx.y < 1) ky_s[threadIdx.y] = ky[idy];

__syncthreads();

if ( idx < N && idy <N)

{

unsigned int index = idx +__umul24(idy ,N);

float scale = - ( kx_s[threadIdx.x]*kx_s[threadIdx.x]

+ ky_s[threadIdy.y]*ky_s[threadIdy.y] );

if ( idx ==0 && idy == 0 ) scale =1.f;

scale = 1.f / scale;

c[index].x *= scale;

c[index].y*= scale;

}

}

)(

ˆˆ22

yx kk

r

+!="

Page 140: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)

42M02: High Performance Computing with CUDA

Compile and run Compile and run poissonpoisson

Compile the example poisson.cu:nvcc –O3 –o poisson poisson.cu

-I/usr/local/cuda/include –L/usr/local/cuda/lib -lcufft

-L/usr/local/NVDIA_CUDA_SDK/common/inc

-L/usr/local/NVDIA_CUDA_SDK/lib -lcutil

Run the example./poisson -N64

Poisson solver on a domain 64 x 64

dimBlock 32 16 (512 threads)

dimGrid 2 4

L2 error 9.436995e-08:

Time 0.000569:

Time I/O 0.000200 (0.000136 + 0.000064):

Solution at (32,32)

computed=0.975879 reference=0.975882

Reference values from MATLAB: N=64

Solution at (32,32): computed= 0.975879 reference= 0.975882

Linf err=2.404194e-05 L2 norm err = 9.412790e-08

Page 141: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)

Misc

IAP09 CUDA@MIT / 6.963

Page 142: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)

19M02: High Performance Computing with CUDA

Tesla C1060 Computing ProcessorTesla C1060 Computing Processor

1.33 GHzCore GHz

Processor 1x Tesla T10P

Form factor

Full ATX:

4.736” (H) x 10.5” (L)

Dual slot wide

On-boardmemory

4 GB

System I/O PCIe x16 gen2

Memory I/O512-bit, 800MHz DDR

102 GB/s peak bandwidth

Display outputs None

Typical power 160 W

Page 143: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)

20M02: High Performance Computing with CUDA

Tesla S1070 1U SystemTesla S1070 1U System

1.5 GHzCore GHz

Processors 4 x Tesla T10P

Form factor1U for an EIA 19”

4-post rack

Total 1U systemmemory

16 GB (4.0GB per GPU)

System I/O 2 PCIe x16

Memory I/O perprocessor

512-bit, 800MHz GDDR

102 GB/s peakbandwidth

Display outputs None

Typical power 700 W

Chassisdimensions

1.73” H ! 17.5” W !28.5” D

Page 144: IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)

18M02: High Performance Computing with CUDA

Double Precision Floating PointDouble Precision Floating Point

NVIDIA GPU SSE2 Cell SPE

Precision IEEE 754 IEEE 754 IEEE 754

Rounding modes for FADDand FMUL

All 4 IEEE, round tonearest, zero, inf, -inf

All 4 IEEE, round tonearest, zero, inf, -inf

Round tozero/truncate only

Denormal handling Full speedSupported, costs 1000’sof cycles

Flush to zero

NaN support Yes Yes No

Overflow and Infinitysupport

Yes YesNo infinity,clamps to max norm

Flags No Yes Some

FMA Yes No Yes

Square rootSoftware with low-latencyFMA-based convergence

Hardware Software only

DivisionSoftware with low-latencyFMA-based convergence

Hardware Software only

Reciprocal estimateaccuracy

24 bit 12 bit 12 bit

Reciprocal sqrt estimateaccuracy

23 bit 12 bit 12 bit

log2(x) and 2^x estimatesaccuracy

23 bit No No