cuda advanced memory usage and optimization yukai hung [email protected] department of mathematics...

CUDA Advanced Memory Usage and OptimizationCUDA Advanced Memory Usage and OptimizationYukai Hung

[email protected] of MathematicsNational Taiwan University

Yukai [email protected]

Department of MathematicsNational Taiwan University

Register as Cache?Register as Cache?

3

Volatile qualifier

Volatile QualifierVolatile Qualifier

__global__ void kernelFunc(int* result){ int temp1; int temp2;

if(threadIdx.x<warpSize) { temp1=array[threadIdx.x] array[threadIdx.x+1]=2; temp2=array[threadIdx.x] result[threadIdx.x]=temp1*temp2; }}

identical readscompiler optimized

this read away

4

Volatile qualifier



if(threadIdx.x<warpSize) { int temp=array[threadIdx.x]; temp1=temp; array[threadIdx.x+1]=2; temp2=temp; result[threadIdx.x]=temp1*temp2; }}

5

Volatile qualifier



if(threadIdx.x<warpSize) { temp1=array[threadIdx.x]*1; array[threadIdx.x+1]=2; __syncthreads();

temp2=array[threadIdx.x]*2; result[threadIdx.x]=temp1*temp2; }}

6

Volatile qualifier


__global__ void kernelFunc(int* result){ volatile int temp1; volatile int temp2;

if(threadIdx.x<warpSize) { temp1=array[threadIdx.x]*1; array[threadIdx.x+1]=2; temp2=array[threadIdx.x]*2; result[threadIdx.x]=temp1*temp2; }}

Data PrefetchData Prefetch

8

Hide memory latency by overlapping loading and computing - double buffer is traditional software pipeline technique


Md Pd

Pdsub

Nd

load blue block to shared memory

compute blue block on shared memoryand load next block to shared memory

9



for loop{ load data from global to shared memory synchronize block

compute data in the shared memory synchronize block }

10



load data from global memory to registersfor loop{ store data from register to shared memory synchronize block

load data from global memory to registers compute data in the shared memory synchronize block }

very small overheadboth memory are very fast

computing and loading overlapregister and shared are independent

11

Matrix-matrix multiplication


Constant MemoryConstant Memory

13


Where is constant memory? - data is stored in the device global memory - read data through multiprocessor constant cache - 64KB constant memory and 8KB cache for each multiprocessor

How about the performance? - optimized when warp of threads read same location - 4 bytes per cycle through broadcasting to warp of threads - serialized when warp of threads read in different location - very slow when cache miss (read data from global memory) - access latency can range from one to hundreds clock cycles

14


How to use constant memory? - declare constant memory on the file scope (global variable) - copy data to constant memory by host (because it is constant!!)

//declare constant memory __constant__ float cst_ptr[size];

//copy data from host to constant memorycudaMemcpyToSymbol(cst_ptr,host_ptr,data_size);

15


//declare constant memory__constant__ float cangle[360];

int main(int argc,char** argv){ int size=3200; float* darray; float hangle[360]; //allocate device memory cudaMalloc((void**)&darray,sizeof(float)*size);

//initialize allocated memory cudaMemset(darray,0,sizeof(float)*size);

//initialize angle array on host for(int loop=0;loop<360;loop++) hangle[loop]=acos(-1.0f)*loop/180.0f;

//copy host angle data to constant memory cudaMemcpyToSymbol(cangle,hangle,sizeof(float)*360);

16


//execute device kernel test_kernel<<<size/64,64>>>(darray);

//free device memory cudaFree(darray);

return 0;}

__global__ void test_kernel(float* darray){ int index;

//calculate each thread global index index=blockIdx.x*blockDim.x+threadIdx.x;

#pragma unroll 10 for(int loop=0;loop<360;loop++) darray[index]=darray[index]+cangle[loop]; return;}

Texture MemoryTexture Memory

18


Texture mapping

19


Texture mapping

20


Texture filtering

nearest-neighborhood interpolation

21


Texture filtering

linear/bilinear/trilinear interpolation

22


Texture filtering

two times bilinear interpolation

23


L2

FB

SP SP

L1

TF

Th

rea

d P

roc

es

so

r

Vtx Thread Issue

Setup / Rstr / ZCull

Work Distribution Pixel Thread Issue

Input Assembler

Host

SP SP

L1

TF

SP SP

L1

TF

SP SP

L1

TF

SP SP

L1

TF

SP SP

L1

TF

SP SP

L1

TF

SP SP

L1

TF

L2

FB

L2

FB

L2

FB

L2

FB

L2

FB

these units perform graphical texture operations

24


two SMs are cooperated astexture processing clusterscalable units on graphics

texture specific unitonly available for texture

25


texture specific unittexture address units

compute texture addresses

texture filtering unitscompute data interpolation

read only texture L1 cache

26


L2

FB

SP SP

L1

TF

Th

rea

d P

roc

es

so

r

Vtx Thread Issue

Setup / Rstr / ZCull

Work Distribution Pixel Thread Issue

Input Assembler

Host

SP SP

L1

TF

SP SP

L1

TF

SP SP

L1

TF

SP SP

L1

TF

SP SP

L1

TF

SP SP

L1

TF

SP SP

L1

TF

L2

FB

L2

FB

L2

FB

L2

FB

L2

FB

read only texture L2 cache for all TPC read only texture L1 cache for each TPC

27


texture specific units

28


Texture is an object for reading data - data is stored on the device global memory - global memory is bound with texture cache

L2

FB

SP SP

L1

TF

Th

rea

d P

roc

es

so

rSP SP

L1

TF

SP SP

L1

TF

SP SP

L1

TF

SP SP

L1

TF

SP SP

L1

TF

SP SP

L1

TF

SP SP

L1

TF

L2

FB

L2

FB

L2

FB

L2

FB

L2

global memory

What is the advantages of texture?What is the advantages of texture?

30


Data caching - helpful when global memory coalescing is the main bottleneck

L2

FB

SP SP

L1

TF

Th

rea

d P

roc

es

so

rSP SP

L1

TF

SP SP

L1

TF

SP SP

L1

TF

SP SP

L1

TF

SP SP

L1

TF

SP SP

L1

TF

SP SP

L1

TF

L2

FB

L2

FB

L2

FB

L2

FB

L2

FB

31


Data filtering - support linear/bilinear and trilinear hardware interpolation

texture specific unitintrinsic interpolation

cudaFilterModePointcudaFilterModeLinear

32


Accesses modes - clamp and wrap memory accessing for out-of-bound addresses

texture specific unit

clamp boundary

wrap boundary

cudaAddressModeWrap

cudaAddressModeClamp

33


Bound to linear memory - only support 1-dimension problems - only get the benefits from texture cache - not support addressing modes and filtering

Bound to cuda array - support float addressing - support addressing modes - support hardware interpolation - support 1/2/3-dimension problems

34


Host code - allocate global linear memory or cuda array - create and set the texture reference on file scope - bind the texture reference to the allocated memory - unbind the texture reference to free cache resource

Device code - fetch data by indicating texture reference - fetch data by using texture fetch function

35


Texture memory constrain

Compute capability 1.3 Compute capability 2.01D texture linear memory 8192 31768

1D texture cuda array 1024x128

2D texture cuda array (65536,32768) (65536,65536)

3D texture cuda array (2048,2048,2048) (4096,4096,4096)

36


Measuring texture cache miss or hit number - latest visual profiler can count cache miss or hit - need device compute capability higher than 1.2

Example: 1-dimension linear memoryExample: 1-dimension linear memory

38


//declare texture reference texture<float,1,cudaReadModeElementType> texreference;

int main(int argc,char** argv){ int size=3200;

float* harray; float* diarray; float* doarray;

//allocate host and device memory harray=(float*)malloc(sizeof(float)*size); cudaMalloc((void**)&diarray,sizeof(float)*size); cudaMalloc((void**)&doarray,sizeof(float)*size);

//initialize host array before usage for(int loop=0;loop<size;loop++) harray[loop]=(float)rand()/(float)(RAND_MAX-1);

//copy array from host to device memory cudaMemcpy(diarray,harray,sizeof(float)*size,cudaMemcpyHostToDevice);

39


//bind texture reference with linear memory cudaBindTexture(0,texreference,diarray,sizeof(float)*size);

//execute device kernel kernel<<<(int)ceil((float)size/64),64>>>(doarray,size);

//unbind texture reference to free resource cudaUnbindTexture(texreference);

//copy result array from device to host memory cudaMemcpy(harray,doarray,sizeof(float)*size,cudaMemcpyDeviceToHost);

//free host and device memory free(harray); cudaFree(diarray); cudaFree(doarray);

return 0;}

40


__global__ void kernel(float* doarray,int size) { int index; //calculate each thread global index index=blockIdx.x*blockDim.x+threadIdx.x;

//fetch global memory through texture reference doarray[index]=tex1Dfetch(texreference,index);

return;}

41


__global__ void offsetCopy(float* idata,float* odata,int offset){ //compute each thread global index int index=blockIdx.x*blockDim.x+threadIdx.x;

//copy data from global memory odata[index]=idata[index+offset];}

42


__global__ void offsetCopy(float* idata,float* odata,int offset){ //compute each thread global index int index=blockIdx.x*blockDim.x+threadIdx.x;

//copy data from global memory odata[index]=tex1Dfetch(texreference,index+offset);}

Example: 2-dimension cuda array Example: 2-dimension cuda array

44


#define size 3200


int main(int argc,char** argv){ dim3 blocknum; dim3 blocksize;

float* hmatrix; float* dmatrix;

cudaArray* carray; cudaChannelFormatDesc channel;

//allocate host and device memory hmatrix=(float*)malloc(sizeof(float)*size*size); cudaMalloc((void**)&dmatrix,sizeof(float)*size*size);

//initialize host matrix before usage for(int loop=0;loop<size*size;loop++) hmatrix[loop]=float)rand()/(float)(RAND_MAX-1);

45


//create channel to describe data type channel=cudaCreateChannelDesc<float>();

//allocate device memory for cuda array cudaMallocArray(&carray,&channel,size,size);

//copy matrix from host to device memory bytes=sizeof(float)*size*size; cudaMemcpyToArray(carray,0,0,hmatrix,bytes,cudaMemcpyHostToDevice);

//set texture filter mode property //use cudaFilterModePoint or cudaFilterModeLinear texreference.filterMode=cudaFilterModePoint;

//set texture address mode property //use cudaAddressModeClamp or cudaAddressModeWrap texreference.addressMode[0]=cudaAddressModeWrap; texreference.addressMode[1]=cudaaddressModeClamp;

46


//bind texture reference with cuda array cudaBindTextureToArray(texreference,carray);

blocksize.x=16; blocksize.y=16;

blocknum.x=(int)ceil((float)size/16); blocknum.y=(int)ceil((float)size/16);

//execute device kernel kernel<<<blocknum,blocksize>>>(dmatrix,size);


//copy result matrix from device to host memory cudaMemcpy(hmatrix,dmatrix,bytes,cudaMemcpyDeviceToHost);

//free host and device memory free(hmatrix); cudaFree(dmatrix); cudaFreeArray(carray);

return 0;}

47


__global__ void kernel(float* dmatrix,int size) { int xindex; int yindex;

//calculate each thread global index xindex=blockIdx.x*blockDim.x+threadIdx.x; yindex=blockIdx.y*blockDim.y+threadIdx.y;

//fetch cuda array through texture reference dmatrix[yindex*size+xindex]=tex2D(texreference,xindex,yindex);

return;}

Example: 3-dimension cuda array Example: 3-dimension cuda array

49


#define size 256


int main(int argc,char** argv){ dim3 blocknum; dim3 blocksize;

float* hmatrix; float* dmatrix;

cudaArray* cudaarray; cudaExtent volumesize; cudaChannelFormatDesc channel;

cudaMemcpy3DParms copyparms={0};

//allocate host and device memory hmatrix=(float*)malloc(sizeof(float)*size*size*size); cudaMalloc((void**)&dmatrix,sizeof(float)*size*size*size);

50


//initialize host matrix before usage for(int loop=0;loop<size*size*size;loop++) hmatrix[loop]=(float)rand()/(float)(RAND_MAX-1);

//set cuda array volume size volumesize=make_cudaExtent(size,size,size);

//create channel to describe data type channel=cudaCreateChannelDesc<float>();

//allocate device memory for cuda array cudaMalloc3DArray(&cudaarray,&channel,volumesize);

//set cuda array copy parameters copyparms.extent=volumesize; copyparms.dstArray=cudaarray; copyparms.kind=cudaMemcpyHostToDevice;

copyparms.srcPtr= make_cudaPitchPtr((void*)hmatrix,sizeof(float)*size,size,size); cudaMemcpy3D(&copyparms);

51


//set texture filter mode property //use cudaFilterModePoint or cudaFilterModeLinear texreference.filterMode=cudaFilterModePoint;

//set texture address mode property //use cudaAddressModeClamp or cudaAddressModeWrap texreference.addressMode[0]=cudaAddressModeWrap; texreference.addressMode[1]=cudaAddressModeWrap; texreference.addressMode[2]=cudaaddressModeClamp;

//bind texture reference with cuda array cudaBindTextureToArray(texreference,carray,channel);

blocksize.x=8; blocksize.y=8; blocksize.z=8;

blocknum.x=(int)ceil((float)size/8); blocknum.y=(int)ceil((float)size/8);

//execute device kernel kernel<<<blocknum,blocksize>>>(dmatrix,size);

52



//copy result matrix from device to host memory cudaMemcpy(hmatrix,dmatrix,bytes,cudaMemcpyDeviceToHost);

//free host and device memory free(hmatrix); cudaFree(dmatrix); cudaFreeArray(carray);

return 0;}

53


__global__ void kernel(float* dmatrix,int size) { int loop; int xindex; int yindex; int zindex;

//calculate each thread global index xindex=threadIdx.x+blockIdx.x*blockDim.x; yindex=threadIdx.y+blockIdx.y*blockDim.y; for(loop=0;loop<size;loop++) { zindex=loop; //fetch cuda array via texture reference dmatrix[zindex*size*size+yindex*size+xindex]= tex3D(texreference,xindex,yindex,zindex); }

return;}

Performance comparison: image projectionPerformance comparison: image projection

55


image projection or ray casting

56


trilinear interpolationon nearby 8 pixels

intrinsic interpolation units is very powerful

global memory accessing is very close to random

57


Method Time Speedupglobal 1.891 -

global/locality 0.198 9.5texture/point 0.072 26.2texture/linear 0.037 51.1

texture/linear/locality 0.012 157.5texture/linear/locality/fast math 0.011 171.9

object size 512 x 512x 512 / ray number 512 x 512

Why texture memory is so powerful?Why texture memory is so powerful?

59


CUDA Array is reordered to something like space filling Z-order - software driver supports reordering data - hardware supports spatial memory layout

http://en.wikipedia.org/wiki/Z-order_(curve)

Why only readable texture cache?Why only readable texture cache?

61

Texture cache cannot detect the dirty data


host memory

cache

float array

load from memory to

cache

perform some operations on cache

lazy updatefor write-back

reload from memory to

cache

modified by other threads

62

Write data to global memory directly without texture cache - only suitable for global linear memory not cuda array


device memory

cache

float array

write data to global memory directly

read data through texture cache

tex1Dfetch(texreference,index)

darray[index]=value;

texture cache may not be updated

How about the texture data locality?How about the texture data locality?

64


all blocks get scheduled round-robin based onthe number of shaders

Why CUDA distributes the work blocks in

horizontal direction?

65


load balancing on overall SMs, suppose consecutive

blocks have very similar work load

texture cache data locality, suppose consecutive blocks

use similar nearby data

66


reorder the block index fitting into z-order to take advantage of texture L1 cache

67


streaming processorstemp1=a/b+sin(c)

special function unitstemp2[loop]=__cos(d)

texture operation unitstemp3=tex2D(ref,x,y)

concurrent executionfor independent units

68


Memory Location Cache Speed Access

global off-chip no hundreds all threads

constant off-chip yes one ~ hundreds all threads

texture off-chip yes one ~ hundreds all threads

shared on-chip - one block threads

local off-chip no very slow single thread

register on-chip - one single thread

instruction off-chip yes - invisible

69


Memory Read/Write Property

global read/write input or output

constant read no structure

texture read locality structure

shared read/write shared within block

local read/write -

register read/write local temp variable

70

Reference - Mark Harris http://www.markmark.net/

- Wei-Chao Chen http://www.cs.unc.edu/~ciao/

- Wen-Mei Hwu http://impact.crhc.illinois.edu/people/current/hwu.php

http://www.markmark.net/

http://www.cs.unc.edu/~ciao/

http://impact.crhc.illinois.edu/people/current/hwu.php

cuda advanced memory usage and optimization yukai hung [email protected] department of mathematics...

Documents

parameter optimization

gpu cuda advanced memory

void kernelfuncint