cuda advanced memory usage and optimization yukai hung [email protected] department of mathematics...
TRANSCRIPT
CUDA Advanced Memory Usage and OptimizationCUDA Advanced Memory Usage and OptimizationYukai Hung
[email protected] of MathematicsNational Taiwan University
Yukai [email protected]
Department of MathematicsNational Taiwan University
Register as Cache?Register as Cache?
3
Volatile qualifier
Volatile QualifierVolatile Qualifier
__global__ void kernelFunc(int* result){ int temp1; int temp2;
if(threadIdx.x<warpSize) { temp1=array[threadIdx.x] array[threadIdx.x+1]=2; temp2=array[threadIdx.x] result[threadIdx.x]=temp1*temp2; }}
identical readscompiler optimized
this read away
4
Volatile qualifier
Volatile QualifierVolatile Qualifier
__global__ void kernelFunc(int* result){ int temp1; int temp2;
if(threadIdx.x<warpSize) { int temp=array[threadIdx.x]; temp1=temp; array[threadIdx.x+1]=2; temp2=temp; result[threadIdx.x]=temp1*temp2; }}
5
Volatile qualifier
Volatile QualifierVolatile Qualifier
__global__ void kernelFunc(int* result){ int temp1; int temp2;
if(threadIdx.x<warpSize) { temp1=array[threadIdx.x]*1; array[threadIdx.x+1]=2; __syncthreads();
temp2=array[threadIdx.x]*2; result[threadIdx.x]=temp1*temp2; }}
6
Volatile qualifier
Volatile QualifierVolatile Qualifier
__global__ void kernelFunc(int* result){ volatile int temp1; volatile int temp2;
if(threadIdx.x<warpSize) { temp1=array[threadIdx.x]*1; array[threadIdx.x+1]=2; temp2=array[threadIdx.x]*2; result[threadIdx.x]=temp1*temp2; }}
Data PrefetchData Prefetch
8
Hide memory latency by overlapping loading and computing - double buffer is traditional software pipeline technique
Data PrefetchData Prefetch
Md Pd
Pdsub
Nd
load blue block to shared memory
compute blue block on shared memoryand load next block to shared memory
9
Hide memory latency by overlapping loading and computing - double buffer is traditional software pipeline technique
Data PrefetchData Prefetch
for loop{ load data from global to shared memory synchronize block
compute data in the shared memory synchronize block }
10
Hide memory latency by overlapping loading and computing - double buffer is traditional software pipeline technique
Data PrefetchData Prefetch
load data from global memory to registersfor loop{ store data from register to shared memory synchronize block
load data from global memory to registers compute data in the shared memory synchronize block }
very small overheadboth memory are very fast
computing and loading overlapregister and shared are independent
11
Matrix-matrix multiplication
Data PrefetchData Prefetch
Constant MemoryConstant Memory
13
Constant MemoryConstant Memory
Where is constant memory? - data is stored in the device global memory - read data through multiprocessor constant cache - 64KB constant memory and 8KB cache for each multiprocessor
How about the performance? - optimized when warp of threads read same location - 4 bytes per cycle through broadcasting to warp of threads - serialized when warp of threads read in different location - very slow when cache miss (read data from global memory) - access latency can range from one to hundreds clock cycles
14
Constant MemoryConstant Memory
How to use constant memory? - declare constant memory on the file scope (global variable) - copy data to constant memory by host (because it is constant!!)
//declare constant memory __constant__ float cst_ptr[size];
//copy data from host to constant memorycudaMemcpyToSymbol(cst_ptr,host_ptr,data_size);
15
Constant MemoryConstant Memory
//declare constant memory__constant__ float cangle[360];
int main(int argc,char** argv){ int size=3200; float* darray; float hangle[360]; //allocate device memory cudaMalloc((void**)&darray,sizeof(float)*size);
//initialize allocated memory cudaMemset(darray,0,sizeof(float)*size);
//initialize angle array on host for(int loop=0;loop<360;loop++) hangle[loop]=acos(-1.0f)*loop/180.0f;
//copy host angle data to constant memory cudaMemcpyToSymbol(cangle,hangle,sizeof(float)*360);
16
Constant MemoryConstant Memory
//execute device kernel test_kernel<<<size/64,64>>>(darray);
//free device memory cudaFree(darray);
return 0;}
__global__ void test_kernel(float* darray){ int index;
//calculate each thread global index index=blockIdx.x*blockDim.x+threadIdx.x;
#pragma unroll 10 for(int loop=0;loop<360;loop++) darray[index]=darray[index]+cangle[loop]; return;}
Texture MemoryTexture Memory
18
Texture MemoryTexture Memory
Texture mapping
19
Texture MemoryTexture Memory
Texture mapping
20
Texture MemoryTexture Memory
Texture filtering
nearest-neighborhood interpolation
21
Texture MemoryTexture Memory
Texture filtering
linear/bilinear/trilinear interpolation
22
Texture MemoryTexture Memory
Texture filtering
two times bilinear interpolation
23
Texture MemoryTexture Memory
L2
FB
SP SP
L1
TF
Th
rea
d P
roc
es
so
r
Vtx Thread Issue
Setup / Rstr / ZCull
Work Distribution Pixel Thread Issue
Input Assembler
Host
SP SP
L1
TF
SP SP
L1
TF
SP SP
L1
TF
SP SP
L1
TF
SP SP
L1
TF
SP SP
L1
TF
SP SP
L1
TF
L2
FB
L2
FB
L2
FB
L2
FB
L2
FB
these units perform graphical texture operations
24
Texture MemoryTexture Memory
two SMs are cooperated astexture processing clusterscalable units on graphics
texture specific unitonly available for texture
25
Texture MemoryTexture Memory
texture specific unittexture address units
compute texture addresses
texture filtering unitscompute data interpolation
read only texture L1 cache
26
Texture MemoryTexture Memory
L2
FB
SP SP
L1
TF
Th
rea
d P
roc
es
so
r
Vtx Thread Issue
Setup / Rstr / ZCull
Work Distribution Pixel Thread Issue
Input Assembler
Host
SP SP
L1
TF
SP SP
L1
TF
SP SP
L1
TF
SP SP
L1
TF
SP SP
L1
TF
SP SP
L1
TF
SP SP
L1
TF
L2
FB
L2
FB
L2
FB
L2
FB
L2
FB
read only texture L2 cache for all TPC read only texture L1 cache for each TPC
27
Texture MemoryTexture Memory
texture specific units
28
Texture MemoryTexture Memory
Texture is an object for reading data - data is stored on the device global memory - global memory is bound with texture cache
L2
FB
SP SP
L1
TF
Th
rea
d P
roc
es
so
rSP SP
L1
TF
SP SP
L1
TF
SP SP
L1
TF
SP SP
L1
TF
SP SP
L1
TF
SP SP
L1
TF
SP SP
L1
TF
L2
FB
L2
FB
L2
FB
L2
FB
L2
global memory
What is the advantages of texture?What is the advantages of texture?
30
Texture MemoryTexture Memory
Data caching - helpful when global memory coalescing is the main bottleneck
L2
FB
SP SP
L1
TF
Th
rea
d P
roc
es
so
rSP SP
L1
TF
SP SP
L1
TF
SP SP
L1
TF
SP SP
L1
TF
SP SP
L1
TF
SP SP
L1
TF
SP SP
L1
TF
L2
FB
L2
FB
L2
FB
L2
FB
L2
FB
31
Texture MemoryTexture Memory
Data filtering - support linear/bilinear and trilinear hardware interpolation
texture specific unitintrinsic interpolation
cudaFilterModePointcudaFilterModeLinear
32
Texture MemoryTexture Memory
Accesses modes - clamp and wrap memory accessing for out-of-bound addresses
texture specific unit
clamp boundary
wrap boundary
cudaAddressModeWrap
cudaAddressModeClamp
33
Texture MemoryTexture Memory
Bound to linear memory - only support 1-dimension problems - only get the benefits from texture cache - not support addressing modes and filtering
Bound to cuda array - support float addressing - support addressing modes - support hardware interpolation - support 1/2/3-dimension problems
34
Texture MemoryTexture Memory
Host code - allocate global linear memory or cuda array - create and set the texture reference on file scope - bind the texture reference to the allocated memory - unbind the texture reference to free cache resource
Device code - fetch data by indicating texture reference - fetch data by using texture fetch function
35
Texture MemoryTexture Memory
Texture memory constrain
Compute capability 1.3 Compute capability 2.01D texture linear memory 8192 31768
1D texture cuda array 1024x128
2D texture cuda array (65536,32768) (65536,65536)
3D texture cuda array (2048,2048,2048) (4096,4096,4096)
36
Texture MemoryTexture Memory
Measuring texture cache miss or hit number - latest visual profiler can count cache miss or hit - need device compute capability higher than 1.2
Example: 1-dimension linear memoryExample: 1-dimension linear memory
38
Texture MemoryTexture Memory
//declare texture reference texture<float,1,cudaReadModeElementType> texreference;
int main(int argc,char** argv){ int size=3200;
float* harray; float* diarray; float* doarray;
//allocate host and device memory harray=(float*)malloc(sizeof(float)*size); cudaMalloc((void**)&diarray,sizeof(float)*size); cudaMalloc((void**)&doarray,sizeof(float)*size);
//initialize host array before usage for(int loop=0;loop<size;loop++) harray[loop]=(float)rand()/(float)(RAND_MAX-1);
//copy array from host to device memory cudaMemcpy(diarray,harray,sizeof(float)*size,cudaMemcpyHostToDevice);
39
Texture MemoryTexture Memory
//bind texture reference with linear memory cudaBindTexture(0,texreference,diarray,sizeof(float)*size);
//execute device kernel kernel<<<(int)ceil((float)size/64),64>>>(doarray,size);
//unbind texture reference to free resource cudaUnbindTexture(texreference);
//copy result array from device to host memory cudaMemcpy(harray,doarray,sizeof(float)*size,cudaMemcpyDeviceToHost);
//free host and device memory free(harray); cudaFree(diarray); cudaFree(doarray);
return 0;}
40
Texture MemoryTexture Memory
__global__ void kernel(float* doarray,int size) { int index; //calculate each thread global index index=blockIdx.x*blockDim.x+threadIdx.x;
//fetch global memory through texture reference doarray[index]=tex1Dfetch(texreference,index);
return;}
41
Texture MemoryTexture Memory
__global__ void offsetCopy(float* idata,float* odata,int offset){ //compute each thread global index int index=blockIdx.x*blockDim.x+threadIdx.x;
//copy data from global memory odata[index]=idata[index+offset];}
42
Texture MemoryTexture Memory
__global__ void offsetCopy(float* idata,float* odata,int offset){ //compute each thread global index int index=blockIdx.x*blockDim.x+threadIdx.x;
//copy data from global memory odata[index]=tex1Dfetch(texreference,index+offset);}
Example: 2-dimension cuda array Example: 2-dimension cuda array
44
Texture MemoryTexture Memory
#define size 3200
//declare texture reference texture<float,2,cudaReadModeElementType> texreference;
int main(int argc,char** argv){ dim3 blocknum; dim3 blocksize;
float* hmatrix; float* dmatrix;
cudaArray* carray; cudaChannelFormatDesc channel;
//allocate host and device memory hmatrix=(float*)malloc(sizeof(float)*size*size); cudaMalloc((void**)&dmatrix,sizeof(float)*size*size);
//initialize host matrix before usage for(int loop=0;loop<size*size;loop++) hmatrix[loop]=float)rand()/(float)(RAND_MAX-1);
45
Texture MemoryTexture Memory
//create channel to describe data type channel=cudaCreateChannelDesc<float>();
//allocate device memory for cuda array cudaMallocArray(&carray,&channel,size,size);
//copy matrix from host to device memory bytes=sizeof(float)*size*size; cudaMemcpyToArray(carray,0,0,hmatrix,bytes,cudaMemcpyHostToDevice);
//set texture filter mode property //use cudaFilterModePoint or cudaFilterModeLinear texreference.filterMode=cudaFilterModePoint;
//set texture address mode property //use cudaAddressModeClamp or cudaAddressModeWrap texreference.addressMode[0]=cudaAddressModeWrap; texreference.addressMode[1]=cudaaddressModeClamp;
46
Texture MemoryTexture Memory
//bind texture reference with cuda array cudaBindTextureToArray(texreference,carray);
blocksize.x=16; blocksize.y=16;
blocknum.x=(int)ceil((float)size/16); blocknum.y=(int)ceil((float)size/16);
//execute device kernel kernel<<<blocknum,blocksize>>>(dmatrix,size);
//unbind texture reference to free resource cudaUnbindTexture(texreference);
//copy result matrix from device to host memory cudaMemcpy(hmatrix,dmatrix,bytes,cudaMemcpyDeviceToHost);
//free host and device memory free(hmatrix); cudaFree(dmatrix); cudaFreeArray(carray);
return 0;}
47
Texture MemoryTexture Memory
__global__ void kernel(float* dmatrix,int size) { int xindex; int yindex;
//calculate each thread global index xindex=blockIdx.x*blockDim.x+threadIdx.x; yindex=blockIdx.y*blockDim.y+threadIdx.y;
//fetch cuda array through texture reference dmatrix[yindex*size+xindex]=tex2D(texreference,xindex,yindex);
return;}
Example: 3-dimension cuda array Example: 3-dimension cuda array
49
Texture MemoryTexture Memory
#define size 256
//declare texture reference texture<float,3,cudaReadModeElementType> texreference;
int main(int argc,char** argv){ dim3 blocknum; dim3 blocksize;
float* hmatrix; float* dmatrix;
cudaArray* cudaarray; cudaExtent volumesize; cudaChannelFormatDesc channel;
cudaMemcpy3DParms copyparms={0};
//allocate host and device memory hmatrix=(float*)malloc(sizeof(float)*size*size*size); cudaMalloc((void**)&dmatrix,sizeof(float)*size*size*size);
50
Texture MemoryTexture Memory
//initialize host matrix before usage for(int loop=0;loop<size*size*size;loop++) hmatrix[loop]=(float)rand()/(float)(RAND_MAX-1);
//set cuda array volume size volumesize=make_cudaExtent(size,size,size);
//create channel to describe data type channel=cudaCreateChannelDesc<float>();
//allocate device memory for cuda array cudaMalloc3DArray(&cudaarray,&channel,volumesize);
//set cuda array copy parameters copyparms.extent=volumesize; copyparms.dstArray=cudaarray; copyparms.kind=cudaMemcpyHostToDevice;
copyparms.srcPtr= make_cudaPitchPtr((void*)hmatrix,sizeof(float)*size,size,size); cudaMemcpy3D(©parms);
51
Texture MemoryTexture Memory
//set texture filter mode property //use cudaFilterModePoint or cudaFilterModeLinear texreference.filterMode=cudaFilterModePoint;
//set texture address mode property //use cudaAddressModeClamp or cudaAddressModeWrap texreference.addressMode[0]=cudaAddressModeWrap; texreference.addressMode[1]=cudaAddressModeWrap; texreference.addressMode[2]=cudaaddressModeClamp;
//bind texture reference with cuda array cudaBindTextureToArray(texreference,carray,channel);
blocksize.x=8; blocksize.y=8; blocksize.z=8;
blocknum.x=(int)ceil((float)size/8); blocknum.y=(int)ceil((float)size/8);
//execute device kernel kernel<<<blocknum,blocksize>>>(dmatrix,size);
52
Texture MemoryTexture Memory
//unbind texture reference to free resource cudaUnbindTexture(texreference);
//copy result matrix from device to host memory cudaMemcpy(hmatrix,dmatrix,bytes,cudaMemcpyDeviceToHost);
//free host and device memory free(hmatrix); cudaFree(dmatrix); cudaFreeArray(carray);
return 0;}
53
Texture MemoryTexture Memory
__global__ void kernel(float* dmatrix,int size) { int loop; int xindex; int yindex; int zindex;
//calculate each thread global index xindex=threadIdx.x+blockIdx.x*blockDim.x; yindex=threadIdx.y+blockIdx.y*blockDim.y; for(loop=0;loop<size;loop++) { zindex=loop; //fetch cuda array via texture reference dmatrix[zindex*size*size+yindex*size+xindex]= tex3D(texreference,xindex,yindex,zindex); }
return;}
Performance comparison: image projectionPerformance comparison: image projection
55
Texture MemoryTexture Memory
image projection or ray casting
56
Texture MemoryTexture Memory
trilinear interpolationon nearby 8 pixels
intrinsic interpolation units is very powerful
global memory accessing is very close to random
57
Texture MemoryTexture Memory
Method Time Speedupglobal 1.891 -
global/locality 0.198 9.5texture/point 0.072 26.2texture/linear 0.037 51.1
texture/linear/locality 0.012 157.5texture/linear/locality/fast math 0.011 171.9
object size 512 x 512x 512 / ray number 512 x 512
Why texture memory is so powerful?Why texture memory is so powerful?
59
Texture MemoryTexture Memory
CUDA Array is reordered to something like space filling Z-order - software driver supports reordering data - hardware supports spatial memory layout
Why only readable texture cache?Why only readable texture cache?
61
Texture cache cannot detect the dirty data
Texture MemoryTexture Memory
host memory
cache
float array
load from memory to
cache
perform some operations on cache
lazy updatefor write-back
reload from memory to
cache
modified by other threads
62
Write data to global memory directly without texture cache - only suitable for global linear memory not cuda array
Texture MemoryTexture Memory
device memory
cache
float array
write data to global memory directly
read data through texture cache
tex1Dfetch(texreference,index)
darray[index]=value;
texture cache may not be updated
How about the texture data locality?How about the texture data locality?
64
Texture MemoryTexture Memory
all blocks get scheduled round-robin based onthe number of shaders
Why CUDA distributes the work blocks in
horizontal direction?
65
Texture MemoryTexture Memory
load balancing on overall SMs, suppose consecutive
blocks have very similar work load
texture cache data locality, suppose consecutive blocks
use similar nearby data
66
Texture MemoryTexture Memory
reorder the block index fitting into z-order to take advantage of texture L1 cache
67
Texture MemoryTexture Memory
streaming processorstemp1=a/b+sin(c)
special function unitstemp2[loop]=__cos(d)
texture operation unitstemp3=tex2D(ref,x,y)
concurrent executionfor independent units
68
Texture MemoryTexture Memory
Memory Location Cache Speed Access
global off-chip no hundreds all threads
constant off-chip yes one ~ hundreds all threads
texture off-chip yes one ~ hundreds all threads
shared on-chip - one block threads
local off-chip no very slow single thread
register on-chip - one single thread
instruction off-chip yes - invisible
69
Texture MemoryTexture Memory
Memory Read/Write Property
global read/write input or output
constant read no structure
texture read locality structure
shared read/write shared within block
local read/write -
register read/write local temp variable
70
Reference - Mark Harris http://www.markmark.net/
- Wei-Chao Chen http://www.cs.unc.edu/~ciao/
- Wen-Mei Hwu http://impact.crhc.illinois.edu/people/current/hwu.php