cuda advanced memory usage and optimization yukai hung [email protected] department of mathematics...

70
CUDA Advanced Memory Usage and Optimization Yukai Hung [email protected] Department of Mathematics National Taiwan University

Upload: barbara-doreen-hutchinson

Post on 12-Jan-2016

225 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CUDA Advanced Memory Usage and Optimization Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com

CUDA Advanced Memory Usage and OptimizationCUDA Advanced Memory Usage and OptimizationYukai Hung

[email protected] of MathematicsNational Taiwan University

Yukai [email protected]

Department of MathematicsNational Taiwan University

Page 2: CUDA Advanced Memory Usage and Optimization Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com

Register as Cache?Register as Cache?

Page 3: CUDA Advanced Memory Usage and Optimization Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com

3

Volatile qualifier

Volatile QualifierVolatile Qualifier

__global__ void kernelFunc(int* result){ int temp1; int temp2;

if(threadIdx.x<warpSize) { temp1=array[threadIdx.x] array[threadIdx.x+1]=2; temp2=array[threadIdx.x] result[threadIdx.x]=temp1*temp2; }}

identical readscompiler optimized

this read away

Page 4: CUDA Advanced Memory Usage and Optimization Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com

4

Volatile qualifier

Volatile QualifierVolatile Qualifier

__global__ void kernelFunc(int* result){ int temp1; int temp2;

if(threadIdx.x<warpSize) { int temp=array[threadIdx.x]; temp1=temp; array[threadIdx.x+1]=2; temp2=temp; result[threadIdx.x]=temp1*temp2; }}

Page 5: CUDA Advanced Memory Usage and Optimization Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com

5

Volatile qualifier

Volatile QualifierVolatile Qualifier

__global__ void kernelFunc(int* result){ int temp1; int temp2;

if(threadIdx.x<warpSize) { temp1=array[threadIdx.x]*1; array[threadIdx.x+1]=2; __syncthreads();

temp2=array[threadIdx.x]*2; result[threadIdx.x]=temp1*temp2; }}

Page 6: CUDA Advanced Memory Usage and Optimization Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com

6

Volatile qualifier

Volatile QualifierVolatile Qualifier

__global__ void kernelFunc(int* result){ volatile int temp1; volatile int temp2;

if(threadIdx.x<warpSize) { temp1=array[threadIdx.x]*1; array[threadIdx.x+1]=2; temp2=array[threadIdx.x]*2; result[threadIdx.x]=temp1*temp2; }}

Page 7: CUDA Advanced Memory Usage and Optimization Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com

Data PrefetchData Prefetch

Page 8: CUDA Advanced Memory Usage and Optimization Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com

8

Hide memory latency by overlapping loading and computing - double buffer is traditional software pipeline technique

Data PrefetchData Prefetch

Md Pd

Pdsub

Nd

load blue block to shared memory

compute blue block on shared memoryand load next block to shared memory

Page 9: CUDA Advanced Memory Usage and Optimization Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com

9

Hide memory latency by overlapping loading and computing - double buffer is traditional software pipeline technique

Data PrefetchData Prefetch

for loop{ load data from global to shared memory synchronize block

compute data in the shared memory synchronize block }

Page 10: CUDA Advanced Memory Usage and Optimization Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com

10

Hide memory latency by overlapping loading and computing - double buffer is traditional software pipeline technique

Data PrefetchData Prefetch

load data from global memory to registersfor loop{ store data from register to shared memory synchronize block

load data from global memory to registers compute data in the shared memory synchronize block }

very small overheadboth memory are very fast

computing and loading overlapregister and shared are independent

Page 11: CUDA Advanced Memory Usage and Optimization Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com

11

Matrix-matrix multiplication

Data PrefetchData Prefetch

Page 12: CUDA Advanced Memory Usage and Optimization Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com

Constant MemoryConstant Memory

Page 13: CUDA Advanced Memory Usage and Optimization Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com

13

Constant MemoryConstant Memory

Where is constant memory? - data is stored in the device global memory - read data through multiprocessor constant cache - 64KB constant memory and 8KB cache for each multiprocessor

How about the performance? - optimized when warp of threads read same location - 4 bytes per cycle through broadcasting to warp of threads - serialized when warp of threads read in different location - very slow when cache miss (read data from global memory) - access latency can range from one to hundreds clock cycles

Page 14: CUDA Advanced Memory Usage and Optimization Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com

14

Constant MemoryConstant Memory

How to use constant memory? - declare constant memory on the file scope (global variable) - copy data to constant memory by host (because it is constant!!)

//declare constant memory __constant__ float cst_ptr[size];

//copy data from host to constant memorycudaMemcpyToSymbol(cst_ptr,host_ptr,data_size);

Page 15: CUDA Advanced Memory Usage and Optimization Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com

15

Constant MemoryConstant Memory

//declare constant memory__constant__ float cangle[360];

int main(int argc,char** argv){ int size=3200; float* darray; float hangle[360]; //allocate device memory cudaMalloc((void**)&darray,sizeof(float)*size);

//initialize allocated memory cudaMemset(darray,0,sizeof(float)*size);

//initialize angle array on host for(int loop=0;loop<360;loop++) hangle[loop]=acos(-1.0f)*loop/180.0f;

//copy host angle data to constant memory cudaMemcpyToSymbol(cangle,hangle,sizeof(float)*360);

Page 16: CUDA Advanced Memory Usage and Optimization Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com

16

Constant MemoryConstant Memory

//execute device kernel test_kernel<<<size/64,64>>>(darray);

//free device memory cudaFree(darray);

return 0;}

__global__ void test_kernel(float* darray){ int index;

//calculate each thread global index index=blockIdx.x*blockDim.x+threadIdx.x;

#pragma unroll 10 for(int loop=0;loop<360;loop++) darray[index]=darray[index]+cangle[loop]; return;}

Page 17: CUDA Advanced Memory Usage and Optimization Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com

Texture MemoryTexture Memory

Page 18: CUDA Advanced Memory Usage and Optimization Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com

18

Texture MemoryTexture Memory

Texture mapping

Page 19: CUDA Advanced Memory Usage and Optimization Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com

19

Texture MemoryTexture Memory

Texture mapping

Page 20: CUDA Advanced Memory Usage and Optimization Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com

20

Texture MemoryTexture Memory

Texture filtering

nearest-neighborhood interpolation

Page 21: CUDA Advanced Memory Usage and Optimization Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com

21

Texture MemoryTexture Memory

Texture filtering

linear/bilinear/trilinear interpolation

Page 22: CUDA Advanced Memory Usage and Optimization Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com

22

Texture MemoryTexture Memory

Texture filtering

two times bilinear interpolation

Page 23: CUDA Advanced Memory Usage and Optimization Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com

23

Texture MemoryTexture Memory

L2

FB

SP SP

L1

TF

Th

rea

d P

roc

es

so

r

Vtx Thread Issue

Setup / Rstr / ZCull

Work Distribution Pixel Thread Issue

Input Assembler

Host

SP SP

L1

TF

SP SP

L1

TF

SP SP

L1

TF

SP SP

L1

TF

SP SP

L1

TF

SP SP

L1

TF

SP SP

L1

TF

L2

FB

L2

FB

L2

FB

L2

FB

L2

FB

these units perform graphical texture operations

Page 24: CUDA Advanced Memory Usage and Optimization Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com

24

Texture MemoryTexture Memory

two SMs are cooperated astexture processing clusterscalable units on graphics

texture specific unitonly available for texture

Page 25: CUDA Advanced Memory Usage and Optimization Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com

25

Texture MemoryTexture Memory

texture specific unittexture address units

compute texture addresses

texture filtering unitscompute data interpolation

read only texture L1 cache

Page 26: CUDA Advanced Memory Usage and Optimization Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com

26

Texture MemoryTexture Memory

L2

FB

SP SP

L1

TF

Th

rea

d P

roc

es

so

r

Vtx Thread Issue

Setup / Rstr / ZCull

Work Distribution Pixel Thread Issue

Input Assembler

Host

SP SP

L1

TF

SP SP

L1

TF

SP SP

L1

TF

SP SP

L1

TF

SP SP

L1

TF

SP SP

L1

TF

SP SP

L1

TF

L2

FB

L2

FB

L2

FB

L2

FB

L2

FB

read only texture L2 cache for all TPC read only texture L1 cache for each TPC

Page 27: CUDA Advanced Memory Usage and Optimization Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com

27

Texture MemoryTexture Memory

texture specific units

Page 28: CUDA Advanced Memory Usage and Optimization Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com

28

Texture MemoryTexture Memory

Texture is an object for reading data - data is stored on the device global memory - global memory is bound with texture cache

L2

FB

SP SP

L1

TF

Th

rea

d P

roc

es

so

rSP SP

L1

TF

SP SP

L1

TF

SP SP

L1

TF

SP SP

L1

TF

SP SP

L1

TF

SP SP

L1

TF

SP SP

L1

TF

L2

FB

L2

FB

L2

FB

L2

FB

L2

global memory

Page 29: CUDA Advanced Memory Usage and Optimization Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com

What is the advantages of texture?What is the advantages of texture?

Page 30: CUDA Advanced Memory Usage and Optimization Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com

30

Texture MemoryTexture Memory

Data caching - helpful when global memory coalescing is the main bottleneck

L2

FB

SP SP

L1

TF

Th

rea

d P

roc

es

so

rSP SP

L1

TF

SP SP

L1

TF

SP SP

L1

TF

SP SP

L1

TF

SP SP

L1

TF

SP SP

L1

TF

SP SP

L1

TF

L2

FB

L2

FB

L2

FB

L2

FB

L2

FB

Page 31: CUDA Advanced Memory Usage and Optimization Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com

31

Texture MemoryTexture Memory

Data filtering - support linear/bilinear and trilinear hardware interpolation

texture specific unitintrinsic interpolation

cudaFilterModePointcudaFilterModeLinear

Page 32: CUDA Advanced Memory Usage and Optimization Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com

32

Texture MemoryTexture Memory

Accesses modes - clamp and wrap memory accessing for out-of-bound addresses

texture specific unit

clamp boundary

wrap boundary

cudaAddressModeWrap

cudaAddressModeClamp

Page 33: CUDA Advanced Memory Usage and Optimization Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com

33

Texture MemoryTexture Memory

Bound to linear memory - only support 1-dimension problems - only get the benefits from texture cache - not support addressing modes and filtering

Bound to cuda array - support float addressing - support addressing modes - support hardware interpolation - support 1/2/3-dimension problems

Page 34: CUDA Advanced Memory Usage and Optimization Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com

34

Texture MemoryTexture Memory

Host code - allocate global linear memory or cuda array - create and set the texture reference on file scope - bind the texture reference to the allocated memory - unbind the texture reference to free cache resource

Device code - fetch data by indicating texture reference - fetch data by using texture fetch function

Page 35: CUDA Advanced Memory Usage and Optimization Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com

35

Texture MemoryTexture Memory

Texture memory constrain

Compute capability 1.3 Compute capability 2.01D texture linear memory 8192 31768

1D texture cuda array 1024x128

2D texture cuda array (65536,32768) (65536,65536)

3D texture cuda array (2048,2048,2048) (4096,4096,4096)

Page 36: CUDA Advanced Memory Usage and Optimization Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com

36

Texture MemoryTexture Memory

Measuring texture cache miss or hit number - latest visual profiler can count cache miss or hit - need device compute capability higher than 1.2

Page 37: CUDA Advanced Memory Usage and Optimization Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com

Example: 1-dimension linear memoryExample: 1-dimension linear memory

Page 38: CUDA Advanced Memory Usage and Optimization Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com

38

Texture MemoryTexture Memory

//declare texture reference texture<float,1,cudaReadModeElementType> texreference;

int main(int argc,char** argv){ int size=3200;

float* harray; float* diarray; float* doarray;

//allocate host and device memory harray=(float*)malloc(sizeof(float)*size); cudaMalloc((void**)&diarray,sizeof(float)*size); cudaMalloc((void**)&doarray,sizeof(float)*size);

//initialize host array before usage for(int loop=0;loop<size;loop++) harray[loop]=(float)rand()/(float)(RAND_MAX-1);

//copy array from host to device memory cudaMemcpy(diarray,harray,sizeof(float)*size,cudaMemcpyHostToDevice);

Page 39: CUDA Advanced Memory Usage and Optimization Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com

39

Texture MemoryTexture Memory

//bind texture reference with linear memory cudaBindTexture(0,texreference,diarray,sizeof(float)*size);

//execute device kernel kernel<<<(int)ceil((float)size/64),64>>>(doarray,size);

//unbind texture reference to free resource cudaUnbindTexture(texreference);

//copy result array from device to host memory cudaMemcpy(harray,doarray,sizeof(float)*size,cudaMemcpyDeviceToHost);

//free host and device memory free(harray); cudaFree(diarray); cudaFree(doarray);

return 0;}

Page 40: CUDA Advanced Memory Usage and Optimization Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com

40

Texture MemoryTexture Memory

__global__ void kernel(float* doarray,int size) { int index; //calculate each thread global index index=blockIdx.x*blockDim.x+threadIdx.x;

//fetch global memory through texture reference doarray[index]=tex1Dfetch(texreference,index);

return;}

Page 41: CUDA Advanced Memory Usage and Optimization Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com

41

Texture MemoryTexture Memory

__global__ void offsetCopy(float* idata,float* odata,int offset){ //compute each thread global index int index=blockIdx.x*blockDim.x+threadIdx.x;

//copy data from global memory odata[index]=idata[index+offset];}

Page 42: CUDA Advanced Memory Usage and Optimization Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com

42

Texture MemoryTexture Memory

__global__ void offsetCopy(float* idata,float* odata,int offset){ //compute each thread global index int index=blockIdx.x*blockDim.x+threadIdx.x;

//copy data from global memory odata[index]=tex1Dfetch(texreference,index+offset);}

Page 43: CUDA Advanced Memory Usage and Optimization Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com

Example: 2-dimension cuda array Example: 2-dimension cuda array

Page 44: CUDA Advanced Memory Usage and Optimization Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com

44

Texture MemoryTexture Memory

#define size 3200

//declare texture reference texture<float,2,cudaReadModeElementType> texreference;

int main(int argc,char** argv){ dim3 blocknum; dim3 blocksize;

float* hmatrix; float* dmatrix;

cudaArray* carray; cudaChannelFormatDesc channel;

//allocate host and device memory hmatrix=(float*)malloc(sizeof(float)*size*size); cudaMalloc((void**)&dmatrix,sizeof(float)*size*size);

//initialize host matrix before usage for(int loop=0;loop<size*size;loop++) hmatrix[loop]=float)rand()/(float)(RAND_MAX-1);

Page 45: CUDA Advanced Memory Usage and Optimization Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com

45

Texture MemoryTexture Memory

//create channel to describe data type channel=cudaCreateChannelDesc<float>();

//allocate device memory for cuda array cudaMallocArray(&carray,&channel,size,size);

//copy matrix from host to device memory bytes=sizeof(float)*size*size; cudaMemcpyToArray(carray,0,0,hmatrix,bytes,cudaMemcpyHostToDevice);

//set texture filter mode property //use cudaFilterModePoint or cudaFilterModeLinear texreference.filterMode=cudaFilterModePoint;

//set texture address mode property //use cudaAddressModeClamp or cudaAddressModeWrap texreference.addressMode[0]=cudaAddressModeWrap; texreference.addressMode[1]=cudaaddressModeClamp;

Page 46: CUDA Advanced Memory Usage and Optimization Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com

46

Texture MemoryTexture Memory

//bind texture reference with cuda array cudaBindTextureToArray(texreference,carray);

blocksize.x=16; blocksize.y=16;

blocknum.x=(int)ceil((float)size/16); blocknum.y=(int)ceil((float)size/16);

//execute device kernel kernel<<<blocknum,blocksize>>>(dmatrix,size);

//unbind texture reference to free resource cudaUnbindTexture(texreference);

//copy result matrix from device to host memory cudaMemcpy(hmatrix,dmatrix,bytes,cudaMemcpyDeviceToHost);

//free host and device memory free(hmatrix); cudaFree(dmatrix); cudaFreeArray(carray);

return 0;}

Page 47: CUDA Advanced Memory Usage and Optimization Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com

47

Texture MemoryTexture Memory

__global__ void kernel(float* dmatrix,int size) { int xindex; int yindex;

//calculate each thread global index xindex=blockIdx.x*blockDim.x+threadIdx.x; yindex=blockIdx.y*blockDim.y+threadIdx.y;

//fetch cuda array through texture reference dmatrix[yindex*size+xindex]=tex2D(texreference,xindex,yindex);

return;}

Page 48: CUDA Advanced Memory Usage and Optimization Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com

Example: 3-dimension cuda array Example: 3-dimension cuda array

Page 49: CUDA Advanced Memory Usage and Optimization Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com

49

Texture MemoryTexture Memory

#define size 256

//declare texture reference texture<float,3,cudaReadModeElementType> texreference;

int main(int argc,char** argv){ dim3 blocknum; dim3 blocksize;

float* hmatrix; float* dmatrix;

cudaArray* cudaarray; cudaExtent volumesize; cudaChannelFormatDesc channel;

cudaMemcpy3DParms copyparms={0};

//allocate host and device memory hmatrix=(float*)malloc(sizeof(float)*size*size*size); cudaMalloc((void**)&dmatrix,sizeof(float)*size*size*size);

Page 50: CUDA Advanced Memory Usage and Optimization Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com

50

Texture MemoryTexture Memory

//initialize host matrix before usage for(int loop=0;loop<size*size*size;loop++) hmatrix[loop]=(float)rand()/(float)(RAND_MAX-1);

//set cuda array volume size volumesize=make_cudaExtent(size,size,size);

//create channel to describe data type channel=cudaCreateChannelDesc<float>();

//allocate device memory for cuda array cudaMalloc3DArray(&cudaarray,&channel,volumesize);

//set cuda array copy parameters copyparms.extent=volumesize; copyparms.dstArray=cudaarray; copyparms.kind=cudaMemcpyHostToDevice;

copyparms.srcPtr= make_cudaPitchPtr((void*)hmatrix,sizeof(float)*size,size,size); cudaMemcpy3D(&copyparms);

Page 51: CUDA Advanced Memory Usage and Optimization Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com

51

Texture MemoryTexture Memory

//set texture filter mode property //use cudaFilterModePoint or cudaFilterModeLinear texreference.filterMode=cudaFilterModePoint;

//set texture address mode property //use cudaAddressModeClamp or cudaAddressModeWrap texreference.addressMode[0]=cudaAddressModeWrap; texreference.addressMode[1]=cudaAddressModeWrap; texreference.addressMode[2]=cudaaddressModeClamp;

//bind texture reference with cuda array cudaBindTextureToArray(texreference,carray,channel);

blocksize.x=8; blocksize.y=8; blocksize.z=8;

blocknum.x=(int)ceil((float)size/8); blocknum.y=(int)ceil((float)size/8);

//execute device kernel kernel<<<blocknum,blocksize>>>(dmatrix,size);

Page 52: CUDA Advanced Memory Usage and Optimization Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com

52

Texture MemoryTexture Memory

//unbind texture reference to free resource cudaUnbindTexture(texreference);

//copy result matrix from device to host memory cudaMemcpy(hmatrix,dmatrix,bytes,cudaMemcpyDeviceToHost);

//free host and device memory free(hmatrix); cudaFree(dmatrix); cudaFreeArray(carray);

return 0;}

Page 53: CUDA Advanced Memory Usage and Optimization Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com

53

Texture MemoryTexture Memory

__global__ void kernel(float* dmatrix,int size) { int loop; int xindex; int yindex; int zindex;

//calculate each thread global index xindex=threadIdx.x+blockIdx.x*blockDim.x; yindex=threadIdx.y+blockIdx.y*blockDim.y; for(loop=0;loop<size;loop++) { zindex=loop; //fetch cuda array via texture reference dmatrix[zindex*size*size+yindex*size+xindex]= tex3D(texreference,xindex,yindex,zindex); }

return;}

Page 54: CUDA Advanced Memory Usage and Optimization Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com

Performance comparison: image projectionPerformance comparison: image projection

Page 55: CUDA Advanced Memory Usage and Optimization Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com

55

Texture MemoryTexture Memory

image projection or ray casting

Page 56: CUDA Advanced Memory Usage and Optimization Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com

56

Texture MemoryTexture Memory

trilinear interpolationon nearby 8 pixels

intrinsic interpolation units is very powerful

global memory accessing is very close to random

Page 57: CUDA Advanced Memory Usage and Optimization Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com

57

Texture MemoryTexture Memory

Method Time Speedupglobal 1.891 -

global/locality 0.198 9.5texture/point 0.072 26.2texture/linear 0.037 51.1

texture/linear/locality 0.012 157.5texture/linear/locality/fast math 0.011 171.9

object size 512 x 512x 512 / ray number 512 x 512

Page 58: CUDA Advanced Memory Usage and Optimization Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com

Why texture memory is so powerful?Why texture memory is so powerful?

Page 59: CUDA Advanced Memory Usage and Optimization Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com

59

Texture MemoryTexture Memory

CUDA Array is reordered to something like space filling Z-order - software driver supports reordering data - hardware supports spatial memory layout

Page 60: CUDA Advanced Memory Usage and Optimization Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com

Why only readable texture cache?Why only readable texture cache?

Page 61: CUDA Advanced Memory Usage and Optimization Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com

61

Texture cache cannot detect the dirty data

Texture MemoryTexture Memory

host memory

cache

float array

load from memory to

cache

perform some operations on cache

lazy updatefor write-back

reload from memory to

cache

modified by other threads

Page 62: CUDA Advanced Memory Usage and Optimization Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com

62

Write data to global memory directly without texture cache - only suitable for global linear memory not cuda array

Texture MemoryTexture Memory

device memory

cache

float array

write data to global memory directly

read data through texture cache

tex1Dfetch(texreference,index)

darray[index]=value;

texture cache may not be updated

Page 63: CUDA Advanced Memory Usage and Optimization Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com

How about the texture data locality?How about the texture data locality?

Page 64: CUDA Advanced Memory Usage and Optimization Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com

64

Texture MemoryTexture Memory

all blocks get scheduled round-robin based onthe number of shaders

Why CUDA distributes the work blocks in

horizontal direction?

Page 65: CUDA Advanced Memory Usage and Optimization Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com

65

Texture MemoryTexture Memory

load balancing on overall SMs, suppose consecutive

blocks have very similar work load

texture cache data locality, suppose consecutive blocks

use similar nearby data

Page 66: CUDA Advanced Memory Usage and Optimization Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com

66

Texture MemoryTexture Memory

reorder the block index fitting into z-order to take advantage of texture L1 cache

Page 67: CUDA Advanced Memory Usage and Optimization Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com

67

Texture MemoryTexture Memory

streaming processorstemp1=a/b+sin(c)

special function unitstemp2[loop]=__cos(d)

texture operation unitstemp3=tex2D(ref,x,y)

concurrent executionfor independent units

Page 68: CUDA Advanced Memory Usage and Optimization Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com

68

Texture MemoryTexture Memory

Memory Location Cache Speed Access

global off-chip no hundreds all threads

constant off-chip yes one ~ hundreds all threads

texture off-chip yes one ~ hundreds all threads

shared on-chip - one block threads

local off-chip no very slow single thread

register on-chip - one single thread

instruction off-chip yes - invisible

Page 69: CUDA Advanced Memory Usage and Optimization Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com

69

Texture MemoryTexture Memory

Memory Read/Write Property

global read/write input or output

constant read no structure

texture read locality structure

shared read/write shared within block

local read/write -

register read/write local temp variable

Page 70: CUDA Advanced Memory Usage and Optimization Yukai Hung a0934147@gmail.com Department of Mathematics National Taiwan University Yukai Hung a0934147@gmail.com

70

Reference - Mark Harris http://www.markmark.net/

- Wei-Chao Chen http://www.cs.unc.edu/~ciao/

- Wen-Mei Hwu http://impact.crhc.illinois.edu/people/current/hwu.php