itcs 4/5010 cuda programming, unc-charlotte, b. wilkinson, jan 23, 2013 sharedmem
DESCRIPTION
Using Shared memory. These notes will demonstrate the improvements achieved by using shared memory, with code and results running on coit-grid06.uncc.edu. ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 23, 2013 SharedMem.ppt. Approach. - PowerPoint PPT PresentationTRANSCRIPT
1ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 23, 2013SharedMem.ppt
Using Shared memory
These notes will demonstrate the improvements achieved by using shared memory, with code and results running on coit-grid06.uncc.edu.
2
Objective: As with memory coalescing demo, to load numbers into a two-dimensional array
Flattened global threadID of thread loaded into array element so one can tell which thread accesses which location array printed out.
For comparison purposes, access done:1.Using global memory only2.On shared memory with local 2-D arrays and copying back to global memory3.As 2. but using separate pointer arithmetic for speed
GPU structure -- one or more 2-D blocks in a 2-D grid. Each block is 2-D 32x32 threads fixed (max. for compute cap. 2.x)
Approach
3
__global__ void gpu_WithoutSharedMem (int *h, int N, int T) {// Array loaded with global thread ID that accesses that location// Coalescing should be possible
int col = threadIdx.x + blockDim.x * blockIdx.x;int row = threadIdx.y + blockDim.y * blockIdx.y;
int threadID = col + row * N;int index = col + row * N;
for (int t = 0; t < T; t++) // to reduce other time effects h[index] = threadID; // load array with global thread ID
}
1. Using global memory only
4
__global__ void gpu_SharedMem (int *h, int N, int T) {
__shared__ int h_local[BlockSize][BlockSize]; // sh. mem. each block
int col = threadIdx.x + blockDim.x * blockIdx.x; int row = threadIdx.y + blockDim.y * blockIdx.y;
int threadID = col + row * N;int index = col + row * N;
// h_local[threadIdx.y][threadIdx.x] = h[index]; Not necessary here// but might be in other caculations
for (int t = 0; t < T; t++) h_local[threadIdx.y][threadIdx.x] = threadID; // load array
h[index] = h_local[threadIdx.y][threadIdx.x]; //copy back to global mem.}
2. Using shared memory
5
__global__ void gpu_SharedMem_ptr (int *h, int N, int T) {
__shared__ int h_local[BlockSize][BlockSize];
int col = threadIdx.x + blockDim.x * blockIdx.x; int row = threadIdx.y + blockDim.y * blockIdx.y;
int threadID = col + row * N;int index = col + row * N;
int *ptr; // index calc. once outside loopptr = &h_local[0][0];int index_local = threadIdx.x + threadIdx.y * N;
for (int t = 0; t < T; t++)ptr[index_local] = threadID;
h[index] = h_local[threadIdx.y][threadIdx.x];}
3. Using shared memorywith index calculation outside loop
This code I am still checking out
6
… /*------------------------- Allocate Memory-----------------------------------*/
int size = N * N * sizeof(int); // number of bytes in total in arrayint *h, *dev_h; // ptr to arrays holding numbers on host and device
h = (int*) malloc(size); // Array on hostcudaMalloc((void**)&dev_h, size); // allocate device memory
/* ------------------------- GPU Computation without shared memory -----------------------------------*/
gpu_WithoutSharedMem <<< Grid, Block >>>(dev_h, N, T); // once outside timing
cudaEventRecord( start, 0 );
gpu_WithoutSharedMem <<< Grid, Block >>>(dev_h, N, T);
cudaEventRecord( stop, 0 ); cudaEventSynchronize( stop );cudaEventElapsedTime( &elapsed_time_ms1, start, stop );
cudaMemcpy(h,dev_h, size ,cudaMemcpyDeviceToHost); //Get results to check
printf("\nComputation without shared memory\n");printArray(h,N);printf("\nTime to calculate results on GPU: %f ms.\n", elapsed_time_ms1);
Main program
Computation 2 and 3 similar
7
Some results
A grid of one block and one iterationArray 32x32
Shared memorySpeedup = 1.18
8
A grid of one block and 1000000 iterations
Array 32 x 32
Shared memorySpeedup = 1.24
9
Repeat just to check results are consistent
10
A grid of 16 x 16 blocks and 10000 iterations
Array 512x512
Speedup = 1.74
Different numbers of iterations produce similar results
11
Array size Speedup
32 x 32 1.24
64 x 64 1.37
128 x 128 1.36
256 x 256 1.78
512 x 512 1.75
1024 x 1024 1.82
2048 x 2048 1.79
4096 x 4096 1.77
1000 iterations. Block size 32 x 32. Number of blocks to suit array size
Different Array Sizes
Questions