tp1: cuda programming

35
P.Bakowski 1 TP1: CUDA programming TP1: CUDA programming CUDA devices, threading and CUDA devices, threading and memories memories P. Bakowski P. Bakowski

Upload: others

Post on 29-Mar-2022

14 views

Category:

Documents


0 download

TRANSCRIPT

Microsoft PowerPoint - cudatp01eng.ppt““CUDA devices, threading and CUDA devices, threading and
memoriesmemories””
nVidianVidia GPU device analysisGPU device analysis
..In our first exercise we will analyze the parameters of the In our first exercise we will analyze the parameters of the
nVidia graphics card installed in your computer.nVidia graphics card installed in your computer.
We write a program that activates the CUDAWe write a program that activates the CUDA "runtime" and "runtime" and
requests the parameters of the graphical unit requests the parameters of the graphical unit -- GPU.GPU.
These parameters are communicated through the These parameters are communicated through the
structure:structure:
cudaGetDevicePropertiescudaGetDeviceProperties(&(&devicePropdeviceProp,device);,device);
.. struct cudaDevicePropstruct cudaDeviceProp
size_t totalGlobalMem;size_t totalGlobalMem;
size_t sharedMemPerBlock;size_t sharedMemPerBlock;
int regsPerBlock;int regsPerBlock;
int warpSize;int warpSize;
size_t memPitch;size_t memPitch;
int maxThreadsPerBlock;int maxThreadsPerBlock;
size_t totalConstMem;size_t totalConstMem;
int major;int major;
int minor;int minor;
int clockRate;int clockRate;
size_t textureAlignment;size_t textureAlignment;
int deviceOverlap;int deviceOverlap;
int multiProcessorCount;int multiProcessorCount;
int kernelExecTimeoutEnabled;int kernelExecTimeoutEnabled;
.. struct cudaDevicePropstruct cudaDeviceProp
size_t totalGlobalMem;size_t totalGlobalMem;
size_t sharedMemPerBlock;size_t sharedMemPerBlock;
int regsPerBlock;int regsPerBlock;
int warpSize;int warpSize;
size_t memPitch;size_t memPitch;
int maxThreadsPerBlock;int maxThreadsPerBlock;
size_t totalConstMem;size_t totalConstMem;
int major;int major;
int minor;int minor;
int clockRate;int clockRate;
size_t textureAlignment;size_t textureAlignment;
int deviceOverlap;int deviceOverlap;
int multiProcessorCount;int multiProcessorCount;
int kernelExecTimeoutEnabled;int kernelExecTimeoutEnabled;
size_t totalGlobalMem;size_t totalGlobalMem;
size_t sharedMemPerBlock;size_t sharedMemPerBlock;
int regsPerBlock;int regsPerBlock;
int warpSize;int warpSize;
size_t memPitch;size_t memPitch;
int maxThreadsPerBlock;int maxThreadsPerBlock;
size_t totalConstMem;size_t totalConstMem;
int major;int major;
int minor;int minor;
size_t textureAlignment;size_t textureAlignment;
int deviceOverlap;int deviceOverlap;
int multiProcessorCount;int multiProcessorCount;
int kernelExecTimeoutEnabled;int kernelExecTimeoutEnabled;
.. struct cudaDevicePropstruct cudaDeviceProp
size_t totalGlobalMem;size_t totalGlobalMem;
size_t sharedMemPerBlock;size_t sharedMemPerBlock;
int regsPerBlock;int regsPerBlock;
int warpSize;int warpSize;
size_t memPitch;size_t memPitch;
int maxThreadsPerBlock;int maxThreadsPerBlock;
size_t totalConstMem;size_t totalConstMem;
int major;int major;
int minor;int minor;
int clockRate;int clockRate;
size_t textureAlignment;size_t textureAlignment;
int deviceOverlap;int deviceOverlap;
int multiProcessorCount;int multiProcessorCount;
int kernelExecTimeoutEnabled;int kernelExecTimeoutEnabled;
nVidianVidia GeForceGeForce GPUsGPUs MultiprocessorsMultiprocessors corescores Compute CapabilityCompute Capability
GeForce GTX 480 GeForce GTX 480 15 15 480 480 2.0 2.0
GeForce GTX 470 GeForce GTX 470 14 14 448 448 2.0 2.0
GeForce GTX 465 GeForce GTX 465 11 11 352 352 2.0 2.0
GeForce GTX 295 GeForce GTX 295 22××30 30 22××240 240 1.3 1.3
GeForce GTX 280/GTXGeForce GTX 280/GTX 285 30 285 30 240 240 1.3 1.3
GeForce GTX 260 GeForce GTX 260 24 24 192 192 1.3 1.3
GeForce 210 GeForce 210 2 2 16 16 1.2 1.2
GeForce GT 240 GeForce GT 240 12 12 96 96 1.2 1.2
GeForce GT 220 GeForce GT 220 6 6 48 48 1.2 1.2
GeForce GT 130 GeForce GT 130 12 12 96 96 1.1 1.1
GeForce GT 120 GeForce GT 120 4 4 32 32 1.1 1.1
GeForce GTS 250 GeForce GTS 250 16 16 128 128 1.1 1.1
GeForce 9800 GX2 GeForce 9800 GX2 22××16 16 22××128 128 1.1 1.1
GeForce 9800 GTX GeForce 9800 GTX 16 16 128 128 1.1 1.1
GeForce 8800 GT GeForce 8800 GT 14 14 112 112 1.1 1.1
GeForce 8800 GTS GeForce 8800 GTS 12 12 96 96 1.0 1.0
GeForce 8600 GT/GTS GeForce 8600 GT/GTS 4 4 32 32 1.11.1
P.Bakowski 8
ExerciseExercise 0:0:
Write a program that displays:Write a program that displays:
1. The name of the circuit1. The name of the circuit
2. The size of the global memory2. The size of the global memory
3. The size of shared memory (in a block)3. The size of shared memory (in a block)
4. The number of registers per block4. The number of registers per block
5. The number of threads per block5. The number of threads per block
6. Version: major, minor6. Version: major, minor
7. Clock frequency (in KHz)7. Clock frequency (in KHz)
8. The number of multiprocessors8. The number of multiprocessors
P.Bakowski 9
GPU analysisGPU analysis
Some elements of the program:Some elements of the program: #include <stdio.h>#include <stdio.h> #include <stdlib.h>#include <stdlib.h>
#include <cuda.h>#include <cuda.h> #include <cuda_runtime.h>#include <cuda_runtime.h>
int main(){int main(){
cudaGetDeviceProperties(&deviceProp, device );cudaGetDeviceProperties(&deviceProp, device );
Attention:Attention:
The source code must be registered with The source code must be registered with .cu.cu extension; to extension; to
compile the code we need to use compile the code we need to use nvccnvcc script.script.
%nvcc %nvcc ––o exo0 exo0.cuo exo0 exo0.cu
P.Bakowski 10
Exercise 1: Vector additionExercise 1: Vector addition
In this exercise we will write our first In this exercise we will write our first kernelkernel –– the the
function executed by the GPU.function executed by the GPU.
The program consists of several parts:The program consists of several parts:
the kernel called as global function:the kernel called as global function:
__global____global__ addVect(..)addVect(..)
global GPU memory allocationglobal GPU memory allocation
the transfer of arguments to this memorythe transfer of arguments to this memory
kernel call with the triple chevron kernel call with the triple chevron
<<<..>>><<<..>>> result transfer from GPU memory to host memoryresult transfer from GPU memory to host memory
P.Bakowski 11
{{
int i = int i = threadIdxthreadIdx.x;.x;
}}
The The threadIdx.xthreadIdx.x variable is provided automatically by variable is provided automatically by
the compiler, it gives the index of the thread executing the compiler, it gives the index of the thread executing the function.the function.
threadIdxthreadIdx.x=0.x=0
float* Cv1;float* Cv1;
cudaMalloc((void **)&Cv1,memsize);cudaMalloc((void **)&Cv1,memsize);
float* Cv2;float* Cv2;
cudaMalloc((void **)&Cv2,memsize);cudaMalloc((void **)&Cv2,memsize);
float* Cres;float* Cres;
cudaMalloc((void **)&Cres,memsize);cudaMalloc((void **)&Cres,memsize);
Attention:Attention:
At the end of execution GPU At the end of execution GPU
memory must be freed by:memory must be freed by:
cudaFreecudaFree(Cv1); (Cv1); cudaFreecudaFree(Cv2); (Cv2);
cudaMemcpy(Cv1,v1,memsize,cudaMemcpycudaMemcpy(Cv1,v1,memsize,cudaMemcpyHostToDeviceHostToDevice););
cudaMemcpy(Cv2,v2,memsize,cudaMemcpycudaMemcpy(Cv2,v2,memsize,cudaMemcpyHostToDeviceHostToDevice););
cudaMemcpy(res,Cres,memsize,cudaMemcpycudaMemcpy(res,Cres,memsize,cudaMemcpyDeviceToHostDeviceToHost););
addVectaddVect
P.Bakowski 14
Exercise 1a:Exercise 1a:
Write a complete application with the kernel that adds Write a complete application with the kernel that adds
two vectors.two vectors.
The values of the vectors can be prepared in the form of The values of the vectors can be prepared in the form of
constants:constants:
float float
v2[]={10.9,11.8,12.7,13.6,14.5,15.4,16.3,17.2,18.1,19.0};v2[]={10.9,11.8,12.7,13.6,14.5,15.4,16.3,17.2,18.1,19.0};
After the return from the kernel the host displays the After the return from the kernel the host displays the
results.results.
Exercise 1b:Exercise 1b:
Write the same application by implementing the Write the same application by implementing the
mechanism of block: mechanism of block:
An example of changes to make:An example of changes to make:
#define BLN 4#define BLN 4
Kernel call:Kernel call:
// vsize // vsize –– is the vector sizeis the vector size
blocks & threadsblocks & threads
Exercise 1b:Exercise 1b:
blockIdx.x=0blockIdx.x=0 blockIdx.x=3blockIdx.x=3
blockDim.x=2blockDim.x=2
Calculation of the index in the kernel:Calculation of the index in the kernel:
P.Bakowski 17
Exercise 2: Matrix MultiplicationExercise 2: Matrix Multiplication
Matrix multiplication, requiring operations of multiplication Matrix multiplication, requiring operations of multiplication
and addition in parallel on many elements allows to fully and addition in parallel on many elements allows to fully
exploit the power of the GPU circuits.exploit the power of the GPU circuits.
ii jj 0*1 + 1*3 = 30*1 + 1*3 = 3
P.Bakowski 18
Matrix Multiplication: dim3 structureMatrix Multiplication: dim3 structure
The organization of threads in two dimensions is well The organization of threads in two dimensions is well
suited to calculate the points in the resulting matrix. suited to calculate the points in the resulting matrix.
We will therefore use the predefined structuresWe will therefore use the predefined structures
of type DIM3of type DIM3
dim3 dimBlock(DIM,DIM);
dim3 dimGrid(1,1);
Each cell of the resulting matrix is processed by a thread.Each cell of the resulting matrix is processed by a thread.
Each thread has two identifiers:Each thread has two identifiers: threadIdx.xthreadIdx.x and and threadIdx.ythreadIdx.y
P.Bakowski 19
Matrix Multiplication: argumentsMatrix Multiplication: arguments
The function of the kernel gets the following arguments:The function of the kernel gets the following arguments:
// allocation on GPU// allocation on GPU
cudaMalloc((void **)&dev_A,DIM*DIM*sizeof(float));cudaMalloc((void **)&dev_A,DIM*DIM*sizeof(float));
cudaMalloc((void **)&dev_B,DIM*DIM*sizeof(float));cudaMalloc((void **)&dev_B,DIM*DIM*sizeof(float));
cudaMalloc((void **)&dev_C,DIM*DIM*sizeof(float));cudaMalloc((void **)&dev_C,DIM*DIM*sizeof(float));
// data transfer to GPU// data transfer to GPU
int sizemat= DIM*DIM*sizeof(float),int sizemat= DIM*DIM*sizeof(float),
cudaMemcpy(dev_A,buffA,sizemat,cudaMemcpyHostToDevice);cudaMemcpy(dev_A,buffA,sizemat,cudaMemcpyHostToDevice);
cudaMemcpy(dev_B,buffB,sizemat,cudaMemcpyHostToDevice);cudaMemcpy(dev_B,buffB,sizemat,cudaMemcpyHostToDevice);
Kernel call:Kernel call:
dim3 dimBlock(DIM,DIM);dim3 dimBlock(DIM,DIM);
dim3 dimGrid(1,1);dim3 dimGrid(1,1);
__global__ void matrix_mul(float* __global__ void matrix_mul(float*
{{
float Pvalue=0;float Pvalue=0;
for(int k=0;k<Width;++k)for(int k=0;k<Width;++k)
{ // { // products additionproducts addition
float Ael=dev_A[ty*Width + k]; float Ael=dev_A[ty*Width + k];
float Bel=dev_B[k*Width +tx]; float Bel=dev_B[k*Width +tx];
Pvalue += Ael*Bel;Pvalue += Ael*Bel;
} }
}}
Matrix Multiplication with CPUMatrix Multiplication with CPU
Below we find the matrix multiplication code programmed Below we find the matrix multiplication code programmed
for CPU. for CPU.
In this case each simple product (AIn this case each simple product (Aijij) is produced in two ) is produced in two
sequential loops, one loop for horizontal indexes (i) and sequential loops, one loop for horizontal indexes (i) and
one loop for vertical indexes (j). Note that GPU executes in one loop for vertical indexes (j). Note that GPU executes in this place i*j threads, so no loops are needed.this place i*j threads, so no loops are needed.
for(int i=0;i<DIM;i++)for(int i=0;i<DIM;i++)
for(int j=0;j<DIM;j++)for(int j=0;j<DIM;j++)
for(int k=0;k<DIM;k++)for(int k=0;k<DIM;k++)
buffC[j+i*DIM] +=buffA[k+j*DIM]*buffB[j+k*DIM];buffC[j+i*DIM] +=buffA[k+j*DIM]*buffB[j+k*DIM];
P.Bakowski 23
Performance analysisPerformance analysis
With CUDA, the performance analysis can be performed via With CUDA, the performance analysis can be performed via
the mechanisms of events.the mechanisms of events.
Events can be initialized by:Events can be initialized by:
float elapsedTime;float elapsedTime;
cudaEventCreate(&stop);cudaEventCreate(&stop);
cudaEventRecord(cudaEventRecord(startstart,0); // activation,0); // activation
Performance: GPU versus CPUPerformance: GPU versus CPU Write an implementation of matrix multiplication with two Write an implementation of matrix multiplication with two
modes: modes: mode GPUmode GPU and and mode CPUmode CPU
Compare the execution time with the events.Compare the execution time with the events.
To obtain meaningful results use the square matrices with To obtain meaningful results use the square matrices with
the minimum size of 32 * 32. The content of the matrices the minimum size of 32 * 32. The content of the matrices can be initialized by the function can be initialized by the function rand()rand() ::
float* random_block(int size)float* random_block(int size)
{{
for (int i=0;i<size;i++)for (int i=0;i<size;i++)
ptrptr[i] =[i] = randrand();();
available in GPUs.available in GPUs.
Global memory is accessible to Global memory is accessible to
all blocks, but it is relatively all blocks, but it is relatively
slow.slow.
Shared memory is accessible Shared memory is accessible
only within the blocks, but it has only within the blocks, but it has
relatively short access time.relatively short access time.
P.Bakowski 26
CUDA: shared memoryCUDA: shared memory
..The threads of a block can share the fast memory The threads of a block can share the fast memory
(shared) and accelerate the execution of the application.(shared) and accelerate the execution of the application.
As an example we will take the algorithm of dot productAs an example we will take the algorithm of dot product
that generates the results to be shared among multiple that generates the results to be shared among multiple
threads.threads.
With this program we will also study the mechanism of With this program we will also study the mechanism of
synchronization between the threads of the same block.synchronization between the threads of the same block.
shared memoryshared memory
Dot ProductDot Product
..Let us take a dot product from two vectors:Let us take a dot product from two vectors:
The first part of execution is independent for each thread The first part of execution is independent for each thread
that deals with one product element of vectors:that deals with one product element of vectors:
thread0thread0: 1*4; : 1*4; thread1thread1: 3*(: 3*(--2); 2); thread2thread2: (: (--5)*(5)*(--1)1)
The results of the multiplications are stored in The results of the multiplications are stored in shared shared
memorymemory. These results are added in the second phase of . These results are added in the second phase of
execution.execution.
Dot ProductDot Product
..Let us take a dot product from two vectors:Let us take a dot product from two vectors:
Note that the number of threads can be larger than the Note that the number of threads can be larger than the
number of cores. To ensure that all products are available number of cores. To ensure that all products are available
in shared memory it is important to synchronize the arrival in shared memory it is important to synchronize the arrival of the results in the sharedof the results in the shared memory by the primitive:memory by the primitive: ____syncthreadssyncthreads();();
shared shared memorymemory ____syncthreadssyncthreads();();
Addition with reductionAddition with reduction
..In the second execution phase it is also important to share In the second execution phase it is also important to share
the work among multiple threads (cores).the work among multiple threads (cores).
For example a sequence of 8 values stored in shared For example a sequence of 8 values stored in shared
memory (1, 2, 3, 4, 5, 6, 7, 8) may be added in 3 steps.memory (1, 2, 3, 4, 5, 6, 7, 8) may be added in 3 steps.
addition with reduction: ns = logaddition with reduction: ns = log22nvnv
1+5 2+6 3+7 4+8 (4 threads)1+5 2+6 3+7 4+8 (4 threads)
6+10 8+12 (2 threads)6+10 8+12 (2 threads)
16+20 (1 thread)16+20 (1 thread)
P.Bakowski 30
Exercise 3: Dot ProductExercise 3: Dot Product
.. Write a program that calculates the dot product of Write a program that calculates the dot product of
two vectors of size N.two vectors of size N.
The kernel must use shared memory to store the The kernel must use shared memory to store the
products.products.
It must use the technique of reduction to speed up It must use the technique of reduction to speed up
the addition of the products available in shared the addition of the products available in shared
memory (cache).memory (cache).
To facilitate the work the essential part of the code is To facilitate the work the essential part of the code is
provided below.provided below.
const int threadsPerBlock = 256;const int threadsPerBlock = 256;
const int blocksPerGrid = const int blocksPerGrid =
imin(32,(N+threadsPerBlockimin(32,(N+threadsPerBlock--1)/threadsPerBlock);1)/threadsPerBlock);
The number of elements of vectors is N = 16 * 1024The number of elements of vectors is N = 16 * 1024
These elements are processed in blocks of 256 threads.These elements are processed in blocks of 256 threads.
The number of blocks per grid should be equal or The number of blocks per grid should be equal or
smaller than 32. Since (16*1024 +255)/256 is > 32, smaller than 32. Since (16*1024 +255)/256 is > 32,
we take blocksPerGrid = 32.we take blocksPerGrid = 32.
P.Bakowski 32
Dot Product: shared memoryDot Product: shared memory The cache The cache –– shared memory size is 256 .This value shared memory size is 256 .This value
{{
int cacheIndex = threadIdx.x;int cacheIndex = threadIdx.x;
float temp =0;float temp =0;
while (tid<N)while (tid<N)
{{
temp += a[tid]*b[tid];temp += a[tid]*b[tid];
tid += blockDim.x*gridDim.x;tid += blockDim.x*gridDim.x; //256*32//256*32
}}
__syncthreads();__syncthreads();
the execution of threads is done by groups of the execution of threads is done by groups of
blockDim.x*gridDim.x // 256*32blockDim.x*gridDim.x // 256*32
P.Bakowski 33
Shared memory usage (256,128,64, .. 1 element)Shared memory usage (256,128,64, .. 1 element)
int i= blockDim.x/2;int i= blockDim.x/2;
while(i!=0)while(i!=0)
cache[cacheIndex] += cache[cacheIndex+i];cache[cacheIndex] += cache[cacheIndex+i];
__syncthreads();__syncthreads();
}}
}}
Complete the code.Complete the code.
Initialize the vectors by a single value: 2Initialize the vectors by a single value: 2
in order to check the execution result.in order to check the execution result.
Exo3bis:Exo3bis:
Write the same application (dot product) to be executed Write the same application (dot product) to be executed
by CPU.by CPU.
Add CUDA events to test and compare the performance Add CUDA events to test and compare the performance
of both solutions.of both solutions.
P.Bakowski 35
SummarySummary
In this lab we studied the main characteristics of GPU. In this lab we studied the main characteristics of GPU.
We studied and wrote some simple examples of CUDA We studied and wrote some simple examples of CUDA
programming including:programming including:
-- writing a simple kernelwriting a simple kernel
-- writing a kernel with twowriting a kernel with two--dimensional processingdimensional processing
-- matrix multiplication matrix multiplication