+ cuda antonyus pyetro do amaral ferreira. + the problem the advent of multicore cpus and manycore...
TRANSCRIPT
![Page 1: + CUDA Antonyus Pyetro do Amaral Ferreira. + The problem The advent of multicore CPUs and manycore GPUs means that mainstream processor chips are now](https://reader035.vdocument.in/reader035/viewer/2022062719/56649ec65503460f94bd0f57/html5/thumbnails/1.jpg)
+
CUDA
Antonyus Pyetro do Amaral Ferreira
![Page 2: + CUDA Antonyus Pyetro do Amaral Ferreira. + The problem The advent of multicore CPUs and manycore GPUs means that mainstream processor chips are now](https://reader035.vdocument.in/reader035/viewer/2022062719/56649ec65503460f94bd0f57/html5/thumbnails/2.jpg)
+The problem
The advent of multicore CPUs and manycore GPUs means that mainstream processor chips are now parallel systems.
The challenge is to develop application software that transparently scales its parallelism.
![Page 3: + CUDA Antonyus Pyetro do Amaral Ferreira. + The problem The advent of multicore CPUs and manycore GPUs means that mainstream processor chips are now](https://reader035.vdocument.in/reader035/viewer/2022062719/56649ec65503460f94bd0f57/html5/thumbnails/3.jpg)
+A solution
CUDA is a parallel programming model and software environmenr.
A compiled CUDA program can therefore execute on any number of processor cores, and only the runtime system needs to know the physical processor count.
![Page 4: + CUDA Antonyus Pyetro do Amaral Ferreira. + The problem The advent of multicore CPUs and manycore GPUs means that mainstream processor chips are now](https://reader035.vdocument.in/reader035/viewer/2022062719/56649ec65503460f94bd0f57/html5/thumbnails/4.jpg)
+CPU vs. GPU
![Page 5: + CUDA Antonyus Pyetro do Amaral Ferreira. + The problem The advent of multicore CPUs and manycore GPUs means that mainstream processor chips are now](https://reader035.vdocument.in/reader035/viewer/2022062719/56649ec65503460f94bd0f57/html5/thumbnails/5.jpg)
+CPU vs. GPU
The GPU is especially well-suited to address problems that can be expressed as data-parallel computations.
Because the same program is executed for each data element, there is a lower requirement for sophisticated flow control.
![Page 6: + CUDA Antonyus Pyetro do Amaral Ferreira. + The problem The advent of multicore CPUs and manycore GPUs means that mainstream processor chips are now](https://reader035.vdocument.in/reader035/viewer/2022062719/56649ec65503460f94bd0f57/html5/thumbnails/6.jpg)
+Applications?
General signal processing or physics simulation to computational finance or computational biology.
The latest generation of NVIDIA GPUs, based on the Tesla architecture, supports the CUDA programming model
![Page 7: + CUDA Antonyus Pyetro do Amaral Ferreira. + The problem The advent of multicore CPUs and manycore GPUs means that mainstream processor chips are now](https://reader035.vdocument.in/reader035/viewer/2022062719/56649ec65503460f94bd0f57/html5/thumbnails/7.jpg)
+CUDA - Hello world
![Page 8: + CUDA Antonyus Pyetro do Amaral Ferreira. + The problem The advent of multicore CPUs and manycore GPUs means that mainstream processor chips are now](https://reader035.vdocument.in/reader035/viewer/2022062719/56649ec65503460f94bd0f57/html5/thumbnails/8.jpg)
+What is CUDA?
CUDA extends C by allowing the programmer to define C functions, called kernels,that are executed N times in parallel by N different CUDA threads.
Each of the threads that execute a kernel is given a unique thread ID that is accessible within the kernel through the built-in threadIdx variable.
![Page 9: + CUDA Antonyus Pyetro do Amaral Ferreira. + The problem The advent of multicore CPUs and manycore GPUs means that mainstream processor chips are now](https://reader035.vdocument.in/reader035/viewer/2022062719/56649ec65503460f94bd0f57/html5/thumbnails/9.jpg)
+CUDA Sum of vectors
![Page 10: + CUDA Antonyus Pyetro do Amaral Ferreira. + The problem The advent of multicore CPUs and manycore GPUs means that mainstream processor chips are now](https://reader035.vdocument.in/reader035/viewer/2022062719/56649ec65503460f94bd0f57/html5/thumbnails/10.jpg)
+Concurrency
Threads within a block can cooperate among themselves by sharing data through some shared memory.
__syncthreads() acts as a barrier at which all threads in the block must wait before any are allowed to proceed.
![Page 11: + CUDA Antonyus Pyetro do Amaral Ferreira. + The problem The advent of multicore CPUs and manycore GPUs means that mainstream processor chips are now](https://reader035.vdocument.in/reader035/viewer/2022062719/56649ec65503460f94bd0f57/html5/thumbnails/11.jpg)
+Process Hierarchy
![Page 12: + CUDA Antonyus Pyetro do Amaral Ferreira. + The problem The advent of multicore CPUs and manycore GPUs means that mainstream processor chips are now](https://reader035.vdocument.in/reader035/viewer/2022062719/56649ec65503460f94bd0f57/html5/thumbnails/12.jpg)
+Memory Hierarchy
Per-thread local memory
Per-block shared memory
![Page 13: + CUDA Antonyus Pyetro do Amaral Ferreira. + The problem The advent of multicore CPUs and manycore GPUs means that mainstream processor chips are now](https://reader035.vdocument.in/reader035/viewer/2022062719/56649ec65503460f94bd0f57/html5/thumbnails/13.jpg)
+Memory Hierarchy
![Page 14: + CUDA Antonyus Pyetro do Amaral Ferreira. + The problem The advent of multicore CPUs and manycore GPUs means that mainstream processor chips are now](https://reader035.vdocument.in/reader035/viewer/2022062719/56649ec65503460f94bd0f57/html5/thumbnails/14.jpg)
+Host and Device
CUDA assumes that the CUDA threads may execute on a physically separate device that operates as a coprocessor to the host.
CUDA also assumes that both the host and the device maintain their own DRAM, referred to as host memory and device memory
![Page 15: + CUDA Antonyus Pyetro do Amaral Ferreira. + The problem The advent of multicore CPUs and manycore GPUs means that mainstream processor chips are now](https://reader035.vdocument.in/reader035/viewer/2022062719/56649ec65503460f94bd0f57/html5/thumbnails/15.jpg)
+Software Stack
CUDA LibrariesCUDA Libraries
CUDA RuntimeCUDA Runtime
CUDA DriverCUDA Driver
ApplicatiApplicationon
HostHost
DevicDevicee
![Page 16: + CUDA Antonyus Pyetro do Amaral Ferreira. + The problem The advent of multicore CPUs and manycore GPUs means that mainstream processor chips are now](https://reader035.vdocument.in/reader035/viewer/2022062719/56649ec65503460f94bd0f57/html5/thumbnails/16.jpg)
+Language Extensions
Function type qualifiers to specify whether a function executes on the host or on the device and whether it is callable from the host or from the device.
Variable type qualifiers to specify the memory location on the device of a variable .
![Page 17: + CUDA Antonyus Pyetro do Amaral Ferreira. + The problem The advent of multicore CPUs and manycore GPUs means that mainstream processor chips are now](https://reader035.vdocument.in/reader035/viewer/2022062719/56649ec65503460f94bd0f57/html5/thumbnails/17.jpg)
+Language Extensions
A new directive to specify how a kernel is executed on the device from the host.
vecAdd<<<1, N>>>(A, B, C);
Four built-in variables that specify the grid and block dimensions and the block and thread indices
![Page 18: + CUDA Antonyus Pyetro do Amaral Ferreira. + The problem The advent of multicore CPUs and manycore GPUs means that mainstream processor chips are now](https://reader035.vdocument.in/reader035/viewer/2022062719/56649ec65503460f94bd0f57/html5/thumbnails/18.jpg)
+Function Type Qualifiers
__device__ Executed on the device Callable from the device
only.
__global__ Executed on the
device Callable from the host
only.__host__
Executed on the host Callable from the host
only. Default type
![Page 19: + CUDA Antonyus Pyetro do Amaral Ferreira. + The problem The advent of multicore CPUs and manycore GPUs means that mainstream processor chips are now](https://reader035.vdocument.in/reader035/viewer/2022062719/56649ec65503460f94bd0f57/html5/thumbnails/19.jpg)
+Variable Type Qualifiers
__device__
global memory space
Is accessible from all the threads within the grid
__constant__
constant memory space
Is accessible from all the threads within the grid
__shared__
space of a thread block
Is only accessible from all the threads within the block
![Page 20: + CUDA Antonyus Pyetro do Amaral Ferreira. + The problem The advent of multicore CPUs and manycore GPUs means that mainstream processor chips are now](https://reader035.vdocument.in/reader035/viewer/2022062719/56649ec65503460f94bd0f57/html5/thumbnails/20.jpg)
+Execution Configuration
Any call to a __global__ function must specify the execution configuration for that call.
The execution configuration defines the dimension of the grid and blocks that will be used to execute the function on the device.
![Page 21: + CUDA Antonyus Pyetro do Amaral Ferreira. + The problem The advent of multicore CPUs and manycore GPUs means that mainstream processor chips are now](https://reader035.vdocument.in/reader035/viewer/2022062719/56649ec65503460f94bd0f57/html5/thumbnails/21.jpg)
+Execution Configuration
<<< Dg, Db, Ns, S >>>
Dg is of type dim3 and specifies the dimension and size of the grid, such that Dg.x * Dg.y equals the number of blocks being launched; Dg.z is unused;
Db is of type dim3 and specifies the dimension and size of each block, such that Db.x * Db.y * Db.z equals the number of threads per block;
Ns is of type size_t and specifies the number of bytes in shared memory that is dynamically allocated per block for this call in addition to the statically allocated memory; this dynamically allocated memory is used by any of the variables declared as an external array; Ns is an optional argument which defaults to 0;
S is of type cudaStream_t and specifies the associated stream; S is an optional argument which defaults to 0.
![Page 22: + CUDA Antonyus Pyetro do Amaral Ferreira. + The problem The advent of multicore CPUs and manycore GPUs means that mainstream processor chips are now](https://reader035.vdocument.in/reader035/viewer/2022062719/56649ec65503460f94bd0f57/html5/thumbnails/22.jpg)
+Built-in Variables
gridDim
This variable is of type dim3 and contains the dimensions of the grid.
blockIdx
This variable is of type uint3 and contains the block index within the grid.
blockDim
This variable is of type dim3 and contains the dimensions of the block.
threadIdx
This variable is of type uint3 and contains the thread indexwithin the block.
![Page 23: + CUDA Antonyus Pyetro do Amaral Ferreira. + The problem The advent of multicore CPUs and manycore GPUs means that mainstream processor chips are now](https://reader035.vdocument.in/reader035/viewer/2022062719/56649ec65503460f94bd0f57/html5/thumbnails/23.jpg)
+Example – Matrix multiplication
Task: C = A(hA,wA) X B(wA, wB)
Each thread block is responsible for computing one square sub-matrix Csub of C;
Each thread within the block is responsible for computing one element of Csub.
![Page 24: + CUDA Antonyus Pyetro do Amaral Ferreira. + The problem The advent of multicore CPUs and manycore GPUs means that mainstream processor chips are now](https://reader035.vdocument.in/reader035/viewer/2022062719/56649ec65503460f94bd0f57/html5/thumbnails/24.jpg)
+Example – Matrix multiplication
Csub is equal to the product of two rectangular matrices:
the sub-matrix of A of dimension (wA, block_size) and the sub-matrix of B of dimension (block_size, wA)
these two rectangular matrices are divided into as many square matrices of dimension block_size as necessary.
Csub is computed as the sum of the products of these square matrices.
![Page 25: + CUDA Antonyus Pyetro do Amaral Ferreira. + The problem The advent of multicore CPUs and manycore GPUs means that mainstream processor chips are now](https://reader035.vdocument.in/reader035/viewer/2022062719/56649ec65503460f94bd0f57/html5/thumbnails/25.jpg)
+Example – Matrix multiplication
![Page 26: + CUDA Antonyus Pyetro do Amaral Ferreira. + The problem The advent of multicore CPUs and manycore GPUs means that mainstream processor chips are now](https://reader035.vdocument.in/reader035/viewer/2022062719/56649ec65503460f94bd0f57/html5/thumbnails/26.jpg)
+Compilation with NVCC
MS – visual studio 2005:
tools>options>Project&solutions >VC++ directories>include files\libraries files
Point to:
C:\Program Files\NVIDIA Corporation\NVIDIA CUDA SDK\common\inc
![Page 27: + CUDA Antonyus Pyetro do Amaral Ferreira. + The problem The advent of multicore CPUs and manycore GPUs means that mainstream processor chips are now](https://reader035.vdocument.in/reader035/viewer/2022062719/56649ec65503460f94bd0f57/html5/thumbnails/27.jpg)
+CUDA Interoperability
OpenGL
Direct3D