Download - Approaches to GPU Computing
![Page 1: Approaches to GPU Computing](https://reader035.vdocument.in/reader035/viewer/2022062520/56815fd6550346895dced866/html5/thumbnails/1.jpg)
Approaches to GPU ComputingLibraries, OpenACC Directives, and Languages
![Page 2: Approaches to GPU Computing](https://reader035.vdocument.in/reader035/viewer/2022062520/56815fd6550346895dced866/html5/thumbnails/2.jpg)
GPUCPUAdd GPUs: Accelerate Science Applications
![Page 3: Approaches to GPU Computing](https://reader035.vdocument.in/reader035/viewer/2022062520/56815fd6550346895dced866/html5/thumbnails/3.jpg)
146X
Medical Imaging U of Utah
36X
Molecular DynamicsU of Illinois, Urbana
18X
Video TranscodingElemental Tech
50X
Matlab ComputingAccelerEyes
100X
AstrophysicsRIKEN
149X
Financial SimulationOxford
47X
Linear AlgebraUniversidad Jaime
20X
3D UltrasoundTechniscan
130X
Quantum ChemistryU of Illinois, Urbana
30X
Gene SequencingU of Maryland
GPUs Accelerate Science
![Page 4: Approaches to GPU Computing](https://reader035.vdocument.in/reader035/viewer/2022062520/56815fd6550346895dced866/html5/thumbnails/4.jpg)
Small Changes, Big Speed-upApplication Code
+
GPU CPUUse GPU to ParallelizeCompute-Intensive
FunctionsRest of Sequential
CPU Code
![Page 5: Approaches to GPU Computing](https://reader035.vdocument.in/reader035/viewer/2022062520/56815fd6550346895dced866/html5/thumbnails/5.jpg)
3 Ways to Accelerate Applications
Applications
Libraries
“Drop-in” Acceleration
Programming Languages
MaximumPerformance
OpenACC Directives
Easily Accelerate Applications
![Page 6: Approaches to GPU Computing](https://reader035.vdocument.in/reader035/viewer/2022062520/56815fd6550346895dced866/html5/thumbnails/6.jpg)
GPU Accelerated Libraries“Drop-in” Acceleration for your
Applications
NVIDIA cuBLAS NVIDIA cuRAND
NVIDIA cuSPARSE NVIDIA NPP
Vector SignalImage
ProcessingMatrix Algebra on
GPU and Multicore NVIDIA cuFFT
C++ Templated Parallel Algorithms Sparse Linear
AlgebraIMSL Library
GPU Accelerated
Linear Algebra
Building-block Algorithms
![Page 7: Approaches to GPU Computing](https://reader035.vdocument.in/reader035/viewer/2022062520/56815fd6550346895dced866/html5/thumbnails/7.jpg)
OpenACC Directives
Program myscience ... serial code ...!$acc kernels do k = 1,n1 do i = 1,n2 ... parallel code ... enddo enddo!$acc end kernels ...End Program myscience
CPU GPU
Your original Fortran or C
code
Simple Compiler hints
Compiler Parallelizes code
Works on many-core GPUs & multicore CPUs
OpenACCCompil
erHint
![Page 8: Approaches to GPU Computing](https://reader035.vdocument.in/reader035/viewer/2022062520/56815fd6550346895dced866/html5/thumbnails/8.jpg)
Recommended ApproachesMATLAB, Mathematica, LabVIEWNumerical analytics
OpenACC, CUDA FortranFortran
OpenACC, CUDA CC
Thrust, CUDA C++C++
PyCUDAPython
GPU.NETC#
![Page 9: Approaches to GPU Computing](https://reader035.vdocument.in/reader035/viewer/2022062520/56815fd6550346895dced866/html5/thumbnails/9.jpg)
CUDA-Accelerated Libraries
Drop-in Acceleration
![Page 10: Approaches to GPU Computing](https://reader035.vdocument.in/reader035/viewer/2022062520/56815fd6550346895dced866/html5/thumbnails/10.jpg)
3 Ways to Accelerate Applications
Applications
Libraries
“Drop-in” Acceleration
Programming Languages
OpenACCDirectives
MaximumFlexibility
Easily AccelerateApplications
![Page 11: Approaches to GPU Computing](https://reader035.vdocument.in/reader035/viewer/2022062520/56815fd6550346895dced866/html5/thumbnails/11.jpg)
Easy, High-Quality Acceleration
Ease of use: Using libraries enables GPU acceleration without in-depth knowledge of GPU programming
“Drop-in”: Many GPU-accelerated libraries follow standard APIs, thus enabling acceleration with minimal code changes
Quality: Libraries offer high-quality implementations of functions encountered in a broad range of applications
Performance: NVIDIA libraries are tuned by experts
![Page 12: Approaches to GPU Computing](https://reader035.vdocument.in/reader035/viewer/2022062520/56815fd6550346895dced866/html5/thumbnails/12.jpg)
NVIDIA cuBLAS NVIDIA cuRAND NVIDIA cuSPARSE NVIDIA NPP
Vector SignalImage
ProcessingGPU
AcceleratedLinear Algebra
Matrix Algebra on GPU and Multicore NVIDIA cuFFT
C++ STL Features for
CUDAIMSL LibraryBuilding-block Algorithms for
CUDA
Some GPU-accelerated Libraries
ArrayFire Matrix Computations
![Page 13: Approaches to GPU Computing](https://reader035.vdocument.in/reader035/viewer/2022062520/56815fd6550346895dced866/html5/thumbnails/13.jpg)
3 Steps to CUDA-accelerated application
Step 1: Substitute library calls with equivalent CUDA library calls
saxpy ( … ) cublasSaxpy ( … )
Step 2: Manage data locality- with CUDA: cudaMalloc(), cudaMemcpy(), etc.
- with CUBLAS: cublasAlloc(), cublasSetVector(), etc.
Step 3: Rebuild and link the CUDA-accelerated librarynvcc myobj.o –l cublas
![Page 14: Approaches to GPU Computing](https://reader035.vdocument.in/reader035/viewer/2022062520/56815fd6550346895dced866/html5/thumbnails/14.jpg)
int N = 1 << 20;
// Perform SAXPY on 1M elements: y[]=a*x[]+y[]saxpy(N, 2.0, d_x, 1, d_y, 1);
Drop-In Acceleration (Step 1)
![Page 15: Approaches to GPU Computing](https://reader035.vdocument.in/reader035/viewer/2022062520/56815fd6550346895dced866/html5/thumbnails/15.jpg)
int N = 1 << 20;
// Perform SAXPY on 1M elements: d_y[]=a*d_x[]+d_y[]cublasSaxpy(N, 2.0, d_x, 1, d_y, 1);
Drop-In Acceleration (Step 1)
Add “cublas” prefix and use device variables
![Page 16: Approaches to GPU Computing](https://reader035.vdocument.in/reader035/viewer/2022062520/56815fd6550346895dced866/html5/thumbnails/16.jpg)
int N = 1 << 20;cublasInit();
// Perform SAXPY on 1M elements: d_y[]=a*d_x[]+d_y[]cublasSaxpy(N, 2.0, d_x, 1, d_y, 1);
cublasShutdown();
Drop-In Acceleration (Step 2)
Initialize CUBLAS
Shut down CUBLAS
![Page 17: Approaches to GPU Computing](https://reader035.vdocument.in/reader035/viewer/2022062520/56815fd6550346895dced866/html5/thumbnails/17.jpg)
int N = 1 << 20;cublasInit();cublasAlloc(N, sizeof(float), (void**)&d_x);cublasAlloc(N, sizeof(float), (void*)&d_y);
// Perform SAXPY on 1M elements: d_y[]=a*d_x[]+d_y[]cublasSaxpy(N, 2.0, d_x, 1, d_y, 1);
cublasFree(d_x);cublasFree(d_y);cublasShutdown();
Drop-In Acceleration (Step 2)
Allocate device vectors
Deallocate device vectors
![Page 18: Approaches to GPU Computing](https://reader035.vdocument.in/reader035/viewer/2022062520/56815fd6550346895dced866/html5/thumbnails/18.jpg)
int N = 1 << 20;cublasInit();cublasAlloc(N, sizeof(float), (void**)&d_x);cublasAlloc(N, sizeof(float), (void*)&d_y);
cublasSetVector(N, sizeof(x[0]), x, 1, d_x, 1);cublasSetVector(N, sizeof(y[0]), y, 1, d_y, 1);
// Perform SAXPY on 1M elements: d_y[]=a*d_x[]+d_y[]cublasSaxpy(N, 2.0, d_x, 1, d_y, 1);
cublasGetVector(N, sizeof(y[0]), d_y, 1, y, 1);
cublasFree(d_x);cublasFree(d_y);cublasShutdown();
Drop-In Acceleration (Step 2)
Transfer data to GPU
Read data back GPU
![Page 19: Approaches to GPU Computing](https://reader035.vdocument.in/reader035/viewer/2022062520/56815fd6550346895dced866/html5/thumbnails/19.jpg)
Explore the CUDA (Libraries) Ecosystem
CUDA Tools and Ecosystem described in detail on NVIDIA Developer Zone:
developer.nvidia.com/cuda-tools-ecosystem
![Page 20: Approaches to GPU Computing](https://reader035.vdocument.in/reader035/viewer/2022062520/56815fd6550346895dced866/html5/thumbnails/20.jpg)
GPU Computing with OpenACC Directives
![Page 21: Approaches to GPU Computing](https://reader035.vdocument.in/reader035/viewer/2022062520/56815fd6550346895dced866/html5/thumbnails/21.jpg)
3 Ways to Accelerate Applications
Applications
Libraries
“Drop-in” Acceleration
Programming Languages
OpenACCDirectives
MaximumFlexibility
Easily AccelerateApplications
![Page 22: Approaches to GPU Computing](https://reader035.vdocument.in/reader035/viewer/2022062520/56815fd6550346895dced866/html5/thumbnails/22.jpg)
OpenACC Directives
Program myscience ... serial code ...!$acc kernels do k = 1,n1 do i = 1,n2 ... parallel code ... enddo enddo!$acc end kernels ...End Program myscience
CPU GPU
Your original Fortran or C
code
Simple Compiler hints
Compiler Parallelizes code
Works on many-core GPUs & multicore CPUs
OpenACCCompiler
Hint
![Page 23: Approaches to GPU Computing](https://reader035.vdocument.in/reader035/viewer/2022062520/56815fd6550346895dced866/html5/thumbnails/23.jpg)
OpenACC Open Programming Standard for Parallel Computing“OpenACC will enable programmers to easily develop portable applications that maximize the performance and power efficiency benefits of the hybrid CPU/GPU architecture of Titan.”
--Buddy Bland, Titan Project Director, Oak Ridge National Lab
“OpenACC is a technically impressive initiative brought together by members of the OpenMP Working Group on Accelerators, as well as many others. We look forward to releasing a version of this proposal in the next release of OpenMP.”
--Michael Wong, CEO OpenMP Directives Board
OpenACC Standard
![Page 24: Approaches to GPU Computing](https://reader035.vdocument.in/reader035/viewer/2022062520/56815fd6550346895dced866/html5/thumbnails/24.jpg)
Easy: Directives are the easy path to accelerate computeintensive applications
Open: OpenACC is an open GPU directives standard, making GPU programming straightforward and portable across parallel and multi-core processors
Powerful: GPU Directives allow complete access to the massive parallel power of a GPU
OpenACC The Standard for GPU Directives
![Page 25: Approaches to GPU Computing](https://reader035.vdocument.in/reader035/viewer/2022062520/56815fd6550346895dced866/html5/thumbnails/25.jpg)
Two Basic Steps to Get Started
Step 1: Annotate source code with directives:
Step 2: Compile & run:pgf90 -ta=nvidia -Minfo=accel file.f
!$acc data copy(util1,util2,util3) copyin(ip,scp2,scp2i) !$acc parallel loop … !$acc end parallel!$acc end data
![Page 26: Approaches to GPU Computing](https://reader035.vdocument.in/reader035/viewer/2022062520/56815fd6550346895dced866/html5/thumbnails/26.jpg)
OpenACC Directives Example!$acc data copy(A,Anew) iter=0 do while ( err > tol .and. iter < iter_max )
iter = iter +1 err=0._fp_kind
!$acc kernels do j=1,m do i=1,n Anew(i,j) = .25_fp_kind *( A(i+1,j ) + A(i-1,j ) & +A(i ,j-1) + A(i ,j+1)) err = max( err, Anew(i,j)-A(i,j)) end do end do
!$acc end kernels IF(mod(iter,100)==0 .or. iter == 1) print *, iter, err A= Anew
end do
!$acc end data
Copy arrays into GPU memory within data region
Parallelize code inside region
Close off parallel region
Close off data region, copy data back
![Page 27: Approaches to GPU Computing](https://reader035.vdocument.in/reader035/viewer/2022062520/56815fd6550346895dced866/html5/thumbnails/27.jpg)
Real-Time Object Detection
Global Manufacturer of Navigation Systems
Valuation of Stock Portfolios using Monte
Carlo Global Technology Consulting
Company
Interaction of Solvents and Biomolecules
University of Texas at San Antonio
Directives: Easy & Powerful
Optimizing code with directives is quite easy, especially compared to CPU threads or writing CUDA kernels. The most important thing is avoiding restructuring of existing code for production applications. ”
-- Developer at the Global Manufacturer of Navigation Systems
“5x in 40 Hours 2x in 4 Hours 5x in 8 Hours
![Page 28: Approaches to GPU Computing](https://reader035.vdocument.in/reader035/viewer/2022062520/56815fd6550346895dced866/html5/thumbnails/28.jpg)
Start Now with OpenACC Directives
Free trial license to PGI Accelerator
Tools for quick ramp
www.nvidia.com/gpudirectives
Sign up for a free trial of the directives compiler now!
![Page 29: Approaches to GPU Computing](https://reader035.vdocument.in/reader035/viewer/2022062520/56815fd6550346895dced866/html5/thumbnails/29.jpg)
Programming Languagesfor GPU Computing
![Page 30: Approaches to GPU Computing](https://reader035.vdocument.in/reader035/viewer/2022062520/56815fd6550346895dced866/html5/thumbnails/30.jpg)
3 Ways to Accelerate Applications
Applications
Libraries
“Drop-in” Acceleration
Programming Languages
OpenACCDirectives
MaximumFlexibility
Easily AccelerateApplications
![Page 31: Approaches to GPU Computing](https://reader035.vdocument.in/reader035/viewer/2022062520/56815fd6550346895dced866/html5/thumbnails/31.jpg)
GPU Programming Languages
OpenACC, CUDA FortranFortran
OpenACC, CUDA CC
Thrust, CUDA C++C++
PyCUDAPython
GPU.NETC#
MATLAB, Mathematica, LabVIEWNumerical analytics
![Page 32: Approaches to GPU Computing](https://reader035.vdocument.in/reader035/viewer/2022062520/56815fd6550346895dced866/html5/thumbnails/32.jpg)
void saxpy_serial(int n, float a, float *x, float *y){ for (int i = 0; i < n; ++i) y[i] = a*x[i] + y[i];}
// Perform SAXPY on 1M elementssaxpy_serial(4096*256, 2.0, x, y);
__global__ void saxpy_parallel(int n, float a, float *x, float *y){ int i = blockIdx.x*blockDim.x + threadIdx.x; if (i < n) y[i] = a*x[i] + y[i];}
// Perform SAXPY on 1M elementssaxpy_parallel<<<4096,256>>>(n,2.0,x,y);
CUDA CStandard C Code Parallel C Code
http://developer.nvidia.com/cuda-toolkit
![Page 33: Approaches to GPU Computing](https://reader035.vdocument.in/reader035/viewer/2022062520/56815fd6550346895dced866/html5/thumbnails/33.jpg)
CUDA C++: Develop Generic Parallel Code
Class hierarchies
__device__ methods
Templates
Operator overloading
Functors (function objects)
Device-side new/delete
More…
template <typename T>struct Functor { __device__ Functor(_a) : a(_a) {} __device__ T operator(T x) { return a*x; } T a;}
template <typename T, typename Oper>__global__ void kernel(T *output, int n) { Oper op(3.7); output = new T[n]; // dynamic allocation int i = blockIdx.x*blockDim.x + threadIdx.x; if (i < n) output[i] = op(i); // apply functor}
http://developer.nvidia.com/cuda-toolkit
CUDA C++ features enable sophisticated and flexible
applications and middleware
![Page 34: Approaches to GPU Computing](https://reader035.vdocument.in/reader035/viewer/2022062520/56815fd6550346895dced866/html5/thumbnails/34.jpg)
// generate 32M random numbers on hostthrust::host_vector<int> h_vec(32 << 20);thrust::generate(h_vec.begin(), h_vec.end(), rand);
// transfer data to device (GPU)thrust::device_vector<int> d_vec = h_vec;
// sort data on device thrust::sort(d_vec.begin(), d_vec.end());
// transfer data back to hostthrust::copy(d_vec.begin(), d_vec.end(), h_vec.begin());
Rapid Parallel C++ Development
Resembles C++ STLHigh-level interface
Enhances developer productivityEnables performance portability between GPUs and multicore CPUs
FlexibleCUDA, OpenMP, and TBB backendsExtensible and customizableIntegrates with existing software
Open source
http://developer.nvidia.com/thrust or http://thrust.googlecode.com
![Page 35: Approaches to GPU Computing](https://reader035.vdocument.in/reader035/viewer/2022062520/56815fd6550346895dced866/html5/thumbnails/35.jpg)
CUDA FortranProgram GPU using Fortran
Key language for HPCSimple language extensions
Kernel functionsThread / block IDsDevice & data management Parallel loop directives
Familiar syntaxUse allocate, deallocateCopy CPU-to-GPU with assignment (=)
module mymodule contains attributes(global) subroutine saxpy(n,a,x,y) real :: x(:), y(:), a, integer n, i attributes(value) :: a, n i = threadIdx%x+(blockIdx%x-1)*blockDim%x if (i<=n) y(i) = a*x(i) + y(i); end subroutine saxpyend module mymodule program main use cudafor; use mymodule real, device :: x_d(2**20), y_d(2**20) x_d = 1.0; y_d = 2.0 call saxpy<<<4096,256>>>(2**20,3.0,x_d,y_d,) y = y_d write(*,*) 'max error=', maxval(abs(y-5.0))end program main
http://developer.nvidia.com/cuda-fortran
![Page 36: Approaches to GPU Computing](https://reader035.vdocument.in/reader035/viewer/2022062520/56815fd6550346895dced866/html5/thumbnails/36.jpg)
More Programming Languages
PyCUDA
NumericalAnalytics
C# .NET GPU.NET
Python
![Page 37: Approaches to GPU Computing](https://reader035.vdocument.in/reader035/viewer/2022062520/56815fd6550346895dced866/html5/thumbnails/37.jpg)
MATLABhttp://www.mathworks.com/discovery/matlab-gpu.html
Get Started TodayThese languages are supported on all CUDA-capable GPUs.You might already have a CUDA-capable GPU in your laptop or desktop PC!
CUDA C/C++http://developer.nvidia.com/cuda-toolkit
Thrust C++ Template Libraryhttp://developer.nvidia.com/thrust
CUDA Fortranhttp://developer.nvidia.com/cuda-toolkit
GPU.NEThttp://tidepowerd.com
PyCUDA (Python)http://mathema.tician.de/software/pycuda
Mathematicahttp://www.wolfram.com/mathematica/new-in-8/cuda-and-opencl-support/
![Page 38: Approaches to GPU Computing](https://reader035.vdocument.in/reader035/viewer/2022062520/56815fd6550346895dced866/html5/thumbnails/38.jpg)
CUDA Registered Developer ProgramAll GPGPU developers should become NVIDIA Registered Developers
Benefits include:Early Access to Pre-Release Software
Beta software and librariesCUDA 5.5 Release Candidate available now
Submit & Track Issues and BugsInteract directly with NVIDIA QA engineers
BenefitsExclusive Q&A Webinars with NVIDIA EngineeringExclusive deep dive CUDA training webinars In-depth engineering presentations on pre-release software
Sign up Now: www.nvidia.com/ParallelDeveloper
![Page 39: Approaches to GPU Computing](https://reader035.vdocument.in/reader035/viewer/2022062520/56815fd6550346895dced866/html5/thumbnails/39.jpg)
GPU Technology Conference 2014May 24-77 | San Jose, CAThe one event you can’t afford to miss Learn about leading-edge advances in GPU computing Explore the research as well as the commercial applications Discover advances in computational visualization Take a deep dive into parallel programming
Ways to participate Speak – share your work and gain exposure as a thought leader Register – learn from the experts and network with your peers Exhibit/Sponsor – promote your company as a key player in the GPU
ecosystem www.gputechconf.com
![Page 40: Approaches to GPU Computing](https://reader035.vdocument.in/reader035/viewer/2022062520/56815fd6550346895dced866/html5/thumbnails/40.jpg)
WHAT IS GPU COMPUTING?
![Page 41: Approaches to GPU Computing](https://reader035.vdocument.in/reader035/viewer/2022062520/56815fd6550346895dced866/html5/thumbnails/41.jpg)
What is GPU Computing?
Computing with CPU + GPUHeterogeneous Computing
x86 GPUPCIe bus
![Page 42: Approaches to GPU Computing](https://reader035.vdocument.in/reader035/viewer/2022062520/56815fd6550346895dced866/html5/thumbnails/42.jpg)
Low Latency or High Throughput?
CPUOptimised for low-latency access to cached data setsControl logic for out-of-order and speculative execution
GPUOptimised for data-parallel, throughput computationArchitecture tolerant of memory latencyMore transistors dedicated to computation
![Page 43: Approaches to GPU Computing](https://reader035.vdocument.in/reader035/viewer/2022062520/56815fd6550346895dced866/html5/thumbnails/43.jpg)
Kepler GK110 Block DiagramArchitecture
7.1B Transistors15 SMX units> 1 TFLOP FP641.5 MB L2 Cache384-bit GDDR5PCI Express Gen3
![Page 44: Approaches to GPU Computing](https://reader035.vdocument.in/reader035/viewer/2022062520/56815fd6550346895dced866/html5/thumbnails/44.jpg)
CUDA ARCHITECTURE
![Page 45: Approaches to GPU Computing](https://reader035.vdocument.in/reader035/viewer/2022062520/56815fd6550346895dced866/html5/thumbnails/45.jpg)
CUDA Parallel Computing Architecture
Parallel computing architecture and programming model
Includes a CUDA C compiler, support for OpenCL and DirectCompute
Architected to natively support multiple computational interfaces (standard languages and APIs)
NVIDIA GPU with the CUDA parallel computing architecture
CUDA C OpenCL™ DirectCompute CUDA Fortran
GPU Computing Application
C C++ Fortran Java C# …
![Page 46: Approaches to GPU Computing](https://reader035.vdocument.in/reader035/viewer/2022062520/56815fd6550346895dced866/html5/thumbnails/46.jpg)
CUDA PROGRAMMING MODEL
![Page 47: Approaches to GPU Computing](https://reader035.vdocument.in/reader035/viewer/2022062520/56815fd6550346895dced866/html5/thumbnails/47.jpg)
Processing Flow
1. Copy input data from CPU memory to GPU memory
2. Load GPU program and execute,caching data on chip for performance
3. Copy results from GPU memory to CPU memory
PCI Bus
![Page 48: Approaches to GPU Computing](https://reader035.vdocument.in/reader035/viewer/2022062520/56815fd6550346895dced866/html5/thumbnails/48.jpg)
CUDA Kernels
Parallel portion of application: execute as a kernelEntire GPU executes kernel, many threads
CUDA threads:LightweightFast switching1000s execute simultaneously
CPU Host Executes functions
GPU Device Executes kernels
![Page 49: Approaches to GPU Computing](https://reader035.vdocument.in/reader035/viewer/2022062520/56815fd6550346895dced866/html5/thumbnails/49.jpg)
CUDA Kernels: Parallel Threads
A kernel is an array of threads, executed in parallel
All threads execute the same code
Each thread has an IDSelect input/output dataControl decisions
float x = input[threadID];float y = func(x);output[threadID] = y;
![Page 50: Approaches to GPU Computing](https://reader035.vdocument.in/reader035/viewer/2022062520/56815fd6550346895dced866/html5/thumbnails/50.jpg)
Key Idea of CUDA
Write a single-threaded program parameterized in terms of the thread ID.Use the thread ID to select a subset of the data for processing, and to make control flow decisions.Launch a number of threads, such that the ensemble of threads processes the whole data set.
![Page 51: Approaches to GPU Computing](https://reader035.vdocument.in/reader035/viewer/2022062520/56815fd6550346895dced866/html5/thumbnails/51.jpg)
CUDA Kernels: Subdivide into Blocks
![Page 52: Approaches to GPU Computing](https://reader035.vdocument.in/reader035/viewer/2022062520/56815fd6550346895dced866/html5/thumbnails/52.jpg)
CUDA Kernels: Subdivide into Blocks
Threads are grouped into blocks
![Page 53: Approaches to GPU Computing](https://reader035.vdocument.in/reader035/viewer/2022062520/56815fd6550346895dced866/html5/thumbnails/53.jpg)
CUDA Kernels: Subdivide into Blocks
Threads are grouped into blocksBlocks are grouped into a grid
![Page 54: Approaches to GPU Computing](https://reader035.vdocument.in/reader035/viewer/2022062520/56815fd6550346895dced866/html5/thumbnails/54.jpg)
CUDA Kernels: Subdivide into Blocks
Threads are grouped into blocksBlocks are grouped into a gridA kernel is executed as a grid of blocks of threads
![Page 55: Approaches to GPU Computing](https://reader035.vdocument.in/reader035/viewer/2022062520/56815fd6550346895dced866/html5/thumbnails/55.jpg)
CUDA Kernels: Subdivide into Blocks
Threads are grouped into blocksBlocks are grouped into a gridA kernel is executed as a grid of blocks of threads
![Page 56: Approaches to GPU Computing](https://reader035.vdocument.in/reader035/viewer/2022062520/56815fd6550346895dced866/html5/thumbnails/56.jpg)
Communication Within a Block
Threads may need to cooperateMemory accessesShare results
Cooperate using shared memoryAccessible by all threads within a block
Restriction to “within a block” permits scalabilityFast communication between N threads is not feasible when N large
![Page 57: Approaches to GPU Computing](https://reader035.vdocument.in/reader035/viewer/2022062520/56815fd6550346895dced866/html5/thumbnails/57.jpg)
Transparent Scalability – G841 2 3 4 5 6 7 8 9 10 11 12
1 2
3 4
5 6
7 8
9 10
11 12
![Page 58: Approaches to GPU Computing](https://reader035.vdocument.in/reader035/viewer/2022062520/56815fd6550346895dced866/html5/thumbnails/58.jpg)
Transparent Scalability – G801 2 3 4 5 6 7 8 9 10 11 12
1 2 3 4 5 6 7 8
9 10 11 12
![Page 59: Approaches to GPU Computing](https://reader035.vdocument.in/reader035/viewer/2022062520/56815fd6550346895dced866/html5/thumbnails/59.jpg)
Transparent Scalability – GT2001 2 3 4 5 6 7 8 9 10 11 12
1 2 3 4 5 6 7 8 9 10 11 12 ...Idle Idle Idle
![Page 60: Approaches to GPU Computing](https://reader035.vdocument.in/reader035/viewer/2022062520/56815fd6550346895dced866/html5/thumbnails/60.jpg)
MEMORY MODEL
![Page 61: Approaches to GPU Computing](https://reader035.vdocument.in/reader035/viewer/2022062520/56815fd6550346895dced866/html5/thumbnails/61.jpg)
Memory hierarchy
Thread:Registers
![Page 62: Approaches to GPU Computing](https://reader035.vdocument.in/reader035/viewer/2022062520/56815fd6550346895dced866/html5/thumbnails/62.jpg)
Memory hierarchy
Thread:Registers
Thread:Local memory
![Page 63: Approaches to GPU Computing](https://reader035.vdocument.in/reader035/viewer/2022062520/56815fd6550346895dced866/html5/thumbnails/63.jpg)
Memory hierarchy
Thread:Registers
Thread:Local memory
Block of threads:Shared memory
![Page 64: Approaches to GPU Computing](https://reader035.vdocument.in/reader035/viewer/2022062520/56815fd6550346895dced866/html5/thumbnails/64.jpg)
Memory hierarchy
Thread:Registers
Thread:Local memory
Block of threads:Shared memory
![Page 65: Approaches to GPU Computing](https://reader035.vdocument.in/reader035/viewer/2022062520/56815fd6550346895dced866/html5/thumbnails/65.jpg)
Memory hierarchy
Thread:Registers
Thread:Local memory
Block of threads:Shared memory
All blocks:Global memory
![Page 66: Approaches to GPU Computing](https://reader035.vdocument.in/reader035/viewer/2022062520/56815fd6550346895dced866/html5/thumbnails/66.jpg)
Memory hierarchy
Thread:Registers
Thread:Local memory
Block of threads:Shared memory
All blocks:Global memory