introduction to c++ amp a ccelerated m assive p arallelism
DESCRIPTION
Introduction to C++ AMP A ccelerated M assive P arallelism. Marc Grégoire Software Architect [email protected] http://www.nuonsoft.com / http://www.nuonsoft.com/blog /. Disclaimer: This presentation was made based on the released Visual C++ 11 Preview. - PowerPoint PPT PresentationTRANSCRIPT
Introduction to C++ AMPAccelerated Massive Parallelism
It is time to start taking advantage of the computing power of GPUs…
Marc GrégoireSoftware [email protected] http://www.nuonsoft.com/ http://www.nuonsoft.com/blog/ January 23rd
2012
Disclaimer: This presentation was made based on the released Visual C++ 11 Preview
Agenda
Introduction Demo: N-Body Simulation Technical
The C++ AMP Technology Coding Demo: Mandelbrot
Summary Resources
The Power of Heterogeneous Computing
146X
Interactive visualization of
volumetric white matter
connectivity
36X
Ionic placement for molecular
dynamics simulation on
GPU
19X
Transcoding HD video stream to
H.264
Simulation in Matlab
using .mex file CUDA function
100X
Astrophysics N-body simulation
149X
Financial simulation of LIBOR model
with swaptions
47X
GLAME@lab: An M-script API for linear Algebra operations on
GPU
20X
Ultrasound medical imaging
for cancer diagnostics
24X
Highly optimized object oriented
molecular dynamics
30X
Cmatch exact string matching to find similar
proteins and gene sequences
17X
source
CPUs vs GPUs today
CPU Low memory bandwidth Higher power consumption Medium level of parallelism Deep execution pipelines Random accesses Supports general code Mainstream programming
GPU High memory bandwidth Lower power consumption High level of parallelism Shallow execution
pipelines Sequential accesses Supports data-parallel
code Niche programmingimages source: AMD
Tomorrow…
CPUs and GPUs coming closer together… …nothing settled in this space yet, things still in
motion…
C++ AMP is designed as a mainstream solution not only for today, but also for tomorrow images source: AMD
C++ AMP
Part of Visual C++ (available nowin the Visual C++ 11 preview)
Complete Visual Studio integration STL-like library for multidimensional data Builds on Direct3Dperformance
portability
productivity
C++ AMP
Abstracts accelerators Current version supports DirectX 11 GPUs Could support others like FPGA’s, off-site
cloud computing… Support heterogeneous mix of accelerators
Example: C++ AMP can use both an NVidia and AMD GPU in your system at the same time
Demo…N-Body
Simulation
N-Body Simulation Demo
Agenda
Introduction Demo: N-Body Simulation Technical
The C++ AMP Technology Coding Demo: Mandelbrot
Summary Resources
C++ AMP – Basics
Everything is in the concurrency namespace concurrency::array_view
Wraps a user-allocated buffer so that C++ AMP can use it
concurrency::array Allocates a buffer that can be used by C++ AMP
C++ AMP automatically transfers data between those buffers and memory on the accelerators
C++ AMP – array_view
Read/write buffer of given dimensionality, with elements of given type: array_view<type, dim>
Read-only buffer: array_view<const type, dim>
Write-only buffer: array_view<writeonly<type>, dim>
C++ AMP – parallel_for_each() concurrency::parallel_for_each(grid, lambda);
grid is the compute domain over which you want to perform arithmetic For array_view objects, use the array_view grid property
The lambda is the code to execute on the accelerator A restriction should be applied to the lambda, restrict(direct3d), which is a
compile-time check to validate the lambda for execution on direct3d accelerators
Accepts 1 parameter: the index into the compute domain (= grid) Same dimensionality as the grid, so if grid is 2D, index is also 2D:
index<2> idx Access the two dimensions as idx[0] and idx[1]
Example:array_view<writeonly<int>, 2> a(height, width, pBuffer);parallel_for_each(a.grid, [=](index<2> idx) restrict(direct3d) { /* … */ });
C++ AMP – parallel_for_each() – lambda
Several restrictions apply to the code in the lambda: Can only call other restrict(direct3d) functions Must capture everything by value, except concurrency::array
objects No recursion, no virtual functions, no pointers to functions, no
pointers to pointers, no goto, no try/catch/throw statements, no global variables, no static variables, no dynamic_cast, no typeid, no asm, no varargs, …
restrict(direct3d), see http://blogs.msdn.com/b/nativeconcurrency/archive/2011/09/05/restrict-a-key-new-language-feature-introduced-with-c-amp.aspx + MSDN
C++ AMP – parallel_for_each() – lambda
The lambda executes in parallel with CPU code that follows parallel_for_each() until a synchronization point is reached
Synchronization: Manually when calling array_view::synchronize()
Good idea, because you can handle exceptions gracefully Automatically when for example array_view goes out of scope
Bad idea, errors will be ignored silently because destructors are not allowed to throw exceptions
Automatically, when CPU code observes the array_view Not recommended, because you might lose error information
if there is no try/catch block catching exceptions at that point
C++ AMP – accelerator / accelerator_view
The concurrency::accelerator and concurrency::accelerator_view classes can be used to query for information on the installed accelerators
get_accelerators() returns a vector of accelerators in the system
Coding Demo…Mandelbrot
Mandelbrot – Single-Threadedfor (int y = -halfHeight; y < halfHeight; ++y) { // Formula: zi = z^2 + z0 float Z0_i = view_i + y * zoomLevel; for (int x = -halfWidth; x < halfWidth; ++x) { float Z0_r = view_r + x * zoomLevel; float Z_r = Z0_r; float Z_i = Z0_i; float res = 0.0f; for (int iter = 0; iter < maxiter; ++iter) { float Z_rSquared = Z_r * Z_r; float Z_iSquared = Z_i * Z_i; if (Z_rSquared + Z_iSquared > escapeValue) { // We escaped res = iter + 1 - log(log(sqrt(Z_rSquared + Z_iSquared))) * invLogOf2; break; } Z_i = 2 * Z_r * Z_i + Z0_i; Z_r = Z_rSquared - Z_iSquared + Z0_r; } unsigned __int32 result = RGB(res * 50, res * 50, res * 50); pBuffer[(y + halfHeight) * m_nBuffWidth + (x + halfWidth)] = result; }}
Mandelbrot – PPLparallel_for(-halfHeight, halfHeight, 1, [&](int y) { // Formula: zi = z^2 + z0 float Z0_i = view_i + y * zoomLevel; for (int x = -halfWidth; x < halfWidth; ++x) { float Z0_r = view_r + x * zoomLevel; float Z_r = Z0_r; float Z_i = Z0_i; float res = 0.0f; for (int iter = 0; iter < maxiter; ++iter) { float Z_rSquared = Z_r * Z_r; float Z_iSquared = Z_i * Z_i; if (Z_rSquared + Z_iSquared > escapeValue) { // We escaped res = iter + 1 - log(log(sqrt(Z_rSquared + Z_iSquared))) * invLogOf2; break; } Z_i = 2 * Z_r * Z_i + Z0_i; Z_r = Z_rSquared - Z_iSquared + Z0_r; } unsigned __int32 result = RGB(res * 50, res * 50, res * 50); pBuffer[(y + halfHeight) * m_nBuffWidth + (x + halfWidth)] = result; }});
Mandelbrot – C++ AMParray_view<writeonly<unsigned __int32>, 2> a(m_nBuffHeight, m_nBuffWidth, pBuffer);parallel_for_each(a.grid, [=](index<2> idx) restrict(direct3d) { // Formula: zi = z^2 + z0 int x = idx[1] - halfWidth; int y = idx[0] - halfHeight; float Z0_i = view_i + y * zoomLevel; float Z0_r = view_r + x * zoomLevel; float Z_r = Z0_r; float Z_i = Z0_i; float res = 0.0f; for (int iter = 0; iter < maxiter; ++iter) { float Z_rSquared = Z_r * Z_r; float Z_iSquared = Z_i * Z_i; if (Z_rSquared + Z_iSquared > escapeValue) { // We escaped res = iter + 1 - fast_log(fast_log(fast_sqrt(Z_rSquared + Z_iSquared))) * invLogOf2; break; } Z_i = 2 * Z_r * Z_i + Z0_i; Z_r = Z_rSquared - Z_iSquared + Z0_r; } unsigned __int32 result = RGB(res * 50, res * 50, res * 50); a[idx] = result;});a.synchronize();
Mandelbrot – C++ AMP Wrap C++ AMP code inside a try-catch block!
try{ array_view<writeonly<unsigned __int32>, 2> a(m_nBuffHeight, m_nBuffWidth, pBuffer); parallel_for_each(a.grid, [=](index<2> idx) restrict(direct3d) {
...
}); a.synchronize();}catch (const Concurrency::runtime_exception& ex){ MessageBoxA(NULL, ex.what(), "Error", MB_ICONERROR);}
Summary C++ AMP allows anyone to make use of parallel
hardware Easy-to-use High-level abstractions in C++ (not C) In-depth support of C++ AMP in Visual Studio,
including the debugger Abstracts multi-vendor hardware
Intention is to make C++ AMP an open specification
Resources
Daniel Moth's blog (PM of C++ AMP), lots of interesting C++ AMP related posts http://www.danielmoth.com/Blog/
MSDN Native parallelism blog (team blog) http://blogs.msdn.com/b/nativeconcurrency/
MSDN Dev Center for Parallel Computing http://msdn.com/concurrency
MSDN Forums to ask questions http://social.msdn.microsoft.com/Forums/en/parallelcppnative/thr
eads
Questions
?I would like to thank Daniel Moth
from Microsoft for his inputs for this presentation.