introduction to c++ amp a ccelerated m assive p arallelism

Introduction to C++ AMPAccelerated Massive Parallelism

It is time to start taking advantage of the computing power of GPUs…

Marc GrégoireSoftware [email protected] http://www.nuonsoft.com/ http://www.nuonsoft.com/blog/ January 23rd

2012

Disclaimer: This presentation was made based on the released Visual C++ 11 Preview

mailto:[email protected]

mailto:[email protected]

http://www.nuonsoft.com/

http://www.nuonsoft.com/

http://www.nuonsoft.com/blog/



http://msdn.microsoft.com/nl-be/hh560770?ocid=ban-n-be-loc--meetmsdnnl

Agenda

Introduction Demo: N-Body Simulation Technical

The C++ AMP Technology Coding Demo: Mandelbrot

Summary Resources

The Power of Heterogeneous Computing

146X

Interactive visualization of

volumetric white matter

connectivity

36X

Ionic placement for molecular

dynamics simulation on

GPU

19X

Transcoding HD video stream to

H.264

Simulation in Matlab

using .mex file CUDA function

100X

Astrophysics N-body simulation

149X

Financial simulation of LIBOR model

with swaptions

47X

GLAME@lab: An M-script API for linear Algebra operations on

GPU

20X

Ultrasound medical imaging

for cancer diagnostics

24X

Highly optimized object oriented

molecular dynamics

30X

Cmatch exact string matching to find similar

proteins and gene sequences

17X

source

CPUs vs GPUs today

CPU Low memory bandwidth Higher power consumption Medium level of parallelism Deep execution pipelines Random accesses Supports general code Mainstream programming

GPU High memory bandwidth Lower power consumption High level of parallelism Shallow execution

pipelines Sequential accesses Supports data-parallel

code Niche programmingimages source: AMD

Tomorrow…

CPUs and GPUs coming closer together… …nothing settled in this space yet, things still in

motion…

C++ AMP is designed as a mainstream solution not only for today, but also for tomorrow images source: AMD

C++ AMP

Part of Visual C++ (available nowin the Visual C++ 11 preview)

Complete Visual Studio integration STL-like library for multidimensional data Builds on Direct3Dperformance

portability

productivity

C++ AMP

Abstracts accelerators Current version supports DirectX 11 GPUs Could support others like FPGA’s, off-site

cloud computing… Support heterogeneous mix of accelerators

Example: C++ AMP can use both an NVidia and AMD GPU in your system at the same time

Demo…N-Body

Simulation

N-Body Simulation Demo

Agenda

Introduction Demo: N-Body Simulation Technical

The C++ AMP Technology Coding Demo: Mandelbrot

Summary Resources

C++ AMP – Basics

Everything is in the concurrency namespace concurrency::array_view

Wraps a user-allocated buffer so that C++ AMP can use it

concurrency::array Allocates a buffer that can be used by C++ AMP

C++ AMP automatically transfers data between those buffers and memory on the accelerators

C++ AMP – array_view

Read/write buffer of given dimensionality, with elements of given type: array_view<type, dim>

Read-only buffer: array_view<const type, dim>

Write-only buffer: array_view<writeonly<type>, dim>

C++ AMP – parallel_for_each() concurrency::parallel_for_each(grid, lambda);

grid is the compute domain over which you want to perform arithmetic For array_view objects, use the array_view grid property

The lambda is the code to execute on the accelerator A restriction should be applied to the lambda, restrict(direct3d), which is a

compile-time check to validate the lambda for execution on direct3d accelerators

Accepts 1 parameter: the index into the compute domain (= grid) Same dimensionality as the grid, so if grid is 2D, index is also 2D:

index<2> idx Access the two dimensions as idx[0] and idx[1]

Example:array_view<writeonly<int>, 2> a(height, width, pBuffer);parallel_for_each(a.grid, [=](index<2> idx) restrict(direct3d) { /* … */ });

C++ AMP – parallel_for_each() – lambda

Several restrictions apply to the code in the lambda: Can only call other restrict(direct3d) functions Must capture everything by value, except concurrency::array

objects No recursion, no virtual functions, no pointers to functions, no

pointers to pointers, no goto, no try/catch/throw statements, no global variables, no static variables, no dynamic_cast, no typeid, no asm, no varargs, …

restrict(direct3d), see http://blogs.msdn.com/b/nativeconcurrency/archive/2011/09/05/restrict-a-key-new-language-feature-introduced-with-c-amp.aspx + MSDN

http://blogs.msdn.com/b/nativeconcurrency/archive/2011/09/05/restrict-a-key-new-language-feature-introduced-with-c-amp.aspx



C++ AMP – parallel_for_each() – lambda

The lambda executes in parallel with CPU code that follows parallel_for_each() until a synchronization point is reached

Synchronization: Manually when calling array_view::synchronize()

Good idea, because you can handle exceptions gracefully Automatically when for example array_view goes out of scope

Bad idea, errors will be ignored silently because destructors are not allowed to throw exceptions

Automatically, when CPU code observes the array_view Not recommended, because you might lose error information

if there is no try/catch block catching exceptions at that point

C++ AMP – accelerator / accelerator_view

The concurrency::accelerator and concurrency::accelerator_view classes can be used to query for information on the installed accelerators

get_accelerators() returns a vector of accelerators in the system

Coding Demo…Mandelbrot

Mandelbrot – Single-Threadedfor (int y = -halfHeight; y < halfHeight; ++y) { // Formula: zi = z^2 + z0 float Z0_i = view_i + y * zoomLevel; for (int x = -halfWidth; x < halfWidth; ++x) { float Z0_r = view_r + x * zoomLevel; float Z_r = Z0_r; float Z_i = Z0_i; float res = 0.0f; for (int iter = 0; iter < maxiter; ++iter) { float Z_rSquared = Z_r * Z_r; float Z_iSquared = Z_i * Z_i; if (Z_rSquared + Z_iSquared > escapeValue) { // We escaped res = iter + 1 - log(log(sqrt(Z_rSquared + Z_iSquared))) * invLogOf2; break; } Z_i = 2 * Z_r * Z_i + Z0_i; Z_r = Z_rSquared - Z_iSquared + Z0_r; } unsigned __int32 result = RGB(res * 50, res * 50, res * 50); pBuffer[(y + halfHeight) * m_nBuffWidth + (x + halfWidth)] = result; }}

Mandelbrot – PPLparallel_for(-halfHeight, halfHeight, 1, [&](int y) { // Formula: zi = z^2 + z0 float Z0_i = view_i + y * zoomLevel; for (int x = -halfWidth; x < halfWidth; ++x) { float Z0_r = view_r + x * zoomLevel; float Z_r = Z0_r; float Z_i = Z0_i; float res = 0.0f; for (int iter = 0; iter < maxiter; ++iter) { float Z_rSquared = Z_r * Z_r; float Z_iSquared = Z_i * Z_i; if (Z_rSquared + Z_iSquared > escapeValue) { // We escaped res = iter + 1 - log(log(sqrt(Z_rSquared + Z_iSquared))) * invLogOf2; break; } Z_i = 2 * Z_r * Z_i + Z0_i; Z_r = Z_rSquared - Z_iSquared + Z0_r; } unsigned __int32 result = RGB(res * 50, res * 50, res * 50); pBuffer[(y + halfHeight) * m_nBuffWidth + (x + halfWidth)] = result; }});

Mandelbrot – C++ AMParray_view<writeonly<unsigned __int32>, 2> a(m_nBuffHeight, m_nBuffWidth, pBuffer);parallel_for_each(a.grid, [=](index<2> idx) restrict(direct3d) { // Formula: zi = z^2 + z0 int x = idx[1] - halfWidth; int y = idx[0] - halfHeight; float Z0_i = view_i + y * zoomLevel; float Z0_r = view_r + x * zoomLevel; float Z_r = Z0_r; float Z_i = Z0_i; float res = 0.0f; for (int iter = 0; iter < maxiter; ++iter) { float Z_rSquared = Z_r * Z_r; float Z_iSquared = Z_i * Z_i; if (Z_rSquared + Z_iSquared > escapeValue) { // We escaped res = iter + 1 - fast_log(fast_log(fast_sqrt(Z_rSquared + Z_iSquared))) * invLogOf2; break; } Z_i = 2 * Z_r * Z_i + Z0_i; Z_r = Z_rSquared - Z_iSquared + Z0_r; } unsigned __int32 result = RGB(res * 50, res * 50, res * 50); a[idx] = result;});a.synchronize();

Mandelbrot – C++ AMP Wrap C++ AMP code inside a try-catch block!

try{ array_view<writeonly<unsigned __int32>, 2> a(m_nBuffHeight, m_nBuffWidth, pBuffer); parallel_for_each(a.grid, [=](index<2> idx) restrict(direct3d) {

...

}); a.synchronize();}catch (const Concurrency::runtime_exception& ex){ MessageBoxA(NULL, ex.what(), "Error", MB_ICONERROR);}

Summary C++ AMP allows anyone to make use of parallel

hardware Easy-to-use High-level abstractions in C++ (not C) In-depth support of C++ AMP in Visual Studio,

including the debugger Abstracts multi-vendor hardware

Intention is to make C++ AMP an open specification

Resources

Daniel Moth's blog (PM of C++ AMP), lots of interesting C++ AMP related posts http://www.danielmoth.com/Blog/

MSDN Native parallelism blog (team blog) http://blogs.msdn.com/b/nativeconcurrency/

MSDN Dev Center for Parallel Computing http://msdn.com/concurrency

MSDN Forums to ask questions http://social.msdn.microsoft.com/Forums/en/parallelcppnative/thr

eads

http://www.danielmoth.com/Blog/

http://blogs.msdn.com/b/nativeconcurrency/

http://msdn.com/concurrency

http://social.msdn.microsoft.com/Forums/en/parallelcppnative/threads

http://social.msdn.microsoft.com/Forums/en/parallelcppnative/threads

Questions

?I would like to thank Daniel Moth

from Microsoft for his inputs for this presentation.

introduction to c++ amp a ccelerated m assive p arallelism

Documents