introduction to c++ amp a ccelerated m assive p arallelism

24
Introduction to C++ AMP Accelerated Massive Parallelism It is time to start taking advantage of the computing power of GPUs… Marc Grégoire Software Architect marc.gregoire@nuonsoft. com http://www.nuonsoft.com / http://www.nuonsoft.com /blog January 23 rd 2012 Disclaimer: This presentation was made based on the released Visual C++ 11 Preview

Upload: bowie

Post on 14-Jan-2016

50 views

Category:

Documents


4 download

DESCRIPTION

Introduction to C++ AMP A ccelerated M assive P arallelism. Marc Grégoire Software Architect [email protected] http://www.nuonsoft.com / http://www.nuonsoft.com/blog /. Disclaimer: This presentation was made based on the released Visual C++ 11 Preview. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Introduction to C++ AMP A ccelerated  M assive  P arallelism

Introduction to C++ AMPAccelerated Massive Parallelism

It is time to start taking advantage of the computing power of GPUs…

Marc GrégoireSoftware [email protected] http://www.nuonsoft.com/ http://www.nuonsoft.com/blog/ January 23rd

2012

Disclaimer: This presentation was made based on the released Visual C++ 11 Preview

Page 2: Introduction to C++ AMP A ccelerated  M assive  P arallelism

Agenda

Introduction Demo: N-Body Simulation Technical

The C++ AMP Technology Coding Demo: Mandelbrot

Summary Resources

Page 3: Introduction to C++ AMP A ccelerated  M assive  P arallelism

The Power of Heterogeneous Computing

146X

Interactive visualization of

volumetric white matter

connectivity

36X

Ionic placement for molecular

dynamics simulation on

GPU

19X

Transcoding HD video stream to

H.264

Simulation in Matlab

using .mex file CUDA function

100X

Astrophysics N-body simulation

149X

Financial simulation of LIBOR model

with swaptions

47X

GLAME@lab: An M-script API for linear Algebra operations on

GPU

20X

Ultrasound medical imaging

for cancer diagnostics

24X

Highly optimized object oriented

molecular dynamics

30X

Cmatch exact string matching to find similar

proteins and gene sequences

17X

source

Page 4: Introduction to C++ AMP A ccelerated  M assive  P arallelism

CPUs vs GPUs today

CPU Low memory bandwidth Higher power consumption Medium level of parallelism Deep execution pipelines Random accesses Supports general code Mainstream programming

GPU High memory bandwidth Lower power consumption High level of parallelism Shallow execution

pipelines Sequential accesses Supports data-parallel

code Niche programmingimages source: AMD

Page 5: Introduction to C++ AMP A ccelerated  M assive  P arallelism

Tomorrow…

CPUs and GPUs coming closer together… …nothing settled in this space yet, things still in

motion…

C++ AMP is designed as a mainstream solution not only for today, but also for tomorrow images source: AMD

Page 6: Introduction to C++ AMP A ccelerated  M assive  P arallelism

C++ AMP

Part of Visual C++ (available nowin the Visual C++ 11 preview)

Complete Visual Studio integration STL-like library for multidimensional data Builds on Direct3Dperformance

portability

productivity

Page 7: Introduction to C++ AMP A ccelerated  M assive  P arallelism

C++ AMP

Abstracts accelerators Current version supports DirectX 11 GPUs Could support others like FPGA’s, off-site

cloud computing… Support heterogeneous mix of accelerators

Example: C++ AMP can use both an NVidia and AMD GPU in your system at the same time

Page 8: Introduction to C++ AMP A ccelerated  M assive  P arallelism

Demo…N-Body

Simulation

Page 9: Introduction to C++ AMP A ccelerated  M assive  P arallelism

N-Body Simulation Demo

Page 10: Introduction to C++ AMP A ccelerated  M assive  P arallelism

Agenda

Introduction Demo: N-Body Simulation Technical

The C++ AMP Technology Coding Demo: Mandelbrot

Summary Resources

Page 11: Introduction to C++ AMP A ccelerated  M assive  P arallelism

C++ AMP – Basics

Everything is in the concurrency namespace concurrency::array_view

Wraps a user-allocated buffer so that C++ AMP can use it

concurrency::array Allocates a buffer that can be used by C++ AMP

C++ AMP automatically transfers data between those buffers and memory on the accelerators

Page 12: Introduction to C++ AMP A ccelerated  M assive  P arallelism

C++ AMP – array_view

Read/write buffer of given dimensionality, with elements of given type: array_view<type, dim>

Read-only buffer: array_view<const type, dim>

Write-only buffer: array_view<writeonly<type>, dim>

Page 13: Introduction to C++ AMP A ccelerated  M assive  P arallelism

C++ AMP – parallel_for_each() concurrency::parallel_for_each(grid, lambda);

grid is the compute domain over which you want to perform arithmetic For array_view objects, use the array_view grid property

The lambda is the code to execute on the accelerator A restriction should be applied to the lambda, restrict(direct3d), which is a

compile-time check to validate the lambda for execution on direct3d accelerators

Accepts 1 parameter: the index into the compute domain (= grid) Same dimensionality as the grid, so if grid is 2D, index is also 2D:

index<2> idx Access the two dimensions as idx[0] and idx[1]

Example:array_view<writeonly<int>, 2> a(height, width, pBuffer);parallel_for_each(a.grid, [=](index<2> idx) restrict(direct3d) { /* … */ });

Page 14: Introduction to C++ AMP A ccelerated  M assive  P arallelism

C++ AMP – parallel_for_each() – lambda

Several restrictions apply to the code in the lambda: Can only call other restrict(direct3d) functions Must capture everything by value, except concurrency::array

objects No recursion, no virtual functions, no pointers to functions, no

pointers to pointers, no goto, no try/catch/throw statements, no global variables, no static variables, no dynamic_cast, no typeid, no asm, no varargs, …

restrict(direct3d), see http://blogs.msdn.com/b/nativeconcurrency/archive/2011/09/05/restrict-a-key-new-language-feature-introduced-with-c-amp.aspx + MSDN

Page 15: Introduction to C++ AMP A ccelerated  M assive  P arallelism

C++ AMP – parallel_for_each() – lambda

The lambda executes in parallel with CPU code that follows parallel_for_each() until a synchronization point is reached

Synchronization: Manually when calling array_view::synchronize()

Good idea, because you can handle exceptions gracefully Automatically when for example array_view goes out of scope

Bad idea, errors will be ignored silently because destructors are not allowed to throw exceptions

Automatically, when CPU code observes the array_view Not recommended, because you might lose error information

if there is no try/catch block catching exceptions at that point

Page 16: Introduction to C++ AMP A ccelerated  M assive  P arallelism

C++ AMP – accelerator / accelerator_view

The concurrency::accelerator and concurrency::accelerator_view classes can be used to query for information on the installed accelerators

get_accelerators() returns a vector of accelerators in the system

Page 17: Introduction to C++ AMP A ccelerated  M assive  P arallelism

Coding Demo…Mandelbrot

Page 18: Introduction to C++ AMP A ccelerated  M assive  P arallelism

Mandelbrot – Single-Threadedfor (int y = -halfHeight; y < halfHeight; ++y) { // Formula: zi = z^2 + z0 float Z0_i = view_i + y * zoomLevel; for (int x = -halfWidth; x < halfWidth; ++x) { float Z0_r = view_r + x * zoomLevel; float Z_r = Z0_r; float Z_i = Z0_i; float res = 0.0f; for (int iter = 0; iter < maxiter; ++iter) { float Z_rSquared = Z_r * Z_r; float Z_iSquared = Z_i * Z_i; if (Z_rSquared + Z_iSquared > escapeValue) { // We escaped res = iter + 1 - log(log(sqrt(Z_rSquared + Z_iSquared))) * invLogOf2; break; } Z_i = 2 * Z_r * Z_i + Z0_i; Z_r = Z_rSquared - Z_iSquared + Z0_r; } unsigned __int32 result = RGB(res * 50, res * 50, res * 50); pBuffer[(y + halfHeight) * m_nBuffWidth + (x + halfWidth)] = result; }}

Page 19: Introduction to C++ AMP A ccelerated  M assive  P arallelism

Mandelbrot – PPLparallel_for(-halfHeight, halfHeight, 1, [&](int y) { // Formula: zi = z^2 + z0 float Z0_i = view_i + y * zoomLevel; for (int x = -halfWidth; x < halfWidth; ++x) { float Z0_r = view_r + x * zoomLevel; float Z_r = Z0_r; float Z_i = Z0_i; float res = 0.0f; for (int iter = 0; iter < maxiter; ++iter) { float Z_rSquared = Z_r * Z_r; float Z_iSquared = Z_i * Z_i; if (Z_rSquared + Z_iSquared > escapeValue) { // We escaped res = iter + 1 - log(log(sqrt(Z_rSquared + Z_iSquared))) * invLogOf2; break; } Z_i = 2 * Z_r * Z_i + Z0_i; Z_r = Z_rSquared - Z_iSquared + Z0_r; } unsigned __int32 result = RGB(res * 50, res * 50, res * 50); pBuffer[(y + halfHeight) * m_nBuffWidth + (x + halfWidth)] = result; }});

Page 20: Introduction to C++ AMP A ccelerated  M assive  P arallelism

Mandelbrot – C++ AMParray_view<writeonly<unsigned __int32>, 2> a(m_nBuffHeight, m_nBuffWidth, pBuffer);parallel_for_each(a.grid, [=](index<2> idx) restrict(direct3d) { // Formula: zi = z^2 + z0 int x = idx[1] - halfWidth; int y = idx[0] - halfHeight; float Z0_i = view_i + y * zoomLevel; float Z0_r = view_r + x * zoomLevel; float Z_r = Z0_r; float Z_i = Z0_i; float res = 0.0f; for (int iter = 0; iter < maxiter; ++iter) { float Z_rSquared = Z_r * Z_r; float Z_iSquared = Z_i * Z_i; if (Z_rSquared + Z_iSquared > escapeValue) { // We escaped res = iter + 1 - fast_log(fast_log(fast_sqrt(Z_rSquared + Z_iSquared))) * invLogOf2; break; } Z_i = 2 * Z_r * Z_i + Z0_i; Z_r = Z_rSquared - Z_iSquared + Z0_r; } unsigned __int32 result = RGB(res * 50, res * 50, res * 50); a[idx] = result;});a.synchronize();

Page 21: Introduction to C++ AMP A ccelerated  M assive  P arallelism

Mandelbrot – C++ AMP Wrap C++ AMP code inside a try-catch block!

try{ array_view<writeonly<unsigned __int32>, 2> a(m_nBuffHeight, m_nBuffWidth, pBuffer); parallel_for_each(a.grid, [=](index<2> idx) restrict(direct3d) {

...

}); a.synchronize();}catch (const Concurrency::runtime_exception& ex){ MessageBoxA(NULL, ex.what(), "Error", MB_ICONERROR);}

Page 22: Introduction to C++ AMP A ccelerated  M assive  P arallelism

Summary C++ AMP allows anyone to make use of parallel

hardware Easy-to-use High-level abstractions in C++ (not C) In-depth support of C++ AMP in Visual Studio,

including the debugger Abstracts multi-vendor hardware

Intention is to make C++ AMP an open specification

Page 23: Introduction to C++ AMP A ccelerated  M assive  P arallelism

Resources

Daniel Moth's blog (PM of C++ AMP), lots of interesting C++ AMP related posts http://www.danielmoth.com/Blog/

MSDN Native parallelism blog (team blog) http://blogs.msdn.com/b/nativeconcurrency/

MSDN Dev Center for Parallel Computing http://msdn.com/concurrency

MSDN Forums to ask questions http://social.msdn.microsoft.com/Forums/en/parallelcppnative/thr

eads

Page 24: Introduction to C++ AMP A ccelerated  M assive  P arallelism

Questions

?I would like to thank Daniel Moth

from Microsoft for his inputs for this presentation.