Максим Гольдин

48

Upload: garin

Post on 22-Feb-2016

54 views

Category:

Documents


0 download

DESCRIPTION

DEV301. Senior Developer. Microsoft Corporation. Введение в платформу гетерогенных вычислений C ++ AMP и инструменты работы с GPU в Visual Studio 11. Максим Гольдин. Agenda. Context Code IDE Summary. N-Body Simulation . d emo. The Power of Heterogeneous Computing. 146X. 36X. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Максим Гольдин
Page 2: Максим Гольдин

DEV301

Введение в платформу гетерогенных вычислений C++ AMP и инструменты работыс GPU в Visual Studio 11 Максим ГольдинSenior DeveloperMicrosoft Corporation

Page 3: Максим Гольдин

Agenda

ContextCodeIDESummary

Page 4: Максим Гольдин

demo

N-Body Simulation

Page 5: Максим Гольдин
Page 6: Максим Гольдин

The Power of Heterogeneous Computing

146X

Interactive visualization of

volumetric white matter connectivity

36X

Ionic placement for molecular dynamics simulation on GPU

19X

Transcoding HD video stream to

H.264

17X

Simulation in Matlab using .mex file CUDA function

100X

Astrophysics N-body simulation

149X

Financial simulation of LIBOR model with swaptions

47X

GLAME@lab: An M-script API for linear Algebra operations

on GPU

20X

Ultrasound medical imaging for cancer

diagnostics

24X

Highly optimized object oriented

molecular dynamics

30X

Cmatch exact string matching to find

similar proteins and gene sequences

source

Page 7: Максим Гольдин

CPUs vs GPUs today

CPU

Low memory bandwidthHigher power consumptionMedium level of parallelismDeep execution pipelinesRandom accessesSupports general codeMainstream programming

GPU

High memory bandwidthLower power consumptionHigh level of parallelismShallow execution pipelinesSequential accessesSupports data-parallel codeNiche programming

images source: AMD

Page 8: Максим Гольдин

Tomorrow…

CPUs and GPUs coming closer together……nothing settled in this space, things still in motion…

C++ Accelerated Massive Parallelism is designed as a mainstream solution not only for today, but also for tomorrow

image source: AMD

Page 9: Максим Гольдин

C++ AMP

Part of Visual C++ Visual Studio integrationSTL-like library for multidimensional data Builds on Direct3D

performance

portabilityproductivity

Page 10: Максим Гольдин

Agenda checkpoint

ContextCodeIDESummary

Page 11: Максим Гольдин

Hello World: Array Addition

void AddArrays(int n, int * pA, int * pB, int * pC){ for (int i=0; i<n; i++) { pC[i] = pA[i] + pB[i]; }}

How do we take the serial code on the left that runs on the CPU and convert it to run on an accelerator like the GPU?

Page 12: Максим Гольдин

Hello World: Array Addition

void AddArrays(int n, int * pA, int * pB, int * pC){

for (int i=0; i<n; i++)

{ pC[i] = pA[i] + pB[i]; }

}

#include <amp.h>using namespace concurrency;

void AddArrays(int n, int * pA, int * pB, int * pC){ array_view<int,1> a(n, pA); array_view<int,1> b(n, pB); array_view<int,1> sum(n, pC); parallel_for_each( sum.grid, [=](index<1> i) restrict(direct3d) { sum[i] = a[i] + b[i]; } );}

void AddArrays(int n, int * pA, int * pB, int * pC){

for (int i=0; i<n; i++)

{ pC[i] = pA[i] + pB[i]; }

}

Page 13: Максим Гольдин

Basic Elements of C++ AMP codingvoid AddArrays(int n, int * pA, int * pB, int * pC){ array_view<int,1> a(n, pA); array_view<int,1> b(n, pB); array_view<int,1> sum(n, pC); parallel_for_each(

sum.grid, [=](index<1> idx) restrict(direct3d) { sum[idx] = a[idx] + b[idx];

} );}

array_view variables captured and associated data copied to accelerator (on demand)

parallel_for_each: execute the lambda on the accelerator once per thread

grid: the number and shape of threads to execute the lambda

index: the thread ID that is running the lambda, used to index into data

array_view: wraps the data to operate on the accelerator

restrict(direct3d): tells the compiler to check that this code can execute on Direct3D hardware (aka accelerator)

Page 14: Максим Гольдин

grid<N>, extent<N>, and index<N>

index<N> represents an N-dimensional point

extent<N>number of units in each dimension of an N-dimensional space

grid<N>origin (index<N>) plus extent<N>

N can be any number

Page 15: Максим Гольдин

Examples: grid, extent, and index

index<1> i(2); index<2> i(0,2); index<3> i(2,0,1);

extent<3> e(3,2,2);extent<2> e(3,4);extent<1> e(6);grid<3> g(e);grid<2> g(e);grid<1> g(e);

Page 16: Максим Гольдин

vector<int> v(96);extent<2> e(8,12); // e[0] == 8; e[1] == 12;array<int,2> a(e, v.begin(), v.end());

// in the body of my lambdaindex<2> i(3,9); // i[0] == 3; i[1] == 9;int o = a[i]; //or a[i] = 16;//int o = a(i[0], i[1]);

array<T,N>

Multi-dimensional array of rank N with element TStorage lives on accelerator

0 1 2 3 4 5 6 7 8 9 10 11

0

1

2

3

4

5

6

7

Page 17: Максим Гольдин

array_view<T,N>

View on existing data on the CPU or GPUarray_view<T,N> array_view<const T,N>

vector<int> v(10);

extent<2> e(2,5); array_view<int,2> a(e, v);

//above two lines can also be written//array_view<int,2> a(2,5,v);

Page 18: Максим Гольдин

Data Classes Comparison

array<T,N>

Rank at compile time Extent at runtimeRectangular

DenseContainer for dataExplicit copyCapture by reference [&]

array_view<T,N>

Rank at compile timeExtent at runtimeRectangular

Dense in one dimensionWrapper for dataImplicit copyCapture by value [=]

Page 19: Максим Гольдин

1. parallel_for_each( 2. g, //g is of type grid<N>3. [ ](index<N> idx)

restrict(direct3d) { // kernel code}

4. );

parallel_for_each

Executes the lambda for each point in the extentAs-if synchronous in terms of visible side-effects

Page 20: Максим Гольдин

Example: Matrix Multiplicationvoid MatrixMultiplySerial( vector<float>& vC, const vector<float>& vA, const vector<float>& vB, int M, int N, int W ){

for (int row = 0; row < M; row++) { for (int col = 0; col < N; col++){ float sum = 0.0f; for(int i = 0; i < W; i++) sum += vA[row * W + i] * vB[i * N + col]; vC[row * N + col] = sum; } }}

void MatrixMultiplyAMP( vector<float>& vC, const vector<float>& vA, const vector<float>& vB, int M, int N, int W ){ array_view<const float,2> a(M,W,vA),b(W,N,vB); array_view<writeonly<float>,2> c(M,N,vC); parallel_for_each(c.grid, [=](index<2> idx) restrict(direct3d) { int row = idx[0]; int col = idx[1];

float sum = 0.0f; for(int i = 0; i < W; i++) sum += a(row, i) * b(i, col); c[idx] = sum;

} );}

Page 21: Максим Гольдин

accelerator, accelerator_view

accelerator e.g. DX11 GPU, REFe.g. CPU

accelerator_viewa context for scheduling and memory management

CPUs

System memory

GPUPCIeGPU

GPU

GPU

Host Accelerator (GPU example)

• Data transfers • between accelerator and host

• could be optimized away for integrated memory architecture

Page 22: Максим Гольдин

Example: accelerator

// Identify an accelerator based on Windows device IDaccelerator myAcc(“PCI\\VEN_1002&DEV_9591&CC_0300”);

// …or enumerate all accelerators (not shown)

// Allocate an array on my acceleratorarray<int> myArray(10, myAcc.default_view);

// …or launch a kernel on my acceleratorparallel_for_each(myAcc.default_view, myArrayView.grid, ...);

Page 23: Максим Гольдин

C++ AMP at a Glance (so far)

restrict(direct3d, cpu)parallel_for_eachclass array<T,N>class array_view<T,N>class index<N>class extent<N>, grid<N>class acceleratorclass accelerator_view

Page 24: Максим Гольдин

Achieving maximum performance gains

Schedule threads in tilesAvoid thread index remappingGain ability to use tile static memory

parallel_for_each overload for tiles acceptstiled_grid<D0> or tiled_grid<D0, D1> or tiled_grid<D0, D1, D2>a lambda which accepts

tiled_index<D0> or tiled_index<D0, D1> or tiled_index<D0, D1, D2>

0 1 2 3 4 5

0

1

2

3

4

5

6

7

0 1 2 3 4 5

0

1

2

3

4

5

6

7

0 1 2 3 4 5

0

1

2

3

4

5

6

7

g.tile<2,2>()g.tile<4,3>()extent<2> e(8,6);grid<2> g(e);

Page 25: Максим Гольдин

tiled_grid, tiled_index

Given

When the lambda is executed byt_idx.global = index<2> (6,3)t_idx.local = index<2> (0,1)t_idx.tile = index<2> (3,1)t_idx.tile_origin = index<2> (6,2)

T

array_view<int,2> data(8, 6, p_my_data);parallel_for_each( data.grid.tile<2,2>(), [=] (tiled_index<2,2> t_idx)… { … });

0 1 2 3 4 50123456 T7

Page 26: Максим Гольдин

tile_static, tile_barrier

Within the tiled parallel_for_each lambda we can usetile_static storage class for local variables

indicates that the variable is allocated in fast cache memoryi.e. shared by each thread in a tile of threads

only applicable in restrict(direct3d) functions

class tile_barriersynchronize all threads within a tilee.g. t_idx.barrier.wait();

Page 27: Максим Гольдин

Example: Matrix Multiplication (tiled)void MatrixMultSimple(vector<float>& vC, const vector<float>& vA, const vector<float>& vB, int M, int N, int W ){

array_view<const float,2> a(M, W, vA), b(W, N, vB); array_view<writeonly<float>,2> c(M,N,vC); parallel_for_each(c.grid, [=] (index<2> idx) restrict(direct3d) { int row = idx[0]; int col = idx[1]; float sum = 0.0f;

for(int k = 0; k < W; k++) sum += a(row, k) * b(k, col);

c[idx] = sum; } );}

void MatrixMultTiled(vector<float>& vC, const vector<float>& vA, const vector<float>& vB, int M, int N, int W ){ static const int TS = 16; array_view<const float,2> a(M, W, vA), b(W, N, vB); array_view<writeonly<float>,2> c(M,N,vC); parallel_for_each(c.grid.tile< TS, TS >(), [=] (tiled_index< TS, TS> t_idx) restrict(direct3d) { int row = t_idx.local[0]; int col = t_idx.local[1]; float sum = 0.0f; for (int i = 0; i < W; i += TS) { tile_static float locA[TS][TS], locB[TS][TS]; locA[row][col] = a(t_idx.global[0], col + i); locB[row][col] = b(row + i, t_idx.global[1]); t_idx.barrier.wait(); for (int k = 0; k < TS; k++) sum += locA[row][k] * locB[k][col]; t_idx.barrier.wait(); } c[t_idx.global] = sum; } );}

Page 28: Максим Гольдин

C++ AMP at a Glance

restrict(direct3d, cpu)parallel_for_eachclass array<T,N>class array_view<T,N>class index<N>class extent<N>, grid<N>class acceleratorclass accelerator_view

tile_static storage classclass tiled_grid< , , >class tiled_index< , , >class tile_barrier

Page 29: Максим Гольдин

Agenda checkpoint

ContextCodeIDESummary

Page 30: Максим Гольдин

Visual Studio 11

OrganizeEditDesignBuildBrowseDebugProfile

Page 31: Максим Гольдин

demo

C++ AMP Parallel Debugger

Page 32: Максим Гольдин

Visual Studio 11

OrganizeEditDesignBuildBrowseDebugProfile

Page 33: Максим Гольдин

C++ AMP Parallel Debugger

Well known Visual Studio debugging features Launch, Attach, Break, Stepping, Breakpoints, DataTips Toolwindows

Processes, Debug Output, Modules, Disassembly, Call Stack, Memory, Registers, Locals, Watch, Quick Watch

New features (for both CPU and GPU)Parallel Stacks window, Parallel Watch window, Barrier

New GPU-specificEmulator, GPU Threads window, race detection

Page 34: Максим Гольдин

demo

Concurrency Visualizerfor GPU

Page 35: Максим Гольдин

Visual Studio 11

OrganizeEditDesignBuildBrowseDebugProfile

Page 36: Максим Гольдин

Concurrency Visualizer for GPU

Direct3D-centricSupports any library/programming model built on it

Integrated GPU and CPU viewGoal is to analyze high-level performance metrics

Memory copy overheadsSynchronization overheads across CPU/GPUGPU activity and contention with other processes

Page 37: Максим Гольдин

Concurrency Visualizer for GPU

Team is exploring ways to provide data on:

GPU Memory Utilization

GPU HW Counters

Page 38: Максим Гольдин

Agenda checkpoint

ContextCodeIDESummary

Page 39: Максим Гольдин

Summary

Democratization of parallel hardware programmabilityPerformance for the mainstreamHigh-level abstractions in C++ (not C)State-of-the-art Visual Studio IDEHardware abstraction platform

Intent is to make C++ AMP an open specification

Page 40: Максим Гольдин

Resources

Daniel Moth's blog (PM of C++ AMP)http://www.danielmoth.com/Blog/

MSDN Native parallelism blog (team blog)http://blogs.msdn.com/b/nativeconcurrency/

MSDN Dev Center for Parallel Computinghttp://msdn.com/concurrency

MSND Forums to ask questionshttp://social.msdn.microsoft.com/Forums/en/parallelcppnative/threads

Page 41: Максим Гольдин

Feedback

Your feedback is very important! Please complete an evaluation form!

Thank you!

Page 42: Максим Гольдин

DEV:Лабораторные работы

9 – 10 ноября в классе для самостоятельной работы9 ноября – с инструктором

10:30 – 11:45DEV201ILL: Основы Visual Studio LightSwitch

13:00 – 14:15DEV303ILL: Отладка с IntelliTrace с использованием Visual Stdudio 2010 Ultimate

14:30 – 15:45DEV304ILL: Использование Architecture Explorer для анализа кода в Visual Studio 2010 Ultimate

16:00 – 17:15DEV305ILL: Test Driven Development в Microsoft Visual Studio 2010

17:30 – 18:45DEV302ILL: Основы тестирования веб-производительности и нагрузочного тестирования с Visual Stdudio 2010 Ultimate

Page 43: Максим Гольдин

Questions?

DEV301 Максим Гольдин

Senior [email protected] http://blogs.msdn.com/b/mgoldin/

You can ask your questions at “Ask the expert” zone within an hour after end of this session

Page 44: Максим Гольдин

restrict(…)

Applies to functions (including lambdas)Why restrict

Target-specific language restrictionsOptimizations or special code-gen behaviorFuture-proofing

Functions can have multiple restrictionsIn 1st release we are implementing direct3d and cpucpu – the implicit default

Page 45: Максим Гольдин

restrict(direct3d) restrictions

Can only call other restrict(direct3d) functionsAll functions must be inlinableOnly direct3d-supported types

int, unsigned int, float, double structs & arrays of these types

Pointers and ReferencesLambdas cannot capture by reference¹, nor capture pointersReferences and single-indirection pointers supported only as local variables and function arguments

Page 46: Максим Гольдин

restrict(direct3d) restrictions

No recursion'volatile'virtual functionspointers to functionspointers to member functionspointers in structspointers to pointers

No goto or labeled statementsthrow, try, catchglobals or staticsdynamic_cast or typeidasm declarationsvarargsunsupported types

e.g. char, short, long double

Page 47: Максим Гольдин

Example: restrict overloading

double bar( double ) restrict(cpu,direc3d); // 1: same code for bothdouble cos( double ); // 2a: general codedouble cos( double ) restrict(direct3d); // 2b: specific code

void SomeMethod(array<double> c) { parallel_for_each( c.grid, [=](index<2> idx) restrict(direct3d) { //… double d1 = bar(c[idx]); // ok double d2 = cos(c[idx]); // ok, chooses direct3d overload //… });}

Page 48: Максим Гольдин

Not Covered

Math librarye.g. acosf

Atomic operation librarye.g. atomic_fetch_add

Direct3D intrinsicsdebugging (e.g. direct3d_printf), fences (e.g. __dp_d3d_all_memory_fence), float math (e.g. __dp_d3d_absf)

Direct3D Interop*get_device, create_accelerator_view, make_array, *get_buffer