11 directcompute by example - amd

108
Group execution

Upload: others

Post on 12-Sep-2021

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 11 DirectCompute by Example - AMD

Group execution

Page 2: 11 DirectCompute by Example - AMD

Advanced DirectX® 11 technology: DirectCompute by Example

Jason Yang and Lee Howes

August 18, 2010

Page 3: 11 DirectCompute by Example - AMD

DirectX 11 Basics

New API from Microsoft®

– Released alongside Windows® 7

– Runs on Windows Vista® as well

Supports downlevel hardware

– DirectX9, DirectX10, DirectX11-class HW supported

– Exposed features depend on GPU

Allows the use of the same API for multiple generations of GPUs

– However Windows Vista/Windows 7 required

Lots of new features…

Page 4: 11 DirectCompute by Example - AMD

What is DirectCompute?

DirectCompute brings GPGPU to DirectX

DirectCompute is both separate from and integrated with the DirectX graphics pipeline

– Compute Shader

– Compute features in Pixel Shader

Potential applications

– Physics

– AI

– Image processing

Page 5: 11 DirectCompute by Example - AMD

DirectCompute – part of DirectX 11

DirectX 11 helps efficiently combine Compute work with graphics

– Sharing of buffers is trivial

– Work graph is scheduled efficiently by the driver

Input

Assembler

Vertex

Shader Tesselation

Geometry

Shader Rasterizer

Pixel

Shader

Compute

Shader

Graphics pipeline

Page 6: 11 DirectCompute by Example - AMD

DirectCompute Features

Scattered writes

Atomic operations

Append/consume buffer

Shared memory (local data share)

Structured buffers

Double precision (if supported)

Page 7: 11 DirectCompute by Example - AMD

DirectCompute by Example

Order Independent Transparency (OIT)

– Atomic operations

– Scattered writes

– Append buffer feature

Bullet Cloth Simulation

– Shared memory

– Shared compute and graphics buffers

Page 8: 11 DirectCompute by Example - AMD

Order Independent Transparency

Page 9: 11 DirectCompute by Example - AMD

Transparency Problem

Classic problem in computer graphics

Correct rendering of semi-transparent geometry requires sorting – blending is an order dependent operation

Sometimes sorting triangles is enough but not always

– Difficult to sort: Multiple meshes interacting (many draw calls)

– Impossible to sort: Intersecting triangles (must sort fragments)

Try doing this in PowerPoint!

Page 10: 11 DirectCompute by Example - AMD

Background

A-buffer – Carpenter „84

– CPU side linked list per-pixel for anti-aliasing

Fixed array per-pixel

– F-buffer, stencil routed A-buffer, Z3 buffer, and k-buffer, Slice map, bucket depth peeling

Multi-pass

– Depth peeling methods for transparency

Recent

– Freepipe, PreCalc [DirectX11 SDK]

Page 11: 11 DirectCompute by Example - AMD

OIT using Per-Pixel Linked Lists

Fast creation of linked lists of arbitrary size on the GPU using D3D11

– Computes correct transparency

Integration into the standard graphics pipeline

– Demonstrates compute from rasterized data

– DirectCompute features in Pixel Shader

– Works with depth and stencil testing

– Works with and without MSAA

Example of programmable blend

Page 12: 11 DirectCompute by Example - AMD

Linked List Construction

Two Buffers

– Head pointer buffer

addresses/offsets

Initialized to end-of-list (EOL) value (e.g., -1)

– Node buffer

arbitrary payload data + “next pointer”

Each shader thread

1. Retrieve and increment global counter value

2. Atomic exchange into head pointer buffer

3. Add new entry into the node buffer at location from step 1

Page 13: 11 DirectCompute by Example - AMD

Algorithm Overview

0. Render opaque scene objects

1. Render transparent scene objects

2. Screen quad resolves and composites fragment lists

Page 14: 11 DirectCompute by Example - AMD

Step 0 – Render Opaque

Render all opaque geometry normally

Render Target

Page 15: 11 DirectCompute by Example - AMD

Algorithm Overview

0. Render opaque scene objects

1. Render transparent scene objects

– All fragments are stored using per-pixel linked lists

– Store fragment‟s: color, alpha, & depth

2. Screen quad resolves and composites fragment lists

Page 16: 11 DirectCompute by Example - AMD

Setup

Two buffers

– Screen sized head pointer buffer

– Node buffer – large enough to handle all fragments

Render as usual

Disable render target writes

Insert render target data into linked list

Page 17: 11 DirectCompute by Example - AMD

Step 1 – Create Linked List

-1 -1 -1 -1 -1 -1

-1 -1 -1 -1 -1 -1

-1 -1 -1 -1 -1 -1

Head Pointer Buffer

-1 -1 -1 -1 -1 -1

-1 -1 -1 -1 -1 -1

-1 -1 -1 -1 -1 -1

Node Buffer 0 1 2 3 4 5 6 …

Counter = 0

Render Target

Page 18: 11 DirectCompute by Example - AMD

Step 1 – Create Linked List

-1 -1 -1 -1 -1 -1

-1 -1 -1 -1 -1 -1

-1 -1 -1 -1 -1 -1

-1 -1 -1 -1 -1 -1

-1 -1 -1 -1 -1 -1

-1 -1 -1 -1 -1 -1

Render Target

Head Pointer Buffer

Node Buffer 0 1 2 3 4 5 6 …

Counter = 0

Page 19: 11 DirectCompute by Example - AMD

Step 1 – Create Linked List

-1 -1 -1 -1 -1 -1

-1 -1 -1 -1 -1 -1

-1 -1 -1 -1 -1 -1

-1 -1 -1 -1 -1 -1

-1 -1 -1 -1 -1 -1

-1 -1 -1 -1 -1 -1

Render Target

Head Pointer Buffer

Node Buffer 0 1 2 3 4 5 6 …

Counter = 1

IncrementCounter()

Page 20: 11 DirectCompute by Example - AMD

Step 1 – Create Linked List

-1 -1 -1 -1 -1 -1

-1 0 -1 -1 -1 -1

-1 -1 -1 -1 -1 -1

-1 -1 -1 -1 -1 -1

-1 -1 -1 -1 -1 -1

-1 -1 -1 -1 -1 -1

Render Target

Head Pointer Buffer

Node Buffer 0 1 2 3 4 5 6 …

Counter = 1

InterlockedExchange()

Page 21: 11 DirectCompute by Example - AMD

Step 1 – Create Linked List

-1 -1 -1 -1 -1 -1

-1 0 -1 -1 -1 -1

-1 -1 -1 -1 -1 -1

0.87

-1

-1 -1 -1 -1 -1 -1

-1 -1 -1 -1 -1 -1

-1 -1 -1 -1 -1 -1

Render Target

Head Pointer Buffer

Node Buffer 0 1 2 3 4 5 6 …

Counter = 1 Scatter Write

Page 22: 11 DirectCompute by Example - AMD

Step 1 – Create Linked List

-1 -1 -1 -1 -1 -1

-1 0 -1 -1 -1 -1

-1 -1 -1 -1 -1 -1

0.87

-1

0.89

-1

0.90

-1

-1 -1 -1 -1 -1 -1

-1 -1 -1 1 2 -1

-1 -1 -1 -1 -1 -1

Culled due to existing scene geometry depth.

Render Target

Head Pointer Buffer

Node Buffer 0 1 2 3 4 5 6 …

Counter = 3

Page 23: 11 DirectCompute by Example - AMD

Node Buffer 0 1 2 3 4 5 6 …

Step 1 – Create Linked List

-1 -1 -1 -1 -1 -1

-1 3 4 -1 -1 -1

-1 -1 -1 -1 -1 -1

0.87

-1

0.89

-1

0.90

-1

0.65

0

0.65

-1

-1 -1 -1 -1 -1 -1

-1 -1 -1 1 2 -1

-1 -1 -1 -1 -1 -1

Render Target

Head Pointer Buffer

Counter = 5

Page 24: 11 DirectCompute by Example - AMD

Step 1 – Create Linked List

-1 -1 -1 -1 -1 -1

-1 5 4 -1 -1 -1

-1 -1 -1 -1 -1 -1

0.87

-1

0.89

-1

0.90

-1

0.65

0

0.65

-1

0.71

3

-1 -1 -1 -1 -1 -1

-1 -1 -1 1 2 -1

-1 -1 -1 -1 -1 -1

Render Target

Head Pointer Buffer

Node Buffer 0 1 2 3 4 5 6 …

Counter = 6

Page 25: 11 DirectCompute by Example - AMD

Node Buffer Counter

Counter allocated in GPU memory (i.e. a buffer)

– Atomic updates

– Contention issues

DirectX11 Append feature

– Linear writes to a buffer

– Implicit writes

Append()

– Explicit writes

IncrementCounter()

Standard memory operations

– Up to 60% faster than memory counters

Page 26: 11 DirectCompute by Example - AMD

Code Example RWStructuredBuffer RWStructuredCounter;

RWTexture2D<int> tRWFragmentListHead;

RWTexture2D<float4> tRWFragmentColor;

RWTexture2D<int2> tRWFragmentDepthAndLink;

[earlydepthstencil]

void PS( PsInput input )

{

float4 vFragment = ComputeFragmentColor(input);

int2 vScreenAddress = int2(input.vPositionSS.xy);

// Get counter value and increment

int nNewFragmentAddress = RWStructuredCounter.IncrementCounter();

if ( nNewFragmentAddress == FRAGMENT_LIST_NULL ) { return; }

// Update head buffer

int nOldFragmentAddress;

InterlockedExchange(tRWFragmentListHead[vScreenAddress], nNewHeadAddress,

nOldHeadAddress );

// Write the fragment attributes to the node buffer

int2 vAddress = GetAddress( nNewFragmentAddress );

tRWFragmentColor[vAddress] = vFragment;

tRWFragmentDepthAndLink[vAddress] = int2(

int(saturate(input.vPositionSS.z))*0x7fffffff), nOldFragmentAddress );

return;

}

Page 27: 11 DirectCompute by Example - AMD

Algorithm Overview

0. Render opaque scene objects

1. Render transparent scene objects

2. Screen quad resolves and composites fragment lists

– Single pass

– Pixel shader sorts associated linked list (e.g., insertion sort)

– Composite fragments in sorted order with background

– Output final fragment

Page 28: 11 DirectCompute by Example - AMD

Step 2 – Render Fragments

-1 -1 -1 -1 -1 -1

-1 5 4 -1 -1 -1

-1 -1 -1 -1 -1 -1

0.87

-1

0.89

-1

0.90

-1

0.65

0

0.65

-1

0.71

3

-1 -1 -1 -1 -1 -1

-1 -1 -1 1 2 -1

-1 -1 -1 -1 -1 -1

(0,0)->(1,1): Fetch Head Pointer: -1 -1 indicates no fragment to render

Render Target

Head Pointer Buffer

Node Buffer 0 1 2 3 4 5 6 …

Page 29: 11 DirectCompute by Example - AMD

Step 2 – Render Fragments

-1 -1 -1 -1 -1 -1

-1 5 4 -1 -1 -1

-1 -1 -1 -1 -1 -1

0.87

-1

0.89

-1

0.90

-1

0.65

0

0.65

-1

0.71

3

-1 -1 -1 -1 -1 -1

-1 -1 -1 1 2 -1

-1 -1 -1 -1 -1 -1

(1,1): Fetch Head Pointer: 5 Fetch Node Data (5) Walk the list and store in temp array

0.71 0.65 0.87

Render Target

Head Pointer Buffer

Node Buffer 0 1 2 3 4 5 6 …

Page 30: 11 DirectCompute by Example - AMD

Step 2 – Render Fragments

-1 -1 -1 -1 -1 -1

-1 5 4 -1 -1 -1

-1 -1 -1 -1 -1 -1

0.87

-1

0.89

-1

0.90

-1

0.65

0

0.65

-1

0.71

3

-1 -1 -1 -1 -1 -1

-1 -1 -1 1 2 -1

-1 -1 -1 -1 -1 -1

(1,1): Sort temp array Blend colors and write out

0.65 0.71 0.87

Render Target

Head Pointer Buffer

Node Buffer 0 1 2 3 4 5 6 …

Page 31: 11 DirectCompute by Example - AMD

Step 2 – Render Fragments

-1 -1 -1 -1 -1 -1

-1 5 4 -1 -1 -1

-1 -1 -1 -1 -1 -1

0.87

-1

0.89

-1

0.90

-1

0.65

0

0.65

-1

0.71

3

-1 -1 -1 -1 -1 -1

-1 -1 -1 1 2 -1

-1 -1 -1 -1 -1 -1

Render Target

Head Pointer Buffer

Node Buffer 0 1 2 3 4 5 6 …

Page 32: 11 DirectCompute by Example - AMD

Anti-Aliasing

Store coverage information in the linked list

Resolve per-sample

– Execute a shader at each sample location

– Use MSAA hardware

Resolve per-pixel

– Execute a shader at each pixel location

– Average all sample contributions within the shader

Page 33: 11 DirectCompute by Example - AMD

Performance Comparison

Teapot Dragon

Linked List 743 fps 338 fps

Precalc 285 fps 143 fps

Depth Peeling 579 fps 45 fps

Bucket Depth Peeling

--- 256 fps

Dual Depth Peeling --- 94 fps Based on internal numbers

Performance scaled to ATI Radeon HD 5770 graphics card

Page 34: 11 DirectCompute by Example - AMD

Mecha Demo

602K scene triangles

– 254K transparent triangles

Page 35: 11 DirectCompute by Example - AMD

Layers

Based on internal numbers

Page 36: 11 DirectCompute by Example - AMD

Scaling

Based on internal numbers

Page 37: 11 DirectCompute by Example - AMD

Future Work

Memory allocation

Sort on insert

Other linked list applications

– Indirect illumination

– Motion blur

– Shadows

More complex data structures

Page 38: 11 DirectCompute by Example - AMD

Bullet Cloth Simulation

Page 39: 11 DirectCompute by Example - AMD

DirectCompute for physics

DirectCompute in the Bullet physics SDK

– An introduction to cloth simulation

– Some tips for implementation in DirectCompute

– A demonstration of the current state of development

Page 40: 11 DirectCompute by Example - AMD

Cloth simulation

Large number of particles

– Appropriate for parallel processing

– Force from each spring constraint applied to both connected particles

Original layout Current layout:

Compute forces as

stretch from rest length

Compute new

positions

Apply position

corrections

to masses

Page 41: 11 DirectCompute by Example - AMD

Cloth simulation

Large number of particles

– Appropriate for parallel processing

– Force from each spring constraint applied to both connected particles

Original layout Current layout:

Compute forces as

stretch from rest length

Compute new

positions

Apply position

corrections

to masses

Rest length of spring

Page 42: 11 DirectCompute by Example - AMD

Cloth simulation steps

For each simulation iteration:

– Compute forces in each link based on its length

– Correct positions of masses/vertices from forces

– Compute new vertex positions

Original layout Current layout:

Compute forces as

stretch from rest length

Compute new

positions

Apply position

corrections

to masses

Page 43: 11 DirectCompute by Example - AMD

Cloth simulation steps

For each simulation iteration:

– Compute forces in each link based on its length

– Correct positions of masses/vertices from forces

– Compute new vertex positions

Original layout Current layout:

Compute forces as

stretch from rest length

Apply position

corrections

to masses

Compute new

positions

Page 44: 11 DirectCompute by Example - AMD

Cloth simulation steps

For each simulation iteration:

– Compute forces in each link based on its length

– Correct positions of masses/vertices from forces

– Compute new vertex positions

Original layout Current layout:

Compute forces as

stretch from rest length

Compute new

positions

Apply position

corrections

to masses

Page 45: 11 DirectCompute by Example - AMD

Springs and masses

Two or three main types of springs

– Structural/shearing

– Bending

Page 46: 11 DirectCompute by Example - AMD

Springs and masses

Two or three main types of springs

– Structural/shearing

– Bending

Page 47: 11 DirectCompute by Example - AMD

CPU approach to simulation

One link at a time

Perform updates in place

“Gauss-Seidel” style

Conserves momentum

Iterate n times

Page 48: 11 DirectCompute by Example - AMD

CPU approach to simulation

One link at a time

Perform updates in place

“Gauss-Seidel” style

Conserves momentum

Iterate n times

Page 49: 11 DirectCompute by Example - AMD

CPU approach to simulation

One link at a time

Perform updates in place

“Gauss-Seidel” style

Conserves momentum

Iterate n times

Page 50: 11 DirectCompute by Example - AMD

CPU approach to simulation

One link at a time

Perform updates in place

“Gauss-Seidel” style

Conserves momentum

Iterate n times

Page 51: 11 DirectCompute by Example - AMD

CPU approach to simulation

One link at a time

Perform updates in place

“Gauss-Seidel” style

Conserves momentum

Iterate n times

Page 52: 11 DirectCompute by Example - AMD

CPU approach to simulation

One link at a time

Perform updates in place

“Gauss-Seidel” style

Conserves momentum

Iterate n times

Target

Page 53: 11 DirectCompute by Example - AMD

Moving to the GPU: The pixel shader approach

Offers full parallelism

One vertex at a time

No scattered writes

Page 54: 11 DirectCompute by Example - AMD

Moving to the GPU: The pixel shader approach

Offers full parallelism

One vertex at a time

No scattered writes

Page 55: 11 DirectCompute by Example - AMD

Moving to the GPU: The pixel shader approach

Offers full parallelism

One vertex at a time

No scattered writes

Page 56: 11 DirectCompute by Example - AMD

Moving to the GPU: The pixel shader approach

Offers full parallelism

One vertex at a time

No scattered writes

Page 57: 11 DirectCompute by Example - AMD

Moving to the GPU: The pixel shader approach

Offers full parallelism

One vertex at a time

No scattered writes

Page 58: 11 DirectCompute by Example - AMD

Moving to the GPU: The pixel shader approach

Offers full parallelism

One vertex at a time

No scattered writes

Page 59: 11 DirectCompute by Example - AMD

Moving to the GPU: The pixel shader approach

Offers full parallelism

One vertex at a time

No scattered writes

In parallel!

Page 60: 11 DirectCompute by Example - AMD

Moving to the GPU: The pixel shader approach

Offers full parallelism

One vertex at a time

No scattered writes

Target

Page 61: 11 DirectCompute by Example - AMD

Downsides of the pixel shader approach

No propagation of updates

– If we double buffer

Or non-deterministic

– If we update in-place in a read/write array

Momentum preservation

– Lacking due to single-ended link updates

Page 62: 11 DirectCompute by Example - AMD

Can DirectCompute help?

Offers scattered writes as a feature as we saw earlier

The GPU implementation could be more like the CPU

– Solver per-link rather than per-vertex

– Leads to races between links that update the same vertex

Page 63: 11 DirectCompute by Example - AMD

Execute independent subsets in parallel

All links act at both ends

Batch links

– No two links in a given batch share a vertex

– No data races

Page 64: 11 DirectCompute by Example - AMD

Execute independent subsets in parallel

All links act at both ends

Batch links

– No two links in a given batch share a vertex

– No data races

Page 65: 11 DirectCompute by Example - AMD

Execute independent subsets in parallel

All links act at both ends

Batch links

– No two links in a given batch share a vertex

– No data races

Page 66: 11 DirectCompute by Example - AMD

Execute independent subsets in parallel

All links act at both ends

Batch links

– No two links in a given batch share a vertex

– No data races

Page 67: 11 DirectCompute by Example - AMD

Execute independent subsets in parallel

All links act at both ends

Batch links

– No two links in a given batch share a vertex

– No data races

Page 68: 11 DirectCompute by Example - AMD

On a real cloth mesh we need many batches

Create independent subsets of links through graph coloring.

Synchronize between batches

Page 69: 11 DirectCompute by Example - AMD

On a real cloth mesh we need many batches

Create independent subsets of links through graph coloring.

Synchronize between batches

1st batch

Page 70: 11 DirectCompute by Example - AMD

On a real cloth mesh we need many batches

Create independent subsets of links through graph coloring.

Synchronize between batches

2nd batch

Page 71: 11 DirectCompute by Example - AMD

On a real cloth mesh we need many batches

Create independent subsets of links through graph coloring.

Synchronize between batches

3rd batch

Page 72: 11 DirectCompute by Example - AMD

On a real cloth mesh we need many batches

Create independent subsets of links through graph coloring.

Synchronize between batches

10 batches

Page 73: 11 DirectCompute by Example - AMD

Driving batches and synchronizing

Iteration 0 Iteration 1 Iteration 2

Simulation step

Batch 0

Batch 1

Batch 2

Batch 3

Batch 0

Batch 1

Batch 2

Batch 3

Batch 0

Batch 1

Batch 2

Batch 3

Batch 4 Batch 4 Batch 4

Page 74: 11 DirectCompute by Example - AMD

Driving batches and synchronizing

Iteration 0

Batch 0

Batch 1

Batch 2

Batch 3

Batch 4

// Execute the kernel

context->CSSetShader(

solvePositionsFromLinksKernel.kernel, NULL, 0 );

int numBlocks =

(constBuffer.numLinks + (blockSize-1)) / blockSize;

context->Dispatch( numBlocks , 1, 1 );

Page 75: 11 DirectCompute by Example - AMD

Driving batches and synchronizing

Iteration

Simulation step

Batch 0

Batch 1

Batch 2

Batch 3

Batch 4

Twiddle fingers…

Page 76: 11 DirectCompute by Example - AMD

Driving batches and synchronizing

Iteration

Simulation step

Batch 0

Batch 1

Batch 2

Batch 3

Batch 4

Twiddle fingers…

Twiddle fingers…

Page 77: 11 DirectCompute by Example - AMD

Driving batches and synchronizing

Iteration

Simulation step

Batch 0

Batch 1

Batch 2

Batch 3

Batch 4

Twiddle fingers…

Twiddle fingers…

Twiddle fingers…

Remember, 10 batches!

Twiddle fingers…

Twiddle fingers…

Page 78: 11 DirectCompute by Example - AMD

Returning to our batching

10 batches: 10 compute shader dispatches

1/10 links per batch

Low compute density per thread 10 batches

Page 79: 11 DirectCompute by Example - AMD

Packing for higher efficiency

Can create larger groups

– The cloth is fixed-structure

– Can be preprocessed

Fewer batches/dispatches

Less parallelism

4 batches

Page 80: 11 DirectCompute by Example - AMD

Solving cloths together

Solve multiple cloths together in n batches

Grouping

– Larger and reduced number of dispatches

– Regain the parallelism that increased work-per-thread removed

4 batches

Page 81: 11 DirectCompute by Example - AMD

Shared memory

We‟ve made use of scattered writes

The next feature of DirectCompute: shared memory

– Load data at the start of a block

– Compute over multiple links together

– Write data out again

Page 82: 11 DirectCompute by Example - AMD

Driving batches and synchronizing

Iteration

Simulation step

Batch 0

Batch 1

Inner batch 0

Inner batch 1

Inner batch 0

Iteration

Batch 0

Batch 1

Iteration

Batch 0

Batch 1

Inner batch 0

Inner batch 1

Inner batch 0

Inner batch 0

Inner batch 1

Inner batch 0

Page 83: 11 DirectCompute by Example - AMD

So let‟s look at the batching we saw before:

There are 4 batches: – If we do this per group we need 3 groups rather

than three DirectX “threads”

How can we improve the batching?

Group 1 Batch 1

Page 84: 11 DirectCompute by Example - AMD

So let‟s look at the batching we saw before:

There are 4 batches: – If we do this per group we need 3 groups rather

than three DirectX “threads”

How can we improve the batching?

Group 2 Batch 1

Page 85: 11 DirectCompute by Example - AMD

So let‟s look at the batching we saw before:

There are 4 batches: – If we do this per group we need 3 groups rather

than three DirectX “threads”

How can we improve the batching?

Group 1 Batch 2

Page 86: 11 DirectCompute by Example - AMD

Solving in shared memory

groupshared float4 positionSharedData[VERTS_PER_GROUP];

[numthreads(GROUP_SIZE, 1, 1)]

void

SolvePositionsFromLinksKernel( … uint3 DTid : SV_DispatchThreadID, uint3 GTid : SV_GroupThreadID … )

{

for( int vertex = laneInWavefront; vertex < verticesUsedByWave; vertex+=GROUP_SIZE )

{

int vertexAddress = g_vertexAddressesPerWavefront[groupID*VERTS_PER_GROUP + vertex];

positionSharedData[vertex] = g_vertexPositions[vertexAddress];

}

... // Perform computation in shared buffer

for( int vertex = GTid.x; vertex < verticesUsedByWave; vertex+=GROUP_SIZE )

{

int vertexAddress = g_vertexAddressesPerWavefront[groupID*VERTS_PER_GROUP + vertex];

g_vertexPositions[vertexAddress] = positionSharedData[vertex];

}

}

Page 87: 11 DirectCompute by Example - AMD

Solving in shared memory

groupshared float4 positionSharedData[VERTS_PER_GROUP];

[numthreads(GROUP_SIZE, 1, 1)]

void

SolvePositionsFromLinksKernel( … uint3 DTid : SV_DispatchThreadID, uint3 GTid : SV_GroupThreadID … )

{

for( int vertex = laneInWavefront; vertex < verticesUsedByWave; vertex+=GROUP_SIZE )

{

int vertexAddress = g_vertexAddressesPerWavefront[groupID*VERTS_PER_GROUP + vertex];

positionSharedData[vertex] = g_vertexPositions[vertexAddress];

}

... // Perform computation in shared buffer

for( int vertex = GTid.x; vertex < verticesUsedByWave; vertex+=GROUP_SIZE )

{

int vertexAddress = g_vertexAddressesPerWavefront[groupID*VERTS_PER_GROUP + vertex];

g_vertexPositions[vertexAddress] = positionSharedData[vertex];

}

}

Define a “groupshared” buffer for shared data storage

Page 88: 11 DirectCompute by Example - AMD

Solving in shared memory

groupshared float4 positionSharedData[VERTS_PER_GROUP];

[numthreads(GROUP_SIZE, 1, 1)]

void

SolvePositionsFromLinksKernel( … uint3 DTid : SV_DispatchThreadID, uint3 GTid : SV_GroupThreadID … )

{

for( int vertex = laneInWavefront; vertex < verticesUsedByWave; vertex+=GROUP_SIZE )

{

int vertexAddress = g_vertexAddressesPerWavefront[groupID*VERTS_PER_GROUP + vertex];

positionSharedData[vertex] = g_vertexPositions[vertexAddress];

}

... // Perform computation in shared buffer

for( int vertex = GTid.x; vertex < verticesUsedByWave; vertex+=GROUP_SIZE )

{

int vertexAddress = g_vertexAddressesPerWavefront[groupID*VERTS_PER_GROUP + vertex];

g_vertexPositions[vertexAddress] = positionSharedData[vertex];

}

}

Data will be shared across a group of threads with these dimensions

Page 89: 11 DirectCompute by Example - AMD

Solving in shared memory

groupshared float4 positionSharedData[VERTS_PER_GROUP];

[numthreads(GROUP_SIZE, 1, 1)]

void

SolvePositionsFromLinksKernel( … uint3 DTid : SV_DispatchThreadID, uint3 GTid : SV_GroupThreadID … )

{

for( int vertex = laneInWavefront; vertex < verticesUsedByWave; vertex+=GROUP_SIZE )

{

int vertexAddress = g_vertexAddressesPerWavefront[groupID*VERTS_PER_GROUP + vertex];

positionSharedData[vertex] = g_vertexPositions[vertexAddress];

}

... // Perform computation in shared buffer

for( int vertex = GTid.x; vertex < verticesUsedByWave; vertex+=GROUP_SIZE )

{

int vertexAddress = g_vertexAddressesPerWavefront[groupID*VERTS_PER_GROUP + vertex];

g_vertexPositions[vertexAddress] = positionSharedData[vertex];

}

}

Load data from global buffers into the shared region

Page 90: 11 DirectCompute by Example - AMD

Solving in shared memory

groupshared float4 positionSharedData[VERTS_PER_GROUP];

[numthreads(GROUP_SIZE, 1, 1)]

void

SolvePositionsFromLinksKernel( … uint3 DTid : SV_DispatchThreadID, uint3 GTid : SV_GroupThreadID … )

{

for( int vertex = laneInWavefront; vertex < verticesUsedByWave; vertex+=GROUP_SIZE )

{

int vertexAddress = g_vertexAddressesPerWavefront[groupID*VERTS_PER_GROUP + vertex];

positionSharedData[vertex] = g_vertexPositions[vertexAddress];

}

... // Perform computation in shared buffer

for( int vertex = GTid.x; vertex < verticesUsedByWave; vertex+=GROUP_SIZE )

{

int vertexAddress = g_vertexAddressesPerWavefront[groupID*VERTS_PER_GROUP + vertex];

g_vertexPositions[vertexAddress] = positionSharedData[vertex];

}

}

Write back to the global buffer after computation

Page 91: 11 DirectCompute by Example - AMD

Group execution

The sequence of operations for the first batch is:

Page 92: 11 DirectCompute by Example - AMD

Group execution

The sequence of operations for the first batch is:

Page 93: 11 DirectCompute by Example - AMD

Group execution

The sequence of operations for the first batch is:

Page 94: 11 DirectCompute by Example - AMD

Group execution

The sequence of operations for the first batch is:

Few links so low packing efficiency: Not a problem with larger cloth

Page 95: 11 DirectCompute by Example - AMD

Group execution

The sequence of operations for the first batch is:

Synchronize

Synchronize

Synchronize

Synchronize

Synchronize

Page 96: 11 DirectCompute by Example - AMD

Group execution

The sequence of operations for the first batch is:

// load

AllMemoryBarrierWithGroupSync();

for( each subgroup ) {

// Process a subgroup

AllMemoryBarrierWithGroupSync();

}

// Store

Synchronize

Page 97: 11 DirectCompute by Example - AMD

Why is this an improvement?

So we still need 10*4 batches. What have we gained?

– The batches within a group chunk are in-shader loops

– Only 4 shader dispatches, each with significant overhead

The barriers will still hit performance

– We are no longer dispatch bound, but we are likely to be on-chip synchronization bound

Page 98: 11 DirectCompute by Example - AMD

Exploiting the SIMD architecture

Hardware executes 64- or 32-wide SIMD

Sequentially consistent at the SIMD level

Synchronization is now implicit

– Take care

– Execute over groups that are SIMD width or a divisor thereof

Page 99: 11 DirectCompute by Example - AMD

Group execution

The sequence of operations for the first batch is:

Just works…

Just works…

Just works…

Just works…

Just works…

Page 100: 11 DirectCompute by Example - AMD

Driving batches and synchronizing

Iteration

Simulation step

Batch 0

Batch 1

Inner batch 0

Inner batch 1

Inner batch 0

Synchronize

Synchronize

Page 101: 11 DirectCompute by Example - AMD

Performance gains

For 90,000 links:

– No solver running in 2.98 ms/frame

– Fully batched link solver in 3.84 ms/frame

– SIMD batched solver 3.22 ms/frame

– CPU solver 16.24 ms/frame

3.5x improvement in solver alone

(67x improvement CPU solver)

Based on internal numbers

Page 102: 11 DirectCompute by Example - AMD

One more thing…

Remember the tight pipeline integration?

How can we use this to our advantage?

Input

Assembler

Vertex

Shader Tesselation

Geometry

Shader Rasterizer

Pixel

Shader

Compute

Shader

Graphics pipeline

Page 103: 11 DirectCompute by Example - AMD

Efficiently output vertex data

Cloth simulation updates vertex positions

– Generated on the GPU

– Need to be used on the GPU for rendering

– Why not keep them there?

Large amount of data to update

– Many vertices in fine simulation meshes

– Normals and other information present

Page 104: 11 DirectCompute by Example - AMD

Create a vertex buffer

// Create a vertex buffer with unordered access support

D3D11_BUFFER_DESC bd;

bd.Usage = D3D11_USAGE_DEFAULT;

bd.ByteWidth = vertexBufferSize * 32;

bd.BindFlags =

D3D11_BIND_VERTEX_BUFFER |

D3D11_BIND_UNORDERED_ACCESS;

bd.CPUAccessFlags = 0;

bd.MiscFlags = 0;

bd.StructureByteStride = 32;

hr = m_d3dDevice->CreateBuffer(&bd, NULL, &m_Buffer);

// Create an unordered access view of the buffer to allow writing

D3D11_UNORDERED_ACCESS_VIEW_DESC uavbuffer_desc;

ud.Format = DirectXGI_FORMAT_UNKNOWN;

ud.ViewDimension = D3D11_UAV_DIMENSION_BUFFER;

ud.Buffer.NumElements = vertexBufferSize;

hr = m_d3dDevice->CreateUnorderedAccessView(m_Buffer, &ud, &m_UAV);

It‟s a vertex buffer. It‟s also bound for unordered access. Scattered writes!

Page 105: 11 DirectCompute by Example - AMD

Performance gains

For 90,000 links with copy on GPU:

– No solver running in 0.58 ms/frame

– Fully batched link solver in 0.82 ms/frame

– SIMD batched solver 0.617 ms/frame

6.5x improvement in solver

6.5x improvement from CPU copy alone

23x improvement over simpler solver with host copy Based on internal numbers

Page 106: 11 DirectCompute by Example - AMD

Thanks

Justin Hensley

Holger Grün

Nicholas Thibieroz

Erwin Coumans

Page 107: 11 DirectCompute by Example - AMD

References

Yang J., Hensley J., Grün H., Thibieroz N.: Real-Time Concurrent Linked List Construction on the GPU. In Rendering Techniques 2010: Eurographics Symposium on Rendering (2010), vol. 29, Eurographics.

Grün H., Thibieroz N.: OIT and Indirect Illumination using DirectX11 Linked Lists. In Proceedings of Game Developers Conference 2010 (Mar. 2010). http://developer.amd.com/gpu_assets/OIT%20and%20Indirect%20Illumination%20using%20DirectX11%20Linked%20Lists_forweb.ppsx

http://developer.amd.com/samples/demos/pages/ATIRadeonHD5800SeriesRealTimeDemos.aspx

http://bulletphysics.org

Page 108: 11 DirectCompute by Example - AMD

Trademark Attribution

AMD, the AMD Arrow logo, ATI, the ATI logo, Radeon and combinations thereof are trademarks of Advanced Micro Devices, Inc. in the United States and/or other jurisdictions. Microsoft, Windows, Windows Vista, Windows 7 and DirectX are registered trademarks of Microsoft Corporation in the U.S. and/or other juristictions. Other names used in this presentation are for identification purposes only and may be trademarks of their respective owners.

©2010 Advanced Micro Devices, Inc. All rights reserved.