rendering the breeze - khronos group · 2014-04-08 · •rendering-min-yu huang (uc davis) ......
TRANSCRIPT
© Copyright Khronos Group, 2010 - Page 1
Rendering the breeze
Yaki Tebeka Benedict R. GasterGraphic Remedy AMD
© Copyright Khronos Group, 2010 - Page 2
Acknowledgements and contact info
• Rendering
- Min-Yu Huang (UC Davis)
• Cloth Simulation
- Lee Howes
© 2004 – 2010 Graphic Remedy. All Rights Reserved
www.amd.com
www.gremedy.com
© Copyright Khronos Group, 2010 - Page 3
Bullet physics
• An open source physics SDK for games (http://bulletphysics.org)
- Popular physics SDK
- Zlib license for copy-and-use openness
- Development led by Erwin Coumans of Sony
• Includes
- Rigid body dynamics, e.g.
- Ragdolls, destruction, and vehicles
- Soft body dynamics
- Cloth, rope, and deformable volumes
• AMD collaborating on GPU acceleration
- Cloth/soft body and fluids in OpenCL and DirectCompute
- Fully open-source contributions
© Copyright Khronos Group, 2010 - Page 4
OpenCL physics is…
Fluids
Cloth
Rigid bodies
© Copyright Khronos Group, 2010 - Page 5
Introducing cloth simulation
•A subset of the possible set of soft bodies
•For real-time generally based on a mass/spring
system
- Large collection of masses (particles)
- Connect using spring constraints
- Layout and properties change properties of cloth
© Copyright Khronos Group, 2010 - Page 6
Springs and masses
• Three main types of springs
- Structural
- Shearing
- Bending
© Copyright Khronos Group, 2010 - Page 7
Springs and masses
• Three main types of springs
- Structural
- Shearing
- Bending
© Copyright Khronos Group, 2010 - Page 8
Springs and masses
• Three main types of springs
- Structural
- Shearing
- Bending
© Copyright Khronos Group, 2010 - Page 9
Parallelism
•Large number of particles- Appropriate for parallel processing- Force from each spring constraint applied
to both connected particles
Original layoutCurrent layout:
Compute forces as
Stretch from rest length
Compute new
positions
Apply position
corrections
to masses
© Copyright Khronos Group, 2010 - Page 10
Parallelism
• Large number of particles
- Appropriate for parallel processing
- Force from each spring constraint applied to both connected particles
Original layoutCurrent layout:
Compute forces as
Stretch from rest length
Compute new
positions
Apply position
corrections
to masses
Rest length of spring
© Copyright Khronos Group, 2010 - Page 11
Parallelism
• For each simulation iteration:
- Compute forces in each link based on its length
- Correct positions of masses/vertices from forces
- Compute new vertex positions
Original layoutCurrent layout:
Compute forces as
Stretch from rest length
Compute new
positions
Apply position
corrections
to masses
© Copyright Khronos Group, 2010 - Page 12
Parallelism
• For each simulation iteration:
- Compute forces in each link based on its length
- Correct positions of masses/vertices from forces
- Compute new vertex positions
Original layoutCurrent layout:
Compute forces as
Stretch from rest length
Apply position
corrections
to masses
Compute new
positions
© Copyright Khronos Group, 2010 - Page 13
Parallelism
• For each simulation iteration:
- Compute forces in each link based on its length
- Correct positions of masses/vertices from forces
- Compute new vertex positions
Original layoutCurrent layout:
Compute forces as
Stretch from rest length
Compute new
positions
Apply position
corrections
to masses
© Copyright Khronos Group, 2010 - Page 14
CPU approach to simulation
• Iterative integration over vertex positions
- For each spring computes a force.
- Updates both vertices with a new position.
- Repeat n times where n is configurable.
• Note that the computation is serial
- Propagation of values through the solver is immediate.
© Copyright Khronos Group, 2010 - Page 15
The CPU approach
for each iteration
{
for(int linkIndex = 0; linkIndex < numLinks; ++linkIndex)
{
float massLSC =
(inverseMass0 + inverseMass1)/linearStiffnessCoefficient;
float k = ((restLengthSquared - lengthSquared) /
(massLSC * (restLengthSquared + lengthSquared)));
vertexPosition0 -= length * (k*inverseMass0);
vertexPosition1 += length * (k*inverseMass1);
}
}
© Copyright Khronos Group, 2010 - Page 16
The CPU approach
for each iteration
{
for(int linkIndex = 0; linkIndex < numLinks; ++linkIndex)
{
float massLSC =
(inverseMass0 + inverseMass1)/linearStiffnessCoefficient;
float k = ((restLengthSquared - lengthSquared) /
(massLSC * (restLengthSquared + lengthSquared)));
vertexPosition0 -= length * (k*inverseMass0);
vertexPosition1 += length * (k*inverseMass1);
}
}
© Copyright Khronos Group, 2010 - Page 17
The CPU approach
for each iteration
{
for(int linkIndex = 0; linkIndex < numLinks; ++linkIndex)
{
float massLSC =
(inverseMass0 + inverseMass1)/linearStiffnessCoefficient;
float k = ((restLengthSquared - lengthSquared) /
(massLSC * (restLengthSquared + lengthSquared)));
vertexPosition0 -= length * (k*inverseMass0);
vertexPosition1 += length * (k*inverseMass1);
}
}
© Copyright Khronos Group, 2010 - Page 18
GPU parallelism
• The CPU implementation was serial.
- No atomicity issues.
- Value propagation immediate from a given update.
• The GPU implementation is parallel within a cloth.
- Multiple updates to the same node create races.
© Copyright Khronos Group, 2010 - Page 19
Vertex solver: a single batch
• Single batch
- Highly efficient per solver iteration
- Can re-use position data for central node
- Need to cleverly arrange data to allow efficient loop unrolling
• Double buffer the vertex data
- Updates will then not be seen until the next iteration
• Write to same buffer
- Updates can be seen quickly, but the computation will be non-deterministic
© Copyright Khronos Group, 2010 - Page 20
A branch divergence optimization
• The GPU is a vector machine: a collection of wide SIMD cores
- Divergent branches across a vector hurt performance
- Nodes have different degrees
- Regular mesh
- Low overhead
- Similar degree throughout
- Complicated mesh
- Arbitrary numerous peaks
- Pack vertices by degree
© Copyright Khronos Group, 2010 - Page 21
Unfortunately…
• Our experiments have found that solver convergence is too slow
- Could we rectify this by changing the core solver algorithm?
• Slow convergence is due to:
- Slow propagation of new values
- Errors introduced because the solver does not conserve internal momentum
• Alternatively, let’s look at another approach
© Copyright Khronos Group, 2010 - Page 22
Batching the simulation
• Create independent subsets of links through graph coloring.
• Synchronize between batches
© Copyright Khronos Group, 2010 - Page 23
Batching the simulation
• Create independent subsets of links through graph coloring.
• Synchronize between batches
1 batch
© Copyright Khronos Group, 2010 - Page 24
Batching the simulation
• Create independent subsets of links through graph coloring.
• Synchronize between batches
2 batches
© Copyright Khronos Group, 2010 - Page 25
Batching the simulation
• Create independent subsets of links through graph coloring.
• Synchronize between batches
3 batches
© Copyright Khronos Group, 2010 - Page 26
Batching the simulation
• Create independent subsets of links through graph coloring.
• Synchronize between batches
10 batches
for each iteration
{
for( int i = 0; i < m_batchStartLengths.size(); ++i )
{
int start = m_linkData.m_batchStartLengths[i].first;
int num = m_linkData.m_batchStartLengths[i].second;
for(int linkIndex = start; linkIndex < start + num; ++linkIndex)
{
…
}
}
}
Driving batches
for each iteration
{
for( int i = 0; i < m_batchStartLengths.size(); ++i )
{
int start = m_linkData.m_batchStartLengths[i].first;
int num = m_linkData.m_batchStartLengths[i].second;
for(int linkIndex = start; linkIndex < start + num; ++linkIndex)
{
…
}
}
}
Driving batches
for each iteration
{
for( int i = 0; i < m_batchStartLengths.size(); ++i )
{
int start = m_linkData.m_batchStartLengths[i].first;
int num = m_linkData.m_batchStartLengths[i].second;
for(int linkIndex = start; linkIndex < start + num; ++linkIndex)
{
…
}
}
}
Driving batches
for each iteration
{
for( int i = 0; i < m_batchStartLengths.size(); ++i )
{
int start = m_linkData.m_batchStartLengths[i].first;
int num = m_linkData.m_batchStartLengths[i].second;
for(int linkIndex = start; linkIndex < start + num; ++linkIndex)
{
…
}
}
}
Driving batches
Statically generated batches and pre-sorted buffers.
for each iteration
{
for( int i = 0; i < m_batchStartLengths.size(); ++i )
{
int start = m_linkData.m_batchStartLengths[i].first;
int num = m_linkData.m_batchStartLengths[i].second;
for(int linkIndex = start; linkIndex < start + num; ++linkIndex)
{
…
}
}
}
Driving batches
This loop is now fully parallel.
solvePositionsFromLinksKernel.kernel.setArg(0, startLink);
solvePositionsFromLinksKernel.kernel.setArg(1, numLinks);
solvePositionsFromLinksKernel.kernel.setArg(2, kst);
solvePositionsFromLinksKernel.kernel.setArg(3, ti);
solvePositionsFromLinksKernel.kernel.setArg(4, m_linkData.m_clLinks.getBuffer());
solvePositionsFromLinksKernel.kernel.setArg(5, m_linkData.m_clLinksMassLSC.getBuffer());
solvePositionsFromLinksKernel.kernel.setArg(6, m_linkData.m_clLinksRestLengthSquared.getBuffer());
solvePositionsFromLinksKernel.kernel.setArg(7, m_vertexData.m_clVertexInverseMass.getBuffer());
solvePositionsFromLinksKernel.kernel.setArg(8, m_vertexData.m_clVertexPosition.getBuffer());
int numWorkItems = workGroupSize*((numLinks + (workGroupSize-1)) / workGroupSize);
cl_int err = m_queue.enqueueNDRangeKernel(
solvePositionsFromLinksKernel.kernel,
cl::NullRange, cl::NDRange(numWorkItems), cl::NDRange(workGroupSize));
Dispatching a batch
solvePositionsFromLinksKernel.kernel.setArg(0, startLink);
solvePositionsFromLinksKernel.kernel.setArg(1, numLinks);
solvePositionsFromLinksKernel.kernel.setArg(2, kst);
solvePositionsFromLinksKernel.kernel.setArg(3, ti);
solvePositionsFromLinksKernel.kernel.setArg(4, m_linkData.m_clLinks.getBuffer());
solvePositionsFromLinksKernel.kernel.setArg(5, m_linkData.m_clLinksMassLSC.getBuffer());
solvePositionsFromLinksKernel.kernel.setArg(6, m_linkData.m_clLinksRestLengthSquared.getBuffer());
solvePositionsFromLinksKernel.kernel.setArg(7, m_vertexData.m_clVertexInverseMass.getBuffer());
solvePositionsFromLinksKernel.kernel.setArg(8, m_vertexData.m_clVertexPosition.getBuffer());
int numWorkItems = workGroupSize*((numLinks + (workGroupSize-1)) / workGroupSize);
cl_int err = m_queue.enqueueNDRangeKernel(
solvePositionsFromLinksKernel.kernel,
cl::NullRange, cl::NDRange(numWorkItems), cl::NDRange(workGroupSize));
Dispatching a batch
Note that the number of work items is rounded to a multiple of the group size
__kernel void
SolvePositionsFromLinksKernel(
const int startLink,
const int numLinks,
const float kst,
const float ti,
__global int2 *g_linksVertexIndices,
__global float *g_linksMassLSC,
__global float *g_linksRestLengthSquared,
__global float *g_verticesInverseMass,
__global float4 *g_vertexPositions)
{
…
}
The OpenCL link solver kernel header
int linkID = get_global_id(0) + startLink;
if( get_global_id(0) < numLinks ) {
float massLSC = g_linksMassLSC[linkID];
float restLengthSquared = g_linksRestLengthSquared[linkID];
if( massLSC > 0.0f ) {
int2 nodeIndices = g_linksVertexIndices[linkID];
float3 position0 = g_vertexPositions[nodeIndices.x].xyz;
float3 position1 = g_vertexPositions[nodeIndices.y].xyz;
float inverseMass0 = g_verticesInverseMass[nodeIndices.x];
float inverseMass1 = g_verticesInverseMass[nodeIndices.y];
The OpenCL link solver kernel body
int linkID = get_global_id(0) + startLink;
if( get_global_id(0) < numLinks ) {
float massLSC = g_linksMassLSC[linkID];
float restLengthSquared = g_linksRestLengthSquared[linkID];
if( massLSC > 0.0f ) {
int2 nodeIndices = g_linksVertexIndices[linkID];
float3 position0 = g_vertexPositions[nodeIndices.x].xyz;
float3 position1 = g_vertexPositions[nodeIndices.y].xyz;
float inverseMass0 = g_verticesInverseMass[nodeIndices.x];
float inverseMass1 = g_verticesInverseMass[nodeIndices.y];
The OpenCL link solver kernel body
The number of links might not be a multiple of the block size.
Changing the memory layout might be a better solution.
float3 del = position1 - position0;
float lengthSquared = dot( del, del );
float k = ((restLengthSquared – lengthSquared )/(massLSC*(lengthSquared ))*kst;
position0 = position0 - del*(k*inverseMass0);
position1 = position1 + del*(k*inverseMass1);
g_vertexPositions[nodeIndices.x] = (float4)(position0, 0.f);
g_vertexPositions[nodeIndices.y] = (float4)(position1, 0.f);
}
}
The OpenCL link solver kernel body
© Copyright Khronos Group, 2010 - Page 38
Returning to our batching
• 10 batches: 10 OpenCL kernel enqueues/dispatches
• 1/10 links per batch
• Low compute density per thread
10 batches
© Copyright Khronos Group, 2010 - Page 39
Higher efficiency
• Can create larger groups
- The cloth is fixed-structure
- Can be preprocessed
• Fewer batches/dispatches
9 batches
© Copyright Khronos Group, 2010 - Page 40
Larger groups still
• We can move to much larger groups of links
- Number of parallel instances reduced
- On arbitrary meshes batches hard to create
• Larger batches
- Smaller dispatches
- Less parallelism
4 batches
© Copyright Khronos Group, 2010 - Page 41
Solving cloths together
• Solve multiple cloths together in n batches
• Grouping
- Larger dispatches and reduced number of dispatches
- Regain the parallelism that increased work-per-thread removed
© Copyright Khronos Group, 2010 - Page 42
What if we consider SIMD packing
• The previous batching would still require serial solving of each group
• However, the GPU is not really a collection of thousands of cores
- Tens of SIMD cores
- Each SIMD core runs a single instruction over a vector of work items
- Essentially a wide, flexible version of SSE
• So what does that tell us?
- Computations within a vector happen simultaneously
- Maybe we should be batching on the SIMD level, not on the work item level?
• The disadvantage of this is that we lose platform independence
© Copyright Khronos Group, 2010 - Page 43
How can this work?
• Take one group from the previous
batching
• This group can be mapped into a single
SIMD core
© Copyright Khronos Group, 2010 - Page 44
A tighter loop
• We still need to batch, of course
• However, now we batch only
WITHIN the SIMD
- These batches are processed in a
single kernel call
- The loop to deal with them can
be unrolled
© Copyright Khronos Group, 2010 - Page 45
Three Eras of Processor PerformanceSingle-Core
Era
Sin
gle-
thre
ad P
erfo
rman
ce
?
Time
we arehere
o
Enabled by: Moore’s Law Voltage Scaling MicroArchitecture
Constrained by:PowerComplexity
Multi-Core Era
Thro
ugh
pu
t P
erfo
rman
ce
Time(# of Processors)
we arehere
o
Enabled by: Moore’s Law Desire for Throughput 20 years of SMP arch
Constrained by:PowerParallel SW availabilityScalability
HeterogeneousSystems Era
Targ
eted
Ap
plic
atio
n
Perf
orm
ance
Time(Data-parallel exploitation)
we arehere
o
Enabled by: Moore’s Law Abundant data parallelism Power efficient GPUs
Constrained by:Programming modelsCommunication overheads
© Copyright Khronos Group, 2010 - Page 46
OpenCL Development Challanges
• Debugging and profiling parallel computing applications are hard and time
consuming tasks
• Delivering, on time, a robust and bug-free parallel computing applications
is hard
• It is almost impossible to optimize a parallel computing application to fully
utilize the available system resources
© 2004 – 2010 Graphic Remedy. All Rights Reserved
© Copyright Khronos Group, 2010 - Page 47
gDEBugger
• gDEBugger is an OpenGL, OpenGL ES and OpenCL Debugger, Profiler
and Memory Analyzer.
• It provides the information a developer needs to find bugs and
optimize application performance
© Copyright Khronos Group, 2010 - Page 48
Debugging Demo
© 2004 – 2010 Graphic Remedy. All Rights Reserved
© Copyright Khronos Group, 2010 - Page 49
OpenCL Calls History
•OpenCL calls are displayed
in a Calls History view
•Properties view displays
each call details- Arguments
- Links to data viewer
- And more
© 2004 – 2010 Graphic Remedy. All Rights Reserved
© Copyright Khronos Group, 2010 - Page 50
Automatic errors detection
gDEBugger can automatically break on:
• gDEBugger detected errors
• OpenCL errors
• OpenCL Memory leaks
• OpenCL function calls
• And more
© 2004 – 2010 Graphic Remedy. All Rights Reserved
© Copyright Khronos Group, 2010 - Page 51
Call Stack and Source Code views
• View the call stack and source code that led to the error /
OpenCL function call
© 2004 – 2010 Graphic Remedy. All Rights Reserved
© Copyright Khronos Group, 2010 - Page 52
Statistical viewer
• Displays a statistical overview of debugged
application’s OpenCL API usage
• Best practice suggestions based on application’s
activities
© 2004 – 2010 Graphic Remedy. All Rights Reserved
© Copyright Khronos Group, 2010 - Page 53
Images and Buffers viewer
• Displays images and buffers data
• Image and Data panes
© 2004 – 2010 Graphic Remedy. All Rights Reserved
© Copyright Khronos Group, 2010 - Page 54
Kernels Source Code Editor
•Displays allocated OpenCL Programs and Kernels
•Enables Programs and Kernel “Edit and Continue”
© 2004 – 2010 Graphic Remedy. All Rights Reserved
© Copyright Khronos Group, 2010 - Page 55
OpenGL interoperability
•Debug applications that use both OpenGL and OpenCL in one
consistent framework
•View interoperability properties
- OpenGL Contexts association
- OpenGL Buffers and Render-buffers association
- OpenGL Texture’s association
© 2004 – 2010 Graphic Remedy. All Rights Reserved
Profiling Demo
© 2004 – 2010 Graphic Remedy. All Rights Reserved
© Copyright Khronos Group, 2010 - Page 57
Real Time Statistics View
• Time based overview of OpenCL activities
• Data is displayed “per queue”- % Kernel, % Write, % Copy, % Read, % Idle
- Work items execution/Sec
- Read, Write, Copy MB/Sec
© 2004 – 2010 Graphic Remedy. All Rights Reserved
© Copyright Khronos Group, 2010 - Page 58
Command Queues Viewer
• Displays detailed, per queue, time based view of
OpenCL activities
© 2004 – 2010 Graphic Remedy. All Rights Reserved
© Copyright Khronos Group, 2010 - Page 59
Performance graph
• Displays performance counters from different
sources- gDEBugger’s OpenGL and OpenCL Servers
- GPUs and drivers: NVIDIA, AMD, S3 Graphics
- Operating Systems: Windows, Mac OS X, iOS and Linux
© 2004 – 2010 Graphic Remedy. All Rights Reserved
© Copyright Khronos Group, 2010 - Page 60
Performance Analysis Toolbar
• Turn off different types of OpenCL operations to
view the effect on parallel computing performance- Disable Kernel Operations
- Disable Read Operations
- Disable Write Operations
- Disable Copy Operations
© 2004 – 2010 Graphic Remedy. All Rights Reserved
Optimizing OpenCL memory usage
© 2004 – 2010 Graphic Remedy. All Rights Reserved
© Copyright Khronos Group, 2010 - Page 62
Memory Analysis viewer
• Displays information about OpenCL objects memory usage
• Tracks OpenCL related memory leaks
• Displays object’s creation call stack
© 2004 – 2010 Graphic Remedy. All Rights Reserved
© Copyright Khronos Group, 2010 - Page 63
gDEBugger’s Customer benefits
• Improves application quality
• Optimizes application performance
• Reduces debugging and profiling time
• Shortens "time to market"
• Helps deploying on multiple platforms
© 2004 – 2010 Graphic Remedy. All Rights Reserved
© Copyright Khronos Group, 2010 - Page 64
Available on
• Windows - OpenGL, OpenGL ES and OpenCL
• Mac OS X - OpenGL and OpenCL
• iOS - iPhone & iPad on-device and Simulator, OpenGL ES 1.1 and 2.0
• Linux - OpenGL and OpenCL
© 2004 – 2010 Graphic Remedy. All Rights Reserved
© Copyright Khronos Group, 2010 - Page 65
Available now
•gDEBugger CL is in final beta testing
•Expected to be released in August 2010
•To join the free beta program
www.gremedy.com/gDEBuggerCL.php
© 2004 – 2010 Graphic Remedy. All Rights Reserved