porting bullet to opencl - khronos group · summary •introduction to bullet physics sdk...
TRANSCRIPT
© Copyright Khronos Group, 2010 - Page 1
Porting Bullet to OpenCL
Erwin CoumansSimulation Team Lead
Sony Computer Entertainment US R&D
Summary
• Introduction to Bullet Physics SDK
• Leverage the OpenCL SDKs from AMD, NVidia and Apple
• Implement and debug kernels on CPU first
• Compress data before sending it over the PCIe bus
- Only exchange the differences
• Change data structures from Array Of Structures to Structures of Arrays
• Choose appropriate algorithm
- Tree traversal from recursive to stackless to history traversal
• Use CMake for cross platform project files for MSVC, Xcode, Makefiles
- FindOpenCL.cmake
© Copyright Erwin Coumans, 2010 - Page 3
Some games using Bullet Physics
© Copyright Erwin Coumans, 2010 - Page 4
Some Movies using Bullet Physics
© Copyright Erwin Coumans, 2010 - Page 5
Authoring Tools
• Maya Dynamica Plugin
• Cinema 4D 11.5
• Blender
• Houdini
• Lightwave CORE
• TBA: Standalone Physics Editor
© Copyright Erwin Coumans, 2010 - Page 6
Summary
• Introduction to Bullet Physics SDK
• Leverage the OpenCL SDKs from AMD, NVidia and Apple
• Implement and debug kernels on CPU first
• Compress data before sending it over the PCIe bus
- Only exchange the differences
• Change data structures from Array Of Structures to Structures of Arrays
• Choose appropriate algorithm
- Tree traversal from recursive to stackless to history traversal
• Use CMake for cross platform project files for MSVC, Xcode, Makefiles
© Copyright Erwin Coumans, 2010 - Page 7
Leveraging the AMD, Apple & NVIDIA SDK
• Radix sort, bitonic sort
• Prefix scan, compaction
• Examples how to use fast shared memory
• Uniform Grid example in Particle Demo
© Copyright Erwin Coumans, 2010 - Page 8
Neighbor Search
• Calculate grid index of particle center
• Parallel Radix or Bitonic Sorted Hash Array
• Search 27 neighboring cells
- Can be reduced to 14 because of symmetry
• Interaction happens during search
- No need to store neighbor information
• Jacobi iteration: independent interactions
© Copyright Erwin Coumans, 2010 - Page 9
Interacting Particle Pairs
Array Index
Sorted Cell ID Particle ID
0 4,D
1 4,F
2 6,B
3 6,C
4 6,E
5 9,A
0 1 2 3
12 13 14 15
5 7
8 10 11
B
C E
D
F
A
Interacting Particle Pairs
D,F
B,C
B,E
C,E
A,D
A,F
A,B
A,C
A,E
© Copyright Erwin Coumans, 2010 - Page 10
Summary
• Introduction to Bullet Physics SDK
• Leverage the OpenCL SDKs from AMD, NVidia and Apple
• Implement and debug kernels on CPU first
• Compress data before sending it over the PCIe bus
- Only exchange the differences
• Change data structures from Array Of Structures to Structures of Arrays
• Choose appropriate algorithm
- Tree traversal from recursive to stackless to history traversal
• Use CMake for cross platform project files for MSVC, Xcode, Makefiles
Implement and debug on CPU
• Keep a reference CPU implementation and transfer memory to CPU and back
- All stages in the physics pipeline can be hot-swapped between CPU and GPU
• We implemented MiniCL, a non-compliant OpenCL replacement
- OpenCL kernels compiled and linked as regular C
- Use regular MSVC debugger, breakpoints etc.
- Multi-threaded or sequential for easier debugging
© Copyright Erwin Coumans, 2010 - Page 12
Summary
• Introduction to Bullet Physics SDK
• Leverage the OpenCL SDKs from AMD, NVidia and Apple
• Implement and debug kernels on CPU first
• Compress data before sending it over the PCIe bus
- Only exchange the differences
• Change data structures from Array Of Structures to Structures of Arrays
• Choose appropriate algorithm
- Tree traversal from recursive to stackless to history traversal
• Use CMake for cross platform project files for MSVC, Xcode, Makefiles
© Copyright Erwin Coumans, 2010 - Page 13
Compress data before sending
• Using the GPU Uniform Grid as part of the Bullet CPU pipeline
• Available through btOpenCLBroadphase
• Reduce bandwidth and avoid sending all pairs
• Bullet requires persistent contact pairs
- to store cached solver information (warm-starting)
• Pre-allocate pairs for each object
© Copyright Erwin Coumans, 2010 - Page 14
Persistent Pairs
Before After Difference
0 1 2 3
12 13 14 15
5 7
8 10 11
B
C E
D
F
A
Particle Pairs Before
After Differences
D,F D,F A,B removed
B,C B,C B,C removed
B,E B,E C,F added
C,E C,E C,D added
A,D A,D
A,F A,F
A,B A,C
A,C A,E
A,E C,F
C,D
0
1
2 3
12 13 14 15
5 7
8 10 11
B
C E
D
F
A
© Copyright Erwin Coumans, 2010 - Page 15
Summary
• Introduction to Bullet Physics SDK
• Leverage the OpenCL SDKs from AMD, NVidia and Apple
• Implement and debug kernels on CPU first
• Change data structures from Array Of Structures to Structures of Arrays
• Choose appropriate algorithm
- Tree traversal from recursive to stackless to history traversal
• Compress data before sending it over the PCIe bus
- Only exchange the differences
• Use CMake for cross platform project files for MSVC, Xcode, Makefiles
© Copyright Erwin Coumans, 2010 - Page 16
From particles to rigid bodies
Particles Rigid Bodies
World Transform Position Position and Orientation
Neighbor Search Uniform Grid Dynamic BVH tree
Compute Contacts Sphere-Sphere Generic Convex Closest Points, GJK
Static Geometry Planes Concave Triangle Mesh
Solving method Jacobi Projected Gauss Seidel
© Copyright Erwin Coumans, 2010 - Page 17
Dynamic BVH Trees
0 1 2 3
12 13 14 15
5 7
8 10 11
B
C E
D
F
A
B
C E
D
F
ADF
BC E A
© Copyright Erwin Coumans, 2010 - Page 18
Dynamic BVH tree acceleration structure
• Broadphase n-body neighbor search
• Ray and convex sweep test
• Concave triangle meshes
• Cloth and Soft body
• Compound collision shapes
• Occlusion culling and view frustum culling
© Copyright Erwin Coumans, 2010 - Page 19
Dynamic BVH tree (instead of uniform grid)
• Keep two dynamic trees, one for moving
objects, other for objects (sleeping/static)
• Find neighbor pairs:
- Overlap M versus M and Overlap M versus S
DF
BC E A
QP
UR S T
S: Non-moving DBVT M: Moving DBVT
© Copyright Erwin Coumans, 2010 - Page 20
Some Optimizations
• Objects can move from one tree to the other
• Incrementally update, re-balance tree
• Tree creation and update is hard to parallelize
- Research and development in progress
• Tree traversal can be parallelized on GPU
- Idea proposed by Takahiro Harada at GDC 2009
© Copyright Erwin Coumans, 2010 - Page 21
Parallel GPU Tree Traversal using History Flags
• Alternative to recursive or stackless traversal
00
00
00
00
© Copyright Erwin Coumans, 2010 - Page 22
Parallel GPU Tree Traversal using History Flags
• 2 bits at each level indicating visited children
10
00
00
00
© Copyright Erwin Coumans, 2010 - Page 23
Parallel GPU Tree Traversal using History Flags
• Set bit when descending into a child branch
10
10
00
00
© Copyright Erwin Coumans, 2010 - Page 24
Parallel GPU Tree Traversal using History Flags
• Reset bits when ascending up the tree
10
10
00
00
© Copyright Erwin Coumans, 2010 - Page 25
Parallel GPU Tree Traversal using History Flags
• Requires only twice the tree depth bits
10
11
00
00
© Copyright Erwin Coumans, 2010 - Page 26
Parallel GPU Tree Traversal using History Flags
• When both bits are set, ascend to parent
10
11
00
00
© Copyright Erwin Coumans, 2010 - Page 27
Parallel GPU Tree Traversal using History Flags
• When both bits are set, ascend to parent
10
00
00
00
© Copyright Erwin Coumans, 2010 - Page 28
History tree traversaldo{
if(Intersect(n->volume,volume)){
if(n->isinternal()) {
if (!historyFlags[curDepth].m_visitedLeftChild){
historyFlags[curDepth].m_visitedLeftChild = 1;
n = n->childs[0];curDepth++;
continue;}
if (!historyFlags[curDepth].m_visitedRightChild){
historyFlags[curDepth].m_visitedRightChild = 1;
n = n->childs[1];curDepth++;
continue;}
}
else
policy.Process(n);
}
n = n->parent;
historyFlags[curDepth].m_visitedLeftChild = 0;
historyFlags[curDepth].m_visitedRightChild = 0;
curDepth--;
} while (curDepth);
© Copyright Erwin Coumans, 2010 - Page 29
Voxelizing objects
© Copyright Erwin Coumans, 2010 - Page 30
General convex collision detection on GPU
• Algorithms are branchy (GJK, EPA)
• Parts of the computation can be accelerated using GPU hardware
- Replace support map by cube map lookup
© Copyright Erwin Coumans, 2010 - Page 31
Parallelizing Constraint Solver
• Projected Gauss Seidel iterations are not embarrassingly parallel
A
B D
1 4
© Copyright Erwin Coumans, 2010 - Page 32
Reordering constraint batches
A B C D
1 1
2 2
3 3
4 4
A
B D
1 4
A B C D
1 1 3 3
4 2 2 4
© Copyright Erwin Coumans, 2010 - Page 33
OpenCL kernel Setup Batches__kernel void kSetupBatches(...)
{
int index = get_global_id(0);
int currPair = index;
int objIdA = pPairIds[currPair * 2].x;
int objIdB = pPairIds[currPair * 2].y;
int batchId = pPairIds[currPair * 2 + 1].x;
int localWorkSz = get_local_size(0);
int localIdx = get_local_id(0);
for(int i = 0; i < localWorkSz; i++){
if((i==localIdx)&&(batchId < 0)&&(pObjUsed[objIdA]<0)&&(pObjUsed[objIdB]<0)){
if(pObjUsed[objIdA] == -1)
pObjUsed[objIdA] = index;
if(pObjUsed[objIdB] == -1)
pObjUsed[objIdB] = index;
}
barrier(CLK_GLOBAL_MEM_FENCE);
}
}
© Copyright Erwin Coumans, 2010 - Page 34
Summary
• Implement and debug kernels on CPU first
• Leverage the OpenCL SDKs from AMD, NVidia and Apple
• Change data structures from Array Of Structures to Structures of Arrays
• Choose appropriate algorithm
- Tree traversal from recursive to stackless to history traversal
• Compress data before sending it over the PCIe bus
- Only exchange the differences
• Use CMake for cross platform project files for MSVC, Xcode, Makefiles
© Copyright Erwin Coumans, 2010 - Page 35
Thanks!
• Available in SVN branches/OpenCL
- http:\\bullet.googlecode.com
• Visit the Physics Simulation Forum at
- http:\\bulletphysics.org
• For technical details see also Takahiro Harada’s GDC talk
• Email: [email protected]