vorpal optimizations for petascale systems paul mullowney, peter messmer, ben cowan, keegan amyx,...
Post on 02-Jan-2016
214 Views
Preview:
TRANSCRIPT
VORPAL Optimizations for Petascale Systems
Paul Mullowney, Peter Messmer, Ben Cowan, Keegan Amyx, Stefan Muszala
Tech-X Corporation
Boyana NorrisArgonne National Lab
*mullowney@txcorp.com
*messmer@txcorp.com
Work supported by DOE Office of Science SBIR Phase II Award DE-FG02-07ER84731
VORPAL: Introduction
• VORPAL plasma framework widely used on leadership-class systems– Particle-in-cell (PIC) algorithm for kinetic plasma model– Finite-Difference Time-Domain (FDTD) for Maxwell solver– Access to various libraries (Trilinos, PETSc) for doing linear
system solves (Electrostatic PIC)– ADI methods for doing implicit Maxwell solve
• Self-consistent model for charged particles and electromagnetic field– Electromagnetic field discretized on 3D cartesian mesh– Particles located anywhere in space– Particles gather forces and scatter charges/currents to the
field• Parallelization via domain decomposition
VORPAL Optimization Challenges
• Petascale systems require strong scaling– FDTD computationally cheap to start with– Need good performance on small domain
Þ Efficient messaging on small computational domainsÞ Efficient computation and messaging for heterogeneous
architectures
• If particles present, they are the main contribution to compute time– Significant time savings via optimized particle-push algorithm– Key challenge: finding good data layout
Þ Optimize nearest neighbor messaging for small domain sizes
Þ Optimize for heterogeneous architectures … GPUs
JumpShot Messaging Patterns in an FDTD Simulation
PETSc on BG/L VORPAL on BG/P
Improving parallel efficiency with different messaging patterns
Conventional field messaging: Send and
receive messages to/from all neighbors at once
Staged messaging: Send in one direction at a time, waiting for one direction to complete before starting the next
Reduces overall number of messages: 6 instead of 26
Messaging results
• Staged messaging can be up to 5× faster for small domain sizes
• Similar performance on Cray XT4 and BG/P
Cray XT4 BG/P
Effect of E/B Field Memory Structure on FDTD Performance
Ex(i,j,k)
Ey(i,j,k+1)
Ey(i,j,k)
Ez(i,j,k)
Ex(i,j,k+1)
Ez(i,j,k+1)
Ex(i,j,k)
Ex(i,j,k+1)
Ey(i,j,k)
Ey(i,j,k+1)
Ez(i,j,k)
Ez(i,j,k+1)
…
…
…
Layout A Layout B
…
…
…
Memory Layout is a key consideration for GPU optimization
Using GPUs for Accelerating FDTD
• FERMI : 8x improvement over Tesla Series GPUs for double precision (> 500 GFlops), likely 2x improvement in single precision (~2 TFlops)
• FERMI available in Spring 2010 ???
GPU Implementations of FDTDImplementation 1: Generic 4 pt
Stencil Kernelsfor(int i = tid; i < nx; i +=
nThreads){ float r = res[i]; r += a1 * in1[i]; r += a2 * in2[i]; r += a3 * in3[i]; r += a4 * in4[i]; res[i] += r; }
Implementation 2: Yee Mesh Specific Kernels
float ex, ey, ez; for(int i = tid; i < n; i += nThreads){ ex = Ex[i], ey = Ey[i], ez = Ez[i]; Bx[i] += dtOverDy*(ez-EzYp1[i]) + dtOverDz*(EyZp1[i]-ey); By[i] += dtOverDz*(ex-ExZp1[i]) + dtOverDx*(EzXp1[i]-ez); Bz[i] += dtOverDx*(ey-EyXp1[i]) + dtOverDy*(ExYp1[i]-ex); }
• Requires 6 calls to Generic Kernel to do the full FDTD update
• Makes no use of GPU memory hierarchy
• All memory accesses are global
• Corresponding call to Ampere update• Reuse 3 global memory accesses• no use of shared memory or other
trickery
Timings: FDTD Performance
Speedup
Specifics
• Boundary Conditions/PMLs can be handled through a spatially dependent dielectric constant
• Dey-Mittra Cut Cell algorithms can be used through slight modifications to 4 pt Stencil Kernel.
• Collection of optimized vector routines that can perform all of the above mentioned algorithms – Accessible from within VORPAL or from High-
Level Languages like IDL and MATLAB– See Peter Messmer’s dinner-time presentation
on GPULib at the NUG meeting (Wed. 10/7)
GPULib (https://gpulib.txcorp.com)
Future Work
• Fully optimize the FDTD algorithm• Move to multiple GPUs so that VORPAL can
take advantage of new heterogeneous systems
• VORPAL ADI algorithm on GPUs: – GPULib has highly optimized tridiagonal and
pentadiagonal solvers for “small” linear systems (<1000 unknowns) that are easily tasked-farmed out to GPU thread blocks. Potential for huge speedup vs CPU implementation.
• Move Particle Push to GPU• Vlasov Solver??
top related