vorpal optimizations for petascale systems paul mullowney, peter messmer, ben cowan, keegan amyx,...

VORPAL Optimizations for Petascale Systems

Paul Mullowney, Peter Messmer, Ben Cowan, Keegan Amyx, Stefan Muszala

Tech-X Corporation

Boyana NorrisArgonne National Lab

*mullowney@txcorp.com

*messmer@txcorp.com

Work supported by DOE Office of Science SBIR Phase II Award DE-FG02-07ER84731

VORPAL: Introduction

• VORPAL plasma framework widely used on leadership-class systems– Particle-in-cell (PIC) algorithm for kinetic plasma model– Finite-Difference Time-Domain (FDTD) for Maxwell solver– Access to various libraries (Trilinos, PETSc) for doing linear

system solves (Electrostatic PIC)– ADI methods for doing implicit Maxwell solve

• Self-consistent model for charged particles and electromagnetic field– Electromagnetic field discretized on 3D cartesian mesh– Particles located anywhere in space– Particles gather forces and scatter charges/currents to the

field• Parallelization via domain decomposition

VORPAL Optimization Challenges

• Petascale systems require strong scaling– FDTD computationally cheap to start with– Need good performance on small domain

Þ Efficient messaging on small computational domainsÞ Efficient computation and messaging for heterogeneous

architectures

• If particles present, they are the main contribution to compute time– Significant time savings via optimized particle-push algorithm– Key challenge: finding good data layout

Þ Optimize nearest neighbor messaging for small domain sizes

Þ Optimize for heterogeneous architectures … GPUs

JumpShot Messaging Patterns in an FDTD Simulation

PETSc on BG/L VORPAL on BG/P

Improving parallel efficiency with different messaging patterns

Conventional field messaging: Send and

receive messages to/from all neighbors at once

Staged messaging: Send in one direction at a time, waiting for one direction to complete before starting the next

Reduces overall number of messages: 6 instead of 26

Messaging results

• Staged messaging can be up to 5× faster for small domain sizes

• Similar performance on Cray XT4 and BG/P

Cray XT4 BG/P

Effect of E/B Field Memory Structure on FDTD Performance

Ex(i,j,k)

Ey(i,j,k+1)

Ey(i,j,k)

Ez(i,j,k)

Ex(i,j,k+1)

Ez(i,j,k+1)

Ex(i,j,k)

Ex(i,j,k+1)

Ey(i,j,k)

Ey(i,j,k+1)

Ez(i,j,k)

Ez(i,j,k+1)

Layout A Layout B

Memory Layout is a key consideration for GPU optimization

Using GPUs for Accelerating FDTD

• FERMI : 8x improvement over Tesla Series GPUs for double precision (> 500 GFlops), likely 2x improvement in single precision (~2 TFlops)

• FERMI available in Spring 2010 ???

GPU Implementations of FDTDImplementation 1: Generic 4 pt

Stencil Kernelsfor(int i = tid; i < nx; i +=

nThreads){ float r = res[i]; r += a1 * in1[i]; r += a2 * in2[i]; r += a3 * in3[i]; r += a4 * in4[i]; res[i] += r; }

Implementation 2: Yee Mesh Specific Kernels

float ex, ey, ez; for(int i = tid; i < n; i += nThreads){ ex = Ex[i], ey = Ey[i], ez = Ez[i]; Bx[i] += dtOverDy*(ez-EzYp1[i]) + dtOverDz*(EyZp1[i]-ey); By[i] += dtOverDz*(ex-ExZp1[i]) + dtOverDx*(EzXp1[i]-ez); Bz[i] += dtOverDx*(ey-EyXp1[i]) + dtOverDy*(ExYp1[i]-ex); }

• Requires 6 calls to Generic Kernel to do the full FDTD update

• Makes no use of GPU memory hierarchy

• All memory accesses are global

• Corresponding call to Ampere update• Reuse 3 global memory accesses• no use of shared memory or other

trickery

Timings: FDTD Performance

Speedup

Specifics

• Boundary Conditions/PMLs can be handled through a spatially dependent dielectric constant

• Dey-Mittra Cut Cell algorithms can be used through slight modifications to 4 pt Stencil Kernel.

• Collection of optimized vector routines that can perform all of the above mentioned algorithms – Accessible from within VORPAL or from High-

Level Languages like IDL and MATLAB– See Peter Messmer’s dinner-time presentation

on GPULib at the NUG meeting (Wed. 10/7)

GPULib (https://gpulib.txcorp.com)

Future Work

• Fully optimize the FDTD algorithm• Move to multiple GPUs so that VORPAL can

take advantage of new heterogeneous systems

• VORPAL ADI algorithm on GPUs: – GPULib has highly optimized tridiagonal and

pentadiagonal solvers for “small” linear systems (<1000 unknowns) that are easily tasked-farmed out to GPU thread blocks. Potential for huge speedup vs CPU implementation.

• Move Particle Push to GPU• Vlasov Solver??

vorpal optimizations for petascale systems paul mullowney, peter messmer, ben cowan, keegan amyx,...

oncestaged messaging

global memory

float r

resi r

fdtd update

memory accesses

nearest neighbor messaging

ez forint i

Documents

board of directors meeting minutes meeting #83 … · live...

data.kemono.party · web viewnow back in the circus and...

grid and component technologies in physics applications...

i people events jail city...

player’s guide - vorpal chainsword games -...

1228 mullowney ln. - diamond...

gpulib: gpu computing in idl (an update) · gpulib: gpu...

plasma medicine in vorpal tech-x workshop / icops 2012,...

greg niemeyer and roger antonsen | the network …...the...

multipactor simulations with vorpal

sarah mullowney, md pgy3 psychiatry residentgeneralized...

dynamic load balancing for vorpal viktor przebinda center...

vowpal wabbit (fast & scalable machine-learning) ariel...

for ad info. call 1-800-950-9952 • st. pats cathedral,...

building an electron cloud simulation using bocca,...

babel f2003 wrap-up stefan muszala*, tom epperly(llnl),...

vorpal carbonc ww v2

a lender's vorpal sword: expungent affidavits & their power...

mccall building suite 300 - loopnet...img_0032 img_0031...

issue number 8the ballad of the phantasmal vorpal sword by...