the next four orders of magnitude in parallel pde simulation performance keyes/talks.html...

The Next Four Orders of Magnitude in Parallel PDE Simulation Performance

http://www.math.odu.edu/~keyes/talks.html

David E. Keyes

Department of Mathematics & Statistics,

Old Dominion University

Institute for Scientific Computing Research,

Lawrence Livermore National Laboratory

Institute for Computer Applications in Science & Engineering,

NASA Langley Research Center

http://www.mcs.anl.gov/petsc-fun3d







GaMM 2001

Background of this Presentation Originally prepared for Petaflops II Conference History of the Petaflops Initiative in the USA

Enabling Technologies for Petaflops Computing, Feb 1994 book by Sterling, Messina, and Smith, MIT Press, 1995

Applications Workshop, Aug 1994 Architectures Workshop, Apr 1995 Systems Software Workshop, Jun 1996 Algorithms Workshop, Apr 1997 Systems Operations Workshop Review, Jun 1998 Enabling Technologies for Petaflops II, Feb 1999

Topics in Ultra-scale Computing, book by Sterling et al., MIT Press, 2001 (to appear)

GaMM 2001

Weighing in at the Bottom Line Characterization of a 1 Teraflop/s computer of today

about 1,000 processors of 1 Gflop/s (peak) each due to inefficiencies within the processors, more practically characterized

as about 4,000 processors of 250 Mflop/s each

How do we want to get to 1 Petaflop/s by 2007 (original goal)? 1,000,000 processors of 1 Gflop/s each (only wider)? 10,000 processors of 100 Gflop/s each (mainly deeper)?

From the point of view of PDE simulations on quasi-static Eulerian

grids Either!

Caveat: dynamic grid simulations are not covered in this talk but see work at Bonn, Erlangen, Heidelberg, LLNL, and ODU presented

elsewhere

GaMM 2001

Perspective

Many “Grand Challenges” in computational science are formulated as PDEs (possibly among alternative formulations)

However, PDE simulations historically have not performed as well as other scientific simulations

PDE simulations require a balance among architectural components that is not necessarily met in a machine designed to “max out” on the standard LINPACK benchmark

The justification for building petaflop/s architectures undoubtedly will (and should) include PDE applications

However, cost-effective use of petaflop/s on PDEs requires further attention to architectural and algorithmic matters

Memory-centric view of computation needs further promotion

GaMM 2001

Application Performance History“3 orders of magnitude in 10 years” – better than Moore’s Law

GaMM 2001

Bell Prize Performance History

Year Type Application Gflop/s System No. Procs1988 PDE Structures 1.0 Cray Y-MP 81989 PDE Seismic 5.6 CM-2 2,0481990 PDE Seismic 14 CM-2 2,0481992 NB Gravitation 5.4 Delta 5121993 MC Boltzmann 60 CM-5 1,0241994 IE Structures 143 Paragon 1,9041995 MC QCD 179 NWT 1281996 PDE CFD 111 NWT 1601997 NB Gravitation 170 ASCI Red 4,0961998 MD Magnetism 1,020 T3E-1200 1,5361999 PDE CFD 627 ASCI BluePac 5,8322000 NB Gravitation 1,349 GRAPE-6 96

GaMM 2001

Plan of Presentation General characterization of PDE requirements Four sources of performance improvement to get from

current 100's of Gflop/s (for PDEs) to 1 Pflop/s Each illustrated with examples from computational

aerodynamics offered as typical of real workloads (nonlinear, unstructured,

multicomponent, multiscale, etc.)

Performance presented on up to thousands of processors of T3E and ASCI Red (for

parallel aspects)

on numerous uniprocessors (for memory hierarchy aspects)

GaMM 2001

Purpose of Presentation Not to argue for specific algorithms/programming

models/codes in any detail but see talks on Newton-Krylov-Schwarz (NKS) methods under homepage

Provide a requirements target for designers of today's systems typical of several contemporary successfully parallelized PDE applications not comprehensive of all important large-scale applications

Speculate on requirements target for designers of tomorrow's sytems

Promote attention to current architectural weaknesses, relative to requirements of PDEs

GaMM 2001

Four Sources of Performance Improvement

Expanded number of processors arbitrarily large factor, through extremely careful attention to load

balancing and synchronization

More efficient use of processor cycles, and faster processor/memory elements

one to two orders of magnitude, through memory-assist language features, processors-in-memory, and multithreading

Algorithmic variants that are more architecture-friendly approximately an order of magnitude, through improved locality

and relaxed synchronization

Algorithms that deliver more “science per flop” possibly large problem-dependent factor, through adaptivity This last does not contribute to raw flop/s!

GaMM 2001

PDE Varieties and Complexities Varieties of PDEs

evolution (time hyperbolic, time parabolic) equilibrium (elliptic, spatially hyperbolic or parabolic) mixed, varying by region mixed, of multiple type (e.g., parabolic with elliptic constraint)

Complexity parameterized by: spatial grid points, Nx

temporal grid points, Nt

components per point, Nc

auxiliary storage per point, Na

grid points in stencil, Ns

Memory: M Nx ( Nc + Na + Nc Nc Ns )

Work: W Nx Nt ( Nc + Na + Nc Nc Ns )

GaMM 2001

Resource Scaling for PDEs

For 3D problems, (Memory) (Work)3/4

for equilibrium problems, work scales with problem size no. of iteration steps – for “reasonable” implicit methods proportional to resolution in single spatial dimension

for evolutionary problems, work scales with problem size time steps – CFL-type arguments place latter on order of resolution in single spatial dimension

Proportionality constant can be adjusted over a very wide range by both discretization and by algorithmic tuning

If frequent time frames are to be captured, disk capacity and I/O rates must both scale linearly with work

GaMM 2001

Typical PDE Tasks Vertex-based loops

state vector and auxiliary vector updates

Edge-based “stencil op” loops residual evaluation, approximate Jacobian evaluation Jacobian-vector product (often replaced with matrix-free form,

involving residual evaluation) intergrid transfer (coarse/fine)

Sparse, narrow-band recurrences approximate factorization and back substitution smoothing

Vector inner products and norms orthogonalization/conjugation convergence progress and stability checks

GaMM 2001

Edge-based Loop

Vertex-centered tetrahedral grid

Traverse by edges load vertex values compute intensively store contributions to

flux at vertices

Each vertex appears in approximately 15 flux computations

GaMM 2001

Explicit PDE Solvers

Concurrency is pointwise, O(N) Comm.-to-Comp. ratio is surface-to-volume, O((N/P)-1/3) Communication range is nearest-neighbor, except for

time-step computation Synchronization frequency is once per step, O((N/P)-1) Storage per point is low Load balance is straightforward for static quasi-uniform

grids Grid adaptivity (together with temporal stability

limitation) makes load balance nontrivial

)u(uu 11

llll ft

GaMM 2001

Domain-decomposed Implicit PDE Solvers

Concurrency is pointwise, O(N), or subdomainwise, O(P) Comm.-to-Comp. ratio still mainly surface-to-volume,

O((N/P)-1/3) Communication still mainly nearest-neighbor, but

nonlocal communication arises from conjugation, norms, coarse grid problems

Synchronization frequency often more than once per grid-sweep, up to Krylov dimension, O(K(N/P)-1)

Storage per point is higher, by factor of O(K)

Load balance issues the same as for explicit

lt

tf

t l

ll

l

l,

1u)u(

u

GaMM 2001

Source #1: Expanded Number of Processors

Amdahl's law can be defeated if serial sections make up a nonincreasing fraction of total work as problem size and processor count scale up together – true for most explicit or iterative implicit PDE solvers

popularized in 1986 Karp Prize paper by Gustafson, et al.

Simple, back-of-envelope parallel complexity analyses show that processors can be increased as fast, or almost as fast, as problem size, assuming load is perfectly balanced

Caveat: the processor network must also be scalable (applies to protocols as well as to hardware)

Remaining four orders of magnitude could be met by hardware expansion (but this does not mean that fixed-size applications of today would run 104 times faster)

GaMM 2001

Back-of-Envelope Scalability Demonstration for Bulk-synchronized PDE Computations

Given complexity estimates of the leading terms of: the concurrent computation (per iteration phase) the concurrent communication the synchronization frequency

And a model of the architecture including: internode communication (network topology and protocol reflecting

horizontal memory structure) on-node computation (effective performance parameters including vertical

memory structure)

One can estimate optimal concurrency and execution time on per-iteration basis, or overall (by taking into account any granularity-

dependent convergence rate)

simply differentiate time estimate in terms of (N,P) with respect to P, equate

to zero and solve for P in terms of N

GaMM 2001

3D Stencil Costs (per Iteration) grid points in each direction n, total work N=O(n3) processors in each direction p, total procs P=O(p3) memory per node requirements O(N/P)

execution time per iteration A n3/p3

grid points on side of each processor subdomain n/p

neighbor communication per iteration B n2/p2

cost of global reductions in each iteration C log p or C p(1/d)

C includes synchronization frequency

same dimensionless units for measuring A, B, C e.g., cost of scalar floating point multiply-add

GaMM 2001

3D Stencil Computation IllustrationRich local network, tree-based global reductions

total wall-clock time per iteration

for optimal p, , or

or (with ),

without “speeddown,” p can grow with n in the limit as

pCp

nB

p

nApnT log),(

2

2

3

3

0

p

T ,0233

2

4

3

p

C

p

nB

p

nA

CA

B2

3

243

32

nC

Apopt

3

13

131

)1(1)1(12

3

0CB

nC

Apopt

31

3

GaMM 2001

3D Stencil Computation Illustration Rich local network, tree-based global reductions

optimal running time

where

limit of infinite neighbor bandwidth, zero neighbor latency ( )

(This analysis is on a per iteration basis; fuller analysis would multiply this cost by an iteration count estimate that generally depends on n and p.)

,log))(,(23

nCp

BAnpnT opt

3

13

131

)1(1)1(12

3 C

A

0B

.log

3

1log))(,( const

C

AnCnpnT opt

GaMM 2001

Summary for Various Networks With tree-based (logarithmic) global reductions and scalable

nearest neighbor hardware: optimal number of processors scales linearly with problem size

With 3D torus-based global reductions and scalable nearest neighbor hardware: optimal number of processors scales as three-fourths power of

problem size (almost “scalable”)

With common network bus (heavy contention): optimal number of processors scales as one-fourth power of

problem size (not “scalable”) bad news for conventional Beowulf clusters, but see 2000 Bell

Prize “price-performance awards”

GaMM 2001

1999 Bell Prize Parallel Scaling Results on ASCI RedONERA M6 Wing Test Case, Tetrahedral grid of 2.8 million vertices (about 11 million

unknowns) on up to 3072 ASCI Red Nodes (Pentium Pro 333 MHz processors)

GaMM 2001

Surface Visualization of Test Domain for

Computing Flow over an ONERA M6 Wing

GaMM 2001

Transonic “Lambda” Shock Solution

GaMM 2001

Fixed-size Parallel Scaling Results (Flop/s)

GaMM 2001

Fixed-size Parallel Scaling Results(Time in seconds)

GaMM 2001

Algorithm: Newton-Krylov-Schwarz

Newtonnonlinear solver

asymptotically quadratic

Krylovaccelerator

spectrally adaptive

Schwarzpreconditionerparallelizable

GaMM 2001

Fixed-size Scaling Results for W-cycle (Time in seconds, courtesy of D. Mavriplis)

ASCI runs: for grid of 3.1M vertices; T3E runs: for grid of 24.7M vertices

GaMM 2001

Source #2: More Efficient Use of Faster Processors

Current low efficiencies of sparse codes can be improved if regularity of reference is exploited with memory-assist features

PDEs have periodic workingset structure that permits effective use of prefetch/dispatch directives, and lots of slackness

Combined with “processors-in-memory” (PIM) technology for gather/scatter into densely used block transfers and multithreading, PDEs can approach full utilization of processor cycles

Caveat: high bandwidth is critical, since PDE algorithms do only O(N) work for O(N) gridpoints worth of loads and stores

One to two orders of magnitude can be gained by catching up to the clock, and by following the clock into the few-GHz range

GaMM 2001

Following the Clock

1999 Predictions from the Semiconductor Industry Association

http://public.itrs.net/files/1999_SIA_Roadmap/Home.htm

A factor of 2-3 can be expected by 2007 by following the clock alone

Frequency in GHz 2000 2004 2008 2014

Local (high perf.) 1.49 2.95 6.00 13.5

Cross-chip (high perf.) 1.32 1.86 2.50 3.60

Cross-chip (cost perf) 0.66 0.99 1.40 2.20

Chip-to-board (high perf.) 1.32 1.86 2.50 3.60

GaMM 2001

Example of Multithreading Same ONERA M6 wing Euler code simulation on ASCI Red ASCI Red contains two processors per node, sharing memory Can use second processor in either message-passing mode with its own subdomain, or in multithreaded shared memory mode,

which does not require the number of subdomain partitions to double Latter is much more effective in flux evaluation phase, as shown by cumulative execution time (here, memory bandwidth is not

an issue)

Number ofNodes

1 MPIProcess

2 MPIProcesses

2Threads

256 510 332 293

1024 183 136 109

3072 93 91 63

GaMM 2001

PDE Workingsets Smallest: data for single stencil:

Ns (Nc2 + Nc + Na ) (sharp)

Largest: data for entire subdomain: (Nx/P) Ns (Nc

2 + Nc + Na ) (sharp)

Intermediate: data for a neighborhood collection of stencils, reused as possible

GaMM 2001

Cache Traffic for PDEs As successive workingsets “drop” into a level of memory,

capacity (and with effort conflict) misses disappear, leaving only compulsory, reducing demand on main memory bandwidth

GaMM 2001

Strategies Based on Workingset Structure

No performance value in memory levels larger than subdomain

Little performance value in memory levels smaller than subdomain but larger than required to permit full reuse of most data within each subdomain subtraversal

After providing L1 large enough for smallest workingset (and multiple independent copies up to accommodate desired level of multithreading) all additional resources should be invested in large L2

Tables describing grid connectivity are built (after each grid rebalancing) and stored in PIM --- used to pack/unpack dense-use cache lines during subdomain traversal

GaMM 2001

Costs of Greater Per-processor Efficiency

Programming complexity of managing subdomain traversal

Space to store gather/scatter tables in PIM Time to (re)build gather/scatter tables in PIM Memory bandwidth commensurate with peak rates

of all processors

GaMM 2001

Source #3: More “Architecture Friendly” Algorithms

Algorithmic practice needs to catch up to architectural demands several “one-time” gains remain to be contributed that could

improve data locality or reduce synchronization frequency, while maintaining required concurrency and slackness

“One-time” refers to improvements by small constant factors, nothing that scales in N or P – complexities are already near information-theoretic lower bounds, and we reject increases in flop rates that derive from less efficient algorithms

Caveat: remaining algorithmic performance improvements may cost extra space or may bank on stability shortcuts that occasionally backfire, making performance modeling less predictable

Perhaps an order of magnitude of performance remains here

GaMM 2001

Raw Performance Improvement from Algorithms

Spatial reorderings that improve locality interlacing of all related grid-based data structures ordering gridpoints and grid edges for L1/L2 reuse

Discretizations that improve locality higher-order methods (lead to larger denser blocks at each point than

lower-order methods) vertex-centering (for same tetrahedral grid, leads to denser blockrows

than cell-centering)

Temporal reorderings that improve locality block vector algorithms (reuse cached matrix blocks; vectors in block

are independent)

multi-step vector algorithms (reuse cached vector blocks; vectors have

sequential dependence)

GaMM 2001

Raw Performance Improvement from Algorithms

Temporal reorderings that reduce synchronization penalty less stable algorithmic choices that reduce synchronization

frequency (deferred orthogonalization, speculative step selection)

less global methods that reduce synchronization range by replacing a tightly coupled global process (e.g., Newton) with loosely coupled sets of tightly coupled local processes (e.g., Schwarz)

Precision reductions that make bandwidth seem larger lower precision representation of preconditioner matrix

coefficients or poorly known coefficients (arithmetic is still performed on full precision extensions)

GaMM 2001

Improvements Resulting from Locality Reordering

8.01626274221.0200200Pent. Pro

333

400

400

400

360

300

600

450

332

120

120

200

250

Clock

MHz

6.32136406018.8333Pent. Pro

7.83149497819.5400Pent. II/NT

8.33347528320.8400Pent. II/LIN

2.5203647718.9800Ultra II/HPC

3.52547549413.0720Ultra II

3.01835427512.5600Ultra II

1.3163747917.61200Alpha 21164

1.6143239758.3900Alpha 21164

2.3153143669.9664604e

3.115405911724.3480P2SC (4 card)

2.713355110121.4480P2SC (2 card)

4.032688716320.3800P3

5.226597412725.4500R10000

Orig.

% of Peak

Orig.

Mflop/s

Interl.

only

Mflop/s

Reord.

Only

Mflop/s

Opt.

Mflop/s

Opt.

% of

Peak

Peak

Mflop/sProcessor

GaMM 2001

Improvements from Blocking Vectors Same ONERA M6 Euler simulation, on SGI Origin One vector represents standard GMRES acceleration Four vectors is a blocked Krylov method, not yet in “production” version Savings arises from not reloading matrix elements of Jacobian for each new vector (four-fold

increase in matrix element use per load) Flop/s rate is effectively tripled – however, can the extra vectors be used efficiently from a

numerical viewpoint??

Number ofVectors

Bytes/floatingmultiply-add

Mflop/sachieved

1 12.36 45

4 3.31 120

GaMM 2001

Improvements from Reduced Precision Same ONERA M6 Euler simulation, on SGI Origin Standard (middle column) is double precision in all floating quantities Optimization (right column) is to store preconditioner for Jacobian matrix in

single precision only, promoting to double before use in the processor Bandwidth and matrix cache capacities are effectively doubled, with no

deterioration in numerical properties

Number ofProcessors

DoublePrecision

SinglePrecision

16 223 139

32 117 67

64 60 34

120 31 16

GaMM 2001

Source #4: Algorithms Packing More Science Per Flop

Some algorithmic improvements do not improve flop rate, but lead to the same scientific end in the same time at lower hardware cost (less memory, lower operation complexity)

Caveat: such adaptive programs are more complicated and less thread-uniform than those they improve upon in quality/cost ratio

Desirable that petaflop/s machines be general purpose enough to run the “best” algorithms

Not daunting, conceptually, but puts an enormous premium on dynamic load balancing

An order of magnitude or more can be gained here for many

problems

GaMM 2001

Example of Adaptive Opportunities Spatial Discretization-based adaptivity

change discretization type and order to attain required approximation to the continuum everywhere without over-resolving in smooth, easily approximated regions

Fidelity-based adaptivity change continuous formulation to accommodate required

phenomena everywhere without enriching in regions where nothing happens

Stiffness-based adaptivity change solution algorithm to provide more powerful, robust

techniques in regions of space-time where discrete problem is linearly or nonlinearly stiff without extra work in nonstiff, locally well-conditioned regions

GaMM 2001

Experimental Example of Opportunity for Advanced Adaptivity

Driven cavity: Newton’s method (left) versus new Additive Schwarz Preconditioned Inexact Newton (ASPIN) nonlinear preconditioning (right)

GaMM 2001

Status and Prospectsfor Advanced Adaptivity

Metrics and procedures well developed in only a few areas method-of-lines ODEs for stiff IBVPs and DAEs, FEA for elliptic

BVPs

Multi-model methods used in ad hoc ways in production Boeing TRANAIR code

Poly-algorithmic solvers demonstrated in principle but rarely in the “hostile” environment of high-performance computing

Requirements for progress management of hierarchical levels of synchronization user specification of hierarchical priorities of different

threads

GaMM 2001

Summary of Suggestions for PDE Petaflops

Algorithms that deliver more “science per flop” possibly large problem-dependent factor, through adaptivity (but

we won't count this towards rate improvement)

Algorithmic variants that are more architecture-friendly expect half an order of magnitude, through improved locality and

relaxed synchronization

More efficient use of processor cycles, and faster processor/memory

expect one-and-a-half orders of magnitude, through memory-assist language features, PIM, and multithreading

Expanded number of processors expect two orders of magnitude, through dynamic balancing and

extreme care in implementation

GaMM 2001

Reminder about the Source of PDEs

Computational engineering is not about individual large-scale analyses, done fast and “thrown over the wall”

Both “results” and their sensitivities are desired; often multiple operation points to be simulated are known a priori, rather than sequentially

Sensitivities may be fed back into optimization process Full PDE analyses may also be inner iterations in a

multidisciplinary computation In such contexts, “petaflop/s” may mean 1,000 analyses running

somewhat asynchronously with respect to each other, each at 1 Tflop/s – clearly a less daunting challenge and one that has better synchronization properties for exploiting “The Grid” – than 1 analysis running at 1 Pflop/s

GaMM 2001

Summary Recommendations for Architects

Support rich (mesh-like) interprocessor connectivity and fast global reductions

Allow disabling of expensive interprocessor cache coherence protocols for user-tagged data

Support fast message-passing protocols between processors that physically share memory, for legacy MP applications

Supply sufficient memory system bandwidth per processor (at least one word per clock per scalar unit)

Give user optional control of L2 cache traffic through directives Develop at least gather/scatter processor-in-memory capability Support variety of precisions in blocked transfers and fast

precision conversions

GaMM 2001

Recommendations for New Benchmarks

Recently introduced sPPM benchmark fills a void for memory-system-realistic full-application PDE performance, but is explicit, structured, and relatively high-order

Similar full-application benchmark is needed for implicit, unstructured, low-order PDE solvers

Reflecting the hierarchical, distributed memory layout of high end computers, this benchmark would have two aspects

uniprocessor (“vertical”) memory system performance – suite of problems of various grid sizes and multicomponent sizes with different interlacings and edge orderings

parallel (“horizontal”) network performance – problems of

various subdomain sizes and synchronization frequencies

GaMM 2001

Bibliography High Performance Parallel CFD, Gropp, Kaushik, Keyes & Smith, 2001, Parallel

Computing (to appear, 2001) Toward Realistic Performance Bounds for Implicit CFD Codes, Gropp, Kaushik,

Keyes & Smith, 1999, in “Proceedings of Parallel CFD'99,” Elsevier Prospects for CFD on Petaflops Systems, Keyes, Kaushik & Smith, 1999, in “Parallel

Solution of Partial Differential Equations,” Springer, pp. 247-278 Newton-Krylov-Schwarz Methods for Aerodynamics Problems: Compressible and

Incompressible Flows on Unstructured Grids, Kaushik, Keyes & Smith, 1998, in “Proceedings of the 11th Intl. Conf. on Domain Decomposition Methods,” Domain Decomposition Press, pp. 513-520

How Scalable is Domain Decomposition in Practice, Keyes, 1998, in “Proceedings of the 11th Intl. Conf. on Domain Decomposition Methods,” Domain Decomposition Press, pp. 286-297

On the Interaction of Architecture and Algorithm in the Domain-Based Parallelization of an Unstructured Grid Incompressible Flow Code, Kaushik, Keyes & Smith, 1998, in “Proceedings of the 10th Intl. Conf. on Domain Decomposition Methods,” AMS, pp. 311-319

GaMM 2001

Related URLs Follow-up on this talk

http://www.mcs.anl.gov/petsc-fun3d ASCI platforms

http://www.llnl.gov/asci/platforms Int. Conferences on Domain Decomposition Methods

http://www.ddm.org SIAM Conferences on Parallel Processing

http://www.siam.org (Norfolk, USA 12-14 Mar 2001) International Conferences on Parallel CFD

http://www.parcfd.org

GaMM 2001

Acknowledgments

Collaborators: Dinesh Kaushik (ODU), Kyle Anderson (NASA), and the PETSc team at ANL: Satish Balay, Bill Gropp, Lois McInnes, Barry Smith

Sponsors: U.S. DOE, ICASE, NASA, NSF Computer Resources: DOE, SGI-Cray Inspiration: Shahid Bokhari (ICASE), Xiao-Chuan Cai (CU-

Boulder), Rob Falgout (LLNL), Paul Fischer (ANL), Kyle Gallivan (FSU), Liz Jessup (CU-Boulder), Michael Kern (INRIA), Dimitri Mavriplis (ICASE), Alex Pothen (ODU), Uli Ruede (Univ. Erlangen), John Salmon (Caltech), Linda Stals (ODU), Bob Voigt (DOE), David Young (Boeing), Paul Woodward (UMinn)

the next four orders of magnitude in parallel pde simulation performance keyes/talks.html...

Documents

petaflops computing

petaflops initiative

applications workshop

architectures workshop

thousands of processors

pde applicationshowever

algorithms workshop

systems software workshop