the next four orders of magnitude in parallel pde simulation performance keyes/talks.html...
TRANSCRIPT
The Next Four Orders of Magnitude in Parallel PDE Simulation Performance
http://www.math.odu.edu/~keyes/talks.html
David E. Keyes
Department of Mathematics & Statistics,
Old Dominion University
Institute for Scientific Computing Research,
Lawrence Livermore National Laboratory
Institute for Computer Applications in Science & Engineering,
NASA Langley Research Center
GaMM 2001
Background of this Presentation Originally prepared for Petaflops II Conference History of the Petaflops Initiative in the USA
Enabling Technologies for Petaflops Computing, Feb 1994 book by Sterling, Messina, and Smith, MIT Press, 1995
Applications Workshop, Aug 1994 Architectures Workshop, Apr 1995 Systems Software Workshop, Jun 1996 Algorithms Workshop, Apr 1997 Systems Operations Workshop Review, Jun 1998 Enabling Technologies for Petaflops II, Feb 1999
Topics in Ultra-scale Computing, book by Sterling et al., MIT Press, 2001 (to appear)
GaMM 2001
Weighing in at the Bottom Line Characterization of a 1 Teraflop/s computer of today
about 1,000 processors of 1 Gflop/s (peak) each due to inefficiencies within the processors, more practically characterized
as about 4,000 processors of 250 Mflop/s each
How do we want to get to 1 Petaflop/s by 2007 (original goal)? 1,000,000 processors of 1 Gflop/s each (only wider)? 10,000 processors of 100 Gflop/s each (mainly deeper)?
From the point of view of PDE simulations on quasi-static Eulerian
grids Either!
Caveat: dynamic grid simulations are not covered in this talk but see work at Bonn, Erlangen, Heidelberg, LLNL, and ODU presented
elsewhere
GaMM 2001
Perspective
Many “Grand Challenges” in computational science are formulated as PDEs (possibly among alternative formulations)
However, PDE simulations historically have not performed as well as other scientific simulations
PDE simulations require a balance among architectural components that is not necessarily met in a machine designed to “max out” on the standard LINPACK benchmark
The justification for building petaflop/s architectures undoubtedly will (and should) include PDE applications
However, cost-effective use of petaflop/s on PDEs requires further attention to architectural and algorithmic matters
Memory-centric view of computation needs further promotion
GaMM 2001
Application Performance History“3 orders of magnitude in 10 years” – better than Moore’s Law
GaMM 2001
Bell Prize Performance History
Year Type Application Gflop/s System No. Procs1988 PDE Structures 1.0 Cray Y-MP 81989 PDE Seismic 5.6 CM-2 2,0481990 PDE Seismic 14 CM-2 2,0481992 NB Gravitation 5.4 Delta 5121993 MC Boltzmann 60 CM-5 1,0241994 IE Structures 143 Paragon 1,9041995 MC QCD 179 NWT 1281996 PDE CFD 111 NWT 1601997 NB Gravitation 170 ASCI Red 4,0961998 MD Magnetism 1,020 T3E-1200 1,5361999 PDE CFD 627 ASCI BluePac 5,8322000 NB Gravitation 1,349 GRAPE-6 96
GaMM 2001
Plan of Presentation General characterization of PDE requirements Four sources of performance improvement to get from
current 100's of Gflop/s (for PDEs) to 1 Pflop/s Each illustrated with examples from computational
aerodynamics offered as typical of real workloads (nonlinear, unstructured,
multicomponent, multiscale, etc.)
Performance presented on up to thousands of processors of T3E and ASCI Red (for
parallel aspects)
on numerous uniprocessors (for memory hierarchy aspects)
GaMM 2001
Purpose of Presentation Not to argue for specific algorithms/programming
models/codes in any detail but see talks on Newton-Krylov-Schwarz (NKS) methods under homepage
Provide a requirements target for designers of today's systems typical of several contemporary successfully parallelized PDE applications not comprehensive of all important large-scale applications
Speculate on requirements target for designers of tomorrow's sytems
Promote attention to current architectural weaknesses, relative to requirements of PDEs
GaMM 2001
Four Sources of Performance Improvement
Expanded number of processors arbitrarily large factor, through extremely careful attention to load
balancing and synchronization
More efficient use of processor cycles, and faster processor/memory elements
one to two orders of magnitude, through memory-assist language features, processors-in-memory, and multithreading
Algorithmic variants that are more architecture-friendly approximately an order of magnitude, through improved locality
and relaxed synchronization
Algorithms that deliver more “science per flop” possibly large problem-dependent factor, through adaptivity This last does not contribute to raw flop/s!
GaMM 2001
PDE Varieties and Complexities Varieties of PDEs
evolution (time hyperbolic, time parabolic) equilibrium (elliptic, spatially hyperbolic or parabolic) mixed, varying by region mixed, of multiple type (e.g., parabolic with elliptic constraint)
Complexity parameterized by: spatial grid points, Nx
temporal grid points, Nt
components per point, Nc
auxiliary storage per point, Na
grid points in stencil, Ns
Memory: M Nx ( Nc + Na + Nc Nc Ns )
Work: W Nx Nt ( Nc + Na + Nc Nc Ns )
GaMM 2001
Resource Scaling for PDEs
For 3D problems, (Memory) (Work)3/4
for equilibrium problems, work scales with problem size no. of iteration steps – for “reasonable” implicit methods proportional to resolution in single spatial dimension
for evolutionary problems, work scales with problem size time steps – CFL-type arguments place latter on order of resolution in single spatial dimension
Proportionality constant can be adjusted over a very wide range by both discretization and by algorithmic tuning
If frequent time frames are to be captured, disk capacity and I/O rates must both scale linearly with work
GaMM 2001
Typical PDE Tasks Vertex-based loops
state vector and auxiliary vector updates
Edge-based “stencil op” loops residual evaluation, approximate Jacobian evaluation Jacobian-vector product (often replaced with matrix-free form,
involving residual evaluation) intergrid transfer (coarse/fine)
Sparse, narrow-band recurrences approximate factorization and back substitution smoothing
Vector inner products and norms orthogonalization/conjugation convergence progress and stability checks
GaMM 2001
Edge-based Loop
Vertex-centered tetrahedral grid
Traverse by edges load vertex values compute intensively store contributions to
flux at vertices
Each vertex appears in approximately 15 flux computations
GaMM 2001
Explicit PDE Solvers
Concurrency is pointwise, O(N) Comm.-to-Comp. ratio is surface-to-volume, O((N/P)-1/3) Communication range is nearest-neighbor, except for
time-step computation Synchronization frequency is once per step, O((N/P)-1) Storage per point is low Load balance is straightforward for static quasi-uniform
grids Grid adaptivity (together with temporal stability
limitation) makes load balance nontrivial
)u(uu 11
llll ft
GaMM 2001
Domain-decomposed Implicit PDE Solvers
Concurrency is pointwise, O(N), or subdomainwise, O(P) Comm.-to-Comp. ratio still mainly surface-to-volume,
O((N/P)-1/3) Communication still mainly nearest-neighbor, but
nonlocal communication arises from conjugation, norms, coarse grid problems
Synchronization frequency often more than once per grid-sweep, up to Krylov dimension, O(K(N/P)-1)
Storage per point is higher, by factor of O(K)
Load balance issues the same as for explicit
lt
tf
t l
ll
l
l,
1u)u(
u
GaMM 2001
Source #1: Expanded Number of Processors
Amdahl's law can be defeated if serial sections make up a nonincreasing fraction of total work as problem size and processor count scale up together – true for most explicit or iterative implicit PDE solvers
popularized in 1986 Karp Prize paper by Gustafson, et al.
Simple, back-of-envelope parallel complexity analyses show that processors can be increased as fast, or almost as fast, as problem size, assuming load is perfectly balanced
Caveat: the processor network must also be scalable (applies to protocols as well as to hardware)
Remaining four orders of magnitude could be met by hardware expansion (but this does not mean that fixed-size applications of today would run 104 times faster)
GaMM 2001
Back-of-Envelope Scalability Demonstration for Bulk-synchronized PDE Computations
Given complexity estimates of the leading terms of: the concurrent computation (per iteration phase) the concurrent communication the synchronization frequency
And a model of the architecture including: internode communication (network topology and protocol reflecting
horizontal memory structure) on-node computation (effective performance parameters including vertical
memory structure)
One can estimate optimal concurrency and execution time on per-iteration basis, or overall (by taking into account any granularity-
dependent convergence rate)
simply differentiate time estimate in terms of (N,P) with respect to P, equate
to zero and solve for P in terms of N
GaMM 2001
3D Stencil Costs (per Iteration) grid points in each direction n, total work N=O(n3) processors in each direction p, total procs P=O(p3) memory per node requirements O(N/P)
execution time per iteration A n3/p3
grid points on side of each processor subdomain n/p
neighbor communication per iteration B n2/p2
cost of global reductions in each iteration C log p or C p(1/d)
C includes synchronization frequency
same dimensionless units for measuring A, B, C e.g., cost of scalar floating point multiply-add
GaMM 2001
3D Stencil Computation IllustrationRich local network, tree-based global reductions
total wall-clock time per iteration
for optimal p, , or
or (with ),
without “speeddown,” p can grow with n in the limit as
pCp
nB
p
nApnT log),(
2
2
3
3
0
p
T ,0233
2
4
3
p
C
p
nB
p
nA
CA
B2
3
243
32
nC
Apopt
3
13
131
)1(1)1(12
3
0CB
nC
Apopt
31
3
GaMM 2001
3D Stencil Computation Illustration Rich local network, tree-based global reductions
optimal running time
where
limit of infinite neighbor bandwidth, zero neighbor latency ( )
(This analysis is on a per iteration basis; fuller analysis would multiply this cost by an iteration count estimate that generally depends on n and p.)
,log))(,(23
nCp
BAnpnT opt
3
13
131
)1(1)1(12
3 C
A
0B
.log
3
1log))(,( const
C
AnCnpnT opt
GaMM 2001
Summary for Various Networks With tree-based (logarithmic) global reductions and scalable
nearest neighbor hardware: optimal number of processors scales linearly with problem size
With 3D torus-based global reductions and scalable nearest neighbor hardware: optimal number of processors scales as three-fourths power of
problem size (almost “scalable”)
With common network bus (heavy contention): optimal number of processors scales as one-fourth power of
problem size (not “scalable”) bad news for conventional Beowulf clusters, but see 2000 Bell
Prize “price-performance awards”
GaMM 2001
1999 Bell Prize Parallel Scaling Results on ASCI RedONERA M6 Wing Test Case, Tetrahedral grid of 2.8 million vertices (about 11 million
unknowns) on up to 3072 ASCI Red Nodes (Pentium Pro 333 MHz processors)
GaMM 2001
Surface Visualization of Test Domain for
Computing Flow over an ONERA M6 Wing
GaMM 2001
Transonic “Lambda” Shock Solution
GaMM 2001
Fixed-size Parallel Scaling Results (Flop/s)
GaMM 2001
Fixed-size Parallel Scaling Results(Time in seconds)
GaMM 2001
Algorithm: Newton-Krylov-Schwarz
Newtonnonlinear solver
asymptotically quadratic
Krylovaccelerator
spectrally adaptive
Schwarzpreconditionerparallelizable
GaMM 2001
Fixed-size Scaling Results for W-cycle (Time in seconds, courtesy of D. Mavriplis)
ASCI runs: for grid of 3.1M vertices; T3E runs: for grid of 24.7M vertices
GaMM 2001
Source #2: More Efficient Use of Faster Processors
Current low efficiencies of sparse codes can be improved if regularity of reference is exploited with memory-assist features
PDEs have periodic workingset structure that permits effective use of prefetch/dispatch directives, and lots of slackness
Combined with “processors-in-memory” (PIM) technology for gather/scatter into densely used block transfers and multithreading, PDEs can approach full utilization of processor cycles
Caveat: high bandwidth is critical, since PDE algorithms do only O(N) work for O(N) gridpoints worth of loads and stores
One to two orders of magnitude can be gained by catching up to the clock, and by following the clock into the few-GHz range
GaMM 2001
Following the Clock
1999 Predictions from the Semiconductor Industry Association
http://public.itrs.net/files/1999_SIA_Roadmap/Home.htm
A factor of 2-3 can be expected by 2007 by following the clock alone
Frequency in GHz 2000 2004 2008 2014
Local (high perf.) 1.49 2.95 6.00 13.5
Cross-chip (high perf.) 1.32 1.86 2.50 3.60
Cross-chip (cost perf) 0.66 0.99 1.40 2.20
Chip-to-board (high perf.) 1.32 1.86 2.50 3.60
GaMM 2001
Example of Multithreading Same ONERA M6 wing Euler code simulation on ASCI Red ASCI Red contains two processors per node, sharing memory Can use second processor in either message-passing mode with its own subdomain, or in multithreaded shared memory mode,
which does not require the number of subdomain partitions to double Latter is much more effective in flux evaluation phase, as shown by cumulative execution time (here, memory bandwidth is not
an issue)
Number ofNodes
1 MPIProcess
2 MPIProcesses
2Threads
256 510 332 293
1024 183 136 109
3072 93 91 63
GaMM 2001
PDE Workingsets Smallest: data for single stencil:
Ns (Nc2 + Nc + Na ) (sharp)
Largest: data for entire subdomain: (Nx/P) Ns (Nc
2 + Nc + Na ) (sharp)
Intermediate: data for a neighborhood collection of stencils, reused as possible
GaMM 2001
Cache Traffic for PDEs As successive workingsets “drop” into a level of memory,
capacity (and with effort conflict) misses disappear, leaving only compulsory, reducing demand on main memory bandwidth
GaMM 2001
Strategies Based on Workingset Structure
No performance value in memory levels larger than subdomain
Little performance value in memory levels smaller than subdomain but larger than required to permit full reuse of most data within each subdomain subtraversal
After providing L1 large enough for smallest workingset (and multiple independent copies up to accommodate desired level of multithreading) all additional resources should be invested in large L2
Tables describing grid connectivity are built (after each grid rebalancing) and stored in PIM --- used to pack/unpack dense-use cache lines during subdomain traversal
GaMM 2001
Costs of Greater Per-processor Efficiency
Programming complexity of managing subdomain traversal
Space to store gather/scatter tables in PIM Time to (re)build gather/scatter tables in PIM Memory bandwidth commensurate with peak rates
of all processors
GaMM 2001
Source #3: More “Architecture Friendly” Algorithms
Algorithmic practice needs to catch up to architectural demands several “one-time” gains remain to be contributed that could
improve data locality or reduce synchronization frequency, while maintaining required concurrency and slackness
“One-time” refers to improvements by small constant factors, nothing that scales in N or P – complexities are already near information-theoretic lower bounds, and we reject increases in flop rates that derive from less efficient algorithms
Caveat: remaining algorithmic performance improvements may cost extra space or may bank on stability shortcuts that occasionally backfire, making performance modeling less predictable
Perhaps an order of magnitude of performance remains here
GaMM 2001
Raw Performance Improvement from Algorithms
Spatial reorderings that improve locality interlacing of all related grid-based data structures ordering gridpoints and grid edges for L1/L2 reuse
Discretizations that improve locality higher-order methods (lead to larger denser blocks at each point than
lower-order methods) vertex-centering (for same tetrahedral grid, leads to denser blockrows
than cell-centering)
Temporal reorderings that improve locality block vector algorithms (reuse cached matrix blocks; vectors in block
are independent)
multi-step vector algorithms (reuse cached vector blocks; vectors have
sequential dependence)
GaMM 2001
Raw Performance Improvement from Algorithms
Temporal reorderings that reduce synchronization penalty less stable algorithmic choices that reduce synchronization
frequency (deferred orthogonalization, speculative step selection)
less global methods that reduce synchronization range by replacing a tightly coupled global process (e.g., Newton) with loosely coupled sets of tightly coupled local processes (e.g., Schwarz)
Precision reductions that make bandwidth seem larger lower precision representation of preconditioner matrix
coefficients or poorly known coefficients (arithmetic is still performed on full precision extensions)
GaMM 2001
Improvements Resulting from Locality Reordering
8.01626274221.0200200Pent. Pro
333
400
400
400
360
300
600
450
332
120
120
200
250
Clock
MHz
6.32136406018.8333Pent. Pro
7.83149497819.5400Pent. II/NT
8.33347528320.8400Pent. II/LIN
2.5203647718.9800Ultra II/HPC
3.52547549413.0720Ultra II
3.01835427512.5600Ultra II
1.3163747917.61200Alpha 21164
1.6143239758.3900Alpha 21164
2.3153143669.9664604e
3.115405911724.3480P2SC (4 card)
2.713355110121.4480P2SC (2 card)
4.032688716320.3800P3
5.226597412725.4500R10000
Orig.
% of Peak
Orig.
Mflop/s
Interl.
only
Mflop/s
Reord.
Only
Mflop/s
Opt.
Mflop/s
Opt.
% of
Peak
Peak
Mflop/sProcessor
GaMM 2001
Improvements from Blocking Vectors Same ONERA M6 Euler simulation, on SGI Origin One vector represents standard GMRES acceleration Four vectors is a blocked Krylov method, not yet in “production” version Savings arises from not reloading matrix elements of Jacobian for each new vector (four-fold
increase in matrix element use per load) Flop/s rate is effectively tripled – however, can the extra vectors be used efficiently from a
numerical viewpoint??
Number ofVectors
Bytes/floatingmultiply-add
Mflop/sachieved
1 12.36 45
4 3.31 120
GaMM 2001
Improvements from Reduced Precision Same ONERA M6 Euler simulation, on SGI Origin Standard (middle column) is double precision in all floating quantities Optimization (right column) is to store preconditioner for Jacobian matrix in
single precision only, promoting to double before use in the processor Bandwidth and matrix cache capacities are effectively doubled, with no
deterioration in numerical properties
Number ofProcessors
DoublePrecision
SinglePrecision
16 223 139
32 117 67
64 60 34
120 31 16
GaMM 2001
Source #4: Algorithms Packing More Science Per Flop
Some algorithmic improvements do not improve flop rate, but lead to the same scientific end in the same time at lower hardware cost (less memory, lower operation complexity)
Caveat: such adaptive programs are more complicated and less thread-uniform than those they improve upon in quality/cost ratio
Desirable that petaflop/s machines be general purpose enough to run the “best” algorithms
Not daunting, conceptually, but puts an enormous premium on dynamic load balancing
An order of magnitude or more can be gained here for many
problems
GaMM 2001
Example of Adaptive Opportunities Spatial Discretization-based adaptivity
change discretization type and order to attain required approximation to the continuum everywhere without over-resolving in smooth, easily approximated regions
Fidelity-based adaptivity change continuous formulation to accommodate required
phenomena everywhere without enriching in regions where nothing happens
Stiffness-based adaptivity change solution algorithm to provide more powerful, robust
techniques in regions of space-time where discrete problem is linearly or nonlinearly stiff without extra work in nonstiff, locally well-conditioned regions
GaMM 2001
Experimental Example of Opportunity for Advanced Adaptivity
Driven cavity: Newton’s method (left) versus new Additive Schwarz Preconditioned Inexact Newton (ASPIN) nonlinear preconditioning (right)
GaMM 2001
Status and Prospectsfor Advanced Adaptivity
Metrics and procedures well developed in only a few areas method-of-lines ODEs for stiff IBVPs and DAEs, FEA for elliptic
BVPs
Multi-model methods used in ad hoc ways in production Boeing TRANAIR code
Poly-algorithmic solvers demonstrated in principle but rarely in the “hostile” environment of high-performance computing
Requirements for progress management of hierarchical levels of synchronization user specification of hierarchical priorities of different
threads
GaMM 2001
Summary of Suggestions for PDE Petaflops
Algorithms that deliver more “science per flop” possibly large problem-dependent factor, through adaptivity (but
we won't count this towards rate improvement)
Algorithmic variants that are more architecture-friendly expect half an order of magnitude, through improved locality and
relaxed synchronization
More efficient use of processor cycles, and faster processor/memory
expect one-and-a-half orders of magnitude, through memory-assist language features, PIM, and multithreading
Expanded number of processors expect two orders of magnitude, through dynamic balancing and
extreme care in implementation
GaMM 2001
Reminder about the Source of PDEs
Computational engineering is not about individual large-scale analyses, done fast and “thrown over the wall”
Both “results” and their sensitivities are desired; often multiple operation points to be simulated are known a priori, rather than sequentially
Sensitivities may be fed back into optimization process Full PDE analyses may also be inner iterations in a
multidisciplinary computation In such contexts, “petaflop/s” may mean 1,000 analyses running
somewhat asynchronously with respect to each other, each at 1 Tflop/s – clearly a less daunting challenge and one that has better synchronization properties for exploiting “The Grid” – than 1 analysis running at 1 Pflop/s
GaMM 2001
Summary Recommendations for Architects
Support rich (mesh-like) interprocessor connectivity and fast global reductions
Allow disabling of expensive interprocessor cache coherence protocols for user-tagged data
Support fast message-passing protocols between processors that physically share memory, for legacy MP applications
Supply sufficient memory system bandwidth per processor (at least one word per clock per scalar unit)
Give user optional control of L2 cache traffic through directives Develop at least gather/scatter processor-in-memory capability Support variety of precisions in blocked transfers and fast
precision conversions
GaMM 2001
Recommendations for New Benchmarks
Recently introduced sPPM benchmark fills a void for memory-system-realistic full-application PDE performance, but is explicit, structured, and relatively high-order
Similar full-application benchmark is needed for implicit, unstructured, low-order PDE solvers
Reflecting the hierarchical, distributed memory layout of high end computers, this benchmark would have two aspects
uniprocessor (“vertical”) memory system performance – suite of problems of various grid sizes and multicomponent sizes with different interlacings and edge orderings
parallel (“horizontal”) network performance – problems of
various subdomain sizes and synchronization frequencies
GaMM 2001
Bibliography High Performance Parallel CFD, Gropp, Kaushik, Keyes & Smith, 2001, Parallel
Computing (to appear, 2001) Toward Realistic Performance Bounds for Implicit CFD Codes, Gropp, Kaushik,
Keyes & Smith, 1999, in “Proceedings of Parallel CFD'99,” Elsevier Prospects for CFD on Petaflops Systems, Keyes, Kaushik & Smith, 1999, in “Parallel
Solution of Partial Differential Equations,” Springer, pp. 247-278 Newton-Krylov-Schwarz Methods for Aerodynamics Problems: Compressible and
Incompressible Flows on Unstructured Grids, Kaushik, Keyes & Smith, 1998, in “Proceedings of the 11th Intl. Conf. on Domain Decomposition Methods,” Domain Decomposition Press, pp. 513-520
How Scalable is Domain Decomposition in Practice, Keyes, 1998, in “Proceedings of the 11th Intl. Conf. on Domain Decomposition Methods,” Domain Decomposition Press, pp. 286-297
On the Interaction of Architecture and Algorithm in the Domain-Based Parallelization of an Unstructured Grid Incompressible Flow Code, Kaushik, Keyes & Smith, 1998, in “Proceedings of the 10th Intl. Conf. on Domain Decomposition Methods,” AMS, pp. 311-319
GaMM 2001
Related URLs Follow-up on this talk
http://www.mcs.anl.gov/petsc-fun3d ASCI platforms
http://www.llnl.gov/asci/platforms Int. Conferences on Domain Decomposition Methods
http://www.ddm.org SIAM Conferences on Parallel Processing
http://www.siam.org (Norfolk, USA 12-14 Mar 2001) International Conferences on Parallel CFD
http://www.parcfd.org
GaMM 2001
Acknowledgments
Collaborators: Dinesh Kaushik (ODU), Kyle Anderson (NASA), and the PETSc team at ANL: Satish Balay, Bill Gropp, Lois McInnes, Barry Smith
Sponsors: U.S. DOE, ICASE, NASA, NSF Computer Resources: DOE, SGI-Cray Inspiration: Shahid Bokhari (ICASE), Xiao-Chuan Cai (CU-
Boulder), Rob Falgout (LLNL), Paul Fischer (ANL), Kyle Gallivan (FSU), Liz Jessup (CU-Boulder), Michael Kern (INRIA), Dimitri Mavriplis (ICASE), Alex Pothen (ODU), Uli Ruede (Univ. Erlangen), John Salmon (Caltech), Linda Stals (ODU), Bob Voigt (DOE), David Young (Boeing), Paul Woodward (UMinn)