building the next generation of parallel applications co-design opportunities and challenges michael...

46
Building the Next Generation of Parallel Applications Co-Design Opportunities and Challenges Michael A. Heroux Scalable Algorithms Department Sandia National Laboratories Collaborators: SNL Staff: [B.|R.] Barrett, E. Boman, R. Brightwell, H.C. Edwards, A. Williams SNL Postdocs: M. Hoemmen, S. Rajamanickam, M. Wolf (MIT Lincoln Lab) ORNL staff: Chris Baker, Greg Koenig, Geoffroy Vallee. Sandia National Laboratories is a multi-program laboratory operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin company, for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-AC04-94AL85000.

Upload: hallie-roth

Post on 31-Mar-2015

215 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Building the Next Generation of Parallel Applications Co-Design Opportunities and Challenges Michael A. Heroux Scalable Algorithms Department Sandia National

Building the Next Generation of Parallel ApplicationsCo-Design Opportunities and Challenges

Michael A. Heroux

Scalable Algorithms Department

Sandia National Laboratories

Collaborators:

SNL Staff: [B.|R.] Barrett, E. Boman, R. Brightwell, H.C. Edwards, A. Williams

SNL Postdocs: M. Hoemmen, S. Rajamanickam, M. Wolf (MIT Lincoln Lab)

ORNL staff: Chris Baker, Greg Koenig, Geoffroy Vallee.

Sandia National Laboratories is a multi-program laboratory operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin company, for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-AC04-94AL85000.

Page 2: Building the Next Generation of Parallel Applications Co-Design Opportunities and Challenges Michael A. Heroux Scalable Algorithms Department Sandia National

Topics

1. Context: A Library developer’s perspective.

2. Observations on success of MPI.

3. Scalable manycore approaches.

4. A new approach to soft faults.

5. Exascale concerns.

Page 3: Building the Next Generation of Parallel Applications Co-Design Opportunities and Challenges Michael A. Heroux Scalable Algorithms Department Sandia National

https://computing.llnl.gov/tutorials/bgp/images/nodeSoftwareStacks.gif

Tutorial for users of Livermore Computing's Dawn BlueGene/P (Blaise Barney)

LLNL Compute Center View

Page 4: Building the Next Generation of Parallel Applications Co-Design Opportunities and Challenges Michael A. Heroux Scalable Algorithms Department Sandia National

http://www.mcs.anl.gov/research/cifts/

CiFTS Project

Image: http://nowlab.cse.ohio-state.edu/projects/ftb-ib/FTB-IB_files/ftb.jpg

FTB World View

Page 5: Building the Next Generation of Parallel Applications Co-Design Opportunities and Challenges Michael A. Heroux Scalable Algorithms Department Sandia National

Math Libraries World View

Math Libraries (e.g., Hypre, PETSc, SuperLU, Trilinos)

Processor Architecture

Node Architecture

Network Architecture

OS/Runtime

Compilers

Programming Models

Applications

Page 6: Building the Next Generation of Parallel Applications Co-Design Opportunities and Challenges Michael A. Heroux Scalable Algorithms Department Sandia National

Three Design Points

• Terascale Laptop: Uninode-Manycore

• Petascale Deskside: Multinode-Manycore

• Exascale Center: Manynode-Manycore

Page 7: Building the Next Generation of Parallel Applications Co-Design Opportunities and Challenges Michael A. Heroux Scalable Algorithms Department Sandia National

Basic Concerns: Trends, Manycore

• Stein’s Law: If a trend cannot continue, it will stop.

Herbert Stein, chairman of the Council of Economic Advisers under Nixon and Ford.

• Trends at risk:– Power.– Single core performance.– Node count.– Memory size & BW.– Concurrency expression in

existing Programming Models.

– Resilience.

1E+05 1E+06 1E+070

20406080

100120140160180

Parallel CG Performance 512 Threads32 Nodes = 2.2GHz AMD 4sockets X

4cores

p32 x t16

p128 x t4

p512 x t1

3D Grid Points with 27pt stencil

Gig

afl

op

s

Edwards: SAND2009-8196 Trilinos ThreadPool Library v1.1.

“Status Quo” ~ MPI-only

7

Strong Scaling Potential

Page 8: Building the Next Generation of Parallel Applications Co-Design Opportunities and Challenges Michael A. Heroux Scalable Algorithms Department Sandia National

Observations

• MPI-Only is not sufficient, except … much of the time.• Near-to-medium term:

– MPI+[OMP|TBB|Pthreads|CUDA|OCL|MPI]– Long term, too?– $100 wager: The first 10 exascale apps will call MPI_Init.

• Concern:– Best hybrid performance: 1 MPI rank per UMA core set.– UMA core set size growing slowly Lots of MPI tasks.

• Long- term:– Something hierarchical, global in scope.

• Conjecture: – Data-intensive apps need non-SPDM model.– Will develop new programming model/env.– Rest of apps will adopt over time.– Time span: 10-20 years.

Page 9: Building the Next Generation of Parallel Applications Co-Design Opportunities and Challenges Michael A. Heroux Scalable Algorithms Department Sandia National

What Can we Do Right Now?

• Study why MPI (really SPMD) was successful.• Study new parallel landscape.• Try to cultivate an approach similar to MPI (and

others).

Page 10: Building the Next Generation of Parallel Applications Co-Design Opportunities and Challenges Michael A. Heroux Scalable Algorithms Department Sandia National

MPI Impresssions

10

Page 11: Building the Next Generation of Parallel Applications Co-Design Opportunities and Challenges Michael A. Heroux Scalable Algorithms Department Sandia National

Dan Reed, Microsoft

Workshop on the Road Map for the Revitalization of High End Computing June 16-18, 2003

Tim Stitts, CSCS

SOS14 Talk

March 2010

“ MPI is often considered the “portable assembly language” of parallel computing, …”

Brad Chamberlain, Cray, 2000.

Page 12: Building the Next Generation of Parallel Applications Co-Design Opportunities and Challenges Michael A. Heroux Scalable Algorithms Department Sandia National

MPI Reality

12

Page 13: Building the Next Generation of Parallel Applications Co-Design Opportunities and Challenges Michael A. Heroux Scalable Algorithms Department Sandia National

Tramonto WJDC

Functional

• New functional.• Bonded systems.• 552 lines C code.

WJDC-DFT (Werthim, Jain, Dominik, and Chapman) theory for bonded systems. (S. Jain, A. Dominik, and W.G. Chapman. Modified interfacial statistical associating fluid theory: A perturbation density functional theory for inhomogeneous complex fluids. J. Chem. Phys., 127:244904, 2007.) Models stoichiometry constraints inherent to bonded systems.

How much MPI-specific code?

dft_fill_wjdc.c

Page 14: Building the Next Generation of Parallel Applications Co-Design Opportunities and Challenges Michael A. Heroux Scalable Algorithms Department Sandia National

dft_fill_wjdc.c MPI-specific

code

Page 15: Building the Next Generation of Parallel Applications Co-Design Opportunities and Challenges Michael A. Heroux Scalable Algorithms Department Sandia National

MFIX Source term for

pressure correction

• MPI-callable, OpenMP-enabled.• 340 Fortran lines.• No MPI-specific code.• Ubiquitous OpenMP markup

(red regions).

MFIX: Multiphase Flows with Interphase eXchanges (https://www.mfix.org/)

source_pp_g.f

Page 16: Building the Next Generation of Parallel Applications Co-Design Opportunities and Challenges Michael A. Heroux Scalable Algorithms Department Sandia National

Reasons for MPI/SPMD Success?

• Portability? Yes.• Standardized? Yes.• Momentum? Yes.• Separation of many

Parallel & Algorithms concerns? Big Yes.

• Once framework in place:– Sophisticated physics added as serial code.– Ratio of science experts vs. parallel experts: 10:1.

• Key goal for new parallel apps: Preserve this ratio

Page 17: Building the Next Generation of Parallel Applications Co-Design Opportunities and Challenges Michael A. Heroux Scalable Algorithms Department Sandia National

Computational Domain Expert Writing MPI Code

Page 18: Building the Next Generation of Parallel Applications Co-Design Opportunities and Challenges Michael A. Heroux Scalable Algorithms Department Sandia National

Computational Domain Expert Writing Future Parallel Code

Page 19: Building the Next Generation of Parallel Applications Co-Design Opportunities and Challenges Michael A. Heroux Scalable Algorithms Department Sandia National

Evolving Parallel Programming Model

19

Page 20: Building the Next Generation of Parallel Applications Co-Design Opportunities and Challenges Michael A. Heroux Scalable Algorithms Department Sandia National

Parallel Programming Model: Multi-level/Multi-device

Stateless vectorizable computational kernels

run on each core

Intra-node (manycore) parallelism and resource

management

Node-local control flow (serial)

Inter-node/inter-device (distributed) parallelism and resource management

Threading

Message Passing

stateless kernels

computational node with

manycore CPUs

and / or

GPGPU

network of computational

nodes

20

Page 21: Building the Next Generation of Parallel Applications Co-Design Opportunities and Challenges Michael A. Heroux Scalable Algorithms Department Sandia National

Domain Scientist’s Parallel Palette

• MPI-only (SPMD) apps:– Single parallel construct.– Simultaneous execution.– Parallelism of even the messiest serial code.

• MapReduce: – Plug-n-Play data processing framework - 80% Google cycles.

• Pregel: Graph framework (other 20%)• Next-generation PDE and related applications:

– Internode:• MPI, yes, or something like it.• Composed with intranode.

– Intranode: • Much richer palette.• More care required from programmer.

• What are the constructs in our new palette?

Page 22: Building the Next Generation of Parallel Applications Co-Design Opportunities and Challenges Michael A. Heroux Scalable Algorithms Department Sandia National

Obvious Constructs/Concerns

• Parallel for:– No loop-carried dependence.– Rich loops.– Use of shared memory for temporal reuse, efficient

device data transfers.• Parallel reduce:

– Couple with other computations.– Concern for reproducibility (‘+’ not associative).

Page 23: Building the Next Generation of Parallel Applications Co-Design Opportunities and Challenges Michael A. Heroux Scalable Algorithms Department Sandia National

Other construct: Pipeline

• Sequence of filters.• Each filter is:

– Sequential (grab element ID, enter global assembly) or – Parallel (fill element stiffness matrix).

• Filters executed in sequence.• Programmer’s concern:

– Determine (conceptually): Can filter execute in parallel?– Write filter (serial code).– Register it with the pipeline.

• Extensible:– New physics feature.– New filter added to pipeline.

Page 24: Building the Next Generation of Parallel Applications Co-Design Opportunities and Challenges Michael A. Heroux Scalable Algorithms Department Sandia National

0

4

21

3

6 8

5

7

E1

E3 E4

E2

E1

E2

E3

E4

0143

012345678

1254

3476

4587

Global Matrix

Assemble

Rows0,1,2

Assemble

Rows3,4,5

Assemble

Rows6,7,8

TBB Pipeline for FE assembly

FE Mesh

Element-stiffnessmatrices computed

in parallel

Launch elem-data

from mesh

Compute stiffnesses

& loads

Assemble rows of stiffness

into global matrixSerial Filter Parallel Filter Several Serial Filters in series

Each assembly filter assembles certain rows from astiffness, then passes it on to the next assembly filter

TBB work of Alan Williams

Page 25: Building the Next Generation of Parallel Applications Co-Design Opportunities and Challenges Michael A. Heroux Scalable Algorithms Department Sandia National

0

4

21

3

6 8

5

7

E1

E3 E4

E2

E1

E2

E3

E4

0143

012345678

1254

3476

4587

Global Matrix

Assemble

Rows

AlternativeTBB Pipeline for FE assembly

FE Mesh

Element-stiffnessmatrices computed

in parallel

Launch elem-data

from mesh

Compute stiffnesses

& loads

Assemble rows of stiffness

into global matrixSerial Filter Parallel Filter Parallel Filter

Each parallel call to the assemblyfilter assembles all rows from thestiffness, using locking to avoidrace conflicts with other threads.

Assemble

Rows

Assemble

Rows

Assemble

Rows

Page 26: Building the Next Generation of Parallel Applications Co-Design Opportunities and Challenges Michael A. Heroux Scalable Algorithms Department Sandia National

Base-line FE Assembly Timings

Num-procs

Assembly-timeIntel 11.1

Assembly-timeGCC 4.4.4

1 1.80s 1.95s

4 0.45s 0.50s

8 0.24s 0.28s

Problem size: 80x80x80 == 512000 elements, 531441 matrix-rowsThe finite-element assembly performs 4096000 matrix-row sum-into operations(8 per element) and 4096000 vector-entry sum-into operations.

MPI-only, no threads. Linux dual quad-core workstation.

Page 27: Building the Next Generation of Parallel Applications Co-Design Opportunities and Challenges Michael A. Heroux Scalable Algorithms Department Sandia National

FE Assembly Timings

Num-threads

Elem-group-size

Matrix-conflicts

Vector-conflicts

Assembly-time

1 1 0 0 2.16s

1 4 0 0 2.09s

1 8 0 0 2.08s

4 1 95917 959 1.01s

4 4 7938 25 0.74s

4 8 3180 4 0.69s

8 1 64536 1306 0.87s

8 4 5892 49 0.45s

8 8 1618 1 0.38s

Problem size: 80x80x80 == 512000 elements, 531441 matrix-rowsThe finite-element assembly performs 4096000 matrix-row sum-into operations(8 per element) and 4096000 vector-entry sum-into operations.

No MPI, only threads. Linux dual quad-core workstation.

1 4 80

0.5

1

1.5

2

2.5

148

Page 28: Building the Next Generation of Parallel Applications Co-Design Opportunities and Challenges Michael A. Heroux Scalable Algorithms Department Sandia National

Other construct: Thread team

• Multiple threads.• Fast barrier.• Shared, fast access memory pool.• Example: Nvidia SM (but need 10-100X local memory).• X86 more vague, emerging more clearly in future.

Page 29: Building the Next Generation of Parallel Applications Co-Design Opportunities and Challenges Michael A. Heroux Scalable Algorithms Department Sandia National

• Observe: Iteration count increases with number of subdomains.• With scalable threaded smoothers (LU, ILU, Gauss-Seidel):

– Solve with fewer, larger subdomains.– Better kernel scaling (threads vs. MPI processes).– Better convergence, More robust.

• Exascale Potential: Tiled, pipelined implementation.• Three efforts:

– Level-scheduled triangular sweeps (ILU solve, Gauss-Seidel).– Decomposition by partitioning– Multithreaded direct factorization

Preconditioners for Scalable Multicore Systems

Strong scaling of Charon on TLCC (P. Lin, J. Shadid 2009)

MPITasks Threads Iterations

4096 1 153

2048 2 129

1024 4 125

512 8 117

256 16 117

128 32 111

29

Factors Impacting Performance of Multithreaded Sparse Triangular Solve, Michael M. Wolf and Michael A. Heroux and Erik G. Boman, VECPAR 2010.

# MPI Ranks

Page 30: Building the Next Generation of Parallel Applications Co-Design Opportunities and Challenges Michael A. Heroux Scalable Algorithms Department Sandia National

Thread Team Advantanges

• Qualitatively better algorithm:– Threaded triangular solve scales.– Fewer MPI ranks means fewer iterations, better

robustness.• Exploits:

– Shared data.– Fast barrier.– Data-driven parallelism.

Page 31: Building the Next Generation of Parallel Applications Co-Design Opportunities and Challenges Michael A. Heroux Scalable Algorithms Department Sandia National

Finite Elements/Volumes/Differencesand parallel node constructs

• Parallel for, reduce, pipeline:– Sufficient for vast majority of node level computation.– Supports:

• Complex modeling expression.• Vanilla parallelism.

– Must be “stencil-aware” for temporal locality.– If well done:

• OpenMP vs. TBB vs. CUDA vs. XYZ doesn’t matter much.• Refactoring costs are manageable.

• Thread team:– Complicated.– Requires substantial parallel algorithm knowledge.– Useful in solvers.

Page 32: Building the Next Generation of Parallel Applications Co-Design Opportunities and Challenges Michael A. Heroux Scalable Algorithms Department Sandia National

Programming Today for Tomorrow’s Machines

32

Page 33: Building the Next Generation of Parallel Applications Co-Design Opportunities and Challenges Michael A. Heroux Scalable Algorithms Department Sandia National

Programming Today for Tomorrow’s Machines

• Parallel Programming in the small:– Focus:

• writing sequential code fragments.• Wrapped in simple thread-safe functions.

– Programmer skills:• 10%: Pattern/framework experts (domain-aware).• 90%: Domain experts (pattern-aware)

• Languages needed are already here:– Maybe CAF will make it, other too.– DSL are useful (math libs are essentially DSL).– Exception: Large-scale data-intensive graph?

Page 34: Building the Next Generation of Parallel Applications Co-Design Opportunities and Challenges Michael A. Heroux Scalable Algorithms Department Sandia National

FE/FV/FD Parallel Programming Today

for ((i,j,k) in points/elements on subdomain) {compute coefficients for point (i,j,k)inject into global matrix

}

Notes:• User in charge of:

– Writing physics code.– Iteration space traversal.– Storage association.

• Pattern/framework/runtime in charge of:– SPMD execution.

Page 35: Building the Next Generation of Parallel Applications Co-Design Opportunities and Challenges Michael A. Heroux Scalable Algorithms Department Sandia National

FE/FV/FD Parallel Programming Tomorrow

pipeline <i,j,k> { filter(addPhysicsLayer1<i,j,k)>); ...

filter(addPhysicsLayern<i,j,k>); filter(injectIntoGlobalMatrix<i,j,k>);

}

Notes:• User in charge of:

– Writing physics code (filter).– Registering filter with framework.

• Pattern/framework/runtime in charge of:– SPMD execution.– Iteration space traversal.

o Sensitive to temporal locality.– Filter execution scheduling.– Storage association.

• Better assignment of responsibility (in general).

Page 36: Building the Next Generation of Parallel Applications Co-Design Opportunities and Challenges Michael A. Heroux Scalable Algorithms Department Sandia National

Co-Design Opportunities

• Better runtime systems:– MPI still wins, a lot.– Work, data placement & migration.

• Storage association:– Portability requires math, storage indexing to be distinct.

• MPI Shared memory extensions:– Allocate a shared buffer for ranks on a node.

• Repeatable results:– Need some way for non-expert to trust results.– Debug mode.

• Data regularity:– We need to exploit as much as possible.

Page 37: Building the Next Generation of Parallel Applications Co-Design Opportunities and Challenges Michael A. Heroux Scalable Algorithms Department Sandia National

Resilient Algorithms:A little reliability, please.

37

Page 38: Building the Next Generation of Parallel Applications Co-Design Opportunities and Challenges Michael A. Heroux Scalable Algorithms Department Sandia National

Soft Error Futures

• Soft error handling: A legitimate algorithms issue.• Programming model, runtime environment play role.

Page 39: Building the Next Generation of Parallel Applications Co-Design Opportunities and Challenges Michael A. Heroux Scalable Algorithms Department Sandia National

Every calculation matters

• Small PDE Problem: ILUT/GMRES• Correct result:35 Iters, 343M FLOPS• 2 examples of a single bad op.• Solvers:

– 50-90% of total app operations.– Soft errors most likely in solver.

• Need new algorithms for soft errors:– Well-conditioned wrt errors.– Decay proportional to number of errors.– Minimal impact when no errors.

Description Iters FLOPS

Recursive Residual Error

Solution Error

All Correct Calcs

35 343M

4.6e-15 1.0e-6

Iter=2, y[1] += 1.0SpMV incorrectOrtho subspace

35 343M

6.7e-15 3.7e+3

Q[1][1] += 1.0Non-ortho subspace

N/C N/A 7.7e-02 5.9e+5

39

Soft Error Resilience

• New Programming Model Elements: • SW-enabled, highly reliable:

• Data storage, paths.• Compute regions.

• Idea: New algorithms with minimal usage of high reliability.

• First new algorithm: FT-GMRES.• Resilient to soft errors.• Outer solve: Highly Reliable• Inner solve: “bulk” reliability.

• General approach applies to many algorithms.

M. Heroux, M. Hoemmen

Page 40: Building the Next Generation of Parallel Applications Co-Design Opportunities and Challenges Michael A. Heroux Scalable Algorithms Department Sandia National

FTGMRES Results

40

Page 41: Building the Next Generation of Parallel Applications Co-Design Opportunities and Challenges Michael A. Heroux Scalable Algorithms Department Sandia National

Exascale Concerns

41

Page 42: Building the Next Generation of Parallel Applications Co-Design Opportunities and Challenges Michael A. Heroux Scalable Algorithms Department Sandia National

If FLOPS are free, why are we making them cheaper?

42

Page 43: Building the Next Generation of Parallel Applications Co-Design Opportunities and Challenges Michael A. Heroux Scalable Algorithms Department Sandia National

Easy things should be easy, hard things should be possible.

Why are we making easy things easier and hard things impossible?

43

Page 44: Building the Next Generation of Parallel Applications Co-Design Opportunities and Challenges Michael A. Heroux Scalable Algorithms Department Sandia National

Explicit/SIMT vs. Implicit/Recursive Algorithms

Problem DifficultyEasy Hard

Tim

e to

Sol

uti

on

Implicit/Recursive:• Implicit formulations.• Multilevel prec.

Explicit/SIMT:• Explicit formulations.• Jacobi prec.

Page 45: Building the Next Generation of Parallel Applications Co-Design Opportunities and Challenges Michael A. Heroux Scalable Algorithms Department Sandia National

Problems with Accelerator-based Exascale

• Global SIMT is the only approach that really works well on GPUs, but:– Many of our most robust algorithms have no apparent SIMT

replacement.– Working on it, but a lot to do, and fundamental issues at play.

• SMs might be useful to break SIMT mold, but: – Local store is way too small.– No market reason to make it bigger.

• Could consider SIMT approaches, but:– Broader apps community moving the other way:

• Climate: Looking at implicit formulations.• Embedded UQ: Coupled formulations.

• Exascale apps at risk? – Isolation from the broader app trends.– Accelerators good, but in combination with strong multicore CPU.

Page 46: Building the Next Generation of Parallel Applications Co-Design Opportunities and Challenges Michael A. Heroux Scalable Algorithms Department Sandia National

Summary

• Building the next generation of parallel applications requires enabling domain scientists:

– Write sophisticated methods.– Do so with serial fragments.– Fragments hoisted into scalable, resilient fragment.

• Resilient algorithms will mitigate soft error impact.• Success of manycore will require breaking out of

global SIMT-only.