the national strategic computing initiative and ... · small modular reactors* climate action plan;...

53
www.ExascaleProject.org The National Strategic Computing Initiative and Synergistic Opportunities for Massive-scale Scientific and Data-analytic Computing Presented to: Columbia University Data Science Institute Colloquium January 20, 2017 James Ang, Ph.D. Sandia National Laboratories – Manager, Exascale Computing Program DOE Exascale Computing Project – Director, Hardware Technology Acknowledgements: Paul Messina, ANL, Doug Kothe, ORNL, Rajeev Thakur, ANL, Terri Quinn, LLNL and John Shalf, LBNL Rob Leland, Bruce Hendrickson, Jonathan Berry SNL, Katherine Yelick, LBNL, Richard Murphy, Micron, John Johnson , PNNL SAND2017-0746 PE

Upload: others

Post on 28-May-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

www.ExascaleProject.org

The National Strategic Computing Initiative and Synergistic Opportunities for Massive-scale Scientific and Data-analytic Computing Presented to: Columbia University Data Science Institute Colloquium January 20, 2017 James Ang, Ph.D. Sandia National Laboratories – Manager, Exascale Computing Program DOE Exascale Computing Project – Director, Hardware Technology Acknowledgements: Paul Messina, ANL, Doug Kothe, ORNL, Rajeev Thakur, ANL, Terri Quinn, LLNL and John Shalf, LBNL Rob Leland, Bruce Hendrickson, Jonathan Berry SNL, Katherine Yelick, LBNL, Richard Murphy, Micron, John Johnson , PNNL

SAND2017-0746 PE

2 Exascale Computing Project

National Strategic Computing Initiative (NSCI)

3 Exascale Computing Project

NSCI Strategic Objectives 1.  Accelerating delivery of a capable exascale computing system that integrates hardware

and software capability to deliver approximately 100 times the performance of current 10 petaflop systems across a range of applications representing government needs.

2.  Increasing coherence between the technology base used for modeling and simulation and that used for data analytic computing – (DAC).

3.  Establishing, over the next 15 years, a viable path forward for future HPC systems even after the limits of current semiconductor technology are reached (the "post- Moore's Law era").

4.  Increasing the capacity and capability of an enduring national HPC ecosystem by employing a holistic approach that addresses relevant factors such as networking technology, workflow, downward scaling, foundational algorithms and software, accessibility, and workforce development.

5.  Developing an enduring public-private collaboration to ensure that the benefits of the research and development advances are, to the greatest extent, shared between the United States Government and industrial and academic sectors.

4 Exascale Computing Project

•  The National Strategic Computing Initiative (NSCI) created by President Obama’s Executive Order aims to maximize the benefits of HPC for US economic competitiveness and scientific discovery –  Leadership in high-performance computing and large-scale data analysis will advance

national competitiveness in a wide array of strategic sectors, including basic science, national security, energy technology, and economic prosperity.

–  NSCI Strategic Objective (1) DOE Exascale Computing Project –  NSCI Strategic Objective (2) Driver for Inter-agency Collaboration

•  DOE is a lead agency within NSCI; DOE’s role is focused on advanced simulation through capable exascale computing emphasizing sustained performance on relevant applications and data analytic computing

DOE’s Role in NSCI

5 Exascale Computing Project

The Exascale Computing Project (ECP)

• A collaborative effort of Two US Department of Energy (DOE) organizations: – Office of Science (DOE-SC) –  National Nuclear Security Administration

(NNSA)

• A 7-year project to accelerate the development of capable exascale systems –  Led by DOE laboratories –  Executed in partnership with academia and

industry

A capable exascale computing system will leverage a balanced ecosystem (software,

hardware, applications)

6 Exascale Computing Project

What is a capable exascale computing system?

A capable exascale computing system requires an entire computational ecosystem that:

•  Delivers 50× the performance of today’s 20 PF systems, supporting applications that deliver high-fidelity solutions in less time and address problems of greater complexity

•  Operates in a power envelope of 20–30 MW

•  Is sufficiently resilient (average fault rate: ≤1/week)

•  Includes a software stack that supports a broad spectrum of applications and workloads

This ecosystem will be developed using a co-design approach

to deliver new software, applications, platforms,

and computational science capabilities at

heretofore unseen scale

7 Exascale Computing Project

Exascale Computing Project Goals

Develop scientific, engineering, and large-data applications that exploit the emerging,

exascale-era computational trends caused by the end of Dennard scaling and

Moore’s law

Foster application development

Create software that makes exascale systems usable

by a wide variety of scientists

and engineers across a range of applications

Ease of use

Enable by 2021 and 2023

at least two diverse computing platforms with up to 50× more

computational capability than today’s 20 PF systems, within

a similar size, cost, and power footprint*

Rich exascale ecosystem

Help ensure continued American leadership

in architecture, software and

applications to support scientific discovery, energy assurance,

stockpile stewardship, and nonproliferation

programs and policies

US HPC leadership

* Goal of imminent RFI is to understand the design trade space for a 2021 exascale system

8 Exascale Computing Project

ECP leadership team Staff from 6 national laboratories, with combined experience of >300 years

Project Management

Kathlyn Boudwin, Director, ORNL

Application Development Doug Kothe,

Director, ORNL Bert Still,

Deputy Director, LLNL

Software Technology

Rajeev Thakur, Director, ANL

Pat McCormick, Deputy Director,

LANL

Hardware Technology

Jim Ang, Director, SNL John Shalf,

Deputy Director, LBNL

Exascale Systems

Terri Quinn, Director, LLNL

Susan Coghlan, Deputy Director,

ANL

Chief Technology Officer

Al Geist, ORNL

Integration Manager

Julia White, ORNL

Communications Manager

Mike Bernhardt, ORNL

Exascale Computing Project Paul Messina,

Project Director, ANL Stephen Lee,

Deputy Project Director, LANL

9 Exascale Computing Project

ECP has formulated a holistic approach that uses co-design and integration to achieve capable exascale Application Development Software

Technology Hardware

Technology Exascale Systems

Scalable and productive software

stack

Science and mission applications

Hardware technology elements

Integrated exascale supercomputers

Correctness Visualization Data Analysis

Applications Co-Design

Programming models, development environment,

and runtimes Tools Math libraries

and Frameworks

System Software, resource management threading, scheduling, monitoring,

and control Memory

and Burst buffer

Data management I/O and file

system Node OS, runtimes

Res

ilien

ce

Wor

kflo

ws

Hardware interface

ECP’s scope encompasses applications, system software, hardware technologies and architectures, and workforce development

10 Exascale Computing Project

The new approved ECP plan of record

• A 7-year project that follows the holistic/co-design approach, which runs through 2023 (including 12 months of schedule contingency) –  To meet the ECP goals

• Enable an initial exascale system based on an advanced architecture and delivered in 2021

• Enable capable exascale systems, based on ECP R&D, delivered in 2022 and deployed in 2023 as part of an NNSA and SC facility upgrades

• Acquisition of the exascale systems is outside of the ECP scope, will be carried out by DOE-SC and NNSA-ASC facilities

11 Exascale Computing Project

What is an exascale advanced architecture?

Time

Computing Capability

2017 2021 2022 2023 2024 2025 2026 2027

10X

Evolution of today’s architectures is on this trajectory

5X

First exascale advanced architecture system

Capable exascale systems

Holistic project required to be on this elevated trajectory

12 Exascale Computing Project

Reaching the Elevated Trajectory will require Advanced and Innovative Architectures

In order to reach the elevated trajectory, advanced architectures must be developed that make a big leap in:

–  Parallelism – Memory and Storage –  Reliability –  Energy Consumption

In addition, the exascale advanced architecture will need to solve emerging data science and machine learning problems in addition to the traditional modeling and simulations applications.

The exascale advanced architecture developments

benefit all future U.S. systems on the higher trajectory

13 Exascale Computing Project

High-level ECP technical project schedule R&D before facilities procure first system

Targeted development for known exascale architectures

2016 2017 2018 2019 2020 2021 2022 2023 2025 2024 FY 2026

Exascale System #1 Site Prep #1

Testbeds

Hardware Technology

Software Technology

Application Development

Facilities activities outside ECP

NRE System #1 NRE System #2

Exascale System #2 Site Prep #2

14 Exascale Computing Project

ECP WBS Exascale Computing Project 1. Paul Messina

Application Development 1.2 Doug Kothe

DOE Science and Energy Apps

1.2.1 Andrew Siegel

DOE NNSA Applications 1.2.2 Bert Still

Other Agency Applications 1.2.3 Doug Kothe

Developer Training and Productivity

1.2.4 Ashley Barker

Co-Design and Integration 1.2.5 Phil Colella

Exascale Systems 1.5 Terri Quinn

NRE 1.5.1 Terri Quinn

Testbeds 1.5.2 Terri Quinn

Co-design and Integration

1.5.3 Susan Coghlan

Hardware Technology 1.4 Jim Ang

PathForward Vendor Node and System

Design 1.4.1 Bronis de Supinski

Design Space Evaluation 1.4.2 John Shalf

Co-Design and Integration 1.4.3 Jim Ang

Software Technology 1.3 Rajeev Thakur

Programming Models and Runtimes

1.3.1 Rajeev Thakur

Tools 1.3.2 Jeff Vetter

Mathematical and Scientific Libraries

and Frameworks 1.3.3 Mike Heroux

Data Analytics and Visualization

1.3.5 Jim Ahrens

Data Management and Workflows

1.3.4 Rob Ross

System Software 1.3.6 Martin Schulz

Resilience and Integrity 1.3.7 Al Geist

Co-Design and Integration 1.3.8 Rob Neely

Project Management 1.1 Kathlyn Boudwin

Project Planning and Management

1.1.1 Kathlyn Boudwin

Project Controls & Risk Management

1.1.2 Monty Middlebrook

Information Technology and Quality Management 1.1.5 Doug Collins

Business Management 1.1.3 Dennis Parton

Procurement Management 1.1.4 Willy Besancenez

Communications & Outreach

1.1.6 Mike Bernhardt

Integration 1.1.7 Julia White

LeapForward Vendor Node And System

Design 1.4.4 TBD

15 Exascale Computing Project

ECP application, co-design center, and software project awards

16 Exascale Computing Project

ECP Mission Need Defines the Application Strategy

•  Materials discovery and design •  Climate science •  Nuclear energy •  Combustion science •  Large-data applications •  Fusion energy •  National security •  Additive manufacturing •  Many others!

•  Stockpile Stewardship Annual Assessment and Significant Finding Investigations

•  Robust uncertainty quantification (UQ) techniques in support of lifetime extension programs

•  Understanding evolving nuclear threats posed by adversaries and in developing policies to mitigate these threats

•  Discover and characterize next-generation materials

•  Systematically understand and improve chemical processes

•  Analyze the extremely large datasets resulting from the next generation of particle physics experiments

•  Extract knowledge from systems-biology studies of the microbiome

•  Advance applied energy technologies (e.g., whole-device models of plasma-based fusion systems)

Key science and technology challenges to be addressed

with exascale Meet national

security needs Support DOE science and energy missions

17 Exascale Computing Project

Exascale Applications Will Address National Challenges Summary of current DOE Science & Energy application development projects

Nuclear Energy (NE)

Accelerate design and

commercialization of next-generation

small modular reactors*

Climate Action Plan; SMR licensing support; GAIN

Climate (BER)

Accurate regional impact assessment of climate change* Climate Action Plan

Wind Energy (EERE)

Increase efficiency and reduce cost of turbine wind plants

sited in complex terrains*

Climate Action Plan

Combustion (BES)

Design high-efficiency, low-

emission combustion engines and gas

turbines* 2020 greenhouse gas

and 2030 carbon emission goals

Chemical Science (BES, BER)

Biofuel catalysts design; stress-resistant crops

Climate Action Plan; MGI

* Scope includes a discernible data science component

18 Exascale Computing Project

Exascale Applications Will Address National Challenges Summary of current DOE Science & Energy application development projects

Materials Science (BES)

Find, predict, and control materials and properties: property

change due to hetero-interfaces and complex structures

MGI

Materials Science (BES)

Protein structure and dynamics; 3D

molecular structure design of

engineering functional properties*

MGI; LCLS-II 2025 Path Forward

Nuclear Materials (BES, NE, FES) Extend nuclear

reactor fuel burnup and develop fusion

reactor plasma- facing materials*

Climate Action Plan; MGI; Light Water

Reactor Sustainability; ITER; Stockpile

Stewardship Program

Accelerator Physics (HEP)

Practical economic design of 1 TeV

electron-positron high-energy collider

with plasma wakefield

acceleration* >30k accelerators today in industry, security, energy,

environment, medicine

Nuclear Physics (NP)

QCD-based elucidation of

fundamental laws of nature: SM validation

and beyond SM discoveries

2015 Long Range Plan for Nuclear Science; RHIC, CEBAF, FRIB

* Scope includes a discernible data science component

19 Exascale Computing Project

Application Development: Partnerships

INDUSTRY Intel

NVIDIA

Stone Ridge Technology

UNIVERSITY College of William & Mary

Columbia University

University of Indiana

University of Utah

University of Illinois

University of Tennessee

Rutgers University

University of California, Irvine & Los Angeles

Colorado State University

University of Chicago

Virginia Tech.

University of Colorado

LABORATORY PARTNERSHIPS Ames Argonne

Brookhaven Fermi National Accelerator

Idaho Oak Ridge

Los Alamos Lawrence Berkeley

Lawrence Livermore Pacific Northwest

National Renewable Energy Laboratory

Thomas Jefferson National Accelerator

Facility

National Energy Technology Laboratory

Princeton Plasma Physics Laboratory

SLAC National Accelerator

Sandia

20 Exascale Computing Project

Meeting ECP Mission Need Application Requirements •  Current ECP application portfolio appropriately targets high-priority strategic problems in key DOE

science, energy, national security programs

•  Some application gaps remain relative to ECP Mission Need requirements –  Data analytic computing (DAC) applications per the NSCI Strategic Plan –  High priority applications for some NSCI deployment agencies (NASA, NOAA, DHS, FBI)

•  ECP has not yet adequately targeted the talented U.S. computational science leadership in academia and industry

•  A Request for Information (RFI) for candidate ECP application projects led by academia or industry is currently being drafted

–  Tentative release date of Jan 2017 with an informational videoconference scheduled shortly thereafter –  Industry participation will adhere to standard DOE cost share requirements and IP provisions –  A subset of RFI responses, determined by DOE, ECP, and external peer review, will be asked to submit full

proposals, with final selection of 3-5 projects (sized initially with teams of 2-4 staff) by April 2017

•  New Data Analytic Computing methods and applications are of particular interest, in addition to proposals to enhance the current ECP portfolio

–  Critical infrastructure, superfacility, supply chain, image/signal processing, in situ analytics –  Machine/statistical learning, classification, streaming/graph analytics, discrete event, combinatorial optimization

21 Exascale Computing Project

Application Co-Design (CD)

Essential to ensure that applications effectively

utilize exascale systems

•  Pulls ST and HT developments into applications

•  Pushes application requirements into ST and HT RD&D

•  Evolved from best practice to an essential element of the development cycle

Executed by several CD Centers focusing on a unique collection of algorithmic motifs invoked

by ECP applications

•  Motif: algorithmic method that drives a common pattern of computation and communication

•  CD Centers must address all high priority motifs invoked by ECP applications, including not only the 7 “classical” motifs but also the additional 6 motifs identified to be associated with data science applications

Game-changing mechanism for delivering next-generation

community products with broad application impact

•  Evaluate, deploy, and integrate exascale hardware-savvy software designs and technologies for key crosscutting algorithmic motifs into applications

22 Exascale Computing Project

ECP Co-Design Centers (CDCs) •  CODAR: CDC for Online Data Analysis and Reduction at the Exascale, Ian Foster, ANL

–  Motifs: Online data analysis and reduction –  Address growing disparity between simulation speeds and I/O rates rendering it infeasible for HPC and data analytic

applications to perform offline analysis. Target common data analysis and reduction methods (e.g., feature and outlier detection, compression) and methods specific to particular data types and domains (e.g., particles, FEM)

•  CDC for Block-Structured AMR, John Bell, LBNL –  Motifs: Structured Mesh, Block-Structured AMR, Particles

–  New block-structured AMR framework (AMReX) for systems of nonlinear PDEs, providing basis for temporal and spatial discretization strategy for DOE applications. Unified infrastructure to effectively utilize exascale and reduce computational cost and memory footprint while preserving local descriptions of physical processes in complex multi-physics algorithms

•  CDC for Efficient Exascale Discretizations (CEED), Tzanio Kolev, LLNL –  Motifs: Unstructured Mesh, Spectral Methods, Finite Element (FE) Methods –  Develop FE discretization libraries to enable unstructured PDE-based applications to take full advantage of exascale

resources without the need to “reinvent the wheel” of complicated FE machinery on coming exascale hardware

•  CDC for Particle-based Methods, Tim Germann, LANL –  Motif(s): Particles (involving particle-particle and particle-mesh interactions) –  Focus on four sub-motifs: short-range particle-particle (e.g., MD and SPH), long-range particle-particle (e.g., electrostatic and

gravitational), particle-in-cell (PIC), and additional sparse matrix and graph operations of linear-scaling quantum MD

23 Exascale Computing Project

Software Technology Summary

•  ECP will build a comprehensive and coherent software stack that will enable application developers to productively write highly parallel applications that can portably target diverse exascale architectures

•  ECP will accomplish this by: –  Extending current technologies to exascale where possible –  Performing R&D required to develop new approaches where necessary –  Coordinating with vendor efforts to develop and deploy high-quality and robust software

products

24 Exascale Computing Project

ST Projects (incl. ATDM) Mapped to Software Stack

Correctness Visualization VTK-m, ALPINE, Cinema

Data Analysis ALPINE, Cinema

Applications Co-Design

Programming Models, Development Environment, and Runtimes

MPI (MPICH, Open MPI), OpenMP, OpenACC, PGAS (UPC++, Global Arrays), Task-Based (PaRSEC, Legion,

DARMA), RAJA, Kokkos, OMPTD, Power steering

System Software, Resource Management Threading, Scheduling, Monitoring,

and Control Argo Global OS, Qthreads, Flux,

Spindle, BEE, Spack, Sonar

Tools PAPI, HPCToolkit, Darshan, Perf.

portability (ROSE, Autotuning, PROTEAS), TAU, Compilers

(LLVM, Flang), Mitos, MemAxes, Caliper, AID, Quo, Perf. Anal.

Math Libraries/Frameworks ScaLAPACK, DPLASMA, MAGMA, PETSc/

TAO, Trilinos, xSDK, PEEKS, SuperLU, STRUMPACK, SUNDIALS, DTK,

TASMANIAN, AMP, FleCSI, KokkosKernels, Agile Comp., DataProp, MFEM

Memory and Burst buffer

Chkpt/Restart (VeloC, UNIFYCR), API and library for complex memory hierarchy

(SICM)

Data Management, I/O and File System

ExaHDF5, PnetCDF, ROMIO, ADIOS, Chkpt/Restart (VeloC, UNIFYCR), Compression (EZ,

ZFP), I/O services, HXHIM, SIO Components, DataWarehouse Node OS, low-level runtimes

Argo OS enhancements, SNL OS project

Res

ilien

ce

Che

ckpo

int/R

esta

rt (V

eloC

, UN

IFY

CR

), FS

EFI

, Fau

lt M

odel

ing

Wor

kflo

ws

Con

tour

, Sib

oka

Hardware interface

25 Exascale Computing Project

Plan to involve additional Universities and Vendors

•  The current set of Software Technology projects span 8 DOE national laboratories,15 universities, and 2 independent software vendors

•  In FY17-18, we plan to issue additional solicitations for universities and vendors

•  We plan to coordinate our activities with NSF to cover the broader university community through NSF

•  ECP has formed an Industry Council. This group will also serve as an interface for engaging industrial users and software vendors.

•  Small businesses: The solicitations in FY17-18 will include independent software vendors. DOE SBIR funding could also support software of interest to ECP.

26 Exascale Computing Project

Hardware Technology Overview

Objective: Fund R&D to design hardware that meets ECP’s Targets for application performance, power efficiency, and resilience

Issue PathForward and LeapForward Hardware Architecture R&D contracts that deliver: • Conceptual exascale node and system designs

• Analysis of performance improvement on conceptual system design • Technology demonstrators to quantify performance gains over existing roadmaps

• Support for active industry engagement in ECP holistic co-design efforts

DOE labs engage to: • Participate in evaluation and review of PathForward and LeapForward deliverables

• Lead Design Space Evaluation through Architectural Analysis, and Abstract Machine Models of PathForward/LeapForward designs for ECP’s holistic co-design

27 Exascale Computing Project

PathForward will drive improvements in vendor offerings that address ECP’s needs for scale, parallel simulations, and large scientific data analytics

PathForward* addresses the disruptive trends in computing due to the power challenge

Power challenge

•  End of Dennard scaling

•  Today’s technology: ~50MW to 100 MW to power the largest systems

Processor/node trends

•  GPUs/accelerators •  Simple in-order cores •  Unreliability at near-

threshold voltages •  Lack of large-scale

cache coherency •  Massive on-node

parallelism

System trends

•  Complex hardware •  Massive numbers

of nodes •  Low bandwidth

to memory •  Drop in platform

resiliency

Disruptive changes

•  New algorithms •  New programming

models •  Less “fast” memory •  Managing for

increasing system disruptions

4 Challenges: power, memory, parallelism, and resiliency

* LeapForward will prioritize the same technical challenges but address innovative concepts that may have higher risk tolerance for 2021 deployment

28 Exascale Computing Project

PathForward Status •  LLNL is managing the procurement •  Competitive RFP released in June, 2016 •  14 PathForward proposals received in August, 2016 •  6 proposals selected for SOW negotiations of final Work Packages

–  As of Early December: The First Vendor SOW has completed the LLNL Procurement Review and is on to Livermore Field Office review, will need final review by NNSA/HQ

–  Two more Vendor SOWs are entering the LLNL Procurement Review –  Final three are pending

•  Firm Fixed Price contracts with milestone deliverables and payments •  DOE Advance IP Waivers for vendors that provide ≥ 40% cost share •  Project duration: 3 years •  Targeting Contract Awards in February 2017

29 Exascale Computing Project

Overarching Goals for PathForward

•  Improve the quality and number of competitive offeror responses to the Capable Exascale Systems RFP for delivery in 2022

•  Improve the offeror’s confidence in the value and feasibility of aggressive advanced technology options that would be bid in response to the Capable Exascale Systems RFP

•  Improve DOE confidence in technology performance benefit, programmability and ability to integrate into a credible system platform acquisition

30 Exascale Computing Project

Overarching Goals for LeapForward

•  Improve the quality and number of competitive offeror responses to the Advanced Architecture Exascale Systems RFP for delivery in 2021

•  Develop Advanced Architecture concepts that arise from Holistic Co-design to impact ECP applications that are addressing: –  Evolutionary Challenge Problems –  Revolutionary Challenge Problems

•  The Advanced Architecture Exascale System will maintain application performance requirements for a system that is delivered in 2021, but there would need to be tradeoffs in other performance parameters

31 Exascale Computing Project

Hardware Technology Summary

•  PathForward reduces the Technical Risk for NRE investments in the 2022/23 capable exascale system(s)

•  LeapForward accepts Technical Risk in return for game-changing impact on the U.S. computing eco-system and potential for early deployment in 2021/22 exascale system

•  Establishes a foundation for architectural diversity in the HPC eco-system

•  Provides hardware technology expertise and analysis

•  Provides an opportunity for inter-agency collaboration under NSCI

32 Exascale Computing Project

Exascale Systems: Non-Recurring Engineering (NRE) Incentivizes awardees to address gaps in their system product roadmaps

•  Brings to the product stage promising hardware and software research (ECP, vendor, Lab, etc.) and integrates it into a system

•  Includes application readiness R&D efforts

•  Must start early enough to impact the system - more than two full years of lead time are necessary to maximize impact

Experience has shown that NRE can substantially improve the delivered system

33 Exascale Computing Project

ECP’s plan to accelerate and enhance system capabilities

PathForward Hardware R&D NRE

HW and SW engineering and productization

RFP release

NRE contract awards

System Build

Systems accepted

Build contract awards

NRE: Application Readiness

Co-Design

LeapForward PathForward

34 Exascale Computing Project

Co-Design and Integration

• ECP is a very large DOE Project, composed of ~100 separate projects –  National Labs, Vendors, Universities – Many technologies –  Different timeframes –  At least two diverse system architectures

• For ECP to be successful, The whole must be more than the sum of the parts

• Holistic Co-design will likely push PIs out of their comfort zone on several dimensions

Exascale Computing Project 1.

Application Development

1.2 DOE Science and

Energy Apps 1.2.1

DOE NNSA Applications

1.2.2 Other Agency Applications

1.2.3 Developer Training

and Productivity 1.2.4

Co-Design and Integration

1.2.5

Exascale Systems 1.5

System Build Phase NRE

1.5.1 Prototypes and

Testbeds 1.5.2

Co-Design and Integration

1.5.3

Hardware Technology

1.4

PathForward Vendor Node and System

Design 1.4.1

Design Space Evaluation

1.4.2 Co-Design

and Integration 1.4.3

Software Technology

1.3 Programming Models

and Runtimes 1.3.1 Tools 1.3.2

Mathematical and Scientific Libraries and Frameworks

1.3.3

Data Analytics and Visualization

1.3.5

Data Management and Workflows

1.3.4

System Software 1.3.6

Resilience and Integrity

1.3.7 Co-Design and

Integration 1.3.8

Project Management 1.1

Project Planning and Management

1.1.1 Project Reporting &

Controls, Risk Management

1.1.2

Information Technology and

Quality Management 1.1.5

Business Management

1.1.3 Procurement Management

1.1.4

Communications & Outreach

1.1.6

Integration 1.1.7

LeapForward Vendor Node and System

Design 1.4.4

Over 500 people for all hands meeting!

35 Exascale Computing Project

Holistic Co-Design •  AD and ST teams cannot assume hardware architectures are firmly defined

inputs to their project plans –  AD and ST teams can provide inputs to the HT node and system design projects

•  HT PathForward/LeapForward projects also cannot assume that applications and the software stack are fixed inputs to their project plans –  HT design teams can provide inputs to the ECP AD and ST projects

•  In Holistic Co-Design, each project’s output can be another project’s input –  Individual ECP projects do not operate in a vacuum –  Assumptions about inputs lead to initial project plans, but the give and take of co-

design interactions can lead to changes

•  The ECP leadership team recognizes that while AD and ST milestones and deliverables will be rigorously managed, some flexibility may be required to support rational changes to initial project plans

36 Exascale Computing Project

Co-Design and ECP Challenges

• Organize multi-disciplinary projects into holistic co-design collaboration teams –  Your funding arrives in technology-centric focus areas –  ECP leadership will support integration of projects into collaborative Co-

Design teams – Begin development of an integration map –  ECP leadership will encourage commitment to shared milestones –  Every ECP project’s performance evaluation will include:

how well they play with others

• But there is another challenge that ECP must address . . .

37 Exascale Computing Project

Co-Design and ECP Challenges - 2 •  With ~25 AD, ~60 ST, 4 CDC, ~5 PathForward and the DSE teams:

All-to-all communication is impractical •  The AD Co-Design Centers and the HT Design Space Evaluation team will

develop capabilities to facilitate ECP inter-project communication –  Proxy Applications and Benchmarks –  Abstract Machine Models and Proxy Architectures

•  The ECP Leadership team will actively work to identify: –  Alignments of cross-cutting projects that will form natural holistic co-design

collaboration teams – a cluster of projects –  Orthogonal projects that should not expend time and energy with forced integration

•  Key Questions –  How many projects should be in a holistic co-design team cluster? –  How many holistic co-design clusters can a project participate in? –  How many holistic co-design clusters should ECP have? –  Should we foster higher levels of co-design collaboration?

38 Exascale Computing Project

NSCI Strategic Objectives 1.  Accelerating delivery of a capable exascale computing system that integrates hardware

and software capability to deliver approximately 100 times the performance of current 10 petaflop systems across a range of applications representing government needs.

2.  Increasing coherence between the technology base used for modeling and simulation and that used for data analytic computing – (DAC).

3.  Establishing, over the next 15 years, a viable path forward for future HPC systems even after the limits of current semiconductor technology are reached (the "post- Moore's Law era").

4.  Increasing the capacity and capability of an enduring national HPC ecosystem by employing a holistic approach that addresses relevant factors such as networking technology, workflow, downward scaling, foundational algorithms and software, accessibility, and workforce development.

5.  Developing an enduring public-private collaboration to ensure that the benefits of the research and development advances are, to the greatest extent, shared between the United States Government and industrial and academic sectors.

39 Exascale Computing Project

Motivation for NSCI Objective 2: Increasing coherence between simulation and analytics •  For modeling & simulation – HPC

!  HPC simulation must ride on some commodity curve !  Larger market forces behind analytics !  Can exploit commodity component technology from analytics

•  For large scale data analytics (LSDA) – DAC !  Large Scale Data Analytics problems becoming ever more sophisticated !  Requiring more coupled methods !  Can exploit architectural lessons from HPC simulation

•  For both: Integration of simulation and analytics in the same workflow !  Automation of analysis of data from simulation !  Creation of synthetic data via simulation to augment analysis !  Automated generation and testing of hypothesis !  Exploration of new scientific and technical scenarios

Mutual&inspira,on,&technical&synergy,&and&economies&of&scale&in&the&crea,on,&deployment,&and&use&of&HPC&resources&

40 Exascale Computing Project

NSCI Objective 2 - continued •  Data Analytic Computing (DAC)

–  Inferring new information from what is already known to enable action on that information •  High Performance Computing (HPC)

–  Insights into interactions of the parts of a system, and the system as a whole, to advance understanding in science and engineering and inform policy and economic decision-making

•  DAC & HPC have traditionally relied on different hardware and software stacks •  M&S community is increasing use of large-scale data analytics, demanding a more

dynamic interaction between analysis and simulations •  LSDA community is demanding increased computational intensity, hence facing barriers

to scalability along with new demands for interoperability, robustness, and reliability of results

•  Coherent technology base for HPC & DAC benefits both while maximizing returns on R&D investments

•  Primary challenge: software layers of the HPC environment –  A more agile and reusable HPC/DAC software portfolio that is equally capable in LSDA and M&S –  Will improve productivity, increase reliability & trustworthiness, establish more sustainable yet agile

software

* https://www.whitehouse.gov/sites/whitehouse.gov/files/images/NSCI%20Strategic%20Plan.pdf

41 Exascale Computing Project

Application Motifs* (what’s the app footprint?) Algorithmic methods that capture a common pattern of computation and communication

1.  Dense Linear Algebra –  Dense matrices or vectors (e.g., BLAS Level 1/2/3)

2.  Sparse Linear Algebra –  Many zeros, usually stored in compressed matrices to access nonzero

values (e.g., Krylov solvers) 3.  Spectral Methods

–  Frequency domain, combining multiply-add with specific patterns of data permutation with all-to-all for some stages (e.g., 3D FFT)

4.  N-Body Methods (Particles) –  Interaction between many discrete points, with variations being

particle-particle or hierarchical particle methods (e.g., PIC, SPH, PME)

5.  Structured Grids –  Regular grid with points on a grid conceptually updated together with

high spatial locality (e.g., FDM-based PDE solvers)

6.  Unstructured Grids –  Irregular grid with data locations determined by app and connectivity to

neighboring points provided (e.g., FEM-based PDE solvers)

7.  Monte Carlo –  Calculations depend upon statistical results of repeated random trials

8.  Combinational Logic –  Simple operations on large amounts of data, often exploiting bit-level

parallelism (e.g., Cyclic Redundancy Codes or RSA encryption) 9.  Graph Traversal

–  Traversing objects and examining their characteristics, e.g., for searches, often with indirect table lookups and little computation

10.  Graphical Models –  Graphs representing random variables as nodes and dependencies as

edges (e.g., Bayesian networks, Hidden Markov Models) 11.  Finite State Machines

–  Interconnected set of states (e.g., for parsing); often decomposed into multiple simultaneously active state machines that can act in parallel

12.  Dynamic Programming –  Computes solutions by solving simpler overlapping subproblems, e.g.,

for optimization solutions derived from optimal subproblem results 13.  Backtrack and Branch-and-Bound

–  Solving search and global optimization problems for intractably large spaces where regions of the search space with no interesting solutions are ruled out. Use the divide and conquer principle: subdivide the search space into smaller subregions (“branching”), and bounds are found on solutions contained in each subregion under consideration

*The Landscape of Parallel Computing Research: A View from Berkeley, Technical Report No. UCB/EECS-2006-183 (Dec 2006).

42 Exascale Computing Project

Seven Computational Giants of Massive Data Analysis* 1.  Basic Statistics

Mean, variance (and other moments), median, sorting, clustering, # of distinct and frequently-occurring elements in a data set; O(N) calculations for N data points.

2.  Generalized N-body Problems Distances, kernels, or other similarities between (all or many) pairs (or higher-order n-tuples) of points. Computational complexity O(N2) or O(N3)

3.  Graph-Theoretic Computations Traversing a graph where the graph is the data or the statistical model takes the form of a graph. Common statistical computations include betweenness, centrality, commute distances; used to identify nodes or communities of interest

4.  Linear Algebraic Computations Linear systems, eigenvalue problems, inverses, many of which result from linear models, e.g., linear regression, PCA. Differentiator: statistical problems such as the optimization in learning, e.g., eigendecomposition of PCA to optimize a linear convex training error not necessarily requiring high accuracy. Differentiator: multivariate statistics arguably has its own matrix form, that of a kernel (or Gram) matrix

*Frontiers In Massive Data Analysis, National Research Council (The National Academies Press, 2013)

5.  Optimizations All subclasses of optimizations, from unconstrained to constrained, both convex and non-convex. Non-trivial optimizations in statistical methods, e.g., linear and quadratic programming, second-order cone programming in support vector machines, more recent classifiers, recent manifold learning methods, e.g. maximum variance unfolding

6.  Integration Integration of functions: a key class of computations within massive data analysis. Needed for Bayesian inference using any model and in non-Bayesian statistical settings, e.g. random effects models. Integrals that appear in statistics are often expectations with a special form.

7.  Alignment Problems Matchings between two or more data objects or data sets, e.g., multiple sequence alignments in computational biology, matching of catalogs from different instruments in astronomy, matching of objects between images, correspondence between synonymous words in text analysis. Critical in data fusion – required before further data analyses is possible

43 Exascale Computing Project

Some M&S + LSDA Use Cases Evident and Possible •  In situ (upstream, during, downstream) analytics [online data analysis and reduction]

–  Cross section processing (nuclear reactors) –  Multi-variate statistical analysis for characterizing turbulence & ignition dynamics; threshold-based topological feature

detection (combustion) –  Halo finder for identifying compact structures (cosmology) –  Spatial and temporal feature detection in global ocean and atmospheric models (climate) –  Formation and evolution of nonlinear turbulent plasma structures (fusion) –  Multiphysics coupling within a nonlinear solution framework (manufacturing, nuclear reactors, fusion, wind, climate,

etc.) –  Statistically valid spatial and temporal data analyses of billion-atom atomistic simulations (MD for nuclear materials)

•  Computational steering –  Driving ensembles of MD simulations with unsupervised ML (precision medicine for cancer) –  Robust and error-minimizing ML-driven mesh management (shock physics of dynamic fluids/solids)

•  Modeling unresolved physics (”subgrid” models) –  Superparameterization approach for subgrid cloud-resolving model (climate) –  ML-driven subgrid turbulence model parameter setting (combustion, wind)

44 Exascale Computing Project

Some M&S + LSDA Use Cases Evident and Possible •  Predictive analytics

–  Scalable supervised and unsupervised ML for tumor-drug response and treatment strategies (precision medicine for cancer)

•  Superfacility (HPC + Experimental Facilities) –  Integration of simulation and experimental science to improve analysis fidelity and turnaround time and increase

knowledge detection/discovery from raw data. Computational steering of experiment via near real-time interpretation of molecular structure revealed by X-ray diffraction (data analytics of FEL-generated light source facilities)

–  ML-driven additive manufacturing of metal alloys (manufacturing)

•  Graph analytics –  Unknown genome reconstruction from collections of short, overlapping DNA segments (microbiome analysis) –  Load balancing with spatial and temporal graph partitioning and clustering techniques (power grid) –  Design of low-rank methods, dynamic load balancing strategies, with combinatorial algorithms, such as graph

matching and coloring can help determine the work and data distribution (chemistry)

45 Exascale Computing Project

Data structures describing simulation and analytics differ* Graphs from simulations may be irregular, but have more locality than those derived from analytics

Computa,onal&&Simula,on&&of&physical&&phenomena:&

Climate&modeling&& Car&crash&&

Internet&connec,vity&& Yeast&protein&interac,ons&&

Large&Scale&Data&Analy,cs:&

* Large Scale Data Analytics and its Relationship to Simulation, Leland, Murphy, Hendrickson, Yelick, Johnson and Berry

46 Exascale Computing Project

Memory performance demands differ A key differentiator in the performance of simulation and analytics

Standard benchmarks include: •  LINPACK (smallest data intensiveness;

barely visible on graph)

•  STREAM •  SPEC FP •  SPEC Int

Figure from Murphy & Kogge with adjustment to double radius of Linpack data point to make it visible.

Area%of%the%circle%=%rela.ve%data%intensiveness%(i.e.%total%amount%of%unique%data%accessed%over%a%fixed%interval%of%instruc.ons)%

Simulation

Analytics

47 Exascale Computing Project

Application code property Modeling & Simulation Large Scale Data Analytics Spatial locality High Low Temporal locality Moderate Low Memory footprint Moderate High Computation type May be floating-point dominated* Integer intensive Input-output orientation Output dominated Input dominated

* Increasingly, simulation work has become less floating-point dominated

Application code characteristics differ

Contras,ng&proper,es:&

48 Exascale Computing Project

So what do we really mean by “increasing coherence” between simulation and analytics?

!  NOT one system ostensibly optimized for both simulation and analytics

!  Greater commonality in underlying Commodity Computing components and design principles

!  Greater interoperability, allowing interleaving of both types of computations

…&A&more&common&hardware&and&soDware&roadmap&&between&simula,on&and&analy,cs&

49 Exascale Computing Project

Simulation and analytics are evolving to become more similar in their architectural needs

!  Current challenges for the DAC LSDA community !  Data movement !  Power consumption !  Memory/interconnect bandwidth !  Scaling efficiency

!  Instruction mix for Sandia’s HPC engineering codes !  Memory operations 40% !  Integer operations 40% !  Floating point 10% !  Other 10%

!  Common design impacts of energy cost trends !  Increased concurrency (processing threads, cores, memory depth) !  Increased complexity and burden on system software, languages, tools, runtime support, codes

…&similar&to&HPC,&modeling&&&simula,on&

…&similar&to&DAC,&LSDA&

50 Exascale Computing Project

Architectural Characteristic Modeling & Simulation Large Scale Data Analytics

Computation Memory address generation dominated Same

Primary memory Low power, high bandwidth, semi-random access Same

Secondary memory

Emerging technologies may offset cost, allowing much more memory … require extremely large memory spaces

Storage Integration of another layer of memory hierarchy to support checkpoint/restart … to support out-of-core data set access

Interconnect technology

High bisection bandwidth, (for relatively coarse-grained access) … (for fine-grained access)

System software (node-level)

Low dependence on system services, increasingly adaptive, resource management for structured parallelism

… highly adaptive, resource management for unstructured parallelism

System software (system-level) Increasingly irregular workflows Irregular workflows

Emerging architectural and system software synergies

51 Exascale Computing Project

Hardware Technology: LeapForward Priorities

•  Prioritize R&D in high impact, high payoff hardware designs

•  Support development of common technology base of hardware component designs that addresses common bottlenecks for HPC and DAC applications

•  Leverage opportunities for Interagency Collaboration and Alignment

•  Tap ECP AD and ST project teams for requirements that can drive the future directions of hardware design – through holistic co-design collaborations

52 Exascale Computing Project

This is a very exciting time for computing in the US •  Unique opportunity to do something special for the nation on a rapid time scale

•  The advanced architecture system in 2021 affords the opportunity for –  More rapid advancement and scaling of mission and science applications –  More rapid advancement and scaling of an exascale software stack –  Rapid investments in vendor technologies and software needed for 2021 and 2023 systems –  More rapid progress in numerical methods and algorithms for advanced architectures –  Strong leveraging of and broader engagement with US computing capability

•  When ECP ends, we will have –  Run meaningful applications at exascale in 2021, producing useful results –  Prepared a full suite of mission and science applications for 2023 capable exascale systems –  Demonstrated integrated software stack components at exascale –  Invested in the engineering and development, and participated in acquisition and testing of 2023

capable exascale systems –  Prepared industry and critical applications for a more diverse and sophisticated set of computing

technologies, carrying US supercomputing well into the future

www.ExascaleProject.org

Thank you!

www.ExascaleProject.org