the national strategic computing initiative and ... · small modular reactors* climate action plan;...
TRANSCRIPT
www.ExascaleProject.org
The National Strategic Computing Initiative and Synergistic Opportunities for Massive-scale Scientific and Data-analytic Computing Presented to: Columbia University Data Science Institute Colloquium January 20, 2017 James Ang, Ph.D. Sandia National Laboratories – Manager, Exascale Computing Program DOE Exascale Computing Project – Director, Hardware Technology Acknowledgements: Paul Messina, ANL, Doug Kothe, ORNL, Rajeev Thakur, ANL, Terri Quinn, LLNL and John Shalf, LBNL Rob Leland, Bruce Hendrickson, Jonathan Berry SNL, Katherine Yelick, LBNL, Richard Murphy, Micron, John Johnson , PNNL
SAND2017-0746 PE
3 Exascale Computing Project
NSCI Strategic Objectives 1. Accelerating delivery of a capable exascale computing system that integrates hardware
and software capability to deliver approximately 100 times the performance of current 10 petaflop systems across a range of applications representing government needs.
2. Increasing coherence between the technology base used for modeling and simulation and that used for data analytic computing – (DAC).
3. Establishing, over the next 15 years, a viable path forward for future HPC systems even after the limits of current semiconductor technology are reached (the "post- Moore's Law era").
4. Increasing the capacity and capability of an enduring national HPC ecosystem by employing a holistic approach that addresses relevant factors such as networking technology, workflow, downward scaling, foundational algorithms and software, accessibility, and workforce development.
5. Developing an enduring public-private collaboration to ensure that the benefits of the research and development advances are, to the greatest extent, shared between the United States Government and industrial and academic sectors.
4 Exascale Computing Project
• The National Strategic Computing Initiative (NSCI) created by President Obama’s Executive Order aims to maximize the benefits of HPC for US economic competitiveness and scientific discovery – Leadership in high-performance computing and large-scale data analysis will advance
national competitiveness in a wide array of strategic sectors, including basic science, national security, energy technology, and economic prosperity.
– NSCI Strategic Objective (1) DOE Exascale Computing Project – NSCI Strategic Objective (2) Driver for Inter-agency Collaboration
• DOE is a lead agency within NSCI; DOE’s role is focused on advanced simulation through capable exascale computing emphasizing sustained performance on relevant applications and data analytic computing
DOE’s Role in NSCI
5 Exascale Computing Project
The Exascale Computing Project (ECP)
• A collaborative effort of Two US Department of Energy (DOE) organizations: – Office of Science (DOE-SC) – National Nuclear Security Administration
(NNSA)
• A 7-year project to accelerate the development of capable exascale systems – Led by DOE laboratories – Executed in partnership with academia and
industry
A capable exascale computing system will leverage a balanced ecosystem (software,
hardware, applications)
6 Exascale Computing Project
What is a capable exascale computing system?
A capable exascale computing system requires an entire computational ecosystem that:
• Delivers 50× the performance of today’s 20 PF systems, supporting applications that deliver high-fidelity solutions in less time and address problems of greater complexity
• Operates in a power envelope of 20–30 MW
• Is sufficiently resilient (average fault rate: ≤1/week)
• Includes a software stack that supports a broad spectrum of applications and workloads
This ecosystem will be developed using a co-design approach
to deliver new software, applications, platforms,
and computational science capabilities at
heretofore unseen scale
7 Exascale Computing Project
Exascale Computing Project Goals
Develop scientific, engineering, and large-data applications that exploit the emerging,
exascale-era computational trends caused by the end of Dennard scaling and
Moore’s law
Foster application development
Create software that makes exascale systems usable
by a wide variety of scientists
and engineers across a range of applications
Ease of use
Enable by 2021 and 2023
at least two diverse computing platforms with up to 50× more
computational capability than today’s 20 PF systems, within
a similar size, cost, and power footprint*
Rich exascale ecosystem
Help ensure continued American leadership
in architecture, software and
applications to support scientific discovery, energy assurance,
stockpile stewardship, and nonproliferation
programs and policies
US HPC leadership
* Goal of imminent RFI is to understand the design trade space for a 2021 exascale system
8 Exascale Computing Project
ECP leadership team Staff from 6 national laboratories, with combined experience of >300 years
Project Management
Kathlyn Boudwin, Director, ORNL
Application Development Doug Kothe,
Director, ORNL Bert Still,
Deputy Director, LLNL
Software Technology
Rajeev Thakur, Director, ANL
Pat McCormick, Deputy Director,
LANL
Hardware Technology
Jim Ang, Director, SNL John Shalf,
Deputy Director, LBNL
Exascale Systems
Terri Quinn, Director, LLNL
Susan Coghlan, Deputy Director,
ANL
Chief Technology Officer
Al Geist, ORNL
Integration Manager
Julia White, ORNL
Communications Manager
Mike Bernhardt, ORNL
Exascale Computing Project Paul Messina,
Project Director, ANL Stephen Lee,
Deputy Project Director, LANL
9 Exascale Computing Project
ECP has formulated a holistic approach that uses co-design and integration to achieve capable exascale Application Development Software
Technology Hardware
Technology Exascale Systems
Scalable and productive software
stack
Science and mission applications
Hardware technology elements
Integrated exascale supercomputers
Correctness Visualization Data Analysis
Applications Co-Design
Programming models, development environment,
and runtimes Tools Math libraries
and Frameworks
System Software, resource management threading, scheduling, monitoring,
and control Memory
and Burst buffer
Data management I/O and file
system Node OS, runtimes
Res
ilien
ce
Wor
kflo
ws
Hardware interface
ECP’s scope encompasses applications, system software, hardware technologies and architectures, and workforce development
10 Exascale Computing Project
The new approved ECP plan of record
• A 7-year project that follows the holistic/co-design approach, which runs through 2023 (including 12 months of schedule contingency) – To meet the ECP goals
• Enable an initial exascale system based on an advanced architecture and delivered in 2021
• Enable capable exascale systems, based on ECP R&D, delivered in 2022 and deployed in 2023 as part of an NNSA and SC facility upgrades
• Acquisition of the exascale systems is outside of the ECP scope, will be carried out by DOE-SC and NNSA-ASC facilities
11 Exascale Computing Project
What is an exascale advanced architecture?
Time
Computing Capability
2017 2021 2022 2023 2024 2025 2026 2027
10X
Evolution of today’s architectures is on this trajectory
5X
First exascale advanced architecture system
Capable exascale systems
Holistic project required to be on this elevated trajectory
12 Exascale Computing Project
Reaching the Elevated Trajectory will require Advanced and Innovative Architectures
In order to reach the elevated trajectory, advanced architectures must be developed that make a big leap in:
– Parallelism – Memory and Storage – Reliability – Energy Consumption
In addition, the exascale advanced architecture will need to solve emerging data science and machine learning problems in addition to the traditional modeling and simulations applications.
The exascale advanced architecture developments
benefit all future U.S. systems on the higher trajectory
13 Exascale Computing Project
High-level ECP technical project schedule R&D before facilities procure first system
Targeted development for known exascale architectures
2016 2017 2018 2019 2020 2021 2022 2023 2025 2024 FY 2026
Exascale System #1 Site Prep #1
Testbeds
Hardware Technology
Software Technology
Application Development
Facilities activities outside ECP
NRE System #1 NRE System #2
Exascale System #2 Site Prep #2
14 Exascale Computing Project
ECP WBS Exascale Computing Project 1. Paul Messina
Application Development 1.2 Doug Kothe
DOE Science and Energy Apps
1.2.1 Andrew Siegel
DOE NNSA Applications 1.2.2 Bert Still
Other Agency Applications 1.2.3 Doug Kothe
Developer Training and Productivity
1.2.4 Ashley Barker
Co-Design and Integration 1.2.5 Phil Colella
Exascale Systems 1.5 Terri Quinn
NRE 1.5.1 Terri Quinn
Testbeds 1.5.2 Terri Quinn
Co-design and Integration
1.5.3 Susan Coghlan
Hardware Technology 1.4 Jim Ang
PathForward Vendor Node and System
Design 1.4.1 Bronis de Supinski
Design Space Evaluation 1.4.2 John Shalf
Co-Design and Integration 1.4.3 Jim Ang
Software Technology 1.3 Rajeev Thakur
Programming Models and Runtimes
1.3.1 Rajeev Thakur
Tools 1.3.2 Jeff Vetter
Mathematical and Scientific Libraries
and Frameworks 1.3.3 Mike Heroux
Data Analytics and Visualization
1.3.5 Jim Ahrens
Data Management and Workflows
1.3.4 Rob Ross
System Software 1.3.6 Martin Schulz
Resilience and Integrity 1.3.7 Al Geist
Co-Design and Integration 1.3.8 Rob Neely
Project Management 1.1 Kathlyn Boudwin
Project Planning and Management
1.1.1 Kathlyn Boudwin
Project Controls & Risk Management
1.1.2 Monty Middlebrook
Information Technology and Quality Management 1.1.5 Doug Collins
Business Management 1.1.3 Dennis Parton
Procurement Management 1.1.4 Willy Besancenez
Communications & Outreach
1.1.6 Mike Bernhardt
Integration 1.1.7 Julia White
LeapForward Vendor Node And System
Design 1.4.4 TBD
16 Exascale Computing Project
ECP Mission Need Defines the Application Strategy
• Materials discovery and design • Climate science • Nuclear energy • Combustion science • Large-data applications • Fusion energy • National security • Additive manufacturing • Many others!
• Stockpile Stewardship Annual Assessment and Significant Finding Investigations
• Robust uncertainty quantification (UQ) techniques in support of lifetime extension programs
• Understanding evolving nuclear threats posed by adversaries and in developing policies to mitigate these threats
• Discover and characterize next-generation materials
• Systematically understand and improve chemical processes
• Analyze the extremely large datasets resulting from the next generation of particle physics experiments
• Extract knowledge from systems-biology studies of the microbiome
• Advance applied energy technologies (e.g., whole-device models of plasma-based fusion systems)
Key science and technology challenges to be addressed
with exascale Meet national
security needs Support DOE science and energy missions
17 Exascale Computing Project
Exascale Applications Will Address National Challenges Summary of current DOE Science & Energy application development projects
Nuclear Energy (NE)
Accelerate design and
commercialization of next-generation
small modular reactors*
Climate Action Plan; SMR licensing support; GAIN
Climate (BER)
Accurate regional impact assessment of climate change* Climate Action Plan
Wind Energy (EERE)
Increase efficiency and reduce cost of turbine wind plants
sited in complex terrains*
Climate Action Plan
Combustion (BES)
Design high-efficiency, low-
emission combustion engines and gas
turbines* 2020 greenhouse gas
and 2030 carbon emission goals
Chemical Science (BES, BER)
Biofuel catalysts design; stress-resistant crops
Climate Action Plan; MGI
* Scope includes a discernible data science component
18 Exascale Computing Project
Exascale Applications Will Address National Challenges Summary of current DOE Science & Energy application development projects
Materials Science (BES)
Find, predict, and control materials and properties: property
change due to hetero-interfaces and complex structures
MGI
Materials Science (BES)
Protein structure and dynamics; 3D
molecular structure design of
engineering functional properties*
MGI; LCLS-II 2025 Path Forward
Nuclear Materials (BES, NE, FES) Extend nuclear
reactor fuel burnup and develop fusion
reactor plasma- facing materials*
Climate Action Plan; MGI; Light Water
Reactor Sustainability; ITER; Stockpile
Stewardship Program
Accelerator Physics (HEP)
Practical economic design of 1 TeV
electron-positron high-energy collider
with plasma wakefield
acceleration* >30k accelerators today in industry, security, energy,
environment, medicine
Nuclear Physics (NP)
QCD-based elucidation of
fundamental laws of nature: SM validation
and beyond SM discoveries
2015 Long Range Plan for Nuclear Science; RHIC, CEBAF, FRIB
* Scope includes a discernible data science component
19 Exascale Computing Project
Application Development: Partnerships
INDUSTRY Intel
NVIDIA
Stone Ridge Technology
UNIVERSITY College of William & Mary
Columbia University
University of Indiana
University of Utah
University of Illinois
University of Tennessee
Rutgers University
University of California, Irvine & Los Angeles
Colorado State University
University of Chicago
Virginia Tech.
University of Colorado
LABORATORY PARTNERSHIPS Ames Argonne
Brookhaven Fermi National Accelerator
Idaho Oak Ridge
Los Alamos Lawrence Berkeley
Lawrence Livermore Pacific Northwest
National Renewable Energy Laboratory
Thomas Jefferson National Accelerator
Facility
National Energy Technology Laboratory
Princeton Plasma Physics Laboratory
SLAC National Accelerator
Sandia
20 Exascale Computing Project
Meeting ECP Mission Need Application Requirements • Current ECP application portfolio appropriately targets high-priority strategic problems in key DOE
science, energy, national security programs
• Some application gaps remain relative to ECP Mission Need requirements – Data analytic computing (DAC) applications per the NSCI Strategic Plan – High priority applications for some NSCI deployment agencies (NASA, NOAA, DHS, FBI)
• ECP has not yet adequately targeted the talented U.S. computational science leadership in academia and industry
• A Request for Information (RFI) for candidate ECP application projects led by academia or industry is currently being drafted
– Tentative release date of Jan 2017 with an informational videoconference scheduled shortly thereafter – Industry participation will adhere to standard DOE cost share requirements and IP provisions – A subset of RFI responses, determined by DOE, ECP, and external peer review, will be asked to submit full
proposals, with final selection of 3-5 projects (sized initially with teams of 2-4 staff) by April 2017
• New Data Analytic Computing methods and applications are of particular interest, in addition to proposals to enhance the current ECP portfolio
– Critical infrastructure, superfacility, supply chain, image/signal processing, in situ analytics – Machine/statistical learning, classification, streaming/graph analytics, discrete event, combinatorial optimization
21 Exascale Computing Project
Application Co-Design (CD)
Essential to ensure that applications effectively
utilize exascale systems
• Pulls ST and HT developments into applications
• Pushes application requirements into ST and HT RD&D
• Evolved from best practice to an essential element of the development cycle
Executed by several CD Centers focusing on a unique collection of algorithmic motifs invoked
by ECP applications
• Motif: algorithmic method that drives a common pattern of computation and communication
• CD Centers must address all high priority motifs invoked by ECP applications, including not only the 7 “classical” motifs but also the additional 6 motifs identified to be associated with data science applications
Game-changing mechanism for delivering next-generation
community products with broad application impact
• Evaluate, deploy, and integrate exascale hardware-savvy software designs and technologies for key crosscutting algorithmic motifs into applications
22 Exascale Computing Project
ECP Co-Design Centers (CDCs) • CODAR: CDC for Online Data Analysis and Reduction at the Exascale, Ian Foster, ANL
– Motifs: Online data analysis and reduction – Address growing disparity between simulation speeds and I/O rates rendering it infeasible for HPC and data analytic
applications to perform offline analysis. Target common data analysis and reduction methods (e.g., feature and outlier detection, compression) and methods specific to particular data types and domains (e.g., particles, FEM)
• CDC for Block-Structured AMR, John Bell, LBNL – Motifs: Structured Mesh, Block-Structured AMR, Particles
– New block-structured AMR framework (AMReX) for systems of nonlinear PDEs, providing basis for temporal and spatial discretization strategy for DOE applications. Unified infrastructure to effectively utilize exascale and reduce computational cost and memory footprint while preserving local descriptions of physical processes in complex multi-physics algorithms
• CDC for Efficient Exascale Discretizations (CEED), Tzanio Kolev, LLNL – Motifs: Unstructured Mesh, Spectral Methods, Finite Element (FE) Methods – Develop FE discretization libraries to enable unstructured PDE-based applications to take full advantage of exascale
resources without the need to “reinvent the wheel” of complicated FE machinery on coming exascale hardware
• CDC for Particle-based Methods, Tim Germann, LANL – Motif(s): Particles (involving particle-particle and particle-mesh interactions) – Focus on four sub-motifs: short-range particle-particle (e.g., MD and SPH), long-range particle-particle (e.g., electrostatic and
gravitational), particle-in-cell (PIC), and additional sparse matrix and graph operations of linear-scaling quantum MD
23 Exascale Computing Project
Software Technology Summary
• ECP will build a comprehensive and coherent software stack that will enable application developers to productively write highly parallel applications that can portably target diverse exascale architectures
• ECP will accomplish this by: – Extending current technologies to exascale where possible – Performing R&D required to develop new approaches where necessary – Coordinating with vendor efforts to develop and deploy high-quality and robust software
products
24 Exascale Computing Project
ST Projects (incl. ATDM) Mapped to Software Stack
Correctness Visualization VTK-m, ALPINE, Cinema
Data Analysis ALPINE, Cinema
Applications Co-Design
Programming Models, Development Environment, and Runtimes
MPI (MPICH, Open MPI), OpenMP, OpenACC, PGAS (UPC++, Global Arrays), Task-Based (PaRSEC, Legion,
DARMA), RAJA, Kokkos, OMPTD, Power steering
System Software, Resource Management Threading, Scheduling, Monitoring,
and Control Argo Global OS, Qthreads, Flux,
Spindle, BEE, Spack, Sonar
Tools PAPI, HPCToolkit, Darshan, Perf.
portability (ROSE, Autotuning, PROTEAS), TAU, Compilers
(LLVM, Flang), Mitos, MemAxes, Caliper, AID, Quo, Perf. Anal.
Math Libraries/Frameworks ScaLAPACK, DPLASMA, MAGMA, PETSc/
TAO, Trilinos, xSDK, PEEKS, SuperLU, STRUMPACK, SUNDIALS, DTK,
TASMANIAN, AMP, FleCSI, KokkosKernels, Agile Comp., DataProp, MFEM
Memory and Burst buffer
Chkpt/Restart (VeloC, UNIFYCR), API and library for complex memory hierarchy
(SICM)
Data Management, I/O and File System
ExaHDF5, PnetCDF, ROMIO, ADIOS, Chkpt/Restart (VeloC, UNIFYCR), Compression (EZ,
ZFP), I/O services, HXHIM, SIO Components, DataWarehouse Node OS, low-level runtimes
Argo OS enhancements, SNL OS project
Res
ilien
ce
Che
ckpo
int/R
esta
rt (V
eloC
, UN
IFY
CR
), FS
EFI
, Fau
lt M
odel
ing
Wor
kflo
ws
Con
tour
, Sib
oka
Hardware interface
25 Exascale Computing Project
Plan to involve additional Universities and Vendors
• The current set of Software Technology projects span 8 DOE national laboratories,15 universities, and 2 independent software vendors
• In FY17-18, we plan to issue additional solicitations for universities and vendors
• We plan to coordinate our activities with NSF to cover the broader university community through NSF
• ECP has formed an Industry Council. This group will also serve as an interface for engaging industrial users and software vendors.
• Small businesses: The solicitations in FY17-18 will include independent software vendors. DOE SBIR funding could also support software of interest to ECP.
26 Exascale Computing Project
Hardware Technology Overview
Objective: Fund R&D to design hardware that meets ECP’s Targets for application performance, power efficiency, and resilience
Issue PathForward and LeapForward Hardware Architecture R&D contracts that deliver: • Conceptual exascale node and system designs
• Analysis of performance improvement on conceptual system design • Technology demonstrators to quantify performance gains over existing roadmaps
• Support for active industry engagement in ECP holistic co-design efforts
DOE labs engage to: • Participate in evaluation and review of PathForward and LeapForward deliverables
• Lead Design Space Evaluation through Architectural Analysis, and Abstract Machine Models of PathForward/LeapForward designs for ECP’s holistic co-design
27 Exascale Computing Project
PathForward will drive improvements in vendor offerings that address ECP’s needs for scale, parallel simulations, and large scientific data analytics
PathForward* addresses the disruptive trends in computing due to the power challenge
Power challenge
• End of Dennard scaling
• Today’s technology: ~50MW to 100 MW to power the largest systems
Processor/node trends
• GPUs/accelerators • Simple in-order cores • Unreliability at near-
threshold voltages • Lack of large-scale
cache coherency • Massive on-node
parallelism
System trends
• Complex hardware • Massive numbers
of nodes • Low bandwidth
to memory • Drop in platform
resiliency
Disruptive changes
• New algorithms • New programming
models • Less “fast” memory • Managing for
increasing system disruptions
4 Challenges: power, memory, parallelism, and resiliency
* LeapForward will prioritize the same technical challenges but address innovative concepts that may have higher risk tolerance for 2021 deployment
28 Exascale Computing Project
PathForward Status • LLNL is managing the procurement • Competitive RFP released in June, 2016 • 14 PathForward proposals received in August, 2016 • 6 proposals selected for SOW negotiations of final Work Packages
– As of Early December: The First Vendor SOW has completed the LLNL Procurement Review and is on to Livermore Field Office review, will need final review by NNSA/HQ
– Two more Vendor SOWs are entering the LLNL Procurement Review – Final three are pending
• Firm Fixed Price contracts with milestone deliverables and payments • DOE Advance IP Waivers for vendors that provide ≥ 40% cost share • Project duration: 3 years • Targeting Contract Awards in February 2017
29 Exascale Computing Project
Overarching Goals for PathForward
• Improve the quality and number of competitive offeror responses to the Capable Exascale Systems RFP for delivery in 2022
• Improve the offeror’s confidence in the value and feasibility of aggressive advanced technology options that would be bid in response to the Capable Exascale Systems RFP
• Improve DOE confidence in technology performance benefit, programmability and ability to integrate into a credible system platform acquisition
30 Exascale Computing Project
Overarching Goals for LeapForward
• Improve the quality and number of competitive offeror responses to the Advanced Architecture Exascale Systems RFP for delivery in 2021
• Develop Advanced Architecture concepts that arise from Holistic Co-design to impact ECP applications that are addressing: – Evolutionary Challenge Problems – Revolutionary Challenge Problems
• The Advanced Architecture Exascale System will maintain application performance requirements for a system that is delivered in 2021, but there would need to be tradeoffs in other performance parameters
31 Exascale Computing Project
Hardware Technology Summary
• PathForward reduces the Technical Risk for NRE investments in the 2022/23 capable exascale system(s)
• LeapForward accepts Technical Risk in return for game-changing impact on the U.S. computing eco-system and potential for early deployment in 2021/22 exascale system
• Establishes a foundation for architectural diversity in the HPC eco-system
• Provides hardware technology expertise and analysis
• Provides an opportunity for inter-agency collaboration under NSCI
32 Exascale Computing Project
Exascale Systems: Non-Recurring Engineering (NRE) Incentivizes awardees to address gaps in their system product roadmaps
• Brings to the product stage promising hardware and software research (ECP, vendor, Lab, etc.) and integrates it into a system
• Includes application readiness R&D efforts
• Must start early enough to impact the system - more than two full years of lead time are necessary to maximize impact
Experience has shown that NRE can substantially improve the delivered system
33 Exascale Computing Project
ECP’s plan to accelerate and enhance system capabilities
PathForward Hardware R&D NRE
HW and SW engineering and productization
RFP release
NRE contract awards
System Build
Systems accepted
Build contract awards
NRE: Application Readiness
Co-Design
LeapForward PathForward
34 Exascale Computing Project
Co-Design and Integration
• ECP is a very large DOE Project, composed of ~100 separate projects – National Labs, Vendors, Universities – Many technologies – Different timeframes – At least two diverse system architectures
• For ECP to be successful, The whole must be more than the sum of the parts
• Holistic Co-design will likely push PIs out of their comfort zone on several dimensions
Exascale Computing Project 1.
Application Development
1.2 DOE Science and
Energy Apps 1.2.1
DOE NNSA Applications
1.2.2 Other Agency Applications
1.2.3 Developer Training
and Productivity 1.2.4
Co-Design and Integration
1.2.5
Exascale Systems 1.5
System Build Phase NRE
1.5.1 Prototypes and
Testbeds 1.5.2
Co-Design and Integration
1.5.3
Hardware Technology
1.4
PathForward Vendor Node and System
Design 1.4.1
Design Space Evaluation
1.4.2 Co-Design
and Integration 1.4.3
Software Technology
1.3 Programming Models
and Runtimes 1.3.1 Tools 1.3.2
Mathematical and Scientific Libraries and Frameworks
1.3.3
Data Analytics and Visualization
1.3.5
Data Management and Workflows
1.3.4
System Software 1.3.6
Resilience and Integrity
1.3.7 Co-Design and
Integration 1.3.8
Project Management 1.1
Project Planning and Management
1.1.1 Project Reporting &
Controls, Risk Management
1.1.2
Information Technology and
Quality Management 1.1.5
Business Management
1.1.3 Procurement Management
1.1.4
Communications & Outreach
1.1.6
Integration 1.1.7
LeapForward Vendor Node and System
Design 1.4.4
Over 500 people for all hands meeting!
35 Exascale Computing Project
Holistic Co-Design • AD and ST teams cannot assume hardware architectures are firmly defined
inputs to their project plans – AD and ST teams can provide inputs to the HT node and system design projects
• HT PathForward/LeapForward projects also cannot assume that applications and the software stack are fixed inputs to their project plans – HT design teams can provide inputs to the ECP AD and ST projects
• In Holistic Co-Design, each project’s output can be another project’s input – Individual ECP projects do not operate in a vacuum – Assumptions about inputs lead to initial project plans, but the give and take of co-
design interactions can lead to changes
• The ECP leadership team recognizes that while AD and ST milestones and deliverables will be rigorously managed, some flexibility may be required to support rational changes to initial project plans
36 Exascale Computing Project
Co-Design and ECP Challenges
• Organize multi-disciplinary projects into holistic co-design collaboration teams – Your funding arrives in technology-centric focus areas – ECP leadership will support integration of projects into collaborative Co-
Design teams – Begin development of an integration map – ECP leadership will encourage commitment to shared milestones – Every ECP project’s performance evaluation will include:
how well they play with others
• But there is another challenge that ECP must address . . .
37 Exascale Computing Project
Co-Design and ECP Challenges - 2 • With ~25 AD, ~60 ST, 4 CDC, ~5 PathForward and the DSE teams:
All-to-all communication is impractical • The AD Co-Design Centers and the HT Design Space Evaluation team will
develop capabilities to facilitate ECP inter-project communication – Proxy Applications and Benchmarks – Abstract Machine Models and Proxy Architectures
• The ECP Leadership team will actively work to identify: – Alignments of cross-cutting projects that will form natural holistic co-design
collaboration teams – a cluster of projects – Orthogonal projects that should not expend time and energy with forced integration
• Key Questions – How many projects should be in a holistic co-design team cluster? – How many holistic co-design clusters can a project participate in? – How many holistic co-design clusters should ECP have? – Should we foster higher levels of co-design collaboration?
38 Exascale Computing Project
NSCI Strategic Objectives 1. Accelerating delivery of a capable exascale computing system that integrates hardware
and software capability to deliver approximately 100 times the performance of current 10 petaflop systems across a range of applications representing government needs.
2. Increasing coherence between the technology base used for modeling and simulation and that used for data analytic computing – (DAC).
3. Establishing, over the next 15 years, a viable path forward for future HPC systems even after the limits of current semiconductor technology are reached (the "post- Moore's Law era").
4. Increasing the capacity and capability of an enduring national HPC ecosystem by employing a holistic approach that addresses relevant factors such as networking technology, workflow, downward scaling, foundational algorithms and software, accessibility, and workforce development.
5. Developing an enduring public-private collaboration to ensure that the benefits of the research and development advances are, to the greatest extent, shared between the United States Government and industrial and academic sectors.
39 Exascale Computing Project
Motivation for NSCI Objective 2: Increasing coherence between simulation and analytics • For modeling & simulation – HPC
! HPC simulation must ride on some commodity curve ! Larger market forces behind analytics ! Can exploit commodity component technology from analytics
• For large scale data analytics (LSDA) – DAC ! Large Scale Data Analytics problems becoming ever more sophisticated ! Requiring more coupled methods ! Can exploit architectural lessons from HPC simulation
• For both: Integration of simulation and analytics in the same workflow ! Automation of analysis of data from simulation ! Creation of synthetic data via simulation to augment analysis ! Automated generation and testing of hypothesis ! Exploration of new scientific and technical scenarios
Mutual&inspira,on,&technical&synergy,&and&economies&of&scale&in&the&crea,on,&deployment,&and&use&of&HPC&resources&
40 Exascale Computing Project
NSCI Objective 2 - continued • Data Analytic Computing (DAC)
– Inferring new information from what is already known to enable action on that information • High Performance Computing (HPC)
– Insights into interactions of the parts of a system, and the system as a whole, to advance understanding in science and engineering and inform policy and economic decision-making
• DAC & HPC have traditionally relied on different hardware and software stacks • M&S community is increasing use of large-scale data analytics, demanding a more
dynamic interaction between analysis and simulations • LSDA community is demanding increased computational intensity, hence facing barriers
to scalability along with new demands for interoperability, robustness, and reliability of results
• Coherent technology base for HPC & DAC benefits both while maximizing returns on R&D investments
• Primary challenge: software layers of the HPC environment – A more agile and reusable HPC/DAC software portfolio that is equally capable in LSDA and M&S – Will improve productivity, increase reliability & trustworthiness, establish more sustainable yet agile
software
* https://www.whitehouse.gov/sites/whitehouse.gov/files/images/NSCI%20Strategic%20Plan.pdf
41 Exascale Computing Project
Application Motifs* (what’s the app footprint?) Algorithmic methods that capture a common pattern of computation and communication
1. Dense Linear Algebra – Dense matrices or vectors (e.g., BLAS Level 1/2/3)
2. Sparse Linear Algebra – Many zeros, usually stored in compressed matrices to access nonzero
values (e.g., Krylov solvers) 3. Spectral Methods
– Frequency domain, combining multiply-add with specific patterns of data permutation with all-to-all for some stages (e.g., 3D FFT)
4. N-Body Methods (Particles) – Interaction between many discrete points, with variations being
particle-particle or hierarchical particle methods (e.g., PIC, SPH, PME)
5. Structured Grids – Regular grid with points on a grid conceptually updated together with
high spatial locality (e.g., FDM-based PDE solvers)
6. Unstructured Grids – Irregular grid with data locations determined by app and connectivity to
neighboring points provided (e.g., FEM-based PDE solvers)
7. Monte Carlo – Calculations depend upon statistical results of repeated random trials
8. Combinational Logic – Simple operations on large amounts of data, often exploiting bit-level
parallelism (e.g., Cyclic Redundancy Codes or RSA encryption) 9. Graph Traversal
– Traversing objects and examining their characteristics, e.g., for searches, often with indirect table lookups and little computation
10. Graphical Models – Graphs representing random variables as nodes and dependencies as
edges (e.g., Bayesian networks, Hidden Markov Models) 11. Finite State Machines
– Interconnected set of states (e.g., for parsing); often decomposed into multiple simultaneously active state machines that can act in parallel
12. Dynamic Programming – Computes solutions by solving simpler overlapping subproblems, e.g.,
for optimization solutions derived from optimal subproblem results 13. Backtrack and Branch-and-Bound
– Solving search and global optimization problems for intractably large spaces where regions of the search space with no interesting solutions are ruled out. Use the divide and conquer principle: subdivide the search space into smaller subregions (“branching”), and bounds are found on solutions contained in each subregion under consideration
*The Landscape of Parallel Computing Research: A View from Berkeley, Technical Report No. UCB/EECS-2006-183 (Dec 2006).
42 Exascale Computing Project
Seven Computational Giants of Massive Data Analysis* 1. Basic Statistics
Mean, variance (and other moments), median, sorting, clustering, # of distinct and frequently-occurring elements in a data set; O(N) calculations for N data points.
2. Generalized N-body Problems Distances, kernels, or other similarities between (all or many) pairs (or higher-order n-tuples) of points. Computational complexity O(N2) or O(N3)
3. Graph-Theoretic Computations Traversing a graph where the graph is the data or the statistical model takes the form of a graph. Common statistical computations include betweenness, centrality, commute distances; used to identify nodes or communities of interest
4. Linear Algebraic Computations Linear systems, eigenvalue problems, inverses, many of which result from linear models, e.g., linear regression, PCA. Differentiator: statistical problems such as the optimization in learning, e.g., eigendecomposition of PCA to optimize a linear convex training error not necessarily requiring high accuracy. Differentiator: multivariate statistics arguably has its own matrix form, that of a kernel (or Gram) matrix
*Frontiers In Massive Data Analysis, National Research Council (The National Academies Press, 2013)
5. Optimizations All subclasses of optimizations, from unconstrained to constrained, both convex and non-convex. Non-trivial optimizations in statistical methods, e.g., linear and quadratic programming, second-order cone programming in support vector machines, more recent classifiers, recent manifold learning methods, e.g. maximum variance unfolding
6. Integration Integration of functions: a key class of computations within massive data analysis. Needed for Bayesian inference using any model and in non-Bayesian statistical settings, e.g. random effects models. Integrals that appear in statistics are often expectations with a special form.
7. Alignment Problems Matchings between two or more data objects or data sets, e.g., multiple sequence alignments in computational biology, matching of catalogs from different instruments in astronomy, matching of objects between images, correspondence between synonymous words in text analysis. Critical in data fusion – required before further data analyses is possible
43 Exascale Computing Project
Some M&S + LSDA Use Cases Evident and Possible • In situ (upstream, during, downstream) analytics [online data analysis and reduction]
– Cross section processing (nuclear reactors) – Multi-variate statistical analysis for characterizing turbulence & ignition dynamics; threshold-based topological feature
detection (combustion) – Halo finder for identifying compact structures (cosmology) – Spatial and temporal feature detection in global ocean and atmospheric models (climate) – Formation and evolution of nonlinear turbulent plasma structures (fusion) – Multiphysics coupling within a nonlinear solution framework (manufacturing, nuclear reactors, fusion, wind, climate,
etc.) – Statistically valid spatial and temporal data analyses of billion-atom atomistic simulations (MD for nuclear materials)
• Computational steering – Driving ensembles of MD simulations with unsupervised ML (precision medicine for cancer) – Robust and error-minimizing ML-driven mesh management (shock physics of dynamic fluids/solids)
• Modeling unresolved physics (”subgrid” models) – Superparameterization approach for subgrid cloud-resolving model (climate) – ML-driven subgrid turbulence model parameter setting (combustion, wind)
44 Exascale Computing Project
Some M&S + LSDA Use Cases Evident and Possible • Predictive analytics
– Scalable supervised and unsupervised ML for tumor-drug response and treatment strategies (precision medicine for cancer)
• Superfacility (HPC + Experimental Facilities) – Integration of simulation and experimental science to improve analysis fidelity and turnaround time and increase
knowledge detection/discovery from raw data. Computational steering of experiment via near real-time interpretation of molecular structure revealed by X-ray diffraction (data analytics of FEL-generated light source facilities)
– ML-driven additive manufacturing of metal alloys (manufacturing)
• Graph analytics – Unknown genome reconstruction from collections of short, overlapping DNA segments (microbiome analysis) – Load balancing with spatial and temporal graph partitioning and clustering techniques (power grid) – Design of low-rank methods, dynamic load balancing strategies, with combinatorial algorithms, such as graph
matching and coloring can help determine the work and data distribution (chemistry)
45 Exascale Computing Project
Data structures describing simulation and analytics differ* Graphs from simulations may be irregular, but have more locality than those derived from analytics
Computa,onal&&Simula,on&&of&physical&&phenomena:&
Climate&modeling&& Car&crash&&
Internet&connec,vity&& Yeast&protein&interac,ons&&
Large&Scale&Data&Analy,cs:&
* Large Scale Data Analytics and its Relationship to Simulation, Leland, Murphy, Hendrickson, Yelick, Johnson and Berry
46 Exascale Computing Project
Memory performance demands differ A key differentiator in the performance of simulation and analytics
Standard benchmarks include: • LINPACK (smallest data intensiveness;
barely visible on graph)
• STREAM • SPEC FP • SPEC Int
Figure from Murphy & Kogge with adjustment to double radius of Linpack data point to make it visible.
Area%of%the%circle%=%rela.ve%data%intensiveness%(i.e.%total%amount%of%unique%data%accessed%over%a%fixed%interval%of%instruc.ons)%
Simulation
Analytics
47 Exascale Computing Project
Application code property Modeling & Simulation Large Scale Data Analytics Spatial locality High Low Temporal locality Moderate Low Memory footprint Moderate High Computation type May be floating-point dominated* Integer intensive Input-output orientation Output dominated Input dominated
* Increasingly, simulation work has become less floating-point dominated
Application code characteristics differ
Contras,ng&proper,es:&
48 Exascale Computing Project
So what do we really mean by “increasing coherence” between simulation and analytics?
! NOT one system ostensibly optimized for both simulation and analytics
! Greater commonality in underlying Commodity Computing components and design principles
! Greater interoperability, allowing interleaving of both types of computations
…&A&more&common&hardware&and&soDware&roadmap&&between&simula,on&and&analy,cs&
49 Exascale Computing Project
Simulation and analytics are evolving to become more similar in their architectural needs
! Current challenges for the DAC LSDA community ! Data movement ! Power consumption ! Memory/interconnect bandwidth ! Scaling efficiency
! Instruction mix for Sandia’s HPC engineering codes ! Memory operations 40% ! Integer operations 40% ! Floating point 10% ! Other 10%
! Common design impacts of energy cost trends ! Increased concurrency (processing threads, cores, memory depth) ! Increased complexity and burden on system software, languages, tools, runtime support, codes
…&similar&to&HPC,&modeling&&&simula,on&
…&similar&to&DAC,&LSDA&
50 Exascale Computing Project
Architectural Characteristic Modeling & Simulation Large Scale Data Analytics
Computation Memory address generation dominated Same
Primary memory Low power, high bandwidth, semi-random access Same
Secondary memory
Emerging technologies may offset cost, allowing much more memory … require extremely large memory spaces
Storage Integration of another layer of memory hierarchy to support checkpoint/restart … to support out-of-core data set access
Interconnect technology
High bisection bandwidth, (for relatively coarse-grained access) … (for fine-grained access)
System software (node-level)
Low dependence on system services, increasingly adaptive, resource management for structured parallelism
… highly adaptive, resource management for unstructured parallelism
System software (system-level) Increasingly irregular workflows Irregular workflows
Emerging architectural and system software synergies
51 Exascale Computing Project
Hardware Technology: LeapForward Priorities
• Prioritize R&D in high impact, high payoff hardware designs
• Support development of common technology base of hardware component designs that addresses common bottlenecks for HPC and DAC applications
• Leverage opportunities for Interagency Collaboration and Alignment
• Tap ECP AD and ST project teams for requirements that can drive the future directions of hardware design – through holistic co-design collaborations
52 Exascale Computing Project
This is a very exciting time for computing in the US • Unique opportunity to do something special for the nation on a rapid time scale
• The advanced architecture system in 2021 affords the opportunity for – More rapid advancement and scaling of mission and science applications – More rapid advancement and scaling of an exascale software stack – Rapid investments in vendor technologies and software needed for 2021 and 2023 systems – More rapid progress in numerical methods and algorithms for advanced architectures – Strong leveraging of and broader engagement with US computing capability
• When ECP ends, we will have – Run meaningful applications at exascale in 2021, producing useful results – Prepared a full suite of mission and science applications for 2023 capable exascale systems – Demonstrated integrated software stack components at exascale – Invested in the engineering and development, and participated in acquisition and testing of 2023
capable exascale systems – Prepared industry and critical applications for a more diverse and sophisticated set of computing
technologies, carrying US supercomputing well into the future