high performance computing and modeling infrastructure · geophysical fluid dynamics laboratory...

16
Geophysical Fluid Dynamics Laboratory Review May 20 – May 22, 2014 Presented by High Performance Computing and Modeling Infrastructure Brian Gross (GFDL) and V. Balaji (Princeton University)

Upload: others

Post on 25-Jul-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: High Performance Computing and Modeling Infrastructure · Geophysical Fluid Dynamics Laboratory Review May 20-22, 2014 GFDL’s available computing doubles every ~2 years … to advance

Geophysical Fluid Dynamics Laboratory Review

May 20 – May 22, 2014

Presented by

High Performance Computing

and Modeling Infrastructure

Brian Gross (GFDL) and V. Balaji

(Princeton University)

Page 2: High Performance Computing and Modeling Infrastructure · Geophysical Fluid Dynamics Laboratory Review May 20-22, 2014 GFDL’s available computing doubles every ~2 years … to advance

Geophysical Fluid Dynamics Laboratory Review

May 20-22, 2014

GFDL’s available computing doubles every ~2 years

… to advance scientific understanding of climate and its natural and anthropogenic variations and impacts, and improve NOAA's predictive capabilities, through the development and use of world-leading computer models of the Earth System.

Page 3: High Performance Computing and Modeling Infrastructure · Geophysical Fluid Dynamics Laboratory Review May 20-22, 2014 GFDL’s available computing doubles every ~2 years … to advance

Geophysical Fluid Dynamics Laboratory Review

May 20-22, 2014

Scientific Advances are Linked to Computer Power

2

1st estimates of effects of 2xCO2

1st coupled model on which IPCC was based

Start of detection & attribution; multiple GHG forcings; natural variability and forcings important; use of ensembles

1st simulation of chemistry-transport-radiation of Antarctic O3 hole

Attribution of role of O3 depletion in climate change

Effects of sulfate and black carbon in satellite radiation budget

Coupled ENSO forecasts

1st simulation of H2O, cloud-radiation, Ice-albedo feedbacks

1st NOAA coupled Earth System Model closed the carbon cycle

Page 4: High Performance Computing and Modeling Infrastructure · Geophysical Fluid Dynamics Laboratory Review May 20-22, 2014 GFDL’s available computing doubles every ~2 years … to advance

Geophysical Fluid Dynamics Laboratory Review

May 20-22, 2014

GFDL depends on sustained, dedicated production computing

3

Page 5: High Performance Computing and Modeling Infrastructure · Geophysical Fluid Dynamics Laboratory Review May 20-22, 2014 GFDL’s available computing doubles every ~2 years … to advance

Geophysical Fluid Dynamics Laboratory Review

May 20-22, 2014

An Infusion of Funds Accelerated NOAA’s HPC Capacity

4

• ARRA provided $170M for climate modeling – $5M for climate data records – $165M for R&D HPC

• Acquired and implemented – Two HPC systems (Gaea at ORNL & Zeus at WV) – Storage and Analysis (remains a local activity) – Enhanced NOAA R&D HPC network – Associated operations and maintenance

Page 6: High Performance Computing and Modeling Infrastructure · Geophysical Fluid Dynamics Laboratory Review May 20-22, 2014 GFDL’s available computing doubles every ~2 years … to advance

Geophysical Fluid Dynamics Laboratory Review

May 20-22, 2014

GFDL uses as much computing as it can grab

5

100,000

1,000,000

10,000,000

100,000,000

Co

re-H

ou

rs

Time (Monthly Granularity)

GFDL RDHPCS Utilization vs. Allocation

Gaea GFDL Allocation

Gaea GFDL Usage

Zeus GFDL Allocation

Zeus GFDL Usage

Page 7: High Performance Computing and Modeling Infrastructure · Geophysical Fluid Dynamics Laboratory Review May 20-22, 2014 GFDL’s available computing doubles every ~2 years … to advance

Geophysical Fluid Dynamics Laboratory Review

May 20-22, 2014

GFDL relies on external partners

• DOE/ORNL – GAEA system for production computing

– TITAN to explore GPU architectures

– Workflow research

• DOE/ANL (competitively awarded INCITE grant) – 150M core-hours on Mira (Blue Gene Q) to explore

extreme scaling

• TACC (competitively awarded XSEDE grant) – To explore Intel Phi architecture on Stampede

6

Page 8: High Performance Computing and Modeling Infrastructure · Geophysical Fluid Dynamics Laboratory Review May 20-22, 2014 GFDL’s available computing doubles every ~2 years … to advance

Geophysical Fluid Dynamics Laboratory Review

May 20-22, 2014

On the horizon …. • Disaster Recovery Act (Sandy

Supplemental) supports

– an upgrade to Zeus in FY15

– A fine-grained parallel system in FY16

• Targeted for FY16

– Upgrade Gaea

– Additional support for software architecture re-engineering

• Continued partnerships to explore near-

exascale performance

7

Page 9: High Performance Computing and Modeling Infrastructure · Geophysical Fluid Dynamics Laboratory Review May 20-22, 2014 GFDL’s available computing doubles every ~2 years … to advance

Geophysical Fluid Dynamics Laboratory Review

May 20-22, 2014

… but this is not enough

8

Page 10: High Performance Computing and Modeling Infrastructure · Geophysical Fluid Dynamics Laboratory Review May 20-22, 2014 GFDL’s available computing doubles every ~2 years … to advance

Geophysical Fluid Dynamics Laboratory Review

May 20-22, 2014

Computational constraints Capability: Maximum simulated years per day of a single model instance. Capacity: Aggregate SYPD on available computing hardware.

Models choices (resolution, complexity, ensemble size) made based upon capability requirements (e.g 5-10 SYPD for dec-cen, 50-100 SYPD for carbon cycle) and available allocation.

Moore’s Law: capability increases 2X every 18 months. We are in the post-Moore era: increased concurrency, but arithmetic does not get faster (quite likely slower!) Harder to program, understand behavior and performance, possible risks to reproducibility

Requires: judicious balanced investment between hardware and software.

9

Page 11: High Performance Computing and Modeling Infrastructure · Geophysical Fluid Dynamics Laboratory Review May 20-22, 2014 GFDL’s available computing doubles every ~2 years … to advance

Geophysical Fluid Dynamics Laboratory Review

May 20-22, 2014

GFDL ESM Genealogy

10

CM2.1

FLOR CM2.5 CM2.6 CM3 ESM2M ESM2G

Increasing ocean res.: 1o->1/4o->1/10o

Ocean vert. coord.: * ESM2M-z* (MOM) * ESM2G-r (GOLD)

Page 12: High Performance Computing and Modeling Infrastructure · Geophysical Fluid Dynamics Laboratory Review May 20-22, 2014 GFDL’s available computing doubles every ~2 years … to advance

Geophysical Fluid Dynamics Laboratory Review

May 20-22, 2014

FMS and FRE

All of the models shown above were built with

codes in the Flexible Modeling System (FMS):

model components sharing a common codebase,

common infrastructure (e.g parallelism and I/O)

and superstructure (coupling interface)

The FMS Runtime Environment (FRE) provides a

fault-tolerant, reproducible environment for

configuring, testing, running and analyzing FMS-

based models.

The FRE workflow includes publication of datasets

to an external server (ESGF).

11

Page 13: High Performance Computing and Modeling Infrastructure · Geophysical Fluid Dynamics Laboratory Review May 20-22, 2014 GFDL’s available computing doubles every ~2 years … to advance

Geophysical Fluid Dynamics Laboratory Review

May 20-22, 2014

The hardware jungle and the software zoo

Processor clock speeds have stalled: it all hinges

now on increased concurrency

• Hosted systems (e.g CPU+GPU)

• Many-core systems (e.g MICs)

• Equally many programming techniques! MPI,

OMP, OACC, PGAS… harder to program and

achieve performance…

GFDL’s conservative approach: standards-based

programming model (messages, threads,

vectors), offloaded I/O. Extensive prototyping on

experimental hardware.

12

Page 14: High Performance Computing and Modeling Infrastructure · Geophysical Fluid Dynamics Laboratory Review May 20-22, 2014 GFDL’s available computing doubles every ~2 years … to advance

Geophysical Fluid Dynamics Laboratory Review

May 20-22, 2014

Fault-resilient workflow

• Reproducible, fault-tolerant workflow across remote

compute and local archive, including publication to ESGF

node

• Large scale automation and testing

• Also the basis for disaster recovery plan

13

Page 15: High Performance Computing and Modeling Infrastructure · Geophysical Fluid Dynamics Laboratory Review May 20-22, 2014 GFDL’s available computing doubles every ~2 years … to advance

Geophysical Fluid Dynamics Laboratory Review

May 20-22, 2014

Summary

• GFDL Strategic Plan: process studies,

development of comprehensive models, climate

extremes, experimental prediction, downstream

science.

• Continued development of atmospheric

dynamical core, unification of ocean modeling

capabilities.

• Convergence of multiple model branches into

trunk model CM4.

• Forecast workflow.

14

Page 16: High Performance Computing and Modeling Infrastructure · Geophysical Fluid Dynamics Laboratory Review May 20-22, 2014 GFDL’s available computing doubles every ~2 years … to advance

Geophysical Fluid Dynamics Laboratory Review

May 20-22, 2014

Challenges

• Right way to program the next generation

of parallel machines

• Component and process concurrency

• Reproducibility: what if models became

more like experimental biological systems

(where an individual cell “culture” is not

reproducible, only the ensemble is)?

• How to understand and analyze

performance on a “sea of functional units”?

15