the u.s. doe accelerated climate modeling for energy project · 2015-04-27 · across acme...
TRANSCRIPT
The U.S. DOE Accelerated Climate Modeling for Energy Project
Robert Jacob April 22, 2015 Third Workshop on Coupling Technologies for Earth System Models
Argonne National Laboratory § $675M opera,ng
budget § 3,200 employees § 1,450 scien,sts and
engineers § 750 Ph.D.s
ACME in a nutshell… A new U.S. climate modeling effort led by the U.S. Department of Energy Office of Biological and Enviornmental Research
or…
“A collabora,on among the DOE na,onal laboratories (and a few other ins,tu,ons) to develop and apply the most complete, leading-‐edge climate and Earth system models for the most challenging and demanding climate-‐change research problems and DOE mission needs while efficiently using DOE Leadership Compu,ng Facili,es.”
Why should DOE in par3cular be interested?
Leading DOE machines at our last meeting
• IBM Blue Gene/Q System – 48 racks – 49,152 nodes – 786 TB of memory – Peak flop rate: 10 PF
• Cray XK7 System – 299,008 cores – 18,688 NVIDIA Kepler
K20x GPUs – 710 TB of memory – Peak flop rate: 27 PF
Upcoming DOE machines
• Intel/Cray Aurora (ALCF) – 50,000 Xeon Phi nodes (“Knight’s Hill”) – Approx 150PF – Produc,on in 2019
• IBM/NVIDIA Summit (OLCF) – 3,400 Power9 nodes – Mul,ple NVIDIA Volta GPUs per
node – Approx 150PF – Produc,on in 2018
• Intel/Cray Cori (NERSC) – 9,300 Xeon Phi nodes (“Knight’s Corner”) – Approx 30PF – Produc,on in 2017
ACME Project Goals • a series of prediction and simulation experiments
addressing scientific questions and mission needs; • a well documented and tested, continuously advancing,
evolving, and improving system of model codes that comprise the ACME Earth system model;
• the ability to use effectively leading (and “bleeding”) edge computational facilities soon after their deployment at DOE national laboratories; and
• an infrastructure to support code development, hypothesis testing, simulation execution, and analysis of results.
Climate Science Drivers for ACME
Water cycle: How do the hydrological cycle and water resources interact with the climate system on local to global scales?
Biogeochemistry: How do biogeochemical cycles
interact with global climate change? Cryosphere: How do rapid changes in cryospheric systems
interact with the climate system?
ACME: let specific science questions drive development
Atmosphere: More accurate simulation of aerosols, clouds, wind, and precipitation
Land: More accurate simulation of terrestrial feedbacks from more complex carbon, nutrient and water cycles Ocean: Introduction of multi-resolution dynamics to more accurately simulate ocean heat uptake and
water masses Sea ice: Recast numerics to focus resolution in polar regions, and add icebergs, sea ice strength, and
snow physics Land ice: Addition of the first realistic, dynamic coupled ice-sheet model
Driver
Ques,ons Hypotheses Experiments Requirements
Development
ACME Roadmap
ACME size/start: • No new funding: 6-7 existing DOE lab climate projects
combined in to one program • 8 U.S. national laboratories and 6 partner institutions • 85 researchers working ¼ time or more • Total effort ~43 FTE • Started from beta tag of CESM1.3
– Using cpl7/MCT for coupler.
• Started July 1, 2014. 3 years initially.
ACME organization
ACME Council Dave Bader, Chair
Execu,ve Commihee: W. Collins, M. Taylor R. Jacob, P. Jones, P. Rasch, P. Thornton, D. Williams
Ex Officio: J. Edmonds, J. Hack, W. Large, E. Ng
Execu3ve CommiAee Chair: D. Bader
Chief Scien,st: William Collins Chief Computa,onal Scien,st: Mark Taylor
Project Engineer Renata McCoy
Coupled Simula3on Group Dave Bader, Bill Collins,
Mark Taylor
Coupled Sim. Task Leaders
Workflow Group
Dean Williams Katherine Evans
Workflow Task Leaders
SoLware Eng./Coupler Group
Robert Jacob Andrew Salinger
SE/Coupler Task Leaders
Performance/ Algorithms Group
Patrick Worley Hans Johansen
Perf. / Alg. Task Leaders
Land Group
Peter Thornton William Riley
Land Task Leaders
Atmosphere Group
Philip Rasch Shaocheng Xie
Atmosphere Task Leaders
Ocean/Ice Group
Philip Jones Todd Ringler
Ocean/Ice Task Leaders
ACME organization
ACME Council Dave Bader, Chair
Execu,ve Commihee: W. Collins, M. Taylor R. Jacob, P. Jones, P. Rasch, P. Thornton, D. Williams
Ex Officio: J. Edmonds, J. Hack, W. Large, E. Ng
Execu3ve CommiAee Chair: D. Bader
Chief Scien,st: William Collins Chief Computa,onal Scien,st: Mark Taylor
Project Engineer Renata McCoy
Coupled Simula3on Group Dave Bader, Bill Collins,
Mark Taylor
Coupled Sim. Task Leaders
Workflow Group
Dean Williams Katherine Evans
Workflow Task Leaders
SoLware Eng./Coupler Group
Robert Jacob Andrew Salinger
SE/Coupler Task Leaders
Performance/ Algorithms Group
Patrick Worley Hans Johansen
Perf. / Alg. Task Leaders
Land Group
Peter Thornton William Riley
Land Task Leaders
Atmosphere Group
Philip Rasch Shaocheng Xie
Atmosphere Task Leaders
Ocean/Ice Group
Philip Jones Todd Ringler
Ocean/Ice Task Leaders
ACME Workflow group: Developing a comprehensive approach to enable large scale climate science
• The end-to-end workflow integrates – a model run simulation manager
(AKUNA) – a data publishing/sharing/archiving
infrastructure (ESGF) – a secure data transport (Globus) – analysis/diagnostics/visualization tools
(UV-CDAT) – provenance capture framework (ProvEn)
to improve reproducibility and tracking
• Performance Monitoring and Analysis • Internode and I/O: Load balancing, communica,on algorithm op,miza,on,
computa,on/communica,on overlap, exploi,ng addi,onal concurrency • On-‐node: Accelerators, threading, memory management, programming models • Next Genera,on Architectures: NERSC NESAP, OLCF-‐4 CAAR, ALCF-‐3 ESP
• Work with DOE computer scien,sts in other projects involving performance and fast numerical libraries.
ACME Performance Group: Simula3on Throughput with a target of 5 simulated years/day
• Best Practices: Common tools, methodologies adopted across ACME science/tech teams. – Developer’s test suites – Continuous Integration with Jenkins – Repository set up and workflow
• I/O: Parallel I/O at ACME model scales, increased use of in-situ diagnostics.
• Modularity and configurability: Modular interfaces for all new models, runtime configurability
• Coupling: Coupler performance, coupler/main design, and MCT!
ACME SoLware Engineering/Coupler Group
ACME git workflow
Model Coupling Toolkit • A set of Fortran90 datatypes and functions for building
parallel coupled models. – With or without a coupler (which MCT doesn’t provide). – All models are assumed to be parallel with MPI. – 2-sided send/recv model for moving data similar to MPI – Support for online parallel interpolation using offline-calculated
weights. – Model registry, decomposition descriptors (by numbering dofs),
distributed data type, communication tables. – Functions for time accumulation, spatial averaging and merging.
Model Coupling Toolkit
• At last mee,ng, Feb, 2013, this was s,ll news • Now, a lihle embarrassing
MCT website as of 4/22/15
A lot has been happening with MCT…
GitX display of recent MCT repository history
Model Coupling Toolkit, v2.9
• Moved repository from Argonne git server to github.com – https://github.com/MCSclimate/MCT!
• New Features to aid in studying Router initialization – GSMap and MCTWorld print().
• Print contents to ascii file for later reading – Router init internal timers
• Invoked with optional string argument to Router init. – RouterTest.F90 - test program which reads in output GSMaps
and MCTWorld info and builds a Router. • Will build on same number of procs and same decomposition as
original model. – Great for creating coupling benchmarks!
Model Coupling Toolkit, v2.9 • Support for NAG 6.0 • Support for Mac builds • Bug fixes including ones found by valgrind (many thanks to
NCAR’s Sean Santos for above)
• mpi-serial 2.0 – a small single-node MPI library. – For programs that assume MPI and users who don’t want to install
a full MPI-library on their laptop/desktop. Doesn’t require mpirun. – Not a stub-library MPI_Send/Recv really copies data. – In 2.0
• Many more MPI datatypes/functions added. • Self-contained build system (autoconf)
– Developed by ALCF’s Raymond Loy • MCT 2.9 release is imminent.
Model Coupling Toolkit: Development process
• Use gitworkflows on github. – Anyone with a free github account can create a fork (git clone) of
the MCT repo. – Make a branch and develop your cool new feature. – Submit a “Pull Request” to have your feature included in master
• A few developers make branches directly in MCT main repo. – Change gets reviewed and tested (by me) before inclusion. – Branch gets merged to master.
• Discussion of bugs/proposed new features with github issues.
• Documentation on github wiki.
Model Coupling Toolkit: Future plans
• Near term – Router initilization benchmarks based on ACME science cases
(0.25 degree atmosphere, 1/10th degree ocean, 10K cores) – Improve scaling/timing of Router init and Rearranger/MCT_Send/
Recv communication for ACME cases on LCFs. Releasing any MCT improvements.
• Long term – MCT-MOAB
• Talked about before. Mesh Oriented Database • Tried using MOAB’s Fortran interface to build MCT datatypes/
functions. Not very satisfying. • New plan (notion): MCT-MOAB in C with MOAB’s C/C++ interface
and Fortran interfaces defined using F2003 standard.
Exascale is coming
Top500 list and projected performance. (top500.org)
#1: Tianhae-‐2 (33PF) #2: Titan (DOE OLCF 17PF) #5: Mira (DOE ALCF 8.5PF)
Exascale impact on software (including coupling software) • Massive in-node parallelism (exponential growth)
– Programmer cannot hand-pick work granularity – Deeper memory hierarchy – “Communication is expensive, FLOPS are free”
• Power as a managed system resource – Turning on/off components – Selecting algorithms for speed within power envelope – Adjusting arithmetic precision – Potentially adjusting fault protection
• Dynamic parallelism and work decomposition • Fault tolerance actively managed in software at many levels • Architecture organization:
– Heterogeneous cores – Specialized functional units – In-situ NVRAM
Coupling at Exascale: possible problems
• Coupler (for Earth sub-system interfaces) is almost entirely 2D – Limited amount of parallelism – Also not a huge number of flops compared to full model – Not a huge memory demand except for datatypes that
grow with number of cores. • But coupler does lots of memory movement
– Moving data between model’s native data type and coupler data type.
– Moving data from one model’s processors to another’s
Coupling at Exascale: to do
• More parallelism through more components executing concurrently – Ensembles – Different models
• Reduce memory movement – One data type across all model components? – Co-located decompositions.
Co-located decomposition
2
3 1
2
3 1
1 1 1
2 2 2
3 3 3
3
3
2
2 3 2
1
1 1 2 models: unrelated decomposi,ons 2 models: related decomposi,ons
hhp://climatemodeling.science.energy.gov/projects/accelerated-‐climate-‐modeling-‐energy
More informa3on
hhp://www.mcs.anl.gov/mct/