learning to forget - adac · • no loops / execution schedule • no explicit threading /...

30
Federal Department of Home Affairs FDHA Federal Office of Meteorology and Climatology MeteoSwiss Learning to forget Lessons from adopting and maintaining a weather and climate model for heterogeneous HPC systems Oliver Fuhrer, MeteoSwiss Contributions from X. Lapillonne 1 , C. Osuna 1 , M. Bianco 2 ,L. Benedicic 2 , T. Schulthess 2,3 1 MeteoSwiss, 2 CSCS, 2 ITP ETH Zurich, ADAC Workshop, 20.6.2018

Upload: others

Post on 07-Oct-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Learning to forget - ADAC · • No loops / execution schedule • No explicit threading / vectorization • No directives • No HW-dependent details • … Separation of concerns

Federal Department of Home Affairs FDHAFederal Office of Meteorology and Climatology MeteoSwiss

Learning to forgetLessons from adopting and maintaining a weather and climate model for heterogeneous HPC systemsOliver Fuhrer, MeteoSwissContributions from X. Lapillonne1, C. Osuna1, M. Bianco2,L. Benedicic2, T. Schulthess2,3

1MeteoSwiss, 2CSCS, 2ITP ETH Zurich,

ADAC Workshop, 20.6.2018

Page 2: Learning to forget - ADAC · • No loops / execution schedule • No explicit threading / vectorization • No directives • No HW-dependent details • … Separation of concerns

2March 13, 2018

Can you spot the weather model?

Page 3: Learning to forget - ADAC · • No loops / execution schedule • No explicit threading / vectorization • No directives • No HW-dependent details • … Separation of concerns

3March 13, 2018

Reality

Page 4: Learning to forget - ADAC · • No loops / execution schedule • No explicit threading / vectorization • No directives • No HW-dependent details • … Separation of concerns

4March 13, 2018

Where did we start in 2010?

Page 5: Learning to forget - ADAC · • No loops / execution schedule • No explicit threading / vectorization • No directives • No HW-dependent details • … Separation of concerns

5March 13, 2018

Operations in 2010

EZMWF-Modell 16 km grid spacing2 x per day10 day forecast

COSMO-7 6.6 km grid spacing3 x per day3 day forecast

COSMO-2 2.2 km grid spacing8 x per day33 h forecast

Page 6: Learning to forget - ADAC · • No loops / execution schedule • No explicit threading / vectorization • No directives • No HW-dependent details • … Separation of concerns

6March 13, 2018

Strategy for next-generation(computational effort relative to operational system)

ECMWF-Model 9 to 18 km gridspacing2 to 4 x per day

COSMO-1 1.1 km gridspacing8 x per day1 to 2 d forecast

COSMO-E 2.2 km gridspacing2 x per day5 d forecast21 members

Ensemble data assimilation: LETKF

13 x 20 x

7 x

= 40 x

Page 7: Learning to forget - ADAC · • No loops / execution schedule • No explicit threading / vectorization • No directives • No HW-dependent details • … Separation of concerns

7March 13, 2018

Business as usual

Cray XE6 (Albis/Lema)Current operationalsystem at CSCS

Next SystemAccounting for Moore’s law

Not feasible!

(power, floor space, cost)

Page 8: Learning to forget - ADAC · • No loops / execution schedule • No explicit threading / vectorization • No directives • No HW-dependent details • … Separation of concerns

8March 13, 2018

COSMO Model

350 kLOC of F90 + MPI + NEC directives(“optimized code”)

Page 9: Learning to forget - ADAC · • No loops / execution schedule • No explicit threading / vectorization • No directives • No HW-dependent details • … Separation of concerns

9March 13, 2018

• Increase level of abstraction- Hide implementation details- Can be disruptive

• Decrease level of abstraction- Add implementation details- Often incremental

Up or down?

Page 10: Learning to forget - ADAC · • No loops / execution schedule • No explicit threading / vectorization • No directives • No HW-dependent details • … Separation of concerns

10ESM AO Workshop, Mai 2018, Pasadena, Oliver Fuhrer

Refactoring effort

Domain-specific language

(performance portable,re-usable)

High-level implementation

Separation of concerns

Fuhrer et al. 2014 (doi: 10.14529/jsfi140103)Gysi et al. 2015 (doi: 10.1145/2807591.2807627)

C++ / DSL rewrite Fortran +directives

Page 11: Learning to forget - ADAC · • No loops / execution schedule • No explicit threading / vectorization • No directives • No HW-dependent details • … Separation of concerns

11March 13, 2018

Piz Dora(old code)

~26 CPUs

10 kWh

1.4

Piz Kesch(new code)

~7 GPUs

2.1 kWh

0.38

Time-to-solution

Energy-to-solution

Size of system (cabinets)

Factor

3.7 x

4.8 x

3.8 x

Results (COSMO-E benchmark)

Co-design (simultaneous software, hardware and workflow re-design)

allowed MeteoSwiss to increase computational load by 40x within 4–5

years

4x through investement into software.

Page 12: Learning to forget - ADAC · • No loops / execution schedule • No explicit threading / vectorization • No directives • No HW-dependent details • … Separation of concerns

14March 13, 2018

• Used in production @MeteoSwiss• Near-global simulations on full Piz Daint

(4’888 GPU nodes, Fuhrer et al. 2018 GMD)

Applications

Visualization by Tarun Chadha (C2SM): clouds > 10-3 g/kg (white) and precipitation > 4 10-2 g/kg (blue)

Page 13: Learning to forget - ADAC · • No loops / execution schedule • No explicit threading / vectorization • No directives • No HW-dependent details • … Separation of concerns

16March 13, 2018

Fortran + MPI + OpenACC + ...is not the solution!

Page 14: Learning to forget - ADAC · • No loops / execution schedule • No explicit threading / vectorization • No directives • No HW-dependent details • … Separation of concerns

17March 13, 2018

durian:code fuhrer$ cloc E3SM/---------------------------------------------------------------------------------------Language files blank comment code---------------------------------------------------------------------------------------Fortran 90 2812 244713 354664 1125623C 1149 154791 250480 648455Fortran 77 498 68946 98188 287216HTML 441 4457 2927 162587XML 914 15353 6546 142869Bourne Shell 404 22088 25988 130941C/C++ Header 399 11006 21174 60059m4 50 5161 1170 49869Python 329 12158 15054 45945Perl 172 14406 21944 41502TeX 172 4979 3830 29693CMake 422 4679 6771 26563...---------------------------------------------------------------------------------------SUM: 8408 586306 832057 2835375---------------------------------------------------------------------------------------

Software productivity gap!

Page 15: Learning to forget - ADAC · • No loops / execution schedule • No explicit threading / vectorization • No directives • No HW-dependent details • … Separation of concerns

19March 13, 2018

Software productivity?

Page 16: Learning to forget - ADAC · • No loops / execution schedule • No explicit threading / vectorization • No directives • No HW-dependent details • … Separation of concerns

20March 13, 2018

Heavily optimized code is typically faster than Fortran + OpenACC, but also unreadable and unmaintainable!

Efficiency myth

Page 17: Learning to forget - ADAC · • No loops / execution schedule • No explicit threading / vectorization • No directives • No HW-dependent details • … Separation of concerns

21March 13, 2018

• Radiation scheme on CPU (Intel E5-2690v3 “Haswell”) and GPU (NVIDIA Tesla K80) using Fortran + OpenMP + OpenACC

Performance portability myth

Lappillonne and Fuhrer, PPL, 2018Clement et al. 2018, PASC’18

Page 18: Learning to forget - ADAC · • No loops / execution schedule • No explicit threading / vectorization • No directives • No HW-dependent details • … Separation of concerns

22March 13, 2018

DSL in C++ may alsonot be the solution!

Page 19: Learning to forget - ADAC · • No loops / execution schedule • No explicit threading / vectorization • No directives • No HW-dependent details • … Separation of concerns

23March 13, 2018

STELLA

enum { data, lap };

template<typename TEnv>struct Laplacian {STENCIL_STAGE(TEnv)STAGE_PARAMETER(FullDomain, data)STAGE_PARAMETER(FullDomain, lap)

static void Do(Context ctx, FullDomain){ctx[lap::Center()] =-4.0 * ctx[data::Center()]+ ctx[data::At(iplus1)]+ ctx[data::At(iminus1)]+ ctx[data::At(jplus1)]+ ctx[data::At(jminus1)];

}};

Stencil stencil;StencilCompiler::Build(stencil,"Example",calculationDomainSize,StencilConfiguration<Real, BlockSize<32,4> >(),pack_parameters(Param<lap, cInOut>(lapfield),Param<data, cIn>. (datafield)

),define_loops(define_sweep<cKIncrement>(define_stages(StencilStage<Laplacian, IJRange<cComplete,0,0,0,0>,KRangeFullDomain >()

))

));

IJKRealField lapfield, datafield;

for(int step = 0; step < numOfSteps; ++step){stencil.Apply();

}

enum { data, lap };

template<typename TEnv>struct Laplacian {STENCIL_STAGE(TEnv)STAGE_PARAMETER(FullDomain, data)STAGE_PARAMETER(FullDomain, lap)

static void Do(Context ctx, FullDomain){ctx[lap::Center()] =-4.0 * ctx[data::Center()]+ ctx[data::At(iplus1)]+ ctx[data::At(iminus1)]+ ctx[data::At(jplus1)]+ ctx[data::At(jminus1)];

}};

Stencil stencil;StencilCompiler::Build(stencil,"Example",calculationDomainSize,StencilConfiguration<Real, BlockSize<32,4> >(),pack_parameters(Param<lap, cInOut>(lapfield),Param<data, cIn>. (datafield)

),define_loops(define_sweep<cKIncrement>(define_stages(StencilStage<Laplacian, IJRange<cComplete,0,0,0,0>,KRangeFullDomain >()

))

));

IJKRealField lapfield, datafield;

for(int step = 0; step < numOfSteps; ++step){stencil.Apply();

}

Loop bodyStencil

Data fields

Run

Page 20: Learning to forget - ADAC · • No loops / execution schedule • No explicit threading / vectorization • No directives • No HW-dependent details • … Separation of concerns

24March 13, 2018

+• C++ well supported language

• Abstract (some) hardware dependent details

• High efficiency

• DSL allows for (partial) separation of concerns between domain-scientist and computer scientist

-• C++ not well accepted by weather and climate community

(boiler plate)

• No introspection / global optimization

• Backends complicated to implement (template meta-programming)

STELLA DSL in C++ critique

Page 21: Learning to forget - ADAC · • No loops / execution schedule • No explicit threading / vectorization • No directives • No HW-dependent details • … Separation of concerns

25March 13, 2018

No turn key solution!Is it time for a SDK forweather and climate?

Page 22: Learning to forget - ADAC · • No loops / execution schedule • No explicit threading / vectorization • No directives • No HW-dependent details • … Separation of concerns

26March 13, 2018

r2Ti,k =Ti+1,k + Ti�1,k + Ti,k+1 + Ti,k�1 � 4Ti,k

�2

Developmet Workflow

Mathematical Model

Discretization, Solver

Implementation

T = T (x, z, t)

T (x, z, t = 0) = T0(x, z)

@T

@t= ↵r2T

“sci

entis

t”

softw

are

engi

neer

Page 23: Learning to forget - ADAC · • No loops / execution schedule • No explicit threading / vectorization • No directives • No HW-dependent details • … Separation of concerns

27March 13, 2018

High-level language for weather and climate• No explicit data structure• No loops / execution schedule• No explicit threading / vectorization• No directives• No HW-dependent details• …

Separation of concerns

We need to learn to forget

Mathematical Model

Discretization, Solver

High-level implementation

Domain-specific Compiler

clim

ate

scie

ntis

tco

mpu

ter

scie

ntis

t

com

puta

tiona

l sc

ient

ist

Page 24: Learning to forget - ADAC · • No loops / execution schedule • No explicit threading / vectorization • No directives • No HW-dependent details • … Separation of concerns

28March 13, 2018

• “SDK for weather and climatescience”

• Joint development betweenCSCS / MeteoSwiss

• Domain-specific for Earthsystem model components

• Regional (production) andglobal grids (prototype)

• Multiple APIs (C++, Python,gtclang)

GridTools Framework

Page 25: Learning to forget - ADAC · • No loops / execution schedule • No explicit threading / vectorization • No directives • No HW-dependent details • … Separation of concerns

29March 13, 2018

Example (GT4Py)

• Mathematical operators (stencils)

• Data fields

• Region over which operator is applied

• Boundary conditions (not shown in exmpale)

• High-level, declarative syntax

• Numpy as well as high-performance backends (x86 multi-core, NVIDIA GPU, Xeon Phi)

Page 26: Learning to forget - ADAC · • No loops / execution schedule • No explicit threading / vectorization • No directives • No HW-dependent details • … Separation of concerns

30March 13, 2018

Example (gtclang)

function avg {offset offstorage in

avg = 0.5 * ( in(off) + in() )}

function coriolis_force {storage fc, in

coriolis_force = fc() * in()}

operator coriolis {storage u_tend, u, v_tend, v, fc

vertical_region ( k_start , k_end ) {u_tend += avg(j-1, coriolis_force(fc, avg(...v_tend -= avg(i-1, coriolis_force(fc, avg(...

}}

}

• Higher productivity and code safety

• Based on LLVM/Clang compiler framework

• Allows for global optimization ”across kernels”

• Generates efficient code for x86 multi-core,NVIDIA

GPUs, Intel Xeon Phi, and ARM (prototype)

• 4x – 6x reduction in LOC

Page 27: Learning to forget - ADAC · • No loops / execution schedule • No explicit threading / vectorization • No directives • No HW-dependent details • … Separation of concerns

31March 13, 2018

Compiler Toolchain

Page 28: Learning to forget - ADAC · • No loops / execution schedule • No explicit threading / vectorization • No directives • No HW-dependent details • … Separation of concerns

32March 13, 2018

• Weather and climate community is struggling to keep up- very few models able to target multiple hardware architectures- software productivity gap- no turn-key solutions

• Large software effort for COSMO to enable efficient execution on multiple hardware architectures

• Learnings- It’s not about porting, it’s about maintaining- Fortran + directives exacerbate problem (ad-interim solution?)- DSL embedded C++ not readily adopted by domain-scientists- We need a high-level software development environment for

weather and climate

Summary

Page 29: Learning to forget - ADAC · • No loops / execution schedule • No explicit threading / vectorization • No directives • No HW-dependent details • … Separation of concerns

33March 13, 2018

Thank you! Questions?

Page 30: Learning to forget - ADAC · • No loops / execution schedule • No explicit threading / vectorization • No directives • No HW-dependent details • … Separation of concerns

34March 13, 2018

MeteoSvizzeraVia ai Monti 146CH-6605 Locarno-MontiT +41 58 460 92 22www.meteosvizzera.ch

MétéoSuisse7bis, av. de la PaixCH-1211 Genève 2T +41 58 460 98 88www.meteosuisse.ch

MétéoSuisseChemin de l‘AérologieCH-1530 PayerneT +41 58 460 94 44www.meteosuisse.ch

MeteoSwissOperation Center 1 CH-8058 Zurich-Airport T +41 58 460 91 11

www.meteoswiss.ch

Federal Department of Home Affairs FDHAFederal Office of Meteorology and Climatology MeteoSwiss