Download - PP POMPA (WG6) Overview Talk
![Page 1: PP POMPA (WG6) Overview Talk](https://reader036.vdocument.in/reader036/viewer/2022062410/56815583550346895dc351d7/html5/thumbnails/1.jpg)
PP POMPA (WG6)Overview Talk
COSMO GM11, Rome
st Birthday
![Page 2: PP POMPA (WG6) Overview Talk](https://reader036.vdocument.in/reader036/viewer/2022062410/56815583550346895dc351d7/html5/thumbnails/2.jpg)
Who is POMPA?
• ARPA-EMR Davide Cesari
• C2SM/ETH Xavier Lapillonne, Anne Roches, Carlos Osuna
• CASPUR Stefano Zampini, Piero Lanucara, Cristiano Padrin
• Cray Pozanovich Jeffrey, Roberto Ansaloni
• CSCS Matthew Cordery, Mauro Biancho, Jean-Guillaume Piccinali, William Sawyer, Neil Stringfellow, Thomas Schulthess, Ugo Varetto
• DWD Ulrich Schättler, Kristina Fröhlich
• KIT Andrew Ferrone, Hartwig Anzt
• MeteoSwiss Petra Baumann, Oliver Fuhrer, André Walser
• NVIDIA Tim Schröder, Thomas Bradley
• Roshydromet Dmitry Mikushin
• SCS Tobias Gysi, Men Muheim, David Müller, Katharina Riedinger
• USAM David Palella, Alessandro Cheloni, Pier Francesco Coppola
• USI Daniel Ruprecht
![Page 3: PP POMPA (WG6) Overview Talk](https://reader036.vdocument.in/reader036/viewer/2022062410/56815583550346895dc351d7/html5/thumbnails/3.jpg)
Kickoff Workshop
• May 3-4 2011, hosted by CSCS in Manno• 15 talks, 18 participants• Goal get to know each other, report on work already done, plan
and coordinate future activities• Revised project plan
![Page 4: PP POMPA (WG6) Overview Talk](https://reader036.vdocument.in/reader036/viewer/2022062410/56815583550346895dc351d7/html5/thumbnails/4.jpg)
Task Overview
• Task 1 Performance analysis and documentation
• Task 2 Redesign memory layout and data structures• Closely linked to work in Task 5 and 6
• Task 3 Improve current parallelization
• Task 4 Parallel I/O• Focus on NetCDF (which is still from 1 core)
• Technical problems
• New person (Carlos Osuna, C2SM) starting work on 15.09.2011
• Task 5 Redesign implementation of dynamical core
• Task 6 Explore GPU acceleration
• Task 7 Implementation documentation• No progress
![Page 5: PP POMPA (WG6) Overview Talk](https://reader036.vdocument.in/reader036/viewer/2022062410/56815583550346895dc351d7/html5/thumbnails/5.jpg)
Performance Analysis
Goal
-Understand the code from a performance perspective (workflow, data movement, bottlenecks, problems, …)
-Guide and prioritize the work in the other tasks
-Try to ensure exchange of information and performance portability developments
![Page 6: PP POMPA (WG6) Overview Talk](https://reader036.vdocument.in/reader036/viewer/2022062410/56815583550346895dc351d7/html5/thumbnails/6.jpg)
Performance Analysis (Task 1)
Work
•COSMO RAPS 5.0 benchmark with DWD, MeteoSwiss and IPCC/ETH runscripts on hpcforge.org (Ulrich Schättler, Oliver Fuhrer, Anne Roches)
•Workflow of RK timestep (Ulrich Schättler)http://www.c2sm.ethz.ch/research/COSMO-CCLM/hp2c_one_year_meeting/2a_schaettler
•Performance analysis
• COSMO RAPS 5.0 on Cray XT4, XT5 and XE6 (Jean-Guillaume
Piccinali, Anne Roches)
• COSMO-ART (Oliver Fuhrer)
•Wiki page
![Page 7: PP POMPA (WG6) Overview Talk](https://reader036.vdocument.in/reader036/viewer/2022062410/56815583550346895dc351d7/html5/thumbnails/7.jpg)
Jean-Guillaume Piccinali and Anne Roches
![Page 8: PP POMPA (WG6) Overview Talk](https://reader036.vdocument.in/reader036/viewer/2022062410/56815583550346895dc351d7/html5/thumbnails/8.jpg)
Jean-Guillaume Piccinali and Anne Roches
![Page 9: PP POMPA (WG6) Overview Talk](https://reader036.vdocument.in/reader036/viewer/2022062410/56815583550346895dc351d7/html5/thumbnails/9.jpg)
Jean-Guillaume Piccinali and Anne Roches
![Page 10: PP POMPA (WG6) Overview Talk](https://reader036.vdocument.in/reader036/viewer/2022062410/56815583550346895dc351d7/html5/thumbnails/10.jpg)
Problem: Overfetching
• Computational intensity is the ration of floating point operations (ops) per memory reference (ref)
• When accessing a single array value, a complete cache line (64 Bytes = 8 double precision values) is loaded into L1 cache
• do i = 1+nbounlines, ie-nbounlines A(i) = 0.0d0end do …also loads A(1), A(2), A(3)
• If subdomain on processor is very small many values loaded from memory never get used for computation
A(1) A(2) A(3) A(4) … A(ie-3) A(ie-2) A(ie-1) A(ie)
![Page 11: PP POMPA (WG6) Overview Talk](https://reader036.vdocument.in/reader036/viewer/2022062410/56815583550346895dc351d7/html5/thumbnails/11.jpg)
Performance Analysis: Wiki
https://wiki.c2sm.ethz.ch/Wiki/ProjPOMPATask1
![Page 12: PP POMPA (WG6) Overview Talk](https://reader036.vdocument.in/reader036/viewer/2022062410/56815583550346895dc351d7/html5/thumbnails/12.jpg)
Improve Current Parallelization (Task 2)
• Loop level hybrid parallelization (OpenMP/MPI) (Matthew Cordery,
Davide Cesari, Stefano Zampini)
• No clear benefit of this approach vs. flat MPI parallelization
• Approach suitable for memory bandwidth bound code?
• Restructuring of code (into blocks) may help!
• Overlap communication with computation using non-blocking MPI calls (Stefano Zampini)
• Lumped halo-updates for COSMO-ART (Christoph Knote, Andrew Ferrone)
![Page 13: PP POMPA (WG6) Overview Talk](https://reader036.vdocument.in/reader036/viewer/2022062410/56815583550346895dc351d7/html5/thumbnails/13.jpg)
Halo exchange in Cosmo
3 types of point to point communications: 2 partially non-blocking and 1 full blocking (with MPI_SENDRECV)
Halo swapping needs completion of East to West before starting South to North communication (implicit corner exchange)
New version which communicates corners (2x more messages)Stefano Zampini
![Page 14: PP POMPA (WG6) Overview Talk](https://reader036.vdocument.in/reader036/viewer/2022062410/56815583550346895dc351d7/html5/thumbnails/14.jpg)
New halo-exchange routine
Stefano Zampini
CALL exch_boundaries(A)
communication time
OLD
CALL exch_boundaries(A,2) CALL exch_boundaries(A,2) CALL exch_boundaries(A,3)
communication time
NEW
![Page 15: PP POMPA (WG6) Overview Talk](https://reader036.vdocument.in/reader036/viewer/2022062410/56815583550346895dc351d7/html5/thumbnails/15.jpg)
Early results: COSMO2
10x12+4 20x24+4 28x35+40
1000
2000
3000
4000
5000
6000
7000
8000
COSMO
NEW
10x12+4 20x24+4 28x35+40
2000
4000
6000
8000
10000
12000
Total time (s) for model runs Mean total time for RK dynamics
Is Testany / Waitany the most efficient way to assure completion?
Restructuring of code to find more work (B) could help!
![Page 16: PP POMPA (WG6) Overview Talk](https://reader036.vdocument.in/reader036/viewer/2022062410/56815583550346895dc351d7/html5/thumbnails/16.jpg)
Explore GPU Acceleration (Task 6)
Goal
•Investigate whether and how GPUs can be leveraged for numerical weather prediction with COSMO
Background
•Early investigations by Michalakes et al. using WRF physical parametrizations
•Full port of JMA next-generation model (ASUCA) to GPUs via a rewrite in CUDA
•New model developments (e.g. NIM at NOAA) which have GPUs as a target architecture in mind from the very start
![Page 17: PP POMPA (WG6) Overview Talk](https://reader036.vdocument.in/reader036/viewer/2022062410/56815583550346895dc351d7/html5/thumbnails/17.jpg)
GPU Motivation
× 8
compute bound
× 5
memory bound
“power bound”
× 1.7
![Page 18: PP POMPA (WG6) Overview Talk](https://reader036.vdocument.in/reader036/viewer/2022062410/56815583550346895dc351d7/html5/thumbnails/18.jpg)
Programming GPUs
• Programming languages (OpenCL, CUDA C, CUDA Fortran, …)
• Two codes to maintain
• Highest control, but require complete rewrite
• Highest performance (if done by expert)
• Directive based approach (PGI, OpenMP-acc, HMPP, …)
• Smaller modifications to original code
• The resulting code is still understandable by Fortran programmers and can be easily modified
• Possible performance sacrifice (w.r.t. rewrite)
• No standard for the moment
• Source-to-source translation (F2C-acc, Kernelgen, …)
• One source code
• Can achieve very good performance
• Legacy codes often don’t map very well onto GPUs
• Hard to debug
![Page 19: PP POMPA (WG6) Overview Talk](https://reader036.vdocument.in/reader036/viewer/2022062410/56815583550346895dc351d7/html5/thumbnails/19.jpg)
Challenges
• How to change a wheel on a moving car?
• GPU hardware and programming models are rapidly changing
• Several approaches are vendor bound and/or not part of a standard
• COSMO is also rapidly evolving
• How to have a single readable code which also compiles onto GPUs?
• Efficiency may require restructuring or even a change of algorithm
• Directives jungle
• Efficient GPU implementation requires…
• to execute all of COSMO on the GPU
• enough fine grain parallelism (i.e. threads)
![Page 20: PP POMPA (WG6) Overview Talk](https://reader036.vdocument.in/reader036/viewer/2022062410/56815583550346895dc351d7/html5/thumbnails/20.jpg)
Explore GPU Acceleration (Task 6)
Work
•Source-to-source translation of the whole model (Dmitry Mikushin)
•Porting of physical parametrizations using PGI directives or f2c-acc (Xavier Lapillone, Cristiano Padrin)
next talk
•Rewrite of dynamical core for GPUs (Oliver Fuhrer)
talk after next talk
![Page 21: PP POMPA (WG6) Overview Talk](https://reader036.vdocument.in/reader036/viewer/2022062410/56815583550346895dc351d7/html5/thumbnails/21.jpg)
HP2C OPCODE Project
• Additional proposal to the Swiss HP2C initiative to build an
“OPerational COSMO DEmonstrator (OPCODE)”
• Project proposal accepted
• Start of project 1 June 2011 until end of 2012
• Project lead: André Walser
• Project resources:• second contract with IT company SCS to continue
collaboration until end of 2012• 2 new positions at MeteoSwiss for about 1 year• contribution to position at C2SM• contribution from CSCS
![Page 22: PP POMPA (WG6) Overview Talk](https://reader036.vdocument.in/reader036/viewer/2022062410/56815583550346895dc351d7/html5/thumbnails/22.jpg)
HP2C OPCODE Project
Main Goals•Leverage the research results of the ongoing HP2C COSMO project•Prototype implementation of the MeteoSwiss production suite making aggressive use of GPU technology•Similar time-to-solution on hardware with substantially lower power consumption and price
Cray XT4 (3 cabinets)
GPU based hardware(a few rack units)
![Page 23: PP POMPA (WG6) Overview Talk](https://reader036.vdocument.in/reader036/viewer/2022062410/56815583550346895dc351d7/html5/thumbnails/23.jpg)
Thank you!