claw fortran compiler - xcalablemp · 2017. 10. 31. · valentin clement 5th xcalablemp workshop -...

44
Image: NASA CLAW FORTRAN Compiler source-to-source translation for performance portability XcalableMP Workshop, Akihabara, Tokyo, Japan October 31, 2017 Valentin Clement [email protected]

Upload: others

Post on 08-Mar-2021

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CLAW FORTRAN Compiler - XcalableMP · 2017. 10. 31. · Valentin Clement 5th XcalableMP Workshop - 31st of October 2017 3 GPU machine in Switzerland - Piz Kesch (MeteoSwiss) Production

Image:NAS

A

CLAWFORTRANCompilersource-to-sourcetranslationforperformanceportabilityXcalableMPWorkshop,Akihabara,Tokyo,JapanOctober31,[email protected]

Page 2: CLAW FORTRAN Compiler - XcalableMP · 2017. 10. 31. · Valentin Clement 5th XcalableMP Workshop - 31st of October 2017 3 GPU machine in Switzerland - Piz Kesch (MeteoSwiss) Production

ValentinClement 5thXcalableMPWorkshop-31stofOctober2017 2

• PerformanceportabilityprobleminCOSMO• Singlesourcecode• Portability• Performance• Portabilityofcode• CLAWFORTRANCompiler• Singlecolumnabstraction• Futureproject

Summary

Page 3: CLAW FORTRAN Compiler - XcalableMP · 2017. 10. 31. · Valentin Clement 5th XcalableMP Workshop - 31st of October 2017 3 GPU machine in Switzerland - Piz Kesch (MeteoSwiss) Production

ValentinClement 5thXcalableMPWorkshop-31stofOctober2017 3

GPUmachineinSwitzerland-PizKesch(MeteoSwiss)

Production+R&DEachrackiscomposedof12Computenodes:•2xIntelHaswellE5-2690v3 2.6GHz12-coreCPUs(totalof24CPUs)•256GB2133MhzDDR4(totalof3TB)•8xDualNVIDIATESLAK80GPU(totalof96cards-192GPUs)

Page 4: CLAW FORTRAN Compiler - XcalableMP · 2017. 10. 31. · Valentin Clement 5th XcalableMP Workshop - 31st of October 2017 3 GPU machine in Switzerland - Piz Kesch (MeteoSwiss) Production

ValentinClement 5thXcalableMPWorkshop-31stofOctober2017 4

GPUmachineinSwitzerland-PizDaint(CSCS)

CrayXC40/XC50•Intel®Xeon®E5-2690v32.60GHz,12cores,64GBRAM•NVIDIATeslaP100

Page 5: CLAW FORTRAN Compiler - XcalableMP · 2017. 10. 31. · Valentin Clement 5th XcalableMP Workshop - 31st of October 2017 3 GPU machine in Switzerland - Piz Kesch (MeteoSwiss) Production

ValentinClement 5thXcalableMPWorkshop-31stofOctober2017 5

Execution time [s]

0

0.3

0.6

0.9

1.2

Execution on CPU Haswell E5-2690v3 12 cores

Execution on GPU 1/2 NVIDIA Tesla K80

Code restructured for CPUCode restructured for GPU with OpenACC

-30%

-35%

Performanceportabilityproblem-COSMORadiation

Page 6: CLAW FORTRAN Compiler - XcalableMP · 2017. 10. 31. · Valentin Clement 5th XcalableMP Workshop - 31st of October 2017 3 GPU machine in Switzerland - Piz Kesch (MeteoSwiss) Production

ValentinClement 5thXcalableMPWorkshop-31stofOctober2017 6

Performanceportabilityproblem-COSMORadiation

!$accparallelloopDOj=1,nproma!$accloopDOk=1,nzCALLfct()!1stloopbody!2ndloopbody!3rdloopbodyENDDOENDDO!$accendparallel

DOk=1,nzCALLfct()DOj=1,nproma!1stloopbodyENDDODOj=1,nproma!2ndloopbodyENDDODOj=1,nproma!3rdloopbodyENDDOENDDO

CPUstructure GPUstructure

Page 7: CLAW FORTRAN Compiler - XcalableMP · 2017. 10. 31. · Valentin Clement 5th XcalableMP Workshop - 31st of October 2017 3 GPU machine in Switzerland - Piz Kesch (MeteoSwiss) Production

ValentinClement 5thXcalableMPWorkshop-31stofOctober2017 7

Howtokeepasinglesourcecodeforeveryone

• Massivecodebase(200’000to>1mioLOC)• Severalarchitecturespecificoptimizationsurvive• MostofthesecodebaseareCPUoptimized• Notsuitedfornextgenerationarchitecture• Notsuitedformassiveparallelism• Fewornomodularity

Page 8: CLAW FORTRAN Compiler - XcalableMP · 2017. 10. 31. · Valentin Clement 5th XcalableMP Workshop - 31st of October 2017 3 GPU machine in Switzerland - Piz Kesch (MeteoSwiss) Production

ValentinClement 5thXcalableMPWorkshop-31stofOctober2017 8

Whatkindofcodebasearewedealingwith?

• Global/localareaweatherforecastmodel• >10aroundtheworld• MonsterFORTRAN77-2008“monolithic”code• Withoutmuchmodularity

• Sofarweinvestigate:• COSMO(Localareamodelconsortium)-Severalinstitution• ICON-DWD(GermanWeatherAgency)-WillreplaceCOSMO• IFSCurrentCycle+FVM-ECMWF-Memberstateusage

Page 9: CLAW FORTRAN Compiler - XcalableMP · 2017. 10. 31. · Valentin Clement 5th XcalableMP Workshop - 31st of October 2017 3 GPU machine in Switzerland - Piz Kesch (MeteoSwiss) Production

ValentinClement 5thXcalableMPWorkshop-31stofOctober2017 9

Whatkindofcodebasearewedealingwith?

Exampleofthreecodebaseweinvestigatedsofar:• COSMO• ICON• IFS

Page 10: CLAW FORTRAN Compiler - XcalableMP · 2017. 10. 31. · Valentin Clement 5th XcalableMP Workshop - 31st of October 2017 3 GPU machine in Switzerland - Piz Kesch (MeteoSwiss) Production

ValentinClement 5thXcalableMPWorkshop-31stofOctober2017 10

COSMOMode-loc

-------------------------------------------------------------------------------- Language files blank comment code -------------------------------------------------------------------------------- Fortran 90 173 53998 109381 211711 C/C++ Header 148 5595 11827 29888 C++ 121 5050 6189 26580 Python 37 1454 1444 5764 Bourne Again Shell 17 246 381 3206 Bourne Shell 33 544 594 2349 XML 11 272 193 2143 CMake 9 103 98 793 make 1 36 27 230 CUDA 58 4 0 58 -------------------------------------------------------------------------------- SUM: 620 68232 130684 286710 --------------------------------------------------------------------------------

ClimateandlocalareamodelusedbyGermany,Switzerland,Russia…

Page 11: CLAW FORTRAN Compiler - XcalableMP · 2017. 10. 31. · Valentin Clement 5th XcalableMP Workshop - 31st of October 2017 3 GPU machine in Switzerland - Piz Kesch (MeteoSwiss) Production

ValentinClement 5thXcalableMPWorkshop-31stofOctober2017 11

------------------------------------------------------------------------------- Language files blank comment code ------------------------------------------------------------------------------- Fortran 90 4003 201338 300427 737289 C/C++ Header 480 846 0 13041 Perl 1 89 13 240 ------------------------------------------------------------------------------- SUM: 4484 202273 300440 750570 --------------------------------------------------------------------------------

*onlysourcewithoutexternalmodules

ECMWFIFS-loc

EuropeanCentreforMedium-rangeweatherforecasts-GlobalModel

Page 12: CLAW FORTRAN Compiler - XcalableMP · 2017. 10. 31. · Valentin Clement 5th XcalableMP Workshop - 31st of October 2017 3 GPU machine in Switzerland - Piz Kesch (MeteoSwiss) Production

ValentinClement 5thXcalableMPWorkshop-31stofOctober2017 12

DWDICON-loc

--------------------------------------------------------------------------------------- Language files blank comment code --------------------------------------------------------------------------------------- Fortran 90 822 99802 144962 447356 C 219 43854 30991 150781 HTML 307 449 15415 94940 Fortran 77 463 294 113285 64061 Java 95 2685 4335 11605 C/C++ Header 106 2194 8359 8332 Python 43 2163 2425 7656 --------------------------------------------------------------------------------------- SUM: 2599 174509 346197 931446 ---------------------------------------------------------------------------------------

NewGermanGlobalmodelwithoptiontouseditaslocalareamodel

Page 13: CLAW FORTRAN Compiler - XcalableMP · 2017. 10. 31. · Valentin Clement 5th XcalableMP Workshop - 31st of October 2017 3 GPU machine in Switzerland - Piz Kesch (MeteoSwiss) Production

ValentinClement 5thXcalableMPWorkshop-31stofOctober2017 13

Portabilityofcode

• Currentcodebase• Optimizedforonearchitecture• Oftenwitholdoptimizationsstillinplace

• Notreallymodular• Globalfieldsinjectedeverywhere

• Future?• Modularstandalonepackagethatcanbeplug• Indifferentmodel• Fordifferentarchitecture

• Abstractdatalayoutwhereitcanbedone• Abstractmodelspecificdata,everythingpassedasarguments.

Page 14: CLAW FORTRAN Compiler - XcalableMP · 2017. 10. 31. · Valentin Clement 5th XcalableMP Workshop - 31st of October 2017 3 GPU machine in Switzerland - Piz Kesch (MeteoSwiss) Production

ValentinClement 5thXcalableMPWorkshop-31stofOctober2017 14

Performanceportability-currentcode

DO ilev = 1, nlay DO icol = 1, ncol tau_loc(icol,ilev) = max(tau(icol,ilev,igpt) … trans(icol,ilev) = exp(-tau_loc(icol,ilev)) END DO END DO DO ilev = nlay, 1, -1 DO icol = 1, ncol radn_dn(icol,ilev,igpt) = trans(icol,ilev) * radn_dn(icol,ilev+1,igpt) … END DO END DO DO ilev = 2, nlay + 1 DO icol = 1, ncol radn_up(icol,ilev,igpt) = trans(icol,ilev-1) * radn_up(icol,ilev-1,igpt) END DO END DO

nlay size ~= 100 ncol size >= 10’000

ilev -> dependency icol -> no dependency

LoopstructurebetterforCPUs

Page 15: CLAW FORTRAN Compiler - XcalableMP · 2017. 10. 31. · Valentin Clement 5th XcalableMP Workshop - 31st of October 2017 3 GPU machine in Switzerland - Piz Kesch (MeteoSwiss) Production

ValentinClement 5thXcalableMPWorkshop-31stofOctober2017 15

Performanceportability-currentcode

!$omp parallel default(shared) DO igpt = 1, ngptot, nproma CALL physical_parameteriztaion(…) ! Code from previous silde

END DO !$omp end parallel

Sometimesusingnpromablockoptimizationforbettervectorization

Page 16: CLAW FORTRAN Compiler - XcalableMP · 2017. 10. 31. · Valentin Clement 5th XcalableMP Workshop - 31st of October 2017 3 GPU machine in Switzerland - Piz Kesch (MeteoSwiss) Production

ValentinClement 5thXcalableMPWorkshop-31stofOctober2017 16

Performanceportability-GPUstructuredcode

DO icol = 1, ncol DO ilev = 1, nlay tau_loc(icol,ilev) = max(tau(icol,ilev,igpt) … trans(icol,ilev) = exp(-tau_loc(icol,ilev)) END DO DO ilev = nlay, 1, -1 radn_dn(icol,ilev,igpt) = trans(icol,ilev) * radn_dn(icol,ilev+1,igpt) … END DO DO ilev = 2, nlay + 1 radn_up(icol,ilev,igpt) = trans(icol,ilev-1) * radn_up(icol,ilev-1,igpt) END DO END DO

nlay size ~= 100 ncol size >= 10’000

ilev -> dependency icol -> no dependency

LoopstructurebetterforGPUs

Page 17: CLAW FORTRAN Compiler - XcalableMP · 2017. 10. 31. · Valentin Clement 5th XcalableMP Workshop - 31st of October 2017 3 GPU machine in Switzerland - Piz Kesch (MeteoSwiss) Production

ValentinClement 5thXcalableMPWorkshop-31stofOctober2017 17

Performanceportability-nextarchitecture

-Whatisthebestloopstructure/datalayoutfornextarchitecture?- Dowewanttorewritethecodeeachtime?- Doweknowexactlywhicharchitecturewewillrunon?

?

Page 18: CLAW FORTRAN Compiler - XcalableMP · 2017. 10. 31. · Valentin Clement 5th XcalableMP Workshop - 31st of October 2017 3 GPU machine in Switzerland - Piz Kesch (MeteoSwiss) Production

ValentinClement 5thXcalableMPWorkshop-31stofOctober2017 18

WhatistheCLAWFORTRANCompiler?

• Source-to-sourcetranslationforFORTRANcode• BasedonXcodeML/FIR• UsingOMNICompilerfront-endandback-end• ContributiontoitviaGitHub

• TransformationofAST• Differenttransformationappliedbasedontarget• Promotionofscalarandarrays• Insertionofiteration• InsertionofOpenACCandOpenMPdirectives

Page 19: CLAW FORTRAN Compiler - XcalableMP · 2017. 10. 31. · Valentin Clement 5th XcalableMP Workshop - 31st of October 2017 3 GPU machine in Switzerland - Piz Kesch (MeteoSwiss) Production

ValentinClement 5thXcalableMPWorkshop-31stofOctober2017

CLAWFC

19

• BasedontheOMNICompilerFORTRANfront-end&back-end• Source-to-sourcetranslator• OpensourceundertheBSDlicense• AvailableonGitHubwiththespecifications

.f90

FPP CX2XF_Front F_Back

.f90

CLAWFORTRANCompilerunderthehood

Page 20: CLAW FORTRAN Compiler - XcalableMP · 2017. 10. 31. · Valentin Clement 5th XcalableMP Workshop - 31st of October 2017 3 GPU machine in Switzerland - Piz Kesch (MeteoSwiss) Production

ValentinClement 5thXcalableMPWorkshop-31stofOctober2017 20

Separationofconcerns• Domainscientistsfocusontheirproblem(1column,1box)

• CLAWcompilerproducecodeforeachtargetanddirectivelanguages

Achievemodularity• Standalonephysicalparameter•Modularfrommodelspecificity

Singlecolumnabstraction

Page 21: CLAW FORTRAN Compiler - XcalableMP · 2017. 10. 31. · Valentin Clement 5th XcalableMP Workshop - 31st of October 2017 3 GPU machine in Switzerland - Piz Kesch (MeteoSwiss) Production

ValentinClement 5thXcalableMPWorkshop-31stofOctober2017 21

RRTMGPExample-AnicemodularcodeCPUstructured

F2003radiationcode• FromRobertPincusandal.fromAERUniversityofColorado

• Computeintensivepartarewelllocatedin“kernel”module.

• Codeisnon-the-lessCPUstructuredwithhorizontalloopastheinnermostineveryiteration.

Page 22: CLAW FORTRAN Compiler - XcalableMP · 2017. 10. 31. · Valentin Clement 5th XcalableMP Workshop - 31st of October 2017 3 GPU machine in Switzerland - Piz Kesch (MeteoSwiss) Production

ValentinClement 5thXcalableMPWorkshop-31stofOctober2017 22

SUBROUTINE lw_solver(ngpt, nlay, tau, …) ! DECLARATION PART OMITTED DO igpt = 1, ngpt DO ilev = 1, nlay DO icol = 1, ncol tau_loc(icol,ilev) = max(tau(icol,ilev,igpt) … trans(icol,ilev) = exp(-tau_loc(icol,ilev)) END DO END DO DO ilev = nlay, 1, -1 DO icol = 1, ncol radn_dn(icol,ilev,igpt) = trans(icol,ilev) * radn_dn(icol,ilev+1,igpt) … END DO END DO DO ilev = 2, nlay + 1 DO icol = 1, ncol radn_up(icol,ilev,igpt) = trans(icol,ilev-1) * radn_up(icol,ilev-1,igpt) END DO END DO END DO radn_up(:,:,:) = 2._wp * pi * quad_wt * radn_up(:,:,:) radn_dn(:,:,:) = 2._wp * pi * quad_wt * radn_dn(:,:,:) END SUBROUTINE lw_solver

RRTMGPExample-originalcode-CPUstructuredLoop

overv

erticaldim

ensio

n

Loop

overh

orizo

ntaldim

ensio

n

Loop

overspe

ctralintervals

Page 23: CLAW FORTRAN Compiler - XcalableMP · 2017. 10. 31. · Valentin Clement 5th XcalableMP Workshop - 31st of October 2017 3 GPU machine in Switzerland - Piz Kesch (MeteoSwiss) Production

ValentinClement 5thXcalableMPWorkshop-31stofOctober2017 23

SUBROUTINE lw_solver(ngpt, nlay, tau, …) ! DECLARATION PART OMITTED DO igpt = 1, ngpt DO ilev = 1, nlay tau_loc(ilev) = max(tau(ilev,igpt) … trans(ilev) = exp(-tau_loc(ilev)) END DO DO ilev = nlay, 1, -1 radn_dn(ilev,igpt) = trans(ilev) * radn_dn(ilev+1,igpt) … END DO DO ilev = 2, nlay + 1 radn_up(ilev,igpt) = trans(ilev-1) * radn_up(ilev-1,igpt) END DO END DO radn_up(:,:) = 2._wp * pi * quad_wt * radn_up(:,:) radn_dn(:,:) = 2._wp * pi * quad_wt * radn_dn(:,:) END SUBROUTINE lw_solver

RRTMGPExample-Singlecolumnabstraction

SUBROUTINE lw_solver(ngpt, nlay, tau, …) ! DECLARATION PART OMITTED DO igpt = 1, ngpt DO ilev = 1, nlay tau_loc(ilev) = max(tau(ilev,igpt) … trans(ilev) = exp(-tau_loc(ilev)) END DO DO ilev = nlay, 1, -1 radn_dn(ilev,igpt) = trans(ilev) * radn_dn(ilev+1,igpt) … END DO DO ilev = 2, nlay + 1 radn_up(ilev,igpt) = trans(ilev-1) * radn_up(ilev-1,igpt) END DO END DO radn_up(:,:) = 2._wp * pi * quad_wt * radn_up(:,:) radn_dn(:,:) = 2._wp * pi * quad_wt * radn_dn(:,:) END SUBROUTINE lw_solver

SUBROUTINE lw_solver(ngpt, nlay, tau, …) ! DECL: Fields don’t have the horizontal dimension (demotion) DO igpt = 1, ngpt DO ilev = 1, nlay tau_loc(ilev) = max(tau(ilev,igpt) … trans(ilev) = exp(-tau_loc(ilev)) END DO DO ilev = nlay, 1, -1 radn_dn(ilev,igpt) = trans(ilev) * radn_dn(ilev+1,igpt) … END DO DO ilev = 2, nlay + 1 radn_up(ilev,igpt) = trans(ilev-1) * radn_up(ilev-1,igpt) END DO END DO radn_up(:,:) = 2._wp * pi * quad_wt * radn_up(:,:) radn_dn(:,:) = 2._wp * pi * quad_wt * radn_dn(:,:) END SUBROUTINE lw_solver

Onlyde

pend

encyontheseite

ratio

nspaces

Page 24: CLAW FORTRAN Compiler - XcalableMP · 2017. 10. 31. · Valentin Clement 5th XcalableMP Workshop - 31st of October 2017 3 GPU machine in Switzerland - Piz Kesch (MeteoSwiss) Production

ValentinClement 5thXcalableMPWorkshop-31stofOctober2017 24

SUBROUTINE lw_solver(ngpt, nlay, tau, …) !$claw parallelize ! model dimension info located in config DO igpt = 1, ngpt DO ilev = 1, nlay tau_loc(ilev) = max(tau(ilev,igpt) … trans(ilev) = exp(-tau_loc(ilev)) END DO DO ilev = nlay, 1, -1 radn_dn(ilev,igpt) = trans(ilev) * radn_dn(ilev+1,igpt) … END DO DO ilev = 2, nlay + 1 radn_up(ilev,igpt) = trans(ilev-1) * radn_up(ilev-1,igpt) END DO END DO radn_up(:,:) = 2._wp * pi * quad_wt * radn_up(:,:) radn_dn(:,:) = 2._wp * pi * quad_wt * radn_dn(:,:) END SUBROUTINE lw_solver

Algorithm

foro

necolum

non

ly

Dependencyontheverticaldimensiononly

RRTMGPExample-CLAWcode

Page 25: CLAW FORTRAN Compiler - XcalableMP · 2017. 10. 31. · Valentin Clement 5th XcalableMP Workshop - 31st of October 2017 3 GPU machine in Switzerland - Piz Kesch (MeteoSwiss) Production

ValentinClement 5thXcalableMPWorkshop-31stofOctober2017 25

RRTMGPExample-CLAWtransformation

clawfc --directive=openacc --target=gpu -o mo_lw_solver.acc.f90 mo_lw_solver.f90

clawfc --directive=openmp --target=cpu -o mo_lw_solver.omp.f90 mo_lw_solver.f90

clawfc --directive=openmp --target=mic -o mo_lw_solver.mic.f90 mo_lw_solver.f90

• Asinglesourcecode• Specifyatargetarchitectureforthetransformation• Specifyacompilerdirectiveslanguagetobeadded

.f90

Originalcode(Architectureagnostic)

Automaticallytransformed

cod

e

.f90

MIC OpenMP

.f90

CPU OpenMP

.f90

GPU OpenACC

CLAWFC

Page 26: CLAW FORTRAN Compiler - XcalableMP · 2017. 10. 31. · Valentin Clement 5th XcalableMP Workshop - 31st of October 2017 3 GPU machine in Switzerland - Piz Kesch (MeteoSwiss) Production

ValentinClement 5thXcalableMPWorkshop-31stofOctober2017 26

CLAW-Onecolumn-OpenACC-localarraystrategy

• DataanalysisforgenerationofOpenACCdirectives• Potentiallycollapsingloops• Generatedatatransferifwanted

• Adaptdatalayout• Promotionofscalarandarrayfieldswithmodeldimensions• DetectunsupportedstatementsforOpenACC

• Insertionofdostatementstoiterateofnewdimensions• Insertionofdirectives(OpenMP/OpenACC)

Page 27: CLAW FORTRAN Compiler - XcalableMP · 2017. 10. 31. · Valentin Clement 5th XcalableMP Workshop - 31st of October 2017 3 GPU machine in Switzerland - Piz Kesch (MeteoSwiss) Production

ValentinClement 5thXcalableMPWorkshop-31stofOctober2017 27

SUBROUTINE lw_solver(ngpt, nlay, tau, …) ! DECL: Fields promoted accordingly to usage !$acc data present(…) !$acc parallel !$acc loop gang vector private(…) collapse(2) DO icol = 1 , ncol , 1 DO igpt = 1 , ngpt , 1 !$acc loop seq DO ilev = 1 , nlay , 1 tau_loc(ilev) = max(tau(icol,ilev,igpt) trans(ilev) = exp(-tau_loc(ilev)) END DO !$acc loop seq DO ilev = nlay , 1 , (-1) radn_dn(icol,ilev,igpt) = trans(ilev) * radn_dn(icol,ilev+1,igpt) END DO !$acc loop seq DO ilev = 2 , nlay + 1 , 1 radn_up(icol,ilev,igpt) = trans(ilev-1)*radn_up(icol,ilev-1,igpt) END DO END DO !$acc loop seq DO igpt = 1 , ngpt , 1 !$acc loop seq DO ilev = 1 , nlay + 1 , 1 radn_up(icol,igpt,ilev) = 2._wp * pi * quad_wt * radn_up(icol,igpt,ilev) radn_dn(icol,igpt,ilev) = 2._wp * pi * quad_wt * radn_dn(icol,igpt,ilev) END DO END DO END DO !$acc end parallel !$acc end data END SUBROUTINE lw_solver

RRTMGPExample-GPUw/OpenACC

Page 28: CLAW FORTRAN Compiler - XcalableMP · 2017. 10. 31. · Valentin Clement 5th XcalableMP Workshop - 31st of October 2017 3 GPU machine in Switzerland - Piz Kesch (MeteoSwiss) Production

ValentinClement 5thXcalableMPWorkshop-31stofOctober2017 28

CLAW-Onecolumn-OpenACC-localarraystrategy

Exampleofdifferentstrategyeasytotestwithanautomatizeworkflow:1.Privatizelocalarrays•Makelocalarraysprivate(unsupportedforallocatablearrays)

2.Promotearrays•Reduceallocationoverhead

Page 29: CLAW FORTRAN Compiler - XcalableMP · 2017. 10. 31. · Valentin Clement 5th XcalableMP Workshop - 31st of October 2017 3 GPU machine in Switzerland - Piz Kesch (MeteoSwiss) Production

ValentinClement 5thXcalableMPWorkshop-31stofOctober2017 29

RRTMGP lw_solver comparison of different kernel version / Domain size: 100x100x42 Piz Kesch (Haswell E5-2690v3 12 cores vs. 1/2 NVIDIA Tesla K80) PGI

Reference: original source code on 1-core

Speedup

0

2

3

5

6

8

Original code One column abstraction CLAW CPU/OpenMP (12cores)

8.1x

0.5x1.0x

RRTMGPlw_solver-Originalvs.CLAWCPU/OpenMP

Page 30: CLAW FORTRAN Compiler - XcalableMP · 2017. 10. 31. · Valentin Clement 5th XcalableMP Workshop - 31st of October 2017 3 GPU machine in Switzerland - Piz Kesch (MeteoSwiss) Production

ValentinClement 5thXcalableMPWorkshop-31stofOctober2017 30

Comparison of different kernel version / Domain Size100x100x42 Piz Kesch (Haswell E5-2690v3 12 cores vs. 1/2 NVIDIA Tesla K80) PGI

Reference: CLAW CPU/OpenMP 12-cores

Speedup

0

1

2

4

5

6

CLAW CPU/OpenMP (12cores) CLAW GPU/OpenACC (0.5K80)

6.1x

1.0x

RRTMGPlw_solver-CLAWCPUvs.CLAWGPU

Page 31: CLAW FORTRAN Compiler - XcalableMP · 2017. 10. 31. · Valentin Clement 5th XcalableMP Workshop - 31st of October 2017 3 GPU machine in Switzerland - Piz Kesch (MeteoSwiss) Production

ValentinClement 5thXcalableMPWorkshop-31stofOctober2017 31

Comparison of different kernel version / Domain Size100x100x42 Piz Kesch (Haswell E5-2690v3 12 cores vs. 1/2 NVIDIA Tesla K80) PGI

Reference: CLAW CPU/OpenMP 12-cores

Speedup

0

1

2

4

5

6

Haswell E5-2690v3 12 cores 1/2 NVIDIA Telse K80CLAW (Cray 8.4.4) CLAW (PGI 16.3) GridTools (GNU + NVCC) CLAW (Cray 8.4.4) CLAW (PGI 16.3) GridTools (GNU + NVCC)

5.0x

0.5x

3.3x

0.6x

4.6x

1.0x

CLAW (Cray 8.4.4)CLAW (PGI 16.3)GridTools (GNU + NVCC)

RRTMGPlw_solver-CLAWvs.GridTools

Page 32: CLAW FORTRAN Compiler - XcalableMP · 2017. 10. 31. · Valentin Clement 5th XcalableMP Workshop - 31st of October 2017 3 GPU machine in Switzerland - Piz Kesch (MeteoSwiss) Production

ValentinClement 5thXcalableMPWorkshop-31stofOctober2017 32

IFS-CloudSC-onecolumnversion

CloudMircophysicsScheme• Takelessthanadaytocreateaonecolumnversion• Canplaywithitandapplydifferentstrategy• OpenACCprivatizationoflocalarrays• OpenACCpromotionoflocalarrays

Page 33: CLAW FORTRAN Compiler - XcalableMP · 2017. 10. 31. · Valentin Clement 5th XcalableMP Workshop - 31st of October 2017 3 GPU machine in Switzerland - Piz Kesch (MeteoSwiss) Production

ValentinClement ECMWF/UKMO/STFC 33

IFS-CloudSC-onecolumnversion:earlyresults

Comparison of different kernel version / Domain Size: 16000x137 Piz Daint (Haswell E5-2690v3 12 cores vs. NVIDIA Tesla P100) Cray 8.6.1

Reference: OpenMP 12-cores

Speedup

0

0.75

1.5

2.25

3

2.9x2.6x

1x

CLAW GPU/OpenACC (private) CLAW GPU/OpenACC (promote)OpenMP / Block

Page 34: CLAW FORTRAN Compiler - XcalableMP · 2017. 10. 31. · Valentin Clement 5th XcalableMP Workshop - 31st of October 2017 3 GPU machine in Switzerland - Piz Kesch (MeteoSwiss) Production

ValentinClement 5thXcalableMPWorkshop-31stofOctober2017 34

Portabilitywithperformance

ECMWFIFSOperationalDyCoredatalayout:nproma,level,block

IFSPhysicalParameterizationsdatalayout:nproma,level

ECMWFFVMdatalayout:jlevel,jnode

Page 35: CLAW FORTRAN Compiler - XcalableMP · 2017. 10. 31. · Valentin Clement 5th XcalableMP Workshop - 31st of October 2017 3 GPU machine in Switzerland - Piz Kesch (MeteoSwiss) Production

ValentinClement 5thXcalableMPWorkshop-31stofOctober2017 35

Portabilitywithperformance

ECMWFIFSOperationalDyCoredatalayout:nproma,level,block

IFSPhysicalParameterizationsdatalayout:level

ECMWFFVMdatalayout:jlevel,jnode

IFSPhysicalParameterizationsdatalayout:nproma,level

IFSPhysicalParameterizationsdatalayout:level,jnode

model cfg

model cfg

Page 36: CLAW FORTRAN Compiler - XcalableMP · 2017. 10. 31. · Valentin Clement 5th XcalableMP Workshop - 31st of October 2017 3 GPU machine in Switzerland - Piz Kesch (MeteoSwiss) Production

ValentinClement 5thXcalableMPWorkshop-31stofOctober2017 36

Portabilitywithperformance

•Transformedcodemightbethesame•MoreparallelismonouterloopforGPUisbetter•Independentfromdatalayout

•Mighthavetointroducecopy• Spendingsmalltimecopyingtonewdatalayoutmightworthinoverallperformance

Page 37: CLAW FORTRAN Compiler - XcalableMP · 2017. 10. 31. · Valentin Clement 5th XcalableMP Workshop - 31st of October 2017 3 GPU machine in Switzerland - Piz Kesch (MeteoSwiss) Production

ValentinClement 5thXcalableMPWorkshop-31stofOctober2017 37

Portabilitywithperformance

Changementofdatalayoutwithcopy

%

0

25

50

75

100

Differentlayout

Originallayout Oprmizedlayout Oprmizedlayoutw/copy

Page 38: CLAW FORTRAN Compiler - XcalableMP · 2017. 10. 31. · Valentin Clement 5th XcalableMP Workshop - 31st of October 2017 3 GPU machine in Switzerland - Piz Kesch (MeteoSwiss) Production

ValentinClement 5thXcalableMPWorkshop-31stofOctober2017 38

CLAWcollaborationwiththeOMNICompilerProject

• OnlyviableFORTRANsource-to-sourceframework• Onlyonecurrentlymaintained• Veryresponsivepeople• AcceptPullRequests• 61issuesopenedastoday->51closed• 61PRastoday->57closed

Opensourcetakessomeeffortbutitisrewarding!

Page 39: CLAW FORTRAN Compiler - XcalableMP · 2017. 10. 31. · Valentin Clement 5th XcalableMP Workshop - 31st of October 2017 3 GPU machine in Switzerland - Piz Kesch (MeteoSwiss) Production

ValentinClement 5thXcalableMPWorkshop-31stofOctober2017 39

ENIACProject(2017-2020)

• EnablingICONmodelonheterogenousarchitecture• PorttoOpenACC• GridToolsforstencilcomputation(DyCore)• LookingatperformanceportabilityinFORTRANcode• EnhanceCLAWFORTRANCompilercapabilities• Movephysicalparameterizationtosinglecolumn• Gettingmorenumbers:-)• Applytransformationforx86,XeonPhiandGPUs

Page 40: CLAW FORTRAN Compiler - XcalableMP · 2017. 10. 31. · Valentin Clement 5th XcalableMP Workshop - 31st of October 2017 3 GPU machine in Switzerland - Piz Kesch (MeteoSwiss) Production

ValentinClement 5thXcalableMPWorkshop-31stofOctober2017 40

EuroExaProject

• MachinewillbehostedatSTFCinUK• ARMprocessornode• LoadedwithFPGA

• ECMWFwillinvestigatesinglecolumnabstraction• SpecifictransformationforARMprocessor• MabyeautomaticoffloadingtoFPGA(FORTRANtoCtranslation)

Page 41: CLAW FORTRAN Compiler - XcalableMP · 2017. 10. 31. · Valentin Clement 5th XcalableMP Workshop - 31st of October 2017 3 GPU machine in Switzerland - Piz Kesch (MeteoSwiss) Production

ValentinClement 5thXcalableMPWorkshop-31stofOctober2017 41

OnlyaproblemofMeteoSwiss?

Page 42: CLAW FORTRAN Compiler - XcalableMP · 2017. 10. 31. · Valentin Clement 5th XcalableMP Workshop - 31st of October 2017 3 GPU machine in Switzerland - Piz Kesch (MeteoSwiss) Production

ValentinClement 5thXcalableMPWorkshop-31stofOctober2017 42

Possiblefuturecollaboration

Page 43: CLAW FORTRAN Compiler - XcalableMP · 2017. 10. 31. · Valentin Clement 5th XcalableMP Workshop - 31st of October 2017 3 GPU machine in Switzerland - Piz Kesch (MeteoSwiss) Production

ValentinClement 5thXcalableMPWorkshop-31stofOctober2017 43

https://github.com/C2SM-RCM/claw-compilerhttps://github.com/omni-compiler

CLAWFORTRANCompilerdeveloper’sguide

CLAWFORTRANCompiler-Resources

Page 44: CLAW FORTRAN Compiler - XcalableMP · 2017. 10. 31. · Valentin Clement 5th XcalableMP Workshop - 31st of October 2017 3 GPU machine in Switzerland - Piz Kesch (MeteoSwiss) Production

ValentinClement 5thXcalableMPWorkshop-31stofOctober2017 44

[email protected]

https://github.com/C2SM-RCM/claw-compilerhttps://github.com/omni-compiler