claw fortran compiler - xcalablemp · 2017. 10. 31. · valentin clement 5th xcalablemp workshop -...
TRANSCRIPT
Image:NAS
A
CLAWFORTRANCompilersource-to-sourcetranslationforperformanceportabilityXcalableMPWorkshop,Akihabara,Tokyo,JapanOctober31,[email protected]
ValentinClement 5thXcalableMPWorkshop-31stofOctober2017 2
• PerformanceportabilityprobleminCOSMO• Singlesourcecode• Portability• Performance• Portabilityofcode• CLAWFORTRANCompiler• Singlecolumnabstraction• Futureproject
Summary
ValentinClement 5thXcalableMPWorkshop-31stofOctober2017 3
GPUmachineinSwitzerland-PizKesch(MeteoSwiss)
Production+R&DEachrackiscomposedof12Computenodes:•2xIntelHaswellE5-2690v3 2.6GHz12-coreCPUs(totalof24CPUs)•256GB2133MhzDDR4(totalof3TB)•8xDualNVIDIATESLAK80GPU(totalof96cards-192GPUs)
ValentinClement 5thXcalableMPWorkshop-31stofOctober2017 4
GPUmachineinSwitzerland-PizDaint(CSCS)
CrayXC40/XC50•Intel®Xeon®E5-2690v32.60GHz,12cores,64GBRAM•NVIDIATeslaP100
ValentinClement 5thXcalableMPWorkshop-31stofOctober2017 5
Execution time [s]
0
0.3
0.6
0.9
1.2
Execution on CPU Haswell E5-2690v3 12 cores
Execution on GPU 1/2 NVIDIA Tesla K80
Code restructured for CPUCode restructured for GPU with OpenACC
-30%
-35%
Performanceportabilityproblem-COSMORadiation
ValentinClement 5thXcalableMPWorkshop-31stofOctober2017 6
Performanceportabilityproblem-COSMORadiation
!$accparallelloopDOj=1,nproma!$accloopDOk=1,nzCALLfct()!1stloopbody!2ndloopbody!3rdloopbodyENDDOENDDO!$accendparallel
DOk=1,nzCALLfct()DOj=1,nproma!1stloopbodyENDDODOj=1,nproma!2ndloopbodyENDDODOj=1,nproma!3rdloopbodyENDDOENDDO
CPUstructure GPUstructure
ValentinClement 5thXcalableMPWorkshop-31stofOctober2017 7
Howtokeepasinglesourcecodeforeveryone
• Massivecodebase(200’000to>1mioLOC)• Severalarchitecturespecificoptimizationsurvive• MostofthesecodebaseareCPUoptimized• Notsuitedfornextgenerationarchitecture• Notsuitedformassiveparallelism• Fewornomodularity
ValentinClement 5thXcalableMPWorkshop-31stofOctober2017 8
Whatkindofcodebasearewedealingwith?
• Global/localareaweatherforecastmodel• >10aroundtheworld• MonsterFORTRAN77-2008“monolithic”code• Withoutmuchmodularity
• Sofarweinvestigate:• COSMO(Localareamodelconsortium)-Severalinstitution• ICON-DWD(GermanWeatherAgency)-WillreplaceCOSMO• IFSCurrentCycle+FVM-ECMWF-Memberstateusage
ValentinClement 5thXcalableMPWorkshop-31stofOctober2017 9
Whatkindofcodebasearewedealingwith?
Exampleofthreecodebaseweinvestigatedsofar:• COSMO• ICON• IFS
ValentinClement 5thXcalableMPWorkshop-31stofOctober2017 10
COSMOMode-loc
-------------------------------------------------------------------------------- Language files blank comment code -------------------------------------------------------------------------------- Fortran 90 173 53998 109381 211711 C/C++ Header 148 5595 11827 29888 C++ 121 5050 6189 26580 Python 37 1454 1444 5764 Bourne Again Shell 17 246 381 3206 Bourne Shell 33 544 594 2349 XML 11 272 193 2143 CMake 9 103 98 793 make 1 36 27 230 CUDA 58 4 0 58 -------------------------------------------------------------------------------- SUM: 620 68232 130684 286710 --------------------------------------------------------------------------------
ClimateandlocalareamodelusedbyGermany,Switzerland,Russia…
ValentinClement 5thXcalableMPWorkshop-31stofOctober2017 11
------------------------------------------------------------------------------- Language files blank comment code ------------------------------------------------------------------------------- Fortran 90 4003 201338 300427 737289 C/C++ Header 480 846 0 13041 Perl 1 89 13 240 ------------------------------------------------------------------------------- SUM: 4484 202273 300440 750570 --------------------------------------------------------------------------------
*onlysourcewithoutexternalmodules
ECMWFIFS-loc
EuropeanCentreforMedium-rangeweatherforecasts-GlobalModel
ValentinClement 5thXcalableMPWorkshop-31stofOctober2017 12
DWDICON-loc
--------------------------------------------------------------------------------------- Language files blank comment code --------------------------------------------------------------------------------------- Fortran 90 822 99802 144962 447356 C 219 43854 30991 150781 HTML 307 449 15415 94940 Fortran 77 463 294 113285 64061 Java 95 2685 4335 11605 C/C++ Header 106 2194 8359 8332 Python 43 2163 2425 7656 --------------------------------------------------------------------------------------- SUM: 2599 174509 346197 931446 ---------------------------------------------------------------------------------------
NewGermanGlobalmodelwithoptiontouseditaslocalareamodel
ValentinClement 5thXcalableMPWorkshop-31stofOctober2017 13
Portabilityofcode
• Currentcodebase• Optimizedforonearchitecture• Oftenwitholdoptimizationsstillinplace
• Notreallymodular• Globalfieldsinjectedeverywhere
• Future?• Modularstandalonepackagethatcanbeplug• Indifferentmodel• Fordifferentarchitecture
• Abstractdatalayoutwhereitcanbedone• Abstractmodelspecificdata,everythingpassedasarguments.
ValentinClement 5thXcalableMPWorkshop-31stofOctober2017 14
Performanceportability-currentcode
DO ilev = 1, nlay DO icol = 1, ncol tau_loc(icol,ilev) = max(tau(icol,ilev,igpt) … trans(icol,ilev) = exp(-tau_loc(icol,ilev)) END DO END DO DO ilev = nlay, 1, -1 DO icol = 1, ncol radn_dn(icol,ilev,igpt) = trans(icol,ilev) * radn_dn(icol,ilev+1,igpt) … END DO END DO DO ilev = 2, nlay + 1 DO icol = 1, ncol radn_up(icol,ilev,igpt) = trans(icol,ilev-1) * radn_up(icol,ilev-1,igpt) END DO END DO
nlay size ~= 100 ncol size >= 10’000
ilev -> dependency icol -> no dependency
LoopstructurebetterforCPUs
ValentinClement 5thXcalableMPWorkshop-31stofOctober2017 15
Performanceportability-currentcode
!$omp parallel default(shared) DO igpt = 1, ngptot, nproma CALL physical_parameteriztaion(…) ! Code from previous silde
END DO !$omp end parallel
Sometimesusingnpromablockoptimizationforbettervectorization
ValentinClement 5thXcalableMPWorkshop-31stofOctober2017 16
Performanceportability-GPUstructuredcode
DO icol = 1, ncol DO ilev = 1, nlay tau_loc(icol,ilev) = max(tau(icol,ilev,igpt) … trans(icol,ilev) = exp(-tau_loc(icol,ilev)) END DO DO ilev = nlay, 1, -1 radn_dn(icol,ilev,igpt) = trans(icol,ilev) * radn_dn(icol,ilev+1,igpt) … END DO DO ilev = 2, nlay + 1 radn_up(icol,ilev,igpt) = trans(icol,ilev-1) * radn_up(icol,ilev-1,igpt) END DO END DO
nlay size ~= 100 ncol size >= 10’000
ilev -> dependency icol -> no dependency
LoopstructurebetterforGPUs
ValentinClement 5thXcalableMPWorkshop-31stofOctober2017 17
Performanceportability-nextarchitecture
-Whatisthebestloopstructure/datalayoutfornextarchitecture?- Dowewanttorewritethecodeeachtime?- Doweknowexactlywhicharchitecturewewillrunon?
?
ValentinClement 5thXcalableMPWorkshop-31stofOctober2017 18
WhatistheCLAWFORTRANCompiler?
• Source-to-sourcetranslationforFORTRANcode• BasedonXcodeML/FIR• UsingOMNICompilerfront-endandback-end• ContributiontoitviaGitHub
• TransformationofAST• Differenttransformationappliedbasedontarget• Promotionofscalarandarrays• Insertionofiteration• InsertionofOpenACCandOpenMPdirectives
ValentinClement 5thXcalableMPWorkshop-31stofOctober2017
CLAWFC
19
• BasedontheOMNICompilerFORTRANfront-end&back-end• Source-to-sourcetranslator• OpensourceundertheBSDlicense• AvailableonGitHubwiththespecifications
.f90
FPP CX2XF_Front F_Back
.f90
CLAWFORTRANCompilerunderthehood
ValentinClement 5thXcalableMPWorkshop-31stofOctober2017 20
Separationofconcerns• Domainscientistsfocusontheirproblem(1column,1box)
• CLAWcompilerproducecodeforeachtargetanddirectivelanguages
Achievemodularity• Standalonephysicalparameter•Modularfrommodelspecificity
Singlecolumnabstraction
ValentinClement 5thXcalableMPWorkshop-31stofOctober2017 21
RRTMGPExample-AnicemodularcodeCPUstructured
F2003radiationcode• FromRobertPincusandal.fromAERUniversityofColorado
• Computeintensivepartarewelllocatedin“kernel”module.
• Codeisnon-the-lessCPUstructuredwithhorizontalloopastheinnermostineveryiteration.
ValentinClement 5thXcalableMPWorkshop-31stofOctober2017 22
SUBROUTINE lw_solver(ngpt, nlay, tau, …) ! DECLARATION PART OMITTED DO igpt = 1, ngpt DO ilev = 1, nlay DO icol = 1, ncol tau_loc(icol,ilev) = max(tau(icol,ilev,igpt) … trans(icol,ilev) = exp(-tau_loc(icol,ilev)) END DO END DO DO ilev = nlay, 1, -1 DO icol = 1, ncol radn_dn(icol,ilev,igpt) = trans(icol,ilev) * radn_dn(icol,ilev+1,igpt) … END DO END DO DO ilev = 2, nlay + 1 DO icol = 1, ncol radn_up(icol,ilev,igpt) = trans(icol,ilev-1) * radn_up(icol,ilev-1,igpt) END DO END DO END DO radn_up(:,:,:) = 2._wp * pi * quad_wt * radn_up(:,:,:) radn_dn(:,:,:) = 2._wp * pi * quad_wt * radn_dn(:,:,:) END SUBROUTINE lw_solver
RRTMGPExample-originalcode-CPUstructuredLoop
overv
erticaldim
ensio
n
Loop
overh
orizo
ntaldim
ensio
n
Loop
overspe
ctralintervals
ValentinClement 5thXcalableMPWorkshop-31stofOctober2017 23
SUBROUTINE lw_solver(ngpt, nlay, tau, …) ! DECLARATION PART OMITTED DO igpt = 1, ngpt DO ilev = 1, nlay tau_loc(ilev) = max(tau(ilev,igpt) … trans(ilev) = exp(-tau_loc(ilev)) END DO DO ilev = nlay, 1, -1 radn_dn(ilev,igpt) = trans(ilev) * radn_dn(ilev+1,igpt) … END DO DO ilev = 2, nlay + 1 radn_up(ilev,igpt) = trans(ilev-1) * radn_up(ilev-1,igpt) END DO END DO radn_up(:,:) = 2._wp * pi * quad_wt * radn_up(:,:) radn_dn(:,:) = 2._wp * pi * quad_wt * radn_dn(:,:) END SUBROUTINE lw_solver
RRTMGPExample-Singlecolumnabstraction
SUBROUTINE lw_solver(ngpt, nlay, tau, …) ! DECLARATION PART OMITTED DO igpt = 1, ngpt DO ilev = 1, nlay tau_loc(ilev) = max(tau(ilev,igpt) … trans(ilev) = exp(-tau_loc(ilev)) END DO DO ilev = nlay, 1, -1 radn_dn(ilev,igpt) = trans(ilev) * radn_dn(ilev+1,igpt) … END DO DO ilev = 2, nlay + 1 radn_up(ilev,igpt) = trans(ilev-1) * radn_up(ilev-1,igpt) END DO END DO radn_up(:,:) = 2._wp * pi * quad_wt * radn_up(:,:) radn_dn(:,:) = 2._wp * pi * quad_wt * radn_dn(:,:) END SUBROUTINE lw_solver
SUBROUTINE lw_solver(ngpt, nlay, tau, …) ! DECL: Fields don’t have the horizontal dimension (demotion) DO igpt = 1, ngpt DO ilev = 1, nlay tau_loc(ilev) = max(tau(ilev,igpt) … trans(ilev) = exp(-tau_loc(ilev)) END DO DO ilev = nlay, 1, -1 radn_dn(ilev,igpt) = trans(ilev) * radn_dn(ilev+1,igpt) … END DO DO ilev = 2, nlay + 1 radn_up(ilev,igpt) = trans(ilev-1) * radn_up(ilev-1,igpt) END DO END DO radn_up(:,:) = 2._wp * pi * quad_wt * radn_up(:,:) radn_dn(:,:) = 2._wp * pi * quad_wt * radn_dn(:,:) END SUBROUTINE lw_solver
Onlyde
pend
encyontheseite
ratio
nspaces
ValentinClement 5thXcalableMPWorkshop-31stofOctober2017 24
SUBROUTINE lw_solver(ngpt, nlay, tau, …) !$claw parallelize ! model dimension info located in config DO igpt = 1, ngpt DO ilev = 1, nlay tau_loc(ilev) = max(tau(ilev,igpt) … trans(ilev) = exp(-tau_loc(ilev)) END DO DO ilev = nlay, 1, -1 radn_dn(ilev,igpt) = trans(ilev) * radn_dn(ilev+1,igpt) … END DO DO ilev = 2, nlay + 1 radn_up(ilev,igpt) = trans(ilev-1) * radn_up(ilev-1,igpt) END DO END DO radn_up(:,:) = 2._wp * pi * quad_wt * radn_up(:,:) radn_dn(:,:) = 2._wp * pi * quad_wt * radn_dn(:,:) END SUBROUTINE lw_solver
Algorithm
foro
necolum
non
ly
Dependencyontheverticaldimensiononly
RRTMGPExample-CLAWcode
ValentinClement 5thXcalableMPWorkshop-31stofOctober2017 25
RRTMGPExample-CLAWtransformation
clawfc --directive=openacc --target=gpu -o mo_lw_solver.acc.f90 mo_lw_solver.f90
clawfc --directive=openmp --target=cpu -o mo_lw_solver.omp.f90 mo_lw_solver.f90
clawfc --directive=openmp --target=mic -o mo_lw_solver.mic.f90 mo_lw_solver.f90
• Asinglesourcecode• Specifyatargetarchitectureforthetransformation• Specifyacompilerdirectiveslanguagetobeadded
.f90
Originalcode(Architectureagnostic)
Automaticallytransformed
cod
e
.f90
MIC OpenMP
.f90
CPU OpenMP
.f90
GPU OpenACC
CLAWFC
ValentinClement 5thXcalableMPWorkshop-31stofOctober2017 26
CLAW-Onecolumn-OpenACC-localarraystrategy
• DataanalysisforgenerationofOpenACCdirectives• Potentiallycollapsingloops• Generatedatatransferifwanted
• Adaptdatalayout• Promotionofscalarandarrayfieldswithmodeldimensions• DetectunsupportedstatementsforOpenACC
• Insertionofdostatementstoiterateofnewdimensions• Insertionofdirectives(OpenMP/OpenACC)
ValentinClement 5thXcalableMPWorkshop-31stofOctober2017 27
SUBROUTINE lw_solver(ngpt, nlay, tau, …) ! DECL: Fields promoted accordingly to usage !$acc data present(…) !$acc parallel !$acc loop gang vector private(…) collapse(2) DO icol = 1 , ncol , 1 DO igpt = 1 , ngpt , 1 !$acc loop seq DO ilev = 1 , nlay , 1 tau_loc(ilev) = max(tau(icol,ilev,igpt) trans(ilev) = exp(-tau_loc(ilev)) END DO !$acc loop seq DO ilev = nlay , 1 , (-1) radn_dn(icol,ilev,igpt) = trans(ilev) * radn_dn(icol,ilev+1,igpt) END DO !$acc loop seq DO ilev = 2 , nlay + 1 , 1 radn_up(icol,ilev,igpt) = trans(ilev-1)*radn_up(icol,ilev-1,igpt) END DO END DO !$acc loop seq DO igpt = 1 , ngpt , 1 !$acc loop seq DO ilev = 1 , nlay + 1 , 1 radn_up(icol,igpt,ilev) = 2._wp * pi * quad_wt * radn_up(icol,igpt,ilev) radn_dn(icol,igpt,ilev) = 2._wp * pi * quad_wt * radn_dn(icol,igpt,ilev) END DO END DO END DO !$acc end parallel !$acc end data END SUBROUTINE lw_solver
RRTMGPExample-GPUw/OpenACC
ValentinClement 5thXcalableMPWorkshop-31stofOctober2017 28
CLAW-Onecolumn-OpenACC-localarraystrategy
Exampleofdifferentstrategyeasytotestwithanautomatizeworkflow:1.Privatizelocalarrays•Makelocalarraysprivate(unsupportedforallocatablearrays)
2.Promotearrays•Reduceallocationoverhead
ValentinClement 5thXcalableMPWorkshop-31stofOctober2017 29
RRTMGP lw_solver comparison of different kernel version / Domain size: 100x100x42 Piz Kesch (Haswell E5-2690v3 12 cores vs. 1/2 NVIDIA Tesla K80) PGI
Reference: original source code on 1-core
Speedup
0
2
3
5
6
8
Original code One column abstraction CLAW CPU/OpenMP (12cores)
8.1x
0.5x1.0x
RRTMGPlw_solver-Originalvs.CLAWCPU/OpenMP
ValentinClement 5thXcalableMPWorkshop-31stofOctober2017 30
Comparison of different kernel version / Domain Size100x100x42 Piz Kesch (Haswell E5-2690v3 12 cores vs. 1/2 NVIDIA Tesla K80) PGI
Reference: CLAW CPU/OpenMP 12-cores
Speedup
0
1
2
4
5
6
CLAW CPU/OpenMP (12cores) CLAW GPU/OpenACC (0.5K80)
6.1x
1.0x
RRTMGPlw_solver-CLAWCPUvs.CLAWGPU
ValentinClement 5thXcalableMPWorkshop-31stofOctober2017 31
Comparison of different kernel version / Domain Size100x100x42 Piz Kesch (Haswell E5-2690v3 12 cores vs. 1/2 NVIDIA Tesla K80) PGI
Reference: CLAW CPU/OpenMP 12-cores
Speedup
0
1
2
4
5
6
Haswell E5-2690v3 12 cores 1/2 NVIDIA Telse K80CLAW (Cray 8.4.4) CLAW (PGI 16.3) GridTools (GNU + NVCC) CLAW (Cray 8.4.4) CLAW (PGI 16.3) GridTools (GNU + NVCC)
5.0x
0.5x
3.3x
0.6x
4.6x
1.0x
CLAW (Cray 8.4.4)CLAW (PGI 16.3)GridTools (GNU + NVCC)
RRTMGPlw_solver-CLAWvs.GridTools
ValentinClement 5thXcalableMPWorkshop-31stofOctober2017 32
IFS-CloudSC-onecolumnversion
CloudMircophysicsScheme• Takelessthanadaytocreateaonecolumnversion• Canplaywithitandapplydifferentstrategy• OpenACCprivatizationoflocalarrays• OpenACCpromotionoflocalarrays
ValentinClement ECMWF/UKMO/STFC 33
IFS-CloudSC-onecolumnversion:earlyresults
Comparison of different kernel version / Domain Size: 16000x137 Piz Daint (Haswell E5-2690v3 12 cores vs. NVIDIA Tesla P100) Cray 8.6.1
Reference: OpenMP 12-cores
Speedup
0
0.75
1.5
2.25
3
2.9x2.6x
1x
CLAW GPU/OpenACC (private) CLAW GPU/OpenACC (promote)OpenMP / Block
ValentinClement 5thXcalableMPWorkshop-31stofOctober2017 34
Portabilitywithperformance
ECMWFIFSOperationalDyCoredatalayout:nproma,level,block
IFSPhysicalParameterizationsdatalayout:nproma,level
ECMWFFVMdatalayout:jlevel,jnode
ValentinClement 5thXcalableMPWorkshop-31stofOctober2017 35
Portabilitywithperformance
ECMWFIFSOperationalDyCoredatalayout:nproma,level,block
IFSPhysicalParameterizationsdatalayout:level
ECMWFFVMdatalayout:jlevel,jnode
IFSPhysicalParameterizationsdatalayout:nproma,level
IFSPhysicalParameterizationsdatalayout:level,jnode
model cfg
model cfg
ValentinClement 5thXcalableMPWorkshop-31stofOctober2017 36
Portabilitywithperformance
•Transformedcodemightbethesame•MoreparallelismonouterloopforGPUisbetter•Independentfromdatalayout
•Mighthavetointroducecopy• Spendingsmalltimecopyingtonewdatalayoutmightworthinoverallperformance
ValentinClement 5thXcalableMPWorkshop-31stofOctober2017 37
Portabilitywithperformance
Changementofdatalayoutwithcopy
%
0
25
50
75
100
Differentlayout
Originallayout Oprmizedlayout Oprmizedlayoutw/copy
ValentinClement 5thXcalableMPWorkshop-31stofOctober2017 38
CLAWcollaborationwiththeOMNICompilerProject
• OnlyviableFORTRANsource-to-sourceframework• Onlyonecurrentlymaintained• Veryresponsivepeople• AcceptPullRequests• 61issuesopenedastoday->51closed• 61PRastoday->57closed
Opensourcetakessomeeffortbutitisrewarding!
ValentinClement 5thXcalableMPWorkshop-31stofOctober2017 39
ENIACProject(2017-2020)
• EnablingICONmodelonheterogenousarchitecture• PorttoOpenACC• GridToolsforstencilcomputation(DyCore)• LookingatperformanceportabilityinFORTRANcode• EnhanceCLAWFORTRANCompilercapabilities• Movephysicalparameterizationtosinglecolumn• Gettingmorenumbers:-)• Applytransformationforx86,XeonPhiandGPUs
ValentinClement 5thXcalableMPWorkshop-31stofOctober2017 40
EuroExaProject
• MachinewillbehostedatSTFCinUK• ARMprocessornode• LoadedwithFPGA
• ECMWFwillinvestigatesinglecolumnabstraction• SpecifictransformationforARMprocessor• MabyeautomaticoffloadingtoFPGA(FORTRANtoCtranslation)
ValentinClement 5thXcalableMPWorkshop-31stofOctober2017 41
OnlyaproblemofMeteoSwiss?
ValentinClement 5thXcalableMPWorkshop-31stofOctober2017 42
Possiblefuturecollaboration
ValentinClement 5thXcalableMPWorkshop-31stofOctober2017 43
https://github.com/C2SM-RCM/claw-compilerhttps://github.com/omni-compiler
CLAWFORTRANCompilerdeveloper’sguide
CLAWFORTRANCompiler-Resources
ValentinClement 5thXcalableMPWorkshop-31stofOctober2017 44
https://github.com/C2SM-RCM/claw-compilerhttps://github.com/omni-compiler