performance tuning of ateles using xevolver · 2936: vec( 2): partiallyvectorized loop. 2936: vec(...
TRANSCRIPT
PerformanceTuningofAtelesusingXevolver
KazuhikoKomatsuTohokuUniversity11October,2017
WSSP26
Introduction• IncreaseofvarietyofHPCsystems– Scalar-typesystem
• Alargenumberofscalarprocessorswithmanycoresandlargecache
• Massivelyparallelcalculations– Accelerator-typesystem
• Acceleratorswithlotsofsimplecores• Dataparallelcalculations
– Vector-typesystem• Vectorcoreswithahighmemorybandwidth• calculatessetofdataelementsatonetime
– TOP500-specializedsystem
11October,2017 WSSP26 2
System-dependentinformation hastobewritteninacodetoexploitthepotentialofeachHPCsystem
Problems
• ByeffectivelyutilizingthefeaturesofeachHPCsystem– Directlywritingsystem-dependentinformation
• C,Fortran,CUDA/OpenCL,AVXintrinsic,…• Severalversions ofanapplicationcode(orIF/DEFHELL)
– EachversionisoptimizedforanHPCsystem
• Lowreadabilityandmaintainability– System-dependentoptimizationsmaybeobstacleofapplicationbehavior
11October,2017 WSSP26 3
ObjectiveandApproach
• Objective– System-dependentoptimizationswhilekeepingreadabilityandmaintainability
• Approach– Utilizecodetransformationsforsystem-dependentoptimizations• Xevolver:Codetransformationframework• Xevtgen:Fortran-likerulegenerator
– Theoriginalcodeasunchangedaspossible
11October,2017 WSSP26 4
Casestudy:TuningofAtelesforSX-ACE
• OverviewofAteles– OneofsolversintheAPESsimulationsuite
• Fortran2008/2003/95/90,python,OpenMP/MPI– ISOCinterface– functionpointer– allocatablevariablesinstructures
– DiscontinuousGalerkinSolver• Fortran100Klines• Python3.5Klines
11October,2017 WSSP26 5
PreliminaryEvaluationonSX-ACE
• Testdata– testsuite/sx_testing/ateles.lua
• Parametersurveyinthescript
– “level”doesnotaffecttheperformance– “degree”mightaffectthevectorlength
11October,2017 WSSP26 6
-- Parameterstovary------ Thepolynomialdegreespecifiestheaccuracywithineachelementanddefines-- thenumberofdegreesoffreedomperelementthatwillbe(degree+1)^3for-- apolyspaceofQdegree=7
-- ...theuniformrefinementlevelfortheperiodiccube-- (Therewillbe8^levelelementsinthemesh)level=2
Single-nodeperformance
11October,2017 WSSP26 7
degree=7 |level=2 degree=14|level=2RealTime(sec) 83.176829 811.614073UserTime(sec) 82.835361 811.393898SysTime(sec) 0.043444 0.062352VectorTime(sec) 12.100865 134.764311Inst.Count 25562226844 542853655029V.Inst.Count 799528202 8621373999V.ElementCount 61705955881 1262989794226V.LoadElementCount 7714451369 126354804785FLOPCount 9218093212 207136982518MOPS 1043.861625 2214.981009MFLOPS 111.282103 255.285359A.V.Length 77.17796 146.495187V.Op.Ratio(%) 71.362225 70.274554MemorySize(MB) 256.03125 320.03125MIPS 308.590758 669.038375I-Cache(sec) 0.266876 1.291097O-Cache(sec) 56.223546 377.723006BankConflictTimeCPUPortConf.(sec) 0.331255 8.065057MemoryNetworkConf. (sec) 3.205328 45.384267ADBHitElementRatio(%) 62.779898 51.393439
Detailedcostdistribution
11October,2017 WSSP26 8
PROC.NAME FREQ. EXCL.TIME[sec] % AVER.TIME
[msec] MFLOPS V.OPRATIO
AVER.V.LEN
VECTORTIME
BANKCONFLICT
ADBHIT
CPU NWPORT
ELEM.%
MODG_VOLTOFACE_Q 4584 69.65 83.8 15.194 25.9 53.85 151.1 1.572 0 0 99.99MODG_PRJ_PFLUX3_Q_6 48896 3.009 3.6 0.062 297.7 98.96 38.5 2.986 0.05 1.671 84.29MAXWELL_FLUX_CUBE_VEC 2292 2.403 2.9 1.048 214.9 80.25 39.1 2.181 0.009 0.043 63.24MODG_PRJ_PFLUX2_Q_6 48896 1.594 1.9 0.033 562 98.86 38.6 1.563 0.061 0.254 81.99MODG_PRJ_PFLUX1_Q_6 48896 1.518 1.8 0.031 589.8 98.87 38.8 1.476 0.03 0.165 79.93
PROC.NAME FREQ. EXCL.TIME[sec] % AVER.TIME
[msec] MFLOPS V.OPRATIO
AVER.V.LEN
VECTORTIME
BANKCONFLICT
ADBHIT
CPU NWPORT
ELEM.%
MODG_VOLTOFACE_Q 16080 665.772 82.2 41.404 62.6 48.56 245.9 16.892 0 0 100MODG_MAXWELL_PHYSFLUX_CONST 514560 23.831 2.9 0.046 0 95.6 238.3 23.027 3.559 17.651 0MODG_PRJ_PFLUX3_Q_6 171520 23.159 2.9 0.135 1080.5 99.74 130.1 23.075 1.499 8.046 77.57MODG_PRJ_PFLUX1_Q_6 171520 17.697 2.2 0.103 1414 99.71 130 17.546 1.214 1.401 70.25MODG_PRJ_PFLUX2_Q_6 171520 17.254 2.1 0.101 1450.3 99.71 130.4 17.133 1.173 1.96 79.31
degree=7 |level=2
degree=14 |level=2
• Costdistributionsinbothdegreesaresimilar• Lowvectoroperationratiointhehighestroutine
Analysisoftheroutine
11October,2017 WSSP26 9
MODG_VOLTOFACE_Q:
2929:+------>doiAnsX=1,maxPolyDegree+12930:|!getthefacevalueoftheansatzfunctionwithfixedcoordinate2931:|!!faceVal=faceValRightBndAns(iAnsX)2932:|2933:|+----->doiVar=1,nScalars2934:||+---->doiElem=1,nElems2936:|||V--->dofacepos=1,mpd1_square2937:||||!getpositionofthecurrentansatzfunction2938:||||pos=(facepos-1)*mpd1+iAnsX2939:||||AfaceState(iElem,facePos,iVar,leftOrRight)&2940:||||&=faceState(iElem,facePos,iVar,leftOrRight)&2941:||||&+volState(iElem,pos,iVar)2942:|||V--- enddo2944:||+---- enddo2945:|+----- enddo2946:|2947:+------ enddo:
LoopLength=~7
LoopLength=~6LoopLength=~64LoopLength=~64
• Compilemessage2936:vec(2):Partially vectorizedloop.2936:vec(24):Iterationcountisassumed.Iterationcount=5000.2936:vec(25):Workvectorsareused.Size=40000byte.
• Theinnermostloopisvectorized,butPARTIALLY– Datadependencycannotbesolvedbythecompiler
• ThelengthofthevectorizedloopisSHORT
OptimizationseffectiveforSX-ACE
11October,2017 WSSP26 10
MODG_VOLTOFACE_Q:
2930:+------>doiAnsX=1,maxPolyDegree+12931:|!getthefacevalueoftheansatzfunctionwithfixedcoordinate2934:|+-----> doiVar=1,nScalars2938:||m=nElems2939:||n=mpd1_square2940:||mn=m*n2941:|| !cdirnodep2942:||V----> doij=0,mn- 12943:|||iElem=ij/n+12944:|||facepos=mod(ij,n)+12945:|||!getpositionofthecurrentansatzfunction2946:|||pos=(facepos-1)*mpd1+iAnsX2947:|||A faceState(iElem,facePos,iVar,leftOrRight)&2948:|||&=faceState(iElem,facePos,iVar,leftOrRight)&2949:|||&+volState(iElem,pos,iVar)2950:||V---- enddo2953:|+----- enddo2955:+------ enddo:
LoopLength=~7
LoopLength=~6
LoopLength=~64*64
• Compilemessage2942:vec(1):Vectorizedloop.2947:vec(23): "NODEP"isspecified.Nodependencyisassumed.:FACESTATE2942:vec(29):ADBisusedforarray.:__REAL(KIND=8)
• Thelengthofthevectorizedloopincreases– Bycollapsingbothiloopandjloop
• Successfullyvectorizedbythe“nodep”directive• Appliedforallthesameloopnestsintheroutine
Motivation
• System-awareoptimizationforSX-ACE– Loopcollapseandcompiler-specificdirectives
• Degradereadability– Similaroptimizations needtobeREPEATEDLYapplied• Loopcollapseandinsertionofthedirectiveisappliedallloopnestsinwholecode
• Mightleadhumanerrors
11October,2017 11WSSP26
Therepetitiveoptimizationcanbereplacedwithtransformationrulesandcodetransformations
Xevolver:codetransformationframework
• Xevolver[Takizawa,2014]– Canseparatesystem-awarenessfromacode
• Externalcustomtranslationrulesandcustomdirectives➔Minimizemodificationsandthenkeepmaintainability oftheoriginalcode
11October,2017 12
Xevtgen:TransformationRuleGenerator
• Xevtgen[Suda,2015]– EasilygenerationoftransformationrulesforXevolver• DummyFortrancode
– Fortran-likecodewithsomespecialtgendirectives
• source anddestination patternsofadummyFortrancode
!$xevtgensrcbeginIF(I.EQ.0)EXIT!$xevtgensrcend
!$xevtgendstbeginIF(I==0)THENEXITENDIF!$xevtgendstend
Sourcepattern
Destinationpattern
StandardFortranprogrammerscaneasilylearnandgeneraterules
11October,2017 WSSP26 13
OptimizationusingXevolver
– Insteadofdirectlymodifyingacode,customrulesandcustomdirectivesaredefinedusingXevolver
➜Minimizemodificationsandthenkeepmaintainability
11October,2017 14
Ateles
Applycodemodification
Makingrulesofmodifications
OptimizedAteles
!$......……
LoopcollapseandNODEPdirective
• CodeTransformationbyXevolver– Justinserting“!$xevcollapse”intotheoriginalcode
• Cankeepthemaintainabilityoftheoriginalcodeasmuchaspossible
!$xevcollapsedoiVar=1,nScalarsdoiElem=1,nElemsdofacepos=1,mpd1_square…
enddoenddo
enddo
doiVar=1,nScalarsm=nElemsn=mpd1_squaremn=m*n
!cdirnodepdoij=0,mn- 1iElem=ij/n+1facepos=mod(ij,n)+1…
enddoenddo
Original+
xevdirective
11October,2017 WSSP26 15
Transformationrule
Sourcepattern Destinationpattern!$xevtgendstbeginm=nElemsn=mpd1_squaremn=m*n!cdirnodepdoij=0,mn- 1iElem=ij/n+1facepos=mod(ij,n)+1
!$xevtgenstmt(body0)enddo!$xevtgendstend
!$xevtgensrcbegin!$xevcollapsedoiElem=1,nElemsdofacepos=1,mpd1_square
!$xevtgenstmt(body0)enddo
enddo!$xevtgensrcend
11October,2017 WSSP26 1618system-awareoptimizationscanbeappliedbycodetransformations
Experimentalenvironment
• ComparisonofperformanceofAtelesonvariousplatforms– Originalversion– TunedversionforNECSX-ACE
• Loopcollapse• Insertionofcompiler-specificdirectives
11October,2017 WSSP26 17
Processor Intel XeonE5-2695v2 NEC SX-ACE IBMPower8
#.of cores 2x12cores 4cores 2x 8cores
Mem. Capacity 128 GB 64GB 512GB
Compiler Intel compiler16.03
NEC SX2003Rev.061
PGI16.10(openpower)
Results
– SX-ACE:x7.5speeduptooriginal,18%toInteloriginal– Xeon:Tunedversiondegrades14%– IBM:Tunedversionaccelerates6%
11October,2017
0
20
40
60
80
100
SX-ACE Xeon Power8
ExecutionTime[sec.]
Original Tuned
18WSSP26
Optimizationforparticularplatformisnotalwayseffective.Ourapproachthatusestransformationissuitable!
Detailedanalysisofloopcollapse
11October,2017 WSSP26 19
• degree=7,level=2
– Vectoroperationratiosbecomemorethan99%– Vectoraveragelengthsbecomemorethan250
FREQ. EXCL.TIME[sec] % AVER.TIME
[msec] MOPS MFLOPS V.OPRATIO
AVER.V.LEN
VECTORTIME
I-CACHEMISS
O-CACHEMISS
BANKCONFLICT ADBHITCPU NWPORT ELEM.%
BEFORE 4584 69.650 83.8 15.194 657.2 25.9 53.85 151.1 1.572 0.001 55.351 0 0 99.99AFTER 4584 2.117 13.7 0.462 18677.2 1135.1 99.83 256 2.114 0.001 0.001 0.485 0.025 73.51
Conclusions• PerformanceanalysisandoptimizationsofAtelesonSX-ACE
– Inlineexpansion– Loopcollapse➔ Degradesthereadabilityandmaintainability
• Forhighmaintainabilityandperformance– CodetransformationbyXevolverisemployed
• onlysmallnumberofdirectivesneedtobeinsertedintotheoriginalcode
• Futurework– IncreaseofcompatibilityofXevolver
• ThebackendofcompilerinfrastructuredoesnotfullysupportFortran2003/2008
• NeedtomodifyAtelesto avoidtheunsupporteddescription
11October,2017 WSSP26 20