performance tuning of ateles using xevolver · 2936: vec( 2): partiallyvectorized loop. 2936: vec(...

PerformanceTuningofAtelesusingXevolver

KazuhikoKomatsuTohokuUniversity11October,2017

WSSP26

Introduction• IncreaseofvarietyofHPCsystems– Scalar-typesystem

• Alargenumberofscalarprocessorswithmanycoresandlargecache

• Massivelyparallelcalculations– Accelerator-typesystem

• Acceleratorswithlotsofsimplecores• Dataparallelcalculations

– Vector-typesystem• Vectorcoreswithahighmemorybandwidth• calculatessetofdataelementsatonetime

– TOP500-specializedsystem

11October,2017 WSSP26 2

System-dependentinformation hastobewritteninacodetoexploitthepotentialofeachHPCsystem

Problems

• ByeffectivelyutilizingthefeaturesofeachHPCsystem– Directlywritingsystem-dependentinformation

• C,Fortran,CUDA/OpenCL,AVXintrinsic,…• Severalversions ofanapplicationcode(orIF/DEFHELL)

– EachversionisoptimizedforanHPCsystem

• Lowreadabilityandmaintainability– System-dependentoptimizationsmaybeobstacleofapplicationbehavior


ObjectiveandApproach

• Objective– System-dependentoptimizationswhilekeepingreadabilityandmaintainability

• Approach– Utilizecodetransformationsforsystem-dependentoptimizations• Xevolver:Codetransformationframework• Xevtgen:Fortran-likerulegenerator

– Theoriginalcodeasunchangedaspossible


Casestudy:TuningofAtelesforSX-ACE

• OverviewofAteles– OneofsolversintheAPESsimulationsuite

• Fortran2008/2003/95/90,python,OpenMP/MPI– ISOCinterface– functionpointer– allocatablevariablesinstructures

– DiscontinuousGalerkinSolver• Fortran100Klines• Python3.5Klines


PreliminaryEvaluationonSX-ACE

• Testdata– testsuite/sx_testing/ateles.lua

• Parametersurveyinthescript

– “level”doesnotaffecttheperformance– “degree”mightaffectthevectorlength


-- Parameterstovary------ Thepolynomialdegreespecifiestheaccuracywithineachelementanddefines-- thenumberofdegreesoffreedomperelementthatwillbe(degree+1)^3for-- apolyspaceofQdegree=7

-- ...theuniformrefinementlevelfortheperiodiccube-- (Therewillbe8^levelelementsinthemesh)level=2

Single-nodeperformance


degree=7 |level=2 degree=14|level=2RealTime(sec) 83.176829 811.614073UserTime(sec) 82.835361 811.393898SysTime(sec) 0.043444 0.062352VectorTime(sec) 12.100865 134.764311Inst.Count 25562226844 542853655029V.Inst.Count 799528202 8621373999V.ElementCount 61705955881 1262989794226V.LoadElementCount 7714451369 126354804785FLOPCount 9218093212 207136982518MOPS 1043.861625 2214.981009MFLOPS 111.282103 255.285359A.V.Length 77.17796 146.495187V.Op.Ratio(%) 71.362225 70.274554MemorySize(MB) 256.03125 320.03125MIPS 308.590758 669.038375I-Cache(sec) 0.266876 1.291097O-Cache(sec) 56.223546 377.723006BankConflictTimeCPUPortConf.(sec) 0.331255 8.065057MemoryNetworkConf. (sec) 3.205328 45.384267ADBHitElementRatio(%) 62.779898 51.393439

Detailedcostdistribution


PROC.NAME FREQ. EXCL.TIME[sec] % AVER.TIME

[msec] MFLOPS V.OPRATIO

AVER.V.LEN

VECTORTIME

BANKCONFLICT

ADBHIT

CPU NWPORT

ELEM.%

MODG_VOLTOFACE_Q 4584 69.65 83.8 15.194 25.9 53.85 151.1 1.572 0 0 99.99MODG_PRJ_PFLUX3_Q_6 48896 3.009 3.6 0.062 297.7 98.96 38.5 2.986 0.05 1.671 84.29MAXWELL_FLUX_CUBE_VEC 2292 2.403 2.9 1.048 214.9 80.25 39.1 2.181 0.009 0.043 63.24MODG_PRJ_PFLUX2_Q_6 48896 1.594 1.9 0.033 562 98.86 38.6 1.563 0.061 0.254 81.99MODG_PRJ_PFLUX1_Q_6 48896 1.518 1.8 0.031 589.8 98.87 38.8 1.476 0.03 0.165 79.93

PROC.NAME FREQ. EXCL.TIME[sec] % AVER.TIME

[msec] MFLOPS V.OPRATIO

AVER.V.LEN

VECTORTIME

BANKCONFLICT

ADBHIT

CPU NWPORT

ELEM.%

MODG_VOLTOFACE_Q 16080 665.772 82.2 41.404 62.6 48.56 245.9 16.892 0 0 100MODG_MAXWELL_PHYSFLUX_CONST 514560 23.831 2.9 0.046 0 95.6 238.3 23.027 3.559 17.651 0MODG_PRJ_PFLUX3_Q_6 171520 23.159 2.9 0.135 1080.5 99.74 130.1 23.075 1.499 8.046 77.57MODG_PRJ_PFLUX1_Q_6 171520 17.697 2.2 0.103 1414 99.71 130 17.546 1.214 1.401 70.25MODG_PRJ_PFLUX2_Q_6 171520 17.254 2.1 0.101 1450.3 99.71 130.4 17.133 1.173 1.96 79.31

degree=7 |level=2

degree=14 |level=2

• Costdistributionsinbothdegreesaresimilar• Lowvectoroperationratiointhehighestroutine

Analysisoftheroutine


MODG_VOLTOFACE_Q:

2929:+------>doiAnsX=1,maxPolyDegree+12930:|!getthefacevalueoftheansatzfunctionwithfixedcoordinate2931:|!!faceVal=faceValRightBndAns(iAnsX)2932:|2933:|+----->doiVar=1,nScalars2934:||+---->doiElem=1,nElems2936:|||V--->dofacepos=1,mpd1_square2937:||||!getpositionofthecurrentansatzfunction2938:||||pos=(facepos-1)*mpd1+iAnsX2939:||||AfaceState(iElem,facePos,iVar,leftOrRight)&2940:||||&=faceState(iElem,facePos,iVar,leftOrRight)&2941:||||&+volState(iElem,pos,iVar)2942:|||V--- enddo2944:||+---- enddo2945:|+----- enddo2946:|2947:+------ enddo:

LoopLength=~7

LoopLength=~6LoopLength=~64LoopLength=~64

• Compilemessage2936:vec(2):Partially vectorizedloop.2936:vec(24):Iterationcountisassumed.Iterationcount=5000.2936:vec(25):Workvectorsareused.Size=40000byte.

• Theinnermostloopisvectorized,butPARTIALLY– Datadependencycannotbesolvedbythecompiler

• ThelengthofthevectorizedloopisSHORT

OptimizationseffectiveforSX-ACE


MODG_VOLTOFACE_Q:

2930:+------>doiAnsX=1,maxPolyDegree+12931:|!getthefacevalueoftheansatzfunctionwithfixedcoordinate2934:|+-----> doiVar=1,nScalars2938:||m=nElems2939:||n=mpd1_square2940:||mn=m*n2941:|| !cdirnodep2942:||V----> doij=0,mn- 12943:|||iElem=ij/n+12944:|||facepos=mod(ij,n)+12945:|||!getpositionofthecurrentansatzfunction2946:|||pos=(facepos-1)*mpd1+iAnsX2947:|||A faceState(iElem,facePos,iVar,leftOrRight)&2948:|||&=faceState(iElem,facePos,iVar,leftOrRight)&2949:|||&+volState(iElem,pos,iVar)2950:||V---- enddo2953:|+----- enddo2955:+------ enddo:

LoopLength=~7

LoopLength=~6

LoopLength=~64*64

• Compilemessage2942:vec(1):Vectorizedloop.2947:vec(23): "NODEP"isspecified.Nodependencyisassumed.:FACESTATE2942:vec(29):ADBisusedforarray.:__REAL(KIND=8)

• Thelengthofthevectorizedloopincreases– Bycollapsingbothiloopandjloop

• Successfullyvectorizedbythe“nodep”directive• Appliedforallthesameloopnestsintheroutine

Motivation

• System-awareoptimizationforSX-ACE– Loopcollapseandcompiler-specificdirectives

• Degradereadability– Similaroptimizations needtobeREPEATEDLYapplied• Loopcollapseandinsertionofthedirectiveisappliedallloopnestsinwholecode

• Mightleadhumanerrors

11October,2017 11WSSP26

Therepetitiveoptimizationcanbereplacedwithtransformationrulesandcodetransformations

Xevolver:codetransformationframework

• Xevolver[Takizawa,2014]– Canseparatesystem-awarenessfromacode

• Externalcustomtranslationrulesandcustomdirectives➔Minimizemodificationsandthenkeepmaintainability oftheoriginalcode

11October,2017 12

Xevtgen:TransformationRuleGenerator

• Xevtgen[Suda,2015]– EasilygenerationoftransformationrulesforXevolver• DummyFortrancode

– Fortran-likecodewithsomespecialtgendirectives

• source anddestination patternsofadummyFortrancode

!$xevtgensrcbeginIF(I.EQ.0)EXIT!$xevtgensrcend

!$xevtgendstbeginIF(I==0)THENEXITENDIF!$xevtgendstend

Sourcepattern

Destinationpattern

StandardFortranprogrammerscaneasilylearnandgeneraterules


OptimizationusingXevolver

– Insteadofdirectlymodifyingacode,customrulesandcustomdirectivesaredefinedusingXevolver

➜Minimizemodificationsandthenkeepmaintainability

11October,2017 14

Ateles

Applycodemodification

Makingrulesofmodifications

OptimizedAteles

!$......……

LoopcollapseandNODEPdirective

• CodeTransformationbyXevolver– Justinserting“!$xevcollapse”intotheoriginalcode

• Cankeepthemaintainabilityoftheoriginalcodeasmuchaspossible

!$xevcollapsedoiVar=1,nScalarsdoiElem=1,nElemsdofacepos=1,mpd1_square…

enddoenddo

enddo

doiVar=1,nScalarsm=nElemsn=mpd1_squaremn=m*n

!cdirnodepdoij=0,mn- 1iElem=ij/n+1facepos=mod(ij,n)+1…

enddoenddo

Original+

xevdirective


Transformationrule

Sourcepattern Destinationpattern!$xevtgendstbeginm=nElemsn=mpd1_squaremn=m*n!cdirnodepdoij=0,mn- 1iElem=ij/n+1facepos=mod(ij,n)+1

!$xevtgenstmt(body0)enddo!$xevtgendstend

!$xevtgensrcbegin!$xevcollapsedoiElem=1,nElemsdofacepos=1,mpd1_square

!$xevtgenstmt(body0)enddo

enddo!$xevtgensrcend

11October,2017 WSSP26 1618system-awareoptimizationscanbeappliedbycodetransformations

Experimentalenvironment

• ComparisonofperformanceofAtelesonvariousplatforms– Originalversion– TunedversionforNECSX-ACE

• Loopcollapse• Insertionofcompiler-specificdirectives


Processor Intel XeonE5-2695v2 NEC SX-ACE IBMPower8

#.of cores 2x12cores 4cores 2x 8cores

Mem. Capacity 128 GB 64GB 512GB

Compiler Intel compiler16.03

NEC SX2003Rev.061

PGI16.10(openpower)

Results

– SX-ACE:x7.5speeduptooriginal,18%toInteloriginal– Xeon:Tunedversiondegrades14%– IBM:Tunedversionaccelerates6%

11October,2017

0

20

40

60

80

100

SX-ACE Xeon Power8

ExecutionTime[sec.]

Original Tuned

18WSSP26

Optimizationforparticularplatformisnotalwayseffective.Ourapproachthatusestransformationissuitable!

Detailedanalysisofloopcollapse


• degree=7,level=2

– Vectoroperationratiosbecomemorethan99%– Vectoraveragelengthsbecomemorethan250

FREQ. EXCL.TIME[sec] % AVER.TIME

[msec] MOPS MFLOPS V.OPRATIO

AVER.V.LEN

VECTORTIME

I-CACHEMISS

O-CACHEMISS

BANKCONFLICT ADBHITCPU NWPORT ELEM.%

BEFORE 4584 69.650 83.8 15.194 657.2 25.9 53.85 151.1 1.572 0.001 55.351 0 0 99.99AFTER 4584 2.117 13.7 0.462 18677.2 1135.1 99.83 256 2.114 0.001 0.001 0.485 0.025 73.51

Conclusions• PerformanceanalysisandoptimizationsofAtelesonSX-ACE

– Inlineexpansion– Loopcollapse➔ Degradesthereadabilityandmaintainability

• Forhighmaintainabilityandperformance– CodetransformationbyXevolverisemployed

• onlysmallnumberofdirectivesneedtobeinsertedintotheoriginalcode

• Futurework– IncreaseofcompatibilityofXevolver

• ThebackendofcompilerinfrastructuredoesnotfullysupportFortran2003/2008

• NeedtomodifyAtelesto avoidtheunsupporteddescription


performance tuning of ateles using xevolver · 2936: vec( 2): partiallyvectorized loop. 2936: vec(...

Documents