improving the performance of the milc code on … › documents ›...

39
Improving the Performance of the MILC Code on Intel Knights Landing, An Overview IXPUG 2017 Fall MeeEng September 26 th 28 th , 2017 Texas Advanced CompuEng Center (TACC) AusEn, TX

Upload: others

Post on 07-Jul-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Improving the Performance of the MILC Code on … › documents › 1506438112IXPUGFALL2017...Improving the Performance of the MILC Code on Intel Knights Landing, An Overview IXPUG

ImprovingthePerformanceoftheMILCCodeonIntelKnightsLanding,

AnOverviewIXPUG2017FallMeeEng

September26th–28th,2017TexasAdvancedCompuEngCenter(TACC)

AusEn,TX

Page 2: Improving the Performance of the MILC Code on … › documents › 1506438112IXPUGFALL2017...Improving the Performance of the MILC Code on Intel Knights Landing, An Overview IXPUG

MILConKNLWorkingGroup1.  IndianaUniversity:S.GoUlieb,R.Li2. UniversityofUtah:C.DeTar3. UniversityofArizona:D.Toussaint4.  Intel:K.Raman,A.Jha,D.Kalamkar,M.Tolubaeva,T.Phung,R.Malladi

5. TataConsultancyServices(TCS):G.Bhaskar,P.Gaurav,J.Bhat

6.  JeffersonLab:B.Joó7.  LBL/NERSC:D.Doerfler

•  ThiseffortissupportedbyIntelParallelCompuEngCenteratIndianaUniversity

•  AndisaNERSC/NESAPTier1codeintheCoriPhase2(KNL)Project

2

Page 3: Improving the Performance of the MILC Code on … › documents › 1506438112IXPUGFALL2017...Improving the Performance of the MILC Code on Intel Knights Landing, An Overview IXPUG

ImpactofMILCQCDSimulaEons•  Measuringthefundamentalparametersof

theStandardModelofparEclephysics•  AndlookingfordeviaEonswhichsuggest

physicsNOTaccountedfor,i.e.NewPhysics!

•  MethodistouseMonteCarloevaluaEonofthequantummechanicalpathintegral

•  Plotshowsresultsachievedoverthelastseveralyears,usingresourcesfrommulEplefaciliEes

•  Asthephysicalgrid(lagce)spacingdecreases,thecomputaEonalcomplexityincreases

•  CoriishelpingwiththethecalculaEonsassociatedwitha=0.043femtometers

RaEoofdecayconstantsoftheKmesontothePion

3

Page 4: Improving the Performance of the MILC Code on … › documents › 1506438112IXPUGFALL2017...Improving the Performance of the MILC Code on Intel Knights Landing, An Overview IXPUG

BenchmarkingPlaiorms•  Endeavor-Intel–  Intel’sinternaldevelopmentcluster–  KnightsLanding(KNL)andSkyLake(SKX)nodes–  IntelOPAhigh-speedinterconnect

•  Cori-NERSC–  CrayXC-40Architecture–  IntelHaswell&KNLprocessors

•  Edison-NERSC–  CrayXC-30Architecture–  IntelIvyBridgeprocessors

•  Theta–ArgonneNaEonalLab–  CrayXC-40withKNLnodes

•  Stampede–TACC–  Dell,IntelandSeagate–  IntelKNLwithOPAinterconnect

4

Page 5: Improving the Performance of the MILC Code on … › documents › 1506438112IXPUGFALL2017...Improving the Performance of the MILC Code on Intel Knights Landing, An Overview IXPUG

MILCcomputaEonalphases:OpEmizaEons&Performance

5

Page 6: Improving the Performance of the MILC Code on … › documents › 1506438112IXPUGFALL2017...Improving the Performance of the MILC Code on Intel Knights Landing, An Overview IXPUG

WheredoesMILCspenditsEme?•  RepresentaEverunEmebreakdown(singlenode)•  su3_rhmd_hisqapplicaEon

0 100 200 300 400 500 600

Haswell:2MPI+16OMP

Haswell:MPI-only32ranks

Haswell:OMP-only32threads

KNL:MPI-only64ranks

KNL:OMP-only64threads

StaggeredCGFFGFFLothers

16x16x16x16lagce

seconds 6

Page 7: Improving the Performance of the MILC Code on … › documents › 1506438112IXPUGFALL2017...Improving the Performance of the MILC Code on Intel Knights Landing, An Overview IXPUG

ProfilingToolsandMethods•  IntelVTuneandAdvisor•  Goodol’Linuxprofandgprof•  Rooflineanalysis

–  e.g.usedtolookhowclosetotheKNLMCDRAMBWroofline•  IntegratedPerformanceMonitoring(IPM)tool•  And,MILChasextensiveEmingsupport,inparEcularfuncEonsknownto

beEmeconsuming(Emeinsec.andGF/sreports)…–  However,wefoundsomeEmingsdidn’taddup

•  NeedingaddiEonalEmingforcodenotrepresented•  andthisledtorouEnesgegngOpenMPthatwereassumedtobetrivial

–  E.g.Update_u()callsfromthe“trajectoryupdate”•  FuncEonspeedupauerthreadingwas42.5x•  resulEnginanoverall12%improvementtothetrajectoryupdate

7

Page 8: Improving the Performance of the MILC Code on … › documents › 1506438112IXPUGFALL2017...Improving the Performance of the MILC Code on Intel Knights Landing, An Overview IXPUG

OpenMPEnhancementsinBaselineMILCDirectory #offilestotal #fileswith

candidateloops

#ofloops #ofloopsmodified

%ofloopsrewri7en

generic 102 43 229 71 31%

generic_ks 185 42 529 17 3.2%

ks_imp_rhmc 20 8 36 4 11%

•  MILCabstractsloopswithFORALLSITESandFORALLFIELDSITES.–  ConverttoFORALLxxx_OMPmacros–  CandidateOpenMPloopsneedtobeexaminedforOMPprivatevariablesandreducEons

•  ExaminaEonofallloopswouldbetedious,soweusedthetoolsmenEonedintheslideabovetohelpusidenEfyloopswithpotenEallythemostimpact

8

Page 9: Improving the Performance of the MILC Code on … › documents › 1506438112IXPUGFALL2017...Improving the Performance of the MILC Code on Intel Knights Landing, An Overview IXPUG

IntegraEngQPhiXSolverforStaggeredFermions(akaTheBigWin)

•  QPhiX,StaggeredFermionversion,developedtoimprovevectorizaEon,andthreadingperformance

•  Staggeredfermionsvs.Wilson/Clover–  Lookingatadifferent“acEon”thanWilson/Clover–  MulE-massCGsolverimplementaEon–  Usesasinglerighthandside

àlimitsreuseanddecreasesarithmeEcintensity–  3complexvaluespergridpointvs.12–  Primarilyusedwithdouble-precisionvariables

•  X-YblockingforSIMDvectorizaEon–  Datastoredasarraysofstructuresofarrays–  SoAintheXdimension–  SIMD_width/SOA_lengthintheYdimension–  EnablesefficientcacheblockingofX-Z

SIMDwidth

SoAlengthx

y

9hUps://github.com/JeffersonLab/qphix

Page 10: Improving the Performance of the MILC Code on … › documents › 1506438112IXPUGFALL2017...Improving the Performance of the MILC Code on Intel Knights Landing, An Overview IXPUG

0

20

40

60

80

100

120

1 2 4 8 16

MILCCGPerformanceKNL

L=16

L=32

MulE-massCGSolver(SingleNode)

•  TheQPhiXsolverprovidesa1.5xforL=32lagce•  Forsmalllagcesizes,performanceislimitedbydata

remappingEme10

GFLO

Ps/sec

IncreasingMPIranksDecreasingthreads

0

20

40

60

80

100

120

1 2 4 8 16

MILCCG(w.QPhiX)PerformanceKNL

L=16

L=32

Page 11: Improving the Performance of the MILC Code on … › documents › 1506438112IXPUGFALL2017...Improving the Performance of the MILC Code on Intel Knights Landing, An Overview IXPUG

GaugeForce(SingleNode)

•  GaugeForceperformanceimprovementsareupto4xforL=32lagcesize

11

GFLO

Ps/sec

0

50

100

150

200

250

300

1 2 4 8 16

MILCGFPerformanceKNL

L=16

L=32

0

50

100

150

200

250

300

1 2 4 8 16

MILCGF(w.QPhiX)PerformanceKNL

L=16

L=32

IncreasingMPIranksDecreasingthreads

Page 12: Improving the Performance of the MILC Code on … › documents › 1506438112IXPUGFALL2017...Improving the Performance of the MILC Code on Intel Knights Landing, An Overview IXPUG

0

20

40

60

80

100

120

1 2 4 8 16

MILCCG(w.QPhiX)L=32

SKXbase

SKX

KNL

SkyLakevs.KNL

•  ImprovementstranslatetolatestXeonprocessorwithAVX512aswell

•  Architecturesbehavedifferentlywrtrank/threadtradeoffs12

GFLO

Ps/sec

IncreasingMPIranksDecreasingthreads

0

50

100

150

200

250

300

1 2 4 8 16

MILCGF(w.QPhiX)L=32

SKXbase

SKX

KNL

Page 13: Improving the Performance of the MILC Code on … › documents › 1506438112IXPUGFALL2017...Improving the Performance of the MILC Code on Intel Knights Landing, An Overview IXPUG

CGPerformanceonSKX+OPA(MulENode)

•  WeakScalingRunsonIntelXeon®6148Gold+Intel®OPAonIntel’sEndeavorCluster•  Requiresminimum2Ranks/Nodeforbestperformance(1rankperNUMAnode)•  CGSolverPerformancelimitedbymemorybandwidthonXeons.Hence,noimprovementwith

QPhiX•  ParallelEfficiency:

–  ~99%@64Nodesfor32^(4)LagceVolumeperNode

0.00

10.00

20.00

30.00

40.00

50.00

60.00

70.00

0 1 2 3 4 5 6 7

GFLOPS/sec/nod

e

log(nodes)/log(2)

MulTmassCG(w/QPhiX)LaUceVolumepernode:32^(4)

IntelXeon®6148Gold+Intel®OPA

1rank/Node

2ranks/Node

4ranks/Node

8ranks/Node

16ranks/Node

32ranks/Node

0.00

10.00

20.00

30.00

40.00

50.00

60.00

70.00

0 1 2 3 4 5 6 7

GFLOPS/sec/nod

e

log(nodes)/log(2)

MulTmassCG(BaselineMILC)LaUceVolumepernode:32^(4)

IntelXeon®6148Gold+Intel®OPA

1rank/Node

2ranks/Node

4ranks/Node

8ranks/Node

16ranks/Node

32ranks/Node

Page 14: Improving the Performance of the MILC Code on … › documents › 1506438112IXPUGFALL2017...Improving the Performance of the MILC Code on Intel Knights Landing, An Overview IXPUG

GFPerformanceonSKX+OPA(MulENode)

•  WeakScalingRunsonIntelXeon®6148Gold+Intel®OPAonIntel’sEndeavorCluster•  ~3.5xNodeLevelPerformanceimprovementwithQPhiX•  MulEpleRanks/NodegivesbeUerperformanceforGaugeForce•  ParallelEfficiency:

–  ~85%@64Nodesfor32^(4)LagceVolumeperNode

0

50

100

150

200

250

300

0 1 2 3 4 5 6 7

GFLOPS/sec/nod

e

log(nodes)/log(2)

GaugeForce(w/QPhiX)LaUceVolumeperNode:32^(4)

IntelXeon®6148Gold+Intel®OPA

1rank/Node

2ranks/Node

4ranks/Node

8ranks/Node

16ranks/Node

32ranks/Node

0

5

10

15

20

25

30

35

0 1 2 3 4 5 6 7

GFLOPS/sec/nod

e

log(nodes)/log(2)

GaugeForce(BaselineMILC)LaUceVolume:32^(4)

IntelXeon®6148Gold+Intel®OPA

1rank/Node

2ranks/Node

4ranks/Node

8ranks/Node

16ranks/Node

32ranks/Node

Page 15: Improving the Performance of the MILC Code on … › documents › 1506438112IXPUGFALL2017...Improving the Performance of the MILC Code on Intel Knights Landing, An Overview IXPUG

MPIMessagingCharacterisEcs,16KNLNodes

•  Messagesizesvarywithnumberofrankspernode(RPN)•  AsRPNdecreases

–  Lagcesize(volume)perrankincreases->messagesizesincrease–  Surface-to-volumedecreases->totalamountofdatadecreases(proporEonalto1/N4)–  Sameanalogywithincreasinglagcesize(volume)pernode

0102030405060708090

64 32 16 8 4 2 1

,me(secon

ds)

rankspernode

MILCMul,massCG:TotalTime16Nodes,L=16

Compute

Irecv

Isend

Allreduce

Wait1

10

1006K

8K

12K

16K

32K

64K

128K

256K

Num

bero

fMessages(1e6)

Messagesize(bytes)

MILCMulAmassCG,16Nodes,L=16

64rpn

32rpn

16rpn

8rpn

4rpn

2rpn

1rpn

15

Page 16: Improving the Performance of the MILC Code on … › documents › 1506438112IXPUGFALL2017...Improving the Performance of the MILC Code on Intel Knights Landing, An Overview IXPUG

MPICharacterisEcs,512Nodes

•  Caveat1:atthisscale,the“full”IPMinstrumentaEonimpactsabsoluteEme,butrelaEveEmes“should”berepresentaEve(Ineedmoreconfidencethatthisistrue)

•  Caveat2:Resultsarefromasingletrial,couldbeupto10%variability16

64 32 16 8 4 2 1rankspernode

QPhiXMul9massCG:TotalTime512Nodes,L=16x32x32x32

Compute

Irecv

Isend

Allreduce

Wait

64 32 16 8 4 2 1rankspernode

MILCMul7massCG:TotalTime512Nodes,L=16x32x32x32

Compute

Irecv

Isend

Allreduce

Wait

Page 17: Improving the Performance of the MILC Code on … › documents › 1506438112IXPUGFALL2017...Improving the Performance of the MILC Code on Intel Knights Landing, An Overview IXPUG

0

2

4

6

8

10

12

8 16

32

64

128

256

512 1k

2k

4k

8k

16k

32k

64k

128k

256k

512k

1M

GB/s

messagesizeinbytes

SMBMsgrateBandwidth:Cori/KNL

1/2BW

r=1

r=2

r=4

r=8

r=16

r=32

r=640

2

4

6

8

10

12

8 16

32

64

128

256

512 1k

2k

4k

8k

16k

32k

64k

128k

256k

512k

1M

GB/s

messagesizeinbytes

SMBMsgrateBandwidth:Cori/KNL

1/2BW

r=8

r=8w/huge2M

r=8eager=64K0

20

40

60

80

100

120

64 32 16 8 4 2 1

1 2 4 8 16 32 64

second

s

RanksperNode,TheadsperRank

MILCCGSolveTime:432Nodes72x72x72x144laEce

eager=8KB

eager=64KB

w/huge2M

Cori(CrayAries)HugePagesOpEmizaEon

•  MPImessageratemicro-benchmarkidenEfiedanissuewhereBWdropssignificantlywhentransiEoningtotheRendezvousprotocol

17

•  TwosoluEonstried:•  MoveRendezvoustransiEonto

64KB•  PerCrayadvice,usehugepages,

2Mpagesinthiscase

•  ThishasasignificantimpactonperformancewhenusingalargenumberofMPIrankspernode

•  RecommendaEonàUsehugepagesincommunicaEonintensivecodeswithmoderatemessagesize

Eagerprotocol Rendezvousprotocol

Page 18: Improving the Performance of the MILC Code on … › documents › 1506438112IXPUGFALL2017...Improving the Performance of the MILC Code on Intel Knights Landing, An Overview IXPUG

RooflineAnalysis•  HelpstofocusonareasofcodetotargetforopEmizaEon•  MILCisknowntomemoryBWbound,andQPhiXversionisnearroofline

model,butbaselineversionhashigherAIandlowerperformance?

RooflineModelforKNL MILCPerformancevs.Roofline 18

1

10

100

1000

0.1 1 10

GFLOP/s

Arithme3cIntensity(FLOP/byte)

MILCStaggeredCGSolveRooflineforKNL28x28x28x28laLce

Emp.MCDRAMRoofline

Emp.DDRRoofline

Baselinew/MCDRAM

QPhiXw/MCDRAM

QPhiXw/DDR

Page 19: Improving the Performance of the MILC Code on … › documents › 1506438112IXPUGFALL2017...Improving the Performance of the MILC Code on Intel Knights Landing, An Overview IXPUG

ImpactonProducEonCompuEng

19

Page 20: Improving the Performance of the MILC Code on … › documents › 1506438112IXPUGFALL2017...Improving the Performance of the MILC Code on Intel Knights Landing, An Overview IXPUG

TheRubberHigngtheRoad•  CurrentCori(KNL)producEonruns

–  96x96x96x192lagceon128nodes–  ks_spectrum_hisq:CalculaEonofmesonandbaryonspectrafromawidevarietyofsources

andsinks–  su3_clov_hisq:generatescloverpropagatorsandcontracEngmesonandbaryontwopoint

funcEons•  QPhiXversionofthemulE-massstaggeredfermionsolverisbeingused

20

StaggeredCG CloverBiCG

StandardMILC1 40GF/s 21-692GF/s

QPhiX3 52GF/s 69GF/s

QPhiXImprovement 1.3x 1xto3.3x1.  64rankspernode,1threadperrank,includesOpenMPimprovements2.  256nodesrunningadifferentproblem3.  32rankspernode,8threadsperrank

Page 21: Improving the Performance of the MILC Code on … › documents › 1506438112IXPUGFALL2017...Improving the Performance of the MILC Code on Intel Knights Landing, An Overview IXPUG

AdvantageofusingCori’sBurstBufferforI/O

•  I/OoverheadisbeingsignificantlyreducedbyusingCori’sBurstBuffer(BB)subsystem

•  Typicalwallclockfor128noderunis~5hours•  UsingnominalLustre/scratch,53minutesspentwriEng61

temporaryfiles•  UsingBB,I/Owasreducedto26minutes

–  2ximprovement

21

Page 22: Improving the Performance of the MILC Code on … › documents › 1506438112IXPUGFALL2017...Improving the Performance of the MILC Code on Intel Knights Landing, An Overview IXPUG

CurrentandFutureEfforts•  FermionForcecalculaEonopEmizaEons•  FatLinkscalcuaEonopEmizaEons•  High-speedinterconnectexploraEons

–  MulEpleendpointsimplementaEonofOPAMPI

•  ConEnueimprovementstoOpenMP–  RooflineanalysisshowsthatbaselineMILCperhapsnotBWbound

•  InvesEgaEngGridsolver•  ConEnueworkingonSciDAC/QOPversionofthecode

22

Page 23: Improving the Performance of the MILC Code on … › documents › 1506438112IXPUGFALL2017...Improving the Performance of the MILC Code on Intel Knights Landing, An Overview IXPUG

Discussion&QuesEons

23

Page 24: Improving the Performance of the MILC Code on … › documents › 1506438112IXPUGFALL2017...Improving the Performance of the MILC Code on Intel Knights Landing, An Overview IXPUG

Backup

24

Page 25: Improving the Performance of the MILC Code on … › documents › 1506438112IXPUGFALL2017...Improving the Performance of the MILC Code on Intel Knights Landing, An Overview IXPUG

CGPerformanceonSKX+OPA(MulENode)

•  WeakScalingRunsonIntelXeon®6148Gold+Intel®OPAonIntel’sEndeavorCluster•  Requiresminimum2Ranks/Nodeforbestperformance(1rankperNUMAnode)•  SmallerVolume[16^(4)perNode]becomescommunicaEonsensiEveatlargernodecounts•  ParallelEfficiency:

–  ~80%@64Nodesfor16^(4)LagceVolumeperNode

0.00

10.00

20.00

30.00

40.00

50.00

60.00

70.00

0 1 2 3 4 5 6 7

GFLO

PS/sec/nod

e

log(nodes)/log(2)

MulEmassCG(w/QPhiX)LagceVolumepernode:16^(4)

IntelXeon®6148Gold+Intel®OPA

1rank/Node

2ranks/Node

4ranks/Node

8ranks/Node

16ranks/Node

32ranks/Node

0.00

10.00

20.00

30.00

40.00

50.00

60.00

70.00

0 1 2 3 4 5 6 7

GFLO

PS/sec/nod

e

log(nodes)/log(2)

MulEmassCG(BaselineMILC)LagceVolumepernode:16^(4)

IntelXeon®6148Gold+Intel®OPA

1rank/Node

2ranks/Node

4ranks/Node

8ranks/Node

16ranks/Node

32ranks/Node

Page 26: Improving the Performance of the MILC Code on … › documents › 1506438112IXPUGFALL2017...Improving the Performance of the MILC Code on Intel Knights Landing, An Overview IXPUG

GFPerformanceonSKX+OPA(MulENode)

•  WeakScalingRunsonIntelXeon®6148Gold+Intel®OPAonIntel’sEndeavorCluster•  ~3.5xNodeLevelPerformanceimprovementwithQPhiX

–  NotseeingLLCeffectswithQPhiXasseenwithBaselineMILC(needtoinvesTgate)•  MulEpleRanks/NodegivesbeUerperformanceforGaugeForce•  ParallelEfficiency:

–  ~92%@64Nodesfor16^(4)LagceVolumeperNode

0.00

50.00

100.00

150.00

200.00

250.00

300.00

0 1 2 3 4 5 6 7

GFLOPS/sec/nod

e

log(nodes)/log(2)

GaugeForce(w/QPhiX)LaUceVolumeperNode:16^(4)

IntelXeon®6148Gold+Intel®OPA

1rank/Node

2ranks/Node

4ranks/Node

8ranks/Node

16ranks/Node

32ranks/Node

0.00

10.00

20.00

30.00

40.00

50.00

60.00

70.00

80.00

0 1 2 3 4 5 6 7

GFLOPS/sec/nod

e

log(nodes)/log(2)

GaugeForce(BaselineMILC)LaUceVolumeperNode:16^(4)

IntelXeon®6148Gold+Intel®OPA

1rank/Node

2ranks/Node

4ranks/Node

8ranks/Node

16ranks/Node

32ranks/Node

Page 27: Improving the Performance of the MILC Code on … › documents › 1506438112IXPUGFALL2017...Improving the Performance of the MILC Code on Intel Knights Landing, An Overview IXPUG

CommunicaEonProfile

•  ProfilecollectedusingIntel®MPIPerformanceSnapshotTool(partofITAC)•  SmallerLagceVolume(16^(4))spendshigh%ofEmeinMPIasexpected

94.59 91.93 91.81 90.45 89.80 87.49

5.41 8.07 8.19 9.55 10.20 12.51

0.00

50.00

100.00

150.00

0 2 3 4 5 6

%Tim

e

log(nodes)/log(2)

Computevs.MPI(CGSolver)LagceVolumeperNode:32^(4)

8MPIRanks/Node

%ComputaEon

81.47 76.69 76.52 75.13 70.71

18.53 23.31 23.48 24.87 29.29

0.00

50.00

100.00

150.00

0 2 3 4 5

%Tim

e

log(nodes)/log(2)

Computevs.MPI(CGSolver)LagceVolume:16^(4)8MPIRanks/Node

%ComputaEon

Page 28: Improving the Performance of the MILC Code on … › documents › 1506438112IXPUGFALL2017...Improving the Performance of the MILC Code on Intel Knights Landing, An Overview IXPUG

MPIFuncEonSummary

•  MostofMPIEmespentinMPIWait(i.e.P2PSend/RecvcompleEon)•  CollecEveOpsTime(i.e.%MPI_Allreduce)increaseswithnodecounts

–  MPI_Allreduce~40%ofMPITimeat64Nodes–  PotenEalboUleneckatverylargenodes

80.98 73.03 61.39 50.53 44.18

13.55 20.0527.5

29.97 40.05

3.68 5.79 9.82 18.03 14.49

0

50

100

150

0 2 3 4 5%M

PITim

e(Normalize

d)

log(nodes)/log2

MPIFuncEonSummary(CGSolver)LagceVolume:16^(4)

%MPI_Wait %MPI_Allreduce

Page 29: Improving the Performance of the MILC Code on … › documents › 1506438112IXPUGFALL2017...Improving the Performance of the MILC Code on Intel Knights Landing, An Overview IXPUG

StaggeredMulE-massCG:LagceScalingStudy

MILCbaseline:•  64MPIrankspernode

StaggeredQPhiX:•  1MPIrankon1node•  upto16ranksonmulEplenodes 29

Page 30: Improving the Performance of the MILC Code on … › documents › 1506438112IXPUGFALL2017...Improving the Performance of the MILC Code on Intel Knights Landing, An Overview IXPUG

QPhiXMILCbaseline

VariousMPIrank/OMPthreadcombinaEons,L=24

MulEmassCG:WeakScalingonCori

30

Page 31: Improving the Performance of the MILC Code on … › documents › 1506438112IXPUGFALL2017...Improving the Performance of the MILC Code on Intel Knights Landing, An Overview IXPUG

Benchmarks:SymanzikGaugeForce

MILCbaseline:•  64MPIrankspernode

QOPQDP:•  64MPIrankspernode

QPhiX:•  1MPIrankwith1node•  16ranks/nodeotherwise

GFperformanceinGFLOPS/node,variouslagcesizespernodeL 1thread/core

31

Page 32: Improving the Performance of the MILC Code on … › documents › 1506438112IXPUGFALL2017...Improving the Performance of the MILC Code on Intel Knights Landing, An Overview IXPUG

Benchmarks:HISQFermionForce

MILCbaseline:•  64MPIrankspernode

QOPQDP:•  64MPIrankspernode

1.00

3.00

5.00

7.00

9.00

1 16 64

L=16

L=24

L=32

SpeedupofQOPQDP

Modifiedalgorithmvs.MILCFF•  Speedupduetoareduced

numberofcalculaEons(FLOPs)to17%ofbaseline

32

Page 33: Improving the Performance of the MILC Code on … › documents › 1506438112IXPUGFALL2017...Improving the Performance of the MILC Code on Intel Knights Landing, An Overview IXPUG

CoriandStampede•  Bothusethesame68-coreKNLSKUs•  CoriusesArieshigh-speedinterconnect,StampedeusesIntel

OmniPath(OPA)–  CrayandIntelMPIrespecEvely

•  CoriusesCrayCNL,Stampedeis???Linux•  Standard(non-QPhiX)versionofMILC

–  Primaryanalysisishigh-speedinterconnectperformance,usingrelaEvelysmalllagcesizepernode

•  IntelCforboth,although17.0.2.xvs.17.0.0.x•  Limitedtomaximumjobsizeof80nodesonStampede->limited

scalingstudyto64nodes

33

Page 34: Improving the Performance of the MILC Code on … › documents › 1506438112IXPUGFALL2017...Improving the Performance of the MILC Code on Intel Knights Landing, An Overview IXPUG

02468

10121416

1K

2K

4K

8K

16K

32K

64K

128K

256K

512K

1M

2M

4M

Band

width(G

B/s)

Messagesize(bytes)

Ping-PongBandwidth

Stampede Cori

0246810121416

1K

2K

4K

8K

16K

32K

64K

128K

256K

512K

1M

2M

4M

Band

width(G

B/s)

Messagesize(bytes)

Uni-direcBonalBandwidth

Stampede Cori

0246810121416

1K

2K

4K

8K

16K

32K

64K

128K

256K

512K

1M

2M

4M

Band

width(G

B/s)

Messagesize(bytes)

Bi-direcAonalBandwidth

Stampede Cori

OSUSinglerankPt-2-Pt

•  Singlecore,point-to-pointbetween2nodes•  Latencyiscomparableat~3.1μS(butabout2xthatofHaswell)•  Ping-pongisexactlythat,singlemessageping-pongedbetween2nodes•  Uni-direcEonalisa“streaming”exchangewithawindowofsize64•  Bi-direcEonalisalso“streaming”•  CorishowsbeUersmallmessageBW(messagerate)andStampedehigherpeakBW

Increasingmessagerate

34

Page 35: Improving the Performance of the MILC Code on … › documents › 1506438112IXPUGFALL2017...Improving the Performance of the MILC Code on Intel Knights Landing, An Overview IXPUG

OSUMulE-rankPt-2-Pt

0

2

4

6

8

10

12

14

64

128

256

512 1K

2K

4K

8K

16K

32K

64K

128K

256K

512K

1M

2M

4M

Band

width(G

B/s)

Messagesize(bytes)

StampedeMulC-RankBandwidth

1/2BW

1rpn

2rpn

4rpn

8rpn

16rpn

32rpn

64rpn

•  Asrankspernode(RPN)increaseinthisuni-direcEonaltest,Stampede’sperformanceimproves.•  CorihasaBW“ceiling”below32RPNthatlimitslargemessageBWthatisnotobservedwithStampede

–  CrayaUributesthistoaPCIlatencyissuebetweenKNLandAries,canbemiEgatedbymovingBTE“put”protocoltransiEontoasmallermessage(defaultis4MB)

•  ItlooksasthoughStampedecouldusesometuninginitstransiEontoalargemessageprotocoltotakeadvantageofitshigherpeakBW

0

2

4

6

8

10

12

14

64

128

256

512 1K

2K

4K

8K

16K

32K

64K

128K

256K

512K

1M

2M

4M

Band

width(G

B/s)

Messagesize(bytes)

CoriMulC-RankBandwidth

1/2BW

1rpn

2rpn

4rpn

8rpn

16rpn

32rpn

64rpn

Increasingmessagerate

35

Page 36: Improving the Performance of the MILC Code on … › documents › 1506438112IXPUGFALL2017...Improving the Performance of the MILC Code on Intel Knights Landing, An Overview IXPUG

SMBMulE-node,MulE-rankStencil

•  16nodes,6neighborsperrank(emulates3DstencilcommunicaEonpaUern)•  Measuresbi-direcEonalmessagerate(convertedtoBW)•  HereweseeStampedeisabletoachieveclosetoitspeakbi-direcEonBW(25GB/s)

–  However,somethinghappensat1MBmessagesize,tobeinvesEgated•  StampedecouldalsousesometuninginitsprotocoltransiEontolargemessages(>64KB)

0

5

10

15

20

25

30

64

128

256

512 1K

2K

4K

8K

16K

32K

64K

128K

256K

512K

1M

Band

width(G

B/s)

Messagesize(bytes)

StampedeSMB

1rpn

2rpn

4rpn

8rpn

16rpn

32rpn

64rpn0

5

10

15

20

25

30

64

128

256

512 1K

2K

4K

8K

16K

32K

64K

128K

256K

512K

1M

Band

width(G

B/s)

Messagesize(bytes)

CoriSMB

1rpn

2rpn

4rpn

8rpn

16rpn

32rpn

64rpn

Increasingmessagerate

36

Page 37: Improving the Performance of the MILC Code on … › documents › 1506438112IXPUGFALL2017...Improving the Performance of the MILC Code on Intel Knights Landing, An Overview IXPUG

WeakScaling:NumberofNodes

•  WeakScaled:lagcesizeis16x16x16x16pernode•  MPI/OpenMPtradeoffstudyat1,8,16,32and64nodes

–  numberofcoresfixedat64–  MPIranks/node(rpn)variedfrom1to64–  Allresultsuse1threadpercore

0

10

20

30

40

50

60

70

0 1 2 3 4 5 6

GFLO

P/s

numberofnodes(2^^N)

Cori:MulCmassCGL=16

1rpn

2rpn

4rpn

8rpn

16rpn

32rpn

64rpn 0

10

20

30

40

50

60

70

0 1 2 3 4 5 6

GFLO

P/s

numberofnodes(2^^N)

Stampede:MulEmassCGL=16

1rpn

2rpn

4rpn

8rpn

16rpn

32rpn

64rpn

37

Page 38: Improving the Performance of the MILC Code on … › documents › 1506438112IXPUGFALL2017...Improving the Performance of the MILC Code on Intel Knights Landing, An Overview IXPUG

WeakScaling:RanksperNode

•  WeakScaled:lagcesizeis16x16x16x16pernode•  MPI/OpenMPtradeoffstudyat1,8,16,32and64nodes

–  numberofcoresfixedat64–  MPIranks/node(rpn)variedfrom1to64

0

10

20

30

40

50

60

70

1 10 100

GFLO

P/s

MPIrankspernode

Cori:Mul@massCGL=16

1node

8nodes

16nodes

32nodes

64nodes0

10

20

30

40

50

60

70

1 10 100

GFLO

P/s

MPIrankspernode

Stampede:MulAmassCGL=16

1node

8nodes

16nodes

32nodes

64nodes

38

Page 39: Improving the Performance of the MILC Code on … › documents › 1506438112IXPUGFALL2017...Improving the Performance of the MILC Code on Intel Knights Landing, An Overview IXPUG

WeakScaling:SelectMPI/OMP

•  CGEmeandCGscalingforselectedMPI/OpenMPcombinaEons–  Idealscalingwouldbeaflathorizontalline

•  StampedescalingisbeUerwithhigherRPN•  ThereisascalingimprovementonCorigoingfrom32to64RPN,butIwouldn’treadtomuchintoitat

suchassmallscale

0.0

10.0

20.0

30.0

40.0

50.0

60.0

70.0

1 8 16 32 64

Time(sec)

numberofnodes

CoriCGPerformanceWeakScaling,L=16pernode

Cori64/1

Cori8/8

Cori2/32

Stampede64/1

Stampede8/8

Stampede2/320.80

0.85

0.90

0.95

1.00

16 32 64

scalingeffi

cien

cy

numberofnodes

Cori:CGWeakScaling,L=16

Cori64/1

Cori8/8

Cori2/32

Stampede64/1

Stampede8/8

Stampede2/32

39