performance optimization of west and qbox on intel knights ... · performance optimization of west...
TRANSCRIPT
PerformanceoptimizationofWESTandQboxonIntelKnightsLanding
HuihuoZheng1,ChristopherKnight1,GiuliaGalli1,2,MarcoGovoni1,2,andFrancoisGygi3
1ArgonneNationalLaboratory2UniversityofChicago
3UniversityofCalifornia,Davis
September27th,2017
IXPUGAnnualFallConference2017 1
ALCFThetaEarlyScienceProject:First-PrinciplesSimulationsofFunctionalMaterialsforEnergyConversion
http://qboxcode.org/; http://west-code.org; http://www.quantum-espresso.org/M. Govoni, G. Galli, J. Chem. Theory Comput. 2015, 11, 2680−2696P. Giannozzi, et al J.Phys.:Condens.Matter, 21, 395502 (2009)
Embedded nanocrystalT. Li, Phys. Rev. Lett. 107, 206805 (2011)
Matters at extreme conditionsD. Pan et al Proc. Nat'l Acd. Sci. 110, 6250 (2013)
Aqueous solutionA.Gaiduk et al., J. Am. Chem. Soc.
Comm. (2016)
Organic photovoltaicsM. Goldey Phys. Chem. Chem. Phys., Advance Article (2016)
Quantum informationH Seo, Sci Rep. 2016; 6: 20803.
2
Optimizationfocus
Number of processors
Tim
e-to
-Sol
utio
nStrongscalinglimit
• Utilizingtunedmathlibraries(FFTW,MKL,ELPA,…)
• Vectorization:AVX512• HighBandwidthMemory
• Addingextralayersofparallelization->increaseintrinsicscalinglimit
• Reducingcommunicationoverheadtoreachtheintrinsiclimit
3,624 KNL nodes, 9.65petaFlOPS
3
Outline
• WEST– additionallayersofparallelization• BandparallelizationofSternheimer equation• Taskgroupparallelizationtofit3DFFTswithinsingleKNLnodetoreducecommunicationoverheadsandtakeadvantageofHBM
• Qbox– reducecommunicationoverheadsofdenselinearalgebrawithon-the-flydataredistribution
• Gather&scatterremap• Transposeremap
• Conclusionsandinsights
4
Optoelectroniccalculationsusingmany-bodyperturbationtheory(GW)
𝚫𝝆 = 𝝌𝚫𝑽𝒑𝒆𝒓𝒕Electronic density
Perturbation potential
Response function
…3D FFTs
Parallelizationscheme(imagegroups &planewave)
Intrinsicstrongscalinglimit𝑛𝑝𝑟𝑜𝑐~𝑁2345×𝑁7
𝚫𝑽𝟏𝚫𝑽𝟐𝚫𝑽𝟑𝚫𝑽𝟒 … 𝚫𝐕𝐍?𝟏𝚫𝐕𝐍
𝚫𝝆𝟏 𝚫𝝆𝟐 … 𝚫𝝆𝑵?𝟏 𝚫𝝆𝑵𝚫𝝆𝟑 𝚫𝝆𝟒
...
Massivelyparallelbydistributingperturbations
Linearresponsetheory
Efficientstrongscalinguptotheintrinsiclimitthenstopscaling
Si35H36,176 electrons256 perturbations
5
Singleperturbationruntime (4BG/Qvs1KNL)
6
• 80%ofruntimeisspentinexternallibraries
• 3.7xspeedupfromBG/Q(ESSL)toKNL(MKL)
• High-bandwidthmemoryonThetacriticalforperformance(e.g.3DFFTs):3.1xspeedup
Cache mode
CdSe, 884 electrons
BG/QESSL
ImprovementofstrongscalingbybandparallelizationofSternheimer equation
Si35H36, 176 electrons256 perturbations
7
Image para.Band para.
Sternheimer equation
IncreasedparallelismbyarrangingtheMPIranksina3Dgrid(perturbations&bands&FFT)
New intrinsic strong scaling limit: 𝑛𝑝𝑟𝑜𝑐 = 𝑁2345×𝑵𝒃𝒂𝒏𝒅×𝑁7
Cost: AllReduceonce across band groups (relatively cheap)
Call hpsi
Improvingstrongscalingof3DFFTsusingtaskgroupparallelizationwhenabandgroupspreadsacrossmultiplenodes
8
Cetus (BG/Q)
Small3DFFTsdonotscalewellacrossmultipleKNLnodesbecauseofinternodecommunicationoverheadsrelativetoshared-memoryMPI.Taskgroups(tg)redistributecompletewavefunctionstoseparatenodestosimultaneouslycomputemultiple3DFFTs.
Theta (KNL)
Cost for data redistribution
MPI_Alltoallwithin single KNL node
Strong scaling of 3D FFT (plane and pencil decomposition) on Cetusand Theta using 256x256x256 FFT grid and FFTXlib in QE
Outline
• WEST– additionallayersofparallelization• BandparallelizationofSternheimer equation• Taskgroupparallelizationtofit3DFFTswithinsingleKNLnodetoreducecommunicationoverheadsandtakeadvantageofHBM
• Qbox– reducecommunicationoverheadsofdenselinearalgebrawithon-the-flydataredistribution
• Gather&scatterremap• Transposeremap
• Conclusionsandinsights
9
SiC512512atoms,2048electrons,PBE0
StrongscalingofQboxforhybrid-DFTcalculations
F. Gygi and I. Duchemin J. Chem. Theory Comput., 2013, 9 (1), pp 582–587
dgemm
dgemm,Gram-Schmidt(syrk,potrf,trsm)
Exactexchange3DFFTs
10
Datalayout:blockdistributionofwavefunctionsto2Dprocessgrid 𝑛𝑝𝑐𝑜𝑙
𝑘
𝑛𝑝𝑟𝑜𝑤
𝑖
FFT FFT FFT FFT
1,024
140,288
𝜓P 𝑘 , 𝑖 = 1, 2, …𝑁QRST,𝑘 = 0, 1, … , 𝑛2U − 1
SiC512:140,288×1,02411
MPI_Alltoallv
MPI_Allreduce
Good for 3D FFTs
Intrinsic limit: 𝒏𝒑𝒓𝒐𝒄 = 𝑵𝒃𝒂𝒏𝒅𝑵𝒛
PoorscalabilityofScaLAPACK fortall-skinnymatricesandsmallsquarematricesduetocommunicationoverheads
=
Wave function matrix
Overlap matrix
Tall-skinny matrices Small square matrix
12
Gram-Schmidt
d(z)gemm
13
ReducingcommunicationoverheadsfromScaLAPACK with“gather&scatter”remap
Solution:creatingacontextwithfewercolumnsandon-the-flydataredistribution• Compute3DFFTsonoriginalgrid• Gatherdatatosmallergrid• RunScaLAPACK onsmallergrid• ScatterdatabacktooriginalgridTheremapcommunicationpatternonlyinvolvesprocs withinsameroworcolumn.
Key:remapcommunicationtimeneedstobesmall. Schematicofgather&scatterremap,gray
processesareidleduringScaLAPACK computation
Gather & scatter communication pattern
Improvementofstrongscalingusing“gather&scatter”remap
14
hpsi +wf_update timeremainsminimalrelativelyflatwithremap,andtheremaptime(custom) istwoordersofmagnitudesmallerthanhpsi +wf_update time.
Customremapfunctionis1000xfasterthanScaLAPACK’s pdgemr2d.
ImprovementofQbox’s strongscalingafteroptimizations;runtimeofimprovesfrom~400to~30sperSCFiteration(13xspeedup)on131,072ranksfor2048electrons.
400.8s
67.5s
30.5s
ReducingcommunicationoverheadsfromScaLAPACK by“transpose”remap
15
Transpose communication pattern
Processrearrangementanddatamovementoftransposeremap
Improvementofruntimebyremapmethods(1) 𝑛𝑝𝑐𝑜𝑙’ = S2]^_
`, 𝑛𝑝𝑟𝑜𝑤a = 𝑛𝑝𝑟𝑜𝑤
(2) 𝑛𝑝𝑐𝑜𝑙’ = S2]^_`
, 𝑛𝑝𝑟𝑜𝑤a = 8×𝑛𝑝𝑟𝑜𝑤
Problemof“gather&scatter”:Idleprocesses.Howtoutilizethem?Assignidleprocessestoactivecolumns.Transpose remap: • Perform 3D FFTs in the
original context. • Transfer data through a series
of local regional transposes• Run ScaLAPACK in the new
context
Keyconceptforremap:creatingdifferentcontextsthatareoptimalfordifferentkernelsredistributingthedataon-the-fly
1.0
2.5x
ConclusionandInsights
16
• BandparallelizationreducestheinternodecommunicationoverheadandimprovesstrongscalingofWESTupto𝑁bbc𝑁2345𝑁QRST cores.
• OptimalremappingofdataformatrixoperationsinQboxreducesScaLAPACKcommunicationoverheadatlargescale,andmakeshybrid- DFTcalculationscaleto𝑁bbc𝑁QRST cores.
• Giventheincreasedcomputationalperformancerelativetonetworkbandwidths,itiscrucialtoreduceand/orhideinter-nodecommunicationcosts.
• Guidingprinciplesfordevelopingcodesinmany-corearchitecture:1) Parallelizingindependent,fine-grainunitsofwork,reducinginter-nodecommunication,and
maximizingutilizationofon-noderesources.2) Optimizingcommunicationpatternsforperformancecriticalkernelswithon-the-flydata
redistributionandprocessreconfiguration.
Acknowledgement
17
• ThisresearchispartofThetaEarlyScienceProjectatArgonneLeadershipComputingFacility,whichisaDOEOfficeofScienceUserFacilityunderContractDE-AC02-06CH11357.
• ThisworkwassupportedbyMICCoM,aspartofComp.Mats.Sci.ProgramfundedbytheU.S.DOE,OfficeofScience,BES,MSEDivision.