performance optimization of west and qbox on intel knights ... · performance optimization of west...

PerformanceoptimizationofWESTandQboxonIntelKnightsLanding

HuihuoZheng1,ChristopherKnight1,GiuliaGalli1,2,MarcoGovoni1,2,andFrancoisGygi3

1ArgonneNationalLaboratory2UniversityofChicago

3UniversityofCalifornia,Davis

September27th,2017

IXPUGAnnualFallConference2017 1

ALCFThetaEarlyScienceProject:First-PrinciplesSimulationsofFunctionalMaterialsforEnergyConversion

http://qboxcode.org/; http://west-code.org; http://www.quantum-espresso.org/M. Govoni, G. Galli, J. Chem. Theory Comput. 2015, 11, 2680−2696P. Giannozzi, et al J.Phys.:Condens.Matter, 21, 395502 (2009)

Embedded nanocrystalT. Li, Phys. Rev. Lett. 107, 206805 (2011)

Matters at extreme conditionsD. Pan et al Proc. Nat'l Acd. Sci. 110, 6250 (2013)

Aqueous solutionA.Gaiduk et al., J. Am. Chem. Soc.

Comm. (2016)

Organic photovoltaicsM. Goldey Phys. Chem. Chem. Phys., Advance Article (2016)

Quantum informationH Seo, Sci Rep. 2016; 6: 20803.

2

Optimizationfocus

Number of processors

Tim

e-to

-Sol

utio

nStrongscalinglimit

• Utilizingtunedmathlibraries(FFTW,MKL,ELPA,…)

• Vectorization:AVX512• HighBandwidthMemory

• Addingextralayersofparallelization->increaseintrinsicscalinglimit

• Reducingcommunicationoverheadtoreachtheintrinsiclimit

3,624 KNL nodes, 9.65petaFlOPS

3

Outline

• WEST– additionallayersofparallelization• BandparallelizationofSternheimer equation• Taskgroupparallelizationtofit3DFFTswithinsingleKNLnodetoreducecommunicationoverheadsandtakeadvantageofHBM

• Qbox– reducecommunicationoverheadsofdenselinearalgebrawithon-the-flydataredistribution

• Gather&scatterremap• Transposeremap

• Conclusionsandinsights

4

Optoelectroniccalculationsusingmany-bodyperturbationtheory(GW)

𝚫𝝆 = 𝝌𝚫𝑽𝒑𝒆𝒓𝒕Electronic density

Perturbation potential

Response function

…3D FFTs

Parallelizationscheme(imagegroups &planewave)

Intrinsicstrongscalinglimit𝑛𝑝𝑟𝑜𝑐~𝑁2345×𝑁7

𝚫𝑽𝟏𝚫𝑽𝟐𝚫𝑽𝟑𝚫𝑽𝟒 … 𝚫𝐕𝐍?𝟏𝚫𝐕𝐍

𝚫𝝆𝟏 𝚫𝝆𝟐 … 𝚫𝝆𝑵?𝟏 𝚫𝝆𝑵𝚫𝝆𝟑 𝚫𝝆𝟒

...

Massivelyparallelbydistributingperturbations

Linearresponsetheory

Efficientstrongscalinguptotheintrinsiclimitthenstopscaling

Si35H36,176 electrons256 perturbations

5

Singleperturbationruntime (4BG/Qvs1KNL)

6

• 80%ofruntimeisspentinexternallibraries

• 3.7xspeedupfromBG/Q(ESSL)toKNL(MKL)

• High-bandwidthmemoryonThetacriticalforperformance(e.g.3DFFTs):3.1xspeedup

Cache mode

CdSe, 884 electrons

BG/QESSL

ImprovementofstrongscalingbybandparallelizationofSternheimer equation

Si35H36, 176 electrons256 perturbations

7

Image para.Band para.

Sternheimer equation

IncreasedparallelismbyarrangingtheMPIranksina3Dgrid(perturbations&bands&FFT)

New intrinsic strong scaling limit: 𝑛𝑝𝑟𝑜𝑐 = 𝑁2345×𝑵𝒃𝒂𝒏𝒅×𝑁7

Cost: AllReduceonce across band groups (relatively cheap)

Call hpsi

Improvingstrongscalingof3DFFTsusingtaskgroupparallelizationwhenabandgroupspreadsacrossmultiplenodes

8

Cetus (BG/Q)

Small3DFFTsdonotscalewellacrossmultipleKNLnodesbecauseofinternodecommunicationoverheadsrelativetoshared-memoryMPI.Taskgroups(tg)redistributecompletewavefunctionstoseparatenodestosimultaneouslycomputemultiple3DFFTs.

Theta (KNL)

Cost for data redistribution

MPI_Alltoallwithin single KNL node

Strong scaling of 3D FFT (plane and pencil decomposition) on Cetusand Theta using 256x256x256 FFT grid and FFTXlib in QE

Outline

• WEST– additionallayersofparallelization• BandparallelizationofSternheimer equation• Taskgroupparallelizationtofit3DFFTswithinsingleKNLnodetoreducecommunicationoverheadsandtakeadvantageofHBM

• Qbox– reducecommunicationoverheadsofdenselinearalgebrawithon-the-flydataredistribution

• Gather&scatterremap• Transposeremap

• Conclusionsandinsights

9

SiC512512atoms,2048electrons,PBE0

StrongscalingofQboxforhybrid-DFTcalculations

F. Gygi and I. Duchemin J. Chem. Theory Comput., 2013, 9 (1), pp 582–587

dgemm

dgemm,Gram-Schmidt(syrk,potrf,trsm)

Exactexchange3DFFTs

10

Datalayout:blockdistributionofwavefunctionsto2Dprocessgrid 𝑛𝑝𝑐𝑜𝑙

𝑘

𝑛𝑝𝑟𝑜𝑤

𝑖

FFT FFT FFT FFT

1,024

140,288

𝜓P 𝑘 , 𝑖 = 1, 2, …𝑁QRST,𝑘 = 0, 1, … , 𝑛2U − 1

SiC512:140,288×1,02411

MPI_Alltoallv

MPI_Allreduce

Good for 3D FFTs

Intrinsic limit: 𝒏𝒑𝒓𝒐𝒄 = 𝑵𝒃𝒂𝒏𝒅𝑵𝒛

PoorscalabilityofScaLAPACK fortall-skinnymatricesandsmallsquarematricesduetocommunicationoverheads

=

Wave function matrix

Overlap matrix

Tall-skinny matrices Small square matrix

12

Gram-Schmidt

d(z)gemm

13

ReducingcommunicationoverheadsfromScaLAPACK with“gather&scatter”remap

Solution:creatingacontextwithfewercolumnsandon-the-flydataredistribution• Compute3DFFTsonoriginalgrid• Gatherdatatosmallergrid• RunScaLAPACK onsmallergrid• ScatterdatabacktooriginalgridTheremapcommunicationpatternonlyinvolvesprocs withinsameroworcolumn.

Key:remapcommunicationtimeneedstobesmall. Schematicofgather&scatterremap,gray

processesareidleduringScaLAPACK computation

Gather & scatter communication pattern

Improvementofstrongscalingusing“gather&scatter”remap

14

hpsi +wf_update timeremainsminimalrelativelyflatwithremap,andtheremaptime(custom) istwoordersofmagnitudesmallerthanhpsi +wf_update time.

Customremapfunctionis1000xfasterthanScaLAPACK’s pdgemr2d.

ImprovementofQbox’s strongscalingafteroptimizations;runtimeofimprovesfrom~400to~30sperSCFiteration(13xspeedup)on131,072ranksfor2048electrons.

400.8s

67.5s

30.5s

ReducingcommunicationoverheadsfromScaLAPACK by“transpose”remap

15

Transpose communication pattern

Processrearrangementanddatamovementoftransposeremap

Improvementofruntimebyremapmethods(1) 𝑛𝑝𝑐𝑜𝑙’ = S2]^_

`, 𝑛𝑝𝑟𝑜𝑤a = 𝑛𝑝𝑟𝑜𝑤

(2) 𝑛𝑝𝑐𝑜𝑙’ = S2]^_`

, 𝑛𝑝𝑟𝑜𝑤a = 8×𝑛𝑝𝑟𝑜𝑤

Problemof“gather&scatter”:Idleprocesses.Howtoutilizethem?Assignidleprocessestoactivecolumns.Transpose remap: • Perform 3D FFTs in the

original context. • Transfer data through a series

of local regional transposes• Run ScaLAPACK in the new

context

Keyconceptforremap:creatingdifferentcontextsthatareoptimalfordifferentkernelsredistributingthedataon-the-fly

1.0

2.5x

ConclusionandInsights

16

• BandparallelizationreducestheinternodecommunicationoverheadandimprovesstrongscalingofWESTupto𝑁bbc𝑁2345𝑁QRST cores.

• OptimalremappingofdataformatrixoperationsinQboxreducesScaLAPACKcommunicationoverheadatlargescale,andmakeshybrid- DFTcalculationscaleto𝑁bbc𝑁QRST cores.

• Giventheincreasedcomputationalperformancerelativetonetworkbandwidths,itiscrucialtoreduceand/orhideinter-nodecommunicationcosts.

• Guidingprinciplesfordevelopingcodesinmany-corearchitecture:1) Parallelizingindependent,fine-grainunitsofwork,reducinginter-nodecommunication,and

maximizingutilizationofon-noderesources.2) Optimizingcommunicationpatternsforperformancecriticalkernelswithon-the-flydata

redistributionandprocessreconfiguration.

Acknowledgement

17

• ThisresearchispartofThetaEarlyScienceProjectatArgonneLeadershipComputingFacility,whichisaDOEOfficeofScienceUserFacilityunderContractDE-AC02-06CH11357.

• ThisworkwassupportedbyMICCoM,aspartofComp.Mats.Sci.ProgramfundedbytheU.S.DOE,OfficeofScience,BES,MSEDivision.

performance optimization of west and qbox on intel knights ... · performance optimization of west...

Documents