linear algebra tasks in materials science: optimization and portability … · 2020. 8. 4. ·...

34
Mitglied der Helmholtz-Gemeinschaft Linear algebra tasks in Materials Science: optimization and portability ADAC Workshop, July 17-19 2017 Edoardo Di Napoli

Upload: others

Post on 09-Oct-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Linear algebra tasks in Materials Science: optimization and portability … · 2020. 8. 4. · Jülich Supercomputing Center Chebyshev Accelerated Subspace Iteration ... July 17-19

Mitg

lied

derH

elm

holtz

-Gem

eins

chaf

t

Linear algebra tasks in MaterialsScience: optimization andportability

ADAC Workshop, July 17-19 2017 Edoardo Di Napoli

Page 2: Linear algebra tasks in Materials Science: optimization and portability … · 2020. 8. 4. · Jülich Supercomputing Center Chebyshev Accelerated Subspace Iteration ... July 17-19

Outline

Jülich Supercomputing Center

Chebyshev Accelerated Subspace Iteration (ChASE)

Hamiltonian and Overlap matrices initialization in FLAPW

ADAC Workshop, July 17-19 2017 Edoardo Di Napoli Folie 2

Page 3: Linear algebra tasks in Materials Science: optimization and portability … · 2020. 8. 4. · Jülich Supercomputing Center Chebyshev Accelerated Subspace Iteration ... July 17-19

Forschungszentrum Jülich

ADAC Workshop, July 17-19 2017 Edoardo Di Napoli Folie 3

Page 4: Linear algebra tasks in Materials Science: optimization and portability … · 2020. 8. 4. · Jülich Supercomputing Center Chebyshev Accelerated Subspace Iteration ... July 17-19

Jülich Supercomputing Centre

Supercomputing operations for:Center: FZJ

Region: JARA

Germany: Gauss Centre for SupercomputingJohn von Neumann Institute for Computing

Europe: PRACE, EU projectsApplication support

Primary support & Simulation Labs

Peer review support & coordinationR&D work

Methods and algorithms, computationalscience, performance analysis and tools

Scientific Big Data Analytics with HPC

Computer architectures, Co-Design ExascaleLabs together with IBM, Intel, NVIDIA

Education and trainingADAC Workshop, July 17-19 2017 Edoardo Di Napoli Folie 4

Page 5: Linear algebra tasks in Materials Science: optimization and portability … · 2020. 8. 4. · Jülich Supercomputing Center Chebyshev Accelerated Subspace Iteration ... July 17-19

Supercomputing systems: dual architecturestrategy

ADAC Workshop, July 17-19 2017 Edoardo Di Napoli Folie 5

Page 6: Linear algebra tasks in Materials Science: optimization and portability … · 2020. 8. 4. · Jülich Supercomputing Center Chebyshev Accelerated Subspace Iteration ... July 17-19

Simulation Laboratory as HPC enabler

ADAC Workshop, July 17-19 2017 Edoardo Di Napoli Folie 6

Page 7: Linear algebra tasks in Materials Science: optimization and portability … · 2020. 8. 4. · Jülich Supercomputing Center Chebyshev Accelerated Subspace Iteration ... July 17-19

SimLab Quantum Materials (SLQM)

Team

Mission

The Simulation Laboratory Quantum Materials (SLQM) provides expertise in thefield of quantum-based simulations in Materials Science with a special focus onhigh-performance computing. SLQM acts as a high-level support structure indedicated projects and hosts research projects dealing with fundamental aspects ofcode development, algorithmic optimization, and performance improvement. TheLab acts as an enabler of large scale simulations on current HPC platforms as wellas on future architectures by targeting domain-specific co-design processes.

ADAC Workshop, July 17-19 2017 Edoardo Di Napoli Folie 7

Page 8: Linear algebra tasks in Materials Science: optimization and portability … · 2020. 8. 4. · Jülich Supercomputing Center Chebyshev Accelerated Subspace Iteration ... July 17-19

High-performance computations

Observations:

Linear algebra algorithms ubiquitous in Materials Science

Numerical libraries are used as black boxes.

Domain-specific knowledge does not influence algorithm choice.

Rigid legacy codes are hard to modernize.

Goal

Design and optimize linear algebra algorithms in order to:

exploit available knowledge.

increase the parallelism of complex tasks.

facilitate performance portability

ADAC Workshop, July 17-19 2017 Edoardo Di Napoli Folie 8

Page 9: Linear algebra tasks in Materials Science: optimization and portability … · 2020. 8. 4. · Jülich Supercomputing Center Chebyshev Accelerated Subspace Iteration ... July 17-19

SLQM active areas of research

Adaptive integration for functional Renormalization Group (fRG)

HPC Tensor operations (contractions, generalized contractions,

transpositions, etc.) with application in fRG, Quantum Chemistry, etc.

Algebraic eigensolvers (sequences of dense eigenproblems, sparse

eigenproblems) with applications in Density Functional Theory (DFT),

Numerical RG, Excitonic Hamiltonians, etc.

Articulated linear algebra tasks (Matrix initialization, etc.)

Structured Linear Systems (Dyson, Poisson, and Keldish equations)

Self-consistent Field preconditioners

Predicting material properties through machine learning

ADAC Workshop, July 17-19 2017 Edoardo Di Napoli Folie 9

Page 10: Linear algebra tasks in Materials Science: optimization and portability … · 2020. 8. 4. · Jülich Supercomputing Center Chebyshev Accelerated Subspace Iteration ... July 17-19

Two examples from FLAPW methods (DFT)

An eigensolver for sequences of dense matrices

Chebyshev Accelerated Subspace iteration Eigensolver (ChASE) –Exterior iterative eigensolver tailored to sequences of dense Hermitianeigenvalue problems arising in DFT methods based on the LAPW basisset.

A complex linear algebra task

H and S Dense Linear Algebra (HSDLA) algorithm – Combination ofdense linear algebra kernels for the initialization of Hamiltonian andOverlap matrices in Density Functional Theory.

ADAC Workshop, July 17-19 2017 Edoardo Di Napoli Folie 10

Page 11: Linear algebra tasks in Materials Science: optimization and portability … · 2020. 8. 4. · Jülich Supercomputing Center Chebyshev Accelerated Subspace Iteration ... July 17-19

Density Functional Theory Self-Consistent FieldGeneral framework

Initial guessfor charge density

nstart(r)

InitializeA(`)

k and B(`)k

matrices

Solve a set ofeigenproblems

P(`)k1

. . .P(`)kN

Compute newcharge density

n(`)(r)

Converged?

|n(`)−n(`−1)|< η

OUTPUTElectronicstructure,

. . .

No

Yes

ADAC Workshop, July 17-19 2017 Edoardo Di Napoli Folie 11

Page 12: Linear algebra tasks in Materials Science: optimization and portability … · 2020. 8. 4. · Jülich Supercomputing Center Chebyshev Accelerated Subspace Iteration ... July 17-19

The case of the FLAPW method

Observations:1 every P(`)

k : A(`)k x = B(`)

k λx is a generalized eigenvalue problem;

2 A and B are hermitian (B is positive definite);

3 required: lower 2÷20 % of eigenpairs;

4 eigenvectors of problems of same k are seemingly uncorrelated acrossiterations i

5 k-vector index: k = 1 : 10÷100;

6 iteration cycle index: `= 1 : 20÷50.

ADAC Workshop, July 17-19 2017 Edoardo Di Napoli Folie 12

Page 13: Linear algebra tasks in Materials Science: optimization and portability … · 2020. 8. 4. · Jülich Supercomputing Center Chebyshev Accelerated Subspace Iteration ... July 17-19

Outline

Jülich Supercomputing Center

Chebyshev Accelerated Subspace Iteration (ChASE)

Hamiltonian and Overlap matrices initialization in FLAPW

ADAC Workshop, July 17-19 2017 Edoardo Di Napoli Folie 13

Page 14: Linear algebra tasks in Materials Science: optimization and portability … · 2020. 8. 4. · Jülich Supercomputing Center Chebyshev Accelerated Subspace Iteration ... July 17-19

Generalized eigenvalue solver: subspace iteration

Two distinct optmization opportunities:

1 Knowledge exploitation: solution of previous SCF cycle can be usedto “precondition” the solver at the next iteration

2 Algorithm optimization: polynomial degree can be pre-computed inorder to minimize Mat-Vec multiplications

ADAC Workshop, July 17-19 2017 Edoardo Di Napoli Folie 14

Page 15: Linear algebra tasks in Materials Science: optimization and portability … · 2020. 8. 4. · Jülich Supercomputing Center Chebyshev Accelerated Subspace Iteration ... July 17-19

Chebyshev Accelerated Subspace EigensolverChebyshev filter

A generic vector v = ∑ni=1 sixi is very quickly aligned in the direction of the

eigenvector corresponding to the extremal eigenvalue λ1

vm = pm(A)v =n

∑i=1

si pm(A)xi =n

∑i=1

si pm(λi)xi

= s1x1 +n

∑i=2

siCm(

λi−ce )

Cm(λ1−c

e )xi ∼ s1x1

λmin λlowerc λupper

0

0.5

1

1.5

·104

eigenspectrum

Che

bysh

evpo

lyno

mia

l

Polynomial degree m = 6

λmin λlowerc λupper

100

101

102

103

104

eigenspectrum

Che

bysh

evpo

lyno

mia

l

Polynomial degree m = 6

ADAC Workshop, July 17-19 2017 Edoardo Di Napoli Folie 15

Page 16: Linear algebra tasks in Materials Science: optimization and portability … · 2020. 8. 4. · Jülich Supercomputing Center Chebyshev Accelerated Subspace Iteration ... July 17-19

The core of the algorithm: Chebyshev filter> 80% Computing time

Three-terms recurrence relationCm+1 (t) = 2xCm (t)−Cm−1 (t) ; m ∈N, C0 (t) = 1, C1 (t) = x

Zm.= pm(H) Z0 with H = H− cIn

FOR: i = 1→ DEG−1

Zi+1 ← 2σi+1

eH × Zi −σi+1σi Zi−1 xGEMM

END FOR.

ADAC Workshop, July 17-19 2017 Edoardo Di Napoli Folie 16

Page 17: Linear algebra tasks in Materials Science: optimization and portability … · 2020. 8. 4. · Jülich Supercomputing Center Chebyshev Accelerated Subspace Iteration ... July 17-19

Preconditioning ChASEExploiting knowledge

ITERATION (`)

P(`)k1

(X(`)k1

,Λ(`)k1)

P(`)k2

(X(`)k2

,Λ(`)k2)

P(`)kN

(X(`)kN

,Λ(`)kN)

X ≡ x1, . . . ,xn

iterative

solver

iterative

solver

iterative

solver

ITERATION (`+1)

P(`+1)k1

(X(`+1)k1

,Λ(`+1)k1

)

P(`+1)k2

(X(`+1)k2

,Λ(`+1)k2

)

P(`+1)kN

(X(`+1)kN

,Λ(`+1)kN

)

Λ≡ diag(λ1, . . . ,λn)

iterative

solver

iterative

solver

iterative

solver

Nextcycle

ADAC Workshop, July 17-19 2017 Edoardo Di Napoli Folie 17

Page 18: Linear algebra tasks in Materials Science: optimization and portability … · 2020. 8. 4. · Jülich Supercomputing Center Chebyshev Accelerated Subspace Iteration ... July 17-19

Speed-upSpeed-up = CPU time (input random vectors)

CPU time (input approximate eigenvectors)

3 7 11 15 19 23Iteration Index

0

1

2

3

4Speed-u

pAu98Ag10 - n=13,379 - 128 cores.

ADAC Workshop, July 17-19 2017 Edoardo Di Napoli Folie 18

Page 19: Linear algebra tasks in Materials Science: optimization and portability … · 2020. 8. 4. · Jülich Supercomputing Center Chebyshev Accelerated Subspace Iteration ... July 17-19

ChASE with polynomial optimizationSpeed-up

Speed-up = CPU time (m constant)CPU time (m optimized for each vector)

3 6 9 12 15

1.2

1.4

1.6

1.8

Iteration cycle index

Spee

d−up

24 Cores −− Multi−threaded BLAS24 Cores and 1 GPU −− CUBLAS

ADAC Workshop, July 17-19 2017 Edoardo Di Napoli Folie 19

Page 20: Linear algebra tasks in Materials Science: optimization and portability … · 2020. 8. 4. · Jülich Supercomputing Center Chebyshev Accelerated Subspace Iteration ... July 17-19

ChASE pseudocodeINPUT: Hamiltonian H, TOL, DEG — OPTIONAL: approximate eigenvectorsZ0, extreme eigenvalues λ1,λNEV.

OUTPUT: NEV wanted eigenpairs (Λ,W).1 Lanczos DoS step. Identify the bounds for the eigenspectrum interval

corresponding to the wanted eigenspace.

REPEAT UNTIL CONVERGENCE:2 Chebyshev filter. Filter a block of vectors W←− Z0.

3 Re-orthogonalize the vectors outputted by the filter; W = QR.

4 Compute the Rayleigh quotient G = Q†HQ.

5 Compute the primitive Ritz pairs (Λ,Y) by solving for GY = YΛ.

6 Compute the approximate Ritz pairs (Λ,W← QY).

7 Check which one among the Ritz vectors converged.

8 Deflate and lock the converged vectors.

END REPEAT

ADAC Workshop, July 17-19 2017 Edoardo Di Napoli Folie 20

Page 21: Linear algebra tasks in Materials Science: optimization and portability … · 2020. 8. 4. · Jülich Supercomputing Center Chebyshev Accelerated Subspace Iteration ... July 17-19

Execution time for distinct platformsAn example from Density Functional Theory

0

50

100

150

200

250

300

0 5 10 15 20 25

Tim

e [s]

iteration

2x Intel Xeon E5-2670 (16 cores)1x NVIDIA K20m2x NVIDIA K20m

1x XeonPhi optimized (compact)

ADAC Workshop, July 17-19 2017 Edoardo Di Napoli Folie 21

Page 22: Linear algebra tasks in Materials Science: optimization and portability … · 2020. 8. 4. · Jülich Supercomputing Center Chebyshev Accelerated Subspace Iteration ... July 17-19

ChASE new implementation

C++ Library

Separation between algorithm and implementationFew kernels for low-level tasks (Lanczos, QR, Mat-Mat multiply, etc..)

General interface

Kernels adapted to the parallel computing platform

Library templated for Real, Complex, SP and DP.

Integration into applications

FLEUR code

Jena BSE code (based on VASP)

early stage collaboration with Exciting developers

early stage collaboration with developers from AICS (Kobe)

ADAC Workshop, July 17-19 2017 Edoardo Di Napoli Folie 22

Page 23: Linear algebra tasks in Materials Science: optimization and portability … · 2020. 8. 4. · Jülich Supercomputing Center Chebyshev Accelerated Subspace Iteration ... July 17-19

Outline

Jülich Supercomputing Center

Chebyshev Accelerated Subspace Iteration (ChASE)

Hamiltonian and Overlap matrices initialization in FLAPW

ADAC Workshop, July 17-19 2017 Edoardo Di Napoli Folie 23

Page 24: Linear algebra tasks in Materials Science: optimization and portability … · 2020. 8. 4. · Jülich Supercomputing Center Chebyshev Accelerated Subspace Iteration ... July 17-19

Hamiltonian and Overlap matrices

Matrix form

H =NA

∑a=1

AHa T [AA]

a Aa︸ ︷︷ ︸HAA

+AHa T [AB]

a Ba +BHa T [BA]

a Aa +BHa T [BB]

a Ba︸ ︷︷ ︸HAB+BA+BB

S =NA

∑a=1

AHa Aa︸ ︷︷ ︸

SAA

+NA

∑a=1

BHa UH

a UaBa︸ ︷︷ ︸SBB

Aa and Ba ∈ CNL×NG while T ...a ∈ CNL×NL and Hermitian.

NL = (lmax +1)2 ≤ 121, NG = 1,000÷50,000, and NA = number of atoms.

ADAC Workshop, July 17-19 2017 Edoardo Di Napoli Folie 24

Page 25: Linear algebra tasks in Materials Science: optimization and portability … · 2020. 8. 4. · Jülich Supercomputing Center Chebyshev Accelerated Subspace Iteration ... July 17-19

Constructing SAAAn example of memory layout re-structuring

SAA =NA

∑a=1

AHa Aa .

1: for a := 1→ NA do2: SAA = AH

a Aa . (zherk: 4NLN2G Flops)

3: end for

NG

NL A1

A2

.

.

.

ANA

A∗

NG

NANL

1: SAA = AH? A? . (zherk: 4NANLN2

G Flops)

ADAC Workshop, July 17-19 2017 Edoardo Di Napoli Folie 25

Page 26: Linear algebra tasks in Materials Science: optimization and portability … · 2020. 8. 4. · Jülich Supercomputing Center Chebyshev Accelerated Subspace Iteration ... July 17-19

Constructing HAB+BA+BBAn example of algorithm re-structuring

HAB+BA+BB =NA

∑a=1

BHa (T

[BA]a Aa)+(AH

a T [AB]a )Ba +

12

BHa (T

[BB]a Ba)+

12(BH

a T [BB]a )Ba

=NA

∑a=1

BHa (T

[BA]a Aa +

12

T [BB]a Ba) +

(AHa T [AB]

a +12

BHa T [BB]

a )Ba

1: for a := 1→ NA do2: Za = T [BA]

a Aa . (zgemm: 8N2LNG Flops)

3: Za = Za +12 T [BB]

a Ba . (zhemm: 8N2LNG Flops)

4: Stack Za to Z? and Ba to B?

5: end for6: H = ZH

? B?+BH? Z? . (zher2k: 8NANLN2

G Flops)

ADAC Workshop, July 17-19 2017 Edoardo Di Napoli Folie 26

Page 27: Linear algebra tasks in Materials Science: optimization and portability … · 2020. 8. 4. · Jülich Supercomputing Center Chebyshev Accelerated Subspace Iteration ... July 17-19

FLEUR early optimizationsMinimization of FLOPS in FLEUR

SAA =(4π)2

Ω∑a

exp [i(KG−KG′) ·xa]lsph

∑l=0

fl,a,G′ fl,a,GW2

l,a

l

∑m=−l

Y∗l,m(RaKG

)Yl,m

(RaKG′

)︸ ︷︷ ︸

?

.

The sum over m can be removed by using the well known identity

Pl(KG · KG′

) 2l+14π

= ?

Removing the sum over m yields a reduction in #ops by a factor of(lsph +1

)∼ 10 but destroys matrix form =⇒ “premature optimization”

ADAC Workshop, July 17-19 2017 Edoardo Di Napoli Folie 27

Page 28: Linear algebra tasks in Materials Science: optimization and portability … · 2020. 8. 4. · Jülich Supercomputing Center Chebyshev Accelerated Subspace Iteration ... July 17-19

Hamiltonian and Overlap matrices initializationMain features

Boosted performance through highly optimized computational kernels.

Performance portability through modular implementation.

Compute bound implementation at the cost of increased complexity (moreFLOPs).

NaCl (Kmax = 4.0): IvyBridge HaswellTiO2 (Kmax = 3.6): IvyBridge Haswell

0 2 4 6 8 10 12 14 16 18 20 22 240

1

2

3

#threads

spee

dup

Figure 1: Speedup of our algorithm over FLEUR with kmax = 4 and increasing parallelism

1

ADAC Workshop, July 17-19 2017 Edoardo Di Napoli Folie 28

Page 29: Linear algebra tasks in Materials Science: optimization and portability … · 2020. 8. 4. · Jülich Supercomputing Center Chebyshev Accelerated Subspace Iteration ... July 17-19

Porting to heterogeneous architecturesBack-of-the-envelope analysis

5 lines of the algorithm constitute 97% of flops

Correspond to BLAS-3 operations (gemm, herk, her2k)

High arithmetic intensity and should fit GPUs well

First step: offload these routine callsAll 5 are BLAS kernels. Can we use some library?

cuBLAScuBLAS-XTMAGMABLASX

ADAC Workshop, July 17-19 2017 Edoardo Di Napoli Folie 29

Page 30: Linear algebra tasks in Materials Science: optimization and portability … · 2020. 8. 4. · Jülich Supercomputing Center Chebyshev Accelerated Subspace Iteration ... July 17-19

Porting to heterogeneous architecturesAdditional code?

3x wrappers (zgemm, zherk, zher2k)

Init and cleanup of cuda runtime and devicesGet #devices, initialize devices, create handlers, ...

Destroy handlers, free devices, ...

Allocate data in page-locked memoryAvoid “hidden” copies

Fast data transfer

Only around 100 lines of additional code

ADAC Workshop, July 17-19 2017 Edoardo Di Napoli Folie 30

Page 31: Linear algebra tasks in Materials Science: optimization and portability … · 2020. 8. 4. · Jülich Supercomputing Center Chebyshev Accelerated Subspace Iteration ... July 17-19

Experimental resultsTest case 2: AuAg (NA = 108, NL = 121, NG = [3275−13379])

0

30

60

90

120

150

180

210

240

2.5 3 3.5 4

3.03x

4.96x

Tim

e (

se

co

nd

s )

Kmax

CPUCPU / 1 GPUCPU / 2 GPUs

Sandy Bridge: CPU: E5-2650, 2 x 8core, 2.0GHz, 64GBs RAM, 2 Nvidia Tesla K20XPeak performance: 256 GFs/s + 2 x 1.3 TFs/s

ADAC Workshop, July 17-19 2017 Edoardo Di Napoli Folie 31

Page 32: Linear algebra tasks in Materials Science: optimization and portability … · 2020. 8. 4. · Jülich Supercomputing Center Chebyshev Accelerated Subspace Iteration ... July 17-19

Conclusions

Conclusions

Modernizing algorithm structure of legacy code is critical

Tailoring algorithms to application-specific tasks

Exploiting knowledge from domain-specific applications

Layered design built on top of standardized libraries

Increase in performance

(Almost) free lunch⇒ performance portability

ADAC Workshop, July 17-19 2017 Edoardo Di Napoli Folie 32

Page 33: Linear algebra tasks in Materials Science: optimization and portability … · 2020. 8. 4. · Jülich Supercomputing Center Chebyshev Accelerated Subspace Iteration ... July 17-19

References

“ChASE: a Chebyshev Accelerated Subspace iteration Eigensolver for sequences ofeigenvalue problems.” J. Winkelmann, M. Berljafa, P. Springer and E. Di Napoli.In preparation.

“An Optimized and Scalable Eigensolver for Sequences of Eigenvalue Problems.” M.Berl-jafa, D. Wortmann, and E. Di Napoli.Concurrency Computat.: Pract. Exper. 27 (2015), pp. 905-922 [arXiv:1404.4161]

“High-performance generation of the Hamiltonian and Overlap matrices in FLAPWmethods.” Edoardo Di Napoli, Elmar Peise, Markus Hrywniak and Paolo Bientinesi.Comp. Phys. Comm. 211 (2017) 61-72 [arXiv:1602.06589]

“Hybrid CPU-GPU generation of the Hamiltonian and Overlap matrices in FLAPWmethods.” Diego Fabregat-Traver, Davor Davidovic, Markus Höhnerbach, Edoardo DiNapoli.Lecture Notes in Computer Science, High-Performance Scientific Computing10164 (2016), pp. 200-211 [arXiv:1611.00606 ]

ADAC Workshop, July 17-19 2017 Edoardo Di Napoli Folie 33

Page 34: Linear algebra tasks in Materials Science: optimization and portability … · 2020. 8. 4. · Jülich Supercomputing Center Chebyshev Accelerated Subspace Iteration ... July 17-19

Thank you

QUANTUMATERIALS

1

For more [email protected]

http://www.fz-juelich.de/ias/jsc/slqm

ADAC Workshop, July 17-19 2017 Edoardo Di Napoli Folie 34