linear algebra tasks in materials science: optimization and portability … · 2020. 8. 4. ·...

Mitg

lied

derH

elm

holtz

-Gem

eins

chaf

t

Linear algebra tasks in MaterialsScience: optimization andportability

ADAC Workshop, July 17-19 2017 Edoardo Di Napoli

Outline

Jülich Supercomputing Center

Chebyshev Accelerated Subspace Iteration (ChASE)

Hamiltonian and Overlap matrices initialization in FLAPW

ADAC Workshop, July 17-19 2017 Edoardo Di Napoli Folie 2

Forschungszentrum Jülich


Jülich Supercomputing Centre

Supercomputing operations for:Center: FZJ

Region: JARA

Germany: Gauss Centre for SupercomputingJohn von Neumann Institute for Computing

Europe: PRACE, EU projectsApplication support

Primary support & Simulation Labs

Peer review support & coordinationR&D work

Methods and algorithms, computationalscience, performance analysis and tools

Scientific Big Data Analytics with HPC

Computer architectures, Co-Design ExascaleLabs together with IBM, Intel, NVIDIA

Education and trainingADAC Workshop, July 17-19 2017 Edoardo Di Napoli Folie 4

Supercomputing systems: dual architecturestrategy


Simulation Laboratory as HPC enabler


SimLab Quantum Materials (SLQM)

Team

Mission

The Simulation Laboratory Quantum Materials (SLQM) provides expertise in thefield of quantum-based simulations in Materials Science with a special focus onhigh-performance computing. SLQM acts as a high-level support structure indedicated projects and hosts research projects dealing with fundamental aspects ofcode development, algorithmic optimization, and performance improvement. TheLab acts as an enabler of large scale simulations on current HPC platforms as wellas on future architectures by targeting domain-specific co-design processes.


High-performance computations

Observations:

Linear algebra algorithms ubiquitous in Materials Science

Numerical libraries are used as black boxes.

Domain-specific knowledge does not influence algorithm choice.

Rigid legacy codes are hard to modernize.

Goal

Design and optimize linear algebra algorithms in order to:

exploit available knowledge.

increase the parallelism of complex tasks.

facilitate performance portability


SLQM active areas of research

Adaptive integration for functional Renormalization Group (fRG)

HPC Tensor operations (contractions, generalized contractions,

transpositions, etc.) with application in fRG, Quantum Chemistry, etc.

Algebraic eigensolvers (sequences of dense eigenproblems, sparse

eigenproblems) with applications in Density Functional Theory (DFT),

Numerical RG, Excitonic Hamiltonians, etc.

Articulated linear algebra tasks (Matrix initialization, etc.)

Structured Linear Systems (Dyson, Poisson, and Keldish equations)

Self-consistent Field preconditioners

Predicting material properties through machine learning


Two examples from FLAPW methods (DFT)

An eigensolver for sequences of dense matrices

Chebyshev Accelerated Subspace iteration Eigensolver (ChASE) –Exterior iterative eigensolver tailored to sequences of dense Hermitianeigenvalue problems arising in DFT methods based on the LAPW basisset.

A complex linear algebra task

H and S Dense Linear Algebra (HSDLA) algorithm – Combination ofdense linear algebra kernels for the initialization of Hamiltonian andOverlap matrices in Density Functional Theory.


Density Functional Theory Self-Consistent FieldGeneral framework

Initial guessfor charge density

nstart(r)

InitializeA(`)

k and B(`)k

matrices

Solve a set ofeigenproblems

P(`)k1

. . .P(`)kN

Compute newcharge density

n(`)(r)

Converged?

|n(`)−n(`−1)|< η

OUTPUTElectronicstructure,

. . .

No

Yes


The case of the FLAPW method

Observations:1 every P(`)

k : A(`)k x = B(`)

k λx is a generalized eigenvalue problem;

2 A and B are hermitian (B is positive definite);

3 required: lower 2÷20 % of eigenpairs;

4 eigenvectors of problems of same k are seemingly uncorrelated acrossiterations i

5 k-vector index: k = 1 : 10÷100;

6 iteration cycle index: `= 1 : 20÷50.


Outline





Generalized eigenvalue solver: subspace iteration

Two distinct optmization opportunities:

1 Knowledge exploitation: solution of previous SCF cycle can be usedto “precondition” the solver at the next iteration

2 Algorithm optimization: polynomial degree can be pre-computed inorder to minimize Mat-Vec multiplications


Chebyshev Accelerated Subspace EigensolverChebyshev filter

A generic vector v = ∑ni=1 sixi is very quickly aligned in the direction of the

eigenvector corresponding to the extremal eigenvalue λ1

vm = pm(A)v =n

∑i=1

si pm(A)xi =n

∑i=1

si pm(λi)xi

= s1x1 +n

∑i=2

siCm(

λi−ce )

Cm(λ1−c

e )xi ∼ s1x1

λmin λlowerc λupper

0

0.5

1

1.5

·104

eigenspectrum

Che

bysh

evpo

lyno

mia

l

Polynomial degree m = 6

λmin λlowerc λupper

100

101

102

103

104

eigenspectrum

Che

bysh

evpo

lyno

mia

l

Polynomial degree m = 6


The core of the algorithm: Chebyshev filter> 80% Computing time

Three-terms recurrence relationCm+1 (t) = 2xCm (t)−Cm−1 (t) ; m ∈N, C0 (t) = 1, C1 (t) = x

Zm.= pm(H) Z0 with H = H− cIn

FOR: i = 1→ DEG−1

Zi+1 ← 2σi+1

eH × Zi −σi+1σi Zi−1 xGEMM

END FOR.


Preconditioning ChASEExploiting knowledge

ITERATION (`)

P(`)k1

(X(`)k1

,Λ(`)k1)

P(`)k2

(X(`)k2

,Λ(`)k2)

P(`)kN

(X(`)kN

,Λ(`)kN)

X ≡ x1, . . . ,xn

iterative

solver

iterative

solver

iterative

solver

ITERATION (`+1)

P(`+1)k1

(X(`+1)k1

,Λ(`+1)k1

)

P(`+1)k2

(X(`+1)k2

,Λ(`+1)k2

)

P(`+1)kN

(X(`+1)kN

,Λ(`+1)kN

)

Λ≡ diag(λ1, . . . ,λn)

iterative

solver

iterative

solver

iterative

solver

Nextcycle


Speed-upSpeed-up = CPU time (input random vectors)

CPU time (input approximate eigenvectors)

3 7 11 15 19 23Iteration Index

0

1

2

3

4Speed-u

pAu98Ag10 - n=13,379 - 128 cores.


ChASE with polynomial optimizationSpeed-up

Speed-up = CPU time (m constant)CPU time (m optimized for each vector)

3 6 9 12 15

1.2

1.4

1.6

1.8

Iteration cycle index

Spee

d−up

24 Cores −− Multi−threaded BLAS24 Cores and 1 GPU −− CUBLAS


ChASE pseudocodeINPUT: Hamiltonian H, TOL, DEG — OPTIONAL: approximate eigenvectorsZ0, extreme eigenvalues λ1,λNEV.

OUTPUT: NEV wanted eigenpairs (Λ,W).1 Lanczos DoS step. Identify the bounds for the eigenspectrum interval

corresponding to the wanted eigenspace.

REPEAT UNTIL CONVERGENCE:2 Chebyshev filter. Filter a block of vectors W←− Z0.

3 Re-orthogonalize the vectors outputted by the filter; W = QR.

4 Compute the Rayleigh quotient G = Q†HQ.

5 Compute the primitive Ritz pairs (Λ,Y) by solving for GY = YΛ.

6 Compute the approximate Ritz pairs (Λ,W← QY).

7 Check which one among the Ritz vectors converged.

8 Deflate and lock the converged vectors.

END REPEAT


Execution time for distinct platformsAn example from Density Functional Theory

0

50

100

150

200

250

300

0 5 10 15 20 25

Tim

e [s]

iteration

2x Intel Xeon E5-2670 (16 cores)1x NVIDIA K20m2x NVIDIA K20m

1x XeonPhi optimized (compact)


ChASE new implementation

C++ Library

Separation between algorithm and implementationFew kernels for low-level tasks (Lanczos, QR, Mat-Mat multiply, etc..)

General interface

Kernels adapted to the parallel computing platform

Library templated for Real, Complex, SP and DP.

Integration into applications

FLEUR code

Jena BSE code (based on VASP)

early stage collaboration with Exciting developers

early stage collaboration with developers from AICS (Kobe)


Outline





Hamiltonian and Overlap matrices

Matrix form

H =NA

∑a=1

AHa T [AA]

a Aa︸︷︷︸HAA

+AHa T [AB]

a Ba +BHa T [BA]

a Aa +BHa T [BB]

a Ba︸︷︷︸HAB+BA+BB

S =NA

∑a=1

AHa Aa︸︷︷︸

SAA

+NA

∑a=1

BHa UH

a UaBa︸︷︷︸SBB

Aa and Ba ∈ CNL×NG while T ...a ∈ CNL×NL and Hermitian.

NL = (lmax +1)2 ≤ 121, NG = 1,000÷50,000, and NA = number of atoms.


Constructing SAAAn example of memory layout re-structuring

SAA =NA

∑a=1

AHa Aa .

1: for a := 1→ NA do2: SAA = AH

a Aa . (zherk: 4NLN2G Flops)

3: end for

NG

NL A1

A2

.

.

.

ANA

A∗

NG

NANL

1: SAA = AH? A? . (zherk: 4NANLN2

G Flops)


Constructing HAB+BA+BBAn example of algorithm re-structuring

HAB+BA+BB =NA

∑a=1

BHa (T

[BA]a Aa)+(AH

a T [AB]a )Ba +

12

BHa (T

[BB]a Ba)+

12(BH

a T [BB]a )Ba

=NA

∑a=1

BHa (T

[BA]a Aa +

12

T [BB]a Ba) +

(AHa T [AB]

a +12

BHa T [BB]

a )Ba

1: for a := 1→ NA do2: Za = T [BA]

a Aa . (zgemm: 8N2LNG Flops)

3: Za = Za +12 T [BB]

a Ba . (zhemm: 8N2LNG Flops)

4: Stack Za to Z? and Ba to B?

5: end for6: H = ZH

? B?+BH? Z? . (zher2k: 8NANLN2

G Flops)


FLEUR early optimizationsMinimization of FLOPS in FLEUR

SAA =(4π)2

Ω∑a

exp [i(KG−KG′) ·xa]lsph

∑l=0

fl,a,G′ fl,a,GW2

l,a

l

∑m=−l

Y∗l,m(RaKG

)Yl,m

(RaKG′

)︸︷︷︸

?

.

The sum over m can be removed by using the well known identity

Pl(KG · KG′

) 2l+14π

= ?

Removing the sum over m yields a reduction in #ops by a factor of(lsph +1

)∼ 10 but destroys matrix form =⇒ “premature optimization”


Hamiltonian and Overlap matrices initializationMain features

Boosted performance through highly optimized computational kernels.

Performance portability through modular implementation.

Compute bound implementation at the cost of increased complexity (moreFLOPs).

NaCl (Kmax = 4.0): IvyBridge HaswellTiO2 (Kmax = 3.6): IvyBridge Haswell

0 2 4 6 8 10 12 14 16 18 20 22 240

1

2

3

#threads

spee

dup

Figure 1: Speedup of our algorithm over FLEUR with kmax = 4 and increasing parallelism

1


Porting to heterogeneous architecturesBack-of-the-envelope analysis

5 lines of the algorithm constitute 97% of flops

Correspond to BLAS-3 operations (gemm, herk, her2k)

High arithmetic intensity and should fit GPUs well

First step: offload these routine callsAll 5 are BLAS kernels. Can we use some library?

cuBLAScuBLAS-XTMAGMABLASX


Porting to heterogeneous architecturesAdditional code?

3x wrappers (zgemm, zherk, zher2k)

Init and cleanup of cuda runtime and devicesGet #devices, initialize devices, create handlers, ...

Destroy handlers, free devices, ...

Allocate data in page-locked memoryAvoid “hidden” copies

Fast data transfer

Only around 100 lines of additional code


Experimental resultsTest case 2: AuAg (NA = 108, NL = 121, NG = [3275−13379])

0

30

60

90

120

150

180

210

240

2.5 3 3.5 4

3.03x

4.96x

Tim

e (

se

co

nd

s )

Kmax

CPUCPU / 1 GPUCPU / 2 GPUs

Sandy Bridge: CPU: E5-2650, 2 x 8core, 2.0GHz, 64GBs RAM, 2 Nvidia Tesla K20XPeak performance: 256 GFs/s + 2 x 1.3 TFs/s


Conclusions

Conclusions

Modernizing algorithm structure of legacy code is critical

Tailoring algorithms to application-specific tasks

Exploiting knowledge from domain-specific applications

Layered design built on top of standardized libraries

Increase in performance

(Almost) free lunch⇒ performance portability


References

“ChASE: a Chebyshev Accelerated Subspace iteration Eigensolver for sequences ofeigenvalue problems.” J. Winkelmann, M. Berljafa, P. Springer and E. Di Napoli.In preparation.

“An Optimized and Scalable Eigensolver for Sequences of Eigenvalue Problems.” M.Berl-jafa, D. Wortmann, and E. Di Napoli.Concurrency Computat.: Pract. Exper. 27 (2015), pp. 905-922 [arXiv:1404.4161]

“High-performance generation of the Hamiltonian and Overlap matrices in FLAPWmethods.” Edoardo Di Napoli, Elmar Peise, Markus Hrywniak and Paolo Bientinesi.Comp. Phys. Comm. 211 (2017) 61-72 [arXiv:1602.06589]

“Hybrid CPU-GPU generation of the Hamiltonian and Overlap matrices in FLAPWmethods.” Diego Fabregat-Traver, Davor Davidovic, Markus Höhnerbach, Edoardo DiNapoli.Lecture Notes in Computer Science, High-Performance Scientific Computing10164 (2016), pp. 200-211 [arXiv:1611.00606 ]


Thank you

QUANTUMATERIALS

1

For more [email protected]

http://www.fz-juelich.de/ias/jsc/slqm


linear algebra tasks in materials science: optimization and portability … · 2020. 8. 4. ·...

Documents