linear algebra tasks in materials science: optimization and portability … · 2020. 8. 4. ·...
TRANSCRIPT
Mitg
lied
derH
elm
holtz
-Gem
eins
chaf
t
Linear algebra tasks in MaterialsScience: optimization andportability
ADAC Workshop, July 17-19 2017 Edoardo Di Napoli
Outline
Jülich Supercomputing Center
Chebyshev Accelerated Subspace Iteration (ChASE)
Hamiltonian and Overlap matrices initialization in FLAPW
ADAC Workshop, July 17-19 2017 Edoardo Di Napoli Folie 2
Forschungszentrum Jülich
ADAC Workshop, July 17-19 2017 Edoardo Di Napoli Folie 3
Jülich Supercomputing Centre
Supercomputing operations for:Center: FZJ
Region: JARA
Germany: Gauss Centre for SupercomputingJohn von Neumann Institute for Computing
Europe: PRACE, EU projectsApplication support
Primary support & Simulation Labs
Peer review support & coordinationR&D work
Methods and algorithms, computationalscience, performance analysis and tools
Scientific Big Data Analytics with HPC
Computer architectures, Co-Design ExascaleLabs together with IBM, Intel, NVIDIA
Education and trainingADAC Workshop, July 17-19 2017 Edoardo Di Napoli Folie 4
Supercomputing systems: dual architecturestrategy
ADAC Workshop, July 17-19 2017 Edoardo Di Napoli Folie 5
Simulation Laboratory as HPC enabler
ADAC Workshop, July 17-19 2017 Edoardo Di Napoli Folie 6
SimLab Quantum Materials (SLQM)
Team
Mission
The Simulation Laboratory Quantum Materials (SLQM) provides expertise in thefield of quantum-based simulations in Materials Science with a special focus onhigh-performance computing. SLQM acts as a high-level support structure indedicated projects and hosts research projects dealing with fundamental aspects ofcode development, algorithmic optimization, and performance improvement. TheLab acts as an enabler of large scale simulations on current HPC platforms as wellas on future architectures by targeting domain-specific co-design processes.
ADAC Workshop, July 17-19 2017 Edoardo Di Napoli Folie 7
High-performance computations
Observations:
Linear algebra algorithms ubiquitous in Materials Science
Numerical libraries are used as black boxes.
Domain-specific knowledge does not influence algorithm choice.
Rigid legacy codes are hard to modernize.
Goal
Design and optimize linear algebra algorithms in order to:
exploit available knowledge.
increase the parallelism of complex tasks.
facilitate performance portability
ADAC Workshop, July 17-19 2017 Edoardo Di Napoli Folie 8
SLQM active areas of research
Adaptive integration for functional Renormalization Group (fRG)
HPC Tensor operations (contractions, generalized contractions,
transpositions, etc.) with application in fRG, Quantum Chemistry, etc.
Algebraic eigensolvers (sequences of dense eigenproblems, sparse
eigenproblems) with applications in Density Functional Theory (DFT),
Numerical RG, Excitonic Hamiltonians, etc.
Articulated linear algebra tasks (Matrix initialization, etc.)
Structured Linear Systems (Dyson, Poisson, and Keldish equations)
Self-consistent Field preconditioners
Predicting material properties through machine learning
ADAC Workshop, July 17-19 2017 Edoardo Di Napoli Folie 9
Two examples from FLAPW methods (DFT)
An eigensolver for sequences of dense matrices
Chebyshev Accelerated Subspace iteration Eigensolver (ChASE) –Exterior iterative eigensolver tailored to sequences of dense Hermitianeigenvalue problems arising in DFT methods based on the LAPW basisset.
A complex linear algebra task
H and S Dense Linear Algebra (HSDLA) algorithm – Combination ofdense linear algebra kernels for the initialization of Hamiltonian andOverlap matrices in Density Functional Theory.
ADAC Workshop, July 17-19 2017 Edoardo Di Napoli Folie 10
Density Functional Theory Self-Consistent FieldGeneral framework
Initial guessfor charge density
nstart(r)
InitializeA(`)
k and B(`)k
matrices
Solve a set ofeigenproblems
P(`)k1
. . .P(`)kN
Compute newcharge density
n(`)(r)
Converged?
|n(`)−n(`−1)|< η
OUTPUTElectronicstructure,
. . .
No
Yes
ADAC Workshop, July 17-19 2017 Edoardo Di Napoli Folie 11
The case of the FLAPW method
Observations:1 every P(`)
k : A(`)k x = B(`)
k λx is a generalized eigenvalue problem;
2 A and B are hermitian (B is positive definite);
3 required: lower 2÷20 % of eigenpairs;
4 eigenvectors of problems of same k are seemingly uncorrelated acrossiterations i
5 k-vector index: k = 1 : 10÷100;
6 iteration cycle index: `= 1 : 20÷50.
ADAC Workshop, July 17-19 2017 Edoardo Di Napoli Folie 12
Outline
Jülich Supercomputing Center
Chebyshev Accelerated Subspace Iteration (ChASE)
Hamiltonian and Overlap matrices initialization in FLAPW
ADAC Workshop, July 17-19 2017 Edoardo Di Napoli Folie 13
Generalized eigenvalue solver: subspace iteration
Two distinct optmization opportunities:
1 Knowledge exploitation: solution of previous SCF cycle can be usedto “precondition” the solver at the next iteration
2 Algorithm optimization: polynomial degree can be pre-computed inorder to minimize Mat-Vec multiplications
ADAC Workshop, July 17-19 2017 Edoardo Di Napoli Folie 14
Chebyshev Accelerated Subspace EigensolverChebyshev filter
A generic vector v = ∑ni=1 sixi is very quickly aligned in the direction of the
eigenvector corresponding to the extremal eigenvalue λ1
vm = pm(A)v =n
∑i=1
si pm(A)xi =n
∑i=1
si pm(λi)xi
= s1x1 +n
∑i=2
siCm(
λi−ce )
Cm(λ1−c
e )xi ∼ s1x1
λmin λlowerc λupper
0
0.5
1
1.5
·104
eigenspectrum
Che
bysh
evpo
lyno
mia
l
Polynomial degree m = 6
λmin λlowerc λupper
100
101
102
103
104
eigenspectrum
Che
bysh
evpo
lyno
mia
l
Polynomial degree m = 6
ADAC Workshop, July 17-19 2017 Edoardo Di Napoli Folie 15
The core of the algorithm: Chebyshev filter> 80% Computing time
Three-terms recurrence relationCm+1 (t) = 2xCm (t)−Cm−1 (t) ; m ∈N, C0 (t) = 1, C1 (t) = x
Zm.= pm(H) Z0 with H = H− cIn
FOR: i = 1→ DEG−1
Zi+1 ← 2σi+1
eH × Zi −σi+1σi Zi−1 xGEMM
END FOR.
ADAC Workshop, July 17-19 2017 Edoardo Di Napoli Folie 16
Preconditioning ChASEExploiting knowledge
ITERATION (`)
P(`)k1
(X(`)k1
,Λ(`)k1)
P(`)k2
(X(`)k2
,Λ(`)k2)
P(`)kN
(X(`)kN
,Λ(`)kN)
X ≡ x1, . . . ,xn
iterative
solver
iterative
solver
iterative
solver
ITERATION (`+1)
P(`+1)k1
(X(`+1)k1
,Λ(`+1)k1
)
P(`+1)k2
(X(`+1)k2
,Λ(`+1)k2
)
P(`+1)kN
(X(`+1)kN
,Λ(`+1)kN
)
Λ≡ diag(λ1, . . . ,λn)
iterative
solver
iterative
solver
iterative
solver
Nextcycle
ADAC Workshop, July 17-19 2017 Edoardo Di Napoli Folie 17
Speed-upSpeed-up = CPU time (input random vectors)
CPU time (input approximate eigenvectors)
3 7 11 15 19 23Iteration Index
0
1
2
3
4Speed-u
pAu98Ag10 - n=13,379 - 128 cores.
ADAC Workshop, July 17-19 2017 Edoardo Di Napoli Folie 18
ChASE with polynomial optimizationSpeed-up
Speed-up = CPU time (m constant)CPU time (m optimized for each vector)
3 6 9 12 15
1.2
1.4
1.6
1.8
Iteration cycle index
Spee
d−up
24 Cores −− Multi−threaded BLAS24 Cores and 1 GPU −− CUBLAS
ADAC Workshop, July 17-19 2017 Edoardo Di Napoli Folie 19
ChASE pseudocodeINPUT: Hamiltonian H, TOL, DEG — OPTIONAL: approximate eigenvectorsZ0, extreme eigenvalues λ1,λNEV.
OUTPUT: NEV wanted eigenpairs (Λ,W).1 Lanczos DoS step. Identify the bounds for the eigenspectrum interval
corresponding to the wanted eigenspace.
REPEAT UNTIL CONVERGENCE:2 Chebyshev filter. Filter a block of vectors W←− Z0.
3 Re-orthogonalize the vectors outputted by the filter; W = QR.
4 Compute the Rayleigh quotient G = Q†HQ.
5 Compute the primitive Ritz pairs (Λ,Y) by solving for GY = YΛ.
6 Compute the approximate Ritz pairs (Λ,W← QY).
7 Check which one among the Ritz vectors converged.
8 Deflate and lock the converged vectors.
END REPEAT
ADAC Workshop, July 17-19 2017 Edoardo Di Napoli Folie 20
Execution time for distinct platformsAn example from Density Functional Theory
0
50
100
150
200
250
300
0 5 10 15 20 25
Tim
e [s]
iteration
2x Intel Xeon E5-2670 (16 cores)1x NVIDIA K20m2x NVIDIA K20m
1x XeonPhi optimized (compact)
ADAC Workshop, July 17-19 2017 Edoardo Di Napoli Folie 21
ChASE new implementation
C++ Library
Separation between algorithm and implementationFew kernels for low-level tasks (Lanczos, QR, Mat-Mat multiply, etc..)
General interface
Kernels adapted to the parallel computing platform
Library templated for Real, Complex, SP and DP.
Integration into applications
FLEUR code
Jena BSE code (based on VASP)
early stage collaboration with Exciting developers
early stage collaboration with developers from AICS (Kobe)
ADAC Workshop, July 17-19 2017 Edoardo Di Napoli Folie 22
Outline
Jülich Supercomputing Center
Chebyshev Accelerated Subspace Iteration (ChASE)
Hamiltonian and Overlap matrices initialization in FLAPW
ADAC Workshop, July 17-19 2017 Edoardo Di Napoli Folie 23
Hamiltonian and Overlap matrices
Matrix form
H =NA
∑a=1
AHa T [AA]
a Aa︸ ︷︷ ︸HAA
+AHa T [AB]
a Ba +BHa T [BA]
a Aa +BHa T [BB]
a Ba︸ ︷︷ ︸HAB+BA+BB
S =NA
∑a=1
AHa Aa︸ ︷︷ ︸
SAA
+NA
∑a=1
BHa UH
a UaBa︸ ︷︷ ︸SBB
Aa and Ba ∈ CNL×NG while T ...a ∈ CNL×NL and Hermitian.
NL = (lmax +1)2 ≤ 121, NG = 1,000÷50,000, and NA = number of atoms.
ADAC Workshop, July 17-19 2017 Edoardo Di Napoli Folie 24
Constructing SAAAn example of memory layout re-structuring
SAA =NA
∑a=1
AHa Aa .
1: for a := 1→ NA do2: SAA = AH
a Aa . (zherk: 4NLN2G Flops)
3: end for
NG
NL A1
A2
.
.
.
ANA
A∗
NG
NANL
1: SAA = AH? A? . (zherk: 4NANLN2
G Flops)
ADAC Workshop, July 17-19 2017 Edoardo Di Napoli Folie 25
Constructing HAB+BA+BBAn example of algorithm re-structuring
HAB+BA+BB =NA
∑a=1
BHa (T
[BA]a Aa)+(AH
a T [AB]a )Ba +
12
BHa (T
[BB]a Ba)+
12(BH
a T [BB]a )Ba
=NA
∑a=1
BHa (T
[BA]a Aa +
12
T [BB]a Ba) +
(AHa T [AB]
a +12
BHa T [BB]
a )Ba
1: for a := 1→ NA do2: Za = T [BA]
a Aa . (zgemm: 8N2LNG Flops)
3: Za = Za +12 T [BB]
a Ba . (zhemm: 8N2LNG Flops)
4: Stack Za to Z? and Ba to B?
5: end for6: H = ZH
? B?+BH? Z? . (zher2k: 8NANLN2
G Flops)
ADAC Workshop, July 17-19 2017 Edoardo Di Napoli Folie 26
FLEUR early optimizationsMinimization of FLOPS in FLEUR
SAA =(4π)2
Ω∑a
exp [i(KG−KG′) ·xa]lsph
∑l=0
fl,a,G′ fl,a,GW2
l,a
l
∑m=−l
Y∗l,m(RaKG
)Yl,m
(RaKG′
)︸ ︷︷ ︸
?
.
The sum over m can be removed by using the well known identity
Pl(KG · KG′
) 2l+14π
= ?
Removing the sum over m yields a reduction in #ops by a factor of(lsph +1
)∼ 10 but destroys matrix form =⇒ “premature optimization”
ADAC Workshop, July 17-19 2017 Edoardo Di Napoli Folie 27
Hamiltonian and Overlap matrices initializationMain features
Boosted performance through highly optimized computational kernels.
Performance portability through modular implementation.
Compute bound implementation at the cost of increased complexity (moreFLOPs).
NaCl (Kmax = 4.0): IvyBridge HaswellTiO2 (Kmax = 3.6): IvyBridge Haswell
0 2 4 6 8 10 12 14 16 18 20 22 240
1
2
3
#threads
spee
dup
Figure 1: Speedup of our algorithm over FLEUR with kmax = 4 and increasing parallelism
1
ADAC Workshop, July 17-19 2017 Edoardo Di Napoli Folie 28
Porting to heterogeneous architecturesBack-of-the-envelope analysis
5 lines of the algorithm constitute 97% of flops
Correspond to BLAS-3 operations (gemm, herk, her2k)
High arithmetic intensity and should fit GPUs well
First step: offload these routine callsAll 5 are BLAS kernels. Can we use some library?
cuBLAScuBLAS-XTMAGMABLASX
ADAC Workshop, July 17-19 2017 Edoardo Di Napoli Folie 29
Porting to heterogeneous architecturesAdditional code?
3x wrappers (zgemm, zherk, zher2k)
Init and cleanup of cuda runtime and devicesGet #devices, initialize devices, create handlers, ...
Destroy handlers, free devices, ...
Allocate data in page-locked memoryAvoid “hidden” copies
Fast data transfer
Only around 100 lines of additional code
ADAC Workshop, July 17-19 2017 Edoardo Di Napoli Folie 30
Experimental resultsTest case 2: AuAg (NA = 108, NL = 121, NG = [3275−13379])
0
30
60
90
120
150
180
210
240
2.5 3 3.5 4
3.03x
4.96x
Tim
e (
se
co
nd
s )
Kmax
CPUCPU / 1 GPUCPU / 2 GPUs
Sandy Bridge: CPU: E5-2650, 2 x 8core, 2.0GHz, 64GBs RAM, 2 Nvidia Tesla K20XPeak performance: 256 GFs/s + 2 x 1.3 TFs/s
ADAC Workshop, July 17-19 2017 Edoardo Di Napoli Folie 31
Conclusions
Conclusions
Modernizing algorithm structure of legacy code is critical
Tailoring algorithms to application-specific tasks
Exploiting knowledge from domain-specific applications
Layered design built on top of standardized libraries
Increase in performance
(Almost) free lunch⇒ performance portability
ADAC Workshop, July 17-19 2017 Edoardo Di Napoli Folie 32
References
“ChASE: a Chebyshev Accelerated Subspace iteration Eigensolver for sequences ofeigenvalue problems.” J. Winkelmann, M. Berljafa, P. Springer and E. Di Napoli.In preparation.
“An Optimized and Scalable Eigensolver for Sequences of Eigenvalue Problems.” M.Berl-jafa, D. Wortmann, and E. Di Napoli.Concurrency Computat.: Pract. Exper. 27 (2015), pp. 905-922 [arXiv:1404.4161]
“High-performance generation of the Hamiltonian and Overlap matrices in FLAPWmethods.” Edoardo Di Napoli, Elmar Peise, Markus Hrywniak and Paolo Bientinesi.Comp. Phys. Comm. 211 (2017) 61-72 [arXiv:1602.06589]
“Hybrid CPU-GPU generation of the Hamiltonian and Overlap matrices in FLAPWmethods.” Diego Fabregat-Traver, Davor Davidovic, Markus Höhnerbach, Edoardo DiNapoli.Lecture Notes in Computer Science, High-Performance Scientific Computing10164 (2016), pp. 200-211 [arXiv:1611.00606 ]
ADAC Workshop, July 17-19 2017 Edoardo Di Napoli Folie 33
Thank you
QUANTUMATERIALS
1
For more [email protected]
http://www.fz-juelich.de/ias/jsc/slqm
ADAC Workshop, July 17-19 2017 Edoardo Di Napoli Folie 34