distributed graph-based density matrix calculation for ......distributed graph-based density matrix...
TRANSCRIPT
![Page 1: Distributed Graph-based Density Matrix Calculation for ......Distributed Graph-based Density Matrix Calculation for Quantum Molecular Dynamics using GPUs April 4-7, 2016 2016 GPU Technology](https://reader034.vdocument.in/reader034/viewer/2022051408/600644cad213ec73d24bdb07/html5/thumbnails/1.jpg)
Slide 1 U N C L A S S I F I E D
Distributed Graph-based Density Matrix Calculation for Quantum
Molecular Dynamics using GPUs
April 4-7, 2016
2016 GPU Technology Conference
S. M. Mniszewski, C. F. A. Negre, M. J. Cawkwell, A. M. N. Niklasson Los Alamos National Laboratory
![Page 2: Distributed Graph-based Density Matrix Calculation for ......Distributed Graph-based Density Matrix Calculation for Quantum Molecular Dynamics using GPUs April 4-7, 2016 2016 GPU Technology](https://reader034.vdocument.in/reader034/viewer/2022051408/600644cad213ec73d24bdb07/html5/thumbnails/2.jpg)
Slide 2 U N C L A S S I F I E D
Why Molecular Dynamics?
![Page 3: Distributed Graph-based Density Matrix Calculation for ......Distributed Graph-based Density Matrix Calculation for Quantum Molecular Dynamics using GPUs April 4-7, 2016 2016 GPU Technology](https://reader034.vdocument.in/reader034/viewer/2022051408/600644cad213ec73d24bdb07/html5/thumbnails/3.jpg)
Slide 3 U N C L A S S I F I E D
Background
• In molecular dynamics simulation, the relative positions of atoms evolve over a series of time steps according to the force acting on each atom
• Employed in materials science, chemistry, and biology to study structures, defects, and equilibrium and non-equilibrium phenomena
• Dependence on an interatomic potential to calculate forces and energy
• Quantum-based models capture the making and breaking of covalent bonds, charge transfer between species of differing electronegativities, and long-range electrostatic interactions - reactions
![Page 4: Distributed Graph-based Density Matrix Calculation for ......Distributed Graph-based Density Matrix Calculation for Quantum Molecular Dynamics using GPUs April 4-7, 2016 2016 GPU Technology](https://reader034.vdocument.in/reader034/viewer/2022051408/600644cad213ec73d24bdb07/html5/thumbnails/4.jpg)
Slide 4 U N C L A S S I F I E D
Integrate the equations of motion of classical molecular trajectories
with the forces calculated on the fly from a self-consistent quantum mechanical description of the electronic structure:
MIR̈I = �⇥U(R; �sc)
⇥RI
H[�]�i = ⇥i�i
⇢ =X
occ.
|�i|2! ⇢sc
SCF� SCF�SCF�
SCF�
U(R; �)
U(R; �sc)
Born-Oppenheimer MD
#SCF�O(N3)
Quantum Molecular Dynamics (QMD)
![Page 5: Distributed Graph-based Density Matrix Calculation for ......Distributed Graph-based Density Matrix Calculation for Quantum Molecular Dynamics using GPUs April 4-7, 2016 2016 GPU Technology](https://reader034.vdocument.in/reader034/viewer/2022051408/600644cad213ec73d24bdb07/html5/thumbnails/5.jpg)
Slide 5 U N C L A S S I F I E D
Example: Biosynthesis of Histidine
• What is responsible for the biosynthesis of histidine? • Present in several pathogenic bacteria • Allosteric mechanism was determined with MD simulation • Several thousand atoms over timescales of 100s of ns!
Rivalta I, Sultan MM, Lee N-S, Manley GA, Loria JP, Batista VS. PNAS, 2012 109(22), E1428-E1436.
![Page 6: Distributed Graph-based Density Matrix Calculation for ......Distributed Graph-based Density Matrix Calculation for Quantum Molecular Dynamics using GPUs April 4-7, 2016 2016 GPU Technology](https://reader034.vdocument.in/reader034/viewer/2022051408/600644cad213ec73d24bdb07/html5/thumbnails/6.jpg)
Slide 6 U N C L A S S I F I E D
Future QMD Simulations
IAPP dimerization and type-2 diabetes (537 atoms).
Beta Amyloid Peptide and Alzheimer’s Disease (410 atoms).
Raffa DF, A. Rauk A, J. Phys. Chem. B, 2 MT007, 111 (14), 3789-99. Dupuis NF, Wu C, Shea J-E,,Bowers, MT J. Am. Chem. Soc., 2011, 133 (19), 7240-43.
![Page 7: Distributed Graph-based Density Matrix Calculation for ......Distributed Graph-based Density Matrix Calculation for Quantum Molecular Dynamics using GPUs April 4-7, 2016 2016 GPU Technology](https://reader034.vdocument.in/reader034/viewer/2022051408/600644cad213ec73d24bdb07/html5/thumbnails/7.jpg)
Slide 7 U N C L A S S I F I E D
Computational Cost
• High computational cost and complexity of QMD calculations
• The MD timestep is the most expensive – the density matrix construction
• The second order spectral projection (SP2) algorithm breakthrough
• Use of hybrid parallelism on GPU-accelerated clusters
System Size (N)
Wal
l-Clo
ck T
ime
/ MD
Tim
e St
ep
O(N3) DiagonalizationO(N) Regular linear scalingO(N) Low pre-factor
Significant pre-factor reduction �is required for practical �
large scale QMD! �
Quantum Molecular Dynamics
![Page 8: Distributed Graph-based Density Matrix Calculation for ......Distributed Graph-based Density Matrix Calculation for Quantum Molecular Dynamics using GPUs April 4-7, 2016 2016 GPU Technology](https://reader034.vdocument.in/reader034/viewer/2022051408/600644cad213ec73d24bdb07/html5/thumbnails/8.jpg)
Slide 8 U N C L A S S I F I E D
The Density Matrix Computation
• Typically, algorithms used in quantum-based models, most notably matrix diagonalization, are not ideally suited to GPUs – Due to their complexity – Difficulty in extracting thread-level parallelism – Difficulty of avoiding branching within warps
• New SP2 approach – Computed directly from the Hamiltonian through a recursive
expansion of the Fermi Operator with the second order spectral projection (SP2) algorithm
– Based on a series of generalized matrix-matrix multiplications – Only one matrix-matrix multiplication is required per iteration – Maps very well to GPUs
![Page 9: Distributed Graph-based Density Matrix Calculation for ......Distributed Graph-based Density Matrix Calculation for Quantum Molecular Dynamics using GPUs April 4-7, 2016 2016 GPU Technology](https://reader034.vdocument.in/reader034/viewer/2022051408/600644cad213ec73d24bdb07/html5/thumbnails/9.jpg)
Slide 9 U N C L A S S I F I E D
0 0.5 1x
0
0.5
1
f(x
)
f(x) = x2
f(x) = 2x - x2
f8(f
7(...f
1(x) ...))
The Second Order Spectral Projection Algorithm (SP2) – Reduced Complexity
ρ = θ µI − H$% &' = lim
i→∞fi[ fi−1[… f0[X0 ]…]]
X0 =
εmaxI − Hεmax − εmin
fi[X i ]=X i
2 if 2Tr[X i ]≥ Ne2X i −X i
2 if 2Tr[X i ]< Ne
Recursive Fermi Operator expansion
Niklasson AMN, Phys. Rev. B 66, 155115 (2002).
![Page 10: Distributed Graph-based Density Matrix Calculation for ......Distributed Graph-based Density Matrix Calculation for Quantum Molecular Dynamics using GPUs April 4-7, 2016 2016 GPU Technology](https://reader034.vdocument.in/reader034/viewer/2022051408/600644cad213ec73d24bdb07/html5/thumbnails/10.jpg)
Slide 10 U N C L A S S I F I E D
SP2 Algorithm Using the GPU Approach
Estimate εmax and εmin X = (εmaxI-H)/(εmax-εmin) TraceX = Tr[X] /* Trace kernel on GPU */ Until converged do
Xtmp = X Xtmp = X2+Xtmp /*CUBLAS xGEMM */ TraceXtmp = Tr[Xtmp] /*Trace kernel on GPU */ if |2TraceX – 2TraceXtmp – Ne| > |2TraceX + 2TraceXtmp –Ne| X = X + Xtmp /* CUBLAS xAXPY */ TraceX = TraceX + TraceXtmp /* CUBLAS xAXPY */ else X = X – Xtmp /* CUBLAS xAXPY */ TraceX = TraceX – TraceXtmp
end until ρ = X Cawkwell MJ, Mniszewski SM, Niklasson AMN, Fast Quantum Molecular Dynamics on Multi-GPU Architectures in LATTE, GTC 2013.
![Page 11: Distributed Graph-based Density Matrix Calculation for ......Distributed Graph-based Density Matrix Calculation for Quantum Molecular Dynamics using GPUs April 4-7, 2016 2016 GPU Technology](https://reader034.vdocument.in/reader034/viewer/2022051408/600644cad213ec73d24bdb07/html5/thumbnails/11.jpg)
Slide 11 U N C L A S S I F I E D
Density Matrix Calculation (Nvidia M2090) – Liquid Methane (10 – 1250 molecules)
0 2000 4000 6000 8000 10000Matrix dimension
0
30
60
90
Tim
e per
den
sity
mat
rix b
uil
d (
s) SP2: 1 GPUSP2: 2 GPUsSP2: 3 GPUsSP2: CPUDiagonalization
Cawkwell MJ, Mniszewski SM, Niklasson AMN, Fast Quantum Molecular Dynamics on Multi-GPU Architectures in LATTE, GTC 2013.
![Page 12: Distributed Graph-based Density Matrix Calculation for ......Distributed Graph-based Density Matrix Calculation for Quantum Molecular Dynamics using GPUs April 4-7, 2016 2016 GPU Technology](https://reader034.vdocument.in/reader034/viewer/2022051408/600644cad213ec73d24bdb07/html5/thumbnails/12.jpg)
Slide 12 U N C L A S S I F I E D
Sparse Matrix SP2 – ELLPACK-R Format
• Described by 3 arrays, 2-D values and indices, 1-D non-zero entries per row
• N rows and M max non-zeroes per row, O(Nm2) computational complexity
• Row-wise storage for parallelism opportunities
• No insertion cost compared to CSR
1" 1"
2" 4" 6" 1"
3" 1"
2" 4" 5" 1"
4" 5 1"
2" 6" 1"
1" 1" 1"
1" 1"
1" 1" 1" 1"
1" 1"
1" 1"
1" 1" 1"
1"
1"
1"
1"
1"
1"
1"
3"
1"
3"
2"
2"
Values" Columns" #"Non4zeroes"Dense"Matrix"Sparse"Matrix"
Mniszewski SM, Cawkwell MJ, Wall ME, Mohd-Yosuf J, Bock N, Germann TG, Niklasson AMN, Efficient Parallel Linear Scaling Construction of the Density Matrix for Born–Oppenheimer Molecular Dynamics, J. Chem. Theory Comput., 2015, 11 (10), pp 4644–4654.
![Page 13: Distributed Graph-based Density Matrix Calculation for ......Distributed Graph-based Density Matrix Calculation for Quantum Molecular Dynamics using GPUs April 4-7, 2016 2016 GPU Technology](https://reader034.vdocument.in/reader034/viewer/2022051408/600644cad213ec73d24bdb07/html5/thumbnails/13.jpg)
Slide 13 U N C L A S S I F I E D
SP2 Shared Memory – Significant Cost Reduction
0 1000 2000 3000 4000 5000 6000Number of atoms
0
2
4
6D
ensit
y M
atrix
Con
struc
tion
Tim
e (s
)Diag. SerialDiag. 16 ThreadsSP2, CSR SerialSP2, ELL 1 ThreadSP2, ELL 4 ThreadsSP2, ELL 16 Threads
Polyethylene
Mniszewski SM, Cawkwell MJ, Wall ME, Mohd-Yosuf J, Bock N, Germann TG, Niklasson AMN, Efficient Parallel Linear Scaling Construction of the Density Matrix for Born–Oppenheimer Molecular Dynamics, J. Chem. Theory Comput., 2015, 11 (10), pp 4644–4654.
![Page 14: Distributed Graph-based Density Matrix Calculation for ......Distributed Graph-based Density Matrix Calculation for Quantum Molecular Dynamics using GPUs April 4-7, 2016 2016 GPU Technology](https://reader034.vdocument.in/reader034/viewer/2022051408/600644cad213ec73d24bdb07/html5/thumbnails/14.jpg)
Slide 14 U N C L A S S I F I E D
Shared Memory SP2 – TRP Cage Protein
0 5 10 15Time (ps)
-39050
-39000
-38950
-38900
-38850
To
tal
ener
gy
(eV
)6 8 10 12 14 16 18
Time (ps)
-39028.1
-39028.0
-39027.9
-39027.8
To
tal
ener
gy
(eV
)
NVT NVE
Mniszewski SM, Cawkwell MJ, Wall ME, Mohd-Yosuf J, Bock N, Germann TG, Niklasson AMN, Efficient Parallel Linear Scaling Construction of the Density Matrix for Born–Oppenheimer Molecular Dynamics, J. Chem. Theory Comput., 2015, 11 (10), pp 4644–4654.
303 atom Trp Cage Protein solvated by 2682 water molecules (8349 atoms)
LATTE Simulation – 18.8 ps
(Thermalization)
Microcanonical Simulation
![Page 15: Distributed Graph-based Density Matrix Calculation for ......Distributed Graph-based Density Matrix Calculation for Quantum Molecular Dynamics using GPUs April 4-7, 2016 2016 GPU Technology](https://reader034.vdocument.in/reader034/viewer/2022051408/600644cad213ec73d24bdb07/html5/thumbnails/15.jpg)
Slide 15 U N C L A S S I F I E D
15
Sparse Matrix Algebra Divide and Conquer
Graph Theory
Graph-based Electronic Structure Theory
![Page 16: Distributed Graph-based Density Matrix Calculation for ......Distributed Graph-based Density Matrix Calculation for Quantum Molecular Dynamics using GPUs April 4-7, 2016 2016 GPU Technology](https://reader034.vdocument.in/reader034/viewer/2022051408/600644cad213ec73d24bdb07/html5/thumbnails/16.jpg)
Slide 16 U N C L A S S I F I E D
Data dependency Graph S⌧
core
halo
i
s(i)�S� =
n
s(i)�
oN
i=1
Recursive Fermi-operator expansion D� =
n
limn!1
fn(fn�1(. . . f0(h[s(i)� ]) . . .))
oN
i=1H =
n
h[s(i)� ])oN
i=1
S⌧ � ⇥Fermi Operator Expansion⇤⌧(global)
Exact Relation!
Graph-based SP2
![Page 17: Distributed Graph-based Density Matrix Calculation for ......Distributed Graph-based Density Matrix Calculation for Quantum Molecular Dynamics using GPUs April 4-7, 2016 2016 GPU Technology](https://reader034.vdocument.in/reader034/viewer/2022051408/600644cad213ec73d24bdb07/html5/thumbnails/17.jpg)
Slide 17 U N C L A S S I F I E D
Graph-based SP2 – Hybrid approach
On Gpus!
Niklasson AMN, et al, Graph-based Linear Scaling Electronic Structure Theory, http://arxiv.org/abs/1603.00937, 2016.
![Page 18: Distributed Graph-based Density Matrix Calculation for ......Distributed Graph-based Density Matrix Calculation for Quantum Molecular Dynamics using GPUs April 4-7, 2016 2016 GPU Technology](https://reader034.vdocument.in/reader034/viewer/2022051408/600644cad213ec73d24bdb07/html5/thumbnails/18.jpg)
Slide 18 U N C L A S S I F I E D
Graph Partitioning SP2
Dt
0 0 0
0 0 0 0 0 0
0 0 0
Graph of H Partitioned graph of H
Graph Partitioning Structure-based Graph-based Hypergraph-based Community-based
Subgraph Processing 1. Determine core+halo 2. Extract submatrix 3. Run Dense SP2 4. Collect into next D
0 0
Base Halo
Subgraph of H
X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X
Submatrix of H/D
Ht
Dt-1
Trivial parallelism based on dense matrix algebra
at BLAS3 performance
Partitions are sets of core rows/orbitals of H
![Page 19: Distributed Graph-based Density Matrix Calculation for ......Distributed Graph-based Density Matrix Calculation for Quantum Molecular Dynamics using GPUs April 4-7, 2016 2016 GPU Technology](https://reader034.vdocument.in/reader034/viewer/2022051408/600644cad213ec73d24bdb07/html5/thumbnails/19.jpg)
Slide 19 U N C L A S S I F I E D
Subgraph Processing
Determine elements Run SP2 Extract submatrix Assemble into D
Determine elements Run SP2 Extract submatrix Assemble into D
Determine elements Run SP2 Extract submatrix Assemble into D
For all sub-graphs:
.
.
.
.
.
.
.
.
.
.
.
.
Dt
• Trivial parallelism • Same HOMO-LUMO sequence through SP2 • Dense matrix & communication-free SP2 • Small subgraphs, process single-threaded • Large subgraphs, process multi-threaded • Tunable accuracy - thresholds
![Page 20: Distributed Graph-based Density Matrix Calculation for ......Distributed Graph-based Density Matrix Calculation for Quantum Molecular Dynamics using GPUs April 4-7, 2016 2016 GPU Technology](https://reader034.vdocument.in/reader034/viewer/2022051408/600644cad213ec73d24bdb07/html5/thumbnails/20.jpg)
Slide 20 U N C L A S S I F I E D
Liquid water (H2O)100
DFTB-LATTE
Graph-based XL Born-Oppenheimer Molecular Dynamics (XL-BOMD)
Niklasson AMN, et al, Graph-based Linear Scaling Electronic Structure Theory, http://arxiv.org/abs/1603.00937, 2016.
![Page 21: Distributed Graph-based Density Matrix Calculation for ......Distributed Graph-based Density Matrix Calculation for Quantum Molecular Dynamics using GPUs April 4-7, 2016 2016 GPU Technology](https://reader034.vdocument.in/reader034/viewer/2022051408/600644cad213ec73d24bdb07/html5/thumbnails/21.jpg)
Slide 21 U N C L A S S I F I E D 1×10-8 1×10-7 1×10-6 1×10-5 1×10-4
Numerical Threshold 1×10-5
1×10-4
1×10-3
1×10-2
1×10-1
||D-D
exac
t|| F per
ato
m
2048 No. Subgraphs1024 No. Subgraphs512 No. Subgraphs
Polyalanine in water 20,000 atoms
⌧
Graph-based SP2 – Tunable accuracy
Niklasson AMN, et al, Graph-based Linear Scaling Electronic Structure Theory, http://arxiv.org/abs/1603.00937, 2016.
![Page 22: Distributed Graph-based Density Matrix Calculation for ......Distributed Graph-based Density Matrix Calculation for Quantum Molecular Dynamics using GPUs April 4-7, 2016 2016 GPU Technology](https://reader034.vdocument.in/reader034/viewer/2022051408/600644cad213ec73d24bdb07/html5/thumbnails/22.jpg)
Slide 22 U N C L A S S I F I E D
64 128 256 512 1024 2048 4096Number of communities
0.5
1
2
4
8
16
32
64
128
SP2
run
time
(s)
1 CPU SpM Alg (MKL)1 CPU Graph Part.1 GPU Graph Part.16 GPU Graph Part.32 GPU Graph Part.
Graph-based SP2 – Distributed GPU Performance
Density matrix calculations for Polyalanine in water 20,000 atoms
• 64-4096 METIS partitions
• Single Nvidia Tesla M2090 GPU per node
• Partition core+halo sizes vary, load-balancing required
• 16,384 GPU threads, still perfect strong scaling
• Best - ~25 µs/atom
Niklasson AMN, et al, Graph-based Linear Scaling Electronic Structure Theory, http://arxiv.org/abs/1603.00937, 2016.
![Page 23: Distributed Graph-based Density Matrix Calculation for ......Distributed Graph-based Density Matrix Calculation for Quantum Molecular Dynamics using GPUs April 4-7, 2016 2016 GPU Technology](https://reader034.vdocument.in/reader034/viewer/2022051408/600644cad213ec73d24bdb07/html5/thumbnails/23.jpg)
Slide 23 U N C L A S S I F I E D
Graph-based SP2 – Distributed CPU vs. GPU
• 1024 2048 METIS partitions
• 16-256 CPU/GPU nodes • MKL vs. CuBLAS • Similar for 1024 and 2048
partitions • Near linear scaling • Speedup of 1.7X on GPUs • Best – 5 µs/atom
Density matrix calculations for Polyalanine in water 20,000 atoms
![Page 24: Distributed Graph-based Density Matrix Calculation for ......Distributed Graph-based Density Matrix Calculation for Quantum Molecular Dynamics using GPUs April 4-7, 2016 2016 GPU Technology](https://reader034.vdocument.in/reader034/viewer/2022051408/600644cad213ec73d24bdb07/html5/thumbnails/24.jpg)
Slide 24 U N C L A S S I F I E D
Basic Matrix Library (BML) & PROGRESS Library
bml C API ● bml_multiply() ● bml_add() ● ...
bml Fortran API ● call bml_multiply() ● call bml_add() ● ...
dense matrix
sparse ELLPACK matrix
sparse CSR matrix
CPU/SMP
GPGPU
MPI
BML library core
bml l
ibra
ry
Uses BLAS Level 2 and 3 routines.
BML available at http://qmmd.github.io/bml/ under the BSD 3-clause license
![Page 25: Distributed Graph-based Density Matrix Calculation for ......Distributed Graph-based Density Matrix Calculation for Quantum Molecular Dynamics using GPUs April 4-7, 2016 2016 GPU Technology](https://reader034.vdocument.in/reader034/viewer/2022051408/600644cad213ec73d24bdb07/html5/thumbnails/25.jpg)
Slide 25 U N C L A S S I F I E D
Summary
• Distributed graph-based SP2 provides significant speedup using distributed GPU-accelerated architectures
• Available as part of BML & PROGRESS libraries • Allows for QMD simulations of larger systems and longer
timeframes than previously possible