porting the onetep linear-scaling quantum chemistry … · quantum chemistry code to massively...
TRANSCRIPT
Porting the ONETEP Linear-Scaling Quantum Chemistry Code to Massively Parallel Heterogeneous Architectures
Karl Wilkinson
Skylaris Group, University of Southampton
Outline
1. Computational Chemistry
2. Heterogeneous Architectures and Programming models
3. Order-N Electronic Total Energy Package (ONETEP)
4. GPU Code Developments and Performance
5. Hybrid MPI-OpenMP Parallelism
6. Future Work
7. Acknowledgements
2
COMPUTATIONAL CHEMISTRY
3
Computational Chemistry
§ Trade offs: – Accuracy
– System size
– Level of sampling
§ Drug design – Lead structure design
– Protein-lead structure interactions
§ Materials – Epoxidised linseed oil – ships hulls
§ Catalysis
§ Energy research – Enzyme-biomass interactions
4
Accuracy
Low
Molecular mechanics
High
Ab initio
System size
Low
10’s of atoms
High
Millions of atoms
Sampling
Low
Single point energy
High
Free energy surface
HETEROGENEOUS ARCHITECTURES AND PROGRAMMING MODELS
5
Heterogeneous Compute Architectures
§ Heterogeneous: Multiple distinct types of processors – More than 10% of top 500 supercomputer use some
form of accelerator – Typically Central Processing Units (CPUs) coupled with
Graphical Processing Units (GPUs)
– Common in high powered workstations
– Intel Xeon Phis emerging
§ Development considerations – Generally need to adapt code to take advantage of the
architecture – Single code will contain multiple algorithms - execute
on most suitable processing unit
– May result in new bottlenecks (PCIe data transfer)
6
Multicore CPUs
GPU based accelerator boards
Parallelism in Heterogeneous Machines
Node CPU
Socket
CPU Socket
PCIe Socket
PCIe Socket
8-16 CPU cores
O(10) Streaming Multiprocessors
32-192 CUDA cores (SP/DP)
O(100-10,000) Nodes
Titan
18,688 Nodes
18,688 CPUs (hexadecacore) 299,008 CPU Cores
18,688 K20X GPUs (14MPs) 261,632 Multiprocessors
560,640 Total cores
7
Memory in Heterogeneous Machines
GPU
System RAM
Global GPU
Memory
Multi- processor Memory
Core caches
CPU Socket
Host-Device - Typical data transfer bottleneck
GPU
System RAM
Global GPU
Memory
Multi- processor Memory
Core caches
CPU Socket
QPI
§ Complex memory hierarchy
§ GPUs designed to hide the movement of data on the GPU itself
§ Major issues: – Host – device data transfer
– CPU speed for data parallel execution
– GPU speed for serial execution
§ Hardware manufacturers addressing these issues: Uncertainty over future hardware
– CPUs becoming more like GPUs (increasing core count)
– GPUs gaining CPU-like functionality (Project Denver)
– Hybrid “System-on-chip” models
8
Motivation for choosing OpenACC
§ Hardware Considerations – Multiple hardware models already
available – Uncertainty over future hardware
models
§ Development Considerations – Complexity of hardware - average
developer is a domain specialist – Porting existing rather than
writing from scratch
– Avoid changing original version of the code
– Avoid mixing languages
– Avoid reinventing the wheel
– Ensure maintainability of code
§ Solutions – Ensure portability and future
proof
– Abstractions help to protect the developer
– Pragmas and directives
PGI Accelerator
OpenACC
9
ORDER-N ELECTRONIC TOTAL ENERGY PACKAGE (ONETEP)
10
ONETEP
§ Commercial code, distributed by Accelrys as part of Materials Studio
§ Actively developed by a number of research groups
– Central SVN repository
§ Standard Fortran 95
§ 233,032 lines of code
§ Parallel design from onset – Message Passing Interface
– Aiming to extend beyond MPI parallelism
11
ONETEP § Highly accurate linear-scaling
quantum chemistry package – Number of atoms
– Number of cores
– Maintain number of iterations
§ Wide range of functionality – Applicable to wide range of
problems
– Used for materials science and biological systems
§ Scales to tens of thousands of atoms
§ Runs on hundreds of cores using MPI
12
ONETEP – Brief Theory § Utilizes a reformulation of Density Functional Theory in terms of
the one-particle density matrix – Broad strokes: The key parameter is the electron density
– Electrons exist within atomic orbitals, orbitals represented by mathematical functions (φ)
§ Scaling with number of atoms achieved through exploitation of the “nearsightedness” of electrons in many atom systems
– Localized density kernel, K, and functions (φ)
– Not sums over all possible φα φβ pairs
13
ONETEP – Methodology
§ Iterative process – Optimize the NGWFs (φ(r)) – Optimize the density kernel (K)
§ Strictly localized non-orthogonal generalized Wannier functions (NGWFs) (φ(r)) used to represent atomic orbitals
14
ONETEP – Methodology § NGWFs are expressed on a real space grid
– Technically: psinc basis set, equivalent to a plane wave basis set
Simula(on Cell
N1 direc(on
N2 direc(on
FFT Box
Computa(onally effi
cient
A
§ Localized operations performed within FFT boxes.
§ FFT box – 3x diameter of largest localized function
§ Parallel scaling achieved through decomposition of data across processors
§ Three types of data § Atomic § Simulation cell § Sparse matrix
15
ONETEP – Data distribution and movement
§ Atomic (FFT box) operations produce data for simulation cell operations and vice versa
§ Memory efficient data abstraction required
16
ONETEP – Data distribution and movement
§ Parallelpiped (PPD) is the basic unit of data used in this process
§ Data contiguity within PPDs is inconsistent relative to other representations
§ Natural barrier to initial areas of accelerated code
17
ONETEP – Profiling
§ Small chemical system by ONETEP standards (Tennis ball dimer: 181 atoms)
§ Basic calculation
§ Representative level of accuracy
§ Most common density kernel optimization scheme
§ Single iteration of NGWF optimizations (over represents NGWF gradient step)
18
CODE DEVELOPMENTS
19
Scope of the Developments
§ Focus on atom localized operations: FFT boxes
§ Focus on core algorithms that will benefit full range of functionality
§ Maintain data structures and current communications schemes
§ Minimize disruption to rest of code
Submitted to Journal of Computational Chemistry
20
§ Simple switch from CPU FFT library (MKL/FFTW/ACML) to a GPU library (CUFFT)
§ No consideration given to data transfer optimization
§ Little/no benefit
§ Extend to other areas of code…
Naive Implementation
21
Implementation Notes
§ Hybrid PGI Accelerator/CUDA Fortran
– PGI for Kernels
– CUDA Fortran for data transfers
§ CUFFT library
§ Very straightforward code, no complex use of pragmas.
§ PGI Accelerator ≃ OpenACC
§ Asynchronous data transfers by default In “pure” PGI Accelerator
– Hybrid model breaks this
22
Data Transfer Implementation
23
Charge Density (36% of runtime) Diagonal elements of the one-particle density matrix: using n(r) = ρ(r; r). 1. Extract NGWFs
2. Populate FFT boxes
3. Transform to reciprocal space
4. Up-sample to fine grid (avoid aliasing errors)
5. Transform to real space
6. Calculate product and sum
7. Deposit charge density data in simulation cell
24
Charge Density (36% of runtime)
1. Extract NGWFs
2. Populate FFT boxes
3. Transform to reciprocal space
!Data transfer to GPU not shown !$acc data region !$acc region coarse_work(1:n1,1:n2,1:n3) = scalefac * cmplx( & fftbox_batch(:,:,:,is,3,batch_count), & fftbox_batch(:,:,:,is,4,batch_count),kind=DP) !$acc end region ! Fourier transform to reciprocal space on coarse grid call cufftExec(cufftplan_coarse,planType,coarse_work, & coarse_work,CUFFT_FORWARD
25
Charge Density (36% of runtime) 4. Up-sample to fine grid (avoid aliasing errors)
• Populate corners of a cube
• Typical dimensions of course grid: 75-125 (odd numbers), double this for the fine grid
26
Coarse array Fine array
5. Transform to real space
6. Calculate product and sum (if common atom)
7. Deposit charge density data in simulation cell
27
Charge Density Performance
§ Tested on Iridis3, Emerald and Nvidia PSG cluster
§ Detailed timings are blocking – No asynchronous kernel execution and data transfer
§ Data transfer stages (2 and 7) problematic, larger quantity of data in step 7
28
Charge Density Performance
§ Clear shift in bottlenecks to data transfer stages
29
1. Extract NGWF 2. Populate FFT box 3. Transform to reciprocal space 4. Up-sample to fine grid (avoid
aliasing errors) 5. Transform to real space
6. Extract local potential FFT box from simulation cell
7. Apply local potential
8. Transform to reciprocal space 9. Down-sample to fine grid 10. Transform to real space
11. Extract PPDs that overlap with φα
12. Extract φα PPDs that overlap with φβ
13. Calculate dot product of φα
and φβPPDs.
Local Potential Integrals
Extract PPDs
FFT
Down-‐sample to coarse
grid
Extract FFT box
FFT
1
3
5
6
4
7
9
FFT
FFT
Apply poten(al
8
10 13
12 Extract
Up-‐sample to fine grid 𝝓↓𝜷 (𝒓)
𝑽↓𝒍𝒐𝒄 (𝒓)
𝑽↓𝒍𝒐𝒄 𝝓↓𝜷 (𝒈)↑
𝑽↓𝒍𝒐𝒄 𝝓↓𝜷 (𝒓)↑
𝝓↓𝜷 (𝒓)
𝝓↓𝜷 (𝒓)
𝝓↓𝜷 (𝒈)
𝝓↓𝜷 (𝒈)
𝑽↓𝒍𝒐𝒄 𝝓↓𝜷 (𝒈)↑
𝑽↓𝒍𝒐𝒄 𝝓↓𝜷 (𝒓)↑
Host
Host
∫↑▒𝝓↓𝜶 ↑∗ (𝒓) 𝑽 ↓𝒍𝒐𝒄 (𝒓) 𝝓↓𝜷 (𝒓)↑
CPU to GPU DATA
2
CPU to GPU DATA
GPU to CPU DATA
11
30
Performance - Bottlenecks Relative cost of stages Local Potential Integrals • More data transfers Kinetic integrals • Only coarse grid NGWF gradient • Partly accelerated • Significantly more
complex • Data transfer at stage 4.
31
Performance Summary
§ Performance of 1 GPU vs 1 CPU core (unfair):
§ Total calculation time: – 1 GPU 2-3.5x faster than 1 CPU core, equivalent to 4 CPU cores
Xeon E5620
Tesla M2050
Tesla M2090
Kepler K20
Charge Density 1221.1 340.5 (3.6) 271.1 (4.5) 225.4 (5.4)
Local Potential Integrals 861.3 301.2 (2.9) 242.2 (3.6) 219.1 (3.9)
Kinetic integrals 268.9 127.0 (2.1) 112.4 (2.4) 131.0 (2.1)
NGWF Gradient 1120.8 636.1 (1.8) 521.5 (2.1) 509.1 (2.2)
32
33
689
456
192
889
532
249
607
396
164
787
638
457
236
795
500
307
739
446
242
843
320
246
142
381
232
175
441
249
151
372
1,273
789
417
1,585
830
539
1,240
642
353
1,198
907
625
335
1,092
615
406
800
467
305
996
0 1000 2000 3000 4000 5000
1x M2090
2x M2090
4x M2090
1x M2050
2x M2050
4x M2050
1x K20
2x K20
4x K20
4x CPU core
Time (s)
Hardware de
tails
Charge density Local poten(al integrals Kine(c integrals NGWF gradient Other
Scaling Performance
§ Good scaling up to 4 GPUs – Seen for all areas of
accelerated code
Understanding Poor Scaling
§ Neglected thread affinity – Quick path interconnect
§ Not the only problem – Sheer size of data transfers
Total GPUs
GPUs per
node
Number of
nodes
GPUs on bus 0
(max. 3)
GPUs on bus 1
(max. 5)
Total Calculation
Time
Scaling relative to 1
GPU
Total % of ideal
1 1 1 1 0 5655.9 1.0 100
8 8 1 3 5 1737.5 3.3 41
24 8 3 3 5 637.9 8.9 37
24 6 4 3 3 620.6 9.1 38
24 4 6 2 2 471.9 12.0 50
24 2 12 1 1 362.9 15.6 65
34
Strong Scaling
§ Detailed scaling tests on Emerald: 8 GPUs 2 CPUs
§ Comparing against a single node – Impressive results
§ Unfortunately not the full story… – Data transfer on a node saturating available PCIe bandwidth?
0.0
2.0
4.0
6.0
8.0
10.0
12.0
14.0
0 2 4 6 8 10 12 14
Speedu
p
Number of nodes
Ideal 495 atom cellulose nanofibril 1717 atom beta amyloid fibril
0.0
2.0
4.0
6.0
8.0
0 2 4 6 8 Number of GPUs
35
Conclusions
§ Overall acceleration of a factor of 2 to 3x relative to single CPU core
§ Equivalent performance to 4 CPU cores
§ Data transfer serious bottleneck – No asynchronous data transfer
– Running in lock-step due to blocking MPI communications
§ Strong scaling is hit hard by limited PCIe bandwidth
§ Promising results, much more to do with obvious first target
36
HYBRID MPI-OPENMP PARALLELISM
37
Motivation
§ Hard limit of 1 atom per core
§ 10 atoms per core optimal – Communications-workload balance
§ Memory demands mean only 50% of cores used on some machines (HECTOR)
§ Scaling limited to hundreds of cores
38
§ Myself: – Charge density
– Local potential integrals
– Kinetic Integrals
– NGWF gradient
§ Dr. Nick Hine: – Sparse algebra
– Projectors
– Testing and modification for IBM BlueGene/Q at Hartree centre
Method
39
§ Level above that parallelized by GPU developments
Initial Results
40
Total Time OpenMP threads % of optimal 1.0 2.0 4.0 8.0 16.0
MPI Processes
1.0 100.0 77.2 51.8 31.2 15.4 2.0 96.4 71.1 47.1 24.4 4.0 90.7 64.2 37.5 8.0 78.5 50.5
16.0 59.6
Charge Density OpenMP threads
% of optimal 1.0 2.0 4.0 8.0 16.0
MPI Processes
1.0 100.0 93.9 78.9 56.5 29.4 2.0 96.6 85.4 69.8 41.7 4.0 96.7 83.6 55.4 8.0 86.3 65.0
16.0 67.0
§ Charge density only
§ Tested on 16 core workstation
§ 181 atom tennis ball dimer
Latest Results
41
1728 atom Silicon Oxide
Results courtesy of Dr. Nick Hine
CalculaHons performed on Blue Joule at the Hartree centre.
MPI Processes
OpenMP threads
Fixed core count
16 1 Out of memory
8 2 Out of time (>3600s)
4 4 1045s
2 8 617s
1 16 673s
OpenMP scaling
4 1 2711s
4 4 1046s
4 8 784s
Latest Results
4000 8000 12000 16000 1.0
2.0
3.0
4.0
Total number of cores
Speedu
p
12,000 atom Beta Amyloid Fibril (Each MPI process) uses 16 OpenMP threads
Ideal sparse products row sums \box opera(ons
42
Results courtesy of Dr. Nick Hine
CalculaHons performed on Blue Joule at the Hartree centre.
Summary
§ Significant performance increase – Scaling
– Memory usage
– cores > atoms now possible
§ Excellent scaling up to 4 OpenMP threads, memory bandwidth limited on Blue Joule above 8 cores
§ HECTOR performance promising
43
FUTURE WORK
44
Future Work Improved GPU code
• OpenACC • Asynchronous data transfer
• Loop unrolling • Extend to other areas
• GPU overloading • Batched 1D FFTs
Hybrid MPI-OpenMP code
• Extend to “Data parallelism” • Extend to other areas
Large scale ab initio Molecular Dynamics on machines such as:
TITAN (AMD CPU + NVIDIA K20X) Stampede (Intel CPU + Intel Xeon Phi)
Mira (BlueGene/Q)
Algorithmic Improvements
• Dynamic/Mixed Precision • Increased truncation
• Hybrid FFT/interpolation
Heterogeneous execution
45
Acknowledgements
§ Chris-Kriton Skylaris – University of Southampton
§ Nick Hine – University of Cambridge
§ NVIDIA – Mark Berger
– Timothy Lanfear
§ PGI – Mathew Colgrove
– Pat Brooks
§ E-Infrastructure South consortium – Istvan Reguly
– Emerald
§ iSolutions – Iridis3
§ EPSRC
46
• Optimized energies exceed the level of precision required for chemical accuracy (milliHartree)
• Difference in calculated energies relative to the CPU code comparable to differences resulting from the number of MPI threads in the CPU code
47
Precision
• GPUs have a larger number of single precision (SP) than double precision (DP) cores.
• Other codes (GAMESS-UK and Terachem) utilize dynamic precision.
• Switch between SP and DP determined by a convergence threshold.
48
Dynamic Precision
• GPUs have a larger number of single precision (SP) than double precision (DP) cores.
• Other codes (GAMESS-UK and Terachem) utilize dynamic precision.
• Switch between SP and DP determined by a convergence threshold.
49
Dynamic Precision
1E-12
1E-10
1E-08
1E-06
1E-04
1E-02
1E+00
1E+02
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
Ener
gy D
iffer
ence
(Eh)
Iteration number
CPU DP GPU DP 1x10E-7 1x10E-5 5x10E-6 1x10E-6