parallel implementation of density functional theory

40
doi.org/10.26434/chemrxiv.12018963.v1 Parallel Implementation of Density Functional Theory Methods in the Quantum Interaction Computational Kernel Program Madushanka Manathunga, Yipu Miao, Dawei Mu, Andreas Goetz, Kenneth M. Merz Jr. Submitted date: 23/03/2020 Posted date: 24/03/2020 Licence: CC BY-NC-ND 4.0 Citation information: Manathunga, Madushanka; Miao, Yipu; Mu, Dawei; Goetz, Andreas; Merz Jr., Kenneth M. (2020): Parallel Implementation of Density Functional Theory Methods in the Quantum Interaction Computational Kernel Program. ChemRxiv. Preprint. https://doi.org/10.26434/chemrxiv.12018963.v1 We present the details of a GPU capable exchange correlation (XC) scheme integrated into the open source QUantum Interaction Computational Kernel (QUICK) program. Our implementation features an octree based numerical grid point partitioning scheme, GPU enabled grid pruning and basis/primitive function prescreening and fully GPU capable XC energy and gradient algorithms. Benchmarking against the CPU version demonstrated that the GPU implementation is capable of delivering an impres- sive performance while retaining excellent accuracy. For small to medium size protein/organic molecular systems, the realized speedups in double precision XC energy and gradient computation on a NVIDIA V100 GPU were 60 to 80-fold and 140 to 780- fold respectively as compared to the serial CPU implementation. The acceleration gained in density functional theory calculations from a single V100 GPU significantly exceeds that of a modern CPU with 40 cores running in parallel. File list (4) download file view on ChemRxiv SI_v2.0.pdf (301.12 KiB) download file view on ChemRxiv Manuscript_v2.0.pdf (1.70 MiB) download file view on ChemRxiv SI_v2.0.docx (81.63 KiB) download file view on ChemRxiv Manuscript_v2.0.docx (1.98 MiB)

Upload: others

Post on 31-Oct-2021

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Parallel Implementation of Density Functional Theory

doi.org/10.26434/chemrxiv.12018963.v1

Parallel Implementation of Density Functional Theory Methods in theQuantum Interaction Computational Kernel ProgramMadushanka Manathunga, Yipu Miao, Dawei Mu, Andreas Goetz, Kenneth M. Merz Jr.

Submitted date: 23/03/2020 • Posted date: 24/03/2020Licence: CC BY-NC-ND 4.0Citation information: Manathunga, Madushanka; Miao, Yipu; Mu, Dawei; Goetz, Andreas; Merz Jr., KennethM. (2020): Parallel Implementation of Density Functional Theory Methods in the Quantum InteractionComputational Kernel Program. ChemRxiv. Preprint. https://doi.org/10.26434/chemrxiv.12018963.v1

We present the details of a GPU capable exchange correlation (XC) scheme integrated into the open sourceQUantum Interaction Computational Kernel (QUICK) program. Our implementation features an octree basednumerical grid point partitioning scheme, GPU enabled grid pruning and basis/primitive function prescreeningand fully GPU capable XC energy and gradient algorithms. Benchmarking against the CPU versiondemonstrated that the GPU implementation is capable of delivering an impres- sive performance whileretaining excellent accuracy. For small to medium size protein/organic molecular systems, the realizedspeedups in double precision XC energy and gradient computation on a NVIDIA V100 GPU were 60 to 80-foldand 140 to 780- fold respectively as compared to the serial CPU implementation. The acceleration gained indensity functional theory calculations from a single V100 GPU significantly exceeds that of a modern CPUwith 40 cores running in parallel.

File list (4)

download fileview on ChemRxivSI_v2.0.pdf (301.12 KiB)

download fileview on ChemRxivManuscript_v2.0.pdf (1.70 MiB)

download fileview on ChemRxivSI_v2.0.docx (81.63 KiB)

download fileview on ChemRxivManuscript_v2.0.docx (1.98 MiB)

Page 2: Parallel Implementation of Density Functional Theory

Supporting Information for

Parallel Implementation of Density Functional Theory Methods in the Quantum Interaction Computational Kernel Program

Madushanka Manathunga†, Yipu Miao#, Dawei Mu%, Andreas W. Götz¶,* and Kenneth M. Merz, Jr.†,*

†Department of Chemistry and Department of Biochemistry and Molecular Biology, Michigan State University, 578 S. Shaw Lane, East Lansing, Michigan 48824-1322, United States, #Facebook, 1 Hacker Way, Menlo Park, California 94025, United States, %National Center for Supercomputing Applications, University of Illinois at Urbana-Champaign, 1205 W Clark St, Urbana, IL 61801, ¶San Diego Supercomputing Center, University of California San Diego, 9500 Gilman Drive, La Jolla, California 92093-0505, United States. Contents: S1: Grid point packing and retrieval of basis/primitive function indices S2: Comparison of HF performance between the QUICK vs GAMESS GPU versions

S1: Grid point packing and retrieval of basis/primitive function indices

Fig. S1. A. True grid points (T, green) are packed into an array of 256*nbin elements with dummy grid points (D, brown). The kernel launch is performed with the same number of threads and true or dummy threads are distinguished by an assigned integer flag. B. Structures of basis and primitive function locator arrays, and arrays that hold corresponding IDs. Here l and m stand for the total number of basis and primitive functions. Colors indicate the access patterns. For example, threads in block 0 (magenta), access the first two elements of the basis function locator array and obtains appropriate indices for accessing basis function ID array (index 0 to 2). In order to obtain primitive functions IDs, for instance, for basis function located at index 0, the first two

A

1 1 1 0 0 1……….

0 1 2 255 256

0 1

257

T T T D D T……….

3 ……….Thread index

Block index

Grid points

Flags

B0 1 2 nbin+1

0 3 7 11 l-6 l……….

3 ……….Array index

Basis function locator

0 1 2 0 6 7……….Basis function ID

0 1 2 l3 ……….Array index

0nbinBlock index 1

nbin

l-1

nbin-1

0 2 4 9 m-7 m……….

0 1 2 l+13 ……….Array index

Primitive function locator

0 1 0 1 6 7……….Primitive function ID

0 1 2 m3 ……….Array index

Basis index0

1

m-1

ll-1

Page 3: Parallel Implementation of Density Functional Theory

elements of the primitive function locator array (green) are picked. The corresponding primitive function IDs are then obtained by accessing the first two elements of the primitive function ID array (green). S2: Comparison of HF performance between the QUICK vs GAMESS GPU versions Table S1. Comparison of HF SCF and gradient times between the QUICK and GAMESS GPU versions

Time for first SCF iteration (s) Time for last SCF iteration (s)

Molecule (atom

number)

Basis set (function number)

QUICK GAMESS QUICK GAMESS

Morphine (40) 6-31G (227) 0.5 51.7 0.4 11.2

6-31G* (353) 0.8 121.1 1.2 22.0

6-31G** (410) 1.0 159.9 1.6 28.0

Gly12 (87) 6-31G (517) 1.0 115.4 1.5 39.0

6-31G* (811) 2.4 279.3 4.6 82.4

6-31G** (925) 3.1 374.3 6.2 105.1

Valinomycin (168) 6-31G (882) 4.9 931.7 16.7 153.0

6-31G* (1350) 11.4 1816.2 43.6 467.0

6-31G** (1620) 16.7 2611.4 63.0 511.0

Page 5: Parallel Implementation of Density Functional Theory

1

Parallel Implementation of Density Functional Theory Meth-ods in the Quantum Interaction Computational Kernel Pro-gram Madushanka Manathunga†, Yipu Miao#, Dawei Mu%, Andreas W. Götz¶,* and Kenneth M. Merz, Jr.†,*

†Department of Chemistry and Department of Biochemistry and Molecular Biology, Michigan State University, 578 S. Shaw Lane, East Lansing, Michigan 48824-1322, United States, #Facebook, 1 Hacker Way, Menlo Park, California 94025, United States, %National Center for Supercomputing Applications, University of Illinois at Urbana-Champaign, 1205 W Clark St, Urbana, IL 61801, United States, ¶San Diego Supercomputer Center, University of California San Diego, 9500 Gilman Drive, La Jolla, California 92093-0505, United States.

KEYWORDS: QUICK, Graphics Processing Units, Quantum Chemistry Software, DFT Package, Exchange Correlation

Abstract. We present the details of a GPU capable exchange correlation (XC) scheme integrated into the open source QUantum Interaction Computational Kernel (QUICK) program. Our implementation features an octree based numerical grid point partition-ing scheme, GPU enabled grid pruning and basis/primitive function prescreening and fully GPU capable XC energy and gradient algorithms. Benchmarking against the CPU version demonstrated that the GPU implementation is capable of delivering an impres-sive performance while retaining excellent accuracy. For small to medium size protein/organic molecular systems, the realized speedups in double precision XC energy and gradient computation on a NVIDIA V100 GPU were 60 to 80-fold and 140 to 780-fold respectively as compared to the serial CPU implementation. The acceleration gained in density functional theory calculations from a single V100 GPU significantly exceeds that of a modern CPU with 40 cores running in parallel.

1. Introduction

Although graphics processing units (GPUs) were originally introduced for rendering computer graphics, they have become essential devices to enhance the performance of scientific ap-plications over the past decade. Examples of GPU accelerated applications span from bioinformatics software that help solv-ing genetic mysteries in biology to data analysis tools aiding gravitational wave detection in astrophysics.1,2 The availability of powerful general purpose GPUs at a reasonable cost, con-venient computing and programming environments such as Compute Unified Device Architecture (CUDA)3 and especial-ly, the fact that GPUs can perform trillions of floating point operations per second in combination with a high memory bandwidth, outperforming desktop central processing units (CPUs), are the main reasons for this trend.

GPUs are also known to deliver outstanding performance in traditional computational chemistry applications, particularly in classical molecular dynamics (MD) simulations4–12 and ab initio quantum chemical calculations.13-32 In the latter context, Hartree-Fock (HF)13–15,23,28–30,32 and post-HF energy and gradi-ent implementations18,20,21,27,31 on GPUs have displayed multi-fold speedups for molecular systems containing a few to large number of atoms. Nevertheless, a majority of the computa-tional chemistry community is unable to enjoy such perfor-mance benefits due to the unavailability of an open-source, user friendly, GPU enabled quantum chemistry software. To-wards this end, we have been developing a quantum chemical

code named the QUantum Interaction Computational Kernel (QUICK) to fill this void.23,28 As reported previously, QUICK is capable of efficiently computing HF energies and gradients. For instance, the speedup realized for moderate size molecular systems on a Kepler type GPU was about 10-20 times in com-parison to a single CPU core while retaining an excellent accu-racy. With high angular momentum basis functions (d and f functions), the realized speedup remained 10-18 fold. Such performance gain was primarily due to GPU accelerated elec-tron repulsion integral (ERI) calculations. In our ERI engine, integrals are calculated using vertical and horizontal recur-rence relations algorithms33,34 reported by Obara, Saika and Head-Gordon and Pople. The integrals are calculated on the fly and the Fock matrix is assembled on the GPU using an efficient direct self-consistent field (SCF) scheme. However, the accuracy of the HF method is insufficient, if not totally unsuitable, to study many chemical problems; but having a GPU enabled HF code paves the way towards an efficient post-HF or density functional theory (DFT) package. In the present work, we have undertaken the task of implementing the latter type of methods in QUICK. In fact, given the vast number of research articles published using DFT methods over the past decades,35 incorporation of such methods into our package was essential.

In the context of GPU parallelization of Gaussian based DFT calculations, a few publications have appeared in the litera-ture.22,36,37 Among these, Yasuda’s work22 is the earliest. His exchange correlation (XC) quadrature scheme consisted of

Page 6: Parallel Implementation of Density Functional Theory

2

several special features aimed at maximizing performance on GPU hardware. These include partitioning numerical grid points in 3-dimensional space, basis function prescreening and preparation of basis function lists for grid point partitions. Only the electron densities and their gradients and matrix ele-ments of the XC potential were calculated on the GPU. The hardware available at the time limited his algorithm to single precision calculations; but the reported accuracy of benchmark tests was about 10-5 a.u. A similar algorithm has been imple-mented by Martinez and coworkers in their GPU capable Ter-achem software.36 In addition to grid point partitioning, this algorithm performs prescreening at the level of primitive func-tions and excludes certain partitions from the calculation based on a sorting procedure. In both of the above implementations, the value of the density functionals at the grid points are calcu-lated on the CPU. We note another DFT package where the XC scheme is GPU enabled, however, the ERI calculations are performed on the CPU.37

The features of XC quadrature scheme reported in the current work include grid point partitioning using an octree algorithm, prescreening and grid pruning based on the value of primitive Gaussian functions and fully GPU enabled XC energy and gradient calculations where not only the electron density and its derivatives, but also the XC functional values are computed on the GPU. The next sections of this paper are organized as follows. In section 2, we give an overview of the underlying theory of our XC scheme, which was originally documented by Pople and coworkers.38 The details of the computational implementation are then presented in section 3. Here we first discuss important aspects of data parallel GPU programming. The GPU version of the XC scheme is then implemented fol-lowing these considerations. Information regarding the parallel CPU implementation using the message passing interface (MPI)39 is also presented. Section 4 is devoted to benchmark tests and discussion. The tests include performance compari-sons between QUICK and the GAMESS GPU version29,32,40 and accuracy and performance comparisons between the QUICK GPU vs CPU versions. Finally, we conclude our dis-cussion by exploring future directions for further improve-ment.

2. Theory

The Kohn-Sham formulation of DFT differs from the HF method by its treatment of the exchange and correlation con-tributions to the electronic energy. The total electronic energy (𝐸) is given by,38

𝐸 =𝐸! + 𝐸" + 𝐸# + 𝐸$%, (1)

where 𝐸! and 𝐸" stand for kinetic and electron-nuclear inter-action energies, 𝐸#is the Coulomb self-interaction of the elec-tron density and 𝐸$% is the remaining electronic exchange and correlation energies. The electron density (𝜌) is a summation of 𝛼 and 𝛽 electron densities (𝜌& and 𝜌' respectively) that can be expressed by choosing sets of orthonormal spin orbitals 𝜓(&(𝑖 = 1,… , 𝑛&) and 𝜓(

'(𝑖 = 1,… , 𝑛') as

𝜌& = ∑ |𝜓(&|)*!( , 𝜌' = ∑ |𝜓(

'|)*"( , (2)

where na and nb are the number of occupied orbitals.

𝐸!, 𝐸" and 𝐸# can be written as,

𝐸! = ∑ 2𝜓(&| −+)∇)|𝜓(&5

*!( + ∑ 2𝜓(

'| − +)∇)|𝜓(

'5*"( , (3)

𝐸" = −∑ 𝑍, ∫𝜌(𝑟)|𝑟 − 𝑟,|-+𝑑𝑟*.%/, , (4)

𝐸# = +)∬𝜌(𝑟+)|𝑟+ − 𝑟)|-+𝜌(𝑟))𝑑𝑟+𝑑𝑟). (5)

𝐸$%, within the generalized gradient approximation (GGA), may be given by a functional 𝑓 that depends on electron densi-ties and their gradient invariants,

𝐸$% = ∫𝑓(𝜌& , 𝜌' , 𝛾&& , 𝛾&' , 𝛾'')𝑑𝑟, (6)

𝛾&& = |∇𝜌&|), 𝛾&' = ∇𝜌& . ∇𝜌' , 𝛾'' = |∇𝜌'|). (7)

In practical computational implementations, one expresses molecular orbitals as a linear combination of atomic orbitals and 𝜌& in equation (2) becomes,

𝜌& = ∑ ∑ ∑ >𝐶0(& @ ∗ 𝐶1(&𝜙0𝜙1 = ∑ 𝑃01&𝜙0𝜙101*!(

21

20 . (8)

Here 𝜙0(𝜇 = 1, . . , 𝑁) are the atomic orbitals. Furthermore, the gradient of 𝜌& can be written as,

∇𝜌& = ∑ 𝑃01& ∇(𝜙0𝜙1)01 . (9)

A similar expression can be written for 𝜌'. Substitution of equations (8), (9) into (2) - (7) and into the energy equation (1) and minimizing with respect to coefficients 𝐶0(& , 𝐶1(& one ob-tains a series of algebraic equations similar to the conventional HF procedure. The resulting Fock-type matrix (hereafter Kohn-Sham matrix) for 𝜌& is,

𝐹01& = 𝐻01%345 + 𝐽01 + 𝐹0167!, (10)

where 𝐻01%345is the one electron Hamiltonian matrix. 𝐽01 is the Coulomb matrix which can be written in the conventional form as,

𝐽01 = ∑ 𝑃89(𝜇𝜈|𝜆𝜎)289 , 𝑃89 = 𝑃89& + 𝑃89

' (11)

The XC contribution to the Kohn-Sham matrix (𝐹0167!) is given

by,

𝐹0167! = ∫ L :;:<!𝜙0𝜙1 + M2

:;:=!!

∇𝜌& +:;:=!"

∇𝜌'O . ∇(𝜙0𝜙1)P 𝑑𝑟, (12)

and a similar expression can be written for 𝐹0167".

The gradient with respect to the position of nucleus A is,

∇,𝐸 = ∑ 𝑃01>∇,𝐻01%345@ ++)∑ 𝑃01𝑃89∇,(𝜇𝜈|𝜆𝜎) −20189

201

∑ 𝑊01>∇,𝑆01@201 − 2∑ ∑ 𝑃01& ∫ L

:;:<!

𝜙1∇𝜙0 +21

>0

𝑋01 M2:;:=!!

∇𝜌&:;:=!"

∇𝜌'OP 𝑑𝑟 − 2∑ ∑ 𝑃01' ∫ L :;:<"𝜙1∇𝜙0 +

21

>0

𝑋01 M:;:=!"

∇𝜌& + 2:;:=""

∇𝜌'OP 𝑑𝑟. (13)

Page 7: Parallel Implementation of Density Functional Theory

3

Where the energy weighted density matrix 𝑊01 is given by,

𝑊01 = ∑ 𝜖(&𝐶0(&𝐶1(& +∑ 𝜖('𝐶0(

'𝐶1('*'

(*!( , (14)

and the matrix element 𝑋01 is,

𝑋01 = 𝜙1∇(∇𝜙0)? + (∇𝜙0)(∇𝜙0)?. (15)

Fig. 1. Partitioning of numerical grid points of a water molecule using the octree algorithm. A. A cell containing all grid points (Level 0) is recursively subdivided into octets (Levels 1 and 2). B. Fully partitioned bins/leaf nodes of the same example containing all grid points from pruning in stage 1 (middle) undergoes primitive function based grid pruning. The resulting bins/leaf nodes and grid points are shown at the bottom right.

Note that for hybrid functionals, the energy in equation (1) will also include a contribution from the HF exchange energy and equations (10) and (13) will be adjusted accordingly. Due to the complex form of the XC functionals f, the analytical calculation of the integrals required for the XC energy, XC potential, and its gradients in equations (6), (12), and (13), is impossible; hence, this is usually achieved through a numeri-cal procedure. The key steps of such a procedure involve the formation of a numerical grid (also called XC quadrature grid) with quadrature weights assigned to each grid point, calcula-tion of the electron density and the gradients of the density at each grid point, calculation of the value of the density func-tional and the derivatives of the functional, calculation of the

XC energy and the contribution to Kohn-Sham matrix (called matrix elements of the XC potential hereafter). In order to compute the nuclear gradients of the XC energy, one must compute second derivatives of the basis functions, values of the two integral terms in equation (13) and add them to the total gradient vector. Finally, due to the involvement of a quadrature weighing scheme in the numerical procedure, the derivatives of the quadrature weights with respect to nuclear displacements must be computed and added to the total gradi-ent.

3. Implementation

3.1 Key considerations in GPU programming

A

B

Level 0

Level 1

Level 2

7967 Grid points9865 Grid points587 leaf nodes 221 leaf nodes

3092 basis functions6484 primitive functions

cc-pVDZ basis set

14675 basis functions117400 primitive functions

Page 8: Parallel Implementation of Density Functional Theory

4

GPUs are ideal devices for data parallel computations, i.e. computations that can be performed on numerous data ele-ments simultaneously. They allow massive parallelization in comparison to traditional CPU platforms, however, at the ex-pense of programming complexity and flexibility. Therefore, a proper understanding of the GPU architecture and the memory hierarchy is essential for writing an efficient code that exploits the full power of this hardware class. GPUs perform tasks according to a single instruction multiple data (SIMD) model, which simply means that they execute the same instructions for a given chunk of data and the same amount of work is ex-pected to be performed for each piece of data. Hence, a pro-grammer should organize the data and assign work to threads that are then processed by the GPU in batches known as thread blocks. The number of threads in a block is up to the pro-grammer, however, there exists a maximum limit allowed by the GPU architecture. For instance, the NVIDIA Volta archi-tecture permits a maximum of 1024 threads per block. NVIDIA GPUs execute threads on streaming multiprocessors (SMs) in warps whose size, 32 for recent architectures, is sole-ly determined by the architecture. Each SM (for example the V100 GPU has 80 SMs, each with 64 CUDA cores for a total of 5120 cores41) executes the same set of instructions for all threads in a warp during a given clock cycle. Therefore, it is essential to minimize the branching in GPU codes (device kernels) to avoid instruction divergence.

A GPU possesses its own physical memory that is distinct from the host (CPU accessible) memory. The main memory called global memory or dynamic random-access memory (DRAM) is relatively large (for example, 32 GB or 16 GB for the V100 and 12 GB for Titan V) and accessible by all threads in streaming multiprocessors. However, global memory trans-actions suffer from relatively high memory latency. A small secondary type of GPU memory called shared memory is also available on each SM. This type of memory transaction is faster but local, meaning that shared memory of a given SM is only accessible by threads within a thread block that is cur-rently executing on this SM. In addition to these two types of memory, threads in a warp have access to a certain number of registers and this is useful to facilitate the communication between threads of the same warp. GPUs also contain constant and texture memories, which are read-only and capable of delivering a higher performance than global memory in specif-ic applications. If threads in a warp read from the same memory location, constant memory is useful. Texture memory is beneficial when the threads read from physically adjacent memory locations. To maximize the performance of device kernels careful handling of memory is essential. The key con-siderations for engineering an efficient GPU code include min-imizing the warp divergence, minimizing frequent global memory transactions, maintaining a coherent global memory access pattern by adjacent threads in a block (coalesced memory access), minimizing random memory access patterns or simultaneously accessing a certain memory location by multiple threads. Furthermore, constant and texture memories should be employed where applicable. Our existing ERI and direct SCF scheme in QUICK was developed in adherence to this philosophy. As detailed below, we implement our XC scheme following the same practices.

3.2 Grid generation and pruning

The selected grid system for our work is the Standard Grid-1 (SG-1) reported by Pople and coworkers.42 This grid consists of 50 radial grid points and 194 angular grid points for each atom of a molecule. The radial grid point generation is per-formed using the Euler-Maclaurin scheme and as for the angu-lar grid points, Lebedev grids are used. Following the grid generation, we compute weights for each grid point based on the scheme reported by Frisch et al.43 We then perform grid pruning in two stages. First, all points with weight less than a threshold (usually 10-10) are eliminated. The remaining points are pruned at a second stage based on the value of atom cen-tered basis/primitive functions at each grid point. As described in section 3.3, we make use of an octree algorithm for this purpose.

3.3 Octree based grid pruning, preparation of basis and contracted function lists

In our octree algorithm, the grid points in space are partitioned as follows. First, the lower and upper bounds of the grid is determined and a single cell (node) containing all grid points is formed in 3-dimensional space (see Fig. 1A). This node is then divided into eight child nodes. The child nodes whose grid point count is greater than a user specified threshold are recursively subdivided into octets until the point count of each child node falls below the threshold. The nodes which are not further divisible (leaf nodes, also termed bins below) are con-sidered to be fully partitioned. We then go through each grid point of the leaf nodes and compute the values of the basis and primitive functions at their positions. If a basis/primitive func-tion satisfies the condition

U(𝜁(@ + ∇𝜁(@)U > 𝜏, (16)

for a given grid point, it is considered to be significant for that particular point. We use t = 10-10 as default threshold, which leads to numerical results that are indistinguishable from the reference without pruning. For basis function-based prescreen-ing, j in equation (16) becomes 𝜇 and 𝜁(@ stands for the value of basis function 𝜙0 at grid point gi. Similarly, for primitive function based prescreening, 𝜁(@ represents the value of 𝑐0A𝜒A at grid point gi, where 𝜒A is the pth primitive function of 𝜙0 and 𝑐0A is the corresponding contraction coefficient. Once the basis function values at each grid point are computed, the points that do not have at least one significant basis function are omitted from the corresponding bin (see Fig. 1B). Fur-thermore, bins without any grid points are also discarded. At this stage, lists of significant basis and primitive function IDs are prepared for each remaining bin of grid points, significant-ly reducing the number of basis function values and deriva-tives that have to be evaluated at each grid point during the SCF and gradient computations. Using this algorithm, the number of primitive Gaussian basis function evaluations is reduced from 117400 to 6484 for H2O with a cc-pVDZ basis set (Fig. 1B).

3.4 Grid point packing and kernel launch

Following the two-stage grid pruning and the preparation of basis and primitive function ID lists, the grid points are pre-pared (hereafter grid point packing) to be uploaded to the GPU. Here we add dummy points into each bin and set the total grid point count in each bin to a threshold value used in

Page 9: Parallel Implementation of Density Functional Theory

5

the octree run. It is worth noting that the threshold we choose is a multiple of the warp size 32, usually 256. By doing so, we are able to pack true grid points into one corner of a block of an array (see Fig. S1A) and this helps us to minimize the warp divergence when we launch GPU threads using the same value for the block size. More specifically, for subsequent calcula-tions (i.e. density, XC energy and XC gradients), we will launch a total of 256*nbin threads (where nbin = number of sig-nificant bins) where each thread block contains at least one true grid point and a set of dummy points. The threads launched for dummy points should not perform any computa-tion and these are differentiated from true points by an as-

signed integer flag. As mentioned previously, the launched thread blocks are submitted to streaming multiprocessors as warps of 32 by the warp scheduler during runtime. This en-sures that at most one warp per thread block suffers from warp divergence and threads of the remaining warps would perform the same task. Once the grid points are packed and basis and primitive function lists have been prepared, these are uploaded to the global memory of the GPU. The corresponding array pointers are stored in constant memory and these are used in launching the kernel and the execution of device kernels dur-ing density, SCF and gradient calculations.

Fig. 2. Flowchart depicting the workflow of a DFT geometry optimization calculation. Brown, blue and mixed color boxes indicate steps performed on CPU, GPU and mixed CPU/GPU respectively. OPT stands for geometry optimization.

3.5 Computing electron densities on the GPU

As apparent from eq. 8 and 9, the calculation of electron den-sities and their gradients require looping over basis and primi-tive functions at each grid point. This task is performed on the GPU and each thread working on a single grid point only loops through basis and primitive function lists assigned to the corresponding thread block. The retrieval of correct lists from global memory is achieved by using two locator arrays (see Fig. S1B). Note that we use the term “ID” when referring to basis/primitive functions and “array index” for array locations. The first, called basis function locator array, holds the array index ranges of the basis function ID array. Each thread ac-cesses the former based on their block index, obtains the cor-responding array index range of the latter and picks the basis function IDs. Second, the primitive function locator array, holds the array index ranges for accessing the primitive func-

tion ID array. Each thread picks elements from the primitive function locator array using basis function array indices, ob-tains the relevant array index range of the latter and takes the primitive function IDs. This retrieval strategy allows us to maintain a coalesced memory access pattern.

3.6 Computing the XC energy and the matrix elements of the XC potential

As reported elsewhere,23,28 our existing SCF implementation assembles the Fock matrix and computes the energy on the GPU. Therefore, our goal is to calculate the XC energy and associated derivatives on the GPU, which highlights the ne-cessity to have density functionals implemented as device kernels. Needless to say coding the numerous available func-tionals is a cumbersome process and the best solution is to make use of an existing density functional library. Neverthe-

Delete DFT grid

Process input

Perform DFT grid operations

Run LIBXC & upload parameters

1

2

3

SCF: Compute XC energy & potential

5

Compute electron densities

Gradients: XC gradients

Generate DFT grid

Compute grid weights

Prepare basis and primitive function lists

Run octree

Prune grid based on primitive function values

Pack grid points & upload

a b

efg

Compute XC energy gradients

Prune & reupload grid

Compute gradients of weights

a b c

4

6

7

Compute functional values

& derivatives

a

Run LIBXC worker

Compute LIBXC functional

i

ii

Compute inbuilt functional

i

Finalize8

OPT

Prune grid based on weights

c

d

Page 10: Parallel Implementation of Density Functional Theory

6

less, to the best our knowledge, such an appropriate GPU ca-pable library is still not available; but a few stable CPU based density functional libraries have been reported and used in many DFT packages.44–46 As discussed below, we selected one of these libraries and modified the source code for execution on GPUs via CUDA to achieve our goal.

The chosen density functional library, named LIBXC46 is open source and provides about 400 functionals that are based on the local density approximation (LDA), the generalized gradi-ent approximation (GGA) and the meta-generalized gradient approximation (MGGA). This library is well organized, has a user friendly interface and above all, the source code is written in the C language. Such factors make LIBXC easily re-engineerable as a GPU capable library with relatively modest coding effort. In LIBXC, the functionals are organized in sep-arate source files and are handled by several workers (each assigned to LDA, GGA exchange, GGA correlation, MGGA exchange and MGGA correlation functional types) with the aid of a series of intermediate files. When the user provides the necessary input with a desired functional name, an as-signed functional ID is selected, and if necessary, parameters required by the workers to call or send to functionals are ob-tained from the intermediate files. The functionals are then called from the workers. To make use of LIBXC in our GPU code, we initiate LIBXC and obtain functional ID and neces-sary intermediate parameters through a dry run immediately after grid point packing but before starting the SCF procedure.

In order to do so, we made several minor modifications to the interface and intermediate files. The obtained information is then uploaded to the GPU. We also implemented device ker-nels for LDA and GGA workers to call functionals during device kernel execution. The corresponding functional source files were also modified with compiler preprocessor directives and during the compilation time, these are included in the CUDA source files and compiled as device kernels. As report-ed previously, use of compiler directives for porting complex scientific applications to GPUs is rewarding in terms of appli-cation performance and implementation effort.47 Note that our current XC implementation is not capable of computing kinet-ic energy densities and for this reason, we have not made any changes to the MGGA workers and related source files in LIBXC.

During the SCF procedure, the device kernel versions of LIBXC workers are called with pre-stored functional IDs and other parameters. The worker then calls the appropriate func-tional kernel. Here we have used C function pointers rather than conditional statements due to potential performance pen-alties introduced by the latter. Following the computation of XC energy density on grid points, we compute the matrix ele-ments of the potential and update the Kohn-Sham matrix using the CUDA atomic add function. In the past, we employed atomic add in our direct SCF scheme and noted that it is not a critical performance bottleneck.23,28

Table 1. Comparison of times between QUICK and GAMESS

Last SCF iteration (s) ERI gradient time (s) XC gradient time (s)

Molecule (atom count)

Basis set (basis function count) QUICK GAMESS QUICK GAMESS QUICK GAMESS

Morphine (40) 6-31G (227) 0.7 11.9 3.8 9.6 2.3 11.3 6-31G* (353) 1.7 23.9 12.9 60.7 3.8 19.6 6-31G** (410) 2.2 30.6 16.0 72.0 4.4 24.4

Gly12 (87) 6-31G (517) 1.8 43.5 8.4 23.6 1.7 112.9 6-31G* (811) 5.2 79.7 27.2 164.6 3.4 208.6 6-31G** (925) 7.0 94.4 34.9 201.4 3.6 246.9

Valinomycin (168) 6-31G (882) 18.9 159.8 85.1 183.2 10.4 686.6

6-31G* (1350) 40.7 334.5 243.0 975.7 18.7 1167.3

6-31G** (1620) 58.6 370.1 317.7 1291.5 19.3 1571.9

3.7 Computing XC energy nuclear gradients

Most of the implementation strategy described above for XC energy and potential holds for our gradient algorithm. More specifically, this is a two-step process consisting of calculating the XC energy gradients and grid weight gradients. While computing the former involves a majority of steps discussed for the energy calculation (i.e. computing values of the basis functions and derivatives and functional values and associated derivatives), additionally, we compute second derivatives of basis functions. The device kernel performing this task is

similar to the one that calculates basis function values and their gradients at a given grid point in the sense of accessing basis and primitive function IDs. The grid weight gradient calculation is only required for grid points whose weight is not equal to unity. Therefore, we prune our grid for the third time removing all points that do not meet this criterion. The result-ing grid points are packed without dummy points, uploaded to the GPU and the appropriate device kernel is called to com-pute the gradients. The calculated gradient contribution from each thread is added to the total gradient vector using the

Page 11: Parallel Implementation of Density Functional Theory

7

CUDA atomic add function, consistent with our existing ERI gradient scheme.

Table 2. Comparison of the accuracy and performance of B3LYP calculations performed using QUICK CPU (serial) and GPU versions.

Second SCF iteration (s) XC (s) Total energy (au)

Molecule (at-om count)

Basis set (basis function count) CPU GPU speedup CPU GPU speedup CPU GPUa

(H2O)32 (96) 6-311G (608) 153.9 2.9 54 44.5 0.5 82 -2444.98724960 -1.09E-07

6-311G** (992) 367.2 7.7 48 68.7 1.0 70 -2445.85046773 -1.12E-07

cc-pVDZ (800) 428.3 5.5 78 53.9 0.8 67 -2445.00925604 -1.09E-07

Taxol (110) 6-311G (940) 687.0 14.9 46 131.0 1.7 75 -2927.02879410 -6.20E-08

6-311G** (1453) 2314.4 44.2 52 223.2 3.4 66 -2927.98372502 -3.10E-08

cc-pVDZ (1160) 3537.4 38.4 92 198.7 3.1 63 -2927.38471110 -5.10E-08

Valinomycin (168) 6-311G (1284) 1451.1 36.9 39 225.9 2.9 78 -3793.79224954 -1.28E-07

6-311G** (2022) 4628.3 105.6 44 364.2 5.2 69 -3795.12274932 -1.10E-07

cc-pVDZ (1620) 6625.2 81.4 81 304.8 4.7 65 -3794.17889687 -9.20E-08

310-helix ace-tyl(ala)18NH2

(189) 6-311G (1507) 1680.8 43.7 38 211.3 2.7 77 -4660.64280944 -7.40E-08

6-311G** (2356) 5914.4 134.4 44 356.1 5.2 68 -4662.21706599 -5.60E-08

cc-pVDZ (1885) 8789.8 108.9 81 309.0 4.8 65 -4661.20248530 -6.80E-08 α-helix ace-

tyl(ala)18NH2 (189)

6-311G (1507) 1881.6 52.3 36 257.2 3.3 78 -4660.97909929 -1.55E-07

6-311G** (2356) 6186.0 157.4 39 408.9 5.9 69 -4662.53814845 -2.10E-07

cc-pVDZ (1885) 8907.1 119.8 74 355.9 5.4 65 -4661.54162676 3.10E-08

β-strand ace-tyl(ala)18NH2

(189) 6-311G (1507) 921.2 24.5 38 100.3 1.3 76 -4660.89912485 -6.90E-08

6-311G** (2356) 3451.9 82.1 42 187.8 2.8 67 -4662.48158890 -1.45E-07

cc-pVDZ (1885) 5094.9 63.4 80 160.6 2.5 63 -4661.46843955 2.09E-06

α-conotoxin mii (PDBID:

1M2C) (220) 6-31G* (1964) 4627.1 108.3 43 373.6 4.9 77 -7142.85690266 -6.04E-06

6-31G** (2276) 5601.1 138.1 41 407.1 6.2 66 -7143.06700681 -5.34E-06

6-311G (1852) 3390.0 103.5 33 424.9 5.5 78 -7142.57919438 -7.30E-08

olestra (453) 6-31G* (3181) 5787.1 143.7 40 258.9 4.1 63 -7540.95874486 -2.14E-06

6-31G** (4015) 9881.3 277.6 36 310.1 4.5 70 -7541.35299607 -2.19E-06

6-311G (3109) 5511.5 164.3 34 276.0 3.7 74 -7540.71669525 -5.85E-07 aGPU energy column shows the deviation of the energy with respect to the corresponding CPU calculation.

3.8 CPU analog of the GPU implementation

In order to perform a fair comparison with our GPU capable DFT implementation, we implemented a parallel CPU version using MPI. In this version, we make use of the standard LIBXC code base and omit the LIBXC dry run from our workflow (step 2 in Fig. 2). Furthermore, the grid generation and octree run are performed in serial but the weight computa-tion, prescreening and preparation of basis and primitive func-

tions lists are all performed on in parallel. Moreover, the pack-ing of grid points (step 3g in Fig. 2) is omitted. Computation of electron densities, XC energy, potential and gradients (steps 4-6 in Fig. 2) are performed in parallel. More specifically, once the grid operations are completed, partitioned bins are distributed among slave CPU tasks. The corresponding grid weights, basis and primitive function lists are also broadcast. All CPU tasks retrieve basis and primitive function IDs using locators as discussed in section 3.5, but with the block index

Page 12: Parallel Implementation of Density Functional Theory

8

now replaced by the bin index. When computing the XC ener-gy and matrix elements of the potential, each CPU task initial-izes LIBXC through a standard Fortran 90 interface and com-putes the required functional values. The slave CPU tasks then

send the computed energy and Kohn-Sham matrix contribu-tions to the master CPU task. The implementation of the XC gradient scheme is very similar to the above.

Table 3. Comparison of gradient times between QUICK CPU (serial) and GPU versions

ERI gradients (s) XC gradients (s)

Molecule (atom count) Basis set (basis function count) CPU GPU speedup CPU GPU speedup

(H2O)32 (96) 6-31G (647) 218.3 5.3 41 1196.7 3.7 325

6-31G** (992) 1015.1 23.7 43 1261.3 5.4 235

Taxol (110) 6-31G (647) 1676.5 38.2 44 1893.4 9.4 201

6-31G** (1160) 7479.7 149.1 50 2094.2 14.7 142

Valinomycin (168) 6-31G (882) 3339.8 85.0 39 6156.7 19.4 318

6-31G** (1620) 14579.7 319.3 46 6457.1 27.0 239

310-helix acetyl(ala)18NH2 (189) 6-31G (608) 4281.0 97.6 44 8611.6 20.6 418

6-31G** (992) 20009.4 395.9 51 8907.9 27.6 323

α-helix acetyl(ala)18NH2 (189) 6-31G (608) 4592.9 109.7 42 8777.5 23.0 381

6-31G** (992) 20114.2 422.8 48 9073.9 32.5 279

β-strand acetyl(ala)18NH2 (189) 6-31G (608) 2307.7 45.7 51 8398.9 16.5 509

6-31G** (992) 11247.6 187.8 60 8601.1 22.4 383

α-conotoxin mii (PDBID: 1M2C) (220) 6-31G (1268) 8371.3 232.6 36 13868.9 17.8 780

6-31G** (2276) 33798.1 813.3 42 14349.2 28.8 498

4. Benchmark Results and Discussion

4.1 Benchmarking the GPU implementation

We now present the benchmarking results of our DFT imple-mentation. First, the performance of the QUICK GPU version is compared against the GPU version of GAMESS.29,32,40 Then a similar comparison between the QUICK CPU and GPU versions is performed. For the former test, we obtained a precompiled copy of GAMESS (the 17.09-r2-libcchem ver-sion) in a singularity image from the NVIDIA GPU cloud webpage. In order to make a fair comparison, the QUICK code was compiled using the comparable GNU and CUDA compil-ers with optimization level 2 (-O2). Note that performance tradeoff between running a CPU/GPU application native or through a container is reported to be negligible48,49 and we assume that the container overhead has no significant impact on our GAMESS timings. The selected test cases include mor-phine, (glycine)12 and valinomycin BLYP50–52 gradient calcula-tions using the 6-31G, 6-31G* and 6-31G** basis sets. In GAMESS input files the SG-1 grid system and direct SCF were requested and the density matrix cutoff value was set to 10-8. Default values were used for all the other options. All tests were carried out on a NVIDIA Volta V100-SXM2 GPU (32 GB) accompanied by Intel Xeon (R) Gold 6138 CPU (2.10 GHz) with 190 GB memory.

The performance comparison between QUICK and GAMESS suggests that the former is significantly faster than the latter (see Table 1). For ERI gradients, the observed speedup is roughly 2-5 fold suggesting that our ERI engine is more effi-cient, in particular for basis sets with higher angular momen-tum quantum numbers. Furthermore, QUICK displays higher speedups (~60 or 80-fold in some cases) for the XC gradients, however, this result has to be interpreted carefully since the XC implementation of GAMESS appears to be not GPU capa-ble. Note that in larger test cases, the GAMESS XC gradient time surpasses the ERI gradient time. Based on the times re-ported for last SCF iteration, the QUICK SCF scheme also outperforms GAMESS. Again, slow performance of the latter may be partially attributed to its CPU based XC implementa-tion. A comparison of separate timings for ERI, XC energy and potential calculations during each SCF iteration was not possible since timings for these individual tasks were not re-ported by GAMESS. Nevertheless, HF single point calcula-tions carried out for the same systems (see Table S1) show that QUICK ERI calculations are much faster, consistent with the gradient results documented above.

We now compare the accuracy and performance between the QUICK CPU and GPU versions using a series of B3LYP53 energy calculations. As is apparent from Table 2, energies computed by the QUICK GPU version agree with the CPU version up to 10-6 au or better. Furthermore, the GPU version

Page 13: Parallel Implementation of Density Functional Theory

9

of the XC implementation delivers at least a 60-fold speedup over the serial CPU version. In both versions, less than 30% of the SCF step time is spent on computing the XC energy and contribution to Kohn-Sham matrix. This suggests that ERI calculation remains the performance bottleneck of QUICK DFT energy calculations. As a result, the speedup observed for the SCF iteration in most cases is somewhat lower than the XC speedup.

In Table 3, we report a comparison of B3LYP/6-31G and B3LYP/6-31G** gradient calculation times between the QUICK CPU and GPU versions. The selected test cases are a subset of the molecular systems chosen for the SCF tests. The key observations from Table 3 include: 1) the GPU version delivers significant speedups for both ERI and XC gradient calculations, 2) the speedup observed for ERI gradients re-semble the ones realized for energy calculation (~50 times), 3)

the speedup delivered for XC gradients is in few hundreds (~100-800 times). Similar speedups observed for ERI energy and gradient calculations can be explained by the fact that their implementations are similar. Indeed, computing the gra-dients of a given basis function involves the cost of computing the value of higher angular momentum basis function. Under-standing the details of the observed speedups for XC energy and gradient computations requires further investigation. As mentioned in section 3.7, we implemented XC energy gradient and grid weight gradient calculations in two separate kernels. Careful examination of kernel run times suggests that the for-mer consumes more than 90 % of the gradient time and gains a better speedup on a GPU. However, the structure of this kernel is similar to that of the XC energy with the exception of com-puting second derivatives. Therefore, the observed higher speedup must be due to the inefficiency of computing the sec-ond derivatives in our CPU implementation.

Table 4. Comparison of accuracy and performance between LIBXC CPU (serial) and GPU versions for taxol (110 atoms) with the 6-31G basis set (647 contracted basis functions).

XC (s) Total energy (au)

Functional Number of SCF iterations CPU GPU speedup CPU GPU

BLYP-Native 21 491.2 21.7 23 -2925.30524021 -3.65E-07

BLYP 21 486.6 21.7 22 -2925.30524021 -3.65E-07

B3LYP-Native 20 398.4 21.7 18 -2926.28595424 -2.41E-07

B3LYP 17 399.1 17.6 23 -2926.28595424 -2.41E-07

PBE054,55 17 397.6 17.6 23 -2923.03725363 -1.50E-08

B3PW9156 17 466.2 20.8 22 -2925.19518104 -1.10E-08

XVWN57,58 21 498.3 21.7 23 -2902.07631292 -2.70E-8

LP_A59 21 498.7 21.7 23 -2930.72703131 -2.80E-8

4.2 Performance of LIBXC functionals

As mentioned above, we have integrated the original and mod-ified LIBXC library versions into QUICK CPU and GPU codes. It is important to document the performance of these functionals within our XC scheme. In Table 4, we report a series of taxol energy calculations at the DFT/6-31G level of theory using various functionals. The selected functionals in-clude hand-coded (native) BLYP, B3LYP and their LIBXC versions and additionally, representative LDA, GGA and hy-brid-GGA functionals. Comparison of native BLYP and B3LYP total energies against the corresponding LIBXC ver-sions suggests that the latter are as accurate as the former on both CPU and GPU platforms. Furthermore, the reported CPU and GPU XC times remain very similar and the acceleration delivered by the GPU remains substantially the same. In fact, the ca. 20-fold speedup realized on the GPU platform is com-mon for all LDA and GGA functionals. Note that computing the value of density functionals is relatively inexpensive with respect to other operations performed in XC energy or gradi-ent calculations and such a similar speedup is expected. Final-ly, the fact that we maintain a similar speedup among native and LIBXC functionals suggests that minor modifications

performed to density functional source code in the GPU ver-sion has not introduced performance penalties.

4.3 Comparison of GPU vs parallel CPU implementations

We now compare the performance between GPU and parallel CPU (MPI) implementations using taxol gradient calculations at the B3LYP/6-31G** level of theory. The selected platform for testing is a single computer node equipped with a NVIDIA V100-SXM2 GPU (32 GB) and 40 Intel Xeon (R) Gold 6148 cores (2 sockets with 20 cores on each, 2.40 GHz clock speed) with 367 GB memory. The QUICK MPI version was compiled using the GNU 7.2.0 compiler and MPICH 3.2.1. The GPU version was compiled using the same GNU compiler with CUDA version 10.0.130. As is apparent from Fig. 3A, the DFT grid operation time for the same calculation is reduced by ca. 4 times (14.8 vs 3.6 s) when going from single to 40 cores in the MPI version. The realized speedup can be mainly at-tributed to parallelized grid weight computation, grid pruning and basis function prescreening. The corresponding V100 GPU time for the same task is 1.7 s with approximately 1 s spent on the CPU based octree run. We tested the necessity of implementing the octree algorithm on the GPU by measuring

Page 14: Parallel Implementation of Density Functional Theory

10

the time to partition numerical grid points of larger molecular systems. The results suggest that our existing CPU implemen-tation is sufficiently efficient to handle such systems. For ex-ample, partitioning grid points of olestra (453 atoms, ~4.4

million grid points) only took 3.0 s while computing grid weights and prescreening consumed 12.7 and 0.6 s respective-ly.

Fig. 3. Comparison of performance between GPU and MPI implementations for taxol (110 atoms) with B3LYP/6-31G** (1160 contracted basis functions).

In Fig. 3B, we report the total time spent by the ERI and XC calculations during 19 SCF iterations. The maximum speedup observed for the former and latter in the MPI version are ca. 23 and 33-fold respectively. The speedups from the GPU ver-sion for the same tasks are ca. 65 and 62-fold. For ERI gradi-ent calculations (see Fig. 3C), the speedup gained from the MPI version is similar to that of the SCF (ca. 22-fold) but sig-nificantly less than the corresponding GPU speedup (ca. 50-fold). Furthermore, the MPI version delivers about 59-fold speedup for the XC gradient calculations, however, this is below the acceleration achieved by a V100 GPU (ca. 142-fold). Finally, as evident from Fig. 3D, the total speedup gained from the GPU is two times over using 40 cores. It is also important to comment on the linearity of our MPI plots in Fig. 3B and 3C. Both ERI SCF and gradient calculations scale well with the number of cores and almost achieve the ideal MPI speedup since we distribute atomic shells among MPI tasks (CPU cores). However, in the XC implementation, we distribute bins containing varying number of grid points. Therefore, the workload is not optimally balanced and plots of the XC MPI timings display a non-regular speedup (not linear) unlike in the ERI case.

4.4 Performance of QUICK on different GPU architec-tures

For all the aforementioned benchmarks, we have used a NVIDIA V100 data center GPU. It is also necessary to docu-ment how the current QUICK GPU implementation performs on other available devices. In Table 5, we report the perfor-mance of several important kernels on different workstation GPUs (V100, Titan V and P100) and a gaming GPU (RTX2080Ti). The selected test case for this purpose is the taxol gradient calculation at the B3LYP/6-31G level of theory. As is apparent from Table 5, all kernels display their best per-formance on the V100 GPU. Surprisingly, ERI kernels show their slowest times on the P100 rather than the gaming GPU, suggesting that their performance is not limited by double precision (FP64) operations. Note that the P100 GPU has a higher FP64 capability in comparison to the RTX2080Ti.60,61 A detailed examination of the two kernels revealed that their performance is bound by high register usage and memory bandwidth. In contrast, XC kernels displayed their highest timings on the RTX2080Ti GPU and therefore, these are bound by FP64 operations. Further examination of the XC energy gradient kernel indicated that its performance is also limited by atomic operations. Overall, excellent performance is achieved both on data center GPUs with Pascal and Volta architectures and gaming GPUs with Turing architecture.

1

8

64

512

4096

32768

1 2 4 8 16 32

Tim

e (s

)

Number of cores

1

8

64

512

4096

32768

1 2 4 8 16 32

Tim

e (s

)

Number of cores

1

8

64

512

4096

32768

1 2 4 8 16 32

Tim

e (s

)

Number of cores

1

8

64

512

4096

32768

1 2 4 8 16 32

Tim

e (s

)

Number of cores

Number of cores

Tim

e (s

)Ti

me

(s)

Number of cores

Tim

e (s

)Ti

me

(s)

Time on V100 GPU

A B

C D

Ideal MPI speedup

Ideal MPI time Ideal MPI timeSCF ERI time SCF XC time

Time on V100 GPU Time on V100 GPU

Ideal MPI time Ideal MPI timeERI gradient time XC gradient time

Time on V100 GPU Time on V100 GPU

DFT grid operation time

Total run timeIdeal MPI timeTime on V100 GPU

Page 15: Parallel Implementation of Density Functional Theory

11

Table 5. Comparison of the performance of different kernels using different GPU architectures.

Total time for each task (s)

Task V100 (32 GB) Titan V (12 GB) P100 (16GB) RTX2080Ti (11 GB)

SCF ERIa 80.4 107.6 159.2 125.2

SCF XCa 22.7 26.1 59.8 114.3

ERI gradients 38.0 40.7 66.9 55.1

XC gradients 6.9 7.9 8.5 13.4 a Reported time is the summation over 22 SCF cycles.

5. Conclusions

We have reported the details of the MPI parallel CPU and the GPU enabled DFT implementation of the QUICK quantum chemical package. Our implementation consists of features such as octree based grid point partitioning, GPU assisted grid pruning and basis/primitive function prescreening and fully GPU enabled XC energy and gradient computations. Perfor-mance comparison with the GAMESS GPU version demon-strates that DFT calculations with QUICK are significantly faster. The accelerations observed for the XC energy and gra-dient computation in the QUICK GPU version with respect to the serial CPU version are impressive. The speedups realized on a V100 GPU for the former and latter are approximately 60 to 80-fold and 140 to 780-fold respectively. Such speedups are out of reach with the MPI parallel CPU version even if one uses 40 cores in parallel. The recommended device for the latest QUICK version (v20.03) is the NVIDIA V100 data cen-ter GPU but the code runs very well also on gaming GPUs.

The profiling of ERI and XC kernels has shown that there exists room for further performance improvement. In the for-mer context, reimplementing large ERI kernels into smaller kernels may be a viable strategy. This is expected to reduce the register pressure and enhance the performance of the ERI engine. The performance of XC kernels on gaming GPUs may be improved by implementing a mixed precision scheme. Ad-ditionally, strategies such as storing the gradient vector in shared memory and latency hiding may be helpful to reduce the computational cost associated with atomic operations in the XC energy gradient kernel.

Currently, we are integrating QUICK as a library into the AMBER molecular dynamics package62 to enable fully GPU enabled quantum mechanics/ molecular mechanics (QM/MM) simulations. Finally, QUICK version 20.03 can be freely downloaded from http://www.merzgroup.org/quick.html under the Mozilla public license.

ASSOCIATED CONTENT Supporting text and Cartesian coordinates of the test molecules. This material is available free of charge via the Internet at http://pubs.acs.org.

ACKNOWLEDGMENT We thank Jonathan Lefman and Peng Wang from NVIDIA for their useful comments on technical aspects of our GPU code. M.M. and K.M. are grateful to the Department of Chemistry and Biochemistry and high performance computer center (iCER HPCC) at the Michigan State University. M.M. and A.G. thank

San Diego Supercomputer Center for granted computer time. This research was supported by the National Science Foundation grant OAC-1835144. This work also used the Extreme Science and Engineering Discovery Environment (XSEDE), which is support-ed by the National Science Foundation (grant number ACI-1053575, resources at the San Diego Supercomputer Center through award TG-CHE130010 to A.G.).

References

(1) Nobile, M. S.; Cazzaniga, P.; Tangherloni, A.; Besozzi, D. Graphics Processing Units in Bioinformatics, Computational Biology and Systems Biology. Brief. Bioinform. 2017, 18, 870–885.

(2) NVIDIA. Seeing Gravity in Real-Time With Deep Learning https://images.nvidia.com/content/pdf/ncsa-gravity-group-success-story.pdf (accessed Feb 25, 2020).

(3) NVIDIA. CUDA C++ Programming Guide https://docs.nvidia.com/cuda/pdf/CUDA_C_Programming_Guide.pdf (accessed Feb 25, 2020).

(4) Friedrichs, M. S.; Eastman, P.; Vaidyanathan, V.; Houston, M.; Legrand, S.; Beberg, A. L.; Ensign, D. L.; Bruns, C. M.; Pande, V. S. Accelerating Molecular Dynamic Simulation on Graphics Processing Units. J. Comput. Chem. 2009, 30, 864–872.

(5) Götz, A. W.; Williamson, M. J.; Xu, D.; Poole, D.; Le Grand, S.; Walker, R. C. Routine Microsecond Molecular Dynamics Simulations with AMBER on GPUs. 1. Generalized Born. J. Chem. Theory Comput. 2012, 8, 1542–1555.

(6) Salomon-Ferrer, R.; Götz, A. W.; Poole, D.; Le Grand, S.; Walker, R. C. Routine Microsecond Molecular Dynamics Simulations with AMBER on GPUs. 2. Explicit Solvent Particle Mesh Ewald. J. Chem. Theory Comput. 2013, 9, 3878–3888.

(7) Biagini, T.; Petrizzelli, F.; Truglio, M.; Cespa, R.; Barbieri, A.; Capocefalo, D.; Castellana, S.; Tevy, M. F.; Carella, M.; Mazza, T. Are Gaming-Enabled Graphic Processing Unit Cards Convenient for Molecular Dynamics Simulation? Evol. Bioinforma. 2019, 15, 1–3.

(8) Lee, T. S.; Cerutti, D. S.; Mermelstein, D.; Lin, C.; Legrand, S.; Giese, T. J.; Roitberg, A.; Case, D. A.; Walker, R. C.; York, D. M. GPU-Accelerated Molecular Dynamics and Free Energy Methods in Amber18: Performance Enhancements and New Features. J. Chem. Inf. Model. 2018, 58, 2043–2050.

(9) Harvey, M. J.; Giupponi, G.; De Fabritiis, G. ACEMD: Accelerating Biomolecular Dynamics in the Microsecond Time Scale. J. Chem. Theory Comput. 2009, 5, 1632–1639.

(10) Le Grand, S.; Götz, A. W.; Walker, R. C. SPFP: Speed without Compromise - A Mixed Precision Model for GPU Accelerated Molecular Dynamics Simulations. Comput. Phys. Commun. 2013, 184, 374–380.

(11) Kutzner, C.; Páll, S.; Fechner, M.; Esztermann, A.; De Groot, B. L.; Grubmüller, H. Best Bang for Your Buck: GPU Nodes for GROMACS Biomolecular Simulations. J. Comput. Chem. 2015, 36, 1990–2008.

(12) Eastman, P.; Swails, J.; Chodera, J. D.; McGibbon, R. T.; Zhao, Y.; Beauchamp, K. A.; Wang, L. P.; Simmonett, A. C.; Harrigan, M. P.; Stern, C. D.; et al. OpenMM 7: Rapid Development of High Performance Algorithms for Molecular Dynamics. PLoS Comput. Biol. 2017, 13, e1005659.

Page 16: Parallel Implementation of Density Functional Theory

12

(13) Ufimtsev, I. S.; Martínez, T. J. Quantum Chemistry on Graphical Processing Units. 1. Strategies for Two-Electron Integral Evaluation. J. Chem. Theory Comput. 2008, 4, 222–231.

(14) Ufimtsev, I. S.; Martinez, T. J. Quantum Chemistry on Graphical Processing Units. 2. Direct Self-Consistent-Field Implementation. J. Chem. Theory Comput. 2009, 5, 1004–1015.

(15) Ufimtsev, I. S.; Martinez, T. J. Quantum Chemistry on Graphical Processing Units. 3. Analytical Energy Gradients, Geometry Optimization, and First Principles Molecular Dynamics. J. Chem. Theory Comput. 2009, 5, 2619–2628.

(16) Luehr, N.; Ufimtsev, I. S.; Martínez, T. J. Dynamic Precision for Electron Repulsion Integral Evaluation on Graphical Processing Units (GPUs). J. Chem. Theory Comput. 2011, 7, 949–954.

(17) Isborn, C. M.; Luehr, N.; Ufimtsev, I. S.; Martínez, T. J. Excited-State Electronic Structure with Configuration Interaction Singles and Tamm–Dancoff Time-Dependent Density Functional Theory on Graphical Processing Units. J. Chem. Theory Comput. 2011, 7, 1814–1823.

(18) Hohenstein, E. G.; Luehr, N.; Ufimtsev, I. S.; Martínez, T. J. An Atomic Orbital-Based Formulation of the Complete Active Space Self-Consistent Field Method on Graphical Processing Units. J. Chem. Phys. 2015, 142, 224103.

(19) Liu, F.; Sanchez, D. M.; Kulik, H. J.; Martínez, T. J. Exploiting Graphical Processing Units to Enable Quantum Chemistry Calculation of Large Solvated Molecules with Conductor-like Polarizable Continuum Models. Int. J. Quantum Chem. 2019, 119, e25760.

(20) Fales, B. S.; Levine, B. G. Nanoscale Multireference Quantum Chemistry: Full Configuration Interaction on Graphical Processing Units. J. Chem. Theory Comput. 2015, 11, 4708–4716.

(21) Fales, B. S.; Martínez, T. J. Efficient Treatment of Large Active Spaces through Multi-GPU Parallel Implementation of Direct Configuration Interaction. J. Chem. Theory Comput. 2020.

(22) Yasuda, K. Accelerating Density Functional Calculations with Graphics Processing Unit. J. Chem. Theory Comput. 2008, 4, 1230–1236.

(23) Miao, Y.; Merz, K. M. Acceleration of Electron Repulsion Integral Evaluation on Graphics Processing Units via Use of Recurrence Relations. J. Chem. Theory Comput. 2013, 9, 965–976.

(24) Götz, A. W.; Wölfle, T.; Walker, R. C. Quantum Chemistry on Graphics Processing Units. Annu. Rep. Comput. Chem. 2010, 6, 21–35.

(25) Walker, R. C.; Götz, A. W. Electronic Structure Calculations on Graphics Processing Units: From Quantum Chemistry to Condensed Matter Physics; Walker, R. C., Götz, A. W., Eds.; wiley: Chichester, UK, 2016.

(26) Andrade, X.; Aspuru-Guzik, A. Real-Space Density Functional Theory on Graphical Processing Units: Computational Approach and Comparison to Gaussian Basis Set Methods. J. Chem. Theory Comput. 2013, 9, 4360–4373.

(27) Asadchev, A.; Gordon, M. S. Fast and Flexible Coupled Cluster Implementation. J. Chem. Theory Comput. 2013, 9, 3385–3392.

(28) Miao, Y.; Merz, K. M. Acceleration of High Angular Momentum Electron Repulsion Integrals and Integral Derivatives on Graphics Processing Units. J. Chem. Theory Comput. 2015, 11, 1449–1462.

(29) Asadchev, A.; Allada, V.; Felder, J.; Bode, B. M.; Gordon, M. S.; Windus, T. L. Uncontracted Rys Quadrature Implementation of up to G Functions on Graphical Processing Units. J. Chem. Theory Comput. 2010, 6, 696–704.

(30) Yasuda, K. Two-Electron Integral Evaluation on the Graphics Processor Unit. J. Comput. Chem. 2008, 29, 334–342.

(31) DePrince, A. E.; Hammond, J. R. Coupled Cluster Theory on Graphics Processing Units I. The Coupled Cluster Doubles Method. J. Chem. Theory Comput. 2011, 7, 1287–1295.

(32) Asadchev, A.; Gordon, M. S. New Multithreaded Hybrid CPU/GPU Approach to Hartree-Fock. J. Chem. Theory Comput. 2012, 8, 4166–4176.

(33) Obara, S.; Saika, A. Efficient Recursive Computation of Molecular Integrals over Cartesian Gaussian Functions. J. Chem. Phys. 1986, 84, 3963–3974.

(34) Head-Gordon, M.; Pople, J. A. A Method for Two-electron Gaussian Integral and Integral Derivative Evaluation Using

Recurrence Relations. J. Chem. Phys. 1988, 89, 5777–5786. (35) Jones, R. O. Density Functional Theory: Its Origins, Rise to

Prominence, and Future. Rev. Mod. Phys. 2015, 87, 897. (36) Luehr, N.; Sisto, A.; Martinez, T. J. Electronic Structure

Calculations on Graphics Processing Units: From Quantum Chemistry to Condensed Matter Physics. In Electronic Structure Calculations on Graphics Processing Units: From Quantum Chemistry to Condensed Matter Physics; Walker, R. C., Götz, A. W., Eds.; John Wiley & Sons, Ltd: Chichester, UK, 2016; pp 1–342.

(37) Nitsche, M. A.; Ferreria, M.; Mocskos, E. E.; González Lebrero, M. C. GPU Accelerated Implementation of Density Functional Theory for Hybrid QM/MM Simulations. J. Chem. Theory Comput. 2014, 10, 959–967.

(38) Pople, J. A.; Gill, P. M. W.; Johnson, B. G. Kohn—Sham Density-Functional Theory within a Finite Basis Set. Chem. Phys. Lett. 1992, 199, 557–560.

(39) Message Passing Interface Forum. MPI: A Message-Passing Interface Standard, Version 3.1. https://www.mpi-forum.org/docs/mpi-3.1/mpi31-report.pdf (accessed Mar 17, 2020).

(40) Schmidt, M. W.; Baldridge, K. K.; Boatz, J. A.; Elbert, S. T.; Gordon, M. S.; Jensen, J. H.; Koseki, S.; Matsunaga, N.; Nguyen, K. A.; Su, S.; et al. General Atomic and Molecular Electronic Structure System. J. Comput. Chem. 1993, 14, 1347–1363.

(41) NVIDIA. NVIDIA Tesla V100 GPU Architecture https://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf (accessed Feb 25, 2020).

(42) Gill, P. M. W.; Johnson, B. G.; Pople, J. A. A Standard Grid for Density Functional Calculations. Chem. Phys. Lett. 1993, 209, 506–512.

(43) Stratmann, R. E.; Scuseria, G. E.; Frisch, M. J. Achieving Linear Scaling in Exchange-Correlation Density Functional Quadratures. Chem. Phys. Lett. 1996, 257, 213–223.

(44) Density Functional Repository http://www.cse.scitech.ac.uk/ccg/dft/ (accessed Feb 28, 2020).

(45) Ekström, U.; Visscher, L.; Bast, R.; Thorvaldsen, A. J.; Ruud, K. Arbitrary-Order Density Functional Response Theory from Automatic Differentiation. J. Chem. Theory Comput. 2010, 6, 1971–1980.

(46) Lehtola, S.; Steigemann, C.; Oliveira, M. J. T.; Marques, M. A. L. Recent Developments in LIBXC — A Comprehensive Library of Functionals for Density Functional Theory. SoftwareX 2018, 7, 1–5.

(47) Lapillonne, X.; Fuhrer, O. Using Compiler Directives to Port Large Scientific Applications to GPUs: An Example from Atmospheric Science. Parallel Process. Lett. 2014, 24.

(48) Xu, P.; Shi, S.; Chu, X. Performance Evaluation of Deep Learning Tools in Docker Containers. In International Conference on Big Data Computing and Communications; Institute of Electrical and Electronics Engineers Inc., 2017; pp 395–403.

(49) Torrez, A.; Priedhorsky, R.; Randles, T. HPC Container Runtime Performance Overhead: At First Order, There Is None https://sc19.supercomputing.org/proceedings/tech_poster/tech_poster_pages/rpost227.html (accessed Feb 25, 2020).

(50) Becke, A. D. Density-Functional Exchange-Energy Approximation with Correct Asymptotic Behavior. Phys. Rev. A 1988, 38, 3098–3100.

(51) Lee, C.; Yang, W.; Parr, R. G. Development of the Colle-Salvetti Correlation-Energy Formula into a Functional of the Electron Density. Phys. Rev. B 1988, 37, 785–789.

(52) Miehlich, B.; Savin, A.; Stoll, H.; Preuss, H. Results Obtained with the Correlation Energy Density Functionals of Becke and Lee, Yang and Parr. Chem. Phys. Lett. 1989, 157, 200–206.

(53) Stephens, P. J.; Devlin, F. J.; Chabalowski, C. F.; Frisch, M. J. Ab Initio Calculation of Vibrational Absorption and Circular Dichroism Spectra Using Density Functional Force Fields. J. Phys. Chem. 1994, 98, 11623–11627.

(54) Adamo, C.; Barone, V. Toward Reliable Density Functional Methods without Adjustable Parameters: The PBE0 Model. J. Chem. Phys. 1999, 110, 6158–6170.

(55) Ernzerhof, M.; Scuseria, G. E. Assessment of the Perdew-Burke-Ernzerhof Exchange-Correlation Functional. J. Chem.

Page 17: Parallel Implementation of Density Functional Theory

13

Phys. 1999, 110, 5029–5036. (56) Becke, A. D. Density-Functional Thermochemistry. III. The

Role of Exact Exchange. J. Chem. Phys. 1993, 98, 5648–5652. (57) Dirac, P. A. M. Note on Exchange Phenomena in the Thomas

Atom. Math. Proc. Cambridge Philos. Soc. 1930, 26, 376–385. (58) Vosko, S. H.; Wilk, L.; Nusair, M. Accurate Spin-Dependent

Electron Liquid Correlation Energies for Local Spin Density Calculations: A Critical Analysis. Can. J. Phys. 1980, 58, 1200–1211.

(59) Lee, C.; Parr, R. G. Exchange-Correlation Functional for Atoms and Molecules. Phys. Rev. A 1990, 42, 193–200.

(60) NVIDIA. NVIDIA Tesla P100 https://images.nvidia.com/content/pdf/tesla/whitepaper/pascal-architecture-whitepaper.pdf (accessed Mar 17, 2020).

(61) NVIDIA. NVIDIA Turing GPU Architecture https://www.nvidia.com/content/dam/en-zz/Solutions/design-visualization/technologies/turing-architecture/NVIDIA-Turing-Architecture-Whitepaper.pdf (accessed Feb 25, 2020).

(62) Case, D. A.; Ben-Shalom, I. Y. .; Brozell, S. R.; Cerutti, D. S.; Cheatham, T. E.; III; Cruzeiro, V. W. D.; Darden, T. A.; Duke, R. E.; Simmerling, C. L.; et al. AMBER 2018. University of California: San Francisco, CA 2018.

Page 18: Parallel Implementation of Density Functional Theory

14

Insert Table of Contents artwork here

CPU-XC GPU-XC CPU-XCG GPU-XCG

Valinomycin B3LYP/6-31G**

XC SCFXC Gradients

67X 239 X

GPU CPU GPUCPU

Page 20: Parallel Implementation of Density Functional Theory

Supporting Information for

Parallel Implementation of Density Functional Theory Methodsin the Quantum Interaction Computational Kernel Program

Madushanka Manathunga†, Yipu Miao#, Dawei Mu%, Andreas W. Götz¶,* andKenneth M. Merz, Jr.†,*

†Department of Chemistry and Department of Biochemistry and Molecular Biology, MichiganState University, 578 S. Shaw Lane, East Lansing, Michigan 48824-1322, United States,#Facebook, 1 Hacker Way, Menlo Park, California 94025, United States, %National Center forSupercomputing Applications, University of Illinois at Urbana-Champaign, 1205 W Clark St,Urbana, IL 61801, ¶San Diego Supercomputing Center, University of California San Diego,9500 Gilman Drive, La Jolla, California 92093-0505, United States.

Contents:S1: Grid point packing and retrieval of basis/primitive function indicesS2: Comparison of HF performance between the QUICK vs GAMESS GPU versions

S1: Grid point packing and retrieval of basis/primitive function indices

A

1 1 1 0 0 1……….

0 1 2 255 256

0 1

257

T T T D D T……….

3 ……….Thread index

Block index

Grid points

Flags

B0 1 2 nbin+1

0 3 7 11 l-6 l……….

3 ……….Array index

Basis function locator

0 1 2 0 6 7……….Basis function ID

0 1 2 l3 ……….Array index

0nbin

Block index 1

nbin

l-1

nbin-1

0 2 4 9 m-7 m……….

0 1 2 l+13 ……….Array index

Primitive function locator

0 1 0 1 6 7……….Primitive function ID

0 1 2 m3 ……….Array index

Basis index0

1

m-1

ll-1

Fig. S1. A. True grid points (T, green) are packed into an array of 256*nbin elements with dummy grid points (D, brown). The kernel launch is performed with the same number of threads and true or dummy threads are distinguished by an assigned integer flag. B. Structures of basis and primitive function locator arrays, and arrays that hold corresponding IDs. Here l and m stand for the total number of basis and primitive functions. Colors indicate the access patterns. For example, threads in block 0 (magenta), access the first two elements of the basis function locator array and obtains appropriate indices for accessing basis function ID

Page 21: Parallel Implementation of Density Functional Theory

array (index 0 to 2). In order to obtain primitive functions IDs, for instance, for basis function located at index 0, the first two elements of the primitive function locator array (green) are picked. The corresponding primitive function IDs are then obtained by accessing the first two elements of the primitive function ID array (green).

S2: Comparison of HF performance between the QUICK vs GAMESS GPU versions

Table S1. Comparison of HF SCF and gradient times between the QUICK and GAMESS GPU versions

Time for first SCF iteration (s) Time for last SCF iteration (s)

Molecule(atom

number)

Basis set(functionnumber)

QUICK GAMESS QUICK GAMESS

Morphine(40)

6-31G (227) 0.5 51.7 0.4 11.2

6-31G* (353) 0.8 121.1 1.2 22.0

6-31G**(410)

1.0 159.9 1.6 28.0

Gly12 (87) 6-31G (517) 1.0 115.4 1.5 39.0

6-31G* (811) 2.4 279.3 4.6 82.4

6-31G**(925)

3.1 374.3 6.2 105.1

Valinomycin(168)

6-31G (882) 4.9 931.7 16.7 153.0

6-31G*(1350)

11.4 1816.2 43.6 467.0

6-31G**(1620)

16.7 2611.4 63.0 511.0

Page 23: Parallel Implementation of Density Functional Theory

Parallel Implementation of Density Functional Theory Methods in the Quantum Interaction Computational Kernel ProgramMadushanka Manathunga†, Yipu Miao#, Dawei Mu%, Andreas W. Götz¶,* and Kenneth M. Merz, Jr.†,*

†Department of Chemistry and Department of Biochemistry and Molecular Biology, Michigan State University, 578 S. Shaw Lane, East Lansing, Michigan 48824-1322, United States, #Facebook, 1 Hacker Way, Menlo Park, California 94025, United States, %National Center for Supercomputing Applications, University of Illinois at Urbana-Champaign, 1205 W Clark St, Urbana, IL 61801, United States, ¶San Diego Supercomputer Center, University of California San Diego, 9500 Gilman Drive, LaJolla, California 92093-0505, United States.

KEYWORDS: QUICK, Graphics Processing Units, Quantum Chemistry Software, DFT Package, Exchange Correlation

Abstract. We present the details of a GPU capable exchange correlation (XC) scheme integrated into the open source QUantumInteraction Computational Kernel (QUICK) program. Our implementation features an octree based numerical grid pointpartitioning scheme, GPU enabled grid pruning and basis/primitive function prescreening and fully GPU capable XC energy andgradient algorithms. Benchmarking against the CPU version demonstrated that the GPU implementation is capable of delivering animpressive performance while retaining excellent accuracy. For small to medium size protein/organic molecular systems, therealized speedups in double precision XC energy and gradient computation on a NVIDIA V100 GPU were 60 to 80-fold and 140 to780-fold respectively as compared to the serial CPU implementation. The acceleration gained in density functional theorycalculations from a single V100 GPU significantly exceeds that of a modern CPU with 40 cores running in parallel.

1. Introduction

Although graphics processing units (GPUs) were originallyintroduced for rendering computer graphics, they have becomeessential devices to enhance the performance of scientificapplications over the past decade. Examples of GPUaccelerated applications span from bioinformatics softwarethat help solving genetic mysteries in biology to data analysistools aiding gravitational wave detection in astrophysics.1,2

The availability of powerful general purpose GPUs at areasonable cost, convenient computing and programmingenvironments such as Compute Unified Device Architecture(CUDA)3 and especially, the fact that GPUs can performtrillions of floating point operations per second in combinationwith a high memory bandwidth, outperforming desktop centralprocessing units (CPUs), are the main reasons for this trend.

GPUs are also known to deliver outstanding performance intraditional computational chemistry applications, particularlyin classical molecular dynamics (MD) simulations4–12 and abinitio quantum chemical calculations.13-32 In the latter context,Hartree-Fock (HF)13–15,23,28–30,32 and post-HF energy andgradient implementations18,20,21,27,31 on GPUs have displayedmultifold speedups for molecular systems containing a few tolarge number of atoms. Nevertheless, a majority of the

computational chemistry community is unable to enjoy suchperformance benefits due to the unavailability of an open-source, user friendly, GPU enabled quantum chemistrysoftware. Towards this end, we have been developing aquantum chemical code named the QUantum InteractionComputational Kernel (QUICK) to fill this void.23,28 Asreported previously, QUICK is capable of efficientlycomputing HF energies and gradients. For instance, thespeedup realized for moderate size molecular systems on aKepler type GPU was about 10-20 times in comparison to asingle CPU core while retaining an excellent accuracy. Withhigh angular momentum basis functions (d and f functions),the realized speedup remained 10-18 fold. Such performancegain was primarily due to GPU accelerated electron repulsionintegral (ERI) calculations. In our ERI engine, integrals arecalculated using vertical and horizontal recurrence relationsalgorithms33,34 reported by Obara, Saika and Head-Gordon andPople. The integrals are calculated on the fly and the Fockmatrix is assembled on the GPU using an efficient direct self-consistent field (SCF) scheme. However, the accuracy of theHF method is insufficient, if not totally unsuitable, to studymany chemical problems; but having a GPU enabled HF codepaves the way towards an efficient post-HF or densityfunctional theory (DFT) package. In the present work, wehave undertaken the task of implementing the latter type of

1

Page 24: Parallel Implementation of Density Functional Theory

methods in QUICK. In fact, given the vast number of researcharticles published using DFT methods over the past decades,35

incorporation of such methods into our package was essential.

In the context of GPU parallelization of Gaussian based DFTcalculations, a few publications have appeared in theliterature.22,36,37 Among these, Yasuda’s work22 is the earliest.His exchange correlation (XC) quadrature scheme consisted ofseveral special features aimed at maximizing performance onGPU hardware. These include partitioning numerical gridpoints in 3-dimensional space, basis function prescreening andpreparation of basis function lists for grid point partitions.Only the electron densities and their gradients and matrixelements of the XC potential were calculated on the GPU. Thehardware available at the time limited his algorithm to singleprecision calculations; but the reported accuracy of benchmarktests was about 10-5 a.u. A similar algorithm has beenimplemented by Martinez and coworkers in their GPU capableTerachem software.36 In addition to grid point partitioning, thisalgorithm performs prescreening at the level of primitivefunctions and excludes certain partitions from the calculationbased on a sorting procedure. In both of the aboveimplementations, the value of the density functionals at thegrid points are calculated on the CPU. We note another DFTpackage where the XC scheme is GPU enabled, however, theERI calculations are performed on the CPU.37

The features of XC quadrature scheme reported in the currentwork include grid point partitioning using an octree algorithm,prescreening and grid pruning based on the value of primitiveGaussian functions and fully GPU enabled XC energy andgradient calculations where not only the electron density andits derivatives, but also the XC functional values are computedon the GPU. The next sections of this paper are organized asfollows. In section 2, we give an overview of the underlyingtheory of our XC scheme, which was originally documentedby Pople and coworkers.38 The details of the computationalimplementation are then presented in section 3. Here we firstdiscuss important aspects of data parallel GPU programming.The GPU version of the XC scheme is then implementedfollowing these considerations. Information regarding theparallel CPU implementation using the message passinginterface (MPI)39 is also presented. Section 4 is devoted tobenchmark tests and discussion. The tests include performancecomparisons between QUICK and the GAMESS GPUversion29,32,40 and accuracy and performance comparisonsbetween the QUICK GPU vs CPU versions. Finally, weconclude our discussion by exploring future directions forfurther improvement.

2. Theory

The Kohn-Sham formulation of DFT differs from the HFmethod by its treatment of the exchange and correlationcontributions to the electronic energy. The total electronicenergy ( E ) is given by,38

E=ET+EV + EJ+Exc ,(1)

where ET and EV stand for kinetic and electron-

nuclear interaction energies, EJ is the Coulomb self-

interaction of the electron density and E xc is the

remaining electronic exchange and correlation energies. Theelectron density ( ρ ) is a summation of α and βelectron densities ( ρα and ρβ respectively) that canbe expressed by choosing sets of orthonormal spin orbitals

ψ iα (i=1,…,nα) and ψ i

β (i=1, …, nβ) as

¿ψ iβ∨¿2

¿ψ iα∨¿2 , ρβ=∑

i

¿

ρα=∑i

¿

, (2)

where n and n are the number of occupied orbitals.

ET , EV and EJ can be written as,

ET=∑i

(ψ iα∨−1

2∇2∨ψ i

α )+∑i

n β

(ψ iβ∨−1

2∇2∨ψ i

β), (3)

Z A∫ ρ (r ) ¿r−r A∨¿−1 dr

EV =−∑A

nucl

¿, (4)

EJ=12∬ ρ (r1 )|r1−r2|

−1ρ(r2)d r 1dr2 .

(5)

Exc , within the generalized gradient approximation

(GGA), may be given by a functional f that depends onelectron densities and their gradient invariants,

E xc=∫ f (ρα , ρβ , γ αα , γ αβ , γ ββ )dr ,

(6)

¿∇ ρβ∨¿2

¿∇ ρα∨¿2 , γαβ=∇ ρα .∇ ρβ , γ ββ=¿γ αα=¿

. (7)

In practical computational implementations, one expressesmolecular orbitals as a linear combination of atomic orbitalsand ρα in equation (2) becomes,

ρα=∑μ

N

∑ν

N

∑i

(Cμiα )∗Cνi

α ϕ μϕ ν=∑μν

Pμνα ϕ μϕ ν .

(8)

Here ϕμ(μ=1, .., N ) are the atomic orbitals.

Furthermore, the gradient of ρα can be written as,

∇ ρα=∑μν

Pμνα ∇(ϕ μϕν ) .

(9)

2

Page 25: Parallel Implementation of Density Functional Theory

A similar expression can be written for ρβ . Substitution ofequations (8), (9) into (2) - (7) and into the energy equation (1)

and minimizing with respect to coefficients Cμiα , C νi

α

one obtains a series of algebraic equations similar to theconventional HF procedure. The resulting Fock-type matrix(hereafter Kohn-Sham matrix) for ρα is,

Fμνα =Hμν

core+J μν+FμνX Cα ,

(10)

where H μνcore is the one electron Hamiltonian matrix.

J μν is the Coulomb matrix which can be written in theconventional form as,

J μν=∑λσ

N

P λσ (μν∨λσ ) , Pλσ =P λσα +P λσ

β

(11)

The XC contribution to the Kohn-Sham matrix ( FμνX Cα ) is

given by,

FμνX Cα=∫ [ ∂ f

∂ ρα

ϕ μϕν+(2∂ f

∂ γ αα

∇ ρα+∂ f

∂ γ αβ

∇ ρβ) .∇ (ϕμϕ ν)]dr

, (12)

and a similar expression can be written for FμνX Cβ .

The gradient with respect to the position of nucleus A is,

∇ A E=∑μν

N

Pμν (∇A H μνcore)+1

2∑μνλσ

N

Pμν Pλσ ∇A (μν|λσ )−∑μν

N

W μν (∇ A Sμν )−2∑μ

'

∑ν

N

Pμνα ∫ [ ∂ f

∂ ρα

ϕ ν∇ϕ μ+X μν(2∂ f

∂ γ αα

∇ ρα∂ f

∂ γ αβ

∇ ρβ)]dr−2∑μ

'

∑ν

N

Pμνβ ∫ [ ∂ f

∂ ρβ

ϕ ν∇ϕ μ+X μν ( ∂ f∂ γαβ

∇ ρα+2∂ f

∂ γ ββ

∇ ρβ)] dr

. (13)

Where the energy weighted density matrix W μν is givenby,

W μν=∑i

ϵ iα Cμi

α C νiα +∑

i

ϵ iβ Cμi

β C νiβ

,

(14)

and the matrix element X μν is,

X μν=ϕν∇(∇ϕ μ)t+(∇ ϕμ)(∇ϕ μ)

t.

(15)

3

Page 26: Parallel Implementation of Density Functional Theory

A

B

Level 0

Level 1

Level 2

7967 Grid points9865 Grid points587 leaf nodes 221 leaf nodes

3092 basis functions6484 primitive functions

cc-pVDZ basis set

14675 basis functions117400 primitive functions

Fig. 1. Partitioning of numerical grid points of a water molecule using the octree algorithm. A. A cell containing all grid points (Level 0)is recursively subdivided into octets (Levels 1 and 2). B. Fully partitioned bins/leaf nodes of the same example containing all grid pointsfrom pruning in stage 1 (middle) undergoes primitive function based grid pruning. The resulting bins/leaf nodes and grid points are shownat the bottom right.

Note that for hybrid functionals, the energy in equation (1)will also include a contribution from the HF exchange energyand equations (10) and (13) will be adjusted accordingly. Dueto the complex form of the XC functionals f, the analyticalcalculation of the integrals required for the XC energy, XCpotential, and its gradients in equations (6), (12), and (13), isimpossible; hence, this is usually achieved through anumerical procedure. The key steps of such a procedureinvolve the formation of a numerical grid (also called XCquadrature grid) with quadrature weights assigned to each gridpoint, calculation of the electron density and the gradients ofthe density at each grid point, calculation of the value of thedensity functional and the derivatives of the functional,calculation of the XC energy and the contribution to Kohn-Sham matrix (called matrix elements of the XC potentialhereafter). In order to compute the nuclear gradients of the XCenergy, one must compute second derivatives of the basisfunctions, values of the two integral terms in equation (13) andadd them to the total gradient vector. Finally, due to theinvolvement of a quadrature weighing scheme in thenumerical procedure, the derivatives of the quadrature weightswith respect to nuclear displacements must be computed andadded to the total gradient.

3. Implementation

3.1 Key considerations in GPU programming

GPUs are ideal devices for data parallel computations, i.e.computations that can be performed on numerous dataelements simultaneously. They allow massive parallelizationin comparison to traditional CPU platforms, however, at theexpense of programming complexity and flexibility.Therefore, a proper understanding of the GPU architecture andthe memory hierarchy is essential for writing an efficient codethat exploits the full power of this hardware class. GPUsperform tasks according to a single instruction multiple data(SIMD) model, which simply means that they execute thesame instructions for a given chunk of data and the sameamount of work is expected to be performed for each piece ofdata. Hence, a programmer should organize the data andassign work to threads that are then processed by the GPU inbatches known as thread blocks. The number of threads in ablock is up to the programmer, however, there exists amaximum limit allowed by the GPU architecture. For instance,the NVIDIA Volta architecture permits a maximum of 1024threads per block. NVIDIA GPUs execute threads on

4

Page 27: Parallel Implementation of Density Functional Theory

streaming multiprocessors (SMs) in warps whose size, 32 forrecent architectures, is solely determined by the architecture.Each SM (for example the V100 GPU has 80 SMs, each with64 CUDA cores for a total of 5120 cores41) executes the sameset of instructions for all threads in a warp during a givenclock cycle. Therefore, it is essential to minimize thebranching in GPU codes (device kernels) to avoid instructiondivergence.

A GPU possesses its own physical memory that is distinctfrom the host (CPU accessible) memory. The main memorycalled global memory or dynamic random-access memory(DRAM) is relatively large (for example, 32 GB or 16 GB forthe V100 and 12 GB for Titan V) and accessible by all threadsin streaming multiprocessors. However, global memorytransactions suffer from relatively high memory latency. Asmall secondary type of GPU memory called shared memoryis also available on each SM. This type of memory transactionis faster but local, meaning that shared memory of a given SMis only accessible by threads within a thread block that iscurrently executing on this SM. In addition to these two typesof memory, threads in a warp have access to a certain numberof registers and this is useful to facilitate the communicationbetween threads of the same warp. GPUs also contain constantand texture memories, which are read-only and capable ofdelivering a higher performance than global memory inspecific applications. If threads in a warp read from the samememory location, constant memory is useful. Texture memoryis beneficial when the threads read from physically adjacentmemory locations. To maximize the performance of devicekernels careful handling of memory is essential. The keyconsiderations for engineering an efficient GPU code includeminimizing the warp divergence, minimizing frequent globalmemory transactions, maintaining a coherent global memoryaccess pattern by adjacent threads in a block (coalescedmemory access), minimizing random memory access patternsor simultaneously accessing a certain memory location bymultiple threads. Furthermore, constant and texture memoriesshould be employed where applicable. Our existing ERI anddirect SCF scheme in QUICK was developed in adherence tothis philosophy. As detailed below, we implement our XCscheme following the same practices.

3.2 Grid generation and pruning

The selected grid system for our work is the Standard Grid-1(SG-1) reported by Pople and coworkers.42 This grid consistsof 50 radial grid points and 194 angular grid points for eachatom of a molecule. The radial grid point generation isperformed using the Euler-Maclaurin scheme and as for theangular grid points, Lebedev grids are used. Following thegrid generation, we compute weights for each grid point basedon the scheme reported by Frisch et al.43 We then perform gridpruning in two stages. First, all points with weight less than athreshold (usually 10-10) are eliminated. The remaining pointsare pruned at a second stage based on the value of atomcentered basis/primitive functions at each grid point. Asdescribed in section 3.3, we make use of an octree algorithmfor this purpose.

3.3 Octree based grid pruning, preparation of basis andcontracted function lists

In our octree algorithm, the grid points in space are partitionedas follows. First, the lower and upper bounds of the grid is

determined and a single cell (node) containing all grid pointsis formed in 3-dimensional space (see Fig. 1A). This node isthen divided into eight child nodes. The child nodes whosegrid point count is greater than a user specified threshold arerecursively subdivided into octets until the point count of eachchild node falls below the threshold. The nodes which are notfurther divisible (leaf nodes, also termed bins below) areconsidered to be fully partitioned. We then go through eachgrid point of the leaf nodes and compute the values of thebasis and primitive functions at their positions. If abasis/primitive function satisfies the condition

|(ζ ij+∇ζ ij )|>τ , (16)

for a given grid point, it is considered to be significant for thatparticular point. We use = 10-10 as default threshold, whichleads to numerical results that are indistinguishable from thereference without pruning. For basis function-basedprescreening, j in equation (16) becomes μ and ζ ij

stands for the value of basis function ϕμ at grid point gi.

Similarly, for primitive function based prescreening, ζ ij

represents the value of cμp χ p at grid point gi, where

χ p is the pth primitive function of ϕμ and cμp isthe corresponding contraction coefficient. Once the basisfunction values at each grid point are computed, the points thatdo not have at least one significant basis function are omittedfrom the corresponding bin (see Fig. 1B). Furthermore, binswithout any grid points are also discarded. At this stage, listsof significant basis and primitive function IDs are prepared foreach remaining bin of grid points, significantly reducing thenumber of basis function values and derivatives that have to beevaluated at each grid point during the SCF and gradientcomputations. Using this algorithm, the number of primitiveGaussian basis function evaluations is reduced from 117400 to6484 for H2O with a cc-pVDZ basis set (Fig. 1B).

3.4 Grid point packing and kernel launch

Following the two-stage grid pruning and the preparation ofbasis and primitive function ID lists, the grid points areprepared (hereafter grid point packing) to be uploaded to theGPU. Here we add dummy points into each bin and set thetotal grid point count in each bin to a threshold value used inthe octree run. It is worth noting that the threshold we chooseis a multiple of the warp size 32, usually 256. By doing so, weare able to pack true grid points into one corner of a block ofan array (see Fig. S1A) and this helps us to minimize the warpdivergence when we launch GPU threads using the same valuefor the block size. More specifically, for subsequentcalculations (i.e. density, XC energy and XC gradients), wewill launch a total of 256*nbin threads (where nbin = number ofsignificant bins) where each thread block contains at least onetrue grid point and a set of dummy points. The threadslaunched for dummy points should not perform anycomputation and these are differentiated from true points by anassigned integer flag. As mentioned previously, the launchedthread blocks are submitted to streaming multiprocessors aswarps of 32 by the warp scheduler during runtime. Thisensures that at most one warp per thread block suffers fromwarp divergence and threads of the remaining warps wouldperform the same task. Once the grid points are packed andbasis and primitive function lists have been prepared, these are

5

Page 28: Parallel Implementation of Density Functional Theory

uploaded to the global memory of the GPU. Thecorresponding array pointers are stored in constant memory

and these are used in launching the kernel and the execution ofdevice kernels during density, SCF and gradient calculations.

Fig. 2. Flowchart depicting the workflow of a DFT geometry optimization calculation. Brown, blue and mixed color boxes indicate steps performed on CPU, GPU and mixed CPU/GPU respectively. OPT stands for geometry optimization.

3.5 Computing electron densities on the GPU

As apparent from eq. 8 and 9, the calculation of electrondensities and their gradients require looping over basis andprimitive functions at each grid point. This task is performedon the GPU and each thread working on a single grid pointonly loops through basis and primitive function lists assignedto the corresponding thread block. The retrieval of correct listsfrom global memory is achieved by using two locator arrays(see Fig. S1B). Note that we use the term “ID” when referringto basis/primitive functions and “array index” for arraylocations. The first, called basis function locator array, holdsthe array index ranges of the basis function ID array. Eachthread accesses the former based on their block index, obtainsthe corresponding array index range of the latter and picks thebasis function IDs. Second, the primitive function locatorarray, holds the array index ranges for accessing the primitivefunction ID array. Each thread picks elements from theprimitive function locator array using basis function arrayindices, obtains the relevant array index range of the latter andtakes the primitive function IDs. This retrieval strategy allowsus to maintain a coalesced memory access pattern.

3.6 Computing the XC energy and the matrix elements ofthe XC potential

As reported elsewhere,23,28 our existing SCF implementationassembles the Fock matrix and computes the energy on the

GPU. Therefore, our goal is to calculate the XC energy andassociated derivatives on the GPU, which highlights thenecessity to have density functionals implemented as devicekernels. Needless to say coding the numerous availablefunctionals is a cumbersome process and the best solution is tomake use of an existing density functional library.Nevertheless, to the best our knowledge, such an appropriateGPU capable library is still not available; but a few stableCPU based density functional libraries have been reported andused in many DFT packages.44–46 As discussed below, weselected one of these libraries and modified the source codefor execution on GPUs via CUDA to achieve our goal.

The chosen density functional library, named LIBXC46 is opensource and provides about 400 functionals that are based onthe local density approximation (LDA), the generalizedgradient approximation (GGA) and the meta-generalizedgradient approximation (MGGA). This library is wellorganized, has a user friendly interface and above all, thesource code is written in the C language. Such factors makeLIBXC easily re-engineerable as a GPU capable library withrelatively modest coding effort. In LIBXC, the functionals areorganized in separate source files and are handled by severalworkers (each assigned to LDA, GGA exchange, GGAcorrelation, MGGA exchange and MGGA correlationfunctional types) with the aid of a series of intermediate files.When the user provides the necessary input with a desiredfunctional name, an assigned functional ID is selected, and if

6

Page 29: Parallel Implementation of Density Functional Theory

necessary, parameters required by the workers to call or sendto functionals are obtained from the intermediate files. Thefunctionals are then called from the workers. To make use ofLIBXC in our GPU code, we initiate LIBXC and obtainfunctional ID and necessary intermediate parameters through adry run immediately after grid point packing but beforestarting the SCF procedure. In order to do so, we made severalminor modifications to the interface and intermediate files.The obtained information is then uploaded to the GPU. Wealso implemented device kernels for LDA and GGA workersto call functionals during device kernel execution. Thecorresponding functional source files were also modified withcompiler preprocessor directives and during the compilationtime, these are included in the CUDA source files andcompiled as device kernels. As reported previously, use ofcompiler directives for porting complex scientific applications

to GPUs is rewarding in terms of application performance andimplementation effort.47 Note that our current XCimplementation is not capable of computing kinetic energydensities and for this reason, we have not made any changes tothe MGGA workers and related source files in LIBXC.

During the SCF procedure, the device kernel versions ofLIBXC workers are called with pre-stored functional IDs andother parameters. The worker then calls the appropriatefunctional kernel. Here we have used C function pointersrather than conditional statements due to potentialperformance penalties introduced by the latter. Following thecomputation of XC energy density on grid points, we computethe matrix elements of the potential and update the Kohn-Sham matrix using the CUDA atomic add function. In the past,we employed atomic add in our direct SCF scheme and notedthat it is not a critical performance bottleneck.23,28

Table 1. Comparison of times between QUICK and GAMESS

Last SCF iteration (s) ERI gradient time (s) XC gradient time (s)

Molecule (atomcount)

Basis set (basisfunction count)

QUICK GAMESS QUICK GAMESS QUICK GAMESS

Morphine (40) 6-31G (227) 0.7 11.9 3.8 9.6 2.3 11.3

6-31G* (353) 1.7 23.9 12.9 60.7 3.8 19.6

6-31G** (410) 2.2 30.6 16.0 72.0 4.4 24.4

Gly12 (87) 6-31G (517) 1.8 43.5 8.4 23.6 1.7 112.9

6-31G* (811) 5.2 79.7 27.2 164.6 3.4 208.6

6-31G** (925) 7.0 94.4 34.9 201.4 3.6 246.9

Valinomycin (168) 6-31G (882) 18.9 159.8 85.1 183.2 10.4 686.6

6-31G* (1350) 40.7 334.5 243.0 975.7 18.7 1167.3

6-31G** (1620) 58.6 370.1 317.7 1291.5 19.3 1571.9

3.7 Computing XC energy nuclear gradients

Most of the implementation strategy described above for XCenergy and potential holds for our gradient algorithm. Morespecifically, this is a two-step process consisting of calculatingthe XC energy gradients and grid weight gradients. Whilecomputing the former involves a majority of steps discussedfor the energy calculation (i.e. computing values of the basisfunctions and derivatives and functional values and associatedderivatives), additionally, we compute second derivatives ofbasis functions. The device kernel performing this task issimilar to the one that calculates basis function values and

their gradients at a given grid point in the sense of accessingbasis and primitive function IDs. The grid weight gradientcalculation is only required for grid points whose weight is notequal to unity. Therefore, we prune our grid for the third timeremoving all points that do not meet this criterion. Theresulting grid points are packed without dummy points,uploaded to the GPU and the appropriate device kernel iscalled to compute the gradients. The calculated gradientcontribution from each thread is added to the total gradientvector using the CUDA atomic add function, consistent withour existing ERI gradient scheme.

Table 2. Comparison of the accuracy and performance of B3LYP calculations performed using QUICK CPU (serial) and GPU versions.

Second SCF iteration (s) XC (s) Total energy (au)

7

Page 30: Parallel Implementation of Density Functional Theory

Molecule(atom count)

Basis set (basisfunction count)

CPU GPU speedup CPU GPU speedup CPU GPUa

(H2O)32 (96) 6-311G (608) 153.9 2.9 54 44.5 0.5 82 -2444.98724960 -1.09E-07

6-311G** (992) 367.2 7.7 48 68.7 1.0 70 -2445.85046773 -1.12E-07

cc-pVDZ (800) 428.3 5.5 78 53.9 0.8 67 -2445.00925604 -1.09E-07

Taxol (110) 6-311G (940) 687.0 14.9 46 131.0 1.7 75 -2927.02879410 -6.20E-08

6-311G** (1453) 2314.4 44.2 52 223.2 3.4 66 -2927.98372502 -3.10E-08

cc-pVDZ (1160) 3537.4 38.4 92 198.7 3.1 63 -2927.38471110 -5.10E-08

Valinomycin(168)

6-311G (1284) 1451.1 36.9 39 225.9 2.9 78 -3793.79224954 -1.28E-07

6-311G** (2022) 4628.3 105.6 44 364.2 5.2 69 -3795.12274932 -1.10E-07

cc-pVDZ (1620) 6625.2 81.4 81 304.8 4.7 65 -3794.17889687 -9.20E-08

310-helixacetyl(ala)18NH2

(189)6-311G (1507) 1680.8 43.7 38 211.3 2.7 77 -4660.64280944 -7.40E-08

6-311G** (2356) 5914.4 134.4 44 356.1 5.2 68 -4662.21706599 -5.60E-08

cc-pVDZ (1885) 8789.8 108.9 81 309.0 4.8 65 -4661.20248530 -6.80E-08

α-helixacetyl(ala)18NH2

(189)6-311G (1507) 1881.6 52.3 36 257.2 3.3 78 -4660.97909929 -1.55E-07

6-311G** (2356) 6186.0 157.4 39 408.9 5.9 69 -4662.53814845 -2.10E-07

cc-pVDZ (1885) 8907.1 119.8 74 355.9 5.4 65 -4661.54162676 3.10E-08

β-strandacetyl(ala)18NH2

(189)6-311G (1507) 921.2 24.5 38 100.3 1.3 76 -4660.89912485 -6.90E-08

6-311G** (2356) 3451.9 82.1 42 187.8 2.8 67 -4662.48158890 -1.45E-07

cc-pVDZ (1885) 5094.9 63.4 80 160.6 2.5 63 -4661.46843955 2.09E-06

α-conotoxin mii(PDBID:

1M2C) (220)6-31G* (1964) 4627.1 108.3 43 373.6 4.9 77 -7142.85690266 -6.04E-06

6-31G** (2276) 5601.1 138.1 41 407.1 6.2 66 -7143.06700681 -5.34E-06

6-311G (1852) 3390.0 103.5 33 424.9 5.5 78 -7142.57919438 -7.30E-08

olestra (453) 6-31G* (3181) 5787.1 143.7 40 258.9 4.1 63 -7540.95874486 -2.14E-06

6-31G** (4015) 9881.3 277.6 36 310.1 4.5 70 -7541.35299607 -2.19E-06

6-311G (3109) 5511.5 164.3 34 276.0 3.7 74 -7540.71669525 -5.85E-07aGPU energy column shows the deviation of the energy with respect to the corresponding CPU calculation.

3.8 CPU analog of the GPU implementation

In order to perform a fair comparison with our GPU capableDFT implementation, we implemented a parallel CPU versionusing MPI. In this version, we make use of the standardLIBXC code base and omit the LIBXC dry run from ourworkflow (step 2 in Fig. 2). Furthermore, the grid generationand octree run are performed in serial but the weightcomputation, prescreening and preparation of basis andprimitive functions lists are all performed on in parallel.Moreover, the packing of grid points (step 3g in Fig. 2) isomitted. Computation of electron densities, XC energy,potential and gradients (steps 4-6 in Fig. 2) are performed in

parallel. More specifically, once the grid operations arecompleted, partitioned bins are distributed among slave CPUtasks. The corresponding grid weights, basis and primitivefunction lists are also broadcast. All CPU tasks retrieve basisand primitive function IDs using locators as discussed insection 3.5, but with the block index now replaced by the binindex. When computing the XC energy and matrix elements ofthe potential, each CPU task initializes LIBXC through astandard Fortran 90 interface and computes the requiredfunctional values. The slave CPU tasks then send thecomputed energy and Kohn-Sham matrix contributions to themaster CPU task. The implementation of the XC gradientscheme is very similar to the above.

8

Page 31: Parallel Implementation of Density Functional Theory

Table 3. Comparison of gradient times between QUICK CPU (serial) and GPU versions

ERI gradients (s) XC gradients (s)

Molecule (atom count)Basis set (basis function

count)CPU GPU speedup CPU GPU speedup

(H2O)32 (96) 6-31G (647) 218.3 5.3 41 1196.7 3.7 325

6-31G** (992) 1015.1 23.7 43 1261.3 5.4 235

Taxol (110) 6-31G (647) 1676.5 38.2 44 1893.4 9.4 201

6-31G** (1160) 7479.7 149.1 50 2094.2 14.7 142

Valinomycin (168) 6-31G (882) 3339.8 85.0 39 6156.7 19.4 318

6-31G** (1620) 14579.7 319.3 46 6457.1 27.0 239

310-helix acetyl(ala)18NH2

(189)6-31G (608) 4281.0 97.6 44 8611.6 20.6 418

6-31G** (992) 20009.4 395.9 51 8907.9 27.6 323

α-helix acetyl(ala)18NH2 (189) 6-31G (608) 4592.9 109.7 42 8777.5 23.0 381

6-31G** (992) 20114.2 422.8 48 9073.9 32.5 279

β-strand acetyl(ala)18NH2

(189)6-31G (608) 2307.7 45.7 51 8398.9 16.5 509

6-31G** (992) 11247.6 187.8 60 8601.1 22.4 383

α-conotoxin mii (PDBID:1M2C) (220)

6-31G (1268) 8371.3 232.6 36 13868.9 17.8 780

6-31G** (2276) 33798.1 813.3 42 14349.2 28.8 498

4. Benchmark Results and Discussion

4.1 Benchmarking the GPU implementation

We now present the benchmarking results of our DFTimplementation. First, the performance of the QUICK GPUversion is compared against the GPU version ofGAMESS.29,32,40 Then a similar comparison between theQUICK CPU and GPU versions is performed. For the formertest, we obtained a precompiled copy of GAMESS (the 17.09-r2-libcchem version) in a singularity image from the NVIDIAGPU cloud webpage. In order to make a fair comparison, theQUICK code was compiled using the comparable GNU andCUDA compilers with optimization level 2 (-O2). Note thatperformance tradeoff between running a CPU/GPUapplication native or through a container is reported to benegligible48,49 and we assume that the container overhead hasno significant impact on our GAMESS timings. The selectedtest cases include morphine, (glycine)12 and valinomycinBLYP50–52 gradient calculations using the 6-31G, 6-31G* and6-31G** basis sets. In GAMESS input files the SG-1 gridsystem and direct SCF were requested and the density matrixcutoff value was set to 10-8. Default values were used for allthe other options. All tests were carried out on a NVIDIA VoltaV100-SXM2 GPU (32 GB) accompanied by Intel Xeon (R)Gold 6138 CPU (2.10 GHz) with 190 GB memory.

The performance comparison between QUICK and GAMESSsuggests that the former is significantly faster than the latter(see Table 1). For ERI gradients, the observed speedup isroughly 2-5 fold suggesting that our ERI engine is moreefficient, in particular for basis sets with higher angularmomentum quantum numbers. Furthermore, QUICK displayshigher speedups (~60 or 80-fold in some cases) for the XC

gradients, however, this result has to be interpreted carefullysince the XC implementation of GAMESS appears to be notGPU capable. Note that in larger test cases, the GAMESS XCgradient time surpasses the ERI gradient time. Based on thetimes reported for last SCF iteration, the QUICK SCF schemealso outperforms GAMESS. Again, slow performance of thelatter may be partially attributed to its CPU based XCimplementation. A comparison of separate timings for ERI,XC energy and potential calculations during each SCFiteration was not possible since timings for these individualtasks were not reported by GAMESS. Nevertheless, HF singlepoint calculations carried out for the same systems (see TableS1) show that QUICK ERI calculations are much faster,consistent with the gradient results documented above.

We now compare the accuracy and performance between theQUICK CPU and GPU versions using a series of B3LYP53

energy calculations. As is apparent from Table 2, energiescomputed by the QUICK GPU version agree with the CPUversion up to 10-6 au or better. Furthermore, the GPU versionof the XC implementation delivers at least a 60-fold speedupover the serial CPU version. In both versions, less than 30% ofthe SCF step time is spent on computing the XC energy andcontribution to Kohn-Sham matrix. This suggests that ERIcalculation remains the performance bottleneck of QUICKDFT energy calculations. As a result, the speedup observed forthe SCF iteration in most cases is somewhat lower than theXC speedup.

In Table 3, we report a comparison of B3LYP/6-31G andB3LYP/6-31G** gradient calculation times between theQUICK CPU and GPU versions. The selected test cases are asubset of the molecular systems chosen for the SCF tests. Thekey observations from Table 3 include: 1) the GPU version

9

Page 32: Parallel Implementation of Density Functional Theory

delivers significant speedups for both ERI and XC gradientcalculations, 2) the speedup observed for ERI gradientsresemble the ones realized for energy calculation (~50 times),3) the speedup delivered for XC gradients is in few hundreds(~100-800 times). Similar speedups observed for ERI energyand gradient calculations can be explained by the fact thattheir implementations are similar. Indeed, computing thegradients of a given basis function involves the cost ofcomputing the value of higher angular momentum basisfunction. Understanding the details of the observed speedupsfor XC energy and gradient computations requires further

investigation. As mentioned in section 3.7, we implementedXC energy gradient and grid weight gradient calculations intwo separate kernels. Careful examination of kernel run timessuggests that the former consumes more than 90 % of thegradient time and gains a better speedup on a GPU. However,the structure of this kernel is similar to that of the XC energywith the exception of computing second derivatives.Therefore, the observed higher speedup must be due to theinefficiency of computing the second derivatives in our CPUimplementation.

Table 4. Comparison of accuracy and performance between LIBXC CPU (serial) and GPU versions for taxol (110 atoms) with the 6-31Gbasis set (647 contracted basis functions).

XC (s) Total energy (au)

FunctionalNumber of

SCF iterationsCPU GPU speedup CPU GPU

BLYP-Native 21 491.2 21.7 23 -2925.30524021 -3.65E-07

BLYP 21 486.6 21.7 22 -2925.30524021 -3.65E-07

B3LYP-Native 20 398.4 21.7 18 -2926.28595424 -2.41E-07

B3LYP 17 399.1 17.6 23 -2926.28595424 -2.41E-07

PBE054,55 17 397.6 17.6 23 -2923.03725363 -1.50E-08

B3PW9156 17 466.2 20.8 22 -2925.19518104 -1.10E-08

XVWN57,58 21 498.3 21.7 23 -2902.07631292 -2.70E-8

LP_A59 21 498.7 21.7 23 -2930.72703131 -2.80E-8

4.2 Performance of LIBXC functionals

As mentioned above, we have integrated the original andmodified LIBXC library versions into QUICK CPU and GPUcodes. It is important to document the performance of thesefunctionals within our XC scheme. In Table 4, we report aseries of taxol energy calculations at the DFT/6-31G level oftheory using various functionals. The selected functionalsinclude hand-coded (native) BLYP, B3LYP and their LIBXCversions and additionally, representative LDA, GGA andhybrid-GGA functionals. Comparison of native BLYP andB3LYP total energies against the corresponding LIBXCversions suggests that the latter are as accurate as the formeron both CPU and GPU platforms. Furthermore, the reportedCPU and GPU XC times remain very similar and theacceleration delivered by the GPU remains substantially thesame. In fact, the ca. 20-fold speedup realized on the GPUplatform is common for all LDA and GGA functionals. Notethat computing the value of density functionals is relativelyinexpensive with respect to other operations performed in XCenergy or gradient calculations and such a similar speedup isexpected. Finally, the fact that we maintain a similar speedupamong native and LIBXC functionals suggests that minormodifications performed to density functional source code inthe GPU version has not introduced performance penalties.

4.3 Comparison of GPU vs parallel CPU implementations

We now compare the performance between GPU and parallelCPU (MPI) implementations using taxol gradient calculationsat the B3LYP/6-31G** level of theory. The selected platformfor testing is a single computer node equipped with a NVIDIAV100-SXM2 GPU (32 GB) and 40 Intel Xeon (R) Gold 6148cores (2 sockets with 20 cores on each, 2.40 GHz clock speed)with 367 GB memory. The QUICK MPI version was compiledusing the GNU 7.2.0 compiler and MPICH 3.2.1. The GPUversion was compiled using the same GNU compiler withCUDA version 10.0.130. As is apparent from Fig. 3A, theDFT grid operation time for the same calculation is reduced byca. 4 times (14.8 vs 3.6 s) when going from single to 40 coresin the MPI version. The realized speedup can be mainlyattributed to parallelized grid weight computation, gridpruning and basis function prescreening. The correspondingV100 GPU time for the same task is 1.7 s with approximately1 s spent on the CPU based octree run. We tested the necessityof implementing the octree algorithm on the GPU bymeasuring the time to partition numerical grid points of largermolecular systems. The results suggest that our existing CPUimplementation is sufficiently efficient to handle suchsystems. For example, partitioning grid points of olestra (453atoms, ~4.4 million grid points) only took 3.0 s whilecomputing grid weights and prescreening consumed 12.7 and0.6 s respectively.

10

Page 33: Parallel Implementation of Density Functional Theory

1

8

64

512

4096

32768

1 2 4 8 16 32

Tim

e (s

)

Number of cores

1

8

64

512

4096

32768

1 2 4 8 16 32

Tim

e (s

)

Number of cores

1

8

64

512

4096

32768

1 2 4 8 16 32

Tim

e (s

)

Number of cores

1

8

64

512

4096

32768

1 2 4 8 16 32

Tim

e (s

)

Number of cores

Number of cores

Tim

e (s

)T

ime

(s)

Number of cores

Tim

e (s

)T

ime

(s)

Time on V100 GPU

A B

C D

Ideal MPI speedup

Ideal MPI time Ideal MPI timeSCF ERI time SCF XC time

Time on V100 GPU Time on V100 GPU

Ideal MPI time Ideal MPI timeERI gradient time XC gradient time

Time on V100 GPU Time on V100 GPU

DFT grid operation time

Total run timeIdeal MPI timeTime on V100 GPU

Fig. 3. Comparison of performance between GPU and MPI implementations for taxol (110 atoms) with B3LYP/6-31G** (1160 contractedbasis functions).

In Fig. 3B, we report the total time spent by the ERI and XCcalculations during 19 SCF iterations. The maximum speedupobserved for the former and latter in the MPI version are ca.23 and 33-fold respectively. The speedups from the GPUversion for the same tasks are ca. 65 and 62-fold. For ERIgradient calculations (see Fig. 3C), the speedup gained fromthe MPI version is similar to that of the SCF (ca. 22-fold) butsignificantly less than the corresponding GPU speedup (ca.50-fold). Furthermore, the MPI version delivers about 59-foldspeedup for the XC gradient calculations, however, this isbelow the acceleration achieved by a V100 GPU (ca. 142-fold). Finally, as evident from Fig. 3D, the total speedupgained from the GPU is two times over using 40 cores. It isalso important to comment on the linearity of our MPI plots inFig. 3B and 3C. Both ERI SCF and gradient calculations scalewell with the number of cores and almost achieve the idealMPI speedup since we distribute atomic shells among MPItasks (CPU cores). However, in the XC implementation, wedistribute bins containing varying number of grid points.Therefore, the workload is not optimally balanced and plots ofthe XC MPI timings display a non-regular speedup (not linear)unlike in the ERI case.

4.4 Performance of QUICK on different GPU architectures

For all the aforementioned benchmarks, we have used aNVIDIA V100 data center GPU. It is also necessary todocument how the current QUICK GPU implementationperforms on other available devices. In Table 5, we report theperformance of several important kernels on differentworkstation GPUs (V100, Titan V and P100) and a gamingGPU (RTX2080Ti). The selected test case for this purpose isthe taxol gradient calculation at the B3LYP/6-31G level oftheory. As is apparent from Table 5, all kernels display theirbest performance on the V100 GPU. Surprisingly, ERI kernelsshow their slowest times on the P100 rather than the gamingGPU, suggesting that their performance is not limited bydouble precision (FP64) operations. Note that the P100 GPUhas a higher FP64 capability in comparison to theRTX2080Ti.60,61 A detailed examination of the two kernelsrevealed that their performance is bound by high registerusage and memory bandwidth. In contrast, XC kernelsdisplayed their highest timings on the RTX2080Ti GPU andtherefore, these are bound by FP64 operations. Furtherexamination of the XC energy gradient kernel indicated that itsperformance is also limited by atomic operations. Overall,excellent performance is achieved both on data center GPUswith Pascal and Volta architectures and gaming GPUs withTuring architecture.

Table 5. Comparison of the performance of different kernels using different GPU architectures.

Total time for each task (s)

Task V100 (32 GB) Titan V (12 GB) P100 (16GB) RTX2080Ti (11 GB)

SCF ERIa 80.4 107.6 159.2 125.2

11

Page 34: Parallel Implementation of Density Functional Theory

SCF XCa 22.7 26.1 59.8 114.3

ERI gradients 38.0 40.7 66.9 55.1

XC gradients 6.9 7.9 8.5 13.4a Reported time is the summation over 22 SCF cycles.

5. Conclusions

We have reported the details of the MPI parallel CPU and theGPU enabled DFT implementation of the QUICK quantumchemical package. Our implementation consists of featuressuch as octree based grid point partitioning, GPU assisted gridpruning and basis/primitive function prescreening and fullyGPU enabled XC energy and gradient computations.Performance comparison with the GAMESS GPU versiondemonstrates that DFT calculations with QUICK aresignificantly faster. The accelerations observed for the XCenergy and gradient computation in the QUICK GPU versionwith respect to the serial CPU version are impressive. Thespeedups realized on a V100 GPU for the former and latter areapproximately 60 to 80-fold and 140 to 780-fold respectively.Such speedups are out of reach with the MPI parallel CPUversion even if one uses 40 cores in parallel. Therecommended device for the latest QUICK version (v20.03) isthe NVIDIA V100 data center GPU but the code runs verywell also on gaming GPUs.

The profiling of ERI and XC kernels has shown that thereexists room for further performance improvement. In theformer context, reimplementing large ERI kernels into smallerkernels may be a viable strategy. This is expected to reduce theregister pressure and enhance the performance of the ERIengine. The performance of XC kernels on gaming GPUs maybe improved by implementing a mixed precision scheme.Additionally, strategies such as storing the gradient vector inshared memory and latency hiding may be helpful to reducethe computational cost associated with atomic operations inthe XC energy gradient kernel.

Currently, we are integrating QUICK as a library into theAMBER molecular dynamics package62 to enable fully GPUenabled quantum mechanics/ molecular mechanics (QM/MM)simulations. Finally, QUICK version 20.03 can be freelydownloaded from http://www.merzgroup.org/quick.html underthe Mozilla public license.

ASSOCIATED CONTENT Supporting text and Cartesian coordinates of the test molecules.This material is available free of charge via the Internet athttp://pubs.acs.org.

ACKNOWLEDGMENT We thank Jonathan Lefman and Peng Wang from NVIDIA fortheir useful comments on technical aspects of our GPU code.M.M. and K.M. are grateful to the Department of Chemistry andBiochemistry and high performance computer center (iCERHPCC) at the Michigan State University. M.M. and A.G. thankSan Diego Supercomputer Center for granted computer time. Thisresearch was supported by the National Science Foundation grantOAC-1835144. This work also used the Extreme Science andEngineering Discovery Environment (XSEDE), which issupported by the National Science Foundation (grant numberACI-1053575, resources at the San Diego Supercomputer Centerthrough award TG-CHE130010 to A.G.).

References

(1) Nobile, M. S.; Cazzaniga, P.; Tangherloni, A.; Besozzi, D.Graphics Processing Units in Bioinformatics, ComputationalBiology and Systems Biology. Brief. Bioinform. 2017, 18, 870–885.

(2) NVIDIA. Seeing Gravity in Real-Time With Deep Learninghttps://images.nvidia.com/content/pdf/ncsa-gravity-group-success-story.pdf (accessed Feb 25, 2020).

(3) NVIDIA. CUDA C++ Programming Guidehttps://docs.nvidia.com/cuda/pdf/CUDA_C_Programming_Guide.pdf (accessed Feb 25, 2020).

(4) Friedrichs, M. S.; Eastman, P.; Vaidyanathan, V.; Houston, M.;Legrand, S.; Beberg, A. L.; Ensign, D. L.; Bruns, C. M.; Pande,V. S. Accelerating Molecular Dynamic Simulation on GraphicsProcessing Units. J. Comput. Chem. 2009, 30, 864–872.

(5) Götz, A. W.; Williamson, M. J.; Xu, D.; Poole, D.; Le Grand, S.;Walker, R. C. Routine Microsecond Molecular DynamicsSimulations with AMBER on GPUs. 1. Generalized Born. J.Chem. Theory Comput. 2012, 8, 1542–1555.

(6) Salomon-Ferrer, R.; Götz, A. W.; Poole, D.; Le Grand, S.;Walker, R. C. Routine Microsecond Molecular DynamicsSimulations with AMBER on GPUs. 2. Explicit Solvent ParticleMesh Ewald. J. Chem. Theory Comput. 2013, 9, 3878–3888.

(7) Biagini, T.; Petrizzelli, F.; Truglio, M.; Cespa, R.; Barbieri, A.;Capocefalo, D.; Castellana, S.; Tevy, M. F.; Carella, M.; Mazza,T. Are Gaming-Enabled Graphic Processing Unit CardsConvenient for Molecular Dynamics Simulation? Evol.Bioinforma. 2019, 15, 1–3.

(8) Lee, T. S.; Cerutti, D. S.; Mermelstein, D.; Lin, C.; Legrand, S.;Giese, T. J.; Roitberg, A.; Case, D. A.; Walker, R. C.; York, D.M. GPU-Accelerated Molecular Dynamics and Free EnergyMethods in Amber18: Performance Enhancements and NewFeatures. J. Chem. Inf. Model. 2018, 58, 2043–2050.

(9) Harvey, M. J.; Giupponi, G.; De Fabritiis, G. ACEMD:Accelerating Biomolecular Dynamics in the Microsecond TimeScale. J. Chem. Theory Comput. 2009, 5, 1632–1639.

12

Page 35: Parallel Implementation of Density Functional Theory

13

Page 36: Parallel Implementation of Density Functional Theory

14

Page 37: Parallel Implementation of Density Functional Theory
Page 38: Parallel Implementation of Density Functional Theory
Page 39: Parallel Implementation of Density Functional Theory