porting the onetep linear-scaling quantum chemistry … · quantum chemistry code to massively...

49
Porting the ONETEP Linear-Scaling Quantum Chemistry Code to Massively Parallel Heterogeneous Architectures Karl Wilkinson Skylaris Group, University of Southampton

Upload: vukhanh

Post on 10-Jun-2018

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Porting the ONETEP Linear-Scaling Quantum Chemistry … · Quantum Chemistry Code to Massively Parallel Heterogeneous Architectures ... Uncertainty over future hardware ... 2.0 96.4

Porting the ONETEP Linear-Scaling Quantum Chemistry Code to Massively Parallel Heterogeneous Architectures

Karl Wilkinson

Skylaris Group, University of Southampton

Page 2: Porting the ONETEP Linear-Scaling Quantum Chemistry … · Quantum Chemistry Code to Massively Parallel Heterogeneous Architectures ... Uncertainty over future hardware ... 2.0 96.4

Outline

1.  Computational Chemistry

2.  Heterogeneous Architectures and Programming models

3.  Order-N Electronic Total Energy Package (ONETEP)

4.  GPU Code Developments and Performance

5.  Hybrid MPI-OpenMP Parallelism

6.  Future Work

7.  Acknowledgements

2

Page 3: Porting the ONETEP Linear-Scaling Quantum Chemistry … · Quantum Chemistry Code to Massively Parallel Heterogeneous Architectures ... Uncertainty over future hardware ... 2.0 96.4

COMPUTATIONAL CHEMISTRY

3

Page 4: Porting the ONETEP Linear-Scaling Quantum Chemistry … · Quantum Chemistry Code to Massively Parallel Heterogeneous Architectures ... Uncertainty over future hardware ... 2.0 96.4

Computational Chemistry

§  Trade offs: –  Accuracy

–  System size

–  Level of sampling

§  Drug design –  Lead structure design

–  Protein-lead structure interactions

§  Materials –  Epoxidised linseed oil – ships hulls

§  Catalysis

§  Energy research –  Enzyme-biomass interactions

4

Accuracy

Low

Molecular mechanics

High

Ab initio

System size

Low

10’s of atoms

High

Millions of atoms

Sampling

Low

Single point energy

High

Free energy surface

Page 5: Porting the ONETEP Linear-Scaling Quantum Chemistry … · Quantum Chemistry Code to Massively Parallel Heterogeneous Architectures ... Uncertainty over future hardware ... 2.0 96.4

HETEROGENEOUS ARCHITECTURES AND PROGRAMMING MODELS

5

Page 6: Porting the ONETEP Linear-Scaling Quantum Chemistry … · Quantum Chemistry Code to Massively Parallel Heterogeneous Architectures ... Uncertainty over future hardware ... 2.0 96.4

Heterogeneous Compute Architectures

§  Heterogeneous: Multiple distinct types of processors –  More than 10% of top 500 supercomputer use some

form of accelerator –  Typically Central Processing Units (CPUs) coupled with

Graphical Processing Units (GPUs)

–  Common in high powered workstations

–  Intel Xeon Phis emerging

§  Development considerations –  Generally need to adapt code to take advantage of the

architecture –  Single code will contain multiple algorithms - execute

on most suitable processing unit

–  May result in new bottlenecks (PCIe data transfer)

6

Page 7: Porting the ONETEP Linear-Scaling Quantum Chemistry … · Quantum Chemistry Code to Massively Parallel Heterogeneous Architectures ... Uncertainty over future hardware ... 2.0 96.4

Multicore CPUs

GPU based accelerator boards

Parallelism in Heterogeneous Machines

Node CPU

Socket

CPU Socket

PCIe Socket

PCIe Socket

8-16 CPU cores

O(10) Streaming Multiprocessors

32-192 CUDA cores (SP/DP)

O(100-10,000) Nodes

Titan

18,688 Nodes

18,688 CPUs (hexadecacore) 299,008 CPU Cores

18,688 K20X GPUs (14MPs) 261,632 Multiprocessors

560,640 Total cores

7

Page 8: Porting the ONETEP Linear-Scaling Quantum Chemistry … · Quantum Chemistry Code to Massively Parallel Heterogeneous Architectures ... Uncertainty over future hardware ... 2.0 96.4

Memory in Heterogeneous Machines

GPU

System RAM

Global GPU

Memory

Multi- processor Memory

Core caches

CPU Socket

Host-Device - Typical data transfer bottleneck

GPU

System RAM

Global GPU

Memory

Multi- processor Memory

Core caches

CPU Socket

QPI

§  Complex memory hierarchy

§  GPUs designed to hide the movement of data on the GPU itself

§  Major issues: –  Host – device data transfer

–  CPU speed for data parallel execution

–  GPU speed for serial execution

§  Hardware manufacturers addressing these issues: Uncertainty over future hardware

–  CPUs becoming more like GPUs (increasing core count)

–  GPUs gaining CPU-like functionality (Project Denver)

–  Hybrid “System-on-chip” models

8

Page 9: Porting the ONETEP Linear-Scaling Quantum Chemistry … · Quantum Chemistry Code to Massively Parallel Heterogeneous Architectures ... Uncertainty over future hardware ... 2.0 96.4

Motivation for choosing OpenACC

§  Hardware Considerations –  Multiple hardware models already

available –  Uncertainty over future hardware

models

§  Development Considerations –  Complexity of hardware - average

developer is a domain specialist –  Porting existing rather than

writing from scratch

–  Avoid changing original version of the code

–  Avoid mixing languages

–  Avoid reinventing the wheel

–  Ensure maintainability of code

§  Solutions –  Ensure portability and future

proof

–  Abstractions help to protect the developer

–  Pragmas and directives

PGI Accelerator

OpenACC

9

Page 10: Porting the ONETEP Linear-Scaling Quantum Chemistry … · Quantum Chemistry Code to Massively Parallel Heterogeneous Architectures ... Uncertainty over future hardware ... 2.0 96.4

ORDER-N ELECTRONIC TOTAL ENERGY PACKAGE (ONETEP)

10

Page 11: Porting the ONETEP Linear-Scaling Quantum Chemistry … · Quantum Chemistry Code to Massively Parallel Heterogeneous Architectures ... Uncertainty over future hardware ... 2.0 96.4

ONETEP

§  Commercial code, distributed by Accelrys as part of Materials Studio

§  Actively developed by a number of research groups

–  Central SVN repository

§  Standard Fortran 95

§  233,032 lines of code

§  Parallel design from onset – Message Passing Interface

–  Aiming to extend beyond MPI parallelism

11

Page 12: Porting the ONETEP Linear-Scaling Quantum Chemistry … · Quantum Chemistry Code to Massively Parallel Heterogeneous Architectures ... Uncertainty over future hardware ... 2.0 96.4

ONETEP §  Highly accurate linear-scaling

quantum chemistry package –  Number of atoms

–  Number of cores

–  Maintain number of iterations

§  Wide range of functionality –  Applicable to wide range of

problems

–  Used for materials science and biological systems

§  Scales to tens of thousands of atoms

§  Runs on hundreds of cores using MPI

12

Page 13: Porting the ONETEP Linear-Scaling Quantum Chemistry … · Quantum Chemistry Code to Massively Parallel Heterogeneous Architectures ... Uncertainty over future hardware ... 2.0 96.4

ONETEP – Brief Theory §  Utilizes a reformulation of Density Functional Theory in terms of

the one-particle density matrix –  Broad strokes: The key parameter is the electron density

–  Electrons exist within atomic orbitals, orbitals represented by mathematical functions (φ)

§  Scaling with number of atoms achieved through exploitation of the “nearsightedness” of electrons in many atom systems

–  Localized density kernel, K, and functions (φ)

–  Not sums over all possible φα φβ pairs

13

Page 14: Porting the ONETEP Linear-Scaling Quantum Chemistry … · Quantum Chemistry Code to Massively Parallel Heterogeneous Architectures ... Uncertainty over future hardware ... 2.0 96.4

ONETEP – Methodology

§  Iterative process –  Optimize the NGWFs (φ(r)) –  Optimize the density kernel (K)

§  Strictly localized non-orthogonal generalized Wannier functions (NGWFs) (φ(r)) used to represent atomic orbitals

14

Page 15: Porting the ONETEP Linear-Scaling Quantum Chemistry … · Quantum Chemistry Code to Massively Parallel Heterogeneous Architectures ... Uncertainty over future hardware ... 2.0 96.4

ONETEP – Methodology §  NGWFs are expressed on a real space grid

–  Technically: psinc basis set, equivalent to a plane wave basis set

Simula(on  Cell  

N1  direc(on  

N2  direc(on

 

FFT  Box    

Computa(onally  effi

cient  

A

§  Localized operations performed within FFT boxes.

§  FFT box – 3x diameter of largest localized function

§  Parallel scaling achieved through decomposition of data across processors

§  Three types of data §  Atomic §  Simulation cell §  Sparse matrix

15

Page 16: Porting the ONETEP Linear-Scaling Quantum Chemistry … · Quantum Chemistry Code to Massively Parallel Heterogeneous Architectures ... Uncertainty over future hardware ... 2.0 96.4

ONETEP – Data distribution and movement

§  Atomic (FFT box) operations produce data for simulation cell operations and vice versa

§  Memory efficient data abstraction required

16

Page 17: Porting the ONETEP Linear-Scaling Quantum Chemistry … · Quantum Chemistry Code to Massively Parallel Heterogeneous Architectures ... Uncertainty over future hardware ... 2.0 96.4

ONETEP – Data distribution and movement

§  Parallelpiped (PPD) is the basic unit of data used in this process

§  Data contiguity within PPDs is inconsistent relative to other representations

§  Natural barrier to initial areas of accelerated code

17

Page 18: Porting the ONETEP Linear-Scaling Quantum Chemistry … · Quantum Chemistry Code to Massively Parallel Heterogeneous Architectures ... Uncertainty over future hardware ... 2.0 96.4

ONETEP – Profiling

§  Small chemical system by ONETEP standards (Tennis ball dimer: 181 atoms)

§  Basic calculation

§  Representative level of accuracy

§  Most common density kernel optimization scheme

§  Single iteration of NGWF optimizations (over represents NGWF gradient step)

18

Page 19: Porting the ONETEP Linear-Scaling Quantum Chemistry … · Quantum Chemistry Code to Massively Parallel Heterogeneous Architectures ... Uncertainty over future hardware ... 2.0 96.4

CODE DEVELOPMENTS

19

Page 20: Porting the ONETEP Linear-Scaling Quantum Chemistry … · Quantum Chemistry Code to Massively Parallel Heterogeneous Architectures ... Uncertainty over future hardware ... 2.0 96.4

Scope of the Developments

§  Focus on atom localized operations: FFT boxes

§  Focus on core algorithms that will benefit full range of functionality

§  Maintain data structures and current communications schemes

§  Minimize disruption to rest of code

Submitted to Journal of Computational Chemistry

20

Page 21: Porting the ONETEP Linear-Scaling Quantum Chemistry … · Quantum Chemistry Code to Massively Parallel Heterogeneous Architectures ... Uncertainty over future hardware ... 2.0 96.4

§  Simple switch from CPU FFT library (MKL/FFTW/ACML) to a GPU library (CUFFT)

§  No consideration given to data transfer optimization

§  Little/no benefit

§  Extend to other areas of code…

Naive Implementation

21

Page 22: Porting the ONETEP Linear-Scaling Quantum Chemistry … · Quantum Chemistry Code to Massively Parallel Heterogeneous Architectures ... Uncertainty over future hardware ... 2.0 96.4

Implementation Notes

§  Hybrid PGI Accelerator/CUDA Fortran

–  PGI for Kernels

–  CUDA Fortran for data transfers

§  CUFFT library

§  Very straightforward code, no complex use of pragmas.

§  PGI Accelerator ≃ OpenACC

§  Asynchronous data transfers by default In “pure” PGI Accelerator

–  Hybrid model breaks this

22

Page 23: Porting the ONETEP Linear-Scaling Quantum Chemistry … · Quantum Chemistry Code to Massively Parallel Heterogeneous Architectures ... Uncertainty over future hardware ... 2.0 96.4

Data Transfer Implementation

23

Page 24: Porting the ONETEP Linear-Scaling Quantum Chemistry … · Quantum Chemistry Code to Massively Parallel Heterogeneous Architectures ... Uncertainty over future hardware ... 2.0 96.4

Charge Density (36% of runtime) Diagonal elements of the one-particle density matrix: using n(r) = ρ(r; r). 1.  Extract NGWFs

2.  Populate FFT boxes

3.  Transform to reciprocal space

4.  Up-sample to fine grid (avoid aliasing errors)

5.  Transform to real space

6.  Calculate product and sum

7.  Deposit charge density data in simulation cell

24

Page 25: Porting the ONETEP Linear-Scaling Quantum Chemistry … · Quantum Chemistry Code to Massively Parallel Heterogeneous Architectures ... Uncertainty over future hardware ... 2.0 96.4

Charge Density (36% of runtime)

1.  Extract NGWFs

2.  Populate FFT boxes

3.  Transform to reciprocal space

!Data transfer to GPU not shown !$acc data region !$acc region coarse_work(1:n1,1:n2,1:n3) = scalefac * cmplx( & fftbox_batch(:,:,:,is,3,batch_count), & fftbox_batch(:,:,:,is,4,batch_count),kind=DP) !$acc end region ! Fourier transform to reciprocal space on coarse grid call cufftExec(cufftplan_coarse,planType,coarse_work, & coarse_work,CUFFT_FORWARD

25

Page 26: Porting the ONETEP Linear-Scaling Quantum Chemistry … · Quantum Chemistry Code to Massively Parallel Heterogeneous Architectures ... Uncertainty over future hardware ... 2.0 96.4

Charge Density (36% of runtime) 4.  Up-sample to fine grid (avoid aliasing errors)

•  Populate corners of a cube

•  Typical dimensions of course grid: 75-125 (odd numbers), double this for the fine grid

26

Coarse array Fine array

Page 27: Porting the ONETEP Linear-Scaling Quantum Chemistry … · Quantum Chemistry Code to Massively Parallel Heterogeneous Architectures ... Uncertainty over future hardware ... 2.0 96.4

5.  Transform to real space

6.  Calculate product and sum (if common atom)

7.  Deposit charge density data in simulation cell

27

Page 28: Porting the ONETEP Linear-Scaling Quantum Chemistry … · Quantum Chemistry Code to Massively Parallel Heterogeneous Architectures ... Uncertainty over future hardware ... 2.0 96.4

Charge Density Performance

§  Tested on Iridis3, Emerald and Nvidia PSG cluster

§  Detailed timings are blocking –  No asynchronous kernel execution and data transfer

§  Data transfer stages (2 and 7) problematic, larger quantity of data in step 7

28

Page 29: Porting the ONETEP Linear-Scaling Quantum Chemistry … · Quantum Chemistry Code to Massively Parallel Heterogeneous Architectures ... Uncertainty over future hardware ... 2.0 96.4

Charge Density Performance

§  Clear shift in bottlenecks to data transfer stages

29

Page 30: Porting the ONETEP Linear-Scaling Quantum Chemistry … · Quantum Chemistry Code to Massively Parallel Heterogeneous Architectures ... Uncertainty over future hardware ... 2.0 96.4

1.  Extract NGWF 2.  Populate FFT box 3.  Transform to reciprocal space 4.  Up-sample to fine grid (avoid

aliasing errors) 5.  Transform to real space

6.   Extract local potential FFT box from simulation cell

7.   Apply local potential

8.  Transform to reciprocal space 9.  Down-sample to fine grid 10. Transform to real space

11.  Extract PPDs that overlap with φα

12.  Extract φα PPDs that overlap with φβ

13.  Calculate dot product of φα

and φβPPDs.

Local Potential Integrals

Extract    PPDs  

FFT  

Down-­‐sample    to  coarse    

grid  

Extract  FFT  box  

FFT  

1  

3  

5  

6  

4  

7  

9  

FFT  

FFT  

Apply    poten(al    

8  

10  13  

12   Extract    

Up-­‐sample  to  fine    grid  𝝓↓𝜷 (𝒓)  

𝑽↓𝒍𝒐𝒄 (𝒓)  

𝑽↓𝒍𝒐𝒄 𝝓↓𝜷    (𝒈)↑   

𝑽↓𝒍𝒐𝒄 𝝓↓𝜷 (𝒓)↑   

𝝓↓𝜷 (𝒓)  

𝝓↓𝜷 (𝒓)  

𝝓↓𝜷    (𝒈)  

𝝓↓𝜷    (𝒈)  

𝑽↓𝒍𝒐𝒄 𝝓↓𝜷    (𝒈)↑   

𝑽↓𝒍𝒐𝒄 𝝓↓𝜷 (𝒓)↑   

Host  

Host  

∫↑▒𝝓↓𝜶 ↑∗ (𝒓) 𝑽 ↓𝒍𝒐𝒄 (𝒓) 𝝓↓𝜷 (𝒓)↑      

CPU  to  GPU  DATA  

2  

CPU  to  GPU  DATA  

GPU  to  CPU  DATA  

11  

30

Page 31: Porting the ONETEP Linear-Scaling Quantum Chemistry … · Quantum Chemistry Code to Massively Parallel Heterogeneous Architectures ... Uncertainty over future hardware ... 2.0 96.4

Performance - Bottlenecks Relative cost of stages Local Potential Integrals •  More data transfers Kinetic integrals •  Only coarse grid NGWF gradient •  Partly accelerated •  Significantly more

complex •  Data transfer at stage 4.

31

Page 32: Porting the ONETEP Linear-Scaling Quantum Chemistry … · Quantum Chemistry Code to Massively Parallel Heterogeneous Architectures ... Uncertainty over future hardware ... 2.0 96.4

Performance Summary

§  Performance of 1 GPU vs 1 CPU core (unfair):

§  Total calculation time: –  1 GPU 2-3.5x faster than 1 CPU core, equivalent to 4 CPU cores

Xeon E5620

Tesla M2050

Tesla M2090

Kepler K20

Charge Density 1221.1 340.5 (3.6) 271.1 (4.5) 225.4 (5.4)

Local Potential Integrals 861.3 301.2 (2.9) 242.2 (3.6) 219.1 (3.9)

Kinetic integrals 268.9 127.0 (2.1) 112.4 (2.4) 131.0 (2.1)

NGWF Gradient 1120.8 636.1 (1.8) 521.5 (2.1) 509.1 (2.2)

32

Page 33: Porting the ONETEP Linear-Scaling Quantum Chemistry … · Quantum Chemistry Code to Massively Parallel Heterogeneous Architectures ... Uncertainty over future hardware ... 2.0 96.4

33

689  

456  

192  

889  

532  

249  

607  

396  

164  

787  

638  

457  

236  

795  

500  

307  

739  

446  

242  

843  

320  

246  

142  

381  

232  

175  

441  

249  

151  

372  

1,273  

789  

417  

1,585  

830  

539  

1,240  

642  

353  

1,198  

907  

625  

335  

1,092  

615  

406  

800  

467  

305  

996  

0   1000   2000   3000   4000   5000  

1x  M2090  

2x  M2090  

4x  M2090  

1x  M2050  

2x  M2050  

4x  M2050  

1x  K20  

2x  K20  

4x  K20  

4x  CPU  core  

Time  (s)  

Hardware  de

tails  

Charge  density   Local  poten(al  integrals   Kine(c  integrals   NGWF  gradient   Other  

Scaling Performance

§  Good scaling up to 4 GPUs –  Seen for all areas of

accelerated code

Page 34: Porting the ONETEP Linear-Scaling Quantum Chemistry … · Quantum Chemistry Code to Massively Parallel Heterogeneous Architectures ... Uncertainty over future hardware ... 2.0 96.4

Understanding Poor Scaling

§  Neglected thread affinity –  Quick path interconnect

§  Not the only problem –  Sheer size of data transfers

Total GPUs

GPUs per

node

Number of

nodes

GPUs on bus 0

(max. 3)

GPUs on bus 1

(max. 5)

Total Calculation

Time

Scaling relative to 1

GPU 

Total % of ideal

1 1 1 1 0 5655.9 1.0 100

8 8 1 3 5 1737.5 3.3 41

24 8 3 3 5 637.9 8.9 37

24 6 4 3 3 620.6 9.1 38

24 4 6 2 2 471.9 12.0 50

24 2 12 1 1 362.9 15.6 65

34

Page 35: Porting the ONETEP Linear-Scaling Quantum Chemistry … · Quantum Chemistry Code to Massively Parallel Heterogeneous Architectures ... Uncertainty over future hardware ... 2.0 96.4

Strong Scaling

§  Detailed scaling tests on Emerald: 8 GPUs 2 CPUs

§  Comparing against a single node – Impressive results

§  Unfortunately not the full story… –  Data transfer on a node saturating available PCIe bandwidth?

0.0  

2.0  

4.0  

6.0  

8.0  

10.0  

12.0  

14.0  

0   2   4   6   8   10   12   14  

Speedu

p  

Number  of  nodes  

Ideal  495  atom  cellulose  nanofibril  1717  atom  beta  amyloid  fibril  

0.0  

2.0  

4.0  

6.0  

8.0  

0   2   4   6   8  Number  of  GPUs  

35

Page 36: Porting the ONETEP Linear-Scaling Quantum Chemistry … · Quantum Chemistry Code to Massively Parallel Heterogeneous Architectures ... Uncertainty over future hardware ... 2.0 96.4

Conclusions

§  Overall acceleration of a factor of 2 to 3x relative to single CPU core

§  Equivalent performance to 4 CPU cores

§  Data transfer serious bottleneck –  No asynchronous data transfer

–  Running in lock-step due to blocking MPI communications

§  Strong scaling is hit hard by limited PCIe bandwidth

§  Promising results, much more to do with obvious first target

36

Page 37: Porting the ONETEP Linear-Scaling Quantum Chemistry … · Quantum Chemistry Code to Massively Parallel Heterogeneous Architectures ... Uncertainty over future hardware ... 2.0 96.4

HYBRID MPI-OPENMP PARALLELISM

37

Page 38: Porting the ONETEP Linear-Scaling Quantum Chemistry … · Quantum Chemistry Code to Massively Parallel Heterogeneous Architectures ... Uncertainty over future hardware ... 2.0 96.4

Motivation

§  Hard limit of 1 atom per core

§  10 atoms per core optimal –  Communications-workload balance

§  Memory demands mean only 50% of cores used on some machines (HECTOR)

§  Scaling limited to hundreds of cores

38

§  Myself: –  Charge density

–  Local potential integrals

–  Kinetic Integrals

–  NGWF gradient

§  Dr. Nick Hine: –  Sparse algebra

–  Projectors

–  Testing and modification for IBM BlueGene/Q at Hartree centre

Page 39: Porting the ONETEP Linear-Scaling Quantum Chemistry … · Quantum Chemistry Code to Massively Parallel Heterogeneous Architectures ... Uncertainty over future hardware ... 2.0 96.4

Method

39

§  Level above that parallelized by GPU developments

Page 40: Porting the ONETEP Linear-Scaling Quantum Chemistry … · Quantum Chemistry Code to Massively Parallel Heterogeneous Architectures ... Uncertainty over future hardware ... 2.0 96.4

Initial Results

40

Total Time OpenMP threads % of optimal 1.0 2.0 4.0 8.0 16.0

MPI Processes

1.0 100.0 77.2 51.8 31.2 15.4 2.0 96.4 71.1 47.1 24.4 4.0 90.7 64.2 37.5 8.0 78.5 50.5

16.0 59.6

Charge Density OpenMP threads

% of optimal 1.0 2.0 4.0 8.0 16.0

MPI Processes

1.0 100.0 93.9 78.9 56.5 29.4 2.0 96.6 85.4 69.8 41.7 4.0 96.7 83.6 55.4 8.0 86.3 65.0

16.0 67.0

§  Charge density only

§  Tested on 16 core workstation

§  181 atom tennis ball dimer

Page 41: Porting the ONETEP Linear-Scaling Quantum Chemistry … · Quantum Chemistry Code to Massively Parallel Heterogeneous Architectures ... Uncertainty over future hardware ... 2.0 96.4

Latest Results

41

1728  atom  Silicon  Oxide  

         

 

Results  courtesy  of  Dr.  Nick  Hine  

CalculaHons  performed  on  Blue  Joule  at  the  Hartree  centre.

MPI Processes

OpenMP threads

Fixed core count

16 1 Out of memory

8 2 Out of time (>3600s)

4 4 1045s

2 8 617s

1 16 673s

OpenMP scaling

4 1 2711s

4 4 1046s

4 8 784s

Page 42: Porting the ONETEP Linear-Scaling Quantum Chemistry … · Quantum Chemistry Code to Massively Parallel Heterogeneous Architectures ... Uncertainty over future hardware ... 2.0 96.4

Latest Results

4000   8000   12000   16000  1.0  

2.0  

3.0  

4.0  

Total  number  of  cores  

Speedu

p  

12,000  atom  Beta  Amyloid  Fibril  (Each  MPI  process)  uses  16  OpenMP  threads  

Ideal  sparse  products  row  sums  \box  opera(ons  

42

         

 

Results  courtesy  of  Dr.  Nick  Hine  

CalculaHons  performed  on  Blue  Joule  at  the  Hartree  centre.

Page 43: Porting the ONETEP Linear-Scaling Quantum Chemistry … · Quantum Chemistry Code to Massively Parallel Heterogeneous Architectures ... Uncertainty over future hardware ... 2.0 96.4

Summary

§  Significant performance increase –  Scaling

–  Memory usage

–  cores > atoms now possible

§  Excellent scaling up to 4 OpenMP threads, memory bandwidth limited on Blue Joule above 8 cores

§  HECTOR performance promising

43

Page 44: Porting the ONETEP Linear-Scaling Quantum Chemistry … · Quantum Chemistry Code to Massively Parallel Heterogeneous Architectures ... Uncertainty over future hardware ... 2.0 96.4

FUTURE WORK

44

Page 45: Porting the ONETEP Linear-Scaling Quantum Chemistry … · Quantum Chemistry Code to Massively Parallel Heterogeneous Architectures ... Uncertainty over future hardware ... 2.0 96.4

Future Work Improved GPU code

•  OpenACC •  Asynchronous data transfer

•  Loop unrolling •  Extend to other areas

•  GPU overloading •  Batched 1D FFTs

Hybrid MPI-OpenMP code

•  Extend to “Data parallelism” •  Extend to other areas

Large scale ab initio Molecular Dynamics on machines such as:

TITAN (AMD CPU + NVIDIA K20X) Stampede (Intel CPU + Intel Xeon Phi)

Mira (BlueGene/Q)

Algorithmic Improvements

•  Dynamic/Mixed Precision •  Increased truncation

•  Hybrid FFT/interpolation

Heterogeneous execution

45

Page 46: Porting the ONETEP Linear-Scaling Quantum Chemistry … · Quantum Chemistry Code to Massively Parallel Heterogeneous Architectures ... Uncertainty over future hardware ... 2.0 96.4

Acknowledgements

§  Chris-Kriton Skylaris –  University of Southampton

§  Nick Hine –  University of Cambridge

§  NVIDIA –  Mark Berger

–  Timothy Lanfear

§  PGI –  Mathew Colgrove

–  Pat Brooks

§  E-Infrastructure South consortium –  Istvan Reguly

–  Emerald

§  iSolutions –  Iridis3

§  EPSRC

46

Page 47: Porting the ONETEP Linear-Scaling Quantum Chemistry … · Quantum Chemistry Code to Massively Parallel Heterogeneous Architectures ... Uncertainty over future hardware ... 2.0 96.4

•  Optimized energies exceed the level of precision required for chemical accuracy (milliHartree)

•  Difference in calculated energies relative to the CPU code comparable to differences resulting from the number of MPI threads in the CPU code

47

Precision

Page 48: Porting the ONETEP Linear-Scaling Quantum Chemistry … · Quantum Chemistry Code to Massively Parallel Heterogeneous Architectures ... Uncertainty over future hardware ... 2.0 96.4

•  GPUs have a larger number of single precision (SP) than double precision (DP) cores.

•  Other codes (GAMESS-UK and Terachem) utilize dynamic precision.

•  Switch between SP and DP determined by a convergence threshold.

48

Dynamic Precision

Page 49: Porting the ONETEP Linear-Scaling Quantum Chemistry … · Quantum Chemistry Code to Massively Parallel Heterogeneous Architectures ... Uncertainty over future hardware ... 2.0 96.4

•  GPUs have a larger number of single precision (SP) than double precision (DP) cores.

•  Other codes (GAMESS-UK and Terachem) utilize dynamic precision.

•  Switch between SP and DP determined by a convergence threshold.

49

Dynamic Precision

1E-12

1E-10

1E-08

1E-06

1E-04

1E-02

1E+00

1E+02

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

Ener

gy D

iffer

ence

(Eh)

Iteration number

CPU DP GPU DP 1x10E-7 1x10E-5 5x10E-6 1x10E-6