porting the onetep linear-scaling quantum chemistry … · quantum chemistry code to massively...

Porting the ONETEP Linear-Scaling Quantum Chemistry Code to Massively Parallel Heterogeneous Architectures

Karl Wilkinson

Skylaris Group, University of Southampton

Outline

1.  Computational Chemistry

2.  Heterogeneous Architectures and Programming models

3.  Order-N Electronic Total Energy Package (ONETEP)

4.  GPU Code Developments and Performance

5.  Hybrid MPI-OpenMP Parallelism

6.  Future Work

7.  Acknowledgements

2

COMPUTATIONAL CHEMISTRY

3

Computational Chemistry

§  Trade offs: –  Accuracy

–  System size

–  Level of sampling

§  Drug design –  Lead structure design

–  Protein-lead structure interactions

§  Materials –  Epoxidised linseed oil – ships hulls

§  Catalysis

§  Energy research –  Enzyme-biomass interactions

4

Accuracy

Low

Molecular mechanics

High

Ab initio

System size

Low

10’s of atoms

High

Millions of atoms

Sampling

Low

Single point energy

High

Free energy surface

HETEROGENEOUS ARCHITECTURES AND PROGRAMMING MODELS

5

Heterogeneous Compute Architectures

§  Heterogeneous: Multiple distinct types of processors –  More than 10% of top 500 supercomputer use some

form of accelerator –  Typically Central Processing Units (CPUs) coupled with

Graphical Processing Units (GPUs)

–  Common in high powered workstations

–  Intel Xeon Phis emerging

§  Development considerations –  Generally need to adapt code to take advantage of the

architecture –  Single code will contain multiple algorithms - execute

on most suitable processing unit

–  May result in new bottlenecks (PCIe data transfer)

6

Multicore CPUs

GPU based accelerator boards

Parallelism in Heterogeneous Machines

Node CPU

Socket

CPU Socket

PCIe Socket

PCIe Socket

8-16 CPU cores

O(10) Streaming Multiprocessors

32-192 CUDA cores (SP/DP)

O(100-10,000) Nodes

Titan

18,688 Nodes

18,688 CPUs (hexadecacore) 299,008 CPU Cores

18,688 K20X GPUs (14MPs) 261,632 Multiprocessors

560,640 Total cores

7

Memory in Heterogeneous Machines

GPU

System RAM

Global GPU

Memory

Multi- processor Memory

Core caches

CPU Socket

Host-Device - Typical data transfer bottleneck

GPU

System RAM

Global GPU

Memory

Multi- processor Memory

Core caches

CPU Socket

QPI

§  Complex memory hierarchy

§  GPUs designed to hide the movement of data on the GPU itself

§  Major issues: –  Host – device data transfer

–  CPU speed for data parallel execution

–  GPU speed for serial execution

§  Hardware manufacturers addressing these issues: Uncertainty over future hardware

–  CPUs becoming more like GPUs (increasing core count)

–  GPUs gaining CPU-like functionality (Project Denver)

–  Hybrid “System-on-chip” models

8

Motivation for choosing OpenACC

§  Hardware Considerations –  Multiple hardware models already

available –  Uncertainty over future hardware

models

§  Development Considerations –  Complexity of hardware - average

developer is a domain specialist –  Porting existing rather than

writing from scratch

–  Avoid changing original version of the code

–  Avoid mixing languages

–  Avoid reinventing the wheel

–  Ensure maintainability of code

§  Solutions –  Ensure portability and future

proof

–  Abstractions help to protect the developer

–  Pragmas and directives

PGI Accelerator

OpenACC

9

ORDER-N ELECTRONIC TOTAL ENERGY PACKAGE (ONETEP)

10

ONETEP

§  Commercial code, distributed by Accelrys as part of Materials Studio

§  Actively developed by a number of research groups

–  Central SVN repository

§  Standard Fortran 95

§  233,032 lines of code

§  Parallel design from onset – Message Passing Interface

–  Aiming to extend beyond MPI parallelism

11

ONETEP §  Highly accurate linear-scaling

quantum chemistry package –  Number of atoms

–  Number of cores

–  Maintain number of iterations

§  Wide range of functionality –  Applicable to wide range of

problems

–  Used for materials science and biological systems

§  Scales to tens of thousands of atoms

§  Runs on hundreds of cores using MPI

12

ONETEP – Brief Theory §  Utilizes a reformulation of Density Functional Theory in terms of

the one-particle density matrix –  Broad strokes: The key parameter is the electron density

–  Electrons exist within atomic orbitals, orbitals represented by mathematical functions (φ)

§  Scaling with number of atoms achieved through exploitation of the “nearsightedness” of electrons in many atom systems

–  Localized density kernel, K, and functions (φ)

–  Not sums over all possible φα φβ pairs

13

ONETEP – Methodology

§  Iterative process –  Optimize the NGWFs (φ(r)) –  Optimize the density kernel (K)

§  Strictly localized non-orthogonal generalized Wannier functions (NGWFs) (φ(r)) used to represent atomic orbitals

14

ONETEP – Methodology §  NGWFs are expressed on a real space grid

–  Technically: psinc basis set, equivalent to a plane wave basis set

Simula(on Cell

N1 direc(on

N2 direc(on

FFT Box

Computa(onally effi

cient

A

§  Localized operations performed within FFT boxes.

§  FFT box – 3x diameter of largest localized function

§  Parallel scaling achieved through decomposition of data across processors

§  Three types of data §  Atomic §  Simulation cell §  Sparse matrix

15

ONETEP – Data distribution and movement

§  Atomic (FFT box) operations produce data for simulation cell operations and vice versa

§  Memory efficient data abstraction required

16

ONETEP – Data distribution and movement

§  Parallelpiped (PPD) is the basic unit of data used in this process

§  Data contiguity within PPDs is inconsistent relative to other representations

§  Natural barrier to initial areas of accelerated code

17

ONETEP – Profiling

§  Small chemical system by ONETEP standards (Tennis ball dimer: 181 atoms)

§  Basic calculation

§  Representative level of accuracy

§  Most common density kernel optimization scheme

§  Single iteration of NGWF optimizations (over represents NGWF gradient step)

18

CODE DEVELOPMENTS

19

Scope of the Developments

§  Focus on atom localized operations: FFT boxes

§  Focus on core algorithms that will benefit full range of functionality

§  Maintain data structures and current communications schemes

§  Minimize disruption to rest of code

Submitted to Journal of Computational Chemistry

20

§  Simple switch from CPU FFT library (MKL/FFTW/ACML) to a GPU library (CUFFT)

§  No consideration given to data transfer optimization

§  Little/no benefit

§  Extend to other areas of code…

Naive Implementation

21

Implementation Notes

§  Hybrid PGI Accelerator/CUDA Fortran

–  PGI for Kernels

–  CUDA Fortran for data transfers

§  CUFFT library

§  Very straightforward code, no complex use of pragmas.

§  PGI Accelerator ≃ OpenACC

§  Asynchronous data transfers by default In “pure” PGI Accelerator

–  Hybrid model breaks this

22

Data Transfer Implementation

23

Charge Density (36% of runtime) Diagonal elements of the one-particle density matrix: using n(r) = ρ(r; r). 1.  Extract NGWFs

2.  Populate FFT boxes

3.  Transform to reciprocal space

4.  Up-sample to fine grid (avoid aliasing errors)

5.  Transform to real space

6.  Calculate product and sum

7.  Deposit charge density data in simulation cell

24

Charge Density (36% of runtime)

1.  Extract NGWFs

2.  Populate FFT boxes

3.  Transform to reciprocal space

!Data transfer to GPU not shown !$acc data region !$acc region coarse_work(1:n1,1:n2,1:n3) = scalefac * cmplx( & fftbox_batch(:,:,:,is,3,batch_count), & fftbox_batch(:,:,:,is,4,batch_count),kind=DP) !$acc end region ! Fourier transform to reciprocal space on coarse grid call cufftExec(cufftplan_coarse,planType,coarse_work, & coarse_work,CUFFT_FORWARD

25

Charge Density (36% of runtime) 4.  Up-sample to fine grid (avoid aliasing errors)

•  Populate corners of a cube

•  Typical dimensions of course grid: 75-125 (odd numbers), double this for the fine grid

26

Coarse array Fine array

5.  Transform to real space

6.  Calculate product and sum (if common atom)

7.  Deposit charge density data in simulation cell

27

Charge Density Performance

§  Tested on Iridis3, Emerald and Nvidia PSG cluster

§  Detailed timings are blocking –  No asynchronous kernel execution and data transfer

§  Data transfer stages (2 and 7) problematic, larger quantity of data in step 7

28

Charge Density Performance

§  Clear shift in bottlenecks to data transfer stages

29

1.  Extract NGWF 2.  Populate FFT box 3.  Transform to reciprocal space 4.  Up-sample to fine grid (avoid

aliasing errors) 5.  Transform to real space

6.   Extract local potential FFT box from simulation cell

7.   Apply local potential

8.  Transform to reciprocal space 9.  Down-sample to fine grid 10. Transform to real space

11.  Extract PPDs that overlap with φα

12.  Extract φα PPDs that overlap with φβ

13.  Calculate dot product of φα

and φβPPDs.

Local Potential Integrals

Extract PPDs

FFT

Down-‐sample to coarse

grid

Extract FFT box

FFT

1

3

5

6

4

7

9

FFT

FFT

Apply poten(al

8

10 13

12 Extract

Up-‐sample to fine grid 𝝓↓𝜷 (𝒓)

𝑽↓𝒍𝒐𝒄 (𝒓)

𝑽↓𝒍𝒐𝒄 𝝓↓𝜷   (𝒈)↑ 

𝑽↓𝒍𝒐𝒄 𝝓↓𝜷 (𝒓)↑ 

𝝓↓𝜷 (𝒓)

𝝓↓𝜷 (𝒓)

𝝓↓𝜷   (𝒈)

𝝓↓𝜷   (𝒈)

𝑽↓𝒍𝒐𝒄 𝝓↓𝜷   (𝒈)↑ 

𝑽↓𝒍𝒐𝒄 𝝓↓𝜷 (𝒓)↑ 

Host

Host

∫↑▒𝝓↓𝜶 ↑∗ (𝒓) 𝑽 ↓𝒍𝒐𝒄 (𝒓) 𝝓↓𝜷 (𝒓)↑   

CPU to GPU DATA

2

CPU to GPU DATA

GPU to CPU DATA

11

30

Performance - Bottlenecks Relative cost of stages Local Potential Integrals •  More data transfers Kinetic integrals •  Only coarse grid NGWF gradient •  Partly accelerated •  Significantly more

complex •  Data transfer at stage 4.

31

Performance Summary

§  Performance of 1 GPU vs 1 CPU core (unfair):

§  Total calculation time: –  1 GPU 2-3.5x faster than 1 CPU core, equivalent to 4 CPU cores

Xeon E5620

Tesla M2050

Tesla M2090

Kepler K20

Charge Density 1221.1 340.5 (3.6) 271.1 (4.5) 225.4 (5.4)

Local Potential Integrals 861.3 301.2 (2.9) 242.2 (3.6) 219.1 (3.9)

Kinetic integrals 268.9 127.0 (2.1) 112.4 (2.4) 131.0 (2.1)

NGWF Gradient 1120.8 636.1 (1.8) 521.5 (2.1) 509.1 (2.2)

32

33

689

456

192

889

532

249

607

396

164

787

638

457

236

795

500

307

739

446

242

843

320

246

142

381

232

175

441

249

151

372

1,273

789

417

1,585

830

539

1,240

642

353

1,198

907

625

335

1,092

615

406

800

467

305

996

0 1000 2000 3000 4000 5000

1x M2090

2x M2090

4x M2090

1x M2050

2x M2050

4x M2050

1x K20

2x K20

4x K20

4x CPU core

Time (s)

Hardware de

tails

Charge density Local poten(al integrals Kine(c integrals NGWF gradient Other

Scaling Performance

§  Good scaling up to 4 GPUs –  Seen for all areas of

accelerated code

Understanding Poor Scaling

§  Neglected thread affinity –  Quick path interconnect

§  Not the only problem –  Sheer size of data transfers

Total GPUs

GPUs per

node

Number of

nodes

GPUs on bus 0

(max. 3)

GPUs on bus 1

(max. 5)

Total Calculation

Time

Scaling relative to 1

GPU

Total % of ideal

1 1 1 1 0 5655.9 1.0 100

8 8 1 3 5 1737.5 3.3 41

24 8 3 3 5 637.9 8.9 37

24 6 4 3 3 620.6 9.1 38

24 4 6 2 2 471.9 12.0 50

24 2 12 1 1 362.9 15.6 65

34

Strong Scaling

§  Detailed scaling tests on Emerald: 8 GPUs 2 CPUs

§  Comparing against a single node – Impressive results

§  Unfortunately not the full story… –  Data transfer on a node saturating available PCIe bandwidth?

0.0

2.0

4.0

6.0

8.0

10.0

12.0

14.0

0 2 4 6 8 10 12 14

Speedu

p

Number of nodes

Ideal 495 atom cellulose nanofibril 1717 atom beta amyloid fibril

0.0

2.0

4.0

6.0

8.0

0 2 4 6 8 Number of GPUs

35

Conclusions

§  Overall acceleration of a factor of 2 to 3x relative to single CPU core

§  Equivalent performance to 4 CPU cores

§  Data transfer serious bottleneck –  No asynchronous data transfer

–  Running in lock-step due to blocking MPI communications

§  Strong scaling is hit hard by limited PCIe bandwidth

§  Promising results, much more to do with obvious first target

36

HYBRID MPI-OPENMP PARALLELISM

37

Motivation

§  Hard limit of 1 atom per core

§  10 atoms per core optimal –  Communications-workload balance

§  Memory demands mean only 50% of cores used on some machines (HECTOR)

§  Scaling limited to hundreds of cores

38

§  Myself: –  Charge density

–  Local potential integrals

–  Kinetic Integrals

–  NGWF gradient

§  Dr. Nick Hine: –  Sparse algebra

–  Projectors

–  Testing and modification for IBM BlueGene/Q at Hartree centre

Method

39

§  Level above that parallelized by GPU developments

Initial Results

40

Total Time OpenMP threads % of optimal 1.0 2.0 4.0 8.0 16.0

MPI Processes

1.0 100.0 77.2 51.8 31.2 15.4 2.0 96.4 71.1 47.1 24.4 4.0 90.7 64.2 37.5 8.0 78.5 50.5

16.0 59.6

Charge Density OpenMP threads

% of optimal 1.0 2.0 4.0 8.0 16.0

MPI Processes

1.0 100.0 93.9 78.9 56.5 29.4 2.0 96.6 85.4 69.8 41.7 4.0 96.7 83.6 55.4 8.0 86.3 65.0

16.0 67.0

§  Charge density only

§  Tested on 16 core workstation

§  181 atom tennis ball dimer

Latest Results

41

1728 atom Silicon Oxide

Results courtesy of Dr. Nick Hine

CalculaHons performed on Blue Joule at the Hartree centre.

MPI Processes

OpenMP threads

Fixed core count

16 1 Out of memory

8 2 Out of time (>3600s)

4 4 1045s

2 8 617s

1 16 673s

OpenMP scaling

4 1 2711s

4 4 1046s

4 8 784s

Latest Results

4000 8000 12000 16000 1.0

2.0

3.0

4.0

Total number of cores

Speedu

p

12,000 atom Beta Amyloid Fibril (Each MPI process) uses 16 OpenMP threads

Ideal sparse products row sums \box opera(ons

42

Results courtesy of Dr. Nick Hine

CalculaHons performed on Blue Joule at the Hartree centre.

Summary

§  Significant performance increase –  Scaling

–  Memory usage

–  cores > atoms now possible

§  Excellent scaling up to 4 OpenMP threads, memory bandwidth limited on Blue Joule above 8 cores

§  HECTOR performance promising

43

FUTURE WORK

44

Future Work Improved GPU code

•  OpenACC •  Asynchronous data transfer

•  Loop unrolling •  Extend to other areas

•  GPU overloading •  Batched 1D FFTs

Hybrid MPI-OpenMP code

•  Extend to “Data parallelism” •  Extend to other areas

Large scale ab initio Molecular Dynamics on machines such as:

TITAN (AMD CPU + NVIDIA K20X) Stampede (Intel CPU + Intel Xeon Phi)

Mira (BlueGene/Q)

Algorithmic Improvements

•  Dynamic/Mixed Precision •  Increased truncation

•  Hybrid FFT/interpolation

Heterogeneous execution

45

Acknowledgements

§  Chris-Kriton Skylaris –  University of Southampton

§  Nick Hine –  University of Cambridge

§  NVIDIA –  Mark Berger

–  Timothy Lanfear

§  PGI –  Mathew Colgrove

–  Pat Brooks

§  E-Infrastructure South consortium –  Istvan Reguly

–  Emerald

§  iSolutions –  Iridis3

§  EPSRC

46

•  Optimized energies exceed the level of precision required for chemical accuracy (milliHartree)

•  Difference in calculated energies relative to the CPU code comparable to differences resulting from the number of MPI threads in the CPU code

47

Precision

•  GPUs have a larger number of single precision (SP) than double precision (DP) cores.

•  Other codes (GAMESS-UK and Terachem) utilize dynamic precision.

•  Switch between SP and DP determined by a convergence threshold.

48

Dynamic Precision

•  GPUs have a larger number of single precision (SP) than double precision (DP) cores.

•  Other codes (GAMESS-UK and Terachem) utilize dynamic precision.

•  Switch between SP and DP determined by a convergence threshold.

49

Dynamic Precision

1E-12

1E-10

1E-08

1E-06

1E-04

1E-02

1E+00

1E+02

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

Ener

gy D

iffer

ence

(Eh)

Iteration number

CPU DP GPU DP 1x10E-7 1x10E-5 5x10E-6 1x10E-6

porting the onetep linear-scaling quantum chemistry … · quantum chemistry code to massively...

Documents