vasp: a case study for accelerating plane wave dft codes · —liquid metal molecular dynamics...
TRANSCRIPT
![Page 1: VASP: A Case Study for Accelerating Plane Wave DFT Codes · —Liquid metal molecular dynamics sample of Nickel-based superalloy —500 atoms, 9 chemical species —Blocked Davidson](https://reader035.vdocument.in/reader035/viewer/2022070919/5fb88bd186e0af7963648bdd/html5/thumbnails/1.jpg)
VASP: A CASE STUDY FOR ACCELERATING PLANE WAVE DFT CODES
Presenters: Sarah Tariq and Przemyslaw Tredak
Authors: Jeroen Bedorf, Przemyslaw Tredak , Dusan Stosic, Arash Ashari, Paul Springer, Darko Stosic, Sarah Tariq, Paul Fleurat-Lessard and Anciaux Sedrakian (Ens-lyon, IFPEN), Maxwell Hutchinson (University of Chicago) and Michael Widom (CMU)
![Page 2: VASP: A Case Study for Accelerating Plane Wave DFT Codes · —Liquid metal molecular dynamics sample of Nickel-based superalloy —500 atoms, 9 chemical species —Blocked Davidson](https://reader035.vdocument.in/reader035/viewer/2022070919/5fb88bd186e0af7963648bdd/html5/thumbnails/2.jpg)
GPU VASP COLLABORATION Collaborators
Project Scope Minimization algorithms to calculate electronic ground state
— Blocked Davidson (ALGO = NORMAL & FAST)
— RMM-DIIS (ALGO = VERYFAST & FAST)
Earlier work — Speeding up plane-wave electronic-structure calculations using graphics-processing units. Maintz, Eck,
Dronskowski. (2011)
— VASP on a GPU: application to exact-exchange calculations of the stability of elemental boron. Hutchinson, Widom. (2011)
— Accelerating VASP Electronic Structure Calculations Using Graphic Processing Units. Hacene, Anciaux-Sedrakian, Rozanska, Klahr, Guignon, Fleurat-Lessard. (2012)
![Page 3: VASP: A Case Study for Accelerating Plane Wave DFT Codes · —Liquid metal molecular dynamics sample of Nickel-based superalloy —500 atoms, 9 chemical species —Blocked Davidson](https://reader035.vdocument.in/reader035/viewer/2022070919/5fb88bd186e0af7963648bdd/html5/thumbnails/3.jpg)
VASP OVERVIEW
Atomic scale materials modeling from first principles
Simulate atoms (mostly solids/surfaces)
Liquids, crystals, magnetism, semiconductor/insulators, surfaces, catalysts
Solve many-body Schrödinger equation
Density Functional Theory (DFT): Kohn-Sham equations
Optionally add exact-exchange using Hybrid Hartree Fock functionals (HF)
![Page 4: VASP: A Case Study for Accelerating Plane Wave DFT Codes · —Liquid metal molecular dynamics sample of Nickel-based superalloy —500 atoms, 9 chemical species —Blocked Davidson](https://reader035.vdocument.in/reader035/viewer/2022070919/5fb88bd186e0af7963648bdd/html5/thumbnails/4.jpg)
THEORY
Self-consistent Kohn-Sham system
— Self-consistency loop until convergence
— Compute Kohn-Sham potential 𝒗𝑲𝑺 𝒓
— Solve Kohn-Sham eigenproblem
— Obtain electronic density 𝒏 𝒓
Kohn-Sham eigenproblem
— Diagonalize Hamiltonian matrix 𝑯 𝑲𝑺
— Problem: often 𝑯 𝑲𝑺 is very big
— Solution: Iterative matrix diagonalization schemes
— Blocked Davidson, RMM-DIIS
— Find lowest few 𝝋𝒊 eigenstates of 𝑯 𝑲𝑺
𝒏𝟎(𝒓)
𝒗𝑲𝑺(𝒓)
𝑯 𝑲𝑺𝝋𝒊 𝒓 = 𝑬𝒊𝝋𝒊 𝒓
𝒏 𝒓 = 𝝋𝒊 𝒓𝟐
𝒊
stop?
end
yes
no
![Page 5: VASP: A Case Study for Accelerating Plane Wave DFT Codes · —Liquid metal molecular dynamics sample of Nickel-based superalloy —500 atoms, 9 chemical species —Blocked Davidson](https://reader035.vdocument.in/reader035/viewer/2022070919/5fb88bd186e0af7963648bdd/html5/thumbnails/5.jpg)
SIMILARITIES IN PW DFT CODES
Rely heavily on math libraries BLAS and FFT
— Easily offloaded using cuBLAS and cuFFT
Don’t need to write a lot of specialized routines
— Focus is on keeping GPU busy, and reducing communication instead of optimizing kernels
![Page 6: VASP: A Case Study for Accelerating Plane Wave DFT Codes · —Liquid metal molecular dynamics sample of Nickel-based superalloy —500 atoms, 9 chemical species —Blocked Davidson](https://reader035.vdocument.in/reader035/viewer/2022070919/5fb88bd186e0af7963648bdd/html5/thumbnails/6.jpg)
TARGET WORKLOADS Silica
— 7 Å thick slab of amorphous silica, 240 atoms (Si68O148H24)
— RMM-DIIS (ALGO = VERYFAST)
NiAl-MD — Liquid metal molecular dynamics sample of Nickel-
based superalloy
— 500 atoms, 9 chemical species
— Blocked Davidson (ALGO = NORMAL)
![Page 7: VASP: A Case Study for Accelerating Plane Wave DFT Codes · —Liquid metal molecular dynamics sample of Nickel-based superalloy —500 atoms, 9 chemical species —Blocked Davidson](https://reader035.vdocument.in/reader035/viewer/2022070919/5fb88bd186e0af7963648bdd/html5/thumbnails/7.jpg)
VERSION AND HARDWARE The GPU port is on VASP version 5.2.12
Code accelerated includes RMM-DIIS and Blocked Davidson routines and also exact-exchange work from CMU
We have run the code on Fermi and Kepler boards
The code has been tested for functional correctness on more than 25 benchmarks
We present performance results on 2 benchmarks at the end of this presentation
![Page 8: VASP: A Case Study for Accelerating Plane Wave DFT Codes · —Liquid metal molecular dynamics sample of Nickel-based superalloy —500 atoms, 9 chemical species —Blocked Davidson](https://reader035.vdocument.in/reader035/viewer/2022070919/5fb88bd186e0af7963648bdd/html5/thumbnails/8.jpg)
OPTIMIZATION DETAILS
![Page 9: VASP: A Case Study for Accelerating Plane Wave DFT Codes · —Liquid metal molecular dynamics sample of Nickel-based superalloy —500 atoms, 9 chemical species —Blocked Davidson](https://reader035.vdocument.in/reader035/viewer/2022070919/5fb88bd186e0af7963648bdd/html5/thumbnails/9.jpg)
RUNTIME DISTRIBUTION FOR SILICA
Time in sec for 1 K40 GPU + 1 IvyBridge core
0 500 1000 1500 2000 2500 3000 3500
Optimized GPU port
original GPU port
CPU
Memcopy
Gemm
FFT
Other
![Page 10: VASP: A Case Study for Accelerating Plane Wave DFT Codes · —Liquid metal molecular dynamics sample of Nickel-based superalloy —500 atoms, 9 chemical species —Blocked Davidson](https://reader035.vdocument.in/reader035/viewer/2022070919/5fb88bd186e0af7963648bdd/html5/thumbnails/10.jpg)
OUTLINE
Reduce communication
Port more work to the GPU
Optimize for small benchmarks
Batch work
Improve MPI scaling
![Page 11: VASP: A Case Study for Accelerating Plane Wave DFT Codes · —Liquid metal molecular dynamics sample of Nickel-based superalloy —500 atoms, 9 chemical species —Blocked Davidson](https://reader035.vdocument.in/reader035/viewer/2022070919/5fb88bd186e0af7963648bdd/html5/thumbnails/11.jpg)
REDUCE COMMUNICATION
![Page 12: VASP: A Case Study for Accelerating Plane Wave DFT Codes · —Liquid metal molecular dynamics sample of Nickel-based superalloy —500 atoms, 9 chemical species —Blocked Davidson](https://reader035.vdocument.in/reader035/viewer/2022070919/5fb88bd186e0af7963648bdd/html5/thumbnails/12.jpg)
REDUCE COMMUNICATION
PCIe Bus
K40: 288GB/s
theoretical
peak memory
bandwidth on
chip
PCIe Gen3:
16GB/s
theoretical
peak per
direction
![Page 13: VASP: A Case Study for Accelerating Plane Wave DFT Codes · —Liquid metal molecular dynamics sample of Nickel-based superalloy —500 atoms, 9 chemical species —Blocked Davidson](https://reader035.vdocument.in/reader035/viewer/2022070919/5fb88bd186e0af7963648bdd/html5/thumbnails/13.jpg)
REDUCE COMMUNICATION – EDDRM AND EDDIAG
Overlap transfers with compute by passing stream index into pipeline of FFT subroutines
Unnecessary idle time
FFT
Memcopy
Default stream
Time
![Page 14: VASP: A Case Study for Accelerating Plane Wave DFT Codes · —Liquid metal molecular dynamics sample of Nickel-based superalloy —500 atoms, 9 chemical species —Blocked Davidson](https://reader035.vdocument.in/reader035/viewer/2022070919/5fb88bd186e0af7963648bdd/html5/thumbnails/14.jpg)
REDUCE COMMUNICATION – EDDRM AND EDDIAG
Overlap transfers with compute by passing stream index into pipeline of FFT subroutines
Stream 1
Stream 2
Stream 3
Much better GPU utilization – 40% speedup
in EDDRM and 144% in EDDIAG!
FFT
Memcopy
Time
![Page 15: VASP: A Case Study for Accelerating Plane Wave DFT Codes · —Liquid metal molecular dynamics sample of Nickel-based superalloy —500 atoms, 9 chemical species —Blocked Davidson](https://reader035.vdocument.in/reader035/viewer/2022070919/5fb88bd186e0af7963648bdd/html5/thumbnails/15.jpg)
REDUCE COMMUNICATION – EDDIAG
Before
After
![Page 16: VASP: A Case Study for Accelerating Plane Wave DFT Codes · —Liquid metal molecular dynamics sample of Nickel-based superalloy —500 atoms, 9 chemical species —Blocked Davidson](https://reader035.vdocument.in/reader035/viewer/2022070919/5fb88bd186e0af7963648bdd/html5/thumbnails/16.jpg)
REDUCE COMMUNICATION – FORCE AND STRESS
Downstream CPU work
FFT
Memcopy
HtoD DtoH
CPU
HtoD DtoH
Time
Memory copies taking more time than the kernel!
CPU
![Page 17: VASP: A Case Study for Accelerating Plane Wave DFT Codes · —Liquid metal molecular dynamics sample of Nickel-based superalloy —500 atoms, 9 chemical species —Blocked Davidson](https://reader035.vdocument.in/reader035/viewer/2022070919/5fb88bd186e0af7963648bdd/html5/thumbnails/17.jpg)
REDUCE COMMUNICATION – FORCE AND STRESS
FFT
Memcopy
HtoD DtoH HtoD DtoH
Time
Memory copies taking more time than the kernel!
Port downstream CPU work to GPU GPU
![Page 18: VASP: A Case Study for Accelerating Plane Wave DFT Codes · —Liquid metal molecular dynamics sample of Nickel-based superalloy —500 atoms, 9 chemical species —Blocked Davidson](https://reader035.vdocument.in/reader035/viewer/2022070919/5fb88bd186e0af7963648bdd/html5/thumbnails/18.jpg)
REDUCE COMMUNICATION – FORCE AND STRESS
Port downstream CPU work to GPU
FFT
Memcopy
HtoD DtoH
CPU
HtoD DtoH
Time
GPU
Unnecessary
![Page 19: VASP: A Case Study for Accelerating Plane Wave DFT Codes · —Liquid metal molecular dynamics sample of Nickel-based superalloy —500 atoms, 9 chemical species —Blocked Davidson](https://reader035.vdocument.in/reader035/viewer/2022070919/5fb88bd186e0af7963648bdd/html5/thumbnails/19.jpg)
REDUCE COMMUNICATION – FORCE AND STRESS
Port downstream CPU work to GPU
Remove unnecessary memory copies
FFT
Memcopy
HtoD
CPU
HtoD
Time
GPU
![Page 20: VASP: A Case Study for Accelerating Plane Wave DFT Codes · —Liquid metal molecular dynamics sample of Nickel-based superalloy —500 atoms, 9 chemical species —Blocked Davidson](https://reader035.vdocument.in/reader035/viewer/2022070919/5fb88bd186e0af7963648bdd/html5/thumbnails/20.jpg)
REDUCE COMMUNICATION – FORCE AND STRESS
Port downstream CPU work to GPU
Remove unnecessary memory copies FFT
Memcopy
HtoD
CPU
HtoD
Time
GPU
![Page 21: VASP: A Case Study for Accelerating Plane Wave DFT Codes · —Liquid metal molecular dynamics sample of Nickel-based superalloy —500 atoms, 9 chemical species —Blocked Davidson](https://reader035.vdocument.in/reader035/viewer/2022070919/5fb88bd186e0af7963648bdd/html5/thumbnails/21.jpg)
REDUCE COMMUNICATION – FORCE AND STRESS
Port downstream CPU work to GPU
Remove unnecessary memory copies
When possible, initialize data on the GPU FFT
Memcopy
CPU
Time
GPU
HtoD HtoD
![Page 22: VASP: A Case Study for Accelerating Plane Wave DFT Codes · —Liquid metal molecular dynamics sample of Nickel-based superalloy —500 atoms, 9 chemical species —Blocked Davidson](https://reader035.vdocument.in/reader035/viewer/2022070919/5fb88bd186e0af7963648bdd/html5/thumbnails/22.jpg)
REDUCE COMMUNICATION – FORCE AND STRESS
Port downstream CPU work to GPU
Remove unnecessary memory copies
When possible, initialize data on the GPU FFT
Memcopy
CPU
Time
GPU
![Page 23: VASP: A Case Study for Accelerating Plane Wave DFT Codes · —Liquid metal molecular dynamics sample of Nickel-based superalloy —500 atoms, 9 chemical species —Blocked Davidson](https://reader035.vdocument.in/reader035/viewer/2022070919/5fb88bd186e0af7963648bdd/html5/thumbnails/23.jpg)
REDUCE COMMUNICATION – FORCE AND STRESS
Port downstream CPU work to GPU
Remove unnecessary memory copies
When possible, initialize data on the GPU FFT
Memcopy
CPU
Time
GPU
![Page 24: VASP: A Case Study for Accelerating Plane Wave DFT Codes · —Liquid metal molecular dynamics sample of Nickel-based superalloy —500 atoms, 9 chemical species —Blocked Davidson](https://reader035.vdocument.in/reader035/viewer/2022070919/5fb88bd186e0af7963648bdd/html5/thumbnails/24.jpg)
REDUCE COMMUNICATION – FORCE AND STRESS
Port downstream CPU work to GPU
Remove unnecessary memory copies
When possible, initialize data on the GPU
Use streams to overlap computation and transfers
FFT
Memcopy
CPU
Time
GPU
![Page 25: VASP: A Case Study for Accelerating Plane Wave DFT Codes · —Liquid metal molecular dynamics sample of Nickel-based superalloy —500 atoms, 9 chemical species —Blocked Davidson](https://reader035.vdocument.in/reader035/viewer/2022070919/5fb88bd186e0af7963648bdd/html5/thumbnails/25.jpg)
REDUCE COMMUNICATION – FORCE AND STRESS
117 ms
14 ms
14ms
8.3x
speedup
Over
original
GPU
version
![Page 26: VASP: A Case Study for Accelerating Plane Wave DFT Codes · —Liquid metal molecular dynamics sample of Nickel-based superalloy —500 atoms, 9 chemical species —Blocked Davidson](https://reader035.vdocument.in/reader035/viewer/2022070919/5fb88bd186e0af7963648bdd/html5/thumbnails/26.jpg)
REDUCE COMMUNICATION – HIGH LEVEL RMM-DIIS PORT
Typical drop-in replacement may not work well for small CPU functions
CPU CPU CPU
![Page 27: VASP: A Case Study for Accelerating Plane Wave DFT Codes · —Liquid metal molecular dynamics sample of Nickel-based superalloy —500 atoms, 9 chemical species —Blocked Davidson](https://reader035.vdocument.in/reader035/viewer/2022070919/5fb88bd186e0af7963648bdd/html5/thumbnails/27.jpg)
REDUCE COMMUNICATION – HIGH LEVEL RMM-DIIS PORT
Typical drop-in replacement may not work well for small CPU functions
CPU CPU
GPU HtoD DtoH1
Slowdown!
![Page 28: VASP: A Case Study for Accelerating Plane Wave DFT Codes · —Liquid metal molecular dynamics sample of Nickel-based superalloy —500 atoms, 9 chemical species —Blocked Davidson](https://reader035.vdocument.in/reader035/viewer/2022070919/5fb88bd186e0af7963648bdd/html5/thumbnails/28.jpg)
REDUCE COMMUNICATION – HIGH LEVEL RMM-DIIS PORT
Typical drop-in replacement may not work well for small CPU functions
Porting more functions and keeping data on the GPU reduces communication and improves results!
GPU HtoD DtoH1 GPU GPU
![Page 29: VASP: A Case Study for Accelerating Plane Wave DFT Codes · —Liquid metal molecular dynamics sample of Nickel-based superalloy —500 atoms, 9 chemical species —Blocked Davidson](https://reader035.vdocument.in/reader035/viewer/2022070919/5fb88bd186e0af7963648bdd/html5/thumbnails/29.jpg)
REDUCE COMMUNICATION – HIGH LEVEL RMM-DIIS PORT
Typical drop-in replacement may not work well for small CPU functions
Porting more functions and keeping data on the GPU reduces communication and improves results!
GPU GPU GPU
High level RMM-DIIS port – 18%
improvement!
![Page 30: VASP: A Case Study for Accelerating Plane Wave DFT Codes · —Liquid metal molecular dynamics sample of Nickel-based superalloy —500 atoms, 9 chemical species —Blocked Davidson](https://reader035.vdocument.in/reader035/viewer/2022070919/5fb88bd186e0af7963648bdd/html5/thumbnails/30.jpg)
BATCH AND STREAM WORK
![Page 31: VASP: A Case Study for Accelerating Plane Wave DFT Codes · —Liquid metal molecular dynamics sample of Nickel-based superalloy —500 atoms, 9 chemical species —Blocked Davidson](https://reader035.vdocument.in/reader035/viewer/2022070919/5fb88bd186e0af7963648bdd/html5/thumbnails/31.jpg)
BATCH WORK AND STREAM WORK
GPU is massively parallel
Need to launch sufficient work to
saturate it
A single call to a zgemm of (50x50)
* (50x50) only launches 2 blocks
which fit on one SM
- Not sufficient to fully utilize the
GPU!
Can launch multiple independent
pieces of work simultaneously
![Page 32: VASP: A Case Study for Accelerating Plane Wave DFT Codes · —Liquid metal molecular dynamics sample of Nickel-based superalloy —500 atoms, 9 chemical species —Blocked Davidson](https://reader035.vdocument.in/reader035/viewer/2022070919/5fb88bd186e0af7963648bdd/html5/thumbnails/32.jpg)
BATCH WORK AND STREAM WORK
STREAMED BATCHED
for(int i=0;i<N;i++)
cublasZgemm();
for(int i=0;i<N;i++){
cublasSetStream();
cublasZgemm();
}
cublasZgemmBatched();
Improved
zgemm
zgemm
zgemm
zgemm Kernel
launch
overhead
Not improved
Kernel
launch
overhead
zgemmBatched
![Page 33: VASP: A Case Study for Accelerating Plane Wave DFT Codes · —Liquid metal molecular dynamics sample of Nickel-based superalloy —500 atoms, 9 chemical species —Blocked Davidson](https://reader035.vdocument.in/reader035/viewer/2022070919/5fb88bd186e0af7963648bdd/html5/thumbnails/33.jpg)
BATCH WORK AND STREAM WORK
for(int i=0;i<N;i++)
cublasZgemm();
GEMM
0
20
40
60
80
100
GPU
utl
izati
on
time
GEMM GEMM GEMM GEMM
![Page 34: VASP: A Case Study for Accelerating Plane Wave DFT Codes · —Liquid metal molecular dynamics sample of Nickel-based superalloy —500 atoms, 9 chemical species —Blocked Davidson](https://reader035.vdocument.in/reader035/viewer/2022070919/5fb88bd186e0af7963648bdd/html5/thumbnails/34.jpg)
BATCH WORK AND STREAM WORK
GEMM
0
50
100
GPU
utl
izati
on
time
Kolumna1
GEMM
GEMM
GEMM
GEMM for(int i=0;i<N;i++){
cublasSetStream();
cublasZgemm();
}
STREAMED
…
Improved Not improved
0
50
100
GPU
utl
izati
on
time
Kolumna1
…
Kernel
launch
overhead
![Page 35: VASP: A Case Study for Accelerating Plane Wave DFT Codes · —Liquid metal molecular dynamics sample of Nickel-based superalloy —500 atoms, 9 chemical species —Blocked Davidson](https://reader035.vdocument.in/reader035/viewer/2022070919/5fb88bd186e0af7963648bdd/html5/thumbnails/35.jpg)
BATCH WORK AND STREAM WORK
GEMM
0
20
40
60
80
100
GPU
utl
izati
on
time
BATCHED
cublasZgemmBatched();
![Page 36: VASP: A Case Study for Accelerating Plane Wave DFT Codes · —Liquid metal molecular dynamics sample of Nickel-based superalloy —500 atoms, 9 chemical species —Blocked Davidson](https://reader035.vdocument.in/reader035/viewer/2022070919/5fb88bd186e0af7963648bdd/html5/thumbnails/36.jpg)
BATCH WORK – INVERSE REAL-SPACE PROJECTION
Padding with 0 required to have
same sizes of all gemms
0 0
data
data
data
![Page 37: VASP: A Case Study for Accelerating Plane Wave DFT Codes · —Liquid metal molecular dynamics sample of Nickel-based superalloy —500 atoms, 9 chemical species —Blocked Davidson](https://reader035.vdocument.in/reader035/viewer/2022070919/5fb88bd186e0af7963648bdd/html5/thumbnails/37.jpg)
BATCH WORK - RPROMU
Problem: How to easily batch it?
for i in 1..N
for j in 1..M
kernel<<<B,T,0,stream(i)>>>(…i,j);
Code Result
Time
![Page 38: VASP: A Case Study for Accelerating Plane Wave DFT Codes · —Liquid metal molecular dynamics sample of Nickel-based superalloy —500 atoms, 9 chemical species —Blocked Davidson](https://reader035.vdocument.in/reader035/viewer/2022070919/5fb88bd186e0af7963648bdd/html5/thumbnails/38.jpg)
BATCH WORK - RPROMU
Problem: How to easily batch it?
Use more grid dimensions and extract i and j from blockIdx.y and blockIdx.z
for i in 1..N
for j in 1..M
kernel<<<B,T,0,stream(i)>>>(…i,j);
Code Result
Time
![Page 39: VASP: A Case Study for Accelerating Plane Wave DFT Codes · —Liquid metal molecular dynamics sample of Nickel-based superalloy —500 atoms, 9 chemical species —Blocked Davidson](https://reader035.vdocument.in/reader035/viewer/2022070919/5fb88bd186e0af7963648bdd/html5/thumbnails/39.jpg)
BATCH WORK - RPROMU
Problem: How to easily batch it?
Use more grid dimensions and extract i and j from blockIdx.y and blockIdx.z
for i in 1..N
for j in 1..M
kernel<<<B,T,0,stream(i)>>>(…i,j);
Code Result
Time
dim3 blocks(B,M,N);
kernel<<<blocks,T>>>(…);
![Page 40: VASP: A Case Study for Accelerating Plane Wave DFT Codes · —Liquid metal molecular dynamics sample of Nickel-based superalloy —500 atoms, 9 chemical species —Blocked Davidson](https://reader035.vdocument.in/reader035/viewer/2022070919/5fb88bd186e0af7963648bdd/html5/thumbnails/40.jpg)
STREAM WORK: GRAHM-SCHMIDT ORTHONORMALIZATION (ORTHCH) MULTI BASIS MATRIX MATRIX MULTIPLY (LINCOM)
Original
New
Running on K20X with 14 SMs
Kernel launches 12 blocks
Because of register usage can run 3 blocks per SM
Theoretically can run 14*3 = 42 blocks
Use streams to launch
multiple independent
Zgemms and fill all the
SMs
![Page 41: VASP: A Case Study for Accelerating Plane Wave DFT Codes · —Liquid metal molecular dynamics sample of Nickel-based superalloy —500 atoms, 9 chemical species —Blocked Davidson](https://reader035.vdocument.in/reader035/viewer/2022070919/5fb88bd186e0af7963648bdd/html5/thumbnails/41.jpg)
MODIFY PARAMETERS TO IMPROVE BATCH SIZES
N = 2*NSIM
Increasing NSIM is an easy way
to improve the performance
without changing the numerical
accuracy of the results
![Page 42: VASP: A Case Study for Accelerating Plane Wave DFT Codes · —Liquid metal molecular dynamics sample of Nickel-based superalloy —500 atoms, 9 chemical species —Blocked Davidson](https://reader035.vdocument.in/reader035/viewer/2022070919/5fb88bd186e0af7963648bdd/html5/thumbnails/42.jpg)
REDUCE ALLOCATION / DEALLOCATION ON GPU
![Page 43: VASP: A Case Study for Accelerating Plane Wave DFT Codes · —Liquid metal molecular dynamics sample of Nickel-based superalloy —500 atoms, 9 chemical species —Blocked Davidson](https://reader035.vdocument.in/reader035/viewer/2022070919/5fb88bd186e0af7963648bdd/html5/thumbnails/43.jpg)
REDUCE ALLOCATION/DEALLOCATION ON GPU
Allocation / Deallocation on GPU is expensive, same as CPU
— Try to allocate once and use many times, even for temporary data
Allocations also cause expensive synchronization with the host, that introduces gaps in the GPU utilization
Allocations and deallocations may be tracked using CUDA API Trace functionality of CUDA Visual Profiler
![Page 44: VASP: A Case Study for Accelerating Plane Wave DFT Codes · —Liquid metal molecular dynamics sample of Nickel-based superalloy —500 atoms, 9 chemical species —Blocked Davidson](https://reader035.vdocument.in/reader035/viewer/2022070919/5fb88bd186e0af7963648bdd/html5/thumbnails/44.jpg)
GPU HtoD DtoH Allocate Deallocate
REDUCE ALLOCATION/DEALLOCATION ON GPU
Time
cudaMalloc(…);
cudaMemcpy(…);
kernel<<<…>>>(…);
cudaMemcpy(…);
cudaFree(…);
![Page 45: VASP: A Case Study for Accelerating Plane Wave DFT Codes · —Liquid metal molecular dynamics sample of Nickel-based superalloy —500 atoms, 9 chemical species —Blocked Davidson](https://reader035.vdocument.in/reader035/viewer/2022070919/5fb88bd186e0af7963648bdd/html5/thumbnails/45.jpg)
GPU HtoD DtoH
REDUCE ALLOCATION/DEALLOCATION ON GPU
Time
cudaMalloc(…);
cudaMemcpy(…);
kernel<<<…>>>(…);
cudaMemcpy(…);
cudaFree(…);
cudaMalloc(…);
cudaMemcpy(…);
Kernel<<<…>>>(…);
cudaMemcpy(…);
if(size < size_needed)
cudaFree(…);
![Page 46: VASP: A Case Study for Accelerating Plane Wave DFT Codes · —Liquid metal molecular dynamics sample of Nickel-based superalloy —500 atoms, 9 chemical species —Blocked Davidson](https://reader035.vdocument.in/reader035/viewer/2022070919/5fb88bd186e0af7963648bdd/html5/thumbnails/46.jpg)
1.4ms
0.3ms
Unnecessary
REDUCE ALLOCATION/DEALLOCATION ON GPU - ECCP
![Page 47: VASP: A Case Study for Accelerating Plane Wave DFT Codes · —Liquid metal molecular dynamics sample of Nickel-based superalloy —500 atoms, 9 chemical species —Blocked Davidson](https://reader035.vdocument.in/reader035/viewer/2022070919/5fb88bd186e0af7963648bdd/html5/thumbnails/47.jpg)
REDUCE ALLOCATION/DEALLOCATION ON GPU – FORCE AND STRESS
Cufft plan create Cufft plan destroy
Now: no plan create or destroy
![Page 48: VASP: A Case Study for Accelerating Plane Wave DFT Codes · —Liquid metal molecular dynamics sample of Nickel-based superalloy —500 atoms, 9 chemical species —Blocked Davidson](https://reader035.vdocument.in/reader035/viewer/2022070919/5fb88bd186e0af7963648bdd/html5/thumbnails/48.jpg)
REDUCE CPU WORK
![Page 49: VASP: A Case Study for Accelerating Plane Wave DFT Codes · —Liquid metal molecular dynamics sample of Nickel-based superalloy —500 atoms, 9 chemical species —Blocked Davidson](https://reader035.vdocument.in/reader035/viewer/2022070919/5fb88bd186e0af7963648bdd/html5/thumbnails/49.jpg)
PORT ADDITIONAL WORK TO THE GPU
Setup precond – 9.3x speedup
— Change from executing many times on the CPU in the new bands loop to executing only once on the GPU after the new bands loop
Potlok
CPU
2% of runtime
Initial GPU
7% of runtime
GPU
15% of runtime
Optimize
other parts GPU
6% of runtime
Port GGA (~50% of
Potlok) to GPU
![Page 50: VASP: A Case Study for Accelerating Plane Wave DFT Codes · —Liquid metal molecular dynamics sample of Nickel-based superalloy —500 atoms, 9 chemical species —Blocked Davidson](https://reader035.vdocument.in/reader035/viewer/2022070919/5fb88bd186e0af7963648bdd/html5/thumbnails/50.jpg)
REMOVE UN NECESSARY CPU WORK
Example: Daxpy and Dscal in EDDRM
135K
elements
1,143K
elements
K
space
real
space DSCAL
FFT
DAXPY
DSCAL DAXPY
1,143K
elements
![Page 51: VASP: A Case Study for Accelerating Plane Wave DFT Codes · —Liquid metal molecular dynamics sample of Nickel-based superalloy —500 atoms, 9 chemical species —Blocked Davidson](https://reader035.vdocument.in/reader035/viewer/2022070919/5fb88bd186e0af7963648bdd/html5/thumbnails/51.jpg)
REMOVE UN NECESSARY CPU WORK
Example: Daxpy and Dscal in EDDRM
135K
elements
1,143K
elements
K
space
real
space DSCAL
FFT
DAXPY x DSCAL DAXPY
1,143K
elements x
![Page 52: VASP: A Case Study for Accelerating Plane Wave DFT Codes · —Liquid metal molecular dynamics sample of Nickel-based superalloy —500 atoms, 9 chemical species —Blocked Davidson](https://reader035.vdocument.in/reader035/viewer/2022070919/5fb88bd186e0af7963648bdd/html5/thumbnails/52.jpg)
REMOVE UN NECESSARY CPU WORK
Example: Daxpy and Dscal in EDDRM
135K
elements
1,143K
elements
K
space
real
space
FFT
1.24x speedup for
EDDRM routine
DSCAL DAXPY
1,143K
elements
![Page 53: VASP: A Case Study for Accelerating Plane Wave DFT Codes · —Liquid metal molecular dynamics sample of Nickel-based superalloy —500 atoms, 9 chemical species —Blocked Davidson](https://reader035.vdocument.in/reader035/viewer/2022070919/5fb88bd186e0af7963648bdd/html5/thumbnails/53.jpg)
USING MORE CPU CORES
CPU, 436
Memcopy, 68
Gemm, 120
FFT, 288
Other, 165
SILICA, 1K40 + 1 Ivy bridge core
Left over
CPU work
![Page 54: VASP: A Case Study for Accelerating Plane Wave DFT Codes · —Liquid metal molecular dynamics sample of Nickel-based superalloy —500 atoms, 9 chemical species —Blocked Davidson](https://reader035.vdocument.in/reader035/viewer/2022070919/5fb88bd186e0af7963648bdd/html5/thumbnails/54.jpg)
USING MORE CPU CORES
0
0.5
1
1.5
2
2.5
3
1 2 3 4 6
Speedup v
s. 1
GPU
1 c
ore
Cores per GPU
Performance improvement with using multiple CPU cores
1 GPU
2 GPUs
4 GPUs
![Page 55: VASP: A Case Study for Accelerating Plane Wave DFT Codes · —Liquid metal molecular dynamics sample of Nickel-based superalloy —500 atoms, 9 chemical species —Blocked Davidson](https://reader035.vdocument.in/reader035/viewer/2022070919/5fb88bd186e0af7963648bdd/html5/thumbnails/55.jpg)
USE MULTI PROCESS SERVICE (MPS)
Performance issues with running multiple MPI ranks per GPU
— Increased MPI communication
— Each rank running in its own context on the GPU
Use the MPS functionality introduced in cuda 5.5 to have multiple MPI ranks run on the same GPU at the same time
— Allows kernels from multiple MPI ranks to run at the same time on the GPU
![Page 56: VASP: A Case Study for Accelerating Plane Wave DFT Codes · —Liquid metal molecular dynamics sample of Nickel-based superalloy —500 atoms, 9 chemical species —Blocked Davidson](https://reader035.vdocument.in/reader035/viewer/2022070919/5fb88bd186e0af7963648bdd/html5/thumbnails/56.jpg)
1 GPU + 1 core
USING MULTIPLE CPU CORES PER GPU 1 GPU + 2 cores
zgemm
zgemm
zgemm
zgemm
zgemm
zgemm
zgemm
zgemm
Time 1 Time 2
Context 1,
MPI rank 1
Context
switch Context 2,
MPI rank 2
![Page 57: VASP: A Case Study for Accelerating Plane Wave DFT Codes · —Liquid metal molecular dynamics sample of Nickel-based superalloy —500 atoms, 9 chemical species —Blocked Davidson](https://reader035.vdocument.in/reader035/viewer/2022070919/5fb88bd186e0af7963648bdd/html5/thumbnails/57.jpg)
USING MULTIPLE CPU CORES PER GPU
0.8
1.3
1.8
2.3
2.8
3.3
1 2 3 4 6
Speedup v
s. 1
core
Cores per GPU
Performance improvement with using multiple CPU cores
1 GPU
1 GPU+MPS
2 GPU
2 GPU + MPS
4 GPU
4 GPU + MPS14%
13%
11%
![Page 58: VASP: A Case Study for Accelerating Plane Wave DFT Codes · —Liquid metal molecular dynamics sample of Nickel-based superalloy —500 atoms, 9 chemical species —Blocked Davidson](https://reader035.vdocument.in/reader035/viewer/2022070919/5fb88bd186e0af7963648bdd/html5/thumbnails/58.jpg)
OPTIMIZATION FOR SMALL BENCHMARKS
![Page 59: VASP: A Case Study for Accelerating Plane Wave DFT Codes · —Liquid metal molecular dynamics sample of Nickel-based superalloy —500 atoms, 9 chemical species —Blocked Davidson](https://reader035.vdocument.in/reader035/viewer/2022070919/5fb88bd186e0af7963648bdd/html5/thumbnails/59.jpg)
SMALL BENCHMARK - PROBLEMS
Launch latency, memory copies and bookkeeping relatively large part of time
Small kernels don’t saturate GPU, wasting resources
![Page 60: VASP: A Case Study for Accelerating Plane Wave DFT Codes · —Liquid metal molecular dynamics sample of Nickel-based superalloy —500 atoms, 9 chemical species —Blocked Davidson](https://reader035.vdocument.in/reader035/viewer/2022070919/5fb88bd186e0af7963648bdd/html5/thumbnails/60.jpg)
SMALL BENCHMARK - SOLUTION
Group independent parts together
Merge independent calls into one kernel
Group independent iterations together
![Page 61: VASP: A Case Study for Accelerating Plane Wave DFT Codes · —Liquid metal molecular dynamics sample of Nickel-based superalloy —500 atoms, 9 chemical species —Blocked Davidson](https://reader035.vdocument.in/reader035/viewer/2022070919/5fb88bd186e0af7963648bdd/html5/thumbnails/61.jpg)
AFTER BEFORE
SMALL BENCHMARK – EXAMPLE I3 LOOP
Setup kernel
arguments
Launch Daxpy
kernel
Launch
Reduction kernel
Copy results to
CPU
Process results
For each sim
in nsim
Launch Daxpy kernel
Launch Reduction
kernel
Copy results to CPU
Setup kernel
arguments
For each sim
in nsim
CPU
work in
parallel
Process results For each sim
in nsim
CPU
work in
parallel
![Page 62: VASP: A Case Study for Accelerating Plane Wave DFT Codes · —Liquid metal molecular dynamics sample of Nickel-based superalloy —500 atoms, 9 chemical species —Blocked Davidson](https://reader035.vdocument.in/reader035/viewer/2022070919/5fb88bd186e0af7963648bdd/html5/thumbnails/62.jpg)
RESULTS FOR I3 LOOP
3.75x improvement for Pdo
— Small benchmark with only 87 ions
1.3x improvement for SILICA
![Page 63: VASP: A Case Study for Accelerating Plane Wave DFT Codes · —Liquid metal molecular dynamics sample of Nickel-based superalloy —500 atoms, 9 chemical species —Blocked Davidson](https://reader035.vdocument.in/reader035/viewer/2022070919/5fb88bd186e0af7963648bdd/html5/thumbnails/63.jpg)
SCALING
![Page 64: VASP: A Case Study for Accelerating Plane Wave DFT Codes · —Liquid metal molecular dynamics sample of Nickel-based superalloy —500 atoms, 9 chemical species —Blocked Davidson](https://reader035.vdocument.in/reader035/viewer/2022070919/5fb88bd186e0af7963648bdd/html5/thumbnails/64.jpg)
MPI SCALING
Number of
GPUs
EDDIAG [seconds, scaling]
EDDRM [seconds, scaling]
ORTHCH [seconds, scaling]
1 GPU 4.2s, 100% 6.7s, 100% 1.5s, 100%
2 GPUs 2.8s, 75% 3.4s, 99% 1.5s, 50%
4 GPUs 2.7s, 39% 1.8s, 95% 2.4s, 15%
8 GPUs 1.9s, 27% 0.9s, 93% 1.4s, 13%
Compute
intensive routine
: good Scaling
MPI intensive routines :
bad Scaling
![Page 65: VASP: A Case Study for Accelerating Plane Wave DFT Codes · —Liquid metal molecular dynamics sample of Nickel-based superalloy —500 atoms, 9 chemical species —Blocked Davidson](https://reader035.vdocument.in/reader035/viewer/2022070919/5fb88bd186e0af7963648bdd/html5/thumbnails/65.jpg)
OVERLAPPING MPI AND GPU WORK
Reordered such that MPI overlaps with computation
GPU compute
Memcopy
Default stream
Time
MPI
![Page 66: VASP: A Case Study for Accelerating Plane Wave DFT Codes · —Liquid metal molecular dynamics sample of Nickel-based superalloy —500 atoms, 9 chemical species —Blocked Davidson](https://reader035.vdocument.in/reader035/viewer/2022070919/5fb88bd186e0af7963648bdd/html5/thumbnails/66.jpg)
OVERLAPPING MPI AND GPU WORK
Reordered such that MPI overlaps with computation
Stream 1
Stream 2
Hide MPI communication and memory copies.
3x improvement in Striploop in EDDIAG
GPU compute
Memcopy
Time
MPI
![Page 67: VASP: A Case Study for Accelerating Plane Wave DFT Codes · —Liquid metal molecular dynamics sample of Nickel-based superalloy —500 atoms, 9 chemical species —Blocked Davidson](https://reader035.vdocument.in/reader035/viewer/2022070919/5fb88bd186e0af7963648bdd/html5/thumbnails/67.jpg)
PRE-ALLOCATING MEMORY IN ONE CONTIGUOUS CHUNK
VASP allocates hundreds of small buffers at the start of the RMM-DIIS iterations.
— Memory allocations require locks and syncs and can therefore be relatively expensive.
— This cost increases with multiple GPUs
Instead:
— Do a single large memory allocation
— Divide the large memory buffer over the hundreds of small buffers
— Memory allocation phase over 100x faster.
![Page 68: VASP: A Case Study for Accelerating Plane Wave DFT Codes · —Liquid metal molecular dynamics sample of Nickel-based superalloy —500 atoms, 9 chemical species —Blocked Davidson](https://reader035.vdocument.in/reader035/viewer/2022070919/5fb88bd186e0af7963648bdd/html5/thumbnails/68.jpg)
AFTER
BEFORE
USING GPU DIRECT
GPU
CPU
NIC NIC
CPU
GPU
GPU
NIC NIC
GPU
![Page 69: VASP: A Case Study for Accelerating Plane Wave DFT Codes · —Liquid metal molecular dynamics sample of Nickel-based superalloy —500 atoms, 9 chemical species —Blocked Davidson](https://reader035.vdocument.in/reader035/viewer/2022070919/5fb88bd186e0af7963648bdd/html5/thumbnails/69.jpg)
USING GPU DIRECT
Use CUDA Aware MPI
— As simple as calling MPI_Send, MPI_Recv with pointers to the GPU data
Performance improvements
Number of
GPUs
Time ORTCH –
without
Time ORTHCH
– with
%
improvement
2 GPUs 1.32s 0.99s 33%
4 GPUs 0.87s 0.63s 37%
![Page 70: VASP: A Case Study for Accelerating Plane Wave DFT Codes · —Liquid metal molecular dynamics sample of Nickel-based superalloy —500 atoms, 9 chemical species —Blocked Davidson](https://reader035.vdocument.in/reader035/viewer/2022070919/5fb88bd186e0af7963648bdd/html5/thumbnails/70.jpg)
RESULTS
![Page 71: VASP: A Case Study for Accelerating Plane Wave DFT Codes · —Liquid metal molecular dynamics sample of Nickel-based superalloy —500 atoms, 9 chemical species —Blocked Davidson](https://reader035.vdocument.in/reader035/viewer/2022070919/5fb88bd186e0af7963648bdd/html5/thumbnails/71.jpg)
RESULTS SILICA (RMM-DIIS) – VASP 5.2.2
• all results measured on K40
and dual socket sandy bridge
with 8 cores per socket
running at 2.9GHz
0
1
2
3
4
5
6
7
8
9
10
0 5 10
Sp
eed
up
vs.
Sin
gle
CP
U S
ocket
Number of CPU Sockets
2 GPU : 1 CPU ratio(1-2 cores/GPU)
CPU only(8 cores/CPU)
1 GPU : 1 CPU ratio(2-6 cores/GPU)2.5x
2.4x
2.3x
2.9x 2.9x
3.7x
3.6x
![Page 72: VASP: A Case Study for Accelerating Plane Wave DFT Codes · —Liquid metal molecular dynamics sample of Nickel-based superalloy —500 atoms, 9 chemical species —Blocked Davidson](https://reader035.vdocument.in/reader035/viewer/2022070919/5fb88bd186e0af7963648bdd/html5/thumbnails/72.jpg)
RESULTS SILICA (RMM-DIIS) – VASP 5.2.2
• all results measured on K40
and dual socket sandy bridge
with 8 cores per socket
running at 2.9GHz
0
1
2
3
4
5
6
7
8
9
10
0 5 10
Sp
eed
up
vs.
Sin
gle
CP
U S
ocket
Number of CPU Sockets
2 GPU : 1 CPU ratio(1-2 cores/GPU)
CPU only(8 cores/CPU)
1 GPU : 1 CPU ratio(2-6 cores/GPU)
1 node with two GPUs
is faster than 10 CPU
Sockets (5 nodes)
![Page 73: VASP: A Case Study for Accelerating Plane Wave DFT Codes · —Liquid metal molecular dynamics sample of Nickel-based superalloy —500 atoms, 9 chemical species —Blocked Davidson](https://reader035.vdocument.in/reader035/viewer/2022070919/5fb88bd186e0af7963648bdd/html5/thumbnails/73.jpg)
RESULTS NIAL-MD (BLOCKED DAVIDSON) , VASP 5.2.2
0
1
2
3
4
5
6
7
8
9
10
0 2 4 6 8
Sp
eed
up
vs.
Sin
gle
CP
U S
ocket
Number of CPU Sockets
2 GPU : 1 CPU ratio(1 core/GPU)
CPU only(8 cores/CPU)
1 GPU : 1 CPU ratio(1 core/GPU)
4x
6.9x
4.8x
4.9x
3.5x
3.4x
• all results measured on K40 and
dual socket sandy bridge with 8
cores per socket running at
2.9GHz
• Running with more cores per GPU
runs out of memory
![Page 74: VASP: A Case Study for Accelerating Plane Wave DFT Codes · —Liquid metal molecular dynamics sample of Nickel-based superalloy —500 atoms, 9 chemical species —Blocked Davidson](https://reader035.vdocument.in/reader035/viewer/2022070919/5fb88bd186e0af7963648bdd/html5/thumbnails/74.jpg)
RESULTS NIAL-MD (BLOCKED DAVIDSON) , VASP 5.2.2
0
1
2
3
4
5
6
7
8
9
10
0 2 4 6 8
Sp
eed
up
vs.
Sin
gle
CP
U S
ocket
Number of CPU Sockets
2 GPU : 1 CPU ratio(1 core/GPU)
CPU only(8 cores/CPU)
1 GPU : 1 CPU ratio(1 core/GPU)
• all results measured on K40 and
dual socket sandy bridge with 8
cores per socket running at
2.9GHz
• Running with more cores per GPU
runs out of memory
1 node with one GPU
is faster than 8 CPU
Sockets (4 nodes)