performance optimization of quantum espresso on knl · performance optimization of quantum espresso...

15
Performance Optimization of Quantum Espresso on KNL [email protected] NERSC November 3 2016 Taylor Barnes, Thorsten Kurth, Paul Kent, Pierre Carrier, Nathan Wichmann, David Prendergast, Jack Deslippe

Upload: others

Post on 18-Jul-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Performance Optimization of Quantum Espresso on KNL · Performance Optimization of Quantum Espresso on KNL tbarnes@lbl.gov NERSC November 3 2016 Taylor Barnes, Thorsten Kurth, Paul

Performance Optimization of Quantum Espresso on KNL

[email protected] NERSC November 3 2016

Taylor Barnes, Thorsten Kurth, Paul Kent, Pierre Carrier, Nathan Wichmann, David Prendergast, Jack Deslippe

Page 2: Performance Optimization of Quantum Espresso on KNL · Performance Optimization of Quantum Espresso on KNL tbarnes@lbl.gov NERSC November 3 2016 Taylor Barnes, Thorsten Kurth, Paul

Introduction

Approximate exchange functional Exact exchange operator

Local DFT: Hybrid DFT:

Cost of Hybrid DFT

Goal: Prepare QE for large-scale execution on the KNL architecture, with a particular focus on improving the implementation of hybrid exchange

0

200

400

600

800

1000

1200

PBE PBE0

Wal

ltim

e (s

)

Local DFT Hybrid DFT

Total walltime for an SCF calculation on 16 waters:

Page 3: Performance Optimization of Quantum Espresso on KNL · Performance Optimization of Quantum Espresso on KNL tbarnes@lbl.gov NERSC November 3 2016 Taylor Barnes, Thorsten Kurth, Paul

NERSC Whitebox, Quad Cache Mode; Intel 17.0.042 Compiler:

Threads

Original OpenMP Threading

Page 4: Performance Optimization of Quantum Espresso on KNL · Performance Optimization of Quantum Espresso on KNL tbarnes@lbl.gov NERSC November 3 2016 Taylor Barnes, Thorsten Kurth, Paul

Increasing the amount of work performed in each OMP loop dramatically reduces overhead costs.

NERSC Whitebox, Quad Cache Mode; Intel 17.0.042 Compiler:

Threads

Improved OpenMP Threading

Page 5: Performance Optimization of Quantum Espresso on KNL · Performance Optimization of Quantum Espresso on KNL tbarnes@lbl.gov NERSC November 3 2016 Taylor Barnes, Thorsten Kurth, Paul

Flat Mode

With the original code, running KNL in Cache mode is approximately 2x faster than Flat mode. Using FASTMEM directives enables Flat mode to outperform Cache mode.

KNL NUMA Mode Performance

Page 6: Performance Optimization of Quantum Espresso on KNL · Performance Optimization of Quantum Espresso on KNL tbarnes@lbl.gov NERSC November 3 2016 Taylor Barnes, Thorsten Kurth, Paul

Using the improved OpenMP code is the fastest way to run on KNL, and is about 1.6x faster than the best Haswell walltimes. OMP is significantly improved on both Haswell and KNL.

KNL vs. Haswell

Page 7: Performance Optimization of Quantum Espresso on KNL · Performance Optimization of Quantum Espresso on KNL tbarnes@lbl.gov NERSC November 3 2016 Taylor Barnes, Thorsten Kurth, Paul

Strong scaling of QE on Ivy Bridge, using pure MPI mode:

Strong Scaling of QE

Page 8: Performance Optimization of Quantum Espresso on KNL · Performance Optimization of Quantum Espresso on KNL tbarnes@lbl.gov NERSC November 3 2016 Taylor Barnes, Thorsten Kurth, Paul

FFT-1[ ψi(g) ]

Calculate ck(g)

v(g) = ck(g) ρij(g) FFT-1[ v(g) ]

FFT[ ρij(r) ]

FFT[ Hψi(r) ]

Reduce Hψi(r)

Local Potential

Update ψ

Broadcast ψ(g) SCF Loop

(duplicated on each bgrp) Loop i

(duplicated on each bgrp)

Hψi(r) += v(r) ψj(r)

Loop j (bgrp parallelized)

ρij(r) = ψi(r) ψj(r)

Loop k

Overview of Existing Code

Page 9: Performance Optimization of Quantum Espresso on KNL · Performance Optimization of Quantum Espresso on KNL tbarnes@lbl.gov NERSC November 3 2016 Taylor Barnes, Thorsten Kurth, Paul

FFT-1[ ψi(g) ]

v(g) = ck(g) ρij(g) FFT-1[ v(g) ]

FFT[ ρij(r) ]

Local Potential

Update ψ

Broadcast ψ(g) SCF Loop

(duplicated on each bgrp)

Hψi(r) += v(r) ψj(r)

Loop j (bgrp parallelized)

ρij(r) = ψi(r) ψj(r)

Loop k

FFT[ Hψi(r) ]

Reduce Hψi(g)

Re-use ck(g)

Loop i (bgrp parallelized)

Pair Parallelization

Page 10: Performance Optimization of Quantum Espresso on KNL · Performance Optimization of Quantum Espresso on KNL tbarnes@lbl.gov NERSC November 3 2016 Taylor Barnes, Thorsten Kurth, Paul

Strong scaling of the exact exchange part of the code on Ivy Bridge, with 1 band group per node:

Parallelization over band pairs both improves the strong scaling of the code and also improves the load balancing.

Pair Parallelization Performance

Page 11: Performance Optimization of Quantum Espresso on KNL · Performance Optimization of Quantum Espresso on KNL tbarnes@lbl.gov NERSC November 3 2016 Taylor Barnes, Thorsten Kurth, Paul

FFT-1[ ψi(g) ]

v(g) = ck(g) ρij(g) FFT-1[ v(g) ]

FFT[ ρij(r) ]

Local Potential

Update ψ

Broadcast ψ(g) SCF Loop

(duplicated on each bgrp)

Hψi(r) += v(r) ψj(r)

Loop j (bgrp parallelized)

ρij(r) = ψi(r) ψj(r)

Loop k

FFT[ Hψi(r) ]

Reduce Hψi(g)

Re-use ck(g)

Loop i (bgrp parallelized)

Code Overview

Page 12: Performance Optimization of Quantum Espresso on KNL · Performance Optimization of Quantum Espresso on KNL tbarnes@lbl.gov NERSC November 3 2016 Taylor Barnes, Thorsten Kurth, Paul

Fraction of total walltime spent in local regions of the code, using 1 band group per node on Haswell:

Unintuitively, local regions of the code dominate the cost of the calculation when using large numbers of nodes. This is because the local regions of the code a run in serial with respect to band groups.

Cost of the Local Calculation

Page 13: Performance Optimization of Quantum Espresso on KNL · Performance Optimization of Quantum Espresso on KNL tbarnes@lbl.gov NERSC November 3 2016 Taylor Barnes, Thorsten Kurth, Paul

Independent Parallelization of the Local Code

Node 1

Band 1

Data structure of Ψ(g), in local code:

Band 2

Band 3

Band 4

Band 5

Band 6

g5 g2 g4 g6 g1 g3

Node 1

Band 1

Band 2

Band 3

Band 4

Band 5

Band 6

g1 g2 g3 g4 g5 g6

Node 2 Node 3

Data structure of Ψ(g), in exact exchange code:

Node 2

Node 3

Enabling parallelization of the local regions of the code across band groups requires on-the-fly transformation of the data structures between local and exact exchange regions of the code.

Page 14: Performance Optimization of Quantum Espresso on KNL · Performance Optimization of Quantum Espresso on KNL tbarnes@lbl.gov NERSC November 3 2016 Taylor Barnes, Thorsten Kurth, Paul

Full Calculation Exact Exchange Part Strong scaling on Ivy Bridge, with 1 band group per node:

Improved Strong Scaling

Page 15: Performance Optimization of Quantum Espresso on KNL · Performance Optimization of Quantum Espresso on KNL tbarnes@lbl.gov NERSC November 3 2016 Taylor Barnes, Thorsten Kurth, Paul

Thank You

Taylor Barnes, NERSC, November 2016 - 15