how to run vasp - nscpla/downloads/... · efficiency using as few cores as possible speed as many...
TRANSCRIPT
How to run VASPPart 1: Influential settings
Quick summary
• Processor cores ≤ atoms
• NCORE = cores per node NPAR = compute nodes
• NSIM = 1 (Intel), 4 (AMD)
• KPAR = compute nodes
• ALGO = Fast (or VeryFast)
• NBANDS = N*cores
• Use the right VASP binary (NGZhalf, and wNGZhalf for gamma-point calculations)
Q: What’s missing?
A: Read the VASP manual!
Efficiency:Running as many jobs as possible within a given allocation of computer time.
Speed:The amount of time (in real time, "human time") to run a specific simulation from the time it starts.
Time-to-solution:The amount of time (in real time, "human time") it takes to get the results = runtime + the time waiting in the queue.
EfficiencyUsing as few cores as possible
SpeedAs many cores as possible (up to a limit)
Time-to-solution •Finding the "right" size which fits the cluster. •Not using up all your allocated time (in order to get higher priority in the queue).
1
10
100
1000
10000 1 10 100 1000 10000
Run-
time
(s)
Number of cores
VASP scaling on Carter NiSi-504 PbSO4-24 NiSi-504 (12c) NiSi-1200 (16c)
Safe & Efficient: cores < n(atoms)
Daring & not as efficient:cores ≈ 2 x n(atoms)
Crash! cores > n(bands)
Why should I care?
0"
10"
20"
30"
40"
50"
60"
70"
NPA
R=1"
NPA
R=2"
NPA
R=3"
NPA
R=6"
NPA
R=8"
NPA
R=12"
NPA
R=16"
NPA
R=48"
Number'of'SCF'itera0ons'vs'NPAR'FeCo%alloy%(53%atoms)%on%48%cores%
Does 10% matter?
Dear Professor N.N., !We regret to inform you that we have decided to cut your VSC time allocation from 100,000 core hours/month to 90,000 core hours/month. !Best regards, VSC Staff
NPAR & NCORE
• Applies to the Davidson/RMM-DIIS algorithms (not hybrid calculations)
• Controls/activates band parallelization
• NPAR = 1: no parallelization over bands = BAD
• NPAR = (1-2)x number of compute nodes
• NCORE = cores working on one band
NCORE = cores per node (or socket)
0
5
10
15
20
25
30
35
NPAR=1 NPAR=2 NPAR=4 NPAR=8 NPAR=16
Spee
d (J
obs/
h)
Li2FeSiO4 (128 atoms) on Triolith Standard DFT / 4 compute nodes / 64 cores
NPAR: caveat
0"
10"
20"
30"
40"
50"
60"
70"
NPA
R=1"
NPA
R=2"
NPA
R=3"
NPA
R=6"
NPA
R=8"
NPA
R=12"
NPA
R=16"
NPA
R=48"
Number'of'SCF'itera0ons'vs'NPAR'FeCo%alloy%(53%atoms)%on%48%cores%
K-point parallelization ("KPAR")
• Controls parallelization over k-points (version 5.3.2+)
• Great for hybrid-DFT jobs
• Not so great for metals with 100-1000s of k-points…
KPAR = "number of k-points treated in parallel" cores / KPAR = "group size" = cores/node?
!
KPAR = min(nodes,#kpts)
0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5
4 8 12 24
Spee
d (J
obs/
h)
Number of nodes (16c/node)
MgO (63 atoms) on Triolith HSE06 hybrid calculation with 4 k-points
KPAR=1 KPAR=2 KPAR=4
NSIM
• Blocking mode for RMM-DIIS algorithm(so only applies for ALGO=Fast/VeryFast)
• Effect depends on BLAS library. Less important nowadays?
• NSIM = 1 for single-node jobs on Intel Sandy Bridge
• NSIM = 4 for older systems (incl. AMD). This is VASP's default.
0
5
10
15
20
25
30
35
NSIM=1 NSIM=2 NSIM=4 NSIM=8
Spee
d (J
obs/
h)
Li2FeSiO4 (128 atoms) on Triolith Standard DFT / 4 compute nodes / 64 cores
0
10
20
30
40
50
60
70
NSIM=1 NSIM=2 NSIM=4 NSIM=6 NSIM=8
Spee
d (J
obs/
h)
PbSO4 (24 atoms) on Triolith Standard DFT / 1 compute node / 16 cores
Core binding / pinning
• It is important to lock an MPI process to a specified processor core. Otherwise they can jump around…
• Intel MPI binds by default.
• Older versions of OpenMPI does not. Try e.g.:
mpirun -np 32 --display-map —bind-to-core …
• Core binding on AMD Opteron 6300-series “Interlagos” processors is not intuitive. [Cf. one of my blog posts]