the impact of process placement on job cost and performance€¦ · 5 managed by ut-battelle for...

12
Managed by UT-Battelle for the Department of Energy The impact of process placement on job cost and performance Using Titan’s Turbo Switch

Upload: others

Post on 22-Aug-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: The impact of process placement on job cost and performance€¦ · 5 Managed by UT-Battelle for the Department of Energy What is aprun’s default placement • The integer units

Managed by UT-Battelle for the Department of Energy

The impact of process placement on job cost

and performance

Using Titan’s Turbo Switch

Page 2: The impact of process placement on job cost and performance€¦ · 5 Managed by UT-Battelle for the Department of Energy What is aprun’s default placement • The integer units

2 Managed by UT-Battelle for the Department of Energy

Outline

•  Who is the target audience?

•  An overview of Titan’s CPUs

•  What is aprun’s default placement?

•  Avoiding floating point contention

•  Best practices

•  What about Eos?

Page 3: The impact of process placement on job cost and performance€¦ · 5 Managed by UT-Battelle for the Department of Energy What is aprun’s default placement • The integer units

3 Managed by UT-Battelle for the Department of Energy

Who is the target audience?

•  Any user whose application primarily uses the CPUs’ floating point units (FPUs)

•  Any user whose application uses 2-8 MPI ranks per node –  e.g. needs more memory per MPI rank

•  This does not apply to codes using the GPUs –  Unless they also heavily use the CPU FPUs

Page 4: The impact of process placement on job cost and performance€¦ · 5 Managed by UT-Battelle for the Department of Energy What is aprun’s default placement • The integer units

4 Managed by UT-Battelle for the Department of Energy

An overview of Titan’s CPUs

•  AMD Opteron 6274 (Interlagos)

•  2 NUMA domains

•  8 Bulldozer modules –  4 per NUMA domain

•  Each Bulldozer has: –  2 Integer Units –  1 Floating Point Unit

(FPU) https://www.olcf.ornl.gov/kb_articles/xk7-titan-node-description/

Page 5: The impact of process placement on job cost and performance€¦ · 5 Managed by UT-Battelle for the Department of Energy What is aprun’s default placement • The integer units

5 Managed by UT-Battelle for the Department of Energy

What is aprun’s default placement

•  The integer units are numbered 0-15

•  By default, aprun will: –  place 16 processes per node –  place the processes sequentially starting at core 0

•  Two processes will be assigned to each Bulldozer and each pair will have to share a FPU –  Even if you only request 8 processes per node,

they will be placed on cores 0-7 and compute for the first four FPUs

Page 6: The impact of process placement on job cost and performance€¦ · 5 Managed by UT-Battelle for the Department of Energy What is aprun’s default placement • The integer units

6 Managed by UT-Battelle for the Department of Energy

Avoiding floating point contention

•  Request twice as many nodes from qsub –  Will consume 2x the node-hours

•  Place 8 processes per node (-N 8 or -S 4)

•  Use only 1 process per FPU (-j 1) –  Don’t use -d 2 (has different behavior on Eos)

Page 7: The impact of process placement on job cost and performance€¦ · 5 Managed by UT-Battelle for the Department of Energy What is aprun’s default placement • The integer units

7 Managed by UT-Battelle for the Department of Energy

Avoiding floating point contention

•  Requesting 2x nodes is not enough

•  You must restrict the processes to one per FPU

•  Using -j 1 is easier than specifying the cores to use (-cc)

•  Set -r 1 as well to have OS processes run on core 15

1.00

1.69 1.69

1.01

0%

25%

50%

75%

100%

125%

150%

175%

200%

Default Even-OS Core Odd-OS Core Half-OS Core

S3D speedup when using twice the nodes

Page 8: The impact of process placement on job cost and performance€¦ · 5 Managed by UT-Battelle for the Department of Energy What is aprun’s default placement • The integer units

8 Managed by UT-Battelle for the Department of Energy

Avoiding floating point contention

•  FPU-intensive apps see ~2.0x speedup –  e.g. HPL

•  Real apps do I/O or have non-floating-point sections

•  Typically, 1.4-1.7x speedup

•  Run your app to determine what, if any, speedup you get

Page 9: The impact of process placement on job cost and performance€¦ · 5 Managed by UT-Battelle for the Department of Energy What is aprun’s default placement • The integer units

9 Managed by UT-Battelle for the Department of Energy

Best practices

•  If I see a speedup, should I always try to avoid FPU contention?

•  You are trading 2x the node-hours for 1.4-1.7x speedup

•  If you are debugging, conserve your node-hours

•  If you want to do your work faster, avoid FPU contention –  Bonus: larger jobs get priority in the queue

Page 10: The impact of process placement on job cost and performance€¦ · 5 Managed by UT-Battelle for the Department of Energy What is aprun’s default placement • The integer units

10 Managed by UT-Battelle for the Department of Energy

What about Eos?

•  2 NUMA domains

•  16 Cores (8/NUMA)

•  Hyper-Threading –  32 integer cores –  but only 16 FPUs

•  Yes, it works here to, but…

•  Intel Xeon 5E-2670

https://www.olcf.ornl.gov/kb_articles/xc30-cpu-description/

Page 11: The impact of process placement on job cost and performance€¦ · 5 Managed by UT-Battelle for the Department of Energy What is aprun’s default placement • The integer units

11 Managed by UT-Battelle for the Department of Energy

What about Eos?

•  By default, aprun uses -j 1 to place one process per core (i.e. no Hyper-Threading) –  Opposite policy from Titan

•  You can conserve node-hours –  Request ½ the nodes form qsub –  Pass -j 2 to aprun –  e.g. debugging

Page 12: The impact of process placement on job cost and performance€¦ · 5 Managed by UT-Battelle for the Department of Energy What is aprun’s default placement • The integer units

12 Managed by UT-Battelle for the Department of Energy

Thank you

•  Questions?