containerizing parallel mpi- based hpc applications. antione... · aramco’s gigapowers. •chart:...

18
Containerizing Parallel MPI- based HPC Applications Ahmed Bu-khamsin Antoine Schonewille Saudi Aramco

Upload: tranlien

Post on 06-Mar-2018

215 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Containerizing Parallel MPI- based HPC Applications. Antione... · Aramco’s GigaPOWERS. •Chart: the more to the center the better. •CPU, interprocess and ... Various Simulation

Containerizing Parallel MPI-

based HPC ApplicationsAhmed Bu-khamsin

Antoine Schonewille

Saudi Aramco

Page 2: Containerizing Parallel MPI- based HPC Applications. Antione... · Aramco’s GigaPOWERS. •Chart: the more to the center the better. •CPU, interprocess and ... Various Simulation

2

• Software containers are encapsulations of system

environments and a mean to use it – Gregory M. Kurtzer

• Containers share the same kernel with the host with some

level of isolation

• Based on standard Linux kernel technologies such as

cgroups, namespaces

• Portable images between different Linux versions

• Examples: Docker, Singularity

What is a Container?

Page 3: Containerizing Parallel MPI- based HPC Applications. Antione... · Aramco’s GigaPOWERS. •Chart: the more to the center the better. •CPU, interprocess and ... Various Simulation

3

Containers VS Virtual Machines

http://geekyap.blogspot.ch/2016/11/docker-vs-singularity-vs-shifter-in-hpc.html

Page 4: Containerizing Parallel MPI- based HPC Applications. Antione... · Aramco’s GigaPOWERS. •Chart: the more to the center the better. •CPU, interprocess and ... Various Simulation

4

Docker VS Singularity

Singularity

• New Technology

• Minimal tools

• Designed for HPC

• Root is only needed to build images

• SETUID is used for mounting and creating

namespaces

• Run under the same user using it

Docker• Most Commonly used Container

technology

• Complete eco-system

• Designed for micro services and cloud

computing

• Requires a system service running under

root privilege

Page 5: Containerizing Parallel MPI- based HPC Applications. Antione... · Aramco’s GigaPOWERS. •Chart: the more to the center the better. •CPU, interprocess and ... Various Simulation

5

• Packaging

- Package once, run everywhere (due to system-call abstraction)

- Manage and run applications with complex dependencies easily and

efficiently

• Distribution

- Reproducible, easily shareable package

• Custom Linux Distros

• Simplifies nodes provisioning and configuration

• Enough with the theory, let’s go to the practice …

Benefits of Containers in HPC

Page 6: Containerizing Parallel MPI- based HPC Applications. Antione... · Aramco’s GigaPOWERS. •Chart: the more to the center the better. •CPU, interprocess and ... Various Simulation

6

The MPI Challenge – Without Containers

Node 1

SSH server

Node 2

SSH server

Node 3

SSH server

Node 4

SSH server

MPI -> ssh

MPI JOB

MPI JOB

MPI JOB

Page 7: Containerizing Parallel MPI- based HPC Applications. Antione... · Aramco’s GigaPOWERS. •Chart: the more to the center the better. •CPU, interprocess and ... Various Simulation

7

MPI Challenge – With Container

Node 1

SSH server

Node 2

SSH server

Node 3

SSH server

Node 4

SSH server

MPI JOB

MPI JOB

MPI JOB

MPI -> ssh

Page 8: Containerizing Parallel MPI- based HPC Applications. Antione... · Aramco’s GigaPOWERS. •Chart: the more to the center the better. •CPU, interprocess and ... Various Simulation

8

MPI Challenge – Our Solution

Node 1

SSH server

Node 2

SSH server

Node 3

SSH server

Node 4

SSH server

MPI -> ssh

Custo

m S

SH

MPI JOB

MPI JOB

MPI JOB

Custom SSH

Page 9: Containerizing Parallel MPI- based HPC Applications. Antione... · Aramco’s GigaPOWERS. •Chart: the more to the center the better. •CPU, interprocess and ... Various Simulation

9

• Start remote containers and the MPI

processes

• Control the resources

• Not mess up existing things (too much)

• “There is a solution to every problem”

• We give you wrapper2

SSH wrapper – cont’d

Page 10: Containerizing Parallel MPI- based HPC Applications. Antione... · Aramco’s GigaPOWERS. •Chart: the more to the center the better. •CPU, interprocess and ... Various Simulation

10

• SSH wrapper to start remote MPI….

- Not entirely accurate...

• Mpirun wrapper generates scripts that

starts the various components.

• Wrapper starts remote container starter

- Allows setting and correcting user

environments.

- Allows very granular tuning!!

- It’s just very cool.

Wrapper2 – a problem to a solution

Page 11: Containerizing Parallel MPI- based HPC Applications. Antione... · Aramco’s GigaPOWERS. •Chart: the more to the center the better. •CPU, interprocess and ... Various Simulation

11

• SSH wrapper to start remote MPI….

- Not entirely accurate...

• Mpirun wrapper generates scripts that

starts the various components.

• Wrapper starts remote container starter

- Allows setting and correcting user

environments.

- Allows very granular tuning!!

- It’s just very cool.

Wrapper2 – a problem to a solution

Page 12: Containerizing Parallel MPI- based HPC Applications. Antione... · Aramco’s GigaPOWERS. •Chart: the more to the center the better. •CPU, interprocess and ... Various Simulation

12

• SSH wrapper to start remote MPI….

- Not entirely accurate...

• Mpirun wrapper generates scripts that

starts the various components.

• Wrapper starts remote container starter

- Allows setting and correcting user

environments.

- Allows very granular tuning!!

- It’s just very cool.

Wrapper2 – a problem to a solution

command

Page 13: Containerizing Parallel MPI- based HPC Applications. Antione... · Aramco’s GigaPOWERS. •Chart: the more to the center the better. •CPU, interprocess and ... Various Simulation

13

• Container behavior: utilize all cores

(balanced) even when less cores required

inside container for MPI job

• Seems to ignore affinity directives

requested by MPI.

• However, we can tune with e.g. numactl

- Tune on container itself

- Tune inside container, the MPI processes

- Or both! (but why?)

• Tuning helped and tuning ruined things.

Granular tuning

MPI process

MPI process

MPI process

MPI process

TUNE

containerhost

TUNE

Page 14: Containerizing Parallel MPI- based HPC Applications. Antione... · Aramco’s GigaPOWERS. •Chart: the more to the center the better. •CPU, interprocess and ... Various Simulation

14

• We benchmarked HPL,

however to really proof

containers, we used a

real life application:

Aramco’s GigaPOWERS.

• Chart: the more to the

center the better.

• CPU, interprocess and

I/O were magnified for

better comparison.

• Out-of-the-box Docker

gave good results.

• Tuning on Docker mostly

I/O improved.

• Tuning on MPI processes

inside Docker mostly

Memory improved.

“Sing a tune…”

ORG GP BARE

ORG GP DOCKER

ORG GP DOCKERUSING TMPFS

ORG GP SOMENUMA TUNE IN

DOCKER

PROFILE GP BARE

PROFILE GPDOCKER

PROFILE GPNUMA ONDOCKER

PROFILE GPNUMA TUNE IN

DOCKER

Various Simulation Runs and their relative performance

CPU Intensive Memory Intensive InterProcess I/O

Page 15: Containerizing Parallel MPI- based HPC Applications. Antione... · Aramco’s GigaPOWERS. •Chart: the more to the center the better. •CPU, interprocess and ... Various Simulation

15

Size Bare Docker Diff % Singularity Diff %

16 cores

1 node

7.683 h 7.689 h -0.08%

32 cores

2 nodes

4.070 h 4.118 h -1.18% 4.095 h -0.61%

64 cores

4 nodes

2.223 h 2.245 h -0.98% 2.238 h -0.67%

128 cores

8 nodes

1.216 h 1.227 h -0.90% 1.243 h -2.17% (?)

256 cores

16 nodes

0.740 h 0.748 h -1.07%

Some results

PPN of 16.

Best results taken out of many…. many, many runs.

Page 16: Containerizing Parallel MPI- based HPC Applications. Antione... · Aramco’s GigaPOWERS. •Chart: the more to the center the better. •CPU, interprocess and ... Various Simulation

16

Start up duration and differences

0

0.01

0.02

0.03

0.04

40 ranks 80 ranks 160 ranks

MPI_Barrier time in seconds

BARE METAL

DOCKERIZED

SINGULARITY

0

0.005

0.01

0.015

0.02

40 ranks 80 ranks 160 ranks

MPI_Bcast time in seconds

BARE METAL

DOCKERIZED

SINGULARITY

0.010

0.004

0.008

0.014

0.016

0.032

2

2.5

3

3.5

4

4.5

5

40 ranks 80 ranks 160 ranks

Small MPI program runtime in seconds

BARE METAL

DOCKERIZED

SINGULARITY

3.5

4.5

Page 17: Containerizing Parallel MPI- based HPC Applications. Antione... · Aramco’s GigaPOWERS. •Chart: the more to the center the better. •CPU, interprocess and ... Various Simulation

17

Delta in total runtime versus bare versus MPI rank quantity

2

2.5

3

3.5

4

4.5

5

40ranks

80ranks

160ranks

Small MPI program runtime in seconds

BARE METAL

DOCKERIZED

SINGULARITY

0

0.2

0.4

0.6

0.8

1

1.2

1.4

2 nodes 4 nodes 8 nodes

seco

nd

s

Delta small MPI program runtime

1 rank 20 ranks

1.2

0

0.2

0.4

0.6

0.8

1

1.2

1.4

2 nodes 4 nodes 8 nodes

seco

nd

s

Delta small MPI program runtime

1 rank 20 ranks

0.4

Tdocker - Tbare Tsingularity - Tbare

Docker Singularity

Page 18: Containerizing Parallel MPI- based HPC Applications. Antione... · Aramco’s GigaPOWERS. •Chart: the more to the center the better. •CPU, interprocess and ... Various Simulation

18

Conclusion

• Software containers are having minimal performance overhead.

• Both container technologies can scale equally well to support large parallel jobs.

• Learning curve with Docker can be steep, Singularity on the other hand is much easier.

• Docker is more mature and has better documentation

• Our MPI process spawning solution is the only solution that supports any available

container and MPI flavors.

• Wrapper2 and other documentation can be found in:

• https://github.com/ambu50/wrapper-sq