containerizing parallel mpi- based hpc applications. antione... · aramco’s gigapowers. •chart:...
TRANSCRIPT
Containerizing Parallel MPI-
based HPC ApplicationsAhmed Bu-khamsin
Antoine Schonewille
Saudi Aramco
2
• Software containers are encapsulations of system
environments and a mean to use it – Gregory M. Kurtzer
• Containers share the same kernel with the host with some
level of isolation
• Based on standard Linux kernel technologies such as
cgroups, namespaces
• Portable images between different Linux versions
• Examples: Docker, Singularity
What is a Container?
3
Containers VS Virtual Machines
http://geekyap.blogspot.ch/2016/11/docker-vs-singularity-vs-shifter-in-hpc.html
4
Docker VS Singularity
Singularity
• New Technology
• Minimal tools
• Designed for HPC
• Root is only needed to build images
• SETUID is used for mounting and creating
namespaces
• Run under the same user using it
Docker• Most Commonly used Container
technology
• Complete eco-system
• Designed for micro services and cloud
computing
• Requires a system service running under
root privilege
5
• Packaging
- Package once, run everywhere (due to system-call abstraction)
- Manage and run applications with complex dependencies easily and
efficiently
• Distribution
- Reproducible, easily shareable package
• Custom Linux Distros
• Simplifies nodes provisioning and configuration
• Enough with the theory, let’s go to the practice …
Benefits of Containers in HPC
6
The MPI Challenge – Without Containers
Node 1
SSH server
Node 2
SSH server
Node 3
SSH server
Node 4
SSH server
MPI -> ssh
MPI JOB
MPI JOB
MPI JOB
7
MPI Challenge – With Container
Node 1
SSH server
Node 2
SSH server
Node 3
SSH server
Node 4
SSH server
MPI JOB
MPI JOB
MPI JOB
MPI -> ssh
8
MPI Challenge – Our Solution
Node 1
SSH server
Node 2
SSH server
Node 3
SSH server
Node 4
SSH server
MPI -> ssh
Custo
m S
SH
MPI JOB
MPI JOB
MPI JOB
Custom SSH
9
• Start remote containers and the MPI
processes
• Control the resources
• Not mess up existing things (too much)
• “There is a solution to every problem”
• We give you wrapper2
SSH wrapper – cont’d
10
• SSH wrapper to start remote MPI….
- Not entirely accurate...
• Mpirun wrapper generates scripts that
starts the various components.
• Wrapper starts remote container starter
- Allows setting and correcting user
environments.
- Allows very granular tuning!!
- It’s just very cool.
Wrapper2 – a problem to a solution
11
• SSH wrapper to start remote MPI….
- Not entirely accurate...
• Mpirun wrapper generates scripts that
starts the various components.
• Wrapper starts remote container starter
- Allows setting and correcting user
environments.
- Allows very granular tuning!!
- It’s just very cool.
Wrapper2 – a problem to a solution
12
• SSH wrapper to start remote MPI….
- Not entirely accurate...
• Mpirun wrapper generates scripts that
starts the various components.
• Wrapper starts remote container starter
- Allows setting and correcting user
environments.
- Allows very granular tuning!!
- It’s just very cool.
Wrapper2 – a problem to a solution
command
13
• Container behavior: utilize all cores
(balanced) even when less cores required
inside container for MPI job
• Seems to ignore affinity directives
requested by MPI.
• However, we can tune with e.g. numactl
- Tune on container itself
- Tune inside container, the MPI processes
- Or both! (but why?)
• Tuning helped and tuning ruined things.
Granular tuning
MPI process
MPI process
MPI process
MPI process
TUNE
containerhost
TUNE
14
• We benchmarked HPL,
however to really proof
containers, we used a
real life application:
Aramco’s GigaPOWERS.
• Chart: the more to the
center the better.
• CPU, interprocess and
I/O were magnified for
better comparison.
• Out-of-the-box Docker
gave good results.
• Tuning on Docker mostly
I/O improved.
• Tuning on MPI processes
inside Docker mostly
Memory improved.
“Sing a tune…”
ORG GP BARE
ORG GP DOCKER
ORG GP DOCKERUSING TMPFS
ORG GP SOMENUMA TUNE IN
DOCKER
PROFILE GP BARE
PROFILE GPDOCKER
PROFILE GPNUMA ONDOCKER
PROFILE GPNUMA TUNE IN
DOCKER
Various Simulation Runs and their relative performance
CPU Intensive Memory Intensive InterProcess I/O
15
Size Bare Docker Diff % Singularity Diff %
16 cores
1 node
7.683 h 7.689 h -0.08%
32 cores
2 nodes
4.070 h 4.118 h -1.18% 4.095 h -0.61%
64 cores
4 nodes
2.223 h 2.245 h -0.98% 2.238 h -0.67%
128 cores
8 nodes
1.216 h 1.227 h -0.90% 1.243 h -2.17% (?)
256 cores
16 nodes
0.740 h 0.748 h -1.07%
Some results
PPN of 16.
Best results taken out of many…. many, many runs.
16
Start up duration and differences
0
0.01
0.02
0.03
0.04
40 ranks 80 ranks 160 ranks
MPI_Barrier time in seconds
BARE METAL
DOCKERIZED
SINGULARITY
0
0.005
0.01
0.015
0.02
40 ranks 80 ranks 160 ranks
MPI_Bcast time in seconds
BARE METAL
DOCKERIZED
SINGULARITY
0.010
0.004
0.008
0.014
0.016
0.032
2
2.5
3
3.5
4
4.5
5
40 ranks 80 ranks 160 ranks
Small MPI program runtime in seconds
BARE METAL
DOCKERIZED
SINGULARITY
3.5
4.5
17
Delta in total runtime versus bare versus MPI rank quantity
2
2.5
3
3.5
4
4.5
5
40ranks
80ranks
160ranks
Small MPI program runtime in seconds
BARE METAL
DOCKERIZED
SINGULARITY
0
0.2
0.4
0.6
0.8
1
1.2
1.4
2 nodes 4 nodes 8 nodes
seco
nd
s
Delta small MPI program runtime
1 rank 20 ranks
1.2
0
0.2
0.4
0.6
0.8
1
1.2
1.4
2 nodes 4 nodes 8 nodes
seco
nd
s
Delta small MPI program runtime
1 rank 20 ranks
0.4
Tdocker - Tbare Tsingularity - Tbare
Docker Singularity
18
Conclusion
• Software containers are having minimal performance overhead.
• Both container technologies can scale equally well to support large parallel jobs.
• Learning curve with Docker can be steep, Singularity on the other hand is much easier.
• Docker is more mature and has better documentation
• Our MPI process spawning solution is the only solution that supports any available
container and MPI flavors.
• Wrapper2 and other documentation can be found in:
• https://github.com/ambu50/wrapper-sq