high performance containers · linux-vserver ports kernel isolation, but requires recompilation...

35
High Performance Containers Convergence of Hyperscale, Big Data and Big Compute

Upload: others

Post on 24-Apr-2020

10 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: High Performance Containers · Linux-VServer ports kernel isolation, but requires recompilation Solaris Zones bring the concept of snapshots Google introduces Process Containers,

High Performance Containers

Convergence of Hyperscale, Big Data and Big Compute

Page 2: High Performance Containers · Linux-VServer ports kernel isolation, but requires recompilation Solaris Zones bring the concept of snapshots Google introduces Process Containers,

Technical Account Manager, DockerChristian Kniep

Page 3: High Performance Containers · Linux-VServer ports kernel isolation, but requires recompilation Solaris Zones bring the concept of snapshots Google introduces Process Containers,

Brief Recap of Container Technology

Page 4: High Performance Containers · Linux-VServer ports kernel isolation, but requires recompilation Solaris Zones bring the concept of snapshots Google introduces Process Containers,

Brief History of Container Technology

Jails Zones Namespaces Docker

VServer cgroups LXC

Container Runtime and Image Format Standards, Jeff Borek, Stephen Walli, KubeCon Dec/2017

FreeBSD Jails expand on Unix chroot to isolate files

Linux-VServer ports kernel isolation, but requires recompilation

Solaris Zones bring the concept of snapshots

Google introduces Process Containers, merged as cgroups

RedHat adds user namespaces, limiting root access in containers

IBM creates LXC providing user tools for cgroups and namespaces

Docker provides simple user tools and images. Containers go mainstream

2000

2001

2004

2006

2008

2008

2013

Page 5: High Performance Containers · Linux-VServer ports kernel isolation, but requires recompilation Solaris Zones bring the concept of snapshots Google introduces Process Containers,

Linux Namespaces 101Short Recap

Page 6: High Performance Containers · Linux-VServer ports kernel isolation, but requires recompilation Solaris Zones bring the concept of snapshots Google introduces Process Containers,

Example PIDProcesses Isolation

● host sees all processes with real PID from the Kernels perspective

● first process within PID namespace gets PID=1

Host

cnt0

ps -ef

cnt1

start.sh

java -jar ..

cnt2

start.sh

java -jar ..

health.sh

Page 7: High Performance Containers · Linux-VServer ports kernel isolation, but requires recompilation Solaris Zones bring the concept of snapshots Google introduces Process Containers,

Resource Isolation of Process Groups7 as of Kernel 4.10

1. MNT: Controls mount points

2. PID: Individual process table

3. NET: Network resources (IPs, routing,...)

4. IPC: Prevents the use of shared memory between processes

5. UTS: Individual host- and domain name

6. USR: Maps container UID to a different UID of the host

7. CGRP: Hides system cgroup hierarchy from container

Page 8: High Performance Containers · Linux-VServer ports kernel isolation, but requires recompilation Solaris Zones bring the concept of snapshots Google introduces Process Containers,

Container Namespaces

A starting container gets his own namespaces.

PIDMNT IPCNET USR

Host

UTS CGRP

cnt0 cnt1 cnt2

But can share namespaces with other containers or even the host

Page 9: High Performance Containers · Linux-VServer ports kernel isolation, but requires recompilation Solaris Zones bring the concept of snapshots Google introduces Process Containers,

All In

When using all host namespaces - we are on the host (almost like ssh).

PIDMNT IPCNET USR

Host

UTS CGRP

cnt0

$ docker run -ti --rm \ --privileged \ --security-opt=seccomp=unconfined \ --pid=host \ --uts=host \ --ipc=host \ --net=host \ -v /:/host \ ubuntu bashroot@linuxkit-025000000001:/# chroot /host/ # ash/ #

Page 10: High Performance Containers · Linux-VServer ports kernel isolation, but requires recompilation Solaris Zones bring the concept of snapshots Google introduces Process Containers,

CGroups

While namespaces isolate,

Control Groups constraint resources.

Page 11: High Performance Containers · Linux-VServer ports kernel isolation, but requires recompilation Solaris Zones bring the concept of snapshots Google introduces Process Containers,

CGroups / Filesystem LayeringShort Recap [cont]

Page 12: High Performance Containers · Linux-VServer ports kernel isolation, but requires recompilation Solaris Zones bring the concept of snapshots Google introduces Process Containers,

Overlay FilesystemCompose a FS from multiple pieces

ubuntu:16.04

openjre:9-b114

appA.jar:1.1 appB.jar

ARG FROM openjre:9-b114COPY appB.jar /usr/local/bin/CMD [“java”, “-jar”, “/usr/local/bin/appB.jar”]

ARG FROM openjre:9-b114COPY appA.jar /usr/local/bin/CMD [“java”, “-jar”, “/usr/local/bin/appA.jar”]

FROM ubuntu:16.04ARG JRE_VER=9~b114-0ubuntu1RUN apt-get update \ && apt-get install -y openjdk-9-jre-headless=${JRE_VER} \ && java -version

openjre:9-b117

Page 13: High Performance Containers · Linux-VServer ports kernel isolation, but requires recompilation Solaris Zones bring the concept of snapshots Google introduces Process Containers,

Stacked View

Hardware

Host Kernel

Userland

Services Hypervisor

Kernel

Userland

Services1 Services2

Userland

Kernel

Hardware

Host Kernel

Userland

Services

Userland

appB appC

Userland

Cnt1 Cnt2

VM1 VM2

Traditional Virtualization os-virtualization

VM

-sho

rtcut

s(P

VM

, pci

-pas

sthr

ough

)

Page 14: High Performance Containers · Linux-VServer ports kernel isolation, but requires recompilation Solaris Zones bring the concept of snapshots Google introduces Process Containers,

Traditional Virtualization

hardware

kernel

hypervisor

kernel

Interface View

libs

From Application to Kernel

application

libs

applicationlib-calls

102syscalls

101hypercalls

hardware

kernelhw-calls

container

os-virtualization

Page 15: High Performance Containers · Linux-VServer ports kernel isolation, but requires recompilation Solaris Zones bring the concept of snapshots Google introduces Process Containers,

Why apply Containers to HPC?

Page 16: High Performance Containers · Linux-VServer ports kernel isolation, but requires recompilation Solaris Zones bring the concept of snapshots Google introduces Process Containers,

SciOps

Peer Review + Collaboration

Researcher Workstation

HPC Compute Nodes

Page 17: High Performance Containers · Linux-VServer ports kernel isolation, but requires recompilation Solaris Zones bring the concept of snapshots Google introduces Process Containers,

Bridging the Technology GapContainers removing barriers of entry:

● Hardware Expertise

● Software Expertise

● Workflow/System Integration

● CapEx / OpEx

Page 18: High Performance Containers · Linux-VServer ports kernel isolation, but requires recompilation Solaris Zones bring the concept of snapshots Google introduces Process Containers,

How are containers used in HPC today?

Page 19: High Performance Containers · Linux-VServer ports kernel isolation, but requires recompilation Solaris Zones bring the concept of snapshots Google introduces Process Containers,

Current Solutions

Lack of HPC focus gave birth to HPC workarounds.

ShipBuild

Development Build hub.docker.com

HPC Runtimes

Pull Image

Extract File-System Store on /share

chroot /container

Page 20: High Performance Containers · Linux-VServer ports kernel isolation, but requires recompilation Solaris Zones bring the concept of snapshots Google introduces Process Containers,

Scientific Environments

Scientific end-users expect the environment to be set up for them, without prior knowledge about the

specifics of the cluster.

Service Cluster Compute ClusterStorage

/home//proj/

Enginerank0

Enginerank1

Enginerank2

EngineAI

Page 21: High Performance Containers · Linux-VServer ports kernel isolation, but requires recompilation Solaris Zones bring the concept of snapshots Google introduces Process Containers,

Service vs Batch Scheduling

Traditionally container workloads are scheduled in a descriptive manner, as tasks (pods) on worker nodes.

HPC schedules a workloads as a batch job on multiple nodes.

Docker Engine

SWARM Kubernetes

Shared System

process1 process2

node0

Shared System

job-process2

agent

nodeN

job-process2

agent

manager

controller Distributed Process

Page 22: High Performance Containers · Linux-VServer ports kernel isolation, but requires recompilation Solaris Zones bring the concept of snapshots Google introduces Process Containers,

HPC Workload SchedulerDEMO!

Page 23: High Performance Containers · Linux-VServer ports kernel isolation, but requires recompilation Solaris Zones bring the concept of snapshots Google introduces Process Containers,

Current Solutions [cont]HPC-specific workaround

+ Drop-in replacement as it wraps the job

- Not OCI compliance

- No integration with upstream container ecosystem

- hard to combine with new workloads

node0

Shared System

job-process2

HPC-runtime

agent

nodeN

job-process2

agent

manager

controller

Page 24: High Performance Containers · Linux-VServer ports kernel isolation, but requires recompilation Solaris Zones bring the concept of snapshots Google introduces Process Containers,

HPC Challenges

Page 25: High Performance Containers · Linux-VServer ports kernel isolation, but requires recompilation Solaris Zones bring the concept of snapshots Google introduces Process Containers,

Kernel-bypassing Devices

To achieve the highest performance possible the kernel got squeezed out of the equation for

performance-critical parts.

Hardware

OS Kernel

Userland

ETH

TCP/IP

GPU

CUDA

IB

OFEDlibnet

Application

Page 26: High Performance Containers · Linux-VServer ports kernel isolation, but requires recompilation Solaris Zones bring the concept of snapshots Google introduces Process Containers,

The Stack

runc

containerd

Engine

Client

--device=/dev/nvidia0

"Devices": [{

"PathOnHost": "/dev/nvidia0",

"PathInContainer": "/dev/nvidia0",

"CgroupPermissions": "rwm"

}],

"devices": [{

"path": "/dev/nvidia0", "type": "c",

"major": 195, "minor": 0,"fileMode": 8630,

"uid": 0, "gid": 0

},

"hooks": { "prestart": [ {"path": “/usr/local/bin/nvidia.sh"}]}

Page 27: High Performance Containers · Linux-VServer ports kernel isolation, but requires recompilation Solaris Zones bring the concept of snapshots Google introduces Process Containers,

Houdini Plugin

Page 28: High Performance Containers · Linux-VServer ports kernel isolation, but requires recompilation Solaris Zones bring the concept of snapshots Google introduces Process Containers,

Houdini Plugin [cont]DEMO!

Page 29: High Performance Containers · Linux-VServer ports kernel isolation, but requires recompilation Solaris Zones bring the concept of snapshots Google introduces Process Containers,

HPC Opportunities

Page 30: High Performance Containers · Linux-VServer ports kernel isolation, but requires recompilation Solaris Zones bring the concept of snapshots Google introduces Process Containers,

Leveraging HPC in the Enterprise / Enterprise in HPC

Image Registry Security scan& sign

Traditional

Third Party

HPC Workloads

docker store

Control Plane

Page 31: High Performance Containers · Linux-VServer ports kernel isolation, but requires recompilation Solaris Zones bring the concept of snapshots Google introduces Process Containers,

HPC @Docker: What’s next?

Page 32: High Performance Containers · Linux-VServer ports kernel isolation, but requires recompilation Solaris Zones bring the concept of snapshots Google introduces Process Containers,

Convergence of AI and HPC

advanced

Multi Node, Shared Storage

intermediate

Single Node, Shared Storage

beginnerCom

plex

ity

Maturity

Single node, Local Storage

non-GPU

MPI

shared file-system

device passthrough

a.k.a. HPC!

Page 33: High Performance Containers · Linux-VServer ports kernel isolation, but requires recompilation Solaris Zones bring the concept of snapshots Google introduces Process Containers,

Evidence for AI/DL trends

Page 34: High Performance Containers · Linux-VServer ports kernel isolation, but requires recompilation Solaris Zones bring the concept of snapshots Google introduces Process Containers,

StepsLow-hanging and high hanging fruit

Device Passthrough

Rather simple

Shared File-System

get UID:GID(s) from somewhere

MPI enablement

● Engine vs. Orchestrator vs. Workload Manager: Who is in charge?

● including PMIx into the engine?

● simple batch scheduling in SWARM?

Docker Ecosystem Goodies

● Secure Supply Chain

● Reproducible, OS idenpenced science

Page 35: High Performance Containers · Linux-VServer ports kernel isolation, but requires recompilation Solaris Zones bring the concept of snapshots Google introduces Process Containers,

Docker for Science