app container rkt

github.com/coreos/rktrkt-dev@googlegroups.com

App Containergithub.com/appc

appc-dev@googlegroups.com

Yifan Gugithub.com/yifan-gu@yifan7702

With containers what does a "Linux Distro"

KERNELSYSTEMDSSH

PYTHONJAVANGINXMYSQLOPENSSL

KERNELSYSTEMDSSH

LXC/DOCKER/RKT

PYTHONJAVANGINXMYSQLOPENSSL

The Bad$ python --versionPython 2.7.6$ python app-requiring-python3.py

$ python --versionPython 3.4.3$ python app-requiring-python2.py

package collisions

The Bad$ cat /etc/os-release | grep ^NAME= NAME=Fedora

$ rpm -i package-from-suse.rpm file /foo from install of package-from-suse.rpm conflicts with file from package-from-fedora

dependency namespacing

The Good$ gpg --list-only --import \

/etc/apt/trusted.gpg.d/*

gpg: key 2B90D010: public key "Debian Archive Automatic Signing Key (8/jessie) <ftpmaster@debian.org>" importedgpg: Total number processed: 1gpg: imported: 1 (RSA: 1)gpg: no ultimately trusted keys found

users control trust

The Good

$ rsync ftp.us.debian.org::debian \ /srv/mirrors/debian

$ dpkg -i \ /srv/mirrors/debian/kernel-image-3.16.0-4-amd64-di_3.16.7-ckt9-2_amd64.udeb

trivial mirroring and hosting

Linux Packages 2.0.deb and .rpm for containers

Container VS VM?● Lightweight (100s vs 10s)

● Easy to deploy● Less isolation?

What is container ?● Packaging with your apps with deps● Running in isolation (using namespace,

cgroups)

Why I want to use it?● Deploy faster● Run faster, run everywhere● Run in isolation

App Container (appc)github.com/appc

appc-dev@googlegroups.com

appc != rkt

Application Containersself-contained, portable

(decoupled from operating system)isolated (memory, network, …)

appc principlesWhy are we doing this?

OpenIndependent GitHub organisation

Contributions from Cloud Foundry, Mesosphere, Google, Red Hat

(and many others!)

Simple but efficientSimple to understand and implement, but eye to optimisation (e.g. content-based

caching)

SecureCryptographic image addressing

Image signing and encryptionContainer identity

Standards-basedWell-known tools (tar, gzip, gpg, http), extensible with modern technologies

(bittorrent, xz)

ComposableIntegrate with existing systems

Non-prescriptive about build workflowsOS/architecture agnostic

appc components

Image FormatApplication Container Imagetarball of rootfs + manifest

uniquely identified by ImageID (hash)

Image DiscoveryApp name → artifact

example.com/http-servercoreos.com/etcd

HTTPS + HTML

Executor （Runtime)grouped applicationsruntime environment

isolatorsnetworking

Metadata Servicehttp://$AC_METADATA_URL/acMetadata

container metadatacontainer identity (HMAC verification)

ACE validatoris this executor compliant with the spec?

$EXECUTOR run ace_validator.aci

appc community

github.com/cdaylward/libappc

C++ library for working with app containers

github.com/cdaylward/nosecone

C++ executor for running app containers

mesos (wip)https://issues.apache.org/jira/browse/MESOS-2162

github.com/3ofcoins/jetpack

FreeBSD Jails/ZFS-based executor(by @mpasternacki)

github.com/sgotti/acido

ACI toolkit (build ACIs from ACIs)

github.com/appc/docker2acidocker2aci busybox/latest

docker2aci quay.io/coreos/etcd

github.com/appc/goaci

goaci github.com/coreos/etcd

appc spec in a nutshell

- Image Format (ACI)- what does an application consist of?

- Image Discovery- how can an image be located?

- Pods- how can applications be grouped and run?

- Executor (runtime)- what does the execution environment look like?

appc statusStabilising

towards first backwardscompatible release

github.com/coreos/rkt

rktan implementation of appc

Open standards. Composability.

rkta modern, secure container runtime

rktsimple CLI tool

simple CLI toolgolang + Linuxself-contained

init system/distro agnostic

simple CLI toolno daemon

no API*apps run directly under spawning process

application(s)

systemd

application(s)

rkt internalsmodular architecture

execution divided into stagesstage0 → stage1 → stage2

rkt (stage0)

pod (stage1)

bash/runit/systemd/... (invoking process)

app1 (stage2)

app2 (stage2)

rkt (stage0)

pod (stage1)

app1 (stage2)

app2 (stage2)

stage0 (rkt binary)discover, fetch, manage application images

set up pod filesystemscommands to manage pod lifecycle

stage0 (rkt binary)

- rkt run

- rkt prepare

- rkt run-prepared

- rkt list

- rkt status

- rkt fetch

- rkt trust

- rkt image list

- rkt image export

- rkt image gc

stage0 (rkt binary)file-based locking for concurrent operation

(e.g. rkt gc, rkt list for pods)database + reference counting for images

rkt (stage0)

pod (stage1)

app1 (stage2)

app2 (stage2)

rkt (stage0)

pod (stage1)

app1 (stage2)

app2 (stage2)

stage1execution environment for pods

app process lifecycle managementisolators

stage1 (swappable)binary ABI with stage0

stage0 calls an execve(stage1)

stage1 (swappable)

● default implementation ○ based on systemd-nspawn+systemd ○ Linux namespaces + cgroups for isolation

● kvm implementation ○ based on lkvm+systemd ○ hardware virtualisation for isolation

● others?

rkt (stage0)

pod (stage1)

app1 (stage2)

app2 (stage2)

rkt (stage0)

pod (stage1)

app1 (stage2)

app2 (stage2)

stage2actual app execution

independent filesystems (chroot)shared namespaces, volumes, IPC, ...

rkt + systemdThe different ways rkt integrates with

systemd

systemd (on host)(systemctl)

systemd (on host)optional

"systemctl stop" just workssocket activation

pod-level isolators: CPUShares, MemoryLimit

systemd-nspawn

systemd-nspawndefault stage1, besides lkvm

taking care of most of the low-level things

systemd-nspawn

systemd

container

systemdpid1

service filessocket activation

systemd-nspawn

application

systemd

container

applicationapp-level isolators: CPUShares, MemoryLimit

chrooted

systemd-nspawn

application

systemd-journald(journalctl)

systemd

container

systemd-journaldno changes in apps required

logs in the containeravailable from the host with journalctl -m / -M

systemd-nspawn

application

systemd-machined(machinectl)

systemd

register

container

systemd-machinedregister on distros using systemd

machinectl {show,status,poweroff…}

systemd-nspawn

application

systemd-machined(machinectl)

systemd

register

container

cgroups

What’s a control group? (cgroup)

● group processes together● organised in trees● applying limits to them as a group

cgroups

cgroup API

/sys/fs/cgroup/*//proc/cgroups/proc/$PID/cgroup

List of cgroup controllers/sys/fs/cgroup/

├─ cpu ├─ devices ├─ freezer ├─ memory ├─ ... └─ systemd

/sys/fs/cgroup/ ├─ systemd │ ├─ user.slice │ ├─ system.slice │ │ ├─ NetworkManager.service │ │ │ └─ cgroups.procs │ │ ... │ └─ machine.slice

How systemd units use cgroups

│... ├─ cpu │ ├─ user.slice │ ├─ system.slice │ └─ machine.slice │ └─ machine-rkt….scope │ └─ system.slice │ └─ app.service ├─ memory │ ├─ user.slice │ ├─ system.slice │ └─ machine.slice ...

/sys/fs/cgroup/ ├─ systemd │ ├─ user.slice │ ├─ system.slice │ └─ machine.slice │ └─ machine-rkt….scope │ └─ system.slice │ └─ app.service │ │ │...

How systemd units use cgroups w/ containers

/sys/fs/cgroup/ ├─ systemd │ ├─ user.slice │ ├─ system.slice │ └─ machine.slice │ └─ machine-rkt….scope │ └─ system.slice │ └─ app.service │ │ │...

│... ├─ cpu │ ├─ user.slice │ ├─ system.slice │ └─ machine.slice │ └─ machine-rkt….scope │ └─ system.slice │ └─ app.service ├─ memory │ ├─ user.slice │ ├─ system.slice │ └─ machine.slice ...

cgroups mounted in the container

Example: memory isolator

“limit”: “500M”

ApplicationImage Manifest

[Service]ExecStart=MemoryLimit=500M

systemd service file

write tomemory.limit_in_

systemd action

Example: CPU isolator

“limit”: “500m”

ApplicationImage Manifest

write tocpu.share

systemd action

[Service]ExecStart=CPUShares=512

systemd service file

Unified cgroup hierarchy● Multiple hierarchies:

○ one cgroup mount point for each controller (memory, cpu, etc.)○ flexible but complex○ cannot remount with a different set of controllers○ difficult to give to containers in a safe way

● Unified hierarchy:○ cgroup filesystem mounted only one time○ still in development in Linux: mount with option

“__DEVEL__sane_behavior”○ initial implementation in systemd-v226 (September 2015)○ no support in rkt yet

rkt: a few other things

- rkt and security- rkt API service (new!)- rkt networking- rkt and user namespaces- rkt and production

rkt and security"secure by default"

rkt security

- image signature verification- privilege separation

- e.g. fetch images as non-root user- SELinux integration- kernel keyring integration (soon)- lkvm stage1 for true hardware isolation

rkt API service (new!)optional, gRPC-based API daemon

exposes information on pods and imagesruns as unprivileged user

easier integration with other projects

rkt networkingplugin-based

Container Networking Interface (CNI)

Container Runtime (e.g. rkt)

veth macvlan ipvlan OVS

Container Networking Interface (CNI)

Networking, the rkt way

Network tooling

● Linux can create pairs of virtual net interfaces

● Can be linked in a bridge

container1 container2

IP masquerading via iptables

bridge

rkt and user namespaces

History of Linux namespaces✓ 1991: Linux

✓ 2002: namespaces in Linux 2.4.19

✓ 2008: LXC✓ 2011: systemd-nspawn✓ 2013: user namespaces in Linux 3.8✓ 2013: Docker✓ 2014: rkt

… development still active

Why user namespaces?

● Better isolation● Run applications which would need more

capabilities● Per user limits● Future?

○ Unprivileged containers: possibility to have container without root

4,294,967,295(32-bit range)

container 1655350

container 2

User ID ranges

unmapped

User ID mapping/proc/$PID/uid_map: “0 1048576 65536”

container

1048576

unmappedunmapped

Problems with container images

Container filesystem

Overlayfs “upper” directory

Application Container Image (ACI)

ApplicationContainer

Image (ACI)

container 1 container 2

downloading

web server

Problems with container images

● Files UID / GID● rkt currently only supports user namespaces

without overlayfs○ Performance loss: no COW from overlayfs○ “chown -R” for every file in each container

Problems with volumes

/home/var

/data /my-app

bind mount(rw / ro)

● mounted in several containers

● No UID translation

● Dynamic UID maps

User namespace and filesystem problem

● Possible solution: add options to mount() to apply a UID mapping

● rkt would use it when mounting:○ the overlay rootfs○ volumes

● Idea suggested on kernel mailing lists

rkt and production

- still pre-1.0- unstable (but stabilising) CLI and API- explicitly not recommended for production

- although some early adopters

rkt v1.0.0EOY (fingers crossed)

stable APIstable CLI

ready to use!

Questions?

github.com/coreos/rkt

coreos.com/careers (soon in Berlin!)

Join us!

app container rkt

Engineering

rkt-10, rkt-20, rkt-21 cuchilla de electricista · el...

build your first kubernetes cluster on...

net core a new .net platform 101 container management docker...

container runtimes and orchestration rkt and … · rkt and...

op 1002 7.2 rkt

kubernetes patterns...one shot action before pod starts...

op 1017 rkt fuzes

op 1415 rkt assy

rkt sappos 2.3 new feature overview

in rkt and linux · 2017-12-14 · container mechanics in...

application modernization field guide · 2021. 7. 22. ·...

recruitment policy rkt

rkt ysam iwfyax gs mik;

container mechanics in linux and rkt - fosdem 2018 · rkt...

docker + app container = ocp

a security state of mind: container security · container...

rkt fin travel overview travel planning

android mobile app tutorial - btod.com...android mobile app...

carplay audio app programming guide - apple developer ·...

agl-based container technology applied to mass production...