app container rkt

Post on 14-Apr-2017

884 Views

Category:

Engineering

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

github.com/coreos/rktrkt-dev@googlegroups.com

App Containergithub.com/appc

appc-dev@googlegroups.com

Yifan Gugithub.com/yifan-gu@yifan7702

With containers what does a "Linux Distro"

mean?

KERNELSYSTEMDSSH

PYTHONJAVANGINXMYSQLOPENSSL

dist

ro d

istr

o di

stro

dis

tro

dist

ro

dist

ro d

istr

o

APP

KERNELSYSTEMDSSH

LXC/DOCKER/RKT

PYTHONJAVANGINXMYSQLOPENSSL

APPdi

stro

dis

tro

dist

ro d

istr

o di

stro

di

stro

dis

tro

The Bad$ python --versionPython 2.7.6$ python app-requiring-python3.py

$ python --versionPython 3.4.3$ python app-requiring-python2.py

package collisions

The Bad$ cat /etc/os-release | grep ^NAME= NAME=Fedora

$ rpm -i package-from-suse.rpm file /foo from install of package-from-suse.rpm conflicts with file from package-from-fedora

dependency namespacing

The Good$ gpg --list-only --import \

/etc/apt/trusted.gpg.d/*

gpg: key 2B90D010: public key "Debian Archive Automatic Signing Key (8/jessie) <ftpmaster@debian.org>" importedgpg: Total number processed: 1gpg: imported: 1 (RSA: 1)gpg: no ultimately trusted keys found

users control trust

The Good

$ rsync ftp.us.debian.org::debian \ /srv/mirrors/debian

$ dpkg -i \ /srv/mirrors/debian/kernel-image-3.16.0-4-amd64-di_3.16.7-ckt9-2_amd64.udeb

trivial mirroring and hosting

Linux Packages 2.0.deb and .rpm for containers

Container VS VM?● Lightweight (100s vs 10s)

● Easy to deploy● Less isolation?

What is container ?● Packaging with your apps with deps● Running in isolation (using namespace,

cgroups)

Why I want to use it?● Deploy faster● Run faster, run everywhere● Run in isolation

App Container (appc)github.com/appc

appc-dev@googlegroups.com

appc != rkt

Application Containersself-contained, portable

(decoupled from operating system)isolated (memory, network, …)

appc principlesWhy are we doing this?

OpenIndependent GitHub organisation

Contributions from Cloud Foundry, Mesosphere, Google, Red Hat

(and many others!)

Simple but efficientSimple to understand and implement, but eye to optimisation (e.g. content-based

caching)

SecureCryptographic image addressing

Image signing and encryptionContainer identity

Standards-basedWell-known tools (tar, gzip, gpg, http), extensible with modern technologies

(bittorrent, xz)

ComposableIntegrate with existing systems

Non-prescriptive about build workflowsOS/architecture agnostic

appc components

Image FormatApplication Container Imagetarball of rootfs + manifest

uniquely identified by ImageID (hash)

Image DiscoveryApp name → artifact

example.com/http-servercoreos.com/etcd

HTTPS + HTML

Executor (Runtime)grouped applicationsruntime environment

isolatorsnetworking

Metadata Servicehttp://$AC_METADATA_URL/acMetadata

container metadatacontainer identity (HMAC verification)

ACE validatoris this executor compliant with the spec?

$EXECUTOR run ace_validator.aci

appc community

github.com/cdaylward/libappc

C++ library for working with app containers

github.com/cdaylward/nosecone

C++ executor for running app containers

mesos (wip)https://issues.apache.org/jira/browse/MESOS-2162

github.com/3ofcoins/jetpack

FreeBSD Jails/ZFS-based executor(by @mpasternacki)

github.com/sgotti/acido

ACI toolkit (build ACIs from ACIs)

github.com/appc/docker2acidocker2aci busybox/latest

docker2aci quay.io/coreos/etcd

github.com/appc/goaci

goaci github.com/coreos/etcd

appc spec in a nutshell

- Image Format (ACI)- what does an application consist of?

- Image Discovery- how can an image be located?

- Pods- how can applications be grouped and run?

- Executor (runtime)- what does the execution environment look like?

appc statusStabilising

towards first backwardscompatible release

github.com/coreos/rkt

rktan implementation of appc

Open standards. Composability.

rkt

rkta modern, secure container runtime

rktsimple CLI tool

simple CLI toolgolang + Linuxself-contained

init system/distro agnostic

simple CLI toolno daemon

no API*apps run directly under spawning process

bash

rkt

application(s)

runit

rkt

application(s)

systemd

rkt

application(s)

rkt internalsmodular architecture

execution divided into stagesstage0 → stage1 → stage2

rkt (stage0)

pod (stage1)

bash/runit/systemd/... (invoking process)

app1 (stage2)

app2 (stage2)

rkt (stage0)

pod (stage1)

bash/runit/systemd/... (invoking process)

app1 (stage2)

app2 (stage2)

stage0 (rkt binary)discover, fetch, manage application images

set up pod filesystemscommands to manage pod lifecycle

stage0 (rkt binary)

- rkt run

- rkt prepare

- rkt run-prepared

- rkt list

- rkt status

- ...

- rkt fetch

- rkt trust

- rkt image list

- rkt image export

- rkt image gc

- ...

stage0 (rkt binary)file-based locking for concurrent operation

(e.g. rkt gc, rkt list for pods)database + reference counting for images

rkt (stage0)

pod (stage1)

bash/runit/systemd/... (invoking process)

app1 (stage2)

app2 (stage2)

rkt (stage0)

pod (stage1)

bash/runit/systemd/... (invoking process)

app1 (stage2)

app2 (stage2)

stage1execution environment for pods

app process lifecycle managementisolators

stage1 (swappable)binary ABI with stage0

stage0 calls an execve(stage1)

stage1 (swappable)

● default implementation ○ based on systemd-nspawn+systemd ○ Linux namespaces + cgroups for isolation

● kvm implementation ○ based on lkvm+systemd ○ hardware virtualisation for isolation

● others?

rkt (stage0)

pod (stage1)

bash/runit/systemd/... (invoking process)

app1 (stage2)

app2 (stage2)

rkt (stage0)

pod (stage1)

bash/runit/systemd/... (invoking process)

app1 (stage2)

app2 (stage2)

stage2actual app execution

independent filesystems (chroot)shared namespaces, volumes, IPC, ...

rkt + systemdThe different ways rkt integrates with

systemd

rkt

rkt

systemd (on host)(systemctl)

systemd (on host)optional

"systemctl stop" just workssocket activation

pod-level isolators: CPUShares, MemoryLimit

rkt

systemd-nspawn

systemd (on host)(systemctl)

systemd-nspawndefault stage1, besides lkvm

taking care of most of the low-level things

rkt

systemd-nspawn

systemd

systemd (on host)(systemctl)

container

systemdpid1

service filessocket activation

rkt

systemd-nspawn

application

systemd

systemd (on host)(systemctl)

container

applicationapp-level isolators: CPUShares, MemoryLimit

chrooted

rkt

systemd-nspawn

application

systemd-journald(journalctl)

logs

systemd

systemd (on host)(systemctl)

container

systemd-journaldno changes in apps required

logs in the containeravailable from the host with journalctl -m / -M

rkt

systemd-nspawn

application

systemd-machined(machinectl)

systemd-journald(journalctl)

logs

systemd

register

systemd (on host)(systemctl)

container

systemd-machinedregister on distros using systemd

machinectl {show,status,poweroff…}

rkt

systemd-nspawn

application

systemd-machined(machinectl)

systemd-journald(journalctl)

logs

systemd

register

systemd (on host)(systemctl)

container

cgroups

What’s a control group? (cgroup)

● group processes together● organised in trees● applying limits to them as a group

cgroups

cgroup API

/sys/fs/cgroup/*//proc/cgroups/proc/$PID/cgroup

List of cgroup controllers/sys/fs/cgroup/

├─ cpu ├─ devices ├─ freezer ├─ memory ├─ ... └─ systemd

/sys/fs/cgroup/ ├─ systemd │ ├─ user.slice │ ├─ system.slice │ │ ├─ NetworkManager.service │ │ │ └─ cgroups.procs │ │ ... │ └─ machine.slice

How systemd units use cgroups

│... ├─ cpu │ ├─ user.slice │ ├─ system.slice │ └─ machine.slice │ └─ machine-rkt….scope │ └─ system.slice │ └─ app.service ├─ memory │ ├─ user.slice │ ├─ system.slice │ └─ machine.slice ...

/sys/fs/cgroup/ ├─ systemd │ ├─ user.slice │ ├─ system.slice │ └─ machine.slice │ └─ machine-rkt….scope │ └─ system.slice │ └─ app.service │ │ │...

How systemd units use cgroups w/ containers

/sys/fs/cgroup/ ├─ systemd │ ├─ user.slice │ ├─ system.slice │ └─ machine.slice │ └─ machine-rkt….scope │ └─ system.slice │ └─ app.service │ │ │...

│... ├─ cpu │ ├─ user.slice │ ├─ system.slice │ └─ machine.slice │ └─ machine-rkt….scope │ └─ system.slice │ └─ app.service ├─ memory │ ├─ user.slice │ ├─ system.slice │ └─ machine.slice ...

cgroups mounted in the container

RW

RO

Example: memory isolator

“limit”: “500M”

ApplicationImage Manifest

[Service]ExecStart=MemoryLimit=500M

systemd service file

write tomemory.limit_in_

bytes

systemd action

Example: CPU isolator

“limit”: “500m”

ApplicationImage Manifest

write tocpu.share

systemd action

[Service]ExecStart=CPUShares=512

systemd service file

Unified cgroup hierarchy● Multiple hierarchies:

○ one cgroup mount point for each controller (memory, cpu, etc.)○ flexible but complex○ cannot remount with a different set of controllers○ difficult to give to containers in a safe way

● Unified hierarchy:○ cgroup filesystem mounted only one time○ still in development in Linux: mount with option

“__DEVEL__sane_behavior”○ initial implementation in systemd-v226 (September 2015)○ no support in rkt yet

rkt: a few other things

- rkt and security- rkt API service (new!)- rkt networking- rkt and user namespaces- rkt and production

rkt and security"secure by default"

rkt security

- image signature verification- privilege separation

- e.g. fetch images as non-root user- SELinux integration- kernel keyring integration (soon)- lkvm stage1 for true hardware isolation

rkt API service (new!)optional, gRPC-based API daemon

exposes information on pods and imagesruns as unprivileged user

easier integration with other projects

rkt networkingplugin-based

Container Networking Interface (CNI)

Container Runtime (e.g. rkt)

veth macvlan ipvlan OVS

Container Networking Interface (CNI)

Networking, the rkt way

Network tooling

● Linux can create pairs of virtual net interfaces

● Can be linked in a bridge

container1 container2

eth0

veth1

eth0

veth2

IP masquerading via iptables

eth0

bridge

rkt and user namespaces

History of Linux namespaces✓ 1991: Linux

✓ 2002: namespaces in Linux 2.4.19

✓ 2008: LXC✓ 2011: systemd-nspawn✓ 2013: user namespaces in Linux 3.8✓ 2013: Docker✓ 2014: rkt

… development still active

Why user namespaces?

● Better isolation● Run applications which would need more

capabilities● Per user limits● Future?

○ Unprivileged containers: possibility to have container without root

0

host

65535

4,294,967,295(32-bit range)

0

container 1655350

container 2

User ID ranges

unmapped

User ID mapping/proc/$PID/uid_map: “0 1048576 65536”

host

container

1048576

65536

65536

unmappedunmapped

Problems with container images

Container filesystem

Container filesystem

Overlayfs “upper” directory

Overlayfs “upper” directory

Application Container Image (ACI)

ApplicationContainer

Image (ACI)

container 1 container 2

downloading

web server

Problems with container images

● Files UID / GID● rkt currently only supports user namespaces

without overlayfs○ Performance loss: no COW from overlayfs○ “chown -R” for every file in each container

Problems with volumes

/

/home/var

user

/

/data /my-app

bind mount(rw / ro)

/data

● mounted in several containers

● No UID translation

● Dynamic UID maps

/data

User namespace and filesystem problem

● Possible solution: add options to mount() to apply a UID mapping

● rkt would use it when mounting:○ the overlay rootfs○ volumes

● Idea suggested on kernel mailing lists

rkt and production

- still pre-1.0- unstable (but stabilising) CLI and API- explicitly not recommended for production

- although some early adopters

rkt v1.0.0EOY (fingers crossed)

stable APIstable CLI

ready to use!

Questions?

github.com/coreos/rkt

coreos.com/careers (soon in Berlin!)

Join us!

top related