containers and isolation as implemented in the linux kernel
TRANSCRIPT
Containers and isolation as implemented in the Linux kernel
Technical Deep Dive Session
Hannes Frederic Sowa <[email protected]>Senior Software Engineer13. September 2016
2
OutlineContainers and isolation as implemented in the Linux kernel
Learned from history and enhanced and innovated in Free Software.
● Overview of not so recent history from other operating systems
● Representation and control from user space
● Implementation details in the kernel● What to come?
3
History of operating system isolation
• Plan9 per-process namespaces• Distributed computing
• Architecture specific files mapped via bind/union mounts
• Directory vnodes had an append operation
• User space server via 9p protocol
• Not yet implemented in linux: RPC via AF_UNIX over NFS
4
History of operating system isolation
• POSIX chroot• Available as syscall thus usable in self written applications
• Provides a new filesystem view thus limited isolation
• FreeBSD’s jails• Strongly integrated into the operating system
• Only small helper library available
• No operating system control and tuning
• Limited network isolation only based on IP addresses
• Solaris Zones• Strongly integrated into the operating system (even package manager)
• Tooling is dictated by Solaris tools
5
Namespace API design in Linux
• Isolation and resource management completely decoupled
• API never tightly coupled to any user space library• Paved the path to a lot of user space frameworks (e.g. docker)
• Syscalls openly documented and reusable by 3rd party software
• Management available on/with already known kernel primitives• With rather primitive tools – nearly no new tools were needed
• Fine grain control of primitives to namespace• Opt-in model
• Easy to enhance in user space as well as in the kernel
6
Isolation vs. Resource Management
cgroups -Resource management
namespaces -isolation
Process 1 Process 2
Process 3
Not completely orthogonal but still...
Process 4
cgroup1
cgroup2
ns1 ns2
7
Namespaces in regular useEven on non-servers namespaces see regular use nowadays:
Type code snip$ lsns NS TYPE NPROCS PID USER COMMAND4026531836 pid 63 2028 hsowa /usr/lib/systemd/systemd --user4026531837 user 63 2028 hsowa /usr/lib/systemd/systemd --user4026531838 uts 70 2028 hsowa /usr/lib/systemd/systemd --user4026531839 ipc 70 2028 hsowa /usr/lib/systemd/systemd --user4026531840 mnt 70 2028 hsowa /usr/lib/systemd/systemd --user4026531969 net 63 2028 hsowa /usr/lib/systemd/systemd --user4026532501 pid 2 3485 hsowa /opt/google/chrome/chrome --type=zygote4026532503 net 6 3485 hsowa /opt/google/chrome/chrome --type=zygote4026532621 pid 1 3486 hsowa /opt/google/chrome/nacl_helper4026532623 net 1 3486 hsowa /opt/google/chrome/nacl_helper4026532724 user 1 3486 hsowa /opt/google/chrome/nacl_helper4026532725 user 6 3485 hsowa /opt/google/chrome/chrome --type=zygote...
Type code snip$ lsns NS TYPE NPROCS PID USER COMMAND4026531836 pid 63 2028 hsowa /usr/lib/systemd/systemd --user4026531837 user 63 2028 hsowa /usr/lib/systemd/systemd --user4026531838 uts 70 2028 hsowa /usr/lib/systemd/systemd --user4026531839 ipc 70 2028 hsowa /usr/lib/systemd/systemd --user4026531840 mnt 70 2028 hsowa /usr/lib/systemd/systemd --user4026531969 net 63 2028 hsowa /usr/lib/systemd/systemd --user4026532501 pid 2 3485 hsowa /opt/google/chrome/chrome --type=zygote4026532503 net 6 3485 hsowa /opt/google/chrome/chrome --type=zygote4026532621 pid 1 3486 hsowa /opt/google/chrome/nacl_helper4026532623 net 1 3486 hsowa /opt/google/chrome/nacl_helper4026532724 user 1 3486 hsowa /opt/google/chrome/nacl_helper4026532725 user 6 3485 hsowa /opt/google/chrome/chrome --type=zygote...
8
Namespace API wrap-up
• No dependencies to 3rd party libraries or tools
• No design mandated by operating system or distributions
• Resource management independent from isolation
• Made several management tools possible (some specialized)• Iproute2, systemd, rkt, Docker, LXC, LXD, lmctfy, runc
• Own choices to use complete distribution or specialized init
• … or maybe just running the application directly in a namespace
• OpenVZ/Virtuozzo reusing and contributing to namespaces upstream
9
OutlineContainers and isolation as implemented in the Linux kernel
Learned from history and enhanced and innovated in Free Software.
● Overview of not so recent history from other operating systems
● Representation and control from user space
● Implementation details in the kernel● What to come?
10
Representation and control from user
# ls -l /proc/self/ns/total 0lrwxrwxrwx. 1 root hsowa 0 12. Sep 22:09 cgroup -> 'cgroup:[4026531835]'lrwxrwxrwx. 1 root root 0 12. Sep 22:09 ipc -> 'ipc:[4026531839]'lrwxrwxrwx. 1 root root 0 12. Sep 22:09 mnt -> 'mnt:[4026531840]'lrwxrwxrwx. 1 root root 0 12. Sep 22:09 net -> 'net:[4026531969]'lrwxrwxrwx. 1 root root 0 12. Sep 22:09 pid -> 'pid:[4026531836]'lrwxrwxrwx. 1 root root 0 12. Sep 22:09 user -> 'user:[4026531837]'lrwxrwxrwx. 1 root root 0 12. Sep 22:09 uts -> 'uts:[4026531838]'# unshare -n # -n :: unshare the network namespace# ls -l /proc/self/ns/netlrwxrwxrwx. 1 root root 0 12. Sep 22:10 /proc/self/ns/net -> 'net:[4026532727]'#
# ls -l /proc/self/ns/total 0lrwxrwxrwx. 1 root hsowa 0 12. Sep 22:09 cgroup -> 'cgroup:[4026531835]'lrwxrwxrwx. 1 root root 0 12. Sep 22:09 ipc -> 'ipc:[4026531839]'lrwxrwxrwx. 1 root root 0 12. Sep 22:09 mnt -> 'mnt:[4026531840]'lrwxrwxrwx. 1 root root 0 12. Sep 22:09 net -> 'net:[4026531969]'lrwxrwxrwx. 1 root root 0 12. Sep 22:09 pid -> 'pid:[4026531836]'lrwxrwxrwx. 1 root root 0 12. Sep 22:09 user -> 'user:[4026531837]'lrwxrwxrwx. 1 root root 0 12. Sep 22:09 uts -> 'uts:[4026531838]'# unshare -n # -n :: unshare the network namespace# ls -l /proc/self/ns/netlrwxrwxrwx. 1 root root 0 12. Sep 22:10 /proc/self/ns/net -> 'net:[4026532727]'#
Processes are associated with one namespace:
11
Making namespaces persistent
# unshare -n # -n :: unshare the network namespace# ls -l /proc/self/ns/netlrwxrwxrwx. 1 root root 0 12. Sep 22:10 /proc/self/ns/net -> 'net:[4026532727]'# touch /run/netns/my_namespace1# mount -o bind /proc/self/ns/net /run/netns/my_namespace1# ls -i /run/netns/my_namespace14026532727 /run/netns/foo# exit# readlink /proc/self/ns/netnet:[4026531969]# nsenter --net=/run/netns/my_namespace1# readlink /proc/self/ns/net net:[4026532727]#
# unshare -n # -n :: unshare the network namespace# ls -l /proc/self/ns/netlrwxrwxrwx. 1 root root 0 12. Sep 22:10 /proc/self/ns/net -> 'net:[4026532727]'# touch /run/netns/my_namespace1# mount -o bind /proc/self/ns/net /run/netns/my_namespace1# ls -i /run/netns/my_namespace14026532727 /run/netns/foo# exit# readlink /proc/self/ns/netnet:[4026531969]# nsenter --net=/run/netns/my_namespace1# readlink /proc/self/ns/net net:[4026532727]#
Managing namespaces as a mountpoint:
12
User namespaces
• User namespaces have a special role as they directly influence permission control
• Allowing to become root inside a user created namespace
• Disassociate permissions with parent namespace
• Example:
$ id -u1000$ unshare –user -r bash# id -u0# unshare -n# nc -l 80 # netcat is allowed to bind to port 80
$ id -u1000$ unshare –user -r bash# id -u0# unshare -n# nc -l 80 # netcat is allowed to bind to port 80
13
Easier management: netns
# ip netns add foo# ip netns add bar# ip link add type veth# ip link set dev veth0 netns foo# ip link set dev veth1 netns bar# ip netns exec foo bash# ip l l1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN mode DEFAULT group default qlen 1 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:002: ip_vti0@NONE: <NOARP> mtu 1332 qdisc noop state DOWN mode DEFAULT group default qlen 1 link/ipip 0.0.0.0 brd 0.0.0.047: veth0@if48: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000 link/ether ce:e5:a7:2f:d5:69 brd ff:ff:ff:ff:ff:ff link-netnsid 1# exit
# ip netns add foo# ip netns add bar# ip link add type veth# ip link set dev veth0 netns foo# ip link set dev veth1 netns bar# ip netns exec foo bash# ip l l1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN mode DEFAULT group default qlen 1 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:002: ip_vti0@NONE: <NOARP> mtu 1332 qdisc noop state DOWN mode DEFAULT group default qlen 1 link/ipip 0.0.0.0 brd 0.0.0.047: veth0@if48: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000 link/ether ce:e5:a7:2f:d5:69 brd ff:ff:ff:ff:ff:ff link-netnsid 1# exit
OpenStack already uses a lightweight wrapper around these to manage netns:
14
Representation wrap-up
• Namespaces are internally represented via normal inodes living in its own filesystem, which are globally valid
• Thus filedescriptor passing works as usual
• Persisting of namespaces simply achieved by bind mounting the representative file to “stable location”
• Easy atomic utilities map directly to the representative syscalls• unshare(1) unshare(2) or clone(2)→
• nsenter(1) setns(2)→
• mount is really just mounting
15
OutlineContainers and isolation as implemented in the Linux kernel
Learned from history and enhanced and innovated in Free Software.
● Overview of not so recent history from other operating systems
● Representation and control from user space
● Implementation details in the kernel● What to come?
16
Implementation details in the kernel• struct user_namespace
• Establishes own configurable UID and GID mapping
• struct nsproxy
• struct uts_namespace
• isolates hostname and domainname (e.g. for auth purposes)
• struct ipc_namespace
• Isolates (POSIX/svipc) mqueue, semaphores, shared memory
• struct mnt_namespace
• Abstraction and isolation over the filesystem views
• struct pid_namespace
• Isolate process tree and pid numbers
• struct net
• Control isolation with network interfaces, routing tables, ip addresses
• struct cgroup_namespace (recent development)
• control group namespace, isolates resource management
17
Mount namespace
• Most important namespace, as they also provide the isolation for /proc and (partially) for sysfs, which should get remounted in a new container
• Mount namespaces basically form trees in the kernel which can be partially overlapping (mount subtrees)
• Process attached to one subtree
• Discovered via nsproxy
18
System configuration (netns)
• Configuration, Routing tables, firewall etc. are all separated per network namespace, how?
• System configuration mostly being done via sysctl
• A lot of sysctls are manageable per namespace
• netns namespace has own sysctl in struct net• Incoming packets use configuration based on the network namespace of
the incoming interface
• Outgoing packets can use socket namespace (locally generated) or the device context
19
OutlineContainers and isolation as implemented in the Linux kernel
Learned from history and enhanced and innovated in Free Software.
● Overview of not so recent history from other operating systems
● Representation and control from user space
● Implementation details in the kernel● What to come?
20
What is coming?
• Basically the namespace concept is architectural complety implemented
• New features added to the kernel are already designed in an orthogonal way or can correctly deal with namespaces
• Network namespace is heavy weight, thus• Connecting netns to outside world requires one virtual router or bridge
• Alternatives exists but are architectural a dead end
• ipvlan: multiplexes IP addresses on one interface
• macvlan: multiplexes MAC addresses on one interface
• Provide isolation on IP layer like FreeBSD jails or Solaris
• Maybe even extended to act like VRF with sockets
THANK YOU
plus.google.com/+RedHat
linkedin.com/company/red-hat
youtube.com/user/RedHatVideos
facebook.com/redhatinc
twitter.com/RedHatNews