unikernels: rise of the library hypervisor
Post on 15-Apr-2017
8.836 Views
Preview:
TRANSCRIPT
Unikernels: the Rise of the Library Hypervisor
Anil Madhavapeddy, @avsm Mindy Preston, @yomimono
Martin Lucina +the MirageOS and Docker for Mac/Win teams
Docker Inc, @docker with contributions from IBM
Docker Distributed Systems Summit 7th October 2016, Berlin, Germany
Conventional hypervisors• Run full guest operating
systems with complex emulation needs.
• Scaffolding for device emulation, instruction emulation, etc.
• Hard to compose into existing infrastructure without wrapping a full hypervisor layer.
Xen Hypervisor
qemu
xenstored
xenconsoled
Hardware
Dom0DomU
Conventional hypervisorsCVE-2016-3710: VGA emulation missing bounds checks causes exploit.
CVE-2016-5403: unbounded virtio memory usage causes DoS.
CVE-2016-3672: unrestricted qemu logging causes DoS.
CVE-2015-8554: qemu-dm buffer overrun in MSI-X causes exploit.
CVE-2015-7504: heap overflow in pcnet emulator causes exploit.
• Run full guest operating systems with complex emulation needs.
• Scaffolding for device emulation, instruction emulation, etc.
• Hard to compose into existing infrastructure without wrapping a full hypervisor layer.
How can distributed systems use hardware protection more
flexibly and composably?
Recap: Unikernels
• "library operating systems" break kernels into libraries.
• Link libraries with a boot layer, scheduler and application.
• Portable microservices that boot directly on hypervisors or Unix. Xen
Hardware
App
Linux
Hardware
DockerApp
Configuration Business Logic
HTTP JSON SSL
TCP/IP Xen Devices
Unix libev
Unix musl libc
Application
Libraries
Libraries
Recap: Unikernels
• Many benefits are lost when deploying on existing clouds.
• Tiny binaries (200k) still require scaffolding of a full OS to boot.
• Difficult to manage hypervisor from inside a container as full host privilege is needed.
• "library operating systems" break kernels into libraries.
• Link libraries with a boot layer, scheduler and application.
• Portable microservices that boot directly on hypervisors or Unix.
Library Hypervisors• Extend the "kit" model and break down hypervisor
functionality into libraries.
• Expose core functionality (CPU and memory) as library, and other pieces (device emulation) are optional.
• Benefit: huge reduction in TCB, and better fit to container-native infrastructure with privilege dropping.
• Drawback: no existing support in operating systems.
Library Hypervisors• Extend the "kit" model and break down hypervisor
functionality into libraries.
• Expose core functionality (CPU and memory) as library, and other pieces (device emulation) are optional.
• Benefit: huge reduction in TCB, and better fit to container-native infrastructure with privilege dropping.
• Drawback: no existing support in operating systems.
But let's a closer look!
What has changed?OSX
Hypervisor framework
FreeBSD bHyve
xHyveHyperKit
bhyve.org
xhyve.org
github.com/docker/hyperkit
What has changed?OSX
Hypervisor framework
Linux /dev/kvm
FreeBSD bHyve
xHyveHyperKit kvmtool
novm
ukvm
What has changed?OSX
Hypervisor framework
Linux /dev/kvm
FreeBSD bHyve
xHyveHyperKit kvmtool
novm
Docker for Mac MirageOS3
ukvm
• Easy drag and drop installation, and autoupdates to get latest Docker.
• Secure, sandboxed virtualisation architecture without elevated privileges.
• Native networking support, with VPN and network sharing compatibility.
• File sharing between container and host: uid mapping, inotify events, etc.
Docker for MacAiming for a native OSX experience that works with existing developer workflows.
• Uses the new HyperKit framework, which is in turn based on xHyve and FreeBSD's bHyve.
• Sandbox friendly: processes largely run as non-root, with privileges of the local user.
Virtualisation
• Uses the new HyperKit framework, which is in turn based on xHyve and FreeBSD's bHyve.
• Sandbox friendly: processes largely run as non-root, with privileges of the local user.
Virtualisation
OSX Kernel
Hypervisor.framework
Hardware virt: VMX,
nested paging
• Uses the new HyperKit framework, which is in turn based on xHyve and FreeBSD's bHyve.
• Sandbox friendly: processes largely run as non-root, with privileges of the local user.
Virtualisation
OSX Kernel Userspace
Hypervisor.framework
User Process
Thread/vCPUTraps on I/O pagesManages ACPI, PCI devices
Hardware virt: VMX,
nested paging
• Uses the new HyperKit framework, which is in turn based on xHyve and FreeBSD's bHyve.
• Sandbox friendly: processes largely run as non-root, with privileges of the local user.
Virtualisation
OSX Kernel Userspace
Hypervisor.framework
User ProcessHardware virt: VMX,
nested paging
ProcessLinux Kernel
VirtIO IPCVirtIO BlockVirtIO Net
Alpine Linux Userspace
Latest Docker preconfigured
QCow2VPNKit
Logs redirected to OSX host
• Uses the new HyperKit framework, which is in turn based on xHyve and FreeBSD's bHyve.
• Embeds Linux: includes an embedded lightweight Alpine Linux distribution optimised for fast boot and stateless operation for containers.
Virtualisation
$ docker info Containers: 358 Running: 13 Paused: 0 Stopped: 345 Images: 485 Server Version: 1.11.1 Storage Driver: aufs Root Dir: /var/lib/docker/aufs Backing Filesystem: extfs Dirperm1 Supported: true
Logging Driver: json-file Cgroup Driver: cgroupfs Plugins: Volume: local Network: bridge null host Kernel Version: 4.4.9-moby Operating System: Alpine Linux v3.3 OSType: linux Architecture: x86_64 CPUs: 2 Total Memory: 3.858 GiB
HyperKit library structure
• In HyperKit, most functionality is linked as a library.
• If app doesn't need a protocol, it is not linked and not part of the trusted computing base.
• Want to hide the gory details of virtualisation from the user. The Linux VM should be "invisible".
• Not solving this leads to many user complaints:
• VPN software and corporate installations do not like bridged virtual machines or custom routing.Result: container traffic cannot connect to Internet.
• Services cannot be exposed on localhost or the external interface and are instead on the Linux VM IP address.Result: breaks common web oAuth workflows.
Networking
Networking
OSX Kernel UserspaceHypervisor.framework
HyperKitHardware virt: VMX,
nested paging
VirtIO IPC
VirtIO Block
VirtIO Net
Networking
OSX Kernel UserspaceHypervisor.framework
HyperKitHardware virt: VMX,
nested paging
VirtIO IPC
VirtIO Block
VirtIO NetEthernet In
Containers! Containers! Containers!
Networking
OSX Kernel UserspaceHypervisor.framework
HyperKitHardware virt: VMX,
nested paging
VirtIO IPC
VirtIO Block
VirtIO NetEthernet In
Bridge
EthernetKernel Module
Containers! Containers! Containers!
• Want to hide the gory details of virtualisation from the user. The Linux VM should be "invisible".
• Not solving this leads to many user complaints:
• VPN software and corporate installations do not like bridged virtual machines or custom routing.Result: container traffic cannot connect to Internet.
• Services cannot be exposed on localhost or the external interface and are instead on the Linux VM IP address.Result: breaks common web oAuth workflows.
Networking
• Challenge: Services publishing ports should be exposed on localhost without needing VM info.
• Solution: VPNKit forwards container port requests to a OSX service which binds them natively on its external interface.
• Benefits:
• docker run -P on the Mac now works without requiring any knowledge of the VM innards.
• External oAuth workflows operate with web apps.
Networking
Networking
OSX Kernel UserspaceHypervisor.framework
HyperKitHardware virt: VMX,
nested paging
VirtIO IPC
VirtIO Block
VirtIO NetEthernet In
Bridge
EthernetKernel Module
Containers! Containers! Containers!
Networking
OSX Kernel UserspaceHypervisor.framework
HyperKitHardware virt: VMX,
nested paging
VirtIO IPC
VirtIO Block
VirtIO NetEthernet In
VPNKitMirageOS
TCP/IP
DNS
SocketerKernel Sockets
Containers! Containers! Containers!
github.com/docker/vpnkit
• Challenge: Deal with custom VPN software on the host that makes it difficult to bridge.
• Solution: VPNKit, efficiently reconstructs container traffic into separate TCP/IP flows and translates them into native OSX/Windows sockets.
• Benefits:
• All network traffic is generated from normal socket calls (e.g. gethostbyaddr) on the Mac, so interacts well with firewalls, VPNs, and any local security policies.
Networking
•Native OSX application, uses HyperKit to virtualise for domain-specific purpose ("docker run")
•Links MirageOS unikernel libraries for networking and storage translation between OS boundaries.
•The library approach let us glue together these components really easily.
•Docker for Mac is quite a complex distributed system internally, but (hopefully) hidden from user.
Docker for Mac + unikernels
MirageOS 3 + Solo5
•Unikernels have been gathering pace; next challenge is to make them easily deployable.
•Build handled via Docker, but docker run shouldn't need privileges (e.g. to start a VM).
•MirageOS 3 has a new library hypervisor for Linux, developed by IBM, Docker and Cambridge University contributors.
mirage.io
MirageOS 3 + Solo5•Source: https://github.com/Solo5/solo5 •Runs as a Unix process and opens /dev/kvm for hardware isolation.
•ukvm is a small, modular monitor that links only what is needed. Can be 10k in size!
•Can run privilege separated: one process opens /dev/kvm and drops privileges and executes the unikernel.
•Boot times are the same as process fork times, since all the device setup is handled in-process.
MirageOS 3 + Solo5
Source: Dan Williams and Ricardo Koller, IBM Research, HotCloud 16
MirageOS 3 + Solo5
•Due for stable release in the next month. • Intended to be "unikernel template" for other projects to share hypervisor code.
•Liberally licensed under BSD/Apache2/ISC to encourage adoption and embedding.
•BoF and tutorials tomorrow to demonstrate it. Developers are all here and hacking!
Demo!
How can distributed systems use hardware protection more
flexibly and composably?
Questions?
Download free at docker.com
Twitter: @avsm
https://github.com/docker/hyperkit
https://github.com/docker/vpnkit
https://github.com/docker/datakit
https://github.com/mirage/
We will be hacking
tomorrow!
Backup Slides
• Challenge: Share arbitrary OSX directory tree into Linux container without requiring extensive modification of either side.
• Solution: Use a FUSE forwarding layer and translate Linux filesystem calls to OSX equivalents.
OSX Host Linux Host ContainerVOLUMEcom.docker.osxfs
Track extra metadata
Translate to OSX filesystem calls
FUSE
Filesystem Sharing
• Challenge: Need filesystem activation so events on the Mac wake up container servers and vice-versa.
• Solution: osxfs uses FSEvents API and injects inotify activation events into container.
OSX Host Linux Host ContainerVOLUMEcom.docker.osxfs
FSEvents watches open files
Events from Linux causes OSX apps
to wake up
FUSE
Filesystem Sharing
• Challenge: Need filesystem activation so events on the Mac wake up container servers and vice-versa.
• Solution: osxfs uses FSEvents API and injects inotify activation events into container.
OSX Host Linux Host ContainerVOLUMEcom.docker.osxfs
FSEvents watches open files
Events from Linux causes OSX apps
to wake up
FUSE
Filesystem Sharing
• Challenge: Deal with custom VPN software on the host that makes it difficult to bridge.
• Solution: VPNKit, efficiently reconstructs container traffic into separate TCP/IP flows and translates them into native OSX/Windows sockets.
OSX Host Linux Host ContainerRUN <...>com.docker.hyperkit-net
Reconstruct traffic
TCP flows
Translate to OSX socket calls
Ethernet bridge
DHCPv4
NTP
Networking
OSX Host Linux Host
Privileged Port Service
Container
EXPOSEPort Service
VSock Binder
RUN <...>
VSock Listener
Userland Proxy
• Challenge: Services publishing ports should be exposed on localhost without needing VM info.
• Solution: VPNKit forwards container port requests to a OSX service which binds them natively on its external interface.
Networking
$ docker run resin/armv7hf-debian uname -a
Linux 7ed2fca7a3f0 4.1.12 #1 SMP Tue Jan 12 10:51:00 UTC 2016 armv7l GNU/Linux
$ docker run justincormack/ppc64le-debian uname -a
Linux edd13885f316 4.1.12 #1 SMP Tue Jan 12 10:51:00 UTC 2016 ppc64le GNU/Linux
Multi-CPU architectures
top related