s2/2016 w03: virtualizationcs9242/16/lectures/03-virt.pdf– mach unix server [golub ‘90], l4linux...

50
COMP9242 Advanced OS S2/2016 W03: Virtualization @GernotHeiser

Upload: others

Post on 18-Jan-2021

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: S2/2016 W03: Virtualizationcs9242/16/lectures/03-virt.pdf– Mach Unix server [Golub ‘90], L4Linux [Härtig ‘97], Disco [Bugnion ‘97] • Idea: manually port guest OS to modified

COMP9242 Advanced OS S2/2016 W03: Virtualization @GernotHeiser

Page 2: S2/2016 W03: Virtualizationcs9242/16/lectures/03-virt.pdf– Mach Unix server [Golub ‘90], L4Linux [Härtig ‘97], Disco [Bugnion ‘97] • Idea: manually port guest OS to modified

2 © 2016 Gernot Heiser. Distributed under CC Attribution License

Copyright Notice These slides are distributed under the Creative Commons Attribution 3.0 License •  You are free:

–  to share—to copy, distribute and transmit the work –  to remix—to adapt the work

•  under the following conditions: –  Attribution: You must attribute the work (but not in any way that

suggests that the author endorses you or your use of the work) as follows:

“Courtesy of Gernot Heiser, UNSW Australia” The complete license text can be found at http://creativecommons.org/licenses/by/3.0/legalcode

COMP9242 S2/2016 W03

Page 3: S2/2016 W03: Virtualizationcs9242/16/lectures/03-virt.pdf– Mach Unix server [Golub ‘90], L4Linux [Härtig ‘97], Disco [Bugnion ‘97] • Idea: manually port guest OS to modified

3 © 2016 Gernot Heiser. Distributed under CC Attribution License

Virtual Machine (VM)

“A VM is an efficient, isolated duplicate of a real machine” [Popek&Goldberg 74]

•  Duplicate: VM should behave identically to the real machine

–  Programs cannot distinguish between real or virtual hardware –  Except for:

o  Fewer resources (and potentially different between executions) o  Some timing differences (when dealing with devices)

•  Isolated: Several VMs execute without interfering with each other •  Efficient: VM should execute at speed close to that of real hardware

–  Requires that most instruction are executed directly by real hardware

Hypervisor aka virtual-machine monitor: Software implementing the VM

“Real machine”: Modern usage more general, “virtualise” any API

COMP9242 S2/2016 W03

Page 4: S2/2016 W03: Virtualizationcs9242/16/lectures/03-virt.pdf– Mach Unix server [Golub ‘90], L4Linux [Härtig ‘97], Disco [Bugnion ‘97] • Idea: manually port guest OS to modified

4 © 2016 Gernot Heiser. Distributed under CC Attribution License

Types of Virtualisation

COMP9242 S2/2016 W03

VM OS

Pro-cess

Processor

Hypervisor

VM OS

Pro-cess

Platform VM or System VM

Operating System

Processor

Virtualiz. Layer

VM Pro-cess

VM

Pro-cess

OS-level VM

Operating System

Processor

Process Java Program Java VM

Process VM

Operating System

Pro-cess

Processor

Pro-cess

Type-2 “Hosted”

Processor

VM OS

Pro-cess

Hypervisor

VM OS

Pro-cess

Operating System

Type-1 “Bare metal”

Plus anything else you want to sound cool!

Page 5: S2/2016 W03: Virtualizationcs9242/16/lectures/03-virt.pdf– Mach Unix server [Golub ‘90], L4Linux [Härtig ‘97], Disco [Bugnion ‘97] • Idea: manually port guest OS to modified

5 © 2016 Gernot Heiser. Distributed under CC Attribution License

Why Virtual Machines? •  Historically used for easier sharing of expensive mainframes

–  Run several (even different) OSes on same machine o  called guest operating system

–  Each on a subset of physical resources –  Can run single-user single-tasked OS

in time-sharing mode o  legacy support

•  Gone out of fashion in 80’s –  Time-sharing OSes common-place –  Hardware too cheap to worry...

Hypervisor

RAM

VM1

Guest OS

Apps

Virt RAM

VM2

Guest OS

Apps

Virt RAM

Mem. region Mem. region

COMP9242 S2/2016 W03

Page 6: S2/2016 W03: Virtualizationcs9242/16/lectures/03-virt.pdf– Mach Unix server [Golub ‘90], L4Linux [Härtig ‘97], Disco [Bugnion ‘97] • Idea: manually port guest OS to modified

6 © 2016 Gernot Heiser. Distributed under CC Attribution License

Why Virtual Machines? •  Renaissance in recent years for improved isolation •  Server/desktop virtual machines

–  Improved QoS and security –  Uniform view of hardware –  Complete encapsulation

o  replication o  migration/consolidation o  checkpointing o  debugging

–  Different concurrent OSes o  eg Linux + Windows

–  Total mediation •  Would be mostly unnecessary

–  … if OSes were doing their job!

Hypervisor

RAM

VM1

Guest OS

Apps

Virt RAM

VM2

Guest OS

Apps

Virt RAM

Mem. region Mem. region

Gernot’s prediction of 2004: 2014 OS textbooks will be identical to 2004 version

except for s/process/VM/g

COMP9242 S2/2016 W03

Page 7: S2/2016 W03: Virtualizationcs9242/16/lectures/03-virt.pdf– Mach Unix server [Golub ‘90], L4Linux [Härtig ‘97], Disco [Bugnion ‘97] • Idea: manually port guest OS to modified

7 © 2016 Gernot Heiser. Distributed under CC Attribution License

Why Virtual machines •  Core driver today is Cloud computing

–  Increased utilisation by sharing hardware –  Reduced maintenance cost through scale –  On-demand provisioning –  Dynamic load balancing though migration

OS

H/W

App App App

OS

H/W

App App App

OS

H/W

App App App

A.com

OS

H/W

App App App

OS

H/W

App App App

OS

H/W

App App App

B.com

H/W

Hypervisor

OS

App App App OS

App App App OS

App App App OS

App App App

H/W

Hypervisor

OS

App App App OS

App App App

Cloud Provider Data Centre

COMP9242 S2/2016 W03

Page 8: S2/2016 W03: Virtualizationcs9242/16/lectures/03-virt.pdf– Mach Unix server [Golub ‘90], L4Linux [Härtig ‘97], Disco [Bugnion ‘97] • Idea: manually port guest OS to modified

8 © 2016 Gernot Heiser. Distributed under CC Attribution License

Why Virtual Machines? •  Embedded systems: integration of heterogenous environments

–  RTOS for critical real-time functionality –  Standard OS for GUIs, networking etc

•  Alternative to physical separation –  low-overhead communication –  size, weight and power (SWaP)

reduction –  consolidate complete components

o  including OS, o  certified o  supplied by different vendors o  legacy support

–  “dual-persona” phone –  secure domain on COTS device

Hypervisor

RAM

VM1

High-level OS

Apps

Virt RAM

VM2

RTOS

Critical SW

Virt RAM

Mem. region Mem. region

COMP9242 S2/2016 W03

Page 9: S2/2016 W03: Virtualizationcs9242/16/lectures/03-virt.pdf– Mach Unix server [Golub ‘90], L4Linux [Härtig ‘97], Disco [Bugnion ‘97] • Idea: manually port guest OS to modified

9 © 2016 Gernot Heiser. Distributed under CC Attribution License

Hypervisor aka Virtual Machine Monitor

•  Program that runs on real hardware to implement the virtual machine •  Controls resources

–  Partitions hardware –  Schedules guests

o  “world switch” –  Mediates access to shared resources

o  e.g. console •  Implications

–  Hypervisor executes in privileged mode –  Guest software executes in unprivileged mode –  Privileged instructions in guest cause a trap into hypervisor –  Hypervisor interprets/emulates them –  Can have extra instructions for hypercalls

OS

Hardware

Guest OS

Hardware

Hypervisor

COMP9242 S2/2016 W03

Page 10: S2/2016 W03: Virtualizationcs9242/16/lectures/03-virt.pdf– Mach Unix server [Golub ‘90], L4Linux [Härtig ‘97], Disco [Bugnion ‘97] • Idea: manually port guest OS to modified

10 © 2016 Gernot Heiser. Distributed under CC Attribution License

Native vs. Hosted VMM

•  Hosted VMM beside native apps –  Sandbox untrusted apps –  Convenient for running

alternative OS on desktop –  leverage host drivers

•  Less efficient –  Double node switches –  Double context switches –  Host not optimised for exception

forwarding

Native/Classic/ Bare-metal/Type-I

Hosted/Type-II

Guest OS

Hypervisor

Hardware

App Guest OS

Hypervisor

Host OS

Hardware

App

COMP9242 S2/2016 W03

Page 11: S2/2016 W03: Virtualizationcs9242/16/lectures/03-virt.pdf– Mach Unix server [Golub ‘90], L4Linux [Härtig ‘97], Disco [Bugnion ‘97] • Idea: manually port guest OS to modified

11 © 2016 Gernot Heiser. Distributed under CC Attribution License

Virtualization Mechanics: Instruction Emulation •  Traditional trap-and-emulate (T&E) approach:

–  guest attempts to access physical resource –  hardware raises exception (trap), invoking HV’s exception handler –  hypervisor emulates result, based on access to virtual resource

•  Most instructions do not trap –  prerequisite for efficient virtualisation –  requires VM ISA (almost) same as processor ISA

ld r0, curr_thrd ld r1, (r0,ASID) mv CPU_ASID, r1 ld sp, (r1,kern_stk)

lda r1, vm_reg_ctxt ld r2, (r1,ofs_r0) sto r2, (r1,ofs_ASID)

Guest

Exception

Hypervisor

COMP9242 S2/2016 W03

Page 12: S2/2016 W03: Virtualizationcs9242/16/lectures/03-virt.pdf– Mach Unix server [Golub ‘90], L4Linux [Härtig ‘97], Disco [Bugnion ‘97] • Idea: manually port guest OS to modified

12 © 2016 Gernot Heiser. Distributed under CC Attribution License

Trap-and-Emulate Requirements Definitions: •  Privileged instruction: traps when executed in user mode

–  Note: NO-OP is insufficient! •  Privileged state: determines resource allocation

–  Includes privilege mode, addressing context, exception vectors… •  Sensitive instruction: control- or behaviour-sensitive

–  control sensitive: changes privileged state –  behaviour sensitive: exposes privileged state

o  incl instructions which are NO-OPs in user but not privileged state •  Innocuous instruction: not sensitive

•  Some instructions are inherently sensitive –  eg TLB load

•  Others are context-dependent –  eg store to page table

COMP9242 S2/2016 W03

Page 13: S2/2016 W03: Virtualizationcs9242/16/lectures/03-virt.pdf– Mach Unix server [Golub ‘90], L4Linux [Härtig ‘97], Disco [Bugnion ‘97] • Idea: manually port guest OS to modified

13 © 2016 Gernot Heiser. Distributed under CC Attribution License

Trap-and-Emulate Architectural Requirements •  T&E virtualisable: all sensitive instructions are privileged

–  Can achieve accurate, efficient guest execution o … by simply running guest binary on hypervisor

–  VMM controls resources –  Virtualized execution indistinguishable from native, except:

o  resources more limited (smaller machine) o  timing differences (if there is access to real time clock)

•  Recursively virtualisable: –  run hypervsior in VM –  possible if hypervsior not timing dependent, overheads low

ld r0, curr_thrd ld r1, (r0,ASID) mv CPU_ASID, r1 ld sp, (r1,kern_stk)

lda r1, vm_reg_ctxt ld r2, (r1,ofs_r0) sto r2, (r1,ofs_ASID)

Guest

Exception

Hypervisor

COMP9242 S2/2016 W03

Page 14: S2/2016 W03: Virtualizationcs9242/16/lectures/03-virt.pdf– Mach Unix server [Golub ‘90], L4Linux [Härtig ‘97], Disco [Bugnion ‘97] • Idea: manually port guest OS to modified

14 © 2016 Gernot Heiser. Distributed under CC Attribution License

Impure Virtualization •  Virtualise other than by T&E of unmodified binary •  Two reasons:

–  Architecture not T&E virtualisable –  Reduce virtualisation overheads

•  Change guest OS, replacing sensitive instructions –  by trapping code (“hypercalls”) –  by in-line emulation code

•  Two approaches –  binary translation: change binary –  para-virtualisation: change ISA

ld r0, curr_thrd ld r1, (r0,ASID) mv r1, PSR ld sp, (r1,kern_stk)

ld r0, curr_thrd ld r1, (r0,ASID) trap ld sp, (r1,kern_stk)

ld r0, curr_thrd ld r1, (r0,ASID) jmp fixup_15 ld sp, (r1,kern_stk)

COMP9242 S2/2016 W03

Page 15: S2/2016 W03: Virtualizationcs9242/16/lectures/03-virt.pdf– Mach Unix server [Golub ‘90], L4Linux [Härtig ‘97], Disco [Bugnion ‘97] • Idea: manually port guest OS to modified

15 © 2016 Gernot Heiser. Distributed under CC Attribution License

Binary Translation •  Locate sensitive instructions in guest binary,

replace on-the-fly by emulation or trap/hypercall –  pioneered by VMware –  detect/replace combination of sensitive instruction for performance –  modifies binary at load time, no source access required

•  Looks like pure virtualisation! •  Very tricky to get right (especially on x86!)

–  Assumptions needed about sane guest behaviour –  “Heroic effort” [Orran Krieger, then IBM, later VMware] 😊

COMP9242 S2/2016 W03

Page 16: S2/2016 W03: Virtualizationcs9242/16/lectures/03-virt.pdf– Mach Unix server [Golub ‘90], L4Linux [Härtig ‘97], Disco [Bugnion ‘97] • Idea: manually port guest OS to modified

16 © 2016 Gernot Heiser. Distributed under CC Attribution License

Para-Virtualization •  New(ish) name, old technique

–  coined by Denali [Whitaker ‘02], popularised by Xen [Barham ‘03] –  Mach Unix server [Golub ‘90], L4Linux [Härtig ‘97], Disco [Bugnion ‘97]

•  Idea: manually port guest OS to modified (more high-level) ISA –  Augmented by explicit hypervisor calls (hypercalls)

o  higher-level ISA to reduce number of traps o  remove unvirtualisable instructions o  remove “messy” ISA features which complicate

–  Generally outperforms pure virtualisation, binary re-writing •  Drawbacks:

–  Significant engineering effort –  Needs to be repeated for each guest-ISA-hypervisor combination –  Para-virtualised guests must be kept in sync with native evolution –  Requires source

Guest

Hypervisor

Hardware

COMP9242 S2/2016 W03

Page 17: S2/2016 W03: Virtualizationcs9242/16/lectures/03-virt.pdf– Mach Unix server [Golub ‘90], L4Linux [Härtig ‘97], Disco [Bugnion ‘97] • Idea: manually port guest OS to modified

17 © 2016 Gernot Heiser. Distributed under CC Attribution License

Virtualization Overheads •  VMM must maintain virtualised privileged machine state

–  processor status –  addressing context –  device state

•  VMM needs to emulate privileged instructions –  translate between virtual and real privileged state –  eg guest ↔ real page tables

•  Virtualisation traps are expensive –  >1000 cycles on some Intel processors! –  Better recently, Haswell has <500 cyc round-trip

•  Some OS operations involve frequent traps –  STI/CLI for mutual exclusion –  frequent page table updates during fork() –  MIPS KSEG addresses used for physical addressing in kernel

COMP9242 S2/2016 W03

Page 18: S2/2016 W03: Virtualizationcs9242/16/lectures/03-virt.pdf– Mach Unix server [Golub ‘90], L4Linux [Härtig ‘97], Disco [Bugnion ‘97] • Idea: manually port guest OS to modified

18 © 2016 Gernot Heiser. Distributed under CC Attribution License

Virtualization and Address Translation

COMP9242 S2/2016 W03

Virtual Memory Virtual Memory

Physical Memory

Page Table

Page Table

Guest Physical Memory

Page Table

Virtual Page Table

Virtual Page Table

Virtual Memory

Guest Physical Memory

Page Table

Virtual Page Table

Virtual Memory

Two levels of address translation!

Must implement with single MMU translation!

Page 19: S2/2016 W03: Virtualizationcs9242/16/lectures/03-virt.pdf– Mach Unix server [Golub ‘90], L4Linux [Härtig ‘97], Disco [Bugnion ‘97] • Idea: manually port guest OS to modified

19 © 2016 Gernot Heiser. Distributed under CC Attribution License

Virtualization Mechanics: Shadow Page Table

COMP9242 S2/2016 W03

Virt PT ptr (Software)

data

ld r0, adr

Guest virtual address

Guest physical address

Physical address

PT ptr (Hardware)

User (Virtual) guest page table

Hypervisor's guest memory map

Shadow (real) guest page table, translations cached in TLB

Guest OS

Hypervisor

Memory

Page 20: S2/2016 W03: Virtualizationcs9242/16/lectures/03-virt.pdf– Mach Unix server [Golub ‘90], L4Linux [Härtig ‘97], Disco [Bugnion ‘97] • Idea: manually port guest OS to modified

20 © 2016 Gernot Heiser. Distributed under CC Attribution License

Virtualization Mechanics: Shadow Page Table Hypervisor must shadow (virtualize) all PT updates by guest: •  trap guest writes to guest PT •  translate guest PA in guest (virtual)

PTE using guest memory map •  insert translated PTE in shadow PT

Virt PT ptr

data

ld r0, adr Guest virtual address

Guest physical address

Physical address

PT ptr

User

Guest OS

Hypervisor

Memory

Shadow PT has TLB semantics (i.e. weak consistency) ⇒ Update at synchronisation points: •  page faults •  TLB flushes

Used by VMware

Shadow PT as virtual TLB •  similar semantics •  can be incomplete:

LRU translation cache

COMP9242 S2/2016 W03

Page 21: S2/2016 W03: Virtualizationcs9242/16/lectures/03-virt.pdf– Mach Unix server [Golub ‘90], L4Linux [Härtig ‘97], Disco [Bugnion ‘97] • Idea: manually port guest OS to modified

21 © 2016 Gernot Heiser. Distributed under CC Attribution License

Virtualisation Semantics: Lazy Shadow Update

COMP9242 S2/2016 W03

access new page …

write-protect GPT unprotect GPT & mark dirty update dirty shadow write-protect GPT

User Hypervisor Guest OS

add mapping to GPT add mappings… return to user

Page 22: S2/2016 W03: Virtualizationcs9242/16/lectures/03-virt.pdf– Mach Unix server [Golub ‘90], L4Linux [Härtig ‘97], Disco [Bugnion ‘97] • Idea: manually port guest OS to modified

22 © 2016 Gernot Heiser. Distributed under CC Attribution License

Virtualisation Semantics: Lazy Shadow Update

COMP9242 S2/2016 W03

continue

write-protect GPT unprotect GPT & mark dirty update dirty shadow write-protect GPT flush TLB

User Hypervisor Guest OS

invalidate mapping in GPT invalidate mapping flush TLB

Page 23: S2/2016 W03: Virtualizationcs9242/16/lectures/03-virt.pdf– Mach Unix server [Golub ‘90], L4Linux [Härtig ‘97], Disco [Bugnion ‘97] • Idea: manually port guest OS to modified

23 © 2016 Gernot Heiser. Distributed under CC Attribution License

Virtualization Mechanics: Real Guest PT •  On guest PT access must

translate (virtualize) PTEs –  store: translate guest “PTE” to

real PTE –  load: translate real PTE to

guest “PTE” •  Each guest PT access traps!

–  including reads –  high overhead

data

ld r0, adr

Guest virtual address

Physical address

Guest PT

User

Guest OS

Hypervisor

Memory

HV PT

Hypervisor maintains guest PT

COMP9242 S2/2016 W03

Page 24: S2/2016 W03: Virtualizationcs9242/16/lectures/03-virt.pdf– Mach Unix server [Golub ‘90], L4Linux [Härtig ‘97], Disco [Bugnion ‘97] • Idea: manually port guest OS to modified

24 © 2016 Gernot Heiser. Distributed under CC Attribution License

Virtualization Mechanics: Optimised Guest PT

•  Guest translates PTEs itself when reading from PT –  supported by Linux PT-access

wrappers •  Guest batches PT updates using

hypercalls –  reduced overhead

data

ld r0, adr

Guest virtual address

Physical address

Guest PT

User

Guest OS

Hypervisor HV PT

Para-virtualized guest “knows” it is

virtualized

Used by original Xen

COMP9242 S2/2016 W03

Page 25: S2/2016 W03: Virtualizationcs9242/16/lectures/03-virt.pdf– Mach Unix server [Golub ‘90], L4Linux [Härtig ‘97], Disco [Bugnion ‘97] • Idea: manually port guest OS to modified

25 © 2016 Gernot Heiser. Distributed under CC Attribution License

Virtualization Techniques •  Impure virtualisation methods enable new optimisations

–  avoid traps through ability to control the ISA –  changed contract between guest and hypervisor

•  Example: virtualised guest page table –  lazy update of virtual state (TLB semantics)

•  Example: virtual interrupt-enable bit (in virtual PSR) –  requires changing guest’s idea of where this bit lives –  hypervisor knows about VM-local virtual state

o  eg queue vitual interrupt until guest enables in virtual PSR

psid Trap 0

1

VPSR 0

0

PSR mov r1,#VPSR ldr r0,[r1] orr r0,r0,#VPSR_ID sto r0,[r1]

COMP9242 S2/2016 W03

Page 26: S2/2016 W03: Virtualizationcs9242/16/lectures/03-virt.pdf– Mach Unix server [Golub ‘90], L4Linux [Härtig ‘97], Disco [Bugnion ‘97] • Idea: manually port guest OS to modified

26 © 2016 Gernot Heiser. Distributed under CC Attribution License

Virtualization Mechanics: 3 Device Models

COMP9242 S2/2016 W03

Hypervisor

Device Driver

VM1

OS Virtual Driver

Apps Apps Apps

Device

Hypervisor

VM1

OS Device Driver

Apps Apps Apps

Device

Emu-lation

Hypervisor

VM1

OS Device Driver

Apps Apps Apps

Device

Emulated Pass-through Split

Page 27: S2/2016 W03: Virtualizationcs9242/16/lectures/03-virt.pdf– Mach Unix server [Golub ‘90], L4Linux [Härtig ‘97], Disco [Bugnion ‘97] • Idea: manually port guest OS to modified

27 © 2016 Gernot Heiser. Distributed under CC Attribution License

Virtualization Mechanics: Emulated Device

COMP9242 S2/2016 W03

Hypervisor

VM1

OS Device Driver

Apps Apps Apps

Device

Emu-lation

Device register

accesses

•  Each device access must be trapped and emulated –  unmodified native driver –  high overhead!

Page 28: S2/2016 W03: Virtualizationcs9242/16/lectures/03-virt.pdf– Mach Unix server [Golub ‘90], L4Linux [Härtig ‘97], Disco [Bugnion ‘97] • Idea: manually port guest OS to modified

28 © 2016 Gernot Heiser. Distributed under CC Attribution License

Virtualization Mechanics: Split Driver (Xen speak)

COMP9242 S2/2016 W03

Hypervisor

Device Driver

VM1

OS Virtual Driver

Apps Apps Apps

Device

Virtual device

Simple interface

•  Simplified, high-level device interface –  small number of

hypercalls –  new (but very

simple) driver –  low overhead –  must port drivers to

hypervisor

“Para-virtualized

driver”

Page 29: S2/2016 W03: Virtualizationcs9242/16/lectures/03-virt.pdf– Mach Unix server [Golub ‘90], L4Linux [Härtig ‘97], Disco [Bugnion ‘97] • Idea: manually port guest OS to modified

29 © 2016 Gernot Heiser. Distributed under CC Attribution License

Virtualization Mechanics: Driver OS (Xen Dom0)

COMP9242 S2/2016 W03

Hypervisor

Device Driver VM

OS

Device Drivers

VM1

OS

Virtual Driver

Apps Apps Apps

Virtual NW

•  Leverage Driver-OS native drivers –  no driver porting –  must trust complete

Driver OS guest! –  huge TCB!

Page 30: S2/2016 W03: Virtualizationcs9242/16/lectures/03-virt.pdf– Mach Unix server [Golub ‘90], L4Linux [Härtig ‘97], Disco [Bugnion ‘97] • Idea: manually port guest OS to modified

30 © 2016 Gernot Heiser. Distributed under CC Attribution License

Virtualization Mechanics: Pass-Through Driver

COMP9242 S2/2016 W03

Hypervisor

VM1

OS

Device Driver

Apps Apps Apps

Device

Direct device access by

guest

•  Unmodified native driver •  Can’t share device

between VMs •  Must trust driver (and

guest) –  unless have hardware

support (I/O MMU)

Available on modern x86, latest ARM

Page 31: S2/2016 W03: Virtualizationcs9242/16/lectures/03-virt.pdf– Mach Unix server [Golub ‘90], L4Linux [Härtig ‘97], Disco [Bugnion ‘97] • Idea: manually port guest OS to modified

31 © 2016 Gernot Heiser. Distributed under CC Attribution License

Modern Architectures Not T&E Virtualisable •  Examples:

–  x86: many non-virtualizable features o  e.g. sensitive PUSH of PSW is not privileged o  segment and interrupt descriptor tables in virtual memory o  segment description expose privileged level

–  MIPS: mostly ok, but o  kernel registers k0, k1 (for save/restore state) user-accessible o  performance issue with virtualising KSEG addresses

–  ARM: mostly ok, but o  some instructions undefined in user mode (banked registers, CPSR) o  PC is a GPR, exception return is MOVS to PC, doesn’t trap

•  Addressed by virtualization extensions to ISA –  x86, Itanium since ~2006 (VT-x, VT-i, AMD-V), ARM since ’12 –  additional processor modes and other features –  all sensitive ops trap into hypervisor or made innocuous (shadow state)

o  eg guest copy of PSW

COMP9242 S2/2016 W03

Page 32: S2/2016 W03: Virtualizationcs9242/16/lectures/03-virt.pdf– Mach Unix server [Golub ‘90], L4Linux [Härtig ‘97], Disco [Bugnion ‘97] • Idea: manually port guest OS to modified

32 © 2016 Gernot Heiser. Distributed under CC Attribution License

x86 Virtualization Extensions (VT-x) •  New processor mode: VT-x root mode

–  orthogonal to protection rings –  entered on virtualisation trap

Non-Root Ring 0

Ring 3

Ring 2 Ring 1

Root Ring 0

Ring 3

Ring 2 Ring 1

VM exit

Kernel entry

Guest Kernel Hypervisor

COMP9242 S2/2016 W03

Page 33: S2/2016 W03: Virtualizationcs9242/16/lectures/03-virt.pdf– Mach Unix server [Golub ‘90], L4Linux [Härtig ‘97], Disco [Bugnion ‘97] • Idea: manually port guest OS to modified

33 © 2016 Gernot Heiser. Distributed under CC Attribution License

ARM Virtualization Extensions (1)

COMP9242 S2/2016 W03

“Secure world”

“Normal world”

Monitor mode

Hyp mode

User mode

Kernel modes Kernel modes

User mode

Hyp mode

New privilege level •  Strictly higher than kernel •  Virtualizes or traps all

sensitive instructions •  Only available in ARM

TrustZone “normal world”

EL0

EL1

EL2

EL3

Page 34: S2/2016 W03: Virtualizationcs9242/16/lectures/03-virt.pdf– Mach Unix server [Golub ‘90], L4Linux [Härtig ‘97], Disco [Bugnion ‘97] • Idea: manually port guest OS to modified

34 © 2016 Gernot Heiser. Distributed under CC Attribution License

ARM Virtualization Extensions (2)

Kernel mode

User mode

Native syscall

Kernel mode

Hyp mode

User mode

Virtual syscall

Kernel mode

Hyp mode

User mode

Virtual syscall Trap to guest

Can configure traps to go directly to guest OS

x86 similar

Big performance

boost!

Configurable Traps

COMP9242 S2/2016 W03

Page 35: S2/2016 W03: Virtualizationcs9242/16/lectures/03-virt.pdf– Mach Unix server [Golub ‘90], L4Linux [Härtig ‘97], Disco [Bugnion ‘97] • Idea: manually port guest OS to modified

35 © 2016 Gernot Heiser. Distributed under CC Attribution License

ARM Virtualization Extensions (3)

COMP9242 S2/2016 W03

mv CPU_ASID,r1 IR

ld r1,(r0,ASID) mv CPU_ASID,r1 ld sp,(r1,kern_stk)

L1 I- Cache

ld r1,(r0,ASID) mv CPU_ASID,r1 ld sp,(r1,kern_stk)

L2 Cache

...

...

...

L1 D- Cache

... R2

... mv CPU_ASID,r1 ...

mv CPU_ASID,r1

Emulation 1)  Load faulting instruction •  Compulsory L1-D miss!

2)  Decode instruction •  Complex logic

3)  Emulate instruction •  Usually straightforward

Page 36: S2/2016 W03: Virtualizationcs9242/16/lectures/03-virt.pdf– Mach Unix server [Golub ‘90], L4Linux [Härtig ‘97], Disco [Bugnion ‘97] • Idea: manually port guest OS to modified

36 © 2016 Gernot Heiser. Distributed under CC Attribution License

ARM Virtualization Extensions (3)

COMP9242 S2/2016 W03

mv CPU_ASID,r1 IR

ld r1,(r0,ASID) mv CPU_ASID,r1 ld sp,(r1,kern_stk)

L1 I- Cache

ld r1,(r0,ASID) mv CPU_ASID,r1 ld sp,(r1,kern_stk)

L2 Cache

...

...

...

L1 D- Cache

R2

R3

mv

Emulation Support 1)  HW decodes instruction •  No L1 miss •  No software decode

2)  SW emulates instruction •  Usually straightforward

No x86 equivalent

r1

Page 37: S2/2016 W03: Virtualizationcs9242/16/lectures/03-virt.pdf– Mach Unix server [Golub ‘90], L4Linux [Härtig ‘97], Disco [Bugnion ‘97] • Idea: manually port guest OS to modified

37 © 2016 Gernot Heiser. Distributed under CC Attribution License

ARM Virtualization Extensions (4)

COMP9242 S2/2016 W03

1st PT ptr (Hardware)

data

ld r0, adr

Guest virtual address

Guest physical address

Physical address

2nd PT ptr (Hardware)

User (Virtual) guest page table

Hypervisor's guest memory map Guest OS

Hypervisor

x86 similar (EPTs) 2-stage translation

•  Hardware PT walker traverses both PTs

•  Loads combined (guest-virtual to physical) mapping into TLB

•  eliminates “virtual TLB”

Page 38: S2/2016 W03: Virtualizationcs9242/16/lectures/03-virt.pdf– Mach Unix server [Golub ‘90], L4Linux [Härtig ‘97], Disco [Bugnion ‘97] • Idea: manually port guest OS to modified

38 © 2016 Gernot Heiser. Distributed under CC Attribution License

ARM Virtualization Extensions (4) •  On page fault walk twice number of

page tables! •  Can have a page miss on each

–  requiring PT walk •  O(n2) misses in worst case for

n-level PT •  Worst-case cost is massively worse

than for single-level translation!

COMP9242 S2/2016 W03

2-stage translation cost

1st PT ptr (Hardware)

data

ld r0, adr

Guest virtual address

Guest physical address

Physical address

2nd PT ptr (Hardware)

User

Guest OS

Hypervisor

Trade-off: •  fewer traps •  simpler implementation •  higher TLB miss cost 50% in extreme cases!

2-stage translation

Page 39: S2/2016 W03: Virtualizationcs9242/16/lectures/03-virt.pdf– Mach Unix server [Golub ‘90], L4Linux [Härtig ‘97], Disco [Bugnion ‘97] • Idea: manually port guest OS to modified

39 © 2016 Gernot Heiser. Distributed under CC Attribution License

ARM Virtualization Extensions (5)

COMP9242 S2/2016 W03

Virtual Interrupts •  ARM has 2-part IRQ controller –  Global “distributor” –  Per-CPU “interface”

•  New H/W “virt. CPU interface” –  Mapped to guest –  Used by HV to forward IRQ –  Used by guest to

acknowledge •  Halves hypervisor invocations

for interrupt virtualization

Distributor

CPU Interface

Hypervisor

Guest

Virt. CPU Interf

x86: issue only for legacy level-triggered IRQs

Page 40: S2/2016 W03: Virtualizationcs9242/16/lectures/03-virt.pdf– Mach Unix server [Golub ‘90], L4Linux [Härtig ‘97], Disco [Bugnion ‘97] • Idea: manually port guest OS to modified

40 © 2016 Gernot Heiser. Distributed under CC Attribution License

ARM Virtualization Extensions (6)

COMP9242 S2/2016 W03

Physical Memory

TLB

VAS VAS VM

System MMU

Physical Address

Guest Physical Address

x86 different (VT-d) Many ARM

SoCs different

System MMU (I/O MMU) •  Devices use virtual addresses •  Translated by system MMU

–  elsewhere called I/O MMU –  translation cache, like TLB –  reloaded from same page table

•  Can do pass-through I/O safely –  guest accesses device registers –  no hypervisor invocation

Page 41: S2/2016 W03: Virtualizationcs9242/16/lectures/03-virt.pdf– Mach Unix server [Golub ‘90], L4Linux [Härtig ‘97], Disco [Bugnion ‘97] • Idea: manually port guest OS to modified

41 © 2016 Gernot Heiser. Distributed under CC Attribution License

x86

•  VM state is ≤ 4 KiB •  Save/restore done by

hardware on VMexit/VMentry •  Fast and simple

ARM

World Switch

•  VM state is 488 B •  Save/restore done by software

(hypervisor) •  Selective save/restore

–  Eg traps w/o world switch

COMP9242 S2/2016 W03

VM 1 control block

Guest 1 state

Guest 2 state World switch

VM 2 control block

Save Restore

Page 42: S2/2016 W03: Virtualizationcs9242/16/lectures/03-virt.pdf– Mach Unix server [Golub ‘90], L4Linux [Härtig ‘97], Disco [Bugnion ‘97] • Idea: manually port guest OS to modified

42 © 2016 Gernot Heiser. Distributed under CC Attribution License

Microkernel as Hypervisor (NOVA, seL4)

COMP9242 S2/2016 W03

General-purpose

Virtualisation- specific

Hyp

Kernel

User

guest OS

seL4

Guest apps VMM

Syscall

Hypercall Exception IPC

VM

ARM x86

One per VM, cannot break

isolation! Root Non-Root Ring 3

Ring 0 guest OS seL4

Guest apps VMM

Syscall

Hypercall

VM

Exception IPC

Page 43: S2/2016 W03: Virtualizationcs9242/16/lectures/03-virt.pdf– Mach Unix server [Golub ‘90], L4Linux [Härtig ‘97], Disco [Bugnion ‘97] • Idea: manually port guest OS to modified

43 © 2016 Gernot Heiser. Distributed under CC Attribution License

Hybrid Hypervisor OSes

•  Idea: turn standard OS into hypervisor –  … by running in VT-x root mode –  eg: KVM (“kernel-based virtual machine”)

•  Can re-use Linux drivers etc •  Huge trusted computing base! •  Often falsely called a Type-2 hypervisor

Root Linux kernel

Linux demons

Native Linux apps

VM exit

Non-Root

VM

Guest kernel

Guest apps

VM

Guest kernel

Guest apps Ring 3

Ring 0 Hypervisor

Variant: VMware MVP •  ARM hypervisor

•  pre-HW support •  re-writes exception

vectors in Android kernel to catch virtualization traps in guest

COMP9242 S2/2016 W03

Page 44: S2/2016 W03: Virtualizationcs9242/16/lectures/03-virt.pdf– Mach Unix server [Golub ‘90], L4Linux [Härtig ‘97], Disco [Bugnion ‘97] • Idea: manually port guest OS to modified

44 © 2016 Gernot Heiser. Distributed under CC Attribution License

ARM: seL4 vs KVM [Dall&Nieh ‘14]

COMP9242 S2/2016 W03

Hyp

Kernel

User

guest OS

seL4

Guest apps VMM

Syscall

Hypercall

VM

Hyp

Kernel

User

Linux kernel

QEMU

Native Linux apps

Guest kernel

Guest apps

Lowvisor

KVM Highvisor

VM

VM

seL4 KVM

Full world switch!

Page 45: S2/2016 W03: Virtualizationcs9242/16/lectures/03-virt.pdf– Mach Unix server [Golub ‘90], L4Linux [Härtig ‘97], Disco [Bugnion ‘97] • Idea: manually port guest OS to modified

45 © 2016 Gernot Heiser. Distributed under CC Attribution License

Virtualisation Cost (KVM)

Micro BM

ARM A15

cycles

x86 Sandybridge

cycles VM exit+entry 27 821 World Switch 1,135 814 I/O Kernel 2,850 3,291 I/O User 6,704 12,218 EOI+ACK 13,726 2,305

Component ARM LoC

x86 LoC

Core CPU 2,493 16,177 Page Faults 738 3,410 Interrupts 1,057 1,978 Timers 180 573 Other 1,344 1,288 Total 5,812 25,367

KVM needs WS for any hypercall!

Source: [Dall&Nieh, ASPLOS’14]

COMP9242 S2/2016 W03

Page 46: S2/2016 W03: Virtualizationcs9242/16/lectures/03-virt.pdf– Mach Unix server [Golub ‘90], L4Linux [Härtig ‘97], Disco [Bugnion ‘97] • Idea: manually port guest OS to modified

46 © 2016 Gernot Heiser. Distributed under CC Attribution License

Fun and Games with Hypervisors •  Time-travelling virtual machines [King ‘05]

–  debug backwards by replay VM from checkpoint, log state changes •  SecVisor: kernel integrity by virtualisation [Seshadri ‘07]

–  controls modifications to kernel (guest) memory •  Overshadow: protect apps from OS [Chen ‘08]

–  make user memory opaque to OS by transparently encrypting •  Turtles: Recursive virtualisation [Ben-Yehuda ‘10]

–  virtualize VT-x to run hypervisor in VM •  CloudVisor: mini-hypervisor underneath Xen [Zhang ‘11]

–  isolates co-hosted VMs belonging to different users –  leverages remote attestation (TPM) and Turtles ideas

… and many more!

COMP9242 S2/2016 W03

Page 47: S2/2016 W03: Virtualizationcs9242/16/lectures/03-virt.pdf– Mach Unix server [Golub ‘90], L4Linux [Härtig ‘97], Disco [Bugnion ‘97] • Idea: manually port guest OS to modified

47 © 2016 Gernot Heiser. Distributed under CC Attribution License

Hypervisors vs Microkernels •  Both contain all code executing at highest privilege level

–  Although hypervisor may contain user-mode code as well o  privileged part usually called “hypervisor” o  user-mode part often called “VMM”

•  Both need to abstract hardware resources –  Hypervisor: abstraction closely models hardware –  Microkernel: abstraction designed to support wide range of systems

•  What must be abstracted? –  Memory –  CPU –  I/O –  Communication

COMP9242 S2/2016 W03

Difference to traditional

terminology!

Page 48: S2/2016 W03: Virtualizationcs9242/16/lectures/03-virt.pdf– Mach Unix server [Golub ‘90], L4Linux [Härtig ‘97], Disco [Bugnion ‘97] • Idea: manually port guest OS to modified

48 © 2016 Gernot Heiser. Distributed under CC Attribution License

What’s the difference?

Resource Hypervisor Microkernel

Memory Virtual MMU (vMMU) Address space

CPU Virtual CPU (vCPU) Thread or scheduler activation

I/O •  Simplified virtual device •  Driver in hypervisor •  Virtual IRQ (vIRQ)

•  IPC interface to user-mode driver •  Interrupt IPC

Communication Virtual NIC, with driver and network stack

High-performance message-passing IPC

Just page tables in disguise

Just kernel-

scheduled activities

Modelled on HW, Re-uses SW

Minimal overhead,

Custom API

Real Difference?

•  Similar abstractions •  Optimised for

different use cases

COMP9242 S2/2016 W03

Page 49: S2/2016 W03: Virtualizationcs9242/16/lectures/03-virt.pdf– Mach Unix server [Golub ‘90], L4Linux [Härtig ‘97], Disco [Bugnion ‘97] • Idea: manually port guest OS to modified

49 © 2016 Gernot Heiser. Distributed under CC Attribution License

Closer Look at I/O and Communication

•  Communication is critical for I/O

–  Microkernel IPC is highly optimised –  Hypervisor inter-VM communication is frequently a bottleneck

Apps Apps Apps

Microkernel

Server

Device Driver

IPC Hypervisor

Device Driver VM

OS

Device Driver

VM1

OS

Virtual Driver

Apps Apps Apps

Virtual NW

COMP9242 S2/2016 W03

Page 50: S2/2016 W03: Virtualizationcs9242/16/lectures/03-virt.pdf– Mach Unix server [Golub ‘90], L4Linux [Härtig ‘97], Disco [Bugnion ‘97] • Idea: manually port guest OS to modified

50 © 2016 Gernot Heiser. Distributed under CC Attribution License

Hypervisors:

•  Communication is Achilles heel –  more important than expected

o  critical for I/O –  plenty improvement attempts

in Xen

•  Most hypervisors have big TCBs –  infeasible to achieve high

assurance of security/safety –  in contrast, microkernel

implementations can be proved correct

Microkernels:

Hypervisors vs Microkernels: Drawbacks

•  Not ideal for virtualization –  API not very effective

o  L4 virtualization performance close to hypervisor

o  effort much higher –  Needed for legacy support –  No issue with H/W support?

•  L4 model uses kernel-scheduled threads for more than exploiting parallelism –  Kernel imposes policy –  Alternatives exist, eg. K42

uses scheduler activations

COMP9242 S2/2016 W03