microkernel design - computer science and engineeringcs9242/10/lectures/10-microkernel... ·...
TRANSCRIPT
Microkernel DesignA walk through selected aspects of
kernel design and seL4kernel design and seL4
These slides are made distributed under the Creative Commons Attribution 3.0 License,
unless otherwise noted on individual slides.
You are free:
to Share — to copy, distribute and transmit the work
to Remix — to adapt the work
Under the following conditions:
Attribution — You must attribute the work (but not in any way that suggests that the author
endorses you or your use of the work) as follows:
“Courtesy of Kevin Elphinstone, UNSW”
The complete license text can be found at http://creativecommons.org/licenses/by/3.0/legalcode
2
© Kevin Elphinstone. Distributed under Creative Commons
Attribution License
Formal Verification - Proof Architecture
SpecificationSpecification
3NICTA Copyright 2010 From imagination to impact
Proof
C CodeC Code
Proof Architecture
SpecificationSpecification
Access Control SpecAccess Control Spec Confinement
4NICTA Copyright 2010 From imagination to impact
C CodeC Code
DesignDesignHaskell
Prototype
Haskell
Prototype
Verification Strategy
• An OS perspective
– simple is better
– complex system-wide invariants increase – complex system-wide invariants increase
difficulty
– concurrency is very difficulty to reason
about
• must consider every possible interleaving of
execution
5
© Kevin Elphinstone. Distributed under Creative Commons
Attribution License
Fundamental Kernel
Abstractions
• Execution
– support CPU running multiple activities
• Memory• Memory
– support (and protect) state associated with
an activity
6
© Kevin Elphinstone. Distributed under Creative Commons
Attribution License
Execution• Two-execution environments
– kernel level (in-kernel) and user-level (application
execution)
• Covered execution models in detail earlier in
the coursethe course
– Two common approaches
• Event-based
– smaller memory footprint, limited to smaller kernels
• Process-based
– larger memory footprint, programming model scales to
larger kernels, though synchronisation adds complexity
7
© Kevin Elphinstone. Distributed under Creative Commons
Attribution License
seL4 Kernel Execution?• For verifiability
– Event-based
• sequential execution from kernel mode entry to exit
– Context switch at kernel exit
• current process/thread control block switch as late as • current process/thread control block switch as late as
possible
• kernel c-code not re-entrant
– Interrupts disabled
• delivered on return to user-level, or
• polled during long running operations
8
© Kevin Elphinstone. Distributed under Creative Commons
Attribution License
Application Execution
• From kernel perspective, commonly two
models
– single-threaded
• straight forward program execution• straight forward program execution
• potentially with another execution model layered on top
(e.g. user-level threads)
– multi-threaded
• potentially with another execution model or user-level
involvement
– m-n user-level threads
– scheduler activations
9
© Kevin Elphinstone. Distributed under Creative Commons
Attribution License
Virtualisation• Introduces third application (guest OS) execution
model – Virtual CPU
• Has close parallels to a thread
• We’ll distinguish them as follows
– Fixed set at “boot time”– Fixed set at “boot time”
• e.g. no create/delete CPUs by guest
– Hardware-like synchronisation
• no blocking synch primitives
– Hardware-like communication
• low-level notification (interrupts), no complex messaging
• handled via interrupt handler
10
© Kevin Elphinstone. Distributed under Creative Commons
Attribution License
Application Execution• For verification
– single threaded
• execution still simplest – event-based sequential code
– multithreaded
• problematic due to concurrency• problematic due to concurrency
• good to overlap I/O (blocking) with execution, and to
utilise multiprocessors
– virtual CPU
• with interrupts disabled – event-based sequential code
• interrupts enabled, problematic due potential number of
instruction interleavings
• obviously good for replication normal CPU execution
model for guest OS.11
© Kevin Elphinstone. Distributed under Creative Commons
Attribution License
seL4 Application
Execution?
• Multithreaded
– verified applications would be limited to a
single threadsingle thread
• Alternatives
– VCPUs
• verified applications have interrupts disabled
12
© Kevin Elphinstone. Distributed under Creative Commons
Attribution License
Memory Management• Page-based virtual memory ubiquitous
• Applications expect a specific memory model
– Text, data, bss, stack
– Memory mapped files
• shared libraries, shared memory• shared libraries, shared memory
– External pagers of memory objects
• Mach
– External control of mappings
• Virtualisation (hypercalls, shadow page tables)
• L4
13
© Kevin Elphinstone. Distributed under Creative Commons
Attribution License
Text, data,….
• Implications for kernel
– knowledge of executable format
Text Data BSS Stack
Virtual Address Space
– knowledge of executable format
• limits alternative – e.g. guest OS, guest application
– at minimum, ability to load application and set up mappings
• also implies allocation of page tables and memory frames.– implies some model for managing memory securely between applications
• also implies book keeping for de-allocation, i.e. resource
attribution – e.g. processes.
14
© Kevin Elphinstone. Distributed under Creative Commons
Attribution License
Memory Mapped Files/Objects
• Implications for kernel
– similar to text, data,G
Text Data BSS Stack
Virtual Address Space
libc File
– similar to text, data,G
– additionally
• adds file-like store to name data and retrive/store data
• adds mechanism for mapping vm region to file
15
© Kevin Elphinstone. Distributed under Creative Commons
Attribution License
External Pagers
Text Data BSS Stack
User-level server
libc File
File System
Server
• Page faults propagated to user-level servers
– they supply data for page, kernel still manages memory (frames, page tables,
etc..)
• Implications for kernel
– adds complexity of • vm-region-based fault forwarding
• data provision mechanism
– removes complexity of supplying/storing data from the kernel (not in Mach’s case)16
© Kevin Elphinstone. Distributed under Creative Commons
Attribution License
Text Data BSS Stacklibc File
Historical L4 Mapping
Model
17
© Kevin Elphinstone. Distributed under Creative Commons
Attribution License
Address Spaces
© 2002 Kevin Elphintone
Address Spaces
• map
• unmap
• grant
© 2002 Kevin Elphintone
Address Spaces
• map
• unmap
• grant
© 2002 Kevin Elphintone
Address Spaces
• map
• unmap
• grant
© 2002 Kevin Elphintone
Page Fault Handling
Pager
"PF" msg
Application
© 2002 Kevin Elphintone
Pager
map msg
Application
Page Fault Handling
Pager
"PF" msg
Application
PF IPC
© 2002 Kevin Elphintone
Pager
map msg
Application
res IPC
Address Spaces
© 2002 Kevin Elphintone
Physical Memory
Address Spaces
© 2002 Kevin Elphintone
Physical Memory
Initial AS
Address Spaces
© 2002 Kevin Elphintone
Physical Memory
Initial AS
Pager 1 Pager 2
Address Spaces
Pager 3
Pager 4
© 2002 Kevin Elphintone
Physical Memory
Initial AS
Pager 1 Pager 2
Address Spaces
Pager 3
Pager 4
Application
Application Application
Application
© 2002 Kevin Elphintone
Physical Memory
Initial AS
Pager 1 Pager 2
Address Spaces
Pager 3
Pager 4
Application
Application Application
Application
Driver
© 2002 Kevin Elphintone
Physical Memory
Initial AS
Pager 1 Pager 2
Driver
Historical L4 Mapping
Model• Kernel only provides
– relatively simple mechanisms
– physical memory can be directly managed at user-
levellevel
– page-tables still managed in kernel
• complexity of some memory management remains
– introduces complexity of tracking mapping
relationships
30
© Kevin Elphinstone. Distributed under Creative Commons
Attribution License
Recursive mapping removed
• Single privilege syscall
for Initial AS
• Pagers requested
mapping from Initial
AS
Application
AS
• Removed need to
track mapping
relationships from
kernel
31
© Kevin Elphinstone. Distributed under Creative Commons
Attribution License
Physical Memory
Initial AS
Pager 1 Pager 2
Initial task removed• Mapping operates pre-
allocate physical
memory partitions
• Removes need for
user-level to proxy
• Adds partitioning
Application
• Adds partitioning
policy in kernel, but
not significant source
of complexity
• page table
management still in
kernel
– some memory
allocation remains
32
© Kevin Elphinstone. Distributed under Creative Commons
Attribution License
Physical Memory
Pager Pager
Note parallels with Hypervisors• Mapping operates on pre-
allocated physical memory
partitions
– hypercalls
• page table management
still in kernel
Application
still in kernel
– some memory allocation
remains
• page table management
becomes quite tricky when
directly virtualising page
tables without hardware
assistance
33
© Kevin Elphinstone. Distributed under Creative Commons
Attribution License
Physical Memory
Guest OS Guest OS
Dhammika Elkaduwe
Philip Derrin
Kevin Elphinstone
Kernel Design for Isolation and Assurance of Physical Memory
Kevin Elphinstone
Embedded Systems
• Increasing functionality
• Increasing software complexity
– Millions of lines of code
– Mutually untrusted SW vendors
• Consolidate functionality• Consolidate functionality
Connectivity
– Attacks from outside
• No longer close systems
– Download SW
IIES08/seL4 1
Embedded Systems
• Diverse applications
– Real-time Vs. best effort
• Tight resource budgets
• Mission/life- critical applications
• Sensitive information
Reliability is paramount Reliability is paramount Reliability is paramount Reliability is paramount
IIES08/seL4 2
Small Kernel Approach
LegacyApp.Legacy
App.LegacyApp.Legacy
App.
SensitiveApp.Sensitive
App.SensitiveApp.SensitiveApp.
Untrusted Trusted
• Smaller, more trustworthy foundation
– Hypervisor, microkernelmicrokernelmicrokernelmicrokernel, isolation
kernel, …..
• Facilitate controlled integration and
isolation
– Isolate: fault isolation, diversity
– Integrate: performance
Supervisor OS
LinuxServer
DeviceDriver
TrustedService Device
Driver
TrustedServiceTrustedServiceTrustedService
DeviceDriver
Hardware
Small kernel (e.g. Microkernel)
– Integrate: performance
IIES08/seL4 3A
Small Kernel Approach
• Smaller, more trustworthy foundation
– Hypervisor, microkernelmicrokernelmicrokernelmicrokernel, isolation
kernel, …..
• Facilitate controlled integration and
isolation
– Isolate: fault isolation, diversity
– Integrate: performance
LegacyApp.Legacy
App.LegacyApp.Legacy
App.
SensitiveApp.Sensitive
App.SensitiveApp.SensitiveApp.
Untrusted Trusted
– Integrate: performance
IIES08/seL4 3B
• Microkernel should:• Provide sufficient API• Correct realisation of API• Adhere to isolation/integration requirements of the system
Supervisor OS
LinuxServer
DeviceDriver
TrustedService Device
Driver
TrustedServiceTrustedServiceTrustedService
DeviceDriver
Hardware
Small kernel (e.g. Microkernel)
Issue
• Kernel consumes resources
– Machine cycles
– Physical memory (kernel metadata)
Example:
– threads – thread control block,
– address space – page-tables
LegacyApp.Legacy
App.LegacyApp.Legacy
App.
SensitiveApp.Sensitive
App.SensitiveApp.SensitiveApp.
Untrusted Trusted
– bookkeeping to reclaim memory
Supervisor OS
LinuxServer
DeviceDriver
TrustedService Device
Driver
TrustedServiceTrustedServiceTrustedService
DeviceDriver
Microkernel
TCB TCBPT PT
IIES08/seL4 4
Possible Approaches
How do we manage kernel
metadata?
• Cache like behaviour [EROS,Cachekernel, HiStar..]
– No predictability, limited RT applicability
• Static allocations
– Works for static systems
LegacyApp.Legacy
App.LegacyApp.Legacy
App.
SensitiveApp.Sensitive
App.SensitiveApp.SensitiveApp.
Untrusted Trusted
– Works for static systems
– Dynamic systems: overcommit or fail
under heavy load
• Domain specific kernel
modifications? Supervisor OS
LinuxServer
DeviceDriver
TrustedService Device
Driver
TrustedServiceTrustedServiceTrustedService
DeviceDriver
Microkernel
TCB TCBPT PT
IIES08/seL4 5
Modified ≠ Verified
• L4.Verified project:L4.Verified project:L4.Verified project:L4.Verified project:
Formally verify the implementation correctness
of the kernel
� Properties:
–Isolation, information flow ...
• Formal refinement Abstract Model
Mathematically proven
properties
• Formal refinement
–Formally connect the properties with the
kernel implementation
C Code HW
Property preserving refinement
IIES08/seL4 6A
Modified ≠ Verified
• L4.Verified project:L4.Verified project:L4.Verified project:L4.Verified project:
Formally verify the implementation correctness
of the kernel
� Properties:
–Isolation, information flow ...
• Formal refinement Abstract Model
Mathematically proven
properties
• Formal refinement
–Formally connect the properties with the
kernel implementation
C Code HW
Property preserving refinement
IIES08/seL4 6B
Modified ≠ Verified
• L4.Verified project:L4.Verified project:L4.Verified project:L4.Verified project:
Formally verify the implementation correctness
of the kernel
� Properties:
–Isolation, information flow ...
• Formal refinement Abstract Model
Mathematically proven
properties
• Formal refinement
–Formally connect the properties with the
kernel implementation
–Modifications invalidate refinement
–Verification is labour intensive
• 10K C-lines = 200K proof lines
• Memory management is core functionality
C Code HW
Property preserving refinement
IIES08/seL4 6C
Approach in a nutshell
• No implicit allocations within
the kernel
– No heap, no slab allocation etc..
• All abstractions are provided
by first-class kernel objects
– Threads – TCB object
supervisory OS
seL4 Microkernel
Trusted
OSserver
Legacy OS
server ....
– Threads – TCB object
– Address space – Page table
objects
• All objects are created upon
explicit user request
IIES08/seL4 7
seL4 Microkernel
Kernel heap
Memory Management Model
supervisory OS
seL4 Microkernel
Trusted OS
server
Legacy OS
server....
� No implicit allocations within the kernel
� Physical memory is divided into untyped
objects
� Authority conferred via capabilities
� Untyped capability is sufficient
authority to allocate kernel objects
All abstractions are provided via first seL4 Microkernel
Kernel Code
untypedobject1
untypedobject2
untypedobject n
..
� All abstractions are provided via first
class kernel objects
� Allocate on explicit user request
� Creator gets the full authority
� Distribute capabilities to allow other
access the service
IIES08/seL4 8A
Memory Management Model
supervisory OS
seL4 Microkernel
Trusted OS
server
Legacy OS
server....
� No implicit allocations within the kernel
� Physical memory is divided into untyped
objects
� Authority conferred via capabilities
� Untyped capability is sufficient
authority to allocate kernel objects
All abstractions are provided via first seL4 Microkernel
Kernel Code
TCBuntypedobject2
untypedobject n
..TCB
� Kernel objects
� Untyped� TCB (Thread Control Blocks) � Capability tables (CT) � Comm. ports ....
IIES08/seL4 8B
� All abstractions are provided via first
class kernel objects
� Allocate on explicit user request
� Creator gets the full authority
� Distribute capabilities to allow other
access the service
Memory Management Model
supervisory OS
seL4 Microkernel
Trusted OS
server
Legacy OS
server ....
� No implicit allocations within the kernel
� Physical memory is divided into untyped
objects
� Authority conferred via capabilities
� Untyped capability is sufficient
authority to allocate kernel objects
All abstractions are provided via first seL4 Microkernel
Kernel Code
TCBuntypedobject n
..TCB
� Kernel objects
� Untyped � TCB (Thread Control Blocks) � Capability tables (CT) � Comm. ports ....
� Objects are managed by user-level
IIES08/seL4 8C
� All abstractions are provided via first
class kernel objects
� Allocate on explicit user request
� Creator gets the full authority
� Distribute capabilities to allow other
access the service
PT PT
Memory Management Model ...
� Delegate authority
� Allow others to obtain services
� Delegate resource management
� Memory management policy is completely
in user-space
supervisory OS
Microkernel
Trusted OS
server
Legacy OS
server ....
� Isolation of physical memory = Isolation of physical memory = Isolation of physical memory = Isolation of physical memory =
Isolation of authority (capabilities)Isolation of authority (capabilities)Isolation of authority (capabilities)Isolation of authority (capabilities)
� Capability dissemination is controlled by
a “Take-Grant” like protection model
Microkernel
Kernel Code
TCBuntypedobject2
untypedobject n
..TCB
IIES08/seL4 8D
Memory Management Model ...
� De-allocation upon explicit user
request
� Call revoke on the Untyped capability
� Memory can be reused
� Kernel tracks capability derivations
� Recorded in capability derivation tree
supervisory OS
Trusted OS
server
Legacy OS server
....
� Recorded in capability derivation tree
(CDT)
� Need bookkeeping
� Doubly-linked list through capabilities
� Space allocated with capability tables
seL4 Microkernel
Kernel Code
TCBuntypedobject2
untypedobject n
..TCB
untypedcap 1
TCB TCB
TCB copy
CDT
9
Capability Derivation Tree
� For allocation:For allocation:For allocation:For allocation:
� The untyped capability should not
have any CDT children
� Guarantees that there are no
previously allocated objects
� Size of the object(s) must be small
or equal to untyped object
supervisory OS
seL4 Microkernel
Trusted OS
server
Legacy OS
server ....
or equal to untyped objectseL4 Microkernel
Kernel Code
TCBuntypedobject2
untpedobject n
..TCB
untypedcap 1
TCB TCB
TCB copy
CDT
IIES08/seL4 10
Evaluation
� Formal properties:Formal properties:Formal properties:Formal properties:
� Formalised the protection model in Isabelle/HOL
� Machine checked, abstract model of the kernel
� Formal, machine checked proof that mechanisms are sufficient for
enforcing spatial partitioning
� Proof also identify the invariants the “supervisory OS” needs to enforce � Proof also identify the invariants the “supervisory OS” needs to enforce
for isolation to hold
supervisory OS
....
seL4 Microkernel
IIES08/seL4 11A
Evaluation
� Formal properties:Formal properties:Formal properties:Formal properties:
� Formalised the protection model in Isabelle/HOL
� Machine checked, abstract model of the kernel
� Formal, machine checked proof that mechanisms are sufficient for
enforcing spatial partitioning
� Proof also identify the invariants the “supervisory OS” needs to enforce � Proof also identify the invariants the “supervisory OS” needs to enforce
for isolation to hold
supervisory OS
....
seL4 Microkernel
IIES08/seL4 11B
• Can not share modifiable page/capability tables
• Can not share thread control blocks• Can not have communication
channels that allow capability propagation
Evaluation ...
� Performance Performance Performance Performance
� Used paravirtualised Linux as an
example
� Compared with L4/Wombat (Linux) for
running LMBench supervisory OS
Linux
....
seL4 Microkernel
Drivers
Iguana
Linux(Wombat) ....
L4 Microkernel
Drivers
Bench mark Gain(%)
fork 4570 3083 32.5exec 5022 3440 31.5
shell 29729 19999 32.7
page faults 34 18.7 45.4
3.4 2.9 11
10.7 9.3 7.6
L4 (s) seL4(s)
Null Syscall
ctxProxy via Iguana
IIES08/seL4 12
Conclusion
• No implicit allocations within the kernel
– Users explicitly allocate kernel objects
– No heap, slab .. (no hidden bookkeeping)
– Authority confinement guarantees control of kernel memory
• All kernel memory management policy is outside the kernel• All kernel memory management policy is outside the kernel
– Different isolation/integration configurations
– Support diverse, co-existing policies
– No modification to the kernel (remains verified)
• Hard guarantees on kernel memory consumption
– Facilitate formal reasoning of physical memory consumption
• Improve performance by controlled delegation
– Similar performance in other caseIIES08/seL4 14
Virtual Memory & seL4
• Implemented using 3 objects*
– Frames: An object corresponding to physical memory
– Page directory: An object corresponding to level 1 page
table of a two-level page table.
– Page table: An object corresponding to level 2 page table of
a two-level page table
– created from untyped memory (as directed by user-level)
55
© Kevin Elphinstone. Distributed under Creative Commons
Attribution License
* currently actually 4 – expect ASIDs will be removed
Virtual Memory & seL4
• Broadly similar model to previous L4 kernels
• VM faults are propagated as IPC
– Introduce new page fault type – missing page table
• To install a mapping, one needs:
– A cap to a page directory
– page table to be installed in page directory• install requires cap to both PD and PT
– A cap to a frame of physical memory
• Thus, model allows creation of domain specific VM model
– using only authorised memory
• Revocation handled via CDT
56
© Kevin Elphinstone. Distributed under Creative Commons
Attribution License
Verification Perspective
• Complexity of memory management policy, and VM
model pushed outside the kernel
– simple VM model implemented at user-level should also be
verifiable
– unverified complex models also supported
• e.g. para-virtualised guest OS’s
• CDT an additional complexity
– needed for revocation of caps anyway
– guarantees integrity (used to determine when memory has
no references)
57
© Kevin Elphinstone. Distributed under Creative Commons
Attribution License
Quick Summary
• Basic abstractions
– Execution
– Memory– Memory
• Many alternative models
– seL4 uses subset that:
• is amenable to verification in-kernel
• should be amenable to verification at user-level
58
© Kevin Elphinstone. Distributed under Creative Commons
Attribution License
Inter-process
Communication• Enables system construction
– alternative is a monolithic server
• Processes cooperate to provide services
• Enables extensibility of the system
59
© Kevin Elphinstone. Distributed under Creative Commons
Attribution License
IPC Semantics
• Blocking versus Non-blocking
• Buffered versus Unbuffered
• Fixed versus Variable-size• Fixed versus Variable-size
• Direct versus Indirect
60
© Kevin Elphinstone. Distributed under Creative Commons
Attribution License
Blocking versus Non-
blockingBlocking (termed synchronous)
• Send
– return control only after
message is sent
• Receive
Non-blocking (termed
asynchronous)
• Send
– message always
immediately copied or • Receive
– returns control only after
message is received
immediately copied or
queued, and send returns
• Receive
– polls for new message
Issues:
• Needs buffering
– buffering bounded
61
© Kevin Elphinstone. Distributed under Creative Commons
Attribution License
Buffered versus Unbuffered
Buffered
• Requires at least extra copy
to buffer
• Send may get ahead of
receive
Unbuffered
• Rendezvous always
• Potential to copy message
directly
receive
– matches differing
processing rates
• Buffers are finite
– send eventually becomes
blocking
– synchronisation and
rendezvous occurs
– performance
62
© Kevin Elphinstone. Distributed under Creative Commons
Attribution License
Fixed versus Variable Size
• Fixed size simplifies buffering and
marshalling
• Variable size needs receiver to wait on
largest size message every timelargest size message every time
– not really an issue except for large messages
63
© Kevin Elphinstone. Distributed under Creative Commons
Attribution License
Direct versus Indirect
Direct
• send(dest, message)
• receive(var, message)• receive(var, message)
64
© Kevin Elphinstone. Distributed under Creative Commons
Attribution License
Source Dest
Direct versus Indirect
Indirect
• send(mailbox, message)
• receive(var, message)• receive(var, message)
• Comms path first class objects
65
© Kevin Elphinstone. Distributed under Creative Commons
Attribution License
Source DestMailbox
seL4 IPC model
• 6 system calls
– send, nbsend, call, wait, reply, replywait
• 2 communication objects• 2 communication objects
– EndPoint, AsyncEndPoint
66
© Kevin Elphinstone. Distributed under Creative Commons
Attribution License
Kernel Calls are IPC
• IPC specifies a capability as the
destination
• ‘call’-ing a cap, invokes the kernel• ‘call’-ing a cap, invokes the kernel
– identifies the object
• TCB, PD, PT
– specifies the method and arguments of call
67
© Kevin Elphinstone. Distributed under Creative Commons
Attribution License
Communications Objects
• EndPoint (EP) and AsyncEndPoint
(AEP)
– acts as a mailbox (indirect comms)– acts as a mailbox (indirect comms)
– distinguished caps to EP and AEP have
badges
• a word of bits
• used to determine authority or identity of
sender
68
© Kevin Elphinstone. Distributed under Creative Commons
Attribution License
EndPoints
• Call
– sends message via EP
– unbuffered (at the moment)
– receiver receives – receiver receives
• message
• unforgeable badge
• a reply cap to sender
– allows caps to propagate in a usable way
– “reply” responds via reply cap
69
© Kevin Elphinstone. Distributed under Creative Commons
Attribution License
Call, EP, and extensible
systems• Call and EP enable kernel extensibility via user-level
servers (Hydra)
• Calling a capability
– invokes a kernel implemented object– invokes a kernel implemented object
• TCB, PD, PT, etc.
– invokes a server implemented object
• Capability propagation is consistent for both kernel-
and user-level implemented objects
– authority confinement of kernel object applies to user-objects
as well
70
© Kevin Elphinstone. Distributed under Creative Commons
Attribution License
AEP
• Used for signalling – “nbsend”
• Badge is “or”-ed with word in AEP object
– can never block
• Receiving
– receives state of AEP word
– zeros work (atomically)
• Depending on encoding of badges,
notification of 32 source events
– used in conjunction with shared memory.
71
© Kevin Elphinstone. Distributed under Creative Commons
Attribution License
IPC Importance
General IPC Algorithm
� Validate parameters
� Locate target thread
� if unavailable, deal with it
� Transfer message
� short data only
� long – outlined or cap transfer
� Schedule target thread
� switch address space as necessary
� Wait for IPC
IPC - Implementation
Short IPC
Short IPC (uniprocessor)
� system-call preamble (disable intr)
� identify dest thread or endpoint and check
� basically cap lookup
� ready-to-receive?
analyze msg and transfer� analyze msg and transfer
� short: no action required
� switch to dest thread & address space
� system-call postamble
The critical path
Short IPC (uniprocessor) “call”“call”
� system-call pre (disable intr)
� identify dest thread or endpoint and check
� basically cap lookup
� ready-to-receive?
analyze msg and transfer wait to receiverunning � analyze msg and transfer
� short: no action required
� switch to dest thread & address space
� system-call post
wait to receiverunning
runningwait to receive
Short IPC (uniprocessor) “send” “send” (eagerly)
� system-call pre (disable intr)
� identify dest thread or endpoint and check
� basically cap lookup
� ready-to-receive?
analyze msg and transfer wait to receiverunning � analyze msg and transfer
� short: no action required
� switch to dest thread & address space
� system-call post
� Not common operation if send is ‘signal’
wait to receiverunning
runningrunning
Short IPC (uniprocessor) “send” “send” (lazily)
� system-call pre (disable intr)
� identify dest thread or endpoint and check
� basically cap lookup
� ready-to-receive?
analyze msg and transfer wait to receiverunning � analyze msg and transfer
� short: no action required
� switch to dest thread & address space
� system-call post
wait to receiverunning
runningrunning
EBX
ESI
EDI
EBP
EAX
ECX
EDX
IPC
ES
FS
GS
ESP
EFLAGS
EIP
CS
SS
DS
EBX
ESI
EDI
EBP
EAX
ECX
EDX
IPC
ES
FS
GS
ESP
EFLAGS
EIP
CS
SS
DS
EBX
ESI
EDI
EBP
EAX
ECX
EDX
IPC
ES
FS
GS
ESP
EFLAGS
EIP
CS
SS
DS
EBX
ESI
EDI
EBP
EAX
ECX
EDX
IPC
ES
FS
GS
ESP
EFLAGS
EIP
CS
SS
DS
EBX
ESI
EDI
EBP
EAX
ECX
EDX
IPC
ES
FS
GS
ESP
EFLAGS
EIP
CS
SS
DS
EBX
ESI
EDI
EBP
EAX
ECX
EDX
IPC
ES
FS
GS
ESP
EFLAGS
EIP
CS
SS
DS
EBX
ESI
EDI
EBP
EAX
ECX
EDX
IPC
Note “payload” from green thread
ES
FS
GS
ESP
EFLAGS
EIP
CS
SS
DS
Implementation Goal
� Most frequent kernel op: short IPC
� thousands of invocations per second
� Performance is critical:
� structure IPC for speed
� structure entire kernel to support fast IPC� structure entire kernel to support fast IPC
� What affects performance?
� cache line misses
� TLB misses
� memory references
� pipe stalls and flushes
� instruction scheduling
Fast Path
� Optimize for common cases
� write in assembler
� non-critical paths written in C/C++� but still fast as possible
� Avoid high-level language overhead:
� function call state preservation
� poor code “optimizations”
� We want every cycle possible!
IPC Attributes for Fast Path
� short message
� single runnable thread after IPC
� must be valid IPC call
� switch threads, originator blocks
� send phase:
� the target is waiting
� receive phase:
� the sender is not ready to couple, causing us to block
Avoid Memory References!!!
� Memory references are slow
� Microkernel should minimize indirect costs
� cache pollution
TLB pollution � TLB pollution
� memory bus
Optimized Memory
stack
Also: hard-wire TLB entries for kernel code
and data.
thread ID
cpu ID
UTCB
thread stateTCB state, grouped by cache lines.
Single TLB entry.
Branch Elimination
slow = ~receiver->thread_state +(timeouts & 0xffff) +sender->resources +receiver->resources;
Common case: -1
if( slow )enter_slow_path()
Common case: 0� Reduces branch prediction
foot print.
� Avoids mispredicts & stalls & flushes.
� Increases latency for slow path
TCB Resources
1 1
Resources bitfield
� One bit per resource
� Fast path checks entire word
� if not 0, jump to resource handlers
Debug registers
Copy area
Message Transfer
IBM PowerPC 750,500 MHz,32 registers
up to 10physicalregisters
virtual registercopy loop
Many cycles wasted on pipe flushes for privileged instructions.
Slow Path vs. Fast Path
L4Ka::Pistachio IPC performance
Pentium 3
500
600
0
100
200
300
400
0 4 8 12 16 20 24 28 32 36 40 44 48 52 56 60
number message registers
cycle
s
Inter C-Path
Inter FastPath
Inter vs. Intra Address Space
L4Ka::Pistachio IPC performance
Pentium 3
500
600
0
100
200
300
400
0 4 8 12 16 20 24 28 32 36 40 44 48 52 56 60
number message registers
cycle
s
Intra FastPath
Inter FastPath
IPC - Implementation
Long IPC
Long IPC (uniprocessor)
� system-call preamble (disable intr)
� identify dest and check
� ready-to-receive?
� analyze msg and transfer
� long/map:
Preemptions possible!(end of timeslice, device interrupt…)
Pagefaults possible!(in source and dest address space)
� – transfer message –
� switch to dest thread & address space
� system-call postamble
Long IPC (uniprocessor)
� system-call pre (disable intr)
� identify dest and check
� ready-to-receive?
� analyze msg and transfer
� long/map:
� lock both partners
Preemptions possible!(end of timeslice, device interrupt…)
Pagefaults possible!(in source and dest address space)
� lock both partners
� – transfer message –
� unlock both partners� switch to dest thread & address space
� system-call post
Long IPC (uniprocessor)
� system-call pre (disable intr)
� identify dest and check
� ready-to-receive?
� analyze msg and transfer
� long/map:
� lock both partners
Preemptions possible!(end of timeslice, device interrupt…)
Pagefaults possible!(in source and dest address space)
� lock both partners
� enable intr
� – transfer message –
� disable intr
� unlock both partners� switch to dest thread & address space
� system-call post
Long IPC (uniprocessor)
� system-call pre (disable intr)
� identify dest thread and check
� same chief
� ready-to-receive?
� analyze msg and transfer
� long/map:
waitrunning
lockedlocked waitlockedlocked running� lock both partners
� enable intr
� – transfer message –
� disable intr
� unlock both partners� switch to dest thread & address space
� system-call post
runningwait to receive
lockedlocked waitlockedlocked running
IPC - mem copy
� Why is it needed? Why not share?
� Security
� Need own copy
� Granularity
� Object small than a page or not aligned
copy in - copy out
� copy into kernel buffer
copy in - copy out
� copy into kernel buffer
� switch spaces
copy in - copy out
� copy into kernel buffer
� switch spaces
� copy out of kernel buffer
� costs for n words
� 2×2n r/w operations
� 3×n/8 cache lines
� 1×n/8 overhead cache misses (small n)
� 4×n/8 cache misses (large n)
temporary mapping
temporary mapping
� select dest area (4+4 M)
temporary mapping
� select dest area (4+4 M)
� map into source AS (kernel)
temporary mapping
� select dest area (4+4 M)
� map into source AS (kernel)
� copy data
temporary mapping
� select dest area (4+4 M)
� map into source AS (kernel)
� copy data
� switch to dest space� switch to dest space
temporary mapping
temporary mapping
� problems
� multiple threads per AS
� mappings might change while message is copied
� How long to keep PTE?
� What about TLB?
current AS
� What about TLB?
temporary mapping
� when leaving curr thread during ipc?
� invalidate PTE
� flush TLB
current AS
temporary mapping
� when leaving curr thread during ipc:
� invalidate PTE
� flush TLB
current AS
temporary mapping
� when returning to thread during ipc:
current AS
temporary mapping
� when returning to thread during ipc:
Reestablishing temp mappingrequires to store
partner id and dest area addressin the sender’s tcb.
Note: receiver’s page mappingsmight have changed !
current AS
Cost estimates
R/W operations
Cache lines
Small n overhead cache misses
Copy in - copy out Temporary mapping
2 × 2n 2n
3 × n/8 2 × n/8
n/8 0Small n overhead cache misses
Large n cache misses
Overhead TLB misses
Startup instructions
n/8 0
5 × n/8 3 × n/8
0 n / words per page
0 50
486 IPC costs
300
400
Mach[µs]
� Mach: copy in/out
� L4: temp mapping
0
100
200
0 2000 4000 6000
msg len
L4 + cache flush
L4
raw copy
Summary• Small messages
– buffering costs a little
– mapping more so
– ideally, direct copy between two pinned “message areas”
• needs to be synchronous
• Large messages• Large messages
– mapping is more efficient
• especially with outlined messages
• startup costs high (cost of setup amortised)
• implementation complexity high
• Shared memory and notification
– similar to buffering in terms of performance
• copy-in copy-out if mutually distrusting
– implementation complexity out of kernel118
© Kevin Elphinstone. Distributed under Creative Commons
Attribution License
seL4• EndPoint
– unbuffered, synchronous, small message to pre-
allocated pinned buffer
– used for “call”
• AsyncEndPoint• AsyncEndPoint
– “or”- ed notification
– used for notification (shared memory buffers)
• Expect long copied messages to be
– avoided if possible
– via shared memory119
© Kevin Elphinstone. Distributed under Creative Commons
Attribution License
FPU Context Switching
• Strict switchingThread switch:
Store current thread’s FPU state
Load new thread’s FPU stateLoad new thread’s FPU state
– Extremely expensive
• IA-32’s full SSE2 state is 512 Bytes
• IA-64’s floating point state is ~1.5KB
– May not even be required
• Threads do not always use FPU
Lazy FPU switching• Lock FPU on thread switch
• Unlock at first use – exception handled by kernelUnlock FPU
If fpu_owner != current
Save current state to fpu_owner
Load new state from current
FPUKernel
current fpu_owner
locked
Load new state from current
fpu_owner := current
finit
fld
fcos
fst
finit
fld
pacman()