the multikernel: a new os architecture for scalable multicore...
TRANSCRIPT
The MultikernelA new OS architecture for scalable multicore systems
Andrew Baumann1 Paul Barham2 Pierre-Evariste Dagand3
Tim Harris2 Rebecca Isaacs2 Simon Peter1 Timothy Roscoe1
Adrian Schüpbach1 Akhilesh Singhania1
1 Systems Group, ETH Zurich 2 Microsoft Research, Cambridge 3 ENS Cachan Bretagne
Systems Group | Department of Computer Science | ETH Zurich SOSP, 12th October 2009
Introduction
How should we structure an OS for future multicore systems?I Scalability to many coresI Heterogeneity and hardware diversity
12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 2
System diversity
UltraSPARC® IIIiprocessor
1x
2004 2005 2006 2007 2008
UltraSPARC® T1processor32 threadseight cores
14x
UltraSPARC T2 processor64 threadseight cores
35x
“Victoria Falls”128 threads
16 cores65x
(two sockets)
FB DIMM FB DIMM FB DIMM FB DIMM
SPU SPU SPU SPU SPU SPU SPU SPU
FPU FPU FPU FPU FPU FPU FPU FPU
2x 10Gigabit Ethernet
Power <95 W x8 @ 2.0 GHz
NIU(Ethernet+)
Sys I/FBuffer Switch Core PCIe
L2$ L2$ L2$ L2$ L2$ L2$ L2$ L2$
C0 C1 C2 C3 C4 C5 C6 C7
MCU
Full Cross Bar
MCU MCU MCU
FB DIMM FB DIMM FB DIMM FB DIMM
FPU FPU FPU FPU FPU FPU FPU FPU
2x 10Gigabit Ethernet
Power <100 W x8 @2. GHz
NIU(E-NET+)
Sys I/FBuffer Switch Core
PCIe
L2$ L2$ L2$ L2$ L2$ L2$ L2$ L2$
C0 C1 C2 C3 C4 C5 C6 C7
MCU
Full Cross Bar
MCU MCU MCU
Sun Niagara T2
AMD Opteron (Istanbul)
Intel Nehalem (Beckton)
12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 3
The interconnect mattersToday’s 8-socket Opteron
12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 4
The interconnect mattersTomorrow’s 8-socket Nehalem
12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 5
The interconnect mattersOn-chip interconnects
GDDR GDDR
PCIe
Coherent L2 CacheCoherent L2 Cache
Syste
m I
nte
rfa
ce
Syste
m I
nte
rfa
ce
Mem
ory
Contr
oller
Mem
ory
Contr
oller
Mem
ory
Contr
oller
Mem
ory
Contr
oller
Dis
pla
y I
nte
rfa
ce
Dis
pla
y I
nte
rfa
ce
Textu
re L
og
icTextu
re L
og
icFix
ed
Fu
ncti
on
Fix
ed
Fu
ncti
on
In-OrderMulti-threaded
Wide SIMD
D$D$I$I$D$D$I$I$
D$D$I$I$D$D$I$I$
GDDR GDDR
PCIe
Coherent L2 CacheCoherent L2 Cache
Syste
m I
nte
rfa
ce
Syste
m I
nte
rfa
ce
Mem
ory
Contr
oller
Mem
ory
Contr
oller
Mem
ory
Contr
oller
Mem
ory
Contr
oller
Dis
pla
y I
nte
rfa
ce
Dis
pla
y I
nte
rfa
ce
Textu
re L
og
icTextu
re L
og
icFix
ed
Fu
ncti
on
Fix
ed
Fu
ncti
on
In-OrderMulti-threaded
Wide SIMD
In-OrderMulti-threaded
Wide SIMD
D$D$I$I$D$D$I$I$ D$D$I$I$
D$D$I$I$D$D$I$I$ D$D$I$I$
In-OrderMulti-threaded
Wide SIMD
In-OrderMulti-threaded
Wide SIMD
In-OrderMulti-threaded
Wide SIMD
PCIe 1
MAC/ PHY
SerDes
GbE
GbE 1 Flexible
I/O
Flexible
I/O
UART,
HPI, I2C,
JTAG,SPI
DDR2 Controller 3 DDR2 Controller 2
DDR2 Controller 1 DDR2 Controller 0
XAUI 1
MAC/ PHY
SerDes
PCIe 0
MAC/ PHY
SerDes
SerDes
0
Reg File
P2
P1
P0
L2 CACHE
PROCESSOR CACHE
SWITCH
2D DMA
L-1I
MDN TDN
UDN IDN
STN
L-1D
I-TLB D-TLB
12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 6
Core diversity
I Within a system:I Programmable NICsI GPUsI FPGAs (in CPU sockets)
I On a single die:I Performance asymmetryI Streaming instructions (SIMD, SSE, etc.)I Virtualisation support
12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 7
Summary
I Increasing core counts, increasing diversityI Unlike HPC systems, cannot optimise at design time
12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 8
The multikernel model
I It’s time to rethink the default structure of an OSI Shared-memory kernel on every coreI Data structures protected by locksI Anything else is a device
I Proposal: structure the OS as a distributed systemI Design principles:
1. Make inter-core communication explicit2. Make OS structure hardware-neutral3. View state as replicated
12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 9
The multikernel model
I It’s time to rethink the default structure of an OSI Shared-memory kernel on every coreI Data structures protected by locksI Anything else is a device
I Proposal: structure the OS as a distributed systemI Design principles:
1. Make inter-core communication explicit2. Make OS structure hardware-neutral3. View state as replicated
12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 9
Outline
Introduction
MotivationHardware diversity
The multikernel modelDesign principlesThe model
Barrelfish
EvaluationCase study: Unmap
12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 10
1. Make inter-core communication explicit
I All communication with messages (no shared state)
I Decouples system structure frominter-core communication mechanism
I Communication patterns explicitly expressedI Naturally supports heterogeneous cores,
non-coherent interconnects (PCIe)I Better match for future hardware
I . . . with cheap explicit message passing (e.g. Tile64)I . . . without cache-coherence (e.g. Intel 80-core)
I Allows split-phase operationsI Decouple requests and responses for concurrency
I We can reason about it
12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 11
1. Make inter-core communication explicit
I All communication with messages (no shared state)I Decouples system structure from
inter-core communication mechanismI Communication patterns explicitly expressed
I Naturally supports heterogeneous cores,non-coherent interconnects (PCIe)
I Better match for future hardwareI . . . with cheap explicit message passing (e.g. Tile64)I . . . without cache-coherence (e.g. Intel 80-core)
I Allows split-phase operationsI Decouple requests and responses for concurrency
I We can reason about it
12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 11
Message passing vs. shared memory: experimentShared memory (move the data to the operation):
I Each core updates the same memory locations (no locking)I Cache-coherence protocol migrates modified cache lines
I Processor stalled while line is fetched or invalidatedI Limited by latency of interconnect round-tripsI Performance depends on data size (cache lines)
and contention (number of cores)
12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 12
Shared memory results4×4-core AMD system
0
2
4
6
8
10
12
2 4 6 8 10 12 14 16
Late
ncy (
cycle
s × 10
00)
Cores
SHM1
12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 13
Shared memory results4×4-core AMD system
0
2
4
6
8
10
12
2 4 6 8 10 12 14 16
Late
ncy (
cycle
s × 10
00)
Cores
SHM2SHM1
12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 13
Shared memory results4×4-core AMD system
0
2
4
6
8
10
12
2 4 6 8 10 12 14 16
Late
ncy (
cycle
s × 10
00)
Cores
SHM4SHM2SHM1
12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 13
Shared memory results4×4-core AMD system
0
2
4
6
8
10
12
2 4 6 8 10 12 14 16
Late
ncy (
cycle
s × 10
00)
Cores
SHM8SHM4SHM2SHM1
Stall
edcy
cles(
nolo
ckin
g!)
12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 13
Message passing vs. shared memory: experimentMessage passing (move the operation to the data):
I A single server core updates the memory locationsI Each client core sends RPCs to the server
I Operation and results described in a single cache lineI Block while waiting for a response (in this experiment)
12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 14
Message passing vs. shared memory: tradeoff4×4-core AMD system
0
2
4
6
8
10
12
2 4 6 8 10 12 14 16
Late
ncy (
cycle
s × 10
00)
Cores
SHM8SHM4SHM2SHM1MSG1
12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 15
Message passing vs. shared memory: tradeoff4×4-core AMD system
0
2
4
6
8
10
12
2 4 6 8 10 12 14 16
Late
ncy (
cycle
s × 10
00)
Cores
SHM8SHM4SHM2SHM1MSG8MSG1
12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 15
Message passing vs. shared memory: tradeoff4×4-core AMD system
0
2
4
6
8
10
12
2 4 6 8 10 12 14 16
Late
ncy (
cycle
s × 10
00)
Cores
SHM8SHM4SHM2SHM1MSG8MSG1
Messagingfaster for:≥4 cores≥4 cache lines
12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 15
Message passing vs. shared memory: tradeoff4×4-core AMD system
0
2
4
6
8
10
12
2 4 6 8 10 12 14 16
Late
ncy (
cycle
s × 10
00)
Cores
SHM8SHM4SHM2SHM1MSG8MSG1
Server
Actual cost of update at server12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 15
Message passing vs. shared memory: tradeoff4×4-core AMD system
0
2
4
6
8
10
12
2 4 6 8 10 12 14 16
Late
ncy (
cycle
s × 10
00)
Cores
SHM8SHM4SHM2SHM1MSG8MSG1
Server
Actual cost of update at server
“spare” cyclesif RPC wassplit-phase
12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 15
2. Make OS structure hardware-neutral
I Separate OS structure from hardwareI Only hardware-specific parts:
I Message transports (highly optimised / specialised)I CPU / device drivers
I Adaptability to changing performance characteristicsI Late-bind protocol and message transport implementations
12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 16
2. Make OS structure hardware-neutral
I Separate OS structure from hardwareI Only hardware-specific parts:
I Message transports (highly optimised / specialised)I CPU / device drivers
I Adaptability to changing performance characteristicsI Late-bind protocol and message transport implementations
12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 16
3. View state as replicated
I Potentially-shared state accessed as if it were a local replicaI Scheduler queues, process control blocks, etc.
I Required by message-passing modelI Naturally supports domains that do not share memoryI Naturally supports changes to the set of running cores
I Hotplug, power management
12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 17
3. View state as replicated
I Potentially-shared state accessed as if it were a local replicaI Scheduler queues, process control blocks, etc.
I Required by message-passing modelI Naturally supports domains that do not share memoryI Naturally supports changes to the set of running cores
I Hotplug, power management
12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 17
Replication vs. sharing as default
I Replicas used as an optimisation in previous systems:Tornado, K42 clustered objects
Linux read-only data, kernel text
I In a multikernel, sharing is a local optimisationI Shared (locked) replica for threads or closely-coupled coresI Hidden, localI Only when faster, as decided at runtimeI Basic model remains split-phase
12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 18
Replication vs. sharing as default
I Replicas used as an optimisation in previous systems:Tornado, K42 clustered objects
Linux read-only data, kernel text
I In a multikernel, sharing is a local optimisationI Shared (locked) replica for threads or closely-coupled coresI Hidden, localI Only when faster, as decided at runtimeI Basic model remains split-phase
12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 18
The multikernel model
12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 19
Outline
Introduction
MotivationHardware diversity
The multikernel modelDesign principlesThe model
Barrelfish
EvaluationCase study: Unmap
12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 20
Barrelfish
I From-scratch implementation of a multikernelI Supports x86-64 multiprocessors (ARM soon)I Open source (BSD licensed)
12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 21
Barrelfish structureMonitors and CPU drivers
I CPU driver serially handles traps and exceptionsI Monitor mediates local operations on global stateI URPC inter-core (shared memory) message transport
on current (cache-coherent) x86 HW
12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 22
Non-original ideas in BarrelfishMultiprocessor techniques:
I Minimise shared state (Tornado, K42, Corey)I User-space messaging decoupled from IPIs (URPC)I Single-threaded non-preemptive kernel per core (K42)
Other ideas we liked:I Capabilities for all resource management (seL4)I Upcall processor dispatch (Psyche, Sched. Activations, K42)I Push policy into application domains (Exokernel, Nemesis)I Lots of information (Infokernel)I Run drivers in their own domains (µkernels)I EDF as per-core CPU scheduler (RBED)I Specify device registers in a little language (Devil)
12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 23
Applications running on Barrelfish
I Slide viewer (this one!)I Webserver (www.barrelfish.org)I Virtual machine monitor (runs unmodified Linux)I SPLASH-2, OpenMP (benchmarks)I SQLiteI ECLiPSe (constraint engine)I more. . .
12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 24
Outline
Introduction
MotivationHardware diversity
The multikernel modelDesign principlesThe model
Barrelfish
EvaluationCase study: Unmap
12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 25
Evaluation goalsHow do we evaluate an alternative OS structure?
I Good baseline performanceI Comparable to existing systems on current hardware
I Scalability with coresI Adapability to different hardwareI Ability to exploit message-passing for performance
12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 26
Case study: Unmap (TLB shootdown)
I Send a message to every core with a mapping,wait for all to be acknowledged
I Linux/Windows:1. Kernel sends IPIs2. Spins on shared acknowledgement count/event
I Barrelfish:1. User request to local monitor domain2. Single-phase commit to remote cores
I How to implement communication?
12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 27
Unmap communication protocols
Unicast
Broadcast
12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 28
Unmap communication protocols
Unicast Broadcast
12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 28
Unmap communication protocolsRaw messaging cost
0
2
4
6
8
10
12
14
2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32
Late
ncy (
cycle
s × 10
00)
Cores
BroadcastUnicast
12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 29
Why use multicast8×4-core AMD system
12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 30
Why use multicast8×4-core AMD system
12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 30
Multicast communication
I “NUMA-aware” multicast
12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 31
Multicast communication
I “NUMA-aware” multicast
12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 31
Unmap communication protocolsRaw messaging cost
0
2
4
6
8
10
12
14
2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32
Late
ncy (
cycle
s × 10
00)
Cores
BroadcastUnicast
MulticastNUMA-Aware Multicast
12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 32
System knowledge base
I Constructing multicast tree requires hardware knowledgeI Mapping of cores to sockets (CPUID data)I Messaging latency (online measurements)
I More generally, Barrelfish needs a way to reasonabout diverse system resources
I We tackle this with constraint logic programming[Schüpbach et al., MMCS’08]
I System knowledge base stores rich, detailed representationof hardware, performs online reasoning
I Initial implementation: port of the ECLiPSe constraint solverI Prolog query used to construct multicast routing tree
12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 33
System knowledge base
I Constructing multicast tree requires hardware knowledgeI Mapping of cores to sockets (CPUID data)I Messaging latency (online measurements)
I More generally, Barrelfish needs a way to reasonabout diverse system resources
I We tackle this with constraint logic programming[Schüpbach et al., MMCS’08]
I System knowledge base stores rich, detailed representationof hardware, performs online reasoning
I Initial implementation: port of the ECLiPSe constraint solverI Prolog query used to construct multicast routing tree
12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 33
Unmap latency
0
10
20
30
40
50
60
2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32
Late
ncy (
cycle
s × 10
00)
Cores
WindowsLinux
Barrelfish
12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 34
Summary of other results
I No penalty for shared-memory (SPLASH, OpenMP)I Network throughput: 951.7Mbit/s (same as Linux)I Pipelined web server
I Static: 640 Mbit/s vs. 316 Mbit/s for lighttpd/LinuxI Dynamic:3417 requests/s (17.1Mbit/s) bottlenecked on SQL
12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 35
ConclusionI Modern computers are inherently distributed systemsI It’s time to rethink OS structure to matchI The Multikernel: model of the OS as a distributed system
1. Explicit communication, replicated state2. Hardware-neutral OS structure
I Barrelfish: our concrete implementationI Reasonable performance
on current hardwareI Better scalability/adaptability
for future hardwareI Promising approach
12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 36
www.barrelfish.org
ConclusionI Modern computers are inherently distributed systemsI It’s time to rethink OS structure to matchI The Multikernel: model of the OS as a distributed system
1. Explicit communication, replicated state2. Hardware-neutral OS structure
I Barrelfish: our concrete implementationI Reasonable performance
on current hardwareI Better scalability/adaptability
for future hardwareI Promising approach
12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 36
www.barrelfish.org
Backup slides
12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 37
URPC implementation
I Current hardware provides one communication mechanism:cache-coherent shared memory
I Can we “trick” cache-coherence protocol to send messages?I User-level RPC (URPC) [Bershad et al., 1991]
I Channel is shared ring bufferI Messages are cache-line sizedI Sender writes message into next lineI Receiver polls on last wordI Marshalling/demarshalling, naming,
binding all implemented above
12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 38
Polling for receiveTradeoff vs. IPIs
I Polling is cheap: line is local to receiver until message arrivesI Hardware-imposed costs for IPI (on 4×4-core AMD):
I ≈800 cycles to send (from user-mode)I ≈1200 cycles lost in receive (to user-mode)
I There is a tradeoff here!I IPIs are decoupled from fast-path messaging, used only for:
1. Specific (batches of) operations that require low latency,even when other tasks are executing
2. Awakening cores that have blocked to save power(alternatively, MONITOR/MWAIT)
12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 39
Polling for receiveTradeoff vs. IPIs
I Polling is cheap: line is local to receiver until message arrivesI Hardware-imposed costs for IPI (on 4×4-core AMD):
I ≈800 cycles to send (from user-mode)I ≈1200 cycles lost in receive (to user-mode)
I There is a tradeoff here!I IPIs are decoupled from fast-path messaging, used only for:
1. Specific (batches of) operations that require low latency,even when other tasks are executing
2. Awakening cores that have blocked to save power(alternatively, MONITOR/MWAIT)
12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 39