the multikernel: a new os architecture for scalable multicore...

The MultikernelA new OS architecture for scalable multicore systems

Andrew Baumann1 Paul Barham2 Pierre-Evariste Dagand3

Tim Harris2 Rebecca Isaacs2 Simon Peter1 Timothy Roscoe1

Adrian Schüpbach1 Akhilesh Singhania1

1 Systems Group, ETH Zurich 2 Microsoft Research, Cambridge 3 ENS Cachan Bretagne

Systems Group | Department of Computer Science | ETH Zurich SOSP, 12th October 2009

Introduction

How should we structure an OS for future multicore systems?I Scalability to many coresI Heterogeneity and hardware diversity

12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 2

dants

Typewritten Text

dants

Typewritten Text

dants

Typewritten Text

dants

Typewritten Text

2 key challenges:

dants

Typewritten Text

System diversity

UltraSPARC® IIIiprocessor

1x

2004 2005 2006 2007 2008

UltraSPARC® T1processor32 threadseight cores

14x

UltraSPARC T2 processor64 threadseight cores

35x

“Victoria Falls”128 threads

16 cores65x

(two sockets)

FB DIMM FB DIMM FB DIMM FB DIMM

SPU SPU SPU SPU SPU SPU SPU SPU

FPU FPU FPU FPU FPU FPU FPU FPU

2x 10Gigabit Ethernet

Power <95 W x8 @ 2.0 GHz

NIU(Ethernet+)

Sys I/FBuffer Switch Core PCIe

L2$ L2$ L2$ L2$ L2$ L2$ L2$ L2$

C0 C1 C2 C3 C4 C5 C6 C7

MCU

Full Cross Bar

MCU MCU MCU

FB DIMM FB DIMM FB DIMM FB DIMM

FPU FPU FPU FPU FPU FPU FPU FPU

2x 10Gigabit Ethernet

Power <100 W x8 @2. GHz

NIU(E-NET+)

Sys I/FBuffer Switch Core

PCIe

L2$ L2$ L2$ L2$ L2$ L2$ L2$ L2$

C0 C1 C2 C3 C4 C5 C6 C7

MCU

Full Cross Bar

MCU MCU MCU

Sun Niagara T2

AMD Opteron (Istanbul)

Intel Nehalem (Beckton)


dants

Typewritten Text

dants

Typewritten Text

dants

Typewritten Text

(fast banked shared L2)

dants

Typewritten Text

dants

Typewritten Text

dants

Typewritten Text

=> To scale well, need to optimize shared data structures for each design

dants

Typewritten Text

dants

Typewritten Text

dants

Typewritten Text

dants

Typewritten Text

(L2 private; 2X slower shared L3)

dants

Typewritten Text

dants

Typewritten Text

dants

Typewritten Text

dants

Typewritten Text

dants

Typewritten Text

dants

Typewritten Text

dants

Typewritten Text

(across chips)

dants

Typewritten Text

dants

Typewritten Text

dants

Typewritten Text

dants

Typewritten Text

(banked cached + ring network)

dants

Typewritten Text

dants

Typewritten Text

dants

Typewritten Text

dants

Typewritten Text

dants

Typewritten Text

dants

Typewritten Text

dants

Typewritten Text

dants

Typewritten Text

dants

Typewritten Text

dants

Typewritten Text

dants

Typewritten Text

dants

Typewritten Text

dants

Typewritten Text

dants

Typewritten Text

dants

Typewritten Text

dants

Typewritten Text

dants

Typewritten Text

dants

Typewritten Text

dants

Typewritten Text

dants

Typewritten Text

dants

Typewritten Text

dants

Typewritten Text

dants

Typewritten Text

dants

Typewritten Text

dants

Typewritten Text

The interconnect mattersToday’s 8-socket Opteron


dants

Typewritten Text

dants

Typewritten Text

(within the chip - different topologies)

dants

Typewritten Text

hyper transport (HT)

dants

Typewritten Text

The interconnect mattersTomorrow’s 8-socket Nehalem


dants

Typewritten Text

(another, topology calls for different optimizations)

dants

Typewritten Text

dants

Typewritten Text

dants

Typewritten Text

The interconnect mattersOn-chip interconnects

GDDR GDDR

PCIe

Coherent L2 CacheCoherent L2 Cache

Syste

m I

nte

rfa

ce

Syste

m I

nte

rfa

ce

Mem

ory

Contr

oller

Mem

ory

Contr

oller

Mem

ory

Contr

oller

Mem

ory

Contr

oller

Dis

pla

y I

nte

rfa

ce

Dis

pla

y I

nte

rfa

ce

Textu

re L

og

icTextu

re L

og

icFix

ed

Fu

ncti

on

Fix

ed

Fu

ncti

on

In-OrderMulti-threaded

Wide SIMD

D$D$I$I$D$D$I$I$

D$D$I$I$D$D$I$I$

GDDR GDDR

PCIe

Coherent L2 CacheCoherent L2 Cache

Syste

m I

nte

rfa

ce

Syste

m I

nte

rfa

ce

Mem

ory

Contr

oller

Mem

ory

Contr

oller

Mem

ory

Contr

oller

Mem

ory

Contr

oller

Dis

pla

y I

nte

rfa

ce

Dis

pla

y I

nte

rfa

ce

Textu

re L

og

icTextu

re L

og

icFix

ed

Fu

ncti

on

Fix

ed

Fu

ncti

on


Wide SIMD


Wide SIMD

D$D$I$I$D$D$I$I$ D$D$I$I$

D$D$I$I$D$D$I$I$ D$D$I$I$


Wide SIMD


Wide SIMD


Wide SIMD

PCIe 1

MAC/ PHY

SerDes

GbE

GbE 1 Flexible

I/O

Flexible

I/O

UART,

HPI, I2C,

JTAG,SPI

DDR2 Controller 3 DDR2 Controller 2

DDR2 Controller 1 DDR2 Controller 0

XAUI 1

MAC/ PHY

SerDes

PCIe 0

MAC/ PHY

SerDes

SerDes

0

Reg File

P2

P1

P0

L2 CACHE

PROCESSOR CACHE

SWITCH

2D DMA

L-1I

MDN TDN

UDN IDN

STN

L-1D

I-TLB D-TLB


dants

Typewritten Text

Larrabee's ring (GPGPU)

dants

Typewritten Text

Tilera's mash

dants

Typewritten Text

dants

Typewritten Text

(chip internals look like networks; many different topologies)

Core diversity

I Within a system:I Programmable NICsI GPUsI FPGAs (in CPU sockets)

I On a single die:I Performance asymmetryI Streaming instructions (SIMD, SSE, etc.)I Virtualisation support


dants

Typewritten Text

(asymmetric cores; some cores might leave out)

dants

Typewritten Text

dants

Typewritten Text

dants

Typewritten Text

dants

Typewritten Text

core themselves, might differ substantially...

Summary

I Increasing core counts, increasing diversityI Unlike HPC systems, cannot optimise at design time


dants

Typewritten Text

(general-purpose = many apps, not just 1 app that gets its own HW)

dants

Typewritten Text

=> There's a need for the system software to adopt to different HW

dants

Typewritten Text

dants

Typewritten Text

dants

Typewritten Text

inter/intra

The multikernel model

I It’s time to rethink the default structure of an OSI Shared-memory kernel on every coreI Data structures protected by locksI Anything else is a device

I Proposal: structure the OS as a distributed systemI Design principles:

1. Make inter-core communication explicit2. Make OS structure hardware-neutral3. View state as replicated


dants

Typewritten Text

dants

Typewritten Text

dants

Typewritten Text

(because of all the aforementioned trends...)

dants

Typewritten Text

dants

Typewritten Text

dants

Typewritten Text

dants

Typewritten Text


I It’s time to rethink the default structure of an OSI Shared-memory kernel on every coreI Data structures protected by locksI Anything else is a device

I Proposal: structure the OS as a distributed systemI Design principles:

1. Make inter-core communication explicit2. Make OS structure hardware-neutral3. View state as replicated


dants

Typewritten Text

instead

dants

Typewritten Text

dants

Typewritten Text

of the multikernel

dants

Typewritten Text

dants

Typewritten Text

(soon to be discussed on detail)

Outline

Introduction

MotivationHardware diversity

The multikernel modelDesign principlesThe model

Barrelfish

EvaluationCase study: Unmap


dants

Typewritten Text

dants

Typewritten Text

dants

Typewritten Text

dants

Typewritten Text

=>

1. Make inter-core communication explicit

I All communication with messages (no shared state)

I Decouples system structure frominter-core communication mechanism

I Communication patterns explicitly expressedI Naturally supports heterogeneous cores,

non-coherent interconnects (PCIe)I Better match for future hardware

I . . . with cheap explicit message passing (e.g. Tile64)I . . . without cache-coherence (e.g. Intel 80-core)

I Allows split-phase operationsI Decouple requests and responses for concurrency

I We can reason about it


dants

Typewritten Text

hard but...

dants

Typewritten Text

1. Make inter-core communication explicit

I All communication with messages (no shared state)I Decouples system structure from

inter-core communication mechanismI Communication patterns explicitly expressed

I Naturally supports heterogeneous cores,non-coherent interconnects (PCIe)

I Better match for future hardwareI . . . with cheap explicit message passing (e.g. Tile64)I . . . without cache-coherence (e.g. Intel 80-core)

I Allows split-phase operationsI Decouple requests and responses for concurrency

I We can reason about it


dants

Typewritten Text

dants

Typewritten Text

(no more locks! no need to reason about locality of cache lines)

dants

Typewritten Text

dants

Typewritten Text

dants

Typewritten Text

(huge change)

dants

Typewritten Text

dants

Typewritten Text

dants

Typewritten Text

dants

Typewritten Text

(SCC)

dants

Typewritten Text

(lots of accumulated knowledge from distributed systems)

Message passing vs. shared memory: experimentShared memory (move the data to the operation):

I Each core updates the same memory locations (no locking)I Cache-coherence protocol migrates modified cache lines

I Processor stalled while line is fetched or invalidatedI Limited by latency of interconnect round-tripsI Performance depends on data size (cache lines)

and contention (number of cores)


dants

Typewritten Text

(wouldn't explicit msg passing be really slow in comparison to coherence?)

dants

Typewritten Text

dants

Typewritten Text

opt1:

Shared memory results4×4-core AMD system

0

2

4

6

8

10

12

2 4 6 8 10 12 14 16

Late

ncy (

cycle

s × 10

00)

Cores

SHM1


dants

Typewritten Text

(writing to one cache line)

dants

Typewritten Text


0

2

4

6

8

10

12

2 4 6 8 10 12 14 16

Late

ncy (

cycle

s × 10

00)

Cores

SHM2SHM1



0

2

4

6

8

10

12

2 4 6 8 10 12 14 16

Late

ncy (

cycle

s × 10

00)

Cores

SHM4SHM2SHM1



0

2

4

6

8

10

12

2 4 6 8 10 12 14 16

Late

ncy (

cycle

s × 10

00)

Cores

SHM8SHM4SHM2SHM1

Stall

edcy

cles(

nolo

ckin

g!)


dants

Typewritten Text

dants

Typewritten Text

Message passing vs. shared memory: experimentMessage passing (move the operation to the data):

I A single server core updates the memory locationsI Each client core sends RPCs to the server

I Operation and results described in a single cache lineI Block while waiting for a response (in this experiment)


dants

Typewritten Text

opt2:

dants

Typewritten Text

dants

Typewritten Text

(thru ring buf in sh-mem)

Message passing vs. shared memory: tradeoff4×4-core AMD system

0

2

4

6

8

10

12

2 4 6 8 10 12 14 16

Late

ncy (

cycle

s × 10

00)

Cores

SHM8SHM4SHM2SHM1MSG1



0

2

4

6

8

10

12

2 4 6 8 10 12 14 16

Late

ncy (

cycle

s × 10

00)

Cores

SHM8SHM4SHM2SHM1MSG8MSG1



0

2

4

6

8

10

12

2 4 6 8 10 12 14 16

Late

ncy (

cycle

s × 10

00)

Cores


Messagingfaster for:≥4 cores≥4 cache lines



0

2

4

6

8

10

12

2 4 6 8 10 12 14 16

Late

ncy (

cycle

s × 10

00)

Cores


Server

Actual cost of update at server12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 15


0

2

4

6

8

10

12

2 4 6 8 10 12 14 16

Late

ncy (

cycle

s × 10

00)

Cores


Server

Actual cost of update at server

“spare” cyclesif RPC wassplit-phase


dants

Typewritten Text

2. Make OS structure hardware-neutral

I Separate OS structure from hardwareI Only hardware-specific parts:

I Message transports (highly optimised / specialised)I CPU / device drivers

I Adaptability to changing performance characteristicsI Late-bind protocol and message transport implementations


dants

Typewritten Text

these r the

dants

Typewritten Text

the comm. lib:

2. Make OS structure hardware-neutral

I Separate OS structure from hardwareI Only hardware-specific parts:

I Message transports (highly optimised / specialised)I CPU / device drivers

I Adaptability to changing performance characteristicsI Late-bind protocol and message transport implementations


dants

Typewritten Text

these r the

dants

Typewritten Text

(later)

dants

Typewritten Text

the comm. lib:

3. View state as replicated

I Potentially-shared state accessed as if it were a local replicaI Scheduler queues, process control blocks, etc.

I Required by message-passing modelI Naturally supports domains that do not share memoryI Naturally supports changes to the set of running cores

I Hotplug, power management


dants

Typewritten Text

Memory should in fact be "shared"; so the final design principle:

3. View state as replicated

I Potentially-shared state accessed as if it were a local replicaI Scheduler queues, process control blocks, etc.

I Required by message-passing modelI Naturally supports domains that do not share memoryI Naturally supports changes to the set of running cores

I Hotplug, power management


dants

Typewritten Text

Memory should in fact be "shared"; so the final design principle:

dants

Typewritten Text

(as we don't "have" sh-mem)

dants

Typewritten Text

dants

Typewritten Text

dants

Typewritten Text

(NIC, GPU)

dants

Typewritten Text

all

dants

Typewritten Text

(can use results from distributed systems about components that are coming/going & have replicas)

Replication vs. sharing as default

I Replicas used as an optimisation in previous systems:Tornado, K42 clustered objects

Linux read-only data, kernel text

I In a multikernel, sharing is a local optimisationI Shared (locked) replica for threads or closely-coupled coresI Hidden, localI Only when faster, as decided at runtimeI Basic model remains split-phase


Replication vs. sharing as default

I Replicas used as an optimisation in previous systems:Tornado, K42 clustered objects

Linux read-only data, kernel text

I In a multikernel, sharing is a local optimisationI Shared (locked) replica for threads or closely-coupled coresI Hidden, localI Only when faster, as decided at runtimeI Basic model remains split-phase


dants

Typewritten Text



dants

Typewritten Text

(sh-mem possible for app)

dants

Typewritten Text

dants

Typewritten Text

(OS part can be specialized to arch)

dants

Typewritten Text

Outline

Introduction



Barrelfish



Barrelfish

I From-scratch implementation of a multikernelI Supports x86-64 multiprocessors (ARM soon)I Open source (BSD licensed)


dants

Typewritten Text

dants

Typewritten Text

(already supported)

Barrelfish structureMonitors and CPU drivers

I CPU driver serially handles traps and exceptionsI Monitor mediates local operations on global stateI URPC inter-core (shared memory) message transport

on current (cache-coherent) x86 HW


Non-original ideas in BarrelfishMultiprocessor techniques:

I Minimise shared state (Tornado, K42, Corey)I User-space messaging decoupled from IPIs (URPC)I Single-threaded non-preemptive kernel per core (K42)

Other ideas we liked:I Capabilities for all resource management (seL4)I Upcall processor dispatch (Psyche, Sched. Activations, K42)I Push policy into application domains (Exokernel, Nemesis)I Lots of information (Infokernel)I Run drivers in their own domains (µkernels)I EDF as per-core CPU scheduler (RBED)I Specify device registers in a little language (Devil)


Applications running on Barrelfish

I Slide viewer (this one!)I Webserver (www.barrelfish.org)I Virtual machine monitor (runs unmodified Linux)I SPLASH-2, OpenMP (benchmarks)I SQLiteI ECLiPSe (constraint engine)I more. . .


www.barrelfish.org

Outline

Introduction



Barrelfish



Evaluation goalsHow do we evaluate an alternative OS structure?

I Good baseline performanceI Comparable to existing systems on current hardware

I Scalability with coresI Adapability to different hardwareI Ability to exploit message-passing for performance


dants

Typewritten Text

goal:

dants

Typewritten Text

Case study: Unmap (TLB shootdown)

I Send a message to every core with a mapping,wait for all to be acknowledged

I Linux/Windows:1. Kernel sends IPIs2. Spins on shared acknowledgement count/event

I Barrelfish:1. User request to local monitor domain2. Single-phase commit to remote cores

I How to implement communication?


Unmap communication protocols

Unicast

Broadcast


Unmap communication protocols

Unicast Broadcast


Unmap communication protocolsRaw messaging cost

0

2

4

6

8

10

12

14

2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32

Late

ncy (

cycle

s × 10

00)

Cores

BroadcastUnicast


Why use multicast8×4-core AMD system


Multicast communication

I “NUMA-aware” multicast


Multicast communication

I “NUMA-aware” multicast


dants

Typewritten Text

(more hops, so send to them first)

Unmap communication protocolsRaw messaging cost

0

2

4

6

8

10

12

14

2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32

Late

ncy (

cycle

s × 10

00)

Cores

BroadcastUnicast

MulticastNUMA-Aware Multicast


System knowledge base

I Constructing multicast tree requires hardware knowledgeI Mapping of cores to sockets (CPUID data)I Messaging latency (online measurements)

I More generally, Barrelfish needs a way to reasonabout diverse system resources

I We tackle this with constraint logic programming[Schüpbach et al., MMCS’08]

I System knowledge base stores rich, detailed representationof hardware, performs online reasoning

I Initial implementation: port of the ECLiPSe constraint solverI Prolog query used to construct multicast routing tree


System knowledge base

I Constructing multicast tree requires hardware knowledgeI Mapping of cores to sockets (CPUID data)I Messaging latency (online measurements)

I More generally, Barrelfish needs a way to reasonabout diverse system resources

I We tackle this with constraint logic programming[Schüpbach et al., MMCS’08]

I System knowledge base stores rich, detailed representationof hardware, performs online reasoning

I Initial implementation: port of the ECLiPSe constraint solverI Prolog query used to construct multicast routing tree


dants

Typewritten Text

dants

Typewritten Text

user service:

Unmap latency

0

10

20

30

40

50

60

2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32

Late

ncy (

cycle

s × 10

00)

Cores

WindowsLinux

Barrelfish


dants

Typewritten Text

not important that b-f faster (can do similar optimizations in other OSes too); important is to think of it as a communication problem in a distributed system

dants

Typewritten Text

dants

Typewritten Text

dants

Typewritten Text

dants

Typewritten Text

dants

Typewritten Text

dants

Typewritten Text

Summary of other results

I No penalty for shared-memory (SPLASH, OpenMP)I Network throughput: 951.7Mbit/s (same as Linux)I Pipelined web server

I Static: 640 Mbit/s vs. 316 Mbit/s for lighttpd/LinuxI Dynamic:3417 requests/s (17.1Mbit/s) bottlenecked on SQL


dants

Typewritten Text

actual result are not very important, it's not that b-f is better; what's important is that it has reasonable/comparable results to other OSes to make it a viable option

ConclusionI Modern computers are inherently distributed systemsI It’s time to rethink OS structure to matchI The Multikernel: model of the OS as a distributed system

1. Explicit communication, replicated state2. Hardware-neutral OS structure

I Barrelfish: our concrete implementationI Reasonable performance

on current hardwareI Better scalability/adaptability

for future hardwareI Promising approach


www.barrelfish.org

www.barrelfish.org

dants

Typewritten Text

dants

Typewritten Text

dants

Typewritten Text

dants

Typewritten Text

=>

ConclusionI Modern computers are inherently distributed systemsI It’s time to rethink OS structure to matchI The Multikernel: model of the OS as a distributed system

1. Explicit communication, replicated state2. Hardware-neutral OS structure

I Barrelfish: our concrete implementationI Reasonable performance

on current hardwareI Better scalability/adaptability

for future hardwareI Promising approach


www.barrelfish.org

www.barrelfish.org

Backup slides


URPC implementation

I Current hardware provides one communication mechanism:cache-coherent shared memory

I Can we “trick” cache-coherence protocol to send messages?I User-level RPC (URPC) [Bershad et al., 1991]

I Channel is shared ring bufferI Messages are cache-line sizedI Sender writes message into next lineI Receiver polls on last wordI Marshalling/demarshalling, naming,

binding all implemented above


Polling for receiveTradeoff vs. IPIs

I Polling is cheap: line is local to receiver until message arrivesI Hardware-imposed costs for IPI (on 4×4-core AMD):

I ≈800 cycles to send (from user-mode)I ≈1200 cycles lost in receive (to user-mode)

I There is a tradeoff here!I IPIs are decoupled from fast-path messaging, used only for:

1. Specific (batches of) operations that require low latency,even when other tasks are executing

2. Awakening cores that have blocked to save power(alternatively, MONITOR/MWAIT)


the multikernel: a new os architecture for scalable multicore...

Documents