improving ipc by kernel design jochen liedtke shane matthews portland state university

Improving IPC by Kernel DesignJochen Liedtke

Shane MatthewsPortland State University

3/12/2004 Portland State University

Summary

• Review

• Performance improved

– Architecture Level

– Algorithmic Level

– Interface Level

– Coding Level

3

Micro-kernels

• Minimal OS, providing a set of primitives used to implement thread/address space management and IPC [1]

• Everything else is moved to user-space (servers)

4

Terminology (L3)

• Dataspace– Memory object, mapped into address space

• Task– Composed of threads, dataspaces, and an address space

• Message– String/memory object

5

L3 Architecture & IPC

• Active components communicate via messages

• Applies to:– Device drivers

• Implemented as user level tasks

– Hardware Interrupts• Interrupt message from micro-kernel to thread

6

L3 Redesign Principles

• IPC performance is the master– Security and performance must not be affected

• Synergetic effects taken into consideration– (Think combined effects)– May lead to reinforcement or diminution

• Design must aim at performance goal– Per short message transfer– 350 cycles (7 micro-seconds)


Architectural Level

• Messages

• Process Structure

• Control Blocks


Compound Messages

• Multiple send/receive -> 1 send/receive

• Messages consists of direct/indirect strings, and memory objects

9

Twofold message copy

• [A space] -> [kernel] -

> [B space]

• O(20 + .75n) cycles,

n:= bytes

• Good for small

messages

• Need something better

as n grows

10

LRPC and SRC RPC

• Client/server share user level memory– sender -> shared buffer

• Problems– When server to client is 1 to many, shared

regions of address space become critical resources

– Shared regions require explicit opens (unlike L3)

– Message change during/after checking

11

Direct Message Copy Via Windows

• L3's method

– Destination mapped

into window

– Message copied to

window

• Window

– per address space

– Accessed exclusivly

by kernel

12

Communication Windows

• Problems

– Must be fast

– Different threads

coxisting within

address space

• L3 Implementation

– One word page

directory B to A.

13

Process Structure

• Threads running kernel mode have 1 kernel

stack per thread

– Efficient since interupts, page faults, IPC,

already save state on kernel stack

• Continuations

– Pro: • Reduce kernel stack

– Cons: • Require additional copies between kernel and

continutation

• Interfere with other optimizations

14

Tread Control Blocks

• Implemented as large array in kernel

– fast tcb access

• Array base + tcb # + tcb size

– Saves TLB misses (IPC)

• kernel stacks of sender and reciever located in TCB

page

– Locking done via unmapping on TCB


Algorithmic Level

• Thread Identifier

• Lazy Scheduling

• Short Messages Via Registers


Thread Identifier

• Thread addressed by 64-bit UID in user-

mode

• Thread number in lower 32-bits of UID

– AND with bit mask, add to TCB’s array base


Lazy Scheduling

• IPC operation call or reply & receive next

– Delete sending thread from ready queue

– Insert into waiting queue

– Delete receiving thread from waiting queue

– Insert into ready queue

• Too many queue operations!


Lazy Scheduling cont.

• L3 queue invariants

– Ready queue contains all ready threads

– Waiting queue contains at least all threads

waiting

• TCB contains threads state (ready/waiting)

• Scheduler removes all threads not

belonging to queue during queue parsing


Short Messages Via Registers

• High proportion of messages are short

– Ex. Driver ack/error, hardware interrupts

• 486

– 7 general registers

– 3 needed: sender ID, result code

– 4 available

• 8-byte messages using coding scheme


Interface Level

• Simple RPC stubs

– Load registers, system call, check success

– Compiler generates stubs inline

• Parameter Passing

– Use registers when possible


Coding Level

• Reduce cache and TLB misses

– Short kernel code

• Short jumps, use registers, short address

displacements

– IPC kernel code in one page

– Handle save/restore of coprocessor lazily

• Delayed until different thread needs to use it


Results

• 100% would indicate double the time increase

• Removal of all increase IPC time by 134% for 8 byte message


Results

• L3 VS Mach

• System– Intel 486 DX-50– 256 KB external

cache– 16 MB memory


Results cont.


Conclusions

• IPC improved by applying

– Performance based reasoning

– Synergetic effects

– Architecture -> coding

26

References

• [1] http://en.wikipedia.org/wiki/Micro_kernel

• [2] Improving IPC by Kernel Design - Jochen Liedtke

improving ipc by kernel design jochen liedtke shane matthews portland state university

Documents