operating systems - architecture

UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science

Emery Berger

University of Massachusetts Amherst

Operating SystemsCMPSCI 377Architecture

UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 2

Architecture

Hardware Support for Applications & OS

Architecture basics & details

Focus on characteristics exposed to application programmer / OS


The Memory Hierarchy

3

Registers

Caches

Associativity

Misses

Locality


Registers

Register = dedicated name for word of memory managed by CPU

General-purpose: “AX”, “BX”, “CX” on x86

Special-purpose:

“SP” = stack pointer

“FP” = frame pointer

“PC” = program counter

4

SP

FP

arg0arg1arg0arg1arg2


Registers

Register = dedicated name for one word of memory managed by CPU

General-purpose: “AX”, “BX”, “CX” on x86

Special-purpose:

“SP” = stack pointer

“FP” = frame pointer

“PC” = program counter

Change processes:save current registers &load saved registers =context switch

5

SP

FP

arg0arg1


Caches

Access to main memory: “expensive”

~ 100 cycles (slow, relatively cheap)

Caches: small, fast, expensive memory

Hold recently-accessed data (D$) or instructions (I$)

Different sizes & locations

Level 1 (L1) – on-chip, smallish

Level 2 (L2) – on or next to chip, larger

Level 3 (L3) – pretty large, on bus

Manages lines of memory (32-128 bytes)

6


Memory Hierarchy

Higher = small, fast, more $, lower latency

Lower = large, slow, less $, higher latency

7

D$, I$ separate

registers

L1

L2

RAM

Disk

1-cycle latency

2-cycle latency

7-cycle latency

100 cycle latency

40,000,000 cycle latency

Network 200,000,000+ cycle latency

D$, I$ unified

load

evict


Cache Jargon

Cache initially cold

Accessing data initially misses

Fetch from lower level in hierarchy

Bring line into cache (populate cache)

Next access: hit

Once cache holds most-frequently used data: “warmed up”

Context switch implications?

8


Cache Details

Ideal cache would be fully associative

That is, LRU (least-recently used) queue

Generally too expensive

Instead, partition memory addresses and put into separate bins divided into ways

1-way or direct-mapped

2-way = 2 entries per bin

4-way = 4 entries per bin, etc.

9


Associativity Example

Hash memory based on addresses to different indices in cache

10


Miss Classification

First access = compulsory miss

Unavoidable without prefetching

Too many items in way = conflict miss

Avoidable if we had higher associativity

No space in cache = capacity miss

Avoidable if cache were larger

Invalidated = coherence miss

Avoidable if cache were unshared

11


Exercise

Cache with 4 entries, 2-way associativity

Assume hash(x) = x % 4 (modulus)

How many misses?

# compulsory misses?

# conflict misses?

# capacity misses?

12


Solution



How many misses?

# compulsory misses? 10

# conflict misses?

# capacity misses?

13

3 7 11 2 3 7 7 9 9 6 13 7 2 5 8 10


Solution



How many misses?


# conflict misses? 2

# capacity misses?

14

3 7 11 2 3 7 7 9 9 6 13 7 2 5 8 10


Solution



How many misses?


# conflict misses? 2

# capacity misses? 0

15

3 7 11 2 3 7 7 9 9 6 13 7 2 5 8 10


Locality

Locality = re-use of recently-used items

Temporal locality: re-use in time

Spatial locality: use of nearby items

In same cache line, same page (4K chunk)

Intuitively – greater locality = fewer misses

# misses depends on cache layout, # of levels, associativity…

Machine-specific

16


Quantifying Locality

Instead of counting misses,compute hit curve from LRU histogram

Assume perfect LRU cache

Ignore compulsory misses

17

3 7 7 2 3 7

1 2 3 4 5 6

3

7






18

3 7 7 2 3 7

1 2 3 4 5 6

3

7






19

3 7 7 2 3 7

1 2 3 4 5 6

3

7

2






20

3 7 7 2 3 7

1 2 3 4 5 6

3

7

2






21

3 7 7 2 3 7

1 2 3 4 5 6

3

7

2






22

3 7 7 2 3 7

1 2 3 4 5 6

3

7

2




Start with total misses on right hand side

Subtract histogram values

23

1 2 3 4 5 6

1 1 3 3 3 3


0%

33%

67%

100%

1 2 3 4 5



Start with total misses on right hand side

Subtract histogram values

Normalize

24

.3 .3 1 1 1 1


Hit Curve Exercise

Derive hit curve for following trace:

25

3 5 4 2 8 3 6 9 9 6 13 7 2 5 8 10


Hit Curve Exercise


26

3 5 4 2 8 3 6 9 9 6 13 7 2 5 8 10

1 2 3 4 5 6 7 8 9


Hit Curve Exercise


27

3 5 4 2 8 3 6 9 9 6 13 7 2 5 8 10

1 2 3 4 5 6 7 8 9

1 2 2 2 3 3 4 5 6


Hit Curve Exercise


28

1 2 2 2 3 3 4 5 6

0%

33%

67%

100%

1 2 3 4 5 6 7 8 9


Important CPU Internals

29

Issues that affect performance

Pipelining

Branches & prediction

System calls (kernel crossings)


Scalar architecture + memory…

Straight-up sequential execution

Fetch instruction

Decode it

Execute it

Problem: instruction or data miss in cache

Result – stall: everything stops

How long to wait for miss all the way to RAM?

30


Superscalar architectures

Out-of-order processors

Pipeline of instructions in flight

Instead of stalling on load, guess!

Branch prediction

Value prediction

Predictors based on history, location in program

Speculatively execute instructions

Actual results checked asynchronously

If mispredicted, squash instructions

Accurate prediction = massive speedup

Hides latency of memory hierarchy31


Pipelining and Branches

Instruction fetch

Instruction decode

Execute

Memory access

Write back

Pipelining overlaps instructions to exploit parallelism, allowing the clock rate to be increased. Branches cause bubbles in the pipeline, where some stages are left idle.

Unresolved branch


Branch Prediction

Instruction fetch

Instruction decode

Execute

Memory access

Write back

A branch predictor allows the processor to speculatively fetch and execute instructions down the predicted path.

Speculative execution


Kernel Mode

Protects OS from users

kernel = English for nucleus

Think atom

Only privileged code executes in kernel

System call –

Enters kernel mode

Flushes pipeline, saves context

Executes code in kernel land

Returns to user mode, restoring context

Where we are in user land

34


Timers & Interrupts

Need to respond to events periodically

Change executing processes

Quantum – time limit for process execution

Fairness – when timer goes off, interrupt

Current process stops

OS takes control through interrupt handler

Scheduler chooses next process

Interrupts also signal I/O events

Network packet arrival, disk read complete…

35


To do

Read C/C++ notes for next week

First homework assigned next week

Language: C/C++

Will be due in 2 weeks

36


The End

37

operating systems - architecture

Technology