operating systems - architecture
DESCRIPTION
From the Operating Systems course (CMPSCI 377) at UMass Amherst, Fall 2007.TRANSCRIPT
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science
Emery Berger
University of Massachusetts Amherst
Operating SystemsCMPSCI 377Architecture
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 2
Architecture
Hardware Support for Applications & OS
Architecture basics & details
Focus on characteristics exposed to application programmer / OS
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science
The Memory Hierarchy
3
Registers
Caches
Associativity
Misses
Locality
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science
Registers
Register = dedicated name for word of memory managed by CPU
General-purpose: “AX”, “BX”, “CX” on x86
Special-purpose:
“SP” = stack pointer
“FP” = frame pointer
“PC” = program counter
4
SP
FP
arg0arg1arg0arg1arg2
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science
Registers
Register = dedicated name for one word of memory managed by CPU
General-purpose: “AX”, “BX”, “CX” on x86
Special-purpose:
“SP” = stack pointer
“FP” = frame pointer
“PC” = program counter
Change processes:save current registers &load saved registers =context switch
5
SP
FP
arg0arg1
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science
Caches
Access to main memory: “expensive”
~ 100 cycles (slow, relatively cheap)
Caches: small, fast, expensive memory
Hold recently-accessed data (D$) or instructions (I$)
Different sizes & locations
Level 1 (L1) – on-chip, smallish
Level 2 (L2) – on or next to chip, larger
Level 3 (L3) – pretty large, on bus
Manages lines of memory (32-128 bytes)
6
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science
Memory Hierarchy
Higher = small, fast, more $, lower latency
Lower = large, slow, less $, higher latency
7
D$, I$ separate
registers
L1
L2
RAM
Disk
1-cycle latency
2-cycle latency
7-cycle latency
100 cycle latency
40,000,000 cycle latency
Network 200,000,000+ cycle latency
D$, I$ unified
load
evict
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science
Cache Jargon
Cache initially cold
Accessing data initially misses
Fetch from lower level in hierarchy
Bring line into cache (populate cache)
Next access: hit
Once cache holds most-frequently used data: “warmed up”
Context switch implications?
8
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science
Cache Details
Ideal cache would be fully associative
That is, LRU (least-recently used) queue
Generally too expensive
Instead, partition memory addresses and put into separate bins divided into ways
1-way or direct-mapped
2-way = 2 entries per bin
4-way = 4 entries per bin, etc.
9
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science
Associativity Example
Hash memory based on addresses to different indices in cache
10
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science
Miss Classification
First access = compulsory miss
Unavoidable without prefetching
Too many items in way = conflict miss
Avoidable if we had higher associativity
No space in cache = capacity miss
Avoidable if cache were larger
Invalidated = coherence miss
Avoidable if cache were unshared
11
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science
Exercise
Cache with 4 entries, 2-way associativity
Assume hash(x) = x % 4 (modulus)
How many misses?
# compulsory misses?
# conflict misses?
# capacity misses?
12
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science
Solution
Cache with 4 entries, 2-way associativity
Assume hash(x) = x % 4 (modulus)
How many misses?
# compulsory misses? 10
# conflict misses?
# capacity misses?
13
3 7 11 2 3 7 7 9 9 6 13 7 2 5 8 10
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science
Solution
Cache with 4 entries, 2-way associativity
Assume hash(x) = x % 4 (modulus)
How many misses?
# compulsory misses? 10
# conflict misses? 2
# capacity misses?
14
3 7 11 2 3 7 7 9 9 6 13 7 2 5 8 10
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science
Solution
Cache with 4 entries, 2-way associativity
Assume hash(x) = x % 4 (modulus)
How many misses?
# compulsory misses? 10
# conflict misses? 2
# capacity misses? 0
15
3 7 11 2 3 7 7 9 9 6 13 7 2 5 8 10
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science
Locality
Locality = re-use of recently-used items
Temporal locality: re-use in time
Spatial locality: use of nearby items
In same cache line, same page (4K chunk)
Intuitively – greater locality = fewer misses
# misses depends on cache layout, # of levels, associativity…
Machine-specific
16
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science
Quantifying Locality
Instead of counting misses,compute hit curve from LRU histogram
Assume perfect LRU cache
Ignore compulsory misses
17
3 7 7 2 3 7
1 2 3 4 5 6
3
7
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science
Quantifying Locality
Instead of counting misses,compute hit curve from LRU histogram
Assume perfect LRU cache
Ignore compulsory misses
18
3 7 7 2 3 7
1 2 3 4 5 6
3
7
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science
Quantifying Locality
Instead of counting misses,compute hit curve from LRU histogram
Assume perfect LRU cache
Ignore compulsory misses
19
3 7 7 2 3 7
1 2 3 4 5 6
3
7
2
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science
Quantifying Locality
Instead of counting misses,compute hit curve from LRU histogram
Assume perfect LRU cache
Ignore compulsory misses
20
3 7 7 2 3 7
1 2 3 4 5 6
3
7
2
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science
Quantifying Locality
Instead of counting misses,compute hit curve from LRU histogram
Assume perfect LRU cache
Ignore compulsory misses
21
3 7 7 2 3 7
1 2 3 4 5 6
3
7
2
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science
Quantifying Locality
Instead of counting misses,compute hit curve from LRU histogram
Assume perfect LRU cache
Ignore compulsory misses
22
3 7 7 2 3 7
1 2 3 4 5 6
3
7
2
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science
Quantifying Locality
Instead of counting misses,compute hit curve from LRU histogram
Start with total misses on right hand side
Subtract histogram values
23
1 2 3 4 5 6
1 1 3 3 3 3
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science
0%
33%
67%
100%
1 2 3 4 5
Quantifying Locality
Instead of counting misses,compute hit curve from LRU histogram
Start with total misses on right hand side
Subtract histogram values
Normalize
24
.3 .3 1 1 1 1
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science
Hit Curve Exercise
Derive hit curve for following trace:
25
3 5 4 2 8 3 6 9 9 6 13 7 2 5 8 10
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science
Hit Curve Exercise
Derive hit curve for following trace:
26
3 5 4 2 8 3 6 9 9 6 13 7 2 5 8 10
1 2 3 4 5 6 7 8 9
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science
Hit Curve Exercise
Derive hit curve for following trace:
27
3 5 4 2 8 3 6 9 9 6 13 7 2 5 8 10
1 2 3 4 5 6 7 8 9
1 2 2 2 3 3 4 5 6
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science
Hit Curve Exercise
Derive hit curve for following trace:
28
1 2 2 2 3 3 4 5 6
0%
33%
67%
100%
1 2 3 4 5 6 7 8 9
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science
Important CPU Internals
29
Issues that affect performance
Pipelining
Branches & prediction
System calls (kernel crossings)
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science
Scalar architecture + memory…
Straight-up sequential execution
Fetch instruction
Decode it
Execute it
Problem: instruction or data miss in cache
Result – stall: everything stops
How long to wait for miss all the way to RAM?
30
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science
Superscalar architectures
Out-of-order processors
Pipeline of instructions in flight
Instead of stalling on load, guess!
Branch prediction
Value prediction
Predictors based on history, location in program
Speculatively execute instructions
Actual results checked asynchronously
If mispredicted, squash instructions
Accurate prediction = massive speedup
Hides latency of memory hierarchy31
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science
Pipelining and Branches
Instruction fetch
Instruction decode
Execute
Memory access
Write back
Pipelining overlaps instructions to exploit parallelism, allowing the clock rate to be increased. Branches cause bubbles in the pipeline, where some stages are left idle.
Unresolved branch
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science
Branch Prediction
Instruction fetch
Instruction decode
Execute
Memory access
Write back
A branch predictor allows the processor to speculatively fetch and execute instructions down the predicted path.
Speculative execution
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science
Kernel Mode
Protects OS from users
kernel = English for nucleus
Think atom
Only privileged code executes in kernel
System call –
Enters kernel mode
Flushes pipeline, saves context
Executes code in kernel land
Returns to user mode, restoring context
Where we are in user land
34
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science
Timers & Interrupts
Need to respond to events periodically
Change executing processes
Quantum – time limit for process execution
Fairness – when timer goes off, interrupt
Current process stops
OS takes control through interrupt handler
Scheduler chooses next process
Interrupts also signal I/O events
Network packet arrival, disk read complete…
35
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science
To do
Read C/C++ notes for next week
First homework assigned next week
Language: C/C++
Will be due in 2 weeks
36
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science
The End
37