pipelining.1 9/15 appendix a & ch: 5 intro, basic pipelining & memory hierarchy review

65
Pipelining.1 9/15 Appendix A & Ch: 5 Intro, Basic Pipelining & Memory Hierarchy Review

Upload: vincent-hardy

Post on 18-Jan-2018

224 views

Category:

Documents


0 download

DESCRIPTION

Pipelining.3 9/15 5 Steps of MIPS Datapath Memory Access Write Back Instruction Fetch Instr. Decode Reg. Fetch Execute Addr. Calc ALU Memory Reg File MUX Memory MUX Sign Extend Zero? IF/ID ID/EX MEM/WB EX/MEM 4 Adder Next SEQ PC RD WB Data Next PC Address RS1 RS2 Imm MUX Datapath Control Path Inst 1 Inst 2 Inst 1 Inst 2 Inst 3

TRANSCRIPT

Page 1: Pipelining.1 9/15 Appendix A & Ch: 5 Intro, Basic Pipelining & Memory Hierarchy Review

Pipelining.19/15

Appendix A &Ch: 5

Intro, Basic Pipelining& Memory Hierarchy Review

Page 2: Pipelining.1 9/15 Appendix A & Ch: 5 Intro, Basic Pipelining & Memory Hierarchy Review

Pipelining.29/15

Simple RISC Pipeline (MIPS)

Instruction # Clock number1 2 3 4 5 6 7 8

Instruction I IF ID EX MEM WB

Instruction I + 1 IF ID EX MEM WB

Instruction I + 2 IF ID EX MEM WB

Instruction I + 3 IF ID EX MEM WB

Instruction I + 4 IF ID EX MEM

Page 3: Pipelining.1 9/15 Appendix A & Ch: 5 Intro, Basic Pipelining & Memory Hierarchy Review

Pipelining.39/15

5 Steps of MIPS Datapath

MemoryAccess

WriteBack

InstructionFetch

Instr. DecodeReg. Fetch

ExecuteAddr. Calc

ALU

Mem

ory

Reg File

MUX

MUX

Mem

ory

MUX

SignExtend

Zero?

IF/ID

ID/EX

MEM

/WB

EX/MEM

4

Adder

Next SEQ PC Next SEQ PC

RD RD RD WB

Data

Next PC

Address

RS1

RS2

Imm

MUX

Datapath

Control Path

Inst

1

Inst

1

Inst

2

Inst

1

Inst

2

Inst

3

Page 4: Pipelining.1 9/15 Appendix A & Ch: 5 Intro, Basic Pipelining & Memory Hierarchy Review

Pipelining.49/15

Limits to pipelining

• Hazards: circumstances that would cause incorrect execution if next instruction were launched

– Structural hazards: Attempting to use the same hardware to do two different things at the same time

– Data hazards: Instruction depends on result of prior instruction still in the pipeline

– Control hazards: Caused by delay between the fetching of instructions and decisions about changes in control flow (branches and jumps).

Page 5: Pipelining.1 9/15 Appendix A & Ch: 5 Intro, Basic Pipelining & Memory Hierarchy Review

Pipelining.59/15

Structural Hazard Example : Same Register Filecycle 5: Load instruction and Instr 3 use same reg. file

Instr.

Order

Time (clock cycles)

Load

Instr 1

Instr 2

Instr 3

Instr 4

Reg ALU

DMemIfetch Reg

Reg ALU

DMemIfetch Reg

Reg ALU

DMemIfetch Reg

Reg ALU

DMemIfetch Reg

Cycle 1Cycle 2 Cycle 3Cycle 4 Cycle 6Cycle 7Cycle 5

Structural Hazard

Solution: Write in 1st half of clockRead in 2nd half

Page 6: Pipelining.1 9/15 Appendix A & Ch: 5 Intro, Basic Pipelining & Memory Hierarchy Review

Pipelining.69/15

Resolving structural hazards

• Defn: attempt to use same hardware for two different things at the same time

• Solution 1: Wait must detect the hazard must have mechanism to stall

• Solution 2: Modify the pipeline

Page 7: Pipelining.1 9/15 Appendix A & Ch: 5 Intro, Basic Pipelining & Memory Hierarchy Review

Pipelining.79/15

Instr.

Order

dadd r1,r2,r3

dsub r4,r1,r3

and r6,r1,r7

or r8,r1,r9

xor r10,r1,r11

Reg ALU

DMemIfetch Reg

Reg ALU DMemIfetch Reg

Data Hazards (data dependencies)Figure A.6, CA:AQA 3e

Time (clock cycles)

IF ID/RF EX MEM WB

Reg ALU DMemIfetch Reg

Reg ALU DMemIfetch Reg

Reg ALU DMemIfetch Reg

Page 8: Pipelining.1 9/15 Appendix A & Ch: 5 Intro, Basic Pipelining & Memory Hierarchy Review

Pipelining.89/15

• Read After Write (RAW) InstrJ tries to read operand before InstrI writes it

• Caused by a “Data Dependence” (in compiler nomenclature). This hazard results from an actual need for communication.

Three Generic Data Hazards

I: add r1,r2,r3J: sub r4,r1,r3

Page 9: Pipelining.1 9/15 Appendix A & Ch: 5 Intro, Basic Pipelining & Memory Hierarchy Review

Pipelining.99/15

• Write After Read (WAR) InstrJ writes operand before InstrI reads it

• Called an “anti-dependence” by compiler writers.This results from reuse of the name “r1”.

• Occurs in Superscalars

I: sub r4,r1,r3 J: add r1,r2,r3K: mul r6,r1,r7

Three Generic Data Hazards

Page 10: Pipelining.1 9/15 Appendix A & Ch: 5 Intro, Basic Pipelining & Memory Hierarchy Review

Pipelining.109/15

Three Generic Data Hazards

• Write After Write (WAW) InstrJ writes operand before InstrI writes it.

• Called an “output dependence” by compiler writersThis also results from the reuse of name “r1”.

• WAR and WAW Occur in Superscalar pipelines

I: sub r1,r4,r3 J: add r1,r2,r3K: mul r6,r1,r7

Page 11: Pipelining.1 9/15 Appendix A & Ch: 5 Intro, Basic Pipelining & Memory Hierarchy Review

Pipelining.119/15

Time (clock cycles)

Forwarding to Avoid Data HazardFigure A.7, CA:AQA 3e

Inst

r.

Order

dadd r1,r2,r3

dsub r4,r1,r3

and r6,r1,r7

or r8,r1,r9

xor r10,r1,r11

Reg ALU

DMemIfetch Reg

Reg ALU DMemIfetch Reg

Reg ALU DMemIfetch Reg

Reg ALU DMemIfetch Reg

Reg ALU DMemIfetch Reg

Page 12: Pipelining.1 9/15 Appendix A & Ch: 5 Intro, Basic Pipelining & Memory Hierarchy Review

Pipelining.129/15

Time (clock cycles)

Instr.

Order

ld r1, 0(r2)

sub r4,r1,r6

and r6,r1,r7

or r8,r1,r9

Data Hazard Even with Forwarding(Load – Immediate Use)Figure A.9, CA:AQA 3e

Reg ALU

DMemIfetch Reg

Reg ALU

DMemIfetch Reg

Reg ALU

DMemIfetch Reg

Reg ALU

DMemIfetch Reg

Page 13: Pipelining.1 9/15 Appendix A & Ch: 5 Intro, Basic Pipelining & Memory Hierarchy Review

Pipelining.139/15

Resolving the Load Data Hazard

Time (clock cycles)

or r8,r1,r9

Instr.

Order

ld r1, 0(r2)

sub r4,r1,r6

and r6,r1,r7

Reg ALU

DMemIfetch Reg

RegIfetch ALU

DMem RegBubble

Ifetch ALU

DMem RegBubble Reg

Ifetch ALU

DMemBubble Reg

One clock stall & forward data from memory (cache) to ALU

Page 14: Pipelining.1 9/15 Appendix A & Ch: 5 Intro, Basic Pipelining & Memory Hierarchy Review

Pipelining.149/15

Try producing fast code fora = b + c;d = e – f;

assuming a, b, c, d ,e, and f in memory. Slow code:

LW Rb,bLW Rc,cADD Ra,Rb,RcSW a,Ra LW Re,e LW Rf,fSUB Rd,Re,RfSW d,Rd

Software Scheduling Avoids Load Hazardsdone by compiler

Fast code:LW Rb,bLW Rc,cLW Re,e ADD Ra,Rb,RcLW Rf,fSW a,Ra SUB Rd,Re,RfSW d,Rd

Page 15: Pipelining.1 9/15 Appendix A & Ch: 5 Intro, Basic Pipelining & Memory Hierarchy Review

Pipelining.159/15

Control Hazard on Branches=> Three Stage Stall(branch penalty!)

10: beq r1,r3,36

14: and r2,r3,r5

18: or r6,r1,r7

22: add r8,r1,r9

36: xor r10,r1,r11

Reg ALU

DMemIfetch Reg

Reg ALU

DMemIfetch Reg

Reg ALU

DMemIfetch Reg

Reg ALU

DMemIfetch Reg

Reg ALU

DMemIfetch Reg

Page 16: Pipelining.1 9/15 Appendix A & Ch: 5 Intro, Basic Pipelining & Memory Hierarchy Review

Pipelining.169/15

Example: Branch Stall Impact

• If 30% branch, Stall 3 cycles significant, CPI_new=1.9!• Two part solution:

– Determine branch taken or not sooner, AND– Compute taken branch address earlier

• MIPS branch tests if register = 0 or 0• MIPS Solution:

– Move Zero test to ID/RF stage– Adder to calculate new PC in ID/RF stage– 1 clock cycle penalty for branch versus 3

Page 17: Pipelining.1 9/15 Appendix A & Ch: 5 Intro, Basic Pipelining & Memory Hierarchy Review

Pipelining.179/15

Adder

IF/ID

Pipelined MIPS DatapathFigure A.24, CA:AQA 3/e

MemoryAccess

WriteBack

InstructionFetch

Instr. DecodeReg. Fetch

ExecuteAddr. Calc

ALU

Mem

ory

Reg File

MUX

DataM

emory

MUX

SignExtend

Zero?

MEM

/WB

EX/MEM

4

Adder

Next SEQ PC

RD RD RD WB

Data

• Data stationary control– local decode for each instruction phase / pipeline stage

Next PC

Address

RS1

RS2

Imm

MUX

ID/EX

Page 18: Pipelining.1 9/15 Appendix A & Ch: 5 Intro, Basic Pipelining & Memory Hierarchy Review

Pipelining.189/15

Four Simple Branch Hazard Alternatives

#1: Stall until branch direction is clear#2: Predict Branch Not Taken

– Execute successor instructions in sequence– “Squash” instructions in pipeline if branch actually taken– Advantage of late pipeline state update– 47% MIPS branches not taken on average– PC+4 already calculated, so use it to get next instruction

#3: Predict Branch Taken– 53% MIPS branches taken on average– But haven’t calculated branch target address in MIPS

» MIPS still incurs 1 cycle branch penalty» Other machines: branch target known before outcome

Page 19: Pipelining.1 9/15 Appendix A & Ch: 5 Intro, Basic Pipelining & Memory Hierarchy Review

Pipelining.199/15

Predict not taken, branch untaken, taken

Untaken branch instruction IF ID EX MEM WB

Instruction I + 1 IF ID EX MEM WB

Instruction I + 2 IF ID EX MEM WB

Instruction I + 3 IF ID EX MEM

Instruction I + 4 IF ID EX

Taken branch instruction IF ID EX MEM WB

Instruction I + 1 IF idle idle idle idle

Branch Target IF ID EX MEM WB

Branch Target + 1 IF ID EX MEM

Branch Target + 2 IF ID EX

Page 20: Pipelining.1 9/15 Appendix A & Ch: 5 Intro, Basic Pipelining & Memory Hierarchy Review

Pipelining.209/15

Delayed branch behavior is same

Untaken branch instruction IF ID EX MEM WB

Branch delay Instruction I + 1 IF ID EX MEM WB

Instruction I + 2 IF ID EX MEM WB

Instruction I + 3 IF ID EX MEM

Instruction I + 4 IF ID EX

Taken branch instruction IF ID EX MEM WB

Branch delay Instruction I + 1 IF ID EX MEM WB

Branch Target IF ID EX MEM WB

Branch Target + 1 IF ID EX MEM

Branch Target + 2 IF ID EX

Page 21: Pipelining.1 9/15 Appendix A & Ch: 5 Intro, Basic Pipelining & Memory Hierarchy Review

Pipelining.219/15

Four Simple Branch Hazard Alternatives

#4: Delayed Branch– Define branch to take place AFTER a following instruction

branch instructionsequential successor1

sequential successor2........sequential successorn

........branch target if taken

– 1 slot delay allows proper decision and branch target address in 5 stage pipeline

– Used in MIPS

Branch delay of length n

Page 22: Pipelining.1 9/15 Appendix A & Ch: 5 Intro, Basic Pipelining & Memory Hierarchy Review

Pipelining.229/15

Scheduling the branch Delay slot

Page 23: Pipelining.1 9/15 Appendix A & Ch: 5 Intro, Basic Pipelining & Memory Hierarchy Review

Pipelining.239/15

Delayed Branch• Where to get instructions to fill branch delay slot?

– Before branch instruction– From the target address: only valuable when branch taken– From fall through: only valuable when branch not taken– Canceling branches allow more slots to be filled

• Compiler effectiveness for single branch delay slot:– Fills about 60% of branch delay slots– About 80% of instructions executed in branch delay slots useful in

computation– About 50% (60% x 80%) of slots usefully filled

• Delayed Branch downside: 7-8 stage pipelines, multiple instructions issued per clock (superscalar)

Page 24: Pipelining.1 9/15 Appendix A & Ch: 5 Intro, Basic Pipelining & Memory Hierarchy Review

Pipelining.249/15

Precise Exceptions in Static Pipelines

Key observation: architected state only change in memory and register write stages.

Page 25: Pipelining.1 9/15 Appendix A & Ch: 5 Intro, Basic Pipelining & Memory Hierarchy Review

Pipelining.259/15

Deeper Pipeline Example:MIPS R4000

Page 26: Pipelining.1 9/15 Appendix A & Ch: 5 Intro, Basic Pipelining & Memory Hierarchy Review

Pipelining.269/15

Load immediate use = 2 cycle penalty

Page 27: Pipelining.1 9/15 Appendix A & Ch: 5 Intro, Basic Pipelining & Memory Hierarchy Review

Pipelining.279/15

Load immediate use 2-cycle penalty

1 2 3 4 5 6 7LD R1,.. IF IS RF EX DF DS TC

DADD R2,R1,.. IF IS RF stall stall DS

DSUB R3,R1,.. IF IS stall stall DF

OR R4,r1,.. IF stall stall EX

Page 28: Pipelining.1 9/15 Appendix A & Ch: 5 Intro, Basic Pipelining & Memory Hierarchy Review

Pipelining.289/15

R4000 taken/untaken branch penalty

1 2 3 4 5 6 7 branch instruction IF IS RF EX DF DS TC

delay slot IF IS RF EX DF DS

stall stall stall stall stall stall

stall stall stall stall stall

branch target IF IS RF

1 2 3 4 5 6 7 branch instruction IF IS RF EX DF DS TC

delay slot IF IS RF EX DF DS

branch insruction+2 IF IS RF EX DF

branch insruction+3 IF IS RF EX

Page 29: Pipelining.1 9/15 Appendix A & Ch: 5 Intro, Basic Pipelining & Memory Hierarchy Review

Pipelining.299/15

Architecture: Abstraction Layers in Modern Systems ?

Algorithm

Gates/Register-Transfer Level (RTL)

Application

Instruction Set Architecture (ISA)Operating System/Virtual Machine

Microarchitecture

Devices

Programming Language

Circuits

Physics

Original domain of

the computer architect

( ‘50s-’80s)

Domain of computer

architecture( ‘90s)

Reliability, power, …

Parallel computin

g, security,

Reinvigoration of computer

architecture, mid-2000s onward.

Page 30: Pipelining.1 9/15 Appendix A & Ch: 5 Intro, Basic Pipelining & Memory Hierarchy Review

Pipelining.309/15

Memory Hierarchy of a Modern Computer System• By taking advantage of the principle of locality:

– Present the user with as much memory as is available in the cheapest technology.

– Provide access at the speed offered by the fastest technology.

Control

Datapath

SecondaryStorage(Disk)

Processor

Registers

MainMemory(DRAM)

SecondLevelCache

(SRAM)

On-C

hipC

ache

1 (10s ms)Speed (ns): 10 100

100s GBSize (bytes): KB MB

TertiaryStorage(Tape)

(10s sec)

TB

Page 31: Pipelining.1 9/15 Appendix A & Ch: 5 Intro, Basic Pipelining & Memory Hierarchy Review

Pipelining.319/15

Memory Hierarchy: Terminology• Hit: data appears in some block in the upper level

(example: Block X) – Hit Rate: the fraction of memory access found in the upper level– Hit Time: Time to access the upper level which consists of

RAM access time + Time to determine hit/miss

• Miss: data needs to be retrieve from a block in the lower level (Block Y)

– Miss Rate = 1 - (Hit Rate)– Miss Penalty: Time to replace a block in the upper level +

Time to deliver the block to processor

• Hit Time << Miss Penalty (500 instructions on 21264!)Lower Level

MemoryUpper LevelMemory

To Processor

From ProcessorBlk X

Blk Y

Page 32: Pipelining.1 9/15 Appendix A & Ch: 5 Intro, Basic Pipelining & Memory Hierarchy Review

Pipelining.329/15

Memory Hierarchy Technology• Random Access:

– “Random” is good: access time is the same for all locations– DRAM: Dynamic Random Access Memory

» High density, low power, cheap, slow» Dynamic: need to be “refreshed” regularly

– SRAM: Static Random Access Memory» Low density, high power, expensive, fast» Static: content will last “forever”(until lose power)

• “Non-so-random” Access Technology:– Access time varies from location to location and from time to time– Examples: Disk, CDROM, DRAM page-mode access

• Sequential Access Technology: access time linear in location (e.g.,Tape)

• The Main Memory: DRAMs + Caches: SRAMs + Flash

Page 33: Pipelining.1 9/15 Appendix A & Ch: 5 Intro, Basic Pipelining & Memory Hierarchy Review

Pipelining.339/15

Why Memory Hierarchy?

• Processor-DRAM performance gap growing • Two goals (speed and size) conflict• Principle of locality: program access a relative small portion of their address

space --- if an item is referenced– it will tend to be referenced again soon (eg loops)--temporal locality– nearby items will tend to be referenced soon (straight code)-- spatial

locality• How memory hierarchy works • Goal of memory hierarchy: to achieve

– speed of highest level (cache)– size of lowest level (disk)

Page 34: Pipelining.1 9/15 Appendix A & Ch: 5 Intro, Basic Pipelining & Memory Hierarchy Review

Pipelining.349/15

Memory Hierarchy Example: Apple iMac G5, powered by Power970-prior to Core duo

iMac G51.6 GHz

07 Reg L1 Inst L1 Data L2 DRAM Disk

Size 1K 64K 32K 512K 256M 80G

Latency

Cycles, Time

1,0.6 ns

3,1.9 ns

3,1.9 ns

11,6.9 ns

88,55 ns

107,12 ms

Let programs address a memory space that scales to the disk size, at a speed

that is usually as fast as register access

Managed by compiler

Managed by hardware

Managed by OS,hardware,application

Goal: Illusion of large, fast, cheap memory

Page 35: Pipelining.1 9/15 Appendix A & Ch: 5 Intro, Basic Pipelining & Memory Hierarchy Review

Pipelining.359/15

Intel Nehalem Cache Hierarchy

Page 36: Pipelining.1 9/15 Appendix A & Ch: 5 Intro, Basic Pipelining & Memory Hierarchy Review

Pipelining.369/15

• Q1: Where can a block be placed in the upper level? (Block placement)

• Q2: How is a block found if it is in the upper level? (Block identification)

• Q3: Which block should be replaced on a miss? (Block replacement)

• Q4: What happens on a write? (Write strategy)

Four Questions for Caches and Memory Hierarchy

Page 37: Pipelining.1 9/15 Appendix A & Ch: 5 Intro, Basic Pipelining & Memory Hierarchy Review

Pipelining.379/15

Format of a Block / Line in Cache

• Tag: higher order of memory address of the data– memory size = , cache size =

how many bits should the tag field have? (32-r)

• Valid = 1 valid block / line 0 invalid block / line

Valid Tag Data

232 2r

Page 38: Pipelining.1 9/15 Appendix A & Ch: 5 Intro, Basic Pipelining & Memory Hierarchy Review

Pipelining.389/15

• Index Used to Lookup Candidates in Cache– Index identifies the set

• Tag used to identify actual copy– If no candidates match, then declare cache miss

• Block is minimum quantum of caching– Data select field used to select data within block– Many caching applications don’t have data select field

Q2: How is a block found in upper level?

Blockoffset

Block AddressTag Index

Set SelectData Select

Page 39: Pipelining.1 9/15 Appendix A & Ch: 5 Intro, Basic Pipelining & Memory Hierarchy Review

Pipelining.399/15

Word3Word0 Word1 Word2

Block Size and Spatial Locality

Larger block size has distinct hardware advantages• less tag overhead• exploit fast burst transfers from DRAM• exploit fast burst transfers over wide busses

What are the disadvantages of increasing block size?

block address offsetb

2b = block size a.k.a line size (in bytes)

Split CPU address

b bits32-b bits

Tag

Block is unit of transfer between cache and memory

4 word block, b=2

Fewer blocks => more conflicts. Can waste bandwidth.

Page 40: Pipelining.1 9/15 Appendix A & Ch: 5 Intro, Basic Pipelining & Memory Hierarchy Review

Pipelining.409/15

Example: 2 KB Direct Mapped Cache with 32 B Blocks– The uppermost 21 bits are the Cache Tag– The lowest 5 bits are the Byte Select (Block Size = 25 = 32)– 2KB cache : 26 (64 lines) * 25 32 bytes / line)– On cache miss, “Cache Block” ( “Cache Line”) read from memory

Cache Index

0123

:

Cache DataByte 0

0431

:

Cache Tag=21 bits

Stored for each lineof the cache

Valid Bit

:63

Byte 1Byte 31 :

Byte 32Byte 33Byte 63 :Byte 2016Byte 2047 :

Cache Tag

Byte Select10

Block address

6 5

Page 41: Pipelining.1 9/15 Appendix A & Ch: 5 Intro, Basic Pipelining & Memory Hierarchy Review

Pipelining.419/15

Set Associative Cache Example: 2-way• Example: 2KB Two-way set associative cache, 32B line

– 2 direct mapped caches operate in parallel, 32 lines * 32 bytes each– Cache Index selects a “set” from the cache Index=5 – The two tags in the set are compared to address in parallel Tag=22– Data is selected based on the tag comparison– Two 22 bit comparators to compare tags

Cache DataCache Block 0

Valid

:: 1KB

Cache DataCache Block 0

Cache Tag=22 Valid

: :1KB

Cache Index=5

Mux 01Sel1 Sel0

Cache Block

CompareAdr Tag

Compare

OR

Hit

Cache Tag=22

Page 42: Pipelining.1 9/15 Appendix A & Ch: 5 Intro, Basic Pipelining & Memory Hierarchy Review

Pipelining.429/15

Fully Associative Cache• Fully Associative Cache example: 2KB cache, 32 byte line

– No Cache Index (still have 64 lines)– Compare Cache Tags of all cache entries in parallel!!– Example: Block Size = 32 B blocks, we need 32 27-bit comparators

• By definition: Conflict Miss = 0 for a fully associative cache

:

Cache DataByte 0

0431

:

Cache Tag (27 bits )

Valid Bit

:

Byte 1Byte 31 :

Byte 32Byte 33Byte 63 :

Cache Tag

Byte Select

5

=

==

=

=

Page 43: Pipelining.1 9/15 Appendix A & Ch: 5 Intro, Basic Pipelining & Memory Hierarchy Review

Pipelining.439/15

Q3: Which block should be replaced on a miss?

• Easy for Direct Mapped• Set Associative or Fully Associative:

– LRU (Least Recently Used): Appealing, but hard to implement for high associativity

– Random: Easy, but – how well does it work?

Assoc: 2-way 4-way 8-way

Size LRU Ran LRU Ran LRU Ran

16K 5.2% 5.7% 4.7% 5.3% 4.4% 5.0%

64K 1.9% 2.0% 1.5% 1.7% 1.4% 1.5%

256K 1.15% 1.17% 1.13% 1.13% 1.12% 1.12%

Page 44: Pipelining.1 9/15 Appendix A & Ch: 5 Intro, Basic Pipelining & Memory Hierarchy Review

Pipelining.449/15

• Compulsory (cold start or process migration, first reference): first access to a block

– “Cold” fact of life: – If you running “billions” of instruction, Compulsory Misses are

insignificant

• Capacity:– Cache cannot contain all blocks access by the program– Solution: increase cache size

• Conflict (collision):– Multiple memory locations mapped

to the same cache location– Solution 1: increase cache size– Solution 2: increase associativity

• Coherence (Invalidation): other process (e.g., I/O) updates memory

Sources of Cache Misses

Page 45: Pipelining.1 9/15 Appendix A & Ch: 5 Intro, Basic Pipelining & Memory Hierarchy Review

Pipelining.459/15

Write Policy: What happens on a WriteA-Write-Through B- Write-Back

• Write-through: all writes update cache and memory/cache– Can always discard cached data - most up-to-date data is in memory– Cache control bit: valid bit

• Write-back: all writes simply update cache– Can’t just discard cached data – – Update lower level when block falls out of cache– Cache control bits: valid and dirty (modified) bits

• Other Advantages:– Write-through:

» memory (or other processors) always have latest data» Simpler management of cache

– Write-back:» much lower bandwidth, since data often overwritten multiple times» Better tolerance to long-latency memory?

Page 46: Pipelining.1 9/15 Appendix A & Ch: 5 Intro, Basic Pipelining & Memory Hierarchy Review

Pipelining.469/15

Write Buffers for Write-Through Caches

Q. Why write buffer ?

ProcessorCache

Write Buffer

Lower Level

Memory

Holds data awaiting write-through to lower level memory

A. So CPU doesn’t stall

Q. Why not just one register A. Bursts of writes are common.

Q. Are Read After Write (RAW) hazards an issue for write buffer?

A. Yes! Drain buffer before next read, or check write buffers for match on reads

Page 47: Pipelining.1 9/15 Appendix A & Ch: 5 Intro, Basic Pipelining & Memory Hierarchy Review

Pipelining.479/15

Write Policy 2:Write Allocate vs Non-Allocate(What happens on write-miss)

• Write allocate: allocate new cache line in cache– Usually means that you have to do a “read miss” to fill in

rest of the cache-line!– Alternative: per/word valid bits

• Write non-allocate (or “write-around”):– Simply send write data through to underlying memory/cache

- don’t allocate new cache line!

Page 48: Pipelining.1 9/15 Appendix A & Ch: 5 Intro, Basic Pipelining & Memory Hierarchy Review

Pipelining.489/15

Virtual Memory Review

• Main memory used as “cache” for secondary (disk) storage– Managed jointly by CPU hardware and OS

• Programs share main memory– Each gets a private virtual address space eg 4GB in MIPS /

IA-32– Protected from other programs

• CPU and OS translate virtual addresses to physical addresses (mapping)

– VM “block” is called a page eg 4KB– VM translation “miss” is called a page fault

Page 49: Pipelining.1 9/15 Appendix A & Ch: 5 Intro, Basic Pipelining & Memory Hierarchy Review

Pipelining.499/15

Modern Virtual Memory Systems Illusion of a large, private, uniform store

Protection & Privacyseveral users - with private and shared address spaces

Demand PagingProvides ability to run programs larger than the primary memory

Hides differences in machine configurations

The price is address translation on each memory reference

OS

useri

PrimaryMemory

SwappingStore

VA PAmappingTLB

Page 50: Pipelining.1 9/15 Appendix A & Ch: 5 Intro, Basic Pipelining & Memory Hierarchy Review

Pipelining.509/15

Page Fault Penalty & Demand Paging

• Page fault (page not present), the page must be fetched from disk

– Takes millions of clock cycles– If no free page is left, a page is replaced– page table updated to point to new location of page

on disc– Handled by OS code

• To minimize page fault rate– Fully associative placement– Smart replacement algorithms

Page 51: Pipelining.1 9/15 Appendix A & Ch: 5 Intro, Basic Pipelining & Memory Hierarchy Review

Pipelining.519/15

• Processor generated address <page number, offset> virtual

• A page table contains the physical address of the base of each page (translation) physical

Paged Memory Systems

pages of a program may be stored non-contiguously. Page table will translate

0123

0123

Address Spaceof process-1

Page Table of process-1

10

2

3

page number offset

virtual physical

Page 52: Pipelining.1 9/15 Appendix A & Ch: 5 Intro, Basic Pipelining & Memory Hierarchy Review

Pipelining.529/15

Address Translation

• Fixed-size pages (e.g., 4K)

Page 53: Pipelining.1 9/15 Appendix A & Ch: 5 Intro, Basic Pipelining & Memory Hierarchy Review

Pipelining.539/15

Mapping Pages to Storage

Page 54: Pipelining.1 9/15 Appendix A & Ch: 5 Intro, Basic Pipelining & Memory Hierarchy Review

Pipelining.549/15

Page Tables example--Translating virtual to physical address

Page offsetVirtual page number

Virtual address

Page offsetPhysical page number

Physical address

Physical page numberValid

If 0 then page is notpresent in memory

Page table register

Page table

20 12

18

31 30 29 28 27 15 14 13 12 11 10 9 8 3 2 1 0

29 28 27 15 14 13 12 11 10 9 8 3 2 1 0

Page 55: Pipelining.1 9/15 Appendix A & Ch: 5 Intro, Basic Pipelining & Memory Hierarchy Review

Pipelining.559/15

Page Tables

• Stores placement information– Array of page table entries, indexed by virtual page

number– Page table register in CPU points to page table in

physical memory• If page is present in memory

– PTE stores the physical page number– Plus other status bits (referenced R, dirty D,.)

• If page is not present– PTE can refer to location in swap space on disk

Page 56: Pipelining.1 9/15 Appendix A & Ch: 5 Intro, Basic Pipelining & Memory Hierarchy Review

Pipelining.569/15

What is in a Page Table Entry (PTE)?– Pointer to next-level page table or to actual page– Permission bits: valid, read-only, read-write, write-only

• Example: Intel x86 architecture PTE:– Address same format previous slide (10, 10, 12-bit offset)– Intermediate page tables called “Directories”

P: Present (same as “valid” bit in other architectures) W: WriteableU: User accessible

PWT: Page write transparent: external cache write-throughPCD: Page cache disabled (page cannot be cached)

A: Accessed: page has been accessed recentlyD: Dirty (M bit): page has been modifiedL: L=14MB page (directory only).

Bottom 22 bits of virtual address serve as offset

Page Frame Number(Physical Page Number)

Free(OS) 0 L D A

PCDPW

T U W P01234567811-931-12

Page 57: Pipelining.1 9/15 Appendix A & Ch: 5 Intro, Basic Pipelining & Memory Hierarchy Review

Pipelining.579/15

Size of Linear (Single level) Page Table is huge

With 32-bit addresses, 4-KB pages & 4-byte PTEs: 220 PTEs, i.e, 4 MB page table per user

Larger pages:• Internal fragmentation (Not all memory in a page is used)• Larger page fault penalty (more time to read from disk)

What about 64-bit virtual address space

Page 58: Pipelining.1 9/15 Appendix A & Ch: 5 Intro, Basic Pipelining & Memory Hierarchy Review

Pipelining.589/15

Hierarchical (2-level ) Page Table(used in Intel X86 processors)

Level 1 Page Table

Level 2Page Tables

Data Pages

page in primary memory page in secondary memory

Root of the CurrentPage Table

p1

offset

p2

Virtual Address

(ProcessorRegister)

PTE of a nonexistent page

p1 p2 offset01112212231

10-bitL1 index

10-bit L2 index

Page 59: Pipelining.1 9/15 Appendix A & Ch: 5 Intro, Basic Pipelining & Memory Hierarchy Review

Pipelining.599/15

Replacement and Writes

• To reduce page fault rate, least-recently used (LRU) replacement

– Reference bit in PTE set to 1 on access to page– Periodically cleared to 0 by OS– A page with reference bit = 0 has not been used recently

• Disk writes take millions of cycles– Page at once, not individual locations– Use write-back– Write through impractical– Dirty bit in PTE set when page is written

Page 60: Pipelining.1 9/15 Appendix A & Ch: 5 Intro, Basic Pipelining & Memory Hierarchy Review

Pipelining.609/15

• Clock Algorithm:– Approximate LRU – Replace an old page, not the oldest page

• Details:– Hardware “use” bit per physical page:

» Hardware sets use bit on each reference» If use bit isn’t set, means not referenced in a long time

– On page fault:» Advance clock hand » Check use bit: 1used recently; clear and leave alone

0selected candidate for replacement

Set of all pagesin Memory

Single Clock Hand:Advances only on page fault!Check for pages not used recentlyMark pages as not used recently

Clock Algorithm: Not Recently Used

...

Page Table

1 0useddirty

1 00 11 10 0

Page 61: Pipelining.1 9/15 Appendix A & Ch: 5 Intro, Basic Pipelining & Memory Hierarchy Review

Pipelining.619/15

Fast Translation Using a TLB• Address translation require extra memory

references– PTE access– actual memory access

• Because access to page tables has good locality– TLB: Translation lookaside buffer– a fast cache of PTEs within the CPU– Typical: 16–512 PTEs, 0.5–1 cycle for hit, 10–100

cycles for miss, 0.01%–1% miss rate– Misses handled by hardware or software

Page 62: Pipelining.1 9/15 Appendix A & Ch: 5 Intro, Basic Pipelining & Memory Hierarchy Review

Pipelining.629/15

Fast Translation Using a TLB

Page 63: Pipelining.1 9/15 Appendix A & Ch: 5 Intro, Basic Pipelining & Memory Hierarchy Review

Pipelining.639/15

TLB Misses• If page is in memory

– Load the PTE from memory and retry– Could be handled in hardware– Or software

» Raise a special exception, with optimized handler

• If page not in memory (page fault, not present)– OS fetches page and updates page table– Restart faulting instruction

Page 64: Pipelining.1 9/15 Appendix A & Ch: 5 Intro, Basic Pipelining & Memory Hierarchy Review

Pipelining.649/15

• As described, TLB lookup is in serial with cache lookup:

• Machines with TLBs go one step further: they overlap TLB lookup with cache access.

– Works because offset available early

Reducing translation time further

Virtual Address

TLB Lookup

V AccessRights PA

V page no. offset10

P page no. offset10

Physical Address

Page 65: Pipelining.1 9/15 Appendix A & Ch: 5 Intro, Basic Pipelining & Memory Hierarchy Review

Pipelining.659/15

• What if cache size is increased to 8KB?– Overlap not complete– Need to do something else.

• OR: Virtual Caches– Tags in cache are virtual addresses– Translation only happens on cache misses

TLB 4K Cache

10 200

4 bytes

index 1 K

page # disp20

assoclookup

32

Hit/Miss

FN Data Hit/Miss

=FN

Overlapping TLB & Cache Access