course&outline& introduc?on&to&computer& … · 2011-12-14 · 14/12/2011&...
Post on 15-Mar-2020
1 Views
Preview:
TRANSCRIPT
14/12/2011
Computer Architecture 1 -‐ David Black-‐Schaffer 1
Introduc?on to Computer Architecture: Review
Overview How to write assembly
Saving registers Two’s complement math, floa;ng point Combina;onal logic and state machines
What does “fastest” mean? How to implement MIPS ISA
How to split up the instruc;on execu;on How to pipeline to make it regular
Hazards: forwarding, branching Memory-‐mapped, DMA, interrupts and polling
SRAM vs. DRAM, row and column access Associa;vity, tags, replacement policies
VM for size and protec;on How VM interacts with caches
What to think about for the final
Course Outline 1. Introduc?on to Processors and Binary Numbers 2. ISA: Instruc?on Formats and Execu?on 3. ISA: Addressing Modes and Procedure Calls 4. Arithme?c and Integer Numbers 5. Digital Logic 6. Performance 7. Datapath: Single-‐cycle 8. Datapath: Mul?-‐cycle 9. Datapath: Excep?ons and Pipelining 10. Datapath: Pipelining Implementa?on 11. I/O 12. Memories 13. Caches 14. Virtual Memory 1: Address Transla?on 15. Virtual Memory 2: TLBs and Caches 16. Review
1. Introduc?on to Processors
• Basic computer opera;on: – 1. Load the instruc?on – 2. Figure out what opera?on to do (control) – 3. Figure out what data to use (data) – 4. Do the computa?on – 4. Figure out what instruc?on to load next
Compute (Add, Sub, Mul, etc.)
Data (Load/Store)
Memory (Big and Slow)
Instruc;ons (Load from Memory)
Control (If, else, loop)
Current Instruc?on
Next One
What to Do
Result
Data
Program
0 7
load r0 mem[7] 0
r1 = r0 -‐ 2 1
j_zero r1 5 (done) 2
r0 = r0 + 1 3
jump 1 (loop) 4
5
6
1. Binary Numbers
• Binary Numbers – Adders have 3 inputs and 2 outputs – Overflow limits the maximum we can
represent
• Two’s complement – Allows us to handle nega?ve numbers
• Basic idea: biggest digit is nega;ve • 10002 = -‐1*23 + 0*22 + 0*21 + 0*20 = -‐810 • Numbers range from -‐1*2n-‐1 to 2n-‐1-‐1
– Subtrac?on (invert and add 1) – Addi?on (same as unsigned) – Comparisons (a bit tricky)
A B
S
+
Cout
1
1
1
+
1 0
1
0
+
1 1
1
0
+
1 1
0
1
+
0
?
2. ISA: Instruc?on Formats and Execu?on
• Register machines (load-‐store, number of registers) • Memory organiza?on (words/bytes) • Program counter and next instruc?on
for (j = 1; j < 10; j++){ a = a + b
}
ALU
C ontr ol L o gic
Register File Program Counter
Instruc?on register
Memory Address Register
Memory Data Register
Executable (binary)
Compiler
ADD R1, R2, R3 SUB R3, R2, R1
Assembler
0010100101 0101010101
2. ISA: Instruc?on Formats and Execu?on
• MIPS Instruc?on formats (R, I, J) • Immediate sizes and how they are used
– Sign-‐extended – How to load 32-‐bit constants
• Control – Jumps (uncondi?onal – how far can they jump?) – Branches (condi?onal – how far?)
• RISC vs. CISC
N a me F iel d s C o m m e nt s F i e l d Si ze 6 bi ts 5 bi ts 5 bi ts 5 bi ts 5 bi ts 6 b i t s A l l M I P S i n s t r u c t i o n s 3 2 bi ts R -‐f or m at o p rs rt rd s h mt f un c t Ar i th me ? c instru c ? o n f or m at
I -‐f or m at o p rs rt a d dr e ss / i m m ed iat e T rans fe r, b ra n c h , i mme diat e f or m at
J-‐ f o r m at o p tar ge t a d dr e ss J u mp instr u c? o n f or m at
14/12/2011
Computer Architecture 1 -‐ David Black-‐Schaffer 2
3. ISA: Addressing Modes
W o r d
M e m o r y
R e g i s t e r
3 . B a s e a d d r e s s i n g (I-‐Format)
o p r s r t A d d r e s s
+
Example: lw R1, 100(R2) 2 1 100 35
Add 16 (signed) bits to the address.
We can access +/- 32,000 (whats?)
Don’t forget R-‐ and J-‐ addressing modes!
3. ISA: Procedure Calls add $a0, $t0, 2 ; set up the arguments add $a1, $s0, $zero add $a2, $s1, $t0 add $a3, $t0, 3 addi $sp, $sp, -4 ; adjust the stack to make room for one item sw $t0, 0($sp) ; save $t0 in case the callee uses it jal leaf_example ; call the leaf_example proceedure lw $t0, 0($sp) ; restore $t0 from the stack addi $sp, $sp, 4 ; adjust the stack to delete one item add $t2, $v0, $zero ; move the result into $t0
Cal
lee
Cal
ler
leaf_example: ; calculates f=(g+h)-‐(i+j) ; g, h, i, and j are in $a0, $a1, $a2, $a3
addi $sp, $sp, -4 ; adjust the stack to make room for one item sw $s0, 0($sp) ; save $s0 for the caller add $t0,$a0,$a1 ; g = $a0, h = $a1 add $t1,$a2,$a3 ; i = $a2, j = $a3 sub $s0,$t0,$t1 add $v0,$s0,$zero ; return f in the result register $v0 lw $s0, 0($sp) ; restore $s0 for the caller addi $sp, $sp, 4 ; adjust the stack to delete one item jr $ra ; jump back to the calling rou?ne
Caller uses $t0, $s0, $s1. $t0 is not saved, so the caller has to save it. What about $s0, $s1?
Aqer the call, the caller restores $t0.
Result is in $v0. Why did we not save $t2? Callee finds its arguments in $a0-‐$a3.
Callee uses $s0, so it must save it. What about $s1?
Results are placed in $v0.
Callee must restore $s0.
Nested Calls
• MIPS stacks are implemented via software convention • What is stored on the stack?
Stacking of Subrou/ne Calls & Returns and Environments:
A: main()
A
A B
A B C
A B
A
CALL B
B:
CALL C C:
RET
RET
Summary: ISA • Architecture = what’s visible to the program about the machine
– Not everything in the deep implementation is “visible” – The name for this invisible stuff is microarchitecture or “implementation”
(and it’s really messy…but fun.)
• A big piece of the ISA = assembly language structure – Primitive instructions, execute sequentially, atomically – Issues are formats, computations, addressing modes, etc – Two broad flavors:
• CISC: lots of complicated instructions • RISC: a few, essential instructions • Basically all recent machines are RISC, but the dominant machine of today, Intel
x86, is still CISC (though they do RISC tricks in the guts…)
• We did one example in some detail: MIPS (from P&H Chap 3) – A RISC machine, its virtue is that it is pretty simple – Can “get” the assembly language without too much memorization
Binary Addi?on Binary Mul?plica?on 0 + 0 = 0 0 x 0 =0 0 + 1 = 1 0 x 1 =0 1 + 0 = 1 1 x 0 =0 1 + 1 = 10 1 x 1 =1 Addi?on of two POSITIVE integers Α = (10111010)2 = (186)10 and Β = (110111)2 = (55)10 11111 (carry) 10111010 110111 −−−−−−−− 11110001 = (241)10
4. Arithme?c
• Overflow • Serial mul?plica?on
4. Integer Numbers
• Signed Magnitude – 1 bit for sign – Have two zeros – Opera?ons are a pain
• Two’s complement – To negate: invert and add 1 (easy to do with Cin) – complement of 0001 is 24 – 0001 = 10000 – 0001 = 1111 – Overflow is different
• Non-‐integers – Fixed point 0010.1100 – Floa?ng point (man?ssa, exponent, and sign)
• Mul?plica?on easier than addi?on
14/12/2011
Computer Architecture 1 -‐ David Black-‐Schaffer 3
5. Digital Logic – Basic Gates
A B Out
0 0 0
0 1 0
1 0 0
1 1 1
A B Out
0 0 0
0 1 1
1 0 1
1 1 1
A B Out
0 0 0
0 1 1
1 0 1
1 1 0
A B Out
0 0 1
0 1 0
1 0 0
1 1 1
AND OR XOR
A B Out
0 0 1
0 1 1
1 0 1
1 1 0
NAND
A B Out
0 0 1
0 1 0
1 0 0
1 1 0
NOR XNOR
5. Karnaugh Maps • Order variables such that only 1 changes in each row/column
(Grey coding) • Groups may overlap
CD\AB 00 01 11 10 00 0 1 1 0 01 0 1 1 0 11 0 1 1 0 10 0 1 0 0
B•D !A•B !C•B
B•(D+!A+!C)
5. Logic Blocks — Memory
MUX
Decod
er
8-‐bit data bus
4-‐bit binary address bus
1-‐hot row enables
8-‐bit data output
Bits 0-‐2
Bit 3
Array of SRAM Cells (Each box stores 1 bit)
Read Address 12 = 1100
Bit 3 = 1
0 1
0
1
2
3
4
5
6
7
100 binary = 00010000 1-‐hot
1100
!"#$%
$&''()*+,-"&(
)(.*+,-"&(
!!! !!" !"! !""
!!" !"! !"" "!!
! " ! " ! "
5. State Storage — Building a Counter
FA A
B
Cin
S
Cout
FA A
B
Cin
S
Cout
FA A
B
Cin
S
Cout
1
0
0 next_value
current_value
Latch
(3 bits)
out0
out1
out2
in0
in1
in2
Clock
• When the clock edge rises (01)
• The next_value is stored as the current value
• The new current_value then goes through the adder to make a new next_value
• Everything takes some ?me
Q: What limits the speed of the clock?
5. Finite State Machines
IDLE
WAIT_CODE
CODE_OK
Bukon B
Bukon A
Bukon B
Bukon A
LAUNCH=true
LAUNCH=false
LAUNCH=false
IDLE LAUNCH= false If (A buzon) next state is WAIT_CODE Else next state is IDLE
WAIT_CODE LAUNCH= false If (B buzon) next state is CODE_OK Else next state is IDLE
CODE_OK LAUNCH= true Next state is IDLE
Next State Logic (From Truth Table)
Current State (Memory)
Inputs
Outputs
Next State
Current State
Clock
5. Logic: Summary • Combina;onal Logic
– Inputs “immediately” produce output – Use truth tables to determine logic equa?ons – Common func?ons such as MUX, DEMUX, decode, encode
• Sequen;al Logic – Current state is updated to next state by the clock – Store current state in a latch/memory/flipflop – Combina?onal logic used to determine the next state, and update the current
state on the clock
• What do you need for the labs? – Combina?onal logic for the ALU and adder – Combina?onal logic to decode instruc?ons and connect up the ALU
• What do you need for life? – Understand that state is stored in memories – …that state is updated by combina?onal logic – …that clock speed depends on how long it takes to calculate the next state
14/12/2011
Computer Architecture 1 -‐ David Black-‐Schaffer 4
6. Performance
• Comparing Machines Using Microarchitecture – Latency (instruction execution time from start to finish) – Throughput (number of instructions per unit of time) – Processor cycle time (GHz) – CPI - cycles per instruction – MIPS - millions of instructions per second – FLOPs – floating point instructions per second
(also GFLOP/TFLOP/PFLOP/EFLOP)
• Comparing Machines Using Benchmark Programs – Which programs? – Benchmark suites/microbenchmarks – Different Means: Arithmetic, Harmonic, and Geometric
6. Metrics
• CPI (Cycles Per Instruc?on): – 25 instructions are loads/stores (each takes 2 cycles) – 50 instructions are adds (each takes 1 cycle – 25 instructions are square root (each takes 100 cycles) – CPI = ( (25 * 2) + (50 * 1) + (25 * 100) ) / 100 = 2600 / 100 = 26.0
• MIPS (Millions of Instruc?ons Per Second): – Machine A has a special instruction for performing square root calculations
• It takes 100 cycles to execute – Machine B doesn’t have the special instruction. It must perform square root
calculations in software using simple instructions (e.g., Add, Mult, Shift) that each take 1 cycle to execute
– Machine A: 1/100 MIPS = 0.01 MIPS – Machine B: 1 MIPS – Square root takes 100 cycles: hurts average “instructions per second” but may
improve performance dramatically!
6. Comparisons • Averages (Arithme?c, Harmonic, Geometric) • Normalizing (Weights, Run?mes)
• What you really care about is ?me… …for your applica?on (benchmarks)
• Amdahl’s Law
10 s 90 s
1 s
A 10x speedup on this part!
100s
91s
90 s
6. Performance Summary • Performance is important to measure
– For architects comparing different deep mechanisms – For developers of software trying to optimize code, applications – For users, trying to decide which machine to use, or to buy
• Performance metric are subtle – Easy to mess up the “machine A is XXX times faster than
machine B” numerical performance comparison – You need to know exactly what you are measuring: time, rate,
throughput, CPI, cycles, etc. – You need to know how combining these to give aggregate
numbers does different kinds of “distortions” to the individual numbers (P&H is good in Chapter 2 on this stuff)
– No metric is perfect, so lots of emphasis on standard benchmarks today
7. Datapath: Single-‐cycle
• Control Path – What to do – Logic for decoding signals (ALU, register file, muxes)
• Data Path – Process the data – What resources do we need? (ALU, register file, memory, PC)
Read Reg 1 Read Reg 2 Write Reg Write Data
Read Data 1 Read Data 2
Register File
ALU
Instruction
Sign extend
16 32
Read data
Write Data
Data Memory (RAM)
M U X
M U X
Zero
Instruction Memory (RAM)
PC Adder 4
Current PC
ADDER
<< 2
M U X
7. Complete Single-cycle Datapath
14/12/2011
Computer Architecture 1 -‐ David Black-‐Schaffer 5
1 1
7. Cost of the Single Cycle Architecture Instr Class 1
Instr Class 2
Instr Class 3
Our Cycle Time (longest Instruction)
3 3 2 1
Most of the time is wasted!
Why longer? Has to do more opera?ons? (E.g.,
condi?onal branch vs. jump)
8. Multi-cycle Solution
Instr Class 1
Instr Class 2
Instr Class 3 Takes 4 cycles
Takes 2 cycles
1 3 2 1 3 1
Less Wasted Time
Idea: Let the FASTEST instruction determine clock period
Instruction Register
8. Multi-cycle Memory-Reference
Read Data 1 Read Data 2 ALU
M U X
PC
M U X
Read Reg1 Read Reg2 Write Reg Write Data
M U X M
U X
M U X
Sign Extend
Shift left 2
Write Data
16 32
4
Zero
Memory
Shift left 2 PC[31:28]
30
M U X
Target
32 Jump
Address
A
B
MDR
ALUOut
ALUSelA = 1 ALUSelB = 10 ALUOp = 00
From State 1
MemRead IorD =1
RegWrite MemtoReg = 1
RegDst = 0
MemWrite IorD= 1
2
3
4
5
Back To State 0
Memory Access
Write-back step
Memory Access
Memory address computation
IorD
MemRead MemWrite IRWrite
RegDest
RegWrite ALU SelA
ALU SelB
MemToReg
ALU Op
ALU Control
Instruction [5:0]
8. Performance of Multicycle Implementation
• Each type of instruction can take a variable # of cycles • Example
– Assume the following instruction distributions: • loads 5 cycles 22% • stores 4 cycles 11% • R-type 4 cycles 49% • branches 3 cycles 16% • jump 3 cycles 2%
– What’s the average Cycles Per Instruction (CPI) CPI = (CPU clock cycles/Instruction Count) CPI = (5 cycles * 0.22) + (4 cycles * 0.11) + (4 cycles * 0.49) + (3 cycles * 0.16) + (3 cycles * 0.02) CPI = 4.04 cycles per instruction
– What was the CPI for the single-cycle machine? • Single cycle implies 1 clock cycle per instruction --> CPI = 1.0 • So isn’t the single-cycle machine about 4 times faster?
8. Performance of Multicycle Implementation
• The correct answer should consider the clock cycle time as well: – For the single cycle implementation, the cycle time is given by the worst case
delay: Tcycle = 40ns (for load instructions, see slide 8) – For the multicycle implementation, the cycle time is given by the worst case
delay over all execution steps: Tcycle = 10 ns (for each of the steps 1, 2, 3, or 4). • The execution time per instruction is:
– CPI * Tcycle = 40 * 1 = 40 ns per instruction for the single cycle machine – CPI * Tcycle = 10 * 4.04 = 40.4 ns per instruction for the multicycle machine – Thus, the single cycle machine is only 1% faster
• When considering other types of units (e.g., FP), the single cycle implementation can be very inefficient. – Think about how long it takes to do divide or square root!
8. Summary • Single cycle implementations have to consider the worst case
delay through the datapath to come-up with the cycle time. • Multicycle implementations have the advantage of using a
different number of cycles for executing each instruction. • In general, the multicycle machine is better than the single
cycle machine, but the actual execution time strongly depends on the workload.
• The most widely used machine implementation is neither single cycle, nor multicycle – it’s the pipelined implementation. (Next lecture)
14/12/2011
Computer Architecture 1 -‐ David Black-‐Schaffer 6
9. Pipelined Datapath This computation is “too long”
100 ns
Pipelined version, 5 pipe stages
~20 ns Latches, called‘Pipeline registers’break up computation into stages. They store the intermediate results.
9. Pipelining: Implementation Issues • What prevents us from just doing a zillion pipe stages?
– Those latches are NOT free, they take up area, and there is a real delay to go THRU the latch itself
– In modern, deep pipeline (10-20 stages), this is a real effect – Typically see logic “depths” in one pipe stage of 10-20 “gates”
~2ns
10 stage pipe
~0.2ns
1 2 3 4 5 ~20 At these speeds, and with this few levels of logic, latch delay is important
• Unpipelined
• Pipelined
• Ideally, Speeduppipeline = Timesequential Pipeline Depth
8. Performance of Pipelined Systems time
instructions
Latency 5 cycles Pipeline
stage time
Throughput: 1 per 5 cycles
Latency 5 cycles
Throughput: 1 per 1 cycle
Ideal speedup only if we can keep the pipeline full!
8. Complete 5 Stage Pipeline
Read Reg 1 Read Reg 2 Write Reg Write Data
Read Data 1 Read Data 2
Register File
ALU
16 32
Read data
Data Memory (RAM)
M
U X
M U X
Zero
Instruction Memory (RAM)
PC
Adder 4
Current PC
ADDER << 2
M U X
Sign extend
IF/ID ID/EX EX/MEM MEM/WB
In cycle 4 we have 3 instructions “in-flight”: Inst 1 is accessing the memory (DM) Inst 2 is using the ALU (EX) Inst 3 is access the register file (ID)
Flow of Instructions Through Pipeline
IM
Clock Cycle 1
Clock Cycle 2
Clock Cycle 3
Clock Cycle 4
Clock Cycle 5
Clock Cycle 6
Clock Cycle 7
LW R1, 100(R0) LW R2,200(R0) LW R3, 300(R0)
REG
IM
ALU
REG
IM
Reg
DM
ALU
Reg
DM Reg
DM
ALU
REG
Program Execution
Time 10. Data Hazards
• In this particular case… – R10 value is not computed or returned to register file when later instruction
wants to use it as an input
Double pumping reg file doesn’t help here; later instruction needs R10 2 clock cycles before it’s been computed & stored back. Oops…
Iget Rget ALU op Mput Rput
Iget Rget ALU op Mput Rput
10 W
10 R
14/12/2011
Computer Architecture 1 -‐ David Black-‐Schaffer 7
Forwarding bus from WB
10. Forwarding
Read Reg 1 Read Reg 2 Write Reg Write Data
Read Data 1 Read Data 2
Register File
ALU
16 32
Read data
Data Memory (RAM)
M
U X
M U X
Zero
Instruction Memory (RAM)
PC
Adder 4
Current PC
ADDER << 2
M U X
Sign extend
IF/ID ID/EX EX/MEM MEM/WB
10. Hazards
• Data hazards – Instruction depends on result of prior computation which is not ready yet – Stall, double pump, and forward, to fix
• Structural hazards – HW cannot support a combination of instructions – OK, maybe add extra hardware resources; may still have to stall
• Control hazards – Pipelining of branches and other instructions which change the PC – Branch predictors, branch delay slots, early branch computation
10. Pipelining Summary
• Need to keep the pipeline full for performance – Hazards make this hard – Dependencies and resource conflicts – Time to access memory! (caches)
• Performance – Can’t use infinitely many stages – Code and pipeline stages (branch delay slot)
• Excep;ons/Interrupts – Can happen at different places in the pipeline – Can happen out-‐of-‐order (early in the pipeline for a later instruc?on vs. later in the pipeline for an earlier one)
– Need to restart instruc?ons – Jump to the OS to handle them
11. Input/Output
• Busses – Clocking, width, arbitra?on
• Performance – Latency – Throughput
• Talking to devices – Method: memory-‐mapped/IO instruc?ons – Means: polling, interrupt, DMA
12. Memory
• Random Access Memories – DRAM: Dynamic Random Access Memory
• High density, low power, cheap, slow • Dynamic: needs to be “refreshed” regularly
– SRAM: Static Random Access Memory • Low density, high power, expensive, fast • Static: content will last “forever”(until lose power)
• What gets used where? – Main memory is DRAM: you need it big, so you need it cheap – CPU cache memory is SRAM: you need it fast, so it’s more expensive, so it’s smaller
than you would usually want due to resource limitations • Relative performance
– Size: DRAM/SRAM: 4-8x bigger for DRAM – Cost/Cycle time: SRAM/DRAM: 8-16x faster, more $$$ for SRAM
12. Performance Impact (4 cycles) lw r2,0x20 lw r3,0x30 add r1,r2,r3 sw r1,0x40
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
F D A M S S S
F D A - - -
W
M S S S W
F D - - - - - - -
F - - - D - - - A M S S S W
• With 1-cycle memory system, program took 8 cycles – CPI = 8 cycles / 4 instructions = 2.0 – With lots more instructions, CPI would approach 1.0
• With 4-cycle memory system, program takes 18 cycles – CPI = 17 cycles / 4 instructions = 4.5 – Doesn’t include instruction fetch penalty found in real memory system
A M W
18
Remember the bit about “if you can keep the pipeline full?”
14/12/2011
Computer Architecture 1 -‐ David Black-‐Schaffer 8
12. Memory Hierarchy of a Modern Computer System • By taking advantage of the principle of locality:
– Present the user with as much memory as is available in the cheapest technology. – Provide access at the speed offered by the fastest technology.
Control
Datapath
Secondary Storage (Disk)
Processor
Registers
Main Memory (DRAM)
Shared Cache
(SRAM)
Local C
ache
1s 10,000,000s (10s ms) Speed (ns): 10-50 100s 5-10
100s Ts Size (bytes): Ms Gs Ks
12. Memory Hierarchy: How Does it Work? • Temporal Locality (Locality in Time):
=> If an item is referenced, the same item will tend to be referenced again soon => Keep most recently accessed data items closer to the processor
• Spatial Locality (Locality in Space): => If an item is referenced, nearby items will tend to be referenced soon => Move recently accessed “blocks” (groups of contiguous words) closer to proc.
• “Block” (or “line”) - minimum unit of data between 2 levels
Lower Level Memory Upper Level
Memory To Processor
From Processor Blk X
Blk Y
Example: L1-cache
Example: L2-cache
Move data in blocks rather than bytes. (More efficient if blocks are larger.) Typical block is 64 bytes.
13. Basic Cache Design • Cache only holds a portion of a
program – Which part of the program does the cache
contain? • Cache holds most recently accessed
references • Cache is divided into units called
cache blocks (also known as cache “lines”), each block holds a contiguous set of memory addresses
– How does the CPU know which part of the program the cache is holding?
• Each cache block has extra bits, called the cache tag, which holds the main memory address of the data in the block
CPU
DRAM Memory
0x00000000 ...
0xFFFFFFFC
Tag Data 2-block cache
block 0 block 1
block size (bytes)
13. The ABC’s (or 1-2-3-4’s) of Caches • Caching is a general concept used in processors, operating
systems, file systems, and applications. • Wherever it is used, there are four basic questions which
arise. These include: – Q1: Where can a block be placed in a cache?
Direct mapped, associative, fully-associative – Q2: How is a block found if it is in a cache?
indexing (direct mapped), limited search (associative), full search (fully-associative)
– Q3: Which block should be replaced on a miss? random, least-recently used (LRU)
– Q4: What happens on a write? write-through or write-back
13. Cache Block Placement
Block Number 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3
Block Number
Memory
Cache
0 1 2 3 4 5 6 7
Fully-‐associa?ve block 12 can go anywhere
Direct Mapped block 12 can go only into block 4 (12 mod 8)
0 1 2 3 4 5 6 7
2-‐way Set-‐associa?ve block 12 can go anywhere in set 0 (12 mod 4)
Set 0
Set 1
Set 2
Set 3
0 1 2 3 4 5 6 7
Inflexible Complex Compromise
Direct-‐Mapped Cache Indexing
• 4-‐entry direct-‐mapped cache
Cache Line 0
Cache Line 1
Cache Line 2
Cache Line 3
AddressIndex
mod 0
mod 1
mod 2
mod 3
0=0000
1=0001
2=0010
3=0011
4=0100
5=0101
6=0110
7=0111
8=1000
9=1001
10=1010
11=1011
12=1100
13=1101
14=1110
15=1111
Memory Space
14/12/2011
Computer Architecture 1 -‐ David Black-‐Schaffer 9
Direct-‐Mapped Cache Indexing
• 4-‐entry direct-‐mapped cache
Cache Line 0
Cache Line 1
Cache Line 2
Cache Line 3
AddressIndex
mod 0
mod 1
mod 2
mod 3
Tag
Tag
Tag
Tag
0=0000
1=0001
2=0010
3=0011
4=0100
5=0101
6=0110
7=0111
8=1000
9=1001
10=1010
11=1011
12=1100
13=1101
14=1110
15=1111
Memory Space
mod 0
mod 1
mod 2
mod 3
0=0000
4=0100
8=1000
12=1100
1=0001
5=0101
9=1001
13=1101
2=0010
6=0110
10=1010
14=1110 Memory addresses
that map to the same cache line will conflict in the cache – we can only store one or the other.
Direct-‐Mapped Cache Indexing
• 4-‐entry direct-‐mapped cache, 4 bytes/line AddressIndex
Cache Line 0
Cache Line 1
Cache Line 2
Cache Line 3
mod 0
mod 1
mod 2
mod 3
Tag
Tag
Tag
Tag
Byte 0 Byte 1 Byte 2 Byte 3
Byte 0 Byte 1 Byte 2 Byte 3
Byte 0 Byte 1 Byte 2 Byte 3
Byte 0 Byte 1 Byte 2 Byte 3
Need to ignore the last two bits because they
choose the byte within the cache
line.
Direct-‐Mapped Cache Indexing
• 4-‐entry direct-‐mapped cache, 4 bytes/line AddressIndex
Cache Line 0
Cache Line 1
Cache Line 2
Cache Line 3
mod 0
mod 1
mod 2
mod 3
Tag
Tag
Tag
Tag
Byte 0 Byte 1 Byte 2 Byte 3
Byte 0 Byte 1 Byte 2 Byte 3
Byte 0 Byte 1 Byte 2 Byte 3
Byte 0 Byte 1 Byte 2 Byte 3
xxx00zz
xxx01zz
xxx10zz
xxx11zz
Need to ignore the last two bits because they
choose the byte within the cache
line.
xxx0000 xxx0001 xxx0010 xxx0011
xxx0100 xxx0001 xxx0110 xxx0111
xxx1000 xxx1001 xxx1010 xxx1011
xxx1100 xxx1101 xxx1110 xxx1111
xxx
xxx
xxx
xxx
Need the tag to tell us which
memory address xxxYYZZ is in
each of the lines for YYZZ
13. Direct-‐Mapped Cache Indexing
• Direct map caches have a 1:1 mapping from memory addresses to cache entries – E.g., 4-‐entry cache, with 4 bytes per line
• Entry 0: xxx00xx • Entry 1: xxx01xx • Entry 2: xxx10xx • Entry 3: xxx11xx
– So addresses 0100000 and 1100001 map to the same cache line.
• To tell which is in the cache we look at the tag: – If the tag is 010 we have the first memory address – If the tag is 110 we have the second memory address
• To get individual bytes, we look at the byte offset – 0100000 is the first byte in the line in cache entry 00 – 0100001 is the second byte in the line in cache entry 00
13. Set-‐Associa?ve Cache Indexing
• Set associa?ve caches have a 1:1 mapping from memory addresses to sets
• Within a set a memory address can be put anywhere (mul?ple entries per set) – E.g., 2-‐way set-‐associa?ve cache with 4 bytes per line
• Set 0: xxxx0xx • Set 1: xxxx1xx • …
– So addresses 1010000 and 0110001 map to the same set, but the set can have mul?ple entries.
– If the set has 2 entries, we can store both values in the cache, even though they map to the same set.
Set-‐Associa?ve Cache Indexing
• 4-‐entry, 2-‐way set-‐associa?ve cache, 4 bytes/line AddressIndex
Cache Line 0
Cache Line 1
Cache Line 2
Cache Line 3
mod 0
mod 1
mod 2
mod 3
Tag
Tag
Tag
Tag
Byte 0 Byte 1 Byte 2 Byte 3
Byte 0 Byte 1 Byte 2 Byte 3
Byte 0 Byte 1 Byte 2 Byte 3
Byte 0 Byte 1 Byte 2 Byte 3
xxxx0zz
xxxx0zz
xxxx1zz
xxxx1zz
xxxx000 xxxx001 xxxx010 xxxx011
xxxx000 xxxx001 xxxx010 xxxx111
xxxx100 xxxx101 xxxx110 xxxx111
xxxx100 xxxx101 xxxx110 xxxx111
xxxx
xxxx
xxxx
xxxx
Need the tag to tell us which
memory address xxxxYZZ is in
each of the lines for YZZ
Set 0
Set 1
14/12/2011
Computer Architecture 1 -‐ David Black-‐Schaffer 10
Set-‐Associa?ve Cache Indexing
• 4-‐entry, 2-‐way set-‐associa?ve cache, 4 bytes/line AddressIndex
Cache Line 0
Cache Line 1
Cache Line 2
Cache Line 3
mod 0
mod 1
mod 2
mod 3
Tag
Tag
Tag
Tag
Byte 0 Byte 1 Byte 2 Byte 3
Byte 0 Byte 1 Byte 2 Byte 3
Byte 0 Byte 1 Byte 2 Byte 3
Byte 0 Byte 1 Byte 2 Byte 3
xxxx0zz
xxxx0zz
xxxx1zz
xxxx1zz
xxxx000 xxxx001 xxxx010 xxxx011
xxxx000 xxxx001 xxxx010 xxxx011
xxxx100 xxxx101 xxxx110 xxxx111
xxxx100 xxxx101 xxxx110 xxxx111
xxxx
xxxx
xxxx
xxxx
Set 0
Set 1
mod 0
mod 0
0=0000
1=0001
2=0010
3=0011
4=0100
5=0101
6=0110
7=0111
8=1000
9=1001
10=1010
11=1011
12=1100
13=1101
14=1110
15=1111
Memory Space
mod 1
mod 1
1=0001
0=0000
8=1000
9=1001
2=0010
10=1010
3=0011
11=1011
4=0100
12=1100
5=0101
13=1101
6=0110
14=1110
7=0111
15=1111 But now we can
have any two of set 0 and any two of set 1 at the same ?me.
Q1: Block Placement
Block Number 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3
Block Number
Memory
Cache
Direct Mapped block 12 can go only into block 4 (12 mod 8)
0 1 2 3 4 5 6 7
Each memory loca?on maps to one cache line. Mapping is done by a modulo operator, which is accomplished by ignoring some of the MSBs. (e.g., address 001100 = 12 and 000100 = 4 both get mapped to 100 = 4)
Q1: Block Placement
Block Number 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3
Block Number
Memory
Cache
0 1 2 3 4 5 6 7
Fully-‐associa?ve block 12 can go anywhere
Any memory loca?on can go to any cache line. This is infinitely flexible, but… It requires very complex (and slow) hardware.
Q1: Block Placement
Block Number 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3
Block Number
Memory
Cache
2-‐way Set-‐associa?ve block 12 can go anywhere in set 0 (12 mod 4)
Set 0
Set 1
Set 2
Set 3
0 1 2 3 4 5 6 7
Every memory loca?on maps to exactly one set of cache lines. Within a set it can go in any loca?on. Tradeoff between simplicity (direct-‐mapped to sets) and flexibility (fully associa?ve within sets).
13. Cache Performance • What’s the impact on performance (CPU time) when the following
cache behavior is included – 50 cycle miss penalty – All instructions normally take 2.0 cycles (excluding memory stalls) – Miss rate is 2.0% – Average of 1.33 memory references per instruction
IC X ( CPIexecution + Memory stall cycles/Instruction) X Clock cycle time
= IC X (2.0 + 0.02 X 1 .33 X 50) X Clock cycle time = IC X 3.33 X Clock cycle time
• Two important results to keep in mind: – The lower the CPIexecution , the higher the relative impact of a cache miss penalty
(more sensitive to memory latency!) – Comparing two machines with identical memory systems, the machine with the
higher clock rate will have the larger number of clock cycles per miss and hence the memory portion of its CPI will be higher.
14. Virtual Memory
• What is virtual memory? – Technique that allows execution of a program that
• can reside in non-contiguous memory locations • does not have to completely reside in memory
– Allows the computer to “fake” a program into believing that its
• memory is contiguous • memory space is larger than physical memory
• Why is VM important? – Cheap - no longer have to buy lots of RAM – Removes burden of memory resource management from the programmer – Enables multiprogramming, time-sharing, protection
14/12/2011
Computer Architecture 1 -‐ David Black-‐Schaffer 11
14. Basic VM Algorithm
VA -‐> PA 0x00
0x04
0x08
0x0C
0x10
0x00
0x04
0x08
0x0C
Disk
RAM
0x00 0x04
0x08 0x0C
add r1,r2,r3
sub r2,r3,r4
lw r2,0x04
mult r3,r4,r5
• Program uses virtual addresses (load, store, instruction fetch) • Computer translates virtual address (VA) to physical address (PA) • Computer reads RAM using PA, returning the data to program
Processor (running program)
Virtual Address
14. Page Tables and Entries
• Page Tables – Size of entries/size of table
• Mul?-‐level page tables (why?)
– How to look up entries • TLB
– Thrashing of pages (LRU example)
• Separate page tables (per process) provide page-level protection • OS creates and manages page tables so that no user-level process
can alter any process’s page table – Page tables are mapped into kernel memory where only the OS can read or
write
Page Table Protection
valid bit Physical page number 0x00000!0x00001!0x00002!0x00003!0x00004!
0x001!0x005!0x00A!0x004!0x008!!Process 1’s Page Table
valid bit Physical page number 0x00000!0x00001!0x00002!0x00003!0x00004!
0x002!0x006!0x00B!0x003!0x009!!Process 2’s Page Table
0x00 0x01 0x02 0x03 0x04 0x05 0x06 0x07 0x08 0x09 0x0A 0x0B
P1 P2 P2 P1 P1 P2
P1 P2 P1 P2
OS
Sof
twar
e P
age
Tabl
e
Hardware TLB
14. Making Address Translation Fast • A cache for address translations: translation lookaside buffer
V a l i d 1 1 1 1 0 1 1 0 1 1 0 1
P a g e t a b l e
P h y s i c a l p a g e a d d r e s s V a l i d
T L B
1 1 1 1 0 1
T a g V i r t u a l p a g e
n u m b e r
P h y s i c a l p a g e o r d i s k a d d r e s s
P h y s i c a l m e m o r y
D i s k s t o r a g e
Level 1 Table (Page Directory)
Level 2 Table (Page Table)
Virtual Address
15. Multi-level Page Tables to Save Space VPN 1 VPN 2 Offset
PPN
PPN Offset
Address Translation/Cache Lookup VPN PO
Virtual Address
TLB
PPN
TAG BO IDX
Cache
Hit/Miss
Data
=?
PO
14/12/2011
Computer Architecture 1 -‐ David Black-‐Schaffer 12
Overlapped Cache & TLB Access
• Simple for small caches – IDX + BO ≤ PO
• Must satisfy Cache size/Assoc ≤ Page size
• Assume 4K pages & 2-way set-associative caches
• What is the max size allowed for parallel address translation to work?
VPN PO Virtual Address
TLB
PPN TAG
BO IDX
Cache
=? Hit/Miss
Data
Virtual Address Cache
• Lookup using VA • TLB access on
miss • Use PA to access
next level (L2)
VPN PO Virtual Address
TAG BO IDX
Cache
=?
TLB
PPN PO
Hit/Miss
Data
Or, Only Use Virtual Bits to Index Cache
• Don’t need to wait for TLB • Parallel TLB access
(e.g., for larger caches) • Physically-tagged but
Virtually-indexed Cache • Can distinguish addresses
from different processes • But, what if multiple
processes share memory?
VPN PO Virtual Address
TLB
PPN TAG
BO IDX
Cache
=? Hit/Miss
Data
Summary • Memory access is hard and complicated!
– Speed of CPU core demands very fast memory access. We do cache hierarchy to solve this one. Gives illusion of speed--most of the time. Occasionally slow.
– Size of programs demands large RAM. We do VM hierarchy to solve this one. Gives illusions of size--most of the time. Occasionally slow.
• VM hierarchy – Another form of cache, but now between RAM and disk. – Atomic units of memory are pages, typically 4kB to 2MB. – Page table serves as translation mechanism from virtual to physical address
• Page table lives in physical memory, managed by OS • For 64b addresses, multi-level tables used, some of the table is in VM
– TLB is yet another cache--caches translated addresses, page table entries. • Saves from having to go to physical memory to do lookup on each access • Usually very small, managed by OS
– VM, TLB, cache have “interesting” interactions. • Big impacts on speed, pipelining. Big impacts on exactly where the virtual to physical mapping
takes place.
And Now For Something Completely Different…
• Course Evalua?ons! (15 minutes)
• Followed by… …what to expect on the exam (without ruining too much of the surprise)
Course Evalua?ons
14/12/2011
Computer Architecture 1 -‐ David Black-‐Schaffer 13
Final Exam
• Format – 3 short-‐answer ques?ons – 6 true/false (-‐1/0/1 points each) – 4 or 5 longer ques?ons – Hopefully no more than 2.5 hours to finish – (This is a bit harder than the exam from last year) – You are allowed one double-‐sided, hand-‐wriken, A4 sheet of notes during the exam and a calculator
• Likely topics: – Caches, virtual memory, performance, pipelines, assembly, arithme?c, (simple) logic, input/output, etc.
– Very likely topics: anything that we spent ?me going through mul?ple anima?ons/examples in class
Ques?ons?
top related