course&outline& introduc?on&to&computer& … · 2011-12-14 · 14/12/2011&...

14/12/2011

Computer Architecture 1 -‐ David Black-‐Schaffer 1

Introduc?on to Computer Architecture: Review

Overview How to write assembly

Saving registers Two’s complement math, floa;ng point Combina;onal logic and state machines

What does “fastest” mean? How to implement MIPS ISA

How to split up the instruc;on execu;on How to pipeline to make it regular

Hazards: forwarding, branching Memory-‐mapped, DMA, interrupts and polling

SRAM vs. DRAM, row and column access Associa;vity, tags, replacement policies

VM for size and protec;on How VM interacts with caches

What to think about for the final

Course Outline 1.  Introduc?on to Processors and Binary Numbers 2.  ISA: Instruc?on Formats and Execu?on 3.  ISA: Addressing Modes and Procedure Calls 4.  Arithme?c and Integer Numbers 5.  Digital Logic 6.  Performance 7.  Datapath: Single-‐cycle 8.  Datapath: Mul?-‐cycle 9.  Datapath: Excep?ons and Pipelining 10.  Datapath: Pipelining Implementa?on 11.  I/O 12.  Memories 13.  Caches 14.  Virtual Memory 1: Address Transla?on 15.  Virtual Memory 2: TLBs and Caches 16.  Review

1. Introduc?on to Processors

•  Basic computer opera;on: –  1. Load the instruc?on –  2. Figure out what opera?on to do (control) –  3. Figure out what data to use (data) –  4. Do the computa?on –  4. Figure out what instruc?on to load next

Compute (Add, Sub, Mul, etc.)

Data (Load/Store)

Memory (Big and Slow)

Instruc;ons (Load from Memory)

Control (If, else, loop)

Current Instruc?on

Next One

What to Do

Result

Data

Program

0 7

load r0 mem[7] 0

r1 = r0 -‐ 2 1

j_zero r1 5 (done) 2

r0 = r0 + 1 3

jump 1 (loop) 4

5

6

1. Binary Numbers

•  Binary Numbers –  Adders have 3 inputs and 2 outputs –  Overflow limits the maximum we can

represent

•  Two’s complement –  Allows us to handle nega?ve numbers

•  Basic idea: biggest digit is nega;ve •  10002 = -‐1*23 + 0*22 + 0*21 + 0*20 = -‐810 •  Numbers range from -‐1*2n-‐1 to 2n-‐1-‐1

–  Subtrac?on (invert and add 1) –  Addi?on (same as unsigned) –  Comparisons (a bit tricky)

A B

S

+

Cout

1

1

1

+

1 0

1

0

+

1 1

1

0

+

1 1

0

1

+

0

?

2. ISA: Instruc?on Formats and Execu?on

•  Register machines (load-‐store, number of registers) •  Memory organiza?on (words/bytes) •  Program counter and next instruc?on

for (j = 1; j < 10; j++){ a = a + b

}

ALU

C ontr ol L o gic

Register File Program Counter

Instruc?on register

Memory Address Register

Memory Data Register

Executable (binary)

Compiler

ADD R1, R2, R3 SUB R3, R2, R1

Assembler

0010100101 0101010101

2. ISA: Instruc?on Formats and Execu?on

•  MIPS Instruc?on formats (R, I, J) •  Immediate sizes and how they are used

–  Sign-‐extended –  How to load 32-‐bit constants

•  Control –  Jumps (uncondi?onal – how far can they jump?) –  Branches (condi?onal – how far?)

•  RISC vs. CISC

N a me F iel d s C o m m e nt s F i e l d Si ze 6 bi ts 5 bi ts 5 bi ts 5 bi ts 5 bi ts 6 b i t s A l l M I P S i n s t r u c t i o n s 3 2 bi ts R -‐f or m at o p rs rt rd s h mt f un c t Ar i th me ? c instru c ? o n f or m at

I -‐f or m at o p rs rt a d dr e ss / i m m ed iat e T rans fe r, b ra n c h , i mme diat e f or m at

J-‐ f o r m at o p tar ge t a d dr e ss J u mp instr u c? o n f or m at

14/12/2011


3. ISA: Addressing Modes

W o r d

M e m o r y

R e g i s t e r

3 . B a s e a d d r e s s i n g (I-‐Format)

o p r s r t A d d r e s s

+

  Example: lw R1, 100(R2) 2 1 100 35

Add 16 (signed) bits to the address.

We can access +/- 32,000 (whats?)

Don’t forget R-‐ and J-‐ addressing modes!

3. ISA: Procedure Calls add $a0, $t0, 2 ; set up the arguments add $a1, $s0, $zero add $a2, $s1, $t0 add $a3, $t0, 3 addi $sp, $sp, -4 ; adjust the stack to make room for one item sw $t0, 0($sp) ; save $t0 in case the callee uses it jal leaf_example ; call the leaf_example proceedure lw $t0, 0($sp) ; restore $t0 from the stack addi $sp, $sp, 4 ; adjust the stack to delete one item add $t2, $v0, $zero ; move the result into $t0

Cal

lee

Cal

ler

leaf_example: ; calculates f=(g+h)-‐(i+j) ; g, h, i, and j are in $a0, $a1, $a2, $a3

addi $sp, $sp, -4 ; adjust the stack to make room for one item sw $s0, 0($sp) ; save $s0 for the caller add $t0,$a0,$a1 ; g = $a0, h = $a1 add $t1,$a2,$a3 ; i = $a2, j = $a3 sub $s0,$t0,$t1 add $v0,$s0,$zero ; return f in the result register $v0 lw $s0, 0($sp) ; restore $s0 for the caller addi $sp, $sp, 4 ; adjust the stack to delete one item jr $ra ; jump back to the calling rou?ne

Caller uses $t0, $s0, $s1. $t0 is not saved, so the caller has to save it. What about $s0, $s1?

Aqer the call, the caller restores $t0.

Result is in $v0. Why did we not save $t2? Callee finds its arguments in $a0-‐$a3.

Callee uses $s0, so it must save it. What about $s1?

Results are placed in $v0.

Callee must restore $s0.

Nested Calls

•  MIPS stacks are implemented via software convention •  What is stored on the stack?

Stacking of Subrou/ne Calls & Returns and Environments:

A: main()

A

A B

A B C

A B

A

CALL B

B:

CALL C C:

RET

RET

Summary: ISA •  Architecture = what’s visible to the program about the machine

–  Not everything in the deep implementation is “visible” –  The name for this invisible stuff is microarchitecture or “implementation”

(and it’s really messy…but fun.)

•  A big piece of the ISA = assembly language structure –  Primitive instructions, execute sequentially, atomically –  Issues are formats, computations, addressing modes, etc –  Two broad flavors:

•  CISC: lots of complicated instructions •  RISC: a few, essential instructions •  Basically all recent machines are RISC, but the dominant machine of today, Intel

x86, is still CISC (though they do RISC tricks in the guts…)

•  We did one example in some detail: MIPS (from P&H Chap 3) –  A RISC machine, its virtue is that it is pretty simple –  Can “get” the assembly language without too much memorization

Binary Addi?on Binary Mul?plica?on 0 + 0 = 0 0 x 0 =0 0 + 1 = 1 0 x 1 =0 1 + 0 = 1 1 x 0 =0 1 + 1 = 10 1 x 1 =1 Addi?on of two POSITIVE integers Α = (10111010)2 = (186)10 and Β = (110111)2 = (55)10 11111 (carry) 10111010 110111 −−−−−−−− 11110001 = (241)10

4. Arithme?c

•  Overflow •  Serial mul?plica?on

4. Integer Numbers

•  Signed Magnitude –  1 bit for sign –  Have two zeros –  Opera?ons are a pain

•  Two’s complement –  To negate: invert and add 1 (easy to do with Cin) –  complement of 0001 is 24 – 0001 = 10000 – 0001 = 1111 –  Overflow is different

•  Non-‐integers –  Fixed point 0010.1100 –  Floa?ng point (man?ssa, exponent, and sign)

•  Mul?plica?on easier than addi?on

14/12/2011


5. Digital Logic – Basic Gates

A B Out

0 0 0

0 1 0

1 0 0

1 1 1

A B Out

0 0 0

0 1 1

1 0 1

1 1 1

A B Out

0 0 0

0 1 1

1 0 1

1 1 0

A B Out

0 0 1

0 1 0

1 0 0

1 1 1

AND OR XOR

A B Out

0 0 1

0 1 1

1 0 1

1 1 0

NAND

A B Out

0 0 1

0 1 0

1 0 0

1 1 0

NOR XNOR

5. Karnaugh Maps •  Order variables such that only 1 changes in each row/column

(Grey coding) •  Groups may overlap

CD\AB 00 01 11 10 00 0 1 1 0 01 0 1 1 0 11 0 1 1 0 10 0 1 0 0

B•D !A•B !C•B

B•(D+!A+!C)

5. Logic Blocks — Memory

MUX

Decod

er

8-‐bit data bus

4-‐bit binary address bus

1-‐hot row enables

8-‐bit data output

Bits 0-‐2

Bit 3

Array of SRAM Cells (Each box stores 1 bit)

Read Address 12 = 1100

Bit 3 = 1

0 1

0

1

2

3

4

5

6

7

100 binary = 00010000 1-‐hot

1100

!"#$%

$&''()*+,-"&(

)(.*+,-"&(

!!! !!" !"! !""

!!" !"! !"" "!!

! " ! " ! "

5. State Storage — Building a Counter

FA A

B

Cin

S

Cout

FA A

B

Cin

S

Cout

FA A

B

Cin

S

Cout

1

0

0 next_value

current_value

Latch

(3 bits)

out0

out1

out2

in0

in1

in2

Clock

•  When the clock edge rises (01)

•  The next_value is stored as the current value

•  The new current_value then goes through the adder to make a new next_value

•  Everything takes some ?me

Q: What limits the speed of the clock?

5. Finite State Machines

IDLE

WAIT_CODE

CODE_OK

Bukon B

Bukon A

Bukon B

Bukon A

LAUNCH=true

LAUNCH=false

LAUNCH=false

IDLE LAUNCH= false If (A buzon) next state is WAIT_CODE Else next state is IDLE

WAIT_CODE LAUNCH= false If (B buzon) next state is CODE_OK Else next state is IDLE

CODE_OK LAUNCH= true Next state is IDLE

Next State Logic (From Truth Table)

Current State (Memory)

Inputs

Outputs

Next State

Current State

Clock

5. Logic: Summary •  Combina;onal Logic

–  Inputs “immediately” produce output –  Use truth tables to determine logic equa?ons –  Common func?ons such as MUX, DEMUX, decode, encode

•  Sequen;al Logic –  Current state is updated to next state by the clock –  Store current state in a latch/memory/flipflop –  Combina?onal logic used to determine the next state, and update the current

state on the clock

•  What do you need for the labs? –  Combina?onal logic for the ALU and adder –  Combina?onal logic to decode instruc?ons and connect up the ALU

•  What do you need for life? –  Understand that state is stored in memories –  …that state is updated by combina?onal logic –  …that clock speed depends on how long it takes to calculate the next state

14/12/2011


6. Performance

•  Comparing Machines Using Microarchitecture –  Latency (instruction execution time from start to finish) –  Throughput (number of instructions per unit of time) –  Processor cycle time (GHz) –  CPI - cycles per instruction –  MIPS - millions of instructions per second –  FLOPs – floating point instructions per second

(also GFLOP/TFLOP/PFLOP/EFLOP)

•  Comparing Machines Using Benchmark Programs –  Which programs? –  Benchmark suites/microbenchmarks –  Different Means: Arithmetic, Harmonic, and Geometric

6. Metrics

•  CPI (Cycles Per Instruc?on): –  25 instructions are loads/stores (each takes 2 cycles) –  50 instructions are adds (each takes 1 cycle –  25 instructions are square root (each takes 100 cycles) –  CPI = ( (25 * 2) + (50 * 1) + (25 * 100) ) / 100 = 2600 / 100 = 26.0

•  MIPS (Millions of Instruc?ons Per Second): –  Machine A has a special instruction for performing square root calculations

•  It takes 100 cycles to execute –  Machine B doesn’t have the special instruction. It must perform square root

calculations in software using simple instructions (e.g., Add, Mult, Shift) that each take 1 cycle to execute

–  Machine A: 1/100 MIPS = 0.01 MIPS –  Machine B: 1 MIPS –  Square root takes 100 cycles: hurts average “instructions per second” but may

improve performance dramatically!

6. Comparisons •  Averages (Arithme?c, Harmonic, Geometric) •  Normalizing (Weights, Run?mes)

•  What you really care about is ?me… …for your applica?on (benchmarks)

•  Amdahl’s Law

10 s 90 s

1 s

A 10x speedup on this part!

100s

91s

90 s

6. Performance Summary •  Performance is important to measure

–  For architects comparing different deep mechanisms –  For developers of software trying to optimize code, applications –  For users, trying to decide which machine to use, or to buy

•  Performance metric are subtle –  Easy to mess up the “machine A is XXX times faster than

machine B” numerical performance comparison –  You need to know exactly what you are measuring: time, rate,

throughput, CPI, cycles, etc. –  You need to know how combining these to give aggregate

numbers does different kinds of “distortions” to the individual numbers (P&H is good in Chapter 2 on this stuff)

–  No metric is perfect, so lots of emphasis on standard benchmarks today

7. Datapath: Single-‐cycle

•  Control Path – What to do – Logic for decoding signals (ALU, register file, muxes)

•  Data Path – Process the data – What resources do we need? (ALU, register file, memory, PC)

Read Reg 1 Read Reg 2 Write Reg Write Data

Read Data 1 Read Data 2

Register File

ALU

Instruction

Sign extend

16 32

Read data

Write Data

Data Memory (RAM)

M U X

M U X

Zero

Instruction Memory (RAM)

PC Adder 4

Current PC

ADDER

<< 2

M U X

7. Complete Single-cycle Datapath

14/12/2011


1 1

7. Cost of the Single Cycle Architecture Instr Class 1

Instr Class 2

Instr Class 3

Our Cycle Time (longest Instruction)

3 3 2 1

Most of the time is wasted!

Why longer? Has to do more opera?ons? (E.g.,

condi?onal branch vs. jump)

8. Multi-cycle Solution

Instr Class 1

Instr Class 2

Instr Class 3 Takes 4 cycles

Takes 2 cycles

1 3 2 1 3 1

Less Wasted Time

Idea: Let the FASTEST instruction determine clock period

Instruction Register

8. Multi-cycle Memory-Reference

Read Data 1 Read Data 2 ALU

M U X

PC

M U X

Read Reg1 Read Reg2 Write Reg Write Data

M U X M

U X

M U X

Sign Extend

Shift left 2

Write Data

16 32

4

Zero

Memory

Shift left 2 PC[31:28]

30

M U X

Target

32 Jump

Address

A

B

MDR

ALUOut

ALUSelA = 1 ALUSelB = 10 ALUOp = 00

From State 1

MemRead IorD =1

RegWrite MemtoReg = 1

RegDst = 0

MemWrite IorD= 1

2

3

4

5

Back To State 0

Memory Access

Write-back step

Memory Access

Memory address computation

IorD

MemRead MemWrite IRWrite

RegDest

RegWrite ALU SelA

ALU SelB

MemToReg

ALU Op

ALU Control

Instruction [5:0]

8. Performance of Multicycle Implementation

•  Each type of instruction can take a variable # of cycles •  Example

–  Assume the following instruction distributions: •  loads 5 cycles 22% •  stores 4 cycles 11% •  R-type 4 cycles 49% •  branches 3 cycles 16% •  jump 3 cycles 2%

–  What’s the average Cycles Per Instruction (CPI) CPI = (CPU clock cycles/Instruction Count) CPI = (5 cycles * 0.22) + (4 cycles * 0.11) + (4 cycles * 0.49) + (3 cycles * 0.16) + (3 cycles * 0.02) CPI = 4.04 cycles per instruction

–  What was the CPI for the single-cycle machine? •  Single cycle implies 1 clock cycle per instruction --> CPI = 1.0 •  So isn’t the single-cycle machine about 4 times faster?

8. Performance of Multicycle Implementation

•  The correct answer should consider the clock cycle time as well: –  For the single cycle implementation, the cycle time is given by the worst case

delay: Tcycle = 40ns (for load instructions, see slide 8) –  For the multicycle implementation, the cycle time is given by the worst case

delay over all execution steps: Tcycle = 10 ns (for each of the steps 1, 2, 3, or 4). •  The execution time per instruction is:

–  CPI * Tcycle = 40 * 1 = 40 ns per instruction for the single cycle machine –  CPI * Tcycle = 10 * 4.04 = 40.4 ns per instruction for the multicycle machine –  Thus, the single cycle machine is only 1% faster

•  When considering other types of units (e.g., FP), the single cycle implementation can be very inefficient. –  Think about how long it takes to do divide or square root!

8. Summary •  Single cycle implementations have to consider the worst case

delay through the datapath to come-up with the cycle time. •  Multicycle implementations have the advantage of using a

different number of cycles for executing each instruction. •  In general, the multicycle machine is better than the single

cycle machine, but the actual execution time strongly depends on the workload.

•  The most widely used machine implementation is neither single cycle, nor multicycle – it’s the pipelined implementation. (Next lecture)

14/12/2011


9. Pipelined Datapath This computation is “too long”

100 ns

Pipelined version, 5 pipe stages

~20 ns Latches, called‘Pipeline registers’break up computation into stages. They store the intermediate results.

9. Pipelining: Implementation Issues •  What prevents us from just doing a zillion pipe stages?

–  Those latches are NOT free, they take up area, and there is a real delay to go THRU the latch itself

–  In modern, deep pipeline (10-20 stages), this is a real effect –  Typically see logic “depths” in one pipe stage of 10-20 “gates”

~2ns

10 stage pipe

~0.2ns

1 2 3 4 5 ~20 At these speeds, and with this few levels of logic, latch delay is important

•  Unpipelined

•  Pipelined

•  Ideally, Speeduppipeline = Timesequential Pipeline Depth

8. Performance of Pipelined Systems time

instructions

Latency 5 cycles Pipeline

stage time

Throughput: 1 per 5 cycles

Latency 5 cycles

Throughput: 1 per 1 cycle

Ideal speedup only if we can keep the pipeline full!

8. Complete 5 Stage Pipeline



Register File

ALU

16 32

Read data

Data Memory (RAM)

M

U X

M U X

Zero


PC

Adder 4

Current PC

ADDER << 2

M U X

Sign extend

IF/ID ID/EX EX/MEM MEM/WB

In cycle 4 we have 3 instructions “in-flight”: Inst 1 is accessing the memory (DM) Inst 2 is using the ALU (EX) Inst 3 is access the register file (ID)

Flow of Instructions Through Pipeline

IM

Clock Cycle 1

Clock Cycle 2

Clock Cycle 3

Clock Cycle 4

Clock Cycle 5

Clock Cycle 6

Clock Cycle 7

LW R1, 100(R0) LW R2,200(R0) LW R3, 300(R0)

REG

IM

ALU

REG

IM

Reg

DM

ALU

Reg

DM Reg

DM

ALU

REG

Program Execution

Time 10. Data Hazards

•  In this particular case… –  R10 value is not computed or returned to register file when later instruction

wants to use it as an input

Double pumping reg file doesn’t help here; later instruction needs R10 2 clock cycles before it’s been computed & stored back. Oops…

Iget Rget ALU op Mput Rput

Iget Rget ALU op Mput Rput

10 W

10 R

14/12/2011


Forwarding bus from WB

10. Forwarding



Register File

ALU

16 32

Read data

Data Memory (RAM)

M

U X

M U X

Zero


PC

Adder 4

Current PC

ADDER << 2

M U X

Sign extend

IF/ID ID/EX EX/MEM MEM/WB

10. Hazards

•  Data hazards –  Instruction depends on result of prior computation which is not ready yet –  Stall, double pump, and forward, to fix

•  Structural hazards –  HW cannot support a combination of instructions –  OK, maybe add extra hardware resources; may still have to stall

•  Control hazards –  Pipelining of branches and other instructions which change the PC –  Branch predictors, branch delay slots, early branch computation

10. Pipelining Summary

•  Need to keep the pipeline full for performance –  Hazards make this hard –  Dependencies and resource conflicts –  Time to access memory! (caches)

•  Performance –  Can’t use infinitely many stages –  Code and pipeline stages (branch delay slot)

•  Excep;ons/Interrupts –  Can happen at different places in the pipeline –  Can happen out-‐of-‐order (early in the pipeline for a later instruc?on vs. later in the pipeline for an earlier one)

–  Need to restart instruc?ons –  Jump to the OS to handle them

11. Input/Output

•  Busses – Clocking, width, arbitra?on

•  Performance – Latency – Throughput

•  Talking to devices – Method: memory-‐mapped/IO instruc?ons – Means: polling, interrupt, DMA

12. Memory

•  Random Access Memories –  DRAM: Dynamic Random Access Memory

•  High density, low power, cheap, slow •  Dynamic: needs to be “refreshed” regularly

–  SRAM: Static Random Access Memory •  Low density, high power, expensive, fast •  Static: content will last “forever”(until lose power)

•  What gets used where? –  Main memory is DRAM: you need it big, so you need it cheap –  CPU cache memory is SRAM: you need it fast, so it’s more expensive, so it’s smaller

than you would usually want due to resource limitations •  Relative performance

–  Size: DRAM/SRAM: 4-8x bigger for DRAM –  Cost/Cycle time: SRAM/DRAM: 8-16x faster, more $$$ for SRAM

12. Performance Impact (4 cycles) lw r2,0x20 lw r3,0x30 add r1,r2,r3 sw r1,0x40

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17

F D A M S S S

F D A - - -

W

M S S S W

F D - - - - - - -

F - - - D - - - A M S S S W

•  With 1-cycle memory system, program took 8 cycles –  CPI = 8 cycles / 4 instructions = 2.0 –  With lots more instructions, CPI would approach 1.0

•  With 4-cycle memory system, program takes 18 cycles –  CPI = 17 cycles / 4 instructions = 4.5 –  Doesn’t include instruction fetch penalty found in real memory system

A M W

18

Remember the bit about “if you can keep the pipeline full?”

14/12/2011


12. Memory Hierarchy of a Modern Computer System •  By taking advantage of the principle of locality:

–  Present the user with as much memory as is available in the cheapest technology. –  Provide access at the speed offered by the fastest technology.

Control

Datapath

Secondary Storage (Disk)

Processor

Registers

Main Memory (DRAM)

Shared Cache

(SRAM)

Local C

ache

1s 10,000,000s (10s ms) Speed (ns): 10-50 100s 5-10

100s Ts Size (bytes): Ms Gs Ks

12. Memory Hierarchy: How Does it Work? •  Temporal Locality (Locality in Time):

=> If an item is referenced, the same item will tend to be referenced again soon => Keep most recently accessed data items closer to the processor

•  Spatial Locality (Locality in Space): => If an item is referenced, nearby items will tend to be referenced soon => Move recently accessed “blocks” (groups of contiguous words) closer to proc.

•  “Block” (or “line”) - minimum unit of data between 2 levels

Lower Level Memory Upper Level

Memory To Processor

From Processor Blk X

Blk Y

Example: L1-cache

Example: L2-cache

Move data in blocks rather than bytes. (More efficient if blocks are larger.) Typical block is 64 bytes.

13. Basic Cache Design •  Cache only holds a portion of a

program –  Which part of the program does the cache

contain? •  Cache holds most recently accessed

references •  Cache is divided into units called

cache blocks (also known as cache “lines”), each block holds a contiguous set of memory addresses

–  How does the CPU know which part of the program the cache is holding?

•  Each cache block has extra bits, called the cache tag, which holds the main memory address of the data in the block

CPU

DRAM Memory

0x00000000 ...

0xFFFFFFFC

Tag Data 2-block cache

block 0 block 1

block size (bytes)

13. The ABC’s (or 1-2-3-4’s) of Caches •  Caching is a general concept used in processors, operating

systems, file systems, and applications. •  Wherever it is used, there are four basic questions which

arise. These include: –  Q1: Where can a block be placed in a cache?

Direct mapped, associative, fully-associative –  Q2: How is a block found if it is in a cache?

indexing (direct mapped), limited search (associative), full search (fully-associative)

–  Q3: Which block should be replaced on a miss? random, least-recently used (LRU)

–  Q4: What happens on a write? write-through or write-back

13. Cache Block Placement

Block Number 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1

1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3

Block Number

Memory

Cache

0 1 2 3 4 5 6 7

Fully-‐associa?ve block 12 can go anywhere

Direct Mapped block 12 can go only into block 4 (12 mod 8)

0 1 2 3 4 5 6 7

2-‐way Set-‐associa?ve block 12 can go anywhere in set 0 (12 mod 4)

Set 0

Set 1

Set 2

Set 3

0 1 2 3 4 5 6 7

Inflexible Complex Compromise

Direct-‐Mapped Cache Indexing

•  4-‐entry direct-‐mapped cache

Cache Line 0

Cache Line 1

Cache Line 2

Cache Line 3

AddressIndex

mod 0

mod 1

mod 2

mod 3

0=0000

1=0001

2=0010

3=0011

4=0100

5=0101

6=0110

7=0111

8=1000

9=1001

10=1010

11=1011

12=1100

13=1101

14=1110

15=1111

Memory Space

14/12/2011



•  4-‐entry direct-‐mapped cache

Cache Line 0

Cache Line 1

Cache Line 2

Cache Line 3

AddressIndex

mod 0

mod 1

mod 2

mod 3

Tag

Tag

Tag

Tag

0=0000

1=0001

2=0010

3=0011

4=0100

5=0101

6=0110

7=0111

8=1000

9=1001

10=1010

11=1011

12=1100

13=1101

14=1110

15=1111

Memory Space

mod 0

mod 1

mod 2

mod 3

0=0000

4=0100

8=1000

12=1100

1=0001

5=0101

9=1001

13=1101

2=0010

6=0110

10=1010

14=1110 Memory addresses

that map to the same cache line will conflict in the cache – we can only store one or the other.


•  4-‐entry direct-‐mapped cache, 4 bytes/line AddressIndex

Cache Line 0

Cache Line 1

Cache Line 2

Cache Line 3

mod 0

mod 1

mod 2

mod 3

Tag

Tag

Tag

Tag

Byte 0 Byte 1 Byte 2 Byte 3




Need to ignore the last two bits because they

choose the byte within the cache

line.


•  4-‐entry direct-‐mapped cache, 4 bytes/line AddressIndex

Cache Line 0

Cache Line 1

Cache Line 2

Cache Line 3

mod 0

mod 1

mod 2

mod 3

Tag

Tag

Tag

Tag





xxx00zz

xxx01zz

xxx10zz

xxx11zz

Need to ignore the last two bits because they

choose the byte within the cache

line.

xxx0000 xxx0001 xxx0010 xxx0011

xxx0100 xxx0001 xxx0110 xxx0111

xxx1000 xxx1001 xxx1010 xxx1011

xxx1100 xxx1101 xxx1110 xxx1111

xxx

xxx

xxx

xxx

Need the tag to tell us which

memory address xxxYYZZ is in

each of the lines for YYZZ

13. Direct-‐Mapped Cache Indexing

•  Direct map caches have a 1:1 mapping from memory addresses to cache entries –  E.g., 4-‐entry cache, with 4 bytes per line

•  Entry 0: xxx00xx •  Entry 1: xxx01xx •  Entry 2: xxx10xx •  Entry 3: xxx11xx

–  So addresses 0100000 and 1100001 map to the same cache line.

•  To tell which is in the cache we look at the tag: –  If the tag is 010 we have the first memory address –  If the tag is 110 we have the second memory address

•  To get individual bytes, we look at the byte offset –  0100000 is the first byte in the line in cache entry 00 –  0100001 is the second byte in the line in cache entry 00

13. Set-‐Associa?ve Cache Indexing

•  Set associa?ve caches have a 1:1 mapping from memory addresses to sets

•  Within a set a memory address can be put anywhere (mul?ple entries per set) –  E.g., 2-‐way set-‐associa?ve cache with 4 bytes per line

•  Set 0: xxxx0xx •  Set 1: xxxx1xx •  …

–  So addresses 1010000 and 0110001 map to the same set, but the set can have mul?ple entries.

–  If the set has 2 entries, we can store both values in the cache, even though they map to the same set.

Set-‐Associa?ve Cache Indexing

•  4-‐entry, 2-‐way set-‐associa?ve cache, 4 bytes/line AddressIndex

Cache Line 0

Cache Line 1

Cache Line 2

Cache Line 3

mod 0

mod 1

mod 2

mod 3

Tag

Tag

Tag

Tag





xxxx0zz

xxxx0zz

xxxx1zz

xxxx1zz

xxxx000 xxxx001 xxxx010 xxxx011




xxxx

xxxx

xxxx

xxxx

Need the tag to tell us which

memory address xxxxYZZ is in

each of the lines for YZZ

Set 0

Set 1

14/12/2011


Set-‐Associa?ve Cache Indexing

•  4-‐entry, 2-‐way set-‐associa?ve cache, 4 bytes/line AddressIndex

Cache Line 0

Cache Line 1

Cache Line 2

Cache Line 3

mod 0

mod 1

mod 2

mod 3

Tag

Tag

Tag

Tag





xxxx0zz

xxxx0zz

xxxx1zz

xxxx1zz





xxxx

xxxx

xxxx

xxxx

Set 0

Set 1

mod 0

mod 0

0=0000

1=0001

2=0010

3=0011

4=0100

5=0101

6=0110

7=0111

8=1000

9=1001

10=1010

11=1011

12=1100

13=1101

14=1110

15=1111

Memory Space

mod 1

mod 1

1=0001

0=0000

8=1000

9=1001

2=0010

10=1010

3=0011

11=1011

4=0100

12=1100

5=0101

13=1101

6=0110

14=1110

7=0111

15=1111 But now we can

have any two of set 0 and any two of set 1 at the same ?me.

Q1: Block Placement

Block Number 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1

1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3

Block Number

Memory

Cache

Direct Mapped block 12 can go only into block 4 (12 mod 8)

0 1 2 3 4 5 6 7

Each memory loca?on maps to one cache line. Mapping is done by a modulo operator, which is accomplished by ignoring some of the MSBs. (e.g., address 001100 = 12 and 000100 = 4 both get mapped to 100 = 4)

Q1: Block Placement

Block Number 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1

1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3

Block Number

Memory

Cache

0 1 2 3 4 5 6 7

Fully-‐associa?ve block 12 can go anywhere

Any memory loca?on can go to any cache line. This is infinitely flexible, but… It requires very complex (and slow) hardware.

Q1: Block Placement

Block Number 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1

1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3

Block Number

Memory

Cache

2-‐way Set-‐associa?ve block 12 can go anywhere in set 0 (12 mod 4)

Set 0

Set 1

Set 2

Set 3

0 1 2 3 4 5 6 7

Every memory loca?on maps to exactly one set of cache lines. Within a set it can go in any loca?on. Tradeoff between simplicity (direct-‐mapped to sets) and flexibility (fully associa?ve within sets).

13. Cache Performance •  What’s the impact on performance (CPU time) when the following

cache behavior is included –  50 cycle miss penalty –  All instructions normally take 2.0 cycles (excluding memory stalls) –  Miss rate is 2.0% –  Average of 1.33 memory references per instruction

IC X ( CPIexecution + Memory stall cycles/Instruction) X Clock cycle time

= IC X (2.0 + 0.02 X 1 .33 X 50) X Clock cycle time = IC X 3.33 X Clock cycle time

•  Two important results to keep in mind: –  The lower the CPIexecution , the higher the relative impact of a cache miss penalty

(more sensitive to memory latency!) –  Comparing two machines with identical memory systems, the machine with the

higher clock rate will have the larger number of clock cycles per miss and hence the memory portion of its CPI will be higher.

14. Virtual Memory

•  What is virtual memory? –  Technique that allows execution of a program that

•  can reside in non-contiguous memory locations •  does not have to completely reside in memory

–  Allows the computer to “fake” a program into believing that its

•  memory is contiguous •  memory space is larger than physical memory

•  Why is VM important? –  Cheap - no longer have to buy lots of RAM –  Removes burden of memory resource management from the programmer –  Enables multiprogramming, time-sharing, protection

14/12/2011


14. Basic VM Algorithm

VA -‐> PA 0x00

0x04

0x08

0x0C

0x10

0x00

0x04

0x08

0x0C

Disk

RAM

0x00 0x04

0x08 0x0C

add r1,r2,r3

sub r2,r3,r4

lw r2,0x04

mult r3,r4,r5

•  Program uses virtual addresses (load, store, instruction fetch) •  Computer translates virtual address (VA) to physical address (PA) •  Computer reads RAM using PA, returning the data to program

Processor (running program)

Virtual Address

14. Page Tables and Entries

•  Page Tables – Size of entries/size of table

•  Mul?-‐level page tables (why?)

– How to look up entries •  TLB

– Thrashing of pages (LRU example)

•  Separate page tables (per process) provide page-level protection •  OS creates and manages page tables so that no user-level process

can alter any process’s page table –  Page tables are mapped into kernel memory where only the OS can read or

write

Page Table Protection

valid bit Physical page number 0x00000!0x00001!0x00002!0x00003!0x00004!

0x001!0x005!0x00A!0x004!0x008!!Process 1’s Page Table

valid bit Physical page number 0x00000!0x00001!0x00002!0x00003!0x00004!

0x002!0x006!0x00B!0x003!0x009!!Process 2’s Page Table

0x00 0x01 0x02 0x03 0x04 0x05 0x06 0x07 0x08 0x09 0x0A 0x0B

P1 P2 P2 P1 P1 P2

P1 P2 P1 P2

OS

Sof

twar

e P

age

Tabl

e

Hardware TLB

14. Making Address Translation Fast •  A cache for address translations: translation lookaside buffer

V a l i d 1 1 1 1 0 1 1 0 1 1 0 1

P a g e t a b l e

P h y s i c a l p a g e a d d r e s s V a l i d

T L B

1 1 1 1 0 1

T a g V i r t u a l p a g e

n u m b e r

P h y s i c a l p a g e o r d i s k a d d r e s s

P h y s i c a l m e m o r y

D i s k s t o r a g e

Level 1 Table (Page Directory)

Level 2 Table (Page Table)

Virtual Address

15. Multi-level Page Tables to Save Space VPN 1 VPN 2 Offset

PPN

PPN Offset

Address Translation/Cache Lookup VPN PO

Virtual Address

TLB

PPN

TAG BO IDX

Cache

Hit/Miss

Data

=?

PO

14/12/2011


Overlapped Cache & TLB Access

• Simple for small caches –  IDX + BO ≤ PO

•  Must satisfy Cache size/Assoc ≤ Page size

• Assume 4K pages & 2-way set-associative caches

• What is the max size allowed for parallel address translation to work?

VPN PO Virtual Address

TLB

PPN TAG

BO IDX

Cache

=? Hit/Miss

Data

Virtual Address Cache

• Lookup using VA • TLB access on

miss • Use PA to access

next level (L2)


TAG BO IDX

Cache

=?

TLB

PPN PO

Hit/Miss

Data

Or, Only Use Virtual Bits to Index Cache

• Don’t need to wait for TLB • Parallel TLB access

(e.g., for larger caches) • Physically-tagged but

Virtually-indexed Cache • Can distinguish addresses

from different processes • But, what if multiple

processes share memory?


TLB

PPN TAG

BO IDX

Cache

=? Hit/Miss

Data

Summary •  Memory access is hard and complicated!

–  Speed of CPU core demands very fast memory access. We do cache hierarchy to solve this one. Gives illusion of speed--most of the time. Occasionally slow.

–  Size of programs demands large RAM. We do VM hierarchy to solve this one. Gives illusions of size--most of the time. Occasionally slow.

•  VM hierarchy –  Another form of cache, but now between RAM and disk. –  Atomic units of memory are pages, typically 4kB to 2MB. –  Page table serves as translation mechanism from virtual to physical address

•  Page table lives in physical memory, managed by OS •  For 64b addresses, multi-level tables used, some of the table is in VM

–  TLB is yet another cache--caches translated addresses, page table entries. •  Saves from having to go to physical memory to do lookup on each access •  Usually very small, managed by OS

–  VM, TLB, cache have “interesting” interactions. •  Big impacts on speed, pipelining. Big impacts on exactly where the virtual to physical mapping

takes place.

And Now For Something Completely Different…

•  Course Evalua?ons! (15 minutes)

•  Followed by… …what to expect on the exam (without ruining too much of the surprise)

Course Evalua?ons

14/12/2011


Final Exam

•  Format –  3 short-‐answer ques?ons –  6 true/false (-‐1/0/1 points each) –  4 or 5 longer ques?ons –  Hopefully no more than 2.5 hours to finish –  (This is a bit harder than the exam from last year) –  You are allowed one double-‐sided, hand-‐wriken, A4 sheet of notes during the exam and a calculator

•  Likely topics: –  Caches, virtual memory, performance, pipelines, assembly, arithme?c, (simple) logic, input/output, etc.

–  Very likely topics: anything that we spent ?me going through mul?ple anima?ons/examples in class

Ques?ons?

course&outline& introduc?on&to&computer& … · 2011-12-14 · 14/12/2011&...

Documents