course&outline& introduc?on&to&computer& … · 2011-12-14 · 14/12/2011&...

14/12/2011

Computer Architecture 1 -‐ David Black-‐Schaffer 1

Introduc?on to Computer Architecture: Review

Overview How to write assembly

Saving registers Two’s complement math, floa;ng point Combina;onal logic and state machines

What does “fastest” mean? How to implement MIPS ISA

How to split up the instruc;on execu;on How to pipeline to make it regular

Hazards: forwarding, branching Memory-‐mapped, DMA, interrupts and polling

SRAM vs. DRAM, row and column access Associa;vity, tags, replacement policies

VM for size and protec;on How VM interacts with caches

What to think about for the final

Course Outline 1.  Introduc?on to Processors and Binary Numbers 2.  ISA: Instruc?on Formats and Execu?on 3.  ISA: Addressing Modes and Procedure Calls 4.  Arithme?c and Integer Numbers 5.  Digital Logic 6.  Performance 7.  Datapath: Single-‐cycle 8.  Datapath: Mul?-‐cycle 9.  Datapath: Excep?ons and Pipelining 10.  Datapath: Pipelining Implementa?on 11.  I/O 12.  Memories 13.  Caches 14.  Virtual Memory 1: Address Transla?on 15.  Virtual Memory 2: TLBs and Caches 16.  Review

1. Introduc?on to Processors

•  Basic computer opera;on: –  1. Load the instruc?on –  2. Figure out what opera?on to do (control) –  3. Figure out what data to use (data) –  4. Do the computa?on –  4. Figure out what instruc?on to load next

Compute (Add, Sub, Mul, etc.)

Data (Load/Store)

Memory (Big and Slow)

Instruc;ons (Load from Memory)

Control (If, else, loop)

Current Instruc?on

Next One

What to Do

Result

Program

load r0 mem[7] 0

r1 = r0 -‐ 2 1

j_zero r1 5 (done) 2

r0 = r0 + 1 3

jump 1 (loop) 4

1. Binary Numbers

•  Binary Numbers –  Adders have 3 inputs and 2 outputs –  Overflow limits the maximum we can

represent

•  Two’s complement –  Allows us to handle nega?ve numbers

•  Basic idea: biggest digit is nega;ve •  10002 = -‐1*23 + 0*22 + 0*21 + 0*20 = -‐810 •  Numbers range from -‐1*2n-‐1 to 2n-‐1-‐1

–  Subtrac?on (invert and add 1) –  Addi?on (same as unsigned) –  Comparisons (a bit tricky)

2. ISA: Instruc?on Formats and Execu?on

•  Register machines (load-‐store, number of registers) •  Memory organiza?on (words/bytes) •  Program counter and next instruc?on

for (j = 1; j < 10; j++){ a = a + b

C ontr ol L o gic

Register File Program Counter

Instruc?on register

Memory Address Register

Memory Data Register

Executable (binary)

Compiler

ADD R1, R2, R3 SUB R3, R2, R1

Assembler

0010100101 0101010101

2. ISA: Instruc?on Formats and Execu?on

•  MIPS Instruc?on formats (R, I, J) •  Immediate sizes and how they are used

–  Sign-‐extended –  How to load 32-‐bit constants

•  Control –  Jumps (uncondi?onal – how far can they jump?) –  Branches (condi?onal – how far?)

•  RISC vs. CISC

N a me F iel d s C o m m e nt s F i e l d Si ze 6 bi ts 5 bi ts 5 bi ts 5 bi ts 5 bi ts 6 b i t s A l l M I P S i n s t r u c t i o n s 3 2 bi ts R -‐f or m at o p rs rt rd s h mt f un c t Ar i th me ? c instru c ? o n f or m at

I -‐f or m at o p rs rt a d dr e ss / i m m ed iat e T rans fe r, b ra n c h , i mme diat e f or m at

J-‐ f o r m at o p tar ge t a d dr e ss J u mp instr u c? o n f or m at

14/12/2011

3. ISA: Addressing Modes

W o r d

M e m o r y

R e g i s t e r

3 . B a s e a d d r e s s i n g (I-‐Format)

o p r s r t A d d r e s s

  Example: lw R1, 100(R2) 2 1 100 35

Add 16 (signed) bits to the address.

We can access +/- 32,000 (whats?)

Don’t forget R-‐ and J-‐ addressing modes!

3. ISA: Procedure Calls add $a0, $t0, 2 ; set up the arguments add $a1, $s0, $zero add $a2, $s1, $t0 add $a3, $t0, 3 addi $sp, $sp, -4 ; adjust the stack to make room for one item sw $t0, 0($sp) ; save $t0 in case the callee uses it jal leaf_example ; call the leaf_example proceedure lw $t0, 0($sp) ; restore $t0 from the stack addi $sp, $sp, 4 ; adjust the stack to delete one item add $t2, $v0, $zero ; move the result into $t0

leaf_example: ; calculates f=(g+h)-‐(i+j) ; g, h, i, and j are in $a0, $a1, $a2, $a3

addi $sp, $sp, -4 ; adjust the stack to make room for one item sw $s0, 0($sp) ; save $s0 for the caller add $t0,$a0,$a1 ; g = $a0, h = $a1 add $t1,$a2,$a3 ; i = $a2, j = $a3 sub $s0,$t0,$t1 add $v0,$s0,$zero ; return f in the result register $v0 lw $s0, 0($sp) ; restore $s0 for the caller addi $sp, $sp, 4 ; adjust the stack to delete one item jr $ra ; jump back to the calling rou?ne

Caller uses $t0, $s0, $s1. $t0 is not saved, so the caller has to save it. What about $s0, $s1?

Aqer the call, the caller restores $t0.

Result is in $v0. Why did we not save $t2? Callee finds its arguments in $a0-‐$a3.

Callee uses $s0, so it must save it. What about $s1?

Results are placed in $v0.

Callee must restore $s0.

Nested Calls

•  MIPS stacks are implemented via software convention •  What is stored on the stack?

Stacking of Subrou/ne Calls & Returns and Environments:

A: main()

CALL B

CALL C C:

Summary: ISA •  Architecture = what’s visible to the program about the machine

–  Not everything in the deep implementation is “visible” –  The name for this invisible stuff is microarchitecture or “implementation”

(and it’s really messy…but fun.)

•  A big piece of the ISA = assembly language structure –  Primitive instructions, execute sequentially, atomically –  Issues are formats, computations, addressing modes, etc –  Two broad flavors:

•  CISC: lots of complicated instructions •  RISC: a few, essential instructions •  Basically all recent machines are RISC, but the dominant machine of today, Intel

x86, is still CISC (though they do RISC tricks in the guts…)

•  We did one example in some detail: MIPS (from P&H Chap 3) –  A RISC machine, its virtue is that it is pretty simple –  Can “get” the assembly language without too much memorization

Binary Addi?on Binary Mul?plica?on 0 + 0 = 0 0 x 0 =0 0 + 1 = 1 0 x 1 =0 1 + 0 = 1 1 x 0 =0 1 + 1 = 10 1 x 1 =1 Addi?on of two POSITIVE integers Α = (10111010)2 = (186)10 and Β = (110111)2 = (55)10 11111 (carry) 10111010 110111 −−−−−−−− 11110001 = (241)10

4. Arithme?c

•  Overflow •  Serial mul?plica?on

4. Integer Numbers

•  Signed Magnitude –  1 bit for sign –  Have two zeros –  Opera?ons are a pain

•  Two’s complement –  To negate: invert and add 1 (easy to do with Cin) –  complement of 0001 is 24 – 0001 = 10000 – 0001 = 1111 –  Overflow is different

•  Non-‐integers –  Fixed point 0010.1100 –  Floa?ng point (man?ssa, exponent, and sign)

•  Mul?plica?on easier than addi?on

14/12/2011

5. Digital Logic – Basic Gates

A B Out

AND OR XOR

A B Out

NOR XNOR

5. Karnaugh Maps •  Order variables such that only 1 changes in each row/column

(Grey coding) •  Groups may overlap

CD\AB 00 01 11 10 00 0 1 1 0 01 0 1 1 0 11 0 1 1 0 10 0 1 0 0

B•D !A•B !C•B

B•(D+!A+!C)

5. Logic Blocks — Memory

8-‐bit data bus

4-‐bit binary address bus

1-‐hot row enables

8-‐bit data output

Bits 0-‐2

Array of SRAM Cells (Each box stores 1 bit)

Read Address 12 = 1100

Bit 3 = 1

100 binary = 00010000 1-‐hot

$&''()*+,-"&(

)(.*+,-"&(

!!! !!" !"! !""

!!" !"! !"" "!!

! " ! " ! "

5. State Storage — Building a Counter

0 next_value

current_value

(3 bits)

•  When the clock edge rises (01)

•  The next_value is stored as the current value

•  The new current_value then goes through the adder to make a new next_value

•  Everything takes some ?me

Q: What limits the speed of the clock?

5. Finite State Machines

WAIT_CODE

CODE_OK

Bukon B

Bukon A

Bukon B

Bukon A

LAUNCH=true

LAUNCH=false

IDLE LAUNCH= false If (A buzon) next state is WAIT_CODE Else next state is IDLE

WAIT_CODE LAUNCH= false If (B buzon) next state is CODE_OK Else next state is IDLE

CODE_OK LAUNCH= true Next state is IDLE

Next State Logic (From Truth Table)

Current State (Memory)

Inputs

Outputs

Next State

Current State

5. Logic: Summary •  Combina;onal Logic

–  Inputs “immediately” produce output –  Use truth tables to determine logic equa?ons –  Common func?ons such as MUX, DEMUX, decode, encode

•  Sequen;al Logic –  Current state is updated to next state by the clock –  Store current state in a latch/memory/flipflop –  Combina?onal logic used to determine the next state, and update the current

state on the clock

•  What do you need for the labs? –  Combina?onal logic for the ALU and adder –  Combina?onal logic to decode instruc?ons and connect up the ALU

•  What do you need for life? –  Understand that state is stored in memories –  …that state is updated by combina?onal logic –  …that clock speed depends on how long it takes to calculate the next state

14/12/2011

6. Performance

•  Comparing Machines Using Microarchitecture –  Latency (instruction execution time from start to finish) –  Throughput (number of instructions per unit of time) –  Processor cycle time (GHz) –  CPI - cycles per instruction –  MIPS - millions of instructions per second –  FLOPs – floating point instructions per second

(also GFLOP/TFLOP/PFLOP/EFLOP)

•  Comparing Machines Using Benchmark Programs –  Which programs? –  Benchmark suites/microbenchmarks –  Different Means: Arithmetic, Harmonic, and Geometric

6. Metrics

•  CPI (Cycles Per Instruc?on): –  25 instructions are loads/stores (each takes 2 cycles) –  50 instructions are adds (each takes 1 cycle –  25 instructions are square root (each takes 100 cycles) –  CPI = ( (25 * 2) + (50 * 1) + (25 * 100) ) / 100 = 2600 / 100 = 26.0

•  MIPS (Millions of Instruc?ons Per Second): –  Machine A has a special instruction for performing square root calculations

•  It takes 100 cycles to execute –  Machine B doesn’t have the special instruction. It must perform square root

calculations in software using simple instructions (e.g., Add, Mult, Shift) that each take 1 cycle to execute

–  Machine A: 1/100 MIPS = 0.01 MIPS –  Machine B: 1 MIPS –  Square root takes 100 cycles: hurts average “instructions per second” but may

improve performance dramatically!

6. Comparisons •  Averages (Arithme?c, Harmonic, Geometric) •  Normalizing (Weights, Run?mes)

•  What you really care about is ?me… …for your applica?on (benchmarks)

•  Amdahl’s Law

10 s 90 s

A 10x speedup on this part!

6. Performance Summary •  Performance is important to measure

–  For architects comparing different deep mechanisms –  For developers of software trying to optimize code, applications –  For users, trying to decide which machine to use, or to buy

•  Performance metric are subtle –  Easy to mess up the “machine A is XXX times faster than

machine B” numerical performance comparison –  You need to know exactly what you are measuring: time, rate,

throughput, CPI, cycles, etc. –  You need to know how combining these to give aggregate

numbers does different kinds of “distortions” to the individual numbers (P&H is good in Chapter 2 on this stuff)

–  No metric is perfect, so lots of emphasis on standard benchmarks today

7. Datapath: Single-‐cycle

•  Control Path – What to do – Logic for decoding signals (ALU, register file, muxes)

•  Data Path – Process the data – What resources do we need? (ALU, register file, memory, PC)

Read Reg 1 Read Reg 2 Write Reg Write Data

Read Data 1 Read Data 2

Register File

Instruction

Sign extend

Read data

Write Data

Data Memory (RAM)

Instruction Memory (RAM)

PC Adder 4

Current PC

7. Complete Single-cycle Datapath

14/12/2011

7. Cost of the Single Cycle Architecture Instr Class 1

Instr Class 2

Instr Class 3

Our Cycle Time (longest Instruction)

3 3 2 1

Most of the time is wasted!

Why longer? Has to do more opera?ons? (E.g.,

condi?onal branch vs. jump)

8. Multi-cycle Solution

Instr Class 1

Instr Class 2

Instr Class 3 Takes 4 cycles

Takes 2 cycles

1 3 2 1 3 1

Less Wasted Time

Idea: Let the FASTEST instruction determine clock period

Instruction Register

8. Multi-cycle Memory-Reference

Read Data 1 Read Data 2 ALU

Read Reg1 Read Reg2 Write Reg Write Data

M U X M

Sign Extend

Shift left 2

Write Data

Memory

Shift left 2 PC[31:28]

Target

32 Jump

Address

ALUOut

ALUSelA = 1 ALUSelB = 10 ALUOp = 00

From State 1

MemRead IorD =1

RegWrite MemtoReg = 1

RegDst = 0

MemWrite IorD= 1

Back To State 0

Memory Access

Write-back step

Memory Access

Memory address computation

MemRead MemWrite IRWrite

RegDest

RegWrite ALU SelA

ALU SelB

MemToReg

ALU Op

ALU Control

Instruction [5:0]

8. Performance of Multicycle Implementation

•  Each type of instruction can take a variable # of cycles •  Example

–  Assume the following instruction distributions: •  loads 5 cycles 22% •  stores 4 cycles 11% •  R-type 4 cycles 49% •  branches 3 cycles 16% •  jump 3 cycles 2%

–  What’s the average Cycles Per Instruction (CPI) CPI = (CPU clock cycles/Instruction Count) CPI = (5 cycles * 0.22) + (4 cycles * 0.11) + (4 cycles * 0.49) + (3 cycles * 0.16) + (3 cycles * 0.02) CPI = 4.04 cycles per instruction

–  What was the CPI for the single-cycle machine? •  Single cycle implies 1 clock cycle per instruction --> CPI = 1.0 •  So isn’t the single-cycle machine about 4 times faster?

8. Performance of Multicycle Implementation

•  The correct answer should consider the clock cycle time as well: –  For the single cycle implementation, the cycle time is given by the worst case

delay: Tcycle = 40ns (for load instructions, see slide 8) –  For the multicycle implementation, the cycle time is given by the worst case

delay over all execution steps: Tcycle = 10 ns (for each of the steps 1, 2, 3, or 4). •  The execution time per instruction is:

–  CPI * Tcycle = 40 * 1 = 40 ns per instruction for the single cycle machine –  CPI * Tcycle = 10 * 4.04 = 40.4 ns per instruction for the multicycle machine –  Thus, the single cycle machine is only 1% faster

•  When considering other types of units (e.g., FP), the single cycle implementation can be very inefficient. –  Think about how long it takes to do divide or square root!

8. Summary •  Single cycle implementations have to consider the worst case

delay through the datapath to come-up with the cycle time. •  Multicycle implementations have the advantage of using a

different number of cycles for executing each instruction. •  In general, the multicycle machine is better than the single

cycle machine, but the actual execution time strongly depends on the workload.

•  The most widely used machine implementation is neither single cycle, nor multicycle – it’s the pipelined implementation. (Next lecture)

14/12/2011

9. Pipelined Datapath This computation is “too long”

100 ns

Pipelined version, 5 pipe stages

~20 ns Latches, called‘Pipeline registers’break up computation into stages. They store the intermediate results.

9. Pipelining: Implementation Issues •  What prevents us from just doing a zillion pipe stages?

–  Those latches are NOT free, they take up area, and there is a real delay to go THRU the latch itself

–  In modern, deep pipeline (10-20 stages), this is a real effect –  Typically see logic “depths” in one pipe stage of 10-20 “gates”

10 stage pipe

~0.2ns

1 2 3 4 5 ~20 At these speeds, and with this few levels of logic, latch delay is important

•  Unpipelined

•  Pipelined

•  Ideally, Speeduppipeline = Timesequential Pipeline Depth

8. Performance of Pipelined Systems time

instructions

Latency 5 cycles Pipeline

stage time

Throughput: 1 per 5 cycles

Latency 5 cycles

Throughput: 1 per 1 cycle

Ideal speedup only if we can keep the pipeline full!

8. Complete 5 Stage Pipeline

Register File

Read data

Data Memory (RAM)

Adder 4

Current PC

ADDER << 2

Sign extend

IF/ID ID/EX EX/MEM MEM/WB

In cycle 4 we have 3 instructions “in-flight”: Inst 1 is accessing the memory (DM) Inst 2 is using the ALU (EX) Inst 3 is access the register file (ID)

Flow of Instructions Through Pipeline

Clock Cycle 1

Clock Cycle 2

Clock Cycle 3

Clock Cycle 4

Clock Cycle 5

Clock Cycle 6

Clock Cycle 7

LW R1, 100(R0) LW R2,200(R0) LW R3, 300(R0)

DM Reg

Program Execution

Time 10. Data Hazards

•  In this particular case… –  R10 value is not computed or returned to register file when later instruction

wants to use it as an input

Double pumping reg file doesn’t help here; later instruction needs R10 2 clock cycles before it’s been computed & stored back. Oops…

Iget Rget ALU op Mput Rput

14/12/2011

Forwarding bus from WB

10. Forwarding

Register File

Read data

Data Memory (RAM)

Adder 4

Current PC

ADDER << 2

Sign extend

IF/ID ID/EX EX/MEM MEM/WB

10. Hazards

•  Data hazards –  Instruction depends on result of prior computation which is not ready yet –  Stall, double pump, and forward, to fix

•  Structural hazards –  HW cannot support a combination of instructions –  OK, maybe add extra hardware resources; may still have to stall

•  Control hazards –  Pipelining of branches and other instructions which change the PC –  Branch predictors, branch delay slots, early branch computation

10. Pipelining Summary

•  Need to keep the pipeline full for performance –  Hazards make this hard –  Dependencies and resource conflicts –  Time to access memory! (caches)

•  Performance –  Can’t use infinitely many stages –  Code and pipeline stages (branch delay slot)

•  Excep;ons/Interrupts –  Can happen at different places in the pipeline –  Can happen out-‐of-‐order (early in the pipeline for a later instruc?on vs. later in the pipeline for an earlier one)

–  Need to restart instruc?ons –  Jump to the OS to handle them

11. Input/Output

•  Busses – Clocking, width, arbitra?on

•  Performance – Latency – Throughput

•  Talking to devices – Method: memory-‐mapped/IO instruc?ons – Means: polling, interrupt, DMA

12. Memory

•  Random Access Memories –  DRAM: Dynamic Random Access Memory

•  High density, low power, cheap, slow •  Dynamic: needs to be “refreshed” regularly

–  SRAM: Static Random Access Memory •  Low density, high power, expensive, fast •  Static: content will last “forever”(until lose power)

•  What gets used where? –  Main memory is DRAM: you need it big, so you need it cheap –  CPU cache memory is SRAM: you need it fast, so it’s more expensive, so it’s smaller

than you would usually want due to resource limitations •  Relative performance

–  Size: DRAM/SRAM: 4-8x bigger for DRAM –  Cost/Cycle time: SRAM/DRAM: 8-16x faster, more $$$ for SRAM

12. Performance Impact (4 cycles) lw r2,0x20 lw r3,0x30 add r1,r2,r3 sw r1,0x40

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17

F D A M S S S

F D A - - -

M S S S W

F D - - - - - - -

F - - - D - - - A M S S S W

•  With 1-cycle memory system, program took 8 cycles –  CPI = 8 cycles / 4 instructions = 2.0 –  With lots more instructions, CPI would approach 1.0

•  With 4-cycle memory system, program takes 18 cycles –  CPI = 17 cycles / 4 instructions = 4.5 –  Doesn’t include instruction fetch penalty found in real memory system

Remember the bit about “if you can keep the pipeline full?”

14/12/2011

12. Memory Hierarchy of a Modern Computer System •  By taking advantage of the principle of locality:

–  Present the user with as much memory as is available in the cheapest technology. –  Provide access at the speed offered by the fastest technology.

Control

Datapath

Secondary Storage (Disk)

Processor

Registers

Main Memory (DRAM)

Shared Cache

(SRAM)

Local C

1s 10,000,000s (10s ms) Speed (ns): 10-50 100s 5-10

100s Ts Size (bytes): Ms Gs Ks

12. Memory Hierarchy: How Does it Work? •  Temporal Locality (Locality in Time):

=> If an item is referenced, the same item will tend to be referenced again soon => Keep most recently accessed data items closer to the processor

•  Spatial Locality (Locality in Space): => If an item is referenced, nearby items will tend to be referenced soon => Move recently accessed “blocks” (groups of contiguous words) closer to proc.

•  “Block” (or “line”) - minimum unit of data between 2 levels

Lower Level Memory Upper Level

Memory To Processor

From Processor Blk X

Example: L1-cache

Example: L2-cache

Move data in blocks rather than bytes. (More efficient if blocks are larger.) Typical block is 64 bytes.

13. Basic Cache Design •  Cache only holds a portion of a

program –  Which part of the program does the cache

contain? •  Cache holds most recently accessed

references •  Cache is divided into units called

cache blocks (also known as cache “lines”), each block holds a contiguous set of memory addresses

–  How does the CPU know which part of the program the cache is holding?

•  Each cache block has extra bits, called the cache tag, which holds the main memory address of the data in the block

DRAM Memory

0x00000000 ...

0xFFFFFFFC

Tag Data 2-block cache

block 0 block 1

block size (bytes)

13. The ABC’s (or 1-2-3-4’s) of Caches •  Caching is a general concept used in processors, operating

systems, file systems, and applications. •  Wherever it is used, there are four basic questions which

arise. These include: –  Q1: Where can a block be placed in a cache?

Direct mapped, associative, fully-associative –  Q2: How is a block found if it is in a cache?

indexing (direct mapped), limited search (associative), full search (fully-associative)

–  Q3: Which block should be replaced on a miss? random, least-recently used (LRU)

–  Q4: What happens on a write? write-through or write-back

13. Cache Block Placement

Block Number 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1

1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3

Block Number

Memory

0 1 2 3 4 5 6 7

Fully-‐associa?ve block 12 can go anywhere

Direct Mapped block 12 can go only into block 4 (12 mod 8)

0 1 2 3 4 5 6 7

2-‐way Set-‐associa?ve block 12 can go anywhere in set 0 (12 mod 4)

0 1 2 3 4 5 6 7

Inflexible Complex Compromise

Direct-‐Mapped Cache Indexing

•  4-‐entry direct-‐mapped cache

Cache Line 0

Cache Line 1

Cache Line 2

Cache Line 3

AddressIndex

0=0000

1=0001

2=0010

3=0011

4=0100

5=0101

6=0110

7=0111

8=1000

9=1001

10=1010

11=1011

12=1100

13=1101

14=1110

15=1111

Memory Space

14/12/2011

•  4-‐entry direct-‐mapped cache

Cache Line 0

Cache Line 1

Cache Line 2

Cache Line 3

AddressIndex

0=0000

1=0001

2=0010

3=0011

4=0100

5=0101

6=0110

7=0111

8=1000

9=1001

10=1010

11=1011

12=1100

13=1101

14=1110

15=1111

Memory Space

0=0000

4=0100

8=1000

12=1100

1=0001

5=0101

9=1001

13=1101

2=0010

6=0110

10=1010

14=1110 Memory addresses

that map to the same cache line will conflict in the cache – we can only store one or the other.

•  4-‐entry direct-‐mapped cache, 4 bytes/line AddressIndex

Cache Line 0

Cache Line 1

Cache Line 2

Cache Line 3

Byte 0 Byte 1 Byte 2 Byte 3

Need to ignore the last two bits because they

choose the byte within the cache

•  4-‐entry direct-‐mapped cache, 4 bytes/line AddressIndex

Cache Line 0

Cache Line 1

Cache Line 2

Cache Line 3

xxx00zz

xxx01zz

xxx10zz

xxx11zz

Need to ignore the last two bits because they

choose the byte within the cache

xxx0000 xxx0001 xxx0010 xxx0011

xxx0100 xxx0001 xxx0110 xxx0111

xxx1000 xxx1001 xxx1010 xxx1011

xxx1100 xxx1101 xxx1110 xxx1111

Need the tag to tell us which

memory address xxxYYZZ is in

each of the lines for YYZZ

13. Direct-‐Mapped Cache Indexing

•  Direct map caches have a 1:1 mapping from memory addresses to cache entries –  E.g., 4-‐entry cache, with 4 bytes per line

•  Entry 0: xxx00xx •  Entry 1: xxx01xx •  Entry 2: xxx10xx •  Entry 3: xxx11xx

–  So addresses 0100000 and 1100001 map to the same cache line.

•  To tell which is in the cache we look at the tag: –  If the tag is 010 we have the first memory address –  If the tag is 110 we have the second memory address

•  To get individual bytes, we look at the byte offset –  0100000 is the first byte in the line in cache entry 00 –  0100001 is the second byte in the line in cache entry 00

13. Set-‐Associa?ve Cache Indexing

•  Set associa?ve caches have a 1:1 mapping from memory addresses to sets

•  Within a set a memory address can be put anywhere (mul?ple entries per set) –  E.g., 2-‐way set-‐associa?ve cache with 4 bytes per line

•  Set 0: xxxx0xx •  Set 1: xxxx1xx •  …

–  So addresses 1010000 and 0110001 map to the same set, but the set can have mul?ple entries.

–  If the set has 2 entries, we can store both values in the cache, even though they map to the same set.

Set-‐Associa?ve Cache Indexing

•  4-‐entry, 2-‐way set-‐associa?ve cache, 4 bytes/line AddressIndex

Cache Line 0

Cache Line 1

Cache Line 2

Cache Line 3

xxxx0zz

xxxx1zz

xxxx000 xxxx001 xxxx010 xxxx011

Need the tag to tell us which

memory address xxxxYZZ is in

each of the lines for YZZ

14/12/2011

Set-‐Associa?ve Cache Indexing

•  4-‐entry, 2-‐way set-‐associa?ve cache, 4 bytes/line AddressIndex

Cache Line 0

Cache Line 1

Cache Line 2

Cache Line 3

xxxx0zz

xxxx1zz

0=0000

1=0001

2=0010

3=0011

4=0100

5=0101

6=0110

7=0111

8=1000

9=1001

10=1010

11=1011

12=1100

13=1101

14=1110

15=1111

Memory Space

1=0001

0=0000

8=1000

9=1001

2=0010

10=1010

3=0011

11=1011

4=0100

12=1100

5=0101

13=1101

6=0110

14=1110

7=0111

15=1111 But now we can

have any two of set 0 and any two of set 1 at the same ?me.

Q1: Block Placement

Block Number 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1

1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3

Block Number

Memory

Direct Mapped block 12 can go only into block 4 (12 mod 8)

0 1 2 3 4 5 6 7

Each memory loca?on maps to one cache line. Mapping is done by a modulo operator, which is accomplished by ignoring some of the MSBs. (e.g., address 001100 = 12 and 000100 = 4 both get mapped to 100 = 4)

Q1: Block Placement

Block Number 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1

1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3

Block Number

Memory

0 1 2 3 4 5 6 7

Fully-‐associa?ve block 12 can go anywhere

Any memory loca?on can go to any cache line. This is infinitely flexible, but… It requires very complex (and slow) hardware.

Q1: Block Placement

Block Number 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1

1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3

Block Number

Memory

2-‐way Set-‐associa?ve block 12 can go anywhere in set 0 (12 mod 4)

0 1 2 3 4 5 6 7

Every memory loca?on maps to exactly one set of cache lines. Within a set it can go in any loca?on. Tradeoff between simplicity (direct-‐mapped to sets) and flexibility (fully associa?ve within sets).

13. Cache Performance •  What’s the impact on performance (CPU time) when the following

cache behavior is included –  50 cycle miss penalty –  All instructions normally take 2.0 cycles (excluding memory stalls) –  Miss rate is 2.0% –  Average of 1.33 memory references per instruction

IC X ( CPIexecution + Memory stall cycles/Instruction) X Clock cycle time

= IC X (2.0 + 0.02 X 1 .33 X 50) X Clock cycle time = IC X 3.33 X Clock cycle time

•  Two important results to keep in mind: –  The lower the CPIexecution , the higher the relative impact of a cache miss penalty

(more sensitive to memory latency!) –  Comparing two machines with identical memory systems, the machine with the

higher clock rate will have the larger number of clock cycles per miss and hence the memory portion of its CPI will be higher.

14. Virtual Memory

•  What is virtual memory? –  Technique that allows execution of a program that

•  can reside in non-contiguous memory locations •  does not have to completely reside in memory

–  Allows the computer to “fake” a program into believing that its

•  memory is contiguous •  memory space is larger than physical memory

•  Why is VM important? –  Cheap - no longer have to buy lots of RAM –  Removes burden of memory resource management from the programmer –  Enables multiprogramming, time-sharing, protection

14/12/2011

14. Basic VM Algorithm

VA -‐> PA 0x00

0x00 0x04

0x08 0x0C

add r1,r2,r3

sub r2,r3,r4

lw r2,0x04

mult r3,r4,r5

•  Program uses virtual addresses (load, store, instruction fetch) •  Computer translates virtual address (VA) to physical address (PA) •  Computer reads RAM using PA, returning the data to program

Processor (running program)

Virtual Address

14. Page Tables and Entries

•  Page Tables – Size of entries/size of table

•  Mul?-‐level page tables (why?)

– How to look up entries •  TLB

– Thrashing of pages (LRU example)

•  Separate page tables (per process) provide page-level protection •  OS creates and manages page tables so that no user-level process

can alter any process’s page table –  Page tables are mapped into kernel memory where only the OS can read or

Page Table Protection

valid bit Physical page number 0x00000!0x00001!0x00002!0x00003!0x00004!

0x001!0x005!0x00A!0x004!0x008!!Process 1’s Page Table

valid bit Physical page number 0x00000!0x00001!0x00002!0x00003!0x00004!

0x002!0x006!0x00B!0x003!0x009!!Process 2’s Page Table

0x00 0x01 0x02 0x03 0x04 0x05 0x06 0x07 0x08 0x09 0x0A 0x0B

P1 P2 P2 P1 P1 P2

P1 P2 P1 P2

Hardware TLB

14. Making Address Translation Fast •  A cache for address translations: translation lookaside buffer

V a l i d 1 1 1 1 0 1 1 0 1 1 0 1

P a g e t a b l e

P h y s i c a l p a g e a d d r e s s V a l i d

1 1 1 1 0 1

T a g V i r t u a l p a g e

n u m b e r

P h y s i c a l p a g e o r d i s k a d d r e s s

P h y s i c a l m e m o r y

D i s k s t o r a g e

Level 1 Table (Page Directory)

Level 2 Table (Page Table)

Virtual Address

15. Multi-level Page Tables to Save Space VPN 1 VPN 2 Offset

PPN Offset

Address Translation/Cache Lookup VPN PO

Virtual Address

TAG BO IDX

Hit/Miss

14/12/2011

Overlapped Cache & TLB Access

• Simple for small caches –  IDX + BO ≤ PO

•  Must satisfy Cache size/Assoc ≤ Page size

• Assume 4K pages & 2-way set-associative caches

• What is the max size allowed for parallel address translation to work?

VPN PO Virtual Address

PPN TAG

BO IDX

=? Hit/Miss

Virtual Address Cache

• Lookup using VA • TLB access on

miss • Use PA to access

next level (L2)

TAG BO IDX

PPN PO

Hit/Miss

Or, Only Use Virtual Bits to Index Cache

• Don’t need to wait for TLB • Parallel TLB access

(e.g., for larger caches) • Physically-tagged but

Virtually-indexed Cache • Can distinguish addresses

from different processes • But, what if multiple

processes share memory?

PPN TAG

BO IDX

=? Hit/Miss

Summary •  Memory access is hard and complicated!

–  Speed of CPU core demands very fast memory access. We do cache hierarchy to solve this one. Gives illusion of speed--most of the time. Occasionally slow.

–  Size of programs demands large RAM. We do VM hierarchy to solve this one. Gives illusions of size--most of the time. Occasionally slow.

•  VM hierarchy –  Another form of cache, but now between RAM and disk. –  Atomic units of memory are pages, typically 4kB to 2MB. –  Page table serves as translation mechanism from virtual to physical address

•  Page table lives in physical memory, managed by OS •  For 64b addresses, multi-level tables used, some of the table is in VM

–  TLB is yet another cache--caches translated addresses, page table entries. •  Saves from having to go to physical memory to do lookup on each access •  Usually very small, managed by OS

–  VM, TLB, cache have “interesting” interactions. •  Big impacts on speed, pipelining. Big impacts on exactly where the virtual to physical mapping

takes place.

And Now For Something Completely Different…

•  Course Evalua?ons! (15 minutes)

•  Followed by… …what to expect on the exam (without ruining too much of the surprise)

Course Evalua?ons

14/12/2011

Final Exam

•  Format –  3 short-‐answer ques?ons –  6 true/false (-‐1/0/1 points each) –  4 or 5 longer ques?ons –  Hopefully no more than 2.5 hours to finish –  (This is a bit harder than the exam from last year) –  You are allowed one double-‐sided, hand-‐wriken, A4 sheet of notes during the exam and a calculator

•  Likely topics: –  Caches, virtual memory, performance, pipelines, assembly, arithme?c, (simple) logic, input/output, etc.

–  Very likely topics: anything that we spent ?me going through mul?ple anima?ons/examples in class

Ques?ons?

course&outline& introduc?on&to&computer& … · 2011-12-14 · 14/12/2011&...

Documents

orchestrating the compiler and microarchitecture …

demystifying gpu microarchitecture through microbenchmarking

the microarchitecture levelthe microarchitecture...

the microarchitecture level

advanced microarchitecture

ibm power6 microarchitecture

red and black3

microarchitecture of the ultrasparct1 cpu

i8085 microarchitecture

advanced microarchitecture

on tuning microarchitecture for programs

the architecture for discovery...tick-tock development model...

2.2 msp430 microarchitecture

lecture 12: microarchitecture fundamentals ii

nehalem (microarchitecture)

arm vs intel microarchitecture

router microarchitecture and scalability of ring topology...

intel’s haswell cpu microarchitecture

2.2 microarchitecture 2.2b – instruction phases

ee 7722|gpu microarchitecture