course&outline& introduc?on&to&computer& … · 2011-12-14 · 14/12/2011&...

13
14/12/2011 Computer Architecture 1 David Black Schaffer 1 Introduc?on to Computer Architecture: Review Overview How to write assembly Saving registers Two’s complement math, floa;ng point Combina;onal logic and state machines What does “fastest” mean? How to implement MIPS ISA How to split up the instruc;on execu;on How to pipeline to make it regular Hazards: forwarding, branching Memorymapped, DMA, interrupts and polling SRAM vs. DRAM, row and column access Associa;vity, tags, replacement policies VM for size and protec;on How VM interacts with caches What to think about for the final Course Outline 1. Introduc?on to Processors and Binary Numbers 2. ISA: Instruc?on Formats and Execu?on 3. ISA: Addressing Modes and Procedure Calls 4. Arithme?c and Integer Numbers 5. Digital Logic 6. Performance 7. Datapath: Singlecycle 8. Datapath: Mul?cycle 9. Datapath: Excep?ons and Pipelining 10. Datapath: Pipelining Implementa?on 11. I/O 12. Memories 13. Caches 14. Virtual Memory 1: Address Transla?on 15. Virtual Memory 2: TLBs and Caches 16. Review 1. Introduc?on to Processors Basic computer opera;on: 1. Load the instruc?on 2. Figure out what opera?on to do (control) 3. Figure out what data to use (data) 4. Do the computa?on 4. Figure out what instruc?on to load next Compute (Add, Sub, Mul, etc.) Data (Load/Store) Memory (Big and Slow) Instruc;ons (Load from Memory) Control (If, else, loop) Current Instruc?on Next One What to Do Result Data Program 0 7 loadr0mem[7] 0 r1=r02 1 j_zeror15 (done) 2 r0=r0+1 3 jump1 (loop) 4 5 6 1. Binary Numbers Binary Numbers Adders have 3 inputs and 2 outputs Overflow limits the maximum we can represent Two’s complement Allows us to handle nega?ve numbers Basic idea: biggest digit is nega;ve 1000 2 = 1*2 3 + 0*2 2 + 0*2 1 + 0*2 0 = 8 10 Numbers range from 1*2 n1 to 2 n1 1 Subtrac?on (invert and add 1) Addi?on (same as unsigned) Comparisons (a bit tricky) A B S + C out 1 1 1 + 1 0 1 0 + 1 1 1 0 + 1 1 0 1 + 0 ? 2. ISA: Instruc?on Formats and Execu?on Register machines (loadstore, number of registers) Memory organiza?on (words/bytes) Program counter and next instruc?on for (j = 1; j < 10; j++){ a=a+b } ALU Contr ol Logic Register File Program Counter Instruc?on register Memory Address Register Memory Data Register Executable (binary) Compiler ADD R1, R2, R3 SUB R3, R2, R1 Assembler 0010100101 0101010101 2. ISA: Instruc?on Formats and Execu?on MIPS Instruc?on formats (R, I, J) Immediate sizes and how they are used Signextended How to load 32bit constants Control Jumps (uncondi?onal – how far can they jump?) Branches (condi?onal – how far?) RISC vs. CISC Name Fields Comment s Field Si ze 6 bits 5 bits 5 bits 5 bits 5 bits 6 bit s All MIPS ins t ruct ions 3 2 bits Rformat op rs rt rd s hmt funct Arithme ?c instru c? on format Iformat op rs rt address /immediat e Transfer, branch, immediate format J format op targe t address J ump instruc?on format

Upload: others

Post on 15-Mar-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Course&Outline& Introduc?on&to&Computer& … · 2011-12-14 · 14/12/2011& Computer&Architecture&1&3&David&Black3 Schaffer& 4 6.&Performance& • •Comparing Machines Using Microarchitecture

14/12/2011  

Computer  Architecture  1  -­‐  David  Black-­‐Schaffer   1  

Introduc?on  to  Computer  Architecture:  Review  

Overview  How  to  write  assembly  

Saving  registers  Two’s  complement  math,  floa;ng  point  Combina;onal  logic  and  state  machines  

What  does  “fastest”  mean?  How  to  implement  MIPS  ISA  

How  to  split  up  the  instruc;on  execu;on  How  to  pipeline  to  make  it  regular  

Hazards:  forwarding,  branching  Memory-­‐mapped,  DMA,  interrupts  and  polling  

SRAM  vs.  DRAM,  row  and  column  access  Associa;vity,  tags,  replacement  policies  

VM  for  size  and  protec;on  How  VM  interacts  with  caches  

What  to  think  about  for  the  final  

Course  Outline  1.  Introduc?on  to  Processors  and  Binary  Numbers  2.  ISA:  Instruc?on  Formats  and  Execu?on  3.  ISA:  Addressing  Modes  and  Procedure  Calls  4.  Arithme?c  and  Integer  Numbers  5.  Digital  Logic  6.  Performance  7.  Datapath:  Single-­‐cycle  8.  Datapath:  Mul?-­‐cycle  9.  Datapath:  Excep?ons  and  Pipelining  10.  Datapath:  Pipelining  Implementa?on  11.  I/O  12.  Memories  13.  Caches  14.  Virtual  Memory  1:  Address  Transla?on  15.  Virtual  Memory  2:  TLBs  and  Caches  16.  Review  

1.  Introduc?on  to  Processors  

•  Basic  computer  opera;on:  –  1.  Load  the  instruc?on  –  2.  Figure  out  what  opera?on  to  do  (control)  –  3.  Figure  out  what  data  to  use  (data)  –  4.  Do  the  computa?on  –  4.  Figure  out  what  instruc?on  to  load  next  

Compute  (Add,  Sub,  Mul,  etc.)  

Data  (Load/Store)  

Memory  (Big  and  Slow)  

Instruc;ons  (Load  from  Memory)  

Control  (If,  else,  loop)  

Current  Instruc?on  

Next  One  

What  to  Do  

Result  

Data  

Program  

0  7  

load  r0  mem[7]  0  

r1  =  r0  -­‐  2  1  

j_zero  r1  5  (done)  2  

r0  =  r0  +  1  3  

jump  1  (loop)  4  

5  

6  

1.  Binary  Numbers  

•  Binary  Numbers  –  Adders  have  3  inputs  and  2  outputs  –  Overflow  limits  the  maximum  we  can  

represent  

•  Two’s  complement  –  Allows  us  to  handle  nega?ve  numbers  

•  Basic  idea:  biggest  digit  is  nega;ve  •  10002  =  -­‐1*23  +  0*22  +  0*21  +  0*20  =  -­‐810  •  Numbers  range  from  -­‐1*2n-­‐1  to  2n-­‐1-­‐1  

–  Subtrac?on  (invert  and  add  1)  –  Addi?on  (same  as  unsigned)  –  Comparisons  (a  bit  tricky)  

A  B  

S  

+  

Cout  

1  

1  

1  

+  

1  0  

1  

0  

+  

1  1  

1  

0  

+  

1  1  

0  

1  

+  

0  

?

2.  ISA:  Instruc?on  Formats  and  Execu?on  

•  Register  machines  (load-­‐store,  number  of  registers)  •  Memory  organiza?on  (words/bytes)  •  Program  counter  and  next  instruc?on  

for  (j  =  1;  j  <  10;  j++){    a  =  a  +  b  

}      

ALU  

           

C  ontr  ol      L  o  gic  

Register  File   Program  Counter  

Instruc?on  register  

Memory  Address  Register  

Memory  Data  Register  

Executable  (binary)  

Compiler  

ADD  R1,  R2,  R3  SUB  R3,  R2,  R1  

Assembler  

0010100101  0101010101  

2.  ISA:  Instruc?on  Formats  and  Execu?on  

•  MIPS  Instruc?on  formats  (R,  I,  J)  •  Immediate  sizes  and  how  they  are  used  

–  Sign-­‐extended  –  How  to  load  32-­‐bit  constants  

•  Control  –  Jumps  (uncondi?onal  –  how  far  can  they  jump?)  –  Branches  (condi?onal  –  how  far?)  

•  RISC  vs.  CISC  

N  a  me   F  iel  d  s   C  o  m  m  e  nt  s  F  i  e  l  d      Si  ze   6  bi  ts   5      bi  ts   5      bi  ts   5      bi  ts   5      bi  ts   6    b  i  t  s   A  l  l  M  I  P  S      i  n  s  t  r  u  c  t  i  o  n  s      3  2      bi  ts  R  -­‐f  or  m  at   o  p   rs   rt   rd   s  h  mt   f  un  c  t   Ar  i  th  me  ?  c    instru  c  ?  o  n    f  or  m  at  

I  -­‐f  or  m  at   o  p   rs   rt   a  d  dr  e  ss  /  i  m  m  ed  iat  e   T  rans  fe  r,    b  ra  n  c  h  ,  i  mme  diat  e  f  or  m  at  

J-­‐  f  o  r  m  at   o  p   tar  ge  t  a  d  dr  e  ss   J  u  mp    instr  u  c?  o  n    f  or  m  at  

Page 2: Course&Outline& Introduc?on&to&Computer& … · 2011-12-14 · 14/12/2011& Computer&Architecture&1&3&David&Black3 Schaffer& 4 6.&Performance& • •Comparing Machines Using Microarchitecture

14/12/2011  

Computer  Architecture  1  -­‐  David  Black-­‐Schaffer   2  

3.  ISA:  Addressing  Modes

W o r d

M e m o r y

R e g i s t e r

3 .   B a s e   a d d r e s s i n g  (I-­‐Format)

o p r s r t A d d r e s s

+

  Example: lw R1, 100(R2) 2 1 100 35

Add 16 (signed) bits to the address.

We can access +/- 32,000 (whats?)

Don’t  forget  R-­‐  and  J-­‐  addressing  modes!  

3.  ISA:  Procedure  Calls add $a0, $t0, 2 ; set up the arguments add $a1, $s0, $zero add $a2, $s1, $t0 add $a3, $t0, 3 addi $sp, $sp, -4 ; adjust the stack to make room for one item sw $t0, 0($sp) ; save $t0 in case the callee uses it jal leaf_example ; call the leaf_example proceedure lw $t0, 0($sp) ; restore $t0 from the stack addi $sp, $sp, 4 ; adjust the stack to delete one item add $t2, $v0, $zero ; move the result into $t0

Cal

lee

Cal

ler

leaf_example: ;  calculates  f=(g+h)-­‐(i+j)     ;  g,  h,  i,  and  j  are  in  $a0,  $a1,  $a2,  $a3  

addi $sp, $sp, -4 ;  adjust  the  stack  to  make  room  for  one  item   sw $s0, 0($sp) ;  save  $s0 for  the  caller   add $t0,$a0,$a1 ;  g  =  $a0,  h  =  $a1 add $t1,$a2,$a3 ;  i  =  $a2,  j  =  $a3 sub $s0,$t0,$t1 add $v0,$s0,$zero ;  return  f  in  the  result  register  $v0       lw $s0, 0($sp) ;  restore  $s0  for  the  caller   addi $sp, $sp, 4 ;  adjust  the  stack  to  delete  one  item   jr $ra ;  jump  back  to  the  calling  rou?ne

Caller  uses  $t0,  $s0,  $s1.    $t0  is  not  saved,  so  the  caller  has  to  save  it.  What  about  $s0,  $s1?  

Aqer  the  call,  the  caller  restores  $t0.  

Result  is  in  $v0.  Why  did  we  not  save  $t2?    Callee  finds  its  arguments  in  $a0-­‐$a3.    

Callee  uses  $s0,  so  it  must  save  it.  What  about  $s1?  

Results  are  placed  in  $v0.  

Callee  must  restore  $s0.    

Nested Calls

•  MIPS stacks are implemented via software convention •  What is stored on the stack?

Stacking  of  Subrou/ne  Calls  &  Returns  and  Environments:  

A:  main()  

A  

A   B  

A   B   C  

A   B  

A  

 CALL  B  

 B:  

 CALL  C    C:  

 RET  

 RET  

Summary: ISA •  Architecture = what’s visible to the program about the machine

–  Not everything in the deep implementation is “visible” –  The name for this invisible stuff is microarchitecture or “implementation”

(and it’s really messy…but fun.)

•  A big piece of the ISA = assembly language structure –  Primitive instructions, execute sequentially, atomically –  Issues are formats, computations, addressing modes, etc –  Two broad flavors:

•  CISC: lots of complicated instructions •  RISC: a few, essential instructions •  Basically all recent machines are RISC, but the dominant machine of today, Intel

x86, is still CISC (though they do RISC tricks in the guts…)

•  We did one example in some detail: MIPS (from P&H Chap 3) –  A RISC machine, its virtue is that it is pretty simple –  Can “get” the assembly language without too much memorization

Binary  Addi?on                                          Binary  Mul?plica?on              0  +  0  =  0              0  x  0  =0                      0  +  1  =  1              0  x  1  =0              1  +  0  =  1          1  x  0  =0              1  +  1  =  10                                                    1  x  1  =1    Addi?on  of  two  POSITIVE  integers    Α  =  (10111010)2  =  (186)10  and    Β  =  (110111)2  =  (55)10                                      11111                (carry)                              10111010                                      110111                          −−−−−−−−                              11110001        =  (241)10    

 

4.  Arithme?c  

•  Overflow  •  Serial  mul?plica?on  

4.  Integer  Numbers  

•  Signed  Magnitude  –  1  bit  for  sign  –  Have  two  zeros  –  Opera?ons  are  a  pain  

•  Two’s  complement  –  To  negate:  invert  and  add  1  (easy  to  do  with  Cin)  –  complement  of  0001  is  24  –  0001  =  10000  –  0001  =  1111  –  Overflow  is  different  

•  Non-­‐integers  –  Fixed  point  0010.1100  –  Floa?ng  point  (man?ssa,  exponent,  and  sign)  

•  Mul?plica?on  easier  than  addi?on  

Page 3: Course&Outline& Introduc?on&to&Computer& … · 2011-12-14 · 14/12/2011& Computer&Architecture&1&3&David&Black3 Schaffer& 4 6.&Performance& • •Comparing Machines Using Microarchitecture

14/12/2011  

Computer  Architecture  1  -­‐  David  Black-­‐Schaffer   3  

5.  Digital  Logic  –  Basic  Gates  

A   B   Out  

0   0   0  

0   1   0  

1   0   0  

1   1   1  

A   B   Out  

0   0   0  

0   1   1  

1   0   1  

1   1   1  

A   B   Out  

0   0   0  

0   1   1  

1   0   1  

1   1   0  

A   B   Out  

0   0   1  

0   1   0  

1   0   0  

1   1   1  

AND   OR   XOR  

A   B   Out  

0   0   1  

0   1   1  

1   0   1  

1   1   0  

NAND  

A   B   Out  

0   0   1  

0   1   0  

1   0   0  

1   1   0  

NOR   XNOR  

5.  Karnaugh  Maps  •  Order  variables  such  that  only  1  changes  in  each  row/column  

(Grey  coding)  •  Groups  may  overlap  

CD\AB   00   01   11   10  00   0   1   1   0  01   0   1   1   0  11   0   1   1   0  10   0   1   0   0  

B•D   !A•B   !C•B  

B•(D+!A+!C)  

5.  Logic  Blocks  —  Memory  

MUX  

Decod

er  

8-­‐bit  data  bus  

4-­‐bit  binary  address  bus  

1-­‐hot  row  enables  

8-­‐bit  data  output  

Bits  0-­‐2  

Bit  3  

Array  of  SRAM  Cells  (Each  box  stores  1  bit)  

Read  Address  12  =  1100  

Bit  3  =  1  

0   1  

0  

1  

2  

3  

4  

5  

6  

7  

100  binary  =  00010000  1-­‐hot  

1100  

!"#$%

$&''()*+,-"&(

)(.*+,-"&(

!!! !!" !"! !""

!!" !"! !"" "!!

! " ! " ! "

5.  State  Storage  —  Building  a  Counter  

FA  A  

B  

Cin  

S  

Cout  

FA  A  

B  

Cin  

S  

Cout  

FA  A  

B  

Cin  

S  

Cout  

1  

0  

0  next_value  

current_value  

Latch    

(3  bits)  

out0  

out1  

out2  

in0  

in1  

in2  

Clock  

•  When  the  clock  edge  rises  (01)  

•  The  next_value  is  stored  as  the  current  value  

•  The  new  current_value  then  goes  through  the  adder  to  make  a  new  next_value  

•  Everything  takes  some  ?me  

Q:  What  limits  the  speed  of  the  clock?  

5.  Finite  State  Machines  

IDLE  

WAIT_CODE  

CODE_OK  

Bukon  B  

Bukon  A  

Bukon  B  

Bukon  A  

LAUNCH=true  

LAUNCH=false  

LAUNCH=false  

IDLE  LAUNCH=  false  If  (A  buzon)  next  state  is  WAIT_CODE  Else  next  state  is  IDLE      

WAIT_CODE  LAUNCH=  false  If  (B  buzon)  next  state  is  CODE_OK  Else  next  state  is  IDLE      

CODE_OK  LAUNCH=  true  Next  state  is  IDLE        

Next  State  Logic    (From  Truth  Table)  

Current  State  (Memory)  

Inputs  

Outputs  

Next  State  

Current  State  

Clock  

5.  Logic:  Summary  •  Combina;onal  Logic  

–  Inputs  “immediately”  produce  output  –  Use  truth  tables  to  determine  logic  equa?ons  –  Common  func?ons  such  as  MUX,  DEMUX,  decode,  encode  

•  Sequen;al  Logic  –  Current  state  is  updated  to  next  state  by  the  clock  –  Store  current  state  in  a  latch/memory/flipflop  –  Combina?onal  logic  used  to  determine  the  next  state,  and  update  the  current  

state  on  the  clock  

•  What  do  you  need  for  the  labs?  –  Combina?onal  logic  for  the  ALU  and  adder  –  Combina?onal  logic  to  decode  instruc?ons  and  connect  up  the  ALU  

•  What  do  you  need  for  life?  –  Understand  that  state  is  stored  in  memories  –  …that  state  is  updated  by  combina?onal  logic  –  …that  clock  speed  depends  on  how  long  it  takes  to  calculate  the  next  state  

Page 4: Course&Outline& Introduc?on&to&Computer& … · 2011-12-14 · 14/12/2011& Computer&Architecture&1&3&David&Black3 Schaffer& 4 6.&Performance& • •Comparing Machines Using Microarchitecture

14/12/2011  

Computer  Architecture  1  -­‐  David  Black-­‐Schaffer   4  

6.  Performance  

•  Comparing Machines Using Microarchitecture –  Latency (instruction execution time from start to finish) –  Throughput (number of instructions per unit of time) –  Processor cycle time (GHz) –  CPI - cycles per instruction –  MIPS - millions of instructions per second –  FLOPs – floating point instructions per second

(also GFLOP/TFLOP/PFLOP/EFLOP)

•  Comparing Machines Using Benchmark Programs –  Which programs? –  Benchmark suites/microbenchmarks –  Different Means: Arithmetic, Harmonic, and Geometric  

6.  Metrics  

•  CPI  (Cycles  Per  Instruc?on):  –  25 instructions are loads/stores (each takes 2 cycles) –  50 instructions are adds (each takes 1 cycle –  25 instructions are square root (each takes 100 cycles) –  CPI = ( (25 * 2) + (50 * 1) + (25 * 100) ) / 100 = 2600 / 100 = 26.0

•  MIPS  (Millions  of  Instruc?ons  Per  Second):  –  Machine A has a special instruction for performing square root calculations

•  It takes 100 cycles to execute –  Machine B doesn’t have the special instruction. It must perform square root

calculations in software using simple instructions (e.g., Add, Mult, Shift) that each take 1 cycle to execute

–  Machine A: 1/100 MIPS = 0.01 MIPS –  Machine B: 1 MIPS –  Square root takes 100 cycles: hurts average “instructions per second” but may

improve performance dramatically!

6.  Comparisons  •  Averages  (Arithme?c,  Harmonic,  Geometric)    •  Normalizing  (Weights,  Run?mes)  

•  What  you  really  care  about  is  ?me…  …for  your  applica?on  (benchmarks)  

•  Amdahl’s  Law  

10 s 90 s

1 s

A 10x speedup on this part!

100s

91s

90 s

6. Performance Summary •  Performance is important to measure

–  For architects comparing different deep mechanisms –  For developers of software trying to optimize code, applications –  For users, trying to decide which machine to use, or to buy

•  Performance metric are subtle –  Easy to mess up the “machine A is XXX times faster than

machine B” numerical performance comparison –  You need to know exactly what you are measuring: time, rate,

throughput, CPI, cycles, etc. –  You need to know how combining these to give aggregate

numbers does different kinds of “distortions” to the individual numbers (P&H is good in Chapter 2 on this stuff)

–  No metric is perfect, so lots of emphasis on standard benchmarks today

7.  Datapath:  Single-­‐cycle  

•  Control  Path  – What  to  do  – Logic  for  decoding  signals  (ALU,  register  file,  muxes)  

•  Data  Path  – Process  the  data  – What  resources  do  we  need?  (ALU,  register  file,  memory,  PC)  

Read Reg 1 Read Reg 2 Write Reg Write Data

Read Data 1 Read Data 2

Register File

ALU

Instruction

Sign extend

16 32

Read data

Write Data

Data Memory (RAM)

M U X

M U X

Zero

Instruction Memory (RAM)

PC Adder 4

Current PC

ADDER

<< 2

M U X

7. Complete Single-cycle Datapath

Page 5: Course&Outline& Introduc?on&to&Computer& … · 2011-12-14 · 14/12/2011& Computer&Architecture&1&3&David&Black3 Schaffer& 4 6.&Performance& • •Comparing Machines Using Microarchitecture

14/12/2011  

Computer  Architecture  1  -­‐  David  Black-­‐Schaffer   5  

1 1

7. Cost of the Single Cycle Architecture Instr Class 1

Instr Class 2

Instr Class 3

Our Cycle Time (longest Instruction)

3 3 2 1

Most of the time is wasted!

Why  longer?  Has  to  do  more  opera?ons?  (E.g.,  

condi?onal  branch  vs.  jump)  

8. Multi-cycle Solution

Instr Class 1

Instr Class 2

Instr Class 3 Takes 4 cycles

Takes 2 cycles

1 3 2 1 3 1

Less Wasted Time

Idea: Let the FASTEST instruction determine clock period

Instruction Register

8. Multi-cycle Memory-Reference

Read Data 1 Read Data 2 ALU

M U X

PC

M U X

Read Reg1 Read Reg2 Write Reg Write Data

M U X M

U X

M U X

Sign Extend

Shift left 2

Write Data

16 32

4

Zero

Memory

Shift left 2 PC[31:28]

30

M U X

Target

32 Jump

Address

A

B

MDR

ALUOut

ALUSelA = 1 ALUSelB = 10 ALUOp = 00

From State 1

MemRead IorD =1

RegWrite MemtoReg = 1

RegDst = 0

MemWrite IorD= 1

2

3

4

5

Back To State 0

Memory Access

Write-back step

Memory Access

Memory address computation

IorD

MemRead MemWrite IRWrite

RegDest

RegWrite ALU SelA

ALU SelB

MemToReg

ALU Op

ALU Control

Instruction [5:0]

8. Performance of Multicycle Implementation

•  Each type of instruction can take a variable # of cycles •  Example

–  Assume the following instruction distributions: •  loads 5 cycles 22% •  stores 4 cycles 11% •  R-type 4 cycles 49% •  branches 3 cycles 16% •  jump 3 cycles 2%

–  What’s the average Cycles Per Instruction (CPI) CPI = (CPU clock cycles/Instruction Count) CPI = (5 cycles * 0.22) + (4 cycles * 0.11) + (4 cycles * 0.49) + (3 cycles * 0.16) + (3 cycles * 0.02) CPI = 4.04 cycles per instruction

–  What was the CPI for the single-cycle machine? •  Single cycle implies 1 clock cycle per instruction --> CPI = 1.0 •  So isn’t the single-cycle machine about 4 times faster?

8. Performance of Multicycle Implementation

•  The correct answer should consider the clock cycle time as well: –  For the single cycle implementation, the cycle time is given by the worst case

delay: Tcycle = 40ns (for load instructions, see slide 8) –  For the multicycle implementation, the cycle time is given by the worst case

delay over all execution steps: Tcycle = 10 ns (for each of the steps 1, 2, 3, or 4). •  The execution time per instruction is:

–  CPI * Tcycle = 40 * 1 = 40 ns per instruction for the single cycle machine –  CPI * Tcycle = 10 * 4.04 = 40.4 ns per instruction for the multicycle machine –  Thus, the single cycle machine is only 1% faster

•  When considering other types of units (e.g., FP), the single cycle implementation can be very inefficient. –  Think about how long it takes to do divide or square root!

8. Summary •  Single cycle implementations have to consider the worst case

delay through the datapath to come-up with the cycle time. •  Multicycle implementations have the advantage of using a

different number of cycles for executing each instruction. •  In general, the multicycle machine is better than the single

cycle machine, but the actual execution time strongly depends on the workload.

•  The most widely used machine implementation is neither single cycle, nor multicycle – it’s the pipelined implementation. (Next lecture)

Page 6: Course&Outline& Introduc?on&to&Computer& … · 2011-12-14 · 14/12/2011& Computer&Architecture&1&3&David&Black3 Schaffer& 4 6.&Performance& • •Comparing Machines Using Microarchitecture

14/12/2011  

Computer  Architecture  1  -­‐  David  Black-­‐Schaffer   6  

9. Pipelined Datapath This computation is “too long”

100 ns

Pipelined version, 5 pipe stages

~20 ns Latches, called‘Pipeline registers’break up computation into stages. They store the intermediate results.

9. Pipelining: Implementation Issues •  What prevents us from just doing a zillion pipe stages?

–  Those latches are NOT free, they take up area, and there is a real delay to go THRU the latch itself

–  In modern, deep pipeline (10-20 stages), this is a real effect –  Typically see logic “depths” in one pipe stage of 10-20 “gates”

~2ns

10 stage pipe

~0.2ns

1 2 3 4 5 ~20 At these speeds, and with this few levels of logic, latch delay is important

•  Unpipelined

•  Pipelined

•  Ideally, Speeduppipeline = Timesequential Pipeline Depth

8. Performance of Pipelined Systems time

instructions

Latency 5 cycles Pipeline

stage time

Throughput: 1 per 5 cycles

Latency 5 cycles

Throughput: 1 per 1 cycle

Ideal speedup only if we can keep the pipeline full!

8. Complete 5 Stage Pipeline

Read Reg 1 Read Reg 2 Write Reg Write Data

Read Data 1 Read Data 2

Register File

ALU

16 32

Read data

Data Memory (RAM)

M

U X

M U X

Zero

Instruction Memory (RAM)

PC

Adder 4

Current PC

ADDER << 2

M U X

Sign extend

IF/ID ID/EX EX/MEM MEM/WB

In cycle 4 we have 3 instructions “in-flight”: Inst 1 is accessing the memory (DM) Inst 2 is using the ALU (EX) Inst 3 is access the register file (ID)

Flow of Instructions Through Pipeline

IM

Clock Cycle 1

Clock Cycle 2

Clock Cycle 3

Clock Cycle 4

Clock Cycle 5

Clock Cycle 6

Clock Cycle 7

LW R1, 100(R0) LW R2,200(R0) LW R3, 300(R0)

REG

IM

ALU

REG

IM

Reg

DM

ALU

Reg

DM Reg

DM

ALU

REG

Program Execution

Time 10. Data Hazards

•  In this particular case… –  R10 value is not computed or returned to register file when later instruction

wants to use it as an input

Double pumping reg file doesn’t help here; later instruction needs R10 2 clock cycles before it’s been computed & stored back. Oops…

Iget Rget ALU op Mput Rput

Iget Rget ALU op Mput Rput

10 W

10 R

Page 7: Course&Outline& Introduc?on&to&Computer& … · 2011-12-14 · 14/12/2011& Computer&Architecture&1&3&David&Black3 Schaffer& 4 6.&Performance& • •Comparing Machines Using Microarchitecture

14/12/2011  

Computer  Architecture  1  -­‐  David  Black-­‐Schaffer   7  

Forwarding bus from WB

10. Forwarding

Read Reg 1 Read Reg 2 Write Reg Write Data

Read Data 1 Read Data 2

Register File

ALU

16 32

Read data

Data Memory (RAM)

M

U X

M U X

Zero

Instruction Memory (RAM)

PC

Adder 4

Current PC

ADDER << 2

M U X

Sign extend

IF/ID ID/EX EX/MEM MEM/WB

10.  Hazards  

•  Data hazards –  Instruction depends on result of prior computation which is not ready yet –  Stall, double pump, and forward, to fix

•  Structural hazards –  HW cannot support a combination of instructions –  OK, maybe add extra hardware resources; may still have to stall

•  Control hazards –  Pipelining of branches and other instructions which change the PC –  Branch predictors, branch delay slots, early branch computation

10.  Pipelining  Summary  

•  Need  to  keep  the  pipeline  full  for  performance  –  Hazards  make  this  hard  –  Dependencies  and  resource  conflicts  –  Time  to  access  memory!  (caches)  

•  Performance  –  Can’t  use  infinitely  many  stages  –  Code  and  pipeline  stages  (branch  delay  slot)  

•  Excep;ons/Interrupts  –  Can  happen  at  different  places  in  the  pipeline  –  Can  happen  out-­‐of-­‐order  (early  in  the  pipeline  for  a  later  instruc?on  vs.  later  in  the  pipeline  for  an  earlier  one)  

–  Need  to  restart  instruc?ons  –  Jump  to  the  OS  to  handle  them  

11.  Input/Output  

•  Busses  – Clocking,  width,  arbitra?on  

•  Performance  – Latency  – Throughput  

•  Talking  to  devices  – Method:  memory-­‐mapped/IO  instruc?ons  – Means:  polling,  interrupt,  DMA  

12.  Memory  

•  Random Access Memories –  DRAM: Dynamic Random Access Memory

•  High density, low power, cheap, slow •  Dynamic: needs to be “refreshed” regularly

–  SRAM: Static Random Access Memory •  Low density, high power, expensive, fast •  Static: content will last “forever”(until lose power)

•  What gets used where? –  Main memory is DRAM: you need it big, so you need it cheap –  CPU cache memory is SRAM: you need it fast, so it’s more expensive, so it’s smaller

than you would usually want due to resource limitations •  Relative performance

–  Size: DRAM/SRAM: 4-8x bigger for DRAM –  Cost/Cycle time: SRAM/DRAM: 8-16x faster, more $$$ for SRAM

12. Performance Impact (4 cycles) lw r2,0x20 lw r3,0x30 add r1,r2,r3 sw r1,0x40

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17

F D A M S S S

F D A - - -

W

M S S S W

F D - - - - - - -

F - - - D - - - A M S S S W

•  With 1-cycle memory system, program took 8 cycles –  CPI = 8 cycles / 4 instructions = 2.0 –  With lots more instructions, CPI would approach 1.0

•  With 4-cycle memory system, program takes 18 cycles –  CPI = 17 cycles / 4 instructions = 4.5 –  Doesn’t include instruction fetch penalty found in real memory system

A M W

18

Remember the bit about “if you can keep the pipeline full?”

Page 8: Course&Outline& Introduc?on&to&Computer& … · 2011-12-14 · 14/12/2011& Computer&Architecture&1&3&David&Black3 Schaffer& 4 6.&Performance& • •Comparing Machines Using Microarchitecture

14/12/2011  

Computer  Architecture  1  -­‐  David  Black-­‐Schaffer   8  

12. Memory Hierarchy of a Modern Computer System •  By taking advantage of the principle of locality:

–  Present the user with as much memory as is available in the cheapest technology. –  Provide access at the speed offered by the fastest technology.

Control

Datapath

Secondary Storage (Disk)

Processor

Registers

Main Memory (DRAM)

Shared Cache

(SRAM)

Local C

ache

1s 10,000,000s (10s ms) Speed (ns): 10-50 100s 5-10

100s Ts Size (bytes): Ms Gs Ks

12. Memory Hierarchy: How Does it Work? •  Temporal Locality (Locality in Time):

=> If an item is referenced, the same item will tend to be referenced again soon => Keep most recently accessed data items closer to the processor

•  Spatial Locality (Locality in Space): => If an item is referenced, nearby items will tend to be referenced soon => Move recently accessed “blocks” (groups of contiguous words) closer to proc.

•  “Block” (or “line”) - minimum unit of data between 2 levels

Lower Level Memory Upper Level

Memory To Processor

From Processor Blk X

Blk Y

Example: L1-cache

Example: L2-cache

Move data in blocks rather than bytes. (More efficient if blocks are larger.) Typical block is 64 bytes.

13. Basic Cache Design •  Cache only holds a portion of a

program –  Which part of the program does the cache

contain? •  Cache holds most recently accessed

references •  Cache is divided into units called

cache blocks (also known as cache “lines”), each block holds a contiguous set of memory addresses

–  How does the CPU know which part of the program the cache is holding?

•  Each cache block has extra bits, called the cache tag, which holds the main memory address of the data in the block

CPU

DRAM Memory

0x00000000 ...

0xFFFFFFFC

Tag Data 2-block cache

block 0 block 1

block size (bytes)

13. The ABC’s (or 1-2-3-4’s) of Caches •  Caching is a general concept used in processors, operating

systems, file systems, and applications. •  Wherever it is used, there are four basic questions which

arise. These include: –  Q1: Where can a block be placed in a cache?

Direct mapped, associative, fully-associative –  Q2: How is a block found if it is in a cache?

indexing (direct mapped), limited search (associative), full search (fully-associative)

–  Q3: Which block should be replaced on a miss? random, least-recently used (LRU)

–  Q4: What happens on a write? write-through or write-back

13. Cache Block Placement

Block  Number  0            1                2            3          4          5            6            7            8            9          0              1              2            3          4            5            6          7            8            9              0            1              2            3            4          5            6            7          8            9            0            1  

                   1            1              1            1            1          1            1          1              1          1              2            2              2            2            2          2            2            2            2          2            3            3  

Block  Number  

Memory  

Cache  

0          1                2            3            4          5          6            7  

Fully-­‐associa?ve  block  12  can  go  anywhere    

Direct  Mapped  block  12  can  go  only  into    block  4  (12  mod  8)  

0            1              2            3          4          5          6            7  

2-­‐way  Set-­‐associa?ve  block  12  can  go  anywhere  in  set  0  (12  mod  4)  

Set  0  

Set  1  

Set  2  

Set  3  

0            1          2            3              4          5            6          7  

Inflexible   Complex   Compromise  

Direct-­‐Mapped  Cache  Indexing  

•  4-­‐entry  direct-­‐mapped  cache  

Cache  Line  0  

Cache  Line  1  

Cache  Line  2  

Cache  Line  3  

AddressIndex  

mod  0  

mod  1  

mod  2  

mod  3  

0=0000  

1=0001  

2=0010  

3=0011  

4=0100  

5=0101  

6=0110  

7=0111  

8=1000  

9=1001  

10=1010  

11=1011  

12=1100  

13=1101  

14=1110  

15=1111  

Memory  Space  

Page 9: Course&Outline& Introduc?on&to&Computer& … · 2011-12-14 · 14/12/2011& Computer&Architecture&1&3&David&Black3 Schaffer& 4 6.&Performance& • •Comparing Machines Using Microarchitecture

14/12/2011  

Computer  Architecture  1  -­‐  David  Black-­‐Schaffer   9  

Direct-­‐Mapped  Cache  Indexing  

•  4-­‐entry  direct-­‐mapped  cache  

Cache  Line  0  

Cache  Line  1  

Cache  Line  2  

Cache  Line  3  

AddressIndex  

mod  0  

mod  1  

mod  2  

mod  3  

Tag  

Tag  

Tag  

Tag  

0=0000  

1=0001  

2=0010  

3=0011  

4=0100  

5=0101  

6=0110  

7=0111  

8=1000  

9=1001  

10=1010  

11=1011  

12=1100  

13=1101  

14=1110  

15=1111  

Memory  Space  

mod  0  

mod  1  

mod  2  

mod  3  

0=0000  

4=0100  

8=1000  

12=1100  

1=0001  

5=0101  

9=1001  

13=1101  

2=0010  

6=0110  

10=1010  

14=1110   Memory  addresses  

that  map  to  the  same  cache  line  will  conflict  in  the  cache  –  we  can  only  store  one  or  the  other.  

Direct-­‐Mapped  Cache  Indexing  

•  4-­‐entry  direct-­‐mapped  cache,  4  bytes/line  AddressIndex  

Cache  Line  0  

Cache  Line  1  

Cache  Line  2  

Cache  Line  3  

mod  0  

mod  1  

mod  2  

mod  3  

Tag  

Tag  

Tag  

Tag  

Byte  0   Byte  1   Byte  2   Byte  3  

Byte  0   Byte  1   Byte  2   Byte  3  

Byte  0   Byte  1   Byte  2   Byte  3  

Byte  0   Byte  1   Byte  2   Byte  3  

Need  to  ignore  the  last  two  bits  because  they  

choose  the  byte  within  the  cache  

line.  

Direct-­‐Mapped  Cache  Indexing  

•  4-­‐entry  direct-­‐mapped  cache,  4  bytes/line  AddressIndex  

Cache  Line  0  

Cache  Line  1  

Cache  Line  2  

Cache  Line  3  

mod  0  

mod  1  

mod  2  

mod  3  

Tag  

Tag  

Tag  

Tag  

Byte  0   Byte  1   Byte  2   Byte  3  

Byte  0   Byte  1   Byte  2   Byte  3  

Byte  0   Byte  1   Byte  2   Byte  3  

Byte  0   Byte  1   Byte  2   Byte  3  

xxx00zz  

xxx01zz  

xxx10zz  

xxx11zz  

Need  to  ignore  the  last  two  bits  because  they  

choose  the  byte  within  the  cache  

line.  

xxx0000   xxx0001   xxx0010   xxx0011  

xxx0100   xxx0001   xxx0110   xxx0111  

xxx1000   xxx1001   xxx1010   xxx1011  

xxx1100   xxx1101   xxx1110   xxx1111  

xxx  

xxx  

xxx  

xxx  

Need  the  tag  to  tell  us  which  

memory  address  xxxYYZZ  is  in  

each  of  the  lines  for  YYZZ  

13.  Direct-­‐Mapped  Cache  Indexing  

•  Direct  map  caches  have  a  1:1  mapping  from  memory  addresses  to  cache  entries  –  E.g.,  4-­‐entry  cache,  with  4  bytes  per  line  

•  Entry  0:  xxx00xx  •  Entry  1:  xxx01xx  •  Entry  2:  xxx10xx  •  Entry  3:  xxx11xx  

–  So  addresses  0100000  and  1100001  map  to  the  same  cache  line.  

•  To  tell  which  is  in  the  cache  we  look  at  the  tag:  –  If  the  tag  is  010  we  have  the  first  memory  address  –  If  the  tag  is  110  we  have  the  second  memory  address  

•  To  get  individual  bytes,  we  look  at  the  byte  offset  –  0100000  is  the  first  byte  in  the  line  in  cache  entry  00  –  0100001  is  the  second  byte  in  the  line  in  cache  entry  00  

13.  Set-­‐Associa?ve  Cache  Indexing  

•  Set  associa?ve  caches  have  a  1:1  mapping  from  memory  addresses  to  sets  

•  Within  a  set  a  memory  address  can  be  put  anywhere  (mul?ple  entries  per  set)  –  E.g.,  2-­‐way  set-­‐associa?ve  cache  with  4  bytes  per  line  

•  Set  0:  xxxx0xx  •  Set  1:  xxxx1xx  •  …  

–  So  addresses  1010000  and  0110001  map  to  the  same  set,  but  the  set  can  have  mul?ple  entries.  

–  If  the  set  has  2  entries,  we  can  store  both  values  in  the  cache,  even  though  they  map  to  the  same  set.  

Set-­‐Associa?ve  Cache  Indexing  

•  4-­‐entry,  2-­‐way  set-­‐associa?ve  cache,  4  bytes/line  AddressIndex  

Cache  Line  0  

Cache  Line  1  

Cache  Line  2  

Cache  Line  3  

mod  0  

mod  1  

mod  2  

mod  3  

Tag  

Tag  

Tag  

Tag  

Byte  0   Byte  1   Byte  2   Byte  3  

Byte  0   Byte  1   Byte  2   Byte  3  

Byte  0   Byte  1   Byte  2   Byte  3  

Byte  0   Byte  1   Byte  2   Byte  3  

xxxx0zz  

xxxx0zz  

xxxx1zz  

xxxx1zz  

xxxx000   xxxx001   xxxx010   xxxx011  

xxxx000   xxxx001   xxxx010   xxxx111  

xxxx100   xxxx101   xxxx110   xxxx111  

xxxx100   xxxx101   xxxx110   xxxx111  

xxxx  

xxxx  

xxxx  

xxxx  

Need  the  tag  to  tell  us  which  

memory  address  xxxxYZZ  is  in  

each  of  the  lines  for  YZZ  

Set  0  

Set  1  

Page 10: Course&Outline& Introduc?on&to&Computer& … · 2011-12-14 · 14/12/2011& Computer&Architecture&1&3&David&Black3 Schaffer& 4 6.&Performance& • •Comparing Machines Using Microarchitecture

14/12/2011  

Computer  Architecture  1  -­‐  David  Black-­‐Schaffer   10  

Set-­‐Associa?ve  Cache  Indexing  

•  4-­‐entry,  2-­‐way  set-­‐associa?ve  cache,  4  bytes/line  AddressIndex  

Cache  Line  0  

Cache  Line  1  

Cache  Line  2  

Cache  Line  3  

mod  0  

mod  1  

mod  2  

mod  3  

Tag  

Tag  

Tag  

Tag  

Byte  0   Byte  1   Byte  2   Byte  3  

Byte  0   Byte  1   Byte  2   Byte  3  

Byte  0   Byte  1   Byte  2   Byte  3  

Byte  0   Byte  1   Byte  2   Byte  3  

xxxx0zz  

xxxx0zz  

xxxx1zz  

xxxx1zz  

xxxx000   xxxx001   xxxx010   xxxx011  

xxxx000   xxxx001   xxxx010   xxxx011  

xxxx100   xxxx101   xxxx110   xxxx111  

xxxx100   xxxx101   xxxx110   xxxx111  

xxxx  

xxxx  

xxxx  

xxxx  

Set  0  

Set  1  

mod  0  

mod  0  

0=0000  

1=0001  

2=0010  

3=0011  

4=0100  

5=0101  

6=0110  

7=0111  

8=1000  

9=1001  

10=1010  

11=1011  

12=1100  

13=1101  

14=1110  

15=1111  

Memory  Space  

mod  1  

mod  1  

1=0001  

0=0000  

8=1000  

9=1001  

2=0010  

10=1010  

3=0011  

11=1011  

4=0100  

12=1100  

5=0101  

13=1101  

6=0110  

14=1110  

7=0111  

15=1111   But  now  we  can  

have  any  two  of  set  0  and  any  two  of  set  1  at  the  same  ?me.  

Q1: Block Placement

Block  Number  0            1                2            3          4          5            6            7            8            9          0              1              2            3          4            5            6          7            8            9              0            1              2            3            4          5            6            7          8            9            0            1  

                   1            1              1            1            1          1            1          1              1          1              2            2              2            2            2          2            2            2            2          2            3            3  

Block  Number  

Memory  

Cache  

Direct  Mapped  block  12  can  go  only  into    block  4  (12  mod  8)  

0            1              2            3          4          5          6            7  

Each  memory  loca?on  maps  to  one  cache  line.    Mapping  is  done  by  a  modulo  operator,  which  is  accomplished  by  ignoring  some  of  the  MSBs.  (e.g.,  address  001100  =  12  and  000100  =  4  both  get  mapped  to  100  =  4)  

Q1: Block Placement

Block  Number  0            1                2            3          4          5            6            7            8            9          0              1              2            3          4            5            6          7            8            9              0            1              2            3            4          5            6            7          8            9            0            1  

                   1            1              1            1            1          1            1          1              1          1              2            2              2            2            2          2            2            2            2          2            3            3  

Block  Number  

Memory  

Cache  

0          1                2            3            4          5          6            7  

Fully-­‐associa?ve  block  12  can  go  anywhere    

Any  memory  loca?on  can  go  to  any  cache  line.  This  is  infinitely  flexible,  but…  It  requires  very  complex  (and  slow)  hardware.  

Q1: Block Placement

Block  Number  0            1                2            3          4          5            6            7            8            9          0              1              2            3          4            5            6          7            8            9              0            1              2            3            4          5            6            7          8            9            0            1  

                   1            1              1            1            1          1            1          1              1          1              2            2              2            2            2          2            2            2            2          2            3            3  

Block  Number  

Memory  

Cache  

2-­‐way  Set-­‐associa?ve  block  12  can  go  anywhere  in  set  0  (12  mod  4)  

Set  0  

Set  1  

Set  2  

Set  3  

0            1          2            3              4          5            6          7  

Every  memory  loca?on  maps  to  exactly  one  set  of  cache  lines.  Within  a  set  it  can  go  in  any  loca?on.  Tradeoff  between  simplicity  (direct-­‐mapped  to  sets)  and  flexibility  (fully  associa?ve  within  sets).  

13. Cache Performance •  What’s the impact on performance (CPU time) when the following

cache behavior is included –  50 cycle miss penalty –  All instructions normally take 2.0 cycles (excluding memory stalls) –  Miss rate is 2.0% –  Average of 1.33 memory references per instruction

IC X ( CPIexecution + Memory stall cycles/Instruction) X Clock cycle time

= IC X (2.0 + 0.02 X 1 .33 X 50) X Clock cycle time = IC X 3.33 X Clock cycle time

•  Two important results to keep in mind: –  The lower the CPIexecution , the higher the relative impact of a cache miss penalty

(more sensitive to memory latency!) –  Comparing two machines with identical memory systems, the machine with the

higher clock rate will have the larger number of clock cycles per miss and hence the memory portion of its CPI will be higher.

14.  Virtual  Memory  

•  What is virtual memory? –  Technique that allows execution of a program that

•  can reside in non-contiguous memory locations •  does not have to completely reside in memory

–  Allows the computer to “fake” a program into believing that its

•  memory is contiguous •  memory space is larger than physical memory

•  Why is VM important? –  Cheap - no longer have to buy lots of RAM –  Removes burden of memory resource management from the programmer –  Enables multiprogramming, time-sharing, protection

Page 11: Course&Outline& Introduc?on&to&Computer& … · 2011-12-14 · 14/12/2011& Computer&Architecture&1&3&David&Black3 Schaffer& 4 6.&Performance& • •Comparing Machines Using Microarchitecture

14/12/2011  

Computer  Architecture  1  -­‐  David  Black-­‐Schaffer   11  

14. Basic VM Algorithm

VA  -­‐>  PA  0x00  

0x04  

0x08  

0x0C  

0x10  

0x00  

0x04  

0x08  

0x0C  

Disk  

RAM  

0x00  0x04  

0x08  0x0C  

add    r1,r2,r3  

sub  r2,r3,r4  

lw    r2,0x04  

mult  r3,r4,r5  

•  Program uses virtual addresses (load, store, instruction fetch) •  Computer translates virtual address (VA) to physical address (PA) •  Computer reads RAM using PA, returning the data to program

Processor    (running  program)  

Virtual  Address  

14.  Page  Tables  and  Entries  

•  Page  Tables  – Size  of  entries/size  of  table  

•  Mul?-­‐level  page  tables  (why?)  

– How  to  look  up  entries  •  TLB  

– Thrashing  of  pages  (LRU  example)  

•  Separate page tables (per process) provide page-level protection •  OS creates and manages page tables so that no user-level process

can alter any process’s page table –  Page tables are mapped into kernel memory where only the OS can read or

write

Page Table Protection

valid  bit  Physical  page  number  0x00000!0x00001!0x00002!0x00003!0x00004!

0x001!0x005!0x00A!0x004!0x008!!Process 1’s Page Table

valid  bit  Physical  page  number  0x00000!0x00001!0x00002!0x00003!0x00004!

0x002!0x006!0x00B!0x003!0x009!!Process 2’s Page Table

0x00 0x01 0x02 0x03 0x04 0x05 0x06 0x07 0x08 0x09 0x0A 0x0B

P1  P2  P2  P1  P1  P2  

P1  P2  P1  P2  

OS

Sof

twar

e P

age

Tabl

e

Hardware TLB

14. Making Address Translation Fast •  A cache for address translations: translation lookaside buffer

V a l i d 1 1 1 1 0 1 1 0 1 1 0 1

P a g e t a b l e

P h y s i c a l p a g e a d d r e s s V a l i d

T L B

1 1 1 1 0 1

T a g V i r t u a l p a g e

n u m b e r

P h y s i c a l p a g e o r d i s k a d d r e s s

P h y s i c a l m e m o r y

D i s k s t o r a g e

Level 1 Table (Page Directory)

Level 2 Table (Page Table)

Virtual Address

15. Multi-level Page Tables to Save Space VPN 1 VPN 2 Offset

PPN

PPN Offset

Address Translation/Cache Lookup VPN PO

Virtual Address

TLB

PPN

TAG BO IDX

Cache

Hit/Miss

Data

=?

PO

Page 12: Course&Outline& Introduc?on&to&Computer& … · 2011-12-14 · 14/12/2011& Computer&Architecture&1&3&David&Black3 Schaffer& 4 6.&Performance& • •Comparing Machines Using Microarchitecture

14/12/2011  

Computer  Architecture  1  -­‐  David  Black-­‐Schaffer   12  

Overlapped Cache & TLB Access

• Simple for small caches –  IDX + BO ≤ PO

•  Must satisfy Cache size/Assoc ≤ Page size

• Assume 4K pages & 2-way set-associative caches

• What is the max size allowed for parallel address translation to work?

VPN PO Virtual Address

TLB

PPN TAG

BO IDX

Cache

=? Hit/Miss

Data

Virtual Address Cache

• Lookup using VA • TLB access on

miss • Use PA to access

next level (L2)

VPN PO Virtual Address

TAG BO IDX

Cache

=?

TLB

PPN PO

Hit/Miss

Data

Or, Only Use Virtual Bits to Index Cache

• Don’t need to wait for TLB • Parallel TLB access

(e.g., for larger caches) • Physically-tagged but

Virtually-indexed Cache • Can distinguish addresses

from different processes • But, what if multiple

processes share memory?

VPN PO Virtual Address

TLB

PPN TAG

BO IDX

Cache

=? Hit/Miss

Data

Summary •  Memory access is hard and complicated!

–  Speed of CPU core demands very fast memory access. We do cache hierarchy to solve this one. Gives illusion of speed--most of the time. Occasionally slow.

–  Size of programs demands large RAM. We do VM hierarchy to solve this one. Gives illusions of size--most of the time. Occasionally slow.

•  VM hierarchy –  Another form of cache, but now between RAM and disk. –  Atomic units of memory are pages, typically 4kB to 2MB. –  Page table serves as translation mechanism from virtual to physical address

•  Page table lives in physical memory, managed by OS •  For 64b addresses, multi-level tables used, some of the table is in VM

–  TLB is yet another cache--caches translated addresses, page table entries. •  Saves from having to go to physical memory to do lookup on each access •  Usually very small, managed by OS

–  VM, TLB, cache have “interesting” interactions. •  Big impacts on speed, pipelining. Big impacts on exactly where the virtual to physical mapping

takes place.

And  Now  For  Something  Completely  Different…  

•  Course  Evalua?ons!  (15  minutes)  

•  Followed  by…        …what  to  expect  on  the  exam  (without  ruining  too  much  of  the  surprise)  

Course  Evalua?ons  

Page 13: Course&Outline& Introduc?on&to&Computer& … · 2011-12-14 · 14/12/2011& Computer&Architecture&1&3&David&Black3 Schaffer& 4 6.&Performance& • •Comparing Machines Using Microarchitecture

14/12/2011  

Computer  Architecture  1  -­‐  David  Black-­‐Schaffer   13  

Final  Exam  

•  Format  –  3  short-­‐answer  ques?ons  –  6  true/false  (-­‐1/0/1  points  each)  –  4  or  5  longer  ques?ons  –  Hopefully  no  more  than  2.5  hours  to  finish  –  (This  is  a  bit  harder  than  the  exam  from  last  year)  –  You  are  allowed  one  double-­‐sided,  hand-­‐wriken,  A4  sheet  of  notes  during  the  exam  and  a  calculator  

•  Likely  topics:  –  Caches,  virtual  memory,  performance,  pipelines,  assembly,  arithme?c,  (simple)  logic,  input/output,  etc.  

–  Very  likely  topics:  anything  that  we  spent  ?me  going  through  mul?ple  anima?ons/examples  in  class  

Ques?ons?