cs 161 review for test 2
DESCRIPTION
CS 161 Review for Test 2. Instructor: L.N. Bhuyan www.cs.ucr.edu/~bhuyan Adapted from notes by Dave Patterson (http.cs.berkeley.edu/~patterson). How to Study for Test 2 : Chap 5. Single-cycle (CPI=1) processor know how to reason about processor organization (datapath, control) - PowerPoint PPT PresentationTRANSCRIPT
1 1999 ©UCB
CS 161Review for Test 2
Instructor: L.N. Bhuyanwww.cs.ucr.edu/~bhuyan
Adapted from notes by Dave Patterson(http.cs.berkeley.edu/~patterson)
2 1999 ©UCB
How to Study for Test 2 : Chap 5°Single-cycle (CPI=1) processor
• know how to reason about processor organization (datapath, control)
- e.g., how to add another instruction? (must modify both control, datapath, or both)
- How to add multiplexors in the datapath
- How to design hardware control unit
°Multicycle (CPI>1) processor- Changes to Single Cycle Datapath
- Control Design through FSM
- how to add new instruction to multicycle?
3 1999 ©UCB
Putting Together a Datapath for MIPS
Memory(Dmem)
PC RegistersALU
Data In Data Out
Memory(Imem)
Address
Data Out
AddressData Out
Data In
Step 1 Step 2 Step 3 Step 4 5
°Question: Which instruction uses which steps and what is the execution time?
4 1999 ©UCB
Datapath Timing: Single-cycle vs. Pipelined°Suppose the following delays for major functional units:• 2 ns for a memory access or ALU operation
• 1 ns for register file read or write
°Total datapath delay for single-cycle:
°What about multi-cycle datapath?
Insn Insn Reg ALU Data Reg TotalType Fetch Read Oper Access Write Time
beq 2ns 1ns 2ns 5nsR-form 2ns 1ns 2ns 1ns 6nssw 2ns 1ns 2ns 2ns 7nslw 2ns 1ns 2ns 2ns 1ns 8ns
5 1999 ©UCB
Implementing Main Control
Main Control
RegDst
Branch
MemRead
MemtoReg
ALUop
MemWrite
ALUSrc
RegWrite
op
2
Main Control has one 6-bit input, 9 outputs (7 are 1-bit, ALUOp is 2 bits)
To build Main Control as sum-of-products:
(1) Construct a minterm for each different instruction (or R-type); each minterm corresponds to a single instruction (or all of the R-type instructions), e.g., MR-format, Mlw
(2) Determine each main control output by forming the logical OR of relevant minterms (instructions), e.g., RegWrite: MR-format OR Mlw
6 1999 ©UCB
Single-Cycle MIPS-lite CPU
Regs
ReadReg1
Readdata1
ALURead
data2
ReadReg2
WriteReg
WriteData
Zero
ALU-con
RegWrite
Address
Readdata
WriteData
SignExtend
Dmem
MemRead
MemWrite
Mux
MemTo-Reg
Mux
Read Addr
Instruc-tion
Imem
4
PC
add
add <<
2
Mux
ALU Control
5:0ALUOp
ALU-src
Mux
25:21
20:16
15:11
RegDst
15:0
31:0
Branch
Main Control
op=[31:26]
PCSrc
7 1999 ©UCB
R-format Execution Illustration (step 4)
Regs
ReadReg1
Readdata1
ALURead
data2
ReadReg2
WriteReg
WriteData
Zero
ALU-con
RegWrite
Address
Readdata
WriteData
SignExtend
Dmem
MemRead
MemWrite
Mux
MemTo-Reg=1
Mux
Read Addr
Instruc-tion
Imem
4
PC
add
add <<
2
Mux
PCSrc=0
ALU Control
5:0 ALUOp
ALU-src=0
Mux
25:21
20:16
15:11
RegDst=1
15:0
31:0
Branch
Main Control
[r1] + [r2]
8 1999 ©UCB
Multicycle Datapath (overview)
Registers
ReadReg1
ALU
ReadReg2
WriteReg
Data
PC
Address
Instructionor Data
Memory
MIPS-liteMulticycle Version
A
B
ALU-Out
InstructionRegister
Data MemoryData
Register
Readdata 1
Readdata 2
• One ALU (no extra adders)• One Memory (no separate Imem, Dmem)• New Temporary Registers (“clocked”/require clock input)
9 1999 ©UCB
Cycle 3 Datapath (R-format)
MIPS-liteMulticycle Version
ALU
Regs
ReadReg1
Readdata1
Readdata2
ReadReg2
WriteReg
WriteData
Sgn Ext- end
PC
<<2
A
B
ALU-Out
Address
ReadData
Mem
WriteData
MDR
Mux
25:21
20:16
15:0 0 1M2 u3 x
Mux
Mux
Mux
IR4
z
15:11
ALUControl
22
3
(funct) 5:0
Mux
ALUOut=A op B
10 1999 ©UCB
MemReadALUSrcA = 0
IorD = 0IRWrite
ALUSrcB = 1ALUOp = 0
PCWritePCSrc = 0
state 0
ALUSrcA = 0ALUSrcB = 3ALUOp = 0
ALUSrcA = 1ALUSrcB = 2ALUOp = 0
ALUSrcA = 1ALUSrcB = 0
ALUOp =2
ALUSrcA = 1ALUSrcB = 0
ALUOp =1PCWriteCond
PCSrc = 1
1
26
8
Memory Access
R-format execution
Branch Completion
FSM diagram for Multicycle Machine
start new instruction
cycle1
cycle2
cycle3
lw/sw
R-form
at beq
11 1999 ©UCB
Implementing the FSM controller (C.3)PCWrite
PCWriteCond
IorD
MemtoReg
PCSrc
ALUOp
ALUSrcB
ALUSrcA
RegWrite
RegDst
NS3NS2NS1NS0
Op
5
Op
4
Op
3
Op
2
Op
1
Op
0
S3
S2
S1
S0
IRWrite
MemRead
MemWrite
Outputs
Inputs
PLA or ROMimplementation of both next-state and output functions
Next-state}
DatapathControl Points
Instruction register opcode field
state register
12 1999 ©UCB
Micro-programmed Control (Chap. 5.5)° In microprogrammed control, FSM states become microinstructions of a microprogram (“microcode”)
• one FSM state=one microinstruction
• usually represent each micro-instruction textually, like an assembly instruction
°FSM current state register becomes the microprogram counter (micro-PC)
• normal sequencing: add 1 to micro-PC to get next micro-instruction
• microprogram branch: separate logic determines next microinstruction
13 1999 ©UCB
Micro-program for Multi-cycle Machine
ALU Reg Mem PC NextOp In1 In2 File Op Src Writ -Instr
--------------------- ------ ---------------- ------- --------- Fetch: Add PC 4 Rd PC ALU
Add PC SE*4 Rd [D1]Mem: Add A SE [D2]LW: Rd ALU
Wr FetchSW: Wr ALU FetchRform: funct A B
Wr FetchBEQ: Sub A B Equ Fetch
D1 = { Mem, Rform, BEQ }D2 = { LW, SW }
14 1999 ©UCB
How to Study for Test 2 : Chap 6°Pipelined Processor
• how pipelined datapath, control differs from architectures of Chapter 5?
- All instructions execute same 5 cycles
- pipeline registers to separate the stages of datapath & control
• Problems for Pipelining- pipeline hazards: structural, data, control
(how each solved?)
15 1999 ©UCB
Pipelining Lessons
° Pipelining doesn’t help latency (execution time) of single task, it helps throughput of entire workload
° Multiple tasks operating simultaneously using different resources
° Potential speedup = Number of pipe stages
° What is real speedup?
° Time to “fill” pipeline and time to “drain” it reduces speedup
6 PM 7 8 9
Time
B
C
D
A
303030 3030 3030Task
Order
16 1999 ©UCB
Space-Time Diagram
°To simplify pipeline, every instruction takes same number of steps, called stages
°One clock cycle per stage
IFtch Dcd Exec Mem WB
IFtch Dcd Exec Mem WB
IFtch Dcd Exec Mem WB
IFtch Dcd Exec Mem WB
IFtch Dcd Exec Mem WB
Program Flow
Time
17 1999 ©UCB
Problems for Pipelining
°Hazards prevent next instruction from executing during its designated clock cycle, limiting speedup
• Structural hazards: HW cannot support this combination of instructions (single person to fold and put clothes away)
• Control hazards: conditional branches & other instructions may stall the pipeline delaying later instructions (must check detergent level before washing next load)
• Data hazards: Instruction depends on result of prior instruction still in the pipeline (matching socks in later load)
18 1999 ©UCB
°guess branch taken, then back up if wrong: “branch prediction”• For example, Predict not taken
• Impact: 1 clock per branch instruction if right, 2 if wrong (static: right ~ 50% of time)
• More dynamic scheme: keep history of the branch instruction (~ 90%)
Control Hazard : Solution 1
add
beq
Load
AL
U IM Reg DM Reg
AL
U IM Reg DM Reg
IMA
LUReg DM Reg
Instr.
Order
Time (clock cycles)
19 1999 ©UCB
°Redefine branch behavior (takes place after next instruction) “delayed branch”
° Impact: 1 clock cycle per branch instruction if can find instruction to put in the “delay slot” ( 50% of time)
Control Hazard : Solution 2
add
beq
Misc
AL
U IM Reg DM Reg
AL
U IM Reg DM Reg
IMA
LUReg DM Reg
Load IM
AL
UReg DM Reg
Instr.
Order
Time (clock cycles)
20 1999 ©UCB
Dependencies backwards in time are hazards
Data Hazard on $1: Illustration
add $1,$2,$3
sub $4,$1,$3
and $6,$1,$7
or $8,$1,$9
xor $10,$1,$11
IF ID/RF EX MEM WBAL
UIM Reg DM Reg
AL
UIM Reg DM RegA
LUIM Reg DM Reg
IM
AL
UReg DM Reg
AL
UIM Reg DM Reg
Instr.
Order
Time (clock cycles)
21 1999 ©UCB
• “Forward” result from one stage to another
• “or” OK if implement register file properly
Data Hazard : Solution:
add $1,$2,$3
sub $4,$1,$3
and $6,$1,$7
or $8,$1,$9
xor $10,$1,$11
IF ID/RF EX MEM WBAL
UIM Reg DM Reg
AL
UIM Reg DM RegA
LUIM Reg DM Reg
IM
AL
UReg DM Reg
AL
UIM Reg DM Reg
Instr.
Order
Time (clock cycles)
22 1999 ©UCB
• Must stall pipeline 1 cycle (insert 1 bubble)
lw $1, 0($2)
sub $4,$1,$6
and $6,$1,$7
or $8,$1,$9
IF ID/RF EX MEM WBAL
UIM Reg DM Reg
AL
UIM Reg DM Reg
AL
UIM Reg DM Reg
IM
AL
UReg DM
Time (clock cycles)
bubble
bubble
bubble
Data Hazard Even with Forwarding
23 1999 ©UCB
How to Study for Test 2 : Chap 7°Processor-Memory performance gap: problem for hardware designers and software developers alike
°Memory Hierarchy--The Goal: want to create illusion of single large, fast memory
• access that hit in highest level are processed most quickly
• Exploit Principle of Locality to obtain high hit rate
°Caches vs. Virtual Memory: how are they similar? Different?
24 1999 ©UCB
Memory Hierarchy: Terminology
°Hit Time: Time to access the upper level which consists of
•Time to determine hit/miss +Memory access time
Miss Penalty: Time to replace a block in the upper level + Time to deliver the block the processor
°Note: Hit Time << Miss Penalty
[Note: “<<“ here means “much less than”]
25 1999 ©UCB
Issues with Direct-Mapped° If block size > 1, rightmost bits of index are really the offset within the indexed block
ttttttttttttttttt iiiiiiiiii oooo
tag index byteto check to offsetif have select
withincorrect block block block
Q: How do Set-Associative and Fully-Associative Designs Look?
26 1999 ©UCB
Read from cache at offset, return word b° 000000000000000000 0000000001 0100
...
ValidTag 0x0-3 0x4-7 0x8-b 0xc-f
01234567
10221023
...
1 0 a b c d
Index
Tag field Index field Offset
0
000000
00
27 1999 ©UCB
Miss Rate Versus Block Size
256
40%
35%
30%
25%
20%
15%
10%
5%
0%
Mis
s ra
te
64164
Block size (bytes) 1 KB
8 KB
16 KB
64 KB
256 KB
totalcachesize• Figure 7.12 -
for direct mapped cache
28 1999 ©UCB
Compromise: N-way Set Associative Cache°N-way set associative:
N cache blocks for each Cache Index• Like having N direct mapped caches operating in parallel
°Example: 2-way set associative cache• Cache Index selects a “set” of 2 blocks from the cache
• The 2 tags in set are compared in parallel
• Data is selected based on the tag result (which matched the address)
• Where is a data written? Based on Replacement Policy, FIFO, LRU, Random
29 1999 ©UCB
Improving Cache Performance° In general, want to minimize
Average Access Time: = Hit Time x (1 - Miss Rate)
+ Miss Penalty x Miss Rate
(recall Hit Time << Miss Penalty)
°Generally, two ways to look at• Larger Block Size
• Larger Cache
• Higher Associativity
• Reducing DRAM latency
°Miss penalty ? ---> L2 cache approach
ReduceMiss Rate
Reduces Miss Penalty
30 1999 ©UCB
Virtual Memory has own terminology°Each process has its own private “virtual address space” (e.g., 232 Bytes); CPU actually generates “virtual addresses”
°Each computer has a “physical address space” (e.g., 128 MegaBytes DRAM); also called “real memory”
°Library analogy: • virtual address is like the title of a book
• physical address is the location of book in the library as given by its Library of Congress call number
31 1999 ©UCB
Mapping Virtual to Physical Address
Virtual Page Number Page Offset
Page OffsetPhysical Page Number
Translation
31 30 29 28 27 .………………….12 11 10
29 28 27 .………………….12 11 10
9 8 ……..……. 3 2 1 0
Virtual Address
Physical Address
9 8 ……..……. 3 2 1 0
1KB page size
32 1999 ©UCB
How Translate Fast?°Observation: since there is locality in pages of data, must be locality in virtual addresses of those pages!
°Why not create a cache of virtual to physical address translations to make translation fast? (smaller is faster)
°For historical reasons, such a “page table cache” is called a Translation Lookaside Buffer, or TLB
°TLB organization is same as Icache or Dcache – Direct-mapped or Set Associative
33 1999 ©UCB
Access TLB and Cache in Parallel?
°Recall: address translation is only for virtual page number, not page offset
° If cache index bits of PA “fit within” page offset of VA, then index is not translated can read cache block while simultaneously accessing TLB
°“Virtually indexed, physically tagged cache” (avoids aliasing problem)
VA
PA
page offsetvirtual page number
tag index ofs