introduction performance analysis single-cycle processor...
Post on 18-Jan-2021
2 Views
Preview:
TRANSCRIPT
1
Copyright © 2007 Elsevier 7-<1>
Chapter 7 :: Microarchitecture
Digital Design and Computer Architecture David Money Harris and Sarah L. Harris
Copyright © 2007 Elsevier 7-<2>
Chapter 7 :: Topics
• Introduction• Performance Analysis• Single-Cycle Processor• Multicycle Processor• Pipelined Processor• Exceptions• Advanced Microarchitecture
Copyright © 2007 Elsevier 7-<3>
Introduction
• Microarchitecture: how to implement an architecture in hardware
• Processor:– Datapath: functional blocks– Control: control signals
Physics
Devices
AnalogCircuits
DigitalCircuits
Logic
Micro-architecture
Architecture
OperatingSystems
ApplicationSoftware
electrons
transistorsdiodes
amplifiersfilters
AND gatesNOT gates
addersmemories
datapathscontrollers
instructionsregisters
device drivers
programs
Copyright © 2007 Elsevier 7-<4>
Microarchitecture
• Multiple implementations for a single architecture:– Single-cycle
• Each instruction executes in a single cycle– Multicycle
• Each instruction is broken up into a series of shorter steps– Pipelined
• Each instruction is broken up into a series of steps• Multiple instructions execute at once.
2
Copyright © 2007 Elsevier 7-<5>
Processor Performance
• Program execution time
Execution Time = (# instructions)(cycles/instruction)(seconds/cycle)
• Definitions:– Cycles/instruction = CPI– Seconds/cycle = clock period– 1/CPI = Instructions/cycle = IPC
• Challenge is to satisfy constraints of:– Cost– Power– Performance
Copyright © 2007 Elsevier 7-<6>
MIPS Processor
• We consider a subset of MIPS instructions:– R-type instructions: and, or, add, sub, slt– Memory instructions: lw, sw– Branch instructions: beq
• Later consider adding addi and j
Copyright © 2007 Elsevier 7-<7>
Architectural State
• Determines everything about a processor:– PC– 32 registers– Memory
Copyright © 2007 Elsevier 7-<8>
MIPS State Elements
CLK
A RDInstruction
Memory
A1
A3WD3
RD2RD1
WE3
A2
CLK
RegisterFile
A RDData
MemoryWD
WEPCPC'
CLK
32 3232 32
32
32
32 32
32
32
5
5
5
3
Copyright © 2007 Elsevier 7-<9>
Single-Cycle MIPS Processor
• Datapath• Control
Copyright © 2007 Elsevier 7-<10>
Single-Cycle Datapath: lw fetch
• First consider executing lw• STEP 1: Fetch instruction
CLK
A RDInstruction
Memory
A1
A3WD3
RD2
RD1WE3
A2
CLK
RegisterFile
A RDData
MemoryWD
WEPCPC' Instr
CLK
Copyright © 2007 Elsevier 7-<11>
Single-Cycle Datapath: lw register read
• STEP 2: Read source operands from register file
Instr
CLK
A RDInstruction
Memory
A1
A3WD3
RD2
RD1WE3
A2
CLK
RegisterFile
A RDData
MemoryWD
WEPCPC'
25:21
CLK
Copyright © 2007 Elsevier 7-<12>
Single-Cycle Datapath: lw immediate
• STEP 3: Sign-extend the immediate
SignImm
CLK
A RDInstruction
Memory
A1
A3WD3
RD2
RD1WE3
A2
CLK
Sign Extend
RegisterFile
A RDData
MemoryWD
WEPCPC' Instr 25:21
15:0
CLK
4
Copyright © 2007 Elsevier 7-<13>
Single-Cycle Datapath: lw address
• STEP 4: Compute the memory address
SignImm
CLK
A RDInstruction
Memory
A1
A3WD3
RD2
RD1WE3
A2
CLK
Sign Extend
RegisterFile
A RDData
MemoryWD
WEPCPC' Instr 25:21
15:0
SrcB
ALUResult
SrcA Zero
CLK
ALUControl2:0
ALU
010
Copyright © 2007 Elsevier 7-<14>
Single-Cycle Datapath: lw memory read
• STEP 5: Read data from memory and write it back to register file
A1
A3WD3
RD2
RD1WE3
A2
SignImm
CLK
A RDInstruction
Memory
CLK
Sign Extend
RegisterFile
A RDData
MemoryWD
WEPCPC' Instr 25:21
15:0
SrcB20:16
ALUResult ReadData
SrcA
RegWrite
Zero
CLK
ALUControl2:0
ALU
0101
Copyright © 2007 Elsevier 7-<15>
Single-Cycle Datapath: lw PC increment
• STEP 6: Determine the address of the next instruction
SignImm
CLK
A RD
InstructionMemory
+
4
A1
A3WD3
RD2
RD1WE3
A2
CLK
Sign Extend
RegisterFile
A RDData
MemoryWD
WEPCPC' Instr 25:21
15:0
SrcB20:16
ALUResult ReadData
SrcA
PCPlus4
Result
RegWrite
Zero
CLK
ALUControl2:0
ALU
0101
Copyright © 2007 Elsevier 7-<16>
Single-Cycle Datapath: sw
• Consider sw• Need to write data to memory
SignImm
CLK
A RD
InstructionMemory
+
4
A1
A3WD3
RD2
RD1WE3
A2
CLK
Sign Extend
RegisterFile
A RDData
MemoryWD
WEPCPC' Instr 25:21
20:16
15:0
SrcB20:16
ALUResult ReadData
WriteData
SrcA
PCPlus4
Result
MemWriteRegWrite
Zero
CLK
ALUControl2:0
ALU
10100
5
Copyright © 2007 Elsevier 7-<17>
Single-Cycle Datapath: R-type instructions
• Consider R-type instructions: add, sub, and, or• Write ALUResult to register file• Writing to rd field of instruction (instead of rt)
SignImm
CLK
A RD
InstructionMemory
+
4
A1
A3WD3
RD2
RD1WE3
A2
CLK
Sign Extend
RegisterFile
01
01
A RDData
MemoryWD
WE01
PCPC' Instr 25:21
20:16
15:0
SrcB
20:16
15:11
ALUResult ReadData
WriteData
SrcA
PCPlus4WriteReg4:0
Result
RegDst MemWrite MemtoRegALUSrcRegWrite
Zero
CLK
ALUControl2:0
ALU
0varies1 001
Copyright © 2007 Elsevier 7-<18>
Single-Cycle Datapath: beq
• Consider branch instruction: beq• Determine whether register values are equal• Calculate branch target address (BTA) from sign-extended
immediate and PC+4
SignImm
CLK
A RD
InstructionMemory
+
4
A1
A3WD3
RD2
RD1WE3
A2
CLK
Sign Extend
RegisterFile
01
01
A RDData
MemoryWD
WE01
PC01
PC' Instr 25:21
20:16
15:0
SrcB
20:16
15:11
<<2
+
ALUResult ReadData
WriteData
SrcA
PCPlus4
PCBranch
WriteReg4:0
Result
RegDst Branch MemWrite MemtoRegALUSrcRegWrite
Zero
PCSrc
CLK
ALUControl2:0
ALU
01100 x0x 1
Copyright © 2007 Elsevier 7-<19>
Complete Single-Cycle Processor
SignImm
CLK
A RDInstruction
Memory
+
4
A1
A3WD3
RD2
RD1WE3
A2
CLK
Sign Extend
RegisterFile
01
01
A RDData
MemoryWD
WE01
PC01
PC' Instr 25:21
20:16
15:0
5:0
SrcB
20:16
15:11
<<2
+
ALUResult ReadData
WriteData
SrcA
PCPlus4
PCBranch
WriteReg4:0
Result
31:26
RegDst
BranchMemWriteMemtoReg
ALUSrc
RegWrite
OpFunct
ControlUnit
Zero
PCSrc
CLK
ALUControl2:0
ALU
Copyright © 2007 Elsevier 7-<20>
Control Unit
RegDst
BranchMemWriteMemtoReg
ALUSrcOpcode5:0
ControlUnit
ALUControl2:0Funct5:0
MainDecoder
ALUOp1:0
ALUDecoder
RegWrite
6
Copyright © 2007 Elsevier 7-<21>
Review: ALU
ALU
N N
N3
A B
Y
F
A & B000A | B001A + B010not used011A & B100A | ~B101A - B110SLT111
FunctionF2:0
Copyright © 2007 Elsevier 7-<22>
Review: ALU
+
2 01
A B
Cout
Y3
01
F2
F1:0
[N-1] S
NN
N
N
N NNN
N
2
ZeroE
xtend
Copyright © 2007 Elsevier 7-<23>
Control Unit: ALU Decoder
Not Used11Look at Funct10Subtract01Add00MeaningALUOp1:0
111 (SLT)101010 (slt)1X001 (Or)100101 (or)1X000 (And)100100 (and)1X110 (Subtract)100010 (sub)1X010 (Add)100000 (add)1X110 (Subtract)XX1010 (Add)X00ALUControl2:0FunctALUOp1:0
Copyright © 2007 Elsevier 7-<24>
Control Unit: Main Decoder
000100beq
101011sw
100011lw
000000R-type
ALUOp1:0MemtoRegMemWriteBranchAluSrcRegDstRegWriteOp5:0Instruction
SignImm
CLK
A RDInstruction
Memory
+4
A1
A3WD3
RD2
RD1WE3
A2
CLK
Sign Extend
RegisterFile
01
01
A RDData
MemoryWD
WE01
PC01
PC' Instr 25:21
20:16
15:0
5:0
SrcB
20:16
15:11
<<2
+
ALUResult ReadData
WriteData
SrcA
PCPlus4
PCBranch
WriteReg4:0
Result
31:26
RegDst
BranchMemWriteMemtoReg
ALUSrc
RegWrite
OpFunct
ControlUnit
Zero
PCSrc
CLK
ALUControl2:0
ALU
7
Copyright © 2007 Elsevier 7-<25>
Control Unit: Main Decoder
01X010X0000100beq
00X101X0101011sw
00000101100011lw
10000011000000R-type
ALUOp1:0MemtoRegMemWriteBranchAluSrcRegDstRegWriteOp5:0Instruction
Copyright © 2007 Elsevier 7-<26>
Single-Cycle Datapath Example: or
SignImm
CLK
A RDInstruction
Memory
+
4
A1
A3WD3
RD2
RD1WE3
A2
CLK
Sign Extend
RegisterFile
01
01
A RDData
MemoryWD
WE01
PC01
PC' Instr 25:21
20:16
15:0
5:0
SrcB
20:16
15:11
<<2
+
ALUResult ReadData
WriteData
SrcA
PCPlus4
PCBranch
WriteReg4:0
Result
31:26
RegDst
BranchMemWriteMemtoReg
ALUSrc
RegWrite
OpFunct
ControlUnit
Zero
PCSrc
CLK
ALUControl2:0
ALU
0010
01
0
0
1
0
Copyright © 2007 Elsevier 7-<27>
Extended Functionality: addi
• No change to datapath
SignImm
CLK
A RDInstruction
Memory
+
4
A1
A3WD3
RD2
RD1WE3
A2
CLK
Sign Extend
RegisterFile
01
01
A RDData
MemoryWD
WE01
PC01
PC' Instr 25:21
20:16
15:0
5:0
SrcB
20:16
15:11
<<2
+
ALUResult ReadData
WriteData
SrcA
PCPlus4
PCBranch
WriteReg4:0
Result
31:26
RegDst
BranchMemWriteMemtoReg
ALUSrc
RegWrite
OpFunct
ControlUnit
Zero
PCSrc
CLK
ALUControl2:0
ALU
Copyright © 2007 Elsevier 7-<28>
Control Unit: addi
001000addi
01X010X0000100beq
00X101X0101011sw
00100101100011lw
10000011000000R-type
ALUOp1:0MemtoRegMemWriteBranchAluSrcRegDstRegWriteOp5:0Instruction
8
Copyright © 2007 Elsevier 7-<29>
Control Unit: addi
00000101001000addi
01X010X0000100beq
00X101X0101011sw
00100101100011lw
10000011000000R-type
ALUOp1:0MemtoRegMemWriteBranchAluSrcRegDstRegWriteOp5:0Instruction
Copyright © 2007 Elsevier 7-<30>
Extended Functionality: j
SignImm
CLK
A RD
InstructionMemory
+
4
A1
A3WD3
RD2
RD1WE3
A2
CLK
Sign Extend
RegisterFile
01
01
A RDData
MemoryWD
WE01
PC01
PC' Instr 25:21
20:16
15:0
5:0
SrcB
20:16
15:11
<<2
+
ALUResult ReadData
WriteData
SrcA
PCPlus4
PCBranch
WriteReg4:0
Result
31:26
RegDst
BranchMemWriteMemtoReg
ALUSrc
RegWrite
OpFunct
ControlUnit
Zero
PCSrc
CLK
ALUControl2:0
ALU
01
25:0 <<2
27:0 31:28
PCJump
Jump
Copyright © 2007 Elsevier 7-<31>
Control Unit: Main Decoder
000100j
0000
Jump
01X010X0000100beq
00X101X0101011sw
00100101100011lw
10000011000000R-type
ALUOp1:0MemtoRegMemWriteBranchAluSrcRegDstRegWriteOp5:0Instruction
Copyright © 2007 Elsevier 7-<32>
Control Unit: Main Decoder
1XXX0XXX0000100j
0000
Jump
01X010X0000100beq
00X101X0101011sw
00100101100011lw
10000011000000R-type
ALUOp1:0MemtoRegMemWriteBranchAluSrcRegDstRegWriteOp5:0Instruction
9
Copyright © 2007 Elsevier 7-<33>
Single-Cycle Performance
• Clock cycle time is limited by the critical path (lw)
SignImm
CLK
A RDInstruction
Memory
+
4
A1
A3WD3
RD2
RD1WE3
A2
CLK
Sign Extend
RegisterFile
01
01
A RDData
MemoryWD
WE01
PC01
PC' Instr 25:21
20:16
15:0
5:0
SrcB
20:16
15:11
<<2
+
ALUResult ReadData
WriteData
SrcA
PCPlus4
PCBranch
WriteReg4:0
Result
31:26
RegDst
BranchMemWriteMemtoReg
ALUSrc
RegWrite
OpFunct
ControlUnit
Zero
PCSrc
CLK
ALUControl2:0
ALU1
0100
1
0
1
0 0
Copyright © 2007 Elsevier 7-<34>
Single-Cycle Performance
• Single-cycle critical path:Tc = tpcq_PC + tmem + max(tRFread, tsext) + tmux + tALU + tmem + tmux + tRFsetup
• In most implementations, limiting paths are: – memory, ALU, register file. – Tc = tpcq_PC + 2tmem + tRFread + 2tmux + tALU + tRFsetup
Copyright © 2007 Elsevier 7-<35>
Single-Cycle Performance Example
Tc =
30tpcq_PCRegister clock-to-Q
Delay (ps)ParameterElement
20tsetupRegister setup
25tmuxMultiplexer
200tALUALU
250tmemMemory read
150tRFreadRegister file read
20tRFsetupRegister file setup
Copyright © 2007 Elsevier 7-<36>
Single-Cycle Performance Example
Tc = tpcq_PC + 2tmem + tRFread + 2tmux + tALU + tRFsetup
= [30 + 2(250) + 150 + 2(25) + 200 + 20] ps= 950 ps
30tpcq_PCRegister clock-to-Q
Delay (ps)ParameterElement
20tsetupRegister setup
25tmuxMultiplexer
200tALUALU
250tmemMemory read
150tRFreadRegister file read
20tRFsetupRegister file setup
10
Copyright © 2007 Elsevier 7-<37>
Single-Cycle Performance Example
• For a program with 100 billion instructions executing on a single-cycle MIPS processor,
Execution Time =
Copyright © 2007 Elsevier 7-<38>
Single-Cycle Performance Example
• For a program with 100 billion instructions executing on a single-cycle MIPS processor,
Execution Time = (# instructions)(cycles/instruction)(seconds/cycle)= (100 × 109)(1)(950 × 10-12 s)= 95 seconds
Copyright © 2007 Elsevier 7-<39>
Multicycle MIPS Processor
• The single-cycle microarchitecture performs all steps in one cycle+ simple- cycle time limited by longest instruction (lw)- two adders and two memories
• Multicycle microarchitecture uses a variable number of shorter steps (ALU or memory)+ higher clock speed+ simpler instructions run faster+ reuse expensive hardware on multiple cycles- sequencing overhead paid many times
• Same design steps: datapath & controlCopyright © 2007 Elsevier 7-<40>
Multicycle State Elements
CLK
ARD
Instr / DataMemory
A1
A3
WD3
RD2RD1
WE3
A2
CLK
RegisterFile
PCPC'
WD
WE
CLK
EN
• Replace Instruction and Data memories with a single unified memory– More realistic
11
Copyright © 2007 Elsevier 7-<41>
Multicycle Datapath: instruction fetch
b
CLK
ARD
Instr / DataMemory
A1
A3
WD3
RD2RD1
WE3
A2
CLK
RegisterFile
PCPC' Instr
CLK
WD
WE
CLK
EN
IRWrite
• First consider executing lw• STEP 1: Fetch instruction
Copyright © 2007 Elsevier 7-<42>
Multicycle Datapath: lw register read
b
CLK
ARD
Instr / DataMemory
A1
A3
WD3
RD2RD1
WE3
A2
CLK
RegisterFile
PCPC' Instr 25:21
CLK
WD
WE
CLK CLK
A
EN
IRWrite
Copyright © 2007 Elsevier 7-<43>
Multicycle Datapath: lw immediate
SignImm
b
CLK
ARD
Instr / DataMemory
A1
A3
WD3
RD2RD1
WE3
A2
CLK
Sign Extend
RegisterFile
PCPC' Instr 25:21
15:0
CLK
WD
WE
CLK CLK
A
EN
IRWrite
Copyright © 2007 Elsevier 7-<44>
Multicycle Datapath: lw address
SignImm
b
CLK
ARD
Instr / DataMemory
A1
A3
WD3
RD2RD1
WE3
A2
CLK
Sign Extend
RegisterFile
PCPC' Instr 25:21
15:0
SrcB
ALUResult
SrcA
ALUOut
CLK
ALUControl2:0
ALU
WD
WE
CLK CLK
A CLK
EN
IRWrite
12
Copyright © 2007 Elsevier 7-<45>
Multicycle Datapath: lw memory read
SignImm
b
CLK
ARD
Instr / DataMemory
A1
A3
WD3
RD2RD1
WE3
A2
CLK
Sign Extend
RegisterFile
PCPC' Instr 25:21
15:0
SrcB
ALUResult
SrcA
ALUOut
CLK
ALUControl2:0
ALU
WD
WE
CLK
Adr
Data
CLK
CLK
A CLK
EN
IRWriteIorD
01
Copyright © 2007 Elsevier 7-<46>
Multicycle Datapath: lw write register
SignImm
b
CLK
ARD
Instr / DataMemory
A1
A3
WD3
RD2RD1
WE3
A2
CLK
Sign Extend
RegisterFile
PCPC' Instr 25:21
15:0
SrcB20:16
ALUResult
SrcA
ALUOut
RegWrite
CLK
ALUControl2:0
ALU
WD
WE
CLK
Adr
Data
CLK
CLK
A CLK
EN
IRWriteIorD
01
Copyright © 2007 Elsevier 7-<47>
Multicycle Datapath: increment PC
PCWrite
SignImm
b
CLK
ARD
Instr / DataMemory
A1
A3
WD3
RD2RD1
WE3
A2
CLK
Sign Extend
RegisterFile
01PCPC' Instr 25:21
15:0
SrcB20:16
ALUResult
SrcA
ALUOut
ALUSrcARegWrite
CLK
ALUControl2:0
ALU
WD
WE
CLK
Adr
Data
CLK
CLK
A
00011011
4
CLK
ENEN
ALUSrcB1:0IRWriteIorD
01
Copyright © 2007 Elsevier 7-<48>
Multicycle Datapath: sw
SignImm
b
CLK
ARD
Instr / DataMemory
A1
A3
WD3
RD2RD1
WE3
A2
CLK
Sign Extend
RegisterFile
01PC 0
1
PC' Instr 25:21
20:16
15:0
SrcB20:16
ALUResult
SrcA
ALUOut
MemWrite ALUSrcARegWrite
CLK
ALUControl2:0
ALU
WD
WE
CLK
Adr
Data
CLK
CLK
A
00011011
4
CLK
ENEN
ALUSrcB1:0IRWriteIorDPCWrite
B
• Consider sw• Need to write data to memory
13
Copyright © 2007 Elsevier 7-<49>
Multicycle Datapath: R-type Instructions
01
SignImm
b
CLK
ARD
Instr / DataMemory
A1
A3
WD3
RD2RD1
WE3
A2
CLK
Sign Extend
RegisterFile
01
01PC 0
1
PC' Instr 25:21
20:16
15:0
SrcB20:16
15:11
ALUResult
SrcA
ALUOut
RegDstMemWrite MemtoReg ALUSrcARegWrite
CLK
ALUControl2:0
ALU
WD
WE
CLK
Adr
Data
CLK
CLK
AB 00
011011
4
CLK
ENEN
ALUSrcB1:0IRWriteIorDPCWrite
• Consider R-type instructions: add, sub, and, or• Write ALUResult to register file• Writing to rd field of instruction (instead of rt)
Copyright © 2007 Elsevier 7-<50>
• Consider branch instruction: beq• Determine whether register values are equal• Calculate branch target address (BTA) from sign-extended
immediate and PC+4
Multicycle Datapath: beq
SignImm
b
CLK
ARD
Instr / DataMemory
A1
A3
WD3
RD2RD1
WE3
A2
CLK
Sign Extend
RegisterFile
01
01 0
1
PC 01
PC' Instr 25:21
20:16
15:0
SrcB20:16
15:11
<<2
ALUResult
SrcA
ALUOut
RegDst BranchMemWrite MemtoReg ALUSrcARegWrite
Zero
PCSrc
CLK
ALUControl2:0
ALU
WD
WE
CLK
Adr
01
Data
CLK
CLK
AB 00
011011
4
CLK
ENEN
ALUSrcB1:0IRWriteIorD PCWrite
PCEn
Copyright © 2007 Elsevier 7-<51>
Complete Multicycle Processor
SignImm
CLK
ARD
Instr / DataMemory
A1
A3
WD3
RD2RD1
WE3
A2
CLK
Sign Extend
RegisterFile
01
01 0
1
PC 01
PC' Instr 25:21
20:16
15:0
5:0
SrcB20:16
15:11
<<2
ALUResult
SrcA
ALUOut
31:26
RegD
st
Branch
MemWrite
Mem
toReg
ALUSrcARegWrite
OpFunct
ControlUnit
Zero
PCSrc
CLK
CLK
ALUControl2:0
ALU
WD
WE
CLK
Adr
01
Data
CLK
CLK
AB 00
011011
4
CLK
ENEN
ALUSrcB1:0IRWrite
IorD
PCWritePCEn
Copyright © 2007 Elsevier 7-<52>
Control Unit
ALUSrcA
PCSrc
Branch
ALUSrcB1:0
Opcode5:0
ControlUnit
ALUControl2:0Funct5:0
MainController
(FSM)
ALUOp1:0
ALUDecoder
RegWrite
PCWrite
IorD
MemWriteIRWrite
RegDstMemtoReg
RegisterEnables
MultiplexerSelects
14
Copyright © 2007 Elsevier 7-<53>
Main Controller FSM: Fetch
SignImm
CLK
ARD
Instr / DataMemory
A1
A3
WD3
RD2RD1
WE3
A2
CLK
Sign Extend
RegisterFile
01
01 0
1
PC 01
PC' Instr 25:21
20:16
15:0
5:0
SrcB20:16
15:11
<<2
ALUResult
SrcA
ALUOut
31:26
RegD
st
Branch
MemWrite
Mem
toReg
ALUSrcARegWrite
OpFunct
ControlUnit
Zero
PCSrc
CLK
CLK
ALUControl2:0
ALU
WD
WE
CLK
Adr
01
Data
CLK
CLK
AB 00
011011
4
CLK
ENEN
ALUSrcB1:0IRWrite
IorD
PCWritePCEn
0
1 1
0
X
X
00
01
0100
10
Reset
S0: Fetch
Copyright © 2007 Elsevier 7-<54>
Main Controller FSM: Fetch
SignImm
CLK
ARD
Instr / DataMemory
A1
A3
WD3
RD2RD1
WE3
A2
CLK
Sign Extend
RegisterFile
01
01 0
1
PC 01
PC' Instr 25:21
20:16
15:0
5:0
SrcB20:16
15:11
<<2
ALUResult
SrcA
ALUOut
31:26
RegD
st
Branch
MemWrite
Mem
toReg
ALUSrcARegWrite
OpFunct
ControlUnit
Zero
PCSrc
CLK
CLK
ALUControl2:0
ALU
WD
WE
CLK
Adr
01
Data
CLK
CLK
AB 00
011011
4
CLK
ENEN
ALUSrcB1:0IRWrite
IorD
PCWritePCEn
0
1 1
0
X
X
00
01
0100
10
IorD = 0AluSrcA = 0
ALUSrcB = 01ALUOp = 00PCSrc = 0
IRWritePCWrite
Reset
S0: Fetch
Copyright © 2007 Elsevier 7-<55>
Main Controller FSM: Decode
IorD = 0AluSrcA = 0
ALUSrcB = 01ALUOp = 00PCSrc = 0
IRWritePCWrite
Reset
S0: Fetch S1: Decode
SignImm
CLK
ARD
Instr / DataMemory
A1
A3
WD3
RD2RD1
WE3
A2
CLK
Sign Extend
RegisterFile
01
01 0
1
PC 01
PC' Instr 25:21
20:16
15:0
5:0
SrcB20:16
15:11
<<2
ALUResult
SrcA
ALUOut
31:26
RegD
st
Branch
MemWrite
Mem
toReg
ALUSrcARegWrite
OpFunct
ControlUnit
Zero
PCSrc
CLK
CLK
ALUControl2:0
ALU
WD
WE
CLK
Adr
01
Data
CLK
CLK
AB 00
011011
4
CLK
ENEN
ALUSrcB1:0IRWrite
IorD
PCWritePCEn
X
0 0
0
X
X
0X
XX
XXXX
00
Copyright © 2007 Elsevier 7-<56>
Main Controller FSM: Address Calculation
IorD = 0AluSrcA = 0
ALUSrcB = 01ALUOp = 00PCSrc = 0
IRWritePCWrite
Reset
S0: Fetch
S2: MemAdr
S1: Decode
Op = LWor
Op = SW
SignImm
CLK
ARD
Instr / DataMemory
A1
A3
WD3
RD2RD1
WE3
A2
CLK
Sign Extend
RegisterFile
01
01 0
1
PC 01
PC' Instr 25:21
20:16
15:0
5:0
SrcB20:16
15:11
<<2
ALUResult
SrcA
ALUOut
31:26
RegD
st
Branch
MemWrite
Mem
toReg
ALUSrcARegWrite
OpFunct
ControlUnit
Zero
PCSrc
CLK
CLK
ALUControl2:0
ALU
WD
WE
CLK
Adr
01
Data
CLK
CLK
AB 00
011011
4
CLK
ENEN
ALUSrcB1:0IRWrite
IorD
PCWritePCEn
X
0 0
0
X
X
01
10
010X
00
15
Copyright © 2007 Elsevier 7-<57>
Main Controller FSM: Address Calculation
IorD = 0AluSrcA = 0
ALUSrcB = 01ALUOp = 00PCSrc = 0
IRWritePCWrite
ALUSrcA = 1ALUSrcB = 10ALUOp = 00
Reset
S0: Fetch
S2: MemAdr
S1: Decode
Op = LWor
Op = SW
SignImm
CLK
ARD
Instr / DataMemory
A1
A3
WD3
RD2RD1
WE3
A2
CLK
Sign Extend
RegisterFile
01
01 0
1
PC 01
PC' Instr 25:21
20:16
15:0
5:0
SrcB20:16
15:11
<<2
ALUResult
SrcA
ALUOut
31:26
RegD
st
Branch
MemWrite
Mem
toReg
ALUSrcARegWrite
OpFunct
ControlUnit
Zero
PCSrc
CLK
CLK
ALUControl2:0
ALU
WD
WE
CLK
Adr
01
Data
CLK
CLK
AB 00
011011
4
CLK
ENEN
ALUSrcB1:0IRWrite
IorD
PCWritePCEn
X
0 0
0
X
X
01
10
010X
00
Copyright © 2007 Elsevier 7-<58>
Main Controller FSM: lw
IorD = 0AluSrcA = 0
ALUSrcB = 01ALUOp = 00PCSrc = 0
IRWritePCWrite
ALUSrcA = 1ALUSrcB = 10ALUOp = 00
IorD = 1
Reset
S0: Fetch
S2: MemAdr
S1: Decode
S3: MemRead
Op = LWor
Op = SW
Op = LW
RegDst = 0MemtoReg = 1
RegWrite
S4: MemWriteback
Copyright © 2007 Elsevier 7-<59>
Main Controller FSM: sw
IorD = 0AluSrcA = 0
ALUSrcB = 01ALUOp = 00PCSrc = 0
IRWritePCWrite
ALUSrcA = 1ALUSrcB = 10ALUOp = 00
IorD = 1 IorD = 1MemWrite
Reset
S0: Fetch
S2: MemAdr
S1: Decode
S3: MemReadS5: MemWrite
Op = LWor
Op = SW
Op = LWOp = SW
RegDst = 0MemtoReg = 1
RegWrite
S4: MemWriteback
Copyright © 2007 Elsevier 7-<60>
Main Controller FSM: R-Type
IorD = 0AluSrcA = 0
ALUSrcB = 01ALUOp = 00PCSrc = 0
IRWritePCWrite
ALUSrcA = 1ALUSrcB = 10ALUOp = 00
IorD = 1RegDst = 1
MemtoReg = 0RegWrite
IorD = 1MemWrite
ALUSrcA = 1ALUSrcB = 00ALUOp = 10
Reset
S0: Fetch
S2: MemAdr
S1: Decode
S3: MemReadS5: MemWrite
S6: Execute
S7: ALUWriteback
Op = LWor
Op = SWOp = R-type
Op = LWOp = SW
RegDst = 0MemtoReg = 1
RegWrite
S4: MemWriteback
16
Copyright © 2007 Elsevier 7-<61>
Main Controller FSM: beq
IorD = 0AluSrcA = 0
ALUSrcB = 01ALUOp = 00PCSrc = 0
IRWritePCWrite
ALUSrcA = 0ALUSrcB = 11ALUOp = 00
ALUSrcA = 1ALUSrcB = 10ALUOp = 00
IorD = 1RegDst = 1
MemtoReg = 0RegWrite
IorD = 1MemWrite
ALUSrcA = 1ALUSrcB = 00ALUOp = 10
ALUSrcA = 1ALUSrcB = 00ALUOp = 01PCSrc = 1
Branch
Reset
S0: Fetch
S2: MemAdr
S1: Decode
S3: MemReadS5: MemWrite
S6: Execute
S7: ALUWriteback
S8: Branch
Op = LWor
Op = SWOp = R-type
Op = BEQ
Op = LWOp = SW
RegDst = 0MemtoReg = 1
RegWrite
S4: MemWriteback
Copyright © 2007 Elsevier 7-<62>
Complete Multicycle Controller FSM
IorD = 0AluSrcA = 0
ALUSrcB = 01ALUOp = 00PCSrc = 0
IRWritePCWrite
ALUSrcA = 0ALUSrcB = 11ALUOp = 00
ALUSrcA = 1ALUSrcB = 10ALUOp = 00
IorD = 1RegDst = 1
MemtoReg = 0RegWrite
IorD = 1MemWrite
ALUSrcA = 1ALUSrcB = 00ALUOp = 10
ALUSrcA = 1ALUSrcB = 00ALUOp = 01PCSrc = 1
Branch
Reset
S0: Fetch
S2: MemAdr
S1: Decode
S3: MemReadS5: MemWrite
S6: Execute
S7: ALUWriteback
S8: Branch
Op = LWor
Op = SWOp = R-type
Op = BEQ
Op = LWOp = SW
RegDst = 0MemtoReg = 1
RegWrite
S4: MemWriteback
Copyright © 2007 Elsevier 7-<63>
Main Controller FSM: addi
IorD = 0AluSrcA = 0
ALUSrcB = 01ALUOp = 00PCSrc = 0
IRWritePCWrite
ALUSrcA = 0ALUSrcB = 11ALUOp = 00
ALUSrcA = 1ALUSrcB = 10ALUOp = 00
IorD = 1RegDst = 1
MemtoReg = 0RegWrite
IorD = 1MemWrite
ALUSrcA = 1ALUSrcB = 00ALUOp = 10
ALUSrcA = 1ALUSrcB = 00ALUOp = 01PCSrc = 1
Branch
Reset
S0: Fetch
S2: MemAdr
S1: Decode
S3: MemReadS5: MemWrite
S6: Execute
S7: ALUWriteback
S8: Branch
Op = LWor
Op = SWOp = R-type
Op = BEQ
Op = LWOp = SW
RegDst = 0MemtoReg = 1
RegWrite
S4: MemWriteback
Op = ADDI
S9: ADDIExecute
S10: ADDIWriteback
Copyright © 2007 Elsevier 7-<64>
Main Controller FSM: addi
IorD = 0AluSrcA = 0
ALUSrcB = 01ALUOp = 00PCSrc = 0
IRWritePCWrite
ALUSrcA = 0ALUSrcB = 11ALUOp = 00
ALUSrcA = 1ALUSrcB = 10ALUOp = 00
IorD = 1RegDst = 1
MemtoReg = 0RegWrite
IorD = 1MemWrite
ALUSrcA = 1ALUSrcB = 00ALUOp = 10
ALUSrcA = 1ALUSrcB = 00ALUOp = 01PCSrc = 1
Branch
Reset
S0: Fetch
S2: MemAdr
S1: Decode
S3: MemReadS5: MemWrite
S6: Execute
S7: ALUWriteback
S8: Branch
Op = LWor
Op = SWOp = R-type
Op = BEQ
Op = LWOp = SW
RegDst = 0MemtoReg = 1
RegWrite
S4: MemWriteback
ALUSrcA = 1ALUSrcB = 10ALUOp = 00
RegDst = 0MemtoReg = 0
RegWrite
Op = ADDI
S9: ADDIExecute
S10: ADDIWriteback
17
Copyright © 2007 Elsevier 7-<65>
Extended Functionality: j
SignImm
CLK
ARD
Instr / DataMemory
A1
A3
WD3
RD2RD1
WE3
A2
CLK
Sign Extend
RegisterFile
01
01PC 0
1
PC' Instr 25:21
20:16
15:0
SrcB20:16
15:11
<<2
ALUResult
SrcA
ALUOut
RegDst BranchMemWrite MemtoReg ALUSrcARegWrite
Zero
PCSrc1:0
CLK
ALUControl2:0
ALU
WD
WE
CLK
Adr
01
Data
CLK
CLK
AB 00
011011
4
CLK
ENEN
ALUSrcB1:0IRWriteIorD PCWrite
PCEn
000110
<<2
25:0 (jump)
31:28
27:0
PCJump
Copyright © 2007 Elsevier 7-<66>
Control FSM: j
IorD = 0AluSrcA = 0
ALUSrcB = 01ALUOp = 00PCSrc = 00
IRWritePCWrite
ALUSrcA = 0ALUSrcB = 11ALUOp = 00
ALUSrcA = 1ALUSrcB = 10ALUOp = 00
IorD = 1RegDst = 1
MemtoReg = 0RegWrite
IorD = 1MemWrite
ALUSrcA = 1ALUSrcB = 00ALUOp = 10
ALUSrcA = 1ALUSrcB = 00ALUOp = 01PCSrc = 01
Branch
Reset
S0: Fetch
S2: MemAdr
S1: Decode
S3: MemReadS5: MemWrite
S6: Execute
S7: ALUWriteback
S8: Branch
Op = LWor
Op = SWOp = R-type
Op = BEQ
Op = LWOp = SW
RegDst = 0MemtoReg = 1
RegWrite
S4: MemWriteback
ALUSrcA = 1ALUSrcB = 10ALUOp = 00
RegDst = 0MemtoReg = 0
RegWrite
Op = ADDI
S9: ADDIExecute
S10: ADDIWriteback
Op = J
S11: Jump
Copyright © 2007 Elsevier 7-<67>
Control FSM: j
IorD = 0AluSrcA = 0
ALUSrcB = 01ALUOp = 00PCSrc = 00
IRWritePCWrite
ALUSrcA = 0ALUSrcB = 11ALUOp = 00
ALUSrcA = 1ALUSrcB = 10ALUOp = 00
IorD = 1RegDst = 1
MemtoReg = 0RegWrite
IorD = 1MemWrite
ALUSrcA = 1ALUSrcB = 00ALUOp = 10
ALUSrcA = 1ALUSrcB = 00ALUOp = 01PCSrc = 01
Branch
Reset
S0: Fetch
S2: MemAdr
S1: Decode
S3: MemReadS5: MemWrite
S6: Execute
S7: ALUWriteback
S8: Branch
Op = LWor
Op = SWOp = R-type
Op = BEQ
Op = LWOp = SW
RegDst = 0MemtoReg = 1
RegWrite
S4: MemWriteback
ALUSrcA = 1ALUSrcB = 10ALUOp = 00
RegDst = 0MemtoReg = 0
RegWrite
Op = ADDI
S9: ADDIExecute
S10: ADDIWriteback
PCSrc = 10PCWrite
Op = J
S11: Jump
Copyright © 2007 Elsevier 7-<68>
Multicycle Performance
• Instructions take different number of cycles:– 3 cycles: beq, j– 4 cycles: R-Type, sw, addi– 5 cycles: lw
• CPI is weighted average• SPECINT2000 benchmark:
– 25% loads– 10% stores – 11% branches– 2% jumps– 52% R-type
Average CPI = (0.11 + 0.2)(3) + (0.52 + 0.10)(4) + (0.25)(5) = 4.12
18
Copyright © 2007 Elsevier 7-<69>
Multicycle Performance
• Multicycle critical path:Tc =
SignImm
CLK
ARD
Instr / DataMemory
A1
A3
WD3
RD2RD1
WE3
A2
CLK
Sign Extend
RegisterFile
01
01 0
1
PC 01
PC' Instr 25:21
20:16
15:0
5:0
SrcB20:16
15:11
<<2
ALUResult
SrcA
ALUOut
31:26
RegD
st
Branch
MemWrite
Mem
toReg
ALUSrcARegWrite
OpFunct
ControlUnit
Zero
PCSrc
CLK
CLK
ALUControl2:0
ALU
WD
WE
CLK
Adr
01
Data
CLK
CLK
AB 00
011011
4
CLK
ENEN
ALUSrcB1:0IRWrite
IorD
PCWritePCEn
Copyright © 2007 Elsevier 7-<70>
Multicycle Performance
• Multicycle critical path:Tc = tpcq + tmux + max(tALU + tmux, tmem) + tsetup
SignImm
CLK
ARD
Instr / DataMemory
A1
A3
WD3
RD2RD1
WE3
A2
CLK
Sign Extend
RegisterFile
01
01 0
1
PC 01
PC' Instr 25:21
20:16
15:0
5:0
SrcB20:16
15:11
<<2
ALUResult
SrcA
ALUOut
31:26
RegD
st
Branch
MemWrite
Mem
toReg
ALUSrcARegWrite
OpFunct
ControlUnit
Zero
PCSrc
CLK
CLK
ALUControl2:0
ALU
WD
WE
CLK
Adr
01
Data
CLK
CLK
AB 00
011011
4
CLK
ENEN
ALUSrcB1:0IRWrite
IorD
PCWritePCEn
Copyright © 2007 Elsevier 7-<71>
Multicycle Performance Example
Tc =
30tpcqRegister clock-to-Q
Delay (ps)ParameterElement
20tsetupRegister setup
25tmuxMultiplexer
200tALUALU
250tmemMemory read
150tRFreadRegister file read
20tRFsetupRegister file setup
Copyright © 2007 Elsevier 7-<72>
Multicycle Performance Example
Tc = tpcq_PC + tmux + max(tALU + tmux, tmem) + tsetup
= tpcq_PC + tmux + tmem + tsetup
= [30 + 25 + 250 + 20] ps= 325 ps
30tpcq_PCRegister clock-to-Q
Delay (ps)ParameterElement
20tsetupRegister setup
25tmuxMultiplexer
200tALUALU
250tmemMemory read
150tRFreadRegister file read
20tRFsetupRegister file setup
19
Copyright © 2007 Elsevier 7-<73>
Multicycle Performance Example
• For a program with 100 billion instructions executing on a multicycleMIPS processor– CPI = 4.12– Tc = 325 ps
Execution Time =
Copyright © 2007 Elsevier 7-<74>
Multicycle Performance Example
• For a program with 100 billion instructions executing on a multicycleMIPS processor– CPI = 4.12– Tc = 325 ps
Execution Time = (# instructions) × CPI × Tc
= (100 × 109)(4.12)(325 × 10-12)= 133.9 seconds
• This is slower than the single-cycle processor (95 seconds). Why?
Copyright © 2007 Elsevier 7-<75>
Multicycle Performance Example
• For a program with 100 billion instructions executing on a multicycleMIPS processor– CPI = 4.12– Tc = 325 ps
Execution Time = (# instructions) × CPI × Tc
= (100 × 109)(4.12)(325 × 10-12)= 133.9 seconds
• This is slower than the single-cycle processor (95 seconds). Why?– Not all steps the same length– Sequencing overhead for each step (tpcq + tsetup= 50 ps)
Copyright © 2007 Elsevier 7-<76>
Review: Single-Cycle MIPS Processor
SignImm
CLK
A RDInstruction
Memory
+4
A1
A3WD3
RD2
RD1WE3
A2
CLK
Sign Extend
RegisterFile
01
01
A RDData
MemoryWD
WE01
PC01 PC' Instr 25:21
20:16
15:0
5:0
SrcB
20:16
15:11
<<2
+
ALUResult ReadData
WriteData
SrcA
PCPlus4
PCBranch
WriteReg4:0
Result
31:26
RegDst
BranchMemWriteMemtoReg
ALUSrc
RegWrite
OpFunct
ControlUnit
Zero
PCSrc
CLK
ALUControl2:0
ALU
01
25:0 <<2
27:0 31:28
PCJump
Jump
20
Copyright © 2007 Elsevier 7-<77>
Review: Multicycle MIPS Processor
ImmExt
CLK
ARD
Instr / DataMemory
A1
A3
WD3
RD2RD1
WE3
A2
CLK
Sign Extend
RegisterFile
01
01PC 0
1
PC' Instr 25:21
20:16
15:0
SrcB20:16
15:11
<<2
ALUResult
SrcA
ALUOut
ZeroCLK
ALU
WD
WE
CLK
Adr
01
Data
CLK
CLK
AB 00
011011
4
CLK
ENEN000110
<<2
25:0 (Addr)
31:28
27:0
PCJump
5:0
31:26
Branch
MemWrite
ALUSrcARegWrite
OpFunct
ControlUnit
PCSrc
CLK
ALUControl2:0
ALUSrcB1:0IRWrite
IorD
PCWritePCEn
RegD
st
Mem
toReg
Copyright © 2007 Elsevier 7-<78>
Pipelined MIPS Processor
• Temporal parallelism• Divide single-cycle processor into 5 stages:
– Fetch– Decode– Execute– Memory– Writeback
• Add pipeline registers between stages
Copyright © 2007 Elsevier 7-<79>
Single-Cycle vs. Pipelined Performance
Time (ps)Instr
FetchInstruction
DecodeRead Reg
ExecuteALU
MemoryRead / Write
WriteReg1
2
0 100 200 300 400 500 600 700 800 900 1100 1200 1300 1400 1500 1600 1700 1800 19001000
Instr
1
2
3
FetchInstruction
DecodeRead Reg
ExecuteALU
MemoryRead / Write
WriteReg
FetchInstruction
DecodeRead Reg
ExecuteALU
MemoryRead/Write
WriteReg
FetchInstruction
DecodeRead Reg
ExecuteALU
MemoryRead/Write
WriteReg
FetchInstruction
DecodeRead Reg
ExecuteALU
MemoryRead/Write
WriteReg
Single-Cycle
Pipelined
Copyright © 2007 Elsevier 7-<80>
Pipelining Abstraction
Time (cycles)
lw $s2, 40($0) RF 40
$0RF
$s2+ DM
RF $t2
$t1RF
$s3+ DM
RF $s5
$s1RF
$s4- DM
RF $t6
$t5RF
$s5& DM
RF 20
$s1RF
$s6+ DM
RF $t4
$t3RF
$s7| DM
add $s3, $t1, $t2
sub $s4, $s1, $s5
and $s5, $t5, $t6
sw $s6, 20($s1)
or $s7, $t3, $t4
1 2 3 4 5 6 7 8 9 10
add
IM
IM
IM
IM
IM
IM lw
sub
and
sw
or
21
Copyright © 2007 Elsevier 7-<81>
Single-Cycle and Pipelined Datapath
SignImmE
CLK
A RD
InstructionMemory
+
4
A1
A3WD3
RD2
RD1WE3
A2
CLK
Sign Extend
RegisterFile
01
01
A RDData
MemoryWD
WE01
PCF01
PC' InstrD25:21
20:16
15:0
SrcBE
20:16
15:11
RtE
RdE
<<2
+
ALUOutM
ALUOutW
ReadDataW
WriteDataE WriteDataM
SrcAE
PCPlus4D
PCBranchM
ResultW
PCPlus4EPCPlus4F
ZeroM
CLK CLK
ALU
WriteRegE4:0
CLKCLK
CLK
SignImm
CLK
A RDInstruction
Memory
+4
A1
A3WD3
RD2
RD1WE3
A2
CLK
Sign Extend
RegisterFile
01
01
A RDData
MemoryWD
WE01
PC01
PC' Instr 25:21
20:16
15:0
SrcB
20:16
15:11
<<2
+
ALUResult ReadData
WriteData
SrcA
PCPlus4
PCBranch
WriteReg4:0
Result
Zero
CLK
ALU
Fetch Decode Execute Memory WritebackCopyright © 2007 Elsevier 7-<82>
Corrected Pipelined Datapath
SignImmE
CLK
A RD
InstructionMemory
+
4
A1
A3WD3
RD2
RD1WE3
A2
CLK
Sign Extend
RegisterFile
01
01
A RDData
MemoryWD
WE01
PCF01
PC' InstrD 25:21
20:16
15:0
SrcBE
20:16
15:11
RtE
RdE
<<2
+
ALUOutM
ALUOutW
ReadDataW
WriteDataE WriteDataM
SrcAE
PCPlus4D
PCBranchM
WriteRegM4:0
ResultW
PCPlus4EPCPlus4F
ZeroM
CLK CLK
WriteRegW4:0
ALU
WriteRegE4:0
CLKCLK
CLK
Fetch Decode Execute Memory Writeback
• WriteReg must arrive at the same time as Result
Copyright © 2007 Elsevier 7-<83>
Pipelined Control
SignImmE
CLK
A RDInstruction
Memory
+
4
A1
A3WD3
RD2
RD1WE3
A2
CLK
Sign Extend
RegisterFile
01
01
A RDData
MemoryWD
WE01
PCF01
PC' InstrD25:21
20:16
15:0
5:0
SrcBE
20:16
15:11
RtE
RdE
<<2
+
ALUOutM
ALUOutW
ReadDataW
WriteDataE WriteDataM
SrcAE
PCPlus4D
PCBranchM
WriteRegM4:0
ResultW
PCPlus4EPCPlus4F
31:26
RegDstD
BranchD
MemWriteD
MemtoRegD
ALUControlD
ALUSrcD
RegWriteD
Op
Funct
ControlUnit
ZeroM
PCSrcM
CLK CLK CLK
CLK CLK
WriteRegW4:0
ALUControlE2:0
ALU
RegWriteE RegWriteM RegWriteW
MemtoRegE MemtoRegM MemtoRegW
MemWriteE MemWriteM
BranchE BranchM
RegDstE
ALUSrcE
WriteRegE4:0
Same control unit as single-cycle processor
Control delayed to proper pipeline stageCopyright © 2007 Elsevier 7-<84>
Pipeline Hazard
• Occurs when an instruction depends on results from previous instruction that hasn’t completed.
• Types of hazards:– Data hazard: register value not written back to register
file yet– Control hazard: next instruction not decided yet
(caused by branches)
22
Copyright © 2007 Elsevier 7-<85>
Data Hazard
Time (cycles)
add $s0, $s2, $s3 RF $s3
$s2RF
$s0+ DM
RF $s1
$s0RF
$t0& DM
RF $s0
$s4RF
$t1| DM
RF $s5
$s0RF
$t2- DM
and $t0, $s0, $s1
or $t1, $s4, $s0
sub $t2, $s0, $s5
1 2 3 4 5 6 7 8
and
IM
IM
IM
IM add
or
sub
Copyright © 2007 Elsevier 7-<86>
Handling Data Hazards
• Insert nops in code at compile time• Rearrange code at compile time• Forward data at run time• Stall the processor at run time
Copyright © 2007 Elsevier 7-<87>
Compile-Time Hazard Elimination
Time (cycles)
add $s0, $s2, $s3 RF $s3
$s2RF
$s0+ DM
RF $s1
$s0RF
$t0& DM
RF $s0
$s4RF
$t1| DM
RF $s5
$s0RF
$t2- DM
and $t0, $s0, $s1
or $t1, $s4, $s0
sub $t2, $s0, $s5
1 2 3 4 5 6 7 8
and
IM
IM
IM
IM add
or
sub
nop
nop
RF RFDMnopIM
RF RFDMnopIM
9 10
• Insert enough nops for result to be ready• Or move independent useful instructions forward
Copyright © 2007 Elsevier 7-<88>
Data Forwarding
Time (cycles)
add $s0, $s2, $s3 RF $s3
$s2RF
$s0+ DM
RF $s1
$s0RF
$t0& DM
RF $s0
$s4RF
$t1| DM
RF $s5
$s0RF
$t2- DM
and $t0, $s0, $s1
or $t1, $s4, $s0
sub $t2, $s0, $s5
1 2 3 4 5 6 7 8
and
IM
IM
IM
IM add
or
sub
23
Copyright © 2007 Elsevier 7-<89>
Data Forwarding
SignImmE
CLK
A RDInstruction
Memory
+
4
A1
A3WD3
RD2
RD1WE3
A2
CLK
SignExtend
RegisterFile
01
01
A RDData
MemoryWD
WE
10
PCF01
PC' InstrD 25:21
20:16
15:0
5:0
SrcBE
25:21
15:11
RsE
RdE
<<2
+
ALUOutM
ALUOutW
ReadDataW
WriteDataE WriteDataM
SrcAE
PCPlus4D
PCBranchM
WriteRegM4:0
ResultW
PCPlus4F
31:26
RegDstD
BranchD
MemWriteD
MemtoRegD
ALUControlD2:0
ALUSrcD
RegWriteD
Op
Funct
ControlUnit
PCSrcM
CLK CLK CLK
CLK CLK
WriteRegW4:0
ALUControlE2:0
ALU
RegWriteE RegWriteM RegWriteW
MemtoRegE MemtoRegM MemtoRegW
MemWriteE MemWriteM
RegDstE
ALUSrcE
WriteRegE4:0
000110
000110
SignImmD
Forw
ardA
E
Forw
ardB
E
20:16 RtE
RsD
RdD
RtD
Reg
Writ
eM
Reg
Writ
eW
Hazard Unit
PCPlus4E
BranchE BranchM
ZeroM
Copyright © 2007 Elsevier 7-<90>
Data Forwarding
• Forward to Execute stage from either:– Memory stage or– Writeback stage
• Forwarding logic for ForwardAE:if ((rsE != 0) AND (rsE == WriteRegM) AND RegWriteM) then ForwardAE = 10else if ((rsE != 0) AND (rsE == WriteRegW) AND RegWriteW) then ForwardAE = 01else ForwardAE = 00
• Forwarding logic for ForwardBE same, but replace rsEwith rtE
Copyright © 2007 Elsevier 7-<91>
Stalling
Time (cycles)
lw $s0, 40($0) RF 40
$0RF
$s0+ DM
RF $s1
$s0RF
$t0& DM
RF $s0
$s4RF
$t1| DM
RF $s5
$s0RF
$t2- DM
and $t0, $s0, $s1
or $t1, $s4, $s0
sub $t2, $s0, $s5
1 2 3 4 5 6 7 8
and
IM
IM
IM
IM lw
or
sub
Trouble!
Copyright © 2007 Elsevier 7-<92>
Stalling
Time (cycles)
lw $s0, 40($0) RF 40
$0RF
$s0+ DM
RF $s1
$s0RF
$t0& DM
RF $s0
$s4RF
$t1| DM
RF $s5
$s0RF
$t2- DM
and $t0, $s0, $s1
or $t1, $s4, $s0
sub $t2, $s0, $s5
1 2 3 4 5 6 7 8
and
IM
IM
IM
IM lw
or
sub
9
RF $s1
$s0
IM or
Stall
24
Copyright © 2007 Elsevier 7-<93>
Stalling Hardware
SignImmE
CLK
A RDInstruction
Memory
+
4
A1
A3WD3
RD2
RD1WE3
A2
CLK
SignExtend
RegisterFile
01
01
A RDData
MemoryWD
WE
10
PCF01
PC' InstrD 25:21
20:16
15:0
5:0
SrcBE
25:21
15:11
RsE
RdE
<<2
+
ALUOutM
ALUOutW
ReadDataW
WriteDataE WriteDataM
SrcAE
PCPlus4D
PCBranchM
WriteRegM4:0
ResultW
PCPlus4F
31:26
RegDstD
BranchD
MemWriteD
MemtoRegD
ALUControlD2:0
ALUSrcD
RegWriteD
Op
Funct
ControlUnit
PCSrcM
CLK CLK CLK
CLK CLK
WriteRegW4:0
ALUControlE2:0
ALU
RegWriteE RegWriteM RegWriteW
MemtoRegE MemtoRegM MemtoRegW
MemWriteE MemWriteM
RegDstE
ALUSrcE
WriteRegE4:0
000110
000110
SignImmD
Stal
lF
Stal
lD
Forw
ardA
E
Forw
ardB
E
20:16 RtE
RsD
RdD
RtD
Reg
Writ
eM
Reg
Writ
eW
Mem
toR
egE
Hazard Unit
Flus
hE
PCPlus4E
BranchE BranchM
ZeroM
EN
EN
CLR
Copyright © 2007 Elsevier 7-<94>
Stalling Hardware
• Stalling logic:lwstall = ((rsD == rtE) OR (rtD == rtE)) AND MemtoRegE
StallF = StallD = FlushE = lwstall
Copyright © 2007 Elsevier 7-<95>
Control Hazards
• beq: – branch is not determined until the fourth stage of the pipeline– Instructions after the branch are fetched before branch occurs– These instructions must be flushed if the branch happens
• Branch misprediction penalty– number of instruction flushed when branch is taken– May be reduced by determining branch earlier
Copyright © 2007 Elsevier 7-<96>
Control Hazards: Original Pipeline
SignImmE
CLK
A RDInstruction
Memory
+4
A1
A3WD3
RD2
RD1WE3
A2
CLK
SignExtend
RegisterFile
01
01
A RDData
MemoryWD
WE
10
PCF01
PC' InstrD 25:21
20:16
15:0
5:0
SrcBE
25:21
15:11
RsE
RdE
<<2
+
ALUOutM
ALUOutW
ReadDataW
WriteDataE WriteDataM
SrcAE
PCPlus4D
PCBranchM
WriteRegM4:0
ResultW
PCPlus4F
31:26
RegDstD
BranchD
MemWriteD
MemtoRegD
ALUControlD2:0
ALUSrcD
RegWriteD
Op
Funct
ControlUnit
PCSrcM
CLK CLK CLK
CLK CLK
WriteRegW4:0
ALUControlE2:0
ALU
RegWriteE RegWriteM RegWriteW
MemtoRegE MemtoRegM MemtoRegW
MemWriteE MemWriteM
RegDstE
ALUSrcE
WriteRegE4:0
000110
000110
SignImmD
Stal
lF
Stal
lD
Forw
ardA
E
Forw
ardB
E
20:16 RtE
RsD
RdD
RtD
Reg
Writ
eM
Reg
Writ
eW
Mem
toR
egE
Hazard Unit
Flus
hE
PCPlus4E
BranchE BranchM
ZeroM
EN
EN
CLR
25
Copyright © 2007 Elsevier 7-<97>
Control Hazards
Time (cycles)
beq $t1, $t2, 40 RF $t2
$t1RF- DM
RF $s1
$s0RF& DM
RF $s0
$s4RF| DM
RF $s5
$s0RF- DM
and $t0, $s0, $s1
or $t1, $s4, $s0
sub $t2, $s0, $s5
1 2 3 4 5 6 7 8
and
IM
IM
IM
IM lw
or
sub
20
24
28
2C
30
...
...
9
Flushthese
instructions
64 slt $t3, $s2, $s3 RF $s3
$s2RF
$t3slt DMIM slt
Copyright © 2007 Elsevier 7-<98>
Control Hazards: Early Branch Resolution
EqualD
SignImmE
CLK
A RDInstruction
Memory
+
4
A1
A3WD3
RD2
RD1WE3
A2
CLK
SignExtend
RegisterFile
01
01
A RDData
MemoryWD
WE
10
PCF01
PC' InstrD 25:21
20:16
15:0
5:0
SrcBE
25:21
15:11
RsE
RdE
<<2
+
ALUOutM
ALUOutW
ReadDataW
WriteDataE WriteDataM
SrcAE
PCPlus4D
PCBranchD
WriteRegM4:0
ResultW
PCPlus4F
31:26
RegDstD
BranchD
MemWriteD
MemtoRegD
ALUControlD2:0
ALUSrcD
RegWriteD
Op
Funct
ControlUnit
PCSrcD
CLK CLK CLK
CLK CLK
WriteRegW4:0
ALUControlE2:0
ALU
RegWriteE RegWriteM RegWriteW
MemtoRegE MemtoRegM MemtoRegW
MemWriteE MemWriteM
RegDstE
ALUSrcE
WriteRegE4:0
000110
000110
=
SignImmD
Stal
lF
Stal
lD
Forw
ardA
E
Forw
ardB
E
20:16 RtE
RsD
RdE
RtD
Reg
Writ
eM
Reg
Writ
eW
Mem
toR
egE
Hazard Unit
Flus
hE
EN
EN
CLR
CLR
Introduced another data hazard in Decode stage
Copyright © 2007 Elsevier 7-<99>
Control Hazards with Early Branch Resolution
Time (cycles)
beq $t1, $t2, 40 RF $t2
$t1RF- DM
RF $s1
$s0RF& DMand $t0, $s0, $s1
or $t1, $s4, $s0
sub $t2, $s0, $s5
1 2 3 4 5 6 7 8
andIM
IM lw20
24
28
2C
30
...
...
9
Flushthis
instruction
64 slt $t3, $s2, $s3 RF $s3
$s2RF
$t3slt DMIM slt
Copyright © 2007 Elsevier 7-<100>
Handling Data and Control Hazards
EqualD
SignImmE
CLK
A RDInstruction
Memory
+
4
A1
A3WD3
RD2
RD1WE3
A2
CLK
SignExtend
RegisterFile
01
01
A RDData
MemoryWD
WE
10
PCF01
PC' InstrD 25:21
20:16
15:0
5:0
SrcBE
25:21
15:11
RsE
RdE
<<2
+
ALUOutM
ALUOutW
ReadDataW
WriteDataE WriteDataM
SrcAE
PCPlus4D
PCBranchD
WriteRegM4:0
ResultW
PCPlus4F
31:26
RegDstD
BranchD
MemWriteD
MemtoRegD
ALUControlD2:0
ALUSrcD
RegWriteD
Op
Funct
ControlUnit
PCSrcD
CLK CLK CLK
CLK CLK
WriteRegW4:0
ALUControlE2:0
ALU
RegWriteE RegWriteM RegWriteW
MemtoRegE MemtoRegM MemtoRegW
MemWriteE MemWriteM
RegDstE
ALUSrcE
WriteRegE4:0
000110
000110
01
01
=
SignImmD
Stal
lF
Stal
lD
Forw
ardA
E
Forw
ardB
E
Forw
ardA
D
Forw
ardB
D
20:16 RtE
RsD
RdD
RtD
Reg
Writ
eE
Reg
Writ
eM
Reg
Writ
eW
Mem
toR
egE
Bran
chD
Hazard Unit
Flus
hE
EN
EN
CLR
CLR
26
Copyright © 2007 Elsevier 7-<101>
Control Forwarding and Stalling Hardware
• Forwarding logic:ForwardAD = (rsD !=0) AND (rsD == WriteRegM) AND RegWriteM
ForwardBD = (rtD !=0) AND (rtD == WriteRegM) AND RegWriteM
• Stalling logic:branchstall = BranchD AND RegWriteE AND
(WriteRegE == rsD OR WriteRegE == rtD) OR BranchD AND MemtoRegM AND
(WriteRegM == rsD OR WriteRegM == rtD)
StallF = StallD = FlushE = lwstall OR branchstall
Copyright © 2007 Elsevier 7-<102>
Branch Prediction
• Guess whether branch will be taken– Backward branches are usually taken (loops)– Perhaps consider history of whether branch was
previously taken to improve the guess
• Good prediction reduces the fraction of branches requiring a flush
Copyright © 2007 Elsevier 7-<103>
Pipelined Performance Example
• Ideally CPI = 1• But need to handle stalling (caused by loads and branches)• SPECINT2000 benchmark:
– 25% loads– 10% stores – 11% branches– 2% jumps– 52% R-type
• Suppose:– 40% of loads used by next instruction– 25% of branches mispredicted
• What is the average CPI?
Copyright © 2007 Elsevier 7-<104>
Pipelined Performance Example
• SPECINT2000 benchmark: – 25% loads– 10% stores – 11% branches– 2% jumps– 52% R-type
• Suppose:– 40% of loads used by next instruction– 25% of branches mispredicted– All jumps flush next instruction
• What is the average CPI?– Load/Branch CPI = 1 when no stalling, 2 when stalling. Thus,– CPIlw = 1(0.6) + 2(0.4) = 1.4– CPIbeq = 1(0.75) + 2(0.25) = 1.25– Thus,
Average CPI = (0.25)(1.4) + (0.1)(1) + (0.11)(1.25) + (0.02)(2) + (0.52)(1)
= 1.15
27
Copyright © 2007 Elsevier 7-<105>
Pipelined Performance
• Pipelined processor critical path:Tc = max {
tpcq + tmem + tsetup
2(tRFread + tmux + teq + tAND + tmux + tsetup )tpcq + tmux + tmux + tALU + tsetup
tpcq + tmemwrite + tsetup
2(tpcq + tmux + tRFwrite) }
Copyright © 2007 Elsevier 7-<106>
Pipelined Performance Example
100 pstRFwriteRegister file write220TmemwriteMemory write15tANDAND gate40teqEquality comparator
30tpcq_PCRegister clock-to-QDelay (ps)ParameterElement
20tsetupRegister setup25tmuxMultiplexer200tALUALU250tmemMemory read150tRFreadRegister file read20tRFsetupRegister file setup
Tc = 2(tRFread + tmux + teq + tAND + tmux + tsetup )= 2[150 + 25 + 40 + 15 + 25 + 20] ps = 550 ps
Copyright © 2007 Elsevier 7-<107>
Pipelined Performance Example
• For a program with 100 billion instructions executing on a pipelined MIPS processor,
• CPI = 1.15• Tc = 550 ps
Execution Time = (# instructions) × CPI × Tc
= (100 × 109)(1.15)(550 × 10-12)= 63 seconds
1.5163Pipelined0.71133Multicycle195Single-cycle
Speedup(single-cycle is baseline)
Execution Time(seconds)Processor
Copyright © 2007 Elsevier 7-<108>
Review: Exceptions
• Unscheduled procedure call to the exception handler• Casued by:
– Hardware, also called an interrupt, e.g. keyboard– Software, also called traps, e.g. undefined instruction
• When exception occurs, the processor:– Records the cause of the exception (Cause register)– Jumps to the exception handler at instruction address 0x80000180– Returns to program (EPC register)
28
Copyright © 2007 Elsevier 7-<109>
Example Exception
Copyright © 2007 Elsevier 7-<110>
Exception Registers
• Not part of the register file.– Cause
• Records the cause of the exception• Coprocessor 0 register 13
– EPC (Exception PC)• Records the PC where the exception occurred• Coprocessor 0 register 14
• Move from Coprocessor 0– mfc0 $t0, Cause– Moves the contents of Cause into $t0
00000 $t0 (8) Cause (13) 00000000000
mfc0
31:26 25:21 20:16 15:11 10:0
010000
Copyright © 2007 Elsevier 7-<111>
Exception Causes
0x00000030Arithmetic Overflow
0x00000028Undefined Instruction
0x00000024Breakpoint / Divide by 0
0x00000020System Call
0x00000000Hardware Interrupt
CauseException
We extend the multicycle MIPS processor to handle the last two types of exceptions.
Copyright © 2007 Elsevier 7-<112>
Exception Hardware: EPC & Cause
SignImm
CLK
ARD
Instr / DataMemory
A1
A3
WD3
RD2RD1
WE3
A2
CLK
Sign Extend
RegisterFile
01
01PC 0
1
PC' Instr 25:21
20:16
15:0
SrcB20:16
15:11
<<2
ALUResult
SrcA
ALUOut
RegDst BranchMemWrite MemtoReg ALUSrcARegWrite
Zero
PCSrc1:0
CLK
ALUControl2:0
ALU
WD
WE
CLK
Adr
01
Data
CLK
CLK
AB 00
011011
4
CLK
ENEN
ALUSrcB1:0IRWriteIorD PCWrite
PCEn
<<2
25:0 (jump)
31:28
27:0
PCJump
00011011
0x8000 0180
Overflow
CLK
EN
EPCWrite
CLK
EN
CauseWrite
01
IntCause
0x30
0x28EPC
Cause
29
Copyright © 2007 Elsevier 7-<113>
Exception Hardware: mfc0
SignImm
CLK
ARD
Instr / DataMemory
A1
A3
WD3
RD2RD1
WE3
A2
CLK
Sign Extend
RegisterFile
01
01PC 0
1
PC' Instr 25:21
20:16
15:0
SrcB20:16
15:11
<<2
ALUResult
SrcA
ALUOut
RegDst BranchMemWrite MemtoReg1:0 ALUSrcARegWrite
Zero
PCSrc1:0
CLK
ALUControl2:0
ALU
WD
WE
CLK
Adr
0001
Data
CLK
CLK
AB 00
011011
4
CLK
ENEN
ALUSrcB1:0IRWriteIorD PCWrite
PCEn
<<2
25:0 (jump)
31:28
27:0
PCJump
00011011
0x8000 0180
CLK
EN
EPCWrite
CLK
EN
CauseWrite
01
IntCause
0x30
0x28EPC
Cause
Overflow
...01101
01110
...15:11
10
C0
Copyright © 2007 Elsevier 7-<114>
Control FSM with Exceptions
IorD = 0AluSrcA = 0
ALUSrcB = 01ALUOp = 00PCSrc = 00
IRWritePCWrite
ALUSrcA = 0ALUSrcB = 11ALUOp = 00
ALUSrcA = 1ALUSrcB = 10ALUOp = 00
IorD = 1RegDst = 1
MemtoReg = 00RegWrite
IorD = 1MemWrite
ALUSrcA = 1ALUSrcB = 00ALUOp = 10
ALUSrcA = 1ALUSrcB = 00ALUOp = 01PCSrc = 01
Branch
Reset
S0: Fetch
S2: MemAdr
S1: Decode
S3: MemReadS5: MemWrite
S6: Execute
S7: ALUWriteback
S8: Branch
Op = LWor
Op = SWOp = R-type
Op = BEQ
Op = LWOp = SW
RegDst = 0MemtoReg = 01
RegWrite
S4: MemWriteback
ALUSrcA = 1ALUSrcB = 10ALUOp = 00
RegDst = 0MemtoReg = 00
RegWrite
Op = ADDI
S9: ADDIExecute
S10: ADDIWriteback
PCSrc = 10PCWrite
Op = J
S11: Jump
Overflow OverflowS13:
OverflowPCSrc = 11
PCWriteIntCause = 0CauseWriteEPCWrite
Op = others
PCSrc = 11PCWrite
IntCause = 1CauseWriteEPCWrite
S12: Undefined
RegDst = 0Memtoreg = 10
RegWrite
Op = mfc0
S14: MFC0
Copyright © 2007 Elsevier 7-<115>
Advanced MicroArchitecture
• Deep Pipelining• Branch Prediction• Superscalar Processors• Out of Order Processors• Register Renaming• SIMD• Multithreading• Multiprocessors
Copyright © 2007 Elsevier 7-<116>
Deep Pipelining
• 10-20 stages typical• Number of stages limited by:
– Pipeline hazards– Sequencing overhead– Power– Cost
30
Copyright © 2007 Elsevier 7-<117>
Branch Prediction
• Ideal pipelined processor: CPI = 1• Branch misprediction increases CPI• Static branch prediction:
– Check direction of branch (forward or backward)– If backward, predict taken– Otherwise, predict not taken
• Dynamic branch prediction:– Keep history of last (several hundred) branches in a
branch target buffer which holds:• Branch destination• Whether branch was taken
Copyright © 2007 Elsevier 7-<118>
Branch Prediction Example
add $s1, $0, $0 # sum = 0add $s0, $0, $0 # i = 0addi $t0, $0, 10 # $t0 = 10
for:beq $t0, $t0, done # if i == 10, branchadd $s1, $s1, $s0 # sum = sum + iaddi $s0, $s0, 1 # increment ij for
done:
Copyright © 2007 Elsevier 7-<119>
1-Bit Branch Predictor
• Remembers whether branch was taken the last time and does the same thing
• Mispredicts first and last branch of loop
Copyright © 2007 Elsevier 7-<120>
2-Bit Branch Predictor
• Only mispredicts last branch of loop
stronglytaken
predicttaken
weaklytaken
predicttaken
weaklynot taken
predictnot taken
stronglynot taken
predictnot taken
taken taken taken
takentakentaken
taken
taken
31
Copyright © 2007 Elsevier 7-<121>
Superscalar
• Multiple copies of datapath execute multiple instructions at once
• Dependencies make it tricky to issue multiple instructions at once
CLK CLK CLK CLK
ARD A1
A2RD1A3
WD3WD6
A4A5A6
RD4
RD2RD5
InstructionMemory
RegisterFile Data
Memory
ALU
s
PC
CLK
A1A2
WD1WD2
RD1RD2
Copyright © 2007 Elsevier 7-<122>
Superscalar Example
lw $t0, 40($s0)add $t1, $t0, $s1sub $t0, $s2, $s3 Ideal IPC: 2and $t2, $s4, $t0 Actual IPC: 2or $t3, $s5, $s6sw $s7, 80($t3)
Time (cycles)
1 2 3 4 5 6 7 8
RF40
$s0
RF
$t0+
DMIM
lw
add
lw $t0, 40($s0)
add $t1, $s1, $s2
sub $t2, $s1, $s3
and $t3, $s3, $s4
or $t4, $s1, $s5
sw $s5, 80($s0)
$t1$s2
$s1
+
RF$s3
$s1
RF
$t2-
DMIM
sub
and $t3$s4
$s3&
RF$s5
$s1
RF
$t4|
DMIM
or
sw80
$s0
+ $s5
Copyright © 2007 Elsevier 7-<123>
Superscalar Example with Dependencies
lw $t0, 40($s0)add $t1, $t0, $s1sub $t0, $s2, $s3 Ideal IPC: 2and $t2, $s4, $t0 Actual IPC: 6/5 = 1.17or $t3, $s5, $s6sw $s7, 80($t3)
Stall
Time (cycles)
1 2 3 4 5 6 7 8
RF40
$s0
RF
$t0+
DMIM
lwlw $t0, 40($s0)
add $t1, $t0, $s1
sub $t0, $s2, $s3
and $t2, $s4, $t0
sw $s7, 80($t3)
RF$s1
$t0add
RF$s1
$t0
RF
$t1+
DM
RF$t0
$s4
RF
$t2&
DMIM
and
IMor
and
sub
|$s6
$s5$t3
RF80
$t3
RF+
DMsw
IM
$s7
9
$s3
$s2
$s3
$s2-
$t0
oror $t3, $s5, $s6
IM
Copyright © 2007 Elsevier 7-<124>
Out of Order Processor
• Looks ahead across multiple instructions to issue as many as possible at once
• Issues instructions out of order as long as no dependencies• Dependencies:
– RAW (read after write): one instruction writes, and later instruction reads a register
– WAR (write after read): one instruction reads, and a later instruction writes a register (also called an antidependence)
– WAW (write after write): one instruction writes, and a later instruction writes a register (also called an output dependence)
32
Copyright © 2007 Elsevier 7-<125>
Out of Order Processor
• Instruction level parallelism: the number of instruction that can be issued simultaneously (in practice average < 3)
• Scoreboard: table that keeps track of:– Instructions waiting to issue– Available functional units– Dependencies
Copyright © 2007 Elsevier 7-<126>
Out of Order Processor Example
lw $t0, 40($s0)add $t1, $t0, $s1sub $t0, $s2, $s3 Ideal IPC: 2and $t2, $s4, $t0 Actual IPC: 6/4 = 1.5or $t3, $s5, $s6sw $s7, 80($t3)
Time (cycles)
1 2 3 4 5 6 7 8
RF40
$s0
RF
$t0+
DMIM
lwlw $t0, 40($s0)
add $t1, $t0, $s1
sub $t0, $s2, $s3
and $t2, $s4, $t0
sw $s7, 80($t3)
or|$s6
$s5$t3
RF80
$t3
RF+
DMsw $s7
or $t3, $s5, $s6
IM
RF$s1
$t0
RF
$t1+
DMIM
add
sub-$s3
$s2$t0
two cycle latencybetween load anduse of $t0
RAW
WAR
RAW
RF$t0
$s4
RF&
DMand
IM
$t2
RAW
Copyright © 2007 Elsevier 7-<127>
Register Renaming
Time (cycles)
1 2 3 4 5 6 7
RF40
$s0
RF
$t0+
DMIM
lwlw $t0, 40($s0)
add $t1, $t0, $s1
sub $r0, $s2, $s3
and $t2, $s4, $r0
sw $s7, 80($t3)
sub-$s3
$s2$r0
RF$r0
$s4
RF&
DMand
$s7
or $t3, $s5, $s6IM
RF$s1
$t0
RF
$t1+
DMIM
add
sw+80
$t3
RAW
$s6
$s5|
or
2-cycle RAW
RAW
$t2
$t3
lw $t0, 40($s0)add $t1, $t0, $s1sub $t0, $s2, $s3 Ideal IPC: 2and $t2, $s4, $t0 Actual IPC: 6/3 = 2or $t3, $s5, $s6sw $s7, 80($t3)
Copyright © 2007 Elsevier 7-<128>
SIMD
• Single Instruction Multiple Data (SIMD)– Single instruction acts on multiple pieces of data at once– Common application: graphics– Perform short arithmetic operations (also called packed
arithmetic)
• For example, add four 8-bit elements• Must modify ALU to eliminate carries between 8-
bit valuespadd8 $s2, $s0, $s1
a0
0781516232432 Bit position
$s0a1a2a3
b0 $s1b1b2b3
a0 + b0 $s2a1 + b1a2 + b2a3 + b3
+
33
Copyright © 2007 Elsevier 7-<129>
Multithreading: First Some Definitions
• Process: program running on a computer• Multiple processes can run at once: e.g., surfing Web,
playing music, writing a paper• Thread: part of a program• Each process has multiple threads: e.g., a word processor
may have threads for typing, spell checking, printing• In a conventional processor:
– One thread runs at once– When one thread stalls (for example, waiting for memory):
• Architectural state of that thread is stored• Architectural state of waiting thread is loaded into processor and it runs• Called context switching
– But appears to user like all threads running simultaneously
Copyright © 2007 Elsevier 7-<130>
Multithreading
• Multiple copies of architectural state• Multiple threads active at once:
– When one thread stalls, another runs immediately (no need to store or restore architectural state)
– If one thread can’t keep all execution units busy, another thread can use them
• Does not increase ILP of single thread, but does increase throughput
Copyright © 2007 Elsevier 7-<131>
Multiprocessors
• Multiple processors (cores) with a method of communication between them
• Types of multiprocessing:– Symmetric multiprocessing (SMT): multiple cores with a shared
memory– Asymmetric multiprocessing: separate cores for different tasks
(for example, DSP and CPU in cell phone)– Clusters: each core has its own memory system
Copyright © 2007 Elsevier 7-<132>
Other Resources
• Patterson & Hennessy’s: Computer Architecture: A Quantitative Approach
• Conferences:– www.cs.wisc.edu/~arch/www/
– ISCA (International Symposium on Computer Architecture)
– HPCA (International Symposium on High Performance Computer Architecture)
top related