csc 252: computer organization spring 2020: lecture 13 · 2020-02-28 · carnegie mellon 3...
TRANSCRIPT
CSC 252: Computer OrganizationSpring 2020: Lecture 13
Substitute Instructor: Sandhya Dwarkadas
Department of Computer ScienceUniversity of Rochester
Carnegie Mellon
2
Announcement• Programming assignment 3 due soon
• Details: https://www.cs.rochester.edu/courses/252/spring2020/labs/assignment3.html
• Due on Feb. 28, 11:59 PM• You (may still) have 3 slip days
Today Due
Carnegie Mellon
3
Announcement• Programming assignment 3 is in x86 assembly language. Seek
help from TAs.• TAs are best positioned to answer your questions about
programming assignments!!!• Programming assignments do NOT repeat the lecture materials.
They ask you to synthesize what you have learned from the lectures and work out something new.
CS:APP3e2/28/20 4
Building Blocks
Combinational Logic■ Compute Boolean functions of
inputs■ Continuously respond to input
changes■ Operate on data and implement
control
Storage Elements■ Store bits■ Addressable memories■ Non-addressable registers■ Loaded only as clock rises
Registerfile
A
B
W dstW
srcA
valA
srcB
valB
valW
Clock
ALU
fun
A
B
MUX0
1
=
Clock
Carnegie Mellon
Combinational Logic
Microarchitecture (So far): Single Cycle
5
Register File
FlagsZ S O
PC
Clock
Memory
Inst. Rd/Wr Reg. IDs
CurrentReg.
Values
Cur. Flag ValuesEnable?
New Flag Values
Cur. PC
NewPC
New Reg. Valus
Enable?
ALU
Logic for generating ALU select signal
Logic for generating new PC value
Logic for generating new flag value
Logic for deciding all the enable signal values
Read current_states;next_states = f(current_states);
When clock rises, current_states = next_states;
Carnegie Mellon
Executing a MOV instruction
6
ALU
Select
Clock
Register File
Write Reg. ID
Read Reg. ID 1
Read Reg. ID 2
RA
RB
newData
Enable
Logic 1
Logic 2
71012
s0s1s2s3
FlagsZ S O
PC
Logic 3
Clock
nPCoPC
Logic 4
3…
s4s5…
• How do we modify the hardware to execute a move instruction?
Logic 5 EnableF
Flags [s2…s9]
Drmmovq rA, D(rB) 4 0 rA rB
rmmovq %rsi,0x41c(%rsp) 40 64 1c 04 00 00 00 00 00 00
Memory
move rA to the memory address rB + D
Carnegie Mellon
7
ALU
Select
Clock
Register File
Write Reg. ID
Read Reg. ID 1
Read Reg. ID 2
RA
RB
newData
Enable
Logic 1
Logic 2
40641
s0s1s2s3
FlagsZ S O
PC
Logic 3
Clock
nPCoPC
Logic 4
c…
s4s5… Logic 5 EnableF
Flags [s2…s9]
Memory
• Need new logic (Logic 6) to select the input to the ALU for Enable.• How about other logics?
MUX
[s4…s19]
Logic 6
Address
Drmmovq rA, D(rB) 4 0 rA rB
Data to write
move rA to the memory address rB + D
Enable
Carnegie Mellon
8
ALU
Select
Clock
Register File
Write Reg. ID
Read Reg. ID 1
Read Reg. ID 2
RA
RB
newData
Enable
Logic 1
Logic 2
40641
s0s1s2s3
FlagsZ S O
PC
Logic 3
Clock
nPCoPC
Logic 4
c…
s4s5… Logic 5 EnableF
Flags [s2…s9]
Memory
MUX
[s4…s19]
Address
How About Memory to Register MOV?
Dmrmovq D(rB), rA 4 0 rA rB
move data at memory address rB + D to rA
Data to writeEnable Logic 6
Carnegie Mellon
9
ALU
Select
Clock
Register File
Write Reg. ID
Read Reg. ID 1
Read Reg. ID 2
RA
RB
newData
Enable
Logic 1
Logic 2
40641
s0s1s2s3
FlagsZ S O
PC
Logic 3
Clock
nPCoPC
Logic 4
c…
s4s5… Logic 5 EnableF
Flags [s2…s9]
Memory
MUX
[s4…s19]
Address
How About Memory to Register MOV?
Dmrmovq D(rB), rA 4 0 rA rB
move data at memory address rB + D to rA
Data to writeEnable Logic 6
MUX
Data read back
Logic 7
Carnegie Mellon
Combinational Logic
Microarchitecture (with MOV)
10
Register File
FlagsZ S O
PC
Clock
Memory
Inst.Rd/Wr
Reg. IDsCurrent
Reg. Values
Cur. Flag ValuesEnable?
New Flag Values
Cur. PC
NewPC
New Reg. Valus
Enable?
Read current_states;next_states = f(current_states);
When clock rises, current_states = next_states;
DataNew Data Addr.
next_states has to be ready before the clock rises again
Carnegie Mellon
Microarchitecture OverviewThink of it as a state machineEvery cycle, one instruction gets
executed. At the end of the cycle, architecture states get modified.
States (All updated as clock rises)
■ PC register■ Cond. Code register■ Data memory■ Register file
11
Combinationallogic
Datamemory
Registerfile
%rbx = 0x100
PC0x014
CC100
Readports
Writeports
Read Write
Carnegie Mellon
• state set according to second irmovq instruction
• combinational logic starting to react to state changes
12
0x014: addq %rdx,%rbx # %rbx <-- 0x300 CC <-- 000
0x016: je dest # Not taken
0x01f: rmmovq %rbx,0(%rdx) # M[0x200] <-- 0x300
Cycle 3:
Cycle 4:
Cycle 5:
0x00a: irmovq $0x200,%rdx # %rdx <-- 0x200Cycle 2:
0x000: irmovq $0x100,%rbx # %rbx <-- 0x100Cycle 1:
ClockCycle 1
① ③ ④②
Cycle 2 Cycle 3 Cycle 4
Combinationallogic
Datamemory
Registerfile
%rbx = 0x100
PC0x014
CC100
Readports
Writeports
Read Write
Carnegie Mellon
• state set according to second irmovq instruction
• combinational logic generates results for addq instruction
13
Combinationallogic
Datamemory
Registerfile
%rbx = 0x100
PC0x014
CC100
Readports
Writeports
0x016
000 %rbx<--
0x300
Read Write
0x014: addq %rdx,%rbx # %rbx <-- 0x300 CC <-- 000
0x016: je dest # Not taken
0x01f: rmmovq %rbx,0(%rdx) # M[0x200] <-- 0x300
Cycle 3:
Cycle 4:
Cycle 5:
0x00a: irmovq $0x200,%rdx # %rdx <-- 0x200Cycle 2:
0x000: irmovq $0x100,%rbx # %rbx <-- 0x100Cycle 1:
ClockCycle 1
① ③ ④②
Cycle 2 Cycle 3 Cycle 4
Carnegie Mellon
• state set according to addqinstruction
• combinational logic starting to react to state changes
14
0x014: addq %rdx,%rbx # %rbx <-- 0x300 CC <-- 000
0x016: je dest # Not taken
0x01f: rmmovq %rbx,0(%rdx) # M[0x200] <-- 0x300
Cycle 3:
Cycle 4:
Cycle 5:
0x00a: irmovq $0x200,%rdx # %rdx <-- 0x200Cycle 2:
0x000: irmovq $0x100,%rbx # %rbx <-- 0x100Cycle 1:
ClockCycle 1
① ③ ④②
Cycle 2 Cycle 3 Cycle 4
Combinationallogic
Datamemory
Registerfile
%rbx = 0x300
PC0x016
CC000
Readports
Writeports
Read Write
Carnegie Mellon
• state set according to addqinstruction
• combinational logic generates results for je instruction
15
0x014: addq %rdx,%rbx # %rbx <-- 0x300 CC <-- 000
0x016: je dest # Not taken
0x01f: rmmovq %rbx,0(%rdx) # M[0x200] <-- 0x300
Cycle 3:
Cycle 4:
Cycle 5:
0x00a: irmovq $0x200,%rdx # %rdx <-- 0x200Cycle 2:
0x000: irmovq $0x100,%rbx # %rbx <-- 0x100Cycle 1:
ClockCycle 1
① ③ ④②
Cycle 2 Cycle 3 Cycle 4
Combinationallogic
Datamemory
Registerfile
%rbx = 0x300
PC0x016
CC000
Readports
Writeports
0x01f
Read Write
Carnegie Mellon
Principles:• Execute each instruction one at a time, one after another• Express every instruction as series of simple steps• Dedicated hardware structure for completing each step• Follow same general flow for each instruction type
Fetch: Read instruction from instruction memoryDecode: Read program registersExecute: Compute value or addressMemory: Read or write dataWrite Back: Write program registersPC: Update program counter
16
Another Way to Look At the Microarchitecture
Carnegie Mellon
Fetch■ Read instruction from instruction memoryDecode■ Read program registersExecute■ Compute value or addressMemory■ Read or write dataWrite Back■ Write program registersPC■ Update program counter
17
Instructionmemory
Instructionmemory
PCincrement
PCincrement
CCCCALUALU
Datamemory
Datamemory
Fetch
Decode
Execute
Memory
Write back
icode, ifunrA , rB
valC
Registerfile
Registerfile
A B M
E
Registerfile
Registerfile
A B M
E
PC
valP
srcA, srcBdstA, dstB
valA, valB
aluA, aluB
Cnd
valE
Addr, Data
valM
PCvalE, valM
newPC
Carnegie Mellon
Stage Computation: Arith/Log. Ops
18
OPq rA, rBicode:ifun ¬ M1[PC]rA:rB ¬ M1[PC+1]
valP ¬ PC+2
Fetch
Read instruction byteRead register byte
Compute next PCvalA ¬ R[rA]valB ¬ R[rB]
DecodeRead operand ARead operand B
valE ¬ valB OP valASet CC
ExecutePerform ALU operationSet condition code register
MemoryR[rB] ¬ valEWrite
backWrite back result
PC ¬ valPPC update Update PC
OPq rA, rB 6 fn rA rB
Carnegie Mellon
Stage Computation: rmmovq
19
rmmovq rA, D(rB)icode:ifun ¬ M1[PC]rA:rB ¬ M1[PC+1]valC ¬ M8[PC+2]valP ¬ PC+10
Fetch
Read instruction byteRead register byteRead displacement DCompute next PC
valA ¬ R[rA]valB ¬ R[rB]
DecodeRead operand ARead operand B
valE ¬ valB + valCExecute
Compute effective address
M8[valE] ¬ valAMemory Write value to memory Writeback
PC ¬ valPPC update Update PC
rmmovq rA, D(rB) 4 0 rA rB D
Carnegie Mellon
Stage Computation: Jumps
• Compute both addresses• Choose based on setting of condition codes and branch condition
20
jXX Desticode:ifun ¬ M1[PC]
valC ¬ M8[PC+1]valP ¬ PC+9
Fetch
Read instruction byte
Read destination addressFall through address
Decode
Cnd ¬ Cond(CC,ifun)Execute
Take branch?MemoryWriteback
PC ¬ Cnd ? valC : valPPC update Update PC
Carnegie Mellon
21
Processor Microarchitecture• Sequential, single-cycle microarchitecture implementation
• Basic idea• Hardware implementation
• Pipelined microarchitecture implementation• Basic Principles• Difficulties: Control Dependency• Difficulties: Data Dependency
Carnegie Mellon
Real-World Pipelines: Car Washes
Idea• Divide process into independent stages• Move objects through stages in sequence• At any given times, multiple objects being processed
22
Sequential Pipelined
Carnegie Mellon
Computational Example
System• Computation requires total of 300 picoseconds• Additional 20 picoseconds to save result in register• Must have clock cycle time of at least 320 ps
23
Combinationallogic
Reg
300 ps 20 ps
Clock
Delay = 320 psThroughput = 3.12 Giga Instructions Per second (GIPS)
Carnegie Mellon
3-Stage Pipelined Version
System• Divide combinational logic into 3 blocks of 100 ps each• Can begin new operation as soon as previous one passes through
stage A.• Begin new operation every 120 ps
• Overall latency increases• 360 ps from start to finish
24
Reg
Clock
Comb.logic
A
Reg
Comb.logic
B
Reg
Comb.logic
C
100 ps 20 ps 100 ps 20 ps 100 ps 20 ps
Delay = 360 psThroughput = 8.33 GIPS
Carnegie Mellon
Pipeline Diagrams
Unpipelined
• Cannot start new operation until previous one completes
3-Stage Pipelined
• Up to 3 operations in process simultaneously
25
Time
OP1OP2OP3
Time
A B CA B C
A B C
OP1OP2OP3
CS:APP3e
Operating a Pipeline
26
Time
OP1OP2OP3
A B CA B C
A B C
0 120 240 360 480 640
Clock
Reg
Clock
Comb.logic
A
Reg
Comb.logic
B
Reg
Comb.logic
C
100 ps 20 ps 100 ps 20 ps 100 ps 20 ps
239
CS:APP3e
Time
OP1OP2OP3
A B CA B C
A B C
0 120 240 360 480 640
Clock
Reg
Clock
Comb.logic
A
Reg
Comb.logic
B
Reg
Comb.logic
C
100 ps 20 ps 100 ps 20 ps 100 ps 20 ps
241
Operating a Pipeline
27
CS:APP3e
Time
OP1OP2OP3
A B CA B C
A B C
0 120 240 360 480 640
Clock
Reg
Reg
Reg
100 ps 20 ps 100 ps 20 ps 100 ps 20 ps
Comb.logic
A
Comb.logic
B
Comb.logic
C
Clock
300
Operating a Pipeline
28
CS:APP3e
Time
OP1OP2OP3
A B CA B C
A B C
0 120 240 360 480 640
Clock
Reg
Clock
Comb.logic
A
Reg
Comb.logic
B
Reg
Comb.logic
C
100 ps 20 ps 100 ps 20 ps 100 ps 20 ps
359
Operating a Pipeline
29
CS:APP3e
Pipeline Trade-offsPros:Increase throughput. Can process more instructions in a given time span.Cons: Increase latency as new registers are needed between pipeline stages.
30
Reg
Clock
Comb.logic
A
Reg
Comb.logic
B
Reg
Comb.logic
C
100 ps 20 ps 100 ps 20 ps 100 ps 20 ps
Combinationallogic
Reg
300 ps 20 ps
Clock
CS:APP3e
Unbalanced PipelineA pipeline’s delay is limited by the slowest stage. This
limits the cycle time and the throughput
31
Reg
Clock
Reg
Comb.logic
B
Reg
Comb.logic
C
50 ps 20 ps 150 ps 20 ps 100 ps 20 ps
Comb.logicA
Reg
Clock
Comb.logic
A
Reg
Comb.logic
B
Reg
Comb.logic
C
100 ps 20 ps 100 ps 20 ps 100 ps 20 psCycle time: 120 ps
Delay: 360 ps
Cycle time: 170 ps
Delay: 510 ps
CS:APP3e
Unbalanced PipelineA pipeline’s delay is limited by the slowest stage. This
limits the cycle time and the throughput
32
Reg
Clock
Reg
Comb.logic
B
Reg
Comb.logic
C
50 ps 20 ps 150 ps 20 ps 100 ps 20 ps
Comb.logicA
Cycle time: 170 ps
Delay: 510 ps
Time
OP1OP2OP3
A B CA B C
A B C
170 ps
CS:APP3e
Mitigating Unbalanced PipelineSolution 1: Further pipeline the slow stages
■ Not always possible. What to do if we can’t further pipeline a stage?
Solution 2: Use multiple copies of the slow component
33
Reg
Reg
Comb.logic
B
Reg
50 ps 20 ps 100 ps 20 ps 50 ps 20 ps
Comb.logicA
Comb.logic
B
MUX
Reg
What Logic?
Clock
What logic do you need there?Hint: it needs to control the clock signals of the two registers and the select signal of the MUX.
select
Comb.logicC
Copy 1
Copy 2
CS:APP3e
Mitigating Unbalanced PipelineData sent to copy 1 in odd cycles and to copy 2 in even cycles.This is called 2-way interleaving. Effectively the same as pipelining
Comb. logic B into two sub-stages.The cycle time is reduced to 70 ps (as opposed to 120 ps) at the
cost of extra hardware.
34
Reg
Reg
Comb.logic
B
Reg
50 ps 20 ps 100 ps 20 ps 50 ps 20 ps
Comb.logicA
Comb.logic
B
MUX
Reg
What Logic?
Clock
select
Comb.logicC
Copy 1
Copy 2
CS:APP3e2/28/20 36
Instructionmemory
Instructionmemory
PCincrement
PCincrement
CCCCALUALU
Datamemory
Datamemory
Fetch
Decode
Execute
Memory
Write back
icode, ifunrA, rB
valC
Registerfile
Registerfile
A BM
E
Registerfile
Registerfile
A BM
E
pState
valP
srcA, srcBdstA, dstB
valA, valB
aluA, aluB
Bch
valE
Addr, Data
valM
PC
valE, valM
valM
icode, valCvalP
PC
Adding Pipeline Registers
PCincrement
PCincrement
CCCCALUALU
Datamemory
Datamemory
Fetch
Decode
Execute
Memory
Write back
Registerfile
Registerfile
A BM
E
Registerfile
Registerfile
A BM
E
valP
d_srcA, d_srcB
valA, valB
aluA, aluB
Bch valE
Addr, Data
valM
PC
W_valE, W_valM, W_dstE, W_dstMW_icode, W_valM
icode, ifun,rA, rB, valC
E
M
W
F
D
valP
f_PC
predPC
Instructionmemory
Instructionmemory
M_icode, M_Bch, M_valA
CS:APP3e
Pipeline StagesFetch
■ Select current PC■ Read instruction■ Compute incremented PC
Decode■ Read program registers
Execute■ Operate ALU
Memory■ Read or write data memory
Write Back■ Update register file
37
CS:APP3e2/28/20 38
Predicting the PC
■ Start fetch of new instruction after current one has completed fetch stage! Not enough time to reliably determine next instruction
■ Guess which instruction will follow! Recover if prediction was incorrect
F
D rB
M_icode
PredictPC
valC valPicode ifun rA
Instructionmemory
Instructionmemory
PCincrement
PCincrement
predPC
Needregids
NeedvalC
Instrvalid
AlignAlignSplitSplit
Bytes 1-5Byte 0
SelectPC
M_BchM_valA
W_icodeW_valM
CS:APP3e2/28/20 39
One Prediction StrategyInstructions that Don’t Transfer Control
■ Predict next PC to be valP■ Always reliable
Call and Unconditional Jumps■ Predict next PC to be valC (destination)■ Always reliable
Conditional Jumps■ Predict next PC to be valC (destination)■ Only correct if branch is taken
! Typically right 60% of time
Return Instruction■ Don’t try to predict
CS:APP3e
Today: Making the Pipeline Really WorkControl Dependencies
■ What is it?■ Software mitigation: Inserting Nops■ Software mitigation: Delay Slots
Data Dependencies■ What is it?■ Software mitigation: Inserting Nops
40
CS:APP3e
nop
41
Control Dependency
1
F
2
DF
3
EDF
4
MEDF
5
MW
EDFF
Definition: Outcome of instruction A determines whether or not instruction B should be executed or not.
Jump instruction example below:■ jne L1 determines whether irmovq $1, %rax should be
executed■ But jne doesn’t know its outcome until after its Execute stage
xorg %rax, %rax jne L1 # Not taken
irmovq $3, %rax # Target + 1
irmovq $1, %rax # Fall ThroughL1 irmovq $4, %rcx # Target
nop
6
WMEDD
7
WMEE
8
WMM
9
WWFF DD EE MM WW
FF DD EE MM