ece 550d fundamentals of computer systems and engineering...
TRANSCRIPT
ECE 550D Fundamentals of Computer Systems and Engineering
Fall 2016
Datapaths
Tyler Bletsch
Duke University
Slides are derived from work by Andrew Hilton (Duke) and Amir Roth (Penn)
2
Last time
• What did we do last time?
• MIPS Assembly • Practice translating C to assembly together
• Using functions
• Calling conventions
• jal (call)
• jr (return)
3
Now confluence of MIPS + digital logic
• Start of semester: Digital Logic • Building blocks of digital design
• Most recently: MIPS assembly, ISA • Lowest level software
• Now: where they meet… • Datapaths: hardware implementation of processors
• By the way: homework 4 = build a datapath
• With some components from the TAs…
4
Necessary ingredient: the ALU
• ALU: Arithmetic/Logic Unit
• Performs any supported math or logic operation on two inputs
• Which operation is chosen by a third input
ALU
A
B
op
out
5
Add/Subtract With Overflow Detection
Full AdderFull AdderFull AdderFull Adder
S0 S1 Sn- 2 Sn- 1
Overflow
a0 a1 b0 b1 an- 2 bn- 2 an- 1 bn- 1
Add/Sub
6
Add/sub
C in
C ou t
Add/sub F
2
0
1
2
3
a
b
Q
A F Q
0 0 a + b
1 0 a - b
- 1 NOT b
- 2 a OR b
- 3 a AND b
ALU Slice
7
The ALU
ALU Slice ALU Slice ALU Slice ALU Slice
ALU control
a 0 b 0 a 1 b 1 a n-2 b n-2 a n-1 b n-1
Q 0 Q 1 Q n-2 Q n-1
Overflow Is non-zero?
ALU
A
B
op
out
8
Datapath for MIPS ISA
• Consider only the following instructions add $1,$2,$3
addi $1,2,$3
lw $1,4($3)
sw $1,4($3)
beq $1,$2,PC_relative_target
j absolute_target
• Why only these? • Most other instructions are the same from datapath viewpoint
• The one’s that aren’t are left for you to figure out
9
Instruction
Fetch
Instruction
Decode
Operand
Fetch
Execute
Result
Store
Next
Instruction
Remember The von Neumann Model?
• Instruction Fetch: Read instruction bits from memory
• Decode: Figure out what those bits mean
• Operand Fetch: Read registers (+ mem to get sources)
• Execute: Do the actual operation (e.g., add the #s)
• Result Store: Write result to register or memory
• Next Instruction: Figure out mem addr of next insn, repeat
10
Start With Fetch
• Same for all instructions (don’t know insn yet)
• PC and instruction memory
• A +4 incrementer computes default next instruction PC
• Details of Insn Mem: later…
• For now: just assume a bunch of DFFs
P
C
Insn
Mem
+
4
11
First Instruction: add
• Add register file and ALU
P
C
Insn
Mem
Register
File
R-type
s1 s2 d
+
4
Decoding: Very easy in MIPS
Op(6) Rs(5) Rt(5) Rd(5) Sh(5) Func(6)
12
Second Instruction: addi
• Destination register can now be either Rd or Rt
• Add sign extension unit and mux into second ALU input
P
C
Insn
Mem
Register
File
S
X
Op(6) Rs(5) Rt(5) I-type Immed(16)
s1 s2 d
+
4
13
Third Instruction: lw
• Add data memory, address is ALU output
• Add register write data mux to select memory output or ALU output
P
C
Insn
Mem
Register
File
S
X
Op(6) Rs(5) Rt(5) I-type Immed(16)
s1 s2 d
Data
Mem d
+
4
a
14
Fourth Instruction: sw
• Add path from second input register to data memory data input
P
C
Insn
Mem
Register
File
S
X
Op(6) Rs(5) Rt(5) I-type Immed(16)
s1 s2 d
Data
Mem
a
d
+
4
15
Fifth Instruction: beq
• Add left shift unit and adder to compute PC-relative branch target
• Add PC input mux to select PC+4 or branch target
• Note: shift by fixed amount very simple
P
C
Insn
Mem
Register
File
S
X
Op(6) Rs(5) Rt(5) I-type Immed(16)
s1 s2 d
Data
Mem
a
d
+
4
<<
2
z
16
Sixth Instruction: j
• Add shifter to compute left shift of 26-bit immediate
• Add additional PC input mux for jump target
P
C
Insn
Mem
Register
File
S
X
Op(6) J-type Immed(26)
s1 s2 d
Data
Mem
a
d
+
4
<<
2 <<
2
17
More Instructions…
• Figure out datapath modifications for
• jal (J-type)
• jr (R-type)
P
C
Insn
Mem
Register
File
S
X
s1 s2 d
Data
Mem
a
d
+
4
<<
2 <<
2
18
Jal
• For jal, need to get PC+4 to RF write mux
P
C
Insn
Mem
Register
File
S
X
s1 s2 d
Data
Mem
a
d
+
4
<<
2 <<
2
19
JR
• For JR need to get RF read value to next PC mux
P
C
Insn
Mem
Register
File
S
X
s1 s2 d
Data
Mem
a
d
+
4
<<
2 <<
2
20
Good practice: Try other insns
• Pick other MIPS instructions, contemplate how to add them
P
C
Insn
Mem
Register
File
S
X
s1 s2 d
Data
Mem
a
d
+
4
<<
2 <<
2
21
d
“Continuous Read” Datapath Timing
• Works because writes (PC, RegFile, DMem) are independent
• And because no read logically follows any write
P
C
Insn
Mem
Register
File
S
X
s1 s2 d
Data
Mem
a
+
4
Read IMem Read Registers Read DMEM Write DMEM Write Registers
Write PC
22
d
What Is Control?
• 8 signals control flow of data through this datapath • MUX selectors, or register/memory write enable signals
• A real datapath has 300-500 control signals
P
C
Insn
Mem
Register
File
S
X
s1 s2 d
Data
Mem
a
+
4
<<
2 <<
2
Rwe
ALUinB
DMwe
JP
ALUop
BR
Rwd
Rdst
23
Example: Control for add
• Control for an instruction: • Values of all control signals to correctly execute it
P
C
Insn
Mem
Register
File
S
X
s1 s2 d
Data
Mem
a
d
+
4
<<
2 <<
2
BR=0
JP=0
Rwd=0
DMwe=0 ALUop=0
ALUinB=0 Rdst=1
Rwe=1
24
Example: Control for sw
• Difference between sw and add is 5 signals • 3 if you don’t count the X (don’t care) signals
P
C
Insn
Mem
Register
File
S
X
s1 s2 d
Data
Mem
a
d
+
4
<<
2 <<
2
Rwe=0
ALUinB=1
DMwe=1
JP=0
ALUop=0
BR=0
Rwd=X
Rdst=X
25
d
Example: Control for beq
• Difference between sw and beq is only 4 signals
P
C
Insn
Mem
Register
File
S
X
s1 s2 d
Data
Mem
a
+
4
<<
2 <<
2
Rwe=0
ALUinB=0
DMwe=0
JP=0
ALUop=1
BR=1
Rwd=X
Rdst=X
26
You all figure LW
• How would these control signals be set for LW?
P
C
Insn
Mem
Register
File
S
X
s1 s2 d
Data
Mem
a
d
+
4
<<
2 <<
2
Rwe
ALUinB
DMwe
JP
ALUop
BR
Rwd
Rdst
27
Example: Control for LW
P
C
Insn
Mem
Register
File
S
X
s1 s2 d
Data
Mem
a
d
+
4
<<
2 <<
2
BR=0
JP=0
Rwd=1
DMwe=0 ALUop=0
ALUinB=1 Rdst=0
Rwe=1
28
d
How Is Control Implemented?
P
C
Insn
Mem
Register
File
S
X
s1 s2 d
Data
Mem
a
+
4
<<
2 <<
2
Rwe
ALUinB
DMwe
JP
ALUop
BR
Rwd
Rdst
Control?
29
Implementing Control
• Each insn has a unique set of control signals • Most are function of opcode
• Some may be encoded in the instruction itself
• E.g., the ALUop signal is some portion of the MIPS Func field
+ Simplifies controller implementation
• Requires careful ISA design
30
Control Implementation: ROM
• ROM (read only memory): think rows of bits • Bits in data words are control signals
• Lines indexed by opcode
• Example: ROM control for 6-insn MIPS datapath
• X is “don’t care”
BR JP ALUinB ALUop DMwe Rwe Rdst Rwd
add 0 0 0 0 0 1 0 0
addi 0 0 1 0 0 1 1 0
lw 0 0 1 0 0 1 1 1
sw 0 0 1 0 1 0 X X
beq 1 0 0 1 0 0 X X
j 0 1 0 0 0 0 X X
opcode
31
Control Implementation: Random Logic
• Real machines have 100+ insns 300+ control signals
• 30,000+ control bits (~4KB)
• Not huge, but hard to make faster than datapath (important!)
• Alternative: random logic (random = ‘non-repeating’)
• Exploits the observation: many signals have few 1s or few 0s
• Example: random logic control for 6-insn MIPS datapath
ALUinB
op
co
de
add
addi
lw
sw
beq
j
BR JP DMwe Rwd Rdst ALUop Rwe
Yes, “random logic” is a very dumb and
misleading name for this concept. Sorry.
32
Datapath and Control Timing
P
C
Insn
Mem
Register
File
S
X
s1 s2 d
Data
Mem
a
d
+
4
Control ROM/random logic
Read IMem Read Registers
(Read Control ROM)
Read DMEM Write DMEM Write Registers
Write PC
33
Single-Cycle Datapath Performance
• Goes against make common case fast (MCCF) principle + Low Cycles Per Instruction (CPI): 1
– Long clock period: to accommodate slowest insn
P
C
Insn
Mem
Register
File
S
X
s1 s2 d
Data
Mem
a
d
+
4
Control ROM/random logic
34
Interlude: Performance
• Previous slide alludes to something new: Performance • Don’t just want it to work…
• But want it to go fast!
• Three components to performance: Number of instructions
x Cycles per instruction (CPI)
x Clock Period (1 / Clock frequency)
Instructions Cycles Seconds Seconds
—————— x ————— x ————— = ——————
Program Instruction Cycle Program
35
Interlude: Performance
• Three components to performance: Number of instructions <- Compiler’s Job
x Cycles per instruction (CPI)
x Clock Period (1 / Clock frequency)
Instructions Cycles Seconds Seconds
—————— x ————— x ————— = ——————
Program Instruction Cycle Program
• Insns/Program: determined by compiler + ISA • Generally assume fixed program when doing micro-architecture
36
Micro-architectural factors
• Micro-architecture: • The details of how the ISA is implemented
• Affects CPI and Clock frequency
• Often will look at fixed program, and consider MIPS • Million Instructions Per Second
• MIPS = IPC * Frequency (in MHz)
• IPC = Instruction Per Cycle (1 / CPI)
• Gives “Bigger is better” number
Instructions Cycles Instructions
————— x ————— = ——————
Cycle Second Second
(IPC) (Frequency) (Throughput)
The use of “MIPS” to mean “Millions of
Instructions Per Second” has nothing to do with the CPU architecture also called “MIPS”,
which actually stands for “Microprocessor without
Interlocked Pipeline Stages”.
This fact that a major CPU
architecture shares a
name with an important
metric for performance is
incredibly confusing and
dumb, and I apologize.
I blame the cocaine-fueled
CPU architects of the
1980s.
37
“Best” IPC
• For now, best we can do: IPC = 1 (CPI = 1) • Do 1 instruction every cycle
• Later: • Real processors can do multiple instructions at once!
• Potentially: IPC > 1! (CPI < 1!)
• Best possible IPC depends on design
38
Performance vs ….
• 1990s: Performance at all cost • Actually more “clock frequency” at all cost…
• Now: Care about other things • Energy (electric bill, battery life)
• Power (cooling, also affects energy)
• Area (chip cost)
• Reliability (tolerance of transient faults: e.g., charge particle strikes)
• …
• Important metric these days “Performance / Watt” • Throughput divided by power consumption
• Why?
39
Performance Modeling and Analysis
• Speaking of performance • Making a processor takes time (years) and money (millions)
• Want to know it will perform well before you finish
• If its wrong, doing it all over is painful…
• Performance can be simulated in software
• Estimate what IPC will be
• Guide design
40
Single-Cycle Datapath Performance
• Goes against make common case fast (MCCF) principle + Low Cycles Per Instruction (CPI): 1
– Long clock period: to accommodate slowest insn
P
C
Insn
Mem
Register
File
S
X
s1 s2 d
Data
Mem
a
d
+
4
Control ROM/random logic
41
Alternative: Multi-Cycle Datapath
• Multi-cycle datapath: attacks high clock period • Cut datapath into multiple stages (5 here), isolate using FFs
• FSM control “walks” insns thru stages (by staging control signals)
+ Insns can bypass stages and exit early
P
C
Insn
Mem
Register
File
S
X
s1 s2 d Data
Mem
a
d
+
4
<<
2
I
R D O
B
A
s3
s3
s3 s4
s5
s5 s5
42
Multi-cycle Datapath FSM
• First state: Get a New Instruction • Output signals to fetch (e.g., read enable IMEM)
• Next State: Always Decode
Next
Insn
Decode
Insn
43
Multi-cycle Datapath FSM
• Second State: Decode • Output signals to decode instruction (RdEn RegFile)
• Go to Next Insn if NOP
• Otherwise Execute
Next
Insn
Decode
Insn Execute
Insn
NOP
44
Multi-cycle Datapath FSM
• Execute State • Execute Insn (varies by insn type)
• Next State: Also depends on insn type
• Branches: Next Insn
Next
Insn
Decode
Insn Execute
Insn Branch
NOP
45
Multi-cycle Datapath FSM
• Execute State • Execute Insn (varies by insn type)
• Next State: Also depends on insn type
• ALU op: write register
Next
Insn
Decode
Insn Execute
Insn Branch
Writeback
ALU
NOP
46
Multi-cycle Datapath FSM
• Execute State • Execute Insn (varies by insn type)
• Next State: Also depends on insn type
• Load: Read Memory
Next
Insn
Decode
Insn Execute
Insn Branch
Writeback
ALU
Read
DMEM
Load
NOP
47
Multi-cycle Datapath FSM
• Execute State • Execute Insn (varies by insn type)
• Next State: Also depends on insn type
• Store: Write Memory
Next
Insn
Decode
Insn Execute
Insn Branch
Writeback
ALU
Read
DMEM
Load Write
DMEM
Store
NOP
48
Multi-cycle Datapath FSM
• Read DMEM State • Control signals enable DMEM Read
• Next state is writeback
Next
Insn
Decode
Insn Execute
Insn Branch
Writeback
ALU
Read
DMEM
Load Write
DMEM
Store
NOP
49
Multi-cycle Datapath FSM
• Writeback state • Control signals enable regfile write
• Next state: Next Insn
Next
Insn
Decode
Insn Execute
Insn Branch
Writeback
ALU
Read
DMEM
Load Write
DMEM
Store
NOP
50
Multi-cycle Datapath FSM
• Write DMEM state • Control signals enable memory write
• Next state: Next Insn
Next
Insn
Decode
Insn Execute
Insn Branch
Writeback
ALU
Read
DMEM
Load Write
DMEM
Store
NOP
51
Multi-Cycle Datapath Example: Add
P
C
Insn
Mem
Register
File
S
X
s1 s2 d Data
Mem
a
d
+
4
<<
2
I
R D O
B
A
• Example: Add
• Cycle 1: Read IMEM
52
Multi-Cycle Datapath Example: Add
P
C
Insn
Mem
Register
File
S
X
s1 s2 d Data
Mem
a
d
+
4
<<
2
I
R D O
B
A
• Example: Add
• Cycle 1: Read IMEM
• Cycle 2: Decode + Read RF
53
Multi-Cycle Datapath Example: Add
P
C
Insn
Mem
Register
File
S
X
s1 s2 d Data
Mem
a
d
+
4
<<
2
I
R D O
B
A
• Example: Add
• Cycle 1: Read IMEM
• Cycle 2: Decode + Read RF
• Cycle 3: ALU
54
Multi-Cycle Datapath Example: Add
• Example: Add • Cycle 1: Read IMEM
• Cycle 2: Decode + Read RF
• Cycle 3: ALU
• Cycle 4: Writeback + Increment PC
P
C
Insn
Mem
Register
File
S
X
s1 s2 d Data
Mem
a
d
+
4
<<
2
I
R D O
B
A
55
Multi-Cycle Datapath Performance
• Opposite performance split of single-cycle datapath + Short clock period
– High CPI
P
C
Insn
Mem
Register
File
S
X
s1 s2 d Data
Mem
a
d
+
4
<<
2
I
R D O
B
A
56
Multi-cycle Data-path CPI
• CPI depends on instructions • Branches / Jumps: 3 cycles
• ALU: 4 cycles
• Stores: 4 cycles
• Loads: 5 cycles
• Overall CPI is weighted average
• Example: • 20% loads, 15% stores, 20% branches, 45% ALU
57
Multi-cycle Data-path CPI
• CPI depends on instructions • Branches / Jumps: 3 cycles
• ALU: 4 cycles
• Stores: 4 cycles
• Loads: 5 cycles
• Overall CPI is weighted average
• Example: • 20% loads, 15% stores, 20% branches, 45% ALU
CPI= 0.20 * 5 +
58
Multi-cycle Data-path CPI
• CPI depends on instructions • Branches / Jumps: 3 cycles
• ALU: 4 cycles
• Stores: 4 cycles
• Loads: 5 cycles
• Overall CPI is weighted average
• Example: • 20% loads, 15% stores, 20% branches, 45% ALU
CPI= 0.20 * 5 + 0.15 * 4 +
59
Multi-cycle Data-path CPI
• CPI depends on instructions • Branches / Jumps: 3 cycles
• ALU: 4 cycles
• Stores: 4 cycles
• Loads: 5 cycles
• Overall CPI is weighted average
• Example: • 20% loads, 15% stores, 20% branches, 45% ALU
CPI= 0.20 * 5 + 0.15 * 4 + 0.20 * 3 + 0.45 * 4 = 4.0
60
Multi-cycle Datapath Performance
• Single-cycle • Clock period = 50ns, CPI = 1
• Performace = 50 ns/insn
• Multi-cycle • Clock period = 10ns
• CPI = (0.2*3+0.2*5+0.6*4) = 4
• Performance = 40 ns/insn
• But wait…
61
Multi-Cycle Datapath Performance
• Did not just cut up existing logic into 5 pieces
• Also added logic (flip flops) • So clock period not 1/5 of single cycle, but slightly longer
P
C
Insn
Mem
Register
File
S
X
s1 s2 d Data
Mem
a
d
+
4
<<
2
I
R D O
B
A
62
Multi-cycle Datapath Performance
• Single-cycle • Clock period = 50ns, CPI = 1
• Performace = 50 ns/insn
• Multi-cycle • Clock period = 12ns
• CPI = (0.2*3+0.2*5+0.6*4) = 4
• Performance = 48 ns/insn
• Better, but not as exciting… • Can we do better still?
• Have our cake (low CPI) and eat it too (high clock frequency)?
63
Clock Period and CPI
• Single-cycle datapath + Low CPI: 1
– Long clock period: to accommodate slowest insn
• Multi-cycle datapath + Short clock period
– High CPI
• Can we have both low CPI and short clock period? – No good way to make a single insn go faster
+ Insn latency doesn’t matter anyway … insn throughput matters
• Key: exploit inter-insn parallelism
insn0.fetch, dec, exec
insn1.fetch, dec, exec
insn0.dec insn0.fetch
insn1.dec insn1.fetch
insn0.exec
insn1.exec
64
Pipelining
• Pipelining: important performance technique • Improves insn throughput rather than insn latency
• Exploits parallelism at insn-stage level to do so
• Begin with multi-cycle design
• When insn advances from stage 1 to 2, next insn enters stage 1
• Individual insns take same number of stages
+ But insns enter and leave at a much faster rate
• Physically breaks “atomic” VN loop ... but must maintain illusion
• Revisit at end of semester (hopefully)
insn0.dec insn0.fetch
insn1.dec insn1.fetch
insn0.exec
insn1.exec
insn0.dec insn0.fetch
insn1.dec insn1.fetch
insn0.exec
insn1.exec
65
Summary
• Datapaths: • Single Cycle
• What do we need?
• Control
• How control is implemented
• Multi-cycle
• Faster clock (yay!)
• Worse CPI (boooo)
• Performance:
• IPC
• Performance / Watt
• CPU Performance Equation
• Pipelining
• Teaser for later!