cs356 unit 12 · 12.3 logic circuits • combinational logic – performs a specific function...
TRANSCRIPT
12.1
CS356 Unit 12
Processor Hardware Organization
Pipelining
12.2
BASIC HW
12.3
Logic Circuits
• Combinational logic– Performs a specific function (mapping
of 2n input combinations to desired output combinations)
– No internal state or feedback
• Given a set of inputs, we will always get the same output after some time (propagation) delay
• Sequential logic (Storage devices)– Registers are the fundamental building
blocks
• Remembers a set of bits for later use
• Acts like a variable from software
• Controlled by a "clock" signal
Outputs depend only on current outputs
Inp
uts
Combinational
Logic
(Usually
operations like
+, -, *, /, &, |, <<)
Ou
tpu
ts
Outputs depend on current inputs and previous inputs (previous inputs summarized via state)
Currentinputs
Outputs
1 0 1
Sequential values feedback
and provide "memory"
Combinational
Logic
Sequential Logic
Register holding "state"
12.4
Combinational Logic Gates
• Main Idea: Circuits called "gates" perform logic operations to produce desired outputs from some digital inputs
N
P
R
V
TB
C
1
1
0
1
100
0
0 00
0
1
OR gate
AND gate
NOT gate
OR gateOR gate
12.5
Propagation Delay
• Main Idea: All digital logic circuits have propagation delay
– Time it takes for output to change when inputs are changed
N
P
R
V
TB
C
1
1
0
1
100
0
0 00
0
1
4 gate delays for input
to propagate to outputs
0 1
11
1
12.6
Combinational Logic Functions
• Map input combinations of n-bits to desired m-bit output
• Can describe function with a truth table and then find its circuit implementation
Logic
Circuit
Ou
tpu
ts
Inp
uts
IN0 IN1 IN2 OUT0 OUT1
0 0 0 0 0
0 0 1 1 0
…
1 1 1 0 1
In0
In1
In2
Out1
12.7
ALU’s
• Perform a selected operation on two input numbers.
– FS[5:0] select the desired operation
ALU
A
B
C0
RES
OF
ZF
32
32
32
X[31:0]
Y[31:0]
RES[31:0]
Func6
FS[5:0]
Func. Code
Op. Func. Code
Op.
00_0000 A SHL B 10_0000 A+B
00_0010 A SHR B 10_0010 A-B
00_0011 A SAR B … …
… … 10_0100 A AND B
01_1000 A * B 10_0101 A OR B
01_1001 A * B (uns.)
10_0110 A XOR B
01_1010 A / B 10_0111 A NOR B
01_1011 A / B (uns.)
… …
… … 10_1010 A < B100010
0x00000001
0x80000000
0x7fffffff
CF
1
0
0
12.8
Sequential Devices (Registers)• Registers capture the D input value when a control
input (aka the clock signal) transitions from 0->1 (clock edge) and stores that value at the Q output until the next clock edge
• A register is like a variable in software. It stores a value for later use
• We can choose to only clock the register at "certain" times when we want the register to capture a new value (i.e. when it is the destination of an instruction)
• Key Idea: Registers store data while we operate on those values
D
Q
CPClock pulse
Data Input
Data Output
(could be
many bits)
(could be
many bits)
Block Diagram of
a Register
t = 0 ns t = 1 ns t = 5 ns t = 7 ns t = 10 ns
Clock pulse
q(t) d(1) d(5) d(7) d(10)unk
d(t)
Some input value changing over time
d(1) d(2) d(3) d(4) d(5) d(6) d(7) d(8) d(9) d(10) d(11) d(12)
The clock
pulse
(positive
edge) here…
…causes q(t) to
sample and hold
the current d(t)
value until
another clock
pulse
%rax
ALU sum
add %rbx,%rax add %rcx,%rax
%rax
12.9
Clock Signal
• Alternating high/low voltage pulse train
• Controls the ordering and timing of operations performed in the processor
• 1 cycle is usually measured from rising edge to rising edge
• Clock frequency = # of cycles per second (e.g. 2.8 GHz = 2.8 * 109
cycles per second)
Processor
Clock Signal
0 (0V)
1 (5V)
1 cycle
2.8 GHz
= 2.8*109 cycles per second
= 0.357 ns/cycle
Op. 1 Op. 2 Op. 3
12.10
FROM X86 TO RISCBasic HW organization for a simplified instruction set
12.11
From CISC to RISC
• Complex Instruction Set Computers often have instructions that vary widely in how much work they perform and how much time they take to execute
– Benefit is fewer instructions are needed to accomplish a task
• Reduced Instruction Set Computers favor instructions that take roughly the same time to execute and follow a common sequence of steps
– It often requires more instructions to describe the overall task (larger code size)
• See example to the right
• RISC makes the hardware design easier so let's tweak our x86 instructions to be more RISC-like
// CISC instructionmovq 0x40(%rdi, %rsi, 4), %rax
// RISC Equiv. w/ 1 mem. or ALU op. // per instructionmov %rsi, %rbx # use %rbx as a temp.shl 2, %rbx # %rsi * 4add %rdi, %rbx # %rdi + (%rsi*4)add $0x40, %rbx # 0x40 + %rdi + (%rsi*4)mov (%rbx), %rax # %rax = *%rbx
CISC vs. RISC Equivalents
12.12
A RISC Subset of x86
• Split mov instructions that access memory into separate instructions:
– ld = Load/Read from memory
– st = Store/Write to memory
• Limit ld & st instructions to use at most indirect w/ displacement
– No ld 0x04(%rdi, %rsi, 4), %rax
• Too much work
– At most ld 0x40(%rdi), %rax orst %rax, 0x40(%rdi)
• Limit arithmetic & logic instructions to only operate on registers
– No add (%rsp), %rax since this implicitly accesses (dereferences) memory
– Only add %reg1, %reg2
// CISC instructionadd %rax, (%rsp)
// Equiv. RISC sequence (w/ ld and st)ld 0(%rsp), %rbxadd %rax, %rbxst %rbx, 0(%rsp)
// 3 x86 memory read instructionsmov (%rdi), %rax // 1mov 0x40(%rdi), %rax // 2mov 0x40(%rdi,%rsi), %rax // 3
// Equiv. load sequencesld 0x0(%rdi), %rax // 1ld 0x40(%rdi), %rax // 2mov %rsi, %rbx // 3aadd %rdi, %rbx // 3bld 0x40(%rbx), %rax // 3c
// 3 x86 memory write instructionsmov %rax, (%rdi) // 1mov %rax, 0x40(%rdi) // 2mov %rax, 0x40(%rdi,%rsi) // 3
// Equiv. store sequencesst %rax, 0x0(%rdi) // 1st %rax, 0x40(%rdi) // 2mov %rsi, %rbx // 3aadd %rdi, %rbx // 3bst %rax, 0x40(%rbx) // 3c
12.13
Developing a Processor Organization
• Identify which hardware components each instruction type would use and in what order: ALU-Type, LD, ST, Jump
ALU-Type
add %rax,%rbx
PC
I-Cache / I-MEM
Addr. Data
D-Cache / D-MEM
Addr. Data
Registers(%rax, %rbx,
etc.)
(aka RegFile)
AL
U
Res.
Zero
LD
ld 8(%rax),%rbxST
st %rbx, 8(%rax)JE je label/displacement
Cond. Codes
1.
2.
3.
4.
5.
PC
I-Cache
Registers
ALU
Registers
• Addr. of Instruc
• Fetch Instruc
• Get %rax,%rbx
• Sum %rax+%rbx
• Save result to %rbx
PC
I-Cache
Registers
ALU
D-Cache
• Addr. of Instruc
• Fetch Instruc
• Get %rax
• Sum %rax+8
• Read data
6. Registers
• Save data to %rbx
PC
I-Cache
Registers
ALU
D-Cache
• Addr. of Instruc
• Fetch Instruc
• Get %rax
• Sum %rax+8
• Write %rbx data
PC
I-Cache
Registers
ALU
• Addr. of Instruc
• Fetch Instruc
• Get %rax
• If cond=TRUE, PC = PC+disp.
12.14
Processor Block Diagram
I-Cache
D-Cache
ALURegisters
(aka
RegFile)
Fetch Decode Exec. Mem WB
PC Decode
Instruction (Machine Code)
Operands ALU Output (Addr. or Result)
Data to write to dest. register
Clock Cycle Time = Sum of delay through worst case pathway = 50 ns
10 ns 10 ns 10 ns 10 ns 10 ns
Control Signals(e.g. ALU operation, Read/Write D-Cache, etc.)
Addr Data
Data
ZF OFCF SF
12.15
Processor Execution (add)
I-Cache
D-Cache
ALURegisters
(aka
RegFile)
Fetch Decode Exec. Mem WB
PC Decode
Instruction (Machine Code)
Operands ALU Output (Addr. or Result)
Data to write to dest. register
Control Signals(e.g. ALU operation, Read/Write D-Cache, etc.)
add %rax,%rdx[Machine Code: 48 01 c2]
%rax+%rdx
Addr Data
%rdx = %rax+%rdx
%rax
%rdx
ZF OFCF SF
12.16
Processor Execution (load)
I-Cache
D-Cache
ALURegisters
(aka
RegFile)
Fetch Decode Exec. Mem WB
PC Decode
Instruction (Machine Code)
Operands ALU Output (Addr. or Result)
Data to write to dest. register
Control Signals(e.g. ALU operation, Read/Write D-Cache, etc.)
Addr Data
ld 0x40(%rbx),%rax[Machine Code: 48 8b 43 40]
addr
%rdx = data
%rbx
data
ZF OFCF SF
0x40
12.17
Processor Execution (store)
I-Cache
D-Cache
ALURegisters
(aka
RegFile)
Fetch Decode Exec. Mem WB
PC Decode
Instruction (Machine Code)
Control Signals(e.g. ALU operation, Read/Write D-Cache, etc.)
Addr Data
Data
st %rax,0x40(%rbx)[Machine Code: 48 89 43 40]
addr
%rbx
0x40
%rax
ZF OFCF SF
12.18
Processor Execution (branch/jump)
I-Cache
D-Cache
ALURegisters
(aka
RegFile)
Fetch Decode Exec. Mem WB
PC Decode
Instruction (Machine Code)
Control Signals(e.g. ALU operation, Read/Write D-Cache, etc.)
Addr Data
Data
je L1 (disp. = 0x08)[Machine Code: 74 08]
PC + 0x08
PC
0x08
ZF
1
OF
0
CF
1
SF
0
12.19
PIPELINING
12.20
Example
for(i=0; i < 100; i++)
C[i] = (A[i] + B[i]) / 4;
10 ns per input set = 1000 ns total
MemoryA[i]
B[i]
A:
B:
C:
i
Cn
tr
12.21
Pipelining Example
Stage 1 Stage 2
Clock 0 A[0] + B[0]
Clock 1 A[1] + B[1] (A[0] + B[0]) / 4
Clock 2 A[2] + B[2] (A[1] + B[1]) / 4
Stage 1 Stage 2
for(i=0; i < 100; i++)
C[i] = (A[i] + B[i]) / 4;
Pipelining refers to
insertion of registers to
split combinational logic
into smaller stages that
can be overlapped in
time (i.e. create an
assembly line)
12.22
Need for Registers
• Provides separation between combinational functions– Without registers, fast signals could “catch-up” to data values in the
next operation stage
Re
gis
ter
Re
gis
ter
Performing an
operation yields
signals with different
paths and delays
We don’t want signals from two
different data values mixing.
Therefore we must collect and
synchronize the values from
the previous operation before
passing them on to the next
Signal i
Signal j
5 ns
2 ns
CLKCLK
12.23
Processors & Pipelines
• Overlaps execution of multiple instructions
• Natural breakdown into stages
– Fetch, Decode, Execute
• Fetch an instruction, while decoding another, while executing another
ExecuteDecodeFetch
CLK 1 CLK 2 CLK 3
Inst 1
Inst 2
Inst 3
Inst 4
CLK 4
ExecuteDecodeFetch
DecodeFetch
Fetch
Fetch Decode Exec.
Inst. 1Clk 1
Clk 2 Inst. 1
Clk 3 Inst. 1
Inst. 2
Inst. 2
Clk 4 Inst. 2
Inst. 3
Inst. 3
Clk 5 Inst. 3
Inst. 4
Inst. 4Inst. 5
Pipelining (Instruction View) Pipelining (Stage View)
12.24
Balancing Pipeline Stages
• Clock period must equal the LONGESTdelay from register to register
• Fig. 1: If total logic delay is 20ns => 50MHz
– Throughput: 1 instruc. / 20 ns
• Fig. 2: Unbalanced stage delays limit the clock speed to the slowest stage (worst case)
– Throughput: 1 instruc. / 10 ns => 100MHz
• Fig. 3: Better to split into more, balanced stages
– Throughput: 1 instruc. / 5 ns => 200MHz
• Fig. 4: Are more stages better
– Ideally: 2x stages => 2x throughput
– Throughput: 1 instruc. / 2.5 ns => 400MHz
– Each register adds extra delay so at some point deeper pipelines don't pay off
Processor Logic(Fetch + Decode + Execute)
Reg
iste
r
Fetc
h
Dec
od
e
Reg
iste
r
Exec
. 1
Exec
. 1
Reg
iste
r
Reg
iste
r
Fetc
h
Dec
od
e
Reg
iste
r
Exec
5 ns 5 ns 10 ns
20 ns
5 ns 5 ns 5 ns 5 ns
F12.5 ns
F2 D1
D2
E1a
E1b
E2a
E2b
Fig
. 1
Fig
. 2
Fig
. 3
Fig
. 4
12.25
Balancing Pipeline Stages
Main Points:
• Latency of any single instruction is unaffected
• Throughput and thus overall program performance can be dramatically improved– Ideally K stage pipeline will
lead to throughput increase by a factor of K
– Reality is splitting stages adds some delay and thus hits a point of diminishing returns
Processor Logic(Fetch + Decode + Execute)
Reg
iste
r
Fetc
h
Dec
od
e
Reg
iste
r
Exec
. 1
Exec
. 1
Reg
iste
r
20 ns
5 ns 5 ns 5 ns 5 ns
F1
2.5 ns
F2 D1
D2
E1a
E1b
E2a
E2b
Fig
. 1
Fig
. 2
Fig
. 3
Non-pipelined(Latency = 20ns, Throughput = 1x)
4 Stage Pipeline(Latency = 20ns, Throughput = 4x)
8 Stage Pipeline(Latency = 20ns, Throughput = 8x)
12.26
5-Stage Pipeline
Instr
uctio
n (
Ma
ch
ine
Co
de
)
Opera
nds &
Contr
ol S
ignals
AL
U O
utp
ut (A
ddr.
or
Re
su
lt)
Re
su
lt o
f In
str
uctio
n
I-Cache D-CacheALUReg.
File
Fetch Decode Exec. Mem WB
PC DecodeZF OFCF SF
12.27
Pipelining
• Let's see how a sequence of instructions can be executed
Instruction
ld 0x8(%rbx), %rax
add %rcx,%rdx
je L1
12.28
Sample Sequence - 1
Instr
uctio
n (
Ma
ch
ine
Co
de
)
Opera
nds &
Contr
ol S
ignals
AL
U O
utp
ut (A
ddr.
or
Re
su
lt)
Re
su
lt o
f In
str
uctio
n
I-Cache D-CacheALUReg.
File
Decode Exec. Mem WB
PC Decode
Fetch
(LD)
Fetch LD
ZF OFCF SF
12.29
Sample Sequence - 2
LD
0x8
(%rb
x),
%ra
x
Opera
nds &
Contr
ol S
ignals
AL
U O
utp
ut (A
ddr.
or
Re
su
lt)
Re
su
lt o
f In
str
uctio
n
I-Cache D-CacheALUReg.
File
Exec. Mem WB
PC Decode
Fetch
(ADD)
Fetch ADD
Decode
(LD)
Decode
instruction and
fetch operands
ZF OFCF SF
12.30
Sample Sequence - 3
AD
D %
rdx, %
rcx
0x4
0 / %
rbx
/ R
EA
D
AL
U O
utp
ut (A
ddr.
or
Re
su
lt)
Re
su
lt o
f In
str
uctio
n
I-Cache D-CacheALUReg.
File
Mem WB
PC Decode
Fetch
(JE)
Fetch JE
Decode
(ADD)
Decode
instruction and
fetch operands
Exec.
(LD)
Add
displacement
0x04 to %rbx
ZF OFCF SF
12.31
Sample Sequence - 4
JE
/ d
isp
lace
me
nt
%rc
x/ %
rdx
/ A
DD
%rb
x+
0x4
0 / R
EA
D
Re
su
lt o
f In
str
uctio
n
I-Cache D-CacheALUReg.
File
WB
PC Decode
Fetch
(i+1)
Fetch
instruc
i+1
Decode
(JE)
Exec.
(ADD)
Add
%rcx+%rdx
Mem
(LD)
Read word from
memory
Decode
instruction and
fetch operands
ZF OFCF SF
12.32
Sample Sequence - 5
Instr
uc. i+
1 M
ach
ine
Co
de
PC
/ D
sip
lace
me
nt
%rd
x/ %
rcx
+ %
rdx
%ra
x/ V
alu
e fro
m M
em
ory
I-Cache D-CacheALUReg.
File
WB
(LD)
PC Decode
Fetch
(i+2)
Decode
(i+1)
Exec.
(JE)
Mem
(ADD)
Write
word to
%rax
Just pass sum
to next stage
Check if
condition is
true
Fetch next
instruc i+2
Decode
instruction i+1
and fetch
operands
ZF OFCF SF
12.33
Sample Sequence - 6
Instr
uc. i+
2 M
ach
ine
Co
de
Instr
uc
i+1 o
pe
ran
ds
Ne
w P
C
%rd
x/ su
m
I-Cache D-CacheALUReg.
File
WB
(ADD)
PC Decode
Fetch
(i+3)
Decode
(i+2)
Exec.
(i+1)
Mem
(JE)
Write
word to
%rdx
Update PCUse the ALUFetch next
instruc i+3
Decode
instruction i+2
and fetch
operands
ZF OFCF SF
12.34
Sample Sequence - 7
De
lete
d
De
lete
d
De
lete
d
%rd
x/ su
m
I-Cache D-CacheALUReg.
File
WB
(JE)
PC Decode
Fetch
(target)
Decode
(i+3)
Exec.
(i+2)
Mem
(i+1)
Do
nothing
Delete i+1Delete i+2Fetch next
instruc i+3
Delete i+3
ZF OFCF SF
12.35
HAZARDSProblems from overlapping instruction execution…
12.36
Hazards
• Hazards prevent parallel or overlapped execution!
• Control Hazards– Problem: We don't know what instruction to fetch but we need to
– Examples: Jumps (branches) and calls
• Data Hazards / Data Dependencies– Problem: When a later instruction needs data from a previous
instruction
– Examples: • sub %rdx,%rax
• add %rax,%rcx
• Structural Hazards– Problem: Due to limited resources, the HW doesn't support overlapping
a certain sequence of instructions
– Examples: See next slides
12.37
Structural Hazards
• Example structural hazard: A single cache rather than separate instruction & data caches
– Structural hazard any time an instruction needs to perform a data access (i.e. ld or st) since we always want to fetch a new instruction each clock cycle
Cache
ALUReg.
File
PC
LD
i+3
i+2 i+1
Hazard!
12.38
Data Hazard - 1
AD
D %
rax, %
rcx
%rd
x/ %
rax
/ S
UB
AL
U O
utp
ut (A
ddr.
or
Re
su
lt)
Re
su
lt o
f In
str
uctio
n
I-Cache D-CacheALUReg.
File
Mem WB
PC Decode
Fetch
(i+1)
Fetch i+1
Decode
(ADD)
Decode and get
register operands
(Do we get the
desired %rax value?)
Exec.
(SUB)
Perform
%rax-%rdx
sub %rdx,%raxadd %rax,%rcx
12.39
Data Hazard - 2
i+1
%ra
x/ %
rcx
/ A
DD
Ne
w v
alu
e fo
r %
rax
Re
su
lt o
f In
str
uctio
n
I-Cache D-CacheALUReg.
File
Mem
(SUB)
WB
PC Decode
Fetch
(i+2)
Fetch i+2
Decode
(i+1)
Decode i+1
Exec.
(ADD)
Perform
%rax+%rcx
using the
wrong value!
sub %rdx,%raxadd %rax,%rcx
New value for %rax
has not been written
back yet
12.40
Stalling
• Solution 1: Halt/Stall the ADD instruction in the DECODE stage and insert nops into the pipeline until the new value of the needed register is present at the cost of lower performance
AD
D %
rax, %
rcx
nop
nop
Ne
w v
alu
e o
f %
rax
I-Cache D-CacheALUReg.
File
Mem
(nop)WB
(SUB)
PC Decode
Fetch
(i+1)Decode
(ADD)Exec.
(nop)
sub %rdx,%raxadd %rax,%rcx
12.41
Forwarding
• Solution 2: Create new hardware paths to hand-off (forward) the data from the producing instruction in the pipeline to the consuming instruction
sub %rdx,%raxadd %rax,%rcx
i+1
%ra
x/ %
rcx
/ A
DD
Ne
w v
alu
e fo
r %
rax
Result o
f In
str
uction
I-Cache D-CacheALUReg.
File
Mem
(SUB)
WB
PC Decode
Fetch
(i+2)
Decode
(i+1)
Exec.
(ADD)
12.42
Solving Data Hazards
• Key Point: Data dependencies (i.e. instructions needing values produced by earlier ones) limit performance
• Forwarding solves many of the data hazards (data dependencies) that exist
– It allows instructions to continue to flow through the pipeline without the need to stall and waste time
– The cost is additional hardware and added complexity
• Even forwarding cannot solve all the issues
– A structural hazard still exists when a LD reads a value needed by the next instruction
12.43
LD + Dependent Instruction Hazard
• Even forwarding cannot prevent the need to stall when a Load instruction produces a value needed by the instruction behind it
– Would require performing 2 cycles worth of work in only a single cycle
ld 8(%rdx),%raxadd %rax,%rcx
i+1
%old
ra
x/ %
rcx
/ A
DD
Ad
dre
ss (
8+
%rd
x)
/ R
EA
D
Result o
f In
str
uction
I-Cache D-CacheALUReg.
File
Mem
(LD)
WB
PC Decode
Fetch
(i+2)
Decode
(i+1)
Exec.
(ADD)
New value for %rax
12.44
LD + Dependent Instruction Hazard
• We would need to introduce 1 stall cycle (nop) into the pipeline to get the timing correct
• Keep this in mind as we move through the next slidesld 8(%rdx),%raxadd %rax,%rcx
i+1
%old
ra
x/ %
rcx
/ A
DD
nop
New
Valu
e o
f %
rax
I-Cache D-CacheALUReg.
File
Mem
(nop)
WB
(LD)
PC Decode
Fetch
(i+2)
Decode
(i+1)
Exec.
(ADD)
New value for %rax
12.45
Control Hazards
• Branches/Jumps require us to know
– Where we want to jump to (aka branch/jump target location)…really just the new value of the PC
– If we should branch or not (checking the jump condition)
• Problem: We often don't know those values until deep in the pipeline and thus we are not sure what instructions should be fetched in the interim
– Requires us to flush unwanted instructions and waste time
12.46
Control Hazard - 1
Instr
uc. i+
2 M
ach
ine
Co
de
Instr
uc
i+1 o
pe
ran
ds
Ne
w P
C
%rd
x/ su
m
I-Cache D-CacheALUReg.
File
WB
(ADD)
PC Decode
Fetch
(i+3)
Decode
(i+2)
Exec.
(i+1)
Mem
(JE)
Write
word to
%rdx
Update PCUse the ALUFetch next
instruc i+3
Decode
instruction i+2
and fetch
operands
12.47
Control Hazard - 2
De
lete
d
De
lete
d
De
lete
d
%rd
x/ su
m
I-Cache D-CacheALUReg.
File
WB
(JE)
PC Decode
Fetch
(target)
Decode
(i+3)
Exec.
(i+2)
Mem
(i+1)
Do
nothing
Delete i+1Delete i+2Fetch next
instruc i+3
Delete i+3
Need to "flush" wrongly fetched instructions
12.48
A FIRST LOOK: CODE REORDERINGEnlisting the help of the compiler
12.49
Two Sides of the Coin
• If the hardware has some problems it just can't solve, can software (i.e. the compiler) help?– Yes!!
• Compilers can re-order instructions to take best advantage of the processor (pipeline) organization
• Identify the dependencies that will incur stalls and slow performance– Load followed by add
– Jump instructions
void sum(int* data, int n, int x){
for(int i=0; i < n; i++){data[i] += x;
}}
sum:mov $0x0,%ecx
L1:cmp %esi,%ecxjge L2ld 0(%rdi), %eaxadd %edx, %eaxst %eax, 0(%rdi)add $4, %rdiadd $1, %ecxj L1
L2:retq
C code and its assembly translation
12.50
How Can the Compiler Help
• Compilers are written with general parsing and semantic representation front ends but architecture-specific backends that generate code optimized for a particular processor
• Q: How could the compiler help improve pipelined performance while still maintaining the external behavior that the high level code indicates
• A: By finding independent instructions and reordering the code
– Could we have moved any other instruction into that slot? No!
sum:mov $0x0,%ecx
L1:cmp %esi,%ecxjge L2ld 0(%rdi), %eaxstall/nopadd %edx, %eaxst %eax, 0(%rdi)add $4, %rdiadd $1, %ecxj L1
L2:retq
C code and its assembly translation
sum:mov $0x0,%ecx
L1:cmp %esi,%ecxjge L2ld 0(%rdi), %eaxadd $1, %ecxadd %edx, %eaxst %eax, 0(%rdi)add $4, %rdij L1
L2:retq
Original Code (incurring 1 stall cycle)
Updated Code(w/ Compiler reordering)
12.51
Taken or Not Taken: Branch Behavior
• When a conditional jump/branch is
– True, we say it is Taken
– False, we say it is Not Taken
• Currently our pipeline will fetch sequentially and then potentially flush if the branch is taken
– Effectively, our pipeline "predicts" that each branch is Not Taken
• The j L1 instruction is always taken and thus will incur wasted clock cycles each time it is executed
• Most of the time the jge L2 will be not taken and perform well
C code and its assembly translation
sum:mov $0x0,%ecx
L1:cmp %esi,%ecxjge L2ld 0(%rdi), %eaxadd $1, %ecxadd %edx, %eaxst %eax, 0(%rdi)add $4, %rdij L1
L2:retq
T NT
12.52
Branch Delay Slots
• Problem: After a jump/branch we fetch instructions that we are not sure should be executed
• Idea: Find an instruction(s) that should ALWAYS be executed (independent of whether branch is taken or not), move those instructions to directly after the branch, and have HW just let them be executed (not flushed) no matter what the branch outcome is
• Branch delay slot(s) = # of instructions that the HW will always execute (not flush) after a jump/branch instruction
12.53
Branch Delay Slot Example
ld 0(%rdi), %rcxcmp %rcx, %rdxje NEXTadd %rbx, %raxNOT TAKEN CODE…
NEXT:
TAKEN CODE
ld 0(%rdi), %rcxadd %rbx, %raxcmp %rcx, %rdxje NEXTdelay slot instruc.NOT TAKEN CODE…
NEXT:
TAKEN CODE
Assume a single
instruction delay slot
Move an ALWAYS
executed instruction
down into the delay
slot and let it execute
no matter what
“Before” Code
ld 0(%rdi), %rcxadd %rbx, %raxcmp %rcx, %rdx
Not Taken
Path Code
je
Taken
Path Code
“After” Code
T
NT
Delay Slot
Flowchart perspective of the
delay slot
Delay Slot
12.54
Implementing Branch Delay Slots
• HW will define the number of branch delay slots (usually a small number…1 or 2)
• Compiler will be responsible for arranging instructions to fill the delay slots– Must find instructions that the branch
does NOT DEPEND on
– If no instructions can be rearranged, can always insert 'nop' and just waste those cycles
ld 0(%rdi), %rcxadd %rbx, %raxcmp %rcx, %rdxje NEXTdelay slot instruc.
Cannot move ‘ld’ into delay slot
because je needs the %rcx value
generated by it
ld 0(%rdi), %rcxadd %rbx, %raxcmp %rcx, %raxje NEXTdelay slot instruc.
If no instruction can be found a
'nop' can be inserted by the
compiler
12.55
A Look Ahead: Branch Prediction
• Currently our pipeline assumes Not Taken and fetches down the sequential path after a jump/branch
• Could we build a pipeline that could predict taken?
– Not yet! Location to jump to (branch target) not known until later stages
• But suppose we could overcome those problems, would we even know how to predict the outcome of a jump/branch before actually looking at the condition codes deeper in the pipeline?
• We could allow a static prediction per instruction (give a hint with the branch that indicates T or NT)
• We could allow dynamic prediction per instruction (use its runtime history)
Loop
Body
loop
branch
NT
Code
Loops
High probability
of being Taken.
Prediction can
be static.T: loop
NT: done
if..else
branch
NT: elseT: ifIf Statements
May exhibit
data
dependent
behavior.
Prediction may
need to be
dynamic.
After Code
T Code
12.56
Demo
12.57
Summary 1
• Pipelining is an effective and important technique to improve the throughput of a processor
• Overlapping execution creates hazards which lead to stalls or wasted cycles
– Data, Control, Structural
– More hardware can be investigated to attempt to mitigate the stalls (e.g. forwarding)
• The compiler can help reorder code to avoid stalls and perform useful work (e.g. delay slots)