cs356 unit 12 · 12.3 logic circuits • combinational logic – performs a specific function...

12.1

CS356 Unit 12

Processor Hardware Organization

Pipelining

12.2

BASIC HW

12.3

Logic Circuits

• Combinational logic– Performs a specific function (mapping

of 2n input combinations to desired output combinations)

– No internal state or feedback

• Given a set of inputs, we will always get the same output after some time (propagation) delay

• Sequential logic (Storage devices)– Registers are the fundamental building

blocks

• Remembers a set of bits for later use

• Acts like a variable from software

• Controlled by a "clock" signal

Outputs depend only on current outputs

Inp

uts

Combinational

Logic

(Usually

operations like

+, -, *, /, &, |, <<)

Ou

tpu

ts

Outputs depend on current inputs and previous inputs (previous inputs summarized via state)

Currentinputs

Outputs

1 0 1

Sequential values feedback

and provide "memory"

Combinational

Logic

Sequential Logic

Register holding "state"

12.4

Combinational Logic Gates

• Main Idea: Circuits called "gates" perform logic operations to produce desired outputs from some digital inputs

N

P

R

V

TB

C

1

1

0

1

100

0

0 00

0

1

OR gate

AND gate

NOT gate

OR gateOR gate

12.5

Propagation Delay

• Main Idea: All digital logic circuits have propagation delay

– Time it takes for output to change when inputs are changed

N

P

R

V

TB

C

1

1

0

1

100

0

0 00

0

1

4 gate delays for input

to propagate to outputs

0 1

11

1

12.6

Combinational Logic Functions

• Map input combinations of n-bits to desired m-bit output

• Can describe function with a truth table and then find its circuit implementation

Logic

Circuit

Ou

tpu

ts

Inp

uts

IN0 IN1 IN2 OUT0 OUT1

0 0 0 0 0

0 0 1 1 0

…

1 1 1 0 1

In0

In1

In2

Out1

12.7

ALU’s

• Perform a selected operation on two input numbers.

– FS[5:0] select the desired operation

ALU

A

B

C0

RES

OF

ZF

32

32

32

X[31:0]

Y[31:0]

RES[31:0]

Func6

FS[5:0]

Func. Code

Op. Func. Code

Op.

00_0000 A SHL B 10_0000 A+B

00_0010 A SHR B 10_0010 A-B

00_0011 A SAR B … …

… … 10_0100 A AND B

01_1000 A * B 10_0101 A OR B

01_1001 A * B (uns.)

10_0110 A XOR B

01_1010 A / B 10_0111 A NOR B

01_1011 A / B (uns.)

… …

… … 10_1010 A < B100010

0x00000001

0x80000000

0x7fffffff

CF

1

0

0

12.8

Sequential Devices (Registers)• Registers capture the D input value when a control

input (aka the clock signal) transitions from 0->1 (clock edge) and stores that value at the Q output until the next clock edge

• A register is like a variable in software. It stores a value for later use

• We can choose to only clock the register at "certain" times when we want the register to capture a new value (i.e. when it is the destination of an instruction)

• Key Idea: Registers store data while we operate on those values

D

Q

CPClock pulse

Data Input

Data Output

(could be

many bits)

(could be

many bits)

Block Diagram of

a Register

t = 0 ns t = 1 ns t = 5 ns t = 7 ns t = 10 ns

Clock pulse

q(t) d(1) d(5) d(7) d(10)unk

d(t)

Some input value changing over time

d(1) d(2) d(3) d(4) d(5) d(6) d(7) d(8) d(9) d(10) d(11) d(12)

The clock

pulse

(positive

edge) here…

…causes q(t) to

sample and hold

the current d(t)

value until

another clock

pulse

%rax

ALU sum

add %rbx,%rax add %rcx,%rax

%rax

12.9

Clock Signal

• Alternating high/low voltage pulse train

• Controls the ordering and timing of operations performed in the processor

• 1 cycle is usually measured from rising edge to rising edge

• Clock frequency = # of cycles per second (e.g. 2.8 GHz = 2.8 * 109

cycles per second)

Processor

Clock Signal

0 (0V)

1 (5V)

1 cycle

2.8 GHz

= 2.8*109 cycles per second

= 0.357 ns/cycle

Op. 1 Op. 2 Op. 3

12.10

FROM X86 TO RISCBasic HW organization for a simplified instruction set

12.11

From CISC to RISC

• Complex Instruction Set Computers often have instructions that vary widely in how much work they perform and how much time they take to execute

– Benefit is fewer instructions are needed to accomplish a task

• Reduced Instruction Set Computers favor instructions that take roughly the same time to execute and follow a common sequence of steps

– It often requires more instructions to describe the overall task (larger code size)

• See example to the right

• RISC makes the hardware design easier so let's tweak our x86 instructions to be more RISC-like

// CISC instructionmovq 0x40(%rdi, %rsi, 4), %rax

// RISC Equiv. w/ 1 mem. or ALU op. // per instructionmov %rsi, %rbx # use %rbx as a temp.shl 2, %rbx # %rsi * 4add %rdi, %rbx # %rdi + (%rsi*4)add $0x40, %rbx # 0x40 + %rdi + (%rsi*4)mov (%rbx), %rax # %rax = *%rbx

CISC vs. RISC Equivalents

12.12

A RISC Subset of x86

• Split mov instructions that access memory into separate instructions:

– ld = Load/Read from memory

– st = Store/Write to memory

• Limit ld & st instructions to use at most indirect w/ displacement

– No ld 0x04(%rdi, %rsi, 4), %rax

• Too much work

– At most ld 0x40(%rdi), %rax orst %rax, 0x40(%rdi)

• Limit arithmetic & logic instructions to only operate on registers

– No add (%rsp), %rax since this implicitly accesses (dereferences) memory

– Only add %reg1, %reg2

// CISC instructionadd %rax, (%rsp)

// Equiv. RISC sequence (w/ ld and st)ld 0(%rsp), %rbxadd %rax, %rbxst %rbx, 0(%rsp)

// 3 x86 memory read instructionsmov (%rdi), %rax // 1mov 0x40(%rdi), %rax // 2mov 0x40(%rdi,%rsi), %rax // 3

// Equiv. load sequencesld 0x0(%rdi), %rax // 1ld 0x40(%rdi), %rax // 2mov %rsi, %rbx // 3aadd %rdi, %rbx // 3bld 0x40(%rbx), %rax // 3c

// 3 x86 memory write instructionsmov %rax, (%rdi) // 1mov %rax, 0x40(%rdi) // 2mov %rax, 0x40(%rdi,%rsi) // 3

// Equiv. store sequencesst %rax, 0x0(%rdi) // 1st %rax, 0x40(%rdi) // 2mov %rsi, %rbx // 3aadd %rdi, %rbx // 3bst %rax, 0x40(%rbx) // 3c

12.13

Developing a Processor Organization

• Identify which hardware components each instruction type would use and in what order: ALU-Type, LD, ST, Jump

ALU-Type

add %rax,%rbx

PC

I-Cache / I-MEM

Addr. Data

D-Cache / D-MEM

Addr. Data

Registers(%rax, %rbx,

etc.)

(aka RegFile)

AL

U

Res.

Zero

LD

ld 8(%rax),%rbxST

st %rbx, 8(%rax)JE je label/displacement

Cond. Codes

1.

2.

3.

4.

5.

PC

I-Cache

Registers

ALU

Registers

• Addr. of Instruc

• Fetch Instruc

• Get %rax,%rbx

• Sum %rax+%rbx

• Save result to %rbx

PC

I-Cache

Registers

ALU

D-Cache


• Fetch Instruc

• Get %rax

• Sum %rax+8

• Read data

6. Registers

• Save data to %rbx

PC

I-Cache

Registers

ALU

D-Cache


• Fetch Instruc

• Get %rax

• Sum %rax+8

• Write %rbx data

PC

I-Cache

Registers

ALU


• Fetch Instruc

• Get %rax

• If cond=TRUE, PC = PC+disp.

12.14

Processor Block Diagram

I-Cache

D-Cache

ALURegisters

(aka

RegFile)

Fetch Decode Exec. Mem WB

PC Decode

Instruction (Machine Code)

Operands ALU Output (Addr. or Result)

Data to write to dest. register

Clock Cycle Time = Sum of delay through worst case pathway = 50 ns

10 ns 10 ns 10 ns 10 ns 10 ns

Control Signals(e.g. ALU operation, Read/Write D-Cache, etc.)

Addr Data

Data

ZF OFCF SF

12.15

Processor Execution (add)

I-Cache

D-Cache

ALURegisters

(aka

RegFile)


PC Decode





add %rax,%rdx[Machine Code: 48 01 c2]

%rax+%rdx

Addr Data

%rdx = %rax+%rdx

%rax

%rdx

ZF OFCF SF

12.16

Processor Execution (load)

I-Cache

D-Cache

ALURegisters

(aka

RegFile)


PC Decode





Addr Data

ld 0x40(%rbx),%rax[Machine Code: 48 8b 43 40]

addr

%rdx = data

%rbx

data

ZF OFCF SF

0x40

12.17

Processor Execution (store)

I-Cache

D-Cache

ALURegisters

(aka

RegFile)


PC Decode



Addr Data

Data

st %rax,0x40(%rbx)[Machine Code: 48 89 43 40]

addr

%rbx

0x40

%rax

ZF OFCF SF

12.18

Processor Execution (branch/jump)

I-Cache

D-Cache

ALURegisters

(aka

RegFile)


PC Decode



Addr Data

Data

je L1 (disp. = 0x08)[Machine Code: 74 08]

PC + 0x08

PC

0x08

ZF

1

OF

0

CF

1

SF

0

12.19

PIPELINING

12.20

Example

for(i=0; i < 100; i++)

C[i] = (A[i] + B[i]) / 4;

10 ns per input set = 1000 ns total

MemoryA[i]

B[i]

A:

B:

C:

i

Cn

tr

12.21

Pipelining Example

Stage 1 Stage 2

Clock 0 A[0] + B[0]

Clock 1 A[1] + B[1] (A[0] + B[0]) / 4

Clock 2 A[2] + B[2] (A[1] + B[1]) / 4

Stage 1 Stage 2

for(i=0; i < 100; i++)

C[i] = (A[i] + B[i]) / 4;

Pipelining refers to

insertion of registers to

split combinational logic

into smaller stages that

can be overlapped in

time (i.e. create an

assembly line)

12.22

Need for Registers

• Provides separation between combinational functions– Without registers, fast signals could “catch-up” to data values in the

next operation stage

Re

gis

ter

Re

gis

ter

Performing an

operation yields

signals with different

paths and delays

We don’t want signals from two

different data values mixing.

Therefore we must collect and

synchronize the values from

the previous operation before

passing them on to the next

Signal i

Signal j

5 ns

2 ns

CLKCLK

12.23

Processors & Pipelines

• Overlaps execution of multiple instructions

• Natural breakdown into stages

– Fetch, Decode, Execute

• Fetch an instruction, while decoding another, while executing another

ExecuteDecodeFetch

CLK 1 CLK 2 CLK 3

Inst 1

Inst 2

Inst 3

Inst 4

CLK 4

ExecuteDecodeFetch

DecodeFetch

Fetch

Fetch Decode Exec.

Inst. 1Clk 1

Clk 2 Inst. 1

Clk 3 Inst. 1

Inst. 2

Inst. 2

Clk 4 Inst. 2

Inst. 3

Inst. 3

Clk 5 Inst. 3

Inst. 4

Inst. 4Inst. 5

Pipelining (Instruction View) Pipelining (Stage View)

12.24

Balancing Pipeline Stages

• Clock period must equal the LONGESTdelay from register to register

• Fig. 1: If total logic delay is 20ns => 50MHz

– Throughput: 1 instruc. / 20 ns

• Fig. 2: Unbalanced stage delays limit the clock speed to the slowest stage (worst case)

– Throughput: 1 instruc. / 10 ns => 100MHz

• Fig. 3: Better to split into more, balanced stages

– Throughput: 1 instruc. / 5 ns => 200MHz

• Fig. 4: Are more stages better

– Ideally: 2x stages => 2x throughput

– Throughput: 1 instruc. / 2.5 ns => 400MHz

– Each register adds extra delay so at some point deeper pipelines don't pay off

Processor Logic(Fetch + Decode + Execute)

Reg

iste

r

Fetc

h

Dec

od

e

Reg

iste

r

Exec

. 1

Exec

. 1

Reg

iste

r

Reg

iste

r

Fetc

h

Dec

od

e

Reg

iste

r

Exec

5 ns 5 ns 10 ns

20 ns

5 ns 5 ns 5 ns 5 ns

F12.5 ns

F2 D1

D2

E1a

E1b

E2a

E2b

Fig

. 1

Fig

. 2

Fig

. 3

Fig

. 4

12.25

Balancing Pipeline Stages

Main Points:

• Latency of any single instruction is unaffected

• Throughput and thus overall program performance can be dramatically improved– Ideally K stage pipeline will

lead to throughput increase by a factor of K

– Reality is splitting stages adds some delay and thus hits a point of diminishing returns

Processor Logic(Fetch + Decode + Execute)

Reg

iste

r

Fetc

h

Dec

od

e

Reg

iste

r

Exec

. 1

Exec

. 1

Reg

iste

r

20 ns

5 ns 5 ns 5 ns 5 ns

F1

2.5 ns

F2 D1

D2

E1a

E1b

E2a

E2b

Fig

. 1

Fig

. 2

Fig

. 3

Non-pipelined(Latency = 20ns, Throughput = 1x)

4 Stage Pipeline(Latency = 20ns, Throughput = 4x)

8 Stage Pipeline(Latency = 20ns, Throughput = 8x)

12.26

5-Stage Pipeline

Instr

uctio

n (

Ma

ch

ine

Co

de

)

Opera

nds &

Contr

ol S

ignals

AL

U O

utp

ut (A

ddr.

or

Re

su

lt)

Re

su

lt o

f In

str

uctio

n

I-Cache D-CacheALUReg.

File


PC DecodeZF OFCF SF

12.27

Pipelining

• Let's see how a sequence of instructions can be executed

Instruction

ld 0x8(%rbx), %rax

add %rcx,%rdx

je L1

12.28

Sample Sequence - 1

Instr

uctio

n (

Ma

ch

ine

Co

de

)

Opera

nds &

Contr

ol S

ignals

AL

U O

utp

ut (A

ddr.

or

Re

su

lt)

Re

su

lt o

f In

str

uctio

n


File

Decode Exec. Mem WB

PC Decode

Fetch

(LD)

Fetch LD

ZF OFCF SF

12.29

Sample Sequence - 2

LD

0x8

(%rb

x),

%ra

x

Opera

nds &

Contr

ol S

ignals

AL

U O

utp

ut (A

ddr.

or

Re

su

lt)

Re

su

lt o

f In

str

uctio

n


File

Exec. Mem WB

PC Decode

Fetch

(ADD)

Fetch ADD

Decode

(LD)

Decode

instruction and

fetch operands

ZF OFCF SF

12.30

Sample Sequence - 3

AD

D %

rdx, %

rcx

0x4

0 / %

rbx

/ R

EA

D

AL

U O

utp

ut (A

ddr.

or

Re

su

lt)

Re

su

lt o

f In

str

uctio

n


File

Mem WB

PC Decode

Fetch

(JE)

Fetch JE

Decode

(ADD)

Decode

instruction and

fetch operands

Exec.

(LD)

Add

displacement

0x04 to %rbx

ZF OFCF SF

12.31

Sample Sequence - 4

JE

/ d

isp

lace

me

nt

%rc

x/ %

rdx

/ A

DD

%rb

x+

0x4

0 / R

EA

D

Re

su

lt o

f In

str

uctio

n


File

WB

PC Decode

Fetch

(i+1)

Fetch

instruc

i+1

Decode

(JE)

Exec.

(ADD)

Add

%rcx+%rdx

Mem

(LD)

Read word from

memory

Decode

instruction and

fetch operands

ZF OFCF SF

12.32

Sample Sequence - 5

Instr

uc. i+

1 M

ach

ine

Co

de

PC

/ D

sip

lace

me

nt

%rd

x/ %

rcx

+ %

rdx

%ra

x/ V

alu

e fro

m M

em

ory


File

WB

(LD)

PC Decode

Fetch

(i+2)

Decode

(i+1)

Exec.

(JE)

Mem

(ADD)

Write

word to

%rax

Just pass sum

to next stage

Check if

condition is

true

Fetch next

instruc i+2

Decode

instruction i+1

and fetch

operands

ZF OFCF SF

12.33

Sample Sequence - 6

Instr

uc. i+

2 M

ach

ine

Co

de

Instr

uc

i+1 o

pe

ran

ds

Ne

w P

C

%rd

x/ su

m


File

WB

(ADD)

PC Decode

Fetch

(i+3)

Decode

(i+2)

Exec.

(i+1)

Mem

(JE)

Write

word to

%rdx

Update PCUse the ALUFetch next

instruc i+3

Decode

instruction i+2

and fetch

operands

ZF OFCF SF

12.34

Sample Sequence - 7

De

lete

d

De

lete

d

De

lete

d

%rd

x/ su

m


File

WB

(JE)

PC Decode

Fetch

(target)

Decode

(i+3)

Exec.

(i+2)

Mem

(i+1)

Do

nothing

Delete i+1Delete i+2Fetch next

instruc i+3

Delete i+3

ZF OFCF SF

12.35

HAZARDSProblems from overlapping instruction execution…

12.36

Hazards

• Hazards prevent parallel or overlapped execution!

• Control Hazards– Problem: We don't know what instruction to fetch but we need to

– Examples: Jumps (branches) and calls

• Data Hazards / Data Dependencies– Problem: When a later instruction needs data from a previous

instruction

– Examples: • sub %rdx,%rax

• add %rax,%rcx

• Structural Hazards– Problem: Due to limited resources, the HW doesn't support overlapping

a certain sequence of instructions

– Examples: See next slides

12.37

Structural Hazards

• Example structural hazard: A single cache rather than separate instruction & data caches

– Structural hazard any time an instruction needs to perform a data access (i.e. ld or st) since we always want to fetch a new instruction each clock cycle

Cache

ALUReg.

File

PC

LD

i+3

i+2 i+1

Hazard!

12.38

Data Hazard - 1

AD

D %

rax, %

rcx

%rd

x/ %

rax

/ S

UB

AL

U O

utp

ut (A

ddr.

or

Re

su

lt)

Re

su

lt o

f In

str

uctio

n


File

Mem WB

PC Decode

Fetch

(i+1)

Fetch i+1

Decode

(ADD)

Decode and get

register operands

(Do we get the

desired %rax value?)

Exec.

(SUB)

Perform

%rax-%rdx

sub %rdx,%raxadd %rax,%rcx

12.39

Data Hazard - 2

i+1

%ra

x/ %

rcx

/ A

DD

Ne

w v

alu

e fo

r %

rax

Re

su

lt o

f In

str

uctio

n


File

Mem

(SUB)

WB

PC Decode

Fetch

(i+2)

Fetch i+2

Decode

(i+1)

Decode i+1

Exec.

(ADD)

Perform

%rax+%rcx

using the

wrong value!


New value for %rax

has not been written

back yet

12.40

Stalling

• Solution 1: Halt/Stall the ADD instruction in the DECODE stage and insert nops into the pipeline until the new value of the needed register is present at the cost of lower performance

AD

D %

rax, %

rcx

nop

nop

Ne

w v

alu

e o

f %

rax


File

Mem

(nop)WB

(SUB)

PC Decode

Fetch

(i+1)Decode

(ADD)Exec.

(nop)


12.41

Forwarding

• Solution 2: Create new hardware paths to hand-off (forward) the data from the producing instruction in the pipeline to the consuming instruction


i+1

%ra

x/ %

rcx

/ A

DD

Ne

w v

alu

e fo

r %

rax

Result o

f In

str

uction


File

Mem

(SUB)

WB

PC Decode

Fetch

(i+2)

Decode

(i+1)

Exec.

(ADD)

12.42

Solving Data Hazards

• Key Point: Data dependencies (i.e. instructions needing values produced by earlier ones) limit performance

• Forwarding solves many of the data hazards (data dependencies) that exist

– It allows instructions to continue to flow through the pipeline without the need to stall and waste time

– The cost is additional hardware and added complexity

• Even forwarding cannot solve all the issues

– A structural hazard still exists when a LD reads a value needed by the next instruction

12.43

LD + Dependent Instruction Hazard

• Even forwarding cannot prevent the need to stall when a Load instruction produces a value needed by the instruction behind it

– Would require performing 2 cycles worth of work in only a single cycle

ld 8(%rdx),%raxadd %rax,%rcx

i+1

%old

ra

x/ %

rcx

/ A

DD

Ad

dre

ss (

8+

%rd

x)

/ R

EA

D

Result o

f In

str

uction


File

Mem

(LD)

WB

PC Decode

Fetch

(i+2)

Decode

(i+1)

Exec.

(ADD)

New value for %rax

12.44

LD + Dependent Instruction Hazard

• We would need to introduce 1 stall cycle (nop) into the pipeline to get the timing correct

• Keep this in mind as we move through the next slidesld 8(%rdx),%raxadd %rax,%rcx

i+1

%old

ra

x/ %

rcx

/ A

DD

nop

New

Valu

e o

f %

rax


File

Mem

(nop)

WB

(LD)

PC Decode

Fetch

(i+2)

Decode

(i+1)

Exec.

(ADD)

New value for %rax

12.45

Control Hazards

• Branches/Jumps require us to know

– Where we want to jump to (aka branch/jump target location)…really just the new value of the PC

– If we should branch or not (checking the jump condition)

• Problem: We often don't know those values until deep in the pipeline and thus we are not sure what instructions should be fetched in the interim

– Requires us to flush unwanted instructions and waste time

12.46

Control Hazard - 1

Instr

uc. i+

2 M

ach

ine

Co

de

Instr

uc

i+1 o

pe

ran

ds

Ne

w P

C

%rd

x/ su

m


File

WB

(ADD)

PC Decode

Fetch

(i+3)

Decode

(i+2)

Exec.

(i+1)

Mem

(JE)

Write

word to

%rdx

Update PCUse the ALUFetch next

instruc i+3

Decode

instruction i+2

and fetch

operands

12.47

Control Hazard - 2

De

lete

d

De

lete

d

De

lete

d

%rd

x/ su

m


File

WB

(JE)

PC Decode

Fetch

(target)

Decode

(i+3)

Exec.

(i+2)

Mem

(i+1)

Do

nothing

Delete i+1Delete i+2Fetch next

instruc i+3

Delete i+3

Need to "flush" wrongly fetched instructions

12.48

A FIRST LOOK: CODE REORDERINGEnlisting the help of the compiler

12.49

Two Sides of the Coin

• If the hardware has some problems it just can't solve, can software (i.e. the compiler) help?– Yes!!

• Compilers can re-order instructions to take best advantage of the processor (pipeline) organization

• Identify the dependencies that will incur stalls and slow performance– Load followed by add

– Jump instructions

void sum(int* data, int n, int x){

for(int i=0; i < n; i++){data[i] += x;

}}

sum:mov $0x0,%ecx

L1:cmp %esi,%ecxjge L2ld 0(%rdi), %eaxadd %edx, %eaxst %eax, 0(%rdi)add $4, %rdiadd $1, %ecxj L1

L2:retq

C code and its assembly translation

12.50

How Can the Compiler Help

• Compilers are written with general parsing and semantic representation front ends but architecture-specific backends that generate code optimized for a particular processor

• Q: How could the compiler help improve pipelined performance while still maintaining the external behavior that the high level code indicates

• A: By finding independent instructions and reordering the code

– Could we have moved any other instruction into that slot? No!

sum:mov $0x0,%ecx

L1:cmp %esi,%ecxjge L2ld 0(%rdi), %eaxstall/nopadd %edx, %eaxst %eax, 0(%rdi)add $4, %rdiadd $1, %ecxj L1

L2:retq


sum:mov $0x0,%ecx

L1:cmp %esi,%ecxjge L2ld 0(%rdi), %eaxadd $1, %ecxadd %edx, %eaxst %eax, 0(%rdi)add $4, %rdij L1

L2:retq

Original Code (incurring 1 stall cycle)

Updated Code(w/ Compiler reordering)

12.51

Taken or Not Taken: Branch Behavior

• When a conditional jump/branch is

– True, we say it is Taken

– False, we say it is Not Taken

• Currently our pipeline will fetch sequentially and then potentially flush if the branch is taken

– Effectively, our pipeline "predicts" that each branch is Not Taken

• The j L1 instruction is always taken and thus will incur wasted clock cycles each time it is executed

• Most of the time the jge L2 will be not taken and perform well


sum:mov $0x0,%ecx

L1:cmp %esi,%ecxjge L2ld 0(%rdi), %eaxadd $1, %ecxadd %edx, %eaxst %eax, 0(%rdi)add $4, %rdij L1

L2:retq

T NT

12.52

Branch Delay Slots

• Problem: After a jump/branch we fetch instructions that we are not sure should be executed

• Idea: Find an instruction(s) that should ALWAYS be executed (independent of whether branch is taken or not), move those instructions to directly after the branch, and have HW just let them be executed (not flushed) no matter what the branch outcome is

• Branch delay slot(s) = # of instructions that the HW will always execute (not flush) after a jump/branch instruction

12.53

Branch Delay Slot Example

ld 0(%rdi), %rcxcmp %rcx, %rdxje NEXTadd %rbx, %raxNOT TAKEN CODE…

NEXT:

TAKEN CODE

ld 0(%rdi), %rcxadd %rbx, %raxcmp %rcx, %rdxje NEXTdelay slot instruc.NOT TAKEN CODE…

NEXT:

TAKEN CODE

Assume a single

instruction delay slot

Move an ALWAYS

executed instruction

down into the delay

slot and let it execute

no matter what

“Before” Code

ld 0(%rdi), %rcxadd %rbx, %raxcmp %rcx, %rdx

Not Taken

Path Code

je

Taken

Path Code

“After” Code

T

NT

Delay Slot

Flowchart perspective of the

delay slot

Delay Slot

12.54

Implementing Branch Delay Slots

• HW will define the number of branch delay slots (usually a small number…1 or 2)

• Compiler will be responsible for arranging instructions to fill the delay slots– Must find instructions that the branch

does NOT DEPEND on

– If no instructions can be rearranged, can always insert 'nop' and just waste those cycles

ld 0(%rdi), %rcxadd %rbx, %raxcmp %rcx, %rdxje NEXTdelay slot instruc.

Cannot move ‘ld’ into delay slot

because je needs the %rcx value

generated by it

ld 0(%rdi), %rcxadd %rbx, %raxcmp %rcx, %raxje NEXTdelay slot instruc.

If no instruction can be found a

'nop' can be inserted by the

compiler

12.55

A Look Ahead: Branch Prediction

• Currently our pipeline assumes Not Taken and fetches down the sequential path after a jump/branch

• Could we build a pipeline that could predict taken?

– Not yet! Location to jump to (branch target) not known until later stages

• But suppose we could overcome those problems, would we even know how to predict the outcome of a jump/branch before actually looking at the condition codes deeper in the pipeline?

• We could allow a static prediction per instruction (give a hint with the branch that indicates T or NT)

• We could allow dynamic prediction per instruction (use its runtime history)

Loop

Body

loop

branch

NT

Code

Loops

High probability

of being Taken.

Prediction can

be static.T: loop

NT: done

if..else

branch

NT: elseT: ifIf Statements

May exhibit

data

dependent

behavior.

Prediction may

need to be

dynamic.

After Code

T Code

12.56

Demo

12.57

Summary 1

• Pipelining is an effective and important technique to improve the throughput of a processor

• Overlapping execution creates hazards which lead to stalls or wasted cycles

– Data, Control, Structural

– More hardware can be investigated to attempt to mitigate the stalls (e.g. forwarding)

• The compiler can help reorder code to avoid stalls and perform useful work (e.g. delay slots)

cs356 unit 12 · 12.3 logic circuits • combinational logic – performs a specific function...

Documents