l16 – pipelining 1 comp 411 – spring 2011 4/20/2011 pipelining between 411 problems sets, i...

L16 – Pipelining 1Comp 411 – Spring 2011 4/20/2011

Pipelining

Between 411 problems sets, I

haven’t had a minute to do laundry

Now that’s what Icall dirty laundry

Read Chapter 4.5-4.8


Forget 411… Let’s Solve a “Relevant Problem”

Device: Washer

Function: Fill, Agitate, Spin

WasherPD = 30 mins

Device: Dryer

Function: Heat, Spin

DryerPD = 60 mins

INPUT:dirty laundry

OUTPUT:4 more weeks


One Load at a Time

Everyone knows that the real reason that UNC students put off doing laundry so long is *not* because they procrastinate, are lazy, or even have better things to do.

The fact is, doing laundry one load at a time is not smart.

(Sorry Mom, but you were wrong about this one!)

Step 1:

Step 2:

Total = WasherPD + DryerPD

= _________ mins90


Doing N Loads of Laundry

Here’s how they do laundry at Duke, the “combinational” way.

(Actually, this is just an urban legend. No one at Duke actually does laundry. The butler’s all arrive on Wednesday morning, pick up the dirty laundry and return it all pressed and starched by dinner)

Step 1:

Step 2:

Step 3:

Step 4:

Total = N*(WasherPD + DryerPD)

= ____________ minsN*90

…


Doing N Loads… the UNC way

UNC students “pipeline” the laundry process.

That’s why we wait!

Step 1:

Step 2:

Step 3:

Total = N * Max(WasherPD, DryerPD)

= ____________ mins

N*60

…Actually, it’s more like N*60 + 30 if we account for the startup transient correctly. When doing pipeline analysis, we’re mostly interested in the “steady state” where we assume we have an infinite supply of inputs.

Actually, it’s more like N*60 + 30 if we account for the startup transient correctly. When doing pipeline analysis, we’re mostly interested in the “steady state” where we assume we have an infinite supply of inputs.


Recall Our Performance Measures

Latency:The delay from when an input is established until the output associated with that input becomes valid.

(Duke Laundry = _________ mins)( UNC Laundry = _________ mins)

Throughput:The rate at which inputs or outputs are processed.

(Duke Laundry = _________ outputs/min)( UNC Laundry = _________ outputs/min)

90

120

1/90

1/60

Assuming that the wash is started as soon as possible and waits (wet) in the washer until dryer is available.

Even though we increaselatency, it takes less time per load


Okay, Back to Circuits…

F

G

HX P(X)

For combinational logic: latency = tPD, throughput = 1/tPD.

We can’t get the answer faster, but are we making effective use of our hardware at all times?

G(X)

F(X)

P(X)

X

F & G are “idle”, just holding their outputs stable while H performs its computation


Pipelined Circuits

use registers to hold H’s input stable!

F

G

HX P(X)

15

20

25

Now F & G can be working on input Xi+1 while H is performing its computation on Xi. We’ve created a 2-stage pipeline : if we have a valid input X during clock cycle j, P(X) is valid during clock j+2.

Suppose F, G, H have propagation delays of 15, 20, 25 ns and we are using ideal zero-delay registers (ts = 0, tpd = 0):

latency45

______

throughput1/45

______unpipelined

2-stage pipeline 50

worse

1/25better

Pipelining uses registers

to improve the

throughput of combinational

circuits


Pipeline Diagrams

Input

F Reg

G Reg

H Reg

i i+1 i+2 i+3

Xi Xi+1

F(Xi)

G(Xi)

Xi+2

F(Xi+1)

G(Xi+1)

H(Xi)

Xi+3

F(Xi+2)

G(Xi+2)

H(Xi+1)

Clock cycleP

ipe

line

sta

ge

s

The results associated with a particular set of input data moves diagonally through the diagram, progressing through one pipeline stage each clock cycle.

H(Xi+2)

…

…

F

G

HX P(X)

15

20

25

This is an exampleof parallelism. Atany instant we arecomputing 2 results.


Pipelining Summary

Advantages:– Higher throughput than combinational system– Different parts of the logic work on different parts of the

problem…

Disadvantages:– Generally, increases latency– Only as good as the *weakest* link

(often called the pipeline’s BOTTLENECK)


Review of CPU Performance

MIPS = Millions of Instructions/Second

Freq = Clock Frequency, MHz

CPI = Clocks per Instruction

MIPS =FreqCPI

To Increase MIPS:

1. DECREASE CPI.

- RISC simplicity reduces CPI to 1.0.

- CPI below 1.0? State-of-the-art multiple instruction issue

2. INCREASE Freq.

- Freq limited by delay along longest combinational path; hence

- PIPELINING is the key to improving performance.


Where Are the Bottlenecks?

Pipelining goal:Break LONG combinational paths memories, ALU in separate stages

WA

PC

+4

InstructionMemory

A

D

RegisterFile

RA1 RA2

RD1 RD2

ALUA B

WA

ALUFN

Control Logic

Data Memory

RD

WD R/W

Adr

Wr

WDSEL0 1 2

BSELWDSELALUFNWr

J:<25:0>

PCSEL

WERF

WERF

00

PC+4

Rt: <20:16>

Imm: <15:0>

ASEL

SEXT

+x4

BT

Z

BT

WASEL

Rd:<15:11>Rt:<20:16> 0

123

WASEL

PC<31:29>:J<25:0>:00

JT

JT

N V C

ZVN C

Rs: <25:21>

ASEL20

SEXT

BSEL01

SEXT

shamt:<10:6>

PCSEL 0123456

“16”

IRQ

0x800000800x800000400x80000000

RESET

“31”“27”

1

WD

WE


miniMIPS Timing

Different instructions use various parts of the data path.

add $4, $5, $6beq $1, $2, 40lw $3, 30($0)jal 20000

Program execution order

Time

CLK

Instruction FetchInstruction DecodeRegister Prop DelayALU OperationBranch TargetData AccessRegister Setup

sw $2, 20($4)

The above scenario is possible only if the system could vary the clock period based on the instruction being executed. This leads to complicated timing generation, and, in the end, slower systems, since it is not very compatible with pipelining!

1 instr every 14 nS, 14 nS, 20 nS, 9 nS, 19 nS

6 nS2 nS2 nS5 nS4 nS6 nS1 nS


Uniform miniMIPS Timing

With a fixed clock period, we have to allow for the worse case.

add $4, $5, $6beq $1, $2, 40lw $3, 30($0)jal 20000

Program execution order

TimeCLK

Instruction FetchInstruction DecodeRegister Prop DelayALU OperationBranch TargetData AccessRegister Setup

sw $2, 20($4)

By accounting for the “worse case” path (i.e. allowing time for each possible combination of operations) we can implement a fixed clock period. This simplifies timing generation, enforces a uniform processing order, and allows for pipelining!

Isn’t the net effect just a slower CPU?

1 instr EVERY 20 nS

6 nS2 nS2 nS5 nS4 nS6 nS1 nS


Goal: 5-Stage Pipeline

GOAL: Maintain (nearly) 1.0 CPI, but increase clock speed to barely include slowest components (mems, regfile, ALU)

APPROACH: structure processor as 5-stage pipeline:

IFInstruction Fetch stage: Maintains PC, fetches

one instruction per cycle and passes it to

WB Write-Back stage: writes result back into register file.

ID/RFInstruction Decode/Register File stage: Decode

control lines and select source operands

ALUALU stage: Performs specified operation,

passes result to

MEMMemory stage: If it’s a lw, use ALU result as an

address, pass mem data (or ALU result if not lw) to


ALUA B

ALUFN

Data MemoryRD

WD R/WAdrWr

WDSEL0 1 2

PC+4

ZVN C

PC

+4

InstructionMemory

AD

00

BT

PC<31:29>:J<25:0>:00

JT

PCSEL 0123456

0x800000800x800000400x80000000

PCREG 00 IRREG

WARegister

FileRA1 RA2

RD1 RD2

J:<25:0>

Imm: <15:0>

+x4

BT

JT

Rt: <20:16>Rs: <25:21>

ASEL20 BSEL01

SEXTSEXT

shamt:<10:6>

“16”

1

=BZ

5-Stage miniMIPS

PCALU 00 IRALU

A B WDALU

PCMEM 00 IRMEM YMEM WDMEM

WARegister

File

WA WD

WEWERF

WASEL

Rd:<15:11>Rt:<20:16>

“31”

“27”

0 1 2 3

InstructionFetch

RegisterFile

ALU

WriteBack

PCWB 00 IRWB YWB

Memory

Address is available right after instruction enters Memory stage

Data is needed just before rising clock edge at end of Write Back stage

almost 2 clock cycles

• Omits some details• NO bypass or interlock logic


Pipelining

Improve performance by increasing instruction throughput

Ideal speedup is number of stages in the pipeline. Do we achieve this?

Instructionfetch

Reg ALUData

accessReg

8 nsInstruction

fetchReg ALU

Dataaccess

Reg

8 nsInstruction

fetch

8 ns

Time

lw $1, 100($0)

lw $2, 200($0)

lw $3, 300($0)

2 4 6 8 10 12 14 16 18

2 4 6 8 10 12 14

...

Programexecutionorder(in instructions)

Instructionfetch

Reg ALUData

accessReg

Time

lw $1, 100($0)

lw $2, 200($0)

lw $3, 300($0)

2 nsInstruction

fetchReg ALU

Dataaccess

Reg

2 nsInstruction

fetchReg ALU

Dataaccess

Reg

2 ns 2 ns 2 ns 2 ns 2 ns



Pipelining

What makes it easyall instructions are the same length

just a few instruction formats

memory operands appear only in loads and stores

What makes it hard?structural hazards: suppose we had only one memory

control hazards: need to worry about branch instructions

data hazards: an instruction depends on a previous instruction

Individual Instructions still take the same number of cycles

But we’ve improved the through-put by increasing the number of simultaneously executing instructions


Structural Hazards

Inst

Fetch

Reg

Read

ALU Data

Access

Reg Write

Inst

Fetch

Reg

Read

ALU Data

Access

Reg Write

Inst

Fetch

Reg

Read

ALU Data

Access

Reg Write

Inst

Fetch

Reg

Read

ALU Data

Access

Reg Write


Problem with starting next instruction before first is finisheddependencies that “go backward in time” are data hazards

Data Hazards

IM Reg

IM Reg

CC 1 CC 2 CC 3 CC 4 CC 5 CC 6

Time (in clock cycles)

sub $2, $1, $3


and $12, $2, $5

IM Reg DM Reg

IM DM Reg

IM DM Reg

CC 7 CC 8 CC 9

10 10 10 10 10/– 20 – 20 – 20 – 20 – 20

or $13, $6, $2

add $14, $2, $2

sw $15, 100($2)

Value of register $2:

DM Reg

Reg

Reg

Reg

DM


Have compiler guarantee no hazards

Where do we insert the “nops” ?

sub $2, $1, $3and $12, $2, $5or $13, $6, $2add $14, $2, $2sw $15, 100($2)

Problem: this really slows us down!

Software Solution


Use temporary results, don’t wait for them to be written register file forwarding to handle read/write to same register ALU forwarding

Forwarding

IM Reg

IM Reg



sub $2, $1, $3

Programexecution order(in instructions)

and $12, $2, $5

IM Reg DM Reg

IM DM Reg

IM DM Reg

CC 7 CC 8 CC 9

10 10 10 10 10/– 20 – 20 – 20 – 20 – 20

or $13, $6, $2

add $14, $2, $2

sw $15, 100($2)

Value of register $2 :

DM Reg

Reg

Reg

Reg

X X X – 20 X X X X XValue of EX/MEM :X X X X – 20 X X X XValue of MEM/WB :

DM


Load word can still cause a hazard:an instruction tries to read a register following a load instruction that

writes to the same register.

Thus, we need a hazard detection unit to “stall” the instruction

Can't always forward

Reg

IM

Reg

Reg

IM



lw $2, 20($1)


and $4, $2, $5

IM Reg DM Reg

IM DM Reg

IM DM Reg

CC 7 CC 8 CC 9

or $8, $2, $6

add $9, $4, $2

slt $1, $6, $7

DM Reg

Reg

Reg

DM


Stalling

We can stall the pipeline by keeping an instruction in the same stage

lw $2, 20($1)


and $4, $2, $5

or $8, $2, $6

add $9, $4, $2

slt $1, $6, $7

Reg

IM

Reg

Reg

IM DM

CC 1 CC 2 CC 3 CC 4 CC 5 CC 6Time (in clock cycles)

IM Reg DM RegIM

IM DM Reg

IM DM Reg

CC 7 CC 8 CC 9 CC 10

DM Reg

RegReg

Reg

bubble


When we decide to branch, other instructions are in the pipeline!

We are predicting “branch not taken”need to add hardware for flushing instructions if we are

wrong

Branch Hazards

Reg

Reg

CC 1


40 beq $1, $3, 7


IM Reg

IM DM

IM DM

IM DM

DM

DM Reg

Reg Reg

Reg

Reg

RegIM

44 and $12, $2, $5

48 or $13, $6, $2

52 add $14, $2, $2

72 lw $4, 50($7)

CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9

Reg


ALUA B

ALUFN

RD

WD R/WAdrWr

WDSEL0 1 2

PC+4

ZVN C

PC

+4

InstructionMemory

AD

00

BT

PC<31:29>:J<25:0>:00

JT

PCSEL 0123456

0x800000800x800000400x80000000

PCREG 00 IRREG

WARegister

FileRA1 RA2

RD1 RD2

J:<25:0>

Imm: <15:0>

+x4

BT

JT

Rt: <20:16>Rs: <25:21>

ASEL20 BSEL01

SEXTSEXT

shamt:<10:6>

“16”

1

=BZ

5-Stage miniMIPS

PCALU 00 IRALU

A B WDALU

PCMEM 00 IRMEM YMEM WDMEM

WARegister

File

WA WD

WEWERF

WASEL

Rd:<15:11>Rt:<20:16>

“31”

“27”

0 1 2 3

InstructionFetch

RegisterFile

ALU

WriteBack

PCWB 00 IRWB YWB

Memory

We wanted a simple, clean pipeline but…

• added CLK EN to freeze IF/RF stages so we can wait for lw to reach WB stage

NOP

Data Memory

broke the sequential semantics of ISA by adding a branch delay-slot and early branch resolution logicadded A/B bypass muxes to get data before it’s written to regfile


Pipeline Summary (I)

• Started with unpipelined implementation– direct execute, 1 cycle/instruction

– it had a long cycle time: mem + regs + alu + mem + wb

• We ended up with a 5-stage pipelined implementation

– increase throughput (3x???)

– delayed branch decision (1 cycle)

Choose to execute instruction after branch

– delayed register writeback (3 cycles)

Add bypass paths (6 x 2 = 12) to forward correct value

– memory data available only in WB stage

Introduce NOPs at IRALU, to stall IF and RF stages until LD result was ready


Pipeline Summary (II)

Fallacy #1: Pipelining is easySmart people get it wrong all of the time!

Fallacy #2: Pipelining is independent of ISAMany ISA decisions impact how easy/costly it is to implement pipelining (i.e. branch semantics, addressing modes).

Fallacy #3: Increasing Pipeline stages improves performanceDiminishing returns. Increasing complexity.

l16 – pipelining 1 comp 411 – spring 2011 4/20/2011 pipelining between 411 problems sets, i...

Documents