1 comp 206: computer architecture and implementation montek singh wed, oct 19, 2005 topic:...

1

COMP 206:COMP 206:Computer Architecture and Computer Architecture and ImplementationImplementation

Montek SinghMontek Singh

Wed, Oct 19, 2005Wed, Oct 19, 2005

Topic: Topic: Instruction-Level ParallelismInstruction-Level Parallelism

(Multiple-Issue, Speculation)(Multiple-Issue, Speculation)

2

OutlineOutline Multiple-Issue ArchitecturesMultiple-Issue Architectures

Superscalar processorsSuperscalar processors VLIW (very long instruction word) processorsVLIW (very long instruction word) processors

SchedulingScheduling Statically scheduled (using compiler techniques)Statically scheduled (using compiler techniques) Dynamically scheduled (using variants of Tomasulo’s Dynamically scheduled (using variants of Tomasulo’s

alg.)alg.)

Reading: HP3, Sections 3.6-3.7Reading: HP3, Sections 3.6-3.7

3

Multiple IssueMultiple Issue Eliminating data and control stalls can achieve CPI of 1Eliminating data and control stalls can achieve CPI of 1 Can we decrease CPI below 1?Can we decrease CPI below 1?

Not if we issue only one instruction per clock cycleNot if we issue only one instruction per clock cycle Multiple-issue processors allow multiple instructions to Multiple-issue processors allow multiple instructions to

issue in a clock cycleissue in a clock cycle Superscalar: issue varying numbers of instructions per Superscalar: issue varying numbers of instructions per

clock (dynamic issue)clock (dynamic issue) Statically scheduled by compilerStatically scheduled by compiler Dynamically scheduled by hardwareDynamically scheduled by hardware

VLIW: issue fixed number of instructions per clock (static VLIW: issue fixed number of instructions per clock (static issue)issue) Statically scheduled by compilerStatically scheduled by compiler

ExamplesExamples Superscalar: IBM PowerPC, Sun SuperSPARC, DEC Alpha, HP Superscalar: IBM PowerPC, Sun SuperSPARC, DEC Alpha, HP

80008000 VLIW: Intel/HP ItaniumVLIW: Intel/HP Itanium

4

A Superscalar Version of MIPSA Superscalar Version of MIPS Two instructions can be issued per clock cycleTwo instructions can be issued per clock cycle

One can be load/store/branch/integer operationOne can be load/store/branch/integer operation Other can be any FP operationOther can be any FP operation

Need to fetch and decode 64 bits per cycleNeed to fetch and decode 64 bits per cycle Instructions paired and aligned on 64-bit boundaryInstructions paired and aligned on 64-bit boundary Integer instruction appears firstInteger instruction appears first

Dynamic issueDynamic issue First instruction issues if independent and satisfies First instruction issues if independent and satisfies

other criteriaother criteria Second instruction issues only if first one does, and is Second instruction issues only if first one does, and is

independent and satisfies similar criteriaindependent and satisfies similar criteria LimitationLimitation

One-cycle delay for loads and branches now turns into One-cycle delay for loads and branches now turns into three-instruction delay!three-instruction delay!… … because instructions are now squeezed closer togetherbecause instructions are now squeezed closer together

5

Performance of Static SuperscalarPerformance of Static Superscalar

LOOP:LD F0, 0(R1)ADDD F4, F0, F2SD 0(R1), F4SUBI R1, R1, 8BNEZ R1, LOOP

LOOP:LD F0, 0(R1)ADDD F4, F0, F2SD 0(R1), F4SUBI R1, R1, 8BNEZ R1, LOOP

LOOP: LD F0, 0(R1)LD F6, -8(R1)LD F10, -16(R1) ADDD F4, F0, F2LD F14, -24(R1) ADDD F8, F6, F2LD F18, -32(R1) ADDD F12, F10, F2SD 0(R1), F4 ADDD F16, F14, F2SD -8(R1), F8 ADDD F20, F18, F2SD -16(R1), F12SUBI R1, R1, 40SD 16(R1), F16BNEZ R1, LOOPSD 8(R1), F20

LOOP: LD F0, 0(R1)LD F6, -8(R1)LD F10, -16(R1) ADDD F4, F0, F2LD F14, -24(R1) ADDD F8, F6, F2LD F18, -32(R1) ADDD F12, F10, F2SD 0(R1), F4 ADDD F16, F14, F2SD -8(R1), F8 ADDD F20, F18, F2SD -16(R1), F12SUBI R1, R1, 40SD 16(R1), F16BNEZ R1, LOOPSD 8(R1), F20

Loop unrolled five times and scheduled staticallyLoop unrolled five times and scheduled statically 6 cycles per element in original scheduled code6 cycles per element in original scheduled code 2.4 cycles per element in superscalar code (2.5x)2.4 cycles per element in superscalar code (2.5x) Loop unrolling gets us from 6 to 3.5 cycles per element (1.7x)Loop unrolling gets us from 6 to 3.5 cycles per element (1.7x) Superscalar execution from 3.5 to 2.4 cycles per element (1.5x)Superscalar execution from 3.5 to 2.4 cycles per element (1.5x)

6

Multiple-Issue with Dynamic Multiple-Issue with Dynamic SchedulingScheduling Extend Tomasulo’s algorithmExtend Tomasulo’s algorithm

support issuing 2 instr/cycle: 1 integer, 1 FPsupport issuing 2 instr/cycle: 1 integer, 1 FP

Simple approach: Simple approach: separate Tomasulo Control for Integer and FP units:separate Tomasulo Control for Integer and FP units:

one set of reservation stations for Integer unit and one for one set of reservation stations for Integer unit and one for FP unitFP unit

How to do instruction issue with two How to do instruction issue with two instructions and keep in-order instruction issue instructions and keep in-order instruction issue for Tomasulo?for Tomasulo? issue logic runs in one-half clock cycleissue logic runs in one-half clock cycle can do two in-order issues in one clock cyclecan do two in-order issues in one clock cycle

7

Performance of Dynamic Performance of Dynamic SuperscalarSuperscalar

Iter. no.Iter. no. InstructionsInstructions Issues Issues ExecutesExecutesWrites Writes resultresult

(clock-cycle (clock-cycle number)number)

11 L.D L.D F0F0,0(R1),0(R1) 11 22 44

11 ADD.D ADD.D F4F4,,F0F0,F2,F2 11 55 88

11 S.D 0(R1),S.D 0(R1),F4F4 22 99

11 SUBI R1,R1,8SUBI R1,R1,8 33 44 55

11 BNEZ R1,LOOPBNEZ R1,LOOP 44 55

22 L.D F0,0(R1)L.D F0,0(R1) 55 66 88

22 ADD.D F4,F0,F2ADD.D F4,F0,F2 55 99 1212

22 S.D 0(R1),F4S.D 0(R1),F4 66 13 13

22 SUBI R1,R1,8SUBI R1,R1,8 77 88 99

22 BNEZ R1,LOOPBNEZ R1,LOOP 88 99

4 clocks per iteration4 clocks per iteration

8

Limits of SuperscalarLimits of Superscalar While Integer/FP split is simple to implement, While Integer/FP split is simple to implement,

we get CPI of 0.5 only for programs with:we get CPI of 0.5 only for programs with: Exactly 50% FP operationsExactly 50% FP operations No hazardsNo hazards

If more instructions issue at same time, If more instructions issue at same time, greater difficulty of decode and issuegreater difficulty of decode and issue Even 2-scalar machine has to do a lot of workEven 2-scalar machine has to do a lot of work

Examine 2 opcodesExamine 2 opcodesExamine 6 register specifiersExamine 6 register specifiersDecide if 1 or 2 instructions can issueDecide if 1 or 2 instructions can issue If 2 are issuing, issue them in-orderIf 2 are issuing, issue them in-order

9

Very Long Instruction Word Very Long Instruction Word (VLIW)(VLIW) VLIW: tradeoff instruction space for simple VLIW: tradeoff instruction space for simple

decodingdecoding The long instruction word has room for many The long instruction word has room for many

operationsoperations By definition, all the operations the compiler puts in By definition, all the operations the compiler puts in

the long instruction word can execute in parallelthe long instruction word can execute in parallel E.g., 2 integer operations, 2 FP operations, 2 memory E.g., 2 integer operations, 2 FP operations, 2 memory

references, 1 branchreferences, 1 branch16 to 24 bits per field => 7*16 or 112 bits to 7*24 or 168 16 to 24 bits per field => 7*16 or 112 bits to 7*24 or 168

bits widebits wide Need very sophisticated compiling technique …Need very sophisticated compiling technique …

… … that schedules across several branchesthat schedules across several branches

10

Loop Unrolling in VLIWLoop Unrolling in VLIWMemory Memory MemoryMemory FPFP FPFP Int. op/Int. op/ ClockClockreference 1reference 1 reference 2reference 2 operation 1operation 1 op. 2 op. 2 branchbranch

LD LD F0F0,0(R1),0(R1) LD F6,-8(R1)LD F6,-8(R1) 11

LD F10,-16(R1)LD F10,-16(R1)LD F14,-24(R1)LD F14,-24(R1) 22

LD F18,-32(R1)LD F18,-32(R1)LD F22,-40(R1)LD F22,-40(R1) ADDD ADDD F4F4,,F0F0,F2,F2 ADDD F8,F6,F2ADDD F8,F6,F2 33

LD F26,-48(R1)LD F26,-48(R1) ADDD F12,F10,F2ADDD F12,F10,F2 ADDD F16,F14,F2ADDD F16,F14,F244

ADDD F20,F18,F2ADDD F20,F18,F2 ADDD F24,F22,F2ADDD F24,F22,F255

SD 0(R1),SD 0(R1),F4F4 SD -8(R1),F8SD -8(R1),F8 ADDD F28,F26,F2ADDD F28,F26,F2 66

SD -16(R1),F12SD -16(R1),F12 SD -24(R1),F16SD -24(R1),F16 77

SD -32(R1),F20SD -32(R1),F20 SD -40(R1),F24SD -40(R1),F24 SUBI R1,R1,#48SUBI R1,R1,#48 88

SD -0(R1),F28SD -0(R1),F28 BNEZ R1,LOOPBNEZ R1,LOOP 99

Unrolled 7 times to avoid delaysUnrolled 7 times to avoid delays 7 results in 9 clocks, or 1.3 clocks per iteration7 results in 9 clocks, or 1.3 clocks per iteration Need more registers in VLIWNeed more registers in VLIW

11

Limits to Multi-Issue MachinesLimits to Multi-Issue Machines Inherent limitations of ILPInherent limitations of ILP

1 branch in 5 instructions: How to keep a 5-way VLIW 1 branch in 5 instructions: How to keep a 5-way VLIW busy?busy?

Latencies of units: many operations must be scheduledLatencies of units: many operations must be scheduled #independent instr. needed: Pipeline Depth #independent instr. needed: Pipeline Depth Number Number

of FUof FU~15-20 independent instructions!~15-20 independent instructions!

Difficulties in building the underlying hardwareDifficulties in building the underlying hardware Large increase in bandwidth of memory and register-fileLarge increase in bandwidth of memory and register-file

Limitations specific to either SS or VLIW Limitations specific to either SS or VLIW implementationimplementation Decode issue in SSDecode issue in SS VLIW code size: unroll loops + wasted fields in VLIWVLIW code size: unroll loops + wasted fields in VLIW VLIW lock step VLIW lock step 1 hazard & all instructions stall 1 hazard & all instructions stall VLIW & binary compatibility is practical weaknessVLIW & binary compatibility is practical weakness

12

Hardware Support for More ILPHardware Support for More ILP Avoid branch prediction by turning branches Avoid branch prediction by turning branches

into conditionally executed instructions:into conditionally executed instructions:if (x) then A = B op C else NOPif (x) then A = B op C else NOP

If false, then do nothing at allIf false, then do nothing at allneither store result nor cause exceptionneither store result nor cause exception

Expanded ISA of Alpha, MIPS, PowerPC, SPARC have Expanded ISA of Alpha, MIPS, PowerPC, SPARC have conditional moveconditional move

PA-RISC: any register-register instruction can nullify PA-RISC: any register-register instruction can nullify following instruction, making it conditionalfollowing instruction, making it conditional

Drawbacks to conditional instructionsDrawbacks to conditional instructions Still takes a clock even if “annulled”Still takes a clock even if “annulled” Stall if condition evaluated lateStall if condition evaluated late Complex conditions reduce effectiveness; Complex conditions reduce effectiveness;

condition becomes known late in pipelinecondition becomes known late in pipeline

13

Hardware Support for More ILPHardware Support for More ILP Speculation:Speculation: allow an instruction to issue that allow an instruction to issue that

is dependent on branch predicted to be taken is dependent on branch predicted to be taken without any consequences (including without any consequences (including exceptions) if branch is not actually taken exceptions) if branch is not actually taken Hardware needs to provide an “undo” operation = Hardware needs to provide an “undo” operation =

squashsquash

Often try to combine with dynamic schedulingOften try to combine with dynamic scheduling Tomasulo: separate Tomasulo: separate speculative bypassingspeculative bypassing of of

results from results from real bypassingreal bypassing of results of results When instruction no longer speculative, write results When instruction no longer speculative, write results

(instruction (instruction commitcommit)) execute out-of-order but execute out-of-order but commit in ordercommit in order Example: PowerPC 620, MIPS R10000, Intel P6, AMD Example: PowerPC 620, MIPS R10000, Intel P6, AMD

K5 …K5 …

14

Hardware support for More ILPHardware support for More ILP

• Need HW buffer for results of uncommitted instructions: reorder buffer (ROB)

– Reorder buffer can be operand source

– Once instruction commits, result is found in register

– 3 fields: instr. type, destination, value

– Use reorder buffer number instead of reservation station

– Instructions commit in order

– As a result, its easy to undo speculated instructions on mispredicted branches or on exceptions

ReorderBuffer

FP Regs

InstrQueue

FP Adder FP Mult

Res Stations Res Stations

Figure 3.29, page 228

15

Four Steps of Speculative Tomasulo Four Steps of Speculative Tomasulo AlgorithmAlgorithm1.1.Issue—get instruction from FP Op QueueIssue—get instruction from FP Op Queue

If reservation station and reorder buffer slot free, issue If reservation station and reorder buffer slot free, issue instruction & send operands & reorder buffer no. for instruction & send operands & reorder buffer no. for destination; each RS now also has a field for ROB#.destination; each RS now also has a field for ROB#.

2.2.Execution—operate on operands (EX)Execution—operate on operands (EX) When both operands ready then execute; if not ready, When both operands ready then execute; if not ready,

watch CDB for result; when both in reservation station, watch CDB for result; when both in reservation station, executeexecute

3.3.Write result—finish execution (WB)Write result—finish execution (WB) Write on Common Data Bus to all awaiting RS’s & ROB; Write on Common Data Bus to all awaiting RS’s & ROB;

mark RS availablemark RS available

4.4.Commit—update register with reorder resultCommit—update register with reorder result When instruction at head of reorder buffer & result When instruction at head of reorder buffer & result

present, update register with result (or store to memory) present, update register with result (or store to memory) and remove instruction from reorder bufferand remove instruction from reorder buffer

16

Result Shift Register and Reorder Result Shift Register and Reorder BufferBuffer General solution to three problemsGeneral solution to three problems

Precise exceptionsPrecise exceptions Speculative executionSpeculative execution Register renamingRegister renaming

Solution in three stepsSolution in three steps In-order initiation, out-of-order termination (using RSRa)In-order initiation, out-of-order termination (using RSRa) In-order initiation, in-order termination (using RSRb)In-order initiation, in-order termination (using RSRb) In-order initiation, in-order termination, with renaming In-order initiation, in-order termination, with renaming

(using ROB)(using ROB) Architectural modelArchitectural model

Essentially MIPS FP pipelineEssentially MIPS FP pipeline Add takes 2 clock cycles, multiplication 5, division 10Add takes 2 clock cycles, multiplication 5, division 10 Memory accesses take 1 clock cycleMemory accesses take 1 clock cycle Integer instructions take 1 clock cycleInteger instructions take 1 clock cycle 1 branch delay slot, delayed branches1 branch delay slot, delayed branches

17

Step I: I-O Initiation, O-O Termination Step I: I-O Initiation, O-O Termination (RSRa)(RSRa)LOOP:LD F6, 32(R2)LD F2, 48(R3)MULTD F0, F2, F4ADDI R2, R2, 8ADDI R3, R3, 8SUBD F8, F6, F2DIVD F10, F10, F0ADDD F6, F8, F6BLEZ R4, LOOPADDI R4, R4, 1

LOOP:LD F6, 32(R2)LD F2, 48(R3)MULTD F0, F2, F4ADDI R2, R2, 8ADDI R3, R3, 8SUBD F8, F6, F2DIVD F10, F10, F0ADDD F6, F8, F6BLEZ R4, LOOPADDI R4, R4, 1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 2210 D9 D8 D7 D6 D5 M D4 M D3 M D2 M S A D1 L1 L2 A1 A2 M S A B A3 DM L1 L2 A1 A2 M S A B A3 DW L1 L2 A1 A2 M S A B A3 D

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22LD F6, 32(R2) F D E M WLD F2, 48(R3) F D E M WMULTD F0, F2, F4 F D - X X X X X M WADDI R2, R2, 8 F - D E M WADDI R3, R3, 8 F D E M WSUBD F8, F6, F2 F D - X X M WDIVD F10, F10, F0 F - D X X X X X X X X X X M WADDD F6, F8, F6 F D X X M WBLEZ R4, LOOP F D - E M WADDI R4, R4, 1 F - D E M W

18

Step II: I-O Initiation, I-O Termination Step II: I-O Initiation, I-O Termination (RSRb)(RSRb)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 2810 D9 - D8 - - D7 - - - D6 - - - - D5 M - - - - - D4 - M - - - - - - D3 - - M - - - - - - - D2 - - - M S - - - - - - - - D A1 L1 L2 - - - - M A1 A2 S - - - - - - - - D A B A3M L1 L2 M A1 A2 S D A B A3W L1 L2 M A1 A2 S D A B A3

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28LD F6, 32(R2) F D E M WLD F2, 48(R3) F D E M WMULTD F0, F2, F4 F D - X X X X X M WADDI R2, R2, 8 F - D - - - - E M WADDI R3, R3, 8 F - - - - D E M WSUBD F8, F6, F2 F D X X M WDIVD F10, F10, F0 F D X X X X X X X X X X M WADDD F6, F8, F6 F D - - - - - - - - X X M WBLEZ R4, LOOP F - - - - - - - - D - E M WADDI R4, R4, 1 F - D E M W

19

Step III: Use Re-order Buffer Step III: Use Re-order Buffer (ROB)(ROB) Combine benefits of early issue and in-order update of stateCombine benefits of early issue and in-order update of state Obtained from RSRa by adding a renaming mechanism to itObtained from RSRa by adding a renaming mechanism to it Add a FIFO to RSRa (implement as circular buffer)Add a FIFO to RSRa (implement as circular buffer) When RSRa allows issuing of new instructionWhen RSRa allows issuing of new instruction

Enter instruction at tail of circular bufferEnter instruction at tail of circular buffer Buffer entry has multiple fieldsBuffer entry has multiple fields

[Result; Valid Bit; Destination Register Name; PC value; Exceptions][Result; Valid Bit; Destination Register Name; PC value; Exceptions]

Termination happens when result is produced, broadcast on Termination happens when result is produced, broadcast on CDB, written into circular buffer (replace M with T)CDB, written into circular buffer (replace M with T) Written ROB entry can serve as source of operands from now onWritten ROB entry can serve as source of operands from now on

Commit happens when value is moved from circular buffer Commit happens when value is moved from circular buffer to register (replace W with C)to register (replace W with C) Happens when instruction reaches head of circular buffer and has Happens when instruction reaches head of circular buffer and has

completed execution with no exceptionscompleted execution with no exceptions

20

ROB: I-O Initiation, I-O ROB: I-O Initiation, I-O TerminationTerminationLOOP:LD F6, 32(R2)LD F2, 48(R3)MULTD F0, F2, F4ADDI R2, R2, 8ADDI R3, R3, 8SUBD F8, F6, F2DIVD F10, F10, F0ADDD F6, F8, F6BLEZ R4, LOOPADDI R4, R4, 1


CDB L1 L2 A1 A2 M S B A3 D1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25

10 D9 D8 D7 D6 D5 M D4 M D3 M D2 M S A D1 L1 L2 A1 A2 M S A B A3 DT L1 L2 A1 A2 M S A B A3 DC L1 L2 M A1 A2 S D A B A3

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25LD F6, 32(R2) F D E T CLD F2, 48(R3) F D E T CMULTD F0, F2, F4 F D - X X X X X T CADDI R2, R2, 8 F - D E T - - - - CADDI R3, R3, 8 F D X T - - - - CSUBD F8, F6, F2 F D - X X T - - CDIVD F10, F10, F0 F - D X X X X X X X X X X T CADDD F6, F8, F6 F D X X T - - - - - - - - CBLEZ R4, LOOP F D - E T - - - - - - - - CADDI R4, R4, 1 F - D E T - - - - - - - - C

LD F D E TLD F D EMULTD F DADDI F

21

States of Circular BufferStates of Circular Buffer



State at cycle 17Result Valid Dest PC Exc

LD xxx yes F6 0LD xxx yes F2 4MULTD xxx yes F0 8ADDI xxx yes R2 12ADDI xxx yes R3 16SUBD xxx yes F8 20DIVD no F10 24ADDD xxx yes F6 28BLEZ xxx yes 32ADDI xxx yes R4 36

LD xxx yes F6 0LD no F2 4

•Entry in yellow is at head of buffer•Entry in green is tail of buffer, i.e., next instruction goes here•Greyed instructions have committed

22

Complexity of ROBComplexity of ROB Assume dual-issue superscalarAssume dual-issue superscalar

Load/Store machine with three-operand instructionsLoad/Store machine with three-operand instructions 64 registers64 registers 16-entry circular buffer16-entry circular buffer

Hardware support needed for ROBHardware support needed for ROB For each buffer entryFor each buffer entry

One write portOne write portFour read ports (two source operands of two instructions)Four read ports (two source operands of two instructions)Four 6-bit comparators for associative lookupFour 6-bit comparators for associative lookup

For each read portFor each read port16-way “priority” encoder with wrap-around (to get latest 16-way “priority” encoder with wrap-around (to get latest

value)value)

Limited capacity of ROB is a structural hazard Limited capacity of ROB is a structural hazard Repeated writes to same register actually Repeated writes to same register actually

happen happen This is not the case in “classical” TomasuloThis is not the case in “classical” Tomasulo

1 comp 206: computer architecture and implementation montek singh wed, oct 19, 2005 topic:...

Documents

multiple issue

f2 sd 0r1

f2 sd 8r1

loop loop

f2 sd 16r1

f2 ld f14

f2 ld f18

loop sd 8r1