dynamic scheduling ii - yeditepe Üniversitesi bilgisayar...

72
COMPSCI 220 / ECE 252 Lecture Notes Dynamic Scheduling II 1 © 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith, Vijaykumar, Lipasti Dynamic Scheduling II • so far: dynamic scheduling (out-of-order execution) • Scoreboard • Tomasulo’s algorithm • register renaming: removing artificial dependences (WAR/WAW) • now: out-of-order execution + precise state • advanced topic: dynamic load scheduling • PentiumII vs. Pentium4 • limits of ILP

Upload: duongtruc

Post on 27-Jun-2019

216 views

Category:

Documents


0 download

TRANSCRIPT

Dynamic Scheduling II

• so far: dynamic scheduling (out-of-order execution)

• Scoreboard• Tomasulo’s algorithm• register renaming: removing artificial dependences (WAR/WAW)

• now: out-of-order execution + precise state• advanced topic: dynamic load scheduling • PentiumII vs. Pentium4• limits of ILP

COMPSCI 220 / ECE 252 Lecture NotesDynamic Scheduling II

1© 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith,

Vijaykumar, Lipasti

Readings

H+P

• chapter 3 (not required)

Readings in Computer Architecture• Sohi and Vajapeyam: “Instruction Issue Logic”• Yeager: “The MIPS R10000 Superscalar Microprocessor”

Recent Research Papers• Complexity-Effective Superscalar• Optimal Pipeline Depth• DIVA• Power

COMPSCI 220 / ECE 252 Lecture NotesDynamic Scheduling II

2© 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith,

Vijaykumar, Lipasti

Superscalar + Out-of-Order + Speculation

superscalar + out-of-order + speculation

• three concepts that work well (best?) when used together• CPI >= 1?

• overcome with superscalar

• superscalar increases hazards? • overcome with dynamic scheduling

• RAW dependences still a problem?• overcome with a large instruction window

• branches a problem for filling large window?• overcome with speculation

COMPSCI 220 / ECE 252 Lecture NotesDynamic Scheduling II

3© 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith,

Vijaykumar, Lipasti

Speculation and Precise Interrupts

Q: why are we discussing these together?

• sequential (von Neumann) semantics for interrupts• all instructions before interrupt should be complete• all instructions after interrupt should look as if never started (abort)• basically, we also want the same thing for a mis-predicted branch

• what makes precise interrupts hard? • out-of-order completion ⇒ must undo post-interrupt writebacks• in-order ⇒ no post-branch writebacks before branch completes• out-of-order ⇒ can happen

A: with out-of-order, precise interrupts and mis-speculation recovery are same problem ⇒ same solution

COMPSCI 220 / ECE 252 Lecture NotesDynamic Scheduling II

4© 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith,

Vijaykumar, Lipasti

Solution: Precise State

• speculative execution requirements

• ability to abort & restart at every branch

• precise synchronous interrupt requirements• ability to abort & restart at every load, store, FP divide, ??

• precise asynchronous interrupt requirements• ability to abort & restart at every ??

• just bite the bullet • implement ability to abort & restart at every instruction• called “precise state”

COMPSCI 220 / ECE 252 Lecture NotesDynamic Scheduling II

5© 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith,

Vijaykumar, Lipasti

Ways to Implement Precise State

• imprecise state: ignore the problem!

– makes page faults (any restartable exceptions) hard– makes speculative execution almost impossible• IEEE standard strongly suggests precise state

• force in-order completion (WB): stall pipe if necessary– slow

• precise state in software– even slower - would require a trap for every misprediction

• precise state in hardware: save recovery info internally+ everything is better in hardware

COMPSCI 220 / ECE 252 Lecture NotesDynamic Scheduling II

6© 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith,

Vijaykumar, Lipasti

The Problem with Precise State

problem is in the writeback stage (WB)

• mixes two things together that should be separate• (1) broadcasts values to RS, forwards to other instructions

• OK for this to be out-of-order

• (2) writes values to registers• would like this to be in-order

solution to every functionality problem? add a level of indirection

• have already seen this for out-of-order execution• split ID into in-order DS and out-of-order IS• separate using instruction buffer (scoreboard, reservation stations)

COMPSCI 220 / ECE 252 Lecture NotesDynamic Scheduling II

7© 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith,

Vijaykumar, Lipasti

Re-Order Buffer (ROB)

I$

regfile

DSIF CMEX

F/D D/XPC X/C

IS

ROB

RT

instruction buffer ⇒ re-order buffer (ROB)

• buffers completed results en route to register file and D$!!!• may be combined with RS or separate (combined in the picture)

• split writeback (WB) into two stages: Complete and Retire

COMPSCI 220 / ECE 252 Lecture NotesDynamic Scheduling II

8© 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith,

Vijaykumar, Lipasti

Complete and Retire

I$

regfile

DSIF CMEX

F/D D/XPC X/C

IS

ROB

RT

• CM (complete)• completed values write results to ROB out-of-order• out-of-order stage ⇒ hazards result in waits

• RT (retire, but sometimes called “commit” or “graduate”)• ROB writes results to register file in-order• in-order stage ⇒ hazards result in stalls

COMPSCI 220 / ECE 252 Lecture NotesDynamic Scheduling II

9© 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith,

Vijaykumar, Lipasti

Memory Ordering Buffer (MOB)

ROB makes register writes in-order, but what about stores?

• same as before (i.e., to D$ in MEM stage)?• bad idea: imprecise memory worse than imprecise registers• must do same trick for stores

Memory Ordering Buffer (MOB)• a.k.a. store buffer, store queue, load/store queue (LSQ)• completed (but not retired) stores write to MOB• on RT, head of MOB written to D$• loads look at MOB and D$ in parallel

• forward from MOB if matching store (i.e. to same address)

COMPSCI 220 / ECE 252 Lecture NotesDynamic Scheduling II

10© 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith,

Vijaykumar, Lipasti

ROB+MOB

I$

regfile

DSIF CMEX

F/D D/XPC X/C

IS

ROB

RT

D$MOBstores

loads/storesstores loads

modulo some gross simplifications, this picture is almost realistic!!

COMPSCI 220 / ECE 252 Lecture NotesDynamic Scheduling II

11© 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith,

Vijaykumar, Lipasti

Tomasulo+ROB

• add ROB to Tomasulo’s algorithm

• combined ROB and RS are called RUU (or Sohi’s method)• RUU = register update unit

• separate ROB and RS are called P6-style (Intel P6 = Pentium Pro)

• our example: Simple-P6• separate ROB and RS• same RS organization as before: 1 ALU, 1 load, 1 store, 2 3-cycle FP

COMPSCI 220 / ECE 252 Lecture NotesDynamic Scheduling II

12© 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith,

Vijaykumar, Lipasti

P6-style Organization

value

V1 V2

T+

T1 T2T

R

==

FUT

taildispatch

retirehead

value

RSROB

reg status

dispatch

RF

CD

B.V

CD

B.T

• instruction fields and ready bits• tags• values

COMPSCI 220 / ECE 252 Lecture NotesDynamic Scheduling II

13© 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith,

Vijaykumar, Lipasti

P6 Data Structures

• RS are the same as before• ROB

• head, tail: to keep sequential order• R: output register of instruction, V: output value of instruction

• tags are different• was: RS# → now: ROB#

• register status table is different• T+: tag + “ready-in-ROB” bit• tag == 0 ⇒ result ready in register file• tag != 0 ⇒ result not ready• tag != 0 + ⇒ result ready in ROB

COMPSCI 220 / ECE 252 Lecture NotesDynamic Scheduling II

14© 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith,

Vijaykumar, Lipasti

P6 Data Structures

RS# FU busy op T V1 V2 T1 T21 ALU No2 load No3 store No4 FP1 No

hdtl

ROB + MOB# instruction R V addr IS EX CM1 ldf f0,X(r1)2 mulf f4,f0,f23 stf f4,Z(r1)4 add r1,r1,#85 ldf f0,X(r1)6 mulf f4,f0,f27 stf f4,Z(r1)

Reg. Statusreg T+f0f2f4r1

CDBV T

5 FP2 No

COMPSCI 220 / ECE 252 Lecture NotesDynamic Scheduling II

15© 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith,

Vijaykumar, Lipasti

P6 Pipeline

new pipeline structure: IF, DS, IS, EX, CM, RT

• DS (dispatch)• (RS/ROB/MOB full) ? (stall) : • {allocate RS/ROB/MOB entries, set RS tag to ROB#, set register status entry to ROB# with “ready-in-ROB” bit off, read ready registers into RS}

• EX (execute)• free RS entry• used to be done at WB• can be earlier now because RS# are not tags

COMPSCI 220 / ECE 252 Lecture NotesDynamic Scheduling II

16© 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith,

Vijaykumar, Lipasti

P6 Pipeline

• CM (complete)

• (CDB not available) ? (wait) :• {write value into ROB entry indicated by RS tag, mark ROB entry complete, mark register status entry “ready-in-ROB” bit (+)}

• RT (retire, commit, graduate)• (ROB head not complete) ? (stall) :• {write ROB head result to register file, if store, then write MOB head to D$, handle any exceptions, free ROB/MOB entries}

COMPSCI 220 / ECE 252 Lecture NotesDynamic Scheduling II

17© 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith,

Vijaykumar, Lipasti

P6: Dispatch (DS) part I

value

V1 V2

T+

T1 T2T

valueR

==

FUT

taildispatch

retirehead

RSROB

reg status

dispatch

RF

CD

B.V

CD

B.T

• stall if RS or ROB or MOB is full• allocate RS+ROB entries (assign ROB# to RS output tag)• set register status entry to ROB# and “ready-in-ROB” bit to 0

COMPSCI 220 / ECE 252 Lecture NotesDynamic Scheduling II

18© 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith,

Vijaykumar, Lipasti

P6: Dispatch (DS) part II

value

V1 V2

T+

T1 T2T

valueR

==

FUT

taildispatch

retirehead

RSROB

reg status

dispatch

RF

CD

B.V

CD

B.T

• read tags for register inputs from register status table• if tag==0: copy value from RF (not shown)• if tag!=0: copy tag to RS • if tag!=0 +: copy value from ROB

COMPSCI 220 / ECE 252 Lecture NotesDynamic Scheduling II

19© 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith,

Vijaykumar, Lipasti

P6: Complete (CM)

value

V1 V2

T+

T1 T2T

valueR

==

FUT

tailfetch

retirehead

RSROB

reg status

fetch

RF

CD

B.V

CD

B.T

• wait for CDB• broadcast <result,tag> on CDB• write result into ROB, set reg. status “ready-in-ROB” bit (+)• match tags, write CDB.V into RS of dependent instructions

COMPSCI 220 / ECE 252 Lecture NotesDynamic Scheduling II

20© 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith,

Vijaykumar, Lipasti

P6: Retire (RT)

value

V1 V2

T+

T1 T2T

R

==

FUT

tailfetch

retirehead

value

RSROB

reg status

fetch

RF

CD

B.V

CD

B.T

• stall until instruction at ROB head has completed• write ROB head result to reg-file (D$ if store), clear reg. status entry• free ROB entry

COMPSCI 220 / ECE 252 Lecture NotesDynamic Scheduling II

21© 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith,

Vijaykumar, Lipasti

COMPSCI 220 / ECE 252 Lecture NotesDynamic Scheduling II

22© 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith,

Vijaykumar, Lipasti

RS# FU busy op T V1 V2 T1 T21 ALU No2 load Yes ldf ROB#1 REG[r1]3 store No4 FP1 No5 FP2 No

P6 Example: Cycle 1

hdtl

ROB + MOB# instruction R V addr IS EX CM

ht 1 ldf f0,X(r1) f0 &X[0]2 mulf f4,f0,f23 stf f4,Z(r1)4 add r1,r1,#85 ldf f0,X(r1)6 mulf f4,f0,f27 stf f4,Z(r1)

Reg. Statusreg T+f0 ROB#1f2f4r1

allocate RS

CDBV T

set reg. status

before ROB,this was RS#2

COMPSCI 220 / ECE 252 Lecture NotesDynamic Scheduling II

23© 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith,

Vijaykumar, Lipasti

P6 Example: Cycle 2

RS# FU busy op T V1 V2 T1 T21 ALU No2 load Yes ldf ROB#1 REG[r1]3 store No4 FP1 Yes mulf ROB#2 REG[f2] ROB#15 FP2 No

allocate RS,

Reg. Statusreg T+f0 ROB#1f2f4 ROB#2r1

CDBV T

set reg. status

allocate ROB,

hdtl

ROB + MOB# instruction R V addr IS EX CM

h 1 ldf f0,X(r1) f0 &X[0] c2t 2 mulf f4,f0,f2 f4

3 stf f4,Z(r1)4 add r1,r1,#85 ldf f0,X(r1)6 mulf f4,f0,f27 stf f4,Z(r1)

COMPSCI 220 / ECE 252 Lecture NotesDynamic Scheduling II

24© 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith,

Vijaykumar, Lipasti

P6 Example: Cycle 3

RS# FU busy op T V1 V2 T1 T21 ALU No2 load No3 store Yes stf ROB#3 REG[r1] ROB#24 FP1 Yes mulf ROB#2 REG[f2] ROB#15 FP2 No

allocatefree

hdtl

ROB + MOB# instruction R V addr IS EX CM

h 1 ldf f0,X(r1) f0 &X[0] c2 c32 mulf f4,f0,f2 f4

t 3 stf f4,Z(r1) &Z[0]4 add r1,r1,#85 ldf f0,X(r1)6 mulf f4,f0,f27 stf f4,Z(r1)

Reg. Statusreg T+f0 ROB#1f2f4 ROB#2r1

CDBV T

COMPSCI 220 / ECE 252 Lecture NotesDynamic Scheduling II

25© 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith,

Vijaykumar, Lipasti

P6 Example: Cycle 4

RS# FU busy op T V1 V2 T1 T21 ALU Yes add ROB#4 REG[r1]2 load No3 store Yes stf ROB#3 REG[r1] ROB#24 FP1 Yes mulf ROB#2 CDB.V REG[f2] ROB#15 FP2 No

allocate

ldf finished1. write result to ROB

f0 ready

hdtl

ROB + MOB# instruction R V addr IS EX CM

h 1 ldf f0,X(r1) f0 [f0] &X[0] c2 c3 c42 mulf f4,f0,f2 f4 c43 stf f4,Z(r1) &Z[0]

t 4 add r1,r1,#8 r15 ldf f0,X(r1)6 mulf f4,f0,f27 stf f4,Z(r1)

Reg. Statusreg T+f0 ROB#1+f2f4 ROB#2r1 ROB#4

CDBV T

[f0] ROB#1

2. CDB broadcast

grab from CDB

3. set ready-in-ROB bit

COMPSCI 220 / ECE 252 Lecture NotesDynamic Scheduling II

26© 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith,

Vijaykumar, Lipasti

P6 Example: Cycle 5

RS# FU busy op T V1 V2 T1 T21 ALU Yes add ROB#4 REG[r1]2 load Yes ldf ROB#5 ROB#43 store Yes stf ROB#3 REG[r1] ROB#24 FP1 No5 FP2 No

retire, write ROBresult into regfile

Reg. Statusreg T+f0 ROB#5f2f4 ROB#2r1 ROB#4

CDBV T

hdtl

ROB + MOB# instruction R V addr IS EX CM1 ldf f0,X(r1) f0 [f0] &X[0] c2 c3 c4

h 2 mulf f4,f0,f2 f4 c4 c53 stf f4,Z(r1) &Z[0]4 add r1,r1,#8 r1 c5

t 5 ldf f0,X(r1) f06 mulf f4,f0,f27 stf f4,Z(r1)

allocate

free

COMPSCI 220 / ECE 252 Lecture NotesDynamic Scheduling II

27© 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith,

Vijaykumar, Lipasti

P6 Example: Cycle 6

RS# FU busy op T V1 V2 T1 T21 ALU No2 load Yes ldf ROB#5 ROB#43 store Yes stf ROB#3 REG[r1] ROB#24 FP1 No5 FP2 Yes mulf ROB#6 REG[f2] ROB#5 allocate

free

Reg. Statusreg T+f0 ROB#5f2f4 ROB#6r1 ROB#4

CDBV T

hdtl

ROB + MOB# instruction R V addr IS EX CM1 ldf f0,X(r1) f0 [f0] &X[0] c2 c3 c4

h 2 mulf f4,f0,f2 f4 c4 c5+3 stf f4,Z(r1) &Z[0]4 add r1,r1,#8 r1 c5 c65 ldf f0,X(r1) f0

t 6 mulf f4,f0,f2 f47 stf f4,Z(r1)

COMPSCI 220 / ECE 252 Lecture NotesDynamic Scheduling II

28© 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith,

Vijaykumar, Lipasti

P6 Example: Cycle 7

RS# FU busy op T V1 V2 T1 T21 ALU No2 load Yes ldf ROB#5 CDB.V ROB#43 store Yes stf ROB#3 REG[r1] ROB#24 FP1 No5 FP2 Yes mulf ROB#6 REG[f2] ROB#5

stall DS, no free store RS

r1 ready

Reg. Statusreg T+f0 ROB#5f2f4 ROB#6r1 ROB#4+

CDBV T

[r1] ROB#4

write result into ROB, CDB

hdtl

ROB + MOB# instruction R V addr IS EX CM1 ldf f0,X(r1) f0 [f0] &X[0] c2 c3 c4

h 2 mulf f4,f0,f2 f4 c4 c5+3 stf f4,Z(r1) &Z[0]4 add r1,r1,#8 r1 [r1] c5 c6 c75 ldf f0,X(r1) f0 &X[1] c7

t 6 mulf f4,f0,f2 f47 stf f4,Z(r1)

grab from CDB

add finished

COMPSCI 220 / ECE 252 Lecture NotesDynamic Scheduling II

29© 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith,

Vijaykumar, Lipasti

P6 Example: Cycle 8

RS# FU busy op T V1 V2 T1 T21 ALU No2 load No3 store Yes stf ROB#3 CDB.V REG[r1] ROB#24 FP1 No5 FP2 Yes mulf ROB#6 REG[f2] ROB#5

stall DS, no free store RS

stall RT

free

Reg. Statusreg T+f0 ROB#5f2f4 ROB#6r1 ROB#4+

CDBV T

[f4] ROB#2

hdtl

ROB + MOB# instruction R V addr IS EX CM1 ldf f0,X(r1) f0 [f0] &X[0] c2 c3 c4

h 2 mulf f4,f0,f2 f4 [f4] c4 c5+ c83 stf f4,Z(r1) &Z[0] c84 add r1,r1,#8 r1 [r1] c5 c6 c75 ldf f0,X(r1) f0 &X[1] c7 c8

t 6 mulf f4,f0,f2 f47 stf f4,Z(r1)

f4 readygrab from CDB

COMPSCI 220 / ECE 252 Lecture NotesDynamic Scheduling II

30© 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith,

Vijaykumar, Lipasti

P6 Example: Cycle 9

RS# FU busy op T V1 V2 T1 T21 ALU No2 load No3 store Yes stf ROB#7 ROB#4.V ROB#64 FP1 No5 FP2 Yes mulf ROB#6 CDB.V REG[f2] ROB#5

stall RT

Reg. Statusreg T+f0 ROB#5+f2f4 ROB#6r1 ROB#4+

CDBV T

[f0] ROB#5

hdtl

ROB + MOB# instruction R V addr IS EX CM1 ldf f0,X(r1) f0 [f0] &X[0] c2 c3 c42 mulf f4,f0,f2 f4 [f4] c4 c5+ c8

h 3 stf f4,Z(r1) &Z[0] c8 c94 add r1,r1,#8 r1 [r1] c5 c6 c75 ldf f0,X(r1) f0 &X[1] c7 c8 c96 mulf f4,f0,f2 f4 c9

t 7 stf f4,Z(r1) &Z[1]

f0 readygrab from CDB

free (ROB#3)allocate (ROB#7)

read from ROBnot reg. file (+)

COMPSCI 220 / ECE 252 Lecture NotesDynamic Scheduling II

31© 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith,

Vijaykumar, Lipasti

P6 Example: Cycle 10

RS# FU busy op T V1 V2 T1 T21 ALU No2 load No3 store Yes stf ROB#7 ROB#4.V ROB#64 FP1 No5 FP2 No

stall RT

Reg. Statusreg T+f0 ROB#5+f2f4 ROB#6r1 ROB#4+

CDBV T

hdtl

ROB + MOB# instruction R V addr IS EX CM1 ldf f0,X(r1) f0 [f0] &X[0] c2 c3 c42 mulf f4,f0,f2 f4 [f4] c4 c5+ c8

h 3 stf f4,Z(r1) &Z[0] c8 c9 c104 add r1,r1,#8 r1 [r1] c5 c6 c75 ldf f0,X(r1) f0 &X[1] c7 c8 c96 mulf f4,f0,f2 f4 c9 c10

t 7 stf f4,Z(r1)

free

COMPSCI 220 / ECE 252 Lecture NotesDynamic Scheduling II

32© 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith,

Vijaykumar, Lipasti

P6 Example: Cycle 11

RS# FU busy op T V1 V2 T1 T21 ALU No2 load No3 store Yes stf ROB#7 ROB#4.V ROB#64 FP1 No5 FP2 No

Reg. Statusreg T+f0 ROB#5+f2f4 ROB#6r1 ROB#4+

CDBV T

hdtl

ROB + MOB# instruction R V addr IS EX CM1 ldf f0,X(r1) f0 [f0] &X[0] c2 c3 c42 mulf f4,f0,f2 f4 [f4] c4 c5+ c83 stf f4,Z(r1) &Z[0] c8 c9 c10

h 4 add r1,r1,#8 r1 [r1] c5 c6 c75 ldf f0,X(r1) f0 &X[1] c7 c8 c96 mulf f4,f0,f2 f4 c9 c10

t 7 stf f4,Z(r1)retire stf

Precise State with ROB

point of ROB is maintaing precise state

• how does that work?• easy as 1,2,3

• 1: wait until precise condition reaches retire (RT) stage• 2: clear contents of ROB, RS & register status table• 3: start over

• works because zeros signify the right state for start-over• 0 in ROB/RS ⇒ entry is empty• tag==0 in register status table ⇒ register is written

• and because register file and D$ writes take place at RT• let’s look at example page fault at stf ...

COMPSCI 220 / ECE 252 Lecture NotesDynamic Scheduling II

33© 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith,

Vijaykumar, Lipasti

COMPSCI 220 / ECE 252 Lecture NotesDynamic Scheduling II

34© 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith,

Vijaykumar, Lipasti

P6 Precise State Example: Cycle 9

RS# FU busy op T V1 V2 T1 T21 ALU No2 load No3 store Yes stf ROB#7 ROB#4.V ROB#64 FP1 No5 FP2 Yes mulf ROB#6 CDB.V REG[f2] ROB#5

Reg. Statusreg T+f0 ROB#5+f2f4 ROB#6r1 ROB#4+

CDBV T

[f0] ROB#5

hd tl

ROB + MOB# instruction R V addr IS EX CM1 ldf f0,X(r1) f0 [f0] &X[0] c2 c3 c42 mulf f4,f0,f2 f4 [f4] c4 c5+ c8

h 3 stf f4,Z(r1) &Z[0] c8 c94 add r1,r1,#8 r1 [r1] c5 c6 c75 ldf f0,X(r1) f0 &X[1] c7 c8 c96 mulf f4,f0,f2 f4 c9

t 7 stf f4,Z(r1)PAGE FAULT

COMPSCI 220 / ECE 252 Lecture NotesDynamic Scheduling II

35© 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith,

Vijaykumar, Lipasti

P6 Precise State Example:Cycle 10

RS# FU busy op T V1 V2 T1 T21 ALU No2 load No3 store No4 FP1 No5 FP2 No

faulting instruction at

Reg. Statusreg Tf0f2f4r1

CDBV T

hd tl

ROB + MOB# instruction R V addr IS EX CM1 ldf f0,X(r1) f0 [f0] &X[0] c2 c3 c42 mulf f4,f0,f2 f4 [f4] c4 c5+ c83 stf f4,Z(r1)4 add r1,r1,#85 ldf f0,X(r1)6 mulf f4,f0,f27 stf f4,Z(r1) ROB head?

CLEAR EVERYTHING

COMPSCI 220 / ECE 252 Lecture NotesDynamic Scheduling II

36© 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith,

Vijaykumar, Lipasti

P6 Precise State Example: Cycle 11

RS# FU busy op T V1 V2 T1 T21 ALU No2 load No3 store Yes stf ROB#3 REG[f4] REG[r1]4 FP1 No5 FP2 No

Reg. Statusreg Tf0f2f4r1

CDBV T

hd tl

ROB + MOB# instruction R V addr IS EX CM1 ldf f0,X(r1) f0 [f0] &X[0] c2 c3 c42 mulf f4,f0,f2 f4 [f4] c4 c5+ c8

ht 3 stf f4,Z(r1) &Z[0]4 add r1,r1,#85 ldf f0,X(r1)6 mulf f4,f0,f27 stf f4,Z(r1) START OVER

P6 Precise State Example: Cycle 12

RS# FU busy op T V1 V2 T1 T21 ALU Yes add ROB#4 REG[r1]2 load No3 store Yes stf ROB#3 REG[f4] REG[r1]4 FP1 No

Reg. Statusreg Tf0f2f4r1 ROB#4

CDBV T

hd tl

ROB + MOB# instruction R V addr IS EX CM1 ldf f0,X(r1) f0 [f0] &X[0] c2 c3 c42 mulf f4,f0,f2 f4 [f4] c4 c5+ c8

h 3 stf f4,Z(r1) &Z[0] c12t 4 add r1,r1,#8 r1

5 ldf f0,X(r1)6 mulf f4,f0,f27 stf f4,Z(r1)

5 FP2 No

COMPSCI 220 / ECE 252 Lecture NotesDynamic Scheduling II

37© 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith,

Vijaykumar, Lipasti

P6 Performance

in other words: what is the cost of precise interrupts?

+ in general: same performance as plain Tomasulo+ maybe a little better (RS freed earlier -> fewer RS structural hazards)

– unless ROB is very small– in which case ROB structural hazards become a problem

• rules of thumb for ROB size• >= superscalar width (N) * number of stages between DS and RT• at least N * thit-L2• what’s the rationale behind these rules?

COMPSCI 220 / ECE 252 Lecture NotesDynamic Scheduling II

38© 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith,

Vijaykumar, Lipasti

P6 Organization Summary

• popular design for a while

+ easy (relatively) to implement correctly• to recover, just zero out everything• examples: PentiumPro, PowerPC, AMD K6

COMPSCI 220 / ECE 252 Lecture NotesDynamic Scheduling II

39© 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith,

Vijaykumar, Lipasti

Problems with P6

value

V1 V2

T+

T1 T2T

valueR

==

FUT

– problems for high performance implementations• too much value movement (regfile/ROB→RS→ROB→regfile) • too many muxes/buses (values from too many places)• RS mixes values with control tags (long data paths make clock slow)

COMPSCI 220 / ECE 252 Lecture NotesDynamic Scheduling II

40© 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith,

Vijaykumar, Lipasti

Alternative Implementation: MIPS R10K

valueT+

T1+ T2+T

R

==

FUT

T ToldT

RS

dispatch

map table

ROB

PRF

head

tail

CDB.T

freelist

• separate control (ROB/RS) from data (registers/FU)• one big physical register file (PRF) holds all data → no copies• ROB and RS used only for control and tags → small• register file close to FUs, everything else is on the side

COMPSCI 220 / ECE 252 Lecture NotesDynamic Scheduling II

41© 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith,

Vijaykumar, Lipasti

R10K Register Renaming

• no architectural register file!• physical register file holds all values

• #physical registers = #architectural registers + #ROB• map architectural registers to physical registers• removes WAW&WAR hazards (physical registers replace RS copies)

• register status table replaced by register map table• mappings cannot be 0 (there is no architectural register file)

• free list keeps track of unallocated physical registers• ROB responsible for returning physical registers to free list

• conceptually: true register renaming• have seen an example a few lectures ago ... here it is again ...

COMPSCI 220 / ECE 252 Lecture NotesDynamic Scheduling II

42© 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith,

Vijaykumar, Lipasti

Register Renaming Example

• names: r1,r2,r3, locations: l1,l2,l3,l4,l5,l6,l7• original mapping: r1→l1, r2→l2, r3→l3 (l4-l7 “free”)

map tableraw instruction r1 r2 r3 free locations renamed instruction

l1 l2 l3 l4,l5,l6,l7add r1,r2,r3 l4 l2 l3 l5,l6,l7 add l4,l2,l3sub r3,r2,r1 l4 l2 l5 l6,l7 sub l5,l2,l4mul r1,r2,r3 l6 l2 l5 l7 mul l6,l2,l5div r2,r1,r3 l6 l7 l5 div l7,l6,l5

COMPSCI 220 / ECE 252 Lecture NotesDynamic Scheduling II

43© 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith,

Vijaykumar, Lipasti

Freeing Registers in R10K

issue: freeing physical registers

• P6• no need to free speculative storage explicitly• temporary storage comes with ROB entry• RT: copy value from ROB to register file, free ROB entry

• R10K• can’t free physical register when writing instruction retires• no architectural register to copy value to• but...• we can free phys register previously mapped to same logical register• why? all instructions that will ever read that value have retired

COMPSCI 220 / ECE 252 Lecture NotesDynamic Scheduling II

44© 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith,

Vijaykumar, Lipasti

Freeing Registers in R10K

• when add commits: free l1• when sub commits: free l3• when mul commits: free ?• when div commits: free ?• see the pattern?

map tableraw instruction r1 r2 r3 free locations renamed instruction

l1 l2 l3 l4,l5,l6,l7add r1,r2,r3 l4 l2 l3 l5,l6,l7 add l4,l2,l3sub r3,r2,r1 l4 l2 l5 l6,l7 sub l5,l2,l4mul r1,r2,r3 l6 l2 l5 l7 mul l6,l2,l5div r2,r1,r3 l6 l7 l5 div l7,l6,l5

COMPSCI 220 / ECE 252 Lecture NotesDynamic Scheduling II

45© 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith,

Vijaykumar, Lipasti

R10K Pipeline

same pipeline structure: IF,DS,IS,EX,CM,RT

• DS (dispatch)• (RS or ROB or MOB full or no physical registers) ? (stall) : • (allocate RS and ROB entries AND physical register)

• IS (issue)• (read physical registers)

• CM (completion)• (writeback destination register, mark ROB entry complete)

• RT (retire, commit, graduate)• (ROB head not complete) ? (stall) :• {if store then write MOB head to D$, handle any exceptions, free ROB/MOB entries, free previous physical register}

COMPSCI 220 / ECE 252 Lecture NotesDynamic Scheduling II

46© 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith,

Vijaykumar, Lipasti

R10K: Dispatch (DS)

valueT+

T1+ T2+T

R

==

FUT

T ToldT

RS

dispatch

map table

ROB

PRF

head

tail

CDB.T

freelist

• stall if no free RS, ROB, MOB or physical register (preg)• allocate RS and ROB entry• read physical register tags for input registers, store in RS• allocate new preg for output, set in RS, ROB and map table

COMPSCI 220 / ECE 252 Lecture NotesDynamic Scheduling II

47© 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith,

Vijaykumar, Lipasti

R10K: Complete (CM)

valueT+

T1+ T2+T

R

==

FUT

T ToldT

RS

dispatch

map table

ROB

PRF

head

tail

CDB.T

freelist

• wait for free CDB• broadcast tag on CDB.T• set instruction’s output register ready bit in map table• set ready bits for matching input tags in RS

COMPSCI 220 / ECE 252 Lecture NotesDynamic Scheduling II

48© 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith,

Vijaykumar, Lipasti

R10K: Retire (RT)

valueT+

T1+ T2+T

R

==

FUT

T ToldT

RS

dispatch

map table

ROB

PRF

head

tail

CDB.T

freelist

• stall until instruction at ROB head is complete• return Told of ROB head to free list• free ROB head entry

COMPSCI 220 / ECE 252 Lecture NotesDynamic Scheduling II

49© 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith,

Vijaykumar, Lipasti

COMPSCI 220 / ECE 252 Lecture NotesDynamic Scheduling II

50© 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith,

Vijaykumar, Lipasti

R10K Data Structures

RS# FU busy op T T1+ T2+1 ALU No2 load No3 store No4 FP1 No5 FP2 No

Map Tablereg T+f0 PR#1+f2 PR#2+f4 PR#3+r1 PR#4+

Free ListPR#5,PR#6, PR#7, PR#8

Told for freeing oldphysical register

always T mappings, evenwitn no in-flight instructions

hd tl

ROB + MOB# instruction T Told addr IS EX CM1 ldf f0,X(r1)2 mulf f4,f0,f23 stf f4,Z(r1)4 add r1,r1,#85 ldf f0,X(r1)6 mulf f4,f0,f27 stf f4,Z(r1)

CDBT

only tags in RS and CDBno values, so need ready bits

COMPSCI 220 / ECE 252 Lecture NotesDynamic Scheduling II

51© 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith,

Vijaykumar, Lipasti

R10K Example: Cycle 1

RS# FU busy op T T1+ T2+1 ALU No2 load Yes ldf PR#5 PR#4+3 store No4 FP1 No5 FP2 No

Map Tablereg T+f0 PR#5f2 PR#2+f4 PR#3+r1 PR#4+

Free ListPR#5, PR#6, PR#7, PR#8

hd tl

ROB + MOB# instruction T Told addr IS EX CM

ht 1 ldf f0,X(r1) PR#5 PR#1 &X[0]2 mulf f4,f0,f23 stf f4,Z(r1)4 add r1,r1,#85 ldf f0,X(r1)6 mulf f4,f0,f27 stf f4,Z(r1)

CDBT

allocate new physical register to f0remember old physical registermapped to f0

COMPSCI 220 / ECE 252 Lecture NotesDynamic Scheduling II

52© 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith,

Vijaykumar, Lipasti

R10K Example: Cycle 2

RS# FU busy op T T1+ T2+1 ALU No2 load Yes ldf PR#5 PR#4+3 store No4 FP1 Yes mulf PR#6 PR#5 PR#2+5 FP2 No

Map Tablereg T+f0 PR#5f2 PR#2+f4 PR#6r1 PR#4+

Free ListPR#6,

PR#7, PR#8

hd tl

ROB + MOB# instruction T Told addr IS EX CM

h 1 ldf f0,X(r1) PR#5 PR#1 &X[0] c2t 2 mulf f4,f0,f2 PR#6 PR#3

3 stf f4,Z(r1)4 add r1,r1,#85 ldf f0,X(r1)6 mulf f4,f0,f27 stf f4,Z(r1)

CDBT

allocate new physical register to f0remember old physical registermapped to f4

COMPSCI 220 / ECE 252 Lecture NotesDynamic Scheduling II

53© 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith,

Vijaykumar, Lipasti

R10K Example: Cycle 3

RS# FU busy op T T1+ T2+1 ALU No2 load No3 store Yes stf PR#6 PR#4+4 FP1 Yes mulf PR#6 PR#5 PR#2+5 FP2 No

Map Tablereg T+f0 PR#5f2 PR#2+f4 PR#6r1 PR#4+

Free List

PR#7, PR#8

hd tl

ROB + MOB# instruction T Told addr IS EX CM

h 1 ldf f0,X(r1) PR#5 PR#1 &X[0] c2 c32 mulf f4,f0,f2 PR#6 PR#3

t 3 stf f4,Z(r1) &Z[0]4 add r1,r1,#85 ldf f0,X(r1)6 mulf f4,f0,f27 stf f4,Z(r1)

CDBT

allocate RSfree RS

COMPSCI 220 / ECE 252 Lecture NotesDynamic Scheduling II

54© 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith,

Vijaykumar, Lipasti

R10K Example: Cycle 4

RS# FU busy op T T1+ T2+1 ALU Yes add PR#7 PR#4+2 load No3 store Yes stf PR#6 PR#4+4 FP1 Yes mulf PR#6 PR#5+ PR#2+5 FP2 No

Map Tablereg T+f0 PR#5+f2 PR#2+f4 PR#6r1 PR#7

Free List

PR#7, PR#8

hd tl

ROB + MOB# instruction T Told addr IS EX CM

h 1 ldf f0,X(r1) PR#5 PR#1 &X[0] c2 c3 c42 mulf f4,f0,f2 PR#6 PR#3 c43 stf f4,Z(r1) &Z[0]

t 4 add r1,r1,#8 PR#7 PR#45 ldf f0,X(r1)6 mulf f4,f0,f27 stf f4,Z(r1)

CDBT

PR#5

match PR#5 tag from CDB & issue

allocate RS

ldf finishedset ready bit, CDB broadcast

COMPSCI 220 / ECE 252 Lecture NotesDynamic Scheduling II

55© 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith,

Vijaykumar, Lipasti

R10K Example: Cycle 5

RS# FU busy op T T1+ T2+1 ALU Yes add PR#7 PR#4+2 load Yes ldf PR#8 PR#73 store Yes stf PR#6 PR#4+4 FP1 No5 FP2 No

Map Tablereg T+f0 PR#8f2 PR#2+f4 PR#6r1 PR#7

Free ListPR#1 PR#8

hd tl

ROB + MOB# instruction T Told addr IS EX CM1 ldf f0,X(r1) PR#5 PR#1 &X[0] c2 c3 c4

h 2 mulf f4,f0,f2 PR#6 PR#3 c4 c53 stf f4,Z(r1) &Z[0]4 add r1,r1,#8 PR#7 PR#4 c5

t 5 ldf f0,X(r1) PR#8 PR#56 mulf f4,f0,f27 stf f4,Z(r1)

CDBT

free RS

allocate RS

ldf retiresreturn PR#1 to free list,no one will ever read it again

R10K Precise State

problem with R10K way? precise state is more difficult

– registers already written• but that’s OK• why? because there is no architectural register file• we can “free” written registers and restore old ones

• we need to restore register map table to the way it was• option 1: roll back ROB serially (slow)• option 2: restore from a checkpoint (fast)

COMPSCI 220 / ECE 252 Lecture NotesDynamic Scheduling II

56© 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith,

Vijaykumar, Lipasti

R10K Rollback Example: Cycle 5

RS# FU busy op T T1+ T2+1 ALU Yes add PR#7 PR#4+2 load Yes ldf PR#8 PR#73 store Yes stf PR#6 PR#4+4 FP1 No

Map Tablereg T+f0 PR#8f2 PR#2+f4 PR#6r1 PR#7

Free ListPR#1

hd tl

ROB + MOB# instruction T Told addr IS EX CM1 ldf f0,X(r1) PR#5 PR#1 &X[0] c2 c3 c4

h 2 mulf f4,f0,f2 PR#6 PR#3 c4 c53 stf f4,Z(r1) &Z[0]4 add r1,r1,#8 PR#7 PR#4 c5

t 5 ldf f0,X(r1) PR#8 PR#56 mulf f4,f0,f27 stf f4,Z(r1)

CDBT

undo instructions 3–5

5 FP2 No

COMPSCI 220 / ECE 252 Lecture NotesDynamic Scheduling II

57© 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith,

Vijaykumar, Lipasti

COMPSCI 220 / ECE 252 Lecture NotesDynamic Scheduling II

58© 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith,

Vijaykumar, Lipasti

R10K Rollback Example: Cycle 6

RS# FU busy op T T1+ T2+1 ALU Yes add PR#7 PR#4+2 load No3 store Yes stf PR#6 PR#4+4 FP1 No5 FP2 No

Map Tablereg T+f0 PR#5f2 PR#2+f4 PR#6r1 PR#7

Free ListPR#1, PR#8

hd tl

ROB + MOB# instruction T Told addr IS EX CM1 ldf f0,X(r1) PR#5 PR#1 &X[0] c2 c3 c4

h 2 mulf f4,f0,f2 PR#6 PR#3 c4 c5+3 stf f4,Z(r1) &Z[0]

t 4 add r1,r1,#8 PR#7 PR#4 c55 ldf f0,X(r1) PR#8 PR#56 mulf f4,f0,f27 stf f4,Z(r1)

CDBT

undo ldf1. free load RS2. free T (PR#8)3. restore f0 mapping to Told (PR#5)4. free ROB entry #5

R10K Rollback Example: Cycle 7

RS# FU busy op T T1+ T2+1 ALU No2 load No3 store Yes stf PR#6 PR#4+4 FP1 No

Map Tablereg T+f0 PR#5f2 PR#2+f4 PR#6r1 PR#4

Free ListPR#1,PR#8

PR#7

hd tl

ROB + MOB# instruction T Told addr IS EX CM1 ldf f0,X(r1) PR#5 PR#1 &X[0] c2 c3 c4

h 2 mulf f4,f0,f2 PR#6 PR#3 c4 c5+t 3 stf f4,Z(r1) &Z[0]

4 add r1,r1,#8 PR#7 PR#4 c55 ldf f0,X(r1)6 mulf f4,f0,f27 stf f4,Z(r1)

CDBT

undo add1. free ALU RS2. free T (PR#7)3. restore r1 mapping to Told (PR#3)4. free ROB entry #4

5 FP2 No

COMPSCI 220 / ECE 252 Lecture NotesDynamic Scheduling II

59© 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith,

Vijaykumar, Lipasti

R10K Rollback Example: Cycle 8

Map Tablereg T+f0 PR#5f2 PR#2+f4 PR#6r1 PR#4

Free ListPR#1,PR#8

PR#7

hd tl

ROB + MOB# instruction T Told addr IS EX CM1 ldf f0,X(r1) PR#5 PR#1 &X[0] c2 c3 c4

ht 2 mulf f4,f0,f2 PR#6 PR#3 c4 c5+ c83 stf f4,Z(r1)4 add r1,r1,#85 ldf f0,X(r1)6 mulf f4,f0,f27 stf f4,Z(r1)

CDBT

undo stf1. free store RS2. free ROB entry #33. no registers to restore/free4. how is store undone?

RS# FU busy op T T1+ T2+1 ALU No2 load No3 store No4 FP1 No

5 FP2 No

COMPSCI 220 / ECE 252 Lecture NotesDynamic Scheduling II

60© 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith,

Vijaykumar, Lipasti

P6 vs. R10K

• R10K style is more popular today• e.g., Alpha, Pentium4

feature P6 R10Kvalue storage RF, ROB, RS PRF

register read DS: RF/ROB → RS IS: PRF → FU

register write RT: ROB → RF CM: FU → PRF

spec value free RT: automatic with ROB when overwriting instructionretires

datapaths RF/ROB→RS, RS→FU, FU→ROB, ROB→RF

PRF→FU, FU→PRF

precise state simple, zero all structures complex: serial/checkpoint

COMPSCI 220 / ECE 252 Lecture NotesDynamic Scheduling II

61© 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith,

Vijaykumar, Lipasti

Advanced Topic: Load Scheduling

• all instructions except for loads are easy in Tomasulo

• register inputs only• register renaming captures all dependences• tags tell you exactly when you can execute

• loads not so easy• must check for older active stores with same address• register renaming doesn’t tell you that

COMPSCI 220 / ECE 252 Lecture NotesDynamic Scheduling II

62© 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith,

Vijaykumar, Lipasti

The Data Memory FU

COMP© 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith,

Vijaykumar, Lipasti

address valueL/S

hea

tail

==

D$/TLB

MOB

data

_out

data

_in

addr

ess

• MOB+D$ = memory FU

• just like any other FU• 2 reg inputs: addr, data_in• 1 reg output: data_out

• what actually happens?

d

SCI 220 / ECE 252 Lecture NotesDynamic Scheduling II

63

Store/Load Dispatch

COMP© 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith,

Vijaykumar, Lipasti

address valueL/S

hea

tail

==

D$/TLB

MOB

data

_out

data

_in

addr

ess

• allocate MOB entry (tail)

• indicate store/load• remember MOB# in RS

d

SCI 220 / ECE 252 Lecture NotesDynamic Scheduling II

64

Store/Load Retire

COMP© 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith,

Vijaykumar, Lipasti

address valueL/S

hea

tail

==

D$/TLB

MOB

data

_out

data

_in

addr

ess

• free MOB entry (head)

• load? • done

• store?• address+value to D$/TLB

d

SCI 220 / ECE 252 Lecture NotesDynamic Scheduling II

65

Store Execute

COMP© 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith,

Vijaykumar, Lipasti

address valueL/S

hea

tail

==

D$/TLB

MOB

data

_out

data

_in

addr

ess

• address+value to MOB

• can be done separately• e.g., PentiumII: store→2µops

d

SCI 220 / ECE 252 Lecture NotesDynamic Scheduling II

66

Load Execute

COMP© 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith,

Vijaykumar, Lipasti

address valueL/S

hea

tail

==

D$/TLB

MOB

data

_out

data

_in

addr

ess

in parallel

• address to D$• read value

• address to MOB• compare to older store addrs

• no match ⇒ D$ value• match +value ⇒ MOB value

• multiple matches? youngest• forwarding or bypassing• same latency as D$ hit (why?)

• match –value ⇒ stall• try executing load again later

d

SCI 220 / ECE 252 Lecture NotesDynamic Scheduling II

67

Memory Disambiguation Problem

© 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith,

Vijaykumar, Lipasti

address L/S

==

addr

ess

lo

h

at load execution• what if older store address unknown?

• how to determine match?• can’t determine, have to guess• called “memory disambiguation”

• what if older load address unknown?• who cares

ad

ead

COMPSCI 220 / ECE 252 Lecture NotesDynamic Scheduling II

68

Memory Disambiguation Alternatives

• conservative: loads in-order with respect to stores

• don’t know address? assume match, wait+ pretty simple– many unnecessary waits on non-matching stores

• opportunistic: out-of-order loads• don’t know address? assume no match, go+ higher performance (most cases are not matching)– mis-speculations: went too soon? recover (complex+expensive!)

• selective: combination of conservative and opportunistic• start out opportunistic• load mis-speculation? remember PC in table, next time conservative+ pretty accurate prediction ⇒ pretty good performance

COMPSCI 220 / ECE 252 Lecture NotesDynamic Scheduling II

69© 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith,

Vijaykumar, Lipasti

Pentium II vs. Pentium4

Pentium II Pentium4

Peak Clock 450MHz (Xeon) 2.4 (4.8 internal) GHzPipeline stages 14 22

Branch Prediction 512 entry local 2K entry hybridBTB 512 entries 4K entriesI$ 16KB 64KB T$ + 8KB I$D$ 16KB 8KBL2 512KB–2MB 256KB–2MB

Fetch Width 16 bytes 3 µops (16 bytes on miss)Rename/Retire Width 3 µops 3 µops

Execute Width 5 µops 7 µops (X2)Reservation Stations 20 60

ROB Size 40 128Register Renaming P6-style MIPS R10K-style

Memory Disambiguation Conservative Predictor-BasedAnything else? No Multithreading

COMPSCI 220 / ECE 252 Lecture NotesDynamic Scheduling II

70© 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith,

Vijaykumar, Lipasti

Limits of Instruction Level Parallelism (ILP)

• no need to build bigger superscalar if there isn’t ILP

• ultimately, how much ILP is there?

• ILP study• assume perfect/infinite hardware• successively refine to more realistic hardware• e.g.: [Wall’88], [Wilson+Lam’92]

• some (surprising) results• perfect/infinite/single-cycle everything: int ILP: >50!, FP ILP: >150!!• on actual machines: int ILP: ~2!!, FP ILP: ~3!• culprits? branch prediction, memory latency, finite ROB, issue width• read on your own (H+P, 3.8, 3.9)

COMPSCI 220 / ECE 252 Lecture NotesDynamic Scheduling II

71© 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith,

Vijaykumar, Lipasti

Summary

• dynamic scheduling + precise state + speculation

• re-order buffer (ROB)• WB ⇒ CM (out-of-order) + RT (in-order)• P6 vs. R10K• how is recovery done?

• load scheduling using • MOB+D$, bypassing values from MOB• memory disambiguation

• ILP limits• read on your own

next up: static (compiler) exploitation of ILP

COMPSCI 220 / ECE 252 Lecture NotesDynamic Scheduling II

72© 2004 by Lebeck, Sorin, Roth, Hill, Wood, Sohi, Smith,

Vijaykumar, Lipasti