pipelining 4 / caches 1 - university of virginia school of...

60
Pipelining 4 / Caches 1 1

Upload: others

Post on 30-Jun-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Pipelining 4 / Caches 1 - University of Virginia School of ...cr4bd/3330/S2018/slides/20180315--sli… · retpaths pred. PC Instr. Mem. registerfile srcA srcB R[srcA] R[srcB] dstE

Pipelining 4 / Caches 1

1

Page 2: Pipelining 4 / Caches 1 - University of Virginia School of ...cr4bd/3330/S2018/slides/20180315--sli… · retpaths pred. PC Instr. Mem. registerfile srcA srcB R[srcA] R[srcB] dstE

after forwarding/prediction

where do we still need to stall?

memory output needed in fetchret followed by anything

memory output needed in exceutemrmovq or popq + use(in immediatelly following instruction)

2

Page 3: Pipelining 4 / Caches 1 - University of Virginia School of ...cr4bd/3330/S2018/slides/20180315--sli… · retpaths pred. PC Instr. Mem. registerfile srcA srcB R[srcA] R[srcB] dstE

overall CPU

5 stage pipeline

1 instruction completes every cycle — except hazards

most data hazards: solved by forwarding

load/use hazard: 1 cycle of stalling

jXX control hazard: branch prediction + squashing2 cycle penalty for misprediction(correct misprediction after jXX finishes execute)

ret control hazard: 3 cycles of stalling(fetch next instruction after ret finishes memory)

3

Page 4: Pipelining 4 / Caches 1 - University of Virginia School of ...cr4bd/3330/S2018/slides/20180315--sli… · retpaths pred. PC Instr. Mem. registerfile srcA srcB R[srcA] R[srcB] dstE

ret paths

pred.PC

Instr.Mem.

register filesrcA

srcB

R[srcA]R[srcB]

dstE

next R[dstE]

dstM

next R[dstM]

DataMem.

ZF/SF

Stat

Data in

Addr inData out

valC

0xF

0xF%rsp

%rsp

0xF

0xF%rsp

rA

rB

ALUaluA

aluBvalE8

0

add/subxor/and(functionof instr.)

write?

functionof opcode

PC+9

instr.length+

fetch

decode executememory

writeback

jmp target(from other stage)

very long critical path

4

Page 5: Pipelining 4 / Caches 1 - University of Virginia School of ...cr4bd/3330/S2018/slides/20180315--sli… · retpaths pred. PC Instr. Mem. registerfile srcA srcB R[srcA] R[srcB] dstE

ret paths

pred.PC

Instr.Mem.

register filesrcA

srcB

R[srcA]R[srcB]

dstE

next R[dstE]

dstM

next R[dstM]

DataMem.

ZF/SF

Stat

Data in

Addr inData out

valC

0xF

0xF%rsp

%rsp

0xF

0xF%rsp

rA

rB

ALUaluA

aluBvalE8

0

add/subxor/and(functionof instr.)

write?

functionof opcode

PC+9

instr.length+

fetch

decode executememory

writeback

jmp target(from other stage)

very long critical path

4

Page 6: Pipelining 4 / Caches 1 - University of Virginia School of ...cr4bd/3330/S2018/slides/20180315--sli… · retpaths pred. PC Instr. Mem. registerfile srcA srcB R[srcA] R[srcB] dstE

ret paths

pred.PC

Instr.Mem.

register filesrcA

srcB

R[srcA]R[srcB]

dstE

next R[dstE]

dstM

next R[dstM]

DataMem.

ZF/SF

Stat

Data in

Addr inData out

valC

0xF

0xF%rsp

%rsp

0xF

0xF%rsp

rA

rB

ALUaluA

aluBvalE8

0

add/subxor/and(functionof instr.)

write?

functionof opcode

PC+9

instr.length+

fetch

decode executememory

writeback

jmp target(from other stage) very long critical path

4

Page 7: Pipelining 4 / Caches 1 - University of Virginia School of ...cr4bd/3330/S2018/slides/20180315--sli… · retpaths pred. PC Instr. Mem. registerfile srcA srcB R[srcA] R[srcB] dstE

unsolved problemcycle # 0 1 2 3 4 5 6 7 8

mrmovq 0(%rax), %rbx F D E M Wsubq %rbx, %rcx F D E M Wsubq %rbx, %rcx F D D E M W

stall

5

Page 8: Pipelining 4 / Caches 1 - University of Virginia School of ...cr4bd/3330/S2018/slides/20180315--sli… · retpaths pred. PC Instr. Mem. registerfile srcA srcB R[srcA] R[srcB] dstE

unsolved problemcycle # 0 1 2 3 4 5 6 7 8

mrmovq 0(%rax), %rbx F D E M Wsubq %rbx, %rcx F D E M Wsubq %rbx, %rcx F D D E M W

stall

5

Page 9: Pipelining 4 / Caches 1 - University of Virginia School of ...cr4bd/3330/S2018/slides/20180315--sli… · retpaths pred. PC Instr. Mem. registerfile srcA srcB R[srcA] R[srcB] dstE

solveable problemcycle # 0 1 2 3 4 5 6 7 8

mrmovq 0(%rax), %rbx F D E M Wrmmovq %rbx, 0(%rcx) F D E M W

common for real processors to do thisbut our textbook only forwards to the end of decode

6

Page 10: Pipelining 4 / Caches 1 - University of Virginia School of ...cr4bd/3330/S2018/slides/20180315--sli… · retpaths pred. PC Instr. Mem. registerfile srcA srcB R[srcA] R[srcB] dstE

fetch/fetch logic — advance or not

predicted PC

MUXfrom incremented PC

should we stall?

7

Page 11: Pipelining 4 / Caches 1 - University of Virginia School of ...cr4bd/3330/S2018/slides/20180315--sli… · retpaths pred. PC Instr. Mem. registerfile srcA srcB R[srcA] R[srcB] dstE

fetch/decode logic — bubble or not

rA

MUXno-op value — 0xF

should we sendno-op value (“bubble”)?

8

Page 12: Pipelining 4 / Caches 1 - University of Virginia School of ...cr4bd/3330/S2018/slides/20180315--sli… · retpaths pred. PC Instr. Mem. registerfile srcA srcB R[srcA] R[srcB] dstE

HCLRS signals

register aB {...

}

HCLRS: every register bank has these MUXes built-in

stall_B: keep old value for all registersregister input → register output

bubble_B: use default value for all registersregister input → default value

9

Page 13: Pipelining 4 / Caches 1 - University of Virginia School of ...cr4bd/3330/S2018/slides/20180315--sli… · retpaths pred. PC Instr. Mem. registerfile srcA srcB R[srcA] R[srcB] dstE

exerciseregister aB {

value : 8 = 0xFF;}...time a_value B_value stall_B bubble_B0 0x01 0xFF 0 01 0x02 ??? 1 02 0x03 ??? 0 03 0x04 ??? 0 14 0x05 ??? 0 05 0x06 ??? 0 06 0x07 ??? 1 07 0x08 ??? 1 08 ???

stall: keep old valuebubble: store default value

10

Page 14: Pipelining 4 / Caches 1 - University of Virginia School of ...cr4bd/3330/S2018/slides/20180315--sli… · retpaths pred. PC Instr. Mem. registerfile srcA srcB R[srcA] R[srcB] dstE

exercise resultregister aB {

value : 8 = 0xFF;}...time a_value B_value stall_B bubble_B0 0x01 0xFF 0 01 0x02 0x01 1 02 0x03 0x01 0 03 0x04 0x03 0 14 0x05 0xFF 0 05 0x06 0x05 0 06 0x07 0x06 1 07 0x08 0x06 1 08 0x06

11

Page 15: Pipelining 4 / Caches 1 - University of Virginia School of ...cr4bd/3330/S2018/slides/20180315--sli… · retpaths pred. PC Instr. Mem. registerfile srcA srcB R[srcA] R[srcB] dstE

exercise: squash + stall (1)time fetch decode execute memory writeback

1 E D C B A

2 E nop C nop B

stall (S) = keep old value; normal (N) = use new valuebubble (B) = use default (no-op);

? ? ? ? ?

S B S B N

exercise: what are the ?swrite down your answers,then compare with your neighbors

12

Page 16: Pipelining 4 / Caches 1 - University of Virginia School of ...cr4bd/3330/S2018/slides/20180315--sli… · retpaths pred. PC Instr. Mem. registerfile srcA srcB R[srcA] R[srcB] dstE

exercise: squash + stall (1)time fetch decode execute memory writeback

1 E D C B A

2 E nop C nop B

stall (S) = keep old value; normal (N) = use new valuebubble (B) = use default (no-op);

? ? ? ? ?

S B S B N

exercise: what are the ?swrite down your answers,then compare with your neighbors

12

Page 17: Pipelining 4 / Caches 1 - University of Virginia School of ...cr4bd/3330/S2018/slides/20180315--sli… · retpaths pred. PC Instr. Mem. registerfile srcA srcB R[srcA] R[srcB] dstE

exercise: squash + stall (2)time fetch decode execute memory writeback

1 E D C B A

2 F E C nop B

stall (S) = keep old value; normal (N) = use new valuebubble (B) = use default (no-op);

? ? ? ? ?

N N S B N

exercise: what are the ?swrite down your answers,then compare with your neighbors

13

Page 18: Pipelining 4 / Caches 1 - University of Virginia School of ...cr4bd/3330/S2018/slides/20180315--sli… · retpaths pred. PC Instr. Mem. registerfile srcA srcB R[srcA] R[srcB] dstE

exercise: squash + stall (2)time fetch decode execute memory writeback

1 E D C B A

2 F E C nop B

stall (S) = keep old value; normal (N) = use new valuebubble (B) = use default (no-op);

? ? ? ? ?

N N S B N

exercise: what are the ?swrite down your answers,then compare with your neighbors

13

Page 19: Pipelining 4 / Caches 1 - University of Virginia School of ...cr4bd/3330/S2018/slides/20180315--sli… · retpaths pred. PC Instr. Mem. registerfile srcA srcB R[srcA] R[srcB] dstE

exercise: squash + stall (2)time fetch decode execute memory writeback

1 E D C B A

2 F E C nop B

stall (S) = keep old value; normal (N) = use new valuebubble (B) = use default (no-op);

? ? ? ? ?

N N S B N

exercise: what are the ?swrite down your answers,then compare with your neighbors

13

Page 20: Pipelining 4 / Caches 1 - University of Virginia School of ...cr4bd/3330/S2018/slides/20180315--sli… · retpaths pred. PC Instr. Mem. registerfile srcA srcB R[srcA] R[srcB] dstE

ret stalltime fetch decode execute memory writeback

0 call

1 ret call

2 wait for ret ret call

3 wait for ret nothing ret call (store)

4 wait for ret nothing nothing ret (load) call

5 addq nothing nothing nothing ret

stall (S) = keep old value; normal (N) = use new valuebubble (B) = use default (no-op);

N N

S N N

S B N N

S B N N N

N B N N N

14

Page 21: Pipelining 4 / Caches 1 - University of Virginia School of ...cr4bd/3330/S2018/slides/20180315--sli… · retpaths pred. PC Instr. Mem. registerfile srcA srcB R[srcA] R[srcB] dstE

ret stalltime fetch decode execute memory writeback

0 call

1 ret call

2 wait for ret ret call

3 wait for ret nothing ret call (store)

4 wait for ret nothing nothing ret (load) call

5 addq nothing nothing nothing ret

stall (S) = keep old value; normal (N) = use new valuebubble (B) = use default (no-op);

N N

S N N

S B N N

S B N N N

N B N N N

14

Page 22: Pipelining 4 / Caches 1 - University of Virginia School of ...cr4bd/3330/S2018/slides/20180315--sli… · retpaths pred. PC Instr. Mem. registerfile srcA srcB R[srcA] R[srcB] dstE

ret stalltime fetch decode execute memory writeback

0 call

1 ret call

2 wait for ret ret call

3 wait for ret nothing ret call (store)

4 wait for ret nothing nothing ret (load) call

5 addq nothing nothing nothing ret

stall (S) = keep old value; normal (N) = use new valuebubble (B) = use default (no-op);

N N

S N N

S B N N

S B N N N

N B N N N

14

Page 23: Pipelining 4 / Caches 1 - University of Virginia School of ...cr4bd/3330/S2018/slides/20180315--sli… · retpaths pred. PC Instr. Mem. registerfile srcA srcB R[srcA] R[srcB] dstE

ret stalltime fetch decode execute memory writeback

0 call

1 ret call

2 wait for ret ret call

3 wait for ret nothing ret call (store)

4 wait for ret nothing nothing ret (load) call

5 addq nothing nothing nothing ret

stall (S) = keep old value; normal (N) = use new valuebubble (B) = use default (no-op);

N N

S N N

S B N N

S B N N N

N B N N N

14

Page 24: Pipelining 4 / Caches 1 - University of Virginia School of ...cr4bd/3330/S2018/slides/20180315--sli… · retpaths pred. PC Instr. Mem. registerfile srcA srcB R[srcA] R[srcB] dstE

ret stalltime fetch decode execute memory writeback

0 call

1 ret call

2 wait for ret ret call

3 wait for ret nothing ret call (store)

4 wait for ret nothing nothing ret (load) call

5 addq nothing nothing nothing ret

stall (S) = keep old value; normal (N) = use new valuebubble (B) = use default (no-op);

N N

S N N

S B N N

S B N N N

N B N N N

14

Page 25: Pipelining 4 / Caches 1 - University of Virginia School of ...cr4bd/3330/S2018/slides/20180315--sli… · retpaths pred. PC Instr. Mem. registerfile srcA srcB R[srcA] R[srcB] dstE

HCLRS bubble example

register fD {icode : 4 = NOP;rA : 4 = REG_NONE;rB : 4 = REG_NONE;...

};wire need_ret_bubble : 1;need_ret_bubble = ( D_icode == RET ||

E_icode == RET ||M_icode == RET );

bubble_D = ( need_ret_bubble ||... /* other cases */ );

15

Page 26: Pipelining 4 / Caches 1 - University of Virginia School of ...cr4bd/3330/S2018/slides/20180315--sli… · retpaths pred. PC Instr. Mem. registerfile srcA srcB R[srcA] R[srcB] dstE

squashing with stall/bubbletime fetch decode execute memory writeback

1 subq

2 jne subq

3 addq [?] jne subq (set ZF)

4 rmmovq [?] addq [?] jne (use ZF) subq

5 xorq nothing nothing jne (done) subqstall (S) = keep old value; normal (N) = use new valuebubble (B) = use default (no-op);

N N

N N N

N N N N

N B B N Ncan compute bubble signal based on execute phasewon’t even start CC write for addq

16

Page 27: Pipelining 4 / Caches 1 - University of Virginia School of ...cr4bd/3330/S2018/slides/20180315--sli… · retpaths pred. PC Instr. Mem. registerfile srcA srcB R[srcA] R[srcB] dstE

squashing with stall/bubbletime fetch decode execute memory writeback

1 subq

2 jne subq

3 addq [?] jne subq (set ZF)

4 rmmovq [?] addq [?] jne (use ZF) subq

5 xorq nothing nothing jne (done) subqstall (S) = keep old value; normal (N) = use new valuebubble (B) = use default (no-op);

N N

N N N

N N N N

N B B N Ncan compute bubble signal based on execute phasewon’t even start CC write for addq

16

Page 28: Pipelining 4 / Caches 1 - University of Virginia School of ...cr4bd/3330/S2018/slides/20180315--sli… · retpaths pred. PC Instr. Mem. registerfile srcA srcB R[srcA] R[srcB] dstE

squashing with stall/bubbletime fetch decode execute memory writeback

1 subq

2 jne subq

3 addq [?] jne subq (set ZF)

4 rmmovq [?] addq [?] jne (use ZF) subq

5 xorq nothing nothing jne (done) subqstall (S) = keep old value; normal (N) = use new valuebubble (B) = use default (no-op);

N N

N N N

N N N N

N B B N Ncan compute bubble signal based on execute phasewon’t even start CC write for addq

16

Page 29: Pipelining 4 / Caches 1 - University of Virginia School of ...cr4bd/3330/S2018/slides/20180315--sli… · retpaths pred. PC Instr. Mem. registerfile srcA srcB R[srcA] R[srcB] dstE

squashing with stall/bubbletime fetch decode execute memory writeback

1 subq

2 jne subq

3 addq [?] jne subq (set ZF)

4 rmmovq [?] addq [?] jne (use ZF) subq

5 xorq nothing nothing jne (done) subqstall (S) = keep old value; normal (N) = use new valuebubble (B) = use default (no-op);

N N

N N N

N N N N

N B B N N

can compute bubble signal based on execute phasewon’t even start CC write for addq

16

Page 30: Pipelining 4 / Caches 1 - University of Virginia School of ...cr4bd/3330/S2018/slides/20180315--sli… · retpaths pred. PC Instr. Mem. registerfile srcA srcB R[srcA] R[srcB] dstE

squashing with stall/bubbletime fetch decode execute memory writeback

1 subq

2 jne subq

3 addq [?] jne subq (set ZF)

4 rmmovq [?] addq [?] jne (use ZF) subq

5 xorq nothing nothing jne (done) subqstall (S) = keep old value; normal (N) = use new valuebubble (B) = use default (no-op);

N N

N N N

N N N N

N B B N Ncan compute bubble signal based on execute phasewon’t even start CC write for addq

16

Page 31: Pipelining 4 / Caches 1 - University of Virginia School of ...cr4bd/3330/S2018/slides/20180315--sli… · retpaths pred. PC Instr. Mem. registerfile srcA srcB R[srcA] R[srcB] dstE

squashing HCLRS

just_detected_mispredict =e_icode == JXX && !branchTaken;

bubble_D = just_detected_mispredict || ...;bubble_E = just_detected_mispredict || ...;

17

Page 32: Pipelining 4 / Caches 1 - University of Virginia School of ...cr4bd/3330/S2018/slides/20180315--sli… · retpaths pred. PC Instr. Mem. registerfile srcA srcB R[srcA] R[srcB] dstE

missing pieces

multi-cycle memories

beyond pipelining: out-of-order, multiple issue

18

Page 33: Pipelining 4 / Caches 1 - University of Virginia School of ...cr4bd/3330/S2018/slides/20180315--sli… · retpaths pred. PC Instr. Mem. registerfile srcA srcB R[srcA] R[srcB] dstE

missing pieces

multi-cycle memories

beyond pipelining: out-of-order, multiple issue

19

Page 34: Pipelining 4 / Caches 1 - University of Virginia School of ...cr4bd/3330/S2018/slides/20180315--sli… · retpaths pred. PC Instr. Mem. registerfile srcA srcB R[srcA] R[srcB] dstE

multi-cycle memories

ideal case for memories: single-cycle

achieved with caches (next topic)fast access to small number of things

typical performance:90+% of the time: single-cycle

sometimes many cycles (3–400+)

20

Page 35: Pipelining 4 / Caches 1 - University of Virginia School of ...cr4bd/3330/S2018/slides/20180315--sli… · retpaths pred. PC Instr. Mem. registerfile srcA srcB R[srcA] R[srcB] dstE

variable speed memoriescycle # 0 1 2 3 4 5 6 7 8

mrmovq 0(%rbx), %r8 F D E M Wmrmovq 0(%rcx), %r9 F D E M Waddq %r8, %r9 F D D E M W

mrmovq 0(%rbx), %r8 F D E M M M M M Wmrmovq 0(%rcx), %r9 F D E E E E E M M M M Maddq %r8, %r9 F D D D D D D D D D D

memory is fast: (cache “hit”; recently accessed?)

memory is slow: (cache “miss”)

21

Page 36: Pipelining 4 / Caches 1 - University of Virginia School of ...cr4bd/3330/S2018/slides/20180315--sli… · retpaths pred. PC Instr. Mem. registerfile srcA srcB R[srcA] R[srcB] dstE

missing pieces

multi-cycle memories

beyond pipelining: out-of-order, multiple issue

22

Page 37: Pipelining 4 / Caches 1 - University of Virginia School of ...cr4bd/3330/S2018/slides/20180315--sli… · retpaths pred. PC Instr. Mem. registerfile srcA srcB R[srcA] R[srcB] dstE

beyond pipelining: multiple issue

start more than one instruction/cycle

multiple parallel pipelines; many-input/output register file

hazard handling much more complexcycle # 0 1 2 3 4 5 6 7 8

addq %r8, %r9 F D E M Wsubq %r10, %r11 F D E M Wxorq %r9, %r11 F D E M Wsubq %r10, %rbx F D E M W…

23

Page 38: Pipelining 4 / Caches 1 - University of Virginia School of ...cr4bd/3330/S2018/slides/20180315--sli… · retpaths pred. PC Instr. Mem. registerfile srcA srcB R[srcA] R[srcB] dstE

beyond pipelining: out-of-order

find later instructions to do instead of stallinglists of available instructions in pipeline registers

take any instruction with available values

provide illusion that work is still done in ordermuch more complicated hazard handling logic

cycle # 0 1 2 3 4 5 6 7 8mrmovq 0(%rbx), %r8 F D E M M M Wsubq %r8, %r9 F D E Waddq %r10, %r11 F D E Wxorq %r12, %r13 F D E W…

24

Page 39: Pipelining 4 / Caches 1 - University of Virginia School of ...cr4bd/3330/S2018/slides/20180315--sli… · retpaths pred. PC Instr. Mem. registerfile srcA srcB R[srcA] R[srcB] dstE

better branch prediction

forward (target > PC) not taken; backward taken

intuition: loops:

LOOP: ......je LOOP

LOOP: ...jne SKIP_LOOP...jmp LOOP

SKIP_LOOP:

25

Page 40: Pipelining 4 / Caches 1 - University of Virginia School of ...cr4bd/3330/S2018/slides/20180315--sli… · retpaths pred. PC Instr. Mem. registerfile srcA srcB R[srcA] R[srcB] dstE

predicting ret: extra copy of stack

predicting ret — stack in processor registers

different than real stack/out of room? just slowerbaz saved registers

baz return address

bar saved registers

bar return address

foo local variables

foo saved registers

foo return address

foo saved registers

stack in memory

baz return address

bar return address

foo return address

(partial?) stackin CPU registers

26

Page 41: Pipelining 4 / Caches 1 - University of Virginia School of ...cr4bd/3330/S2018/slides/20180315--sli… · retpaths pred. PC Instr. Mem. registerfile srcA srcB R[srcA] R[srcB] dstE

prediction before fetch

real processors can take multiple cycles to read instruction memory

predict branches before reading their opcodes

how — more extra data structurestables of recent branches (often many kilobytes)

27

Page 42: Pipelining 4 / Caches 1 - University of Virginia School of ...cr4bd/3330/S2018/slides/20180315--sli… · retpaths pred. PC Instr. Mem. registerfile srcA srcB R[srcA] R[srcB] dstE

2004 CPURegisters

L1 cacheL2 cache

Branch Prediction(approximate)

Image: approx 2004 AMD press image of Opteron die;approx register/branch prediction location via chip-architect.org (Hans de Vries) 28

Page 43: Pipelining 4 / Caches 1 - University of Virginia School of ...cr4bd/3330/S2018/slides/20180315--sli… · retpaths pred. PC Instr. Mem. registerfile srcA srcB R[srcA] R[srcB] dstE

stalling/misprediction and latency

hazard handling where pipeline latency matterslonger pipeline — larger penaltypart of Intel’s Pentium 4 problem (c. 2000)

on release: 50% higher clock rate, 2-3x pipeline stages of competitors

out-of-order, multiple issue processorfirst-generation review quote:

Review quote: Anand Lai Shimpi, “Intel Pentium 4 1.4 & 1.5 GHz”, AnandTech, 20 November 2000 29

Page 44: Pipelining 4 / Caches 1 - University of Virginia School of ...cr4bd/3330/S2018/slides/20180315--sli… · retpaths pred. PC Instr. Mem. registerfile srcA srcB R[srcA] R[srcB] dstE

2004 CPU

RegistersL1 cache

L2 cacheL3 cache

mainmemory

< 1 ns

∼ 1 ns

∼ 5 ns

∼ 20 ns

∼ 100 ns

Image: approx 2004 AMD press image of Opteron die;approx register location via chip-architect.org (Hans de Vries) 30

Page 45: Pipelining 4 / Caches 1 - University of Virginia School of ...cr4bd/3330/S2018/slides/20180315--sli… · retpaths pred. PC Instr. Mem. registerfile srcA srcB R[srcA] R[srcB] dstE

2004 CPURegisters

L1 cacheL2 cache

L3 cachemain

memory

< 1 ns

∼ 1 ns

∼ 5 ns

∼ 20 ns

∼ 100 ns

Image: approx 2004 AMD press image of Opteron die;approx register location via chip-architect.org (Hans de Vries) 30

Page 46: Pipelining 4 / Caches 1 - University of Virginia School of ...cr4bd/3330/S2018/slides/20180315--sli… · retpaths pred. PC Instr. Mem. registerfile srcA srcB R[srcA] R[srcB] dstE

2004 CPURegisters

L1 cache

L2 cacheL3 cache

mainmemory

< 1 ns

∼ 1 ns

∼ 5 ns

∼ 20 ns

∼ 100 ns

Image: approx 2004 AMD press image of Opteron die;approx register location via chip-architect.org (Hans de Vries) 30

Page 47: Pipelining 4 / Caches 1 - University of Virginia School of ...cr4bd/3330/S2018/slides/20180315--sli… · retpaths pred. PC Instr. Mem. registerfile srcA srcB R[srcA] R[srcB] dstE

2004 CPURegisters

L1 cacheL2 cache

L3 cachemain

memory

< 1 ns

∼ 1 ns

∼ 5 ns

∼ 20 ns

∼ 100 ns

Image: approx 2004 AMD press image of Opteron die;approx register location via chip-architect.org (Hans de Vries) 30

Page 48: Pipelining 4 / Caches 1 - University of Virginia School of ...cr4bd/3330/S2018/slides/20180315--sli… · retpaths pred. PC Instr. Mem. registerfile srcA srcB R[srcA] R[srcB] dstE

2004 CPURegisters

L1 cacheL2 cache

L3 cachemain

memory

< 1 ns

∼ 1 ns

∼ 5 ns

∼ 20 ns

∼ 100 ns

Image: approx 2004 AMD press image of Opteron die;approx register location via chip-architect.org (Hans de Vries) 30

Page 49: Pipelining 4 / Caches 1 - University of Virginia School of ...cr4bd/3330/S2018/slides/20180315--sli… · retpaths pred. PC Instr. Mem. registerfile srcA srcB R[srcA] R[srcB] dstE

2004 CPURegisters

L1 cacheL2 cache

L3 cachemain

memory

< 1 ns

∼ 1 ns

∼ 5 ns

∼ 20 ns

∼ 100 ns

Image: approx 2004 AMD press image of Opteron die;approx register location via chip-architect.org (Hans de Vries) 30

Page 50: Pipelining 4 / Caches 1 - University of Virginia School of ...cr4bd/3330/S2018/slides/20180315--sli… · retpaths pred. PC Instr. Mem. registerfile srcA srcB R[srcA] R[srcB] dstE

2004 CPURegisters

L1 cacheL2 cache

L3 cachemain

memory

< 1 ns

∼ 1 ns

∼ 5 ns

∼ 20 ns

∼ 100 ns

Image: approx 2004 AMD press image of Opteron die;approx register location via chip-architect.org (Hans de Vries) 30

Page 51: Pipelining 4 / Caches 1 - University of Virginia School of ...cr4bd/3330/S2018/slides/20180315--sli… · retpaths pred. PC Instr. Mem. registerfile srcA srcB R[srcA] R[srcB] dstE

cache: real memory

Data MemoryAKA

L1 Data Cache

address

input (if writing)write enable

value

ready?

L2 Cache

31

Page 52: Pipelining 4 / Caches 1 - University of Virginia School of ...cr4bd/3330/S2018/slides/20180315--sli… · retpaths pred. PC Instr. Mem. registerfile srcA srcB R[srcA] R[srcB] dstE

cache: real memory

Data MemoryAKA

L1 Data Cache

address

input (if writing)write enable

value

ready?

L2 Cache

31

Page 53: Pipelining 4 / Caches 1 - University of Virginia School of ...cr4bd/3330/S2018/slides/20180315--sli… · retpaths pred. PC Instr. Mem. registerfile srcA srcB R[srcA] R[srcB] dstE

the place of cache

CPU Cache

RAMor

anothercache

read 0xABCD?read 0x1234?

0xABCD is 10000x1234 is 4000

read 0xABCD?

0xABCD is 1000

32

Page 54: Pipelining 4 / Caches 1 - University of Virginia School of ...cr4bd/3330/S2018/slides/20180315--sli… · retpaths pred. PC Instr. Mem. registerfile srcA srcB R[srcA] R[srcB] dstE

memory hierarchy goals

performance of the fastest (smallest) memoryhide 100x latency difference? 99+% hit (= value found in cache) rate

capacity of the largest (slowest) memory

33

Page 55: Pipelining 4 / Caches 1 - University of Virginia School of ...cr4bd/3330/S2018/slides/20180315--sli… · retpaths pred. PC Instr. Mem. registerfile srcA srcB R[srcA] R[srcB] dstE

memory hierarchy assumptions

temporal locality“if a value is accessed now, it will be accessed again soon”

caches should keep recently accessed values

spatial locality“if a value is accessed now, adjacent values will be accessed soon”

caches should store adjacent values at the same time

natural properties of programs — think about loops

34

Page 56: Pipelining 4 / Caches 1 - University of Virginia School of ...cr4bd/3330/S2018/slides/20180315--sli… · retpaths pred. PC Instr. Mem. registerfile srcA srcB R[srcA] R[srcB] dstE

locality examples

double computeMean(int length, double *values) {double total = 0.0;for (int i = 0; i < length; ++i) {

total += values[i];}return total / length;

}

temporal locality: machine code of the loop

spatial locality: machine code of most consecutive instructions

temporal locality: total, i, length accessed repeatedly

spatial locality: values[i+1] accessed after values[i]35

Page 57: Pipelining 4 / Caches 1 - University of Virginia School of ...cr4bd/3330/S2018/slides/20180315--sli… · retpaths pred. PC Instr. Mem. registerfile srcA srcB R[srcA] R[srcB] dstE

backup slides

36

Page 58: Pipelining 4 / Caches 1 - University of Virginia School of ...cr4bd/3330/S2018/slides/20180315--sli… · retpaths pred. PC Instr. Mem. registerfile srcA srcB R[srcA] R[srcB] dstE

HCLRS pipelining debugging: intro

debugging pipelines is consistently one of the biggest sources ofdifficulty in this class

37

Page 59: Pipelining 4 / Caches 1 - University of Virginia School of ...cr4bd/3330/S2018/slides/20180315--sli… · retpaths pred. PC Instr. Mem. registerfile srcA srcB R[srcA] R[srcB] dstE

HCL2D pipeline debugging (1)

draw a picture of the state of the instructions

get -d outputredirect to a file./hclrs file.hcl -d input.yo >output.txt

check each stage of the broken instruction

expect forwarding/hazard-handling problems

38

Page 60: Pipelining 4 / Caches 1 - University of Virginia School of ...cr4bd/3330/S2018/slides/20180315--sli… · retpaths pred. PC Instr. Mem. registerfile srcA srcB R[srcA] R[srcB] dstE

HCL2D pipeline debugging (2)

write assembly — not just supplied test caseswrite file.ys — make file.yo to assembleremove anything not involved in the errorfind a minimal test casedon’t spend time looking at irrelevant instructions

draw the pipeline stageswhat instructions are in fetch/decode/etc. when?what values should be in pipelien registers when?what forwarding should happen?

39