cs 152, spring 2010 section 5 andrew waterman university of california, berkeley

18
CS 152, Spring 2010 Section 5 Andrew Waterman University of California, Berkeley

Post on 23-Jan-2016

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CS 152, Spring 2010 Section 5 Andrew Waterman University of California, Berkeley

CS 152, Spring 2010Section 5

Andrew Waterman

University of California, Berkeley

Page 2: CS 152, Spring 2010 Section 5 Andrew Waterman University of California, Berkeley

Mystery Die

Page 3: CS 152, Spring 2010 Section 5 Andrew Waterman University of California, Berkeley

Mystery Die

• NVIDIA GTX280• 240 cores * 1.296 GHz * 3 flops/cycle– 933 GFLOPS (Nhm is

8*3G*8=192GFLOPS)

Page 4: CS 152, Spring 2010 Section 5 Andrew Waterman University of California, Berkeley

Agenda

• Quiz 1 Post-Mortem• VM & Caches• Return PS1– Graded only for completeness

Page 5: CS 152, Spring 2010 Section 5 Andrew Waterman University of California, Berkeley

Quiz 1, Q1

• Microcode for JALM offset(rs)• Corner case didn’t hurt performance• Straightforward sol’n: (27/29 points)– A <- R[rs]– B <- sExt16(imm)–MA <- A+B– A <- PC // PC = PC+4

already happened– R[31] <- A– PC <- M[MA]

Page 6: CS 152, Spring 2010 Section 5 Andrew Waterman University of California, Berkeley

Quiz 1, Q1

• Cleverer sol’n:– B <- R[rs] // use

commutative property– R[31] <- A+4 // A still has old

PC– A <- sExt16(imm)–MA <- A+B– PC <- M[MA]

• AFAIK, this is the only 5-line solution

Page 7: CS 152, Spring 2010 Section 5 Andrew Waterman University of California, Berkeley

Quiz 1, Q1

• Common problems:– Forgetting that A already had the old PC,

so took an extra cycle– Forgetting that PC was already

incremented, so did R[31] <- oldPC+8– Being overly-conservative with don’t-cares

• Can destroy IR as soon as you’ve read rs, imm• Can set load-enable to DC the cycle the value

is used

• Almost all points deducted were nit-picks

Page 8: CS 152, Spring 2010 Section 5 Andrew Waterman University of California, Berkeley

Quiz 1, Q2

• 6-stage pipeline; new writeback at end of EX

• When ALUop has proceeded to M1, the writeback value is available to insn in ID– Second write port doesn’t help the

immediately-subsequent insn—just the one after it

– Example insn sequence that benefits from it:• add r1, r2, r3• sub r11, r12, r13• add r21, r1, r23

Page 9: CS 152, Spring 2010 Section 5 Andrew Waterman University of California, Berkeley

Quiz 1, Q2

• 6-stage pipeline; new writeback at end of EX

• When ALUop has proceeded to M1, the writeback value is available to insn in ID– Can remove bypass from end of M1 to end

of ID• Equivalently, start of M2 to start of EX

– Can also remove *ALU* bypass from end of M2 to end of ID, and end of WB to end of ID• Still needed for bypassing load results• Didn’t require this answer

Page 10: CS 152, Spring 2010 Section 5 Andrew Waterman University of California, Berkeley

Quiz 1, Q2

• 6-stage pipeline; new writeback at end of EX• Problem with precise state:–Memory address exceptions not detected til M2– By then, a subsequent ALU op has written back

• lw r1,-1(r0) // misaligned address• xor r2,r3,r4 // r2 modified anyway

– Fix with interlock:• Stall any ALU op immediately following any load/store• Actually reduces control logic (interlock is already

there for a load followed by a dependent ALU op)

Page 11: CS 152, Spring 2010 Section 5 Andrew Waterman University of California, Berkeley

Quiz 1, Q2

• 6-stage pipeline; new writeback at end of EX

• Problem with precise state:–Memory address exceptions not detected til M2– By then, a subsequent ALU op has written back

• lw r1,-1(r0) // misaligned address• xor r2,r3,r4 // r2 modified anyway

– Fix with additional read port:• Use read port to read *rd* (r2 in above example)• If lw causes trap, can then restore old value of rd

Page 12: CS 152, Spring 2010 Section 5 Andrew Waterman University of California, Berkeley

Quiz 1, Q2

• 6-stage pipeline; new writeback at end of EX

• Problem with precise state:–Memory address exceptions not detected til M2– By then, a subsequent ALU op has written back

• lw r1,-1(r0) // misaligned address• xor r2,r3,r4 // r2 modified anyway

– Fix with additional read port:• Use read port to read *rd* (r2 in above example)• If lw causes trap, can then restore old value of rd

Page 13: CS 152, Spring 2010 Section 5 Andrew Waterman University of California, Berkeley

Quiz 1, Q3

• Reducing number of registers in ISA– Increases instructions/program because

more registers must be spilled to the stack– Increases CPI because of load-use delay

(these loads will be harder to schedule around)• Little penalty for “no effect”• Subtle: could decrease CPI for some programs

with bad D$ hit rates; stack accesses will almost always hit

– Smaller RF could shorten critical path

Page 14: CS 152, Spring 2010 Section 5 Andrew Waterman University of California, Berkeley

Quiz 1, Q3

• Adding a branch delay slot– Compiler can’t always fill delay slot

usefully, so more NOPs => more insns/program

– CPI decreases because fewer control hazards are possible. Also, new NOPs have low CPI

– Small critical path reduction: don’t need control signal to squash instructions after a taken branch• Credit still given for “no effect”

Page 15: CS 152, Spring 2010 Section 5 Andrew Waterman University of California, Berkeley

Quiz 1, Q3

• Merging Execute and Memory Stages– No effect on insns/program: not ISA

visible– Decreases CPI: eliminates load-use

delay• NOT just because the pipeline depth is

reduced

– Address calculation added to critical path

Page 16: CS 152, Spring 2010 Section 5 Andrew Waterman University of California, Berkeley

Quiz 1, Q3

• Microcoded CISC -> pipelined RISC– Increases insns/program: CISCs take

fewer insns to encode a given program– Decreases CPI: RISC pipelines can sustain

CPIs close to 1, whereas microcoded machines take several clocks per insn

– Toss-up on seconds/cycle• Bypasses and extra control signals in pipeline

are slow• Shared bus in microcoded machine could be

slow, too

Page 17: CS 152, Spring 2010 Section 5 Andrew Waterman University of California, Berkeley

Quiz 1, Q3

• Microcoded CISC -> pipelined RISC– Increases insns/program: CISCs take

fewer insns to encode a given program– Decreases CPI: RISC pipelines can sustain

CPIs close to 1, whereas microcoded machines take several clocks per insn

– Toss-up on seconds/cycle• Bypasses and extra control signals in pipeline

are slow• Shared bus in microcoded machine could be

slow, too

Page 18: CS 152, Spring 2010 Section 5 Andrew Waterman University of California, Berkeley

VM & Caches

Organization Performance Aliasing

Virtual, DM, size <= pgsize OK with ASID No

Virtual, SA, size/A <= pgsize OK with ASID Yes

Virtual, others OK with ASID Yes

VIPT, DM, size <= pgsize OK No

VIPT, SA, size/A <= pgsize OK No

VIPT, others OK Yes

Physical, any organization Bad No