rireview and ad dad vanced concepts - ece.ucsb.edustrukov/ece154winter2012/lecture17_18.pdf · 4...

R i d Ad d CReview and Advanced Concepts

Adapted from instructor’s supplementary material from Computersupplementary material from ComputerOrganization and Design, 4th Edition,Patterson & Hennessy, © 2008, MK]y, , ]

PC IF/ID ID/EX EX/M M/WB

Pipelining Review

IF ID

PC IF/ID ID/EX EX/M M/WB

EX M W

WW

IF IDPC IF/ID

ID/EX

EX/M

M/W

B

EX M W

I0: add R4 = R1 + R0I1: sub R9 = R3 – R4I2: add R4 = R5 + R6I3: lw R2, 100 (R3)I4: lw R2, 0 (R2)I5: sw R2, 100 (R4)I6: and R2 = R2 & R1I7: beq R9 == R1, TARGETI8: and R9 = R9 & R1

IF ID EX M WI0

c0 c1 c2 c3 c4 c5 c6 c7 c8 c9 c10 c11 c12 c13

IF ID EX M W

IF ID EX M W

IF ID EX M W

I1

I2

IF ID EX M W

IF ID EX M W

IF ID EX M W

I3

I4

I4/I5

IF ID EX M W

IF ID EX M W

IF ID EX M W

/

I5

I6

IF ID EX M W

IF ID EX M W

I7

I8

IF ID EX M WIF ID EX M W

I0I1


IF ID EX M W

I1I2I3I4



I4/I5I5I6

C0 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 C11 C12 C13 C14 C15 C16 C18


I7I8

I0IF ID EX

MEM

WB

I1IF ID EX

MEM

WB

I2IF ID EX

MEM

WB

I3 ME WI3IF ID EX

MEM

WB

I4IF ID X EX

MEM

WB

I5IF X ID EX

MEM

WB

I6 MEI6X IF ID EX

MEM

WB

I7IF ID EX

MEM

WB

I8IF ID EX

MEM

WB

Memory Hierarchy Review

VA

VA PA misshit ¾ t¼ t

TLBLookup

Cache MainMemory

VA PA miss

hitmissL1 M S

HW VM

VA

data

Trans‐lation

L1 M S

4 Questions for the Memory Hierarchy

• Q1: Where can a entry be placed in the upper level? (Entry placement)

• Q2: How is a entry found if it is in the upper level?level?(Entry identification)

• Q3: Which entry should be replaced on a miss? (Entry replacement)

• Q4: What happens on a write? (Write strategy)(Write strategy)

Q1&Q2: Where can a entry be placed/found?

# of sets Entries per setDirect mapped # of entries 1ppSet associative (# of entries)/ associativity Associativity (typically

2 to 16)Fully associative 1 # of entriesFully associative 1 # of entries

Location method # of comparisonsLocation method # of comparisonsDirect mapped Index 1Set associative Index the set; compare

t’ tDegree of

i ti itset’s tags associativityFully associative Compare all entries’ tags

Separate lookup (page) # of entries0p p (p g )

table

Q3: Which entry should be replaced on a miss?• Easy for direct mapped – only one choice• Set associative or fully associative

– Random– LRU (Least Recently Used)

• For a 2‐way set associative, random replacement has a miss rate about 1.1 times higher than LRU

• LRU is too costly to implement for high levels of associativity (> 4‐way) since tracking the usage information is costlyinformation is costly

Q4: What happens on a write?• Write‐through – The information is written to the entry inWrite through The information is written to the entry in

the current memory level and to the entry in the next level of the memory hierarchy– Always combined with a write buffer so write waits to next levelAlways combined with a write buffer so write waits to next level

memory can be eliminated (as long as the write buffer doesn’t fill)

• Write‐back – The information is written only to the entry in y ythe current memory level. The modified entry is written to next level of memory only when it is replaced.– Need a dirty bit to keep track of whether the entry is clean or y p y

dirty– Virtual memory systems always use write‐back of dirty pages to

disk• Pros and cons of each?

– Write‐through: read misses don’t result in writes (so are simpler and cheaper), easier to implement

– Write‐back: writes run at the speed of the cache; repeated writes require only one write to lower level

Improving Performancep g

• Performance = Instr Count * CPI * Clock time

Average CPI > Ideal CPI

Get CPI close to the ideal one(Forwarding, branch prediction, multithreading, prefetch)

E l i I i l l ll li (ILP) Exploit Instruction level parallelism (ILP) (pipelining, superscalar, out of order)

Exploit data level parallelism (DLP) Exploit data level parallelism (DLP)(vector processors)

(will not discuss multicore or multiprocessors)

Instruction‐Level Parallelism (ILP)

• Pipelining: executing multiple instructions in parallelT i ILP• To increase ILP– Deeper pipeline

• Less work per stage shorter clock cycle

CPI * Clock time

Less work per stage shorter clock cycle

Multiple issue– Multiple issue• Replicate pipeline stages multiple pipelines• Start multiple instructions per clock cycleE 4GH 4 l i l i• E.g., 4GHz 4‐way multiple‐issue

– 16 BIPS, peak CPI = 0.25, peak IPC = 4

• But dependencies reduce this in practice

Multiple IssueMultiple Issue

• Static multiple issuep– Compiler groups instructions to be issued together

– Packages them into “issue slots”

– Compiler detects and avoids hazards

• Dynamic multiple issue– CPU examines instruction stream and chooses instructions to issue each cycle

Compiler can help by reordering instructions– Compiler can help by reordering instructions

– CPU resolves hazards using advanced techniques at runtime

MIPS with Static Dual Issue• Two‐issue packets

– One ALU/branch instructionO e U/b a c st uct o

– One load/store instruction

– 64‐bit aligned• ALU/branch, then load/store

• Pad an unused instruction with nop

Address Instruction type Pipeline Stages

n ALU/branch IF ID EX MEM WB

n + 4 Load/store IF ID EX MEM WB

n + 8 ALU/branch IF ID EX MEM WB


n + 16 ALU/branch IF ID EX MEM WB


MIPS with Static Dual Issue

Static Multiple Issue

• Compiler groups instructions into “issueCompiler groups instructions into issue packets”– Group of instructions that can be issued on a psingle cycle

– Determined by pipeline resources required

• Think of an issue packet as a very long instruction– Specifies multiple concurrent operations– Very Long Instruction Word (VLIW)

Scheduling Static Multiple IssueScheduling Static Multiple Issue

• Compiler must remove some/all hazardsCompiler must remove some/all hazards– Reorder instructions into issue packets

No dependencies with a packet– No dependencies with a packet

– Possibly some dependencies between packets• Varies between ISAs; compiler must know!• Varies between ISAs; compiler must know!

– Pad with nop if necessary

Hazards in the Dual‐Issue MIPS

• More instructions executing in parallelMore instructions executing in parallel

• EX data hazard– Forwarding avoided stalls with single‐issueForwarding avoided stalls with single issue

– Now can’t use ALU result in load/store in same packet• add $t0, $s0, $s1l d $ 2 0($t0)load $s2, 0($t0)

• Split into two packets, effectively a stall

• Load‐use hazardLoad use hazard– Still one cycle use latency, but now two instructions

• More aggressive scheduling requiredgg g q

Scheduling Example• Schedule this for dual‐issue MIPS

Loop: lw $t0, 0($s1) # $t0=array elementaddu $t0, $t0, $s2 # add scalar in $s2sw $t0, 0($s1) # store resultaddi $s1, $s1,–4 # decrement pointerbne $s1, $zero, Loop # branch $s1!=0

ALU/b h L d/ t lALU/branch Load/store cycleLoop: nop lw $t0, 0($s1) 1

addi $s1, $s1,–4 nop 2

dd $ 0 $ 0 $ 2 3addu $t0, $t0, $s2 nop 3

bne $s1, $zero, Loop sw $t0, 4($s1) 4

IPC 5/4 1 25 (c f peak IPC 2) IPC = 5/4 = 1.25 (c.f. peak IPC = 2)

Loop UnrollingLoop Unrolling

• Replicate loop body to expose moreReplicate loop body to expose more parallelism– Reduces loop control overhead– Reduces loop‐control overhead

• Use different registers per replicationC ll d “ i i ”– Called “register renaming”

– Avoid loop‐carried “anti‐dependencies”f ll d b l d f h• Store followed by a load of the same register

• Aka “name dependence”– Reuse of a register nameReuse of a register name

Loop Unrolling Example

ALU/branch Load/store cycleLoop: addi $s1, $s1,–16 lw $t0, 0($s1) 1

nop lw $t1, 12($s1) 2

addu $t0, $t0, $s2 lw $t2, 8($s1) 3

addu $t1, $t1, $s2 lw $t3, 4($s1) 4

addu $t2, $t2, $s2 sw $t0, 16($s1) 5

addu $t3, $t4, $s2 sw $t1, 12($s1) 6

nop sw $t2, 8($s1) 7

bne $s1, $zero, Loop sw $t3, 4($s1) 8

• IPC = 14/8 = 1.75– Closer to 2, but at cost of registers and code size

Speculation

• “Guess” what to do with an instruction– Start operation as soon as possible– Check whether guess was right

f l h• If so, complete the operation• If not, roll‐back and do the right thing

• Common to static and dynamic multiple issuey p• Examples

– Speculate on branch outcome• Roll back if path taken is different

– Speculate on load• Roll back if location is updatedp

Compiler/Hardware Speculation

• Compiler can reorder instructionsCompiler can reorder instructions– e.g., move load before branch– Can include “fix‐up” instructions to recover from pincorrect guess

• Hardware can look ahead for instructions to execute– Buffer results until it determines they are actually needed

– Flush buffers on incorrect speculation

Speculation and ExceptionsSpeculation and Exceptions

• What if exception occurs on a speculativelyWhat if exception occurs on a speculatively executed instruction?– e.g., speculative load before null‐pointer checkg , p p

• Static speculation– Can add ISA support for deferring exceptionspp g p

• Dynamic speculation– Can buffer exceptions until instruction completionCan buffer exceptions until instruction completion (which may not occur)

Chapter 4 — The Processor —23

Dynamic Multiple Issue

• “Superscalar” processorsSuperscalar processors

• CPU decides whether to issue 0, 1, 2, … each cyclecycle– Avoiding structural and data hazards

• Avoids the need for compiler scheduling– Though it may still help

– Code semantics ensured by the CPU

Dynamic Pipeline Scheduling

• Allow the CPU to execute instructions out of d id llorder to avoid stalls

– But commit result to registers in order

• Examplelw $t0, 20($s2)addu $t1, $t0, $t2sub $s4, $s4, $t3lti $t5 $ 4 20slti $t5, $s4, 20

– Can start sub while addu is waiting for lw

Dynamically Scheduled CPUPreserves dependencies

Hold pending doperands

Results also sent to any waiting reservation stations

Reorders buffer for register writes

Can supply operands for issued instructions

Register Renaming

• Reservation stations and reorder buffer effectively provide register renaming

• On instruction issue to reservation station– If operand is available in register file or reorder buffer

• Copied to reservation stationCopied to reservation station• No longer required in the register; can be overwritten

– If operand is not yet available• It will be provided to the reservation station by a function unit

• Register update may not be required

Why Do Dynamic Scheduling?Why Do Dynamic Scheduling?

• Why not just let the compiler schedule code?Why not just let the compiler schedule code?

• Not all stalls are predicableh i– e.g., cache misses

• Can’t always schedule around branches– Branch outcome is dynamically determined

• Different implementations of an ISA have different latencies and hazards

Does Multiple Issue Work?

b h ’d l kThe BIG Picture

• Yes, but not as much as we’d like

• Programs have real dependencies that limit ILP

• Some dependencies are hard to eliminate• Some dependencies are hard to eliminate

– e.g., pointer aliasing

• Some parallelism is hard to exposeSo e pa a e s s a d to e pose

– Limited window size during instruction issue

• Memory delays and limited bandwidth

– Hard to keep pipelines full

• Speculation can help if done well

The Opteron X4 Microarchitecture

72 physical p yregisters

The Opteron X4 Pipeline Flow• For integer operations

FP is 5 stages longer

Up to 106 RISC‐ops in progress

Bottlenecks Complex instructions with long dependencies

Branch mispredictions

M d l Memory access delays

Multithreading• Performing multiple threads of execution in parallel

– Replicate registers, PC, etc.– Fast switching between threads

• Fine‐grain multithreading– Switch threads after each cycle– Interleave instruction execution– If one thread stalls others are executedIf one thread stalls, others are executed

• Coarse‐grain multithreading– Only switch on long stall (e.g., L2‐cache miss)– Simplifies hardware, but doesn’t hide short stalls (eg, data hazards)

Fine‐Grain Multithreading

• Switch CPU Threads with minimal (zero?)Switch CPU Threads with minimal (zero?) overhead

• Multithreading now helps resolve fine‐grainMultithreading now helps resolve fine grain dependencies (e.g. forwarding?)

1 2 3 4 5 6 71 2 3 4 5 6 7

Inst a IF ID EX MEM WB

Inst M IF ID EX MEM WB

Inst b IF ID EX MEM WB

Inst N IF ID EX MEM

Inst c IF ID EX

Inst P IF ID

http://www.cs.manchester.ac.uk/ugt/2010/COMP25212/

Fine Grain Multithreading

• What about cache misses?What about cache misses?

1 2 3 4 5 6 7



4 5 6 7

M MISS

EX

4 5 6 7

M MISS Miss Miss WB

EX MEM WB

4 5 6 7

M MISS Miss Miss WB

EX MEM WB

Inst b IF ID EX MEM WB

Inst N IF ID EX MEM

I t *** IF ID EX

ID

IF

ID (ID) (ID) EX

IF ID EX MEM

ID (ID) (ID) EX

IF ID EX MEM

IF (IF) IDInst c*** IF ID EX

Inst P IF ID

h h h d f l

IF (IF) ID

() IF

• This has the advantage of simplicityhttp://www.cs.manchester.ac.uk/ugt/2010/COMP25212/

Coarse Grain Multithreading

• Alternatively if 1 CPU thread stalled issueAlternatively, if 1 CPU thread stalled, issue every clock from alternate thread

1 2 3 4 5 6 7

Inst a IF ID EX M‐MISS Miss Miss WB

Inst M IF ID EX MEM WBInst M IF ID EX MEM WB

Inst b IF ID (ID) (ID) EX

Inst N IF ID EX MEM

Inst P IF ID EX

Inst Q IF ID


CPU Support for Fine Grain MT

VA MappingA

Address

VA MappingB

AddressTranslation

Data CacheInst Cache

PCA

Fetch L

Fetch LD

ecode L

Fetch LE

xec Lo

Fetch LM

em Lo

Write LoPCB ogic

ogicLogic

ogicogic

ogicogic

ogic

GPRsA

GPRsB http://www.cs.manchester.ac.uk/ugt/2010/COMP25212/

Simultaneous Multithreading

• In multiple‐issue dynamically scheduled p y yprocessor– Schedule instructions from multiple threads– Instructions from independent threads execute when function units are available

– Within threads dependencies handled byWithin threads, dependencies handled by scheduling and register renaming

• Example: Intel Pentium‐4 HT– Two threads: duplicated registers, shared function units and caches

Multithreading Example

Simultaneous MultiThreading

• Let’s look simply at instruction issue:1 2 3 4 5 6 7 8 9 10


Inst b IF ID EX MEM WBInst b IF ID EX MEM WB


Inst N IF ID EX MEM WB

Inst c IF ID EX MEM WB

Inst P IF ID EX MEM WB

Inst Q IF ID EX MEM WBQ

Inst d IF ID EX MEM WB

Inst e IF ID EX MEM WB

I t R IF ID EX MEM WBInst R IF ID EX MEM WB


Simultaneous Multi‐Threading

“permit different threads to occupy the same pipeline stage at the same time”pipeline stage at the same time

• This makes most sense with superscalar issuepInst Cache Data Cache

PCA

Inst Iss

FetcD

ecode+

FetcM

em

Write

PCB

sue Logic

h Logic+R

egisters

h Logicm

Logic

e Logic

s


SMT issues

• Asymmetric pipeline stallAsymmetric pipeline stall– One part of pipeline stalls – we want other pipeline to continue

• Overtaking – want unstalled thread to make progress

• Pipeline overcrowding – may need extra wide pipeline registers (why?)

• Existing implementations (mainly) on O‐o‐O, register renamed architectures


Future of MultithreadingFuture of Multithreading

• Will it survive? In what form?Will it survive? In what form?

• Power considerations simplified microarchitecturesmicroarchitectures– Simpler forms of multithreading

• Tolerating cache‐miss latency– Thread switch may be most effective

• Multiple simple cores might share resources more effectively

Chapter 7 —Multicores, Multiprocessors, and Clusters — 42

Instruction and Data Streams• An alternate classification

Data StreamsSingle Multiple

Instruction Single SISD: SIMD: SSEInstruction Streams

Single SISD:Intel Pentium 4

SIMD: SSE instructions of x86

Multiple MISD:No examples today

MIMD:Intel Xeon e5345No examples today Intel Xeon e5345

SPMD: Single Program Multiple Data A parallel program on a MIMD computer

Conditional code for different processorsp

SIMD

• Operate elementwise on vectors of datap– E.g., MMX and SSE instructions in x86

• Multiple data elements in 128‐bit wide registers

• All processors execute the same instruction at the same time

Each with different data address etc– Each with different data address, etc.• Simplifies synchronization• Reduced instruction control hardware• Reduced instruction control hardware• Works best for highly data‐parallel applicationsapplications

Vector Processors

• Highly pipelined function unitsg y p p• Stream data from/to vector registers to units

– Data collected from memory into registers– Results stored from registers to memory

• Example: Vector extension to MIPS32 64 l i (64 bi l )– 32 × 64‐element registers (64‐bit elements)

– Vector instructions• lv, sv: load/store vector• addv.d: add vectors of double• addvs.d: add scalar to each element of vector of double

• Significantly reduces instruction‐fetch bandwidth• Significantly reduces instruction‐fetch bandwidth

Example: DAXPY (Y = a × X + Y)

• Conventional MIPS codel d $f0 ($ ) l d l l.d $f0,a($sp) ;load scalar aaddiu r4,$s0,#512 ;upper bound of what to load

loop: l.d $f2,0($s0) ;load x(i)mul.d $f2,$f2,$f0 ;a × x(i)l d $f4 0($s1) ;load y(i)l.d $f4,0($s1) ;load y(i)add.d $f4,$f4,$f2 ;a × x(i) + y(i)s.d $f4,0($s1) ;store into y(i)addiu $s0,$s0,#8 ;increment index to xaddiu $s1,$s1,#8 ;increment index to y

b $ $ b dsubu $t0,r4,$s0 ;compute boundbne $t0,$zero,loop ;check if done

• Vector MIPS codel.d $f0,a($sp) ;load scalar al.d $f0,a($sp) ;load scalar alv $v1,0($s0) ;load vector xmulvs.d $v2,$v1,$f0 ;vector-scalar multiplylv $v3,0($s1) ;load vector yaddv.d $v4,$v2,$v3 ;add y to product

$ 4 0($ 1) t th ltsv $v4,0($s1) ;store the result

Vector vs. Scalar

• Vector architectures and compilersp– Simplify data‐parallel programming– Explicit statement of absence of loop‐carried dependencesdependences

• Reduced checking in hardware– Regular access patterns benefit from interleaved and b tburst memory

– Avoid control hazards by avoiding loops• More general than ad‐hoc media extensionsMore general than ad hoc media extensions (such as MMX, SSE)– Better match with compiler technology

rireview and ad dad vanced concepts - ece.ucsb.edustrukov/ece154winter2012/lecture17_18.pdf · 4...

Documents