embedded computer architectures

EmbeddedComputerArchitectures

Hennessy & PattersonChapter 3

Instruction-Level Parallelism and Its Dynamic Exploitation

Gerard Smit (Zilverling 4102), smit@cs.utwente.nl

André Kokkeler (Zilverling 4096), kokkeler@utwente.nl

Contents

• Introduction• Hazards <= dependencies• Instruction Level Parallelism; Tomasulo’s

approach• Branch prediction

Dependencies

• True Data dependency• Name dependency

—Antidependency—Output dependency

• Control dependency

Data Dependency

Inst i

Inst i+1

Inst i+2

Result

DataDep

Two instructions are data dependent => risk of RAW hazard

Name Dependency

• Antidependence

• Output dependence

Inst i

register or memory location

Inst jWrite

Two instructions are antidependent => risk of WAR hazard

Inst i

register or memory location

Inst jWrite

Two instructions are antidependent => risk of WAW hazard

Control Dependency

• Branch condition determines whether instruction i is executed => i is control dependent on the branch

Instruction Level Parallelism

• Pipelining = ILP• Other approach: Dynamic scheduling =>

out of order execution• Instruction Decode stage split into

—Issue (decode, check for structural hazards)—Read Operands

Instruction Level Parallelism

• Scoreboard:—Sufficient resources—No data dependencies

• Tomasulo’s approach—Minimize RAW hazards—Register renaming to minimize WAW and RAW

hazards

issueRead

operands

Reservation Station (park instructions while waiting for operands)

Tomasulo’s approach

• Register Renaming

1. Read F0

2. Write F0

3. Read F0

4. Write F0

Register F0

start of instruction

register use of instruction

1. Read F0

2. Write F0

3. Read F0

4. Write F0

Register F0

Problems if arrows cross

1. Read F0

2. Write F0

3. Read F0

4. Write F0

Register F0

Instr 2, 3,… will be stalled. Note that Instr 2 and 3 are stalled only becauseInstr 1 is not ready. If not for Instr 1, they could be executed earlier

1. Read F0

2. Write F0

3. Read F0

4. Write F0

Instr 3.Register F0

Instr 1.Register F0

How is it arranged that value is written into Instr 3. Register F0 and not inInstr 1. Register F0?

1. Read F0

2. Write F0

3. Read F0

4. Write F0

Instr 3.Register F0Instr 3.F0Source

Instr 1.Register F0Instr 1.F0Source Instr. k

Instr. 2

The result of Instr 2 is labelled with ‘Instr. 2’. Hardware checks whether thereIs an instruction waiting for the result (checking the F0Source fields of instructions)And places the result in the correct place.

1. Read F0

2. Write F0

3. Read F0

4. Write F0

Instr 3.Register F0Instr 3.F0Source Instr. 2

F0Data F0Sourceoperation (read)

1. Read F0

2. Write F0

3. Read F0

4. Write F0

1. Read F0

2. Write F0

3. Read F0

4. Write F0

operation (write)

Reservation StationIssue

Filled during executionFilled during Issue

• Effects—Register Renaming: prevents WAW and WAR

hazards—Execution starts when operands are available

(datafields are filled): prevents RAW

• Issue in more detail (issue is done sequentially)

1. Read F0

2. Write F0

3. Read F0

4. Write F0

Empty ?????read1

write1

Reservation Station

datalabel operation sourceFormat:

This is the only information you have:During issue, you have to keep trackwhich instruction changed F0 last!!!!

• Issue in more detail

1. Read F0

2. Write F0

3. Read F0

4. Write F0

Empty ?????read1

Empty Write1

write1

Reservation Station

write2

datalabel operation sourceFormat:

write1

write2

Keeping track of register statusduring issue is done for every register

• Definitions for the MIPS

—For each reservation station:

Name Busy Operation Vj Vk Qj Qk A

Name = labelBusy = in execution or notOperation = instructionV = operand valueQ = operand sourceA = memory address (Load, Store)

Tomasulo’s approach; hardware view

Issue hardware

Reservation Station

“Execution ControlHardware”

Execution Units

“Reservation FillHardware”

Common Data Bus

Of which instructionsare operands and corresponding executionunits available? =>

Transport operands toexecutions unit

Puts data in correctplace in reservationstation

From instruction queue

Register RenamingFill ReservationStations

Results + identification Of instruction producing the result

Branch prediction

• Data Hazards => Tomasulo’s approach• Branch (control) hazards => Branch

prediction—Goal: Resolve outcome of branch early =>

prevent stalls because of control hazards

Branch prediction; 1 history bit

• Example:

Outerloop: …R=10

Innerloop: …R=R-1BNZ R, Innerloop……Branch Outerloop

History bit

History bit: is branch taken previously or not: - predict taken: fetch from ‘Innerloop’- predict not taken: fetch next instr

Actual outcome of branch: - taken: set history bit to ‘taken’- not taken: set history bit to ‘not taken’

In this situation: Correct prediction in 80 % of branch evaluations

Branch prediction; 2 history bits

• Example:

Outerloop: …R=10

Innerloop: …R=R-1BNZ R, Innerloop……Branch Outerloop

2 history bits

Predict taken Predict taken

Predict not takenPredict not taken

Not taken

In this application:correct predictionin 90 % of branchevaluations

Branch prediction; Correlating branch predictors

If (aa == 2)aa=0;

If (bb == 2)bb=0;

If (aa != bb)

Results of these branches areused in prediction of this branch

Example: suppose aa == 2 and bb == 2 then condition for last ‘if’ is always false =>if previous two branches are not taken, last branch is taken.

• Mechanism: Suppose result of 3 previous branches is used to influence decision.

• 8 possible sequences:br-3 br-2 br-1 brNT NT NT TNT NT T NT…. …. …. …. T T T T

• Dependent on outcome of branch under consideration prediction is changed:—1 bit history: (3,1) predictor

Branch under consideration

For the sequence(NT NT NT) the prediction isthat the branch will be taken=> Fetches from branchdestination

• Mechanism: Suppose result of 3 previous branches is used to influence decision.

• 8 possible sequences:br-3 br-2 br-1 brNT NT NT TNT NT T NT…. …. …. …. T T T T

• Dependent on outcome of branch under consideration prediction is changed:—1 bit history: (3,1) predictor—2 bit history: (3,2) predictor

Branch under consideration

For the sequence(NT NT NT) the prediction isthat the branch will be taken=> Fetches from branchdestination

Represented by 2 bits-2 combinations indicate:predict taken-2 combinations indicate: predict non takenUpdated by means of statemachine

Branch Target Buffer

• Solutions:—Delayed Branch—Branch Target buffer

BNEZ R1, Loop IF ID EX MEM WBnext instr IF ID EX MEM WB

branch target IF ID EX MEM WB

Even with a good prediction, we don’t know where to branch too until here and we’ve already retrieved the next instruction

Branch Target Buffer

Memory(Instruction

cache)

ProgramCounter

Address Branch Target

Corresponding BranchTargets

Addresses of branch instructions

From Instruction Decode hardware

Select

After IF stage, branchaddress already in PC

Branch Folding

Memory(Instruction

cache)

ProgramCounter

Address Instruction at target

Corresponding Instructionsat Branch Targets

Addresses of branch instructions

Unconditional Branches:Effectively removing Branch instruction (penalty of -1)

Return Address Predictors

• Indirect branches: branch address known at run time.

• 80% of time: return instructions.• Small fast stack:

ProcedureCall

ProcedureReturn

Multiple Issue Processors

Goal: Issue multiple instructions in a clockcycle

• Superscalarissue varying number of instructions per clock—Statically scheduled—Dynamically scheduled

• VLIWissue fixed number of instructions per clock—Statically scheduled

Multiple Issue Processors

• Example

Instruction type

Pipe Stages

Integer IF ID EX MEM WB

FP IF ID EX EX EX WB

FP IF ID EX EX EX

Hardware Based Speculation

• Multiple Issue Processors => nearly 1 branch every clock cycle

• Dynamic scheduling + branch prediction:fetch+issue

• Dynamic scheduling + branch speculation:fetch+issue+execution

• KEY: Do not perform updates that cannot be undone until you’re sure the corresponding operation really should be executed.

• Tomasulo:

Branch (Predict Not Taken)

Register FileOperation i

Operation k

Operations beyond this point are finished

Issued

Operation k:-Operand available-Execution postponed until clear whether branch is taken

• Tomasulo:

Operation k

Issued

Finished

Dependent on outcome branch:-Flush reservation stations-Start execution

• Speculation:

Operation k

Results of operations beyond this point are committed (from reorder buffer to register file)

Issued

Operation k:-Operand available and executed

Reorder Buffer

Commit: sequentially

• Speculation:

Operation k

Issued

Reorder Buffer

Commit: sequentiallyCommitted

• Speculation:

Operation k

Reorder Buffer

Committed

• Speculation:

Operation k

Reorder Buffer

Committed

• Some aspects—Instructions causing a lot of work should not

have been executed => restrict allowed actions in speculative mode

—ILP of a program is limited—Realistic branch predictions: easier to

implement => less efficient

Pentium Pro Implementation

• Pentium Family

Processor Year Clock Rate (MHz)

L1 Cache (instr, data)

L2 Cache(instr, data)

Pentium Pro 1995 100-200 8 KB, 8 KB 256 KB, 1024 KB

Pentium I 1998 233-450 16 KB, 16 KB

256 KB, 512 KB

Pentium II Xeon

1999 400-450 16 KB, 16 KB

512 KB, 2 MB

Celeron 1999 500-900 16 KB, 16 KB

128 KB

Pentium III 1999 450-1100 16 KB, 16 KB

256 KB, 512 KB

Pentium III Xeon

2000 700-900 16 KB, 16 KB

1 MB, 2 MB

• I486: CISC => problems with pipelining• 2 observations

—Translation CISC instructions into sequence of microinstructions

—Microinstruction is of equal length

• Solution: pipelining microinstructions

...Jump to Indirect or Execute

...Jump to Execute

...Jump to Fetch

Jump to Op code routine

Jump to Fetch or Interrupt

Fetch cycle routine

Indirect Cycle routine

Interrupt cycle routine

Execute cycle begin

AND routine

ADD routine

Note: each micro-program ends with a branch to the Fetch, Interrupt, Indirect or Execute micro-program

• All RISC features are implemented on the execution of microinstructions instead of machine instructions—Microinstruction-level pipeline with dynamically scheduled

microoperations– Fetch machine instruction (3 stages)– Decode machine instruction into microinstructions (2 stages)– Issue microinstructions (2 stages, register renaming, reorder

buffer allocation performed here)– Execute of microinstructions (1 stage, floating point units

pipelined, execution takes between 1 and 32 cycles)– Write back (3 stages)– Commit (3 stages)

—Superscalar can issue up to 3 microoperations per clock cycle

—Reservation stations (20 of them) and multiple functional units (5 of them)

—Reorder buffer (40 entries) and speculation used

Pentium Pro Implementation• Execution Units have the following stages • Integer ALU1• Integer Load 3• Integer Multiply 4• FP add 3• FP multiply 5 (partially pipelined –multiplies

can start every other cycle)

• FP divide 32 (not pipelined)

Thread-Level Parallelism

• ILP: on instruction level• Thread-Level Parallelism: on a higher level

—Server applications—Database queries

• Thread: has all information (instructions, data, PC register state etc) to allow it to execute—On a separate processer—As a process on a single process.

Thread-Level Parallelism

• Potentially high efficiency• Desktop applications:

—Costly to switch to ‘thread-level reprogrammed’ applications.

—Thread level parallelism often hard to find

=> ILP continues to be focus for desktop-oriented processors (for embedded processors, the situation is different)

embedded computer architectures

Documents

a survey of risc architectures for desktop, server, and...

embedded architectures - ulisboa · embedded architectures...

high-performance architectures for embedded memory...

mips – pipelining - ulisboa · computer organization...

embedded computer architecture tu/e 5kk73 henk corporaal...

subsumption embedded systems lecture...

software architectures and embedded systems

computer fundamentals metrics and performance - ulisboa ·...

computer architectures

cases 2002 intl conference on compilers, architectures and...

embedded computer architectures

wireless embedded system (0120442x) single-node ...

timed automata based analysis of embedded system...

computer architectures m

computer architecture, advanced architectures part vii...

embedded computer architecture tu/e 5kk73 henk corporaal...

computer architecture vliw architectures

embedded systems programming and architectures -...

high-performance embedded systems: architectures...

computer network architectures