embedded computer architectures
Post on 05-Jan-2016
14 Views
Preview:
DESCRIPTION
TRANSCRIPT
EmbeddedComputerArchitectures
Hennessy & PattersonChapter 3
Instruction-Level Parallelism and Its Dynamic Exploitation
Gerard Smit (Zilverling 4102), smit@cs.utwente.nl
André Kokkeler (Zilverling 4096), kokkeler@utwente.nl
Contents
• Introduction• Hazards <= dependencies• Instruction Level Parallelism; Tomasulo’s
approach• Branch prediction
Dependencies
• True Data dependency• Name dependency
—Antidependency—Output dependency
• Control dependency
Data Dependency
Inst i
Inst i+1
Inst i+2
Result
Result
DataDep
DataDep
DataDep
Two instructions are data dependent => risk of RAW hazard
Name Dependency
• Antidependence
• Output dependence
Inst i
register or memory location
Inst jWrite
Read
Two instructions are antidependent => risk of WAR hazard
Inst i
register or memory location
Inst jWrite
Write
Two instructions are antidependent => risk of WAW hazard
Control Dependency
• Branch condition determines whether instruction i is executed => i is control dependent on the branch
Instruction Level Parallelism
• Pipelining = ILP• Other approach: Dynamic scheduling =>
out of order execution• Instruction Decode stage split into
—Issue (decode, check for structural hazards)—Read Operands
Instruction Level Parallelism
• Scoreboard:—Sufficient resources—No data dependencies
• Tomasulo’s approach—Minimize RAW hazards—Register renaming to minimize WAW and RAW
hazards
issueRead
operands
Reservation Station (park instructions while waiting for operands)
Tomasulo’s approach
• Register Renaming
1. Read F0
2. Write F0
3. Read F0
4. Write F0
Register F0
start of instruction
register use of instruction
Time
Tomasulo’s approach
• Register Renaming
1. Read F0
2. Write F0
3. Read F0
4. Write F0
Register F0
Time
Problems if arrows cross
Tomasulo’s approach
• Register Renaming
1. Read F0
2. Write F0
3. Read F0
4. Write F0
Register F0
Time
Instr 2, 3,… will be stalled. Note that Instr 2 and 3 are stalled only becauseInstr 1 is not ready. If not for Instr 1, they could be executed earlier
Tomasulo’s approach
• Register Renaming
1. Read F0
2. Write F0
3. Read F0
4. Write F0
Instr 3.Register F0
Instr 1.Register F0
How is it arranged that value is written into Instr 3. Register F0 and not inInstr 1. Register F0?
Tomasulo’s approach
• Register Renaming
1. Read F0
2. Write F0
3. Read F0
4. Write F0
Instr 3.Register F0Instr 3.F0Source
Instr 1.Register F0Instr 1.F0Source Instr. k
Instr. 2
The result of Instr 2 is labelled with ‘Instr. 2’. Hardware checks whether thereIs an instruction waiting for the result (checking the F0Source fields of instructions)And places the result in the correct place.
Tomasulo’s approach
• Register Renaming
1. Read F0
2. Write F0
3. Read F0
4. Write F0
Instr 3.Register F0Instr 3.F0Source Instr. 2
F0Data F0Sourceoperation (read)
Tomasulo’s approach
• Register Renaming
1. Read F0
2. Write F0
3. Read F0
4. Write F0
F0Data F0Sourceoperation (read)
F0Data F0Sourceoperation (read)
Tomasulo’s approach
• Register Renaming
1. Read F0
2. Write F0
3. Read F0
4. Write F0
F0Data F0Sourceoperation (read)
F0Data F0Sourceoperation (read)
operation (write)
operation (write)
Reservation StationIssue
Filled during executionFilled during Issue
Tomasulo’s approach
• Effects—Register Renaming: prevents WAW and WAR
hazards—Execution starts when operands are available
(datafields are filled): prevents RAW
Tomasulo’s approach
• Issue in more detail (issue is done sequentially)
1. Read F0
2. Write F0
3. Read F0
4. Write F0
Empty ?????read1
Empty
write1
Reservation Station
read2
read
read
write
datalabel operation sourceFormat:
This is the only information you have:During issue, you have to keep trackwhich instruction changed F0 last!!!!
Tomasulo’s approach
• Issue in more detail
1. Read F0
2. Write F0
3. Read F0
4. Write F0
Empty ?????read1
Empty Write1
write1
Reservation Station
read2
write2
read
read
write
write
datalabel operation sourceFormat:
????
write1
write1
write2
F0
Keeping track of register statusduring issue is done for every register
Tomasulo’s approach
• Definitions for the MIPS
—For each reservation station:
Name Busy Operation Vj Vk Qj Qk A
Name = labelBusy = in execution or notOperation = instructionV = operand valueQ = operand sourceA = memory address (Load, Store)
Tomasulo’s approach; hardware view
Issue hardware
Reservation Station
“Execution ControlHardware”
Execution Units
“Reservation FillHardware”
Common Data Bus
Of which instructionsare operands and corresponding executionunits available? =>
Transport operands toexecutions unit
Puts data in correctplace in reservationstation
From instruction queue
Register RenamingFill ReservationStations
Results + identification Of instruction producing the result
Branch prediction
• Data Hazards => Tomasulo’s approach• Branch (control) hazards => Branch
prediction—Goal: Resolve outcome of branch early =>
prevent stalls because of control hazards
Branch prediction; 1 history bit
• Example:
Outerloop: …R=10
Innerloop: …R=R-1BNZ R, Innerloop……Branch Outerloop
History bit
History bit: is branch taken previously or not: - predict taken: fetch from ‘Innerloop’- predict not taken: fetch next instr
Actual outcome of branch: - taken: set history bit to ‘taken’- not taken: set history bit to ‘not taken’
In this situation: Correct prediction in 80 % of branch evaluations
Branch prediction; 2 history bits
• Example:
Outerloop: …R=10
Innerloop: …R=R-1BNZ R, Innerloop……Branch Outerloop
2 history bits
Predict taken Predict taken
Predict not takenPredict not taken
Not taken
Not taken
Not taken
taken
taken
taken
In this application:correct predictionin 90 % of branchevaluations
Branch prediction; Correlating branch predictors
If (aa == 2)aa=0;
If (bb == 2)bb=0;
If (aa != bb)
Results of these branches areused in prediction of this branch
Example: suppose aa == 2 and bb == 2 then condition for last ‘if’ is always false =>if previous two branches are not taken, last branch is taken.
Branch prediction; Correlating branch predictors
• Mechanism: Suppose result of 3 previous branches is used to influence decision.
• 8 possible sequences:br-3 br-2 br-1 brNT NT NT TNT NT T NT…. …. …. …. T T T T
• Dependent on outcome of branch under consideration prediction is changed:—1 bit history: (3,1) predictor
Branch under consideration
For the sequence(NT NT NT) the prediction isthat the branch will be taken=> Fetches from branchdestination
Branch prediction; Correlating branch predictors
• Mechanism: Suppose result of 3 previous branches is used to influence decision.
• 8 possible sequences:br-3 br-2 br-1 brNT NT NT TNT NT T NT…. …. …. …. T T T T
• Dependent on outcome of branch under consideration prediction is changed:—1 bit history: (3,1) predictor—2 bit history: (3,2) predictor
Branch under consideration
For the sequence(NT NT NT) the prediction isthat the branch will be taken=> Fetches from branchdestination
Represented by 2 bits-2 combinations indicate:predict taken-2 combinations indicate: predict non takenUpdated by means of statemachine
Branch Target Buffer
• Solutions:—Delayed Branch—Branch Target buffer
BNEZ R1, Loop IF ID EX MEM WBnext instr IF ID EX MEM WB
branch target IF ID EX MEM WB
Even with a good prediction, we don’t know where to branch too until here and we’ve already retrieved the next instruction
Branch Target Buffer
Memory(Instruction
cache)
ProgramCounter
Address Branch Target
Corresponding BranchTargets
Addresses of branch instructions
Hit?
From Instruction Decode hardware
Select
After IF stage, branchaddress already in PC
Branch Folding
Memory(Instruction
cache)
ProgramCounter
Address Instruction at target
Corresponding Instructionsat Branch Targets
Addresses of branch instructions
Hit?
Unconditional Branches:Effectively removing Branch instruction (penalty of -1)
Return Address Predictors
• Indirect branches: branch address known at run time.
• 80% of time: return instructions.• Small fast stack:
RET
ProcedureCall
ProcedureReturn
RET
Multiple Issue Processors
Goal: Issue multiple instructions in a clockcycle
• Superscalarissue varying number of instructions per clock—Statically scheduled—Dynamically scheduled
• VLIWissue fixed number of instructions per clock—Statically scheduled
Multiple Issue Processors
• Example
Instruction type
Pipe Stages
Integer IF ID EX MEM WB
FP IF ID EX EX EX WB
Integer IF ID EX MEM WB
FP IF ID EX EX EX WB
Integer IF ID EX MEM WB
FP IF ID EX EX EX WB
Integer IF ID EX MEM WB
FP IF ID EX EX EX
Hardware Based Speculation
• Multiple Issue Processors => nearly 1 branch every clock cycle
• Dynamic scheduling + branch prediction:fetch+issue
• Dynamic scheduling + branch speculation:fetch+issue+execution
• KEY: Do not perform updates that cannot be undone until you’re sure the corresponding operation really should be executed.
Hardware Based Speculation
• Tomasulo:
Branch (Predict Not Taken)
Register FileOperation i
Operation k
Operations beyond this point are finished
Issued
Operation k:-Operand available-Execution postponed until clear whether branch is taken
Hardware Based Speculation
• Tomasulo:
Branch (Predict Not Taken)
Register FileOperation i
Operation k
Issued
Finished
Dependent on outcome branch:-Flush reservation stations-Start execution
Hardware Based Speculation
• Speculation:
Branch (Predict Not Taken)
Register FileOperation i
Operation k
Results of operations beyond this point are committed (from reorder buffer to register file)
Issued
Operation k:-Operand available and executed
Reorder Buffer
Commit: sequentially
Hardware Based Speculation
• Speculation:
Branch (Predict Not Taken)
Register FileOperation i
Operation k
Issued
Operation k:-Operand available and executed
Reorder Buffer
Commit: sequentiallyCommitted
Hardware Based Speculation
• Speculation:
Branch (Predict Not Taken)
Register FileOperation i
Operation k
Operation k:-Operand available and executed
Reorder Buffer
Commit: sequentially
Committed
Hardware Based Speculation
• Speculation:
Branch (Predict Not Taken)
Register FileOperation i
Operation k
Operation k:-Operand available and executed
Reorder Buffer
Commit: sequentially
Committed
Hardware Based Speculation
• Some aspects—Instructions causing a lot of work should not
have been executed => restrict allowed actions in speculative mode
—ILP of a program is limited—Realistic branch predictions: easier to
implement => less efficient
Pentium Pro Implementation
• Pentium Family
Processor Year Clock Rate (MHz)
L1 Cache (instr, data)
L2 Cache(instr, data)
Pentium Pro 1995 100-200 8 KB, 8 KB 256 KB, 1024 KB
Pentium I 1998 233-450 16 KB, 16 KB
256 KB, 512 KB
Pentium II Xeon
1999 400-450 16 KB, 16 KB
512 KB, 2 MB
Celeron 1999 500-900 16 KB, 16 KB
128 KB
Pentium III 1999 450-1100 16 KB, 16 KB
256 KB, 512 KB
Pentium III Xeon
2000 700-900 16 KB, 16 KB
1 MB, 2 MB
Pentium Pro Implementation
• I486: CISC => problems with pipelining• 2 observations
—Translation CISC instructions into sequence of microinstructions
—Microinstruction is of equal length
• Solution: pipelining microinstructions
Pentium Pro Implementation
...Jump to Indirect or Execute
...Jump to Execute
...Jump to Fetch
Jump to Op code routine
...
Jump to Fetch or Interrupt
...
Jump to Fetch or Interrupt
Fetch cycle routine
Indirect Cycle routine
Interrupt cycle routine
Execute cycle begin
AND routine
ADD routine
Note: each micro-program ends with a branch to the Fetch, Interrupt, Indirect or Execute micro-program
Pentium Pro Implementation
Pentium Pro Implementation
• All RISC features are implemented on the execution of microinstructions instead of machine instructions—Microinstruction-level pipeline with dynamically scheduled
microoperations– Fetch machine instruction (3 stages)– Decode machine instruction into microinstructions (2 stages)– Issue microinstructions (2 stages, register renaming, reorder
buffer allocation performed here)– Execute of microinstructions (1 stage, floating point units
pipelined, execution takes between 1 and 32 cycles)– Write back (3 stages)– Commit (3 stages)
—Superscalar can issue up to 3 microoperations per clock cycle
—Reservation stations (20 of them) and multiple functional units (5 of them)
—Reorder buffer (40 entries) and speculation used
Pentium Pro Implementation• Execution Units have the following stages • Integer ALU1• Integer Load 3• Integer Multiply 4• FP add 3• FP multiply 5 (partially pipelined –multiplies
can start every other cycle)
• FP divide 32 (not pipelined)
Thread-Level Parallelism
• ILP: on instruction level• Thread-Level Parallelism: on a higher level
—Server applications—Database queries
• Thread: has all information (instructions, data, PC register state etc) to allow it to execute—On a separate processer—As a process on a single process.
Thread-Level Parallelism
• Potentially high efficiency• Desktop applications:
—Costly to switch to ‘thread-level reprogrammed’ applications.
—Thread level parallelism often hard to find
=> ILP continues to be focus for desktop-oriented processors (for embedded processors, the situation is different)
top related