structure of computer systems

22
Structure of Structure of Computer Systems Computer Systems Course 5 Course 5 The Central Processing The Central Processing Unit - CPU Unit - CPU

Upload: kobe

Post on 06-Jan-2016

33 views

Category:

Documents


0 download

DESCRIPTION

Structure of Computer Systems. Course 5 The Central Processing Unit - CPU. Solutions for hazard cases. Scoreboard method Tomasulo’s method Branch prediction. Scoreboard method. General considerations (wiki): used first in the CDC 6600 computer (1966), - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Structure of Computer Systems

Structure of Computer Structure of Computer SystemsSystems

Course 5Course 5

The Central Processing Unit - The Central Processing Unit - CPUCPU

Page 2: Structure of Computer Systems

Solutions for hazard casesSolutions for hazard cases

Scoreboard methodScoreboard method Tomasulo’s methodTomasulo’s method Branch predictionBranch prediction

Page 3: Structure of Computer Systems

Scoreboard methodScoreboard method General considerations (wiki):General considerations (wiki):

used first in the CDC 6600 computer (1966), used first in the CDC 6600 computer (1966), used for used for dynamically schedulingdynamically scheduling a pipeline so that the instructions can a pipeline so that the instructions can

execute execute out-of-orderout-of-order when there are no conflicts and the hardware is when there are no conflicts and the hardware is available (no structural hazard is present)available (no structural hazard is present)

the the data dependenciesdata dependencies of every instruction are of every instruction are loggedlogged.. instructions are releasedinstructions are released only when the scoreboard determines that only when the scoreboard determines that

there are there are no conflictsno conflicts with previously issued and incomplete instructions. with previously issued and incomplete instructions. if an instruction is stalled because it is unsafe to continue, the if an instruction is stalled because it is unsafe to continue, the

scoreboard monitors the flowscoreboard monitors the flow of executing instructionsof executing instructions until all until all dependencies have been resolved before the stalled instruction is dependencies have been resolved before the stalled instruction is issued.issued.

Page 4: Structure of Computer Systems

Scoreboard methodScoreboard method Implementation of the scoreboard method:Implementation of the scoreboard method:

Every instruction goes through 4 stages:Every instruction goes through 4 stages: Issue(ID1)

• decode instructions• check for structural and WAW hazards• stall until structural and WAW hazards are resolved

Read operands (ID2)• wait until no RAW hazards• then read operands

Execution (EX)• operate on operands• may be multiple cycles - notify scoreboard when done

Write result (WB)• finish execution• stall if WAR hazard

Page 5: Structure of Computer Systems

Scoreboard methodScoreboard method

Scoreboard structure:Scoreboard structure: Instruction status

• Indicates which of 4 steps the instruction is in: ID1, ID2, EX, or WB. Functional unit status: Indicates the state of the functional unit

(FU)• Busy Indicates whether the unit is busy or not• Op Operation to perform in the unit (e.g., + or –)• Fi Destination register• Fj, Fk Source-register numbers• Qj, Qk Functional units producing source registers Fj, Fk• Rj, Rk Flags indicating when Fj, Fk are ready

Register result status• Indicates which functional unit will write each register, if one exists.Blank when no pending instructions will write that register

Page 6: Structure of Computer Systems

Scoreboard methodScoreboard method

Speedup from scoreboard 1.7 for FORTRAN programs 2.5 for hand-coded assembly language programs

Hardware Scoreboard hardware approximately same as one FPU Main cost - buses (4 times normal amount) Could be more severe for modern processors

Page 7: Structure of Computer Systems

Scoreboard and Tomasulo’s Scoreboard and Tomasulo’s algorithmalgorithm

Issues with Scoreboard method: Issues with Scoreboard method: it does not solve it does not solve structural hazardstructural hazard No forwardingNo forwarding logic logic introduces stall phases when a required functional unit is busy; the stall introduces stall phases when a required functional unit is busy; the stall

affects the next instructions tooaffects the next instructions too

Tomasulo’s algorithmTomasulo’s algorithm avoid the structural hazard and also resolve WAR and WAW avoid the structural hazard and also resolve WAR and WAW

dependencies with dependencies with Register renaming Register renaming andand Common data bus (CDB) Common data bus (CDB) Used first in IBM 360/91 computer (1969)Used first in IBM 360/91 computer (1969) Register renamingRegister renaming – keep multiple copies of the same physical register – keep multiple copies of the same physical register

• Avoids data dependencies when the dependency is caused by the limited Avoids data dependencies when the dependency is caused by the limited number of registers and not by a real data dependencynumber of registers and not by a real data dependency

Common data busCommon data bus – a data is put on a common bus as soon as it’s – a data is put on a common bus as soon as it’s available avoiding unnecessary stall until the data is written in the available avoiding unnecessary stall until the data is written in the destination registerdestination register

Page 8: Structure of Computer Systems

Tomasulo’s alorithmTomasulo’s alorithm

Instruction stages:Instruction stages: IssueIssue – – an instruction is issued if the required functional unit and all an instruction is issued if the required functional unit and all

operands are available, else it is stalled and the next instruction is operands are available, else it is stalled and the next instruction is tested and if possible issued; if a real data is not yet available a virtual tested and if possible issued; if a real data is not yet available a virtual value is considered, until the real value becomes available value is considered, until the real value becomes available

• Registers are renamed to avoid WAR and WAW hazardsRegisters are renamed to avoid WAR and WAW hazards ExecuteExecute – – the instruction is carried out as long as the necessary the instruction is carried out as long as the necessary

operands are available or present on the CDB; special care must be operands are available or present on the CDB; special care must be given to Load and Store instructions that require access to the memorygiven to Load and Store instructions that require access to the memory

Write resultWrite result – the result of the executed instruction is written back into – the result of the executed instruction is written back into the destination register and Store operations are made with the memorythe destination register and Store operations are made with the memory

(see later (see later commit stagecommit stage))

Page 9: Structure of Computer Systems

Tomasulo’s alorithmTomasulo’s alorithm

Reservation stationsReservation stations buffers that fetch and store instruction operands as they are buffers that fetch and store instruction operands as they are

availableavailable A reservation station A reservation station holds the data and the resultholds the data and the result of an of an

instructioninstruction It It points to registerspoints to registers (if data is available) (if data is available) or other reservation or other reservation

stationsstations that will contain the necessary data as soon as it that will contain the necessary data as soon as it becomes available (before it is written back in the register)becomes available (before it is written back in the register)

The reservation station stores the result of an instruction The reservation station stores the result of an instruction execution and releases the functional unit as soon the execution and releases the functional unit as soon the instruction is executed; the result becomes available for other instruction is executed; the result becomes available for other reservation stations ; in this way we reservation stations ; in this way we avoid WAR and RAW stallsavoid WAR and RAW stalls

Page 10: Structure of Computer Systems

Tomasulo’s algorithmTomasulo’s algorithm

To avoid structural hazard, To avoid structural hazard, redundant functional unitsredundant functional units are used, such as multiple integer ALUs, floating point are used, such as multiple integer ALUs, floating point ALUs or address computing ALUsALUs or address computing ALUs

Example: the P6 architecture (Pentium II and III) Example: the P6 architecture (Pentium II and III) contains 7 ALUs –> 2IEU, 1FEU, 1MMX, 3AGUcontains 7 ALUs –> 2IEU, 1FEU, 1MMX, 3AGU

In front of every functional unit a buffer or a list may store In front of every functional unit a buffer or a list may store the request(s) (instructions) destined for that unit; e.g. the request(s) (instructions) destined for that unit; e.g. Netburst architecture (Pentium IV) has a list of requests Netburst architecture (Pentium IV) has a list of requests for every reservation station; for every reservation station;

In this way every functional unit is scheduled in advance In this way every functional unit is scheduled in advance and it can work almost without stalling and it can work almost without stalling

Page 11: Structure of Computer Systems

Tomasulo’s algorithmTomasulo’s algorithm Commit Commit – an extra stage in the instruction execution – an extra stage in the instruction execution

sequence, besides issue, execute and write resultsequence, besides issue, execute and write result Used to further improve the Tomasulo’s solutionUsed to further improve the Tomasulo’s solution In the In the Write resultWrite result stage the result is written in the stage the result is written in the re-order bufferre-order buffer

(ROB) and not directly in the destination register or memory; all (ROB) and not directly in the destination register or memory; all data in ROB may be used by other instructions; in this way some data in ROB may be used by other instructions; in this way some stall periods may be avoidedstall periods may be avoided

Re-order buffer (ROB)Re-order buffer (ROB) – it is used to commit instructions – it is used to commit instructions executed out-of-orderexecuted out-of-order

• Contains data regarding instructions in original orderContains data regarding instructions in original order; some entries ; some entries may be filled-in in advance as result of out-of-order executionmay be filled-in in advance as result of out-of-order execution

• The instructions are The instructions are committed in their original ordercommitted in their original order• ROB is useful for role-back procedures in case of branch prediction ROB is useful for role-back procedures in case of branch prediction

mismatch or exceptionsmismatch or exceptions In the commit stage data from the re-order buffer is copied into In the commit stage data from the re-order buffer is copied into

the real registers or into the memory in the order specified the real registers or into the memory in the order specified through the program and not in the order of executionthrough the program and not in the order of execution

Page 12: Structure of Computer Systems

Branch predictionBranch prediction A method for solving control hazardA method for solving control hazard Problem: a brunch in the program disturbs pipeline Problem: a brunch in the program disturbs pipeline

execution; if the branch “is taken” the pipeline must be execution; if the branch “is taken” the pipeline must be flushed and reinitialized with instructions from the target flushed and reinitialized with instructions from the target addressaddress

Principle: try to guess the direction of a branch Principle: try to guess the direction of a branch instruction (mainly conditional branch) and load the instruction (mainly conditional branch) and load the pipeline with instructions from the correct branchpipeline with instructions from the correct branch

Methods:Methods: Static predictionStatic prediction – based on the nature of the branch – based on the nature of the branch

instructioninstruction Dynamic predictionDynamic prediction – take into consideration the – take into consideration the

history of the branch instructions (if there were taken history of the branch instructions (if there were taken or not in the past may predict their future behavior)or not in the past may predict their future behavior)

Page 13: Structure of Computer Systems

Branch predictionBranch prediction

Static predictionStatic prediction – based on the nature of the – based on the nature of the branch instruction branch instruction Cases:Cases:

• Procedure calls - are takenProcedure calls - are taken

• Unconditional jumps - are takenUnconditional jumps - are taken

• Backward branches - are taken (considered as loops in the program)Backward branches - are taken (considered as loops in the program)

• Forward branches - are not taken (considered exceptions from a normal Forward branches - are not taken (considered exceptions from a normal execution)execution)

Advantage: Advantage: • it is simple and fast it is simple and fast • works well for programs having many loopsworks well for programs having many loops

drawback: drawback: • does not work well if there are a lot of conditional jumpsdoes not work well if there are a lot of conditional jumps

Page 14: Structure of Computer Systems

Branch predictionBranch prediction

Dynamic predictionDynamic prediction - - take into consideration the history of the take into consideration the history of the branch instructions branch instructions

Principle: use previous executions of a conditional jump in order to Principle: use previous executions of a conditional jump in order to better predict the next executionsbetter predict the next executions

Methods:Methods:• Next line predictorNext line predictor – stores the pointer to the next instruction (or group of – stores the pointer to the next instruction (or group of

instructions if multiple instructions are fetched in the same time); the method instructions if multiple instructions are fetched in the same time); the method stores the decision as well as the targetstores the decision as well as the target (pointer) of the branch (pointer) of the branch

• Saturating countersSaturating counters – store in 1 or two bits (saturating counters) the – store in 1 or two bits (saturating counters) the decisions made before; in case of 2 bit counter – 4 states:decisions made before; in case of 2 bit counter – 4 states:

Strongly not taken (00) – “not taken” is predictedStrongly not taken (00) – “not taken” is predicted Weakly not taken (01) – “not taken” is predictedWeakly not taken (01) – “not taken” is predicted Weakly taken (10) – “taken” is predictedWeakly taken (10) – “taken” is predicted Strongly taken (11) - “taken” is predictedStrongly taken (11) - “taken” is predicted

every occurrence of the branch updatesevery occurrence of the branch updates the state of the counterthe state of the counter

00 01

11 10

Taken

Not taken

Page 15: Structure of Computer Systems

Branch predictionBranch prediction

Dynamic predictionDynamic prediction – methods (cont.) – methods (cont.) store the decision and the target address for every executed conditional store the decision and the target address for every executed conditional

jump in a BHT (Branch History Table) and BTB (Branch Target Buffer); jump in a BHT (Branch History Table) and BTB (Branch Target Buffer); this information will help predict next executions of the same instructions this information will help predict next executions of the same instructions with aprox. 90% probability.with aprox. 90% probability.

BHT and BTB are indexed with less significant bits of the addresses (of BHT and BTB are indexed with less significant bits of the addresses (of PC); the number of bits used determines the dimension of the tablesPC); the number of bits used determines the dimension of the tables

Two-level adaptive predictorTwo-level adaptive predictor• necessary for necessary for alternating and imbricated conditional jumpsalternating and imbricated conditional jumps• idea: to memorize idea: to memorize jump sequence patternsjump sequence patterns; prediction based on a pattern of ; prediction based on a pattern of

taken (1) and not taken (0) branchestaken (1) and not taken (0) branches

0 1 0 0n bits

2 bit counterPrediction

Pattern history table

....

• a two-level adaptive predictor a two-level adaptive predictor with an n-bit history can with an n-bit history can predict any repetitive predict any repetitive sequence with any period if sequence with any period if all n-bit sub-sequences are all n-bit sub-sequences are

differentdifferent

Page 16: Structure of Computer Systems

Branch predictionBranch prediction Dynamic prediction – methods (cont.)Dynamic prediction – methods (cont.)

Local branch predictionLocal branch prediction• a a separate history buffer for each conditional jump instructionseparate history buffer for each conditional jump instruction• it may use a 2 level branch predictor with common or individual it may use a 2 level branch predictor with common or individual

pattern history tablepattern history table• Pentium II and III have local branch predictors with a local 4-bit Pentium II and III have local branch predictors with a local 4-bit

history and a local pattern history table with 16 entries for each history and a local pattern history table with 16 entries for each conditional jump conditional jump

Global branch predictorGlobal branch predictor• keeps a keeps a shared (global) history of all conditional jumpsshared (global) history of all conditional jumps• any correlation between two branches is used for prediction; any correlation between two branches is used for prediction; • poor results if branches are not correlated; poor results if branches are not correlated; • usually usually not as good as local predictorsnot as good as local predictors• variants:variants:

““gshare" predictor gshare" predictor ““gselect” predictor gselect” predictor

Page 17: Structure of Computer Systems

Branch predictionBranch prediction Dynamic prediction – methods (cont.)Dynamic prediction – methods (cont.)

Global branch predictorGlobal branch predictor – possible implementation: two-level adaptive predictor – possible implementation: two-level adaptive predictor with globally shared history buffer and pattern history tablewith globally shared history buffer and pattern history table

• ““gshare" predictorgshare" predictor - index in the prediction history table is a XOR between - index in the prediction history table is a XOR between the global history buffer and the jump addressthe global history buffer and the jump address

• ““gselect” predictorgselect” predictor – index is obtain by concatenating the history buffer and – index is obtain by concatenating the history buffer and the jump’s address the jump’s address

• Pentium M, Core 2 and AMD processors use global branch prediction Pentium M, Core 2 and AMD processors use global branch prediction combinations of local and global predictors:combinations of local and global predictors:

• Alloyed branch predictionAlloyed branch prediction - concatenates local and global branch history - concatenates local and global branch history buffer, sometimes also with the address of the jumpbuffer, sometimes also with the address of the jump

• Agree predictorAgree predictor – makes a XOR between the local and global predictor – makes a XOR between the local and global predictor (used in Pentium 4)(used in Pentium 4)

• Hybrid predictorHybrid predictor – a combination of predictors; the result is selected through – a combination of predictors; the result is selected through voting or from the predictor with the best hit ratesvoting or from the predictor with the best hit rates

• Loop predictorLoop predictor – detects if a conditional jump is a loop; it is taken N-1 times – detects if a conditional jump is a loop; it is taken N-1 times and not taken 1 time; it may use a counter for the loop; it may be part of a and not taken 1 time; it may use a counter for the loop; it may be part of a hybrid predictorhybrid predictor

• Prediction of indirect jumps – Prediction of indirect jumps – when the jump target of a conditional branch when the jump target of a conditional branch has multiple choices – store the previous targets and more bits on the has multiple choices – store the previous targets and more bits on the prediction history buffer for such a jumpprediction history buffer for such a jump

• Prediction of function returnsPrediction of function returns – stores a copy of the stack that contains the – stores a copy of the stack that contains the return addresses of the executed functions return addresses of the executed functions

Page 18: Structure of Computer Systems

Branch predictionBranch prediction Correlated predictionCorrelated prediction

example of a combination example of a combination between local and global between local and global predictionprediction

how it works: how it works: • every entry in the history table has every entry in the history table has

4 predictors (e.g. 2 bit counters)4 predictors (e.g. 2 bit counters)• the 2 bit global history buffer the 2 bit global history buffer

select between the 4 predictorsselect between the 4 predictors• the state of the selected predictor the state of the selected predictor

is updated according with the is updated according with the decision madedecision made

• the global branch history gives the the global branch history gives the context and the local predictors context and the local predictors store behavior of different jump store behavior of different jump instructionsinstructions

• (2,2) predictor – 2 bit counters and (2,2) predictor – 2 bit counters and 2 bit history buffer2 bit history buffer

Branch address (4 bits)

2-bits per branch local predictors

PredictionPrediction

2-bit recent global branch history

(01 = not taken then taken)

Page 19: Structure of Computer Systems

Misprediction statistics Misprediction statistics for specs testsfor specs tests

1. 4096 Entries 2-bit BHT2. Unlimited Entries 2-bit BHT3. 1024 Entries - local and global prediction (2,2) BHT- 1 and 3 require the same amount of memory – 8kbits

0%1%

5%6% 6%

11%

4%

6%5%

1%

0%

2%

4%

6%

8%

10%

12%

14%

16%

18%

20%

4,096 entries: 2-bits per entryUnlimited entries: 2-bits/entry 1,024 entries (2,2)

0%

18%

Fre

qu

ency

of

Mis

pre

dic

tio

ns

eq

nt

ottgc c

12%

Page 20: Structure of Computer Systems

Branch predictionBranch prediction

Tournament predictorTournament predictor 2-bit local predictor fail on important branches; by adding global 2-bit local predictor fail on important branches; by adding global

information, performance may improvedinformation, performance may improved Tournament predictors: use two predictors, 1 based on global Tournament predictors: use two predictors, 1 based on global

information and 1 based on local information, and combine with information and 1 based on local information, and combine with a selectora selector

Hopes to select right predictor for right branch (or right context of Hopes to select right predictor for right branch (or right context of branch)branch)

Page 21: Structure of Computer Systems

Misprediction statisticsMisprediction statistics

0%

1%

2%

3%

4%

5%

6%

7%

8%

9%

10%

0 8 16 24 32 40 48 56 64 72 80 88 96 104 112 120 128

Total predictor size (Kbits)

Co

nd

itio

nal

bra

nc

h m

isp

red

icti

on

rat

e

Local - 2 bit counters

Correlating - (2,2) scheme

Tournament

Page 22: Structure of Computer Systems

Branch predictionBranch prediction Branch Target Buffer (BTB): contains target of taken branchesBranch Target Buffer (BTB): contains target of taken branches

an associative access memoryan associative access memory contains: contains:

• jump instr. addressjump instr. address• target addresstarget address• prediction stateprediction state

Taken

Branch?

Entry found in branch-

target buffer?

Send out predicted PC

Is instruction a taken branch?

Send PC to memory and branch-target

buffer

Enter branch instruction address

and next PC into branch-target buffer

Mispredicted branch, kill fetched

instruction; restart fetch at other target;

delete entry from target buffer

Normal instruction execution

Branch correctly predicted; continue execution with no

stalls

No

Yes

Yes

Yes

No

NoID

IF

EXPC

Jmp addr Target pred

New address