1 chapter 3: ilp and its dynamic exploitation review simple static pipeline dynamic scheduling,...

1

Chapter 3: ILP and Its Dynamic

Exploitation• Review simple static pipeline• Dynamic scheduling, out-of-order execution• Dynamic branch prediction, Instruction Issue unit• Multiple issue (superscalar)• Hardware-based speculation• ILP limitation• Intel Core i7 and ARM Cortex-A8

2

Multiple Issue

• Goal: Enable multiple instructions to be issued in a single clock cycle. (Can get CPI < 1!, or IPC > 1)

• Two basic “flavors” of multiple-issue:– Superscalar:

• Maintain ordinary serial instruction stream format.

• Instructions per clock (IPC) varies widely.

• Instruction Issue can be dynamic or static (in-order).

– VLIW (Very Long Instruction Word) a.k.a. EPIC (Explicitly Parallel Instruction Computing).

• New format: Parallel instructions grouped into blocks.

• Instructions per block are fixed (by block size).

• Mostly statically scheduled by compiler.

3

Superscalar Pipeline

• Typical superscalar: 1-8 insts. issued per cycle– Actual IPC depends on dependences, hazards

• Simple example: 2 insts./cycle, static scheduling– Instructions statically pre-paired to ease decoding:

• 1st: One load/store/branch/integer-ALU op.• 2nd: One floating-point op.

4

Code Example to be Used

• C code fragment:double *p;

do { *(p--) += c } while (p);

• MIPS code fragment:

Loop: LD F0,0(R1) ; F0 = *p

ADDD F4,F0,F2 ; F4 = F0 + c

SD 0(R1),F4 ; *p = F4

ADDI R1,R1,#-8 ; p--

BNEZ R1,Loop ; until p=0

5

Multiple Issue + Dynamic Sched.

• Why? Usual advantages of dynamic scheduling…– Compiler independent, data-dependent scheduling

• Multiple-issue Tomasulo:– Issue 1 integer + 1 FP instruction to RS each cycle

– Problem (again) issuing multiple inst. simultaneously

• If instructions dependent, hazard detection is complex.

– Two solutions to this problem:

• Enter inst. into tables in only 1/2 a clock

• Build hardware to issue two instructions in parallel; must be careful to detect proper dependences

– Memory dependence: loads/stores dependences through load/store queue

6

Example of Dual-Issue Tomasulo

• The clock cycle of Issue, Exec, and Writeback for a Duel-Issue Tomasulo pipeline (no speculation)

7


• Resource usage table for the last figure

8


• The clock cycle of Issue, Exec, and Writeback for a Duel-Issue Tomasulo pipeline with additional ALU and CDB

9


• Resource usage table for the last figure

10

Hardware-Based Speculation

• Dynamic scheduling + Speculative execution :– Dynamic branch prediction chooses which instructions

will be pre-executed.

– Speculation executes instructions conditionally early

(before branch conditions are resolved).

– Dynamic scheduling handles scheduling of different

dynamic sequences of basic blocks encountered.

• Dataflow execution: Execute instructions as soon

as their operands are available. May be canceled

if the prediction is incorrect!

11

Advantages of HW-based Spec.

• Allow more overlap of instruction executions

• Dynamic speculation can disambiguate memory references, so a load can be moved before a store (if the locations addressed are different).

• Speculation work better if more accurate dynamic branch predictions can be used.

• Precise exception handling needed for speculated instructions.

• No extra bookkeeping code (speculation bits, register renaming code) in the program.

• Program code independent of implementation

12

Implementing HW-based Spec.

• Separate the execution of speculative instructions

(including dataflow between them) from the

committing of results permanently to

registers/memory (when speculations are correct).

• New structure called the reorder bufferreorder buffer holds

results of instructions that have executed

speculatively (or non-speculatively) but cannot yet

be committed (commit in order).– The reorder buffer represents non-programmer-visible

temporary storage, like the reservation stations in

Tomasulo’s algorithm.

13

Steps of Execution in HWBS

• Issue (or dispatch):– Get next fetched instruction (in-order).

– Issue if reservation station & reorder buffer not full.

• Execute:– Monitor CDB for operands until ready, then execute

• Write result:– Write to CDB, reorder buffer, & reservation stations

• Commit:– When instruction is first in reorder buffer (& wasn’t

mispredicted), commit value to register/memory.

• Committing mispredicted branch flushes reorder buffer.

14

HWBS Implementation Sketch

15

A Simple Example (Fig 3.12)

Ready to commit

Not commit due to MUL

16

Loop Example with Reorder Buffer

Completed but not able to commit

18

Comparison with/without Speculation

19

Comparison with/without Speculation

20

ILP Limitations

• An Ideal processor:Infinite registers for renaming; Perfect branch and jump predictions; and Perfect memory disambiguation

21

Increasing the window size and Maximum Issue Count

How close a real dynamically scheduled, speculative processor come to the ideal one?– Look arbitrarily far ahead predicting all branches– Rename all register uses to avoid WAR/WAW– Determine data dependencies– Determine memory dependencies– Enough parallel units

22

Limitation on Window Size

23

Effect of Branch Prediction

24

Effect on Finite Registers

25

Effect on Memory Disambiguation

ARM Cortex-A8 Pipeline

26

Figure 3.36 The basic structure of the A8 pipeline is 13 stages. Three cycles are used for instruction fetch and four for instruction decode, in addition to a five-cycle integer pipeline. This yields a 13-cycle branch misprediction penalty. The instruction fetch unit tries to keep the 12-entry instruction queue filled.

Decode Stage

27

Figure 3.37 The five-stage instruction decode of the A8. In the first stage, a PC produced by the fetch unit (either from the branch target buffer or the PC incrementer) is used to retrieve an 8-byte block from the cache. Up to two instructions are decoded and placed into the decode queue; if neither instruction is a branch, the PC is incremented for the next fetch. Once in the decode queue, the scoreboard logic decides when the instructions can issue. In the issue, the register operands are read; recall that in a simple scoreboard, the operands always come from the registers. The register operands and opcode are sent to the instruction execution portion of the pipeline.

Execution Stage

28

CPI

29

Figure 3.39 The estimated composition of the CPI on the ARM A8 shows that pipeline stalls are the primary addition to the base CPI. eon deserves some special mention, as it does integer-based graphics calculations (ray tracing) and has very few cache misses. It is computationally intensive with heavy use of multiples, and the single multiply pipeline becomes a major bottleneck. This estimate is obtained by using the L1 and L2 miss rates and penalties to compute the L1 and L2 generated stalls per instruction. These are subtracted from the CPI measured by a detailed simulator to obtain the pipeline stalls. Pipeline stalls include all three hazards plus minor effects such as way misprediction.

Intel Core i7

30

Figure 3.41 The Intel Core i7 pipeline structure shown with the memory system components. The total pipeline depth is 14 stages, with branch mispredictions costing 17 cycles. There are 48 load and 32 store buffers. The six independent functional units can each begin execution of a ready micro-op in the same cycle.

Wasted Work in Core i7

31

Figure 3.42 The amount of “wasted work” is plotted by taking the ratio of dispatched micro-ops that do not graduate to all dispatched micro-ops. For example, the ratio is 25% for sjeng, meaning that 25% of the dispatched and executed micro-ops are thrown away. The data in this section were collected by Professor Lu Peng and Ph.D. student Ying Zhang, both of Louisiana State University.

CPI of Intel Core i7

32

Figure 3.43 The CPI for the 19 SPECCPU2006 benchmarks shows an average CPI for 0.83 for both the FP and integer benchmarks, although the behavior is quite different. In the integer case, the CPI values range from 0.44 to 2.66 with a standard deviation of 0.77, while the variation in the FP case is from 0.62 to 1.38 with a standard deviation of 0.25. The data in this section were collected by Professor Lu Peng and Ph.D. student Ying Zhang, both of Louisiana State University.

Relative Performance and Energy Efficiency

33

Figure 3.45 The relative performance and energy efficiency for a set of single-threaded benchmarks shows the i7 920 is 4 to over 10 times faster than the Atom 230 but that it is about 2 times less power efficient on average! Performance is shown in the columns as i7 relative to Atom, which is execution time (i7)/execution time (Atom). Energy is shown with the line as Energy (Atom)/Energy (i7). The i7 never beats the Atom in energy efficiency, although it is essentially as good on four benchmarks, three of which are floating point. The data shown here were collected by Esmaeilzadeh et al. [2011]. The SPEC benchmarks were compiled with optimization on using the standard Intel compiler, while the Java benchmarks use the Sun (Oracle) Hotspot Java VM. Only one core is active on the i7, and the rest are in deep power saving mode. Turbo Boost is used on the i7, which increases its performance advantage but slightly decreases its relative energy efficiency.

1 chapter 3: ilp and its dynamic exploitation review simple static pipeline dynamic scheduling,...

Documents

clock cycle of issue

multiple issue dynamic

multiple instructions

speculation slide

multiple issue goal

duel issue tomasulo

static scheduling instructions

parallel instructions