1 chapter 3: ilp and its dynamic exploitation review simple static pipeline dynamic scheduling,...

33
1 Chapter 3: ILP and Its Dynamic Exploitation Review simple static pipeline Dynamic scheduling, out-of-order execution Dynamic branch prediction, Instruction Issue unit Multiple issue (superscalar) Hardware-based speculation ILP limitation Intel Core i7 and ARM Cortex-A8

Upload: everett-leonard

Post on 13-Dec-2015

218 views

Category:

Documents


4 download

TRANSCRIPT

Page 1: 1 Chapter 3: ILP and Its Dynamic Exploitation Review simple static pipeline Dynamic scheduling, out-of-order execution Dynamic branch prediction, Instruction

1

Chapter 3: ILP and Its Dynamic

Exploitation• Review simple static pipeline• Dynamic scheduling, out-of-order execution• Dynamic branch prediction, Instruction Issue unit• Multiple issue (superscalar)• Hardware-based speculation• ILP limitation• Intel Core i7 and ARM Cortex-A8

Page 2: 1 Chapter 3: ILP and Its Dynamic Exploitation Review simple static pipeline Dynamic scheduling, out-of-order execution Dynamic branch prediction, Instruction

2

Multiple Issue

• Goal: Enable multiple instructions to be issued in a single clock cycle. (Can get CPI < 1!, or IPC > 1)

• Two basic “flavors” of multiple-issue:– Superscalar:

• Maintain ordinary serial instruction stream format.

• Instructions per clock (IPC) varies widely.

• Instruction Issue can be dynamic or static (in-order).

– VLIW (Very Long Instruction Word) a.k.a. EPIC (Explicitly Parallel Instruction Computing).

• New format: Parallel instructions grouped into blocks.

• Instructions per block are fixed (by block size).

• Mostly statically scheduled by compiler.

Page 3: 1 Chapter 3: ILP and Its Dynamic Exploitation Review simple static pipeline Dynamic scheduling, out-of-order execution Dynamic branch prediction, Instruction

3

Superscalar Pipeline

• Typical superscalar: 1-8 insts. issued per cycle– Actual IPC depends on dependences, hazards

• Simple example: 2 insts./cycle, static scheduling– Instructions statically pre-paired to ease decoding:

• 1st: One load/store/branch/integer-ALU op.• 2nd: One floating-point op.

Page 4: 1 Chapter 3: ILP and Its Dynamic Exploitation Review simple static pipeline Dynamic scheduling, out-of-order execution Dynamic branch prediction, Instruction

4

Code Example to be Used

• C code fragment:double *p;

do { *(p--) += c } while (p);

• MIPS code fragment:

Loop: LD F0,0(R1) ; F0 = *p

ADDD F4,F0,F2 ; F4 = F0 + c

SD 0(R1),F4 ; *p = F4

ADDI R1,R1,#-8 ; p--

BNEZ R1,Loop ; until p=0

Page 5: 1 Chapter 3: ILP and Its Dynamic Exploitation Review simple static pipeline Dynamic scheduling, out-of-order execution Dynamic branch prediction, Instruction

5

Multiple Issue + Dynamic Sched.

• Why? Usual advantages of dynamic scheduling…– Compiler independent, data-dependent scheduling

• Multiple-issue Tomasulo:– Issue 1 integer + 1 FP instruction to RS each cycle

– Problem (again) issuing multiple inst. simultaneously

• If instructions dependent, hazard detection is complex.

– Two solutions to this problem:

• Enter inst. into tables in only 1/2 a clock

• Build hardware to issue two instructions in parallel; must be careful to detect proper dependences

– Memory dependence: loads/stores dependences through load/store queue

Page 6: 1 Chapter 3: ILP and Its Dynamic Exploitation Review simple static pipeline Dynamic scheduling, out-of-order execution Dynamic branch prediction, Instruction

6

Example of Dual-Issue Tomasulo

• The clock cycle of Issue, Exec, and Writeback for a Duel-Issue Tomasulo pipeline (no speculation)

Page 7: 1 Chapter 3: ILP and Its Dynamic Exploitation Review simple static pipeline Dynamic scheduling, out-of-order execution Dynamic branch prediction, Instruction

7

Example of Dual-Issue Tomasulo

• Resource usage table for the last figure

Page 8: 1 Chapter 3: ILP and Its Dynamic Exploitation Review simple static pipeline Dynamic scheduling, out-of-order execution Dynamic branch prediction, Instruction

8

Example of Dual-Issue Tomasulo

• The clock cycle of Issue, Exec, and Writeback for a Duel-Issue Tomasulo pipeline with additional ALU and CDB

Page 9: 1 Chapter 3: ILP and Its Dynamic Exploitation Review simple static pipeline Dynamic scheduling, out-of-order execution Dynamic branch prediction, Instruction

9

Example of Dual-Issue Tomasulo

• Resource usage table for the last figure

Page 10: 1 Chapter 3: ILP and Its Dynamic Exploitation Review simple static pipeline Dynamic scheduling, out-of-order execution Dynamic branch prediction, Instruction

10

Hardware-Based Speculation

• Dynamic scheduling + Speculative execution :– Dynamic branch prediction chooses which instructions

will be pre-executed.

– Speculation executes instructions conditionally early

(before branch conditions are resolved).

– Dynamic scheduling handles scheduling of different

dynamic sequences of basic blocks encountered.

• Dataflow execution: Execute instructions as soon

as their operands are available. May be canceled

if the prediction is incorrect!

Page 11: 1 Chapter 3: ILP and Its Dynamic Exploitation Review simple static pipeline Dynamic scheduling, out-of-order execution Dynamic branch prediction, Instruction

11

Advantages of HW-based Spec.

• Allow more overlap of instruction executions

• Dynamic speculation can disambiguate memory references, so a load can be moved before a store (if the locations addressed are different).

• Speculation work better if more accurate dynamic branch predictions can be used.

• Precise exception handling needed for speculated instructions.

• No extra bookkeeping code (speculation bits, register renaming code) in the program.

• Program code independent of implementation

Page 12: 1 Chapter 3: ILP and Its Dynamic Exploitation Review simple static pipeline Dynamic scheduling, out-of-order execution Dynamic branch prediction, Instruction

12

Implementing HW-based Spec.

• Separate the execution of speculative instructions

(including dataflow between them) from the

committing of results permanently to

registers/memory (when speculations are correct).

• New structure called the reorder bufferreorder buffer holds

results of instructions that have executed

speculatively (or non-speculatively) but cannot yet

be committed (commit in order).– The reorder buffer represents non-programmer-visible

temporary storage, like the reservation stations in

Tomasulo’s algorithm.

Page 13: 1 Chapter 3: ILP and Its Dynamic Exploitation Review simple static pipeline Dynamic scheduling, out-of-order execution Dynamic branch prediction, Instruction

13

Steps of Execution in HWBS

• Issue (or dispatch):– Get next fetched instruction (in-order).

– Issue if reservation station & reorder buffer not full.

• Execute:– Monitor CDB for operands until ready, then execute

• Write result:– Write to CDB, reorder buffer, & reservation stations

• Commit:– When instruction is first in reorder buffer (& wasn’t

mispredicted), commit value to register/memory.

• Committing mispredicted branch flushes reorder buffer.

Page 14: 1 Chapter 3: ILP and Its Dynamic Exploitation Review simple static pipeline Dynamic scheduling, out-of-order execution Dynamic branch prediction, Instruction

14

HWBS Implementation Sketch

Page 15: 1 Chapter 3: ILP and Its Dynamic Exploitation Review simple static pipeline Dynamic scheduling, out-of-order execution Dynamic branch prediction, Instruction

15

A Simple Example (Fig 3.12)

Ready to commit

Not commit due to MUL

Page 16: 1 Chapter 3: ILP and Its Dynamic Exploitation Review simple static pipeline Dynamic scheduling, out-of-order execution Dynamic branch prediction, Instruction

16

Loop Example with Reorder Buffer

Completed but not able to commit

Page 17: 1 Chapter 3: ILP and Its Dynamic Exploitation Review simple static pipeline Dynamic scheduling, out-of-order execution Dynamic branch prediction, Instruction

17

Page 18: 1 Chapter 3: ILP and Its Dynamic Exploitation Review simple static pipeline Dynamic scheduling, out-of-order execution Dynamic branch prediction, Instruction

18

Comparison with/without Speculation

Page 19: 1 Chapter 3: ILP and Its Dynamic Exploitation Review simple static pipeline Dynamic scheduling, out-of-order execution Dynamic branch prediction, Instruction

19

Comparison with/without Speculation

Page 20: 1 Chapter 3: ILP and Its Dynamic Exploitation Review simple static pipeline Dynamic scheduling, out-of-order execution Dynamic branch prediction, Instruction

20

ILP Limitations

• An Ideal processor:Infinite registers for renaming; Perfect branch and jump predictions; and Perfect memory disambiguation

Page 21: 1 Chapter 3: ILP and Its Dynamic Exploitation Review simple static pipeline Dynamic scheduling, out-of-order execution Dynamic branch prediction, Instruction

21

Increasing the window size and Maximum Issue Count

How close a real dynamically scheduled, speculative processor come to the ideal one?– Look arbitrarily far ahead predicting all branches– Rename all register uses to avoid WAR/WAW– Determine data dependencies– Determine memory dependencies– Enough parallel units

Page 22: 1 Chapter 3: ILP and Its Dynamic Exploitation Review simple static pipeline Dynamic scheduling, out-of-order execution Dynamic branch prediction, Instruction

22

Limitation on Window Size

Page 23: 1 Chapter 3: ILP and Its Dynamic Exploitation Review simple static pipeline Dynamic scheduling, out-of-order execution Dynamic branch prediction, Instruction

23

Effect of Branch Prediction

Page 24: 1 Chapter 3: ILP and Its Dynamic Exploitation Review simple static pipeline Dynamic scheduling, out-of-order execution Dynamic branch prediction, Instruction

24

Effect on Finite Registers

Page 25: 1 Chapter 3: ILP and Its Dynamic Exploitation Review simple static pipeline Dynamic scheduling, out-of-order execution Dynamic branch prediction, Instruction

25

Effect on Memory Disambiguation

Page 26: 1 Chapter 3: ILP and Its Dynamic Exploitation Review simple static pipeline Dynamic scheduling, out-of-order execution Dynamic branch prediction, Instruction

ARM Cortex-A8 Pipeline

26

Figure 3.36 The basic structure of the A8 pipeline is 13 stages. Three cycles are used for instruction fetch and four for instruction decode, in addition to a five-cycle integer pipeline. This yields a 13-cycle branch misprediction penalty. The instruction fetch unit tries to keep the 12-entry instruction queue filled.

Page 27: 1 Chapter 3: ILP and Its Dynamic Exploitation Review simple static pipeline Dynamic scheduling, out-of-order execution Dynamic branch prediction, Instruction

Decode Stage

27

Figure 3.37 The five-stage instruction decode of the A8. In the first stage, a PC produced by the fetch unit (either from the branch target buffer or the PC incrementer) is used to retrieve an 8-byte block from the cache. Up to two instructions are decoded and placed into the decode queue; if neither instruction is a branch, the PC is incremented for the next fetch. Once in the decode queue, the scoreboard logic decides when the instructions can issue. In the issue, the register operands are read; recall that in a simple scoreboard, the operands always come from the registers. The register operands and opcode are sent to the instruction execution portion of the pipeline.

Page 28: 1 Chapter 3: ILP and Its Dynamic Exploitation Review simple static pipeline Dynamic scheduling, out-of-order execution Dynamic branch prediction, Instruction

Execution Stage

28

Page 29: 1 Chapter 3: ILP and Its Dynamic Exploitation Review simple static pipeline Dynamic scheduling, out-of-order execution Dynamic branch prediction, Instruction

CPI

29

Figure 3.39 The estimated composition of the CPI on the ARM A8 shows that pipeline stalls are the primary addition to the base CPI. eon deserves some special mention, as it does integer-based graphics calculations (ray tracing) and has very few cache misses. It is computationally intensive with heavy use of multiples, and the single multiply pipeline becomes a major bottleneck. This estimate is obtained by using the L1 and L2 miss rates and penalties to compute the L1 and L2 generated stalls per instruction. These are subtracted from the CPI measured by a detailed simulator to obtain the pipeline stalls. Pipeline stalls include all three hazards plus minor effects such as way misprediction.

Page 30: 1 Chapter 3: ILP and Its Dynamic Exploitation Review simple static pipeline Dynamic scheduling, out-of-order execution Dynamic branch prediction, Instruction

Intel Core i7

30

Figure 3.41 The Intel Core i7 pipeline structure shown with the memory system components. The total pipeline depth is 14 stages, with branch mispredictions costing 17 cycles. There are 48 load and 32 store buffers. The six independent functional units can each begin execution of a ready micro-op in the same cycle.

Page 31: 1 Chapter 3: ILP and Its Dynamic Exploitation Review simple static pipeline Dynamic scheduling, out-of-order execution Dynamic branch prediction, Instruction

Wasted Work in Core i7

31

Figure 3.42 The amount of “wasted work” is plotted by taking the ratio of dispatched micro-ops that do not graduate to all dispatched micro-ops. For example, the ratio is 25% for sjeng, meaning that 25% of the dispatched and executed micro-ops are thrown away. The data in this section were collected by Professor Lu Peng and Ph.D. student Ying Zhang, both of Louisiana State University.

Page 32: 1 Chapter 3: ILP and Its Dynamic Exploitation Review simple static pipeline Dynamic scheduling, out-of-order execution Dynamic branch prediction, Instruction

CPI of Intel Core i7

32

Figure 3.43 The CPI for the 19 SPECCPU2006 benchmarks shows an average CPI for 0.83 for both the FP and integer benchmarks, although the behavior is quite different. In the integer case, the CPI values range from 0.44 to 2.66 with a standard deviation of 0.77, while the variation in the FP case is from 0.62 to 1.38 with a standard deviation of 0.25. The data in this section were collected by Professor Lu Peng and Ph.D. student Ying Zhang, both of Louisiana State University.

Page 33: 1 Chapter 3: ILP and Its Dynamic Exploitation Review simple static pipeline Dynamic scheduling, out-of-order execution Dynamic branch prediction, Instruction

Relative Performance and Energy Efficiency

33

Figure 3.45 The relative performance and energy efficiency for a set of single-threaded benchmarks shows the i7 920 is 4 to over 10 times faster than the Atom 230 but that it is about 2 times less power efficient on average! Performance is shown in the columns as i7 relative to Atom, which is execution time (i7)/execution time (Atom). Energy is shown with the line as Energy (Atom)/Energy (i7). The i7 never beats the Atom in energy efficiency, although it is essentially as good on four benchmarks, three of which are floating point. The data shown here were collected by Esmaeilzadeh et al. [2011]. The SPEC benchmarks were compiled with optimization on using the standard Intel compiler, while the Java benchmarks use the Sun (Oracle) Hotspot Java VM. Only one core is active on the i7, and the rest are in deep power saving mode. Turbo Boost is used on the i7, which increases its performance advantage but slightly decreases its relative energy efficiency.