1 9th lecture q branch prediction (rest) q predication q intel pentium ii/iii q intel pentium 4

1 9th Lecture q Branch prediction (rest) q Predication q Intel Pentium II/III q Intel Pentium 4 2 Hybrid Predictors q The second strategy of McFarling is to combine multiple separate branch predictors, each tuned to a different class of branches. q Two or more predictors and a predictor selection mechanism are necessary in a combining or hybrid predictor. m McFarling: combination of two-bit predictor and gshare two-level adaptive, m Young and Smith: a compiler-based static branch prediction with a two-level adaptive type, m and many more combinations! q Hybrid predictors often better than single-type predictors. 3 Simulations of Grunwald 1998 Table 1.1. SAg, gshare and MCFarlings combining predictor 4 Results q Simulation of Keeton et al using an OLTP (online transaction workload) on a PentiumPro multiprocessor reported a misprediction rate of 14% with an branch instruction frequency of about 21%. q The speculative execution factor, given by the number of instructions decoded divided by the number of instructions committed, is 1.4 for the database programs. q Two different conclusions may be drawn from these simulation results: m Branch predictors should be further improved m and/or branch prediction is only effective if the branch is predictable. q If a branch outcome is dependent on irregular data inputs, the branch often shows an irregular behavior. Question: Confidence of a branch prediction? Predicated Instructions and Multipath Execution - Confidence Estimation q Confidence estimation is a technique for assessing the quality of a particular prediction. q Applied to branch prediction, a confidence estimator attempts to assess the prediction made by a branch predictor. q A low confidence branch is a branch which frequently changes its branch direction in an irregular way making its outcome hard to predict or even unpredictable. q Four classes possible: m correctly predicted with high confidence C(HC), m correctly predicted with low confidence C(LC), m incorrectly predicted with high confidence I(HC), and m incorrectly predicted with low confidence I(LC). 6 Implementation of a confidence estimator q Information from the branch prediction tables is used: m Use of saturation counter information to construct a confidence estimator speculate more aggressively when the confidence level is higher m Used of a miss distance counter table (MDC): Each time a branch is predicted, the value in the MDC is compared to a threshold. If the value is above the threshold, then the branch is considered to have high confidence, and low confidence otherwise. m A small number of branch history patterns typically leads to correct predictions in a PAs predictor scheme. The confidence estimator assigned high confidence to a fixed set of patterns and low confidence to all others. q Confidence estimation can be used for speculation control, thread switching in multithreaded processors or multipath execution 7 Predicated Instructions q Provide predicated or conditional instructions and one or more predicate registers. q Predicated instructions use a predicate register as additional input operand. q The Boolean result of a condition testing is recorded in a (one-bit) predicate register. q Predicated instructions are fetched, decoded and placed in the instruction window like non predicated instructions. q It is dependent on the processor architecture, how far a predicated instruction proceeds speculatively in the pipeline before its predication is resolved: m A predicated instruction executes only if its predicate is true, otherwise the instruction is discarded. In this case predicated instructions are not executed before the predicate is resolved. m Alternatively, as reported for Intel's IA64 ISA, the predicated instruction may be executed, but commits only if the predicate is true, otherwise the result is discarded. 8 Predication Example if (x = = 0) { /*branch b1 */ a = b + c; d = e - f; } g = h * i;/* instruction independent of branch b1 */ (Pred = (x = = 0) )/* branch b1: Pred is set to true in x equals 0 */ if Pred then a = b + c;/* The operations are only performed */ if Pred then e = e - f;/* if Pred is set to true */ g = h * i; 9 Predication 4Able to eliminate a branch and therefore the associated branch prediction increasing the distance between mispredictions. 4The the run length of a code block is increased better compiler scheduling. 6Predication affects the instruction set, adds a port to the register file, and complicates instruction execution. 6Predicated instructions that are discarded still consume processor resources; especially the fetch bandwidth. q Predication is most effective when control dependences can be completely eliminated, such as in an if-then with a small then body. q The use of predicated instructions is limited when the control flow involves more than a simple alternative sequence. 10 Eager (Multipath) Execution q Execution proceeds down both paths of a branch, and no prediction is made. q When a branch resolves, all operations on the non-taken path are discarded. q Oracle execution: eager execution with unlimited resources m gives the same theoretical maximum performance as a perfect branch prediction q With limited resources, the eager execution strategy must be employed carefully. q Mechanism is required that decides when to employ prediction and when eager execution: e.g. a confidence estimator q Rarely implemented (IBM mainframes) but some research projects: m Dansoft processor, Polypath architecture, selective dual path execution, simultaneous speculation scheduling, disjoint eager execution (a) Single path speculative execution (b) full eager execution (c) disjoint eager execution Prediction of Indirect Branches q Indirect branches, which transfer control to an address stored in register, are harder to predict accurately. q Indirect branches occur frequently in machine code compiled from object- oriented programs like C++ and Java programs. q One simple solution is to update the PHT to include the branch target addresses. 13 Branch handling techniques and implementations TechniqueImplementation examples No branch predictionIntel 8086 Static prediction always not takenIntel i486 always takenSun SuperSPARC backward taken, forward not takenHP PA-7x00 semistatic with profilingearly PowerPCs Dynamic prediction: 1-bitDEC Alpha 21064, AMD K5 2-bitPowerPC 604, MIPS R10000, Cyrix 6x86 and M2, NexGen 586 two-level adaptiveIntel PentiumPro, Pentium II, AMD K6, Athlon Hybrid predictionDEC Alpha PredicationIntel/HP Merced and most signal processors as e.g. ARM processors, TI TMS320C6201 and many other Eager execution (limited)IBM mainframes: IBM 360/91, IBM 3090 Disjoint eager executionnone yet 14 High-Bandwidth Branch Prediction q Future microprocessor will require more than one prediction per cycle starting speculation over multiple branches in a single cycle, m e.g. Gag predictor is independent of branch address. q When multiple branches are predicted per cycle, then instructions must be fetched from multiple target addresses per cycle, complicating I-cache access. m Possible solution: Trace cache in combination with next trace prediction. q Most likely a combination of branch handling techniques will be applied, m e.g. a multi-hybrid branch predictor combined with support for context switching, indirect jumps, and interference handling. 15 The Intel P5 and P6 family P5 P6 including L2 cache NetBurst 16 Micro-Dataflow in PentiumPro 1995 q... The flow of the Intel Architecture instructions is predicted and these instructions are decoded into micro-operations ( ops), or series of ops, and these ops are register-renamed, placed into an out-of-order speculative pool of pending operations, executed in dataflow order (when operands are ready), and retired to permanent machine state in source program order.... q R.P. Colwell, R. L. Steck: A 0.6 m BiCMOS Processor with Dynamic Execution, International Solid State Circuits Conference, Feb 17 PentiumPro and Pentium II/III q The Pentium II/III processors use the same dynamic execution microarchitecture as the other members of P6 family. q This three-way superscalar, pipelined micro-architecture features a decoupled, multi-stage superpipeline, which trades less work per pipestage for more stages. q The Pentium II/III processor has twelve stages with a pipestage time 33 percent less than the Pentium processor, which helps achieve a higher clock rate on any given manufacturing process. q A wide instruction window using an instruction pool. q Optimized scheduling requires the fundamental execute phase to be replaced by decoupled issue/execute and retire phases. This allows instructions to be started in any order but always be retired in the original program order. q Processors in the P6 family may be thought of as three independent engines coupled with an instruction pool. 18 Pentium Pro Processor and Pentium II/III Microarchitecture Pentium II/III 20 Pentium II/III: The In-Order Section q The instruction fetch unit (IFU) accesses a non-blocking I-cache, it contains the Next IP unit. q The Next IP unit provides the I-cache index (based on inputs from the BTB), trap/interrupt status, and branch-misprediction indications from the integer FUs. q Branch prediction: m two-level adaptive scheme of Yeh and Patt, m BTB contains 512 entries, maintains branch history information and the predicted branch target address. m Branch misprediction penalty: at least 11 cycles, on average 15 cycles q The instruction decoder unit (IDU) is composed of three separate decoders 21 Pentium II/III: The In-Order Section (Continued) q A decoder breaks the IA-32 instruction down to ops, each comprised of an opcode, two source and one destination operand. These ops are of fixed length. m Most IA-32 instructions are converted directly into single micro ops (by any of the three decoders), m some instructions are decoded into one-to-four ops (by the general decoder), m more complex instructions are used as indices into the microcode instruction sequencer (MIS) which will generate the appropriate stream of ops. q The ops are send to the register alias table (RAT) where register renaming is performed, i.e., the logical IA-32 based register references are converted into references to physical registers. q Then, with added status information, ops continue to the reorder buffer (ROB, 40 entries) and to the reservation station unit (RSU, 20 entries). 22 The Fetch/Decode Unit I-cache Instruction Fetch Unit Next_IP Branch Target Buffer Microcode Instruction Sequencer Register Alias Table Instruction Decode Unit Simple Decoder IA-32 instructions Alignment Simple Decoder General Decoder op1 op2op3 (a) in-order section (b) instruction decoder unit (IDU) 23 The Out-of-Order Execute Section q When the ops flow into the ROB, they effectively take a place in program order. q ops also go to the RSU which forms a central instruction window with 20 reservation stations (RS), each capable of hosting one op. q ops are issued to the FUs according to dataflow constraints and resource availability, without regard to the original ordering of the program. q After completion the result goes to two different places, RSU and ROB. q The RSU has five ports and can issue at a peak rate of 5 ops each cycle. 24 Latencies and throughtput for Pentium II/III FUs Issue/Execute Unit to/from Reorder Buffer Port 0 Port 1 Port 2 Port 3 Port 4 Reservation Station Unit MMX Functional Unit Floating-point Functional Unit Integer Functional Unit MMX Functional Unit Jump Functional Unit Integer Functional Unit Load Functional Unit Store Functional Unit Store Functional Unit 26 The In-Order Retire Section. q A op can be retired m if its execution is completed, m if it is its turn in program order, m and if no interrupt, trap, or misprediction occurred. q Retirement means taking data that was speculatively created and writing it into the retirement register file (RRF). q Three ops per clock cycle can be retired. 27 Retire Unit to/from D-cache to/from Reorder Buffer Reservation Station Unit Memory Interface Unit Retirement Register File 28 The Pentium II/III Pipeline BTB access I-cache access Fetch and predecode Decode BTB0 BTB1 IFU0 IFU1 Register renaming Reorder buffer read IFU2 IDU0 IDU1 RAT ROB read Retirement (a)(c) Reorder buffer write-back RRF ROB write Port 0 Port 1 Port 2 Port 3 Port 4 Execution and completion Issue Reservation station Reorder buffer read RSU ROB read (b) 29 Pentium Pro Processor Basic Execution Environment Eight 32-bit Registers Six 16-bit Registers 32 bits General Purpose Registers Segment Registers EFLAGS Register EIP (Instruction Pointer Register) * The address space can be flat or segmented Address Space* 30 Application Programming Registers 31 Pentium III 32 Pentium II/III summary and offsprings q Pentium III in 1999, initially at 450 MHz (0.25 micron technology), former name Katmai q two 32 kB caches, faster floating-point performance q Coppermine is a shrink of Pentium III down to 0.18 micron. 33 Pentium 4 q Was announced for mid-2000 under the code name Willamette q native IA-32 processor with Pentium III processor core q running at 1.5 GHz q 42 million transistors q 0.18 m q 20 pipeline stages (integer pipeline), IF and ID not included q trace execution cache (TEC) for the decoded Ops q NetBurst micro-architecture 34 Pentium 4 Features Rapid Execution Engine: Intel: Arithmetic Logic Units (ALUs) run at twice the processor frequency Fact: Two ALUs, running at processor frequency connected with a multiplexer running at twice the processor frequency Hyper Pipelined Technology: Twenty-stage pipeline to enable high clock rates Frequency headroom and performance scalability 35 Advanced Dynamic Execution Very deep, out-of-order, speculative execution engine Up to 126 instructions in flight (3 times larger than the Pentium III processor) Up to 48 loads and 24 stores in pipeline (2 times larger than the Pentium III processor) branch prediction based on OPs 4K entry branch target array (8 times larger than the Pentium III processor) new algorithm (not specified), reduces mispredictions compared to G-Share of the P6 generation about one third 36 First level caches 12k OP Execution Trace Cache (~100 k) Execution Trace Cache that removes decoder latency from main execution loops Execution Trace Cache integrates path of program execution flow into a single line Low latency 8 kByte data cache with 2 cycle latency 37 Second level caches Included on the die size: 256 kB Full-speed, unified 8-way 2nd-level on-die Advance Transfer Cache 256-bit data bus to the level 2 cache Delivers ~45 GB/s data throughput (at 1.4 GHz processor frequency) Bandwidth and performance increases with processor frequency 38 NetBurst Micro-Architecture 39 Streaming SIMD Extensions 2 (SSE2) Technology SSE2 Extends MMX and SSE technology with the addition of 144 new instructions, which include support for: 128-bit SIMD integer arithmetic operations. 128-bit SIMD double precision floating point operations. Cache and memory management operations. Further enhances and accelerates video, speech, encryption, image and photo processing. MHz Intel NetBurst micro-architecture system bus Provides 3.2 GB/s throughput (3 times faster than the Pentium III processor). Quad-pumped 100MHz scalable bus clock to achieve 400 MHz effective speed. Split-transaction, deeply pipelined. 128-byte lines with 64-byte accesses. 41 Pentium 4 data types 42 Pentium 4 43 Pentium 4 offsprings Foster q Pentium 4 with external L3 cache and DDR-SDRAM support q provided for server q clock rate GHz q to be launched in Q2/2001 Northwood q 0.13 m technique q new 478 pin socket

1 9th lecture q branch prediction (rest) q predication q intel pentium ii/iii q intel pentium 4

Documents