microarchitecture of superscalars (4) decoding dezső sima fall 2007 (ver. 2.0) dezső sima, 2007
TRANSCRIPT
![Page 1: Microarchitecture of Superscalars (4) Decoding Dezső Sima Fall 2007 (Ver. 2.0) Dezső Sima, 2007](https://reader035.vdocument.in/reader035/viewer/2022062407/56649dcf5503460f94ac4124/html5/thumbnails/1.jpg)
Microarchitecture of Superscalars (4)Decoding
Dezső Sima
Fall 2007
(Ver. 2.0) Dezső Sima, 2007
![Page 2: Microarchitecture of Superscalars (4) Decoding Dezső Sima Fall 2007 (Ver. 2.0) Dezső Sima, 2007](https://reader035.vdocument.in/reader035/viewer/2022062407/56649dcf5503460f94ac4124/html5/thumbnails/2.jpg)
Overview
1. Overview•
2. Straightforward parallel decoding•
3. Predecoding•
4. Decoding with CISC/RISC conversion•
4.1 Overview•
4.2 Decoding into µops•
4.3 Decoding into macroops•
5. Using a trace cache•
6. Decoding with instruction grouping•
6.1 Overview•
6.2 Grouping of RISC instructions•
6.3 Grouping of CISC instructions•
![Page 3: Microarchitecture of Superscalars (4) Decoding Dezső Sima Fall 2007 (Ver. 2.0) Dezső Sima, 2007](https://reader035.vdocument.in/reader035/viewer/2022062407/56649dcf5503460f94ac4124/html5/thumbnails/3.jpg)
1. Overview
1. gen. RISC superscalars
Intel
PredecodingStraightforwardparallel decoding
Using a tracecache
Decoding withinstruction grouping
Decoding techniques used in superscalars
Decoding withCISC/RISC conversion
Beginning with 2. gen. superscalars
Beginning with 2. gen.
superscalar CISCs
P4-family
Decoding into µops
Decoding intomacroops
AMD(up to two µops)
Grouping of RISC
instructions
POWER4
POWER5
Grouping of CISC
instructions
Pentium MCore
Beginning with the Pentium Pro
Beginning withthe K7
K7 (Athlon)K8 (Hammer)
![Page 4: Microarchitecture of Superscalars (4) Decoding Dezső Sima Fall 2007 (Ver. 2.0) Dezső Sima, 2007](https://reader035.vdocument.in/reader035/viewer/2022062407/56649dcf5503460f94ac4124/html5/thumbnails/4.jpg)
2 Straightforward parallel decoding
Figure 2.1: The PowerPC 601’s front end
Source: Stokes, J.H., „PowerPC on Apple: An architecture history”, Aug. 2004.http://arstechnica.com/articles
![Page 5: Microarchitecture of Superscalars (4) Decoding Dezső Sima Fall 2007 (Ver. 2.0) Dezső Sima, 2007](https://reader035.vdocument.in/reader035/viewer/2022062407/56649dcf5503460f94ac4124/html5/thumbnails/5.jpg)
3 Predecoding (1)
Figure 3.1: Contrasting the decoding and instruction issues in a scalar and a 4-way superscalar processor
Icache
Superscalar issue
DF . . .I
Decode / Issue / Check
Instructionbuffer
Decode / Issue / Check
Scalar issue
Typical FX-pipeline layout D/IF . . .
Icache
Instructionbuffer
![Page 6: Microarchitecture of Superscalars (4) Decoding Dezső Sima Fall 2007 (Ver. 2.0) Dezső Sima, 2007](https://reader035.vdocument.in/reader035/viewer/2022062407/56649dcf5503460f94ac4124/html5/thumbnails/6.jpg)
3 Predecoding (1)
Figure 3.2: The principle of predecoding
Second-level cache(or memory)
Predecodeunit
I-cache
Typically 128 bits/cycleWhen instructions are written into the I-cache, the predecode unit of a RISC processor appends 4-7 bits to each instruction.
AMD’s CISC processors append n-bits to each byte (K5, K6: 5 bits/byte ; K7, K8: 3 bits/byte).E.g. 148 bits/cycle
Source: Sima, D. et al., „ACA”, Addison-Wesley 1997
![Page 7: Microarchitecture of Superscalars (4) Decoding Dezső Sima Fall 2007 (Ver. 2.0) Dezső Sima, 2007](https://reader035.vdocument.in/reader035/viewer/2022062407/56649dcf5503460f94ac4124/html5/thumbnails/7.jpg)
3 Predecoding (2)
Figure 3.3: The introduction of predecoding
Source: Sima, D. et al., „ACA”, Addison-Wesley 1997
![Page 8: Microarchitecture of Superscalars (4) Decoding Dezső Sima Fall 2007 (Ver. 2.0) Dezső Sima, 2007](https://reader035.vdocument.in/reader035/viewer/2022062407/56649dcf5503460f94ac4124/html5/thumbnails/8.jpg)
3. Predecoding (3)
Figure 3.4: Variable length instruction decoding in the AthlonSource: de Vries, H., „Understanding the detailed Architecture of AMD’s 64 bit Core”, Sept.2003,
http://www.chip-architect.com
![Page 9: Microarchitecture of Superscalars (4) Decoding Dezső Sima Fall 2007 (Ver. 2.0) Dezső Sima, 2007](https://reader035.vdocument.in/reader035/viewer/2022062407/56649dcf5503460f94ac4124/html5/thumbnails/9.jpg)
3 Predecoding (4)
Figure 3.5: Opteron’s instruction cache and decoding
Source: de Vries, H., „Understanding the detailed Architecture of AMD’s 64 bit Core”, Sept.2003, http://www.chip-architect.com
![Page 10: Microarchitecture of Superscalars (4) Decoding Dezső Sima Fall 2007 (Ver. 2.0) Dezső Sima, 2007](https://reader035.vdocument.in/reader035/viewer/2022062407/56649dcf5503460f94ac4124/html5/thumbnails/10.jpg)
4 Decoding with CISC/RISC conversion
Decoding with CISC/RISCconversion
RISC core
Retiring with RISC/CISCconversion
CISC instructions
Decoding with CISC/RISC conversion
Examples:PPro K6
µops macroops
Modification of the program stateafter RISC/CISC re-conversion
Figure 4.1: Principle of decoding with CISC/RISC conversion
Source: Sima, D. et al., „ACA”, Addison-Wesley 1997
4.1 Overview
![Page 11: Microarchitecture of Superscalars (4) Decoding Dezső Sima Fall 2007 (Ver. 2.0) Dezső Sima, 2007](https://reader035.vdocument.in/reader035/viewer/2022062407/56649dcf5503460f94ac4124/html5/thumbnails/11.jpg)
4.2 Decoding into µops (1)
Figure 4.2: The Microarchitecture of the Pentium Pro
Source: Shanley, T. ,”Pentium Pro Processor System Architecture”, Addison-Wesley Press, 1997
![Page 12: Microarchitecture of Superscalars (4) Decoding Dezső Sima Fall 2007 (Ver. 2.0) Dezső Sima, 2007](https://reader035.vdocument.in/reader035/viewer/2022062407/56649dcf5503460f94ac4124/html5/thumbnails/12.jpg)
4.2 Decoding into µops (2)
Figure 4.3: Basic misprediction pipeline of the Pentium III
Source: Hinton, G. et al., „The Microarchitecture of the Pentium 4 Processor”, Intel Technology Journal Q1, 2001
![Page 13: Microarchitecture of Superscalars (4) Decoding Dezső Sima Fall 2007 (Ver. 2.0) Dezső Sima, 2007](https://reader035.vdocument.in/reader035/viewer/2022062407/56649dcf5503460f94ac4124/html5/thumbnails/13.jpg)
Figure 4.4: Decoding in AMD’s K6
Source: Shriver, B., Smith,.B.,”The Anatomyof a High-Performance Microprocessor”
IEEE Computer Society Press, 1998
4.2 Decoding into µops (3)
![Page 14: Microarchitecture of Superscalars (4) Decoding Dezső Sima Fall 2007 (Ver. 2.0) Dezső Sima, 2007](https://reader035.vdocument.in/reader035/viewer/2022062407/56649dcf5503460f94ac4124/html5/thumbnails/14.jpg)
Figure 4.5: The Microarchitecture of the Pentium M (Yonah)
4.2 Decoding into µops (4)
Source: Kanter, D., „Intel’s next Generation Microarchitecture Unveiled”, Real World Tech., 2006 March 9.
![Page 15: Microarchitecture of Superscalars (4) Decoding Dezső Sima Fall 2007 (Ver. 2.0) Dezső Sima, 2007](https://reader035.vdocument.in/reader035/viewer/2022062407/56649dcf5503460f94ac4124/html5/thumbnails/15.jpg)
4.2 Decoding into µops (5)
Figure 4.6: The Microarchitecture of the Core processor familySource: Kanter, D., „Intel’s next Generation Microarchitecture Unveiled”, Real World Tech., 2006 March 9.
![Page 16: Microarchitecture of Superscalars (4) Decoding Dezső Sima Fall 2007 (Ver. 2.0) Dezső Sima, 2007](https://reader035.vdocument.in/reader035/viewer/2022062407/56649dcf5503460f94ac4124/html5/thumbnails/16.jpg)
4.3 Decoding into macroops (1)
Figure 4.7: AMD AthlonTM the Microarchitecture of the Athlon
Source: Meyer, D., „The AMD-K7 Processor”, MPF. Oct. 1998
![Page 17: Microarchitecture of Superscalars (4) Decoding Dezső Sima Fall 2007 (Ver. 2.0) Dezső Sima, 2007](https://reader035.vdocument.in/reader035/viewer/2022062407/56649dcf5503460f94ac4124/html5/thumbnails/17.jpg)
4.3 Decoding into macroops (2)
Figure 4.8: Decoding in the Athlon (1)
Source: Meyer, D., „The AMD-K7 Processor”, MPF. Oct. 1998
![Page 18: Microarchitecture of Superscalars (4) Decoding Dezső Sima Fall 2007 (Ver. 2.0) Dezső Sima, 2007](https://reader035.vdocument.in/reader035/viewer/2022062407/56649dcf5503460f94ac4124/html5/thumbnails/18.jpg)
4.3 Decoding into macroops (3)
Figure 4.9: Decoding in the Athlon (2)
Source: Meyer, D., „The AMD-K7 Processor”, MPF. Oct. 1998
![Page 19: Microarchitecture of Superscalars (4) Decoding Dezső Sima Fall 2007 (Ver. 2.0) Dezső Sima, 2007](https://reader035.vdocument.in/reader035/viewer/2022062407/56649dcf5503460f94ac4124/html5/thumbnails/19.jpg)
Each MacroOp: 1 or 2 operations (OPs)
eg: ADD EAX, EBX 1 ADD OPAND EAX, [EBX+16] 1 LOAD OP
1 AND OP
Up to 3 MacroOps per cycle with up to 3 FX + 2 L/S OPs (dual ported D$!) per cycle
4.3 Decoding into macroops (4)
![Page 20: Microarchitecture of Superscalars (4) Decoding Dezső Sima Fall 2007 (Ver. 2.0) Dezső Sima, 2007](https://reader035.vdocument.in/reader035/viewer/2022062407/56649dcf5503460f94ac4124/html5/thumbnails/20.jpg)
4.3 Decoding into macroops (5)
Figure 4.10: The Microarchitecture of the Hammer
Source: Weber, F., „AMD’s Next Generation Microprocessor Architecture”, MPF. Oct. 2001
![Page 21: Microarchitecture of Superscalars (4) Decoding Dezső Sima Fall 2007 (Ver. 2.0) Dezső Sima, 2007](https://reader035.vdocument.in/reader035/viewer/2022062407/56649dcf5503460f94ac4124/html5/thumbnails/21.jpg)
5 Using a trace cache (1)
Figure 5.1: The Microarchitecture of the Pentium 4 (Willamette)
![Page 22: Microarchitecture of Superscalars (4) Decoding Dezső Sima Fall 2007 (Ver. 2.0) Dezső Sima, 2007](https://reader035.vdocument.in/reader035/viewer/2022062407/56649dcf5503460f94ac4124/html5/thumbnails/22.jpg)
5 Using a trace cache (2)
Figure 5.2: Basic misprediction pipeline of the Pentium 4 (Willamette)
Source: Hinton, G. et al., „The Microarchitecture of the Pentium 4 Processor”, Intel Technology Journal Q1, 2001
![Page 23: Microarchitecture of Superscalars (4) Decoding Dezső Sima Fall 2007 (Ver. 2.0) Dezső Sima, 2007](https://reader035.vdocument.in/reader035/viewer/2022062407/56649dcf5503460f94ac4124/html5/thumbnails/23.jpg)
5 Using a trace cache (3)
Figure 5.3: The Microarchitecture of the Pentium 4 (Prescott)
Source: Kanter, D., „Intel’s next Generation Microarchitecture Unveiled”, Real World Tech., 2006 March 9.
![Page 24: Microarchitecture of Superscalars (4) Decoding Dezső Sima Fall 2007 (Ver. 2.0) Dezső Sima, 2007](https://reader035.vdocument.in/reader035/viewer/2022062407/56649dcf5503460f94ac4124/html5/thumbnails/24.jpg)
Decoding withinstruction grouping
Grouping of RISC
instructions
POWER4POWER5
Grouping of CISC
instructions
Pentium MCore arch.
6. Decoding with instruction grouping
K7 (Athlon)K8 (Hammer)
6.1 Overview
![Page 25: Microarchitecture of Superscalars (4) Decoding Dezső Sima Fall 2007 (Ver. 2.0) Dezső Sima, 2007](https://reader035.vdocument.in/reader035/viewer/2022062407/56649dcf5503460f94ac4124/html5/thumbnails/25.jpg)
Operation of the Reorder Buffer (ROB)
index 1 2 3 4 5 6 7 8 9 10 11 12lane 0 lane 1 lane 2
= Out Of Order finished Instructions, results still speculative. = Instructions being retired now. = Retired Instructions, not speculative anymore.
Figure 5.3: Instruction grouping in the K7 and K8
Source: de Vries, H., „Understanding the detailed Architecture of AMD’s 64 bit Core”, Sept.2003, http://www.chip-architect.com
Up to 3 MacroOps are decoded per cycle, these MacroOps are allocated a line in the ROB
The ROB has 24 lines of 3 entries each. The ROB retires a line if it is the oldest one and all MacroOps in that line are completed.
6.2 Grouping of RISC instructions (1)
![Page 26: Microarchitecture of Superscalars (4) Decoding Dezső Sima Fall 2007 (Ver. 2.0) Dezső Sima, 2007](https://reader035.vdocument.in/reader035/viewer/2022062407/56649dcf5503460f94ac4124/html5/thumbnails/26.jpg)
Figure 6.1: Out of order execution of MacroOps from the FX schedulers in the K8L (to be introduced in Q2 2007)
(The K8L scheduler has 8*3 entires vs 6*3 in the K8)
Source: Malich, Y.„AMD's Next Generation Microarchitecture Preview: from K8 to K8L”, Aug. 2006.
6.2 Grouping of RISC instructions (2)
SchedulersDecoders EUs
![Page 27: Microarchitecture of Superscalars (4) Decoding Dezső Sima Fall 2007 (Ver. 2.0) Dezső Sima, 2007](https://reader035.vdocument.in/reader035/viewer/2022062407/56649dcf5503460f94ac4124/html5/thumbnails/27.jpg)
Figure 6.1: The principle of instruction grouping in IBM’s POWER4 and POWER5 processors
6.2 Grouping of RISC instructions (3)
Instructiongroups
EU EU
Issuequeues
Executionunits
ROB
Dispatch instruction groups in-order, forward individual
instructions to the issue queues
Execute individual instructions ooo
Retire isntruction groups in-order, modify program state
Retire
![Page 28: Microarchitecture of Superscalars (4) Decoding Dezső Sima Fall 2007 (Ver. 2.0) Dezső Sima, 2007](https://reader035.vdocument.in/reader035/viewer/2022062407/56649dcf5503460f94ac4124/html5/thumbnails/28.jpg)
6.2 Grouping of RISC instructions (4)
Figure 6.2: Implementation of instruction grouping in IBM’s POWER 5 processor
Source: Sinharoy, B. et al. „POWER5 system microarchitecture”, IBM J.,Res.& Dev., July/Sept. 2005.
![Page 29: Microarchitecture of Superscalars (4) Decoding Dezső Sima Fall 2007 (Ver. 2.0) Dezső Sima, 2007](https://reader035.vdocument.in/reader035/viewer/2022062407/56649dcf5503460f94ac4124/html5/thumbnails/29.jpg)
6.3 Grouping of CISC instructions (1)(Intel: macro-op fusion)
x86 instructions: macro-opsinternal instructions: μops
Macro-op fusion:combines two macro ops into a single μop.
Specifically:x86 compare or test instructions are fused with x86 jumps to produce a single μop.
Any decoder can perform macro-op fusion but only one macro-op fusion can be performed in each cycle.
In the Core architecture the max. decode bandwidth is 4+1 x86 instructions/cycle
Macro-op fusion can reduce the number of μops by about 10%.
Introduced in the Core architecture
![Page 30: Microarchitecture of Superscalars (4) Decoding Dezső Sima Fall 2007 (Ver. 2.0) Dezső Sima, 2007](https://reader035.vdocument.in/reader035/viewer/2022062407/56649dcf5503460f94ac4124/html5/thumbnails/30.jpg)
6.3 Grouping of CISC instructions (2)
Benefits:
• Fewer μopsIncreased performance
• ooo execution becomes more effective as the instruction window includes now more (~10%) x86 instructions