microarchitecture of superscalars (4) decoding dezső sima fall 2007 (ver. 2.0) dezső sima, 2007

Microarchitecture of Superscalars (4)Decoding

Dezső Sima

Fall 2007

(Ver. 2.0) Dezső Sima, 2007

Overview

1. Overview•

2. Straightforward parallel decoding•

3. Predecoding•

4. Decoding with CISC/RISC conversion•

4.1 Overview•

4.2 Decoding into µops•

4.3 Decoding into macroops•

5. Using a trace cache•

6. Decoding with instruction grouping•

6.1 Overview•

6.2 Grouping of RISC instructions•

6.3 Grouping of CISC instructions•

1. Overview

1. gen. RISC superscalars

Intel

PredecodingStraightforwardparallel decoding

Using a tracecache

Decoding withinstruction grouping

Decoding techniques used in superscalars

Decoding withCISC/RISC conversion

Beginning with 2. gen. superscalars

Beginning with 2. gen.

superscalar CISCs

P4-family

Decoding into µops

Decoding intomacroops

AMD(up to two µops)

Grouping of RISC

instructions

POWER4

POWER5

Grouping of CISC

instructions

Pentium MCore

Beginning with the Pentium Pro

Beginning withthe K7

K7 (Athlon)K8 (Hammer)

2 Straightforward parallel decoding

Figure 2.1: The PowerPC 601’s front end

Source: Stokes, J.H., „PowerPC on Apple: An architecture history”, Aug. 2004.http://arstechnica.com/articles

3 Predecoding (1)

Figure 3.1: Contrasting the decoding and instruction issues in a scalar and a 4-way superscalar processor

Icache

Superscalar issue

DF . . .I

Decode / Issue / Check

Instructionbuffer

Decode / Issue / Check

Scalar issue

Typical FX-pipeline layout D/IF . . .

Icache

Instructionbuffer

3 Predecoding (1)

Figure 3.2: The principle of predecoding

Second-level cache(or memory)

Predecodeunit

I-cache

Typically 128 bits/cycleWhen instructions are written into the I-cache, the predecode unit of a RISC processor appends 4-7 bits to each instruction.

AMD’s CISC processors append n-bits to each byte (K5, K6: 5 bits/byte ; K7, K8: 3 bits/byte).E.g. 148 bits/cycle

Source: Sima, D. et al., „ACA”, Addison-Wesley 1997

3 Predecoding (2)

Figure 3.3: The introduction of predecoding


3. Predecoding (3)

Figure 3.4: Variable length instruction decoding in the AthlonSource: de Vries, H., „Understanding the detailed Architecture of AMD’s 64 bit Core”, Sept.2003,

http://www.chip-architect.com

3 Predecoding (4)

Figure 3.5: Opteron’s instruction cache and decoding

Source: de Vries, H., „Understanding the detailed Architecture of AMD’s 64 bit Core”, Sept.2003, http://www.chip-architect.com

4 Decoding with CISC/RISC conversion

Decoding with CISC/RISCconversion

RISC core

Retiring with RISC/CISCconversion

CISC instructions

Decoding with CISC/RISC conversion

Examples:PPro K6

µops macroops

Modification of the program stateafter RISC/CISC re-conversion

Figure 4.1: Principle of decoding with CISC/RISC conversion


4.1 Overview

4.2 Decoding into µops (1)

Figure 4.2: The Microarchitecture of the Pentium Pro

Source: Shanley, T. ,”Pentium Pro Processor System Architecture”, Addison-Wesley Press, 1997


Figure 4.3: Basic misprediction pipeline of the Pentium III

Source: Hinton, G. et al., „The Microarchitecture of the Pentium 4 Processor”, Intel Technology Journal Q1, 2001

Figure 4.4: Decoding in AMD’s K6

Source: Shriver, B., Smith,.B.,”The Anatomyof a High-Performance Microprocessor”

IEEE Computer Society Press, 1998


Figure 4.5: The Microarchitecture of the Pentium M (Yonah)


Source: Kanter, D., „Intel’s next Generation Microarchitecture Unveiled”, Real World Tech., 2006 March 9.


Figure 4.6: The Microarchitecture of the Core processor familySource: Kanter, D., „Intel’s next Generation Microarchitecture Unveiled”, Real World Tech., 2006 March 9.

4.3 Decoding into macroops (1)

Figure 4.7: AMD AthlonTM the Microarchitecture of the Athlon

Source: Meyer, D., „The AMD-K7 Processor”, MPF. Oct. 1998


Figure 4.8: Decoding in the Athlon (1)



Figure 4.9: Decoding in the Athlon (2)


Each MacroOp: 1 or 2 operations (OPs)

eg: ADD EAX, EBX 1 ADD OPAND EAX, [EBX+16] 1 LOAD OP

1 AND OP

Up to 3 MacroOps per cycle with up to 3 FX + 2 L/S OPs (dual ported D$!) per cycle



Figure 4.10: The Microarchitecture of the Hammer

Source: Weber, F., „AMD’s Next Generation Microprocessor Architecture”, MPF. Oct. 2001

5 Using a trace cache (1)

Figure 5.1: The Microarchitecture of the Pentium 4 (Willamette)


Figure 5.2: Basic misprediction pipeline of the Pentium 4 (Willamette)

Source: Hinton, G. et al., „The Microarchitecture of the Pentium 4 Processor”, Intel Technology Journal Q1, 2001


Figure 5.3: The Microarchitecture of the Pentium 4 (Prescott)

Source: Kanter, D., „Intel’s next Generation Microarchitecture Unveiled”, Real World Tech., 2006 March 9.

Decoding withinstruction grouping

Grouping of RISC

instructions

POWER4POWER5

Grouping of CISC

instructions

Pentium MCore arch.

6. Decoding with instruction grouping

K7 (Athlon)K8 (Hammer)

6.1 Overview

Operation of the Reorder Buffer (ROB)

index 1 2 3 4 5 6 7 8 9 10 11 12lane 0 lane 1 lane 2

= Out Of Order finished Instructions, results still speculative. = Instructions being retired now. = Retired Instructions, not speculative anymore.

Figure 5.3: Instruction grouping in the K7 and K8

Source: de Vries, H., „Understanding the detailed Architecture of AMD’s 64 bit Core”, Sept.2003, http://www.chip-architect.com

Up to 3 MacroOps are decoded per cycle, these MacroOps are allocated a line in the ROB

The ROB has 24 lines of 3 entries each. The ROB retires a line if it is the oldest one and all MacroOps in that line are completed.

6.2 Grouping of RISC instructions (1)

Figure 6.1: Out of order execution of MacroOps from the FX schedulers in the K8L (to be introduced in Q2 2007)

(The K8L scheduler has 8*3 entires vs 6*3 in the K8)

Source: Malich, Y.„AMD's Next Generation Microarchitecture Preview: from K8 to K8L”, Aug. 2006.


SchedulersDecoders EUs

Figure 6.1: The principle of instruction grouping in IBM’s POWER4 and POWER5 processors


Instructiongroups

EU EU

Issuequeues

Executionunits

ROB

Dispatch instruction groups in-order, forward individual

instructions to the issue queues

Execute individual instructions ooo

Retire isntruction groups in-order, modify program state

Retire


Figure 6.2: Implementation of instruction grouping in IBM’s POWER 5 processor

Source: Sinharoy, B. et al. „POWER5 system microarchitecture”, IBM J.,Res.& Dev., July/Sept. 2005.

6.3 Grouping of CISC instructions (1)(Intel: macro-op fusion)

x86 instructions: macro-opsinternal instructions: μops

Macro-op fusion:combines two macro ops into a single μop.

Specifically:x86 compare or test instructions are fused with x86 jumps to produce a single μop.

Any decoder can perform macro-op fusion but only one macro-op fusion can be performed in each cycle.

In the Core architecture the max. decode bandwidth is 4+1 x86 instructions/cycle

Macro-op fusion can reduce the number of μops by about 10%.

Introduced in the Core architecture

6.3 Grouping of CISC instructions (2)

Benefits:

• Fewer μopsIncreased performance

• ooo execution becomes more effective as the instruction window includes now more (~10%) x86 instructions

microarchitecture of superscalars (4) decoding dezső sima fall 2007 (ver. 2.0) dezső sima, 2007

Documents

principle of decoding

ciscrisc conversion

decoding techniques

trace cache decoding

decoding dezs sima

overview slide

ciscrisc conversion

comarticles slide