stamatis vassiliadis symposium sept. 28, 2007 j. e. smith

Stamatis Vassiliadis Symposium

Sept. 28, 2007

J. E. Smith

Future Superscalar Processors Future Superscalar Processors Based onBased on

Instruction Compounding Instruction Compounding

Future Microprocessors 2

Instruction Compounding (Fusing)Instruction Compounding (Fusing)

Instruction compounding, or “fusing” has become a key idea in high performance microprocessors

“A compound instruction reflects the parallel issue of instructions; it comprises some number of independent instructions or interlocked instructions”

“Instructions composing a compound instruction need not be consecutive.”

-- S. Vassiliadis et al. IBM Journal of R and D, Jan. 1994


The Future Processor: Three Key The Future Processor: Three Key AspectsAspects

Instruction compounding or fusing• Based on S. Vassiliadis work• Employs compounding and 3-input ALU

Co-designed VM for dynamic translation/fusing

• Concealed from all software• Optimized (fused) instructions held in code-cache

Dual decoder front-end for fast startup• Hardware front-end decoder for fast startup• Software translator for sustained high performance


Processor Micro-architectureProcessor Micro-architecture

Data

x86 Code

Code Cache(V-code)

I-Cache

ConventionalMemory

ConcealedMemory

Verticalx86

Decoder

TranslationSoftware

HorizontalV-code

Decoder

PipelinedRename/Dispatch

IssueBuffer

PipelinedExecutionBackend

V-code

x86code

H-code


Fusible Instruction SetFusible Instruction Set

RISC-ops with unique features:

• A fusible bit per instruction fuses two dependent instructions

• Dense instruction encoding, 16/32-bit ISA design

Special Features to Support the x86 ISA

• Condition codes

• Addressing modes

• Aware of long immediate & displacement values

21-bit Immediate / Displacement10b opcode

11b Immediate / Disp10b opcode 5b Rds5b Rsr

16-bit opcode 5b Rds5b Rsr5b Rsr

5b op

10b Immd / Disp

F

16-bit immediate / Displacement10b opcode 5b Rds

F

F

F

F

F

F

5b Rds5b Rsr

5b op

5b op

5b Rds5b Rsr

Core 32-bit instruction formats

Add-on 16-bit instruction formats for code density

Fusible ISA Instruction Formats


Microarchitecture: Macro-op ExecutionMicroarchitecture: Macro-op Execution

• Enhanced OOO superscalar microarchitecture– Process & execute fused macro-ops as single Instructions

throughout the entire pipeline

DecodeRenameDispatch

Wake-up

RFSelect EXEFetch MEM

cacheports

AlignFuse

Fusebit

3- 1 ALUs

WBRetire

Increasedeffective

bandwidth

Pipelined scheduling;Wider effective window;

Higher effective bandwidth

Highereffective

bandwidth

Higher effective bandwidth;Simpler forward logic

Simpler ROBtracking


Macro-op Fusing AlgorithmMacro-op Fusing Algorithm

Objectives: • Maximize fused dependent pairs • Simple & Fast

Heuristics: • Pipelined Scheduler: Only single-cycle ALU ops can be a head.

Minimize non-fused single-cycle ALU ops• Criticality: Fuse instructions that are “close” in the original

sequence. ALU-ops criticality is easier to estimate. • Simplicity: 2 or fewer distinct register operands per fused pair

Solution: Two-pass Fusing Algorithm:• The 1st pass, forward scan, prioritizes ALU ops, i.e. for each

ALU-op tail candidate, look backward in the scan for its head• The 2nd pass considers all kinds of RISC-ops as tail candidates


Fusing Algorithm: ExampleFusing Algorithm: Example

x86 asm:

-----------------------------------------------------------

1. lea eax, DS:[edi + 01]

2. mov [DS:080b8658], eax

3. movzx ebx, SS:[ebp + ecx << 1]

4. and eax, 0000007f

5. mov edx, DS:[eax + esi << 0 + 0x7c]

RISC-ops:-----------------------------------------------------1. ADD Reax, Redi, 12. ST Reax, mem[R22] 3. LD.zx Rebx, mem[Rebp + Recx << 1]4. AND Reax, 0000007f5. ADD R17, Reax, Resi6. LD Redx, mem[R17 + 0x7c]

After fusing: Macro-ops-----------------------------------------------------1. ADD R18, Redi, 1 :: AND Reax, R18, 007f 2. ST R18, mem[R22]3. LD.zx Rebx, mem[Rebp + Recx << 1]4. ADD R17, Reax, Resi :: LD Rebx, mem[R17+0x7c]


Instruction Fusing Profile Instruction Fusing Profile

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Pe

rce

nta

ge

of D

yn

am

ic In

str

uctio

ns

ALU

FP or NOPs

BR

ST

LD

Fused

55+% fused RISC-ops increases effective ILP by 1.4 Only 6% single-cycle ALU ops left un-fused.


Other DBT Other DBT Software ProfileSoftware Profile

Of all fused macro-ops: • 50% ALU-ALU pairs. • 30% fused condition test & conditional branch pairs. • Others mostly ALU-MEM ops pairs.

Of all fused macro-ops: • 70+% are inter-x86instruction fusion. • 46% access two distinct source registers, • only 15% (6% of all instruction entities) write two distinct

destination registers.

Translation Overhead Profile• About 1000 instructions per translated hotspot instruction.


Co-designed x86 Processor Co-designed x86 Processor PerformancePerformance

0.5

0.6

0.7

0.8

0.9

1

1.1

1.2

1.3

16 32 48 64issue window size

Rela

tive IP

C p

erf

orm

ance

4-wide Macro-op 3-wide Macro-op 2-wide Macro-op 4-wide Base 3-wide Base


Dual Decoder Front-EndDual Decoder Front-End

Data

x86 Code

Code Cache(V-code)

I-Cache

ConventionalMemory

ConcealedMemory

Verticalx86

Decoder

TranslationSoftware

HorizontalV-code

Decoder

PipelinedRename/Dispatch

IssueBuffer

PipelinedExecutionBackend

V-code

x86code

H-code


Evaluation: Startup Performance Evaluation: Startup Performance


Activity of HW x86 Decoder Activity of HW x86 Decoder

0

10

20

30

40

50

60

70

80

90

100

1 10 100

1,00

0

10,0

00

100,

000

1,00

0,00

0

10,0

00,0

00

100,

000,

000

Finish

Time: Cycles

HW

Ass

ist A

ctiv

ity (

%) Superscalar

VM.soft

VM.dual


Important Research IssuesImportant Research Issues Profiling

• Probe insertion via software translator not feasible

Multi-core• Shared code cache

• SMT designs

Memory consistency• Stores can be done in-order

• Re-scheduled loads may be important for performance

Precise traps• Potential HW assist?

stamatis vassiliadis symposium sept. 28, 2007 j. e. smith

Documents

fused macroops

fused riscops

alualu pairs

aluops criticality

alumem ops pairs

future processor

single instructions

instruction entities