stamatis vassiliadis symposium sept. 28, 2007 j. e. smith

15
Stamatis Vassiliadis Symposium Sept. 28, 2007 J. E. Smith Future Superscalar Processors Future Superscalar Processors Based on Based on Instruction Compounding Instruction Compounding

Upload: luisa

Post on 08-Jan-2016

22 views

Category:

Documents


1 download

DESCRIPTION

Future Superscalar Processors Based on Instruction Compounding. Stamatis Vassiliadis Symposium Sept. 28, 2007 J. E. Smith. Instruction Compounding (Fusing). Instruction compounding, or “fusing” has become a key idea in high performance microprocessors - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Stamatis Vassiliadis Symposium Sept. 28, 2007 J. E. Smith

Stamatis Vassiliadis Symposium

Sept. 28, 2007

J. E. Smith

Future Superscalar Processors Future Superscalar Processors Based onBased on

Instruction Compounding Instruction Compounding

Page 2: Stamatis Vassiliadis Symposium Sept. 28, 2007 J. E. Smith

Future Microprocessors 2

Instruction Compounding (Fusing)Instruction Compounding (Fusing)

Instruction compounding, or “fusing” has become a key idea in high performance microprocessors

“A compound instruction reflects the parallel issue of instructions; it comprises some number of independent instructions or interlocked instructions”

“Instructions composing a compound instruction need not be consecutive.”

-- S. Vassiliadis et al. IBM Journal of R and D, Jan. 1994

Page 3: Stamatis Vassiliadis Symposium Sept. 28, 2007 J. E. Smith

Future Microprocessors 3

The Future Processor: Three Key The Future Processor: Three Key AspectsAspects

Instruction compounding or fusing• Based on S. Vassiliadis work• Employs compounding and 3-input ALU

Co-designed VM for dynamic translation/fusing

• Concealed from all software• Optimized (fused) instructions held in code-cache

Dual decoder front-end for fast startup• Hardware front-end decoder for fast startup• Software translator for sustained high performance

Page 4: Stamatis Vassiliadis Symposium Sept. 28, 2007 J. E. Smith

Future Microprocessors 4

Processor Micro-architectureProcessor Micro-architecture

Data

x86 Code

Code Cache(V-code)

I-Cache

ConventionalMemory

ConcealedMemory

Verticalx86

Decoder

TranslationSoftware

HorizontalV-code

Decoder

PipelinedRename/Dispatch

IssueBuffer

PipelinedExecutionBackend

V-code

x86code

H-code

Page 5: Stamatis Vassiliadis Symposium Sept. 28, 2007 J. E. Smith

Future Microprocessors 5

Fusible Instruction SetFusible Instruction Set

RISC-ops with unique features:

• A fusible bit per instruction fuses two dependent instructions

• Dense instruction encoding, 16/32-bit ISA design

Special Features to Support the x86 ISA

• Condition codes

• Addressing modes

• Aware of long immediate & displacement values

21-bit Immediate / Displacement10b opcode

11b Immediate / Disp10b opcode 5b Rds5b Rsr

16-bit opcode 5b Rds5b Rsr5b Rsr

5b op

10b Immd / Disp

F

16-bit immediate / Displacement10b opcode 5b Rds

F

F

F

F

F

F

5b Rds5b Rsr

5b op

5b op

5b Rds5b Rsr

Core 32-bit instruction formats

Add-on 16-bit instruction formats for code density

Fusible ISA Instruction Formats

Page 6: Stamatis Vassiliadis Symposium Sept. 28, 2007 J. E. Smith

Future Microprocessors 6

Microarchitecture: Macro-op ExecutionMicroarchitecture: Macro-op Execution

• Enhanced OOO superscalar microarchitecture– Process & execute fused macro-ops as single Instructions

throughout the entire pipeline

DecodeRenameDispatch

Wake-up

RFSelect EXEFetch MEM

cacheports

AlignFuse

Fusebit

3- 1 ALUs

WBRetire

Increasedeffective

bandwidth

Pipelined scheduling;Wider effective window;

Higher effective bandwidth

Highereffective

bandwidth

Higher effective bandwidth;Simpler forward logic

Simpler ROBtracking

Page 7: Stamatis Vassiliadis Symposium Sept. 28, 2007 J. E. Smith

Future Microprocessors 7

Macro-op Fusing AlgorithmMacro-op Fusing Algorithm

Objectives: • Maximize fused dependent pairs • Simple & Fast

Heuristics: • Pipelined Scheduler: Only single-cycle ALU ops can be a head.

Minimize non-fused single-cycle ALU ops• Criticality: Fuse instructions that are “close” in the original

sequence. ALU-ops criticality is easier to estimate. • Simplicity: 2 or fewer distinct register operands per fused pair

Solution: Two-pass Fusing Algorithm:• The 1st pass, forward scan, prioritizes ALU ops, i.e. for each

ALU-op tail candidate, look backward in the scan for its head• The 2nd pass considers all kinds of RISC-ops as tail candidates

Page 8: Stamatis Vassiliadis Symposium Sept. 28, 2007 J. E. Smith

Future Microprocessors 8

Fusing Algorithm: ExampleFusing Algorithm: Example

x86 asm:

-----------------------------------------------------------

1. lea eax, DS:[edi + 01]

2. mov [DS:080b8658], eax

3. movzx ebx, SS:[ebp + ecx << 1]

4. and eax, 0000007f

5. mov edx, DS:[eax + esi << 0 + 0x7c]

RISC-ops:-----------------------------------------------------1. ADD Reax, Redi, 12. ST Reax, mem[R22] 3. LD.zx Rebx, mem[Rebp + Recx << 1]4. AND Reax, 0000007f5. ADD R17, Reax, Resi6. LD Redx, mem[R17 + 0x7c]

After fusing: Macro-ops-----------------------------------------------------1. ADD R18, Redi, 1 :: AND Reax, R18, 007f 2. ST R18, mem[R22]3. LD.zx Rebx, mem[Rebp + Recx << 1]4. ADD R17, Reax, Resi :: LD Rebx, mem[R17+0x7c]

Page 9: Stamatis Vassiliadis Symposium Sept. 28, 2007 J. E. Smith

Future Microprocessors 9

Instruction Fusing Profile Instruction Fusing Profile

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Pe

rce

nta

ge

of D

yn

am

ic In

str

uctio

ns

ALU

FP or NOPs

BR

ST

LD

Fused

55+% fused RISC-ops increases effective ILP by 1.4 Only 6% single-cycle ALU ops left un-fused.

Page 10: Stamatis Vassiliadis Symposium Sept. 28, 2007 J. E. Smith

Future Microprocessors 10

Other DBT Other DBT Software ProfileSoftware Profile

Of all fused macro-ops: • 50% ALU-ALU pairs. • 30% fused condition test & conditional branch pairs. • Others mostly ALU-MEM ops pairs.

Of all fused macro-ops: • 70+% are inter-x86instruction fusion. • 46% access two distinct source registers, • only 15% (6% of all instruction entities) write two distinct

destination registers.

Translation Overhead Profile• About 1000 instructions per translated hotspot instruction.

Page 11: Stamatis Vassiliadis Symposium Sept. 28, 2007 J. E. Smith

Future Microprocessors 11

Co-designed x86 Processor Co-designed x86 Processor PerformancePerformance

0.5

0.6

0.7

0.8

0.9

1

1.1

1.2

1.3

16 32 48 64issue window size

Rela

tive IP

C p

erf

orm

ance

4-wide Macro-op 3-wide Macro-op 2-wide Macro-op 4-wide Base 3-wide Base

Page 12: Stamatis Vassiliadis Symposium Sept. 28, 2007 J. E. Smith

Future Microprocessors 12

Dual Decoder Front-EndDual Decoder Front-End

Data

x86 Code

Code Cache(V-code)

I-Cache

ConventionalMemory

ConcealedMemory

Verticalx86

Decoder

TranslationSoftware

HorizontalV-code

Decoder

PipelinedRename/Dispatch

IssueBuffer

PipelinedExecutionBackend

V-code

x86code

H-code

Page 13: Stamatis Vassiliadis Symposium Sept. 28, 2007 J. E. Smith

Future Microprocessors 13

Evaluation: Startup Performance Evaluation: Startup Performance

Page 14: Stamatis Vassiliadis Symposium Sept. 28, 2007 J. E. Smith

Future Microprocessors 14

Activity of HW x86 Decoder Activity of HW x86 Decoder

0

10

20

30

40

50

60

70

80

90

100

1 10 100

1,00

0

10,0

00

100,

000

1,00

0,00

0

10,0

00,0

00

100,

000,

000

Finish

Time: Cycles

HW

Ass

ist A

ctiv

ity (

%) Superscalar

VM.soft

VM.dual

Page 15: Stamatis Vassiliadis Symposium Sept. 28, 2007 J. E. Smith

Future Microprocessors 15

Important Research IssuesImportant Research Issues Profiling

• Probe insertion via software translator not feasible

Multi-core• Shared code cache

• SMT designs

Memory consistency• Stores can be done in-order

• Re-scheduled loads may be important for performance

Precise traps• Potential HW assist?