stamatis vassiliadis symposium sept. 28, 2007 j. e. smith
DESCRIPTION
Future Superscalar Processors Based on Instruction Compounding. Stamatis Vassiliadis Symposium Sept. 28, 2007 J. E. Smith. Instruction Compounding (Fusing). Instruction compounding, or “fusing” has become a key idea in high performance microprocessors - PowerPoint PPT PresentationTRANSCRIPT
Stamatis Vassiliadis Symposium
Sept. 28, 2007
J. E. Smith
Future Superscalar Processors Future Superscalar Processors Based onBased on
Instruction Compounding Instruction Compounding
Future Microprocessors 2
Instruction Compounding (Fusing)Instruction Compounding (Fusing)
Instruction compounding, or “fusing” has become a key idea in high performance microprocessors
“A compound instruction reflects the parallel issue of instructions; it comprises some number of independent instructions or interlocked instructions”
“Instructions composing a compound instruction need not be consecutive.”
-- S. Vassiliadis et al. IBM Journal of R and D, Jan. 1994
Future Microprocessors 3
The Future Processor: Three Key The Future Processor: Three Key AspectsAspects
Instruction compounding or fusing• Based on S. Vassiliadis work• Employs compounding and 3-input ALU
Co-designed VM for dynamic translation/fusing
• Concealed from all software• Optimized (fused) instructions held in code-cache
Dual decoder front-end for fast startup• Hardware front-end decoder for fast startup• Software translator for sustained high performance
Future Microprocessors 4
Processor Micro-architectureProcessor Micro-architecture
Data
x86 Code
Code Cache(V-code)
I-Cache
ConventionalMemory
ConcealedMemory
Verticalx86
Decoder
TranslationSoftware
HorizontalV-code
Decoder
PipelinedRename/Dispatch
IssueBuffer
PipelinedExecutionBackend
V-code
x86code
H-code
Future Microprocessors 5
Fusible Instruction SetFusible Instruction Set
RISC-ops with unique features:
• A fusible bit per instruction fuses two dependent instructions
• Dense instruction encoding, 16/32-bit ISA design
Special Features to Support the x86 ISA
• Condition codes
• Addressing modes
• Aware of long immediate & displacement values
21-bit Immediate / Displacement10b opcode
11b Immediate / Disp10b opcode 5b Rds5b Rsr
16-bit opcode 5b Rds5b Rsr5b Rsr
5b op
10b Immd / Disp
F
16-bit immediate / Displacement10b opcode 5b Rds
F
F
F
F
F
F
5b Rds5b Rsr
5b op
5b op
5b Rds5b Rsr
Core 32-bit instruction formats
Add-on 16-bit instruction formats for code density
Fusible ISA Instruction Formats
Future Microprocessors 6
Microarchitecture: Macro-op ExecutionMicroarchitecture: Macro-op Execution
• Enhanced OOO superscalar microarchitecture– Process & execute fused macro-ops as single Instructions
throughout the entire pipeline
DecodeRenameDispatch
Wake-up
RFSelect EXEFetch MEM
cacheports
AlignFuse
Fusebit
3- 1 ALUs
WBRetire
Increasedeffective
bandwidth
Pipelined scheduling;Wider effective window;
Higher effective bandwidth
Highereffective
bandwidth
Higher effective bandwidth;Simpler forward logic
Simpler ROBtracking
Future Microprocessors 7
Macro-op Fusing AlgorithmMacro-op Fusing Algorithm
Objectives: • Maximize fused dependent pairs • Simple & Fast
Heuristics: • Pipelined Scheduler: Only single-cycle ALU ops can be a head.
Minimize non-fused single-cycle ALU ops• Criticality: Fuse instructions that are “close” in the original
sequence. ALU-ops criticality is easier to estimate. • Simplicity: 2 or fewer distinct register operands per fused pair
Solution: Two-pass Fusing Algorithm:• The 1st pass, forward scan, prioritizes ALU ops, i.e. for each
ALU-op tail candidate, look backward in the scan for its head• The 2nd pass considers all kinds of RISC-ops as tail candidates
Future Microprocessors 8
Fusing Algorithm: ExampleFusing Algorithm: Example
x86 asm:
-----------------------------------------------------------
1. lea eax, DS:[edi + 01]
2. mov [DS:080b8658], eax
3. movzx ebx, SS:[ebp + ecx << 1]
4. and eax, 0000007f
5. mov edx, DS:[eax + esi << 0 + 0x7c]
RISC-ops:-----------------------------------------------------1. ADD Reax, Redi, 12. ST Reax, mem[R22] 3. LD.zx Rebx, mem[Rebp + Recx << 1]4. AND Reax, 0000007f5. ADD R17, Reax, Resi6. LD Redx, mem[R17 + 0x7c]
After fusing: Macro-ops-----------------------------------------------------1. ADD R18, Redi, 1 :: AND Reax, R18, 007f 2. ST R18, mem[R22]3. LD.zx Rebx, mem[Rebp + Recx << 1]4. ADD R17, Reax, Resi :: LD Rebx, mem[R17+0x7c]
Future Microprocessors 9
Instruction Fusing Profile Instruction Fusing Profile
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Pe
rce
nta
ge
of D
yn
am
ic In
str
uctio
ns
ALU
FP or NOPs
BR
ST
LD
Fused
55+% fused RISC-ops increases effective ILP by 1.4 Only 6% single-cycle ALU ops left un-fused.
Future Microprocessors 10
Other DBT Other DBT Software ProfileSoftware Profile
Of all fused macro-ops: • 50% ALU-ALU pairs. • 30% fused condition test & conditional branch pairs. • Others mostly ALU-MEM ops pairs.
Of all fused macro-ops: • 70+% are inter-x86instruction fusion. • 46% access two distinct source registers, • only 15% (6% of all instruction entities) write two distinct
destination registers.
Translation Overhead Profile• About 1000 instructions per translated hotspot instruction.
Future Microprocessors 11
Co-designed x86 Processor Co-designed x86 Processor PerformancePerformance
0.5
0.6
0.7
0.8
0.9
1
1.1
1.2
1.3
16 32 48 64issue window size
Rela
tive IP
C p
erf
orm
ance
4-wide Macro-op 3-wide Macro-op 2-wide Macro-op 4-wide Base 3-wide Base
Future Microprocessors 12
Dual Decoder Front-EndDual Decoder Front-End
Data
x86 Code
Code Cache(V-code)
I-Cache
ConventionalMemory
ConcealedMemory
Verticalx86
Decoder
TranslationSoftware
HorizontalV-code
Decoder
PipelinedRename/Dispatch
IssueBuffer
PipelinedExecutionBackend
V-code
x86code
H-code
Future Microprocessors 13
Evaluation: Startup Performance Evaluation: Startup Performance
Future Microprocessors 14
Activity of HW x86 Decoder Activity of HW x86 Decoder
0
10
20
30
40
50
60
70
80
90
100
1 10 100
1,00
0
10,0
00
100,
000
1,00
0,00
0
10,0
00,0
00
100,
000,
000
Finish
Time: Cycles
HW
Ass
ist A
ctiv
ity (
%) Superscalar
VM.soft
VM.dual
Future Microprocessors 15
Important Research IssuesImportant Research Issues Profiling
• Probe insertion via software translator not feasible
Multi-core• Shared code cache
• SMT designs
Memory consistency• Stores can be done in-order
• Re-scheduled loads may be important for performance
Precise traps• Potential HW assist?