heads and tails a variable-length instruction format supporting parallel fetch and decode
DESCRIPTION
Heads and Tails A Variable-Length Instruction Format Supporting Parallel Fetch and Decode. Heidi Pan and Krste Asanović MIT Laboratory for Computer Science CASES Conference, Nov. 2001. Motivation. Tight space constraints Cost, power consumption, space constraints Program code size - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Heads and Tails A Variable-Length Instruction Format Supporting Parallel Fetch and Decode](https://reader030.vdocument.in/reader030/viewer/2022020322/56814ad0550346895db7ec10/html5/thumbnails/1.jpg)
Heads and TailsA Variable-Length Instruction Format Supporting Parallel Fetch and Decode
Heidi Pan and Krste AsanovićMIT Laboratory for Computer Science
CASES Conference, Nov. 2001
![Page 2: Heads and Tails A Variable-Length Instruction Format Supporting Parallel Fetch and Decode](https://reader030.vdocument.in/reader030/viewer/2022020322/56814ad0550346895db7ec10/html5/thumbnails/2.jpg)
Motivation• Tight space constraints
• Cost, power consumption, space constraints• Program code size• Variable-length instructions: more compact but less efficient to
fetch and decode
• High performance• Deep pipelines or superscalar issue• Fixed-length instructions: easy to fetch and decode but less compact
• Heads and Tails (HAT) instruction format• Easy to fetch and decode AND compact
![Page 3: Heads and Tails A Variable-Length Instruction Format Supporting Parallel Fetch and Decode](https://reader030.vdocument.in/reader030/viewer/2022020322/56814ad0550346895db7ec10/html5/thumbnails/3.jpg)
Related Work• 16-bit version of existing RISC ISAs• Compressed instructions in main memory• Dictionary compression• CISC
![Page 4: Heads and Tails A Variable-Length Instruction Format Supporting Parallel Fetch and Decode](https://reader030.vdocument.in/reader030/viewer/2022020322/56814ad0550346895db7ec10/html5/thumbnails/4.jpg)
16-Bit Versions• Examples
• MIPS16 (MIPS), Thumb (Arm)
• Feature(s)• Dynamic switching between full-width & half-width
![Page 5: Heads and Tails A Variable-Length Instruction Format Supporting Parallel Fetch and Decode](https://reader030.vdocument.in/reader030/viewer/2022020322/56814ad0550346895db7ec10/html5/thumbnails/5.jpg)
16-Bit Versions, cont’d.• Advantages
• Simple decompression of just mapping 16-bit to 32-bit instructions
• Static code size reduced by ~30-40%
• Disadvantages• Can only encode limited subset of operations and
operands; more dynamic instructions needed• Shorter instructions can sometimes compensate for
the increased number of instructions, but performance of systems with instruction cache reduced by ~20%
![Page 6: Heads and Tails A Variable-Length Instruction Format Supporting Parallel Fetch and Decode](https://reader030.vdocument.in/reader030/viewer/2022020322/56814ad0550346895db7ec10/html5/thumbnails/6.jpg)
Compression in Memory• Examples
• CCRP, Kemp, Lekatsas, etc.
• Feature(s)• Hold compressed instructions in memory
then decompress when refilling cache
![Page 7: Heads and Tails A Variable-Length Instruction Format Supporting Parallel Fetch and Decode](https://reader030.vdocument.in/reader030/viewer/2022020322/56814ad0550346895db7ec10/html5/thumbnails/7.jpg)
Compression in Memory, cont’d.• Advantages
• Processor unchanged (see regular instructions)• Avoids latency & energy consumption of
decompression on cache hits
• Disadvantages• Decrease effective capacity of cache & increase
energy used to fetch cached instructions• Cache miss latencies increase
– Translate pc ; block decompressed sequentially
![Page 8: Heads and Tails A Variable-Length Instruction Format Supporting Parallel Fetch and Decode](https://reader030.vdocument.in/reader030/viewer/2022020322/56814ad0550346895db7ec10/html5/thumbnails/8.jpg)
Dictionary Compression• Examples
• Araujo, Benini, Lefurgy, Liao, etc.
• Features• Fixed-length code words in instruction stream point
to a dictionary holding common instruction sequences
• Branch address modified to point in compressed instruction stream
![Page 9: Heads and Tails A Variable-Length Instruction Format Supporting Parallel Fetch and Decode](https://reader030.vdocument.in/reader030/viewer/2022020322/56814ad0550346895db7ec10/html5/thumbnails/9.jpg)
Dictionary Compression, cont’d.• Advantage(s)
• Decompression is just fast table lookup
• Disadvantages• Table fetch adds latency to pipeline, increasing
branch mispredict penalties• Variable-length codewords interleaved with
uncompressed instructions• More energy to fetch codeword on top of full-length
instruction
![Page 10: Heads and Tails A Variable-Length Instruction Format Supporting Parallel Fetch and Decode](https://reader030.vdocument.in/reader030/viewer/2022020322/56814ad0550346895db7ec10/html5/thumbnails/10.jpg)
CISC
• Examples• x86, VAX
• Feature(s)• More compact base instruction set
• Advantage(s)• Don’t need to dynamically compress and
decompressing instructions
![Page 11: Heads and Tails A Variable-Length Instruction Format Supporting Parallel Fetch and Decode](https://reader030.vdocument.in/reader030/viewer/2022020322/56814ad0550346895db7ec10/html5/thumbnails/11.jpg)
CISC cont’d.• Disadvantages
• Not designed for parallel fetch and decode
• Solutions• P6: brute-force strategy of speculative decodes at
every byte position; wastes energy• AMD Athlon: predecodes instruction during cache
refill to mark boundaries between instructions; still need several cycles after instruction fetch to scan & align
• Pentium-4: caches decoded micro-ops in trace cache; but cache misses longer latency and still full-size micro-ops
![Page 12: Heads and Tails A Variable-Length Instruction Format Supporting Parallel Fetch and Decode](https://reader030.vdocument.in/reader030/viewer/2022020322/56814ad0550346895db7ec10/html5/thumbnails/12.jpg)
Heads and Tails Design Goals
• Variable-length instructions that are easily fetched and decoded
• Compact instructions in memory and cache• Format applicable for both compressing
existing fixed-length ISA or creating new variable-length ISA
![Page 13: Heads and Tails A Variable-Length Instruction Format Supporting Parallel Fetch and Decode](https://reader030.vdocument.in/reader030/viewer/2022020322/56814ad0550346895db7ec10/html5/thumbnails/13.jpg)
Heads and Tails Format
• Each instruction split into two portions:fixed-length head & variable-length tail
• Multiple instructions packed into a fixed-length bundle
• A cache line can have multiple bundles
![Page 14: Heads and Tails A Variable-Length Instruction Format Supporting Parallel Fetch and Decode](https://reader030.vdocument.in/reader030/viewer/2022020322/56814ad0550346895db7ec10/html5/thumbnails/14.jpg)
Heads and Tails Format
5 H0 H1 H2 H3 H4 H5 T4 T3 T2 T0
4 H0 H1 H2 H3 H4 T4 T3 T2 T1 T0
6 H0 H1 H2 H3 H4 H5 H6 T6 T4 T3 T1 T0
unused
last instr #heads
tails
not all heads must have tails tails at fixed granularity granularity of tails independent of size of heads
![Page 15: Heads and Tails A Variable-Length Instruction Format Supporting Parallel Fetch and Decode](https://reader030.vdocument.in/reader030/viewer/2022020322/56814ad0550346895db7ec10/html5/thumbnails/15.jpg)
Heads and Tails Format
bundle # instruction #
5 H0 H1 H2 H3 H4 H5 T4 T3 T2 T0
4 H0 H1 H2 H3 H4 T4 T3 T2 T1 T0
6 H0 H1 H2 H3 H4 H5 H6 T6 T4 T3 T1 T0
last instr #heads
tails
PC
sequential: pc incremented end of bundle: bundle # incremented; inst # reset to 0 branch: inst # checked
![Page 16: Heads and Tails A Variable-Length Instruction Format Supporting Parallel Fetch and Decode](https://reader030.vdocument.in/reader030/viewer/2022020322/56814ad0550346895db7ec10/html5/thumbnails/16.jpg)
Length Decoding• Fixed-length heads enable parallel fetch and decode• Heads contain information to locate corresponding
tail• Even though head must be decoded before finding
tail, still faster than conventional variable-length schemes
• Also, tails generally contain less critical information needed later in the pipeline
![Page 17: Heads and Tails A Variable-Length Instruction Format Supporting Parallel Fetch and Decode](https://reader030.vdocument.in/reader030/viewer/2022020322/56814ad0550346895db7ec10/html5/thumbnails/17.jpg)
Conventional VL Length-Decoding
Length1
Length2 Length
3
Instr 1 Instr 2 Instr 3
++
![Page 18: Heads and Tails A Variable-Length Instruction Format Supporting Parallel Fetch and Decode](https://reader030.vdocument.in/reader030/viewer/2022020322/56814ad0550346895db7ec10/html5/thumbnails/18.jpg)
Length2
Conventional VL Length-Decoding
Length1
Instr 1 Instr 2 Instr 3
2nd length decoder needs to know Length1 first
![Page 19: Heads and Tails A Variable-Length Instruction Format Supporting Parallel Fetch and Decode](https://reader030.vdocument.in/reader030/viewer/2022020322/56814ad0550346895db7ec10/html5/thumbnails/19.jpg)
Length2
Conventional VL Length-Decoding
Length1
Length3
Instr 1 Instr 2 Instr 3
+
3rd length decoder needs to know Length1 & Length2
![Page 20: Heads and Tails A Variable-Length Instruction Format Supporting Parallel Fetch and Decode](https://reader030.vdocument.in/reader030/viewer/2022020322/56814ad0550346895db7ec10/html5/thumbnails/20.jpg)
+
Length2
Conventional VL Length-Decoding
Length1
Length3
Instr 1 Instr 2 Instr 3
+
Need to know all 3 lengths to fetch and align more instructions.
![Page 21: Heads and Tails A Variable-Length Instruction Format Supporting Parallel Fetch and Decode](https://reader030.vdocument.in/reader030/viewer/2022020322/56814ad0550346895db7ec10/html5/thumbnails/21.jpg)
HAT Length-Decoding
Length1
Length2
Length3
Head1 Head2 Head3 Tail3 Tail2 Tail1
Length decoding done in parallel
![Page 22: Heads and Tails A Variable-Length Instruction Format Supporting Parallel Fetch and Decode](https://reader030.vdocument.in/reader030/viewer/2022020322/56814ad0550346895db7ec10/html5/thumbnails/22.jpg)
HAT Length-Decoding
Length1
Length2
Length3
Head1 Head2 Head3 Tail3 Tail2 Tail1
Length decoding done in parallel Only tail-length adders dependent on previous length information
![Page 23: Heads and Tails A Variable-Length Instruction Format Supporting Parallel Fetch and Decode](https://reader030.vdocument.in/reader030/viewer/2022020322/56814ad0550346895db7ec10/html5/thumbnails/23.jpg)
HAT Length-Decoding
Length1
Length2
Length3
+
Head1 Head2 Head3 Tail3 Tail2 Tail1
Length decoding done in parallel Only tail-length adders dependent on previous length information
![Page 24: Heads and Tails A Variable-Length Instruction Format Supporting Parallel Fetch and Decode](https://reader030.vdocument.in/reader030/viewer/2022020322/56814ad0550346895db7ec10/html5/thumbnails/24.jpg)
HAT Length-Decoding
Length1
Length2
Length3
+
+
Head1 Head2 Head3 Tail3 Tail2 Tail1
Length decoding done in parallel Only tail-length adders dependent on previous length information
![Page 25: Heads and Tails A Variable-Length Instruction Format Supporting Parallel Fetch and Decode](https://reader030.vdocument.in/reader030/viewer/2022020322/56814ad0550346895db7ec10/html5/thumbnails/25.jpg)
Branches in HAT• When branching into middle of line, only
head located, need to find tail• Could scan all earlier heads and sum
corresponding tail lengths, but substantial delay & energy penalty
![Page 26: Heads and Tails A Variable-Length Instruction Format Supporting Parallel Fetch and Decode](https://reader030.vdocument.in/reader030/viewer/2022020322/56814ad0550346895db7ec10/html5/thumbnails/26.jpg)
• Approach 1: Tail-Start Bit Vector• Indicates starting locations of tails• Does not increase static code size, but increases
cache area (cache refill time)• Requires that every head has a tail
Branches in HAT
5 H0 H1 H2 H3 H4 H5 T5 T4 T3 T1 T0
0 1 1 0 1 1 0 1
should be T2
![Page 27: Heads and Tails A Variable-Length Instruction Format Supporting Parallel Fetch and Decode](https://reader030.vdocument.in/reader030/viewer/2022020322/56814ad0550346895db7ec10/html5/thumbnails/27.jpg)
Branches in HAT
• Approach 2: Tail Pointers• Uses extra field per head to store pointer to tail
(filled in by linker at link time)• Removes latency, increases code size slightly• Cannot be used for indirect jumps
(target address not known until run time)– Expand PCs to include tail pointer– Restrict indirect jumps to only be at beginning
of bundle
![Page 28: Heads and Tails A Variable-Length Instruction Format Supporting Parallel Fetch and Decode](https://reader030.vdocument.in/reader030/viewer/2022020322/56814ad0550346895db7ec10/html5/thumbnails/28.jpg)
Branches in HAT
• Approach 3: BTB for HAT Branches• Store target tail pointer info in branch target buffer• Resort back to scanning from the beginning of the
bundle if prediction fails• Does not increase code size, but increases BTB size
and branch mispredict penalty
![Page 29: Heads and Tails A Variable-Length Instruction Format Supporting Parallel Fetch and Decode](https://reader030.vdocument.in/reader030/viewer/2022020322/56814ad0550346895db7ec10/html5/thumbnails/29.jpg)
HAT Advantages• Fetch & decode of multiple variable-length
instructions can be pipelined or parallelized• PC granularity independent of instruction
length granularity (less bits for branch offsets)• Variable alignment muxes smaller than in
conventional VL scheme• No instruction straddles cache line or page
boundary
![Page 30: Heads and Tails A Variable-Length Instruction Format Supporting Parallel Fetch and Decode](https://reader030.vdocument.in/reader030/viewer/2022020322/56814ad0550346895db7ec10/html5/thumbnails/30.jpg)
MIPS-HAT
• Example of HAT format: compressed variable-length re-encoding of MIPS
• Simple compression techniques• based on previous scheme by Panich99
• HAT format can be applied to many other types of instruction encoding
![Page 31: Heads and Tails A Variable-Length Instruction Format Supporting Parallel Fetch and Decode](https://reader030.vdocument.in/reader030/viewer/2022020322/56814ad0550346895db7ec10/html5/thumbnails/31.jpg)
• 5-bit tail fields (register fields not split)• 15-40 bit instructions
• 10-bit heads (to enable Tail-Start Bit Vector)• Every head has a tail
MIPS-HAT Design Decisions
![Page 32: Heads and Tails A Variable-Length Instruction Format Supporting Parallel Fetch and Decode](https://reader030.vdocument.in/reader030/viewer/2022020322/56814ad0550346895db7ec10/html5/thumbnails/32.jpg)
MIPS-HAT Format
op reg1 reg2 op2/imm (imm) (imm) (imm) (imm)
op reg1 reg2 (op2)
op reg1 reg2 reg3 (op2)
I-Type op reg1 op2/imm (imm) (imm) (imm) (imm)
R-Type op reg1 op2
J-Type op op2/imm imm (imm) (imm) (imm) (imm) (imm)
Heads Tails
![Page 33: Heads and Tails A Variable-Length Instruction Format Supporting Parallel Fetch and Decode](https://reader030.vdocument.in/reader030/viewer/2022020322/56814ad0550346895db7ec10/html5/thumbnails/33.jpg)
• Combine MIPS opcode fields• Opcode determines length
• 6 possible lengths; could use 3 overhead bits per instruction
• Instead include size information in opcode butnumber of possible opcodes substantially increased
• But only small subset frequently used
• Use 1-2 opcode fields• Most popular opcodes in primary opcode field (head)• All other opcodes use escape opcode and secondary
opcode field (tail)
MIPS-HAT Opcodes
![Page 34: Heads and Tails A Variable-Length Instruction Format Supporting Parallel Fetch and Decode](https://reader030.vdocument.in/reader030/viewer/2022020322/56814ad0550346895db7ec10/html5/thumbnails/34.jpg)
MIPS-HAT Compression• Use the minimum number of 5-bit fields to
encode immediates• Eliminate unused operand fields• New opcodes for frequently used operands• Two address versions of instructions with
same source & destination registers• Common instruction sequences re-encoded
as a single instruction
![Page 35: Heads and Tails A Variable-Length Instruction Format Supporting Parallel Fetch and Decode](https://reader030.vdocument.in/reader030/viewer/2022020322/56814ad0550346895db7ec10/html5/thumbnails/35.jpg)
MIPS-HAT Format
op reg1 reg2 op2/imm (imm) (imm) (imm) (imm)
op reg1 reg2 (op2)
op reg1 reg2 reg3 (op2)
I-Type op reg1 op2/imm (imm) (imm) (imm) (imm)
R-Type op reg1 op2
J-Type op op2/imm imm (imm) (imm) (imm) (imm) (imm)
Heads Tails
![Page 36: Heads and Tails A Variable-Length Instruction Format Supporting Parallel Fetch and Decode](https://reader030.vdocument.in/reader030/viewer/2022020322/56814ad0550346895db7ec10/html5/thumbnails/36.jpg)
MIPS-HAT Bundle Format
128-bit bundle
256-bit bundle
# instr (3b)
# instr (4b)
50x5b units
25x5b units
8x10b heads
16x5b tail units
16x10b heads
32x5b tail units
![Page 37: Heads and Tails A Variable-Length Instruction Format Supporting Parallel Fetch and Decode](https://reader030.vdocument.in/reader030/viewer/2022020322/56814ad0550346895db7ec10/html5/thumbnails/37.jpg)
Instruction Size Distribution
22.1%
13.0%
47.5%
3.8%
3.3%
10.4%15b20b25b30b35b40b
Most instructions fit in 25 bits or less.
![Page 38: Heads and Tails A Variable-Length Instruction Format Supporting Parallel Fetch and Decode](https://reader030.vdocument.in/reader030/viewer/2022020322/56814ad0550346895db7ec10/html5/thumbnails/38.jpg)
Compression Ratios
75.5% 78.5%
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
256b 128bBundle Size
Static
Static Compression Ratio = compressed code size
original code size
relatively more overhead & internal fragmentation
![Page 39: Heads and Tails A Variable-Length Instruction Format Supporting Parallel Fetch and Decode](https://reader030.vdocument.in/reader030/viewer/2022020322/56814ad0550346895db7ec10/html5/thumbnails/39.jpg)
Compression Ratios
75.5% 75.0% 78.5% 75.5%
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
256b 128bBundle Size
Static Dynamic
Static Compression Ratio = compressed code size
original code size
Dynamic Fetch Ratio = new bits fetchedoriginal bits fetched
![Page 40: Heads and Tails A Variable-Length Instruction Format Supporting Parallel Fetch and Decode](https://reader030.vdocument.in/reader030/viewer/2022020322/56814ad0550346895db7ec10/html5/thumbnails/40.jpg)
Impact of Branch Schemes on Compression
75.0% 75.5%
0%10%20%30%40%50%60%70%80%90%
100%
256b 128bBundle Size
Normal
Dynamic Fetch Ratios
![Page 41: Heads and Tails A Variable-Length Instruction Format Supporting Parallel Fetch and Decode](https://reader030.vdocument.in/reader030/viewer/2022020322/56814ad0550346895db7ec10/html5/thumbnails/41.jpg)
75.0%86.5%
75.5%81.1%
0%10%20%30%40%50%60%70%80%90%
100%
256b 128bBundle Size
NormalBrBV
Tail-Start Bit Vector: Large increase in dynamic fetch ratio.Only have to fetch 16b BrBV rather than 32b BrBV each time
Tail-Start Bit Vector Effects
![Page 42: Heads and Tails A Variable-Length Instruction Format Supporting Parallel Fetch and Decode](https://reader030.vdocument.in/reader030/viewer/2022020322/56814ad0550346895db7ec10/html5/thumbnails/42.jpg)
75.0%86.5%
77.1% 75.5%81.1%
77.6%
0%10%20%30%40%50%60%70%80%90%
100%
256b 128bBundle Size
NormalBrBVBrTail
Tail Pointer: Much lower cost than tail-start bit vector...
Tail Pointer Effects
![Page 43: Heads and Tails A Variable-Length Instruction Format Supporting Parallel Fetch and Decode](https://reader030.vdocument.in/reader030/viewer/2022020322/56814ad0550346895db7ec10/html5/thumbnails/43.jpg)
75.0%86.5% 75.5% 81.1%
0%
20%
40%
60%
80%
100%
256b 128bBundle Size
Static Compression Ratios
NormalBrTail
Tail Pointer: But increases static code size.
Tail Pointer Effects
![Page 44: Heads and Tails A Variable-Length Instruction Format Supporting Parallel Fetch and Decode](https://reader030.vdocument.in/reader030/viewer/2022020322/56814ad0550346895db7ec10/html5/thumbnails/44.jpg)
Comparison to Related Schemes
0%
20%
40%
60%
80%
100%
HAT-MIPS CCRP MIPS16 SAMC/SADC
Compression Ratios
![Page 45: Heads and Tails A Variable-Length Instruction Format Supporting Parallel Fetch and Decode](https://reader030.vdocument.in/reader030/viewer/2022020322/56814ad0550346895db7ec10/html5/thumbnails/45.jpg)
Conclusion• New heads-and-tails instruction format
• High code density in both memory & cache• Allows parallel fetch & decode
• MIPS-HAT• Simple compression scheme to illustrate HAT• Static compression ratio = 75.5%• Dynamic fetch ratio = 75.0%• Several branching schemes introduced
![Page 46: Heads and Tails A Variable-Length Instruction Format Supporting Parallel Fetch and Decode](https://reader030.vdocument.in/reader030/viewer/2022020322/56814ad0550346895db7ec10/html5/thumbnails/46.jpg)
Future Work• HAT format can be applied to many other
types of instruction encoding• Aggressive instruction compression techniques• New instruction sets that take advantage of HAT to
increase performance w/o sacrificing code density