configuration bitstream reduction for sram-based fpgas by enumerating lut input permutations the...
TRANSCRIPT
![Page 1: Configuration Bitstream Reduction for SRAM-based FPGAs by Enumerating LUT Input Permutations The University of British Columbia© 2011 Guy Lemieux Ameer](https://reader035.vdocument.in/reader035/viewer/2022062321/56649f155503460f94c2adb2/html5/thumbnails/1.jpg)
Configuration Bitstream Reductionfor SRAM-based FPGAs
by Enumerating LUT Input Permutations
The University ofBritish Columbia © 2011 Guy Lemieux
Ameer Abdelhadi Guy Lemieux
![Page 2: Configuration Bitstream Reduction for SRAM-based FPGAs by Enumerating LUT Input Permutations The University of British Columbia© 2011 Guy Lemieux Ameer](https://reader035.vdocument.in/reader035/viewer/2022062321/56649f155503460f94c2adb2/html5/thumbnails/2.jpg)
Modern FPGAs are HUGE!
• Altera Stratix IV “4SGX530” device (45nm)– 200k 6-LUTs in one die
• FPGAs keep getting bigger– Moore’s Law
• 2x every 18-24 months• 28nm devices announced (Stratix V, Virtex 7)
– More than Moore• Xilinx 1,200k 6-LUTs using stacked silicon
2
![Page 3: Configuration Bitstream Reduction for SRAM-based FPGAs by Enumerating LUT Input Permutations The University of British Columbia© 2011 Guy Lemieux Ameer](https://reader035.vdocument.in/reader035/viewer/2022062321/56649f155503460f94c2adb2/html5/thumbnails/3.jpg)
Configuration Bits
• Growing bitstream storage & load time– Altera Stratix IV (4SGX, largest device)
• 4 seconds to boot up using serial EEPROM• 35 seconds to configure using USB 2.0• 20 MByte bitstream
• Multiple configurations per FPGA– Even more bitstream storage
3
![Page 4: Configuration Bitstream Reduction for SRAM-based FPGAs by Enumerating LUT Input Permutations The University of British Columbia© 2011 Guy Lemieux Ameer](https://reader035.vdocument.in/reader035/viewer/2022062321/56649f155503460f94c2adb2/html5/thumbnails/4.jpg)
Bitstream Compression
• Obvious solution: bitstream compression– Many methods in research
• Architecture-aware: switch blocks (eg, avoid 5:1 muxes), logic blocks (eg, ULMs, coarse-grained) , readback, frame reordering, wildcard techniques
• Architecture-agnostic: general compression (eg: huffman, LZ, arithmetic), “don’t care”
– Not yet applied in practice
• Can we do better?4
![Page 5: Configuration Bitstream Reduction for SRAM-based FPGAs by Enumerating LUT Input Permutations The University of British Columbia© 2011 Guy Lemieux Ameer](https://reader035.vdocument.in/reader035/viewer/2022062321/56649f155503460f94c2adb2/html5/thumbnails/5.jpg)
Main Idea
• Bitstream Reduction– Throw away redundant bits
• Do not store them in configuration EEPROM• Do not transmit them
– Regenerate missing bits (in the FPGA)• Done on-the-fly, during bitstream load• Does not reduce # SRAM bits inside FPGA
• This is not bitstream compression– Can still compress after reduction
5
![Page 6: Configuration Bitstream Reduction for SRAM-based FPGAs by Enumerating LUT Input Permutations The University of British Columbia© 2011 Guy Lemieux Ameer](https://reader035.vdocument.in/reader035/viewer/2022062321/56649f155503460f94c2adb2/html5/thumbnails/6.jpg)
Main Idea
• Where to find redundant bits?– LUT configuration bits (this paper)– Interconnect (future work)
• Eg, 4-input LUT:– 16 config bits (2^k)– 65,536 configurations (2^(2^k))– 222 unique (npn-equivalent) 4-input functions– Theoretically, 8 config bits is sufficient
6
![Page 7: Configuration Bitstream Reduction for SRAM-based FPGAs by Enumerating LUT Input Permutations The University of British Columbia© 2011 Guy Lemieux Ameer](https://reader035.vdocument.in/reader035/viewer/2022062321/56649f155503460f94c2adb2/html5/thumbnails/7.jpg)
Equivalence Classes
• What is npn-equivalent ? First n = input negation (inversion), p = input permutations,Second n = output negation
• Eg, 4-input LUT–4! = 24 input permutations for the same function–2^k = 16 input inversions–2^1 = 2 output inversions
7
![Page 8: Configuration Bitstream Reduction for SRAM-based FPGAs by Enumerating LUT Input Permutations The University of British Columbia© 2011 Guy Lemieux Ameer](https://reader035.vdocument.in/reader035/viewer/2022062321/56649f155503460f94c2adb2/html5/thumbnails/8.jpg)
Example: LUT input ordering
8
![Page 9: Configuration Bitstream Reduction for SRAM-based FPGAs by Enumerating LUT Input Permutations The University of British Columbia© 2011 Guy Lemieux Ameer](https://reader035.vdocument.in/reader035/viewer/2022062321/56649f155503460f94c2adb2/html5/thumbnails/9.jpg)
Example: LUT input ordering
9
![Page 10: Configuration Bitstream Reduction for SRAM-based FPGAs by Enumerating LUT Input Permutations The University of British Columbia© 2011 Guy Lemieux Ameer](https://reader035.vdocument.in/reader035/viewer/2022062321/56649f155503460f94c2adb2/html5/thumbnails/10.jpg)
Example: LUT input ordering
10
![Page 11: Configuration Bitstream Reduction for SRAM-based FPGAs by Enumerating LUT Input Permutations The University of British Columbia© 2011 Guy Lemieux Ameer](https://reader035.vdocument.in/reader035/viewer/2022062321/56649f155503460f94c2adb2/html5/thumbnails/11.jpg)
Example: LUT input ordering
11
![Page 12: Configuration Bitstream Reduction for SRAM-based FPGAs by Enumerating LUT Input Permutations The University of British Columbia© 2011 Guy Lemieux Ameer](https://reader035.vdocument.in/reader035/viewer/2022062321/56649f155503460f94c2adb2/html5/thumbnails/12.jpg)
Example: LUT input ordering
12
![Page 13: Configuration Bitstream Reduction for SRAM-based FPGAs by Enumerating LUT Input Permutations The University of British Columbia© 2011 Guy Lemieux Ameer](https://reader035.vdocument.in/reader035/viewer/2022062321/56649f155503460f94c2adb2/html5/thumbnails/13.jpg)
Example: LUT input ordering
13
![Page 14: Configuration Bitstream Reduction for SRAM-based FPGAs by Enumerating LUT Input Permutations The University of British Columbia© 2011 Guy Lemieux Ameer](https://reader035.vdocument.in/reader035/viewer/2022062321/56649f155503460f94c2adb2/html5/thumbnails/14.jpg)
Example: LUT input ordering
14
![Page 15: Configuration Bitstream Reduction for SRAM-based FPGAs by Enumerating LUT Input Permutations The University of British Columbia© 2011 Guy Lemieux Ameer](https://reader035.vdocument.in/reader035/viewer/2022062321/56649f155503460f94c2adb2/html5/thumbnails/15.jpg)
Example: LUT input ordering
15
![Page 16: Configuration Bitstream Reduction for SRAM-based FPGAs by Enumerating LUT Input Permutations The University of British Columbia© 2011 Guy Lemieux Ameer](https://reader035.vdocument.in/reader035/viewer/2022062321/56649f155503460f94c2adb2/html5/thumbnails/16.jpg)
Example 2-LUT
16
![Page 17: Configuration Bitstream Reduction for SRAM-based FPGAs by Enumerating LUT Input Permutations The University of British Columbia© 2011 Guy Lemieux Ameer](https://reader035.vdocument.in/reader035/viewer/2022062321/56649f155503460f94c2adb2/html5/thumbnails/17.jpg)
Example 2-LUT
17
Bit e0
-- removed from bitstream-- regenerated at load-time-- storage bit still exists in LUT
![Page 18: Configuration Bitstream Reduction for SRAM-based FPGAs by Enumerating LUT Input Permutations The University of British Columbia© 2011 Guy Lemieux Ameer](https://reader035.vdocument.in/reader035/viewer/2022062321/56649f155503460f94c2adb2/html5/thumbnails/18.jpg)
Example 2-LUT
18
Bit e0
-- removed from bitstream-- regenerated at load-time-- storage bit still exists in LUT
Regeneration circuit-- shared by all LUTs-- regenerates missing bits
![Page 19: Configuration Bitstream Reduction for SRAM-based FPGAs by Enumerating LUT Input Permutations The University of British Columbia© 2011 Guy Lemieux Ameer](https://reader035.vdocument.in/reader035/viewer/2022062321/56649f155503460f94c2adb2/html5/thumbnails/19.jpg)
Example 2-LUT
19
Bit e0
-- removed from bitstream-- regenerated at load-time-- storage bit still exists in LUT
Regeneration circuit-- shared by all LUTs-- regenerates missing bits
Problem: changing order of inputs reorders LUT contents
changes value of bit e0
![Page 20: Configuration Bitstream Reduction for SRAM-based FPGAs by Enumerating LUT Input Permutations The University of British Columbia© 2011 Guy Lemieux Ameer](https://reader035.vdocument.in/reader035/viewer/2022062321/56649f155503460f94c2adb2/html5/thumbnails/20.jpg)
Configuration Chaining
20
Problem: changing order of inputs reorders LUT contents
changes value of bit e0
Solution:
![Page 21: Configuration Bitstream Reduction for SRAM-based FPGAs by Enumerating LUT Input Permutations The University of British Columbia© 2011 Guy Lemieux Ameer](https://reader035.vdocument.in/reader035/viewer/2022062321/56649f155503460f94c2adb2/html5/thumbnails/21.jpg)
Configuration Chaining
21
1. Copy value ofbit e0 to be saved
Problem: changing order of inputs reorders LUT contents
changes value of bit e0
Solution:
![Page 22: Configuration Bitstream Reduction for SRAM-based FPGAs by Enumerating LUT Input Permutations The University of British Columbia© 2011 Guy Lemieux Ameer](https://reader035.vdocument.in/reader035/viewer/2022062321/56649f155503460f94c2adb2/html5/thumbnails/22.jpg)
Configuration Chaining
22
1. Copy value ofbit e0 to be saved
Problem: changing order of inputs reorders LUT contents
changes value of bit e0
Solution:
2. Encode value ofbit e0 in the next LUT
![Page 23: Configuration Bitstream Reduction for SRAM-based FPGAs by Enumerating LUT Input Permutations The University of British Columbia© 2011 Guy Lemieux Ameer](https://reader035.vdocument.in/reader035/viewer/2022062321/56649f155503460f94c2adb2/html5/thumbnails/23.jpg)
General Approach
Produce a LUT chain
• Input ordering of next LUT generates removed bits for current LUT
Approach
1. Enumerate # of input permutations of LUTk=2 2! = 2 k=3 3! = 6 k=4 4! = 24
2. Define number of bits (pk) to be replaced
k=2 log2(2)=1 k=3 log2(6)=2 k=4 log2(24)=4
• For each LUT– Copy P = pk bits from the current LUT
– Re-order next LUT inputs according to value of P
– Adjust next LUT configuration bits to match new input ordering
– Remove pk configuration bits from current LUT
23
![Page 24: Configuration Bitstream Reduction for SRAM-based FPGAs by Enumerating LUT Input Permutations The University of British Columbia© 2011 Guy Lemieux Ameer](https://reader035.vdocument.in/reader035/viewer/2022062321/56649f155503460f94c2adb2/html5/thumbnails/24.jpg)
LUT Savingsk # LUT bits # bits Saved
= Log2(k!)% saved
2 4 1 25%
3 8 2 25%
4 16 4 25%
5 32 6 19%
6 64 9 14%
7 128 12 9.4%
8 256 15 5.9%
24
How complex is bit regeneration?
![Page 25: Configuration Bitstream Reduction for SRAM-based FPGAs by Enumerating LUT Input Permutations The University of British Columbia© 2011 Guy Lemieux Ameer](https://reader035.vdocument.in/reader035/viewer/2022062321/56649f155503460f94c2adb2/html5/thumbnails/25.jpg)
General Structure
25
![Page 26: Configuration Bitstream Reduction for SRAM-based FPGAs by Enumerating LUT Input Permutations The University of British Columbia© 2011 Guy Lemieux Ameer](https://reader035.vdocument.in/reader035/viewer/2022062321/56649f155503460f94c2adb2/html5/thumbnails/26.jpg)
Regeneration Circuits (k=3,4,5)
26
k=3
e9e8e5 e7e6 e14e13e12e11e10e3e2e1e0 e4
Σ6 Σ5 Σ4 Σ3 Σ2Σ7
π0 π1 π2 π3 π4 π5 π6 π7
e6e5e4e3 e11e10
Σ6 Σ5 Σ4 Σ3 Σ2
π0 π1 π2 π3 π4 π5 π6
e3e2e1e0
Σ5 Σ4 Σ3 Σ2
π0 π1 π2 π3 π4 π5
e2e0 e1e1e0
Σ4 Σ3 Σ2
π0 π1 π2 π3 π4
Σ3 Σ2
π0 π1 π2 π3
Σ2
π0 π1 π2
k=4
k=5
k=6
k=7
k=8
e7 e8 e9e6e5e4 e7 e8e3e2 e5e4
e1e0 e3e2e1e0
• 3 x 6b comparators• 2b (half) adder• Saves 2 bits (25%)
![Page 27: Configuration Bitstream Reduction for SRAM-based FPGAs by Enumerating LUT Input Permutations The University of British Columbia© 2011 Guy Lemieux Ameer](https://reader035.vdocument.in/reader035/viewer/2022062321/56649f155503460f94c2adb2/html5/thumbnails/27.jpg)
Regeneration Circuits (k=3,4,5)
27
k=3
e9e8e5 e7e6 e14e13e12e11e10e3e2e1e0 e4
Σ6 Σ5 Σ4 Σ3 Σ2Σ7
π0 π1 π2 π3 π4 π5 π6 π7
e6e5e4e3 e11e10
Σ6 Σ5 Σ4 Σ3 Σ2
π0 π1 π2 π3 π4 π5 π6
e3e2e1e0
Σ5 Σ4 Σ3 Σ2
π0 π1 π2 π3 π4 π5
e2e0 e1e1e0
Σ4 Σ3 Σ2
π0 π1 π2 π3 π4
Σ3 Σ2
π0 π1 π2 π3
Σ2
π0 π1 π2
k=4
k=5
k=6
k=7
k=8
e7 e8 e9e6e5e4 e7 e8e3e2 e5e4
e1e0 e3e2e1e0
• 3 x 6b comparators• 2b (half) adder• Saves 2 bits (25%)
• 6 x 6b comparators• 3b (full) adder• 2b (half) adder• Saves 4 bits (25%)
![Page 28: Configuration Bitstream Reduction for SRAM-based FPGAs by Enumerating LUT Input Permutations The University of British Columbia© 2011 Guy Lemieux Ameer](https://reader035.vdocument.in/reader035/viewer/2022062321/56649f155503460f94c2adb2/html5/thumbnails/28.jpg)
Regeneration Circuits (k=3,4,5)
28
k=3
e9e8e5 e7e6 e14e13e12e11e10e3e2e1e0 e4
Σ6 Σ5 Σ4 Σ3 Σ2Σ7
π0 π1 π2 π3 π4 π5 π6 π7
e6e5e4e3 e11e10
Σ6 Σ5 Σ4 Σ3 Σ2
π0 π1 π2 π3 π4 π5 π6
e3e2e1e0
Σ5 Σ4 Σ3 Σ2
π0 π1 π2 π3 π4 π5
e2e0 e1e1e0
Σ4 Σ3 Σ2
π0 π1 π2 π3 π4
Σ3 Σ2
π0 π1 π2 π3
Σ2
π0 π1 π2
k=4
k=5
k=6
k=7
k=8
e7 e8 e9e6e5e4 e7 e8e3e2 e5e4
e1e0 e3e2e1e0
• 3 x 6b comparators• 2b (half) adder• Saves 2 bits (25%)
• 6 x 6b comparators• 3b (full) adder• 2b (half) adder• Saves 4 bits (25%)
• 10 x 6b comparators• 2 x 3b (full) adder• 3 x 2b (half) adder• Saves 6 bits (25%)
![Page 29: Configuration Bitstream Reduction for SRAM-based FPGAs by Enumerating LUT Input Permutations The University of British Columbia© 2011 Guy Lemieux Ameer](https://reader035.vdocument.in/reader035/viewer/2022062321/56649f155503460f94c2adb2/html5/thumbnails/29.jpg)
Regeneration Circuits (k=3,4,5)
29
k=3
e9e8e5 e7e6 e14e13e12e11e10e3e2e1e0 e4
Σ6 Σ5 Σ4 Σ3 Σ2Σ7
π0 π1 π2 π3 π4 π5 π6 π7
e6e5e4e3 e11e10
Σ6 Σ5 Σ4 Σ3 Σ2
π0 π1 π2 π3 π4 π5 π6
e3e2e1e0
Σ5 Σ4 Σ3 Σ2
π0 π1 π2 π3 π4 π5
e2e0 e1e1e0
Σ4 Σ3 Σ2
π0 π1 π2 π3 π4
Σ3 Σ2
π0 π1 π2 π3
Σ2
π0 π1 π2
k=4
k=5
k=6
k=7
k=8
e7 e8 e9e6e5e4 e7 e8e3e2 e5e4
e1e0 e3e2e1e0
• 3 x 6b comparators• 2b (half) adder• Saves 2 bits (25%)
• 6 x 6b comparators• 3b (full) adder• 2b (half) adder• Saves 4 bits (25%)
• 10 x 6b comparators• 2 x 3b (full) adder• 3 x 2b (half) adder• Saves 6 bits (25%)
Note: area cost is only once per device
(shared across all LUTs in entire FPGA)
![Page 30: Configuration Bitstream Reduction for SRAM-based FPGAs by Enumerating LUT Input Permutations The University of British Columbia© 2011 Guy Lemieux Ameer](https://reader035.vdocument.in/reader035/viewer/2022062321/56649f155503460f94c2adb2/html5/thumbnails/30.jpg)
Regeneration Circuits (k=6)
30
k=3
e9e8e5 e7e6 e14e13e12e11e10e3e2e1e0 e4
Σ6 Σ5 Σ4 Σ3 Σ2Σ7
π0 π1 π2 π3 π4 π5 π6 π7
e6e5e4e3 e11e10
Σ6 Σ5 Σ4 Σ3 Σ2
π0 π1 π2 π3 π4 π5 π6
e3e2e1e0
Σ5 Σ4 Σ3 Σ2
π0 π1 π2 π3 π4 π5
e2e0 e1e1e0
Σ4 Σ3 Σ2
π0 π1 π2 π3 π4
Σ3 Σ2
π0 π1 π2 π3
Σ2
π0 π1 π2
k=4
k=5
k=6
k=7
k=8
e7 e8 e9e6e5e4 e7 e8e3e2 e5e4
e1e0 e3e2e1e0
![Page 31: Configuration Bitstream Reduction for SRAM-based FPGAs by Enumerating LUT Input Permutations The University of British Columbia© 2011 Guy Lemieux Ameer](https://reader035.vdocument.in/reader035/viewer/2022062321/56649f155503460f94c2adb2/html5/thumbnails/31.jpg)
Regeneration Circuits (k=7)
31
k=3
e9e8e5 e7e6 e14e13e12e11e10e3e2e1e0 e4
Σ6 Σ5 Σ4 Σ3 Σ2Σ7
π0 π1 π2 π3 π4 π5 π6 π7
e6e5e4e3 e11e10
Σ6 Σ5 Σ4 Σ3 Σ2
π0 π1 π2 π3 π4 π5 π6
e3e2e1e0
Σ5 Σ4 Σ3 Σ2
π0 π1 π2 π3 π4 π5
e2e0 e1e1e0
Σ4 Σ3 Σ2
π0 π1 π2 π3 π4
Σ3 Σ2
π0 π1 π2 π3
Σ2
π0 π1 π2
k=4
k=5
k=6
k=7
k=8
e7 e8 e9e6e5e4 e7 e8e3e2 e5e4
e1e0 e3e2e1e0
![Page 32: Configuration Bitstream Reduction for SRAM-based FPGAs by Enumerating LUT Input Permutations The University of British Columbia© 2011 Guy Lemieux Ameer](https://reader035.vdocument.in/reader035/viewer/2022062321/56649f155503460f94c2adb2/html5/thumbnails/32.jpg)
Regeneration Circuits (k=8)
32
k=3
e9e8e5 e7e6 e14e13e12e11e10e3e2e1e0 e4
Σ6 Σ5 Σ4 Σ3 Σ2Σ7
π0 π1 π2 π3 π4 π5 π6 π7
e6e5e4e3 e11e10
Σ6 Σ5 Σ4 Σ3 Σ2
π0 π1 π2 π3 π4 π5 π6
e3e2e1e0
Σ5 Σ4 Σ3 Σ2
π0 π1 π2 π3 π4 π5
e2e0 e1e1e0
Σ4 Σ3 Σ2
π0 π1 π2 π3 π4
Σ3 Σ2
π0 π1 π2 π3
Σ2
π0 π1 π2
k=4
k=5
k=6
k=7
k=8
e7 e8 e9e6e5e4 e7 e8e3e2 e5e4
e1e0 e3e2e1e0
![Page 33: Configuration Bitstream Reduction for SRAM-based FPGAs by Enumerating LUT Input Permutations The University of British Columbia© 2011 Guy Lemieux Ameer](https://reader035.vdocument.in/reader035/viewer/2022062321/56649f155503460f94c2adb2/html5/thumbnails/33.jpg)
Regeneration Circuits
All designs are freelyavailable for download.
33
![Page 34: Configuration Bitstream Reduction for SRAM-based FPGAs by Enumerating LUT Input Permutations The University of British Columbia© 2011 Guy Lemieux Ameer](https://reader035.vdocument.in/reader035/viewer/2022062321/56649f155503460f94c2adb2/html5/thumbnails/34.jpg)
Cost of Regeneration – Summaryk # LUT
bits# bits Saved
= Log2(k!)
# Compare (6 bit
compare)
# Full Adders
# Half Adders
# 2-input Gates
2 4 1 1 0 1 0
3 8 2 3 0 2 0
4 16 4 6 1 1 0
5 32 6 10 2 3 0
6 64 9 15 4 4 4
7 128 12 21 7 5 11
8 256 15 28 11 5 11
34
Note: area cost is only once per device
(shared across all LUTs in entire FPGA)
![Page 35: Configuration Bitstream Reduction for SRAM-based FPGAs by Enumerating LUT Input Permutations The University of British Columbia© 2011 Guy Lemieux Ameer](https://reader035.vdocument.in/reader035/viewer/2022062321/56649f155503460f94c2adb2/html5/thumbnails/35.jpg)
Implementation
• 65nm standard cell implementation– Use minimum-size drive strength (not timing critical)
• Regeneration circuits are very small– Shared across entire device
35
k=3 4 5 6 7 8
Area (μm2) 64 129 220 333 502 702
Transistors 198 404 694 1,058 1,576 2,208
![Page 36: Configuration Bitstream Reduction for SRAM-based FPGAs by Enumerating LUT Input Permutations The University of British Columbia© 2011 Guy Lemieux Ameer](https://reader035.vdocument.in/reader035/viewer/2022062321/56649f155503460f94c2adb2/html5/thumbnails/36.jpg)
Future Workk # LUT bits # bits Saved
= Log2(k!)% saved
2 4 1 25%
3 8 2 25%
4 16 4 25%
5 32 6 19%
6 64 9 14%
7 128 12 9.4%
8 256 14 5.5%
36
Can we reach 25% savings with 5+ inputs ?
![Page 37: Configuration Bitstream Reduction for SRAM-based FPGAs by Enumerating LUT Input Permutations The University of British Columbia© 2011 Guy Lemieux Ameer](https://reader035.vdocument.in/reader035/viewer/2022062321/56649f155503460f94c2adb2/html5/thumbnails/37.jpg)
Future Workk # LUT bits # bits Saved
= Log2(k!)% saved
2 4 1 25%
3 8 2 25%
4 16 4 25%
5 32 6 19%
6 64 9 14%
7 128 12 9.4%
8 256 14 5.5%
37
Can we reach 25% savings with 5+ inputs ?
Eg, combine two 4-LUTs to make one 5-LUT
![Page 38: Configuration Bitstream Reduction for SRAM-based FPGAs by Enumerating LUT Input Permutations The University of British Columbia© 2011 Guy Lemieux Ameer](https://reader035.vdocument.in/reader035/viewer/2022062321/56649f155503460f94c2adb2/html5/thumbnails/38.jpg)
Thank you!
38