architecture-specific packing for virtex-5 fpgas taneem ahmed, paul kundarewich, jason anderson,...
TRANSCRIPT
![Page 1: Architecture-Specific Packing for Virtex-5 FPGAs Taneem Ahmed, Paul Kundarewich, Jason Anderson, Brad Taylor, Rajat Aggarwal February 25th, 2008](https://reader035.vdocument.in/reader035/viewer/2022062313/56649c765503460f9492a623/html5/thumbnails/1.jpg)
Architecture-Specific Packingfor Virtex-5 FPGAsTaneem Ahmed, Paul Kundarewich, Jason Anderson,Brad Taylor, Rajat Aggarwal
February 25th, 2008
![Page 2: Architecture-Specific Packing for Virtex-5 FPGAs Taneem Ahmed, Paul Kundarewich, Jason Anderson, Brad Taylor, Rajat Aggarwal February 25th, 2008](https://reader035.vdocument.in/reader035/viewer/2022062313/56649c765503460f9492a623/html5/thumbnails/2.jpg)
2
Overview
• Virtex-5 6-LUT Packing• Virtex-5 DSP and Block RAM Packing• Results• Summary
![Page 3: Architecture-Specific Packing for Virtex-5 FPGAs Taneem Ahmed, Paul Kundarewich, Jason Anderson, Brad Taylor, Rajat Aggarwal February 25th, 2008](https://reader035.vdocument.in/reader035/viewer/2022062313/56649c765503460f9492a623/html5/thumbnails/3.jpg)
3
Simplified FPGA Logic Element
4-LUT
A4A3A2A1
O4
FF
![Page 4: Architecture-Specific Packing for Virtex-5 FPGAs Taneem Ahmed, Paul Kundarewich, Jason Anderson, Brad Taylor, Rajat Aggarwal February 25th, 2008](https://reader035.vdocument.in/reader035/viewer/2022062313/56649c765503460f9492a623/html5/thumbnails/4.jpg)
4
Simplified FPGA Logic Block
FF4-LUT
FF4-LUT
FF4-LUT
FF4-LUT
GeneralInterconnec
t
GeneralInterconnec
t
GeneralInterconnec
t
GeneralInterconnec
t
![Page 5: Architecture-Specific Packing for Virtex-5 FPGAs Taneem Ahmed, Paul Kundarewich, Jason Anderson, Brad Taylor, Rajat Aggarwal February 25th, 2008](https://reader035.vdocument.in/reader035/viewer/2022062313/56649c765503460f9492a623/html5/thumbnails/5.jpg)
5
Virtex-5 Logic Block
CLB
FF6-LUT
FF6-LUT
FF6-LUT
FF6-LUT
SLICE
FF6-LUT
FF6-LUT
FF6-LUT
FF6-LUT
SLICE
GeneralInterconnec
t
GeneralInterconnec
t
GeneralInterconnec
t
GeneralInterconnec
t
![Page 6: Architecture-Specific Packing for Virtex-5 FPGAs Taneem Ahmed, Paul Kundarewich, Jason Anderson, Brad Taylor, Rajat Aggarwal February 25th, 2008](https://reader035.vdocument.in/reader035/viewer/2022062313/56649c765503460f9492a623/html5/thumbnails/6.jpg)
6
Dual-Output 6-LUT
6-LUT
A6A5A4A3A2A1
O6
O5
![Page 7: Architecture-Specific Packing for Virtex-5 FPGAs Taneem Ahmed, Paul Kundarewich, Jason Anderson, Brad Taylor, Rajat Aggarwal February 25th, 2008](https://reader035.vdocument.in/reader035/viewer/2022062313/56649c765503460f9492a623/html5/thumbnails/7.jpg)
7
Dual-Output 6-LUT UsageA6
A5A4A3A2A1
O6
5-LUT O5
5-LUT
![Page 8: Architecture-Specific Packing for Virtex-5 FPGAs Taneem Ahmed, Paul Kundarewich, Jason Anderson, Brad Taylor, Rajat Aggarwal February 25th, 2008](https://reader035.vdocument.in/reader035/viewer/2022062313/56649c765503460f9492a623/html5/thumbnails/8.jpg)
8
Dual-Output Packing
A6
A5A4A3A2A1
O6
5-LUT
5-LUT O5
A6
A5A4A3A2A1
O6
5-LUT
5-LUT O5
6-LUT 6-LUT
Number of 6-LUTs used: 2Number of 6-LUTs used: 1!
xy
X
LogicX
ab
Y
LogicY
VCC
xy
ba
Y
LogicY
LogicX
X
![Page 9: Architecture-Specific Packing for Virtex-5 FPGAs Taneem Ahmed, Paul Kundarewich, Jason Anderson, Brad Taylor, Rajat Aggarwal February 25th, 2008](https://reader035.vdocument.in/reader035/viewer/2022062313/56649c765503460f9492a623/html5/thumbnails/9.jpg)
9
XOR
XOR
AX
AX
6-LUT
CY
CY
F7
F7
F7
O5
O5O5
O6
CIN
FFAQ
AMUX
A
O6
O6
Virtex-5 LUT/FF Pair
![Page 10: Architecture-Specific Packing for Virtex-5 FPGAs Taneem Ahmed, Paul Kundarewich, Jason Anderson, Brad Taylor, Rajat Aggarwal February 25th, 2008](https://reader035.vdocument.in/reader035/viewer/2022062313/56649c765503460f9492a623/html5/thumbnails/10.jpg)
10
Dual-Output Packing Tradeoff
AX
6-LUT
F7
O5
O5O5
O6
FF
O6
O66-LUT
![Page 11: Architecture-Specific Packing for Virtex-5 FPGAs Taneem Ahmed, Paul Kundarewich, Jason Anderson, Brad Taylor, Rajat Aggarwal February 25th, 2008](https://reader035.vdocument.in/reader035/viewer/2022062313/56649c765503460f9492a623/html5/thumbnails/11.jpg)
11
Dual-Output Packing in Placer
• Goal: To reduce area without performance hit– Can be done pre-placement
• Will be sub-optimal without delay estimates – Use delay estimates available during placement to
make good decisions on when to merge two LUTs
• Approach:– Allow second 5-LUT to be used, when performance
impact is small– Incorporate LUT packing in placer’s cost function
![Page 12: Architecture-Specific Packing for Virtex-5 FPGAs Taneem Ahmed, Paul Kundarewich, Jason Anderson, Brad Taylor, Rajat Aggarwal February 25th, 2008](https://reader035.vdocument.in/reader035/viewer/2022062313/56649c765503460f9492a623/html5/thumbnails/12.jpg)
12
Placer Cost Function
• Previous cost function:– Cost = a * W + b * T– W: wirelength cost T: timing performance cost
• Extend cost function with two new terms– One based on 6-LUT utilization (L)– One based on SLICE utilization (S)– Cost = a * W + b * T + c * L + d * S
![Page 13: Architecture-Specific Packing for Virtex-5 FPGAs Taneem Ahmed, Paul Kundarewich, Jason Anderson, Brad Taylor, Rajat Aggarwal February 25th, 2008](https://reader035.vdocument.in/reader035/viewer/2022062313/56649c765503460f9492a623/html5/thumbnails/13.jpg)
13
6-LUT Utilization Term
• L is computed based on all the used 6-LUT slots
• Where
![Page 14: Architecture-Specific Packing for Virtex-5 FPGAs Taneem Ahmed, Paul Kundarewich, Jason Anderson, Brad Taylor, Rajat Aggarwal February 25th, 2008](https://reader035.vdocument.in/reader035/viewer/2022062313/56649c765503460f9492a623/html5/thumbnails/14.jpg)
14
• S is computed based on all the available SLICEs
• Let:– Ni = Number of used 5-LUTs in SLICE i (at most 8)
SLICE Utilization Term
S = Sii=0
m
![Page 15: Architecture-Specific Packing for Virtex-5 FPGAs Taneem Ahmed, Paul Kundarewich, Jason Anderson, Brad Taylor, Rajat Aggarwal February 25th, 2008](https://reader035.vdocument.in/reader035/viewer/2022062313/56649c765503460f9492a623/html5/thumbnails/15.jpg)
15
Performance Recovery
• Helpful to prohibit pack in certain cases for performance reasons
• Other used elements in a SLICE may block the “good” path from the O5 output to external interconnect.
![Page 16: Architecture-Specific Packing for Virtex-5 FPGAs Taneem Ahmed, Paul Kundarewich, Jason Anderson, Brad Taylor, Rajat Aggarwal February 25th, 2008](https://reader035.vdocument.in/reader035/viewer/2022062313/56649c765503460f9492a623/html5/thumbnails/16.jpg)
16
Performance Recovery: XOR
XOR
XOR
AX
AX
LUT6
CY
CY
F7
F7
F7
O5
O5O5
O6
CIN
FFAQ
AMUX
A
O6
O6LUT6
FF
![Page 17: Architecture-Specific Packing for Virtex-5 FPGAs Taneem Ahmed, Paul Kundarewich, Jason Anderson, Brad Taylor, Rajat Aggarwal February 25th, 2008](https://reader035.vdocument.in/reader035/viewer/2022062313/56649c765503460f9492a623/html5/thumbnails/17.jpg)
17
Performance Recovery: F7
XOR
XOR
AX
AX
LUT6
CY
CY
F7
F7
F7
O5
O5O5
O6
CIN
FFAQ
AMUX
A
O6
O6LUT6
F7
FF
![Page 18: Architecture-Specific Packing for Virtex-5 FPGAs Taneem Ahmed, Paul Kundarewich, Jason Anderson, Brad Taylor, Rajat Aggarwal February 25th, 2008](https://reader035.vdocument.in/reader035/viewer/2022062313/56649c765503460f9492a623/html5/thumbnails/18.jpg)
18
6-LUT Reduction
0
2
4
6
8
10
12
14
16
Benchmark Design #
% 6
-LU
T R
ed
uc
tio
n
5.5% 6-LUTReduction
![Page 19: Architecture-Specific Packing for Virtex-5 FPGAs Taneem Ahmed, Paul Kundarewich, Jason Anderson, Brad Taylor, Rajat Aggarwal February 25th, 2008](https://reader035.vdocument.in/reader035/viewer/2022062313/56649c765503460f9492a623/html5/thumbnails/19.jpg)
19
SLICE Reduction
0
5
10
15
20
25
Benchmark Design #
% S
LIC
E R
edu
ctio
n
10.23% SLICEReduction
![Page 20: Architecture-Specific Packing for Virtex-5 FPGAs Taneem Ahmed, Paul Kundarewich, Jason Anderson, Brad Taylor, Rajat Aggarwal February 25th, 2008](https://reader035.vdocument.in/reader035/viewer/2022062313/56649c765503460f9492a623/html5/thumbnails/20.jpg)
20
Performance Results
-15
-10
-5
0
5
10
15
20
25
0 5 10 15 20 25
SLICEs Reduction (%)
Pe
rfo
rma
nc
e L
os
s (
%)
3.3% PerformanceDegradation
![Page 21: Architecture-Specific Packing for Virtex-5 FPGAs Taneem Ahmed, Paul Kundarewich, Jason Anderson, Brad Taylor, Rajat Aggarwal February 25th, 2008](https://reader035.vdocument.in/reader035/viewer/2022062313/56649c765503460f9492a623/html5/thumbnails/21.jpg)
21
Overview
• Virtex-5 6-LUT Packing• Virtex-5 DSP and Block RAM Packing• Summary
![Page 22: Architecture-Specific Packing for Virtex-5 FPGAs Taneem Ahmed, Paul Kundarewich, Jason Anderson, Brad Taylor, Rajat Aggarwal February 25th, 2008](https://reader035.vdocument.in/reader035/viewer/2022062313/56649c765503460f9492a623/html5/thumbnails/22.jpg)
22
New Type of Packing Problem
• Traditionally, packing is considered to be a problem of just LUTs and flops
• However, Virtex-5 contains large IP blocks that present their own packing problem
![Page 23: Architecture-Specific Packing for Virtex-5 FPGAs Taneem Ahmed, Paul Kundarewich, Jason Anderson, Brad Taylor, Rajat Aggarwal February 25th, 2008](https://reader035.vdocument.in/reader035/viewer/2022062313/56649c765503460f9492a623/html5/thumbnails/23.jpg)
23
Virtex-5 Block RAMs
18 Kb RAM
18 Kb RAM
36Kb RAM
• A 36 Kbit block RAM tile can store:a) single 36 Kb RAMb) two independent 18 Kb RAMs
• Block RAM has configurable “aspect ratio”• 18 Kb RAM can be configured as:
16K x 1, 8K x 2, 2K x 9, or 1K x 18
• Tools decide which independent 18 Kb block RAMs to locate in which tile
![Page 24: Architecture-Specific Packing for Virtex-5 FPGAs Taneem Ahmed, Paul Kundarewich, Jason Anderson, Brad Taylor, Rajat Aggarwal February 25th, 2008](https://reader035.vdocument.in/reader035/viewer/2022062313/56649c765503460f9492a623/html5/thumbnails/24.jpg)
24
Virtex-5 DSP48E Block• A multiply-accumulate operation, pervasive in DSP
circuits, can be realized in a single DSP48E. • Multiple DSP48Es can be chained together to form more
complex functions through the PCIN and PCOUT ports
PCIN
C (48-bit)
B (18-bit)A (25-bit)
=
48-bit
Op
tion
al p
ipe
line
re
gis
ter/
rou
ting
log
ic
Op
tion
al p
ipe
line
re
gis
ter/
rou
ting
log
ic
Ro
utin
g lo
gicX
P
25x18
Pattern detect
ALU
PCOUT
![Page 25: Architecture-Specific Packing for Virtex-5 FPGAs Taneem Ahmed, Paul Kundarewich, Jason Anderson, Brad Taylor, Rajat Aggarwal February 25th, 2008](https://reader035.vdocument.in/reader035/viewer/2022062313/56649c765503460f9492a623/html5/thumbnails/25.jpg)
25
Block RAM and DSP Floorplan
• Block RAM and DSP48E tiles are organized in columns
Block RAM tile
DSP48E
DSP48E
Block RAM tile
DSP48E
DSP48E
Block RAM tile
DSP48E
DSP48E
Block RAM tile
DSP48E
DSP48E
Virtex-5DSP tile
Block RAM tile
Block RAM tile
Block RAM tile
Block RAM tile
Block RAM tile
DSP48E
DSP48E
Block RAM tile
![Page 26: Architecture-Specific Packing for Virtex-5 FPGAs Taneem Ahmed, Paul Kundarewich, Jason Anderson, Brad Taylor, Rajat Aggarwal February 25th, 2008](https://reader035.vdocument.in/reader035/viewer/2022062313/56649c765503460f9492a623/html5/thumbnails/26.jpg)
26
Block RAM/DSP Packing
• Problem: Placer algorithms are heuristic and sometimes do not find an optimal block RAM packing
• Goal: Leverage preferred block RAM packing patterns to achieve high performance
• Target area: DSP designs– DSP designs make heavy use of block RAMs and
DSP blocks
![Page 27: Architecture-Specific Packing for Virtex-5 FPGAs Taneem Ahmed, Paul Kundarewich, Jason Anderson, Brad Taylor, Rajat Aggarwal February 25th, 2008](https://reader035.vdocument.in/reader035/viewer/2022062313/56649c765503460f9492a623/html5/thumbnails/27.jpg)
27
DSP Block RAM Designs
• Most common DSP application is the Finite Impulse Response Filter or FIR filter– FIR filters have multiple instances of a “tap” which
involve DSP and block RAMs
![Page 28: Architecture-Specific Packing for Virtex-5 FPGAs Taneem Ahmed, Paul Kundarewich, Jason Anderson, Brad Taylor, Rajat Aggarwal February 25th, 2008](https://reader035.vdocument.in/reader035/viewer/2022062313/56649c765503460f9492a623/html5/thumbnails/28.jpg)
28
FIR Filter
• A Finite Impulse Response or FIR filter is a digital filter that takes a weighted average of the signals in a delay line
• An N-tap filter can be expressed as:y[n] = c0*x[n] + c1*x[n-1]+…+cn*[n-N+1]– Where:
• y[n] is the output of the filter at time n• x[n] is the data input “signal” at time n• Ci is the coefficient
• Each coefficient/data product in sum is referred to as a “tap”– DSP units used for the multiply and accumulate– Block RAMs used to store the data and coefficients
![Page 29: Architecture-Specific Packing for Virtex-5 FPGAs Taneem Ahmed, Paul Kundarewich, Jason Anderson, Brad Taylor, Rajat Aggarwal February 25th, 2008](https://reader035.vdocument.in/reader035/viewer/2022062313/56649c765503460f9492a623/html5/thumbnails/29.jpg)
29
FIR Designs – Use Case 1• 2-tap FIR filter involving small block RAMs
RAMD1 RAMC1
Data RAM
18 Kb block RAM
RAMD0 RAMC0
Coefficient RAM
DSP0 Tap 0
DSP1 Tap 1
PCOUT
PCIN
A
B
datainput
dataoutput
A
B
36 Kb block RAM Tile
![Page 30: Architecture-Specific Packing for Virtex-5 FPGAs Taneem Ahmed, Paul Kundarewich, Jason Anderson, Brad Taylor, Rajat Aggarwal February 25th, 2008](https://reader035.vdocument.in/reader035/viewer/2022062313/56649c765503460f9492a623/html5/thumbnails/30.jpg)
30
Packing for Use Case 1
• Packing both 18k Block RAMs into a Block RAM tile permits a natural alignment between the DSP and Block RAMs
High Performance!
Block RAM tile
DSP48E
DSP48E
Block RAM tile
DSP48E
DSP48E
Block RAM tile
DSP48E
DSP48E
Block RAM tile
DSP48E
DSP48E
Operates as two independent18 Kb block RAMs
Virtex-5DSP tile
![Page 31: Architecture-Specific Packing for Virtex-5 FPGAs Taneem Ahmed, Paul Kundarewich, Jason Anderson, Brad Taylor, Rajat Aggarwal February 25th, 2008](https://reader035.vdocument.in/reader035/viewer/2022062313/56649c765503460f9492a623/html5/thumbnails/31.jpg)
31
FIR Designs – Use Case 2
• 2-tap FIR filter involving larger block RAMs
DSP0
DSP1
PCOUT
PCIN
RAMD0
RAMD1
A
B
18 Kb block RAM
A
B
36 Kb block RAM
RAMC0
RAMC1
Data RAM Coefficient RAM
Tap 1
Tap 0
![Page 32: Architecture-Specific Packing for Virtex-5 FPGAs Taneem Ahmed, Paul Kundarewich, Jason Anderson, Brad Taylor, Rajat Aggarwal February 25th, 2008](https://reader035.vdocument.in/reader035/viewer/2022062313/56649c765503460f9492a623/html5/thumbnails/32.jpg)
32
Packing for Use Case 2
• Two Block RAM columns feed one DSP column• Again provides a natural alignment between the
DSP and Block RAMsDSP48E
DSP48E
Block RAM tile
DSP48E
DSP48E
DSP48E
DSP48E
DSP48E
DSP48E
Block RAM tile
Block RAM tile
Block RAM tile
Block RAM tile
Block RAM tile
Block RAM tile
Block RAM tile
Virtex-5DSP tile
![Page 33: Architecture-Specific Packing for Virtex-5 FPGAs Taneem Ahmed, Paul Kundarewich, Jason Anderson, Brad Taylor, Rajat Aggarwal February 25th, 2008](https://reader035.vdocument.in/reader035/viewer/2022062313/56649c765503460f9492a623/html5/thumbnails/33.jpg)
33
Block RAM Chains
• Use Case: 18k Block RAM’s data input and output pins connected together (e.g. FIFO)
• Algorithm: Look for such chains and pack them together into single block RAM tile
• Special Case: 18k block RAMs separated by registers
inRAM0dia doa
addra
RAM1dib dob
addrb
out
18 Kb block RAM
![Page 34: Architecture-Specific Packing for Virtex-5 FPGAs Taneem Ahmed, Paul Kundarewich, Jason Anderson, Brad Taylor, Rajat Aggarwal February 25th, 2008](https://reader035.vdocument.in/reader035/viewer/2022062313/56649c765503460f9492a623/html5/thumbnails/34.jpg)
34
Block RAM/DSP Packing Results
Circuit Perf RAM Packing (MHz)
Perf. Baseline (MHz)
Percent Improvement
Circuit 1 500 400 25%
Circuit 2 450 365 23%
Circuit 3 500 470 6%
Circuit 4 425 435 -2%
Circuit 5 215 200 8%
Geomean 400 359 11%
![Page 35: Architecture-Specific Packing for Virtex-5 FPGAs Taneem Ahmed, Paul Kundarewich, Jason Anderson, Brad Taylor, Rajat Aggarwal February 25th, 2008](https://reader035.vdocument.in/reader035/viewer/2022062313/56649c765503460f9492a623/html5/thumbnails/35.jpg)
35
Summary
• Described two architecture specific packing approaches for a 65nm commercial FPGA:Xilinx Virtex-5– Dual-output LUT packing in placement:
• Achieves 10.2% SLICE reduction and 5.5% LUT reduction– Packing for DSPs and block RAMs:
• Achieves 11% performance improvement
![Page 36: Architecture-Specific Packing for Virtex-5 FPGAs Taneem Ahmed, Paul Kundarewich, Jason Anderson, Brad Taylor, Rajat Aggarwal February 25th, 2008](https://reader035.vdocument.in/reader035/viewer/2022062313/56649c765503460f9492a623/html5/thumbnails/36.jpg)
36
Questions