a cad framework for malibu: an fpga with time ...a cad framework for malibu: an fpga with...
TRANSCRIPT
-
A CAD Framework for MALIBU: An FPGA with
Time-multiplexed Coarse-Grained Elements
David Grant
Supervisor: Dr. Guy Lemieux
FPGA 2011 -- Feb 28, 2011
-
2
Motivation
● Growing Industry Trend: Large FPGA Circuits➔ Often from C-to-Hardware or system generators
● ex. molecular dynamics, rendering, nuclear simulation
➔ Word-oriented
➔ Millions of gates
-
3
Motivation
● Problems with FPGAs➔ CAD runtime can take hours or days
➔ Fixed capacity● A large circuit may not fit
➔ Inefficient use of resources● Resources sit idle most of the time
-
4
Motivation
● Benchmark: chem
-
5
Motivation
● Benchmark: chem
Tim
e-M
ultip
lexi
ng –
Fre
e
Available Spatial Parallelism
TM - Slow
-
6
Motivation
● Solution➔ Divide up the circuit and run it on an array of processors
● Preserve the coarse-grained features of the circuit
➔ Create coarse-grained-aware CAD tools
-
7
Time-Multiplexing
● Time-Multiplexed FPGAs➔ Look-up Table (LUT)
config bits
inputs fromother LUTs
-
8
Time-Multiplexing
● Time-Multiplexed FPGAs➔ Time-multiplexed LUT➔ Multiplexer is shared
time
config bits
inputs fromother LUTs
-
9
Datapath FPGAs
● Datapath FPGAs➔ Config bit sharing➔ 1.1x density, same performance
a[0]
a[1]
a[3]
b[0]
b[1]
b[3]
a[0]
a[1]
a[3]
b[0]
b[1]
b[3]
config bits config bit
Routing Mux Datapath Routing Mux
-
10
Time-Multiplexing and Datapath
● Where we're going➔ Coarse-grained(datapath) time-multiplexed resources➔ ALU is shared
32
time
inputs fromother ALUs
-
11
Overview
● Motivation● Malibu Architecture● Synthesis● Results
Traditional Island-Style FPGA
-
12
Overview
● Motivation● Malibu Architecture● Synthesis● Results
Malibu Architecture
-
13
Malibu Architecture
● Traditional FPGA CLB
-
14
Malibu Architecture
● Add Coarse-Grainedinputs and outputs
-
15
Malibu Architecture
● Add an ALU and register file
-
16
Malibu Architecture
● Add coarse-grainedcommunication
● Each CG has a schedule➔ SL instructions➔ Fmax = 1GHz / SL
-
17
CLB0 CLB1Time CG FG CG FG
0 EQ N0,W0 -> CGO0 EQ N0,E0 -> CGO01 NOP LUT XOR N0,E0 -> R02 ADD N0,W0 -> E0 MUX W0,R0,CGI0->E0
Malibu Architecture
● Circuit Execution Example
CLB0
a
CLB1
x
b c
d
CLB0
a
CLB1
x
b c
d
N
W E
S
N
W E
S
-
18
CLB0 CLB1Time CG FG CG FG
0 EQ N0,W0 -> CGO0 EQ N0,E0 -> CGO01 NOP LUT XOR N0,E0 -> R02 ADD N0,W0 -> E0 MUX W0,R0,CGI0->E0
Malibu Architecture
● Circuit Execution Example
CLB0
a
CLB1
x
b c
d
CLB0
a
CLB1
x
b c
d
N
W E
S
N
W E
S
-
19
CLB0 CLB1Time CG FG CG FG
0 EQ N0,W0 -> CGO0 EQ N0,E0 -> CGO01 NOP LUT XOR N0,E0 -> R02 ADD N0,W0 -> E0 MUX W0,R0,CGI0->E0
Malibu Architecture
● Circuit Execution Example
CLB0
a
CLB1
x
b c
d
CLB0
a
CLB1
x
b c
d
N
W E
S
N
W E
S
-
20
Overview
● Motivation● Malibu Architecture● Synthesis● Results
Verilog
Bitstream
-
21
Overview
● Motivation● Malibu Architecture● Synthesis● Results
M-CAD M-HOT
Verilog
Bitstream
-
22
Front-End Synthesis
● Parse and Elaborate➔ Use Verilator ➔ Construct a CDFG➔ Optimize
● Coarse-Grained Synthesis➔ Map CDFG to Malibu instructions➔ Various CDFG transformations
● Fine-Grained Synthesis➔ Extract signals ≤ Wf➔ Use OdinII and ABC to synthize to LUTs
-
23
Back-End Synthesis
● M-CAD➔ Traditional FPGA-CAD flow➔ Separate Placement, Routing, Scheduling
● M-HOT➔ Integrated placement, routing, scheduling➔ Divides problem into levels, place+route each level
● Both Approaches➔ Can target any-sized architecture➔ Can trade area for performance➔ Fast
-
24
Synthesis
● Example
assign c1 = (a == b) ? 1'b1 : 1'b0;assign c2 = (c == d) ? 1'b1 : 1'b0;assign t2 = c ^ d;assign x = (c1 & c2) ? t1 : t2;
always @(posedge clk) begint1
-
25
Synthesis
● CDFG and ALAP output of Front-End Synthesis Level
2
1
0
-
26
Synthesis
5: EQ--
6: EQ--
CLB0 CLB1
5: EQ8: LUT
-
6: EQ7: XOR
-
5: EQ8: AND4: ADD
6: EQ7: XOR9: MUX
● M-HOT Place, Route, Schedule each height
-
27
Overview
● Motivation● Malibu Architecture● Synthesis● Results
-
28
Results
● Frequency (MHz) for each benchmark
-
29
Results
● Frequency (MHz) for each benchmark
Coarse-grained circuits
-
30
Results
● Area vs. Performance tradeoff
-
31
Results
● Results (compared to Quartus II / Stratix III)
M-HOT M-CAD
Synthesis Time Improvement: 30.9x 77.0x
User Clock Speed: 0.12x 0.07x
Density: 1.48x 0.67x
10x = 10 times better than the Quartus result 1x = same as Quartus0.1x = 1/10th the Quartus result (10 times worse)
-
32
Future Work
● Improve Front-End Synthesis➔ Needs to be both coarse-grain and fine-grain aware➔ Coarse-grained optimizations
● Improve M-CAD and M-HOT➔ Possibly 2x-3x performance improvement
-
33
Thanks
● Purpose➔ Implement a circuit on a coarse-grained/fine-grained
architecture
● Malibu Architecture➔ FPGA with time-multiplexed coarse-grained resources➔ Can trade density for performance
● Synthesis (M-CAD and M-HOT)➔ Fast, up to 250x faster than QuartusII➔ Fmax results within 1/10th of an FPGA
Slide 1Slide 2Slide 3Slide 4Slide 5Slide 6Slide 7Slide 8Slide 9Slide 10Slide 11Slide 12Slide 13Slide 14Slide 15Slide 16Slide 17Slide 18Slide 19Slide 20Slide 21Slide 22Slide 23Slide 24Slide 25Slide 26Slide 27Slide 28Slide 29Slide 30Slide 31Slide 32Slide 33