a cad framework for malibu: an fpga with time ...a cad framework for malibu: an fpga with...

33
A CAD Framework for MALIBU: An FPGA with Time-multiplexed Coarse-Grained Elements David Grant Supervisor: Dr. Guy Lemieux FPGA 2011 -- Feb 28, 2011

Upload: others

Post on 07-Feb-2021

3 views

Category:

Documents


0 download

TRANSCRIPT

  • A CAD Framework for MALIBU: An FPGA with

    Time-multiplexed Coarse-Grained Elements

    David Grant

    Supervisor: Dr. Guy Lemieux

    FPGA 2011 -- Feb 28, 2011

  • 2

    Motivation

    ● Growing Industry Trend: Large FPGA Circuits➔ Often from C-to-Hardware or system generators

    ● ex. molecular dynamics, rendering, nuclear simulation

    ➔ Word-oriented

    ➔ Millions of gates

  • 3

    Motivation

    ● Problems with FPGAs➔ CAD runtime can take hours or days

    ➔ Fixed capacity● A large circuit may not fit

    ➔ Inefficient use of resources● Resources sit idle most of the time

  • 4

    Motivation

    ● Benchmark: chem

  • 5

    Motivation

    ● Benchmark: chem

    Tim

    e-M

    ultip

    lexi

    ng –

    Fre

    e

    Available Spatial Parallelism

    TM - Slow

  • 6

    Motivation

    ● Solution➔ Divide up the circuit and run it on an array of processors

    ● Preserve the coarse-grained features of the circuit

    ➔ Create coarse-grained-aware CAD tools

  • 7

    Time-Multiplexing

    ● Time-Multiplexed FPGAs➔ Look-up Table (LUT)

    config bits

    inputs fromother LUTs

  • 8

    Time-Multiplexing

    ● Time-Multiplexed FPGAs➔ Time-multiplexed LUT➔ Multiplexer is shared

    time

    config bits

    inputs fromother LUTs

  • 9

    Datapath FPGAs

    ● Datapath FPGAs➔ Config bit sharing➔ 1.1x density, same performance

    a[0]

    a[1]

    a[3]

    b[0]

    b[1]

    b[3]

    a[0]

    a[1]

    a[3]

    b[0]

    b[1]

    b[3]

    config bits config bit

    Routing Mux Datapath Routing Mux

  • 10

    Time-Multiplexing and Datapath

    ● Where we're going➔ Coarse-grained(datapath) time-multiplexed resources➔ ALU is shared

    32

    time

    inputs fromother ALUs

  • 11

    Overview

    ● Motivation● Malibu Architecture● Synthesis● Results

    Traditional Island-Style FPGA

  • 12

    Overview

    ● Motivation● Malibu Architecture● Synthesis● Results

    Malibu Architecture

  • 13

    Malibu Architecture

    ● Traditional FPGA CLB

  • 14

    Malibu Architecture

    ● Add Coarse-Grainedinputs and outputs

  • 15

    Malibu Architecture

    ● Add an ALU and register file

  • 16

    Malibu Architecture

    ● Add coarse-grainedcommunication

    ● Each CG has a schedule➔ SL instructions➔ Fmax = 1GHz / SL

  • 17

    CLB0 CLB1Time CG FG CG FG

    0 EQ N0,W0 -> CGO0 EQ N0,E0 -> CGO01 NOP LUT XOR N0,E0 -> R02 ADD N0,W0 -> E0 MUX W0,R0,CGI0->E0

    Malibu Architecture

    ● Circuit Execution Example

    CLB0

    a

    CLB1

    x

    b c

    d

    CLB0

    a

    CLB1

    x

    b c

    d

    N

    W E

    S

    N

    W E

    S

  • 18

    CLB0 CLB1Time CG FG CG FG

    0 EQ N0,W0 -> CGO0 EQ N0,E0 -> CGO01 NOP LUT XOR N0,E0 -> R02 ADD N0,W0 -> E0 MUX W0,R0,CGI0->E0

    Malibu Architecture

    ● Circuit Execution Example

    CLB0

    a

    CLB1

    x

    b c

    d

    CLB0

    a

    CLB1

    x

    b c

    d

    N

    W E

    S

    N

    W E

    S

  • 19

    CLB0 CLB1Time CG FG CG FG

    0 EQ N0,W0 -> CGO0 EQ N0,E0 -> CGO01 NOP LUT XOR N0,E0 -> R02 ADD N0,W0 -> E0 MUX W0,R0,CGI0->E0

    Malibu Architecture

    ● Circuit Execution Example

    CLB0

    a

    CLB1

    x

    b c

    d

    CLB0

    a

    CLB1

    x

    b c

    d

    N

    W E

    S

    N

    W E

    S

  • 20

    Overview

    ● Motivation● Malibu Architecture● Synthesis● Results

    Verilog

    Bitstream

  • 21

    Overview

    ● Motivation● Malibu Architecture● Synthesis● Results

    M-CAD M-HOT

    Verilog

    Bitstream

  • 22

    Front-End Synthesis

    ● Parse and Elaborate➔ Use Verilator ➔ Construct a CDFG➔ Optimize

    ● Coarse-Grained Synthesis➔ Map CDFG to Malibu instructions➔ Various CDFG transformations

    ● Fine-Grained Synthesis➔ Extract signals ≤ Wf➔ Use OdinII and ABC to synthize to LUTs

  • 23

    Back-End Synthesis

    ● M-CAD➔ Traditional FPGA-CAD flow➔ Separate Placement, Routing, Scheduling

    ● M-HOT➔ Integrated placement, routing, scheduling➔ Divides problem into levels, place+route each level

    ● Both Approaches➔ Can target any-sized architecture➔ Can trade area for performance➔ Fast

  • 24

    Synthesis

    ● Example

    assign c1 = (a == b) ? 1'b1 : 1'b0;assign c2 = (c == d) ? 1'b1 : 1'b0;assign t2 = c ^ d;assign x = (c1 & c2) ? t1 : t2;

    always @(posedge clk) begint1

  • 25

    Synthesis

    ● CDFG and ALAP output of Front-End Synthesis Level

    2

    1

    0

  • 26

    Synthesis

    5: EQ--

    6: EQ--

    CLB0 CLB1

    5: EQ8: LUT

    -

    6: EQ7: XOR

    -

    5: EQ8: AND4: ADD

    6: EQ7: XOR9: MUX

    ● M-HOT Place, Route, Schedule each height

  • 27

    Overview

    ● Motivation● Malibu Architecture● Synthesis● Results

  • 28

    Results

    ● Frequency (MHz) for each benchmark

  • 29

    Results

    ● Frequency (MHz) for each benchmark

    Coarse-grained circuits

  • 30

    Results

    ● Area vs. Performance tradeoff

  • 31

    Results

    ● Results (compared to Quartus II / Stratix III)

    M-HOT M-CAD

    Synthesis Time Improvement: 30.9x 77.0x

    User Clock Speed: 0.12x 0.07x

    Density: 1.48x 0.67x

    10x = 10 times better than the Quartus result 1x = same as Quartus0.1x = 1/10th the Quartus result (10 times worse)

  • 32

    Future Work

    ● Improve Front-End Synthesis➔ Needs to be both coarse-grain and fine-grain aware➔ Coarse-grained optimizations

    ● Improve M-CAD and M-HOT➔ Possibly 2x-3x performance improvement

  • 33

    Thanks

    ● Purpose➔ Implement a circuit on a coarse-grained/fine-grained

    architecture

    ● Malibu Architecture➔ FPGA with time-multiplexed coarse-grained resources➔ Can trade density for performance

    ● Synthesis (M-CAD and M-HOT)➔ Fast, up to 250x faster than QuartusII➔ Fmax results within 1/10th of an FPGA

    Slide 1Slide 2Slide 3Slide 4Slide 5Slide 6Slide 7Slide 8Slide 9Slide 10Slide 11Slide 12Slide 13Slide 14Slide 15Slide 16Slide 17Slide 18Slide 19Slide 20Slide 21Slide 22Slide 23Slide 24Slide 25Slide 26Slide 27Slide 28Slide 29Slide 30Slide 31Slide 32Slide 33