reconfigurable

Upload: kimmitera

Post on 10-Apr-2018

217 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/8/2019 Reconfigurable

    1/85

    Recongurable Computing

    Eduardo Sanchez

    HEIG-VD

    Eduardo Sanchez 2

    xxii List of Contributors

    Laura Pozzi, Faculty of Informatics, University of Lugano, Lugano,Switzerland (Chapter 9)

    Brian C. Richards, Department of Electrical Engineering and ComputerSciences, University of CaliforniaBerkeley, Berkeley, California (Chapter 8)

    Eduardo Sanchez, School of Computer and Communication Sciences, EcolePolytechnique Federale de Lausanne; and Reconfigurable and EmbeddedDigital Systems Institute, Haute Ecole dIngenierie et de Gestion du Cantonde Vaud, Lausanne, Switzerland (Chapter 33)

    Lesley Shannon, School of Engineering Science, Simon Fraser University,Burnaby, BC, Canada (Chapter 2)

    Satnam Singh, Programming Principles and Tools Group, Microsoft Research,Cambridge, United Kingdom (Chapter 16)

    Greg Stitt, Department of Computer Science and Engineering, University ofCaliforniaRiverside, Riverside, California (Chapter 26)

    Russell Tessier, Department of Computer and Electrical Engineering,University of Massachusetts, Amherst, Massachusetts (Chapter 30)

    Keith D. Underwood, Computation, Computers, Information and

    Mathematics Center, Sandia National Laboratories, Albuquerque, NewMexico (Chapter 31)

    Andres Upegui, Logic Systems Laboratory, School of Computer andCommunication Sciences, Ecole Polytechnique Federale de Lausanne,Lausanne, Switzerland (Chapter 33)

    Frank Vahid, Department of Computer Science and Engineering, University ofCaliforniaRiverside, Riverside, California (Chapter 26)

    John Wawrzynek, Department of Electrical Engineering and ComputerSciences, University of CaliforniaBerkeley, Berkeley, California (Chapters 8and 9)

    Nicholas Weaver, International Computer Science Institute, Berkeley,California (Chapter 18)

    Joseph Yeh, Lincoln Laboratory, Massachusetts Institute of Technology,Lexington, Massachusetts (Chapter 9)

    Peixin Zhong, Department of Electrical and Computer Engineering, MichiganState University, East Lansing, Michigan (Chapter 29)

  • 8/8/2019 Reconfigurable

    2/85

    Recongurable computing

    Methods for execution of algorithms: hardwired technology: high performance software-programmed microprocessors: high exibility

    3Eduardo Sanchez

    Why hardwired solutions are faster than software solutions?

    Eduardo Sanchez 4

  • 8/8/2019 Reconfigurable

    3/85

    Recongurable computing is intended to ll the gap betweenhard and soft, achieving potentially much higher performance

    than software, while maintaining a higher level ofexibility thanhardware (Compton and Hauck, Recongurable computing, ACM

    Computing Surveys, June 2002)

    Recongurable computing: systems incorporating some form of hardware programmability when we talk about recongurable computing we are usually talking about

    FPGA-based systems design

    Main motivations: accelerators for computing intensive applications tools for system validation: prototyping, emulation

    Eduardo Sanchez 5

    Moore's law

    Eduardo Sanchez 6

  • 8/8/2019 Reconfigurable

    4/85

    Eduardo Sanchez 7

    Eduardo Sanchez 8

  • 8/8/2019 Reconfigurable

    5/85

    Eduardo Sanchez 9

    Transistors

    (predicted for 2008)

    Transistors

    (actual in 2004)

    Moore's law

    (2x - 12 months)

    438.4 trillion

    Moore's law

    (2x - 18 months)

    14 billion

    Moore's law

    (2x - 24 months)

    128 million

    Pentium 4 800 million

    Itanium 9150 1.7 billion

    Eduardo Sanchez 10

  • 8/8/2019 Reconfigurable

    6/85

    Eduardo Sanchez 11

    Intel introduces a new chip-fabrication process every two years: 2001: 0.13 micron 2003: 90 nm 2005: 65 nm 2007: 45 nm 2009: 32 nm ....

    Eduardo Sanchez 12

  • 8/8/2019 Reconfigurable

    7/85

    Problem: Wirth's law

    Software is slowing faster than hardware is accelerating Expressed in Biblical cadences:Grove givetand Gas taketaway

    Eduardo Sanchez 13

    Andy Grove

    Intel ChairmanBill Gates

    Microsoft President

    Computing requirement increases even faster than Moore's law

    Eduardo Sanchez 14

  • 8/8/2019 Reconfigurable

    8/85

    Problem: time to market

    Eduardo Sanchez 15

    Problem: power consumption

    Eduardo Sanchez 16

    400MHz

    200MHz

    100MHz

    50MHz

  • 8/8/2019 Reconfigurable

    9/85

    Eduardo Sanchez 17

    Embedded systems

    Embedded systems, which are hidden from the user and cannotusually be manipulated or reprogrammed, are found in virtually

    all electronic equipment used today, from wireless telephonesand DVD players to cars and airplanes

    A fth of the value of each car produced in the EU is due toembedded electronics, a value that is expected to rise to about 40

    percent by 2015

    Eduardo Sanchez 18

  • 8/8/2019 Reconfigurable

    10/85

    Eduardo Sanchez 19

    Eduardo Sanchez 20

  • 8/8/2019 Reconfigurable

    11/85

    Pervasive computing

    "The most profound technologies are those that disappear. Theyweave themselves into the fabric of everyday life until they are

    indistinguishable from it"Mark Weiser, "The Computer for the 21stCentury", Scientic

    American, Septiembre, 1991

    In a near future, the computer will disappear for beingeverywhere: it will be ubiquitous, pervasive

    Pervasive systems will be so integrated with theirs users that theywill be invisible, they will disappear

    Eduardo Sanchez 21

    Eduardo Sanchez 22

  • 8/8/2019 Reconfigurable

    12/85

    Evolution of computer systems: mainframes: one computer, many users PCs: one computer, one user pervasive systems: many computers, one user

    Eduardo Sanchez 23

    Eduardo Sanchez 24

    remote communication

    fault tolerancehigh availability

    remote information accessdistributed security

    mobile networkingadaptive applications

    energy-aware systems

    mobile information accesslocation sensitivity

    smart spacesinvisibilitylocalized scalability

    uneven conditioning

    distributed systems

    mobile computing

    pervasive systems

  • 8/8/2019 Reconfigurable

    13/85

    Smart Dust project

    Eduardo Sanchez 25

    A 20 MIPS CPU embedded in a shoe Four times more powerful than early Silicon Graphics workstations

    (Motorola 68000)

    Eduardo Sanchez 26

  • 8/8/2019 Reconfigurable

    14/85

    A lot of new devices to design

    Eduardo Sanchez 27

    Integrated circuits

    Full-custom

    (ASIC)

    Hand-made

    Libraries

    Semi-custom

    Mask

    programmable

    Field

    programmable

    Gate array ROM

    PROM

    PAL

    PLA

    CPLD FPGA

    Standard circuits

    Eduardo Sanchez 28

  • 8/8/2019 Reconfigurable

    15/85

    Field Programmable Gate Arrays

    Array of logic cells Each cell is able to implement a logic function, chosen amongseveral possible functions: the choice is done by programming

    Interconnections between cells are also programmable Two types, depending on the cells complexity:

    ne grain coarse grain

    Two types, depending on the programming mode: RAM: every logic cell contains a LUT (look-up table), accompanied by a ip-

    op, and all interconnected with programmable routing pathways

    anti-fusesEduardo Sanchez 29

    programmable

    interconnections

    programmable

    fonctions

    conguration

    I/O celllogic cell

    Eduardo Sanchez 30

  • 8/8/2019 Reconfigurable

    16/85

    Programmable

    interconnect

    Programmable

    logic blocks

    Eduardo Sanchez 31

    |

    &a

    b

    c y

    y = (a & b) | !c

    Required function Truth table

    1011101

    000

    001

    010

    011

    100

    101

    110

    1111

    y

    a b c y

    0

    0001111

    0

    0110011

    0

    1010101

    1

    0111011

    SRAM cells

    Programmed LUT

    8:1Multiplexer

    a b c

    Eduardo Sanchez 32

  • 8/8/2019 Reconfigurable

    17/85

    An example of logic cell:

    LUT Carry &Control

    SPDEC

    RC

    Q

    G4G3G2G1

    BY

    YQ

    YYB

    Cout

    Cin

    Functional frequencies are design-dependantEduardo Sanchez 33

    16-bit SR

    flip-flop

    clock

    mux

    y

    qe

    a

    b

    cd

    16x1 RAM

    4-input

    LUT

    clock enable

    set/reset

    Eduardo Sanchez 34

  • 8/8/2019 Reconfigurable

    18/85

    16-bit SR

    16x1 RAM

    4-inputLUT

    LUT MUX REG

    Logic Cell (LC)

    16-bit SR

    16x1 RAM

    4-input

    LUT

    LUT MUX REG

    Logic Cell (LC)

    Slice

    Eduardo Sanchez 35

    CLB CLB

    CLB CLB

    Logic cell

    Slice

    Logic cell

    Logic cell

    Slice

    Logic cell

    Logic cell

    Slice

    Logic cell

    Logic cell

    Slice

    Logic cell

    Configurable logic block (CLB)

    Eduardo Sanchez 36

  • 8/8/2019 Reconfigurable

    19/85

    Columns of embedded

    RAM blocks

    Arrays of

    programmable

    logic blocks

    Eduardo Sanchez 37

    RAM blocks

    Multipliers

    Logic blocks

    Eduardo Sanchez 38

  • 8/8/2019 Reconfigurable

    20/85

    x

    +

    x

    +

    A[n:0]

    B[n:0] Y[(2n - 1):0]

    Multiplier

    Adder

    Accumulator

    MAC

    Eduardo Sanchez 39

    uP

    RAM

    I/O

    etc.

    Main FPGA fabric

    Microprocessorcore, special RAM,

    peripherals andI/O, etc.

    The Stripe

    Eduardo Sanchez 40

  • 8/8/2019 Reconfigurable

    21/85

    uP

    (a) One embedded core (b) Four embedded cores

    uP uP

    uP uP

    Eduardo Sanchez 41

    Configuration data in

    Configuration data out

    = I/O pin/pad

    = SRAM cell

    Eduardo Sanchez 42

  • 8/8/2019 Reconfigurable

    22/85

    Serial load with FPGA as master

    Mode Pins Mode

    Serial load with FPGA as slave

    Parallel load with FPGA as master

    Parallel load with FPGA as slave

    0 0

    0 1

    1 0

    1 1

    Eduardo Sanchez 43

    Configuration data in

    Memory

    Device

    Control

    Configuration

    data out

    FPGA

    Cdata In

    Cdata Out

    Eduardo Sanchez 44

  • 8/8/2019 Reconfigurable

    23/85

    Configuration data [7:0]Memory

    Device

    Control FPGA

    Cdata In[7:0]

    Address

    Eduardo Sanchez 45

    Configuration data [7:0]Memory

    Device

    Control FPGA

    Cdata In[7:0]

    Eduardo Sanchez 46

  • 8/8/2019 Reconfigurable

    24/85

    Me

    mory

    Device

    Control

    Microprocessor

    Address

    Data

    Peripheral

    Port,etc.

    FPGA

    Cdata In[7:0]

    Eduardo Sanchez 47

    Total area = active logic + conguration memory + interconnect

    interconnect

    active logic

    conguration memory

    Eduardo Sanchez 48

  • 8/8/2019 Reconfigurable

    25/85

    Advantages over PLDs: enhanced exibility reduced board space, power and cost increased performance

    Advantages over ASICs: reprogrammability o-the-shelf availability zero NRE (non-recurring engineering) costs reduced time-to-market ease-of-use

    Eduardo Sanchez 49

    Cumulative

    NRE + Unit Cost

    Cumulative

    Volume K Units

    ASIC .15

    ASIC .25

    FPGA .25 FPGA .15

    ASIC costs start

    higher, but slopeis atter

    For each technology

    advance, FPGAs becomemore cost eective

    Eduardo Sanchez 50

  • 8/8/2019 Reconfigurable

    26/85

    As performance requirements increase, the implementation ofcontrol elements in embedded applications is moving from 8-bits

    to 32-bits

    At the same time, the implementation vehicle of choice forembedded applications is moving from ASICs to FPGAs due to cost

    and time-to-market pressures

    Eduardo Sanchez 51

    Eduardo Sanchez 52

  • 8/8/2019 Reconfigurable

    27/85

    Synthesis methodology

    conguration bit-string

    schematic

    graphic editor VHDL

    placement

    routing

    partition

    Eduardo Sanchez 53

    Registertransfer level

    RTL

    Logic

    Simulator

    RTL functionalverification

    LogicSynthesis

    Gate-levelnetlist

    Logic

    Simulator

    Place-and-Route

    Gate-level functionalverification

    Eduardo Sanchez 54

  • 8/8/2019 Reconfigurable

    28/85

    Graphical S tate Diagram

    Graphical Flowchart

    When clock risesIf (s == 0)then y = (a & b) | c;else y = c & !(d ^ e);

    Textual HDL

    Top-level

    block-level

    schematic

    Block-level schematic

    Eduardo Sanchez 55

    Eduardo Sanchez 56

  • 8/8/2019 Reconfigurable

    29/85

    Eduardo Sanchez 57

    Intellectual property (IP)

    A semiconductor IP block is a predesigned function to beimplemented in a semiconductor device. In some cases, the

    functions are parametrisable, allowing a degree of customization.These functions include physical library functions (analog ordigital), basic blocks (such as counters and muxes) and system-level macros (also known as cores or virtual components) -including memory blocks

    Market: 1999: 442 millions dollars (semiconductors total : 196136 M$) 2000: 620 millions dollars (total semiconductors total : 231601 M$) 2004: 2940 millions dollars (semiconductors total : 339545 M$)

    Eduardo Sanchez 58

  • 8/8/2019 Reconfigurable

    30/85

    System-on-a-chip (SOC)

    SOC

    ASIC FPGA

    expensive circuit

    lower performance

    higher consumption

    lower development cost faster adaptation to change

    Eduardo Sanchez 59

    ASIC SOC

    2000 2004 2011

    Technology 0.18 0.09 0.05

    Gates/cm2 5x106 30x106 200x106

    MIPS/watt 1000 2200 4000

    SRAM Mb/cm2 20 120 240

    Eduardo Sanchez 60

  • 8/8/2019 Reconfigurable

    31/85

    SoC today

    Dual-

    core

    Power

    Management/

    FrequencyBoost

    (Foxton

    Technology)

  • 8/8/2019 Reconfigurable

    32/85

    Complex

    it(log)

    1982 1992 2002 2012

    Processors / DSPs

    Batteries

    GA

    P

    New architectures

    ! Microelectronics Systems TOMMOROW (2012) :-Technology :

  • 8/8/2019 Reconfigurable

    33/85

    Arbitration

    CPURAM

    MPEG2

    PCI

    PCI-X

    USB2.0controller

    DSP

    ROM FlashTEST (BIST)

    TESTBIST :Built-in Self Test

    REUSEIP core assembling

    ReconfigurablecoresFPGA,

    RECONFIGURABLEARCHITECTURES

    Platform based design

    C54x

    ARM720 MHzA/D

    20 MHzA/D8K

    SARAM8KSARAM

    4K SARAM

    2KDARAM

    2KDARAM

    2KDARAM

    Analog PLL/VCO

    Digital PLL/VCO

    Viterbi Decoder

    State Metric RAMViterbi Decoder

    Traceback RAM

    Digital PLL/VCO

    Digital PLL/VCO

    B-CDMATM Modem Logic

    B-CDMA(TM) SoC Layout

    90 MIPS TI DSP

    Sub-system

    ARM7 RISC

    processor

    Platform based design :SoC example

  • 8/8/2019 Reconfigurable

    34/85

    Accelerators for computing intensive applications Why can they be faster than processors ? What kind of reconfigurable architecture should be used ? Should they be used as stand-alone solutions ? Why arent they used today ?

    Main motivations for the use of reconfigurable

    Tools for system validation : Prototyping (Emulation) Why simulation is not adapted? Shoud prototyping replace simulation ? How much confidence should be putted in prototyping results

    Context : Motivations

    MicroBlaze soft processor

    Thirty-two 32-bit general purpose registers 32-bit instruction word with three operands and two addressing

    modes

    Separate 32-bit instruction and data buses that conform to IBMs OPB(On-chip Peripheral Bus) specication

    Separate 32-bit instruction and data buses with direct connection toon-chip block RAM through a LMB (Local Memory Bus)

    32-bit address bus Single issue, 3-stage pipeline (instruction fetch, operand fetch,

    execution)

    Hardware multiplier

    Big-endianEduardo Sanchez 68

  • 8/8/2019 Reconfigurable

    35/85

    Eduardo Sanchez 69

    Virtex-II Pro family

    Virtex-II Pro FPGAs provide up to four embedded 32-bit IBM PowerPC 405 RISCprocessors, each delivering over 420 Dhrystone MIPS at 300 MHz

    16KB data / 16KB instruction caches memory management unit variable page size (1KB-16MB) ve-stage datapath pipeline integer multiply/divide unit 32x32 bit general purpose registers dedicated on-chip memory interface it takes up as little as 2% of the total die area of XC2VP50 it does not have a hardware oating point unit

    Up to twenty-four on-chip 3.125 Gbps Rocket I/O transceivers Based on a 0.13, 9-layer copper/low-K dielectric technology

    Eduardo Sanchez 70

  • 8/8/2019 Reconfigurable

    36/85

    Eduardo Sanchez 71

    Eduardo Sanchez 72

  • 8/8/2019 Reconfigurable

    37/85

    Virtex-4 family from Xilinx

    Columnar architecture

    Eduardo Sanchez 73

    A Congurable Logic Block (CLB) contains 4 interconnected slices

    Eduardo Sanchez 74

  • 8/8/2019 Reconfigurable

    38/85

    A simplied view of the slice is:

    Eduardo Sanchez 75

    BRAM Multipliers (DSP) blocks

    Eduardo Sanchez 76

  • 8/8/2019 Reconfigurable

    39/85

    Integrated PowerPC 405 Fully integrated Ethernet

    Media Access Controller(EMAC)

    Bitstreams encrypted with256-bit AES algorithm

    90-nm, 11-layer technology 500MHz for memory and

    multipliers

    Lowest powerEduardo Sanchez 77

    Three platforms:

    Logic

    Memory

    DCMs

    DSP

    Logic

    Memory

    DCMs

    DSP

    Logic

    Memory

    DCMs

    DSP

    RocketIO

    PowerPC

    SX PlatformOptimized forhigh-performancesignal processing

    FX PlatformOptimized forembedded processing andhigh-speed serialconnectivity

    LX PlatformOptimized forhigh-performance logic

    Eduardo Sanchez 78

  • 8/8/2019 Reconfigurable

    40/85

    2442192896209,936142,128XC4VFX14

    0

    2042160768126,76894,896XC4VFX10

    0

    1642128576124,17656,880XC4VFX60

    12424844882,59241,904XC4VFX40

    8213232041,22419,224XC4VFX20

    -2132320464812,312XC4VFX12

    ---51264085,76055,296XC4VSX55

    ---19244883,45634,560XC4VSX35

    ---12832042,30423,040XC4VSX25

    ---96960126,048200,448XC4VLX200

    ---96960125,184152,064XC4VLX160

    ---96960124,320110,592XC4VLX100

    ---80768123,60080,640XC4VLX80

    ---6464082,88059,904XC4VLX60---6464081,72841,472XC4VLX40

    ---4844881,29624,192XC4VLX25

    ---32320486413,824XC4VLX15

    RocketIO

    transceiver

    10/100/

    1000 EMACPowerPC

    XtremeDS

    P Slice

    SelectI

    ODCM

    Block

    RAM [Kb]

    Logic

    CellsDevice

    Eduardo Sanchez 79

    Stratix II family from Altera

    Eduardo Sanchez 80

  • 8/8/2019 Reconfigurable

    41/85

    Eduardo Sanchez 81

    Each Logic Array Block (LAB) contains 4 Adaptive Logic Modules (ALM)

    Eduardo Sanchez 82

  • 8/8/2019 Reconfigurable

    42/85

    Eduardo Sanchez 83

    Eduardo Sanchez 84

  • 8/8/2019 Reconfigurable

    43/85

    Eduardo Sanchez 85

    Eduardo Sanchez 86

  • 8/8/2019 Reconfigurable

    44/85

    Eduardo Sanchez 87

    Nios soft processor

    RISC-like processor Full 32-bit instruction set, data path and address space

    32 general-purpose registers 32 external interrupt sources Single-instruction 32x32 multiply and divide producing a 32-bit

    result

    Single-instruction barrel shifter 6-level pipeline Branch prediction

    Eduardo Sanchez 88

  • 8/8/2019 Reconfigurable

    45/85

    Eduardo Sanchez 89

    Eduardo Sanchez 90

  • 8/8/2019 Reconfigurable

    46/85

    Eduardo Sanchez 91

    Fusion family from Actel

    This FPGA family integrates thestandard programmable logic with

    congurable analog and Flashmemory

    Congurable analog to digitalconverter (ADC), supporting

    resolutions up to 12 bits, andsample rates up to 600 k samples

    per second

    A 32-bit ARM7 soft-core isavailable

    Eduardo Sanchez 92

  • 8/8/2019 Reconfigurable

    47/85

    Eduardo Sanchez 93

    VersaTile congurations:

    Eduardo Sanchez 94

  • 8/8/2019 Reconfigurable

    48/85

    Eduardo Sanchez 95

    Virtex-5 family from Xilinx

    Eduardo Sanchez 96

  • 8/8/2019 Reconfigurable

    49/85

    Eduardo Sanchez 97

    Eduardo Sanchez 98

  • 8/8/2019 Reconfigurable

    50/85

    Stratix III family from Altera

    Eduardo Sanchez 99

    Eduardo Sanchez 100

    !"#$%&'&()*+%,-./0%1-'2,+%

  • 8/8/2019 Reconfigurable

    51/85

    Eduardo Sanchez 101

    Coarse grain RAs

    Statement :

    Digital signal processing is more and more arithmeticoriented, as to say that

    operations are more at the word-level (+,-,*..) rather than bit-level

    Example : Digital filtering : FIR

    !K Multiplies"! K Sums"

  • 8/8/2019 Reconfigurable

    52/85

    x%

    x!%

    S=Ax!+By+C%

    Computation example :

    #CYCLES: 1

    Statement :

    Digital signal processing is more and more arithmeticoriented, as to say that

    operations are more at the word-level (+,-,*..) rather than bit-level

    Coarse grain RAs

    x%

    x!%

    B% y%

    B.y%S=Ax!+By+C%

    Computation example :

    #CYCLES: 1

    Statement :

    Digital signal processing is more and more arithmeticoriented, as to say that

    operations are more at the word-level (+,-,*..) rather than bit-level

    Coarse grain RAs

  • 8/8/2019 Reconfigurable

    53/85

    x% B% y% C%A%

    A.x! B.y+C

    S=Ax!+By+C%

    Computation example :

    #CYCLES: 1

    Statement :

    Digital signal processing is more and more arithmeticoriented, as to say that

    operations are more at the word-level (+,-,*..) rather than bit-level

    Coarse grain RAs

    S =A.x!+B.y+C%

    S=Ax!+By+C%

    Computation example :

    x% B% y% C%A%

    #CYCLES: 1

    Coarse grain RAs

    Statement :

    Digital signal processing is more and more arithmeticoriented, as to say that

    operations are more at the word-level (+,-,*..) rather than bit-level

  • 8/8/2019 Reconfigurable

    54/85

    S =A.x!+B.y+C%

    S=Ax!+By+C%

    Computation example :

    x% B% y% C%A%

    #CYCLES: 3

    Coarse grain RAs

    Statement :

    Digital signal processing is more and more arithmeticoriented, as to say that

    operations are more at the word-level (+,-,*..) rather than bit-level

    Example targetting FPGA

    S =A.x!+B.y+C%

    x% B% y% C%A%

    Highly combinationalHigh logic Complexity

    Arithmetic operator

    Coarse grain RAs

    Statement :

    Digital signal processing is more and more arithmeticoriented, as to say that

    operations are more at the word-level (+,-,*..) rather than bit-level

    Example targetting FPGA

  • 8/8/2019 Reconfigurable

    55/85

    S =A.x!+B.y+C%

    x% B% y% C%A%

    LC!LC!

    LC!LC!LC!

    LC!LC!LC!

    LC!LC!LC!

    LC!LC! LC! LC! LC!

    Coarse grain RAs

    Highly combinationalHigh logic Complexity

    Arithmetic operator

    Statement :

    Digital signal processing is more and more arithmeticoriented, as to say that

    operations are more at the word-level (+,-,*..) rather than bit-level

    Example targetting FPGA

    S =A.x!+B.y+C%

    x% B% y% C%A%

    LC!LC!

    LC!LC!LC!

    LC!LC!LC!

    LC!LC!LC!

    LC!LC! LC! LC! LC!

    Higher propagation delay

    than dedicated logic

    Coarse grain RAs

    Statement :

    Digital signal processing is more and more arithmeticoriented, as to say that

    operations are more at the word-level (+,-,*..) rather than bit-level

    Highly combinationalHigh logic Complexity

    Arithmetic operator

    Example targetting FPGA

  • 8/8/2019 Reconfigurable

    56/85

    S =A.x!+B.y+C%

    x% B% y% C%A%

    LC!LC!

    LC!LC!LC!

    LC!LC!LC!

    LC!LC!LC!

    LC!LC! LC! LC! LC!

    Coarse grain RAs

    Higher propagation delay

    than non-programmable

    interconnect

    Statement :

    Digital signal processing is more and more arithmeticoriented, as to say that

    operations are more at the word-level (+,-,*..) rather than bit-level

    Highly combinationalHigh logic Complexity

    Arithmetic operator

    Example targetting FPGA

    S =A.x!+B.y+C%

    Computation time: N x Tcx% B% y% C%A%

    N : number of cycles

    Tp: propagation time (critical path)

    Tp= TLUT + TROUTING

    Higher than dedicated circuits

    Spatial execution (parallel), but higher

    cycle time

    Coarse grain RAs

    Statement :

    Digital signal processing is more and more arithmeticoriented, as to say that

    operations are more at the word-level (+,-,*..) rather than bit-level (or, and, )

    Example targetting FPGA

  • 8/8/2019 Reconfigurable

    57/85

    Coarse grain RAs

    Motivations :

    FPGAs suffer from : low functionnal frequencies (relatively low performances) high reconfiguration cost (50-100:1) Higher power consumption (compared to ASICs) Routing problems

    WHY ?

    Coarse grain RAs

    Motivations :

    FPGAs suffer from : low functionnal frequencies (relatively low performances) high reconfiguration cost (50-100:1) Higher power consumption (compared to ASICs) Routing problems

    BECAUSE :

    Mainly because of routing delays ~ 80 % of the propagation time

  • 8/8/2019 Reconfigurable

    58/85

    Coarse grain RAs

    Motivations :

    FPGAs suffer from : low functionnal frequencies (relatively low performances) high reconfiguration cost (50-100:1) Higher power consumption (compared to ASICs) Routing problems

    ~ 90% of the area occupied for reconfigurable purposes (LUT+ routing)

    BECAUSE :

    1980 1990 2000 2010

    10 000

    100 000

    1000 000

    10 000 000

    100 000 000

    1000 000 000

    10 000 000 000

    100 000 000 000

    1000 000 000 000

    10 000 000 000 000

    1000

    Transistors

    /

    Chip

    memory

    FPGA physical

    FPGA logical

    Micropro

    cessors

    Coarse grain RAs

    FPGAs : Reconfiguration cost

  • 8/8/2019 Reconfigurable

    59/85

    Coarse grain RAs

    Motivations :

    FPGAs suffer from : low functionnal frequencies (relatively low performances) high reconfiguration cost (50-100:1) Higher power consumption (compared to ASICs) Routing problems

    ~ probably because of programmable interconnect, LUT memory

    BECAUSE :

    Coarse grain RAs

    Motivations :

    FPGAs suffer from : low functionnal frequencies (relatively low performances) high reconfiguration cost (50-100:1) Higher power consumption (compared to ASICs) Routing problems

    ~ Hard to wire every net in the design

    Usually hard to go upper than 90% of the FPGA capacity

    BECAUSE :

  • 8/8/2019 Reconfigurable

    60/85

    1980 1990 2000 2010

    10 000

    100 000

    1000 000

    10 000 000

    100 000 000

    1000 000 000

    10 000 000 000

    100 000 000 000

    1000 000 000 000

    10 000 000 000 000

    1000

    Transistors

    /

    Chip

    memory

    FPGA physical

    FPGA logical

    FPGA routedMicr

    oprocess

    ors

    Coarse grain RAs

    FPGAs : Reconfiguration cost

    Coarse grain RAs

    Basic idea : Replace bit-level logic block by word-level logic block

    CLB, LC RC, CFB

    FINE GRAIN: CLB

    granularity: BITadapted to: encryption, prototyping

    COARSE GRAIN: RC

    granularity: WORDadapted to: DSP, data oriented processing

    ALU, MULT

    Registers

    MUXes

    High reconfiguration cost overheadLow functionnal frequencies

    Low reconfiguration cost overheadHigh level of performances

    LUT

    LUT

    LUT

  • 8/8/2019 Reconfigurable

    61/85

    CFB CFB CFB CFB

    CFB CFB CFB CFB

    CFB CFB CFB CFB

    CFB CFB CFB CFB

    ALU, MULT

    Registers

    MUXes

    Coarse grain RAs

    From CLBs to CFBs (Configurable Function Block)

    A single CFB can handle alone an arithmetic operation (ex : multiplication)

    Since multipliers, and adders are hardwired, they are

    much more efficient (higher frequencies, lower area)

    LETS REVIEW THE MICROPROCESSOR

    ARCHITECTURES

    Each CFB is like a small !P/ DSP

    Coarse grain RAs

    A simple coarse grain processing unit = small CPU controller

    Constitution

    Optimized Datapath (16 bits)Register File (4x16bits)

    Hardwired ALU and multiplier

    Features

    Complex computations in local mode (FIR,IIR, WT)Low silicon area (0.07mm!, 0.18"m CMOS process)Single-cycle operations (ex:MAC+register load)

    !inst.

  • 8/8/2019 Reconfigurable

    62/85

    Statemachine% AL

    U

    t1

    t2

    t3

    A

    B

    C

    DATAPATHCONTROLLER

    Coarse grain RAs

    BUS

    Arithmetic and Logic Unit (ALU)Register file

    Tristate components (inputs/ outputs)

    Microprocessor basics

    t1

  • 8/8/2019 Reconfigurable

    63/85

    t1

  • 8/8/2019 Reconfigurable

    64/85

    t1

  • 8/8/2019 Reconfigurable

    65/85

    t1

  • 8/8/2019 Reconfigurable

    66/85

    t1

  • 8/8/2019 Reconfigurable

    67/85

    OpcodeMAR

    PC

    Load path

    Store path

    n

    Address

    m

    IR

    FSM

    incr

    Instruction path

    LD

    Function

    controls

    Address operand

    Branch

    Data flow

    Control signals

    A Single accumulator machine

    Coarse grain RAs

    MAR : Memory Adress Register

    IR : Instruction Register

    PC : Program Counter register

    Instruction:

    Opcode:

    00: Load

    01: Store10: Add

    11: Branch

    Address

    15 14 13 0

    Single Address Instruction: one of the registers is fixed (= accumulator)-AC is an implicit operand

    AC:= AC Memory(Address)

    Coarse grain RAs

  • 8/8/2019 Reconfigurable

    68/85

    1000110100110011

    MAR

    PC

    Load path

    Store path

    16

    Address

    14

    IR

    FSM

    incr

    Instruction path

    LD

    Function

    controls

    Address operand

    Branch

    10110100110011

    16 bits wide

    16M words

    2

    14

    16

    Memory

    1. Instruction fetch:

    - PC is moved into MAR

    10110100110011

    1000110100110011

    1000110100110011

    - Read from memory- Load instruction into IR

    2. Instruction decode:

    - Op code bits to FSM(ADD)- rest of bits is operand addr.

    Coarse grain RAsMAR : Memory Adress Register

    IR : Instruction Register

    PC : Program Counter register

    AC

    A B

    ALU

    S

    MAR

    PC

    Load path

    Store path

    16

    Address

    14

    IR

    FSM

    incr

    Instruction path

    LD

    ADD

    Address operand

    Branch

    16 bits wide

    16M words2

    14

    16

    Memory

    3. Operand Fetch:

    - IR -> MAR

    00110100110011

    1000110100110011

    0101010101110001

    10110100110011

    01010101011100010011001101110110

    1000100011100111

    1000100011100111

    - Read data from memory

    4. Instr. Execute

    - Memory to ALU B- AC to ALU

    - ALU Add- S to AC

    Coarse grain RAsMAR : Memory Adress Register

    IR : Instruction Register

    PC : Program Counter register

  • 8/8/2019 Reconfigurable

    69/85

    1000110100110011

    AC1000100011100111

    A B

    ALU

    S

    MAR

    PC

    Load path

    Store path

    16

    Address

    14

    IR

    FSM

    incr

    Instruction path

    LD

    ADD

    Address operand

    Branch

    16 bits wide

    16M words

    2

    14

    16

    Memory

    5. Housekeeping:

    - Increment PC

    00110100110011

    1000110100110011

    0101010101110001

    10110100110100

    01010101011100010011001101110110

    1000100011100111

    Coarse grain RAsMAR : Memory Adress Register

    IR : Instruction Register

    PC : Program Counter register

    Coarse grain RAs

    A simple microprocessor : Architecture

    16x16

    registers

    Adress to memorydata to/from memory

    To controller

    (FSM)

    To controller

    (FSM)

  • 8/8/2019 Reconfigurable

    70/85

    Coarse grain RAs

    A simple microprocessor : Instruction format

    shift

    or

    or

    or

    Instruction formatInstruction

    Coarse grain RAs

    Action

    A simple microprocessor : Instruction format

  • 8/8/2019 Reconfigurable

    71/85

    Coarse grain RAs

    A simple microprocessor : Instruction format

    0000 7C0A ;

    0001 8C00 ; LOAD RC, #A

    0002 7B04 ; ...

    0003 7A0A ; ...0004 9C7C ; ...

    0005 611A ; ...

    0006 614B ; ...

    ...

    Coarse grain RAs

    A simple microprocessor : test program

    What will it do ?

  • 8/8/2019 Reconfigurable

    72/85

  • 8/8/2019 Reconfigurable

    73/85

    CFB CFB CFB CFB

    CFB CFB CFB CFB

    CFB CFB CFB CFB

    CFB CFB CFB CFB

    ALU, MULT

    Registers

    MUXes

    Coarse grain RAs

    From CLBs to CFBs (Configurable Function Block)

    A single CFB can handle alone an arithmetic operation (ex : multiplication)

    Since multipliers, and adders are hardwired, they are

    much more efficient (higher frequencies, lower area)

    Setup an array of CFBs that have an architecture similar

    to a !P datapath

    We reviewed the !P basics, the idea is as follows:

    In abstract :

    Instructions configure both PE and interconnect every cycle

    In reality :

    Instruction Bandwidth / Memory too high, so

    COMPROMISE

    Coarse grain RAs

    Coarse grain RA model

  • 8/8/2019 Reconfigurable

    74/85

    Instructions

    currently in hardware

    Instructions paged out

    Actual available

    hardware

    Coarse grain RAs

    Reconfigurable computing dynamic reconfiguration

    Relationship of communication among processors

    Shared clock (Pipelined) Shared registers (VLIW) Shared memory (SMM) Shared network Shared bus Something not shared, and thats better

    Communications

    Coarse grain RAs

  • 8/8/2019 Reconfigurable

    75/85

    instruction

    Data in

    Data out

    Coarse grain RAs

    SISD : Single Instruction Single Data

    instruction

    Data in

    Data out

    Coarse grain RAs

    SIMD : Single Instruction Multiple Data

  • 8/8/2019 Reconfigurable

    76/85

    instruction1

    Data in

    instruction2

    instruction3

    Data out

    Coarse grain RAs

    MISD : Multiple Instructions Single Data

    instruction

    Data in

    Data out

    instruction

    Data in

    Data out

    Coarse grain RAs

    MIMD : Multiple Instructions Multiple Data

  • 8/8/2019 Reconfigurable

    77/85

    Vector Processing {SIMD}

    +

    r1 r2

    r3

    add r3, r1, r2

    SCALAR

    (1 operation)

    v1 v2

    v3

    +

    vector

    length

    add.vv v3, v1, v2

    VECTOR

    (N operations)

    Vector processors have high-level operations that work on linear arrays of numbers:"vectors"

    Coarse grain RAs

    They Just Dont Know It Yet!

    My Position[Jonathan Rose]

    ASICs are dead

    Existing reconfigurable systems

  • 8/8/2019 Reconfigurable

    78/85

    SCHEMATICS

    VERIFI

    CATION

    ARCHITECTURESPECIFICATIONS

    FABRICATION

    BOARD

    MASKS

    DRAWING

    History Circuit design (full custom) < 1980

    SCHEMATICS

    VERIFI

    CATION

    ARCHITECTURESPECIFICATIONS

    FABRICATION

    ELECTRICAL

    SIMULATION

    MASKS

    DRAWING

    SPICE

    DESIGN RULECHECK (DRC)

    History Circuit design (full custom) 1980

  • 8/8/2019 Reconfigurable

    79/85

    Behavioral

    level

    Logical

    level

    Physical

    level

    Electrical

    level

    ?

    Placement

    routing

    APPLICATION

    SPECIFICATIONS

    PROCESSEUR

    MEM

    ASIC

    ASICMEM

    ASIC

    FABRICATION

    CADFunctional

    Simulation

    Schematics

    editor

    Logical

    simulation

    post-layoutsimulation

    Vrifications

    Test

    History ASIC design 1980-1990

    Circuit level

    Behavioral

    RTL

    level

    SystemC

    Virtual

    prototypes

    Logical

    level

    ?

    Co-design

    Architectural

    synthesis

    Physical

    synthesis

    Test

    Low power

    APPLICATION

    SPECIFICATIONS

    PROCESSEUR

    MEM

    ASIC

    ASICMEM

    ASIC

    FABRICATION

    Logic

    synthesisPhysical

    level

    ASIC design - today

  • 8/8/2019 Reconfigurable

    80/85

    Digital Systems Design

    Custom%

    Hierarchical%

    Standard Cells%

    Memory%

    PLA%

    Gate matrix%

    Macro Cells%

    Cell-Based%

    Gate Arrays%

    Sea-of-gates%

    Prediffused%

    Antifuse based%

    Memory based%

    Prewired%

    Array-Based%

    Semicustom%

    Main VLSI design styles

    SoCs can integrate all of these

    Digital Systems Design

    Custom%

    Hierarchical!

    Standard Cells!

    Memory%

    PLA%

    Gate matrix%

    Macro Cells%

    Cell-Based!

    Gate Arrays%

    Sea-of-gates%

    Prediffused%

    Antifuse based%

    Memory based%

    Prewired%

    Array-Based%

    Semicustom!

    Main VLSI design styles

    SoCs can integrate all of these

  • 8/8/2019 Reconfigurable

    81/85

    So, why Hardware Design Languages?

    Because we need a means for modeling, and therefore simulating complex digitalsystems before fabricating them

    Logic Simulators are used for that Because drawing 1B transistors at hands takes time

    Logic synthesizers do part of the job for us!

    2 answers!

    So, why Hardware Design Languages?

    Domains and levels of modeling

  • 8/8/2019 Reconfigurable

    82/85

    So, why Hardware Design Languages?

    Domains and levels of modeling

    So, why Hardware Design Languages?

    Domains and levels of modeling

  • 8/8/2019 Reconfigurable

    83/85

    So, why Hardware Design Languages?

    Domains and levels of modeling

    Netlist (schematics)

    Std. Cells library

    ASIC

    ab

    cd

    e

    s

    Circuit.vhd

    Behavioral description

    2

    3 4

    1

    Standard-cell ASIC design

    Digital Systems Design

    Placement & Routing

    Technological mapping

    Logic synthesis2

    3

    4

    1 Design (VHDL, Verilog,)

    Simulation at different levels

    Logic Logic with

    delays

    Electrical

    Logic synthesis

  • 8/8/2019 Reconfigurable

    84/85

    Specifications

    Behavioral HDL Simulation

    Netlist Simulation

    GDSII Simulation

    SYNTHESIS

    PLACE & ROUTE

    DESIGN CAPTURE

    ASIC design flow

    So, why Hardware Design Languages?

    Given that architecture In C we would write:

    But hardware intrinsically operates in parallel

    &%

    C%

    0D&EC%

    'D&FC%

    Void main() {

    int a,b,c,d;

    c=a+b;

    d=a-b;

    }

    Sequential execution

    concurrent execution

    Need of a language which permits to express concurrency

  • 8/8/2019 Reconfigurable

    85/85

    VHDL basics

    VHDL : VHSIC Hardware Description Language

    Very High Speed Integrated Circuits

    Hardware description langage :

    Allows description of concurrent tasksTwo main goals

    Modeling (simulation)Description (synthesis)

    Only a subset of VHDL is synthesizable