reconfigurable
TRANSCRIPT
-
8/8/2019 Reconfigurable
1/85
Recongurable Computing
Eduardo Sanchez
HEIG-VD
Eduardo Sanchez 2
xxii List of Contributors
Laura Pozzi, Faculty of Informatics, University of Lugano, Lugano,Switzerland (Chapter 9)
Brian C. Richards, Department of Electrical Engineering and ComputerSciences, University of CaliforniaBerkeley, Berkeley, California (Chapter 8)
Eduardo Sanchez, School of Computer and Communication Sciences, EcolePolytechnique Federale de Lausanne; and Reconfigurable and EmbeddedDigital Systems Institute, Haute Ecole dIngenierie et de Gestion du Cantonde Vaud, Lausanne, Switzerland (Chapter 33)
Lesley Shannon, School of Engineering Science, Simon Fraser University,Burnaby, BC, Canada (Chapter 2)
Satnam Singh, Programming Principles and Tools Group, Microsoft Research,Cambridge, United Kingdom (Chapter 16)
Greg Stitt, Department of Computer Science and Engineering, University ofCaliforniaRiverside, Riverside, California (Chapter 26)
Russell Tessier, Department of Computer and Electrical Engineering,University of Massachusetts, Amherst, Massachusetts (Chapter 30)
Keith D. Underwood, Computation, Computers, Information and
Mathematics Center, Sandia National Laboratories, Albuquerque, NewMexico (Chapter 31)
Andres Upegui, Logic Systems Laboratory, School of Computer andCommunication Sciences, Ecole Polytechnique Federale de Lausanne,Lausanne, Switzerland (Chapter 33)
Frank Vahid, Department of Computer Science and Engineering, University ofCaliforniaRiverside, Riverside, California (Chapter 26)
John Wawrzynek, Department of Electrical Engineering and ComputerSciences, University of CaliforniaBerkeley, Berkeley, California (Chapters 8and 9)
Nicholas Weaver, International Computer Science Institute, Berkeley,California (Chapter 18)
Joseph Yeh, Lincoln Laboratory, Massachusetts Institute of Technology,Lexington, Massachusetts (Chapter 9)
Peixin Zhong, Department of Electrical and Computer Engineering, MichiganState University, East Lansing, Michigan (Chapter 29)
-
8/8/2019 Reconfigurable
2/85
Recongurable computing
Methods for execution of algorithms: hardwired technology: high performance software-programmed microprocessors: high exibility
3Eduardo Sanchez
Why hardwired solutions are faster than software solutions?
Eduardo Sanchez 4
-
8/8/2019 Reconfigurable
3/85
Recongurable computing is intended to ll the gap betweenhard and soft, achieving potentially much higher performance
than software, while maintaining a higher level ofexibility thanhardware (Compton and Hauck, Recongurable computing, ACM
Computing Surveys, June 2002)
Recongurable computing: systems incorporating some form of hardware programmability when we talk about recongurable computing we are usually talking about
FPGA-based systems design
Main motivations: accelerators for computing intensive applications tools for system validation: prototyping, emulation
Eduardo Sanchez 5
Moore's law
Eduardo Sanchez 6
-
8/8/2019 Reconfigurable
4/85
Eduardo Sanchez 7
Eduardo Sanchez 8
-
8/8/2019 Reconfigurable
5/85
Eduardo Sanchez 9
Transistors
(predicted for 2008)
Transistors
(actual in 2004)
Moore's law
(2x - 12 months)
438.4 trillion
Moore's law
(2x - 18 months)
14 billion
Moore's law
(2x - 24 months)
128 million
Pentium 4 800 million
Itanium 9150 1.7 billion
Eduardo Sanchez 10
-
8/8/2019 Reconfigurable
6/85
Eduardo Sanchez 11
Intel introduces a new chip-fabrication process every two years: 2001: 0.13 micron 2003: 90 nm 2005: 65 nm 2007: 45 nm 2009: 32 nm ....
Eduardo Sanchez 12
-
8/8/2019 Reconfigurable
7/85
Problem: Wirth's law
Software is slowing faster than hardware is accelerating Expressed in Biblical cadences:Grove givetand Gas taketaway
Eduardo Sanchez 13
Andy Grove
Intel ChairmanBill Gates
Microsoft President
Computing requirement increases even faster than Moore's law
Eduardo Sanchez 14
-
8/8/2019 Reconfigurable
8/85
Problem: time to market
Eduardo Sanchez 15
Problem: power consumption
Eduardo Sanchez 16
400MHz
200MHz
100MHz
50MHz
-
8/8/2019 Reconfigurable
9/85
Eduardo Sanchez 17
Embedded systems
Embedded systems, which are hidden from the user and cannotusually be manipulated or reprogrammed, are found in virtually
all electronic equipment used today, from wireless telephonesand DVD players to cars and airplanes
A fth of the value of each car produced in the EU is due toembedded electronics, a value that is expected to rise to about 40
percent by 2015
Eduardo Sanchez 18
-
8/8/2019 Reconfigurable
10/85
Eduardo Sanchez 19
Eduardo Sanchez 20
-
8/8/2019 Reconfigurable
11/85
Pervasive computing
"The most profound technologies are those that disappear. Theyweave themselves into the fabric of everyday life until they are
indistinguishable from it"Mark Weiser, "The Computer for the 21stCentury", Scientic
American, Septiembre, 1991
In a near future, the computer will disappear for beingeverywhere: it will be ubiquitous, pervasive
Pervasive systems will be so integrated with theirs users that theywill be invisible, they will disappear
Eduardo Sanchez 21
Eduardo Sanchez 22
-
8/8/2019 Reconfigurable
12/85
Evolution of computer systems: mainframes: one computer, many users PCs: one computer, one user pervasive systems: many computers, one user
Eduardo Sanchez 23
Eduardo Sanchez 24
remote communication
fault tolerancehigh availability
remote information accessdistributed security
mobile networkingadaptive applications
energy-aware systems
mobile information accesslocation sensitivity
smart spacesinvisibilitylocalized scalability
uneven conditioning
distributed systems
mobile computing
pervasive systems
-
8/8/2019 Reconfigurable
13/85
Smart Dust project
Eduardo Sanchez 25
A 20 MIPS CPU embedded in a shoe Four times more powerful than early Silicon Graphics workstations
(Motorola 68000)
Eduardo Sanchez 26
-
8/8/2019 Reconfigurable
14/85
A lot of new devices to design
Eduardo Sanchez 27
Integrated circuits
Full-custom
(ASIC)
Hand-made
Libraries
Semi-custom
Mask
programmable
Field
programmable
Gate array ROM
PROM
PAL
PLA
CPLD FPGA
Standard circuits
Eduardo Sanchez 28
-
8/8/2019 Reconfigurable
15/85
Field Programmable Gate Arrays
Array of logic cells Each cell is able to implement a logic function, chosen amongseveral possible functions: the choice is done by programming
Interconnections between cells are also programmable Two types, depending on the cells complexity:
ne grain coarse grain
Two types, depending on the programming mode: RAM: every logic cell contains a LUT (look-up table), accompanied by a ip-
op, and all interconnected with programmable routing pathways
anti-fusesEduardo Sanchez 29
programmable
interconnections
programmable
fonctions
conguration
I/O celllogic cell
Eduardo Sanchez 30
-
8/8/2019 Reconfigurable
16/85
Programmable
interconnect
Programmable
logic blocks
Eduardo Sanchez 31
|
&a
b
c y
y = (a & b) | !c
Required function Truth table
1011101
000
001
010
011
100
101
110
1111
y
a b c y
0
0001111
0
0110011
0
1010101
1
0111011
SRAM cells
Programmed LUT
8:1Multiplexer
a b c
Eduardo Sanchez 32
-
8/8/2019 Reconfigurable
17/85
An example of logic cell:
LUT Carry &Control
SPDEC
RC
Q
G4G3G2G1
BY
YQ
YYB
Cout
Cin
Functional frequencies are design-dependantEduardo Sanchez 33
16-bit SR
flip-flop
clock
mux
y
qe
a
b
cd
16x1 RAM
4-input
LUT
clock enable
set/reset
Eduardo Sanchez 34
-
8/8/2019 Reconfigurable
18/85
16-bit SR
16x1 RAM
4-inputLUT
LUT MUX REG
Logic Cell (LC)
16-bit SR
16x1 RAM
4-input
LUT
LUT MUX REG
Logic Cell (LC)
Slice
Eduardo Sanchez 35
CLB CLB
CLB CLB
Logic cell
Slice
Logic cell
Logic cell
Slice
Logic cell
Logic cell
Slice
Logic cell
Logic cell
Slice
Logic cell
Configurable logic block (CLB)
Eduardo Sanchez 36
-
8/8/2019 Reconfigurable
19/85
Columns of embedded
RAM blocks
Arrays of
programmable
logic blocks
Eduardo Sanchez 37
RAM blocks
Multipliers
Logic blocks
Eduardo Sanchez 38
-
8/8/2019 Reconfigurable
20/85
x
+
x
+
A[n:0]
B[n:0] Y[(2n - 1):0]
Multiplier
Adder
Accumulator
MAC
Eduardo Sanchez 39
uP
RAM
I/O
etc.
Main FPGA fabric
Microprocessorcore, special RAM,
peripherals andI/O, etc.
The Stripe
Eduardo Sanchez 40
-
8/8/2019 Reconfigurable
21/85
uP
(a) One embedded core (b) Four embedded cores
uP uP
uP uP
Eduardo Sanchez 41
Configuration data in
Configuration data out
= I/O pin/pad
= SRAM cell
Eduardo Sanchez 42
-
8/8/2019 Reconfigurable
22/85
Serial load with FPGA as master
Mode Pins Mode
Serial load with FPGA as slave
Parallel load with FPGA as master
Parallel load with FPGA as slave
0 0
0 1
1 0
1 1
Eduardo Sanchez 43
Configuration data in
Memory
Device
Control
Configuration
data out
FPGA
Cdata In
Cdata Out
Eduardo Sanchez 44
-
8/8/2019 Reconfigurable
23/85
Configuration data [7:0]Memory
Device
Control FPGA
Cdata In[7:0]
Address
Eduardo Sanchez 45
Configuration data [7:0]Memory
Device
Control FPGA
Cdata In[7:0]
Eduardo Sanchez 46
-
8/8/2019 Reconfigurable
24/85
Me
mory
Device
Control
Microprocessor
Address
Data
Peripheral
Port,etc.
FPGA
Cdata In[7:0]
Eduardo Sanchez 47
Total area = active logic + conguration memory + interconnect
interconnect
active logic
conguration memory
Eduardo Sanchez 48
-
8/8/2019 Reconfigurable
25/85
Advantages over PLDs: enhanced exibility reduced board space, power and cost increased performance
Advantages over ASICs: reprogrammability o-the-shelf availability zero NRE (non-recurring engineering) costs reduced time-to-market ease-of-use
Eduardo Sanchez 49
Cumulative
NRE + Unit Cost
Cumulative
Volume K Units
ASIC .15
ASIC .25
FPGA .25 FPGA .15
ASIC costs start
higher, but slopeis atter
For each technology
advance, FPGAs becomemore cost eective
Eduardo Sanchez 50
-
8/8/2019 Reconfigurable
26/85
As performance requirements increase, the implementation ofcontrol elements in embedded applications is moving from 8-bits
to 32-bits
At the same time, the implementation vehicle of choice forembedded applications is moving from ASICs to FPGAs due to cost
and time-to-market pressures
Eduardo Sanchez 51
Eduardo Sanchez 52
-
8/8/2019 Reconfigurable
27/85
Synthesis methodology
conguration bit-string
schematic
graphic editor VHDL
placement
routing
partition
Eduardo Sanchez 53
Registertransfer level
RTL
Logic
Simulator
RTL functionalverification
LogicSynthesis
Gate-levelnetlist
Logic
Simulator
Place-and-Route
Gate-level functionalverification
Eduardo Sanchez 54
-
8/8/2019 Reconfigurable
28/85
Graphical S tate Diagram
Graphical Flowchart
When clock risesIf (s == 0)then y = (a & b) | c;else y = c & !(d ^ e);
Textual HDL
Top-level
block-level
schematic
Block-level schematic
Eduardo Sanchez 55
Eduardo Sanchez 56
-
8/8/2019 Reconfigurable
29/85
Eduardo Sanchez 57
Intellectual property (IP)
A semiconductor IP block is a predesigned function to beimplemented in a semiconductor device. In some cases, the
functions are parametrisable, allowing a degree of customization.These functions include physical library functions (analog ordigital), basic blocks (such as counters and muxes) and system-level macros (also known as cores or virtual components) -including memory blocks
Market: 1999: 442 millions dollars (semiconductors total : 196136 M$) 2000: 620 millions dollars (total semiconductors total : 231601 M$) 2004: 2940 millions dollars (semiconductors total : 339545 M$)
Eduardo Sanchez 58
-
8/8/2019 Reconfigurable
30/85
System-on-a-chip (SOC)
SOC
ASIC FPGA
expensive circuit
lower performance
higher consumption
lower development cost faster adaptation to change
Eduardo Sanchez 59
ASIC SOC
2000 2004 2011
Technology 0.18 0.09 0.05
Gates/cm2 5x106 30x106 200x106
MIPS/watt 1000 2200 4000
SRAM Mb/cm2 20 120 240
Eduardo Sanchez 60
-
8/8/2019 Reconfigurable
31/85
SoC today
Dual-
core
Power
Management/
FrequencyBoost
(Foxton
Technology)
-
8/8/2019 Reconfigurable
32/85
Complex
it(log)
1982 1992 2002 2012
Processors / DSPs
Batteries
GA
P
New architectures
! Microelectronics Systems TOMMOROW (2012) :-Technology :
-
8/8/2019 Reconfigurable
33/85
Arbitration
CPURAM
MPEG2
PCI
PCI-X
USB2.0controller
DSP
ROM FlashTEST (BIST)
TESTBIST :Built-in Self Test
REUSEIP core assembling
ReconfigurablecoresFPGA,
RECONFIGURABLEARCHITECTURES
Platform based design
C54x
ARM720 MHzA/D
20 MHzA/D8K
SARAM8KSARAM
4K SARAM
2KDARAM
2KDARAM
2KDARAM
Analog PLL/VCO
Digital PLL/VCO
Viterbi Decoder
State Metric RAMViterbi Decoder
Traceback RAM
Digital PLL/VCO
Digital PLL/VCO
B-CDMATM Modem Logic
B-CDMA(TM) SoC Layout
90 MIPS TI DSP
Sub-system
ARM7 RISC
processor
Platform based design :SoC example
-
8/8/2019 Reconfigurable
34/85
Accelerators for computing intensive applications Why can they be faster than processors ? What kind of reconfigurable architecture should be used ? Should they be used as stand-alone solutions ? Why arent they used today ?
Main motivations for the use of reconfigurable
Tools for system validation : Prototyping (Emulation) Why simulation is not adapted? Shoud prototyping replace simulation ? How much confidence should be putted in prototyping results
Context : Motivations
MicroBlaze soft processor
Thirty-two 32-bit general purpose registers 32-bit instruction word with three operands and two addressing
modes
Separate 32-bit instruction and data buses that conform to IBMs OPB(On-chip Peripheral Bus) specication
Separate 32-bit instruction and data buses with direct connection toon-chip block RAM through a LMB (Local Memory Bus)
32-bit address bus Single issue, 3-stage pipeline (instruction fetch, operand fetch,
execution)
Hardware multiplier
Big-endianEduardo Sanchez 68
-
8/8/2019 Reconfigurable
35/85
Eduardo Sanchez 69
Virtex-II Pro family
Virtex-II Pro FPGAs provide up to four embedded 32-bit IBM PowerPC 405 RISCprocessors, each delivering over 420 Dhrystone MIPS at 300 MHz
16KB data / 16KB instruction caches memory management unit variable page size (1KB-16MB) ve-stage datapath pipeline integer multiply/divide unit 32x32 bit general purpose registers dedicated on-chip memory interface it takes up as little as 2% of the total die area of XC2VP50 it does not have a hardware oating point unit
Up to twenty-four on-chip 3.125 Gbps Rocket I/O transceivers Based on a 0.13, 9-layer copper/low-K dielectric technology
Eduardo Sanchez 70
-
8/8/2019 Reconfigurable
36/85
Eduardo Sanchez 71
Eduardo Sanchez 72
-
8/8/2019 Reconfigurable
37/85
Virtex-4 family from Xilinx
Columnar architecture
Eduardo Sanchez 73
A Congurable Logic Block (CLB) contains 4 interconnected slices
Eduardo Sanchez 74
-
8/8/2019 Reconfigurable
38/85
A simplied view of the slice is:
Eduardo Sanchez 75
BRAM Multipliers (DSP) blocks
Eduardo Sanchez 76
-
8/8/2019 Reconfigurable
39/85
Integrated PowerPC 405 Fully integrated Ethernet
Media Access Controller(EMAC)
Bitstreams encrypted with256-bit AES algorithm
90-nm, 11-layer technology 500MHz for memory and
multipliers
Lowest powerEduardo Sanchez 77
Three platforms:
Logic
Memory
DCMs
DSP
Logic
Memory
DCMs
DSP
Logic
Memory
DCMs
DSP
RocketIO
PowerPC
SX PlatformOptimized forhigh-performancesignal processing
FX PlatformOptimized forembedded processing andhigh-speed serialconnectivity
LX PlatformOptimized forhigh-performance logic
Eduardo Sanchez 78
-
8/8/2019 Reconfigurable
40/85
2442192896209,936142,128XC4VFX14
0
2042160768126,76894,896XC4VFX10
0
1642128576124,17656,880XC4VFX60
12424844882,59241,904XC4VFX40
8213232041,22419,224XC4VFX20
-2132320464812,312XC4VFX12
---51264085,76055,296XC4VSX55
---19244883,45634,560XC4VSX35
---12832042,30423,040XC4VSX25
---96960126,048200,448XC4VLX200
---96960125,184152,064XC4VLX160
---96960124,320110,592XC4VLX100
---80768123,60080,640XC4VLX80
---6464082,88059,904XC4VLX60---6464081,72841,472XC4VLX40
---4844881,29624,192XC4VLX25
---32320486413,824XC4VLX15
RocketIO
transceiver
10/100/
1000 EMACPowerPC
XtremeDS
P Slice
SelectI
ODCM
Block
RAM [Kb]
Logic
CellsDevice
Eduardo Sanchez 79
Stratix II family from Altera
Eduardo Sanchez 80
-
8/8/2019 Reconfigurable
41/85
Eduardo Sanchez 81
Each Logic Array Block (LAB) contains 4 Adaptive Logic Modules (ALM)
Eduardo Sanchez 82
-
8/8/2019 Reconfigurable
42/85
Eduardo Sanchez 83
Eduardo Sanchez 84
-
8/8/2019 Reconfigurable
43/85
Eduardo Sanchez 85
Eduardo Sanchez 86
-
8/8/2019 Reconfigurable
44/85
Eduardo Sanchez 87
Nios soft processor
RISC-like processor Full 32-bit instruction set, data path and address space
32 general-purpose registers 32 external interrupt sources Single-instruction 32x32 multiply and divide producing a 32-bit
result
Single-instruction barrel shifter 6-level pipeline Branch prediction
Eduardo Sanchez 88
-
8/8/2019 Reconfigurable
45/85
Eduardo Sanchez 89
Eduardo Sanchez 90
-
8/8/2019 Reconfigurable
46/85
Eduardo Sanchez 91
Fusion family from Actel
This FPGA family integrates thestandard programmable logic with
congurable analog and Flashmemory
Congurable analog to digitalconverter (ADC), supporting
resolutions up to 12 bits, andsample rates up to 600 k samples
per second
A 32-bit ARM7 soft-core isavailable
Eduardo Sanchez 92
-
8/8/2019 Reconfigurable
47/85
Eduardo Sanchez 93
VersaTile congurations:
Eduardo Sanchez 94
-
8/8/2019 Reconfigurable
48/85
Eduardo Sanchez 95
Virtex-5 family from Xilinx
Eduardo Sanchez 96
-
8/8/2019 Reconfigurable
49/85
Eduardo Sanchez 97
Eduardo Sanchez 98
-
8/8/2019 Reconfigurable
50/85
Stratix III family from Altera
Eduardo Sanchez 99
Eduardo Sanchez 100
!"#$%&'&()*+%,-./0%1-'2,+%
-
8/8/2019 Reconfigurable
51/85
Eduardo Sanchez 101
Coarse grain RAs
Statement :
Digital signal processing is more and more arithmeticoriented, as to say that
operations are more at the word-level (+,-,*..) rather than bit-level
Example : Digital filtering : FIR
!K Multiplies"! K Sums"
-
8/8/2019 Reconfigurable
52/85
x%
x!%
S=Ax!+By+C%
Computation example :
#CYCLES: 1
Statement :
Digital signal processing is more and more arithmeticoriented, as to say that
operations are more at the word-level (+,-,*..) rather than bit-level
Coarse grain RAs
x%
x!%
B% y%
B.y%S=Ax!+By+C%
Computation example :
#CYCLES: 1
Statement :
Digital signal processing is more and more arithmeticoriented, as to say that
operations are more at the word-level (+,-,*..) rather than bit-level
Coarse grain RAs
-
8/8/2019 Reconfigurable
53/85
x% B% y% C%A%
A.x! B.y+C
S=Ax!+By+C%
Computation example :
#CYCLES: 1
Statement :
Digital signal processing is more and more arithmeticoriented, as to say that
operations are more at the word-level (+,-,*..) rather than bit-level
Coarse grain RAs
S =A.x!+B.y+C%
S=Ax!+By+C%
Computation example :
x% B% y% C%A%
#CYCLES: 1
Coarse grain RAs
Statement :
Digital signal processing is more and more arithmeticoriented, as to say that
operations are more at the word-level (+,-,*..) rather than bit-level
-
8/8/2019 Reconfigurable
54/85
S =A.x!+B.y+C%
S=Ax!+By+C%
Computation example :
x% B% y% C%A%
#CYCLES: 3
Coarse grain RAs
Statement :
Digital signal processing is more and more arithmeticoriented, as to say that
operations are more at the word-level (+,-,*..) rather than bit-level
Example targetting FPGA
S =A.x!+B.y+C%
x% B% y% C%A%
Highly combinationalHigh logic Complexity
Arithmetic operator
Coarse grain RAs
Statement :
Digital signal processing is more and more arithmeticoriented, as to say that
operations are more at the word-level (+,-,*..) rather than bit-level
Example targetting FPGA
-
8/8/2019 Reconfigurable
55/85
S =A.x!+B.y+C%
x% B% y% C%A%
LC!LC!
LC!LC!LC!
LC!LC!LC!
LC!LC!LC!
LC!LC! LC! LC! LC!
Coarse grain RAs
Highly combinationalHigh logic Complexity
Arithmetic operator
Statement :
Digital signal processing is more and more arithmeticoriented, as to say that
operations are more at the word-level (+,-,*..) rather than bit-level
Example targetting FPGA
S =A.x!+B.y+C%
x% B% y% C%A%
LC!LC!
LC!LC!LC!
LC!LC!LC!
LC!LC!LC!
LC!LC! LC! LC! LC!
Higher propagation delay
than dedicated logic
Coarse grain RAs
Statement :
Digital signal processing is more and more arithmeticoriented, as to say that
operations are more at the word-level (+,-,*..) rather than bit-level
Highly combinationalHigh logic Complexity
Arithmetic operator
Example targetting FPGA
-
8/8/2019 Reconfigurable
56/85
S =A.x!+B.y+C%
x% B% y% C%A%
LC!LC!
LC!LC!LC!
LC!LC!LC!
LC!LC!LC!
LC!LC! LC! LC! LC!
Coarse grain RAs
Higher propagation delay
than non-programmable
interconnect
Statement :
Digital signal processing is more and more arithmeticoriented, as to say that
operations are more at the word-level (+,-,*..) rather than bit-level
Highly combinationalHigh logic Complexity
Arithmetic operator
Example targetting FPGA
S =A.x!+B.y+C%
Computation time: N x Tcx% B% y% C%A%
N : number of cycles
Tp: propagation time (critical path)
Tp= TLUT + TROUTING
Higher than dedicated circuits
Spatial execution (parallel), but higher
cycle time
Coarse grain RAs
Statement :
Digital signal processing is more and more arithmeticoriented, as to say that
operations are more at the word-level (+,-,*..) rather than bit-level (or, and, )
Example targetting FPGA
-
8/8/2019 Reconfigurable
57/85
Coarse grain RAs
Motivations :
FPGAs suffer from : low functionnal frequencies (relatively low performances) high reconfiguration cost (50-100:1) Higher power consumption (compared to ASICs) Routing problems
WHY ?
Coarse grain RAs
Motivations :
FPGAs suffer from : low functionnal frequencies (relatively low performances) high reconfiguration cost (50-100:1) Higher power consumption (compared to ASICs) Routing problems
BECAUSE :
Mainly because of routing delays ~ 80 % of the propagation time
-
8/8/2019 Reconfigurable
58/85
Coarse grain RAs
Motivations :
FPGAs suffer from : low functionnal frequencies (relatively low performances) high reconfiguration cost (50-100:1) Higher power consumption (compared to ASICs) Routing problems
~ 90% of the area occupied for reconfigurable purposes (LUT+ routing)
BECAUSE :
1980 1990 2000 2010
10 000
100 000
1000 000
10 000 000
100 000 000
1000 000 000
10 000 000 000
100 000 000 000
1000 000 000 000
10 000 000 000 000
1000
Transistors
/
Chip
memory
FPGA physical
FPGA logical
Micropro
cessors
Coarse grain RAs
FPGAs : Reconfiguration cost
-
8/8/2019 Reconfigurable
59/85
Coarse grain RAs
Motivations :
FPGAs suffer from : low functionnal frequencies (relatively low performances) high reconfiguration cost (50-100:1) Higher power consumption (compared to ASICs) Routing problems
~ probably because of programmable interconnect, LUT memory
BECAUSE :
Coarse grain RAs
Motivations :
FPGAs suffer from : low functionnal frequencies (relatively low performances) high reconfiguration cost (50-100:1) Higher power consumption (compared to ASICs) Routing problems
~ Hard to wire every net in the design
Usually hard to go upper than 90% of the FPGA capacity
BECAUSE :
-
8/8/2019 Reconfigurable
60/85
1980 1990 2000 2010
10 000
100 000
1000 000
10 000 000
100 000 000
1000 000 000
10 000 000 000
100 000 000 000
1000 000 000 000
10 000 000 000 000
1000
Transistors
/
Chip
memory
FPGA physical
FPGA logical
FPGA routedMicr
oprocess
ors
Coarse grain RAs
FPGAs : Reconfiguration cost
Coarse grain RAs
Basic idea : Replace bit-level logic block by word-level logic block
CLB, LC RC, CFB
FINE GRAIN: CLB
granularity: BITadapted to: encryption, prototyping
COARSE GRAIN: RC
granularity: WORDadapted to: DSP, data oriented processing
ALU, MULT
Registers
MUXes
High reconfiguration cost overheadLow functionnal frequencies
Low reconfiguration cost overheadHigh level of performances
LUT
LUT
LUT
-
8/8/2019 Reconfigurable
61/85
CFB CFB CFB CFB
CFB CFB CFB CFB
CFB CFB CFB CFB
CFB CFB CFB CFB
ALU, MULT
Registers
MUXes
Coarse grain RAs
From CLBs to CFBs (Configurable Function Block)
A single CFB can handle alone an arithmetic operation (ex : multiplication)
Since multipliers, and adders are hardwired, they are
much more efficient (higher frequencies, lower area)
LETS REVIEW THE MICROPROCESSOR
ARCHITECTURES
Each CFB is like a small !P/ DSP
Coarse grain RAs
A simple coarse grain processing unit = small CPU controller
Constitution
Optimized Datapath (16 bits)Register File (4x16bits)
Hardwired ALU and multiplier
Features
Complex computations in local mode (FIR,IIR, WT)Low silicon area (0.07mm!, 0.18"m CMOS process)Single-cycle operations (ex:MAC+register load)
!inst.
-
8/8/2019 Reconfigurable
62/85
Statemachine% AL
U
t1
t2
t3
A
B
C
DATAPATHCONTROLLER
Coarse grain RAs
BUS
Arithmetic and Logic Unit (ALU)Register file
Tristate components (inputs/ outputs)
Microprocessor basics
t1
-
8/8/2019 Reconfigurable
63/85
t1
-
8/8/2019 Reconfigurable
64/85
t1
-
8/8/2019 Reconfigurable
65/85
t1
-
8/8/2019 Reconfigurable
66/85
t1
-
8/8/2019 Reconfigurable
67/85
OpcodeMAR
PC
Load path
Store path
n
Address
m
IR
FSM
incr
Instruction path
LD
Function
controls
Address operand
Branch
Data flow
Control signals
A Single accumulator machine
Coarse grain RAs
MAR : Memory Adress Register
IR : Instruction Register
PC : Program Counter register
Instruction:
Opcode:
00: Load
01: Store10: Add
11: Branch
Address
15 14 13 0
Single Address Instruction: one of the registers is fixed (= accumulator)-AC is an implicit operand
AC:= AC Memory(Address)
Coarse grain RAs
-
8/8/2019 Reconfigurable
68/85
1000110100110011
MAR
PC
Load path
Store path
16
Address
14
IR
FSM
incr
Instruction path
LD
Function
controls
Address operand
Branch
10110100110011
16 bits wide
16M words
2
14
16
Memory
1. Instruction fetch:
- PC is moved into MAR
10110100110011
1000110100110011
1000110100110011
- Read from memory- Load instruction into IR
2. Instruction decode:
- Op code bits to FSM(ADD)- rest of bits is operand addr.
Coarse grain RAsMAR : Memory Adress Register
IR : Instruction Register
PC : Program Counter register
AC
A B
ALU
S
MAR
PC
Load path
Store path
16
Address
14
IR
FSM
incr
Instruction path
LD
ADD
Address operand
Branch
16 bits wide
16M words2
14
16
Memory
3. Operand Fetch:
- IR -> MAR
00110100110011
1000110100110011
0101010101110001
10110100110011
01010101011100010011001101110110
1000100011100111
1000100011100111
- Read data from memory
4. Instr. Execute
- Memory to ALU B- AC to ALU
- ALU Add- S to AC
Coarse grain RAsMAR : Memory Adress Register
IR : Instruction Register
PC : Program Counter register
-
8/8/2019 Reconfigurable
69/85
1000110100110011
AC1000100011100111
A B
ALU
S
MAR
PC
Load path
Store path
16
Address
14
IR
FSM
incr
Instruction path
LD
ADD
Address operand
Branch
16 bits wide
16M words
2
14
16
Memory
5. Housekeeping:
- Increment PC
00110100110011
1000110100110011
0101010101110001
10110100110100
01010101011100010011001101110110
1000100011100111
Coarse grain RAsMAR : Memory Adress Register
IR : Instruction Register
PC : Program Counter register
Coarse grain RAs
A simple microprocessor : Architecture
16x16
registers
Adress to memorydata to/from memory
To controller
(FSM)
To controller
(FSM)
-
8/8/2019 Reconfigurable
70/85
Coarse grain RAs
A simple microprocessor : Instruction format
shift
or
or
or
Instruction formatInstruction
Coarse grain RAs
Action
A simple microprocessor : Instruction format
-
8/8/2019 Reconfigurable
71/85
Coarse grain RAs
A simple microprocessor : Instruction format
0000 7C0A ;
0001 8C00 ; LOAD RC, #A
0002 7B04 ; ...
0003 7A0A ; ...0004 9C7C ; ...
0005 611A ; ...
0006 614B ; ...
...
Coarse grain RAs
A simple microprocessor : test program
What will it do ?
-
8/8/2019 Reconfigurable
72/85
-
8/8/2019 Reconfigurable
73/85
CFB CFB CFB CFB
CFB CFB CFB CFB
CFB CFB CFB CFB
CFB CFB CFB CFB
ALU, MULT
Registers
MUXes
Coarse grain RAs
From CLBs to CFBs (Configurable Function Block)
A single CFB can handle alone an arithmetic operation (ex : multiplication)
Since multipliers, and adders are hardwired, they are
much more efficient (higher frequencies, lower area)
Setup an array of CFBs that have an architecture similar
to a !P datapath
We reviewed the !P basics, the idea is as follows:
In abstract :
Instructions configure both PE and interconnect every cycle
In reality :
Instruction Bandwidth / Memory too high, so
COMPROMISE
Coarse grain RAs
Coarse grain RA model
-
8/8/2019 Reconfigurable
74/85
Instructions
currently in hardware
Instructions paged out
Actual available
hardware
Coarse grain RAs
Reconfigurable computing dynamic reconfiguration
Relationship of communication among processors
Shared clock (Pipelined) Shared registers (VLIW) Shared memory (SMM) Shared network Shared bus Something not shared, and thats better
Communications
Coarse grain RAs
-
8/8/2019 Reconfigurable
75/85
instruction
Data in
Data out
Coarse grain RAs
SISD : Single Instruction Single Data
instruction
Data in
Data out
Coarse grain RAs
SIMD : Single Instruction Multiple Data
-
8/8/2019 Reconfigurable
76/85
instruction1
Data in
instruction2
instruction3
Data out
Coarse grain RAs
MISD : Multiple Instructions Single Data
instruction
Data in
Data out
instruction
Data in
Data out
Coarse grain RAs
MIMD : Multiple Instructions Multiple Data
-
8/8/2019 Reconfigurable
77/85
Vector Processing {SIMD}
+
r1 r2
r3
add r3, r1, r2
SCALAR
(1 operation)
v1 v2
v3
+
vector
length
add.vv v3, v1, v2
VECTOR
(N operations)
Vector processors have high-level operations that work on linear arrays of numbers:"vectors"
Coarse grain RAs
They Just Dont Know It Yet!
My Position[Jonathan Rose]
ASICs are dead
Existing reconfigurable systems
-
8/8/2019 Reconfigurable
78/85
SCHEMATICS
VERIFI
CATION
ARCHITECTURESPECIFICATIONS
FABRICATION
BOARD
MASKS
DRAWING
History Circuit design (full custom) < 1980
SCHEMATICS
VERIFI
CATION
ARCHITECTURESPECIFICATIONS
FABRICATION
ELECTRICAL
SIMULATION
MASKS
DRAWING
SPICE
DESIGN RULECHECK (DRC)
History Circuit design (full custom) 1980
-
8/8/2019 Reconfigurable
79/85
Behavioral
level
Logical
level
Physical
level
Electrical
level
?
Placement
routing
APPLICATION
SPECIFICATIONS
PROCESSEUR
MEM
ASIC
ASICMEM
ASIC
FABRICATION
CADFunctional
Simulation
Schematics
editor
Logical
simulation
post-layoutsimulation
Vrifications
Test
History ASIC design 1980-1990
Circuit level
Behavioral
RTL
level
SystemC
Virtual
prototypes
Logical
level
?
Co-design
Architectural
synthesis
Physical
synthesis
Test
Low power
APPLICATION
SPECIFICATIONS
PROCESSEUR
MEM
ASIC
ASICMEM
ASIC
FABRICATION
Logic
synthesisPhysical
level
ASIC design - today
-
8/8/2019 Reconfigurable
80/85
Digital Systems Design
Custom%
Hierarchical%
Standard Cells%
Memory%
PLA%
Gate matrix%
Macro Cells%
Cell-Based%
Gate Arrays%
Sea-of-gates%
Prediffused%
Antifuse based%
Memory based%
Prewired%
Array-Based%
Semicustom%
Main VLSI design styles
SoCs can integrate all of these
Digital Systems Design
Custom%
Hierarchical!
Standard Cells!
Memory%
PLA%
Gate matrix%
Macro Cells%
Cell-Based!
Gate Arrays%
Sea-of-gates%
Prediffused%
Antifuse based%
Memory based%
Prewired%
Array-Based%
Semicustom!
Main VLSI design styles
SoCs can integrate all of these
-
8/8/2019 Reconfigurable
81/85
So, why Hardware Design Languages?
Because we need a means for modeling, and therefore simulating complex digitalsystems before fabricating them
Logic Simulators are used for that Because drawing 1B transistors at hands takes time
Logic synthesizers do part of the job for us!
2 answers!
So, why Hardware Design Languages?
Domains and levels of modeling
-
8/8/2019 Reconfigurable
82/85
So, why Hardware Design Languages?
Domains and levels of modeling
So, why Hardware Design Languages?
Domains and levels of modeling
-
8/8/2019 Reconfigurable
83/85
So, why Hardware Design Languages?
Domains and levels of modeling
Netlist (schematics)
Std. Cells library
ASIC
ab
cd
e
s
Circuit.vhd
Behavioral description
2
3 4
1
Standard-cell ASIC design
Digital Systems Design
Placement & Routing
Technological mapping
Logic synthesis2
3
4
1 Design (VHDL, Verilog,)
Simulation at different levels
Logic Logic with
delays
Electrical
Logic synthesis
-
8/8/2019 Reconfigurable
84/85
Specifications
Behavioral HDL Simulation
Netlist Simulation
GDSII Simulation
SYNTHESIS
PLACE & ROUTE
DESIGN CAPTURE
ASIC design flow
So, why Hardware Design Languages?
Given that architecture In C we would write:
But hardware intrinsically operates in parallel
&%
C%
0D&EC%
'D&FC%
Void main() {
int a,b,c,d;
c=a+b;
d=a-b;
}
Sequential execution
concurrent execution
Need of a language which permits to express concurrency
-
8/8/2019 Reconfigurable
85/85
VHDL basics
VHDL : VHSIC Hardware Description Language
Very High Speed Integrated Circuits
Hardware description langage :
Allows description of concurrent tasksTwo main goals
Modeling (simulation)Description (synthesis)
Only a subset of VHDL is synthesizable