18-545: ADVANCED DIGITAL DESIGN PROJECT
FALL 2016
BRANDON LUCIA
L2: FPGA HARDWARE
18-545: FALL 2016
Admin stuff
Project Proposals happen on Monday
Be prepared to give an in-class presentation
Lab 1 is due Wednesday, Sept. 14th
Reading Assignment #1 due today
Submit a PDF/text file, don't fill in the web form
Team assignments are done
2
18-545: FALL 2016
Admin Stuff
Status reports due today
No word docs, please!
Be specific about what happened/is going to happen
Talk about what YOU did/will do, not just what your group did
Grades on the way, as general feedback
3
18-545: FALL 2016
Game Plan
Overview
Why use FPGAs?
FPGA Internals
7
Caveat: I will use Xilinx specific terminology since that’s the FPGA company
you will be using. Beware that other companies use different terms
FPGA Overview
Field Programmable Gate Array
Array of generic logic gates
Gates where logic function can
be programmed
Programmable interconnection
between gates
Fielded systems can be
programmed
i.e. post-fabrication
18-545: FALL 2016
Xilinx Virtex-5 FPGA
9
18-545: FALL 2016
Design Platform
Virtex-5 Development System
Xilinx XC5VLX110T FPGA
17280 slices of CLB goodness
256MB DDR2 (SODIMM)
DVI Video port
VGA port is for input
10/100/1000 Ethernet port
Audio Codec (AC97)
USB2 port
16x2 LCD, RS-232
Compact Flash card slot
Expansion connectors
10
18-545: FALL 2016
Game Plan
Overview
Why use FPGAs?
FPGA Internals
11
Why use FPGAs?
System designers have a
Goldilocks problem
Off-the-shelf parts are not
efficient enough
Custom ASICs cost too much
Need a “just right” solution
ASIC Design
Difficult to design
Large and complex
Issues in advanced processes
Interconnect delay
Device leakage
Power density constraints
Expensive to design / fabricate
Mask set costs
Non-recurring engineering costs
Need a high-volume, high-profit market to justify costs!
Efficiency ViewAn efficiency gap exists between ASICs and CPUs
N. Zhang, et. al, “The Cost of Flexibility in Systems on a Chip Design for Signal Processing Applications”
0.01
0.1
1
10
100
1000
10000
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Energy Efficiency (MOPS/mW) Area Efficiency (MOPS/mm2)
Microprocessors
ASICsDSPs
Economic ViewFPGAs: High package costs ($300+), low NRE costs
ASICs: Low package costs (pennies), high NRE costs ($600K+)
Develo
pm
en
t C
ost
+ D
evic
e C
ost
•Increasing NRE charge
•58% are late to market --
impacts total volumes shipped
•ASIC cycle longer than some
market windows
•Over 50% need to be respun
Total Units
Additional ASIC costs:
Decreasing FPGA unit
cost pushing crossover
point to the right
ASIC
Trend
FPGA
Trend
(Courtesy Xilinx, Inc.)
FPGA solution has
a lower total cost
ASIC solution has
a lower total cost
18-545: FALL 2016
FPGA Advantages
Higher performance than CPU solution
Lower power than CPU solution (usually)
Low NRE costs
Off-the-shelf part designed by FPGA vendor
You are sharing NRE costs with all other customers
Fast design time
Low time-to-market
Fast re-design / re-fabrication time
Easy to correct an error, to add functionality, in response to spec change
Can even change product after deployment
16
18-545: FALL 2016
High per-part costs
Good for low to middle volume applications
High volume applications should consider ASICs
Perhaps use FPGA for prototyping
Lower performance than ASIC
Higher power than ASIC
More specialized design skills than programming a CPU
17
FPGA Disadvantages
Example uses of FPGAs
Rapid Prototyping
Emulation of ASIC design
Design exploration
Shipping product
Networking
Military
Microsoft Bing Datacenters
Reconfigurable Computing
Research!
(http://parallel.princeton.edu/openpiton/)
18-545: FALL 2016
Game Plan
Overview
Why use FPGAs?
FPGA Internals
19
FPGA Breakdown
3 Basic components
Configurable Logic Blocks
General purpose interconnect
I/O Blocks
Advanced components
Hard macros
CPUs
Block RAM
Multipliers
Specialized components
DSP blocks
VIRTEX-II PRO
CLB
(64 TOTAL)
I/O BLOCK
(64 TOTAL)
GENERAL
PURPOSE
INTERCONNECT
IOBS HAVE DIRECT
ACCESS TO
ADJACENT CLBS
SWITCH
MATRIX
(COURTESY XILINX, INC.)
XILINX XC3020
ZOOMED IN VIEW OF
THE CLB MATRIX OF
THE FPGA
SPECIFIC INGRESS
AND EGRESS
CONNECTION
OPTIONS (BLACK
DOTS) ARE
AVAILABLE
EVEN MORE ZOOMED IN VIEW
(COURTESY XILINX, INC.)
ROUTING
EACH MATRIX
HAS 5
CONNECTIONS
PER SIDE
(COURTESY XILINX, INC.)
ROUTING: THE SWITCH MATRIX
ONLY CERTAIN
CONNECTION
PATTERNS ARE
POSSIBLE
(COURTESY XILINX, INC.)
ROUTING: THE SWITCH MATRIXEACH MATRIX
HAS 5
CONNECTIONS
PER SIDE
18-545: FALL 2016
Hierarchical Routing
25
Spartan-2 and more recent have different length connections
between switch matrices
Local roads,
limited access
roads, interstate
highways
Routes across
entire chip don’t
burn lots of short
connections
Detailed Routing (Spartan 2)
Configurable Logic Blocks
CLBs get more and more stuff
crammed in them over time
XC3K family had LUT (5
variable input, 2 FF values,
2 outputs), 2 FFs, clock
enable, FF reset (direct /
global) and 9 muxes
~51 bits of configuration
SRAM per CLB
(COURTESY XILINX, INC.)
18-545: FALL 2016
What’s a Look-up-table (LUT)?
A direct implementation of a truth table, using memory
LUT inputs are memory address values
LUT outputs are the memory data value
28
A B C D F
0 0 0 0 1
0 0 0 1 1
0 0 1 0 1
0 0 1 1 1
0 1 0 0 1
0 1 0 1 1
0 1 1 0 1
0 1 1 1 1
1 0 0 0 1
1 0 0 1 1
1 0 1 0 1
1 0 1 1 1
1 1 0 0 0
1 1 0 1 0
1 1 1 0 0
1 1 1 1 0
A B C D F
0 0 0 0 0
0 0 0 1 1
0 0 1 0 0
0 0 1 1 0
0 1 0 0 0
0 1 0 1 1
0 1 1 0 0
0 1 1 1 1
1 0 0 0 0
1 0 0 1 1
1 0 1 0 0
1 0 1 1 0
1 1 0 0 1
1 1 0 1 1
1 1 1 0 0
1 1 1 1 0
18-545: FALL 2016
Another View of LUTs
29
Can view LUT as 16:1 mux
Inputs are mux select
Config sets mux data inputs
Logically same as 16x1 memory
Can compact logic if you can route
inputs to mux data inputs
Look Up Table Additional Functionality
Can be configured as:
Shift register (16 regs)
Small memory (16 bits)
“Distributed RAM”
Some other FPGAs use
muxes instead of memories to
implement the core
combinational logic
18-545: FALL 2016
Spartan-2 CLBSpartan-2 has 2 LUTs (4 input each) feeding a 3rd LUT, 2 FFs
(with Preset/Reset, Enable, posedge or negedge clocks) and 16
muxes
12 inputs (plus clock), 4 outputs
(COURTESY XILINX, INC.)34
Spartan-3
CLBs are composed of 4 slices
Organized as 2 pairs, one of
which is optimized for
memory access
Each slice has 2 FFs and 2
LUTs
(COURTESY XILINX, INC.)
FPGA Families extend Architecture
❏Devices are built, with more capability, but around
the same basic architecture
❏Some additional capabilities
◆Low voltage versions
◆Faster clock rates
◆Different packaging options
(Courtesy Xilinx, Inc.)
FIFO
memory
chips
The need for more stuff
❏CompEs cannot design on logic, routing, I/O alone
❏Extreme case from early 90s
◆16 port ATM switch, designed on a single board
◆Design is limited by I/O to memory chips--bring them on-chip
FPGAs
(XC3Ks)
37
Other “Stuff”
❏Clock managers
◆Global clock buffering, distribution
◆Digital Clock Manager (DCM): eliminate skew, phase shifts,
multiply or divide clock
❏Memory
◆Block RAM
◆Distributed RAM (repurposed LUTs)
❏Shift Registers
❏Dedicated Multiplexers
❏Carry Look-Ahead Generators
❏I/O Blocks
◆SelectIO supports 18 standards (single, differential, various
voltage levels, ....)
❏Embedded Multipliers 38
Hard Macros
Hard macros
Block RAMs
Multipliers
CPUs
DSPs
Soft macros
HDL IP Blocks
Block RAMs
Distributed RAM
Use LUTs as memories
Low density
Poor performance
Block RAM
Large-ish dedicated memory blocks
Xilinx BRAMs = 18Kb
Some configurability
Dual-port
Data width / depth
FIFO, CAM, etc.
Multipliers
18x18 signed 2’s-complement multiplier
Two 18b inputs
One 36b output
18b enough for many DSP applications
Can gang multiple units together for wider data
Faster and lower power than multiplier from CLBs
CPUs – PowerPC 405
XC2VP30 has 2 Embedded PowerPC 405 cores
Embedded L1 I and D caches
No FPU
CPU Connectivity: PLB and OPB
IBM Core Connect
Processor Local Bus (PLB) - fast on-chip communication
On-Chip Peripheral Bus (OPB) - optimized for periphs. (UART, etc)
Device Control Register bus (DCR) - used to send and set config.
CPU Connectivity: PLB and OPB (cont.)
CPU Connectivity: OCM
On-Chip Memory controller
CPU block RAM
2 OCMs – I and D
Direct, fast interface
Can use dual-port BRAMs for
producer-consumer link to
FPGA fabric
18-545: FALL 2016
CPU Links
A lot more details on the embedded CPU
http://www.xilinx.com/bvdocs/userguides/ppc_ref_guide.pdf
http://direct.xilinx.com/bvdocs/userguides/ug018.pdf
http://www-
3.ibm.com/chips/techlib/techlib.nsf/productfamilies/CoreConnect_
Bus_Architecture
46
Zynq 7000
Advanced Microcontroller
Bus Interface + Advanced
eXtensible Interconnect
To memory, FPGA fabric,
I/O & Peripherals
AMBA = ARM’s attempt
at The One True Interface
Configuration Storage
Lots of configuration bits
LUTs, routing, I/O configuration
Xilinx XC2VP30 has >11Mb
Configuration storage
technologies
Volatile
SRAM cells
Non-volatile
FLASH, EEPROM
Anti-fuse
Actel anti-fuse
WL
bit bit_b
6T SRAM cell
18-545: FALL 2016
Configuration
How to load (scan) configuration bits (bitstream)
Connect all configuration registers into single long shift register
Serially clock in configuration bits
Most designs use standard scan interface (JTAG) developed for test
Bitstream source
Non-volatile memory
On-board FLASH, EEPROM, serial memory
External media (CF card)
Attached workstation
Can encrypt bitstream to conceal configuration
49
18-545: FALL 2016
Major FPGA Vendors
SRAM-based FPGAs
Xilinx
Altera
Atmel
Lattice Semiconductor
Flash & antifuse FPGAs
Actel Corp.
Quick Logic Corp.
Lattice Semiconductor
Xilinx (system-in-a-package solution)
Share over 60% of the market
50