1
FPGA, a history of interconnect
Ivo Bolsens
CTO, Xilinx
2Xilinx Confidential
“The Architecture of the Month”
TC
IBMCLCALQL
ACT
CP
MPA
ERA
VirtexApex
XC4000
XC2000XC3000
AT
6200Am4000
ORCADLXC5200
FLEX
The Architectural Shakeout
Mass extinctions in the mid 1990sXilinx: 8100, 6200, 4700, Prizm, …Plessey, Toshiba, Motorola, IBM, ...
We were hit by fast-moving CMOS process technology, particularly multiple metal layers.
3Xilinx Confidential
TrendsC
ourte
sy a
nd C
opyr
ight
of U
MC
0100002000030000400005000060000700008000090000
100000
1985 1990 1995 2000 2005010002000300040005000600070008000900010000
LUTs
Wire
4Xilinx Confidential
Interconnect and ease of use
Longer wire reach gave dramatic improvement for ease of design
0
1000
2000
3000
4000
5000
6000
0 200 400 600 800 1000
LUTs within reach
Delay
(Tilo
+inte
rcon
nect
) ps
4036EX
VirtexVirtex II
4000XL
Virtex-II Pro
Lower is better
5Xilinx Confidential
State-of-the-Art 65nm FPGA
High-Performance 6-LUT Fabric
High-Performance 6-LUT Fabric
36Kbit Dual-Port
Block RAM / FIFO with ECC
36Kbit Dual-Port
Block RAM / FIFO with ECC
SelectIO with ChipSync
+ XCITE DCI
SelectIO with ChipSync
+ XCITE DCI
550 MHz ClockManagement Tile
DCM + PLL
550 MHz ClockManagement Tile
DCM + PLL
25x18 MultiplierDSP Slice with Integrated ALU
25x18 MultiplierDSP Slice with Integrated ALU
MoreConfiguration
Options
MoreConfiguration
Options
6Xilinx Confidential
Virtex-4 Interconnect PatternPredictable Interconnect
Fast Connect1 Hop2 Hops3 Hops
1 CLBSymmetric routing pattern reaches more CLBswith fewer hops
Dramatically increases design performanceDramatically increases design performance
Xilinx Upstairs 7 7Xilinx Confidential
Layers of Interconnect
1 mmBall Pitch
175 umBump Pitch
o o o o o o o o o o o o o o o o o
} Signals
Ground Plane
Ground Plane
12 layers on chip, 10 layers in package, 10 layers in PCB.
82% of noise is determined by the pin-out and found in package balls and PCB vias
82% of noise is determined by the pin-out and found in package balls and PCB vias
Xilinx Upstairs 8 8Xilinx Confidential
Sparse Chevron Pin-out Pattern
X X X X X X X X X X X X X X X X X X X X X X X X XX X X X X X X X X X X X X X X X X X X X X X X 9 X X X X X X X X X X X X X X X X X X X X X X X X 3 X X X X X X XX X X X X X X X X X X X X X X X X X 3 X X X X X X X X XX X X X X X X X X X X X X X X X X X X X X X X X 9 X X X XX X X X X 0 X X X X X X X X X X X X X X X X 9 X X X X X XX X X 0 X X X X X X X X X X X X X X X X 3 X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X 5 X X XX X X X X X 0 X X X X X X X X X X X X X X X X X X X X XX X X X 6 X X X X X X X X X X X X X X X X X X X X X X X XX X X X X X X X X X X X X X X X X X X X X X X 5 X XX X X X X X X X X X X X X 5 X X X X X X X X 6 X X X X X X X X X X X X XX X 6 X X X X X X X v X X X X X X X X 1 XX X X X X X X X X X X X X X 1 X X X XX X X X X 2 X X X X X X X X X X X X XX X X 2 X X X X X X X X X X X X X X X X X X X X X X X X X X X X 1 X X XX X X X X X 2 X X X X X X X X X X X XX X X X 4 X X X X X X X X X X X X X XX X X X X X X X X X X X X X X X X XX X X X X X X 4 X X v X X X X X X X X X X X X 4 X X X X X X X X X X X X X X X XX X 8 X X X X X X X X X X X X X X X X X X X XX X X X X X X X X X X X X X X X X X X X X X X X X X X XX X X X X 8 X X X X X X X X X X X X X X X X X X X X X XX X X 8 X X X X X X X X X X X X X X X X v X X X X X X X X X X X X X X X X 6 X X X X X X X X X X X X X X X X X X XX X X X X X 2 X X X X X X X X X X X X X X X X 1 X X X X XX X X X 2 X X X X X X X X X X X X X X X X X X X X X X X XX X X X X X X X X 6 X X X X X X X X X X X X X X X X 1 X XX X X X X X X 6 X X X X X X X X X X X X X X X X 1 X X X X X X X X 2 X X X X X X X X X X X X X X X X X X X X X X X
X X X X X X X X X X X X X X X X X X X X X X X X X
GNDCore VccI/O VccI/O SignalX
X X X X X X X X X X X XX X X X X X X X X X X X X X
X X X X X X X X X X X X XX X X X X X X X X X X X X XX X X X X X X X X X X X X XX X X X X 0 X X X X X X X XX X X 0 X X X X X X X X X X X
X X X X X X X X X X X X XX X X X X X 0 X X X X X X X XX X X X 6 X X X X X X X X X XX X X X X X X X X X X XX X X X X X X X X
X X X X 6 X X X XX X 6 X X X X X X XX X X X X X X X XX X X X X 2 X X X XX X X 2 X X X X X X
• Every SelectIO adjacent to return path
• Achieves near-ideal return current loop
IO pins in a regular array of return path pinsIO pins in a regular array of return path pins
9Xilinx Confidential
Off chip interconnect• Growing gap between number of logic
gates and I/O • Technology scaling favors logic density
15x drop inI/O-to-logic
ratioby 2020
Source: ITRS
10Xilinx Confidential
I/O to Logic RatioComparison of number of I/O per 1000 logic cell in the largest FPGA in each family
~60x decrease inI/O-logic
ratio
11Xilinx Confidential
Year
Microprocessor’s chip area is dominated by cache memory to overcome the
lack of I/O bandwidth
Logic to I/O Gap in Microprocessors
The Move to Serial Connectivity
12Xilinx Confidential
Source: Intel/Xilinx
Arrays of serial Gb transceivers
ProgrammableProgrammableTransceivers Transceivers 6 Gbps
1 Gbps Parallel Bus Limit1 Gbps Parallel Bus Limit
MANY STANDARDSMANY STANDARDSSerial
InterconnectSerialSerial
InterconnectInterconnect
PCI ExPCI ExFCFC--1/21/2
GbEGbE
10 GbE10 GbE
SATASATA
FCFC--44
OCOC--48483 Gbps
Sign
alin
g R
ate
Sign
alin
g R
ate
ISAISAISA
PCIPCIPCI
UP TO 66 MHzUP TO 66 MHz
VESAVESA
PCIPCI--XX
ParallelInterconnect
ParallelParallelInterconnectInterconnect
1980s 1990s 2000s
Arrays of serial Gb transceivers
13Xilinx Confidential
Die-Stacking Landscape (Connection Density, Number of Device Layers)
>106/cm2
3-4 device layers
Photo: Tezzaron
102-103/cm2
4+ device layers 104 -106/cm2
4+ device layers
Photo: Amkor
Monolithic 3-D integration
Chip-Stacking(Through Silicon Vias)
Chip-Stacking(wire-bonding)
14Xilinx Confidential
Imagine…
Specialized Layers ofDSP fabric,Memory fabric,FPGA fabric, …
Benefits :• Single package of heterogeneous die• Multiple configurations of standard products • Lower cost• Lower power• Ultimate customization• Ultimate flexibility
With 3D Interconnect
Optimized for Technology130nm, 90nm, 32nm, …
And at its heart…An FPGA SoC with: • Embedded Processing• Embedded DSP• High-speed Serial Connectivity• Reprogrammable FPGA Logic Fabric as the Base
15Xilinx Confidential
Wafer BondingDaughter Wafer
Mother Wafer
Daughter IC
Mother Wafer
Daughter ICPrep Wafers; Dice Daughter
1)
2)
Attach
3)
Fuse
4)
4 to 15 µm 7 to 50 µm
< 5µm
Wafers from Foundry
Source: Cubic Wafer
16Xilinx Confidential
Inter-Die Connection Performance
Die 2
Die 1
1 ns
• 100X less dynamic power than conventionalsingle-ended I/O
• Link performance ~ 1 Gigabit/sec
17Xilinx Confidential
2-D FPGA Fabric
Routing SwitchLook Up Table (LUT). A programmable logic block
18Xilinx Confidential
3-D FPGA Fabric
A. Rahman, et al., IEEE TVLSI, 11(1), 2003
• Shorter wire-length and delay• Higher logic density
19Xilinx Confidential
Performance Improvement• Integration of RAM with FPGA by high
bandwidth die-stacking• 2 Terabit/sec bandwidth between FPGA
and RAM
High Performance Applications
Projected Performance Improvement
Sparse Matrix/ Vector Multiply
4X-8X over 2.4GHz Pentium
170X over 2.2GHz Opteron
20X over 1.7GHz Pentium
Molecular Dynamics ~20X over 3 GHz Processor
Traffic Simulation
Radiative Heat Transfer
(Courtesy of Maya Gokhale, LANL)
20Xilinx Confidential
The FPGA System
MGTs I/OsMemory PowerPCLogicEmulation DSP
Com
mun
icatio
nPo
rt
CustomLogic
Inte
rnal
Mem
ory
Exte
rnal
Mem
ory P
ort
DSP
Acce
lerat
or
µP
21Xilinx Confidential
MicroBlaze’s Flexible Acceleration Interface
• FSL = Fast Simplex Link• Eliminates bus signaling overhead
– No address decode– No arbitration– No acknowledge cycles
• Simple instruction programming• Flexible number master and slave
FSL ports – Configurable depth FIFO in FSL– Input and output FSL port
width is configurable as 8,16, or 32 bits.
• Dedicated MicroBlaze instruction– Get fromInputFSL M, toReg N– Put toOutputFSL M, fromReg N– Blocking and non-blocking
support
22Xilinx Confidential
Application-SpecificHardware Acceleration
• When the processor core begins to reach software task capacity, then Fabric Acceleration to the rescue– Use Fast Simplex Link
(FSL) to interface to customer-defined accelerators
– Enables dramatic improvements in performance
FPGA/processor• CoreConnect Architecture
– Processor Local Bus (PLB)• Ideal for bust transfers
– Memory, High Speed Peripherals, Cache Interface• 32-bit address, 64-bit data• 2.1 GB/s Max BW @ 133 MHz
– On-Chip Peripheral Bus (OPB)• Low speed peripherals• 32-bit address, 32-bit data
– Device Control Register Bus (DCR)• For peripheral setup and control
• On Chip Memory Interface (OCM)– 4 Processor Cycle Latency– Lowest Latency and good data rate– Not a bus interface
23Xilinx Confidential
24Xilinx Confidential
Accelerate Performance Beyond the Core
SoftAuxiliary
Processor
SoftSoftAuxiliaryAuxiliary
ProcessorProcessorAPU
ControlAPUAPU
ControlControlPowerPCPowerPCPowerPC
FPGA FabricFPGA Fabric
Hard Processor BlockHard Processor Block
OCMOCM
PLBPLB
APU I/F
FPGAInterface
• Extends PPC 405 Instruction Set– Floating point support– User Defined Instructions
• Offloads CPU intensive operations
– Matrix calculations• Video processing
– Floating point mathematics• 3D data processing
• Direct interface to HW accelerators
– High Bandwidth – Low Latency
• Reduce number of bus cycles by factor of 10X
• Increase performance by over 20X
25Xilinx Confidential
Comparison with Traditional Bus-basedAPU PLB
ProcessorBlock Soft Aux.
ProcessorProcessor
Block Soft Aux. ProcessorAPU I/F APU I/F
Write Instruction
and operands
Read Result and Status
Write Operand1
Read Status
5 PLB cycles + 2 CPU cycles
1 APU cycle
ExecutionExecution
Write Operand2 and Instruction
Read Result
ExecutionExecution 5 PLB cycles + 2 CPU cyclesNEX APU cycle
NEX PLB cycle1 APU cycle +1 CPU cycle
6 PLB cycles + 3 CPU cycles
6 PLB cycles + 3 CPU cycles
26Xilinx Confidential
Processor Performance and Fabric Acceleration
1,0001,000
20022002 20042004 20062006
D-MIPSD-MIPS
• PowerPC 405 at 450 MHz• APU
20082008
2,0002,000
3,0003,000
20102010
Next generationPowerPC
VirtexVirtex--II ProII Pro
VirtexVirtex--44
Fabric AccelerationFabric AccelerationPowerPC PowerPC –– APUAPU
MicroBlaze MicroBlaze -- FSLsFSLs
““TraditionalTraditional””Frequency ScalingFrequency Scaling
VirtexVirtex--55• PowerPC • APU
Virtex-II Pro Fabric
PowerPC 405
ISO
CM
ISO
CM
DSO
CM
DSO
CM
GPIOGPIO UARTUART10/100Ethernet10/100
Ethernet
User PeripheralUser Peripheral
OPB IPIFOPB IPIFUser LogicUser Logic
BRAMBRAM
PCI 64/66PCI 64/66GE MACGE MAC MemoryControllerMemory
Controller
DDR SDRAMDDR SDRAMSDRAMSDRAM
MemoryMemory
ZBT SSRAMZBT SSRAM
PowerPC ArchitectureData Control Register Bus - DCR
Instruction DataOPB
ArbiterProcessor Local Bus - PLB On-Chip Peripheral Bus - OPBPLB Arbiter
PLB/OPBBridge
JTAG BlockJTAG Block
System Reset
System Reset
Full System Customization & High Performance
27Xilinx Confidential
28Xilinx Confidential
Scalability
2005 2007 2009
Scalable in performance, and re-usable solution FPGAProcessing
rate Flexible systemon chip used fortailored systemarchitectures
Fixed processor
Processor / memory bottlenecksworsen
Next fixed architecture Next fixed architectureNo re-use, architecture dependent
29Xilinx Confidential
Configurable InterconnectCell Processor
85GB External Bandwidth205GB internal bus400GB register file bandwidth400GB internal memory(cache)60-130 Watts
FPGA80GB External Bandwidth2TB internal memory(BRAM)2.5TB internal embedded mem>10TB ALU<->register<->ALU10-20 watts
You must customize your program to make use of the busses of the Cell processor
The FPGA connectivity can be customized to fit your program
Flexibility
0
50
100
150
200
Computation (GOPS) Memory Bandwidth(GB/sec)
CPUV2ProV5
Woodcrest = 80 WattVirtex-5 = 10 Watt
Woodcrest : 24 GFlops @3GhzVirtex-5 : 60 GFlops (70 GFlops-SP) @ 350Mhz
30Xilinx Confidential
31Xilinx Confidential
Internal Memory Bandwidth
0100200300400500600700800900
1000
0 10 20 30 40 50 60
Bandwidth (Tbps)
Mem
ory
(KB
)
4VLX200 @250Mhz2V60003.73GHz iPxe965
BRAM
LUT-RAM
32Xilinx Confidential
Conclusions
• It is all about connectivity• Important aspects
– Off-chip interconnect goes serial• Latency, power
– On-chip interconnect has to be• Scalable• Hierarchical• Flexible• Easy to use
– System interconnect heterogeneous and tailored to application