the coram fpga architecture for reconfigurable computing · the coram fpga architecture for...
TRANSCRIPT
The CoRAM FPGA Architecturefor Reconfigurable Computing
Eric S. Chung, Weinan Ma, Michael Papamichael
G b i l W i J C H K M iGabriel Weisz, James C. Hoe, Ken Mai
Computer Architecture Lab at
CMU/ECE/CALCM/Hoe IWLS, June 2011, slide‐1
What we “know” about the future
100
ITRS road map 2009
What we know about the future2009 Intl. Technology Roadmap for Semiconductors
100
m (
log
) Area densitySupply voltageDevice power reduction
10
d t
o 4
0n
m
1
orm
alize
d
Only 4X Lower Power16X Area Density
0
40 36 32 28 25 22 20 18 16 14 13 11
No
CMU/ECE/CALCM/Hoe IWLS, June 2011, slide‐2
Technology Node (nm)
Must do more Ops/second for less Joules/second
In‐Core Performance and EnergyIn Core Performance and Energy
DeviceGFLOP/sactual
(GFLOP/s)/mm2
normalized to GFLOP/J
normalized to actual
40nm 40nm
Intel Core i7 (45nm) 96 0.50 1.14
Nvidia GTX285 (55nm) 425 2.40 6.78
MMM Nvidia GTX480 (40nm) 541 1.28 3.52
ATI R5870 (40nm) 1491 5.95 9.87
Xilinx V6‐LX760 (40nm) 204 0 53 3 62
• CPU and GPU benchmarking was compute‐bound; FPGA and
Xilinx V6 LX760 (40nm) 204 0.53 3.62
Std Cell effectively compute‐bound (no off‐chip I/O)• Power (switching+leakage) measurements isolated the core from the system
CMU/ECE/CALCM/Hoe IWLS, June 2011, slide‐3
y• For details see [Chung, et al. MICRO 2010]
More of the SameGFLOP/s (GFLOP/s)/mm2 GFLOP/J
Intel Core i7 (45nm) 67 0.35 0.71
FFT‐21
0
( ) 6 0 35 0
Nvidia GTX285 (55nm) 250 1.41 4.2
Nvidia GTX480 (40nm) 453 1.08 4.3
ATI R5870 (40 )
Mopt/s (Mopts/s)/mm2 Mopts/J
ATI R5870 (40nm) ‐ ‐ ‐
Xilinx V6‐LX760 (40nm) 380 0.99 6.5
Mopt/s (Mopts/s)/mm Mopts/J
holes Intel Core i7 (45nm) 487 2.52 4.88
Nvidia GTX285 (55nm) 10756 60.72 189
Black‐Sc Nvidia GTX480 (40nm) ‐ ‐ ‐
ATI R5870 (40nm) ‐ ‐ ‐
Xilinx V6‐LX760 (40nm) 7800 20.26 138
CMU/ECE/CALCM/Hoe IWLS, June 2011, slide‐4
( )
For details see [Chung, et al. MICRO 2010]
Why doesn’t everybody want to compute with FPGAs?
“Traditionally, FPGAs have been the bastardstep‐brother of ASICs…”step brother of ASICs…
Proceedings of ISFPGA 2004
• FPGAs today NOT designed for computing tools and languages difficult to use
users exposed to device and platform details
CMU/ECE/CALCM/Hoe IWLS, June 2011, slide‐5
no application portability
Review of FPGA AnatomyReview of FPGA Anatomy Block RAM
4KBBlockRAM
Read DataLocal
Address
Block RAM
Programmable lookup tables (LUT)
RAMWrite Data
I
g p ( )and flip‐flops (FF)
aka “soft logic” or “fabric”
I/O i terc
onne
ct LUT FF
CMU/ECE/CALCM/Hoe IWLS, June 2011, slide‐6
I/O pins In
FPGA “Computers” TodayFPGA Computers TodayI/O Pads
User responsible for:
li ti d d t
Memory Controller Memory Controller
FPGA
application and data
platform I/O and
memory interfacesApplication
M
Control
FPGA memory interfaces
data distribution
l l i
SRAM
To/FromNetwork
Logic
memory control logicNetwork
CMU/ECE/CALCM/Hoe IWLS, June 2011, slide‐7
I/O PadsMemory Controller Memory Controller
The CoRAM FPGAThe CoRAM FPGAMemory Interfaces (Global Address Space)Memory Interfaces (Global Address Space)
Hard logic for data
distribution (NoC)General‐purpose Network‐on‐Chip
M
Application
distribution (NoC)
“Software”‐managed
memory hierarchy
SRAM memory hierarchy
Simple, portable
b i f
Controlthreads
abstraction for userGeneral‐purpose Network‐on‐Chip
CMU/ECE/CALCM/Hoe IWLS, June 2011, slide‐8
Memory Interfaces (Global Address Space)
Simple ExampleSimple Examplectrlthread() {
cohandle c0 = get_sram(“c0”);
module verilog top(…);
1 c_coram_write(c0,sram_address,global_address,size);
}
g_ p( );coram c0(/*ports*/);cofifo f0 (/*ports*);…
endmodule
size);
2 c_fifo_write(1);
C0
SRAM
WrData
Address
Application
Control Thread
GlobalD
DC0RdData 1
GlobalMemory
DataData
2
CMU/ECE/CALCM/Hoe IWLS, June 2011, slide‐9
2
Channel FIFO
Design Case Study: MMMultDesign Case Study: MMMultMMM in hardware Single Compute EngineMMM in hardware N compute engines
(1 per row of matrix ) Compute Engine
Compute Engine
k
OP=writebackOP=load
R
g p g
PacketProcess
custom data networkCompute Engine
Compute Engine
Network R
x +
WenRenPktIn
PktOut
Control
x +
Control
Compute Engine
Compute Engine
C=ABA
B SR
AM R
A/B SRAMsC SRAM
toNext
A/B SRAMsC SRAM
B C DRAM
Packet Generator Address Generator
Packet Format
C SRAMA SRAM
CMU/ECE/CALCM/Hoe IWLS, June 2011, slide‐10
OP PE_ID DATA
MMMult with CoRAM
C t l Th d P
MMMult with CoRAMChannel FIFO
PEs ready
void ctrl_thread() {for (j = 0; j < N; j += NB)
Control Thread ProgramStart compute
PEs ready
for (j 0; j < N; j + NB)for (i = 0; i < N; i += NB)for (k = 0; k < N; k += NB) { c_fifo_read(…); for (m = 0; m < NB; m++) {
ll ti it (
x +
Control
c_collective_write(ramsA, m*NB, A + i*N+k + m*N, NB*dsz);
A/B SRAMsC SRAM
…}c_fifo_write(…);
}}
CMU/ECE/CALCM/Hoe IWLS, June 2011, slide‐11
}}
}
What we are doing nowWhat we are doing now
• Eric Chungg
architecture and API
microarchitecture and RTL design
• Weinan Ma (advised by Prof. Mai)
circuit‐level design
• Michael Papamichael
on‐chip data network
• Gabe Weisz• Gabe Weisz
high‐level programming
environment
CMU/ECE/CALCM/Hoe IWLS, June 2011, slide‐12
environment
Beneath the AbstractionBeneath the Abstraction
Control thread
programsSRAM
Application
Architecture
Microarchitectureric
Unit
ric
Unit
Control Unit Control Unit
Fab
Cluster
Fab
Cluster
CoRAMFPGA
CMU/ECE/CALCM/Hoe IWLS, June 2011, slide‐13
Memory Interfaces (DRAM or Cache)
Bulk Data Distribution (Network‐on‐Chip)
Control Threads: Soft or Hard?Control Threads: Soft or Hard?
Soft methods
Option 1:Synthesis to Fabric
Soft methods only rely on RL fabric
must share logic with app
efficiency/perf depends on quality of C‐based synthesis and compilation
Control Thread
Option 2:Compile to soft cores
pThread Programs
Hard method
Option 3:
dedicated silicon for cores
high perf/efficiency
h th
CMU/ECE/CALCM/Hoe IWLS, June 2011, slide‐14
pHard micro‐controllers consumes area whether
used or not
RTL P t t i ( FPGA)Mi bl
RTL Prototyping (on FPGA)
100MHz
AXI Memory Bus
Microblaze(100MHz)
I/O DevicesI/O Subsystem
and Drivers
RouterCacheBank
128b
100MHz100MHz200MHz
Router
MMMultiply Compute Elements
RouterCacheBank
MemoryController(Xilinx
128bRouter
Arbite
r
tform Adapter
512b
x+
Control Unit
x+
Control Unit
x+
Control UnitControl
CoRAMCluster
MIG) 128b
A
RouterRouter
RouterCacheBank
128bCacheBank
Xilinx Plat
x+
Control Unit
x+
Control Unit
x+
Control UnitControl
CoRAMCluster
CoRAMCluster
Router
CMU/ECE/CALCM/Hoe IWLS, June 2011, slide‐15
Bank Cluster
NoC and ClustersDRAM Interfaces DRAM Caches
Is CoRAM A Good Idea?Is CoRAM A Good Idea?
What is the impact on programmability and portability?
l f application performance?
area and power?
Some important questionsshould CoRAM support be hard soft or hybrid? should CoRAM support be hard, soft, or hybrid?
how to incorporate I/O devices, multi‐FPGA?
what applications can CoRAM support?
CMU/ECE/CALCM/Hoe IWLS, June 2011, slide‐16
what applications can CoRAM support?
Is CoRAM A Good Idea?Is CoRAM A Good Idea?
Y b th j d ! You can be the judge!
Tool release planned f lfor late 2011
Visit us at:/www.ece.cmu.edu/~coram
Relevant papers:E i S Ch J C H d K M i C RAM A I F b i M• Eric S. Chung, James C. Hoe, and Ken Mai. CoRAM: An In‐Fabric Memory Architecture for FPGA‐Based Computing, In Proceedings of FPGA‐19, 2011.
• Eric S. Chung, Peter Milder, James C. Hoe, and Ken Mai. Single‐chip Heterogeneous Computing: Does the Future Include Custom Logic, GPGPUs,
CMU/ECE/CALCM/Hoe IWLS, June 2011, slide‐17
Heterogeneous Computing: Does the Future Include Custom Logic, GPGPUs, and ASICs? In Proceedings of MICRO‐43, 2010.
Computer Architecture Lab (CALCM)
Carnegie Mellon University
h // d /
CMU/ECE/CALCM/Hoe IWLS, June 2011, slide‐18
http://www.ece.cmu.edu/CALCM