the coram fpga architecture for reconfigurable computing · the coram fpga architecture for...

The CoRAM FPGA Architecturefor Reconfigurable Computing

Eric S. Chung, Weinan Ma, Michael Papamichael

G b i l W i J C H K M iGabriel Weisz, James C. Hoe, Ken Mai

Computer Architecture Lab at

CMU/ECE/CALCM/Hoe IWLS, June 2011, slide‐1

What we “know” about the future

100

ITRS road map 2009

What we know about the future2009 Intl. Technology Roadmap for Semiconductors

100

m (

log

) Area densitySupply voltageDevice power reduction

10

d t

o 4

0n

m

1

orm

alize

d

Only 4X Lower Power16X Area Density

0

40 36 32 28 25 22 20 18 16 14 13 11

No


Technology Node (nm)

Must do more Ops/second for less Joules/second

In‐Core Performance and EnergyIn Core Performance and Energy

DeviceGFLOP/sactual

(GFLOP/s)/mm2

normalized to GFLOP/J

normalized to actual

40nm 40nm

Intel Core i7 (45nm) 96 0.50 1.14

Nvidia GTX285 (55nm) 425 2.40 6.78

MMM Nvidia GTX480 (40nm) 541 1.28 3.52

ATI R5870 (40nm) 1491 5.95 9.87

Xilinx V6‐LX760 (40nm) 204 0 53 3 62

• CPU and GPU benchmarking was compute‐bound; FPGA and

Xilinx V6 LX760 (40nm) 204 0.53 3.62

Std Cell effectively compute‐bound (no off‐chip I/O)• Power (switching+leakage) measurements isolated the core from the system


y• For details see [Chung, et al. MICRO 2010]

More of the SameGFLOP/s (GFLOP/s)/mm2 GFLOP/J

Intel Core i7 (45nm) 67 0.35 0.71

FFT‐21

0

( ) 6 0 35 0

Nvidia GTX285 (55nm) 250 1.41 4.2

Nvidia GTX480 (40nm) 453 1.08 4.3

ATI R5870 (40 )

Mopt/s (Mopts/s)/mm2 Mopts/J

ATI R5870 (40nm) ‐ ‐ ‐

Xilinx V6‐LX760 (40nm) 380 0.99 6.5

Mopt/s (Mopts/s)/mm Mopts/J

holes Intel Core i7 (45nm) 487 2.52 4.88

Nvidia GTX285 (55nm) 10756 60.72 189

Black‐Sc Nvidia GTX480 (40nm) ‐ ‐ ‐

ATI R5870 (40nm) ‐ ‐ ‐

Xilinx V6‐LX760 (40nm) 7800 20.26 138


( )

For details see [Chung, et al. MICRO 2010]

Why doesn’t everybody want to compute with FPGAs?

“Traditionally, FPGAs have been the bastardstep‐brother of ASICs…”step brother of ASICs…

Proceedings of ISFPGA 2004

• FPGAs today NOT designed for computing tools and languages difficult to use

users exposed to device and platform details


no application portability

Review of FPGA AnatomyReview of FPGA Anatomy Block RAM

4KBBlockRAM

Read DataLocal

Address

Block RAM

Programmable lookup tables (LUT)

RAMWrite Data

I

g p ( )and flip‐flops (FF)

aka “soft logic” or “fabric”

I/O i terc

onne

ct LUT FF


I/O pins In

FPGA “Computers” TodayFPGA Computers TodayI/O Pads

User responsible for:

li ti d d t

Memory Controller Memory Controller

FPGA

application and data

platform I/O and

memory interfacesApplication

M

Control

FPGA memory interfaces

data distribution

l l i

SRAM

To/FromNetwork

Logic

memory control logicNetwork


I/O PadsMemory Controller Memory Controller

The CoRAM FPGAThe CoRAM FPGAMemory Interfaces (Global Address Space)Memory Interfaces (Global Address Space)

Hard logic for data

distribution (NoC)General‐purpose Network‐on‐Chip

M

Application

distribution (NoC)

“Software”‐managed

memory hierarchy

SRAM memory hierarchy

Simple, portable

b i f

Controlthreads

abstraction for userGeneral‐purpose Network‐on‐Chip


Memory Interfaces (Global Address Space)

Simple ExampleSimple Examplectrlthread() {

cohandle c0 = get_sram(“c0”);

module verilog top(…);

1 c_coram_write(c0,sram_address,global_address,size);

}

g_ p( );coram c0(/*ports*/);cofifo f0 (/*ports*);…

endmodule

size);

2 c_fifo_write(1);

C0

SRAM

WrData

Address

Application

Control Thread

GlobalD

DC0RdData 1

GlobalMemory

DataData

2


2

Channel FIFO

Design Case Study: MMMultDesign Case Study: MMMultMMM in hardware Single Compute EngineMMM in hardware N compute engines

(1 per row of matrix ) Compute Engine

Compute Engine

k

OP=writebackOP=load

R

g p g

PacketProcess

custom data networkCompute Engine

Compute Engine

Network R

x +

WenRenPktIn

PktOut

Control

x +

Control

Compute Engine

Compute Engine

C=ABA

B SR

AM R

A/B SRAMsC SRAM

toNext

A/B SRAMsC SRAM

B C DRAM

Packet Generator Address Generator

Packet Format

C SRAMA SRAM


OP PE_ID DATA

MMMult with CoRAM

C t l Th d P

MMMult with CoRAMChannel FIFO

PEs ready

void ctrl_thread() {for (j = 0; j < N; j += NB)

Control Thread ProgramStart compute

PEs ready

for (j 0; j < N; j + NB)for (i = 0; i < N; i += NB)for (k = 0; k < N; k += NB) { c_fifo_read(…); for (m = 0; m < NB; m++) {

ll ti it (

x +

Control

c_collective_write(ramsA, m*NB, A + i*N+k + m*N, NB*dsz);

A/B SRAMsC SRAM

…}c_fifo_write(…);

}}


}}

}

What we are doing nowWhat we are doing now

• Eric Chungg

architecture and API

microarchitecture and RTL design

• Weinan Ma (advised by Prof. Mai)

circuit‐level design

• Michael Papamichael

on‐chip data network

• Gabe Weisz• Gabe Weisz

high‐level programming

environment


environment

Beneath the AbstractionBeneath the Abstraction

Control thread

programsSRAM

Application

Architecture

Microarchitectureric

Unit

ric

Unit

Control Unit Control Unit

Fab

Cluster

Fab

Cluster

CoRAMFPGA


Memory Interfaces (DRAM or Cache)

Bulk Data Distribution (Network‐on‐Chip)

Control Threads: Soft or Hard?Control Threads: Soft or Hard?

Soft methods

Option 1:Synthesis to Fabric

Soft methods only rely on RL fabric

must share logic with app

efficiency/perf depends on quality of C‐based synthesis and compilation

Control Thread

Option 2:Compile to soft cores

pThread Programs

Hard method

Option 3:

dedicated silicon for cores

high perf/efficiency

h th


pHard micro‐controllers consumes area whether

used or not

RTL P t t i ( FPGA)Mi bl

RTL Prototyping (on FPGA)

100MHz

AXI Memory Bus

Microblaze(100MHz)

I/O DevicesI/O Subsystem

and Drivers

RouterCacheBank

128b

100MHz100MHz200MHz

Router

MMMultiply Compute Elements

RouterCacheBank

MemoryController(Xilinx

128bRouter

Arbite

r

tform Adapter

512b

x+

Control Unit

x+

Control Unit

x+

Control UnitControl

CoRAMCluster

MIG) 128b

A

RouterRouter

RouterCacheBank

128bCacheBank

Xilinx Plat

x+

Control Unit

x+

Control Unit

x+

Control UnitControl

CoRAMCluster

CoRAMCluster

Router


Bank Cluster

NoC and ClustersDRAM Interfaces DRAM Caches

Is CoRAM A Good Idea?Is CoRAM A Good Idea?

What is the impact on programmability and portability?

l f application performance?

area and power?

Some important questionsshould CoRAM support be hard soft or hybrid? should CoRAM support be hard, soft, or hybrid?

how to incorporate I/O devices, multi‐FPGA?

what applications can CoRAM support?


what applications can CoRAM support?

Is CoRAM A Good Idea?Is CoRAM A Good Idea?

Y b th j d ! You can be the judge!

Tool release planned f lfor late 2011

Visit us at:/www.ece.cmu.edu/~coram

Relevant papers:E i S Ch J C H d K M i C RAM A I F b i M• Eric S. Chung, James C. Hoe, and Ken Mai. CoRAM: An In‐Fabric Memory Architecture for FPGA‐Based Computing, In Proceedings of FPGA‐19, 2011.

• Eric S. Chung, Peter Milder, James C. Hoe, and Ken Mai. Single‐chip Heterogeneous Computing: Does the Future Include Custom Logic, GPGPUs,


Heterogeneous Computing: Does the Future Include Custom Logic, GPGPUs, and ASICs? In Proceedings of MICRO‐43, 2010.

Computer Architecture Lab (CALCM)

Carnegie Mellon University

h // d /


http://www.ece.cmu.edu/CALCM

the coram fpga architecture for reconfigurable computing · the coram fpga architecture for...

Documents