smartcell reconfigurable architecture for low-power stream ... · system with 16 cell units in a 4...

29
SmartCell Reconfigurable Architecture for Low-Power Stream Processing Cao Liang and Xinming Huang Embedded Computing Lab Worcester Polytechnic Institute http://computing.wpi.edu MAPLD Conference September 15-18, 2008 Annapolis, MD

Upload: others

Post on 02-Aug-2020

3 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: SmartCell Reconfigurable Architecture for Low-Power Stream ... · system with 16 cell units in a 4 by 4 mesh structure, with a total of 64 PEs RTL level design and simulation FPGA

SmartCell Reconfigurable Architecture for Low-Power

Stream Processing

Cao Liang and Xinming HuangEmbedded Computing Lab

Worcester Polytechnic Institutehttp://computing.wpi.edu

MAPLD ConferenceSeptember 15-18, 2008Annapolis, MD

Page 2: SmartCell Reconfigurable Architecture for Low-Power Stream ... · system with 16 cell units in a 4 by 4 mesh structure, with a total of 64 PEs RTL level design and simulation FPGA

Xinming Huang SmartCell Reconfigurable Architecture for Low-Power Stream Processing 2

Outline

Introduction and Motivation

SmartCell Architecture

SmartCell Prototype with 64 PEs

Benchmark Applications and Performance

Conclusions

Page 3: SmartCell Reconfigurable Architecture for Low-Power Stream ... · system with 16 cell units in a 4 by 4 mesh structure, with a total of 64 PEs RTL level design and simulation FPGA

Xinming Huang SmartCell Reconfigurable Architecture for Low-Power Stream Processing 3

ChallengesMajor driving force in embedded computing

Multimedia signal and image processing Wireless communicationsMilitary and space applications

Design challenges: Low power (power efficiency)High performanceFlexibility (Programmability or reconfigurability)

Game Console

Radar imaging

Software radio PDA Image processing

Multimedia TV Data encryption

Scientific computing

Page 4: SmartCell Reconfigurable Architecture for Low-Power Stream ... · system with 16 cell units in a 4 by 4 mesh structure, with a total of 64 PEs RTL level design and simulation FPGA

Xinming Huang SmartCell Reconfigurable Architecture for Low-Power Stream Processing 4

Existing Computing PlatformsGeneral purpose processors (GPP)Application specific integrated circuit (ASIC)Reconfigurable architecture

Dominated by Field Programmable Gate Array (FPGA)

New architectures: CellBE, GPU

Reconfigurablesystems (FPGA)

Performance, Power Efficiency

Flex

ibili

ty General Purpose

Processor

Customized ASICs

Page 5: SmartCell Reconfigurable Architecture for Low-Power Stream ... · system with 16 cell units in a 4 by 4 mesh structure, with a total of 64 PEs RTL level design and simulation FPGA

Xinming Huang SmartCell Reconfigurable Architecture for Low-Power Stream Processing 5

Coarse-Grained Reconfigurable Architecture (CGRA)

Motivations of the SmartCell architectureCoarse-grained computing operatorsReconfigurable interconnectionDomain specific, e.g. stream processing

Bridging the gap between FPGA and ASIC

[Bjerregaard 2006]

Page 6: SmartCell Reconfigurable Architecture for Low-Power Stream ... · system with 16 cell units in a 4 by 4 mesh structure, with a total of 64 PEs RTL level design and simulation FPGA

Xinming Huang SmartCell Reconfigurable Architecture for Low-Power Stream Processing 6

Overview of SmartCell ArchitectureComputing units are tiled in a 2D structure

Page 7: SmartCell Reconfigurable Architecture for Low-Power Stream ... · system with 16 cell units in a 4 by 4 mesh structure, with a total of 64 PEs RTL level design and simulation FPGA

Xinming Huang SmartCell Reconfigurable Architecture for Low-Power Stream Processing 7

Design of Processor Element

Processor Element (PE)16-bit input, 36-bit output Logic, Shift, and Arithmetic operations

Page 8: SmartCell Reconfigurable Architecture for Low-Power Stream ... · system with 16 cell units in a 4 by 4 mesh structure, with a total of 64 PEs RTL level design and simulation FPGA

Xinming Huang SmartCell Reconfigurable Architecture for Low-Power Stream Processing 8

Design of Cell UnitInclude 4 PEs to form a quad structureFully connected cross-bar (S_Box) for date exchangeSerial peripheral interface (SPI) for instruction configuration

Page 9: SmartCell Reconfigurable Architecture for Low-Power Stream ... · system with 16 cell units in a 4 by 4 mesh structure, with a total of 64 PEs RTL level design and simulation FPGA

Xinming Huang SmartCell Reconfigurable Architecture for Low-Power Stream Processing 9

On-chip Interconnection Design

Modified CMesh On-chip NetworkTraditional CMesh Hierarchical CMesh

Page 10: SmartCell Reconfigurable Architecture for Low-Power Stream ... · system with 16 cell units in a 4 by 4 mesh structure, with a total of 64 PEs RTL level design and simulation FPGA

Xinming Huang SmartCell Reconfigurable Architecture for Low-Power Stream Processing 10

Control Logic Design

Four types of control signals Program counter control Datapath/delay control Operation controlNetwork-on-Chip control

Format of the instruction code (64-bit/instruct)

Page 11: SmartCell Reconfigurable Architecture for Low-Power Stream ... · system with 16 cell units in a 4 by 4 mesh structure, with a total of 64 PEs RTL level design and simulation FPGA

Xinming Huang SmartCell Reconfigurable Architecture for Low-Power Stream Processing 11

Configuration Structure

System configuration

Page 12: SmartCell Reconfigurable Architecture for Low-Power Stream ... · system with 16 cell units in a 4 by 4 mesh structure, with a total of 64 PEs RTL level design and simulation FPGA

Xinming Huang SmartCell Reconfigurable Architecture for Low-Power Stream Processing 12

Prototype Chip Design

Implementation of a seedling SmartCell system with 16 cell units in a 4 by 4 mesh structure, with a total of 64 PEs

RTL level design and simulationFPGA prototypingStandard cell ASIC implementation with TSMC .13 μm technology Total area is about 8.2 mm2

Runs up to 107 MHz Configuration time is within 12μs

Page 13: SmartCell Reconfigurable Architecture for Low-Power Stream ... · system with 16 cell units in a 4 by 4 mesh structure, with a total of 64 PEs RTL level design and simulation FPGA

Xinming Huang SmartCell Reconfigurable Architecture for Low-Power Stream Processing 13

SmartCell Features A combination of the following features makes SmartCell a unique approach in CGRA families

Dynamic reconfiguration Deep pipeline and parallelism Hardware virtualization Explicit synchronization Unique system topology

PE1

PE2

PE3

PE4 PE1

PE2

PE3

PE4

PE1

PE2

PE3

PE4PE1

PE2

PE3

PE4

PE1

PE2

PE3

PE4 PE1

PE2

PE3

PE4

PE1

PE2

PE3

PE4PE1

PE2

PE3

PE4

PE1

PE2

PE3

PE4 PE1

PE2

PE3

PE4

PE1

PE2

PE3

PE4PE1

PE2

PE3

PE4

1D Systolic array 2D Systolic array SIMD structure

Page 14: SmartCell Reconfigurable Architecture for Low-Power Stream ... · system with 16 cell units in a 4 by 4 mesh structure, with a total of 64 PEs RTL level design and simulation FPGA

Xinming Huang SmartCell Reconfigurable Architecture for Low-Power Stream Processing 14

Application Domain and Benchmarks

Application Domain

Test Benches

Signal processing

64-tap FIR64-tap IIR

Multimedia and image processing

32-point FFT8*8 2D-DCT,

8 by 8 Motion Estimation (ME) in 24 by 24 searching area

Scientific computing

128 by 128 Matrix Multiplication (MMM), 64th-order Polynomial Evaluation (PoE)

RC5 Data Encryption

Page 15: SmartCell Reconfigurable Architecture for Low-Power Stream ... · system with 16 cell units in a 4 by 4 mesh structure, with a total of 64 PEs RTL level design and simulation FPGA

Xinming Huang SmartCell Reconfigurable Architecture for Low-Power Stream Processing 15

Benchmark Mapping Infinite Impulse Response (IIR) filter

Biquad cascaded-IIR structure on a single Cell

Page 16: SmartCell Reconfigurable Architecture for Low-Power Stream ... · system with 16 cell units in a 4 by 4 mesh structure, with a total of 64 PEs RTL level design and simulation FPGA

Xinming Huang SmartCell Reconfigurable Architecture for Low-Power Stream Processing 16

Benchmark Mapping (cont’)2D Discrete Cosine Transform (2D DCT)

Decomposed into two 1D DCTs

Page 17: SmartCell Reconfigurable Architecture for Low-Power Stream ... · system with 16 cell units in a 4 by 4 mesh structure, with a total of 64 PEs RTL level design and simulation FPGA

Xinming Huang SmartCell Reconfigurable Architecture for Low-Power Stream Processing 17

Experimental SetupEvaluation Metrics

Area & TimingPower consumptionThroughput and power efficiencyComparing with RaPiD, Altera’s Stratix II FPGA and ASIC

System dimension 4 by 4

Design tools ModelSim, Synopsys

Library TSMC .13μm process

Voltage 1 V

Simulation freq. 100 MHz

Page 18: SmartCell Reconfigurable Architecture for Low-Power Stream ... · system with 16 cell units in a 4 by 4 mesh structure, with a total of 64 PEs RTL level design and simulation FPGA

Xinming Huang SmartCell Reconfigurable Architecture for Low-Power Stream Processing 18

Area and Power Consumption

Generated at 100 MHz with fully operational circuits

Page 19: SmartCell Reconfigurable Architecture for Low-Power Stream ... · system with 16 cell units in a 4 by 4 mesh structure, with a total of 64 PEs RTL level design and simulation FPGA

Xinming Huang SmartCell Reconfigurable Architecture for Low-Power Stream Processing 19

Power Consumption and Efficiency

On average 156 mW power consumption @ 100 MHz31 GOPS/W energy efficiency

only arithmetic & logic operations, excluding I/O power

FIR IIR 2D-DCT RC5 MMM FFT PoE ME

PDyn <mW> 144 181 153 132 135 161 142 137

PCore <mW> 152 189 161 140 143 169 150 145

EEff <GOPS/W>

42.1 33.9 39.8 45.7 11.2 18.9 42.7 11.0

Page 20: SmartCell Reconfigurable Architecture for Low-Power Stream ... · system with 16 cell units in a 4 by 4 mesh structure, with a total of 64 PEs RTL level design and simulation FPGA

Xinming Huang SmartCell Reconfigurable Architecture for Low-Power Stream Processing 20

Compare with RaPid, FPGA, and ASICPower and system throughput comparison

Power consumption of RaPiD has been scaled down to the same process technology of SmartCell system

Page 21: SmartCell Reconfigurable Architecture for Low-Power Stream ... · system with 16 cell units in a 4 by 4 mesh structure, with a total of 64 PEs RTL level design and simulation FPGA

Xinming Huang SmartCell Reconfigurable Architecture for Low-Power Stream Processing 21

Compare with Rapid and FPGA

52% average power reduction compared with RaPiD75% average power reduction compared with FPGA

Page 22: SmartCell Reconfigurable Architecture for Low-Power Stream ... · system with 16 cell units in a 4 by 4 mesh structure, with a total of 64 PEs RTL level design and simulation FPGA

Xinming Huang SmartCell Reconfigurable Architecture for Low-Power Stream Processing 22

Power Efficiency Comparison

* Compare to 90nm Stratix-II FPGAs

Page 23: SmartCell Reconfigurable Architecture for Low-Power Stream ... · system with 16 cell units in a 4 by 4 mesh structure, with a total of 64 PEs RTL level design and simulation FPGA

Xinming Huang SmartCell Reconfigurable Architecture for Low-Power Stream Processing 23

Conclusions

An interesting CGRA architecture is proposed and developed – namely SmartCellThe architecture is reconfigurable and can be targeted for different computing systemsA prototype design with 64 PEs shows both throughput and power efficiency in benchmarks of data streaming applicationsSmartCell may have the potential to bridge the gap between high-power FPGAs and inflexible ASICs

Page 24: SmartCell Reconfigurable Architecture for Low-Power Stream ... · system with 16 cell units in a 4 by 4 mesh structure, with a total of 64 PEs RTL level design and simulation FPGA

Xinming Huang SmartCell Reconfigurable Architecture for Low-Power Stream Processing 24

The following are backup slides

Page 25: SmartCell Reconfigurable Architecture for Low-Power Stream ... · system with 16 cell units in a 4 by 4 mesh structure, with a total of 64 PEs RTL level design and simulation FPGA

Xinming Huang SmartCell Reconfigurable Architecture for Low-Power Stream Processing 25

Existing CGRAs

Page 26: SmartCell Reconfigurable Architecture for Low-Power Stream ... · system with 16 cell units in a 4 by 4 mesh structure, with a total of 64 PEs RTL level design and simulation FPGA

Xinming Huang SmartCell Reconfigurable Architecture for Low-Power Stream Processing 26

Characteristic Comparison

Page 27: SmartCell Reconfigurable Architecture for Low-Power Stream ... · system with 16 cell units in a 4 by 4 mesh structure, with a total of 64 PEs RTL level design and simulation FPGA

Xinming Huang SmartCell Reconfigurable Architecture for Low-Power Stream Processing 27

SmartCell: Tiled Architecture, Processor Design and Interconnect

Many cells are titled in 2D layoutEach cell has 4 PEs (N,W,S,E)Simplified processor with memA crossbar within the cell; on-chip interconnect uses CMesh

Page 28: SmartCell Reconfigurable Architecture for Low-Power Stream ... · system with 16 cell units in a 4 by 4 mesh structure, with a total of 64 PEs RTL level design and simulation FPGA

Xinming Huang SmartCell Reconfigurable Architecture for Low-Power Stream Processing 28

Features and Application Domain

FFT Butterfly

RC5 data encryption

Polynomial Evaluation

FIR filter

2D DCT

Page 29: SmartCell Reconfigurable Architecture for Low-Power Stream ... · system with 16 cell units in a 4 by 4 mesh structure, with a total of 64 PEs RTL level design and simulation FPGA

Xinming Huang SmartCell Reconfigurable Architecture for Low-Power Stream Processing 29

System Design and PerformancePrototype chip design: 4x4 cells (64 PEs), .13 TSMC,8.2mm2, 1V, about 156mW @100MHzBenchmark with RaPiD, Stratix-II (90nm), and ASIC

Acknowledgement:Dr. Michael Fritz, DARPA/MTO YFA ProgramCao Liang, WPI graduate assistant, now with AMD