japanese 2 nd generation dynamically reconfigurable processors

Japanese 2nd generation 　 Dynamically Reconfigurable Processors

ERSA2009

Invited Speech

Hideharu Amano

Keio Univ.

Commercial Products usingDynamically Reconfigurable Processors

SONY PMW EX-1/3 Professional camcorderNEC electronics’STP enginePanasonic’s Professional camcorderDFabric

Multifunction PrintersIP Flex’s DAPDNA-2

SONY PSP VME (Virtual Mobile Engine)

Short history of Dynamically Reconfigurable

Processors 1990 1995 2000 2005

FPGA with DynamicReconfiguration

Processor withReconfigurableInstructions

MPLD(Fujitsu)

WASMII(Keio)

Time MultiplexedFPGA(Xilinx)

DRL(NEC)

GARP(UCB)CHIMAERA(NorthWestern Univ.)

Xpp(PACT)CS2112(Chameleon)

DRP(NEC elec.)

DAPDNA/2(IPFlex)DFabric(Elixcent)

Kilocore(Rapport)PipeRench(CMU)

X-bridge(NEC ele.)

DAPDNA/IMX(IPFlex)

S-5(Stretch) S-6(Stretch)

FE-GA(Hitachi)

The 1st Generation The 2nd Generation

A lot of commercialsystems

DISC(Brigham Young Univ.)

Product Vendor Context Data PE

D-Fabri ｘ Panasonic Deliver ４ Homo

Xpp PACT Deliver 24 Homo

S5/S6 engine Stretch Deliver 4/8 Hetero

CS2112 Chameleon Multi-C （８） 16/32 Homo

DAPDNA-2 IPFlex Multi-C （４） 32 Hetero

DRP-1 NEC electronics Multi-C （１６）

8 Homo

STP-engine NEC electronics Multi-C(32) 8 Homo

Kilocore Rapport Multi-C 8 Homo

ADRES IMEC Multi-C （３２）

16 Homo

FE-GA Hitachi Multi-C 16 Hetero

For Car-tuners SANYO Multi-C(4) 24 Homo

FlexSword(SAKE) Toshiba Multi-C(4/16) 16 Homo

Cluster Fujitsu Multi-C 16 Hetero

Most of Japanese semiconductor Companies have their own projects!

Outline

• Why Dynamically Reconfigurable Processors ?– A solution of recent SoC problems.

• What is a Dynamically Reconfigurable Processor ?– Coarse Grain Structure– Dynamic Reconfiguration– C-level programming

• What is the main advantages/limitations?– Comparison with other architectures– Low power consumption

• The 2nd generation examples

Why Dynamic Reconfigurable Processors?

CPU

Memory

ApplicationSpecific

HardwareI/O

A solution to problems onSoC (System-on-a Chip)

SoC (System-on-a-Chip)

Brain in Various IT products, e.g.Cellular Phones,Network Controllers,Mobile Terminals,Video camera,Car electronics…

Problem!•The performance is depending onApplication Specific Hardware•Various new techniques are coming up.•Design/mask cost of leading edge semiconductor process is much increased.

Powerful but flexible, low power/cost off-load engine is required!

How about using common FPGAs?

CPU

Memory

ApplicationSpecific

HardwareI/O

CommonFPGA

Common FPGA is FlexibleXilinx’s FPGA (eg. Virtex-4/FX)with PowerPC are popularly used.Of course, Alteras’ are also popular.

•System on a Programmable Device tends to be expensive and too much power consuming for most consumer products.•They come from their static fine grain architecture

But

What is a Dynamically Reconfigurable Processor ?

CPU

Memory

ApplicationSpecific

HardwareI/O

Dynamically Reconfigurable

Processor

Flexible Accelerators in SoCs

Coarse Grain Structure→ 　 High performance

Dynamic reconfiguration → 　 High area efficiency

C-level programming→ 　 Easy to design

1

Outline





1. Coarse Grain StructureAn example of PE array

PESE SE SE SE SE

PESE

PESE

PESE

PESE SE

PESE

PESE

PESE

PESE SE

PESE

PESE

PESE

PESE SE

MEMSE

MEMSE

MEMSE

MEMSE SE

MULT

MULT

MULT

MULT

PE PE PE

Island style like FPGAsVarious types of Array structures are used

MuCCRA-1 by Keio Univ(ASSCC2007)

An example of PE (Processing Element)

SMU ALU

smuasel smubsel alubselaluaselalucsel

RFile

rfselrfcsel

aluina

rfinarfinca

aluinb

rfinbrfincb

rfboutcrfboutrfaoutcrfaout

rfaddra

rfaddrb

rfinc rfina

outc out

ina inb

rfwe

rfwec dmuope

cnst aluconf

outc out

inbinainc

24bit data2bit carry

smuina

aluincasmuinb

PE of MuCCRA-1

ALU: Add/Sub/Mult/CMPSMU:Shift/Mask/ConstantRFile: Register Files

2. Dynamic Reconfiguration

• The operations of PEs and interconnections are defined by the configuration data stored in the configuration memory like FPGAs.

• Changing configuration data dynamically → 　 The data path for various applications can b

e switched quickly.• How configuration data are changed?

– High speed delivery from the central configuration memory.

– Multicontext dynamic reconfiguration→ 　 One clock dynamic reconfiguration

Quick delivery of instructions/configuration from on-chip memory

On-Chip Memory

ＰＥ /SE

ＰＥ /SE

On-Chip Memory

•Delivery with 10’s micro-seconds•PACT Xpp•Panasonic(Elixent’s) DFabric

Dynamically reconfiguration is donemainly for Task switching

Multicontext Function

Mul

tipl

exer

SRAM slots

n

PE/SE

1

2

Input data

Output data

PE/SEPE/SEContext

A number of Configuration Memory slots are provided.

They can be switched in a clock

→　 Hardware Structure is changed in a clock

→　 Hardware Context switching

Practical implementation ofmulticontext structure

Context Pointer

PE or Switcihng Element

Context Memory

3. C-level programming

• The programming environment is a mixture of traditional C compiler and FPGA design tool

• The C-code is divided into the data flow and control.

• The assignment of the contexts, PEs and memory modules can be automatically done.

• The place-and-route sometimes takes a long time like FPGA design.

• The programming is easy only if the data to be processed can be mapped onto the memory modules.

Example: DRP Compiler (NEC)

• Compiling C source code into DRP object code

Behavaioral Description Language (BDL)

• High level synthesis: generates finite state machines (FSMs) and associated datapath planes– The ASIC behavioral design tool: Cyb

er is modified and used.

• Mapper: maps FSMs and datapath plane to STC and PEs respectively

• Place & Router: physically locates the PEs, memories and interconnection between them

C Source Code

High Level Synthesis

FSM Datapath

Technology Mapper

Place & Router

Code Generation

Object Code

Outline





Dynamically Reconfigurable Processors vs. other architectures

vs. Multi-core/Many Core architectures– No instruction fetch/Cache mechanism– Less flexible but much smaller area　→　 16PEs in 1.5mm-square/90nm (MuCCRA2)

vs. SIMD (Single Instruction Streams Multiple Data Streams)– The operations and interconnections can be customized for each

PE and SE.　→　 Efficient for complicated algorithms.– The number of instructions/contexts are small

vs. VLIW (Very Long Instruction Word)– A larger degree of parallelism can be utilized.　→　 Higher performance can be obtained.– The number of instructions/contexts are small

MuCCRA-2 Floor Plan

•ASPLA’s 90nm•2.5mmX2.5mm(Core: 1.5X1.5)

16

The total PE array < one PE of Recent Multi/Many core processors

Dynamically Reconfigurable Processors vs. other architectures

vs. Multi-core/Many Core architectures– No instruction fetch/Cache mechanism– Less flexible but much smaller area　→　 16PEs in 1.5mm-square/90nm (MuCCRA2)

vs. SIMD (Single Instruction Streams Multiple Data Streams)– The operations and interconnections can be customized for each

PE and SE.　→　 Efficient for complicated algorithms.– The number of instructions/contexts are small

vs. VLIW (Very Long Instruction Word)– A larger degree of parallelism can be utilized.　→　 Higher performance can be obtained.– The number of instructions/contexts are small

１ 3 8 16 ManyNum. ofHW-contexts

Num. of Cores

Granularityof core

10

100

1000

32bit

16bit

8bit

4bit

Common ProcessorＶＬＩＷ

DAPDNA-2

DRPDRL

CS2112

32

Xbridge

FE-GAFPGA

FPGA extension

Dynamically ReconfigurableProcessors

DFabric

Xpp

Multi-Core processor

Granularity vs.Num. of Cores vs.Mum. of HW-contexts

Main Advantage: Low power consumption

Why low power ?1. 　 No redundant hardware

– There are no instruction fetch mechanisms, cache, TLB, and etc.→ 　 Of course, it cannot be a general purpose engine, but enough

for an accelerator.– A bare datapath works only for computation.

2. 　 Parallel Execution with a number of PE ｓ– Much lower clock frequency can be used to achieve the same pe

rformance as other architectures.– The main problem is leakage power, but can be suppressed by p

ower gating techniques.

10X energy efficient compared with DSPs.5-50X with FPGAs.Sometimes similar to that for hardwired logic.

Energy consumption （ nJ)

0

1000

2000

3000

4000

5000

6000

7000

DCT Viterbi SHA-1

Ener

gy(n

J) MuCCRAFPGAASIC

The comparison using 0.18um implementation

The main limitations as an accelerator in SoCs

• The data must be stored in the memory modules placed around the PE array.– If the data is more than the memory, it is hard to be

treated.

• If the required contexts are more than its context memory, the operational speed is much degraded.– The virtual hardware mechanism is provided but there

is a certain limitation.

• The performance is not so improved for problems without parallelism.

Outline





Dynamically Reconfigurable Processors: The 2nd generation

• Customized for a specific target application area– SANYO car tuner → 　 Tuner– Fujitsu → 　 Wireless communication– Toshiba SAKE 　→　 Multi-media– NEC electronics X-bridge 　→　 Multi-media

• Multi-core structure with small PE arrays rather than a big array– Cooperation with various type cores

• Integrated design environment• Low power design → 　 The main advantage!

X-bridge: NEC electronics (2008)

CPUMIPS

JTAG

I-CD-C

UA

RT

UA

RT

CS

IG

PIO

INTC

GeneralPort8bX4

DMA

SPL

DMA

WorkRAM(1kB)

PCIexpHB/EP(1-lane)

PeriphI/F

10/100EtherMAC

DMA

PCIHost/

Target

DDR2SDRAM

CTR

DMAPCIexpHB/EP(1-lane)

STPEngine

64bit on chip bus (266MHz)

64bi

t M

emor

yS

witc

h (2

66M

Hz)SPL

SPLSPL

SPLSPL

SPLSPL SPL

Nconnect

DynamicallyReconfigurableCore256PE(8bit)32-context

Providing the virtual hardware

mechanism

DMA controller hides the communication

overhead

From Invited talk in Design Gaia.2008

Mixture of SIMD and DRP units:Toshiba’s FlexSword

Host Processor

I/O Buffer (Data RAM)

Formatter0Write

Control

Host I/F

System Memory

Inter-Unit Buffer (Data Registers)

Dynamically Reconfigurable Units(Indenepndently Controlled)

Code Buffer (Code RAM)

Formatter1AUX0AUX1

Optimized forStream Processing

SIMD Units

codedata

Our Architecture

From FPT2007　 Tutorial session

The Architecture (Formatter)

data A

Cfg

Me

m

data B

Shuffle

16-bit ALU x 8PE

Xbar In

validID

PE

PE w/o Shuffle

Xbar In

Xbar Out

Cfg Controller

CodeMem

Simple Hardware•Pipeline registers only•No intra-PE data transfer•PE:4 cfgs, Xbar: 16cfgs•ALU, shift & absolute ops only

PE

PE

Xbar In: Formatter0 onlyXBar Out: Formatter1 only

128 128

Suitable for batterfly operations

19

64

From FPT2007　 Tutorial session

SANYO’s Car tuner DRP

ALU ALU ALU ALU ALU ALU




main memory

Out

In

sequencer

command memory

Feedback

ALU array





L１

L２

L３

L４

L１L２L３L４

Ｔｈ１－１Ｔｈ１－２

Ｔｈ１－３Ｔｈ１－４

Ｔｈ２－１Ｔｈ２－２

Ｔｈ２－３Ｔｈ２－４

Ｔｈ３－１Ｔｈ３－２

Ｔｈ３－３Ｔｈ３－４

Ｔｈ４－１Ｔｈ４－２

Ｔｈ４－３Ｔｈ４－４

Ｔｈ１－５Ｔｈ１－６

Ｔｈ１－７Ｔｈ１－８

Ｔｈ２－５Ｔｈ２－６

Ｔｈ２－７

Pipelined execution of 4 threads

Fine carrier frequency offset estimation/correction

Cluster0I

QI

Qto FFT

Cluster4 Cluster5 Cluster6

Cluster1

Cluster2

DIV ATAN

Reg incluster0

Cluster0

Cluster3data outcontrol

I

Qto FFT

I

Q

LT1

LT2

self-correlation

phase offset calculation

Cluster1Cluster6(through)

Cluster0Reg

correction offset calculation in phase

polarCluster2complexmultiply

I

Q

Cluster3data outcontrol &

clip

I

Q

a) Fine carrier frequency offset estimation for LT1

b) Fine carrier frequency offset estimation for LT2

c) Fine carrier frequency offset correction for SIGNAL and DATA

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

MLT ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

MLT

MLT

MLT

MLT

MLT

MLT

MLT

Crossbar Network

LS

LS

LS

LS

LS

LS

LS

LS

LS

LS

MEM

MEM

MEM

MEM

MEM

MEM

MEM

MEM

MEM

MEM

Configuration Manager

Sequence Manager

BusInterface

Computational Cell Array

Interrupt/DMA request

I/Oport

Load/StoreCells

LocalMemory

Hitachi’s 　 FE-GA

Heterogeneous Multi-Core using FE-GA

SH-4

LPM

FVR

LDM

DSM

DTU

Network Interface

CPU0

FE-GA

LPM

FVR

LDM

DSM

DTU

Network Interface

DRP0CPU1 DRP1

CPU2 CPU3 DRP2 DRP3

On-Chip CSM

Network Interface

The codes are generated by a

parallelizing compiler and

standard APIs.

Summary

• The 2nd Generation Dynamically Reconfigurable Processors are going to be embedded into consumer electronics products.

• The main advantage is low power consumption.• The main limitations is data memory

→ limited into a kind of stream computing.• Especially active in Japan

– Major Japanese consumer electronics companies all try to develop such systems.

Yes. Japanese Culture LovesDynamic Reconfiguration!

Thank you!A part of ourown project willbe presented inthe later sessions

PE architecture Simple structure Executable up to 4 instructions in parallel

To upper Cell

To lower Cell

To left Cell

To right Cell

From upper Cell

From lower Cell

From left Cell

From right Cell

ALUArithmetic-1

LogicalFlow Control

SFTShift

THRData Control

Out

put S

witc

hT

rans

fer

Reg

iste

r (T

RE

G)

Inpu

t Sw

itch

Del

ay A

djus

tmen

t

1-bit x 48-bit x 4w/ valid bitConfiguration Register (x4)

control bus

ＤＲＰ Programming０

１

２

３

４

５

Data input

Data output

1. Context switching

2. Parallel processing in a context3.Sequential execution in a context

3-dimensional flexibility.Functional optimizer works efficiently.Efficient C-level programming

Context is controlledwith a state machine.

Time multiplexed execution

•Area becomes 1/n, but performance becomes also 1/n.

Target hardware

Real hardware

•A single task can be executed with multiple contexts.

→　 Area efficiency is improved!

Target Hardware

Real Hardware

Time multiplexed execution

Most of hardware works partially.

A wide research field ofreconfigurable architectures

• Two major extremes of multiple-core architectures: – FPGAs

• Fine-grained multiple-core architectures 　 with huge number of cores

• Basically static: 1-hardware context

– Many-core processors• Very coarse-grained multiple-core architectures• Fully programmable: Infinite-number of hardware contexts

WIDE RESEARCH FIELD

Our environment for architectural explorationMuCCRA array design environment [FPL07]

DRPA Verilog-HDLGenerator

Architecture parameters

Verilog HDL description

Logic SynthesisSynopsys Design Compiler

Placement and RoutingSynopsys Astro

RTL/Net/Chip simulation

(Cadence NC-Verilog)

Timing Analysis(Synopsys Prime Time)

Retargetable CompilerBlack Diamond

Application Programs

Test Bench and Test Vector

Netlist

GDSII

Netlist

Template Library

CMOS standard cell library

4

Extremely Low Power Design

• Now, major benefit of Dynamically Reconfigurable Processors– 1/8-1/10 to DSP [ASSCC07]– The main reason why SONY uses VME (Virtual Mobile Engine) i

n PSP (Playstation Portable) and X-bridge in professional video systems.

• Applying traditional techniques/Reducing the overhead of context switching [FPL08]– Operand isolation is quite effective

• Context oriented voltage control [Schweizer:FPT07]• Fine-grained power gating [FPT08Poster]• Dual Vth

Network on Chips for reconfigurable systems

• For inner-core connection– island style/direct interconnection– New style of interconnection?

• For inter-core connection– The similar network for Many-core systems

may be used?

• Three dimensional/Wireless– A new possibility

3 Dimensional wireless connected dies: MuCCRA-Cube

• A plane is corresponding to an array like MuCCRA-2 (4 ×4 　 PE) • 4 planes are connected with inductive wireless very high speed

interconnection. (3Gbit/sec per each channel)• Planes are connected in the flipped direction• 16 channels are provided in the 3-D direction

Direction of planes Channels

MuCCRA-Cube Prototype

• STARC/ASPLA 90nm• 2.5 mm x 5mm die• Verilog-HDL is used for desig

n

• Synthesis: Synopsys DesignCompiler 2006.06-SP2

• Place&Route: Synopsys Astro 2007.03-SP3• Simulation: Cadence Verilog-XL 5.7

DATAMEM

TCC

PE/SECSC

Transceiver(Data)

Transceiver(CLK)

Summary

• There is a wide field for architectural exploration between FPGAs and Many-core processors

• Keywords– Application Configurable– Low power Techniques– Interconnection Networks including Three dim

ensional/Wireless– Integrated Design Environment

ReconfigurableArrayView

FURF

FURF

FURF

FURF

FURF

FURF

FURF

FURF

FURF

FURF

FURF

FURF

FURF

FURF

FURF

FURF

FURF

FURF

FURF

FURF

FURF

FURF

FURF

FURF

FURF

FURF

FURF

FURF

FURF

FURF

FURF

FURF

FU FU FU FU

FURF

FURF

FURF

FURF

FURF

FURF

FURF

FURF

FURF

FURF

FURF

FURF

FU FU FU FU

FURF

FURF

FURF

FURF

FURF

FURF

FURF

FURF

FURF

FURF

FURF

FURF

RF

Instruction FetchInstruction DispatchInstruction Decode Data Cache

VLIWview

IMEC ADRES

PE

Interconnect

PE PE PE…..

PE

Interconnect

PE PE PE…..

PE

Interconnect

PE PE PE…..

PE

Interconnect

PE PE PE…..

Co

nfi

gu

rati

on

Co

ntr

oll

er

Output Controller

Input Controller

Fabric16PEs X 16PEs

128bits

128bits

672bits

32bits

Stripe

Rapport Kilocore

MMU

InstCache

DataCache

InstUnit

Load/StoreUnit

FR

FPU

AR

ALU

WR

ISEF

FP UnitInteger Unit

Extension Unit

Stretch 　 S5 engine

Screen Shot of Context MenuImplementation

Screen Shot of Code Menu

User Function

Library Function

Implementation

Screen Shot of Pointer

Input

Output

Implementation

Screen Shot of MuCCRA-1

Memory Modules

Multiply Modules

Switching Element

PE

Implementation

Screen Shot of MuCCRA-2

PE

Switching Element

Memory Modules

Implementation

DRP Tile structure

PE

PE

PE

PE

PE

PE

PE

PEP

EPE

PE

PE

PE

PE

PE

PEP

EPE

PE

PE

PE

PE

PE

PEP

EPE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PEP

EPE

PE

PE

PE

PE

PE

PEP

EPE

PE

PE

PE

PE

PE

PEP

EPE

PE

PE

PE

PE

PE

PE

HMEM HMEM HMEM HMEM

HMEM HMEM HMEM HMEM

VMEM

VMEM

VMEM

VMEM

VMEM

VMEM

VMEM

VMEM

VMEM

VMEM

VMEM

VMEM

VMEM

VMEM

VMEM

VMEM

State Transition Controller

VMEM ctrlVMEM ctrl

VMEM ctrlVMEM ctrl

1 port HMEM8bit × 8K entries

2 port VMEM8bit × 256 entries

Task and context control in MuCCRA[FPL08]

• Context control– Multicontext switching with a

Context Pointer

• Task Control– Multiple tasks each of which is consisting of multiple

contexts are loaded from the centralized memory– A Virtual Hardware Mechanism

CSC(Context Switching

Controller)

TCC(Task Configuration

Controller)

SMU

ALU

RFile

0123・・・63

Configuration Data Memory

Context Memory

BA

DC

Target Tasks

MuCCRA PE Array

Context Pointer

Configuration Data(Contexts)

Control Signals

PE

7

１ 3 8 16 ManyNum. ofHW-contexts

Num. of CoresGranularityof core

10

100

1000

32bit

16bit

8bit

4bit

Common ProcessorＶＬＩＷ

DAPDNA-2

DRPDRL

Granularity vs.Num. of Cores vs.Mum. of HW-contexts

8 16

CS2112

Multi-Core processor

32

Xbridge

FE-GA FPGA

japanese 2 nd generation dynamically reconfigurable processors

Documents

changing configuration

central configuration

data path

problems onsoc system

main advantageslimitations

power consuming

consumer products

commonfpgacommon fpga