japanese 2 nd generation dynamically reconfigurable processors
DESCRIPTION
Japanese 2 nd generation Dynamically Reconfigurable Processors. ERSA2009 Invited Speech Hideharu Amano Keio Univ. Commercial Products using Dynamically Reconfigurable Processors. SONY PMW EX-1/3 Professional camcorder NEC electronics’ STP engine Panasonic’s Professional camcorder - PowerPoint PPT PresentationTRANSCRIPT
Japanese 2nd generation Dynamically Reconfigurable Processors
ERSA2009
Invited Speech
Hideharu Amano
Keio Univ.
Commercial Products usingDynamically Reconfigurable Processors
SONY PMW EX-1/3 Professional camcorderNEC electronics’STP enginePanasonic’s Professional camcorderDFabric
Multifunction PrintersIP Flex’s DAPDNA-2
SONY PSP VME (Virtual Mobile Engine)
Short history of Dynamically Reconfigurable
Processors 1990 1995 2000 2005
FPGA with DynamicReconfiguration
Processor withReconfigurableInstructions
MPLD(Fujitsu)
WASMII(Keio)
Time MultiplexedFPGA(Xilinx)
DRL(NEC)
GARP(UCB)CHIMAERA(NorthWestern Univ.)
Xpp(PACT)CS2112(Chameleon)
DRP(NEC elec.)
DAPDNA/2(IPFlex)DFabric(Elixcent)
Kilocore(Rapport)PipeRench(CMU)
X-bridge(NEC ele.)
DAPDNA/IMX(IPFlex)
S-5(Stretch) S-6(Stretch)
FE-GA(Hitachi)
The 1st Generation The 2nd Generation
A lot of commercialsystems
DISC(Brigham Young Univ.)
Product Vendor Context Data PE
D-Fabri x Panasonic Deliver 4 Homo
Xpp PACT Deliver 24 Homo
S5/S6 engine Stretch Deliver 4/8 Hetero
CS2112 Chameleon Multi-C (8) 16/32 Homo
DAPDNA-2 IPFlex Multi-C (4) 32 Hetero
DRP-1 NEC electronics Multi-C (16)
8 Homo
STP-engine NEC electronics Multi-C(32) 8 Homo
Kilocore Rapport Multi-C 8 Homo
ADRES IMEC Multi-C (32)
16 Homo
FE-GA Hitachi Multi-C 16 Hetero
For Car-tuners SANYO Multi-C(4) 24 Homo
FlexSword(SAKE) Toshiba Multi-C(4/16) 16 Homo
Cluster Fujitsu Multi-C 16 Hetero
Most of Japanese semiconductor Companies have their own projects!
Outline
• Why Dynamically Reconfigurable Processors ?– A solution of recent SoC problems.
• What is a Dynamically Reconfigurable Processor ?– Coarse Grain Structure– Dynamic Reconfiguration– C-level programming
• What is the main advantages/limitations?– Comparison with other architectures– Low power consumption
• The 2nd generation examples
Why Dynamic Reconfigurable Processors?
CPU
Memory
ApplicationSpecific
HardwareI/O
A solution to problems onSoC (System-on-a Chip)
SoC (System-on-a-Chip)
Brain in Various IT products, e.g.Cellular Phones,Network Controllers,Mobile Terminals,Video camera,Car electronics…
Problem!•The performance is depending onApplication Specific Hardware•Various new techniques are coming up.•Design/mask cost of leading edge semiconductor process is much increased.
Powerful but flexible, low power/cost off-load engine is required!
How about using common FPGAs?
CPU
Memory
ApplicationSpecific
HardwareI/O
CommonFPGA
Common FPGA is FlexibleXilinx’s FPGA (eg. Virtex-4/FX)with PowerPC are popularly used.Of course, Alteras’ are also popular.
•System on a Programmable Device tends to be expensive and too much power consuming for most consumer products.•They come from their static fine grain architecture
But
What is a Dynamically Reconfigurable Processor ?
CPU
Memory
ApplicationSpecific
HardwareI/O
Dynamically Reconfigurable
Processor
Flexible Accelerators in SoCs
Coarse Grain Structure→ High performance
Dynamic reconfiguration → High area efficiency
C-level programming→ Easy to design
1
Outline
• Why Dynamically Reconfigurable Processors ?– A solution of recent SoC problems.
• What is a Dynamically Reconfigurable Processor ?– Coarse Grain Structure– Dynamic Reconfiguration– C-level programming
• What is the main advantages/limitations?– Comparison with other architectures– Low power consumption
• The 2nd generation examples
1. Coarse Grain StructureAn example of PE array
PESE SE SE SE SE
PESE
PESE
PESE
PESE SE
PESE
PESE
PESE
PESE SE
PESE
PESE
PESE
PESE SE
MEMSE
MEMSE
MEMSE
MEMSE SE
MULT
MULT
MULT
MULT
PE PE PE
Island style like FPGAsVarious types of Array structures are used
MuCCRA-1 by Keio Univ(ASSCC2007)
An example of PE (Processing Element)
SMU ALU
smuasel smubsel alubselaluaselalucsel
RFile
rfselrfcsel
aluina
rfinarfinca
aluinb
rfinbrfincb
rfboutcrfboutrfaoutcrfaout
rfaddra
rfaddrb
rfinc rfina
outc out
ina inb
rfwe
rfwec dmuope
cnst aluconf
outc out
inbinainc
24bit data2bit carry
smuina
aluincasmuinb
PE of MuCCRA-1
ALU: Add/Sub/Mult/CMPSMU:Shift/Mask/ConstantRFile: Register Files
2. Dynamic Reconfiguration
• The operations of PEs and interconnections are defined by the configuration data stored in the configuration memory like FPGAs.
• Changing configuration data dynamically → The data path for various applications can b
e switched quickly.• How configuration data are changed?
– High speed delivery from the central configuration memory.
– Multicontext dynamic reconfiguration→ One clock dynamic reconfiguration
Quick delivery of instructions/configuration from on-chip memory
On-Chip Memory
PE /SE
PE /SE
On-Chip Memory
•Delivery with 10’s micro-seconds•PACT Xpp•Panasonic(Elixent’s) DFabric
Dynamically reconfiguration is donemainly for Task switching
Multicontext Function
Mul
tipl
exer
SRAM slots
n
PE/SE
1
2
Input data
Output data
PE/SEPE/SEContext
A number of Configuration Memory slots are provided.
They can be switched in a clock
→ Hardware Structure is changed in a clock
→ Hardware Context switching
Practical implementation ofmulticontext structure
Context Pointer
PE or Switcihng Element
Context Memory
3. C-level programming
• The programming environment is a mixture of traditional C compiler and FPGA design tool
• The C-code is divided into the data flow and control.
• The assignment of the contexts, PEs and memory modules can be automatically done.
• The place-and-route sometimes takes a long time like FPGA design.
• The programming is easy only if the data to be processed can be mapped onto the memory modules.
Example: DRP Compiler (NEC)
• Compiling C source code into DRP object code
Behavaioral Description Language (BDL)
• High level synthesis: generates finite state machines (FSMs) and associated datapath planes– The ASIC behavioral design tool: Cyb
er is modified and used.
• Mapper: maps FSMs and datapath plane to STC and PEs respectively
• Place & Router: physically locates the PEs, memories and interconnection between them
C Source Code
High Level Synthesis
FSM Datapath
Technology Mapper
Place & Router
Code Generation
Object Code
Outline
• Why Dynamically Reconfigurable Processors ?– A solution of recent SoC problems.
• What is a Dynamically Reconfigurable Processor ?– Coarse Grain Structure– Dynamic Reconfiguration– C-level programming
• What is the main advantages/limitations?– Comparison with other architectures– Low power consumption
• The 2nd generation examples
Dynamically Reconfigurable Processors vs. other architectures
vs. Multi-core/Many Core architectures– No instruction fetch/Cache mechanism– Less flexible but much smaller area → 16PEs in 1.5mm-square/90nm (MuCCRA2)
vs. SIMD (Single Instruction Streams Multiple Data Streams)– The operations and interconnections can be customized for each
PE and SE. → Efficient for complicated algorithms.– The number of instructions/contexts are small
vs. VLIW (Very Long Instruction Word)– A larger degree of parallelism can be utilized. → Higher performance can be obtained.– The number of instructions/contexts are small
MuCCRA-2 Floor Plan
•ASPLA’s 90nm•2.5mmX2.5mm(Core: 1.5X1.5)
16
The total PE array < one PE of Recent Multi/Many core processors
Dynamically Reconfigurable Processors vs. other architectures
vs. Multi-core/Many Core architectures– No instruction fetch/Cache mechanism– Less flexible but much smaller area → 16PEs in 1.5mm-square/90nm (MuCCRA2)
vs. SIMD (Single Instruction Streams Multiple Data Streams)– The operations and interconnections can be customized for each
PE and SE. → Efficient for complicated algorithms.– The number of instructions/contexts are small
vs. VLIW (Very Long Instruction Word)– A larger degree of parallelism can be utilized. → Higher performance can be obtained.– The number of instructions/contexts are small
1 3 8 16 ManyNum. ofHW-contexts
Num. of Cores
Granularityof core
10
100
1000
32bit
16bit
8bit
4bit
Common ProcessorVLIW
DAPDNA-2
DRPDRL
CS2112
32
Xbridge
FE-GAFPGA
FPGA extension
Dynamically ReconfigurableProcessors
DFabric
Xpp
Multi-Core processor
Granularity vs.Num. of Cores vs.Mum. of HW-contexts
Main Advantage: Low power consumption
Why low power ?1. No redundant hardware
– There are no instruction fetch mechanisms, cache, TLB, and etc.→ Of course, it cannot be a general purpose engine, but enough
for an accelerator.– A bare datapath works only for computation.
2. Parallel Execution with a number of PE s– Much lower clock frequency can be used to achieve the same pe
rformance as other architectures.– The main problem is leakage power, but can be suppressed by p
ower gating techniques.
10X energy efficient compared with DSPs.5-50X with FPGAs.Sometimes similar to that for hardwired logic.
Energy consumption ( nJ)
0
1000
2000
3000
4000
5000
6000
7000
DCT Viterbi SHA-1
Ener
gy(n
J) MuCCRAFPGAASIC
The comparison using 0.18um implementation
The main limitations as an accelerator in SoCs
• The data must be stored in the memory modules placed around the PE array.– If the data is more than the memory, it is hard to be
treated.
• If the required contexts are more than its context memory, the operational speed is much degraded.– The virtual hardware mechanism is provided but there
is a certain limitation.
• The performance is not so improved for problems without parallelism.
Outline
• Why Dynamically Reconfigurable Processors ?– A solution of recent SoC problems.
• What is a Dynamically Reconfigurable Processor ?– Coarse Grain Structure– Dynamic Reconfiguration– C-level programming
• What is the main advantages/limitations?– Comparison with other architectures– Low power consumption
• The 2nd generation examples
Dynamically Reconfigurable Processors: The 2nd generation
• Customized for a specific target application area– SANYO car tuner → Tuner– Fujitsu → Wireless communication– Toshiba SAKE → Multi-media– NEC electronics X-bridge → Multi-media
• Multi-core structure with small PE arrays rather than a big array– Cooperation with various type cores
• Integrated design environment• Low power design → The main advantage!
X-bridge: NEC electronics (2008)
CPUMIPS
JTAG
I-CD-C
UA
RT
UA
RT
CS
IG
PIO
INTC
GeneralPort8bX4
DMA
SPL
DMA
WorkRAM(1kB)
PCIexpHB/EP(1-lane)
PeriphI/F
10/100EtherMAC
DMA
PCIHost/
Target
DDR2SDRAM
CTR
DMAPCIexpHB/EP(1-lane)
STPEngine
64bit on chip bus (266MHz)
64bi
t M
emor
yS
witc
h (2
66M
Hz)SPL
SPLSPL
SPLSPL
SPLSPL SPL
Nconnect
DynamicallyReconfigurableCore256PE(8bit)32-context
Providing the virtual hardware
mechanism
DMA controller hides the communication
overhead
From Invited talk in Design Gaia.2008
Mixture of SIMD and DRP units:Toshiba’s FlexSword
Host Processor
I/O Buffer (Data RAM)
Formatter0Write
Control
Host I/F
System Memory
Inter-Unit Buffer (Data Registers)
Dynamically Reconfigurable Units(Indenepndently Controlled)
Code Buffer (Code RAM)
Formatter1AUX0AUX1
Optimized forStream Processing
SIMD Units
codedata
Our Architecture
From FPT2007 Tutorial session
The Architecture (Formatter)
data A
Cfg
Me
m
data B
Shuffle
16-bit ALU x 8PE
Xbar In
validID
PE
PE w/o Shuffle
Xbar In
Xbar Out
Cfg Controller
CodeMem
Simple Hardware•Pipeline registers only•No intra-PE data transfer•PE:4 cfgs, Xbar: 16cfgs•ALU, shift & absolute ops only
PE
PE
Xbar In: Formatter0 onlyXBar Out: Formatter1 only
128 128
Suitable for batterfly operations
19
64
From FPT2007 Tutorial session
SANYO’s Car tuner DRP
ALU ALU ALU ALU ALU ALU
ALU ALU ALU ALU ALU ALU
ALU ALU ALU ALU ALU ALU
ALU ALU ALU ALU ALU ALU
main memory
Out
In
sequencer
command memory
Feedback
ALU array
ALU ALU ALU ALU ALU ALU
ALU ALU ALU ALU ALU ALU
ALU ALU ALU ALU ALU ALU
ALU ALU ALU ALU ALU ALU
L1
L2
L3
L4
L1L2L3L4
Th1-1Th1-2
Th1-3Th1-4
Th2-1Th2-2
Th2-3Th2-4
Th3-1Th3-2
Th3-3Th3-4
Th4-1Th4-2
Th4-3Th4-4
Th1-5Th1-6
Th1-7Th1-8
Th2-5Th2-6
Th2-7
Pipelined execution of 4 threads
Fine carrier frequency offset estimation/correction
Cluster0I
QI
Qto FFT
Cluster4 Cluster5 Cluster6
Cluster1
Cluster2
DIV ATAN
Reg incluster0
Cluster0
Cluster3data outcontrol
I
Qto FFT
I
Q
LT1
LT2
self-correlation
phase offset calculation
Cluster1Cluster6(through)
Cluster0Reg
correction offset calculation in phase
polarCluster2complexmultiply
I
Q
Cluster3data outcontrol &
clip
I
Q
a) Fine carrier frequency offset estimation for LT1
b) Fine carrier frequency offset estimation for LT2
c) Fine carrier frequency offset correction for SIGNAL and DATA
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
MLT ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
MLT
MLT
MLT
MLT
MLT
MLT
MLT
Crossbar Network
LS
LS
LS
LS
LS
LS
LS
LS
LS
LS
MEM
MEM
MEM
MEM
MEM
MEM
MEM
MEM
MEM
MEM
Configuration Manager
Sequence Manager
BusInterface
Computational Cell Array
Interrupt/DMA request
I/Oport
Load/StoreCells
LocalMemory
Hitachi’s FE-GA
Heterogeneous Multi-Core using FE-GA
SH-4
LPM
FVR
LDM
DSM
DTU
Network Interface
CPU0
FE-GA
LPM
FVR
LDM
DSM
DTU
Network Interface
DRP0CPU1 DRP1
CPU2 CPU3 DRP2 DRP3
On-Chip CSM
Network Interface
The codes are generated by a
parallelizing compiler and
standard APIs.
Summary
• The 2nd Generation Dynamically Reconfigurable Processors are going to be embedded into consumer electronics products.
• The main advantage is low power consumption.• The main limitations is data memory
→ limited into a kind of stream computing.• Especially active in Japan
– Major Japanese consumer electronics companies all try to develop such systems.
Yes. Japanese Culture LovesDynamic Reconfiguration!
Thank you!A part of ourown project willbe presented inthe later sessions
PE architecture Simple structure Executable up to 4 instructions in parallel
To upper Cell
To lower Cell
To left Cell
To right Cell
From upper Cell
From lower Cell
From left Cell
From right Cell
ALUArithmetic-1
LogicalFlow Control
SFTShift
THRData Control
Out
put S
witc
hT
rans
fer
Reg
iste
r (T
RE
G)
Inpu
t Sw
itch
Del
ay A
djus
tmen
t
1-bit x 48-bit x 4w/ valid bitConfiguration Register (x4)
control bus
DRP Programming0
1
2
3
4
5
Data input
Data output
1. Context switching
2. Parallel processing in a context3.Sequential execution in a context
3-dimensional flexibility.Functional optimizer works efficiently.Efficient C-level programming
Context is controlledwith a state machine.
Time multiplexed execution
•Area becomes 1/n, but performance becomes also 1/n.
Target hardware
Real hardware
•A single task can be executed with multiple contexts.
→ Area efficiency is improved!
Target Hardware
Real Hardware
Time multiplexed execution
Most of hardware works partially.
A wide research field ofreconfigurable architectures
• Two major extremes of multiple-core architectures: – FPGAs
• Fine-grained multiple-core architectures with huge number of cores
• Basically static: 1-hardware context
– Many-core processors• Very coarse-grained multiple-core architectures• Fully programmable: Infinite-number of hardware contexts
WIDE RESEARCH FIELD
Our environment for architectural explorationMuCCRA array design environment [FPL07]
DRPA Verilog-HDLGenerator
Architecture parameters
Verilog HDL description
Logic SynthesisSynopsys Design Compiler
Placement and RoutingSynopsys Astro
RTL/Net/Chip simulation
(Cadence NC-Verilog)
Timing Analysis(Synopsys Prime Time)
Retargetable CompilerBlack Diamond
Application Programs
Test Bench and Test Vector
Netlist
GDSII
Netlist
Template Library
CMOS standard cell library
4
Extremely Low Power Design
• Now, major benefit of Dynamically Reconfigurable Processors– 1/8-1/10 to DSP [ASSCC07]– The main reason why SONY uses VME (Virtual Mobile Engine) i
n PSP (Playstation Portable) and X-bridge in professional video systems.
• Applying traditional techniques/Reducing the overhead of context switching [FPL08]– Operand isolation is quite effective
• Context oriented voltage control [Schweizer:FPT07]• Fine-grained power gating [FPT08Poster]• Dual Vth
Network on Chips for reconfigurable systems
• For inner-core connection– island style/direct interconnection– New style of interconnection?
• For inter-core connection– The similar network for Many-core systems
may be used?
• Three dimensional/Wireless– A new possibility
3 Dimensional wireless connected dies: MuCCRA-Cube
• A plane is corresponding to an array like MuCCRA-2 (4 ×4 PE) • 4 planes are connected with inductive wireless very high speed
interconnection. (3Gbit/sec per each channel)• Planes are connected in the flipped direction• 16 channels are provided in the 3-D direction
Direction of planes Channels
MuCCRA-Cube Prototype
• STARC/ASPLA 90nm• 2.5 mm x 5mm die• Verilog-HDL is used for desig
n
• Synthesis: Synopsys DesignCompiler 2006.06-SP2
• Place&Route: Synopsys Astro 2007.03-SP3• Simulation: Cadence Verilog-XL 5.7
DATAMEM
TCC
PE/SECSC
Transceiver(Data)
Transceiver(CLK)
Summary
• There is a wide field for architectural exploration between FPGAs and Many-core processors
• Keywords– Application Configurable– Low power Techniques– Interconnection Networks including Three dim
ensional/Wireless– Integrated Design Environment
ReconfigurableArrayView
FURF
FURF
FURF
FURF
FURF
FURF
FURF
FURF
FURF
FURF
FURF
FURF
FURF
FURF
FURF
FURF
FURF
FURF
FURF
FURF
FURF
FURF
FURF
FURF
FURF
FURF
FURF
FURF
FURF
FURF
FURF
FURF
FU FU FU FU
FURF
FURF
FURF
FURF
FURF
FURF
FURF
FURF
FURF
FURF
FURF
FURF
FU FU FU FU
FURF
FURF
FURF
FURF
FURF
FURF
FURF
FURF
FURF
FURF
FURF
FURF
RF
Instruction FetchInstruction DispatchInstruction Decode Data Cache
VLIWview
IMEC ADRES
PE
Interconnect
PE PE PE…..
PE
Interconnect
PE PE PE…..
PE
Interconnect
PE PE PE…..
PE
Interconnect
PE PE PE…..
Co
nfi
gu
rati
on
Co
ntr
oll
er
Output Controller
Input Controller
Fabric16PEs X 16PEs
128bits
128bits
672bits
32bits
Stripe
Rapport Kilocore
MMU
InstCache
DataCache
InstUnit
Load/StoreUnit
FR
FPU
AR
ALU
WR
ISEF
FP UnitInteger Unit
Extension Unit
Stretch S5 engine
Screen Shot of Context MenuImplementation
Screen Shot of Code Menu
User Function
Library Function
Implementation
Screen Shot of Pointer
Input
Output
Implementation
Screen Shot of MuCCRA-1
Memory Modules
Multiply Modules
Switching Element
PE
Implementation
Screen Shot of MuCCRA-2
PE
Switching Element
Memory Modules
Implementation
DRP Tile structure
PE
PE
PE
PE
PE
PE
PE
PEP
EPE
PE
PE
PE
PE
PE
PEP
EPE
PE
PE
PE
PE
PE
PEP
EPE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PEP
EPE
PE
PE
PE
PE
PE
PEP
EPE
PE
PE
PE
PE
PE
PEP
EPE
PE
PE
PE
PE
PE
PE
HMEM HMEM HMEM HMEM
HMEM HMEM HMEM HMEM
VMEM
VMEM
VMEM
VMEM
VMEM
VMEM
VMEM
VMEM
VMEM
VMEM
VMEM
VMEM
VMEM
VMEM
VMEM
VMEM
State Transition Controller
VMEM ctrlVMEM ctrl
VMEM ctrlVMEM ctrl
1 port HMEM8bit × 8K entries
2 port VMEM8bit × 256 entries
Task and context control in MuCCRA[FPL08]
• Context control– Multicontext switching with a
Context Pointer
• Task Control– Multiple tasks each of which is consisting of multiple
contexts are loaded from the centralized memory– A Virtual Hardware Mechanism
CSC(Context Switching
Controller)
TCC(Task Configuration
Controller)
SMU
ALU
RFile
0123・・・63
Configuration Data Memory
Context Memory
BA
DC
Target Tasks
MuCCRA PE Array
Context Pointer
Configuration Data(Contexts)
Control Signals
PE
7
1 3 8 16 ManyNum. ofHW-contexts
Num. of CoresGranularityof core
10
100
1000
32bit
16bit
8bit
4bit
Common ProcessorVLIW
DAPDNA-2
DRPDRL
Granularity vs.Num. of Cores vs.Mum. of HW-contexts
8 16
CS2112
Multi-Core processor
32
Xbridge
FE-GA FPGA