The last lesson: Recent Embedded Architectures
Hideharu Amano
Embedded processors
• Cost/Power-centric, Performance for specific application
• RISC Processors• Shrunk instructions are provided
– ARM ( ARM )– MIPS ( MIPS )– SH ( Hitachi/Lunesus)
• Works at 60MHz-800MHz depending on the applications
→ Performance was enough until 90s’
MOPS(Million Operations Per Second) for various embedded applications
10 100 1000 10000
MPEG2/4 Dec.
MPEG2/4 Enc.
JPEG Enc./Dec.
MP3 Enc./Dec. Dolby Enc./Dec.
100K words identification5 K sentence translation
3 dimensional image generation
2 dimensional image generation
Vo IP modem
CDMA modem
Imageprocessing
VoiceMusic
Graphics
Communication
The performance of the simple RISC processor is
not enough
Performance enhancement techniques for CPU
Instruction Level Parallel processing
SuperScaler
VLIW (Very Long Instruction Word)
SMT (SimultaneousMultiThreading)
Dynamic scheduling of instructions
SuperscalarUsing high clockfrequency
Sophisticated Branch Prediction
Thread Level ParallelProcessing MIMD (Multiple Instruction streams
Multiple Data streams)
SIMD (Single Instruction stream Multiple Data streams)
Chip-multiprocessors
Efficient for every CPU
Of course, useful for embedded CPUs
Increasing cost/power consumption
EmbeddedCPU
HardwareAccelerator
RAM I/O I/O
On-Chip busOn-Chip Network
Embedded CPU + Hardware Accelerator
Hardware accelerator is suitable for high
performance in specific application
Various type of architectures for
embedded processing
Amdahl’s Law
• Total SpeedUp =
(1-ratio of acceleration) +
ratio of acceleration
SpeedUp of acceleration• 100 times acceleration.• If the ratio of acceleration is 50%, total speed up
becomes 2.001 times.• Fortunately, the ratio is large in media processin
g.
Special Purpose processor
Stream processorGraphic processorNetwork processor
Dynamically Reconfigurable Processors
FPGA 、 Reconfigurable systems
Dedicated hardware
ProgrammableHardware
DSP
General purposeCPU
ConfigurableProcessor
Tile Processor
HomogeneousChip-multiprocessor
Specialinstructions
MultipleCores
HeterogeneousMultiprocessor
Multiple Cores
High performance forwide application field
High performance for narrow application fieldVarious embedded architectures
Specification Analysis
System Spec.
Hardware/Softwaredivision
Hardware Spec. Software Spec.
Interface GenerationHardware Functional
SynthesisProgram Generation
Hardware design Interface design Program
Co-verification
System design
Hardware/SoftwareCo-design High level design
cost can be reduced.Recently, Low level
design cost is increased.
Configurable Processor/ Integrated Platform
• Configurable Processor– Hardware accelerators, special purpose processors c
an be combined as special instructions. • ARC(ARC)• Xtensa (Tensilica)• MeP(Toshiba )• Triton(Poseidon Design Systems)
– Various type of interconnection is possible.– Integrated software emvironment
• Integrated Platform → Standard components– UniPhier ( Matsushita )
MM1 MM2 ... MMn
32bitProcessor Core
ConfigurationOptional Inst.Memory SizeInterruptDebugging...
MeP Core Extension
Extended Inst.UCIDSPVLIW...
Hardwareengine
Bus IFLocal bus
Global bus
Configurable ProcessorMeP
Multi-Core/Multiprocessor• Heterogeneous Processors
– Special purpose processors for each application– High performance/cost– Different programming for different processor→ Complicated BUGs!
• Homogeneous Processors– Multiple general purpose processors– Programming environment for servers can be introduced.
• Parallel OS, Parallel Compilers– Dynamic Voltage Control/Dynamic Frequency Control →
Necessary performance with optimized power.• Each processor executes its own task ⇔ Differ
ent from Tile processors
NEC MP211
ARM926PE0
ARM926PE1
ARM926PE2
SPX-K602DSP
DMAC USB OTG
3D Acc.
Rot-ater.
ImageAcc.
CamDTVI/F.
LCDI/F
AsyncBridge0
AsyncBridge1
APBBridge0
IIC UART
TIM1
TIM2
TIM3
WDT
Mem. card
PCM
APBBridge1
Bus Interface
Scheduler
SDRAMController
SRAMInterface
On-chip SRAM
(640KB)
PLL OSC
Inst.RAM
PMU
INTC TIM0GPIO SIO
Sec.Acc.
SMU uWIRE
CameraLCD
FLASH DDR SDRAM
Cell ( IBM/SONY/Toshiba )
SXU
LS
DMA
PXU
L1 C
L2 C
MIC
BIC
ExternalDRAM
Flex I/O
EIB: 2+2 Ring Bus
CPU Core IBM Power
SPE:Synergistic Processing Element(SIMD core)
32KB+32KB
512KB
PPE
512KB Local Store
SXU
LS
DMA
SXU
LS
DMA
SXU
LS
DMA
SXU
LS
DMA
SXU
LS
DMA
SXU
LS
DMA
SXU
LS
DMA
NUMA machines which share a single address
space
MPCore (ARM+NEC)
CPUinterface
Timer
WdogCPU
interfaceTimer
WdogCPU
interfaceTimer
WdogCPU
interfaceTimer
Wdog
CPU/VFP
L1 Memory
CPU/VFP
L1 Memory
CPU/VFP
L1 Memory
CPU/VFP
L1 Memory
Interrupt Distributor
Snoop Control Unit (SCU) CoherenceControl Bus
DuplicatedL1 Tag
…
IRQ IRQ IRQ IRQ
PrivateFIQ Lines
PrivatePeripheralBus
L2 Cache
PrivateAXI R/W64bit Bus
Tile Processor/Processor Array• Each PE provides its own PC, and fetches instructions from i
ts own instruction memory.→ Falls into NORMA machines• However, it is close to dynamically reconfigurable processor
s shown later.– A single task is executed with all PEs ⇔ Multiprocessors– Heterogeneous PEs– A lot of homogeneous PEs– Program is embedded.– Simple Interconnection network.– The concept of context switching– The target is image processing and media processing.– MIT RAW– Quicksilver’s ACM– MorphTech’s rDSP– PicoChip’s PC101
ComputingProcessor
(8 stages 32bitSingle issue
In order)
4-stagepipelined
FPU
96KBI-Cache
32KBD-Cache
Com-municationProcessor
8 32-bitchannels
On-Chip NORMA system for embedded applications
MIT’s RAW
Adaptive Node Domain NodeProgrammable Node
Level1 Cluster
Level2 Cluster
Level3 Cluster
ACM ( Quicksilver)Matrix Interconnect Network
Dynamically Reconfigurable Processors• Reconfigurable systems → Previous lesson
– Flexible but It takes 10’s milliseconds for dynamic reconfiguration.• Dynamically Reconfigurable Processors
– Improves area efficiency by changing hardware structure.– IPs used in various SoCs.– History
• Reconfigurable Co-processor Garp(1997), CHIMAERA(2000)• Multicontext reconfigurable devices WASMII(1992),Time-multiplexing FPGA(19
97), PipeRench(1998), DRL(1998)• Functional-level synthesis
– Various commercial products are available since 2000• IPFlex DAPDNA-2, NEC electronics DRP-1, PACT Xpp, Elixent DFabrix
– SONY’s VME(Virtual Mobile Engine) is embedded in Network Workman and PSP
– Recently, many Japanese vendors start to develop commercial products• Fujitsu• Hitachi• Lucent• Sanyo• Toshiba ( Mep+D-Fabrix)
Processing Element• Specialized for media/stream processing Coarse grain ⇔ Fine grain: LUT of FPGAs• Components
– ALU– Shifter+ Mask unit– Multiplexers– Registers
• Operations and interconnection between components are changeable
• No instruction fetch mechanism : A part of large datapath
Chameleon CS2112 DPU
Instruction
Rou
ting
MU
XR
ou
ting
MU
X
Register&
Mask
Register&
Mask
OP Register
RegisterBarrelShifter
OP : Operations in C or Verilog
SIMD arrays and pipelines are formed with multiple DPUs.
32 bit ・16 bit
Dynamic reconfiguration• Compared with FPGAs, coarse grain PE is area
effective for media/stream processing.
→ However, flexible part requires semiconductor area : Not comparable with hardware accelerators
• But it is flexible!
→ Dynamic reconfiguration
By changing hardware structure, the same semiconductor area can be used for multiple tasks.
Instructions/Configuration datadelivery
On-Chip Memory
PE
PE
On-Chip Memory
•10’s micro-seconds•PACT Xpp•Elixent’s D-Fabrix
Multiple tasks can be switched→ High area efficiency
PAC PACI/O
I/O
PACI/O
I/O
I/O
I/O
PACI/O
I/O
SCM
CM
CM CM
CM
PAC: Processing Array Cluster)CM: Configuration ManagerSCM: Supervising CM
Xpp (PACT Informations technologie) PAE
RAM ALU I/O
Configuration Manager
Xpp64 (8x8 PAC) is availableConfiguation requires 100’s clock cycles24bits Data, 40MHz Clock
4 bit ALU Register
RAM basedswitch box
ALU R
RR
ALU R
RR
ALU R
RR
ALU R
RR
ALU R
RR
ALU R
RR
ALU R
RR
ALU R
RR
ALU R
RR
ALU R
RR
ALU R
RR
ALU R
RR
ALU R
RR
ALU R
RR
ALU R
RR
ALU R
RR
RAM8bit address
8bit dataD-FabrixProcessing Array
Elixent D-Fabrix
MMU
InstCache
DataCache
InstUnit
Load/StoreUnit
FR
FPU
AR
ALU
WR
ISEF
FP UnitInteger Unit
Extension Unit
Stretch S5 engine
Multicontext reconfiguration
Mul
tipl
exer
SRAM slots
n
Logic cells
1
2
Input data
Output data
Logic cellsLogic cellsContext
Multiple sets of configuration can be switched with a clock cycle.
Context memory is combined into PE/SwitchesFujitsu’s MPLD using ROMs(1990)Fujitsu’s MPLD using ROMs(1990) 、、
WASMII used RAM(1992)WASMII used RAM(1992) 、、 Xilinx’s proposal(1997)Xilinx’s proposal(1997) 、、NEC’s DRL(1998)NEC’s DRL(1998) 、、 Chameleon CS2112(2000)Chameleon CS2112(2000)
Context pointer
PE and Switches
Contextmemory
Double buffering using multicontext devices
• Task is switched without overhead
Task N+1
Task N
Execution
LoadingConfigurationData
Task N+1
Task N+2
Execution
LoadingConfigurationData
Double buffering using multicontext devices
Ipflex’s DAPDNA-2
DAP(RISC)
DMAController
InterruptController
TimerSROM IF
GPIOUART
Serial IF
DD
R S
DR
IF(64b
it 166MH
z)P
CI IF
(32bit 66M
Hz)
DNA loadbuffer
DNA direct I/O(Async. In)
DNA storebuffer
DNA direct I/O(Async. out)
DNAMatrix
BS
U
Heterogeneous368 PEsALU,Memory 、Delay
Time-multiplexing execution of a single task
If the performance becomes 1/n, the performance/areais not increased.
Target hardware
Reconfigurable Device
Even in the dedicated hardware, everything cannot be donewith a single clock.In this example, it takes 4 clock cycles.The dynamic reconfigurable processor requires 8 clock cycles→ The performance/area is improved.
Target hardware
Time-multiplexing execution of a single task
NEC electronics’ DRP (Dynamically Reconfigurable Processor)
• Multicontext reconfiguration– 16 contexts– Controlled by FSM (Finite State Machine)– Background loading of configuration data
• 8x8 PEs + distributed memory modules → A Tile• DRP-1 is consisting of 8 tiles → 512PEs• 8bits data width• State transition/Configuration is controlled with a til
e.• Single task is executed with multiple contexts.
DRP-1
TileVmemHmem
DRP Tile Structure
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
HMEM HMEM HMEM HMEM
HMEM HMEM HMEM HMEM
VMEM
VMEM
VMEM
VMEM
VMEM
VMEM
VMEM
VMEM
VMEM
VMEM
VMEM
VMEM
VMEM
VMEM
VMEM
VMEM
State Transition Controller
VMEM ctrlVMEM ctrl
VMEM ctrlVMEM ctrl
VMEM(2-port memory)
8bit × 256entry
HMEM(1-port memory)
8bit × 8092entry
Context control with a FSM
0
1
2
3
4
5
Data input
Data output
1.Contextswitching
2. Parallel Processing in a context3. Sequential execution in a context
DRP compiler automatically generatesthe diagram from C-like language: BDL.
ReconfigurableArrayView
FURF
FURF
FURF
FURF
FURF
FURF
FURF
FURF
FURF
FURF
FURF
FURF
FURF
FURF
FURF
FURF
FURF
FURF
FURF
FURF
FURF
FURF
FURF
FURF
FURF
FURF
FURF
FURF
FURF
FURF
FURF
FURF
FU FU FU FU
FURF
FURF
FURF
FURF
FURF
FURF
FURF
FURF
FURF
FURF
FURF
FURF
FU FU FU FU
FURF
FURF
FURF
FURF
FURF
FURF
FURF
FURF
FURF
FURF
FURF
FURF
RF
Instruction FetchInstruction DispatchInstruction Decode Data Cache
VLIWview
IMEC ADRES
PE
Interconnect
PE PE PE…..
PE
Interconnect
PE PE PE…..
PE
Interconnect
PE PE PE…..
PE
Interconnect
PE PE PE…..
Co
nfi
gu
rati
on
Co
ntr
oll
er
Output Controller
Input Controller
Fabric16PEs X 16PEs
128bits
128bits
672bits
32bits
Stripe
Rapport Kilocore
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
MLT ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
MLT
MLT
MLT
MLT
MLT
MLT
MLT
Crossbar switch
LS
LS
LS
LS
LS
LS
LS
LS
LS
LS
MEM
MEM
MEM
MEM
MEM
MEM
MEM
MEM
MEM
MEM
Configuration Manager
Sequence manager
Businterface
Computational cell array
Interrupt/DMA request
I/O port
Load/Store cell
Localmemory
Hitachi’s FE-GA
Product Vendor Conf. Data Width PE
Xpp-64 PACT Delivery 24 Homo
D-Fabric Elixent Delivery 4 Homo
S5 engine Stretch Delivery 4/8 Hetero
PCA-2 NTT Delivery 9 Homo
CS2112 Chameleon Multi-c(8) 16/32 Hetero
DAPDNA-2 IPFlex Multi-c(4) 32 Hetero
DRP-1 NECEL Multi-c(16) 8 Homo
Kilocore Rapport Multi-c 8 Homo
ADRES IMEC Multi-c(32) 16 Homo
FE-GA Hitachi Multi-c(4) 16 Hetero
Cluster machine
Fujitsu Multi-c 16 Hetero
Dynamically Reconfigurable Processors
1 3 8 16 ManyTime-multiplexing
Number of nodes
Gates Number
10
100
100032bitALU/Registers
8bit ALU/registers
4 ・ 5inputLUT
FPGAVLIW
Chip-Multiprocessor
ACM
DAPDNA-2
DRP- 1
KilocoreDRL
Dynamically reconfigurable Processors
CS2112 r DSPPC101
PARS
SimpleRISC
SuperScalar
10
100
1000 ト
10K
100K
1M
10M
Superscalar
Cost
ADRES
• Behavaioral Description Language (BDL) : C-like– Bit width, Pragma– Pointer is limited.
• Functional synthesis: FSM and Data path are generated.– Synthesis tools for ASIC can be us
ed.• Mapping: FSM → STC、 Da
tapath → PE array• Place & Routing• Configuration data generation
C Source Code
High Level Synthesis
FSM Datapath
Technology Mapper
Place & Router
Code Generation
Object Code
C-level design (DRP)
BDL code examplemem(0:16) d0[8], d1[8], d2[8], d3[8], d4[8], d5[8], d6[8], d7[8];void row() { ter(0:16) SUMT0, SUMT1, SUMT2, SUMT3; reg(0:16) SUB0, SUB1, SUB2, SUB3; ter(0:16) z0, z1, z2, z3, z4, z5, z6, z7; reg(0:8) i=0; $ for(; i < 8; i++) { d0[i], d1[i], d2[i], d3[i], d4[i], d5[i], d6[i], d7[i]; $ SUMT0 = d0[i] + d7[i]; SUB0 = d0[i] - d7[i]; SUMT1 = d1[i] + d6[i]; SUB1 = d1[i] - d6[i]; . . . . . z0 = A * SUMT0 + A * SUMT1 + A * SUMT2 + A * SUMT3; z2 = B * SUMT0 + C * SUMT1 – C * SUMT2 – B * SUMT3; . . . . . $ z1 = D * SUB0 + E * SUB1 * F * SUB2 + G * SUB3; z3 = E * SUB0 – G * SUB1 – D * SUB2 – F * SUB3; . . . . . $}
16bit memory:Allocated to VMEM
Terminals & RegistersDelimiter for the
state/context
Memory Access for giving an address
Terminals must be used In the assigned
state/context
Registers can be used in the next
states/contexts
Special Purpose processor
Stream processorGraphic processorNetwork processor
Dynamically Reconfigurable Processors
FPGA 、 Reconfigurable systems
Dedicated hardware
ProgrammableHardware
DSP
General purposeCPU
ConfigurableProcessor
Tile Processor
HomogeneousChip-multiprocessor
Specialinstructions
MultipleCores
HeterogeneousMultiprocessor
Multiple Cores
High performance forwide application field
High performance for narrow application fieldVarious embedded architectures
Now going major
Next going major
Going major ?
Glossary
• 今回は、いままで出てきた単語が多く、しかもそのまま呼ばれているものばかり
• Tile Processor: タイルプロセッサ• Dynamically reconfigurable processor: 動的再構
成可能(リコンフィギャラブル)プロセッサ• FSM(Finite State Machine) 有限状態マシン• Multicontext :マルチコンテキスト型(マルチ
コンテクストかも)• Functional Synthesis: 機能合成• Time multiplexed Execution :時分割多重実行
Excise
• Assume that the dynamically reconfigurable processor executes 1000 times faster than that of the host processor.
• Compute the total performance when it can be used for 90% of the total task.