introducing the connx d2 dsp engine introduced: august 24, 2009
TRANSCRIPT
Introducing the
ConnX D2 DSP Engine
Introduced: August 24, 2009
Copyright © 2009, Tensilica, Inc.
Fastest Growing Processor / DSP IP Company
Customizable Dataplane Processor/DSP IP Licensing– Leading provider of customizable Dataplane Processor Units (DPUs)
– Unique combination of processor & DSP IP cores + software design tools– Customization enables improved power, cost, performance
– Standard DPU solutions for audio, video/imaging & baseband comms – Dominant patent portfolio for configurable processor technology
Broad-Based Success– 150+ Licensees, including 5 of the top 10 semiconductor companies– Shipping in high volume today (>200M/yr rate)– Fastest growing Semiconductor Processor IP company (per Gartner, Jan-09)
• 21% revenue growth in 2007, 25% in 2008
2
Copyright © 2009, Tensilica, Inc.
Focus: Dataplane Processing Units (DPUs)
EmbeddedController
ForDataplaneProcessing
Main Applications CPU
Tensilica focus: Dataplane Processors
DPUs: Customizable CPU+DSP delivering 10 to 100x higher performance than CPU or DSP and providing better flexibility & verification than RTL
3
Copyright © 2009, Tensilica, Inc.
Communications DSP Trends / Challenges
4
Markets Changing Faster
• Market requirements in flux as economy wobbles
• Emerging standards evolve faster in the Internet age
Development Teams Shrink
• SOC development schedules tightening• Tightening resource constraints (do
more with less)
Code Size Increases
• Communications standards growing innumber & complexity
• DSP algorithm code heavily integrated with more (and more complex) control code
Maintenance and flexibility pushes DSP algorithms towards C-code
Copyright © 2009, Tensilica, Inc.
Trends Within Licensable DSP Architectures
1st Generation Licensable DSP Cores• Modest/Medium performance (single/dual MAC)
• Simple architecture (single issue, compound Instructions)
• Limited or no compiler support (mostly hand coded)
2nd Generation Licensable DSP Cores• Added RISC like architecture features (register arrays)
• Improved compiler targets, but still assembly
• Some offer wide VLIW for performance• Large area; code bloat
• Some offer wide SIMD for performance• Good area/performance tradeoff• No performance when vectorization fails
5
Copyright © 2009, Tensilica, Inc.
Vectorization Benefits (SIMD)
• Loop counts can be reduced
• Data computation can be done in parallel
• Cheapest (hardware cost) method to get higher performance
Example: 2-way SIMD performance benefit
Data7
Data6
Data5
Data4
Data3
Data2
Data1
Data0
Data6
Data4
Data2
Data0
Data7
Data5
Data3
Data1
Before Vectorization After Vectorization
Single Execution
2-way SIMD Execution
6
Copyright © 2009, Tensilica, Inc.
VLIW Technology
7
• Parallel execution of Instructions• Effective use of multiple ALUs/MACs• Compiler allocates instructions to VLIW slots• Orthogonal allocation yields more flexibility
Instruction #1
Instruction #2
Instruction #3
Instruction #4
Execution ALU
Instruction #1
Instruction #3
VLIW Execution ALU1
Instruction #2
Instruction #4
VLIW Execution ALU2
Copyright © 2009, Tensilica, Inc.
Ideal 3rd Generation Licensable DSP
Ideal Characteristics• VLIW capability for good performance on general code
• Parallelization of independent operations
• SIMD capability for good performance on loop code
• Data parallel execution
• Good C compiler target
• Reduce or eliminate need to assembly program
• Productivity benefit
• Small, compact size
• Keep costs down in brutally competitive markets
8
Copyright © 2009, Tensilica, Inc.
Tensilica - the Stealth DSP Company
Single MAC
Dual MAC
Quad MAC
8 MAC
ConnX D2
8 MAC and
more
Xtensa TIE
HiFi 2
Single PrecisionFloating Point Unit
Double PrecisionAcceleration
Floating Point HW
Custom DSPsDSP Building Blocks
ConnX Vectra
LX
ConnX 545CK DSP
ConnX BBE
16 MAC
388VDO
MAC16
MUL32
DIV32
Comms Audio Video Xtensa: Other Markets
9
Copyright © 2009, Tensilica, Inc.
ConnX D2 DSP Engine - Overview
• Dual 16b MAC Architecture with Hybrid SIMD / VLIW• Optimum performance on a wide range of algorithms
• SIMD offers high data computation rate for DSP algorithms
• 2-way VLIW allows parallel instruction execution on SIMD and scalar code
• “Out of the Box” industry standard software compatibility• TI C6x fixed-point C intrinsics supported
• Fully bit for bit equivalent with TI C6x
• ITU reference code fixed point C intrinsics directly supported
• Goals: Ease of Use, Low Area/Cost• Click and go “Out of the Box” performance from standard C code
• Standard C and fixed point data types - 16-bit, 32-bit and 40-bit
• Advanced optimizing, vectorizing compiler
• Less than 70K gates (under 0.2mm2 in 65nm)
10
Copyright © 2009, Tensilica, Inc.
Target Applications: ConnX D2
• Embedded control
• VoIP gateways, voice-over-networks (including VoIP codecs)
• Femto-cell and pico-cell base stations
• Next generation disk drives, data storage
• Mobile terminals and handsets
• Home entertainment devices
• Computer peripherals, printers
General purpose 16-bit DSP for a wide range of applications
11
Copyright © 2009, Tensilica, Inc.
ConnX D2 DSP:An ingredient of an Xtensa DPU
Hardware Use Model
• Click-button configuration option within Xtensa LX core
• Part of the Tensilica configurable core deliverable package
• Two reference configurations
• Typical DSP solution for high performance
• Small size for cost and power sensitive applications
• Full tool support from Tensilica
• High level simulators (SystemC), ISS and RTL
• Debugger and Trace
• Compiler, IDE and Operating Systems
12
Copyright © 2009, Tensilica, Inc.
ConnX D2 Processor Block Diagram (Typical)
13
Copyright © 2009, Tensilica, Inc.
ConnX D2 Engine Architecture
AR Register Bank(32 bits)
Local Memory and/or Cache XDU Alignment Registers
(4 x 32 bits)XDD Register File
(8 x 40-bits)
16-bit vector
Overflow State
Carry State
Hi / Lo 16-bit select
LoadStore Unit
LoadStore Unit
32-bits
32b
32b
32-bits
16-bits16-bits
16-bits
16-bits
16-bit vector8-bit8-bit8-bit8-bit
40-bit, 32-bit & 16-bit fixed40-bit, 32-bit & 16-bit integer
16-bit imaginary16-bit real
Q Q
X
+
Shift / Saturation
Accumulator (up to 40-bits)DR Register
rounding 40b
16b16b
32b
Q Q
X
+
Shift / Saturation
Accumulator (up to 40-bits)DR Register
rounding 40b
16b16b
32b
Addressing Modes• Immediate• Immediate updating• Indexed• Indexed updating• Aligning updating• Circular (instruction)• Bit-reversed
(instruction)
32b
14
DSP specific instructions
Add-Bit-Reverse-Base and Add-Subtract : Useful for FFT implementation
Add-Compare-Exchange : Useful for Viterbi implementation
Add-Modulo : Circular buffer implementation. Useful for FIR implementation
Copyright © 2009, Tensilica, Inc. 15
ConnX D2 : Instruction Allocation Options
16-bit InstructionsBase ISA
24-bit InstructionsBase ISA or ConnX D2
Slot 0ConnX D2 or Base ISA
Slot 1ConnX D2 or Base ISA (register
moves & C ops on register data)
VLIW Instructions (64-bits)
• Flexible allocation of instructions available to compiler• Optimum use of VLIW slots (ConnX D2 or base ISA instructions)
• Improved performance and no code bloat (reduced NOPs)
• Reduce code size when algorithm is less performance intensive
• Modeless switching between instruction formats
Copyright © 2009, Tensilica, Inc.
loopgtz a3,.LBB52_energy # [3] l16si a3,a2,2 # [0*II+0] id:16 a+0x0 l16si a5,a2,4 # [0*II+1] id:16 a+0x0 l16si a6,a2,6 # [0*II+2] id:16 a+0x0 l16si a7,a2,8 # [0*II+3] id:16 a+0x0 mul16s a3,a3,a3 # [0*II+4] mul16s a5,a5,a5 # [0*II+5] mul16s a6,a6,a6 # [0*II+6] mul16s a7,a7,a7 # [0*II+7] addi.n a2,a2,8 # [0*II+8] add.n a3,a4,a3 # [0*II+9] add.n a3,a3,a5 # [0*II+10] add.n a3,a3,a6 # [0*II+11] add.n a4,a3,a7 # [0*II+12]
ConnX D2 : SIMD with VLIW – Extra Performance
Example : Energy CalculationExample : Energy Calculation
Combining SIMD and VLIW can give 6 times performance
127
A = ∑ Xn* Xn
n=0
Instruction Execution (Control)
• Vectorization and SIMD gives double data computation performance
• VLIW gives 2 pipeline executions (one is SIMD) with auto-increment loads
• ConnX D2 architecture gives this combination and performance
loop { # format XD2_FLIX_FORMAT xd2_la.d16x2s.iu xdd0,xdu0,a4,4; xd2_mulaa40.d16s.ll.hh xdd1,xdd0,xdd0 }
Slot0 Slot1
416 cyclesBase Xtensa Configuration
ConnX D2: 64 cyclesConnX D2: 64 cycles
128 iterationC algorithm
SIMD Computation
16One instruction (64-bit VLIW instruction)
Copyright © 2009, Tensilica, Inc.
When Vectorization is Not PossiblePerformance for scalar code bases
int energy(short *a, int col, int cols, int rows){ int i; int sum=0; for (i=0; i<rows; i++) { sum += a[cols*i+col] * a[cols*i+col]; } return sum;}
• Energy computation of column ‘col’ in 2-D array• Above code loop cannot be vectorized
• Non–contiguous memory accesses thwarts vectorizers• Regular compilers can not map this code into traditional SIMD
DSPs
17
Copyright © 2009, Tensilica, Inc.
When Vectorization is Not PossiblePerformance for scalar code bases
• Confirmed that ConnX D2 and TI C6x compilers can not vectorize this code
• ConnX D2 compiler can however use VLIW to increase performance
int energy(short *a, int col, int cols, int rows){ int i; int sum=0; for (i=0; i<rows; i++) { sum += a[cols*i+col] * a[cols*i+col]; } return sum;}
entry a1,32 blti a5,1,.Lt_0_2306 addx2 a2,a3,a2 slli a3,a4,1 addi.n a4,a5,-1 sub a2,a2,a3{ # format XD2_FLIX_FORMAT xd2_l.d16s.xu xdd0,a2,a3 xd2_movi.d40 xdd1,0 } loopgtz a4,.LBB43_energy { # format XD2_FLIX_FORMAT xd2_l.d16s.xu xdd0,a2,a3 ; xd2_mula32.d16s.ll_s1 xdd1,xdd0,xdd0 }…………
ConnX D2 : One cycle within loop
Load scalar 16-bitsxdd0 is loaded with memory contents defined in a2 register. a2 register value is updated by value in a3
MAC operation on lower 16-bits.Multiplies xdd0 with xdd0. Accumulated result is stored in xdd1
C-Code
Generated Assembly Code
18
Copyright © 2009, Tensilica, Inc.
Optimization with ITU / TI IntrinsicsPerformance for generic code bases
#define ASIZE 1000extern int a[ASIZE];extern int red;void energy(){ int i; int red_0 = red; for (i = 0; i < ASIZE; i++) { red_0 = L_mac(red_0, a[i], a[i]); } red = red_0;}
entry a1,32 l32r a2,.LC1_40_18 l32r a5,.LC0_40_17 xd2_l.d16x2s.iuxdd0,a2,4test_arr_1+0x0l32i.n a3,a5,0 test_global_red_0+0x0{ # format XD2_ARUSEDEF_FORMAT xd2_mov.d32.a32s xdd1,a3 movi a3,499 }
loopgtz a3,{ # format XD2_FLIX_FORMAT xd2_l.d16x2s.iu xdd0,a2,4; xd2_mulaa.fs32.d16s.ll.hh xdd1,xdd0,xdd0 }.......... Generated Assembly Code
19
Energy calculation loop1000 looping, using L_mac ITU intrinsic
L_mac maps to one ConnX D2 instructionCompiler further optimizes by using SIMD to accelerate loopVLIW allows further accelerates with parallel loads1000 loop C algorithm optimized to 500 cycles loop
Sustained 3 operations / cycle
Copyright © 2009, Tensilica, Inc.
“Out of the Box” Performance - Results
Comparison to TI C55x(TI C55x is an industry benchmark Dual-MAC, 2-way VLIW)
• 20% more performance (256 point complex FFT)
Comparison to other DSP IP vendors• Almost twice the performance
ConnX D2"Out of the Box" C code
TI C55xOptimized assembly
Cycle count(lower is better) 3740 4786 #
ConnX D2
(Out of the Box ITU reference code)CEVA - X1620
(Out of the Box ITU reference code)
Required MHz for AMR-NB (VAD2)Encode + Decode
27.7 MHz 48 MHz *
* - 2008, From CEVA published Whitepaper# - Dec 2008, www.ti.com 20
• FFT specific instructions• Dual write to Register Files• Advanced Complier• SIMD and VLIW performance
• 1 to 1 mapping of ITU intrinsics
• SIMD and VLIW performance• Flexibility in VLIW allocation• VLIW Performance for scalar
code
Why better?
Why better?
Copyright © 2009, Tensilica, Inc.
Small, Low Power, & High Performance
Optimized for low area / low cost applications
• Less than 70,000 gates
• 0.18mm2 in 65nm GP *
Low power
• 52uW/MHz power consumption• 65nm GP, measured running AMR-NB algorithm
Very high performance
• 600MHz in 65nm GP **
* - After full Place and Route, when optimized for area/power. Size is for the full Xtensa core including the D2 DSP option** - After full Place and Route, when optimized for speed
21
Copyright © 2009, Tensilica, Inc.
Flexible and Customizable
Configure memory subsystems to exact requirements• Up to 4 local memories
• Instruction memory, data memory• RAM and ROM options• DMA path into these memories
• Instruction and data cache configurations• MMU and memory region protection• Memory port interface• Option of dual load/store architecture
Full customization• Instruction set extensions• Custom I/O Interfaces
• TIE Ports, Queues and Lookup Memory interfaces
22
Copyright © 2009, Tensilica, Inc.
ConnX D2 DSP Engine: Summary
• Small size
• Low power
• Excellent performance on wide range of code
• Easy to use – C programming centric• “Out of the Box” performance
• Reduce development time – reduced cost
• ITU and T.I. C intrinsic support – large existing code base
• Bit equivalent to TI C6x
• Take current TI code, port and get same functionality on ConnX D2
• Flexible & customizable
23