digital signal processors application architecture

61
1 Kurt Keutzer Lecture 9: Digital Signal Processors: Applications and Architectures Prepared by: Professor Kurt Keutzer Computer Science 252, Spring 2000 With contributions from: Dr. Jeff Bier, BDTI; Dr. Brock Barton, TI; Prof. Bob Brodersen, Prof. David Patterson

Upload: anjugadu

Post on 07-Dec-2015

33 views

Category:

Documents


8 download

DESCRIPTION

Digital Signal Processors Application Architecture

TRANSCRIPT

  • Lecture 9: Digital Signal Processors:Applications and ArchitecturesPrepared by: Professor Kurt KeutzerComputer Science 252, Spring 2000With contributions from:Dr. Jeff Bier, BDTI; Dr. Brock Barton, TI; Prof. Bob Brodersen, Prof. David Patterson

    Kurt Keutzer

  • Processor ApplicationsGeneral Purpose - high performancePentiums, Alphas, SPARCUsed for general purpose software Heavy weight OS - UNIX, NTWorkstations, PCsEmbedded processors and processor coresARM, 486SX, Hitachi SH7000, NEC V800Single programLightweight, often realtime OSDSP supportCellular phones, consumer electronics (e.g. CD players) Microcontrollers Extremely cost sensitiveSmall word size - 8 bit commonHighest volume processors by farAutomobiles, toasters, thermostats, ... IncreasingCostIncreasingvolume

    Kurt Keutzer

  • Processor Markets$30B$9.3B/31%$5.7B/19%$10B/33%8-bitmicro16-bitmicroDSP32-bitmicro$5.2B/17%$1.2B/4%32 bit DSP

    Kurt Keutzer

  • The Processor Design SpaceCostPerformanceMicroprocessorsPerformance iseverything& Software rulesEmbeddedprocessorsMicrocontrollersCost is everythingApplication specific architecturesfor performance

    Kurt Keutzer

  • Market for DSP ProductsMixed/SignalAnalogDSPDSP is the fastest growing segment of the semiconductor market

    Kurt Keutzer

  • DSP Applications Audio applications MPEG Audio Portable audioDigital camerasWireless Cellular telephones Base stationNetworking Cable modems ADSL VDSL

    Kurt Keutzer

  • Another Look at DSP ApplicationsHigh-endWireless Base Station - TMS320C6000Cable modem gatewaysMid-endCellular phone - TMS320C540Fax/ voice serverLow endStorage products - TMS320C27Digital camera - TMS320C5000Portable phonesWireless headsetsConsumer audioAutomobiles, toasters, thermostats, ... IncreasingCostIncreasingvolume

    Kurt Keutzer

  • Serving a range of applications

    Kurt Keutzer

  • Worlds Cellular SubscribersMillionsYearDigitalAnalogSource: Ericsson Radio Systems, Inc.Will providea ubiquitousinfrastructurefor wirelessdata as wellas voice

    Kurt Keutzer

  • CELLULAR TELEPHONE SYSTEMPHYSICALLAYERPROCESSINGRF MODEMCONTROLLER 1 2 3 4 5 67 8 90415-555-1212SPEECHDECODESPEECHENCODEA/DBASEBANDCONVERTERDAC

    Kurt Keutzer

  • HW/SW/IC PARTITIONINGPHYSICALLAYERPROCESSINGRF MODEMCONTROLLER 1 2 3 4 5 67 8 90415-555-1212SPEECHDECODESPEECHENCODEA/DBASEBANDCONVERTERDACANALOG ICDSPASICMICROCONTROLLER

    Kurt Keutzer

  • Mapping onto a system on a chip CRAMASICLOGICphonebookprotocolkeypadintfccontrolS/PDMAspeechqualityenhancmentde-intl &decodervoicerecognitionRPE-LTPspeech decoder demodulatorand synchronizer

    Viterbi equalizer

    Kurt Keutzer

  • Example Wireless Phone OrganizationC540ARM7

    Kurt Keutzer

  • Multimedia I/O Architecture Low Power BusEmbedded ProcessorFifoVideoDecompVideoAudioFBFifoGraphicsPenSched ECC PactInterfaceDataFlowSRAM

    Kurt Keutzer

  • Multimedia System on a ChipFuture chips will be a mix of processors, memory and dedicated hardware for specific algorithms and I/OE.g. Multimedia terminal electronics

    Kurt Keutzer

  • Requirements of the Embedded ProcessorsOptimized for a single program - code often in on-chip ROM or off chip EPROMMinimum code size (one of the motivations initially for Java)Performance obtained by optimizing datapathLow costLowest possible areaTechnology behind the leading edgeHigh level of integration of peripherals (reduces system cost)Fast time to marketCompatible architectures (e.g. ARM) allows reuseable codeCustomizable coreLow power if application requires portability

    Kurt Keutzer

  • Area of processor cores = CostNintendo processorCellular phones

    Kurt Keutzer

  • Another figure of meritComputation per unit areaNintendo processorCellular phones???

    Kurt Keutzer

  • Code sizeIf a majority of the chip is the program stored in ROM, then code size is a critical issueThe Piranha has 3 sized instructions - basic 2 byte, and 2 byte plus 16 or 32 bit immediate

    Kurt Keutzer

  • BENCHMARKS - DSPstoneZIVOJNOVIC, VERLADE, SCHLAGER: UNIVERSITY OF AACHENAPPLICATION BENCHMARKSADPCM TRANSCODER - CCITT G.721REAL_UPDATECOMPLEX_UPDATESDOT_PRODUCTMATRIX_1X3CONVOLUTIONFIRFIR2DIMHR_ONE_BIQUADLMSFFT_INPUT_SCALED

    Kurt Keutzer

  • Evolution of GP and DSPGeneral Purpose Microprocessor traces roots back to Eckert, Mauchly, Von Neumann (ENIAC)DSP evolved from Analog Signal Processors, using analog hardware to transform phyical signals (classical electrical engineering)ASP to DSP becauseDSP insensitive to environment (e.g., same response in snow or desert if it works at all)DSP performance identical even with variations in components; 2 analog systems behavior varies even if built with same components with 1% variationDifferent history and different applications led to different terms, different metrics, some new inventionsConvergence of markets will lead to architectural showdown

    Kurt Keutzer

  • Embedded Systems vs. General Purpose Computing - 1Embedded System

    Runs a few applications often known at design timeNot end-user programmableOperates in fixed run-time constraints, additional performance may not be useful/valuableGeneral purpose computing

    Intended to run a fully general set of applicationsEnd-user programmable Faster is always better

    Kurt Keutzer

  • Embedded Systems vs. General Purpose Computing - 2Embedded System

    Differentiating features:powercostspeed (must be predictable)

    General purpose computing

    Differentiating featuresspeed (need not be fully predictable)speeddid we mention speed? cost (largest component power)

    Kurt Keutzer

  • DSP vs. General Purpose MPUDSPs tend to be written for 1 program, not many programs. Hence OSes are much simpler, there is no virtual memory or protection, ...DSPs sometimes run hard real-time appsYou must account for anything that could happen in a time slot All possible interrupts or exceptions must be accounted for and their collective time be subtracted from the time interval. Therefore, exceptions are BAD!DSPs have an infinite continuous data stream

    Kurt Keutzer

  • DSP vs. General Purpose MPUThe MIPS/MFLOPS of DSPs is speed of Multiply-Accumulate (MAC). DSP are judged by whether they can keep the multipliers busy 100% of the time.The "SPEC" of DSPs is 4 algorithms: Inifinite Impule Response (IIR) filtersFinite Impule Response (FIR) filtersFFT, and convolversIn DSPs, algorithms are king!Binary compatability not an issueSoftware is not (yet) king in DSPs. People still write in assembly language for a product to minimize the die area for ROM in the DSP chip.

    Kurt Keutzer

  • TYPES OF DSP PROCESSORSDSP Multiprocessors on a die TMS320C80 TMS320C600032-BIT FLOATING POINTTI TMS320C4XMOTOROLA 96000AT&T DSP32CANALOG DEVICES ADSP2100016-BIT FIXED POINTTI TMS320C2XMOTOROLA 56000AT&T DSP16ANALOG DEVICES ADSP2100

    Kurt Keutzer

  • Note of Caution on DSP ArchitecturesSuccessful DSP architectures have two aspects:Key architectural and micro-architectural features that enabled product success in key parameters Speed Code density Low power Architectural and micro-architectural features that are artifacts of the era in which they were designed

    We will focus on the former!

    Kurt Keutzer

  • Architectural Features of DSPsData path configured for DSP Fixed-point arithmeticMAC- Multiply-accumulateMultiple memory banks and buses - Harvard ArchitectureMultiple data memoriesSpecialized addressing modes Bit-reversed addressing Circular buffersSpecialized instruction set and execution control Zero-overhead loopsSupport for MACSpecialized peripherals for DSPTHE ULTIMATE IN BENCHMARK DRIVEN ARCHITECTURE DESIGN!!!

    Kurt Keutzer

  • DSP Data Path: ArithmeticDSPs dealing with numbers representing real world => Want reals/ fractionsDSPs dealing with numbers for addresses => Want integersSupport fixed point as well as integersS.radix point-1 x < 1S.radix point2N1 x < 2N1

    Kurt Keutzer

  • DSP Data Path: PrecisionWord size affects precision of fixed point numbersDSPs have 16-bit, 20-bit, or 24-bit data wordsFloating Point DSPs cost 2X - 4X vs. fixed point, slower than fixed pointDSP programmers will scale values inside codeSW LibrariesSeparate explicit exponentBlocked Floating Point single exponent for a group of fractionsFloating point support simplify development

    Kurt Keutzer

  • DSP Data Path: Overflow?DSP are descended from analog : what should happen to output when peg an input? (e.g., turn up volume control knob on stereo)Modulo Arithmetic???Set to most positive (2N11) or most negative value(2N1) : saturationMany algorithms were developed in this model

    Kurt Keutzer

  • DSP Data Path: MultiplierSpecialized hardware performs all key arithmetic operations in 1 cycle 50% of instructions can involve multiplier => single cycle latency multiplierNeed to perform multiply-accumulate (MAC)n-bit multiplier => 2n-bit product

    Kurt Keutzer

  • DSP Data Path: AccumulatorDont want overflow or have to scale accumulatorOption 1: accumalator wider than product: guard bitsMotorola DSP: 24b x 24b => 48b product, 56b AccumulatorOption 2: shift right and round product before adder

    Kurt Keutzer

  • DSP Data Path: RoundingEven with guard bits, will need to round when store accumulator into memory3 DSP standard optionsTruncation: chop results => biases results upRound to nearest: < 1/2 round down, 1/2 round up (more positive) => smaller biasConvergent: < 1/2 round down, > 1/2 round up (more positive), = 1/2 round to make lsb a zero (+1 if 1, +0 if 0) => no bias IEEE 754 calls this round to nearest even

    Kurt Keutzer

  • Data PathDSP Processor

    Specialized hardware performs all key arithmetic operations in 1 cycle.Hardware support for managing numeric fidelity:ShiftersGuard bitsSaturation

    General-Purpose Processor

    Multiplies often take>1 cycleShifts often take >1 cycleOther operations (e.g., saturation, rounding) typically take multiple cycles.

    Kurt Keutzer

  • 320C54x DSP Functional Block Diagram

    Kurt Keutzer

  • FIR Filtering: A Motivating ProblemM most recent samples in the delay line (Xi)New sample moves data down delay lineTap is a multiply-addEach tap (M+1 taps total) nominally requires: Two data fetches Multiply Accumulate Memory write-back to update delay lineGoal: 1 FIR Tap / DSP instruction cycle

    Kurt Keutzer

  • BENCHMARKS - FIR FILTERFINITE-IMPULSE RESPONSE FILTER. . . .

    Kurt Keutzer

  • Micro-architectural impact - MACelement of finite-impulse response filter computationMPYXYACC REGADD/SUB

    Kurt Keutzer

  • The critical hardware unit in a DSP is the multiplier - much of the architecture is organized around allowing use of the multiplier on every cycleThis means providing two operands on every cycle, through multiple data and address busses, multiple address units and local accumulator feedback 123D546Mapping of the filter onto a DSP execution unit

    Kurt Keutzer

  • MAC Eg. - 320C54x DSP Functional Block Diagram

    Kurt Keutzer

  • DSP MemoryFIR Tap implies multiple memory accessesDSPs want multiple data portsSome DSPs have ad hoc techniques to reduce memory bandwdith demandInstruction repeat buffer: do 1 instruction 256 timesOften disables interrupts, thereby increasing interrupt response timeSome recent DSPs have instruction cachesEven then may allow programmer to lock in instructions into cacheOption to turn cache into fast program memoryNo DSPs have data cachesMay have multiple data memories

    Kurt Keutzer

  • Conventional ``Von Neumann memory

    Kurt Keutzer

  • HARVARD ARCHITECTURE in DSPPROGRAMMEMORYX MEMORYY MEMORYGLOBALP DATAX DATAY DATA

    Kurt Keutzer

  • Memory ArchitectureDSP ProcessorHarvard architecture2-4 memory accesses/cycleNo caches-on-chip SRAM

    General-Purpose ProcessorVon Neumann architectureTypically 1 access/cycleMay use cachesProcessorProgramMemoryDataMemoryProcessorMemory

    Kurt Keutzer

  • Eg. TMS320C3x MEMORY BLOCK DIAGRAM - Harvard Architecture

    Kurt Keutzer

  • Eg. 320C62x/67x DSP

    Kurt Keutzer

  • DSP AddressingHave standard addressing modes: immediate, displacement, register indirectWant to keep MAC datapth busyAssumption: any extra instructions imply clock cycles of overhead in inner loop => complex addressing is good => dont use datapath to calculate fancy addressAutoincrement/Autodecrement register indirectlw r1,0(r2)+ => r1
  • DSP Addressing: FFTFFTs start or end with data in weird bufferfly order0 (000)=> 0 (000)1 (001)=> 4 (100)2 (010)=> 2 (010)3 (011)=> 6 (110)4 (100)=> 1 (001)5 (101)=> 5 (101)6 (110)=> 3 (011)7 (111)=> 7 (111)What can do to avoid overhead of address checking instructions for FFT?Have an optional bit reverse address addressing mode for use with autoincrement addressingMany DSPs have bit reverse addressing for radix-2 FFT

    Kurt Keutzer

  • BIT REVERSED ADDRESSINGData flow in the radix-2 decimation-in-time FFT algorithm

    Kurt Keutzer

  • DSP Addressing: BuffersDSPs dealing with continuous I/OOften interact with an I/O buffer (delay lines)To save memory, buffer often organized as circular bufferWhat can do to avoid overhead of address checking instructions for circular buffer?Option 1: Keep start register and end register per address register for use with autoincrement addressing, reset to start when reach end of bufferOption 2: Keep a buffer length register, assuming buffers starts on aligned address, reset to start when reach endEvery DSP has modulo or circular addressing

    Kurt Keutzer

  • CIRCULAR BUFFERSInstructions accomodate three elements: buffer address buffer size incrementAllows for cyling through: delay elements coefficients in data memory

    Kurt Keutzer

  • AddressingDSP ProcessorDedicated address generation unitsSpecialized addressing modes; e.g.:AutoincrementModulo (circular)Bit-reversed (for FFT)Good immediate data supportGeneral-Purpose ProcessorOften, no separate address generation unitGeneral-purpose addressing modes

    Kurt Keutzer

  • Address calculation unit for DSPSupports modulo and bit reversal arithmeticOften duplicated to calculate multiple addresses per cycle

    Kurt Keutzer

  • DSP Instructions and ExecutionMay specify multiple operations in a single instructionMust support Multiply-Accumulate (MAC)Need parallel move supportUsually have special loop support to reduce branch overheadLoop an instruction or sequence0 value in register usually means loop maximum number of timesMust be sure if calculate loop count that 0 does not mean 0May have saturating shift left arithmeticMay have conditional execution to reduce branches

    Kurt Keutzer

  • ADSP 2100: ZERO-OVERHEAD LOOPAddress Generation PCS = PC + 1if (PC = x && ! condition) PC = PCSelse PC = PC +1DO UNTIL condition Eliminates a few instructions in loops - Important in loops with small bodies

    Kurt Keutzer

  • Instruction SetDSP Processor

    Specialized, complex instructionsMultiple operations per instructionGeneral-Purpose Processor

    General-purpose instructionsTypically only one operation per instructionmac x0,y0,a x: (r0) + ,x0 y: (r4) + ,y0mov *r0,x0mov *r1,y0 mpy x0, y0, aadd a, bmov y0, *r2inc r0inc rl

    Kurt Keutzer

  • Specialized Peripherals for DSPsSynchronous serial portsParallel portsTimersOn-chip A/D, D/A convertersHost portsBit I/O portsOn-chip DMA controllerClock generatorsOn-chip peripherals often designed for background operation, even when core is powered down.

    Kurt Keutzer

  • Specialized peripherals

    Kurt Keutzer

  • TMS320C203/LC203 BLOCK DIAGRAM DSP Core Approach - 1995

    Kurt Keutzer

  • Summary of Architectural Features of DSPsData path configured for DSP Fixed-point arithmeticMAC- Multiply-accumulateMultiple memory banks and buses - Harvard ArchitectureMultiple data memoriesSpecialized addressing modes Bit-reversed addressing Circular buffersSpecialized instruction set and execution control Zero-overhead loopsSupport for MACSpecialized peripherals for DSPTHE ULTIMATE IN BENCHMARK DRIVEN ARCHITECTURE DESIGN!!!

    Kurt Keutzer