dsp hardware
TRANSCRIPT
-
7/27/2019 DSP Hardware
1/44
1
EKT353 Lecture Notes by Professor Dr. Farid Ghani
DSP Hardware
Introduction:
Since their introduction in the early l980s, DSP processors havegrown substantially in complexity and sophistication to enhancetheir capability and range of applicability. This has also led to asubstantial increase in the number of DSP processors available.To reflect this, features of successive generations of fixed andfloatingpoint DSP processors and factors that affect choice ofDSP processors are considered in the following pages.
For convenience, DSP processors can be divided into two broadcategories: general purpose and special purpose. DSPprocessors include fixed-point devices such as Texas InstrumentsTMS320C54x, and Motorola DSP563x processors, and floating-point processors such as Texas Instruments TMS320C4x andAnalog Devices ADSP21xxx SHARC processors.
There are two types of special-purpose hardware,
1. Hardware designed for efficient execution of specific DSPalgorithms such as digital filters, Fast Fourier Transform.This type of special- purpose hardware is sometimes calledan algorithm-specific digital signal processor.
2. Hardware designed for specific applications: for exampletelecommunications, digital audio, or control applications.This type of hardware is sometimes called an application-
specific digital signal processor.
In most cases application-specific digital signal processorsexecute specific algorithms, such as PCM encoding/decoding, butare also required to perform other application-specific operations.Examples of special-purpose DSP processors are Cirruss
-
7/27/2019 DSP Hardware
2/44
2
EKT353 Lecture Notes by Professor Dr. Farid Ghani
processor for digital audio sampling rate converters (CS8420),Intels multi- channel telephony voice echo canceller (MT9300),FFT processor (PDSPI65I5A) and programmable FIR filter(VPDSP 16256).
Both general-purpose and special-purpose processors can bedesigned with single chips or with individual blocks of multipliers,ALUs, memories, and so on. First, we will discuss thearchitectural features of digital signal processors that have madereal-time DSP in many areas possible.
Most general purpose processors available today are based on
the Von Neumann concepts, where operations are performedsequentially. Figure 1 shows a simplified architecture for astandard Von Neumann processor. When an instruction isprocessed in such a processor, units of the processor notinvolved at each instruction phase wait idly until control is passedon to them.
Address
Generator
I/O
Devices
Accumulator
Multiplier
Product
Register
Program
and data
Memory
Address bus
Data bus
ALU
Figure 1. A simplified architecture for standard microprocessor
-
7/27/2019 DSP Hardware
3/44
3
EKT353 Lecture Notes by Professor Dr. Farid Ghani
Increase in processor speed is achieved by making theindividual units operate faster, but there is a limit on how fastthey can be made to operate. If it is to operate in real time, a DSPprocessor must have its architecture optimized for executing DSP
functions.
I/O
Devices
ALU
Shifter
Multiplier
Accumulator
X data
Memory
Y data
Memory
Program
Memory
P Data bus
Y Data bus
X Data bus
Arithmatic Unit
Memory Unit
Figure 2. Basic generic hardware architecture for signal
processing
Figure 2 shows a generic hardware architecture suitable for realtime DSP It is characterized by the following:
Multiple bus structure with separate memory space for dataand program instructions. Typically the data memories holdinput data, intermediate data values and output samples, as
well as fixed coefficients for, for example,digital filters orFFTs. The program instructions are stored in the programmemory.
The I/O port provides a means of passing data to and fromexternal devices such as the ADC and DAC or for passing
-
7/27/2019 DSP Hardware
4/44
4
EKT353 Lecture Notes by Professor Dr. Farid Ghani
digital data to other processors. Direct memory access(DMA), if available, allows for rapid transfer of blocks of datadirectly to or from data RAM, typically under external control.
Arithmetic units for logical and arithmetic operations, whichinclude an ALU, a hardware multiplier and shifters (ormultiplier--accumulator)
Why is such architecture necessary? Most DSP algorithms (suchas filtering correlation and fast Fourier transform) involverepetitive arithmetic operations such as multiply, add, memoryaccesses, and heavy data flow through the CPU. The architecture
of standard microprocessors is not suited to this type of activity.An important goal in DSP hardware design is to optimize both thehardware architecture and the instruction set for DSP operations.In digital signal processors, this is achieved by makingextensive use of the concepts of parallelism. In particular, thefollowing techniques are used:
1. Harvard architecture;
2. pipe-lining;3. fast, dedicated hardware multiplier/accumulator;4. special instructions dedicated to DSP;5. replication;6. on-chip memory/cache;7. extended parallelism SIMD, VLIW and static superscalar
processing.
For successful DSP design, it is important to understand these
key architectural features.
-
7/27/2019 DSP Hardware
5/44
5
EKT353 Lecture Notes by Professor Dr. Farid Ghani
Harvard architecture:The principal feature of the Harvard architecture is that theprogram and data memories lie in two separate spaces,permitting a full overlap of instruction fetch and execution.
Standard microprocessors, such as the Intel 6502, arecharacterized by a single bus structure for both data andinstructions, as shown in Figure 1.
Suppose that in a standard microprocessor we wish to read avalue op I at address ADR 1 in memory into the accumulator andthen store it at two other addresses, ADR2 and ADR3. Theinstructions could be
LDA ADRI load the operand op1 into the accumulator fromADRISTA ADR2 store op1 in address ADR2STA ADR3 store op1 in address ADR3
Typically, each of these instructions would involve three distinctsteps:
instruction fetch; instruction decode; instruction execute.
In our case, the instruction fetch involves fetching the nextinstruction from memory, and instruction execute involves eitherreading or writing data into memory. In a standard processor,without Harvard architecture, the program instructions (that is, the
program code) and the data (operands) are held in one memoryspace; see Figure 3. Thus the fetching of the next instructionwhile the current one is executing is not allowed, because thefetch and execution phases each require memory access.
-
7/27/2019 DSP Hardware
6/44
6
EKT353 Lecture Notes by Professor Dr. Farid Ghani
MPU
Instruction 1
PC
IRLDA ADR1
Instruction 2 STA ADR2
Instruction 3 STA ADR3
Instruction 1
Instruction 1
Instruction 1
ADR1
ADR2ADR3
(a)
Fetch
1
Decode
1LDA ADR1Execute
1
Fetch
2STA ADR2Decode
2
Execute
2
STA ADR3Fetch
3
Decode
3
Execute
3
(b)
Figure 3. An illustration of instructions fetch, decode, and executein a Non-Harward architecture with single memory space.
(a) instruction fetch from memory (b) timing diagram
-
7/27/2019 DSP Hardware
7/44
7
EKT353 Lecture Notes by Professor Dr. Farid Ghani
In a Harvard architecture (Figure 4), since the programinstructions and data lie in separate memory spaces, the fetchingof the next instruction can overlap the execution of the currentinstruction; see Figure 5. Normally, the program memory holds
the program code, while the data memory storesvariables suchas the input data samples.
Digital
Signal
Processor
Program
Memory
Data
Memory
Data memory address bus
Program memory address bus
Program data bus
Data bus
Figure 4.Basic Harvard architecture with separate data andprogram memory spaces.
It may be seen from Figure 4 that data and program instruction
fetches can be overlapped as two independent memories areused in the architecture. This is explained with the help of thetiming diagram as shown in Figure 5 below.
-
7/27/2019 DSP Hardware
8/44
8
EKT353 Lecture Notes by Professor Dr. Farid Ghani
LDA ADR1
STA ADR2
STA ADR3
Clock
Fetch
Fetch
Fetch
Decode
Decode
Decode
Execute
Execute
Execute
Figure 5.An illustration of instruction overlap made possible by
the Harvard architecture.
Strict Harvard architecture is used by some digital signalprocessors (for example Motorola D5P56000), but most use amodified Harvard architecture (for example, the TMS32O family ofprocessors). In the modified architecture used by the TMS32O,for example, separate program and data memory spaces are stillmaintained, but communication between the two memory spacesis permissible, unlike in the strict Harvard architecture.
Pipelining
Pipelining is a technique which allows two or more operations tooverlap during execution. In pipelining, a task is broken down intoa number of distinct subtasks which are then overlapped duringexecution. It is used extensively in digital signal processors toincrease speed. A pipeline is akin to a typical production line in a
factory, such as a car or television assembly plant. As in theproduction line, the task is broken clown into small, independentsubtasks called pipe stages. The pipe stages are connected inseries to form a pipe and the stages executed sequentially.As we have seen in the last example, an instruction can bebroken down into three steps. Each step in the instruction can be
-
7/27/2019 DSP Hardware
9/44
9
EKT353 Lecture Notes by Professor Dr. Farid Ghani
regarded as a stage in a pipeline and so can be overlapped. Byoverlapping the instructions, a new instruction is started at thestart of each clock cycle as shown in Figure 6(a).
Instruction 1
Clock
Instruction 2
Instruction 3
Pipe Stage
1
Pipe Stage
2
Pipe Stage
3
Pipe Stage
1
Pipe Stage
1
Pipe Stage
2
Pipe Stage
2
Pipe Stage
3
Pipe Stage
3
Figure 6 (a)
Figure 6(b) gives the timing diagram for a three-stage pipeline,drawn to highlight the instruction steps. Typically, each step in thepipeline takes one machine cycle.
Instruction fetch
Instruction decode
Instruction execute
i
i-1
i-2
i+1
i
i
i+2
i+1
i+1
i+2
i+2i-1
Clock
Figure 6 (b)
Thus during a given cycle up to three different instructions may beactive at the same time, although each will be at a different stageof completion. The key to an instruction pipeline is that the three
-
7/27/2019 DSP Hardware
10/44
10
EKT353 Lecture Notes by Professor Dr. Farid Ghani
parts of the instruction (that is, fetch, decode and execute) areindependent and so the execution of multiple instructions can beoverlapped. In Figure 6(b), it is seen that, at the ith cycle, theprocessor could be simultaneously fetching the ith instruction,
decoding the (i - 1)th instruction and at the same time executingthe (i -2)th instruction, which are then overlapped duringexecution. It is used extensively in digital signal processors toincrease speed.
Figure 6(b) gives the timing diagram for a three-stage pipeline,drawn to highlight the instruction steps. Typically, each step in thepipeline takes one machine cycle. Thus during a given cycle up to
three different instructions may be active at the same time,-although each will be at a different stage of completion. The keyto an instruction pipeline is that the three parts of the instruction(that is, fetch, decode and execute) are independent and so theexecution of multiple instructions can be overlapped. In Figure6(b), it is seen that, at the ith cycle, the processor could besimultaneously fetching the ith instruction, decoding the (i-1)thinstruction and at the same time executing the (i -2)th instruction.
The threestage pipelining discussed above is based on thetechnique used in the Texas Instruments TMS320 processors. Asin other applications of pipelining, in the TMS 320 a number ofregisters are used to achieve the pipeline: a pre-fetch counterholds the address of the next instruction to be fetched, aninstruction register holds the instruction to be executed, and aqueue instruction register stores the instructions to be executed if
the current instruction is still executing. The program countercontains the address of the next instruction to execute.
By exploiting the inherent parallelism in the instruction stream,pipelining leads to a significant reduction on average, of theexecution time per instruction. The throughput of a pipeline
-
7/27/2019 DSP Hardware
11/44
11
EKT353 Lecture Notes by Professor Dr. Farid Ghani
machine is determined by the number of instructions through thepipe per unit time. As in a production line, all the stages in thepipeline must be synchronized. The t ime for moving aninst ruc t ion from one step to another wi th in the pipe (see
Figure 6(a)) is on e cycle and depend s on the slowest s tage inthe pipeline. In a perfect pipeline, the average time per
inst ruc t ion is given by
time per instruction (non-pipeline) / number of pipe stages (1)
In the ideal case, the speed increase is equal to the number ofpipe stages. In practice, the speed increase will be less because
of the overheads in setting up the pipeline and delays in thepipeline registers, and so on.
Example lIn a non-pipeline machine, the instruction fetch, decode, andexecute take 35ns 25 ns, and 40 ns, respectively. Determine theincrease in throughput if the instruction steps were pipelined.Assume a 5ns pipeline overhead at each stage, and ignore otherdelays.
In the non-pipeline machine, the average instruction time is simplythe sum of the execution time of all the steps: 35 + 25 + 40 ns =100 ns. However, if we assume that the processor has a fixedmachine cycle with the instruction steps synchronized to thesystem clock, then each instruction would take three machinecycles to complete:
40 ns x 3 = 120 ns. (Since the slowest cycle is 40 ns)This corresponds to a throughput of 8.3x I06instructions per second.
In the pipeline machine, the clock speed is determined by thespeed of the slowest stage plus overheads.In our case, the
-
7/27/2019 DSP Hardware
12/44
12
EKT353 Lecture Notes by Professor Dr. Farid Ghani
machine cycle is 40 + 5 = 45 ns. This places a limit on theaverage instruction execution time. The throughput (when thepipeline is full) is 22.2 x10
6instructions per second. Then
speedup = average instruction time (non-pipeline) /averageinstruction time (pipeline)
= 120/45= 2.67 times (assuming non-pipeline executes in three
cycles)
In the pipeline machine, each instruction still takes three clock
cycles, but at each cycle the processor is executing up to threedifferent instructions. Pipelining increases the system throughput,but not the execution time of each instruction on its own.Typically, there is a slight increase in the execution time of eachinstruction because of the pipeline overhead.
Pipelining has a major impact on the system memory. Thenumber of memory accesses in a pipeline machine increases,essentially by the number of stages. In DSP the use of Harvardarchitecture, where data and instructions lie in separate memoryspaces, promotes pipelining.
When a slow unit, such as a data memory, and an arithmeticelement are connected in series, the arithmetic unit often waitsidly for a good deal of the time for data. Pipelining may be used insuch cases to allow a better utilization of the arithmetic unit. Thenext example illustrates this concept.
Example 2Most DSP algorithms are characterized by multiply-andaccumulate operations typified by the following equation:
-
7/27/2019 DSP Hardware
13/44
13
EKT353 Lecture Notes by Professor Dr. Farid Ghani
a0x(n) + a1x(n- 1)+ a2x(n -2)+. . . +aN-1x(n -(N-I))
Figure 7 shows a non-pipeline configuration for an arithmeticelement for executing the above equation. Assume a transport
delay of 200 ns, 100 ns, and 100 ns, respectively, for the memory,multiplier and accumulator.
aN-1
aN-2
a2
a1
a0
x[n-(N-1)]
x[n-(N-2)]
x(n-2)
x(n-1)
x(n)
Multiplier
Coefficient Memory Data Memory
.
.
.
.
TM = 200 ns
Tx = 200 ns
Ta = 200 ns
Figure 7. Non-pipelined MAC configuration. Products are clockedinto the accumulator every 400 ns.
1. What is the system throughput?
2. Reconfigure the system with pipelining to give a speedincrease of 2: 1, Illustrate the operation of the new
-
7/27/2019 DSP Hardware
14/44
14
EKT353 Lecture Notes by Professor Dr. Farid Ghani
configuration with a timing diagram.
Solution:
1. The coefficients, and the data arrays are stored in memoryas shown in Figure 7. In the non-pipelined mode, thecoefficients and data are accessed sequentially and appliedto the multiplier. The products are summed in theaccumulator. Successive multiplication-accumulation (MAC)will be performed once every 400 ns (200 + 100 + 100),giving a throughput of2.5x 106 operations per second.
2. The arithmetic operations involved can be broken up intothree distinct steps: memory read, multiply, and accumulate.To improve speed these steps can be overlapped. A speedimprovement of 2:1 can be achieved by inserting pipelineregisters between the memory and multiplier and betweenthe multiplier and accumulator as shown in Figure 8
-
7/27/2019 DSP Hardware
15/44
15
EKT353 Lecture Notes by Professor Dr. Farid Ghani
aN-1
aN-2
a2
a1
a0
x[n-(N-1)]
x[n-(N-2)]
x(n-2)
x(n-1)
x(n)
Pipeline
Register
Multiplier
Coefficient Memory Data Memory
.
.
.
.
Pipeline
Register
Product
Register
Figure 8. Pipelined MAC configuration.The pipeline registers serve as temporary store for coefficient anddata sample pair. The product register also serves as a temporary
store for the product.
-
7/27/2019 DSP Hardware
16/44
16
EKT353 Lecture Notes by Professor Dr. Farid Ghani
The timing diagram for the pipeline configuration is shown inFigure 9. As is evident in the timing diagram, the MAC is
performed once every 200 ns. The limiting factor is the basictransport delay through the slowest element, in this case thememory. Pipeline overheads have been ignored.
Clock
1st MAC
2nd
MAC
3rd
MAC
Read
Read
Read
Multiply
Multiply
Multiply
Accumulate
Accumulate
Accumulate
x(0)
x(1)
x(2)
a0x(0) 0+a0x(0)
a1x(1) a0x(0)+a1x(1)
a2x(2) a0x(0)+a1x(1)
+a2x(2)
Figure 9. Timing diagram for a pipelined MAC unit.When the pipeline is full, a MAC operation is performed every
clock cycle (200 ns).
DSP algorithms are often repetitive but highly structured, makingthem well suited to multilevel pipelining. For example, FFT
requires the continuous calculation of butterflies. Although eachbutterfly requires different data and coefficients the basic butterflyarithmetic operations are identical. Thus arithmetic units such asFFT processors can be tailored to take advantage of this.Pipelining ensures a steady flow of instructions to the CPU, and ingeneral leads to a significant increase in system throughput.
-
7/27/2019 DSP Hardware
17/44
17
EKT353 Lecture Notes by Professor Dr. Farid Ghani
However, on occasions pipelining may cause problems. Forexample, in some digital signal processors, pipelining may causean unwanted instruction to be executed, especially near branchinstructions, and the designer should be aware of this possibility.
Hardware multiplieraccumulator:
The basic numerical operations in DSP are multiplications andadditions. Multiplication, in software, is notoriously timeconsuming. Additions are even more time consuming if floatingpoint arithmetic is used. To make real-time DSP possible a fast,dedicated hardware multiplier-accumulator (MAC) using fixed or
floating point arithmetic is mandatory. Fixed or floating hardwareMAC: is now standard in all digital signal processors. In a fixedpoint processor, the hardware multiplier typically accepts two I 6-bit 2s complement fractional numbers and computes a 32-bitproduct in a single cycle (25 ns typically) The average MACinstruction time can be significantly reduced through the use ofspecial repeat instructions.
-
7/27/2019 DSP Hardware
18/44
18
EKT353 Lecture Notes by Professor Dr. Farid Ghani
A typical DSP hardware MAC configuration is depicted in Figure10. In this configuration the multiplier has a pair of input registersthat hold the inputs to the multiplier, and a 32-bit product registerwhich holds the result of a multiplication. The output of the P
(product) register is connected to a double-precision accumulator,where the products are accumulated.
X register Y register
P register
R register
//
/
/
16 16
32
32
X data Y data
Figure 10. A typical MAC configuration in DSPs.
-
7/27/2019 DSP Hardware
19/44
19
EKT353 Lecture Notes by Professor Dr. Farid Ghani
The principle is very much the same for hardware floating-pointmultiplier - accumulators, except that the inputs and products arenormalized floating- point numbers. Floating-point MACs allowfast computation of DSP results with minimal errors. The DSP
algorithms such as FIR and IIR filtering suffer from the effects offinite word-length (coefficient quantization and arithmetic errors).Floating point offers a wide dynamic range and reduced arithmeticerrors, although for many applications the dynamic rangeprovided by the fixed-point representation is adequate.
General-purpose digital signal processors:
General-purpose digital signal processors are basically highspeed microprocessors with hardware architectures andinstruction sets optimized for DSP operations. These processorsmake extensive use of parallelism, Harvard architecture,pipelining and dedicated hardware whenever possible to performtime-consuming operations, such as shifting/scaling,multiplication, and so on.
General-purpose DSPs have evolved substantially over the lastdecade as a result of the never-ending quest to find better waysto perform DSP operations, in terms of computational efficiency,ease of implementation, cost, power consumption, size, andapplication-specific needs. The insatiable appetite for improvedcomputational efficiency has led to substantial reductions ininstruction cycle times and, more importantly, to increasingsophistication in the hardware and software architectures. It isnow common to have dedicated, on-chip arithmetic hardware
units (e.g. to support fast multiply / accumulate operations), largeon-chip memory with multiple access and special instructions forefficient execution of inner core computations in DSP. There isalso a trend towards increased data word sizes (e.g. to maintainsignal quality) and increased parallelism (to increase both thenumber of instructions executed in one cycle and the number of
-
7/27/2019 DSP Hardware
20/44
20
EKT353 Lecture Notes by Professor Dr. Farid Ghani
operations performed per instruction). Thus, in newer general-purpose DSP processors increasing use is made of multiple datapaths/arithmetic to support parallel operations. DSP processorsbased on SIMD (Single Instruction, Multiple Data), VLIW (Very
Large Instruction Word) and superscalar architectures are beingintroduced to support efficient parallel processing. In some DSPs,performance is enhanced further by using specialized, on-chip co-processors to speed up specific DSP algorithms such as FIRfiltering and Viterbi decoding. The explosive growth incommunications and digital audio technologies has had a majorinfluence in the evolution of DSPs, as has growth in embeddedDSP processor applications.
Fixed Point Digital Signal Processors:
Fixed-point DSP processors available today differ in their detailedarchitecture and the onboard resources provided. A summary ofkey architectures ot four generations of fixed-paint- DSPprocessors from four leading semiconductor manufacturers isgiven in Table 1. The classification of DSP processors into thefour generations is partly based on historical reasons,architectural features, and computational performance.
The basic architecture of the first generation fixed-point DSPprocessor family (TMS32OCIx), first introduced in 1982 by Texasinstruments, is depicted in Figure 11.
-
7/27/2019 DSP Hardware
21/44
21
EKT353 Lecture Notes by Professor Dr. Farid Ghani
32-bitaccumulator
DataMemory
ProgramMemory
MUX
Program memory bus
Data bus
Data bus
16 x 16 bitmultiplier
Input registers
32-bit ALU
16
16 16
16
16
16
Figure 11 A simplified architecture of a first generation fixed-pointDSP processor (Texas Instruments TMS32OCIO).
Key features of the TMS32OCIx are the dedicated arithmetic unitswhich include a multiplier and an accumulator. The processorfamily has a modified Harvard architecture with two separatememory spaces for programs and data. It has an on-chip memory
-
7/27/2019 DSP Hardware
22/44
22
EKT353 Lecture Notes by Professor Dr. Farid Ghani
and special instructions for execution of basic DSP algorithms,although these are limited.
Second generation fixed-point DSPs have substantially enhanced
features as compared to the first generation. In most cases, theseinclude much larger on-chip memories and more specialinstructions to support efficient execution of DSP algorithms. As aresult, the computational performance of second generation DSPprocessors is four to six times that of the first generation.
Typical second generation DSP processors include TexasInstruments TMS320C5x, Motorola DSP5600x, Analog Devices
ADSP2 I xx and Lucent Technologies DSPI6xx families. TexasInstruments first and second generation DSPs have a lot incommon, architecturally, but second generation DSPs have morefeatures and increased speed. The internal architecture thattypifies the TMS320C5x family of processors is shown in Figure12 in a simplified form to emphasize the dual internal memoryspaces which are characteristic of the Harvard architecture.
-
7/27/2019 DSP Hardware
23/44
-
7/27/2019 DSP Hardware
24/44
24
EKT353 Lecture Notes by Professor Dr. Farid Ghani
The Motorola DSP5600x processor is a high-precision fixed pointdigital signal processor. Its architecture is depicted in Figure 13.
Program
MemoryROM/RAM
X data
memory
Y data
memory
24-bit X data bus
24-bit Y data bus
24-bit global data bus
Internal
data
paths
24 x 24/56-bit
MAC
Two 56-bit
Accumulators
Arithmetic units
Data
Bus
switch24-bit
External
Data
Bus
24-bit data bus
Figure 13. A simplified architecture of a second generation fixed-point DSP (Motorola D5P56002).
Internally, it has two independent data memory spaces, the X-data and Y- data memory spaces, and one program memoryspace. Having two separate data memory spaces allows a naturalpartitioning of data for DSP operations and facilitates the
-
7/27/2019 DSP Hardware
25/44
25
EKT353 Lecture Notes by Professor Dr. Farid Ghani
execution of the algorithm. For example, in graphics applicationsdata can be stored as X and Y data, in FIR filtering as coefficientsand data, and in FFT as real and imaginary. During programexecution, pairs of data samples can be fetched or stored in
internal memory simultaneously in one cycle. Externally, the twodata spaces are multiplexed into a single data bus, reducingsomewhat the benefits of the dual internal data memory. Thearithmetic units consist of two 56-bit accumulators and a singlecycle, fixed-point hardware multiplier-accumulator (MAC). TheMAC accepts 24-bit inputs and produces a 56-bit product. The 24-bit word length provides sufficient accuracy for representing mostDSP variables while the 56-bit accumulator (including eight guard
bits) prevents arithmetic overflows. These word lengths areadequate for most applications, including digital audio, whichimposes stringent requirements. The 5600x processors providespecial instructions that allow zero-overhead looping and bitreversed addressing capability for scrambling input data beforeFFT or unscrambling the fast Fourier transformed data.
Analog Devices ADS P2 lxx is another family of second
generation fixed- point DSP processors with two separateexternal memory spaces - one holds data only, and the otherholds program code as well as data. A simplified block diagram ofthe internal architecture of the ADSP2 lxx is depicted in Figure 14.
-
7/27/2019 DSP Hardware
26/44
26
EKT353 Lecture Notes by Professor Dr. Farid Ghani
Program
Memory
Data
Memory
Memory units
ALU MAC
Arithmetic units
Shifter
Program memory
path (24-bits)
data memory
path (16-bits)
Figure 14. A simplified architecture of a second generation fixed-point DSP (Analog Devices ADS P2100).
The main components are the ALU, multiplier--accumulator, andshifters. The MAC accepts 16 x 16-bit inputs and produces a 32-bit product in one cycle. The accumulator of the ADSP2 lxx haseight guard bits which may be used for extended precision. TheADSP2 1 xx departs from the strict Harvard architecture, as itallows the storage of both data and program instructions in theprogram memory. A signal line (data access signal) is used toindicate when data and not program instructions are beingfetched from the program memory. Storage of data in the programmemory inhibits asteady data flow through the CPU as data and
instruction fetches cannot occur simultaneously. To avoid abottleneck, the ADSP2 I xx family has an on-chip programmemory cache which holds the last 16 instructions executed. Thiseliminates the need, especially when executing program loops, forrepeated instruction fetches from program memory. The ADSP2lxx provides special instructions for zero-overhead looping and
-
7/27/2019 DSP Hardware
27/44
27
EKT353 Lecture Notes by Professor Dr. Farid Ghani
supports a bit-reversing addressing facility for FFT. The processorfamily has a large on-chip memory (up to 64 Kbytes of internalRAM is provided for increased data transfer). The processor hasan excellent support for DMA. External devices can transfer data
and instructions to or from the DSP processor RAM withoutprocessor intervention.
Lucent Technologies DSP l6xx family of fixed-point DSPs (seeFigure 15) is targeted at the telecommunications and modemmarket.
Program
memory
Data
memory
Cache
Memory units
16-bits X data bus
16-bits Y data bus
16 x 16 bits
multiplier
ALU
Two 36-bits
accumulators
Arithmetic units
Figure 15. A simplified architecture of Lucent Technologies DSPl6xx fixed-point DSP.
-
7/27/2019 DSP Hardware
28/44
28
EKT353 Lecture Notes by Professor Dr. Farid Ghani
In terms of computational performance, it is one of the mostpowerful second generation processors. The processor has aHarvard architecture, and like most of the other secondgeneration processors, it has two data paths, the X and Y data
paths. Its data arithmetic units include a dedicated 16 x 16- bitmultiplier, a 36-bit ALU/shifter (which includes four guard bits) anddual accumulators. Special instructions such as those for zero-overhead single and block instruction looping are provided.
Third generation fixed point DSPs are essentially enhancementsof second generation DSPs. In general, performanceenhancements are achieved by increasing and/or making more
effective use of available On-Chip resources. Compared to thesecond generation DSPs, features of the third generation DSPsinclude more data paths (typically three compared to two in thesecond generation), wider data paths, larger on-chip memory andinstruction cache and in some cases a dual MAC. As a result, theperformance of third generation DSPs is typically two or threetimes superior to that of the second generation DSP processors ofthe same family. Simplified architectures of three third generationDSP processors, TMS320C54x, DSP563x and DSPI6000, aredepicted in Figures 16, 17 and 18.
-
7/27/2019 DSP Hardware
29/44
29
EKT353 Lecture Notes by Professor Dr. Farid Ghani
16 K word
Program
ROM
8 K word
Prog /data
RAM
24 K word
Prog /data
RAM
17 x 17-bit
multiplier
40-bit
adder
Round/
Scale
40 bit
ALU
Viterbi
accelerator
2 x 40 bit
acculumulator
40-bit
shifter
MAC ALU
Arithmetic units
M
U
L
T
I
P
L
E
D
A
T
A
B
U
S
Program data bus
C data bus
D data bus
Figure 16. A simplified architecture of a third generation fixed-point DSP (Texas Instruments TMS320C54x)
-
7/27/2019 DSP Hardware
30/44
30
EKT353 Lecture Notes by Professor Dr. Farid Ghani
Program
cache
4 K words
X data
RAM
2 K words
Y data
RAM
2 K words
Memory units
Program data bus
24 x 24-bitsMAC
2 x 56-bits
accumulator
Shifter
Data ALU
X data bus
Y data bus
Figure 17. A simplified architecture of a third generation fixed-point DSP (Motorola DSP56300).
-
7/27/2019 DSP Hardware
31/44
31
EKT353 Lecture Notes by Professor Dr. Farid Ghani
MAC
16 x 16
MAC
16 x 16
ALU Adder
Eight 40-bits accumulator
Arithmetic unit
Program
memory
Data
memory
Memory units
32-bit X data bus
32-bit Y data bus
Figure 18. A simplified architecture of a third generatIon fixed-
point DSP (Lucent Technologies DSP 16000).
-
7/27/2019 DSP Hardware
32/44
32
EKT353 Lecture Notes by Professor Dr. Farid Ghani
Most of the third generation fixed point DSP Processors areaimed at applications in digital communication and digital audio,reflectingthe enormous growth and influence of these applicationareas on DSP processor development. Thus there are features in
some of the processors that support these applications. In thethird generation processors, semiconductor manufacturer havealso taken the issue of power consumption seriously because ofits application in portable and hand held devices.
Fourth generation fixed point processors with their newarchitectures are primarily aimed at large and/or emerging multichannel applications, such as digital subscribers loop, remote
access server modems, wireless base stations third generationmobile systems and medical imaging. The new fixed pointarchitecture that has attracted a great deal of attention in the DSPcommunity is the very long instruction word (VLIW). The newarchitecture makes extensive use of parallelism whilst retainingsome of the good features of previous DSP processors.Compared to previous generations, fourth generation fixed pointDSP processors, in general, have wider instructionwords, wider
data paths, more registers, larger Instruction cache and multiplearithmetic units, enabling them to execute many more instructionsand operations per cycle.
Texas Instruments TMS320C62x family of fixed-point DSPprocessors is based on the VLIW architecture as shown inFigure 19.
-
7/27/2019 DSP Hardware
33/44
33
EKT353 Lecture Notes by Professor Dr. Farid Ghani
Program
RAM
Data
RAM
Data path 1
Register file 1
L1 S1 M1 D1
Data path 2
Register file 2
L2 S2 M2 D2
Instructions fetch, dispatch and decode
256-bits program data bus
32-bits data bus A
32-bits data bus B
On-chip memory units
Figure 19. A simplified architecture of a fourth generationfixed-point, very long instruction word, DSP processor (Texas
Instruments TMS320C62x). Note the two independent arithmetic
data paths, each with four execution units -L1, S1, M1 and D1;L2, S2, M2 and D2.
The core processor has two independent arithmetic paths, eachwith four execution units - a logic unit (Li), a shifter/logic unit (Si),a multiplier (Mi)and a data address unit (Di). Typically, the core
-
7/27/2019 DSP Hardware
34/44
34
EKT353 Lecture Notes by Professor Dr. Farid Ghani
processor fetches eight 32- bit instructions at a time, giving aninstruction width of 256 bits (and hence the term very longinstruction word). With a total of eight execution units four in eachdata path, the TMS320C62x can execute up to eight instructions
in parallel in one cycle. The processor has a large program anddata cache memories (typically, 4 Kbyte of level 1 program/datacaches and 64 Kbyte of level 2 program/data cache). Each datapath has its own register file (sixteen 32-bit registers), but canalso access registers on the other data path. Advantages of VLIWarchitectures include simplicity and high computationalperformance. Disadvantages include increased program memoryusage (organization of codes to match the inherent parallelism of
the processor may lead to inefficient use of memory). Further,optimum processor performance can only be achieved when allthe execution units are busy which is not always possiblebecause of data dependencies, instruction delays and restrictionsin the use of the execution units. However, sophisticatedprogramming tools are available for code packing, instructionscheduling, resource assignment, and in general to exploit thevast potential of the processor.
Floating-point digital signal processors:
The ability of DSP processors to perform high speed, highprecision DSP operations using floating point arithmetic has beena welcome development. This minimizes finite word length effectssuch as overflows, round-off errors, and coefficient quantizationerrors inherent in DSP. It also facilitates algorithm development,as a designer can develop an algorithm on a large computer in a
high level language and then port II to a DSP device more readilythan with a fixed point.
Floating-point DSP processors retain key features of fixed-pointprocessors such as special instructions for DSP operations andmultiple data paths for multiple operations. As in the case of fixed-
-
7/27/2019 DSP Hardware
35/44
35
EKT353 Lecture Notes by Professor Dr. Farid Ghani
point DSP processors, floating point DSP processors availableare significantly different architecturally.
The TMS320C3x is perhaps the best known family of first
generation general- purpose floating-point DSPs. The C3xfamilyare 32-bit single chip digital signal processors and support bothinteger and floating-point arithmetic operations. They have a largememory space and are equipped with many on-chip peripheralfacilities to simplify system design. These include a programcache to improve the execution of commonly used codes, and on-chip dual access memories. The large memory spaces cater formemory intensive applications, for example graphics and image
processing. In the TMS320C30, a floating-point multiplicationrequires 32-bit operands and produces a 40-bit normalizedfloating-point product. Integer multiplication requires 24-bit inputsand yields 32-bit results. Three floating- point formats aresupported. The first is a 16-bit short floating-point format, with 4-bit exponents, 1 sign bit and 11bits for mantissa. This format is forimmediate floating-point operations. The second is a single-precision format with an 8-bit exponent, 1 sign bit and 23-bitfractions (32 bits in total). The third is a 40-bit extended precisionformat which has an 8-bit exponent, 1 sign bit and 31-bit fractions.The floating-point representation differs from that of standardIEEE, but facilities are provided to allow conversion between thetwo formats. The TMS320C3x combines the features of Harvardarchitecture (separate buses for program instructions, data andI/O) and Von Neumann processor (unified address space).
The emphasis in the second generation, general-purpose floating-
point DSPs is on multiprocessing and multiprocessor support. Keyissues in multiprocessor support include inter-processorcommunication, DMA transfers and global memory sharing. Thebest known second generation floating-point DSP families areTexas instruments TMS320C4x and Analog Devices ADSP-2106x SHARC (Super Harvard Architecture Computer). The C4x
-
7/27/2019 DSP Hardware
36/44
36
EKT353 Lecture Notes by Professor Dr. Farid Ghani
shares some of the architectural features of the C3x, but it wasdesigned for multiprocessing. The C40x family has good I/Ocapabilities it has six COMM ports for inter-processorcommunication and six 32-bit wide DMA channels for rapid data
transfers. The architecture allows multiple operations to beperformed in parallel in one instruction cycle. The C4x familysupports both floating- and fixed-point arithmetic. The nativefloating-point data format in. the C40 differs from the IEEE754/854 standard, although conversion between them can bereadily accomplished.
Analog Devices ADSP-2106x SHARC DSP processors are also
32-bit floating- point devices. They have large internal memoryand impressive 1/0 capability 10 DMA channels to allowaccess to internal memory without intervention and six Link portsfor inter-processor communications at high speed. Thearchitecture allows shared global memory, making it possible forup to six SHARC processors to access each others internal RAMat up to full data rate. The ADSP-2106x family supports both thefixed-point and floating-point arithmetic. Its single precisionfloating-point format complies with the single precision IEEE754/854 floating-point standard (24-bit mantissa and 8-bitexponent). The architecture also supports multiple operations percycle.
Third generation floating-point DSP processors take the conceptsof parallelism much farther to increase both the number ofinstructions and the number of operations in a cycle to meet thechallenges of multichannel and computationally intensive
applications. This is achieved by the use of new architectures, theVLRV (very long instruction word) and superscalar architecturesin particular. The two leading third generation floating-point DSPprocessor families are the Texas Instruments TMS320C67x andAnalog Devices ADSP-TS001. The TMS320C67x family has the
-
7/27/2019 DSP Hardware
37/44
37
EKT353 Lecture Notes by Professor Dr. Farid Ghani
same VLIW architecture as the advanced, fourth generation fixed-point DSP processors, TMS320C62x.
The Tiger SHARC DSP family supports mixed arithmetic types
(fixed and floating point arithmetic) and data types (8-, 16-, and32-bit numbers). This flexibility makes it possible to use thearithmetic and data type most appropriate for a given applicationto enhance performance. As with the TMS320C67x, the TigerSHARC is aimed at large-scale, multi-channel applications, suchas the third generation mobile systems (3G wireless), digitalsubscriber lines (xDSL) and remote, multiple access servermodems for Internet services. Tiger SHARC, with its static
superscalar architecture, combines the good features of VLIWarchitecture, conventional DSP architecture, and RISCcomputers. The processor has two computation blocks, each witha multiplier, ALU and 64-bit shifter. The processor can execute upto eight MAC operations per cycle with 16-bit inputs and 40-bitaccumulation, two 40-bit MACs on 16-bit complex data or two 80-bit MACs with 32-bit data. With 8-bit data, Tiger SHARC can issueup to 16 operations in a cycle. Tiger SHARC has a wide memorybandwidth, with its memory organized in three 128-bit wide banks.Access to data can be in variable data sizes - normal 32-bitwords, long 64-bit words or quad 128-bit words. Up to four 32-bitinstructions can be issued in one cycle. To avoid the use of largeNOPs (which is a disadvantage of VLIW designs), the largeinstruction words may be broken down into separate shortinstructions which are issued to each unit independently.
-
7/27/2019 DSP Hardware
38/44
38
EKT353 Lecture Notes by Professor Dr. Farid Ghani
Selecting digital signal processors:
The choice of a DSP processor for a given application hasbecome an important issue in recent years because of the wide
range of processors available (Levy. 1999; Berkeley DesignTechnology, 1996, 1999). Specific factors that may be consideredwhen selecting a DSP processor for an application includearchitectural features, execution speed, type of arithmetic andword length.
1. Architectural features
Most DSP processors available today have goodarchitectural features, but these may not be adequate for aspecific application. Key features of interest include size ofon-chip memory, special instructions and I/O capability. On-chip memory is an essential requirement in most real timeDSP applications for fast access to data and rapid programexecution. For memory hungry applications (e.g. digitalaudio , FAX/Modem, MPEG coding/decoding), the size ofinternal RAM may become an important distinguishingfactor. Where internal memory is insufficient this can beaugmented by high speed, off-chip memory, although thismay add to system costs. For applications that require fastand efficient communication or data flow with the outsideworld, I/O features such interface to ADC and DACs, DMAcapability and support for multiprocessing may be important.Depending on the application, a rich set of specialinstructions to support DSP operations are important, e.g.
zero-overhead looping capability, dedicated DSPinstructions, and circular addressing.
-
7/27/2019 DSP Hardware
39/44
39
EKT353 Lecture Notes by Professor Dr. Farid Ghani
2. Execution speed
Speed of DSP processors is an important measure ofperformance because of the time-critical nature of most DSP
tasks. Traditionally, the two main units of measurement forthis are the clock speed of the processor, in MHz, and thenumber of instructions performed, in millions of instructionsper second (MIPS) or, in the case of floating-point DSPprocessors, in millions of floating-point operations persecond (MFLOPS). However, such measures may beinappropriate in some cases because of significantdifferences in the way different DSP processors operate,
with most able to perform multiple operations in one machineinstruction. For example, the C62x family of processors canexecute as many as eight instructions in a cycle. Thenumber of operations performed in each cycle also differsfrom processor to processor. Thus, comparison of executionspeed of processors based on such measures may not bemeaningful. An alternative measure is based on theexecution speed of bench-mark algorithms - e.g. DSPkernels such as FFT, FIR and IIR filters (Levy, 1998Berkeley Design Technology, 1999).
3. Type of arithmetic
The two most common types of arithmetic used in modernDSP processors are fixed- and floating-point arithmetic.Floating arithmetic is the natural choice for applications withwide and variable dynamic range requirements (dynamic
range may be defined as the difference between the largestand smallest signal levels that can be represented or thedifference between the largest signal and the noise floor,measured in decibels). Fixed- point processors are favoredin low cost, high volume applications (e.g. celIular phonesand computer disk drives). The use of fixed-point arithmetic
-
7/27/2019 DSP Hardware
40/44
40
EKT353 Lecture Notes by Professor Dr. Farid Ghani
raises issues associated with dynamic range constraintswhich the designer must address. In general, floatingprocessors are more expensive than fixed-point processors,although the cost difference hasfallen significantly in recent
years. Most floating-point DSP processors available todayalso support fixed-point arithmetic.
4. Word length
Processor data word length is an important parameter inDSP as it can have a significant impact on signal quality, itdetermines how accurately parameters and results of DSP
operations can be represented. In general, the longer thedata word the lower the errors that are introduced by digitalsignal processing. In fixed-point audio processing, forexample, a processor word length of at least 24 bits isrequired to keep the smallest signal level sufficiently abovethe noise floor generated by signal processing to maintainCD quality. A variety of processor word lengths are used infixed-point DSP processors, depending on application .Fixed-point DSP processors aimed at telecommunicationsmarkets tend to use a 16-bit word length (e.g.TMS320C54x), whereas those aimed at high quality audioapplications tend to use 24 bits (e.g. DSP56300). In recentyears there is a trend towards the use of more bits for theADC and DAC (e.g. Cirrus 24-bit audio codec, CS4228) asthe cost of these devices falls to meet the insatiable demandfor increased quality. Thus, there is likely to be an increaseddemand for larger processor word lengths for audio
processing. In fixed-point processors, it may also benecessary to provide guard bits (typically I to 8 bits) in theaccumulators to prevent arithmetic overflows duringextended multiply and accumulate operations. The extra bitseffectively extend the dynamic range available in the DSPprocessor. In most floating- point DSP processors, a 32-bit
-
7/27/2019 DSP Hardware
41/44
41
EKT353 Lecture Notes by Professor Dr. Farid Ghani
data size (24-bit mantissa and 8-bit exponent) is used forsingle-precision arithmetic. This size is also compatible withthe IEEE floating-point format (IEEE 754). Most floating-point DSP processors also have fixed-point arithmetic
capability, and often support variable data size, fixed-pointarithmetic.
-
7/27/2019 DSP Hardware
42/44
42
EKT353 Lecture Notes by Professor Dr. Farid Ghani
TMS320C6416 DSP Board
-
7/27/2019 DSP Hardware
43/44
43
EKT353 Lecture Notes by Professor Dr. Farid Ghani
TMS320C6416 DSP Board
-
7/27/2019 DSP Hardware
44/44
44
Functional block and DSP core diagram for TMS320C6416 DSP