introduction to digital signal processors (dsps) · or a6,a3,a2 ;a2 = a6 or a3. title: powerpoint...
Post on 24-Jun-2020
26 Views
Preview:
TRANSCRIPT
Introduction to Digital Signal Processors (DSPs)
Outline/objectives
• Identify the most important DSP processor
architecture features and how they relate
to DSP applications
• Understand the types of code appropriate
for DSP implementation
What is a DSP?
• A specialized microprocessor for real-
time DSP applications
– Digital filtering (FIR and IIR)
– FFT
– Convolution, Matrix Multiplication etc
ADC DACDSPANALOG
INPUT
ANALOG
OUTPUT
DIGITAL
INPUTDIGITAL
OUTPUT
Hardware used in DSP
ASIC FPGA GPP DSP
Performance Very High High Medium Medium High
Flexibility Very low High High High
Power
consumption
Very low low Medium Low Medium
Development
Time
Long Medium Short Short
Common DSP features• Harvard architecture
• Dedicated single-cycle Multiply-Accumulate (MAC) instruction (hardware MAC units)
• Single-Instruction Multiple Data (SIMD) Very Large Instruction Word (VLIW) architecture
• Pipelining
• Saturation arithmetic
• Zero overhead looping
• Hardware circular addressing
• Cache
• DMA
Harvard Architecture
• Physically separate
memories and paths
for instruction and
data
DATA
MEMORY
PROGRAM
MEMORY
CPU
Single-Cycle MAC unit
Multiplier
Adder
Register
a xi i
a xi i
a xi-1 i-1
a xi i a xi-1 i-1+
Σ(a x )i ii=0
n
Can compute a sum of n-
products in n cycles
Single Instruction - Multiple Data
(SIMD)• A technique for data-level parallelism by
employing a number of processing
elements working in parallel
Very Long Instruction Word (VLIW)
• A technique for
instruction-level
parallelism by executing
instructions without
dependencies (known at
compile-time) in parallel
• Example of a single
VLIW instruction:
F=a+b; c=e/g; d=x&y; w=z*h;
VLIW instruction F=a+b c=e/g d=x&y w=z*h
PU
PU
PU
PU
a
b
F
c
d
w
e
g
x
y
z
h
CISC vs. RISC vs. VLIW
Pipelining• DSPs commonly feature deep pipelines
• TMS320C6x processors have 3 pipeline stages with a number of phases (cycles):– Fetch
• Program Address Generate (PG)
• Program Address Send (PS)
• Program ready wait (PW)
• Program receive (PR)
– Decode• Dispatch (DP)
• Decode (DC)
– Execute• 6 to 10 phases
Saturation Arithmetic• fixed range for operations like addition and
multiplication
• normal overflow and underflow produce the
maximum and minimum allowed value,
respectively
• Associativity and distributivity no longer apply
• 1 signed byte saturation arithmetic examples:• 64 + 69 = 127
• -127 – 5 = -128
• (64 + 70) – 25 = 122 ≠ 64 + (70 -25) = 109
Examples
• Perform the following operations using
one-byte saturation arithmetic• 0x77 + 0x99 =
• 0x4*0x42=
• 0x3*0x51=
Zero Overhead Looping
• Hardware support for loops with a
constant number of iterations using
hardware loop counters and loop buffers
• No branching
• No loop overhead
• No pipeline stalls or branch prediction
• No need for loop unrolling
Hardware Circular Addressing
• A data structureimplementing a fixed length queue of fixed size objects where objects are added to the head of the queue while items are removed from the tail of the queue.
• Requires at least 2 pointers (head and tail)
• Extensively used in digital filtering
y[n] = a0x[n]+a1x[n-1]+…+akx[n-k]
X[n]
X[n-1]
X[n-2]
X[n-3]
X[n]
X[n-1]
X[n-2]
X[n-3]
Head
Tail
Cycle1
Cycle2
Direct Memory Access (DMA)
• The feature that allows peripherals to access main memory without the intervention of the CPU
• Typically, the CPU initiates DMA transfer, doesother operations while the transfer is in progress, and receives an interrupt from the DMA controller once the operation is complete.
• Can create cache coherency problems (the data in the cache may be different from the data in the external memory after DMA)
• Requires a DMA controller
Cache memory
• Separate instruction and data L1 caches
(Harvard architecture)
• Cache coherence protocols required,
since most systems use DMA
DSP vs. Microcontroller
• DSP
– Harvard Architecture
– VLIW/SIMD (parallel
execution units)
– No bit level operations
– Hardware MACs
– DSP applications
• Microcontroller
– Mostly von Neumann
Architecture
– Single execution unit
– Flexible bit-level
operations
– No hardware MACs
– Control applications
Examples• Estimate how long will the following code
fragment take to execute on– A general purpose processor with 1 GHz operating
frequency, five-stage pipelining and 5 cycles required for multiplication, 1 cycle for addition
– A DSP running at 500 MHz, zero overhead looping and 6 independent ALUs and 2 independent single-cycle MAC units?
for (i=0; i<8; i++)
{
a[i] = 2*i + 3;
b[i] = 3*i + 5;
}
Review Questions
• Which of the following code fragments is appropriate for SIMD implementation?a[0]=b[0]+c[0]; a[0]=b[0]&c[0];
a[2]=b[2]+c[2]; a[0]=b[0]%c[0];
a[4]=b[4]+c[4]; a[0]=b[0]+c[0];
a[6]=b[6]+c[6]; a[0]=b[0]/c[0];
• Can the following instructions be merged into one VLIW instruction? If not in how many?– a=b+c;
– d=c/e;
– f=d&a;
– g=b%c;
Examples• How many VLIW instructions does the following program
fragment require if there two independent data paths (a,b), with 3 ALUs and 1 MAC available in each and 8 instructions/word? How many cycles will it take to execute if they are the first instructions in the program and all instructions require 1 cycle, assuming the pipelining architecture of slide 10 with 6 phases of execution?ADD a1,a2,a3 ;a3 = a1+a2
SUB b1,b3,b4 ;b4 = b1-b3
MUL a2,a3,a5 ;a5 = a2-a3
MUL b3,b4,b2 ;b2 = b3*b4
AND a7,a0,a1 ;a1 = a7 AND a0
MUL a3,a4,a5 ;a5 = a3*a4
OR a6,a3,a2 ;a2 = a6 OR a3
top related