Texas InstrumentsTexas Instruments
Course DSPCourse DSP
Module 1 OutlineModule 1 Outline
IntroductionIntroduction TI DSP FamiliesTI DSP Families C6000 ArchitectureC6000 Architecture C6000 RoadmapC6000 Roadmap
Learning ObjectivesLearning Objectives Why process signals digitally?Why process signals digitally? Definition of a real-time application.Definition of a real-time application. Why use Why use DDigital igital SSignal ignal PProcessing rocessing
processors?processors? What are the typical What are the typical DSPDSP algorithms? algorithms? Parameters to consider when choosing a Parameters to consider when choosing a
DSP processor.DSP processor. Programmable vs ASIC DSP.Programmable vs ASIC DSP.
Present Day ApplicationsPresent Day Applications
Consumer AudioConsumer Audio Stereo A/D, D/AStereo A/D, D/A
PLLPLL MixersMixers
MultimediaMultimedia Stereo audioStereo audio
ImagingImaging Graphics paletteGraphics palette
Voltage regulationVoltage regulation
Wireless / CellularWireless / Cellular Voice-band audioVoice-band audio
RF codecsRF codecs Voltage regulationVoltage regulation
HDDHDD PRML read channelPRML read channel
MR pre-ampMR pre-amp Servo controlServo control
SCSI tranceiversSCSI tranceivers
AutomotiveAutomotive Digital radio A/D/ADigital radio A/D/A Active suspensionActive suspension Voltage regulationVoltage regulation
DTADDTAD Speech synthesizerSpeech synthesizer
Mixed-signalMixed-signalprocessorprocessor
DSP:DSP:TechnologyTechnology
EnablerEnabler
PerformancePerformanceInterfacingInterfacing
PowerPower
SizeSize
Ease-of UseEase-of Use• ProgrammingProgramming• InterfacingInterfacing• Debugging Debugging
IntegrationIntegration• MemoryMemory• PeripheralsPeripherals
CostCost• Device costDevice cost• System costSystem cost• Development costDevelopment cost• Time to market Time to market
System ConsiderationsSystem Considerations
Different Needs? Multiple Families!Different Needs? Multiple Families!
Lowest CostLowest CostControl SystemsControl Systems Motor ControlMotor Control StorageStorage Digital Ctrl SystemsDigital Ctrl Systems
C2000C2000(C20x/24x/28x)(C20x/24x/28x)
‘‘C1x ‘C2xC1x ‘C2x
C6000C6000(C62x/64x/67x)(C62x/64x/67x)
‘‘C3x ‘C4x ‘C8xC3x ‘C4x ‘C8x
Multi Channel and Multi Channel and Multi Function App'sMulti Function App's
Comm InfrastructureComm Infrastructure Wireless Base-stationsWireless Base-stations DSLDSL ImagingImaging Multi-media ServersMulti-media Servers VideoVideo
Max Max PerformancePerformance with with
Best Best Ease-of-UseEase-of-Use EfficiencyEfficiency Best MIPS perBest MIPS perWatt / Dollar / SizeWatt / Dollar / Size Wireless phonesWireless phones Internet audio playersInternet audio players Digital still cameras Digital still cameras ModemsModems TelephonyTelephony VoIPVoIP
C5000C5000(C54x/55x)(C54x/55x)
‘‘C5xC5x
Why go digital?Why go digital?
Digital signal processing techniques are Digital signal processing techniques are now so powerful that sometimes it is now so powerful that sometimes it is extremely difficult, if not impossible, for extremely difficult, if not impossible, for analogue signal processing to achieve analogue signal processing to achieve similar performance.similar performance.
Examples:Examples: FIR filter with linear phase.FIR filter with linear phase. Adaptive filters.Adaptive filters.
Why go digital?Why go digital?
Analogue signal processing is achieved Analogue signal processing is achieved by using analogue components such as:by using analogue components such as: Resistors.Resistors. Capacitors.Capacitors. Inductors.Inductors.
The inherent tolerances associated with The inherent tolerances associated with these components, temperature, voltage these components, temperature, voltage changes and mechanical vibrations can changes and mechanical vibrations can dramatically affect the effectiveness of dramatically affect the effectiveness of the analogue circuitry.the analogue circuitry.
Why go digital?Why go digital?
With DSP it is easy to:With DSP it is easy to: Change applications.Change applications. Correct applications.Correct applications. Update applications.Update applications.
Additionally DSP reduces:Additionally DSP reduces: Noise susceptibility.Noise susceptibility. Chip count.Chip count. Development time.Development time. Cost.Cost. Power consumption.Power consumption.
Why NOT go digital?Why NOT go digital?
High frequency signals cannot be High frequency signals cannot be processed digitally because of two processed digitally because of two reasons:reasons: AAnalog to nalog to DDigital igital CConverters, onverters, ADCADC cannot cannot
work fast enough.work fast enough. The application can be too complex to be The application can be too complex to be
performed in real-time.performed in real-time.
Real-time processingReal-time processing
DSP processors have to perform tasks DSP processors have to perform tasks in real-time, so how do we define real-in real-time, so how do we define real-time?time?
The definition of real-time depends on The definition of real-time depends on the application.the application.
Example: a 100-tap FIR filter is Example: a 100-tap FIR filter is performed in real-time if the DSP can performed in real-time if the DSP can perform and complete the following perform and complete the following operation between two samples:operation between two samples:
99
0k
y n a k x n k
We can say that we have a real-time We can say that we have a real-time application if:application if: Waiting Time Waiting Time 0 0
Real-time processingReal-time processing
Processing TimeProcessing TimeWaitingWaiting
TimeTime
Sample TimeSample Timenn n+1n+1
Why not use a General Purpose Why not use a General Purpose Processor (GPP) such as a Pentium Processor (GPP) such as a Pentium instead of a DSP processor?instead of a DSP processor? What is the What is the power consumptionpower consumption of a of a
Pentium and a DSP processor?Pentium and a DSP processor? What is the What is the costcost of a Pentium and a DSP of a Pentium and a DSP
processor?processor?
Why do we need DSP processors?Why do we need DSP processors?
Use a DSP processor when the following are Use a DSP processor when the following are required:required:
Cost saving.Cost saving. Smaller size.Smaller size. Low power consumption.Low power consumption. Processing of many “high” frequency signals in Processing of many “high” frequency signals in
real-time.real-time.
Use a GPP processor when the following are Use a GPP processor when the following are required:required:
Large memory.Large memory. Advanced operating systems.Advanced operating systems.
Why do we need DSP processors?Why do we need DSP processors?
What are the typical DSP algorithms?What are the typical DSP algorithms?
Algorithm Equation
Finite Impulse Response Filter
M
kk knxany
0
)()(
Infinite Impulse Response Filter
N
kk
M
kk knybknxany
10
)()()(
Convolution
N
k
knhkxny0
)()()(
Discrete Fourier Transform
1
0
])/2(exp[)()(N
n
nkNjnxkX
Discrete Cosine Transform
1
0
122
cos).().(N
x
xuN
xfucuF
The Sum of Products (SOP) is the key element The Sum of Products (SOP) is the key element in most DSP algorithms:in most DSP algorithms:
IntroductionIntroduction
IntroductionIntroduction TI DSP FamiliesTI DSP Families C6000 ArchitectureC6000 Architecture
CPUCPU Buses and PeripheralsBuses and Peripherals
C6000 RoadmapC6000 Roadmap
ExternalExternalMemoryMemory
CPU
Internal BusesInternal Buses
PPEERRIIPPHHEERRAALLSS
'C6000 Block Diagram'C6000 Block Diagram
InternalInternalMemoryMemory
'C6000 System Block Diagram'C6000 System Block Diagram
ExternalExternalMemoryMemory
.D1.D1
.M1.M1
.L1.L1
.S1.S1
.D2.D2
.M2.M2
.L2.L2
.S2.S2
Re
gs (B
0-B
15
)R
eg
s (B0
-B1
5)
Re
gs (A
0-A
15
)R
eg
s (A0
-A1
5)
CPUCPU
PPEERRIIPPHHEERRAALLSS
Internal BusesInternal Buses
InternalInternalMemoryMemory
What Problem Are We Trying To Solve?What Problem Are We Trying To Solve?
Digital sampling of Digital sampling of an analog signal:an analog signal:
A
tt
What does it take to do this fast … and easy?What does it take to do this fast … and easy?
Most DSP algorithms can be Most DSP algorithms can be expressed with MAC:expressed with MAC:
countcount
i = 1i = 1Y = Y = a aii * x * xii
for (i = 1; i < count; i++){for (i = 1; i < count; i++){ sum += m[i] * n[i]; } sum += m[i] * n[i]; }
DACDACxx YY
ADCADC DSPDSP
Fastest Execution of MACsFastest Execution of MACs The ‘C6x roadmap ... from 200 to 2400 MMACsThe ‘C6x roadmap ... from 200 to 2400 MMACs
Ease of C ProgrammingEase of C Programming Even using natural C, the ‘C6000 Architecture can Even using natural C, the ‘C6000 Architecture can
perform 2 to 4 MACs per cycleperform 2 to 4 MACs per cycle Compiler generates 80-100% efficient codeCompiler generates 80-100% efficient code
Multiply-Accumulate (MAC) in Natural C CodeMultiply-Accumulate (MAC) in Natural C Code
for (i = 0; i < count; i++){for (i = 0; i < count; i++){ sum += m[i] * n[i]; } sum += m[i] * n[i]; }
Fast MAC using only CFast MAC using only C
How does the ‘C6000 achieve such performance from C?How does the ‘C6000 achieve such performance from C?
Great out-of-box experience Completely natural C code (non ’C6x specific) Code available at: www.ti.com/sc/c6000compiler Versus hand-coded assembly based on cycle count
How does the ‘C6000 achieve such performance from C?How does the ‘C6000 achieve such performance from C?
Sample Compiler BenchmarksSample Compiler Benchmarks
'C6000 Architecture: Built for Speed'C6000 Architecture: Built for Speed
A0A0
A31A31
....A15A15
....
.M1.M1.M1.M1
.L1.L1.L1.L1
.D1.D1.D1.D1
.S1.S1.S1.S1
.M2.M2.M2.M2
.L2.L2.L2.L2
.D2.D2.D2.D2
.S2.S2.S2.S2
B0B0
B31B31
....B15B15
....
Controller/DecoderController/DecoderController/DecoderController/Decoder
MemoryMemory ‘‘C6000 Compiler C6000 Compiler excels at excels at
Natural CNatural C
While While dual-MACdual-MAC speeds speeds math intensive algorithms, math intensive algorithms, flexibility of 8 independent flexibility of 8 independent functional unitsfunctional units allows the allows the compiler to quickly perform compiler to quickly perform other types of processingother types of processing
All ‘C6000 instructions are All ‘C6000 instructions are conditionalconditional allowing efficient allowing efficient hardware pipelininghardware pipelining
Instruction set and CPU Instruction set and CPU hardware orthogonality hardware orthogonality allow the compiler to allow the compiler to achieve 80-100% efficiencyachieve 80-100% efficiency
Fastest MAC using Natural CFastest MAC using Natural C
;** --------------------------------------------------*;** --------------------------------------------------*LOOP:LOOP: ; PIPED LOOP KERNEL; PIPED LOOP KERNEL
LDDWLDDW .D1.D1 A4++,A7:A6A4++,A7:A6|||| LDDWLDDW .D2.D2 B4++,B7:B6B4++,B7:B6|||| MPYSPMPYSP .M1X.M1X A6,B6,A5A6,B6,A5|||| MPYSPMPYSP .M2X.M2X A7,B7,B5A7,B7,B5|||| ADDSPADDSP .L1.L1 A5,A8,A8A5,A8,A8|||| ADDSPADDSP .L2.L2 B5,B8,B8B5,B8,B8|| [A1]|| [A1] BB .S2.S2 LOOPLOOP|| [A1]|| [A1] SUBSUB .S1.S1 A1,1,A1A1,1,A1;** --------------------------------------------------*;** --------------------------------------------------*
float mac(float *m, float *n, int count)float mac(float *m, float *n, int count){ int i, float sum = 0;{ int i, float sum = 0;
for (i=0; i < count; i++) {for (i=0; i < count; i++) { sum += m[i] * n[i]; } … sum += m[i] * n[i]; } …
A0A0
A31A31
....A15A15
....
.M1.M1.M1.M1
.L1.L1.L1.L1
.D1.D1.D1.D1
.S1.S1.S1.S1
.M2.M2.M2.M2
.L2.L2.L2.L2
.D2.D2.D2.D2
.S2.S2.S2.S2
B0B0
B31B31
....B15B15
....
Controller/DecoderController/DecoderController/DecoderController/Decoder
MemoryMemory
IntroductionIntroduction
IntroductionIntroduction TI DSP FamiliesTI DSP Families 'C6000 Architecture'C6000 Architecture
CPUCPU Buses and PeripheralsBuses and Peripherals
'C6000 Roadmap'C6000 Roadmap
'C6000 System Block Diagram'C6000 System Block Diagram
ExternalExternalMemoryMemory
.D1.D1
.M1.M1
.L1.L1
.S1.S1
.D2.D2
.M2.M2
.L2.L2
.S2.S2
Register Set B
Register Set B
Register Set A
Register Set A
CPUCPU
PPEERRIIPPHHEERRAALLSS
Internal BusesInternal Buses
Looking at the internal buses ...Looking at the internal buses ...
InternalInternalMemoryMemory
‘‘C6000 Internal BusesC6000 Internal Buses
PCPCProgram AddrProgram Addr x32x32
Program DataProgram Data x256x256
DMADMA
DMA AddrDMA Addr - Read- Read
DMA DataDMA Data - Read- Read
DMA AddrDMA Addr - Write- Write
DMA DataDMA Data - Write- Write
AAregsregs
BBregsregs
Data AddrData Addr - T1- T1 x32 x32
Data DataData Data - T1- T1 x32/64 x32/64
Data AddrData Addr - T2- T2 x32x32
Data DataData Data - T2- T2 x32/64 x32/64
InternalInternal
MemoryMemory
ExternalExternal
MemoryMemory
PeripheralsPeripherals
'C6000 System Block Diagram'C6000 System Block Diagram
ExternalExternalMemoryMemory
.D1.D1
.M1.M1
.L1.L1
.S1.S1
.D2.D2
.M2.M2
.L2.L2
.S2.S2
Register Set B
Register Set B
Register Set A
Register Set A
CPUCPU
Internal BusesInternal Buses
Next, the internal memory ...Next, the internal memory ...
InternalInternalMemoryMemory
‘‘C6711 MemoryC6711 Memory
cache details
cache logic FFFF_FFFFFFFF_FFFF
0000_00000000_0000
64KB Internal64KB Internal
On-chip PeripheralsOn-chip Peripherals0180_00000180_0000
128MB External2
128MB External3
8000_00008000_0000
9000_00009000_0000
A000_0000A000_0000
B000_0000B000_0000
128MB External0
128MB External1
64K64KProg / DataProg / Data
(Level 2)(Level 2)CPUCPU
4K4KProgramProgram
CacheCache
4K4KDataData
CacheCache
‘‘C6711 Cache LogicC6711 Cache Logic
CPU requestsCPU requestsdatadata
Is data in L1?Is data in L1? Is data in L2?Is data in L2?
Copy DataCopy Datafromfrom
External MemExternal Memto L2to L2
Copy DataCopy Datafrom L2 to L1from L2 to L1
Send DataSend Datato CPUto CPU
NoNo
YesYesYesYes
NoNo
‘‘C6711 Cache DetailsC6711 Cache Details
Level 1 ProgramLevel 1 Program• Always cacheAlways cache• 1 way cache 1 way cache
(direct mapped)(direct mapped)• Zero wait-stateZero wait-state• Line size:Line size: 512 bits512 bits
(or 16 instr)(or 16 instr)
Level 1 DataLevel 1 Data• Always cacheAlways cache• 2 way cache2 way cache• Zero wait-stateZero wait-state• Line size:Line size: 256 bits256 bits
Level 2Level 2• Unified (prog or data)Unified (prog or data)• RAM or cacheRAM or cache• 1-4 way cache1-4 way cache• 32 data bytes in 4 cycles32 data bytes in 4 cycles• 16 instr. in 5 cycles16 instr. in 5 cycles• Line Size:Line Size: 1024 bits1024 bits
(or 128 bytes)(or 128 bytes)
CPU
L1Prog(4KB)
L1Data(4KB)
L2L2UnifiedUnified
(64KB)(64KB)
256256
8/16/32/648/16/32/64
128128
256256
'C6000 System Block Diagram'C6000 System Block Diagram
ExternalExternalMemoryMemory
.D1.D1
.M1.M1
.L1.L1
.S1.S1
.D2.D2
.M2.M2
.L2.L2
.S2.S2
Register Set B
Register Set B
Register Set A
Register Set A
CPUCPU
PPEERRIIPPHHEERRAALLSS
Internal BusesInternal Buses
Looking at each peripheral ...Looking at each peripheral ...
InternalInternalMemoryMemory
'C6000 Peripherals'C6000 Peripherals
ExternalExternalMemoryMemory
.D1.D1
.M1.M1
.L1.L1
.S1.S1
.D2.D2
.M2.M2
.L2.L2
.S2.S2
Register Set BRegister Set B
Register Set ARegister Set A
CPUCPU
Internal BusesInternal Buses
InternalInternalMemoryMemory
McBSP’sMcBSP’s
UtopiaUtopia
GPIOGPIO
VCPVCPTCPTCP
DMA, EDMADMA, EDMA
(Boot)(Boot)
TimersTimers
PLLPLL
XB, PCI,XB, PCI,
Host PortHost Port
EMIFEMIF
.D1.D1
.M1.M1
.L1.L1
.S1.S1
.D2.D2
.M2.M2
.L2.L2
.S2.S2
Register Set BRegister Set B
Register Set ARegister Set A
CPUCPU
Internal BusesInternal Buses
InternalInternalMemoryMemory
SDRAMSDRAM
AsyncAsync
SBSRAMSBSRAM
EMIFEMIF
EMIFEMIF
External Memory Interface (External Memory Interface (EMIFEMIF)) Glueless access to async/sync memoryGlueless access to async/sync memory Works with PC100 SDRAM (cheap, fast, and easy!)Works with PC100 SDRAM (cheap, fast, and easy!) Byte-wide data accessByte-wide data access 16, 32, or 64-bit bus widths16, 32, or 64-bit bus widths
External Memory Interface (External Memory Interface (EMIFEMIF)) Glueless access to async/sync memoryGlueless access to async/sync memory Works with PC100 SDRAM (cheap, fast, and easy!)Works with PC100 SDRAM (cheap, fast, and easy!) Byte-wide data accessByte-wide data access 16, 32, or 64-bit bus widths16, 32, or 64-bit bus widths
Internal and External MemoryInternal and External MemoryDevicesDevices InternalInternal EMIF (A)EMIF (A)
size of rangesize of rangeEMIFBEMIFB
size of rangesize of range
C6201C6201C6204C6204C6205C6205C6701C6701
P = 64 KBP = 64 KBD = 64 KBD = 64 KB
52M Bytes52M Bytes(32-bits wide)(32-bits wide)
N/AN/A
C6202C6202P = 256 KBP = 256 KBD = 128 KBD = 128 KB
52M Bytes52M Bytes(32-bits wide)(32-bits wide) N/AN/A
C6203C6203P = 384 KBP = 384 KBD = 512 KBD = 512 KB
52M Bytes52M Bytes(32-bits wide)(32-bits wide)
N/AN/A
C6211C6211C6711C6711C6712C6712
L1PL1P == 4 KB4 KBL1DL1D == 4 KB4 KBL2L2 == 64 KB64 KB
128M Bytes128M Bytes (32-bits wide)(32-bits wide) N/AN/A
C6712C671264M Bytes64M Bytes
(16-bits wide)(16-bits wide)N/AN/A
C6414C6414C6415C6415C6416C6416
L1PL1P == 16 KB16 KBL1DL1D == 16 KB16 KBL2L2 == 1 MB1 MB
256M Bytes256M Bytes (64-bits wide)(64-bits wide)
64M Bytes64M Bytes (16-bits wide)(16-bits wide)
52M Bytes52M Bytes(32-bits wide)(32-bits wide) N/AN/A
L1PL1P == 4 KB4 KBL1DL1D == 4 KB4 KBL2L2 == 64 KB64 KB
N/AN/A
ExternalExternalMemoryMemory
.D1.D1
.M1.M1
.L1.L1
.S1.S1
.D2.D2
.M2.M2
.L2.L2
.S2.S2
Register Set BRegister Set B
Register Set ARegister Set A
CPUCPU
Internal BusesInternal Buses
InternalInternalMemoryMemory
XBUS, PCI,XBUS, PCI,
Host PortHost Port
EMIFEMIF
HPI / XBUS / PCIHPI / XBUS / PCI
Parallel Peripheral InterfaceParallel Peripheral InterfaceHPI:HPI: Dedicated, slave-only, async 16/32-bit bus allows Dedicated, slave-only, async 16/32-bit bus allows
host-host-P access to C6000 memoryP access to C6000 memory
XBUS:XBUS: Similar to HPI but provides …Similar to HPI but provides … Master/slave and sync modesMaster/slave and sync modes Glueless i/f to FIFOs (up to single-cycle xfer rate) Glueless i/f to FIFOs (up to single-cycle xfer rate)
PCI:PCI: Standard 32-bit, 33MHz PCI interfaceStandard 32-bit, 33MHz PCI interface
These interfaces provide means to bootstrap the C6000These interfaces provide means to bootstrap the C6000
Parallel Peripheral InterfaceParallel Peripheral InterfaceHPI:HPI: Dedicated, slave-only, async 16/32-bit bus allows Dedicated, slave-only, async 16/32-bit bus allows
host-host-P access to C6000 memoryP access to C6000 memory
XBUS:XBUS: Similar to HPI but provides …Similar to HPI but provides … Master/slave and sync modesMaster/slave and sync modes Glueless i/f to FIFOs (up to single-cycle xfer rate) Glueless i/f to FIFOs (up to single-cycle xfer rate)
PCI:PCI: Standard 32-bit, 33MHz PCI interfaceStandard 32-bit, 33MHz PCI interface
These interfaces provide means to bootstrap the C6000These interfaces provide means to bootstrap the C6000
GPIOGPIO
ExternalExternalMemoryMemory
.D1.D1
.M1.M1
.L1.L1
.S1.S1
.D2.D2
.M2.M2
.L2.L2
.S2.S2
Register Set BRegister Set B
Register Set ARegister Set A
CPUCPU
Internal BusesInternal Buses
InternalInternalMemoryMemory
GPIOGPIO
XB, PCI,XB, PCI,
Host PortHost Port
EMIFEMIF
General Purpose Input/Output (GPIO)General Purpose Input/Output (GPIO) ‘‘C64x provides 8 or 16 bits of general purpose bitwise I/OC64x provides 8 or 16 bits of general purpose bitwise I/O
Use to observe or control the signal of a single-pinUse to observe or control the signal of a single-pin
General Purpose Input/Output (GPIO)General Purpose Input/Output (GPIO) ‘‘C64x provides 8 or 16 bits of general purpose bitwise I/OC64x provides 8 or 16 bits of general purpose bitwise I/O
Use to observe or control the signal of a single-pinUse to observe or control the signal of a single-pin
McBSP and UtopiaMcBSP and Utopia
ExternalExternalMemoryMemory
.D1.D1
.M1.M1
.L1.L1
.S1.S1
.D2.D2
.M2.M2
.L2.L2
.S2.S2
Register Set BRegister Set B
Register Set ARegister Set A
CPUCPU
Internal BusesInternal Buses
InternalInternalMemoryMemory
McBSP’sMcBSP’s
UtopiaUtopia
GPIOGPIO
XB, PCI,XB, PCI,
Host PortHost Port
EMIFEMIF
Multi-Channel Buffered Serial Port (Multi-Channel Buffered Serial Port (McBSPMcBSP)) 2 (or 3) full-duplex, synchronous serial-ports2 (or 3) full-duplex, synchronous serial-ports Up to 100 Mb/sec performanceUp to 100 Mb/sec performance SupportsSupports multi-channel operation (T1, E1, MVIP, …)multi-channel operation (T1, E1, MVIP, …)
Utopia (Utopia (C64xC64x)) ATM connectionATM connection 50 MHz wide area network connectivity50 MHz wide area network connectivity
Multi-Channel Buffered Serial Port (Multi-Channel Buffered Serial Port (McBSPMcBSP)) 2 (or 3) full-duplex, synchronous serial-ports2 (or 3) full-duplex, synchronous serial-ports Up to 100 Mb/sec performanceUp to 100 Mb/sec performance SupportsSupports multi-channel operation (T1, E1, MVIP, …)multi-channel operation (T1, E1, MVIP, …)
Utopia (Utopia (C64xC64x)) ATM connectionATM connection 50 MHz wide area network connectivity50 MHz wide area network connectivity
DMA / EDMADMA / EDMA
ExternalExternalMemoryMemory
.D1.D1
.M1.M1
.L1.L1
.S1.S1
.D2.D2
.M2.M2
.L2.L2
.S2.S2
Register Set BRegister Set B
Register Set ARegister Set A
CPUCPU
Internal BusesInternal Buses
InternalInternalMemoryMemory
McBSP’sMcBSP’s
UtopiaUtopia
GPIOGPIO
DMA, EDMADMA, EDMA
(Boot)(Boot)
XB, PCI,XB, PCI,
Host PortHost Port
EMIFEMIF
Direct Memory Access (DMA / EDMA) Direct Memory Access (DMA / EDMA) Transfers any set of memory locations to anotherTransfers any set of memory locations to another 4 / 16 / 64 channels (transfer parameter sets)4 / 16 / 64 channels (transfer parameter sets) Transfers can be triggered by any interrupt (sync)Transfers can be triggered by any interrupt (sync) Operates independent of CPUOperates independent of CPU On reset, provides bootstrap from memoryOn reset, provides bootstrap from memory
Direct Memory Access (DMA / EDMA) Direct Memory Access (DMA / EDMA) Transfers any set of memory locations to anotherTransfers any set of memory locations to another 4 / 16 / 64 channels (transfer parameter sets)4 / 16 / 64 channels (transfer parameter sets) Transfers can be triggered by any interrupt (sync)Transfers can be triggered by any interrupt (sync) Operates independent of CPUOperates independent of CPU On reset, provides bootstrap from memoryOn reset, provides bootstrap from memory
Timer / CounterTimer / Counter
ExternalExternalMemoryMemory
.D1.D1
.M1.M1
.L1.L1
.S1.S1
.D2.D2
.M2.M2
.L2.L2
.S2.S2
Register Set BRegister Set B
Register Set ARegister Set A
CPUCPU
Internal BusesInternal Buses
InternalInternalMemoryMemory
McBSP’sMcBSP’s
UtopiaUtopia
GPIOGPIO
DMA, EDMADMA, EDMA
(Boot)(Boot)
TimersTimers
XB, PCI,XB, PCI,
Host PortHost Port
EMIFEMIF
Timer / CounterTimer / Counter Two (or three) 32-bit timer/countersTwo (or three) 32-bit timer/counters Can generate interruptsCan generate interrupts Both input and output pinsBoth input and output pins
Timer / CounterTimer / Counter Two (or three) 32-bit timer/countersTwo (or three) 32-bit timer/counters Can generate interruptsCan generate interrupts Both input and output pinsBoth input and output pins
VCP / TCP -- 3G WirelessVCP / TCP -- 3G Wireless
ExternalExternalMemoryMemory
.D1.D1
.M1.M1
.L1.L1
.S1.S1
.D2.D2
.M2.M2
.L2.L2
.S2.S2
Register Set BRegister Set B
Register Set ARegister Set A
CPUCPU
Internal BusesInternal Buses
InternalInternalMemoryMemory
McBSP’sMcBSP’s
UtopiaUtopia
GPIOGPIO
VCPVCPTCPTCP
DMA, EDMADMA, EDMA
(Boot)(Boot)
TimersTimers
XB, PCI,XB, PCI,
Host PortHost Port
EMIFEMIFTurbo Coprocessor (TCP) Supports 35 data channels at 384 kbpsSupports 35 data channels at 384 kbps 3GPP / IS2000 Turbo coder3GPP / IS2000 Turbo coder Programmable parameters include mode, rate and frame lengthProgrammable parameters include mode, rate and frame length
Viterbi Coprocessor (VCP) Supports >500 voice channels at 8 kbpsSupports >500 voice channels at 8 kbps Programmable decoder parameters include constraint length, Programmable decoder parameters include constraint length,
code rate, and frame lengthcode rate, and frame length
Turbo Coprocessor (TCP) Supports 35 data channels at 384 kbpsSupports 35 data channels at 384 kbps 3GPP / IS2000 Turbo coder3GPP / IS2000 Turbo coder Programmable parameters include mode, rate and frame lengthProgrammable parameters include mode, rate and frame length
Viterbi Coprocessor (VCP) Supports >500 voice channels at 8 kbpsSupports >500 voice channels at 8 kbps Programmable decoder parameters include constraint length, Programmable decoder parameters include constraint length,
code rate, and frame lengthcode rate, and frame length
PLLPLL
ExternalExternalMemoryMemory
.D1.D1
.M1.M1
.L1.L1
.S1.S1
.D2.D2
.M2.M2
.L2.L2
.S2.S2
Register Set BRegister Set B
Register Set ARegister Set A
CPUCPU
Internal BusesInternal Buses
InternalInternalMemoryMemory
McBSP’sMcBSP’s
UtopiaUtopia
GPIOGPIO
VCPVCPTCPTCP
DMA, EDMADMA, EDMA
(Boot)(Boot)
TimersTimers
PLLPLL
XB, PCI,XB, PCI,
Host PortHost Port
EMIFEMIF
PLLPLL External clock multiplierExternal clock multiplier Reduces EMI and costReduces EMI and cost Pin selectablePin selectable
PLLPLL External clock multiplierExternal clock multiplier Reduces EMI and costReduces EMI and cost Pin selectablePin selectable
InputInput CLKINCLKIN
OutputOutput CLKOUT1CLKOUT1
-- Output rate of PLL Output rate of PLL-- Instruction (MIP) rate Instruction (MIP) rate
CLKOUT2CLKOUT2-- 1/2 rate of CLKOUT1 1/2 rate of CLKOUT1
InputInput CLKINCLKIN
OutputOutput CLKOUT1CLKOUT1
-- Output rate of PLL Output rate of PLL-- Instruction (MIP) rate Instruction (MIP) rate
CLKOUT2CLKOUT2-- 1/2 rate of CLKOUT1 1/2 rate of CLKOUT1
'C6000 Peripherals'C6000 Peripherals
ExternalExternalMemoryMemory
.D1.D1
.M1.M1
.L1.L1
.S1.S1
.D2.D2
.M2.M2
.L2.L2
.S2.S2
Register Set BRegister Set B
Register Set ARegister Set A
CPUCPU
Internal BusesInternal Buses
InternalInternalMemoryMemory
McBSP’sMcBSP’s
UtopiaUtopia
GPIOGPIO
VCPVCPTCPTCP
DMA, EDMADMA, EDMA
(Boot)(Boot)
TimersTimers
PLLPLL
XB, PCI,XB, PCI,
Host PortHost Port
EMIFEMIF
IntroductionIntroduction
IntroductionIntroduction TI DSP FamiliesTI DSP Families 'C6000 Architecture'C6000 Architecture 'C6000 Roadmap'C6000 Roadmap
C6000 RoadmapC6000 RoadmapP
erf
orm
an
ce
Highest
Perform
ance
Time
Software CompatibleSoftware CompatibleFloating PointFloating PointFloating PointFloating Point
Multi-coreMulti-coreMulti-coreMulti-core C64xC64x™™ DSP DSP 1.1 GHz 1.1 GHz
C64xC64x™™ DSP DSP 1.1 GHz 1.1 GHz
C64xC64x™™ DSP DSPC64xC64x™™ DSP DSP
2nd Generation2nd Generation
General General PurposePurpose C6414C6414C6414C6414 C6415C6415C6415C6415 C6416C6416C6416C6416
MediaMediaGatewayGateway
3G Wireless 3G Wireless InfrastructureInfrastructure
C6201C6201
C6701C6701
C6202C6202
C6203C6203
C6211C6211C6711C6711
C6204C6204
1st Generation1st Generation
C6205C6205
C6712C6712
C62xC62x™™
C67xC67x™™
TI Floating Point - A History of Firsts:TI Floating Point - A History of Firsts:
First commercially-successful floating-point DSP First commercially-successful floating-point DSP ‘C30 (1987) ‘C30 (1987)
First floating-point DSP with multiprocessing support First floating-point DSP with multiprocessing support ‘C40 (1991) ‘C40 (1991)
First $10 floating-point DSP First $10 floating-point DSP ‘C32 (1995) ‘C32 (1995)
First 1-GFLOPS DSP First 1-GFLOPS DSP ‘C6701 (1998) ‘C6701 (1998)
First $5 floating-point DSP First $5 floating-point DSP ‘C33 (1999) ‘C33 (1999)
First 2-level cache floating-point DSP First 2-level cache floating-point DSP ‘C6711 (1999) ‘C6711 (1999)
First to offer 600 MFLOPS for under $10First to offer 600 MFLOPS for under $10 ‘C6712 (2000) ‘C6712 (2000)
TI Floating-Point InnovationTI Floating-Point Innovation
IntroductionIntroduction
IntroductionIntroductionTI DSP FamiliesTI DSP Families'C6000 Architecture'C6000 Architecture'C6000 Roadmap'C6000 Roadmap
Optional TopicsOptional Topics
Go to Module 2Go to Module 2
'C6000 CPU Details'C6000 CPU Details
'C6000 Instruction Summaries'C6000 Instruction Summaries
What Problem Are We Trying To Solve?What Problem Are We Trying To Solve?
Digital sampling of Digital sampling of an analog signal:an analog signal:
A
tt
What are the two primary instructions?What are the two primary instructions?
Most DSP algorithms can be Most DSP algorithms can be expressed as:expressed as:
4040
i = 1i = 1Y = Y = a aii * x * xii
DACDACxx YY
ADCADC DSPDSP
MultMultMultMult
ALUALUALUALUMPYMPY a, x, proda, x, prodADDADD y, prod, yy, prod, y
y =y =4040
aann x xnnn = 1n = 1
**
The Core of DSP : Sum of ProductsThe Core of DSP : Sum of Products
Where are the variables?Where are the variables?
The ’C6000The ’C6000Designed to Designed to
handle DSP’shandle DSP’smath-intensivemath-intensive
calculationscalculations
AALLUUAALLUU
.M.M.M.M
MPYMPY .M.M a, x, proda, x, prod
.L.L.L.L ADDADD .L .L y, prod, yy, prod, y
Note:Note: You don’t have to You don’t have to specify functional specify functional units (.M or .L)units (.M or .L)
y =y =4040
aann x xnnn = 1n = 1
**Register File ARegister File A
aaxx
prodprod
32-bits32-bits
yy
......
.M.M.M.M
.L.L.L.L
Working Variables : The Register FileWorking Variables : The Register File
How are the number of iterations specified?How are the number of iterations specified?
16
reg
iste
rs1
6 re
gis
ters MPYMPY .M.M a, x, proda, x, prod
ADDADD .L.L y, prod, yy, prod, y
Loops: Coding on a RISC ProcessorLoops: Coding on a RISC Processor
1.1. Program flow: Program flow: the branch instructionthe branch instruction
2.2. Initialization: Initialization: setting the loop countsetting the loop count
3.3. Decrement: Decrement: subtract 1 from the loop countersubtract 1 from the loop counter
B loop B loop
SUB cnt, 1, cnt SUB cnt, 1, cnt
MVK 40, cnt MVK 40, cnt
y =y =4040
aann x xnnn = 1n = 1
**
The “S” Unit : For Standard OperationsThe “S” Unit : For Standard Operations
MVKMVK .S.S 40, cnt40, cnt
loop:loop:
MPYMPY .M.M a, x, proda, x, prod
ADDADD .L .L y, prod, yy, prod, y
SUBSUB .L.L cnt, 1, cntcnt, 1, cnt
BB .S.S looploop
How is the loop terminated?How is the loop terminated?
.M.M.M.M
.L.L.L.L
.S.S.S.S
Register File ARegister File A
32-bits32-bits
16
reg
iste
rs1
6 re
gis
ters
aaxx
prodprodyy
......
cntcnt
Conditional Instruction ExecutionConditional Instruction Execution
Note: if condition is false, execution replaced with nopNote: if condition is false, execution replaced with nop
Code SyntaxCode Syntax Execute if:Execute if:
[ cnt ][ cnt ] cnt cnt 0 0[ !cnt ][ !cnt ] cnt = 0cnt = 0
Execution based on [zero/non-zero] value of specified variableExecution based on [zero/non-zero] value of specified variable
To minimize branching, To minimize branching, allall instructions are conditional instructions are conditional
[condition][condition] B B looploop
y =y =4040
aann x xnnn = 1n = 1
**
Loop Control via Conditional BranchLoop Control via Conditional Branch
MVKMVK .S.S 40, cnt40, cnt
loop:loop:
MPYMPY .M.M a, x, proda, x, prod
ADDADD .L .L y, prod, yy, prod, y
SUBSUB .L.L cnt, 1, cntcnt, 1, cnt
[cnt][cnt] BB .S.S looploop
How are the a and x array values brought in from memory?How are the a and x array values brought in from memory?
.M.M.M.M
.L.L.L.L
.S.S.S.S
Register File ARegister File A
32-bits32-bits
aaxx
prodprodyy
......
cntcnt
Memory Access via “.D” UnitMemory Access via “.D” Unit
.M .M .M .M
.L .L .L .L
.S .S .S .S y =y =
4040
aann x xnnn = 1n = 1
**
MVKMVK .S.S 40, cnt40, cnt
loop:loop:
LDHLDH .D.D *ap , a*ap , a
LDHLDH .D.D *xp , x*xp , x
MPYMPY .M.M a, x, proda, x, prod
ADDADD .L .L y, prod, yy, prod, y
SUBSUB .L.L cnt, 1, cntcnt, 1, cnt
[cnt][cnt] BB .S.S looploop
Data Memory:Data Memory:x(40), a(40), yx(40), a(40), y
How do we increment through the arrays?How do we increment through the arrays?
16
reg
iste
rs1
6 re
gis
ters
Register File ARegister File A
aaxx
prodprodyy
cntcnt
*ap*ap
*xp*xp*yp*yp
.D .D .D .D
Auto-Increment of PointersAuto-Increment of Pointers
Register File ARegister File A
aaxx
prodprodyy
cntcnt
*ap*ap
*xp*xp*yp*yp
y =y =4040
aann x xnnn = 1n = 1
**
MVKMVK .S.S 40, cnt40, cnt
loop:loop:
LDHLDH .D.D *ap*ap++++, a, a
LDHLDH .D.D *xp*xp++++, x, x
MPYMPY .M.M a, x, proda, x, prod
ADDADD .L .L y, prod, yy, prod, y
SUBSUB .L.L cnt, 1, cntcnt, 1, cnt
[cnt][cnt] BB .S.S looploop
How do we store results back to memory?How do we store results back to memory?
.M .M .M .M
.L .L .L .L
.S .S .S .S
Data Memory:Data Memory:x(40), a(40), yx(40), a(40), y
.D .D .D .D
16
reg
iste
rs1
6 re
gis
ters
Storing Results Back to MemoryStoring Results Back to Memory
Register File ARegister File A
aaxx
prodprodyy
cntcnt
*ap*ap
*xp*xp*yp*yp
y =y =4040
aann x xnnn = 1n = 1
**
MVKMVK .S.S 40, cnt40, cnt
loop:loop:
LDHLDH .D.D *ap++, a*ap++, a
LDHLDH .D.D *xp++, x*xp++, x
MPYMPY .M.M a, x, proda, x, prod
ADDADD .L .L y, prod, yy, prod, y
SUBSUB .L.L cnt, 1, cntcnt, 1, cnt
[cnt][cnt] BB .S.S looploop
STWSTW .D.D y, *ypy, *yp
But wait - that’s only half the story...But wait - that’s only half the story...
.M .M .M .M
.L .L .L .L
.S .S .S .S
Data Memory:Data Memory:x(40), a(40), yx(40), a(40), y
.D .D .D .D
Dual Resources : Twice as NiceDual Resources : Twice as Nice
A0A0A1A1A2A2A3A3A4A4
Register File ARegister File A
A15A15
A5A5A6A6A7A7
aann
xxnn
prdprdsumsum
cntcnt
....
*a*a*x*x*y*y
.M1.M1.M1.M1
.L1.L1.L1.L1
.S1.S1.S1.S1
.D1.D1.D1.D1
.M2.M2.M2.M2
.L2.L2.L2.L2
.S2.S2.S2.S2
.D2.D2.D2.D2
Register File BRegister File B
B0B0B1B1B2B2B3B3B4B4
B15B15
B5B5B6B6B7B7....
32-bits32-bits
........
32-bits32-bits
Our final view of the sum of products example...Our final view of the sum of products example...
ExternalExternalMemoryMemory
.D1.D1
.M1.M1
.L1.L1
.S1.S1
.D2.D2
.M2.M2
.L2.L2
.S2.S2
Register Set B
Register Set B
Register Set A
Register Set A
CPUCPU
PPEERRIIPPHHEERRAALLSS
Internal BusesInternal Buses
To summarize each units’ instructions ...To summarize each units’ instructions ...
InternalInternalMemoryMemory
‘‘C6000 System Block DiagramC6000 System Block Diagram
‘‘C62x RISC-like instruction setC62x RISC-like instruction set
.L .L .L .L
.D .D .D .D
.S .S .S .S
.M .M .M .M
No Unit UsedIDLEIDLENOPNOP
.S Unit.S UnitNEGNEGNOT NOT ORORSETSETSHLSHLSHRSHRSSHLSSHLSUBSUBSUB2SUB2XORXORZEROZERO
ADDADDADDKADDKADD2ADD2ANDANDBBCLRCLREXTEXTMVMVMVCMVCMVKMVKMVKHMVKH
.L Unit.L UnitNOTNOTORORSADDSADDSATSATSSUBSSUBSUBSUBSUBCSUBCXORXORZEROZERO
ABSABSADDADDANDANDCMPEQCMPEQCMPGTCMPGTCMPLTCMPLTLMBDLMBDMVMVNEGNEGNORMNORM
.M Unit.M Unit
SMPYSMPYSMPYHSMPYH
MPYMPYMPYHMPYHMPYLHMPYLHMPYHLMPYHL
.D Unit.D Unit
NEGNEGSTBSTB (B/H/W) (B/H/W) SUBSUBSUBABSUBAB (B/H/W) (B/H/W) ZEROZERO
ADDADDADDABADDAB (B/H/W)(B/H/W)LDBLDB (B/H/W)(B/H/W) MVMV
‘‘C67x : Superset of Fixed-PointC67x : Superset of Fixed-Point
.L .L .L .L
.D .D .D .D
.S .S .S .S
.M .M .M .M
No Unit UsedIDLEIDLENOPNOP
.S Unit.S UnitNEGNEGNOT NOT ORORSETSETSHLSHLSHRSHRSSHLSSHLSUBSUBSUB2SUB2XORXORZEROZERO
ADDADDADDKADDKADD2ADD2ANDANDBBCLRCLREXTEXTMVMVMVCMVCMVKMVKMVKHMVKH
ABSSPABSSPABSDPABSDPCMPGTSPCMPGTSPCMPEQSPCMPEQSPCMPLTSPCMPLTSPCMPGTDPCMPGTDPCMPEQDPCMPEQDPCMPLTDPCMPLTDPRCPSPRCPSPRCPDPRCPDPRSQRSPRSQRSPRSQRDPRSQRDPSPDPSPDP
.L Unit.L UnitNOTNOTORORSADDSADDSATSATSSUBSSUBSUBSUBSUBCSUBCXORXORZEROZERO
ABSABSADDADDANDANDCMPEQCMPEQCMPGTCMPGTCMPLTCMPLTLMBDLMBDMVMVNEGNEGNORMNORM
ADDSPADDSPADDDPADDDPSUBSPSUBSPSUBDPSUBDPINTSPINTSPINTDPINTDPSPINTSPINTDPINTDPINTSPRTUNCSPRTUNCDPTRUNCDPTRUNCDPSPDPSP
.M Unit.M Unit
SMPYSMPYSMPYHSMPYH
MPYMPYMPYHMPYHMPYLHMPYLHMPYHLMPYHL
MPYSPMPYSPMPYDPMPYDPMPYIMPYIMPYIDMPYID
.D Unit.D Unit
NEGNEGSTBSTB (B/H/W) (B/H/W) SUBSUBSUBAB SUBAB (B/H/W) (B/H/W) ZEROZERO
ADDADDADDABADDAB (B/H/W)(B/H/W)LDBLDB (B/H/W)(B/H/W)LDDWLDDWMVMV
‘‘C64x C64x Superset of ‘C62x Superset of ‘C62x
.L .L .L .L
.S .S .S .S
.D .D .D .D
.M .M .M .M
.S Unit.S UnitPACK2PACK2PACKH2PACKH2PACKLH2PACKLH2PACKHL2PACKHL2UNPKHU4UNPKHU4UNPKLU4UNPKLU4SWAP2SWAP2SPACK2SPACK2SPACKU4SPACKU4
SADD2SADD2SADDUS2SADDUS2SADD4SADD4ANDNANDNSHR2SHR2SHRU2SHRU2SHLMBSHLMBSHRMBSHRMB
CMPEQ2CMPEQ2CMPEQ4CMPEQ4CMPGT2CMPGT2CMPGT4CMPGT4BDECBDECBPOSBPOSBNOPBNOPADDKPCADDKPC
.L Unit.L UnitSHLMBSHLMBSHRMBSHRMBMVK(5-bit)MVK(5-bit)
ABS2ABS2ADD2ADD2ADD4ADD4MAXMAXMINMINSUB2SUB2SUB4SUB4SUBABS4SUBABS4ANDNANDN
PACK2PACK2PACKH2PACKH2PACKLH2PACKLH2PACKHL2PACKHL2PACKH4PACKH4PACKL4PACKL4UNPKHU4UNPKHU4UNPKLU4UNPKLU4SWAP2/4SWAP2/4
.D Unit.D UnitLDDWLDDWLDNWLDNWLDNDWLDNDWSTDWSTDWSTNWSTNWSTNDWSTNDWMVK(5-bit)MVK(5-bit)
ADD2ADD2SUB2SUB2ANDANDANDNANDNORORXORXORADDADADDAD
.M .M .M .M
.M Unit.M UnitMVDMVDBITC4BITC4BITRBITRDEALDEALSHFLSHFLMPYHIMPYHIMPYLIMPYLIMPYHIRMPYHIRMPYLIRMPYLIR
AVG2AVG2AVG4AVG4ROTLROTLSSHVLSSHVLSSHVRSSHVRBITC4BITC4BITRBITRDEALDEALSHFLSHFL
MPY2/SMPY2MPY2/SMPY2DOTP2DOTP2DOTPN2DOTPN2DOTPRSU2DOTPRSU2DOTPNRSU2DOTPNRSU2DOTPU4DOTPU4DOTPSU4DOTPSU4GMPY4GMPY4XPND2/4XPND2/4
Double-sizeDouble-sizeRegister setsRegister sets
(A16-A31)(A16-A31)(B16-B31)(B16-B31)
Advanced Advanced Instruction Instruction
PackingPacking(minimizes(minimizes code-size) code-size)
Advanced Advanced EmulationEmulationFeaturesFeatures