implementing the viterbi algorithm on programmable processors
DESCRIPTION
Implementing the Viterbi algorithm on programmable processors. Sridhar Rajagopal Elec 696 [email protected]. Motivation. Viterbi decoding - One of the major bottlenecks in baseband processing [PHY] - PowerPoint PPT PresentationTRANSCRIPT
RICE UNIVERSITY
Implementing the Viterbi algorithm on programmable processors
Sridhar Rajagopal
Elec [email protected]
RICE UNIVERSITY
Motivation
Viterbi decoding - One of the major bottlenecks in baseband processing [PHY]
Need for flexibility in the algorithm parameters due to different protocols “read programmable”
No architecture developed yet to meet real-time requirements of 3G systems.
2 - 8 Mbps range for wideband CDMA
100 Mbps range for wireless LAN
RICE UNIVERSITY
Today
Background
Advanced DSP architectures -- TI C6x [15]
Viterbi algorithm basics [10]
Viterbi on TI DSPs [10]
A programmable processor specifically designed for Viterbi [15]
RICE UNIVERSITY
VLIW [Very Long Instruction Word] arch.Similar to a vector processor -- butmultiple instructions -> multiple Func. UnitsFU’s are not all the same
32-bit architecture 8 functional units
TI C6x architecture
Inst 1 Inst 2 Inst 3 Inst 4
FU 1 FU 2 FU 3 FU 4
4-wide VLIW
RICE UNIVERSITY
RICE UNIVERSITY
8 VelociTI principles
Parallel fetch, decode and execute
Pipelined enough to make ADD critical path
Instructions based on RISC
Load - Store architecture
Orthogonal - Instruction Set and Reg. File
Determinism
Conditional Instructions
Instruction Packing
RICE UNIVERSITY
2 * 4 = 8 Functional Units
.M Multiplication unit
16 bit x 16 bit signed/# packed/# .L arithmetic Logic unit
Comparisons and logic operationsSaturation arithmetic and absolute value
.S Shifter unitBit manipulation (set, get, shift, rotate)Branching, addition and packed addition
.D Data unit Load/store to memoryAddition and pointer arithmetic
RICE UNIVERSITY
How powerful am I?
8 instructions per cycle
Max: 6 adds per cycle2 multiplies per cycle2 load/stores per cycle2 branches per cycle
Idea is you will be using instructions in these ratios to get full FU utilization.
RICE UNIVERSITY
C6x DSP Core
RICE UNIVERSITY
C6x Datapath
RICE UNIVERSITY
C6x Resource Constraints
Instructions using the same FU1 inst. / FU
Cross Pathsonly 1 operand from other reg. file to (L,S,M)
Loads and stores2 loads and stores from 2 different reg. files
Reads and writesmax 4-reads from the same registerNo 2 writes to the same register :)
RICE UNIVERSITY
Instruction Packing
Fetch Packet Execute Packet
Avoid NOPs in the instruction code Multi-cycle NOPs if absolutely necessary LSB- “p” bit of instruction for packing
A || B || C ,D || E, F, G || H8 instructions instead of 32
A B C D
1 1 0 1
E F G H
0 0 1 0
RICE UNIVERSITY
Conditional Instructions
All instructions can be conditioned based on the value in registers A1,A2,B0,B1,B2
Avoids branch latencies
If condition not met by end of first phase of execution, results not written back to reg. file
Conditional loads/stores squashed before data phase
RICE UNIVERSITY
C6x Pipeline
Fetch (if necessary) - 4 phasesAddress GenerateAddress SendAccess Ready WaitFetch Packet Receive
Decode - 2 phasesInstruction dispatch (if necessary)Instruction decode
Execute - 10 phases Most 1 phase
RICE UNIVERSITY
Some interesting instructions
Saturation Bit-counting -- Image coding Integer-comparison Bit-manipulation Seed generation for reciprocal instructions
RICE UNIVERSITY
Other details
64 KB internal program and data DMA - peripherals to memory
Intrinsics in code for better programmingsimilar to using “ViS” in UltraSPARCSoftware pipelining of loops
PERFORMANCE:5-10X higher clock -- higher pipeline (2-4X) Additional ALUs
RICE UNIVERSITY
Additional features in C64x
SIMD support
Communication-specific instructions
interleaving, galois field multiply
Bit count and rotate hardware
64 32-bit registers
Lower resource constraints
No more NOPs needed ever [no boundaries]
RICE UNIVERSITY
C64x DSP Core
RICE UNIVERSITY
Today
Background
Advanced DSP architectures -- TI C6x [15]
Viterbi algorithm basics [10]
Viterbi on TI DSPs [10]
A programmable processor specifically designed for Viterbi [15]
RICE UNIVERSITY
Viterbi Decoding
Encoder Decoderk kn > k n
Rate k/n = 1/2 Convolutional Encoder
RICE UNIVERSITY
Error Protection
States = 2^(FFs) = 2^(Constraint Length - 1) Cannot go from any state to any state
RICE UNIVERSITY
Trellis for decoding
RICE UNIVERSITY
Trellis for an input sequence
RICE UNIVERSITY
Error detection
Branch metric = “Distance” between received symbol pair and possible symbol pairs
Path metric = Accumulated error metric
RICE UNIVERSITY
Error-correction
RICE UNIVERSITY
Stages in Viterbi Decoding
Calculate Branch metrics for all states every stage
Update Path metrics for all states every stage
At the end, Traceback the trellis to get the decoded bits
RICE UNIVERSITY
Computations
Branch metrics:Hamming distance: (XOR) and Count 1’sEuclidean distance: squared distance
Path metrics:Add Branch metrics to existing path metricsCompare for minimum and Select minimum
Survivor Traceback:Linked list /Pointer chasing
Memory Intensive / Sequential Operations
RICE UNIVERSITY
Today
Background
Advanced DSP architectures -- TI C6x [15]
Viterbi algorithm basics [10]
Viterbi on TI DSPs [10]
A programmable processor specifically designed for Viterbi [15]
RICE UNIVERSITY
Viterbi support in different processors
C54xSpecial hardware acceleratorACS unit with 2 ACC and split ALUViterbi butterfly (2 ACS) in 4 cycles
C62xnothing special
C6416Viterbi coprocessorK = 5-9,Rate = 1/2,1/3,1/4
RICE UNIVERSITY
Viterbi Coprocessor in C6416
RICE UNIVERSITY
Viterbi Coprocessor in C6416
SM, SD and HD memory not accessible to DSP
RICE UNIVERSITY
Today
Background
Advanced DSP architectures -- TI C6x [15]
Viterbi algorithm basics [10]
Viterbi on TI DSPs [10]
A programmable processor specifically designed for Viterbi [15]
RICE UNIVERSITY
Need for VSP architecture
Large amount of memory access
Traceback decoding
Not efficient on a GPP
Program instructions in a GPP is of a higher order than complexity of the algorithm
RICE UNIVERSITY
VSP architecture
RICE UNIVERSITY
Branch Metric Calculation
RICE UNIVERSITY
Path Metric Calculation
RICE UNIVERSITY
Traceback Unit
RICE UNIVERSITY
Traceback with survivor updates
Start Filling the Trellis
Start Traceback5*Constraint Length
Symbol Decoded
Update Survivor Path for most recent symbol
RICE UNIVERSITY
Survivor Path Updates
RICE UNIVERSITY
Circular updates
RICE UNIVERSITY
Software Programming
Small but specialized instruction setLOAD, ACS
Shorter execution time All 3 subprocessors programmed independently
10 ns, (100 MHz) in 1990 to get 1.5 Mbps
RICE UNIVERSITY
Conclusions
Viterbi algorithm important for implementation in a programmable communication receiver
Approaches have been as co-processor support to DSPs or specialized processors.
We are yet to design programmable processors that meet real-time requirements for 100 Mbps applications.