implementing the viterbi algorithm on programmable processors

RICE UNIVERSITY

Implementing the Viterbi algorithm on programmable processors

Sridhar Rajagopal

Elec [email protected]

RICE UNIVERSITY

Motivation

Viterbi decoding - One of the major bottlenecks in baseband processing [PHY]

Need for flexibility in the algorithm parameters due to different protocols “read programmable”

No architecture developed yet to meet real-time requirements of 3G systems.

2 - 8 Mbps range for wideband CDMA

100 Mbps range for wireless LAN

RICE UNIVERSITY

Today

Background

Advanced DSP architectures -- TI C6x [15]

Viterbi algorithm basics [10]

Viterbi on TI DSPs [10]

A programmable processor specifically designed for Viterbi [15]

RICE UNIVERSITY

VLIW [Very Long Instruction Word] arch.Similar to a vector processor -- butmultiple instructions -> multiple Func. UnitsFU’s are not all the same

32-bit architecture 8 functional units

TI C6x architecture

Inst 1 Inst 2 Inst 3 Inst 4

FU 1 FU 2 FU 3 FU 4

4-wide VLIW

RICE UNIVERSITY

RICE UNIVERSITY

8 VelociTI principles

Parallel fetch, decode and execute

Pipelined enough to make ADD critical path

Instructions based on RISC

Load - Store architecture

Orthogonal - Instruction Set and Reg. File

Determinism

Conditional Instructions

Instruction Packing

RICE UNIVERSITY

2 * 4 = 8 Functional Units

.M Multiplication unit

16 bit x 16 bit signed/# packed/# .L arithmetic Logic unit

Comparisons and logic operationsSaturation arithmetic and absolute value

.S Shifter unitBit manipulation (set, get, shift, rotate)Branching, addition and packed addition

.D Data unit Load/store to memoryAddition and pointer arithmetic

RICE UNIVERSITY

How powerful am I?

8 instructions per cycle

Max: 6 adds per cycle2 multiplies per cycle2 load/stores per cycle2 branches per cycle

Idea is you will be using instructions in these ratios to get full FU utilization.

RICE UNIVERSITY

C6x DSP Core

RICE UNIVERSITY

C6x Datapath

RICE UNIVERSITY

C6x Resource Constraints

Instructions using the same FU1 inst. / FU

Cross Pathsonly 1 operand from other reg. file to (L,S,M)

Loads and stores2 loads and stores from 2 different reg. files

Reads and writesmax 4-reads from the same registerNo 2 writes to the same register :)

RICE UNIVERSITY

Instruction Packing

Fetch Packet Execute Packet

Avoid NOPs in the instruction code Multi-cycle NOPs if absolutely necessary LSB- “p” bit of instruction for packing

A || B || C ,D || E, F, G || H8 instructions instead of 32

A B C D

1 1 0 1

E F G H

0 0 1 0

RICE UNIVERSITY

Conditional Instructions

All instructions can be conditioned based on the value in registers A1,A2,B0,B1,B2

Avoids branch latencies

If condition not met by end of first phase of execution, results not written back to reg. file

Conditional loads/stores squashed before data phase

RICE UNIVERSITY

C6x Pipeline

Fetch (if necessary) - 4 phasesAddress GenerateAddress SendAccess Ready WaitFetch Packet Receive

Decode - 2 phasesInstruction dispatch (if necessary)Instruction decode

Execute - 10 phases Most 1 phase

RICE UNIVERSITY

Some interesting instructions

Saturation Bit-counting -- Image coding Integer-comparison Bit-manipulation Seed generation for reciprocal instructions

RICE UNIVERSITY

Other details

64 KB internal program and data DMA - peripherals to memory

Intrinsics in code for better programmingsimilar to using “ViS” in UltraSPARCSoftware pipelining of loops

PERFORMANCE:5-10X higher clock -- higher pipeline (2-4X) Additional ALUs

RICE UNIVERSITY

Additional features in C64x

SIMD support

Communication-specific instructions

interleaving, galois field multiply

Bit count and rotate hardware

64 32-bit registers

Lower resource constraints

No more NOPs needed ever [no boundaries]

RICE UNIVERSITY

C64x DSP Core

RICE UNIVERSITY

Today

Background





RICE UNIVERSITY

Viterbi Decoding

Encoder Decoderk kn > k n

Rate k/n = 1/2 Convolutional Encoder

RICE UNIVERSITY

Error Protection

States = 2^(FFs) = 2^(Constraint Length - 1) Cannot go from any state to any state

RICE UNIVERSITY

Trellis for decoding

RICE UNIVERSITY

Trellis for an input sequence

RICE UNIVERSITY

Error detection

Branch metric = “Distance” between received symbol pair and possible symbol pairs

Path metric = Accumulated error metric

RICE UNIVERSITY

Error-correction

RICE UNIVERSITY

Stages in Viterbi Decoding

Calculate Branch metrics for all states every stage

Update Path metrics for all states every stage

At the end, Traceback the trellis to get the decoded bits

RICE UNIVERSITY

Computations

Branch metrics:Hamming distance: (XOR) and Count 1’sEuclidean distance: squared distance

Path metrics:Add Branch metrics to existing path metricsCompare for minimum and Select minimum

Survivor Traceback:Linked list /Pointer chasing

Memory Intensive / Sequential Operations

RICE UNIVERSITY

Today

Background





RICE UNIVERSITY

Viterbi support in different processors

C54xSpecial hardware acceleratorACS unit with 2 ACC and split ALUViterbi butterfly (2 ACS) in 4 cycles

C62xnothing special

C6416Viterbi coprocessorK = 5-9,Rate = 1/2,1/3,1/4

RICE UNIVERSITY

Viterbi Coprocessor in C6416

RICE UNIVERSITY

Viterbi Coprocessor in C6416

SM, SD and HD memory not accessible to DSP

RICE UNIVERSITY

Today

Background





RICE UNIVERSITY

Need for VSP architecture

Large amount of memory access

Traceback decoding

Not efficient on a GPP

Program instructions in a GPP is of a higher order than complexity of the algorithm

RICE UNIVERSITY

VSP architecture

RICE UNIVERSITY

Branch Metric Calculation

RICE UNIVERSITY

Path Metric Calculation

RICE UNIVERSITY

Traceback Unit

RICE UNIVERSITY

Traceback with survivor updates

Start Filling the Trellis

Start Traceback5*Constraint Length

Symbol Decoded

Update Survivor Path for most recent symbol

RICE UNIVERSITY

Survivor Path Updates

RICE UNIVERSITY

Circular updates

RICE UNIVERSITY

Software Programming

Small but specialized instruction setLOAD, ACS

Shorter execution time All 3 subprocessors programmed independently

10 ns, (100 MHz) in 1990 to get 1.5 Mbps

RICE UNIVERSITY

Conclusions

Viterbi algorithm important for implementation in a programmable communication receiver

Approaches have been as co-processor support to DSPs or specialized processors.

We are yet to design programmable processors that meet real-time requirements for 100 Mbps applications.

implementing the viterbi algorithm on programmable processors

Documents