www.eecs.umich.edu/~sdrg 1 design and implementation of turbo decoders for software defined radio...

www.eecs.umich.edu/~sdrg

1

Design and Implementation of Turbo Decoders for Software Defined Radio

Yuan Lin1, Scott Mahlke1, Trevor Mudge1, Chaitali Chakrabarti2, Alastair Reid3, Krisztian Flautner3

1Advanced Computer Architecture Lab, University of Michigan2Department of Electrical Engineering, Arizona State University

3ARM, Ltd.


2

Advantages of Software Defined Radio

• Multi-mode operations• Lower costs

– Faster time to market– Prototyping and bug fixes– Chip volumes– Longevity of platforms

• Protocol complexity favors software dominated solutions

• Enables future wireless communication innovations– Cognitive radio

UWB EDGE 802.16a

802.16a Bluetooth

802.11b WCDMA 802.11n

SDR


3

SDR Design Objectives for W-CDMA

• Programmable processor– Same hardware should support Turbo

decoder as well as other DSP algorithms

• Throughput requirements– 2Mbps

• Power constraints– 100mW ~ 500mW


4

SODA: DSP Processor for SDR

SIMDReg.File

EX

SIMDALU+Mult

SIMDShuffle

Net-work(SSN)

WB

ScalarALU

WB

EX

ScalarRF

LocalSIMD

Memory

LocalScalar

Memory

STV

AGURF

EX

WB

AGUALU

1. SIMD pipeline

2. Scalar pipeline

4. AGU pipeline

VTS

Pred.Regs

WB

SIMDto

Scalar(VtoS)ALU

RF

DMA

GlobalMemory

SODADSP

5. DMA

3. Localmemory

Sy

ste

m In

terc

on

ne

ct

* 2 issue LIW (400MHz) - SIMD + (Scalar or AGU) - 2 local scratchpad memory* Global scratchpad memory - accessible only through DMA* DMA - global memory IO - inter-processor communication


5

SODA PE SIMD Pipeline

SIMDReg.File

EX

SIMDALU+Mult

SIMDShuffleNetwork

(SSN)

WB

ScalarALU

WB

EX

ScalarRegister

File

LocalSIMD

Memory

LocalScalar

Memory

STV

AGUReg. File

EX

WB

AGUCalculation

1. SIMD pipeline

2. Scalar pipeline

4. AGU pipeline

VTS

Pred.Regs

WB

SIMDto

Scalar(VtoS)ALU

RegisterFile

DMA

GlobalMemory

SODADSP

5. DMA

3. Localmemory

Sy

ste

m In

terc

on

ne

ct

16-bit 16 entriesRF

EX

16-bitMultiplier

16-bitALU

WB

16bit16bit

16bit

SIMD: - 16-bit datapath - 32 wide - predicated exec. - predicated neg. - 32 40-bit ACC

16-bit 16 entriesRF

EX

16-bitMultiplier

16-bitALU

WB

16bit16bit

16bit

16-bit 16 entriesRF

EX

16-bitMultiplier

16-bitALU

WB

16bit16bit

16bit

16-bit 16 entriesRF

EX

16-bitMultiplier

16-bitALU

WB

16bit16bit

16bit

16-bit 16 entriesRF

EX

16-bitMultiplier

16-bitALU

WB

16bit16bit

16bit

32-waySIMD


6

SIMDReg.File

EX

SIMDALU+Mult

SIMDShuffleNetwork

(SSN)

ScalarALU

WB

EX

ScalarRegister

File

LocalSIMD

Memory

ScalarScalar

Memory

STV

AGUReg. File

EX

WB

AGUCalculation

1. SIMD pipeline

2. Scalar pipeline

4. AGU pipeline

VTS

Pred.Regs

WB

SIMDto

Scalar(VtoS)ALU

RegisterFile

DMA

GlobalMemory

SODADSP

5. DMA

3. Localmemory

Sy

ste

m In

terc

on

ne

ct

WB

SODA PE SIMD Shuffle Network

SIMD Shuffle NetworkShuffle Exchange (SE)Inverse Shuffle Exchange(ISE)Exchange Only (EX)Iterative Feedback - Any 32-wide SIMD permutation can be done with 9iterations of either 32-wide SE or 32-wide ISE - Includes SE, ISE, and EX to shorten the number ofiterations for trellis permutation patterns


7

SIMDReg.File

EX

SIMDALU+Mult

SIMDShuffle

Network(SSN)

WB

ScalarALU

WB

EX

ScalarRegister

File

LocalSIMD

Memory

ScalarScalar

Memory

STV

AGUReg. File

EX

WB

AGUCalculation

1. SIMD pipeline

2. Scalar pipeline

4. AGU pipeline

VTS

Pred.Regs

WB

SIMDto

Scalar(VtoS)ALU

RegisterFile

DMA

GlobalMemory

SODADSP

5. DMA

3. Localmemory

Sy

stem In

terc

on

nect

SODA PE Scalar Pipeline Scalar: - One 16-bit datapath - No mult unit Scalar memory: - 16bit port - 1 read/write port - 4 KBytes Scalar-to-Vector Vector-to-Scalar


8

Turbo Decoder on SODA

• Most computationally intensive algorithm in W-CDMA• Hardest algorithm to parallelize• Implementation outline

– MaxLogMAP trellis computation with SIMD operations– Parallelizing trellis computations through sliding window– Interleaver implementation

De-Interleaver

SISOdecoder 1

InterleaverSISO

decoder 2

Input

Output

Demux ys & yp1

ys & yp2

L1ex L2exDe-Interleaver

SISOdecoder 1

InterleaverSISO

decoder 2

Input

Output

Demux ys & yp1

ys & yp2

L1ex L2exDe-Interleaver

SISOdecoder 1

InterleaverSISO

decoder 2

Input

Output

Demux ys & yp1

ys & yp2

L1ex L2ex


9

Trellis Computation on SODA• Two types of trellis diagram configurations

– Blue edges: (0-branch), Red edges: (1-branch)

• Mapping trellis of size S onto SODA of SIMD size T

1

3

5

7

1

3

5

7

0

2

4

6

0

2

4

6

1

3

5

7

0

2

4

6

state[t] state[t+1] state[t+2]

Time

state[t] state[t+1] state[t+2]

Time

1

3

5

7

1

3

5

7

0

2

4

6

0

2

4

6

1

3

5

7

0

2

4

6

Forward Trellis Backward Trellis


10

Forward Trellis on SODA (S = T)

In

m0

m1

m2

m3

m4

m5

m6

m7

b0

b1

b2

b3

b4

b5

b6

b7

M

2 4-bitInputs

M

M

M

M

M

M

M

m0

m1

m2

m3

m4

m5

m6

m7

b0

b1

b2

b3

b4

b5

b6

b7

M

M

M

M

M

M

M

M

Branch metric calculations(BMC)

M : b[i] = In[0]*m[i][0] + In[1]*m[i][1]

b0

b1

b2

b3

b4

b5

b6

b7

+

+

+

+

+

+

+

+

b0

b1

b2

b3

b4

b5

b6

b7

+

+

+

+

+

+

+

+

s0

s1

s2

s3

s4

s5

s6

s7

s0

s1

s2

s3

s4

s5

s6

s7

s0

s1

s2

s3

s4

s5

s6

s7

s0

s1

s2

s3

s4

s5

s6

s7

s0

s1

s2

s3

s4

s5

s6

s7

s0

s1

s2

s3

s4

s5

s6

s7

s0

s1

s2

s3

s4

s5

s6

s7

s0

s1

s2

s3

s4

s5

s6

s7

>

>

>

>

>

>

>

>

Add-Compare-Select calculations(ACS) Misaligned

SIMD operation


11

Handling SIMD Misalignment

s0

s1

s2

s3

s4

s5

s6

s7

s0

s1

s2

s3

s4

s5

s6

s7

s0

s1

s2

s3

s4

s5

s6

s7

>

>

>

>

>

>

>

>

s0

s1

s2

s3

s4

s5

s6

s7

s0

s1

s2

s3

s4

s5

s6

s7

s0

s1

s2

s3

s4

s5

s6

s7

s0

s1

s2

s3

s4

s5

s6

s7

SIMDpredicated

move

step 1: SIMD interleaving

s0

s1

s2

s3

s4

s5

s6

s7

s0

s2

s4

s6

s1

s3

s5

s7

s1

s3

s5

s7

s0

s2

s4

s6

s0

s1

s2

s3

s4

s5

s6

s7

step 2: SIMD permutation

ISE

EX

1

3

5

7

0

2

4

6

2

6

3

7

0

4

1

5

2

6

3

7

0

4

1

5

3

7

2

6

1

5

0

4

1

3

5

7

0

2

4

6

0

2

4

6

1

3

5

7

3

7

2

6

1

5

0

4

ISE

s0

s2

s4

s6

s1

s3

s5

s7

s1

s3

s5

s7

s0

s2

s4

s6

>

>

>

>

>

>

>

>

s0

s1

s2

s3

s4

s5

s6

s7

step 3: SIMD compare select


12

Sliding Window on SODA

• Problem:– W-CDMA uses K=4, 8 wide trellis– SODA has 32-wide SIMD

• Solution:– parallelize trellis computation by implementing

sliding window• fully utilize SIMD width• achieving higher-throughput in the process


13

Sliding Window Parallelization

b00

b01

b02

time

b10

b11

b12

b20

b21

b22

window1

window2

window3

SODA sliding windowimplementation

LLC

beta metricstored

alpha and betacalculation

alpha and betadummy calculation

b0

b1

b2

b3

time

b4

b5

b6

Typical ASIC slidingwindow implementation

LLC

beta metricstored

alpha and betacalculation

alpha and betadummy calculation

b0

b1

b2

b3

time

b4

b5

b6

Typical ASIC slidingwindow implementation


14

Sliding Window on SODA (S < T)

s01

3

1

3

0

2

0

2

s1

s2

s3

t0

t1

t2

t3

0

3

0

3

1

2

1

2

s0

s1

s2

s3

t0

t1

t2

t3

>

>

>

>

>

>

>

>

s0

s1

s2

s3

t0

t1

t2

t3

step 1: SIMD interleaving step 2: SIMD permutation step 3: SIMD CS

s0

s2

t0

t2

s1

s3

t1

t3

s1

s3

t1

t3

s0

s2

t0

t2

2

2

3

3

1

1

0

0

1

1

3

3

2

2

0

0

0

3

0

3

2

1

2

1

2

3

2

3

0

1

0

1

1

3

1

3

0

2

0

2

0

3

0

3

1

2

1

2

2

2

3

3

1

1

0

0

22

2

0

0

1

1

3

3

3

0

3

0

1

2

1

2

3

2

3

2

1

0

1

0

s0

s2

t0

t2

s1

s3

t1

t3

s1

s3

t1

t3

s0

s2

t0

t2

EX SE

EX SE

Trellis state for2nd sub-block

Trellis state for1st sub-block

Trellis state for2nd sub-block

Trellis state for1st sub-block

ISE

ISE

s0

s1

s2

s3

t0

t1

t2

t3

s0

s1

s2

s3

s0

s1

s2

s3

t0

t1

t2

t3

t0

t1

t2

t3

>

>

>

>

>

>

>

>

s0

s1

s2

s3

s0

s1

s2

s3t0

t1

t2

t3

t0

t1

t2

t3

SIMDpredicated

move

s0

s1

s2

s3

t0

t1

t2

t3

s0

s1

s2

s3

t0

t1

t2

t3


15

Turbo Decoder System Operations

time

InterleavingSerial Alpha+LLCBeta InterleavingAlpha+LLCBeta

InterleavingParallel

Interleaving

Dummy+Alpha+LLC

Dummy+Beta

Dummy+Alpha+LLC

Dummy+Beta

Interleaving

Parallel +Overlap

Interleaving

Dummy+Alpha+LLC

Dummy+Beta

Dummy+Alpha+LLC

Dummy+Beta


16

SODA DMA Modifications

• Traditional DMA controller– Designed for block data transfer– 1 source and 1 destination address per block

• Modified DMA controller– Adding data interleaving functionality to DMA– Needs to handle scalar data transfers– 1 source and 1 destination address per scalar


17

Achieved Performance on SODA

• SODA operates at 400MHz

• Can achieve 2.08Mbps with I = 5

Average number of cycles for one trellis

block

dummy calculation

size of one trellis block

1 bit of Alpha, Beta and LLC computation

data memory access

Number of sliding windows processed in

parallel

1 bit of Alpha, Beta and LLC computation

Overall Turbo decoder throughput

SODA operation frequency

Number of Turbo iterations

Cycles for 1bit trellis computaion = Tblock/L

Extrinsic scaling


18

Conclusion & Future Work

• Implementation summary– SODA consumes <100mW in 90nm– Meets W-CDMA throughput requirements– Hardware features

• wide SIMD execution• SIMD permutation network• smart DMA

• Beyond 3G– Support for higher throughput 3G+ protocols

• Multi-processor SODA for Turbo decoder

– LDPC decoding


19

Questions?

• www.eecs.umich.edu/~sdrg

www.eecs.umich.edu/~sdrg 1 design and implementation of turbo decoders for software defined radio...

Documents

soda soda

scalar slide

soda s t slide

sdr slide

mw slide

soda of simd size t

wide trellis soda

process slide