www.eecs.umich.edu/~sdrg 1 design and implementation of turbo decoders for software defined radio...
Post on 22-Dec-2015
216 views
TRANSCRIPT
www.eecs.umich.edu/~sdrg
1
Design and Implementation of Turbo Decoders for Software Defined Radio
Yuan Lin1, Scott Mahlke1, Trevor Mudge1, Chaitali Chakrabarti2, Alastair Reid3, Krisztian Flautner3
1Advanced Computer Architecture Lab, University of Michigan2Department of Electrical Engineering, Arizona State University
3ARM, Ltd.
www.eecs.umich.edu/~sdrg
2
Advantages of Software Defined Radio
• Multi-mode operations• Lower costs
– Faster time to market– Prototyping and bug fixes– Chip volumes– Longevity of platforms
• Protocol complexity favors software dominated solutions
• Enables future wireless communication innovations– Cognitive radio
UWB EDGE 802.16a
802.16a Bluetooth
802.11b WCDMA 802.11n
SDR
www.eecs.umich.edu/~sdrg
3
SDR Design Objectives for W-CDMA
• Programmable processor– Same hardware should support Turbo
decoder as well as other DSP algorithms
• Throughput requirements– 2Mbps
• Power constraints– 100mW ~ 500mW
www.eecs.umich.edu/~sdrg
4
SODA: DSP Processor for SDR
SIMDReg.File
EX
SIMDALU+Mult
SIMDShuffle
Net-work(SSN)
WB
ScalarALU
WB
EX
ScalarRF
LocalSIMD
Memory
LocalScalar
Memory
STV
AGURF
EX
WB
AGUALU
1. SIMD pipeline
2. Scalar pipeline
4. AGU pipeline
VTS
Pred.Regs
WB
SIMDto
Scalar(VtoS)ALU
RF
DMA
GlobalMemory
SODADSP
5. DMA
3. Localmemory
Sy
ste
m In
terc
on
ne
ct
* 2 issue LIW (400MHz) - SIMD + (Scalar or AGU) - 2 local scratchpad memory* Global scratchpad memory - accessible only through DMA* DMA - global memory IO - inter-processor communication
www.eecs.umich.edu/~sdrg
5
SODA PE SIMD Pipeline
SIMDReg.File
EX
SIMDALU+Mult
SIMDShuffleNetwork
(SSN)
WB
ScalarALU
WB
EX
ScalarRegister
File
LocalSIMD
Memory
LocalScalar
Memory
STV
AGUReg. File
EX
WB
AGUCalculation
1. SIMD pipeline
2. Scalar pipeline
4. AGU pipeline
VTS
Pred.Regs
WB
SIMDto
Scalar(VtoS)ALU
RegisterFile
DMA
GlobalMemory
SODADSP
5. DMA
3. Localmemory
Sy
ste
m In
terc
on
ne
ct
16-bit 16 entriesRF
EX
16-bitMultiplier
16-bitALU
WB
16bit16bit
16bit
SIMD: - 16-bit datapath - 32 wide - predicated exec. - predicated neg. - 32 40-bit ACC
16-bit 16 entriesRF
EX
16-bitMultiplier
16-bitALU
WB
16bit16bit
16bit
16-bit 16 entriesRF
EX
16-bitMultiplier
16-bitALU
WB
16bit16bit
16bit
16-bit 16 entriesRF
EX
16-bitMultiplier
16-bitALU
WB
16bit16bit
16bit
16-bit 16 entriesRF
EX
16-bitMultiplier
16-bitALU
WB
16bit16bit
16bit
32-waySIMD
www.eecs.umich.edu/~sdrg
6
SIMDReg.File
EX
SIMDALU+Mult
SIMDShuffleNetwork
(SSN)
ScalarALU
WB
EX
ScalarRegister
File
LocalSIMD
Memory
ScalarScalar
Memory
STV
AGUReg. File
EX
WB
AGUCalculation
1. SIMD pipeline
2. Scalar pipeline
4. AGU pipeline
VTS
Pred.Regs
WB
SIMDto
Scalar(VtoS)ALU
RegisterFile
DMA
GlobalMemory
SODADSP
5. DMA
3. Localmemory
Sy
ste
m In
terc
on
ne
ct
WB
SODA PE SIMD Shuffle Network
SIMD Shuffle NetworkShuffle Exchange (SE)Inverse Shuffle Exchange(ISE)Exchange Only (EX)Iterative Feedback - Any 32-wide SIMD permutation can be done with 9iterations of either 32-wide SE or 32-wide ISE - Includes SE, ISE, and EX to shorten the number ofiterations for trellis permutation patterns
www.eecs.umich.edu/~sdrg
7
SIMDReg.File
EX
SIMDALU+Mult
SIMDShuffle
Network(SSN)
WB
ScalarALU
WB
EX
ScalarRegister
File
LocalSIMD
Memory
ScalarScalar
Memory
STV
AGUReg. File
EX
WB
AGUCalculation
1. SIMD pipeline
2. Scalar pipeline
4. AGU pipeline
VTS
Pred.Regs
WB
SIMDto
Scalar(VtoS)ALU
RegisterFile
DMA
GlobalMemory
SODADSP
5. DMA
3. Localmemory
Sy
stem In
terc
on
nect
SODA PE Scalar Pipeline Scalar: - One 16-bit datapath - No mult unit Scalar memory: - 16bit port - 1 read/write port - 4 KBytes Scalar-to-Vector Vector-to-Scalar
www.eecs.umich.edu/~sdrg
8
Turbo Decoder on SODA
• Most computationally intensive algorithm in W-CDMA• Hardest algorithm to parallelize• Implementation outline
– MaxLogMAP trellis computation with SIMD operations– Parallelizing trellis computations through sliding window– Interleaver implementation
De-Interleaver
SISOdecoder 1
InterleaverSISO
decoder 2
Input
Output
Demux ys & yp1
ys & yp2
L1ex L2exDe-Interleaver
SISOdecoder 1
InterleaverSISO
decoder 2
Input
Output
Demux ys & yp1
ys & yp2
L1ex L2exDe-Interleaver
SISOdecoder 1
InterleaverSISO
decoder 2
Input
Output
Demux ys & yp1
ys & yp2
L1ex L2ex
www.eecs.umich.edu/~sdrg
9
Trellis Computation on SODA• Two types of trellis diagram configurations
– Blue edges: (0-branch), Red edges: (1-branch)
• Mapping trellis of size S onto SODA of SIMD size T
1
3
5
7
1
3
5
7
0
2
4
6
0
2
4
6
1
3
5
7
0
2
4
6
state[t] state[t+1] state[t+2]
Time
state[t] state[t+1] state[t+2]
Time
1
3
5
7
1
3
5
7
0
2
4
6
0
2
4
6
1
3
5
7
0
2
4
6
Forward Trellis Backward Trellis
www.eecs.umich.edu/~sdrg
10
Forward Trellis on SODA (S = T)
In
m0
m1
m2
m3
m4
m5
m6
m7
b0
b1
b2
b3
b4
b5
b6
b7
M
2 4-bitInputs
M
M
M
M
M
M
M
m0
m1
m2
m3
m4
m5
m6
m7
b0
b1
b2
b3
b4
b5
b6
b7
M
M
M
M
M
M
M
M
Branch metric calculations(BMC)
M : b[i] = In[0]*m[i][0] + In[1]*m[i][1]
b0
b1
b2
b3
b4
b5
b6
b7
+
+
+
+
+
+
+
+
b0
b1
b2
b3
b4
b5
b6
b7
+
+
+
+
+
+
+
+
s0
s1
s2
s3
s4
s5
s6
s7
s0
s1
s2
s3
s4
s5
s6
s7
s0
s1
s2
s3
s4
s5
s6
s7
s0
s1
s2
s3
s4
s5
s6
s7
s0
s1
s2
s3
s4
s5
s6
s7
s0
s1
s2
s3
s4
s5
s6
s7
s0
s1
s2
s3
s4
s5
s6
s7
s0
s1
s2
s3
s4
s5
s6
s7
>
>
>
>
>
>
>
>
Add-Compare-Select calculations(ACS) Misaligned
SIMD operation
www.eecs.umich.edu/~sdrg
11
Handling SIMD Misalignment
s0
s1
s2
s3
s4
s5
s6
s7
s0
s1
s2
s3
s4
s5
s6
s7
s0
s1
s2
s3
s4
s5
s6
s7
>
>
>
>
>
>
>
>
s0
s1
s2
s3
s4
s5
s6
s7
s0
s1
s2
s3
s4
s5
s6
s7
s0
s1
s2
s3
s4
s5
s6
s7
s0
s1
s2
s3
s4
s5
s6
s7
SIMDpredicated
move
step 1: SIMD interleaving
s0
s1
s2
s3
s4
s5
s6
s7
s0
s2
s4
s6
s1
s3
s5
s7
s1
s3
s5
s7
s0
s2
s4
s6
s0
s1
s2
s3
s4
s5
s6
s7
step 2: SIMD permutation
ISE
EX
1
3
5
7
0
2
4
6
2
6
3
7
0
4
1
5
2
6
3
7
0
4
1
5
3
7
2
6
1
5
0
4
1
3
5
7
0
2
4
6
0
2
4
6
1
3
5
7
3
7
2
6
1
5
0
4
ISE
s0
s2
s4
s6
s1
s3
s5
s7
s1
s3
s5
s7
s0
s2
s4
s6
>
>
>
>
>
>
>
>
s0
s1
s2
s3
s4
s5
s6
s7
step 3: SIMD compare select
www.eecs.umich.edu/~sdrg
12
Sliding Window on SODA
• Problem:– W-CDMA uses K=4, 8 wide trellis– SODA has 32-wide SIMD
• Solution:– parallelize trellis computation by implementing
sliding window• fully utilize SIMD width• achieving higher-throughput in the process
www.eecs.umich.edu/~sdrg
13
Sliding Window Parallelization
b00
b01
b02
time
b10
b11
b12
b20
b21
b22
window1
window2
window3
SODA sliding windowimplementation
LLC
beta metricstored
alpha and betacalculation
alpha and betadummy calculation
b0
b1
b2
b3
time
b4
b5
b6
Typical ASIC slidingwindow implementation
LLC
beta metricstored
alpha and betacalculation
alpha and betadummy calculation
b0
b1
b2
b3
time
b4
b5
b6
Typical ASIC slidingwindow implementation
www.eecs.umich.edu/~sdrg
14
Sliding Window on SODA (S < T)
s01
3
1
3
0
2
0
2
s1
s2
s3
t0
t1
t2
t3
0
3
0
3
1
2
1
2
s0
s1
s2
s3
t0
t1
t2
t3
>
>
>
>
>
>
>
>
s0
s1
s2
s3
t0
t1
t2
t3
step 1: SIMD interleaving step 2: SIMD permutation step 3: SIMD CS
s0
s2
t0
t2
s1
s3
t1
t3
s1
s3
t1
t3
s0
s2
t0
t2
2
2
3
3
1
1
0
0
1
1
3
3
2
2
0
0
0
3
0
3
2
1
2
1
2
3
2
3
0
1
0
1
1
3
1
3
0
2
0
2
0
3
0
3
1
2
1
2
2
2
3
3
1
1
0
0
22
2
0
0
1
1
3
3
3
0
3
0
1
2
1
2
3
2
3
2
1
0
1
0
s0
s2
t0
t2
s1
s3
t1
t3
s1
s3
t1
t3
s0
s2
t0
t2
EX SE
EX SE
Trellis state for2nd sub-block
Trellis state for1st sub-block
Trellis state for2nd sub-block
Trellis state for1st sub-block
ISE
ISE
s0
s1
s2
s3
t0
t1
t2
t3
s0
s1
s2
s3
s0
s1
s2
s3
t0
t1
t2
t3
t0
t1
t2
t3
>
>
>
>
>
>
>
>
s0
s1
s2
s3
s0
s1
s2
s3t0
t1
t2
t3
t0
t1
t2
t3
SIMDpredicated
move
s0
s1
s2
s3
t0
t1
t2
t3
s0
s1
s2
s3
t0
t1
t2
t3
www.eecs.umich.edu/~sdrg
15
Turbo Decoder System Operations
time
InterleavingSerial Alpha+LLCBeta InterleavingAlpha+LLCBeta
InterleavingParallel
Interleaving
Dummy+Alpha+LLC
Dummy+Beta
Dummy+Alpha+LLC
Dummy+Beta
Interleaving
Parallel +Overlap
Interleaving
Dummy+Alpha+LLC
Dummy+Beta
Dummy+Alpha+LLC
Dummy+Beta
www.eecs.umich.edu/~sdrg
16
SODA DMA Modifications
• Traditional DMA controller– Designed for block data transfer– 1 source and 1 destination address per block
• Modified DMA controller– Adding data interleaving functionality to DMA– Needs to handle scalar data transfers– 1 source and 1 destination address per scalar
www.eecs.umich.edu/~sdrg
17
Achieved Performance on SODA
• SODA operates at 400MHz
• Can achieve 2.08Mbps with I = 5
Average number of cycles for one trellis
block
dummy calculation
size of one trellis block
1 bit of Alpha, Beta and LLC computation
data memory access
Number of sliding windows processed in
parallel
1 bit of Alpha, Beta and LLC computation
Overall Turbo decoder throughput
SODA operation frequency
Number of Turbo iterations
Cycles for 1bit trellis computaion = Tblock/L
Extrinsic scaling
www.eecs.umich.edu/~sdrg
18
Conclusion & Future Work
• Implementation summary– SODA consumes <100mW in 90nm– Meets W-CDMA throughput requirements– Hardware features
• wide SIMD execution• SIMD permutation network• smart DMA
• Beyond 3G– Support for higher throughput 3G+ protocols
• Multi-processor SODA for Turbo decoder
– LDPC decoding