design of a transport triggered vector processor for turbo ...sshahabu/springer_turbo.pdf ·...

Design of a transport triggered vector processor for turbodecoding

Shahriar Shahabuddin • Janne Janhunen •

Markku Juntti • Amanullah Ghazi •

Olli Silven

Received: 29 March 2013 / Revised: 24 August 2013 / Accepted: 20 September 2013 / Published online: 15 October 2013

� Springer Science+Business Media New York 2013

Abstract In order to meet the requirement of high data

rates for next generation wireless systems, efficient

implementations of receiver algorithms are essential. On

the other hand, faster time-to-market motivates the inves-

tigation of programmable implementations. This paper

presents a novel design of a programmable turbo decoder

as an application-specific instruction-set processor (ASIP)

using transport triggered architecture (TTA). The processor

architecture is designed in such a manner that it can be

programmed with high level language to support different

suboptimal maximum a posteriori (MAP) algorithms in a

single TTA processor. The design enables the designer to

change the algorithms according to the frame error rate

performance requirement. A quadratic polynomial permu-

tation interleaver is used for contention-free memory

access and to make the processor 3GPP LTE compliant.

Several optimization techniques to enable real time pro-

cessing on programmable platforms are introduced. The

essential parts of the turbo decoding algorithm are designed

with vector function units. Unlike most other turbo decoder

ASIPs, high level language is used to program the pro-

cessor to meet the time-to-market requirements. With a

single iteration, 68.35 Mbps decoding speed is achieved for

the max-log-MAP algorithm at a clock frequency of

210 MHz on 90 nm technology.

Keywords TTA � ASIP � MAP � FER � QPP

1 Introduction

The turbo coding scheme [1] has been adopted for the air

interface standard called long term evolution, that has been

defined by the 3rd Generation Partnership Project (3GPP)

[2]. The decoding algorithms for the constituent decoders

are not specified by 3GPP. Different suboptimal algorithms

have been used for the turbo decoding and new solutions

with less complexity are still being invented quite fre-

quently. A programmable implementation simplifies the

necessary support for different suboptimal algorithms of

constituent decoders and different interleaving methods.

The hardware designs of turbo decoders are at a mature

stage due to the extensive efforts of researchers. The

hardware implementations provide high throughput, but the

development time is not as rapid as the processor based

implementations. Besides, the hardware implementations

suffers from inflexibility. On the other hand, a software

based implementation provides the required flexibility to

support multistandard solutions, but requires a careful

design to achieve the target throughput. Programmable

accelerators, which enable software-hardware co-design

method might be an attractive solution to overcome these

bottlenecks. The design of software and hardware together

to achieve the best performance and ensure programma-

bility is not a straightforward task. The designer needs a

very efficient tool, which can be used to design the pro-

cessor easily for a particular application.

In this paper, we propose a design of a processor based

on the transport triggered architecture (TTA) for a flexible

turbo decoder. TTA is a processor template for a pro-

grammable application-specific instruction-set processor

S. Shahabuddin (&) � J. Janhunen � M. Juntti

Department of Communications Engineering and Centre for

Wireless Communications, University of Oulu, Oulu, Finland

e-mail: [email protected]

A. Ghazi � O. Silven

Department of Computer Science and Engineering, University of

Oulu, Oulu, Finland

123

Analog Integr Circ Sig Process (2014) 78:611–622

DOI 10.1007/s10470-013-0183-y

(ASIP). The TTA based codesign environment (TCE) tool

enables the designer to write an application with a high

level language and design the target processor in a graph-

ical user interface at the same time [3].

The turbo decoder based on the TTA was implemented

by Salmela et al. [4] and by us [5]. The throughput of the

processor in [4] was comparable to the turbo decoders

based on monolithic hardware design. However, the design

was less flexible in comparison with the design presented

in this paper. The interleaver of the processor was not

presented according to the 3GPP LTE standard [2].

Besides, the processor was programmed with assembly

language while the high level language is applied herein.

The design in [5] was flexible and able to support four

suboptimal algorithms. However, the design used more

computational units and provided less throughput than the

design presented in this paper. Besides, the synthesis

results were not presented either.

A TTA processor with vector function units is presented

in this paper. The main computational parts of the turbo

decoding algorithm use the vector units. The scalar func-

tion units are also used, for example, in loop indexing. The

quadratic permutation polynomial (QPP) interleaving pat-

tern is used for the interleaver block. The processor is

programmed with C language to shorten the time-to-mar-

ket. The processor achieves a maximum of 210 MHz

operating frequency and takes an area of 85.74 kgates with

90 nm technology. With a single full iteration 68.35 Mbps

throughput is achieved for the max-log-MAP algorithm for

a clock frequency of 210 MHz.

The main contribution of the work is the design of a

processor that is able to change the suboptimal turbo

decoding algorithms. If the channel condition is bad and

creates more errors then the complex log-MAP algorithm

can be used. If the channel condition is good, then simple

max-log-MAP algorithm can be used to achieve higher

throughput. Similarly, for different channel conditions

different suboptimal algorithms can be used. Therefore, a

decoder is needed which can accommodate all the subop-

timal algorithms. Under these circumstances, a program-

mable ASIP is better choice than the monolithic hardware

decoders, as it provides the required flexibility and reuse

capability of the hardware. In this paper, we propose a

processor for turbo decoding with all the well-known

suboptimal SISO algorithms.

The rest of the paper is organized as follows: In Sect. 2

an overview of the turbo decoder is presented. In Sect. 3

the simplification techniques of the turbo decoding algo-

rithm are considered. A brief description of the tool chain

to design a TTA processor is presented in Sect. 4. In Sect. 5

the vector special function units based on the simplification

technique are presented. The top level architecture of the

TTA processor and the programming techniques are

addressed in Sect. 6. In Sect. 7 the performance evaluation

of the design is presented. The conclusions are drawn in

Sect. 8.

2 Turbo decoder

2.1 Turbo decoder structure

The turbo decoder consists of two soft-input soft-output

(SISO) decoders, with interleavers and de-interleavers

between them as shown in Fig. 1. The inputs of the turbo

decoder come from the soft demodulator which produces

the log-likelihood ratios (LLR) for the systematic and

parity bits. The LLRs of the systematic bit, LuI and first

parity bits, LcI1 is fed to the first SISO decoder. The SISO

decoder produces soft outputs, i.e., the a posteriori prob-

ability and so called extrinsic information, based on these

LLRs. These soft outputs are used in the second SISO

decoder as the additional a priori information. The inputs

of the second SISO decoder are the LLRs coming from the

systematic bits, the second parity bits denoted by LcI2 and

the output of the first SISO decoder. The LLRs of the

systematic bits are reordered this time with the same

interleaving pattern used at the encoder. Similarly, the soft

outputs coming from the first SISO decoder are processed

also with the same interleaving pattern, which are used as a

priori values for the second SISO decoder.

The heart of the turbo coding is the iterative decoding

procedure. The output of the second SISO decoder does not

produce the hard outputs immediately, but the soft output is

used again in the first SISO decoder for more accurate

approximation. The process continues in a similar fashion

in an iterative manner. A single iteration by both the first

and the second SISO decoder is referred to as a full iter-

ation. On the other hand, the operation performed by a

Fig. 1 Block diagram of the turbo decoder

612 Analog Integr Circ Sig Process (2014) 78:611–622

123

single SISO decoder can be referred to as a half iteration.

At the beginning of the first iteration, the a priori values

are set to zero. Six to eight full iterations are used to

achieve sufficient performance [1].

2.2 MAP algorithm for component decoder

The MAP algorithm for the constituent decoder applied here

has been proposed by Benedetto et al. [6]. The decoding is

done in smaller windows so that the decoding process can be

done in parallel and the decoder does not have to wait for the

whole block to arrive before starting the decoding process.

This windowing is sometimes referred to as a sliding window

method. The algorithm can be stated like:

1. Initialize the values of the forward state metric as

a0(s) = 0 if s = S0 and a0ðsÞ ¼ �1 otherwise.

2. Calculate all the forward state metric values of the

same window through the forward recursion according

to

akðsÞ ¼ maxe

�ðak�1½sSðeÞ� þ uðeÞLuIk�1

þ c1ðeÞLcI1k�1 þ c2ðeÞLcI2k�1Þ;ð1Þ

where, sS(e) denotes the starting state, sE(e) denotes the

ending state. The input bits are represented by u(e) and the

output bits are represented by c1(e) and c2(e).

3. Initialize the values of the backward state metric as

bn(s) = 0 if s = Sn and bnðsÞ ¼ �1 otherwise.

4. Calculate all the backward state metric values of the

same window through the backward recursion as

bkðsÞ ¼ maxe

�ðbkþ1½sEðeÞ� þ uðeÞLuIkþ1

þ c1ðeÞLcI1kþ1 þ c2ðeÞLcI2kþ1Þ:ð2Þ

5. The LLR values for the information and both parity

bits can be calculated as

LLRð:; OÞ ¼ maxe

�ðak�1½sSðeÞ� þ c1ðeÞLcI1k�1

þ c2ðeÞLcI2k�1 þ bk½sEðeÞ�Þ:ð3Þ

The forward metric and backward metrics increase in each

step and that is why the forward and backward metrics need

to be normalized to avoid an overflow in the computation.

2.3 Suboptimal MAP algorithms

The MAP algorithm can take several suboptimal forms

depending on the implementation of the Jacobian loga-

rithms. Four types of suboptimal forms of MAP algorithm

are used in this paper and they are explained below. A

more detailed overview can be found in [7, 8] and [9].

Log-MAP The Jacobian logarithm takes the following

form in log-MAP algorithm,

max�ða; bÞ ¼ maxða; bÞ þ lnð1þ e�jb�ajÞ: ð4Þ

The Jacobian logarithm equals to the addition of the

maximum of two arguments and a correction function [7].

The correction function is implemented using a look-up

table [10, 11].

Max-log-MAP The max-log-MAP is the simplest of all

the suboptimal forms of the MAP algorithm [7]. The

Jacobian logarithm is expressed with the maximum of two

arguments in max-log-MAP algorithm as

max�ða; bÞ ¼ maxða; bÞ: ð5Þ

Constant-log-MAP The approximation of log-MAP used

in [8] is called a constant-log-MAP algorithm and can be

expressed as

max�ða; bÞ ¼ maxða; bÞ þ 0 if jb� aj[ M

C if jb� aj �M:

�ð6Þ

Linear-log-MAP The linear approximation of Jacobian

logarithm used in [9] is called a linear-log-MAP algorithm

and can be expressed as

max�ða; bÞ ¼ maxða; bÞ

þ 0 if jb� aj[ M

Tðjb� aj �MÞ if jb� aj �M:

�

ð7Þ

In [9], the values of parameters T and M were chosen for

a fixed-point implementation.

2.4 QPP interleaver

The QPP interleaver has been adopted for the 3GPP LTE

standard [2]. Unlike the earlier 3G interleavers, the QPP

interleaver is based on algebraic constructions. The QPP

interleaver can be expressed by a simple mathematical

formula. The relationship between the output index x and

the input index f(x) can be derived as

f ðxÞ ¼ ðf2x2 þ f1xÞ modðNÞ: ð8Þ

The values of the parameters f1 and f2 of (8) are integers

and depend on the block size N. For a different N, a dif-

ferent set of parameters has been defined in [12]. The block

sizes are divisible by 4 and 8. The value of f1 is an odd

integer while the value of f2 is even. Several algebraic

properties of the QPP interleaver have been listed in [13].

3 Simplified MAP algorithm

The MAP algorithm described above is complex and needs

to be simplified. Repeated calculations should be avoided

in order to reduce the latency of the processor. The reduced

Analog Integr Circ Sig Process (2014) 78:611–622 613

123

calculations used in this work will be described in this

section. A similar idea has been presented in [14] where a

simplified version of max-log-MAP has been proposed.

The reduced design is tested in a link level simulator using

Matlab. The Matlab design is converted to C language in

such a way so that it can be efficiently implemented in the

TCE tool.

There are 16 branch metric computations between two

states for the forward metric, the backward metric and the

LLR calculations. From the trellis structure of the 3GPP

turbo code of Fig. 2, it can be seen that four calculations of

the branch metrics are repeated to result in total of sixteen

calculations. Instead of multiplying with 1 and -1, the

branch metrics can be calculated directly based on these

four calculations,

ckð1Þ ¼ LuIk þ LcI1k þ LcI2k

ckð2Þ ¼ �LuIk � LcI1k þ LcI2k

ckð3Þ ¼ LuIk þ LcI1k � LcI2k

ckð4Þ ¼ �LuIk � LcI1k � LcI2k;

ð9Þ

where c4 can be represented as -c1 and c3 can be repre-

sented as -c2. For every forward metric, the backward

metric and the LLR, calculations of two branch is

sufficient.

The forward metric calculations are shown as,

akð1Þ ¼ max�ðak�1ð1Þ � ck�1ð1Þ; ak�1ð2Þ þ ck�1ð1ÞÞakð5Þ ¼ max�ðak�1ð1Þ þ ck�1ð1Þ; ak�1ð2Þ � ck�1ð1ÞÞakð2Þ ¼ max�ðak�1ð3Þ þ ck�1ð2Þ; ak�1ð4Þ � ck�1ð2ÞÞakð6Þ ¼ max�ðak�1ð3Þ � ck�1ð1Þ; ak�1ð4Þ þ ck�1ð1ÞÞakð3Þ ¼ max�ðak�1ð5Þ � ck�1ð1Þ; ak�1ð6Þ þ ck�1ð1ÞÞakð7Þ ¼ max�ðak�1ð5Þ þ ck�1ð1Þ; ak�1ð6Þ � ck�1ð1ÞÞakð4Þ ¼ max�ðak�1ð7Þ þ ck�1ð1Þ; ak�1ð8Þ � ck�1ð1ÞÞakð8Þ ¼ max�ðak�1ð7Þ � ck�1ð1Þ; ak�1ð8Þ þ ck�1ð1ÞÞ:

ð10Þ

Similarly, the backward metric calculations are shown

as,

bkð1Þ ¼ max�ðbkþ1ð1Þ � ckþ1ð1Þ; bkþ1ð5Þ þ ckþ1ð1ÞÞbkð2Þ ¼ max�ðbkþ1ð1Þ þ ckþ1ð1Þ; bkþ1ð5Þ � ckþ1ð1ÞÞbkð3Þ ¼ max�ðbkþ1ð2Þ þ ckþ1ð2Þ; bkþ1ð6Þ � ckþ1ð2ÞÞbkð4Þ ¼ max�ðbkþ1ð2Þ � ckþ1ð2Þ; bkþ1ð6Þ þ ckþ1ð2ÞÞbkð5Þ ¼ max�ðbkþ1ð3Þ � ckþ1ð2Þ; bkþ1ð7Þ þ ckþ1ð2ÞÞbkð6Þ ¼ max�ðbkþ1ð3Þ þ ckþ1ð2Þ; bkþ1ð7Þ � ckþ1ð2ÞÞbkð7Þ ¼ max�ðbkþ1ð4Þ þ ckþ1ð1Þ; bkþ1ð8Þ � ckþ1ð1ÞÞbkð8Þ ¼ max�ðbkþ1ð4Þ � ckþ1ð1Þ; bkþ1ð8Þ þ ckþ1ð1ÞÞ:

ð11Þ

Finally, the LLR calculation can be expressed as

LLRk ¼ max�ðmax�ðak�1ð1Þ þ bkð1Þ; ak�1ð2Þ þ bkð5ÞÞþ ck�1ð1Þ;max�ðak�1ð7Þ þ bkð8Þ; ak�1ð8Þ þ bkð4ÞÞ þ ck�1ð1Þ;max�ðak�1ð3Þ þ bkð6Þ; ak�1ð4Þ þ bkð2ÞÞ þ ck�1ð2Þ;max�ðak�1ð5Þ þ bkð3Þ; ak�1ð6Þ þ bkð7ÞÞ þ ck�1ð2ÞÞ�max�ðmax�ðak�1ð1Þ þ bkð5Þ; ak�1ð2Þ þ bkð1ÞÞ � ck�1ð1Þ;max�ðak�1ð5Þ þ bkð7Þ; ak�1ð6Þ þ bkð3ÞÞ � ck�1ð1Þ;max�ðak�1ð3Þ þ bkð2Þ; ak�1ð4Þ þ bkð6ÞÞ � ck�1ð2Þ;max�ðak�1ð7Þ þ bkð4Þ; ak�1ð8Þ þ bkð8Þ � ck�1ð2ÞÞ:

ð12Þ

4 TTA processor design methodology

4.1 TTA processor design

TTA is a processor design philosophy where the program

directly controls the internal data transport between dif-

ferent function units of a processor. A typical processor

consists of function units to perform computation, register

files to store the data and control unit for program flow

Fig. 2 Trellis of 3GPP turbo code


123

control. Buses, ports and sockets make the interconnection

network of the processor. In a TTA processor, one of the

input ports of a function unit is marked as a triggering port.

Writing data to the triggering port of a function unit starts

the computation. The TTA processor works with a single

move instruction and does not require a complex instruc-

tion decoding unit. Most of the instructions scheduling

decisions are taken at compile time which further reduces

the processing logic of the TTA processor [15, 16].

TCE provides an excellent toolset for designing,

implementing and simulating a TTA processor. The toolset

includes a processor design tool with an extensive library

of function units. It also includes a C/C?? compiler to

compile the implemented software for the designed TTA

processor. A graphical and command line simulator is

provided to analyze the execution of the program on the

processor with detailed information about the resource

usages and cycle counts during the execution [3].

If the designed processor does not meet the target per-

formance, the designer can go back to the source code and

design the processor again. In order to find the actual run

time, the processor must be synthesized or mapped on a

field programmable gate array. A rich hardware database

(HDB) is provided in TCE that includes VHSIC hardware

definition language (VHDL) description for most of the

common TTA function units, register files, sockets etc.

TCE also allows the designer to build more complicated

special function units to reduce the processing latency. In

that case, the designer has to write the VHDL description

for the special function unit and add it in the HDB. A

diagram of the TTA processor design methodology is given

in Fig. 3.

4.2 Programming TTA processor

TCE allows to program the TTA processor with high level

language like C or low level assembly language. As stated

earlier, TTA assembly program has only one instruction

called move. In Fig. 4 part of a TTA processor with an

adder and a multiplier is shown. The black horizontal

straight lines represent the buses of the processor. The

vertical rectangular blocks represent the sockets. The

connections between the function units and the buses are

illustrated by black dots in the sockets.

Part of an assembly program for this TTA processor in

Fig. 4 is shown in Table 1. The program is adding some

constant values and multiplying the added values with

another value.

It can be seen that the programmer needs to carefully

choose the function units and buses in case of assembly

programming. The more the architecture includes buses

and function units, the more complicated the programming

becomes. If the design needs to be changed for some

reasons, i.e., buses need to be removed or added, the whole

program has to be re-written once again. In case of high

level programming the compiler takes care of the sched-

uling. The designer can add or remove function units

without worrying about the programming. Therefore, the

Fig. 3 TTA processor design methodology

Fig. 4 Part of a TTA processor

Table 1 Part of the assembly program for the TTA processor in

Fig. 1

Bus 1 Bus 2 Bus 3

74 –[ ADD.in1t.add 71 –[ ADD.in2 174 –[ MUL.in2

238 –[ADD.in1t.add 231 –[ ADD.in2 ADD.out1 –[MUL.in1t.mul

60 –[ ADD.in2 38 –[ MUL.in2 ADD.out1 –[MUL.in1t.mul


123

work presented in this paper used high level programming

language to reduce the development time.

5 Vector special function units

The vector special function units designed for this imple-

mentation are described in this section. The vector units are

designed in such a way that it can facilitate programming

in high level language. In our case, C programming is used.

The word lengths from [4] and [11] are used for the input

LLRs.

5.1 PACK special function unit

A special function unit, PACK, is used to merge the three

8-bit LLR inputs, that are, the a priori LLR, LLR for

systematic bits and LLR for parity bits. An opcode signal,

which is used to select different suboptimal algorithms, can

also be merged with these three input signals. These signals

will constitute a 32-bit packed input that can be used for

the vector units that compute forward metrics, backward

metrics and output LLRs.

5.2 METRIC special function unit

A function unit can be made that would calculate eight

forward metrics based on the earlier state forward metrics

and the LLR input. In this work, a vector METRIC unit is

designed which takes three 32-bit inputs to calculate the

forward metric. The eight forward metrics of the earlier

state can be packed into two 32-bit inputs. These two 32-bit

inputs and the input coming from the PACK unit are fed to

the METRIC unit to calculate the eight forward metrics for

the next state. These eight forward metric outputs are also

packed as two 32-bit outputs and used for further pro-

cessing. A block diagram of the METRIC unit is given in

Fig. 5. The diagram shows two forward metric calculations

of a single butterfly pair. The rest of the calculations are not

shown for readability.

The forward and backward metrics calculations are not

the same. However, from (10) and (11) it can be seen that

the operations needed to calculate the forward and back-

ward metrics are similar. If the inputs are arranged prop-

erly, the same METRIC unit can be used to calculate the

backward metrics. For example, ak-1(1), ak-1(2), ak-1(3)

and ak-1(4) are merged as one 32-bit input in the METRIC

unit. From these four 8-bit inputs and with the help of the

LLR inputs, four forward metrics can be calculated. They

are ak(1), ak(5), ak(2) and ak(6).

Similarly, bk?1(1), bk?1(5), bk?1(2) and bk?1(6) can be

merged as one 32-bit input to calculate bk(1), bk(2), bk(3)

and bk(4). Note that, due to the structure of the METRIC

unit, bk(1) and bk(3) will reside inside the first 32-bit output

and bk(2) and bk(4) will reside inside the second 32-bit

output.

The normalization is needed inside the METRIC unit to

avoid an overflow. The normalization technique is done

following the technique suggested by Valenti et al. [17].

Typically, the normalization is done in every step by

subtracting the minimum values of the forward or back-

ward metrics of the same column from every forward or

backward metric values. Valenti et al. suggested subtract-

ing all the values from the first values of the columns.

Thus, there is no need to store the first row and this con-

stitutes 12.5 percent savings in memory compared to the

Fig. 5 Metric function unit


123

other normalization methods. The normalization is not

shown in Fig. 5 for simplicity.

The METRIC unit supports four operations for max-log-

MAP, linear-log-MAP, constant-log-MAP and log-MAP

respectively. The max-log-MAP needs straightforward

maximum selection operation as can be seen in (5). From

(4), (6) and (7) it can be seen that the other algorithms use

the maximum selection and some additional calculations.

All the operations in the METRIC unit can exploit the same

maximum selection logic. The additional calculations are

different for different algorithms and need some additional

logic. Thus, the four operations need different numbers of

internal registers and the latencies of the four operations

are different.

5.3 ARRANGE special function unit

The drawback of using the same METRIC unit for both

forward and backward metric is the fact that an additional

unit is needed to arrange the outputs of the METRIC unit

during the backward metric computation. It can be seen

from Fig. 5, that the METRIC unit would provide two

32-bit outputs with one consisting of bk(1), bk(3), bk(5)

and bk(7) and the other consisting of bk(2), bk(4), bk(6)

and bk(8). To use the same METRIC unit again for the

next backward metric calculation, two 32-bit inputs are

needed, with one input consisting of bk(1), bk(5), bk(2)

and bk(6) and the other consisting of bk(3), bk(7), bk(4)

and bk(8). That is the functionality of the ARRANGE

special function unit that changes the relative positions of

the inputs. The functionality of the ARRANGE unit is

shown in Fig. 6.

5.4 LLROUT special function unit

Another special function unit named LLROUT is designed

to calculate the output LLRs after the forward and back-

ward metric computation. It can be seen from (12) that

eight forward metric and eight backward metric and three

LLR values are needed for the output LLR computation.

Four packed 32-bit outputs coming from the METRIC unit

after the forward and backward metric computations are

used as the inputs of the unit. Besides, the packed LLR

inputs with the opcode is used as another input. Therefore,

the LLR special function unit takes five inputs and pro-

duces one 8-bit LLR output. The four suboptimal algo-

rithms can be chosen with the opcode which is used in the

packed input. The unit supports four operations with dif-

ferent latency for the four different algorithms.

6 TTA processor for turbo decoding

6.1 Top level architecture of the processor

A part of the processor designed for the turbo decoder is

illustrated in Fig. 7. For readability, the whole processor

figure is not given. The blocks on the upper part of the

figure represent the function units and the register files of

the processor.Fig. 6 Inputs and outputs of ARRANGE special funtion unit

Fig. 7 Implemented processor with reduced number of function units


123

The fixed point processor includes load/store unit (LSU),

arithmetic logic unit (ALU), global control unit and register

files. Based on the resource requirements in high level lan-

guage, more function units and register files are added.

The LLR inputs are read from a first-in-first-out (FIFO)

memory buffer by using the function unit called STREAM.

The STREAM units can read every input sample in one

clock cycle. Three STREAM units are used to get the input

LLRs simultaneously. One STREAM unit is used to write

the output LLRs into the FIFO buffer.

Like the Viterbi decoder, the turbo decoder needs

numerous add-compare-select operations. One way to

design the processor for these algorithms is to use the

appropriate number of adder and maximum selection units.

Another way is to design a special function unit to calculate

all the necessary next state metrics based on the earlier

state metrics and branch metrics.

The latter way is followed in our design and a special

function unit named METRIC is designed with three inputs

and two outputs. The functionality of this unit is already

described in Sect. 2 The PACK unit is used which takes the

three LLR inputs and the opcode to use the particular

suboptimal algorithm. One ARRANGE unit is used to

calculate the backward metrics.

Two LSU units are used to support the memory acces-

ses. The LSU units are used to read and write memory. The

memory can be read in three clock cycles and can be

written in a single cycle. The ALU unit is used to perform

the basic arithmetic operations like addition, subtraction

etc. Operations like shifting right or left are also included

in the ALU.

Thirteen buses are used in the design. Several register

files are used to save the intermediate results. In terms of

the power consumption, registers can be more expensive

than the memory, but to meet the latency requirement

register files are needed. A single Boolean register file is

included in the processor design.

6.2 Programming the processor

The processor is programmed with a high level language C.

The architecture of the processor and the structures of the

vector special function units allow the use of 8-bit char-

acters and 32-bit integers, that greatly simplifies the soft-

ware design. Macros are used to call the special function

units.

The turbo decoding algorithm uses three blocks of 6,144

input LLRs. The blocks are divided into smaller windows

to reduce the memory requirements. The forward metrics

are calculated and stored first by following the simplifica-

tion in (10).

The backward metrics are calculated next using the

ARRANGE and METRIC. As soon as the backward metric

calculations of a single column are finished, the LLRs are

calculated using the LLROUT unit in parallel with the help

of the stored forward metric at the same time. Therefore,

there is no need to save the backward metrics [17].

A lot of data dependencies can be seen from the turbo

decoder algorithm. So, writing efficient code is needed for

the utilization of computing resources available. An effi-

cient code is written to reduce memory accesses by holding

the values inside the register files. It is possible to eliminate

the need for accessing the forward metric vectors con-

stantly. Instead, the temporary variables are used to hold

the values for the next iteration. The code for the backward

metric and the LLR calculations are written in the same

manner.

A significant improvement in throughput is achieved by

partially unrolling the main loops of the code. The code is

unrolled as such that the main loop body processes 100

samples to improve the opportunities for parallelism across

the samples and reduce the amount of time spent evaluating

the loop condition. A static multi-issue machine needs

statically schedulable parallel operations to utilize the

hardware fully. One way is to schedule operations from

multiple loop iterations in parallel. Unrolling the loops

increases the instruction memory size due to the increased

code size, but the achieved benefit is significant.

In theory, a critical path to process one sample in for-

ward metric loop is four clock cycles. Furthermore the

theoretical pipeline can take a new input and output a

sample every clock cycle. At the moment our implemen-

tation has a six clock cycles long pipeline. The bottleneck

at the moment is the data access which is very common to

algorithms processing large amount of data. There are two

possible solutions for the limitations. Because the algo-

rithm requires three inputs at once, there should be three

data accesses. This can be either handled with two memory

spaces with one single port and one two port memory

access or the data accesses can be handled by stream units

with FIFO buffers. In many solutions the data bottlenecks

are solved with FIFO buffers. We rely on streaming inputs

and outputs using the STREAM units, but the intermediate

result between the forward and backward metric calcula-

tion applies the memory.

A look-up table, holding the values of the f1 and f2parameters, is used to implement the QPP interleaver. The

rest of the calculations are quite straightforward.

7 Results and discussion

The designed processor takes 56,624 clock cycles to pro-

cess three blocks of 6,144 samples for a full iteration for

the max-log-MAP mode. According to the 3GPP inter-

leaver specifications, the size of the block could be up to of


123

6,144 samples and that is the reason the size of the input

block is chosen as 6,144. The size of the block can be

changed using the program code.

The throughput for one full iteration for the max-log-

MAP with a clock frequency of 210 MHz can be calculated

as,

Throughput ¼ 6; 144 bits� 3� 210 MHz

56; 624 clock cycles

¼ 68:35 Mbps:

The clock cycle needed for a single trellis stage of the

max-log-MAP can be calculated as,

CycleStage ¼56; 624

2� 3� 6; 144

¼ 1:53 cycles:

It can be seen from Table 2 that the processor takes

more clock cycles if max-log-MAP algorithm is not used.

The reason lies in the fact that the constant-log-MAP,

linear-log-MAP and log-MAP algorithms need to invoke

some conditional statements resulting branches in

execution. Thus, the latency increases compared to the

max-log-MAP algorithm.

Linear-log-MAP is slower than the constant-log-MAP

because multiplication operations are needed. The log-

MAP algorithm provides a better FER performance than

the other algorithms. Log-MAP can be used in cases when

the latency requirements are not strict and a high FER

performance is needed. A scenario such as this is simulated

with a link level Matlab simulator in Fig. 8. Log-MAP and

max-log-MAP based turbo decoders are used in a 4 9 4

system. The modulation method and the detection algo-

rithm used for both of the decoders are 16 QAM and

LMMSE detection respectively. The channel model is

based on the 3GPP vehicular A (3GPP-VA) model rec-

ommended by International Telecommunication Union.

The mobile velocity is set to 120 kmph in the simulation. A

5 MHz bandwidth corresponding to 512 orthogonal fre-

quency division multiplexing subcarriers is applied. One

frame is equal to one OFDM symbol in the simulator.

Thus, one frame consists of 512 subcarriers. Here, 300

subcarriers are loaded with data and the rest are used as

guard interval. Six turbo iterations are used in the

simulation.

It can be seen from Fig. 8 that the FER performance

difference between log-MAP and max-log-MAP is nearly 2

dBs in high SNR region. The log-MAP is more complex

than max-log-MAP. However, for this particular scenario

this additional complexity is compensated with the FER

performance gain. In [17] the FER performance of log-

MAP, linear-log-MAP, constant-log-MAP and max-log-

MAP has been shown for different scenarios.

The buses of the processor are efficiently utilized to

achieve the best possible result due to the efficient sched-

uling. The number of some of the operations during the

algorithm execution is summarized in Table 3.

The number of addition operations does not only rep-

resent the addition for the algorithm, but for several other

purposes like loop indexing for the code. The multiplica-

tion is used for the QPP interleaving sequence generation.

The throughput can be increased by using dedicated

special function units working as accelerators for the pro-

cessor. On the other hand, the more special function units

are used, the less flexible the processor becomes. As an

example, a special function unit dedicated to calculate the

backward metric and the LLRs could be designed to reduce

Fig. 8 Decoding performance comparison for Log-MAP and max-

log-MAP algorithm

Table 3 Number of operations

for a full iterationOperation # of OPS

ADD 98,208

PACK 12,288

ARRANGE 12,288

MUL 29,576

LLROUT 12,288

METRIC 24,576

STREAM 49,152

Table 2 Number of clock cycles for a single iteration with three

simulation blocks

Modes Algorithm Clock cycle

1 max-log-MAP 56,624

2 linear-log-MAP 67,244

3 constant-log-MAP 80,298

4 log-MAP 104,628


123

latency. However, it might be useless for some other

functionalities. A common special function unit to calcu-

late both the forward and backward metric could be better

utilized in this case. The TTA processor designed here will

work for the Viterbi decoding with reasonable throughput.

The BRANCH unit for the max-log-MAP mode is able to

calculate the add-select-compare operations needed for the

Viterbi decoding.

The designed TTA processor is synthesized using UMC

90 nm standard cell library (fsd0k_generic_core 1d0vtc).

Synopsys Design Compiler is used to estimate gate count

and maximum frequency for the processor. The operating

conditions (temperature, operating voltage, manufacturing

process quality) for synthesis are set to default value

(TCCOM). The maximum clock frequency achieved dur-

ing the synthesis for the processor is 210 MHz. The total

gate count for the processor at 210 MHz is 87.11 kgates. A

comparison of the complexity and throughput for different

clock cycle is given in Table 4.

A comparison with different other implementations of

turbo decoder is presented in Table 5. The software pro-

grammable solutions like [17, 18] and [19] provide very

low throughput. A comparatively higher throughput

software implementation can be found in [21] where the

decoder has been implemented using graphical processing

unit (GPU) and with parallel programming. Our proposed

processor provides very good throughput compared to the

software programmed decoders and ASIPs. An efficient

ASIP implementation can be found in [22]. The processor

was designed to decode turbo and low density parity-check

codes (LDPC). The processor was programmed in assem-

bly and still the throughput is lower than the implementa-

tion presented in this paper.

The throughput of the TTA processor presented in [4] is

higher than our proposed solution and takes less resources.

The processor was programmed in assembly and the TTA

architecture had dedicated funtion units built for only the

max-log-MAP algorithm. Thus the design presented in [4]

is closer to the dedicated hardware and is unable to exploit

the benefits of processor based solution. Naturally the

performance is also close to a dedicated hardware solution

as can be seen from [23]. A key point is, the processor did

not support the log-MAP algorithm and it is more error

prone in worse channel conditions.

Our implementation is more flexible and is able to

utilize the programmability better than the solution given

in [4] by supporting several algorithms. As the processor

presented in this paper is programmed in C, the program

can be modified very easily which will be very difficult

for the processor presented in [4]. A processor design

similar to [4] would increase the development time

significantly.

The recent hardware implementations of turbo decoder

use the parallel SISO decoders which can be compared

with the multi-core solutions. An example of this can be

seen in [24] where eight SISO decoders working in parallel

provided a very high throughput. A multi-core solution of

Table 4 Complexity and throughput of the TTA processor

Clock

frequency

(MHz)

Area

(kgates)

Throughput

(Mbps)

100 70.91 32.55

125 72.13 40.69

200 83.22 65.10

210 85.74 68.35

Table 5 Comparison of turbo decoder implementations

Category Reference Architecture Algorithm Throughput

(Normalized)

Area

(kgates)

Clock

frequency

(MHz)

SW design [18] Motorola 56603 max-log-MAP 243 Kbps - 80

SW design [17] Intel Pentium max-log-MAP 366 Kbps - 933

SW design [19] TMS320C6201 DSP max-log-MAP 2 Mbps - -

SW and HW codesign [20] VLIW ASIP max-log-MAP 5 Mbps - 80

SW and HW codesign [4] TTA proc. for UMTS max-log-MAP 14.1 Mbps 20.8 210

SW and HW codesign [5] TTA proc. max-log-MAP 31.21 Mbps - -

SW design [21] Nvidia C1060 max-log-MAP 33.85 Mbps - -

SW and HW codesign [22] VLIW ASIP max-log-MAP 46.5 Mbps - 400

SW and HW codesign Proposed TTA proc. max-log-MAP 65.10 Mbps 85.74 210

HW design [23] 130 nm technology max-log-MAP 90.3 Mbps 44.1 246

SW and HW codesign [4] TTA proc. max-log-MAP 98 Mbps 43.2 277

HW design [24] 90 nm technology max-log-MAP 1469.2 Mbps (with 8 SISO) - 152


123

several ASIPs working in parallel will be a competitive

alternative of the parallel turbo decoder ASIC.

8 Conclusion

The paper described a novel design of a turbo decoder on a

TTA processor using high level language. The reference

turbo decoder design is simulated on a Matlab link level

simulator. The design is then converted in C language and

mapped on the TTA processor using TCE tool. The design

shows the promise of the possibility of designing several

decoding techniques on a single TTA processor. As the

turbo decoding algorithm is very complex, it is very difficult

to achieve the LTE target throughput even with pure hard-

ware designs. The parallel multi-core turbo decoder is a

natural choice for the next generation wireless systems. The

target throughput could also be reached by multi-core TTA

processor. The flexibility gained from that processor could

provide very interesting results and would be a fruitful

direction for the future research. The energy efficiency of the

processor would also provide interesting result.

Acknowledgements This research was supported by the Finnish

Funding Agency for Technology and Innovation (Tekes), Renesas

Mobile Europe, Nokia Siemens Networks and Xilinx. Special thanks

are due to Dr. Perttu Salmela from Tampere University of Technology

and Jarkko Huusko from Centre for Wireless Communications

(CWC), University of Oulu for sharing their insights on program-

mable turbo decoder implementations.

References

1. Berrou, C., Glavieux, A., & Thitimajshima, P. (1993). Near

shannon limit error-correcting coding and decoding: turbo-codes.

In IEEE international conference on communication, Geneva,

Switzerland. (Vol. 2, pp. 1064–1070).

2. Multiplexing and channel coding, 3GPP TS 36.212 version 10.5.0.

3. Jaaskelainen, P., Guzma, V., Cilio, A., Pitkanen, T., & Takala, J.

(2007). Codesign toolset for application-specific instruction-set

processors. In Multimedia on mobile devices 2007, Proceedings

of SPIE, San Jose, CA. (vol. 6507, pp. 1–11).

4. Salmela, P., Sorokin, H., & Takala, J. (2008). A programmable

max-log-MAP turbo decoder implementation. Hindawi VLSI

Design, 2008, 636–640.

5. Shahabuddin, S., Janhunen, J., & Juntti, M. (2013). Design of a

transport triggered architecture processor for a flexible iterative

turbo decoder. In Proceedings of wireless innovation forum

conference on wireless communications technologies and soft-

ware radio (SDR wincomm, Washington, D.C.

6. Benedetto, S., Montorsi, G., Divsalar, D., & Pollara, F. (1996). A

soft-input soft-output maximum a posteriori (MAP) module to

decoder parallel and serial concatenated codes. In JPL TDA

Progress Report, (vol. 42–127, pp. 1–20). Pasadena, CA: Jet

Propulsion Lab.

7. Robertson, P., Hoeher, P., & Villebrun, E. (1997). Optimal and

suboptimal maximum a posteriori algorithms suitable for turbo

decoding. In European transactions on telecommunication, (vol.

8, pp. 119–125).

8. Classon, B., Blankenship, K., & Desai, V. (2000). Turbo

decoding with the constant-log-MAP algorithm. In Proceedings

of second international symposiyum turbo codes and related

application (pp. 467–470). Brest, France.

9. Cheng, J.-F., & Ottosson, T. (2000). Linearly approximated log-

MAP algorithms for turbo coding. In Proceedings of IEEE

vehicular technology conference, Houtson, TX.

10. Wang, Z. (2007). High-speed recursion architectures for MAP-

based turbo decoders. IEEE Transactions on Very Large Scale

Integration (VLSI) Systems, 15(4), 470–474.

11. Li, L., Maunder, R. G., Al-Hashimi, B. M., & Hanzo, L. (2011).

A low-complexity turbo decoder architecture for energy-efficient

wireless sensor networks. IEEE Transactions on Very Large

Scale Integration (VLSI) Systems, PP(99), 1–9.

12. Nimbalker, A., Blankenship, Y. W., Classon, B. K., & Blan-

kenship, K. T. (2008). ARP and QPP Interleavers for LTE Turbo

coding. In IEEE wireless communications and networking con-

ference (pp. 1032–1037).

13. Sun, Y., & Cavallaro, J. R. (2010). Efficient hardware imple-

mentation of a highly-parallel 3GPP LTE, LTE-advance turbo

decoder. Integration, The VLSI Journal 44(1), 1–11.

14. Salmela, P., Jarvinen, T., & Takala, J. (2007). Simplified max-

log-MAP decoder structure. In Proceedings of the 1st joint

workshop on mobile future and the symposium on trends in

communications (SympoTIC ’06), (pp. 1–11). San Jose, CA.

15. Corporaal, H. (1998). Microprocessor architectures: from VLIW

to TTA. West Sussex: Wiley.

16. Esko, O., Jskelinen, P., Huerta, P., Lama, C. S. L., Takala, J., &

Martinez, J. I. (2010). Customized exposed datapath soft-core

design flow with compiler support. In 20th international con-

ference on field programmable logic and applications, August–

September, 2010, Milano, Italy.

17. Valenti, M. C., & Sun, J. (2001). The UMTS turbo code and an

efficient decoder implementation suitable for software defined

radios. International Journal on Wireless Information Networks,

8(4), 203–216.

18. Michel, H., Worm, A., Munch, M., & Wehn, N. (2002). Hard-

ware software trade-offs for advanced 3G channel coding. In

Proceedings of design, automation and test in Europe. Paris:

IEEE Computer Society

19. Song, Y., Liu, G., & Huiyang, (2005). The implementation of

turbo decoder on DSP in W-CDMA system. In International

conference on wireless communications, networking and mobile

computing (pp. 1281–1283).

20. Ituero, P., & Lopez-Vallejo, M. (2006). New schemes in clustered

VLIW processors applied to turbo decoding. In Proceedings of

international conference on application-specific systems, archi-

tectures and processors (ASAP’06) (pp. 291–296). Colo:

Steamboat Springs.

21. Wu, M., Sun, Y., & Cavallaro, J. R. (2010). Implementation of a

3GPP LTE turbo decoder accelerator on GPU,‘‘ In 2010 IEEE

workshop on signal processing systems (SIPS) (pp. 192–197), San

Francisco, CA.

22. Alles, M., Vogt, T., & Wehn, N. (2008). FlexiChaP: A recon-

figurable ASIP for convolutional, turbo, and LDPC code decod-

ing. In Proceedings of international symposium on turbo coding

(pp. 13–18). Lausanne.

23. Benkeser, C., Burg, A., Cupaiuolo, T., & Huang, Q. (2009).

Design and optimization of an HSPDA turbo decoder ASIC.

IEEE Journal of Solid-State Circuits, 44(1), 98–106.

24. Lin, C. H., Chen, C. Y., Chang, E. J., & Wu, A. Y. (2013).

Reconfigurable parallel turbo decoder design for multiple high-

mobility 4G systems. Springer Journal of Signal Processing

Systems, 73(2), 109–122.


123

Shahriar Shahabuddinreceived his B.Sc degree from

University of Dhaka (Bangla-

desh) and M.Sc degree from

University of Oulu (Finland), in

2009 and 2012 respectively.

Currently, he is working as a

doctoral researcher at the com-

munication signal processing

group of Centre for Wireless

Communications (CWC), Uni-

versity of Oulu, Finland. His

research interests include appli-

cation specific processor and

monolithic hardware design for

digital signal processing.

Janne Janhunen received the

M.Sc. (Tech.) and Dr.Sc (Tech.)

degrees in Electrical and Infor-

mation Engineering and Com-

munication Engineering from

the University of Oulu, Oulu,

Finland, in 2007 and 2011,

respectively. His main research

interests are energy efficient

programmable architectures and

crosslayer optimization in the

area of wireless communication

and video processing.

Markku Juntti (S’93–M’98–

SM’04) received the M.Sc.

(Tech.) and Dr.Sc. (Tech.)

degrees in Electrical Engineer-

ing from the University of Oulu,

Oulu, Finland, in 1993 and

1997, respectively. He was with

the University of Oulu from

1992 to 1998. In academic year

1994–1995, he was a Visiting

Scholar at Rice University,

Houston, TX. From 1999 to

2000, he was a Senior Specialist

with Nokia Networks. He has

been a Professor of communi-

cations engineering at the Department of Communication Engineer-

ing and Centre for Wireless Communications (CWC), University of

Oulu, since 2000. His research interests include signal processing for

wireless networks as well as communication and information theory.

He is an author or coauthor in some 200 papers published in inter-

national journals and conference records as well as in book WCDMA

for UMTS (Wiley, 2000). He is also an Adjunct Professor in the

Department of Electrical and Computer Engineering, Rice University.

Dr. Juntti is an Editor of the IEEE TRANSACTIONS ON COM-

MUNICATIONS and was an Associate Editor for the IEEE

TRANSACTIONS ON VEHICULAR TECHNOLOGY in 2002 to

2008. He was Secretary of IEEE Communication Society Finland

Chapter in 1996–1997 and the Chairman for years 2000–2001. He has

been Secretary of the Technical Program Committee (TPC) of the

2001 IEEE International Conference on Communications (ICC’01),

and the Co-Chair of the Technical Program Committee of 2004

Nordic Radio Symposium and 2006 IEEE International Symposium

on Personal, Indoor, and Mobile Radio Communications (PIMRC

2006). He is the General Chair of 2011 IEEE Communication Theory

Workshop (CTW 2011).

Amanullah Ghazi has done

Master’s in Wireless Commu-

nication Engineering from Uni-

versity of Oulu, Finland. He has

a bachelor degree in Computer

Engineering from Aligarh Mus-

lim University, India. Before

pursuing his master degree, he

has worked in Telecommunica-

tion industry for 3 years.

Amanullah is currently working

as Doctoral Researcher at

Department of Computer Sci-

ence & Engineering at Univer-

sity of Oulu. His research

interest is Application Specific Processor design for SDR platforms.

Olli Silven received the M.Sc.

and Ph.D. in Electrical Engi-

neering degrees from the Uni-

versity of Oulu in 1982 and

1988, respectively. Since 1996

he has been the Professor of

Signal Processing Engineering

at the University of Oulu. His

main research interests are in

embedded signal processing and

machine vision system design.

He has contributed to the

development of numerous solu-

tions from real-time 3-D imag-

ing in reverse vending machines

to IP blocks for mobile video coding.


123

design of a transport triggered vector processor for turbo ...sshahabu/springer_turbo.pdf ·...

Documents