design of a transport triggered vector processor for turbo ...sshahabu/springer_turbo.pdf ·...
TRANSCRIPT
Design of a transport triggered vector processor for turbodecoding
Shahriar Shahabuddin • Janne Janhunen •
Markku Juntti • Amanullah Ghazi •
Olli Silven
Received: 29 March 2013 / Revised: 24 August 2013 / Accepted: 20 September 2013 / Published online: 15 October 2013
� Springer Science+Business Media New York 2013
Abstract In order to meet the requirement of high data
rates for next generation wireless systems, efficient
implementations of receiver algorithms are essential. On
the other hand, faster time-to-market motivates the inves-
tigation of programmable implementations. This paper
presents a novel design of a programmable turbo decoder
as an application-specific instruction-set processor (ASIP)
using transport triggered architecture (TTA). The processor
architecture is designed in such a manner that it can be
programmed with high level language to support different
suboptimal maximum a posteriori (MAP) algorithms in a
single TTA processor. The design enables the designer to
change the algorithms according to the frame error rate
performance requirement. A quadratic polynomial permu-
tation interleaver is used for contention-free memory
access and to make the processor 3GPP LTE compliant.
Several optimization techniques to enable real time pro-
cessing on programmable platforms are introduced. The
essential parts of the turbo decoding algorithm are designed
with vector function units. Unlike most other turbo decoder
ASIPs, high level language is used to program the pro-
cessor to meet the time-to-market requirements. With a
single iteration, 68.35 Mbps decoding speed is achieved for
the max-log-MAP algorithm at a clock frequency of
210 MHz on 90 nm technology.
Keywords TTA � ASIP � MAP � FER � QPP
1 Introduction
The turbo coding scheme [1] has been adopted for the air
interface standard called long term evolution, that has been
defined by the 3rd Generation Partnership Project (3GPP)
[2]. The decoding algorithms for the constituent decoders
are not specified by 3GPP. Different suboptimal algorithms
have been used for the turbo decoding and new solutions
with less complexity are still being invented quite fre-
quently. A programmable implementation simplifies the
necessary support for different suboptimal algorithms of
constituent decoders and different interleaving methods.
The hardware designs of turbo decoders are at a mature
stage due to the extensive efforts of researchers. The
hardware implementations provide high throughput, but the
development time is not as rapid as the processor based
implementations. Besides, the hardware implementations
suffers from inflexibility. On the other hand, a software
based implementation provides the required flexibility to
support multistandard solutions, but requires a careful
design to achieve the target throughput. Programmable
accelerators, which enable software-hardware co-design
method might be an attractive solution to overcome these
bottlenecks. The design of software and hardware together
to achieve the best performance and ensure programma-
bility is not a straightforward task. The designer needs a
very efficient tool, which can be used to design the pro-
cessor easily for a particular application.
In this paper, we propose a design of a processor based
on the transport triggered architecture (TTA) for a flexible
turbo decoder. TTA is a processor template for a pro-
grammable application-specific instruction-set processor
S. Shahabuddin (&) � J. Janhunen � M. Juntti
Department of Communications Engineering and Centre for
Wireless Communications, University of Oulu, Oulu, Finland
e-mail: [email protected]
A. Ghazi � O. Silven
Department of Computer Science and Engineering, University of
Oulu, Oulu, Finland
123
Analog Integr Circ Sig Process (2014) 78:611–622
DOI 10.1007/s10470-013-0183-y
(ASIP). The TTA based codesign environment (TCE) tool
enables the designer to write an application with a high
level language and design the target processor in a graph-
ical user interface at the same time [3].
The turbo decoder based on the TTA was implemented
by Salmela et al. [4] and by us [5]. The throughput of the
processor in [4] was comparable to the turbo decoders
based on monolithic hardware design. However, the design
was less flexible in comparison with the design presented
in this paper. The interleaver of the processor was not
presented according to the 3GPP LTE standard [2].
Besides, the processor was programmed with assembly
language while the high level language is applied herein.
The design in [5] was flexible and able to support four
suboptimal algorithms. However, the design used more
computational units and provided less throughput than the
design presented in this paper. Besides, the synthesis
results were not presented either.
A TTA processor with vector function units is presented
in this paper. The main computational parts of the turbo
decoding algorithm use the vector units. The scalar func-
tion units are also used, for example, in loop indexing. The
quadratic permutation polynomial (QPP) interleaving pat-
tern is used for the interleaver block. The processor is
programmed with C language to shorten the time-to-mar-
ket. The processor achieves a maximum of 210 MHz
operating frequency and takes an area of 85.74 kgates with
90 nm technology. With a single full iteration 68.35 Mbps
throughput is achieved for the max-log-MAP algorithm for
a clock frequency of 210 MHz.
The main contribution of the work is the design of a
processor that is able to change the suboptimal turbo
decoding algorithms. If the channel condition is bad and
creates more errors then the complex log-MAP algorithm
can be used. If the channel condition is good, then simple
max-log-MAP algorithm can be used to achieve higher
throughput. Similarly, for different channel conditions
different suboptimal algorithms can be used. Therefore, a
decoder is needed which can accommodate all the subop-
timal algorithms. Under these circumstances, a program-
mable ASIP is better choice than the monolithic hardware
decoders, as it provides the required flexibility and reuse
capability of the hardware. In this paper, we propose a
processor for turbo decoding with all the well-known
suboptimal SISO algorithms.
The rest of the paper is organized as follows: In Sect. 2
an overview of the turbo decoder is presented. In Sect. 3
the simplification techniques of the turbo decoding algo-
rithm are considered. A brief description of the tool chain
to design a TTA processor is presented in Sect. 4. In Sect. 5
the vector special function units based on the simplification
technique are presented. The top level architecture of the
TTA processor and the programming techniques are
addressed in Sect. 6. In Sect. 7 the performance evaluation
of the design is presented. The conclusions are drawn in
Sect. 8.
2 Turbo decoder
2.1 Turbo decoder structure
The turbo decoder consists of two soft-input soft-output
(SISO) decoders, with interleavers and de-interleavers
between them as shown in Fig. 1. The inputs of the turbo
decoder come from the soft demodulator which produces
the log-likelihood ratios (LLR) for the systematic and
parity bits. The LLRs of the systematic bit, LuI and first
parity bits, LcI1 is fed to the first SISO decoder. The SISO
decoder produces soft outputs, i.e., the a posteriori prob-
ability and so called extrinsic information, based on these
LLRs. These soft outputs are used in the second SISO
decoder as the additional a priori information. The inputs
of the second SISO decoder are the LLRs coming from the
systematic bits, the second parity bits denoted by LcI2 and
the output of the first SISO decoder. The LLRs of the
systematic bits are reordered this time with the same
interleaving pattern used at the encoder. Similarly, the soft
outputs coming from the first SISO decoder are processed
also with the same interleaving pattern, which are used as a
priori values for the second SISO decoder.
The heart of the turbo coding is the iterative decoding
procedure. The output of the second SISO decoder does not
produce the hard outputs immediately, but the soft output is
used again in the first SISO decoder for more accurate
approximation. The process continues in a similar fashion
in an iterative manner. A single iteration by both the first
and the second SISO decoder is referred to as a full iter-
ation. On the other hand, the operation performed by a
Fig. 1 Block diagram of the turbo decoder
612 Analog Integr Circ Sig Process (2014) 78:611–622
123
single SISO decoder can be referred to as a half iteration.
At the beginning of the first iteration, the a priori values
are set to zero. Six to eight full iterations are used to
achieve sufficient performance [1].
2.2 MAP algorithm for component decoder
The MAP algorithm for the constituent decoder applied here
has been proposed by Benedetto et al. [6]. The decoding is
done in smaller windows so that the decoding process can be
done in parallel and the decoder does not have to wait for the
whole block to arrive before starting the decoding process.
This windowing is sometimes referred to as a sliding window
method. The algorithm can be stated like:
1. Initialize the values of the forward state metric as
a0(s) = 0 if s = S0 and a0ðsÞ ¼ �1 otherwise.
2. Calculate all the forward state metric values of the
same window through the forward recursion according
to
akðsÞ ¼ maxe
�ðak�1½sSðeÞ� þ uðeÞLuIk�1
þ c1ðeÞLcI1k�1 þ c2ðeÞLcI2k�1Þ;ð1Þ
where, sS(e) denotes the starting state, sE(e) denotes the
ending state. The input bits are represented by u(e) and the
output bits are represented by c1(e) and c2(e).
3. Initialize the values of the backward state metric as
bn(s) = 0 if s = Sn and bnðsÞ ¼ �1 otherwise.
4. Calculate all the backward state metric values of the
same window through the backward recursion as
bkðsÞ ¼ maxe
�ðbkþ1½sEðeÞ� þ uðeÞLuIkþ1
þ c1ðeÞLcI1kþ1 þ c2ðeÞLcI2kþ1Þ:ð2Þ
5. The LLR values for the information and both parity
bits can be calculated as
LLRð:; OÞ ¼ maxe
�ðak�1½sSðeÞ� þ c1ðeÞLcI1k�1
þ c2ðeÞLcI2k�1 þ bk½sEðeÞ�Þ:ð3Þ
The forward metric and backward metrics increase in each
step and that is why the forward and backward metrics need
to be normalized to avoid an overflow in the computation.
2.3 Suboptimal MAP algorithms
The MAP algorithm can take several suboptimal forms
depending on the implementation of the Jacobian loga-
rithms. Four types of suboptimal forms of MAP algorithm
are used in this paper and they are explained below. A
more detailed overview can be found in [7, 8] and [9].
Log-MAP The Jacobian logarithm takes the following
form in log-MAP algorithm,
max�ða; bÞ ¼ maxða; bÞ þ lnð1þ e�jb�ajÞ: ð4Þ
The Jacobian logarithm equals to the addition of the
maximum of two arguments and a correction function [7].
The correction function is implemented using a look-up
table [10, 11].
Max-log-MAP The max-log-MAP is the simplest of all
the suboptimal forms of the MAP algorithm [7]. The
Jacobian logarithm is expressed with the maximum of two
arguments in max-log-MAP algorithm as
max�ða; bÞ ¼ maxða; bÞ: ð5Þ
Constant-log-MAP The approximation of log-MAP used
in [8] is called a constant-log-MAP algorithm and can be
expressed as
max�ða; bÞ ¼ maxða; bÞ þ 0 if jb� aj[ M
C if jb� aj �M:
�ð6Þ
Linear-log-MAP The linear approximation of Jacobian
logarithm used in [9] is called a linear-log-MAP algorithm
and can be expressed as
max�ða; bÞ ¼ maxða; bÞ
þ 0 if jb� aj[ M
Tðjb� aj �MÞ if jb� aj �M:
�
ð7Þ
In [9], the values of parameters T and M were chosen for
a fixed-point implementation.
2.4 QPP interleaver
The QPP interleaver has been adopted for the 3GPP LTE
standard [2]. Unlike the earlier 3G interleavers, the QPP
interleaver is based on algebraic constructions. The QPP
interleaver can be expressed by a simple mathematical
formula. The relationship between the output index x and
the input index f(x) can be derived as
f ðxÞ ¼ ðf2x2 þ f1xÞ modðNÞ: ð8Þ
The values of the parameters f1 and f2 of (8) are integers
and depend on the block size N. For a different N, a dif-
ferent set of parameters has been defined in [12]. The block
sizes are divisible by 4 and 8. The value of f1 is an odd
integer while the value of f2 is even. Several algebraic
properties of the QPP interleaver have been listed in [13].
3 Simplified MAP algorithm
The MAP algorithm described above is complex and needs
to be simplified. Repeated calculations should be avoided
in order to reduce the latency of the processor. The reduced
Analog Integr Circ Sig Process (2014) 78:611–622 613
123
calculations used in this work will be described in this
section. A similar idea has been presented in [14] where a
simplified version of max-log-MAP has been proposed.
The reduced design is tested in a link level simulator using
Matlab. The Matlab design is converted to C language in
such a way so that it can be efficiently implemented in the
TCE tool.
There are 16 branch metric computations between two
states for the forward metric, the backward metric and the
LLR calculations. From the trellis structure of the 3GPP
turbo code of Fig. 2, it can be seen that four calculations of
the branch metrics are repeated to result in total of sixteen
calculations. Instead of multiplying with 1 and -1, the
branch metrics can be calculated directly based on these
four calculations,
ckð1Þ ¼ LuIk þ LcI1k þ LcI2k
ckð2Þ ¼ �LuIk � LcI1k þ LcI2k
ckð3Þ ¼ LuIk þ LcI1k � LcI2k
ckð4Þ ¼ �LuIk � LcI1k � LcI2k;
ð9Þ
where c4 can be represented as -c1 and c3 can be repre-
sented as -c2. For every forward metric, the backward
metric and the LLR, calculations of two branch is
sufficient.
The forward metric calculations are shown as,
akð1Þ ¼ max�ðak�1ð1Þ � ck�1ð1Þ; ak�1ð2Þ þ ck�1ð1ÞÞakð5Þ ¼ max�ðak�1ð1Þ þ ck�1ð1Þ; ak�1ð2Þ � ck�1ð1ÞÞakð2Þ ¼ max�ðak�1ð3Þ þ ck�1ð2Þ; ak�1ð4Þ � ck�1ð2ÞÞakð6Þ ¼ max�ðak�1ð3Þ � ck�1ð1Þ; ak�1ð4Þ þ ck�1ð1ÞÞakð3Þ ¼ max�ðak�1ð5Þ � ck�1ð1Þ; ak�1ð6Þ þ ck�1ð1ÞÞakð7Þ ¼ max�ðak�1ð5Þ þ ck�1ð1Þ; ak�1ð6Þ � ck�1ð1ÞÞakð4Þ ¼ max�ðak�1ð7Þ þ ck�1ð1Þ; ak�1ð8Þ � ck�1ð1ÞÞakð8Þ ¼ max�ðak�1ð7Þ � ck�1ð1Þ; ak�1ð8Þ þ ck�1ð1ÞÞ:
ð10Þ
Similarly, the backward metric calculations are shown
as,
bkð1Þ ¼ max�ðbkþ1ð1Þ � ckþ1ð1Þ; bkþ1ð5Þ þ ckþ1ð1ÞÞbkð2Þ ¼ max�ðbkþ1ð1Þ þ ckþ1ð1Þ; bkþ1ð5Þ � ckþ1ð1ÞÞbkð3Þ ¼ max�ðbkþ1ð2Þ þ ckþ1ð2Þ; bkþ1ð6Þ � ckþ1ð2ÞÞbkð4Þ ¼ max�ðbkþ1ð2Þ � ckþ1ð2Þ; bkþ1ð6Þ þ ckþ1ð2ÞÞbkð5Þ ¼ max�ðbkþ1ð3Þ � ckþ1ð2Þ; bkþ1ð7Þ þ ckþ1ð2ÞÞbkð6Þ ¼ max�ðbkþ1ð3Þ þ ckþ1ð2Þ; bkþ1ð7Þ � ckþ1ð2ÞÞbkð7Þ ¼ max�ðbkþ1ð4Þ þ ckþ1ð1Þ; bkþ1ð8Þ � ckþ1ð1ÞÞbkð8Þ ¼ max�ðbkþ1ð4Þ � ckþ1ð1Þ; bkþ1ð8Þ þ ckþ1ð1ÞÞ:
ð11Þ
Finally, the LLR calculation can be expressed as
LLRk ¼ max�ðmax�ðak�1ð1Þ þ bkð1Þ; ak�1ð2Þ þ bkð5ÞÞþ ck�1ð1Þ;max�ðak�1ð7Þ þ bkð8Þ; ak�1ð8Þ þ bkð4ÞÞ þ ck�1ð1Þ;max�ðak�1ð3Þ þ bkð6Þ; ak�1ð4Þ þ bkð2ÞÞ þ ck�1ð2Þ;max�ðak�1ð5Þ þ bkð3Þ; ak�1ð6Þ þ bkð7ÞÞ þ ck�1ð2ÞÞ�max�ðmax�ðak�1ð1Þ þ bkð5Þ; ak�1ð2Þ þ bkð1ÞÞ � ck�1ð1Þ;max�ðak�1ð5Þ þ bkð7Þ; ak�1ð6Þ þ bkð3ÞÞ � ck�1ð1Þ;max�ðak�1ð3Þ þ bkð2Þ; ak�1ð4Þ þ bkð6ÞÞ � ck�1ð2Þ;max�ðak�1ð7Þ þ bkð4Þ; ak�1ð8Þ þ bkð8Þ � ck�1ð2ÞÞ:
ð12Þ
4 TTA processor design methodology
4.1 TTA processor design
TTA is a processor design philosophy where the program
directly controls the internal data transport between dif-
ferent function units of a processor. A typical processor
consists of function units to perform computation, register
files to store the data and control unit for program flow
Fig. 2 Trellis of 3GPP turbo code
614 Analog Integr Circ Sig Process (2014) 78:611–622
123
control. Buses, ports and sockets make the interconnection
network of the processor. In a TTA processor, one of the
input ports of a function unit is marked as a triggering port.
Writing data to the triggering port of a function unit starts
the computation. The TTA processor works with a single
move instruction and does not require a complex instruc-
tion decoding unit. Most of the instructions scheduling
decisions are taken at compile time which further reduces
the processing logic of the TTA processor [15, 16].
TCE provides an excellent toolset for designing,
implementing and simulating a TTA processor. The toolset
includes a processor design tool with an extensive library
of function units. It also includes a C/C?? compiler to
compile the implemented software for the designed TTA
processor. A graphical and command line simulator is
provided to analyze the execution of the program on the
processor with detailed information about the resource
usages and cycle counts during the execution [3].
If the designed processor does not meet the target per-
formance, the designer can go back to the source code and
design the processor again. In order to find the actual run
time, the processor must be synthesized or mapped on a
field programmable gate array. A rich hardware database
(HDB) is provided in TCE that includes VHSIC hardware
definition language (VHDL) description for most of the
common TTA function units, register files, sockets etc.
TCE also allows the designer to build more complicated
special function units to reduce the processing latency. In
that case, the designer has to write the VHDL description
for the special function unit and add it in the HDB. A
diagram of the TTA processor design methodology is given
in Fig. 3.
4.2 Programming TTA processor
TCE allows to program the TTA processor with high level
language like C or low level assembly language. As stated
earlier, TTA assembly program has only one instruction
called move. In Fig. 4 part of a TTA processor with an
adder and a multiplier is shown. The black horizontal
straight lines represent the buses of the processor. The
vertical rectangular blocks represent the sockets. The
connections between the function units and the buses are
illustrated by black dots in the sockets.
Part of an assembly program for this TTA processor in
Fig. 4 is shown in Table 1. The program is adding some
constant values and multiplying the added values with
another value.
It can be seen that the programmer needs to carefully
choose the function units and buses in case of assembly
programming. The more the architecture includes buses
and function units, the more complicated the programming
becomes. If the design needs to be changed for some
reasons, i.e., buses need to be removed or added, the whole
program has to be re-written once again. In case of high
level programming the compiler takes care of the sched-
uling. The designer can add or remove function units
without worrying about the programming. Therefore, the
Fig. 3 TTA processor design methodology
Fig. 4 Part of a TTA processor
Table 1 Part of the assembly program for the TTA processor in
Fig. 1
Bus 1 Bus 2 Bus 3
74 –[ ADD.in1t.add 71 –[ ADD.in2 174 –[ MUL.in2
238 –[ADD.in1t.add 231 –[ ADD.in2 ADD.out1 –[MUL.in1t.mul
60 –[ ADD.in2 38 –[ MUL.in2 ADD.out1 –[MUL.in1t.mul
Analog Integr Circ Sig Process (2014) 78:611–622 615
123
work presented in this paper used high level programming
language to reduce the development time.
5 Vector special function units
The vector special function units designed for this imple-
mentation are described in this section. The vector units are
designed in such a way that it can facilitate programming
in high level language. In our case, C programming is used.
The word lengths from [4] and [11] are used for the input
LLRs.
5.1 PACK special function unit
A special function unit, PACK, is used to merge the three
8-bit LLR inputs, that are, the a priori LLR, LLR for
systematic bits and LLR for parity bits. An opcode signal,
which is used to select different suboptimal algorithms, can
also be merged with these three input signals. These signals
will constitute a 32-bit packed input that can be used for
the vector units that compute forward metrics, backward
metrics and output LLRs.
5.2 METRIC special function unit
A function unit can be made that would calculate eight
forward metrics based on the earlier state forward metrics
and the LLR input. In this work, a vector METRIC unit is
designed which takes three 32-bit inputs to calculate the
forward metric. The eight forward metrics of the earlier
state can be packed into two 32-bit inputs. These two 32-bit
inputs and the input coming from the PACK unit are fed to
the METRIC unit to calculate the eight forward metrics for
the next state. These eight forward metric outputs are also
packed as two 32-bit outputs and used for further pro-
cessing. A block diagram of the METRIC unit is given in
Fig. 5. The diagram shows two forward metric calculations
of a single butterfly pair. The rest of the calculations are not
shown for readability.
The forward and backward metrics calculations are not
the same. However, from (10) and (11) it can be seen that
the operations needed to calculate the forward and back-
ward metrics are similar. If the inputs are arranged prop-
erly, the same METRIC unit can be used to calculate the
backward metrics. For example, ak-1(1), ak-1(2), ak-1(3)
and ak-1(4) are merged as one 32-bit input in the METRIC
unit. From these four 8-bit inputs and with the help of the
LLR inputs, four forward metrics can be calculated. They
are ak(1), ak(5), ak(2) and ak(6).
Similarly, bk?1(1), bk?1(5), bk?1(2) and bk?1(6) can be
merged as one 32-bit input to calculate bk(1), bk(2), bk(3)
and bk(4). Note that, due to the structure of the METRIC
unit, bk(1) and bk(3) will reside inside the first 32-bit output
and bk(2) and bk(4) will reside inside the second 32-bit
output.
The normalization is needed inside the METRIC unit to
avoid an overflow. The normalization technique is done
following the technique suggested by Valenti et al. [17].
Typically, the normalization is done in every step by
subtracting the minimum values of the forward or back-
ward metrics of the same column from every forward or
backward metric values. Valenti et al. suggested subtract-
ing all the values from the first values of the columns.
Thus, there is no need to store the first row and this con-
stitutes 12.5 percent savings in memory compared to the
Fig. 5 Metric function unit
616 Analog Integr Circ Sig Process (2014) 78:611–622
123
other normalization methods. The normalization is not
shown in Fig. 5 for simplicity.
The METRIC unit supports four operations for max-log-
MAP, linear-log-MAP, constant-log-MAP and log-MAP
respectively. The max-log-MAP needs straightforward
maximum selection operation as can be seen in (5). From
(4), (6) and (7) it can be seen that the other algorithms use
the maximum selection and some additional calculations.
All the operations in the METRIC unit can exploit the same
maximum selection logic. The additional calculations are
different for different algorithms and need some additional
logic. Thus, the four operations need different numbers of
internal registers and the latencies of the four operations
are different.
5.3 ARRANGE special function unit
The drawback of using the same METRIC unit for both
forward and backward metric is the fact that an additional
unit is needed to arrange the outputs of the METRIC unit
during the backward metric computation. It can be seen
from Fig. 5, that the METRIC unit would provide two
32-bit outputs with one consisting of bk(1), bk(3), bk(5)
and bk(7) and the other consisting of bk(2), bk(4), bk(6)
and bk(8). To use the same METRIC unit again for the
next backward metric calculation, two 32-bit inputs are
needed, with one input consisting of bk(1), bk(5), bk(2)
and bk(6) and the other consisting of bk(3), bk(7), bk(4)
and bk(8). That is the functionality of the ARRANGE
special function unit that changes the relative positions of
the inputs. The functionality of the ARRANGE unit is
shown in Fig. 6.
5.4 LLROUT special function unit
Another special function unit named LLROUT is designed
to calculate the output LLRs after the forward and back-
ward metric computation. It can be seen from (12) that
eight forward metric and eight backward metric and three
LLR values are needed for the output LLR computation.
Four packed 32-bit outputs coming from the METRIC unit
after the forward and backward metric computations are
used as the inputs of the unit. Besides, the packed LLR
inputs with the opcode is used as another input. Therefore,
the LLR special function unit takes five inputs and pro-
duces one 8-bit LLR output. The four suboptimal algo-
rithms can be chosen with the opcode which is used in the
packed input. The unit supports four operations with dif-
ferent latency for the four different algorithms.
6 TTA processor for turbo decoding
6.1 Top level architecture of the processor
A part of the processor designed for the turbo decoder is
illustrated in Fig. 7. For readability, the whole processor
figure is not given. The blocks on the upper part of the
figure represent the function units and the register files of
the processor.Fig. 6 Inputs and outputs of ARRANGE special funtion unit
Fig. 7 Implemented processor with reduced number of function units
Analog Integr Circ Sig Process (2014) 78:611–622 617
123
The fixed point processor includes load/store unit (LSU),
arithmetic logic unit (ALU), global control unit and register
files. Based on the resource requirements in high level lan-
guage, more function units and register files are added.
The LLR inputs are read from a first-in-first-out (FIFO)
memory buffer by using the function unit called STREAM.
The STREAM units can read every input sample in one
clock cycle. Three STREAM units are used to get the input
LLRs simultaneously. One STREAM unit is used to write
the output LLRs into the FIFO buffer.
Like the Viterbi decoder, the turbo decoder needs
numerous add-compare-select operations. One way to
design the processor for these algorithms is to use the
appropriate number of adder and maximum selection units.
Another way is to design a special function unit to calculate
all the necessary next state metrics based on the earlier
state metrics and branch metrics.
The latter way is followed in our design and a special
function unit named METRIC is designed with three inputs
and two outputs. The functionality of this unit is already
described in Sect. 2 The PACK unit is used which takes the
three LLR inputs and the opcode to use the particular
suboptimal algorithm. One ARRANGE unit is used to
calculate the backward metrics.
Two LSU units are used to support the memory acces-
ses. The LSU units are used to read and write memory. The
memory can be read in three clock cycles and can be
written in a single cycle. The ALU unit is used to perform
the basic arithmetic operations like addition, subtraction
etc. Operations like shifting right or left are also included
in the ALU.
Thirteen buses are used in the design. Several register
files are used to save the intermediate results. In terms of
the power consumption, registers can be more expensive
than the memory, but to meet the latency requirement
register files are needed. A single Boolean register file is
included in the processor design.
6.2 Programming the processor
The processor is programmed with a high level language C.
The architecture of the processor and the structures of the
vector special function units allow the use of 8-bit char-
acters and 32-bit integers, that greatly simplifies the soft-
ware design. Macros are used to call the special function
units.
The turbo decoding algorithm uses three blocks of 6,144
input LLRs. The blocks are divided into smaller windows
to reduce the memory requirements. The forward metrics
are calculated and stored first by following the simplifica-
tion in (10).
The backward metrics are calculated next using the
ARRANGE and METRIC. As soon as the backward metric
calculations of a single column are finished, the LLRs are
calculated using the LLROUT unit in parallel with the help
of the stored forward metric at the same time. Therefore,
there is no need to save the backward metrics [17].
A lot of data dependencies can be seen from the turbo
decoder algorithm. So, writing efficient code is needed for
the utilization of computing resources available. An effi-
cient code is written to reduce memory accesses by holding
the values inside the register files. It is possible to eliminate
the need for accessing the forward metric vectors con-
stantly. Instead, the temporary variables are used to hold
the values for the next iteration. The code for the backward
metric and the LLR calculations are written in the same
manner.
A significant improvement in throughput is achieved by
partially unrolling the main loops of the code. The code is
unrolled as such that the main loop body processes 100
samples to improve the opportunities for parallelism across
the samples and reduce the amount of time spent evaluating
the loop condition. A static multi-issue machine needs
statically schedulable parallel operations to utilize the
hardware fully. One way is to schedule operations from
multiple loop iterations in parallel. Unrolling the loops
increases the instruction memory size due to the increased
code size, but the achieved benefit is significant.
In theory, a critical path to process one sample in for-
ward metric loop is four clock cycles. Furthermore the
theoretical pipeline can take a new input and output a
sample every clock cycle. At the moment our implemen-
tation has a six clock cycles long pipeline. The bottleneck
at the moment is the data access which is very common to
algorithms processing large amount of data. There are two
possible solutions for the limitations. Because the algo-
rithm requires three inputs at once, there should be three
data accesses. This can be either handled with two memory
spaces with one single port and one two port memory
access or the data accesses can be handled by stream units
with FIFO buffers. In many solutions the data bottlenecks
are solved with FIFO buffers. We rely on streaming inputs
and outputs using the STREAM units, but the intermediate
result between the forward and backward metric calcula-
tion applies the memory.
A look-up table, holding the values of the f1 and f2parameters, is used to implement the QPP interleaver. The
rest of the calculations are quite straightforward.
7 Results and discussion
The designed processor takes 56,624 clock cycles to pro-
cess three blocks of 6,144 samples for a full iteration for
the max-log-MAP mode. According to the 3GPP inter-
leaver specifications, the size of the block could be up to of
618 Analog Integr Circ Sig Process (2014) 78:611–622
123
6,144 samples and that is the reason the size of the input
block is chosen as 6,144. The size of the block can be
changed using the program code.
The throughput for one full iteration for the max-log-
MAP with a clock frequency of 210 MHz can be calculated
as,
Throughput ¼ 6; 144 bits� 3� 210 MHz
56; 624 clock cycles
¼ 68:35 Mbps:
The clock cycle needed for a single trellis stage of the
max-log-MAP can be calculated as,
CycleStage ¼56; 624
2� 3� 6; 144
¼ 1:53 cycles:
It can be seen from Table 2 that the processor takes
more clock cycles if max-log-MAP algorithm is not used.
The reason lies in the fact that the constant-log-MAP,
linear-log-MAP and log-MAP algorithms need to invoke
some conditional statements resulting branches in
execution. Thus, the latency increases compared to the
max-log-MAP algorithm.
Linear-log-MAP is slower than the constant-log-MAP
because multiplication operations are needed. The log-
MAP algorithm provides a better FER performance than
the other algorithms. Log-MAP can be used in cases when
the latency requirements are not strict and a high FER
performance is needed. A scenario such as this is simulated
with a link level Matlab simulator in Fig. 8. Log-MAP and
max-log-MAP based turbo decoders are used in a 4 9 4
system. The modulation method and the detection algo-
rithm used for both of the decoders are 16 QAM and
LMMSE detection respectively. The channel model is
based on the 3GPP vehicular A (3GPP-VA) model rec-
ommended by International Telecommunication Union.
The mobile velocity is set to 120 kmph in the simulation. A
5 MHz bandwidth corresponding to 512 orthogonal fre-
quency division multiplexing subcarriers is applied. One
frame is equal to one OFDM symbol in the simulator.
Thus, one frame consists of 512 subcarriers. Here, 300
subcarriers are loaded with data and the rest are used as
guard interval. Six turbo iterations are used in the
simulation.
It can be seen from Fig. 8 that the FER performance
difference between log-MAP and max-log-MAP is nearly 2
dBs in high SNR region. The log-MAP is more complex
than max-log-MAP. However, for this particular scenario
this additional complexity is compensated with the FER
performance gain. In [17] the FER performance of log-
MAP, linear-log-MAP, constant-log-MAP and max-log-
MAP has been shown for different scenarios.
The buses of the processor are efficiently utilized to
achieve the best possible result due to the efficient sched-
uling. The number of some of the operations during the
algorithm execution is summarized in Table 3.
The number of addition operations does not only rep-
resent the addition for the algorithm, but for several other
purposes like loop indexing for the code. The multiplica-
tion is used for the QPP interleaving sequence generation.
The throughput can be increased by using dedicated
special function units working as accelerators for the pro-
cessor. On the other hand, the more special function units
are used, the less flexible the processor becomes. As an
example, a special function unit dedicated to calculate the
backward metric and the LLRs could be designed to reduce
Fig. 8 Decoding performance comparison for Log-MAP and max-
log-MAP algorithm
Table 3 Number of operations
for a full iterationOperation # of OPS
ADD 98,208
PACK 12,288
ARRANGE 12,288
MUL 29,576
LLROUT 12,288
METRIC 24,576
STREAM 49,152
Table 2 Number of clock cycles for a single iteration with three
simulation blocks
Modes Algorithm Clock cycle
1 max-log-MAP 56,624
2 linear-log-MAP 67,244
3 constant-log-MAP 80,298
4 log-MAP 104,628
Analog Integr Circ Sig Process (2014) 78:611–622 619
123
latency. However, it might be useless for some other
functionalities. A common special function unit to calcu-
late both the forward and backward metric could be better
utilized in this case. The TTA processor designed here will
work for the Viterbi decoding with reasonable throughput.
The BRANCH unit for the max-log-MAP mode is able to
calculate the add-select-compare operations needed for the
Viterbi decoding.
The designed TTA processor is synthesized using UMC
90 nm standard cell library (fsd0k_generic_core 1d0vtc).
Synopsys Design Compiler is used to estimate gate count
and maximum frequency for the processor. The operating
conditions (temperature, operating voltage, manufacturing
process quality) for synthesis are set to default value
(TCCOM). The maximum clock frequency achieved dur-
ing the synthesis for the processor is 210 MHz. The total
gate count for the processor at 210 MHz is 87.11 kgates. A
comparison of the complexity and throughput for different
clock cycle is given in Table 4.
A comparison with different other implementations of
turbo decoder is presented in Table 5. The software pro-
grammable solutions like [17, 18] and [19] provide very
low throughput. A comparatively higher throughput
software implementation can be found in [21] where the
decoder has been implemented using graphical processing
unit (GPU) and with parallel programming. Our proposed
processor provides very good throughput compared to the
software programmed decoders and ASIPs. An efficient
ASIP implementation can be found in [22]. The processor
was designed to decode turbo and low density parity-check
codes (LDPC). The processor was programmed in assem-
bly and still the throughput is lower than the implementa-
tion presented in this paper.
The throughput of the TTA processor presented in [4] is
higher than our proposed solution and takes less resources.
The processor was programmed in assembly and the TTA
architecture had dedicated funtion units built for only the
max-log-MAP algorithm. Thus the design presented in [4]
is closer to the dedicated hardware and is unable to exploit
the benefits of processor based solution. Naturally the
performance is also close to a dedicated hardware solution
as can be seen from [23]. A key point is, the processor did
not support the log-MAP algorithm and it is more error
prone in worse channel conditions.
Our implementation is more flexible and is able to
utilize the programmability better than the solution given
in [4] by supporting several algorithms. As the processor
presented in this paper is programmed in C, the program
can be modified very easily which will be very difficult
for the processor presented in [4]. A processor design
similar to [4] would increase the development time
significantly.
The recent hardware implementations of turbo decoder
use the parallel SISO decoders which can be compared
with the multi-core solutions. An example of this can be
seen in [24] where eight SISO decoders working in parallel
provided a very high throughput. A multi-core solution of
Table 4 Complexity and throughput of the TTA processor
Clock
frequency
(MHz)
Area
(kgates)
Throughput
(Mbps)
100 70.91 32.55
125 72.13 40.69
200 83.22 65.10
210 85.74 68.35
Table 5 Comparison of turbo decoder implementations
Category Reference Architecture Algorithm Throughput
(Normalized)
Area
(kgates)
Clock
frequency
(MHz)
SW design [18] Motorola 56603 max-log-MAP 243 Kbps - 80
SW design [17] Intel Pentium max-log-MAP 366 Kbps - 933
SW design [19] TMS320C6201 DSP max-log-MAP 2 Mbps - -
SW and HW codesign [20] VLIW ASIP max-log-MAP 5 Mbps - 80
SW and HW codesign [4] TTA proc. for UMTS max-log-MAP 14.1 Mbps 20.8 210
SW and HW codesign [5] TTA proc. max-log-MAP 31.21 Mbps - -
SW design [21] Nvidia C1060 max-log-MAP 33.85 Mbps - -
SW and HW codesign [22] VLIW ASIP max-log-MAP 46.5 Mbps - 400
SW and HW codesign Proposed TTA proc. max-log-MAP 65.10 Mbps 85.74 210
HW design [23] 130 nm technology max-log-MAP 90.3 Mbps 44.1 246
SW and HW codesign [4] TTA proc. max-log-MAP 98 Mbps 43.2 277
HW design [24] 90 nm technology max-log-MAP 1469.2 Mbps (with 8 SISO) - 152
620 Analog Integr Circ Sig Process (2014) 78:611–622
123
several ASIPs working in parallel will be a competitive
alternative of the parallel turbo decoder ASIC.
8 Conclusion
The paper described a novel design of a turbo decoder on a
TTA processor using high level language. The reference
turbo decoder design is simulated on a Matlab link level
simulator. The design is then converted in C language and
mapped on the TTA processor using TCE tool. The design
shows the promise of the possibility of designing several
decoding techniques on a single TTA processor. As the
turbo decoding algorithm is very complex, it is very difficult
to achieve the LTE target throughput even with pure hard-
ware designs. The parallel multi-core turbo decoder is a
natural choice for the next generation wireless systems. The
target throughput could also be reached by multi-core TTA
processor. The flexibility gained from that processor could
provide very interesting results and would be a fruitful
direction for the future research. The energy efficiency of the
processor would also provide interesting result.
Acknowledgements This research was supported by the Finnish
Funding Agency for Technology and Innovation (Tekes), Renesas
Mobile Europe, Nokia Siemens Networks and Xilinx. Special thanks
are due to Dr. Perttu Salmela from Tampere University of Technology
and Jarkko Huusko from Centre for Wireless Communications
(CWC), University of Oulu for sharing their insights on program-
mable turbo decoder implementations.
References
1. Berrou, C., Glavieux, A., & Thitimajshima, P. (1993). Near
shannon limit error-correcting coding and decoding: turbo-codes.
In IEEE international conference on communication, Geneva,
Switzerland. (Vol. 2, pp. 1064–1070).
2. Multiplexing and channel coding, 3GPP TS 36.212 version 10.5.0.
3. Jaaskelainen, P., Guzma, V., Cilio, A., Pitkanen, T., & Takala, J.
(2007). Codesign toolset for application-specific instruction-set
processors. In Multimedia on mobile devices 2007, Proceedings
of SPIE, San Jose, CA. (vol. 6507, pp. 1–11).
4. Salmela, P., Sorokin, H., & Takala, J. (2008). A programmable
max-log-MAP turbo decoder implementation. Hindawi VLSI
Design, 2008, 636–640.
5. Shahabuddin, S., Janhunen, J., & Juntti, M. (2013). Design of a
transport triggered architecture processor for a flexible iterative
turbo decoder. In Proceedings of wireless innovation forum
conference on wireless communications technologies and soft-
ware radio (SDR wincomm, Washington, D.C.
6. Benedetto, S., Montorsi, G., Divsalar, D., & Pollara, F. (1996). A
soft-input soft-output maximum a posteriori (MAP) module to
decoder parallel and serial concatenated codes. In JPL TDA
Progress Report, (vol. 42–127, pp. 1–20). Pasadena, CA: Jet
Propulsion Lab.
7. Robertson, P., Hoeher, P., & Villebrun, E. (1997). Optimal and
suboptimal maximum a posteriori algorithms suitable for turbo
decoding. In European transactions on telecommunication, (vol.
8, pp. 119–125).
8. Classon, B., Blankenship, K., & Desai, V. (2000). Turbo
decoding with the constant-log-MAP algorithm. In Proceedings
of second international symposiyum turbo codes and related
application (pp. 467–470). Brest, France.
9. Cheng, J.-F., & Ottosson, T. (2000). Linearly approximated log-
MAP algorithms for turbo coding. In Proceedings of IEEE
vehicular technology conference, Houtson, TX.
10. Wang, Z. (2007). High-speed recursion architectures for MAP-
based turbo decoders. IEEE Transactions on Very Large Scale
Integration (VLSI) Systems, 15(4), 470–474.
11. Li, L., Maunder, R. G., Al-Hashimi, B. M., & Hanzo, L. (2011).
A low-complexity turbo decoder architecture for energy-efficient
wireless sensor networks. IEEE Transactions on Very Large
Scale Integration (VLSI) Systems, PP(99), 1–9.
12. Nimbalker, A., Blankenship, Y. W., Classon, B. K., & Blan-
kenship, K. T. (2008). ARP and QPP Interleavers for LTE Turbo
coding. In IEEE wireless communications and networking con-
ference (pp. 1032–1037).
13. Sun, Y., & Cavallaro, J. R. (2010). Efficient hardware imple-
mentation of a highly-parallel 3GPP LTE, LTE-advance turbo
decoder. Integration, The VLSI Journal 44(1), 1–11.
14. Salmela, P., Jarvinen, T., & Takala, J. (2007). Simplified max-
log-MAP decoder structure. In Proceedings of the 1st joint
workshop on mobile future and the symposium on trends in
communications (SympoTIC ’06), (pp. 1–11). San Jose, CA.
15. Corporaal, H. (1998). Microprocessor architectures: from VLIW
to TTA. West Sussex: Wiley.
16. Esko, O., Jskelinen, P., Huerta, P., Lama, C. S. L., Takala, J., &
Martinez, J. I. (2010). Customized exposed datapath soft-core
design flow with compiler support. In 20th international con-
ference on field programmable logic and applications, August–
September, 2010, Milano, Italy.
17. Valenti, M. C., & Sun, J. (2001). The UMTS turbo code and an
efficient decoder implementation suitable for software defined
radios. International Journal on Wireless Information Networks,
8(4), 203–216.
18. Michel, H., Worm, A., Munch, M., & Wehn, N. (2002). Hard-
ware software trade-offs for advanced 3G channel coding. In
Proceedings of design, automation and test in Europe. Paris:
IEEE Computer Society
19. Song, Y., Liu, G., & Huiyang, (2005). The implementation of
turbo decoder on DSP in W-CDMA system. In International
conference on wireless communications, networking and mobile
computing (pp. 1281–1283).
20. Ituero, P., & Lopez-Vallejo, M. (2006). New schemes in clustered
VLIW processors applied to turbo decoding. In Proceedings of
international conference on application-specific systems, archi-
tectures and processors (ASAP’06) (pp. 291–296). Colo:
Steamboat Springs.
21. Wu, M., Sun, Y., & Cavallaro, J. R. (2010). Implementation of a
3GPP LTE turbo decoder accelerator on GPU,‘‘ In 2010 IEEE
workshop on signal processing systems (SIPS) (pp. 192–197), San
Francisco, CA.
22. Alles, M., Vogt, T., & Wehn, N. (2008). FlexiChaP: A recon-
figurable ASIP for convolutional, turbo, and LDPC code decod-
ing. In Proceedings of international symposium on turbo coding
(pp. 13–18). Lausanne.
23. Benkeser, C., Burg, A., Cupaiuolo, T., & Huang, Q. (2009).
Design and optimization of an HSPDA turbo decoder ASIC.
IEEE Journal of Solid-State Circuits, 44(1), 98–106.
24. Lin, C. H., Chen, C. Y., Chang, E. J., & Wu, A. Y. (2013).
Reconfigurable parallel turbo decoder design for multiple high-
mobility 4G systems. Springer Journal of Signal Processing
Systems, 73(2), 109–122.
Analog Integr Circ Sig Process (2014) 78:611–622 621
123
Shahriar Shahabuddinreceived his B.Sc degree from
University of Dhaka (Bangla-
desh) and M.Sc degree from
University of Oulu (Finland), in
2009 and 2012 respectively.
Currently, he is working as a
doctoral researcher at the com-
munication signal processing
group of Centre for Wireless
Communications (CWC), Uni-
versity of Oulu, Finland. His
research interests include appli-
cation specific processor and
monolithic hardware design for
digital signal processing.
Janne Janhunen received the
M.Sc. (Tech.) and Dr.Sc (Tech.)
degrees in Electrical and Infor-
mation Engineering and Com-
munication Engineering from
the University of Oulu, Oulu,
Finland, in 2007 and 2011,
respectively. His main research
interests are energy efficient
programmable architectures and
crosslayer optimization in the
area of wireless communication
and video processing.
Markku Juntti (S’93–M’98–
SM’04) received the M.Sc.
(Tech.) and Dr.Sc. (Tech.)
degrees in Electrical Engineer-
ing from the University of Oulu,
Oulu, Finland, in 1993 and
1997, respectively. He was with
the University of Oulu from
1992 to 1998. In academic year
1994–1995, he was a Visiting
Scholar at Rice University,
Houston, TX. From 1999 to
2000, he was a Senior Specialist
with Nokia Networks. He has
been a Professor of communi-
cations engineering at the Department of Communication Engineer-
ing and Centre for Wireless Communications (CWC), University of
Oulu, since 2000. His research interests include signal processing for
wireless networks as well as communication and information theory.
He is an author or coauthor in some 200 papers published in inter-
national journals and conference records as well as in book WCDMA
for UMTS (Wiley, 2000). He is also an Adjunct Professor in the
Department of Electrical and Computer Engineering, Rice University.
Dr. Juntti is an Editor of the IEEE TRANSACTIONS ON COM-
MUNICATIONS and was an Associate Editor for the IEEE
TRANSACTIONS ON VEHICULAR TECHNOLOGY in 2002 to
2008. He was Secretary of IEEE Communication Society Finland
Chapter in 1996–1997 and the Chairman for years 2000–2001. He has
been Secretary of the Technical Program Committee (TPC) of the
2001 IEEE International Conference on Communications (ICC’01),
and the Co-Chair of the Technical Program Committee of 2004
Nordic Radio Symposium and 2006 IEEE International Symposium
on Personal, Indoor, and Mobile Radio Communications (PIMRC
2006). He is the General Chair of 2011 IEEE Communication Theory
Workshop (CTW 2011).
Amanullah Ghazi has done
Master’s in Wireless Commu-
nication Engineering from Uni-
versity of Oulu, Finland. He has
a bachelor degree in Computer
Engineering from Aligarh Mus-
lim University, India. Before
pursuing his master degree, he
has worked in Telecommunica-
tion industry for 3 years.
Amanullah is currently working
as Doctoral Researcher at
Department of Computer Sci-
ence & Engineering at Univer-
sity of Oulu. His research
interest is Application Specific Processor design for SDR platforms.
Olli Silven received the M.Sc.
and Ph.D. in Electrical Engi-
neering degrees from the Uni-
versity of Oulu in 1982 and
1988, respectively. Since 1996
he has been the Professor of
Signal Processing Engineering
at the University of Oulu. His
main research interests are in
embedded signal processing and
machine vision system design.
He has contributed to the
development of numerous solu-
tions from real-time 3-D imag-
ing in reverse vending machines
to IP blocks for mobile video coding.
622 Analog Integr Circ Sig Process (2014) 78:611–622
123