liang li nine month report

UNIVERSITY OF SOUTHAMPTON

Faculty of Engineering, Science and Mathematics

School of Electronics and Computer Science

A progress report submitted for continuation towards a PhD

Supervisor: Dr. Rob Maunder, Prof. Bashir M Al-Hashimi and Prof. Lajos Hanzo

Examiner: Dr Song Xin Ng

Analysis of Low Power Implementational

Issues of Turbo-like Codes in Body Area

Networks

by Liang Li

November 3, 2009

http://www.soton.ac.uk

http://www.engineering.soton.ac.uk

http://www.ecs.soton.ac.uk

mailto:[email protected]

UNIVERSITY OF SOUTHAMPTON

ABSTRACT

FACULTY OF ENGINEERING, SCIENCE AND MATHEMATICS

SCHOOL OF ELECTRONICS AND COMPUTER SCIENCE

A progress report submitted for continuation towards a PhD

by Liang Li

Body Area Networks (BANs) are a promising application of wireless sensor networks

(WSNs) which are attracting a lot of research interest. A BAN is a WSN located in the

vicinity of a human body for continual monitoring of certain parameters of the human

body, which can provide a healthcare service in a more comfortable, convenient and

economical way than the conventional methods. The extremely low power and high reli-

ability requirements of BANs make the communication challenge. In this report, a state

of the art investigation of the research on communication technologies in BANs is given.

Based on the investigation, a proposal of using Turbo-like codes for the channel coding

scheme of BANs is discussed. Because of the low power requirement of BANs applica-

tions, the low power implementation issues of Turbo decoding schemes are discussed. A

method to determine the optimal data width specification in fixed-point implementation

of Turbo decoder from a low power point of view is presented. A framework to compare

and evaluate different Turbo-like codes from the energy consumption point of view is

proposed.

http://www.soton.ac.uk

http://www.engineering.soton.ac.uk

http://www.ecs.soton.ac.uk

mailto:[email protected]

Contents

Acknowledgements xi

1 Introduction 1

1.1 Introduction of Body Area Networks (BANs) . . . . . . . . . . . . . . . . 1

1.2 Communication in BANs . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2.1 Communication requirements . . . . . . . . . . . . . . . . . . . . . 4

1.2.1.1 Frequency conditions . . . . . . . . . . . . . . . . . . . . 4

1.2.1.2 Network scale and communication range . . . . . . . . . 5

1.2.1.3 Data rate . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.2.1.4 Reliability, accuracy and latency . . . . . . . . . . . . . . 6

1.2.1.5 Energy consumption . . . . . . . . . . . . . . . . . . . . . 6

1.2.1.6 Network topology . . . . . . . . . . . . . . . . . . . . . . 7

1.2.1.7 Security . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

1.3 Candidate options for Body Area Networks . . . . . . . . . . . . . . . . . 7

1.3.1 Turbo-like Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

1.4 Outline of the report . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2 Turbo-like Code Solutions in BANs 13

2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.2 Turbo codes and BCJR decoding algorithm . . . . . . . . . . . . . . . . . 18

2.2.1 UMTS encoder and decoder architecture . . . . . . . . . . . . . . . 18

2.2.2 BCJR algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

2.2.2.1 Log-BCJR algorithm . . . . . . . . . . . . . . . . . . . . 24

2.3 EXIT chart analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

2.4 Fixed-point representation in a Turbo decoder . . . . . . . . . . . . . . . . 31

3 Optimal Data-width Settings for Fixed-point Implementation 35

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

3.2 Fixed-point EXIT chart analysis of UMTS Turbo Decoder . . . . . . . . . 42

3.3 Simulation and Analysis Results . . . . . . . . . . . . . . . . . . . . . . . 43

3.3.1 Comparison between different Logarithm methods . . . . . . . . . 43

3.3.2 Comparison and Analysis in Fixed-point simulation . . . . . . . . 43

3.3.2.1 Wrapping Technique . . . . . . . . . . . . . . . . . . . . . 46

3.3.2.2 Saturation Technique . . . . . . . . . . . . . . . . . . . . 49

3.3.2.3 Normalisation Technique . . . . . . . . . . . . . . . . . . 51

3.3.2.4 Final validation . . . . . . . . . . . . . . . . . . . . . . . 51

4 Energy Estimation Decoding Algorithm 57

v

vi CONTENTS

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

4.2 Previous works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

4.3 A framework for quantifying the energy consumption of a Turbo-like decoder 61

4.3.1 Level 1 of the framework . . . . . . . . . . . . . . . . . . . . . . . 61

4.3.2 Future work: Level 2 of the framework . . . . . . . . . . . . . . . . 64

5 Conclusions and Further Works 67

Bibliography 69

List of Figures

1.1 A typical BANs architecture. . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.2 Two concatenation way of Turbo-like codes. . . . . . . . . . . . . . . . . . 9

1.3 Two decoding schemes of two types of Turbo-like codes. . . . . . . . . . . 9

2.1 Transmission scheme of serial concatenation codes. . . . . . . . . . . . . . 14

2.2 A typical BER chart for Turbo codes. . . . . . . . . . . . . . . . . . . . . 15

2.3 Performance comparison of a Turbo code and a convolutional code [?]. . . 16

2.4 A classical Turbo encoder. . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.5 A classical Turbo decoder. . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.6 A classical SC decoder. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.7 Scheme of UMTS Turbo encoder. . . . . . . . . . . . . . . . . . . . . . . . 19

2.8 Scheme of the convolutional encoder and the trellis diagram. . . . . . . . 21

2.9 Trellis diagram of a transition sequence. . . . . . . . . . . . . . . . . . . . 22

2.10 A example transition sequence. . . . . . . . . . . . . . . . . . . . . . . . . 22

2.11 Scheme of UMTS Turbo decoder. . . . . . . . . . . . . . . . . . . . . . . . 23

2.12 A example trellis of a short terminated trellis code. . . . . . . . . . . . . . 25

2.13 Scheme of the EXIT chart generating. . . . . . . . . . . . . . . . . . . . . 29

2.14 One EXIT curve I(ae) = F (I(aa)) of UMTS Turbo code using BPSK totransmit over an AWGN channel having an SNR of -4 dB. . . . . . . . . . 29

2.15 EXIT chart of UMTS Turbo decoder. . . . . . . . . . . . . . . . . . . . . 30

2.16 The decoding trajectories in the EXIT chart. . . . . . . . . . . . . . . . . 30

3.1 correction function. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

3.2 A possible accumulation route in the trellis. . . . . . . . . . . . . . . . . . 38

3.3 Example of difference calculation in two’s complement representation. . . 39

3.4 Three different Turbo codes in previous works. . . . . . . . . . . . . . . . 40

3.5 EXIT chart of different log algorithms. . . . . . . . . . . . . . . . . . . . . 43

3.6 EXIT chart of different fraction lengths. . . . . . . . . . . . . . . . . . . . 44

3.7 Scheme of UMTS Turbo decoder. . . . . . . . . . . . . . . . . . . . . . . . 46

3.8 EXIT chart of different integer lengths with wrapping technique - 1. . . . 47




3.12 EXIT chart of different integer lengths with saturation technique. . . . . . 50

3.13 EXIT chart of different integer lengths with normalisation technique. . . . 52

3.14 Simulation results of 5114-bit block length in fixed-point with normalisa-tion and floating-point. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

vii

viii LIST OF FIGURES

3.15 Simulation results of 40-bit block length in fixed-point with normalisationand floating-point. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

3.16 Simulation results of SNR=-4.83dB/453-bit block length in fixed-pointwith normalisation and floating-point. . . . . . . . . . . . . . . . . . . . . 54

3.17 Simulation results of 5114-bit block length in fixed-point with wrappingtechnique and floating-point. . . . . . . . . . . . . . . . . . . . . . . . . . 54

3.18 Simulation results of 40-bit block length in fixed-point with wrappingtechnique and floating-point. . . . . . . . . . . . . . . . . . . . . . . . . . 55

3.19 Simulation results of SNR=-4.83dB/453-bit block length in fixed-pointwith wrapping technique and floating-point. . . . . . . . . . . . . . . . . . 55

4.1 The dependence between the different stages. . . . . . . . . . . . . . . . . 63

4.2 Flowchart of energy estimation framework. . . . . . . . . . . . . . . . . . 65

List of Tables

1.1 Data rate requirement of different applications in BANs . . . . . . . . . . 5

2.1 Different representation methods for integer numbers . . . . . . . . . . . . 32

2.2 Two’s complement representation method for fraction numbers . . . . . . 33

3.1 Different representation methods for integer numbers . . . . . . . . . . . . 40

ix

Acknowledgements

I would like to express my gratitude to all those who gave me help to complete this

report. I would like give my special thanks to my supervisors, Dr. Rob Maunder, Prof.

Bashir M Al-Hashimi and Prof. Lajos Hanzo. Their insightful guidance and directions

and all the wise advice allowed this work to become reality. I am deeply indebted to

Dr. Rob Maunder. His stimulating suggestions and encouragement helped me in all the

time of research for and writing of this report.

Many thanks also to my colleagues and the staff of the Communications Group and

Electronic System Design Group for the useful discussions and comments throughout

my research. Special thanks to my colleagues Amit Acharyya and Dr. Jos Akhtman for

their technical support.

I would also like to express my appreciation to my parents who taught me all the good

things that really matter in life. Especially, I would like to thanks to my girlfriend

Nuofei Lu whose patient love enabled me to complete this work.

xi

Chapter 1

Introduction

A wireless sensor network (WSN) is a network that composed with a number of sensor

devices with the ability to communication with each other or with upper level networks.

A sensor node in a WSN consist of data sensing, data processing, and communicating

components. The sensors are deployed either inside the phenomenon or very close to it.

Typically, one or more central devices are included in a WSN for collecting data from

the sensors and communicating with upper level networks. With the development of

wireless communication technologies, WSNs start to play an important role in support-

ing different scales of networks to connect a person at anywhere, at anytime and with

anybode. In this report, a promising application of wireless sensor networks (WSNs),

Body Area Networks (BANs) is introduce. The communication requirement of BANs is

investigated based on literature review. The candidate technologies for the wireless com-

munication in BANs are discussed, including the potential of using Turbo-like coding

schemes in BANs, which is advocated in this report. Finally, the outline of the report

is given.

1.1 Introduction of Body Area Networks (BANs)

Body Area Networks (BANs) or Wireless Body Area Networks (WBANs) are a promising

application of short range wireless sensor networks(WSNs) in the healthcare industry.

Its basic scenario is by locating a number of wireless sensors on the human body for

continual monitoring of physiological parameters of humans such as heart rate, Electro-

CardioGram (ECG) data, ElectroEncephaloGraphy (EEG) data, blood pressure, body

temperature, levels of certain chemicals such as sugar, oxygen and medications in the

blood, motion, etc [1]. The parameters could be important or even life critical to some

people or patients, such as the aging population, chronic disease patients, cerebrovas-

cular and cardiovascular disease patients. Long-term monitoring and logging of the

1

2 Chapter 1 Introduction

physiological parameters of the patients could help doctors to treat the patient or dis-

cover risks earlier. The purpose of deploying BANs on such group of people is to create

a more comfortable, convenient and economical way to perform the required monitoring

and logging missions either in hospitals or in a home-based healthcare system. The

concept of (W)BANs was first introduced by T. G. Zimmerman in 1996 [2]. Over the

past few years, the advancements in electronic systems and wireless technologies have

enabled the development of small and intelligent medical sensors which can be attached

to or implanted into the human body. The healthcare industry is becoming increasingly

interested in using such types of technologies to develop practical BANs [3]. Hence, a

lot of hot spots in such research area are being widely researched including energy har-

vesting, signal processing and communication, etc. Recently the IEEE 802.15 working

group established a study group, body area network (802.15.TG6), to develop guideline

for using wireless technologies for BANs applications in various healthcare services. This

report is focused on the wireless communication technology for BANs. Furthermore, the

concept of BANs can be further divided into two categories depending on their operating

environments. One is mainly operated on the surface or in the vicinity of human bodies,

namely wearable BANs. The other is operated inside body, namely implantable BANs.

Due to the different operating environments, the communication requirements of such

two types of application are different, which lead to different strategies of the develop-

ment of the communication technology in the networks. Most of the recent works and

the IEEE 802.15.TG6 group are targeted on the wearable BANs, which is also the focus

of this report. In the rest part of this report, the term BANs is referred to the wearable

BANs.

1.2 Communication in BANs

To discuss the wireless communication issues in BANs, we start with the scenarios of

BANs applications. A typical BANs scenario is given in Figure 1.1. It consists a number

of sensor nodes attached on the human body to perform different functions for collecting

different physiological parameters. A central device such as a PDA or a mobile phone

takes the data from the sensor nodes via a wireless network formed between them, namely

a BAN. The size of the sensors is supposed to be as small as possible for comfort and

convenience issues. Therefore, the functions of them are limited. The basic assumption

is the sensors only collect the data from the body, do the necessary signal processing and

transmit the data to the central device in real-time communication. The central device

then perform more processing of the data. It might connect to an external network

for communicating with a higher-level system, such as the hospital, depending on the

requirement of different scenarios. The number of sensors required could be variable

depending on different applications. For one particular disease, usually there are only

a few (< 3) sensor nodes required [3]. However, for more complicated situations, more

Chapter 1 Introduction 3

Sensor nodes Central node

Figure 1.1: A typical BANs architecture.

sensors might be required. Especially, when motion detection is involved, for example,

for people need after treatment to help with recover the mobility, sensors might required

on every movable part of the body, and including more sensors mean more accurate of

the motion detection. According to [1], typically no more than 20 sensors will be used

for any one person.

Based on such a scenario of the applications of BANs, there are a couple of features of the

scenario need to be considered carefully while developing communication technologies

for BANs.

• Firstly, the limited resource on the sensors due to the limited size of them. The

nodes need to be light, small, wireless and long lived. Since one of the purpose of

BANs is to create a convenient healthcare system for patients without the support

from the professionals. The wireless sensors on human bodies need to be battery

operated or energy harvesting operated. They are required to last long without

any need of maintenance. This leads to a extremely energy efficient requirement.

• Secondly, in the scenario, all the sensors and the central device are always around

the patient which leads to a very short cover range requirement for BANs.

• Thirdly, in the scenario, all the data only required to be transmitted from the

sensors to the central nodes. Since the function of a sensor is simple and limited,

there is almost no need for a dual-way communication in the system. An one-

way communication (from the sensors to the central node) is sufficient for BANs.

Although, in some scenarios, there may be an overall energy saving by letting

nodes listen to each other. Also, including relays in the networt might require


nodes to receive. The basic assumption of no requirement of transmission from

the central node to the sensors is still valid.

Based on the discussion of the basic scenario of BANs, in next section, the requirement

of the communication in BANs is discussed.

1.2.1 Communication requirements

To develop new technologies of a particular type of WSN, the basic requirements need

to be considered first. In this section, we will discuss the communication requirements

of BANs, which includes frequency bands, network scale, communication range, data

rate, reliability, energy consumption, network topology and security issues.

1.2.1.1 Frequency conditions

Currently, there is no clear frequency range could be used for BANs. According to the

latest news [4], the Federal Communications Commision (FCC) is considering several

possible frequency bands for use by BANs:

• 2300-2305 MHz and 2360-2395 MHz Band: The 802.15.TG6 Group and GE Health-

care (GEHC) propose to use this band for BANs. However, this band is currently

used by several other services, including Aeronautical Mobile Telemetry (AMT),

federal radio location and amateur radio users. This can be a problem regarding

interferences and security. The FCC is considering the proposed potential use of

these bands by BANs on a coexistence and non-interference basis.

• 2400-2483.5 MHz Band: This band is used by Industrial, Scientific and Medical

(ISM) equipment on a non-licensed basis under the FCC’s rules. The FCC seeks

comment on whether BANs could operate in this band under current rules or

whether new rules would be required to regulate BANs using this band.

• Other Frequency Bands: The FCC seeks comment on whether other frequency

bands may be appropriate for BANs, including the 5150-5250 MHz band, which is

now allocated for federal and non-federal aeronautical navigation and non-federal

fixed-satellite use and unlicensed notional information infrastructure (U-NII) de-

vices.

An alternative solution is to use the Ultra-WideBand (UWB) technology, which is au-

thorised to communicating between 3.1 GHz and 10.6 GHz [5]. [6] discussed the potential

and the advantages of applying UWB technology to BANs.


1.2.1.2 Network scale and communication range

As discussed in Section 1.2, a BAN is a small scale network which could include up to 20

sensor nodes. One important feature of BANs is that all the devices in the network are

around a human body. This extremely limited the communication range requirement

of the network. An universal agreement of the communication range of BANs is 2-5

meters [1,7,8], which is shorter than the cover range of any existing WSNs application.

An obvious advantage of such a short communication range is that it directly leads to a

low emission level from the transmitter. On the other hand, all the devices are in each

others vicinity, which could induce a interference problem.

1.2.1.3 Data rate

Most of the previous works agree that BANs require real-time low data rate communi-

cation, but the detailed investigation results of the data rate requirement are different.

For example, the investigation results from [3] and [7] are summarised in Table 1.1.

Healthcare applications B. Zhen’s work [3] H. Li’s work [7]

Heartbeat <0.1 kps 0.05 kpsBody temperature <0.1 kps 0.05 kps

Electrocardiogram (ECG) 2.5 kps 72 kpsElectroencephalography (EEG) 0.54 kps 131.1 kps

Electromyography (EMG) 1152 kpsBlood pressure <0.1 kps 0.05 kps

Blood sugar level <0.1 kpsBlood analysis 8.192 kps

Table 1.1: Data rate requirement of different applications in BANs

Note that for the low data rate requirement applications, such as heartbeat, body tem-

perature and blood pressure, the investigation results are quiet close. However, for more

complicated parameters, such as ECG and EEG, the investigation results are very dif-

ferent. The reason of this could be the different assumption of the pre-processing of the

signals before transmitting. For example, if a sensor transmits compressed data rather

than transits the raw data, the data rate requirement could be reduced. With a review

of a number of previous work, the highest data rate assumption is given by [8], which

claimed that up to 1 Mbps data rate is required by BANs. This conclusion still gives

a data rate requirement low than any existing WSNs applications. For example, for

another lower speed short range WSNs application Wireless Personal Area Networks

(WPANs), the required data rate is up to 10 Mbps. Although, some of the previous

works including video stream transmission into BANs scenarios, the data rate require-

ment could be increased to 100Mbit/s [1].


1.2.1.4 Reliability, accuracy and latency

Although the requirement of the performance in BANs is still under discussion, which

includes reliability, accuracy and latency, of BANs. Since the monitoring signal are life

critical, the fault data transmission or a few minutes delay maybe fatal for the patient.

The requirement of the performance in BANs should be relatively high. Many previous

works agreed that delays and communication errors should be within strictly defined

standard in order to avoid disastrous behaviour [3, 9, 10]. According to the 802.15.TG6

group’s official released report [11], a fast reaction of < 1 second with a reliability of

99.99% is expected. Also latency should < 250 ms and jitter < 50 ms. The evaluation

of the performance can be revealed by the delay profile, the information loss rate, the

bit error rate (BER) and frame error rate (FER), which need to be considered carefully

during the design of a communication system. In addition, when a human body moves

the sensors will change positions to each other. When the environment is changing

channel conditions will also change and interfere in the network performance. For BANs,

the system must keep its reliability under all the possible conditions. The different state

of human bodies must be considered such as walking, running, turning, etc. Different

environments will be crossed in practice where different and multiple BANs will coexist

or different and multiple interference resistance will happen, such as tunnels, subways

parks, etc. The design of BANs must be prepared for all the realistic scenarios.

1.2.1.5 Energy consumption

The energy consumption is a very crucial issue in BANs on the sensor nodes due to the

limited energy resource and the long life-time requirement. As discussed in Section 1.2,

the sensors in BANs have to be battery operated or depending on energy harvesting

which do not need to be recharged frequently. Since the sensors are expected to be

as small as possible, the energy resources can be provided on them would be extremely

limited. This requires every function on the sensors need to be energy efficient, including

the wireless communication mechanism. In addition, a distinguished feature of BANs

is that the sensors are required to attached on the human body. On the other hand,

previous works suggested that, for the safety to human body, wireless devices should be

separated at least 30 cm distance from human body [3]. Hence, extremely low power in

transmission is required to protect human tissues. It is widely agreed that the low power

requirement is one of the most challenging issues in developing BANs [1, 3, 12]. Since

such a small distance and especial requirement for protecting people tissue of BANs, it

is not covered by existing wireless standards [1].


1.2.1.6 Network topology

Because of the small network scale and the one-way communication assumption, a star

topology becomes a direct solution of BANs [13,14]. The advantages of star topology are

its simple architecture and highly concentrated system complexity to the central node,

which is suitable features for BANs. However, the devices are located on the human

body that can be in motion. BANs should therefore be robust against frequent changes

in the network topology. Moreover, human bodies strongly attenuate RF signal [10].

Both of the reasons lead to benefits of using a multi-hop network topology in BANs.

In addition, using multi-hop transmission instead of direct transmission can approach

lower energy consumption in communication [12]. Hence, many of recent works were

focusing on multi-hop network solutions for BANs [15,16].

1.2.1.7 Security

Security is a major issue on medical applications [8]. Safety and privacy should be

concerned including all involved parts from doctors, nurses, patients, administrative

personal and medical service providers. Devices also needs authentication for security.

Interferences of external devices or intentional attacks must be considered on safety

issues. On the other hand, due to the limited resource on the sensors and the user

friendly requirement, the system must be simple. The insertion and deinsertion of a

node in a BAN must be easy to the user.

1.3 Candidate options for Body Area Networks

Based on the discussion in Section 1.2.1, the distinctions of BANs from other networks

are its lower communication range, extremely low power and high accuracy and reliability

requirements. Since the existing WSNs technologies are not suitable for BANs, a new

IEEE study group 802.15.TG6 is working on developing new standard and protocol for

BANs applications. For creating a new protocol suitable for BANs, there is a choice

between to define a new PHY/MAC or to evaluate and improve current available or

emerging technologies. Some paper suggested to modify from existing standards [1, 3].

For the low power purpose, a possible approach is to scale down existing standards,

such as turn down the transmit power or introduce duty cycle mechanism [1]. IEEE

802.15.4a/b is a popular standard to be evaluated or improved for BANs applications

in previous works [8, 17–19]. Some other Existing standards such as Medical Implant

Communication Service (MICS) and Wireless Medical Telemetry Service (WMTS) are

also suggested [20]. Based on an investigation of some previous works about using

existing standards in BANs [8, 17–23], we found that most of these works focus on

the evaluation of the performance of the standards in BANs applications. Despite of


the importance of the low power requirement of BANs, the power issue have not been

addressed a lot in these works. [23] pointed out that the reason of the energy consumption

issue of the potential standards is hard to investigate is that the platforms currently

available for evaluating the existing standards are not particularly designed with proper

low power techniques for the purpose of extremely low power applications such as BANs.

Therefore, it is not fair to evaluate the energy consumption based on the platforms

since the existing standard could be further scaled down, such as turning down the

transmission power or introducing duty cycle mechanism, for the low power purpose,

as suggested in [1]. However, the scaling down of a standard must lead to degradation

on the performance. For example, although some of the previous works claim that the

802.15.4 standard provides a sufficient performance for BANs applications, it cannot be

guaranteed that the performance can be maintained after proper scaling down techniques

are applied to the standard.

Another option is to define a new standard for BANs. Since the existing low power

standards could not meet the ultra-low power requirement of BANs. How to further

scale down the energy consumption in a WSNs and keep the high performance becomes

a challenge. There are two problems challenge the performance of the communication in

BANs. One is that the transmission power of the sensors is ideally as low as possible for

protecting human tissues. On the other hand, human tissues are composed primarily

of water molecules, which tend to absorb RF energy [24]. These problems induce a

requirement of a proper Error Correction Code in channel coding scheme to maintain a

high reliability and accuracy in such a crucial condition. However, due to the reduction

of the transmission power in the system, the energy consumption by channel coding

could take more contribution of the whole system, which make a low power design of the

channel coding scheme becomes desirable. For overcome such challenge, we investigate

the potential to use Turbo-like codes in the communication of BANs. The novel way to

apply Turbo-like Codes in BANs and the advantages will be discussed in Section 1.3.1.

1.3.1 Turbo-like Codes

We will give all overall discussion of Turbo principle and Turbo-like codes in next chapter.

In this section, for the purpose of discussing the protential advantages of applying Turbo-

like codes in the channel coding scheme of BANs, we will give a brief introduction of

some distinctive features of Turbo-like codes.

Turbo-like codes refer to a type of ECC that includes two component codes in one coding

scheme. The concatenation between the component codes on the encoding process could

be parallel or serial. An interleaver is used between the component encoders. The two

type of concatenation in Turbo-like codes are illustrated in Figure 1.2. The success of

Turbo-like codes is that it introduces an iterative decoding process to approach the best

decoding results. The two decoding schemes corresponding to the two encoding schemes


Encoder1

Encoder2

π Encoder1 π Encoder2

Input

MU

X

Output

Parallel Concatenated Code

Input Output

Serial Concatenated Code

Figure 1.2: Two concatenation way of Turbo-like codes.

are given in Figure 1.3. An iterative decoding process is performed between the two

ππ−1

deM

UX

Decoder2

Decoder1

Decoder2

π

π−1

Decoder1Input

Output

Parallel Decoding Scheme

Output Input

Serial Decoding Scheme

Figure 1.3: Two decoding schemes of two types of Turbo-like codes.

concatenated decoders by feeding the decoded results back to each other’s input. Under

such a scheme, the decoded result improves in each iteration, until the best result is

achieved after a certain number of iterations. The details of the Turbo-like codes are

given in next chapter.

The advantages of the Turbo-like code are its high reliability and the near Shannon

capacity performance [25, 26], which could conquer the crucial transmission condition

in BANs. However, the disadvantage of it is its relatively high complexity decoding

scheme. Turbo-like codes are usually not appropriate for low power communication

since the iterative decoding process consumes a lot of energy [27]. However, the one-

way communication assumption in BANs provide an opportunity to apply Turbo-like

codes in its communication system. In contrast of the complicated decoding schemes

of the Turbo-like codes, the encoding schemes of them are simple and easy to have a

low power implementation. Therefore, in a star topology network, the decoding scheme

does not need to be implemented on the sensors based on the one-way communication

assumption. Although it need to be implemented on the central nodes, in BANs, the

central node usually has a sufficient energy recourse since it is usually a much bigger

equipment than the sensors, such as a PDA or a smart mobile. Hence, the Turbo-like

codes are naturally suited for a star topology BAN. Furthermore, as we discussed, the

multihop network is necessary in some scenario of BANs. In a multihop network, relays

are used reduce the transmission distance. Thus the transmission power on each nodes

can be reduced, which is an especially desired feature for BANs. On the other hand,

it is required to induce extra receiving, transmitting and coding process on the relays,


which might increase the overall energy consumption on the relays and reduce the life-

time of them. For the channel coding point of view, one way to avoid inducing extra

coding process on the relays is to transmit the received signals without any processing

but amplify them. However, it is not a desirable solution since in this way, the noisy in

the transmission is also been amplified on the relay, which might increase the decoding

error rate at the central node and decrease the communication performance. For an

alternative solution,we propose a novel decoding scheme the relay which including fewer

times of iteration in the decoding process, and transmit the sub-optimal decoding result.

It is possible to find out a balance point that with a certain times of iterative decoding

on the relays, the energy consumption of the extra coding process and the overall com-

munication performance are both acceptable and the transmission power can be reduced

by the multihop mechanism in the network.

To further research on the proposed novel decoding scheme, we need to investigate on

the energy consumption of the low power implementations of the Turbo-like code, such

as fixed-point ASIC implementation. In this report, we proposed a method to determine

the optimal datawidth specification in fixed-point implementation of Turbo-like code in

low power point of view. As discussed in Section 1.3, the possible energy consumption

of a communication standard or a coding scheme for the low power applications such

as BANs is hard to evaluate. In this report, we propose a framework to compare and

evaluate the possible energy consumption of different Turbo-like code decoding schemes.

To sum up, in this report, we discussed the possibility of introducing Turbo-like code

into BANs communication system. And related requirement of the method to evaluate

the energy consumption of a Turbo-like code decoding system is introduced. An im-

portant issue of low power realization of Turbo-like decoding algorithms in fixed-point

implementation, the data width specification is explored. The outline of the report is

given in next section.

1.4 Outline of the report

The later chapters in this report are organised as follows.

• Chapter 2 is a background chapter. A fully introduction to the Turbo principle and

Turbo-like codes is presented. The fixed-point implementation issues of hardware

design for the algorithms are also introduced.

• Chapter 3 proposes a method to determine the optimal data width specification in

fixed-point implementations of Turbo-like codes from a low power point of view.

• Chapter 4 proposes a framework to evaluate and compare the different Turbo-like

codes from the energy consumption point of view. Part of the framework is the

future work of the project, which is also discussed in this chapter.


• Chapter 5 gives the conclusion of this report.

Chapter 2

Turbo-like Code Solutions in

BANs

In this chapter, we introduce the background information of this report. It includes a

brief introduction of the Turbo principle and its encoding and decoding algorithms, the

EXIT chart analysis tool and the basic theory of fixed-point representation in hardware

implementation.

2.1 Introduction

The Turbo principle is a concept of error correcting code (ECC) including iterative

decoding processes, also referred as turbo decoding processes, such as serial or parallel

concatenated codes [25,28] and LDPC codes [29]. A unique feature of Turbo-like codes

is including two or more component codes concatenated in the schemes. Such types of

codes are called concatenated codes, which were first proposed by [30]. The first version

of concatenation codes was serially concatenated (SC) codes. It includes two or more

component codes concatenated in a serial structure. A famous example consists of a

Reed-Solomon code [31] as the outer code (applied first and removed last) and followed

by a convolutional code [32] as the inner code (applied last and removed first) [33] in

the scheme. In the early concatenated coding schemes, despite including two or more

component codes, there is no iterative decoding process in the decoder. The decoder

generates hard decisions (i.e. the determined bit results) directly. In a communication

receiver, the demodulator is usually produce soft decisions in the demodulation process.

The soft decisions are corresponding to the hard decisions. Instead of giving the decoded

bit results, soft decisions are reliability information expressed by a posteriori probability

of each bit. Soft decisions express not just what the most likely value of a bit is, but also

how likely it is while hard decisions only express the former. And before Turbo principle

discovered, a typical decoder utilises the soft decisions in the decoding process and

13

14 Chapter 2 Turbo-like Code Solutions in BANs

generate hard decisions at its output. Such a decoder could be called as a Soft-in Hard-

out (SIHO) decoder. Therefore, a straight forward way of decoding the SC codes involves

the use of a SIHO decoder for the inner decoder and a HIHO (Hard-in Hard-out) decoder

for the outer decoder concatenatedly. If a convolutional encoder is concerned, a Viterbi

Algorithm (VA) [34] decoder is used at the corresponding place to give hard decisions.

As discussed in [35], the first drawback of such a structure is that the inner decoder

generates hard decisions, thus preventing the outer decoder from utilising its ability to

accept soft decisions at its input. The second drawback is that if the inner decoder makes

a continual error sequence, the outer decoder is unable to correct the error. The second

drawback can be conquered by inserting an interleaver between the inner and the outer

encoder and correspondingly an deinterleaver between the inner and the outer decoder.

The function of a interleaver is to rearrange the order of a sequence in a pseudo-random

way. The function of a deinterleaver, with knowledge of the rearranging method of

the corresponding interleaver, is to restore the order of an interleaved sequence. Thus, a

continual error sequence in the inner decoder becomes dispersed in the input to the outer

decoder. The transmission scheme is shown in Figure 2.1. However, if errors occurred

Decoded bits

Input bits Outer encoder Interleaver Inner encoder

Channel

Inner decoderDeinterleaverOuter decoder

Figure 2.1: Transmission scheme of serial concatenation codes.

at the output of the outer decoder, these would remain in the final decoding results. A

Turbo-like code can be considered of a refinement of the concatenated encoding schemes

with an improved decoding process including iterative algorithms. The concept of turbo

decoding is for a system with two component codes to pass soft decisions from the output

of one decoder to the input of the other decoder, and to iterate this process many times

to produce more reliable decisions. To obtain benefits from an iterative decoding process,

it required that the two decoders feed soft decisions to each other. It is because using

hard decisions as an input of a decoder degrades system performance compared with soft

decisions [36]. Therefore, it requires Soft-in Soft-out (SISO) decoders for the decoding

of each component code. The introduction of Turbo codes in [25] is also the first time

of introduction of parallel concatenation (PC) codes. It was reported that the scheme

can achieve a bit error rate (BER) of 10−5 using a rate 1/2 code over an additive white

Gaussian noise (AWGN) channel and BPSK modulation at an Eb/N0 of 0.7 dB [25,26].

According to the discussion in [25,26], the Shannon capacity for a binary modulation is

the error probability Pe = 0 (Pe = 10−5 can be taken as a reference here) for Eb/N0 =

0 dB. Hence the performance is 0.7 dB from Shannon capacity. Most importantly, with

the iterative decoding scheme, the complexity of a Turbo decoder is much less than

that of a non-iterative decoder having the same performance. According to [37], the

complexity required to allow the earlier codes to approach the Shannon capacity would

Chapter 2 Turbo-like Code Solutions in BANs 15

be not feasible to implement. The discovery of Turbo codes has revolutionised the field of

error correcting codes since it first time achieved the performance very close to Shannon

capacity in practice.

To evaluate the performance of a Turbo or Turbo-like code, a Bit Error Rate (BER)

chart is a commonly used tool. A typical BER chart of Turbo codes looks like Figure 2.2.

Y axis is the BER of the decoding result after a certain times of iterative decoding and

X axis is Eb/N0, where Eb is the energy in one bit and N0 is the noise power spectral

density (i.e. noise power in a 1 Hz bandwidth). As shown in Figure 2.2, a typical Turbo

0 0.5 1 1.5 2 2.5

BE

R Turbo cliff

Error floor

10−7

10−6

10−5

10−4

10−3

10−2

10−1

10−0

Eb/N0

Threshold Eb/N0

Figure 2.2: A typical BER chart for Turbo codes.

code can achieve very low BER once Eb/N0 reached a certain point. In the figure, the

point which the BER curve starts to decrease is called threshold Eb/N0, the region where

the BER curve falling fast is called the turbo cliff region and the region where the BER

curve is flat at a very low value is called the error floor region. To understand how the

Turbo codes outperforms the earlier coding schemes, we quote Figure 2.3 from [?]. It

shows simulation results of the original rate R=1/2 turbo code presented in [25] and

a maximum free distance (MFD) R=1/2, memory ν = 14(2, 1, 14) convolutional code

with Viterbi decoding. The simulation results show that the Turbo code outperforms the

convolutional code by 1.7 dB at a BER of 1−−5. The comparison is distinct, especially

since a detailed complexity analysis reveals that the complexity of the Turbo decoder is

much smaller than the Viterbi decoder used for the convolutional code.

A classical Turbo encoder is composed of two recursive systematic convolutional (RSC)

encoders, as shown in Figure 2.4. The input information sequence is encoded twice by

the two RSC encoders. The first encoder processes the information in its original order,

while the second encoder processes the same sequence in a different order obtained by

an interleaver. In this scheme the systematic bit sequence is also transmitted to the


Figure 2.3: Performance comparison of a Turbo code and a convolutional code [?].

decoder. As shown in the figure, sequence c and d are the output of each encoder. Se-

quence a is the systematic bit sequence and b is the interleaved systematic bit sequence.

Note that only a is transmitted since b can be obtained by an identical interleaver on the

decoder. In the decoding process, as shown in Figure 2.5, two a posteriori probability

RSC1

RSC2

Puncturing

OutputInput a

b

c

dπ

Figure 2.4: A classical Turbo encoder.

(APP) decoders are used correspondingly for the two convolutional encoders in the en-

coding scheme to get the minimal bit error probability. In the figure, a, c and d are the

soft decisions sequence corresponding to the output sequence a, c and d in Figure 2.4

obtained by the demodulator. The purpose of an APP decoder is to compute a posteriori

probabilities on either the information bits or the encoded symbols. Its applications in

Turbo-like codes make it became the major representative of the SISO decoders. The

algorithm was originally invented by Bahl, Cocke, Jelinek and Raviv in 1972, so called

BCJR algorithm [38]. The capability of generating soft decisions of it is well suited for

iterative decoding schemes. In Figure 2.5, the two decoders are working alternatively in


an iterative way. To get the correct order of the input sequences, a identical interleaver

with the one used in the encoding scheme and a corresponding deinterleaver is used be-

tween the decoders. An extra interleaver is used for providing the systematic sequence

for both of the decoders. The main advantage of this decoding process compared with

using the VA decoders is that it utilises the ability of the decoders to accept soft decisions

at their input. However, in iterative decoding schemes, the information provided for one

decoder from the other one, is extrinsic information but not a posteriori information.

The extrinsic information represents the new information obtained by a decoder. The

reason of using extrinsic information is to prevent the decoding scheme from being a

positive feedback amplifier [35]. As shown in Figure 2.5, the a priori information from

systematic sequence is added to the input of the decoders, since the a posteriori in-

formation already includes the a priori information from the previous decoding process

from the other decoder, this creates a positive feedback amplifier in the loop, by using

extrinsic information instead of a posteriori information, such a problem can be solved.

Therefore, the output of the decoders in Figure 2.5 is extrinsic information. It can be

obtained by a simple subtraction between the a posteriori and the a priori information.

Alternately, it can also be generated directly by a modified BCJR algorithm. By receiv-

ing the new extrinsic information from the other decoder, the reliability of the decoding

increases in each iteration. The whole decoding process stops when the required relia-

bility is reached or until no further reliability can be gleaned. In practice, the modified

APP1

APP2Output

ππ−1π

c

d

a

Figure 2.5: A classical Turbo decoder.

BCJR algorithm avoid the final subtraction operations, which is more suitable for itera-

tive decoding schemes. Moreover, a further improved version of BCJR algorithm called

the Log-APP or Log-BCJR algorithm is a transferred version of BCJR algorithm in

logarithmic domain. The purpose of it is to avoid the mass of multiplication operations

in BCJR algorithm and more importantly, the Log-BCJR algorithm has variables with

a much more manageable dynamic range than those of the BCJR algorithm, reducing

the memory requirement and allowing fixed-point processing to be used. Since it avoids

the complex circuit implementation due to many multipliers required by original BCJR

algorithm and requires much less memory, Log-BCJR algorithm is widely used in prac-

tice. Hence, in this report, since we only investigate the applications of BCJR algorithm

in interactive decoding schemes, the algorithm we discuss and simulate would be the


modified Log-BCJR algorithm that generate the extrinsic information directly. We will

discuss the detail of the algorithm in next section.

The Turbo principle can also apply to SC codes, which becomes to another primary cat-

egory Turbo-like code, serially concatenated convolutional codes (SCCC) [28]. Instead

of using the decoding scheme in Figure 2.1, a scheme similar to with the Turbo decoder,

as shown in Figure 2.6 is used. The two SIHO decoders are replaced with SISO decoders

and an interleaver and a deinterleaver are required to form the iterative decoding loop.

According to [35], serial Turbo codes perform better than parallel Turbo codes in the

Inner decoderOuter decoderOutput Input

π

π−1

Figure 2.6: A classical SC decoder.

error floor region. On the other hand, in the turbo cliff region, parallel Turbo codes

perform better with the same overall coding rate.

2.2 Turbo codes and BCJR decoding algorithm

Turbo-like codes generally have a simple encoding scheme and a relatively complicated

decoding scheme. In this section, we use 3rd Generation Partnership Project (3GPP)

Universal Mobile Telecommunications System (UMTS) Standard [39] as an example to

introduce the typical Turbo coding schemes, the included convolutional code and the

SISO decoding algorithm for convolutional code, BCJR algorithm. UMTS Turbo code

and BCJR algorithm are also using as examples to present my works in this report, so

the description in this section are refereed in later chapters.

2.2.1 UMTS encoder and decoder architecture

To simplify the description, we assume BPSK modulation is used in our case, so each

symbol in transmission is a bit. For other modulation methods, the transmitted bits

here would be replaced by transmitted symbols. According to [39], the concatenated

RSC encoder of UMTS Turbo code is a rate R=1/2, K=4 constraint length and m=3

memory convolutional code. Two such identical encoder form a rate 1/3, 8-state PCCC

illustrated in Figure 2.7. In the RSC encoder, the three memory bits forms an 8-state

finite-state machine (FSM). We use the notation Na to represent the block length of

the encoding sequence a. Before the encoding of the bit sequence a commences, the

shift registers of each concatenated convolutional code are initialised in a state that is

known to the receiver. Typically, the m=3 memory elements of each shift register are


D DD

DD

Interleaver

D

MU

X

Output

Inputa

b

a

c

e

d

f

Figure 2.7: Scheme of UMTS Turbo encoder.

initialised with logic-zeros, placing them in what is referred to as the “all zeros” state.

However, following the encoding of the Na bits in the sequence a, the shift registers will

enter states that are not inherently known to the receiver. A number of techniques have

been proposed to cope with this [35]:

• No termination: In this case, in the decoding process, the end of a block sequence

is considered to have a equivalent possibility of each possible state. No information

of the final state need to be provided. The decoding process is then less effective

for the last encoded data and the performance may be reduced. The degradation

is a function of the block length. However, for some applications the degradation

may be acceptable.

• Termination: This method involves several extra bits at the end of each block

sequence to force the encoder return to the “all zero” state. The UMTS Turbo

code of Figure 2.7 is one example of such technique. The extra tail bits also need to

be sent to the decoder. This method conquered the uncertain final state issue but

induced another two drawbacks. Firstly, extra redundancy information is added to

the transmission. Nevertheless, the redundancy is negligible except for very short

blocks and it is useful for error correction. Secondly, for parallel codes, the tail bits

are not identical for each constituent codes, which means in the iterative decoding

process, the extrinsic information of the tail bits cannot be exchanged between the

decoders. Hence, the data at the end of the block sequence will get less benefit

from the Turbo decoding process. The SCCC also has the similar problem.

• Adopt tail-biting: [40] introduced a technique allows any state of the encoder

as the initial state. This method involves a double encoding process: Firstly, a


normal encoding of the sequence starting from “all zero” state is performed, but

the output of the encoder is ignored. Only the final state of the encoder is stored.

Secondly, the encoding process is performed again in order to actually generate

the output. In this step, the initial state is a function of the final state previously

stored. The result of this process is the final state of the encoder is equal to its

initial state. The advantage of this method is no extra bits have to be added

and transmitted. However, the double encoding process is the main drawback of

this method. In addition, it only works for the convolutional codes where BCJR

algorithm is especially adapted.

In the UMTS Turbo code, the termination technique is used as shown in Figure 2.7. The

initial states of the shift registers are all set to zeros when starting to encode a bit block

a. Note that after encoding the Na bits of the source sequence a, the two switches in the

figure switch down to form a closed loop in the two encoders. Following this, m = 3 bits

are encoded in order to reset the contents of the shift register to “all zero” state. The

output of the Turbo encoder is a, c, e, d and f , where a is the systematic bit sequence, c

and d are the encoded bit sequences of the two encoders, respectively, and e and f are the

termination sequence of the two encoders, respectively. The termination is performed by

taking the tail bits from the shift register feedback after all information bits are encoded.

It takes m bits to force the final state back to “all zero” state for each encoder. Therefore,

in the case where a comprises Na bits, c and d will comprise Nc = Nd = Na + m bits,

while e and f will comprise Ne = Nf = m bits. In UMTS Standard, the possible block

length of the Turbo code (i.e. the length of bit sequence a) Na ∈ [40, 5114]. For the

interleaved sequence b, the length is Nb = Na. The termination bits e and f have

a length of the number of memory bits in the RSC encoders, Ne = Nf = m = 3.

Consequently, for the encoded sequence c and d, there is Nc = Nd = Na + 3. Note that

the additional termination bits (e and f) make the coding rate R of the encoder lower

than 1/3, namely R = Na/(Na + Nc + Nd + Ne + Nf ) = Na/(3Na + 4m).

To understand the operation of the FSM, the state transition can be shown as a trellis

diagram in Figure 2.8. an, cn and end are the input sequence and the output sequence.

S1, S2 and S3 are the current state of the three memory bits in the encoder. S+1 , S+

2

and S+3 are the next state of the three memory bits. The transition of the states and

the decoding results can be expressed as the following equations.

• For encoding bits:

S+1 = ak ⊕ S2 ⊕ S3 (2.1)

S+2 = S1 (2.2)

S+3 = S2 (2.3)

ck = S+1 ⊕ S1 ⊕ S2 (2.4)


0 0 0

0 0 1

0 1 0

0 1 1

1 0 0

1 0 1

1 1 0

1 1 1

0 0 0

0 0 1

0 1 0

0 1 1

1 0 0

1 0 1

1 1 0

1 1 1

0/0

0/0

1/1

1/0

0/1

0/1

1/0

1/11/1

0/0

0/1

1/0

1/0

0/1

0/0

1/1

0 0 0

0 0 1

0 1 0

0 1 1

1 0 0

1 0 1

1 1 1

0/0

0/0

1/1

1/0

0/1

0/1

1/0

1/1

0 0 0

0 0 1

0 1 0

0 1 1

1 0 0

1 0 1

1 1 0

1 1 1

1 1 0

S1 S2 S3

S+1 S+

2 S+3S1S2S3

State1

ak/ck

State2

State3

State4

State5

State6

State7

State8

S+1 S+

2 S+3

ck

ek

S1S2S3

State1

ek/ck

State2

State3

State4

State5

State6

State7

State8

Transition trellis for encoding bits Transition trellis for termination bits

S+1 S+

2 S+3

an

yn

Figure 2.8: Scheme of the convolutional encoder and the trellis diagram.

• For termination bits:

S+1 = 0 (2.5)

S+2 = S1 (2.6)

S+3 = S2 (2.7)

ek = S2 ⊕ S3 (2.8)

ck = 0 ⊕ s1 ⊕ S3 (2.9)

The eight possible states are corresponding to the State1 to State8 as shown in the

figure. The trellis diagrams gives all the possible transitions of the FSM. The left trellis

diagram shows the transitions for the encoding bits in a sequence. The right diagram

shows the transitions for the termination bits in a sequence. Note that the first state in a

transition sequence is “all zero”, which is the state1 in the figure. With the termination

technique, the last state in the sequence is forced back to state1. It causes the possible

transitions at the first three steps and the last three steps are limited. A transition trellis

diagram of a transition sequence is shown in Figure 2.9. The input sequence is an and the

output sequence cn and en can be obtained by tracking the state transition in Figure 2.9.

For instance, for a 5 bits input sequence a = [0, 1, 1, 0, 1], the transitions in the trellis

is shown in Figure 2.10. Note that there are 8 steps in the trellis since there are three

termination bits included. The encoded bit sequence would be c = [0, 1, 0, 0, 1, 0, 1, 1] and

the actually transmitted systematic bit sequence is [a, e] = [0, 1, 1, 0, 1, 1, 0, 1]. The trellis

diagram is not only helpful to understand the encoding operations of a convolutional


0 0 0

0 0 1

0 1 0

0 1 1

1 0 0

1 0 1

1 1 0

1 1 1

State1

State2

State3

State4

State5

State6

State7

State8

S1S2S3 a1/c1 a2/c2 a3/c3 a4/c4 a5/c5 aNa/cNa

e2/cNa+2 e3/cNa+3e1/cNa+1

Figure 2.9: Trellis diagram of a transition sequence.

code, but also useful to explain the BCJR decoding algorithm, as we shall discuss later.

0 0 0

0 0 1

0 1 0

0 1 1

1 0 0

1 0 1

1 1 0

1 1 1

yn/cn

S1S2S3

State1

State2

State3

State5

State6

State7

State8

State4

0/0 1/1 1/0 0/0 1/1 0/0 0/1 1/1

Figure 2.10: A example transition sequence.

The architecture of the decoder is as shown in Figure 2.11. A data transmission loop is

formed between the decoder 1 and decoder 2 to realize the iterative decoding process.

Each iteration consists of two half iterations, one for each constituent RSC code. The

two decoders operate alternately since the input of one decoder includes the output of the

other decoder from pervious half iteration. The operation of the RSC decoder (i.e. the

BCJR algorithm) is described in Section 2.2.2. In the figure, the input of the decoding

scheme is assumed to be in soft decision form, which makes that the channel gain and


Decoder 1

Decoder 2

ππ−1π

ap

babc be

ac

ec

fc

dc

Input

deM

UX

ya

ye

za

ze

aa ae

cc

Figure 2.11: Scheme of UMTS Turbo decoder.

noise variance have been properly taken into account. The five input ac, cc, dc, ec and

fc are the input soft decisions corresponding to the coded output a, c, e, d and f in the

encoding scheme. For each decoder, it received two information sequence. One is the

soft decisions of the encoded sequence, which is received from the transmission channel

directly (cc for decoder 1 and dc for decoder 2). The other one is the uncoded sequence

input information from the other decoder, which is simply formed by adding the a priori

information provided by the other decoder to the received systematic information. The

a priori information is the extrinsic information generated by the other decoder after

rearranging the order by the proper interleaver (π) or deinterleaver (π−1). For decoder

1, the input LLR from the other decoder ya is the sum of the aa and ac following with

ec as shown in the figure. Because the two encoders have independent tails, the soft

decisions of the tail bits are not passed between the decoders. Thus the information of

the termination bits need to be considered carefully. The systematic information of the

decoder 1’s termination bits ec need to be added at the end for a complete ya. Therefore,

ya = [aa+ac, ec]. On the other hand, for the extrinsic information generated by decoder

1, ye, the information of the termination bits need to be cut off before it interleaved and

transmitted to decoder 2, as shown in the figure. Therefore, the length of ya and ye are

Ny = Na+m. For decoder 2, respectively, the uncoded information is the sum of ba (i.e.

interleaved ae) and interleaved systematic information bc. Therefore, za = [ba + bc, fc]

and the same processing of the termination bits is applied on za and ze. The length of za

and ze are Nz = Nb + 3 = Na + 3. In the first iteration, be is initialised with a sequence

of zero valued LLRs which imply that the values of the corresponding bits are completely

unknown. ac is the received systematic information. Note that two identical interleaver

of the interleaver in the encoding scheme and a corresponding deinterleaver are used

between the decoders to give the correct order of the input sequence. As discussed


before, in the BCJR algorithm we used, the extrinsic information directly generated by

the decoding algorithm, which is done inside the decoder, after all the iterations are

completed, the a posteriori output of the decoding scheme is obtained by adding the

final extrinsic output to the final a priori input of the decoder 1, as shown in the figure.

And the SISO decoding process is then completed. Based on the soft decisions, hard bit

decisions can be taken to give the final decoding result.

2.2.2 BCJR algorithm

As we discussed, the APP algorithm we investigate is the Log-BCJR algorithm. In this

section, we give a brief description of the the Approx-Log-BCJR algorithm. The original

BCJR algorithm is introduced with detail in [37].

2.2.2.1 Log-BCJR algorithm

There are two main advantages of inducing Log-BCJR algorithm. Firstly, the original

BCJR algorithm is consist of many multiplication operations which lead to very complex

circuits while implementing in hardware. Log-BCJR algorithm avoids the multiplica-

tions by transforming the algorithm into the logarithmic domain, where multiplications

become additions. Secondly, the values of soft decisions in the normal domain could

have a very large dynamic range and theoretically unlimited, which leads to a large

amount of memory space in practice. To transfer them to logarithmic domain reduces

the dynamic range of the soft decisions and consequently all the internal variables in

the algorithm. Hence significantly this approach reduce the memory requirement to

implement the algorithm. We use notation y to represent the systematic bits in the

encoder including the termination bits, which means y = [a, e], and use ya to represent

the received uncoded sequence LLRs in our BCJR decoder, according to Figure 2.11.

We have y = {yn}Ny

n=1. In normal domain, the soft decision of a received bit is defined

as:

yan =

P (yn = 0)

P (yn = 1)(2.10)

where yan is the soft decision of the received bit yn. In logarithmic domain, the soft

decisions become log-likelihood ratios (LLRs) defined as:

yan = ln

(P (yn = 0)

P (yn = 1)

)(2.11)

The two basic operations in the original BCJR algorithm is addition and multiplication.

For A = ln(a) and B = ln(b), the multiplication in normal domain becomes addition in

logarithmic domain.

ln(ab) = ln(eAeB) = A + B (2.12)


Addition in the normal domain can be solved by Jacobian logarithm in the logarithmic

domain, which we use max∗ to define the function

ln(a + b) = ln(eA + eB) = max(A,B) + ln(1 + e−|A−B|) = max∗(A,B) (2.13)

The max∗ function is usually computed by successive pairwise operations when there are

more than two terms of it. In practice, the function fc = ln(1 + e−|A−B|) can be imple-

mented by a Look-Up-Table (LUT), so the function can be done by a select operation in

the LUT. The LUT realized version of Log-BCJR algorithm is called Approx-Log-BCJR

algorithm. In Approx-Log-BCJR, max function can be done by a compare operation

between A and B. Thus, in all the operations required in Log-BCJR algorithm are

“add”, “compare” and “select”, so called ACS operations.

For presenting Log-BCJR algorithm, we use the convolutional code in the UMTS Turbo

code as an example. Figure 2.12 shows the example trellis diagram provided in Sec-

tion 2.2. yn are the systematic bits and cn are the encoded bits. There are three tail

bits used for termination, as shown in the trellis, that is driving the encoder back into

“all zero” state, State1. Note that we use the notations for decoder 1 in Figure 2.11

here but simply replacing sequence y and c with z and b, the same decoding trellis

can also apply to decoder 2 in Figure 2.11. In the trellis diagram, there are 16 pos-

0 0 0

0 0 1

0 1 0

0 1 1

1 0 0

1 0 1

1 1 0

1 1 1

State1

State2

State3

State5

State6

State7

State8

State4

0/0 1/1 1/0 0/0 1/1 0/0 0/1 1/1yn/cn

StateT1 T3

T4

T5

T6

S2 S4

S3

T12

T28

T46

T54

T58

T60

S7 S14

S6

S5

S1

T2

S23S31

S35

S37

S38

T59

Figure 2.12: A example trellis of a short terminated trellis code.

sible transitions, except the three initial steps at the start and the three termination

steps at the end. For a certain input systematic sequence yn and the corresponding

encoded sequence cn. only one transition is used in each step in a encoding trellis, as

exemplified in Figure 2.12. The corresponding systematic bit and encoded bit of each

transition is also given in the figure. To represent each state in Figure 2.12, we use


notation {S1, S2, S3...S38} to notate the possible states in the trellis following the order

from to to bottom and from left to right, as shown in the figure. Similarly, we use

notation {T1, T2, T3...T60} to notate each possible transitions in the trellis following the

same order. In addition, we use notation tn to represent the transition employed in the

encoder trellis for the nth bit. Similarly sn is the state entered by the encoder after

the nth bit. Therefore, for the example sequence of y and c in Figure 2.12, we have

the traced transitions {tn}Ny

n=1 = {T1, T4, T12, T28, T46, T54, T58, T60} and traced states

{sn}Ny

n=0 = {S1, S2, S6, S14, S23, S31, S35, S37, S38}. For describing the algorithm, we de-

fined the following notations.

• fr(T ) is the starting state of the transition T . For example, in Figure 2.12,

fr(T1)=S1 and fr(T3)=S2.

• to(T ) is the ending state of the transition T . For example, in Figure 2.12, to(T1)=S2

and to(T2)=S3.

• fr(S) is the aggregate of all the transitions started from state S. For example, in

Figure 2.12, fr(S2)={T3, T4}.

• to(S) is the aggregate of all the transitions ended at the state S. For example, in

Figure 2.12, to(S38)={T59, T60}.

• y(T ) is the value for the bit in y that is implied by the transition T . For example,

in Figure 2.12, y(T1) = 0 since t1 = T1 implies that y1 = 0. Similarly, y(T4) = 1.

• c(T ) is the value for the bit in c that is implied by the transition T . For example,

in Figure 2.12, c(T1) = 0 and c(T2) = 1.

• n(T ) is the bit index associated with the transition T . For example, in Figure 2.12,

n(T1) = n(T2) = 1 and n(T3) = n(T4) = n(T5) = n(T6) = 2.

With the notations above, we shall start to describe the Log-BCJR algorithm. The

ultimate purpose of the algorithm is to calculate extrinsic LLRs of the decoded sequence

ye. However, in the algorithm, it is more immediate to calculate the probability that

the encoder traversed a specific transition in the trellis. The calculation of the extrinsic

LLRs ye leads to the calculations of another three groups of internal variables, γ, α and

β.

• The γ values are conditional transition probabilities. In our case, the γ values is

divided into two sub-groups, the a priori transition probability γy and the channel

transition probability γc. They corresponding to each transition in the trellis.

For each transition in each step, there is a γy(T ) and a γc(T ). γy(T ) represents

the probability ln[P (tn(T ) = T |yan)]. γc(T ) represents the probability ln[P (tn(T ) =

T |ccn)].


• The α values are corresponding to each state in each step in the trellis. It is the

conditional probability that in step n (i.e. the decoding process is working on the

trellis step that corresponding to the received yan and cc

n), the traversed transition

T is started from a particular state S, that is α(S) represents the probability

ln[P (Sn(S) = S|{yan}

n(S)n=1 , {cc

n}n(S)n=1 )].

• A β value, on the other hand, is the conditional probabilities of a traversed tran-

sition T is ended to a particular state, that is β(S) represents the probability

ln[P (Sn(S) = S|{yan}

Ny

n=n(S)+1, {ccn}

Nc

n=n(S)+1)].

Finally, the three groups of the variables can be used to calculate the probability that

the encoder traversed a specific transition T in the trellis. We use δ to represent such

a probability. For calculating the extrinsic information, δy is considered here, which

δyT represents the probability ln[P (tn(T ) = T |{yan}

Ny

n−1,n 6=n(T ), {ccn}

Nc

n−1)]. It is the joint

probability of the corresponding γc, α and β of the transition T .

For computing all the variables above, Log-BCJR algorithm is composed of the following

four parts.

1. γ calculation: The values of γ depend on the inputs of the convolutional decoder.

There are two inputs, the encoded LLRs input and the uncoded LLRs input. As

shown in Figure 2.7, the encoded LLRs input is the LLRs of the encoded sequence

received from the channel cc

n. The uncoded LLRs input is ya. For a transition T ,

the γy and γc can be calculated as:

γy(T ) = (1 − y(T ))yan(T ) (2.14)

γc(T ) = (1 − c(T ))ccn(T ) (2.15)

2. α calculation: The values of α depend on the γ values and α values from the

previous step in the trellis. Hence, it requires a forward recursion in the trellis to

obtain all the α values. For a state S, in step n, the function to calculate α is:

α(S) = max*

T∈to(S)(γy(T ) + γc(T ) + α(fr(T ))) (2.16)

where α(S1) = 0.

3. β calculation: The values of β depend on the γ values and β values from the next

step in the trellis. Hence, it requires a backward recursion in the trellis to obtain

all the β values. For a state S, in step n, the function to calculate β is:

β(S) = max*

T∈fr(S)(γy(T ) + γc(T ) + β(to(T ))) (2.17)

where β(S38) = 0


4. δy calculation: The values of δy can be calculated according to (2.18).

δy(T ) = γc(T ) + α(from(T )) + β(to(T )) (2.18)

5. Finally, the extrinsic information can be calculated based on δ values. The extrinsic

LLRs of the uncoded bits ye are:

yen = max*

T |y(T )=0(δy(T )) − max*

T |y(T )=1(δy(T )) (2.19)

The algorithm is accomplished.

2.3 EXIT chart analysis

As we mentioned, the BER chart is a powerful tool to analyse the performance of a

Turbo-like code. However, it is unable to characterise the convergence behaviour of a

Turbo-like code, for example at the onset of the turbo cliff. This requires a different

analysis tool, namely the extrinsic information transfer (EXIT) chart [41]. An EXIT

chart uses mutual information (MI) measurement to quantity the quality of the extrinsic

information exchanged between the constituent decoders in an iterative decoding sys-

tems. It is comprised of two curves for the two decoders in the system. Each curve

plots the mutual information of the extrinsic LLRs versus the mutual information of the

a priori LLRs of one decoder in the system, which is basically to measure the quality

of the input and the output of the decoder. For example, taking the UMTS decoding

scheme as an example, for the first decoder, the EXIT curve plots I(ae, a) as a function

of I(aa, a) as shown in Figure 2.13, where I(aa, a) is the mutual information between

aa and a, while I(ae, a) is the mutual information between ae and a. For drawing the

EXIT curve, we using simulator to generate sequences of a priori LLRs aa having a range

of mutual information (0 < I(aa, a) < 1). Using simulations that include the channel

model, the modulation model and the BCJR decoder, the extrinsic output ae can be

obtained and measured. If we use I(ae) to represent I(ae, a) and I(aa) to represent

I(aa, a), the EXIT function I(ae) = F (I(aa)) of the UMTS Turbo code is shown in Fig-

ure 2.14. In the simulation, we use the exact convolutional code shown in Figure 2.13,

with BPSK modulation and AWGN channel. The Signal-to-Noise Ratio (SNR) is -4dB.

The SNR is defined as:

SNR =Es

N0(2.20)

For the other decoder, with the same function another EXIT curve can be drawn based

on the simulation. For a Turbo code, owing to the symmetry of the two concatenated

codes, the EXIT function of the lower convolutional code is identical to that of the upper


DDD

Channel

BCJR

decoder

LLRsgenerate

MImeasure

a

c

e

aInput

MU

X

Modulator

deModulator

deM

UX

cc

ec

ac

ae

aa

I(ae, a)

I(aa, a)ya

ye

Figure 2.13: Scheme of the EXIT chart generating.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

I(aa)

I(a e)

Figure 2.14: One EXIT curve I(ae) = F (I(aa)) of UMTS Turbo code using BPSKto transmit over an AWGN channel having an SNR of -4 dB.

convolutional code. In EXIT chart, the second curve is displayed with the swapped

axes, that is the horizontal axis is the mutual information of the extrinsic output and

the vertical axis is the mutual information of the a priori input. The reason to display

the second curve with the swapped axises is because in the iterative decoding process,

the output of one decoder is the input of the other decoder in next iteration. By putting

the input of the decoder and the output of the other decoder in the same axis. The


interaction of the two concatenated decodes can be predicted on an EXIT chart. The

complete EXIT chart of the UMTS Turbo code generated by our simulation results is

given in Figure 2.15. The iterative decoding process of the Turbo code can be revealed by

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1EXIT chart of the UMTS Turbo decoder

I(aa)

I(a e)

Figure 2.15: EXIT chart of UMTS Turbo decoder.

decoding trajectories in the EXIT chart, as shown in Figure 2.16. A decoding trajectory

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

I(aa)

I(a e)

Figure 2.16: The decoding trajectories in the EXIT chart.

start at (0,0) point, where at the start of the decoding process, there is no a priori

information coming from the other decoder. The mutual information of the output of

the first decoder can be obtained by the upper curve in the EXIT chart and is provided as


the input of the second decoder. Based on the mutual information provided by the first

decoder, the mutual information of the output of the second decoder can be obtained

by the lower curve in the EXIT chart. The decoding performance of the next iteration

can be obtained by the same way. Thus, a decoding trajectory can be obtained. Note

that the the condition of the decoding trajectory has a high probability to reach the

(1,1) point is that the EXIT chart has open tunnel. By reaching the (1,1) point, the

maximum likelihood decoding has been found and the BER will be in the error floor

region. However, since the EXIT chart is the statistical results by large samples of

simulation. In practice, the trajectories are varying from the EXIT chart. As shown in

Figure 2.16, the three trajectories are all different and departured from the EXIT chart.

The EXIT chart gives the average convergence behaviour of the investigated code. An

EXIT chart allows to consider the two concatenated codes in isolation of each other.

Since EXIT charts can predict the iterative interaction of the two codes, the iterative

decoding process does not need to be simulated in order to draw an EXIT chart. Thus,

EXIT chars can be obtained faster than BER/FER charts.

The measurement of the mutual information has a number of different methods. The

first method is the averaging method uses the equation:

I(a, a) = 1 +1

Na

Na∑

n−1

1∑

a′=0

(1 − a′)ea

1 + ealog2[

(1 − a′)ea

1 + ea] (2.21)

This method has the advantage of not requiring any knowledge of the bit sequence

a. This is achieved by assuming that the LLRs in zmathbf a satisfy the consistency

condition, that is the LLRs do no express too much confidence or too little confidence.

Since the averaging method “believe” what the LLRs say, it does not need to consider

the true values of the bits in a. However, this assumption is only valid if there are no

sub-optimalities in the receiver design. This requires perfect channel estimation, perfect

carrier recovery, perfect synchronisation, perfect equalisation and optimal decoding using

the Log-BCJR algorithm. The histogram method of measuring mutual information does

not make the described assumption and is therefore better suited when a sub-optimal

receiver is employed. This method uses knowledge of the true values of the bits in a to

avoid having to “believe” what the LLRs say.

2.4 Fixed-point representation in a Turbo decoder

In this section, we give a introduction of fixed-point representation in hardware design.

Fixed-point representation, compared with floating-point representation, is easily imple-

mented in a small memory space and it is fast to execute. Therefore, it is well-suited to

real-time or low-power applications. Internally, the computation of fixed-point numbers


take the values as integers, but considered the integer part and fraction part separately

with an imaginary point.

Two’s complement representation is the most widely used fixed-point representation

in practice. A two’s complement binary number is divided into three parts, a sign

bit, an integer part and a fraction part. First, let us consider the two’s complement

representation of signed integers before considering the representation of numbers having

a fraction part. The most significant bit is used as the sign bit, where 0 is used to

represent positive signs and 1 is to represent negative signs. The rest of the bits represent

the magnitude of the number. For a negative number, the magnitude of it complemented

bit by bit and incremented by 1 is its two’s complement representation. For example

the 3-bit representation of 2 is 010. The complement of this is 101. Adding 1 to

this give the two’s complement representation of -2, namely 110. The complete set

of 3-bit two’s complement representations is given in Table 2.1. In addition, another

two signed integer representation methods, sign and absolute value notation and one’s

complement notation, are also given as examples in the table for comparison. As shown,

Binary number 000 001 010 011 100 101 110 111

Sign and absolute value +0 +1 +2 +3 -0 -1 -2 -3One’ complement +0 +1 +2 +3 -3 -2 -1 -0Two’ complement +0 +1 +2 +3 -4 -3 -2 -1

Table 2.1: Different representation methods for integer numbers

compared with the other two methods, two’s complement notation avoided the double

representation of zero. As a consequence, the range of negative values is more than the

range of positive values by one smallest value in its resolution. The main advantage of

two’s complement notation is the ability to perform the addition of negative numbers,

without needing to take the sign of the operands into consideration. In two’s complement

notation, the subtraction is achieved by doing the complement and adding. For example,

in 3-bit representation, 2−3 can be done by calculating the sum of 2 (010) and -3 (101).

2 − 3 = 2 + (−3) = 010 + 101 = 111 = −1 (2.22)

For the subtractions in two’s complement notation, letting the result overflow is neces-

sary. Take the following calculation as an example:

3 − 3 = 3 + (−3) = 010 + 110 = (1)000 = 000 = 0 (2.23)

Since the overflowed part is lost, the calculation gives the correct result naturally. In

contrast, the subtraction in the other two notation methods is more complicate since

complement and adding does no give the correct result. For example, in one’s comple-

ment notation:

3 − 2 = 3 + (−2) = 011 + 101 = 000 = 0 (2.24)


Therefore, the addition involve different signed components need to be considered care-

fully, extra correction is required.

For a fractional fixed-point number, A , an imaginary point is set at a certain place. For

a 3-bit two’s complement fixed-point number with 2-bit fraction part, a example for such

method is given in Table 2.2. In the table, the imaginary point is placed after the most

significant bit in the binary representation. For n-bit two’s complement representation,

Binary number 0.00 0.01 0.10 0.11 1.00 1.01 1.10 1.11

Two’ complement +0.00 +0.25 +0.50 +0.75 -1 -0.75 -0.5 -0.25

Table 2.2: Two’s complement representation method for fraction numbers

we use notation Qp.q to represent the point setting, where p represents the bit number

of the integer component and q represents the bit number of fraction component. The

total bit number n = p + q + 1. For example, a 8-bit two’s complement number with

a imaginary point after the 5th bit: 01100.010. The integer part is 12 and the fraction

part is 0.25, thus the value of 01100.010 is 12.25. The maximum and minimum limits of

the representation are given by (2.25), and the resolution r is given by (2.26).

−2p+q

2q≤ A ≤

2p+q − 1

2q(2.25)

r = 2−q (2.26)

Chapter 3

Optimal Data-width Settings for

Fixed-point Implementation

3.1 Introduction

In Turbo-like decoding schemes, the algorithms are usually specified in the floating-

point domain. However, in practical implementations, for energy efficiency, a fixed-

point number representation is mandatory for most architectures, such as DSP systems,

FPGA or VLSI implementations [42], since fixed-point implementation allows significant

energy consumption reductions, with only insignificant reductions in performance [43].

As discussed in Chapter 2, one of the advantages of the Log-BCJR algorithm is the

reduced dynamic range of the internal variables and the LLRs. In practice, this allows

a fixed point representation to be used. In fixed-point implementation, the hardware

complexity increases linearly with the internal bit-width representation of the data since

the bit-width of the representation determines the bit-width of all the databus and the

computing resources in the datapath structure [35]. Moreover, the iterative decoding

process of Turbo-like coding schemes require a large amount of memory space to store the

internal variables. Using less bits for each variable can significantly reduce the memory

requirement and hence reduce the energy consumption of the decoder. Therefore, for

a low power implementation, minimising the number of bits required for representing

the fixed-point quantities in the algorithm is a very important issue. However, the

information lost due to the reducing of the data width will cause degradation of the

performance. Therefore, there is a trade off between communication performance and

hardware complexity. This needs to be explored for an low power design.

Many papers investigated the fixed-point implementation issues of Turbo decoders by

exploring the minimum data width of the different quantities with acceptable degrada-

tion on BER/FER chart [42–50]. However, no universal conclusion has been obtained.

Even though some of the papers were using the same specification of the simulation,

35

36 Chapter 3 Optimal Data-width Settings for Fixed-point Implementation

namely the UMTS Turbo decoder with BPSK modulation simulated in AWGN channel,

the conclusions are different [42, 43, 45, 47, 49, 50]. The reason is that, in fixed-point

implementation, there are different issues affect the decoding performance and different

techniques to deal with the issues.

The performance degradation caused by fixed point implementation is due to the lost

information, that is the underflow and the overflow. For underflow, the fraction bit-width

limited the computation accurateness of the calculations in the algorithm. Especially, in

our case, since we investigate the Log-BCJR algorithm [51] using Look-Up-Table(LUT)

to realize Jacobian logarithm, the precision of the fixed-point representing is directly

relative to the numbers of the elements in the LUT. As discussed in Chapter 2, the

max∗ operator in Jacobian algorithm is defined as:

max∗(x, y) = ln(ex + ey) (3.1)

= max(x, y) + ln(1 + e−|y−x|) (3.2)

= max(x, y) + fc(|y − x|) (3.3)

Function fc is a quantised version of function ln(1 + e−|y−x|) which is implemented

by a LUT. Therefore, the bit-width of fraction part determines the largest number of

elements in the LUT, as shown in Figure 3.1. For example, 3-bit fraction number gives

the resolution of the LUT 0.125, which makes the largest possible LUT has 7 elements.

By this analogy, 2-bit fraction number gives a 4 elements LUT and 1-bit fraction number

0 0.5 1 1.5 2 2.5 3 3.5 40

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

|y−x|

f(|y

−x|

Correction fuctions and its LUT implementation

correction fuction3−bit fraction LUT2−bit fraction LUT1−bit fraction LUT

Figure 3.1: correction function.

gives only 2 elements LUT, as shown in Figure 3.1. Hence, the fraction bit-width

Chapter 3 Optimal Data-width Settings for Fixed-point Implementation 37

affects not only the width of databus, computing resources and memory requirement

as discussed, but also the complexity of the LUT used in the algorithm.

The occurrence of overflow issues depends on the dynamic range of the variables in the

algorithm and the number of bits used in the integer part of the fixed-point representa-

tion. In the event of overflow, the lost information could be fatal to the system perfor-

mance. However, the dynamic range of the variables is difficult predict and sometimes

quite large requiring a large number of bits in fixed point representation to guarantee the

range is covered. In Log-BCJR decoder, there are only three different operations, “add”,

“compare” and “select”, as mentioned as ACS operations. The “compare” and “select”

operations are not be able to induce any overflow. However, any “add” is possible to be

overflowed. Take the Log-BCJR algorithm we used in the UMTS decoder in Chapter 2

for example, in the decoding trellis each α is the sum of two γ, a α from the previous

step and a correct function fc in (3.3). Since it includes a α from the previous step,

the calculation of α forms an accumulation of the α value in the trellis. Therefore, the

values of α would increase without limits as the block length is increased. The resulting

overflow in a limited data width is the most significant effect needs to be considered.

The calculation of β has the same problem. The δ calculation is the sum of a α, a β and

a γ. It is also possible to be overflowed. To deal with this issues, a number of different

techniques have been proposed [45,48].

The first approach is to saturate the over flowing data during its processing. This method

is widely used in fixed point digital filters [52, 53]. A disadvantage of this approach is

that it requires some additional saturation hardware on each computing unit that could

cause a overflow, such as adders. Our simulation results showed that this technique is

not suitable for Log-BCJR algorithm alone, but can work well in collaboration with a

second technique, namely normalisation [45].

Normalisation is applied in Log-BCJR algorithm for dealing with the overflow on α and

β internal variables particularly. It scales down the increasing metrics in each step, in

order to prevent them from increasing without bound. This reduces the occurrence of

overflow and allows the data width for representing the variables to be further reduced.

As discussed, the α and β values are accumulating in the decoding trellis. Taking α as a

example, each α is the sum of a previous α, two γ values and a correct function values.

For each α, there is a accumulation history route in the trellis. A example is shown

in Figure 3.2. Based on the algorithm describe in Chapter 2, α(S4), α(S5), α(S6) and

α(S7) accumulate from α(S2) and α(S3), which in turn accumulate from α(S1). This

accumulation continues as the forward recursion proceeds, with subsequent α values

typically becoming higher and higher. In this way, overflow can occur for the α values

calculated towards the end of the forward recursion. However, the extrinsic LLRs that

are generated by the Log-BCJR are not sensitive to the particular value of any α value,

only to the difference between the α values of states having the same bit index [48]. For

example, the Log-BCJR is not sensitive to the values of α(S4), α(S5), α(S6) and α(S7),


only to the difference α(S4)−α(S5), α(S4)−α(S6), α(S4)−α(S7) and so on. The same

conclusion can also be applied to β values. As shown in (2.19), basic operation of the

extrinsic LLRs calculation is to select two δ and to calculate the difference of them, where

δ is the sum of a particular group of a α, a β and a γ in a single step. Therefore, if the

0 0 0

0 0 1

0 1 0

0 1 1

1 0 0

1 0 1

1 1 0

1 1 1

State1

State2

State3

State5

State6

State7

State8

State4

0/0 1/1 1/0 0/0 1/1 0/0 0/1 1/1yn/cn

StateT1 T3

T4

T5

T6

S2 S4

S3

S7

S6

S5

S1

T2

Figure 3.2: A possible accumulation route in the trellis.

α values from the previous step reduced with a unique value before the α values of the

current step are calculated, the concerned differences remains, but the increasing speed

of the α values would be slow down. The same method also works for β values. The

normalisation technique approach such purpose. However, the normalisation process

needs extra calculation and operations to realize, increasing the datapath complexity.

In addition, the normalisation technique also has different approaches. The most widely

used normalisation technique is the subtractive normalisation [45, 48, 54]. The path

metrics is normalised by subtracting a constant from all the metrics in particular time.

Even this method also has different versions. In [45], the path metrics is subtracted with

the respective minimum one in each step. In [48], the path metrics is subtracted with

the maximum one of them in the step. This technique requires extra computations to

find the maximum path metric and perform the subtractions. In [54], a modified version

is mentioned that instead of searching for the smallest or largest metric at each step, a

fixed state metric is subtracted from all path metrics. Hence, the comparison operation

for searching the required metric can be avoided. All this different version end up with

different data width requirements in the papers’ conclusions.

The third approach explores the nature of the two’s complement representation [45].

It was first introduced for Viterbi decoders [55] and later applied to SISO decoders.

The Log-BCJR decoding process is only concerned with the difference between the path


metrics. It can be proven that all possible differences between pairs of path metrics

are upper bounded [55]. Therefore, in two’s complement representation, as long as the

difference between two metrics is not over the largest value that can be represented by

the specified data width, the subtraction can be performed correctly using modulo 2n

arithmetic by simply ignoring the overflow of the operands. Three examples of difference

calculation in such method is given in Figure 3.3. Note that in the calculation of (1+3)-2

and (2+2+2)-3, the results in the brackets are both overflowed in 3-bit two’s complement

representation, but the equivalent calculation in two’s complement representation still

gives the correct answer as long as the result does not overflowed. However, for the third

calculation, the difference in the last calculation step is overflowed. In this situation,

the two’s complement representation cannot maintain the result correct, as shown in the

figure. The modulo 2n arithmetic is naturally implemented in VLSI architecture. Thus,

000

100

010110

111 001

011101

(1+3)-2=2

(001+011)-010=100-010=100+110=010=2

(2+2+2)-3=3

(010+010+010)-011=110-011=110+101=011=3

(2+2+2)-1=5

(010+010+010)-001=110-001=110+111=101=-3

-2

0

2

-4

-1 1

-3 3

Figure 3.3: Example of difference calculation in two’s complement representation.

no additional hardware requirement is required in this approach. According to [45],

1-2 bit more data width may be required for this approach compared to subtractive

normalisation, and it is shown that for a high-speed MAP decoder one additional bit

results in approximately 25% higher area and power consumption.

In conclusion, to implement a Turbo decoding algorithm in fixed-point representation,

different choices of relative techniques have different optimal data width requirement. In

the eight similar previous works we investigated [42–47,49,50], the different environment,

design and implementation configurations lead to different conclusions. Some of them

did not even provide a clear configuration of their simulations, which make the results

unrepeatable. A brief summarisation of the configurations of the eight papers are given in

Table 3.1. Three similar turbo codes are considered in the papers, as shown in Figure 3.4.

In the figure, type-2 corresponds to the UMTS Turbo encoder that discussed in Chapter

2.

Only a few papers discussed the effects of this issue based on mathematical proofs

[45,47,48]. However, it is shown that mathematical proofs are not sufficient to decide the

optimal data width specifications in practice. Some of the mathematical proofs give the

upper bounds of the path metrics that will never being exceed in practice [48]. And when

saturation and normalisation technique is applied, the data width requirement can be

further reduced with tolerable decrease in communication performance (i.e. BER/FER

degradation) [45]. Therefore, our simulation result show that the actual dynamic range


DDD

DDD

DDD

DDD D DD

DDD

ππ

π

Type-1 Type-2 Type-3

Figure 3.4: Three different Turbo codes in previous works.

authors J. Hsu [44] G. Montorsi [47] H. Michel [42] A. Worm [45]

encoder Type-1 Type-2 Type-2 Type-2modulation BPSK BPSK/PAM BPSK BPSK

channel AWGN AWGN AWGN/Rayleigh AWGN/Rayleighinterleaver helical N/A N/A 3GPP compliant

block length (bit) 216 4828 600 600iteration times 5 10 5/10 5/7/10normalisation Yes N/A N/A Yes

wrapping/saturation N/A saturation N/A saturationLook-Up-Table 16 elements 22 elements 7/10 elements N/A

authors M. A. Castellon [43] M. A. Castellon [50] T. K. Blankenship [46] R. Hoshyar [49]

encoder Type-2 Type-2 Type-3 Type-2modulation BPSK BPSK BPSK BPSK

channel AWGN AWGN AWGN/Rayleigh AWGN/Rayleighinterleaver block prime N/A N/A ideal

block length (bit) 1024 N/A 640 2896iteration times 3/8 5/8 N/A 7/18 halfnormalisation N/A N/A N/A Yes

wrapping/saturation saturation N/A N/A N/ALook-Up-Table 2 elements 7 elements 2/4/8 elements N/A

Table 3.1: Different representation methods for integer numbers

used in fixed-point implementation can be less than the theoretical bounds predicted by

mathematical analysis. As a result, the data-width decisions of a decoding algorithm

cannot be done only based on mathematical analysis. Traditional BER/FER chart sim-

ulation is time consuming. Sometimes different types of variables in a decoding scheme

have different optimal data-widths. It induce a large number of combinations required

to be tested while using simulation to find out the optimal settings. If considering the

effects of different technique utilisation, the required simulation of drawing BER/FER

charts will be unacceptable. In addition, BER/FER chart analysis does not give an

insight into the iterative decoding convergence process. Hence, to fully investigate the

optimal data width in fixed-point implementation of a decoding algorithm, we propose

a method based on EXIT chart [41] analysis to determine the optimal fixed point spec-

ification of a Turbo-like decoder in practical implementations. Our method is less time

consuming compared with previous works using BER/FER chart to do the same analy-

sis. Moreover, our results showed that the EXIT chart provides more useful information

than BER/FER chart when determining the optimal fixed point specification. Instead

of only giving the performance result, the EXIT chart shows the convergence behaviour


of the decoder. And the reasons caused the performance degradation by insufficient

bit-width can be analysed. Hence, the proper technique to prevent the degradation can

be induced to further optimise the system. For presenting our method, we investigate

the 3GPP UMTS Turbo decoder [39] and the optimal data width specifications for its

fixed-point implementation is concluded and compared with previous works. It is easy

to apply this method to any Turbo code and potentially any Turbo-like code including

an iterative decoding scheme which can be analysed by EXIT chart.

As introduced in Chapter 2, EXIT chart analysis is a powerful tool to analyse and opti-

mise the convergence behaviour of iterative systems, such as Turbo-like decoders. Unlike

BER/FER simulation, It is less time consuming since the simulation of the interleaver in

the decoder and the actual iterative decoding process are not required. Although the ef-

fects of the performance by an sub-optimal interleaver cannot be revealed, an interleaver

only changes the order of the information sequence and no information is lost during

such a process by fixed-point implementation. Since our purpose is to investigate the

performance degradation by fixed-point implementation, the unconsidered interleaver

would not affect the result of our method. Moreover, a BER plot can only give the per-

formance of a particular number of iterations in the decoding process, while an EXIT

chart traces the convergence behaviour of the decoder allowing an arbitrary number of

iterations to be considered. Our results show that based on the analysis of EXIT chart,

not only the performance can be investigated, but also the reasons causing the perfor-

mance degradation by fixed point implementation can be identified. Hence, the proper

combination of techniques can be chosen to improve the performance. One drawback

is an EXIT chart only considers a fixed Signal-to-nose ratio (SNR) while a BER/FER

chart considered a wide range of SNR or Eb/N0. However, since the EXIT simulation

is a lot less time consuming than BER/FER simulation. It is possible to draw different

EXIT chart under different SNR if necessary. To sum up, EXIT chart is more suitable

than BER/FER chart for finding the optimal data-width setting for a fixed-point im-

plemented decoding scheme. Also, we have tested an SNR where the tunnel is narrow

and the performance is most sensitive to the fixed point representation’s limitations.

The ideal of using EXIT chart to analysis the impact of finite precision arithmetics of

Turbo codes is first introduced in [56]. However, no convincible analysis procedure and

conclusion were given. In this chapter, we first time introduce a detailed analysis method

of using EXIT chart to determine the optimal data width specification of fixed-point im-

plementation of Turbo-like decoders, by giving a fully investigation of the UMTS Turbo

decoder [39]. In Section 3.2, to demonstrate our method, we use it to select the optimal

data-width specification for UMTS Turbo decoder with a comprehensive consideration of

fixed-point implementation techniques. The conclusion is then compared with previous

works. In the last section, our conclusion is given.


3.2 Fixed-point EXIT chart analysis of UMTS Turbo De-

coder

To present our method, we use EXIT chart analysis to investigate the fixed-point effects

of the UMTS Turbo decoder implemented by Log-BCJR with Jacobian logarithm. The

specification and structure of the UMTS encoder and decoder are presented in Chapter

2. In our simulation, BPSK modulation is assumed, with an AWGN channel. We first

use a SNR=-4dB noise level for simulation, which the EXIT chart of the UMTS Turbo

code has a moderately open tunnel, so the degradation of the performance can be easy

to observe. We also chose an SNR=-4.83 dB where the tunnel is almost closed (i.e.

the onset of the Turbo cliff) and the performance is most sensitive to the fixed point

representation’s limitations to validate our optimal data width specification. Random

bit sequences are given on the input of the Turbo encoder. We use 453-bit frame length

(i.e. interleaver length), which is the geometric mean of the minimum and maximum

block length of the UMTS standard. Since the performance degradation of Turbo codes

proportion with the block length in logarithm domain, we use it to investigate the opti-

mal data-width specification. The shortest (40-bit) and longest (5114-bit) frame lengths

in UMTS Standard is then simulated under the optimal specification for investigate the

performance effect of the frame length. In addition, we gathered different conclusions

from eight previous works [42–47, 49, 50] as a comparison to show the validity of our

work.

Firstly, the effects on the EXIT chart by using three BCJR algorithms are simulated

in floating-point representation. The three algorithms are Log-BCJR using exact cal-

culation of the Jocobian logarithm, Log-BCJR using 8 elements look-up-table (LUT)

Jacobian logarithm and Max-Log-BCJR [57]. It has been proved that the performance

loss by using Jocobian logarithm is less than 0.1dB relative to the exact log calcula-

tion, which is usually considered acceptable [54]. The performance degradation of Max-

Log-BCJR UMTS Turbo decoder is also well explored. According to [58], the Eb/N0

performance degradation on 10−5 BER between Log-BCJR and Max-Log-BCJR is 0.3

dB of 640 bits block length and 0.54dB of 5114 bits block length in AWGN channel

and worse in Rayleigh fading channel, which is considered significant (not acceptable).

In our analysis process, we aim to obtain the fixed-point EXIT chart as close to the

floating-point Log-BCJR result as possible and consider the degradation similar to that

of the floating-point Max-Log-BCJR result unacceptable.

Secondly, we investigate the effects of limited fraction part on fixed-point representation.

Since the limitation on fraction part length also limited the elements in the LUT, the

numbers of elements in the LUT is also considered.

Thirdly, we investigate the effects of a limited integer part on the fixed-point represen-

tation. The three overflow control approaches discussed before are all investigated.


Finally, based on the analysis, the optimal combination of fraction length and integer

length is then investigated under the different block length. The effect of termination

techniques is investigated here. The conclusion is given and compared with previous

works.

3.3 Simulation and Analysis Results

3.3.1 Comparison between different Logarithm methods

Figure 3.5 gives the EXIT charts of UMTS Turbo decoder using the three log algo-

rithms mentioned before. The tunnel between the two curves is narrower due to the

information lost by Max-Log-BCJR. Therefore, by certain iteration decoding times, the

mutual information should be lower than Log-BCJR algorithm implementation. In other

words, to obtain the certain target BER, more decoding iterations could be required.

Therefore, we can assert that the BER degradation due to the information lost by the

implementation can be reflected in the EXIT chart. Our further simulation results prove

this conclusion.

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

I(aa)/I(b

e)

I(a e)/

I(b a)

Log−BCJRJacobi−Log−BCJRMax−Log−BCJR

Figure 3.5: EXIT chart of different log algorithms.

3.3.2 Comparison and Analysis in Fixed-point simulation

To analyse the effects of fixed-point representation, fixed-point data type are used for all

the variables in simulations. Later, we will use a long bit-width, 32-bit, for the fraction


part but limited bit length for the integer part in order to investigate the degradation

caused by the limited dynamic range. First however we consider the opposite in order

to investigate the performance of limited precision with sufficient integer bit-width (32-

bit). Note that the effects on the LUT in Log-BCJR is also considered here. For n-bit

fraction part representation, up to 2n elements are used in the LUT, as described in

Section 3.1. The EXIT chart results are shown in Figure 3.6. The simulation results

show that using 1-bit length in fraction part and 2 elements in the LUT gives observable

degradation in EXIT chart result, but 2-bit length in fraction part with 4 elements in

the LUT gives almost no observable degradation in the EXIT chart result. As shown in

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

I(aa)/I(b

e)

I(a e)/

I(b a)

0−bit fraction1−bit fraction2−bit fraction3−bit fractionfloating point

Figure 3.6: EXIT chart of different fraction lengths.

the figure, 2-bit fraction length give no difference result compared with the floating point

result. 1-bit fraction length also gives EXIT chart very close the floating-point result

and the degradation is much less than that of the floating-point Max-Log-BCJR result.

Note that 0-bit fraction part effectively removes the LUT, transforming it from the

approx-Log-BCJR to the Max-Log-BCJR. The EXIT chart degradation is worse than

that of the Max-Log-BCJR, however because of the low resolution used for the BCJR

variables. Considering the trading-off between energy consumption and performance,

1-bit and 2-bit fraction part both possible to be sufficient for most of the applications.

Further decision should refer to later combined simulation results with limited fraction

and integer lengths. The BER chart analysis for different fraction lengths are given

in [43, 47, 50]. Both [43, 47] concluded that 2-bit for the fraction length approaches

the performance of the floating-point decoder, which could be chosen for the optimal

specification. Although, [50] declared that 3-bit fraction length gives better performance,

which only incurs a penalty of 0.015dB. The simulation result of [47] showed that 1-bit

fraction length only causes a loss of 0.1dB for medium-low SNR but has no consequences


on the error floor performance. In the eight papers we selected, five of them determined

2-bit fraction length is the most optimal choice and three of them chose 3-bit fraction

length.

For determining the optimal bit-width of integer part, many papers investigated the

optimal bit-width setting for the different internal variables (input LLRs, α, β, γ and

λ) separately. However, in practice, it is not convenient to store different variables

with different data-width memory block. The different setting of different variables

will no decrease the memory requirement rather than using a unique data-width set-

ting. Although [42] claimed that further bit-width minimisation for different variables

can reduce the switching activity which has influence on the energy consumption, as

the process technology scaling down, the contribution to the total power consumption

by dynamic power become smaller and smaller. Thus, the benefits of considered the

different variables separately is reduced. On the other hand, such a strategy requires

additional extension and clipping mechanism of the databuses in the datapath, which

increase the design complexity of the datapath. Therefore, we consider a single data-

width setting in our analysis. However, it is valuable and necessary to consider the input

LLRs and the internal variables of the SISO decoder separately, because the limit of the

input LLRs directly affect the dynamic range of the internal variables, such as α and

β. According to [47], the possible differences between pairs of path metrics ∆MAX (i.e.

the possible difference between the α values or β values in a signal step in the decoding

trellis), which is significantly important in BCJR decoder as discussed in last section,

are upper-bounded by a function of the dynamic range of the input LLRs:

∆MAX = min(wMu + dmin(w)Mc) (3.4)

where dmin(w) is the minimum weight of the code sequences generated by input se-

quences with weight w, ±Mu and ±Mc are the dynamic ranges for respectively the

two input of the SISO decoder, extrinsic information and LLRs received by the soft

demodulator. Hence, dmin(w) depends on the considered code. ±Mu and ±Mc are

simply related to the bit-width of the integer part of the input LLRs. As discussed in

the previous section, it is important to keep the difference between pairs of metrics for

maintaining the performance, and different overflow control techniques require different

data-width to guarantee this condition. Therefore, based on the discussion in [47], in

an insufficient data-width specification, the bit-width of the internal variables typically

requires a couple more bits than the bit-width of the input LLRs.

As discussed, with different bit-width settings for different variables, the transformation

between different length data needs to be carefully managed. Transforming shorter

length data to longer length data would not induce any problem since the values of the

data remain unaltered. An extension mechanism simply add zores to the extra highest

bits can solve the problem. It is easy to realize in hardware and no extra operation

required in our simulation. However, the transforming for the other direction may


cause information loss. Moreover, the highest bit in two’s complement representation

determines the sign of the value, which means simply ignoring the extra bits during the

transforming may not only reduce the value but change the sign of the data value. It will

significantly affect the correctness of the decoding process. Hence, a clipping mechanism

with saturation is required during the transforming. If the original data value is over

Decoder 1

Decoder 2 clip

clip

clip

clip

clip

clip

clip

dc

ae

ap

aa

babebc

cc

fc

ac

ec

π−1π π

Figure 3.7: Scheme of UMTS Turbo decoder.

the limitation of the aiming data width, the transformed data must be set to the limited

value. Thus, the information lost is minimum. Such a method require extra hardware to

realized in practice and extra operations in our simulation. Figure 3.7 showed that the

clipping operations we used in the decoding scheme to simulate the data transforming

between different data-widths.

3.3.2.1 Wrapping Technique

We investigated the optimal bit-width settings for the length of integer part under differ-

ent overflow control techniques by using EXIT chart analysis. As discussed before, two’s

complement representation can naturally avoid the effect of the overflow in BCJR algo-

rithm since the overflowed data can be considered as wrapping in a circle, so the distance

between two data remained. The benefit of this wrapping technique is that no extra

operation or hardware is required. Therefore, it is suitable for the cases where memory

is sufficient or a simple datapath is required. Note that for input LLRs, saturation is still

required since it has shorter bit-width than internal variables and external input, as dis-

cussed before. The wrapping technique is only suitable for the internal variables. We use

notation (LLR:X,VAR:Y) to describe the integer lengths setting, where X is the integer

length of the input LLRs of a BCJR decoder and Y is the integer length of all the other

internal variables. Figure 3.8 shows the EXIT chart results of setting (LLR:5,VAR:7),


(LLR:4,VAR:7) and (LLR:3,VAR:7). The simulation results showed that with setting

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1EXIT

5−bit LLRs/7−bit VARs4−bit LLRs/7−bit VARs3−bit LLRs/7−bit VARs

Figure 3.8: EXIT chart of different integer lengths with wrapping technique - 1.

(LLR:5,VAR:7), there is almost no degradation in EXIT chart compared with floating-

point result. It is obvious that the EXIT chart of setting (LLR:3,VAR:7) failed to create

a tunnel to (1,1) point, which means that the BER of the decoding result would be

significantly reduced. The setting (LLR:5,VAR:7) and (LLR:4,VAR:7) also give differ-

ent results in EXIT chart analysis, which is shown in Figure 3.9. It is a zoomed in

version of Figure 3.8. For setting (LLR:5,VAR:7), the curves of EXIT function Ie(Ia)

0.92 0.93 0.94 0.95 0.96 0.97 0.98 0.990.92

0.93

0.94

0.95

0.96

0.97

0.98

0.99

EXIT




reaches a peak value at a certain Ia and starts decreasing. It makes the tunnel closed.

Although the close point is very near to (1,1) point, it means the best possible decoding

result of (LLR:5,VAR:7) cannot match the result of (LLR:4,VAR:7). Respectively in

BER simulation, there is going to be a degradation of BER from (LLR:4,VAR:7) to

(LLR:5,VAR:7). Note that the tunnels of (LLR:5,VAR:7) and (LLR:3,VAR:7) close in

different ways. They show the different reasons cause the closures. For (LLR:3,VAR:7),

the closure is caused by a lower increasing speed of the curves Ie(Ia). Since the only

difference between(LLR:3,VAR:7) and (LLR:4,VAR:7) is 1 bit shorter or input LLRs,

it can be conjectured that the lower Ie(Ia) is due to the lost information in LLRs by

the decreased bit-width. Thus, 4-bit integer length is the minimum sufficient bit-width

setting for input LLRs. For (LLR:5,VAR:7), the closure is due to the reduction of curves

Ie(Ia) after their peak times. The result of (LLR:4,VAR:7) is proved that such a bit-

width setting is sufficient for maintaining the validated information in all the variables,

so the performance degradation of (LLR:5,VAR:7) is due to the not enough bit-width

difference between the input LLRs and the internal variables. Because while the it-

eration times increasing, the mutual information in a priori LLRs is increasing, which

means the average absolute value of the LLRs is increasing. Due to the accumulated add

operations of the input LLRs in BCJR algorithm, insufficient difference of bit-widths

between the input LLRs and the internal variables may cause a serious overflow problem

in the calculations of the internal variables. Therefore at the end of the EXIT chart, the

function Ie(Ia) starts decreasing. It is the overflow of the internal variables exceeds the

tolerance limit of the wrapping technique caused the EXIT chart failure to reach the

(1,1) point.

Such a effect can be shown more obviously in Figure 3.10 and Figure 3.11. In Fig-

ure 3.10, for the results of (LLR:5,VAR:7) and (LLR:6,VAR:7) the peak point of the

curves occur earlier due to the even smaller bit-width difference between the input

LLRs and the internal variables. While the difference increasing, the performance reach

the best point at (LLR:4,VAR:7). On the other hand, in Figure 3.11, when the integer

length of the input LLRs becomes shorter than 4-bit, the performance getting worse

again. However, since the difference is sufficient, there are no reductions occurred in the

curves of (LLR:3,VAR:7) and (LLR:2,VAR:7). Only the increasing speed of the curves

is lower due to the insufficient bit-width of the input LLRs, which caused the closure

of the tunnel. If we further reduce the internal variables integer bit-width to 6-bit, the

tunnel in the EXIT chart is always closed before (1,1) point irrespective of the integer

bit-width of the LLRs.

In conclusion, 4-bit integer width of the input LLRs is the minimum acceptable setting

for UMTS Turbo decoders. With wrapping technique, the minimum difference between

the input LLRs and the internal variables is 3 bits. Therefore, the optimal integer length

setting is (LLR:4,VAR:7).


0.91 0.92 0.93 0.94 0.95 0.96 0.97 0.98 0.990.9

0.91

0.92

0.93

0.94

0.95

0.96

0.97

0.98

0.99

I(aa)/I(b

e)

I(a e)/

I(b a)



0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

I(aa)/I(b

e)

I(a e)/

I(b a)



3.3.2.2 Saturation Technique

Wrapping technique is actually a “do nothing” technique. No additional operation or

hardware is used to deal with the overflow in internal variables. Another simple overflow

control technique is saturation technique. As mentioned before, the input LLRs are

forced to be saturated due to the shorter data-width than the internal variables in

our specification. The same technique can also be applied to the internal variables.


The problem is that the BCJR algorithm is using the distance between the metrics,

which are the internal variables, as described before. The saturation technique limited

all the overflowed data value to the maximum or minimum value, which changed the

difference of the variables. The differences between the overflowed data become 0 and

the differences between the overflowed and unoverflowed data are also decreased. The

simulation results showed that this problem makes the results under saturation technique

even worse than using wrapping technique. However, when subtracting normalisation,

as mentioned as rescaling normalisation technique, is applied, saturation technique is a

necessary condition to obtain the benefit from normalisation [54]. Our further simulation

results showed that normalisation technique without saturation cannot reduce the data-

width requirement less than wrapping technique. Figure 3.12 gives the simulation results

of using saturation technique. Since the conditions for the input LLRs are not changed,

the minimum integer width of them remains 4-bit. However, the required bit-width

difference between the input LLRs and the internal variables is significantly increased

due to the application of saturation technique.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1EXIT

(LLR:4,VAR:13)(LLR:4,VAR:12)(LLR:4,VAR:11)

Figure 3.12: EXIT chart of different integer lengths with saturation technique.

Although it can be observed that for setting(LLR:4,VAR:12), the tunnel is closed be-

fore (1,1) point, as shown in the figure, the closing point is very close to (1,1) point

and the EXIT curves have almost no difference with the floating point result. Hence,

(LLR:4,VAR:12) is the optimal integer bit-width setting for saturation technique. Its

EXIT chart result has almost no difference with the floating-point result. In the result

of (LLR:4,VAR:11) the tunnel is closed far before (1,1) point. Note that different from

the results of using wrapping technique, when the integer length of the internal variables

reduced to 11-bit, the function Ie(Ia) falls to near 0 values very soon after the peak point.

As discussed before, the reason caused a exit chart curves (i.e. function Ie(Ia)) starting


reducing is the insufficient difference of the integer lengths between the input LLRs and

internal variables. Since the EXIT chart of (LLR:4,VAR:13) can reach the (1,1) point, 4-

bit integer length for the input LLRs is still sufficient under saturation technique. Hence,

the saturation technique increases the requirement of the integer length difference be-

tween the input LLRs and the internal variables. As mentioned before, the decoding

result only depends on the difference between path metrics (i.e. internal variables). The

saturation technique fixed overflowed variables to the positive and negative limits. It

can be speculated that while the overflowed internal variables fixed at the limited values,

the difference between path metrics become 0. Hence the reliable soft output cannot be

obtained. When a certain amount of the internal variables overflowed, the EXIT chart

curves reduce to 0 very fast, as shown in the result of (LLR:4,VAR:11). Therefore, the

saturation technique is not suitable for convolutional codes. However, it is a precondi-

tion for applying normalisation technique. Our simulation results in with normalisation

technique show that it is important to combine saturation and normalisation techniques

together to obtain the most optimal bit-width specification.

3.3.2.3 Normalisation Technique

The limitation of the wrapping technique is that if the difference between the path

metrics exceeds the dynamic range, the subtraction would not give the correct result

any more. The purpose of the saturation technique is to fixed the problem. However,

as we showed in saturation technique simulation results, it induced another problem

which a lot of overflowed variables are fixed at the same value. Normalisation technique

is introduced to deal with such a problem. Our simulation results show that, with the

combination of saturation and normalisation, the requirement of the integer bit-width

of the internal variables can be further reduced. In our simulation, for each group the

increasing variables α and β are subtracted with the largest one of them in each step.

The EXIT chart results are shown in Figure 3.13. The optimal bit-width setting of

the integer length is (LLR:4,VAR:5), which is two less bits requirement for the internal

variables.

3.3.2.4 Final validation

To finally determine and validate the optimal data width specification for the fixed-

point implementation of the UMTS Turbo code, we investigate the EXIT chart perfor-

mance considered the combination of the limited integer and fraction lengths. Since the

simulation results from Figure 3.6, which only limited the fixed-point fraction length,

were not sufficient to determine the optimal fraction length, we consider both 1-bit

and 2-bit fraction length in our final validating simulation. We combined the frac-

tion length settings with the optimal integer length setting for the wrapping technique

(LLR:4,VAR:7) and for the normalisation technique (LLR:4,VAR:5). We use notation


0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

I(aa)/I(b

e)

I(a e)/

I(b a)

(LLR:4,VAR:7)(LLR:4,VAR:6)(LLR:4,VAR:5)(LLR:4,VAR:4)

Figure 3.13: EXIT chart of different integer lengths with normalisation technique.

(LLR:X,VAR:Y,FRC:Z) to represent the settings in our simulation results, where Z is the

length of the fraction part. Moreover, for different settings, we simulate them in different

situation, which include the longest block length (5114-bit), the shortest block length

(40-bit) and the most performance sensitive SNR (SNR=-4.83dB) where the tunnel in

EXIT is just open.

For normalisation technique, the final validation are shown in Figure 3.14 which is

the result for the longest block length, Figure 3.15 which is the result for the short-

est block length and Figure 3.16 which is the result for the most sensitive SNR=-4.83

dB. According to the results, setting (LLR:4,VAR:5,FRC:2) gives almost as same perfor-

mance as the floating-point results in different situations while setting (LLR:4,VAR:5,VAR:1)

gives further degradation due to the combined effects of the limited integer and fraction

lengths, but the degradation is not as bad as the Max-Log-BCJR though. Moreover,

according to the simulation results, the block length does not have a significant effect on

the EXIT chart. Consequently, a unique optimal specification can work for any block

length. Therefore, for normalisation technique, we conclude that (LLR:4,VAR:5,FRC:2)

is the optimal specification of the UMTS Turbo decoder.

For wrapping technique, the final validation are shown in Figure 3.17 which is the result

for the longest block length, Figure 3.18 which is the result for the shortest block length

and Figure 3.19 which is the result for the most sensitive SNR=-4.83 dB. Clearly with

the results in the figures, 2-bit fraction part is the best option in our case. Hence, for

wrapping technique, (LLR:4,VAR:7,FRC:2) is the optimal specification of the UMTS

Turbo decoder.


0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

I(aa)

I(a e)

Fixd−point (LLR:4,VAR:5,FRC:1)Fixed−point (LLR:4,VAR:5,FRC:2)Floating−point

Figure 3.14: Simulation results of 5114-bit block length in fixed-point with normali-sation and floating-point.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

I(aa)

I(a e)


Figure 3.15: Simulation results of 40-bit block length in fixed-point with normalisationand floating-point.

To compare our results with previous works, for the input LLRs, [42, 47, 49] claimed

that 3-bit integer length is sufficient. However, they only considered the input LLRs

received from the channel. In EXIT chart simulation, the considered input LLRs are

the a priori input of the concatenate decoders, which includes the channel input and

the extrinsic information from the other decoder. Since the they are both the input

of the concatenate decoders, it is more reasonable to consider that they have the same

bit-width. Hence, our results showed that 4-bit integer length is the optimal setting for

the input LLRs. [46] gave the same conclusion about the integer bit-width of the input


0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

I(aa)

I(a e)


Figure 3.16: Simulation results of SNR=-4.83dB/453-bit block length in fixed-pointwith normalisation and floating-point.

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

I(aa)

I(a e)


Figure 3.17: Simulation results of 5114-bit block length in fixed-point with wrappingtechnique and floating-point.

LLRs. The other papers mentioned before did not considered the input LLRs separately.

For the internal variables, [44,46] considered all the different internal variables (i.e. γ, α,

beta and delta) separately. The longest variables require 8 bits for the integer part. [50]

concluded that 7-bit is the optimal setting. The conclusion of [49] is 6-bit. [42, 43] has

the same conclusion with our result that 5-bit is the most optimal setting. The reason

caused so many different conclusion is the different circumstance used in the simulation,


0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

I(aa)

I(a e)


Figure 3.18: Simulation results of 40-bit block length in fixed-point with wrappingtechnique and floating-point.

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

I(aa)

I(a e)


Figure 3.19: Simulation results of SNR=-4.83dB/453-bit block length in fixed-pointwith wrapping technique and floating-point.

as discussed before. For example, [49] used normalisation in their simulation, but only at

a couple of certain steps in the decoding process while we used it in each step. Hence, it

concluded with one more bit requirement for the internal variables than our conclusion.

By using EXIT chart, we proved that with proper overflow control techniques, the

optimal bit-width specification for the UMTS Turbo decoder is (integer:4-bit,fraction:2-

bit) for the input LLRs and (integer:5-bit,fraction:2-bit) for the internal variables.


In conclusion, we introduced a method to determine the optimal data width specifica-

tion for implementing a Turbo code in a low power fixed point system based on EXIT

chart analysis. By applying the method to the UMTS Turbo code, we demonstrate

the advantage of the method compared with conventional method based on BER chart

analysis. The different techniques for reduce the data width requirement of fixed point

turbo code implementation are also discussed.

Chapter 4

Energy Estimation Decoding

Algorithm

In this Chapter, a framework to estimate the energy consumption of a encoder/decoder

on the algorithmic level is proposed.

4.1 Introduction

There are different aspects to evaluate a systems power/energy consumption. For in-

stance, the average power is directly related to the chip heating and temperature issues

and the worst case instantaneous power affects the voltage drop problem [59]. In low

power WSN applications, such as Body Area Networks, a long life-time is the most

important motivation for applying low power techniques. Since a lot of advanced tech-

niques, such as clock gating and power gating, can help to increase the life-time of a

system without changing the average power while the system is fully operating, the es-

timation of power consumption of a design in early design stage is not good enough to

investigate its potential life-time. Hence, energy consumption estimation is more suit-

able for this issue. Indeed, the latest works on long life-time WSNs design issues are

more focused on energy consumption based design [60–62].

Power/Energy estimation is required for all levels of abstraction in the design flow with

different purposes [63]. In later stages, such as gate-level or transistor level, very accurate

estimation can be given, since most of the information of implementation is available.

On the other hand, most of the design effort has already been invested by this stage.

Not a lot power reduction can be achieved after the estimation. The purposes of pow-

er/energy estimation at these stages are only to fine-tune the design and verify that

the power constraints have been met. Therefore, to design an extremely low-power sys-

tem, power/energy estimation is more important at the early stages. By being aware

57

58 Chapter 4 Energy Estimation Decoding Algorithm

of forecasted energy consumption during the early design stages, more energy reduction

can be achieved. However, in very early design stages, such as algorithm level design,

most knowledge of the physical parameters affecting the energy consumption are not

available which make the energy estimation at this stage very difficult. Hence, in tradi-

tional design flow, communication engineers estimate the computational complexity of

an algorithm instead of energy estimation to evaluate an algorithm design. The com-

plexity indicates the certain computing resource an algorithm needs. However, in the

case of a Turbo decoder for example, [64] demonstrated that memory access rather than

computational complexity is the most critical part of the decoder in terms of energy con-

sumption. The same assertion can be given to many other types of system design [65,66].

There are also other components in the implementation of a system, such as the data-

path selection logic, the internal registers and the controller. Again, their contribution

to the energy consumption cannot be predicted by the computational complexity of an

algorithm. Therefore, a lower complexity algorithm cannot guarantee a lower energy

consumption implementation. For low power design, energy estimation would provide

more information than complexity estimation.

As discussed in Chapter 1, in short range, lowpower WSNs, such as BANs, due to the low

transmission power, energy consumption in the physical layer, especially in the channel

coding, scheme could have a significant contribution on the energy consumption of the

whole system. In this chapter, we propose a framework for estimating energy consump-

tion at a very early stage, namely the algorithm level, of a channel coding system. As

we focus on the energy consumption rather than the average power consumption of a

channel coding system, a lot of decisions are required in algorithm level design which can

affect the energy consumption of the final implementation. Moreover, after this stage,

the basic scheme is fixed. The potential reduction of power consumption is then limited.

Therefore, the decisions in algorithm level design is very important to a low power de-

sign. Our framework aims to rank various coding scheme design options and thus helps

in selecting the one that is potentially more effective from the energy point of view. Since

the encoding algorithms are typically of low complexity and energy consumption, we are

particularly interested in the energy consumption of decoding algorithms, which are

typically much higher. The framework is suitable for all turbo-like algorithms and even

other types of algorithms in channel coding systems, such as equalisation, interference

cancellation and MIMO (Multi-Input and Multi-Output) detection. The knowledge of

the hardware design in later design stages is not required while applying the framework.

There are two classic approaches to implement a coding scheme, namely DSP (Digital

Signal Porcessing) implementation and ASIC (Appication-Specific Integrated Circuit)

implementation. A DSP system is based on a general purpose processer with an in-

struction set. Thus, the algorithm is realized by assembly language program. An ASIC

system, on the other hand is a specific designed system for a particular application.

Therefore, the hardware design would be very optimal for the applying algorithm. DSP

Chapter 4 Energy Estimation Decoding Algorithm 59

implemenation is widely used in traditional WSNs applications due to its general appi-

cability. However, comparing with ASIC implementation, DSP implementation’s long

execution time and low hardware usage efficiency are not suitable for low power systems.

Moreover, the lower bound energy consumption of a coding scheme is an important issue

which needs to be considered in Physical Layer design. Therefore, our framework aims

to estimate possible energy consumption of an algorithm in ASIC implementations.

4.2 Previous works

In previous works, power/energy estimation in early design stage, as mentioned as high-

level power/energy estimation, can be divided in two categories. One is based on DSP or

FPGA implementations [60,67,68]. More specifically, in order to simplify the problem,

these methods assume the fixed architectural templates, that are offered by DSPs and

FPGAs. The benefit of such types of methods is that it is easy to make the approaches

suitable for a widely range of algorithm. However, DSP or FPGA implementations are

not suitable for extremely low power applications, since the unique characteristics of an

algorithm cannot be explored in these architectures. Such characteristics may be utilised

in hardware implementation, which are very important for low power design, which is

why the lower bound energy consumption is of particular interest to communication

engineers.

To investigate the distinct characteristics of an algorithm, energy estimation of possible

ASIC implementations is required. This is the other category of high-level power/energy

estimation, mostly referred as behavioural level power/energy estimation which based

on executable behavioural descriptions [63].

“Algorithm level”, usually refers to a clear mathematical description of an algorithm, is

a more general concept compared with behavioural level since the term algorithm level

is widely used in communication area but “behavioural level” is a concept in hardware

design area, which means an executable program or a clearly flow chart description with

detailed operations requirement. However, they have no clear distinction from the pow-

er/energy estimation point of view. They both refer to a clear mathematical description

of an algorithm and lack of knowledge of the architecture of the implementation.

One type of behavioural level power estimation method is the activity-based model,

which typically assumes some architectural style or template and produces physical

capacitance and switching activity estimations of the resources based on it [69]. The

dynamic power is then expressed as:

P =∑

r∈{all resources}

frCrV2dd (4.1)


where fr is the access frequency of resource r which is produced by activity prediction,

Cr is the switched capacitance of r and Vdd is the supply voltage. [69] The equation

has a couple of equivalent transformation in different methods, but they all based on

(4.1). Typically, in such methods [70–72], only dynamic power is considered. However,

as IC process technology enter deep submicron sizes, an exponential increase in the

subthreshold leakage current arises, which makes the leakage power of CMOS circuits

unneglectable. Moreover, switching activity estimation for sequential circuits is difficult

and time consuming, which makes such type of method difficult to use in a practical

algorithm design stage. An alternative approach is offered by a complexity-based model,

which considers the power/energy consumption of a system to be a sum of different

entities power/energy consumption. In [73], the power consumption of cryptographic

algorithms are estimated based on how many different components (registers, adders ...

etc) were used and what type of memory was chosen. In [74], the energy consumption

in a digital CMOS circuit is expressed as:

E = µNgatesEgate (4.2)

where µ is the circuit activity, Egate is the energy consumption per switching gate of a

reference cell (e.g. 2-input NAND gate) in a particular technology and Ngates is the gate

approximate equivalent count of the design with the reference cell. There parameters

can be obtained by specification parameters of the technology or by simulation. Hence,

all three types of power components (datapath, memory and controller) in CMOS circuit

are considered automatically. The drawback of these methods is that the activity of the

circuit is roughly estimated by using only one parameter. The framework we proposed

conquers this drawback by considered the different components separately.

The other challenge of high-level power/energy estimation is that unlike DSP or FPGA

implementation, ASIC design is can be optimised for the specific algorithm. As a re-

sult, the possible implementation can be difficult to predicate in the algorithm level.

Some previous works obtained the specifics of hardware implementation by using high-

level synthesis tools [73, 74]. Others transfer the behavioural description into a more

complicated description, which may include boolean functions, truth tables or circuit

design [70, 72]. There approaches require knowledge of hardware design and synthe-

sis processes. Further more, the required program and simulation processes are time

consuming, which are not desirable for an algorithm level design stage. To conquer this

challenge, our framework relies on the designer to specify the algorithm partitioning and

resource constraints, but avoids an actual hardware design process. In this approach the

framework not only estimates the potential energy consumption of an algorithm but also

provides feedback on the quality of a design strategy.


4.3 A framework for quantifying the energy consumption

of a Turbo-like decoder

In this section, a framework that allows us to compare and estimate the energy consump-

tion of a Turbo-like decoder design at the algorithm level is proposed. Traditional low

power system design methods can only design the algorithm based on the computational

complexity analysis. Our framework provides an ample opportunity of feeding back the

energy estimation result to algorithm choice or design steps. Based on the purpose of

our work, the comparison and evaluation of the different Turbo-like code algorithm, we

develop two levels of the framework. For comparison of different algorithms, it is not

necessary to estimate all the possible energy consumption in the implementation of the

algorithms, since this could be a time consuming work at the algorithm level. There-

fore, in the level 1 of our framework, we aim to provide a quick method which allows

communication engineers to compare different algorithms from the energy consumption

point of view with little extra effort. We only considered two main parts of the possible

energy consumption while implementing an algorithm in hardware design, the energy

consumption by all the operations in the algorithm and the energy consumption by the

memory requirement of the algorithm. In our case, for a Turbo-like code, all the possible

operations in the algorithm are ACS operations. The reason we select these two parts

of energy consumption in the system in level 1 framework is because that only these two

parts are directly related to the target algorithms. The other parts of energy consump-

tion of the system, such as the energy contribution by the controller and the datapath

structure, could be variable depending on different design strategies. Therefore, only

an approximate estimation can be given at the algorithm level. However, in level 2

framework, we aim to give a energy estimation which considered all the possible energy

contribution in the system. The level 1 framework is presented in Section 4.3.1. The

level 2 framework is still in the future work planning stage, which discussed in Section

4.3.2.

4.3.1 Level 1 of the framework

For considering the energy consumption of the computing operations in the target al-

gorithm, a conventional complexity analysis of the algorithm need to be performed.

Taking one convolutional decoder in the UMTS Turbo decoding scheme as an example,

the algorithm is introduced in Chapter 2. We category all the ACS operations into

two operations, additions (including subtractions) and the max∗ operations. For a n-

bit decoding frame, the complexity analysis result shows that the decoding algorithm

including 97n − 10 additions and 30n − 20 max∗ operations.


For considering the energy consumption of each operation in the algorithm, we imple-

mented the addition and the max∗ operation in gate-level design based on STMircoelec-

tronis 0.12 µm technology standard cell library. We consider a max∗ operation with a 4

elements LUT’s support here. The data width of the operation units is 8-bit, which is

sufficient for the target convolutional decoding process, according to the results in Chap-

ter 3. Then analysis the energy consumption of each operation by power analysis tool

Synopsys PrimeTime [75]. Such a procedure is strictly following the standard ASIC de-

sign procedure. With the assumption that in the critical path of the implementation has

no more than 10 adders and the system clock is lower than 10MHz, our power analysis

result is that the typical energy consumption of an addition operation in our specifica-

tion is Eadd = 0.04591pJ . With the same specification, the typical energy consumption

of a max∗ operation is Emax∗ = 175pJ . Note that a max∗ operation including a more

than one comparison and addition operations and a 4 elements LUT which consumed

much more energy than an addition operation. The conventional complexity analysis

cannot taking such difference between the operations into account. Therefore, based on

the complexity analysis and our power analysis results, the total energy consumption by

the operations in the UMTS decoding algorithm can be calculated by:

Eoperations = Eadd × (97n − 10) + Emax∗ × (30n − 20) (4.3)

The calculation result of Eoperations is 5254.4∗n−3500 pJ. Therefore, for a 40-bit frame,

the decoding energy consumption by the operations is 2.07 × 105 pJ. For a 5114-bit

frame, the decoding energy consumption by the operations is 2.69 × 107 pJ.

For considering the energy consumption by the memory requirement of the algorithm,

we first need to address the total memory requirement of the algorithm, that is, how

many variables are required to be stored during the decoding. This required to analysis

the dependence of the different stage in the algorithm. According to the introduction

in Chapter 2, in the UMTS Turbo decoder, the decoding algorithm of the convolutional

code including five stage, the calculation of γ, α, β, δ and ye. The dependence of the

stage can be shown in Figure 4.1. As shown in the figure, the γ values are required

to be stored for the calculation of α, β and δ. The γ,α and β values all need to be

stored for the calculation of δ. Since the input can be used to calculate γ straight away,

and the δ can be used to calculate ye and the output straight away, there is no need

to store such variables. The energy consumption of the memory is basically depends

on the reading and the writing times of the memory. The writing times is equal to the

number of variables need to be stored since each variable only need to be wrote in the

the memory for one times. The reading times is equal to times of using the variables in

the algorithm, since the variables can only be stored in the memory, for how many times

of using the variable, there are how many times of reading the variable from the memory.

According to the algorithm, for an n-bit frame, there are 32n − 40 γ values, 8n − 10 α

values, and 8n − 10 β values. Therefore, there are 48n − 60 times of writing required


γ

α β

δ

ye

Input

Output

Figure 4.1: The dependence between the different stages.

during the decoding. Each γ is used once in α calculation and once in β calculation.

Only half of the γ is used once in the δ calculation. Therefore, the total times of reading

γ is 2.5 × (32n − 40) = 80n − 100 times. Each α is used once for δ calculation, which

induce 8n−10 times reading. Each β is used once for δ calculation, which induce 8n−10

times reading. To sum up, 96n− 120 times of reading is required for the decoding. Due

to the lack of memory standard cells in our standard cell library. The power analysis

of memory unit cannot be performed. Therefore, we used the datasheet of a 64Mbits

memory product NEC uPD4564163 [76] to calculate the energy consumption of the

memory. According to the datasheet, assuming no miss reading happened during the

decoding processing, each reading or writing operation required 2 clock cycles at least.

One reading or writing operation consume 9900 pJ. Note that the product is outdated

compared with the technology we used for estimate the operation energy consumption.

Therefore, with the proper memory product, the energy consumption might be reduced.

In this case, the reading and writing consume the same amount of energy in the memory.

The total energy consumption can be simple calculation:

Ememory = 9900 × (96n − 120) = 9.5 × 105n − 1.19 × 106(pJ) (4.4)

Therefore, for a 40-bit frame, the total energy consumption by the memory is Ememory =

3.68 × 107 pJ. For a 5114-bit frame, Ememory = 4.86 × 109 pJ.

In this level of framework, the analysis results allow the comparison of different Turbo

decoding algorithms from the operation energy point of view and the memory energy


point of view. If the same technology library of the memory and the standard cells both

avaliable, it is reasonable to consider the sum of the two energy consumption part is the

energy consumption estimation directly related to the algorithm. As discussed, the rest

part of the energy consumption in the system highly depends on the design strategy

hence cannot be accurately estimated at the algorithm level.

In next section, we discuss the future work of the level 2 of our framework, which

aim to give a total energy consumption estimation considering all the possible energy

contribution in the system. As discussed, since accurately estimation of the total system

energy consumption is impossible at the algorithm level, reasonable assumptions are

required for level 2 of the framework.

4.3.2 Future work: Level 2 of the framework

The level 2 of the framework we propose is based on complexity, memory and parallelism

analysis of the mathematical description of the algorithm. By converting the descrip-

tion into factor graphs, the computing resource, memory requirement and parameters

of control unit can be obtained. The energy estimation is based on a look-up table of

the energy consumption of different entities in design. The look-up table is built using

the simulation of a particular technology library, in our case, STMicroelectronis 0.12µm

process standard cell library. A digital circuit system can be divided into three compo-

nents, namely the datapath architecture, the system memory and the controller. Hence,

the total energy consumption of a system is divided into three parts, as expressed in

(4.5):

Etotal = Edatapath + Ememory + Econtroller (4.5)

Our framework estimates the energy using clock cycle accurate analysis and timing anal-

ysis. The cycle-accurate analysis considers the energy consumption of different hardware

components in different modes of operation, typically operating mode and idle mode.

The timing analysis considers the required operation times of different components and

the total time consumption of processing a typical task (e.g. decoding a data frame).

The energy consumption of the system, such as average energy consumption per clock

cycle or the energy consumption of a particular task, can be obtained.

A flowchart of the framework is shown in Figure 4.2. The mathematical description of

a algorithm is the basic input of the framework. A executable program is not required.

For a complicated algorithm, it is usually divided into many steps for ease of imple-

mentation. Therefore, a partitioning analysis of the algorithm is needed. After this,

the algorithm can be converted into a factorgraph-based description. In our framework,

a factor graph is used for describing the computation complexity in the algorithm and

an overall flowchart is used to describe the dependence between the steps. The com-

puting resource required can be estimated based on the factor graph. Note that the


Mathmatical Description

Partitioning Analysis

Overall Flowchart

Computing Resource Estimation Resource Constraint

EstimationMemory Requirement Controller State

Estimation Control Signal Estimation

Memory Access Estimation

Timing Analysis

MemoryEnergy Estimation of

DatapathEnergy Estimation of

ControllerEnergy Estimation of

Factor Graphic forComplexity Analysis

Figure 4.2: Flowchart of energy estimation framework.

computing resource estimation includes all the entities in the datapath. By considering

the resource requirement of each step and the dependence information in the overall

flowchart, the overall resource constraint can be obtained by analysis. In addition, the

estimation of control signals, controller state, memory requirement and memory access

are given. Timing analysis is given with the information from the factor graph and over-

all flowchart. This generates the total clock cycle requirement to get the clock frequency

constraint and for later cycle-accurate estimation. Finally, the three parts of energy

consumption, the datapath, the memory and the controller can be obtained, as shown

in in Figure 4.2.

Chapter 5

Conclusions and Further Works

In this report, we give an investigation of the state of the art of the development of

wireless communication system for BANs. Based on the investigation of the previous

works, we proposed a promising solution of applying Turbo-like codes in the channel

coding scheme of BANs communication system. Based on the proposal, we brought out

the requirement of exploring the fixed-point low power implementation of Turbo-like

codes and evaluating the different Turbo-like codes from the energy consumption point

of view.

Therefore, in chapter 3, we proposed a method based on EXIT chart analysis to deter-

mine the optimal data width specification of a Turbo-like decoding algorithm in fixed-

point low power implementation. The issue is significantly important to the energy

consumption of the implementation. We represent our method by applying it to the

UMTS Turbo decoder. We considered the different conditions of the overflow issue in

the implementation and compared our result with the previous works. The advantages of

our method compared with conventional BER/FER chart analysis method are revealed.

In chapter 4, we proposed a framework to evaluate the different Turbo-like codes from

the energy consumption point of view. The framework has two levels. The level 1 of the

framework considered the energy consumption of the required ACS operations in the

algorithm and the related memory requirement. Level 1 has a relatively simple proce-

dure, which can be easily applied to the target algorithms. It offer a better evaluation

of the algorithms than conventional complexity evaluation of the algorithms from en-

ergy consumption point of view but with little extra effort required. The level 2 of the

framework is the future work plan of the project. It aims to create a procedure to al-

low gate-level energy consumption estimation of the target algorithms without hardware

design knowledge requirement. Some detail of this level of the framework is discussed.

The works presented in this report is the preparation of exploring the novel decoding

scheme on the relays in BANs proposed in Chapter 1. Therefore, the future work is to

67

68 Chapter 5 Conclusions and Further Works

apply the two proposed methods to Turbo-like codes which considered suitable for BANs

applications. With the consideration of the other part of energy consumption on the

relays including the receiving and transmission power, the modulation and demodulation

schemes, our novel decoding scheme can be investigated.

Bibliography

[1] S. Drude, “Requirements and applications scenarios for body area netwokrs,” in

Mobile and Wireless Communications Summit, 2007. 16th IST, 2007.

[2] T. G. Zimmerman, “Personal area networks: Nearfield intrabody communication,”

IBM System Journal, vol. 35, pp. 609–617, 1996.

[3] B. Zhen, H. Li, and R. Kohno, “IEEE body area netwokrs for medical applica-

tions,” in Wireless Communication Systems, 2007. ISWCS 2007. 4th International

Symposium on, 2007.

[4] M. L. R. Fox, H. Symons, S. Berson, and H. Westphal, “Fcc pro-

poses rules for body area networks (mban),” Jul. 2009, access

from:http://mobihealthnews.com/3078/fcc-proposes-rules-for-body-area-networks-

mban/.

[5] “Revision of part 15 regarding ultra-wideband transmission systems. first report and

order, et docket, 98-153, fcc 02-48,” Federal Communications Commission (FCC),

Tech. Rep., 2002.

[6] J. Ryckaert, C. Desset, V. de Heyn, M. Badaroglu, P. Wambacq, G. V. der Plas,

and B. V. Poucke, “Ultra-wideband transmitter for wireless body area networks,”

in Proceeding on 14th IST Mobile & Wireless Communications Summit, Jun. 2005.

[7] H. Li, K. Takizawa, B. Zhen, and R. Kohno, “Body area network and its standard-

ization at IEEE 802.15.MBAN,” in Mobile and Wireless Communications Summit,

2007. 16th IST, Jul. 2007, pp. 1–5.

[8] J. A. D. Moutinho, “Wireless body area network,” 2009, rECIN2009.

[9] V. M. Jones, R. G. A. Bults, D. Konstantas, and P. A. M. Vierhout, “Healthcare

pans: Personal area networks for trauma care and home care,” in In 4th Inter-

national Symposium on Wireless Personal Multimedia Communications (WPMC),

2001, pp. 1369–1374.

[10] M. Soini, J. Nummela, P. Oksa, L. Ukkonen, and L. Sydnheimo, “Wireless body area

network for hip rehabilitation system,” Ubiquitous Computing and Communication

Journal, vol. 3, p. 7, 2008.

69

70 BIBLIOGRAPHY

[11] B. Zhen, “Ban technical requirements,” IEEE 802.15.TG6, Tech. Rep., Sep. 2008.

[12] B. Latr, I. Moerman, B. Dhoedt, and P. Demeester, “Networking in wireless body

area networks,” in in 5th FTW PHD Symposium, Interactive poster session, Dec.

2004, p. 113.

[13] C. K. Singh and A. Kumar, “Performance evaluation of an IEEE 802.15.4 sensor

network with a star topology,” Wireless Networks, vol. 14, no. 4, pp. 543–568, Aug.

2008.

[14] S. Choi, S. Song, K. Sohn, H. Kim, J. Kim, J. Yoo, and H. Yoo, “A low-power star-

topology body area network controller for periodic data monitoring around and

inside the human body,” in 2006 10th IEEE International Symposium on Wearable

Computers, Oct. 2006, pp. 139–140.

[15] A. G. Ruzzelli, R. Jurdak, G. M. P. O’Hare, and P. V. D. Stok, “Energy-efficient

multi-hop medical sensor networking,” in Proceedings of the 1st ACM SIGMOBILE

international workshop on Systems and networking support for healthcare and as-

sisted living environments, 2007, pp. 37–42.

[16] B. Latre, B. Braem, I. Moerman, C. Blondia, E. Reusens, W. Joseph, and P. De-

meester, “A low-delay protocol for multihop wireless body area networks,” in Mo-

biQuitous 2007. Fourth Annual International Conference on Mobile and Ubiquitous

Systems: Networking & Services, Aug. 2007, pp. 1–8.

[17] J. Misic, “Enforcing patient privacy in healthcare WSNs using ECC implemented on

802.15.4 beacon enabled clusters,” in Pervasive Computing and Communications,

2008. PerCom 2008. Sixth Annual IEEE International Conference on, 2008.

[18] J. Rousselot, A. El-Hoiydi, and J.-D. Decotignie, “Performance evaluation of the

IEEE 802.15.4a UWB physical layer for body area networks,” in Computers and

Communications, 2007. ISCC 2007. 12th IEEE Symposium on, 2007.

[19] D. Domenicali and M.-G. D. Benedetto, “Perfromance analysis for a body area net-

work composed of IEEE 802.15.4a devices,” in Proceedings of 4th Workshop on Po-

sitioning, Navigation and Communication 2007(WPNC’07), Hannover, Germany,

Mar. 2007, pp. 273–276.

[20] M. R. Yuce, “Implementation of body area networks based on MICS/WMTS med-

ical bands for healthcare systems,” in IEEE Engineering in Medicine and Biology

Society Conference (IEEE EMBC08), Aug 2008, pp. 3417–3421.

[21] S. Stoa, I. Balasingham, and T. A. Ramstad, “Data throughput optimization in

the ieee 802.15.4 medical sensor networks,” in ISCAS 2007. IEEE International

Symposium on Circuits and Systems, May. 2007, pp. 1361–1364.

BIBLIOGRAPHY 71

[22] X. Liang and I. Balasingham, “Performance analysis of the IEEE 802.15.4 based ecg

monitoring network,” in Proceeding of The Seventh IASTED International Confer-

ences on Wireless and Optical Communications (WOC’07), 2007.

[23] R. C. Shah, L. Nachman, and C. Wan, “On the performance of bluetooth and ieee

802.15.4 radios in a body area network,” in Proceedings of the ICST 3rd interna-

tional conference on Body area networks, Tempe, Arizona, 2008.

[24] D. D. Arumugam and D. W. Engels, “Impacts of rf radiation on the human body

in a passive rfid environment,” in 2008 IEEE Antennas and Propagation Society

International Symposium, Jul. 2008, pp. 1–4.

[25] C. Berrou, A. Glavieux, and P. Thitimajshima, “Near shannon limit error correcting

coding and decoding: Turbo codes,” in IEEE Proceedings of the Int. Conf. on

Communications, 1993.

[26] C. Berrou and A. Glavieux, “Near optimum error correcting coding and decoding:

Turbo-codes,” IEEE Trans. on Communications, vol. 44, no. 10, pp. 1261–1271,

Oct. 1996.

[27] I. Joe, “Energy efficiency maximization for wireless sensor netwokrs,” international

Federation for Information Processing, vol. 211, pp. 115–122, 2006.

[28] S. Benedetto and G. Montorsi, “Serial concatenated of block and convolutional

codes,” Electronics Letters, vol. 32, no. 10, pp. 887–888, May. 1996.

[29] R. Gallager, “Low-density parity-check codes,” IRE Transaction on Information

Theory, vol. 8, no. 1, pp. 21–28, Jan. 1962.

[30] J. G. D. Forney, “Concatenated codes,” Massachusetts Institute of Technology Re-

search Lab of Electronics, Tech. Rep., 1966.

[31] I. S. Reed and G. Solomon, “Polynomial codes over certain finite fields,” SIAM

Journal of Applied Math, vol. 8, pp. 300–304, 1960.

[32] P. Elias, “Coding for noisy channels,” in IRE Convention Record Pt. 4, 1955, pp.

37–37.

[33] J. H. Yuen, M. K. Simon, W. Miller, F. Pollara, C. R. Ryan, D. Divsalar, and

J. C. Morakis, “Modulation and coding for satellite and space communications,” in

Proceedings of the IEEE, vol. 78, no. 7, Jul. 1990, pp. 1250–1265.

[34] A. J. Viterbi, “Error bounds for convolutional codes and an asymptotically optimum

decoding algorithm,” IEEE Transactions on Information Theory, vol. IT-13, pp.

493–497, Apr. 1967.

[35] E. Boutillon, C. Douillard, and G. Montorsi, “Iterative decoding of concatenated

convolutional codes: Implementation issues,” in Proceedings of the IEEE, 2007.

72 BIBLIOGRAPHY

[36] B. Sklar, Fundamentals of Turbo Codes, Digital Communications: Fundamentals

and Applications, Second Edition. Prentice-Hall, 2001.

[37] C. Schlegel and L. Perez, Trellis and Turbo coding, ser. IEEE Press series on digital

& mobile communication, J. B. Anderson, Ed. John Wiley & Sons, 2004.

[38] L. R. Bahl, J. Cocke, F. Jelinek, and J. Raviv, “Optimal decoding of linear codes for

minimizing symbol error rate,” IEEE Transactions on Information Theory, vol. 20,

no. 3, pp. 284–287, Mar. 1974.

[39] “3rd generation partnership project; technical specification group radio access net-

work; multiplexing and channel coding (tdd) (release 7),” 3GPP Organizational

Partners (ARIB, ATIS, CCSA, ETSI, TTA, TTC), Tech. Rep., 2008.

[40] C. Weiss, C. Bettstetter, and S. Riedel, “Code construction and decoding of parallel

concatenated tail-biting codes,” IEEE Transactions on Information Theory, vol. 47,

no. 10, pp. 366–386, Jan. 2001.

[41] S. ten Brink, “Convergence behavior of iteratively decoded parallel concatenated

codes,” IEEE Transactions on Communications, vol. 49, no. 10, pp. 1727–1737,

Oct. 2001.

[42] H. Michel and N. Wehn, “Turbo-decoder quantization for umts,” IEEE Communi-

cation Letters, vol. 5, no. 2, pp. 55–57, Feb. 2001.

[43] M. A. Castellon, I. J. Fair, and D. G. Elliott, “Fixed-point Turbo decoder implemen-

tation suitable for embedded applications,” in Electrical and Computer Engineering,

2005. Canadian Conference on, May. 2005, pp. 1065–1068.

[44] J. Hsu and C. Wang, “On finite-precision implementation of a decoder for turbo

codes,” in Proceedings of the 1999 IEEE International Symposium on, vol. 4, Or-

lando, FL, USA, Jul. 1999, pp. 423–426.

[45] A. Worm, H. Michel, F. Gilbert, G. Kreiselmaier, M. Thul, and N. Wehn, “Ad-

vanced implementation issues of turbo-decoders,” in Proc. 2nd International Sym-

posium on Turbo-Codes and Related Topics, 2000, pp. 351–354.

[46] T. K. Blankenship and B. Classon, “Fixed-point performance of low-complexity

turbo decoding algorithms,” in Vehicular Technology Conference, 2001. IEEE VTS

53rd, vol. 2, Rhodes, Greece, 2001, pp. 1483–1487.

[47] G. Montorsi and S. Benedetto, “Design of fixed-point iterative decoders for concate-

nated codes with interleavers,” IEEE Journal on Selected Areas in Communications,

vol. 19, pp. 871–882, 2001.

[48] Y. Wu, B. D. Woerner, and T. K. Blankenship, “Data width requirements in siso

decoding with modulo normalization,” IEEE Transactions on Communications,

vol. 49, no. 11, pp. 1861–1868, Nov. 2001.

BIBLIOGRAPHY 73

[49] R. Hoshyar, A. R. S. Bahai, and R. Tafazolli, “Finite precision Turbo decoding,”

in Proc. 3rd International Symposiumon Turbo Codes and Related Topics, Brest,

France, Sep. 2003, pp. 483–486.

[50] A. Morales-Cortes, R. Parra-Michel, L. F. Gonzalez-Perez, and T. G. Cervantes,

“Finite precision analysis of the 3gpp standard turbo decoder for fixed-point im-

plementation in fpga devices,” in Reconfigurable Computing and FPGAs, 2008.

International Conference on, Dec. 2008, pp. 43–48.

[51] S. Benedetto, D. Divsalar, G. Montorsi, and F. Pollara, “A soft-input soft-output

app module for iterative decoding of concatenated codes,” IEEE Communications

Letters, vol. 1, no. 1, pp. 22–24, Jan. 1997.

[52] V. Singh, “Elimination of overflow oscillations in fixed-point state-spece digital

filters using saturation alrithmetic,” IEEE Transactions on Circuits and Systems,

vol. 37, no. 6, pp. 814–818, Jun. 1990.

[53] D. A. Balley and A. A. Beer, “Simulation of filter structures for fixed-point imple-

mentation,” in Proceeding of the 28th Southeastern Symposium on System Theory,

Baton Rouge, LA, USA, 1996, pp. 270–274.

[54] G. Masera, Turbo Code Applications: a journey from a paper to realization, K. Srip-

imanwat, Ed. Springer Netherlands, 2005.

[55] A. Hekstra, “An alternative to metric rescaling in viterbi decoders,” IEEE Tran-

scations on Communications, vol. 37, pp. 1220–1222, Nov. 1989.

[56] B. Riaz and J. Bajcsy, “Impact of finite precision arithmetics on exit chart analysis

of turbo codes,” in 5th IEEE Consumer Communications and Networking Confer-

ence, 2008. CCNC 2008., 2008.

[57] P. Robertson, E. Villebrun, and P. Hoeher, “A comparison of optimal and sub-

optimal map decoding algorithm operating in the log domain,” in Proceeding of

IEEE International Conference of Communication, 1995, pp. 1009–1013.

[58] M. C. Valenti and J. Sun, “The UMTS Turbo code and an efficient dcoder imple-

mentation suitable for software-defined radios,” International Journal of Wireless

Information Networks, vol. 8, no. 4, pp. 203–215, Oct. 2001.

[59] F. N. Jajm, “A survey of power estimation techniques in VLSI circuits,” IEEE

Transactions on Very Large Scale Integration (VLSI) Systems, vol. 2, no. 4, pp.

446–455, Dec. 1994.

[60] O. Celebican, T. S. Rosing, and V. J. M. III, “Energy estimation of peripheral

devices in embedded systems,” in Proceedings of the 14th ACM Great Lakes sym-

posium on VLSI, Boston, MA, USA, 2004, pp. 430 – 435.

74 BIBLIOGRAPHY

[61] J. Kaza and C. Chakrabarti, “Design and implementation of low-energy turbo

decoders,” Very Large Scale Integration (VLSI) Systems, IEEE Transactions on,

vol. 12, no. 9, pp. 968–977, Sep. 2004.

[62] S. Chouhan, R. Bose, and M. Balakrishnan, “A framework for energy-consumption-

based-design space exploration for wireless sensor nodes,” IEEE Transaction on

Computer-Aided Design of Intergrated Circuits and Systems, vol. 28, no. 7, pp.

1017–1024, Jul 2009.

[63] E. Macii, “CAD algorithms, methods and tools for low-power circuits and systems,”

IEEE Technology Surveys, Tech. Rep., 2006.

[64] G. Masera, M. Mazza, G. Piccinini, F. Viglione, and M. Zamboni, “Architectural

strategies for low-power VLSI Turbo decoders,” IEEE Transactions on Very Large

Scale Integration (VLSI) Systems, vol. 10, no. 3, pp. 279–285, Jun. 2002.

[65] K. Hildingsson, T. Arslan, and A. T. Erdogan, “Energy evaluation methodology for

platform based system-on-chip design,” in Proceedings of IEEE Computer society

Annual Symposium on, Feb. 2004, pp. 61–68.

[66] T. V. Aa, M. Jayapala, F. Barat, H. Corporaal, F. Catthoor, and G. Deconinck,

“A high-level memory energy estimator based on reuse distance,” in Proceedings of

the 3rd Workshop on Optimizations for DSP and Embedded Systems (ODES’05),

San Jose, Calif, USA, Mar. 2005.

[67] J. Laurent, E. Senn, N. Julien, and E. Martin, “High-level energy estimation for

DSP systems,” in Proceedings of Int. Workshop on Power And Timing Modeling,

Optimization and Simulation PATMOS 2001, 2001, pp. 311–316.

[68] C. Menn, O. Bringmann, and W. Rosenstiel, “Controller estimation for FPGA

target architectures during high-level synthesis,” in Proceedings of the 15th inter-

national symposium on System Synthesis, 2002.

[69] P. Landman, “High-level power estimation,” in Proceedings of ISLPED, 1996, pp.

29–35.

[70] P. Surti and L. Chao, “Controller power estimation using information from behav-

ioraldescription,” in ISCAS ’96., vol. 4, May. 1996, pp. 679–682.

[71] J. N. Kozhaya and F. N. Najm, “Accurate power estimation for large sequen-

tial circuits,” in Proceedings of the 1997 IEEE/ACM international conference on

Computer-aided design, San Jose, California, United States, 1997, pp. 488 – 493.

[72] M. Lesser and V. Ohm, “Accurate power estimation for sequential cmos circuits

using graph-based methods,” VLSI Design, vol. 12, pp. 187–203, 2001.

BIBLIOGRAPHY 75

[73] M. Khaddour and O. Hammami, “High level energy consumption estimation of

cryptographic algorithms,” in ICTTA 2008. 3rd International Conference on, Apr.

2008, pp. 1–6.

[74] A. B. A. Garcia, J. Gobert, T. Dombek, H. Mehrez, and F. Petrot, “Energy estima-

tions in high level cycle-accurate descriptions of embedded systems,” in Proceedings

of 5th International Workshop on Design and Diagnostics of Electronic Circuits

and Systems (DDECS’2002), Brno, Czech Republic, Apr. 2002, pp. 228–235.

[75] “Datasheet of primetime,” Synopsys, Tech. Rep., 2009.

[76] “Nec upd4564163 datasheets: 64mbit synchronous dram, 4 bank, lvttl,” NEC, Tech.

Rep., 1998.

liang li nine month report

Documents