downloads.hindawi.comdownloads.hindawi.com › journals › specialissues ›...

EURASIP Journal on Wireless Communications and Networking

Advances in Error Control Coding Techniques

Guest Editors: Yonghui Li, Jinhong Yuan, Andrej Stefanov, and Branka Vucetic

Advances in Error Control CodingTechniques

EURASIP Journal onWireless Communications and Networking

Advances in Error Control CodingTechniques

Guest Editors: Yonghui Li, Jinhong Yuan,Andrej Stefanov, and Branka Vucetic

Copyright © 2008 Hindawi Publishing Corporation. All rights reserved.

This is a special issue published in volume 2008 of “EURASIP Journal on Wireless Communications and Networking.” All articles areopen access articles distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, andreproduction in any medium, provided the original work is properly cited.

Editor-in-ChiefLuc Vandendorpe, UCL, Belgium

Associate Editors

Thushara D. Abhayapala, AustraliaFarid Ahmed, USAMohamed H. Ahmed, CanadaAlagan Anpalagan, CanadaCarles Anton-Haro, SpainAnthony C. Boucouvalas, GreeceLin Cai, CanadaYuh-Shyan Chen, TaiwanBiao Chen, USAPascal Chevalier, FranceChia-Chin Chong, South KoreaHuaiyu Dai, USASoura Dasgupta, USAIbrahim Develi, TurkeyPetar M. Djuric, USAMischa Dohler, SpainAbraham O. Fapojuwo, CanadaMichael Gastpar, USAAlex B. Gershman, GermanyWolfgang Gerstacker, Germany

David Gesbert, FranceFary Ghassemlooy, UKChristian Hartmann, GermanyStefan Kaiser, GermanyG. K. Karagiannidis, GreeceChi Chung Ko, SingaporeVisa Koivunen, FinlandNicholas Kolokotronis, GreeceRichard Kozick, USAS. Lambotharan, UKVincent Lau, Hong KongDavid I. Laurenson, UKTho Le-Ngoc, CanadaWei Li, USATongtong Li, USAZhiqiang Liu, USASteve McLaughlin, UKSudip Misra, IndiaIngrid Moerman, BelgiumMarc Moonen, Belgium

Eric Moulines, FranceSayandev Mukherjee, USAKameswara Rao Namuduri, USAAmiya Nayak, CanadaClaude Oestges, BelgiumA. Pandharipande, The NetherlandsPhillip Regalia, FranceA. Lee Swindlehurst, USASergios Theodoridis, GreeceGeorge S. Tombras, GreeceLang Tong, USAAthanasios Vasilakos, GreecePing Wang, CanadaWeidong Xiang, USAYang Xiao, USAXueshi Yang, USALawrence Yeung, Hong KongDongmei Zhao, CanadaWeihua Zhuang, Canada

Contents

Advances in Error Control Coding Techniques, Yonghui Li, Jinhong Yuan,Andrej Stefanov, and Branka VuceticVolume 2008, Article ID 574783, 3 pages

Structured LDPC Codes over Integer Residue Rings, Elisa Mo and Marc A. ArmandVolume 2008, Article ID 598401, 9 pages

Differentially Encoded LDPC Codes—Part I: Special Case of Product Accumulate Codes,Jing Li (Tiffany)Volume 2008, Article ID 824673, 14 pages

Differentially Encoded LDPC Codes—Part II: General Case and Code Optimization,Jing Li (Tiffany)Volume 2008, Article ID 367287, 10 pages

Construction and Iterative Decoding of LDPC Codes Over Rings for Phase-Noisy Channels,Sridhar Karuppasami and William G. CowleyVolume 2008, Article ID 385421, 9 pages

New Technique for Improving Performance of LDPC Codes in the Presence of Trapping Sets,Esa Alghonaim, Aiman El-Maleh, and Mohamed Adnan LandolsiVolume 2008, Article ID 362897, 12 pages

Distributed Generalized Low-Density Codes for Multiple Relay Cooperative Communications,Changcai Han and Weiling WuVolume 2008, Article ID 852397, 9 pages

Reed-Solomon Turbo Product Codes for Optical Communications: From Code Optimizationto Decoder Design, Raphael Le Bidan, Camille Leroux, Christophe Jego, Patrick Adde,and Ramesh PyndiahVolume 2008, Article ID 658042, 14 pages

Complexity Analysis of Reed-Solomon Decoding over GF(2m) without Using Syndromes,Ning Chen and Zhiyuan YanVolume 2008, Article ID 843634, 11 pages

Efficient Decoding of Turbo Codes with Nonbinary Belief Propagation, Charly Poulliat,David Declercq, and Thierry LestableVolume 2008, Article ID 473613, 10 pages

Space-Time Convolutional Codes over Finite Fields and Rings for Systems with LargeDiversity Order, Mario de Noronha-Neto and B. F. Uchoa-FilhoVolume 2008, Article ID 624542, 7 pages

Joint Decoding of Concatenated VLEC and STTC System, Huijun Chen and Lei CaoVolume 2008, Article ID 890194, 8 pages

Average Throughput with Linear Network Coding over Finite Fields: The Combination Network Case,Ali Al-Bashabsheh and Abbas YongacogluVolume 2008, Article ID 329727, 7 pages

MacWilliams Identity for Codes with the Rank Metric, Maximilien Gadouleau and Zhiyuan YanVolume 2008, Article ID 754021, 13 pages

Hindawi Publishing CorporationEURASIP Journal on Wireless Communications and NetworkingVolume 2008, Article ID 574783, 3 pagesdoi:10.1155/2008/574783

EditorialAdvances in Error Control Coding Techniques

Yonghui Li,1 Jinhong Yuan,2 Andrej Stefanov,3 and Branka Vucetic1

1 School of Electrical and Information Engineering, The University of Sydney, Sydney, NSW 2006, Australia2 School of Electrical Engineering and Telecommunications, The University of New South Wales, Sydney, NSW 2052, Australia3 Department of Electrical and Computer Engineering, Polytechnic University, 6 Metrotech Center, Brooklyn, NY 11201, USA

Correspondence should be addressed to Yonghui Li, [email protected]

Received 9 September 2008; Accepted 9 September 2008

Copyright © 2008 Yonghui Li et al. This is an open access article distributed under the Creative Commons Attribution License,which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

In the past decade, a significant progresshas been reported inthe field of error control coding. In particular, the innovationofturbo codes and rediscovery of LDPC codes have beenrecognized as two significant breakthroughs in this field.The distinct features of these capacity approaching codeshave enabled them to be widely proposed and/or adoptedin existing wireless standards. Furthermore, the inventionof space time coding significantly increased the capacity ofwireless systems and these codes have been widely appliedin broadband communication systems. Recently, new codingconcepts, exploiting the distributed nature of networks, havebeen developed, such as network coding and distributedcoding techniques. They have great potential applicationsin wireless, sensor, and ad hoc networks. Despite recentadvances, many challenging problems still remain. Thisspecial issue is intended to present the state-of-the-art resultsin the theory and applications of coding techniques.

The special issue has received twenty six submissions, andamong them, thirteen papers have been finally selected aftera rigorous review process. They reflect recent advances in thearea of error control coding.

In the first paper, “Structured LDPC codes over integerresidue rings,” Mo and Armand designed a new class oflow-density parity-check (LDPC) codes over integer residuerings. The codes are constructed based on regular Tannergraphs by using Latin squares over a multiplicative groupof a Galois ring, rather than a finite field. The proposedapproach is suitable for the design of codes with a widerange of rates. One feature of this type of codes is thattheir minimum pseudocodeword weights are equal to theirminimum Hamming distances.

The next two-part series of papers “Differentiallyencoded LDPC codes—Part I: special case of product accu-mulate codes” and “Differentially encoded LDPC codes—

Part II: general case and code optimization,” by J. TiffanyLi, study the theory and practice of differentially encodedlow-density parity-check (DE-LDPC) codes in the context ofnoncoherent detection. Part I studies a special class of DE-LDPC codes, product accumulate codes. The more generalcase of DE-LDPC codes, where the LDPC part may takearbitrary-degree profiles, is studied in Part II. The analysisreveals that a conventional LDPC code is not fitful fordifferential coding, and does not in general deliver a desirableperformance when detected noncoherently. Through extrin-sic information transfer (EXIT) analysis and a modified“convergence constraint” density evolution (DE) method,a characterization of the type of LDPC degree profiles isprovided. The convergence-constraint method provides auseful extension to the conventional “threshold-constraint”method, and can match an outer LDPC code to any giveninner code with the imperfectness of the inner decoder takeninto consideration.

In the fourth paper, “Construction and iterative decodingof LDPC codes over rings for phase-noisy channels,” byKaruppasami and Cowley, a design and decoding method forLDPC codes for channels with phase noise is proposed. Thenew code applies blind or turbo estimators to provide signalphase estimates over each observation interval. It is resilientto phase rotations of 2π/M, where M is the number of phasesymmetries in the signal set and estimates phase ambiguitiesin each observation interval.

A novel approach for enhancing decoder performancein presence of trapping sets by introducing a new conceptcalled trapping set neutralization is proposed in the fifthpaper “New technique for improving performance of LDPCcodes in the presence of trapping sets” by E. Alghonaim et al.The effect of a trapping set can be eliminated by setting itsvariable nodes intrinsic and extrinsic values to zero. After a

2 EURASIP Journal on Wireless Communications and Networking

trapping set is neutralized, the estimated values of variablenodes are affected only by external messages from nodesoutside the trapping set. Most harmful trapping sets areidentified by means of simulation. To be able to neutralizeidentified trapping sets, a simple algorithm is introduced tostore trapping sets configuration information in variable andcheck nodes.

Design of efficient distributed coding schemes for coop-erative communications networks has recently attractedsignificant attention. A distributed generalized low-density(GLD) coding scheme for multiple relay cooperative com-munications is developed by Han and Wu in the sixthpaper “Distributed generalized low-density codes for mul-tiple relay cooperative communications.” By using partialerror detecting and error correcting capabilities of theGLD code, each relay node decodes and forwards someof the constituent codes of the GLD code to cooperativelyform a distributed GLD code. It can work effectively andkeep a fixed overall code rate when the number of relaynodes varies. Furthermore, the partial decoding at relays isallowed and a progressive processing procedure is proposedto reduce the complexity and adapt to the source-relaychannel variations. Simulation results verify that distributedGLD codes with various number of relay nodes can obtainsignificant performance gains in quasistatic fading channelscompared with the strategy without cooperation.

Since the early 1990s, a progressive introduction of inlineoptical amplifiers and an advent of wavelength divisionmultiplexing (WDM) accelerated the use of FEC in opticalfiber communications to reduce the system costs andimprove margins against various line impairments, such asbeam noise, channel crosstalk, and nonlinear dispersion. Incontrast to the first and second generations of FEC codes foroptical communications, which are based on Reed-Solomon(RS) codes and the concatenated codes with hard-decisiondecoding, the third generation FEC codes with soft-decisiondecoding are attractive to reduce costs by relaxing therequirements on expensive optical devices in high-capacitysystems. In this regard, the seventh paper “Reed-Solomonturbo product codes for optical communications: from codeoptimization to decoder design” by Bidan et al. investigatesthe use of turbo-product codes with Reed-Solomon codes asthe components for 40 Gb/s over optical transport networksand 10 Gb/s over passive optical networks. The issues ofcode design and novel ultra-high-speed parallel decodingarchitecture are developed. The complexity and performancetrade-off of the scheme is also carefully addressed in thispaper.

Recently, there has been renewed interest in decodingReed-Solomon (RS) codes without using syndromes. Inthe eighth paper “Complexity analysis of Reed-Solomondecoding over GF(2m) without using syndromes,” Chen andYan investigated the complexity of a type of syndrome-lessdecoding for RS codes, and compared it to that of syndrome-based decoding algorithms. The complexity analysis in theirpaper mainly focuses on RS codes over characteristic-2fields, for which some multiplicative FFT techniques are notapplicable. Their findings show that for high-rate RS codes,syndrome-less decoding algorithms require more field oper-

ations and have higher hardware costs and lower throughput,when compared to syndrome-based decoding algorithms.They also derived tighter bounds on the complexities of fastpolynomial multiplications based on Cantor’s approach andthe fast extended Euclidean algorithm.

In the ninth paper “Efficient decoding of turbo codeswith nonbinary belief propagation” by Poulliat et al., a newapproach of decoding turbo codes by a nonbinary beliefpropagation algorithm is proposed. The approach consistsin representing groups of turbo code binary symbols by anonbinary Tanner graph and applying a group belief iterativedecoding. The parity check matrices of turbo codes need tobe preprocessed to ensure the code good topological prop-erties. This preprocessing introduces an additional diversity,which is exploited to improve the decoding performance.

The tenth paper, “Space-time convolutional codes overfinite fields and rings for systems with large diversity order”by Uchoa-Filho and Noronha-Neto, propose a convolutionalencoder over the finite ring of integers to generate a space-time convolutional code (STCC). Under this structure, thepaper has proved three interesting properties related to thegenerator matrix of the convolutional code that can be usedto simplify the code search procedure for STCCs over thefinite ring of integers. The properties establish equivalencesamong STCCs, so that many convolutional codes can bediscarded in the code search without loosing anything.

Providing high-quality multimedia service has becomean attractive application in wireless communication systems.In the eleventh paper, “Joint decoding of concatenated VLECand STTC system,” Chen and Cao proposed a joint source-channel coding scheme for wireless fading channels, whichcombines variable length error correcting codes (VLECs)and space time trellis codes (STTCs) to provide bandwidthefficient data compression, as well as coding and diversitygains. At the receiver, an iterative joint source and spacetime decoding algorithm is developed to utilize redundancyin both STTC and VLEC to improve overall decodingperformance. In their paper, various issues, such as theinseparable systematic information in the symbol level, theasymmetric trellis structure of VLEC, information exchangebetween bit and symbol domains, and a rate allocationbetween STTC and VLEC, have been investigated.

In the twelfth paper, “Average throughput with linearnetwork coding over finite fields: the combination networkcase,” Al-Bashabsheh and Yongacoglu extend the averagecoding throughput measure to include linear coding overarbitrary finite fields. They characterize the average linearnetwork coding throughput for the combination networkwith min-cut 2 over an arbitrary finite field, and providea network code, which is completely specified by the fieldsize and achieves the average coding throughput for thecombination network.

The MacWilliams identity and related identities for linearcodes with the rank metric are derived in thethirteenthpaper “MacWilliams identity for codes with the rank metric”by Gadouleau and Yan. It is shown that similar to theMacWilliams identity for the Hamming metric, the rankweight distribution of any linear code can be expressed asa functional transformation of that of its dual code, and the

Yonghui Li et al. 3

rank weight enumerator of the dual of any vector dependsonly on the rank weight of the vector and is related to therank weight enumerator of a maximum rank distance code.

Yonghui LiJinhong Yuan

Andrej StefanovBranka Vucetic


Research ArticleStructured LDPC Codes over Integer Residue Rings

Elisa Mo and Marc A. Armand

Department of Electrical and Computer Engineering, National University of Singapore, Singapore 117576

Correspondence should be addressed to Marc A. Armand, [email protected]

Received 31 October 2007; Revised 31 March 2008; Accepted 3 June 2008

Recommended by Jinhong Yuan

This paper presents a new class of low-density parity-check (LDPC) codes over Z2a represented by regular, structured Tannergraphs. These graphs are constructed using Latin squares defined over a multiplicative group of a Galois ring, rather than a finitefield. Our approach yields codes for a wide range of code rates and more importantly, codes whose minimum pseudocodewordweights equal their minimum Hamming distances. Simulation studies show that these structured codes, when transmitted usingmatched signal sets over an additive-white-Gaussian-noise channel, can outperform their random counterparts of similar lengthand rate.

Copyright © 2008 E. Mo and MarcA. Armand. This is an open access article distributed under the Creative Commons AttributionLicense, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properlycited.

1. INTRODUCTION

The study of nonbinary LDPC codes over GF(q) was initiatedby Davey and Mackay [1]. However, the symbols of anonbinary code over a finite field cannot be matched toany signal constellation. In other words, it is not possibleto obtain a geometrically uniform code (wherein everycodeword has the same error probability), from a nonbinary,finite field code. The subject of geometrically uniform codeshas been well studied by various authors including Slepian[2] and Forney Jr. [3]. More recently, Sridhara and Fuja [4]introduced geometrically uniform, nonbinary LDPC codesover certain rings, including integer residue rings. Theircodes are however unstructured. Structured LDPC codes,which include the family of finite geometry (FG) codes [5]and balanced incomplete block design (BIBD) codes [6], arefavored over their random counterparts due to the reductionin storage space for the parity check matrix and the ease inperformance analysis they provide, while achieving relativelysimilar performance. Structured nonbinary LDPC codes thathave been proposed thus far however, are constructed overfinite fields, for example, [1, 7], and therefore cannot begeometrically uniform.

This paper therefore addresses the problem of designingstructured, geometrically uniform, nonbinary LDPC codesover integer residue rings. Motivated by the fact that

short nonbinary LDPC codes can outperform their binarycounterparts [8–10], we focus our investigations on codes ofshort codelength. Studies of the so-called pseudocodewordsarising from finite covers of a Tanner graph, for example,[11–13], have revealed that while a code’s performanceunder maximum-likelihood (ML) decoding is dictated byits (Hamming) weight distribution, its performance underiterative decoding is dictated by the weight distributionof the pseudocodewords associated with its Tanner graph.More specifically, the presence of pseudocodewords of lowweight, particularly those of weight less than the minimumHamming distance of the code, is detrimental to a code’sperformance under iterative decoding. We therefore adoptthe Latin-squares-based approach of Kelley et al. [14] to con-struct structured codes, as their method aims at maximizingthe minimum pseudocodeword weight of a code. While wemaintain the pseudocodeword framework used there, ourwork nevertheless differs from [14] primarily because ourconstruction relies on an extension of the notion of Latinsquares to multiplicative groups of a Galois ring—a keycontribution of this paper.

We note that codes based on Latin squares were alsostudied in [7, 15–17]. However, the authors of these worksdid not do so in the pseudocodeword framework. Codes con-structed using other combinatorial approaches, such as thosepresented in [6, 18, 19], were similarly not investigated using


the notion of pseudocodewords. Specifically, these relatedworks focused on the optimization of design parameters suchas girth, expansion, diameter, and stopping sets. Our worktherefore differs from these earlier studies in this regard.

For practical reasons, we only consider linear codes overZ2a . In the next section, we provide an overview of codesover Z2a and their natural mapping to a matched signalconstellation, that is, the 2a-PSK constellation. Section 3introduces the notion of Latin squares over finite fields,followed by our extension of Latin squares to multiplicativegroups of a Galois ring. A method to construct Tannergraphs using Latin squares (over a multiplicative groupof a Galois ring) is presented in Section 4. We show thatfrom these graphs, a wide range of code rates may beobtained. We further derive in the same section certainproperties of the corresponding codes and, in particular,show that their minimum pseudocodeword weights equaltheir minimum Hamming distances. This is one of ourmain results. Finally, Section 5 presents computer simula-tions which demonstrate that our codes, when mapped tomatched signal sets and transmitted over the additive-white-Gaussian-noise (AWGN) channel, outperform their randomcounterparts of similar length and rate.

2. CODES OVER Z2a

2.1. An overview

Let C be a Z2a-submodule of the free Z2a-module Zn2a . Its nG×n generator matrix G can be expressed in the form [20]

G =

⎡⎢⎢⎢⎢⎢⎣

2λ1 g1

2λ2 g2...

2λnG gnG

⎤⎥⎥⎥⎥⎥⎦

, (1)

where 0 ≤ λi ≤ a − 1 for i = 1, 2, . . . ,nG and{g1, g2, . . . , gnG} ⊂ Zn2a is a set of linearly independentelements. The rate of C is

r = 1n

nG∑

i=1

a− λia

= nG

n−∑nG

i=1λian

. (2)

The dual code C⊥ is generated by the nH×n parity-checkmatrix of C, which can be expressed in the form

H =

⎡⎢⎢⎢⎢⎣

2μ1 h1

2μ2 h2...

2μnH hnH

⎤⎥⎥⎥⎥⎦

, (3)

where 0 ≤ μi ≤ a − 1 for i = 1, 2, . . . ,nH and{h1, h2, . . . , hnH} ⊂ Zn2a is a set of linearly independentelements. The rate of C can also be obtained by

r = 1− 1n

nH∑

i=1

a− μia

= 1− nH

n+

∑nHi=1μian

. (4)

If G (or H) is not already in the form in (1) (or in (3)),one could perform Gaussian elimination without dividinga row by a zero divisor to obtain the nG (or nH) linearlyindependent rows.

Remark 1. C is a free Z-submodule if λi = 0 for i = 1,2, . . . ,nG. This also implies that μi = 0 for i = 1, 2, . . . ,nH.

2.2. The matched signal set

The 2a-PSK signal set contains 2a points that are equidistantfrom the origin while maximally spread apart on a two-dimensional space. Projecting one dimension on the real axisand the other on the imaginary axis, a symbol x ∈ Z2a ismapped to sx =

√Es exp( j2πx/2a) of the signal set, where√

Es is the energy assigned to each symbol [4].The 2a-PSK is matched to Z2a because for any x, y ∈ Z2a ,

d2E

(sx, sy

) = d2E

(sx−y , s0

), (5)

where d2E(sx, sy) denotes the square Euclidean distance

between sx and sy [21].Let cx, cy ∈ C, where cx = [x1, x2, . . . , xn] and cy = [y1,

y2, . . . , yn]. They are mapped symbol by symbol to [sx1 ,sx2 , . . . , sxn] and [sy1 , sy2 , . . . , syn], respectively. The squaredEuclidean distance between these two signal vectors is

d2E

([sx1 , sx2 , . . . , sxn

],[sy1 , sy2 , . . . , syn

])

=n∑

i=0

d2E

(sxi , syi

) =n∑

i=0

d2E

(sxi−yi , s0

)

= d2E

([sx1−y1 , sx2−y2 , . . . , sxn−yn

],[s0, s0, . . . , s0

]).

(6)

Observe that the Hamming distance between two code-words is mapped proportionally to the Euclidean distancebetween their corresponding signal vectors.

3. LATIN SQUARES

3.1. Definition and application to galois fields

The following definition and example are taken from [22,Chapter 17].

Definition 1. A Latin square of order q is denoted as(R,C, S;L), where R, C, and S are sets of cardinality q andL is a mapping L(i, j) = k, where i ∈ R, j ∈ C, and k ∈ S,such that given any two of i, j, and k, the third is unique.

A Latin square can be written as a q × q array for whichthe cell in row i and column j contains the symbol L(i, j).Two Latin squares with mapping functions L and L′ areorthogonal if (L(i, j),L′(i, j)) is unique for each pair (i, j).Further, a complete family of q−1 mutually orthogonal Latinsquares (MOLS) exists for q = ps, where p is prime.

The notion of Latin squares can be easily applied toGalois fields by setting R = C = S = GF(ps) and mappingfunction Lβ(i, j) = i + β j for β ∈ GF(ps) \ {0}.

E. Mo and MarcA. Armand 3

Example 1. Let R = C = S = GF(22) = {0, 1,α,α2}.Mapping functions L1(i, j) = i + j, Lα(i, j) = i + αj andLα2 (i, j) = i + α2 j yield a complete family of three MOLS

M1 =

⎡⎢⎢⎢⎣

0 1 α α2

1 0 α2 αα α2 0 1α2 α 1 0

⎤⎥⎥⎥⎦ , Mα =

⎡⎢⎢⎢⎣

0 α α2 11 α2 α 0α 0 1 α2

α2 1 0 α

⎤⎥⎥⎥⎦

Mα2 =

⎡⎢⎢⎢⎣

0 α2 1 α1 α 0 α2

α 1 α2 0α2 0 α 1

⎤⎥⎥⎥⎦ ,

(7)

respectively.

3.2. Extension to multiplicative groups of a Galois ring

Extending the notion of Latin squares over integer residuerings is not trivial. Setting R = C = S = Z2s and mappingfunctions Lβ(i, j) = i + β j for β ∈ Z2s \ {0} do not yield acomplete family of 2s − 1 MOLS.

Example 2. Let R = C = S = Z22 = {0, 1, 2, 3} and letmapping functions be L1(i, j) = i + j, L2(i, j) = i + 2 j andL3(i, j) = i + 3 j,

M1 =

⎡⎢⎢⎢⎣

0 1 2 31 2 3 02 3 0 13 0 1 2

⎤⎥⎥⎥⎦ , M2 =

⎡⎢⎢⎢⎣

0 2 0 21 3 1 32 0 2 03 1 3 1

⎤⎥⎥⎥⎦

M3 =

⎡⎢⎢⎢⎣

0 3 2 11 0 3 22 1 0 33 2 1 0

⎤⎥⎥⎥⎦ ,

(8)

are obtained, respectively. Since the elements in each row ofM2 is not unique, M2 is not a Latin square. Therefore, we donot have a complete family of three MOLS.

Hence, we propose an alternative way of constructingLatin squares over integer residue rings. Let extension ringR = GR(2a, s) = Z2a[y]/〈φ(y)〉, where φ(y) is a degree sbasic irreducible polynomial over Z2a . Embedded in R is amultiplicative group G2s−1 of units of order 2s − 1. Further,we let a′ < a and define z = zmod 2a

′, where z ∈ R, and

extend this notation to n-tuples and matrices over R.

Example 3. Let R = GR(22, 2) = Z4[y]/〈y2 + y + 3〉.Embedded in R is G3 = {1,α,α2} = {1, y + 2, 3y + 1},generated by α = y + 2. Let R = C = G3 ∪ {0}. Mapping

functions L1(i, j) = i+ j, Lα(i, j) = i+αj and Lα2 (i, j) = i+α2 jyield matrices

M1 =

⎡⎢⎢⎢⎣

0 1 y + 2 3y + 11 2 y + 3 3y + 2

y + 2 y + 3 2y 33y + 1 3y + 2 3 2y + 2

⎤⎥⎥⎥⎦

Mα =

⎡⎢⎢⎢⎣

0 y + 2 3y + 1 11 y + 3 3y + 2 2

y + 2 2y 3 y + 33y + 1 3 2y + 2 3y + 2

⎤⎥⎥⎥⎦

Mα2 =

⎡⎢⎢⎢⎣

0 3y + 1 1 y + 21 3y + 2 2 y + 3

y + 2 3 y + 3 2y3y + 1 2y + 2 3y + 2 3

⎤⎥⎥⎥⎦ ,

(9)

respectively. Since G3 ∪ {0} is not closed under R-addition,S ⊂ R so that |S| /=|R| = |C| = 2s. Thus, all three matricesare not Latin squares.

To overcome this problem, the mapping functions haveto be altered slightly such that they map i ∈ R and j ∈ Cuniquely to Lβ(i, j) ∈ S and |R| = |C| = |S|.

Definition 2. L(a)β (i, j) = ((i)1/2a−1

+(β j)1/2a−1

)2a−1

, where i, j ∈G2s−1 ∪ {0} and β ∈ G2s−1.

Theorem 1. L(a)β (i, j) ∈ G2s−1 ∪ {0}.

Proof. It is apparent that (i)1/2a−1

, (β j)1/2a−1 ∈ G2s−1 ∪ {0}.Since G2s−1 ∪ {0} is not closed under R-addition, (i)1/2a−1

+

(β j)1/2a−1 = u + 2v, where u ∈ G2s−1 ∪ {0} and v ∈R. Usingbinomial expansion, the mapping function can be expressedas

L(a)β (i, j) = (u + 2v)2a−1 =

2a−1∑

x=0

(2a−1

x

)u2a−1−x(2v)x. (10)

Observe that(

2a−1

x

)u2a−1−x(2v)x = 0 mod 2a for x = 1, 2, . . . ,

2a−1. Thus, L(a)β (i, j) = u2a−1 ∈ G2s−1 ∪ {0}.

Theorem 2. Consider two multiplicative groups G2s−1 ⊂GR(2a, s) = Z2a[y]/〈φ(y)〉 and G′2s−1 ⊂ GR(2a

′, s) =

Z2a′ [y]/〈φ(y)〉, where φ(y) is a degree-s basic irreduciblepolynomial over Z2a . Let i, j ∈ G2s−1∪{0} and β ∈ G2s−1, then

i, j ∈ G′2s−1 ∪ {0} and β ∈ G′2s−1. Then, L(a′)β

(i, j) = L(a)β (i, j).

Proof. Using binomial expansion,

L(a)β (i, j) =

2a−1∑

x=0

(2a−1

x

)((i)1/2a−1)2a−1−x(

(β j)1/2a−1)xmod 2a

′.

(11)


Now, observe that(

2a−1

x

)mod 2a

′

=

⎧⎪⎪⎪⎨⎪⎪⎪⎩

(2a

′−1

y

), x = y·2a−a′ , where y is an integer,

0, otherwise.(12)

Thus,

L(a)β (i, j)

=2a′−1∑

y=0

(2a

′−1

y

)((i)1/2a−1)2a−1−y·2a−a′(

(β j)1/2a−1)y·2a−a′mod 2a

′

=2a′−1∑

y=0

(2a

′−1

y

)((i)1/2a

′−1)2a′−1−y(

(β j)1/2a′−1)y = L(a′)

β(i, j).

(13)

Remark 2. When a′ = 1, the mapping function L(1)β

(i, j) =i + β j coincides with the mapping function applied to the

Galois fields. Since L(1)β

(i, j) = L(a)β (i, j) (from Theorem 2),

L(a)β (i, j) is unique for a given pair (i, j). It follows that two

Latin squares constructed by L(a)β0

(i, j) and L(a)β1

(i, j), whereβ0,β1 ∈ G2s−1 and β0 /= β1, are orthogonal.

Let R = C = S = G2s−1 ∪ {0}. A complete family

{(R,C, S;L(a)β ) : β ∈ G2s−1} of MOLS is obtained by defining

L(a)β (i, j) = ((i)1/(2a−1) + (β j)1/(2a−1))2a−1

.

Example 4. Let R = C = S = G3 ∪ {0} ⊂ GR(22, 2) and

mapping functions L(2)1 (i, j) = ((i)1/2 + j1/2)2, L(2)

α (i, j) =((i)1/2 + (αj)1/2)2 and L(2)

α2 (i, j) = ((i)1/2 + (α2 j)1/2)2. Theresultant MOLS are

M1 =

⎡⎢⎢⎢⎣

0 1 α α2

1 0 α2 αα α2 0 1α2 α 1 0

⎤⎥⎥⎥⎦ , Mα =

⎡⎢⎢⎢⎣

0 α α2 11 α2 α 0α 0 1 α2

α2 1 0 α

⎤⎥⎥⎥⎦ ,

Mα2 =

⎡⎢⎢⎢⎣

0 α2 1 α1 α 0 α2

α 1 α2 0α2 0 α 1

⎤⎥⎥⎥⎦ ,

(14)

respectively. A complete family of three MOLS is obtained.In addition, the mapping function L0(i, j) = i yields a matrix

M0 =

⎡⎢⎢⎢⎣

0 0 0 01 1 1 1α α α αα2 α2 α2 α2

⎤⎥⎥⎥⎦ (15)

Step1

Step 2

Step 3 Step 40

1 2s 22s

22s

2s + 1

Figure 1: Portion of parity check matrix constructed in each step.

which is orthogonal to each Latin square in the completefamily of MOLS.

4. STRUCTURED LDPC CODES OVER Z2a

4.1. Construction of graphs using latin squares

The construction method proposed in [14, Section IV-A] canbe generalized to construct graphs for different values of aand s by altering the mapping functions according to thevalue of a. The procedure is stated here for easy reference byTheorem 3 that follows. The graph is a tree that has threelayers that enumerate from its root; the root is a variablenode, the first layer has 2s + 1 check nodes, the second layerhas 2s(2s + 1) variable nodes and the third layer has 22s checknodes. Thus there are 22s +2s +1 variable nodes and the samenumber of check nodes. The connectivity of the nodes areexecuted in the following steps.

(1) The variable root node is connected to each of thecheck nodes in the first layer.

(2) Each check node in the first layer is connected to 2s

consecutive variable nodes in the second layer.

(3) Each of the first 2s variable nodes in the second layeris connected to 2s consecutive check nodes in thethird layer.

(4) For i, j, k,β ∈ G2s−1 ∪ {0}, label the remainingvariable nodes in the second layer (β, i) and all checknodes in the third layer ( j, k). If β = 0, variablenode (0, i) is connected to check node ( j, i). If β ∈G2s−1, variable node (β, i) is connected to check node

( j,L(a)β (i, j)). The tree is completed once all possible

combinations of (i, j, k,β) are exhausted.


Let T (a, s) denote the resultant tree constructed using thecomplete family of MOLS derived from G2s−1 ∪ {0} ⊆ R.T (a, s) is a degree-2s + 1 regular tree. By reading the variable(check) nodes as columns (rows) of a matrix H(a, s) ∈Z(22s+2s+1)×(22s+2s+1)

2a in top-bottom, left-right manner whilesetting the edge weights to be randomly chosen units fromZ2a , the portion of H(a, s) constructed at each step isillustrated in Figure 1. The null space of H(a, s) yields anLDPC code C(a, s) over Z2a .

Example 5. Let a = 2 and s = 2. The Latin squares are shownin Example 4. Steps (1)–(3) are illustrated in Figure 2(a). Asobserved, this can be perceived as the nonrandom portion ofthe parity-check matrix. Step 4, on the other hand, executesthe pseudorandom portion of the parity-check matrix thatis commonly seen in most LDPC parity-check matrices. Theresultant tree is shown in Figure 2(b).

4.2. Properties of C(a, s)

C(a, s) is a length n(s) = 22s + 2s + 1 regular LDPC coderepresented by H(a, s) (or T (a, s)). We denote the minimumdistance of C(a, s) as dmin(a, s). Following the definitiongiven in [14], wmin(a, s) denotes the minimum pseudocode-word weight that arises from the Tanner graph of C(a, s) forthe 2a-ary symmetric channel.

Theorem 3. Let T (a, s) denote the graph resulting from reduc-ing mod 2a

′, all edge weights of T (a, s). T (a′, s) = T (a, s),

that is, H(a′, s) = H(a, s).

Proof. First, the connection procedure is regardless of ain steps (1)–(3), and similarly for β = 0 in step (4).

Since L(a′)β

(i, j) = L(a)β (i, j) (from Theorem 2), the edge

((β, i), ( j,L(a)β (i, j))) in T (a, s) is equivalent to the edge

((β, i), ( j,L(a′)β

(i, j))) in T (a′, s).

Remark 3. The graphs constructed by setting a = 1 yieldbinary codes that are the same as those in [14, Section IV-A]. Further, it has also been shown that these codes are thebinary projective geometry (PG) LDPC codes introduced in[5]. Thus, it is known that dmin(1, s) = 2s + 2.

Before deriving dmin(a, s), we state two relationshipsbetween the codewords in C(a, s) and C(a′, s).

Corollary 1. (i) If c ∈ C(a, s), then c ∈ C(a′, s).

(ii) If c ∈ C(a, s) can be expressed as c= 2a−a′c′, where

C ∈ Zn2a′ , then c′ ∈ C(a′, s) and is unique.

Proof. Corollary 1(i) is a simple consequence of Theorem 3;while for Corollary 1(ii),

2a−a′c′HT(a, s) = 0 mod 2a

=⇒ c′HT(a, s) = 0 mod 2a′

=⇒ c′HT(a′, s) = 0 mod 2a′(from Theorem 3).

(16)

The uniqueness of c′ follows from the natural groupembedding, GR(2a

′, s)→R : r → 2a−a

′r.

Theorem 4. dmin(a, s) = dmin(1, s).

Proof. Let dc be the Hamming weight of c ∈ C(a, s) \ {0}.

Case 1. c contains at least one unit. From Corollary 1(i),when a′ = 1, c ∈ C(1, s). Further, dc ≥ dc. If dc = dmin(1, s),dc ≥ dmin(1, s).

Case 2.1. c can be expressed as c= 2a−a′c′, where c′ contains

at least one unit of Z2a′ . From Corollary 1(ii), c′ ∈ C(a′, s).Further, dc = dc′ and from Case 1, dc′ ≥ dc′ . When a′ = 1,c= 2a−1c′, and c′ ∈ C(1, s). If dc′ = dmin(1, s), dc = dmin(1, s).

Case 2.2. c can be expressed as c= 2a−a′c′, where c′ does not

contain any unit of Z2a′ . Similarly, from Corollary 1(ii), c′ ∈C(a′, s). Therefore, dc = dc′ and the bounds on dc′ followCase 2.1.

Thus, dmin(a, s) = dmin(1, s).

It has already been shown in [14, Section IV-A] thatwmin(1, s) = dmin(1, s). The following theorem states therelationship between wmin(a, s) and dmin(a, s).

Theorem 5. wmin(a, s) = dmin(a, s).

Proof. Since T (1, s) = T (a, s) when a′ = 1 (fromTheorem 3) and all edge weights in T (a, s) are units of Z2a ,wmin(a, s) and wmin(1, s) share the same tree bound [14],that is, wmin(a, s) ≥ 2s + 2, for all a. Further, dmin(a, s) =dmin(1, s) = 2s + 2 (from Theorem 4). Thus,

2s + 2 ≤ wmin(a, s) ≤ dmin(a, s) = 2s + 2

=⇒ wmin(a, s) = dmin(a, s) = 2s + 2.(17)

The code rate r(a, s) has to be computed first by reducingH(a, s) to the form as discussed in Section 2.1. r(a, s) isbounded by

22s + 2s − 3s

a(2

2s+ 2s + 1

) ≤ r(a, s) ≤ 22s + 2s − 3s

22s + 2s + 1, (18)

where the upper bound corresponds to the code ratesof the binary PG-LDPC codes [5]. We observe that bysetting the edge weights of T (a, s) as randomly chosenunits from Z2a , r(a, s) tends to the lower bound whichresults in codes suitable for low-rate applications. On theother hand, by setting all edge weights to be unity, r(a, s)increases significantly. The corresponding codes can thus bedeployed in moderate-rate applications. Table 1 compiles theproperties of C(a, s) for various values of a and s.

5. SIMULATION RESULTS

Figures 3 and 4 show the bit error rate (BER) and symbolerror rate (SER) performance of our structured codes over


(a)

(0, 0)(0, 1) (0,α) (0,α2) (1, 0) (1, 1)(1,α)(1,α2) (α, 0) (α, 1)(α,α)(α,α2) (α2, 0)(α2, 1)(α2,α) (α2,α2)

(0, 0) (0, 1) (0,α) (0,α2) (1, 0) (1, 1) (1,α) (1,α2) (α, 0) (α, 1) (α,α) (α,α2) (α2, 0) (α2, 1) (α2,α) (α2,α2)

Variable node

Check node

(b)

Figure 2: Tree constructed for a = 2, s = 2 after (a) steps (1)–(3), and (b) step (4) (the final structure).

the AWGN channel. In Figure 3(a), the corresponding edgeweights of the codes simulated are randomly chosen unitsof Z4, while those in Figures 3(b) and 4 are set to unity.The codewords are transmitted using the matched signalsdiscussed in Section 2.2. The received signals are decodedusing the sum-product algorithm. The performance ofrandom, near-regular LDPC codes with constant variablenode degree of 3, is also shown. These codes have similarcodelengths and rates to that of the structured codes. Foreach data point, 104 error bits are obtained for a maximumof 100 iterations allowed for decoding each received signalvector.

Figure 3(a) shows our structured Z4 code outperformingthe random code when the codelength is small, that is,42 bits. On the other hand, Figure 3(b) shows our structuredcode performing worse than its random counterpart whenthe codelength is much larger, specifically, 2114 bits. At a

glance, it therefore appears that our structured codes are onlybetter than random codes for short codelengths. To get aclearer picture as to how our codes fair in comparison to theirrandom counterparts, we turn to Figures 4(a) and 4(b) whichsummarize the BER performance of random and structuredcodes over Z4, respectively, Z8, for increasing codelengthsof 21, 146, and 546 bits, respectively, 63, 219, and 819 bits.From these empirical results, we conclude that our codessignificantly outperform their random counterparts over awide BER range for very small codelengths, that is, less than100 bits. On the other hand, for larger codelengths, randomcodes perform better in the higher BER region while ourstructured codes are superior at lower BERs, specifically,10−4 and below for codelengths close to 1000 bits and 10−6

and below for larger codelengths, exceeding 2000 bits. Thisphenomenon may be attributed to the fact that the minimumdistance of our codes grow linearly with the square root of


Table 1: Properties of C(a, s).

a s n(s)Degree of dmin(a, s) r(a, s) r(a, s)

T (a, s) = wmin(a, s) (Lower bound) (Unity edge weights)

1 0.5238 0.5238

2 2 21 5 6 0.2619 0.4762

3 0.1746 0.3175

4 0.1309 0.2381

1 0.6164 0.6164

2 3 73 9 10 0.3082 0.5548

3 0.2055 0.4932

4 0.1541 0.3699

1 0.6996 0.6996

2 4 273 17 18 0.3498 0.6337

3 0.2332 0.5653

4 0.1749 0.4982

1 0.7692 0.7692

2 5 1057 33 34 0.3846 0.7053

3 0.2564 0.6367

4 0.1923 0.5669

10−6

10−5

10−4

10−3

10−2

10−1

100

Err

orra

te

1 2 3 4 5 6 7 8 9

Eb/N0 (dB)

Structured BERStructured SER

Random BERRandom SER

(a) a = 2, s = 2, random edge weights

10−4

10−3

10−2

10−1

100

Err

orra

te

1 1.5 2 2.5 3 3.5 4 4.5

Eb/N0 (dB)

Structured BERStructured SER

Random BERRandom SER

(b) a = 2, s = 5, unity edge weights

Figure 3: Performance of structured and random LDPC codes over Z4 with QPSK signaling over the AWGN channel.

their codelength. On the other hand, from [23, Theorem 26],we have that the minimum distance of a random, regularLDPC code with constant variable node degree of 3 growslinearly with its codelength with high probability. As therandom codes considered here are near regular, we believethat they have superior minimum distances compared to ourstructured codes.

6. CONCLUSION

To summarize, we have extended the notion of Latinsquares to multiplicative groups of a Galois ring. Using thegeneralized mapping function, we have constructed Tannergraphs representing a family of structured LDPC codes overZ2a spanning a wide range of code rates. In addition, we


10−6

10−5

10−4

10−3

10−2

10−1

100

BE

R

1 2 3 4 5 6 7 8

Eb/N0 (dB)

s = 4

s = 3

s = 2

StructuredRandom

(a) a = 2, unity edge weights, transmitted using QPSK signaling

10−6

10−5

10−4

10−3

10−2

10−1

BE

R

7 8 9 10 11 12 13

Eb/N0 (dB)

s = 4

s = 3

s = 2

StructuredRandom

(b) a = 3, unity edge weights, transmitted using 8-PSK signaling

Figure 4: Performance of structured and random LDPC codes transmitted using matched signals over the AWGN channel.

have shown that the minimum pseudocodeword weightof these codes are equal to their minimum Hammingdistance—a desirable attribute under iterative decoding.Finally, our simulation results show that these codes, whentransmitted by matched signal sets over the AWGN channel,can significantly outperform their random counterparts ofsimilar length and rate, at BERs of practical interest.

ACKNOWLEDGMENTS

The authors would like to thank the anonymous reviewersfor their helpful comments which led to significant improve-ments in Sections 1 and 5 of this paper. The authors alsogratefully acknowledge financial support from the Ministryof Education ACRF Tier 1 Research Grant no. R-263-000-361-112.

REFERENCES

[1] M. C. Davey and D. J. C Mackay, “Low density parity checkcodes over GF(q),” IEEE Communications Letters, vol. 2, no. 5,pp. 159–166, 1998.

[2] D. Slepian, “Group codes for the Gaussian channel,” BellSystem Technical Journal, vol. 47, pp. 575–602, 1968.

[3] G. D. Forney Jr., “Geometrically uniform codes,” IEEE Trans-actions on Information Theory, vol. 37, no. 5, pp. 1241–1260,1991.

[4] D. Sridhara and T. E. Fuja, “LDPC codes over rings for PSKmodulation,” IEEE Transactions on Information Theory, vol.51, no. 9, pp. 3209–3220, 2005.

[5] Y. Kou, S. Lin, and M. P. C. Fossorier, “Low-density parity-check codes based on finite geometries: a rediscovery and newresults,” IEEE Transactions on Information Theory, vol. 47, no.7, pp. 2711–2736, 2001.

[6] B. Vasic and O. Milenkovic, “Combinatorial constructions oflow-density parity-check codes for iterative decoding,” IEEETransactions on Information Theory, vol. 50, no. 6, pp. 1156–1176, 2004.

[7] I. B. Djordjevic and B. Vasic, “Nonbinary LDPC codes foroptical communication systems,” IEEE Photonics TechnologyLetters, vol. 17, no. 10, pp. 2224–2226, 2005.

[8] A. Bennatan and D. Burshtein, “Design and analysis ofnonbinary LDPC codes for arbitrary discrete-memorylesschannels,” IEEE Transactions on Information Theory, vol. 52,no. 2, pp. 549–583, 2006.

[9] X.-Y. Hu and E. Eleftheriou, “Binary representation of cycleTanner-graph GF(2b) codes,” in Proceedings of IEEE Interna-tional Conference on Communications (ICC ’04), vol. 1, pp.528–532, Paris, France, June 2004.

[10] C. Poulliat, M. Fossorier, and D. Declercq, “Using binaryimage of nonbinary LDPC codes to improve overall perfor-mance,” in Proceedings of IEEE International Symposium onTurbo Codes, Munich, Germany, April 2006.

[11] C. A. Kelley, D. Sridhara, and J. Rosenthal, “Pseudocodewordweights for non-binary LDPC codes,” in Proceedings of IEEEInternational Symposium on Information Theory (ISIT ’06), pp.1379–1383, Seattle, Wash, USA, July 2006.

[12] R. Koetter and P. O. Vontobel, “Graph-covers and iterativedecoding of finite length codes,” in Proceedings of the 3rd IEEEInternational Symposium on Turbo Codes and Applications, pp.75–82, Brest, France, September 2003.

[13] N. Wiberg, Codes and decoding on general graphs, Ph.D. thesis,Linkoping University, Linkoping, Sweden, 1996.

[14] C. A. Kelley, D. Sridhara, and J. Rosenthal, “Tree-basedconstruction of LDPC codes having good pseudocodewordweights,” IEEE Transactions on Information Theory, vol. 53, no.4, pp. 1460–1478, 2007.

[15] I. B. Djordjevic and B. Vasic, “MacNeish-Mann theorembased iteratively decodable codes for optical communicationsystems,” IEEE Communications Letters, vol. 8, no. 8, pp. 538–540, 2004.


[16] O. Milenkovic and S. Laendner, “Analysis of the cycle-structure of LDPC codes based on Latin squares,” in Pro-ceedings of IEEE International Conference on Communications(ICC ’04), vol. 2, pp. 777–781, Paris, France, June 2004.

[17] B. Vasic, I. B. Djordjevic, and R. K. Kostuk, “Low-densityparity check codes and iterative decoding for long-haul opticalcommunication systems,” Journal of Lightwave Technology, vol.21, no. 2, pp. 438–446, 2003.

[18] I. B. Djordjevic and B. Vasic, “Iteratively decodable codes fromorthogonal arrays for optical communication systems,” IEEECommunications Letters, vol. 9, no. 10, pp. 924–926, 2005.

[19] O. Milenkovic, N. Kashyap, and D. Leyba, “Shortened arraycodes of large girth,” IEEE Transactions on Information Theory,vol. 52, no. 8, pp. 3707–3722, 2006.

[20] G. Caire and E. M. Biglieri, “Linear block codes over cyclicgroups,” IEEE Transactions on Information Theory, vol. 41, no.5, pp. 1246–1256, 1995.

[21] H. A. Loeliger, “Signal sets matched to groups,” IEEE Trans-actions on Information Theory, vol. 37, no. 6, pp. 1675–1682,1991.

[22] J. H. van Lint and R. M. Wilson, A Course in Combinatorics,Cambridge University Press, Cambridge, UK, 2nd edition,2001.

[23] G. Como an F. Fagnani, “Average spectra and mini-mum distances of low density parity check codes overcyclic groups,” http://calvino.polito.it/∼fagnani/groupcodes/ldpcgroupcodes.pdf.


Research ArticleDifferentially Encoded LDPC Codes—Part I:Special Case of Product Accumulate Codes

Jing Li (Tiffany)

Electrical and Computer Engineering, Lehigh University, Bethlehem, PA 18015, USA

Correspondence should be addressed to Jing Li (Tiffany), [email protected]

Received 19 November 2007; Accepted 6 March 2008

Recommended by Yonghui Li

Part I of a two-part series investigates product accumulate codes, a special class of differentially-encoded low density parity check(DE-LDPC) codes with high performance and low complexity, on flat Rayleigh fading channels. In the coherent detection case,Divsalar’s simple bounds and iterative thresholds using density evolution are computed to quantify the code performance at finiteand infinite lengths, respectively. In the noncoherent detection case, a simple iterative differential detection and decoding (IDDD)receiver is proposed and shown to be robust for different Doppler shifts. Extrinsic information transfer (EXIT) charts reveal that,with pilot symbol assisted differential detection, the widespread practice of inserting pilot symbols to terminate the trellis actuallyincurs a loss in capacity, and a more efficient way is to separate pilots from the trellis. Through analysis and simulations, it is shownthat PA codes perform very well with both coherent and noncoherent detections. The more general case of DE-LDPC codes, wherethe LDPC part may take arbitrary degree profiles, is studied in Part II Li 2008.

Copyright © 2008 Jing Li (Tiffany). This is an open access article distributed under the Creative Commons Attribution License,which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

1. INTRODUCTION

The discovery of turbo codes and the rediscovery of low-density parity-check (LDPC) codes have renewed the rese-arch frontier of capacity-achieving codes [1, 2]. They alsorevolutionized the coding theory by establishing a new soft-iterative paradigm, where long powerful codes are con-structed from short simple codes and decoded through iter-ative message exchange and successive refinement betweencomponent decoders. Compared to turbo codes, LDPCcodes boast a lower complexity in decoding, a richer varietyin code construction, and not being patented.

One important application of LDPC codes is wirelesscommunications, where sender and receiver communicatethrough, for example, a no-line-of-sight land-mobile chan-nel that is characterized by the Rayleigh fading model.It iswell-recognized that LDPC codes perform remarkably wellon Rayleigh fading channels, that is, assuming the carrierphase is perfectly synchronized and coherent detection isperformed; but what if otherwise?

It should be noted that, due to practical issues likecomplexity, acquisition time, sensitivity to tracking errors,

and phase ambiguity, coherent detection may become expen-sive or infeasible in some cases. In the context of nonco-herent detection, the technique of differential encodingbecomes immediately relevant. Differential encoding admitssimple noncoherent differential detection which solves phaseambiguity and requires only frequency synchronization(often more readily available than phase synchronization).Viewed from the coding perspective, performing differentialencoding is essentially concatenating the original code withan accumulator, or, a recursive convolutional code in theform of 1/(1 +D).

In this series of two-part papers, we investigate the theoryand practice of LDPC codes with differential encoding. Westart with a special class of differentially encoded LDPC (DE-LDPC) codes, namely, product accumulate (PA) codes (PartI), and then we move to the general case where an arbitrary(random) LDPC code is concatenated with an accumulator(Part II) [3].

Product accumulate codes, proposed in [4] and depictedin Figure 1, are a class of serially concatenated codes, wherethe inner code is a differential encoder, and the outer codeis a parallel concatenation of two branches of single-parity


Outer code Inner code

SPC1p1

p2

SPC2

D

(a)

Outer code Inner code

1/(1 +D)TPC/SPC π

C1

C2

x1

x2

xn−1

xn

0

y1

y2

yn−1

yn

Check

Bit

y: observation from the channel

x: input bit to 1/(1 +D),output bit from TPC/SPC

Ran

dom

inte

rlea

ver

...

...

... ...

(b)

Figure 1: PA codes (a), code structure (b). Graph representation.

check (SPC) codes or a structured LDPC code comprisingdegree-1 and degree-2 variable nodes. Since the accumulatorcan also be described using a sparse bipartite graph, aPA code is, overall, an LDPC code. Alternatively, it mayalso be regarded as a differentially-encoded LDPC code, toemphasize the impact of the inner differential encoder.Thereasons to study PA codes are multifold. First, PA codesexhibit an interesting threshold property and remarkableperformance,and are well established as a class of “good”codes with rates ≥ 1/2 and performance within a few tenthsof a dB from the Shannon limit [4]. Here, “good” is in thesense defined by MacKay [2]. Second, PA codes are desirablefor their simplicity. They are simple to describe, simple toencode and decode, and simple enough to allow rigoroustheoretical analysis [4]. Comparatively, a random LDPC codecan be expensive to describe and expensive to implement inVLSI (due to the difficulty of routing and wiring). Finally, PAcodes are intrinsically differentially encoded, which naturallypermits noncoherent differential detection without needingadditional components.

The primary interest is the noncoherent detection case,but for completeness of investigation and for comparison,we also include the case of coherent detection. Under theassumption that phase information is known, we computeDivsalar’s simple bounds to benchmark the performance ofPA codes at finite code lengths [5], and we evaluate iterativethresholds using density evolution (DE) to benchmark theperformance of PA codes at infinite code lengths. Theasymptotic thresholds reveal that PA codes are about from

0.6 to 0.7 dB better than regular LDPC codes, but 0.5 dBworse than optimal irregular LDPC codes (whose maximalleft degree is 50) on Rayleigh fading channels with coherentdetection. Simulations of fairly long block lengths show agood agreement with the analytical results.

When phase information is unavailable, the decoder/detector will either proceed without phase information(completely blind), or entails some (coarse) estimation andcompensation in the decoding process. We regard eithercase as noncoherent detection. The presence of a differentialencoder in the code structure readily lands PA codes tononcoherent differential detection. Conventional differentialdetection (CDD) operates on two symbol intervals andrecovers the information by subtracting the phase of theprevious signal sample from the current signal sample. It ischeap to implement, but suffers as much as from 4 to 5 dBin bit error rate (BER) performance [6]. Closing the gapbetween CDD and differentially encoded coherent detectiongenerally requires the extension of the observation windowbeyond two symbol intervals.The result is multisymboldifferential detection (MSDD), exemplified by maximum-likelihood (ML) multisymbol detection, trellis-based mul-tisymbol detection with per-survivor processing, and theirvariations [7, 8]. MSDD performs significantly better thanCDD, at the cost of a considerably higher complexity whichincreases exponentially with the window size. To preserve thesimplicity of PA codes, here we propose an efficient iterativedifferential detection and decoding (IDDD) receiver which isrobust against various Doppler spreads and can perform, forexample, within 1 dB from coherent detection on fast fadingchannels.

We investigate the impact of pilot spacing and filterlengths, and we show that the proposed PA IDDD receiverrequires very moderate number of pilot symbols, comparedto, for example, turbo codes [6]. It is quite expected thatthe percentage of pilots directly affects the performanceespecially on very fast fading channels, but much lessexpected is that how these pilot symbols are inserted alsomakes a huge difference. Through extrinsic informationtransfer (EXIT) analysis [9], we show that the widespreadpractice of inserting pilot symbols to periodically terminatethe trellis of the differential encoder inevitably [6, 7] incursa loss in code capacity. We attribute this to what we callthe “trellis segmentation” effect, namely, error events aremade much shorter in the periodically terminated trellis thanotherwise. We propose that pilot symbols be separated fromthe trellis structure, and simulation confirms the efficiency ofthe new method.

From analysis and simulation, it is fair to say that PAcodes perform well both with coherent and noncoherentdetection. In Part II of this series of papers, we will showthat conventional LDPC codes, such as regular LDPC codeswith uniform column weight of 3 and optimized irregularones reported in literature, actually perform poorly withnoncoherent differential detection. We will discuss why, how,and how much we can change the situation.

The rest of the paper is organized as follows. Section 2introduces PA codes and the channel model. Section 3analyzes the coherently detected PA codes on fading channels

Jing Li (Tiffany) 3

using Divsalar’s simple bounds and iterative thresholds.Section 4 discusses noncoherent detection and decoding ofPA codes and performs EXIT analysis. Finally, Section 5summarizes the paper.

2. PA CODES AND CHANNEL MODEL

2.1. Channel model

We consider binary phase shift-keying (BPSK) signaling (0→+1, 1 → −1) over flat Rayleigh fading channels. Assumingproper sampling of the outputs from the matched filter, thereceived discrete-time baseband signal can be modeled asrk = αke jθk sk + nk, where sk is the BPSK-modulated signal,nk is the i.i.d. complex AWGN with zero mean and varianceσ2 = N0/2 in each dimension. The fading amplitude αkis modeled as a normalized Rayleigh random variable withE[α2

k] = 1 and pdf pA(αk) = 2αk exp(−α2k) for αk > 0, and

the fading phase θk is uniformly distributed over [0, 2π].For fully interleaved channels, αk’s and θk’s are inde-

pendent for different time indexes k. For insufficientlyinterleaved channels, they are correlated. We use the Jakes’isotropic scattering land mobile Rayleigh channel modelto describe the correlated Rayleigh process which hasautocorrelation Rk = (1/2)J0(2kπ fdTs), where fdTs is thenormalized Doppler spread, and J0(·) is the 0th order Besselfunction of the first kind.

Throughout the paper, θk is assumed known perfectlyto the receiver/decoder in the coherent detection case, andunknown (and needs to be worked around) in the nonco-herent detection case. Further, the receiver is said to havechannel state information (CSI) if αk known (irrespective ofθk), and no CSI otherwise.

2.2. PA codes and decoding analysis

A product accumulate code, as illustrated in Figure 1(a),consists of an accumulator (or a differential encoder) as theinner code, and a parallel concatenation of 2 branches ofsingle-parity check codes as the outer code. PA codes aredecoded through a soft-iterative process where soft extrinsicinformation is exchanged between component decodersconforming to the turbo principle. The outer code, modeledas a structured LDPC code, is decoded using the message-passing algorithm. The inner code, taking the convolutionalform of 1/(1 + D), may be decoded either using the trellis-based BCJR algorithm, or a graph-based message-passingalgorithm. The latter, thanks to the cycle-free code graphof 1/(1 + D), performs as optimally as the BCJR algorithm,but consumes several times less of complexity [4, 10].Thus, the entire code can be efficiently decoded through aunified message-passing algorithm, driven by the initial log-likelihood ratio (LLR) values extracted from the channel [4].For Rayleigh fading channels with perfect CSI, that is, αk isknown∀k, the initial channel-LLRs are computed using

LCSIch

(sk) = 4αk

N0rk, (1)

and for Rayleigh fading channels without CSI,

LNCSIch (sk) = 4E[αk]

N0rk, (2)

where E[α] = √π/2 is the mean of α. Due to the space limit-ation, we omit the details of the overall message-passingalgorithm, but refer readers to [4].

3. COHERENT DETECTION

This section investigates the coherent detection case onRayleigh fading channels. We employ Divsalar’s simplebounds and the iterative threshold to analyze the ensembleaverage performance of PA codes, and simulate individual PAcodes at short and long lengths.

3.1. Simple bounds

Union bounds are simple to compute, but are rather looseat low SNRs. Divsalar’s simple bound is possibly one ofthe best closed-form bounds [5]. Like many other tightbounds, the simple bound is based on the second Gallager’sbounding techniques [1]. By using numerical integrationinstead of a Chernoff bound and by reducing the numberof codewords to be included in the bound, Divsalar was ableto tighten the bound to overcome the cutoff rate limitation.Since the simple bound requires the knowledge of thedistance spectrum, a hard-to-attain property especially forconcatenated codes, it has not seen wide application. Here,the simplicity of PA codes permits an accurate computationof the ensemble-average distance spectrum (whose detailscan be found in [4]), and thus enables the exploitation ofthe simple bound.

The technique of the simple bound allows for the com-putation of either a maximum likelihood (ML) threshold inthe asymptotic sense [4, 5], or a performance upper boundwith respect to a given finite length. Divsalar derived thegeneral form of the simple bound on independent Rayleighfading channels with perfect CSI. Following a similar line ofreasoning, below we extend it to the case of non-CSI.

3.1.1. Gallager’s second bounding technique

Gallager’s second bounding technique sets the base for manytight bounds including the simple bounds [1]. It states that

Pr (error) ≤ Pr (error, r ∈ R) + Pr (r /∈R), (3)

where r = γ α s + n is the received codeword (N-dimen-sional noise-corrupted vector), s is the transmitted code-word vector, n is the noise vector whose components arei.i.d. Gaussian random variables with zero mean and unitvariance, γ is the known constant (in modulation), α is theN ×N matrix containing fading coefficients (α is an identitymatrix for AWGN channels),and R denotes a region in theobserved space around the transmitted codeword. To get atight bound, optimization and integration are usually neededto determine a meaningful R.


3.1.2. Divsalar’s simple bound for independent rayleighfading channels with CSI

For Rayleigh fading channels, the decision metric is based onthe minimization of the norm ||r − γαs||, where s, r, andα are the transmitted signal, received signal, and the fadingamplitude in vector form, respectively, and γ is the amplitudeof the transmitted signal such that γ2/2 = Es/N0.

For a good approximation of the error using (3),and forcomputational simplicity, the decision region R was chosenas an N-dimensional hypersphere centered at ηγαs and withradius

√NR, where η and R are the parameters to be

optimized [5].When perfect CSI is available, the effect of fading can

be compensated through a linear transformation on γ α s. Inparticular, a rotation e jϕ and a rescaling ζ have shown to yielda good and analytically feasible solution [5]

R = {r | ∥∥r− ζe jϕγαs∥∥2 ≤ NR2}, (4)

which leads to the upper bound of the error probability of an(N ,K ,R) code [5]

P(e)

≤2√N−K+1∑

h=2

min

{

e−NE(c,δ,ρ,β,κ,φ), eNγN (δ) 1π

∫ π/2

0

[sin2θ

sin2θ+c

]h

dθ

}

,

(5)

where

E(c, δ, ρ,β, κ,φ)

= −ργN (δ) +ρ

2log

β

ρ+

1− ρ2

log1− β1− ρ

+ ρδ log (1 + c(1− 2κφ))

+ ρ(1− δ) log[

1 + c(

1− 2κφ − (1− κ2)ρβ

)]

+ (1− ρ) log[

1+c(

1−ρ(1−2κφ)1− ρ ,− (1− ρ(1− κ))2

(1−ρ)(1−β)

)],

(6)

c = γ2

2= EsN0

= REbN0

, (7)

δ = h

N, (8)

γ(δ) = γN

(h

N

)

=

⎧⎪⎪⎪⎪⎨

⎪⎪⎪⎪⎩

1N

log(Ah), for word error rate,

1N

log

(∑

w

w

KAw,h

)

, for bit error rate.

(9)

3.1.3. Extension of the simple bound to the case of No CSI

Another simple and reasonable choice of the decision regionis an ellipsoid centered at ηγs, which can be obtained by

rescaling each coordinate of r so as to compensate for theeffect of fading

R = {r | ∥∥α−1r− ηγs∥∥2 ≤ NR2}, (10)

where η and R are optimized. For independent Rayleighchannels without CSI, since accurate information on αis unavailable,we resort to the expectation of the fadingcoefficient α−1 ≈ E[α−1] = (1/0.8862)I in (10), where I is anidentity matrix. By replicating the computations describedin [5], we obtain the upper bound of the bit error rate forindependent Rayleigh channels without CSI:

P(e) ≤2√N−K+1∑

h=2

min

{

e−NE(c,δ,ρ), exp(hυNγN (δ)

c

)

×(

1− 2√1 + 2/υ + 1

)h}

,

(11)

where

E(c, δ, ρ) = −12

log(1− ρ + ρe2γN (δ))

+ c(

1 +1− δδ

(1 +

1− ρρ

e−2γN (δ)))−1

,

ρ =(

1 +1− ββ

e2γN (δ))−1

,

β ={

2c1− δδ

(1− e−2γN (δ))−1

+(

1− δδ

)2[(1 + c)2 − 1

]}1/2

− (1 + c)1− δδ

,

c = E2[α]γ2

2= 0.88622R

EbN0

,

υ =√

(γ2/2)2 − 1 =√

(REb/N0)2 − 1,

(12)

and δ and γN (δ) are the same as in (8) and (9).Please note that the aforediscussed extension to the

fading case with no CSI slightly loosens the simple bound,but it preserves the computational simplicity. It is possible fora more sophisticated transformation to yield tighter boundsbut not necessarily a feasible analytical expression.

Figure 2 plots the simulated BER performance and thesimple bound of a (1024,512) PA code on independentRayleigh fading channels with and without CSI. Since anoptimal ML decoder is assumed, and the ensemble averagedistance spectrum is used, in the computation, the simplebound represents the best ensemble average performance,and may not accurately reflect the individual PA code beingsimulated. Nevertheless, we see that the bound is fairly tight.It provides a useful indication of the code performance atSNRs below the cut-off rate, and, at high SNRs, it joins withthe union bound to predict the error floor.

3.2. Threshold computation via the iterative analysis

The ML performance bound evaluated in the previoussubsection factors in the finite length of a PA code ensemble,

Jing Li (Tiffany) 5

2 4 6 8 10 12 14 16

Eb/N0 (dB)

10−6

10−4

10−2

100

BE

R

Simu., CSISimu., no CSI

Divsalar bound, CSIDivsalar bound, no CSI

K = 512, R = 0.5 PA, independent fading

Figure 2: Divsalar simple bounds for R = 0.5 PA codes.

but the assumption of an ML decoder may be optimistic.Below we account for the iterative nature of the practicaldecoder and compute an asymptotic iterative threshold usingthe renowned method of density evolution [11].

A useful tool for analyzing the iterative decoding pro-cess of sparse-graph codes, density evolution examines theprobability density function (pdf) of exchanging messagesin each step and can,literally speaking, track the entiredecoding process. In general, we are more interested in theasymptotic SNR thresholds, η, which are defined as thecritical channel condition that isrequired for the decodingprocess to converge unanimously to the correct decision:

η(dB) = minSNR

{liml→∞

∫ 0

−∞y f (l)

Ly (ζ) dζ = 0}

, (13)

where y = ±1 is the BPSK modulated signals, and f (l)Ly

denotes the pdf of LLR information on y after the lthdecoding iteration.

Tracking the density of the messages requires thecomputation of the initial pdf of the LLR messages fromthe channel,and the transformation of the message pdf ’sin each step of the decoding process. Although Gaussianapproximation is reported to incur only very little inaccuracyon AWGN channels [12, 13], the deviation is larger on fadingchannels, since the pdf of the initial LLRs from a fadingchannel looks different from a Gaussian distribution. Hence,exact density evolution is used to preserve accuracy.

3.2.1. Initial LLR pdf from the channel

Hou et al. showed in [14] that the pdf of the LLRs fromindependent Rayleigh channelswith perfect CSI is given

by (assuming BPSK signaling and the all-zero sequence istransmitted)

f CSILch,y

(ζ) =∫∞

0N(

4α2

N0,

8α2

N0

)· p(α)dα,

=√N0

4πexp

(

− ζ(√N0 + 1− 1

)

2

)

×∫∞

0exp

(

−((ζN0/4α)−α√N0 +1

)2

N0

)

dα.

(14)

Using integrals from [15], we further simplify (14) to

f CSILch,y

(ζ) = N0

4√

1 +N0· exp

(ζ − |ζ|√1 +N0

2

)

. (15)

For the case when CSI is not available to the receiver,we assume that the Rayleigh-faded and AWGN-corruptedsignals follow a Gaussian distribution in the most probableregion. The pdf of the initial messages is then derived as

f NCSILch,y

(ζ) = Δ2√N0κN0

π

(

κ +√

2ΔζQ

(

− Δζ√π

))

, (16)

where Δ = √N0/2(N0 + 1), κ = exp(−Δ2ζ2/2π), and Q(x) =(1/√

2π)∫∞x e

−z2/2dz.

3.2.2. Evolution of LLR pdf in the decoder

To track the evolution of the pdf ’s along the iterativeprocess can either employ Monte Carlo simulation, or,more accurately and more efficiently, to proceed analyticallythrough discretized density evolution. The latter is possibledue to the simplicity in the code structure and in the decod-ing algorithm of PA codes. As a selfcontained discussion,we summarize the major steps of the discretized densityevolution of PA codes in the Appendix, but for details, pleaserefer to [4].

Using (15) for perfect CSI case or (16) for no CSIcase (i.e., substituting them in (A.4) and (A.5) in theAppendix), the thresholds of PA codes on Rayleigh channelscan be computed through (A.3) to (A.12) in the Appendix.The computed thresholds are a good indication of theperformance limit as the code length and the number ofiterations increase without bound.

Figure 3 plots the thresholds as well as the simulationresults of PA codes on independent Rayleigh channels withand without CSI. We see that the analytical results areconsistent with the simulation results for fairly large blocksizes. Here, simulations are evaluated after the 50th iteration.As the block size and the number of iterations continue toincrease, we expect the actual performance to converge to thethresholds.

Table 1 compares the thresholds of PA codes with thoseof LDPC codes for several code rates. The ergodic capacityof the independent Rayleigh fading channel is also listed asreference. We see that the thresholds of PA codes are about0.6 dB from the channel capacity, and simulations of fairly


Table 1: Thresholds (Eb/N0 in dB) of PA codes on Rayleigh channels ((3, ρ) LDPC data by courtesy of Hou et al. [14]).

Flat Rayleigh CSI Flat Rayleigh no CSI

Rate Capacity (dB) PA (dB) LDPC (dB) Capacity (dB) PA (dB) LDPC (dB)

0.5 1.8 2.42 3.06 2.6 3.33 4.06

0.6 3.0 3.56 — 3.8 4.48 —

2/3 3.7 4.34 4.72 4.4 5.15 5.74

1.5 2 2.5 3 3.5 4 4.5 5 5.5

Eb/N0 (dB)

10−4

10−3

10−2

10−1

BE

R

Simulations and thresholds of PA codes

R = 1/2, CSI

R = 1/2, no CSI

R = 2/3, CSI

Figure 3: Thresholds computed using density evolution andsimulations (data block size K = 64 K).

large block sizes are about 0.3-0.4 dB from the thresholds.Compared to the thresholds of LDPC codes reported in[14], rate 1/2 PA codes are from about 0.6-0.7 dB better(asymptotically) than (3, 6)-regular LDPC codes, but areabout 0.5 dB worse (asymptotically) than irregular LDPCcodes. It should be noted that these irregular LDPC codes arespecifically optimized for Rayleigh fading channels and havemaximum variable node degree of 50. It is fair to say thatPA codes perform on par with LDPC codes (using coherentdetection).

3.3. Simulation with coherent detection

To benchmark the performance of coherently detected PAcodes, several PA configurations are simulated on correlatedand independent Rayleigh fading channels. In each globaliteration (i.e., iteration between the inner decoder and theouter decoder), two local iterations of the outer decoding areperformed. This scheduling is found to strike the best trade-off between complexity and performance (with coherentdetection).

3.3.1. Coherent BPSK on independent rayleigh channels

Figure 4 shows the performances of rate 1/2 PA codes onindependent Rayleigh fading channels with and withoutchannel state information, respectively. Bit error rates after20, 30, and 50 (global) iterations are plotted, and data

block sizes from short to large (512, 1 K, 4 K, and 64 K)are evaluated to demonstrate the interleaving gain. Forcomparison purpose, the corresponding channel capacitiesare also shown. The simulated performance degradation dueto the lack of CSI is about 0.9 dB, which is consistent with thegap between the respective channel capacities.

Compared to the (3, 6)-regular LDPC codes reported in[14],the performance of this rate 1/2, codeword length N =128 × 1024 = 1.3 × 105 PA code is about 0.4 and 0.25 dBbetter than regular LDPC codes of lengthN = 105 and 106 onindependent Rayleigh channels. It is possible that optimizedirregular LDPC codes will outperform PA codes (as indicatedby their thresholds), but for regular codes, PA codes seem oneof th best.

3.3.2. Coherent BPSK on correlated rayleigh channels

Figure 5 shows the performance of PA codes on correlatedfading channels. Perfect CSI is assumed available to thereceiver, and an interleaver exists between the PA code andthe channel (to partially break up the correlation between theneighboring bits). Short PA codes with rate 1/2 and 3/4 aresimulated on two common fading scenarios with normalizedDoppler spreads fdTs = 0.01 and 0.001, respectively.As expected, the performance deteriorates rapidly as fdTsdecreases, since slower Doppler rate brings smaller diversityorder. Due to the interleaver between the PA code and thechannel, the impact of slow Doppler rate is less severe forlarger block sizes than for smaller ones. Whereas K = 1 K PAcode loses about 7 dB at BER = 10−4 as fdTs changes from0.01 to 0.001, the loss with K = 4 K PA code is less than 5 dB.

To illuminate how well short PA codes perform oncorrelated channels, we compare them with turbo codes(which are the best-known codes at short code lengths)in Figure 5. The comparing turbo code has 16-state com-ponent convolutional codes whose generator polynomial is(1, 35/23)oct and which are decoded using log-domain BCJRalgorithm. Code rate is 075, data block size is 4 K, and S-random interleavers are used in both codes to lower thepossible error floors. Curves plotted are for PA codes atthe 10th iteration and turbo codes at the 6th iteration. Weobserve that turbo codes perform about 0.6 and 0.7 dB betterthan PA codes for fdTs = 0.001 and 0.01, respectively.However, it should be noted that this performance gaincomes at a price of a considerably higher complexity. Whilethe message-passing decoding of a rate-0.75 PA code at the10th iteration requires about 267 operations per data bit [4],the log-domain BCJR decoding of a rate-0.75 turbo code atthe 6th iteration requires as many as 9720 operations per data

Jing Li (Tiffany) 7

1.5 2 2.5 3 3.5 4 4.5 5

Eb/N0 (dB)

10−6

10−5

10−4

10−3

10−2

10−1

100

BE

R

R = 0.5 PA, independent fading, CSI

Shannon limit

K = 512

K = 1 K

K = 4 KK = 64 K

20,30,50 iterations

(a)

2.5 3 3.5 4 4.5 5 5.5 6

Eb/N0 (dB)

10−6

10−5

10−4

10−3

10−2

10−1

100

BE

R

R = 0.5 PA, independent fading, no CSI

Shannon limit

K = 512

K = 1 K

K = 4 KK = 64 K

20,30,50 iterations

(b)

Figure 4: Performance of PA codes on independent Rayleigh fading channels. Code rate 0.5, data block size 512, 1 K, 4 K, 64 K. (a) WithCSI; (b) without CSI.

2 4 6 8 10 12 14

Eb/N0 (dB)

10−6

10−5

10−4

10−3

10−2

10−1

100

BE

R

R = 1/2, fdTs = 0.01R = 1/2, fdTs = 0.001R = 3/4, fdTs = 0.01

Turbo, R = 3/4, fdTs = 0.01R = 3/4, fdTs = 0.001Turbo, R = 3/4, fdTs = 0.001

Correlated fading, K = 4 K, fdTs = 0.01, 0.001

Figure 5: Performance of PA codes on correlated Rayleigh fadingchannels with CSI. Data block length 4 K, normalized Doppler ratefdTs = 0.01, 0.001, rate of PA codes 0.5 and 0.75, rate of turbo codes0.75, component codes of the turbo code (1, 35/23)oct, 10 iterationsfor PA codes, and 6 iterations for turbo codes.

bit, a complexity 35 times larger. Hence, PA codes are stillattractive for providing good performance at low lost.

4. NONCOHERENT DETECTION OF PA CODES

This section considers noncoherent detection. The channelmodel of interest is a Rayleigh fading channel with correlatedfading coefficients.

4.1. Iterative differential detection and decoding

PA codes are inherently differentially encoded which makes itconvenient for noncoherent differential detection. Althoughmultiple symbol differential detection is possible, for com-plexity concerns, we consider a simple iterative differentialdetection and decoding receiver, whose structure is shownin Figure 6.The IDDD receiver consists of a conventionaldifferential detector with 2-symbol observation window (thecurrent and the previous), a phase tracking filter and theoriginal PA decoder (that used in coherent detection [4]).Trellis structure is employed to assist the detection anddecoding of the inner differential code 1/(1 + D), butunlike the case of multiple symbol detection, the trellis isnot expanded and has 2 states only. Soft information ispassed back and forth among different parts of the receiverconforming to the turbo principle. Let x denote the inputto the inner differential encoder or the output from theouter code, and let y denote the output from the differentialencoder or the symbol to be put on the channel (seeFigure 6). The differential encoder implements yk = xk yk−1

for xk, yk ∈ {±1} (BPSK signal mapping 0 → +1, 1 →−1). The channel reception is given by rk = αke jθk yk + nk,where the channel amplitudes (αk’s) and phases (θk’s) arecorrelated, and the complex white Gaussian noise samples(nk’s) are independent.

In theory, differential decoding does not require pilotsymbols. In practice, however, pilot symbols are insertedperiodically even with multiple symbol detection, to avoidcatastrophic error propagation in differential decoding. Thisis particularly so for the fast fading case where phases (θk)are changing rapidly (will show later). Hence, some of therk’s (and yk’s) in the received sequence are pilot symbols.

We use L to denote the LLR information,superscript (q)to denote the qth (global) iteration, and subscript i, o, ch,and e to denote the quantities associated with the inner


Conv. diff. detector

1/(1 +D)inner decoder

(detector)

Channel estimator(filter)

Outer decoderπ−1

π

x

Channel

y r

nαe jθ

Iterative differential detector and decoder

Figure 6: Structure of iterative differential detection and decoding receiver.

code, the outer code, the fading channel, and “the extrinsic”,respectively.

4.1.1. IDDD receiver

Here is a sketch of how the proposed IDDD receiver operates.In the first iteration, the switch in Figure 6 is flipped up.The samples of the received symbols, rk, are fed into theconventional differential detector which computes uk =Real(rkr∗k−1) and subsequently soft LLR Lch(xk) from uk. Here∗ denotes the complex conjugate. Lch(xk) is then treated as

L(1)e,i (xk) and fed into the outer decoder, which, in return,

generates L(1)e,o (xk) and passes it to the inner decoder for use

in the next detection/decoding iteration. Starting from thesecond iteration, the switch in Figure 6 is flipped down, and

channel estimation for αk and θk is performed before the“coherent” detection and decoding of the inner and outercode. After Q iterations, a decision is made by combiningthe extrinsic information from both the inner and the outerdecoders: xk = sign(L(Q)

e,i (xk) + L(Q)e,o (xk)). In the above

discussion, we have ignored the existence of the randominterleaver, but it is understood that proper interleaving andde-interleaving is performed whenever needed.

4.1.2. Conventional differential detector for the firstdecoding iteration

With the assumption that the carrier phases are near con-stant between two neighboring symbols, the conventional

differential detector (in the first iteration) performs ukΔ=

Real (rkr∗k−1). Hard decision of xk is obtained by simplychecking the sign of uk. Computing soft information Lch(xk)from uk requires the knowledge of the pdf of uk. Theconditional pdf of uk given αk and xk is [16]

fU|α,X(u|α, x)=

⎧⎪⎪⎪⎪⎪⎪⎪⎪⎨

⎪⎪⎪⎪⎪⎪⎪⎪⎩

12N0

exp(xu− α2/2

N0

), −∞ < xu ≤ 0,

12N0

exp(xu−α2/2

N0

)Q(√

α2

N0,

√4xuN0

),

0 < xu <∞,(17)

where Q(a, b) is the Marcum Q-function. It is then possibleto get the true pdf of ukusing

fU|X(u | x) =∫∞

0fU|α,X(u | α, x) fα(α) dα

= 2∫∞

0fU|α,X(u | α, x) αe−α

2dα.

(18)

Since the computation of Marcum Q-function is slowand does not always converge at large values, an exactevaluation of (18) and hence the computation of Lch(xk)can be difficult. We propose a simple approximation whichevaluates (17) with α substituted by its mean E[α]. This leadsto

fU|X(u | x) ≈

⎧⎪⎪⎪⎪⎪⎪⎪⎪⎨

⎪⎪⎪⎪⎪⎪⎪⎪⎩

12N0

exp(xu− π/8

N0

), −∞ < xu ≤ 0,

12N0

exp(xu− π/8

N0

)Q

(√π

4N0,

√4xuN0

)

,

0 < xu <∞.(19)

The corresponding LLR from the channel can then becomputed by

Lch(xk) = logPr (uk | xk = +1)Pr (uk | xk = −1)

≈ sign (uk)

(2|uk|N0

+ log

(

Q

(√π2

4N0,

√4|uk|N0

)))

.

(20)

An even more convenient compromise is to assume ukis Gaussian distributed, as is used in [17] and a few otherpapers. Under this Gaussian assumption, we get

fU|X(u | x) ≈ N(x, 2N0 +N2

0

), (21)

Lch(xk) ≈ 2uk2N0 +N2

0. (22)

Alternatively, instead of using the conventional differ-ential decoding in the first iteration, a channel estimationfollowed by the decoding of the inner 1/(1 + D) code can

Jing Li (Tiffany) 9

−4 −2 0 2 4 6 8

u

0

1

2

3

4

5

6

7

8×10−3

Pdf

:f(u

)

Es/N0 = 6 dBTrue pdf f (u)(Monte Carlo)

f (u | α)use E[α] for α

Gaussian approximationN(1, 2N0 +N2

0 )

Figure 7: Distribution of uk = Re{yk y∗k−1} in a conventionaldifferential detection (assume “+1” transmitted).

be used, which makes the first iteration exactly the sameas subsequent iterations. This third option then leads topilot symbol assisted modulation (PSAM), which has slightlyhigher complexity than using differential detection in thefirst iteration.

To see how accurate the above treatments are, we plot inFigure 7 several curves approximating the pdf of uk. Fromthe most sharp and asymmetric to the least sharp andsymmetric, these curves denote the exact pdf of fU|X(u |x = +1) from Monte Carlo simulations (histogram, canbe regarded as the numerical evaluation of (18)), the“mean-α approximated” pdf from (19) and the Gaussianapproximated pdf from (21). From the figure, the Gaussianapproximation does not reflect the true pdf well, but thisinaccuracy turns out not severely affecting the overall IDDDperformance. As shown later in Figure 13, all the three treat-ments (Gaussian approximation, mean-α approximation,and PSAM) result in very similar decoding performance.Weattribute this to the fact that the inaccuracy affects mostly thefirst iteration, and subsequent iterations can help mitigatethe loss. Thus, Gaussian approximation still presents itselfas a simple and viable approach for noncoherent differentialdecoding.

4.1.3. Channel estimator

The channel estimator in the IDDD receiver (Figure 6) maybe implemented in several ways. Here we use a linear filter of(2L + 1) taps to estimate αk’s and θk’s in the qth iteration

α(q)k e jθ

(q)k =

L∑

l=−Lpl y

(q−1)k−l rk−l, (23)

where pl denotes the coefficient of the lth filter tap, and y(q−1)k

denotes the estimate on yk from the feedback of the previous

iteration. For soft feedback, y(q−1)k is computed using y

(q−1)k =

tanh((L(q−1)e,i (yk))/2), and for hard feedback, y

(q−1)k = sign

(L(q−1)e,i (yk)). The LLR message L

(q−1)e,i (yk) is generated toge-

ther with L(q−1)e,i (xk) by the inner decoder in the (q − 1)th

decoding iteration (please refer to [4] for the step-by-stepmessage-passing decoding algorithm of 1/(1 + D) code). In

the first iteration, L(0)e,i (yk)’s are initiated as zeros for coded

bits and a large positive number (i.e., +∞) for pilot symbols.Regarding the choice of the filter, we take a Wiener filter,

since it is known to be optimal for estimating channel gain inthe minimum mean-square-error (MMSE) sense, when thecorrelation of the fading process, Rks, are known [18]. Thefilter coefficients, p−L, p−L+1, . . . , pL, are obtained from theWiener-Hopf equation

⎛

⎜⎜⎜⎝

R0 −N0 R1 · · · RL−1

R1 R0 −N0 · · · RL−2

· · · · · · · · · · · ·RL−1 RL−2 · · · R0 −N0

⎞

⎟⎟⎟⎠·

⎛

⎜⎜⎜⎝

p−Lp−(L−1)

· · ·pL

⎞

⎟⎟⎟⎠

=

⎛

⎜⎜⎜⎝

R−LR−L−1

· · ·RL

⎞

⎟⎟⎟⎠

,

(24)

where Rk = (1/2)J0(2kπ fdTs). Since the computation ofpl’s from (24) involves an inverse operation on a matrix(one-time job), it may not be computable when the matrixbecomes (near) singular, which occurs when the channel isvery slow fading. In such cases, a low-pass filter, or a simple“moving average” can be used [6].

4.2. Analysis of pilot insertion through EXIT charts

4.2.1. EXIT charts

We perform EXIT analysis [9] to generate further insightsinto PA codes and the proposed noncoherent IDDD receiver.In EXIT charts, the exchange of extrinsic information isvisualized as a decoding/detection trajectory, allowing theprediction of the decoding convergence and thresholds [9].Several quantities, like the bit error rate, the mean ofthe extrinsic LLR information, and the equivalent SNRvalue, were previously used to depict the characteristicsand relations of the component decoders, but the mutualinformation is shown to be the most robust among all [9].The mutual information between the binary bit yk and itscorresponding LLR values is defined as

I(Y ,L(Y))

Δ= 12

∑

y=±1

∫∞

−∞fL(y)(η | Y = y)

· log2

2 fL(y)(η | Y = y)

fL(y)(η | Y = +1) + fL(y)(η | Y = −1)dη

=∫∞

−∞fL(y)(η | Y = +1)

· log2

2 fL(y)(η | Y = +1)

fL(y)(η | Y = +1) + fL(y)(−η | Y = +1)dη

= 1−∫∞

∞fL(Y)(η | Y = +1) · log2

(1 + e−η

)dη,

(25)


p p

(a)

p p

(b)

Figure 8: Trellis diagram of binary differential PSK with pilot insertion. (a) Pilot symbols periodically terminate the trellis. (b) Pilot symbolsare separated from the trellis structure.

where L(Y) is either the a priori information La(Y) or theextrinsic information Le(Y), and fL(Y)(η | Y = y) is theconditional pdf. The second equality holds when the channelis output symmetric such that fL(y)(η | Y = − y) =fL(y)(−η | Y = y), and the third equality holds whenthe received messages satisfy the consistency condition (alsoknown as the symmetry condition): fL(y)(η | Y = y) =fL(y)(−η | Y = y)eyη [11]. Note that the consistencycondition is an invariant in the message-passing process ona number of channels including the AWGN channel andthe independent Rayleigh fading channel with perfect CSI;but it is not preserved on fading channels without CSI orwith estimated (thus imperfect) CSI, since the initial densityfunction evaluated in the latter cases is but an approximationof the actual pdf of the LLR messages. Thus, (25) shouldbe used to compute the mutual information in those cases.We use the X-axis to represent the mutual information tothe inner code (a prior) or from the outer code (extrinsic),denoted as Ia,i/Ie,o, and the Y-axis to represent the mutualinformation from the inner code or to the outer code,denoted as Ie,i/Ia,o.

4.2.2. Pilot symbol insertion

A practicality issue about noncoherent detection is pilotinsertion. The number of pilot symbols inserted should besufficient to attain a reasonable track of the channel, but notin excess. Many researchers have reported that excessive pilotsymbols not only cause wasteful bandwidth expansion, butactually degrade the overall performance, since the energycompensation for the rate loss due to excessive pilot morethan outweighs the gain that can be obtained by a finerchannel tracking. This trade-off issue has long been notedin literature, but little attention has been given to anotherissue of no less importance, namely, how pilots should beinserted when differential encoding or other trellis-basedcoding/modulation front-end is used.

There exist at least two ways to insert pilot symbols ina differential encoder. The widespread approach is to peri-odically terminate the trellis [6, 7], as shown in Figure 8(a),such that pilot symbols are used to estimate the channeland at the same time participate in the trellis decoding.Seemingly plausible, this turns out to be a bad strategy,since segmenting the trellis into small chunks significantlyincreases the number of short error events, and consequentlyincurs a loss in performance.

The negative effect of trellis segmentation is best illus-trated by the EXIT chart in Figure 9. EXIT curves corre-

0 0.2 0.4 0.6 0.8 1

Ia,i/Ie,o

0.4

0.5

0.6

0.7

0.8

0.9

1

I e,i/Ia,o

Out code of PA codesR = 0.75

Outer code of PA codesR = 0.5

Es/N0 = 4.75 dB,0, 4, 10, 20% pilots

Es/N0 = 0.5 dB,0, 4, 10, 20% pilots

Effect of pilots segmenting the trellis, Es/N0 = 4.75, 0.5 dB

Figure 9: The effect of pilot symbols segmenting the trellis on theperformance of the differential decoder. Normalized Doppler ratefdTs = 0.01, Es/N0 = 4.75 dB and 0 dB, perfect CSI.

sponding to the differential decoder with 0%, 4%, 10%, and20% pilot insertion are plotted for two different SNR values.To eliminate the impact of other factors, the four curvesin each SNR set are given the same energy per transmittedsymbol and perfect knowledge on the fading phase andamplitude is provided to all the decoders (irrespective of thenumber of pilot symbols). Thus the difference between thecurves in each family is only due to the difference in pilotspacing. At the left end of the curves (when input mutualinformation is small), a larger number of pilot symbolscorrespond to a better performance (a higher output mutualinformation). This is because when little information isprovided from the outer code, pilot symbols become theprimary contributor to a priori information. However, thesituation is completely reversed toward the right end ofthe EXIT curves. We see that more pilot symbols actuallydegrade the performance, the reason being, given sufficientinformation provided by the outer code, pilot symbols nolonger constitute the key source of a priori information; onthe other hand, they segment the trellis and shorten errorevents, rendering an opposite effect to spectrum thinningand thus deteriorating the performance. The performanceloss is more severe when more pilot symbols are insertedand when the code is operating at a relatively low SNRlevel. It is worth noting, for example, with 20% of pilotinsertion (pilot spacing is 5), even provided with a perfect

Jing Li (Tiffany) 11

mutual information from the outer code (Ia,i = 1, butthe channel remains noisy), the trellis decoder neverthelessfails to produce sufficient output mutual information Ie,i. Assuch, the inner EXIT curve is bound to intersect the outerEXIT curve at a rather early stage of the iterative process,causing the iterative decoder to fail at a high BER level(not to mention this EXIT curve has 20% more of energyconsumption than the no-pilot case).

The implication of this EXIT analysis is that thewidespread approach of inserting pilot symbols as part ofthe trellis could cause deficiency for differential encoding(and other serial concatenated schemes with inner trelliscodes). Specifically, unless the outer code is itself a capacity-achieving code at some SNR, the inner and outer EXITcurves will intersect, result in convergence failure and causeerror floors. We observe that the more the pilot symbols, thehigher the error floor; and the lower the code rate (lowerSNR), the more severe the impact. It is therefore particularlyimportant to keep the number of pilot symbols in suchschemes minimal, so that error floors do not occur too early.This analysis also suggests an alternative, and potentiallybetter-performing, way of pilot insertion, namely, separatingpilots from the trellis and thus not affecting error events; seeFigure 8(b).

It should be pointed out, that the level of the impactcaused by trellis segmentation may be very different fordifferent outer codes. Many (outer) codes, including singleparity check codes, block turbo codes (i.e., turbo productcodes) and convolutional codes, will see a large impact, sincethese (outer) codes require sufficient input information inorder to produce perfect output information, or, put anotherway, these codes alone are not “good” codes (good in thesense as MacKay defined in [2]). However, “good” codes likeLDPC codes will likely see a much smaller impact. This isbecause an ideal LDPC code has an EXIT curve shaping like abox (e.g., see [3, Figure 3]) which can produce perfect outputinformation as long as the input information is above somethreshold (without requiring Ia,i = 1). Alternatively, one mayalso interpret it as: ideal LDPC codes have large minimumdistances and are capable of correcting short error eventsincluding those caused by the segmentation effect.

To verify the analytical results, we simulate the perfor-mance of a rate 1/2, data block size K = 32 K PA code withdifferent strategies of pilot insertion; see Figure 10. The nor-malized Doppler spread is fdTs = 0.01, and error rates eval-uated after 10 decoding iterations. Solid lines represent thecases where perfect channel knowledge is known to thereceiver, and dashed lines represent the case where nonco-herent detection is used. Comparing solid curves, we see adrastic performance gap results from different strategies ofpilot insertion. In this specific case, by segmenting the trellisevery 10 symbols, trellis-segmented pilot insertion lossesmore than 3 dB at BER of 10−4 than otherwise.The dashedcurve corresponds to the same PA code noncoherently-detected via the IDDD receiver discussed before, where 10%of pilot symbols are inserted using the strategy in Figure 8(b)and where an 81-tap wiener filter is used to estimate thechannel. It is interesting to note that if one overlooksthe impact of pilot insertion strategies, one might arrives

3.5 4 4.5 5 5.5 6 6.5 7 7.5 8

Eb/N0 (dB)

(64 K, 32 K), fdTs = 0.01, 10%

10−6

10−5

10−4

10−3

10−2

10−1

BE

R

0%, ideal10%, ideal, pilots separated10%, ideal, pilots term trellis10%, IDDD, pilots separated

Performance gap due todifferent pilot insertion strategies

Figure 10: Performance of PA codes with different pilot insertionstrategies. Normalized Doppler rate fdTs = 0.01, code rate 0.5, datablock size 32 K, 0% or 10% pilot insertion, 10 iterations.

at a paradox result that noncoherent detection (dashedline) performs (noticeably) better than coherent detection(rightmost solid line)!

4.3. Impact of the pilot symbol spacingand filter length

We now investigate how the number of pilot symbols andthe length of the estimation filter affect the performanceof noncoherent detection. Figure 11 illustrates the impactof different pilot spacing on the BER performance of fastfading channels where the normalized Doppler spread takesfdTs = 0.05, 0.02 or 0.01. We observe the following: (1) TheIDDD receiver is rather robust for different Doppler rates.(2) Smaller pilot spacing, such as <6 symbols, is undesirable,whose consumption of additional energy more than out-weighs any gain it may bring. (3) The code performance athigh Doppler rates is more sensitive to pilot spacing than thatat lower Doppler rates. At the normalized Doppler rate of0.01 (already fast fading), noncoherently detected PA codestolerate pilot spacing as small as 6 symbols and as largeas 45 to 50 symbols (put aside the bandwidth issue); butat very fast Doppler rate of 0.05, pilot spacing beyond 7–9symbols will soon cause drastic performance degradation.For comparison, we also plot the case where pilot symbolsperiodically terminate the trellis (dashed line), which, dueto trellis segmentation, experiences inferior performancewhen pilot spacing is small. Compared to differentiallyencoded turbo codes [6], PA codes appear to require fewerpilot symbols (we note that in the study of differentiallyencoded turbo codes in [6], the authors terminated the trellisperiodically with pilot symbols, which may have made the


0 10 20 30 40 50 60 70

Pilot spacing

fdTs = 0.01, Eb/N0 = 10 dB

10−4

10−3

10−2

BE

R

fdTs = 0.05fdTs = 0.02

fdTs = 0.01fdTs = 0.01, segment trellis

Figure 11: Effect of the number of pilot symbols on the perfor-mance of noncoherent detected PA codes on correlated Rayleighchannels with fdTs = 0.01. Code rate 0.75, data block size 1 K, filterlength 65, 10 (global) iterations, 4 (local) iterations within the outercode of PA codes.

6 7 8 9 10 11 12

Eb/N0 (dB)

K = 1 K,R = 3/4, fdTs = 0.01, 4% pilots, 10 iterations

10−5

10−4

10−3

10−2

10−1

BE

R

IDDD-1, soft feedback, Guass approximationIDDD-2, soft feedback, mean-α approximationIDDD-3, soft feedback, PSAMIDDD-4, hard feedback, PSAM

Figure 12: Comparison of BER performance for several noncoher-ent receiver strategies on correlated Rayleigh channels with fdTs= 0.01. Code rate 0.75, data block size 1 K, 4% of bandwidthexpansion, filter length 65, 10 (global) iterations each with 4 (local)iterations for the outer decoding.

tolerant range of pilot spacing (at the small spacing end)smaller than otherwise).

The impact of the length of the channel tracking filter isalso studied. We observe that while the filter length affectsthe overall performance, the impact is limited compared topilot spacing.This is consistent with what has been reported

in other studies [6] and is not a new discovery. Hence, weomit the plot.

4.4. Simulation results of noncoherent detection

The performance of noncoherently detected PA codes onfast Rayleigh fading channels are presented below. Unlessotherwise indicated, the BER curves shown are after 10 globaliterations, and in each global iteration, 4 to 6 local iterationsof the outer code are performed. We have chosen theseparameters on the basis of a set of simulations and trading-off between performance and complexity.

4.4.1. Noncoherent detection of PA codes with differentreceiver strategies

We compare the BER performance of 4 types of IDDD strate-gies for aK = 1 K, R = 3/4 PA code on a fdTs = 0.01 Rayleighfading channel in Figure 12. “IDDD-1” uses the conventionaldifferential detection with Gaussian approximation (22) tocompute Lch(xk) in the first iteration, and soft feedback ofyk in all iterations to assist channel estimation; “IDDD-2” uses conventional differential detection with “mean-α”approximation (20) in the first iteration and soft feedbackin all iterations; “IDDD-3” is PSAM with soft feedback; and“IDDD-4” is PSAM with hard feedback. In all cases, 4% ofpilot symbols are inserted and curves shown are after 10iterations. Different decoding strategies in the first iterationdoes not affect the performance much, and the performanceis not very sensitive to hard or soft feedback either. Althoughnot shown, simulations of a long PA code (K = 48 K) of thesame (high) rate (R = 3/4) reveal a similar phenomenon. It ispossible, however, that other codes may be more sensitive tothe difference in decoding strategies especially the differencein the feedback information [6].

4.4.2. Comparison of noncoherent detection withcoherent detection

Figure 13 shows the performance of rate 3/4 PA codes after10 iterations on fast Rayleigh fading channels with Dopplerrate Ts fd = 0.01.Short block size of 1 K and large blocksize of 48 K are evaluated. In each case, a family of 5 BER-versus- Eb/N0 curves, accounting for rate loss due to pilotinsertion, are plotted. The three leftmost curves are theideal coherent case with knowledge of fading amplitudesand phases provided to the receiver, and the two rightcurves are the noncoherent case where IDDD is used totrack amplitudes and phases. In both the coherent andthe noncoherent case, trellis segmentation incurs a smallperformance loss, but since the pilot spacing is not verysmall (every 25 symbols), the effect is not as drastic as thecase in Figure 10. The noncoherent cases are about 1 dB and0.55 dB away from the ideal coherent case at BER of 10−4

for block sizes of 48 K and 1 K, respectively. This satisfyingperformance is achieved with only 4% of pilot insertion anda very low-complexity IDDD receiver.

Jing Li (Tiffany) 13

5 6 7 8 9 10 11 12

Eb/N0 (dB)

R = 3/4, fdTs = 0.01, 10 iterations

10−6

10−5

10−4

10−3

10−2

10−1

BE

R

0%, ideal4%, ideal, pilots separated from trellis4%, ideal, pilots terminate trellis4%, IDDD, pilots separated from trellis4%, IDDD, pilots terminate trellis

K = 1 KK = 48 K

Figure 13: Comparison of BER performance for several transmis-sion/reception strategies for PA codes of large and small block sizeson correlated Rayleigh channels with fdTs = 0.01. Code rate 0.75,data block size 48 K and 1 K, 4% of bandwidth expansion, filterlength 65, 10 (global) iterations each with 4 (local) iterations forthe outer decoding.

5. CONCLUSION

Previous work has established product accumulate codes asa class of provenly “good” codes on AWGN channels, withlow linear-time complexity and performances close to theShannon limit. This paper performs a comprehensive studyof product accumulate codes on Rayleigh fading channelswith both coherent and noncoherent detection. Usefulanalytical tools including Divsalar’s simple bounds, densityevolution, and EXIT charts are employed, and extensivesimulations are conducted. It is shown that PA codes notonly perform remarkably well with coherent detection, butthe embedded differential encoder makes them naturallysuitable for noncoherent detection. A simple iterative dif-ferential detection and decoding (IDDD) strategy allows PAcodes to perform only 1 dB away from the coherent case.Another useful finding reveals that the widespread practiceof inserting pilot symbols to terminate the trellis actuallyincurs performance loss compared to when pilot symbols areinserted as separate parts from the trellis.

We conclude by proposing product accumulate codesas a promising low-cost candidate for wireless applications.The advantages of PA codes include (i) they perform verywell with coherent and noncoherent detection (especiallyat high rates), (ii) the performance is comparable to turboand LDPC codes, yet PA codes require much less decodingcomplexity than turbo codes and much less encodingcomplexity and memory than random LDPC codes, and (iii)the regular structure of PA codes makes it possible for low-cost implementation in hardware.

APPENDIX

DISCRETIZED DENSITY EVOLUTION FOR PA CODES

Using message-passing decoding, the relevant operations onthe messages (in LLR form) include the sum in the realdomain and the tanh operation (also known as the checkoperation or � operation). For independent messages toadd together, the resulting pdf of the sum is the discreteconvolution (denoted by ∗) of the component pdf ’s whichcan be efficiently implemented using a fast Fourier transform(FFT). For the tanh operation on messages, define:

γ = α� βΔ= Q(2tanh−1(tanh(α/2) tanh(β/2))),

where α, β, and γ are quantified messages, and Qdefines the quantization operation. The pdf of γ, fγ, can becomputed using

fγ[k] =∑

(i, j):kΔ=iΔ� jΔ

fα[i] · fβ[ j], (A.1)

whereΔ is the quantization interval. To simplify the notation,we denote this operation (A.1) as fγ = R( fα, fβ), and usinginduction on the above equation, we further denote

Rk( fα)Δ=R

(fα,(R(fα, . . . , R

(fα, fα

) · · · ))︸︷︷︸

k−1

. (A.2)

The following notations are also used:

(i) fLch,y : pdf of the messages of the received signals yobtained from the channel (see Figure 1(b)),

(ii) f (k)Lo,x

: pdf of the (a prior) messages of the input x tothe inner 1/(1 +D) code in the kth iteration (obtainedfrom the outer code in the k − 1th iteration) (seeFigure 1(b)),

(iii) f (k)Le ,x: pdf of the (extrinsic) messages passed from the

inner code to the outer code in the kth iteration,(iv) f (k)

Le1,(·) and f (k)Le2,(·): pdf ’s of the extrinsic information

computed from the upper and lower branch of theouter code in the kth iteration, respectively. Subscriptsd and p denote data and parity bit, respectively.

Obviously, f (0)Le2,d

= f (0)Le2,p

= δ(0), the Kronecker deltafunction.

The discretized density evolution of a rate t/(t + 2) PAcode can then be summarized as follows [4]:

initialization: f (0)Lo,x= f (0)

Le,y = f (0)Le1,d = f (0)

Le2,d = δ(0), (A.3)

inner code: f (k)Le,y =R

(f (k−1)Lo,x

, fLch,y∗ f (k−1)Le,y

), (A.4)

f (k)Le ,x =R2

(fLch,y∗ f (k)

Le,y

), (A.5)

inner-to-outer: f (k)Lo ,d = f (k)

Le,x , (A.6)

f (k)Lo,p = f (k)

Le,x , (A.7)


outer code: f (k)Le1,d =R

(f (k)Lo ,p, R(t−1)

(f (k)Lo,d∗ f (k−1)

Le2,d

)), (A.8)

f (k)Le1,p =Rt

(f (k)Lo,d∗ f (k−1)

Le2,d

), (A.9)

f (k)Le2,d =R

(f (k)Lo,p, R(t−1)

(f (k)Lo ,d∗ f (k)

Le1,d

)), (A.10)

f (k)Le2,p =Rt

(f (k)Lo,d∗ f (k)

Le1,d

), (A.11)

outer-to-inner: f (k+1)Lo,x

= t(f (k)Le1,d + f (k)

Le2,d

)

t + 2+f (k)Le1,p + f (k)

Le2,p

2t + 2.

(A.12)

Although the outer code of PA codes can be viewedas an LDPC code, it is desirable to take a serial updateprocedure as described above rather than a parallel one asin a conventional LDPC code, since this allows the checkscorresponding to the two SPC branches to take turns toupdate, which leads to a faster convergence [4].

ACKNOWLEDGMENTS

This research work is supported in part by the NationalScience Foundation under Grants no. CCF-0430634 andCCF-0635199, and by the Commonwealth of Pennsylvaniathrough the Pennsylvania Infrastructure Technology Alliance(PITA).

REFERENCES

[1] R. G. Gallager, Low Density Parity Check Codes, MIT Press,Cambridge, Mass, USA, 1963.

[2] D. J. C. MacKay, “Good error-correcting codes based on verysparse matrices,” IEEE Transactions on Information Theory,vol. 45, no. 2, pp. 399–431, 1999.

[3] J. Li, “Differentially encoded LDPC codes—part II: generalcase and code optimization,” to appear in EURASIP Journalon Wireless Communications and Networking.

[4] J. Li, K. R. Narayanan, and C. N. Georghiades, “Productaccumulate codes: a class of codes with near-capacity perfor-mance and low decoding complexity,” IEEE Transactions onInformation Theory, vol. 50, no. 1, pp. 31–46, 2004.

[5] D. Divsalar and E. Biglieri, “Upper bounds to error probabili-ties of coded systems beyond the cutoff rate,” in Proceedings ofthe IEEE International Symposium on Information Theory (ISIT’00), p. 288, Sorrento, Italy, June 2000.

[6] M. C. Valenti and B. D. Woerner, “Iterative channel estimationand decoding of pilot symbol assisted turbo codes over flat-fading channels,” IEEE Journal on Selected Areas in Communi-cations, vol. 19, no. 9, pp. 1697–1705, 2001.

[7] P. Hoeher and J. Lodge, ““Turbo DPSK”: iterative differentialPSK demodulation and channel decoding,” IEEE Transactionson Communications, vol. 47, no. 6, pp. 837–843, 1999.

[8] M. Peleg and S. Shamai, “Iterative decoding of coded andinterleaved noncoherent multiple symbol detected DPSK,”Electronics Letters, vol. 33, no. 12, pp. 1018–1020, 1997.

[9] S. T. Brink, “Convergence behavior of iteratively decodedparallel concatenated codes,” IEEE Transactions on Communi-cations, vol. 49, no. 10, pp. 1727–1737, 2001.

[10] P. Robertson, E. Villebrun, and P. Hoeher, “A comparison ofoptimal and sub-optimal MAP decoding algorithms operatingin the log domain,” in Proceedings of the IEEE International

Conference on Communications (ICC ’95), vol. 2, pp. 1009–1013, Seattle, Wash, USA, June 1995.

[11] T. J. Richardson, M. A. Shokrollahi, and R. L. Urbanke,“Design of capacity-approaching irregular low-density parity-check codes,” IEEE Transactions on Information Theory, vol. 47,no. 2, pp. 619–637, 2001.

[12] S.-Y. Chung, T. J. Richardson, and R. L. Urbanke, “Analysisof sum-product decoding of low-density parity-check codesusing a Gaussian approximation,” IEEE Transactions on Infor-mation Theory, vol. 47, no. 2, pp. 657–670, 2001.

[13] K. Xie and J. Li, “On accuracy of Gaussian assumption initerative analysis for LDPC codes,” in Proceedings of IEEEInternational Symposium on Information Theory (ISIT 06), pp.2398–2402, Seattle, Wash, USA, July 2006.

[14] J. Hou, P. H. Siegel, and L. B. Milstein, “Performance analysisand code optimization of low density parity-check codes onRayleigh fading channels,” IEEE Journal on Selected Areas inCommunications, vol. 19, no. 5, pp. 924–934, 2001.

[15] I. S. Gradshteyn and I. M. Ryzhik, Tables of Integrals, Series andProducts, Academic Press, New York, NY, USA, 1980.

[16] G. L. Stuber, Principles of Mobile Communications, KluwerAcademic Publishers, Norwell, Mass, USA, 1996.

[17] M. K. Simon and M.-S. Alouini, Digital Communication overFading Channels, John Wiley & Sons, New York, NY, USA,2000.

[18] P. Hoeher, S. Kaiser, and P. Robertson, “Two-dimensionalpilot-symbol-aided channel estimation by Wiener filtering,” inProceedings of the IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP ’97), pp. 1845–1848,Munich, Germany, April 1997.


Research ArticleDifferentially Encoded LDPC Codes—Part II:General Case and Code Optimization

Jing Li (Tiffany)

Electrical and Computer Engineering, Lehigh University, Bethlehem, PA 18015, USA

Correspondence should be addressed to Jing Li (Tiffany), [email protected]



This two-part series of papers studies the theory and practice of differentially encoded low-density parity-check (DE-LDPC) codes,especially in the context of noncoherent detection. Part I showed that a special class of DE-LDPC codes, product accumulatecodes, perform very well with both coherent and noncoherent detections. The analysis here reveals that a conventional LDPC code,however, is not fitful for differential coding and does not, in general, deliver a desirable performance when detected noncoherently.Through extrinsic information transfer (EXIT) analysis and a modified “convergence-constraint” density evolution (DE) methoddeveloped here, we provide a characterization of the type of LDPC degree profiles that work in harmony with differential detection(or a recursive inner code in general), and demonstrate how to optimize these LDPC codes. The convergence-constraint methodprovides a useful extension to the conventional “threshold-constraint” method, and can match an outer LDPC code to any giveninner code with the imperfectness of the inner decoder taken into consideration.

Copyright © 2008 Jing Li (Tiffany). This is an open access article distributed under the Creative Commons Attribution License,which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

1. INTRODUCTION

With an increasingly mature status of the sparse-graphcoding technology in a theoretical context, the very pervasivescope of their well-proven practical applications, and thewide-scale availability of software radio, low-density parity-check (LDPC) codes have become and continue to be afavorable coding strategy for researchers and practitioners.Their superb performance on various channel models andwith various modulation schemes have been documented inmany papers. While the existing literature has shed great lighton the theory and practice of LDPC codes, investigation waslargely carried out from a pure coding perspective, wherethe prevailing assumption is that the synchronization andchannel estimation are handled perfectly by the front-endreceiver.

In wireless communications, accurate phase estimationmay in many cases be very expensive or infeasible, which callsfor noncoherent detection. Practical noncoherent detectionis generally performed in one of the two ways: insertingpilot symbols directly in the coded and modulated sequenceto help track the channel (it is possible to insert eitherpilot tones or pilot symbols, but the latter is found to be

more effective and is what of relevance to this paper), andemploying differential coding. Considering that the formermay result in a nontrivial expansion of bandwidth especiallyon fast-changing channels, many wireless systems adopt thelatter, including satellite and radio-relay communications.

The problem we wish to investigate is: LDPC codesperform remarkably well with coherent detection, but howabout their performance with noncoherent detection andnoncoherent differential detection in particular? This seriesof two-part papers aim to generate useful insight andengineering rules. In Part I of the series [1], we considereda special class of differentially encoded LDPC (DE-LDPC)codes, product accumulate (PA) codes [2]. The outer codeof a (p(t + 2), pt) PA code is a simple, structured LDPC codewith left (variable) degree profile λ(x) = 1/(t+ 1) + t/(t+ 2)xand right (check) degree profile ρ(x) = xt; and the inner codeis a differential encoder 1/(1 + D). We showed that, despitetheir simplicity, PA codes perform quite well with coherentdetection as well as noncoherent differential detection [1].This motivates us, in Part II of this series of papers, tostudy the general case of differentially encoded LDPC codes.The question of how LDPC codes perform with differentialcoding is a worthy one [3–6], and directly relates to other


interesting problems. For example, what is the best strategyto apply LDPC codes in noncoherent detection—shoulddifferential coding be used or not? Modulation schemes suchas the minimum phase shift keying (MPSK) have equivalentrealizations in recursive and non-recursive forms; is oneform preferred over the other in the context of LDPC coding?What other DE-LDPC configurations, besides PA codes, aregood for differential coding, and how to find them?

Since the conventional differential detector (CDD) oper-ating on two symbol intervals incurs a nontrivial per-formance loss [7], and since multiple symbol differentialdetectors (MSDD) [8] have a rather high complexity thatincreases exponentially with the window size, we developed,in Part I of this series of papers, a simple iterative differentialdetection and decoding (IDDD) receiver, whose structure isshown in [1, Figure 6]. The IDDD receiver comprises a CDDwith 2-symbol observation window (the current and theprevious), a phase-tracking Wiener filter, a message-passingdecoder for the accumulator 1/(1 + D) [2], and a message-passing decoder configured for the (outer) LDPC code. TheCDD, coupled with the phase-tracking unit and the 1/(1 +D) decoder, acts as the front-end, or, the inner decoder ofthe serially concatenated system, and the succeeding LDPCdecoder acts as the outer decoder. Soft reliability informationin the form of log-likelihood ratio (LLR) is exchangedbetween the inner and the outer decoders to successivelyrefine the decision. In the sequel, unless otherwise stated, wetake the IDDD receiver as the default noncoherent receiver inour discussion of DE-LDPC codes.

We study the convergence property of IDDD for a generalDE-LDPC code, through extrinsic information transfer(EXIT) charts [9, 10]. A somewhat unexpected finding isthat, while a high-rate PA code yields desirable performancewith noncoherent (differential) detection, a general DE-LDPC code does not. We attribute the reason to the mis-match of the convergence behavior between a conventionalLDPC code and a differential decoder. This suggests thatconventional LDPC codes, while an excellent choice forcoherent detection, are not as desirable for noncoherentdetection. It also gives rise to the question of what specialLDPC codes, possibly performing poorly in the conventionalscenario (such as the outer code of the PA code), may turnout right for differential modulation and detection?

One remarkable property of LDPC codes is the possibil-ity to design their degree profiles, through density evolution[11], to match to a specific channel or a specific inner code[12–15]. To make LDPC codes work in harmony with thenoncoherent differential decoder of interest, here we developa convergence-constraint density evolution method. Theconventional threshold-constraint method [11, 16] targets thebest asymptotic threshold, and the new method effectivelycaptures the interaction and convergence between the innerand the outer EXIT curves through a set of “sample points.”In that, it makes it possible to optimize LDPC codes tomatch to an (arbitrary) inner code/modulation with theimperfectness of the inner decoder/demodulator taken intoaccount. Our study reveals that LDPC codes may be dividedin two groups. Those having minimum left degree of ≥2 aregenerally suitable for a nonrecursive inner code/modulator

but not for a differential detector or any recursive innercode. On the other hand, the LDPC codes that performwell with a recursive receiver always have degree-1 (anddegree-2) variable nodes. Further, when the code rate ishigh, these degree-1 and -2 nodes become dominant. Thisalso explains why high-rate PA codes, whose outer code hasdegree-1 and degree-2 nodes only, perform remarkably with(noncoherent) differential detection [1].

The channel model of interest here is flat Rayleigh fadingchannels with additive white Gaussian noise (AWGN), thesame as discussed in Part I [1]. Let rk be the noisy signal atthe receiver, let sk be the binary phase shift keying (BPSK)modulated signal at the transmitter, let nk be the i.i.d.complex AWGN with zero mean and variance σ2 = N0/2 ineach dimension, and let αke jθk be the fading coefficient withRayleigh distributed amplitude αk and uniformly distributedphase θk. We have rk = αke jθk sk + nk. Throughout the paper,θk is assumed known perfectly to the receiver/decoder inthe coherent detection case, and unknown (and needs to beworked around) in the noncoherent detection case. Further,the receiver is said to have channel state information (CSI) ifαk known (irrespective of θk), and no CSI otherwise.

We consider correlated channel fading coefficients (sothat noncoherent detection is possible). Applying Jakes’isotropic scattering land mobile Rayleigh channel model, theautocorrelation of αk is characterized by the 0th-order Besselfunction of the first kind

Rk = 12J0(2kπ fdTs), (1)

and the power spectrum density (PSD) is given by

S( f ) = P

π√

1− ( f / fd)2

, for | f | < fd, (2)

where fdTs is the normalized Doppler spread, f is thefrequency band, τ is the lag parameter, and P is a constantthat is dependent on the average received power given aspecific antenna and the distribution of the angles of theincoming power.

The rest of the paper is organized as follows. Section 2evaluates the performance of a conventional LDPC code withnoncoherent detection, and compare it with that of PA codes.Section 3 proposes the convergence-constraint method tooptimize LDPC codes to match to a given inner code andparticular a differential detector. Section 4 concludes thepaper.

2. CODES MATCHED TO DIFFERENTIAL CODING

Part I showed that PA codes, a special class of DE-LDPCcodes, perform quite well with coherent detection as well asnoncoherent detection [1]. This section reveals whether ornot this also holds for general DE-LDPC codes, and the farsubtly why.

The analysis makes essential use of the EXIT charts[9, 10], which are obtained through a repeated applicationof density evolution at different decoding stages. Althoughthey were initially proposed solely as a visualization tool,

Jing Li (Tiffany) 3

recent studies have revealed surprisingly elegant and usefulproperties of EXIT charts. Specifically, the convergence prop-erty states that, in order for the iterative decoder to convergesuccessfully, the outer EXIT curve should stay strictly belowthe inner EXIT curve, leaving an open tunnel between thetwo curves. The area property states that the area underthe EXIT curve, A = ∫ 1

0IedIa, corresponds to the rate ofthe code [10], where Ia and Ie denote the a priori (input)mutual information to and the extrinsic (output) mutualinformation from a particularly subdecoder, respectively.When the auxiliary channel is an erasure channel andthe subdecoder is an optimal one, the relation is exact;otherwise, it is a good approximation [10]. The immediateimplication of these properties is that, to fully harness thecapacity (achievable rate) provided by the (noncoherent)inner differential decoder, the outer code must have an EXITcurve closely matched in shape and in position to that of theinner code.

With this in mind, we evaluate a few examples of(DE-)LDPC codes. (The computation of EXIT charts specificto DE-LDPC codes with IDDD receiver is discussed in [1].)We consider two configurations of the inner code:

(1) a differential decoder for 1/(1 +D); and(2) a direct detector, that is, a BPSK detector;

and three configurations of the outer code:

(1) the outer code of a PA code, which has degree profile:

λ(x) = 17

+67x,

ρ(x) = x7;(3)

(2) a (3,12)-regular LDPC code; and(3) an optimized irregular LDPC code reported in [17],

whose threshold is 0.6726—about 0.0576 dB awayfrom the AWGN capacity—and whose degree profile is

γ(x) = 0.1510x+0.1978x2 +0.2201x6 +0.03537 +0.3958x29,

ρ(x) = x20.(4)

All three outer codes have rate 3/4, and the channelis a correlated Rayleigh fading channel with AWGN andnormalized Doppler rate of fdTs = 0.01.

The EXIT curves, plotted in Figure 1, demonstrate thatthe outer code of the PA code and the differential decodermatch quite well, but a conventional LDPC code, regular orirregular, will either intersect with the differential decoder,causing decoder failure, or leave a huge area betweenthe curves, causing a capacity loss. On the other hand,LDPC codes, especially the (optimized) irregular ones,agree very well with the direct detector. This suggests that(conventional) LDPC codes perform better as a single codethan being concatenated with a recursive inner code. Put itanother way, an LDPC code that is optimal in the usual sense,for example, BPSK modulation and memoryless channels,may become quite suboptimal when operated together witha recursive inner code or a recursive modulation, such

1

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0

I e,i/Ia,o

0 0.2 0.4 0.6 0.8 1

Ia,i/Ie,o

0.35

0.290.24

0.190.15

0.12

0.085

0.055

0.026

0.013

0.00250.005

0.0076

fdTs = 0.01, Eb/N0 = 5.32, R = 3/4

Differential code 1/(1 +D)Fading channelOuter code of PA code

LDPC code (regular)LDPC code (irregular)

Regular LDPC

Irregular LDPC

PA (outer)

Rayleigh channel

Differential code

Figure 1: EXIT curves of LDPC codes, the outer code of PA codes,differential decoder and the direct detector of Rayleigh channels.Normalized Doppler rate 0.01, Eb/N0 = 5.32 dB, code rate 3/4,(3, 12)-regular LDPC code, and optimized irregular LDPC codewith ρ(x) = x20 and γ(x) = 0.1510x+0.1978x2+0.2201x6+0.03537+0.3958x29.

as a differential encoder. On the other hand, not usingdifferential coding generally requires more pilot symbols inorder to track the channel well, especially on fast-fadingenvironments. Hence, it is fair to say that (conventional)LDPC codes that boast outstanding performance undercoherent detection may not be nearly as advantageousunder noncoherent detection, since they either suffer fromperformance loss (with differential encoding) or incur alarge bandwidth expansion (without differential encoding).In comparison, PA codes can make use of the (intrinsic)differential code for noncoherent detection, and thereforepresent a better choice for bandwidth-limited wireless appli-cations.

Before providing simulations to confirm our findings, wenote that the EXIT curves of both inner codes in Figure 1 arecomputed using perfect knowledge of the fading coefficients.We used this genie-aided case in the discussion, to rid offthe artifact of coarse channel estimation and better contrastthe differences between the recursive differential detectorand the nonrecursive direct detector. If the amplitude andphase information is to be estimated and handled by theinner code as in actual noncoherent detection, then the EXITcurve of the direct detector will show a small rising slope atthe left end instead of being a flat straight line all the waythrough, and the EXIT curve of the differential decoder willalso exhibit a deeper slope at the left end.


Figure 2 plots the BER performance curves of the samethree codes specified in Figure 1 on Rayleigh channels withnoncoherent detection. All the codes have data block sizeK = 1 K and code rate 3/4. Soft feedback is used inIDDD, the normalized Doppler spread is 0.01, and 2% or4% pilot symbols are inserted to help track the channel.The two LDPC codes are evaluated either with or withouta differential inner code. From the most power efficientto the least power efficient, the curves shown are (i) PAcode with 4% of pilot symbols, (ii) PA code with 2% ofpilot symbols, (iii) BPSK-coded irregular LDPC code with4% of pilot symbols, (iv) BPSK-coded regular LDPC codewith 4% of pilot symbols, (v) BPSK-coded irregular LDPCcode with 2% of pilot symbols, (vi) differentially encodedirregular LDPC code with 4% of pilot symbols. It is evidentthat (conventional) LDPC codes suffer from a differentialinner code. For example, with 4% of bandwidth expansion,BPSK-coded irregular and regular LDPC codes performabout 0.5 and 1 dB worse than PA codes at BER of 10−4,respectively, but the differentially encoded irregular LDPCcode falls short by more than 2.2 dB. Further, while theirregular LDPC code (not differentially coded) is moderately(0.5 dB) behind the PA code with 4% of pilot symbols, thegap becomes much more significant when pilot symbols arereduced in half. For PA codes, 2% of pilot symbols remainadequate to support a desirable performance, but theybecome insufficient to track the channel for nondifferentiallyencoded LDPC codes, causing a considerable performanceloss and an error floor as high as BER of 10−3. Thus, theadvantages of PA codes over (conventional) LDPC codesare rather apparent, especially in cases when noncoherentdetection is required and when only limited bandwidthexpansion is allowed.

3. CODE DESIGN FROM THE CONVERGENCEPROPERTY

3.1. Problem formulation

EXIT analysis and computer simulations in the previoussection show that a conventional LDPC code does not fitdifferential coding, but special cases such as the the outercode of PA codes do. This raises more interesting questions:what other (special) LDPC codes are also in harmony withdifferential encoding? What degree profiles do they have? Isit possible to characterize and optimize the degree profiles,and how?

The fundamental tool to solve these questions lies inconvex optimization. In [11], the optimization problemof the irregular LDPC degree profiles on AWGN channelswas formulated as a duality-based convex optimizationproblem, and an iterative method termed density evolutionwas proposed to solve the problem. In [16], a Gaussianapproximation was applied to the density evolution method,which reduces the problem to be a linear optimizationproblem. Density evolution has since been exploited, indifferent flavors and possibly combined with differentialevolution [18], to design good LDPC ensembles for a vari-ety of communication channels and modulation schemes,

10−1

10−2

10−3

10−4

10−5

BE

R

6 7 8 9 10 11 12 13 14

Eb/N0 (dB)

K = 1 K, R = 3/4, fdTs = 0.01, 10 iter

PA, 2%PA, 4%Irregular LDPC, 4%

Regular LDPC, 4%Irregular LDPC, 2%Irregular LDPC, 4%, dif. dec.

Figure 2: Comparison of PA codes and LDPC codes on fast-fadingRayleigh channels with noncoherent detection and decoding. Solidline: PA codes, dashed lines: LDPC codes. Code rate 0.75, data blocksize 1 K, filter length 65, normalized Doppler spread 0.01, 10 globaliterations, and 4 (local) iterations within LDPC codes or the outercode of PA codes inside each global iteration.

see, for example [12–15] and the references therein. Theresults reported in these previous papers are excellent, butthey almost exclusively aimed at the asymptotic threshold,namely, their cost functions were set to minimize the SNRthreshold for a target code rate, or, equivalently, to maximizethe code rate for a target SNR threshold. This is well justified,since in these papers, the primary involvement of the channelis to provide the initial LLR information to trigger the startof the density evolution process.

However, the problem we consider here is somewhatdifferent. Our goal is to design codes that can fully achievethe capacity provided by the given inner receiver, and thenoncoherent differential decoder in particular. Consideringthat the inner receiver, due to the lack of channel knowledgeor other practical constraints, may not be an optimal receiver,it is of paramount importance to control the interactionbetween the inner and the outer code, or, the convergencebehavior as reflected in the matching of shape and position ofthe corresponding EXIT curves. To emphasize the difference,we thereafter refer to the conventional density evolutionmethod as the “threshold-constraint” method, and proposea “convergence-constraint” method as a useful extension tothe conventional method.

The key idea of the proposed method is to samplethe inner EXIT curve and design an (outer) EXIT curvethat matches with these sample points, or, “control points.”Suppose we choose a set of M control points in the EXITplane, denoted as (v1,w1), (v2,w2), . . . , (vM ,wM). Let To(·)be the input-output mutual information transfer functionof the outer LDPC code (whose exact expression of To

Jing Li (Tiffany) 5

will be defined later in (17)), the optimization problem isformulated as

max∑Dvi=1λi=1,

∑Dcj=2ρj=1

{R = 1−

∑Dcj=2ρj/ j∑Dvi=1λi/i

| To(wk)

≥ vk, k = 1, 2, . . . ,M

},

(5)

where R denotes the code rate of the outer LDPC code, andλi and ρi denote the fraction of edges that connect to variablenodes and check nodes of degree i, respectively.

The formulation in (5) assumes that the LLR mes-sages at the input of the inner and the outer decoderare Gaussian distributed, and that the output extrinsicmutual information (MI) of an irregular LDPC codecorresponds to a linear combination of the extrinsic MIfrom a set of regular codes. As reported in literature,the Gaussian assumption for LLR messages is less notfar from reality on AWGN channels but less accurateon Rayleigh fading channels [12]. Nevertheless, Gaussianassumption is used for several reasons. The first reasonis simplicity and tractability. Tracking and optimizing theexact message pdf ’s involves tedious computation, whichis exacerbated by the fact the proposed new method isgoverned by a set of control points, rather than a singlecontrol point as in the conventional method. Second, recallthat to compute EXIT curves inevitably uses the Gaussianapproximation. Thus, it seems well acceptable to adoptthe same approximation when shaping and positioning anEXIT curve. Finally, characterizing and representing EXITcurves using mutual information help stabilize the processand alleviate the inaccuracy caused by Gaussian approxi-mation and other factors. As confirmed by many previouspapers as well as this one, the optimization generates verygood results in spite of the use of the Gaussian approxima-tion.

3.2. The optimization method

Below we detail the convergence-constraint design methodformulated in (5). We conform to the notations and thegraphic framework presented in [16]. Let λ(x) = ∑Dv

i=1λixi−1

and ρ(x) = ∑Dci=2ρix

i−1 be the degree profiles from the edgeperspective, where Dv and Dc are the maximum variablenode and check node degrees, and λi and ρi are the fractionof edges incident to variable nodes and check nodes of degreei. Similarly, let λ′(x) =∑Dv

i=1λ′i xi−1 and ρ′(x) =∑Dc

i=2ρ′i xi−1 be

the degree profiles from the node perspective. Let R be thecode rate. The following relation holds:

λ′i=λi/i∑Dvj=1λj/ j

, ρ′i =ρi/i∑Dcj=2ρj/ j

, R=1−∑Dv

i=1λi/i∑Dcj=1ρj/ j

.

(6)

Let superscript (l) denote the lth LDPC decoding iteration,and subscript v and c denote the quantities pertaining tovariable nodes and check nodes, respectively. Further, define

two functions that will be useful in the discussion

I(x) = 1−∫∞−∞

1√2πx

e−(z−x)2/4x log(1 + e−z

)dz, (7)

φ(x) =

⎧⎪⎨⎪⎩

1− 1√4πx

∫tanh

z

2e−(z−x)2/4xdz, x > 0,

1, x = 0.(8)

Function I(x) maps the message mean x to the corre-sponding mutual information (under Gaussian assumption),and φ(x) helps describe how the message mean evolves intanh(y/2) operation, where y follows a Gaussian distributionwith mean x and variance 2x.

The complete design process takes a dual constraintoptimization process that progressively optimizes variablenode degree profile λ(x) and check node degree profile ρ(x)based on each other. Despite the duality in the formulationand the steps, optimizing λ(x) is far more critical to thecode performance than optimizing ρ(x), largely because theoptimal check node degree profile are shown to follow theconcentration rule [16]:

ρ(x) = Δxk + (1− Δ)xk+1. (9)

It is therefore a common practice to preset ρ(x) according to(9) and code rate R, and optimize λ(x) only. For this reason,below we focus our discussion on optimizing λ(x) for a givenρ(x). Interested readers can formulate the optimization ofρ(x) in a similar way.

3.2.1. Threshold-constraint method (optimizing λ(x))

Under the assumption that the messages passed along allthe edges are i.i.d. and Gaussian distributed, the averagemessages variable nodes receive from their neighboringcheck nodes follow a mixed Gaussian distribution. From(l−1)th iteration to lth local iteration (in the LDPC decoder),the mean of the messages associated with the variable node,mv, evolves as

m(l)v =

Dv∑

i=2

λi N(m(l)v,i, 2m(l)

v,i

)(10)

=Dv∑

i=2

λiφ

(m0 + (i− 1)

Dc∑

j=2

ρjφ−1(1− (1−m(l−1)

v

) j−1))

,

(11)

where m0 denotes the mean of the initial messages receivedfrom the inner code (or the channel). Let us define

hi(m0, r)Δ= φ

(m0 + (i− 1)

Dc∑j=2ρjφ−1

(1− (1− r) j−1)

),

h(m0, r

) Δ=Dv∑i=2λihi

(m0, r

).

(12)

Then (11) can be rewritten as

rl = h(m0, rl−1

) =Dv∑

i=2

λihi(m0, rl−1

). (13)


The conventional threshold-constraint density evolutionguarantees that the degree profile converges asymptoticallyto the zero-error state at the given initial message mean m0.This is achieved by enforcing [16]

r > h(m0, r

), ∀r ∈ (0,φ

(m0)]. (14)

Viewed from the EXIT chart, the threshold-constraintmethod has implicitly used a control point (v,w) = (1,I(m0)), such that the resultant EXIT curve will stay belowit.

3.2.2. Convergence-constraint method (optimizing λ(x))

The proposed convergence-constraint method extends theconventional threshold-constraint method by introducinga set of control points, which may be placed in arbitrarypositions in the EXIT plane, to control the shape and theposition of the EXIT cure. Each control point (v,w) ∈[0, 1]2 ensures that the EXIT curve will, at the input a priorimutual information w, produce extrinsic mutual informa-tion greater than v. This is reflected in the optimizationprocess by changing (14) to

r∗ > h(m0, r∗

), ∀ r∗ ∈ (0,φ

(m0)]

, (15)

where r∗(≥ 0) is the threshold value that satisfies T0(w) ≥ v.We can reformulate the problem as follows: for a given checknode degree profile ρ(x) and a control point (v,w) in theEXIT chart, where 0 ≤ v,w,≤ 1,

max∑Dvi=1λi=1

Dv∑

i=1

λii

,

subject to: (i)Dv∑

i=1

λi = 1,

(ii)Dv∑

i=1

λi(hi(m0, r

)−r)<0, ∀r ∈ (r∗,φ(m0)]

,

(16)

where m0 = I−1(w) and r∗ satisfies

To(w)Δ=

Dv∑

i=1

λ′iI

(iDc∑

j=2

ρjφ−1(1− (1− r∗) j−1)

)≥ v. (17)

Apparently, when v = 1, we get r∗ = 0, and the case reducesto that of the conventional threshold-constraint design.

Hence, given a set of M control points, (v1,w1),(v2,w2), . . . , (vM ,wM), where 0 ≤ v1 < v2 < · · · < vM ≤ 1and 0 ≤ w1 ≤ w2 ≤ · · · ≤ wM ≤ 1, one can combinethe constraints associated with each individual control pointand perform joint optimization, to control the shape and theposition of the resulting EXIT curve. Specifically, when theset of control points are proper samples from the inner EXITcurve, the resultant EXIT curve represents an optimizedLDPC ensemble that matches to the inner code.

3.2.3. Linear programming

The basic idea of convergence-constraint design, as discussedbefore, is simple. Complication arises from the fact thatconstraint (ii) in (16) is a nonlinear function of λi’s. Further-more, observe that the determination of the optimizationrange, or, the computation of r∗ from (17), requires theknowledge of λ(x), which is yet to be optimized. One possibleapproach to overcome this chicken-and-egg dilemma isto attempt an approximated λ(x) in (17) to compute r∗.Specifically, we propose accounting for the two lowest degreevariable nodes λi1 and λi2 , and approximating the degreeprofile as

λ(x)=λi1xi1−1 +λi2xi2−1 +O

(λi2+1x

i2) ≈ λi1x

i1−1 +(1−λi1

)xi1

(18)

in (17). First, this approximated λ(x) is only used in (17) totentatively determine r∗, so that the optimization processcan get started. The exact λ(x) in (16), (i) and (ii), is tobe optimized. Second, the value of i1 and λi1 (or λ′i1 ) in theapproximated λ(x) is calculated in one of the following twoways.

Case 1. A conventional LDPC ensemble has i1 = 2, that is,no degree-1 variable nodes. This is because the outboundmessages from degree-1 variable nodes do not improve overthe message-passing process. In that case, we consider onlydegree-2 and 3 nodes (λi1=2 and λi2=3), upper bound thepercentage of degree-2 nodes with λ∗2 , and treat all the restas degree-3 nodes. The stability condition [11, 16] statesthat there exists a value ξ > 0 such that, given an initialsymmetric message density P0 satisfying

∫ 0−∞P0(x)dx < ξ,

then the necessary and sufficient condition for the densityevolution to converge to the zero-error state is λ′(0)ρ′(1) <

eγ, where γΔ= − log(

∫∞−∞P0(x)e−x/2dx). Applying the stability

condition on Gaussian messages with initial mean valuem0, we get γ = m0/4 and λ∗2 = em0/4/

∑Dcj=2( j − 1)ρj , or

equivalently,

λ∗2 (w) = eI−1(w)/4

∑Dcj=2ρj( j − 1)

. (19)

It should be noted that not all values of wk fromthe M preselected control points are suitable for (19) incomputing λ∗2 . Since the stability condition ensures theasymptotic convergence to the zero-error state for a giveninput messages, λ2 ≤ λ∗2 (w∗) is valid and required onlywhen the output mutual information will approach 1 at theinput mutual information w∗. What this implies in samplingthe inner EXIT curve is that, at least one control point,say, the rightmost point (vM ,wM), should roughly satisfy therequirement: (vM ,wM) ≈ (1,wM). This value of wM is thenused in (19) to compute λ∗2 = λ∗2 (wM), which is subsequently

used in λ(x) ≈ λ∗2 x+ (1−λ∗2 )x2 to compute r∗ from (17). r∗

will then be applied to all the control points from 1 to M.

Jing Li (Tiffany) 7

Checks Bits Checks Bits

Error

Error

LDPC Differential encoder

p

q

Figure 3: Defect for λ′1 > 1− R. When the four bits associated withthe solid circles flip altogether, another valid codeword results, andthe decoder is unable to tell (undetectable error).

It is also worth mentioning that when a Gaussianapproximation is used on the message pdf ’s, the stabilitycondition reduces to

λ∗2 (w) = eI−1(w)/4

∏Dcj=2( j − 1)ρj

, (20)

which is a weaker condition than (19). Since we use Gaussianapproximation primarily for the purpose of complexityreduction, unnecessary application is therefore avoided.Thus (19) rather than (20) is used in our design process.

Case 2. Consider the case when an LDPC code is iterativelydecoded together with a differential encoder, or, otherrecursive inner code or modulation with memory. Sincethe inner code imposes another level of checks on all thevariable nodes, degree-1 variable nodes in the outer LDPCcode will get extrinsic information from the inner code, andtheir estimates will improve with decoding iterations. Thus,without loss of generality, we let the first and the secondnonzero λi’s be λ1 and λ2. No analytical bounds on λ1 orλ2 were reported in literature for this case. We propose tobound λ′1 by λ′1 ≤ 1 − R, where R is the code rate (the exactcode rate is dependent on the optimization result, and maybe slightly different from the target code rate). The rationalis that, if λ′1 > 1 − R, then there exist at least two degree-1 variable nodes, say the pth node and the qth node, whichconnect to the same check. When the LDPC code operatesalone, these two variable nodes are apparently useless andwasteful, and can be removed altogether. When the LDPCcode is combined with an inner recursive code, as shownin Figure 3, these two degree-1 variable nodes will cause aminimum distance of 4 for the entire codeword, irrespectiveof the code length. Using this empirical bound on λ1, we can

employ the approximation λ(x) = (1−R)+Rx in (17), whichleads to the computation of (a lower bound for) r∗. Codeoptimization as formulated by the convergence-constraintmethod can thus be solved using linear programming.

It is rather expected that the choice of the control pointsdirectly affects the optimization results. The set of control

points need not be large—in fact, an excessive number ofcontrol points actually makes the optimization process con-verge slow and at times converge poor. We suggest choosing3 to 5 control points that can reasonably characterize theshape of the inner EXIT curve. Our experiments show thatthe proposed method generates EXIT curves with a shapematching very well to what we desire, but the positionis slightly lower, indicating that the resultant code rate isslightly pessimistic. This can be compensated by presettingthe control points slightly higher than we actually want themto be.

3.3. Optimization results and useful findings

For complexity concerns, instead of performing dual opti-mization, we apply the concentration theorem in (9) andpreselect ρ(x) that will make the the average column weightto be approximately 3. The left degree profile λ(x) is opti-mized through the convergence-constraint method discussedin the previous subsection. We now discuss some observa-tions and findings from our optimization experiments.

First, the LDPC ensemble optimal for differential codingalways contains degree-1 and degree-2 variable nodes. Forhigh rate codes above 0.75, these nodes are dominant, andin some cases, are the the only types of variable nodes in thedegree profile. For medium rates around 0.5, there exist alsoa good portion of high-degree variable nodes. Consideringthat the outer code of a PA code has only degree-1 anddegree-2 variable nodes, λ(x) = (1 − R)/(1 + R) + (2R/(1 +R))x, where R ≥ 1/2 is the code rate, it is fair to saythat PA are (near-)optimal at high rates, but less optimalat medium rates (the optimized LDPC ensemble containsslightly different degree distribution than that of the PA code,the difference is very small in either asymptotic thresholds orfinite length simulations). This is actually well reflected inthe EXIT charts. As rate 3/4 (see Figure 1), the area betweenthe outer code of the PA code and the inner differentialcode is very small, leaving not much room for improvement.In comparison, at rate around 0.5 (see Figure 4), the areabecomes much bigger, indicating that an optimized outercode could acquire more information rate for the same SNRthreshold, or, for the same information rate, achieve a betterSNR threshold.

The optimization result of a target rate 0.5 is shown inFigure 4. We consider an inner differential code, operatingat 0.25 dB on a fdTs = 0.01 Rayleigh fading channel, anddecoded using the noncoherent IDDD receiver with the helpof using 10% pilot symbols. The optimzed LDPC ensemblehas code rate R = 0.5037 and degree profile

λ(x) = 0.0672 + 0.4599x + 0.0264x8 + 0.0495x9

+ 0.0720x10 + 0.0828x11 + 0.0855x12

+ 0.0807x13 + 0.0760x14,

ρ(x) = x5.

(21)

We see that the two EXIT curves match very well witheach other. Here the inner EXIT curve is computed throughMonte Carlo simulations, when the sequences are taken in


1

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0

I e,i/Ia,o

0 0.2 0.4 0.6 0.8 1

Ia,i/Ie,o

Code design

fdTs = 0.01

1.26 dB, 10% pilots

0.25 dB, 10% pilots

R = 0.5037, optimized LDPC

R=0.5, outer code of PA codes

Figure 4: EXIT chart of a rate 0.5 LDPC ensemble optimizedusing convergence-evolution for differential coding. NormalizedDoppler rate is 0.01, 10% of pilot symbols are assumed to assistnoncoherent differential detection. Degree profile of the optimizedLDPC ensemble: λ(x) = 0.0672 + 0.4599x + 0.0264x8 + 0.0495x9 +0.0720x10 +0.0828x11 +0.0855x12 +0.0807x13 +0.0760x14, ρ(x) = x5

blocks of N= 106 bits, and the power penalty due to the pilotsymbols is also compensated for.

The optimized LDPC ensemble requires 0.25 −10 log10(0.5037) = 3.2283 dB asymptotically, in order for theiterative process to converge successfully. Compared to a rate0.50 PA code which requires 1.26−10 log10(0.5) = 4.2703 dB(Figure 4), the optimized LDPC ensemble is about 1.04 dBbetter asymptotically. However, as the tunnel between theinner and the outer EXIT curves becomes more narrow, themessage-passing decoder takes a larger number of iterationsto arrive at the zero-error state. The increased computingcomplexity and processing time are the price we pay forreaching out to the limit.

The optimized LDPC ensemble is good in the asymptoticsense, that is, with infinite or very long code lengths. Inpractice, we are also concerned with finite-length imple-mentation or individual code realization. According to theconcentration rule, at long lengths, all code realizationsperform close to each other, and they all tend to convergeto the asymptotic threshold as length increases with bound.At short lengths, however, the concentration rule fails andthe performance may vary rather noticeably from one coderealization to another. Good realizations have improvedneighborhood condition than others, including a largergirth (achieved, e.g., through the edge progressive growthalgorithm), a smaller number of short cycles, or a smallertrapping set.

Figure 5 simulates the optimized rate-0.5037 LDPCcode with differential encoding and noncoherent differential

100

10−1

10−2

10−3

10−4

BE

R

3 3.5 4 4.5 5 5.5 6 6.5 7

Eb/N0 (dB)

Opt. LDPC, dif. dec., ideal, 10%, 15 iterOpt. LDPC, dif. dec., 10%, 5, 10, 15 iterPA, 10%, 15 iterConv. LDPC, non-dif. dec., 10%, 15 iter

Analyticalthreshold

0.75 dB 1.4 dB 1.5 dB

(64 K, 32 K), optimized LDPC, fdTs = 0.01

Figure 5: Simulations of optimized LDPC code with differentialcoding and iterative differential detection and decoding. Code rate0.5037, normalized Doppler rate 0.01, 10% pilot insertion, degreeprofile λ(x) = 0.0672 + 0.4599x+ 0.0264x8 + 0.0495x9 + 0.0720x10 +0.0828x11 + 0.0855x12 + 0.0807x13 + 0.0760x14, and ρ(x) = x5, 15(global) iterations each with 6 (local) iterations in the outer LDPCdecoding.

decoding. The Rayleigh channel and the inner differentialdecoder (the IDDD receiver) are the same as discussed inFigure 4. We chose a long codeword length of N = 64 Kto test how well the simulation agrees with the analyticalthreshold. As mentioned before, a large number of iterations(e.g., 100 iterations) is preferred to fully harness the codegain, but considering the complexity and delay affordable ina practical system, we simulated only 15 iterations. In thefigure, the leftmost curve corresponds to the optimized DE-LDPC code using ideal detection (perfect knowledge on thefading phases and amplitudes), but with 10% pilot symbols.These wasteful pilot symbols are included in this coherentdetection case to offset the curve, and to compare fairly withall the other noncoherently detected curves with 10% of pilotinsertion. The three circled curves to the right of this idealdetection curve correspond to the noncoherent performanceusing iterative differential detection and decoding at the 5th,10th, and 15th iteration. We see that the performance ofthe optimized differentially encoded LDPC code performsonly about 0.3 dB worse than the coherent detection, whichis very encouraging. Further, the simulated performanceis only 0.75 dB away from the asymptotic threshold of3.23 dB (discussed before), showing a good theory-practiceagreement.

For reference, we also plot in Figure 5 the performance ofa PA code and a conventional LDPC code without differentialcoding (recall that conventional LDPC codes perform worsewith differential coding than without), both having code ratearound 0.5 and both noncoherently detected. We see that the

Jing Li (Tiffany) 9

PA code outperforms the conventional LDPC code by 1.5 dB,but the optimized DE-LDPC code outperforms the PA codeby another 1.4 dB!

4. CONCLUSION

Part I of this two-part series of papers [1] studied productaccumulate codes, a special case of differentially encodedLDPC codes, with coherent detection and especially nonco-herent detection. It showed that PA codes perform very wellin both cases. Here in Part II, we generalize the study fromPA codes to an arbitrary differentially encoded LDPC code.

The remarkable performance of LDPC codes with coher-ent detection has been extensively studied, but much lesswork has been carried out on noncoherently detected LDPCcodes. In general, a noncoherently detected system may ormay not employ differential encoding. The former leads to adifferential encoding and noncoherent differential detectionarchitecture, and the latter requires the insertion of (many)pilot symbols in order to track the (fast-changing) channelwell. A rather unexpected finding here is that a conventionalLDPC code actually suffers in either case: in the formerit was because of an EXIT mismatch between the (outer)LDPC code and the (inner) differential code, and in the latterit was because of the large bandwidth expansion. Here byconventional we mean the LDPC code that delivers a superbperformance in the usual setting with coherent detection andpossibly channel state information.

Further investigation shows that it is not only possible,but highly beneficial, to optimize an LDPC code to match toa differential decoder. The optimization is achieved througha new convergence-constraint density evolution methoddeveloped here. The resultant optimized degree profiles arerather nonconventional, as they contain (many) degree-1and -2 variable nodes. This is in sharp contrast to theconventional LDPC case (i.e., coherent detection) wheredegree-1 variable nodes are deemed highly undesirable.

The effectiveness of the new DE method is confirmedby the fact that the optimized DE-LDPC code brings anadditional 1.4 dB and 2.9 dB, respectively, over the existingPA code and the conventional LDPC code (when nonco-herent detection is used). The proposed DE optimizationprocedure is very useful. It provides a practical way to tunethe shape and the position of an EXIT curve, and cantherefore match an LDPC code to virtually any front-endprocessor, with the imperfectness of the processor taken intoexplicit consideration.

We conclude by stating that LDPC codes can after allperform very well with differential encoding (or any otherrecursive inner code or modulation), but the degree profilesneed to be carefully (re)designed, using, for example, theconvergence-constraint density evolution developed here,and one should expect the optimized degree profile tocontain many degree-1 (and degree-2) variable nodes.

ACKNOWLEDGMENTS

This research work supported in part by the NationalScience Foundation under Grant no. CCF-0430634 and

CCF-0635199, and by the Commonwealth of Pennsylvaniathrough the Pennsylvania Infrastructure Technology Alliance(PITA).

REFERENCES

[1] J. Li, “Differentially-encoded LDPC codes: part I—special caseof product accumulate codes,” to appear in EURASIP Journalon Wireless Communications and Networking .

[2] J. Li, K. R. Narayanan, and C. N. Georghiades, “Productaccumulate codes: a class of codes with near-capacity perfor-mance and low decoding complexity,” IEEE Transactions onInformation Theory, vol. 50, no. 1, pp. 31–46, 2004.

[3] V. T. Nam, P.-Y. Kam, and Y. Xin, “LDPC codes with BDPSKand differential detection over flat Rayleigh fading channels,”in Proceedings of the 50th IEEE Global TelecommunicationsConference (GLOBECOM ’07), pp. 3245–3249, Washington,DC, USA, November 2007.

[4] H. Tatsunami, K. Ishibashi, and H. Ochiai, “On the per-formance of LDPC codes with differential detection overRayleigh fading channels,” in Proceedings of the 63rd IEEEVehicular Technology Conference (VTC ’06), vol. 5, pp. 2388–2392, Melbourne, Victoria, Australia, May 2006.

[5] M. Franceschini, G. Ferrari, R. Raheli, and A. Curtoni, “Serialconcatenation of LDPC codes and differential modulations,”IEEE Journal on Selected Areas in Communications, vol. 23,no. 9, pp. 1758–1768, 2005.

[6] J. Mitra and L. Lampe, “Simple concatenated codes usingdifferential PSK,” in Proceedings of the 49th IEEE GlobalTelecommunications Conference (GLOBECOM ’06), pp. 1–6,San Francisco, Calif, USA, November 2006.

[7] M. C. Valenti and B. D. Woerner, “Iterative channel estimationand decoding of pilot symbol assisted turbo codes over flat-fading channels,” IEEE Journal on Selected Areas in Communi-cations, vol. 19, no. 9, pp. 1697–1705, 2001.

[8] M. Peleg and S. Shamai, “Iterative decoding of coded andinterleaved noncoherent multiple symbol detected DPSK,”Electronics Letters, vol. 33, no. 12, pp. 1018–1020, 1997.

[9] S. ten Brink, “Convergence behavior of iteratively decodedparallel concatenated codes,” IEEE Transactions on Communi-cations, vol. 49, no. 10, pp. 1727–1737, 2001.

[10] A. Ashikhmin, G. Kramer, and S. ten Brink, “Extrinsicinformation transfer functions: model and erasure channelproperties,” IEEE Transactions on Information Theory, vol. 50,no. 11, pp. 2657–2673, 2004.

[11] T. J. Richardson, M. A. Shokrollahi, and R. L. Urbanke,“Design of capacity-approaching irregular low-density parity-check codes,” IEEE Transactions on Information Theory, vol. 47,no. 2, pp. 619–637, 2001.

[12] J. Hou, P. H. Siegel, and L. B. Milstein, “Performance analysisand code optimization of low density parity-check codes onRayleigh fading channels,” IEEE Journal on Selected Areas inCommunications, vol. 19, no. 5, pp. 924–934, 2001.

[13] A. Shokrollahi and R. Storn, “Design of efficient erasure codeswith differential evolution,” in Proceedings of the IEEE Interna-tional Symposium on Information Theory, p. 5, Sorrento, Italy,June 2000.

[14] S. ten Brink, G. Kramer, and A. Ashikhmin, “Design of low-density parity-check codes for modulation and detection,”IEEE Transactions on Communications, vol. 52, no. 4, pp. 670–678, 2004.

[15] R.-R. Chen, R. Koetter, U. Madhow, and D. Agrawal, “Jointnoncoherent demodulation and decoding for the block fading


channel: a practical framework for approaching Shannoncapacity,” IEEE Transactions on Communications, vol. 51,no. 10, pp. 1676–1689, 2003.

[16] S.-Y. Chung, T. J. Richardson, and R. L. Urbanke, “Analysisof sum-product decoding of low-density parity-check codesusing a Gaussian approximation,” IEEE Transactions on Infor-mation Theory, vol. 47, no. 2, pp. 657–670, 2001.

[17] http://lthcwww.epfl.ch/research/.[18] R. Storn and K. Price, “Differential evolution—a simple and

efficient heuristic for global optimization over continuousspaces,” Journal of Global Optimization, vol. 11, no. 4, pp. 341–359, 1997.


Research ArticleConstruction and Iterative Decoding of LDPC Codes OverRings for Phase-Noisy Channels

Sridhar Karuppasami and William G. Cowley

Institute for Telecommunications Research, University of South Australia, Mawson Lakes, SA 5095, Australia

Correspondence should be addressed to Sridhar Karuppasami, [email protected]

Received 1 November 2007; Revised 7 March 2008; Accepted 27 March 2008

Recommended by Branka Vucetic

This paper presents the construction and iterative decoding of low-density parity-check (LDPC) codes for channels affected byphase noise. The LDPC code is based on integer rings and designed to converge under phase-noisy channels. We assume thatphase variations are small over short blocks of adjacent symbols. A part of the constructed code is inherently built with thisknowledge and hence able to withstand a phase rotation of 2π/M radians, where “M” is the number of phase symmetries inthe signal set, that occur at different observation intervals. Another part of the code estimates the phase ambiguity present inevery observation interval. The code makes use of simple blind or turbo phase estimators to provide phase estimates over everyobservation interval. We propose an iterative decoding schedule to apply the sum-product algorithm (SPA) on the factor graph ofthe code for its convergence. To illustrate the new method, we present the performance results of an LDPC code constructed overZ4 with quadrature phase shift keying (QPSK) modulated signals transmitted over a static channel, but affected by phase noise,which is modeled by the Wiener (random-walk) process. The results show that the code can withstand phase noise of 2◦ standarddeviation per symbol with small loss.

Copyright © 2008 S. Karuppasami and W. G. Cowley. This is an open access article distributed under the Creative CommonsAttribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work isproperly cited.

1. INTRODUCTION

In the past decade, plenty of work was done in the con-struction and decoding of LDPC codes [1]. In general, thecode construction techniques were motivated to provide areduced encoding complexity and better bit-error rate (BER)performance. The channels considered are generally eitheradditive white Gaussian (AWGN) or binary erasure channels.However, many real systems are affected by phase noise(e.g., DVB-S2). The severity of the phase noise dependson the quality of the local oscillators and the symbol rate.Hence the performance of codes on the channels with phasedisturbances are of practical significance.

Over the past few years, iterative decoding for channelswith phase disturbance has received lots of attention [2–7]. In [2, 3], the authors have proposed algorithms toapply over a factor graph model that involves the phasenoise process. They used canonical distributions to dealwith the continuous phase probability density functions. Inparticular, their approach based on Tikhonov distributionyields a good performance. In [4], the authors developed

algorithms for noncoherent decoding of turbo-like codes forthe phase-noisy channels. These schemes make use of pilotsymbols for either estimation or decoding. In [5], the authorsshowed the rotational robustness of certain codes under aconstant phase offset channel with the presence of cycle slipsonly during the initial part of the codeword.

In [6], the authors used smaller observation intervalsto tackle varying frequency offset in the context of seriallyconcatenated convolutional codes (SCCCs). They used blindand turbo phase estimators to provide a phase estimate forevery sub-block. Since the phase estimates obtained fromthe blind phase estimator (BPE) are phase ambiguous, eachsub-block is affected by an ambiguity of 2π/M radians.By differentially encoding the sub-blocks independently, theauthors tackled the phase ambiguity. However, using aninner differential encoder along with an LDPC code providesa loss in performance and the degree distributions of theLDPC code needs to be optimized [7].

The concept of smaller observation intervals in thepresence of phase disturbances is attractive and offers lowcomplexity as well. Intuitively, as the observation interval get


smaller more phase variation may be tackled. On the otherhand, phase estimators produce poor estimates with smallerobservation intervals. However, if the phase estimation erroris smaller than π/M, the decoder may be able to convergecorrectly.

In our earlier work [8], we used sub-blocks in a binaryLDPC-coded receiver to tackle residual frequency offset. Thereceived symbol vector was split into many sub-blocks andBPE was used to provide a phase estimate across every sub-block. We introduced the concept of “local check nodes”(LCNs) to resolve the phase ambiguity created by the BPEon the sub-blocks. Local check nodes are odd degree checknodes connected to the variable nodes present within asingle sub-block. In (1), the local check nodes correspondto the top four rows of the parity-check matrix, in whichthe bottom (dotted) part is connected according to randomconstruction. In this small example, the LCN degree (dLc ) isthree and if the sub-block size (Nb) is six symbols, the paritycheck matrix provides Nb/dLc = 2 LCNs to resolve the phaseambiguity in each sub-block

H =

⎡⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎣

111000000000

000111000000

000000111000

000000000111

. . .. . .

. . .. . .

. . .

. . .. . .

. . .. . .

. . .

⎤⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎦

. (1)

The phase-ambiguity-resolved vector is decoded by an LDPCdecoder. Turbo phase/frequency estimates (e.g., [9]) areobtained during iterations to facilitate the convergence.The quality of the phase ambiguity estimate is better withmore LCNs. Hence with reduced sub-block sizes, the phaseambiguity estimate is less reliable and the code suffersperformance degradation.

Following [6, 8], but with a different perspective, weaddressed the problem of phase noise for BPSK signals in thepresence of a binary LDPC-coded system [10]. In particular,we incorporated the observation that, even under large phasedisturbances the variation in phase over adjacent symbolsare normally small. We created a set of check nodes called“global check nodes” (GCNs) that converge irrespective ofphase rotations (0 or π radians) in any sub-blocks. Weused BPE or TPE to provide a phase estimate in each sub-block. After the convergence of GCN, we used only oneLCN per sub-block to resolve the phase ambiguity presentin the sub-block. We found that even under relatively largephase noise and observation intervals, the method provided agood performance for BPSK signals. We did not make use ofpilot symbols and the complexity is low. However, we foundthat the extension of the above approach to higher-ordermodulations was very difficult with a binary LDPC code.In particular, with a binary LDPC code, constructing globalcheck nodes that converge irrespective of a phase rotation(a multiple of 2π/M radians) in the sub-blocks was difficult.

This paper addresses the problem of extending the abovecode construction technique to higher-order signal constella-

tions based on integer rings. Specifically, we construct LDPCcodes over rings with certain constraints on the placement ofedges and edge gains such that they, along with sub-blockphase estimation techniques, provide good performanceunder phase-noisy channels with low complexity. Under anoiseless channel, we present edge constraints based on inte-ger rings generalized for any phase-symmetric modulationscheme, under which the convergence of the global checknodes is guaranteed in the presence of phase ambiguitiesin any sub-block. Similarly, we present generalized edgeconstraints for the local check node such that they areable to resolve the phase ambiguity in the sub-block. Toillustrate the concepts discussed in this paper under a phase-noisy channel, we show the performance of an LDPC codeconstructed over Z4 with codewords mapped onto QPSKmodulation, where the transmitted symbol sk ∈ {smk =e j((π/2)m+π/4)}, m = {0, 1, 2, 3}.

The remainder of the paper is organized as follows. InSection 2, we discuss the channel model considered for oursimulations. In Section 3, we address the effects of phaseambiguity on the check nodes and discuss the constructionof global and local check nodes. In Section 4, we explaincode construction and present a matrix inversion techniqueto obtain the generator matrix. In Section 5, we explainthe receiver architecture and detail the iterative decodingfor the convergence of these codes. We also show theadditional computational complexity required due to thephase estimation process. In Section 6, we discuss the BERperformance of the proposed receiver under phase noiseconditions using the code constructed over Z4 for QPSKsignal set. In Section 7, we discuss the benefits of the blindphase estimator in reducing the computational complexityinvolved with the turbo phase estimation and also show theBER performance of the low-complexity iterative receiverwith the Z4 code under phase noise conditions. We concludein Section 8 by summarizing the results of this paper.

2. CHANNEL MODEL

An information sequence is encoded by an (N ,K) nonbinaryLDPC code constructed over integer rings (ZM), where Nand K represent the length and dimension of the code andZM denote the integers {0, 1, 2, . . . ,M − 1} under additionmodulo M, respectively. The alphabets over ZM are mappedonto complex symbols s using phase shift keying (PSK)modulation withM phase symmetries. The complex symbolsare transmitted over a channel affected by carrier phasedisturbance and complex additive white Gaussian noise.

Ideal timing and frame synchronization are assumedand henceforth, all the simulations assume one sample persymbol. At the receiver, after matched filtering and idealsampling, we have

rk = skejθk + nk, k = 0, 1, . . . ,Ns − 1, (2)

where sk, rk, θk, and nk are the kth component of the vectorsr, s, θ, and n, of length Ns, respectively. The noise samplesnk contain uncorrelated real and imaginary parts with zeromean and two-sided power spectral density (PSD) of N0/2.

S. Karuppasami and W. G. Cowley 3

The phase noise process θk is generated using the Wiener(random-walk) model described by

θk = θk−1 + Δk, k = 1, 2, . . . ,Ns − 1, (3)

where Δk is a white real Gaussian process with a standarddeviation of σΔ. θ0 is generated uniformly from the distribu-tion (−π,π).

Let us divide the received symbol vector r of lengthNs into B sub-blocks of length Nb. Assuming small phasevariations over adjacent symbols, we may approximate thephase variations on the symbol in the lth sub-block by a

mean phase offset θl ∈ (−π,π). Similar to (2), the receivedsequence can be expressed as

rk′ � sk′ejθl + nk′ , l = 0, 1, . . . ,B − 1, (4)

where k′ = Nbl + k, k = 0, 1, . . . ,Nb − 1. While thechannel model in (2) is used in our simulations, we usethe approximate model in (4) for the code constructionand receiver-side processing. The approximate phase offset

over lth sub-block, θl ∈ (−π,π) can be represented as thesummation of an ambiguous phase offset φl ∈ (−π/M,π/M)and the phase ambiguity αl ∈ {0, 2π/M, 4π/M, . . . , 2(M −1)π/M}.

The proposed receiver tackles modest to high levels ofphase noise. For instance, the phase noise considered in thispaper (Wiener model with σΔ of 1◦ and 2◦) is several timeslarger than the phase noise mentioned in the European SpaceAgency model (Wiener model with σΔ= 0.3◦ per symbol[2]). However, due to the assumptions made in (4), theproposed receiver will not be able to tackle large amountsof phase noise, such as the Wiener model with σΔ= 6◦ persymbol in [2, 3].

3. EFFECT OF PHASE AMBIGUITIES ONTHE CHECK NODES

In this section, we address the effect of phase rotations thatare multiples of 2π/M radians on the global and local checknodes of an LDPC code constructed over ZM . Let Hi, j be theelements of the parity check matrix participating in the ithcheck node such that,

dc∑

j=1

Hi, jx j = 0 (mod M), (5)

where dc is the degree of the check node, xj is the jth symbolparticipating in the ith check node and the value of Hi, j ischosen from the nonzero elements of ZM . In the remainingsubsections, we denote the degree of the GCN and LCN asdGc and dLc , respectively.

3.1. Global check nodes

Unlike local check nodes, the edges of the GCN are spreadacross many sub-blocks. Let p be the number of global checknode edges connected to symbols present within one sub-

block. Say, all symbols in that sub-block are rotated by 2πt/Mradians, where t ∈ {0, 1, . . . ,M − 1}. As a result, the checkequation in (5) becomes

p∑

j=1

Hi, j(xj + t

)+

dGc∑

j=p+1

Hi, jx j =p∑

j=1

Hi, j t +dGc∑

j=1

Hi, jx j = tp∑

j=1

Hi, j.

(6)

Thus for arbitrary integer t, (6) becomes zero only if

p∑

j=1

Hi, j = 0 (mod M). (7)

In the case of binary LDPC code, p should be even inorder to satisfy (7). For LDPC codes over higher-order rings,p can either be odd or even depending on the values of Hi, j .In this work, we select the values of Hi, j from the set ofnonzero divisors of ZM ({1, 3} from Z4) to avoid problemsduring matrix inversion. As a result, p becomes even in thecase of LDPC code over integer rings which further makes dGcas well, even.

Example 1. Assume an LDPC code constructed over Z4 withB = 4 sub-blocks. Consider a degree-8 GCN whose edgesare connected to two symbols per sub-block (p = 2) and thecorresponding edge gains be g = [1, 3, 1, 3, 3, 1, 1, 3]. One setof symbols that satisfies this check is x = [3, 2, 3, 1, 1, 3, 0, 1].Let us assume that sub-block one and four are rotated by π/2and π radians, respectively. Therefore, the sub-block rotatedversion of x, say xr = [0, 3, 3, 1, 1, 3, 2, 3]. It can be seen thatxr still satisfies the parity check equation with the same g.Note that each sub-block has one edge with value “1” andanother with “3,” whose sum is 0 (mod 4) as required by (7).

3.2. Local check nodes

Local check nodes resolve the phase ambiguity present in asub-block. Let the elements Hi, j participating in check i beselected from a single sub-block such that,

dLc∑

j=1

Hi, j /= 0 (mod 2). (8)

Alternatively, (8) represents that the element∑dLc

j=1Hi, j ischosen from the set of nonzero divisors from ZM , whichis achieved by performing the summation over modulo 2rather than M. If modulo M is used, the check node will notresolve certain phase ambiguities as explained below.

If all the symbols xj participating in ith local check nodeare rotated by 2πt/M radians, then using (5) and (8), we canshow that for every t there exists a distinct residue (mod M)which provides a solution for the phase ambiguity present onthe participating symbols xj . Considering all the operationsbelow are modulo M,

dLc∑

j=1

Hi, j(xj + t

) =dLc∑

j=1

Hi, jx j +dLc∑

j=1

Hi, j t = tdLc∑

j=1

Hi, j . (9)


Hence t can be written as

t =( dLc∑

j=1

Hi, j(xj + t

))×( dLc∑

j=1

Hi, j

)−1

(mod M). (10)

In case where the∑dLc

j=1Hi, j do not have a multiplicative

inverse in ZM (say∑dLc

j=1Hi, j equals a zero divisor), then (9)is satisfied for any t ∈ {zero divisors in ZM} and hencethe phase ambiguity estimate is not unique. Thus choosing∑dLc

j=1Hi, j with a multiplicative inverse in ZM ensures phaseambiguity resolution. Further, by selecting the edge gains ofthe LCN from the nonzero divisors of ZM , which are oddintegers less than M, we require an odd number of edges tosatisfy (8). Hence the degree of the local check node dLc isalways considered to be odd in this work.

Example 2. Let us consider the code and rotations as inExample 1. Let the code include a degree-3 LCN whose edgeswith gains [1, 3, 1] are connected to the first sub-block. A setof symbols that satisfies this check is x = [3, 0, 1]. Due to therotation of π/2 radians in the first sub-block, xr = [0, 1, 2].Using (10), we can evaluate that t = 1 which corresponds toπ/2 radians.

4. NONBINARY CODE CONSTRUCTION

We apply the above set of principles in constructing codesthat are beneficial in dealing with phase noise channels.Similar to [11], we construct a binary code and choose thenonzero divisors from ZM as edge gains such that checkconditions as described in Section 3 are satisfied.

4.1. Code construction

Following Section 2, let us say we have “B” sub-blocks oflength Nb. A binary parity check matrix HN−K×N is con-structed such that it involves two parts:

H =

⎡⎢⎢⎣

Hresolving

· · · · · · · · · · · ·Hconverging

⎤⎥⎥⎦ . (11)

The upper (B × N) part of the matrix, called Hresolving,involves B local check nodes in contrast to Nb/dc LCNs asin our previous method [8], which are used to resolve thephase ambiguity in B sub-blocks. The lower (N−K −B×N)part of the matrix, called Hconverging, contains N − K − Bcheck nodes whose neighbours are selected such that theirconvergence is independent of the phase ambiguities in thesub-block. We assume the degree of all the local (global)check nodes to be equal to dLc (dGc ). The codes are designedto be check biregular, (i.e., with two different degrees, dLc anddGc ). However, there is no constraint on the variable nodedegree.

We construct the code as per the following procedure.

(1) Construction of local check nodes: the edges of the localcheck node are connected to the first dLc symbols of thesub-block for which it resolves the phase ambiguity.For example, assuming dLc = 3, let Hij = 1 wherej corresponds to the first 3 columns of each sub-block. However, we can arbitrarily choose the set of dLcsymbols from any part of the sub-block.

(2) Construction of global check nodes: for every symbol,the parity checks in which the symbol participatesare randomly chosen based on its degree and (7). Asin Example 1, every global check node participates inonly two symbols from a sub-block. Care was taken toavoid short cycles after constructing every column.

To illustrate the local and global check nodes, a smallparity check matrix (H) is shown in (12). The first four rowscorresponding to the local check nodes (Hresolving) are shownat the top. The two rows below the local check nodes areconnected globally and also have p = 2 edges connectedto symbols from a sub-block. The restriction of two edgesper sub-block provides a better connectivity in the code.The same technique is continued to construct the remainingglobal check nodes in the dotted part of the matrix. The localand the global check nodes shown in the first and fifth rowsof theH-matrix are used in the previous examples. A portionof the Tanner graph of the H matrix, in (12), is shown inFigure 1. Local check nodes (shaded checks) and their edges(solid lines) are distinguished from the global check nodesand their edges (dash-dotted lines)

H =

⎡⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎣

131000000000000000000000

000000111000000000000000

000000000000333000000000

000000000000000000313000

· · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · ·100300010300030100001030

010300031000003100130000

.... . .

. . .. . .

. . .. . .

. . .. . .

. . .. . .

...

.... . .

. . .. . .

. . .. . .

. . .. . .

. . .. . .

...

⎤⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎦

.

(12)

4.2. Some comments on encoding

We used the Gaussian elimination (GE) approach to obtaina systematic generator matrix. Even though the edge gains ofthe parity check matrix are nonzero divisors, we encounteredzero divisors ({2} in the case of Z4) during GE in thediagonal part of the matrix. To avoid this problem, weinterchanged columns across the parity check matrix suchthat we obtain a generator matrix (G) corresponding tothe column-interchanged parity check matrix (H′). Sincewe wanted to use the original H matrix instead of H′,we created a permutation table (P) to record the columns


p = 2B = 4

LCN (dLc = 3)

GCN (dGc = 8)...

Figure 1: Tanner graph of the H matrix in (12), illustrating localand global check nodes.

that are interchanged during inversion. Alternate inversiontechniques may avoid the use of permutation table P.

A summary of the communication system used in thesimulations is given in Figure 2. The message is encoded bythe generator matrix G to produce the codeword (c). Thecodeword c undergoes inverse permutation to produce c′.The codeword is transmitted through the composite channel.Since the permuted-encoded symbols are the codewords ofthe original code H , the decoder decodes the codeword. Thedecoded codeword c ′ is again permuted to give the originalcodeword c.

5. RECEIVER ARCHITECTURE AND ITERATIVEDECODING SCHEDULE

The receiver architecture to tackle large phase disturbancesis shown below in Figure 3. We used the SPA algorithmfor LDPC codes over rings, similar to [12]. In the case ofan AWGN channel, the SPA may be applied over theentire code for convergence. However, in the presence ofphase disturbances, phase estimators provide an ambiguousphase estimate and hence the SPA is applied only over therotationally invariant part of the factor graph, that is, thegraph involving global check nodes only.

This section discusses the application of SPA on the factorgraph of the code with phase offset on every sub-block suchthat the benefits of local and global check nodes are achieved.Thus we split up the decoding into three phases as describedbelow.

(1) Converging phase.

(a) The likelihood vector, of length M, for the kthvariable node is initialized with the channel like-lihoods, p(rk | sk = smk ) = (1/2πσ2)exp{−(|rk −smk |2)/2σ2}, where m = {0, 1, . . . ,M − 1}, k ={0, 1, . . . ,Ns − 1} and σ2 is the noise variance.

(b) The SPA is applied over the Hconverging part of thecode alone. Local check nodes are not used. Themessages coming from these nodes are assignedto be equiprobable.

(c) After every d iterations, the turbo phase estima-tor (TPE) [9] estimates the phase offset φl, whichis given by

φl = arg

(∑

k′rk′a

∗k′

), (13)

where k′, as defined in (4), is the kth componentin the lth sub-block and a∗k′ is the complexconjugate of the soft symbol estimate. The softsymbol estimate ak′ of the symbol sk′ is given by

ak′ =M−1∑

m=0

smk′ p(sk′ = smk′ | rk′

), (14)

where p(sk′ = smk′ | rk′) is the a posteriori prob-ability that symbol sk′ = smk′ . The receivedsymbol vector corresponding to lth sub-block iscorrected using the turbo phase estimate φl.

(d) The likelihoods are recalculated from r afterphase correction and are used to update the mes-sages that are passed on to the global check node.

(e) Steps (a)–(c) are repeated until all the globalcheck nodes are satisfied.

(2) Resolving phase.

(a) As the symbol a posteriori probabilities at thevariable nodes are good enough at the endof converging phase, a hard decision is takenon the symbols, which corresponds to (xj +t) in (10). These hard decisions are used toevaluate the sub-block phase ambiguity estimatesαl = 2πt/M using local check nodes as in (10),which are further used to correct the receivedsymbol values, giving r′. In general, the decoderconverges at the end of this stage.

(3) Final phase.

(a) If required, SPA is continued over the entirecode involving both Hresolving and Hconverging until

either the syndrome (Hc ′T = 0) is satisfied or a

specified number of iterations are reached. Turbophase estimation or phase ambiguity resolutionis is not required at this phase.

5.1. Comments on turbo phase estimation

In general, turbo phase estimation can provide a phase esti-mate in the range (−π,π). However, during the converging


MessageG

cP−1

c′Mapper &

channel modelas in eq.(2)

LDPC receiver(See Fig. 3)

c′P

c

Figure 2: Communication system.

Phaseambiguity

resolver

Over sub-blocks

LDPCdecoder

Are all GCNsatisfied ?

Turbophase

estimator

Over sub-blocksDelay by

diterations

r r r′

e− jφl

e− jαl

Yes

No

Figure 3: Proposed LDPC receiver architecture.

0 5 10 15 20 25

Number of iterations

−30

−20

−10

0

10

20

30

Mea

ntu

rbo

phas

ees

tim

ate

(deg

rees

)

θ = 0◦

θ = 15◦

θ = 30◦

θ = 60◦

θ = 75◦

θ = 90◦

Figure 4: Evolution of turbo phase estimates over sub-blocksduring convergence.

phase of this code, the decoder converges to a codewordwhich is rotationally equivalent to the transmitted codeword.Hence the turbo phase estimator provides a phase estimatewhose range lies between (−π/M,π/M). This is illustratedin Figure 4, which shows the mean trajectories of the turbophase estimates over a sub-block of 100 symbols at anEb/N0 = 2 dB under a constant phase offset (θ).

5.2. Computational complexity

The computational complexity of the proposed LDPCreceiver can be evaluated as the summation of the complexi-ties of the LDPC decoder and the phase estimator/ambiguityresolver. The computational complexity of the nonbinaryLDPC decoder is dominated by the check node decoder withO(M2) operations. Reducing the computational complexityof the nonbinary LDPC decoder is an active area of research[13, 14]. In this paper, we concentrate only on the additionalcomplexity involved in the receiver due to the turbo phaseestimation in (13) and ambiguity resolution.

Since the decoding algorithm works in the probabilitydomain, the a posteriori probability of the symbols p(sk =smk | rk) are directly available from the decoder. Giventhe a posteriori probability vector of length M, for thekth symbol, the soft symbol estimate of the symbol skcan be calculated according to (14). To estimate Ns softsymbol estimates, we require 2(M − 1)Ns real additions and2MNs real multiplications. Given the soft symbol estimates,the evaluation of turbo phase estimate for B sub-blocksrequires an additional 4Ns real multiplications, 2(Ns − B)real additions and B lookup table (LUT) access for evaluatingthe arg function. Correcting every symbol by the turbophase estimate requires 4 real multiplications and 2 realadditions. Thus the total complexity involved for estimatingand correcting a symbol for its phase offset using a turbophase estimator per iteration (OTPE) is given as

OTPE = [2M + 8]× +[

2M + 4− 2BNs

]+

+[B

Ns

]LUT

, (15)


where [·]×, [·]+, and [·]LUT correspond to the number ofreal multiplications, real additions, and look-up table access,respectively. The complexity involved in resolving phaseambiguity per symbol is very small. Also phase ambiguityresolution is required only once per decoding.

Thus the additional complexity of the receiver, mainlydue to turbo phase estimation, is relatively small. In the caseof the LDPC code described in Section 6, the additional com-plexity per symbol per iteration is approximately equivalentto ([16]× + [12]+) operations, assuming d = 1.

6. BER PERFORMANCE OF THE PROPOSED RECEIVER

We constructed a binary LDPC code of N = 3000, K =1500, R = 0.5 for a sub-block size of Nb = 100 symbols.Through simulations, we found that the code with sub-blocksize of 100 symbols gives the best BER performance for theamounts of phase noise considered in this paper. The degreedistributions of this binary code were obtained through EXITcharts [15] such that they converged at an Eb/No of 1.3 dB.The variable node and check node distributions, in termsof node perspective, were λ(x) = 0.8047x3 + 0.0067x4 +0.1887x8 and ρ(x) = 0.02x3 + 0.98x8, respectively. The codecorresponds to B = 30 sub-blocks over the codeword.We replaced the edge gains of this code from the nonzerodivisors of Z4 such that they follow the constraints discussedin Section 3. Turbo phase estimation was done after everyiteration (d = 1), only during the converging phase.Iterations are performed until the codeword converges, orto a maximum of 200 iterations. However, we found thaton an average in the waterfall region, less than 40 iterationsare required for convergence. Simulations are performedeither until 100 codeword errors are found or up to 500,000transmissions.

Simulation results in Figure 5 show the performance ofour receiver in Figure 3 under phase noise conditions. For aconstant phase offset, there is a small degradation of around0.3 dB from the coherent performance at a BER of 10−5. Thisloss is due to the proposed application schedule of SPA onthe code, which did not include local check nodes duringthe convergence phase and the degraded performance of theturbo phase estimator with reduced sub-block size. However,thereafter with a small loss, the code is able to tolerate a phasenoise with σΔ= 2◦ per symbol.

7. LOWER COMPLEXITY ITERATIVE RECEIVER

In this section, we show that the computational complexityinvolved with the turbo phase estimation can be reduced byusing a blind phase estimator just once, before the iterativereceiver proposed in Figure 3.

7.1. Comments on initial phase estimation

The performance of the LCN-based phase ambiguity res-olution (PAR) algorithm degrades with the amount ofphase offset present on the symbols participating in theLCN. Hence in our earlier work [8], we used a BPE toprovide phase estimate for every sub-block of symbols before

1 1.2 1.4 1.6 1.8 2 2.2 2.4

Eb/N0 (dB)

10−6

10−5

10−4

10−3

10−2

10−1

100

BE

R

AWGNσΔ = 0◦

σΔ = 1◦

σΔ = 2◦

Figure 5: Performance of the proposed receiver in Figure 3 withQPSK and the Wiener phase model.

0 20 40 60 80 100 120 140 160 180 200

Number of iterations

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

Pro

babi

lity

ofde

code

rco

nver

gen

ce

BPE + TPE (d = 10)TPE (d = 10)TPE (d = 1)

Figure 6: Convergence improvement due to an initial blind phaseestimator.

resolving PAR using the local check nodes. However, inthe current work, we are able to delay the PAR on thesub-blocks since the code can converge with the phaseambiguous estimates obtained from the TPE alone. Hencethe proposed architecture does not require the use of ablind phase estimator. However, by employing an initialBPE for coarse phase estimation and correction of the sub-blocks, the number of iterations required for convergencecan be reduced. Figure 6 illustrates the benefit of blind phaseestimation at an Eb/N0 = 2.1 dB with a Wiener phase noiseof 1◦ standard deviation per symbol.

It also shows that the computational complexity dueto TPE can be reduced, approximately a factor of 10, by


1 1.2 1.4 1.6 1.8 2 2.2 2.4

Eb/N0 (dB)

10−5

10−4

10−3

10−2

10−1

100

BE

R

AWGNTPE (d = 1)BPE + TPE (d = 10)TPE (d = 10)TPE (d = 1, till 10th iteration and then d = 10)

Figure 7: Performance of the low complexity receiver discussed inSection 7 under phase noise with σΔ= 2◦ per symbol.

using the BPE once before the iterative receiver and thenperiodically using the turbo phase estimator.

7.2. BER performance

The code described in Section 6 was used to simulate the BERperformance of the iterative receiver with low computationalcomplexity. The blind phase estimator was used to estimateand correct the phase disturbance present in each sub-blockof the received symbol vector, following which the phase-corrected symbol vector was fed into the iterative receiverin Figure 3. During the convergence phase, turbo phaseestimates were obtained once in 10 iterations (d = 10). AtσΔ= 2◦ per symbol, Figure 7 shows the advantage of a blindphase estimator in terms of BER performance. The resultcompares three distinct cases with the normal receiver, whereturbo phase estimation was performed in every iteration.The presence of blind phase estimator allows us to includeturbo phase estimator only once in every 10 iterationswith a small loss of 0.05 dB. However, without blind phaseestimator, performing turbo phase estimation only once inevery 10 iterations shows significant degradation. As shown,the performance can be improved by including turbo phaseestimation for more iterations, particularly the early stages ofthe decoder, during which the LDPC decoder provides a lotof new information regarding the symbols.

8. CONCLUSION

In this paper, we addressed the problem of LDPC code-based iterative decoding under phase noise channels froma code perspective. We proposed construction of ring-basedcodes for higher-order modulations that work well with sub-

block phase estimation techniques of low complexity. Thecode was constructed using the new constraints outlined inSection 3 such that it not only converges under sub-blockphase rotations, but also estimates them. We also showed theproperty of ring-based check nodes under the presence ofphase ambiguity based on their edge gains in a generalizedmanner. As part of our future work, we are looking at waysto construct code without explicitly constructing local checknodes for PAR. The sub-block size used in the simulationresults shown earlier, has not been optimized and we believethat the method can be extended to adjust the observationinterval and phase model depending on the amount of phasenoise.

ACKNOWLEDGMENTS

The authors wish to acknowledge helpful discussions withDr. Steven S. Pietrobon on this topic and also thank reviewersfor their useful comments.

REFERENCES

[1] R. Gallager, “Low density parity-check codes,” IEEE Transac-tions on Information Theory, vol. 8, no. 1, pp. 21–28, 1962.

[2] G. Colavolpe, A. Barbieri, and G. Caire, “Algorithms foriterative decoding in the presence of strong phase noise,” IEEEJournal on Selected Areas in Communications, vol. 23, no. 9, pp.1748–1757, 2005.

[3] A. Barbieri, G. Colavolpe, and G. Caire, “Joint iterativedetection and decoding in the presence of phase noiseand frequency offset,” IEEE Transactions on Communications,vol. 55, no. 1, pp. 171–179, 2007.

[4] I. Motedayen-Aval and A. Anastasopoulos, “Polynomial-complexity noncoherent symbol-by-symbol detection withapplication to adaptive iterative decoding of turbo-like codes,”IEEE Transactions on Communications, vol. 51, no. 2, pp. 197–207, 2003.

[5] R. Nuriyev and A. Anastasopoulos, “Rotationally invariantand rotationally robust codes for the AWGN and the non-coherent channel,” IEEE Transactions on Communications,vol. 51, no. 12, pp. 2001–2010, 2003.

[6] W. G. Cowley and M. S. C. Ho, “Transmission design forDoppler-varying channels,” in Proceedings of the 7th AustralianCommunications Theory Workshop (AusCTW ’06), pp. 110–113, Perth, Australia, February 2006.

[7] M. Franceschini, G. Ferrari, R. Raheli, and A. Curtoni, “Serialconcatenation of LDPC codes and differential modulations,”IEEE Journal on Selected Areas in Communications, vol. 23,no. 9, pp. 1758–1768, 2005.

[8] S. Karuppasami and W. G. Cowley, “LDPC code-aided phase-ambiguity resolution for QPSK signals affected by a frequencyoffset,” in Proceedings of the 8th Australian CommunicationsTheory Workshop (AusCTW ’07), pp. 47–50, Adelaide, Aus-tralia, February 2007.

[9] N. Noels, V. Lottici, A. Dejonghe, et al., “A theoreticalframework for soft-information-based synchronization initerative (turbo) receivers,” EURASIP Journal on WirelessCommunications and Networking, vol. 2005, no. 2, pp. 117–129, 2005.


[10] S. Karuppasami, W. G. Cowley, and S. S. Pietrobon, “LDPCcode construction and iterative receiver techniques for chan-nels with phase noise,” in Proceedings of the 67th IEEEVehicular Technology Conference (VTC ’08), Singapore, May2008.

[11] D. Sridhara and T. E. Fuja, “LDPC codes over rings forPSK modulation,” IEEE Transactions on Information Theory,vol. 51, no. 9, pp. 3209–3220, 2005.

[12] M. C. Davey and D. MacKay, “Low-density parity check codesover GF(q),” IEEE Communications Letters, vol. 2, no. 6, pp.165–167, 1998.

[13] D. Declercq and M. Fossorier, “Decoding algorithms fornonbinary LDPC codes over GF(q),” IEEE Transactions onCommunications, vol. 55, no. 4, pp. 633–643, 2007.

[14] A. Voicila, D. Declercq, F. Verdier, M. Fossorier, and P. Urard,“Low-complexity, low-memory EMS algorithm for nonbinaryLDPC codes,” in Proceedings of IEEE International Conferenceon Communications (ICC ’07), pp. 671–676, Glasgow, Scot-land, UK, June 2007.

[15] S. ten Brink, G. Kramer, and A. Ashikhmin, “Design of low-density parity-check codes for modulation and detection,”IEEE Transactions on Communications, vol. 52, no. 4, pp. 670–678, 2004.


Research ArticleNew Technique for Improving Performance of LDPC Codes inthe Presence of Trapping Sets

Esa Alghonaim,1 Aiman El-Maleh,1 and Mohamed Adnan Landolsi2

1 Computer Engineering Department, King Fahd University of Petroleum & Minerals, Dhahran 31261, Kingdom of Saudi Arabia2 Electrical Engineering Department, King Fahd University of Petroleum & Minerals, Dhahran 31261, Kingdom of Saudi Arabia

Correspondence should be addressed to Esa Alghonaim, [email protected]

Received 2 December 2007; Revised 18 February 2008; Accepted 21 April 2008


Trapping sets are considered the primary factor for degrading the performance of low-density parity-check (LDPC) codes in theerror-floor region. The effect of trapping sets on the performance of an LDPC code becomes worse as the code size decreases.One approach to tackle this problem is to minimize trapping sets during LDPC code design. However, while trapping sets canbe reduced, their complete elimination is infeasible due to the presence of cycles in the underlying LDPC code bipartite graph.In this work, we introduce a new technique based on trapping sets neutralization to minimize the negative effect of trapping setsunder belief propagation (BP) decoding. Simulation results for random, progressive edge growth (PEG) and MacKay LDPC codesdemonstrate the effectiveness of the proposed technique. The hardware cost of the proposed technique is also shown to be minimal.

Copyright © 2008 Esa Alghonaim et al. This is an open access article distributed under the Creative Commons Attribution License,which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

1. INTRODUCTION

Forward error correcting (FEC) codes are an essential com-ponent of modern state-of-the-art digital communicationand storage systems. Indeed, in many of the recently devel-oped standards, FEC codes play a crucial role for improvingthe error performance capability of digital transmission overnoisy and interference-impaired communication channels.

Low-density parity-check codes (LDPCs), originallyintroduced in [1], have recently been undergoing a lotof active research and are now widely considered to beone of the leading families of FEC codes. LDPC codesdemonstrate performance very close to the information-theoretic bounds predicted by Shannon theory, while at thesame time having the distinct advantage of low-complexity,near-optimal iterative decoding.

As with other types of codes decoded by iterative decod-ing algorithms (such as turbo codes), LDPC codes can sufferfrom the presence of undesirable error floors at increasingSNR levels (although these are found to be relatively lowerthan the error floors encountered with turbo codes [2]).In the case of LDPC codes, trapping sets [2–4] have beenidentified as one of the main factors causing error floorsat high SNR values. The analysis of trapping sets and theirimpact on LDPC codes has been addressed in [3, 5–9]. The

main approaches for mitigating the impact of trapping setson LDPC codes are based on either introducing algorithmsto minimize their presence during code design as in [5, 7, 9]or by enhancing decoder performance in the presence oftrapping sets as in [3, 6, 8]. The main disadvantage of thefirst approach, in addition to putting tight constraints oncode design, is that trapping sets cannot be totally eliminatedat the end due to the “unavoidable” existence of cyclesin their underlying bipartite Tanner graphs especially forrelatively short block length codes (which is the focus of thiswork). In addition, LDPC codes designed to reduce trappingsets may result in large interconnect complexity increasinghardware implementation overhead. The second approach istherefore considered to be more applicable for our purposeand is the basis of the contributions presented in thispaper.

In order to enhance decoder performance in the presenceof (unavoidable) trapping sets, an algorithm is introducedin [3] based on flipping the hard decoded bits in trappingsets. First, trapping sets are identified and stored in alookup table based on BP decoding simulation. Wheneverthe decoder fails, the decoder uses the lookup table basedon the unsatisfied parity checks to determine if a preknownfailure is detected. If a match occurs, the decoder simplyflips the hard decision values of trapping bits. This approach


suffers from the following disadvantages: (1) the decoder hasto exactly specify the trapping sets variable nodes in orderto flip them; (2) extra time is needed to search the lookuptable for a trapping set; (3) the technique is not amenable topractical hardware implementation.

In [6, 8], the concept of averaging partial results is used toovercome the negative effect of trapping sets in the error floorregion. Variable node messages update in the conventionalBP decoder are modified in order to make it less sensitiveto oscillations in messages received from check nodes. Thevariable node equation is modified to be the average ofcurrent and previous signals values received from checknodes. While this approach is effective in handling oscillatingerror patterns, it does not improve decoder performance inthe case of constant error patterns.

In this paper, we propose a novel approach for enhanc-ing decoder performance in presence of trapping sets byintroducing a new concept called trapping sets neutralization.The effect of a trapping set can be eliminated by setting itsvariable nodes intrinsic and extrinsic values to zero, thatis, neutralizing them. After a trapping set is neutralized,the estimated values of variable nodes are affected only byexternal messages from nodes outside the trapping set.

Most harmful trapping sets are identified by means ofsimulation. To be able to neutralize identified trapping sets,a simple algorithm is introduced to store trapping setsconfiguration information in variable and check nodes.

The remainder of this paper is organized as follows:In Section 2, we give an overview of LDPC codes and BPalgorithm. Trapping sets identification and neutralization areintroduced in Section 3. Section 4 presents the algorithm oftrapping sets neutralization based on learning. Experimentalresults are given in Section 5. In Section 6, we conclude thepaper.

2. OVERVIEW OF LDPC CODES

LDPC codes are a class of linear block codes that use a sparse,random-like parity-check matrix H [1, 10]. An LDPC codedefined by the parity-check matrix H represents the parityequations in a linear form, where any given codeword usatisfies the set of parity equations such that u×H = 0. Eachcolumn in the matrix represents a codeword bit while eachrow represents a parity-check equation.

LDPC codes can also be represented by bipartite graphs,usually called Tanner graphs, having two types of nodes:variable nodes and check nodes interconnected by edgeswhenever a given information bit appears in the parity-check equation of the corresponding check bit, as shown inFigure 1.

The properties for an (N, K) LDPC code specified by anM ×N H matrix can be summarized as follows.

– Block size: number of columns (N) in the H matrix.

– Number of information bits: given by K = N −M.

– Rate: the rate of the information bits to the blocksize. It equals 1−M/N , given that there are no lineardependent rows in the H matrix.

v1

v2

v3

v4

v5

c1

c2

c3

v1 v2 v3 v4 v5c1

c2

c3

+

+

+

H =

⎡⎢⎣1 1 1 1 01 0 1 0 10 1 0 1 1

⎤⎥⎦Figure 1: The two representations of LDPC codes: graph form andmatrix form.

– Check node degree: number of 1’s in the correspond-ing row in the H matrix. Degree of a check node cj isreferred to as d(cj).

– Variable node degree: number of 1’s in the corre-sponding column in the H matrix. Degree of avariable node vi is referred to as d(vi).

– Regularity: an LDPC code is said to be regular ifd(vi) = p for 1 ≤ i ≤ N and d(cj) = q for 1 ≤ j ≤M.In this case, the code is (p, q) regular LDPC code.Otherwise, the code is considered irregular.

– Code girth: the minimum cycle length in the Tannergraph of the code.

The iterative message-passing belief propagation algorithm(BP) [1, 10] is commonly used for decoding LDPC codesand is shown to achieve optimum performance when theunderlying code graph is cycle-free. In the following, abrief summary of the BP algorithm is given. Followingthe notation and terminology used in [11], we define thefollowing:

(i) ui: transmitted bit in a codeword, ui ∈ {0, 1}.(ii) xi: a transmitted channel symbol, with a value given

by

xi =⎧⎨⎩+1, when ui = 0

−1, when ui = 1.(1)

(iii) yi: a received channel symbol, yi = xi + ni, where niis zero-mean additive white Gaussian noise (AWGN)random variable with variance σ2.

(iv) For the jth row in an H matrix, the set of columnlocations having 1’s is given by Rj = {i : hji = 1}. Theset of column locations having 1’s, excluding locationI, is given by Rj\i = {i′ : hji′ = 1} \ {i}.

(v) For the ith column in an H matrix, the set of rowlocations having 1’s is given byCi = { j : hji = 1}. Theset of row locations having 1’s, excluding the locationj, is given by Ci\ j = { j′ : hj′i = 1} \ { j}.

Esa Alghonaim et al. 3

c j

qi j(b)

vi

(a)

c j

r ji(b)

vi

(b)

Figure 2: (a) Variable-to-check message, (b) check-to-variablemessage.

(vi) qi j(b): message (extrinsic information) to be passedfrom variable node vi to check node cj regardingthe probability of ui = b, b ∈ {0, 1}, as shown inFigure 2(a). It equals the probability that ui = b givenextrinsic information from all check nodes, exceptnode cj .

(vii) r ji(b): message to be passed from check node cj tovariable node vi, which is the probability that the jthcheck equation is satisfied given that bit ui = b andthe other bits have separable (independent) distribu-tion given by {qi j′ } j′ /= j , as shown in Figure 2(b).

(viii) Qi(b) = the probability that ui = b, b ∈ {0, 1}.(ix)

L(ui) ≡ logPr(xi = +1 | yi)Pr(xi = −1 | yi)

= logPr(ui = 0 | yi)Pr(ui = 1 | yi)

,

(2)

where L(ui) is usually referred to as the intrinsicinformation for node vi.

(x)

L(r ji) ≡ logr ji(0)

r ji(1), L(qi j) ≡ log

qi j(0)

qi j(1). (3)

(xi)

L(Qi) ≡ logQi(0)Qi(1)

. (4)

The BP algorithm involves one initialization step and threeiterative steps as shown below.

Initialization step

Set the initial value of each variable node signal as follows:L(qi j) ≡ L(ui) = 2yi/σ2, where σ2 is the variance of noise inthe AWGN channel.

Iterative steps

The three iterative steps are as follows.

(i) Update check nodes as follows:

L(r ji) = ( ∏

i′∈Rj\i

αi′ j

)× φ

( ∑i′∈Rj\i

φ(βi′ j))

, (5)

where αi′ j = sign(L(qi j)), βi j = |L(qi j)|,

φ(x) = − log(tanh(x/2)

) = logex + 1ex − 1

. (6)

(ii) Update variable nodes as follows:

L(qi j) = L(ui) +∑

j′∈Ci\ jL(r j′i). (7)

(iii) Compute estimated variable nodes as follows:

L(Qi) = L(ui)

+∑j∈Ci

L(r ji). (8)

Based on L(Qi), the estimated value of the received bit (ui) isgiven by

ui =⎧⎨⎩1, if L

(Qi)< 0,

0, else.(9)

During LDPC decoding, the iterative steps (i) to (iii) arerepeated until one of the following two events occurs:

(i) the estimated vector u = (u1, . . . , un) satisfies thecheck equations, that is, u ·H = 0;

(ii) maximum iterations number is reached.

3. TRAPPING SETS

In BP decoding of LDPC codes, dominant decoding failuresare, in general, caused by a combination of multiple cycles[4]. In [2], the combination of error bits that leads to adecoder failure is defined as trapping sets. In [3], it is shownthat the dominant trapping sets are formed by a combinationof short cycles present in the bipartite graph.

In the following, we adopt the terminology and notationrelated to trapping sets as originally introduced in [8]. LetH be the parity-check matrix of (N, K) LDPC code, and letG(H) denote its corresponding Tanner graph.

Definition 1. A (z, w) trapping set T is a set of z variablenodes, for which the subgraph of the z variable nodes andthe check nodes that are directly connected to them containsexactly w odd-degree check nodes.

The next example illustrates the behavior of trapping setsand how they are harmful.

Example 2. Consider a regular (N, K) LDPC code withdegree (3,6). Figure 3 shows a trapping set T(4, 2) in the codegraph. Assume that an all-zero codeword (u = 0) is sentthrough an AWGN channel, and all bits are received correctly(i.e., have positive intrinsic values) except the 4 bits in thetrapping set T(4, 2), that is, L(ui) < 0 for 1 ≤ i ≤ 4 andL(ui) > 0 for 4 < i ≤ N . (Assume logic 0 is encoded as +1,while logic 1 is encoded as −1).


v1

v2 v3 v4

c1 c2

c3

c4

c5

c6

c7

Figure 3: Trapping set example of T(4, 2).

Based on (8), the estimated value of a variable node is thesum of its intrinsic information and messages received fromthe neighboring three check nodes. Therefore, the estimationequation for each variable node contains four summationterms: the intrinsic information and three informationmessages. In this case, the estimated values for v1 (and v3)will be incorrect because all of the four summation termsof its estimation equation are negative. For v2 (and v4),three out of the four summation terms in its estimationequation have negative values. Therefore, v2 (and v4) hashigh probability to be incorrectly estimated. In this case,the decoder becomes in trap and will continue in the trapunless positive signals from c1 and/or c2 are strong enoughto change the polarities of the estimated values of v2 and/orv4. This example illustrates a trapping set causing a constanterror pattern.

As a first step to investigate the effect of trapping sets onLDPC codes performance, extensive simulations for LDPCcodes over AWGN channels with various SNR values havebeen performed. A frame is considered to be in error if themaximum decoding iteration is reached without satisfyingthe check equations, that is, the syndrome u×H is nonzero.Error frames are classified based on observing the behavior ofthe LDPC decoder at each decoding iteration. At the end ofeach iteration, bits in error are counted. Based on this, errorframes are classified into three patterns described as follows.

(i) Constant error pattern: where the bit error countbecomes constant after only a few decoding itera-tions.

(ii) Oscillating error pattern: where the bit error countfollows a nearly periodic change between maximumand minimum values. An important feature of thiserror pattern is the high variation in bit error countas a function of decoding iteration number.

(iii) Random-like error pattern: where the bit error countevolution follows a random shape, characterized bylow variation range.

Figure 4 shows one example for each of the three error pat-terns. In a constant error pattern, bit errors count becomesconstant after several decoding iterations (10 iterations in the

300

250

200

150

100

50

0

Nu

mbe

rof

erro

rbi

ts

0 10 20 30 40 50 60 70 80 90 100

Iteration

OscillatingConstantRandom-like

Figure 4: Illustration of the three types of error patterns.

Table 1: Percentages of error patterns at error-floor region.

Code Size Constant Oscillating Random-like

HE(1024,512) 59% 38% 3%

RND(1024,512) 95% 4% 1%

PEG(100,50) 90% 5% 5%

example of Figure 4). In this case, the decoder becomes stuckdue to the presence of a tapping set T(z, w), and the numberof bits in error equals z and all check nodes are satisfiedexcept w check nodes.

The major difference between a trapping set T(z, w)causing a constant error pattern and a trapping set T(e, f )causing other patterns is the number of odd-degree checknodes. Based on extensive simulations, it is found that w ≤f . This result is interpreted logically as follows: if variablenodes of a trapping set are in error, only odd-degree checknodes are sending correct messages to the variable nodes ofthe trapping set. Therefore, as the number of odd-degreecheck nodes decreases, the probability of breaking the trapdecreases. As an extreme example, a trapping set with noodd-degree check nodes results in a decoder convergence toa codeword other than the transmitted one and thus causesundetected decoder failure.

Table 1 shows examples of percentages of the three errorpatterns for three LDPC codes based on simulating the codesat error-floor regions. The first LDPC code, HE(1024,512)[12], is constructed to be interconnected efficiently forfully parallel hardware implementation. The RND(1024,512)LDPC code is randomly constructed avoiding cycles of size4. The PEG(100,50) LDPC code is constructed using PEGalgorithm [7], which maximizes the size of cycles in the codegraph. From Table 1, it is evident that constant error patternsare significant in some LDPC codes including short lengthcodes.


This observation motivates the need for developinga technique for enhancing decoder performance due totrapping sets of constant error patterns type. For trappingsets that cause constant error patterns, when a trap occurs,values of check equations do not change in subsequentiterations. Thus, a decoder trap is detected based on checkequations results. The unsatisfied check nodes are used toreach the trapping set variable nodes.

3.1. BP decoder trapping sets detection

In order to eliminate the effect of trapping sets duringthe iterations of BP decoder, a mechanism is needed todetect the presence of a trapping set. The proposed trappingsets detection technique is based on monitoring the stateof the check equations vector u × H. At the end of eachdecoding iteration, a new value of u × H is computed.If the u × H value is nonzero and remains unchanged(stable) for a predetermined number of iterations, then adecoder trap is detected. We call this number the stabilityparameter (d), and it is normally set to a small value. Basedon experimental results, it is found that d = 3 is a goodchoice. The implementation of trap detection is similar to theimplementation of valid codeword detection with some extralogic in each check node. Figure 5 shows an implementationof trapping sets detection for a decoder with M check nodes.The output si for a check node ci is logic zero if the checkequation result is equivalent to the check equation resultin the previous iteration, that is, no change in the checkequation result. The output S is zero if there is no changein all check equations between the current and the previousiteration numbers.

3.2. Trapping sets neutralization

In this section, we introduce a new technique to overcomethe detrimental effect of trapping sets during BP decoding.To overcome the negative impact of a trapping set T(z,w),the basic idea is to neutralize the z variable nodes in thetrapping set. Neutralizing a variable node involves settingits intrinsic value and extrinsic message values to zero.Specifically, neutralizing a variable node vi involves thefollowing two steps:

(1) L(ui) = 0,

(2) L(qi j) = 0, 1 ≤ j ≤ d(vi).

The neutralization concept is illustrated by the followingexample.

Example 3. For the trapping set T(4, 2) in Example 2, it hasbeen shown that when all code bits are received correctlyexcept T(4, 2) bits, the decoder fails to correct the codewordresulting in an error pattern of constant type.

Now, consider neutralizing the trapping set variablenodes by setting its intrinsic and extrinsic values to zero. Afterneutralization, the decoder converges to a valid codewordwithin two iterations, as follows. In the first iteration after

neutralization, for v2 and v4, two extrinsic messages becomepositive due to positive messages from nodes c1 and c2,which shifts estimated values of v2 and v4 to the positivecorrect values. For nodes v1 and v3, all extrinsic values arezeros and their estimated values remain zero. In the seconditeration after neutralization, for v1 and v3, two extrinsicmessages become positive due to positive extrinsic messagesfrom nodes v2 and v4, which shifts estimated values of v1 andv3 to the positive correct values.

The proposed neutralization technique has three impor-tant characteristics. (1) It is not necessary to exactly deter-mine the variable nodes in a trapping set, such as thetrapping set bits flipping technique used in [3]. In theprevious example, if only 3 out of the 4 trapping sets variablesare neutralized, the decoder will still be able to recoverfrom the trap. (2) If some nodes outside a trapping set areneutralized (due to inexact identification of the trapping set),their extrinsic messages are expected to quickly recover theirestimation function to correct values due to correct messagesfrom neighbouring nodes. This is because most of theextrinsic messages are correct in the error-floor regions. (3)Neutralization is performed during BP decoding iterations assoon as a trapping set is detected, which makes the decoderable to converge to a valid codeword within the allowedmaximum number of iterations.

As an example, for the near-constant error pattern inFigure 4, a trap occurs at iteration 10 and is detected atiteration 13 (assuming d = 3). In this case, the decoderhas a plenty of time to neutralize the trapping set beforereaching the maximum 100 iterations. In general, based onour simulations, a decoder trap is detected during earlydecoding iterations.

4. BP DECODER WITH TRAPPING SETSNEUTRALIZATION BASED ON LEARNING

In this section, we introduce an algorithm to correct constanterror pattern types (causing error floors) associated withLDPC BP decoding. The proposed algorithm involves twoparts: (1) a preprocessing phase called learning phase and(2) actual decoding phase. The learning phase is an offlinecomputation process in which trapping sets are identified.Then, variable and check nodes are configured according tothe identified trapping sets. In the actual decoding phase, theproposed decoder runs as a standard BP decoder with theability to detect and neutralize trapping sets using variableand check nodes configuration information obtained duringthe learning phase. When a trapping set is detected, thedecoder stops running BP iterations and switches to aneutralization process, in which the detected trapping set isneutralized. Upon completion of the neutralization process,the decoder resumes to normal running of BP iterations. Theneutralization process involves forwarding messages betweenthe trapping sets check and variable nodes.

Before proceeding with the details of the proposeddecoder, we give an example on how variable and checknodes are configured during the learning phase and howthis configuration is used to neutralize a trapping set duringactual decoding.


Control logic

Counter

Trapdetection

Ss2

c1

c2

cM

s2

c2

u1

u2

ud(c2)

Figure 5: Decoder trap detection circuit.

v1

v2

v3

v4

c1 c2

c3 c4

c5

c6 c7

1

12

3

4

5

Figure 6: Tree structure for the trapping set T(4, 2).

Example 4. Given the trapping set T(4, 2) of the previousexample, we show the following: (a) how the nodes ofthis trapping set are configured, (b) how the neutralizationprocess is performed during the actual decoding phase.

(a) In the learning phase, the trapping set nodes {c1,c2, c3, c4, c5, c6, c7, v1, v2, v3, v4} are configured for neutraliza-tion. First, a tree is built corresponding to the trapping setstarting with odd-degree check nodes as the first level of thetree, as shown in Figure 6. The reason for starting from odd-degree check nodes is because they are the only gates leadingto a trapping set when the decoder is in a trap. When thedecoder is stuck due to a trapping set, all check nodes aresatisfied except the odd-degree check nodes of the trappingset. Therefore, odd-degree check nodes in trapping sets arethe keys for the neutralization process.

Degree one check nodes in the trapping set (c1 andc2 in this example) are configured to initiate messages totheir neighboring variable nodes requesting them to performneutralization. We call these messages: neutralization initia-tion messages. In Figure 6, arrows pointing out from a nodeindicate that the node is configured to forward a neutral-ization message to its neighbor. The task of neutralizationmessage forwarding in a trapping set is to send neutralizationmessage to every variable node in the trapping set. In ourexample, c1 and c2 are configured for neutralization messageinitiation, while v2, c3, and c6 are configured for neutral-ization messages forwarding. This configuration is enough

to forward neutralization messages to all variable nodes inthe trapping set. Another possible configuration is that c1

and c2 are configured for neutralization message initiationwhile v4, c4, and c7 are configured for neutralization messagesforwarding. Thus, in general, there is no need to configure alltrapping set nodes.

(b) Now, assume that the proposed decoder is runningBP iterations and falls in a trap due to T(4, 2). Next, weshow how the preconfigured nodes are able to neutralize thetrapping set T(4, 2) in this example. First, the decoder detectsa trap event and then it stops running BP iterations andswitches to a neutralization process. The decoder runs theneutralization process for a fixed number of cycles and thenresumes running the BP iterations. In the first cycle of theneutralization process, the unsatisfied check nodes initiate aneutralization message according to the configuration storedin them during the learning phase. Because the decoderfailure is due to T(4, 2), all check nodes in the decoder aresatisfied except the two check nodes c1 and c2. Therefore,only c1 and c2 initiate neutralization messages to nodesv2 and v4, respectively. In the second neutralization cycle,variable nodes v2 and v4 receive neutralization messages andperform neutralization, and v2 forwards the neutralizationmessage to c3 and c6. In the third neutralization cycle, c3

and c6 receive and forward the neutralization messages tov1 and v3, respectively, which in turn perform neutralizationbut do not forward neutralization messages. After that, nomessage forwarding is possible until neutralization cyclesend. After the neutralization process, the decoder resumesrunning BP iterations. The proposed decoder converges to avalid codeword within two iterations after resuming runningBP iterations, as previously shown in Example 3.

Before discussing the neutralization algorithm, a descrip-tion for the configuration parameters used in variable andcheck nodes is given followed by an illustrative example. Eachvariable node vi is assigned a bit γi, and each check node cjis assigned a bit β

qj and a word α

qj for each of its links q. The

following is a description for these parameters.γi: message forwarding configuration bit assigned for a

variable node vi. When a variable node vi receives a neutral-ization message, it acts as follows. If γi = 1, then vi forwardsthe received neutralization message to all neighboring checknodes except the one that sent the message; otherwise it doesnot forward the received message.

βqj : message initiation configuration bit assigned for a

link indexed q in a check node cj , where 1 ≤ q ≤ d(cj).


Inputs: LDPC code,(γi,β

qj ,α

qj ): nodes configuration,

Result of the check equation in each check node cj ,nt cycles: number of neutralization cyclesOutput: Some variable nodes are neutralized

1. For each check node cj with unsatisfied equation dofor 1 ≤ q ≤ d(cj), if β

qj = 1 then initiate

a neutralization message through link q2. l = 1 // Current number of neutralization cycle3. While l ≤ nt cycles do

For each variable node vi that received aneutralization message do the following:– perform node neutralization on vi– if γi = 1 then forward the message to all neighborsFor every check node cj that received aneutralization message through link p do thefollowing:– for 1 ≤ q ≤ d(cj), if the bit α

qj (p) is set then

forward the message through link ql = l + 1

Algorithm 1: Trapping sets neutralization algorithm.

αqj : message forwarding configuration word assigned for

a link indexed q in a check node cj , where 1 ≤ q ≤ d(cj).The size of α

qj in bits equals d(cj). If a check node cj has to

forward a neutralization message received at link indexed pthrough a link indexed q, then α

qj is configured by setting

the bit number p to 1, that is, the bit αqj (p) is set to 1.

For example, if a degree 6 check node cj has to forward aneutralization message received at the link indexed 2 throughthe link indexed 3, α3

j is configured to (000010)2, that is,

α3j (2) = 1.

The following example illustrates variable and checknodes configuration values for a given trapping set.

Example 5. Assume that the trapping set T(4, 2) in Figure 6is identified in a regular (3,6) LDPC code. Check nodes linksindices are indicated on the links, for example, in c1, (c1, v2)link has index 5. The configuration for this trapping set isshown in Table 2.

Algorithm 1 lists the proposed trapping set neutraliza-tion algorithm. Since the decoder does not know how manycycles are needed to neutralize a trapping set, it performsneutralization and message forwarding cycles for a presetnumber (nt cycles). For example, two neutralization cyclesare needed to neutralize the trapping set shown in Figure 6.The number of neutralization cycles is preset during thelearning phase to the maximum number of neutralizationcycles required for all trapping sets. Based on simulationresults, it is found that a small number of neutralizationcycles are often required. For example, 5 neutralization cyclesare found sufficient to neutralize trapping sets of 20 variablenodes.

Inputs: LDPC code,no failures: number of processed decoder failuresOutput: TS List

1. TS List = ∅, failures = 02. While failures ≤ no failures do

u = 0, x = +1, y = x + n // transmit a codewordDecode y using standard BP decoder.If u ·H = 0 then goto 2 // Valid codewordfailures = failures + 1Re-decode y observing trap detection indicatorIf a decoder trap is not detected then goto 2TS = List of variable nodes vi in error ui = 1) andunsatisfied check nodes.If TS ∈ TS List then increment TS weightElse add TS to TS List and set its weight to 1

Algorithm 2: Trapping sets identification algorithm.

4.1. Trapping sets learning phase

The trapping sets learning phase involves two steps. First,the trapping sets of a given LDPC code are identified.Then, variable and check nodes are configured based on theidentified trapping sets.

4.1.1. Trapping sets identification

Trapping sets can be identified based on two approaches.(1) By performing decoding simulations and observingdecoder failures [2]. (2) By using graph search methods [3].The first approach is adopted in this work as it providesinformation on the frequency of occurrence of each trappingset, considered as its weight. This weight is computed basedon how many decoder failures occur due to that trappingset and is used to measure its negative impact compared toother trapping sets. The priority of configuring nodes for atrapping set is assigned according to its weight; more harmfultrapping sets are given higher configuration priority.

Algorithm 2 lists the proposed trapping sets identi-fication algorithm. Decoding simulations of an all-zeroscodeword with AWGN are performed until a decoder failureis observed. Then, the received frame y that caused thedecoding failure is identified, and decoding iterations areredone while observing trap detection indicator. If a trapis not detected, then decoding simulations are continuedsearching for another decoder failure. However, if a trap isdetected, then the trapping set TS is identified as follows.First, the unsatisfied check nodes are considered the odd-degree check nodes in the trapping set TS while the variablenodes with hard decision errors (ui = 1) are consideredthe variable nodes of the trapping set. Finally, if theidentified trapping set TS is already in the trapping sets list,TS List, then its weight is incremented by one; otherwisethe identified trapping set is added to the trapping sets list,TS List, and its weight is set to one.


Table 2: Nodes configuration for T(4, 2).

Configuration Meaning

β51 = 1 c1 initiates a message through link 5 (i.e., initiates message to v2).

β32 = 1 c2 initiates a message through link 3 (i.e., initiates message to v4).

γ2 = 1 v2 forwards incoming messages to all neighbors.

α23 = (000001)2 c3 forwards incoming messages from link 1 to link 2 (i.e., from v2 to v1).

α16 = (001000)2 c6 forwards incoming messages from link 4 to link 1 (i.e., from v2 to v3).

Inputs: TS List, LDPC code of size (N, K)Outputs: γi, 1 ≤ i ≤ N

βqj , α

qj , 1 ≤ j ≤ N − K , 1 ≤ q ≤ d(cj)

1. γi = 0 for 1 ≤ i ≤ Nβqj = 0 and α

qj = 0, for 1 ≤ j ≤ N − K and

1 ≤ q ≤ d(cj)2. Sort TS List according to trapping sets weights in a

descending order.3. k = 14. While (k ≤ size of TS List) do

Update configuration so that it includes TSkCompute ωj for 1 ≤ j ≤ kIf ωj ≤ T for 1 ≤ j ≤ k then

accept configuration updateElse reject TSk and reject configuration updatek = k + 1

Algorithm 3: Nodes configuration algorithm.

4.1.2. Nodes configuration

The second step in the trapping sets learning phase is toconfigure variable and check nodes in order for the decoderto be able to neutralize identified trapping sets duringdecoding iterations.

Before discussing the configuration algorithm, we discussthe case when two trapping sets have common nodes andits impact on the neutralization process. Then, we proposea solution to overcome this problem. This is illustratedthrough the following example.

Example 6. Figure 7 shows partial nodes of two trapping setsTS1 and TS2 in a regular (3,6) LDPC code. {v1, v3, v5} ∈ TS1,and {v2, v3, v4} ∈ TS2. v3 is a common node between TS1

and TS2. Configuration values after configuring nodes forTS1 and TS2 are as follows:

α31 = (000011)2 (Link 3 in c1 forwards messages received

from link 1 or link 2);γ3 = 1 (v3 forwards messages to neighbors);α2

2 = (000001)2 (Link 2 in c2 forwards messages receivedfrom link 1);

α23 = (000001)2 (Link 2 in c3 forwards messages received

from link 1).Therefore, when the decoder performs a neutralization

process due to TS1, node v4 will be neutralized although itis not a subset of TS1. Similarly, performing neutralization

v1 v2

v3

v4v5

c1

c2 c3

1 2

3

1 1

2 2

TS1 TS2

Figure 7: Example of common nodes between two trapping sets.

process due to TS2 causes node v5 (which is not a subsetof TS2) to be neutralized. Fortunately, as mentioned inSection 3.1, when the decoder is in a trap due to a trappingset TS, the decoder converges to a valid codeword even ifsome variable nodes outside TS have been unnecessarily neu-tralized. However, based on simulation results, neutralizing alarge number of variable nodes other than the desired nodesleads to a decoder failure.

Having introduced the trapping sets common nodesproblem, we next show the proposed solution for thisproblem. Define ωj for each trapping set TS j as follows.

ωj : ratio of neutralized variable nodes outside the set TS jto the total number of variable nodes (N).

Define T as the maximum allowed value for ωj . Theproposed solution is as follows: after configuring a trappingset TSk, we compute ωj for 1 ≤ j ≤ k. If ωj ≤ T for1 ≤ j ≤ k, then we accept the new configuration, otherwise,TSk is rejected and the configuration is restored to its statebefore configuring TSk.

Algorithm 3 lists nodes configuration algorithm. Ini-tially, configurations of all variable and check nodes are setto zero, step 1. This means that no node is allowed to initiateor forward a neutralization message. Sorting in step 2 isimportant to give more harmful trapping sets (with greaterweight) configuration priority over less harmful trappingsets. Step 4 processes trapping sets in TS List one by one.For each trapping set TSk, update nodes configuration bysetting nodes configuration parameters (γi,β

qj ,α

qj ) related to

variable and check nodes in TSk. Then, for each previously


Inputs: LDPC code,Nodes configuration (γi,β

qj ,α

qj ),

data received from channel,max iter: maximum iterations,nt cycles: number of neutralization cyclesOutput: decoded codeword

1. iter = 0, nt done = 02. iter = iter + 13. Run a normal BP decoding iteration.4. If u ·H = 0 then stop // valid codeword5. If iter = max iter then stop // decoder failure6. If decoder trap is not detected then goto step 27. If (iter + nt cycles < max iter) and

(nt done = 0) then do:– Perform neutralization // Algorithm 1– iter = iter + nt cycles– nt done = 1

8. Goto step 2

Algorithm 4: The proposed learning-based decoder.

configured trapping set TS j , 1 ≤ j ≤ k, we compute ωj . Theparameter ωj for a trapping set TS j is computed as follows:check equations for all check nodes of the decoder are set assatisfied (i.e., assigned zero values) except odd-degree checknodes in TS j , and then a neutralization process is performedas in Algorithm 1. The actual number of neutralized variablenodes outside the trapping set variable nodes is divided by N(code size) to get ωj . If the ωj parameter for all previouslyconfigured trapping sets is less than or equal to the thresholdT, then the new configuration is accepted, otherwise TSk isrejected (ignored) and nodes configuration is restored to thestate before the last update.

4.2. The proposed learning-based decoder

The algorithm of the proposed learning-based decoderis listed in Algorithm 4. The algorithm is similar to theconventional BP decoding algorithm with the addition oftrapping sets detection and neutralization. Note that if atrapping set is not detected during decoding iterations,then the proposed algorithm becomes identical to theconventional BP decoder. After each decoding iteration, thetrap detection flag is checked, step 6. If a trap is detected, thennormal decoding iterations are paused, the decoder performsa neutralization process based on Algorithm 1, the iterationnumber is increased by the number of neutralization cyclesto compensate for the time spent in the neutralizationprocess, and finally the decoder resumes conventional BPiterations. In step 7, before performing a neutralizationprocess, the decoder checks nt done to make sure that noneutralization process has been performed in the previousiterations. This condition guarantees that the decoder willnot keep running into the same trap and perform theneutralization process repeatedly. This may happen when atrap is redetected before the decoder is able to get out of

it. Upon trap detection and before deciding to perform aneutralization process, the decoder must check another con-dition. It must ensure that the decoding iterations left beforereaching maximum iterations are enough to perform a neu-tralization process, step 7. For example, consider a decoderwith 64 maximum decoding iterations and 5 neutralizationcycles. If a trapping set is detected at iteration number62, the decoder will not have enough time to completeneutralization process.

4.3. Hardware cost

The hardware cost for the proposed algorithm is consideredlow. For trapping sets storage, we need to assign one bit foreach variable node (message forwarding bit). For each checknode ci, we need to assign one bit for message initiating andone word of size d(ci) for message forwarding. Fortunately,the communication links needed to forward neutralizationmessages between check and variable nodes of the trappingsets already exist as part of the BP decoding. Therefore,no extra hardware cost is added for the communicationbetween trapping sets nodes. What is needed is a simplecontrol logic to decide to perform message initiation andforwarding based on the stored forwarding information. Thedecoder trap detection, shown in Figure 5, is implemented asa logic tree similar to the tree of the valid codeword detectionimplementation. The cost is low, as it mainly consists of asimple logic circuit within the check nodes, in the addition toan OR gate tree combining logic outputs from check nodes.Using a simple multiplexer, valid code word detection logicand trap detection logic can share most of their components.It is worth emphasizing that it is not necessary to storeconfiguration information for all variable and check nodes.Only a subset included in the learned trapping sets is used,which further reduces the required overhead.

5. EXPERIMENTAL RESULTS

In order to demonstrate the effectiveness of the proposedtechnique, extensive simulations have been performed onseveral LDPC code types and sizes over BPSK modulatedAWGN channel. The maximum number of iterations is setto 64. Due to the required CPU-intensive simulations, espe-cially at high SNR, a parallel computing simulation platformwas developed to run the LDPC decoding simulations on 170nodes on a departmental LAN network [13].

The following is a brief description for the LDPC codesused in the simulation.

-HE(1024,512): a near regular LDPC code of size (1024,512) constructed to be interconnect efficient for fully parallelhardware implementation [12].

-RND(1024,512): regular (3,6) LDPC code of size (1024,512) randomly generated with the avoidance of cycles ofsize 4.

-PEG(1024,512): irregular LDPC code of size (1024,512)generated by PEG algorithm [7]. This algorithm maximizesgraph cycles and implicitly minimizes trapping sets of con-stant type.


10−3

10−4

10−5

10−6

10−7

Fram

eer

ror

rate

(FE

R)

2.5 2.75 3 3.25 3.5

SNR (dB)

Conventional BP decoding algorithmAverage decoding algorithmProposed algorithmProposed algorithm on top of average decoding

Figure 8: Performance results for RND(1024,512) LDPC code.

-PEG(100,50): similar to the previous code, but its size is(100,50).

MacKay(204,102): a regular LDPC code of size (204,102)on MacKay’s website [14] labeled as 204.33.484.txt.

In each of the five codes, we compare performance resultsfor the proposed algorithm with conventional BP decodingand the average decoding algorithm proposed in [8]. Theaverage decoding algorithm is a modified version of theBP algorithm in which messages are averaged over severaldecoding iterations in order to prevent sudden magnitudechanges in the values of variable nodes messages. We alsoadd another curve showing the performance of the proposedalgorithm on top of average decoding algorithm. Usingthe proposed algorithm on top of averaging algorithm isidentical to the proposed algorithm listed in Algorithm 4,except that in step 3 average decoding algorithm iterationis taking place instead of normal BP decoding iteration. Inthe learning phase of each LDPC code, we set trapping setsdetection parameter (d) to 3 and we set the threshold value(T) to 10%.

Figure 8 shows the performance results for RND(1024,512). It is evident that the performance of the proposedlearning-based algorithm outperforms that of the averagedecoder in the error-floor region. At low SNR region, averagedecoding algorithm is better than the proposed algorithm.The reason is due to the few occurrences of constant trappingsets in the low SNR region. As SNR increases, constant errorframes increase until they become dominant in error-floorregion. The proposed algorithm on top of average decodingshows the best results in all SNR regions. This is becauseit combines the advantages of the two algorithms: learning-based and average decoding as it improves both constant andnonconstant type of patterns.

Figures 9 and 10 show the performance results forthe two LDPC codes, PEG(100,50) and PEG(1024,512).

10−4

10−5

10−6

10−7

10−8

10−9

Fram

eer

ror

rate

(FE

R)

5 5.5 6 6.5 7

SNR (dB)


Figure 9: Performance results for PEG(100,50) LDPC code.

10−3

10−4

10−5

10−6

10−7

Fram

eer

ror

rate

(FE

R)

2.5 2.75 3 3.25

SNR (dB)


Figure 10: Performance results for PEG(1024,512) LDPC code.

While there is significant improvement for the proposedalgorithm in PEG(100,50), there is almost no improve-ment in PEG(1024,512). The low improvement gain inPEG(1024,512) is due to the low percentage (not more than8%) of trapping sets that cause constant error patterns.However, it is hard to implement PEG(1024,512) codes usingfully parallel architectures. As can be seen from the PEG codeconstruction algorithm [7], when a new connection is to beadded to a variable node, the selected check node for con-nection is the one in the farthest level of the tree originated


10−3

10−4

10−5

10−6

10−7

Fram

eer

ror

rate

(FE

R)

2.5 2.75 3 3.25

SNR (dB)


Figure 11: Performance results for HE(1024,512) LDPC code.

Table 3: Results after the learning phase of HE(1024,512) LDPCcode.

i TSi size TSi weight ωj

1 (8,2) 106 0%

2 (8,2) 49 0%

3 (12,2) 13 0%

4 (10,3) 9 1%

5 (8,3) 8 2%

6 (10,2) 7 0%

7 (7,3) 5 2%

8 (7,3) 5 3%

9 (7,3) 4 0%

10 (15,2) 3 0%

Table 4: Identified trapping sets and configuration percentages fordifferent LDPC codes.

CODE #TS %V %C

HE(1024,512) 55 27.15% 13.46%

RND(1024,512) 50 18.46% 9.9%

PEG(1024,512) 8 6.74% 3.42%

PEG(100,50) 57 60% 31.67%

MacKay(204,102) 40 50% 27.94%

from the variable node. This results in interconnections evendenser than pure random construction methods.

Figure 11 shows the performance for an interconnectefficient LDPC code, HE(1024,512) [12], that has beenimplemented in a fully parallel hardware architecture. ThisLDPC code is designed to have a balance between decoderthroughput and error performance. The figure shows that the

10−1

10−2

10−3

10−4

10−5

10−6

10−7

10−8

Fram

eer

ror

rate

(FE

R)

3 4 5 6

SNR (dB)


Figure 12: Performance results for MacKay(204,102) LDPC code.

best performance is obtained using the proposed algorithmon top of the average decoding algorithm. The performanceat 3.25B is not drawn due to the excessive simulation timeneeded at this point.

Based on the results of all simulated codes, it is clearlydemonstrated that the application of the proposed algorithmon top of average decoding achieves significant performanceimprovements in comparison with conventional LDPCdecoding. In particular, one can observe that performanceimprovements are highlighted for LDPC codes with relativelylow performance using conventional LDPC decoder. Thisallows LDPC code design techniques to relax some ofthe design constraints and focus on reducing hardwarecomplexity such as creating interconnect-efficient codes.

Table 3 lists part of the trapping sets that are identifiedduring the learning phase of the HE(1024,512) LDPCcode. The complete number of identified trapping setsis 55. One may note that trapping sets with the highestweights have small number of variable and odd-degree checknodes. Table 4 shows the number of identified trapping setsand percentage of check and variable nodes configured toperform neutralization messages forwarding. It is clear thatonly a subset of the variable and check nodes is configured,which further decreases hardware cost.

6. CONCLUSION

In this paper, we have introduced a new technique to enhancethe performance of LDPC decoders especially in the errorfloor regions. This technique is based on identifying trappingsets of constant error pattern and reducing their negativeimpact by neutralizing them. The proposed technique, inaddition to enhancing performance, has simple hardwarearchitecture with reasonable overhead. Based on extensive


simulations on different LDPC code designs and sizes, itis shown that the proposed technique achieves significantperformance improvements for: (1) short LDPC codes, (2)LDPC codes designed under additional constraints such asinterconnect-efficient codes. It is also demonstrated that theapplication of the proposed technique on top of averagedecoding achieves significant performance improvementsover conventional LDPC decoding for all of the investigatedcodes. This makes LDPC codes even more attractive foradoption in various applications and enables the designof codes that optimize hardware implementation withoutcompromising the required performance.

ACKNOWLEDGMENT

The authors would like to thank King Fahd Universityof Petroleum & Minerals for supporting this work underProject no. IN070376.

REFERENCES

[1] R. G. Gallager, Low Density Parity-Check Codes, MIT Press,Cambridge, Mass, USA, 1963.

[2] T. Richardson, “Error floors of LDPC codes,” in Proceedingsof The 41st Annual Allerton Conference on Communication,Control, and Computing, Monticello, Ill, USA, October 2003.

[3] E. Cavus and B. Daneshrad, “A performance improvementand error floor avoidance technique for belief propagationdecoding of LDPC codes,” in Proceedings of the 16th IEEEInternational Symposium on Personal, Indoor and Mobile RadioCommunications (PIMRC ’05), vol. 4, pp. 2386–2390, Berlin,Germany, September 2005.

[4] T. Tian, C. Jones, J. D. Villasenor, and R. D. Wesel, “Con-struction of irregular LDPC codes with low error floors,” inProceedings of the IEEE International Conference on Commu-nications (ICC ’03), vol. 5, pp. 3125–3129, Anchorage, Alaska,USA, May 2003.

[5] T. Tian, C. R. Jones, J. D. Villasenor, and R. D. Wesel, “Selectiveavoidance of cycles in irregular LDPC code construction,”IEEE Transactions on Communications, vol. 52, no. 8, pp. 1242–1247, 2004.

[6] S. Gounai, T. Ohtsuki, and T. Kaneko, “Modified beliefpropagation decoding algorithm for low-density parity checkcode based on oscillation,” in Proceedings of the 63rd IEEEVehicular Technology Conference (VTC ’06), vol. 3, pp. 1467–1471, Melbourne, Australia, May 2006.

[7] X.-Y. Hu, E. Eleftheriou, and D.-M. Arnold, “Progressive edge-growth Tanner graphs,” in Proceedings of the IEEE GlobalTelecommunicatins Conference (GLOBECOM ’01), vol. 2, pp.995–1001, San Antonio, Tex, USA, November 2001.

[8] S. Landner and O. Milenkovic, “Algorithmic and combinato-rial analysis of trapping sets in structured LDPC codes,” inProceedings of the IEEE International Conference on WirelessNetworks, Communications and Mobile Computing (Wirless-Com ’05), vol. 1, pp. 630–635, Maui, Hawaii, USA, June 2005.

[9] G. Richter and A. Hof, “On a construction method of irregularLDPC codes without small stopping sets,” in Proceedings of theIEEE International Conference on Communications (ICC ’06),vol. 3, pp. 1119–1124, Istanbul, Turkey, June 2006.

[10] D. J. C. MacKay, “Good error-correcting codes based on verysparse matrices,” IEEE Transactions on Information Theory, vol.45, no. 2, pp. 399–431, 1999.

[11] W. Ryan, “A Low-Density Parity-Check Code Tutorial, PartII—the Iterative Decoder,” Electrical and Computer Engineer-ing Department, The University of Arizona, Tucson, Ariz,USA, April 2002.

[12] M. Mohiyuddin, A. Prakash, A. Aziz, and W. Wolf, “Synthesiz-ing interconnect-efficient low density parity check codes,” inProceedings of the 41st Annual Design Automation Conference(DAC ’04), pp. 488–491, San Diego, Calif, USA, June 2004.

[13] E. Alghonaim, A. El-Maleh, and M. Adnan Al-Andalusi,“Parallel computing platform for evaluating LDPC codes per-formance,” in Proceedings of the IEEE International Conferenceon Signal Processing and Communications (ICSPC ’07), pp.157–160, Dubai, United Arab Emirates, November 2007.

[14] D. C. Mackay codes, http://www.inference.phy.cam.ac.uk/mackay/codes/.


Research ArticleDistributed Generalized Low-Density Codes forMultiple Relay Cooperative Communications

Changcai Han and Weiling Wu

Key Laboratory of Information Processing and Intelligent Technology, Beijing University of Posts and Telecommunications,Beijing 100876, China

Correspondence should be addressed to Changcai Han, [email protected]

Received 1 November 2007; Revised 17 March 2008; Accepted 9 July 2008


As a class of pseudorandom error correcting codes, generalized low-density (GLD) codes exhibit excellent performance over bothadditive white Gaussian noise (AWGN) and Rayleigh fading channels. In this paper, distributed GLD codes are proposed formultiple relay cooperative communications. Specifically, using the partial error detecting and error correcting capabilities of theGLD code, each relay node decodes and forwards some of the constituent codes of the GLD code to cooperatively form a distributedGLD code, which can work effectively and keep a fixed overall code rate when the number of relay nodes varies. Also, at relay nodes,a progressive processing procedure is proposed to reduce the complexity and adapt to the source-relay channel variations. At thedestination, the soft information from different paths is combined for the GLD decoder thus diversity gain and coding gain areachieved simultaneously. Simulation results verify that distributed GLD codes with various number of relay nodes can obtainsignificant performance gains in quasistatic fading channels compared with the strategy without relays and the performance isfurther improved when more relays are employed.

Copyright © 2008 C. Han and W. Wu. This is an open access article distributed under the Creative Commons Attribution License,which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

1. INTRODUCTION

Cooperative communications can increase achievable ratesand decrease susceptibility to channel variations [1–3] andhave potential practical applications in cellular systems,wireless ad hoc, and sensor networks. In cooperative com-munications, several relay protocols, such as amplify-and-forward (AF) [4], decode-and-forward (DF) [1], and codedcooperation [5, 6], have been proposed.

Based on relay protocols, various coding strategies canbe devised using rate-compatible punctured convolutional(RCPC) codes, product codes, or concatenation codes andthe coding scheme design has become a hot topic inthe literature [7–12]. Specifically, low-density parity-check(LDPC) codes [13] are employed for relay networks in [7–9] and distributed turbo codes are presented in [10–12].

Most of the cooperative strategies above are devisedfor the classical three-node relay channel model, that is,the network with only one relay. However, it has beentheoretically revealed that the diversity gain increases whenmore relays participate in cooperation [14]. Moreover, in

wireless relay networks, the number of relays participating incooperation may vary from time to time due to the randommobility of nodes [15]. Therefore, the coding schemes shouldbe easily adjusted when the relay number varies. Althoughdistributed turbo codes can be extended to networks withdifferent number of relays using multiple turbo codes [16],the overall code rate decreases with the increase of relaynumber [11]. In this paper, based on the DF relay protocol,a novel coding scheme is proposed for cooperative relaynetworks using generalized low-density (GLD) codes [17–19], which has a fixed overall code rate.

GLD codes were first introduced by Tanner in [17]and then were further investigated in [18–20]. GLD codeswhich make the generalization of Gallager’s LDPC codesare constructed by replacing the parity-check constraints inLDPC codes with block code constraints. Similar to LDPCcodes, GLD codes can also be iteratively decoded and exhibitexcellent performance over both additive white Gaussiannoise (AWGN) [18, 19] and Rayleigh fading channels [20].

In the proposed scheme, each relay is only responsiblefor forwarding one or several constituent codes of GLD


s

d

Phase 1Phase 2

Figure 1: Cooperative communication scenario.

codes according to the number of available relays usingthe partial error detecting and error correcting capabilitiesof GLD codes. Unlike distributed turbo codes in [11],the overall code rate of distributed GLD codes is fixedwhen the relay number varies. Moreover, a progressiveprocessing strategy is proposed for relay nodes, which allowspartial decoding of the received codeword to reduce thecomplexity in good source-relay channel conditions andguarantees the robustness of the system in bad conditions.At the destination, a combiner is added to collect the softinformation from different nodes for the GLD decoderand a little complexity is added to the destination node.The significant performance of distributed GLD codes overquasistatic fading channels is further verified by simulations.

The remainder of this paper is organized as follows.Section 2 briefly describes the system model of coopera-tive communications with multiple relays in a cluster. InSection 3, distributed GLD codes are proposed and theprocessing algorithms at relays and the destination are pre-sented. Section 4 gives the simulation results of distributedGLD codes. The conclusions are drawn in Section 5.

2. SYSTEM MODEL

In this paper, we investigate the scenario that the sourcenode transmits data to the distant destination aided by somenearby nodes as depicted in Figure 1. We may further assumethat the source and relay nodes locate geographically in asmall region forming a transmit cluster, and thus the qualityof the channels from the source to relays is usually good. Thisis approximately equivalent to the cooperative network withmultiple relays as presented in [15].

In this scenario, the cooperative relay group can beassigned by some central nodes or via some other distributedprotocols. For example, the source node may send a “hello”message to the surrounding nodes and those nodes whichrespond properly verified according to some criteria forma transmit cluster as introduced in [15]. Once a cluster isformed, the relay set R is given. Let L denote the numberof available relays in the set R, where L equals the cardinalityof the set R denoted as |R|. Note that the cluster needs tobe reformed after some time due to the random mobility ofnodes. That is, the relay number L may vary from time totime in cooperative relay networks accordingly.

All the channels from nodes in the cluster to the distantdestination are usually modeled as independent quasistaticRayleigh fading channels. All the internode channels, thatis, the channels between nodes in the same cluster, maybe modeled as independent AWGN channels due to strongline of sight components [21], although this is not a criticalelement of the scheme. We assume that all the nodes transmitsignals on orthogonal channels (e.g., CDMA, FDMA, orTDMA) and are constrained to the half-duplex mode.

We assume the source s transmits signals to the destina-tion d aided by relay nodes in the set R. The cooperativerelay protocol in this scenario is usually consisted of twophases as illustrated in Figure 1. At the source node s, letc = [c1, c2, . . . , cN ] denote the encoded bit vector, where N isthe code length. Then it is modulated with the binary phaseshift keying (BPSK) constellation to get x = [x1, x2, . . . , xN ].

In the first phase, the source s broadcasts its data xi, 1 ≤i ≤ N and the received signal at the distant destination d isgiven by

ysdi =√Pshsdi xi + ηsdi , (1)

where Ps is the power of each symbol from the source, hsdiis the Rayleigh fading coefficient of the channel from thesource to the destination and ηsdi denotes the AWGN withthe variance σ2.

Simultaneously, the broadcast data from the source canalso be received by relay nodes in the set R and the receivedsignal of the relay l is denoted as

ysri,l =√Pshsri,lxi + ηsri,l, (2)

where hsri,l represents the fading gain of the path from thesource node to the relay node l and ηsri,l is the AWGN. In thescenario of this paper, the internode channels are modeled asAWGN channels, that is, for the pair of nodes belong to thesame cluster, hsri,l = 1. Relay nodes decode the data and someerror detecting codes such as cyclic redundancy check (CRC)or other linear block codes are used to verify the decodingresults.

In the second phase, those relay nodes which decodethe data correctly aid the source to forward data to thedestination. Let xli denote the signal transmitted by the relayl, which is received at the destination given by

yrdi,l =√Prdl hrdi,l x

li + ηrdi,l , (3)

where Prdl is the power of each symbol from the relay l, hrdi,l

is the Rayleigh fading coefficient of the channel from therelay l to the destination, and ηrdi,l denotes the AWGN withthe variance σ2

l .We further assume that all the fading coefficients such as

hsdi and hrdi,l are constant during a transmit frame and varyfrom frame to frame, that is, quasistatic fading channels.Various cooperative coding schemes can be designed toachieve performance gains over quasistatic fading channelsby designing the transmit signals of the source and relay

C. Han and W. Wu 3

H0

H1

H2

H3

π1

π2

· · ·

...

0

0

H =

⎡⎢⎢⎢⎢⎢⎣

H1

H2

H3

...

⎤⎥⎥⎥⎥⎥⎦

Figure 2: Structure of the parity-check matrix H of a GLD code.

nodes. For fair comparison of different strategies, the totaltransmit power of each bit ci is usually fixed as

P = Ps +∑

l∈R

Prdl . (4)

3. DISTRIBUTED GLD CODES FOR COOPERATIVECOMMUNICATIONS

3.1. GLD codes and distributed GLD codes

In this part, generalized low-density codes are introducedand distributed GLD codes are proposed for the cooperativenetworks with multiple relays. Following the constructionin [17–19], GLD codes are defined using a sparse matrix Hconstructed by replacing each row in a sparse parity-checkmatrix, which indeed defines an LDPC code, with n − krows including one copy of the parity-check matrix H0 of theconstituent code C0(n, k). Here, C0(n, k) is usually a blockcode of code length n and information bit length k, such asBCH code and Reed-Solomon (RS) code.

For an (N , J ,n) GLD code with code length N, the matrixH consists of J submatrices H1, . . . , HJ , where H j+1 = πj(H1)and πj , j = 1, . . . , J − 1 denote pseudorandom columnpermutations, that is, bit-level interleavers [19] as illustratedin Figure 2. Therefore, an (N , J ,n) GLD code C can beconsidered as the intersection of J supercodes C1, . . . ,CJ , thatis, C = ∩J

j=1Cj , where C1 = C0 ⊕ · · · ⊕ C0 and Cj+1 =

πj(C1). Therefore, the code rate of (N , J ,n) GLD codes isR = 1 − J(1 − r) where r = k/n is the code rate of theconstituent code C0.

The parity-check matrix H of GLD code is rearrangedusing Gaussian elimination method to get the systematicform, and then the generator matrix G is achieved. Usingthe generator matrix G, information bits are encoded. GLDcodes can be iteratively decoded based on the soft-inputsoft-output (SISO) decoders of constituent codes [22, 23].Specifically, the first supercode C1 is decoded using N/nSISO decoders executed in parallel, for it is composed ofN/n constituent codes. Then the extrinsic messages of codedbits are interleaved and fed to the decoder of the secondsupercode C2 as the priori information. Thus, excellent

performance is obtained by iterating the process above foreach supercode [19], that is, C1→C2→·· ·→CJ→C1→·· · .

In the following, distributed GLD codes are devised forcooperative relay networks using (N , 2,n) GLD codes, forthe performance of GLD codes with J = 2 is asymptoticallygood as shown in [19]. In order to make the description moregeneral, we still employ (N , J ,n) to denote the GLD code inthe following.

In the proposed scheme, the source node encodes thedata using an (N , J ,n) GLD encoder and then broadcastsmodulated symbols to the sink and simultaneously to all therelay nodes in the first phase. Then, the GLD code is decodedand forwarded in a distributed manner by relay nodes usingits partial error detecting and error correcting capabilities.Specifically, the protocol assigns nl, 1 ≤ l ≤ L, differentconstituent codes of the GLD code to the relay l, accordingto the relay number L in the cluster. Since an (N , J ,n) GLDcode consists of J·N/n constituent codes, we configure nl tosatisfy

L∑

l=1

nl = J·Nn

. (5)

In order to efficiently use the transmit power, we assumeL ≤ J·N/n and one constituent code is only allocatedto one relay in this scheme. Then each relay decodes theconstituent codes for which it is responsible. The decodingresults of constituent codes which are decoded correctly areforwarded to the destination by the associated relays in thesecond phase. Note that relay nodes do not reencode the data,which reduces the complexity of relay nodes compared withdistributed turbo codes in [11].

In this way, all the constituent codes forwarded by therelays construct a distributed GLD code. If all the constituentcodes are forwarded successfully, each code symbol xi, 1 ≤i ≤ N , is indeed forwarded J times by J relays whichconstitute a relay set R(ci) ⊆ R for the associated code bitci, where |R(ci)| = J . Therefore, J copies of the bit ci fromrelays in the R(ci) can be combined with the copy from thesource to achieve diversity gain at the destination.

One advantage of the proposed scheme is that it can beadaptive to the variation of the relay number L by simplyadjusting nl. Note that, for each code bit ci, the total powerconsumed by the source and the associated relays in R(ci)is fixed as P when the relay number L varies. Moreover,contrary to distributed turbo codes [11], the overall code rateR of the system is independent of the relay number L andgiven as follows:

R = 1− J(1− r)1 + J

. (6)

Therefore, the scheme is quite suitable for the cooperativenetworks where the number of active relay nodes may varyfrom time to time. In contrast, the distributed turbo codesmay increase the traffic of the network when more relays areemployed to improve the performance.

Another advantage of the proposed scheme is thateach relay node is only responsible for relaying one or


Input

Hard decisiondecoding for C0

H0 · C0 = 0 i = i + 1

MAPdecoding for C0

The ith iterativedecoding for GLD

H0 · C0 = 0H0 · C0 = 0or i = Imax

i = 1 Decoder stops

Yes

No

Yes

Yes

No

No

Figure 3: Flow chart of progressive decoding for relay nodes.

several constituent codes to the destination according tothe assignment of the protocol. In this way, each relay onlyconsumes a little energy to relay data and significant diversitygain can be achieved at the destination, for the fading atdifferent locations may vary. In general, the constituentcodes are uniformly allocated to the available relay nodesin the set R so as to balance the power consumption anddata payload of each relay node. Also, the same quality ofeach bit is achieved in this way. This also gives us a hintto improve the system performance by allowing the relays,which are lucky to experience good channel conditions to thedestination, to forward more constituent codes than otherswith some adaptive protocols. This adaptive strategy willnot be included in this paper. The other design aspects andadvantages of distributed GLD codes will be addressed in thefollowing parts.

3.2. Progressive decoding for relay nodes

Generally speaking, the internode channels in the samecluster can be modeled as AWGN channels and are usuallyof high quality. For example, the channels between differentreceiving nodes in the same cluster are modeled as error-freechannels in [24]. In order to reduce the decoding complexityand adapt well to the channel variations, a progressivedecoding strategy is proposed for relay nodes as illustratedin Figure 3 using the partial error detecting and correctingcapabilities of the GLD code.

Let us take the scenario in which one relay nodeforwards one constituent code C0 as an example and theprocess can be summarized as follows. First, decode theconstituent code C0(n, k) using a hard decision algorithm

based on the received n symbols and obtain the hard decisionC0 of the codeword. Then, use the parity check matrixH0 to verify whether the codeword is correct or not. IfH0·C0 = 0, the decoder stops and C0 is forwarded tothe destination. Otherwise, this codeword will be decodedutilizing a maximum a posteriori (MAP) decoder, that is,BCJR algorithm [22, 23]. Similarly, the check criterion isemployed to check the decoding results after MAP decoding.If H0·C0 = 0, the relay stops decoding and forwards C0

to the destination. Otherwise, the relay will execute iterativedecoding for the whole GLD code based on all the N symbolsfrom the source. During the iterative decoding, the samecheck criterion is executed after each iteration. Once thecheck result is correct or the iteration reaches the maximumtimes Imax, the decoding process stops.

Theoretically, undetected errors will be incurred by thechecking criterion, but this is ignored in this paper due tothe good internode channels in a cluster and the good errordetecting capability of the constituent code. In addition, theprobability of decoding failure at relays may be very lowattributed to the significant performance of GLD codes.

3.3. Processing at the destination

In the proposed scheme, several independent copies asso-ciated with one symbol are obtained from the source andrelay nodes, and then the signals from different paths arecombined before they are inputted into the GLD decoderat the destination. With the scheme, the coding gain anddiversity gain are achieved with a little additional complexitycompared with strategies without relays.

For each bit ci in a GLD codeword, its log-likelihood ratio(LLR) can be denoted as

L(ci) = logP(ci = 1 | ysdi ,Yrd

i

)

P(ci = 0 | ysdi ,Yrd

i

) = logP(ysdi ,Yrd

i | ci = 1)

P(ysdi ,Yrd

i | ci = 0) ,

(7)

where the set Yrdi = {yrdi,l | l ∈ R(ci)}. Since all the paths are

independent, we have

L(ci) = logP(ysdi | ci = 1

)

P(ysdi | ci = 0

) +∑

l∈R(ci)

logP(yrdi,l | ci = 1

)

P(yrdi,l | ci = 0

)

= Lsd(ci) +J∑

l=1

Lrdl (ci).

(8)

In (8), Lsd(ci) is the LLR from the source node to thedestination given by

Lsd(ci) = logP(ysdi | ci = 1

)

P(ysdi | ci = 0

) . (9)

The LLR from the relay l, l ∈ R(ci), to the destination isdenoted as

Lrdl (ci) = logP(yrdi,l | ci = 1

)

P(yrdi,l | ci = 0

) . (10)

Therefore, the receiver structure is depicted in Figure 4.

C. Han and W. Wu 5

In the receiver, fading factors and the parameters ofGaussian noise of each channel to the destination need tobe estimated before the combination of soft information.If BPSK modulation is adopted in the system as describedabove, the LLR from the source to the destination can begiven as follows:

Lsd(ci) = 2√Pshsdi ysdiσ2

, (11)

and the LLR from the relay l, l ∈ R(ci), to the destination is

Lrdl (ci) =2√Prdl hrdi,l y

rdi,l

σ2l

. (12)

Then, the combined LLRs are sent to the GLD decoderfor iterative decoding. In this way, the diversity and codinggain are achieved using distributed GLD codes with lowcomplexity at the destination. Specifically, the trellis-basedMAP algorithm [22, 23] can be employed to decode theconstituent codes in parallel for the GLD code. Comparedwith multiple turbo codes [16], the decoding latency ofGLD codes is shorter due to the parallel decoding ofN/n constituent codes in each supercode. Moreover, inthe proposed scheme, the destination node always needsto decode the (N , J ,n) GLD code even when the relaynumber L varies. However, in distributed turbo codes, thereceiver needs to decode the multiple turbo code with L + 1constituent RSC codes. Here, the code length of the multipleturbo code also increases with the relay number L, whichis usually much longer than the (N , J ,n) GLD code lengthN. Therefore, the complexity and decoding latency of theproposed scheme can be greatly reduced compared withdistributed turbo codes especially when L is large.

In conclusion, Section 3 presents a novel strategy usingdistributed GLD codes for multiple relay cooperative com-munications. Firstly, it is a flexible scheme which can adaptwell to cooperative networks with different number ofrelays. Secondly, the complexity of relay nodes is low, forthe progressive decoding algorithm allows partial decodingof the GLD code and no reencoding process is needed.When there are many relays, the power consumption can beapproximately balanced for each relay node, which is quiteessential for relays especially in wireless sensor networks. Atlast, the diversity gain and coding gain are achieved witha little additional complexity compared with the strategywithout relay.

4. SIMULATION RESULTS

In this section, the performance of the proposed schemeis simulated and compared with other cooperative codingschemes. In the simulations, the (420, 2, 15) GLD code isemployed, which takes (15, 11) BCH codes as constituentcodes and has a code rate of R = 7/15.

Firstly, we evaluate the progressive processing at relaynodes. In Figure 5, the bit error rate (BER) performance ofthe (15, 11) BCH code with hard decision decoding, MAPdecoding and the (420, 2, 15) GLD code under different

Lrd1 (ci) LrdJ (ci) Lsd(ci)

· · ·

Combiner GLD decoderL(ci)

Figure 4: The receiver structure at the destination.

1086420−2

Es/N0 (dB)

10−6

10−5

10−4

10−3

10−2

10−1

BE

R

BCH HDBCH MAPGLD, iter. = 1

GLD, iter. = 2GLD, iter. = 4GLD, iter. = 10

Figure 5: Performance of the progressive decoding at relays overAWGN channel.

iterations over AWGN channel is compared, for internodechannels in a cluster are usually modeled as AWGN channels.Here, the horizontal axis Es/N0 denotes the signal-to-noiseratio (SNR) of symbols after encoding and modulation. It isillustrated that proper decoding schemes in the progressiveprocessing may be chosen according to source-relay channelconditions.

Secondly, the performance of distributed GLD codesis simulated under different conditions. In the following,we assume the internode channels in the same cluster areperfect as in [11, 24]. The source and relay nodes faceindependent and identically distributed quasistatic Rayleighfading channels toward the distant destination. Here, thesource and each relay use the same energy to transmit eachsymbol. If the transmit power of each symbol is P, thesource broadcasts the symbol using P/3 and the two relaynodes related to this symbol share the remainder 2P/3, foreach symbol is relayed twice by two relays in the designeddistributed GLD code. If there is not any relay node, that is,L = 0, all the transmit power P is allocated to the sourcenode. Therefore, the comparison is fair and the overall coderate of the system is R = 7/45. The GLD decoder iterates 10times at the receiver. We simulate scenarios with or withoutsource-destination path.


454035302520151050

Eb/N0 (dB)

10−6

10−5

10−4

10−3

10−2

10−1

BE

R

L = 0L = 56L = 28L = 14

L = 8L = 4L = 2

Figure 6: Performance of distributed GLD codes without source-destination path.

When the source-destination path does not exist, theBER performance of distributed GLD codes with differentnumber of relays is illustrated in Figure 6. The horizontalaxis in Figure 6 is Eb/N0 denoting the SNR of informationbits and horizontal axes in performance figures below arethe same. Seen from Figure 6, distributed GLD codes cansignificantly improve the system performance and the BERperformance can be further improved as the relay numberincreases. Especially, when L = 56, the distributed GLDcode can achieve about 35 dB gain at a BER of 10−5 overthe scheme without relay. In the scheme with 56 relays, eachrelay only needs to forward a single (15, 11) BCH codeword,that is, 15 symbols. Therefore, a little latency, complexity, andpower consumption are incurred for each relay node.

When the source-destination path is included, Figure 7shows the BER performance under different number of relaynodes. The performance can also be improved as the relaynumber increases. Take the scheme with two relays as anexample, in which each relay indeed forwards one supercodeof the GLD code, and it can achieve about 25 dB gain overthe scheme without relay node at a BER of 10−5.

The performance of distributed GLD codes with andwithout the source-destination path is further comparedin Figure 8. It is shown that the source-destination pathcan improve the performance for the system with the samenumber of relay nodes, especially when L is small, such asL = 2, 4. However, as the relay number increases, the gapdecreases. This may be because that when L is small, thediversity from the source-destination path is prominent inthe overall performance.

Thirdly, the performance of the proposed scheme is fur-ther compared with other two cooperative coding schemeswhen the source-destination path exists. First, Figure 9

454035302520151050

Eb/N0 (dB)

10−6

10−5

10−4

10−3

10−2

10−1

BE

R

L = 0L = 56L = 28L = 14

L = 8L = 4L = 2

Figure 7: Performance of distributed GLD codes with source-destination path.

252015105

Eb/N0 (dB)

10−6

10−5

10−4

10−3

10−2

10−1

BE

R

No SD, L = 56No SD, L = 14No SD, L = 4No SD, L = 2

SD, L = 56SD, L = 14SD, L = 4SD, L = 2

Figure 8: Performance comparison of distributed GLD codes withsource-destination path (SD) and without source-destination path(No SD).

compares the performance of distributed GLD codes and dis-tributed turbo codes. In the simulations, the rate 1/2(7, 5)8

recursive systematic convolutional (RSC) codes are used atthe source and all the relay nodes to construct the distributedturbo code following [11]. Specifically, the source broadcaststhe RSC code with the code length of 420 bits equal to the

C. Han and W. Wu 7

302520151050

Eb/N0 (dB)

10−6

10−5

10−4

10−3

10−2

10−1

BE

R

Distributed turbo code, L = 4Distributed GLD code, L = 56Distributed GLD code, L = 4Distributed GLD code, L = 2

Figure 9: Performance comparison of distributed GLD codes (R =7/45) and distributed turbo codes (R = 1/6).

length of GLD code at the source and each relay node onlytransmits the parity-check bits of its own RSC code. Forfair comparison, four relay nodes are used to construct thedistributed turbo code with the overall rate R = 1/6 whichis approximately equal to 7/45 in the distributed GLD code.The transmit power is allocated to the source and relay nodesaccording to the same rule as in the distributed GLD code.Considering the complexity of the receiver, we choose thesoft-output Viterbi algorithm (SOVA) to decode each RSCcode for the multiple turbo code at the destination.

In Figure 9, it is illustrated that distributed GLD codesoutperform the distributed turbo code. Here, the destinationnode in distributed GLD codes always needs to decode the(420, 2, 15) GLD code for different relay number L. However,in distributed turbo codes, the receiver needs to decode themultiple turbo code of the code length 420 × (1 + L/2) bits,which consists of L + 1 constituent (7, 5)8 RSC codes. Thecomplexity and decoding latency of the proposed scheme canbe greatly reduced compared with distributed turbo codesespecially when L is large. In addition, distributed GLD codesmay be used to provide different quality of service (QoS)by employing different number of relays while the networktraffic is constant, for they can easily adapt to the variation ofthe relay number and keep a fixed overall code rate.

Then, Figure 10 compares the proposed scheme withanother relaying coding scheme using turbo codes workingin the similar manner as GLD codes, which is called turbomultiple relay (TMR) scheme in this paper. In TMR scheme,the source node broadcasts a rate 1/2 turbo code using therate 1/2(7, 5)8 RSC codes as constituent codes. The turbocode length is 420 bits and the codeword is also forwardedtwice by relay nodes to achieve the overall rate R = 1/6.It is observed that TMR scheme exhibits a little better

151050

Eb/N0 (dB)

10−6

10−5

10−4

10−3

10−2

10−1

BE

R

TMR, L = 56TMR, L = 14

Distributed GLD code, L = 56Distributed GLD code, L = 14

Figure 10: Performance comparison of distributed GLD codes (R =7/45) and turbo multiple relay scheme (TMR, R = 1/6).

performance in the waterfall region while it is worse in theerror floor region. In fact, TMR scheme has some flaws. Forexample, the relay is difficult to just decode and detect partof the codeword while the proposed scheme can ingeniouslyuse the intrinsic partial error detecting and error correctingcapabilities of the GLD code.

At last, different power allocation strategies on dis-tributed GLD codes are investigated. In the simulationsabove, the powers allocated to the source and the twoassociated relays is P/3 and 2P/3, respectively, thus eachsymbol from either the source or each relay has the samepower level. In practical situations, the destination node isusually located far from the source while the relay nodessurround the source in a cluster. When the large-scalepath loss is considered, the power levels at relay nodesmay be much higher than at the destination. Thus, theunequal power allocation (UPA) may be considered tofurther improve the performance of distributed GLD codes.

Consider a network topology as illustrated in Figure 11.Here, the transmit cluster is limited in a region with theradius of 50 meters and the destination is 250 metersaway from the source node, similar to the configurationin [25]. Generally, relay nodes are uniformly distributedwithin the cluster. However, we simplify the situation withthe assumption that all the relay nodes in the cluster arein a circle with radium of 50 meters and 250 meters awayfrom the destination and the source is at the center of thecircle. We assume that the average large-scale path loss isexpressed as a function of the separation distance using apath loss exponent γ = 2. Therefore, we allocate transmitpower 2P/25 to the source and let the two associated relaysshare the remainder 23P/25 for each symbol. In this way, thereceived Eb/N0 of the relay node can be still about 6.4 dBhigher than at the destination and thus the reliability ofinternode channels in the cluster can still be guaranteed.


s d

50 m

250 m

Figure 11: Topology example of the cooperative network.

30252015105

Eb/N0 (dB)

10−6

10−5

10−4

10−3

10−2

10−1

BE

R

L = 0No SD, L = 56No SD, L = 14No SD, L = 8No SD, L = 2

SD, L = 56SD, L = 14SD, L = 8SD, L = 2

Figure 12: Performance of distributed GLD codes of UPA schemewith and without source-destination path.

Figure 12 shows the BER performance of distributedGLD codes with UPA. It is obvious that the source-destination path still improves the performance for thesystem with the same number of relay nodes, especially whenL is small, such as L = 2. However, compared with Figure 8,the improvement gap of the source-destination path in theUPA scheme is narrower especially when L is large, such asL = 14, 56. For the systems with many relays, the diversityof the source-destination path can be almost ignored.

When the source-destination path does not exist, theperformance of distributed GLD codes with two powerallocation schemes is compared in Figure 13. It is seenthat the UPA strategy with Ps = 2P/25 can improve theperformance over the scheme with Ps = P/3 when the systemhas the same number of relays. Furthermore, for differentrelay number L, the improvement gaps of the UPA strategyare all about 1.3 dB at a BER of 10−5.

When the source-destination path is included, the BERperformance of distributed GLD codes with two differentpower allocation strategies is compared in Figure 14. It isvery interesting that when L is small, such as 2 and 4, theUPA scheme suffers performance loss compared with thescheme with Ps = P/3. However, when L is large, such

302520151050

Eb/N0 (dB)

10−6

10−5

10−4

10−3

10−2

10−1

BE

R

Ps = 1/3, L = 56Ps = 1/3, L = 8Ps = 1/3, L = 4Ps = 1/3, L = 2

Ps = 2/25, L = 56Ps = 2/25, L = 8Ps = 2/25, L = 4Ps = 2/25, L = 2

Figure 13: Performance comparison between the two powerallocation schemes without source-destination path.

30252015105

Eb/N0 (dB)

10−6

10−5

10−4

10−3

10−2

10−1

BE

R

Ps = 1/3, L = 56Ps = 1/3, L = 14Ps = 1/3, L = 8Ps = 1/3, L = 4Ps = 1/3, L = 2

Ps = 2/25, L = 56Ps = 2/25, L = 14Ps = 2/25, L = 8Ps = 2/25, L = 4Ps = 2/25, L = 2

Figure 14: Performance comparison between the two powerallocation schemes with source-destination path.

as 14 and 56, the UPA scheme contrarily exhibits betterperformance compared with the scheme with Ps = P/3. Thismay be because when L is small, the UPA scheme lowers theeffectiveness of the diversity due to the source-destinationpath while it enforces the effectiveness of relays for large Lsuch as 56 and 14.

C. Han and W. Wu 9

5. CONCLUSION

Distributed generalized low-density codes are constructedfor multiple relay cooperative communications. The pro-posed scheme can adapt well to the variation of the relaynumber in the wireless relay network while the overallcode rate of the system is fixed. In the scheme, each relayis responsible for forwarding one or several constituentcodes of the GLD code, thus the complexity and powerconsumption of each relay node are quite limited. Moreover,a progressive decoding strategy is proposed for relay nodes tofurther reduce the complexity and adapt to the source-relaychannel variations. At the destination, the soft informationis first combined and then iterative decoding is performedfor GLD codes to achieve the coding gain and diversity gain.The significant performance improvements have also beenverified by simulations over quasistatic fading channels.

In the future, the performance can still be furtherimproved by allocating constituent codes and transmitpower elaborately to different relay nodes considering theirdistance to the destination and channel variations.

ACKNOWLEDGMENTS

The authors thank the editors and reviewers for their valu-able comments and suggestions. This work was supported byNational Basic Research Program of China (2007CB310604)and NSFC (60772108).

REFERENCES

[1] A. Sendonaris, E. Erkip, and B. Aazhang, “Increasing uplinkcapacity via user cooperation diversity,” in Proceedings of IEEEInternational Symposium on Information Theory (ISIT ’98), p.156, Cambridge, Mass, USA, August 1998.

[2] A. Sendonaris, E. Erkip, and B. Aazhang, “User cooperationdiversity—part I: system description,” IEEE Transactions onCommunications, vol. 51, no. 11, pp. 1927–1938, 2003.

[3] A. Sendonaris, E. Erkip, and B. Aazhang, “User cooperationdiversity—part II: implementation aspects and performanceanalysis,” IEEE Transactions on Communications, vol. 51, no.11, pp. 1939–1948, 2003.

[4] J. N. Laneman, G. W. Wornell, and D. N. C. Tse, “Anefficient protocol for realizing cooperative diversity in wirelessnetworks,” in Proceedings of IEEE International Symposium onInformation Theory (ISIT ’01), p. 294, Washington, DC, USA,June 2001.

[5] T. E. Hunter and A. Nosratinia, “Cooperation diversitythrough coding,” in Proceedings of IEEE International Sym-posium on Information Theory (ISIT ’02), p. 220, Lausanne,Switzerland, June-July 2002.

[6] T. E. Hunter and A. Nosratinia, “Diversity through codedcooperation,” IEEE Transactions on Wireless Communications,vol. 5, no. 2, pp. 283–289, 2006.

[7] J. Ezri and M. Gastpar, “On the performance of independentlydesigned LDPC codes for the relay channel,” in Proceedingsof IEEE International Symposium on Information Theory(ISIT ’06), pp. 977–981, Seattle, Wash, USA, July 2006.

[8] P. Razaghi and W. Yu, “Bilayer low-density parity-check codesfor decode-and-forward in relay channels,” IEEE Transactionson Information Theory, vol. 53, no. 10, pp. 3723–3739, 2007.

[9] A. Chakrabarti, A. de Baynast, A. Sabharwal, and B. Aazhang,“Low density parity check codes for the relay channel,” IEEEJournal on Selected Areas in Communications, vol. 25, no. 2,pp. 280–291, 2007.

[10] B. Zhao and M. C. Valenti, “Distributed turbo coded diversityfor relay channel,” Electronics Letters, vol. 39, no. 10, pp. 786–787, 2003.

[11] M. C. Valenti and B. Zhao, “Distributed turbo codes: towardsthe capacity of the relay channel,” in Proceedings of the 58thIEEE Vehicular Technology Conference (VTC ’03), vol. 1, pp.322–326, Orlando, Fla, USA, October 2003.

[12] Y. Li, B. Vucetic, T. F. Wong, and M. Dohler, “Distributedturbo coding with soft information relaying in multihop relaynetworks,” IEEE Journal on Selected Areas in Communications,vol. 24, no. 11, pp. 2040–2050, 2006.

[13] R. G. Gallager, Low-Density Parity-Check Codes, MIT Press,Cambridge, Mass, USA, 1963.

[14] K. Azarian, H. El Gamal, and P. Schniter, “On the achievablediversity-multiplexing tradeoff in half-duplex cooperativechannels,” IEEE Transactions on Information Theory, vol. 51,no. 12, pp. 4152–4172, 2005.

[15] A. K. Sadek, W. Su, and K. J. R. Liu, “Clustered cooperativecommunications in wireless networks,” in Proceedings of IEEEGlobal Telecommunications Conference (GLOBECOM ’05), vol.3, pp. 1–5, St. Louis, Mo, USA, November-December 2005.

[16] D. Divsalar and F. Pollara, “Multiple turbo codes,” inProceedings of IEEE Military Communications Conference(MILCOM ’95), vol. 1, pp. 279–285, San Diego, Calif, USA,November 1995.

[17] R. M. Tanner, “A recursive approach to low complexity codes,”IEEE Transactions on Information Theory, vol. 27, no. 5, pp.533–547, 1981.

[18] M. Lentmaier and K. S. Ziganfirov, “Iterative decoding ofgeneralized low-density parity-check codes,” in Proceedingsof IEEE International Symposium on Information Theory(ISIT ’98), p. 149, Cambridge, Mass, USA, August 1998.

[19] J. Boutros, O. Pothier, and G. Zemor, “Generalized lowdensity (Tanner) codes,” in Proceedimgs of IEEE InternationalConference on Communications (ICC ’99), vol. 1, pp. 441–445,Vancouver, Canada, June 1999.

[20] O. Pothier, L. Brunel, and J. Boutros, “A low complexity FECscheme based on the intersection of interleaved block codes,”in Proceedings of the 49th IEEE Vehicular Technology Conference(VTC ’99), vol. 1, pp. 274–278, Houston, Tex, USA, May 1999.

[21] M. Yuksel and E. Erkip, “Diversity-multiplexing tradeoff incooperative wireless systems,” in Proceedings of the 40th AnnualConference on Information Sciences and Systems (CISS ’06), pp.1062–1067, Princeton, NJ, USA, March 2007.

[22] L. Bahl, J. Cocke, F. Jelinek, and J. Raviv, “Optimal decodingof linear codes for minimizing symbol error rate,” IEEETransactions on Information Theory, vol. 20, no. 2, pp. 284–287, 1974.

[23] T. Matsumoto, “Decoding performance of linear block codesusing a trellis in digital mobile radio,” IEEE Transactions onVehicular Technology, vol. 39, no. 1, pp. 68–74, 1990.

[24] X. Li, T. F. Wong, and J. M. Shea, “Performance analysisfor collaborative decoding with least-reliable-bits exchange onAWGN channels,” IEEE Transactions on Communications, vol.56, no. 1, pp. 58–69, 2008.

[25] S. Yi, B. Azimi-Sadjadi, S. Kalyanaraman, and V. Subrama-nian, “Error control code combining techniques in cluster-based cooperative wireless networks,” in Proceedimgs of IEEEInternational Conference on Communications (ICC ’05), vol. 5,pp. 3510–3514, Seoul, South Korea, May 2005.


Research ArticleReed-Solomon Turbo Product Codes for OpticalCommunications: From Code Optimization to Decoder Design

Raphael Le Bidan, Camille Leroux, Christophe Jego, Patrick Adde, and Ramesh Pyndiah

Institut TELECOM, TELECOM Bretagne, CNRS Lab-STICC, Technopole Brest-Iroise, CS 83818, 29238 Brest Cedex 3, France

Correspondence should be addressed to Raphael Le Bidan, [email protected]

Received 31 October 2007; Accepted 22 April 2008


Turbo product codes (TPCs) are an attractive solution to improve link budgets and reduce systems costs by relaxing therequirements on expensive optical devices in high capacity optical transport systems. In this paper, we investigate the use ofReed-Solomon (RS) turbo product codes for 40 Gbps transmission over optical transport networks and 10 Gbps transmissionover passive optical networks. An algorithmic study is first performed in order to design RS TPCs that are compatible withthe performance requirements imposed by the two applications. Then, a novel ultrahigh-speed parallel architecture for turbodecoding of product codes is described. A comparison with binary Bose-Chaudhuri-Hocquenghem (BCH) TPCs is performed.The results show that high-rate RS TPCs offer a better complexity/performance tradeoff than BCH TPCs for low-cost Gbps fiberoptic communications.

Copyright © 2008 Raphael Le Bidan et al. This is an open access article distributed under the Creative Commons AttributionLicense, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properlycited.

1. INTRODUCTION

The field of channel coding has undergone major advancesfor the last twenty years. With the invention of turbocodes [1] followed by the rediscovery of low-density parity-check (LDPC) codes [2], it is now possible to approach thefundamental limit of channel capacity within a few tenthsof a decibel over several channel models of practical interest[3]. Although this has been a major step forward, there is stilla need for improvement in forward-error correction (FEC),notably in terms of code flexibility, throughput, and cost.

In the early 90’s, coinciding with the discovery of turbocodes, the deployment of FEC began in optical fiber commu-nication systems. For a long time, there was no real incentiveto use channel coding in optical communications since thebit error rate (BER) in lightwave transmission systems canbe as low as 10−9–10−15. Then, the progressive introductionof in-line optical amplifiers and the advent of wavelengthdivision multiplexing (WDM) technology accelerated the useof FEC up to the point that it is now considered almostroutine in optical communications. Channel coding is seenas an efficient technique to reduce systems costs and toimprove margins against various line impairments such asbeat noise, channel cross-talk, or nonlinear dispersion. On

the other hand, the design of channel codes for opticalcommunications poses remarkable challenges to the systemengineer. Good codes are indeed expected to provide at thesame time low overhead (high code rate) and guaranteedlarge coding gains at very low BER [4]. Furthermore, theissue of decoding complexity should not be overlookedsince data rates have now reached 10 Gbps and beyond(up to 40 Gbps), calling for FEC devices with low power-consumption.

FEC schemes for optical communications are commonlyclassified into three generations. The reader is referred to[5, 6] for an in-depth historical perspective of FEC for opticalcommunication. First-generation FEC schemes mainly reliedon the (255, 239) Reed-Solomon (RS) code over the Galoisfield GF(256), with only 6.7% overhead. In particular, thiscode was recommended by the ITU for long-haul submarinetransmissions. Then, the development of WDM technologyprovided the impetus for moving to second-generation FECsystems, based on concatenated codes with higher codinggains [7]. Third-generation FEC based on soft-decisiondecoding is now the subject of intense research since strongerFEC are seen as a promising way to reduce costs by relaxingthe requirements on expensive optical devices in high-capacity transport systems.


K1

K2

N1

N2

Informationsymbols

Checkson rows

Checks oncolumns

Checks onchecks

Figure 1: Codewords of the product code P = C1 ⊗C2.

First introduced in [8], turbo product codes (TPCs)based on binary Bose-Chaudhuri-Hocquenghem (BCH)codes are an efficient and mature technology that has foundits way in several (either proprietary or public) wirelesstransmission systems [9]. Recently, BCH TPCs have receivedconsiderable attention for third-generation FEC in opticalsystems since they show good performance at high code ratesand have a high-minimum distance by construction. Fur-thermore, their regular structure is amenable to very-high-data-rate parallel decoding architectures [10, 11]. Researchon TPCs for lightwave systems culminated recently withthe experimental demonstration of a record coding gain of10.1 dB at a BER of 10−13 using a (144, 128) × (256, 239)BCH turbo product code with 24.6% overhead [12]. Thisgain was measured using a turbo decoding very-large-scale-integration (VLSI) circuit operating on 3-bit soft inputs ata data rate of 12.4 Gbps. LDPC codes are also considered asserious candidate for third generation FEC. Impressive cod-ing gains have notably been demonstrated by Monte-Carlosimulation [13]. To date however, to the best of the authorsknowledge, no high-rate LDPC decoding architecture hasbeen proposed in order to demonstrate the practicality ofLDPC codes for Gbps optical communications.

In this work, we investigate the use of Reed-SolomonTPCs for third-generation FEC in fiber optic communi-cation. Two specific applications are envisioned, namely40 Gbps line rate transmission over optical transport net-works (OTNs), and 10 Gbps data transmission over passiveoptical networks (PONs). These two applications have differ-ent requirements with respect to FEC. An algorithmic studyis first carried out in order to design RS product codes forthe two applications. In particular, it is shown that high-rateRS TPCs based on carefully designed single-error-correctingRS codes realize an excellent performance/complexity trade-off for both scenarios, compared to binary BCH TPCs ofsimilar code rate. In a second step, a novel parallel decodingarchitecture is introduced. This architecture allows decodingof turbo product codes at data rates of 10 Gbps and beyond.Complexity estimations show that RS TPCs better trade-off area and throughput than BCH TPCs for full-parallel

decoding architectures. An experimental setup based onfield-programmable gate array (FPGA) devices has beensuccessfully designed for 10 Gbps data transmission. Thisprototype demonstrates the practicality of RS TPCs for next-generation optical communications.

The remainder of the paper is organized as follows.Construction and properties of RS product codes areintroduced in Section 2. Turbo decoding of RS productcodes is described in Section 3. Product code design foroptical communication and related algorithmic issues arediscussed in Section 4. The challenging issue of designing ahigh-throughput parallel decoding architecture for productcodes is developed in Section 5. A comparison of throughputand complexity between decoding architectures for RS andBCH TPCs is carried out in Section 6. Section 7 describesthe successful realization of a turbo decoder prototypefor 10 Gbps transmission. Conclusions are finally given inSection 8.

2. REED-SOLOMON PRODUCT CODES

2.1. Code construction and systematic encoding

Let C1 and C2 be two linear block codes over the Galoisfield GF(2m), with parameters (N1,K1,D1) and (N2,K2,D2),respectively. The product code P = C1 ⊗ C2 consists ofall N1 × N2 matrices such that each column is a codewordin C1 and each row is a codeword in C2. It is wellknown that P is an (N1N2,K1K2) linear block code withminimum distance D1D2 over GF(2m) [14]. The directproduct construction thus offers a simple way to buildlong block codes with relatively large minimum distanceusing simple, short component codes with small minimumdistance. When C1 and C2 are two RS codes over GF(2m),we obtain an RS product code over GF(2m). Similarly, thedirect product of two binary BCH codes yields a binary BCHproduct code.

Starting from a K1 × K2 information matrix, systematicencoding of P is easily accomplished by first encoding theK1

information rows using a systematic encoder for C2. Then,the N2 columns are encoded using a systematic encoder forC1, thus resulting in the N1 × N2 coded matrix shown inFigure 1.

2.2. Binary image of RS product codes

Binary modulation is commonly used in optical commu-nication systems. A binary expansion of the RS productcode is then required for transmission. The extension fieldGF(2m) forms a vector space of dimension m over GF(2).A binary image Pb of P is thus obtained by expandingeach code symbol in the product code matrix into m bitsusing some basis B for GF(2m). The polynomial basis B ={1,α, . . . ,αm−1} where α is a primitive element of GF(2m)is the usual choice, although other basis exist [15, Chapter8]. By construction, Pb is a binary linear code with lengthmN1N2, dimensionmK1K2, and minimum distance d at leastas large as the symbol-level minimum distance D = D1D2

[14, Section 10.5].

Raphael Le Bidan et al. 3

3. TURBO DECODING OF RS PRODUCT CODES

Product codes usually have high dimension which precludesmaximum-likelihood (ML) soft-decision decoding. Yet theparticular structure of the product code lends itself to anefficient iterative “turbo” decoding algorithm offering close-to-optimum performance at high-enough signal-to-noiseratios (SNRs).

Assume that a binary transmission has taken place overa binary-input channel. Let Y = (yi, j) denote the matrixof samples delivered by the receiver front-end. The turbodecoder soft input is the channel log-likelihood ratio (LLR)matrix, R = (ri, j), with

ri, j = A lnf1(yi, j)

f0(yi, j) . (1)

Here A is a suitably chosen constant term, and fb(y) denotesthe probability of observing the sample y at the channeloutput given that bit b has been transmitted.

Turbo decoding is realized by decoding successively therows and columns of the channel matrix R using soft-inputsoft-output (SISO) decoders, and by exchanging reliabilityinformation between the decoders until a reliable decisioncan be made on the transmitted bits.

3.1. SISO decoding of the component codes

In this work, SISO decoding of the RS component codesis performed at the bit-level using the Chase-Pyndiahalgorithm. First introduced in [8] for binary BCH codesand latter extended to RS codes in [16], the Chase-Pyndiah decoder consists of a soft-input hard-output Chase-2 decoder [17] augmented by a soft-output computationunit.

Given a soft-input sequence r = (r1, . . . , rmN ) corre-sponding to a row (N = N2) or column (N = N1) ofR, the Chase-2 decoder first forms a binary hard-decisionsequence y = (y1, . . . , ymN ). The reliability of the hard-decision yi on the ith bit is measured by the magnitude |ri|of the corresponding soft input. Then, Nep error patternsare generated by testing different combinations of 0 and1 in the Lr least reliable bit positions. In general, Nep ≤2Lr with equality if all combinations are considered. Thoseerror patterns are added modulo-2 to the hard-decisionsequence y to form candidate sequences. Algebraic decodingof the candidate sequences returns a list with at most Nep

distinct candidate codewords. Among them, the codeword dat minimum Euclidean distance from the input sequence r isselected as the final decision.

Soft-output computation is then performed as follows.For a given bit i, the list of candidate codewords is searchedfor a competing codeword c at minimum Euclidean distancefrom r and such that ci /=di. If such a codeword exists, thenthe soft output r′i on the ith bit is given by

r′i =(‖r− c‖2 − ‖r− d‖2

4

)× di, (2)

R R

Wk+1

Wk

αk

RkDk

Row/columnSISO decoding

Figure 2: Block diagram of the turbo-decoder at the kth half-iteration.

where ‖·‖2 denotes the squared norm of a sequence.Otherwise, the soft output is computed as follows:

r′i = ri + β × di, (3)

where β is a positive value, computed on a per-codewordbasis, as suggested in [18]. Following the so-called “turboprinciple,” the soft input ri is finally subtracted from the softoutput r′i to obtain the extrinsic information

wi = r′i − ri (4)

which will be sent to the next decoder.

3.2. Iterative decoding of the product code

The block diagram of the turbo decoder at the kth half-iteration is shown in Figure 2. A half-iteration stands for arow or column decoding step, and one iteration comprisestwo half-iterations. The input of the SISO decoder at half-iteration k is given by

Rk = R + αkWk, (5)

where αk is a scaling factor used to attenuate the influence ofextrinsic information during the first iterations, and whereWk = (wi, j) is the extrinsic information matrix deliveredby the SISO decoder at the previous half-iteration. Thedecoder outputs an updated extrinsic information matrixWk+1, and possibly a matrix Dk of hard-decisions. Decodingstops when a given maximum number of iterations have beenperformed, or when an early-termination condition (stopcriterion) is met.

The use of a stop criterion can improve the convergenceof the iterative decoding process and also reduce the averagepower-consumption of the decoder by decreasing the averagenumber of iterations required to decode a block. An efficientstop criterion taking advantage of the structure of theproduct codes was proposed in [19]. Another simple andeffective solution is to stop when the hard decisions donot change between two successive half-iterations (i.e., nofurther corrections are done).

4. RS PRODUCT CODE DESIGN FOROPTICAL COMMUNICATIONS

Two optical communication scenarios have been identifiedas promising applications for third-generation FEC based onRS TPCs: 40 Gbps data transport over OTN, and 10 Gbpsdata transmission over PON. In this section, we first review


the own expectations of each application with respect toFEC. Then, we discuss the algorithmic issues that have beenencountered and solved in order to design RS TPCs that arecompatible with these requirements.

4.1. FEC design for data transmission overOTN and PON

40 Gbps transport over OTN calls for both high-coding gainsand low overhead (<10%). High-coding gains are requiredin order to insure high data integrity with BER in therange 10−13–10−15. Low-overhead limit optical transmissionimpairments caused by bandwidth extension. Note thatthese two requirements usually conflict with each other tosome extent. The complexity and power consumption ofthe decoding circuit is also an important issue. A possiblesolution, proposed in [6], is to multiplex in parallel fourpowerful FEC devices at 10 Gbps. However 40 Gbps low-costline cards are a key to the deployment of 40 Gbps systems.Furthermore, the cost of line cards is primarily dominatedby the electronics and optics operating at the serial line rate.Thus, a single low-cost 40 Gbps FEC device could competefavorably with the former solution if the loss in coding gain(if any) remains small enough.

For data transmission over PON, channel codes with lowcost and low latency (small block size) are preferred to longcodes (>10 Kbits) with high-coding gain. BER requirementsare less stringent than for OTN and are typically of the orderof 10−11. High-coding gains result in increased link budget[20]. On the other hand, decoding complexity should be keptat a minimum in order to reduce the cost of optical networkunits (ONUs) deployed at the end-user side. Channel codesfor PON are also expected to be robust against burst errors.

4.2. Choice of the component codes

On the basis of the above-mentioned requirements, we havechosen to focus on RS product codes with less than 20%overhead. Higher overheads lead to larger signal bandwidth,thereby increasing in return the complexity of electronic andoptical components. Since the rate of the product code isthe product of the individual rates of the component codes,RS component codes with code rate R ≥ 0.9 are necessary.Such code rates can be obtained by considering multiple-error-correcting RS codes over large Galois fields, that is,GF(256) and beyond. Another solution is to use single-error-correcting (SEC) RS codes over Galois fields of smaller order(32 or 64). The latter solution has been retained in this worksince it leads to low-complexity SISO decoders.

First, it is shown in [21] that 16 error patterns are suffi-cient to obtain near-optimum performance with the Chase-Pyndiah algorithm for SEC RS codes. In contrast, moresophisticated SISO decoders are required with multiple-error-correcting RS codes (e.g., see [22] or [23]) sincethe number of error patterns necessary to obtain near-optimum performance with the Chase-Pyndiah algorithmgrows exponentially withmt for a t-error-correction RS codeover GF(2m).

In addition, SEC RS codes admit low-complexity alge-braic decoders. This feature further contributes to reduc-ing the complexity of the Chase-Pyndiah algorithm. Formultiple-error-correcting RS codes, the Berlekamp-Masseyalgorithm and the Euclidean algorithm are the preferredalgebraic decoding methods [15]. But they introduce unnec-essary overhead computations for SEC codes. Instead, amore simpler decoder is obtained from the direct decodingmethod devised by Peterson, Gorenstein, and Zierler (PGZdecoder) [24, 25]. First, the two syndromes S1 and S2 arecalculated by evaluating the received polynomial r(x) at thetwo code roots αb and αb+1:

Si = r(αb+i−1) =

N−1∑

�=0

r�α�(b+i−1), i = 1, 2. (6)

If S1 = S2 = 0, r(x) is a valid codeword and decoding stops.If only one of the two syndromes is zero, a decoding failure isdeclared. Otherwise, the error locator X is calculated as

X = S2

S1(7)

from which the error location i is obtained by taking thediscrete logarithm of X . The error magnitude E is finallygiven by

E = S1

Xb. (8)

Hence, apart from the syndrome computation, at mosttwo divisions over GF(2m) are required to obtain the errorposition and value with the PGZ decoder (only one is neededwhen b = 0). The overall complexity of the PGZ decoder isusually dominated by the initial syndrome computation step.Fortunately, the syndromes need not be fully recomputedat each decoding attempt in the Chase-2 decoder. Rather,they can be updated in a very simple way by taking onlyinto account the bits that are flipped between successiveerror patterns [26]. This optimization further alleviates SISOdecoding complexity.

On the basis of the above arguments, two RS productcodes have been selected for the two envisioned applications.The (31, 29)2 RS product code over GF(32) has been retainedfor PON systems since it combines a moderate overhead of12.5% with a moderate code length of 4805 coded bits. This isonly twice the code length of the classical (255, 239) RS codeover GF(256). On the other hand, the (63, 61)2 RS productcode over GF(64) has been preferred for OTN, since it has asmaller overhead (6.3%), similar to the one introduced bythe standard (255, 239) RS code, and also a larger codinggain, as we will see later.

4.3. Performance analysis and code optimization

RS product codes built from SEC RS component codesare very attractive from the decoding complexity point ofview. On the other hand, they have low-minimum distanceD = 3 × 3 = 9 at the symbol level. Therefore, it is ofcapital interest to verify that this low-minimum distance


does not introduce error flares in the code performancecurve that would penalize the effective coding gain at lowBER. Monte-carlo simulations can be used to evaluate thecode performance down to BER of 10−10–10−11 within areasonable computation time. For lower BER, analyticalbounding techniques are required.

In the following, binary on-off keying (OOK) intensitymodulation with direct detection over additive white Gaus-sian noise (AWGN) is assumed. This model was adopted hereas a first approximation which simplifies the analysis and alsofacilitates the comparison with other channel codes. Moresophisticated models of optical systems for the purpose ofassessing the performance of channel codes are developed in[27, 28]. Under the previous assumptions, the BER of theRS product code at high SNRs and under ML soft-decisiondecoding is well approximated by the first term of the unionbound:

BER ≈ d

mN1N2

Bd2

erfc

(

Q

√d

2

)

, (9)

where Q is the input Q-factor (see [29, Chapter 5]), d is theminimum distance of the binary image Pb of the productcode, and Bd the corresponding multiplicity (number ofcodewords with minimum Hamming weight d in Pb).This expression shows that the asymptotic performance ofthe product code is determined by the bit-level minimumdistance d of the product code, not by the symbol minimumdistance D1D2.

The knowledge of the quantities d and Bd is requiredin order to predict the asymptotic performance of thecode in the high Q-factor (low BER) region using (9).These parameters depend in turn on the basis B usedto represent the 2m-ary symbols as bits, and are usuallyunknown. Computing the exact binary weight enumeratorof RS product codes is indeed a very difficult problem. Eventhe symbol weight enumerator is hard to find since it is notcompletely determined by the symbol weight enumeratorsof the component codes [30]. An average binary weightenumerator for RS product codes was recently derivedin [31]. This enumerator is simple to calculate. Howeversimulations are still required to assess the tightness of thebounds for a particular code realization. A computationalmethod that allows the determination of d and Ad undercertain conditions was recently suggested in [32]. Thismethod exploits the fact that product codewords withminimum symbol weightD1D2 are readily constructed as thedirect product of a minimum-weight row codeword with aminimum-weight column codeword. Specifically, there areexactly

AD1D2=(2m − 1

)(N1

D1

)(N2

D2

)

(10)

distinct codewords with symbol weight D1D2 in the productcode C1 ⊗ C2. They can be enumerated with the help of acomputer provided the number AD1D2 of such codewords

is not too large. Estimates d and Bd are then obtained bycomputing the Hamming weight of the binary expansion

Table 1: Minimum distance d and multiplicity Bd for the binaryimage of the (31, 29)2 and (63, 61)2 RS product codes as a functionof the first code root αb.

Product code mK2 mN2 R b d Bd

(31, 29, 3)2 4205 4805 0.8751 9 217,186

0 14 6,465,608

(63, 61, 3)2 22326 23814 0.9371 9 4,207,140

0 14 88,611,894

of those codewords. Necessarily, d ≤ d. If it can be shownthat product codewords of symbol weight >D1D2 necessarily

have binary minimum distance >d at the bit level (this is not

always the case, depending on the value of d), then it follows

that d = d and Bd = Bd.This method has been used to obtain the binary mini-

mum distance and multiplicity of the (31, 29)2 and (63, 61)2

RS product codes using narrow-sense component codes withgenerator polynomial g(x) = (x − α)(x − α2). This is theclassical definition of SEC RS codes that can be found inmost textbooks. The results are given in Table 1. We observethat in both cases, we are in the most unfavorable case wherethe bit-level minimum distance d is equal to the symbol-levelminimum distance D, and no greater. Simulation results forthe two RS TPCs after 8 decoding iterations are shown inFigures 3 and 4, respectively. The corresponding asymptoticperformance calculated using (9) are plotted in dashedlines. For comparison purpose, we have also included theperformance of algebraic decoding of RS codes of similarcode rate over GF(256). We observe that the low-minimumdistance introduces error flares at BER of 10−8 and 10−9

for the (31, 29)2 and (63, 61)2 product codes, respectively.Clearly, the two RS TPCs do not match the BER requirementsimposed by the envisioned applications.

One solution to increase the minimum distance of theproduct code is to resort to code extension or expurgation.However this approach increases the overhead. It alsoincreases decoding complexity since a higher number oferror patterns are then required to maintain near-optimumperformance with the Chase-Pyndiah algorithm [21]. In thiswork, another approach has been considered. Specifically,investigations have been conducted in order to identifycode constructions that can be mapped into binary imageswith minimum distance larger than 9. One solution isto investigate different basis B. How to find a basis thatmaps a nonbinary code into a binary code with bit-levelminimum distance strictly larger than the symbol-leveldesigned distance remains a challenging research problem.Thus, the problem was relaxed by fixing the basis to bethe polynomial basis, and studying instead the influence ofthe choice of the code roots on the minimum distance ofthe binary image. Any SEC RS code over GF(2m) can becompactly described by its generator polynomial

g(x) = (x − αb)(x − αb+1), (11)


6 7 8 9 10 11

Q-factor (dB)

10−12

10−10

10−8

10−6

10−4

10−2

Bit

erro

rra

te

Uncoded OOKRS (255, 223)RS (31, 29)2 with b = 1RS (31, 29)2 with b = 0eBCH (128, 120)2

Figure 3: BER performance of the (31, 29)2 RS product code as afunction of the first code root αb, after 8 iterations.

where b is an integer in the range 0 · · · 2m − 2. Narrow-sense RS codes are obtained by setting b = 1 (which isthe usual choice for most applications). Note however thatdifferent values for b generate different sets of codewords,and thus different RS codes with possibly different binaryweight distributions. In [32], it is shown that alternate SECRS codes obtained by setting b = 0 have minimum distanced = D + 1 = 4 at the bit level. This is a notable improvementover classical narrow-sense (b = 1) RS codes for whichd = D = 3. This result suggests that RS product codes shouldbe preferably built from two RS component codes with firstroot α0. RS product codes constructed in this way will becalled alternate RS product codes in the following.

We have computed the binary minimum distance d

and multiplicity Ad of the (31, 29)2 and (63, 61)2 alternateRS product codes. The values are reported in Table 1.Interestingly, the alternate product codes have a minimumdistance d as high as 14 at the bit-level, at the expense ofan increase of the error coefficient Bd. Thus, we get most ofthe gain offered by extended or expurgated codes (for whichd = 16, as verified by computer search) but without reducingthe code rate. It is also worth noting that this extra codinggain is obtained without increasing decoding complexity.The same SISO decoder is used for both narrow-sense andalternate SEC RS codes. In fact, the only modifications occurin (6)–(8) of the PGZ decoder, which actually simplify whenb = 0. Simulated performance and asymptotic bounds forthe alternate RS product codes are shown in Figures 3 and4. A notable improvement is observed in comparison withthe performance of the narrow-sense product codes sincethe error flare is pushed down by several decades in bothcases. By extrapolating the simulation results, the net codinggain (as defined in [5]) at a BER of 10−13 is estimated to be

7 8 9 10 11 12

Q-factor (dB)

10−15

10−10

10−5

Bit

erro

rra

te

Uncoded OOKRS (255, 239)RS (63, 61)2 with b = 1RS (63, 61)2 with b = 0eBCH (256, 247)2

Figure 4: BER performance of the (63, 61)2 RS product code as afunction of the first code root αb, after 8 decoding iterations.

around 8.7 dB and 8.9 dB for the RS(31, 29)2 and RS(63, 61)2,respectively. As a result, the two selected RS product codesare now fully compatible with the performance requirementsimposed by the respective envisioned applications. Moreimportantly, this achievement has been obtained at no cost.

4.4. Comparison with BCH product codes

A comparison with BCH product codes is in order sinceBCH product codes have already found application in opticalcommunications. A major limitation of BCH product codesis that very large block lengths (>60000 coded bits) arerequired to achieve high code rates (R > 0.9). On the otherhand, RS product codes can achieve the same code rate thanBCH product codes, but with a block size about 3 timessmaller [21]. This is an interesting advantage since, as shownlatter in the paper, large block lengths increase the decodinglatency and also the memory complexity in the decoderarchitecture. RS product codes are also expected to be morerobust to error bursts than BCH product codes. Both codingschemes inherit burst-correction properties from the row-column interleaving in the direct product construction. ButRS product codes also benefit from the fact that, in the mostfavorable case, m consecutive erroneous bits may cause asingle symbol error in the received word.

A performance comparison has been carried outbetween the two selected RS product codes and extendedBCH(eBCH) product codes of similar code rate: theeBCH(128, 120)2 and the eBCH(256, 247)2. Code extensionhas been used for BCH codes since it increases mini-mum distance without increasing decoding complexity nordecreasing significantly the code rate, in contrast to RScodes. Both eBCH TPCs have minimum distance 16 with


6 7 8 9 10 11 12 13 14 15

Q-factor (dB)

10−10

10−8

10−6

10−4

10−2

Bit

erro

rra

te

Uncoded OOKOOK + RS (255, 239)OOK + RS (63, 61)2 unquantizedOOK + RS (63, 61)2 3−bitOOK + RS (63, 61)2 4−bit

Figure 5: BER performance for the (63, 61)2 RS product code as afunction of the number of quantization bits for the soft-input (signbit included).

multiplicities 853442 and 6908802, respectively. Simulationresults after 8 iterations are shown in Figures 3 and 4.The corresponding asymptotic bounds are plotted in dashedlines. We observe that eBCH TPCs converge at lowerQ-factors. As a result, a 0.3-dB gain is obtained at BER in therange 10−8–10−10. However, the large multiplicities of eBCHTPCs introduce a change of slope in the performance curvesat lower BER. In fact, examination of the asymptotic boundsshows that alternate RS TPCs are expected to perform at leastas well as eBCH TPCs in the BER range of interest for opticalcommunication, for example, 10−10–10−15. Therefore, weconclude that RS TPCs compare favorably with eBCH TPCsin terms of performance. We will see in the next sections thatRS TPCs have additional advantages in terms of decodingcomplexity and throughput for the target applications.

4.5. Soft-input quantization

The previous performance study assumed unquantized softvalues. In a practical receiver, a finite number q of bits(sign bit included) is used to represent soft information.Soft-input quantization is performed by an analog-to-digitalconverter (ADC) in the receiver front-end. The very highbit rate in fiber optical systems makes ADC a challengingissue. It is therefore necessary to study the impact of soft-input quantization on the performance. Figure 5 presentssimulation results for the (63, 61)2 alternate RS product codeusing q = 3 and q = 4 quantization bits, respectively. Forcomparison purpose, the performance without quantizationis also shown. Using q = 4 bits yields virtually nodegradation with respect to ideal (infinite) quantization,whereas q = 3 bits of quantization introduce a 0.5 dB penalty.

Similar conclusions have been obtained with the (31, 29)2 RSproduct code and also with various eBCH TPCs, as reportedin [27, 33] for example.

5. FULL-PARALLEL TURBO DECODINGARCHITECTURE DEDICATED TO PRODUCT CODES

Designing turbo decoding architectures compatible with thevery high-line rate requirements imposed by fiber opticssystems at reasonable cost is a challenging issue. Paralleldecoding architectures are the only solution to achieve datarates above 10 Gbps. A simple architectural solution is toduplicate the elementary decoders in order to achieve thegiven throughput. However, this solution results in a turbodecoder with unacceptable cumulative area. Thus, smarterparallel decoding architectures have to be designed in orderto better trade-off performance and complexity under theconstraint of a high-throughput. In the following, we focuson an (N2,K2) product code obtained from with twoidentical (N ,K) component codes over GF(2m). For 2m-aryRS codes, m > 1 whereas m = 1 for binary BCH codes.

5.1. Previous work

Many turbo decoder architectures for product codes havebeen proposed in the literature. The classical approachinvolves decoding all the rows or all the columns of amatrix before the next half-iteration. When an applicationrequires high-speed decoders, an architectural solution is tocascade SISO elementary decoders for each half-iteration. Inthis case, memory blocks are necessary between each half-iteration to store channel data and extrinsic information.Each memory block is composed of four memories of mN2

soft values. Thus, duplicating a SISO elementary decoderresults in duplicating the memory block which is very costlyin terms of silicon area. In 2002, a new architecture forturbo decoding product codes was proposed [10]. The ideais to store several data at the same address and to performsemiparallel decoding to increase the data rate. However, it isnecessary to process these data by row and by column. Letus consider l adjacent rows and l adjacent columns of theinitial matrix. The l2 data constitute a word of the new matrixthat has l2 times fewer addresses. This data organizationdoes not require any particular memory architecture. Theresults obtained show that the turbo decoding throughput isincreased by l2 when l elementary decoders processing l datasimultaneously are used. Turbo decoding latency is dividedby l. The area of the l elementary decoders is increased by l/2while the memory is kept constant.

5.2. Full-parallel decoding principle

All rows (or all columns) of a matrix can be decoded inparallel. If the architecture is composed of 2N elementarydecoders, an appropriate treatment of the matrix allows theelimination of the reconstruction of the matrix betweeneach half-iteration decoding step. Specifically, let i and j bethe indices of a row and a column of the N × N matrix.In full-parallel processing, the row decoder i begins the


N rows of Nsoft values

Soft value N columns of N soft values

j

i

Index (i + 1) = i + 1 mod N

Index ( j + 1) = j − 1 mod N

Figure 6: Full-parallel decoding of a product code matrix.

decoding by the soft value in the ith position. Moreover,each row decoder processes the soft values by increasing theindex by one modulo N . Similarly, the column decoder jbegins the decoding by the soft value in the jth position.In addition, each column decoder processes the soft valuesby decreasing the index by one modulo N . In fact, full-parallel decoding of turbo product code is possible thanksto the cyclic property of BCH and RS codes. Indeed, everycyclic shift c′ = (cN−1, c0, . . . , cN−3, cN−2) of a codewordc = (c0, c1, . . . , cN−2, cN−1) is also a valid codeword in a cycliccode. Therefore, only one-clock period is necessary betweentwo successive matrix decoding operations. The full-paralleldecoding of an N × N product code matrix is described inFigure 6. A similar strategy was previously presented in [34]where memory access conflicts are resolved by means of anappropriate treatment of the matrix.

The elementary decoder latency depends on the structureof the decoder (i.e., number of pipeline stages) and thecode length N . Here, as the reconstruction matrix isremoved, the latency between row and column decoding isnull.

5.3. Full-parallel architecture for product codes

The major advantage of our full-parallel architecture is that itenables the memory block of 4mN2 soft values between eachhalf-iteration to be removed. However, the codeword softvalues exchanged between the row and column decoders haveto be routed. One solution is to use a connection network forthis task. In our case, we have chosen an Omega network. TheOmega network is one of several connection networks usedin parallel machines [35]. It is composed of log2N stages,each having N/2 exchange elements. In fact, the Omeganetwork complexity in terms of number of connections andof 2×2 switch transfer blocks is N × log2N and (N/2) log2N ,respectively. For example, the equivalent gate complexity ofa 31 × 31 network can be estimated to be 200 logic gatesper exchanged bit. Figure 7 depicts a full-parallel architecturefor the turbo decoding of product codes. It is composed ofcascaded modules for the turbo decoder. Each module isdedicated to one iteration. However, it is possible to processseveral iterations by the same module. In our approach, 2Nelementary decoders and 2 connection blocks are necessaryfor one module. A connection block is composed of 2 Omega

networks exchanging the R and Rk soft values. Since theOmega network has low complexity, the full-parallel turbodecoder complexity essentially depends on the complexity ofthe elementary decoder.

5.4. Elementary SISO decoder architecture

The block diagram of an elementary SISO decoder is shownin Figure 2, where k stands for the current half-iterationnumber. Rk is the soft-input matrix computed from theprevious half-iteration whereas R denotes the initial matrixdelivered by the receiver front-end (Rk = R for the 1sthalf-iteration). Wk is the extrinsic information matrix.αk is a scaling factor that depends on the current half-iteration and which is used to mitigate the influence of theextrinsic information during the first iterations. The decoderarchitecture is structured in three pipelined stages identifiedas reception, processing, and transmission units [36]. Duringeach stage, the N soft values of the received word Rk areprocessed sequentially in N clock periods. The receptionstage computes the initial syndromes Si and finds the Lrleast reliable bits in the received word. The main functionof the processing stage is to build and then to correct theNep error patterns obtained from the initial syndrome andto combine the least reliable bits. Moreover, the processingstage also has to produce a metric (Euclidean distancebetween error pattern and received word) for each errorpattern.Finally, a selection function identifies the maximumlikelihood codeword d and the competing codewords c(if any). The transmission stage performs different func-tions: computing the reliability for each binary soft value,computing the extrinsic information, and correcting thereceived soft values. The N soft values of the codeword arethus corrected sequentially. The decoding process needs toaccess the R and Rk soft values during the three decodingphases. For this reason, these words are implemented intosix random access memories (RAMs) of size q × m × Ncontrolled by a finite-state machine. In summary, a full-parallel TPC decoder architecture requires low-complexitydecoders.

6. COMPLEXITY AND THROUGHPUTANALYSIS OF THE FULL-PARALLELREED-SOLOMON TURBO DECODERS

Increasing the throughput regardless of the turbo decodercomplexity is not relevant. In order to compare the through-put and complexity of RS and BCH turbo decoders, wepropose to measure the efficiency η of a parallel architectureby the ratio

η = T

C, (12)

where T is the throughput and C is the complexity ofthe design. An efficient architecture is expected to have ahigh η ratio, that is, a high throughput with low hardwarecomplexity. In this section, we determine and compare theefficiency of TPC decoders based on SEC BCH and RScomponent codes, respectively.


Elementarydecoder

for row 1

Elementarydecoder

for row 2

Elementarydecoder

for row N

Elementarydecoder forcolumn 1


Elementarydecoder forcolumn N

Elementarydecoder

for row 1

Elementarydecoder

for row 2

Elementarydecoder

for row N




Con

nec

tion

bloc

k

Con

nec

tion

bloc

k

Con

nec

tion

bloc

k

Con

nec

tion

bloc

k

A module for one iteration

· · ·

· · ·

· · ·

......

......

Figure 7: Full-parallel architecture for decoding of product codes.

6.1. Turbo decoder complexity analysis

A turbo decoder of product code corresponds to the cumu-lative area of computation resources, memory resources, andcommunication resources. In a full-parallel turbo decoder,the main part of the complexity is composed of memoryand computation resources. Indeed, the major advantageof our full-parallel architecture is that it enables thememory blocks between each half-iteration to be replacedby Omega connection networks. Communication resourcesthus represent less than 1% of the total area of the turbodecoder. Consequently, the following study will only focuson memory and computation resources.

6.1.1. Complexity analysis of computation resources

The computation resources of an elementary decoder aresplit into three pipelined stages. The reception and transmis-sion stages have O(log(N)) complexity. For these two stages,replacing a BCH code by an RS code of same code length N(at the symbol level) over GF(2m) results in an increase ofboth complexity and throughput by a factor m. As a result,efficiency is constant in these parts of the decoder. However,the hardware complexity of the processing stage increaseslinearly with the numberNep of error patterns. Consequently,the increase in the local parallelism rate has no influenceon the area of this stage and thus increases the efficiencyof an RS SISO decoder. In order to verify those generalconsiderations, turbo decoders for the (15, 13)2, (31, 29)2,and (63, 61)2 RS product codes were described in HDLlanguage and synthesized. Logic syntheses were performedusing the Synopsys tool Design Compiler with an ST-microelectronics 90 nm CMOS process. All designs wereclocked with 100 MHz. Complexity of BCH turbo decoderswas estimated thanks to a generic complexity model whichcan deliver an estimation of the gate count for any code sizeand any set of decoding parameters. Therefore, taking intoaccount the implementation and performance constraints,this model can be used to select a code size N and a setof decoding parameters [37]. In particular, the numbers oferror patterns Nep and also the number of competing code-

Table 2: Computation resource complexity of selected TPCdecoders in terms of gate count.

Code RateElementary Full-parallel

decoder module

(32, 26)2 BCH 0.66 2 791 178 624

(64, 57)2 BCH 0.79 3 139 401 792

(128, 120)2 BCH 0.88 3 487 892 672

(15, 13)2 RS 0.75 3 305 99 150

(31, 29)2 RS 0.88 4 310 267 220

(63, 61)2 RS 0.94 6 000 756 000

words kept for soft-output computation directly affect boththe hardware complexity and the decoding performance.Increasing these parameter values improves performance butalso increases complexity.

Table 2 summarizes some computation resource com-plexities in terms of gate count for different BCH andRS product codes. Firstly, the complexity of an elementarydecoder for each product code is given. The results clearlyshow that RS elementary decoders are more complex thanBCH elementary decoders over the same Galois field.Complexity results for a full-parallel module of the turbodecoding process are also given in Table 2. As describedin Figure 7, a full-parallel module is composed of 2Nelementary decoders and 2 connection blocks for oneiteration. In this case, full-parallel modules composed of RSelementary decoders are seen to be less complex than full-parallel modules composed of BCH elementary decoderswhen comparing eBCH and RS product codes of similarcode rate R. For instance, for a code rate R = 0.88, thecomputation resource complexity in terms of gate countare about 892, 672 and 267, 220 for the BCH(128, 120)2 andRS(31, 29)2, respectively. This is due to the fact that RScodes need smaller code length N (at the symbol level) toachieve a given code rate, in contrast to binary BCH codes.Considering again the previous example, only 31×2 decodersare necessary in the RS case for full-parallel decodingcompared to 128 × 2 decoders in the BCH case. Similarly,


0 50 100 150 200 250 300 350 400

Degree of parallelism

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Com

puta

tion

logi

cga

teco

un

t(M

gate

s)

BCH block turbo decoderRS block turbo decoder

Figure 8: Comparison of computation resource complexity.

Figure 8 gives computation resource area of BCH and RSturbo decoders for 1 iteration and different parallelismdegrees. We verify that higher P (i.e., higher throughput)can be obtained with less computation resources using RSturbo decoders. This means that RS product codes are moreefficient in terms of computation resources for full-parallelarchitectures dedicated to turbo decoding.

6.1.2. Complexity analysis of memory resources

A half-iteration of a parallel turbo decoder contains N banksof q×m×N bits. The internal memory complexity of a par-allel decoder for one half-iteration can be approximated by

SRAM � γ × q ×m×N2, (13)

where γ is a technological parameter specifying the numberof equivalent gate counts per memory bit, q is the numberof quantization bits for the soft values, and m is the numberof bits per Galois field element. Using (17), it can also beexpressed as

SRAM = γ × P2

m× q, (14)

where P is the parallelism degree, defined as the number ofgenerated bits per clock period (t0).

Let us consider a BCH code and an RS code ofsimilar code length N= 2m − 1. For BCH codes, a symbolcorresponds to 1 bit, whereas it is made of m bits for RScodes. Calculating the SISO memory area for both BCH andRS gives the following ratio:

SRAM(BCH)SRAM(RS)

= m = log2(N + 1). (15)

This result shows that RS turbo decoders have lower memorycomplexity for a given parallelism rate. This was confirmedby memory area estimations results showed in Figure 9.Random access memory (RAM) area of BCH and RS turbodecoders for a half-iteration and different parallelism degrees

0 20 40 60 80 100 120 140 160 180

Degree of parallelism

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

RA

Mga

teco

un

t(M

gate

s)

BCH block turbo decoderRS block turbo decoder

Figure 9: Comparison of internal RAM complexity.

are plotted using a memory area estimation model providedby ST-Microelectronics. We can observe that higher P (i.e.,higher throughput) can be obtained with less memory whenusing an RS turbo decoder. Thus, full-parallel decoding ofRS codes is more memory-efficient than BCH code turbodecoding.

6.2. Turbo decoder throughput analysis

In order to maximize the data rate, decoding resources areassigned for each decoding iteration. The throughput of aturbo decoder can be defined as

T = P × R× f0, (16)

where R is the code rate and f0 = 1/t0 is the maximum fre-quency of an elementary SISO decoder. Ultrahigh through-put can be reached by increasing these three parameters.

(i) R is a parameter that exclusively depends on the codeconsidered. Thus, using codes with a higher code rate (e.g.,RS codes) would provide larger throughput.

(ii) In a full-parallel architecture, a maximum through-put is obtained by duplicating N elementary decodersgenerating m soft values per clock period. The parallelismdegree can be expressed as

P = N ×m. (17)

Therefore, enhanced parallelism degree can be obtained byusing nonbinary codes (e.g., RS codes) with larger codelength N .

(iii) Finally, in a high-speed architecture, each elemen-tary decoder has to be optimized in terms of workingfrequency f0. This is accomplished by including pipelinestages within each elementary SISO decoder. RS and BCHturbo decoders of equivalent code size have equivalentworking frequency f0 since RS decoding is performed byintroducing some local parallelism at the soft value level.This result was verified during logic syntheses. The maindrawback of pipelining elementary decoders is the extracomplexity generated by internal memory requirement.


Table 3: Hardware efficiency of selected TPC decoders.

Code R P T C η

(32, 26)2 BCH 0.66 32 2.11 201 10.5

(64, 57)2 BCH 0.79 64 5.06 508 9.97

(128, 120)2 BCH 0.88 128 11.26 1361 8.27

(15, 13)2 RS 0.75 60 4.5 128 35.0

(31, 29)2 RS 0.88 155 13.64 396 34.4

(63, 61)2 RS 0.94 378 35.5 1312 27

Since RS codes have higher P and R for equivalentf0, RS turbo decoder can reach a higher data rate thanequivalent BCH turbo decoder. However, the increase inthroughput cannot be considered regardless of the turbodecoder complexity.

6.3. Turbo product code comparison:throughput versus complexity

The efficiency η between the decoder throughput and thedecoder complexity can be used to compare eBCH and RSturbo product codes. We have reported in Table 3 the coderate R, the parallelism degree P, the throughput T (Gbps),the complexity C (kgate) and the efficiency η (kbps/gate) foreach code. All designs have been clocked at f0 = 100 MHz forthe computation of the throughput T . An average ratio of 3.5between RS and BCH decoder efficiency is observed.

The good compromise between performance, through-put and complexity clearly makes RS product codes goodcandidates for next-generation PON and OTN. In particular,the (31, 29)2 RS product code is compatible with the 10 Gbpsline rate envisioned for PON evolutions. Similarly, the(63, 61)2 RS product code can be used for data transport overOTN at 40 Gbps provided the turbo decoder is clocked at afrequency slightly higher than 100 MHz.

7. IMPLEMENTATION OF AN RS TURBO DECODER FORULTRA HIGH THROUGHPUT COMMUNICATION

An experimental setup based on FPGA devices has beendesigned in order to show that RS TPCs can effectivelybe used in the physical layer of 10 Gbps optical accessnetworks. Based on the previous analysis, the (31, 29)2 RSTPC was selected since it offers the best compromise betweenperformance and complexity for this kind of application.

7.1. 10 Gbps experimental setup

The experimental setup is composed of a board that includes6 Xilinx Virtex-5 LX330 FPGAs [38]. A Xilinx Virtex-5LX330 FPGA contains 51,840 slices that can emulate up to12 million gates of logic. It should be noted that Virtex-5slices are organized differently from previous generations.Each Virtex-5 slice contains four look up tables (LUTs)and four flip-flops instead of two LUTs and two flip-flopsin previous generation devices. The board is hosted on a64-bit, 66 MHz PCI bus that enables communication atfull PCI bandwidth with a computer. An FPGA embedded

memory block containing 10 encoded and noisy productcode matrices is used to generate input data towards theturbo decoder. This memory block exchanges data with acomputer thanks to the PCI bus.

One decoding iteration was implemented on each FPGAresulting in a 6 full-iteration turbo decoder as shown inFigure 10. Each decoding module corresponds to a full-parallel architecture dedicated to the decoding of a matrixof 31 × 31 coded soft values. We recall here that a codedsoft value over GF(32) is mapped onto 5 LLR values, eachLLR being quantized on 5 bits. Besides, the decoding processneeds to access the 31 coded soft values from each of thematrices R and Rk during the three decoding phases of ahalf-iteration as explained in Section 4. For theses reasons,31×5×5×2 = 1, 550 bits have to be exchanged between thedecoding modules during each clock period f0 = 65 MHz.The board offers 200 chip to chip LVDS for each FPGA toFPGA interconnect. Unfortunately, this number of LVDSis insufficient to enable the transmission of all the bitsbetween the decoding modules. To solve this implementationconstraint, we have chosen to add SERializer/DESerializer(SERDES) modules for the parallel-to-serial conversions andfor the serial-to-parallel conversions in each FPGA. Indeed,SERDES is a pair of functional blocks commonly used inhigh-speed communications to convert data between paralleldata and serial interfaces in each direction. SERDES modulesare clocked with f1 = 2 × f0 = 130 MHz and operate at8 : 1 serialization or 1 : 8 deserialization. In this way, all datacan be exchanged between the different decoding modules.Finally, the total occupation rate of the FPGA that containsthe more complex design (decoding module + two SERDESmodules + memory block + PCI protocol module) is slightlyhigher than 66%. This corresponds to 34,215 Virtex-5 slices.Note that the decoding module represents only 37% of thetotal design complexity. More details about this are given inthe next section.

Currently, a new design phase of the experimental setupis in progress. The objective is to include channel emulatorand BER measurement facilities in order to verify decodingperformance of the turbo decoder by plotting some BERcurves as in our previous experimental setup [37].

7.2. Characteristics and performance ofthe implemented decoding module

A decoding module for one iteration is composed of 31 ×2 = 62 elementary decoders and 2 connection blocks.Each elementary decoder uses information quantized on5 bits with Nep = 8 error patterns and only 1 competingcodeword. These reduced parameter values allow a decreasein the required area for a performance degradation whichremains inferior to 0.5 dB. Thus a (31, 29) RS elementarydecoder occupies 729 slice LUTs, 472 slice Flip-Flops and3 BlockRAM of 18 Kbs. A connection block occupies only2,325 slice LUTs. Computation resources of a decodingmodule take up 29,295 slice Flip-Flops and 49,848 sliceLUTs. It means that the occupation rates are about 14%and 24% of a Xilinx Virtex-5 LX330 FPGA for slice registersand slice LUTs, respectively. Besides, memory resources for


Elementarydecoderfor row 1


Elementarydecoder

for row N




SERDES module

200 LVDS signals

FPG

AX

C5V

LX

330

Con

nec

tion

bloc

k

Con

nec

tion

bloc

k

SER

DE

Sm

odu

le



Elementarydecoder

for row N




Global clock f0 = 65 MHz

FPGA XC5VLX330

Con

nec

tion

bloc

k

Con

nec

tion

bloc

k

SER

DE

Sm

odu

le

SER

DE

Sm

odu

le

200

LVD

Ssi

gnal

s



Elementarydecoder

for row N




FPG

AX

C5V

LX

330

Con

nec

tion

bloc

k

Con

nec

tion

bloc

k

SERDES module

SER

DE

Sm

odu

le

200

LVD

Ssi

gnal

s



Elementarydecoder

for row N




SERDES module

200

LVD

Ssi

gnal

s

FPGAXC5VLX330

Con

nec

tion

bloc

k

Con

nec

tion

bloc

k

SER

DE

Sm

odu

le

Blo

ckR

AM

Elementarydecoder

for row 1Elementary

decoderfor row 2

Elementarydecoder

for row N




FPGA XC5VLX330

Con

nec

tion

bloc

k

Con

nec

tion

bloc

k

SER

DE

Sm

odu

le

SER

DE

Sm

odu

le

200

LVD

Ssi

gnal

s



Elementarydecoder

for row N




FPG

AX

C5V

LX

330

Con

nec

tion

bloc

k

Con

nec

tion

bloc

k

SERDES module

SER

DE

Sm

odu

le

200 LVDS signals

· · ·

· · ·

· · ·

· · ·

· · ·

· · ·

· · ·

· · ·

· · ·· · ·· · ·

· · ·

· · ·

· · ·

· · ·

· · ·

· · ·

· · ·

· · ·

· · ·

· · ·

· · ·

· · ·

· · ·

· · ·

· · ·

· · ·

· · ·

· · ·

· · ·

· · ·

· · ·

· · ·

· · ·

· · ·

· · ·

· · ·

· · ·

· · ·

......

......

......

......

......

......

......

......

......

......

......

......

Figure 10: 10 Gbps experimental setup for turbo decoding of (31, 29)2 RS product code.

the decoding module take up 186 BlockRAM of 18 Kbits.It represents 32% of the total BlockRAM available in theXilinx Virtex-5 LX330 FPGA. Note that one BlockRAM of18 Kbits is allocated by the Xilinx tool ISE to memorize only31 × 5 × 5 = 775 bits in our design. The occupation rateof each BlockRAM of 18 Kbits is then only about 4%. Inputdata are clocked at f0 = 65 MHz resulting in a data rate ofTin = 10 Gbps at the turbo-decoder input. By taking intoaccount the code rateR = 0.87, the information rate becomesTout = 8.7 Gbps. In conclusion, the implementation resultsshowed that a turbo decoder dedicated to the (31, 29)2 RSproduct code can effectively be integrated to the physicallayer of a 10 Gbps optical access network.

7.3. (63,61)2 RS TPC complexity estimation fora 40 Gbps transmission over OTN

A similar prototype based on the (63, 61)2 RS TPC can bedesigned for 40 Gbps transmission over OTN. Indeed, thearchitecture of one decoding iteration is the same for thetwo RS TPCs considered in this work. For the (63, 61)2

RS product code, a decoding module for one iteration isnow composed of 63 × 2 = 126 elementary decoders and2 connection blocks. Logic syntheses were performed usingthe Xilinx tool ISE to estimate the complexity of a (63, 61)RS elementary decoder. This decoder occupies 1070 sliceLUTs, 660 slice Flip-Flops, and 3 BlockRAM of 18 Kbs. Theseestimations immediately give the complexity of a decodingmodule dedicated to one iteration. Computation resourcesof a (63, 61)2 RS decoding module take up 83,160 slice Flip-Flops and 134,820 slice LUTs. The occupation rates are thenabout 40% and 65% of a Xilinx Virtex-5 LX330 FPGA forslice registers and slice LUTs, respectively. Memory resourcesof a (63, 61)2 RS decoding module take up 378 BlockRAM of18 Kbits that represents 65% of the total BlockRAM available

in the considered FPGA device. One BlockRAM of 18 Kbits isallocated by the Xilinx tool ISE to memorize only 63×6×5 =1890 bits. For a (63, 61) RS elementary decoder, the occupa-tion rate of each BlockRAM of 18 Kbits is only about 10.5%.

8. CONCLUSION

We have investigated the use of RS product codes forforward-error correction in high-capacity fiber optic trans-port systems. A complete study considering all the aspectsof the problem from code optimization to turbo productcode implementation has been performed. Two specificapplications were envisioned: 40 Gbps line rate transmis-sion over OTN and 10 Gbps data transmission over PON.Algorithmic issues have been ordered and solved in order todesign RS turbo product codes that are compatible with therespective requirements of the two transmission scenarios.A novel full-parallel turbo decoding architecture has beenintroduced. This architecture allows decoding of TPCs atdata rates of 10 Gbps and beyond. In addition, a comparativestudy has been carried out between eBCH and RS TPCsin the context of optical communications. The results haveshown that high-rate RS TPCs offer similar performanceat reduced hardware complexity. Finally, we have describedthe successful realization of an RS turbo decoder prototypefor 10 Gbps data transmission. This experimental setupdemonstrates the practicality and also the benefits offeredby RS TPCs in lightwave systems. Although only fiber opticcommunications have been considered in this work, RS TPCsmay also be attractive FEC solutions for next-generationfree-space optical communication systems.

ACKNOWLEDGMENTS

The authors wish to acknowledge the financial support ofFrance Telecom R&D. They also thank Gerald Le Mestre


for his significant help during the experimental setup designphase. This paper was presented in part at IEEE InternationalConference on Communication, Glasgow, Scotland, in June2007.

REFERENCES

[1] C. Berrou, A. Glavieux, and P. Thitimajshima, “Near shannonlimit error-correcting coding and decoding: turbo-codes 1,” inProceedings of the IEEE International Conference on Communi-cations (ICC ’93), vol. 2, pp. 1064–1070, Geneva, Switzerland,May 1993.

[2] R. G. Gallager, “Low-density parity-check codes,” IEEE Trans-actions on Information Theory, vol. 8, no. 1, pp. 21–28, 1962.

[3] D. J. Costello Jr. and G. D. Forney Jr., “Channel coding: theroad to channel capacity,” Proceedings of the IEEE, vol. 95, no.6, pp. 1150–1177, 2007.

[4] S. Benedetto and G. Bosco, “Channel coding for opticalcommunications,” in Optical Communication: Theory andTechniques, E. Forestieri, Ed., chapter 8, pp. 63–78, Springer,New York, NY, USA, 2005.

[5] T. Mizuochi, “Recent progress in forward error correctionfor optical communication systems,” IEICE Transactions onCommunications, vol. E88-B, no. 5, pp. 1934–1946, 2005.

[6] T. Mizuochi, “Recent progress in forward error correction andits interplay with transmission impairments,” IEEE Journal ofSelected Topics in Quantum Electronics, vol. 12, no. 4, pp. 544–554, 2006.

[7] “Forward error correction for high bit rate DWDM submarinesystems,” International Telecommunication Union ITU-TRecommandation G.975.1, February 2004.

[8] R. Pyndiah, A. Glavieux, A. Picart, and S. Jacq, “Near optimumdecoding of product codes,” in Proceedings of the IEEE GlobalTelecommunications Conference (GLOBECOM ’94), vol. 1, pp.339–343, San Francisco, Calif, USA, November-December1994.

[9] K. Gracie and M.-H. Hamon, “Turbo and turbo-like codes:principles and applications in telecommunications,” Proceed-ings of the IEEE, vol. 95, no. 6, pp. 1228–1254, 2007.

[10] J. Cuevas, P. Adde, S. Kerouedan, and R. Pyndiah, “Newarchitecture for high data rate turbo decoding of productcodes,” in Proceedings of the IEEE Global TelecommunicationsConference (GLOBECOM ’02), vol. 2, pp. 1363–1367, Taipei,Taiwan, November 2002.

[11] C. Jego, P. Adde, and C. Leroux, “Full-parallel architecture forturbo decoding of product codes,” Electronics Letters, vol. 42,no. 18, pp. 1052–1054, 2006.

[12] T. Mizuochi, Y. Miyata, T. Kobayashi, et al., “Forward errorcorrection based on block turbo code with 3-bit soft decisionfor 10-Gb/s optical communication systems,” IEEE Journal ofSelected Topics in Quantum Electronics, vol. 10, no. 2, pp. 376–386, 2004.

[13] I. B. Djordjevic, S. Sankaranarayanan, S. K. Chilappagari,and B. Vasic, “Low-density parity-check codes for 40-Gb/soptical transmission systems,” IEEE Journal of Selected Topicsin Quantum Electronics, vol. 12, no. 4, pp. 555–562, 2006.

[14] F. J. MacWilliams and N. J. A. Sloane, The Theory of Error-Correcting Codes, North-Holland, Amsterdam, The Nether-lands, 1977.

[15] R. E. Blahut, Algebraic Codes for Data Transmission, Cam-bridge University Press, Cambridge, UK, 2003.

[16] O. Aitsab and R. Pyndiah, “Performance of Reed-Solomonblock turbo code,” in Proceedings of the IEEE Global Telecom-

munications Conference (GLOBECOM ’96), vol. 1, pp. 121–125, London, UK, November 1996.

[17] D. Chase, “A class of algorithms for decoding block codeswith channel measurement information,” IEEE Transactionson Information Theory, vol. 18, no. 1, pp. 170–182, 1972.

[18] P. Adde and R. Pyndiah, “Recent simplifications and improve-ments in block turbo codes,” in Proceedings of the 2ndInternational Symposium on Turbo Codes and Related Topics,pp. 133–136, Brest, France, September 2000.

[19] R. Pyndiah, “Iterative decoding of product codes: block turbocodes,” in Proceedings of the 1st International Symposium onTurbo Codes and Related Topics, pp. 71–79, Brest, France,September 1997.

[20] J. Briand, F. Payoux, P. Chanclou, and M. Joindot, “Forwarderror correction in WDM PON using spectrum slicing,”Optical Switching and Networking, vol. 4, no. 2, pp. 131–136,2007.

[21] R. Zhou, R. Le Bidan, R. Pyndiah, and A. Goalic, “Low-complexity high-rate Reed-Solomon block turbo codes,” IEEETransactions on Communications, vol. 55, no. 9, pp. 1656–1660, 2007.

[22] P. Sweeney and S. Wesemeyer, “Iterative soft-decision decod-ing of linear block codes,” IEE Proceedings: Communications,vol. 147, no. 3, pp. 133–136, 2000.

[23] M. Lalam, K. Amis, D. Leroux, D. Feng, and J. Yuan,“An improved iterative decoding algorithm for block turbocodes,” in Proceedings of the IEEE International Symposium onInformation Theory (ISIT ’06), pp. 2403–2407, Seattle, Wash,USA, July 2006.

[24] W. W. Peterson, “Encoding and error-correction proceduresfor the Bose-Chaudhuri codes,” IEEE Transactions on Informa-tion Theory, vol. 6, no. 4, pp. 459–470, 1960.

[25] D. Gorenstein and N. Zierler, “A class of error correcting codesin pm symbols,” Journal of the Society for Industrial and AppliedMathematics, vol. 9, no. 2, pp. 207–214, 1961.

[26] S. A. Hirst, B. Honary, and G. Markarian, “Fast Chasealgorithm with an application in turbo decoding,” IEEETransactions on Communications, vol. 49, no. 10, pp. 1693–1699, 2001.

[27] G. Bosco, G. Montorsi, and S. Benedetto, “Soft decoding inoptical systems,” IEEE Transactions on Communications, vol.51, no. 8, pp. 1258–1265, 2003.

[28] Y. Cai, A. Pilipetskii, A. Lucero, M. Nissov, J. Chen, and J.Li, “On channel models for predicting soft-decision errorcorrection performance in optically amplified systems,” inProceedings of the Optical Fiber Communications Conference(OFC ’03), vol. 2, pp. 532–533, Atlanta, Ga, USA, March 2003.

[29] G. P. Agrawal, Lightwave Technology: Telecommunication Sys-tems, John Wiley & Sons, Hoboken, NJ, USA, 2005.

[30] L. M. G. M. Tolhuizen, “More results on the weight enu-merator of product codes,” IEEE Transactions on InformationTheory, vol. 48, no. 9, pp. 2573–2577, 2002.

[31] M. El-Khamy and R. Garello, “On the weight enumer-ator and the maximum likelihood performance of linearproduct codes,” IEEE Transaction on Information Theory,arXiv:cs.IT/0601095 (preprint) Jan 2006.

[32] R. Le Bidan, R. Pyndiah, and P. Adde, “Some results on thebinary minimum distance of Reed-Solomon codes and blockturbo codes,” in Proceedings of the IEEE International Con-ference on Communications (ICC ’07), pp. 990–994, Glasgow,Scotland, June 2007.

[33] P. Adde, R. Pyndiah, and S. Kerouedan, “Block turbo codewith binary input for improving quality of service,” in Mul-tiaccess, Mobility and Teletraffic for Wireless Communications,


X. Lagrange and B. Jabbari, Eds., vol. 6, Kluwer AcademicPublishers, Boston, Mass, USA, 2002.

[34] Z. Chi and K. K. Parhi, “High speed VLSI architecturedesign for block turbo decoder,” in Proceedings of the IEEEInternational Symposium on Circuits and Systems (ISCAS ’02),vol. 1, pp. 901–904, Phoenix, Ariz, USA, May 2002.

[35] D. H. Lawrie, “Access and alignment of data in an arrayprocessor,” IEEE Transactions on Computers, vol. C-24, no. 12,pp. 1145–1155, 1975.

[36] S. Kerouedan and P. Adde, “Implementation of a blockturbo decoder on a single chip,” in Proceedings of the 2ndInternational Symposium on Turbo Codes and Related Topics,pp. 243–246, Brest, France, September 2000.

[37] C. Leroux, C. Jego, P. Adde, and M. Jezequel, “Towards Gb/sturbo decoding of product code onto an FPGA device,” inProceedings of the IEEE International Symposium on Circuitsand Systems (ISCAS ’07), pp. 909–912, New Orleans, La, USA,May 2007.

[38] http://www.dinigroup.com/DN9000k10PCI.php.


Research ArticleComplexity Analysis of Reed-Solomon Decoding overGF(2m) without Using Syndromes

Ning Chen and Zhiyuan Yan

Department of Electrical and Computer Engineering, Lehigh University, Bethlehem, PA 18015, USA

Correspondence should be addressed to Zhiyuan Yan, [email protected]

Received 15 November 2007; Revised 29 March 2008; Accepted 6 May 2008


There has been renewed interest in decoding Reed-Solomon (RS) codes without using syndromes recently. In this paper, weinvestigate the complexity of syndromeless decoding, and compare it to that of syndrome-based decoding. Aiming to provideguidelines to practical applications, our complexity analysis focuses on RS codes over characteristic-2 fields, for which somemultiplicative FFT techniques are not applicable. Due to moderate block lengths of RS codes in practice, our analysis is complete,without bigO notation. In addition to fast implementation using additive FFT techniques, we also consider direct implementation,which is still relevant for RS codes with moderate lengths. For high-rate RS codes, when compared to syndrome-based decodingalgorithms, not only syndromeless decoding algorithms require more field operations regardless of implementation, but alsodecoder architectures based on their direct implementations have higher hardware costs and lower throughput. We also derivetighter bounds on the complexities of fast polynomial multiplications based on Cantor’s approach and the fast extended Euclideanalgorithm.

Copyright © 2008 N. Chen and Z. Yan. This is an open access article distributed under the Creative Commons Attribution License,which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

1. INTRODUCTION

Reed-Solomon (RS) codes are among the most widely usederror control codes, with applications in space commu-nications, wireless communications, and consumer elec-tronics [1]. As such, efficient decoding of RS codes isof great interest. The majority of the applications of RScodes use syndrome-based decoding algorithms such as theBerlekamp-Massey algorithm (BMA) [2] or the extendedEuclidean algorithm (EEA) [3]. Alternative hard decisiondecoding methods for RS codes without using syndromeswere considered in [4–6]. As pointed out in [7, 8], thesealgorithms belong to the class of frequency-domain algo-rithms and are related to the Welch-Berlekamp algorithm[9]. In contrast to syndrome-based decoding algorithms,these algorithms do not compute syndromes and avoidthe Chien search and Forney’s formula. Clearly, this differ-ence leads to the question whether these algorithms offerlower complexity than syndrome-based decoding, especiallywhen fast Fourier transform (FFT) techniques are applied[6].

Asymptotic complexity of syndromeless decoding wasanalyzed in [6], and in [7] it was concluded that syndrome-less decoding has the same asymptotic complexityO(n log2n)(note that all the logarithms in this paper are to basetwo) as syndrome-based decoding [10]. However, existingasymptotic complexity analysis is limited in several aspects.For example, for RS codes over Fermat fields GF(22r + 1)and other prime fields [5, 6], efficient multiplicative FFTtechniques lead to an asymptotic complexity of O(n log2n).However, such FFT techniques do not apply to characteristic-2 fields, and hence this complexity is not applicable toRS codes over characteristic-2 fields. For RS codes overarbitrary fields, the asymptotic complexity of syndromelessdecoding based on multiplicative FFT techniques was shownto be O(n log2n log logn) [6]. Although they are applicableto RS codes over characteristic-2 fields, the complexity haslarge coefficients and multiplicative FFT techniques are lessefficient than fast implementation based on additive FFTfor RS codes with moderate block lengths [6, 11, 12]. Assuch, asymptotic complexity analysis provides little help topractical applications.


In this paper, we analyze the complexity of syndrome-less decoding and compare it to that of syndrome-baseddecoding. Aiming to provide guidelines to system designers,we focus on the decoding complexity of RS codes overGF(2m). Since RS codes in practice have moderate lengths,our complexity analysis provides not only the coefficientsfor the most significant terms, but also the followingterms. Due to their moderate lengths, our comparison isbased on two types of implementations of syndromelessdecoding and syndrome-based decoding: direct implemen-tation and fast implementation based on FFT techniques.Direct implementations are often efficient when decodingRS codes with moderate lengths and have widespreadapplications; thus, we consider both computational com-plexities, in terms of field operations, and hardware costsand throughputs. For fast implementations, we considertheir computational complexities only and their hardwareimplementations are beyond the scope of this paper. Weuse additive FFT techniques based on Cantor’s approach[13] since this approach achieves small coefficients [6,11] and hence is more suitable for moderate lengths. Incontrast to some previous works [12, 14], which countfield multiplications and additions together, we differen-tiate the multiplicative and additive complexities in ouranalysis.

The main contributions of the papers are as follows.

(i) We derived a tighter bound on the complexitiesof fast polynomial multiplication based on Cantor’sapproach.

(ii) We also obtained a tighter bound on the complexityof the fast extended Euclidean algorithm (FEEA)for general partial greatest common divisor (GCD)computation.

(iii) We evaluated the complexities of syndromeless de-coding based on different implementation approach-es and compare them with their counterparts of syn-drome-based decoding. Both errors-only and errors-and-erasures decodings are considered.

(iv) We compare the hardware costs and throughputs ofdirect implementations for syndromeless decoderswith those for syndrome-based decoders.

The rest of the paper is organized as follows. To makethis paper self-contained, in Section 2 we briefly review FFTalgorithms over finite fields, fast algorithms for polyno-mial multiplication and division over GF(2m), the FEEA,and syndromeless decoding algorithms. Section 3 presentsboth computational complexity and decoder architecturesof direct implementations of syndromeless decoding, andcompares them with their counterparts for syndrome-baseddecoding algorithms. Section 4 compares the computationalcomplexity of fast implementations of syndromeless decod-ing with that of syndrome-based decoding. In Section 5,case studies on two RS codes are provided and errors-and-erasures decoding is discussed. The conclusions are given inSection 6.

2. BACKGROUND

2.1. Fast Fourier transform over finite fields

For any n (n | q − 1) distinct elements a0, a1, . . . , an−1 ∈GF(q), the transform from f = ( f0, f1, . . . , fn−1)T to F �( f (a0), f (a1), . . . , f (an−1))T , where f (x) = ∑n−1

i=0 fixi ∈

GF(q)[x], is called a discrete Fourier transform (DFT),denoted by F = DFT(f). Accordingly, f is called the inverseDFT of F, denoted by f = IDFT(F). Asymptotically fastFourier transform (FFT) algorithm over GF(2m) was pro-posed in [15]. Reduced-complexity cyclotomic FFT (CFFT)was shown to be efficient for moderate lengths in [16].

2.2. Polynomial multiplication over GF(2m)by Cantor’s approach

A fast polynomial multiplication algorithm using additiveFFT was proposed by Cantor [13] for GF(qq

m), where

q is prime, and it was generalized to GF(qm) in [11].Instead of evaluating and interpolating over the multiplica-tive subgroups as in multiplicative FFT techniques, Can-tor’s approach uses additive subgroups. Cantor’s approachrelies on two algorithms: multipoint evaluation (MPE) [11,Algorithm 3.1] and multipoint interpolation (MPI) [11,Algorithm 3.2].

Suppose the degree of the product of two polynomialsover GF(2m) is less than h (h ≤ 2m), the product can beobtained as follows. First, the two operand polynomials areevaluated using the MPE algorithm. The evaluation resultsare then multiplied pointwise. Finally, the product polyno-mial is obtained by the MPI algorithm to interpolate thepointwise multiplication results. The polynomial multiplica-tion requires at most (3/2)h log2h+(15/2)h logh+8hmulti-plications over GF(2m) and (3/2)h log2h+(29/2)h logh+4h+9 additions over GF(2m) [11]. For simplicity, henceforth inthis paper, all arithmetic operations are over GF(2m) unlessspecified otherwise.

2.3. Polynomial division by Newton iteration

Suppose a, b ∈ GF(q)[x] are two polynomials of degreesd0 + d1 and d1 (d0,d1 ≥ 0), respectively. To find the quotientpolynomial q and the remainder polynomial r satisfyinga = qb + r, where deg r < d1, a fast polynomial divisionalgorithm is available [12]. Suppose revh(a) � xha(1/x),the fast algorithm first computes the inverse of revd1 (b) modxd0+1 by Newton iteration. Then, the reverse quotient is givenby q∗ = revd0+d1 (a)revd1 (b)−1 mod xd0+1. Finally, the actualquotient and remainder are given by q = revd0 (q∗) andr = a− qb.

Thus, the complexity of polynomial division withremainder of a polynomial a of degree d0 + d1 by a monicpolynomial b of degree d1 is at most 4M(d0) + M(d1) +O(d1) multiplications/additions when d1 ≥ d0 [12, Theorem9.6], where M(h) stands for the numbers of multiplica-tions/additions required to multiply two polynomials ofdegree less than h.

N. Chen and Z. Yan 3

2.4. Fast extended Euclidean algorithm

Let r0 and r1 be two monic polynomials with deg r0 >deg r1 and we assume s0 = t1 = 1, s1 = t0 = 0. Stepi (i = 1, 2, . . . , l) of the EEA computes ρi+1ri+1 = ri−1 −qiri, ρi+1si+1 = si−1− qisi, and ρi+1ti+1 = ti−1− qiti so that thesequence ri are monic polynomials with strictly decreasingdegrees. If the GCD of r0 and r1 is desired, the EEA terminateswhen rl+1 = 0. For 1 ≤ i ≤ l, Ri � Qi · · ·Q1R0, whereQi =

[ 0 11/ρi+1 −qi/ρi+1

]and R0 =

[1 00 1

]. Then, it can be easily

verified that Ri =[ si tisi+1 ti+1

]for 0 ≤ i ≤ l. In RS decoding,

the EEA stops when the degree of ri falls below a certainthreshold for the first time, and we refer to this as partialGCD.

The FEEA in [12, 17] costs no more than (22M(h) +O(h)) logh multiplications/additions when n0 ≤ 2h [14].

2.5. Syndrome-based and syndromeless decoding

Over a finite field GF(q), suppose a0, a1, . . . , an−1 are n (n ≤q) distinct elements and g0(x) � ∏n−1

i=0 (x − ai). Let usconsider an RS code over GF(q) with length n, dimensionk, and minimum Hamming distance d = n − k + 1. Amessage polynomial m(x) of degree less than k is encoded toa codeword (c0, c1, . . . , cn−1) with ci = m(ai), and the receivedvector is given by r = (r0, r1, . . . , rn−1).

The syndrome-based hard decision decoding consists ofthe following Steps: syndrome computation, key equationsolver, the Chien search, and Forney’s formula. Furtherdetails are omitted, and interested readers are referred to[1, 2, 18]. We also consider the following two syndromelessalgorithms.

Algorithm 1. [4, 5], [6, Algorithm 1]

(1.1) Interpolation. Construct a polynomial g1(x) withdeg g1(x) < n such that g1(ai) = ri for i = 0, 1, . . . ,n− 1.

(1.2) Partial GCD. Apply the EEA to g0(x) and g1(x), andfind g(x) and v(x) that maximize deg g(x) whilesatisfying v(x)g1(x) ≡ g(x) mod g0(x) and deg g(x) <(n + k)/2.

(1.3) Message recovery. If v(x) | g(x), the message poly-nomial is recovered by m(x) = g(x)/v(x), otherwiseoutput “decoding failure.”

Algorithm 2. [6, Algorithm 1a]

(2.1) Interpolation. Construct a polynomial g1(x) withdeg g1(x) < n such that g1(ai) = ri for i = 0, 1, . . . ,n− 1.

(2.2) Partial GCD. Find s0(x) and s1(x) satisfying g0(x) =xn−d+1s0(x) + r0(x) and g1(x) = xn−d+1s1(x) + r1(x),where deg r0(x) ≤ n − d and deg r1(x) ≤ n − d.Apply the EEA to s0(x) and s1(x), and stop when theremainder g(x) has degree less than (d − 1)/2. Thus,we have v(x)s1(x) + u(x)s0(x) = g(x).

(2.3) Message recovery. If v(x) � g0(x), output “decodingfailure;” otherwise, first compute q(x) = g0(x)/v(x),and then obtain m′(x) = g1(x) + q(x)u(x). Ifdegm′(x) < k, output m′(x); otherwise output“decoding failure.”

Compared with Algorithm 1, the partial GCD Step ofAlgorithm 2 is simpler but its message recovery Step is morecomplex [6].

3. DIRECT IMPLEMENTATION OFSYNDROMELESS DECODING

3.1. Complexity analysis

We analyze the complexity of direct implementation ofAlgorithms 1 and 2. For simplicity, we assume n − k is evenand hence d − 1 = 2t.

First, g1(x) in Steps (1.1) and (2.1) is given by IDFT(r).Direct implementation of Steps (1.1) and (2.1) followsHorner’s rule and requires n(n − 1) multiplications andn(n− 1) additions [19].

Steps (1.2) and (2.2) both use the EEA. The Sugiyamatower (ST) [3, 20] is well known as an efficient directimplementation of the EEA. For Algorithm 1, the ST isinitialized by g1(x) and g0(x), whose degrees are at most n.Since the number of iterations is 2t, Step (1.2) requires 4t(n+2) multiplications and 2t(n + 1) additions. For Algorithm 2,the ST is initialized by s0(x) and s1(x), whose degrees are atmost 2t and the iteration number is at most 2t.

Step (1.3) requires one polynomial division, which canbe implemented by using k iterations of cross multiplicationsin the ST. Since v(x) is actually the error locator polynomial[6], deg v(x) ≤ t. Hence, this requires k(k + 2t + 2)multiplications and k(t + 2) additions. However, the resultof the polynomial division is scaled by a nonzero constant.That is, cross multiplications lead to m(x) = am(x). Toremove the scaling factor a, we can first compute 1/a =lc(g(x))/(lc(m(x))lc(v(x))), where lc( f ) denotes the leadingcoefficient of a polynomial f , and then obtains m(x) =(1/a)m(x). This process requires one inversion and k + 2multiplications.

Step (2.3) involves one polynomial division, one poly-nomial multiplication, and one polynomial addition, andtheir complexities depend on the degrees of v(x) andu(x), denoted as dv and du, respectively. In the polynomialdivision, let the result of the ST be q(x) = aq(x). The scalingfactor is recovered by 1/a = 1/(lc(q(x))lc(v(x))). Thus, itrequires one inversion, (n− dv + 1)(n + dv + 3) + n − dv + 2multiplications, and (n− dv + 1)(dv + 2) additions to obtainq(x). The polynomial multiplication needs (n−dv+1)(du+1)multiplications and (n − dv + 1)(du + 1) − (n − dv + du + 1)additions, and the polynomial addition needs n additionssince g1(x) has degree at most n − 1. The total complexityof Step (2.3) includes (n − dv + 1)(n + dv + du + 5) + 1multiplications, (n− dv + 1)(dv + du + 2) + n− du additions,and one inversion. Consider the worst case for multiplicativecomplexity, where dv should be as small as possible. But


dv > du, so the highest multiplicative complexity is (n −du)(n + 2du + 6) + 1, which maximizes when du = (n− 6)/4.And we know du < dv ≤ t. Let R denote the code rate.So for RS codes with R > 1/2, the maximum complexityis n2 + nt − 2t2 + 5n − 2t + 5 multiplications, 2nt − 2t2 +2n + 2 additions, and one inversion. For codes with R ≤1/2, the maximum complexity is (9/8)n2 + (9/2)n + 11/2multiplications, (3/8)n2 + (3/2)n + 3/2 additions, and oneinversion.

Table 1 lists the complexity of direct implementation ofAlgorithms 1 and 2, in terms of operations in GF(2m). Thecomplexity of syndrome-based decoding is given in Table 2.The numbers for syndrome computation, the Chien search,and Forney’s formula are from [21]. We assume that the EEAis used for the key equation solver since it was shown to beequivalent to the BMA [22]. The ST is used to implementthe EEA. Note that the overall complexity of syndrome-baseddecoding can be reduced by sharing computations betweenthe Chien search and Forney’s formula. However, this is nottaken into account in Table 2.

3.2. Complexity comparison

For any application with fixed parameters n and k, thecomparison between the algorithms is straightforward usingthe complexities in Tables 1 and 2. Below we try to determinewhich algorithm is more suitable for a given code rate.The comparison between different algorithms is complicatedby three different types of field operations. However, thecomplexity is dominated by the number of multiplications:in hardware implementation, both multiplication and inver-sion over GF(2m) require an area-time complexity of O(m2)[23], whereas an addition requires an area-time complexityof O(m); the complexity due to inversions is negligible sincethe required number of inversions is much smaller thanthat of multiplications; the numbers of multiplications andadditions are both O(n2). Thus, we focus on the number ofmultiplications for simplicity.

Since t = (1/2)(1 − R)n and k = Rn, the multiplicativecomplexities of Algorithms 1 and 2 are (3−R)n2 +(3−R)n+2and (1/2)(3R2−7R+ 8)n2 + (7−3R)n+ 5, respectively, whilethe complexity of syndrome-based decoding is (1/2)(5R2 −13R + 8)n2 + (2 − 3R)n. It is easy to verify that in allthese complexities, the quadratic and linear coefficientsare of the same order of magnitude; hence, we consideronly the quadratic terms. Considering only the quadraticterms, Algorithm 1 is less efficient than syndrome-baseddecoding when R > 1/5. If the Chien search and Forney’sformula share computations, this threshold will be evenlower. Comparing the highest terms, Algorithm 2 is lessefficient than the syndrome-based algorithm regardless ofR. It is easy to verify that the most significant term of thedifference between Algorithms 1 and 2 is (1/2)(1 − R)(3R −2)n2. So when implemented directly, Algorithm 1 is lessefficient than Algorithm 2 when R > 2/3. Thus, Algorithm 1is more suitable for codes with very low rate, whilesyndrome-based decoding is the most efficient for high-ratecodes.

3.3. Hardware costs, latency, and throughput

We have compared the computational complexities of syn-dromeless decoding algorithms with those of syndrome-based algorithms. Now we compare these two types ofdecoding algorithms from a hardware perspective: we willcompare the hardware costs, latency, and throughput ofdecoder architectures based on direct implementations ofthese algorithms. Since our goal is to compare syndrome-based algorithms with syndromeless algorithms, we selectour architectures so that the comparison is on a levelfield. Thus, among various decoder architectures availablefor syndrome-based decoders in the literature, we considerthe hypersystolic architecture in [20]. Not only it is anefficient architecture for syndrome-based decoders, but alsosome of its functional units can be easily adapted toimplement syndromeless decoders. Thus, decoder archi-tectures for both types of decoding algorithms have thesame structure with some functional units the same; thisallows us to focus on the difference between the twotypes of algorithms. For the same reason, we do not tryto optimize the hardware costs, latency, or throughputusing circuit-level techniques since such techniques willbenefit from the architectures for both types of decodingalgorithms in a similar fashion and hence does not affect thecomparison.

The hypersystolic architecture [20] contains three func-tional units: the power sums tower (PST) computing thesyndromes, the ST solving the key equation, and thecorrection tower (CT) performing the Chien search andForney’s formula. The PST consists of 2t systolic cells, each ofwhich comprises of one multiplier, one adder, five registers,and one multiplexer. The ST has δ + 1 (δ is the maximaldegree of the input polynomials) systolic cells, each of whichcontains one multiplier, one adder, five registers, and sevenmultiplexers. The latency of the ST is 6γ clock cycles [20],where γ is the number of iterations. For the syndrome-baseddecoder architecture, δ and γ are both 2t. The CT consistsof 3t + 1 evaluation cells, two delay cells, along with twojoiner cells, which also perform inversions. Each evaluationcell needs one multiplier, one adder, four registers, and onemultiplexer. Each delay cell needs one register. The twojoiner cells altogether need two multipliers, one inverter, andfour registers. Table 3 summarizes the hardware costs of thedecoder architecture for syndrome-based decoders describedabove. For each functional unit, we also list the latency(in clock cycles), as well as the number of clock cycles itneeds to process one received word, which is proportional tothe inverse of the throughput. In theory, the computationalcomplexities of Steps of RS decoding depend on the receivedword, and the total complexity is obtained by first computingthe sum of complexities for all the Steps and then consideringthe worst case scenario (cf. Section 3.1). In contrast, thehardware costs, latency, and throughput of every functionalunit are dominated by the worst case scenario; the numbersin Table 3 all correspond to the worst case scenario. Thecritical path delay (CPD) is the same, Tmult +Tadd +Tmux, forthe PST, ST, and CT. In addition to the registers required bythe PST, ST, and CT, the total number of registers in Table 3


Table 1: Direct implementation complexities of syndromeless decoding algorithms

Multiplications Additions Inversions

Interpolation n(n− 1) n(n− 1) 0

Partial GCDAlgorithm 1 4t(n + 2) 2t(n + 1) 0

Algorithm 2 4t(2t + 2) 2t(2t + 1) 0

Message recoveryAlgorithm 1 (k + 2)(k + 1) + 2kt k(t + 2) 1

Algorithm 2 n2 + nt − 2t2 + 5n− 2t + 5 2nt − 2t2 + 2n + 2 1

TotalAlgorithm 1 2n2 + 2nt + 2n + 2t + 2 n2 + 3nt − 2t2 + n− 2t 1

Algorithm 2 2n2 + nt + 6t2 + 4n + 6t + 5 n2 + 2nt + 2t2 + n + 2t + 2 1

Table 2: Direct implementation complexity of syndrome-baseddecoding

Multiplications Additions Inv.

Syndrome computation 2t(n− 1) 2t(n− 1) 0

Key equation solver 4t(2t + 2) 2t(2t + 1) 0

Chien search n(t − 1) nt 0

Forney’s formula 2t2 t(2t − 1) t

Total 3nt + 10t2 − n + 6t 3nt + 6t2 − t t

also account for the registers needed by the delay line calledMain Street [20].

Both the PST and the ST can be adapted to implementdecoder architectures for syndromeless decoding algorithms.Similar to syndrome computation, interpolation in syn-dromeless decoders can be implemented by Horner’s rule,and thus the PST can be easily adapted to implement thisStep. For the architectures based on syndromeless decoding,the PST contains n cells, and the hardware costs of eachcell remain the same. The partial GCD is implemented bythe ST. The ST can implement the polynomial divisionin message recovery as well. In Step (1.3), the maximumpolynomial degree of the polynomial division is k+ t and theiteration number is at most k. As mentioned in Section 3.1,the degree of q(x) in Step (2.3) ranges from 1 to t. In thepolynomial division g0(x)/v(x), the maximum polynomialdegree is n and the iteration number is at most n − 1. Giventhe maximum polynomial degree and iteration number, thehardware costs and latency for the ST can be determined asfor the syndrome-based architecture.

The other operations of syndromeless decoders donot have corresponding functional units available in thehypersystolic architecture, and we choose to implement themin a straightforward way. In the polynomial multiplicationq(x)u(x), u(x) has degree at most t − 1 and the product hasdegree at most n−1. Thus, it can be done by nmultiply-and-accumulate circuits, n registers in t cycles (see, e.g., [24]). Thepolynomial addition in Step (2.3) can be done in one clockcycle with n adders and n registers. To remove the scalingfactor, Step (1.3) is implemented in four cycles with at mostone inverter, k+2 multipliers, and k+3 registers; Step (2.3) isimplemented in three cycles with at most one inverter, n + 1multipliers, and n + 2 registers. We summarize the hardwarecosts, latency, and throughput of the decoder architecturesbased on Algorithms 1 and 2 in Table 4.

Now we compare the hardware costs of the three decoderarchitectures based on Tables 3 and 4. The hardware costsare measured by the numbers of various basic circuitelements. All three decoder architectures need only oneinverter. The syndrome-based decoder architecture requiresfewer multiplexers than the decoder architecture based onAlgorithm 1, regardless of the rate, and fewer multipliers,adders, and registers when R > 1/2. The syndrome-based decoder architecture requires fewer registers than thedecoder architecture based on Algorithm 2 when R > 21/43,and fewer multipliers, adders, and multiplexers regardlessof the rate. Thus, for high rate codes, the syndrome-based decoder has lower hardware costs than syndromelessdecoders. The decoder architecture based on Algorithm 1requires fewer multipliers and adders than that based onAlgorithm 2, regardless of the rate, but more registers andmultiplexers when R > 9/17.

In these algorithms, each Step starts with the resultsof the previous Step. Due to this data dependency, theircorresponding functional units have to operate in a pipelinedfashion. Thus, the decoding latency is simply the sum of thelatency of all the functional units. The decoder architecturebased on Algorithm 2 has the longest latency, regardlessof the rate. The syndrome-based decoder architecture hasshorter latency than the decoder architecture based onAlgorithm 1 when R > 1/7.

All three decoders have the same CPD, so the throughputis determined by the number of clock cycles. Since thefunctional units in each decoder architecture are pipelined,the throughput of each decoder architecture is determined bythe functional unit that requires the largest number of cycles.Regardless of the rate, the decoder based on Algorithm 2 hasthe lowest throughput. When R > 1/2, the syndrome-baseddecoder architecture has higher throughput than the decoderarchitecture based on Algorithm 1. When the rate is lower,they have the same throughput.

Hence, for high-rate RS codes, the syndrome-baseddecoder architecture requires less hardware and achieveshigher throughput and shorter latency than those based onsyndromeless decoding algorithms.

4. FAST IMPLEMENTATION OFSYNDROMELESS DECODING

In this section, we implement the three Steps of Algorithms1 and 2: interpolation, partial GCD, and message recovery,


Table 3: Decoder architecture based on syndrome-based decoding (CPD is Tmult + Tadd + Tmux)

Multipliers Adders Inverters Registers Muxes Latency Throughput−1

Syndrome computation 2t 2t 0 10t 2t n + 6t 6t

Key equation solver 2t + 1 2t + 1 0 10t + 5 14t + 7 12t 12t

Correction 3t + 3 3t + 1 1 12t + 10 3t + 1 3t 3t

Total 7t + 4 7t + 2 1 n + 53t + 15 19t + 8 n + 21t 12t

Table 4: Decoder architectures based on syndromeless decoding (CPD is Tmult + Tadd + Tmux)

Multipliers Adders Inverters Registers Muxes Latency Throughput−1

Interpolation n n 0 5n n 4n 3n

Partial Algorithm 1 n + 1 n + 1 0 5n + 5 7n + 7 12t 12t

GCD Algorithm 2 2t + 1 2t + 1 0 10t + 5 14t + 7 12t 12t

Message Algorithm 1 2k + t + 3 k + t + 1 1 6k + 5t + 8 7k + 7t + 7 6k + 4 6k

recovery Algorithm 2 3n + 2 3n + 1 1 7n + 7 7n + 7 6n + t − 2 6n

TotalAlgorithm 1 2n + 2k + t + 4 2n + k + t + 2 1 10n + 6k + 5t + 13 8n + 7k + 7t + 14 4n + 6k + 12t + 4 6k

Algorithm 2 4n + 2t + 3 4n + 2t + 2 1 12n + 10t + 12 8n + 14t + 14 10n + 13t − 2 6n

by fast algorithms described in Section 2 and evaluate theircomplexities. Since both the polynomial division by Newtoniteration and the FEEA depend on efficient polynomialmultiplication, the decoding complexity relies on the com-plexity of polynomial multiplication. Thus, in additionto field multiplications and additions, the complexities inthis section are also expressed in terms of polynomialmultiplications.

4.1. Polynomial multiplication

We first derive a tighter bound on the complexity of the fastpolynomial multiplication based on Cantor’s approach.

Let the degree of the product of two polynomials be lessthan n. The polynomial multiplication can be done by twoFFTs and one inverse FFT if length-n FFT is available overGF(2m), which requires n | 2m − 1. If n � 2m − 1, oneoption is to pad the polynomials to length n′ (n′ > n) withn′ | 2m − 1. Compared with fast polynomial multiplicationbased on multiplicative FFT, Cantor’s approach uses additiveFFT and does not require n | 2m − 1, so it is moreefficient than FFT multiplication with padding for mostdegrees. For n= 2m − 1, their complexities are similar.Although asymptotically worse than Schonhage’s algorithm[12], which has O(n logn log logn) complexity, Cantor’sapproach has small implicit constants, and hence, it is moresuitable for practical implementation of RS codes [6, 11].Gao claimed an improvement on Cantor’s approach in [6],but we do not pursue this due to lack of details.

A tighter bound on the complexity of Cantor’s approachis given in Theorem 1. Here we make the same assumptionas in [11] that the auxiliary polynomials si and the valuessi(βj) are precomputed. The complexity of precomputationwas given in [11].

Theorem 1. By Cantor’s approach, two polynomials a, b ∈GF(2m)[x] whose product has a degree less than h (1 ≤h ≤ 2m) can be multiplied using less than (3/2)h log2h +

(7/2)h logh − 2h + logh + 2 multiplications, (3/2)h log2h +(21/2)h logh − 13h + logh + 15 additions, and 2h inversionsover GF(2m).

Proof. There exists 0 ≤ p ≤ m satisfying 2p−1 < h ≤ 2p. Sinceboth the MPE and MPI algorithms are recursive, we denotethe numbers of additions of the MPE and MPI algorithms forinput i (0 ≤ i ≤ p) as SE(i) and SI(i), respectively. Clearly,SE(0) = SI(0) = 0. Following the approach in [11], it can beshown that for 1 ≤ i ≤ p,

SE(i) ≤ i(i + 3)2i−2 + (p − 3)(2i − 1) + i, (1)

SI(i) ≤ i(i + 5)2i−2 + (p − 3)(2i − 1) + i. (2)

Let ME(h) and AE(h) denote the numbers of multipli-cations and additions, respectively, that the MPE algorithmrequires for polynomials of a degree less than h. When i = pin the MPE algorithm, f (x) has a degree less than h ≤ 2p,while sp−1 is of degree 2p−1 and has at most p nonzerocoefficients. Thus, g(x) has a degree less than h − 2p−1.Therefore, the numbers of multiplications and additions forthe polynomial division in [11, Step 2 of Algorithm 3.1] areboth p(h − 2p−1), while r1(x) = r0(x) + si−1(βi)g(x) needsat most h − 2p−1 multiplications and the same number ofadditions. Substituting the bound on ME(2p−1) in [11], weobtain ME(h) ≤ 2ME(2p−1) + p(h − 2p−1) + h − 2p−1, andthus ME(h) is at most (1/4)p22p − (1/4)p2p − 2p + (p + 1)h.Similarly, substituting the bound on SE(p − 1) in (1), weobtainAE(h) ≤ 2SE(p−1)+ p(h−2p−1)+h−2p−1, and henceAE(h) is at most (1/4)p22p + (3/4)p2p − 4·2p + (p + 1)h + 4.

Let MI(h) and AI(h) denote the numbers of multiplica-tions and additions, respectively, which the MPI algorithmrequires when the interpolated polynomial has a degree lessthan h. When i = p in the MPI algorithm, f (x) has a degreeless than h ≤ 2p. It implies that r0(x) + r1(x) has a degree lessthan h − 2p−1. Thus, it requires at most h − 2p−1 additionsto obtain r0(x) + r1(x) and h − 2p−1 multiplications forsi−1(βi)

−1(r0(x)+r1(x)). The numbers of multiplications and


additions for the polynomial multiplication in [11, Step 3 ofAlgorithm 3.2] to obtain f (x) are both p(h − 2p−1). Addingr0(x) also needs 2p−1 additions. Substituting the bound onMI(2p−1) in [11], we haveMI(h) ≤ 2MI(2p−1)+p(h−2p−1)+h− 2p−1, and hence MI(h) is at most (1/4)p22p− (1/4)p2p−2p+(p+1)h. Similarly, substituting the bound on SI(p−1) in(2), we haveAI(h) ≤ 2SI(p−1)+p(h−2p−1)+h+1, and henceAE(h) is at most (1/4)p22p + (5/4)p2p − 4·2p + (p + 1)h + 5.The interpolation Step also needs 2p inversions.

Let M(h1,h2) be the complexity of multiplication of twopolynomials of degrees less than h1 and h2. Using Cantor’sapproach, M(h1,h2) includes ME(h1) +ME(h2) +MI(h) + 2p

multiplications, AE(h1) + AE(h2) + AI(h) additions, and 2p

inversions, when h = h1 + h2− 1. Finally, we replace 2p by 2has in [11].

Compared with the results in [11], our results havethe same highest degree term but smaller terms for lowerdegrees.

By Theorem 1, we can easily compute M(h1) � M(h1,h1). A by-product of the above proof is the bounds for theMPE and MPI algorithms. We also observe some propertiesfor the complexity of fast polynomial multiplication thathold for not only Cantor’s approach but also for otherapproaches. These properties will be used in our complex-ity analysis next. Since all fast polynomial multiplicationalgorithms have higher-than-linear complexities, 2M(h) ≤M(2h). Also note that M(h + 1) is no more than M(h) plus2hmultiplications and 2h additions [12, Exercise 8.34]. Sincethe complexity bound is determined only by the degree ofthe product polynomial, we assume M(h1,h2) ≤ M(�(h1 +h2)/2). We note that the complexities of Schonhage’salgorithm as well as Schonhage and Strassen’s algorithm,both based on multiplicative FFT, are also determined by thedegree of the product polynomial [12].

4.2. Polynomial division

Similar to [12, Exercise 9.6], in characteristic-2 fields, thecomplexity of Newton iteration is at most

∑

0≤ j≤r−1

(M(⌈(

d0 + 1)2− j⌉)

+ M(⌈(

d0 + 1)2− j−1⌉))

, (3)

where r = �log(d0+1). Since �(d0+1)2− j ≤ (d0+1)2− j�+1and M(h + 1) is no more than M(h), plus 2h multiplica-tions and 2h additions [12, Exercise 8.34], it requires atmost

∑1≤ j≤r(M((d0 + 1)2− j�) + M((d0 + 1)2− j−1�)), plus

∑0≤ j≤r−1(2(d0 + 1)2− j�+ 2(d0 + 1)2− j−1�) multiplications

and the same number of additions. Since 2M(h) ≤ M(2h),Newton iteration costs at most

∑0≤ j≤r−1((3/2)M((d0 +

1)2− j�)) ≤ 3M(d0 + 1), 6(d0 + 1) multiplications, and6(d0 +1) additions. The second Step to compute the quotientneeds M(d0 + 1) and the last Step to compute the remainderneeds M(d1 + 1,d0 + 1) and d1 + 1 additions. By M(d1 +1,d0 + 1) ≤ M(�(d0 + d1)/2 + 1), the total cost is at most4M(d0) + M(�(d0 + d1)/2), 15d0 + d1 + 7 multiplications,and 11d0 + 2d1 + 8 additions. Note that this bound does notrequire d1 ≥ d0 as in [12].

4.3. Partial GCD

The partial GCD Step can be implemented in threeapproaches: the ST, the classical EEA with fast polynomialmultiplication and Newton iteration, and the FEEA with fastpolynomial multiplication and Newton iteration. The ST isessentially the classical EEA. The complexity of the classicalEEA is asymptotically worse than that of the FEEA. Since theFEEA is more suitable for long codes, we will use the FEEAin our complexity analysis of fast implementations.

In order to derive a tighter bound on the complexity ofthe FEEA, we first present a modified FEEA in Algorithm 3.

Let η(h) � max{ j:∑ ji=1deg qi ≤ h}, which is the number of

Steps of the EEA satisfying deg r0 − deg rη(h) ≤ h < deg r0 −deg rη(h)+1. For f (x) = fnxn + · · · + f1x + f0 with fn /= 0, the

truncated polynomial f (x) � h � fnxh+· · ·+ fn−h+1x+ fn−h,where fi = 0 for i < 0. Note that f (x) � h = 0 if h < 0.

Algorithm 3. (modified fast extended Euclidean algorithm)Input: two monic polynomials r0 and r1, with deg r0 =

n0 > n1 = deg r1, as well as integer h (0 < h ≤ n0)Output: l = η(h), ρl+1, Rl, rl, and rl+1

(3.1) If r1 = 0 or h < n0 − n1, then return 0, 1,[

1 00 1

], r0,

and r1.

(3.2) h1 = h/2�, r∗0 = r0 � 2h1, r∗1 = r1 � (2h1 − (n0 −n1)).

(3.3) ( j − 1, ρ∗j ,R∗j−1, r∗j−1, r∗j ) = FEEA(r∗0 , r∗1 ,h1).

(3.4)[ r j−1

r j

] = R∗j−1

[ r0−r∗0 xn0−2h1

r1−r∗1 xn0−2h1

]+[ r∗j−1x

n0−2h1

r∗j xn0−2h1

], Rj−1 =

[ 1 00 1/lc(r j )

]R∗j−1, ρj = ρ∗j lc(r j), r j = r j /lc(r j), nj =

deg r j .

(3.5) If r j = 0 or h < n0 − nj , then return j −1, ρj , Rj−1, r j−1, and r j .

(3.6) Perform polynomial division with remainder asr j−1 = qjr j + r j+1, ρj+1 = lc(r j+1), r j+1 =r j+1/ρj+1, nj+1 = deg r j+1, Rj =

[ 0 11/ρj+1 −qj /ρj+1

]Rj−1.

(3.7) h2 = h − (n0 − nj), r∗j = r j � 2h2, r∗j+1 = r j+1 �(2h2 − (nj − nj+1)).

(3.8) (l − j, ρ∗l+1, S∗, r∗l− j , r∗l− j+1) = FEEA(r∗j , r∗j+1,h2).

(3.9)[ rlrl+1

] = S∗[ r j−r∗j xn j−2h2

r j+1−r∗j+1xnj−2h2

]+[ r∗l− j x

n j−2h2

r∗l− j+1xnj−2h2

], S =

[ 1 00 1/lc(rl+1)

]S∗, ρl+1 = ρ∗l+1lc(rl+1).

(3.10) Return l, ρl+1, SRj , rl, rl+1.

It is easy to verify that Algorithm 3 is equivalent to theFEEA in [12, 17]. The difference between Algorithm 3 andthe FEEA in [12, 17] lies in Steps (3.4), (3.5), (3.8), and(3.10): in Steps (3.5) and (3.10), two additional polynomialsare returned, and they are used in the updates of Steps (3.4)and (3.8) to reduce complexity. The modification in Step(3.4) was suggested in [14] and the modification in Step (3.9)follows the same idea.

In [12, 14], the complexity bounds of the FEEA areestablished assuming n0 ≤ 2h. Thus, we first establish abound of the FEEA for the case n0 ≤ 2h below in Theorem 2,


using the bounds we developed in Sections 4.1 and 4.2. Theproof is similar to those in [12, 14] and hence omitted;interested readers should have no difficulty filling in thedetails.

Theorem 2. Let T(n0,h) denote the complexity of the FEEA.When n0 ≤ 2h, T(n0,h) is at most 17M(h) logh plus(48h + 2) logh multiplications, (51h + 2) logh additions,and 3h inversions. Furthermore, if the degree sequence isnormal, T(2h,h) is at most 10M(h) logh, ((55/2)h + 6) loghmultiplications, and ((69/2)h + 3) logh additions.

Compared with the complexity bounds in [12, 14], ourbound not only is tighter, but also specifies all terms of thecomplexity and avoid the big O notation. The saving over[14] is due to lower complexities of Steps (3.6), (3.9), and(3.10) as explained above. The saving for the normal caseover [12] is due to lower complexity of Step (3.9).

Applying the FEEA to g0(x) and g1(x) to find v(x) andg(x) in Algorithm 1, we have n0 = n and h ≤ t sincedeg v(x) ≤ t. For RS codes, we always have n > 2t. Thus,the condition n0 ≤ 2h for the complexity bound in [12, 14]is not valid. It was pointed out in [6, 12] that s0(x) and s1(x)as defined in Algorithm 2 can be used instead of g0(x) andg1(x), which is the difference between Algorithms 1 and 2.Although such a transform allows us to use the results in[12, 14], it introduces extra cost for message recovery [6]. Tocompare the complexities of Algorithms 1 and 2, we establisha more general bound in Theorem 3.

Theorem 3. The complexity of FEEA is no more than34M(h/2�) logh/2� + M(n0/2�) + 4M(�n0/2 − h/4) +2M((n0−h)/2�)+4M(h)+2M((3/4)h�)+4M(h/2�), (48h+4) logh/2� + 9n0 + 22h multiplications, (51h + 4) logh/2� +11n0 + 17h + 2 additions, and 3h inversions.

The proof is also omitted for brevity. The main differencebetween this case and Theorem 2 lies in the top level callof the FEEA. The total complexity is obtained by adding2T(h, h/2�) and the top-level cost.

It can be verified that, when n0 ≤ 2h, Theorem 3 presentsa tighter bound than Theorem 2 since saving on the toplevel is accounted for. Note that the complexity bounds inTheorems 2 and 3 assume that the FEEA solves sl+1r0 +tl+1r1 = rl+1 for both tl+1 and sl+1. If sl+1 is not necessary, thecomplexity bounds in Theorems 2 and 3 are further reducedby 2M(h/2�), 3h + 1 multiplications, and 4h + 1 additions.

4.4. Complexity comparison

Using the results in Sections 4.1, 4.2, and 4.3, we firstanalyze and then compare the complexities of Algorithms1 and 2 as well as syndrome-based decoding under fastimplementations.

In Steps (1.1) and (2.1), g1(x) can be obtained by aninverse FFT when n| 2m − 1 or by the MPI algorithm. Inthe latter case, the complexity is given in Section 4.1. ByTheorem 3, the complexity of Step (1.2) is T(n, t) minus thecomplexity to compute sl+1. The complexity of Step (2.2) is

T(2t, t). The complexity of Step (1.3) is given by the bound inSection 4.2. Similarly, the complexity of Step (2.3) is readilyobtained by using the bounds of polynomial division andmultiplication.

All the steps of syndrome-based decoding can be imple-mented using fast algorithms. Both syndrome computationand the Chien search can be done by n-point evaluations.Forney’s formula can be done by two t-point evaluations plust inversions and t multiplications. To use the MPE algorithm,we choose to evaluate on all n points. By Theorem 3, thecomplexity of the key equation solver is T(2t, t) minus thecomplexity to compute sl+1.

Note that to simplify the expressions, the complexi-ties are expressed in terms of three kinds of operations:polynomial multiplications, field multiplications, and fieldadditions. Of course, with our bounds on the complexity ofpolynomial multiplication in Theorem 1, the complexities ofthe decoding algorithms can be expressed in terms of fieldmultiplications and additions.

Given the code parameters, the comparison amongthese algorithms is quite straightforward with the aboveexpressions. As in Section 3.2, we attempt to compare thecomplexities using onlyR. Such a comparison is of course notaccurate, but it sheds light on the comparative complexityof these decoding algorithms without getting entangled inthe details. To this end, we make four assumptions. First, weassume the complexity bounds on the decoding algorithmsas approximate decoding complexities. Second, we use thecomplexity bound in Theorem 1 as approximate polynomialmultiplication complexities. Third, since the numbers ofmultiplications and additions are of the same degree, we onlycompare the numbers of multiplications. Fourth, we focuson the difference of the second highest degree terms since thehighest degree terms are the same for all three algorithms.This is because the partial GCD Steps of Algorithms 1 and2, as well as the key equation solver in syndrome-baseddecoding, differ only in the top level of the recursion ofFEEA. Hence, Algorithms 1 and 2 as well as the key equationsolver in syndrome-based decoding have the same highestdegree term.

We first compare the complexities of Algorithms 1 and2. Using Theorem 1, the difference between the secondhighest degree terms is given by (3/4)(25R − 13)n log2n,so Algorithm 1 is less efficient than Algorithm 2 whenR > 0.52. Similarly, the complexity difference betweensyndrome-based decoding and Algorithm 1 is given by(3/4)(1 − 31R)n log2n. Thus, syndrome-based decoding ismore efficient than Algorithm 1 when R > 0.032. Comparingsyndrome-based decoding and Algorithm 2, the complexitydifference is roughly−(9/2)(2+R)n log2n. Hence, syndrome-based decoding is more efficient than Algorithm 2 regardlessof the rate.

We remark that the conclusion of the above comparisonis similar to those obtained in Section 3.2 except thethresholds are different. Based on fast implementations,Algorithm 1 is more efficient than Algorithm 2 for low ratecodes, and the syndrome-based decoding is more efficientthan Algorithms 1 and 2 in virtually all cases.


Table 5: Complexity of syndromeless decoding

(n, k)

Direct implementation Fast implementation

Algorithm 1 Algorithm 2 Algorithm 1 Algorithm 2

Mult. Add. Inv. Overall Mult. Add. Inv. Overall Mult. Add. Inv. Overall Mult. Add. Inv. Overall

(255, 233)

Interpolation 64770 64770 0 1101090 64770 64770 0 11101090 586 6900 0 16276 586 6900 0 16276

Partial GCD 16448 8192 0 271360 2176 1056 0 35872 8224 8176 16 140016 1392 1328 16 23856

Msg recovery 57536 4014 1 924606 69841 8160 1 1125632 3791 3568 1 64240 8160 7665 1 138241

Total 138754 76976 1 2297056 136787 73986 1 2262594 12601 18644 17 220532 10138 15893 17 178373

(511, 447)

Interpolation 260610 260610 0 4951590 260610 260610 0 4951590 1014 23424 0 41676 1014 23424 0 41676

Partial GCD 65664 32768 0 1214720 8448 4160 0 156224 32832 32736 32 624288 5344 5216 32 101984

Msg recovery 229760 15198 1 4150896 277921 31680 1 5034276 14751 14304 1 279840 31680 30689 1 600947

Total 556034 308576 1 10317206 546979 296450 1 10142090 48597 70464 33 945804 38038 59329 33 744607

Table 6: Complexity of syndrome-based decoding

(n, k)Direct implementation Fast implementation

Mult. Add. Inv. Overall Mult. Add. Inv. Overall

(255, 223)

Syndrome computation 8128 8128 0 138176 149 4012 0 6396

Key equation solver 2176 1056 0 35872 1088 1040 16 18704

Chien search 3825 4080 0 65280 586 6900 0 16276

Forney’s formula 512 496 16 8944 512 496 16 8944

Total 14641 13760 16 248272 2335 12448 32 50320

(511, 447)

Syndrome computation 32640 32640 0 620160 345 16952 0 23162

Key equation solver 8448 4160 0 156224 4224 4128 32 80736

Chien search 15841 16352 0 301490 1014 23424 0 41676

Forney’s formula 2048 2016 32 39456 2048 2016 32 39456

Total 58977 55168 32 1117330 7631 46520 64 185030

5. CASE STUDY AND DISCUSSIONS

5.1. Case study

We examine the complexities of Algorithms 1 and 2 as well assyndrome-based decoding for the (255, 223) CCSDS RS code[25] and a (511, 447) RS code which have roughly the samerate R = 0.87. Again, both direct and fast implementationsare investigated. Due to the moderate lengths, in some casesdirect implementation leads to lower complexity, and hencein such cases, the complexity of direct implementation isused for both.

Tables 5 and 6 list the total decoding complexity ofAlgorithms 1 and 2 as well as syndrome-based decoding,respectively. In the fast implementations, cyclotomic FFT[16] is used for interpolation, syndrome computation, andthe Chien search. The classical EEA with fast polynomialmultiplication and division is used in fast implementationssince it is more efficient than the FEEA for these lengths.We assume normal degree sequence, which represents theworst case scenario [12]. The message recovery Steps use longdivision in fast implementation since it is more efficient thanNewton iteration for these lengths. We use Horner’s rule forForney’s formula in both direct and fast implementations.

We note that for each decoding Step, Tables 5 and 6not only provide the numbers of finite field multiplications,additions, and inversions, but also list the overall com-plexities to facilitate comparisons. The overall complexitiesare computed based on the assumptions that multiplicationand inversion are of equal complexity, and that as in [15],one multiplication is equivalent to 2m additions. The latterassumption is justified by both hardware and softwareimplementations of finite field operations. In hardwareimplementation, a multiplier over GF(2m) generated bytrinomials requires m2 − 1 XOR and m2 AND gates [26],while an adder requires m XOR gates. Assuming that XORand AND gates have the same complexity, the complexity ofa multiplier is 2m times that of an adder over GF(2m). Insoftware implementation, the complexity can be measuredby the number of word-level operations [27]. Using theshift and add method as in [27], a multiplication requiresm − 1 shift and m XOR word-level operations, respectively,while an addition needs only one XOR word-level operation.Henceforth, in software implementations the complexity ofa multiplication over GF(2m) is also roughly 2m times as thatof an addition. Thus, the total complexity of each decodingStep in Tables 5 and 6 is obtained by N = 2m(Nmult +Ninv) +Nadd, which is in terms of field additions.


Comparisons between direct and fast implementationsfor each algorithm show that fast implementations consid-erably reduce the complexities of both syndromeless andsyndrome-based decoding, as shown in Tables 5 and 6.The comparison between these tables shows that for thesetwo high-rate codes, both direct and fast implementationsof syndromeless decoding are not as efficient as theircounterparts of syndrome-based decoding. This observationis consistent with our conclusions in Sections 3.2 and 4.4.

For these two codes, hardware costs and throughput ofdecoder architectures based on direct implementations ofsyndrome-based and syndromeless decoding can be easilyobtained by substituting the parameters in Tables 3 and 4;thus for these two codes, the conclusions in Section 3.3 apply.

5.2. Errors-and-erasures decoding

The complexity analysis of RS decoding in Sections 3and 4 has assumed errors-only decoding. We extend ourcomplexity analysis to errors-and-erasures decoding below.

Syndrome-based errors-and-erasures decoding has beenwell studied, and we adopt the approach in [18]. In thisapproach, first erasure locator polynomial and modifiedsyndrome polynomial are computed. After the error locatorpolynomial is found by the key equation solver, the erratalocator polynomial is computed and the error-and-erasurevalues are computed by Forney’s formula. This approach isused in both direct and fast implementations.

Syndromeless errors-and-erasures decoding can be car-ried out in two approaches. Let us denote the number oferasures as ν (0 ≤ ν ≤ 2t), and up to f = (2t − ν)/2� errorscan be corrected given ν erasures. As pointed out in [5, 6], thefirst approach is to ignore the ν erased coordinates, therebytransforming the problem into errors-only decoding of an(n− ν, k) shortened RS code. This approach is more suitablefor direct implementation. The second approach is similarto syndrome-based errors-and-erasures decoding describedabove, which uses the erasure locator polynomial [5]. Inthe second approach, only the partial GCD Step is affected,while the same fast implementation techniques described inSection 4 can be used in the other Steps. Thus, the secondapproach is more suitable for fast implementation.

We readily extend our complexity analysis for errors-only decoding in Sections 3 and 4 to errors-and-erasuresdecoding. Our conclusions for errors-and-erasures decodingare the same as those for errors-only decoding: Algorithm 1is the most efficient only for very low rate codes; syndrome-based decoding is the most efficient algorithm for high ratecodes. For brevity, we omit the details and interested readerswill have no difficulty filling in the details.

6. CONCLUSION

We analyze the computational complexities of two syn-dromeless decoding algorithms for RS codes using bothdirect implementation and fast implementation, and com-pare them with their counterparts of syndrome-based decod-ing. With either direct or fast implementation, syndromelessalgorithms are more efficient than the syndrome-based

algorithms only for RS codes with very low rate. When imple-mented in hardware, syndrome-based decoders also havelower complexity and higher throughput. Since RS codes inpractice are usually high-rate codes, syndromeless decodingalgorithms are not suitable for these codes. Our case studyalso shows that fast implementations can significantly reducethe decoding complexity. Errors-and-erasures decoding isalso investigated although the details are omitted for brevity.

ACKNOWLEDGMENTS

This work was supported in part by Thales CommunicationsInc. and in part by a grant from the Commonwealth ofPennsylvania, Department of Community and EconomicDevelopment, through the Pennsylvania Infrastructure Tech-nology Alliance (PITA). The authors are grateful to Dr.Jurgen Gerhard for valuable discussions. The authors wouldalso like to thank the reviewers for their constructivecomments, which have resulted in significant improvementsin the manuscript. The material in this paper was presentedin part at the IEEE Workshop on Signal Processing Systems,Shanghai, China, October 2007.

REFERENCES

[1] S. B. Wicker and V. K. Bhargava, Eds., Reed–Solomon Codesand Their Applications, IEEE Press, New York, NY, USA, 1994.

[2] E. R. Berlekamp, Algebraic Coding Theory, McGraw-Hill, NewYork, NY, USA, 1968.

[3] Y. Sugiyama, M. Kasahara, S. Hirasawa, and T. Namekawa, “Amethod for solving key equation for decoding Goppa codes,”Information and Control, vol. 27, no. 1, pp. 87–99, 1975.

[4] A. Shiozaki, “Decoding of redundant residue polynomialcodes using Euclid’s algorithm,” IEEE Transactions on Informa-tion Theory, vol. 34, no. 5, part 1, pp. 1351–1354, 1988.

[5] A. Shiozaki, T. K. Truong, K. M. Cheung, and I. S. Reed, “Fasttransform decoding of nonsystematic Reed–Solomon codes,”IEE Proceedings: Computers and Digital Techniques, vol. 137,no. 2, pp. 139–143, 1990.

[6] S. Gao, “A new algorithm for decoding Reed–Solomon codes,”in Communications, Information and Network Security, V. K.Bhargava, H. V. Poor, V. Tarokh, and S. Yoon, Eds., pp. 55–68,Kluwer Academic Publishers, Norwell, Mass, USA, 2003.

[7] S. V. Fedorenko, “A simple algorithm for decoding Reed–Solomon codes and its relation to the Welch–Berlekampalgorithm,” IEEE Transactions on Information Theory, vol. 51,no. 3, pp. 1196–1198, 2005.

[8] S. V. Fedorenko, “Correction to “A simple algorithm fordecoding Reed–Solomon codes and its relation to the Welch–Berlekamp algorithm”,” IEEE Transactions on InformationTheory, vol. 52, no. 3, p. 1278, 2006.

[9] L. R. Welch and E. R. Berlekamp, “Error correction foralgebraic block codes,” US patent 4633470, September 1983.

[10] J. Justesen, “On the complexity of decoding Reed–Solomoncodes,” IEEE Transactions on Information Theory, vol. 22, no.2, pp. 237–238, 1976.

[11] J. von zur Gathen and J. Gerhard, “Arithmetic andfactorization of polynomials over F2,” Tech. Rep. tr-rsfb-96-018, University of Paderborn, Paderborn, Germany, 1996,http://www-math.uni-paderborn.de/∼aggathen/Publications/gatger96a.ps.


[12] J. von zur Gathen and J. Gerhard, Modern Computer Algebra,Cambridge University Press, Cambridge, UK, 2nd edition,2003.

[13] D. G. Cantor, “On arithmetical algorithms over finite fields,”Journal of Combinatorial Theory, Series A, vol. 50, no. 2, pp.285–300, 1989.

[14] S. Khodadad, Fast rational function reconstruction, M.S. thesis,Simon Fraser University, Burnaby, BC, Canada, 2005.

[15] Y. Wang and X. Zhu, “A fast algorithm for the Fouriertransform over finite fields and its VLSI implementation,”IEEE Journal on Selected Areas in Communications, vol. 6, no.3, pp. 572–577, 1988.

[16] N. Chen and Z. Yan, “Reduced-complexity cyclotomic FFTand its application in Reed–Solomon decoding,” in Pro-ceedings of the IEEE Workshop on Signal Processing Systems(SIPS ’07), pp. 657–662, Shanghai, China, October 2007.

[17] S. Khodadad and M. Monagan, “Fast rational functionreconstruction,” in Proceedings of the International Symposiumon Symbolic and Algebraic Computation (ISSAC ’06), pp. 184–190, ACM Press, Genoa, Italy, July 2006.

[18] T. K. Moon, Error Correction Coding: Mathematical Methodsand Algorithms, John Wiley & Sons, Hoboken, NJ, USA, 2005.

[19] J. J. Komo and L. L. Joiner, “Adaptive Reed–Solomon decodingusing Gao’s algorithm,” in Proceedings of the IEEE MilitaryCommunications Conference (MILCOM ’02), vol. 2, pp. 1340–1343, Anaheim, Calif, USA, October 2002.

[20] E. Berlekamp, G. Seroussi, and P. Tong, “A hypersystolicReed–Solomon decoder,” in Reed–Solomon Codes and TheirApplications, S. B. Wicker and V. K. Bhargava, Eds., pp. 205–241, IEEE Press, New York, NY, USA, 1994.

[21] D. Mandelbaum, “On decoding of Reed–Solomon codes,”IEEE Transactions on Information Theory, vol. 17, no. 6, pp.707–712, 1971.

[22] A. E. Heydtmann and J. M. Jensen, “On the equivalenceof the Berlekamp–Massey and the Euclidean algorithms fordecoding,” IEEE Transactions on Information Theory, vol. 46,no. 7, pp. 2614–2624, 2000.

[23] Z. Yan and D. V. Sarwate, “New systolic architectures forinversion and division in GF(2m),” IEEE Transactions onComputers, vol. 52, no. 11, pp. 1514–1519, 2003.

[24] T. Park, “Design of the (248, 216) Reed–Solomon decoderwith erasure correction for Blu-ray disc,” IEEE Transactions onConsumer Electronics, vol. 51, no. 3, pp. 872–878, 2005.

[25] “Telemetry Channel Coding,” CCSDS Std. 101.0-B-6, October2002.

[26] B. Sunar and C.K. Koc, “Mastrovito multiplier for all trinomi-als,” IEEE Transactions on Computers, vol. 48, no. 5, pp. 522–527, 1999.

[27] A. Mahboob and N. Ikram, “Lookup table based multiplica-tion technique for GF(2m) with cryptographic significance,”IEE Proceedings: Communications, vol. 152, no. 6, pp. 965–974,2005.


Research ArticleEfficient Decoding of Turbo Codes withNonbinary Belief Propagation

Charly Poulliat,1 David Declercq,1 and Thierry Lestable2

1 ETIS laboratory, UMR 8051-ENSEA/UCP/CNRS, Cergy-Pontoise 95014, France2 Samsung Electronics Research Institute, Communications House, South Street, Staines, Middlesex TW18 4QE, UK

Correspondence should be addressed to Charly Poulliat, [email protected]

Received 31 October 2007; Revised 25 February 2008; Accepted 27 March 2008

Recommended by Branka Vucetic

This paper presents a new approach to decode turbo codes using a nonbinary belief propagation decoder. The proposed approachcan be decomposed into two main steps. First, a nonbinary Tanner graph representation of the turbo code is derived by clusteringthe binary parity-check matrix of the turbo code. Then, a group belief propagation decoder runs several iterations on the obtainednonbinary Tanner graph. We show in particular that it is necessary to add a preprocessing step on the parity-check matrix of theturbo code in order to ensure good topological properties of the Tanner graph and then good iterative decoding performance.Finally, by capitalizing on the diversity which comes from the existence of distinct efficient preprocessings, we propose a newdecoding strategy, called decoder diversity, that intends to take benefits from the diversity through collaborative decoding schemes.

Copyright © 2008 Charly Poulliat et al. This is an open access article distributed under the Creative Commons Attribution License,which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

1. INTRODUCTION

Turbo codes and low-density, parity-check (LDPC) codeshave long been recognized to belong tothe family of modernerror correctingcodes. Although often opponents in stan-dards and applications, these two classes of codes sharecommon properties, the most important one being that theyhave a sparse graph representation that allows to decodethem efficiently using iteratively whether the maximum aposteriori (MAP) algorithm [1] for turbo codes, or the beliefpropagation (BP) algorithm for LDPC codes [2], as well astheir low-complexity iterative decoders.

Moreover, LDPC and turbo codes are two codingcandidates which are often options within the same system[3, 4]. It is thus interesting to investigate common architec-ture/algorithm at the receiver side to enable switching easilyamong them, whilst still maintaining reasonable cost andarea size.

Even if turbo codes effectively exhibit a sparse factorgraph representation for which the BP decoder is equivalentto the so-called turbo decoder [5, 6], this factor graphrepresentation is composed of different types of nodes,both for variable and for function nodes, which are notreduced to parity-check constraints (see [5] for more details).Later, some researchers have tried to use a factor graph

representation of the turbo code based only on parity-checkequations [7]. We will refer to a factor graph with only parity-check constraints for the function nodes (binary or not) asTanner graph in the rest of the paper [8].

The classical BP algorithm (sometimes called sum-product) on the Tanner graph of a turbo code does notperform sufficiently well to compete with the turbo decoderperformance [7]. This is mainly due to the inherent presenceof many short cycles of length 4, that lead to a poor con-vergence behavior inducing loss of performance. In order tosolve the problem of these short cycles, in [9, 10] the authorspropose to use special convolutional codes as componentsof the turbo code, called low-density convolutional codes,for which an iterative decoder based ontheir Tanner graphexperiences has less statistical dependence, and thereforeexhibits very good performance.

Our approach is different from [10] since we aim athaving a generic BP decoder which performs close to thebest performance, without imposing any constraint on thecomponent code. In this paper, we present a new approachto decode parallel turbo codes (i.e., binary, duobinary,punctured or not, etc.) using a nonbinary belief propagationdecoder. The generic structure of the proposed iterativedecoder is illustrated in Figure 1. The general approach canbe decomposed into two main steps: the first step consists


Group BPdecoder

Generic turbo-decoder

ClusteringPreprocessing

Codeparameters(or paritymatrix H)

p pChannel likelihoods

Figure 1: Block representation of the generic turbo decoder basedon group BP decoder.

in building a nonbinary Tanner graph of the turbo codeusing only parity-check nodes defined over a certain finitegroup, and symbol nodes representing groups of bits. TheTanner graph is obtained by a proper clustering of order pof the binary parity-check matrix of the turbo code, called“binary image.” However, the clustering of the commonlyused binary representation of turbo codes appears to be notsuitable to build an nonbinary Tanner graph representationthat leads to good performance under iterative decoding.Thus, we will show in the paper that there exist some suitablepreprocessing functions of the parity-check matrix (firstblock of Figure 1) for which, after the bit clustering (secondblock of Figure 1), the corresponding nonbinary Tannergraphs have good topological properties. This preliminarytwo-round step is necessary to have good Tanner graphrepresentations that outperform the classical representationsof turbo codes under iterative decoding. Then, the secondstep is a BP-based decoding stage (last block in Figure 1)and thus consists in running several iterations of groupbelief propagation (group BP), as introduced in [11], onthe nonbinary Tanner graph. Furthermore, we will showthat the decoder can also fully benefit from the decodingdiversity that inherently raises from concurrent extendedTanner graph representations, leading to the general conceptof decoder diversity. The proposed algorithms show verygood performance, as opposed to the binary BP decoder,and serve as a first step to view LDPC and turbo codeswithin a unified framework from the decoder point of view,that strengthen the idea to handle them with a commonapproach.

The remaining of the paper is organized as follows. InSection 2, we describe how to decode turbo codes based ongroup BP decoder. To this end, we review how to derive thebinary representation of the parity-check matrix Htc of aparallel turbo code. Then, we explain how to build the non-binary Tanner graph of a turbo code based on a clusteringtechnique and describe the group BP decoding algorithmbased on this representation. In Section 3, we discuss howto choose a posteriori good matrix representations and howto take advantage of the inherent diversity that is offered byconcurrent preprocessing in the decoding process. To thisend, we present some choices for the required preprocessingof the matrix Htc before clustering to build a Tanner graphwith good topological properties, that performs well under

group BP decoding. Then, we introduce in Section 4 theconcept of decoder diversity and show how it can be usedto further enhance performance. Finally, conclusions andperspectives are drawn in Section 5.

2. DECODING A TURBO CODE ASA NONBINARY LDPC CODE

In this Section, we present the different key elements thatenable to decode turbo codes as nonbinary LDPC codesdefined over some extended binary groups. First, we brieflyreview how to derive the binary representation of the parity-check matrix Htc of a parallel turbo code based on the parity-check matrix of a component code. Then, we explain howto build the nonbinary Tanner graph of a turbo code basedon a clustering technique and describe how the group BPdecoding algorithm can be used to efficiently decode turbocodes based on this extended representation.

2.1. Binary parity-check matrix of a turbo code

The first step in our approach consists in deriving a binaryparity-check matrix representation of the turbo code. We willonly focus in this paper on parallel turbo codes with identicalcomponent codes.

2.1.1. Parity-check matrix of convolutional codes

The binary image of the turbo code is essentially based onthe binary representation of the parity-check matrices of itscomponent codes. Following the derivations presented in[12], the parity-check matrix for both feedforward convo-lutional encoders and their equivalent recursive systematicform is generally derived using the Smith’s decompositionof its polynomial generator matrix G(D), where G(D) is ak × n matrix that gives the transfer of the k inputs into the noutputs of the convolutional encoder and D is defined as thedelay operator (please refer to [12] for more details about thisdecomposition). From this decomposition, the polynomialsyndrome former matrix HT(D) [12], of dimensions n×(n−k), can be derived and it can be expanded as

HT(D) = HT0 + HT

1 D + · · · + HTmsDms , (1)

whereHTi , 0 ≤ i ≤ ms is a matrix with dimensions n×(n−k),

and ms is the maximum degree of polynomials inHT(D). Forboth feedforward convolutional encoders and their recursivesystematic form, it is possible to derive the binary image fromthe semi-infinite matrix HT given by

HT =

⎛⎜⎜⎜⎜⎝

HT0 HT

1 . . . HTms

HT0 HT

1 . . . HTms

. . .. . .

. . .

⎞⎟⎟⎟⎟⎠. (2)

When direct truncation is used, it is possible to derivefrom HT the finite length binary parity-check matrix with

Charly Poulliat et al. 3

dimension (N − K)×N , given by

H =

⎛⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎝

H0

H1 H0...

. . . H0

Hms

Hms

. . .. . .

Hms . . . H0

⎞⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎠

, (3)

where N and K are the codeword and information blocklengths, respectively.

Under some length restrictions for the recursive case[13, 14], it is also possible to derive the binary image ofthe parity-check matrix of the tail-biting code Htb from theparity-check matrix H [15] for feedforward convolutionalencoders and their recursive systematic form. This can finallybe represented as follows using the so-called “wrap around”technique:

Htb =

⎛⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎝

H0 Hms . . . H1

H1 H0. . .

......

. . .. . . Hms

Hms H0

Hms

. . .. . .

Hms Hms−1 . . . H0

⎞⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎠

. (4)

Note that, in each case, both systematic and nonsystem-atic encoders give the same codewords and thus share thesame parity-check matrix [12, 16].

2.1.2. Parity-check matrix of turbo codes

For recursive systematic convolutional codes of rate k/(k+1),that mainly compose classical turbo codes in the standards,the matrix HT(D) is simply given by [12]

HT(D) =

⎛⎜⎜⎜⎜⎜⎜⎜⎝

hT1 (D)

hT2 (D)

...

hTk+1(D)

⎞⎟⎟⎟⎟⎟⎟⎟⎠

, (5)

where in fact hTi (D), 1 ≤ i ≤ k, are the feedforward poly-nomials and hTk+1(D) is the feedback polynomial defining therecursive systematic convolutional code. Then, for this kindof components codes, the binary parity-check matrix can besimply derived using (2)–(4).

As recursive component codes of turbo codes are system-atic, the columns of the associated parity-check matrix Hwith dimension (N −K)×N can be assigned to informationbits and to redundancy bits. Note that when using the pre-ceding expressions of H , the output bits of the convolutionalencoder are supposed to be ordered alternatively within thecodeword. After some column permutations, we can rewrite

H as H = [HiHr], where Hi and Hr contain columns ofH relative to information and redundancy bits, respectively.Using this notation, we can derive easily the parity-checkmatrix of a turbo code as follows for the case of two-component codes in parallel [17, 18]:

Htc =[

Hi Hr 0

HiΠT 0 Hr

], (6)

where ΠT is the transpose of the interleaver permutationmatrix at the input of the second component encoder. In thatcase, Htc has dimensions 2(N − K) × 2N − K . Of course,this technique can be easily generalized to more than twocomponents.

2.1.3. Example

To illustrate this section, we consider an R = 1/3 turbo codewith two rate one-half code components with parametersin octal given by (1, 238/358). Under direct truncation, theparity-check matrix of a component code and a correspond-ing turbo code are, respectively, given by the matrices H andHtc as follows:

H =

⎛⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎝

1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 00 1 1 1 0 0 0 0 0 0 0 0 0 0 0 00 1 0 1 1 1 0 0 0 0 0 0 0 0 0 01 0 0 1 0 1 1 1 0 0 0 0 0 0 0 01 1 1 0 0 1 0 1 1 1 0 0 0 0 0 00 0 1 1 1 0 0 1 0 1 1 1 0 0 0 00 0 0 0 1 1 1 0 0 1 0 1 1 1 0 00 0 0 0 0 0 1 1 1 0 0 1 0 1 1 1

⎞⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎠

,

Htc =

⎛⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎝

1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 1 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 00 0 1 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 01 0 0 1 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 01 1 0 0 1 0 0 0 1 0 1 1 1 0 0 0 0 0 0 0 0 0 0 00 1 1 0 0 1 0 0 0 1 0 1 1 1 0 0 0 0 0 0 0 0 0 00 0 1 1 0 0 1 0 0 0 1 0 1 1 1 0 0 0 0 0 0 0 0 00 0 0 1 1 0 0 1 0 0 0 1 0 0 1 1 0 0 0 0 0 0 0 00 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 00 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 00 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 00 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 00 1 1 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 1 1 1 0 0 01 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 1 1 0 00 0 0 1 1 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 1 1 1 00 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 1 1

⎞⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎠

.

(7)

2.2. Clustering and preprocessing

Once the parity-check matrix H of a turbo code has beenderived, we obtain a nonbinary Tanner graph by applying aclustering technique, which is essentially the same as the onedescribed in [11].

The matrix H is decomposed in groups of p rows andp columns. Each group of p rows represents a generalized


parity-check node in the Tanner graph, defined in the finitegroup G(2p), and each group of columns represents a symbolnode, build from the concatenation of p bits (p-tuples)defining elements in G(2p).

A cluster is then defined as a submatrix hi j of sizep × p in H , and each time a cluster contains nonzerovalues (ones in this case) in it an edge connecting thecorresponding group of rows and group of columns iscreated in the Tanner graph. To each nonzero cluster isassociated a linear function fi j(·) from G(2p) to G(2p) whichhas hi j as matrix representation. Using this notation, the ithgeneralized parity-check equation defined over G(2p) can bewritten as

∑

j

fi j(cj) ≡ 0, (8)

where cj is the jth coordinate of a codeword having symbolsdefined in G(2p).

To illustrate the clustering impact on the Tanner graphrepresentation and to give some insights that can motivateto extend the representation from the binary domain to anonbinary one, we will consider as a simple example theclustering of the recursive systematic convolutional codeswith polynomial representation in octal basis given by(1, 58/78). We assume that 12 information bits have been sentusing direct truncation. Then, a 4× 4 clustering is applied tothe binary parity-check matrix. Using representation of (3),the resulting clustered matrix is given by

H =

⎛⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎝

1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

1 1 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 1 1 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 0 0 0 1 1 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 1 1 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 1 1 0 1 1 1 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 1 1 0 1 1 1 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 0 0 0 0 1 1 0 1 1 1 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 1 1 1 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 1 1 1 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 1 1 1

⎞⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎠

.

(9)

We are now able to associate a nonbinary Tannergraph representation of H with generalized parity-checkconstraints applying now to 4-tuples binary vectors. TheTanner graph corresponding to our example is finally givenin Figure 2(b) and it is compared with the Tanner graphassociated with the binary image defined by H (Figure 2(a)).

Through this example, we can see that, for convolutionalcodes, when using the representation given in (3), wecan still ensure a sparse graph condition and even reacha tree representation when increasing the order of therepresentation. In fact, for rate one-half codes, it has beenobserved that there exists a minimum value of P for whichwe can have a tree. This induces that using a BP-like decoder

(a)

(b)

(c)

Figure 2: Comparison of different Tanner graph representations ofthe recursive systematic convolutional code with bit clustering oforder p = 4 versus those of the binary image defined by H .

will lead to a maximum a posteriori symbol decoding and, inthat case, it has been verified that BP and the MAP have thesame performance. Unfortunately, this tree condition doesnot hold anymore when we use the alternative representationH of the parity-check matrix of a convolutional code as usedin turbo codes parity-check matrix as it can be seen forthe Tanner graph representation of our previous example inFigure 2(c). This representation introduces cycles even in theextended representation of the convolutional code using bitclustering, and as a result, in the extended representation ofthe turbo codes. Moreover, when tail biting is used, there isno possibility to ensure a tree condition due to the nonzeroelements in the right-hand corner of the tail-bited, parity-check matrix of the component code. Thus, a remainingissue is how to derive a “good” extended Tanner graphrepresentation. To this end, we will present in Section 3how to overcome these problems to ensure fair performanceunder BP decoding by applying an efficient preprocessing ofthe parity-check matrix of the turbo code.

2.3. Nonbinary group belief propagation decoding

The Tanner graph obtained by preprocessing and clusteringthe binary image does not correspond to a usual code definedover a finite field GF(q = 2p) but can be defined on afinite group G(2p) of the same order (see [11] for moredetails). We will refer to the belief propagation decoder ongroup codes as group BP decoder. The group BP decoderis very similar in nature to regular BP in finite fields. The


Messagesof size q Vpv

Vcp

Uvp

Upc

Channel values

Information symbolsci

Linearfunction nodes

hjici

Interleaver Π

Parity check nodes

Figure 3: Tanner graph of a nonbinary LDPC code defined over a finite group G(q).

only difference is that the nonzero values of a parity-checkequation are replaced with more general linear functionsfrom G(2p) to G(2p), defined by the binary matrices whichform the clusters. In particular, it is shown in [11] thatgroup BP can be implemented in the Fourier domain witha reasonable decoding complexity.

We briefly review the main steps of the group BP decoderand its application to the nonbinary Tanner Graph of a turbocode. The modified Tanner graph of an LDPC code over afinite group is depicted in Figure 3, in which we indicatedthe notations we use for the vector messages. Additionally, tothe classical variable and check nodes, we add function nodesto represent the effect of the linear transformations deducedfrom the clusters as explained in the previous section.

The group BP decoder has four main steps which use q =2p dimensional probability messages.

(i) Data node update: the output extrinsic message isobtained from the term by term product of all inputmessages including the channel-likelihood message,except the one carried on the same branch of theTanner graph.

(ii) Function node update: the messages are updatedthrough the function nodes fi j(·). This messageupdate is reduced to a cyclic permutation in the caseof a finite field code, but in the case of a more generallinear function from G(2p) to G(2p) denoted β =fi j(α) the update operation is

Upc[βj] =∑

i

Uvp[αi] j = 0, . . . , q − 1, βj = fi j(αi).

(10)

(iii) Check node update: this step is identical to BP decoderover finite fields and can be efficiently implementedusing a fast Fourier transform. See, for example, [11,19] for more details.

(iv) Inverse function node update: with the use of thefunction fi j(·) backwards, that is, by identifying thevalues αi which have the same image βj , the updateequation is

Vpv[αi] = Vcp[βj] ∀αi : βj = fi j(αi). (11)

These four steps define one decoding iteration of ageneral parity-check code on a finite group, which is thecase of a clustered convolutional or turbo code as describedpreviously. Note that the function node update is simply areordering of the values both in the finite field case, and whenthe cluster defining the function fi j(·) is full rank. When thecluster has deficient rank r < p, which is often the case whenclustering a turbo code, only 2r entries of the message Upc arefilled and the remaining entries are set to zero.

Note that we do not discuss in this paper the decodingcomplexity issues, but we rather focus on the feasibility ofthe decoding using a BP decoder. Of course, a nonbinary BPdecoder is naturally much more computationally intensivethan a binary BP or a turbo decoder. However, reducedcomplexity nonbinary decoders have been recently proposed,which exhibit good complexity/performance tradeoff evencompared to binary decoders [20]. The reduced complexitydecoder can be easily adapted to codes on finite groups, sincethe function node update is not more complex in the groupcase than in the field case.

3. COMPARISON OF BINARY IMAGES OBTAINEDWITH DIFFERENT PREPROCESSINGS

In this Section, we discuss some relevant issues related to theimprovement of the performance when group BP decoderis used. We show in particular that some preprocessingfunctions can lead to interesting Tanner graph topologies andgood performance under iterative decoding.

3.1. Selection of preprocessing for an efficientsparse graph representation

It should be noted that the performance of the group BPdecoder depends highly on the structure of the nonbinaryTanner graph. In our framework, it is possible to apply somespecific transformations on the binary image H before theclustering operation, so that the Tanner graph has desirableproperties. Indeed, any row linear transformation A andcolumn permutation Π applied to H do not change the codespace but change the topology of the clustered Tanner graph.Let us denote by H′ = Pc(H) = A ·H ·Π the preprocessed


binary parity-check matrix. We propose in this paper twopreprocessing techniques that we found attractive in terms ofTanner graph properties and described below and depicted inFigure 4.

(Pc1 )

This preprocessing is defined by alternating the informationbits and the redundancy bits of the first convolutional codeof the parallel turbo code. We obtain with this techniquetwo parts in the parity-check matrix. Each of them has anupper triangular form with a diagonal (or near diagonal forthe rectangular part of H′), therefore, reducing the numberof nonzero clusters in the nonbinary Tanner graph deducedfrom H′. Note that a second preprocessing of this type canbe considered by alternating the information bits and theredundancy bits of the second convolutional encoder.

(Pc2 )

This preprocessing is obtained by column permutationswith the aim of having the most concentrated diagonal inthe parity-check matrix, that is, minimizing the number ofclusters that will be created on the diagonal. This is supposedto be a good choice since the clusters on the diagonal arethe more dense in the Tanner graph and are assumed toparticipate the most to the performance degradation ofthe BP decoder when they contribute to cycles. Indeed, wehave verified by simulations on several turbo codes thatthe number of nonzero clusters of a given size has less inthe preprocessing Pc2 than in the preprocessing Pc1 on thediagonal. Note that by properly choosing the columns to bepermuted, several images of this type could be created.

Note that the two proposed preprocessing techniques arerestricted to column permutations, that is, with the specialcase of A = Id, where Id corresponds to the identity trans-formation. This case is the simplest one; the transformationkeeps the binary Tanner graph of the code unchanged,but the nonbinary clustered Tanner graph is modified afterpreprocessing. We will show through simulations that thishas an important impact on the decoder performance.Although Figure 4 plots examples of rate R = 1/3 turbocodes, the exact same preprocessing strategies can be appliedto any type of turbo code, that is, any rate for puncturedand/or multibinary turbo codes.

3.2. Simulation results with different preprocessings

In this section, we apply the different preprocessing tech-niques presented in the previous section to duobinary turbocodes with the parameters R = 0.5 and size N = {848, 3008}coded bits taken from the DVB-RCS standard [21, 22]. Theframe sizes we used correspond to ATM and MPEG framesizes with K = {53, 188} information bytes, respectively.Note that these turbo codes have sizes which are notparticularly well suited to clustering. A size ofN = 864 wouldhave been preferable for cluster size p = 8 to ensure a properclustering of each part of the turbo code parity-check matrixcorresponding to each component codes, but we wanted to

Binary image of a turbo code, natural representation

Binary image of TC(23, 35), natural representation

100

200

300

400

500

600

700

800

Row

inde

x

200 400 600 800 1000 1200

Column index

0

00

00

0

0

0Info

Red 1 Red 2

Π(info)

(a)

Binary image of a turbo code, preprocessing Pc1

Binary image of TC(23, 35), alternating bits for first component code

100

200

300

400

500

600

700

800

Row

inde

x

200 400 600 800 1000 1200

Column index

0

0 0

0

0

Info+red 1 Red 2

(b)

Binary image of a turbo code, preprocessing Pc2

50

100

150

200

250

300

350

400

Row

inde

x

100 200 300 400 500 600 700 800

Column index

0

0

0

0

(c)

Figure 4: Three different binary representations of the same rateR = 1/3 turbo code. The first one is the natural representation (see(6)), the second one corresponds to the clustering Pc1 , and the thirdone to the clustering Pc2 .


100

10−1

10−2

10−3

10−4

10−5

10−6

Fram

eer

ror

rate

0.5 1 1.5 2 2.5 3 3.5

Eb/N0 (dB)

Turbo-decoderGroup-BP: no prep.Group-BP: P1

Group-BP: P2

Binary BP

Figure 5: Performance of the group BP decoding algorithm basedon different preprocessing functions for the (R = 0.5, N = 848)duobinary turbo code and comparison with the turbo decoder asused in the DVB standard.

keep the original interleaver and frame size from the standard[21, 22]. These codes have been terminated using tail biting,and their minimum distances are dmin = {18, 19}. For bothduobinary turbo codes, HT(D) is given by

HT(D) =

⎛⎜⎜⎝

1 + D2 + D3

1 + D + D2 + D3

1 + D + D3

⎞⎟⎟⎠ . (12)

In the following, we will consider the additive whiteGaussian noise channel (AWGN) for our simulations. Forthis channel, we compare the group decoder performancewith various preprocessing functions, a clustering size ofp = 8, and a floating point implementation of the groupBP decoder using shuffle scheduling [23]. As a reference,we simulated the turbo decoder based on MAP componentdecoders in floating-point precision in order to have the bestresults that one can obtain with a turbo decoding strategy.

The curves plotted in Figure 5 are related to the R = 1/2turbo code with parameter N = 848 and correspond to thenatural representation of the code and two preprocessings(one type Pc1 and one type Pc2 ).

In order to illustrate that the preprocessing has influenceon the nonbinary factor graph, we have counted the numberof nonzero clusters and also the number of full-rank clustersin the cases of the two simulated matrices tested in thissection, and for the two types of preprocessings Pc1 andPc2 . We reported the statistics on Table 1. Remember that anonzero cluster corresponds to an edge in the Tanner graph,and that a full-rank cluster corresponds to a permutationfunction, while a rank deficient cluster corresponds to aprojection. We can see that the number of nonzero clustersis much lower in the case of the proposed preprocessing,

Table 1: Cluster Statistics on the turbo codes from the DVBstandard, with a clustering size of P = 8.

PTotal Nonzero Full-rank

clusters clusters clusters

Turbo R = 1/2 Pc1 5618 506 26

N = 848 bits Pc2 5618 426 0

Turbo R = 1/2 Pc1 70688 1786 94

N = 3008 bits Pc2 70688 1504 0

but also that there is no full-rank clusters. This indicatesthat the preprocessing Pc2 has concentrated the ones ofthe parity-check matrix Hb in a better way than Pc1 . Oursimulation results show that this better concentration hasa direct influence on the error-correction capability of thegroup BP decoder.

All group BP simulations have used a maximum of 100iterations, but the average number of iteration is as low as3-4 iterations for frame error rates below 10−3. Simulationswere run until at least 100 frames have been found in error.As expected, the preprocessing of type Pc2 is far better thanthe other preprocessings, which is explained by the fact thatthe corresponding Tanner graph has less nonzero clusters.It can be seen that with a good preprocessing function, aturbo code can be efficiently decoded using a BP decoder,and even can slightly beat the turbo decoder in the waterfallregion. The turbo decoder remains better in the error floorregion, which is due to the fact that the group BP decoderhas much more detected errors (due to decoder failures) inthis region than the turbo decoder. Although we are awarethat the group BP decoder is much more complex than theturbo decoder, this result is quite encouraging since it waslong thought that turbo codes could not be decoded usingan LDPC-like decoder. As a drastic example, we have plottedthe very poor performance of a binary BP decoder on thebinary image of the turbo code, which does not converge atall for all the SNRs under consideration.

We also simulated the same curves for a longer code withN = 3008 in order to show the robustness of our approach.The results are shown in Figure 6, and the same comments asfor the N = 848 code apply with an even larger performancegain when using the best preprocessing function.

4. IMPROVING PERFORMANCE BY CAPITALIZING ONTHE PREPROCESSING DIVERSITY

As there exist more than one possibility to build a nonbi-nary Tanner graph from the same code through differentpreprocessing functions, this raises the question whetherif it is possible to improve the decoding performance byusing this diversity of graph representation. Actually, wehave noticed that with the same noise realization, thegroup BP decoder on a specific Tanner graph can either(i) converge to the right codeword,or (ii) converge to awrong codeword (undetected error),or (iii) diverge after afixed maximum number of iterations. If we accept someadditional complexity, using several instances of iterativedecoding based on several preprocessing functions and a


100

10−1

10−2

10−3

10−4

10−5

10−6

Fram

eer

ror

rate

0.5 1 1.5 2 2.5 3

Eb/N0 (dB)

Turbo-decoderGroup-BP: no prep.Group-BP: P1

Group-BP: P2

Binary BP

Figure 6: Performance of the group BP decoding algorithm basedon different preprocessing functions for the (R = 0.5, N = 3008)duobinary turbo code and comparison with the turbo decoder asused in the DVB standard.

proper results merging strategy is likely to improve the error-correction performance.

In this paper, we will not address the problem of findinga good set of preprocessing functions, and we restrictourselves to Nd = 5 different Tanner graphs obtainedwith preprocessing functions of type Pc2 . There are variouspossible merging methods to use the outputs of each decoder,with associated performance complexity tradeoffs. Asidefrom the two natural merging strategies depicted below, onecan think of more elaborate choices.

Serial merging

The Nd decoders are potentially used in a sequential manner.Assuming that we check the value of the syndrome at eachiteration, when a decoder fails to converge to the rightcodeword or to a wrong codeword after a given numberof iterations, we switch to another decoder, that is, anotherTanner graph is computed with a different preprocessingand we restart the decoder from scratch with the new graphand the permuted likelihood. The process stops when onedecoder converges to a codeword (either the sent codewordor not).

Parallel merging

The Nd decoders are used in parallel and a maximumlikelihood (ML) decision is taken among the ones that haveconverged to a codeword. If nb, with nb ≤ Nd, is the numberof decoders that have converged to a codeword in less thanthe maximum number of iterations, then the nb associatedlikelihood is computed and the one with the maximum

likelihood is selected. Note that the nb candidate codewordsare not necessarily distincts.

Lower bound on merging strategies

In order to study the potential of the decoder diversityapproach regardeless of the merging strategy, we define thefollowing lower bound. Among the Nd decoders in thediversity set, we check if at least one decoder converges tothe right codeword. A decoder failure is decided if all Nd

decoders have not converged after the maximum numberof iterations. Note that this method does not exhibit anyundetected error. This is called a lower bound on mergingstrategies because it assumes that if there exists at least oneTanner graph which converges to the right codeword, onecan think of a smart procedure to select this graph. This isof course not always possible, especially if the codeword sentis not the ML codeword. This lower bound allows also tohave a possibly tight estimation on the parallel merging case,without having to simulate all Nd decoders.

The extra complexity induced by the serial merging isnegligible since the other Tanner graphs will be used onlywhen the first one fails to converge, that is, at an FER =10−3 for the first decoder, the decoder diversity will be usedonly 0.1% of the time. The parallel merging is much morecomplex since it uses Nd times more computations, but onecan argue that it is simpler to parallelize on a chip. We did notsimulate the parallel merging in this work. In the worst case,the extra latency of the serial merging is obviously linearlydependent on the number Nd of different Tanner graphs.

In Figures 7 and 8, we report simulation results for theAWGN channel for the two turbo codes that have beenstudied in the previous section. Of course, the results with nodiversity are similar to those observed in Figures 5 and 6 forthe preprocessing of type Pc2 , and we do not plot them in thenew figures. If we focus on the maximum performance gainthat one can hope for by looking at the lower bound curves, itis clear that using several decoders can improve significantlythe performance, both in the waterfall and the error floorregions. For the small code as well as for the longer code,using group BP decoding with decoder diversity can gainbetween 0.25 dB to 0.4 dB compared to the turbo decoderusing MAP component decoders, which was up to nowconsidered as the best decoder proposed for turbo codes.This result shows in particular that it is possible to consideriterative decoders which are more powerful and, therefore,which are closer to the maximum-likelihood decoder, thanthe classical turbo decoder.

Interestingly, the serial merging which is the moreobvious merging strategy, and also requires the least addi-tional complexity, achieves full decoder diversity gain inthe waterfall region, that is, above FER = 10−3. This isparticularly useful for wireless standards which use ARQ-based transmission and, therefore, hardly require errorcorrection below FER = 10−3. In the error floor regionthough, we can see in both Figures 7 and 8 that moreelaborate merging solutions should be used to achieve fulldiversity gain and obtain a substantial gain compared withturbo decoder. Note, however, that with the serial merging


100

10−1

10−2

10−3

10−4

10−5

10−6

10−7

Fram

eer

ror

rate

0.5 1 1.5 2 2.5 3 3.5

Eb/N0 (dB)

Turbo-decoder5 group decoders: serial merging5 group decoders: lower bound

Figure 7: Decoding performance when diversity is applied to the(R = 0.5, N = 848) duobinary turbo code.

100

10−1

10−2

10−3

10−4

10−5

10−6

10−7

Fram

eer

ror

rate

0.5 1 1.5 2 2.5 3

Eb/N0 (dB)

Turbo-decoder5 group decoders: serial merging5 group decoders: lower bound

Figure 8: Decoding performance when diversity is applied to the(R = 0.5, N = 3008) duobinary turbo code.

and for the N = 3008 turbo code the results are better thanthe turbo decoder for all SNRs, even in the error floor region.

5. CONCLUSION

In this paper, we have proposed a new approach to efficientlydecode turbo codes using a nonbinary belief propagationdecoder. It has been shown that this generic method isfully efficient if a preprocessing step on the parity-checkmatrix of the code is added to the decoding process in orderto ensure good topological properties of the Tanner graph

and then good iterative decoding performance. Using thisextended representation, we show that the proposed algo-rithm exhibits very good performance in both the waterfalland the error regions when compared to a classical turbodecoder. Moreover, using the inherent diversity induced bythe existence of several concurrent extended Tanner graphrepresentations, we show that the performance can be furtherimproved and we introduce the concept of decoder diversity.This study shows that this decoding strategy (i.e., joint useof preprocessing, group BP and diversity decoding) appearsas a key step that enables to consider LDPC and turbocodes within a unified framework from the decoder point ofview.

REFERENCES

[1] L. Bahl, J. Cocke, F. Jelinek, and J. Raviv, “Optimal decodingof linear codes for minimizing symbol error rate,” IEEETransactions on Information Theory, vol. 20, no. 2, pp. 284–287, 1974.

[2] R. G. Gallager, Low Density Parity Check Codes, Number21 in Research Monograph SeriesNumber 21 in ResearchMonograph Series, MIT Press, Cambridge, Mass, USA, 1963.

[3] IEEE 802.16-2004, “IEEE standard for local and metropolitanarea networks, air interface for fixed broadband wireless accesssystems,” October 2004.

[4] IEEE 802.16e, February 2006, IEEE standard for local andmetropolitan area networks, air interface for fixed broadbandwireless access systems, amendment 2: physical and mediumaccess control layers for combined fixed and mobile operationin licensed bands and corrigendum 1.

[5] F. R. Kschischang and B. J. Frey, “Iterative decoding ofcompound codes by probability propagation in graphicalmodels,” IEEE Journal on Selected Areas in Communications,vol. 16, no. 2, pp. 219–230, 1998.

[6] R. J. McEliece, D. J. C. MacKay, and J.-F. Cheng, “Turbo decod-ing as an instance of Pearl’s “belief propagation” algorithm,”IEEE Journal on Selected Areas in Communications, vol. 16, no.2, pp. 140–152, 1998.

[7] L. Zhu, J. Wang, and S. Yang, “Factor graphs based iterativedecoding of turbo codes,” in Proceedings of the IEEE Inter-national Conference on Communications, Circuits and Systemsand West Sino Expositions (ICCCAS & WeSino Expo ’02), vol.1, pp. 46–50, Chengdu, China, June-July 2002.

[8] R. Tanner, “A recursive approach to low complexity codes,”IEEE Transactions on Information Theory, vol. 27, no. 5, pp.533–547, 1981.

[9] K. Engdahl and K. Sh. Zigangirov, “On the statistical theory ofturbo codes,” in Proceedings of the 6th International Workshopon Algebraic and Combinatorial Coding Theory (ACCT ’98), pp.108–111, Pskov, Russia, September 1998.

[10] A. J. Felstrom and K. Sh. Zigangirov, “Time-varying periodicconvolutional codes with low-density parity-check matrix,”IEEE Transactions on Information Theory, vol. 45, no. 6, pp.2181–2191, 1999.

[11] A. Goupil, M. Colas, G. Gelle, and D. Declercq, “FFT-based BPdecoding of general LDPC codes over Abelian groups,” IEEETransactions on Communications, vol. 55, no. 4, pp. 644–649,2007.

[12] R. Johannesson and K. Sh. Zigangirov, Fundamentals of Con-volutional Coding, Digital, Mobile CommunicationDigital,Mobile Communication, chapter 1-2, IEEE Press, New York,NY, USA, 1999.


[13] P. Stahl, J. B. Anderson, and R. Johannesson, “A note on tail-biting codes and their feedback encoders,” IEEE Transactionson Information Theory, vol. 48, no. 2, pp. 529–534, 2002.

[14] C. Weiss, C. Bettstetter, and S. Riedel, “Code constructionand decoding of parallel concatenated tail-biting codes,” IEEETransactions on Information Theory, vol. 47, no. 1, pp. 366–386, 2001.

[15] H. Ma and J. Wolf, “Binary unequal error-protection blockcodes formed from convolutional codes by generalized tail-biting,” IEEE Transactions on Information Theory, vol. 32, no.6, pp. 776–786, 1986.

[16] S. Lin and D. J. Costello, Error Control Coding, Prentice-Hall,Englewood Cliffs, NJ, USA, 2nd edition, 2004.

[17] O. Pothier, Compound codes based on graphs and their iterativedecoding, Ph.D. thesis, ENST, Paris, France, January 2000.

[18] R. E. Blahut, Algebraic Codes for Data Transmission, Cam-bridge University Press, Cambridge, UK, 2003.

[19] D. Declercq and M. Fossorier, “Decoding algorithms fornonbinary LDPC codes over GF(q),” IEEE Transactions onCommunications, vol. 55, no. 4, pp. 633–643, 2007.

[20] A. Voicila, D. Declercq, F. Verdier, M. Fossorier, and P. Urard,“Low complexity, low memory EMS algorithm for non-binaryLDPC codes,” in Proceedings of the IEEE International Confer-ence on Communications (ICC ’07), pp. 671–676, Glasgow, UK,June 2007.

[21] C. Douillard and C. Berrou, “Turbo codes with rate-m/(m +1) constituent convolutional codes,” IEEE Transactions onCommunications, vol. 53, no. 10, pp. 1630–1638, 2005.

[22] Digital Video Broadcasting (DVB), “Interaction channel forsatellite distribution systems,” 2000, ETSI EN 301 790, v 1.2.2.

[23] J. Zhang and M. Fossorier, “Shuffled iterative decoding,” IEEETransactions on Communications, vol. 53, no. 2, pp. 209–213,2005.


Research ArticleSpace-Time Convolutional Codes over Finite Fieldsand Rings for Systems with Large Diversity Order

Mario de Noronha-Neto1 and B. F. Uchoa-Filho2

1 Telecommunications Systems Research and Development Group, Federal Center of Technological Education of Santa Catarina,88103-310 Sao Jose, SC, Brazil

2 Communications Research Group-GPqCom, Department of Electrical Engineering, Federal University of Santa Catarina,88040-900 Florianopolis, SC, Brazil

Correspondence should be addressed to B. F. Uchoa-Filho, [email protected]

Received 26 October 2007; Revised 18 March 2008; Accepted 6 May 2008


We propose a convolutional encoder over the finite ring of integers modulo pk ,Zpk , where p is a prime number and k is anypositive integer, to generate a space-time convolutional code (STCC). Under this structure, we prove three properties related tothe generator matrix of the convolutional code that can be used to simplify the code search procedure for STCCs over Zpk . SomeSTCCs of large diversity order (≥4) designed under the trace criterion for n = 2, 3, and 4 transmit antennas are presented forvarious PSK signal constellations.

Copyright © 2008 M. de Noronha-Neto and B. F. Uchoa-Filho. This is an open access article distributed under the CreativeCommons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided theoriginal work is properly cited.

1. INTRODUCTION

Since the discovery of space-time trellis codes (STTCs) byTarokh et al. [1], much research has been done in thisarea. Some authors [2–5] have concentrated their effortsto generate STTCs through an encoding structure whereinthe inputs are binary symbols, the encoding operations arerealised modulo 2k, where k is any positive integer, and the2k-ary outputs are matched to a 2k-ary signal constellation.Although this encoding structure facilitates the code searchprocedure, this search becomes prohibitively complex as thenumber of transmit antennas, states, or modulation sizeincreases.

In order to simplify the design of STTCs, Abdool-Rassoolet al. [6] have proven two theorems that allow one tosignificantly reduce the computational effort of the codesearch. In [7], utilising an alternative structure, the authorshave considered STTCs generated by a convolutional encoderover the Galois field GF(p) ≡ Zp, p a prime, where theinformation symbols, the convolutional encoder tap gains,and the output symbols are elements of Zp, allowing fora spectral efficiency of log2(p) b/s/Hz. These codes arereferred to as space-time convolutional codes (STCCs). Usingthe structure proposed in [7], Hong and Chung [8] and

Noronha-Neto and Uchoa-Filho [9] have presented somenew STCCs over GF(p) for two transmit antennas.

The design of good STTCs is based on the well-knownrank and determinant [1] criteria or the trace [2, 3] criterion,depending on the system’s diversity order. If the diversityorder is greater or equal to 4, the trace criterion should beused in substitution of the determinant criterion while therank criterion may be relaxed.

In this paper, utilising a nonsystematic feedfoward con-volutional encoder over the finite ring of integers modulo pk,Zpk , and inspired by the results in [6], we prove three prop-erties related to the generator matrix of the convolutionalcodes over Zpk that can simplify the code search procedurefor STCCs over Zpk . Essentially, the properties establishequivalences among STCCs so that many convolutionalcodes can be discarded in the code search without loosinganything. Herein we focus on systems with large diversityorder, so only STCCs designed under the trace criterion areconsidered. By exploiting the structure of the convolutionalencoder over Zpk and the simplifications coming from theproperties, we obtain some good STCCs over finite fields(k = 1) and rings based on the trace criterion for 3,4,5,7,8,and 9-PSK modulations, n = 2, 3, 4 transmit antennas, andencoder memories 1, 2, and 3.


g0,1

g0,2

g0,n...

ut ut−1 ut−2 ut−K· · · Ψ

gK ,n

+

gK ,2

...

· · ·

gK ,1

+

+· · ·

vnt

v2t· · ·

v1t++

g2,1

g2,2

g2,n...

+ +

+ +

g1,1

g1,2

g1,n...

Figure 1: Rate 1/n convolutional encoder over Zpk with memory order K . The multiplier Ψ controls the number of encoder states.

We should mention the important work of Carrasco andPereira [10] that considers nonbinary space-time convolu-tional codes. There are significant differences between thepresent work and [10]. First, Carrasco and Pereira considereda systematic feedback convolutional encoder, which hasapproximately the same number of nonbinary coefficients ascompared to the encoder in nonsystematic feedfoward formproposed in this paper. However, our structure gives rise tothe three properties mentioned above for code equivalences,by which we can reduce the code search effort. Anotherimportant difference between our work and the work ofCarrasco and Pereira is that in [10] they consider thedeterminant criterion regardless of the number of receivedantennas.

The remainder of this paper is organised as follows.In Section 2, we describe the proposed space-time codedsystem based on convolutional codes over Zpk , and presentthe design criteria for obtaining good STCCs. In Section 3,we prove the three properties mentioned above and presentguidelines for finding good STCCs over Zpk . The new STCCsfound with the code search are tabulated in Section 4. Alsoprovided in that section is the frame error rate (FER)for some of the new STCCs obtained from computersimulations. Comparison results with existing STTCs are alsogiven. Finally, in Section 5, we conclude the paper and makesome final comments.

Throughout this paper, the conjugate, transpose, andhermitian (conjugate transpose) of a matrix/vector A aredenoted by A∗, AT , and AH , respectively.

2. THE SPACE-TIME CONVOLUTIONALLY CODEDSYSTEM AND DESIGN CRITERIA

We consider a space-time coded system employing n trans-mit antennas and m receive antennas. In the transmitter,at each discrete time t, a Zpk -valued information symbolut is encoded by a rate 1/n convolutional encoder over Zpkwith encoder memory K , shown in Figure 1. The encoderoutput at time t is a block of n coded symbols over Zpk ,(v1t , v2

t , . . . , vnt ), where

vit ≡K∑

x=0

ut−xgx,i(mod pk

), (1)

for i = 1, . . . ,n. The encoder tap gain associated withtransmit antenna i and memory depth x is denoted by gx,i.The coded symbols are mapped into a complex pk-PSKsignal constellation and transmitted simultaneously via then transmit antennas. A complex codeword c of length l of thespace-time code is a sequence of blocks

c={(c1t , c2

t , . . . , cnt)}={(e j(2π/pk)v1

t , e j(2π/pk)v2t , . . . , e j(2π/pk)vnt

)}

(2)

for t = 1, 2, . . . , l, where cit is the signal transmitted from theith antenna at time t. The set of all codewords c is called theSTCC, and is denoted by C.

Note that in Figure 1 there is a multiplier Ψ betweenthe (K − 1)th and the Kth memory depths. This multiplier,a positive integer that divides pk, has the purpose ofcontrolling the number of encoder states. A similar structurehas been adopted for the Gaussian channel by Massey andMittelholzer in [11]. For Ψ = 1, the number of encoderstates is pkK . But for Ψ > 1 the number of encoder statesis reduced due to the ring property that the product oftwo nonzero ring elements may be zero, which reduces therange of possible integer values that can be stored in theKth encoder memory. We set the value of this multiplierto pk−z, where z = 1, 2, . . . , k − 1, to obtain encoders withintermediate number of states between powers of pk. Thenumber of encoder states becomes pkK /Ψ. For example, theencoders over Z4 with (K = 1,Ψ = 1), (K = 2,Ψ = 2), and(K = 2,Ψ = 1) have 4, 8, and 16 states, respectively.

In the space-time coded system, the signal received by the

jth antenna at time t, djt , is given by

djt =

n∑

i=1

αi, j cit

√Es + η

jt , (3)

where Es is the average energy of the transmitted signal,

ηjt is a zero-mean complex white Gaussian noise with

variance N0/2 per dimension, and αi, j denotes the flat fadingcoefficient of the channel from the ith transmit antennato the jth receive antenna. Under the Rayleigh fadingassumption, αi, j , for 1 ≤ i ≤ n and 1 ≤ j ≤ m, aremodelled as independent samples of a zero-mean complexGaussian random process with variance 0.5 per dimension.

M. de Noronha-Neto and B. F. Uchoa-Filho 3

In practice, to achieve independent fading the antennasmust be physically separated by a distance in the order ofa few wavelengths. For the quasistatic, flat-fading channel,it is assumed that the fading coefficients remain constantduring a frame and change independently from one frameto another.

Also, we assume that the receiver perfectly knows thechannel state information and that the Viterbi algorithmwith the Euclidean metric is used in the decoder. Under theseconditions, and for high signal-to-noise ratio (SNR), Tarokhet al. [1] have shown that the pairwise error probability isupperbounded by

P(c −→ e) ≤( r∏

i=1

λi

)−m(Es

4N0

)−rm, (4)

where r is the rank of the difference matrix of complexcodewords (arranged as a matrix):

B(c, e)Δ=

⎛⎜⎜⎜⎜⎜⎝

e11 − c1

1 e12 − c1

2 · · · e1l − c1

l

e21 − c2

1 e22 − c2

2 · · · e2l − c2

l...

.... . .

...en1 − cn1 en2 − cn2 · · · enl − cnl

⎞⎟⎟⎟⎟⎟⎠

, (5)

and λi, for i = 1, . . . , r, are the nonzero eigenvalues of

A(c, e)Δ= B(c, e)B(c, e)H . To minimise P(c→e) in (4),

we should maximise the minimum rank r of the matrixB(c, e) over all pairs of distinct complex codewords (rankcriterion), and maximise the minimum geometric mean(ηdet) of the nonzero eigenvalues of the matrix A(c, e) overall pairs of distinct complex codewords with minimum rank(determinant criterion).

As shown by Chen et al. [2, 3], the rank and thedeterminant criteria should be adopted whenever rm < 4. Ifrm ≥ 4, they have shown that the pairwise error probabilityis upperbounded by

P(c −→ e) ≤ 14

exp

(−m Es

4N0

n∑

i=1

l∑

j=1

∣∣e ji − cji

∣∣2)

, (6)

which indicates that to minimise P(c→e) we should max-imise the minimum squared Euclidean distance over all pairsof distinct complex codewords (trace criterion). It should benoted that the squared Euclidean distance between c and eis equal to the trace of A(c, e), denoted by ηtr. In this paper,we consider only systems with rm ≥ 4, but r needs not to beequal to n.

3. GUIDELINES FOR FINDING GOOD SPACE-TIMECONVOLUTIONAL CODES OVER ZPK

In this section, we prove three properties that can be used toreduce the code search procedure for STCCs over Zpk . Butfirst, let us denote G as the n(K + 1) scalar generator matrix

of the rate 1/n convolutional encoder over Zpk of Figure 1,which is defined in this paper as

GΔ=

⎡⎢⎢⎢⎢⎣

g0,1 g1,1 · · · gK ,1

g0,2 g1,2 · · · gK ,2...

.... . .

...g0,n g1,n · · · gK ,n

⎤⎥⎥⎥⎥⎦. (7)

The first property is based on a result in [6, Section 3.2]for STCCs generated by an encoder with binary input and2k-ary tap gains. Herein, this result is extended to the case ofa convolutional encoder over Zpk .

Property 1. Consider an STCC C over Zpk generated by agenerator matrix G with coefficients gx,i, for x = 0, 1, . . . ,Kand i = 1, 2, . . . ,n. Let C be the STCC over Zpk generated

by the generator matrix G with coefficients gx,i = pk − gx,i,for x = 0, 1, . . . ,K and i = 1, 2, . . . ,n. Then, every pair ofcodewords c, e ∈ C is associated with a pair c, e ∈ C suchthat A(c, e) = B(c, e)B(c, e)H and A(c, e) = B(c, e)B(c, e)H

have the same rank, determinant, and trace. Therefore, thetwo STCCs C and C are entirely equivalent.

Proof. Consider that the output of the encoder shown inFigure 1 is as given in (1). Changing the encoder coefficientto pk − gx,i yields the following output:

vit ≡K∑

x=0

ut−x(pk − gx,i

)(mod pk

)

≡K∑

x=0

(ut−x pk

)− (ut−xgx,i)

(mod pk)

≡K∑

x=0

− ut−xgx,i (mod pk) ≡ −vit (mod pk

)

≡ pk − vit (mod pk).

(8)

Each element of B(c, e) is a difference of complex numbers ofthe form

bi, j = exp(j2πvpk

)− exp

(j2πwpk

).

The associated element of B(c, e) is

bi, j = exp(j2π(pk − v)

pk

)− exp

(j2π(pk −w)

pk

)

= exp(− j2πv

pk

)− exp

(− j2πwpk

)

= bi, j∗.

(9)

From (9), we can conclude that

A(c, e) = B(c, e)B(c, e)H

= B(c, e)∗(B(c, e)∗

)H

= B(c, e)∗B(c, e)T

= (B(c, e)B(c, e)H)∗

= A(c, e)∗.

(10)


Table 1: New good STCCs over finite fields based on the trace criterion.

pk n ϑ G rank ηtr ηdet

3

3 3 [1 1 ; 1 2 ; 2 1] 2 18 —

3 9 [1 1 1 ; 1 1 2 ; 1 2 1] 3 27 3

3 27 [1 0 1 2 ; 1 1 1 1 ; 1 1 2 1] 3 33 4.32

4 3 [1 1 ; 1 1 ; 1 1 ; 1 2] 2 24 —

4 9 [0 2 1 ; 1 1 1 ; 1 2 1 ; 2 2 1] 3 33 —

4 27 [2 1 2 2 ; 2 0 2 1 ; 1 1 2 2 ; 2 2 2 1] 4 45 3

53 5 [1 1 ; 1 2 ; 2 2] 2 15 —

3 25 [1 1 1 ; 1 3 2 ; 2 3 1] 3 21.38 1

4 5 [1 2 ; 1 2 ; 2 1 ; 2 1] 2 20 —

73 7 [2 4 ; 3 5 ; 6 1] 2 14 —

4 7 [1 1 ; 1 2 ; 2 3 ; 3 3] 2 17.19 —

Since A(c, e) is Hermitian, then A(c, e) and A(c, e)∗ have thesame rank, determinant, and trace.

Note that by this property it is possible to reduce byapproximately one half the number of STCCs to be testedwithout any sacrifice in terms of finding the best code.

Now, we present the second property, which is also anextension of a result in [6, Theorem 2] to the ring Zpk .

Property 2. Consider an STTC C over Zpk generated by agenerator matrix G. Any STCC over Zpk generated by agenerator matrix whose rows correspond to a permutationof the rows of G is entirely equivalent to C.

Proof. A permutation of the rows of G implies a permutationof the encoder outputs in Figure 1, and also induces the samepermutation of the rows of B(c, e). It is easy to show thatthe rank, determinant, and trace of the corresponding matrixA(c, e) are not affected by any permutation of the rows ofB(c, e).

Observe that with Property 2 it is possible to obtain areduction in the code search space by a factor of approxi-mately n!. In this paper, we utilised Properties 1 and 2 toreduce the code search effort under the trace criterion, butthey can also be applied to the rank and the determinantcriteria. The last property, presented next, applies to the tracecriterion only.

Property 3. Consider an STCC over Zpk generated by a matrixG with coefficients gx,i, for x = 0, 1, . . . ,K and i = 1, 2, . . . ,n.Changing the coefficients gx,i of χ rows ofG to pk−gx,i, where1 ≤ χ ≤ n, does not affect the trace of the matrix A(c, e) forany pair of STCC codewords c and e.

Proof. Consider a rate R = 1/n convolutional encoder overZpk with scalar generator matrix G. As proved in Property 1,if the coefficients gx,i, where x = 0, 1, . . . ,K , of the ithrow of the matrix G are changed to their correspondingcomplements modulo pk, that is, pk − gx,i, where x =0, 1, . . . ,K , then the ith row of the matrix B(c, e) changesto its complex conjugate. Since A = BBH , then the ith

diagonal element ai,i of the matrixA is the sum of the squaredmodulus of the elements of the ith row of B. Since |bi, j|2 =|bi, j∗|2, and the trace of a matrix is the sum of its diagonalelements, Property 3 is proved.

By utilising Property 3, it is possible to reduce the codesearch space by a factor of 2n. Note that when all rows of Gare changed to their corresponding complements modulo pk,that is, when χ = n, this property becomes Property 1.

It is worth mentioning that the structure of convolutionalencoders over Zpk , adopted in this paper, offers a reducedsearch space as compared to the structure based on binaryinputs. For our structure, the number of possible codes ispkn(K+1), while for the structure with binary input (standardstructure) this number is pk

2n(K+1). This reduction is possiblebecause the structure over Zpk yields a smaller number ofcoefficients. Of course, since we consider a smaller searchspace, it is possible that in some cases the standard structurewill produce better codes. On the other hand, the codesearch based on the standard structure becomes prohibitiveas the number of transmit antennas, states, or constellationsize increases, and quite often only partial (nonexhaustive)search results are presented (see, e.g., [12]). The STCCspresented herein have, in many cases, the same performanceparameters of the STCCs found with the standard structurefor the same number of antennas and the same complexity.For the cases where the STCC is over GF(p), that is, k = 1,the structure utilised in this paper becomes the only option.

We should also mention that a computer programroutine to discard those equivalent codes, according to thethree properties, can be easily prepared. So the cut in thesearch effort is quite significant.

As a final consideration, we should note that althoughin this paper we utilise only PSK constellations, quadratureamplitude modulation (QAM) constellations could also beused. However, Properties 1 and 3 would not hold forQAM, and the search space reduction provided by theseproperties would be lost. On the other hand, Property 2could still be used without any modification if QAM signalconstellations were adopted. It is well known that QAMhas better Euclidean distance properties than PSK. So, using


Table 2: New good STCCs over finite rings based on the trace criterion.

pk n ϑ G rank ηtr ηdet

4

2 4 [1 1 ; 1 2] 2 10 2

2 8 [1 1 0 ; 2 1 1] 2 12 3.46

2 16 [1 1 2 ; 2 1 3] 2 16 3.46

2 64 [1 0 1 2 ; 1 1 2 1] 2 18 5.29

3 4 [1 1 ; 1 1 ; 1 2] 2 16 —

3 8 [3 3 0 ; 1 0 1 ; 1 3 1] 2 18 —

3 16 [1 1 1 ; 1 2 2 ; 2 1 3] 2 24 —

3 64∗ [2 2 3 3 ; 1 2 1 3 ; 1 1 3 2] 3 32 2.88

4 4 [1 1 ; 1 1; 1 2 ; 1 2] 2 20 —

4 8 [1 0 1 ; 1 1 0 ; 1 1 1 ; 1 3 1] 2 26 —

4 16 [1 1 1 ; 1 1 2 ; 1 2 2 ; 2 1 3] 3 32 —

4 64∗ [1 3 2 3 ; 1 2 1 1 ; 2 2 1 2 ; 3 3 1 0] 4 40 2

8

2 8 [1 2 ; 4 3] 2 7.17 1.41

2 16 [2 1 0 ; 3 0 1] 2 8 2

2 64∗ [5 1 6 ; 1 1 3] 2 10.58 1.17

3 8 [1 1 ; 2 2 ; 3 4] 2 12 —

4 8 [1 1 ; 1 2 ; 2 3 ; 3 4] 2 16.52 —

9 3 9 [1 3 ; 6 4 ; 7 2] 2 12 —

Table 3: Comparison of STCCs found with different encoder structures.

pk n ϑ ηtr [12] ηdet [12] ηtr ηdet

4

2 4 10 2 10 2

2 8 12 2.82 12 3.46

2 16 16 2.82 16 3.46

2 64 18 4 18 5.29

3 4 16 — 16 —

3 8 20 — 18 —

3 16 24 — 24 —

3 64∗ 28 — 32 2.88

4 4 20 — 20 —

4 8 26 — 26 —

4 16 32 — 32 —

4 64∗ 38 — 40 2

8

2 8 7.17 1.41 7.17 1.41

2 16 8 0.82 8 2

3 8 12 — 12 —

4 8 16.58 — 16.58 —

the encoding structure proposed in this paper, it is possiblethat STCCs for QAM constellation better than STCCs forPSK constellation of the same size exist. However, since thedemonstration of algebraic properties to reduce the codesearch effort constitutes an important part of this paper,QAM will not be considered herein.

4. CODE SEARCH AND SIMULATION RESULTS

In this section, we present some new STCCs generated bya rate 1/n convolutional encoder over Zpk , and show their

performance on the quasistatic flat Rayleigh fading channel.Since we are considering large diversity order, the code searchwas based only on the trace criterion. Tables 1 and 2 showthe search results for STCCs over finite fields and rings,respectively, with various pk-PSK modulations, number ofstates (ϑ), and number of transmit antennas (n). In thesetables, the STCCs marked with ∗ are the result of a partialsearch. All other codes are optimal for the structure ofFigure 1. In [12], we can find STCCs for the 4 and 8-PSKmodulations. For the same number of states and number oftransmit antennas, those codes in most cases have the same


100

10−1

10−2

10−3

FER

0 2 4 6 8 10 12 14

SNR (dB)

3-states, m = 29-states, m = 227-states, m = 2


Figure 2: FER versus SNR for the STCCs over Z3 for 3-PSK basedon the trace criterion with n = 3, m = 2, 3, and 3, 9, and 27 states.

100

10−1

10−2

10−3

FER

0 2 4 6 8 10 12 14 16

SNR (dB)



Figure 3: FER versus SNR for STCCs over Z4 for 4-PSK based onthe trace criterion with n = 3, m = 2, 3, and 4, 8, and 16 states.

trace as the STCCs presented in Table 2. Table 3 comparesSTCCs for the 4 and 8-PSK modulations found with differentstructures. It can be seen that with the proposed structure weobtained an STCC with improved trace in two cases, a worsetrace in one case, and the same trace in all other cases.

All the new STCCs in Table 1 and the STCC for 9-PSK inTable 2 have no corresponding competitors in the literature.The STCCs over GF(p) with two transmit antennas based onthe trace criterion can be found in [9].

In Figures 2 and 3, we show the FER versus SNR(in decibels) curves for the STCCs over GF(3) and Z4,

100

10−1

10−2

10−3

FER

0 2 4 6 8 10 12 14

SNR (dB)

4-states, [11]8-states, [11]64-states, [11]

4-states, new8-states, new64-states, new

Figure 4: Performance comparison of STCCs for 4-PSK obtainedwith different encoder structures for n = 3, m = 2, and 4, 8, and 64states.

respectively, where we can observe the performance of thecodes for different numbers of states and receive antennas. InFigure 4, we show the performance comparison of the STCCsfor 4-PSK found with the encoder structure over Zpk andwith the standard structure. For n = 3 transmit antennas,m = 2 receive antennas, and for 4, 8, and 64 states, wecan observe that the performances of these codes are verysimilar, although the codes have been generated by differentencoder structures and have different traces in the cases of 8and 64 states. In all simulations presented in this section, weconsidered a frame length l = 130 symbols.

5. CONCLUSION AND FINAL COMMENTS

In this paper, we have considered space-time convolutionalcodes over finite fields and rings for the quasistatic, flatRayleigh fading channel. Based on this encoding structure,we proved three properties that can be used to simplify thecode search based on the trace criterion. Good STCCs for n =2, 3, 4 transmit antennas and various pk-PSK constellationswere presented. The resulting spectral efficiencies, namely,log2(pk) b/s/Hz, can serve a wide range of multimediaapplications.

As the STCCs presented herein are designed by thetrace criterion, they do not achieve the optimal diversity-multiplexing gain (DM-G) tradeoff [13, 14] for system withmore than one receive antenna. Therefore, it is possible thatSTTCs constructed to achieve the optimum DM-G tradeoffperform better than the codes in this paper, under the samespectral efficiency.

ACKNOWLEDGMENTS

This work was supported by CEFET/SC under a researchgrant and by the Brazilian National Council for Scientific


and Technological Development (CNPq) under Grants no.484391/2006-2 and 303938/2007-2.

REFERENCES

[1] V. Tarokh, N. Seshadri, and A. R. Calderbank, “Space-timecodes for high data rate wireless communication: performancecriterion and code construction,” IEEE Transactions on Infor-mation Theory, vol. 44, no. 2, pp. 744–765, 1998.

[2] Z. Chen, J. Yuan, and B. Vucetic, “Improved space-time timetrellis coded modulation scheme on slow Rayleigh fadingchannels,” Electronics Letters, vol. 37, no. 7, pp. 440–441, 2001.

[3] Z. Chen, B. Vucetic, J. Yuan, and K. L. Lo, “Space-timetrellis codes for 4-PSK with three and four transmit antennasin quasi-static flat fading channels,” IEEE CommunicationsLetters, vol. 6, no. 2, pp. 67–69, 2002.

[4] S. Baro, G. Bauch, and A. Hansmann, “Improved codes forspace-time trellis-coded modulation,” IEEE CommunicationsLetters, vol. 4, no. 1, pp. 20–22, 2000.

[5] R. S. Blum, “Some analytical tools for the design of space-timeconvolutional codes,” IEEE Transactions on Communications,vol. 50, no. 10, pp. 1593–1599, 2002.

[6] B. Abdool-Rassool, M. R. Nakhai, F. Heliot, L. Revelly, and H.Aghvami, “Search for space-time trellis codes: novel codes forRayleigh fading channels,” IEE Proceedings: Communications,vol. 151, no. 1, pp. 25–31, 2004.

[7] M. de Noronha-Neto, R. D. Souza, and B. F. Ucha-Filho,“Space-time convolutional codes over GF(p) achieving full 2-level diversity,” in Proceedings of the IEEE Wireless Communica-tions and Networking Conference (WCNC ’03), vol. 1, pp. 408–413, New Orleans, La, USA, March 2003.

[8] S. K. Hong and J.-M. Chung, “Prime valued space-time con-volutionsl Zw code achieving full 2-level diversity,” ElectronicsLetters, vol. 40, no. 4, pp. 253–254, 2004.

[9] M. de Noronha-Neto and B. F. Uchoa-Filho, “Space-timeconvolutional codes over GF(p) for two transmit antennas,”IEEE Transactions on Communications, vol. 56, no. 3, pp. 356–358, 2008.

[10] R. A. Carrasco and A. Pereira, “Space-time ring TCM codesfor QAM over fading channels,” IEE Proceedings: Communica-tions, vol. 151, no. 4, pp. 316–321, 2004.

[11] J. L. Massey and T. Mittelholzer, “Convolutional codes overrings,” in Proceedings of the 4th Joint Swedish-Soviet Interna-tional Workshop on Information Theory, pp. 14–18, Gotland,Sweeden, August-September 1989.

[12] B. Vucetic and J. Yuan, Space-Time Coding, John Wiley & Sons,New York, NY, USA, 2003.

[13] L. Zheng and D. N. C. Tse, “Diversity and multiplexing:a fundamental tradeoff in multiple-antenna channels,” IEEETransactions on Information Theory, vol. 49, no. 5, pp. 1073–1096, 2003.

[14] R. Vaze and B. S. Rajan, “On space-time trellis codes achievingoptimal diversity multiplexing tradeoff,” IEEE Transactions onInformation Theory, vol. 52, no. 11, pp. 5060–5067, 2006.


Research ArticleJoint Decoding of Concatenated VLEC and STTC System

Huijun Chen and Lei Cao

Department of Electrical Engineering, The University of Mississippi, University, MS 38677, USA

Correspondence should be addressed to Huijun Chen, [email protected]

Received 1 November 2007; Revised 26 March 2008; Accepted 6 May 2008


We consider the decoding of wireless communication systems with both source coding in the application layer and channel codingin the physical layer for high-performance transmission over fading channels. Variable length error correcting codes (VLECs)and space time trellis codes (STTCs) are used to provide bandwidth efficient data compression as well as coding and diversitygains. At the receiver, an iterative joint source and space time decoding scheme are developed to utilize redundancy in both STTCand VLEC to improve overall decoding performance. Issues such as the inseparable systematic information in the symbol level,the asymmetric trellis structure of VLEC, and information exchange between bit and symbol domains have been considered inthe maximum a posteriori probability (MAP) decoding algorithm. Simulation results indicate that the developed joint decodingscheme achieves a significant decoding gain over the separate decoding in fading channels, whether or not the channel informationis perfectly known at the receiver. Furthermore, how rate allocation between STTC and VLEC affects the performance of the jointsource and space-time decoder is investigated. Different systems with a fixed overall information rate are studied. It is shown thatfor a system with more redundancy dedicated to the source code and a higher order modulation of STTC, the joint decoding yieldsbetter performance, though with increased complexity.

Copyright © 2008 H. Chen and L. Cao. This is an open access article distributed under the Creative Commons Attribution License,which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

1. INTRODUCTION

Providing multimedia service has become an attractiveapplication in wireless communication systems. Due tobandwidth limitation and hash wireless channel conditions,reliable source transmission over wireless channel remainsa challenging problem. Space time code and variable lengthsource code are two key enabling techniques in the physicaland application layers, respectively.

Tarokh introduced space time trellis codes (STTCs)[1] in multiple-input multiple-output (MIMO) systems,which obtain bandwidth efficiency four times higher thatof diversity systems without space time coding. While theseSTTCs are designed to achieve the maximum diversity inspace dimension, the coding gain in time dimension, on theother hand, still may be improved.

Variable length error correcting codes (VLECs) [2] area family of error correcting codes used in source coding.VLEC maps source symbols to codewords of variable lengthaccording to the source statistics. Compared to Huffmancode aiming for high-compression efficiency, VLEC hasinherent redundancy and some error resilient capability.However, VLEC is still sensitive to channel errors and

one single bit error may cause continuous source symbolpartition errors due to the well-known synchronizationproblem.

Shannon’s classical separation theory states that we canoptimize the system by designing optimal source codeand channel code separately. However, this theorem holdsonly for infinite size of packets. Therefore, with delayand computation resource constraint, joint optimizationof source and channel coding or decoding often yieldsbetter performance in realistic systems. Joint source channeldecoding (JSCD) basically focuses on using the redundancyin the source coded stream to improve the overall decodingperformance. Constraint JSCD (C-JSCD) is discussed in[3, 4], in which the output from channel decoder is modeledas an output from binary symmetric channel (BSC) and thesource decoder exploits the statistic character of BSC as aconstraint in the maximum a posteriori probability (MAP)algorithm. Integrated JSCD (I-JSCD), proposed in [5, 6],merges the trellises of source code and channel code intoone integrated trellis and carries out MAP decoding basedon the combined trellis. The drawback of I-JSCD is that thedecoding complexity dramatically increases with the numberof states in the combined trellis. Recently, iterative JSCD


[7, 8] adopts iterative decoding structure and informationexchange between source decoder and channel decoder. Ithas attracted increasing attention because of its relatively lowdecoding complexity.

Joint decoding schemes with space time componentshave also been considered recently. A mega concatenationsystem of multiple-level code, trellis- coded modulation(TCM), and STTC is proposed in [9] to provide unequalerror protection for MPEG4 streams. Variable length spacetime- coded modulation (VL-STCM) is proposed in [10,11] by concatenating VLC and BLAST in MIMO systems.Iterative detection structure is proposed in [12] for aconcatenated system with reversible variable length code(RVLC), TCM, and diagonal block space time trellis code(DBSTTC). In this paper, we consider another type of sys-tems where recursive STTCs (Rec-STTCs) with full transmitdiversity gain and some coding gain are concatenated withsource VLECs. For this type of systems, we design an iterativedecoding scheme to fully utilize the redundancy in bothsource code and space time code. Modification of MAPdecoding algorithms and information exchange betweensymbol and bit domains from the two component decodersare addressed. This iterative decoding is evaluated in bothquasi static and rapid fading channels when either perfectchannel information is available or the channel estimationerrors exist. The results show significant decoding gainover noniterative decoding in the tested cases. Furthermore,we study the rate allocation issue dealing with how toallocate the redundancy between STTC and VLEC for betterdecoding performance under the overall bandwidth andtransmission power constraint. We find that with increaseddecoding complexity, the joint decoding system performancecan be improved by introducing more redundancy intosource code while using a higher-order modulation in STTC.

The rest of paper is organized as follows. The concatena-tion structure of VLEC and STTC is described in Section 2.Joint source and space time decoding algorithm is discussedin Section 3 in detail. Performance in case of perfect channelestimation is provided in Section 4. Performance in presenceof channel estimation errors is presented in Section 5. Therate allocation issue is then investigated in Section 6. Finally,conclusions are drawn in Section 7.

2. SYSTEM WITH VLEC AND STTC

The encoder block diagram is depicted in Figure 1. Weassume ai, i = 0, 1, . . . ,K − 1 is one packet of digital sourcesymbols, drawn from a finite alphabet set 0, 1, . . . ,N − 1.K is the packet length, N is the source alphabet size. TheVLEC encoder maps each source symbol to a variable lengthcodeword at a code rate RVLEC = H/l. l is the averageVLEC codeword length. H is the entropy of the source.The generated bit sequence is bj , j = 0, 1, . . . ,L− 1. A bitinterleaver is inserted before the use of STTC for timediversity. In this paper, we use a random interleaver.

Consider 2p-ary modulation is used, the bit streamis grouped every p bits and converted to symbol streamct, t = 0, 1, . . . , �L/p� − 1 as the input to STTC encoder.The output from STTC is NT modulated symbol sequences

{ai}Source

VLECencoder

{bj} ∏ Bit−>sym.

{ct} STTCencoder

dNT−1t

d0t

...

Figure 1: Serial concatenation of VLEC and STTC.

Table 1: Examples of VLEC [8].

Symbol Probability Huffman C1 C2

0 0.33 00 00 11

1 0.3 11 11 001

2 0.18 10 101 0100

3 0.1 010 010 0101100

4 0.09 011 0110 0001010

E[l] H = 2.14 2.19 2.46 3.61

d f 1 2 3

dit , i = 0, . . . ,NT − 1; t = 0, 1, . . . ,M − 1 (M = �L/p�),which are sent to radio channel through NT transmit anten-nas. The overall effective information rate is pH/l bit/s/Hz.

Suppose there are NR antennas at the receiver; at time t,the signal on the jth receive antenna is

rjt =

NT−1∑i=0

√Es f

i, jt dit + η

jt , (1)

where i = 0, . . . ,NT − 1; j = 0, . . . ,NR − 1; Es is the average

power of the transmitted signal; fi, jt is the path gain between

the ith transmit antenna and the jth receive antenna at timet. We consider two fading cases: quasi static fading channel(also referred as block fading) in which the path gain keepsconstant over one packet and rapid fading channel in which

the path gain changes from one symbol to the other. ηjt is

the additive complex white Gaussian noise on the jth receiveantenna at time t with zero mean and variance of N0/2 perdimension.

2.1. Variable length error correcting code

In [2], Buttigieg introduced variable length error correctingcode (VLEC). It is similar to block error correcting code inthat each source symbol is mapped to a codeword, but withdifferent length. The more frequent symbols are assignedwith shorter codewords. The codewords are designed sothat a minimum free distance is guaranteed. With a largerfree distance, VLEC has stronger error resilience capability.However, in the mean time, more redundancy is introducedand the average length per symbol increases, which reducesthe overall effective information rate. Table 1 gives theexamples of Huffman code and two VLECs of a same sourcewith different free distances from [8].

Since a bit-based trellis representation was proposed forVLEC [13], the MAP decoding algorithm can also be adoptedfor bit-level VLEC decoding. Figure 2 gives the tree structureand the bit-level trellis representation of VLEC C1. Each

H. Chen and L. Cao 3

1

1

1 1

10

0

0

0

0

L

LL

LL

R

1112

1314

15

T

11

12

13

14

15

T

11

12

13

14

15

Figure 2: Example of VLEC tree structure and bit-level trellis [7].

interior node on the VLEC coding tree is represented by“Ii”. The root node and the leaf nodes can be classified asterminal nodes and denoted by the “T” states in the trellis.The branches in the trellis describe the state transitions atany bit instance along the source coded sequence.

2.2. Recursive space time trellis code

The recursive nature of component encoders is criticalto the excellent decoding performance of turbo codes.General rules for designing parallel and serial concatenatedconvolutional codes have been presented in [14, 15]. In bothcases, recursive convolutional code is required.

In [16], Tujkovic proposed recursive space time trelliscode (Rec-STTC) with full diversity gain for parallel con-catenated space time code. Figure 3 gives the example ofRec-STTCs in [16] for two transmit antennas. The upperpart is a 4-state, QPSK modulated Rec-STTC (ST1) withbandwidth efficiency 2 bit/s/Hz and the lower part is an8-state, 8PSK modulated Rec-STTC (ST2) with bandwidthefficiency 3 bit/s/Hz. Each line represents a transition fromthe current state to the next state. The numbers on the leftand right sides of the dashes are the corresponding inputsymbols and two output symbols, respectively.

3. JOINT VLEC AND SPACE TIME DECODER

Consider the above serial concatenated source and spacetime coding system. Conventionally, the separate decodingstops after one round of STTC decoding followed by VLECdecoding. In this paper, we utilize both redundancy in VLECand error correcting ability of STTC in time dimension tofacilitate each other’s decoding through an iterative process,and hence to improve the overall decoding performance.

Figure 4 illustrates the iterative joint source and spacetime decoding structure. Assume that the packet has beensynchronized and the side information of the packet lengthin bit after VLEC encoder is known at the receiver.Soft-inputsoft-output MAP algorithm [17] is used in both VLEC andSTTC decoders.

0 1/1 2/2 3/3

1/10 0/11 3/12 2/13

2/20 3/21 0/22 1/23

3/30 2/31 1/32 0/33

0001

11 10

(a)

0 1/1 2/2 3/3 4/4 5/5 6/6 7/7

1/50 2/51 3/52 4/53 5/54 6/55 7/56 0/57

2/20

3/70

4/40

5/10

6/60

7/30

3/21

4/71

5/41

6/11

7/61

0/31

4/22

5/72

6/42

7/12

0/62

1/32

5/23

6/73

7/43

0/13

1/63

2/33

6/24

7/74

0/44

1/14

2/64

3/34

7/25

0/75

1/45

2/15

3/65

4/35

0/26

1/76

2/46

3/16

4/66

5/36

1/27

2/77

3/47

4/17

5/67

6/37

011

010

110

111101

100

001

000

(b)

Figure 3: Trellis graphs of QPSK and 8PSK recursive STTCs.

3.1. MAP in symbol and bit domains

The MAP decoder takes the received sequences as soft inputsand a priori probability sequences and outputs an optimalestimate of each symbol (or bit) in the sense of maximizingits a posteriori probability. The a posteriori probabilityis calculated through the coding constraints representeddistinctly by trellis.

Given the received streams,

r =

⎡⎢⎢⎢⎢⎣

r00 , . . . , r0

t , . . .

.... . .

...

rNR−10 , . . . , rNR−1

t , . . .

⎤⎥⎥⎥⎥⎦ , (2)

and assume perfect channel information f = [ fi, jt ], i =

0, . . . ,NT−1, j = 0, . . . ,NR−1, known at the receiver, at eachtime index t, then the space time decoder generates symboldomain log-likelihood values for all symbols in the signalconstellation Q = q, q= 0, . . . , 2p − 1 as follows:

L(ct = q|r) = ln

∑(s′,s)⇒q

αt−1(s′)γt(s′, s)βt(s), (3)

where (s′, s) represents the state transition from time t− 1 totime t on the STTC trellis,

γt(s′, s) = P(

rt | (s′, s))P(s|s′). (4)

rt = rjt , j = 0, . . . ,NR − 1 is the array of received signal on

theNR receive antennas at index t. The first part on the right-hand side of (4) involves channel information given by

lnP(

rt | (s′, s)) = −C

NR−1∑j=0

∣∣∣∣∣r jt −NT−1∑i=0

fi, jt dit

∣∣∣∣∣2

, (5)

where dit, i = 0, . . . ,NT − 1 are the transmitted signals asso-ciated with transition branch (s′, s) at time t. C is a constantthat depends on the channel condition at time t and isthe same for all possible transition branches. P(s|s′) is a


{r0t }

{rNR−1t }

La STTC

Space timeMAP decoder Lp STTC

Bit−>sym.probabilityconverter

Sym.−>bitprobabilityconverter

∏

∏−1

Y

La VLCLp VLC

VLEC bit-levelMAP decoder

VLEC SOVAdecoder

Figure 4: Joint source space time decoder.

priori information and equal to P (q : (s′, s)↔q). Withoutany a priori information, every symbol in constellation isconsidered as generated with equal possibility and P(s|s′) isset to 1/2p.

αt(s) is the probability that the state at time t is s and thereceived signal sequences up to time t are rk<t+1, It can becalculated by a forward pass as

αt(s) =∑s′γt(s′, s)αt−1(s′). (6)

βt−1(s′) is the probability that the state at time t − 1 is s′ andthe received data sequences after time t−1 are rk>t−1, and canbe calculated by a backward pass as

βt−1(s′) =∑s

βt(s)γt(s′, s). (7)

The initial values α0(0) = βNs(0) = 0 (Ns is the packet lengthin modulated symbol), assuming tail symbols are added toforce the encoder registers back to the zero state.

It needs to be pointed out that L(ct = q | r) in (3)is a log-likelihood value but not the log-likelihood ratio inthe conventional MAP decoding. This is because multiplecandidate symbols exist in the STTC constellation. Besides,the systematic and parity information can no longer beseparated in (5), because the two output symbols in anytrellis transition are sent through two transmit antennassimultaneously. Received signal on any receive antenna is anadditive effect of two symbols and the noise. Equation (3)can be rewritten as

L(ct = q | r

) = La STTC + Le STTC, (8)

where

La STTC = lnP(ct = q

),

Le STTC = ln∑

(s′,s)⇒qαt−1(s′)P

(rt | (s′, s)

)βt(s).

(9)

As a result, each symbol domain log-likelihood value com-prises only two parts: extrinsic information and a prioriinformation. The extrinsic information of STTC will be sentto the VLEC decoder as a priori information.

The bit-indexed soft input sequence Y to VLEC decoderis the extrinsic information from the channel decoder in the

first iteration. VLEC MAP decoder calculates bit domain log-likelihood ratio for each coded bit uk as

L(uk|Y

) = lnP(uk = 1 | Y)

P(uk = 0 | Y)

= ln

∑(s′,s)⇒uk=1 αk−1(s′)γk(s′, s)βk(s)∑(s′,s)⇒uk=0 αk−1(s′)γk(s′, s)βk(s)

.

(10)

The forward and backward calculations of α and β are similarto STTC MAP decoding. Since this is a serially concatenatedsystem without separable systematic information, Y will beregarded as the Lp STTC minus the a priori information of theSTTC decoding in the first iteration and will remain the samefor the use of all iterations. The a priori information of VLECdecoder (La VLEC) in the following iterations will be updatedwith the extrinsic information from the STTC decoder. Thecalculation of γ can be written as

γk(s′, s) = P(Yk | (s′, s)

)P(uk), uk ∈ 1, 0, (11)

where uk is the output bit from VLEC encoder associatedwith transition from previous state s′ to current state s atinstant k along the trellis. Equation (10) can be furtherrepresented as

L(uk|Y

) = La VLEC + Le VLEC, (12)

where

La VLEC = lnP(uk = 1

)P(uk = 0

) ,

Le VLEC = ln

∑(s′,s)⇒uk=1 αk−1(s′)P

(Yk | (s′, s)

)βk(s)∑

(s′,s)⇒uk=0 αk−1(s′)P(Yk | (s′, s)

)βk(s)

.

(13)

Therefore, once the VLEC log-likelihood ratio is calculated,the extrinsic information Le VLEC will be extracted and sentto the STTC decoding as the new a priori information.

3.2. Iterative information exchange

The principle of iterative decoding is to update the a prioriinformation of each component decoder with the extrinsicinformation from the other component decoder back andforth. By iterative information exchange, the decoder can


make full use of the coding gain in the coding trellises ofthe component codes to remove channel noise in a buildup way. During the first iteration, the a priori probabilityto Rec-STTC decoder La STTC is set to be equally distributedover every possible symbol. The log-likelihood output fromspace time decoder Lp STTC is separated into two parts:soft information (including the systematic and extrinsicinformation since systematic information is not separablein space time coding scheme) and a priori information,which, in later iterations, is the extrinsic informationfrom VLEC decoder. The soft symbol information Le STTC

is extracted and converted to log-likelihood ratio in bitdomain. After de-interleaving, it is sent to VLEC decoderas a priori information La VLEC. The a posteriori probabilityoutput of VLEC decoder Lp VLEC consists of two parts: apriori information and extrinsic information Le VLEC. Onlyextrinsic information is interleaved and converted to the apriori information in symbol domain for Rec-STTC decoderin the next iteration. After the final iteration, Viterbi VLECdecoding is carried out on Lp VLEC to estimate the sourcesymbol sequence.

The conversion between the bit domain log-likeli-hood ratio and the symbol domain log-likelihood valueis implemented based on the mapping method and themodulation mode. Each symbol q consists of p bitsq↔w0,w1, . . . ,wp−1, wi ∈ 0, 1. For a group of p bitsy0y1 · · · yp−1, we derive the relation between L(q),q= 0, 1, . . . , 2p − 1 and corresponding L(yi), i = 0, . . . ,p − 1 as follows:

L(yi) = ln

P(yi = 1

)P(yi = 0

) = ln

∑q:(wi=1)∈q P(q)∑q:(wi=0)∈q P(q)

= ln

∑q:(wi=1)∈q eL(q)∑q:(wi=0)∈q eL(q) ,

(14)

L(q) = lnP(q) = lnp−1∏i=0

P(wi) =p−1∑i=0

lnewiL(yi)

1 + eL(yi), (15)

where i = 0, . . . ,P − 1. In (15),we use a conversion pairbetween LLR L(a) and absolute probability P (a = 1) andP (a = 0) as follows:

P(a = 1) = eL(a)

1 + eL(a), P(a = 0) = 1

1 + eL(a). (16)

4. PERFORMANCE OVER FADING CHANNELS

Throughout this paper, a MIMO system with two transmitantennas and two receive antennas is used to transmit VLECcoded source stream. A symbol stream is first generatedand fed to source encoder. Each symbol is drawn froma 5-ary alphabet with probability distribution shown inTable 1. Each input packet has 100 source symbols. Weuse the VLEC (C1, C2) schemes in Table 1 and the Rec-STTCs (ST1, ST2) with signal constellations in Figure 3.The average transmitted signal power is set to one (Es =1) and the amplitudes of QPSK and 8PSK are both equalto one (

√Es = 1). The output bit stream from VLEC

65.85.65.45.254.84.64.44.24

Eb/N0 (dB)

Separate C2+ST1 rapidJoint iter4 C2+ST1 rapidJoint iter8 C2+ST1 rapid

Separate C2+ST1 blockJoint iter4 C2+ST1 blockJoint iter8 C2+ST1 block

10−6

10−5

10−4

10−3

10−2

10−1

100

SER

Figure 5: SER performance of joint source and space time decoderover Rayleigh fading channels.

encoder is padded with “0” if necessary so that its lengthcan be divided by p. Tail symbols are added so that Rec-STTC encoder registers return to zero states. A randominterleaver is used between the VLEC encoder and the Rec-STTC encoder. We adopt Rayleigh distributed channel modelof both rapid fading case and quasi static fading case.Following the iterative decoding and information conversiondescribed in the previous section, the end-to-end systemperformance is measured by the symbol error rate (SER) eachtime after VLEC SOVA decoder. SER is measured in terms ofLevenshtein distance [18] which is the minimum number ofinsertions, deletions, and substitutions required to transformone sequence to another.

In this section, we study VLEC C2 concatenated withQPSK modulated Rec-STTC ST1. The overall effectiveinformation rate is 1.1856 bit/sec/Hz. Figure 5 shows theSER performance comparison between the joint VLEC andspace time decoder and the separable space time and VLECdecoder over quasi static (i.e., block) Rayleigh fading channeland rapid Rayleigh fading channel. The joint source spacetime decoder achieves more than 2 dB gain over separatedecoding in SER in rapid fading channel and about 0.8 dBgain in quasi static fading channel. Especially, at 6 dB in rapidfading channels, after 8th iteration, SER also drops to 10−3 ofthe SER of separate decoding.

We also observe that the concatenated VLEC and STTCsystem has a less performance gain in quasi static fadingchannel than in rapid fading channel, as shown in Figure 5.This is reasonable because the rapid fading channels, whichare also called interleaved fading channels, can provideadditional diversity gain, compared with the quasi staticchannel.


5. PERFORMANCE IN PRESENCE OFCHANNEL ESTIMATION ERRORS

In this section, we evaluate the joint source and spacetime decoding in more realistic scenarios. In Section 3, thedecoder assumes in the first place that the channel state infor-mation (CSI) is perfectly known at the receiver. However, inreal communication systems, regardless of what method isused, there are always errors in the channel estimation. Howthe joint source and space time decoder performs in presenceof channel estimation errors is examined here.

Considering imperfect channel estimation, the actualchannel fading matrix f used to calculate metric in (5)

becomes the estimated channel fading matrix f . We model

each estimated channel fading coefficient fi, jt between the ith

transmit antenna and the jth receive antenna at time t as a

noisy version of the actual channel fading coefficient fi, jt ,

fi, jt = f

i, jt + η

i, jt , (17)

where ηi, jt is the channel estimation error and modeled as

a complex Gaussian random variable, with zero mean and

variance of σ2η and is independent on f

i, jt . The correlation

coefficient ρ between fi, jt and f

i, jt is given by

ρ = 1√1 + σ2

η

. (18)

We use VLEC C1 and Rec-STTC ST1 for simulation.Other simulation parameters keep the same. Figure 6 showsthe SER performance over quasi static fading channels. Whenchannel information is accurately estimated ρ = 1.0, the SERdecreases through iterations. There is about 0.7 dB gain at thelevel of 10−3 in SER over separate VLEC and STTC decoding.In both cases of channel estimation error (ρ = 0.98 case I andρ = 0.95 case II), the joint RVLC and STTC decoding stillachieves iterative decoding gain. After 8 iterations, the jointdecoding scheme achieves a performance gain of more than0.7 dB gain at the level of 10−3 in SER in case I, comparedwith separate decoding. In case II, a performance gain of3.5 dB at the level of 10−2 in SER is achieved after 8 iterations.

The decoding performance in case I and case II overrapid fading channels in Figure 7 shows a similar result.Although channel estimation for rapid fading channels is notpractical in real systems, the result provides some theoreticperspectives of the joint VLEC and STTC decoding. Similardecoding gain is observed. After 8 iterations, the jointdecoding scheme achieves a performance gain of 1.5 dB inSER at the level of 10−3 with perfect channel estimation, aperformance gain of nearly 4 dB at the level of 10−2 in SER incase I, and a performance gain of more than 5 dB at the levelof 10−1 in SER in case II, compared with separate VLEC andSTTC decoding.

It can be found that in both quasi static fading channeland rapid fading channel, from ρ = 1 to ρ = 0.95,the decoding gain increases. When channel estimation isless accurate, the channel information fed to space timedecoder deviates more from correctness and causes more

1312.51211.51110.5109.598.58

Eb/N0 (dB)

ρ = 1 separateρ = 1 iter8ρ = 0.98 separate

ρ = 0.98 iter8ρ = 0.95 separateρ = 0.95 iter8

10−4

10−3

10−2

10−1

SER

Figure 6: SER performance joint source and space time decodingover quasi static fading channel with channel estimation error.

1312.51211.51110.5109.598.58

Eb/N0 (dB)

ρ = 1 separateρ = 1 iter8ρ = 0.98 separate

ρ = 0.98 iter8ρ = 0.95 separateρ = 0.95 iter8

10−5

10−4

10−3

10−2

10−1

100

SER

Figure 7: SER performance joint source and space time decodingover rapid fading channel with channel estimation error.

errors. The iterative decoder can still achieve significantimprovement over the separate decoding through iterations.Therefore, the joint source space time decoder is robustto channel estimation errors to some extent. The result isalso consistent with the decoder’s convergence characteristic.After 6 iterations, the iterative decoding algorithm has


1312.51211.51110.5109.598.58

Eb/N0 (dB)

Separate system II rapidJoint iter3 system II rapidJoint iter6 system II rapidSeparate system I rapidJoint iter3 system I rapidJoint iter6 system I rapid

10−7

10−6

10−5

10−4

10−3

10−2

10−1

SER

Figure 8: SER performance comparison between (C1+ST1) and(C2+ST2) over rapid fading channel.

little improvement in case of ρ = 1 while iterative gainis still observed in case of ρ = 0.95. However, we alsodid simulations in case of ρ ≤ 0.65 which means thechannel estimation is very poor. We did not find muchimprovement using the iterative decoding. This is becauseat this situation, the estimation does not reflect correctinformation of the actual channel situation and the spacetime component decoder cannot work effectively to extractthe correct information for the iterative utilization.

6. RATE ALLOCATION BETWEEN STTC AND VLEC

The frequency bandwidth resource available to a com-munication system is always limited, the overall effectivedata rate that can be transmitted from antennas is henceconstrained. The power efficiency is measured by the energyrequired for transmitting one bit. When communicating ata rate of R with transmit power E, the power efficiency isdefined as E/R. The overall effective data rate depends onboth the modulation order of Rec-STTC and the averagecodeword length of VLEC. On one hand, for a source withgiven entropy H and a fixed power efficiency, the overalleffective information rate is given by pH/l. It increaseswith the modulation order p in Rec-STTC. However, thedecoding performance decreases due to a smaller averageEuclidean distance between each pair of signal points inthe modulation constellation. On the other hand, VLECwith a larger average length l helps to increase errorresilience capability due to extra redundancy introduced.

1312.51211.51110.5109.598.58

Eb/N0 (dB)

Separate system II blockJoint iter3 system II blockJoint iter6 system II blockSeparate system I blockJoint iter3 system I blockJoint iter6 system I block

10−4

10−3

10−2

10−1

SER

Figure 9: SER performance comparison between (C1+ST1) and(C2+ST2) over quasi static fading channel.

However, this decoding performance is improved at thecost of data rate loss which needs to be compensated later,for example, by the increase of modulation order. As aresult, one interesting question is that, given the overalleffective information rate and transmit power, whetherintroducing more redundancy in VLEC or reducing themodulation order of Rec-STTC gives more performanceimprovement. This question is partially answered in thefollowing simulation.

We study the iterative source space time decodingperformance of two different concatenated systems. SystemI concatenates VLEC C1 with QPSK Rec-STTC ST1. SystemII concatenates VLEC C2 with 8PSK Rec-STTC code ST2.With the source entropy of 2.14, the average bit lengthfor each source symbol of C1 and C2 equals to 2.46and 3.61. The bandwidth efficiencies of QPSK and 8PSKequal to 2 bit/s/Hz and 3 bit/s/Hz. System II has a slightlyhigher overall effective information rate (1.7784 bit/s/Hz)than system I (1.7398 bit/s/Hz). By assigning unit power toeach modulated symbol, system II also has a slightly higherpower efficiency (1/1.7784 = 0.5607/bit) than system I(1/1.7398 = 0.5748/bit), which means that system II uses lessaverage power to transmit one bit source information.

Figure 8 shows SER performance comparisons betweensystem I and system II over rapid fading channels. Thesimulation system configuration is the same. System IIoutperforms system I almost 4 dB at SER of 7 × 10−5. Theperformance comparison between system I and system II inquasi static channels shows a similar result, as in Figure 9.


Therefore, given the roughly same overall informationrate and power efficiency, by allocating more redundancy inthe source code, the joint source and space time decodinghas more iterative decoding gain. However, it also needs tobe noted that the better performance of system II is achievedat the cost of higher computation complexity because thenumber of the states in both VLEC trellis and STTC trellisincreases. The complexity of system II is roughly 4 timesin STTC decoder and 2 times in VLEC decoder comparedwith system I. Also, different from rapid fading channel, thequasi static channels provide no additional diversity gain. Asa result, system II has a less performance gain over system Iin quasi static fading channels.

7. CONCLUSIONS

In this paper, a joint decoder is proposed for serial con-catenated source and space time code. VLEC and Rec-STTCare employed with redundancy in both codes. By iterativeinformation exchange, the concatenation system achievesadditional decoding gain without bandwidth expansion.Simulation shows that SER of joint decoding scheme isgreatly reduced, compared to the separate decoding systemin both quasi static and rapid fading channels. The proposeddecoder is also shown to be effective with channel estimationerrors. Finally, We find that given certain overall effectiveinformation rate and transmit power, introducing redun-dancy in source code can provide more decoding gain thanreducing the bandwidth efficiency of STTC, though withincreased decoding complexity.

REFERENCES


[2] V. Buttigieg and P. G. Farrell, “Variable-length error-correctingcodes,” IEE Proceedings: Communications, vol. 147, no. 4, pp.211–215, 2000.

[3] N. Demir and K. Sayood, “Joint source/channel coding forvariable length codes,” in Proceedings of the Data CompressionConference (DCC ’98), pp. 139–148, Snowbird, Utah, USA,March-April 1998.

[4] K. P. Subbalakshmi and J. Vaisey, “Optimal decoding ofentropy coded memoryless sources over binary symmetricchannels,” in Proceedings of the Data Compression Conference(DCC ’98), p. 573, Snowbird, Utah, USA, March-April 1998.

[5] A. H. Murad and T. E. Fuja, “Joint source-channel decodingof variable-length encoded sources,” in Proceedings of the IEEEInformation Theory Workshop (ITW ’98), pp. 94–95, Killarney,Ireland, June 1998.

[6] Q. Chen and K. P. Subbalakshmi, “An integrated joint source-channel decoder for MPEG-4 coded video,” in Proceedings ofthe 58th IEEE Vehicular Technology Conference (VTC ’03), vol.1, pp. 347–351, Orlando, Fla, USA, October 2003.

[7] R. Bauer and J. Hagenauer, “On variable length codes foriterative source/channel decoding,” in Proceedings of the DataCompression Conference (DCC ’01), pp. 273–282, Snowbird,Utah, USA, March 2001.

[8] A. Hedayat and A. Nosratinia, “Performance analysis anddesign criteria for finite-alphabet source-channel codes,” IEEETransactions on Communications, vol. 52, no. 11, pp. 1872–1879, 2004.

[9] S. X. Ng, J. Y. Chung, and L. Hanzo, “Turbo-detected unequalprotection MPEG-4 wireless video telephony using multi-level coding, trellis coded modulation and space-time trelliscoding,” IEE Proceedings: Communications, vol. 152, no. 6, pp.1116–1124, 2005.

[10] S. X. Ng, J. Wang, L.-L. Yang, and L. Hanzo, “Variable lengthspace time coded modulation,” in Proceedings of the 62nd IEEEVehicular Technology Conference (VTC ’05), pp. 1049–1053,Dallas, Tex, USA, September 2005.

[11] S. X. Ng, J. Wang, M. Tao, L.-L. Yang, and L. Hanzo,“Iteratively decoded variable length space-time coded mod-ulation: code construction and convergence analysis,” IEEETransactions on Wireless Communications, vol. 6, no. 5, pp.1953–1962, 2007.

[12] S. X. Ng, F. Guo, and L. Hanzo, “Iterative detection of diagonalblock space time trellis codes, TCM and reversible variablelength codes for transmission over Rayleigh fading channels,”in Proceedings of the 60th IEEE Vehicular Technology Conference(VTC ’04), vol. 2, pp. 1348–1352, Los Angeles, Calif, USA,September 2004.

[13] V. B. Balakirsky, “Joint source-channel coding with variablelength codes,” in Proceedings of the IEEE International Sympo-sium on Information Theory (ISIT ’97), p. 419, Ulm, Germany,June-July 1997.

[14] S. Benedetto and G. Montorsi, “Unveiling turbo codes:some results on parallel concatenated coding schemes,” IEEETransactions on Information Theory, vol. 42, no. 2, pp. 409–428, 1996.

[15] D. Divsalar and F. Pollara, “Serial and hybrid concatenatedcodes with applications,” in Proceedings of the InternationalSymposium on Turbo Codes, pp. 80–87, Brest, France, Septem-ber 1997.

[16] D. Tujkovic, “Recursive space-time trellis codes for turbocoded modulation,” in Proceedings of the IEEE Global Commu-nication Conference (GLOBECOM ’00), vol. 2, pp. 1010–1015,San Francisco, Calif, USA, November-December 2000.

[17] L. R. Bahl, J. Cocke, F. Jelinek, and J. Raviv, “Optimal decodingof linear codes for minimizing symbol error rate,” IEEETransactions on Information Theory, vol. 20, no. 2, pp. 284–287, 1974.

[18] T. Okuda, E. Tanaka, and T. Kasai, “A method for thecorrection of garbled words based on the Levenshtein metric,”IEEE Transactions on Computers, vol. 25, no. 2, pp. 172–178,1976.


Research ArticleAverage Throughput with Linear Network Coding overFinite Fields: The Combination Network Case

Ali Al-Bashabsheh and Abbas Yongacoglu

School of Information Technology and Engineering, University of Ottawa, Ottawa, Canada K1N 6N5

Correspondence should be addressed to Ali Al-Bashabsheh, [email protected]

Received 4 November 2007; Revised 17 March 2008; Accepted 27 March 2008

Recommended by Andrej Stefanov

We characterize the average linear network coding throughput, Tavgc , for the combination network with min-cut 2 over an arbitrary

finite field. We also provide a network code, completely specified by the field size, achieving Tavgc for the combination network.

Copyright © 2008 A. Al-Bashabsheh and A. Yongacoglu. This is an open access article distributed under the Creative CommonsAttribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work isproperly cited.

1. INTRODUCTION

For a set of sinks in a directed multicast network, it wasshown in [1] that if the network can achieve a certainthroughput to each sink individually, then it can achievethe same throughput to all sinks simultaneously by allowingcoding at intermediate nodes. Such argument is possiblesince information is an abstract entity rather than a physicalone. Thus, in addition to repetition and forwarding, nodescan manipulate the symbols available from their in-edgesand apply functions of such symbols to their out-edges. Thecollection of edge functions can be referred to as the networkcode. The work in [1] shows that a natural bound on theachievable coding rate is the least of the min-cuts betweenthe source and each of the sinks. We refer to a code achievingthe min-cut rate as a solution.

It is known that every multicast network has a linearsolution over a sufficiently large finite field [2]. Sanders et al.[3] showed for linear network coding that an alphabet of sizeO(|T |) is sufficient to achieve the min-cut rate, where |T |is the number of sinks in the network. Rasala-Lehman andLehman [4] indicated that such bound is tight by advisinga solvable multicast network requiring an alphabet of sizeΩ(√|T |) to achieve the min-cut rate. This shows that some

multicast networks might require alphabets of huge sizesto achieve their min-cut throughputs. Coding over largealphabet sizes is not always desirable since it introducessome complexity and latency concerns. Hence, this motivatesworking below min-cut rates (i.e., relaxing the constraint of

operating at network capacity) but with significantly smalleralphabet sizes (if possible) [5].

Chekuri et al. [6] introduced the measure averagerouting throughput by relaxing the constraint that all sinksmust receive the same rate. By decoupling the problem ofmaximizing the average rate from the problem of balancingrates toward sinks, they showed that average routing ratescan significantly exceed the maximum achievable commonrouting rates in the network. They also argued that themajority rather than the minority of multicast applicationsexperience different rates at the receivers. In [7], the conceptaverage linear network coding throughput was introducedunder the constraint where source alphabet and linearnetwork coding are restricted to the binary field. In this work,we extend average coding throughput measure to includelinear coding over arbitrary finite fields. Such extension isan important step toward practical network coding. To seethis, we first remark that in [7] the motivation to restrictthe alphabet to the binary field was to present a simplecoding scheme where nodes are not required to performoperations over a field larger than F2. Such cut on processingcomplexity came with the price of reducing the total networkthroughput. In practice, although nodes might not possesthe capability to perform operations over the necessary fieldto achieve the min-cut throughput, they might still havea computation capability beyond the binary field. In suchsituations, it is more reasonable to design codes compatiblewith nodes computation capability and thus get closer to themin-cut throughput.


In the literature, two different variations of the problemof nonuniform coding throughputs at the terminals wereconsidered. The general connection model [5, 8] considersthe problem where each sink specifies its set of demandedmessages. On the other hand, the nonuniform demandproblem refers to the problem where each sink specifies onlythe size of its demand (i.e., the number of messages) and suchdemanded size might vary from sink to another [9]. In thiswork, a sink does not specify neither the identities nor thenumber of demanded messages. The objective is to maximizethe average throughput achievable with linear networkcoding under the additional constraint where messages andnetwork coding are restricted to the finite field Fq (whereFq might not be sufficiently large to achieve the min-cutthroughput).

2. DEFINITIONS AND PROBLEM FORMULATION

In general, assume an h unit rate information source consistsof h unit rate messages x1, x2, . . . , xh, where messages aresymbols from a finite field Fq. Also assume symbols carriedby edges belong to the same field Fq. For the comparisonbetween average throughputs and common throughputs tobe fair and meaningful, it is important that the number ofmessages h at the source does not exceed the min-cut. In themore general case where the min-cuts from the source to thesinks are not equal, h can be set such that it does not exceedthe smallest of such min-cuts.

A directed network N on V nodes and E links can bemodeled as a directed graph G(V ,E). In multicast networks,a node s ∈ V broadcasts a set of messages x1, . . . , xh toa set of sinks T = {t1, . . . , tn} ⊆ V \ s where h is thesmallest min-cut between s and t, for all t ∈ T . For any edgee ∈ E, we denote by δin(e) the set of edges entering the nodefrom which e departs. At some parts of this work, we find itmore convenient to deal with the valuation (defined below)induced by the network code rather than the network codeitself.

Definition 1. Given a network code, C, defined by a collec-tion of functions fe, for all e ∈ E such that

fe : F|δin(e)|q −→ Fq. (1)

The code valuation induced by C is a collection of functions:

f ′e : Fhq −→ Fq, (2)

where f ′e is the value of fe as a function of x1, . . . , xh.For a multicast network N with a set of messages at

the source whose size does not exceed the smallest of themin-cuts between the source and each sink. Linear networkcoding over sufficiently large field allows every sink t ∈T to recover the entire set of messages. In this work, wesomehow reverse the story, that is, we restrict the field sizeand allow sinks to recover subsets of the set of messages.Since sinks do not experience the same throughput any more,average throughput per sink becomes a natural measure toevaluate the performance of a given network code. Hence,the objective is to decide on a linear network code over the

specified alphabet which maximizes the average throughput-or equivalently the sum of throughputs experienced by allsinks. More formally, we define the maximum average linearcoding throughput over Fq as

Tavgc = 1

|T |maxQ∈Q

[∑

t∈T

Ttc(Q)

]

, (3)

where the maximization is over Q, the set of all possiblelinear coding schemes over Fq, and Tt

c(Q) is the throughputat sink t under linear coding scheme Q ∈ Q. In contrast,maximum average routing throughput was defined in [6]and is repeated here for convenience:

Tavgi = 1

|T |maxP∈P

[∑

t∈T

Tti (P)

]

, (4)

where in this formulation, the maximization is over allpossible integer routing schemes, P , and Tt

i (P) is the integerrouting throughput at sink t under routing scheme P ∈ P .

In what follows, we restrict our attention to the family ofcombination networks with N intermediate nodes and min-cut 2. Such networks are sufficient to develop the ideas weneed to present in this work.

3. COMPLEXITY VERSUS ALPHABET SIZE

Consider a multicast network that requires a field Fq of sizeq to achieve the min-cut throughput. Thus, all operations tocompute edge functions must be done over the field Fq. Inother words, each node in the network must have a memoryof Θ(log2(q)) bits to store and manipulate the receivedsymbols. In practice, each edge can deliver a fixed numberof bits per unit time. Hence, the assumption that edges candeliver one symbol from Fq per network use implies a latencyof Θ(log2(q)).

4. AVERAGE THROUGHPUT OF COMBINATIONNETWORK WITH MIN-CUT 2

Consider a combination network N with min-cut 2 andmessages x1, x2 ∈ Fq at the source. A combination networkwith min-cut 2 consists of three layers of nodes: the sources, a set of N intermediate nodes, and a set of ( N2 ) sinks. Thesource has an out-edge to each intermediate node. Finally,each distinct pair of intermediate nodes is connected to aunique sink via a pair of edges directed into the sink. In thissection, we derive an expression for the maximum averagethroughput of N over an arbitrary finite field Fq. It is knownthat the network, N , is solvable when N ≤ q + 1, where qis the field size. Hence, average throughput is equivalent tothe min-cut throughput in this case. On the other hand, forq = 2, the problem was solved in [7]. Therefore, in most ofthe mathematical treatment which follows, we assume 2 <q < N − 1. In spite of this assumption, whenever applicable,we use the previously obtained results for the binary field andthe fact that N is solvable for q ≥ N−1 to present the resultsfor any q ≥ 2.

A. Al-Bashabsheh and A. Yongacoglu 3

Definition 2. Let F be a collection of functions such that

f : Fq × Fq −→ Fq, (5)

for each f ∈ F . Then, an average throughput, Tavgc , is said to

be achievable over F if there exists a network code valuationC = { fe(x1, x2) : fe(x1, x2) ∈ F for all e ∈ E} achievingT

avgc .

Let G = {αx1 + βx2 : α,β ∈ Fq} be the set of all linearfunctions (combinations) of x1 and x2 over Fq. Also let F ={x2} ∪ {x1 + βx2 : β ∈ Fq}.

Lemma 1. Let f (x1, x2) = αx1 + βx2, f ′(x1, x2) = α′x1 + β′x2

be two functions in G with α,β,α′,β′ ∈ Fq \ {0} then x1 isrecoverable from f and f ′ if and only if x2 is recoverable fromf and f ′ (i.e., either both messages are recoverable or non ofthem).

Proof. See the appendix.

Corollary 1. Let f (x1, x2) = x1+βx2, f ′(x1, x2) = x1+β′x2 betwo functions in F with β,β′ ∈ Fq \ {0} then x1 is recoverablefrom f and f ′ if and only if x2 is recoverable from f and f ′.

Proof. The corollary follows from Lemma 1 and the fact thatF ⊂ G.

Lemma 2. An average throughput, Tavgc , is achievable over F

if and only if it is achievable over G.


Remarks

(i) Lemma 2 suggests that it is sufficient to consider the setof functions F and there is no gain in considering G.

(ii) With a slight modifications in the proofs, it is easy toshow that Lemmas 1 and 2 are still valid even if G wasthe set of all affine functions of x1 and x2 over Fq.

From the definition of F , we see that |F | = q + 1, andwith the aid of Corollary 1, it is straight forward to show thatany sink receiving two distinct functions from F will be ableto recover both messages. Thus, for N ≤ q + 1 the networkis solvable [10], and the average throughput is equal to themin-cut throughput. For N > q + 1, a simple pigeon holeargument shows that some source edges will be carrying thesame combination of messages. Thus, a receiver with two in-edges carrying the same function of messages will be able torecover one or non of the messages.

The next proposition determines how functionsf (x1, x2) ∈ F must be distributed among source out-edgesin order to maximize average throughput (i.e., minimizeloss in throughput). Let mi be the number of source edgescarrying fi(x1, x2) = x1 + βix2, for all 1 ≤ i ≤ q − 1, βi ∈Fq \ {0}, βi /=βj , for all i /= j. With such assignment of

functions to source out edges, we still have N − ∑q−1i=1 mi

unused source out-edges and two functions in F namely,f (x1, x2) = x1 and f (x1, x2) = x2. Let m0 and m′

0 be thenumber of source edges carrying x1 and x2, respectively.

Since there is no preference in recovering one messageover the other (both messages are equally importantto each destination), a maximum average throughput

achieving assignment must haveN−∑q−1i=1 mi equally divided

between m0 and m′0, that is, m0 = (N − ∑q−1

i=1 mi)/2� and

m′0 = (N − ∑q−1

i=1 mi + 1)/2�. Now, if a sink has both itsin-edges carrying x1 then it can not recover x2, and thusthere are (m0

2 ) sinks which cannot recover x2. Similarly, thereare (m

′0

2 ) destinations which cannot recover x1. Finally, adestination receiving fi(x1, x2) = x1 + βix2, βi /= 0 on bothof its in-edges will not recover any of the messages, and sothere is a loss of 2(mi

2 ). Hence, the total loss in throughput isgiven by

L(m1, . . . ,mk

)

=

⎛

⎜⎝

⌊N−∑k

i=1mi

2

⌋

2

⎞

⎟⎠+

⎛

⎜⎝

⌊N−∑k

i=1mi + 12

⌋

2

⎞

⎟⎠ + 2

k∑

i=1

⎛

⎝mi

2

⎞

⎠ ,

(6)

where k = q − 1 and the average throughput, as function ofm1, . . . ,mk, is given by

Tavgc(m1, . . . ,mk

) = 1( N2 )

[

2

(N2

)

− L(m1, . . . ,mk)]

. (7)

Before we present the proposition, we need the followinglemma.

Lemma 3. Let

A =

⎛

⎜⎜⎜⎜⎜⎜⎜⎝

a 1 · · · 1

1 a · · · 1

... · · · . . . 1

1 · · · 1 a

⎞

⎟⎟⎟⎟⎟⎟⎟⎠

(8)

be a k × k matrix with a ∈ R, a /= 1 or −(k − 1), then

A−1 = 1(a− 1)(a + k − 1)

×

⎛

⎜⎜⎜⎜⎜⎜⎜⎝

a + (k − 2) −1 · · · −1

−1 a + (k − 2) · · · −1

... · · · . . ....

−1 · · · −1 a + (k − 2)

⎞

⎟⎟⎟⎟⎟⎟⎟⎠

.

(9)


Proposition 1. Average linear network coding throughput ofnetwork N is maximized whenm1 = m2 = · · · = mk := m∗

int,where (N + 1)/(k + 4)� ≤ m∗

int ≤ (N + 1)/(k + 4)� + 1 .


Proof. From (6), we can write size

L(m1, . . . ,mk

)= 12

[(N−∑k

i=1mi−δ2

)(N−∑k

i=1mi − δ2

−1)

+(N−∑k

i=1mi +δ2

)(N−∑k

i=1mi +δ2

−1)

+ 4k∑

i=1

mi(mi − 1

)

2

]

,

(10)

where δ = 0 and 1 for N − ∑ki=1mi is even and odd,

respectively. This reduces to

L(m1, . . . ,mk

) = 14

[(

N −k∑

i=1

mi

)2

− 2

(

N −k∑

i=1

mi

)

+ 4k∑

i=1

mi(mi − 1

)+ δ

]

.

(11)

For the moment, we relax the constraintthat m1, . . . ,mk

must be integer valued. We also relax the constraint that(N − ∑k

i=1mi)/2 and (N − ∑ki=1mi + 1)/2 must be integers

(note that with such relaxation δ disappears from (11)). Now,we compute the partial derivative of (11) with respect tomj , for all 1 ≤ j ≤ k and equate to zero. Thus, we obtain

5mj +∑

i /= jmi = N + 1. (12)

This can equivalently be written as(m1 m2 · · · mk

)A = (N + 1)(1 1 · · · 1), (13)

where A = 4I + 1, I is the k × k identity matrix, and 1 is theall ones k × k matrix. Thus,

(m1 m2 · · · mk

) = (N + 1)(1 1 · · · 1)A−1 (14)

and from Lemma 3 (using a = 5), we obtain

(m1 m2 · · · mk

) = N + 1k + 4

(1 1 · · · 1), (15)

that is, m1 = m2 = · · · = mk = (N + 1)/(k + 4). Notingthat L(m1, . . . ,mk) is a convex function of m1, . . . ,mk thenwe know that the integer value m∗

int of m1, . . . ,mk whichminimizes L(m1, . . . ,mk) is bounded as

⌊N + 1k + 4

⌋≤ m∗

int ≤⌊N + 1k + 4

⌋+ 1 (16)

as required.

Since Tavgc =maxm1,...,mkT

avgc (m1,. . .,mk) andT

avgc (m1,. . .,

mk) is maximized by the choice of m1, . . . ,mk as inProposition 1, then T

avgc is totally specified by m∗

int. Thefollowing proposition establishes a simple relation betweenm∗

int and the field size q.

Proposition 2. In a combination network with N interme-diate nodes, m∗

int that maximizes the average linear networkcoding throughput over Fq is given by

m∗int =

⌊N + 2 + q/2�

q + 3

⌋, (17)

where 2 < q < N − 1.

Proof. From (7) and (11), and by substituting m1 = m2 =· · · = mk := m, we obtain

Tavgc (m)= 1

2N(N − 1)

[4N(N − 1)− (N − km)2

+ 2(N − km)− 4km(m− 1)− δ],(18)

where δ = 0 for N − km is even, and δ = 1 for N − km isodd. Since δ plays a roll in the next derivations, we will writeδ(N ,m) to emphasize its dependence on N and m. From(18), we can write

Tavgc (m) = 1

2N(N − 1)h(m), (19)

where

h(m) = − (k2 + 4k)m2 − 2(N + 1)km

+N(3N − 2)− δ(N ,m).(20)

Let ma = (N + 1)/(k+ 4)� and mb = ma + 1 = (N + 1)/(k+4)� + 1, then from Proposition 1 we know that m∗

int is eitherma or mb. Thus, we can write

m∗int = arg max

m∈{ma ,mb}T

avgc (m) = arg max

m∈{ma ,mb}h(m). (21)

In what follows, we assume k > 1 (the case when k = 1 wassolved in [7])

Now consider the following two possibilities.

Possibility A (k is odd)

Note in this case the field size, q, is even since k = q−1. Thus,q= 2n for some integer n > 1, that is, Fq is an extension fieldwith characteristic 2. Depending on the parities of N and ma

(or equivalently mb), the following four cases arise.

Case 1. Both N and ma are even. Noting that for this caseδ(N ,ma) = 0 and δ(N ,mb) = 1, then from (20) and the factthat mb = ma + 1 we obtain

h(mb) = h

(ma)− k(k + 4)

(2ma + 1

)+ 2(N + 1)k − 1.

(22)

Since ma = (N + 1)/(k + 4)� = (N + 1− ε)/(k + 4) for someε ∈ {0, 1, . . . , k + 3}, then we obtain

h(mb) = h

(ma)

+ k(2ε − (k + 4)

)− 1. (23)

From this and (21), we see that m∗int = ma if k(2ε − (k +

4)) − 1 < 0, and m∗int = mb if k(2ε − (k + 4)) − 1 > 0. But


k(2ε − (k + 4)) − 1 > 0 if and only if ε > (k + 4)/2 + 1/2k.Thus,

m∗int =

⎧⎪⎪⎪⎪⎨

⎪⎪⎪⎪⎩

ma, ε <k + 4

2+

12k

,

mb, ε >k + 4

2+

12k.

(24)

Now, we impose more structure on ε. Since ma = (N + 1 −ε)/(k + 4) and noting that ma is even while N + 1 and k + 4are odd, then ε must be odd, that is, ε ∈ {1, 3, . . . , k, k + 2}.From this and (24), we get

m∗int =

⎧⎪⎪⎪⎪⎨

⎪⎪⎪⎪⎩

ma, ε ∈{

1, 3, 5, . . . ,k − 1

2,k + 3

2

},

mb, ε ∈{k + 7

2,k + 11

2, . . . , k, k + 2

}.

(25)

Case 2. Both N and ma are odd. An identical result to (25)can be obtained.

Case 3. N is even, and ma is odd. For this case, we haveδ(N ,ma) = 1 and δ(N ,mb) = 0 and from (20) we can write

h(mb) = h

(ma)

+ k(2ε − (k + 4)

)+ 1, (26)

sincema = (N +1−ε)/(k+4) and from the fact that ma, N +1, and k + 4 are all odd, then ε must be even, that is, ε ∈{0, 2, . . . , k + 1, k + 3}.

Now, k(2ε− (k+ 4)) + 1 > 0 if and only if ε > (k+ 4)/2−1/2k. Thus,

m∗int =

⎧⎪⎪⎪⎪⎨

⎪⎪⎪⎪⎩

ma, ε ∈{

0, 2, 4, . . . ,k − 3

2,k + 1

2

},

ma, ε ∈{k + 5

2,k + 9

2, . . . , k + 1, k + 3

}.

(27)

Case 4. N is odd, and ma is even. An identical result to (27)can be obtained.

Combining the previous four cases, we can write for anyodd k > 1:

m∗int=

⎧⎪⎪⎪⎪⎨

⎪⎪⎪⎪⎩

ma=⌊N + 1k + 4

⌋, ε∈

{0,1, 2, 3, . . . ,

k + 12

,k + 3

2

},

mb=⌊N + 1k + 4

⌋+1, ε∈

{k + 5

2,k + 7

2, . . . ,k + 2,k + 3

}.

(28)

Or more compactly

m∗int =

⌊N + 2 + (k + 1)/2

k + 4

⌋. (29)

Possibility B (k is even)

Note that in this case the field size, q, is odd. Thus, q = pn

for some prime p /= 2 and integer n ≥ 1. As in possibility A.the following four cases arise.

Case 1. Both N and ma are even. In this case, we haveδ(N ,ma) = δ(N ,mb) = 0 and from (20) we can write

h(mb) = h

(ma)

+ k(2ε − (k + 4)

), (30)

since ma is even, N + 1 is odd, and k + 4 is even, we inducethat ε must be odd. Combining this with (30), (21) and thefact that k(2ε − (k + 4)) < 0 if and only if ε < (k + 4)/2 weobtain, for k/2 is even (i.e., k is divisible by 4),

m∗int =

⎧⎪⎪⎪⎪⎨

⎪⎪⎪⎪⎩

ma, ε ∈{

1, 3, 5, . . . ,k − 2

2,k + 2

2

},

mb, ε ∈{k + 6

2,k + 10

2, . . . , k + 3

},

(31)

and for k/2 is odd:

m∗int =

⎧⎪⎪⎪⎪⎨

⎪⎪⎪⎪⎩

ma, ε ∈{

1, 3, 5, . . . ,k

2,k + 4

2

},

mb, ε ∈{k + 8

2,k + 12

2, . . . , k + 3

}.

(32)

Case 2. Both N and ma are odd. Following the same steps asbefore and noting that ε is even for this case, we obtain (fork/2 is even)

m∗int =

⎧⎪⎪⎪⎪⎨

⎪⎪⎪⎪⎩

ma, ε ∈{

0, 2, 4, . . . ,k

2,k + 4

2

},

mb, ε ∈{k + 8

2,k + 12

2, . . . , k, k + 2

},

(33)

and for k/2 is odd:

m∗int =

⎧⎪⎪⎪⎪⎨

⎪⎪⎪⎪⎩

ma, ε ∈{

0, 2, 4, . . . ,k − 2

2,k + 2

2

},

mb, ε ∈{k + 6

2,k + 10

2, . . . , k, k + 2

}.

(34)

Case 3. N is even, and ma is odd. It can be shown that m∗int

in this case is given by identical relations to (31) and (32).

Case 4. N is odd, and ma is even. This case can be shown tobe similar to Case 2, that is, m∗

int is given by (33) and (34).

Combining the previous four cases, we can write for anyeven k > 1:

m∗int=

⎧⎪⎪⎪⎪⎨

⎪⎪⎪⎪⎩

ma=⌊N + 1k + 4

⌋, ε∈

{0,1, 2, 3, . . . ,

k

2,k + 2

2,k + 4

2

},

mb=⌊N + 1k + 4

⌋+1, ε∈

{k + 6

2,k + 8

2, . . . ,k + 2,k + 3

}.

(35)

This can also be written as

m∗int =

⌊N + 2 + k/2

k + 4

⌋. (36)

From (29) and (36), we obtain for any 1 < k < N − 2

m∗int =

⌊N + 2 + (k + 1)/2�

k + 4

⌋, (37)

substituting k = q − 1 and the proposition follows.


From the work in [7] for q = 2 and the results presentedin this work, we obtain the following corollary which char-acterizes T

avgc over any finite field.

Corollary 2. For a combination network with N intermediatenodes and min-cut 2, the maximum achievable average linearnetwork coding throughput over Fq is given as

Tavgc =

⎧⎪⎪⎪⎪⎨

⎪⎪⎪⎪⎩

2, q ≥ N − 1,

1( N2 )

[

2

⎛

⎝N

2

⎞

⎠− L(m∗q

)]

, 2 ≤ q < N − 1,(38)

where

L(m∗q

)=

⎛

⎜⎝

⌊N − km∗q

2

⌋

2

⎞

⎟⎠+

⎛

⎜⎝

⌊N − km∗q + 1

2

⌋

2

⎞

⎟⎠+2k

⎛

⎝m∗q

2

⎞

⎠ ,

m∗q =

⎧⎪⎪⎪⎪⎨

⎪⎪⎪⎪⎩

⌊N + 2 + q/2�

q + 3

⌋, 2 < q < N − 1,

⌊N + 2

5

⌋, q = 2.

(39)

Proof. For q ≥ N − 1, the result is immediate since thenetwork is solvable. For 2 < q < N − 1, the claim followsfrom (6) and (7), where from Proposition 1 we know thataverage throughput is maximized whenm1, . . . ,mk are equal.Hence, we substitute m∗

q for m1, . . . ,mk in (6), where m∗q is

the integer value which maximizes the average throughputand was obtained in Proposition 2 (it was denoted m∗

int).Finally, for q = 2 the result was proven in [7].

Figure 1 shows Tavgc as a function of N and the number

of intermediate nodes, for different field sizes. It also showsthe average integer routing throughput T

avgi .

APPENDIX

PROOFS

Proof of Lemma 1. Assume x1 is recoverable from f and f ′,then x2 = β−1( f −αx1) where β−1 exists since β /= 0, and Fq \{0} is a group under multiplication. Thus, x2 is recoverable.The proof of the other direction is similar.

Proof of Lemma 2. The forward implication is obvious sinceF ⊂ G. To prove the reverse implication, assume thatthere exists a code valuation C = {ge(x1, x2) : ge(x1, x2) ∈G, for all e ∈ E} achieving average throughput T

avgc .

Consider an indexing set I = {1, 2, . . . ,N} on the source out-edges (and equivalently on intermediate nodes). Since eachintermediate node in N has only one in-edge then interme-diate nodes merely forward what they receive on their in-edges to their out-edges. Thus, C is uniquely specified bythe functions carried by the source out-edges. Hence, we mayconsider C = {gi(x1, x2) : gi(x1, x2) ∈ G, for all i ∈ I}. Forany i ∈ I, if gi(x1, x2) = αix1 + βix2 ∈ C with αi = βi = 0,

50454035302520151050

N

1.5

1.55

1.6

1.65

1.7

1.75

1.8

1.85

1.9

1.95

2

Ave

rage

thro

ugh

put

q = 32q = 16q = 13q = 11q = 9q = 8

q = 7q = 5q = 4q = 3q = 2

Tavgi

Figure 1: Average linear network coding throughput.

then gi can be replaced with any function without reducingthe average throughput. Thus, we can assume that αi andβi are not both zero. Now, let A be the subset of I whoseelements are the indices of source edges carrying functionsgi(x1, x2) = αix1, that is, βi = 0, for all i ∈ A. Similarly,let B be the subset of I such that gi = βix2, for all i ∈ B.Obviously, A and B are disjoint since αi and βi are not bothzero for any i ∈ I. Finally, let C be the subset of I whoseelements are the indices of all source edges carrying functionsof the form gi = αix1 + βix2 with αi /= 0 and βi /= 0. Clearly,A,B, and C partition I.

Now, design a code over F such that fi(x1, x2) = x1,for all i ∈ A, fi(x1, x2) = x2, for all i ∈ B, and fi(x1,x2) = x1 + α−1

i βix2, for all i ∈ C. The existence of fi in thegiven form for i ∈ C is guaranteed since αi /= 0, βi /= 0, andFq \ {0} is a group under multiplication.

Now for all i, j ∈ I, consider sink ti j ∈ T with incomingedges originating from intermediate nodes i, j ∈ I.

(i) If i, j ∈ A, then ti j will be able to recover only x1 fromg1 and g2 which is still the case if gi and gj are replaced by fiand f j . The same argument holds if i, j ∈ B with x2 replacingx1.

(ii) If i ∈ A and j ∈ B, then fi and f j will make x1 and x2

available to ti j so both messages are recoverable and there isno loss in throughput due to replacing gi and gj with fi andf j .

(iii) If i ∈ A and j ∈ C, then fi makes x1 available to ti jwhich can be used with f j to recover x2. Thus, both messagesare recoverable, and there is no loss in considering F insteadof G. A similar argument holds for i ∈ B and j ∈ C.

(iv) If i, j ∈ C, then from gi and gj we can writeγx2 = αigj − αjgi where γ = αiβj − αjβi. Hence, x2 is notrecoverable if and only if γ = 0. From this and Lemma 1(since αi,βi,αj ,βj ∈ Fq \ {0}), we know that both x1 and


x2 are not recoverable if and only if γ = 0. Also from fi andf j, we can write γ′x2 = f j − fi where γ′ = α−1

j β j − α−1i βi, and

the same argument for γ holds for γ′. Thus, we need to showγ = 0 if and only if γ′ = 0. To this end, note that

γ = 0 ⇐⇒ αiβj = αjβi ⇐⇒ βj = α−1i α jβi

⇐⇒ βj = αjα−1i βi ⇐⇒ α−1

j β j = α−1i βi ⇐⇒ γ′ = 0.

(A.1)

Hence, there is no loss in considering F instead of G in anyof the previous cases. The lemma follows by noting that theprevious cases represent all possibilities of receiving a pair offunctions by any sink.

Remarks

(i) With a slight modification in the last step of the proof,the lemma can be shown to be still true even if thealphabet was a finite division ring (a skew field) insteadof a field.

(ii) It is possible to show that the lemma is true for anymulticast network N with min-cut 2. The proof in thiscase can be of the same nature as the proof presented in[11] for the sufficiency of homogeneous functions.

Proof of Lemma 3. Note that A can be written as A = (a −1)I + 1 where I is the k × k identity matrix, and 1 is the allones k × k matrix (1i j = 1, for all 1 ≤ i, j ≤ k). Let B beanother k×k matrix such that AB = I. Assume that B can bewritten as B = (bI− 1)c, where b and c are scalars. Thus,

AB = ((a− 1)I + 1)(bI− 1)c

= ((a− 1)bI + b1− (a− 1)1− k1)c.

(A.2)

For the multiplication AB to equal I, we need b1− (a−1)1−k1 = 0, where 0 is the all zero matrix. This can be satisfied bychoosing

b = a + k − 1. (A.3)

We also need

c(a− 1)b = c(a− 1)(a + k − 1) = 1. (A.4)

Since a /= 1, − (k − 1), we obtain c = 1/(a − 1)(a + k − 1).Thus,

B = 1(a− 1)(a + k − 1)

((a + k − 1)I− 1

). (A.5)

It is easy to check that BA = I, thus A−1 = B.

REFERENCES

[1] R. Ahlswede, N. Cai, S.-Y. R. Li, and R. W. Yeung, “Networkinformation flow,” IEEE Transactions on Information Theory,vol. 46, no. 4, pp. 1204–1216, 2000.

[2] S.-Y. R. Li, R. W. Yeung, and N. Cai, “Linear network coding,”IEEE Transactions on Information Theory, vol. 49, no. 2, pp.371–381, 2003.

[3] P. Sanders, S. Egner, and L. Tolhuizen, “Polynomial timealgorithms for network information flow,” in Proceedings of the15th Annual ACM Symposium on Parallelism in Algorithms andArchitectures (SPAA ’03), pp. 286–294, San Diego, Calif, USA,June 2003.

[4] A. Rasala-Lehman and E. Lehman, “Complexity classificationof network information flow problems,” in Proceedings of the15th Annual ACM-SIAM Symposium on Discrete Algorithms(SODA ’04), pp. 142–150, New Orleans, La, USA, January2004.

[5] A. Rasala-Lehman, “Network coding,” Ph.D. dissertation,Department of Electrical Engineering and Computer Science,MIT, Cambridge, Mass, USA, 2005.

[6] C. Chekuri, C. Fragouli, and E. Soljanin, “On average through-put and alphabet size in network coding,” IEEE Transactions onInformation Theory, vol. 52, no. 6, pp. 2410–2424, 2006.

[7] A. Al-Bashabsheh and A. Yongacoglu, “Average throughputwith linear network coding over the binary field,” in Proceed-ings of the IEEE Information Theory Workshop (ITW ’07), pp.90–95, Tahoe City, Calif, USA, September 2007.

[8] R. Dougherty, C. Freiling, and K. Zeger, “Insufficiency oflinear coding in network information flow,” IEEE Transactionson Information Theory, vol. 51, no. 8, pp. 2745–2759, 2005.

[9] Y. Cassuto and J. Bruck, “Network coding for non-uniformdemands,” in Proceedings of the International Symposium onInformation Theory (ISIT ’05), pp. 1720–1724, Adelaide, SA,Australia, September 2005.

[10] C. Fragouli and E. Soljanin, “Information flow decompositionfor network coding,” IEEE Transactions on Information Theory,vol. 52, no. 3, pp. 829–848, 2006.

[11] R. Dougherthy, C. Freiling, and K. Zeger, “Linearity andsolvability in multicast networks,” IEEE Transactions on Infor-mation Theory, vol. 50, no. 10, pp. 2243–2256, 2004.


Research ArticleMacWilliams Identity for Codes with the Rank Metric

Maximilien Gadouleau and Zhiyuan Yan

Department of Electrical and Computer Engineering, Lehigh University, Bethlehem, PA 18015, USA

Correspondence should be addressed to Maximilien Gadouleau, [email protected]


Recommended by Andrej Stefanov

The MacWilliams identity, which relates the weight distribution of a code to the weight distribution of its dual code, is usefulin determining the weight distribution of codes. In this paper, we derive the MacWilliams identity for linear codes with the rankmetric, and our identity has a different form than that by Delsarte. Using our MacWilliams identity, we also derive related identitiesfor rank metric codes. These identities parallel the binomial and power moment identities derived for codes with the Hammingmetric.

Copyright © 2008 M. Gadouleau and Z. Yan. This is an open access article distributed under the Creative Commons AttributionLicense, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properlycited.

1. INTRODUCTION

The MacWilliams identity for codes with the Hammingmetric [1], which relates the Hamming weight distribution ofa code to the weight distribution of its dual code, is useful indetermining the Hamming weight distribution of codes. Thisis because if the dual code has a small number of codewordsor equivalence classes of codewords under some knownpermutation group, its weight distribution can be obtainedby exhaustive examination. It also leads to other identitiesfor the weight distribution such as the Pless identities[1, 2].

Although the rank has long been known to be a metricimplicitly and explicitly (e.g., see [3]), the rank metric wasfirst considered for error-control codes (ECCs) by Delsarte[4]. The potential applications of rank metric codes towireless communications [5, 6], public-key cryptosystems[7], and storage equipments [8, 9] have motivated a steadystream of works [8–20] that focus on their properties.The majority of previous works focus on rank distanceproperties, code construction, and efficient decoding of rankmetric codes, and the seminal works in [4, 9, 10] have madesignificant contribution to these topics. Independently in[4, 9, 10], a Singleton bound (up to some variations) on theminimum rank distance of codes was established, and a classof codes achieving the bound with equality was constructed.We refer to this class of codes as Gabidulin codes henceforth.In [4, 10], analytical expressions to compute the weightdistribution of linear codes achieving the Singleton bound

with equality were also derived. In [8], it was shown thatGabidulin codes are optimal for correcting crisscross errors(referred to as lattice-pattern errors in [8]). In [9], it wasshown that Gabidulin codes are also optimal in the sense ofa Singleton bound in crisscross weight, a metric consideredin [9, 12, 21] for crisscross errors. Decoding algorithms wereintroduced for Gabidulin codes in [9, 10, 22, 23].

In [4], the counterpart of the MacWilliams identity,which relates the rank distance enumerator of a code tothat of its dual code, was established using associationschemes. However, Delsarte’s work lacks an expression ofthe rank weight enumerator of the dual code as a functionaltransformation of the enumerator of the code. In [24,25], Grant and Varanasi defined a different rank weightenumerator and established a functional transformationbetween the rank weight enumerator of a code and that ofits dual code.

In this paper we show that, similar to the MacWilliamsidentity for the Hamming metric, the rank weight distri-bution of any linear code can be expressed as a functionaltransformation of that of its dual code. It is remarkable thatour MacWilliams identity for the rank metric has a similarform to that for the Hamming metric. Similarly, an interme-diate result of our proof is that the rank weight enumeratorof the dual of any vector depends on only the rank weightof the vector and is related to the rank weight enumeratorof a maximum rank distance (MRD) code. We also deriveadditional identities that relate moments of the rank weightdistribution of a linear code to those of its dual code.


Our work in this paper differs from those in [4, 24, 25] inseveral aspects.

(i) In this paper, we consider a rank weight enumeratordifferent from that in [24, 25], and solve the originalproblem of determining the functional transforma-tion of rank weight enumerators between dual codesas defined by Delsarte.

(ii) Our proof, based on character theory, does notrequire the use of association schemes as in [4] orcombinatorial arguments as in [24, 25].

(iii) In [4], the MacWilliams identity is given betweenthe rank distance enumerator sequences of twodual array codes using the generalized Krawtchoukpolynomials. Our identity is equivalent to that in [4]for linear rank metric codes, although our identityis expressed using different parameters which areshown to be the generalized Krawtchouk polynomialsas well. We also present this identity in the form of afunctional transformation (cf. Theorem 1). In such aform, the MacWilliams identities for both the rankand the Hamming metrics are similar to each other.

(iv) The functional transformation form allows us toderive further identities (cf. Section 4) between therank weight distribution of linear dual codes. Wewould like to stress that the identities between themoments of the rank distribution proved in thispaper are novel and were not considered in theaforementioned papers.

We remark that both the matrix form [4, 9] and thevector form [10] for rank metric codes have been consideredin the literature. Following [10], in this paper the vector formover GF(qm) is used for rank metric codes although theirrank weight is defined by their corresponding code matricesover GF(q) [10]. The vector form is chosen in this paper sinceour results and their derivations for rank metric codes canbe readily related to their counterparts for Hamming metriccodes.

The rest of the paper is organized as follows. Section 2reviews some necessary backgrounds. In Section 3, we estab-lish the MacWilliams identity for the rank metric. We finallystudy the moments of the rank distributions of linear codesin Section 4.

2. PRELIMINARIES

2.1. Rank metric, MRD codes, andrank weight enumerator

Consider an n-dimensional vector x = (x0, x1, . . . , xn−1) ∈GF(qm)n. The field GF(qm) may be viewed as an m-dimensional vector space over GF(q). The rank weight of x,denoted as rk(x), is defined to be the maximum number ofcoordinates in x that are linearly independent over GF(q)[10]. Note that all ranks are with respect to GF(q) unlessotherwise specified in this paper. The coordinates of x thusspan a linear subspace of GF(qm), denoted as S(x), with

dimension equal to rk(x). For all x, y ∈ GF(qm)n, it is easily

verified that dR(x, y)def= rk(x − y) is a metric over GF(qm)n

[10], referred to as the rank metric henceforth. The minimumrank distance of a code C, denoted as dR(C), is simply theminimum rank distance over all possible pairs of distinctcodewords. When there is no ambiguity about C, we denotethe minimum rank distance as dR.

Combining the bounds in [10, 26] and generalizingslightly to account for nonlinear codes, we can show that thecardinality K of a code C over GF(qm) with length n andminimum rank distance dR satisfies

K ≤ min{qm(n−dR+1), qn(m−dR+1)}. (1)

In this paper, we call the bound in (1) the Singleton boundfor codes with the rank metric, and refer to codes that attainthe Singleton bound as maximum rank distance (MRD)codes. We refer to MRD codes over GF(qm) with lengthn ≤ m and with length n > m as Class-I and Class-II MRDcodes, respectively. For any given parameter set n, m, anddR, explicit construction for linear or nonlinear MRD codesexists. For n ≤ m and dR ≤ n, generalized Gabidulin codes[16] constitute a subclass of linear Class-I MRD codes. Forn > m and dR ≤ m, a Class-II MRD code can be constructedby transposing a generalized Gabidulin code of length mand minimum rank distance dR over GF(qn), although thiscode is not necessarily linear over GF(qm). When n = lm(l ≥ 2), linear Class-II MRD codes of length n and minimum

distance dR can be constructed by a Cartesian product Gl def=G × · · · × G of an (m, k) linear Class-I MRD code G[26].

For all v ∈ GF(qm)n with rank weight r, the rank weightfunction of v is defined as fR(v) = yrxn−r . Let C be a codeof length n over GF(qm). Suppose there are Ai codewords inC with rank weight i (0 ≤ i ≤ n). Then the rank weightenumerator of C, denoted as WR

C(x, y), is defined to be

WRC(x, y)

def=∑

v∈C

fR(v) =n∑

i=0

Aiyixn−i. (2)

2.2. Hadamard transform

Definition 1 (see [1]). LetC be the field of complex numbers.Let a ∈ GF(qm) and let {1,α1, . . . ,αm−1} be a basis set ofGF(qm). We thus have a = a0 +a1α1 +· · ·+am−1αm−1, whereai ∈ GF(q) for 0 ≤ i ≤ m − 1. Finally, letting ζ ∈ C be a

primitive qth root of unity, χ(a)def= ζa0 maps GF(qm) to C.

Definition 2 (Hadamard transform [1]). For a mapping ffrom GF(qm)n to C, the Hadamard transform of f , denoted

as f , is defined to be

f (v)def=

∑

u∈GF(qm)nχ(u · v) f (u), (3)

where u · v denotes the inner product of u and v.

M. Gadouleau and Z. Yan 3

2.3. Notations

In order to simplify notations, we will occasionally denotethe vector space GF(qm)n as F. We denote the number ofvectors of rank u (0 ≤ u ≤ min{m,n}) in GF(qm)n asNu(qm,n). It can be shown thatNu(qm,n) = [ nu ]α(m,u) [10],

where α(m, 0)def= 1 and α(m,u)

def= ∏u−1i=0 (qm − qi) for u ≥ 1.

The [ nu ] term is often referred to as a Gaussian polynomial

[27], defined as [ nu ]def= α(n,u)/α(u,u). Note that [ nu ] is

the number of u-dimensional linear subspaces of GF(q)n.

We also define β(m, 0)def= 1 and β(m,u)

def= ∏u−1i=0 [m−i1 ]

for u ≥ 1. These terms are closely related to Gaussianpolynomials: β(m,u) = [mu ]β(u,u) and β(m + u,m + u) =[m+u

u ]β(m,m)β(u,u). Finally, σidef= i(i− 1)/2 for i ≥ 0.

3. MACWILLIAMS IDENTITY FOR THE RANK METRIC

3.1. q-product, q-transform, and q-derivative

In order to express the MacWilliams identity in polynomialform as well as to derive other identities, we introduce severaloperations on homogeneous polynomials.

Let a(x, y;m) = ∑ri=0ai(m)yixr−i and b(x, y;m) =∑s

j=0bj(m)y jxs− j be two homogeneous polynomials in x andy of degrees r and s, respectively, with coefficients ai(m) andbj(m), respectively. ai(m) and bj(m) for i, j ≥ 0 in turnare real functions of m, and are assumed to be zero unlessotherwise specified.

Definition 3 (q-product). The q-product of a(x, y;m) andb(x, y;m) is defined to be the homogeneous polynomial

of degree (r + s)c(x, y;m)def= a(x, y;m)∗b(x, y;m) =∑r+s

u=0cu(m)yuxr+s−u, with

cu(m) =u∑

i=0

qisai(m)bu−i(m− i). (4)

We will denote the q-product by ∗ henceforth. Forn ≥ 0, the nth q-power of a(x, y;m) is defined recursively:a(x, y;m)[0] = 1 and a(x, y;m)[n] = a(x, y;m)[n−1] ∗a(x, y;m) for n ≥ 1.

We provide some examples to illustrate the concept. It iseasy to verify that x∗y = yx, y∗x = qyx, yx∗x = qyx2,and yx∗(qm − 1)y = (qm − q)y2x. Note that x∗y /= y∗x. Itis easy to verify that the q-product is neither commutativenor distributive in general. However, it is commutative anddistributive in some special cases as described below.

Lemma 1. Suppose a(x, y;m) = a is a constant independentfrom m. Then a(x, y;m) ∗ b(x, y;m) = b(x, y;m) ∗ a(x, y;m) = ab(x, y;m). Also, if deg[c(x, y;m)] = deg[a(x, y;m)],then [a(x, y;m)+c(x, y;m)]∗b(x, y;m) = a(x, y;m)∗b(x, y;m) + c(x, y;m) ∗ b(x, y;m), and b(x, y;m) ∗ [a(x, y;m) +c(x, y;m)] = b(x, y;m)∗ a(x, y;m) + b(x, y;m)∗ c(x, y;m).

The homogeneous polynomials al(x, y;m)def= [x + (qm −

1)y][l] and bl(x, y;m)def= (x − y)[l] are very important to

our derivations below. The following lemma provides theanalytical expressions of al(x, y;m) and bl(x, y;m).

Lemma 2. For l ≥ 0, y[l] = qσl yl and x[l] = xl. Furthermore,

al(x, y;m) =l∑

u=0

[lu

]

α(m,u)yuxl−u,

bl(x, y;m) =l∑

u=0

[lu

]

(−1)uqσu yuxl−u.

(5)

Note that al(x, y;m) is the rank weight enumerator ofGF(qm)l. The proof of Lemma 2, which goes by induction onl, is easy and hence omitted.

Definition 4 (q-transform). We define the q-transform ofa(x, y;m) = ∑r

i=0ai(m)yixr−i as the homogeneous polyno-mial a(x, y;m) =∑r

i=0ai(m)y[i]∗x[r−i].

Definition 5 (q-derivative [28]). For q ≥ 2, the q-derivativeat x /= 0 of a real-valued function f (x) is defined as

f (1)(x)def= f (qx)− f (x)

(q − 1)x. (6)

For any real number a, [ f (x) + ag(x)](1) = f (1)(x) +ag(1)(x) for x /= 0. For ν ≥ 0, we will denote the νth q-derivative (with respect to x) of f (x, y) as f (ν)(x, y). The 0thq-derivative of f (x, y) is defined to be f (x, y) itself.

Lemma 3. For 0 ≤ ν ≤ l, (xl)(ν) = β(l, ν)xl−ν. The νth q-

derivative of f (x, y) = ∑ri=0 fi y

ixr−i is given by f (ν)(x, y) =∑r−νi=0 fiβ(i, ν)yixr−i−ν. Also,

a(ν)l (x, y;m) = β(l, ν)al−ν(x, y;m),

b(ν)l (x, y;m) = β(l, ν)bl−ν(x, y;m).

(7)

The proof of Lemma 3, which goes by induction on ν, iseasy and hence omitted.

Lemma 4 (Leibniz rule for the q-derivative). For two homo-geneous polynomials f (x, y) and g(x, y) with degrees r and s,respectively, the νth (ν ≥ 0) q-derivative of their q-product isgiven by

[f (x, y)∗g(x, y)

](ν)=ν∑

l=0

[νl

]

q(ν−l)(r−l) f (l)(x, y)∗g(ν−l)(x, y).

(8)

The proof of Lemma 4 is given in Appendix A.The q−1-derivative is similar to the q-derivative.

Definition 6 (q−1-derivative). For q ≥ 2, the q−1-derivativeat y /= 0 of a real-valued function g(y) is defined as

g{1}(y)def= g

(q−1y

)− g(y)(q−1 − 1

)y

. (9)

For any real number a, [ f (y) + ag(y)]{1} = f {1}(y) +ag{1}(y) for y /= 0. For ν ≥ 0, we will denote the νth


q−1-derivative (with respect to y) of g(x, y) as g{ν}(x, y). The0th q−1-derivative of g(x, y) is defined to be g(x, y) itself.

Lemma 5. For 0 ≤ ν ≤ l, the νth q−1-derivative of yl is

(yl){ν} = qν(1−n)+σνβ(l, ν)yl−ν. Also,

a{ν}l (x, y;m) = β(l, ν)q−σνα(m, ν)al−ν(x, y;m− ν),

b{ν}l (x, y;m) = (−1)νβ(l, ν)bl−ν(x, y;m).(10)

The proof of Lemma 5 is similar to that of Lemma 3 andis hence omitted.

Lemma 6 (Leibniz rule for the q−1-derivative). For twohomogeneous polynomials f (x, y;m) and g(x, y;m) withdegrees r and s, respectively, the νth (ν ≥ 0) q−1-derivative oftheir q-product is given by

[f (x, y;m)∗g(x, y;m)

]{ν}

=ν∑

l=0

[νl

]

ql(s−ν+l) f {l}(x, y;m)∗g{ν−l}(x, y;m− l).(11)

The proof of Lemma 6 is given in Appendix B.

3.2. The dual of a vector

As an important step toward our main result, we derive therank weight enumerator of 〈v〉⊥, where v ∈ GF(qm)n is an

arbitrary vector and 〈v〉 def= {av : a ∈ GF(qm)}. Note that〈v〉 can be viewed as an (n, 1) linear code over GF(qm) witha generator matrix v. It is remarkable that the rank weightenumerator of 〈v〉⊥ depends on only the rank of v.

Berger [14] has determined that linear isometries for therank distance are given by the scalar multiplication by anonzero element of GF(qm), and multiplication on the rightby a nonsingular matrix B ∈ GF(q)n×n. We say that two codesC and C′ are rank-equivalent if there exists a linear isometryf for the rank distance such that f (C) = C′.

Lemma 7. Suppose v has rank r ≥ 1. Then L = 〈v〉⊥ is rank-equivalent to C ×GF(qm)n−r , where C is an (r, r − 1, 2) MRDcode and × denotes Cartesian product.

Proof. We can express v as v = vB, where v = (v0, . . . ,vr−1, 0 . . . , 0) has rank r, and B ∈ GF(q)n×n has full rank.Remark that v is the parity-check of C × GF(qm)n−r , whereC = 〈(v0, . . . , vr−1)〉⊥ is an (r, r − 1, 2) MRD code. It can be

easily checked that u ∈ L if and only if udef= uBT ∈ 〈v〉⊥.

Therefore, 〈v〉⊥ = LBT , and hence L is rank-equivalent to〈v〉⊥ = C ×GF(qm)n−r .

We hence derive the rank weight enumerator of an (r, r−1, 2) MRD code. Note that the rank weight distributionof linear Class-I MRD codes has been derived in [4, 10].However, we will not use the result in [4, 10], and insteadderive the rank weight enumerator of an (r, r − 1, 2) MRDcode directly.

Proposition 1. Suppose vr ∈ GF(qm)r has rank r (0 ≤ r ≤m). The rank weight enumerator of Lr = 〈v〉⊥ depends ononly r and is given by

WRLr

(x, y) = q−m{[x +

(qm − 1

)y][r]

+(qm − 1

)(x − y)[r]

}.

(12)

Proof. We first prove that the number of vectors with rank rin Lr , denoted as Ar,r , depends only on r and is given by

Ar,r = q−m[α(m, r) +

(qm − 1

)(−1)rqσr

](13)

by induction on r (r ≥ 1). Equation (13) clearly holds forr = 1. Suppose (13) holds for r = r − 1.

We consider all the vectors u = (u0, . . . ,ur−1) ∈ Lr suchthat the first r − 1 coordinates of u are linearly independent.Remark that ur−1 = −v−1

r−1

∑r−2i=0uivi is completely determined

by u0, . . . ,ur−2. Thus there are Nr−1(qm, r − 1) = α(m, r − 1)such vectors u. Among these vectors, we will enumerate thevectors t whose last coordinate is a linear combination of thefirst r−1 coordinates, that is, t = (t0, . . . , tr−2,

∑r−2i=0 aiti) ∈ Lr

where ai ∈ GF(q) for 0 ≤ i ≤ r − 2.Remark that t ∈ Lr if and only if (t0, . . . , tr−2) · (v0 +

a0vr−1, . . . , vr−2 + ar−2vr−1) = 0. It is easy to check thatv(a) = (v0 + a0vr−1, . . . , vr−2 + ar−2vr−1) has rank r − 1.Therefore, if a0, . . . , ar−2 are fixed, then there are Ar−1,r−1

such vectors t. Also, suppose∑r−2

i=0 tivi + vr−1∑r−2

i=0 biti = 0.Hence

∑r−2i=0 (ai − bi)ti = 0, which implies a = b since ti’s

are linearly independent. That is, 〈v(a)〉⊥ ∩ 〈v(b)〉⊥ = {0}if a /=b. We conclude that there are qr−1Ar−1,r−1 vectors t.Therefore, Ar,r = α(m, r−1)− qr−1Ar−1,r−1 = q−m[α(m, r) +(qm − 1)(−1)rqσr ].

Denote the number of vectors with rank p in Lr asAr,p. We have Ar,p = [ rp ]Ap,p [10], and hence Ar,p =[ rp ]q−m[α(m, p) + (qm − 1)(−1)pqσp]. Thus, WR

Lr(x, y) =

∑rp=0Ar,px

r−p yp = q−m{[x + (qm − 1)y][r] + (qm − 1)(x −y)[r]}.

We comment that Proposition 1 in fact provides the rankweight distribution of any (r, r − 1, 2) MRD code.

Lemma 8. Let C0 ⊆ GF(qm)r be a linear code with rankweight enumerator WR

C0(x, y), and for s ≥ 0, let WR

Cs(x, y)

be the rank weight enumerator of Csdef= C0 × GF(qm)s. Then

WRCs

(x, y) is given by

WRCs

(x, y) =WRC0

(x, y)∗[x +(qm − 1

)y][s]

. (14)

Proof. For s ≥ 0, denote WRCs

(x, y) = ∑r+su=0Bs,uy

uxr+s−u. Wewill prove that

Bs,u =u∑

i=0

qisB0,i

[s

u− i]

α(m− i,u− i) (15)

by induction on s. Equation (15) clearly holds for s = 0.Now assume (15) holds for s = s − 1. For any xs =(x0, . . . , xr+s−1) ∈ Cs, we define xs−1 = (x0, . . . , xr+s−2) ∈Cs−1. Then rk(xs) = u if and only if either rk(xs−1) = u and


xr+s−1 ∈ S(xs−1) or rk(xs−1) = u − 1 and xr+s−1 /∈S(xs−1).This implies Bs,u = quBs−1,u + (qm − qu−1)Bs−1,u−1 =∑u

i=0qisB0,i[ s

u−i ]α(m− i,u− i).

Combining Lemma 7, Proposition 1, and Lemma 8, therank weight enumerator of 〈v〉⊥ can be determined at last.

Proposition 2. For v ∈ GF(qm)n with rank r ≥ 0, the rankweight enumerator of L = 〈v〉⊥ depends on only r, and is givenby

WRL(x, y) = q−m

{[x +

(qm − 1

)y][n]

+(qm − 1

)(x − y)[r]∗[x+

(qm−1

)y][n−r]}

.(16)

3.3. MacWilliams identity for the rank metric

Using the results in Section 3.2, we now derive theMacWilliams identity for rank metric codes. Let C be an(n, k) linear code over GF(qm), let WR

C(x, y) = ∑ni=0Aiy

ixn−i

be its rank weight enumerator, and let WRC⊥(x, y) =∑n

j=0Bj yjxn− j be the rank weight enumerator of its dual code

C⊥.

Theorem 1. For any (n, k) linear code C and its dual code C⊥

over GF(qm),

WRC⊥(x, y) = 1

|C|WRC

(x +

(qm − 1

)y, x − y

), (17)

where WRC is the q-transform of WR

C . Equivalently,

n∑

j=0

Bj yjxn− j = q−mk

n∑

i=0

Ai(x − y)[i]∗[x +(qm − 1

)y][n−i]

.

(18)

Proof. We have rk(λu) = rk(u) for all λ ∈ GF(qm)∗ and

all u ∈ GF(qm)n. We want to determine fR(v) for all v ∈GF(qm)n. By Definition 2, we can split the summation in (3)into two parts:

fR(v) =∑

u∈L

χ(u · v) fR(u) +∑

u∈F\Lχ(u · v) fR(u), (19)

where L = 〈v〉⊥. If u ∈ L, then χ(u · v) = 1 by Definition 1,and the first summation is equal to WR

L(x, y). For the secondsummation, we divide vectors into groups of the form {λu1},where λ ∈ GF(qm)∗ and u1 · v = 1. We remark that foru ∈ F \L (see [1, Chapter 5, Lemma 9]):∑

λ∈GF(qm)∗χ(λu1 · v) fR(λu1) = fR(u1)

∑

λ∈GF(qm)∗χ(λ) = − fR(u1).

(20)

Hence the second summation is equal to (−1/(qm−1))WRF\L(x,

y). This leads to fR(v) = (1/(qm − 1))[qmWRL(x, y)−WR

F (x,y)]. Using WR

F (x, y) = [x + (qm − 1)y][n] and Proposition 2,

we obtain fR(v) = (x− y)[r]∗[x + (qm − 1)y][n−r], where r =rk(v).

By [1, Chapter 5, Lemma 11], any mapping f from F

to C satisfies∑

v∈C⊥ f (v) = (1/|C|)∑v∈C f (v). Applying thisresult to fR(v) and using Definition 4, we obtain (17) and(18).

Also, Bj ’s can be explicitly expressed in terms of Ai’s.

Corollary 1. It holds that

Bj = 1|C|

n∑

i=0

AiPj(i;m,n), (21)

where

Pj(i;m,n)def=

j∑

l=0

[il

][n− ij − l

]

(−1)lqσl ql(n−i)α(m− l, j − l).(22)

Proof. We have (x − y)[i]∗(x +(qm − 1

)y)[n−i] =

∑n

j=0Pj(i;

m,n)y jxn− j . The result follows Theorem 1.

Note that although the analytical expression in (21) issimilar to that in [4, (3.14)], Pj(i;m,n) in (22) are differentfrom Pj(i) in [4, (A10)] and their alternative forms in [29].We can show the following:

Proposition 3. Pj(x;m,n) in (22) are the generalizedKrawtchouk polynomials.

The proof is given in Appendix C. Proposition 3 showsthat Pj(x;m,n) in (22) are an alternative form for Pj(i) in [4,(A10)], and hence our results in Corollary 1 are equivalentto those in [4, Theorem 3.3]. Also, it was pointed out in [29]that Pj(x;m,n)/Pj(0;m,n) is actually a basic hypergeometricfunction.

4. MOMENTS OF THE RANK DISTRIBUTION

4.1. Binomial moments of the rank distribution

In this section, we investigate the relationship betweenmoments of the rank distribution of a linear code and thoseof its dual code. Our results parallel those in [1, page 131].

Proposition 4. For 0 ≤ ν ≤ n,

n−ν∑

i=0

[n− i

ν

]

Ai = qm(k−ν)ν∑

j=0

[n− jn− ν

]

Bj. (23)

Proof. First, applying Theorem 1 to C⊥, we obtain

n∑

i=0

Aiyixn−i=qm(k−n)

n∑

j=0

Bjbj(x, y;m)∗an− j(x, y;m). (24)

Next, we apply the q-derivative with respect to xto (24) ν times. By Lemma 3 the left-hand side (LHS)


becomes∑n−ν

i=0 β(n − i, ν)Aiyixn−i−ν, while the RHS reducesto qm(k−n)

∑nj=0Bjψj(x, y) by Lemma 4, where

ψj(x, y)def= [bj(x, y;m)∗an− j(x, y;m)

](ν)

=ν∑

l=0

[νl

]

q(ν−l)( j−l)b(l)j (x, y)∗a(ν−l)

n− j (x, y;m).(25)

By Lemma 3, b(l)j (x, y;m) = β( j, l)(x− y)[ j−l] and a(ν−l)

n− j (x, y;m) = β(n − j, ν − l)an− j−ν+l(x, y;m). It can be verified

that for any homogeneous polynomial b(x, y;m) and for anys ≥ 0, (b∗as)(1, 1;m) = qmsb(1, 1;m). Also, for x = y = 1,

b(l)j (1, 1;m) = β( j, j)δj,l. We hence have ψj(1, 1) = 0 for

j > ν, and ψj(1, 1) = [ νj ]β( j, j)β(n − j, ν − j)qm(n−ν) for

j ≤ ν. Since β(n− j, ν− j) = [n− jν− j ]β(ν− j, ν− j) and β(ν, ν) =

[ νj ]β( j, j)β(ν− j, ν− j), then ψj(1, 1) = [

n− jν− j ]β(ν, ν)qm(n−ν).

Applying x = y = 1 to the LHS and rearranging both sides

using β(n− i, ν) = [ n−iν ]β(ν, ν), we obtain (23).

Proposition 4 can be simplified if ν is less than theminimum distance of the dual code.

Corollary 2. Let d′R be the minimum rank distance of C⊥. If0 ≤ ν < d′R, then

n−ν∑

i=0

[n− i

ν

]

Ai = qm(k−ν)

[nν

]

. (26)

Proof. We have B0 = 1 and B1 = · · · = Bν = 0.

Using the q−1-derivative, we obtain another identity.


n∑

i=ν

[iν

]

qν(n−i)Ai

= qm(k−ν)ν∑

j=0

[n− jn− ν

]

(−1) jqσj α(m− j, ν− j)q j(ν− j)Bj.

(27)

The proof of Proposition 5 is similar to that ofProposition 4, and is given in Appendix D. Following [1], werefer to the LHS of (23) and (27) as binomial moments ofthe rank distribution of C. Similarly, when either ν is lessthan the minimum distance d′R of the dual code, or ν isgreater than the diameter (maximum distance between anytwo codewords) δ′R of the dual code, Proposition 5 can besimplified.

Corollary 3. If 0 ≤ ν < d′R, then

n∑

i=ν

[iν

]

qν(n−i)Ai = qm(k−ν)

[nν

]

α(m, ν). (28)

For δ′R < ν ≤ n,

ν∑

i=0

[n− in− ν

]

(−1)iqσiα(m− i, ν− i)qi(ν−i)Ai = 0. (29)

Proof. Apply Proposition 5 to C, and use B1 = · · · = Bν =0 to prove (28). Apply Proposition 5 to C⊥, and use Bν =· · · = Bn = 0 to prove (29).

4.2. Pless identities for the rank distribution

In this section, we consider the analogues of the Plessidentities [1, 2], in terms of Stirling numbers. The q-Stirlingnumbers of the second kind Sq(ν, l) are defined [30] to be

Sq(ν, l)def= q−σl

β(l, l)

l∑

i=0

(−1)iqσi[li

][l − i

1

]ν

, (30)

and they satisfy

[m1

]ν

=ν∑

l=0

qσl Sq(ν, l)β(m, l). (31)

The following proposition can be viewed as a q-analogueof the Pless identity with respect to x [2, P2].


q−mkn∑

i=0

[n− i

1

]ν

Ai =ν∑

j=0

Bj

ν∑

l=0

[n− jn− l

]

β(l, l)Sq(ν, l)q−ml+σl .

(32)

Proof. We have

n∑

i=0

[n− i

1

]ν

Ai =n∑

i=0

Ai

ν∑

l=0

qσl Sq(ν, l)

[n− il

]

β(l, l) (33)

=ν∑

l=0

qσlβ(l, l)Sq(ν, l)n∑

i=0

[n− il

]

Ai

=ν∑

l=0

qσlβ(l, l)Sq(ν, l)qm(k−l)l∑

j=0

[n− jn− l

]

Bj

(34)

= qmkν∑

j=0

Bj

ν∑

l=0

[n− jn− l

]

qσlβ(l, l)Sq(ν, l)q−ml,

where (33) follows (31) and (34) is due to Proposition 4.

Proposition 6 can be simplified when ν is less than theminimum distance of the dual code.

Corollary 4. For 0 ≤ ν < d′R,

q−mkn∑

i=0

[n− i

1

]ν

Ai=ν∑

l=0

β(n, l)Sq(ν, l)q−ml+σl (35)

=q−mnn∑

i=0

[n− i

1

]ν [ni

]

α(m, i). (36)


Proof. Since B0 = 1 and B1 = · · · = Bν = 0, (32)directly leads to (35). Since the right-hand side of (35) istransparent to the code, without loss of generality we chooseC = GF(qm)n and (36) follows naturally.

Unfortunately, a q-analogue of the Pless identity withrespect to y [2, P1] cannot be obtained due to the presenceof the qν(n−i) term in the LHS of (27). Instead, we derive

its q−1-analogue. We denote pdef= q−1 and define the

functions αp(m,u), [ nu ]p, βp(m,u) similarly to the functionsintroduced in Section 2.3, only replacing q by p. It is easy torelate these q−1-functions to their counterparts: α(m,u) =p−mu−σu(−1)uαp(m,u), [ nu ] = p−u(n−u)[ nu ]p, and β(m,u) =p−u(m−u)−σuβp(m,u).


pmkn∑

i=0

[i1

]ν

p

Ai

=ν∑

j=0

Bj pj(m+n−j)

ν∑

l= jβp(l, l)Sp(ν, l)(−1)l

[n− jn−l

]

p

αp(m− j, l− j).

(37)

The proof of Proposition 7 is given in Appendix E.

Corollary 5. For 0 ≤ ν < d′R,

pmkn∑

i=0

[i1

]ν

p

Ai =ν∑

l=0

βp(n, l)Sp(ν, l)αp(m, l)(−1)l . (38)

Proof. Note that B0 = 1 and B1 = · · · = Bν = 0.

4.3. Further results on the rank distribution

For nonnegative integers λ, μ, and ν, and a linear code C withrank weight distribution {Ai}, we define

Tλ,μ,ν(C)def= q−mk

n∑

i=0

[iλ

]μ

qν(n−i)Ai, (39)

whose properties are studied below. We refer to

T0,0,ν(C)def= q−mk

n∑

i=0

qν(n−i)Ai (40)

as the νth q-moment of the rank distribution of C. Weremark that for any code C, the 0th order q-moment of itsrank distribution is equal to 1. We first relate Tλ,1,ν(C) andT1,μ,ν(C) to T0,0,ν(C).

Lemma 9. For nonnegative integers λ, μ, and ν,

Tλ,1,ν(C) = 1α(λ, λ)

λ∑

l=0

[λl

]

(−1)lqσl qn(λ−l)T0,0,ν−λ+l(C),

(41)

T1,μ,ν(C) = (1− q)−μμ∑

a=0

(μa

)

(−1)aqanT0,0,ν−a(C). (42)

The proof of Lemma 9 is given in Appendix F. We nowconsider the case where ν is less than the minimum distanceof the dual code.

Proposition 8. For 0 ≤ ν < d′R,

T0,0,ν(C) =ν∑

j=0

[νj

]

α(n, j)q−mj (43)

= q−mnn∑

i=0

[ni

]

α(m, i)qν(n−i) (44)

= q−mνν∑

l=0

[νl

]

α(m, l)qn(ν−l). (45)

The proof of Proposition 8 is given in Appendix G.Proposition 8 hence shows that the νth q-moment of therank distribution of a code is transparent to the code whenν < d′R. As a corollary, we show that Tλ,1,ν(C) and T1,μ,ν(C)are also transparent to the code when 0 ≤ λ,μ ≤ ν < d′R.

Corollary 6. For 0 ≤ λ,μ ≤ ν < d′R,

Tλ,1,ν(C) = q−mn[nλ

] n∑

i=λ

[n− λi− λ

]

qν(n−i)α(m, i),

T1,μ,ν(C) = q−mnn∑

i=0

[i1

]μ

qν(n−i)[ni

]

α(m, i).

(46)

Proof. By Lemma 9 and Proposition 8, Tλ,1,ν(C) andT1,μ,ν(C) are transparent to the code. Thus, without loss ofgenerality we assume C = GF(qm)n and (46) follows.

4.4. Rank weight distribution of MRD codes

The rank weight distribution of linear Class-I MRD codeswas given in [4, 10]. Based on our results in Section 4.1,we provide an alternative derivation of the rank distributionof linear Class-I MRD codes, which can also be used todetermine the rank weight distribution of Class-II MRDcodes.

Proposition 9 (rank distribution of linear Class-I MRDcodes). Let C be an (n, k,dR) linear Class-I MRD code overGF(qm)(n ≤ m), and let WR

C(x, y) = ∑ni=0Aiy

ixn−i be its rankweight enumerator. We then have A0 = 1 and for 0 ≤ i ≤n− dR,

AdR+i =[

ndR + i

] i∑

j=0

(−1)i− jqσi− j[dR + idR + j

](qm( j+1) − 1

).

(47)

Proof. It can be shown that for two sequences of real numbers

{aj}lj=0 and {bi}li=0 such that aj =∑ j

i=0[ l−il− j ]bi for 0 ≤ j ≤ l,

we have bi =∑i

j=0(−1)i− jqσi− j [ l− jl−i ]aj for 0 ≤ i ≤ l.

By Corollary 2, we have∑ji=0[

n−dR−in−dR−j]AdR+i=[

nn−dR−j](qm( j+1)−

1) for 0 ≤ j ≤ n−dR. Applying the result above to l = n−dR,


aj = [n

n−dR− j ](qm( j+1) − 1), and bi = AdR+i, we obtain

AdR+i =i∑

j=0

(−1)i− jqσi− j[

ndR + i

][dR + idR + j

](qm( j+1) − 1

).

(48)

We remark that the above rank distribution is consistentwith that derived in [4, 10]. Since Class-II MRD codes canbe constructed by transposing linear Class-I MRD codes andthe transposition operation preserves the rank weight, theweight distributions Class-II MRD codes can be obtainedaccordingly.

APPENDICES

The proofs in this section use some well-known propertiesof Gaussian polynomials [27]: [ nk ] = [ n

n−k ], [ nk ][ kl ] =[ nl ][ n−l

n−k ], and

[nk

]

=[n− 1k

]

+ qn−k[n− 1k − 1

]

(A.1)

= qk[n− 1k

]

+

[n− 1k − 1

]

(A.2)

= qn − 1qn−k − 1

[n− 1k

]

(A.3)

= qn−k+1 − 1qk − 1

[n

k − 1

]

.(A.4)

A. PROOF OF LEMMA 4

We consider homogeneous polynomials f (x, y;m) =∑ri=0 fi y

ixr−i and u(x, y;m) =∑ri=0ui y

ixr−i of degree r as wellas g(x, y;m) = ∑s

j=0gj yjxs− j and v(x, y;m) = ∑s

j=0vj yjxs− j

of degree s. First, we need a technical lemma.

Lemma 10. If ur = 0, then

1x

(u(x, y;m)∗v(x, y;m)) = u(x, y;m)x

∗v(x, y;m). (A.5)

If vs = 0, then

1x

(u(x, y;m)∗v(x, y;m))=u(x, qy;m)∗v(x, y;m)x

. (A.6)

Proof. Suppose ur = 0. Then u(x, y;m)/x = ∑r−1i=0ui y

ixr−1−i.Hence

u(x, y;m)x

∗v(x,y;m)=r+s−1∑

k=0

( k∑

l=0

qlsul(m)vk−l(m−l))

ykxr+s−1−k

= 1x

(u(x, y;m)∗v(x, y;m)

).

(A.7)

Suppose vs = 0. Then v(x, y;m)/x =∑s−1j=0vj y

jxs−1− j . Hence

u(x, qy;m)∗v(x, y;m)x

=r+s−1∑

k=0

( k∑

l=0

ql(s−1)qlul(m)vk−l(m− l))

ykxr+s−1−k

= 1x


).

(A.8)

We now give a proof of Lemma 4.

Proof. In order to simplify notations, we omit the depen-dence of the polynomials f and g on the parameter m. Theproof goes by induction on ν. For ν = 0, the result is trivial.For ν = 1, we have[f (x, y)∗g(x, y)

](1)

= 1(q − 1)x

[f (qx, y)∗g(qx, y)− f (qx, y)∗g(x, y)

+ f (qx, y)∗g(x, y)− f (x, y)∗g(x, y)]

= 1(q − 1)x

[f (qx, y)∗(g(qx, y)− g(x, y))

+ ( f (qx, y)− f (x, y))∗g(x, y)]

= f (qx, qy)∗g(qx, y)−g(x, y)(q − 1)x

+f (qx, y)−f (x, y)

(q − 1)x∗g(x, y),

(A.9)

= qr f (x, y)∗g(1)(x, y) + f (1)(x, y)∗g(x, y), (A.10)

where (A.9) follows Lemma 10.Now suppose (8) is true for ν = ν. In order to further

simplify notations, we omit the dependence of the variouspolynomials in x and y. We have

( f∗g)(ν+1)=ν∑

l=0

[νl

]

q(ν−l)(r−l)[ f (l)∗g(ν−l)](1)

=ν∑

l=0

[νl

]

q(ν−l)(r−l)(qr−l f (l)∗g(ν−l+1) +f (l+1)∗g(ν−l))

(A.11)

=ν∑

l=0

[νl

]

q(ν+1−l)(r−l) f (l)∗g(ν−l+1)

+ν+1∑

l=1

[ν

l − 1

]

q(ν+1−l)(r−l+1) f (l)∗g(ν−l+1)

=ν∑

l=1

([νl

]

+ qν+1−l[

νl − 1

])

q(ν+1−l)(r−l) f (l)

∗g(ν−l+1) + q(ν+1)r f∗g(ν+1) + f (ν+1)∗g

=ν+1∑

l=0

[ν + 1l

]

q(ν+1−l)(r−l) f (l)∗g(ν−l+1),

(A.12)

where (A.11) follows (A.10), and (A.12) follows (A.1).


B. PROOF OF LEMMA 6

We consider homogeneous polynomials f (x, y;m) =∑ri=0 fi y

ixr−i and u(x, y;m) =∑ri=0ui y

ixr−i of degree r as wellas g(x, y;m) = ∑s

j=0gj yjxs− j and v(x, y;m) = ∑s

j=0vj yjxs− j

of degree s. First, we need a technical lemma.

Lemma 11. If u0 = 0, then

1y

(u(x, y;m))∗v(x, y;m)

) = qsu(x, y;m)

y∗v(x, y;m− 1).

(B.1)

If v0 = 0, then

1y


) = u(x, qy;m)∗v(x, y;m)y

.

(B.2)

Proof. Suppose u0=0. Then u(x, y;m)/y=∑r−1i=0ui+1xr−1−i yi.

Hence

qsu(x, y;m)

y∗v(x, y;m− 1)

= qsr+s−1∑

k=0

( k∑

l=0

qlsul+1vk−l(m− 1− l))

xr+s−1−k yk

= qsr+s∑

k=1

( k∑

l=1

q(l−1)sulvk−l(m− l))

xr+s−k yk−1

= 1y


).

(B.3)

Suppose v0 = 0. Then v(x, y;m)/y = ∑s−1j=0vj+1xs−1− j y j .

Hence

u(x, qy;m)∗v(x, y;m)y

=r+s−1∑

k=0

( k∑

l=0

ql(s−1)qlulvk−l+1(m− l))

xr+s−1−k yk

=r+s∑

k=1

(k−1∑

l=0

qlsulvk−l(m− l))

xr+s−k yk−1

= 1y

(u(x, y;m)∗v(x, y;m)).

(B.4)

We now give a proof of Lemma 6.

Proof. The proof goes by induction on ν, and is similar tothat of Lemma 4. For ν = 0, the result is trivial. For ν = 1 wecan easily show, by using Lemma 11, that

[f (x, y;m)∗g(x, y;m)

]{1}

= f (x, y;m)∗g{1}(x, y;m) + qs f {1}(x, y;m)∗g(x, y;m− 1)(B.5)

It is thus easy to verify the claim by induction on ν.

C. PROOF OF PROPOSITION 3

Proof. It was shown in [29] that the generalized Krawtchoukpolynomials are the only solutions to the recurrence

Pj+1(i+1;m+1,n+1)=q j+1Pj+1(i+1;m,n)−q jPj(i;m,n)(C.1)

with initial conditions Pj(0;m,n) = [ nj ]α(m, j). Clearly, ourpolynomials satisfy these initial conditions. We hence showthat Pj(i;m,n) satisfy the recurrence in (C.1). We have

Pj+1(i + 1;m + 1,n + 1)

=i+1∑

l=0

[i +1l

][n− ij +1−l

]

(−1)lqσl ql(n−i)α(m+1−l, j+1−l)

=i+1∑

l=0

[i + 1l

][m +1−lj +1−l

]

(−1)lqσl ql(n−i)α(n−i, j+1−l)

=i+1∑

l=0

{

ql[il

]

+

[i

l−1

]}{

q j+1−l[

m− lj + 1− l

]

+

[m− lj − l

]}

(C.2)

× (−1)lqσl ql(n−i)α(n− i, j + 1− l),

=i∑

l=0

[il

]

q j+1

[m− lj + 1− l

]

(−1)lqσl ql(n−i)α(n− i, j + 1− l)

+i∑

l=0

ql[il

][m− lj − l

]

(−1)lqσl ql(n−i)α(n− i, j + 1− l)

+i+1∑

l=1

[i

l−1

]

q j+1−l[m− lj +1−l

]

(−1)lqσl ql(n−i)α(n−i, j+1−l)

+i+1∑

l=1

[i

l − 1

][m− lj − l

]

(−1)lqσl ql(n−i)α(n− i, j + 1− l),

(C.3)

where (C.2) follows (A.2). Let us denote the four summa-tions in the right-hand side of (C.3) as A, B, C, and D,respectively. We have A = q j+1Pj+1(i;m,n), and

B=i∑

l=0

[il

][m− lj − l

]

(−1)lqσl ql(n−i)α(n−i, j−l)(qn−i+l−q j),

(C.4)

C=i∑

l=0

[il

]

q j−l[m−l−1j − l

]

(−1)l+1qσl+1q(l+1)(n−i)α(n−i, j−l)

=−q j+n−ii∑

l=0

[il

][m−lj−l

]

(−1)lqσl ql(n−i)α(n−i, j−l)qm− j−1qm−l−1

,

(C.5)


D=−qn−ii∑

l=0

[il

][m−lj−l

]

(−1)lqσl ql(n−i)α(n−i, j−l)ql qj−l−1

qm−l−1,

(C.6)

where (C.5) follows (A.3) and (C.6) follows both (A.3) and(A.4). Combining (C.4), (C.5), and (C.6), we obtain

B + C +D =i∑

l=0

[il

][m−lj−l

]

(−1)lqσl ql(n−i)α(n−i, j−l)

×{qn−i+l−q j−qn−i q

m−q jqm−l−1

−qn−i qj−ql

qm−l−1

}

= −q jPj(i;m,n).(C.7)

D. PROOF OF PROPOSITION 5

Before proving Proposition 5, we need two technical lemmas.

Lemma 12. For all m, ν, and l,

δ(m, ν, j)def=

j∑

i=0

[ji

]

(−1)iqσiα(m− i, ν)

= α(ν, j)α(m− j, ν− j)q j(m− j).

(D.1)

Proof. The proof goes by induction on j. The claim triviallyholds for j = 0. Let us suppose it holds for j = j. We have

δ(m, ν, j + 1

)

=j+1∑

i=0

[j + 1i

]

(−1)iqσiα(m− i, ν)

=j+1∑

i=0

(

qi[ji

]

+

[j

i− 1

])

(−1)iqσiα(m− i, ν) (D.2)

=j∑

i=0

qi[ji

]

(−1)iqσiα(m−i, ν)+j+1∑

i=1

[j

i−1

]

(−1)iqσiα(m−i, ν)

=j∑

i=0

qi[ji

]

(−1)iqσiα(m−i, ν)−j∑

i=0

[ji

]

(−1)iqσi+1α(m−1−i, ν)

=j∑

i=0

qi[ji

]

(−1)iqσiα(m− 1− i, ν− 1)qm−1−i(qν − 1)

= qm−1(qν − 1)δ(m− 1, ν− 1, j

)

= α(ν, j + 1)α

(m− j − 1, ν− j − 1

)q( j+1)(m− j−1),

where (D.2) follows (A.2).

Lemma 13. For all n, ν, and j,

θ(n, ν, j)def=

j∑

l=0

[jl

][n− jν− l

]

ql(n−ν)(−1)lqσlα(ν− l, j − l)

= (−1) jqσj[n− jn− ν

]

.

(D.3)

Proof. The proof goes by induction on j. The claim triviallyholds for j = 0. Let us suppose it holds for j = j. We have

θ(n, ν, j + 1

)

=j+1∑

l=0

[j + 1l

][n− 1− j

ν− l]

ql(n−ν)(−1)lqσlα(ν− l, j + 1− l)

=j+1∑

l=0

([jl

]

+ q j+1−l[

jl − 1

])[n− 1− j

ν− l]

(D.4)

× ql(n−ν)(−1)lqσlα(ν− l, j + 1− l)

=j∑

l=0

[jl

][n−1− j

ν−l]

ql(n−ν)(−1)lqσlα(ν−l, j−l)(qν−l−q j−l)

+j+1∑

l=1

q j−l+1

[j

l−1

][n−1− j

ν−l]

ql(n−ν)(−1)lqσlα(ν−l, j−l+1

),

(D.5)

where (D.4) follows (A.1). Let us denote the first and secondsummations in the right-hand side of (D.5) as A and B,respectively. We have

A=(qν − q j)j∑

l=0

[jl

][n−1− jν− l

]

ql(n−1−ν)(−1)lqσlα(ν−l, j−l)

= (qν − q j)θ(n− 1, ν, j)

= (qν − q j)(−1) jqσj[n− 1− jn− 1− ν

]

, (D.6)

B=j∑

l=0

q j−l[jl

][n−1− jν−1−l

]

q(l+1)(n−ν)(−1)l+1qσl+1α(ν−1−l, j−l)

=−q j+n−ν

j∑

l=0

[jl

][n−1− jν−1−l

]

ql(n−ν)(−1)lqσlα(ν−1−l, j−l)

= −q j+n−νθ(n− 1, ν− 1, j

)

= −q j+n−ν(−1) jqσj[n− 1− jn− ν

]

. (D.7)


Combining (D.4), (D.6), and (D.7), we obtain

θ(n, ν, j + 1

)

= (−1) jqσj{(qν − q j)

[n− 1− jn− 1− ν

]

− q j+n−ν

[n− 1− jn− ν

]}

= (−1) j+1qσj+1

[n− 1− jn− ν

]{− (qν− j − 1

)qn−ν − 1

qν− j − 1+ qn−ν

}

(D.8)

= (−1) j+1qσj+1

[n− 1− jn− ν

]

, (D.9)

where (D.8) follows (A.4).

We now give a proof of Proposition 5.

Proof. We apply the q−1-derivative with respect to y to (24)ν times, and we apply x = y = 1. By Lemma 5, the LHSbecomes

n∑

i=ν

qν(1−i)+σνβ(i, ν)Ai = qν(1−n)+σνβ(ν, ν)n∑

i=ν

[iν

]

qν(n−i)Ai.

(D.10)

The RHS becomes qm(k−n)∑n

j=0Bjψj(1, 1), where

ψj(x, y)def= [bj(x, y;m)∗an− j(x, y;m)

]{ν}

=ν∑

l=0

[νl

]

ql(n− j−ν+l)b{l}j (x, y;m)∗a{ν−l}n− j (x, y;m−l)

(D.11)

=ν∑

l=0

[νl

]

ql(n− j−ν+l)(−1)lβ( j, l)β(n− j, ν−l)q−σν−l

× bj−l(x, y;m)∗α(m−l, ν−l)an−j−ν+l(x, y;m−ν)

=β(ν, ν)q−σν

ν∑

l=0

[jl

][n− jν− l

]

ql(n− j)(−1)lqσl

× bj−l(x, y;m)∗α(m−l, ν−l)an− j−ν+l(x, y;m−ν),(D.12)

where (D.11) and (D.12) follow Lemmas 6 and 5, respec-tively. We have[bj−l∗α(m− l, ν− l)an− j−ν+l

](1, 1;m−ν)

=n−ν∑

u=0

[ u∑

i=0

qi(n− j−ν+l)

[j − li

]

(−1)iqσiα(m−i−l, ν−l)

×[n− j − ν + l

u− i]

α(m− ν− i,u− i)]

= q(m−ν)(n−ν− j+l)j−l∑

i=0

[j − li

]

(−1)iqσiα(m−l−i, ν−l)

= q(m−ν)(n−ν− j+l)α(ν−l, j−l)α(m− j, ν− j)q( j−l)(m− j),(D.13)

where (D.13) follows Lemma 12. Hence

ψj(1, 1)

= β(ν, ν)qm(n−ν)+ν(1−n)+σνα(m− j, ν− j)q j(ν− j)

×j∑

l=0

[jl

][n− j

ν− l

]

ql(n−ν)(−1)lqσlα(ν− l, j − l)

=β(ν, ν)qm(n−ν)+ν(1−n)+σνα(m−j, ν−j)q j(ν− j)(−1) jqσj[n− jn−ν

]

,

(D.14)

where (D.14) follows Lemma 13. Incorporating this expres-sion for ψj(1, 1) in the definition of the RHS and rearrangingboth sides, we obtain the result.

E. PROOF OF PROPOSITION 7

Proof. Equation (27) can be expressed in terms of theαp(m,u) and [ nu ]p functions as

n∑

i=ν

[iν

]

p

Ai

= (−1)νp−mk−σν

ν∑

j=0

[n− jn− ν

]

p

p j(m+n− j)αp(m− j, ν− j)Bj.

(E.1)

We obtain

pmkn∑

i=0

[j1

]ν

p

Ai

= pmkν∑

l=0

pσlβp(l, l)Sp(ν, l)n∑

i=l

[il

]

p

Ai (E.2)

=ν∑

l=0

βp(l, l)Sp(ν, l)(−1)ll∑

j=0

[n− jn−l

]

p

p j(m+n− j)αp(m−j, l−j)Bj(E.3)

=ν∑

j=0

Bj pj(m+n− j)

ν∑

l= jβp(l, l)Sp(ν, l)(−1)l

[n−jn−l]

p

αp(m−j, l−j),

where (E.2) and (E.3) follow (31) and (E.1), respectively.

F. PROOF OF LEMMA 9

Proof. We first prove (41):

q−mkn∑

i=0

[iλ

]

qν(n−i)Ai

= q−mk

α(λ, λ)

n∑

i=0

qν(n−i)Aiλ∑

l=0

[λl

]

(−1)lqσl qi(λ−l) (F.1)

= q−mk

α(λ, λ)

λ∑

l=0

[λl

]

(−1)lqσl qn(λ−l)n∑

i=0

q(ν−λ+l)(n−i)Ai

= 1α(λ, λ)

λ∑

l=0

[λl

]

(−1)lqσl qn(λ−l)T0,0,ν−λ+l(C),


where (F.1) follows α(i, λ) = ∑λl=0[ λl ](−1)lqσl qi(λ−l). We now

prove (42): since

[i1

]μ

=(

1− qi1− q

)μ= 1

(1− q)μ

μ∑

a=0

(μa

)

(−1)aqia, (F.2)

we obtain

T1,μ,ν(C) = q−mk

(1− q)μ

n∑

i=0

qν(n−i)Aiμ∑

a=0

(μa

)

(−1)aqia

= q−mk

(1− q)μ

μ∑

a=0

(μa

)

(−1)aqann∑

i=0

q(ν−a)(n−i)Ai

= (1− q)−μμ∑

a=0

(μa

)

(−1)aqanT0,0,ν−a(C).

(F.3)

G. PROOF OF PROPOSITION 8

Proof. From [27, (3.3.6)], we obtain [ n−iν ] = (1/α(ν, ν))×∑ν

l=0[ νl ](−1)ν−lqσν−l ql(n−i), and hence

q−mkn∑

i=0

[n− i

ν

]

Ai

= q−mkn∑

i=0

Ai1

α(ν, ν)

ν∑

l=0

[νl

]

(−1)ν−lqσν−l ql(n−i)

= q−mk

α(ν, ν)

ν∑

l=0

[νl

]

(−1)ν−lqσν−ln∑

i=0

ql(n−i)Ai

= 1α(ν, ν)

ν∑

l=0

[νl

]

(−1)ν−lqσν−l T0,0,l(C), (G.1)

where (G.1) follows (40). By Corollary 2, we have for ν < d′R,∑ν

l=0[ νl ](−1)ν−lqσν−l T0,0,l(C) = q−mνα(n, ν), and we obtain

ν∑

j=0

[νj

]

α(n, j)q−mj =ν∑

j=0

[νj

] j∑

l=0

[jl

]

(−1) j−lqσj−l T0,0,l(C)

=ν∑

l=0

T0,0,l(C)

[νl

] ν∑

j=0

[ν−lj−l]

(−1) j−lqσj−l

= T0,0,ν(C), (G.2)

where (G.2) follows∑ν−l

j=0[ ν−lj ](−1) jqσj = δν,l, which in turn

is a special case of [27, (3.3.6)]. This proves (43). Thus,T0,0,ν(C) is transparent to the code, and (44) can be shownby choosing C = GF(qm)n without loss of generality.

Suppose S(ν,n,m)def= ∑ν

j=0[ νj ]α(n, j)q−mj . Then S(ν,n,

m) = S(n, ν,m) since [ νj ]α(n, j) = [ nj ]α(ν, j). Also, com-

bining (43) and (44) yields S(ν,n,m) = qn(ν−m)S(n,m, ν).Therefore, we obtain S(ν,n,m) = qν(n−m)S(ν,m,n), whichproves (45).

ACKNOWLEDGMENTS

This work was supported in part by Thales CommunicationsInc. and in part by a grant from the Commonwealth ofPennsylvania, Department of Community and EconomicDevelopment, through the Pennsylvania Infrastructure Tech-nology Alliance (PITA). The material in this paper waspresented in part at the IEEE International Symposium onInformation Theory, Nice, France, June 24–29, 2007.

REFERENCES

[1] F. MacWilliams and N. Sloane, The Theory of Error-CorrectingCodes, North-Holland, Amsterdam, The Netherlands, 1977.

[2] V. Pless, “Power moment identities on weight distributions inerror correcting codes,” Information and Control, vol. 6, no. 2,pp. 147–152, 1963.

[3] L. Hua, “A theorem on matrices over a field and its applica-tions,” Chinese Mathematical Society, vol. 1, no. 2, pp. 109–163,1951.

[4] P. Delsarte, “Bilinear forms over a finite field, with applicationsto coding theory,” Journal of Combinatorial Theory A, vol. 25,no. 3, pp. 226–241, 1978.


[6] P. Lusina, E. M. Gabidulin, and M. Bossert, “Maximum rankdistance codes as space-time codes,” IEEE Transactions onInformation Theory, vol. 49, no. 10, pp. 2757–2760, 2003.

[7] E. M. Gabidulin, A. V. Paramonov, and O. V. Tretjakov,“Ideals over a non-commutative ring and their application incryptology,” in Proceedings of the Workshop on the Theory andApplication of Cryptographic Techniques (EUROCRYPT ’91),vol. 547 of Lecture Notes in Computer Science, pp. 482–489,Brighton, UK, April 1991.

[8] E. M. Gabidulin, “Optimal codes correcting lattice-patternerrors,” Problems of Information Transmission, vol. 21, no. 2,pp. 3–11, 1985.

[9] R. M. Roth, “Maximum-rank array codes and their appli-cation to crisscross error correction,” IEEE Transactions onInformation Theory, vol. 37, no. 2, pp. 328–336, 1991.

[10] E. M. Gabidulin, “Theory of codes with maximum rankdistance,” Problems of Information Transmission, vol. 21, no. 1,pp. 1–12, 1985.

[11] K. Chen, “On the non-existence of perfect codes with rankdistance,” Mathematische Nachrichten, vol. 182, no. 1, pp. 89–98, 1996.

[12] R. M. Roth, “Probabilistic crisscross error correction,” IEEETransactions on Information Theory, vol. 43, no. 5, pp. 1425–1438, 1997.

[13] W. B. Vasantha and N. Suresh Babu, “On the covering radiusof rank-distance codes,” Ganita Sandesh, vol. 13, pp. 43–48,1999.

[14] T. P. Berger, “Isometries for rank distance and permutationgroup of Gabidulin codes,” IEEE Transactions on InformationTheory, vol. 49, no. 11, pp. 3016–3019, 2003.

[15] E. M. Gabidulin and P. Loidreau, “On subcodes of codes inrank metric,” in Proceedings of IEEE International Symposiumon Information Theory (ISIT ’05), pp. 121–123, Adelaide,Australia, September 2005.

[16] A. Kshevetskiy and E. M. Gabidulin, “The new construction ofrank codes,” in Proceedings of IEEE International Symposium


on Information Theory (ISIT ’05), pp. 2105–2108, Adelaide,Australia, September 2005.

[17] M. Gadouleau and Z. Yan, “Properties of codes with the rankmetric,” in Proceedings of IEEE Global TelecommunicationsConference (GLOBECOM ’06), pp. 1–5, San Francisco, Calif,USA, November 2006.

[18] M. Gadouleau and Z. Yan, “Decoder error probability of MRDcodes,” in Proceedings of IEEE Information Theory Workshop(ITW ’06), pp. 264–268, Chengdu, China, October 2006.

[19] M. Gadouleau and Z. Yan, “On the decoder error probabilityof bounded rank-distance decoders for maximum rank dis-tance codes,” IEEE Transactions on Information Theory, vol. 54,no. 7, pp. 3202–3206, 2008.

[20] P. Loidreau, “Properties of codes in rank metric,” http://arxiv.org/pdf/cs.DM/0610057/.

[21] M. Schwartz and T. Etzion, “Two-dimensional cluster-correcting codes,” IEEE Transactions on Information Theory,vol. 51, no. 6, pp. 2121–2132, 2005.

[22] G. Richter and S. Plass, “Fast decoding of rank-codes withrank errors and column erasures,” in Proceedings of IEEEInternational Symposium on Information Theory (ISIT ’04), p.398, Chicago, Ill, USA, June-July 2004.

[23] P. Loidreau, “A Welch-Berlekamp like algorithm for decodingGabidulin codes,” in Proceedings of the 4th InternationalWorkshop on Coding and Cryptography (WCC ’05), vol. 3969,pp. 36–45, Bergen, Norway, March 2005.

[24] D. Grant and M. Varanasi, “Weight enumerators and aMacWilliams-type identity for space-time rank codes overfinite fields,” in Proceedings of the 43rd Allerton Conferenceon Communication, Control, and Computing, pp. 2137–2146,Monticello, Ill, USA, October 2005.

[25] D. Grant and M. Varanasi, “Duality theory for space-timecodes over finite fields,” to appear in Advance in Mathematicsof Communications.

[26] P. Loidreau, “Etude et optimisation de cryptosystemes a clepublique fondes sur la theorie des codes correcteurs,” Ph.D.dissertation, Ecole Polytechnique, Paris, France, May 2001.

[27] G. E. Andrews, The Theory of Partitions, vol. 2 of Encyclopediaof Mathematics and Its Applications, Addison-Wesley, Reading,Mass, USA, 1976.

[28] G. Gasper and M. Rahman, Basic Hypergeometric Series,vol. 96 of Encyclopedia of Mathematics and Its Applications,Cambridge University Press, New York, NY, USA, 2nd edition,2004.

[29] P. Delsarte, “Properties and applications of the recurrenceF(i + 1, k + 1,n + 1) = qk+1F(i, k + 1,n) − qkF(i, k,n),” SIAMJournal on Applied Mathematics, vol. 31, no. 2, pp. 262–270,1976.

[30] L. Carlitz, “q-Bernoulli numbers and polynomials,” DukeMathematical Journal, vol. 15, no. 4, pp. 987–1000, 1948.

downloads.hindawi.comdownloads.hindawi.com › journals › specialissues ›...

Documents