ieee transactions on signal processing, …staff.aub.edu.lb/~mm14/pdf/journals/2011_ieee_tsp...ieee...

IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 59, NO. 6, JUNE 2011 2943

Construction and Hardware-EfficientDecoding of Raptor Codes

Hady Zeineddine, Mohammad M. Mansour, Senior Member, IEEE, and Ranjit Puri

Abstract—Raptor codes are a class of concatenated codescomposed of a fixed-rate precode and a Luby-transform (LT)code that can be used as rateless error-correcting codes overcommunication channels. These codes have the atypical featuresof dynamic code-rate, highly irregular Tanner graph check-degreedistribution, random LT-code structure, and LT-precode concate-nation, which render a hardware-efficient decoder implementationachieving good error-correcting performance a challenging task.In this paper, the design of hardware-efficient Raptor decoderswith good performance is addressed through joint optimizationstargeting 1) the code construction, 2) decoding schedule, and3) decoder architecture. First, random encoding is decoupled bydeveloping a two-stage LT-code construction scheme that embedsstructural features in the LT-graph that are amenable to effi-cient implementation while guaranteeing good performance. AnLT-aware LDPC precode construction methodology that ensuresarchitectural-compatibility with the structured LT code is alsoproposed. Second, a decoding schedule is optimized to reducememory cost and account for processing workload-variabilitycaused by the varying code rate. Third, to address the problems ofcheck-degree irregularity and hardware underutilization, a novelreconfigurable check unit that attains a constant throughput whileprocessing a varying number of LT and LDPC nodes is presented.These design steps collectively are employed to generate serial andpartially-parallel decoder architectures. A Raptor code instanceconstructed using the proposed method having LT data-blocklength of 1210 is shown to outperform or closely match the per-formance of conventional LDPC codes over the code-rate range��

��. The corresponding hardware serial decoder is synthe-

sized using 65-nm CMOS technology and achieves a throughputof 22 Mb/s at rate 0.4 for a BER of �� , dissipates an averagepower of 222 mW at 1.2 V, and occupies an area of 1.77 mm�.

Index Terms—Algorithms, code construction, decoder architec-ture, iterative decoding, low-density parity-check codes, raptorcodes, rate-less.

I. INTRODUCTION

A Raptor code is constructed by concatenating a fixed-rateprecode to a rateless LT-code [1], [2]. The code is there-

fore rateless and its “rate” is determined on a frame-by-frame

Manuscript received June 23, 2010; revised December 15, 2010; acceptedFebruary 02, 2011. Date of publication February 14, 2011; date of current ver-sion May 18, 2011. The associate editor coordinating the review of this manu-script and approving it for publication was Prof. Warren J. Gross.

H. Zeineddine and M. M. Mansour are with the Department of Electricaland Computer Engineering, American University of Beirut, Beirut 1107 2020,Lebanon (e-mail: [email protected]; [email protected]).

R. Puri is with the Microsoft Corporation, Redmond, WA 98052 USA (e-mail:[email protected]; website: http://www.aub.edu.lb/~mm14/).

Color versions of one or more of the figures in this paper are available onlineat http://ieeexplore.ieee.org.

Digital Object Identifier 10.1109/TSP.2011.2114655

basis or even changed for the same frame, upon a decodingfailure. Similar to other rateless codes, Raptor codes were ini-tially designed to operate on binary erasure channels (BECs). Ina BEC [3], a code symbol is either erased with an erasure prob-ability or received correctly by the receiver. This behavior isan adequate model for packet transmission over computer net-works, where a corrupted or un-received packet is consideredan erased symbol. Through coding, the erased packets in theframe can be recovered by applying erasure-correcting decodingon its received/unerased symbols. Thus, coding is advantageouswhen compared to the acknowledgement-based protocols, espe-cially under poor channel conditions or upon transmission fromone server to multiple recipients. By using rateless codes overthe varying channels, the common problem of over- or under-estimating the channel loss rate, in fixed-rate erasure coding, isavoided [2].

LT-codes [1] constitute a class of rateless codes, in which( being potentially any value ) output symbols are pro-duced randomly and independently according to a degree distri-bution. A decoding algorithm is then applied at the receiver sideto recover the input symbols from the output symbols. TheLT-decoder is efficient in terms of the block-size required tohave a decoding success with high probability. The main draw-back of LT codes is that, for vanishing error probability over era-sure channels, the average degree of the output symbols growslogarithmically with the number of input symbols (whenis close to ). This makes it hard to design a linear-time en-coder and decoder for LT codes [2], [4].

Raptor codes [2] are an extension of LT-codes specificallytargeted to solve the problem of nonlinear-time encoding anddecoding. In Raptor codes, the input symbols are encoded usinga fixed-rate code prior to LT encoding. Raptor codes solve thetransmission problem over an unknown erasure channel in analmost optimal manner [2].

The application of LT/Raptor codes over binary-input memo-ryless symmetric channels was studied in [4] and [5], with em-phasis on the theoretical behavior and the design and analysisof check-degree distributions in [4]. In these codes, the sym-bols are bits not packets. The rateless nature of Raptor codesleads to a dynamic coding performance which can be viewed asan application metric rather than simply as a design parameter.It provides flexibility in making time-dependent optimal trade-offs among transmission time, bandwidth, and power. In addi-tion, it allows the development of efficient methods/protocols todeal with a decoding failure, where overcoming such a failureis simply achieved by sending additional bits (i.e., by instanta-neously decreasing the rate). These characteristics can be advan-tageous in the case of rapidly changing environments, such as adhoc networks, multiuser communication channels and channels

1053-587X/$26.00 © 2011 IEEE

2944 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 59, NO. 6, JUNE 2011

where noise levels are not known a priori to the sender. The pre-code is needed to attain high code minimum distance and thusavoid error floors at relatively high bit-error rates (BERs).

LT codes can be efficiently decoded using the iterativetwo-phase message-passing algorithm (TPMP), typically usedin LDPC decoding [6]. If the fixed-rate precode is an LDPCcode, applying the TPMP algorithm on the concatenated code(LT and LDPC) yields significantly better decoding perfor-mance than applying a two-stage decoding process [7]. Inaddition, joint coding results in lesser average number ofLT-decoding iterations due to better stopping criteria and en-ables utilizing the same hardware resources for LT and LDPCdecoding. This motivates the need for a hardware-efficient de-coder architecture for Raptor codes, having LDPC as precode.

The peculiar features of Raptor codes impose serious chal-lenges when it comes to a hardware efficient decoder implemen-tation. These features include varying code rate, random LT-en-coding, variable check-degree distribution, and joint decodingof the LT code and LDPC precode. These irregularity and ran-domness features lead to low resource utilization, high controloverhead, complex data movement patterns, in addition to strin-gent memory requirements, thus resulting in a highly inefficientdecoder implementation.

In this paper, a class of decoders for Raptor codes that areefficient in terms of both hardware complexity and algorithmicerror-correcting performance is proposed. To address the prob-lems of random encoding, indeterministic rate, variable degreedistribution and LDPC-LT architectural compatibility, theproposed method involves decoder design and optimizationsat three different levels, namely, code construction, decodingprocedure and decoder architecture. First, a two-stage LT-codeconstruction method that yields short-cycle free LT codes,decouples code structuring from random encoding and al-lows LT-LDPC-compatible design is proposed. Moreover, areplication technique to construct girth-8 structured LT-graphsis developed. A method to construct a class of 4-cycle-freeLDPC codes that are LT-compatible is also presented and asubset of this class, obtained by row merging, is shown to yield4-cycle-free Raptor codes. Second, serial and partially parallelmemory-aware decoding schedules to jointly decode bothLDPC and LT codes are proposed along with the correspondingarchitectures. Third, three novel reconfigurable check-nodeunit designs, corresponding to three different message updatealgorithms, are developed to process messages correspondingto irregular check-degrees at a constant throughput.

Making use of the above optimizations, the decoding proce-dure can then mapped into row processing of a regular matrix.The decoding schedule hence is made simple, regular and iden-tical across both LT and LDPC codes. The problems relatedto hardware utilization, stringent memory requirements andinterconnect complexity, are largely resolved. The constructedcodes preserve a good error-correcting performance over low tomedium rates. An instance code constructed using the proposedmethod, outperforms randomly-built rate-0.4 (3,5) LDPC codesand closely matches with LDPC codes defined in IEEE 802.16PHY-layer [8] at rates and , for BERs as low as .Hardware simulations show that at BERs as low as ,the serial decoder implementation achieves a throughput of20 Mb/s, has an area of 1.7 mm and dissipates average powerof 222 mW at 1.2-V supply voltage.

The remainder of the paper is organized as follows. Section IIdescribes Raptor codes and their decoding algorithm. Moreover,it presents an overview of the challenges for hardware efficientdecoder implementations and the proposed solutions to tacklethem. Architecture-aware LT-code and LDPC precode construc-tion techniques are presented in Section III. In Section IV, se-rial and partially-parallel decoder architectures, in addition tothe reconfigurable check function unit design, are proposed.Section V presents hardware simulation results for the resultingserial architecture, and Section VI concludes the paper.

II. RAPTOR CODES

A Raptor code is composed of a fixed rate precode concate-nated with a rateless LT-code [2]. The LT-code has a minimumdistance that is bounded by the minimum bit-node degree andhence, exhibits an error floor at relatively high BERs. The pre-code is, thus, needed to attain high code minimum distance andavoid high-BER error floors.

A. Encoding

Given a frame of data bits, the encoding process is donein two stages. First, the frame is encoded using a fixed precodeinto a new frame of bits. Next, the -bit encoded frame isre-encoded with an LT-code to generate output bits,for some dynamically determined code rate . Each output bitin the LT-code frame is generated independently as follows. Apredesigned distribution on the integers , is sam-pled to obtain an integer , called the output degree, then bitsof the -bit frame are chosen at random and their modulo-2sum (XOR) is transmitted. The design of the distribution iscrucial for the resulting code to yield good error-correcting per-formance.

Similar to LDPC codes [6], an LT-code can be representedby a bipartite (Tanner) graph [9] with variable (or bit) nodesrepresenting the input bits on one side and check nodes rep-resenting the output bits on the other (see Fig. 1). An edge existsbetween check-node and bit-node if bit is an input to theXOR whose output is check-node . In this case, nodes andare said to be neighbors. The degree of a node , denoted by

, is the number of edges connected to it. If all nodes in bothpartitions of the graph have the same degree, the graph (code)is called regular, otherwise it is irregular. The girth is definedto be the minimum cycle-length in the graph.

An LT-code can be equivalently represented by anmatrix , where if bit-node is connectedto check-node and 0 otherwise. The Hamming weight of arow or column vector in is defined as the number of nonzeroentries in the vector. If the precode is an LDPC code, the Raptorcode can be represented by a bipartite graph with the check-nodepartition composed of LT and LDPC check nodes as shown inFig. 1.

B. Decoding

A Raptor code can be decoded using Gallager’s TPMP algo-rithm used for LDPC codes [6], where two types of messagesare exchanged between bit-nodes and check-nodes in an itera-tive manner. Let be the intrinsic channel reliability value ofthe tdh check-node, CTB the check-to-bit message fromcheck-node to bit-node at iteration and BTC the bit-to-check message from bit-node to check-node at iteration .

ZEINEDDINE et al.: CONSTRUCTION AND HARDWARE-EFFICIENT DECODING OF RAPTOR CODES 2945

Fig. 1. Bipartite graph � and parity-check matrix� of a Raptor code. A length-4 cycle is highlighted on � and�.

We denote by the index set of the check-node neighbors ofbit-node (so and by the index set of thebit-node neighbors of check-node (so ). We usethe notation to denote the vector of checkmessages to bit-node at iteration and to denote all

vectors of check messages at iteration .and are similarly defined. For simplicity, we drop thesubscripts and iteration index when the context is clear orwhen arbitrary nodes are considered. The decoding algorithm isdescribed as follows:

1) At iteration :Phase 1 Compute for each bit-node the messageBTC to every check-node according to

BTC CTB (1)

with initial conditions BTC forand .Phase 2 Compute for each check-node , the messageCTB to every bit-node :• If node is an LT check-node, then [see (2), shown

at the bottom of the page].• If node is an LDPC check-node, then

CTB

BTC BTC

(3)

where and is the signof .

2) Decision phase: At the final iteration , bit is set to 0 or1 according to the sign of CTB .

Decoding is terminated when either the maximum number ofdecoding iterations is reached or when the decoded bits satisfyall the check-constraints of the LDPC precode. Using the latterstopping criterion in joint decoding results in a lesser averagenumber of iterations per frame than in the case of two-stagedecoding where the LT-decoding stage needs a constant numberof iterations.

The code construction method, from a coding point of view,must generate codes that have high girth and minimum distance,or more generally minimize the number of short cycles andlow-weight codewords. Short cycles in a bipartite graph createhigh correlations between supposedly independent messages,which prevent suboptimal iterative decoding from convergingto maximum likelihood decoding (MLD) or near-MLD perfor-mance. The existence of low-weight codewords on the otherhand degrades the code’s algorithmic performance and causeserror floors at relatively high signal-to-noise ratios (SNRs).

C. Overview of Decoder Architectures

In principle, LT and LDPC codes have similar decoding al-gorithms and hence their decoders share many similarities. Theiterative decoder architecture, shown in Fig. 2, is composedof three main components: 1) A check-node processor com-posed of a number of check function units (CFUs) that computecheck-to-bit messages (serially or in parallel), 2) a bit-node pro-cessor composed of a number of bit function units (BFUs) thatcompute bit-to-check messages, and 3) a network that commu-nicates messages between the check and bit-node processors.Message communication can be done through a complex inter-connect that mimics the graph topology in a parallel architectureand/or through memory in a serial or partially-parallel architec-ture.

The design and implementation of LDPC decoders have beenstudied extensively in the literature (e.g., [10]–[17]). Severalchallenges were resolved to obtain efficient decoders. One of

CTB

;

BTC BTC . (2)


Fig. 2. An iterative decoder architecture overview. The interconnect networkcommunicates � messages per clock cycle in either direction.

these challenges is the randomness of the LDPC code struc-ture leading to very complex interconnect in case of parallelarchitecture and high memory overhead in serial and partiallyparallel architectures. Other challenges include the high com-plexity of the check-to-bit message update, the performancedegradation due to quantization and the decoding convergencespeed. To solve the code randomness problem, several classes ofstructured codes, such as quasi-cyclic (QC) [18] and AA- [10]codes, were introduced. In general, the parity-check matricesof such codes are composed of zero and permutation matrices.While these codes have their algorithmic performance compa-rable to that of randomly built codes for short to moderate blocklengths, their structure simplifies memory partitioning and ac-cess and significantly reduces the interconnect complexity. Con-sequently, instances of quasi-cyclic codes were deployed in sev-eral systems, including IEEE standard 802.16e [8]. A conver-gence speedup by a factor of nearly 2, in addition to significantmemory savings, were obtained by applying a turbo-decodingmessage-passing (TDMP) algorithm [19]. In the TDMP algo-rithm, the neighboring bit posterior probabilities are updatedupon the processing of each check node and the updated valuesare used in the following check-to-bit message computations ofthe same iteration. The layered structure of the deployed AA andQC codes makes the application of TDMP hardware-efficient.Two main check-message update algorithms were proposed tosimplify the check-node unit and avoid the need of computa-tion. One is based on a simplified form of the BCJR algorithm[20] tailored to the trellis of a single parity-check code [10]. An-other is the Min-Sum algorithm [21], [22], which degrades theperformance but reduces substantially the check-node memoryrequirements [13], [14]. Collectively, these optimizations al-lowed for efficient high-throughput multi-mode/rate LDPC de-coder implementations (e.g., [13]–[15], [23], and [24]).

While similar to LDPC codes, Raptor codes have inherentfeatures that render a direct application of the aforementionedoptimizations for LDPC codes a nontrivial task and ultimatelyprohibit an efficient decoder implementation. These featuresand the proposed steps to tackle them are listed below.

1. Dynamic code rate: The hardware resources of a par-allel architecture will be highly underutilized when thecode rate exceeds the minimum rate. This fact favors theimplementation of serial or partially-parallel decoder ar-chitectures. The proposed decoding procedure stores onlythe check-to-bit messages and implements (1) in a serialmanner similar to [19] and [25], cutting down memorycost and eliminating complications associated with irreg-ular and variable bit-node degrees, caused nonexclusivelyby the code rateless nature.

2. Input-bits random selection: This means that ( beingthe number of bit-to-check messages communicated eachcycle) random locations of the bit-node memory are simul-taneously accessed. Such stringent requirement would leadto prohibitively complex memory design and read/writenetworks. Similar to LDPC codes, irregular quasi-cyclicLT codes can be constructed to resolve the problem. Threeissues, however, must be resolved to make this solutionplausible. First, a design-related issue, is that the LT-ma-trix has to be redesigned with the variation of the rateand check degree distribution, possibly occurring in realtime. Second, a storage-related issue, is that the shift-off-sets and the locations of the nonzero submatrices of theLT-matrix must be stored for every code instance. In ad-dition, due to the code structure, the degree distributionpolynomial coefficients are quantized and the quantiza-tion step is set equal to the ratio of the permutation sub-matrix size to the block length. The proposed code con-struction method utilizes one quasi-cyclic source-matrix topseudo-randomly generate all code instances via row-split-ting, according to any given check-set distribution, as willbe explained in Section III. This resolves, to a great ex-tent, the design-related and storage issues. In the proposedmethod, the optimal check degree distribution is approx-imated to a check-set distribution; such an approximationdoes not have the same dependency on the permutation ma-trix size as is the case in the custom quasi-cycle code.

3. Check-degree randomness and variable check-degreedistribution: To attain a constant message-throughput,

, under variable check degrees, reconfigurable CFUscapable of processing constant number of messages percycle are utilized in the Raptor decoder architecture. SerialCFUs [13], [14], [23] are more power-efficient, since theyinvolve less data movement and multiplexing and pos-sess the appropriate flexibility to process variable-degreecheck nodes. However, their latency is proportional to thecorresponding check-node degree, which is considerablyhigh in the high-rate LDPC precode. In addition, therow-splitting transformation involves pseudorandom per-mutation of the communicated messages, which, in serialCFU processing, is either constrained or has extra latencyand complexity compared to the case of parallel CFUprocessing. In partially parallel decoder architectures, thishigh latency results in idle time between subsequent itera-tions that is comparable to the decoding time. Therefore,reconfigurable parallel CFUs are utilized in the archi-tecture, and three novel designs of CFUs implementingthe conventional, BCJR, and Min-Sum algorithms aredeveloped.

4. LT-code and LDPC-precode architectural compatibility:Hardware reuse (i.e., memory, function units and intercon-nect) in both LT and LDPC decoding is made possible inthis work by constructing high-rate LDPC codes that sharethe same structure with the LT-generating source matrixand have check nodes whose degrees are multiple of themaximum possible LT-node degree and whose bit-neigh-bors are evenly distributed as discussed in Section III.The reconfigurable CFUs are designed to process theLDPC check nodes in a quasi-serial manner, attaining athroughput of messages per cycle.


Fig. 3. Random splitting of a row is done in two steps: 1) Random permutation of the nonzero entries and 2) sampling from the partition set to obtain the resultingrow weights. For simplicity, sampling is performed here by reading the partition set in a circular fashion.

The design steps mentioned above, targeted to resolve thechallenges facing efficient Raptor decoder implementation, arediscussed in the subsequent sections.

III. ARCHITECTURE-AWARE RAPTOR CODE CONSTRUCTION

The proposed method to construct architecture-aware Raptorcodes is summarized by the following three steps:

1) A structured source matrix is constructed to have highgirth. This embeds favorable regularity features and guar-antees good girth properties in the LT-code formed by therandom encoding process in Step 2.

2) Pseudo-random row-splitting is applied on yielding thematrix . A rate LT-code, represented by thematrix , is obtained by selecting the first rowsof .

3) A code-construction methodology is applied to form aclass of matrices which are struc-turally identical to and describe 4-cycle-free LDPCgraphs. A regular code from this class can be obtained byapplying row-merging on . The resulting Raptor graphis 4-cycle-free and has rate .

The following subsections describe the above three steps andpresent BER simulation curves to demonstrate the effectivenessof the proposed codes.

A. Row Splitting Transformation

Random row splitting transforms a bipartite graph havingconstant check-degree into an LT-graph having a check-degreedistribution that is close to optimal, while still preserving theunderlying structure and girth properties of the initial graph. Itis described by the following procedure:

Input: A matrix having uniform row weight of .Output: A matrix describing an LT-code.Procedure:1. Design an LT-check degree distribution , with

, that is optimal with respect to a prese-lected metric such as error correcting performance orconvergence speed.

2. Approximate the check-degree distribution with a dis-tribution over the space of sets , having the prop-erty that the elements of each set are positive integersthat sum to .

3. For every row of , random sampling fromis done according to . Denote the chosen set by

. The row vector is then splitinto rows, , , such that

and the Hamming weight of ,. Let the resulting matrix be . is

formed by selecting the first rows of .Step 1 is beyond the scope of this work. One way to per-

form the approximation in Step 2 is to minimize the weightedquadratic distance between and the check degree dis-tribution resulting from . Let be a ma-trix such that is the number of elements of value in di-vided by ( is th element of ). Then, the check degreedistribution resulting from is . Finding the -vector

can be formulated as the following quadratic opti-mization problem:

and

where is the diagonal matrix of weights such thatis set proportional to the sensitivity of the preselected metric inStep 1 to a change in . This minimizes the deviation of thismetric from the optimal one due to the difference .Steps 1 and 2 can involve an iterative co-design of and toreach the best approximation, while having the quantization ofdistribution parameters compatible with the number of splitrows.

One possible simple way to do the sampling in Step 3 is toread sequentially a circular array of sets. These sets are chosento approximate the ideal distribution and thus can be easilyreconfigured in hardware for arbitrary . The technique of rowsplitting is illustrated in Fig. 3.

Row-splitting is implemented in hardware by randomly per-muting the bit-to-check messages then applying check-nodeprocessing by reconfiguring the CFU, having a constantthroughput of , according to the chosen set . Randomsampling of the bit-node neighbors of a check node is nowdistributed over two phases: the construction phase of and


the row-splitting phase. The bit-node stringent memory require-ments can be relaxed, thus, by imposing a desired structure on

. At the algorithmic level, the girth of the graph describedby is preserved after row-splitting, as Lemma 1 states.

Lemma 1: If the graph obtained by row splitting has a cycleof length , then the graph, prior to row splitting, has a cycle oflength .

Proof: Let be a cycle in the graphdescribed by . Let be the node in the graph , de-

scribed by , which gives node after row-splitting. If theedge exists in , then the edge exists in

. Therefore, , is a pathin and contains a cycle whose length is less than thelength of path .

B. Construction of the Source Matrix

The structure and girth properties of the LT-graph are de-termined by constructing a graph described by . Matrix

is constructed to have a quasi-cyclic structure composed ofzero and shifted identity matrices. As is the case with LDPC de-coders, this allows for partitioning of the bit-node memory intoone or two-port memory arrays where, in serial decoding, onemessage is read/written from each memory array per cycle andsimplifies memory address generation. An example of such amatrix is the regular matrix , with

(4)

where is a design parameter and is the identity matrixcyclically shifted to the right by positions. When is prime,

describes a 4-cycle-free graph. Similar to the case of quasi-cyclic codes, can be equivalently represented by using asmaller base matrix, where a zero submatrix in isreplaced by a entry and a shifted identity matrix in isreplaced by its shift value.

The graph described by above is regular since both checkand bit-nodes have degree . Irregular graphs, designated byirregular bit-degrees and constant check-degree , can be de-signed using the following procedure:

1) Choose a set of pair-wise prime integersand set .

2) For , construct a matrix byvertically stacking identity matrices of size .

3) Form a matrix by horizontally appending thematrices and then retaining only the

first rows from the resulting matrix.Fig. 4 illustrates an example of an irregular constructedusing the above procedure for , , .

Lemma 2: The resulting has regular check-node degreeof and irregular bit-node degrees of or for

. Moreover, describes a 4-cycle-free bipartite graph.Proof: Suppose a 4-cycle exists in and let and

be the row indexes in corresponding to the check-nodes inthe cycle. Then , such that ,

and , which is impossible bythe pair-wise primality of and .

1) Girth-Oriented Replication/Girth-8 LT-Code Construc-tion: The idea is to apply an additional replication step to a4-cycle free quasi-cyclic matrix, targeted to avoid length-6 cy-cles. Each node of the base graph is replicated into nodes,

Fig. 4. Irregular matrix� for � � �, � � �, � � �.

forming a new “child” graph and each edge betweenbit-node and check-node in is replaced by edges be-tween the children nodes of and in , respectively. For-mally, the matrix , describing , is constructed using thefollowing procedure.

Input: A prime and a regular matrixidentical to the matrix described in (4).

Output: A matrix describing a girth-8 graph.Procedure: Replace every scalar entry in by a

shifted identity matrix where

;

,

for . The resulting LT source matrixis given by .

Theorem 1: The graph obtained from the graph replica-tion method has girth .

Proof: See Appendix I.Replicating the number of block rows and columns allows

for flexible construction of LT-aware LDPC codes having highcheck-degrees while still being 4-cycle-free. The choice of theshift offsets of , is targeted, along with avoiding 6-cycle, toallow constructing a subclass of 4-cycle free Raptor codes. Par-tially parallel decoding can be applied by processing rows of

simultaneously, while incurring no extra control or sched-uling overhead. In the rest of this paper, matrix describedhere is used as a source matrix for LT-code construction.

C. LT-Compatible Precode Construction

The constructed LT-code has a minimum distance that, beingupper bounded by the minimum bit-node degree, is . Theconcatenation of a precode with an LT code has an approxi-mately multiplicative effect on the codeword weights; a code-word in the precode of weight is mapped, with highprobability, to a codeword of weight , with beingthe bit-node degree in the LT code.

From an algorithmic performance perspective, the LDPC pre-code design is motivated by the need to avoid minimum cyclesand low-weight codewords. At the architectural level, both LT


Fig. 5. Row merging for � � ��.�� is the �� base matrix of the matrix defined in (4).

Fig. 6. Regular-to-irregular graph transformation in submatrix � , for � �

�� , � � � and � � �. The nonzero entries corresponding to � in thefirst three rows are replaced by nonzero entries corresponding to � . Note thatthe column weight remains unchanged.

and LDPC decoding utilize the same hardware resources. There-fore, the LT-compatible precode is formed by the replication ofa base matrix, using -sized submatrices, as follows:

Input: A vector of values, where.

Output: A matrix described by the base matrix, where row in has weight .

Procedure: Graph-Replication Construction TechniqueStep 1: For :

i. Construct a matrix such that1) describes a 4-cycle-free bipartite graph(with vertices on each side), 2) the weight ofrow in equals and 3) the code describedby has no very-low-weight codewords (e.g.,

).ii. Replace every 0-entry in with and

every 1-entry with the value , for.

Step 2: Form a base matrix by verticallyconcatenating the matrices, for .

Step 3: Return to Step 1.i and permute the rows of, to avoid very-low minimum

distances in the code described by .Theorem 2: The graph , described by , is 4-cycle-free.

Proof: See Appendix II.Besides being composed of -size submatrices, the con-

structed precodes have another architecture-aware property.

The Hamming weight of each row of is a multiple of, say and the 1-entries of the row are distributed evenly

between the , -wide, blocks of the bit-node side. Hence,the 1-entries of each row can be partitioned into sets, eachof size , having exactly one element in every bit-node block.The bit-to-check messages corresponding to each of the setscan then be accessed from bit-node memory in one clock cycle.Therefore, bit-memory access corresponding to one LDPCcheck-node will take clock cycles. Thus, no overhead existsin terms of the bit-node memory organization, being partitionedinto blocks corresponding to the bit-node blocks and theinterconnect network. The reconfigurable CFU is designedto process messages having check-degree in a quasi-serialmanner, with a throughput of messages per cycle. Since both

and are composed of -size submatrices, this schemecan be easily generalized to the partially-parallel decoding case.

The LDPC code generated by the graph-replication techniqueis 4-cycle-free, but the resulting Raptor graph formed from theLT and LDPC graphs need not be. A subclass of these LDPCcodes that yields a 4-cycle-free Raptor graph can be obtaineddirectly from via a row-merging transformation, illustratedin Fig. 5, as follows:Procedure: Row Merging Transformation

Step 1: Form a set of integers with maximum cardinalitysuch that:

i. .ii. , if , then the

pair of elements is identical to the pair .Step 2: For :

i. Form the matrix by adding modulo 2the shifted identity matrices , for, where is the th element of .

ii. Replace every 0-entry of with and every1-entry with the quantity , for .

Step 3: Form a LDPC base-matrix byconcatenating the matrices , for .

Step 4: , omit the rows from whose indexesare to form a

matrix. Then omit the first columns. Theresulting matrix describes an LT source matrix.

Theorem 3: The graph , described by ,is 4-cycle-free.


Fig. 7. FER and BER versus SNR curves for rate-0.4 LDPC �LDPC � and Raptor �RAP � codes.

Proof: See Appendix III.The Raptor code designed above has the following properties.

Its precode has rate , check-degree and bit-degree. Its LT code has maximum check-degree and maximum

bit-degree .1) Regular-to-Irregular Graph Transformation: The base

matrix of the regular LDPC code obtained by row mergingcan be used as a starting point to design the base subma-trices of irregular 4-cycle-free LDPC precodes. This canbe done by simply omitting positive-entries from the reg-ular base matrix. However, if is small, the effect on thecode’s minimum distance will be significant. Alternatively, therow-weights of the submatrices , obtained from Step 2.i ofthe row merging method, can be changed via 1-entry substi-tutions across the matrix. This is done by picking an integer

that satisfies the following property: suchthat with ,

and : . Then, in subma-trix , the 1-entries corresponding tocan be interchangeably replaced by those of in acolumn-per-column or row-per-row fashion. The result is anirregular 4-cycle-free LDPC graph. This procedure is illustratedin Fig. 6. By applying identical row shuffling in and in the

submatrices of matrix [Section III-B-2)] that sharethe same bit-nodes with , a 4-cycle-free Raptor graph withirregular LDPC subgraph is obtained. However, the resultingLT graph is not 6-cycle-free anymore.

Only regular LDPC precodes are considered in the decoderarchitecture discussed in the next section. However, minor mod-ifications in the control, memory address generation and recon-

figurable CFU operation is needed to have the architecture suit-able, as well, to irregular LDPC decoding. In addition, willbe assumed to have columns, i.e., the first columnsare omitted as indicated in the row merging transformation.

D. Code Design Examples and Simulation Results

In this section, the coding performance of Raptor codes con-structed using the proposed techniques is compared to that ofconventional LDPC codes at rates 0.4, 0.5, and 0.66. An LTcode instance with and isfirst constructed. For each rate, LT codes obtained by severalpartition-set arrays are compared and the array yielding the besterror-correcting performance is chosen. Table I shows the de-tails of the codes used. Both the regular and irregular LDPCprecodes are 4-cycle-free and have a minimum distance of 6.Only Raptor codes RAP and RAP containing regular precodesare 4-cycle-free. The irregular precode outperforms the regularprecode at rate 0.66 and therefore was used there. Three met-rics were used to evaluate the code performance: the bit-errorrate (BER), frame error rate (FER) and average number of it-erations required for successful decoding, with a maximum of100 iterations. The results are plotted in Figs. 7–9.

The Raptor code outperforms the rate-0.4 LDPC code, com-pares favorably to the rate-0.5 LDPC code and unfavorably tothe rate- LDPC codes. However, the relatively low minimumdistance of the precode implies that an error floor would appearat low error rates (at FER of or lower, or equivalentlyBER of ). To get error floors at lower rates, the precodehas to be redesigned, or alternatively, is set to 13 (i.e., the next


Fig. 8. FER and BER versus SNR curves for rate-0.5 and rate-�� LDPC �LDPC � LDPC � LDPC � and Raptor �RAP � RAP � codes.

Fig. 9. Average number of iterations required until decoding convergence versus SNR for all codes in Table I.


TABLE IPARAMETERS OF THE SIMULATED RAPTOR CODES AND LDPC CODES

TABLE IINUMBER OF EDGES IN THE MATCHING RAPTOR AND LDPC TANNER GRAPHS

prime after 11), therefore increasing the block length so thatregular codes of minimum distance 8 can be obtained.

The code structure affects the decoding throughput throughtwo metrics, namely, the number of decoding iterations to con-verge to a valid codeword and the message processing workloadrequired per iteration. The latter metric is directly related to thenumber of edges in the code graph (i.e., to the matrix sparsity).While comparable for rates 0.4 and 0.5, the average numberof iterations in Raptor decoding is significantly higher than inLDPC decoding at rate . This is due to the nature ofthe LT code where the transmitted bits are modulo-2 sums of thebit values instead of the bit values themselves. For rate , thedesigned Raptor code requires 9 iterations to converge even infavorable channel conditions. The number of edges in the cor-responding LDPC and Raptor graphs, compared in Table II, in-dicate that the LDPC graphs are sparser by a factor of thantheir Raptor counterparts.

IV. RAPTOR DECODER ARCHITECTURES

Matrix can be viewed as a concatenation of , -deeplayers; the precode, likewise, consists of -deep layers. The un-derlying layered structure of the code is convenient for applyingthe turbo decoding algorithm. If two layers are connected viaa nonzero entry to a bit-node , the posterior reliability valueof has to be updated upon processing the former of the twolayers, before being forwarded to the latter layer for processing.An idle time is caused by this scheduling order, which sig-nificantly affects the decoder throughput and constraints thepipelining of the interconnect and computational units. Thisinconvenience is resolved, in LDPC decoding, by reorderinglayers, block-columns or messages within the layers [26], or byprocessing all layers, serially within each layer, in parallel andadding offsets to submatrices shift values [15]. For serial LT de-coding, layer reordering in , considered in row merging, canbe done efficiently without affecting the bit-node degree distri-bution, assuming the latency time is less than . The problem ismore involved in the precode decoding due to the small depthof the layers (e.g., ) compared to the latency time andthe high number of nonzero matrices within each layer. The al-gorithmic modification needed in the precode construction andits hardware cost as opposed to the performance degradationcaused by ignoring the adjacent layers interdependence, is be-yond the scope of this work. The problem becomes more rele-

Fig. 10. A serial Raptor decoder architecture. Parameter � in the intrinsicmemory is � � and is chosen so that �� is the maximum block-size��, or equivalently is the minimum achievable LT-code rate.

vant in partially-parallel decoding. The proposed architecturesin this paper, therefore, apply the TPMP decoding scheduling.

A. Serial Decoder Architecture

Fig. 10 illustrates the architecture of a serial decoder forRaptor codes. The decoder processes the check-to-bit orbit-to-check messages corresponding to one row of percycle. For a Raptor code composed of an LDPC precode ofcheck-degree and an LT code formed by row-splittingrows of (or equivalently ), the serial decoder performs

subiterations (one decoding iteration), as describednext. For clarity of exposition, let be a submatrix of

composed of the rows used to generate the LT graph,horizontally concatenated with the rows corresponding tothe LDPC precode. The rows corresponding to each LDPCcheck-node are grouped into one block row.

The operation of the serial decoder proceeds as follows. Atsubiteration of iteration , for , thedecoder performs the following steps (refer to Fig. 10):

1) Forwarding: A -dimensional message vectoris read from bit-node memory. Assuming

, is the edge corresponding to th


nonzero entry of row of , then denotesthe posterior extrinsic reliability value of the bit-nodeconnected to obtained at the end of iteration .

2) Permuting: is sent to a pseudo-random permuterto generate the .

3) operation: The -dimensional vectoris read from the check-node memory

and new bit-to-check messages, , arecomputed as .

4) operation: new check-to-bit messages,, are generated by performing LT decoding

using (2) if , or LDPC decoding using (3) if.

5) Inverse permuting and check-memory write-back: Thevector is written back to check-node memoryand simultaneously inverse-permuted using the -blockto generate .

6) Posterior reliability update: The extrinsic posterior mes-sages are updated as follows:

;.

7) Accumulation: is written back to the bit-nodememory.

Steps 3), 6), and 7) compute the bit-to-check messages in a se-rial manner, therefore avoiding the underutilization and sched-uling problems that may arise due to the bit-node degree vari-ability, which, in turn, is caused by the rate and check degreesvariability. This allows the bit-node block to store a constantnumber of messages regardless of the code rate,thus saving memory.

The main components of the serial decoder are described indetail below.

Bit-Node Blocks 1 and 2: These blocks independentlyperform the forwarding (Step 1)) and accumulation [Step 7)].The blocks interchange their roles every iteration. Using twoblocks instead of one allows simultaneous communication ofcheck-to-bit and bit-to-check messages, hence doubling thethroughput.

Each memory block is composed of banks, where eachbank holds -bit words and has one read and one write ports.The banks are accessed using the address

, where . The intermediate value, updated every cycles, corresponds to index of the ac-

cessed bit-node in matrix [see Section III-B-2)]; , up-dated per cycle, corresponds to the bit index in the child ma-trix resulting from the girth-oriented replication. Since iscomposed solely of shifted identity matrices, the computationof both and , during LT decoding, is done by eithercomputing the shift-offset of the corresponding matrix or in-crementing the previous value. During LDPC decoding,

is alternated between different values, which are up-dated every cycles. The update of is similar to that inLT-decoding but is done every cycles.

Communication Network: It consists mainly of a permuterand inverse permuter, as well as adders and subtractors. In thepermuter design, a tradeoff exists between the number of pos-sible permutations and hardware complexity. One possible per-muter applies a linear permutation generating

possible permutations. It can be implemented using linear finiteshift registers to generate the permutation coefficients pseudo-randomly, -wide cyclic shifter and , ,-bit multiplexers. The subtractor blocks form the bit-to-check

messages [Step 3)], while the adders accumulate the posteriorextrinsic reliability values [Step 6)].

Check-Node Block: The check-node block is composed offour main components.

1) Check-Node Memory: It consists of banks, eachstoring up to words and performing 1-read and 1-writeper cycle. The memory banks are accessed sequentially.

2) Partition Table: It is a buffer that stores the check-node de-grees that result from the different row-splitting scenarios(i.e., partition set in Fig. 3). In the simplest case, it can beaccessed sequentially as a circular buffer. For each split-ting scenario, a -bit partition vector showing the de-grees of the formed check nodes is stored. For example, ifa row with Hamming weight of is split into fourcheck nodes of degrees 3, 2, 4, 1, then the partition vectoris 0001100001 (i.e., ).Equality of bits and of the vector indicates that mes-sages and belong to the same check-node.

3) Intrinsic Values Memory (IVM): It stores the intrinsicchannel reliability values of the check-nodes. Since thenumber of words read every subiteration is variable, but isat most , the IVM block is divided into banks.The reliability value of check-node is written at location

in bank . For correct operation, thevalues read from the memory banks are

reorganized before being forwarded to the reconfigurableCFU, in a way such that if bits and of the partitionvector differ, the th value of the IVM output vector mustequal the intrinsic reliability value of the check-nodewhose message index starts with . This is done through adecompressor network, similar to that shown in Fig. 12(a),that is composed of -wide -bit cyclic shifter, toapply shifting , followed bymuxes to obtain the correct distribution of the intrinsicvalues over locations.

4) Reconfigurable CFU: It receives bit-to-check mes-sages from the subtractor block, messages from theIVM and the partition vector. It computes the check-to-bit-messages corresponding to one row in . The detaileddesign of the CFU is presented in Section IV-C.

B. Partially-Parallel Decoder Architecture

To increase throughput by a factor of , the messages corre-sponding to rows of , or equivalently one row of , areprocessed simultaneously. The resulting partially-parallel archi-tecture has the same organization of the serial architecture. Therequired modifications are the following. The bit node mem-ories have their banks organized as -bit partitions andaccess is done row-wise. The check-node (and IVM) memoryare partitioned into blocks, each sized down by a factorbut retaining the access rate and pattern of its counterpart in theserial architecture. For permutation, each -message output ofmemory bank is cyclically shifted by , prior toapplying independent random permutations on the -message vectors, corresponding to the processed rows of .The number of CFUs and adders/subtractors is replicated by


TABLE IIIHARDWARE RESOURCES AND THROUGHPUT OF SERIAL AND PARTIALLY PARALLEL ARCHITECTURES. THROUGHPUT IS MEASURED AS THE NUMBER OF

COMPLETED ITERATIONS/s; �� IS THE CLOCK FREQUENCY. THE MEMORY ADDRESS DECODERS ARE NOT INCLUDED

. The hardware resources and throughput for serial and par-tially architectures are summarized in Table III, given per mes-sage bit-precision. In total, processing one cycle of , requiresreading values from memory ( is the averagenumber of nodes formed by one row-split), writingvalues and applying one CFU operation and addi-tions/subtractions.

C. Reconfigurable Check Function Unit Design

The design of the reconfigurable CFUs presented in this sec-tion is based on their fixed-degree counterpart implementations.Three basic algorithms for the CFU operation exist, namely, theconventional one implementing equations (2) and (3), BCJR-based [10], [19] and Min-Sum [21], [27] algorithms. The lattertwo give a reduced complexity approximation of the check-up-date equations and thus, need no LUTs to compute the func-tion. The Min-Sum algorithm, in particular, reduces the checkmemory requirements. However this comes at the expense ofvarying degradation in performance for the Min-Sum updateand an increase in the complexity of computational units andlatency in the CFU, for the BCJR-based update. In this section,

indicates the message of index correspondingto processed row and corresponds to the intrinsic relia-bility value of the check node derived from row and to whichedge is connected.

1) Accumulator-Based CFU Architecture: The functionand its inverse in (2) and (3) are implemented using LUTs. The

operation is also applied on the intrinsic reliability values onlyonce and the generated values are stored in IVM. The proposedreconfigurable CFU is based on the design of a -degreeCFU that implements the following transformed equation:

CTB BTC BTC

This equation can be implemented using a tree of adders ofdepth to compute BCT , followed bysubtractors to extract the individual messages BTC fromthe total sum. At level of the adder tree, the bit-to-checkmessages are partitioned into sets, , such that1) set , includes messages with consecutiveindices, 2) the sum of the operations on the messages in set

is computed in a stage (call it output value of the set),and 3) the output values of these sets are fed to the addersubtree starting from stage which, in turn, computes theirsum. The left (right) boundary of a set is definedas the minimum (maximum) index of the messages in the set.

In LT-decoding mode, intermediate values, , are main-tained at level of the adder tree. The th value of , , stores

the sum of the operations on messages that 1) belong to thesame partition as message at level and 2) have their corre-sponding LT-graph edges connected to the check node that theth edge is connected to. The algorithm for updating is shown

below and the corresponding architecture is shown in Fig. 11(a).

Algorithm 1: Accumulator-Based Reconfigurable CFUAlgorithm

BTC

for to do

for all s.t. an adder in fixed-rate mode adds outputvalues of and at stage with do

if right boundary of and left boundary ofcorrespond to same check-node then

else

end if

, if share same check-node,then

, if and share same check-node,then

end for

end for

In LDPC decoding mode, the bit-to-check messagescorresponding to one check-node are fed to the CFU in con-secutive cycles. By using an extra accumulator to add theoutput values into BTC , whereis the index of the first row in corresponding to the cur-rent check-node and delaying subtraction for an extra cyclesrequired to obtain the latter sum, the CFU attains a constantthroughput of messages/cycle. The registers for storingthe intermediate vector during LT-processing are now re-usedto store the vector for the additional cycles. Theextra latency of cycles is due to the quasi-serial mode of CFUoperation in LDPC decoding. This is unlike the LT-decodinglatency caused by pipelining targeted to enhance the operatingfrequency.

2) Forward–Backward BCJR-Based CFU Architecture:In [10] and [19], a SISO message-processing unit (MPU)was proposed to implement check-message processing. The


check-to-bit messages are computed using a simplified formof the BCJR algorithm [20]. Let denote an operator whichperforms the operation .Then the message from a degreecheck-node to bit-node

at iteration , where , is computed as [10]CTB BTC . Moreover, the following simpleyet fairly accurate approximation of was proposed:

The unit implementing this approximation is called theMax-Quartet MPU. The resulting CFU implements the BCJRalgorithm on the syndrome trellis of an -SPCcode, where intermediate forward and backward metrics arepropagated each cycle. At stage , two metrics

(forward) and (backward) are computed according tothe recursions BTC BTC and

BTC BTC , respectively.Then starting from stage , the check-to-bit messagesCTB are generated as CTB .

A reconfigurable version of the SISO MPU is proposed inFig. 11(b). LT-node processing capability is achieved by mul-tiplexing the inputs to the Max-Quartet units implementing the

- and -recursions. The computation of the forward and back-ward metrics is done as follows:

BTC if have same check node;otherwise.

BTC if have samecheck node;otherwise.

In LDPC-decoding mode, the metricBTC corresponding to

row in (one of the rows merged to form the check-node) is computed and then forwarded to a degree- serial CFU.

The CFU output is then forwarded toMax-Quartet units to compute .

3) Min-Sum CFU Architecture: A reduced complexity ap-proximation of the check-to-bit message computation is givenin [21] as

CTB BTC BTC (5)

The Min-Sum approximation results in a degradation in theerror correcting performance. To reduce the performance gap,a correction step, consisting of subtracting an offset or multi-plying by a normalization factor [27], is applied on the resultingminimum value.

The reconfigurable CFU implementing the Min-Sum approx-imation is based on the constant-degree implementation givenin [16]. The architecture has a tree structure similar to the ac-cumulator-based CFU, where the addition operation is substi-tuted by a 4-input 2-output partial-sorting function that com-putes the minimum and the second minimum of two sorted inputpairs. The final outputs of the tree are the minimum-value, itsindex and the second minimum value of the input values. DuringLT-decoding, two intermediate values instead of one are needed

for every edge per level. These are the minimum and secondminimum values of the messages that 1) belong to the same par-tition as message at level and 2) have their correspondingLT-graph edges connected to the check node that the th edgeis connected to. To reduce the number of intermediate valuesper edge to approximately one, the intermediate value of eachdegree-1 LT-node edge is set to the corresponding intrinsicchannel reliability value. For a degree-2 LT-node edge , theintermediate value holds the corresponding minimum value if

is odd or the second minimum value if is even. The in-dexes of the minimum values, corresponding to the row checknodes, are tracked across the CFU by updating a -bitminimum-index vector after each stage. In the final stage, thecorrection step is applied on the intermediate values andthe CTB messages are generated using the partition and min-imum-index vectors. In LDPC-decoding mode, the minimumindex-vector and two odd- and even-indexed outputs of the cor-rection step are forwarded to an additional partial-sorter. Theminimum, second minimum and the minimum index over therows of every corresponding check node are thus computed withan extra latency of cycles and forwarded to the CTB block.

The Min-Sum algorithm changes the check-node memoryrequirements since up to three values are sufficient to regen-erate the absolute values of the messages corresponding to onenode. This results in significant savings in the LDPC memory.Three LT-memory organization schemes are considered: 1) theLT-memory retains its conventional organization, 2) the LTmemory stores two values per node (the minimum and secondminimum values) and a -bit minimum-index vector perrow and 3) the LT-memory stores three values per LT-node.In the two latter schemes, the LT-memory storing the absolutevalues of the CTB messages is composed of two blocks, one ofwhich stores the odd/even-indexed values output by the correc-tion step, prior to the CTB generation. The odd/even-indexedmemory block has similar access patterns and thus, organiza-tion to those of the IVM memory, but contains , insteadof , memory banks. A -wide compressor is neededto store the CFU output in memory and a -wide decom-pressor is needed to regenerate the corresponding messages inthe next iteration. Fig. 12(a) shows the CFU interconnect underthis scheme. Fig. 12(b) compares the reduction in check-nodememory size in the three respective schemes versus the averageLT-node degree. As the average LT-node degree increases, thememory size reduction brought by the Min-Sum algorithmbecomes more significant.

Table IV compares the hardware complexity of the fixed-de-gree and reconfigurable implementations of the three message-update algorithms. The main advantage of the BCJR-based andMin-Sum implementations is the elimination of lookup tables.In the BCJR-based CFU, the numbers of the registers and rela-tively complex function units increase by and are higherthan in the other two algorithms. On the other hand, the accu-mulator-based and Min-Sum implementations involve intensivemultiplexing and have their register count replicated.

V. PERFORMANCE EVALUATIONS AND SIMULATION RESULTS

1) Quantization Analysis: A bit-accurate C++ simulator ofthe serial Raptor decoder was developed to analyze the quantiza-tion effect on the coding performance. The decoding procedureof the rate-0.4 Raptor code (RAP in Table I) was simulated


Fig. 11. (a) Accumulator-based CFU architecture. The gray-shaded muxes are needed for LDPC-decoding. Registers used to propagate�� messages from theaccumulator to the subtractor are omitted. (b) BCJR-based CFU. Blocks labeled � perform the Max-Quartet operation.

for message quantization of 5, 6, 7, and 28 (almost-ideal) bits,assuming a white Gaussian noise channel with BPSK modula-tion. Three check message update equations were considered:the conventional [using (2) and (3)], BCJR-based and Min-Sumwith offset [22]. In the latter algorithm, the offset was obtained,per check node, by reading from a lookup table, according to thecorresponding check degree and the average bit posterior relia-bility value resulting from the previous iteration.

As illustrated in Fig. 13, quantization results in less steepcurves compared to the almost-ideal case. The performancegap between the conventional and BCJR-based, already smallin the near ideal case, disappears for the other quantizationlevels. The Min-Sum algorithm incurs a performance loss inthe -dB range, for message widths greater than 5. Inall algorithms, a large performance loss results from going to5-bit messages, therefore, the message quantization, , is set to6 in the synthesis step discussed next.

2) Synthesis Results: The serial decoder datapath was mod-eled in Verilog and its datapath bit-width was set to andthen synthesized using a 65–nm, 1.2-V customized CMOS li-brary. The Design Compiler tool from Synopsys [28] was usedto synthesize the logic components. Area and power estimates ofthe various memory blocks of the decoder were computed usingCACTI software tool [29], after some customizations to fit thedecoder requirements. The resulting area and power figures ofthe basic decoder components are summarized in Table V. Thecritical path of the decoder was estimated to be 3.26 ns; hence,the clock frequency was set at 300 MHz.

The decoder occupies an area of mm and dissipates 222mW when operating at 1.2 V and 300 MHz. Memory accessesdissipate 85% of the total power, partially due to havingmemory accesses per cycle. The check-node memory accesses,

which account for 40% of the overall memory accesses, dis-sipate 49% of the total power. This is due to the large size ofcheck-node memory which occupies 70% of the total decoderarea. This motivates the need to optimize check-node memorythrough partitioning and making use of the fact that check-nodememory is accessed sequentially. Regarding area, 92% of thearea is occupied by memory, mainly due to the low message pro-cessing throughput of the decoder ( messages per cycle) andconsequently, the low area of the combinational components.However, in partially-parallel architectures, the area ratio of theinterconnect and combinational function units to memory in-creases by a factor as Table III suggests. The throughput ofthe synthesized serial decoder is plotted versus SNR in Fig. 14.Unlike the case for LDPC codes, no throughput advantage isachieved when the code rate increases. This is due to the in-crease in the number of iterations required for convergence thatis accompanied with the rate increase.

VI. CONCLUSION

Hardware-efficient Raptor decoders have been developedthrough a methodology that encompasses code construction,decoding scheduling and architectural optimizations in asingle design cycle. The proposed method decouples the codestructure from the random encoding process and utilizes amemory-and-variability aware decoding procedure and recon-figurable check processing. The decoding procedure is hencemapped to row-processing of a regular matrix. The resultingRaptor codes can be employed in wireless communicationchannels, where the code rate becomes a real-time parameter.The rateless nature of a Raptor code, added to the simple LT-en-coding process, can be exploited to enhance communicationunder varying and poor channel conditions such as multiuser


Fig. 12. (a) CFU operation and interconnect of the Min-Sum implementation. Only LT-decoding mode is shown for clarity. (b) Check-node memory size of thethree schemes in Min-Sum implementation, relative to the memory organization discussed in Section IV-A. � � ��.

TABLE IVHARDWARE RESOURCES OF FIXED-DEGREE AND RECONFIGURABLE ARCHITECTURES, OF THE THREE ALGORITHMS. THE MIN-SUM

CORRECTION STEP AND THE SIGN-PRODUCING LOGIC ARE NOT INCLUDED

channels and ad hoc networks. Future work includes, designingRaptor codes at higher rates, applying the turbo decodingapproach and investigating the possibility of enhancing thedecoding convergence speed relative to LDPC codes.

APPENDIX IPROOF OF THEOREM 1

Let be a node of graph . By abuse of notation, let alsobe the index of the corresponding node in matrix . isthe node in the graph , from which is formed by graphreplication. Node can be described by three quantities

, where , and.

We first show that is 4-cycle free. Assume has alength-4 cycle , where ’s are bit-nodes and ’s arecheck-nodes. Two cases exist. Case 1: If and

, then a length-4 cycleexists in , which is impossible by construction of . Case2: If , then edges and result

from one edge in , which is impossible by thereplication method.

We next prove that is 6-cycle free. Assume has alength-6 cycle . By construction of , then

,and . Therefore,

, or equiv-alently

(6)By the replication method, then

,and . Therefore,

,or equivalently

(7)Similarly, by construction of , then

,


Fig. 13. (a) Almost-ideal performance of the 3 algorithms considered. Performance of (b) conventional, (c) BCJR-based, and (d) Min-Sum algorithms for variousquantization levels. Since the max-quartet approximation involves a constant, the internal datapath width of the BCJR-based CFU is chosen so that the corre-sponding quantization step is multiple of .

and . Substituting in(7) yields

(8)Solving (6) and (8) gives or or ,which are impossible by construction of .

APPENDIX IIPROOF OF THEOREM 2

Assume , described by matrix , has a length-4 cycle, where ’s are bit-nodes and ’s are check-nodes.

Two cases exist.

TABLE VPOWER AND AREA ESTIMATES OF THE BASIC COMPONENTS OF THE

SYNTHESIZED SERIAL DECODER. THE ACCUMULATOR-BASED CFU IS

USED IN THE ARCHITECTURE. IVM MEMORY CAN PERFORM

�� READS PER CYCLE. THE INTERCONNECT NETWORK

CONSISTS OF THE PERMUTER/DEPRMUTER, IVM INTERCONNECT,ADDERS AND OTHER AUXILIARY LOGIC


Fig. 14. Synthesized decoder throughput (in number of decoded informationbits/sec) versus SNR for Raptor codes RAP , RAP , RAP .

Case 1) If , then a length-4 cycle existsin the graph described by , which is impossibleby Step 1.i of the construction method.

Case 2) If , then is connected to , in the4-cycle which implies that

. Similarly, is connected to , inthe 4-cycle, hence

. Therefore, , which implies thatby construction.

APPENDIX IIIPROOF OF THEOREM 3

Assume , described by matrix , has a length-4 cyclewhere ’s are bit-nodes and ’s are check-nodes.

Three cases exist.Case 1) Both and are LT-check nodes. By Theorem 1,

this case is impossible.Case 2) Both , are LDPC-check nodes. If ,

then such thatand , which is impossible by

condition 1.ii) of the row merging transformation.On the other hand, if , then this caseis impossible using a similar argument used in theproof of case 2 in Theorem 2.

Case 3) is an LT check-node and is an LDPC check-node. is connected to , in the 4-cycle impliesthat and

.Also, is connected to , implies that

. Therefore, . Hence, in the matrixconstructed by graph replication, bit-node , where

(i.e., does not correspond to any of the first columns),is connected to two check nodes having different -coordinatesbut equal -coordinates, which is impossible by the graph repli-cation method.

REFERENCES

[1] M. Luby, “LT codes,” in Proc. 43rd Annu. IEEE Symp. FoundationsComput. Sci., Vancouver, BC, Canada, Nov. 2002, pp. 271–280.

[2] A. Shokrollahi, “Raptor codes,” IEEE Trans. Inf. Theory, vol. 52, no.6, pp. 2551–2567, Jun. 2006.

[3] P. Elias, “Coding for two noisy channels,” in Proc. 3rd London Symp.Inf. Theory, Sep. 1955, pp. 61–76.

[4] O. Etesami and A. Shokrollahi, “Raptor codes on binary memorylesssymmetric channels,” IEEE Trans. Inf. Theory, vol. 52, no. 5, pp.2033–2051, May 2006.

[5] R. Palanki and J. S. Yedidia, “Rateless codes on noisy channels,” inProc. Int. Symp. Inf. Theory, Chicago, IL, Jul. 2004, p. 37.

[6] R. G. Gallager, Low-Density Parity-Check Codes. Cambridge, MA:MIT Press, 1963.

[7] G. D. Forney, Concatenated Codes. Cambridge, MA: MIT Press,1966.

[8] IEEE 802.16-2009, IEEE Stand. For Local and MetropolitanArea Networks—Part 16: Air Interface For Broadband WirelessAccess Systems, IEEE 802.16, May 2009 [Online]. Available:http://www.ieee802.org/16

[9] R. M. Tanner, “A recursive approach to low complexity codes,” IEEETrans. Inf. Theory, vol. IT-27, pp. 533–547, Sep. 1981.

[10] M. M. Mansour and N. R. Shanbhag, “High-throughput LDPC de-coders,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 11,no. 6, pp. 976–996, Dec. 2003.

[11] X. Y. Hu et al., “Efficient implementations of the sum-product algo-rithm for decoding LDPC codes,” in Proc. IEEE Int. Global Commun.Conf., San Antonio, TX, Nov. 2001, pp. 1036–1036E.

[12] C. Howland and A. Blanksby, “Parallel decoding architectures for lowdensity parity check codes,” in Proc. IEEE Int. Symp. Circuits Syst.,Sydney, Australia, May 2001, pp. 742–745.

[13] K. Gunnam et al., “VLSI architectures for layered decoding for irreg-ular LDPC codes of WIMAX,” in Proc. IEEE Int. Conf. Commun.,Glasgow, Scotland, Jun. 2007, pp. 4542–4547.

[14] B. Xiang et al., “An area-efficient and low-power multirate decoder forquasi-cyclic low-density parity-check codes,” IEEE Trans. Very LargeScale Integr. (VLSI) Syst., vol. 18, no. 10, pp. 1447–1460, Oct. 2010.

[15] K. Zhang, X. Huang, and Z. Wang, “High-throughput layered decoderimplementation for quasi-cyclic LDPC codes,” IEEE Trans. Very LargeScale Integr. (VLSI) Syst., vol. 27, no. 6, pp. 985–994, Aug. 2009.

[16] N. Jiang et al., “High-Throughput QC-LDPC decoders,” IEEE Trans.Broadcast., vol. 55, no. 2, pp. 251–259, Jun. 2009.

[17] J.-B. Doré, M.-H. Hamon, and P. Pénard, “On flexible design and im-plementation of structured LDPC codes,” in Proc. 18th Annu. IEEE Int.Symp. Personal, Indoor Mobile Radio Commun., Athens, Sep. 2007,pp. 1–5.

[18] R. Townsend and E. Weldon, “Self-orthogonal quasi-cyclic codes,”IEEE Trans. Inf. Theory, vol. IT-13, no. 2, pp. 183–195, Apr. 1967.

[19] M. M. Mansour, “A turbo-decoding message-passing algorithm forsparse parity-check matrix codes,” IEEE Trans. Signal Process., vol.54, no. 11, pp. 4376–4392, Nov. 2006.

[20] L. R. Bahl, J. Cocke, F. Jelinek, and J. Raviv, “Optimal decodingof linear codes for minimizing symbol error rate,” IEEE Trans. Inf.Theory, vol. IT-20, pp. 284–287, Mar. 1974.

[21] M. Fossorier, M. Mihaljevic, and H. Imai, “Reduced complexity itera-tive decoding of low-density parity check codes based on belief propa-gation,” IEEE Trans. Commun., vol. 47, no. 5, pp. 673–680, May 1999.

[22] J. Chen et al., “Reduced-complexity decoding of LDPC codes,” IEEETrans. Commun., vol. 53, no. 8, pp. 1288–1299, Aug. 2005.

[23] M. M. Mansour and N. R. Shanbhag, “A 640-mb/s 2048-bit pro-grammable LDPC decoder chip,” IEEE J. Solid-State Circuits, vol.41, no. 3, pp. 684–698, Mar. 2006.

[24] C.-H. Liu et al., “An LDPC decoder chip based on self-routing networkfor IEEE 802.16e applications,” IEEE J. Solid-State Circuits, vol. 43,pp. 684–694, Mar. 2008.

[25] S. Kim, G. E. Sobelman, and H. Lee, “A reduced-complexity architec-ture for LDPC layered decoding schemes,” IEEE Trans. VLSI Syst., pp.1–5, 2010, DOI: 10.1109/TVLSI.2010.2043965, preprint.

[26] M. Rovini, G. Gentile, F. Rossi, and L. Fanucci, “A minimum-latencyblock-serial architecture of a decoder for IEEE 802.11n LDPC codes,”in Proc. IFIP Int. Conf. Very Large Scale Integr., Atlanta, GA, Oct.2007, pp. 236–241.

[27] J. Chen and M. Fossorier, “Density evolution for two improvedbp-based decoding algorithms of LDPC codes,” IEEE Commun. Lett.,vol. 6, no. 5, pp. 208–210, May 2002.

[28] Synopsys Design Compiler [Online]. Available: http://www.syn-opsys.com

[29] N. Muralimanohar, R. Balasubramonian, and N. Jouppi, “CACTI 6.0:A tool to model large caches,” in Proc. Int. Symp. Microarchitecture,Chicago, IL, Dec. 2007, pp. 3–14.


Hady Zeineddine received the B.E. degree fromthe American University of Beirut (AUB), Beirut,Lebanon, in 2006 and the M.S. degree from theUniversity of Texas at Austin in 2009, all in electricaland computer engineering. He is currently workingtoward the Ph.D. degree at AUB.

His research interests are in the design of algo-rithms and architectures for efficient IC implementa-tion of communication and digital signal processingapplications.

Mohammad M. Mansour (S’98–M’03–SM’08)received the B.E. degree with distinction and theM.E. degree both in computer and communicationsengineering from the American University of Beirut(AUB), Beirut, Lebanon, in 1996 and 1998, re-spectively, and the M.S. degree in mathematics andthe Ph.D. degree in electrical engineering from theUniversity of Illinois at Urbana-Champaign (UIUC),Urbana, IL, in 2002 and 2003, respectively.

In 1996, he was a Teaching Assistant at the samedepartment. In 1997, he was a Research Assistant at

the ECE Department at AUB. From 1998 to 2003, he was a research assistantat the Coordinated Science Laboratory (CSL) at UIUC. During summer 2000,he worked at National Semiconductor Corporation, San Francisco, CA, with thewireless research group. From December 2006 to August 2008, he was on re-search leave with QUALCOMM Flarion Technologies, Bridgewater, NJ, wherehe worked on modem design and implementation for 3GPP-LTE, 3GPP-UMB,and peer-to-peer wireless networking PHY layer standards. He is currently anAssociate Professor of electrical and computer engineering with the Electrical

and Computer Engineering (ECE) Department at AUB, Beirut, Lebanon. Hisresearch interests are VLSI design and implementation for embedded signalprocessing and wireless communications systems, coding theory and its appli-cations, digital signal processing systems, and general purpose computing sys-tems.

Prof. Mansour is a member of the Design and Implementation of Signal Pro-cessing Systems Technical Committee of the IEEE Signal Processing Society.He has been serving as an Associate Editor for the IEEE TRANSACTIONS ON

CIRCUITS AND SYSTEMS II (TCAS-II) since April 2008, Associate Editor forISRN Applied Mathematics since December 2010 and Associate Editor for theIEEE TRANSACTIONS ON VLSI SYSTEMS since January 2011. He is the Tech-nical Co-Chair of the IEEE SiPS 2011 workshop. He is the recipient of the PhiKappa Phi Honor Society Award twice in 2000 and 2001 and the recipient ofthe Hewlett Foundation Fellowship Award in March 2006. He joined the facultyat AUB in October 2003.

Ranjit Puri received the B.E. degree with distinctionfrom the Electronics and Telecommunication Depart-ment, University of Mumbai, India, in 2007 and theM.S. degree in electrical engineering from the Uni-versity of Texas at Austin in 2009.

He currently works as a Software Engineer in theWindows division with Microsoft Corporation, Red-mond, WA. His current research interests include sys-tems software for virtual machine management andembedded systems.

Mr. Puri was awarded the J. N. Tata and DorabjiTata Trust Scholarship in 2007 and is a member of Phi Kappa Phi and GammaBeta Phi.

ieee transactions on signal processing, …staff.aub.edu.lb/~mm14/pdf/journals/2011_ieee_tsp...ieee...

Documents