column layered decoding.pdf

7/29/2019 column layered decoding.pdf

1/10

Published in IET Communications

Received on 7th November 2010

Revised on 7th June 2011

doi: 10.1049/iet-com.2010.1002

ISSN 1751-8628

Reduced-complexity column-layered decodingand implementation for LDPC codesZ. Cui1 Z. Wang2 X. Zhang3

1Qualcomm Inc., San Diego, CA 92121, USA2Broadcom Corp., Irvine, CA 92617, USA3Department of EECS, Case Western Reserve University, Cleveland, OH 44106-7071, USA

E-mail: [email protected]

Abstract: Layered decoding is well appreciated in low-density parity-check (LDPC) decoder implementation since it can achieveeffectively high decoding throughput with low computation complexity. This work, for the first time, addresses low-complexitycolumn-layered decoding schemes and very-large-scale integration (VLSI) architectures for multi-Gb /s applications. At first, themin-sum algorithm is incorporated into the column-layered decoding. Then algorithmic transformations and judiciousapproximations are explored to minimise the overall computation complexity. Compared to the original column-layereddecoding, the new approach can reduce the computation complexity in check node processing for high-rate LDPC codes byup to 90% while maintaining the fast convergence speed of layered decoding. Furthermore, a relaxed pipelining scheme is

presented to enable very high clock speed for VLSI implementation. Equipped with these new techniques, an efficientdecoder architecture for quasi-cyclic LDPC codes is developed and implemented with 0.13 mm VLSI implementationtechnology. It is shown that a decoding throughput of nearly 4 Gb /s at a maximum of 10 iterations can be achieved for a(4096, 3584) LDPC code. Hence, this work has facilitated practical applications of column-layered decoding and particularlymade it very attractive in high-speed, high-rate LDPC decoder implementation.

1 Introduction

Conventionally, low-density parity-check (LDPC) codes aredecoded using the sum-product algorithm (SPA) [1] or themodified min-sum algorithm (MSA) [2]. In general, the SPAhas the best decoding performance. The MSA is anapproximation of the SPA aimed to reduce computationcomplexity. It is widely employed in the LDPC decoderdesign [36]. Both algorithms are based on the two-phasemessage-passing (TPMP) scheme. In the literature, variationsof the SPA and MSA decoding approaches have beeninvestigated to reduce the interconnect complexity in VLSIimplementation [7, 8]. Recently, layered LDPC decodingschemes [912] have attracted much attention in bothacademy and industry because they can effectively speed upthe convergence of LDPC decoding and thus reduce therequired maximum number of decoding iterations. At

present, two kinds of layered decoding approaches, that is,row-layered decoding [911] and column-layered decoding[11, 12] have been proposed. In row-layered decoding, the

parity check matrix of the LDPC code is partitioned intomultiple row layers. The message updating is performed rowlayer by row layer. The column-layered decoding employsthe similar idea except that the parity check matrix is

partitioned into multiple column layers. The messagecomputation is performed column layer by column layer.

The implementation of an LDPC decoder with row-layereddecoding has been widely studied [1316]. The SPA-based

column-layered decoding approach [12] was proposed in2005. In the same year, Radosavljevic et al. [11] proposeda simplification of the SPA-based column-layered decodingapproach. It has been reported that column-layereddecoding has a similar convergence speed as row-layereddecoding. However, the column-layered decoding algorithmhas attracted much less attention because of its inherenthigh computation complexity. In this work, we investigateand explore the benefits of the column-layered decodingapproach. We first incorporate the MSA into the column-layered message passing scheme because it has much lowercomplexity than the SPA. In addition, by deeplyinvestigating the message updating process in the Min-Sum-

based column-layered decoding, we develop a simplifiedcolumn-layered decoding scheme, which maximallyeliminates the redundant computations in the originalscheme and significantly reduces the computationcomplexity further with judicious algorithmicapproximation. It is shown that up to 90% of computationsin check node processing can be saved for high-rate LDPCcodes.

For high-speed VLSI implementation, the proposed designhas significant advantages over the conventional row-layereddecoding. First, the column-layered decoding inherently has a

shorter critical path. For a row-layered decoder, the messagesassociated with multiple sub-blocks in a row layer can be

processed in one cycle to increase throughput. However, itwill increase the complexity of the check node unit (CNU)

IET Commun., 2011, Vol. 5, Iss. 15, pp. 2177 2186 2177

doi: 10.1049/iet-com.2010.1002 & The Institution of Engineering and Technology 2011

www.ietdl.org


2/10

and require serial concatenation of multiple comparison andselection stages in VLSI implementation. It has beenreported that significant hardware overhead is required tooptimise the corresponding circuitry for high clock speed[14]. In column-layered decoding, the major implementation complexity is associated with variable nodeunit (VNU), particularly when the messages correspondingto multiple sub-blocks in a column layer are processed in

parallel. As only addition operations are performed in aVNU, it is very convenient to employ arithmeticoptimisation to minimise the critical path. Second, in the

proposed design, the overall pipeline latency in decoding acode block is equal to the number of pipeline stages whilein row-layered decoding, the same amount of pipelinelatency is introduced for the message updating of everylayer. Moreover, the intrinsic message loading latency isminimised in the column-layered decoding because thedecoding can start as soon as the intrinsic messagescorresponding to one block column are available. Insummary, the proposed design is well suited for very-high-decoding throughput and low-power LDPC decoderimplementation.

2 Simplified column-layered decoding withmin-sum algorithm

The column-layered decoding scheme (also known as theshuffled iterative decoding) for LDPC codes was proposedin [12]. The decoding scheme is based on the SPAalgorithm. Similar to row-layered decoding [911], themaximum number of decoding iterations can besignificantly reduced for the same decoding performance.Since the MSA has much lower implementation complexitythan the SPA algorithm, it is widely utilised in hardware

implementation. In this paper, the MSA is incorporated intothe column-layered decoding for the first time. Then,algorithmic transformations and intelligent approximationsare explored to significantly reduce the computationcomplexity and memory requirement.

2.1 Min-sum algorithm-based column-layered

decoding

Let C be a binary (N, K) LDPC code specified by a parity-check matrix H with M rows and N columns. Each row ofthe parity check matrix is associated with a check node andeach column is associated with a variable node. Let

N(c) = {v:Hcv = 1} denote the set of variable nodes thatparticipate in check node c and M(v) = {c:Hcv = 1} denotethe set of check nodes associated to variable node v. Let Ivdenote the intrinsic message for variable node v and Rcvrepresent the check-to-variable message conveyed fromcheck node c to variable node v and Lcv represent thevariable-to-check message conveyed from variable node vto check node c. Assume that the N bits of a code-word are

divided into G groups of the same size, N0, N1, . . . ,NG1.Accordingly, the parity-check matrix is divided into G

block columns. The proposed column-layered decodingbased on the MSA is described with the pseudo-code inFig. 1.

The optimum value of the scaling factora is about 0.8 [2].For the convenience of VLSI implementation, it is set as 0.75in this work.

Assume that all variable nodes are divided into four groups(i.e., G 4). The computation flow of the column-layereddecoding for one iteration is illustrated in Fig. 2a. Theshaded sub-matrices indicate coverage of computation indecoding each layer, where the computation in (1) must be

Fig. 1 Min-sum-based column-layered decoding algorithm

2178 IET Commun., 2011, Vol. 5, Iss. 15, pp. 21772186

& The Institution of Engineering and Technology 2011 doi: 10.1049/iet-com.2010.1002

www.ietdl.org


3/10

carried out for all block columns except for the currentupdating block column. Hence, the computation complexityof the original column-layered decoding scheme periteration is many times more than that of the conventionalTPMP scheme whether the MSA or SPA is used.

2.2 Low-complexity decoding scheme

2.2.1 Algorithm reformulation:With a close study of the

computation flow of the column-layered decoding, it can beobserved that a significant amount of redundantcomputation is performed in the original column-layereddecoding algorithm. To improve the computation efficiency,the computation in (1) for consecutive column layers can beincrementally performed. In LDPC decoding, each variablenode sends a variable-to-check message to everyneighbouring check node. Assume that a check node c hasdc variable node neighbours and hence receives dc softmessages from its neighbouring variable nodes. To clarifythe main procedure of the reformulated column-layereddecoding, we assume that each check node has only onevariable node neighbour in each block column of the paritycheck matrix. It should be noted that this constraint is not

indispensable for column-layered decoding in general.Let m(g)c = [m1 m2 . . . mdc ] be a sorted vector of themagnitudes of the soft messages received by check node cin ascending order. The superscript g indicates that the softmessage was generated when the gth block column is

processed. Similarly, let S(g)c be Pn[N(c)sign(Lcn) when thedecoding for the layer g is completed. To reduce thecomputation complexity of min-sum-based column-layereddecoding, m

(g)c can be computed from m

(g1)c in three steps.

1. For each variable node v in Ng, remove the old /Lcv/ fromm

(g1)c to obtain a temporary sorted vector m

(g)c . In addition,

remove the old sign(Lcv) from S(g1)c and obtain a temporary

sign-product

S

(g)

c . Since the smallest value, m1, inm

(g)

c isminn[N(c)/v(/Lcn/) and S(g)c is Pn[N(c)\vsign(Lcn), the value of

Rcv can be computed as S(g1)

c m1. Send the Rcv messageto the variable node v.

2. Perform variable-to-check message computations for allvariable nodes belonging to Ng. The new Lcv messages aresent back to corresponding check nodes.3. For each check node, insert the updated/Lcv/ into m

(g)c in a

sorted order to obtain m(g)c .

The reformulated column-layered decoding procedure

is summarised as in Fig. 3, m(0)c and S

(0)

c in a decoding

iteration are computed from m(G1)

c and S(G1)c , respectively,

which are obtained in the previous iteration. The new

computation flow for one full iteration is depicted in Fig. 2b.

2.2.2 Simplification of min-sum-based column-layered decoding: The algorithm reformulation presentedabove removes a large amount of redundant computation inthe original column-layered decoding scheme and thussignificantly reduces the overall computation complexity.However, because every vector m

(g)c contains dc values, the

reformulated algorithm still requires a considerable amountof memory to store check-to-variable messages. For row c,only the two smallest values in the sorted vector m

(g)c are

directly involved in the message updating Step-A fromlayer g2 1 to layer g. For example, if the smallest value inm

(g1)c is from layer g in the previous iteration, it is

removed from the vectorm(g1)c in horizontal Step A. Then

the second-smallest value is used as the magnitude ofvariable-to-check message. In horizontal Step B, m

(g)c is

computed with the new value, /Lcv/, from variable node andthe pre-sorted vector m(g)c from horizontal Step A. The newvalue, /Lcv/, could take any index in m

(g)c . If it takes a very

small index, it has more chance to be used in Step A offurther computation. Otherwise, it has much less chance to

be one of the two smallest values in the remainingcomputation. In other words, though m

(g)c contains dc

values, most of the values in the end of the sorted vectorare less likely to be used as reliability information for

message updating. Thus, it is reasonable to reduce thelength of the vector m(g)c to further reduce theimplementation complexity. Our simulations show that ifthe lengths of the vectors m

(g)c and m

(g)c are set to be 3, the

Fig. 2 Computation flow in an iteration of column-layered decoding

a Original algorithmb After algorithm reformulation



www.ietdl.org


4/10

decoding convergence speed and performance have almost nodegradation compared to the standard Min-Sum basedcolumn-layered decoding. In such a case, we name thescheme as three-min column-layered decoding. Similarly,

the lengths of the vectors m(g)c and m

(g)c can be set as 2 or

even 1. Correspondingly, more penalties in convergencespeed and performance are expected.

In this approximation, m(g)c contains three values in most

cases. It may contain two valid values if one value isremoved from the vectorm

(g1)c in horizontal Step A. In the

computation of horizontal Step-B, at most three comparisonoperations are required to sort the new value /Lcv/ fromvariable node and the sorted values in m(g)c . If m

(g)c contains

three valid values, the third comparison is needed todetermine whether the /Lcv/ is the third smallest value or itshould be discarded. Table 1 shows the average number ofm

(g)c updating events in one iteration for the decoding of a

(4096, 3584) (4, 32) QC-LDPC code at the SNR of 4.1 dB.In one iteration, the total number of m(g)c updating events

for a check node is 32. A message updating event can becategorised into one of three types. Type I, a value isremoved from m

(g1)c and then a new value is inserted into

m(g)c . Type II, no value is removed from m

(g1)c andm

(g)c but

is updated. Type III, no value is removed from m(g1)c and

/Lcv/ is discarded. It can be seen from Table 1 that if novalue is removed from m

(g1)c , /Lcv/ is much more likely to

be discarded than being the third-smallest value in m(g)c . To

further reduce the decoding complexity, the thirdcomparison in horizontal Step B is eliminated. In themodification, /Lcv/ is discarded if no value is removedfrom m(g

1)c and /Lcv/ is larger than the second value in

m(g)c . This results in the simplified three-min column-

layered decoding. In this case, a check node requires threeequal-comparisons to remove the old /Lcv/ from vectorm

(g1)c in horizontal Step A and requires two regular

comparisons to update m(g)c vector with the new /Lcv/ in

horizontal Step B.The major difference between the three-min decoding and

the simplified three-min decoding is that the three-mindecoding requires three comparisons for sorting inhorizontal Step B. The simplified three-min requires twocomparisons for sorting in horizontal Step B. Oursimulation shows that the additional approximationintroduced by the simplified three-min decoding onlycauses very small performance loss. Table 1 shows that asorted vector mc only gets updated about 4 times in aniteration of the three-min decoding. On the contrary, forTPMP or row-layered decoding, the sorting computation tofind the smallest and the second smallest magnitudes for acheck node re-starts in every iteration. For the same LDPC

code, the average number of updating activities for a sortedvector is more than 16 per iteration. It leads to significant

power savings for the three-min column-layered decodingin check node message updating.

Fig. 3 Reformulated column-layered decoding

Table 1 Average number ofmc(g) updating events in one

iteration

In the third

iteration

In the sixth

iteration

m(g)c =m

(g1)c 2.804 2.783

m(g)c =m

(g1)c , /Lcv/ being the

smallest value in m(g)c

0.036 0.050

m(g)c =m


second-smallest value in m(g)c

0.140 0.170

m(g)c =m


third-smallest value in m(g)c

0.880 1.005

m(g)c = m

(g1)c =m

(g1)c , /Lcv/

being discarded

28.140 27.992



www.ietdl.org


5/10

2.2.3 Pipelining of column-layered decoding:Pipelining is a common practice in VLSI implementation toincrease clock speed and thus to speed up data processingthroughput. In general, pipelining can only be applied tofeed-forward data paths to maintain the original function ofVLSI circuitry. In LDPC column-layered decodingalgorithms, data dependency exists between consecutivelayers. Thus, the VLSI circuitry for check node, variable

node and message memories forms a logic loop andpipelining cannot be directly applied to increase theeffective clock speed of LDPC decoders. In this section, arelaxed pipelining scheme of column-layered LDPCdecoding is proposed.

In the original column-layered decoding, the changebetween m(g

1)c andm

(g1+P)c is very small if the value of P

is not large. Thus, m(g1)

c can be used as the estimation ofm

(g1+P)c . An approximation of R

(g+P)cv can be calculated

from m(g1)

c before m(g1+P)c are obtained. Then, it is

immediately used for computing L(g+P)

cv . The approximationallows P clock cycles to complete the message computationfor each layer. When |L(g

+P)cv | is obtained, the m

(g1+P)c is

already available. Thus, the horizontal step for layer g+Pcan be undertaken with m(g1+P)c and |L

(g+P)cv |. The updating

method of m(g+P)

c is the same as before. The pipelinedcolumn-layered decoding algorithm is formulated as thefollowing (see Fig. 4).

2.2.4 Impact to the complexity of VLSI implementation: For the standard min-sum basedcolumn-layered decoding, a check node requires dc2 2regular comparators to compute the check-to-variablemessage for a new column layer. To execute (1) and (2),

the variable-to-check messages corresponding to the wholeH matrix must be stored. For the simplified three-mincolumn-layered decoding algorithm, a check node onlyrequires two regular comparators. The magnitudes ofvariable-to-check messages can be used on-the-fly and arenot stored. Only the sign bits of variable-to-check messagesshould be stored. For each check node, three magnitudes,three indices and one sign bit need to be saved. For a

(N, N2

M) (4, 32) structured LDPC code, approximately,(302 2)/30 93% of computation can be reduced.Assuming that each message is quantised as 4 bits, thenumber of memory bits needed by a check node is3 (3+ 5)+ 1 25. The percentage of memory bits thatcan be saved for extrinsic messages is[(32 42 322 25) M]/(32 4 M) 55.4%.

As the proposed column-layered decoding approaches donot save the magnitude of variable-to-check messages, theyare inherently memory efficient. Equipped with the

proposed decoding techniques, an efficient column-layereddecoder architecture for quasi-cyclic LDPC codes isdeveloped with architectural and arithmetic optimisations.The details are provided in Section 4.

3 Performance simulation

To evaluate the decoding performance and convergencespeed of the proposed algorithms, three LDPC codes aresimulated. Codes I and II are rate 1/2, (2304, 1152) LDPCcode and rate 5/6, (2304, 1920) LDPC code, respectively,adopted in WiMax (802.16 e) standard. Code III is a (4096,3584) (4, 32)-regular QC-LDPC code, which is constructedusing the progressive edge growth method (PEG) [17, 18].

Fig. 4 Pipelined column-layered decoding



www.ietdl.org


6/10

The code is also used to illustrate the VLSI architecturedesign in Section 4. In each simulation, at least 50 frameerrors are observed.

Fig. 5 shows the frame error rates in decoding the WiMaxrate-1/2 (2304, 1152) LDPC code with four different-layereddecoding algorithms, that is, (i) the row-layered decoding,(ii) the original column-layered decoding, (iii) the three-mincolumn-layered decoding and (iv) the simplified three-min

column-layered decoding. The maximum number ofdecoding iterations is set as 10 and 20, respectively, for alldecoding approaches. It can be observed that all the fourdecoding algorithms have almost the same decoding

performance with maximum of 20 iterations. With themaximum of 10 decoding iterations, only subtle performancedifference can be observed. Fig. 6 shows the frame errorrates in decoding the WiMax rate-5/6 (2304, 1920) LDPCcode with maximum of 10 and 20 iterations, respectively. Itshows again that the performance differences among various-layered decoding algorithms are negligible.

Fig. 7 shows the frame error rate (FER) in decoding the(4096, 3584) (4, 32) QC-LDPC code. The number ofcolumns in each column layer is 128. The maximum

number of iteration is set as 10. Compared to the standardmin-sum-based column-layered decoding algorithm, thethree-min column-layered decoding algorithm has almost no

performance loss. The simplified three-min approach has nomore than 0.02 dB performance loss. The performance of

pipelined three-min decoding with one-stage pipeline(P 1) has very minimal difference compared to that ofthe standard min-sum-based column-layered decoding. With

two-stage pipeline (P 2), the pipelined three-minapproach has about 0.02 dB performance loss.

4 Proposed column-layered decoderarchitectures for QC-LDPC codes

In this section, an optimised QC-LDPC decoder architecturefor a (4096, 3584) (4, 32)-regular QC-LDPC code using the

pipelined three-min column-layered decoding scheme ispresented. The architecture efficiently enables very highdecoding parallelism for layered decoding while havingvery short critical path. In one clock cycle, the messagescorresponding to four circulant matrices are processed in

parallel. Two-stage pipelining is employed to improve theclock frequency. In order to further reduce the critical pathdelay, we rearrange the additions in VNU to maximallytake advantage of carry-save addition. With a simplemodification on CNU, the simplified three-min decodingscheme can be realised for even less hardware requirementat the sacrifice of small performance loss.

4.1 Column-layered decoder architecture forQC-LDPC codes

The parity check matrix of the QC-LDPC code to beconsidered in this work consists of an array of

4 32 cyclically shifted permutation matrices of dimension128 128. The decoding steps discussed in Section 2.2 are

performed for one block column at a time. That means allextrinsic messages corresponding to a block column of theH matrix are processed in parallel in one clock cycle. Thearchitecture of pipelined decoder with two pipeline stages isshown in Fig. 8. Each M component in an M-register arrayon the left side of Fig. 8 represents a 25-bit register forstoring a sorted vector associated with a row of the Hmatrix. The Get Rcv component performs (8) to get check-to-variable messages for layer g+P, where P 2.The CNUa component (the portion of CNU for horizontalStep-A) performs (9) to remove the old variable-to-checkmessage associated to layer g from the sorted vector. TheCNUb component (the portion of CNU for horizontal StepB) is utilised to perform (10) and (11). The total numbersof M-register, Get Rcv, CNUa and CNUb are all 512. Thenumber of VNUs in the VNU array is 128. A VNU isrequired to perform the vertical step (2) and (3) for avariable node. The barrel shifters in the left side of theVNU array align the check-to-variable message from roworder to column order. Similarly, the barrel shifters in theright side of VNU array align the variable-to-check messagefrom column order to row order. It can be observed fromFig. 8 that two types of loops are formed in column-layereddecoding. The first-type of loop consists of CNUa andCNUb components. The second-type of loop is composed

of a Get Rcv, a CNUb, a VNU and two barrel shifters. Withthe pipelined column-layered decoding approach, thecritical path in the second-type loop is drastically reducedwhen applying two-stage pipelining.

Fig. 6 Frame error rate of various layered decoding approaches

in decoding WiMax rate-5/6 (2304, 1920) LDPC code

Fig. 5 Frame error rate of various layered decoding approaches

in decoding WiMax rate-1/2 (2304, 1152) LDPC code



www.ietdl.org


7/10

The Get Rcv component is needed because its outputs arethe check-to-variable messages for layer g+P. On thecontrary, the outputs of CNUa are intermediate values forlayer g. For non-pipelined decoding, the Get Rcv componentcan be eliminated because the output of CNUa contains thecheck-to-variable message for layer g. We only need onestage of barrel shifter at the input of CNUa array tominimise critical path. The connections among the array ofCNUa, CNUb andVNUhave no change during the decoding.

4.2 Architecture of CNU

Fig. 9 shows the structure of the CNU for the three-mindecoding scheme. The sub-matrices in a block row of the H

matrix are processed one at a time. The CNU is composedof two concatenated stages. The first stage, CNUa,computes the m

(g)c and the second stage, CNUb, generates

m(g)c . The data in Fig. 9 are associated with the vectors in

the horizontal decoding step as follows:

m(g1)c = [min1 old, min2 old, min3 old]

m(g)c = [min1 temp, min2 temp, min3 temp]

m(g)c = [min1 new, min2 new, min3 new]

I(g1)c = [idx1 old, idx2 old, idx3 old]

I(g)

c = [idx1 temp, idx2 temp, idx3 tmp]

I(g)c = [idx1 new, idx2 new, idx3 new]

In each M-register for a sorted vector, three smallestmagnitudes and their indices are stored. In the decoding ofa column layer, if the column index is not in the vector, themagnitudes and indices in the vector are directly passedthrough the select-logic-A to the second stage. Otherwise,min1_temp and min2_temp get values from the tworemaining smallest magnitudes. The value of min3_temp

becomes void. The temporary index values are determinedin the same way. It is clear that min1_temp is themagnitude of the Rcv message for the column layer in non-

pipelined decoding. After the new Lcv value is sent backfrom VNU, it is used to compute the new value for thesorted vector. The structure of a Get Rcv component isshown in Fig. 9. This component is not required fornon-pipelined decoding.

For the proposed simplified three-min decoding approach,the adder for the computation of abs(Lcv)2min3_temp is not

needed. Instead, a simple three-input OR gate can be used todisable M-register update in the condition shown in row 5 of

Fig. 7 Frame error rate of various decoding approaches in

decoding a (4096, 3584) (4, 32) QC-LDPC code

Fig. 8 Top-level block diagram of pipelined column-layered decoding



www.ietdl.org


8/10

Table 1. The three inputs for the OR gate are the outputs ofthe three comparators in CNUa. It is shown in Fig. 6 thatthe removal of the adder results in less than 0.02 dB

performance loss.

4.3 Architecture of optimised VNU

Shown in Fig. 10 is the structure of an optimised VNU that can

simultaneously process 4 check-to-variable messages. Theaddition operations are rearranged such that the advantage ofcarry-save adder can be maximally taken. In the beginning ofa VNU, the check-to-variable messages in signed-magnitude

format are converted to 2s complement representation. Inthe data conversion for each signed-magnitude number, thesign bit is not immediately added to the bitwise-not of lower

bits. Instead, all sign bits are added through the adder array.Then, each two-bit sign-sum is sent to an adder in thesecond addition stage for final summation. As each adder inthe second and third stages has three inputs, it can beimplemented using a carry-save adder and a regular binary

adder. The right-shift operations 1 and 2 are used forperforming the scalar multiplication of 0.75. The above-mentioned arithmetic optimisations aim to significantlyreduce the logic delay in a VNU.

To illustrate the data flow of the optimised VNU, let us takethe computation of L3v as an example. Assume that R4v is anegative number and other inputs are positive numbers. Thevalue before the shift of the scaling operation of L3v is (/R1v/+/R2v/+bit_inverse(/R4v/)+ 0+ 0+ 1). The 0+ 0+ 1 computation is performed by a 1-bit full adder in theadder array block associated with L3v message. After the finalshift and addition stage, the output of L3v is 0.75 (R1v+

R2v+R4v)+Iv.

Fig. 9 FER in decoding the (4096, 3584) (4, 32) QC-LDPC code

a CNU architecture for the three-min column-layered decodingb Structure of the Get Rcv component

Fig. 10 Architecture of the optimised VNU Table 2 Hardware resource for (4096, 3584) (4, 32) QC-LDPC

decoders

VNU 128

CNU 512

Get_Rcv 512

intrinsic message, bits 4 4096 16 384

sorted vector, bits 25 512 12 800

signs of variable-to-check messages 4 4096 16 384

area per VNU, mm2 5597.8

area per CNU, mm2 2100.9

area per Get_Rcv, mm2 152.1

clock frequency, MHz 388

synthesis area, mm2 6.755

information decoding throughput, Gb/s 3.928



www.ietdl.org


9/10

5 Hardware requirement and decodingthroughput

The pipelined column-layered decoder with two pipelinestages is modelled using Verilog RTL and synthesisedusing Fujitsu 0.13 um standard library. The requiredhardware resource and the synthesis result are summarisedin Table 2. Each intrinsic message is quantised as 4 bits. Ittakes 32 clock cycles to compute the initial sorted vectorsand 32 10 320 clock cycles for ten decoding iterations.The number of clock cycles for pipeline latency is 2. Let

fclk denote the clock frequency of the decoder, theinformation (source data) decoding throughput is

fclk (40962 512)/(32+ 320+ 2). Thus, the decoder

achieves a decoding throughput of 3.928 Gb/s at a clockspeed of 388 MHz.

Table 3 compares the proposed column-layered decoderwith the state-of-the-art row-layered decoders. To mitigatethe discrepancy introduced from different implementationtechnologies, the area and clock speed of all these designsare scaled to 65 nm. For the implementation technology of0.18 mm, 0.13 mm and 90 nm, the area scaling down factoris set as 8, 4 and 2, respectively. The corresponding clockfrequency is scaled up by 1.253, 1.252 and 1.25,respectively. The maximum decoding iteration is set as 10for all decoders with layered decoding algorithms. For areaof synthesis result, a scaling factor of 1/0.7 is applied toapproximate the layout area.

Considering the normalised decoding throughput,throughput/area ratio and decoding performance amongvarious designs, it can be concluded that the proposedsimplified column-layered decoding algorithm andarchitecture have significant advantages in high-throughputLDPC decoder implementation.

6 Conclusion

In this paper, various techniques have been explored to reducethe computation complexity of the column-layered decoding.As a result, the proposed method can drastically reduce theoverall computation complexity of the original scheme

while largely maintaining decoding performance andconvergence speed. In addition, a relaxed pipelining schemehas been shown to break the data dependency betweenadjacent column layers, and thus enhance the clock speed.

Combining all the proposed techniques, a low-complexity,high-speed LDPC decoder architecture for generic QC-LDPC codes has been developed and the implementationresult for a specific example has demonstrated that the

proposed column-layered decoder architecture is verycompetitive to state-of-the-art row-layered LDPC decoderdesigns.

7 References

1 Gallager, R.G.: Low-density parity-check codes, IRE Trans. Inf.Theory, 1962, IT-8, pp. 2128

2 Chen, J., Dholakia, A., Eleftheriou, E., Fossorier, M.P.C., Hu, X.-Y.:Reduced-complexity decoding of LDPC codes, IEEE Trans.Commun., 2005, 53, (8), pp. 12881299

3 Wang, Z., Cui, Z.: A memory efficient partially parallel decoderarchitecture for QC-LDPC codes. Proc. 39th Asilomar Conf. Signals,Systems & Computers, 2005, pp. 729733

4 Lin, C., Lin, K., Chan, H., Lee, C.: A 3.33 Gb/s (1200, 720) low-density parity check code decoder. Proc. 31st European Solid-StateCircuits Conf., September 2005, pp. 211214

5 Oh, D., Parhi, K.K.: Nonuniformly quantized Min-Sum decoderarchitecture for low-density parity-check codes. Proc. 18th ACMGreat Lakes symp. on VLSI, 2008, pp. 451456

6 Shih, X., Zhan, C., Lin, C., Wu, A.: An 8.29 mm2 52 mW multi-modeLDPC decoder design for mobile WiMAX system in 0.13 mm CMOSprocess, IEEE J. Solid-State Circuits, 2008, 43, (3), pp. 672683

7 Darabiha, A., Carusone, A.C., Kschischang, F.R.: Block-interlacedLDPC decoders with reduced interconnect complexity, IEEE Trans.Circuits and Systems II, 2008, 55, (1), pp. 7478

8 Mohsenin, T., Truong, D., Baas, B.: Multi-split-row threshold decodingimplementations for LDPC codes. 2009 IEEE Int. Symp. on Circuitsand Systems, 2009, pp. 24492452

9 Sharon, E., Litsyn, S., Goldberger, J.: An efficient message-passingschedule for LDPC decoding. Proc. 23rd IEEE Convention ofElectrical and Electronics Engineers Israel, September 2004,pp. 223 226

10 Hocevar, D.E.: A reduced complexity decoder architecture via layereddecoding of LDPC codes. IEEE Workshop Signal Processing Systems,2004, pp. 107112

11 Radosavljevic, P., Baynast, A., Cavallaro, J.R.: Optimized messagepassing schedules for LDPC decoding. Proc. 39th Asilomar Conf.Signals, Systems and Computers, 2005, pp. 591595

12 Zhang, J., Fossorier, M.P.C.: Shuffled iterative decoding, IEEE Trans.Commun., 2005, 53, (2), pp. 209213

13 Brack, T., Alles, M., Lehnigk-Emden, T., et al.: Low complexity LDPC

code decoders for next generation standards. Proc. DesignationAutomation Test in Europe, April 2007, pp. 16

14 Sun, Y., Cavallaro, J.R.: A low-power 1-Gbps reconfigurable LDPCdecoder design for multiple 4 G wireless standards. 2008 IEEE Int.SOC Conf., 2008, pp. 367370

Table 3 Design comparisons with recent high-speed LDPC decoders

This work [13] [16] [14] [15]

code 4096, 3584 9600, 7200 802.11 n 802.16 e, 802.11 n 2048-bits

rate 7/8 3/4 1/25/6 1/25/6 1/27/8

algorithm column-layered

min-sum

row-layered

min-sum

row-layered

min-sum

row-layered BP row-layered BCJR

LLR message quantisation 4-bits 6-bits 5-bit 4-bits

max. iteration 10 10 5 10 10max decoding parallelism 512 80 81 2 96

technology 0.13 mm 65 nm 0.18 mm 90 nm 0.18 mm

clock frequency, MHz 388 500 208 450 125

area, mm2 6.755 (synthesis) 0.504 (synthesis) 3.39 3.5 14.3

max info. throughput, Gb/s 3.928 1.08 0.78 1.0 0.64

clock frequency (scaled to 65 nm) 606 500 406 562 244

layout area, scaled to 65 nm 2.41 0.72 0.42 1.75 1.79

normalised info. throughput,

scaled to 65 nm

6.13 1.08 0.76

(10 iterations)

1.25 1.25



www.ietdl.org


10/10

15 Mansour, M.M., Shanbhag, N.R.: A 640-Mb/s 2048-bit programmableLDPC decoder chip, IEEE J. Solid-State Circuits, 2006, 41, (3),pp. 684 698

16 Studer, C., Preyss, N., Roth, C., Burg, A.: Configurable high-throughput decoder architecture for quasi-cyclic LDPC codes. 42ndAsilomar Conf. Signals, Systems and Computers, 2008, pp. 11371142

17 Hu, X.-Y., Eleftheriou, E., Arnold, D.M.: Regular and irregularprogressive edge-growth tanner graphs, IEEE Trans. Inf. Theory,2005, 51, (1), pp. 386398

18 Li, Z., Kumar, B.V.K.V.: A class of good quasi-cyclic low-density paritycheck codes based on progressive edge growth graph. 38th AsilomarConf. Signals, Systems and Computers, 2004, vol. 2, pp. 19901994



www.ietdl.org

column layered decoding.pdf

Documents