column layered decoding.pdf

Upload: lakkepogu

Post on 14-Apr-2018

216 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/29/2019 column layered decoding.pdf

    1/10

    Published in IET Communications

    Received on 7th November 2010

    Revised on 7th June 2011

    doi: 10.1049/iet-com.2010.1002

    ISSN 1751-8628

    Reduced-complexity column-layered decodingand implementation for LDPC codesZ. Cui1 Z. Wang2 X. Zhang3

    1Qualcomm Inc., San Diego, CA 92121, USA2Broadcom Corp., Irvine, CA 92617, USA3Department of EECS, Case Western Reserve University, Cleveland, OH 44106-7071, USA

    E-mail: [email protected]

    Abstract: Layered decoding is well appreciated in low-density parity-check (LDPC) decoder implementation since it can achieveeffectively high decoding throughput with low computation complexity. This work, for the first time, addresses low-complexitycolumn-layered decoding schemes and very-large-scale integration (VLSI) architectures for multi-Gb /s applications. At first, themin-sum algorithm is incorporated into the column-layered decoding. Then algorithmic transformations and judiciousapproximations are explored to minimise the overall computation complexity. Compared to the original column-layereddecoding, the new approach can reduce the computation complexity in check node processing for high-rate LDPC codes byup to 90% while maintaining the fast convergence speed of layered decoding. Furthermore, a relaxed pipelining scheme is

    presented to enable very high clock speed for VLSI implementation. Equipped with these new techniques, an efficientdecoder architecture for quasi-cyclic LDPC codes is developed and implemented with 0.13 mm VLSI implementationtechnology. It is shown that a decoding throughput of nearly 4 Gb /s at a maximum of 10 iterations can be achieved for a(4096, 3584) LDPC code. Hence, this work has facilitated practical applications of column-layered decoding and particularlymade it very attractive in high-speed, high-rate LDPC decoder implementation.

    1 Introduction

    Conventionally, low-density parity-check (LDPC) codes aredecoded using the sum-product algorithm (SPA) [1] or themodified min-sum algorithm (MSA) [2]. In general, the SPAhas the best decoding performance. The MSA is anapproximation of the SPA aimed to reduce computationcomplexity. It is widely employed in the LDPC decoderdesign [36]. Both algorithms are based on the two-phasemessage-passing (TPMP) scheme. In the literature, variationsof the SPA and MSA decoding approaches have beeninvestigated to reduce the interconnect complexity in VLSIimplementation [7, 8]. Recently, layered LDPC decodingschemes [912] have attracted much attention in bothacademy and industry because they can effectively speed upthe convergence of LDPC decoding and thus reduce therequired maximum number of decoding iterations. At

    present, two kinds of layered decoding approaches, that is,row-layered decoding [911] and column-layered decoding[11, 12] have been proposed. In row-layered decoding, the

    parity check matrix of the LDPC code is partitioned intomultiple row layers. The message updating is performed rowlayer by row layer. The column-layered decoding employsthe similar idea except that the parity check matrix is

    partitioned into multiple column layers. The messagecomputation is performed column layer by column layer.

    The implementation of an LDPC decoder with row-layereddecoding has been widely studied [1316]. The SPA-based

    column-layered decoding approach [12] was proposed in2005. In the same year, Radosavljevic et al. [11] proposeda simplification of the SPA-based column-layered decodingapproach. It has been reported that column-layereddecoding has a similar convergence speed as row-layereddecoding. However, the column-layered decoding algorithmhas attracted much less attention because of its inherenthigh computation complexity. In this work, we investigateand explore the benefits of the column-layered decodingapproach. We first incorporate the MSA into the column-layered message passing scheme because it has much lowercomplexity than the SPA. In addition, by deeplyinvestigating the message updating process in the Min-Sum-

    based column-layered decoding, we develop a simplifiedcolumn-layered decoding scheme, which maximallyeliminates the redundant computations in the originalscheme and significantly reduces the computationcomplexity further with judicious algorithmicapproximation. It is shown that up to 90% of computationsin check node processing can be saved for high-rate LDPCcodes.

    For high-speed VLSI implementation, the proposed designhas significant advantages over the conventional row-layereddecoding. First, the column-layered decoding inherently has a

    shorter critical path. For a row-layered decoder, the messagesassociated with multiple sub-blocks in a row layer can be

    processed in one cycle to increase throughput. However, itwill increase the complexity of the check node unit (CNU)

    IET Commun., 2011, Vol. 5, Iss. 15, pp. 2177 2186 2177

    doi: 10.1049/iet-com.2010.1002 & The Institution of Engineering and Technology 2011

    www.ietdl.org

  • 7/29/2019 column layered decoding.pdf

    2/10

    and require serial concatenation of multiple comparison andselection stages in VLSI implementation. It has beenreported that significant hardware overhead is required tooptimise the corresponding circuitry for high clock speed[14]. In column-layered decoding, the major implementation complexity is associated with variable nodeunit (VNU), particularly when the messages correspondingto multiple sub-blocks in a column layer are processed in

    parallel. As only addition operations are performed in aVNU, it is very convenient to employ arithmeticoptimisation to minimise the critical path. Second, in the

    proposed design, the overall pipeline latency in decoding acode block is equal to the number of pipeline stages whilein row-layered decoding, the same amount of pipelinelatency is introduced for the message updating of everylayer. Moreover, the intrinsic message loading latency isminimised in the column-layered decoding because thedecoding can start as soon as the intrinsic messagescorresponding to one block column are available. Insummary, the proposed design is well suited for very-high-decoding throughput and low-power LDPC decoderimplementation.

    2 Simplified column-layered decoding withmin-sum algorithm

    The column-layered decoding scheme (also known as theshuffled iterative decoding) for LDPC codes was proposedin [12]. The decoding scheme is based on the SPAalgorithm. Similar to row-layered decoding [911], themaximum number of decoding iterations can besignificantly reduced for the same decoding performance.Since the MSA has much lower implementation complexitythan the SPA algorithm, it is widely utilised in hardware

    implementation. In this paper, the MSA is incorporated intothe column-layered decoding for the first time. Then,algorithmic transformations and intelligent approximationsare explored to significantly reduce the computationcomplexity and memory requirement.

    2.1 Min-sum algorithm-based column-layered

    decoding

    Let C be a binary (N, K) LDPC code specified by a parity-check matrix H with M rows and N columns. Each row ofthe parity check matrix is associated with a check node andeach column is associated with a variable node. Let

    N(c) = {v:Hcv = 1} denote the set of variable nodes thatparticipate in check node c and M(v) = {c:Hcv = 1} denotethe set of check nodes associated to variable node v. Let Ivdenote the intrinsic message for variable node v and Rcvrepresent the check-to-variable message conveyed fromcheck node c to variable node v and Lcv represent thevariable-to-check message conveyed from variable node vto check node c. Assume that the N bits of a code-word are

    divided into G groups of the same size, N0, N1, . . . ,NG1.Accordingly, the parity-check matrix is divided into G

    block columns. The proposed column-layered decodingbased on the MSA is described with the pseudo-code inFig. 1.

    The optimum value of the scaling factora is about 0.8 [2].For the convenience of VLSI implementation, it is set as 0.75in this work.

    Assume that all variable nodes are divided into four groups(i.e., G 4). The computation flow of the column-layereddecoding for one iteration is illustrated in Fig. 2a. Theshaded sub-matrices indicate coverage of computation indecoding each layer, where the computation in (1) must be

    Fig. 1 Min-sum-based column-layered decoding algorithm

    2178 IET Commun., 2011, Vol. 5, Iss. 15, pp. 21772186

    & The Institution of Engineering and Technology 2011 doi: 10.1049/iet-com.2010.1002

    www.ietdl.org

  • 7/29/2019 column layered decoding.pdf

    3/10

    carried out for all block columns except for the currentupdating block column. Hence, the computation complexityof the original column-layered decoding scheme periteration is many times more than that of the conventionalTPMP scheme whether the MSA or SPA is used.

    2.2 Low-complexity decoding scheme

    2.2.1 Algorithm reformulation:With a close study of the

    computation flow of the column-layered decoding, it can beobserved that a significant amount of redundantcomputation is performed in the original column-layereddecoding algorithm. To improve the computation efficiency,the computation in (1) for consecutive column layers can beincrementally performed. In LDPC decoding, each variablenode sends a variable-to-check message to everyneighbouring check node. Assume that a check node c hasdc variable node neighbours and hence receives dc softmessages from its neighbouring variable nodes. To clarifythe main procedure of the reformulated column-layereddecoding, we assume that each check node has only onevariable node neighbour in each block column of the paritycheck matrix. It should be noted that this constraint is not

    indispensable for column-layered decoding in general.Let m(g)c = [m1 m2 . . . mdc ] be a sorted vector of themagnitudes of the soft messages received by check node cin ascending order. The superscript g indicates that the softmessage was generated when the gth block column is

    processed. Similarly, let S(g)c be Pn[N(c)sign(Lcn) when thedecoding for the layer g is completed. To reduce thecomputation complexity of min-sum-based column-layereddecoding, m

    (g)c can be computed from m

    (g1)c in three steps.

    1. For each variable node v in Ng, remove the old /Lcv/ fromm

    (g1)c to obtain a temporary sorted vector m

    (g)c . In addition,

    remove the old sign(Lcv) from S(g1)c and obtain a temporary

    sign-product

    S

    (g)

    c . Since the smallest value, m1, inm

    (g)

    c isminn[N(c)/v(/Lcn/) and S(g)c is Pn[N(c)\vsign(Lcn), the value of

    Rcv can be computed as S(g1)

    c m1. Send the Rcv messageto the variable node v.

    2. Perform variable-to-check message computations for allvariable nodes belonging to Ng. The new Lcv messages aresent back to corresponding check nodes.3. For each check node, insert the updated/Lcv/ into m

    (g)c in a

    sorted order to obtain m(g)c .

    The reformulated column-layered decoding procedure

    is summarised as in Fig. 3, m(0)c and S

    (0)

    c in a decoding

    iteration are computed from m(G1)

    c and S(G1)c , respectively,

    which are obtained in the previous iteration. The new

    computation flow for one full iteration is depicted in Fig. 2b.

    2.2.2 Simplification of min-sum-based column-layered decoding: The algorithm reformulation presentedabove removes a large amount of redundant computation inthe original column-layered decoding scheme and thussignificantly reduces the overall computation complexity.However, because every vector m

    (g)c contains dc values, the

    reformulated algorithm still requires a considerable amountof memory to store check-to-variable messages. For row c,only the two smallest values in the sorted vector m

    (g)c are

    directly involved in the message updating Step-A fromlayer g2 1 to layer g. For example, if the smallest value inm

    (g1)c is from layer g in the previous iteration, it is

    removed from the vectorm(g1)c in horizontal Step A. Then

    the second-smallest value is used as the magnitude ofvariable-to-check message. In horizontal Step B, m

    (g)c is

    computed with the new value, /Lcv/, from variable node andthe pre-sorted vector m(g)c from horizontal Step A. The newvalue, /Lcv/, could take any index in m

    (g)c . If it takes a very

    small index, it has more chance to be used in Step A offurther computation. Otherwise, it has much less chance to

    be one of the two smallest values in the remainingcomputation. In other words, though m

    (g)c contains dc

    values, most of the values in the end of the sorted vectorare less likely to be used as reliability information for

    message updating. Thus, it is reasonable to reduce thelength of the vector m(g)c to further reduce theimplementation complexity. Our simulations show that ifthe lengths of the vectors m

    (g)c and m

    (g)c are set to be 3, the

    Fig. 2 Computation flow in an iteration of column-layered decoding

    a Original algorithmb After algorithm reformulation

    IET Commun., 2011, Vol. 5, Iss. 15, pp. 2177 2186 2179

    doi: 10.1049/iet-com.2010.1002 & The Institution of Engineering and Technology 2011

    www.ietdl.org

  • 7/29/2019 column layered decoding.pdf

    4/10

    decoding convergence speed and performance have almost nodegradation compared to the standard Min-Sum basedcolumn-layered decoding. In such a case, we name thescheme as three-min column-layered decoding. Similarly,

    the lengths of the vectors m(g)c and m

    (g)c can be set as 2 or

    even 1. Correspondingly, more penalties in convergencespeed and performance are expected.

    In this approximation, m(g)c contains three values in most

    cases. It may contain two valid values if one value isremoved from the vectorm

    (g1)c in horizontal Step A. In the

    computation of horizontal Step-B, at most three comparisonoperations are required to sort the new value /Lcv/ fromvariable node and the sorted values in m(g)c . If m

    (g)c contains

    three valid values, the third comparison is needed todetermine whether the /Lcv/ is the third smallest value or itshould be discarded. Table 1 shows the average number ofm

    (g)c updating events in one iteration for the decoding of a

    (4096, 3584) (4, 32) QC-LDPC code at the SNR of 4.1 dB.In one iteration, the total number of m(g)c updating events

    for a check node is 32. A message updating event can becategorised into one of three types. Type I, a value isremoved from m

    (g1)c and then a new value is inserted into

    m(g)c . Type II, no value is removed from m

    (g1)c andm

    (g)c but

    is updated. Type III, no value is removed from m(g1)c and

    /Lcv/ is discarded. It can be seen from Table 1 that if novalue is removed from m

    (g1)c , /Lcv/ is much more likely to

    be discarded than being the third-smallest value in m(g)c . To

    further reduce the decoding complexity, the thirdcomparison in horizontal Step B is eliminated. In themodification, /Lcv/ is discarded if no value is removedfrom m(g

    1)c and /Lcv/ is larger than the second value in

    m(g)c . This results in the simplified three-min column-

    layered decoding. In this case, a check node requires threeequal-comparisons to remove the old /Lcv/ from vectorm

    (g1)c in horizontal Step A and requires two regular

    comparisons to update m(g)c vector with the new /Lcv/ in

    horizontal Step B.The major difference between the three-min decoding and

    the simplified three-min decoding is that the three-mindecoding requires three comparisons for sorting inhorizontal Step B. The simplified three-min requires twocomparisons for sorting in horizontal Step B. Oursimulation shows that the additional approximationintroduced by the simplified three-min decoding onlycauses very small performance loss. Table 1 shows that asorted vector mc only gets updated about 4 times in aniteration of the three-min decoding. On the contrary, forTPMP or row-layered decoding, the sorting computation tofind the smallest and the second smallest magnitudes for acheck node re-starts in every iteration. For the same LDPC

    code, the average number of updating activities for a sortedvector is more than 16 per iteration. It leads to significant

    power savings for the three-min column-layered decodingin check node message updating.

    Fig. 3 Reformulated column-layered decoding

    Table 1 Average number ofmc(g) updating events in one

    iteration

    In the third

    iteration

    In the sixth

    iteration

    m(g)c =m

    (g1)c 2.804 2.783

    m(g)c =m

    (g1)c , /Lcv/ being the

    smallest value in m(g)c

    0.036 0.050

    m(g)c =m

    (g1)c , /Lcv/ being the

    second-smallest value in m(g)c

    0.140 0.170

    m(g)c =m

    (g1)c , /Lcv/ being the

    third-smallest value in m(g)c

    0.880 1.005

    m(g)c = m

    (g1)c =m

    (g1)c , /Lcv/

    being discarded

    28.140 27.992

    2180 IET Commun., 2011, Vol. 5, Iss. 15, pp. 21772186

    & The Institution of Engineering and Technology 2011 doi: 10.1049/iet-com.2010.1002

    www.ietdl.org

  • 7/29/2019 column layered decoding.pdf

    5/10

    2.2.3 Pipelining of column-layered decoding:Pipelining is a common practice in VLSI implementation toincrease clock speed and thus to speed up data processingthroughput. In general, pipelining can only be applied tofeed-forward data paths to maintain the original function ofVLSI circuitry. In LDPC column-layered decodingalgorithms, data dependency exists between consecutivelayers. Thus, the VLSI circuitry for check node, variable

    node and message memories forms a logic loop andpipelining cannot be directly applied to increase theeffective clock speed of LDPC decoders. In this section, arelaxed pipelining scheme of column-layered LDPCdecoding is proposed.

    In the original column-layered decoding, the changebetween m(g

    1)c andm

    (g1+P)c is very small if the value of P

    is not large. Thus, m(g1)

    c can be used as the estimation ofm

    (g1+P)c . An approximation of R

    (g+P)cv can be calculated

    from m(g1)

    c before m(g1+P)c are obtained. Then, it is

    immediately used for computing L(g+P)

    cv . The approximationallows P clock cycles to complete the message computationfor each layer. When |L(g

    +P)cv | is obtained, the m

    (g1+P)c is

    already available. Thus, the horizontal step for layer g+Pcan be undertaken with m(g1+P)c and |L

    (g+P)cv |. The updating

    method of m(g+P)

    c is the same as before. The pipelinedcolumn-layered decoding algorithm is formulated as thefollowing (see Fig. 4).

    2.2.4 Impact to the complexity of VLSI implementation: For the standard min-sum basedcolumn-layered decoding, a check node requires dc2 2regular comparators to compute the check-to-variablemessage for a new column layer. To execute (1) and (2),

    the variable-to-check messages corresponding to the wholeH matrix must be stored. For the simplified three-mincolumn-layered decoding algorithm, a check node onlyrequires two regular comparators. The magnitudes ofvariable-to-check messages can be used on-the-fly and arenot stored. Only the sign bits of variable-to-check messagesshould be stored. For each check node, three magnitudes,three indices and one sign bit need to be saved. For a

    (N, N2

    M) (4, 32) structured LDPC code, approximately,(302 2)/30 93% of computation can be reduced.Assuming that each message is quantised as 4 bits, thenumber of memory bits needed by a check node is3 (3+ 5)+ 1 25. The percentage of memory bits thatcan be saved for extrinsic messages is[(32 42 322 25) M]/(32 4 M) 55.4%.

    As the proposed column-layered decoding approaches donot save the magnitude of variable-to-check messages, theyare inherently memory efficient. Equipped with the

    proposed decoding techniques, an efficient column-layereddecoder architecture for quasi-cyclic LDPC codes isdeveloped with architectural and arithmetic optimisations.The details are provided in Section 4.

    3 Performance simulation

    To evaluate the decoding performance and convergencespeed of the proposed algorithms, three LDPC codes aresimulated. Codes I and II are rate 1/2, (2304, 1152) LDPCcode and rate 5/6, (2304, 1920) LDPC code, respectively,adopted in WiMax (802.16 e) standard. Code III is a (4096,3584) (4, 32)-regular QC-LDPC code, which is constructedusing the progressive edge growth method (PEG) [17, 18].

    Fig. 4 Pipelined column-layered decoding

    IET Commun., 2011, Vol. 5, Iss. 15, pp. 2177 2186 2181

    doi: 10.1049/iet-com.2010.1002 & The Institution of Engineering and Technology 2011

    www.ietdl.org

  • 7/29/2019 column layered decoding.pdf

    6/10

    The code is also used to illustrate the VLSI architecturedesign in Section 4. In each simulation, at least 50 frameerrors are observed.

    Fig. 5 shows the frame error rates in decoding the WiMaxrate-1/2 (2304, 1152) LDPC code with four different-layereddecoding algorithms, that is, (i) the row-layered decoding,(ii) the original column-layered decoding, (iii) the three-mincolumn-layered decoding and (iv) the simplified three-min

    column-layered decoding. The maximum number ofdecoding iterations is set as 10 and 20, respectively, for alldecoding approaches. It can be observed that all the fourdecoding algorithms have almost the same decoding

    performance with maximum of 20 iterations. With themaximum of 10 decoding iterations, only subtle performancedifference can be observed. Fig. 6 shows the frame errorrates in decoding the WiMax rate-5/6 (2304, 1920) LDPCcode with maximum of 10 and 20 iterations, respectively. Itshows again that the performance differences among various-layered decoding algorithms are negligible.

    Fig. 7 shows the frame error rate (FER) in decoding the(4096, 3584) (4, 32) QC-LDPC code. The number ofcolumns in each column layer is 128. The maximum

    number of iteration is set as 10. Compared to the standardmin-sum-based column-layered decoding algorithm, thethree-min column-layered decoding algorithm has almost no

    performance loss. The simplified three-min approach has nomore than 0.02 dB performance loss. The performance of

    pipelined three-min decoding with one-stage pipeline(P 1) has very minimal difference compared to that ofthe standard min-sum-based column-layered decoding. With

    two-stage pipeline (P 2), the pipelined three-minapproach has about 0.02 dB performance loss.

    4 Proposed column-layered decoderarchitectures for QC-LDPC codes

    In this section, an optimised QC-LDPC decoder architecturefor a (4096, 3584) (4, 32)-regular QC-LDPC code using the

    pipelined three-min column-layered decoding scheme ispresented. The architecture efficiently enables very highdecoding parallelism for layered decoding while havingvery short critical path. In one clock cycle, the messagescorresponding to four circulant matrices are processed in

    parallel. Two-stage pipelining is employed to improve theclock frequency. In order to further reduce the critical pathdelay, we rearrange the additions in VNU to maximallytake advantage of carry-save addition. With a simplemodification on CNU, the simplified three-min decodingscheme can be realised for even less hardware requirementat the sacrifice of small performance loss.

    4.1 Column-layered decoder architecture forQC-LDPC codes

    The parity check matrix of the QC-LDPC code to beconsidered in this work consists of an array of

    4 32 cyclically shifted permutation matrices of dimension128 128. The decoding steps discussed in Section 2.2 are

    performed for one block column at a time. That means allextrinsic messages corresponding to a block column of theH matrix are processed in parallel in one clock cycle. Thearchitecture of pipelined decoder with two pipeline stages isshown in Fig. 8. Each M component in an M-register arrayon the left side of Fig. 8 represents a 25-bit register forstoring a sorted vector associated with a row of the Hmatrix. The Get Rcv component performs (8) to get check-to-variable messages for layer g+P, where P 2.The CNUa component (the portion of CNU for horizontalStep-A) performs (9) to remove the old variable-to-checkmessage associated to layer g from the sorted vector. TheCNUb component (the portion of CNU for horizontal StepB) is utilised to perform (10) and (11). The total numbersof M-register, Get Rcv, CNUa and CNUb are all 512. Thenumber of VNUs in the VNU array is 128. A VNU isrequired to perform the vertical step (2) and (3) for avariable node. The barrel shifters in the left side of theVNU array align the check-to-variable message from roworder to column order. Similarly, the barrel shifters in theright side of VNU array align the variable-to-check messagefrom column order to row order. It can be observed fromFig. 8 that two types of loops are formed in column-layereddecoding. The first-type of loop consists of CNUa andCNUb components. The second-type of loop is composed

    of a Get Rcv, a CNUb, a VNU and two barrel shifters. Withthe pipelined column-layered decoding approach, thecritical path in the second-type loop is drastically reducedwhen applying two-stage pipelining.

    Fig. 6 Frame error rate of various layered decoding approaches

    in decoding WiMax rate-5/6 (2304, 1920) LDPC code

    Fig. 5 Frame error rate of various layered decoding approaches

    in decoding WiMax rate-1/2 (2304, 1152) LDPC code

    2182 IET Commun., 2011, Vol. 5, Iss. 15, pp. 21772186

    & The Institution of Engineering and Technology 2011 doi: 10.1049/iet-com.2010.1002

    www.ietdl.org

  • 7/29/2019 column layered decoding.pdf

    7/10

    The Get Rcv component is needed because its outputs arethe check-to-variable messages for layer g+P. On thecontrary, the outputs of CNUa are intermediate values forlayer g. For non-pipelined decoding, the Get Rcv componentcan be eliminated because the output of CNUa contains thecheck-to-variable message for layer g. We only need onestage of barrel shifter at the input of CNUa array tominimise critical path. The connections among the array ofCNUa, CNUb andVNUhave no change during the decoding.

    4.2 Architecture of CNU

    Fig. 9 shows the structure of the CNU for the three-mindecoding scheme. The sub-matrices in a block row of the H

    matrix are processed one at a time. The CNU is composedof two concatenated stages. The first stage, CNUa,computes the m

    (g)c and the second stage, CNUb, generates

    m(g)c . The data in Fig. 9 are associated with the vectors in

    the horizontal decoding step as follows:

    m(g1)c = [min1 old, min2 old, min3 old]

    m(g)c = [min1 temp, min2 temp, min3 temp]

    m(g)c = [min1 new, min2 new, min3 new]

    I(g1)c = [idx1 old, idx2 old, idx3 old]

    I(g)

    c = [idx1 temp, idx2 temp, idx3 tmp]

    I(g)c = [idx1 new, idx2 new, idx3 new]

    In each M-register for a sorted vector, three smallestmagnitudes and their indices are stored. In the decoding ofa column layer, if the column index is not in the vector, themagnitudes and indices in the vector are directly passedthrough the select-logic-A to the second stage. Otherwise,min1_temp and min2_temp get values from the tworemaining smallest magnitudes. The value of min3_temp

    becomes void. The temporary index values are determinedin the same way. It is clear that min1_temp is themagnitude of the Rcv message for the column layer in non-

    pipelined decoding. After the new Lcv value is sent backfrom VNU, it is used to compute the new value for thesorted vector. The structure of a Get Rcv component isshown in Fig. 9. This component is not required fornon-pipelined decoding.

    For the proposed simplified three-min decoding approach,the adder for the computation of abs(Lcv)2min3_temp is not

    needed. Instead, a simple three-input OR gate can be used todisable M-register update in the condition shown in row 5 of

    Fig. 7 Frame error rate of various decoding approaches in

    decoding a (4096, 3584) (4, 32) QC-LDPC code

    Fig. 8 Top-level block diagram of pipelined column-layered decoding

    IET Commun., 2011, Vol. 5, Iss. 15, pp. 2177 2186 2183

    doi: 10.1049/iet-com.2010.1002 & The Institution of Engineering and Technology 2011

    www.ietdl.org

  • 7/29/2019 column layered decoding.pdf

    8/10

    Table 1. The three inputs for the OR gate are the outputs ofthe three comparators in CNUa. It is shown in Fig. 6 thatthe removal of the adder results in less than 0.02 dB

    performance loss.

    4.3 Architecture of optimised VNU

    Shown in Fig. 10 is the structure of an optimised VNU that can

    simultaneously process 4 check-to-variable messages. Theaddition operations are rearranged such that the advantage ofcarry-save adder can be maximally taken. In the beginning ofa VNU, the check-to-variable messages in signed-magnitude

    format are converted to 2s complement representation. Inthe data conversion for each signed-magnitude number, thesign bit is not immediately added to the bitwise-not of lower

    bits. Instead, all sign bits are added through the adder array.Then, each two-bit sign-sum is sent to an adder in thesecond addition stage for final summation. As each adder inthe second and third stages has three inputs, it can beimplemented using a carry-save adder and a regular binary

    adder. The right-shift operations 1 and 2 are used forperforming the scalar multiplication of 0.75. The above-mentioned arithmetic optimisations aim to significantlyreduce the logic delay in a VNU.

    To illustrate the data flow of the optimised VNU, let us takethe computation of L3v as an example. Assume that R4v is anegative number and other inputs are positive numbers. Thevalue before the shift of the scaling operation of L3v is (/R1v/+/R2v/+bit_inverse(/R4v/)+ 0+ 0+ 1). The 0+ 0+ 1 computation is performed by a 1-bit full adder in theadder array block associated with L3v message. After the finalshift and addition stage, the output of L3v is 0.75 (R1v+

    R2v+R4v)+Iv.

    Fig. 9 FER in decoding the (4096, 3584) (4, 32) QC-LDPC code

    a CNU architecture for the three-min column-layered decodingb Structure of the Get Rcv component

    Fig. 10 Architecture of the optimised VNU Table 2 Hardware resource for (4096, 3584) (4, 32) QC-LDPC

    decoders

    VNU 128

    CNU 512

    Get_Rcv 512

    intrinsic message, bits 4 4096 16 384

    sorted vector, bits 25 512 12 800

    signs of variable-to-check messages 4 4096 16 384

    area per VNU, mm2 5597.8

    area per CNU, mm2 2100.9

    area per Get_Rcv, mm2 152.1

    clock frequency, MHz 388

    synthesis area, mm2 6.755

    information decoding throughput, Gb/s 3.928

    2184 IET Commun., 2011, Vol. 5, Iss. 15, pp. 21772186

    & The Institution of Engineering and Technology 2011 doi: 10.1049/iet-com.2010.1002

    www.ietdl.org

  • 7/29/2019 column layered decoding.pdf

    9/10

    5 Hardware requirement and decodingthroughput

    The pipelined column-layered decoder with two pipelinestages is modelled using Verilog RTL and synthesisedusing Fujitsu 0.13 um standard library. The requiredhardware resource and the synthesis result are summarisedin Table 2. Each intrinsic message is quantised as 4 bits. Ittakes 32 clock cycles to compute the initial sorted vectorsand 32 10 320 clock cycles for ten decoding iterations.The number of clock cycles for pipeline latency is 2. Let

    fclk denote the clock frequency of the decoder, theinformation (source data) decoding throughput is

    fclk (40962 512)/(32+ 320+ 2). Thus, the decoder

    achieves a decoding throughput of 3.928 Gb/s at a clockspeed of 388 MHz.

    Table 3 compares the proposed column-layered decoderwith the state-of-the-art row-layered decoders. To mitigatethe discrepancy introduced from different implementationtechnologies, the area and clock speed of all these designsare scaled to 65 nm. For the implementation technology of0.18 mm, 0.13 mm and 90 nm, the area scaling down factoris set as 8, 4 and 2, respectively. The corresponding clockfrequency is scaled up by 1.253, 1.252 and 1.25,respectively. The maximum decoding iteration is set as 10for all decoders with layered decoding algorithms. For areaof synthesis result, a scaling factor of 1/0.7 is applied toapproximate the layout area.

    Considering the normalised decoding throughput,throughput/area ratio and decoding performance amongvarious designs, it can be concluded that the proposedsimplified column-layered decoding algorithm andarchitecture have significant advantages in high-throughputLDPC decoder implementation.

    6 Conclusion

    In this paper, various techniques have been explored to reducethe computation complexity of the column-layered decoding.As a result, the proposed method can drastically reduce theoverall computation complexity of the original scheme

    while largely maintaining decoding performance andconvergence speed. In addition, a relaxed pipelining schemehas been shown to break the data dependency betweenadjacent column layers, and thus enhance the clock speed.

    Combining all the proposed techniques, a low-complexity,high-speed LDPC decoder architecture for generic QC-LDPC codes has been developed and the implementationresult for a specific example has demonstrated that the

    proposed column-layered decoder architecture is verycompetitive to state-of-the-art row-layered LDPC decoderdesigns.

    7 References

    1 Gallager, R.G.: Low-density parity-check codes, IRE Trans. Inf.Theory, 1962, IT-8, pp. 2128

    2 Chen, J., Dholakia, A., Eleftheriou, E., Fossorier, M.P.C., Hu, X.-Y.:Reduced-complexity decoding of LDPC codes, IEEE Trans.Commun., 2005, 53, (8), pp. 12881299

    3 Wang, Z., Cui, Z.: A memory efficient partially parallel decoderarchitecture for QC-LDPC codes. Proc. 39th Asilomar Conf. Signals,Systems & Computers, 2005, pp. 729733

    4 Lin, C., Lin, K., Chan, H., Lee, C.: A 3.33 Gb/s (1200, 720) low-density parity check code decoder. Proc. 31st European Solid-StateCircuits Conf., September 2005, pp. 211214

    5 Oh, D., Parhi, K.K.: Nonuniformly quantized Min-Sum decoderarchitecture for low-density parity-check codes. Proc. 18th ACMGreat Lakes symp. on VLSI, 2008, pp. 451456

    6 Shih, X., Zhan, C., Lin, C., Wu, A.: An 8.29 mm2 52 mW multi-modeLDPC decoder design for mobile WiMAX system in 0.13 mm CMOSprocess, IEEE J. Solid-State Circuits, 2008, 43, (3), pp. 672683

    7 Darabiha, A., Carusone, A.C., Kschischang, F.R.: Block-interlacedLDPC decoders with reduced interconnect complexity, IEEE Trans.Circuits and Systems II, 2008, 55, (1), pp. 7478

    8 Mohsenin, T., Truong, D., Baas, B.: Multi-split-row threshold decodingimplementations for LDPC codes. 2009 IEEE Int. Symp. on Circuitsand Systems, 2009, pp. 24492452

    9 Sharon, E., Litsyn, S., Goldberger, J.: An efficient message-passingschedule for LDPC decoding. Proc. 23rd IEEE Convention ofElectrical and Electronics Engineers Israel, September 2004,pp. 223 226

    10 Hocevar, D.E.: A reduced complexity decoder architecture via layereddecoding of LDPC codes. IEEE Workshop Signal Processing Systems,2004, pp. 107112

    11 Radosavljevic, P., Baynast, A., Cavallaro, J.R.: Optimized messagepassing schedules for LDPC decoding. Proc. 39th Asilomar Conf.Signals, Systems and Computers, 2005, pp. 591595

    12 Zhang, J., Fossorier, M.P.C.: Shuffled iterative decoding, IEEE Trans.Commun., 2005, 53, (2), pp. 209213

    13 Brack, T., Alles, M., Lehnigk-Emden, T., et al.: Low complexity LDPC

    code decoders for next generation standards. Proc. DesignationAutomation Test in Europe, April 2007, pp. 16

    14 Sun, Y., Cavallaro, J.R.: A low-power 1-Gbps reconfigurable LDPCdecoder design for multiple 4 G wireless standards. 2008 IEEE Int.SOC Conf., 2008, pp. 367370

    Table 3 Design comparisons with recent high-speed LDPC decoders

    This work [13] [16] [14] [15]

    code 4096, 3584 9600, 7200 802.11 n 802.16 e, 802.11 n 2048-bits

    rate 7/8 3/4 1/25/6 1/25/6 1/27/8

    algorithm column-layered

    min-sum

    row-layered

    min-sum

    row-layered

    min-sum

    row-layered BP row-layered BCJR

    LLR message quantisation 4-bits 6-bits 5-bit 4-bits

    max. iteration 10 10 5 10 10max decoding parallelism 512 80 81 2 96

    technology 0.13 mm 65 nm 0.18 mm 90 nm 0.18 mm

    clock frequency, MHz 388 500 208 450 125

    area, mm2 6.755 (synthesis) 0.504 (synthesis) 3.39 3.5 14.3

    max info. throughput, Gb/s 3.928 1.08 0.78 1.0 0.64

    clock frequency (scaled to 65 nm) 606 500 406 562 244

    layout area, scaled to 65 nm 2.41 0.72 0.42 1.75 1.79

    normalised info. throughput,

    scaled to 65 nm

    6.13 1.08 0.76

    (10 iterations)

    1.25 1.25

    IET Commun., 2011, Vol. 5, Iss. 15, pp. 2177 2186 2185

    doi: 10.1049/iet-com.2010.1002 & The Institution of Engineering and Technology 2011

    www.ietdl.org

  • 7/29/2019 column layered decoding.pdf

    10/10

    15 Mansour, M.M., Shanbhag, N.R.: A 640-Mb/s 2048-bit programmableLDPC decoder chip, IEEE J. Solid-State Circuits, 2006, 41, (3),pp. 684 698

    16 Studer, C., Preyss, N., Roth, C., Burg, A.: Configurable high-throughput decoder architecture for quasi-cyclic LDPC codes. 42ndAsilomar Conf. Signals, Systems and Computers, 2008, pp. 11371142

    17 Hu, X.-Y., Eleftheriou, E., Arnold, D.M.: Regular and irregularprogressive edge-growth tanner graphs, IEEE Trans. Inf. Theory,2005, 51, (1), pp. 386398

    18 Li, Z., Kumar, B.V.K.V.: A class of good quasi-cyclic low-density paritycheck codes based on progressive edge growth graph. 38th AsilomarConf. Signals, Systems and Computers, 2004, vol. 2, pp. 19901994

    2186 IET Commun., 2011, Vol. 5, Iss. 15, pp. 21772186

    & The Institution of Engineering and Technology 2011 doi: 10.1049/iet-com.2010.1002

    www.ietdl.org