programming graphics processing units for the - eieencmlau/francis_lau/iii._channel_coding... ·...

Programming Graphics Processing Units for theDecoding of Low-Density Parity-Check Codes

F. C. M. Lau∗ and L. ShiDepartment of Electronic and Information Engineering, Hong Kong Polytechnic University, Hong Kong

∗Email: [email protected]

Abstract—Simulating the error performance of low-densityparity-check (LDPC) codes usually takes a lot of computationtime. Inexpensive graphics processing units (GPUs) have recentlybeen used to accelerate the decoding process by allowing thesmaller and identical tasks to be performed in a highly parallelmanner. In this paper, we propose decoding a number of LDPCcodes at the same time using GPUs. We show that the decodingspeed is improved with our proposed method.

Index Terms—Error-correction code, GPU computing, LDPCcode.

I. INTRODUCTION

Low-density parity-check (LDPC) codes are a kind of error-correction code with excellent error performance [1], [2]. Thedecoding algorithm, which involves the iterative updating ofbit-node-to-check-node (B2C) messages and check-node-to-bit-node (C2B) messages, is relatively complex [3], [4]. Inparticular, when a large number of codewords need to bedecoded, a CPU requires much time to complete the decodingprocess. Sometimes a few weeks are needed to evaluate theerror performance of a LDPC code with an extremely lowerror rate [5], [6].

Computer Unified Device Architecture (CUDA) [7] is acomputing architecture that makes use of Graphic ProcessingUnits (GPUs) [8] to perform computations in a highly parallelmanner. In the updating of B2C (or C2B) messages, themessages are independent of one another. Hence we can dividethe updating process into many smaller and identical tasks.Moreover, we can ensure that such tasks can be carried out inparallel and are highly suitable to be implemented by usingCUDA. In [9], such a CUDA program that decodes LDPCcodes using GPUs has been designed. In the program, themessage updating processes are accomplished efficiently inparallel by the hundreds of cores in a GPU. The program,moreover, has been found to run much faster than a corre-sponding decoder program run in a CPU. The drawback ofthe program is that it decodes only one LDPC codeword at atime.

In this paper, we design a CUDA program which decodesmultiple LDPC codewords at the same time. Instead of com-puting the smaller tasks in parallel for the same codeword,we compute the same task for different codewords in parallel.By decoding a large number of codewords simultaneously, weaim to improve the decoding speed by fully utilize all theresources available. We organize the paper as follows. To makethe paper self-contained, we briefly review the LDPC codes

Fig. 1. A parity-check matrix with a maximum row weight of 3 and amaximum column weight of 2. BN: bit node; CN: check node.

and the decoding algorithm in Section II. In Section III, wedescribe the details of our proposed program architecture. InSection IV, we present our simulation results.

II. REVIEW OF LDPC CODES

Each LDPC code is associated with a parity-check matrix(PCM) usually denoted by H. In H, most of the entriesare zeros and the number of entries with ones is linearlyproportional to the number of columns. Figure 1 shows anexample of a PCM with a maximum row weight of 3 anda maximum column weight of 2. In addition, each columncorresponds to a bit node and each row corresponds to acheck node. Each non-zero element in H further indicatesthat the corresponding bit node and check node are connected.Consequently, associated with the non-zero element are a B2Cmessage and a C2B message used in the iterative decodingalgorithm [9], [10].

In the decoding algorithm, messages are first received fromthe channel. We assume that an additive white Gaussian noisechannel is used. Supposing there are N bit nodes (equals thenumber of columns in H), N channel messages, denoted byλi (i = 1, . . . , N ), are therefore received for each codeword.The B2C messages in the ith column are then all initializedto λi, i.e., qij = λi where qij denotes the B2C message frombit node i to check node j. Assume that there are M checknodes. In each iteration, the following steps are performed.(For details of the algorithm, please refer to [9], [10].)

1) Check node j (j = 1, 2, . . . ,M ), based on all the incom-ing B2C messages qij (i = 1, 2, . . .), computes/updatesthe C2B message (denoted by rji) to bit node i (i =1, 2, . . .).

2) Bit node i (i = 1, 2, . . . , N ), based on the incoming C2Bmessages rji (j = 1, 2, . . .) and the channel message λi,computes/updates the B2C message qij to check node j(j = 1, 2, . . .).

Fig. 2. Alignment of the channel messages for K codewords. Each codewordhas N channel messages.

(a)

(b)

Fig. 3. Domain conversion matrices corresponding to the PCM in Fig. 1.(a) Check-node-domain-to-bit-node-domain (CND-to-BND); (b) bit-node-domain-to-check-node-domain (BND-to-CND).

3) Bit node i (i = 1, 2, . . . , N ), based on the incoming C2Bmessages rji (j = 1, 2, . . .) and the channel messageλi, decodes its bit value. Then the decoded codewordis checked against its validity. If the decoded codewordis a valid codeword, the decoding process terminates;otherwise, the next iteration begins. The iteration alsostops when the number of iterations exceeds a presetmaximum number.

III. PROPOSED CUDA PROGRAM

A. Memory allocation

In our proposed CUDA program, we decode K codewordsat the same time using K separate threads. In order to accessthe data from the memory efficiently, we have to align themessages in such a way that coalesced read/write can beperformed. In Fig. 2, we show the arrangement of the channelmessages for the K codewords in the memory. We can see thatsuch an arrangement can facilitate the use of coalesced read.Note that there are N channel messages for each codeword.

In many cases, the parity-check matrices (PCMs) used arevery large in size. A lot of memory will be required if we areto store the whole PCM, including the non-zero elements. Itis therefore more efficient to store only the information of thenon-zero elements in the PCM. Here, we compress the PCMand store it in two different domains — one in the check-node

(a)

(b)

Fig. 4. Alignment of (a) B2C messages and (b) C2B messages for Kcodewords.

Fig. 5. Flow chart of the main program.

domain (CND) and the other in the bit-node domain (BND), ashas been proposed in [9]. In addition, the B2C messages andthe C2B messages are in compressed forms, i.e., in BND andin CND. Figure 3 further illustrates the conversion matrices[9] used for converting the B2C messages or C2B messagesfrom one domain to another. Finally, for K codewords, wealign the B2C messages and the C2B messages as in Fig. 4.Again, the reason for doing so is to enable the efficient use ofcoalesced read/write.

(a)

(b)

Fig. 6. Hardware and software platforms used in the simulation. (a) Platformone; (b) platform two.

TABLE IDETAILS OF THE LDPC CODES USED IN THE SIMULATIONS.

B. Program flow

Figure 5 shows the flow diagram of the CUDA program.The program will first read a .PCM file which contains thedetails of the parity-check matrix. Based on the information,the compressed PCMs in the BND and CND are formed.Moreover, the domain conversion matrices are created. Thenthe K codewords, each of length N , are generated and mixedwith noise. All the aforementioned data are subsequentlyloaded into the GPU memory. The CUDA kernel is calledand all K threads are instructed to do the same calculations,with one thread assigned to decode one codeword.

Referring to Fig. 5, an initial guess on each receivedcodeword is first made based on the channel messages. If somedecoded codewords (out of the K codewords) are valid, thecorresponding entries in the truth table (of size K) in the GPUmemory will be set to “0”; otherwise, they will be set to “1”.For codewords that are not valid, they will go through the it-erative process — C2B message updating (first-half iteration),C2B messages converted from CND to BND, B2C message

(a)

(b)

(c)

Fig. 7. Relative simulation time taken to decode a fixed number of“1000x504pchk” LDPC code. Platform one is used and the number ofcodewords K being decoded simultaneously varies from 32 to 8000 in theproposed approach. The signal-to-noise ratio used is (a) 0.5 dB, (b) 1.2 dBand (c) 1.5 dB.

Fig. 8. Relative simulation time taken to decode a fixed number of “Wimax”LDPC code. Platform two is used and the number of codewords K beingdecoded simultaneously varies from 128 to 6400 in the proposed approach.The signal-to-noise ratio used is SNR = 1.2 dB.

updating (second-half iteration), bit-node decision, and validitycheck of decoded codeword. Again, the corresponding entriesin the truth table will be updated (set to “0”) when validcodewords are decoded. The remaining codewords will thenproceed to the next iteration — B2C messages converted fromBND to CND, C2B message updating (first-half iteration),C2B messages converted from CND to BND, B2C messageupdating (second-half iteration), bit-node decision, and validitycheck of decoded codeword. The process continues until allthe codewords have been decoded or when a preset maximumnumber of iterations has been executed.

IV. RESULTS AND DISCUSSIONS

We use two different platforms to test our program. Thefirst one uses a GTS450 GPU while the second one uses aGTX295 GPU [8]. Details of the two platforms are shownin Fig. 6. We compare the simulation times of our approachand the one proposed by Yau et al. [9]. In running the GPUprogram proposed by Yau et al., we set the number of threadsper block to be 64 in the decoding process.

First, we simulate the “1000x504pchk” LDPC code whichcontains 1008 bit nodes and 504 check nodes. It is a relativelysimple code and we run it on platform one. The signal-to-noise ratios (SNRs) used are 0.5 dB, 1.2 dB and 1.5 dB. Inour approach, the number of codewords being decoded at thesame time varies from 32 to 8000. The relative simulationtimes taken to decode a fixed number of codewords are shownin Fig. 7. We observe that our proposed approach requiresless time to decode the same number of codewords comparedwith the algorithm used by Yau et al. Further, the simulationtime reduces as the number of codewords being decodedat the same time increases in our proposed approach. Thereason is that with the new approach, more resources (i.e.,memory and cores) in the GPU can be utilized for the decodingprocess at the same time. The effect is more prominent if morecodewords are being decoded simultaneously.

Next, we simulate the “Wimax” LDPC code [11], [12],[13]. It has twice the number of edges compared with the“1000x504pchk” code. In our approach, the number of code-words being decoded at the same time varies from 128 to64000. The SNR used is 1.2 dB. This time, we run it usingplatform two and we plot the relative simulation times inFig. 8. We arrive at the same conclusions as above, i.e., thesimulation time is reduced compared with the method used byYau et al. and the time decreases with the number of threads.

V. CONCLUSIONS

This paper proposes a new approach based on GPU todecode LDPC codes. The proposed approach decodes multiplecodes at the same time and utilizes more resources available inthe GPU. Simulation results have shown that, compared withthe method used by Yau et al. which decodes one codeword ata time, our proposed approach can improve the time taken todecode the LDPC codes. Moreover, the speed-up improvementof our approach becomes more remarkable as the number ofcodewords being decoded at the same time increases.

ACKNOWLEDGEMENTS

The work described in this paper was partially supported bya grant from the RGC of the Hong Kong SAR, China (ProjectNo. PolyU 521809).

REFERENCES

[1] D. J. C. MacKay and R. M. Neal, “Near Shannon limit performance oflow density parity check codes,” Electronics Letters, vol. 32, pp. 1645–1646, August 1996.

[2] T. J. Richardson, M. A. Shokrollahi, and R. L. Urbanke, “Design ofcapacity-approaching irregular low-density parity-check codes,” IEEETrans. Inform. Theory, vol. 47, no. 2, pp. 619–637, 2001.

[3] M. P. C. Fossorier, M. Mihaljevic, and H. Imai, “Reduced complexityiterative decoding of low-density parity check codes based on beliefpropagation,” IEEE Trans. Commun., vol. 47, no. 5, pp. 673–680, 1999.

[4] M. P. C. Fossorier, “Iterative reliability-based decoding of low-densityparity check codes,” IEEE Journal on Selected Areas in Communica-tions, vol. 19, no. 5, pp. 908–917, 2001.

[5] L. Dinoi, F. Sottile, S. Benedetto, I. Superiore, M. Boella, and I. Torino,“Design of variable-rate irregular LDPC codes with low error floor,” inProc. IEEE Int. Conf. Commun., pp. 647–651, Seoul, Korea, 2005.

[6] X. Zheng, F. C. M. Lau, and C. K. Tse, “Constructing Short-LengthIrregular LDPC Codes with Low Error Floor,” IEEE Transactions onCommunications, vol. 58, no. 10, pp. 2823–2834, Oct. 2010.

[7] C. Nvidia, “Nvidia CUDA C programming guide version 3.2,” tech. rep.,NVIDIA Corporation, 2010.

[8] The Current Generation CUDA Architecture, Code Named Fermi. Avail-able at http://www.nvidia.com/object/fermi architecture.html.

[9] S. F. Yau, T. L. Wong, and F. C. M. Lau, “Extremely fast simulator fordecoding LDPC codes,” in Proc. The 13th International Conference onAdvanced Communication Technology, Phoenix Park, Korea, Feb. 2011.

[10] T. J. Richardson and R. L. Urbanke, “The capacity of low-density parity-check codes under message-passing decoding,” IEEE Trans. Inform.Theory, vol. 47, no. 2, pp. 599–618, 2001.

[11] T. Brack, M. Alles, F. Kienle, and N. Wehn, “A synthesizable IP Corefor WiMax 802.16e LDPC code decoding,” in Proc. Personal Indoor andRadio Communications Conference, pp. 1–5, Helsinki, Finland, 2006.

[12] W. M. Tam, F. C. M. Lau, and C. K. Tse, “A class of QC-LDPCcodes with low encoding complexity and good error performance,” IEEECommunications Letters, vol. 14, no. 2, pp. 169–171, 2010.

[13] IEEE, “IEEE standard for local metropolitan area networks. Part 16:Air interface for fixed and mobile broadband wireless access systems,Amendment 2: Physical and medium access control layers for combinedfixed and mobile operation in licensed bands,” IEEE Standard 802.16e-2005, 28 Feb. 2006.

programming graphics processing units for the - eieencmlau/francis_lau/iii._channel_coding... ·...

Documents