ldpc 2

TSINGHUA SCIENCE AND TECHNOLOGYISSNll1007-0214ll04/10llpp577-587Volume 18, Number 6, December 2013

Comparison of Parallelization Strategies for Min-Sum Decoding ofIrregular LDPC Codes

Hua Xu�, Wei Wan, Wei Wang, Jun Wang, Jiadong Yang, and Yun Wen

Abstract: Low-Density Parity-Check (LDPC) codes are powerful error correcting codes. LDPC decoders have been

implemented as efficient error correction codes on dedicated VLSI hardware architectures in recent years. This

paper describes two strategies to parallelize min-sum decoding of irregular LDPC codes. The first implements

min-sum LDPC decoders on multicore platforms using OpenMP, while the other uses the Compute Unified Device

Architecture (CUDA) to parallelize LDPC decoding on Graphics Processing Units (GPUs). Empirical studies on

data with various scales show that the performance of these decoding processes is improved by these parallel

strategies and the GPUs provide more efficient, fast implementation decoder.

Key words: Low-Density Parity-Check (LDPC) codes; multicore; OpenMP; Graphic Processor Unit (GPU); Compute

Unified Device Architecture (CUDA)

1 Introduction

With the development of advanced computer hardwaretechnologies, multicore architectures are offeringpowerful computing platforms for different high-performance computation applications in many fields.There are many parallelization methods for multicoreplatforms, such as OpenMP[1], multithreading, andMPI. In recent years, Graphic Processing Units (GPUs)have also developed very rapidly. Parallel GPUs havebegun making computational inroads against the CPUand GPGPUs[2] with general-purpose computing onGPUs in fields as diverse as scientific image processing,linear algebra, and 3-D reconstruction. Today, more and

�Hua Xu, Wei Wan, Wei Wang, Jiadong Yang, and Yun Wenare with the State Key Laboratory on Intelligent Technologyand Systems, Tsinghua National Laboratory for InformationScience and Technology, Department of Computer Science andTechnology, Tsinghua University, Beijing 100084, China. E-mail: [email protected].� Jun Wang is with the Department of Electronics Engineering,

Tsinghua University, Beijing 100084, China. E-mail:[email protected].�To whom correspondence should be addressed.

Manuscript received: 2012-02-28; revised: 2013-04-02;accepted: 2013-05-06

more applications in many fields have been developedfor these multicore platforms because these platformsprovide tremendous processing power. LDPC decodingis just one of these fields.

Low-Density Parity-Check (LDPC) codes were firstproposed by Gallager[3] in 1963 and rediscovered byMacKay and Neal[4] in 1996. LDPC codes are excellenterror correcting codes with good performance close tothe Shannon limit[5]. Currently, LDPC decoders areimplemented by high speed low-complexity hardware,because LDPC codes can be processed with fullyparallel operation[6]. LDPC codes also have goodflexibility and a low error floor[7, 8]. LDPC codes havebeen widely used in error control coding in recentyears. Due to their outstanding performance and greatpotential, LDPC codes have been used in emergingstandards for digital communication and storageapplications, such as the DVB-S2 standard, ChineseDigital Terrestrial/Television Multimedia Broadcasting(DTMB), WiMAX (802.16e), Wifi (802.11n), and10 Gbit Ethernet (802.3an).

There are many algorithms for LDPC decoding fordifferent bit-error rates and system complexities, suchas Majority-Logic (MLG) decoding[9], Bit-Flopping(BF)[10, 11] decoding, Iterative Decoding based on Belief

578 Tsinghua Science and Technology, December 2013, 18(6): 577-587

Propagation (IDBP)[12, 13] (also called the Sum-ProductAlgorithm (SPA) or message passing[14]), and the Min-Sum Algorithm (MSA)[15]. Among these algorithms,SPA has excellent error correction performance, but isa little complex. Although, the MSA error performancein general is a few tenths of a decibel lower than that ofSPA, it is much simpler to implement.

MSA LDPC decoding is computationally intensive.To achieve real-time processing, VLSI hardwareprocessors are used. Howland and Blanksby[16]

described a VLSI fully parallel architecturethat achieves LDPC decoding with excellentthroughput. Shimizu et al.[17] implemented aparallel LDPC decoder on an Field-ProgrammableGate Array (FPGA) and simulated its decodingperformance. However, hardware solutions are neitherflexible nor scalable. Decoding different codesusually requires complete change of the hardwarearchitecture and many resources requiring longand expensive development program. Thus, generalPC platforms are now being considered for LDPCdecoding. Falcao et al.[18-20] developed various LDPCdecoders on various multicore processors, such asoff-the-shelf general-purpose x86 processors, GPUs,and the CELL Broadband Engine (CELL/B.E.). Theirdecoders achieve LDPC decoding with excellentthroughput. Wang et al.[21] also implemented a parallelalgorithm for LDPC decoding using the ComputeUnified Device Architecture (CUDA).

This study evaluated whether LDPC decoderscan be implemented on general-purpose multicorearchitectures using two efficient parallel MSAs forLDPC decoding. One uses OpenMP to implementthe parallel MSA on x86 general-purpose multicores,while the other implement decoding LDPC codes onGPUs using the NVIDIA CUDA programming model.Evaluated parallel approaches are using six matrices ofdifferent sizes on different platforms. The results showthat the parallel GPU approach runs significantly fasterthan on the x86 general-purpose multicore platform.The main contribution of this paper is a comparisonof various parallelization strategies for LDPC decodingwith an efficient programmable solution.

2 LDPC Decoding Algorithm

LDPC codes are linear .N;K/ block codes. They canbe defined by an M �N sparse parity check matrix H,whereM andN represent the numbers of Check-Nodes

(CNs) and Bit-Nodes (BNs). M is equal to N � K. Ifc is a valid codeword, then HcT D 0. The code rateis computed as R D K=N . The weight of a columnis defined as the number of times 1 appearing in thecolumn, and the weight of a row is the number of times1 appearing in the row. Regular LDPC codes have bothequal column weights and equal row weights, whileirregular LDPC codes do not. Irregular LDPC codes canbe represented by their degree distribution as �.x/ DPdvmax

jD1 �jxj�1 and �.x/ D

PdcmaxjD1 �jx

j�1, where �j

represents the columns with weight j as a proportionof all columns and �j represents the proportion of rowswith weight j , dvmax represents the maximum columnweight and dcmax represents the maximum row weight.These two equations specify the degree distribution ofbit nodes and check nodes in irregular LDPC codes. ATanner graph[22] is an intuitive way to represent LDPCcodes. It is formed by bit-nodes and check-nodes andlinked by bidirectional edges. Each row of the paritycheck matrix is a check node in the correspondingTanner graph and each column of the parity checkmatrix is a bit node. Figure 1 is an example of a Tannergraph of a 4 � 8 matrix.

QC-LDPC codes[23, 24] have lower implementationcomplexities for both encoding and decoding comparedwith random LDPC codes. They can be efficientlyencoded using simple feedback-shift registers with lowcomplexity due to their special structure. Well-designedQC-LDPC codes can be as good as computer-generatedregular or irregular random LDPC codes, in termsof bit-error performance, word-error performance, anderror floor. A QC-LDPC code with index t is madeup of many t � t square blocks, including all-zero matrices and circulant permutation matrices. Acirculant permutation matrix with parameter s isgenerated by shifting an identify matrix to the rights times. In other words, this is a square matrixwhere each row is obtained by cyclic right-shifts ofthe previous row. Equation (1) shows an example ofa circulant permutation matrix A with t D 6 ands D 2. A QC-LDPC code can be represented by

Fig. 1 Tanner graph example.

Hua Xu et al.: Comparison of Parallelization Strategies for Min-Sum Decoding of Irregular LDPC Codes 579

row sets of parameters s and t . The parity checkmatrix H of a QC-LDPC code shown in Eq. (2) ismade up of a series of t � t matrices, named T. Thesubscripts of T indicate the position of the block inmatrix H, and the value of T denotes the characterof the block. If jTj D �1, the matrix is an all-zeromatrix; otherwise, it is an identify matrix shifting t

times. The value of jTj depends on the actual need indifferent applications. These characters of the matrix ofa QC-LDPC code facilitate hardware implementation ofdecoders. Therefore, QC-LDPC codes are widely used,such as in the Chinese DTMB[25].

A D

0BBBBBBB@

0 0 1 0 0 0

0 0 0 1 0 0

0 0 0 0 1 0

0 0 0 0 0 1

1 0 0 0 0 0

0 1 0 0 0 0

1CCCCCCCA(1)

H D

0BBBBBBBBB@

A0;0 A0;1 � � � � � � A0;n�1

A1;0 A1;1 � � � � � � A1;n�1

A2;0 A2;1 � � � � � � A2;n�1

::::::

: : ::::

::::::

: : ::::

Am�1;0 Am�1;1 � � � � � � Am�1;n�1

1CCCCCCCCCA(2)

Figures 2 and 3 show a model of the DTMBtransmitter[26] and receiver[27]. The DTMB takesadvantages of the latest technical breakthroughs, suchas LDPC coding for better error correction capability,a long time interleaver to reduce impulsive noise, andspread spectrum protection of the system information.The input data is firstly scrambled. Then the system

uses a Forward Error Correction (FEC) code, which is aconcatenation of a BCH outer code and an LDPC innercode. After that, the output binary sequence is mappedto M-QAM symbols before convolutional interleaving.Some system information is added to transmit necessaryterrestrial encoding and modulation information beforethe frame body and the frame head are combinedinto the signal frame. The “frame body processing”module applies an Inverse Fast Fourier Transform(IFFT) to the frame body for multi-carrier modulation,with no changes of the frame body for single-carriermodulation. Finally, the baseband processing and theup-conversion are completed. At the receiver side, asshown in Fig. 3, the channel state information obtainedvia the synchronization and channel estimation is usedto equalize the frame body which is then processed bythe corresponding inverse operations of the transmitter.LDPC provides superior error correction capabilityfor better sensitivity especially at higher code rates.Therefore, DTMB provides widely larger coverage orbetter availability. More detailed information can befound in Ref. [26]. FEC decoding, as depicted in Fig. 3,consists of BCH decoding and LDPC decoding. Thispaper focuses on LDPC decoding steps.

LDPC decoding is based on the belief propagation ofmessages between connected nodes. It makes intensivecomputations to run the decoding algorithm. SPA is anefficient LDPC decoding algorithm which has excellenterror correction performance, but quite complex.MSA approximates the calculation at the check nodeswith a simple minimum operation which reducesthe complexity compared to SPA. This balances thecomplexity and error correction performance and hasbeen widely applied[28]. This algorithm can iteratively

Fig. 2 DTMB transmitter system.

Fig. 3 DTMB receiver system.


decode the LDPC codes. Suppose an LDPC code isapplied to an AWGN channel with noise having zeromean and variance of �2. Assume the signal is BPSKmodulated with unit energy. The y D .y1; y2; � � � ; yn/

is the soft input information from the channel and D D.D1;D2; � � � ;Dn/ is the algorithm decoding result. TheMSA is depicted as Algorithm 1.

Each iteration is mainly described by horizontal andvertical intensive processing blocks. In the horizontalprocessing, the algorithm updates the message fromeach check node to each bit node, while the verticalprocessing does the converse. Equation (3) in Algorithm1 defines the horizontal processing that updates themessage from each CNm to BNn. For each iteration,Rmn values are updated according to Eq. (3). Similarly,Eq. (4) defines the vertical processing that computesmessages sent from BNn to CNm. In this case, Qmn

values hold the updated information from Eq. (4).At the end of each iteration, the algorithm performstentative decoding, which is indicated by Eq. (5). Theiterative procedure is stopped if the decoded wordD verifies all the parity check equations of the code

Algorithm 1 MSA1 Initialization:2 Pn D

�2yn�2I Qmn D PnI i D 0I

3 repeat4 fHorizontal processing:g5 for each node pair (BNn,CNm), corresponding toHmn D

1 in the parity check matrix H of the code do6

Rmn D .Y

n02N.m/nn

sgn.Qmn// minn02N.m/nn

.abs.Qmn//I

(3)7 end for8 fVertical processing:g9 for each node pair (BNn,CNm), corresponding toHmn D

1 in the parity check matrix H of the code do10

Qmn D Pn CX

m02M.n/nm

Rm0nI (4)

11 end for12 fTentative decoding:g13 for each bit Dn in the decoding result D do14

Qn D Pn CX

m02M.n/

Rm0nI

if Qn > 1 then Dn D 1; else Dn D 0:(5)

15 end for16 i i C 1;17 until (HD T

D 0 _ i > Max number of iterations)

HDTD 0 or i reaches the maximum number of

iterations.A loop-carried dependency is the dependence of

a loop iteration on the output of one or moreprevious iterations, which means computations in agiven iteration of a loop cannot be completed withoutknowing the values calculated in earlier iterations.Loop-carried dependencies prevent parallelization ofthe loops. Horizontal and vertical processings inAlgorithm 1 represent the most intensive processingsteps in MSA. Both blocks are based on nested loopsand each loop updates different data. Thus there isno loop-carried dependencies, so they can each beprocessed in parallel in a high performance codespecific computing engine or in a highly parallelprogrammable device. Currently, most LDPC decodersare implemented on VLSI because VLSI-baseddecoding methods can provide real-time operations.Since these two processing steps can be executedin parallel, the LDPC decoders developed here onmulticore platforms use OpenMP on GPU and CUDA.

3 Parallel MSA LDPC Decoding

As described in the previous section, the parallel MSAfor LDPC decoding can be implemented on multicorearchitectures in two ways. This section introducesthe data structures used to represent the H matrix.Then, the parallelization approach is given for theLDPC decoder for processing on general-purpose x86architectures using OpenMP. Finally, a multithreading-based approach is given for GPUs using CUDA.

3.1 Data structures

The parallel MSA was special data structures torepresent the H matrix of an LDPC code. The edgesof the Tanner graph defined by the H matrix depict thebidirectional flow of messages exchanged between bitnodes and check nodes. Two separate two-dimensionalarrays are used to represent the H matrix. One arrayrepresents the check nodes for horizontal processing,where the another array represents the bit nodesfor vertical processing. Figure 4 describes the datastructures representing the Tanner graph in Fig. 1,which is an irregular LDPC code. The present solutioncan be applied not only to regular LDPC codes but alsoirregular LDPC codes, so the data structures are suitablefor representing both of them. Some values in the two-dimensional arrays in Fig. 4 are null because thesearrays represent an irregular LDPC code, which has


Fig. 4 Data structures representing the Tanner graph in Fig. 1.

different column weights and row weights. The mostimportant data for parallelization is the information toeach edge. The edges in a Tanner graph are representedby a linear array, where each element is made ofa special edge structure. This special edge structurecontains information about the bit node, check node,Rmn and Qmn related to each edge. Rmn andQmn are the intermediate results in the horizontaland vertical processing steps. These structures allowparallel execution because there are no loop-carrieddependencies in the horizontal and vertical processing,and the related data is grouped into consecutive memorylocations.

3.2 Parallel MSA on multiple cores using OpenMP

OpenMP[1] is a multi-threading implementation thatallows the compiler to generate code for task anddata parallelism that is implemented through compilerdirectives that instruct segments of code to be run asparallel threads. In this method, the master thread forksa specified number of slave threads and divides thetask among them. After execution of the parallelizedcodes, the threads join back into the master thread,which continues onward to the end of the program.Each thread executes the parallelized section of codeindependently by default. The runtime environmentallocates threads to processors according to the usage,machine load, and other factors. OpenMP providesan effective straightforward approach for programminggeneral-purpose x86 multicores. More informationabout OpenMP can be found at openmp.org.

The most costly loops can be identified whenparallelizing an application using OpenMP. If there areno loop-carried dependencies in the loop iterations,these iterations can be parallelized via the #pragmaomp parallel for directive. In MSA, the horizontaland vertical processings represent the most intensive

processing steps. However, each loop updates differentdata. Then, both are nested loops that are independent,they can be parallelized. Figure 5 shows how theparallel MSA is executed on general-purpose x86multicore machine using OpenMP. An iterative cyclicalprocess is executed after a series of preprocessing steps.Each loop contains horizontal processing, verticalprocessing, and tentative decoding. At the end, theresults are output from the system.

Horizontal processing and vertical processing areperformed in parallel using OpenMP as depicted inAlgorithm 2. The horizontal and vertical steps areboth sequential. The horizontal processing is donefirst and then the vertical processing. In the decodingprocess, the two steps are divided and executed inparallel on different cores. This approach uses the#pragma omp parallel for directive to first parallelizethe horizontal step and then the vertical step. Only

Fig. 5 Parallel MSA on multicores using OpenMP.

Algorithm 2 Parallel MSA on multiple cores using OpenMP1 Initialization: : : :2 repeat3 #pragma omp parallel for4 fHorizontal processing:g5 : : :

6 #pragma omp parallel for7 fVertical Processing:g8 : : :

9 #pragma omp parallel for10 fTentative decoding:g11 : : :

12 i i C 1;13 until (HDT

D 0 _ i > Max number of iterations)


one codeword is decoded at a time. Although multiplecodeword decoding was used early, it is not convenientfor calculating the average decoding time for eachcodeword and can be implemented on the GPU.The speedup can then not be calculated so multiplecodeword decoding is not used here. Each loop in thealgorithm processes one edge by reading data from thetwo-dimensional arrays and the edge information array,calculates the results, and writes Rmn and Qmn backto the edge information array. Different loops writedifferent data and each loop does not need to read datafrom the other loops’ results. Only one codeword isdecoded at a time. For multiple codeword decoding,both the #pragma omp parallel section directive andthe #pragma omp parallel for directive can be usedto launch several decoders in parallel on multicoreplatforms since the different cores need not to shareany data. However, a GPU using CUDA can not launchmultiple decoders.

3.3 Parallel MSA on GPU using CUDA

GPUs are powerful tools for parallel computing. Inrecent years, intensive efforts have focused on how touse GPUs for general-purpose computing (GPGPUs).NVIDIA introduced CUDA[29, 30] to enable GPUsto solve complex computing problems. CUDA is aparallel programming model and software environmentdesigned for GPUs that has a easy learning curvefor programmers familiar with standard programminglanguages such as the C language. CUDA is astreaming computing platform where geometry, pixel,and vertex programs share common Stream Processors(SP). The compiler, the software development kit,the architecture documentation, and the languageextensions are available on the NVIDIA website.

GPUs are specialized for highly parallel computation,so the CUDA parallel computing model operates withtens of thousands of lightweight threads grouped intothread blocks. These threads must execute the samefunction on different data. The function that containsthe computations and runs in parallel with a large

amount of instances is called the kernel. Using a smallnumber of threads or a small number of blocks toexecute a kernel would be very inefficient. This paperelaborate how to develop the MSA on GPUs usingCUDA (Fig. 6).

The parallel algorithm implemented on the GPUdescribed in Algorithm 3 is just like the algorithmimplemented on a CPU. The <<< � � � >>> .� � � /

syntax in Algorithm 3 is a special syntax used inthe CUDA model introduced by NVIDIA. GPUsare designed for highly parallel computations withtens of thousands of lightweight threads groupedinto thread blocks. For example, in the expressionHorizontalProcessing<<<grids 1, threads 1>>> (p1;

� � � , pn/, grids 1 represents the number of blocks,threads 1 represents the number of threads in eachblock, and p1; � � � ; pn are the parameters. So there area total of grids 1 � threads 1 threads executing the samefunction with different data. In this study, the parallelalgorithm implemented on the GPU is a straightforwardimplementation just like the algorithm implemented ona CPU. Each processor performs the same task in thehorizontal and vertical steps of the MSA, on differentpieces of distributed data, which means that each thread

Algorithm 3 Parallel MSA on GPUs using CUDA1 Initialization: : : :2 copy the data from host memory to global memory in the

GPU;3 repeat4 Initialize grids1, threads1 for Horizontal Processing;5 HorizontalProcessing<<<grids1;threads1>>>

.p1; : : : ; pn/;6 Initialize grids2, threads2 for Vertical Processing;7 VerticalProcessing<<<grids2;threads2>>>

.p1; : : : ; pn/;8 Initialize grids3, threads3 for Tentative decoding;9 TentativeDecoding<<<grids3;threads3>>>

.p1; : : : ; pn/;10 i i C 1

11 until (HDTD 0 _ i > Max number of iterations)

12 copy the data from global memory back to host memory;

Fig. 6 Parallel MSA on GPUs using CUDA.


executes the same instructions; therefore, Data-LevelParallelism (DLP, also known as loop-level parallelism)is used with the GPU. A thread-per-edge approachis used in both the horizontal step and the verticalstep of the algorithm, the information is exchangedbetween each node pair (BNn, CNm) corresponding toHmn D 1 in the parity check matrix H. This algorithmis also scalable for future GPUs with more cores. In thedecoding process, each block contains 256 lightweightthreads. In this algorithm, the data required for theGPU computation is copied from the host memory tothe GPU’s global memory so that all threads can accessthe data in the global memory. Each lightweight thread,corresponding to an edge in the Tanner graph, does notrequire complex processing in MSA, so the data is notmoved from global memory to the fast shared memoryor registers, which means the small shared memorywill not limit the processing of different matrices withdifferent sizes. When processing different kernels, eachthread just reads data from the GPU’s global memoryand directly writes the result data (Rmn andQmn) backto the global memory. Different lightweight threadswrite different data and each lightweight thread doesnot need read other lightweight thread results. At theend of the algorithm, the data is copied from the GPU’sglobal memory back to the host memory.

4 Tests

This section describes test results for the parallelMSA for decoding LDPC codes on multiple coresand GPUs. The speedups and throughputs of these

parallel strategies are then evaluated. The algorithmsare implemented in the C language with a series oftests performed on three different parallel processingplatforms.

The test setups are described in Table 1. The firstplatform is a PC platform with two Intel Core i7920 2.67 GHz CPUs. Each of these CPUs containsfour cores, which means this PC contains eight cores.This PC also has an NVIDIA GTX 295. The CUDAexperiments are performed on this GPU. The secondplatform is a PC with an Intel Core2 Quad 2.93 GHzCPU which contains four cores. The last platform is aPC with an Intel Core2 Duo 3.0 GHz CPU. The firstplatform should have the best performance.

Six irregular LDPC codes were used to test theparallel strategies. The parity-check matrices H of thesecodes are characterized in Table 2. Matrices C, D, and Eare specific codes used in practical industry applications(the Chinese DTMB). Only matrix A is not a QC-LDPCcode. The square blocks of matrices C, D, and E havethe same size of 127 � 127. Matrices B and F have thesame square block size of 128 � 128.

4.1 Results on multicore platforms using OpenMP

The decoding rates on x86 multicore platforms usingOpenMP are shown in Table 3, including both theserial and parallel decoding times (ms), speedups, andthroughputs (Mbit/s). The speedups and throughputsare shown only for 30 iterations in Table 3. The x86multicore platforms were programmed using OpenMPdirectives and compiled with Microsoft Visual Studio2005.

Table 1 Experimental setup.

Platform Number of cores Clock speed (GHz) Memory (GB) Language OS

Platform 1 (CPU) Intel Core i7 920 8 2.67 8 C + OpenMP Windows XP SP2Platform 1 (GPU) NVIDIA GTX 295 480 1.242 (p/SP) 1.75 C + CUDA Windows XP SP2Platform 2 (CPU) Intel Core2 Quad 4 2.93 4 C + OpenMP Windows XP SP2Platform 3 (CPU) Intel Core2 Duo 2 3.0 4 C + OpenMP Windows XP SP2

Table 2 LDPC codes for the tests.

Matrix Rate Size (M �N ) Edges Block size Rows weights Columns weights

A 0.5 128 � 256 664 — 5, 6 2, 3, 6B 0.5 1280 � 2560 6784 128 � 128 5, 6 2, 3, 6C 0.4 4445 � 7493 34 925 127 � 127 7, 8 3, 4, 11D 0.6 2921 � 7493 37 592 127 � 127 12, 13 3, 4, 7, 16E 0.8 1397 � 7493 37 338 127 � 127 26, 27 3, 4, 11F 0.5 12 800 � 25 600 67 840 128 � 128 5, 6 2, 3, 6


Table 3 Decoding performance on x86 multicore platforms using OpenMP.

MatrixSerial time (ms) Parallel time (ms) Speedup for

30 iterations(Mbit/s)

Throughput for30 iterations

(Mbit/s)10

iterations20

iterations30

iterations10

iterations20

iterations30

iterations

Platform 1

A 0.41 0.78 0.83 1.56 1.24 2.34 0.53 0.11B 4.36 1.56 8.73 3.12 13.09 4.68 2.80 0.55C 24.62 6.19 49.33 12.35 74.04 18.54 3.99 0.40D 22.93 6.39 43.91 12.53 64.79 18.63 3.48 0.40E 24.19 6.21 48.46 12.37 72.67 18.59 3.91 0.40F 43.74 9.76 87.55 19.51 131.34 29.15 4.51 0.88

Platform 2

A 0.38 0.90 0.76 1.80 1.14 2.70 0.42 0.09B 4.06 2.94 8.13 5.86 12.21 8.76 1.39 0.29C 22.77 12.44 45.67 24.74 68.43 36.96 1.85 0.20D 21.79 13.70 42.27 26.60 62.51 39.42 1.59 0.19E 21.79 12.84 43.59 25.43 65.38 37.99 1.72 0.20F 41.37 18.07 82.69 35.33 124.08 52.51 2.36 0.49

Platform 3

A 0.40 0.52 0.78 1.05 1.17 1.59 0.74 0.16B 4.24 2.50 8.56 5.03 12.83 7.55 1.70 0.34C 23.80 14.76 47.68 29.45 71.41 43.82 1.63 0.17D 23.21 14.09 44.92 27.56 66.36 40.83 1.63 0.18E 22.83 13.46 45.76 26.90 68.60 40.45 1.70 0.19F 43.60 28.08 86.82 56.60 130.61 84.92 1.54 0.30

4.2 Results on GPUs using CUDA

The CUDA tests were also performed on Platform1. NVIDIA Geforce GTX 295 was used to test theparallel strategies. This GPU has 480 1242-MHz streamprocessors and was programmed using the CUDAprogramming interface (version 2.2). Table 4 showsthe CUDA test results, including both the CPU andGPU decoding times (ms) for different iterations, thespeedups for 30 iterations, and the throughputs (Mbit/s)for different iterations.

5 Comparison

Figure 7 compares the speedups and throughputs ofthe parallel MSA LDPC decoding using OpenMP.Among these three platforms, Platform 1 gave the

best performance, because it has the most cores.The results show that with this parallel approach,the number of cores is increased and the speedupincreases. The speedup was approximately 1.65 withtwo cores approximately 4.0 with eight cores. However,the general-purpose x86 multicore platforms arenot suitable to implement LDPC decoders becausethe throughputs achieved are still far from thoserequested by real-time applications. In addition, it isinteresting that the parallel decoding results of matrix Ademonstrate worse performance than the serial resultson all of the three platforms. An important reasonis that this matrix is very small and doesn’t have somany edges, so the computation of each thread is notcomplex, but the thread creation and synchronization

Table 4 GPU decoding performance using CUDA.

MatrixCPU time (s) GPU time (s) Speedup for

30 iterations(Mbit/s)

Throughout (Mbit/s)10

iterations20

iterations30

iterations10

iterations20

iterations30

iterations10

iterations20

iterations30

iterations

A 0.41 0.83 1.24 0.24 0.48 0.72 1.72 1.07 0.53 0.36B 4.36 8.73 13.09 0.24 0.48 0.72 18.18 10.67 5.33 3.56C 24.66 49.38 74.16 0.25 0.49 0.73 101.59 29.97 15.29 10.26D 22.93 43.91 64.79 0.25 0.49 0.73 88.75 29.97 15.29 10.26E 24.19 48.46 72.67 0.25 0.49 0.73 99.55 29.97 15.29 10.26F 43.74 87.55 131.34 0.26 0.50 0.74 177.49 98.46 51.20 34.59


Fig. 7 Results of OpenMP.

require too much time with OpenMP. It can also benoted that the throughput of matrix B is higher than thethroughputs of matrices C, D, and E, because matrixB has less edges than these three matrices and requiresmuch less decoding time.

Figure 8 shows the test results on the GPU usingCUDA with the parallel approach on the GPU givinga large speedup. The three different results are thedecoding time for 30 iterations, including decodingLDPC serial on Platform 1’s CPU and parallel decodingon the GPU (Fig. 8a). Since the decoding time onthe GPU is very short, the Y axis was logarithmiccoordinate. Figure 8b represents the speedups for30 iterations. Matrices C, D, and E have similarspeedups because these matrices have similar numbersof edges. The highest speedup is approximately 180 formatrix F with 67 840 edges. The speedups increaseas the number of edges increases. A thread-per-edgeapproach is used and each thread does not requirecomplex processing with the MSA method so theparallel decoding performance on the GPU with tens ofthousands of threads is reasonably good.

Figure 8c shows the throughputs for various number

of iterations. Matrix F gives the highest throughputbecause it has the largest size. Matrices C, D, andE are used in the Chinese DTMB. In the real-timeapplications of Chinese DTMB, the throughput shouldbe at least 11.98 Mbit/s and the average number ofiterations is usually between 15 and 20. Then resultsshow that the throughputs for these three matrices areapproximately 15.29 Mbit/s for 20 iterations, whichmeans that the throughputs are able to meet the real-time precossing requirements. Therefore, GPU-basedLDPC decoders can be used to implements softwareLDPC decoders.

Comparing Figs. 7 and 8 shows the CUDA on theGPU has better speedup and throughput than OpenMPon multicore CPUs, but it is more sensitive to thematrix size. Generally speaking, although multicoremethods give huge improvements, the GPU-basedparallel approach is a more efficient way of providingintensive decoding of LDPC codes.

6 Conclusions

This paper compares two strategies to parallelize MSALDPC decoding on multicore architectures. Tests show

Fig. 8 CUDA results.


that the LDPC decoder can be implemented on general-purpose multi-processor or multicore PC platformsusing the MSA on multicore platforms with OpenMPor on a GPU using CUDA. Tests demonstrate that thedecoding algorithm on multicore platforms reduces runtimes with the GPU giving the best performance.

Although the decoding performance on x86multicore platforms using OpenMP improves withmore cores, the LDPC decoders still can not beimplemented on general-purpose x86 multicoreplatforms in software in actual applications because thethroughputs are far too slow for real-time applications.In constrast, the decoding performance on the GPUusing CUDA is quite good which shows that theGPU-based parallel approach is an efficient methodfor decoding of LDPC codes which can replace thehardware decoding methods such as VLSI.

Acknowledgements

This work was supported by the Agilent TechnologyFoundation (No. 912-CHN09), the National NaturalScience Foundation of China (No. 61175110), theNational Key Basic Research and Development (973)Program of China (No. 2012CB316305), and the NationalKey Projects of Science and Technology of China(No. 2011ZX02101-004).

References

[1] B. Chapman, G. Jost, and R. Van Der Pas, Using OpenMP:Portable Shared Memory Parallel Programming, The MITPress, 2008.

[2] J. D. Owens, D. Luebke, N. Govindaraju, M. Harris,J. Kruger, A. E. Lefohn, and T. J. Purcell, A surveyof general-purpose computation on graphics hardware,Computer Graphics Forum, vol. 26, no. 1, pp. 80-113, Mar.2007.

[3] R. G. Gallager, Low-density parity-check codes, IRETransactions on Information Theory, vol. 8, no. 1, pp. 21-28, 1962.

[4] D. J. C. MacKay and R. M. Neal, Near shannon limitperformance of low density parity check codes, ElectronicsLetters, vol. 32, no. 18, pp. 1645-1466, 1996.

[5] C. E. Shannon, A mathematical theory of communication,ACM SIGMOBILE Mobile Computing andCommunications Review, vol. 5, no. 1, pp. 3-55, 2001.

[6] G. Al-Rawi, J. Cioffi, R. Motwani, and M. Horowitz,Optimizing iterative decoding of low-density Parity checkcodes on programmable pipelined parallel architectures,presented at the IEEE Global TelecommunicationsConference, Houston, USA, 2001.

[7] Z. He, P. Fortier, and S. Roy, A class of irregular LDPCcodes with low error floor and low encoding complexity,

IEEE Communications Letters, vol. 10, no. 5, pp. 372-374,2006.

[8] T. Tian, C. Jones, J. D. Villasenor, and R. D. Wesel,Construction of irregular LDPC codes with low errorfloors, presented at the 6th IEEE International Conferenceon Communications, Anchorage, USA, 2003.

[9] V. D. Kolesnik, Probabilistic decoding of majority codes,Problemy Peredachi Informatsii, vol. 7, no. 3, pp. 3-12,1971.

[10] Y. Kou, S. Lin, and M. Fossorier, Low-density parity-checkcodes based on finite geometries: A rediscovery and newresults, IEEE Transactions on Information Theory, vol. 47,no. 7, pp. 2711-2736, 2001.

[11] Z. Liu and D. A. Pados, Low complexity decoding offinite geometry LDPC codes, presented at the 6th IEEEInternational Conference on Communications, Anchorage,USA, 2003.

[12] J. Chen and M. Fossorier, Near optimum universal beliefpropagation based decoding of low-density parity checkcodes, IEEE Transactions on Communications, vol. 50, no.3, pp. 406-414, 2002.

[13] D. J. C. MacKay, Good error-correcting codes based onvery sparse matrices, IEEE Transactions on InformationTheory, vol. 45, no. 2, pp. 399-431, 1999.

[14] F. R. Kschischang, B. J. Frey, and H. A. Loeliger, Factorgraphs and the sum-product algorithm, IEEE Transactionson Information Theory, vol. 47, no. 2, pp. 498-519, 2001.

[15] M. Fossorier, M. Mihaljevic, and H. Imai, Reducedcomplexity iterative decoding of low-density parity checkcodes based on belief propagation, IEEE Transactions onCommunications, vol. 47, no. 5, pp. 673-680, 1999.

[16] C. Howland and A. Blanksby, Parallel decodingarchitectures for low density parity check codes, presentedat the IEEE International Symposium on Circuits andSystems, Sydney, Australia, 2001.

[17] K. Shimizu, T. Ishikawa, N. Togawa, T. Ikenaga, andS. Goto, A parallel LSI architecture for LDPC decoderimproving message-passing schedule, presented at theIEEE International Symposium on Circuits and Systems,Island of Kos, Greece, 2006.

[18] G. Falcao, L. Sousa, and V. Silva, Massive parallel LDPCdecoding on GPU, presented at the Principles and Practiceof Parallel Programming, Salt Lake City, UT, USA, 2008.

[19] G. Falcao, L. Sousa, V. Silva, and J. Marinho, ParallelLDPC decoding on the Cell/B.E. processor, in HighPerformance Embedded Architectures and Compilers,2009, pp. 389-403.

[20] G. Falcao, L. Sousa, and V. Silva, Massively LDPCdecoding on multicore architectures, IEEE Transactions onParallel and Distributed Systems, vol. 22, no. 2, pp. 309-322, 2011.

[21] S. Wang, S. Cheng, and Q. Wu, A parallel decodingalgorithm of LDPC codes using CUDA, presented atthe 42nd Asilomar Conference on Signals, Systems, andComputers, Pacific Grove, CA, USA, 2008.

[22] R. M. Tanner, A recursive approach to low-complexitycodes, IEEE Transactions on Information Theory, vol. 27,no. 5, pp. 533-547, 1981.


[23] M. Fossorier, Quasi-cyclic low-density parity-check codesfrom circulant permutation matrices, IEEE Transactionson Information Theory, vol. 50, no. 8, pp. 1788-1793,2004.

[24] L. Chen, J. Xu, I. Djurdjevic, and S. Lin, Near-shannon-limit quasi-cyclic low-density parity-check codes, IEEETransactions on Communications, vol. 52, no. 7, pp. 1038-1042, 2004.

[25] D. Niu, K. Peng, J. Song, C. Pan, and Z. Yang, Multi-rate LDPC decoder implementation for China digitaltelevision terrestrial broadcasting standard, presentedat IEEE International Conference on Communications,Circuits and Systems, Guilin, China, 2007.

[26] J. Song, Z. Yang, L. Yang, K. Gong, C. Pan, J. Wang,and Y. Wu, Technical review on Chinese digital terrestrialtelevision broadcasting standard and measurements

on some working modes, IEEE Transactions onBroadcasting, vol. 53, no. 1, pp. 1-7, 2007.

[27] X. Wang, J. Wang, J. Wang, Y. Li, S. Tang, and J. Song,Embedded transmission of multi-service over DTMBsystem, IEEE Transactions on Broadcasting, vol. 56, no.4, pp. 504-513, 2010.

[28] Q. Hong, J. Wang, and W. Lei, A resource-efficientdecoder architecture for LDPC codes, presented atthe International Conference on Electrical and ControlEngineering (ICECE), Wuhan, China, 2010.

[29] M. Garland, S. Le Grand, J. Nickolls, J. Anderson, J.Hardwick, S. Morton, E. Phillips, Y. Zhang, and V. Volkov,Parallel computing experiences with CUDA, MicroIEEE,vol. 28, no. 4, pp. 13-27, 2008.

[30] CUDA homepage, http://developer.nvidia.com/object/cuda.html, 2013.

Hua Xu received his BEng degree fromXi’an Jiaotong University in 1998 and gothis MEng and PhD degree from TsinghuaUniversity in 2000 and 2003. Now heis an associate professor in Departmentof Computer Science and Technology,Tsinghua University. His research fieldsinclude the following aspects: data mining,

intelligent information processing, and advanced processcontrollers for IC manufacturing equipments. He has publishedover 50 academic papers, received 10 invention patents ofadvanced controller and is also the copyright owner of 6 softwaresystems. He has achieved the 2nd Prize of National Science andTechnology Progress of China, the 1st Prize of Beijing Scienceand Technology, and the 3rd Prize of Chongqing Science andTechnology.

Wei Wan got his BEng degree in2009 and his MEng in 2012 incomputer science and technologyfrom Tsinghua University. Hisresearch interests span the areasof parallel processing, data mining,machine learning, and natural languageprocessing.

Wei Wang is a master student of theDepartment of Computer Science andTechnology, Tsinghua University. He gothis BEng degree in computer science andtechnology from Xiamen University in2010. His research interests span the areasof data mining, machine learning, andnatural language processing.

Jun Wang received his PhD degree fromTsinghua University in 2003. Now heis an associate professor in Departmentof Electronics Engineering, TsinghuaUniversity. His research fields includethe following aspects: broadband wirelesstransmission channel coding, modulation,and receiving technology.

Jiadong Yang got his BEng degree in 2006and MEng degree in 2008 from TianjinUniversity, respectively and PhD degreefrom Tsinghua University in 2012, all incomputer science and technology. Nowhe is working in Jike.com. His researchinterests include machine learning andevolutionary computation.

Yun Wen got his MEng degree in 2011in computer science and technologyfrom Tsinghua University. Now he isworking in Jike.com. His research interestsinclude machine learning and evolutionarycomputation.

ldpc 2

Documents