joint detection and decoding for mimo systems using convolutional codes: algorithm and vlsi...

13
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 59, NO. 9, SEPTEMBER 2012 1919 Joint Detection and Decoding for MIMO Systems Using Convolutional Codes: Algorithm and VLSI Architecture Chitaranjan Pelur Sukumar, Chung-An Shen, and Ahmed M. Eltawil, Member, IEEE Abstract—In this paper, we present a novel approach to per- form joint detection and decoding for spatial multiplexing mul- tiple-input multiple-output (MIMO) systems which utilize convo- lutional codes. The bit error rate (BER) performance of the pro- posed approach is signicantly better than that of systems which utilize separate detection and decoding blocks. Formal algorithms with two possible system setups are presented and their perfor- mance documented. In particular, for a reference 4 4, 16-QAM system using a rate 1/2 convolutional code with generator poly- nomial [247, 371] and a constraint length of 8, improvements in signal-to-noise ratio (SNR) of 2.5 dB and 3 dB are achieved over conventional soft decoding at a BER of 10 . The proof of concept VLSI architecture for one algorithm is provided and a novel way to reduce memory usage is demonstrated. Results indicate that better performance over conventional systems is achievable with compa- rable hardware complexity. The proposed design was synthesized and layout with 65-nm CMOS technology at 181-MHz clock fre- quency. An average throughput of 216.9 Mbps at a SNR of 13 dB with area equivalent to 553 Kgates was achieved. Index Terms—Error correction codes, joint detection and de- coding, K-best, multiple-input multiple-output (MIMO), spatial multiplexing, sphere decoding, tree search. I. INTRODUCTION M ULTIPLE-INPUT-MULTIPLE-OUTPUT (MIMO) wireless communication systems are widely recog- nized as a means of increasing data rates [1], [2] and have been rapidly adopted by a large number of industry standards such as WiMax and LTE. Increased capacity is achieved by transmitting multiple data streams at the same time (spatial multiplexing). While the design of the transmitter for such sys- tems is fairly straightforward, choosing the architecture of the receiver involves making design tradeoffs regarding optimality (in terms of Bit Error Rates) and hardware complexity. The two receiver architectures which are on opposite ends of this design spectrum are the zero-forcing (ZF) and maximum-likelihood Manuscript received December 17, 2010; revised May 21, 2011 and September 12, 2011; accepted November 14, 2011. Date of publication January 24, 2012; date of current version August 24, 2012. This work was supported in part by the Center for Pervasive Communications and Computing at the University of California, Irvine, in part by the National Science Foundation under Grant ECCS 0955157, and in part by Mindspeed Technologies, Inc. This paper was recommended by Associate Editor G. Sobelman. The authors are with the University of California, Irvine, CA 92697 USA (e-mail: [email protected]; [email protected]; [email protected]). Color versions of one or more of the gures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identier 10.1109/TCSI.2011.2180442 (ML) decoders. While ZF decoders are simpler to implement than ML decoders, their bit error rate (BER) performance is far worse due to noise enhancement. As a result, considerable ef- fort has been directed towards reducing the complexity of these receivers. The sphere decoding algorithm (SDA) [3]–[7] and its relevant K-Best decoding algorithm [8] have been adopted as decoders of choice, because of their ability to implement ML (or close to ML) decoding with signicantly reduced complexity. These approaches can be effectively expressed as a tree-searching class of algorithms visiting a subset of the tree. Recently, there has been a signicant amount of work in related elds, both in the theoretical and algorithmic domain [9]–[11] as well as in the architecture and implementation context [13]–[18]. Another pervasive feature of all modern communication systems is the use of error correcting codes, which introduce structured redundancy to the information bits. Two approaches have been proposed in literature to utilize the interdependent behavior of the MIMO detection and the decoding of error correction codes (recovery of the information bits). In the rst and more common method, the detection and decoding problems are treated as two separate blocks, in which the modulation symbols are rst detected and the information bits are recovered by using error correction blocks such as Viterbi decoders. Furthermore, to improve the performance, the extrinsic information (log-likelihood ratios) of each bit can be exchanged iteratively between the detector and decoder multiple times before a nal decision is made [13]. On the other hand, the best performing receiver architectures are those which conduct detection and decoding in a single stage process, that is, to perform joint detection and decoding. In [12], the authors describe the method to achieve this for MIMO systems which use linear block codes. This paper presents a technique to perform joint detection and decoding in MIMO systems using convolutional error correc- tion codes. The main contributions of this work can be summa- rized as follows. The concept and approach to perform MIMO detection and decoding of convolutional codes in a single stage is pre- sented. This technique is based on a tree-searching algo- rithm where the decoding of the convolutional code is an integral part of the tree search process. The means by which each coded bit is mapped to different levels of the tree (which corresponds to the modulation points sent to specic transmit antennas) needs to be de- signed carefully, especially when interleaving is applied. 1549-8328/$31.00 © 2012 IEEE

Upload: ahmed-m

Post on 11-Oct-2016

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Joint Detection and Decoding for MIMO Systems Using Convolutional Codes: Algorithm and VLSI Architecture

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 59, NO. 9, SEPTEMBER 2012 1919

Joint Detection and Decoding for MIMO SystemsUsing Convolutional Codes: Algorithm

and VLSI ArchitectureChitaranjan Pelur Sukumar, Chung-An Shen, and Ahmed M. Eltawil, Member, IEEE

Abstract—In this paper, we present a novel approach to per-form joint detection and decoding for spatial multiplexing mul-tiple-input multiple-output (MIMO) systems which utilize convo-lutional codes. The bit error rate (BER) performance of the pro-posed approach is significantly better than that of systems whichutilize separate detection and decoding blocks. Formal algorithmswith two possible system setups are presented and their perfor-mance documented. In particular, for a reference 4 4, 16-QAMsystem using a rate 1/2 convolutional code with generator poly-nomial [247, 371] and a constraint length of 8, improvements insignal-to-noise ratio (SNR) of 2.5 dB and 3 dB are achieved overconventional soft decoding at a BER of 10 . The proof of conceptVLSI architecture for one algorithm is provided and a novel way toreduce memory usage is demonstrated. Results indicate that betterperformance over conventional systems is achievable with compa-rable hardware complexity. The proposed design was synthesizedand layout with 65-nm CMOS technology at 181-MHz clock fre-quency. An average throughput of 216.9 Mbps at a SNR of 13 dBwith area equivalent to 553 Kgates was achieved.

Index Terms—Error correction codes, joint detection and de-coding, K-best, multiple-input multiple-output (MIMO), spatialmultiplexing, sphere decoding, tree search.

I. INTRODUCTION

M ULTIPLE-INPUT-MULTIPLE-OUTPUT (MIMO)wireless communication systems are widely recog-

nized as a means of increasing data rates [1], [2] and havebeen rapidly adopted by a large number of industry standardssuch as WiMax and LTE. Increased capacity is achieved bytransmitting multiple data streams at the same time (spatialmultiplexing). While the design of the transmitter for such sys-tems is fairly straightforward, choosing the architecture of thereceiver involves making design tradeoffs regarding optimality(in terms of Bit Error Rates) and hardware complexity. The tworeceiver architectures which are on opposite ends of this designspectrum are the zero-forcing (ZF) and maximum-likelihood

Manuscript received December 17, 2010; revised May 21, 2011 andSeptember 12, 2011; accepted November 14, 2011. Date of publication January24, 2012; date of current version August 24, 2012. This work was supportedin part by the Center for Pervasive Communications and Computing at theUniversity of California, Irvine, in part by the National Science Foundationunder Grant ECCS 0955157, and in part by Mindspeed Technologies, Inc. Thispaper was recommended by Associate Editor G. Sobelman.The authors are with the University of California, Irvine, CA 92697 USA

(e-mail: [email protected]; [email protected]; [email protected]).Color versions of one or more of the figures in this paper are available online

at http://ieeexplore.ieee.org.Digital Object Identifier 10.1109/TCSI.2011.2180442

(ML) decoders. While ZF decoders are simpler to implementthan ML decoders, their bit error rate (BER) performance is farworse due to noise enhancement. As a result, considerable ef-fort has been directed towards reducing the complexity of thesereceivers. The sphere decoding algorithm (SDA) [3]–[7] andits relevant K-Best decoding algorithm [8] have been adoptedas decoders of choice, because of their ability to implementML (or close to ML) decoding with significantly reducedcomplexity. These approaches can be effectively expressedas a tree-searching class of algorithms visiting a subset of thetree. Recently, there has been a significant amount of work inrelated fields, both in the theoretical and algorithmic domain[9]–[11] as well as in the architecture and implementationcontext [13]–[18].Another pervasive feature of all modern communication

systems is the use of error correcting codes, which introducestructured redundancy to the information bits. Two approacheshave been proposed in literature to utilize the interdependentbehavior of the MIMO detection and the decoding of errorcorrection codes (recovery of the information bits). In thefirst and more common method, the detection and decodingproblems are treated as two separate blocks, in which themodulation symbols are first detected and the informationbits are recovered by using error correction blocks such asViterbi decoders. Furthermore, to improve the performance,the extrinsic information (log-likelihood ratios) of each bitcan be exchanged iteratively between the detector and decodermultiple times before a final decision is made [13]. On theother hand, the best performing receiver architectures are thosewhich conduct detection and decoding in a single stage process,that is, to perform joint detection and decoding. In [12], theauthors describe the method to achieve this for MIMO systemswhich use linear block codes.This paper presents a technique to perform joint detection and

decoding in MIMO systems using convolutional error correc-tion codes. The main contributions of this work can be summa-rized as follows.• The concept and approach to performMIMO detection anddecoding of convolutional codes in a single stage is pre-sented. This technique is based on a tree-searching algo-rithm where the decoding of the convolutional code is anintegral part of the tree search process.

• The means by which each coded bit is mapped to differentlevels of the tree (which corresponds to the modulationpoints sent to specific transmit antennas) needs to be de-signed carefully, especially when interleaving is applied.

1549-8328/$31.00 © 2012 IEEE

Page 2: Joint Detection and Decoding for MIMO Systems Using Convolutional Codes: Algorithm and VLSI Architecture

1920 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 59, NO. 9, SEPTEMBER 2012

To prove the concept, two possible bit mapping strate-gies associated with different interleaving structures aresuggested and the joint detection-decoding algorithm foreach setup is presented. In the first proposed algorithm,interleaving is performed on the modulated symbols(symbol interleaving), while in the second, interleaving isperformed on the coded bits (bit interleaving) as well ason the modulated symbols. BER performances for bothapproaches are documented and generalizations to othersystem setups are discussed.

• The VLSI architecture and implementation results for thefirst algorithm (symbol interleaving) is documented. Tech-niques for memory management are discussed that can re-duce the complexity of the architecture especially for verylong trees that are typical of this approach. It is shown thatthe BER performance of the joint detector and decoderis superior to that of conventional separate detector anddecoder systems with comparable throughput and com-plexity. To the best of our knowledge, this is the first paperto present the algorithm and VLSI architecture to performjoint detection and decoding forMIMO systems using con-volutional codes.

The remainder of the paper is organized as follows. Section IIreviews the system model and background of MIMO systemswith convolutional error correction codes. Section III presentsthe suggested system setup and joint detection-decoding algo-rithms. Section IV discusses the generalizations of the algo-rithms for arbitrary system setups and Section V presents thesimulation results for the bit error rate performance of the pro-posed algorithms. In addition, Section VI describes the VLSIarchitecture of a decoder based on the joint detection-decodingalgorithm, while Section VII demonstrates the implementationresults of the proposed decoder. Finally, the paper is concludedin Section VIII.

II. SYSTEM MODEL/BACKGROUND

In aMIMO systemwith transmit antennas, consecutivemodulation points are arranged in a transmit vector .If the number of receive antennas is given by , the-dimensional received signal vector is given by

(1)

where the matrix denotes the channel matrix. Thisequation represents a single tap MIMO system. The elements ofare independent, complex Gaussian random variables with

zero mean and unit variance. The vector represents the addi-tive white Gaussian noise with zero mean and variance . Witha constellation size of , the total number of bits containedin each transmitted vector is given by . In this systemmodel, the estimated transmitted ML signal vector is de-termined given by

(2)

where is the set containing the modulation points. The ma-trices in (1) can be transformed to their real matrix representa-tion , i.e.,

(3)

where and denote the real and imaginary part of (.).Considering the QR-decomposed channel model, that is,

, the norm term of (2) can be rewritten in a recursive processas follows:

(4)

where represents the th element of vector , de-notes the th element of the matrix , and is the th ele-ment of vector . The revised problem of (3) can be representedby a tree structure with levels, and each node in thetree contains child nodes. It is noted here that level m repre-sents the highest level of tree; level represents the secondhighest level, etc. Therefore, the ML solution can be achievedby finding the path with the smallest path metric in the tree con-structed by (4). Equation (4) can be further rewritten as

(5)

and

(6)

where is the accumulated path metric to level , isthe accumulated path metric to level , and istypically called the search center, is given by

(7)

In soft output detection, a set of possible output paths aregenerated and the log likelihood ratio (LLR) values for eachcoded bit are computed accordingly. These values are thensent to the decoder to perform error correction. This mightresult in the MIMO detector generating bit sequences whichare not valid codewords. In contrast, in this paper, the proposedapproach is to identify the lattice points which are part of avalid codeword sequence closest to the received signal. Thiscan be considered as performing detection (find the latticepoint closest to the received signal) and decoding (find thevalid codeword sequence) concurrently. This idea is illustratedin Fig. 1. In conventional sphere decoding the entire sphereindicated in Fig. 1(a) is searched to find the closest pointwhereas in the proposed technique, [Fig. 1(b)], only a subsetof points which are part of a valid code sequence are explored.This results in increasing the sparsity of the sphere aroundthe received vector and thus, increasing the reliability of the

Page 3: Joint Detection and Decoding for MIMO Systems Using Convolutional Codes: Algorithm and VLSI Architecture

SUKUMAR et al.: JOINT DETECTION AND DECODING FOR MIMO SYSTEMS 1921

Fig. 1. Difference between (a) conventional sphere decoding and (b) proposedidea (dark dots are valid points).

decoding. Furthermore, since there is no separate block for thedecoding, computations for the LLR values of coded bits areno longer required.The idea of joint detection and decoding can be realized

by tracking the state of the encoder at every level of thetree and then enumerating only the valid modulation points(corresponding to valid states) before proceeding to deeperlevels. The means by which each coded bit is mapped to dif-ferent levels of the tree (which corresponds to the modulationpoints sent to specific transmit antennas) needs to be designedcarefully, especially when interleaving is applied to guardagainst channel fades. Using an arbitrary interleaver structurecould make it extremely problematic to track the encoderstates while searching the tree. For example, the state of thebits at higher levels of the tree might depend on that of thebits at any lower level of the tree. In the paper, we presenttwo possible interleaving structures that allow for a practicaland realizable implementation of the system while providingsignificant gains. Algorithm I assumes no bit interleaver suchthat the bit mappings (and the corresponding encoder states)follow the level-by-level sequence of tree searching direction.Algorithm II employs a carefully designed bit interleaver suchthat the encoder states follow a tractable sequence duringthe tree search process. In subsequent sections of the paper,it will be shown that the BER performance of the proposedalgorithms outperform the separate approach with conventionalrow-in-column-out interleaver. It is important to note that themain intention of the paper is to demonstrate that the conceptof joint detection and decoding does not only yield theoreticalgains but is both realizable and practical. While the authorsacknowledge that the interleavers used are not compliant toany specific standard, they are purely presented as a vehicleto demonstrate the gains achievable via joint detection anddecoding.

III. ALGORITHMS

Conventionally, in systems that use separate detection anddecoding, each channel realization is processed independentlyto determine the transmitted modulation points. However, inthe proposed joint detection-decoding scheme, the modula-tion points across channel realizations need to be consideredtogether. To take advantage of this, it is proposed to stackthe received signal vectors rotated by (as in (4)) across thechannel realizations as follows:

......

. . .(8)

where and .As a result of this stacking, the dimension of the search tree be-comes . For nominal values of (the length of the encodedbits), the value of is large and therefore the dimension ofthe search tree becomes extremely large. The objective here isto find the shortest path through this expanded tree which resultsin a valid codeword(s). We now present two algorithms to per-form tree search for two sample system setups as follows. Forboth approaches, the K-best search algorithm [8] is used.

A. Algorithm I

1) System Setup: In a system equipped with a convolutionalencoder with rate , a block of input informationbits is encoded and output encoded bits are generated.These encoded bits are collected in a vector . For the sake ofclarity, we initially present the case where the modulation orderand the output rate of the encoder are set such that successivelevels of the resulting tree corresponds to successive states ofthe encoder. Therefore, the modulation order of the system isgiven by , where the factor of two appears due to realdecomposition and hence the number of bits in each level ofthe tree is half of that in one modulation point. In Section IV,a technique to generalize the proposed algorithm for arbitrarycode rates and modulation orders is discussed. This setup en-sures that each level of the tree will correspond to the differentstates of the encoder. The total number of channel realizationsrequired to transmit these bits is given by .The transmit vector after real decomposition during the th

channel realization is denoted by and the transmit–receiverelationship for this channel use can be written as

(9)

where and represent the channel matrix and noise vector,respectively. For example, in a system using rate 1/2 code, themodulation scheme is 16 QAM and for systems using rate 1/3code, the modulation scheme of choice is 64 QAM. The outputbits are partitioned into vectors, each of lengthbits. The th such vector is denoted by . The vector ismapped to the transmission vector element as follows:

(10)

Page 4: Joint Detection and Decoding for MIMO Systems Using Convolutional Codes: Algorithm and VLSI Architecture

1922 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 59, NO. 9, SEPTEMBER 2012

Fig. 2. (a) System structure for Algorithm I and (b) an example of bit mappingfor one input bit.

where

(11)

and

(12)

The function is the modulator which simply outputs thereal or imaginary part of the constellation points correspondingto depending on . Thus, the bits transmitted in the th levelof the last transmit vector corresponds to the first transitionof the encoder, the th level bits of correspond tothe next transition of states and so on.The structure of this system setup is illustrated in Fig. 2(a),

with an example of a possible bit mapping for a given trellisstructure shown in Fig. 2(b).2) Algorithm: In this system setup, the algorithm to perform

joint detection-decoding is presented as follows.

Initialize Set the level of the tree, to the rootnode. Let the state of the th survivor path be initializedto zero for all .1) For a given level , set and perform:2) For survivor path perform:3) Given state , calculate the valid states for level

from the code trellis for each of the valid inputbit sequences.

4) Calculate the path metric for each of the valid modulationpoints.

5) If , set , goto step 6else set , goto step 2.

6) Find the set of best survivors in level . For eachsurvivor store the corresponding information bit whichgave rise to the modulation point. Also update the statevariables .

7) If , output the information bits corresponding to theshortest pathelse , goto step 1

The idea of Algorithm I is illustrated in Fig. 3. It should be notedthat the trellis displayed is only for representational purposesand the algorithm presented is agnostic to the trellis structure.

Fig. 3. Decoding approach for Algorithm I.

Fig. 4. System setup for Algorithm II.

B. Algorithm II

1) System Setup: In this setup, the idea is to place differentcodes on the real and imaginary components of the signal fromeach antenna across channel realizations. If an error is made indecoding a bit at a higher level of the tree, the probability that adecoding error is made at a lower level (in the same channelrealization) is reduced. Thus, this can be thought of as a bitinterleaver.In this architecture, the input bitstream of length is divided

into sections of length . Each of these sections is en-coded by one of independent convolutional encoders. Theoutput encoded bits of the th encoder is denoted by the vector

of length . Furthermore, the th blocks of con-secutive bits of vector are denoted by . Mapping in thissystem is performed as follows:

(13)

where

(14)

and

(15)

The structure of this setup is illustrated in Fig. 4. Thus, suc-cessive channel uses of the th elements of correspond to dif-ferent states of one of the convolutional encoders.2) Algorithm: The algorithm is presented as follows.

Page 5: Joint Detection and Decoding for MIMO Systems Using Convolutional Codes: Algorithm and VLSI Architecture

SUKUMAR et al.: JOINT DETECTION AND DECODING FOR MIMO SYSTEMS 1923

Fig. 5. Decoding approach for Algorithm II.

Initialize Set the level of the tree, to theroot node. Let the state of the th survivor path for level be

be initialized to zero for all .1) For a given level , set and perform:2) For survivor path perform:3) Given state , calculate the valid statesfor level from the code trellis for each of thevalid input bit sequences. If ,

.4) Calculate the path metric for each of the valid modulationpoints.

5) If , set , goto step 6else set , goto step 2.

6) Find the set of best survivors in level . For eachsurvivor store the corresponding information bit whichgave rise to the modulation point. Also update the statevariables .

7) if , output the information bits corresponding to theshortest pathelse , goto step 1

The idea of this algorithm is illustrated in Fig. 5.

IV. GENERALIZATION TO ARBITRARY SYSTEM SETUPS

A. Generalization to Different System Dimensions

In the previous sections, it was assumed that the output rateof the encoder is equal to the number of bits (or modulationorder) per level of the tree after real decomposition. In this sec-tion, the proposed techniques are generalized for arbitrary mod-ulation orders, code rates and channel ranks. The central con-cept is that if there is an overflow of bits (the number of bitsfrom the encoder cannot fit in one level of the tree) or under-flow of bits (the number of bits from the encoder is less thanwhat can fit in one level of the tree), multiple levels of the treemust be searched. This is because in either the case of under-flow or overflow, each level of the tree no longer correspondsto one encoder state transition. Thus, the number of levels to beconsidered is the minimum number of levels which leads to avalid convolutional encoder state. For an arbitrary modulationorder , the number of bits that are transmitted per level isgiven by

(16)

Therefore, the number of bits that must be considered togetherto decide a valid state is given by

(17)

where stands for the least common multiple ofand . Thus, the number of levels that must be lumped togetherfor searching is given by

(18)

To determine the output state, the number of trellis stages thatmust be considered is given by

(19)

The set of possible output states are the combinations of all validstates after trellis stages. The path metric must be updatedby adding the partial euclidean distance for each of the lumpedtree levels to the parent path metric. For example, for a systemwith 64 QAM using rate 1/2 code, the number of bits that aretransmitted per level is . The number of bits that mustbe considered together to decide a valid state is . Thenumber of levels that must be lumped together for searching is

. Finally, to determine the output state, the number oftrellis stages that must be considered . The other issuethat must be taken into consideration is the situation where thetotal number of tree levels is not a integer multiple of . In thiscase the last set of information bits that are sent into the encodercould be made zero. The number of such zero information bitsis given by

(20)

where is the modulus operator which gives the re-mainder of the division of by . Thus for the last stage ofsearch, the search space is further constrained by the knowledgethat the last set of information bits are zero.

B. Symbol Interleaver

In the system setup for Algorithm I, no bit interleaver is em-ployed; however, symbol interleaving can be used instead. Toachieve symbol interleaving two consecutive symbol vectorscorresponding to one codeword must be transmitted such thatthey are separated by a period which allows the channel realiza-tions to be sufficiently decorrelated. This is illustrated in Fig. 6.If the th coded bits block of length is denoted by , thecorresponding transmit symbol vectors are .As seen from the figure two consecutive transmit symbols

, are temporally separated by a period equal to thedecorrelation time. In the figure, it is assumed that and

are separated in time sufficiently such that they can beconsidered uncorrelated. The intervening channel realizationscan be used to transmit symbol vectors belonging to othercodewords. Therefore a system designer would interleave thesymbol vectors. Accordingly, the receiver is equipped with anappropriate symbol de-interleaver. While the latency of the

Page 6: Joint Detection and Decoding for MIMO Systems Using Convolutional Codes: Algorithm and VLSI Architecture

1924 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 59, NO. 9, SEPTEMBER 2012

Fig. 6. Two consecutive symbol vectors corresponding to one codeword are separated by a period greater than the decorrelation time.

decoding process is increased, the throughput and performanceremain unaffected.Algorithm II does have a bit interleaver and so the level of

protection is higher than in Algorithm I. However, the degreeof independence between bits is not as high as that achievedby conventional bit interleavers such as row-in-column-outor random interleaver. To further guard against performancedegradation, the same symbol interleaver structure as describedabove could also be used.

C. Turbo-Coded Systems

One of the important reasons why the proposed techniqueworks well in the case of convolutional codes is that the outputof a convolutional encoder is dependent purely on the currentstate of the encoder and the input bit. Due to the design of theinterleavers presented previously, it is possible to ensure thatthe state of the encoder at any tree level is dependent on the in-formation bits that have been decoded in previous tree levelsor at least a function of one of the survivor paths. However,this design approach does not migrate in straightforwardmannerto Turbo Coded systems. For example, in a turbo encoder withtwo parallel convolutional encoders, the input to the second en-coder is an interleaved version of the input to the first encoder.Thus, while traversing the tree, keeping track of the states ofboth the encoders is difficult as the current state of any of thedecoders at a particular tree level could be dependent on infor-mation bits that the sphere decoder has not decoded/encounteredyet. In other words, there are two encoder states that have to betracked that are not independent of each other creating a chal-lenge to find the right compromise between the required controllogic and state storage requirements versus design complexity.An alternative approach is to design an interleaver where thisproblem is either eliminated or mitigated. Both of these tech-niques are currently being actively investigated by the authors.

V. SIMULATION RESULTS

A. Results

We consider the case study of a 4 4 16 QAM MIMOsystems with a code rate of 1/2 and a generator polynomial[247,371] with a constraint length of 8. The preprocessor isassumed to be a regular QR decomposition, where each channeluse is assumed to be independent. In this section, we quantifythe BER performance for both of the proposed algorithms and

Fig. 7. BER performance of the proposed algorithm along with conventionalseparate decoding.

compare to the conventional separate soft and hard decisiondecoding.Fig. 7 shows the BER performance of the proposed algo-

rithms along with conventional separate decoding. For the hardand soft decision cases, a row-in-column-out bit interleaver wasused. The value of associated with the K-best search waschosen to be 512. From the graph, it can be seen that all the sys-tems with joint detection and decoding schemes perform betterthan the conventional decoding with no iteration for practicalsignal-to-noise ratio (SNR) values. At a BER of , the dif-ference between Algorithm II with 8192 bits and soft decisiondecoding is approximately 3 dB. Also, at a BER of , thedifference between Algorithm I with 8192 bits and soft decisiondecoding is approximately 2.5 dB. However for lower values ofSNR, it can be seen that the performance of Algorithm I is supe-rior to that of Algorithm II. This is because the number of infor-mation bits in Algorithm I is higher, (8192) as compared to eachof the 8 information bit sequences in Algorithm II of length 1024( as described in Section III). As the SNR increases,this factor is no longer an important contributor to performance,and instead, the presence of a bit interleaver in Algorithm II im-proves its performance. Furthermore, the slope of Algorithm I’scurve is low, indicating the impact on performance due to thelack of a bit interleaver. As the simulation results show, the ex-pected drop in performance in the proposed Algorithm I dueto the lack of a conventional bit interleaver is more than com-pensated by the increase in sparseness of the search space in-troduced by these algorithms. Fig. 7 also shows that the BER

Page 7: Joint Detection and Decoding for MIMO Systems Using Convolutional Codes: Algorithm and VLSI Architecture

SUKUMAR et al.: JOINT DETECTION AND DECODING FOR MIMO SYSTEMS 1925

Fig. 8. Difference in BER due to change in the value of .

performance of the disjoint scheme can be greatly improvedvia multiple iterations of detection-decoding process. However,it is important to notice that such an iterative process also in-curs extra hardware complexity as well as delay and throughputdegradation. To be specific, a soft-input-soft-output (SISO) de-coder such as BCJR or SOVA is needed, which is much morecomplicated than the Viterbi decoder. In addition, performinginterleaving as well as de-interleaving between MIMO detec-tion and SISO decoding leads to significant area/power con-sumption and latency increment.Furthermore, Algorithm II case with 65 536 information bits

(8192 channel realizations) is also plotted. This curve is plottedto quantify the performance of this algorithm if 8192 informa-tion bits are present in each of the convolutional codes—thesame as the Algorithm I case.While for the low andmid SNR re-gions the 8192 bits cases (Algorithm II) outperform this curve,it can be seen that for higher SNRs, the slope of this curve ishigher. In fact, at high SNRs, the 65 536 bits case outperformsboth the 8192 bits cases. Furthermore, the slope generated bythis algorithm is steeper than the slope for the conventional tech-niques. The worse performance in the low and mid SNR regionscan be explained by the fact that the tree depth becomes largeand thus requires a higher to provide performance that iscompetitive to Algorithm I. The better performance in the highSNR regions is due to the fact that the length of the convolutioncodes is larger.

B. Different Values of K

Fig. 8 illustrates the difference in BER due to change in thevalue of . For systems using Algorithm II, going from

to results in an increase of 2 dB to achieve aBER of . However, it should be noted that systems withlower have far less complexity than those with higher .For Algorithm I, as the value of is increased fromto , the performance in the mid and low SNR regionsimproves marginally, while the performance in the high SNRregion is practically the same.

Fig. 9. BER of the proposed Algorithm I for a codeword length of 512 bits.

C. Shorter Codewords

Fig. 9 documents the performance of the proposed algorithmin Algorithm I for a codeword length of 512 bits which is acommon codeword length. The performance of this systemsetup is illustrated here to describe the performance of thesystem that will be implemented in Section VII. A codewordlength of 512 bits corresponds to 64 channel realizations as-suming a 4 4 16 QAM system. The trends presented hereare largely the same as that presented in Fig. 7. As a byproductof having a shorter codeword length, the used in the K-bestsearch can be reduced. The proposed technique in systemswith outperforms the conventional soft decoder with

by 2 dB at a BER of . It is noted that the BERperformance of the conventional soft decoder degrades by1 dB when is reduced from 512 to 8. Also in systems with

, the proposed algorithm outperforms the conventionalsystem by about 2 dB at a BER of . The BER of thissystem is clearly worse than the BER of the systems in Fig. 7at comparable mid to high SNRs. This is because the length ofthe codewords in this system is shorter.

VI. SYSTEM ARCHITECTURE

Based on the proposed algorithm, a MIMO decoder thatcan perform joint detection and decoding was designed andsynthesized in 65 nm CMOS technology. This section dis-cusses the system architecture of the proposed MIMO decoder.A high-level overview of the decoder structure and its oper-ation schemes are introduced. This architecture is based onthe pipelined WPE K-Best decoder [22]. The primary designchallenge in adapting this structure to the proposed architec-ture relates to the large number of tree levels that need to betraversed and the resulting large memory requirement. Theproposed methods for addressing these issues to achieve anefficient design are given. Moreover, performance-complexitytradeoffs are discussed.

A. High-Level Overview of the Architecture

The pipelined WPE K-Best decode enumerate the tree nodesat each level with the ascending order of their path metrics.This structure reduces hardware implementation overhead for

Page 8: Joint Detection and Decoding for MIMO Systems Using Convolutional Codes: Algorithm and VLSI Architecture

1926 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 59, NO. 9, SEPTEMBER 2012

Fig. 10. System architecture of the proposed joint MIMO decoder.

the data sorting and, moreover, the pipelined scheme can maxi-mize the achievable decoding throughput. Interested readers canfind the detailed description of this architecture in the authorsprevious work [22]. In order to jointly perform detection anddecoding, the required modifications from conventional K-Bestdecoder structure are introduced as follows.• An additional module is required to determine, based onthe current state of the encoder, the valid states for the nextlevel of the tree and the corresponding input/output bits.From the output bits, the valid modulation points that needto be processed in the next tree level are determined. Theseoperations can be realized through simple lookup tablesand combinational logic. The architecture of the modifiedK-Best decoder is depicted in Fig. 10. As shown in thisfigure, at any level of the tree, once a winner (survivor) isidentified, its corresponding encoder state will be read outfrom the dedicated registers ( in Fig. 10). This servesas the input state to the block, in which the twopossible output states and the corresponding input/outputbits are identified. The two sets of possible output bits arethen converted to the modulation points (tree nodes) andtheir path metrics are computed. These two nodes are thevalid child nodes of the given winner and will be saved tothe dedicated registers ( in Fig. 10). Furthermore, thepath metrics, information bits, and the output states asso-ciated with these two valid child nodes will also be savedinto registers ( , , and , respectively, inFig. 10). This process will be performed iteratively untilsurvivors are determined, and the whole decoding processis finished when the bottom level of the tree is reached.

• Another aspect that needs modification from conventionaldesign is the storage of intermediate results and the decoderoutputs. In this joint decoder, the state of the survivor nodesfor one level of the tree must be stored. Thus, a total ofstates are saved. Moreover, information bits (associ-

ated with survivors) need to be stored and carried outthrough the enumeration into deeper levels. It should alsobe noted that, despite the stacked channel realization, themodulation points (tree nodes) of only levels (and notevery level of the tree) need to be stored as the requiredhistory to compute the path metrics. This is because ofthe upper-triangular feature of the stacked channel matrix

where only the channel coefficients of the current channelrealization (and not all the stacked channel coefficients) areused. Moreover, the output of this decoder are the informa-tion bits themselves, and no longer the modulation points.

B. Memory Organization

1) Challenges: In this design, there are two major challengeswith respect to memory organization that need to be addressed.They are described as follows.• The first issue is with the tracking of the information bitsand their associated path history (i.e., list of parent nodesfor each survivor). Particularly, the parent history list fora survivor node might be completely different at a partic-ular level as compared to the previous level. In previousdesigns [22], this was handled by rearranging the parenthistory list along with the nodes, such that the node andits associated list with the least weight were in a predeter-mined location, and the node and associated list with thesecond least weight were in another predetermined loca-tion and so on. While this is an acceptable method for de-signs with a small number of tree levels, this approach isuntenable in this case due to the large tree depth where alarge amount of overhead will be spent in reorganizing thememory.

• The second issue is that of memory usage. Due to the largenumber of channel uses that must be processed before a re-sult is obtained, the storage of a large number of survivorsand their history is a significant contributor to the area ofthe design. Therefore, to achieve an efficient design an ef-fective way of dealing with this complexity is paramount.

2) Alternative Architecture for History-Tracking: To tacklethe problem of tracking survivors and their history, the fol-lowing technique is proposed. Instead of reorganizing the entirehistory list, the address of the parent node for that particularsurvivor node is stored in a memory block separate from theinformation-bits memory block. Using a technique inspired bylinked-lists in computer data architectures, the entire list canbe rebuilt by iteratively using these addresses from this newmemory block. Thus, the new memory block serves as a listof “pointers.” However, this technique requires a large amountof memory—precisely bits—for each tree level.Thus, as a tradeoff, a two-stage structure was employed. In the

Page 9: Joint Detection and Decoding for MIMO Systems Using Convolutional Codes: Algorithm and VLSI Architecture

SUKUMAR et al.: JOINT DETECTION AND DECODING FOR MIMO SYSTEMS 1927

Fig. 11. Architecture of win_infobits_mems and pstore.

following, this two-stage structure of memory organization willbe described in details.• Stage 1: For each of the levels (single-channel realiza-tion), register banks are used to store the informationbits. Once a winner is identified, its corresponding infor-mation bit will be stored into one location within this reg-ister bank. In addition, all the winner bits and the bits ontheir corresponding parent history paths will be stored inthe register location that is indexed by the same number.For example, the current bit and all the parent bits for thefirst winner will be stored in the location that is indexedas 1 and those for the second winner will be indexed as2, etc. Thus, within every single channel realization, oneindex number can indicate information bits (which areassociated with the levels for that channel realization).This requires a update of all the current and previous reg-isters according to the newest winner path.

• Stage 2: Once one channel realization is processed, theentire bit stream for all the levels (thus, bits) willbe stored to a memory block, in which each single lineis associated with one channel realization. Therefore, thedepth of this memory block is equal to the number ofstacked channel realizations and the width is equal tobits. Another memory block is employed for indicatingthe connections of winner paths between two channel real-izations (i.e., two memory lines). The architectures of theproposed memory organization is depicted in Fig. 11, inwhich represents the memory blockfor storing information bits and is the memoryblock for indicating the connections. In this figure, the

indicates one line in of channelrealization and indicates the line of channelrealization . The indicates one line in thememory representing the connections between

and . For example, as depicted in Fig. 11, thefirst location of (2 in this case) indicates that thefirst winner path for the line of in the memorycomes from the second winner path for the line of ,and the second location of (1 in this case) indi-cates that the second survivor path for the line ofcomes from the first survivor path for the line of . Thecontents of will be updated while processing eachlevel within the is finished, in which the winnersand their corresponding parent paths are determined. The

update logic for the can also be seen from Fig. 11.Once the bottom level of the tree is reached, the contentof the will be used as a trace-back reference whereall the memory locations and the corresponding bit streamon the final winner path should be sent out as the decoderoutput.

3) Techniques to Reduce Memory Usage: It can be seen thatthe and the memories (and the as-sociated logic circuits) occupy a significant portion of the de-coder. Therefore, it is essential to reduce the overhead of thesecomponents to further reduce the complexity of the decoder. Inthis context, the following observations can be made and ap-plied. In the K-Best decoding, all the survivor paths after a cer-tain level tend to have the same parent path history. In otherwords, all the survivor tree branches have merged at that level.An example can be shown in Fig. 12 where all the survivor pathsin the channel realization come from the same survivor path(survivor path 1) in the channel realization . In such case, it canbe said that all the survivor paths are merged at the level . Thisimplies that when merge happens, the results corresponding tothe channel realization where the merge occurred can be sentout immediately. This is because that no matter how the de-coder proceeds down the depth of the tree, the information bitsuntil that level are precisely the same for all survivors. There-fore, the memory utilized for all the channel realizations at andbefore where the merge occurred can be reused by the rest ofthe channel realizations. In other words, the total number of re-quired memory locations can actually be smaller than the totalnumber of the channel realizations and one line in this memorycan be shared between multiple channel realizations.Based on the aforementioned observation, in this design, the

size of and the is set to store onlya part of the total channel realizations. While a is up-dated, the merge check operation will be performed and once amerge occurs, the corresponding information bits for themergedchannel realization will be sent out. However, it should be notedthat the channel realization at which the merge occurs is randomand is a function of the individual channel realizations. As a re-sult, the number of pstore locations that need to be updated isalso random. Assume that the number of clock cycles to processone level of the tree is . If the number of clock cycles re-quired to update the pstore locations is smaller than , thedecoder can process as typical case and there is no extra cycleoverheads.

Page 10: Joint Detection and Decoding for MIMO Systems Using Convolutional Codes: Algorithm and VLSI Architecture

1928 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 59, NO. 9, SEPTEMBER 2012

Fig. 12. Idea of the path-merge at certain level of the tree.

Fig. 13. BER performance of the implementation result.

VII. IMPLEMENTATION RESULTS

A. Implementation Results

The proposed joint MIMO decoder was designed and synthe-sized to support a 4 4 16-QAM MIMO system incorporatingthe convolutional code. The code structure is the same as theone that is used in Section V. This design is targeted to processa system using Algorithm I with 512 bits (i.e., 64 channel real-izations). The BER performance of such a system was presentedin Fig. 9. The size of the memory is designed to accommodate16 channel realizations (i.e., 1/4 of the total channel realiza-tions). The impact of using less memory than required is quan-tified later in the paper. The design was first implemented on aXilinx Virtex2-Pro FPGA and tested against fixed point modelsfor functionality verification. All channel entries are representedusing 7 bits for the integral part and 7 bits for fractional partand the path metrics are represented using 10 bits for the inte-gral part and 5 bits for fractional part. Fig. 13 shows the BERperformance result obtained by performing experiments on theFPGA. As illustrated by the figure, the experimental BER de-grades marginally from the floating point simulated BER, whichverifies our previous discussions that the performance loss dueto the memory overflow can be negligible.Furthermore, the design was synthesized using TSMC stan-

dard CMOS cell libraries with 65-nm technology. SynopsysDesign Compiler was used for synthesizing and the CadenceSOC Encounter was used for placement, routing and layout.

In order to reduce power consumption, automatic clock gatingwas used with a minimum bitwidth of 16. The RTL, post-syn-thesis, and post-layout simulations were compared againstfixed point to ensure functionality. Prime Time-PX is used forstatic timing analysis and power consumption estimation fromthe post-layout results. The frequency of operation was setto 181 MHz. Estimates for a single decoder core in terms oflatency, throughput, area, and power are reported in Table I.Area is reported in Kilo Gate Equivalents (kGE) to normalizethe difference in technology, where a single two input NANDgate with drive strength of one is used as the reference gate.As shown in the table, the proposed architecture achievesan average throughput of approximately 18 Mbps across thetested SNR range with a total size of 46 kGE. As discussed inSection VI, the latency and corresponding decoding throughputvaries with the SNR because the channel realization at whichthe merge occurs is random and is a function of the individualchannel realizations. However, it should be noted that theobserved variability is negligible and can be easily managed bysystem design techniques such as data buffering. Of the totalarea of the decoder, the memory occupies 13 kGE, which isapproximately about 28% of the design. To target a throughputhigher than 200 Mbps, 12 instances of the decoder cores areexecuting in parallel. To estimate the complexity, a modulewith 12 instantiated cores was synthesized and laid-out. Thecomplexity of such system is summarized in Table II, wherethe area estimation is based on the results of layout and thepower consumption is based on the post-layout simulations. Itcan be seen from the Table II that complexity increases almostlinearly, that is, 12 . This is due to the fact that 12 copies of thedecoder cores were implemented and executed independentlyin parallel. It is important to note, that complexity (area andpower) can be reduced by sharing common components at theexpense of a more complicated control policy.

B. Memory Complexity

Following the discussion in Section VI-B3, the size of thememory is reduced from supporting 64 channel realizationsto one which needs to store only 16 channel realizations. Todemonstrate the reduction in the complexity of memory, ascheme with conventional memory usage is also estimated, in

Page 11: Joint Detection and Decoding for MIMO Systems Using Convolutional Codes: Algorithm and VLSI Architecture

SUKUMAR et al.: JOINT DETECTION AND DECODING FOR MIMO SYSTEMS 1929

TABLE IIMPLEMENTATION RESULTS OF THE PROPOSED K-BEST DECODER CORE

TABLE IIIMPLEMENTATION RESULTS OF THE PROPOSED

K-BEST DECODER WITH 12 CORES

TABLE IIIIMPLEMENTATION RESULTS OF THE PROPOSED K-BEST DECODER

WITH CONVENTIONAL MEMORY

which the memory is dimensioned to accommodate 64 channelrealizations and no merge check operation is performed. Theresults and the comparisons of the memory area is tabulated inTable III. As shown from this table, the memory area reducesabout 33% when the partial memory usage with merge check isapplied. It is also noted here that the slight increase in the sizeof logic for the partial memory usage case is due to the extracomponents required to perform the merge check. This alsoindicates that for systems utilizing a large number of channelrealizations, the memory savings can be potentially even larger.

It should be noted that for Algorithm II, the states of levelsmust be maintained as opposed to just the state of the previouslevel in Algorithm I, thus further increasing the memory areaand power requirements.

C. Comparisons With Conventional Schemes

Table IV further compares the performance and complexityof the proposed joint MIMO decoder and the conventional sep-arate detection-decoding scheme. In the conventional scheme,the MIMO detector is assumed to be the K-Best decoder con-taining the same design entities as the design presented above,except that the blocks involved in identifying the encoder states,the storage elements used for the stacked channel structure andthe merge checking logic are not present. In other words, a reg-ular K-Best detector is considered with only the required rele-

TABLE IVAREA OF CONVENTIONAL SEPARATE MIMO RECEIVERS IN kGE

vant blocks taken from the design presented above. Based onthis approach, it was found that the size of the detector requiredto support the target throughput of 200 Mbps is 318 kGE.After MIMO detection, a Viterbi decoder is employed for re-covering the information bits. Thus the overall area accountsfor both the MIMO detector as well as the Viterbi decoder.The size of the Viterbi decoder is obtained from the imple-mentation results of the “TB” decoder reported in [23] whichruns at a throughput rate of 200 Mbps. Thus, the total size of asystem using separate detection and decoding is approximately514 kGE. This is comparable to the size of the joint detectionand decoder whose total size is 558 kGE, a difference of 8.5%.The proposed design is also compared with the work pre-

sented in [25]. The proposed system is for a 64 state Viterbidecoder capable of operating at 54 Mbps. Assuming a doublingin size for a doubling in the number of states and scaling forthe speed, an equivalent design size of 362 kGE is obtained.The system size then becomes 650 kGE. The difference in sizeof the total system then becomes 21% in favor of the proposedtechnique. The next comparison is with the the “RE” decoderusing relaxed adaptive Viterbi presented in [23]. The size of theViterbi decoder was reported to be 108.1 kGE. This results in atotal system size of 426.1 kGE-a difference of 31%.

D. Comparisons With Prior Work

In the preceding section, we utilized a MIMO decoder thatwas previously designed by the authors and used that to providecommon ground when comparing to other convolution decoderimplementations. Since, to the best of the authors’ knowledge,there is no prior architecture that presents a joint (non-iterative)detection and decoding scheme, in this section, we expand thecomparison to prior work, by normalizing the throughput to thatof the proposed structure (216.9 Mbps) and scaling the powerand area accordingly. It has been shown in Tables I and II thatlinear scaling provides fairly close approximations given thatthe exact copies of identical cores are assumed. In order to con-sider the difference in supply voltage and technology, the poweris scaled in a manner similar to that presented in [24].

(21)

Specifically, in [16], a K-Best decoder for 4 4 16-QAMsystems with was implemented. When scaling the re-ported throughput of 106.6 Mbps to 216.9 Mbps, the size of thedetector becomes approximately 197 kGE. Therefore, the sizeof the detector and decoder becomes 410 kGE. However, as-suming that only a typical QR decomposition is employed—asimplemented in the proposed design—this design achieves aBER of at SNR of 18 dB, which is 3.5 dB worse than thiswork. In [20], a soft-output sphere decoder was implemented.As the throughput is variable between 10 to 95 Mbps, a 50

Page 12: Joint Detection and Decoding for MIMO Systems Using Convolutional Codes: Algorithm and VLSI Architecture

1930 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 59, NO. 9, SEPTEMBER 2012

TABLE VCOMPARISON WITH PRIOR WORK

this design can support antenna size and modulation scheme 16-QAM 64-QAMthis design can support antenna size and modulation scheme QPSK 64-QAMbased on the post-layout results; estimated based on linear scaling

Mbps average throughput is assumed so that the BER degra-dation is minimal. Under these assumptions, the size requiredby this design to scale to 216.9 Mbps is around 245 kGE. In-cluding the “TB” decoder in [23] also scaled to 216.9 Mbps,the system size becomes 459 kGE. Assuming no loss of BERin the scaled design, a gain of 3.5 dB is achieved by the pro-posed architecture. In [24], a configurable soft-output detectorwas implemented. For a 4 4 16-QAM system, the achiev-able throughput is 287.9 Mbps before decoding (which is 144Mbps of information bits). Since this design does not supporta joint scheme, a separate decoding block is needed. Thus, thesystem size becomes 739.5 kGE as shown in Table V.While op-erating at 4 4 64 QAMmode, the throughput will increase andhence the estimated scaled area and power will decrease. Thereason for this large size is because the design is configurableto extend to an 8 8 64-QAM system. Furthermore, since theemployed algorithm contains a variable search space, the de-coding throughput is heavily dependent on the SNR as well ason the instantaneous channel realizations. Thus, the guaranteedminimum decoding throughput could be much lower than thereported number. In [27], a configurable design with an un-coded BER performance was reported. The reported highestthroughput can achieve 1.1 Gbps before decoding given 44 64-QAM operating mode. For the 4 4 16-QAM setup, thethroughput is expected to be reduced. The coded BER perfor-mance could be heavily degraded assuming hard output is given.Finally, in [26], a high throughput detector was reported to sup-port up to 4 4 64-QAM systems with a throughput of up to2 Gbps, when operated at a clock frequency of 833 MHz. Thesummary and power consumption for each of the designs dis-cussed above is listed in Table V.As shown in Table V, the proposed scheme can achieve sig-

nificant performance gains (3.5 to 4.5 dB) as compared to con-ventional systems with comparable throughput and hardwarecomplexity. However, it is important to note that the previouslydiscussed works are all non-iterative techniques, furthermore,the Viterbi decoder used is a hard output decoder. Clearly, ifiteration is used, performance will be improved. However, toperform multiple iterations, a more complicated soft output de-coder is required. In addition, a de-interleaver/interleaver struc-ture is needed between the detector and decoder. The additionof these blocks will degrade the throughput and will require

a further increase in size and power consumption to compen-sate for this loss in throughput. To gain better insight in the im-pact of iteration, we discuss the work presented in [28] whichpresents an FPGA implementation of an iterative receiver forMIMO-OFDM systems. While, the system setup presented in[28] (4 4, QPSK system using MMSE detector and Turbocode) is not exactly the same as used in this paper, it providessufficient and valuable insights in studying the tradeoffs of it-erative schemes. As shown in the paper, the required SNR toachieve a packet error rate (PER) of improves by at least2 dB by performing two iterations, the data rate also drops ac-cordingly from approximately 25 Mbps to 12.5 Mbps when noparallelism is used. Furthermore, it is also illustrated, that inorder to support the iterative scheme, the required logic ele-ments increase by approximately about 70% and the memoryrequirement increases by approximately 5X. Therefore, it can beconcluded that, conventional schemes can improve performanceby performing multiple iterations between the detector and theconvolutional decoder, albeit at a significant cost in hardwareand throughput.

VIII. CONCLUSION

The algorithm and VLSI architecture to perform joint MIMOdetection and decoding for systems using convolutional codesare presented. Two bit mapping structures are demonstrated andperformance is presented and compared to conventional softoutput decoding. It is shown that both algorithms outperform theconventional soft output decoder with comparable architecturecomplexity. As the algorithm involves a deep tree search, anarchitecture optimization to conserve memory usage was pro-posed and implemented. The throughput, area, as well as powerconsumption estimates of the synthesized decoder are given andcompared to state of the art implementations.

REFERENCES[1] D. Gesbert, M. Shafi, D-S. Shiu, P. J. Smith, and A. Naguib, “From

theory to practice: An overview of MIMO space-time coded wirelesssystems,” IEEE J. Sel. Areas Commun., vol. 21, no. 3, pp. 281–302,Apr. 2003.

[2] G. J. Foschini, “Layered space-time architecture for wireless commu-nication in a fading environment when using multi-element antennas,”Bell Labs. Tech. J., vol. 1, no. 2, pp. 41–59, 1996.

[3] B. Hassibi and H. Vikalo, “On the sphere-decoding algorithm: Ex-pected complexity,” IEEE Trans. Signal Process., vol. 53, no. 8, pp.2806–2818, Aug. 2005.

Page 13: Joint Detection and Decoding for MIMO Systems Using Convolutional Codes: Algorithm and VLSI Architecture

SUKUMAR et al.: JOINT DETECTION AND DECODING FOR MIMO SYSTEMS 1931

[4] U. Fincke and M. Phost, “Improved methods for calculating vectorsof short length in a lattice, including a complexity analysis,” MathComput., vol. 44, no. 8, 1985.

[5] E. Viberto and J. Boutros, “A universal lattice code decoder for fadingchannels,” IEEE Trans. Inf. Theory, vol. 45, no. 5, pp. 1639–1642, Jul.1999.

[6] B. Hochwald and S. Brink, “Achieving near-capacity on a multiple-antenna channel,” IEEE Trans. Commun., vol. 51, no. 3, pp. 389–399,Mar. 2003.

[7] A. D. Murugan, H. El Gamal, M. O. Damen, and G. Caire, “A uni-fied framework for tree search decoding: Rediscovering the sequentialdecoder,” IEEE Trans. Inf. Theory, vol. 52, no. 3, pp. 933–953, Mar.2006.

[8] K. Wong, C. Ysui, S. Cheng, and W. Mow, “A VLSI architecture of aK-Best lattice decoding algorithm for MIMO channels,” in Proc. IEEEInt. Symp. Circuits Syst. (ISCAS), 2002, pp. 273–276.

[9] W. Zhao and G. B. Giannakis, “Reduced complexity closest pointdecoding algorithms for random lattices,” IEEE Trans. WirelessCommun., vol. 5, no. 1, pp. 101–111, Jan. 2006.

[10] L. Azzam and E. Ayanoglu, “Reduced complexity sphere decoding viaa reordered lattice representation,” IEEE Trans. Commun., vol. 57, no.9, pp. 2564–2569, Sep. 2009.

[11] C. S. Park, K. K. Parhi, and S. C. Park, “Probabilistic spherical detec-tion and VLSI implementation for multiple-antenna systems,” IEEETrans. Circuits Syst. I, vol. 56, no. 3, pp. 685–698, Mar. 2009.

[12] H. Vikalo and B. Hassibi, “On joint detection and decoding of linearblock codes on Gaussian vector channels,” IEEE Trans. SignalProcess., vol. 54, no. 9, pp. 3330–3342, Sep. 2006.

[13] A. Burg, M. Borgmann, M. Wenk, M. Zellweger, W. Fichtner, and H.Bolcskei, “VLSI implementation of MIMO detection using the spheredecoding,” IEEE J. Solid-State Circuits, vol. 40, no. 7, pp. 1566–1577,Jul. 2005.

[14] L. G. Barbero and J. S. Thompson, “A fixed-complexity MIMO de-tector based on the complex sphere decoder,” in Proc. IEEE Int. Work-shop Signal Process. Adv. Wireless Commun., Jul. 2006.

[15] M. Wenk, M. Zellweger, A. Burg, N. Felber, and W. Fichtner, “K-BestMIMO detection VLSI architectures achieving up to 424 Mbps,” inProc. IEEE ISCAS, May 2006, pp. 1151–1154.

[16] Z. Guo and P. Nilsson, “Algorithm and implementation of the K-Bestsphere decoding for MIMO detection,” IEEE J. Sel. Areas. Commun.,vol. 24, no. 3, pp. 491–503, Mar. 2006.

[17] S. Chen and T. Zhang, “Relaxed K-Best MIMO signal detector de-sign and VLSI implementation,” IEEE Trans. Very Large-Scale Integr.Syst., vol. 15, no. 3, pp. 328–1337, Mar. 2007.

[18] S. Modal, A. M. Eltawil, and K. N. Salama, “Architectural optimiza-tions for low-power K-Best MIMO decoders,” IEEE Trans. Veh. Tech.,vol. 58, no. 7, pp. 3145–3153, Sep. 2009.

[19] S. Modal, A. M. Eltawil, C.-A. Shen, and K. N. Salama, “Design andimplementation of a sort free K-Best sphere decoder,” IEEE Trans.Very Large-Scale Integr. Syst., vol. 58, no. 7, pp. 3145–3153, Sep.2009.

[20] C. Studer, A. Burg, and H. Bolcskei, “Soft-output sphere decoding:Algorithms and VLSI implementation,” IEEE J. Sel. Areas Commun.,vol. 26, no. 2, pp. 290–300, Feb. 2008.

[21] C.-H. Yang and D. Markovic, “A flexible DSP architecture for MIMOsphere decoding,” IEEE Trans. Circuits Syst. I, vol. 56, no. 10, pp.2301–2314, Oct. 2009.

[22] C.-A. Shen and A. M. Eltawil, “ A radius adaptive K-Best decoderwith early termination: Algorithm andVLSI architecture,” IEEE Trans.Circuits Syst. I, vol. 57, no. 9, pp. 2476–2486, Sep. 2010.

[23] F. Sun and T. Zhang, “Low-power state-parallel relaxed adaptiveviterbi decoder,” IEEE Trans. Circuits Syst. I, vol. 54, no. 5, pp.1060–1068, May 2007.

[24] C.-H. Liao, T.-P. Wang, and T.-D. Chiueh, “A 74.8 mW soft-outputdetector ic for 8 8 spatial-multiplexing MIMO communications,”IEEE J. Solid-State Circuits, vol. 45, no. 2, pp. 411–421, Feb. 2010.

[25] C.-C. Lin, Y.-H. Shih, H.-C. Chang, and C.-Y. Lee, “Design of apower-reduction viterbi decoder for WLAN applications,” IEEETrans. Circuits Syst. I, vol. 52, no. 6, pp. 1148–1156, Jun. 2005.

[26] D. Patel, V. Smolyakov, M. Shabany, and P. G. Gulak, “VLSIimplementation of a WiMAX/LTE compliant low-complexityhigh-throughput soft-output K-best MIMO detector,” in Proc. IEEEInt. Symp. Circuits Syst. (ISCAS), 2010, pp. 593–596.

[27] L. Liu, F. Ye, X. Ma, T. Zhang, and J. Ren, “A 1.1-Gb/s 115-pJ/bitConfigurable MIMO Detector Using 0.13- m CMOS Technology,”IEEE Tran. Circuits Syst. II, vol. 57, no. 9, pp. 701–705, Sep. 2010.

[28] L. Boher, R. Rabineau, andM. Helard, “FPGA implementation of an it-erative receiver for MIMO-OFDM systems,” IEEE Trans. J. Sel. AreasCommun., vol. 26, no. 6, pp. 857–866, Aug. 2008.

Chitaranjan Pelur Sukumar was born in Chennai,India. He received the B.E. degree from the Univer-sity of Madras, Chennai, in 2003, the M.S. degree inelectrical engineering fromNorth Carolina State Uni-versity, Raleigh, in 2005, and the Ph.D. degree fromthe University of California, Irvine, in 2011.His interests include MIMO-OFDM channel esti-

mation and MIMO detection.

Chung-An Shen received the B.Sc. degree from Na-tional Taiwan University of Science and Technology,Taipei, in 2000 and the M.Sc. degree from The OhioState University, Columbus, in 2003, both in elec-trical engineering. He is currently pursuing the Ph.D.degree in electrical engineering at the University ofCalifornia, Irvine. His research interests include thedevelopment and design of algorithm and VLSI ar-chitecture for wireless communication systems.

Ahmed M. Eltawil (S’97–M’03) received the B.Sc.and M.Sc. degrees with honors from Cairo Univer-sity, Giza, Egypt, in 1997 and 1999, respectively,and the Doctorate degree from the University ofCalifornia, Los Angeles, in 2003.He joined the Department of Electrical Engi-

neering and Computer Science at the University ofCalifornia, Irvine, in 2005, where he is currently anAssociate Professor. He is the founder and directorof the Wireless Systems and Circuits Laboratory(WSCL), (http://newport.eecs.uci.edu/~aeltawil/)

a member laboratory of the Center for Pervasive Communications andComputing (CPCC). He also holds a visiting professorship with King SaudUniversity, Saudi Arabia. His current research interests are in low-power digitalcircuit and signal processing architectures for wireless communication systemswhere he has published more than 80 technical papers on the subject, includingfour book chapters.Dr. Eltawil has been on the technical program committees and steering

committees for numerous workshops, symposia, and conferences in the area ofVLSI, and communication system design. He has received several distinguishedawards, including the NSF CAREER award in 2010 supporting his researchin low power systems. Since 2006, he has been a member of the Associationof Public Safety Communications Officials (APCO) and has been activelyinvolved in efforts towards integrating advanced communication technologiesin critical first responder networks. He held several industry positions includingthe director of ASIC Engineering at Innovics Wireless (2000-2003) and SilvusCommunications (2003-2005).