a parallel early-pruned k-best mimo signal detector up to 1.9gb/s

Wireless Pers Commun (2011) 57:695–705DOI 10.1007/s11277-009-9871-4

A Parallel Early-Pruned K-Best MIMO Signal DetectorUp to 1.9Gb/s

Liang Liu · Junyan Ren · Xiaojing Ma · Fan Ye

Published online: 20 November 2009© Springer Science+Business Media, LLC. 2009

Abstract Multiple inputs multiple outputs orthogonal frequency division multiplexing(MIMO-OFDM) technology is regarded as a promising solution to offer ultra-high data ratein wireless communications. This paper presents a field-programmable gate array (FPGA)implementation of an early-pruned K-Best detection algorithm applicable to ultra-high datathroughput MIMO-OFDM communication systems. The algorithm simplifies the computa-tion significantly compared to conventional K-Best algorithm with negligible bit error ratio(BER) degradation. A fully parallel structure is implemented on a FPGA platform, whichachieves 1.9Gb/s detection throughput and is about three times over previous implemen-tation. Moreover, a pre-processing method is realized to reduce the number of multipliersinside the detector and shrinks the critical path delay down to 8.32 ns. Together with candi-date sharing and early-pruning architecture to further save the hardware cost, a high-speed,compact MIMO signal detector is demonstrated.

Keywords MIMO · OFDM · Signal detection · Early-pruned K-Best

1 Introduction

Due to the limited bandwidth by nature, the enhancement of the spectrum efficiencies is thechallenging issues for the future wireless networks. Because multiple-input multiple-output(MIMO) technique provides the linear increasing in the channel capacity with respect to thenumber of transmitter and receiver antennas [1], it is considered as a promising solution forthe high spectrum efficiencies. The multi-path fading introduces the frequency-selectivity ofthe general wireless network which stimulates the orthogonal frequency-division multiplex-ing (OFDM) technique. Recently, market demand for ultra-high data rate wireless applica-tions such as uncompressed high-definition video streaming and flash file downloading hasbeen significantly increasing. The combination of MIMO and OFDM technology offers a

L. Liu · J. Ren (B) · X. Ma · F. YeState Key Laboratory of ASIC & System, Micro/Nano-Electronics Innovation Platform, Fudan University,Shanghai 201203, People’s Republic of Chinae-mail: [email protected]

123

696 L. Liu et al.

Fig. 1 Basic block diagram of MIMO-OFDM system

system solution boosting the throughput up to gigabit/s [2]. The ultra-high system through-put poses a challenge to the hardware realization of a high-speed, low-cost MIMO system,especially to the signal detection implementation at the receiver side. Therefore, an efficientalgorithm and compact circuit architecture are necessary to realize the economic MIMOreceiver with gigabit/s throughput.

There are three main categories of MIMO detection algorithms. (1) Maximum like-hood(ML) detection. The ML detection provides the optimal BER performance. However, MLapplies the brute-force search over the signal spaces. Moreover, its computational complexitygrows exponentially to the transmitter antenna number and modulation size, and thus intro-duces difficulties in hardware implementation. (2) Linear detection (LD) [3] is suggested tosolve the unknown transmit symbols by linear manipulations. LD has the lowest complexity.The performance, however, is limited due to the error propagations. (3) Sub-ML algorithm,such as sphere decoder (SD) and K-Best detection, reduces the complexity while maintaininga close-to-ML performance, so as to achieve efficient implementation.

Several detectors have been implemented based on the Sub-ML algorithm [4–8]. Burget al. [4] proposed a one-node-per-clock structure for deep-first algorithm. It features lowhardware cost but the variable throughput prevents it from high data rate implementation. InWenk et al. [5], the K-Best detector achieves a throughout only up to 424 Mb/s. Barbero andThompson [6] proposed a fixed-throughput sphere decoder algorithm and implemented it ina FPGA platform with a throughput of 600 Mb/s. Moreover, all of the above implementationsdemonstrated the data rate of only hundreds megabit/s instead of the expected gigabit persecond. In this paper, we address the above limitations by means of a fully parallel imple-mentation of early-pruned K-Best detector as well as other supporting techniques, such asthe candidate-sharing structure and the pre-processing scheme.

The remainder of this paper is organized as follows: Section 2 gives a brief introductionto MIMO-OFDM system. Section 3 elaborates the new complexity reduced detection algo-rithm. Section 4 explains the circuit design of the detection algorithm. Section 5 presents theFPGA implementation results of this detector and compares it with prior arts. And Section 6concludes this paper.

2 System Model

The basic block diagram of MIMO-OFDM wireless communication system is illustrated inFig. 1. The serial encoded bit stream is decomposed to Nt parallel data streams through a

123

A Parallel Early-Pruned K-Best MIMO Signal Detector 697

serial to parallel converter (S to P), and then transmitted through Nt different antennas afterOFDM modulation. At the receiver side, Nr FFT processors fulfill the OFDM demodulationand then a signal detector decodes out the Nt parallel transmitting data stream using the Nr

receiving data and the channel estimation information H .In this paper, we consider a MIMO-OFDM system with N transmitters and N receivers.

The corresponding base-band complex model is given as

y = HS + n (1)

where H is the complex channel matrix of N × N-dimension, y is the N × 1 received vector,S is an N × 1 transmitter vector, and n is the vector of additive noise, whose elements areindependent, identically distributed (i.i.d.) complex Gaussian random variables.

The Real-Value Decomposition (RVD) is applied to the complex signal model accord-ing to: [

Re(y)

Im(y)

]=

[Re(H)

Im(H)

−Im(H)

Re(H)

] [Re(S)

Im(S)

]+

[Re(n)

Im(n)

](2)

where Re(·) and Im(·) denote the real part and the imaginary part of a complex numberrespectively. The N-dimensional complex problem is converted into a 2N-dimensional realproblem. Furthermore, the channel matrix H is QR decomposed: H = QR where Q is uni-tary and R is upper triangular. Multiply by the conjugation of the unitary matrix Q, thecorresponding system model is rewritten as y′ = RS + n′, where y′ = QH y and n′ = QH n.

The following assumptions are made through the rest of the paper. First, the channel stateinformation is perfectly known at the receiver and its elements are modeled as complexGaussian random variables with zero mean and unit variance. Secondly, the entries of S arechosen independently from a set � of constellation points.

3 Proposed MIMO Detection Algorithm

3.1 Traditional K-Best Algorithm

The ML algorithm solves the closest-point-search problem:

∧S = arg

S∈�2Nmin

2N∑i=1

∣∣∣∣y′i −

2N∑j=i+1

Ri j S j − Rii Si

∣∣∣∣2

(3)

The K-Best algorithm transfers the closest-point-search problem in ML detection to a tree-search problem by rewriting (3) in a progressive way.

PEDi = PEDi+1 + inci

inci =∣∣∣∣∣∣y

′i −

2N∑j=i+1

Ri j S j − Rii Si

∣∣∣∣∣∣2

(4)

Figure 2 shows a simple example of tree search structure. In the searching tree, each nodeowns a partial Euclidean distance (PED, Mi ), and inci is the Euclidean distance increment(EDI) between the father node (FN) and its children nodes (CN). In K-Best detection, all theCNs of K survivors are exploited to select K best CNs with smallest PED to be extendednext.

123

698 L. Liu et al.

Fig. 2 Tee search of K-Best and K-ZF algorithm

3.2 Proposed Early-Pruned K-Best Algorithm

In high throughput implementations, K-Best is preferred due to its fixed throughput, its onedirection search scheme and its close-to-ML BER performance. However, the excessive com-plex computations introduced by sorting procedure and PED calculations degrade the timingperformance and incur large hardware overhead. On the other hand, the zero-forcing (ZF)algorithm is more suitable in high SNR scenario, because it has the least computational com-plexity compared to other detection algorithms and owns an acceptable BER performancewhen the noise interference is not so serious [3]. The statistical channel model is furtherexploited in the analysis: for the random matrix H= QR, the diagonal elements (ri,i ) ofthe upper triangular matrix R are distributed independently according to standard Gammadistribution with n = 2N + 1 − i [9], which means that the probability that r2

i,i takes smallvalues increases with i. In other words, the lower the layer is the higher equivalent SNR thelayer has.

Based on the above analysis, we propose an early-pruned K-Best algorithm, namelyK-Best zero-forcing (K-ZF) detection, to reduce the complexity. Using a design parame-ter M, the K-ZF algorithm divides the searching tree into two parts, the high SNR part andlow SNR part. As shown in Fig. 2, the K-ZF performs traditional K-Best algorithm in thefirst (2N-M) layers forming K sub-trees each with a depth of M. For each sub-tree in theremaining M layers, the K-ZF performs a zero forcing process in which the searching treeis early pruned in a way that only one CN with the smallest EDI of a FN is calculated andpreserved. This result in a (P-1)/P PED calculation saving compared to the K-Best method,where P is the real constellation sizes after RVD. Furthermore, all the calculated nodes andtheir PEDs in these M layers are preserved, so the complex sorting procedure is omitted inthese M layers.

Clearly, by varying M between 0 and 2N one can trade off between the computation com-plexity and BER performance of the K-ZF detector. When M is small, fewer layers utilize

123


Fig. 3 BER performance of the MIMO detection

Table 1 Computational complexity of the MIMO detection

MIMO detection K-Best K-ZF M = 2 K-ZF M = 3 K-ZF M = 4 K-ZF M = 5 ZF

Ave. number of 164 128 110 92 74 8expanded node

ZF processing and the BER performance will be closer to the K-Best algorithm. On the otherhand, large M means less hardware consumption, better timing performance, but worse BERperformance. This makes K-ZF detector optimum for configurable systems where M can beadaptively adjusted according to the channel condition.

3.3 Performance Simulation

The performance of the proposed K-ZF detector is evaluated based on computer simulation.We consider an uncoded 4 × 4 spatially multiplexed MIMO-OFDM system using 16-QAMmodulation. For each simulation, 1,000 packets, each containing 1024 bytes of informationbits, were transmitted and each package transition spanned one realization of H . Figure 3compares the BER performance of the proposed K-ZF algorithm with different M to theexhaustive search ML algorithm, the ZF algorithm, and the K-Best algorithm. Table 1 com-pares the computational complexity of these detectors. The complexity is represented interms of the average number of expanded nodes in each detection process. The BER perfor-mance and the complexity of the proposed detector are between that of K-Best and ZF. Thecomplexity reduction to K-Best demonstrates that the K-ZF detector can efficiently balancethe BER performance and computational complexity by adjusting the parameter M. It is alsofound that when M equals to ‘3’ or ‘2’, the BER performance of the K-ZF practically matchesthat of the K-Best and closes to the ML detection.

123

700 L. Liu et al.

Fig. 4 Overall architecture of the K-ZF detector with M = 3

4 Circuit Design for K-ZF Detector

4.1 Fully Parallel Pipelined Structure

Figure 4 shows the overall block diagram of the proposed K-ZF MIMO detector for a 4 × 416-QAM MIMO system with K = 6 and M = 3. In order to achieve an ultra high throughput,we design a fully parallel pipelined hardware structure. It consists of five stages of K-Bestunit, three stages of ZF unit, a final select unit (FSU) and register banks to store the channelinformation (R) and pipeline the received signal ( y). The PEDs and corresponding nods(S) are processed and delivered between neighboring layers. Each K-Best layer consists ofa PED calculation unit (PCU) responsible for node expansion and PED calculation and aK-Best select unit (KSU) selecting the K smallest PEDs from the calculation result of PCU.The ZF block expands the best CN for each FN and calculates its corresponding PED. TheFSU generates the final detection result by selecting the node with smallest PED. The highestthroughput is achieved since K nodes in each layer are calculated in a full-parallel fashion,furthermore, the process element is deeply pipelined such that the K-Best unit is six-stagepipelined and the ZF unit is four-stage pipelined. To save hardware cost in fully parallelimplementations, we propose two complexity reduced calculation units.

4.2 Complexity Reduced PED Calculation Unit

PED calculation is the main computation in a K-Best detector that plays critical role inthroughput, area and power consumption, especially in parallel implementations. Straightfor-ward implemented parallel PCU incurs unacceptable hardware cost. One of the most effectivemethods to save area in a parallel architecture is resource sharing. Taking the advantage thatthe tree-search process is an incremental procedure and that the real symbol of the QAM con-stellation point belongs to a set of finite odd integers, we use two hardware reuse structuresto share as much calculations as possible. (1) Shared farther node unit (FNU): We re-write(4) as following:

inci = |ci − Rii Si |2 with ci = y′i −

2N∑j=i+1

Ri j S j (5)

According to (5), ci is independent from Si and is common to all the CNs expanded fromthe same FN. Hence in the proposed PCU, ci is computed only once per FN rather thanbeing calculated for each CN. (2) Candidate-sharing structure: The most exhaustive process

123


Fig. 5 Structure diagram of a PCU block

in PCU is the computation of ci , which requires (2N − i) multipliers, (2N − i − 1) addersand a subtracter. In the parallel structure K ci calculation units are duplicated to process KFNs simultaneously and thus make the PCU implementation impractical. To conquer thiswe propose a candidate-sharing structure (CSS). As is shown in Fig. 5, ci is not completelycalculated for each FN. Instead we use the candidate generation unit (CSU) to generate allthe four candidates of Ri j S j . K FNs share one CSU and each FN uses only a MUX to selectthe right candidate of Ri j S j , and then finish the ci calculation via adders and a subtracter.With the CSS, a considerable hardware resource is saved in the full-parallel scenario.

In hardware implementations, multiplier is one of the most complex parts. It prolongsthe critical path delay and consumes a large amount of resources. As mentioned above, inorder to generate the result of Ri j S j , (2N − i) multipliers are needed in PCU. To avoid thisexpensive processing we use a pre-processing method. Given that the entries of S are chosenfrom a fixed set of constellation points, we pre-arrange the received signal y in a way that theentries of S are chosen from [−3,−1, 1, 3] after RVD. In this way the calculation of Ri j S j ismultipliers-free, where the procedure of R×3 can be replaced with shift and add operations.

4.3 Early-Pruned Calculation Unit

The ZF block in the detector finds out the best CN a FN:

∧Si

= argS∈�

min

∣∣∣∣∣∣∧yi −

2N∑j=i+1

Ri j S j − Rii Si

∣∣∣∣∣∣2

(6)

From (6), expanding the best CN in the 16-QAM modulation needs four EDI calculation unitsand a four-input comparator, leading to a same hardware consumption as that in the K-Best

123

702 L. Liu et al.

Fig. 6 Structure diagram of a zero forcing block

unit. Because the real value version of candidate nodes Si locates symmetrically on both sidesof the zero point, we can early prune the node before EDI being fully calculated. As shownin Fig. 6, we divide the minimum process into two steps: (1) expand only the positive partof CNs and then make the sign decision according to the sign of ci . (2) Instead of generatingthe final result after complete inci computation, we put the two-input comparator in front ofthe square calculation unit (SCU) to save one SCU hardware cost.

5 FPGA Implementation and Performance Analysis

The proposed K-ZF detector has been implemented in a Xilinx Virtex-5 FPGA (XC5VLX330).The timinganalyzer and the Floorplanner of ISE10.1I were applied to analyze timing per-formance, hardware cost. Table 2 shows the key performance figures of the K-ZF detector.The proposed detector utilizes 7% less resources and achieves the maximum clock rate of129 MHz to support a 1.9Gb/s detection throughput and thus meets the requirement of futureultra-high speed wireless applications. The bit throughput of the proposed fully paralleldetector in Table 2 is given by

Throughput = fc × log2 � × N (7)

where, � is the constellation size and equals to 16 in 16-QAM modulation, N is the antennasize, and fc is the clock frequency.

Table 3 compares the proposed K-ZF detector with previous 16-QAM signal detectors inthroughput and hardware cost. Although FPGA platform is used instead of ASIC implemen-tation, our detector achieves the highest throughput 1.9Gb/s due to the fully parallel structureand an efficient design of the calculation unit. Moreover, the proposed K-ZF detector con-sumes a hardware resource that not increases linearly with the throughput. In summary,

123


Table 2 FPGA resource use ofthe K-ZF detector

Xilinx XC5VLX230 FPGA K-ZF detector

Number of slices registers 9978 (4%)Number of slices LUTs 15,108 (7%)Max. clock rate 120 MHzMax. throughput 1.9Gb/s

Table 3 Comparison of detectors for 16-QAM System

MIMO detector SD1 [4] SD2 [4] K-Best [7] K-Best [5] K-Best [6] SD [8] K-ZF

Clock frequency (MHz) 51 73 100 132 150 250 120Throughput (Mb/s) 71 169 53 424 600 167 1920Gate count (KG) 117 50 91 93 96 45 120Throughput/resource (MHz/KGE) 0.61 3.38 0.58 4.56 6.25 3.7 16Platform ASIC ASIC ASIC ASIC FPGA ASIC FPGA

the proposed K-ZF detector achieves the highest throughput/resource and well balances thehardware expenditure and the speed.

6 Conclusion

This paper demonstrates a FPGA implementation of signal detector for a 4 × 4 16-QAMMIMO-OFDM system. The proposed early-pruned K-ZF algorithm reduces the computa-tional complexity significantly and maintains a sub-optimal BER performance. The fullyparallel pipelined structure has enabled a high-speed signal process. The implementationresult shows that the proposed signal detector can offer a fixed data throughput to 1.9Gb/sand thus can satisfy the data transfer requirement of future wireless applications.

References

1. Paulraj, A. J., Gore, D. A., Nabar, R. U., & Bolcskei, H. (2004). An overview of MIMO communi-cations—A key to gigabit wireless. Proceedings of the IEEE, 92(2), 198–218.

2. Baek, M.-S., et al. (2005). MB-OFDM UWB system with multiple antennas for high capacity trans-mission in wireless personal area network. IEEE International Conference on Consumer Electronics,85–86.

3. Chen, C.-J. & Wang, L.-C. (2005). On the performance of the zero-forcing receiver operating in themultiuser MIMO system with reduced noise enhancement effect. IEEE Global TelecommunicationsConference, 3, 5.

4. Burg, A. et al. (2005). VLSI implementation of MIMO detection using the sphere decoding algorithm.IEEE Journal of Solid-State Circuits, 40, 1566–1577.

5. Wenk, M. et al. (2006). K-Best MIMO detecting VLSI architectures achieving up to 424 Mbps. IEEEProceedings of International Symposium Circuits and Systems, 1151–1154.

6. Barbero, L. G., & Thompson, J. S. (2006). Rapid prototyping of a fixed-throughput sphere decoderfor MIMO Systems. IEEE International Conference on Communications, 7, 3082–3087.

7. Zhan, G., & Nilsson, P. (2006). Algorithm and implementation of the K-Best sphere decoding forMIMO detection. IEEE Journal on Selected Areas in Communications, 4(3), 491–503.

8. Cerato, B., Masera, G., & Viterbo, E. (2009). Decoding the golden code: A VLSI design, IEEETransactions on VLSI Systems, 17, 156–160.

9. Zhao, W., & Giarmakis, G. B. (2006). Reduced complexity closest point decoding algorithms forrandom lattices, IEEE Transactions on Wireless Communication, 5, 101–111.

123

704 L. Liu et al.

Author Biographies

Liang Liu was born in Hunan, China, in 1983. He received the B.S.degree in Department of Electronics Engineering from Fudan Univer-sity, Shanghai, China, in 2005. He is currently working toward thePh.D. degree at Fudan University. His current research interest includeswireless communication system and high-speed low-power digital inte-grated circuits design.

Junyan Ren received B.S. and M.S. in Physics (1983) and ElectronicEngineering (1986) from Fudan University, China. Since 1986, he hasbeen with State-Key Lab of ASIC and System Lab and Micro/Nano-Electronics Innovation Platform, Fudan University. Current he is fullprofessor in Microelectronics, and vice director of State-Key Lab ofASIC and system. He is the author and/or co-author of over 100 techni-cal papers of conferences and journals. Also, he has filed over 20 Chinapatents. His researches interest includes RF/analog/mixed-signal inte-grated circuit with communication applications.

Xiaojing Ma received the B.S. degree in Department of ElectronicsEngineering from Fudan University, Shanghai, China, in 2006. He iscurrently working toward the M.S. degree of Department of Micro-electronics at Fudan University. His current research interest includesMIMO-OFDM system and digital integrated circuits design.

123


Fan Ye was born in Shanghai, China, in 1978. He received the M.S.degree in Department of Microelectronics from Fudan University,Shanghai, China, in 2003. He has worked as a lecturer since then andnow he is working toward the Ph.D. degree at Fudan University. Hiscurrent research interest includes digital communication systems andmixed-signal integrated circuits design.

123

a parallel early-pruned k-best mimo signal detector up to 1.9gb/s

Documents