the linear model under mixed gaussian inputs: designing the transfer matrix

IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 61, NO. 21, NOVEMBER 1, 2013 5247

The Linear Model Under Mixed Gaussian Inputs:Designing the Transfer Matrix

John T. Flåm, Dave Zachariah, Mikko Vehkaperä, and Saikat Chatterjee

Abstract—Suppose a linear model , where inputsare independent Gaussian mixtures. The problem is to design

the transfer matrix so as to minimize the mean square error(MSE) when estimating from . This problem has importantapplications, but faces at least three hurdles. Firstly, even for afixed, the minimum MSE (MMSE) has no analytical form. Secondly,

theMMSE is generally not convex in . Thirdly, derivatives of theMMSEw.r.t. arehard toobtain.Thispapercasts theproblemasastochastic programand invokes gradientmethods. The study ismo-tivated by two applications in signal processing. One concerns thechoice of error-reducing precoders; the other deals with selectionof pilot matrices for channel estimation. In either setting, our nu-merical results indicate improved estimation accuracy—markedlybetter than those obtained by optimal design based on standardlinear estimators. Some implications of the non-convexities of theMMSE are noteworthy, yet, to our knowledge, not well known. Forexample, there are cases in which more pilot power is detrimentalfor channel estimation. This paper explains why.

Index Terms—Gaussian mixtures, minimum mean square error(MMSE), estimation.

I. PROBLEM STATEMENT

C ONSIDER the following linear system

(1)

Here is a vector of observations, and and are mutually in-dependent random vectors with known Gaussian Mixture (GM)distributions:

(2)

Manuscript received October 10, 2012; revised February 28, 2013 and May31, 2013; accepted August 06, 2013. Date of publication August 15, 2013; dateof current version September 20, 2013. The associate editor coordinating thereview of this manuscript and approving it for publication was Prof. Y.-W.Peter Hong. The work of J. T. Flåm was supported by the Research Councilof Norway under the NORDITE/VERDIKT program, Project CROPS2 (Grant181530/S10). The work of M. Vehkaperä was supported by the Swedish Re-search Council VR Grant 621-2011-1024.J. T. Flåm was with the Department of Electronics and Telecommu-

nications, Norwegian University of Science and Technology (NTNU),Trondheim, Norway. He is currently working at www.embida.com (e-mail:[email protected]).D. Zachariah and S. Chatterjee are with the School of Electrical Engineering/

ACCESS Linnaeus Center, Royal Institute of Technology (KTH), Stockholm10044, Sweden (e-mail: [email protected]; [email protected]).Mikko Vehkaperä is also with the School of Electrical Engineering/ACCESS

Linnaeus Center, Royal Institute of Technology (KTH), Stockholm 10044,Sweden, and also with the School of Electrical Engineering, Aalto University,Finland (e-mail: [email protected])..Color versions of one or more of the figures in this paper are available online

at http://ieeexplore.ieee.org.Digital Object Identifier 10.1109/TSP.2013.2278812

(3)

In this work, we assume that is a transfer matrix that we are atliberty to design, typically under some constraints. Specifically,our objective is to design such that can be estimated fromwith minimum mean square error (MMSE). The MMSE, for

a fixed , is by definition [1]

(4)

Here, denotes the 2-norm, is the joint probabilitydensity function (PDF) of ,

(5)

is theMMSE estimator, and is the PDF of given . TheMMSE in (4) depends on both through and . Ourobjective is to solve the following optimization problem

(6)

where denotes a prescribed set of matrices that must belongto.The next section introduces two applications where this ma-

trix design problem appears. Section III illustrates some of thebasic properties of that problem, through a simple example.Section IV spells out problem (6) in full detail. Section V ar-gues that a sampling based approach, involving stochastic gra-dients, is a viable option towards a solution. It also reviews howthe Robbins-Monro method [2], [3] applies. Numerical resultsare provided in Section VI. Section VII concludes. A large partof the detailed analysis, concerning stochastic gradients, is de-ferred to the Appendix.

II. BACKGROUND AND MOTIVATION

Problem (6) appears in various applications of interest.Sections II.A and II.B present two of these, which are of partic-ular interest to the signal processing community. Section II.Cintroduces and motivates the use of GM input distributions.Section II.D discusses the general properties of problem (6),and list some of the contributions in this paper.

A. Linear Precoder Design

Consider a linear system model

(7)

1053-587X © 2013 IEEE

5248 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 61, NO. 21, NOVEMBER 1, 2013

where is an a priori known matrix, either given exactly or es-timated, and is a precoder matrix to be designed such that themean square error (MSE) when estimating from becomesas small as possible. The vector is random noise. If andare independent and GM distributed, this is a matrix design

problem as described in Section I, where is the design param-eter. A typical constraint is to require that cannot exceed acertain average power, i.e., . Then, in terms of theminimization problem given by (6), the feasible set is definedby

A linear model with known transfer matrix and GM dis-tributed inputs is frequently assumed within speech and imageprocessing. In these applications, the signal of interest often ex-hibits multi modal behavior. That feature can be reasonably ex-plained by assuming an underlying GM distribution. The noiseis often modeled as Gaussian, which is a special case of a GM.Conveniently, with (2) and (3) as inputs, the MMSE estimatorin (5) has a closed analytical form for any given . Selectedworks exploiting this include [4]–[8]. However, none of theseworks study MSE reducing precoders. These have the poten-tial to significantly improve the estimation accuracy, and shouldtherefore be of interest.

B. Pilot Signal Design

Consider a multiple-input-multiple-output (MIMO) commu-nication model

(8)

where is a random channel matrix that we wish to estimatewith as small MSE as possible, and is a pilot signal to be de-signed for that purpose. As before, is random noise. In orderto estimate with some confidence, we must transmit as leastas many pilot vectors as there are columns in . In addition wemust assume that the realization of does not change before allpilots have been transmitted. This assumption typically holds inflat, block-fading MIMO systems [9]–[11]. With multiple trans-mitted pilots, model (8) can be written in matrix form as

(9)

If is , then this model can be vectorized into (Thm. 2,Ch. 2, [12])

(10)

Here denotes the identity matrix, the oper-ator stacks the columns of a matrix into a column vector, anddenotes the Kronecker product. Assuming that the channeland noise are independent and GM distributed, this is

again a design problem as described in Section I, where the pilotmatrix is the design parameter. A natural constraint, is to im-pose power limitations on the transmitted pilots, i.e., .

Again, in terms of the minimization problem given by (6), thefeasible set is then defined by

In (10), a GM distribution on the channel can account formultiple fading situations. This can be useful, for example if thesource is assumed to transmit from multiple locations. Then, thecommonly used Rice distribution is unlikely to accurately cap-ture the channel statistics associated with all transmit locations(especially so in urban areas). In fact, in [13] it has been ex-perimentally observed and reported that different transmit lo-cations are indeed associated with different channel statistics.A GM distributed channel, with multiple modes, has the poten-tial to capture this. As for the noise, one may either assume thatis pure background noise, or that represents noise and in-

terference. In the former case a Gaussian distribution may bejustifiable, whereas in the latter a GM distribution may be moresuitable [14], [15].Note that the assumption that a channel realization can orig-

inate from several underlying distributions is not novel. For in-stance, all studies assuming channels governed by a MarkovModel make this assumption, see e.g., [16], [17] and the ref-erences therein. A GM is a special case of an Hidden Markovmodel, where subsequent observations are independent, ratherthan governed by a Markov process. In spite of this, to the bestof our knowledge, pilot optimization for estimating channelsgoverned by a GM distribution has not been considered in theliterature.

C. Gaussian Mixture Distributions

While aimed at minimizing the MSE, most optimizationstudies on linear precoders [18], [19] or pilot signals [9]–[11],[20]–[22] utilize only the first and second moments of the inputdistributions. Commonly, the underlying motivation is that alinear MMSE (LMMSE) estimator is employed. The LMMSEestimator1 only relies on first and second order statistics, whichconveniently tends to simplify the associated matrix designproblem. In fact, the desired matrix can often be obtained as thesolution of a convex optimization problem. It is known, how-ever, that the LMMSE estimator is optimal only for the specialcase when the random signals and are both Gaussian. Forall other cases, the LMMSE estimator is suboptimal.In practice, purely Gaussian inputs are rare. In general, the

input distributions may be asymmetric, heavy tailed and/ormulti modal. A type of distribution that can accommodate all ofthese cases is the Gaussian Mixture (GM) distribution. In fact,a GM can in theory represent any distribution with arbitraryaccuracy [23], [24]. Therefore, in this work, we assume thatthe inputs are GM distributed as in (2) and (3). Notation (2)should be read in the distributional sense, where results froman imagined composite experiment. First, source isactivated with probability . Second,that source generates a Gaussian signal with distribution law

1Among all estimators which are linear (affine) in the observations, theLMMSE estimator obtains the smallest MSE.

FLÅM et al.: THE LINEAR MODEL UNDER MIXED GAUSSIAN INPUTS: DESIGNING THE TRANSFER MATRIX 5249

. For any realized , however, the underlyingindex is not observable. The noise emerges in an entirelysimilar, but independent manner. and are index sets. Intheory, it suffices that these sets are countable, but in practicethey must be finite. Clearly when and are singletons, onefalls back to the familiar case of Gaussian inputs.The mixture parameters, e.g., and

, are rarely given a priori. Most often theymust be estimated, which is generally a non-trivial task [23],[25]. A common approach is to estimate the GM parametersfrom training data. The expectation maximization (EM) al-gorithm is well suited, and much used, for that purpose [1],[26], [27]. Briefly, the algorithm relies on observations drawnfrom the distribution we wish to parametrize, and some initialestimate of the parameters. The observations are used to updatethe parameters, iteratively, until convergence to a local max-imum of the likelihood function. Because the resulting GMparameters depend on the initial estimates, the algorithm canalternatively be started from multiple initial estimates. Thisproduces multiple sets of GM parameters, and each set can beassigned probabilities based on the training data. Our startingpoint is that the distributions in (2) and (3) have resulted fromsuch, or similar model fitting.

D. Problem Properties and Contributions

Regardless of the underlying application, solving optimiza-tion problem (6) is not straightforward. In particular, three hur-dles stand out. Firstly, with (2) and (3) as inputs to (1), theMMSE in (4) has no analytical closed form [28]. Thus, the ef-fect of any matrix , in terms of MMSE, cannot be evaluatedexactly. Secondly, as we shall see in Section III, the MMSE isgenerally not convex in . Thirdly, as will be argued in theAppendix, the first and second order derivatives of the MMSEw.r.t cannot be calculated exactly, and accurate approxima-tions are hard to obtain. For these reasons, and in order to makeprogress, we cast the problem as a stochastic program and in-voke the Robbins-Monro algorithm [2], [3]. Very briefly ourapproach goes as follows: We draw samples and use these tocompute stochastic gradients of the MMSE. These feed into aniterative gradient method that involves projection. The contri-butions of the paper are several:• As always, for greater accuracy, its preferable to usegradients instead of finite difference approximations. Forthis reason the paper spells out a procedure for exactrealization of stochastic gradients, as given in Section A ofthe Appendix. Accordingly, the Robbins-Monro algorithmcomes to replace the Kiefer-Wolfowitz procedure.

• In the design phase, we exploit the known input statisticsand update based on samples of the inputs , in-stead of output . This yields a closed form stochastic gra-dient, and we prove in the initial part of the Appendix thatit is unbiased.

• Numerical experiments indicate that our method has farbetter accuracy thanmethods which proceed via linear esti-mators. The main reason is that the optimal estimator, usedhere, is non-linear.

• It turns out that the non-convexities of the MMSE mayhave practical implications that deserve being betterknown. Specifically, in channel estimation, it can beharmful to increase the power of the pilot signal. Thispaper offers an explanation.

Clearly, in many practical problems, the quantities in (1) arecomplex-valued. Throughout this paper, however, they will allbe assumed real. For the analysis, this assumption introducesno loss of generality. This can be seen by considering the casewhen the quantities are indeed complex. Then (1) can be writtenas

where subscripts and denote real and imaginary parts respec-tively, and . Note that a complex GM distribution onimplies that and are jointly GM distributed. That is, the

real vector has a GM distribution. The same argu-ment goes of course for the real vector . Separatingthe real and imaginary parts of (11), the system of equations canbe written as

(11)

Left-multiplying this equation by produces

(12)

Clearly, there is a one-to-one correspondence between (11) and(12). The latter model consists exclusively of real quantities, andhas similar form as (1). The only difference is that now has aspecific structure. In terms of designing , any candidate ma-trix must therefore obey this structure. But this can be enforcedthrough the set of matrices that must belong to. Thus, we areessentially back to the same problem as (6) with all quantitiesbeing real.Model (1) with GM inputs (2) and (3) is quite generic, and

we have indicated two signal processing applications where thematrix design problem appears. The solution to that problem is,however, essentially application independent. It should there-fore be of interest also to readers with different applications inmind. To the best of our knowledge, it has not been pursued inthe literature.

III. AN ILLUSTRATIVE EXAMPLE

Here we study a special instance of the matrix design problemwhere theMMSE for all can be plotted. In general this isof course not possible, but the following simple example revealssome fundamental properties of the problem. Assume that wewish to design a precoder, as in (7), where and is re-stricted to be an orthogonal matrix. Thus, the norm of precodedsignal is equal to that of the non-precoded signal. Orthogonal(unitary) precoders are reported e.g., in [29], [30]. From an im-plementation viewpoint, orthogonal precoding implies that the


Fig. 1. (a) Densities without any rotation. (b): The effect of rotating by .

Fig. 2. MMSE versus rotation angle.

signal of interest needs no amplification—it is only rotated. Be-cause , (7) simplifies to . Further, let andbe independent and identically GM distributed as

where is a scalar and is the unit vector along the -axis.Assume initially that , which corresponds to no rotation.In this case, Fig. 1(a) illustrates the densities of (full cir-cles) and (dashed circles), when seen from above. They areidentical and sit on top of each other. Now, with a precoder thatrotates by , the densities of and will look like inFig. 1(b). The latter configuration is preferable from an estima-tion viewpoint. This is clear from Fig. 2, where the MMSE isdisplayed as a function of all rotation angles between 0 and(with ). As can be seen, a significant gain can be obtainedby rotating (or by ). This gain is not due to a particu-larly favorable signal-to-noise-ratio ;because is orthogonal, the SNR remains equal for all rotationangles. The MMSE gain is instead due to a rotation producinga signal which tends to be orthogonal to the noise, as Fig. 1(b)indicates.The above example is a special case of the matrix design

problem, where in (1) is restricted to be orthogonal. It isclear that plays a decisive role in how accurately can be es-

timated. An almost equally important observation, however, isthat theMMSE is not convex in . Hence, in general, we cannotexpect that first order optimality (zero gradient) implies a globalminimum. When studying the channel estimation problem inSection VI.B, we will see an implication of this non-convexity,which is perhaps not well known: In certain cases the MMSEof the channel estimator does not decrease with increasing pilotpower. On the contrary, the MMSE may in fact increase.In the next section we rewrite the original minimization

problem into an equivalent but more compact maximizationproblem. Then, in Section V we present a stochastic optimiza-tion approach which provides a solution.

IV. AN EQUIVALENT MAXIMIZATION PROBLEM

The applications in Sections II-A and II-B represent specialcases of the matrix design problem; in each case, depends onanother matrix of interest. In general, however, can stand onits own and need not be a function of some other argument. Ourobjective is to propose a solution to the general matrix designproblem, not only special cases of it. From here onwards, wetherefore assume that is the matrix of interest and make noassumptions about functional dependencies. We will present aframework for solving this general matrix design problem. Spe-cial cases of it can be handled by the exact same frameworksimply by using appropriate substitutions for . The substitu-tions for the applications in Sections II-A and II.B, are providedin Section A of the Appendix.In this section we propose a more convenient formulation of

the general matrix design problem than given in (6). To thatend we first rewrite expression (4). Using the results of [28],it follows that for model (1), under independent GM inputs (2)and (3), and a fixed , the MMSE can be written as

(13)

In (13), denotes the trace operator and is a (GM)probability density function

(14)

where

(15)

(16)

(17)

In (15), denotes transposition, denotes the determinant,is short for and is the length of . The

MMSE estimator in (13) can be written as

(18)


where

(19)

In what follows, it is convenient to define

(20)

This notation emphasizes that the squared norm of the MMSEestimate depends on both and the observation . In (13), onlythe integral depends on . Exploiting this, and using (20), theminimizer of the MMSE is that which maximizes

(21)

subject to

(22)

The integral in (21) cannot be evaluated analytically, even for afixed and known [28]. Moreover, as the example in Section Billustrated, the MMSE is generally not convex in , which im-plies that is generally not concave. Hence, any optimiza-tion method that merely aims at first order optimality, does ingeneral not produce a global maximizer for . Finally, as ar-gued in Section B of the Appendix, neither first or second orderderivatives of w.r.t. can be computed exactly, and ac-curate approximations of these are hard to obtain.

V. THE ROBBINS-MONRO SOLUTION

The above observations suggest that a sampling based ap-proach is the only viable option. The problem of maximizinga non-analytical expectation , over a parameter ,falls under the umbrella of stochastic optimization. In particular,for our problem, the Robbins-Monro algorithm [2], [3], can beused to move iteratively from a chosen initial matrix to alocal maximizer . The philosophy is to update the currentmatrix using the gradient of the MMSE. Since that gradientcannot be calculated exactly, however, one instead relies on astochastic approximation. Translated to our problem, the idea isbriefly as follows. Although (21) cannot be computed analyti-cally, it can be estimated from a set of generated sample vectors

(23)

as

(24)

In (23), and are independent realizations of the signal andnoise, as generated by (2) and (3) respectively. The derivativeof (24) w.r.t. represents an approximation of , whichcan be used to update the current . Each update is then pro-jected onto the feasible set . This is the core idea of the muchcelebrated Robbins-Monro (RM) algorithm [2]. In our context,the algorithm can be outlined as follows.• Let the current matrix be .

• Draw at random and compute .• Calculate the update direction (stochastic gradient) as

(25)

(26)

where is an infinite sequence of step sizes satis-fying and .

• Update the matrix as

(27)

where represents the projection onto the set of per-missible matrices .

• Repeat all steps until convergence.

A. Remarks on the Robbins-Monro Algorithm

As for the starting point , if there is no reason to believethat a particular candidate is better than any other, then an arbi-trary initial matrix can be chosen. On the other hand, if thedesigner is more informed, he may of course choose morecleverly. Because the final solution does depend on the startingpoint, one possibility is to initiate the algorithm from multiplestarting points and select the best outcome.The above algorithm is presented in terms of . When

deriving a linear precoder, or a pilot matrix, we are moreinterested in the matrix in (7), or in (10). By invokingLemma 1 in Section A of the Appendix, and the substitutionsexplained there, the above algorithm can be used in both cases.These substitutions guarantee that the underlying structureon is taken into account, and they are straightforward toimplement.Recall that the input statistics (2), (3) are assumed known.

Therefore, in a design phase, it is reasonable to assume that theinputs can be sampled to compute , as indicated in thesecond step of the algorithm. The alternative would be to sampledirectly, leaving the underlying and unknown. The first

approach is preferred because it guarantees that (25) becomesan unbiased estimate of , whereas the alternative does not.This important point is fully explained in the initial part of theAppendix. In general, the RM procedure does not require ob-serving the input realizations . The algorithm convergesalso when only outputs are available. For this reason we write(25) in terms of , but for our implementation we will assumethat the underlying inputs are fully known.Because the stochastic gradient in (25) is central

in the algorithm, its closed form expression is derived usingLemma 1 in Section A of the Appendix. Clearly, israndom because it relies on a random realization of . It can beseen as an estimator of based on a single random obser-vation vector . Depending on the input, the stochastic gradientmay point in various directions; also in directions opposite to

. In order to increase the probability of a beneficial up-date, one can alternatively compute the stochastic gradient as


an average based on multiple ’s, as suggested in (24). Then(25) would modify to

In our implementation of the algorithm, however, we do notdo this. In fact, it was recognized by Robbins and Monro, thatchoosing large is generally inefficient. The reason is thatis only intermediate in the calculations, and as argued in theAppendix, regardless of the value of , the stochastic gradientcan be chosen such that it coincides with in expectation.The RM procedure does not rely on the existence of atall points. If this derivative is discontinuous, one can instead useany of its sub-gradients; all of which are defined. Consequently,if the local maximum towards which the algorithm convergeshas a discontinuous derivative, then the algorithm will oscil-late around this point. Due to the decaying step sizes, however,the oscillations will eventually become infinitesimal, and for allpractical purposes, the system comes to rest.For unconstrained problems, when there are no restrictions

on , the projection in (27) can be left out. The solu-tion for an unconstrained problem is an with infinite norm,magnifying the signal of interest beyond boundaries, and thusmarginalizing the effect of the noise. In terms of precoding orpilot signals, this translates to transmit amplifiers that consumeinfinite power and that cannot be realized. Unconstrained prob-lems are therefore hardly of any practical interest. In the morerealistic constrained case, the matrix in (26) may not re-side in the feasible set . In such cases, the projection

is necessary to guarantee that . This pro-jection is commonly defined as

(28)

In words, one selects that matrix in which has minimum Eu-clidean distance to as the update. If is already in ,the solution to the projection is clearly . Note thatwhen the feasible set is convex, then the projection definedby (28) is the solution to a convex optimization problem. Onthe other hand, if is non-convex, one may have to settle forlocal optimization methods. There are cases, however, whereis non-convex, but one still can solve (28) optimally. In fact, thetwo numerical examples of Section VI represent two such casesand optimal projection becomes very simple in both of them.For the interested reader, more literature on projections can befound in [3] and the references therein.Convergence towards a local optimum is guaranteed [2], [3]

only as . Therefore, in theory, the algorithmmust run for-ever in order to converge. The engineering solution, which tendsto work well in practice, is to terminate the algorithm whenever

, where is a chosen threshold, or simplyafter a predefined number of iterations. Still, the running timemay be non-negligible: If is an matrix, and we de-fine , then it can be verified that the computationalcomplexity per iteration of the RM algorithm is in the order of

multiplications and additions. Thereforethe Robins-Monro procedure is best suited for problems withlimited dimensions, and when the transfer matrix can be com-puted once and for all. The latter would imply stationary inputsignals.In general, for other problems than considered here, it may

happen that the functional form of is unknown, evenwhen its output can be observed for any and . In this case,

cannot be computed. Instead one may replace it by afinite difference approximation [31]. In some cases, this mayalso be preferable even when the derivative can be computed;especially so if computing requires much more effort.When finite difference approximation are used, the procedureis known as the Kiefer-Wolfowitz algorithm [3], [31]. If thederivative can be computed, however, the Kiefer-Wolfowitz al-gorithm is associated with more uncertainty (larger variance)than the Robbins-Monro procedure. For the interested reader,the present paper extends [32], which considers Kiefer-Wol-fowitz precoding.

VI. NUMERICAL RESULTS

In this section we will study two specific examples. One is onlinear precoding, the other is on pilot design for channel estima-tion. In conformance with much of the literature, we will use thenormalizedMSE (NMSE) as performance measure2. This is de-fined as

A. Precoder Design

Here we study the performance of a Robbins-Monro pre-coder. As in the simple example of Section III, we restrict theprecoder to be orthogonal. Imposing the precoder to be an or-thogonal matrix makes the projection in the RM algorithm par-ticularly simple, as we shall see. For the current example wechoose the following parameters.• .• is GM distributed with parameters

• The noise is Gaussian and distributed as

(29)

2Assuming to be a zero mean signal, the NMSE is never larger than 1 (zerodB). The reason is that the MMSE estimator, , will coincide with the priormean of only when the SNR tends to zero. Hence, the prior mean is a worstcase estimate of , and the NMSE describes the relative gain over the worstcase estimate.


where is a scalar that can account for any chosen, and and are the covariance

matrices of and respectively.• We use as the initial guess in the RM algorithm.• As stopping criterion we use: .The RM algorithm described in Section V is employed using

Lemma 1 in Section A of the Appendix, and the substitution de-scribed there. For the projection in (27), we choose the nearestorthogonal matrix. This projection is the solution to the fol-lowing optimization problem.

where is the set of orthogonal matrices. It can easily be ver-ified that this set is non-convex. The optimal solution is never-theless particularly simple, and exploits the singular value de-composition:

In Fig. 3, the two lower curves display the NMSE with andwithout precoding, associated with the MMSE estimator, for in-creasing SNR levels. As can be seen, RM precoding provides asignificant NMSE gain, especially at SNR levels between 0 and10 dB. The upper curve shows the NMSE of the LMMSE es-timator when its corresponding optimal precoder, implementedas in [19], is used. In order to make a ‘fair’ comparison with theMMSE estimator, we require for the LMMSE estimator that itsprecoder satisfies . Observe that this impliesthat the LMMSE precoder is not restricted to be an orthogonalmatrix. Still, we see that the LMMSE estimator is highly subop-timal at intermediate SNR levels. In fact, it is even worse thandoing MMSE estimation without any precoding. The reason isthat the LMMSE estimator is linear in the data, and that it onlymakes use of first and second order statistics. Even though theLMMSE estimator is helped by a precoder, that precoder is in-herently matched to a highly suboptimal estimator. The optimalMMSE estimator (18), on the other hand, is non-linear in thedata and makes use of the full knowledge of the underlyingdistributions.From [28], we know that the LMMSE estimator becomes the

MMSE estimator as and as . The shortexplanation is that, at , both estimators do not trustthe observed data. Instead they rely on the prior knowledge andalways select the mean of as their estimate. At , thesituation is reverse: Both estimators will fully trust the observeddata, and disregard the prior knowledge. Thus, in the case of anon-singular transfer matrix, both estimators will simply use itsinverse to recover . In terms of precoders, the implication isthat the optimal LMMSE precoder will be optimal also for theMMSE estimator in these two extreme cases. For intermediateSNR levels, however, Fig. 3 reminds us that the LMMSE esti-mator may be quite far from optimal. It is nevertheless widelyused, because it is straightforward to implement.The above example indicates that our method generates a rea-

sonable precoder, for a particular setup. Admittedly, there existsother examples for which the gain is much less significant. How-ever, in all simulations we have carried out, the clear tendency is

Fig. 3. The NMSE with and without precoding.

that a RM precoder/MMSE receiver outperforms the LMMSEprecoder/LMMSE receiver, at intermediate SNR levels.

B. Pilot Design for Channel Estimation

Also for the channel estimation problem, the RM pilotmatrix/MMSE receiver outperforms the LMMSE pilot ma-trix/LMMSE receiver at intermediate SNR levels. In the nextexample we choose parameters in order to highlight this, andone additional property. That property is a direct consequenceof the non-convex nature of the MMSE. We believe it to be ofinterest, but not well known. The starting point is the channelestimation problem in (9), where we assume that all matricesare 2 2. In the corresponding vectorized model

(30)

we assume the following parameters.• The vectorized channel, , is distributed as .• The vectorized noise, , is GM distributed with parameters

(31)

• As constraint we impose that , where is apositive scalar that can account for any chosen pilot power,and therefore also any . Observethat also this constraint defines a non-convex set.

• Stopping criterion: .Again, the RM algorithm described in Section V is employed

using Lemma 1 in Section A of the Appendix, and the substi-tution described there. As starting point, we use two differentcandidates: (i) equal to a scaled identity matrix satisfying thepower constraint, and (ii) equal to the pilot matrix that is op-timal for the LMMSE receiver (satisfying the same power con-straint). During the iterations we rely on the following simple


Fig. 4. The NMSE as a function of pilot power.

projection: If the candidate pilot matrix has power ,then . Thus, if the pilot matrix does not use the entire

power budget, the magnitude of all its elements are increased.Similarly, if pilot matrix has too large power, the magnitude ofits elements are decreased. Observe that this projection does in-deed produce the closest matrix (in Euclidean sense) in the fea-sible set. The constraint defines the shell of an hy-persphere. The projection drags any matrix residing off the shellonto the shell, using the shortest possible path (which is orthog-onal to the shell).Fig. 4 shows the estimation accuracy for increasing SNR (in-

creasing values of ). It can bee seen that for SNR values be-tween dB and 0 dB, the commonly used LMMSE channelestimator/LMMSE pilot matrix, implemented as in [10], hasmuch poorer performance than the other schemes, which areall based on MMSE estimation. Again, the explanation is thatLMMSE estimator only makes use of lower order statistics,whereas the MMSE estimator incorporates the full knowledgeof the problem. As indicated in Fig. 4, for this range of SNR’s,it is better to simply transmit a scaled identity pilot matrix fol-lowed by MMSE estimation. As the SNR in creases and tendsto infinity, however, it is known that the LMMSE estimator be-comes optimal [28]. Correspondingly, the LMMSE pilot matrixbecomes optimal, also for the MMSE estimator.The curve marked with ‘RM-pilot(1), MMSE receiver’ de-

notes the Robbins-Monro approach with a scaled identity ma-trix as starting point. Observe that it yields best performancebetween dB and 0 dB, whereas for SNR larger than 0 dB theperformance is similar to actually transmitting a scaled identityas the pilot matrix. Hence, for dB, a scaled identity pilotmatrix may be a local minimizer, from which the RM algorithmdoes not easily escape. The curve marked with ‘RM-pilot(2),MMSE receiver’ also denotes Robbins-Monro approach, butwith the LMMSE pilot matrix as starting point. Observe that thisgives improved performance for dB, but slightly worseperformance in the range from dB to 0 dB. This underlinesthe starting point of the RM procedure may largely affect the

Fig. 5. The NMSE as a function of the scalar .

end result. One possibility is therefore, is to initiate the algo-rithm from multiple starting points, and select the best outcomefor the SNR range of interest.The most striking observation in Fig. 4, however, is perhaps

that the MMSEmay become less accuratewhen the pilot poweris increased! Moreover, this is not an artifact of the RM algo-rithm; the same tendency is clearly seen also when a scaled iden-tity is used as the pilot matrix. We discuss this in more detail inthe next section.

C. Increased Pilot Power Improved Channel Estimates

We believe that the above phenomenon is not well known,and that it deserves to be explained. In order to visualize whathappens, we will consider an example of smaller dimensions,but with similar properties as in the previous subsection. Specif-ically, we will assume that the unknown channel matrix is2 1, and that the pilot signal is just a scalar, . In thissetup, we will not optimize anything, we will merely study theNMSE as function of increasing values for . Using (10) it fol-lows that . We assume the following parameters:• , where is a scalar that we can vary.• The signal (channel) is distributed as .• The noise is GM distributed with parameters

(32)

In Fig. 5, the NMSE is plotted as a function of increasing valuesfor the scalar . We observe the same tendency as in Fig. 4: Forincreasing values of (corresponding to increasing pilot powerin Fig. 4), the NMSE may increase. In Fig. 5, we also plot theNMSE that would be obtained by a genie aided estimator [28].Briefly, the genie aided estimator knows from which underlyingGaussian source the noise originates for each observation .Accordingly it can always produce the MMSE estimate corre-sponding to a purely Gaussian model. The genie aided estimatorcan of course not be implemented in practice, but because it is


Fig. 6. Sampled observations for dB and dB.

much better informed than the MMSE estimator, it provides alower bound on the NMSE. Yet, from Fig. 5 we see that for

dB, the MMSE estimator is able to pin-point the cor-rect noise component. A plausible explanation is the following.For small , almost all realizations of are closeto the origin. Thus, observations tend to appear in two dis-tinct clusters; one cluster centered at each noise component. Asa consequence, the active noise component can essentially al-ways be identified. As grows, take values in anincreasingly larger area, and the cluster borders approach eachother. The value dB is the largest value for wherethe clusters can still be ‘perfectly’ separated. This value corre-sponds to the local minimum in Fig. 5. Because we are consid-ering 2-dimensional random vectors, we can actually visualizethese clusters. The upper part of Fig. 6 shows how 400 inde-pendent ’s form two nearby, but separable, clusters generatedat dB. When grows beyond this level, the receiverfaces a larger identification problem: it is harder to tell whichnoise component was active. The lower part of Fig. 6 shows400 independent ’s generated at dB. This value corre-sponds to the local maximum in Fig. 5. Here the clusters largelyoverlap. As continues to grow, however, the average magni-tude of a noise contribution becomes so small compared to theaverage magnitude of , that near perfect recovery of even-tually becomes possible.Translated to the channel estimation problem in Fig. 4, the in-

terpretation is that there is a continuous range where increasingthe pilot power is harmful. From Fig. 4, one observes that, un-less one can spend an additional 15 dB (approximately) on pilotpower, one will not improve from the local minimum solution.

VII. CONCLUSION

We have provided a framework for solving the matrix designproblem of the linear model under Gaussian mixture statistics.The study is motivated by two applications in signal processing.One concerns the choice of error-reducing precoders; the otherdeals with selection of pilot matrices for channel estimation. In

either setting we use the RM procedure to arrive at a solution.Our numerical results indicate improved estimation accuracy atintermediate SNR levels; markedly better than those obtainedby optimal design based on the LMMSE estimator.Although the RM algorithm in theory only converges asymp-

totically, in practice we see that a hard stopping criterion maywork well. The algorithm is still computationally demanding,and therefore best suited under stationary settings and for prob-lems with limited dimensions.We have explored an interesting implication of the non-con-

vexity of the MMSE; namely a case where spending more pilotpower gives worse channel estimates. This phenomenon is notlinked to the stochastic optimization procedure. It can be ob-served without optimizing at all, and we have offered a plau-sible explanation.

APPENDIX

This Appendix explains how to derive a closed form expres-sion for the gradient direction in (25), whereis defined through (18)–(20). To that end, it is worth observingthat when optimizing it is beneficial if the designer can drawsamples directly from the inputs and , and not only theoutput . In order to see why, assume in what follows that theorder of derivation and integration can be interchanged such that

(33)

Such a change of order is justified for our problem, and it can beverified by invoking Lebesgue’s Dominated Convergence The-orem. Now, if we can only observe outputs , we have

Hence, in this case, the update direction is not an unbi-ased estimator of the desired gradient . In contrast, assumethat we can draw inputs , and define

then

Here, the second equality holds because is in-

dependent of . Hence, is an unbiased estimator of, which is of course desirable. Because it is beneficial to

sample and , rather than just , we will assume here that thedesigner cando this. In practice, this implies that theoptimizationof isdoneoff line, aspreparation for the subsequentestimation.


In what follows, we will derive a closed form expression for, and use this as the update direction in (25). Although

we assume knowledge of for each observed , we will

write instead of , and instead of ,simply to save space.Using (18), can be written as

(34)

Recall the difference between (33) and (34). The former is thegradient of the left hand side of (24), whereas the latter is thegradient of the summand in (24). In order to compute (34), wemake use of the following theorem [12].Theorem 1: For a scalar function, , of a matrix argu-

ment, the differential has the form

In our case, we take to be the expression in the large paren-thesis of (34). We will identify its differential, and exploit The-orem 1 in order to obtain the derivative. For that purpose, it isconvenient to define

(35)

(36)

(37)

Using these, the derivative in (34), can then be compactlywritten as . Using the chain rule, the differ-ential of this fraction is

(38)

Thus, we must identify the differentialsand , which we do in next. Notation: we will in the re-mainder of this Appendix use to compactly denote the traceoperator.Computing :

(39)

The differential , is a differential of a Gaussianprobability density function. In our case it depends on theindexes , but in order to enhance readability, we willdisregard these indexes in what follows. Thus, for now, wewill use (14)–(19) with all indexes removed, and reincorporate

the indexes when needed. In addition, we will for now disre-gard the constant factor in (15). Hence, instead ofconsidering the differential , we therefore consider

, where . Thiscan be written as

(40)

In the second equality we have used the chain rule, and exploitedthat is an exponential function. The first differential in (40),provided is full rank, is (Theorem 1, ch. 8, of [12])

(41)

(42)

(43)

(44)

(45)

In (43), we have rotated the first trace (done a cyclic permutationof the matrix product), and transposed the second trace. Because

and are symmetric, they are not affected by transpo-sition. Moreover, . The trace operator is in-variant to such rotations and transposition, and therefore theseoperations are justified. In (44) we have rotated the second term.Such rotations and transpositions will be frequently employedthroughout. Introducing , the second differential of(40) can be written

(46)

The first term of (46) is

(47)

(48)

(49)

Equation (48) results from Theorem 3, ch. 8, of [12]. Observethat in (49) is a symmetric matrix, playing thesame role as in (42). Therefore, we can utilize (45) andconclude that


(50)

Recall that . The second term of (46)can therefore be written as

(51)

Using (45), (50) and (51), and inserting into (40), we find that

(52)

If we define

where , and reincorporate the constant factor, we now find that

(53)

Accordingly, (39) becomes

(54)

Computing :

(55)

Apart from a rearrangement of the indexes, (55) containstwo similar terms. Hence it suffices to compute one of them.

Recalling that is defined by (19), we focus on thedifferential

(56)

(57)

(58)

We will resolve this term by term. The first term, (56), reads

(59)

The second term, (57), can be written as

(60)

The third term, (58), reads

(61)

Using (59), (60) and (61) we now define


Due to its two similar terms, the differential in (55) can then bewritten

(62)

Computing :

(63)

The last equation results immediately by employing (53).

A. Computing the Derivative

In order to obtain the derivative, we first formulate the fol-lowing lemma:Lemma 1: Utilizing (54), (62) and (63), expression (38),

which is the differential of the objective function in the matrixdesign problem, can be written as

(64)

Now it is a matter of applying Theorem 1 on (64) in order toidentify the derivative. This is generally straightforward. In caseof a precoder design problem, one makes the following substi-tutions: and throughout. In case ofthe pilot design problem (10), . In addition, as-suming that is , we have

In the latter equality is the Magnus and Neudecker com-mutation matrix [12].If in (64) is not a function of some other argument, iden-

tifying the derivative from (64) becomes particularly simple.Compactly defining , and observing

that , we find from (34), (64), and Theorem1 that the stochastic gradient is given by

(65)

B. First and Second Order Derivatives of the ObjectiveFunction

When trying to compute

(66)

the mixture densities in the denominators of (65) will not sim-plify by substitutions. An entirely similar argument provides thereason for why (21) cannot be computed analytically in the firstplace [28]. Hence, (66) cannot be computed analytically, and aclosed form derivative of (21) w.r.t does not exist. A similarargument will hold also for the second order derivative.

REFERENCES

[1] S. M. Kay, Fundamentals of Statistical Signal Processing: EstimationTheory. Upper Saddle River, NJ, USA: Prentice-Hall, Inc., 1993.

[2] H. Robbins and S. Monro, “A stochastic approximation method,”Annal. Math. Statist., vol. 22, no. 3, pp. 400–407, 1951.

[3] H. J. Kushner and G. Yin, Stochastic Approximation and Recursive Al-gorithms and Applications. Berlin, Germany: Springer-Verlag, 2003,vol. 35.

[4] D. Persson and T. Eriksson, “Mixture model- and least squares-basedpacket video error concealment,” IEEE Trans. Image Process., vol. 18,no. 5, pp. 1048–1054, May 2009.

[5] A. Kundu, S. Chatterjee, and T. V. Sreenivas, “Subspace based speechenhancement using Gaussian mixture model,” in Proc. Interspeech2008, Brisbane, Australia, Sept. 2008, pp. 395–398.

[6] A. Kundu, S. Chatterjee, and T. V. Sreenivas, “Speech enhancementusing intra-frame dependency in DCT domain,” in Proc. 16th Eur.Signal Process. Conf. (EUSIPCO 2008), Lausanne, Switzerland, Aug.25–29, 2008.

[7] A. D. Subramaniam, W. R. Gardner, and B. D. Rao, “Low-complexitysource coding using Gaussian mixture models, lattice vector quantiza-tion, and recursive coding with application to speech spectrum quanti-zation,” IEEE Trans. Audio, Speech, Lang. Process., vol. 14, no. 2, pp.531–532, Mar. 2006.

[8] G. Yu, G. Sapiro, and S. Mallat, “Solving inverse problems with piece-wise linear estimators: From Gaussian mixture models to structuredsparsity,” IEEE Trans. Image Process., vol. 21, no. 5, pp. 2481–2499,May 2011.

[9] D. Katselis, E. Kofidis, and S. Theodoridis, “On training optimizationfor estimation of correlated MIMO channels in the presence of mul-tiuser interference,” IEEE Trans. Signal Process., vol. 56, no. 10, pp.4892–4904, Oct. 2008.

[10] M. Biguesh and A. B. Gershman, “Training-based MIMO channel es-timation: A study of estimator tradeoffs and optimal training signals,”IEEE Trans. Signal Processing, vol. 54, no. 3, pp. 884–893, Mar. 2006.

[11] E. Björnson and B. Ottersten, “A framework for training-based esti-mation in arbitrarily correlated rician MIMO channels with rician dis-turbance,” IEEE Trans. Signal Process., vol. 58, no. 3, pp. 1807–1820,Mar. 2010.


[12] J. R. Magnus and H. Neudecker, Matrix Differential Calculus WithApplications in Statistics and Econometrics, 2nd ed. Hoboken, NJ,USA: Wiley, 1999.

[13] J. Medbo, H. Asplund, J. E. Berg, and N. Jaldén, “Directional channelcharacteristics in elevation and Azimuth at an Urban Macrocell basestation,” in Proc. 6th Eur. Conf. Antennas Propag., Mar. 2012.

[14] V. Cellini and G. Dona, “A novel joint channel and multi-user in-terference statistics estimator for UWB-IR based on Gaussian mix-ture model,” in 2005 IEEE Int. Conf. Ultra-Wideband, Sept. 2005, pp.660–660.

[15] G. E. Healey and R. Kondepudy, “Radiometric CCD camera calibra-tion and noise estimation,” IEEE Trans. Pattern Anal. Mach. Intell.,vol. 16, no. 3, pp. 267–276, Mar. 1994.

[16] H. S. Wang and N. Moayeri, “Finite-state markov channel—A usefulmodel for radio communication channels,” IEEE Trans. Veh. Technol.,vol. 44, no. 1, pp. 167–171, Feb. 1995.

[17] P. Sadeghi, R. Kennedy, P. Rapajic, and R. Shams, “Finite-stateMarkov modeling of fading channels—A survey of principles andapplications,” IEEE Signal Process. Mag., vol. 25, no. 5, pp. 57–80,Sep. 2008.

[18] A. Scaglione, P. Stoica, S. Barbarossa, G. B. Giannakis, and H. Sam-path, “Optimal designs for space-time linear precoders and decoders,”IEEE Trans. Signal Process., vol. 50, no. 5, pp. 1051–1064, May 2002.

[19] D. H. Pham, H. D. Tuan, B.-N. Vo, and T. Q. Nguyen, “Jointly optimalprecoding/postcoding for colored MIMO systems,” in Proc. IEEE Int.Conf. Acoustics, Speech, Signal Process., May 2006, vol. 4, pp. IV–.

[20] F. A. Dietrich and W. Utschick, “Pilot-assisted channel estimationbased on second-order statistics,” IEEE Trans. Signal Process., vol.53, no. 3, pp. 1178–1193, Mar. 2005.

[21] J. H. Kotecha and A. M. Sayeed, “Transmit signal design for op-timal estimation of correlated MIMO channels,” IEEE Trans. SignalProcess., vol. 52, no. 2, pp. 546–557, Feb. 2004.

[22] Y. Liu, T. F. Wong, and W. W. Hager, “Training signal design forestimation of correlated MIMO channels with colored interference,”IEEE Trans. Signal Process., vol. 55, no. 4, pp. 1486–1497, Apr. 2007.

[23] J. Q. Li and A. R. Barron, “Mixture density estimation,” in In Advancesin Neural Information Processing Systems. Cambridge, MA, USA:MIT Press, 12, 1999, pp. 279–285.

[24] H. W. Sorenson and D. L. Alspach, “Recursive bayesian estimationusing Gaussian sums,” Automatica., vol. 7, no. 4, pp. 465–479, 1971.

[25] S. Dasgupta, “Learning mixtures of Gaussians,” in Proc. 40th Annu.Symp. Found. Comput. Sci., 1999, pp. 634–644.

[26] Z. Ghahramani, “Solving inverse problems using an EM approachto density estimation,” in Proc. 1993 Connectionist Models SummerSchool, 1993, pp. 316–323.

[27] D. J. C. MacKay, Information Theory, Inference, and Learning Algo-rithms. Cambridge, MA, USA: Cambridge University Press, 2003.

[28] J. Flåm, S. Chatterjee, K. Kansanen, and T. Ekman, “On MMSE es-timation—A linear model under Gaussian mixture statistics,” IEEETrans. Signal Process., 2012.

[29] D. J. Love and R. W. Heath, “Limited feedback unitary precoding forspatial multiplexing systems,” IEEE Trans. Inf. Theory, vol. 51, no. 8,pp. 2967–2976, Aug. 2005.

[30] I. Kim, S. Park, D. Love, and S. Kim, “Improved multiuser MIMOunitary precoding using partial channel state information and insightsfrom the riemannian manifold,” IEEE Trans. Wireless Commun., vol.8, no. 8, pp. 4014–4023, Aug. 2009.

[31] J. Kiefer and J. Wolfowitz, “Stochastic estimation of the maximum of aregression function,” Annals Math. Statist., vol. 23, no. 3, pp. 462–466,Sept. 1952.

[32] J. T. Flåm, M. Vehkaperä, D. Zachariah, and E. Tsakonas, “Meansquare error reduction by precoding of mixed Gaussian input,” inProc. Int. Symp. Inf. Theory Appl., Honolulu, Hawaii, Oct. 28–Nov.01 2012.

John T. Flåm was born in Lærdal, Norway in 1976.He received the M.S. degree in electrical engineeringfrom the Norwegian University of Science and Tech-nology (NTNU), Trondheim, Norway in 2006. In theperiod 2006–2008 he worked as a software develop-ment engineer at www.nordicsemi.com. He receivedthe Ph.D. degree in electronics and communicationsengineering from NTNU in 2013. Since march 2013he has been working at www.embida.com. His re-search interests include statistical signal processingand communications.

Dave Zachariah was born in Stockholm, Swedenin 1983. He received the M.S. degree in electricalengineering from Royal Institute of Technology(KTH), Stockholm, Sweden, in 2007. During2007–8 he worked as a research engineer at GlobalIP Solutions, Stockholm. He received the Tech. Lic.and Ph.D. degrees in signal processing from KTH in2011 and 2013, respectively. His research interestsinclude statistical signal processing, sensor fusion,sparse signal processing and spectral analysis.

Mikko Vehkaperä (S’03-M’10) received the M.Sc.and Lic. Sc. degrees from University of Oulu, Fin-land, in 2004 and 2006, respectively, and the Ph.D.degree from Norwegian University of Science andTechnology (NTNU), Trondheim, Norway, in 2010.Since 2010 he has been with the School of Elec-

trical Engineering and ACCESS Linnaeus Center atRoyal Institute of Technology (KTH), Stockholm,Sweden. Between 2012 and 2013, he shared a re-search position between KTH and Aalto UniversitySchool of Electrical Engineering, Finland. He held

visiting appointments at Massachusetts Institute of Technology (MIT), US,Kyoto University, Japan, and University of Erlangen-Nuremberg, Germany.His research interests are in the field of wireless communications and signalprocessing.

Saikat Chatterjee is with the CommunicationTheory Lab and the Signal Processing Lab,KTH-Royal Institute of Technology, Sweden. Hewas also with the Sound and Image Processing Labat the same institution as a post-doctoral fellow forone year. Before moving to Sweden, he receivedPh.D. degree in 2009 from Indian Institute of Sci-ence, India. He was a co-author of the paper thatwon the best student paper award at ICASSP 2010.His research interests include source coding, speechand audio processing, estimation and detection,

sparse signal processing, compressive sensing, wireless communications andcomputational biology.

the linear model under mixed gaussian inputs: designing the transfer matrix

Documents