kildare, 2800lyngby,denmark - connecting repositories · duction to linear programming. such...

LINEAR PROGRAM DIFFERENTIATION FORSINGLE-CHANNEL SPEECH SEPARATION

Barak A. Pearlmutter*

Hamilton InstituteNational University of Ireland Maynooth

Co. Kildare, Ireland

ABSTRACT

Many apparently difficult problems can be solved by re-duction to linear programming. Such problems are oftensubproblems within larger systems. When gradient opti-misation of the entire larger system is desired, it is neces-sary to propagate gradients through the internally-invokedLP solver. For instance, when an intermediate quantity zis the solution to a linear program involving constraint ma-trix A, a vector of sensitivities dEldz will induce sensitiv-ities dE/dA. Here we show how these can be efficientlycalculated, when they exist. This allows algorithmic differ-entiation to be applied to algorithms that invoke linear pro-gramming solvers as subroutines, as is common when usingsparse representations in signal processing. Here we applyit to gradient optimisation of overcomplete dictionaries formaximally sparse representations of a speech corpus. Thedictionaries are employed in a single-channel speech sepa-ration task, leading to 5 dB and 8 dB target-to-interferenceratio improvements for same-gender and opposite-gendermixtures, respectively. Furthermore, the dictionaries aresuccessfully applied to a speaker identification task.

1. INTRODUCTION

Linear programming solvers (LP) are often used as subrou-tines within larger systems, in both operations research andmachine learning [1, 2]. One very simple example of this isin sparse signal processing, where it is common to representa vector as sparsely as possible in an overcomplete basis;this representation can be found using LP, and the sparserepresentation is then used in further processing [3-9]

To date it has not been practical to perform end to endgradient optimisation of algorithms of this sort This is dueto the difficulty of propagating intermediate gradients (ad-joints) through the LP solver. We show below how theseadjoint calculations can be done: how a sensitivity of the

*Supported by Science Foundation Ireland grant OO/PI l/C067 and theHigher Education Authority of Ireland.

t Thanks to Oticon Fonden for financial support for this work.

Rasmus K. Olssont

Informatics and Mathematical ModellingTechnical University of Denmark

2800 Lyngby, Denmark

output can be manipulated to give a sensitivity of the in-puts. As usual in Automatic Differentiation (AD), these donot require much more computation than the original primalLP calculation-in fact, rather unusually, here they may re-quire considerably less.

We first introduce our notational conventions for LP, andthen give a highly condensed introduction to, and notationfor, AD. We proceed to derive AD transformations for asimpler subroutine than LP: a linear equation solver. (Thisnovel derivation is of independent interest, as linear equa-tions are often constructed and solved within larger algo-rithms.) Armed with a general AD transformation for lin-ear equation solvers along with suitable notation, we findthe AD transformations for linear program solvers simpleto derive. This is applied mechanically to yield AD rulesfor a linearly-constrained L -optimiser.

The problem of finding an overcomplete signal dictio-nary tuned to a given stimulus ensemble, so that signalsdrawn from that ensemble will have sparse representationsin the constructed dictionary, has received increasing atten-tion, due to applications in both neuroscience and in theconstruction of efficient practical codes [10]. Here we de-rive a gradient method for such an optimisation, and applyit to learn a sparse representation of speech.

Single-channel speech separation, where the objective isto estimate the speech sources of the mixture, is a relevanttask in hearing aids, as a speech recognition pre-processor,and in other applications which might benefit from betternoise reduction. For this reason, there has been a flurryof interest in the problem [9, 11-17]. We encode the au-dio mixtures in the basis functions of the combined person-alised dictionaries, which were adapted using the devisedgradient method The sparse code separates the signal intoits sources, and reconstruction follows. Furthermore, weshow that the dictionaries are truly personal, meaning thata given dictionary provides the sparsest fit for the particu-lar speaker which it was adapted to Hence we are able tocorrectly classify speech signals to their speaker.

1-4244-0657-9/06/$20.00 ©2006 IEEE 421

Authorized licensed use limited to: The Library NUI Maynooth. Downloaded on May 18, 2009 at 10:12 from IEEE Xplore. Restrictions apply.

2. BACKGROUND AND NOTATION

We develop a convenient notation while briefly reviewingthe essentials of linear programming (LP) and algorithmicdifferentiation (AD).

2.1. Linear Programming

In order to develop a notation for LP, consider the generalLP problem

argmin wTz s.t. Az < a and Bz = b (1)z

We will denote the linear program solver Ip, and write thesolution as z = lp(w, A, a, B, b). It is important to seethat Ip(-) can be regarded as either a mathematical functionwhich maps LP problems to their solutions, or as a computerprogram which actually solves LP problems. Our notationdeliberately does not distinguish between these two closelyrelated notions.

Assuming feasibility, boundedness, and uniqueness, thesolution to this LP problem will satisfy a set of linear equal-ities consisting of a subset of the constraints, the active con-straints [ 18-20]. An LP solver calculates two pieces of in-formationr the solution itself, and the identity of the activeconstraints. We will find it convenient to refer to the ac-tive constraints by defining some very sparse matrices thatextract the active constraints from the constraint matrices.Leta, < .< a, be the indices of the rows of A corre-sponding to active constraints, and 1 < < indexthe active rows of B. Without loss of generality, we assumethat the total number of active constraints is equal the di-mensionality of the solution, rn + Tr = dim z. We let Pc,be a matrix with n rows, where the i-th row is all zeros ex-cept for a one in the a i-th column, and P similarly have mrows, with its i-th row all zeros except for a one in the 6i-thcolumn. So P,A and P B hold the active rows ofA and B,respectively. These can be combined into a single matrix,

p UPO0 0P- L P/3

Using these definitions, the solution z to (1), which pre-sumably is already available having been computed by thealgorithm that identified the active constraints, must be theunique solution of the system of linear constraints

Az=

a. D ,3 -bD

or

lpw A,a B, b -Iq P A P a (2p(w,,a,, )- q( B , b ) (

where Iq is a routine that efficiently solves a system of lin-ear equations, Iq (M, m) = M- m. For notational con-venience we suppress the identity of the active constraints

as an output of the lp routine. Instead we assume that it isavailable where necessary, so any function with access tothe solution z found by the LP solver is also assumed tohave access to the corresponding P.

2.2. Algorithmic Differentiation

AD is a process by which a numeric calculation specifiedin a computer programming language can be mechanicallytransformed so as to calculate derivatives (in the differentialcalculus sense) of the function originally calculated [21].There are two sorts of AD transformations. forward accu-mulation [22] and reverse accumulation [23]. (A specialcase of reverse accumulation AD is referred to as backprop-agation in the machine learning literature [24].) If the entirecalculation is denoted y = h(x), then forward accumula-tion AD arises because a perturbation dx/dr induces a per-turbation dyldr, and reverse accumulation AD arises be-cause a gradient dE/dy induces a gradient dE/dx. The Ja-cobian matrix plays a dominant role in reasoning about thisprocess. This is the matrix J whose i, j-th entry is dhildxj.Forward AD calculates y Jx= h (x, x), and reverseAD calculates x = JTy h (x, y). The difficulty is that,in high dimensional systems, the matrix J is too large toactually calculate. In AD the above matrix-vector productsare found directly and efficiently, without actually calculat-ing the Jacobian.

The central insight is that calculations can be brokendown into a chained series of assignments v := g(u), andtransformed versions of these chained together. The trans-formed version of the above internal assignment statementwould be v := g (u, u, v) in forward mode [22], or u :=g (u v, v) in reverse mode [23]. The most interesting prop-erty of AD, which results from this insight, is that the timeconsumed by the adjoint calculations can be the same as thatconsumed by the original calculation, up to a small constantfactor. (This naturally assumes that the transformations ofthe primitives invoked also obey this property, which is ingeneral true.)

We will refer to the adjoints of original variables in-troduced in forward accumulation (perturbations) using aforward-leaning accent v -v; to the adjoint variables in-troduced in the reverse mode transformation (sensitivities)using a reverse-leaning accent v v and to the forward-and reverse-mode transformations of functions using for-ward and reverse arrows, h h and h h. A de-tailed introduction to AD is beyond the scope of this paper,but one form appears repeatedly in our derivations. This isV AUB where A and B are constant matrices and Uand V are matrices as well. This transforms to V := AUBandU -A VBT

422


2.3. AD of a Lin. Eq. Solver

We first derive AD equations for a simple implicit function,namely a linear equation solver. We consider a subroutineIq which finds the solution z of Mz = m, written z =

Iq(M, m). This assumes that M is square and full-rank,just as a division operation z = xly assumes that y 7( 0.We will derive formulae for both forward mode AD (the zinduced by M and m) and reverse mode AD (theM and minduced by Z).

For forward propagation of perturbations, we will writez = Iq (M M: m, m, z). Because (M + M)(z + Z)m + m which reduces to Mz = m- Mz, we concludethat

Iq (M M m m, z) = Iq(M,m Mz).

Note that Iq is linear in its second argument, where the per-turbations enter linearly. For reverse propagation of sensi-tivities, we will write

[M g]= Iq (M, rn, z, z). (3)

First observe that z M- m and hence m = M-Tz so

m Iq(Mf Z)For the remaining term we start with our previous forwardperturbation M z, namely z =-M- 'Mz, and note thatthe reverse must be the transpose of this linear relationship,M =-M zzT, which is the outer product

M =-mnT

2.4. AD of Linear Programming

We apply equation (3) followed by some bookkeeping, yields

A aEB b

Ip(w,A,a,B,b,zz)

P Iq(P B 7p bzz)

w = 0

Forward accumulation is similar, but is left out for brevity.

2.5. Constrainned LI Optimisation

We can find AD equations for linearly constrained LI norm

optimisation via reduction to LP. Consider

argmin lic l s5t Dc =yAlthough c I = cici is a nonlinear objective function,a change in parametrisation allows optimisation via LPI Wename the solution c Llopt(y, D) where

Llopt(y D) =I I- l ,I 0 D I 1] ,y)

in which 0 and 1 denote column vectors whose elements allcontain the indicated number, and each I is an appropriatelysized identity matrix. The reverse-mode AD transformationfollows immediately,

Llopt(y, D, c c) D y=

[0' ij Ip(1, I,0,D [I Ij ,YZ, IC ) [- ol

where z is the solution of the internal LP problem and 0' isan appropriately sized matrix of zeros.

3. DICTIONARIES OPTIMISED FOR SPARSITY

A major advantage of the LP differentiation framework, andmore specifically the reverse accumulation of the constrainedL, norm optimisation, is that it provides directly a learningrule for learning sparse representation in overcomplete dic-tionaries.

We assume an overcomplete dictionary in the columnsof D, which is used to encode a signal represented in thecolumn vector y using the column vector of coefficientsc = Llopt(y, D) where each dictionary element has unitL2 length. A probabilistic interpretation of the encodingas a maximum posterior (MAP) estimate naturally followsfrom two assumptions: a Laplacian prior p(c), and a noise-free observation model y = Dc. This gives

c = arg maxxp(c' y,D)C,

We would like to improve D for a particular distribu-tion of signals, meaning change D so as to maximise thesparseness of the codes assigned. With y drawn from thisdistribution, an ideal dictionary will minimise the averagecode length, giving maximnally sparse coefficients. We willupdate D so as to minimise E =£ I Llopt(y, D) l ) whilekeeping the columns of D at unit length. This can be re-garded a special case of Independent Component Analysis[25], where measures of independence across coefficientsare optimised. We wish to use a gradient method so we cal-culate VDEy where Ey Llopt(y, D) making E(Ey). Invoking AD,

VDEY = o

= Llopt(y, D, c, sign(c)) OTj-

(4)

where sign(x) =1/0/- for x positive/zero/negative andapplies elementwise to vectors.

We are now in a position to performr stochastic gradientoptimisation [26], modified by the inclusion of a normali-sation step to maintain the columns of D at unit length andnon-negative.

423


Fig. 1. A sample of learnt dictionary entries for male (left) and female (right) speech in the Mel spectrum domain. Clearly,harmonic features emerge from the data but some broad and narrow noise spectra can also be seen. The dictionaries wereinitialised to N = 256 delta-like pulses, length L = 80 and were adopted from T = 420 s of speech.

1. Draw y from signal distribution.2. Calculate Ey.3. Calculate VDEY by (4).4. StepD :=D - VDEy5. Set any negative element of D to zero.6. Normalise the columns di of D to unit L2 norm.7. Repeat to convergence of D.

This procedure can be regarded as a very efficient exactmaximum likelihood treatment of the posterior integratedusing a Gaussian approximation [7]. However, the formu-lation here can be easily and mechanically generalised toother objectives.

A set of personalised speech dictionaries were learnt bysparsity optimisation in the Grid Corpus [27] which is avail-able at http:Ilwww.dcs.shef.ac.uk/spandh/gridcorpus. Thiscorpus contains 1000 x 34 utterances of 34 speakers, con-fined to a limited vocabulary. The speech was preprocessedand represented to (essentially) transform the audio signalsinto a Mel time-frequency representation, as presented anddiscussed by Ellis and Weiss [14]. The data was down-sampled to 8 kHz and high-pass filtered to bias our objectivetowards more accuracy in the high-end of the spectrum. Theshort-time Fourier transform was computed from windoweddata vectors of length 32 ms, corresponding to K = 256samples, and subsequently mapped into L = 80 bands onthe Mel scale. From T 420 s of audio from each speaker,the non-zero time-frames were extracted for training andnormalised to unity L2 norm. The remainder of the audio(> 420 s) was set aside for testing. The stochastic gradientoptimisation of the linearly constrained L1 norm was run for40,000 iterations. The step-size r was decreased throughoutthe training The N = 256 colu,mns of the dictionaries wereinitialised with narrow pulses distributed evenly across thespectrum and non negativity was enforced following eachiteration. In Figure 1 is displayed a randomly selected sam-

ple of learnt dictionary elements of one male and one fe-male speaker. The dictionaries clearly capture a number ofcharacteristics of speech, such as quasi-periodicity and de-pendencies across frequency bands.

3.1. Source Separation

This work was motivated by a particular application: single-channel source separation.i The aim is to recover R sourcesignals from a one-dimensional mixture signal. In that con-text, an important technique is to perform a linearly con-strained Ll-norm optimisation in order to fit an observedsignal using a sparse subset of coefficients over an over-complete signal dictionary. A single column of the mixturespectrogram is the sum of the source spectra: y = ,R Yi.In the interest of simplicity, this model assumes a 0 dB target-to-masker ratio (TMR). Generalization to general TMR bythe inclusion of weighting coefficients is straightforward.

As a generative signal model, it is assumed that yi canbe represented sparsely in the overcomplete dictionary D,which is the concatenation of the source dictionaries:D = D1 . Di ... DR j. Assuming that the Diare different in some sense, it can be expected that a sparserepresentation in the overcomplete basis D coincides withthe separation of the sources, i.e. we compute

TT T1 fl3ci . .. cRj

= Llopt(y, Dv)C_ C,

where the ci are the coefficients pertaining to the ith source.The source estimates in the Mel spectrum domain are thenre-synthesised as j =D ci. The conversion back to thetime-domain consists of mapping to the amplitude spectro-

'The INTERSPEECH 2006 conference hosts a special session onthis issue, based on the GRID speech corpus See www.dcs sheftac.uk/-martin/SpeechSeparationChallenge htm

424


zv:

7

5- I 6.5

4 6

1 5

1 7 56 420 128 256512

time (s)

ct

0

;I)0--

01

N

Fig. 2. Dependency of the separation performance mea-sured as signal-to-noise ratio (SNR) as a function of thedata volume (left), and, the dictionary size, N (right). OnlyT = 7 s of speech is needed to attain near-optimal perfor-mance. The performance increases about 0.5 dB per dou-bling of N.

Genders SNR (dB)

M/MM/FF/F

4.9±1.27.8±1.35.1±1.4

0.30.2 __0.10.016 0.063 0.25 1 2

time (s)

Fig. 3. The maximum-likelihood correct-classification rateas computed in a T = 2 s test window on all combination ofthe 34 speakers and 34 dictionaries. If all time-frames areincluded into computation, the classification is perfect, butthe performance decreases as smaller windows are used.

Table 1. Monaural two-speaker signal-to-noise separationperformance (mean±stderr of SNR), by speaker gender.The simulated test data consisted of all possible combina-tions, T = 6 s, of the 34 speakers.

gram and subsequently reconstructing the time-domain sig-nal using the noisy phase of the mixture. Due to the sparsityof speech in the transformed domain, the degree of over-lap of the sources is small, which causes the approximationto be fairly accurate. Useful software in this connection isavailable at http://www.ee.columbia.edu/ dpwe/. In the fol-lowing, the quality of yj are evaluated iin the time-doimainsimply as the ratio of powers of the target to reconstructionerror, henceforth termed the signal-to-noise ratio (SNR).

In order to assess the convergence properties of the algo-rithm, the SNR was computed as a function of the amountof training data, see figure 2. It was found that useful re-sults could be achieved with a few seconds of training data,whereas optimal performance was only obtained after a fewminutes. It was furthermore investigated how the SNR variesas a function of the number of dictionary elements, . Eachdoubling of N brings an improvement, indicating the po-tential usefulness of increased computing power The aboveresults were obtained by simulating all possible mixturesof 8 speakers (4 male 4 female) at 0 dB and computingthe SNR's on 6 s segments. Performance figures were com-puted on the cormplete data set of 34 speakers amnounting to595 combinations, with N 256 and TF 420 s; see Ta-ble 1 The test data is available at www2 imm dtu dk/-rko/single-channel.

3.2. Speaker identification

In many potential applications of source separation the speak-ers of the mixture would be novel, and have to be estimatedfrom the audio stream. In order to perform truly blind sep-aration, the system should be able to automatically applythe appropriate dictionaries. Here we attack a simpler sub-problem: speaker identification in an audio signal with onlya sinigle speaker. Our approach is straightforward: select thedictionary that yields the sparsest code for the signal. Again,this can be interpreted as maximum-likelihood classifica-tion. Figure 3 displays the percentage of correctly classifiedsound snippets. The figures were computed on all combina-tions of speakers and dictionaries, that is 34 x 34 = 1156combinations. The complete data (2 s) resulted in all speak-ers being correctly identified. Shorter windows carried ahigher error rate. For the described classification frameworkto be successful in a source separation task, it is requiredthat each speaker appears exclusively in parts of the audiosignal. This is not at all unrealistic in normal conversation,depending on the politeness of the speakers.

4. CONCLUSION AND OUTLOOK

Linear programming is often viewed as a black-box solverwhich cannot be fruitfully combined with gradient-basedoptimisation methods. As we have seen, this is not thecase. LP can be used as a subroutine in a larger system,and perturbations can be propnaated forwards and sensitivi-ties propagated backwards through the LP solver. The onlycaution is that LP is by nature only piecewise differentiableso care must be taken with regard to crossing through such

425


discontinuities.The figures carry evidence that the adapted Mel scale

dictionaries to a large extent perform the job, and that thegeneralisation of the results to spontaneous speech dependsto a large extent on designing a sensible scheme for pro-viding the algorithm with a balanced training data. Further-more, the system should be able to manage some difficultaspects of real-room conditions, in particular those in whichthe observed signal is altered by the room dynamics. Wefeel that a possible solution could build on the principleslaid out in previous work [9], where a head-related trans-fer function (HRTF) is used to provide additional contrastbetween the sources.

We found that using spectrogram patches rather thanpower spectra improved the results only marginally, in agree-ment with previous reports using a related approach [16].

References

[1] 0. L. Mangasarian, W. N. Street, and W. H. Wol-berg. Breast cancer diagnosis and prognosis via linearprogramming. Operations Research, 43(4):570-577,July-Aug. 1995.

[2] P. S. Bradley, 0. L. Mangasarian, and W. N. Street.Clustering via concave minimization. In Adv. in Neu.Info. Proc. Sys. 9, pages 368-374. MIT Press, 1997.

[3] I. F. Gorodnitsky and B. D. Rao. Sparse signal re-construction from limited data using FOCUSS: A re-weighted minimum norm algorithm. IEEE Trans. Sig-nal Proccessing, 45(3):600-616, 1997.

[4] M. Lewicki and B. A. Olshausen. Inferring sparse,overcomplete image codes using an efficient codingframework. In Advances in Neural Information Pro-cessing Systems 10, pages 815-821. MIT Press, 1998.

[5] S. S. Chen, D. L. Donoho, and M. A. Saunders.Atomic decomposition by basis pursuit. SIAM Jour-nal on Scientific Computing, 20(f):33-61, 1998.

[6] T.-W. Lee, M. S. Lewicki, M. Girolami, and T. J. Se-jnowski. Blind source separation of more sources thanmixtures using overcomplete representations. IEEESignal Processing Letters, 4(5):87-90, 1999.

[7] M. S. Lewicki and T. J. Sejnowski. Learning over-complete representations. Neu. Comp., 12(2)337-65,2000.

[8] M. Zibulevsky and B. A. Pearlmutter. Blind sourceseparation by sparse decomposition in a signal dictio-nary. Neu. Comp. 13(4):863-882, Apr 2001.

[9] B. A. Pearlmutter and A. M. Zador. Monaural sourceseparation using spectral cues In ICA pages 478-4852004.

[10] K. Kreutz-Delgado, J. F. Murray, B. D. Rao, K. Engan,T.-W. Lee, and 1. J. Sejnowski. Dictionary learning

algorithms for sparse representation. Neu. Conip., 15(2):349-396, 2003.

[111] S. T. Roweis. One microphone source separation. InAdv. in Neu. Info. Proc. Sys. 13, pages 793-799. MITPress, 2001.

[12] G.-J. Jang and T.-W. Lee. A maximum likelihoodapproach to single-channel source separation. J. ofMach. Learn. Research, 4:1365-1392, Dec. 2003.

[13] M. N. Schmidt and M. M0rup. Nonnegative ma-trix factor 2-D deconvolution for blind single channelsource separation. In ICA, pages 123-123, 2006.

[14] D. P. W. Ellis and R. J. Weiss. Model-based monau-ral source separation using a vector-quantized phase-vocoder representation. In ICASSP, 2006.

[15] M. N. Schmidt and R. K. Olsson. Single-channelspeech separation using sparse non-negative matrixfactorization. In Interspeech, 2006, submitted.

[16] S. T. Roweis. Factorial models and refiltering forspeech separation and denoising. In Eurospeech,pages 1009-1012, 2003.

[17] F. Bach and M. I. Jordan. Blind one-microphonespeech separation: A spectral learning approach. InAdvances in Neural Information Processing Systems17, pages 65-72, 2005.

[18] G. B. Dantzig. Programming in a linear structure.USAF, Washington D.C., 1948.

[19] S. I. Gass. An Illustrated Guide to Linear Programming. McGraw-Hill, 1970.

[20] R. Dorfman. The discovery of linear programming.Annals of the History of Computing, 6(3):283-295,July-Sep. 1984.

[21] A. Griewank. Evaluating Derivatives. Principles andTechniques ofAlgorithmic Differentiation. Number 19in Frontiers in Appl. Math. SIAM, Philadelphia, PA,2000. ISBN 0-89871-451-6.

[22] R. E. Wengert. A simple automatic derivative evalua-tion program. Commun. ACM, 7(8):463-464, 1964.

[23] B. Speelpenning. Compiling Fast Partial Derivativesof Functions Given by Algorithms. PhD thesis, De-partment of Computer Science, University of Illinois,Urbana-Champaign, Jan. 1980.

[24] D. E. Rumelhart, G. E. Hinton, and R. J. Williams.Learning representations by back-propagating errors.Nature 323:533 536 1986

[25] A. J. Bell and T. J. Sejnowski. An information-maximization approach to blind separation and blinddeconvolution. Neu Comp 7(6)11129-1159 1995

[26] H. Robbins and S. Monro. A stochastic approximationmethod Ann Mat Stats., 22:400-407 1951

[27] M. P. Cooke, J. Barker, S. P. Cunningham, andX Shao An audio-visual corpus for speech perceptionand automatic speech recognition, 2005. Submitted.

426


kildare, 2800lyngby,denmark - connecting repositories · duction to linear programming. such...

Documents