adaptation of bayesian models for single channel source

HAL Id: inria-00544774https://hal.inria.fr/inria-00544774

Submitted on 8 Feb 2011

HAL is a multi-disciplinary open accessarchive for the deposit and dissemination of sci-entific research documents, whether they are pub-lished or not. The documents may come fromteaching and research institutions in France orabroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, estdestinée au dépôt et à la diffusion de documentsscientifiques de niveau recherche, publiés ou non,émanant des établissements d’enseignement et derecherche français ou étrangers, des laboratoirespublics ou privés.

Adaptation of Bayesian models for single channel sourceseparation and its application to voice / music

separation in popular songsAlexey Ozerov, Pierrick Philippe, Frédéric Bimbot, Rémi Gribonval

To cite this version:Alexey Ozerov, Pierrick Philippe, Frédéric Bimbot, Rémi Gribonval. Adaptation of Bayesian modelsfor single channel source separation and its application to voice / music separation in popular songs.IEEE Transactions on Audio, Speech and Language Processing, Institute of Electrical and ElectronicsEngineers, 2007, 15 (5), pp.1564–1578. �10.1109/TASL.2007.899291�. �inria-00544774�

https://hal.inria.fr/inria-00544774

https://hal.archives-ouvertes.fr

1564 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 15, NO. 5, JULY 2007

Adaptation of Bayesian Models for Single-ChannelSource Separation and its Application to Voice/Music

Separation in Popular SongsAlexey Ozerov, Pierrick Philippe, Frédéric Bimbot, and Rémi Gribonval

Abstract—Probabilistic approaches can offer satisfactory solu-tions to source separation with a single channel, provided that themodels of the sources match accurately the statistical propertiesof the mixed signals. However, it is not always possible to trainsuch models. To overcome this problem, we propose to resort toan adaptation scheme for adjusting the source models with respectto the actual properties of the signals observed in the mix. In thispaper, we introduce a general formalism for source model adapta-tion which is expressed in the framework of Bayesian models. Par-ticular cases of the proposed approach are then investigated experi-mentally on the problem of separating voice from music in popularsongs. The obtained results show that an adaptation scheme canimprove consistently and significantly the separation performancein comparison with nonadapted models.

Index Terms—Adaptive Wiener filtering, Bayesian model, expec-tation maximization (EM), Gaussian mixture model (GMM), max-imum a posteriori (MAP), model adaptation, single-channel sourceseparation, time–frequency masking.

I. INTRODUCTION

THIS PAPER deals with the general problem of source sep-aration with a single channel, which can be formulated as

follows. Let and be two sampled audio signals (alsocalled sources) and the sum of these two signals

(1)

also called mix. Given , the source separation problem inthe case of a single channel consists in estimating the contribu-tions of each of the two sources .

Several methods (for example [1]–[4]) have been proposed inthe literature to approach this problem. In this paper, we con-sider the probabilistic framework, with a particular focus onGaussian mixture models (GMMs) [5], [6]. The GMM-basedapproach offers the advantage of being sufficiently general andapplicable to a wide variety of audio signals. These methods

Manuscript received July 12, 2006; revised March 6, 2007. The associate ed-itor coordinating the review of this manuscript and approving it for publicationwas Dr. Te-Won Lee.

A. Ozerov was with Orange Labs, 35512 Cesson Sévigné Cedex, France,and IRISA (CNRS and INRIA), Metiss Group (Speech and Audio Processing),35042 Rennes Cedex, France. He is now with the Sound and Image Processing(SIP) Laboratory, KTH (Royal Institute of Technology), SE-100 44 Stockholm,Sweden (e-mail: [email protected]).

P. Philippe is with Orange Labs, 35512 Cesson Sévigné Cedex, France(e-mail: [email protected]).

F. Bimbot and R. Gribonval are with IRISA (CNRS and INRIA), MetissGroup (Speech and Audio Processing), 35042 Rennes Cedex, France (e-mail:[email protected]; [email protected]).

Digital Object Identifier 10.1109/TASL.2007.899291

have indeed shown good results for the separation of speech sig-nals [5] and some particular musical instruments [6].

The underlying idea behind these techniques is to representeach source by a GMM, which is composed by a set of char-acteristic spectral patterns. Each GMM is learned on a trainingset, which contains samples of the corresponding audio class(for instance, speech, music, drums, etc.). In this paper, we referto these models as general or a priori models, as they are sup-posed to cover the range of properties observable for sourcesbelonging to the corresponding class.

An efficient model must be able to yield a rather accurate de-scription of a given source or class of sources, in terms of acollection of spectral shapes corresponding to the various be-haviors that can be observed in the source realizations. This re-quires GMMs with a large number of Gaussian functions, whichraises a number of problems:

• trainability issues linked to the difficulty in gathering andhandling a representative set of examples for the sourcesor classes of sources involved in the mix;

• selectivity issues arising from the fact that the particularsources in the mix may only span a small range of observa-tions within the overall possibilities covered by the generalmodels;

• sensor and channel variability which may affect to a largeextent the acoustic observations in the mix and cause amore or less important mismatch with the training condi-tions;

• computational complexity which can become intractablewith large source models, as the separation process re-quires factorial models [5], [6].

A typical situation which illustrates these difficulties arisesfor the separation of voice from music in popular songs. Forsuch a task, it turns out to be particularly unrealistic to ac-curately model the entire population of music sounds with atractable and efficient GMM. The problem is all the more acuteas the actual realizations of music sounds within a given songcover much less acoustic diversity than the general populationof music sounds.

The approach proposed in this paper is to resort to modeladaptation in order to overcome the aforementioned difficulties.In a similar way, as it is done for instance in speaker (or channel)adaptation for speech recognition, the proposed scheme consistsin adjusting the source models to their realizations in the mix

. This process intends to specialize the adapted or a pos-teriori models to the particular properties of the sources as ob-served in the mix, while keeping the model complexity tractable.

1558-7916/$25.00 © 2007 IEEE

ozerov

Note

Copyright © 2007 IEEE. Reprinted from IEEE Trans. on Audio, Speech and Lang. Proc., vol. 15, no. 5, July 2007. This material is posted here with permission of the IEEE. Internal or personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution must be obtained from the IEEE by writing to [email protected]. By choosing to view this document, you agree to all provisions of the copyright laws protecting it.

OZEROV et al.: ADAPTATION OF BAYESIAN MODELS FOR SINGLE-CHANNEL SOURCE SEPARATION 1565

In the first part of this article, we propose a general formalismfor model adaptation in the case of mixed sources. This for-malism is founded on Bayesian modeling and statistical esti-mation with missing data.

The second part of the work is dedicated to experiments andassessment of the proposed approach in the case of voice/musicseparation in popular songs. We show how separation perfor-mance can be significantly improved with model adaptation.

The remainder of the paper is structured as follows. InSection II, the principles of probabilistic single-channel sourceseparation are presented, the limitations of this approach arediscussed and the problem studied in this paper is defined.Then, in Section III, a general formalism for source modeladaptation is presented and further developed in the particularcase of a maximum a posteriori (MAP) criterion. Section IVis dedicated to the customization of the proposed approach tothe problem of voice/music separation in monophonic popularsongs. Finally, Section V presents the experimental results,with simulations and evaluations which validate the proposedapproach. All technical aspects of the paper, including theprecise description of the adaptation algorithms, are gatheredin an Appendix.

II. PROBABILISTIC SINGLE-CHANNEL SOURCE SEPARATION

A. Source Separation Based on Probabilistic Models: GeneralFramework

The problem of source separation with a single channel, asformulated in (1), is fundamentally ill-posed. In fact, for anysignal , the coupleis a solution to the problem.

Therefore, it is necessary to express additional constraints orhypotheses to elicit a unique solution. In the case of the proba-bilistic approach, the sources and are supposed to have adifferent statistical behavior, corresponding to different knownsource models , with . Therefore, amongall possible solutions to (1), one can choose the pair minimizingsome distortion measure given these models. This can be ex-pressed as the optimization of the following criterion, subject tothe constraint (1):

(2)

where is a distortion measure between the sources andtheir estimates . Since the sources are not observed, the valueof this function is replaced by its expectation conditionally onthe observed mix and the source models.

The source models are generally trained on databases of ex-amples of audio signals, the characteristics of which are close tothose of the sources within the mix [5], [7]. In this paper, suchmodels will be referred to as general models.

The separation problem of (1) can be reformulated in theshort-time Fourier transform (STFT) domain [5], [6]. Since theSTFT is a linear transform, we have

(3)

where , and denote the STFT of thetime-domain signals , , and for each signal

Fig. 1. Source separation based on general a priori probabilistic models.

frame number and each frequency index( being the index of the Nyquist frequency). In the

rest of the paper, the presentation will take place in the STFT do-main, knowing that the OLA (Overlap and Add) method can beused to reconstruct the signal (see for instance [8]). Time-do-main signals will be systematically denoted by a lower-caseletter, while STFT-domain quantities will be denoted by theirupper-case counterpart.

The formulation of the problem in the STFT domain ismotivated by the fact that audio sources are generally weaklyoverlapping in the time–frequency domain. This property hasbeen illustrated for instance in [9]. In fact, if the sources do notoverlap at all in the STFT-domain, i.e., iffor any and , the following masking operation yields theexact solution for the estimation of the th source:

(4)

where , if , and ,otherwise.

As, in practice, the sources overlap partly, this approach canbe adjusted by using a masking function (or mask) that takescontinuous values and choosingclose to 1 if the th source is dominant in the time–frequencyregion defined by and close to 0 if the th source is dom-inated. In that case, the masking approach does not yield theexact solution, but an optimal one in some weighted least-squaresense. The operation expressed in (4) is called time–frequencymasking, and it also corresponds to an adaptive filtering process.

However, the main difficulty is that the knowledge on the re-spective dominance of the sources within the mix is not avail-able, which makes it impossible to obtain an exact estimationof the optimal masks . In the conven-tional probabilistic approach, the source models are used to es-timate masking functions according to the observed behavior ofthe sources.

Fig. 1 summarizes the general principles of probabilisticsource separation. The general models and are trainedindependently on sets of examples and . The sourceestimates and are obtained by filtering the mix (cf.(4)) with masks estimated from the general source modelsand and the mix itself .

1) GMM Source Model: As mentioned earlier, the approachreported on in this article is based on Gaussian mixture models(GMMs) of the audio sources. A number of recent works havebeen using GMMs or, more generally, hidden Markov models


(HMMs) to account for the statistical properties of audio sources[5]–[7], [10]–[13], the latter being a rather natural extension ofthe former. The GMM/HMM-based framework allows to modeland to separate nonstationary audio sources, as considered here,assuming that each source is locally stationary and modeled by aparticular Gaussian within the corresponding mixture of Gaus-sians.

The underlying idea is to represent each source as the realiza-tion of a random variable driven by a finite set of characteristicspectral shapes, i.e., “local” power spectral densities (PSDs).Each local PSD describes some particular sound event. Underthe GMM formalism, model for the th audio sourceis composed of states corresponding to local PSDs

, .Conditionally, to state , the short-term spectrum is

viewed as some realization of a random Gaussian complexvector with zero mean and diagonal covariance matrixcorresponding to the local PSD, i.e., .Such a GMM can be parameterized as ,where are the weights of each Gaussian density satis-fying . Altogether, the GMM probability densityfunction (pdf) of the short-term spectrum can be writtenas

(5)

where denotes the pdf of a complex Gaussianrandom vector with mean vectorand diagonal covariance matrix ,defined as in [14] (pp. 503–504)

(6)

2) Model Learning: A conventional framework for learningthe GMM’s parameters , from training data(for th source) is based on optimizing the maximum-likelihood(ML) criterion

(7)

This approach is used for source separation, for instance, in[5]–[7]. In practice, the optimization of the ML criterion is ob-tained with an expectation-maximization (EM) algorithm [15].

3) Source Estimation: Once the source models trained, thesources in the mix can be estimated in the minimum meansquare error (mmse) sense, i.e., with the distortion measurefrom (2) defined as .This leads to a variant of adaptive Wiener filtering, which isequivalent to the time–frequency masking operation (4) withthe mask being calculated as follows [6] (and similarly for

):

(8)

where denotes the a posteriori probability that thestate pair has emitted the frame , with the property that

, and

(9)

where the symbol denotes proportionality, the short-term spectrum of the mix, and , the hidden states inmodels and .

B. Problem Statement

In the approach presented in the previous subsections, a diffi-culty arises in practice from the fact that the source modelstend to perform poorly in realistic cases, as there is generally amismatch between the models and the actual properties of thesources in the mix.

To illustrate this issue, let us take an example where one ofthe sources is a voice signal (as, for instance in [5], [7], [10],and [12]). Either the voice model has been trained on a par-ticular voice but its generalization ability tends to be poor toother voices, or it is trained on a group of voices but then it re-quires a large number of parameters, and even though, it tendsto lack selectivity to a particular voice in a particular mix, not tomention the variability problems that can be caused by differentrecording and/or transmission conditions.

The same problem is reported also with other classes of sig-nals, in particular, musical instruments [6], [11], all the moreacute as the separation problem is formulated with less a prioriknowledge, for instance, separating singing voice from music,where the class of music signals is extremely wide.

Thus, the practical use of statistical approaches to source sep-aration requires the following problems to be addressed:

1) deal with the scarcity of representative training data forwide classes of audio signals (for instance, the class ofmusic sounds);

2) specialize a source model to the particular properties ofa given source in a given mix (for instance, a particularinstrument or combination of instruments);

3) account for recording and transmission variability whichcan affect significantly the statistical behavior of a sourcein a mix, w.r.t. its observed properties in a training data set(for instance, the type of microphone, the room acoustics,the channel distortion, etc.);

4) control the computational complexity which arises whendealing with large-size statistical models (for instance,hundreds or thousands of Gaussian functions in a GMM).

These problems can be formulated more strictly in terms ofstatistical modeling. Suppose that the source observed inthe mix is the realization of a random process , and that thetraining data is the realization of a (more or less slightly) dif-ferent random process . Let us denote as andthe pdfs of these two processes.

In order to reliably estimate , (Section II-A3),the ideal situation would be to know their exact pdfs .


However, the sources are not observed separately, which makesit impossible to access or even estimate reliably their pdfs. Thesepdfs are therefore replaced by those of the training dataand approximated by GMMs optimized according to theML criterion as in (7). In summary

(10)

Model learning with a training scheme requires the trainingdata to be extremely representative of the actual source proper-ties in the mix, which means very large databases with high cov-erage. However, the effective use of the models for source sepa-ration implies that they are also rather selective, i.e., well-fittedto the actual statistical properties of each source in the mix.

In order to overcome these limitations, we propose to resort,when possible, to an adaptation scheme which aims at adjustinga posteriori the models by tuning their characteristics to those ofthe sources actually observed in the mix. As it will be detailedfurther, this approach makes it possible, under certain condi-tions, to improve the quality of the source model, while keepingits dimensionality reasonable.

III. MODEL ADAPTATION WITH MISSING ACOUSTIC DATA

The goal of model adaptation is to replace the general models(which match well the properties of the training sources, but notnecessarily those of the corresponding sources in the mix), withadapted models adjusted so as to better represent the sourcesin the mix, thus leading to an improved separation ability. Inthis section, model adaptation is introduced in a general form.The principle is then detailed in the case of a MAP adaptationcriterion.

A. Principle

In contrast with the general models and , adaptedmodels have their characteristics tuned to those of the sourcesin the mix. Although, adapted and general models have exactlythe same structure, new notations are introduced for adaptedmodels and for their parameters, in order to distinguish betweenthese two types of models. Thus, the adapted models are denoted

, and parameterized as withbeing weights of Gaussians and

covariance matrices.The ideal situation for model adaptation would be to learn

the models from the test data, i.e., from the separated sources, or at least from some other sources having characteristic

extremely similar to those of . For example, Benaroya et al.[6] evaluate their algorithms in such a context. They learn themodels from the separated sources (available in experimentalconditions) issued from the first part of a musical piece, andthen they separate the second part of the same piece. While theresults are convincing, such a procedure is only possible in arather artificial context.

Another interesting direction is to try to infer the model pa-rameters directly from the mix . For example, Attias [16] usessuch an approach in the multichannel determined case, whenthere are at least as many channels (or mixes) as sources. In thiscase, the spatial diversity (i.e., the fact that the sources comefrom different directions) creates a situation which allows to es-timate the models without any other a priori knowledge. In the

Fig. 2. Source separation based on adapted a posteriori probabilistic models.

single-channel case studied here, this approach cannot be ap-plied as it is, since the spatial diversity is not exploitable. In-deed, one could try to look for the models and optimizingthe following ML criterion:

(11)

but this would certainly not lead to any good model estimates,since in this criterion there is no a priori knowledge about thesources to distinguish between them. For example, swapping themodels and in this criterion does not change the value ofthe likelihood, i.e., .

An alternative approach is to use the MAP adapta-tion approach [17] widely applied for speech recognition[18] and speaker verification [19] tasks. The MAP es-timation criterion consists in maximizing the posterior

rather than the likelihood , as in(11). Using the Bayes rule, this posterior can be representedas with a proportionalityfactor which does not depend on the models , andtherefore has no influence on the optimization of the criterion.In contrast to the ML criterion, the model parameters are nowconsidered as realizations of some random variables and their apriori (or prior) pdf should be specified. We supposethat the parameters of model are independent from thoseof model and that the pdf of the parameters of each modeldepends on the parameters of the corresponding general model,which can be summarized as .Finally, we have the following MAP criterion:

(12)

Note that the MAP criterion (12), in contrast to the ML criterion(11), involves the prior pdfs , which forcesthe adapted models to stay attached to the general ones. Thus,the general models play the role of a priori knowledge about thesources.

Better separation performance may be achieved with theMAP criterion (12) and appropriate priors compared to whatcan be obtained with general models. Fig. 2 illustrates theintegration of the a posteriori adaptation unit into the baselineseparation scheme (Fig. 1).

For the sake of generality, we do not give here any func-tional form for the prior pdfs . A discussion concerningthe role of the priors is proposed in Section III-A1, together


Fig. 3. Bayesian networks representing model learning, source estimation, anda posteriori model adaptation (Fig. 2). Recall that q = fq (t)g , k = 1; 2denote GMM state sequences. Shadings of nodes: observed nodes (black), es-timated hidden nodes (gray), and other hidden nodes (white). (a) Learning. (b)Source estimation. (c) A posteriori model adaptation.

with some ideas on how to choose them. In Section IV-D1,these priors are represented as parametric constraints, thus intro-ducing a class of constrained adaptation techniques. Two par-ticular adaptation techniques (i.e., filter and PSD gains adapta-tion) belonging to this class are introduced in Sections IV-D2and IV-D3 and evaluated in the experimental part of this paper.

The general adaptation scheme can be represented as in Fig. 3using Bayesian networks (or oriented graphical models) [20],[21] in order to give some graphical interpretation of the de-pendencies. Different shadings are used in order to distinguishbetween different types of nodes. Observed nodes are in black,hidden nodes estimated conditionally on the observed nodes arein gray, and all other hidden nodes (nonestimated ones) are inwhite.

We propose to call the approach presented in this articlemodel adaptation with missing acoustic data. This expressionreflects two following ideas.

1) Model adaptation corresponds to the attachment of theadapted models to the general models, for instance bymeans of prior pdfs , as in (12).

2) The use of missing acoustic data corresponds to the factthat the model parameters are estimated from the mix ,whereas the actual acoustic data (the sources ) are un-known (i.e., missing). The adjective acoustic is added inorder to avoid any confusion with missing data from theEM algorithm’s terminology [15].

1) Role of Priors in the MAP Approach: In the case of theMAP approach, the choice of the prior pdfs resultsfrom a tradeoff. On the one hand, since the adaptation is carriedout from the mix , the priors should be restrictive enough toattach well the adapted models to the general ones . On theother hand, the priors should still give some freedom to models,so that they can be adapted to the characteristics of the mixedsources. Two extreme cases of this tradeoff are as follows.

1) The models could be completely attached to the generalmodels , i.e., there is no adaptation freedom and

. This is equivalent to the separation scheme withoutadaptation (Fig. 1).

2) The adapted models could be completely free, i.e., thepriors are noninformative uniform .

This is the case of the ML criterion (11) which, as alreadydiscussed, may not lead to a satisfactory adaptation.

A good choice of the priors is therefore crucial, and someexamples of potentially applicable priors could be inspiredby many adaptation techniques used for speech recognitionand speaker verification, such as MAP [17], [19], maximumlikelihood linear regression (MLLR) [22], [23], structural MAP(SMAP)[4], eigenspace-based MLLR (EMLLR) [25], etc. Leeand Huo [18] propose a review of all these methods.

Note that the MAP adaptation as presented in [17] and [19]corresponds to a particular choice of conjugate priors (normalinverse Wishart priors for covariance matrices, and Dirichletpriors for Gaussian weights). In this paper, we call MAP adap-tation any procedure which can be represented in the form ofthe MAP criterion (12) whatever the priors.

2) Comparison With the State of the Art: For source separa-tion with a single channel, some authors propose to introduce in-variance to some physical characteristics into source modeling.For example, Benaroya et al. [26] use time-varying gain fac-tors, thus introducing an invariance to the local signal energy.For musical instruments separation, Vincent et al. [11] proposeto use other descriptive parameters representing the volume, thepitch, and the timbre of the corresponding musical note. Theseadditional parameters are estimated a posteriori for each frame,since they are time varying. Thus, these approaches can also beconsidered as an adaptation process. Note, however, that thistype of adaptation is based on the introduction of additional pa-rameters which modify the initial structure of the models.

In order to complete the positioning of our work, it must beunderlined that the approach formalized in this article groupstwo aspects together. The adaptation aspect is inspired by theadaptation techniques used for instance for speech recognitionand speaker verification tasks [17]–[19], [22]–[25]. The infer-ence of model parameters from the mix shares some commonpoints with works concerning speaker identification in noise[27] and blind clustering of popular music recordings [28]. Inthese two works, the first model is estimated from the mix withthe second one fixed a priori, but there is no notion of adapta-tion, i.e., no attachment of the estimated models to some generalones.

Therefore, our article proposes two main contributions. Asdeveloped above, it is possible to group in a same formalismthe adaptation aspect and inference of model parameters fromthe mix. Details on the corresponding algorithms, in the MAPframework, are provided in the Appendix.

The second contribution is the customization, experimen-tation, and application of this formalism in a particular caseof single-channel source separation. This is detailed in theupcoming sections.

IV. APPLICATION TO VOICE/MUSIC SEPARATION

IN POPULAR SONGS

The proposed formalism concerning model adaptation is fur-ther developed in this section, with the purpose to customize itto a particular separation task: the separation of singing voicefrom music in popular songs.

This separation task is particularly useful for audio indexing.Indeed, the extraction of metadata used for indexing (such as


Fig. 4. A posteriori model adaptation block (compare to Fig. 2) for voice/musicseparation.

melody, some keywords, singer identity, etc) is likely to be mucheasier using separate voice and music signals rather than voicemixed with music.

As in (1), it is assumed that each song’s recordingis a mix of two sources now denoted (for

voice) and (for music). The problem is to estimate thecontribution of voice and that of music given themix .

For this particular task, the source separation system isdesigned according to Fig. 2. Model learning and sourceestimation blocks are implemented as they are described inSections II-A2 and II-A3. The remainder of this section isdevoted to the description of the model adaptation block.

A. Overview of the Model Adaptation Block

In popular songs, there are usually some portions of the signalwhen music appears alone (free from voice). We call the corre-sponding temporal segments nonvocal parts in contrast to vocalparts, i.e., parts that include voice. A key idea in our adapta-tion scheme is inspired by the work of Tsai et al. [28] and isto use the nonvocal parts for music model adaptation. Then, theobtained music model and the general voice model are furtheradapted on the totality of the song.

The proposed model adaptation block is represented in Fig. 4and consists of the following three steps.

1) The song is first segmented into vocal partsand nonvocal parts (here

denotes the set of vocal frames indices).2) An acoustically adapted music model is estimated from

the nonvocal parts (see Section IV-C).3) The acoustically adapted music model and the general

voice model are further adapted on the entire songwith respect to adaptation of filters and PSD gains (pre-sented in Section IV-D).

The resulting models are then used to separate the sources ac-cording to Fig. 2.

In summary, the music model is first adapted alone so thatit better reflects the acoustic characteristics of the very type ofmusic in the song and then both music and voice models areadapted in terms of gain level and recording conditions.

The functional blocks of this adaptation scheme (Fig. 4) aredescribed in the following sections.

B. Automatic Vocal/Nonvocal Segmentation

The practical problem of segmenting popular songs into vocaland nonvocal parts was already studied [29]–[32], and some re-ported systems give reasonable segmentation performance.

In the work reported in this paper, a classical solution basedon GMMs [28], [32] is used. The STFT of processed song

, which is a sequence of short-time spectra, is trans-formed into a sequence of acoustic parameters(typically MFCCs [33]). Two GMMs and modeling, re-spectively, vocal and nonvocal frames are used to decide if thevector is a vocal or a nonvocal one.1 The GMMs and

are learned from some training data, i.e., popular songs man-ually segmented into vocal and nonvocal parts. These modelsare used for segmentation without any preliminary adaptationto the characteristics of the processed song. These are indeedgeneral segmentation models.

The vocal/nonvocal decision for the th frame can be obtainedby comparing the log-likelihood ratio with some threshold :

(13)

However, the segmentation performance can be increased sig-nificantly by averaging the frame-based score over a block ofseveral consecutive frames [28], [32]. For this block-based deci-sion, the log-likelihood ratio (13) over each block offrames is computed as

(14)

C. Acoustic Adaptation of the Music Model

The acoustically adapted music model is estimated from thenonvocal parts using a MAP criterion

(15)

where, following [17], the prior pdf is chosen as theproduct of pdfs of conjugate priors for the model parameters(i.e., normal inverse Wishart priors for covariance matrices, andDirichlet priors for Gaussian weights). These priors involve arelevance factor as a parameter representing thedegree of attachment of the adapted model to the generalone . This MAP criterion, with such a choice for the priors,can be optimized using the EM algorithm [15] leading to thereestimation formulas which can be found in [17].

1Note that the structure of these GMMs is slightly different from that of theGMMs � and � used for separation. In particular, the observation vectorsare real (not complex), and the mean vectors are not zero.


This way of estimating an acoustically adapted music modelcalls for both prior knowledge and auxiliary information, thusfitting in the general formalism introduced in the previous sec-tion:

• the pretrained model , which expresses prior statisticalknowledge on the music source and translates into an at-tachment constraint of the observed source parameters tothe general model;

• the segmentation of the mix between vocal and nonvocalparts, where the latter represents auxiliary information in-dicating when the mix can be considered as pertainingto the music source only, which can be resorted to for im-proving the estimation of the music model.

In the general case, these two sources of knowlege and in-formation are combined in the maximization of criterion (15).However, two extreme cases may occur in particular practicalsituations.

• Full-Retrain : The nonvocal parts are in a sig-nificant and sufficient quantity to allow a complete rees-timation of the music model without resorting to the priorknowledge from the general music model: in that case, theMAP approach degenerates into an ML estimation of theadapted music model.

• No-Adapt : Very few or even no nonvocal partsat all are detected in the mix ; no auxiliary informationis thus available, and therefore the general model consti-tutes the only source of knowledge that is exploitable toconstrain the solution of the separation procedure.

D. Adaptation of Filters and PSD Gains

In this section, an adaptation technique called adaptation offilters and PSD gains (Fig. 4) is presented. This technique fallsin the proposed adaptation formalism (Section III).

It can be viewed as a constrained adaptation technique,which is presented below in a general form. Next, it is ex-plained how such a technique fits in our proposed adaptationformalism. Then, the techniques of filter adaptation and PSDgain adaptation are introduced separately. Finally, it is shownhow these techniques can be assembled to form a joint filterand PSD gain adaptation.

1) Constrained Adaptation: Constrained adaptation is basedon the assumption that the parameters of each adapted modelbelong to some subset of admissible parameters whichdepends on the parameters of the corresponding general model

. It is supposed as before that the parameters of the modelpossess some prior density, which is defined on this subset

and depends also on the general model . For example, theparameters of the adapted model can depend on those of thegeneral model via some parametric deformation with freeparameters , i.e., . The goal of constrainedadaptation is to find the free parameters and satisfyingthe following MAP criterion:

subject to and

(16)

where , are the prior pdfs for thefree parameters.2 The adapted models are then obtained as

, .From a strict mathematical point of view, the MAP criterion

(16) is different from criterion (12), but from a practical pointof view they are similar. Indeed, the additional parametric con-straints (i.e., ) play a similar role to that ofthe prior pdfs and the EM algorithm (see the Appendix) is stillapplicable for criterion (16).

2) Filter Adaptation: In our previous work [34], we intro-duced a constrained adaptation technique consisting in the adap-tation of one single filter. With this adaptation, the modelingbecomes invariant to any cross-recording variation that can berepresented by a global linear filter, for example variation ofroom acoustics, of some microphone characteristics, etc. Themismatch between the general model and the adapted one

can be modeled as a linear filter. In other words, each sourcemodeled by is considered as a result of filtering with a filter

of some other source modeled by . The filter is sup-posed to be unknown, and the goal of the filter adaptation tech-nique is to estimate it.

Let be the Fourier transform of the impulseresponse of the filter . We have the following relation betweenthe PSDs of adapted and general models:

(17)

Introducing the diagonal matrix with

(hereafter, this matrix will be calledfilter) expression (17) can be rewritten as follows, linking themodel together with the model :

(18)

In the context of constrained adaptation presented in the pre-vious subsection, the filter plays the role of the free parame-ters , and plays the role of the parametric deformation

. The following criterion, corresponding to criterion(16), is used to estimate the filter :

(19)

Note that, since the adaptation is done in two steps (seeFig. 4), the acoustically adapted music model is used inthis criterion (19) instead of some general music model .Let us also remark that there is no additional constraint onthe filter , i.e., there is a noninformative uniform prior

. However, thanks to constraint (18), theadapted model remains attached to the general one . InAppendix D1, we describe in details how to perform the EMalgorithm to optimize criterion (19).

Note that the filter adaptation can be considered as a sortof constrained MLLR adaptation. Indeed, the MLLR technique[22], [23] consists in adapting an affine transform of the feature

2For the particular constrained adaptation techniques introduced in this paper(cf. Section IV-D2 and IV-D3) noninformative uniform priors are used, i.e.,p(C j� ) / const. In other words, no particular knowledge is assumed onthe values taken by the free parameters C .


space, while for filter adaptation, only dilatations and contrac-tions along the axes of the STFT (feature) space are allowed.

3) PSD Gains Adaptation: Each state of a GMM is describedby some characteristic spectral pattern (or local PSD) corre-sponding to some particular sound event, for example, a mu-sical note or chord. The relative mean energies of these soundsevents vary between recordings. For example, in one recording,the A note can be played on average louder than the D note,while it can be the opposite for another recording. In order totake into account this energy variation, a positive gainis associated to each PSD of the model . This gainis called PSD gain and corresponds to the mean energy of thesound event represented by this PSD. Since each PSD is the di-agonal of the corresponding covariance matrix , this matrixis multiplied by the PSD gain . Thus, the PSD gains adapta-tion technique consists in looking for the adapted model inthe following form:

(20)

where is a vector of PSD gains and the symbol“ ” denotes a nonstandard operation used here to distinguishbetween the application of the PSD gains and that ofthe filter (18) .

Comparing to the filter adaptation technique, where the goalis to adapt the energy for each frequency band , the goal of thePSD gains adaptation is to adapt the energy for each PSD .

The following explication is very similar to the one given forfilter adaptation. The PSD gains play the role of free param-eters and the following criterion is used to estimate them

(21)

Again, the EM algorithm can be used to reestimate the gains, asexplained in Appendix D2.

4) Joint Filters and PSD Gain Adaptation: This section de-tails how to adapt the filters and PSD gains jointly for bothmodels. The adapted voice and music models are representedin the following form: and ,where and denote, respectively, the filter and the PSDgains of the music model . The following criterion is used toestimate all these parameters:

(22)

The direct application of the EM algorithm (29), (30) to opti-mize criterion (22) is not possible, since it is difficult to solve theM step (cf. (30) in the Appendix) jointly on the filtersand the PSD gains (see Appendix D3).

One solution to this problem would be to use the space-alter-nating generalized EM (SAGE) algorithm [35], [36] alternatingthe EM iterations between and . However,in contrast to the EM algorithm, this approach requires two EMiterations instead of one to reestimate once all the parameters

. Thus, the computational complexity dou-bles.

Analyzing separately the computational complexities of theE and M steps (29), (30), we see that the M step computa-tional complexity is negligible in comparison with that of theE step. Indeed, the complexity of the E step (calculation of thenatural statistics expectations) is [see(38) and (39)] and that of the M step (parameters update) is

[see, for example, (42) and (43)]. Thus,in order to avoid doubling the complexity, instead of using theSAGE algorithm, we propose for each iteration to do one E stepfollowed by several M steps alternating between the updates of

and . Algorithm 2 in the Appendix summa-rizes this principle.

V. EXPERIMENTS

Experiments concerning model adaptation in the context ofvoice/music separation are presented in this section. First, themodule for automatic vocal/nonvocal segmentation is evaluatedindependently from the adaptation block (Fig. 4). Then, the ex-periments on model adaptation and separation are developed,using a manual vocal/nonvocal segmentation in the first place,and an automatic segmentation in the second place.

A. Automatic Vocal/Nonvocal Segmentation

1) Data Description: The training set for learning andGMMs, modeling vocal and nonvocal parts, contains 52

popular songs. A set of 22 other songs is used to evaluate thesegmentation performance. All recordings are mono, sampledat 11 025 Hz, and manually segmented into vocal and nonvocalparts.

2) Acoustic Parameters: Classical MFCC-based acoustic pa-rameters are chosen for this segmentation task. In particular,the vector of parameters for each frame consists of the first 12MFCC coefficients [33] and the energy (13 parameters), theirfirst- and second-order derivatives and (which repre-sents 39 parameters in total). The MFCC coefficients are ob-tained from the STFT, which is computed using a half-over-lapped 93-ms length Hamming window. Parameters are normal-ized using cepstral mean subtraction (CMS) and variance nor-malization (VN) [37] in order to reduce the influence of convo-lutive and additive noises.

3) Performance Measure: The performance of vocal/non-vocal segmentation is evaluated using detection error tradeoff(DET) curves [38]. For a given segmentation threshold [see(13), (14)], the segmentation performance can be evaluated interms of two types of errors: the vocal miss error rate (VMER),which is the rate of vocal frames identified as nonvocal and thevocal false alarm rate (VFAR), which is the rate of nonvocalframes identified as vocal.

These error measures are computed comparing the automaticsegmentation with a manual one. Frames localized in a 0.5-s in-terval around a manually marked switch-point are not taken intoaccount for the calculation of VMER and VFAR. This toleranceis justified by the fact that it is difficult to mark accurately theswitch-points between vocal and nonvocal parts by hand. Thecoordinates of each point of a DET curve are the VMER andthe VFAR as the segmentation threshold varies.

4) Simulations: A 32-Gaussian GMM and a 32-GaussianGMM are learned from the training data using 50 iterations


Fig. 5. DET curves for vocal/nonvocal automatic segmentation. Dotted line:random segmentation, EER = 50%. Dashed line: frame-based decision [see(13)], EER = 29%. Solid line: block-based decision [see (14)] with 1-s blocklength, EER = 17%. Square: operating point chosen for model adaptation ( =

0, VMER = 15%, VFAR = 19%).

of the EM algorithm,3 which is initialized by the K-means (orLloyd) algorithm [39]. Segmentation results are represented onFig. 5. With the frame-based decision (13), the equal error rate(EER) (i.e., when VMER VFAR) is 29%. Note that a randomsegmentation gives an EER of 50%. When the block-based de-cision (14) with 1-s block length is used, the EER significantlyfalls down to 17%.

Since one goal of this work is to improve the separation per-formance, the threshold (corresponding to some operatingpoint on the DET curve) for segmentation system integrated inthe model adaptation scheme (Fig. 4) should be chosen on thebasis of the separation performance. This issue is addressed inthe following section.

Note that, in the choice of the segmentation threshold, there isa tradeoff between purity and quantity of data. Indeed, since thenonvocal parts are used for the music model acoustic adaptation(Fig. 4), from one side the nonvocal parts should be quite pure,or not much disturbed by vocal frames detected by mistake, i.e.,the VMER should be low. On the other side, a sufficient quantityof nonvocal frames should be detected correctly in order to haveenough data to adapt the music model, i.e., the VFAR should below.

B. Adaptation and Separation

1) Data Description: The training database for the generalvoice model includes 34 samples of “pure” singing voicefrom popular music. The general music model is trainedon 30 samples of popular music free from voice. Each sampleis approximately one minute long. The test database containssix popular songs, for which voice and music tracks are avail-able separately. It is therefore possible to evaluate the separation

3The numbers of the EM algorithm iterations (here 50) reported hereafterwere found suitable for guaranteeing appropriate convergence of the algorithmin each particular implementation.

performance by comparing the estimated voice with the originalone. The test items are manually segmented into vocal and non-vocal parts (automatic segmentation is also performed in the ex-periment). All recordings are mono and sampled at 11 025 Hz.

2) Parameters: As for segmentation, the STFT is computedusing a half-overlapped 93-ms-length Hamming window.

3) Performance Measure: Separation performance is esti-mated using the Normalised SDR (NSDR) [34], which measuresthe improvement of the source to distortion ratio (SDR) [40] indecibels

SDR (23)

between the nonprocessed mix and the estimated source :NSDR SDR SDR (24)

The aim of this normalization is to combine the absolute mea-sure SDR and the “difficulty” of the separation task forprocessed recording SDR . This difficulty is expressed asthe performance of “inactive separation,” i.e., when the mixitself is taken instead of the estimate . The higher the NSDR,the better the separation performance.

In the context of audio indexing, we are mainly interestedin voice estimation (Section IV). Therefore, the separa-tion performance is evaluated using the voice NSDR (i.e.,NSDR ) and not the music one. Note, at the sametime, that the order of the music NSDR is quite similar to thatof the voice NSDR. The overall performance is estimated byaveraging the voice NSDRs calculated for all songs from thetest database.

4) Simulations: In order to estimate the efficiency of eachstep in the proposed adaptation scheme, as well as the effi-ciency of the adaptation of different parameter combinations(filters, PSD gains), the separation experiments are performedwith 32-state voice GMM and 32-state music GMM in the fol-lowing configurations.

1) General models: and are learned from externaltraining data (50 iterations of the EM algorithm, initializedby the K-means algorithm).

2) Acoustically adapted models:• Voice model: As mentioned in Section IV-A, the mix

is segmented into vocal and nonvocal parts. Thevocal parts correspond to portions of the signal thatinclude voice, but only an unsignificant part of theseportions may contain only voice signals (most of themare composed of voice music). Therefore, these dataare not used in the acoustic adaptation of the voicemodel. The voice GMM is kept constant, i.e., ,which corresponds to the degenerate case “no-adapt” ofSection IV-C.

• Music model: Experiments have been run to determinethe optimal relevance factor (see Section IV-C) foradapting the music GMM in the MAP framework.For our test data set, the optimal value was observedto be zero, i.e., the “full-retrain” degenerate case ofSection IV-C.4 The EM algorithm run in this context

4This situation may arise from the fact that the general music model is not veryelaborate and that the number of music-only frames (about 200–500) segmentedfrom the mix is in sufficient quantity for every song.


TABLE IAVERAGE NSDR ON THE SIX SONGS OF THE TEST DATABASE OBTAINED WITH DIFFERENT MODEL TYPES (Q = Q = 32)

was iterated 40 times after initialization by the K-meansalgorithm.

3) Filter/Gain Adapted models: and are obtainedfrom the models and via an adaptation of the fol-lowing parameter combinations:5

a) filter-adapted for voice model ;b) filter- and gain-adapted for voice model ;c) filter-adapted for voice model and gain-adapted for

both models ;d) filter- and gain-adapted for both models

.(Five iterations of EM algorithm 2 described inAppendix D, initialized as follows: ,

for , where is the identity matrix.)4) Ideal models: and are learned from the separated

sources and , which are available for evaluation pur-poses (40 iterations of the EM algorithm, initialized by theK-means algorithm). The separation performance obtainedwith these “ideal” models (inaccessible in a real applica-tion context) acts as a kind of empirical upper bound forthe separation performance, which can be obtained withadapted models.

Since the estimation of the acoustically adapted music modelis based on some vocal/nonvocal segmentation, the tests in-

volving this model are performed using both manual and auto-matic segmentation.

Automatic segmentation is done by a block-based decisionsystem (14) with 1-s block length and with a segmentationthreshold . These parameters were chosen since they leadto the best separation performance with acoustically adaptedmodels. The segmentation performances of this system areVMER and VFAR (see Fig. 5).

The average results on the six songs of the test database aresummarized in Table I. A main performance improvement isobtained with the acoustic adaptation of the music modelfrom the nonvocal parts. There is a 5.8 and 4.1 dB improvementfor manual and automatic segmentations, respectively.

The voice model filter adaptation increases further theperformance by 1.0 and 0.7 dB for the two types of segmen-tation. An additional adaptation of the PSD gains for thismodel leads also to a slight performance improvement. Adapta-tion of the music model parameters (i.e., and ) does notincrease the performance any further. This can be explained by

5Note that Algorithm 2 is still applicable with slight modifications, whenonly a part of parameters fH ; g ;H ; g g is adapted. For example, to adaptfH ;H ; g g, (45) should be skipped.

Fig. 6. Average NSDR on the six songs of the test database for different num-bers of states Q = Q = Q and for different types of models. Plain: generalmodels. Dotted: adapted models with automatic segmentation. Dashed: adaptedmodels with manual segmentation.

the fact that the music model is already quite well adaptedby the acoustic adaptation step.

Altogether, compared with the general models, adaptationimproves the separation performance by 7.4 dB with a manualsegmentation and still by 5.1 dB when the segmentation is com-pletely automatic. One can note that these results are 3.1 and5.4 dB below the empirical upper bound obtained using idealmodels. It remains a challenge to reduce this gap with improvedmodel adaptation schemes.

The effect of model dimensionality (i.e., number of states) on the separation performance is evaluated

in the following configurations:1) general models and ;2) adapted models and (giving the

best separation results according to Table I) with manualsegmentation;

3) adapted models and withautomatic segmentation.

The results are represented in Fig. 6. Note that increasing thenumber of states in the case of general models does not leadto performance improvement, compared with one-state GMMs

. A one-state GMM consists of only one PSD,thus for the Wiener filter defined by (8) is alinear filter, which does not vary in time. It was noticed that


Fig. 7. Detailed NSDR on the six songs of the test database for different numbers of states Q = Q = Q and for different types of models. Plain: generalmodels. Dotted: adapted models with automatic segmentation. Dashed: adapted models with manual segmentation.

for voice estimation, this is merely a high-pass filter with itscutoff frequency around 300 Hz. Thus, for the voice/music sep-aration task, the general models cannot give better performancethan 5-dB NSDR obtained with a simple high-pass filtering, andtherefore there is no interest to use general models with severalstates. Probably, this is because of the problem of weak repre-sentativeness of training data for wide sound classes mentionedin Section II-B.

As illustrated by our experiments, the model adaptation al-lows to overcome these limits. Indeed, with adapted models,the separation performance can be significantly improved byincreasing the number of model states, and the added compu-tational complexity pays off. This experiment indicates that forsource separation tasks with wide sound classes (such as music),model adaptation is essential.

As can be seen in Fig. 7, a deeper investigation of the behaviorof the NSDR for each of the six test songs separately shows aconsistent behavior of the proposed adaptation scheme.

Concerning computational complexity, it is worth mentioningthat the proposed system needs about 4 h to separate 23 min(total duration of six test songs) using a laptop equipped withPentium M processor 1.7 GHz, which is quite reasonable.

Note that the problem of voice/music separation in mono-phonic recordings is a very difficult task which was not muchstudied ([34], [41], [42]). For this task, we have developed aseparation system, which, thanks to model adaptation, has thefollowing advantages.

• Compared with general models, the separation perfor-mance is improved by 5 dB.

• The system is completely automatic.

• The computational complexity is quite reasonable (lessthan ten times RT).

• Experiments were carried out with no special restrictionsabout music style (while staying in pop/rock songs) norabout the language of the songs.

However, there are also some limitations, which should bementioned. First, the processed song must contain nonvocalparts of reasonable length in order to have enough data forthe acoustic adaptation of music model. Second, the musicfrom nonvocal parts should be quite similar to that from vocalparts. Finally, it is preferable that there is only one singer at atime, i.e., no chorus or back vocals. At first sight, a majority ofpopular songs verify these assumptions.

VI. CONCLUSION AND FURTHER WORK

In the context of probabilistic methods for source separationwith a single channel, we have presented a general formalismconsisting in a posteriori model adaptation. This formalism isintroduced in the general form of Bayesian models, and furtherclarified in terms of a MAP adaptation criterion which can beoptimized using the EM algorithm.

To show the relevance of model adaptation in practice, amodel adaptation system derived from this formalism has beendesigned for the difficult task of separating voice from musicin popular songs. This system is based on vocal/nonvocalsegmentation, on adapting acoustically a music model from thenonvocal parts, and on a final adaptation of voice and musicmodels from the mix using filters and PSD gains adaptationtechnique.


Compared to general (nonadapted) models, the adaptation al-lows, in our experiments, to consistently improve the separationperformance. It yields on average a 5-dB improvement whichbridges half of the gap between the use of general models onthe one hand and ideal models on the other hand.

More generally, by formulating the adaptation process in arather general way, which integrates prior knowledge, struc-tural constraints, and a posteriori observations, the work re-ported in this paper may contribute to the solution of a numberof problems, whether they resort to blind, knowledge-based ordata-driven source separation.APPENDIX

In this Appendix, the EM algorithm [15] is applied to opti-mize the MAP criterion (12). This algorithm is first presented inits general form, then precisions are given in the case of expo-nential families. Finally, some additional calculations are donefor the GMMs studied in this article.

A. EM Algorithm in its General Form

The following notations are introduced with their namesgiven according to the terminology of the EM algorithm [15],[36]:

• observed data;• complete data (recall that

, denote GMM state sequences);

• estimated parameters;• prior pdf

Note that the observed and complete data are chosen in an ap-propriate way to use EM. Indeed, according to (3), the observeddata are expressed in a unique manner from the complete data

.With these new notations, the MAP criterion (12) can be

rewritten in a more compact form

(25)

In order to optimize this MAP criterion, the EM algorithm isused. This algorithm is an iterative procedure, and in its generalform can be expressed as follows [15], [36]:

(26)

(27)

where denotes the parameters estimated at the th iteration.The E step (expectation) (26) consists in computing an auxiliaryfunction , and the M step (maximization) (27) consistsin estimating the new parameters maximizing this function.

B. EM Algorithm for Exponential Families

The EM algorithm takes a particular form if the families ofcomplete data pdfs , are exponen-tial families, as recalled in Definition 1 below. This is the casefor the GMMs (as shown in Appendix C1), as well as for theHMMs. In this paper, we present the EM algorithm for this par-ticular case of exponential families, since we believe that in thisform, the algorithm is easier to understand, and its derivationfor the GMMs, as well as for the HMMs, becomes very com-pact and straightforward.

Definition 1: References [15], [36]. The family of pdfsparameterized by is called an exponential family

if can be expressed in the following form:

(28)

where are scalar functions,vector functions, and denotes scalar product. The function

is called natural statistics for this exponential family.The natural statistics are also sufficient [14] for the pa-

rameter . For any sufficient statistics, the following property isfulfilled.

Property 1: If is a sufficient statistics for , then theMAP parameter estimatormust be a function of .

Here, denoting the respective natural statistics of, it can be shown [15], [36] that the EM algorithm

(26), (27) can be represented in a form which is easier to under-stand, to interpret and to use, specifically

(29)

(30)

with the functions , defined as solutionsof the following complete data MAP criteria

(31)

The existence of such functions depending only on the natural(sufficient) statistics is guaranteed by Property 1.Note that the MAP criteria (31) correspond to the MAP crite-rion (12) assuming the complete data areobserved.

The following simple interpretation can be given to this EMalgorithm. If the complete data were observed, we would usethe complete data MAP criteria (31), and their solutions are

. However, since the complete data arenot observed, the values of natural statistics are re-placed by their expectations (29) conditionally on the observeddata and the models estimated at the previous iteration. Thus,the E step (29) consists of computing the conditional expecta-tions of the sufficient statistics, and the M step (30) consists ofestimating the new model parameters using these expectations.

C. EM Algorithm for GMMs

1) Natural Statistics for GMMs: For the GMMs usedthroughout this paper, the families of pdfsare exponential families and their natural statistics are

(32)

with(33)

and

(34)


where is the Kronecker delta function, which equals to 1if , and equals to 0 otherwise.

Indeed, using the GMM definition (5), the log-likelihood ofthe complete data can be expressed as follows:

(35)

where the statistics are are defined according to(33) and (34). Equation (35) can be rewritten as

, where is some vector func-tion, and the statistics are defined as in (32).

The statistics count the number of times that state hasbeen observed, and the statistics represent the energy ofthe STFT associated to state and calculated in the frequencyband .

2) Conditional Expectations of Natural Statistics for GMMs:The conditional expectations (29) of the natural statistics(32)–(34) are calculated using Algorithm 1. Indeed, (36) isanalogous to (9), and (37) can be found in the article of Rose etal. [27]. The proof for (39) is given in (40) using a shorthand

, and (38) can be proven in a similar way.

Algorithm 1 Calculation of the conditional expectations ofnatural statistics for (and similarly for )

1) Compute the weights satisfyingand

(36)

2) Compute the expected PSD for state ,

(37)

3) Compute the conditional expectation of

(38)

4) Compute the conditional expectation of

(39)

(40)

D. EM for Filters and/or PSD Gains Adaptation

We now have tools to express the EM algorithm in the pro-posed framework for filters and/or PSD gains adaptation. Inorder to obtain the reestimation formulas using the EM algo-rithm (29), (30), one should solve the complete data MAP cri-teria (31) and express their solutions as functions of natural sta-tistics .

1) Filter Adaptation: In the case of filter adaptation (19),these MAP criteria become (finally, there is only one criterion,since only one model is adapted)

(41)

Injecting into expression (35) and zeroing the derivativeaccording to , one can show that the solution of (41) is givenby . Then, replacing thestatistics by their conditional expectations (39), we havethe reestimation formula

(42)

where the expectations are calculated using Algorithm

1, conditionally on the models and .2) PSD Gains Adaptation: The very same reasoning brings

the reestimation formula for PSD gains adaptation

(43)

with the expectations and calculated using Al-

gorithm 1, conditionally on the models and.

3) Joint Filters and PSD Gains Adaptation: Doing the samedevelopments for the criterion (22), i.e., putting intoexpression (35) and zeroing the derivatives according to and


, one can show that the filter is expressed via the PSDgains and vice versa, i.e.,

and

Thus, we decide to look for the solution alternating betweenthese two expressions, which leads to the reestimation formulas(44)–(47) of Algorithm 2.

Algorithm 2 Joint filters and PSD gain adaptation for modelsand

1) E step: Compute the expectations

and of the natural statistics

conditionally on the models andusing Algorithm 1.

2) M step: Update the parameters.a) Initialize , ;b) Perform maximization steps: for

(44)

(45)

(46)

(47)

c) Set , ,,

REFERENCES

[1] D. L. Wang and G. J. Brown, “Separation of speech from interferingsounds based on oscillatory correlation,” IEEE Trans. Neural Netw.,vol. 10, no. 3, pp. 684–697, May 1999.

[2] G.-J. Jang and T.-W. Lee, “A maximum likelihood approach tosingle-channel source separation,” J. Mach. Learning Res., no. 4, pp.1365–1392, 2003.

[3] M. Reyes-Gomez, D. Ellis, and N. Jojic, “Multiband audio modelingfor single-channel acoustic source separation,” in Proc. IEEE Int.Conf. Acoust., Speech, Signal Proc. (ICASSP’04), May 2004, vol. 5,pp. 641–644.

[4] B. Pearlmutter and A. Zador, “Monaural source separation using spec-tral cues,” in Proc. 5th Int. Conf. Ind. Compon. Anal. (ICA’04), 2004,pp. 478–485.

[5] S. T. Roweis, “One microphone source separation,” in Advances inNeural Information Processing Systems. Cambridge, MA: MITPress, 2001, vol. 13, pp. 793–799.

[6] L. Benaroya and F. Bimbot, “Wiener based source separation withHMM/GMM using a single sensor,” in Proc. Int. Conf. Ind. Compon.Anal. Blind Source Separation (ICA’03), Nara, Japan, Apr. 2003, pp.957–961.

[7] T. Kristjansson, H. Attias, and J. Hershey, “Single microphone sourceseparation using high resolution signal reconstruction,” in Proc. IEEEInt. Conf. Acoust., Speech, Signal Process. (ICASSP’04), 2004, vol. 2,pp. 817–820.

[8] G. Peeters and X. Rodet, “SINOLA: A new analysis/synthesis methodusing spectrum peak shape distortion, phase and reassigned spectrum,”in Proc. Int. Comput. Music Conf. (ICMC’99), Oct. 1999, pp. 153–156.

[9] S. Rickard and O. Yilmaz, “On the approximate W-disjoint orthogo-nality of speech,” in IEEE Int. Conf. Acoust., Speech, Signal Process.(ICASSP’02), Orlando, FL, May 2002, vol. 3, pp. 3049–3052.

[10] N. H. Pontoppidan and M. Dyrholm, “Fast monaural separation ofspeech,” in Proc. 23rd Conf. Signal Process. Audio Recording Repro-duction Audio Eng. Soc. (AES) , 2003.

[11] E. Vincent and X. Rodet, “Underdetermined source separation withstructured source priors,” in Int. Conf. Ind. Compon. Anal. Blind SourceSeparation (ICA’04), Granada, Spain, Sep. 2004, pp. 327–334.

[12] T. Beierholm, B. D. Pedersen, and O. Winther, “Low complexityBayesian single channel source separation,” in IEEE Int. Conf.Acoust., Speech, Signal Process. (ICASSP’04), 2004, vol. 5, pp.529–532.

[13] D. Ellis and R. Weiss, “Model-based monaural source separation usinga vector-quantized phase-vocoder representation,” in IEEE Int. Conf.Acoust., Speech, Signal Process. (ICASSP’06), Toulouse, France, May2006, vol. 5, pp. 957–960.

[14] S. M. Kay, Fundamentals of Statistical Signal Processing, EstimationTheory. Englewood Cliffs, NJ: Prentice-Hall, 1993.

[15] A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maximum likelihoodfrom incomplete data via the EM algorithm,” J. R. Statist. Soc., vol. 39,pp. 1–38, 1977.

[16] H. Attias, “New EM algorithms for source separation and deconvo-lution,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process.(ICASSP’03), 2003, vol. 5, pp. 297–300.

[17] J. Gauvain and C. Lee, “Maximum a posteriori estimation for multi-variate Gaussian mixture observations of markov chains,” IEEE Trans.Speech Audio Process., vol. 2, no. 2, pp. 291–298, Apr. 1994.

[18] C.-H. Lee and Q. Huo, “On adaptive decision rules and decision pa-rameter adaptation for automatic speech recognition,” Proc. IEEE, vol.88, no. 8, pp. 1241–1269, Aug. 2000.

[19] A. Reynolds, T. Quatieri, and R. Dunn, “Speaker verification usingadapted Gaussian mixture models,” Digital Signal Process., no. 10, pp.19–41, 2000.

[20] K. P. Murphy, “Dynamic Bayesian networks: Representation, infer-ence and learning,” Ph.D. dissertation, Univ. California Berkeley,Berkeley, CA, Jul. 2002.

[21] M. I. Jordan, Z. Ghahramani, T. S. Jaakkola, and L. K. Saul, “An in-troduction to variational methods for graphical models,” Learning inGraphical Models, vol. 37, no. 2, pp. 183–233, 1999.

[22] C. Leggetter and P. Woodland, “Flexible speaker adaptation using max-imum likelihood linear regression,” in ARPA Spoken Lang. Technol.Workshop, 1995, pp. 104–109.

[23] M. Gales, D. Pye, and P. Woodland, “Variance compensation withinthe MLLR framework for robust speech recognition and speaker adap-tation,” in Proc. Int. Conf. Spoken Lang. Process. (ICSLP’96), Philadel-phia, PA, 1996, vol. 3, pp. 1832–1835.

[24] K. Shinoda and C.-H. Lee, “Structural MAP speaker adaptation usinghierarchical priors,” in Proc. IEEE Workshop Speech Recognition Un-derstanding, Santa Barbara, CA, Dec. 1997, pp. 381–388.

[25] K.-T. Chen, W.-W. Liau, H.-M. Wang, and L.-S. Lee, “Fast speakeradaptation using eigenspace-based maximum likelihood linear regres-sion,” in Proc. Int. Conf. Spoken Lang. Process. (ICSLP’00), Beijing,China, Oct. 2000, pp. 742–745.

[26] L. Benaroya, F. Bimbot, and R. Gribonval, “Audio source separationwith a single sensor,” IEEE Trans. Audio, Speech, Lang. Process., vol.14, no. 1, pp. 191–199, Jan. 2006.

[27] R. C. Rose, E. M. Hofstetter, and D. A. Reynolds, “Integrated modelsof signal and background with application to speaker identification innoise,” IEEE Trans. Speech Audio Process., vol. 2, no. 2, pp. 245–257,Apr. 1994.

[28] W.-H. Tsai, D. Rogers, and H.-M. Wang, “Blind clustering of pop-ular music recordings based on singer voice characteristics,” Comput.Music J., vol. 28, no. 3, pp. 68–78, 2004.


[29] A. Berenzweig and D. P. W. Ellis, “Locating singing voice segmentswithin music signals,” in Proc. IEEE Workshop Applicat. SignalProcess. Audio Acoust. (WASPAA’01), 2001, pp. 119–122.

[30] Y. E. Kim and B. Whitman, “Singer identification in popular musicrecordings using voice coding features,” in Proc. Int. Symp. Music Inf.Retrieval (ISMIR’02), Oct. 2002, pp. 164–169.

[31] T. L. Nwe, A. Shenoy, and Y. Wang, “Singing voice detection in pop-ular music,” in Proc. ACM Multimedia Conf., New York, Oct. 2004,pp. 324–327.

[32] W. H. Tsai and H. M. Wang, “Automatic detection and tracking oftarget singer in multi-singer music recordings,” in Proc. IEEE Int. Conf.Acoust., Speech, Signal Process. (ICASSP’04), Montreal, QC, Canada,2004, vol. 4, pp. 221–224.

[33] R. Vergin, D. O’Shaughnessy, and A. Farhat, “Generalized mel fre-quency cepstral coefficients for large-vocabulary speaker-independentcontinuous-speech recognition,” IEEE Trans. Speech Audio Process.,vol. 7, no. 5, pp. 525–532, Sep. 1999.

[34] A. Ozerov, P. Philippe, R. Gribonval, and F. Bimbot, “One microphonesinging voice separation using source-adapted models,” in IEEE Work-shop Applicat. Signal Process. Audio Acoust. (WASPAA’05), Mohonk,NY, Oct. 2005, pp. 90–93.

[35] J. A. Fessler and A. O. Hero, “Space-alternating generalized expecta-tion-maximization algorithm,” IEEE Trans. Signal Process., vol. 42,no. 10, pp. 2664–2677, Oct. 1994.

[36] G. McLachlan and T. Krishnan, The EM Algorithm and Extensions.New York: Wiley, 1997.

[37] C.-P. Chen, J. Bilmes, and K. Kirchhoff, “Low-resource noise-robustfeature post-processing on aurora 2.0,” in Proc. Int. Conf. Spoken Lang.Process. (ICSLP’02), 2002, pp. 2445–2448.

[38] A. Martin, G. Doddington, T. Kamm, M. Ordowski, and M. Przy-bocki, “The DET curve in assessment of detection task performance,”in Proc. Eur. Conf. Speech Commun. Technol. (EuroSpeech’97), 1997,pp. 1895–1898.

[39] S. P. Lloyd, “Least squares quantization in PCM,” IEEE Trans. Inf.Theory, vol. IT-28, no. 2, pp. 129–137, Mar. 1982.

[40] R. Gribonval, L. Benaroya, E. Vincent, and C. Févotte, “Proposals forperformance measurement in source separation,” in Proc. Int. Conf.Ind. Compon. Anal. Blind Source Separation (ICA’03), Apr. 2003, pp.763–768.

[41] S. Vembu and S. Baumann, “Separation of vocals from polyphonicaudio recordings,” in Proc. Int. Symp. Music Inf. Retrieval (ISMIR’05),2005, pp. 337–344.

[42] Y. Li and D. L. Wang, “Singing voice separation from monaural record-ings,” in Proc. Int. Symp. Music Inf. Retrieval (ISMIR’06), 2006, pp.176–179.

Alexey Ozerov received the M.Sc. degree in math-ematics from the Saint Petersburg State University,Saint Petersburg, Russia, in 1999, the M.Sc. degreein applied mathematics from the University of Bor-deaux 1, Bordeaux, France, in 2003, and the Ph.D.degree in signal processing from the University ofRennes 1, Rennes, France, in 2006.

He was with Orange Labs, Cesson Sévigné,France, and the IRISA, Rennes, from 2003 to 2006while working towards the Ph.D. degree. From 1999to 2002, he was an R&D software engineer with

Terayon Communication Systems, first in Saint Petersburg and then in Prague,Czech Republic. He is currently a Postdoctoral Researcher in the Sound andImage Processing (SIP) Laboratory, KTH (Royal Institute of Technology),Stockholm, Sweden. His research interests include speech recognition, audiosource separation, and source coding.

Pierrick Philippe received the Ph.D. in signal pro-cessing from the University of Paris, Orsay, France,in 1995.

Before joining Orange Labs, Cesson Sévigné,France, he was with Envivio (2001–2002) and TDF(1997–2001) where his activities were focused onaudio signal processing, especially low bit-ratecoding. Before that, he spent two years at Innovason,where he developed DSP effects and algorithms fordigital mixing desks. He is now a Senior Audio Spe-cialist at Orange Labs, developing audio algorithms

especially related to standards. He is an active member of the MPEG audiosubgroup. His main interests are signal processing, including low bit-rate audiocoding, and sound analysis and processing.

Frédéric Bimbot received the B.A. degree in lin-guistics from Sorbonne Nouvelle University, Paris,France, in 1987, the telecommunication engineerdegree from ENST, Paris, in 1985, and the Ph.D.degree in signal processing in 1988.

In 1990, he joined the French National Centerfor Scientific Research (CNRS) as a PermanentResearcher. He was with ENST for seven yearsand then moved to IRISA (CNRS and INRIA),Rennes, France. He also repeatedly visited AT&TBell Laboratories between 1990 and 1999. He has

been involved in several European projects: SPRINT (speech recognition usingneural networks), SAM-A (assessment methodology), and DiVAN (audioindexing). He has also been the work-package Manager of research activitieson speaker verification, in the projects CAVE, PICASSO, and BANCA.His research is focused on audio signal analysis, speech modeling, speakercharacterization and verification, speech system assessment methodology, andaudio source separation. He is heading the METISS Research Group at IRISA,dedicated to selected topics in speech and audio processing.

Dr. Bimbot was Chairman of the Groupe Francophone de la CommunicationParlée (now AFCP) from 1996 to 2000 and from 1998 to 2003, a member ofthe International Speech Communication Association Board (ISCA), formerlyknown as ESCA.

Rémi Gribonval graduated from École NormaleSupérieure, Paris, France in 1997. He receivedthe Ph.D. degree in applied mathematics from theUniversity of Paris-IX Dauphine, Paris, France, in1999.

From 1999 to 2001, he was a Visiting Scholar atthe Industrial Mathematics Institute (IMI), Depart-ment of Mathematics, University of South Carolina,Columbia. He is currently a Research Associate withthe French National Center for Computer Science andControl (INRIA) at IRISA, Rennes, France. His re-

search interests are in adaptive techniques for the representation and classifica-tion of audio signals with redundant systems, with a particular emphasis in blindaudio source separation.

adaptation of bayesian models for single channel source

Documents