arxiv:1711.06805v1 [cs.sd] 18 nov 2017]coordinated science lab, ece, university of illinois at...

SEPARAKE: SOURCE SEPARATION WITH A LITTLE HELP FROM ECHOES

Robin Scheibler†, Diego Di Carlo‡, Antoine Deleforge‡, and Ivan Dokmanic]

†Tokyo Metropolitan University, Tokyo, Japan‡Inria Rennes - Bretagne Atlantique, France

]Coordinated Science Lab, ECE, University of Illinois at Urbana-Champaign

ABSTRACT

It is commonly believed that multipath hurts various audio process-ing algorithms. At odds with this belief, we show that multipath infact helps sound source separation, even with very simple propaga-tion models. Unlike most existing methods, we neither ignore theroom impulse responses, nor we attempt to estimate them fully. Werather assume that we know the positions of a few virtual micro-phones generated by echoes and we show how this gives us enoughspatial diversity to get a performance boost over the anechoic case.We show improvements for two standard algorithms—one that usesonly magnitudes of the transfer functions, and one that also uses thephases. Concretely, we show that multichannel non-negative matrixfactorization aided with a small number of echoes beats the vanillavariant of the same algorithm, and that with magnitude informationonly, echoes enable separation where it was previously impossible.

Index Terms— Source separation, echoes, room geometry,NMF, multi-channel.

1. INTRODUCTION

Source separation algorithms can be grouped according to how theydeal with sound propagation: those that ignore it [1], those that as-sume a single anechoic path [2], those that model the room transferfunctions (TFs) entirely [3, 4], and those that attempt to separatelyestimate the contribution of the early echoes and the contributionof the late tail [5]. In this paper we propose yet another route: weassume knowing the locations of a few walls relative to the micro-phone array, which enables us to exploit the associated virtual mi-crophones. This assumption is easy to satisfy in living rooms andconference rooms, but the corresponding model incurs a significantmismatch with respect to the complete reverberation. We show thatit nonetheless gives sizable performance boosts while being simpleto estimate.

A typical setup is illustrated in Figure 1. We consider J sourcesemitting from J distinct directions of arrival (DOAs) {θj}Jj=1, andan array of M microphones. The array is placed close to a wall ora corner. There are two reasons why this is useful: first, it rendersechoes from the nearby walls significantly stronger than all otherechoes; second, it keeps the resulting virtual array (real and virtualmicrophones) compact. The latter justifies the far field assumptionwhich in turn simplifies exposition.

Real and virtual microphones form dipoles with diverse frequency-dependent directivity patterns. Our goal is to design algorithmswhich benefit from this known spatial diversity.

The research presented in this paper is reproducible. Code and data areavailable at dokmanic.ece.illinois.edu/separake.zip. IvanDokmanic was supported by a Google Faculty Research Award.

s1

s2

r1

r2

r11r12

r21

r22

Virtual microphones

Fig. 1: Typical setup with two speakers recorded by two microphones. Theillustration shows the virtual microphone model (grey microphones) with di-rect sound path (dotted lines) and resulting first-order echoes (colored lines)

Echoes have been used previously to enhance various audioprocessing tasks. It was shown that they improve indoor beam-forming [6, 7, 8], aid in sound source localization [9], and enablelow-resource microphone array self-localization [10]. They, how-ever, seems to be rarely analyzed in the context of source separationwith non-negative source models. Since most speech communica-tion takes place indoors, we believe that our findings are relevant formodern applications and voice-based assistants like Amazon Echoor Google Home.

1.1. Our Goal and Main Findings

Our emphasis here is different than that in [5]. Rather than fit theecho model, we aim to show that separation in the presence of echoesis in fact better than separation without echoes. We ask the followingquestions:

1. Is speech separation with echoes fundamentally easier thanspeech separation without echoes? Are there specific settingswhere this is true or false?

2. Is it necessary to fully model the reverberation or can we getaway with a geometric perspective where we know the loca-tions of a few virtual microphones?

To answer these questions we set up several simple experiments. Wetake two standard, well-understood multi-channel source separationalgorithms which estimate the channel (the TFs), and instead of up-dating the channel estimate we simply furnish the TFs of real anda few virtual microphones. The first algorithm—non-negative ma-trix factorization (NMF) via multiplicative updates (MU)—only usesthe magnitudes of the transfer functions, while the second one—expectation maximization (EM)—also uses the phases. In this initial

arX

iv:1

711.

0680

5v1

[cs

.SD

] 1

8 N

ov 2

017

dokmanic.ece.illinois.edu/separake.zip

investigation we look at the (over)determined case (J ≤ M ); theanalysis of the underdetermined case is postponed to a forthcomingjournal version of this paper. Our findings can be summarized asfollows:

• (MU) With magnitudes only, multi-channel anechoic separa-tion is hardly any better than single-channel separation: as themagnitude of the transfer functions is the same at all micro-phones, channel modeling offers no diversity. The situationis different in rooms where the direction-dependent magni-tude of TFs varies significantly from microphone to micro-phone. We show that replacing the transfer functions witha few echoes (even just one) gives significant performancegains compared to not modeling the TFs at all, but also thatit does better than learning the TFs through multiplicativeupdates.

• (EM) With both phases and magnitudes, anechoic separationwill be near-perfect since it corresponds to a determined lin-ear system. Therefore, any uncertainty from imperfectionsin channel modeling will make things worse. Surprisingly,approximating the TFs with one echo matches learning themthrough EM updates and using more outperforms it.

For a sneak peak at the gains, fast forward to Figure 3.

2. MODELING

Suppose J sources emit inside the room and we have M micro-phones. Each microphone receives

ym(t) =

J∑j=1

cjm(t),

with cjm being the spatial image of the jth source at the mth micro-phone. Spatial images are given as

cmj(t) = (xj ∗ hjm)(t),

where hjm is the room impulse response between the source j andmicrophonem. The room impulse response is a central object in thispaper. We model it as

hjm(t) =

K∑k=0

αkjmδ(t− tkjm) + ejm(t),

where the sum comprises the line-of-sight propagation and the earli-est K echoes we want to account for (at most 6 in this paper), whilethe error term ejm(t) collects later echoes and the tail of the rever-beration. We do not assume ejm(t) to be known. We assume thatthe sources are in the far field of real and virtual microphones so thetimes tkjm depend only on the source DOAs which we assume areknown. Assuming K echoes per source are known, we can form anapproximate TF from source j to microphone m,

Hj,m(ejω) =

K∑k=0

αkjme

−iωtkjm . (1)

The far field assumption implies that only the relative arrival timesare known so we can arbitrarily fix the delay of the direct path tozero. In addition, we assume all walls to be spectrally flat in thefrequency range of interest and set all αk

jm to a same constant.

As usual, we process by frames. In the short-time Fourier trans-form (STFT) domain the mth microphone signal reads

Ym[f, n] =

J∑j=1

Hjm[f ]Xj [f, n] +Bm[f, n] (2)

with f and n being the frequency and frame index, Xj [f, n] theSTFT of the jth source signal, and Bm[f, n] a term including noiseand model mismatch. It is convenient to group the microphone ob-servations in vector form,

Y[f, n] = H[f ]X[f, n] + B[f, n]. (3)

where Y[f, n] =[Ym[f, n]

]m

, H[f ] =[Hjm[f, n]

]m,j

, X[f, n] =[Xj [f, n]

]j, and B[f, n] =

[Bm[f, n]

]m

. Let the squared magni-

tude of the spectrogram of the jth source be Pj =[|Xj [f, n]|2

]fn

.We postulate a non-negative factor model for Pj :

Pj = DjZj , (4)

where Dj is the non-negative dictionary, and the latent variables Zj

are called activations. Source separation can then be cast as an infer-ence problem in which we maximize the likelihood of the observedY over all possible non-negative factorizations (4). This normallyinvolves learning the channel (frequency-domain mixing matrices).Instead of learning, we fix the channel to the earliest few echoes.

3. SOURCE SEPARATION BY NMF

To evaluate the usefulness of echoes in source separation, we mod-ify the multi-channel NMF framework of Ozerov and Fevotte [3]as follows. First, we introduce a dictionary learned from availabletraining data. We explore both speaker-specific and universal dic-tionaries [11]. Speaker-specific dictionaries can be beneficial whenspeakers are known in advance. Universal dictionary is more ver-satile but gives a weaker regularization prior. Second, rather thanlearning the TF from the data, we use the approximate model of (1).In the following we briefly describe the two used algorithms.

3.1. NMF using Multiplicative Updates (MU-NMF)

Multiplicative updates for NMF only involve the magnitudes andare simpler than the EM updates. They have been originally pro-posed by Lee and Seung [12]. We use the Itakura-Saito divergence[13] between the observed multi-channel squared magnitude spectraVm = [|Ym[n, f ]|2]fn and their non-negative factorizations,

Vm =J∑

j=1

diag(Qjm)DjZj , m = 1, . . . ,M (5)

where Qjm =[|Hjm[f ]|2

]f

is the vector of squared magnitudeof the approximate TF between microphone m and source j. Weadd an `1-penalty term to promote sparsity in the activations due tothe potentially large size of the universal dictionary [11]. The costfunction is thus

CMU(Zj) =∑mfn

dIS(Vm[f, n]|Vm[f, n]),+γ∑j

‖Zj‖1, (6)

where dIS(v|v) = vv− log v

v− 1. By adapting the original MU

rule derivations from Ozerov and Fevotte, we obtain the following

regularized MU update rule:

Zj ← Zj �

∑m(diag(Qij)Dj)

>(Vj � V−2

j

)∑

m(diag(Qij)Dj)>V−1j + γ

, (7)

where multiplication �, power, and division are element-wise.Importantly, neglecting the reverberation (or working in the ane-

choic regime) leads to a constant Qjm for all j and m. A conse-quence is that the MU-NMF framework breaks down with a uni-versal dictionary. Indeed, (5) becomes the same for all m, Vm =∑

j DZj = D∑

j Zj , so even with correct atoms chosen, we canassign them to any source without changing Ym, and hence the cost.Therefore, anechoic multi-channel separation with a universal dic-tionary cannot work well. This intuitive reasoning is corroboratedby numerical experiments in Section 4.3. The problem is overcomeby the EM-NMF algorithm which keeps the channel phase and isthus able to exploit the phase diversity across the array. Of course,in line with the message of this paper, it is also overcome by usingechoes.

3.2. NMF using Expectation Maximization (EM-NMF)

Unlike the MU algorithm that independently maximizes the log-likelihood of TF magnitudes, EM-NMF maximizes the joint log-likelihood over all complex-valued channels [3]. Hence, it takes intoaccount observed phases. Each source j is modeled as the sum ofcomponents with complex Gaussian priors of the form ck[f, n] ∼CN

(0, dfkzkn

)such that

Xj [f, n] ∼ CN (0, (DjZj)fn) , (8)

and the magnitude spectrum Pj of (4) can be understood as thevariance of source j. Under this model, and assuming uncorrelatednoise, the microphone signals also follow a complex Gaussian dis-tribution with covariance matrix

ΣY[f, n] = H[f ]ΣX[f, n] HH [f ] + ΣB[f, n], (9)

and the negative log-likelihood of the observed signal is

CEM(Zj) =∑fn

trace(Y[f, n]Y[f, n]HΣ−1

Y [f, n])+ log detΣY[f, n].

This quantity can be efficiently minimized using the EM algorithmproposed in [3]. We modify the original algorithm by fixing thesource dictionaries Dj and the early-echo channel model H[f ]throughout the iterations.

4. NUMERICAL EXPERIMENTS

We test our hypotheses through computer simulations. In the follow-ing, we describe the simulation setup, dictionary learning protocols,and we discuss the results.

4.1. Setup

An array of three microphones arranged on the corners of an equi-lateral triangle with edge length 0.3 m is placed in the corner of a3D room with 7 walls. We select 40 sources at random locations ata distance ranging from 2.5 m to 4 m from the microphone array.Pairs of sources are chosen so that they are at least 1 m apart. Thefloor plan and the locations of microphones are depicted in Figure 2.

0.00 0.05 0.10 0.15Time [s]

6 m

3 m

4 m

5 m

2.5 m

4 m

Fig. 2: On the left, a typical simulated RIR. On the right, the simulated sce-nario.

learn anechoic 0 1 2 3 4 5 6

γ = 10−1 10 10 10−3 0 0 0 0 0

Table 1: Value of the regularization parameter γ used with the universaldictionary.

The scenario is repeated for every two active sources out of the 780possible pairs.

The sound propagation between sources and microphones issimulated using the image source model implemented in pyrooma-coustics Python package [14]. The wall absorption factor is set to0.4, leading to a T60 of approximately 100 ms. An example RIR isshown in Figure 2. The sampling frequency is set to 16 kHz, STFTframe size to 2048 samples with 50% overlap between frames, andwe use a cosine window for analysis and synthesis. Partial TFs arethen built from the K nearest image microphones. The global delayis discarded.

With this setup, we perform three different experiments. In thefirst one, we evaluate MU-NMF with a universal dictionary. In theother two, we evaluate the performance of MU-NMF and EM-NMFwith Speaker-specific dictionaries. We vary K from 1 to 6 and usethree baseline scenarios:

1. learn: The TFs are learned from the data along the activationsas originally proposed [3].

2. anechoic: Anechoic conditions, without echoes nor modelmismatch.

3. no echoes: Reverberation is present but ignored (i.e. K = 0).

With the universal dictionary, the large number of latent variableswarrants the introduction of sparsity-inducing regularization. Thevalue of the regularization parameter γ was chosen by a grid searchon a holdout set with the signal-to-distortion ratio (SDR) as the figureof merit [15] (Table 1).

4.2. Dictionary Training, Test Set, and Implementation

Universal Dictionary: Following the methodology of [11] we select25 male and 25 female speakers and use all available training sen-tences to form the universal dictionary D = [DM

1 · · ·DM25 DF

1 · · ·DF25].

The test signals were selected from speakers and utterances outsidethe training set. The number of latent variables per speaker is 10 sothat with STFT frame size of 2048 we have D ∈ R1025×500.

Speaker-Specific Dictionary: Two dictionaries were trained onone male and one female speaker. One utterance per speaker wasexcluded to be used for testing. The number of latent variables perspeaker was set to 20.

All dictionaries were trained on samples from the TIMIT corpus[16] using the NMF solver in scikit-learn Python package [17].

(a) MU-NMF, Universal dictionary (b) MU-NMF, Speaker-specific dictionary

0

2

4

6

SD

R

FemaleMale

learnanechoic 0 1 2 3 4 5 6Number of echoes

0

3

6

9

12

SIR

(c) EM-NMF, Speaker-specific dictionary

Fig. 3: Distribution of SDR and SIR for male and female speakers as a function of the number of echoes included in modeling, and comparison with the threebaselines.

lear

nan

echo

ic 0 1 2 3 4 5 6

Number of echoes

0

2

4

6

8

[dB

]

11.7 dB SDR

MU-NMF, univ.MU-NMF, speak. dep.EM-NMF, speak. dep.

lear

nan

echo

ic 0 1 2 3 4 5 6

Number of echoes

12.9 dB SIR

Fig. 4: Summary of the median SDR and SIR for the different algorithmsevaluated.

Implementation: Authors of [3] provide a Matlab implementa-tion of MU-NMF and EM-NMF methods for stereo separation. Weported their code to Python and extended it to arbitrary number ofinput channels.1 The number of iterations for MU-NMF (EM-NMF)was set to 200 (300) and simulated annealing in EM-NMF imple-mentation was disabled.

4.3. Results

We evaluate the performance in terms of signal-to-distortion ratio(SDR) and source-to-interference ratio (SIR) as defined in [15]. Wecompute these metrics using the mir eval toolbox [18].

The distributions of SDR and SIR resulting from separation us-ing MU-NMF and a universal dictionary are shown in Figure 3a,with a summary in Figure 4. We use the median performance tocompare the results from different algorithms. First, we confirm thatseparation fails for flat TFs (anechoic and K = 0) with SIR around0 dB. Learning the TFs performs somewhat better in terms of SIRthan in terms of SDR, though both are low. Introducing approximate

1Our implementation and all experimental code are publicly available inline with the philosophy of reproducible research.

TFs dramatically improves performance: we outperform the learnedapproach even with a single echo. With up to six echoes, gains are+2 dB SDR and +5 dB SIR. Interestingly, with more than one echo,`1 regularization becomes unnecessary; non-negativity and echo pri-ors are sufficient to separate sources.

Separation with speaker-dependent dictionaries is less challeng-ing since we have a stronger prior. Accordingly, as shown in Fig-ures 3b and 4, MU-NMF now achieves some separation even withoutthe channel information. The gains from using echoes are smaller,though one echo is still sufficient to match the median performanceof learned TFs. Using an echo, however, results in a smaller vari-ance. This is surprising at least to the authors of this paper. Addingmore echoes further improves SDR (SIR) by up to +2 dB (+3 dB).

In the same scenario, EM-NMF (Figure 3c) has near-perfect per-formance on anechoic signals which is expected as the problem isoverdetermined. As for MU, a single echo suffice to reach the per-formance of learned TFs and more further improves it. Moreover,echoes significantly improve separation power as illustrated by upto 3 dB improvement over learn. It is interesting to note that in allexperiments the first three echoes almost saturate the metrics. Thisis good news since higher order echoes are hard to estimate.

5. CONCLUSION

In this paper we began studying the role of early echoes in compu-tational auditory scene analysis, in particular in source separation.We found a simple echo model not only improves performance, butit enables separation in conditions where it is not normally pos-sible, for example with certain non-negative speaker-independentmodels. Echoes seem to play an essential role in magnitude-onlyalgorithms like non-negative matrix factorization via multiplicativeupdates. They improve separation as measured in terms of standardmetrics even when compared to approaches that learn the transferfunctions. We believe these results are only a first step in understand-ing the potential of echoes in computational auditory scene analysis.They suggest that simple models used in this paper could be usedas regularizers in other common audio processing tasks. Ongoingwork includes running real experiments, studying the underdeter-mined case, and blindly estimating the wall parameters.

6. REFERENCES

[1] J. Le Roux, J. R. Hershey, and F. Weninger, “Deep NMF forspeech separation,” in IEEE ICASSP. IEEE, 2015, pp. 66–70.

[2] S. Rickard, “The duet blind source separation algorithm,” BlindSpeech Separation, pp. 217–241, 2007.

[3] A. Ozerov and C. Fevotte, “Multichannel nonnegative matrixfactorization in convolutive mixtures for audio source separa-tion,” IEEE Trans. Audio, Speech, Language Process., vol. 18,no. 3, pp. 550–563, 2010.

[4] A. A. Nugraha, A. Liutkus, and E. Vincent, “Multichannel au-dio source separation with deep neural networks,” IEEE/ACMTransactions on Audio, Speech, and Language Processing,vol. 24, no. 9, pp. 1652–1664, 2016.

[5] S. Leglaive, R. Badeau, and G. Richard, “Multichannel audiosource separation with probabilistic reverberation modeling,”in IEEE WASPAA. IEEE, 2015, pp. 1–5.

[6] I. Dokmanic, R. Scheibler, and M. Vetterli, “Raking the Cock-tail Party,” IEEE J. Sel. Top. Signal Process., vol. PP, no. 99,pp. 1–12, 2015.

[7] R. Scheibler, I. Dokmanic, and M. Vetterli, “Raking Echoes inthe Time Domain,” in IEEE ICASSP, Brisbane, 2015.

[8] R. Scheibler, “Rake, Peel, Sketch,” Ph.D. dissertation, IC, Lau-sanne, 2017.

[9] F. Ribeiro, D. E. Ba, and C. Zhang, “Turning Enemies IntoFriends: Using Reflections to Improve Sound Source Local-ization,” ICME, 2010.

[10] I. Dokmanic, L. Daudet, and M. Vetterli, “From acoustic roomreconstruction to SLAM,” in 2016 IEEE International Confer-ence on Acoustics, Speech and Signal Processing (ICASSP).IEEE, 2016, pp. 6345–6349.

[11] D. L. Sun and G. J. Mysore, “Universal speech models forspeaker independent single channel source separation,” IEEEICASSP, pp. 141–145, 2013.

[12] D. D. Lee and H. S. Seung, “Algorithms for non-negative matrix factorization,” in Advances in NeuralInformation Processing Systems 13, T. K. Leen, T. G.Dietterich, and V. Tresp, Eds. MIT Press, 2001, pp.556–562. [Online]. Available: http://papers.nips.cc/paper/1861-algorithms-for-non-negative-matrix-factorization.pdf

[13] C. Fevotte and J. Idier, “Algorithms for nonnegative matrix fac-torization with the β-divergence,” Neural computation, vol. 23,no. 9, pp. 2421–2456, 2011.

[14] R. Scheibler, E. Bezzam, and I. Dokmanic, “Pyroomacoustics:A Python package for audio room simulations and array pro-cessing algorithms,” arXiv preprint arXiv:1710.04196, 2017.

[15] E. Vincent, H. Sawada, P. Bofill, S. Makino, and J. P. Rosca,“First stereo audio source separation evaluation campaign:data, algorithms and results,” in International Conferenceon Independent Component Analysis and Signal Separation.Springer, 2007, pp. 552–559.

[16] J. S. Garofolo, L. F. Lamel, W. M. Fisher, J. G. Fiscus,D. S. Pallett, N. L. Dahlgren, and V. Zue, “TIMIT acoustic-phonetic continuous speech corpus,” Linguistic data consor-tium, vol. 10, no. 5, p. 0, 1993.

[17] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel,B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss,V. Dubourg et al., “Scikit-learn: Machine learning in Python,”Journal of Machine Learning Research, vol. 12, no. Oct, pp.2825–2830, 2011.

[18] C. Raffel, B. McFee, E. J. Humphrey, J. Salamon, O. Nieto,D. Liang, D. P. Ellis, and C. C. Raffel, “mir eval: A transparentimplementation of common MIR metrics,” in In Proceedings ofthe 15th International Society for Music Information RetrievalConference, ISMIR. Citeseer, 2014.

http://papers.nips.cc/paper/1861-algorithms-for-non-negative-matrix-factorization.pdf

http://papers.nips.cc/paper/1861-algorithms-for-non-negative-matrix-factorization.pdf

arxiv:1711.06805v1 [cs.sd] 18 nov 2017]coordinated science lab, ece, university of illinois at...

Documents