fengbaoandwaleedh.abdulla...4 feng bao and waleed h. abdulla fig.2....

SIP (2018), vol. 7, e5, page 1 of 12 © The Authors, 2018.This is an Open Access article, distributed under the terms of the Creative Commons Attribution-NonCommercial-NoDerivatives licence(http://creativecommons.org/licenses/by-nc-nd/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work isunaltered and is properly cited. The written permission of Cambridge University Press must be obtained for commercial re-use or in order to create a derivative work.doi:10.1017/ATSIP.2018.7

original paper

Noise masking method based on an effectiveratio mask estimation in Gammatone channels

feng bao and waleed h. abdulla

In computational auditory scene analysis, the accurate estimation of binarymask or ratiomask plays a key role in noisemasking.An inaccurate estimation often leads to some artifacts and temporal discontinuity in the synthesized speech. To overcome thisproblem, we propose a new ratio mask estimation method in terms of Wiener filtering in each Gammatone channel. In thereconstruction of Wiener filter, we utilize the relationship of the speech and noise power spectra in each Gammatone channel tobuild the objective function for the convex optimization of speech power. To improve the accuracy of estimation, the estimatedratiomask is furthermodified based on its adjacent time–frequency units, and then smoothed by interpolatingwith the estimatedbinary masks. The objective tests including the signal-to-noise ratio improvement, spectral distortion and intelligibility, andsubjective listening test demonstrate the superiority of the proposed method compared with the reference methods.

Keywords: CASA, Noise masking, Ratio mask estimation, Convex optimization

Received 1 November 2017; Revised 6 April 2018

I . I NTRODUCT ION

Speech enhancement is a focused topic in the speech signal-processing area. The noise reduction or noise masking isoften concerned in the speech enhancement. They aim toremove ormask a certain amount of background noise fromnoisy speech and make the enhanced speech have a betterquality and a higher intelligibility. A lower speech intelligi-bility in the background noise remains a major complaintof the comfort and hearing fatigue by listeners. Althoughthe state-of-the-art monaural speech enhancement algo-rithms have achieved an appreciable suppression of thenoise and improved speech quality certainly, it is still a chal-lenge for them, thus far, to improve the intelligibility ofnoise-degraded speech.

Monaural speech enhancement, i.e., speech enhance-ment from single-microphone recordings, is particularlychallenging due to an infinite number of solutions. From theapplication point of view, monaural speech enhancement isperhaps most desirable compared with multi-microphonesolutions, since a monaural system is less sensitive toroom reverberation and spatial source configuration. In thepast several decades, many monaural speech enhancementmethods have been proposed, such as spectral-subtraction[1, 2], Wiener filtering [3], and statistical-model-basedmethods [4] enhanced in the frequency domain. These

Department of Electrical and Computer Engineering, University of Auckland,Auckland 1010, New Zealand

Corresponding author:F. Bao,Email: [email protected]

approaches are more suitable to handle the stationary noise(e.g., white or car noise) rather than non-stationary noise(e.g., babble and street noises). The musical noise and vac-uum feeling are usually caused by these typical methods.In the face of these problems, a very effective de-nosingmethod [5] was proposed with Generalized Gamma Prior(GammaPrior), which was extended from the minimummean-square error (MMSE) estimation of discrete Fouriertransform (DFT) magnitude. In this method, two classesof generalized Gamma distributions were adopted for thecomplex-valued DFT coefficients. The better performanceon subjective and objective tests was achieved comparedwith typicalWiener filter and statistical-model-basedmeth-ods. Zoghlami proposed an enhancement approach [6] fornoise reduction based on non-uniform multi-band analy-sis. The noisy signal spectral band is divided into subbandsusing a gammatone filterbank, and the sub-bands signalswere individually weighted according to the power spectralsubtraction technique and the Ephraim andMalah’s spectralattenuation algorithm. This method is a kind of sub-bandspectral subtraction or Wiener filter method.

However, the difficulties of non-stationary noise are stillnot solved very well based on the above methods. Thus,the baseline Hidden Markov Models (HMMs) [7] andCodebook-driven [8] methods with a priori information(i.e., spectral envelops or spectral shapes) about speech andnoise were proposed to overcome the situation of non-stationary noise. Some revisedmethods based on codebookand HMM were proposed in recent years. For example, theSparse AutoregressiveHiddenMarkovModels (SARHMM)method [9] modeled linear prediction gains of speech and

1https://www.cambridge.org/core/terms. https://doi.org/10.1017/ATSIP.2018.7Downloaded from https://www.cambridge.org/core. IP address: 54.39.106.173, on 03 Mar 2020 at 10:13:17, subject to the Cambridge Core terms of use, available at

mailto:[email protected]

https://www.cambridge.org/core/terms

https://doi.org/10.1017/ATSIP.2018.7

https://www.cambridge.org/core

2 feng bao and waleed h. abdulla

noise in non-stationary noise environments. The likelihoodcriterion was adopted to find the model parameters, com-bined with a priori information of speech and noise spectralshapes by a small number of HMM states. The sparsity ofspeech and noise modeling helped to improve the track-ing performance of both spectral shape and power level ofnon-stationary noise. Another very new codebook-basedapproach with speech-presence probability (SPP) [10] alsoachieved a very good result on speech enhancement of non-stationary noise environment. This kind of codebook-basedmethod with SPP (CBSPP) utilized the Markov process tomodel the correlation between the adjacent code-vectorsin the codebook for optimizing Bayesian MMSE estima-tor. The correlation between adjacent linear prediction (LP)gains was also fully considered during the procedure ofparameter estimation. Through the introduction of SPPin the codebook constrained Wiener filter, the proposedWiener filter achieved the goal of noise reduction.

The aforementioned methods are focused on noisereduction. These noise reduction methods often utilizea gain function into each time–frequency (T–F) bin tosuppress the noise based on a T–F representation of thenoisy speech. The usage of the gain function over all T–Fbins can be considered as an attenuation of noise mag-nitude in each T–F bin. These methods generally derivetheir gains as a function of the short-time signal-to-noiseratio (SNR) in the respective T–F bin, that is, speech andnoise powers at each T–F bin need to be estimated. Withrespect to the noise masking, the Computational AuditoryScene Analysis (CASA) [11, 12] is considered as an effec-tive approach. By incorporating auditory perception model(i.e., Gammatone filterbank), it could mask the noise basedon an estimation of the binary mask in each T–F unitthat includes the different number of T–F bins concernedin noise reduction methods. Because the result of binarymask only corresponds to value 0 dominated by noise orvalue 1 dominated by the speech in each T–F unit, theseCASA methods based on the binary mask often wronglyremove the background noise in a weak T–F unit domi-nated by the speech and seriously affect the hearing quality.A good solution for the shortage of binarymask is ideal ratiomask (IRM) [13, 14] that a priori knowledge of speech andnoise is known in advance in the derivation of the mask.The IRM could be considered as a soft decision that themask values continuously vary from 0 to 1 instead of ahard decision that the mask value is 0 or 1 derived fromthe ideal binary mask (IBM) [15, 16]. The IRM is morereasonable to handle the situations that speech energy islarger or less than noise in each T–F unit. These ratio maskestimators [13, 14] operated in DFT domain. It is a kindof Weiner filtering solution that the transfer function ofWeiner filter can be obtained by estimating speech andnoisepowers.

Because of the huge and complicated training process ofspeech and noise priori information, the noise reductionmethods based on HMM and codebook or deep learning-based noise ratio masking method may have some limi-tations on solving the practical issues. Particularly worth

mentioning is that the noise types and priori informationare not easy and unrealistic to predict in advance.

Therefore, we propose a noise-masking method withoutany priori information and assumption of speech and noisesignals. This proposed method can better mimic the hear-ing perception properties of the human being to improvethe intelligibility of the enhanced speech. The soft decisionfactor, ratio mask, is used to resynthesize speech signal sothat it can avoid the wrong elimination of weak speech T–F units caused by the binary mask. Furthermore, the ratiomask in our proposed method is a kind of ratio betweenthe estimated speech and noise powers. Considering thesuperiority of solving theminimization problem, the speechpower is estimated by convex optimization [17, 18] in theproposed method. For further compensating the powersof weak speech components, the estimated ratio mask ismodified and interpolated to recover parts of speech com-ponents.

The remainder of this paper is organized as follows.In Section II, we present the overall principle of the pro-posed method. The performance evaluation is described inSections III and IV provides the conclusions.

I I . THE PR INC IPLE OF THEPROPOSED METHOD

Figure 1 describes the main block diagram of the proposednoise-masking method. First, the input noisy speech isdecomposed into 128 channels by using Gammatone filter-bank [19, 20]. Then, the signal of each channel is windowedin time domain and a fast Fourier transform (FFT) is donefor this windowed signal to obtain the power spectrum ofnoisy speech. The feature extraction module is utilized to

Fig. 1. The Block diagram of the proposed method.

https://www.cambridge.org/core/terms. https://doi.org/10.1017/ATSIP.2018.7Downloaded from https://www.cambridge.org/core. IP address: 54.39.106.173, on 03 Mar 2020 at 10:13:17, subject to the Cambridge Core terms of use, available at




noise masking method based on an effective ratio mask estimation in gammatone channels 3

calculate the noise power spectrum by Minima ControlledRecursive Averaging (MCRA) [21] and normalized cross-correlation coefficient (NCCC) [22] between the spectraof noisy speech and noise. The NCCC is used to repre-sent the proportion of noise power in noisy speech, whichwill contribute to the convex optimization of speech power.The objective function used for convex optimization is builtbased on noisy speech power and NCCC in each chan-nel, and minimized by the gradient descent method [23].The speech power is estimated by minimizing the objec-tive function. After that, the powers of estimated speech andnoise are used to estimate the ratio mask. This ratio mask isthen modified based on the adjacent T–F units and furthersmoothed by interpolating with the estimated binary maskfor increasing the accuracy of ratiomask estimation. Finally,the enhanced speech is resynthesized from the smoothedratio masks [24].

Based on above block diagram, we mainly describe fourkey parts. In Section II-A, the speech synthesis mechanismwill be described, which is a basic orientation of the pro-posed work. The estimation method of the ratio mask andbinary mask concerned in speech synthesis are given inSections II-B and II-C, respectively. The speech power esti-mation closely related to the ratio mask and binary maskestimation is deduced in Section II-D.

A) Speech synthesis mechanismIn the proposed method, the enhanced signal is resynthe-sized in time domain based on CASA model [24], whichis different from the frequency domain synthesis method,such as the Wiener filter method. In the last stage ofthe speech enhancement system, the target speech is syn-thesized by means of the estimated ratio mask and fil-ter responses of noisy speech signal in each Gammatonechannel.

The Gammatone filter response of arbitrary channel is,Gc [y(t)],

Gc [y(t)] = y(t) ∗ gc(t), (1)

such as c is the Gammatone channel index and t is thetime index. The symbol ∗ is the convolution operation byGammatone filter [19]. gc(t) is a Gammatone filter impulseresponse [20] described as:

gc(t) ={

ta−1 exp(−2πbt) cos(2π fc t), t ≥ 00, else

(2)

where a = 4 is the order of the filter, b is the equivalent rect-angular bandwidth, which increases as the center frequencyfc increases.

Then, the first filtering response of the noisy speech sig-nal, gc(t), is reversed in time domain again and furtherfiltered by the Gammatone filter, that is, Fc [y(t)],

Fc [y(t)] = Gc [y(t)] ∗ gc(t), (3)

where Gc [y(t)] represents the time-reverse operation ofGc [y(t)].

These twice time-reverse operations eliminate the phasedifference between the filter outputs of the Gammatonechannels. The phase-corrected output from each filter chan-nel is then divided into time frames by a raised cosinewindow for the overlap-and-add. Figure 2 shows the blockdiagram of the speech synthesis mechanism.

The signal magnitude in each channel is then weightedby the corresponding ratio mask value at that time instant.The weighted filter responses are then summed across allGammatone channels to yield a reconstructed speech wave-form as follows:

x(t) =∑

C

M(c , t) · Fc [y(t)] ·W(c , t), (4)

where c and C are the Gammatone channel index andnumber, respectively. M(c , t) is the estimated ratio mask.Fc [y(t)] is the time-reverse signal of Fc [y(t)]. W(c , t) isa raised cosine window. x(c , t) is the resynthesized speechsignal.

B) Ratio mask estimationIn our noise-masking method, the noisy speech signal with4 kHz bandwidth is decomposed into the T–F units by a128-channel Gammatone filterbank [19, 20] whose cen-ter frequencies are quasi-logarithmically spaced from 80 to4000Hz. In the proposed method, we assume that the cleanspeech and noise are additive and statistically independentin each gammatone channel. Thus, the powers of noise andspeech in each channel can be expressed as follows:

Py(c , m) = Px(c , m)+ Pd(c , m), (5)

where c is the channel index, m is the frame index, Py(c , m),Px(c , m), and Pd(c , m) indicate the powers of the noisyspeech, clean speech, and noise in the cth channel of themth frame, respectively. The ratio mask can be estimated asfollows:

VR(c , m) = Px(c , m)

Px(c , m)+ Pd(c , m), (6)

where VR(c , m) is the initial ratiomask estimation in the cthchannel of the mth frame. Px(·) and Pd(·) are the estimatedpowers of speech and noise, respectively. The speech powerestimationwill be given in Section II-D and the noise poweris obtained by the MCRA method [21].

As equation (6), the ratio mask changes from 0 to 1rather than 0 or 1 happened in the binary mask. The ratiomask value will close to 0, if the current T–F unit is dom-inated by the noise, that is, the noise power is larger thanspeech power. On the contrary, the ratiomask approximatesto 1, if the speech dominates the current T–F unit. Thus,the soft discrimination factor, ratio mask, can keep parts ofspeech component in the weak voiced fragments by using asmaller ratio value instead of 0 caused by the binary maskestimation.

The speech components at low frequency are moreimportant than that of at high frequency because the






Fig. 2. The Block diagram of the speech synthesis mechanism.

information of low frequency contributes to more speechintelligibility. Therefore, the ratio mask at low frequency ismodified by its adjacent T–F units to further preserve thespeech energy in the proposed method. The modified ratiomask, VR , is obtained as

VR(c , m) ={

Va(c , m), if c ∈ [1, 50]

VR(c , m), otherwise, (7)

where

Va(c , m) = VR(c + 2, m)+ VR(c + 1, m)+ VR(c , m)

3,

(8)here Va(c , m) represents the average of ratio masks in threeadjacent T–F units of the same frame. The purpose thatwe average the adjacent T–F units is to eliminate the out-liers units and keep the speech energy. The initial ratiomask defined in equation (6) is only modified below the50th channel which corresponds to the center frequenciesof 636Hz, because the speech components are more impor-tant below this frequency based on the hearing perception.Thus, the basic idea of equation (7) is to recover the partialspeech energy that has beenmasked in the initial ratiomaskestimation.

The binary mask can be deemed as a hard decision andis easy to keep more speech components due to its binarydiscriminant. Also, parts of the noise components are notmasked enough and kept at the same time. Comparativelyspeaking, the ratiomask has a good ability ofmasking noise,but it simultaneously damages some speech components.Thus, combining the advantages of both ratio mask andbinary mask, the linear interpolation between them givenin equation (9) is utilized to smooth the ratio mask, whenthe binary mask value is 1. The smoothed ratio mask isobtained by

VR(c , m) =

⎧⎪⎪⎪⎨⎪⎪⎪⎩η · VR(c , m)+ (1− η) · VB (c , m),

if VB (c , m) = 1

VR(c , m),

if VB (c , m) = 0

, (9)

where η = 0.2 is a smoothing factor obtained bymassive lis-tening test. Meanwhile, we also use the average HIT-FalseAlarm rate (HIT-FA) objective test [25] to determine the

Fig. 3. Average HIT-FA score histogram with respect to factor η.

credibility of η = 0.2. Figure 3 shows the HIT-FA scorecurve versus the smoothing factor η which varies from 0to 1.0. The speech signal is subjected to five types of noises(white, babble, office, street, and factory1 noise) under threekinds of SNRs (0, 5, and 10 dB). The highest point appearsin the condition of η = 0.2 for three SNRs. Thus, in thepaper, we set the factor η to 0.2. VB (c , m) is the estimatedbinarymask value [25] that will be introduced in the follow-ing Section II-C and VR(c , m) is the modified ratio maskgiven in equation (7).

C) Binary mask estimationDue to the power estimation error, the estimated powers ofspeech and noise do not meet equation (5) any more. So, weintroduce a factor to solve this issue, that is, the noisy speechpower can be expressed as

Py(c , m) = Px(c , m)+ σ(c , m) · Pd(c , m), (10)

where Py(c , m), Px(c , m), and Pd(c , m) are the noisy speechpower, estimated speech power and estimated noise powerin the cth channel of the mth frame, respectively. σ(c , m)

is a factor to balance equation (10). Thus, the power esti-mation errors of the speech and noise are compensatedby the factor σ(c , m). When the factor σ(c , m) increases,the noise components will be reduced in the current T–Funit. It means that the current T–F unit is dominated byspeech components, when σ(c , m) has a larger value. Oppo-sitely, when the factor σ(c , m) has a smaller value, the noise






Fig. 4. Average HIT-FA score histogram with respect to threshold ξ .

components dominate the current unit. Based on equation(10), the factor σ(c , m) can be given as

σ(c , m) = Py(c , m)− Px(c , m)

Pd(c , m). (11)

The difference of Py(c , m) and Px(c , m) in the numerator ofequation (11) corresponds to the noise power derived fromthe speech power estimation. The denominator is the noisepower obtained by MCRA method [21]. The factor σ(c , m)

makes the estimated powers, Px(·) and Pd(·), meet equation(5). The σ(c , m) can be considered as a boundary factor [25]to distinguish the speech or noise T–F unit. In our experi-ments, if factor σ(c , m) is larger than the threshold ξ , thecurrent T–F unit is dominated by speech components andlabeled as value 1. Otherwise, the noise components domi-nate the current T–F unit labeled as value 0. Thus, the binarymask VB (c , m) used in equation (9) can be determined as

VB (c , m) ={

1, if σ(c , m) > ξ

0, otherwise. (12)

By a HIT-FA objective test based on equation (12), wefound that HIT-FA has a good result when the thresholdξ is chosen as 3. Figure 4 shows the average HIT-FA scorehistogram about threshold ξ with the value from 1.6 to 3.8based on five types of noises (white, babble, office, street,and factory1 noise) under three kinds of SNRs situations (0,5 and 10 dB).

D) Speech power estimationThe enhanced speech is resynthesized by the estimated ratiomask defined in equation (9), which relies on the powersof the estimated speech and noise. Thus, the speech poweris the key point of our proposed noise-masking method.The speech power estimation can be deemed as a minimiz-ing problem. We apply the convex optimization [17] that itslocal optimal solution easily matches the global optimum tosolve the minimization problem.

For real and positive speech power vector px ∈ �n com-posed of 128 channels, if objective function J : �n → �is convex, J (px) has the minimum value with respect topx . The vector px can be estimated by the minimizationproblem without constrains. The optimal solution can be

reached when we optimize each element of px individu-ally. Thus, the optimal value Px(c , m) of Px(c , m) can beobtained as follows:

Px(c , m) = arg minpx

J[Px(c , m)

]. (13)

The above minimization problem is a kind of convex opti-mization with respect to variable Px(c , m) based on objec-tive function J [Px(c , m)]. This convex optimization can beeasily implemented by the gradient descent method. Here,the objective function J (·) is built as follows:

J[Px(c , m)

]=

128∑c=1

[Py(c , m)− Pd(c , m)− Px(c , m)

]2

+ λ · ϕ(

Px(c , m)). (14)

The first term in equation (14) is defined in the sense ofmin-imum mean-square error, which is completely convex, i.e.,the square error between Py(·) and Pd(·)+ Px(·) shouldequal to 0, when Pd(·) and Px(·) are correctly estimated.Since the estimation errors with respect to Pd(·) and Px(·)exist in practical situation, the second term in equation (14)is introduced for the error constrain. The function ϕ(·) iscalled the regularization or penalty function. Here we usethe l1 norm as a penalty function, i.e., ϕ(Px) =

∑128c=1 |Px |.

The λ > 0 is the regularization parameter. By varying theparameter λ, we can trace out the optimal trade-off solu-tion of equation (14). Due to the non-negativity of λ andϕ(·), equation (14) is also a completely convex function. Itmeans that the approximative value Px(·) can be estimatedby minimizing equation (14).

In equation (14), Py(·) is the noisy speech powerobtained in time domain. We assume that the noise hasbeen pre-estimated by the MCRA [21] method. Thus, theobjective function (14) only has one variable, Px(·). Then,we further utilize the relationship of noise and noisy speechto obtain the proportion of noise power within the noisypower. Multiplying noisy power by this proportion, we canget a modified objective function as follows

J[Px(c , m)

]=

128∑c=1

[Py(c , m)− ρ(c , m) · Py(c , m)

− Px(c , m)]2+ λ ·

128∑c=1

Px(c , m), (15)

where ρ(c , m) is the normalized cross-correlation coeffi-cient [22] between noise and noisy speech spectra calculatedin the frequency domain. The ρ(c , m) · Py(c , m) is consid-ered as an approximative value of noise power. Actually,the factor ρ(c , m) indicates the percentage of noise com-ponents within the noisy speech signal. Therefore, we applythis coefficient to represent the proportion of noise power innoisy speech signal, instead of using noise power spectrumdirectly. The common situation of noise overestimationcan make the estimated speech power negative, which is






impossible for the real application. The usage of normalizedcross-correlation coefficient cleverly avoids noise overes-timation, because ρ(c , m) varies from 0 to 1. Therefore,the ρ(c , m) · Py(c , m) is always smaller than noisy speechpower, Py(c , m).

Figure 5 shows an examples that two NCCCs vary withthe time with respect to the channel index and frame indexin five channels. The channel indexes are 20, 50, 70, 100, and120 which correspond to the center frequencies of 237, 636,1077, 2195, and 3432Hz, respectively. Figure 5(a) describesthe NCCC, χ(c , m), between the true noise and noisyspeech, and Fig. 5(b) shows the NCCC, ρ(c , m), betweenof the estimated noise and noisy speech. From Fig. 5, wecan find that each Gammatone channel has different NCCCvalue because the signal energy of each channel is different,where more energy is concentrated at low frequencies. Theestimated ρ(c , m) approximately matches to the ideal ratiotrajectory χ(c , m). Thus, it is confident to apply the esti-mated noise with MCRA to obtain the proportion of noise

Fig. 5. An example of normalized cross-correlation coefficient in differentchannels (Input SNR = 5 dB, white noise). (a) True noise condition. (b) Esti-mated noise condition.

Table 1. Iterative algorithm of Px .

Input: Py(c , m) and ρ(c , m)

Output: Px(c , m)

For each frame and channel of noisy speechk = 0iterative step size δ = 0.1If iterative error > θ

∇ = d J ( Px )

d Px

P (k+1)x ←− P (k)

x − δ · ∇iterative error= J (·)(k+1) − J (·)(k)

k = k + 1End if error ≤ θ

Return P (k)x (c , m)

power in noisy speech. The ρ(c , m) in the unvoiced frag-ments is larger than that in voiced fragments, because thesignal components in the unvoiced segments more like thenoisy components. Moreover, [1− ρ(c , m)] · Py(c , m) canensure the difference between noisy and noise signals is notnegative since ρ(c , m) is less than or equal to 1 based on thefollowing normalized cross-correlation:

ρ(c , m) =∑L

l=1 Y(c , m, l) · D(c , m, l)√∑Ll=1 Y2(c , m, l) ·∑L

l=1 D2(c , m, l), (16)

where L is the number of FFT points, l is the frequencyindex, and Y(c , m, l) and D(c , m, l) are the spectral mag-nitude of noisy speech and the estimated noise in the cthchannel of the mth frame, which are calculated by FFT withthe size of 256 and MCRA method [21], respectively.

Equation (15) can be solved by the gradient descentmethod [23] to calculate the approximative value of theestimated speech power, Px(c , m), in each channel of eachframe. The complete algorithm framework is presented inTable 1. The input of iteration algorithm is noisy power andρ(c , m). The output of iteration algorithm is the optimalsolution of speech power. By taking derivation of J (·) inequation (15) with respect to Px(c , m), we can get the gradi-ent ∇ of objective function. Then, moving Px(c , m) to thedirection of the negative gradient to obtain the (k + 1)thiteration solution. After that, the iterative error of adjacenttwo iterations is computed to determine if the iteration isover. The iteration will stop, if the iterative error is smallerthan threshold θ where it is set to 1 based on objectiveand subjective tests. Otherwise, the iteration will keep goinguntil convergence.

I I I . EXPER IMENTS AND RESULTS

In this section of performance evaluation, we discuss theenhanced results of our proposed method named as Con-vexRM. The test clean speeches are selected from TIMIT[26] database including 50 utterances. Five types of noisesfrom NOISEX-92 [27] noise database are used, whichinclude white, babble, office, street, and factory1 noises. Theinput SNR is defined as −5, 0, 5 and 10 dB, respectively.






Fig. 6. Power error comparison of speech against the number of Iteration.

The sampling rate of the noisy speech signal is 8 kHz. Weapply segmental SNR (SSNR) improvement measure [28],log spectral distortion (LSD) measure [29], and short-timeobjective intelligibility (STOI) [30] to evaluate the objec-tive quality of the enhanced signal. Meanwhile, theMultipleStimuli with Hidden Reference and Anchor (MUSHRA)listening test [31] is utilized tomeasure the subjective perfor-mance. To put our results in perspective, the GammaPrior[5], SARHMM [9], and CBSPP [10] are selected as referenceenhancement methods and their related ready-made batchprogram and source codes are used for tests. The IRM is theideal situation that the clean speech and noise signals areknown in advance.

A) Experiment setupIn the Section II-D, the objective function (15) was solvedby the gradient descent method. The iteration times shouldbe seriously considered. A large number of iterations cannot only obtain a better estimated result but also make theenhanced system very complicated. However, the enhancedperformancemay be degraded, if the number of iterations isunreasonably constrained. Therefore, we analyze the aver-age iteration error in all 128 channels between 30 and 70iterations to ensure the reasonable number of iterations.An example of average errors in terms of "dB" is expressedin Fig. 6. In this error comparison, we use four kinds ofSNRs under five types of noises to determine the num-ber of iterations. From Fig. 6, we can find that the powererror reaches a lower value when the number of iterations islarger than 60. Therefore, during the convex optimizationof speech power, the number of iterations in the gradientdescent method is set to 60, which adequately satisfies theconvergent condition.

Although the proposed method needs to iteratively esti-mate the speech power, we set a fixed iterative number of60 to reduce the computation complexity. The SARHMMand CBSPP not only need the prior information and bigdatabase but also cost a lot of time to search the mappingpairs during the online enhancement. However, the itera-tive estimation of our proposed method is very simple witha small iteration number. Our proposedmethod is a slightly

Fig. 7. The cochleogram comparison (Input SNR = 5 dB, factory1 noise). (a)The cochleogram resynthesized by idea ratio mask; (b) The cochleogram resyn-thesized by the estimated binary mask VB ; (c) The cochleogram resynthesizedby the initial ratio mask VR ; (d) The cochleogram resynthesized by the modi-fied ratio mask with adjacent T–F units VR ; (e) The cochleogram resynthesizedby the smoothed ratio mask with binary mask VR .

more complex than theGammaPrior approach, but the pro-posed method does not need the additive Gaussian noiseassumption and the complexity of it is tolerable.

An example of cochleogram comparison based on ratiomask estimations is given by Fig. 7 to observe the esti-mation performance. Figure 7(a) is the cochleogram basedon the ideal ratio mask that the clean speech and noiseare known in advance. Figure 7(b) is the cochleogramobtained through the estimated binary mask by equation






Fig. 8. Speech waveform comparison of five channels (Input SNR = 5 dB, white noise). (a) Clean speech; (b) Noisy speech; (c) Enhanced speech.

(12). Figure 7(c) is the cochleogram acquired via the ini-tial estimation of ratio mask by equation (6). Figure 7(d) isthe cochleogram based on the modified ratio mask by adja-cent T–F units and the cochleogram of Fig. 7(e) is the finalestimated ratio mask interpolated with the binary mask inequation (9). The speech components at low frequency usu-ally contain some useful information and, so it is unrealisticto totally remove the background noise at low frequency.Thus, in Figs 7(d) and 7(e), we deliberately keep the verylittle energy of the T–F units as the floor noise below the12th Gammetone channels that correspond to the centerfrequencies of 120Hz. The binary mask estimation losessome speech T–F units at low frequency and keeps partsof speech units at the middle frequency. The initial ratiomask has relatively good results on remaining the speechcomponents at the middle frequency but misses too manyspeech T–F units at low frequency. By using the modified

algorithm given by equation (7), the ratio masks at low fre-quency are recovered. Moreover, Combining the feature ofbetter preservation of speech T–F units at the middle fre-quency of binary mask, the smoothed ratio mask given byequation (9) recovers and enlarges the edge of speech T–Funits. Based on the above analysis, the final estimated ratiomask (VR) obtains a better performance and is consideredas the key point in our proposed method (ConvexRM).

To further observe the noise-masking performance ineach Gammatone channel, Fig. 8 shows an example that thewaveforms vary with the time in five channels. The cleanspeech, noisy speech and enhanced speech are shown inFigs 8(a)–8(c), respectively. From this figure, we can findthat the enhanced speechwaveformsmatch the clean speechwaveformswell in the 50th, 70th, 100th, and 120th channels.For the 20th channel, the enhanced speech has more back-ground noise than clean speech, because Gammatone filter






prefers to reinforce the energy at low frequency. Actually,this energy level of noise does not affect the hearing qualitytoo much.

To demonstrate the effectiveness of the proposedmethodwith 60 iterations using the gradient descent method, thespectrograms of the enhanced speech by different methodsare shown in Fig. 9 for verifying the noise masking.

From the Fig. 9, we can find that the proposedConvexRM method masks more background noise andkeeps more speech components than other three refer-ence approaches, respectively. The GammaPrior algorithmremoves the least noise in these methods. Although theSARHMM wipes off noise components to a certain degree,it also loses parts of speech components in high frequency.As the CBSPP approach, the parts of weak speech compo-nents are eliminated after enhancement. Our ConvexRMmethod has better results on noise masking and speechenergy reservation. To further observe the performance ofnoise elimination, The SSNR, LSD, STOI, and MUSHRAtests are discussed in the next subsections.

B) SSNR improvement test resultsThe average SSNR measurement [28] is often utilized toevaluate the denoising performance of speech enhancementmethod. The input SNR and the output SNR are definedas follows, respectively. The average SSNR improvement isobtained by subtracting Sin from Sout .

Sin = 10 · log10

∑N−1n=0 x2(n)∑N−1

n=0 [x(n)− y(n)]2, (17)

Sout = 10 · log10

∑N−1n=0 x2(n)∑N−1

n=0 [x(n)− x(n)]2, (18)

where N is the number of samples. x(n) is the original cleanspeech, y(n) represents the input noisy speech and x(n)

denotes the enhanced signal.The average SSNR improvement of various enhancement

methods for stationary and non-stationary noise condi-tions are presented in Table 2 for different input SNRs (i.e.,−5, 0, 5, and 10 dB). As Table 2, the ConvexRM methodgenerally shows a higher value than other three referencemethods. To give more details, the proposed ConvexRM isa little higher than the CBSPP method in babble and officenoises. Both CBSPP and SARHMM approaches are betterthan the GammaPrior on the performance of reducing thebackground noise.

C) LSD test resultsDuring the signal enhancement, although parts of back-ground noises are removed, the signal may also distort atthe same time. Therefore, in order to further check the spec-trum distortion of the enhanced signal, the LSD measure[29] is employed to evaluate the objective quality of theenhanced speech. It measures the similarity between theclean speech spectrum and the enhanced speech spectrum,

Fig. 9. Spectrogram comparison (Input SNR = 5 dB, factory1 noise), (“She hadyour dark suit in greasy wash water all year").






Table 2. SSNR improvement results.

Noise type Methods −5 dB 0 dB 5 dB 10 dB

White IRM 19.89 15.03 12.77 9.32GammaPrior 10.01 8.95 7.62 5.94SARHMM 13.06 10.72 8.52 6.34CBSPP 16.01 13.02 10.07 6.86ConvexRM 16.19 13.48 10.42 7.09

Babble IRM 15.12 14.12 11.51 7.99GammaPrior 7.24 6.86 5.65 4.47SARHMM 8.50 7.43 6.31 5.04CBSPP 10.49 9.03 7.41 5.04ConvexRM 10.78 9.29 7.53 5.40

Office IRM 16.12 14.89 12.82 9.02GammaPrior 9.33 8.61 7.46 6.05SARHMM 10.96 9.55 7.41 6.30CBSPP 11.28 9.47 7.74 5.62ConvexRM 11.37 9.78 7.85 6.33

Street IRM 18.29 16.12 13.92 11.65GammaPrior 11.89 10.96 9.54 7.54SARHMM 12.81 11.54 9.98 7.43CBSPP 12.92 10.92 8.67 7.34ConvexRM 15.85 13.19 10.51 8.06

Factory1 IRM 17.23 15.73 13.12 10.65GammaPrior 9.14 8.09 6.84 5.32SARHMM 10.33 8.86 7.38 5.62CBSPP 12.93 10.60 8.09 5.63ConvexRM 13.00 10.87 8.45 5.78

and is defined as

l = 1

M

M−1∑m=0

√√√√ 1

K

K−1∑k=0

[10 · log10

|X(m, k)|2|X(m, k)|2

]2

, (19)

where k is the index of frequency bins. K = 512 is theFFT size. m is the frame index, and M is the total num-ber of frames. |X(m, k)| denotes the clean speech amplitudeof DFT coefficients, and |X(m, k)| denotes the enhancedspeech amplitude of DFT coefficients.

The LSD test results are given in Table 3. FromTable 3, wecan see that the ConvexRMmethod has less distortion thanother methods in−5, 0, 5, and 10 dB conditions. Addition-ally, the SARHMM method almost shares the same level ofspectral distortion with our proposed system in 10 dB situ-ation that ourmethod is still better than other two referenceapproaches. It means that our ConvexRM method masksmore background noise and causes less spectral distortion.

D) STOI test resultsSTOI [30] denotes a correlation of short-time temporalenvelopes between clean and enhanced speech, and hasbeen shown to be highly correlated to human speech intel-ligibility score. The STOI measure is derived based on acorrelation coefficient between the temporal envelops of theclean and enhanced speech in short-time regions and thescore of STOI ranges from 0 to 1. The higher the STOIvalue is, themore the intelligibility has. Table 4 describes theSTOI results that the enhanced signal by ConvexRM holdsthe highest intelligibility among all methods, especially inthe low SNR conditions (−5 and 0 dB). Three reference

Table 3. LSD results.


White Noisy 14.58 12.85 11.48 8.66IRM 5.29 4.67 4.01 3.12GammaPrior 9.34 7.98 6.75 5.94SARHMM 7.78 7.07 6.46 5.92CBSPP 9.18 8.17 8.13 7.69ConvexRM 7.15 6.69 6.33 5.82

Babble Noisy 10.87 9.03 7.44 6.04IRM 5.01 4.12 3.23 2.34GammaPrior 8.09 6.83 5.77 5.05SARHMM 7.84 6.76 5.85 5.01CBSPP 8.09 7.40 6.93 5.78ConvexRM 7.39 6.58 5.73 4.97

Office Noisy 10.01 8.27 6.75 5.45IRM 4.45 3.67 3.05 2.21GammaPrior 7.23 6.06 5.21 4.60SARHMM 7.03 6.09 5.34 4.61CBSPP 7.35 6.73 6.32 5.42ConvexRM 6.56 5.79 5.16 4.58

Street Noisy 9.17 7.53 6.18 5.05IRM 3.72 3.06 2.46 2.09GammaPrior 6.10 5.18 4.87 4.27SARHMM 6.62 5.74 4.95 4.29CBSPP 6.56 6.23 5.68 5.01ConvexRM 5.98 5.28 4.82 4.24

Factory1 Noisy 12.43 10.46 8.61 7.01IRM 5.59 4.88 3.75 3.05GammaPrior 8.35 7.02 5.96 5.35SARHMM 8.14 7.02 6.04 5.36CBSPP 8.48 7.02 6.04 5.36ConvexRM 7.28 6.49 5.87 5.34

methods even are lower than the noisy signal in several situ-ations(e.g., 10 dB white, −5, and 0 dB office noise) becausethey lose parts of speech components in the weak speechfragments. All the methods obtained the high scores understreet noise. The reason is that the street noise given inNOISEX-92 database is a kind of relatively stationary noiseand the more noise energy is usually existed at the low fre-quency. Actually, the street noise is also easier to processthan white and much easier than the babble or office noise.

E) Subjective listening test resultsIn our experiments, the MUSHRA listening test [31] is usedto evaluate the subjective quality of the enhanced speech.The MUSHRA listening test consists of several successiveexperiments. Each experiment aims to compare a high-quality reference speech (i.e., clean speech) to several testspeech signals sorted in randomorder, in which the subjectsare provided with the signals under test as well as one cleanspeech and hidden anchor. Each subject needs to gradethe whole test speech signals on a quality scale between0 and 100.

During the MUSHRA test, we used as hidden anchora speech signal having an SNR of 0 dB less than the noisyspeech to be enhanced. Sevenmale and seven female listen-ers participated in the tests, and each listener did the testtwo times. The listeners were allowed to listen to each test






Table 4. STOI results.


White Noisy 0.52 0.64 0.76 0.86IRM 0.77 0.80 0.84 0.90GammaPrior 0.52 0.64 0.76 0.85SARHMM 0.51 0.63 0.74 0.83CBSPP 0.52 0.63 0.73 0.81ConvexRM 0.56 0.68 0.79 0.86

Babble Noisy 0.51 0.63 0.74 0.83IRM 0.76 0.79 0.83 0.89GammaPrior 0.51 0.63 0.74 0.82SARHMM 0.49 0.61 0.72 0.81CBSPP 0.49 0.61 0.72 0.79ConvexRM 0.52 0.65 0.77 0.85

Office Noisy 0.63 0.72 0.80 0.86IRM 0.78 0.81 0.85 0.92GammaPrior 0.61 0.71 0.79 0.85SARHMM 0.55 0.66 0.75 0.82CBSPP 0.60 0.66 0.77 0.82ConvexRM 0.65 0.75 0.82 0.88

Street Noisy 0.75 0.81 0.86 0.90IRM 0.84 0.88 0.91 0.95GammaPrior 0.76 0.82 0.86 0.90SARHMM 0.68 0.75 0.81 0.87CBSPP 0.74 0.79 0.88 0.86ConvexRM 0.80 0.85 0.88 0.92

Factory1 Noisy 0.51 0.63 0.75 0.84IRM 0.75 0.80 0.84 0.91GammaPrior 0.51 0.64 0.75 0.84SARHMM 0.47 0.61 0.73 0.83CBSPP 0.46 0.60 0.72 0.81ConvexRM 0.53 0.67 0.78 0.87

Fig. 10. The MUSHRA results for five types of noises.

speech several times and had access to the clean speech ref-erence. The ten test utterances used were contaminated byaforementioned five types of noises using a 5 dB input SNR.A statistical analysis of the test results was conducted for thedifferent de-noising methods under five noise conditions.

Figure 10 shows the averaged MUSHRA listening testresults with a 95% confidence interval. Most of the listen-ers preferred to the proposed ConvexRM method over theother methods under five types of noises. SARHMM andCBSPP are the very competing methods to the proposedone. In the situations of white, street and factory1 noises,ConvexRM obtains an obvious preference compared withother three reference approaches. As the conditions of bab-ble and office noise, most listeners still chose the proposedConvexRM method despite its score has a little decline.

From the hearing perception, the enhanced speech by theproposedmethod ismore comfortable and continuous thanSARHMM and CBSPP methods. Meanwhile, the proposedmethod can feel less background noise than GammaPriormethod.

I V . CONCLUS IONS

In this paper, we proposed a novel method for noise mask-ing based on an effective estimation of ratio mask inGammatone domain instead of DFT domain. The convexoptimization algorithm was applied to estimate the speechpower, combined with an adaptive factor named NCCC.The adjacent T–F units were considered for keeping thespeech components at the low frequency. To recover weakspeech components, the linear interpolation between theratio mask and the binary mask was utilized to smooth theestimated ratio mask. By objective measures and subjectivelistening test, our proposed method has shown a better per-formance than other three reference methods, especially inthe low SNR conditions the improvement of intelligibility isobvious. This also implies that our noise-masking methodis effective.

REFERENCES

[1] Boll, S.: Suppression of acoustic noise in speech using spectral sub-traction. IEEE Trans. Acoust., Speech, Signal Process, ASSP-27 (2)(1979), 113–120.

[2] Li, C.; Liu, W.J.: A novel multi-band spectral subtraction methodbased on phase modification and magnitude compensation, in Proc.IEEE ICASSP, 2011, 4760–4763.

[3] Loizou, P.C.: Speech Enhancement: Theory and Practice, CRC Press,Boca Raton, FL, USA, 2007.

[4] Ephraim, Y.;Malah, D.: Speech enhancement using aminimummeansquare error short-time spectral amplitude estimator. IEEE Trans.Acoust. Speech Signal Process., 32 (6) (1984), 1109–1121.

[5] Erkelens, J.S.; Hendriks, R.C.; Heusdens, R.; Jensen, J.: Minimummean-square error estimation of discrete Fourier coefficients withgeneralized gamma priors. IEEE Trans. Audio, Speech, Lang. Process.,15 (6) (2007), 1741–1752.

[6] Zoghlami, N.; Lachiri, Z.; Ellouze, N.: Speech enhancement usingauditory spectral attenuation, in EUSIPCO 2009, Scotland, 24–28August 2009.

[7] Zhao, D.Y.; Kleijn, W.B.: HMM-Based gain modeling for enhance-ment of speech in noise. IEEE Trans. Audio, Speech, Lang. Process., 15(3) (2007), 882–892.

[8] Srinivasan, S.; Samuelsson, J.; Kleijn, W.B.: Codebook driven shortterm predictor parameter estimation for speech enhancement. IEEETrans. Audio, Speech, Lang. Process., 14 (1) (2006), 163–176.

[9] Deng, F.; Bao, C.C.; Kleijin, W.B.: Sparse hiddenMarkov models forspeech enhancement in non-stationary noise environments. IEEETrans. Audio, Speech, Lang. Process., 23 (11) (2015), 1973–1987.

[10] He, Q.; Bao, F.; Bao, C.C.: Multiplicative update of auto-regressivegains for codebook-based speech enhancement. IEEE Trans. Audio,Speech, Lang. Process., 25 (3) (2017), 457–468.

[11] Hu, G.; Wang, D.L.: Monaural speech segregation based on pitchtracking and amplitude modulation. IEEE Trans. Neural Netw., 15 (5)(2004), 1135–1150.






[12] Bao, F.; Abdulla, W.H.: A Noise Masking Method with AdaptiveThresholds based on CASA, APSIPA, Jeju, South Korea, 2016.

[13] Wang, Y.; Narayanan, A.; Wang, D.L.: On training targets for super-vised speech separation. IEEE/ACM Trans. Audio, Speech, Lang. Pro-cess., 22 (12) (2014), 1849–1858.

[14] Williamson, D.S.;Wang, Y.X.;Wang, D.L.: Complex ratiomasking formonaural speech separation. IEEE/ACM Trans. Audio, Speech, Lang.Process., 24 (3) (2016), 483–493.

[15] Madhu, N.; Spriet, A.; Jansen, S.; Koning, R.; Wouters, J.: The poten-tial for speech intelligibility improvement using the ideal binarymaskand the ideal Wiener filter in single channel noise reduction systems:application to auditory prostheses. IEEE/ACM Trans. Audio, Speech,Lang. Process., 21 (1) (2013), 63–72.

[16] Koning, R.; Madhu, N.; Wouters, J.: Ideal timeÍCFrequency mask-ing algorithms lead to different speech intelligibility and quality innormal-hearing and Cochlear implant listeners. IEEE/ACM Trans.Audio, Speech, Lang. Process., 62 (1) (2014), 331–341.

[17] Boyd, S.; Vandenberghe, L.: Convex Optimization, Cambridge Uni-versity Press, 2004.

[18] Bao, F.; Abdulla, W.H.: A new IBM estimation method based onconvex optimization for CASA. Speech Commun., 97 (2018), 51–65.

[19] Patterson, R.D.; Nimmo-Smith, I.; Holdsworth, J.; Rice, P.: An Effi-cient Auditory Filterbank based on the Gammatone Function, Appl.Psychol. Unit, Cambridge Univ., Cambridge, UK, 1998.

[20] Abdulla, W.H.: Advance in Communication and Software Technolo-gies, Chapter Auditory Based Feature Vectors for Speech RecognitionSystems, WSEAS Press, 2002, pp. 231–236.

[21] Cohen, I.: Noise estimation by minima controlled recursive averag-ing for robust speech enhancement. IEEE Signal Process. Lett., 9 (1)(2002), 12–15.

[22] Bao, F.; Dou, H.J.; Jia, M.S.; Bao, C.C.: A novel speech enhancementmethod using power spectra smooth in wiener filtering, in APSIPA,2014.

[23] Gardner, W.A.: Learning characteristics of stochastic gradient-descent algorithms: a general study, analysis, and critique. SignalProcess, 6 (2) (1984), 113–133.

[24] Weintraub, M.: A Theory and Computational Model of AuditoryMonaural Sound Separation. Ph.D. dissertation, Dept. Elect. Eng.,Stanford University, Stanford, CA, 1985.

[25] Bao, F.; Abdulla, W.H.: A convex optimization approach for time-frequency mask estimation, inWASPAA, 2017, pp. 31–35.

[26] Garofolo, J.S.; Lamel, L.F.; Fisher, W.M.; Fiscus, J.G.; Pallett, D.S.;Dahlgrena, N.L.: DARPA- TIMIT, Acoustic Phone TiccontinuousSpeechCorpus,USDepartment ofCommerce,Washington,DC, 1993(NISTIR Publication No. 4930).

[27] Varga, A.P.; Steeneken, H.J.M.; Tomlinson, M.; Jones, D.: TheNOISEX-92 study on the effect of additive noise on automatic speechrecognition. http://spib.rice.edu/spib/select, 1992.

[28] Quackenbush, S.R.; Barnwell, T.P.; Clements, M.A.: Objective Mea-sures of Speech Quality, Prentice–Hall, Englewood Cliffs, NJ, 1988.

[29] Abramson, A.; Cohen, I.: Simultaneous detection and estimationapproach for speech enhancement. IEEE Trans. Audio, Speech, Lang.Process, 15 (8) (2007), 2348–2359.

[30] Taal, C.H.; Hendriks, R.C.; Heusdens, R.; Jensen, J.: An algorithmfor intelligibility prediction of timeÍCfrequency weighted noisyspeech. IEEE Trans.Audio, Speech, Lang. Process, 19 (7) (2011),2125–2136.

[31] Vincent, E.: MUSHRAM: AMATLAB interface forMUSHRA listen-ing tests, [Online]. Available: http://www.elec.qmul.ac.uk/ people/emmanuelv/mushram, 2005.

FengBao received the B.S. andM.S. degrees in Electronic Engi-neering from Beijing University of Technology in 2012 and2015, respectively. From October 2015, he started to pursueDoctor degree in the Department of Electrical and ComputerEngineering, University of Auckland, Auckland, New Zealand.His research interests are in the areas of speech enhancementand speech signal processing. He is the author or coauthor ofover 20 papers in journals and conferences.

Waleed H. Abdulla holds a Ph.D. degree from the Universityof Otago, New Zealand. He is currently an Associate Professorin the University of Auckland. He was Vice President-MemberRelations and Development (APSIPA). He has published morethan 170 refereed publications, one patent, and two books.He ison the editorial boards of six journals. He has supervised over25 postgraduate students. He is the recipient of many awardsand funded projects such as JSPS, ETRI, and Tsinghua fel-lowships. He has received Excellent Teaching Awards for 2005and 2012. He is also a Senior Member of IEEE. His researchinterests include human biometrics; signal, speech, and imageprocessing; machine learning; active noise control.


http://spib.rice.edu/spib/select

http://www.elec.qmul.ac.uk/ people/emmanuelv/mushram

http://www.elec.qmul.ac.uk/ people/emmanuelv/mushram




fengbaoandwaleedh.abdulla...4 feng bao and waleed h. abdulla fig.2....

Documents