icassp 2006 robustness techniques survey

ICASSP 2006 Robustness Techniques Survey

ShihHsiang 2006

PARAMETRIC NONLINEAR FEATURE EQUALIZATIONFOR ROBUST SPEECH RECOGNITION

Luz Garc’ıa, Jos’e C. Segura, Javier Ram’ırez, Angel de la Torre, Carmen Ben’ıtezDpto. Teor’ıa de la Se˜nal, Telem’atica y Comunicaciones (TSTC)

Universidad de Granada

3

Introduction

• HEQ have been successfully applied to deal with the nonlinear effect of the acoustic environment in the feature domain– Normalize the probability distributions of the features in such a way

that the acoustic environment effects are (partially) removed

• HEQ still suffer from several limitations– Rely on a local estimation of the probability distributions of the

features based on a reduced number of observations belonging to a single utterance to be equalized

– Nonlinear transformation is based on mapping the global CDF of each feature into a reference one

– The transformations are usually based on a component-by-component equalization of the feature vector, thus discarding any cross-information between features in the equalization process

4

Introduction (cont.)

• In this paper, a parametric nonlinear equalization technique is proposed– Relies on a two Gaussian model for the probability distribution of the

features– And on a simple Gaussian classifier to label the input frames as belo

nging to the speech or non-speech classes

• Recognition experiments on the AURORA 4 database have been performed and the effectiveness of the algorithm is analyzed in comparison with other linear and nonlinear feature equalization techniques

5

Review Histogram Equalization

xCCxFx

xFCxCxC

xx

xxx1

~

~~

~

~

• For a given random variable y with probability density function , a function mapping into a reference distribution

xpx xFx ~ xpx xpx~~

6

Review Histogram Equalization (cont.)

• The relative content of non-speech frames is a cause of variability in the HEQ transformation– because an estimation of the global probability distribution is use

d, that takes into account both speech and non-speech frames

7

Review Histogram Equalization (cont.)

• The unwanted variability of the transformation induced by the variable proportion of non-speech frames of each utterance can be – Reduced by removing non-speech frames before the estimation

of the transformation– Another possibility is to use different transformations for speech

and non-speech frames• Instead of using a transformation to map the global CDF’s of t

he features, we can build separate mappings for speech and non-speech frames

– As an alternative, the propose the use of a parametric form of the equalization transform based on a two Gaussian mixture model

• The first Gaussian is used to represent non-speech frames, while the second one represents speech frames

8

Two-class parametric equalization

• For each class, a parametric linear transformation is defined to map the clean and noisy representation spaces

• The clean Gaussians for speech and non-speech frames can be estimated from the training database, while the noisy Gaussians should be estimated from the utterance to be equalized

nonspeech is y if,ˆ

2/1

,

,,,

yn

xnynxn yx

speech is y if,ˆ

2/1

,

,,,

ys

xsysxs yx

non-speech

speech

clean Gaussian

noise

noisy Gaussian

noisy speech

9

Two-class parametric equalization (cont.)

• In order to select whether the current frame y is speech or non-speech, a voice activity detector could be used– Implies a hard decision between both linear transformations that

could create discontinuities in the limit of the non-speech/speech decision

• Instead, a soft decision can be used

2/1

,

,,,

2/1

,

,,, ||ˆ

ys

xsysxs

yn

xnynxn yysPyynPx

The posterior probabilities P(n|y) and P(s|y) are obtained using a simple two-class Gaussian classifier on MFCC C0

10


• Training the two-class Gaussian classifier– Initially, those frames with C0 below the mean value are assigne

d to the non-speech class and those with C0 above the mean are assigned to the speech class

– The EM algorithm is then iterated until convergence (usually, 10 iterations are enough) to obtain the final classifier

• This classifier is used to obtain the class probabilities P(n|y) and P(s|y) and also to obtain the mean and covariance matricesμn,y、 Σn,y、 μs,y and Σs,y for the non-speech and speech classes for the given noisy input utterance

11


The two Gaussian model for the C0 and C1 cepstral coefficients (used as reference model)along with the histograms of the speech and non-speech frames for a set of clean utterances

12

Experimental Results

• The proposed parametric equalization algorithm has been tested on the AURORA4 (WSJ0) database– The recognition system used in all cases is based on continuous cro

ssword triphone models with 3 tied states and a mixture of 6 Gaussians per state

– The language model is the standard bigram for the WSJ0 task– A feature vector of 13 cepstral coefficients is used as the basic para

meterization of the speech signal using C0 instead of the logarithmic energy

– The baseline reference system (BASE) uses sentence-by-sentence subtraction of the mean values of each cepstral coefficient (CMS)

– The parameters of the reference distribution have been obtained by averaging over the whole clean training set of utterances

13

Experimental Results

• First row (BASE) corresponds to the baseline system which is based on a simple CMS linear normalization technique.

• The second row (HEQ) shows the word error rates when using a standard quantile-based implementation of HEQ– relative word error reduction of 17.8%

• The performance of HEQ is clearly improved by PEQ as shown in the third row, with a relative word error reduction of 30.8%.

• This result is very close to the one obtained for the AFE, which yields a 31.4% reduction of the word error rate

• Moreover, PEQ outperforms AFE in half of the tests (i.e. 02, 06, 08, 09, 10, 11 and 13).

14

Conclusions and Future Work

• The transformation is based on a nonlinear interpolation of two independent linear transformations– The linear transformations are obtained using a simple Gaussian

model for the classes of speech and non-speech features

• The technique evaluated on a complex continuous speech recognition task showing its competitive performance against linear and nonlinear feature equalization techniques like CMS and HEQ

• A study of influence of within class cross-correlations is currently under development

MODEL-BASED WIENER FILTER FOR NOISE ROBUST SPEECH RECOGNITION

Takayuki Arakawa, Masanori Tsujikawa and Ryosuke IsotaniMedia and Information Research Laboratories, NEC Corporation, Japan

[email protected], [email protected], [email protected]

16

Introduction

• Various kinds of background noise exist in the real world– Therefore robustness against various kinds of noise is quite impo

rtant.

• Several approaches have been proposed to deal with this issue– Signal-processing-based spectral enhancement

• Spectral Subtraction (SS), Wiener Filter (WF)• Less computational costs, but needs many tuning costs depe

nding on the kind of noise and signal-to-noise ratio (SNR)– Statistical-model-based noise adaptation

• The acoustic model i.e., a hidden Markov model (HMM), is adapted to the noisy environment

• It needs huge computational costs to adapt the distributions to a noisy environment

17

Introduction (cont.)

– Statistical-model-based compensation• Using Gaussian mixture model (GMM)• The computational cost is still much more than that of the

signal-processing-based spectral enhancement.

• In this paper, they proposed Model-Based Wiener filter (MBW)

Concept

18

Proposed Method (Cont.)

• A GMM with K Gaussian distributions is used as knowledge of clean speech in the cepstrum domain

MBW algorithm

K

k

kCkPCP1

| SDCTC log

19


• The noisy speech signal X(t) is modeled as

• Step 1: Perform Spectral Subtraction (SS)

• Step 2: Derive the expected value of the clean speech

tNtStX noisy speech clean speech noise

tXtNtXtS ,maxˆ temporary clean speech estimated noise

Spectrum Domain

tSDCTtC ˆlogˆ Cepstrum Domain

K

kktCkPtS

1

logˆ|exp

kk IDCT logMMSE Estimation

Kk

ktCPkP

ktCPkPtCkP

1|ˆ

|ˆˆ|

flooring parameter

20


• Step 3: Calculate Wiener Gain

• Step 4: Get the final estimated clean speech

)(

)()1(1 ,

1 tN

tStt

t

ttW

tXtWtS ~

smoothing parameter

21

Experiments and Results

• Experiment Condition– The Mel-frequency cepstral coefficients (MFCC) and their 1st an

d 2nd derivatives are used as feature value of speech (include C0)

– The feature value for GMM is composed of a 13-dimensional MFCC only

– The flooring parameter α is set at 0.1, and the smoothing parameter β is set at 0.98

• The MBW method was tested on the Aurora2-J task– contains utterances (in Japanese) of consecutive digit string reco

rded in clean environments– The other conditions are the same as Aurora2

22

Experiments and Results (cont.)

• The performance of different mixture number of the GMM

5dB restaurant noise

At the point of 128 or256, it becomes saturated

23


• The word accuracy for each SNR

almost equivalent to that of the AFE.

24


• The Word Accuracy over the SNR for each kind of noise.

These results show that the proposed method is much more robustthan the AFE against various kinds of noise.

25

Conclusions

• Review MBW algorithm– Roughly estimates clean speech signals using SS – Compensates them using a GMM to improve robustness against

non-stationary noise– The compensated speech signal is used to calculate the Wiener

gain– Performing Wiener filtering

• The results show that the proposed method performs as well as the ETSI AFE

• These results demonstrate that the proposed method is robust against various kinds of noise

icassp 2006 robustness techniques survey

Documents

nonspeech framesas

nonspeech framesinstead

component equalization

parametric form

probability distributions

input frames

feature domainnormalize

feature vector