thesis
DESCRIPTION
TRANSCRIPT
IntroductionFeature Compensation based on Stereo Data
Feature Compensation based on a Masking ModelTemporal Modelling and Uncertainty Decoding
Conclusions
Noise Robust Speech Recognition ofMissing or Uncertain Data
Jose Andres Gonzalez LopezAdvisors: Dr. Antonio M. Peinado Herreros
Dr. Angel M. Gomez Garcıa
Dpt. Signal Theory, Telecommunications and NetworkingUniversity of Granada
Ph.D. DefenceFebruary 25th, 2013
1 / 49 Jose A. Gonzalez Noise Robust Speech Recognition of Missing or Uncertain Data
IntroductionFeature Compensation based on Stereo Data
Feature Compensation based on a Masking ModelTemporal Modelling and Uncertainty Decoding
Conclusions
Outline
1 Introduction
2 Feature Compensation based on Stereo Data
3 Feature Compensation based on a Masking Model
4 Temporal Modelling and Uncertainty Decoding
5 Conclusions
2 / 49 Jose A. Gonzalez Noise Robust Speech Recognition of Missing or Uncertain Data
IntroductionFeature Compensation based on Stereo Data
Feature Compensation based on a Masking ModelTemporal Modelling and Uncertainty Decoding
Conclusions
Outline
1 Introduction
2 Feature Compensation based on Stereo Data
3 Feature Compensation based on a Masking Model
4 Temporal Modelling and Uncertainty Decoding
5 Conclusions
3 / 49 Jose A. Gonzalez Noise Robust Speech Recognition of Missing or Uncertain Data
IntroductionFeature Compensation based on Stereo Data
Feature Compensation based on a Masking ModelTemporal Modelling and Uncertainty Decoding
Conclusions
Robust ASR
The performance of ASR (Automatic Speech recognition)systems degrades when training and testing conditions differ.
This mismatch can be due to different factors
Language complexity: grammar, vocabulary, spontaneousspeech, ...Speaker variability: accent, age, gender, ...Environmental factors: background noise, channel distortion,room acoustics, ...
In this work, we will focus on the environmental factors,especially on the background noise and the channel distortion.
Effect of noise on speech: noise modifies the speechdistributions and causes loss of information.
4 / 49 Jose A. Gonzalez Noise Robust Speech Recognition of Missing or Uncertain Data
IntroductionFeature Compensation based on Stereo Data
Feature Compensation based on a Masking ModelTemporal Modelling and Uncertainty Decoding
Conclusions
Approaches for Noise Robustness
Different approaches to achieve noise robustness: robustfeature extraction, model adaptation and feature modification.
Feature compensation enhances the noisy features used forspeech recognition.
yt and xt are, respectively, the feature vectors for noisyspeech and estimated clean speech at time t.
uncertainty: information about the reliability of xt .
5 / 49 Jose A. Gonzalez Noise Robust Speech Recognition of Missing or Uncertain Data
IntroductionFeature Compensation based on Stereo Data
Feature Compensation based on a Masking ModelTemporal Modelling and Uncertainty Decoding
Conclusions
Objectives
Development of a set of compensation techniques for speechfeature enhancement.
To do this, a Bayesian estimation framework is adopted here.
Two different approaches for estimating clean speech will beexplored
Feature compensation based on stereo-data: clean andnoisy recordings are used to derive a set of transformationsapplied to noisy speech.Feature compensation based on a masking model:parametric models of speech degradation are used to estimateclean speech.
Finally, an uncertainty decoding approach and temporalmodelling of speech will be also investigated.
6 / 49 Jose A. Gonzalez Noise Robust Speech Recognition of Missing or Uncertain Data
IntroductionFeature Compensation based on Stereo Data
Feature Compensation based on a Masking ModelTemporal Modelling and Uncertainty Decoding
Conclusions
IntroductionMMSE EstimationExperimental Results
Outline
1 Introduction
2 Feature Compensation based on Stereo Data
3 Feature Compensation based on a Masking Model
4 Temporal Modelling and Uncertainty Decoding
5 Conclusions
7 / 49 Jose A. Gonzalez Noise Robust Speech Recognition of Missing or Uncertain Data
IntroductionFeature Compensation based on Stereo Data
Feature Compensation based on a Masking ModelTemporal Modelling and Uncertainty Decoding
Conclusions
IntroductionMMSE EstimationExperimental Results
Introduction
Stereo data: simultaneous recordings of clean and noisyspeech signals,
(X,Y) = (〈x1, y1〉, 〈x2, y2〉, . . . , 〈xT , yT 〉)
The stereo data is used to learn the statistical relationshipbetween the clean and noisy feature spaces.
As a result, a set of transformations is derived to enhancespeech in a certain acoustic environment.
Acoustic environment: combination of additive andconvolutional noises at a given SNR.
8 / 49 Jose A. Gonzalez Noise Robust Speech Recognition of Missing or Uncertain Data
IntroductionFeature Compensation based on Stereo Data
Feature Compensation based on a Masking ModelTemporal Modelling and Uncertainty Decoding
Conclusions
IntroductionMMSE EstimationExperimental Results
MMSE Estimation (I)
MMSE estimation is chosen to obtain suitable estimates forthe clean feature vectors,
x = E[x|y] =
∫xp(x|y)dx
Problem: p(x|y) must be expressed in a convenient form.Solution: clean and noisy feature spaces are represented byVQ codebooks Mx and My , respectively.
9 / 49 Jose A. Gonzalez Noise Robust Speech Recognition of Missing or Uncertain Data
IntroductionFeature Compensation based on Stereo Data
Feature Compensation based on a Masking ModelTemporal Modelling and Uncertainty Decoding
Conclusions
IntroductionMMSE EstimationExperimental Results
MMSE Estimation (II)Using these codebooks, the MMSE estimation can be expressed as,
x =Mx∑kx=1
P(kx |k∗y ) x(kx )
P(kx |k∗y ): mapping between the clean and noisy cells for acertain environment. Estimated using stereo data.x(kx ) = E[x|y, kx , k∗y ]: 3 alternatives (Q-VQMMSE,S-VQMMSE and W-VQMMSE).
10 / 49 Jose A. Gonzalez Noise Robust Speech Recognition of Missing or Uncertain Data
IntroductionFeature Compensation based on Stereo Data
Feature Compensation based on a Masking ModelTemporal Modelling and Uncertainty Decoding
Conclusions
IntroductionMMSE EstimationExperimental Results
Computation of x(kx)
Q-VQMMSEAssumes that both spaces arequantized.Also, this approach assumes thatthe spaces are independent.
Then, x(kx ) = µ(kx )x .
S-VQMMSEA correction is applied to y,
x(kx ) = y −(µ
(k∗y )
y − µ(kx )x
)= µ(kx )
x +(
y − µ(k∗y )
y
)∆: quantization error
11 / 49 Jose A. Gonzalez Noise Robust Speech Recognition of Missing or Uncertain Data
IntroductionFeature Compensation based on Stereo Data
Feature Compensation based on a Masking ModelTemporal Modelling and Uncertainty Decoding
Conclusions
IntroductionMMSE EstimationExperimental Results
Improving the Mapping Accuracy
Subregion modelling
C(kx ,ky )y is the subset of the noisy cell ky whose corresponding
clean vectors belong to kx .
Similarly, C(kx ,ky )x is the subset of kx whose corresponding
noisy vectors are C(kx ,ky )y .
12 / 49 Jose A. Gonzalez Noise Robust Speech Recognition of Missing or Uncertain Data
IntroductionFeature Compensation based on Stereo Data
Feature Compensation based on a Masking ModelTemporal Modelling and Uncertainty Decoding
Conclusions
IntroductionMMSE EstimationExperimental Results
Whitening-transformation based VQMMSE
W-VQMMSE assumes that the subregions of both featurespaces are Gaussian distributed, e.g.
C(kx ,ky )x ∼ N
(µ
(kx ,ky )x ,Σ
(kx ,ky )x
)Computation of E[x|y, kx , ky ]: the following whiteningtransformation is applied
E[x|y, kx , ky ] = µ(kx ,ky )x +
(Σ
(kx ,ky )x
)1/2 (Σ
(kx ,ky )y
)−1/2 (y − µ(kx ,ky )
y
)After some manipulations the MMSE estimation becomes,
x = A(k∗y )y + b(k∗y )
where the parameters of the affine transformation can beprecomputed offline for each noisy cell ky = 1, . . . ,My .
13 / 49 Jose A. Gonzalez Noise Robust Speech Recognition of Missing or Uncertain Data
IntroductionFeature Compensation based on Stereo Data
Feature Compensation based on a Masking ModelTemporal Modelling and Uncertainty Decoding
Conclusions
IntroductionMMSE EstimationExperimental Results
Experimental Setup
Recognition task: based on the Aurora2 noisy digitsdatabase.
Acoustic environments: 9 noises at 7 SNRs (clean, 20, 15,10, 5, 0, and -5 dB).
Speech features: ETSI FE Standard (13 MFCCs + ∆ +∆2).
Front-end speech models: codebooks with 256 components.
SPLICE and MEMLIN are also evaluated (i.e. GMM-basedMMSE estimation).
A priori knowledge on the acoustic environment is assumed.
14 / 49 Jose A. Gonzalez Noise Robust Speech Recognition of Missing or Uncertain Data
IntroductionFeature Compensation based on Stereo Data
Feature Compensation based on a Masking ModelTemporal Modelling and Uncertainty Decoding
Conclusions
IntroductionMMSE EstimationExperimental Results
FE Results
System Clean 20 dB 15 dB 10 dB 5 dB 0 dB -5 dB Avg.
Baseline 99.02 90.79 75.53 50.70 25.86 11.27 6.18 50.83Matched 99.02 98.66 98.29 97.02 92.16 75.78 34.88 92.38
SPLICE 99.02 98.09 95.87 88.88 70.62 39.04 15.99 78.50MEMLIN 99.02 98.36 97.01 92.43 78.26 47.03 18.76 82.62Q-VQMMSE 96.19 93.72 90.21 81.24 61.82 31.33 14.39 71.66S-VQMMSE 99.02 97.93 96.28 90.57 74.70 43.02 18.57 80.50
iW-VQMMSE 99.02 98.23 96.79 91.60 76.82 46.60 20.02 82.01dW-VQMMSE 99.02 98.33 97.06 92.43 78.70 48.88 20.26 83.08fW-VQMMSE 99.02 98.37 97.15 92.88 79.61 50.04 20.89 83.61
Matched: HMMs trained under the same conditions that in testing.
iW-, dW-, fW-: identity, diagonal and full covariance matrices.
MEMLIN and iW-VQMMSE behave almost identically, but our proposal
is more efficient.
When the dynamic features are also processed, MEMLIN and
fW-VQMMSE achieves similar results: 87.67 % vs. 87.31 %.
15 / 49 Jose A. Gonzalez Noise Robust Speech Recognition of Missing or Uncertain Data
IntroductionFeature Compensation based on Stereo Data
Feature Compensation based on a Masking ModelTemporal Modelling and Uncertainty Decoding
Conclusions
IntroductionMMSE EstimationExperimental Results
AFE Results
System Clean 20 dB 15 dB 10 dB 5 dB 0 dB -5 dB Avg.
Baseline 99.02 90.79 75.53 50.70 25.86 11.27 6.18 50.83AFE 99.22 98.24 96.95 93.68 84.37 62.46 29.53 87.14Matched 99.02 98.66 98.29 97.02 92.16 75.78 34.88 92.38
Q-VQMMSE 95.60 93.56 91.28 85.25 70.23 39.20 12.84 75.90S-VQMMSE 99.22 98.32 97.39 94.71 86.30 63.07 27.46 87.96
iW-VQMMSE 99.22 98.61 97.93 95.89 89.19 69.46 32.62 90.22dW-VQMMSE 99.22 98.70 98.05 96.19 89.93 71.47 34.94 90.87fW-VQMMSE 99.22 98.65 97.99 96.10 89.92 72.29 36.57 90.99
AFE: ETSI Advanced Front-End.
The proposed techniques are applied to the features extractedby AFE.
The combined systems AFE+VQMMSE increase therobustness of AFE against noise.
16 / 49 Jose A. Gonzalez Noise Robust Speech Recognition of Missing or Uncertain Data
IntroductionFeature Compensation based on Stereo Data
Feature Compensation based on a Masking ModelTemporal Modelling and Uncertainty Decoding
Conclusions
Speech Masking ModelTGIMMSRNoise Model Estimation
Outline
1 Introduction
2 Feature Compensation based on Stereo Data
3 Feature Compensation based on a Masking Model
4 Temporal Modelling and Uncertainty Decoding
5 Conclusions
17 / 49 Jose A. Gonzalez Noise Robust Speech Recognition of Missing or Uncertain Data
IntroductionFeature Compensation based on Stereo Data
Feature Compensation based on a Masking ModelTemporal Modelling and Uncertainty Decoding
Conclusions
Speech Masking ModelTGIMMSRNoise Model Estimation
Introduction
Speech degradation model: an analytical model that relatesy with x and n (the additive noise vector).
Model-based compensation: the degradation model is usedto derive the MMSE estimator.
X No stereo data is required.X Thus, unknown distortions can be mitigated.× MMSE estimation only tackles the distortions considered in
the degradation model. E.g. additive and convolutional noises.× Noise need to be estimated.
We will only consider the robustness to additive noise here.
18 / 49 Jose A. Gonzalez Noise Robust Speech Recognition of Missing or Uncertain Data
IntroductionFeature Compensation based on Stereo Data
Feature Compensation based on a Masking ModelTemporal Modelling and Uncertainty Decoding
Conclusions
Speech Masking ModelTGIMMSRNoise Model Estimation
Speech Masking Model
In the log-Mel domain, the degradation model is approximated by
y = log(ex + en)
This model can be rewrittenas,
y = max(x,n) + ε(x,n)
Disregarding ε(x,n), thespeech masking model is
y ≈ max(x,n) 0
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
-0.4 -0.2 0 0.2 0.4 0.6
Probab
ility
ε(x, n)
Distribution of ε(x, n) in Aurora2
19 / 49 Jose A. Gonzalez Noise Robust Speech Recognition of Missing or Uncertain Data
IntroductionFeature Compensation based on Stereo Data
Feature Compensation based on a Masking ModelTemporal Modelling and Uncertainty Decoding
Conclusions
Speech Masking ModelTGIMMSRNoise Model Estimation
Spectral Reconstruction: Problems
According to the speech masking model, the observation canbe rearranged into y = (yr , yu).
Reliable features (xr ≈ yr ), i.e. speech is dominant.Unreliable features (−∞ ≤ xu ≤ yu): speech is masked bynoise.
Thus, feature compensation can be reformulated as differenttwo problems
1 Segregation of the noisy spectra into speech and noise.This yields a mask where the reliable and unreliable featuresare identified.
2 Spectral reconstruction, i.e. estimate the speech energy inthe unreliable features.
Two alternative techniques are proposed here:TGI only deals with problem 2.MMSR addresses both 1 & 2.
20 / 49 Jose A. Gonzalez Noise Robust Speech Recognition of Missing or Uncertain Data
IntroductionFeature Compensation based on Stereo Data
Feature Compensation based on a Masking ModelTemporal Modelling and Uncertainty Decoding
Conclusions
Speech Masking ModelTGIMMSRNoise Model Estimation
Truncated-Gaussian based Imputation
TGI estimates the speech energy in the unreliable regions ofthe observed spectrogram.
To do this, the correlation between features is exploited.
Prerequisites: the segregation binary mask is known inadvance.
After spectral reconstruction, MFCC features can becomputed and used for recognition.
21 / 49 Jose A. Gonzalez Noise Robust Speech Recognition of Missing or Uncertain Data
IntroductionFeature Compensation based on Stereo Data
Feature Compensation based on a Masking ModelTemporal Modelling and Uncertainty Decoding
Conclusions
Speech Masking ModelTGIMMSRNoise Model Estimation
MMSE Estimation of the Unreliable Features
MMSE estimation is used again to reconstruct the unreliablefeatures,
xu = E[xu|xr = yr ,−∞ ≤ xu ≤ yu]
Speech model: p(x) is modelled as a Gaussian MixtureModel (GMM),
p(x) =M∑k=1
P(k)N(
x;µ(k),Σ(k))
Applying this model, the MMSE estimation is given by,
xu =M∑k=1
P(k |yr , yu) x(k)u
Problem: computation of P(k |yr , yu) and x(k)u .
22 / 49 Jose A. Gonzalez Noise Robust Speech Recognition of Missing or Uncertain Data
IntroductionFeature Compensation based on Stereo Data
Feature Compensation based on a Masking ModelTemporal Modelling and Uncertainty Decoding
Conclusions
Speech Masking ModelTGIMMSRNoise Model Estimation
Posterior Computation
After applying Bayes’ rule, the posterior can be expressed as,
P(k |yr , yu) =p(yr , yu|k)P(k)∑M
k ′=1 p(yr , yu|k ′)P(k ′)
p(yr , yu|k) is factorized as the following product,
p(yr , yu|k) = p(yr |k)
∫ yu
−∞p(xu|yr , k)dxu
p(yr |k) = N (yr ;µ(k)r ,Σ
(k)r ): marginal PDF of the reliable
features.
p(xu|yr , k) = N (xu;µ(k)u|r ,Σ
(k)u|r ): conditional PDF of the
unreliable features given the reliable ones.
23 / 49 Jose A. Gonzalez Noise Robust Speech Recognition of Missing or Uncertain Data
IntroductionFeature Compensation based on Stereo Data
Feature Compensation based on a Masking ModelTemporal Modelling and Uncertainty Decoding
Conclusions
Speech Masking ModelTGIMMSRNoise Model Estimation
Partial Estimates
According to the speech masking model xu ∈ (−∞, yu]. Thus,
x(k)u =
∫ yu
−∞xup(xu|yr , k)dxu
Independence is assumed tosolve the integral.
The partial estimate
x(k)u = µ(k)(y) corresponds
to the mean of aright-truncated GaussianPDF.
24 / 49 Jose A. Gonzalez Noise Robust Speech Recognition of Missing or Uncertain Data
IntroductionFeature Compensation based on Stereo Data
Feature Compensation based on a Masking ModelTemporal Modelling and Uncertainty Decoding
Conclusions
Speech Masking ModelTGIMMSRNoise Model Estimation
Example
Clean
Noisy (0 dB)
Oracle mask
TGI reconstruction
23
15
7
12
0
23
15
7
12
5
23
15
7
1
0
23
15
7
12
4
Time (s)
eigth six zero one one six two
Mel
ch
ann
el
0.5 1.0 1.5 2.0 2.5 3.0
25 / 49 Jose A. Gonzalez Noise Robust Speech Recognition of Missing or Uncertain Data
IntroductionFeature Compensation based on Stereo Data
Feature Compensation based on a Masking ModelTemporal Modelling and Uncertainty Decoding
Conclusions
Speech Masking ModelTGIMMSRNoise Model Estimation
Experimental Setup
Databases: Aurora2 & Aurora4.
The 3 test sets (A, B and C) of Aurora2 are considered.
Aurora4: 5000-word recognition task based on the WallStreet Journal corpus. Two testing conditions:
Test 01-07 includes utterances with artificially added acousticnoise (random SNR between 10 dB and 20 dB).Test 08-14: acoustic noise + different microphones.
TGI is evaluated using both oracle (OR) or estimated (EST)binary masks.
Noise estimation: linear interpolation of the first and lastframes of the utterance.
Front-end speech model: GMM with 256 components.
26 / 49 Jose A. Gonzalez Noise Robust Speech Recognition of Missing or Uncertain Data
IntroductionFeature Compensation based on Stereo Data
Feature Compensation based on a Masking ModelTemporal Modelling and Uncertainty Decoding
Conclusions
Speech Masking ModelTGIMMSRNoise Model Estimation
Experimental ResultsW
Acc (
%)
Aurora2 Aurora440
50
60
70
80
90
100
Baseline CBR−OR TGI−OR CBR−EST TGI−EST
CBR: Cluster-Based Reconstruction (Raj et al., 2004).
TGI outperforms CBR when oracle masks are used.
The difference is small when the masks are estimated.
Large margin for improvement between OR and EST ⇒ amore robust approach for speech/noise segregation is required.
27 / 49 Jose A. Gonzalez Noise Robust Speech Recognition of Missing or Uncertain Data
IntroductionFeature Compensation based on Stereo Data
Feature Compensation based on a Masking ModelTemporal Modelling and Uncertainty Decoding
Conclusions
Speech Masking ModelTGIMMSRNoise Model Estimation
Masking-Model based Spectral Reconstruction
As we have seen, TGI achieves excellent results when oraclemasks are used.
However, its performance diminishes when the masks areestimated ⇒ the noise estimation errors can be magnified bythe hard decision implemented by the binary masks.
MMSR uses the noise estimates directly in the MMSEestimation.
Advantages with respect to TGI
No a priori segregation mask is required now.Therefore, the feature reliability and the speech energy in theunreliable regions are jointly estimated.
28 / 49 Jose A. Gonzalez Noise Robust Speech Recognition of Missing or Uncertain Data
IntroductionFeature Compensation based on Stereo Data
Feature Compensation based on a Masking ModelTemporal Modelling and Uncertainty Decoding
Conclusions
Speech Masking ModelTGIMMSRNoise Model Estimation
MMSR: Diagram
Mx : GMM with Mx gaussians.Mn: GMM with Mn gaussians (alternatively a noise estimate
nt ∼ Nn(nt ,Σn,t) for each frame).MMSE estimation
x =Mx∑kx=1
Mn∑kn=1
P(kx , kn|y) x(kx ,kn)
29 / 49 Jose A. Gonzalez Noise Robust Speech Recognition of Missing or Uncertain Data
IntroductionFeature Compensation based on Stereo Data
Feature Compensation based on a Masking ModelTemporal Modelling and Uncertainty Decoding
Conclusions
Speech Masking ModelTGIMMSRNoise Model Estimation
Posterior Computation
Applying Bayes’ rule, P(kx , kn|y) ∝ p(y|kx , kn)P(kx)P(kn).
Independence assumpion: p(y|kx , kn) is expressed as theproduct of p(y |kx , kn) for every observed feature y .
According to the masking model, p(y |kx , kn) is computed as,
p(y |kx , kn) = p(x = y , n ≤ y |kx , kn)︸ ︷︷ ︸+ p(n = y , x < y |kx , kn)︸ ︷︷ ︸px(y |kx)Pn(x ≤ y |kn) pn(y |kn)Px(x < y |kx)
Probability that speech is dominant
Probability that noise is dominant
30 / 49 Jose A. Gonzalez Noise Robust Speech Recognition of Missing or Uncertain Data
IntroductionFeature Compensation based on Stereo Data
Feature Compensation based on a Masking ModelTemporal Modelling and Uncertainty Decoding
Conclusions
Speech Masking ModelTGIMMSRNoise Model Estimation
Partial Estimates
Contrary to TGI, the reliability of the observed feature y isunknown in MMSR.
Hence, both the reliable and unreliable cases are taken intoaccount,
x (kx ,kn) = w (kx ,kn) y +(
1− w (kx ,kn))µ
(kx )x
Estimate for high SNRsEstimate for masked speech (i.e. truncated PDF mean)w (kx ,kn) = P(x = y , n < y |kx , kn) is the normalized speechpresence probability.
31 / 49 Jose A. Gonzalez Noise Robust Speech Recognition of Missing or Uncertain Data
IntroductionFeature Compensation based on Stereo Data
Feature Compensation based on a Masking ModelTemporal Modelling and Uncertainty Decoding
Conclusions
Speech Masking ModelTGIMMSRNoise Model Estimation
MMSR: Mask Estimation
MMSR can be also considered as a robust method for speechsegregation.
To see this, we reproduce here the final expression for theMMSE estimator,
x =
Mx∑kx=1
Mn∑kn=1
P(kx , kn|y)w (kx ,kn)
︸ ︷︷ ︸
m
y +Mx∑kx=1
Mn∑kn=1
P(kx , kn|y)(
1− w (kx ,kn))µ
(kx )x
m ∈ [0, 1] acts as a soft-mask: m ≈ 1 for the reliable featuresand m ≈ 0 for the unreliable ones.Advantages regarding other methods:
Parameter free.Mask estimation is fully integrated within the reconstruction.
32 / 49 Jose A. Gonzalez Noise Robust Speech Recognition of Missing or Uncertain Data
IntroductionFeature Compensation based on Stereo Data
Feature Compensation based on a Masking ModelTemporal Modelling and Uncertainty Decoding
Conclusions
Speech Masking ModelTGIMMSRNoise Model Estimation
Experimental Results
WA
cc (
%)
Aurora2 Aurora440
50
60
70
80
90
100
Baseline TGI−OR TGI−EST MMSR VTS
VTS: well-known model-based compensation technique(Moreno, 1996).
MMSR outperforms TGI-EST and is upper-bounded byTGI-OR.
VTS is slightly better than MMSR ⇒ more accurate noisemodels can reduce the gap.
33 / 49 Jose A. Gonzalez Noise Robust Speech Recognition of Missing or Uncertain Data
IntroductionFeature Compensation based on Stereo Data
Feature Compensation based on a Masking ModelTemporal Modelling and Uncertainty Decoding
Conclusions
Speech Masking ModelTGIMMSRNoise Model Estimation
MMSR: Diagram
34 / 49 Jose A. Gonzalez Noise Robust Speech Recognition of Missing or Uncertain Data
IntroductionFeature Compensation based on Stereo Data
Feature Compensation based on a Masking ModelTemporal Modelling and Uncertainty Decoding
Conclusions
Speech Masking ModelTGIMMSRNoise Model Estimation
MMSR: Diagram
35 / 49 Jose A. Gonzalez Noise Robust Speech Recognition of Missing or Uncertain Data
IntroductionFeature Compensation based on Stereo Data
Feature Compensation based on a Masking ModelTemporal Modelling and Uncertainty Decoding
Conclusions
Speech Masking ModelTGIMMSRNoise Model Estimation
EM-based Noise Model Estimation
Objective: estimate the noise model used in MMSR.
Noise model: GMM with Mn gaussians,
Mn ={⟨π
(1)n ,µ
(1)n ,Σ
(1)n
⟩, . . . ,
⟨π
(Mn)n ,µ
(Mn)n ,Σ
(Mn)n
⟩}where π
(kn)n (kn = 1, . . . ,Mn) are the component priors.
Maximum Likelihood estimation
Mn = argmaxMn
p(y1, . . . , yT |Mn,Mx)
Direct optimization of this expression is unfeasible ⇒ aniterative EM approach is used.
36 / 49 Jose A. Gonzalez Noise Robust Speech Recognition of Missing or Uncertain Data
IntroductionFeature Compensation based on Stereo Data
Feature Compensation based on a Masking ModelTemporal Modelling and Uncertainty Decoding
Conclusions
Speech Masking ModelTGIMMSRNoise Model Estimation
Overview
Problems
The oracle mask is unknown ⇒ the soft-mask estimated byMMSR is used.
Treatment of the speech-dominated regions: the noise inthese regions can be estimated using the model obtained inthe previous iteration.
37 / 49 Jose A. Gonzalez Noise Robust Speech Recognition of Missing or Uncertain Data
IntroductionFeature Compensation based on Stereo Data
Feature Compensation based on a Masking ModelTemporal Modelling and Uncertainty Decoding
Conclusions
Speech Masking ModelTGIMMSRNoise Model Estimation
Experimental Results
2 4 6 885
85.5
86
86.5Aurora2
No. of components
WA
cc (
%)
2 4 6 8 1068
68.5
69
69.5Aurora4
No. of componentsW
Acc (
%)
Estimated noise
GMM noise model
Small but consistent performance improvement is achievedwhen using GMM noise models in MMSR.GMMs worse than estimated noise in 2 cases
1-gauss GMMs: unable to properly model non-stationary noises.
Complex GMMs: not enough data to robustly estimate the GMM
parameters.38 / 49 Jose A. Gonzalez Noise Robust Speech Recognition of Missing or Uncertain Data
IntroductionFeature Compensation based on Stereo Data
Feature Compensation based on a Masking ModelTemporal Modelling and Uncertainty Decoding
Conclusions
Temporal ModellingUncertainty Decoding
Outline
1 Introduction
2 Feature Compensation based on Stereo Data
3 Feature Compensation based on a Masking Model
4 Temporal Modelling and Uncertainty Decoding
5 Conclusions
39 / 49 Jose A. Gonzalez Noise Robust Speech Recognition of Missing or Uncertain Data
IntroductionFeature Compensation based on Stereo Data
Feature Compensation based on a Masking ModelTemporal Modelling and Uncertainty Decoding
Conclusions
Temporal ModellingUncertainty Decoding
Temporal Modelling
More accurate MMSE estimates are obtained with betterspeech models.
Here, the temporal correlation of speech is considered.
Two alternative approaches
Patch-based modelling: short segments of speech aremodelled instead of single frames.HMM modelling: the previous speech models (GMMs or VQcodebooks) are augmented with transition probabilities. Then,
xt =M∑k=1
P(k |y1, . . . , yt , . . . , yT )E[x|yt , k]
40 / 49 Jose A. Gonzalez Noise Robust Speech Recognition of Missing or Uncertain Data
IntroductionFeature Compensation based on Stereo Data
Feature Compensation based on a Masking ModelTemporal Modelling and Uncertainty Decoding
Conclusions
Temporal ModellingUncertainty Decoding
Experimental ResultsW
Acc (
%)
Aurora2 Aurora450
60
70
80
90
100
TGI−OR
PATCH−OR
HMM−OR
TGI−EST
PATCH−EST
HMM−EST
The PATCH and HMM approaches are applied in combinationwith TGI.
Spectral reconstruction benefits from temporal redundancy,especially at low SNRs.
The HMM-based modelling achieves the best results.
41 / 49 Jose A. Gonzalez Noise Robust Speech Recognition of Missing or Uncertain Data
IntroductionFeature Compensation based on Stereo Data
Feature Compensation based on a Masking ModelTemporal Modelling and Uncertainty Decoding
Conclusions
Temporal ModellingUncertainty Decoding
Uncertainty Decoding (I)
The accuracy of MMSE estimation depends on many factors,such as the SNR of the signal, stationarity of the noise, etc.
Inaccurate xt can degrade the performance of ASR.
Two objectives1 Estimate the uncertainty/reliability of xt .2 Account for this information in the recognizer.
42 / 49 Jose A. Gonzalez Noise Robust Speech Recognition of Missing or Uncertain Data
IntroductionFeature Compensation based on Stereo Data
Feature Compensation based on a Masking ModelTemporal Modelling and Uncertainty Decoding
Conclusions
Temporal ModellingUncertainty Decoding
Uncertainty Decoding (II)
Uncertainty of x
Depends on p(x|y) that appears in the MMSE estimator
If p(x|y) = δy(x), then we will consider that x is fully reliable.
If p(x|y) is uniformly distributed, then x is badly estimated.
How to measure the uncertainty of x?
Entropy of p(x|y).
Variance of the MMSE estimate: Σx.
Exploitation in the recognizer
Soft-data decoding: Σx increases the variance of theGaussians in the acoustic model.
Weighted Viterbi Algorithm: the exponential factorρ ∈ [0, 1] used to weight the observation probabilities of x isobtained after applying a sigmoid function to MSE = tr(Σx).
43 / 49 Jose A. Gonzalez Noise Robust Speech Recognition of Missing or Uncertain Data
IntroductionFeature Compensation based on Stereo Data
Feature Compensation based on a Masking ModelTemporal Modelling and Uncertainty Decoding
Conclusions
Temporal ModellingUncertainty Decoding
Experimental ResultsW
Acc (
%)
Aurora2 Aurora440
50
60
70
80
90
100
Baseline TGI−OR UD−OR TGI−EST UD−EST
UD: TGI + Weighted Viterbi Algorithm.
OR vs. EST: oracle masks and oracle uncertainties vs.estimated masks and uncertainties.
The recognition performance is improved after accounting forthe uncertainty, especially in Aurora4.
44 / 49 Jose A. Gonzalez Noise Robust Speech Recognition of Missing or Uncertain Data
IntroductionFeature Compensation based on Stereo Data
Feature Compensation based on a Masking ModelTemporal Modelling and Uncertainty Decoding
Conclusions
Outline
1 Introduction
2 Feature Compensation based on Stereo Data
3 Feature Compensation based on a Masking Model
4 Temporal Modelling and Uncertainty Decoding
5 Conclusions
45 / 49 Jose A. Gonzalez Noise Robust Speech Recognition of Missing or Uncertain Data
IntroductionFeature Compensation based on Stereo Data
Feature Compensation based on a Masking ModelTemporal Modelling and Uncertainty Decoding
Conclusions
Conclusions (I)
The performance of ASR is severely affected by noise.
To improve the robustness of ASR to noise, a featurecompensation approach has been adopted in this thesis.
Stereo-data based compensation: stereo recordings are usedto estimate a set of transformations that are later applied tonoisy speech.
Excellent results for the environments seen during training.Efficient implementation without a significant performancedegradation when VQ codebooks are used.The proposed techniques can be used to reduce the residualnoise of other robust techniques.
46 / 49 Jose A. Gonzalez Noise Robust Speech Recognition of Missing or Uncertain Data
IntroductionFeature Compensation based on Stereo Data
Feature Compensation based on a Masking ModelTemporal Modelling and Uncertainty Decoding
Conclusions
Conclusions (II)
Model-based compensation: a model that considers thedistortion of speech as a masking problem is used to derivetwo reconstruction techniques.
TGI estimates the masked regions in the noisy spectra. Goodresults if the masking pattern is perfectly known, otherwise itsperformance is significantly affected.MMSR uses clean speech and noise models to enhance noisyspeech. Unlike TGI, mask estimation is an integrated part ofthe reconstruction algorithm.An EM-based iterative algorithm has been proposed toestimate the noise models used by MMSR.
Finally, several approaches to account for temporalcorrelations and to decode uncertain speech evidence werealso investigated.
47 / 49 Jose A. Gonzalez Noise Robust Speech Recognition of Missing or Uncertain Data
IntroductionFeature Compensation based on Stereo Data
Feature Compensation based on a Masking ModelTemporal Modelling and Uncertainty Decoding
Conclusions
Future Work
Speech masking model vs. perceptual masking.
EM algorithm: joint estimation of additive and convolutionalnoises.
Using more information in MMSR. E.g. pitch, onset/offsetposition, etc.
Joint speaker and noise compensation.
48 / 49 Jose A. Gonzalez Noise Robust Speech Recognition of Missing or Uncertain Data
IntroductionFeature Compensation based on Stereo Data
Feature Compensation based on a Masking ModelTemporal Modelling and Uncertainty Decoding
Conclusions
Thank you!
49 / 49 Jose A. Gonzalez Noise Robust Speech Recognition of Missing or Uncertain Data