signal modeling for robust speech recognition with frequency warping and convex optimization yoon...
Post on 20-Dec-2015
218 views
TRANSCRIPT
![Page 1: Signal Modeling for Robust Speech Recognition With Frequency Warping and Convex Optimization Yoon Kim March 8, 2000](https://reader031.vdocument.in/reader031/viewer/2022032309/56649d4b5503460f94a2886d/html5/thumbnails/1.jpg)
Signal Modeling for Robust Speech Recognition With Frequency
Warping and Convex Optimization
Yoon Kim
March 8, 2000
![Page 2: Signal Modeling for Robust Speech Recognition With Frequency Warping and Convex Optimization Yoon Kim March 8, 2000](https://reader031.vdocument.in/reader031/viewer/2022032309/56649d4b5503460f94a2886d/html5/thumbnails/2.jpg)
Outline
• Introduction and Motivation
• Speech Analysis with Frequency Warping
• Speaker Normalization with Convex Optimization
• Experimental Results
• Conclusions
![Page 3: Signal Modeling for Robust Speech Recognition With Frequency Warping and Convex Optimization Yoon Kim March 8, 2000](https://reader031.vdocument.in/reader031/viewer/2022032309/56649d4b5503460f94a2886d/html5/thumbnails/3.jpg)
Problem Definition
• Devise effective and robust features for speech recognition that are insensitive to mismatches in individual speaker acoustics and environment
• How can we process the signal such that the acoustic mismatch is minimized ?
![Page 4: Signal Modeling for Robust Speech Recognition With Frequency Warping and Convex Optimization Yoon Kim March 8, 2000](https://reader031.vdocument.in/reader031/viewer/2022032309/56649d4b5503460f94a2886d/html5/thumbnails/4.jpg)
Robust Signal Modeling
• Feature Extraction– Derives a compact, yet effective representation
• Feature Normalization – Compensates for the acoustic mismatch
between the training and testing conditions
![Page 5: Signal Modeling for Robust Speech Recognition With Frequency Warping and Convex Optimization Yoon Kim March 8, 2000](https://reader031.vdocument.in/reader031/viewer/2022032309/56649d4b5503460f94a2886d/html5/thumbnails/5.jpg)
Part I: Feature Extraction for Speech Recognition
![Page 6: Signal Modeling for Robust Speech Recognition With Frequency Warping and Convex Optimization Yoon Kim March 8, 2000](https://reader031.vdocument.in/reader031/viewer/2022032309/56649d4b5503460f94a2886d/html5/thumbnails/6.jpg)
Cepstral Analysis of Speech
• Most popular choice for speech recognition
• Cepstrum is defined as the inverse Fourier transform of the log spectrum
• Truncated to length L (smoothes log spectrum)
1,,1,0,)(log2
1)( LndeeSnc njj
![Page 7: Signal Modeling for Robust Speech Recognition With Frequency Warping and Convex Optimization Yoon Kim March 8, 2000](https://reader031.vdocument.in/reader031/viewer/2022032309/56649d4b5503460f94a2886d/html5/thumbnails/7.jpg)
FFT-Based Feature Extraction
• Perceptually motivated FFT filterbank is used to emulate the auditory system
• Analysis is directly affected by fine harmonics
• Examples
– Mel Frequency Cepstral Analysis
– Perceptual Linear Prediction (PLP)
![Page 8: Signal Modeling for Robust Speech Recognition With Frequency Warping and Convex Optimization Yoon Kim March 8, 2000](https://reader031.vdocument.in/reader031/viewer/2022032309/56649d4b5503460f94a2886d/html5/thumbnails/8.jpg)
LP-Based Feature Extraction
• Linear prediction provides a smooth spectrum mostly containing vocal-tract information
• Frequency warping is not straightforward
• Examples
– Frequency-Warped Linear Prediction– Time-domain Warped Linear Prediction
![Page 9: Signal Modeling for Robust Speech Recognition With Frequency Warping and Convex Optimization Yoon Kim March 8, 2000](https://reader031.vdocument.in/reader031/viewer/2022032309/56649d4b5503460f94a2886d/html5/thumbnails/9.jpg)
Part I: Non-uniform Linear Predictive Analysis of Speech
![Page 10: Signal Modeling for Robust Speech Recognition With Frequency Warping and Convex Optimization Yoon Kim March 8, 2000](https://reader031.vdocument.in/reader031/viewer/2022032309/56649d4b5503460f94a2886d/html5/thumbnails/10.jpg)
Basic Ideas of the NLP Analysis
• Frequency warping of the vocal-tract spectrum using non-uniform DFT (NDFT)
• Bark-frequency scale is used for warping
• Pre- and post-warp linear prediction smoothing
![Page 11: Signal Modeling for Robust Speech Recognition With Frequency Warping and Convex Optimization Yoon Kim March 8, 2000](https://reader031.vdocument.in/reader031/viewer/2022032309/56649d4b5503460f94a2886d/html5/thumbnails/11.jpg)
Bark Bilinear Transform
z
zzA
1)(
z
zzA
1)(
• Bark Bilinear Transform
• For an appropriately chosen ρ, the mapping closely resembles a Bark mapping
z
zzA
1)(ρ
![Page 12: Signal Modeling for Robust Speech Recognition With Frequency Warping and Convex Optimization Yoon Kim March 8, 2000](https://reader031.vdocument.in/reader031/viewer/2022032309/56649d4b5503460f94a2886d/html5/thumbnails/12.jpg)
Figure: Bark-Frequency Warping
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1fs = 31kHz, rho = 0.70777
Linear Frequency (rad/pi)
Warp
ed
Fre
qu
en
cy
(rad
/pi)
Optimal Fit Using BBT Bark Frequency Warping
![Page 13: Signal Modeling for Robust Speech Recognition With Frequency Warping and Convex Optimization Yoon Kim March 8, 2000](https://reader031.vdocument.in/reader031/viewer/2022032309/56649d4b5503460f94a2886d/html5/thumbnails/13.jpg)
Pre-Warp Linear Prediction
• Vocal-tract transfer function H(z) can be represented by an all-pole model
p
k
kk za
G
zA
GzH
1
1)(
)(
![Page 14: Signal Modeling for Robust Speech Recognition With Frequency Warping and Convex Optimization Yoon Kim March 8, 2000](https://reader031.vdocument.in/reader031/viewer/2022032309/56649d4b5503460f94a2886d/html5/thumbnails/14.jpg)
NDFT Frequency Warping
• NDFT of the vocal-tract impulse response
• ωk : Frequency grid of Bark bilinear transform
],,,,1[)(
1,,1,0,)()(~
21
0
p
p
n
nj
aaana
MkenakA k
![Page 15: Signal Modeling for Robust Speech Recognition With Frequency Warping and Convex Optimization Yoon Kim March 8, 2000](https://reader031.vdocument.in/reader031/viewer/2022032309/56649d4b5503460f94a2886d/html5/thumbnails/15.jpg)
Post-Warp Linear Prediction
• Take the IDFT of the power spectrum to get the warped autocorrelation coefficients
• Durbin recursion to get new LP coefficients
qnekPM
nr
MkkA
GkHkP
M
k
Mknj ,,1,0,)(~1
)(~
1,,1,0,|)(
~|
|)(|)(~
1
0
/2
2
22
![Page 16: Signal Modeling for Robust Speech Recognition With Frequency Warping and Convex Optimization Yoon Kim March 8, 2000](https://reader031.vdocument.in/reader031/viewer/2022032309/56649d4b5503460f94a2886d/html5/thumbnails/16.jpg)
Conversion to Cepstrum
• Convert warped LP parameters to a set of L cepstral parameters via recursion
1,,1),(~)(1
)(~)(
,ln)0(1
1
2
Lnknakckn
nanc
Gcn
k
![Page 17: Signal Modeling for Robust Speech Recognition With Frequency Warping and Convex Optimization Yoon Kim March 8, 2000](https://reader031.vdocument.in/reader031/viewer/2022032309/56649d4b5503460f94a2886d/html5/thumbnails/17.jpg)
NDFT Warping: Vowel /u/
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1-16
-14
-12
-10
-8
-6
-4
Normalized Frequency (rad/pi)
Log
Mag
nitu
de
Original LP power spectrum
![Page 18: Signal Modeling for Robust Speech Recognition With Frequency Warping and Convex Optimization Yoon Kim March 8, 2000](https://reader031.vdocument.in/reader031/viewer/2022032309/56649d4b5503460f94a2886d/html5/thumbnails/18.jpg)
NDFT Warping: Vowel /u/
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1-16
-14
-12
-10
-8
-6
-4
Normalized Frequency (rad/pi)
Log
Mag
nitu
de
Original LP power spectrumNLP power spectrum
![Page 19: Signal Modeling for Robust Speech Recognition With Frequency Warping and Convex Optimization Yoon Kim March 8, 2000](https://reader031.vdocument.in/reader031/viewer/2022032309/56649d4b5503460f94a2886d/html5/thumbnails/19.jpg)
Clustering Measures
• Derive meaningful measures to assess how well the feature clusters of each class (vowel) can be separated and discriminated
• Three measures were considered– Determinant measure– Trace measure– Inverse trace measure
![Page 20: Signal Modeling for Robust Speech Recognition With Frequency Warping and Convex Optimization Yoon Kim March 8, 2000](https://reader031.vdocument.in/reader031/viewer/2022032309/56649d4b5503460f94a2886d/html5/thumbnails/20.jpg)
Scatter Matrices
• SW: Within-class scatter matrix
• SB : Between-class scatter matrix
• ST : Total scatter matrix
BWT
c
i
TiiiB
c
i
TiiW
SSS
mmmmnS
mxmxS
1
1
))((
))((
![Page 21: Signal Modeling for Robust Speech Recognition With Frequency Warping and Convex Optimization Yoon Kim March 8, 2000](https://reader031.vdocument.in/reader031/viewer/2022032309/56649d4b5503460f94a2886d/html5/thumbnails/21.jpg)
Determinant Measure
)det(
)det(det WSW
WSWJ
WT
BT
• Ratio of the between-class and within-class scattering volume
• Larger the value, better the clustering
![Page 22: Signal Modeling for Robust Speech Recognition With Frequency Warping and Convex Optimization Yoon Kim March 8, 2000](https://reader031.vdocument.in/reader031/viewer/2022032309/56649d4b5503460f94a2886d/html5/thumbnails/22.jpg)
Trace Measure
iBW
c
iBW SSSSJ )()(Tr 1
1
1
1tr
• Ratio of the sum of scattering radii of between-class and within-class scatter
• Larger the better
![Page 23: Signal Modeling for Robust Speech Recognition With Frequency Warping and Convex Optimization Yoon Kim March 8, 2000](https://reader031.vdocument.in/reader031/viewer/2022032309/56649d4b5503460f94a2886d/html5/thumbnails/23.jpg)
Inverse Trace Measure
1
1
1inv 1
1)(Tr
c
i iWT SSJ
• Sum of within-class scattering radii normalized by the total scatter
• Smaller the better
![Page 24: Signal Modeling for Robust Speech Recognition With Frequency Warping and Convex Optimization Yoon Kim March 8, 2000](https://reader031.vdocument.in/reader031/viewer/2022032309/56649d4b5503460f94a2886d/html5/thumbnails/24.jpg)
Vowel Clustering Performance
• We compared the values of the scattering measures discussed to assess the clustering performance of the NLP cepstrum
• Mel, PLP and LP techniques were also tested for comparison
![Page 25: Signal Modeling for Robust Speech Recognition With Frequency Warping and Convex Optimization Yoon Kim March 8, 2000](https://reader031.vdocument.in/reader031/viewer/2022032309/56649d4b5503460f94a2886d/html5/thumbnails/25.jpg)
Steady-State Vowel Database
• Eleven steady-state English vowels from 23 speakers (12 male, 9 female, 2 children)
• Sampling rate: 10 kHz
• Each speaker provided 6 frames of steady-state vowel segments
![Page 26: Signal Modeling for Robust Speech Recognition With Frequency Warping and Convex Optimization Yoon Kim March 8, 2000](https://reader031.vdocument.in/reader031/viewer/2022032309/56649d4b5503460f94a2886d/html5/thumbnails/26.jpg)
Results: Vowel Clustering
Method Jdet Jtr Jinv
LP 7.01 e-9 12.04 6.89
PLP 7.08 e-7 12.36 6.73
Mel 2.37 e-6 13.58 6.75
NLP 4.30 e-5 14.72 6.49
![Page 27: Signal Modeling for Robust Speech Recognition With Frequency Warping and Convex Optimization Yoon Kim March 8, 2000](https://reader031.vdocument.in/reader031/viewer/2022032309/56649d4b5503460f94a2886d/html5/thumbnails/27.jpg)
2-D Vowel Clusters: /a/ /i/ /o/
-0.5 0 0.5 1 1.5-1
-0.5
0
0.5
1LPC
Jtr = 20.29
-0.1 -0.05 0 0.05 0.1-0.02
0
0.02
0.04
0.06PLP
Jtr = 22.72
-2 0 2 4-1
0
1
2
3
4Mel
Jtr = 25.06
-0.5 0 0.5 1-0.6
-0.4
-0.2
0
0.2
0.4NLP
Jtr = 28.82
/a//i//o/
![Page 28: Signal Modeling for Robust Speech Recognition With Frequency Warping and Convex Optimization Yoon Kim March 8, 2000](https://reader031.vdocument.in/reader031/viewer/2022032309/56649d4b5503460f94a2886d/html5/thumbnails/28.jpg)
2-D Vowel Clusters: /a/ /e/ /i/
-1 -0.5 0 0.5 1-0.6
-0.4
-0.2
0
0.2LPC
Jtr = 8.72
-5 0 5 10
x 10-3
-12
-10
-8
-6
-4
-2x 10
-3 PLP
Jtr = 9.77
-4 -2 0 2-1
0
1
2Mel
Jtr = 10.31
-0.4 -0.2 0 0.2 0.4-0.3
-0.2
-0.1
0
0.1
0.2NLP
Jtr = 12.45
/a//e//i/
![Page 29: Signal Modeling for Robust Speech Recognition With Frequency Warping and Convex Optimization Yoon Kim March 8, 2000](https://reader031.vdocument.in/reader031/viewer/2022032309/56649d4b5503460f94a2886d/html5/thumbnails/29.jpg)
Part II: Feature Normalization for Speaker Acoustics Matching
![Page 30: Signal Modeling for Robust Speech Recognition With Frequency Warping and Convex Optimization Yoon Kim March 8, 2000](https://reader031.vdocument.in/reader031/viewer/2022032309/56649d4b5503460f94a2886d/html5/thumbnails/30.jpg)
Speech Recognition Problem
• Given a sequence of acoustic feature vectors Xextracted from speech, find the most likely word string that could have been uttered
)|()(maxarg)|(maxargˆ WXPWPXWPWWW
![Page 31: Signal Modeling for Robust Speech Recognition With Frequency Warping and Convex Optimization Yoon Kim March 8, 2000](https://reader031.vdocument.in/reader031/viewer/2022032309/56649d4b5503460f94a2886d/html5/thumbnails/31.jpg)
HMM Acoustic Model
• Hidden Markov Models (HMMs): Each phone unit is modeled as a sequence of hidden states
• Speech dynamics modeled as transitions from one state to another
• Each state has a feature probability distribution
• Goal: Guess the underlying state sequence (phone string) from the observable features
![Page 32: Signal Modeling for Robust Speech Recognition With Frequency Warping and Convex Optimization Yoon Kim March 8, 2000](https://reader031.vdocument.in/reader031/viewer/2022032309/56649d4b5503460f94a2886d/html5/thumbnails/32.jpg)
Example: HMM Word Model
Digit: “one”
pause /w/ /Λ/ /n/ pause
1 2 3 4 5
![Page 33: Signal Modeling for Robust Speech Recognition With Frequency Warping and Convex Optimization Yoon Kim March 8, 2000](https://reader031.vdocument.in/reader031/viewer/2022032309/56649d4b5503460f94a2886d/html5/thumbnails/33.jpg)
Why Speaker Normalization ?
• Most speech recognition systems use statistical models trained using a large database with the hope that the testing conditions will be similar
• Acoustic mismatches between the speakers used in training and testing result in unacceptable degradation of recognition performance
![Page 34: Signal Modeling for Robust Speech Recognition With Frequency Warping and Convex Optimization Yoon Kim March 8, 2000](https://reader031.vdocument.in/reader031/viewer/2022032309/56649d4b5503460f94a2886d/html5/thumbnails/34.jpg)
Prior Work in Speaker Normalization
• Normalization usually refers to modification of the features to fit a statistical model
• Vocal-tract length normalization (VTLN)– Attempts to alter the resonant frequencies of the
vocal-tract by warping the frequency axis – Linear warping– All-pass warping (bilinear transform)
![Page 35: Signal Modeling for Robust Speech Recognition With Frequency Warping and Convex Optimization Yoon Kim March 8, 2000](https://reader031.vdocument.in/reader031/viewer/2022032309/56649d4b5503460f94a2886d/html5/thumbnails/35.jpg)
Prior Work: Speaker Adaptation
• Adaptation usually refers to modification of the model parameters to fit the data
• Maximum Likelihood Bias
• ML Linear Regression (MLLR)
mean Original:mean, dTransforme:~R,~
LLAA
v~
![Page 36: Signal Modeling for Robust Speech Recognition With Frequency Warping and Convex Optimization Yoon Kim March 8, 2000](https://reader031.vdocument.in/reader031/viewer/2022032309/56649d4b5503460f94a2886d/html5/thumbnails/36.jpg)
Part II: Speaker Normalization with Maximum-Likelihood Affine Cepstral Filtering
![Page 37: Signal Modeling for Robust Speech Recognition With Frequency Warping and Convex Optimization Yoon Kim March 8, 2000](https://reader031.vdocument.in/reader031/viewer/2022032309/56649d4b5503460f94a2886d/html5/thumbnails/37.jpg)
Linear Cepstral Filtering (LCF)
• We propose the following linear, Toeplitz transformation of the cepstral feature vectors
LL
LL
R
hhh
hh
h
HcHc
021
01
0
0
0
00
,~
![Page 38: Signal Modeling for Robust Speech Recognition With Frequency Warping and Convex Optimization Yoon Kim March 8, 2000](https://reader031.vdocument.in/reader031/viewer/2022032309/56649d4b5503460f94a2886d/html5/thumbnails/38.jpg)
Linear Cepstral Filtering (LCF)
• H represents the linear cepstral transformation for normalizing speaker acoustics.
• The matrix operation corresponds to – Convolution in the cepstral domain– Log spectral filtering in the frequency domain
)()()(~
)()()(~1
0
SHSknhkcncL
k
![Page 39: Signal Modeling for Robust Speech Recognition With Frequency Warping and Convex Optimization Yoon Kim March 8, 2000](https://reader031.vdocument.in/reader031/viewer/2022032309/56649d4b5503460f94a2886d/html5/thumbnails/39.jpg)
Maximum-Likelihood Estimation
• Find the optimal normalization H such that the transformed features yield maximum likelihood with respect to a given model Λ
• Only L parameters for estimation (instead of L2)
TLo
hH
hhhhHcc
cPcPH
],,,[,~
)|~(maxarg)|~(maxargˆ
11
![Page 40: Signal Modeling for Robust Speech Recognition With Frequency Warping and Convex Optimization Yoon Kim March 8, 2000](https://reader031.vdocument.in/reader031/viewer/2022032309/56649d4b5503460f94a2886d/html5/thumbnails/40.jpg)
Commutative Property of LCF
• Due to the commutative property of the convolution, the transformed cepstrum can also be expressed as a linear function of the filter h
021
01
0
0
0
00
,~
ccc
cc
c
CChcHc
LL
![Page 41: Signal Modeling for Robust Speech Recognition With Frequency Warping and Convex Optimization Yoon Kim March 8, 2000](https://reader031.vdocument.in/reader031/viewer/2022032309/56649d4b5503460f94a2886d/html5/thumbnails/41.jpg)
Solution: Single Gaussian Case
• Let c(i) be the i-th feature of the data (i=0,…,N-1)
• Let the distribution corresponding to c(i) be Gaussian with mean μi and covariance Σi
• Total log-likelihood of transformed feature data set is a concave, quadratic function of the filter h
1
0
)(1)(1
0
)( )()(2
1)|~(log
N
ii
ii
Ti
iN
i
i hChCcP
![Page 42: Signal Modeling for Robust Speech Recognition With Frequency Warping and Convex Optimization Yoon Kim March 8, 2000](https://reader031.vdocument.in/reader031/viewer/2022032309/56649d4b5503460f94a2886d/html5/thumbnails/42.jpg)
Solution: Single Gaussian Case• Since the negative of the log-likelihood is convex
in h, there exists a unique ML solution h*
12/11
12/1
1
02/1
0
)1(2/11
)1(2/11
)0(2/10
1
2
*
,
,)(minarg
NNN
N
TT
h
b
C
C
C
A
bAAAbAhh
![Page 43: Signal Modeling for Robust Speech Recognition With Frequency Warping and Convex Optimization Yoon Kim March 8, 2000](https://reader031.vdocument.in/reader031/viewer/2022032309/56649d4b5503460f94a2886d/html5/thumbnails/43.jpg)
Case: Gaussian Mixture
i
iii
M
iii wwxNwxP 1,0),,;();(
1
• Log-likelihood is no longer a convex function
• Approximation: We use the single Gaussian density for ML filter estimation
• Past studies support the validity of the approx.
![Page 44: Signal Modeling for Robust Speech Recognition With Frequency Warping and Convex Optimization Yoon Kim March 8, 2000](https://reader031.vdocument.in/reader031/viewer/2022032309/56649d4b5503460f94a2886d/html5/thumbnails/44.jpg)
Case: Log-Concave PDFs
• For any distribution that is log-concave, ML estimation can be posed as a convex problem
• Examples– Laplace: p(x) = (1/2a) exp(-|x|/a)– Uniform: p(x) = 1/(2a) on [-a, a]– Rayleigh: p(x) = (2/a) x exp(-x2/b), x > 0
![Page 45: Signal Modeling for Robust Speech Recognition With Frequency Warping and Convex Optimization Yoon Kim March 8, 2000](https://reader031.vdocument.in/reader031/viewer/2022032309/56649d4b5503460f94a2886d/html5/thumbnails/45.jpg)
Affine Cepstral Filtering (ACF)
• We can extend the linear transformation to an affine form by adding a cepstral bias term v
• Bias models channel and other additive effects
• Joint optimization of filter and bias leads to a more flexible transformation of the cepstral space
)()()()(~~ VSHSvcHc
![Page 46: Signal Modeling for Robust Speech Recognition With Frequency Warping and Convex Optimization Yoon Kim March 8, 2000](https://reader031.vdocument.in/reader031/viewer/2022032309/56649d4b5503460f94a2886d/html5/thumbnails/46.jpg)
Solution: Affine Transformation• By combining the filter h and bias v into an
augmented design vector x, the joint ML solution can be easily attained by extending the linear case
v
hxb
C
C
C
A
bAxx
NNNN
N
x
,,
,minarg
12/11
12/1
1
02/1
0
2/11
2/11
2/10
)1(2/11
)1(2/11
)0(2/10
2
*
![Page 47: Signal Modeling for Robust Speech Recognition With Frequency Warping and Convex Optimization Yoon Kim March 8, 2000](https://reader031.vdocument.in/reader031/viewer/2022032309/56649d4b5503460f94a2886d/html5/thumbnails/47.jpg)
Example: Vowel /ah/No Warping, No Normalization
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 16
8
10
12
14
16
18
20
Normalized Frequency
Log
Spe
ctru
m
REFERENCE SPEAKERTEST SPEAKER
![Page 48: Signal Modeling for Robust Speech Recognition With Frequency Warping and Convex Optimization Yoon Kim March 8, 2000](https://reader031.vdocument.in/reader031/viewer/2022032309/56649d4b5503460f94a2886d/html5/thumbnails/48.jpg)
Vowel /ah/: With NLP Warping, No Normalization
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 12.5
3
3.5
4
4.5
5
5.5
6
6.5
Normalized Frequency
Log
Spe
ctru
m
REFERENCE SPEAKERTEST SPEAKER
![Page 49: Signal Modeling for Robust Speech Recognition With Frequency Warping and Convex Optimization Yoon Kim March 8, 2000](https://reader031.vdocument.in/reader031/viewer/2022032309/56649d4b5503460f94a2886d/html5/thumbnails/49.jpg)
Vowel /ah/: With NLP Warping and LCF Normalization
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 12.5
3
3.5
4
4.5
5
5.5
6
6.5
Normalized Frequency
Log
Spe
ctru
m
REFERENCE SPEAKERTEST SPEAKER
![Page 50: Signal Modeling for Robust Speech Recognition With Frequency Warping and Convex Optimization Yoon Kim March 8, 2000](https://reader031.vdocument.in/reader031/viewer/2022032309/56649d4b5503460f94a2886d/html5/thumbnails/50.jpg)
Example: Vowel /oh/No Warping, No Normalization
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 113
14
15
16
17
18
19
20
21
22
23
Normalized Frequency
Log
Spe
ctru
m
REFERENCE SPEAKERTEST SPEAKER
![Page 51: Signal Modeling for Robust Speech Recognition With Frequency Warping and Convex Optimization Yoon Kim March 8, 2000](https://reader031.vdocument.in/reader031/viewer/2022032309/56649d4b5503460f94a2886d/html5/thumbnails/51.jpg)
Vowel /oh/: With NLP Warping, No Normalization
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 13
3.5
4
4.5
5
5.5
6
6.5
7
7.5
8
Normalized Frequency
Log
Spe
ctru
m
REFERENCE SPEAKERTEST SPEAKER
![Page 52: Signal Modeling for Robust Speech Recognition With Frequency Warping and Convex Optimization Yoon Kim March 8, 2000](https://reader031.vdocument.in/reader031/viewer/2022032309/56649d4b5503460f94a2886d/html5/thumbnails/52.jpg)
Vowel /oh/: With NLP Warping and LCF Normalization
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 13
3.5
4
4.5
5
5.5
6
6.5
7
7.5
8
Normalized Frequency
Log
Spe
ctru
m
REFERENCE SPEAKERTEST SPEAKER
![Page 53: Signal Modeling for Robust Speech Recognition With Frequency Warping and Convex Optimization Yoon Kim March 8, 2000](https://reader031.vdocument.in/reader031/viewer/2022032309/56649d4b5503460f94a2886d/html5/thumbnails/53.jpg)
Normalization in Training
• For each speaker in the training database, ML filter and bias vectors are estimated using the unnormalized model Λ
• ML transformation is applied to the feature vectors for each speaker
• A normalized, Gaussian-mixture model is trained using the normalized features
~
![Page 54: Signal Modeling for Robust Speech Recognition With Frequency Warping and Convex Optimization Yoon Kim March 8, 2000](https://reader031.vdocument.in/reader031/viewer/2022032309/56649d4b5503460f94a2886d/html5/thumbnails/54.jpg)
Normalization in Recognition
• Given a set of enrollment data, normalization parameters are estimated for each speaker
• We apply the speaker-dependent mapping to subsequent data from the speaker
• Transformation can be regarded as a statistical “spectral equalizer” applied to each speaker to optimally fit the normalized model
![Page 55: Signal Modeling for Robust Speech Recognition With Frequency Warping and Convex Optimization Yoon Kim March 8, 2000](https://reader031.vdocument.in/reader031/viewer/2022032309/56649d4b5503460f94a2886d/html5/thumbnails/55.jpg)
Frame-Based Vowel Recognition
• Same vowel database used for clustering (23 speakers, 11 steady-state vowels)
• 4 speakers in the test set provided a total of 18 frames; 6 frames were used for estimation
• LP, PLP, Mel, and NLP features considered
• Recognition performance: Error rate (%)
![Page 56: Signal Modeling for Robust Speech Recognition With Frequency Warping and Convex Optimization Yoon Kim March 8, 2000](https://reader031.vdocument.in/reader031/viewer/2022032309/56649d4b5503460f94a2886d/html5/thumbnails/56.jpg)
Results: Vowel Recognition
Method Baseline No Norm
Diag MLLR
ML Bias
ML LCF
LP 35.7 45.5 30.0 26.4
PLP 32.0 32.7 30.6 27.4
Mel 32.8 30.1 24.6 20.5
NLP 25.9 26.3 21.7 19.2
Avg. 31.6 33.6 26.7 23.4
![Page 57: Signal Modeling for Robust Speech Recognition With Frequency Warping and Convex Optimization Yoon Kim March 8, 2000](https://reader031.vdocument.in/reader031/viewer/2022032309/56649d4b5503460f94a2886d/html5/thumbnails/57.jpg)
Summary: Vowel Recognition
31.633.6
26.7
23.4
15
20
25
30
35
40
Err
or R
ate
% BaselineDiag MLLRML BiasML LCF
![Page 58: Signal Modeling for Robust Speech Recognition With Frequency Warping and Convex Optimization Yoon Kim March 8, 2000](https://reader031.vdocument.in/reader031/viewer/2022032309/56649d4b5503460f94a2886d/html5/thumbnails/58.jpg)
HMM Digit Recognition
• TIDIGITS corpus: 326 speakers providing 77 digit sequences in a quiet environment
• Digits: 1-9, “zero” and “oh”
• 8-state HMM for each digit
• Varied # of Gaussians/state from 1 to 15, and the best result was selected
![Page 59: Signal Modeling for Robust Speech Recognition With Frequency Warping and Convex Optimization Yoon Kim March 8, 2000](https://reader031.vdocument.in/reader031/viewer/2022032309/56649d4b5503460f94a2886d/html5/thumbnails/59.jpg)
Case: Adult Data on Adult Model
• HMM for each digit was trained with data from 112 adult speakers (55 male, 57 female)
• Another set of 113 adult speakers were used for testing (56 male, 57 female)
• One utterance per digit was used for estimating the normalization parameters for each speaker
![Page 60: Signal Modeling for Robust Speech Recognition With Frequency Warping and Convex Optimization Yoon Kim March 8, 2000](https://reader031.vdocument.in/reader031/viewer/2022032309/56649d4b5503460f94a2886d/html5/thumbnails/60.jpg)
Digit Results: Adult on Adult
Method Baseline Linear Affine GD model
Mel 2.0 1.7 1.5 1.9
NLP 1.9 1.6 1.3 1.9
Avg 2.0 1.6 1.4 1.9
![Page 61: Signal Modeling for Robust Speech Recognition With Frequency Warping and Convex Optimization Yoon Kim March 8, 2000](https://reader031.vdocument.in/reader031/viewer/2022032309/56649d4b5503460f94a2886d/html5/thumbnails/61.jpg)
Case: Child Data on Adult Model
• Case of severe mismatch between the training and testing speaker acoustics
• Model: 112 adults
• Test set: 100 children (50 boys, 50 girls)
![Page 62: Signal Modeling for Robust Speech Recognition With Frequency Warping and Convex Optimization Yoon Kim March 8, 2000](https://reader031.vdocument.in/reader031/viewer/2022032309/56649d4b5503460f94a2886d/html5/thumbnails/62.jpg)
Digit Results: Child on Adult
Method Baseline Model
Linear Norm.
Affine Norm.
Child Model
Mel 23.1 18.7 15.5 4.5
NLP 19.5 18.2 15.6 2.3
Avg. 21.3 18.5 15.6 3.4
![Page 63: Signal Modeling for Robust Speech Recognition With Frequency Warping and Convex Optimization Yoon Kim March 8, 2000](https://reader031.vdocument.in/reader031/viewer/2022032309/56649d4b5503460f94a2886d/html5/thumbnails/63.jpg)
Summary: Digit Recognition
Adult on Adult
21.6 1.4
0.51
1.52
2.53
Err
or R
ate
%
Baseline Linear Affine
Child on Adult
21.318.5
15.6
5
15
25
Baseline Linear Affine
![Page 64: Signal Modeling for Robust Speech Recognition With Frequency Warping and Convex Optimization Yoon Kim March 8, 2000](https://reader031.vdocument.in/reader031/viewer/2022032309/56649d4b5503460f94a2886d/html5/thumbnails/64.jpg)
Conclusions
• Speaker normalization was achieved using NLP frequency warping and ML affine cepstral filtering
![Page 65: Signal Modeling for Robust Speech Recognition With Frequency Warping and Convex Optimization Yoon Kim March 8, 2000](https://reader031.vdocument.in/reader031/viewer/2022032309/56649d4b5503460f94a2886d/html5/thumbnails/65.jpg)
Conclusions
• Speaker normalization was achieved using NLP frequency warping and ML affine cepstral filtering
• A unified framework for optimizing the matrix and bias parameters was presented using simple convex programming
![Page 66: Signal Modeling for Robust Speech Recognition With Frequency Warping and Convex Optimization Yoon Kim March 8, 2000](https://reader031.vdocument.in/reader031/viewer/2022032309/56649d4b5503460f94a2886d/html5/thumbnails/66.jpg)
Conclusions
• Speaker normalization was achieved using NLP frequency warping and ML affine cepstral filtering
• A unified framework for optimizing the matrix and bias parameters was presented using simple convex programming
• Proposed signal modeling techniques gave considerable boost in recognition performance, even for severely mismatched conditions
![Page 67: Signal Modeling for Robust Speech Recognition With Frequency Warping and Convex Optimization Yoon Kim March 8, 2000](https://reader031.vdocument.in/reader031/viewer/2022032309/56649d4b5503460f94a2886d/html5/thumbnails/67.jpg)
Future Research
• Compensation of noise and channel mismatches
• Joint optimization of frequency warping and affine transform parameters
• Investigation of other optimality criteria for stochastic matching