speech signal representations i
Post on 02-Jan-2016
41 Views
Preview:
DESCRIPTION
TRANSCRIPT
Speech Signal Speech Signal Representations IRepresentations I
Seminar Speech Recognition 2002Seminar Speech Recognition 2002
F.R. VerhageF.R. Verhage
Speech Signal Representations ISpeech Signal Representations I
Decomposition of the speech signal (x[n]) as a Decomposition of the speech signal (x[n]) as a source (e[n]) passed through a linear time-source (e[n]) passed through a linear time-varying filter (h[n]).varying filter (h[n]).
Speech Signal Representations ISpeech Signal Representations I
Estimation of the filter, inspired by:Estimation of the filter, inspired by: Speech production modelsSpeech production models
– Linear Predictive Coding (LPC)Linear Predictive Coding (LPC)– Cepstral analysisCepstral analysis
Speech perception models (part II)Speech perception models (part II)– Mel-frequency cepstrumMel-frequency cepstrum– Perceptual Linaer Prediction (PLP)Perceptual Linaer Prediction (PLP)
Speech recognizers estimate filter Speech recognizers estimate filter characteristics and ignore the sourcecharacteristics and ignore the source
Speech Signal Representations ISpeech Signal Representations I
Short-Time Fourier AnalysisShort-Time Fourier Analysis
SpectrogramSpectrogram– Representation of a signal highlighting several Representation of a signal highlighting several
of its properties based on short-time Fourier of its properties based on short-time Fourier analysisanalysis
– Two dimensional: time horizontal and frequency Two dimensional: time horizontal and frequency verticalvertical
– Third ‘dimension’: gray or color level indicating Third ‘dimension’: gray or color level indicating energyenergy
Speech Signal Representations ISpeech Signal Representations I
Short-Time Fourier AnalysisShort-Time Fourier Analysis
SpectrogramSpectrogram– Narrow bandNarrow band
Long windows (> 20 ms) →Long windows (> 20 ms) → Narrow bandwidthNarrow bandwidth Lower time resolution, better frequency resolutionLower time resolution, better frequency resolution
– Wide bandWide band Short windows ( <10 ms) →Short windows ( <10 ms) → Wide bandwidthWide bandwidth Good time resolution, lower frequency resolutionGood time resolution, lower frequency resolution
– Pitch synchronousPitch synchronous Requires knowledge of local pitch periodRequires knowledge of local pitch period
Speech Signal Representations ISpeech Signal Representations I
Short-Time Fourier AnalysisShort-Time Fourier Analysis
SpectrogramSpectrogram
Speech Signal Representations ISpeech Signal Representations I
Short-Time Fourier AnalysisShort-Time Fourier Analysis
Window analysisWindow analysis– Series of short segments, analysis framesSeries of short segments, analysis frames– Short enough so that the signal is stationaryShort enough so that the signal is stationary– Usually constant, 20-30 msUsually constant, 20-30 ms– Overlaps possibleOverlaps possible
– Different types of window functions (wDifferent types of window functions (wmm[n]):[n]): Rectangular (equal to no window function)Rectangular (equal to no window function) HammingHamming HanningHanning
n n
njnjm
jm enxnmwenxeX ][][][
Speech Signal Representations ISpeech Signal Representations I
Short-Time Fourier AnalysisShort-Time Fourier Analysis
Window analysisWindow analysis– Window size must be long enoughWindow size must be long enough
Rectangular: N ≥ MRectangular: N ≥ M Hamming, Hanning: N ≥ 2MHamming, Hanning: N ≥ 2M
– Pitch period not known in advance →Pitch period not known in advance →– Prepare for lowest pitch period →Prepare for lowest pitch period →– At least 20ms for rectangular or 40ms for At least 20ms for rectangular or 40ms for
Hamming/Hanning (50Hz)Hamming/Hanning (50Hz)– But longer windows give a more average spectrum But longer windows give a more average spectrum
instead of distinct spectra →instead of distinct spectra →– Rectangular window has better time resolutionRectangular window has better time resolution
Speech Signal Representations ISpeech Signal Representations I
Short-Time Fourier AnalysisShort-Time Fourier Analysis
Speech Signal Representations ISpeech Signal Representations I
Short-Time Fourier AnalysisShort-Time Fourier Analysis
Speech Signal Representations ISpeech Signal Representations I
Short-Time Fourier AnalysisShort-Time Fourier Analysis
Speech Signal Representations ISpeech Signal Representations I
Short-Time Fourier AnalysisShort-Time Fourier Analysis
Speech Signal Representations ISpeech Signal Representations I
Short-Time Fourier AnalysisShort-Time Fourier Analysis
Speech Signal Representations ISpeech Signal Representations I
Short-Time Fourier AnalysisShort-Time Fourier Analysis
Speech Signal Representations ISpeech Signal Representations I
Short-Time Fourier AnalysisShort-Time Fourier Analysis
Speech Signal Representations ISpeech Signal Representations I
Short-Time Fourier AnalysisShort-Time Fourier Analysis
Window analysisWindow analysis– Frequency response not completely zero outside main Frequency response not completely zero outside main
lobe → Spectral leakagelobe → Spectral leakage– Second lobe of a Hamming window is approx. 43dB Second lobe of a Hamming window is approx. 43dB
below main lobe → less spectral leakagebelow main lobe → less spectral leakage– Hamming, Hanning, triangular windows offer less Hamming, Hanning, triangular windows offer less
spectral leakage →spectral leakage →– Rectangular windows are rarely used despite their Rectangular windows are rarely used despite their
better time resolutionbetter time resolution
Speech Signal Representations ISpeech Signal Representations I
Short-Time Fourier AnalysisShort-Time Fourier Analysis
Speech Signal Representations ISpeech Signal Representations I
Short-Time Fourier AnalysisShort-Time Fourier Analysis
Speech Signal Representations ISpeech Signal Representations I
Short-Time Fourier AnalysisShort-Time Fourier Analysis
Speech Signal Representations ISpeech Signal Representations I
Short-Time Fourier AnalysisShort-Time Fourier Analysis
Speech Signal Representations ISpeech Signal Representations I
Short-Time Fourier AnalysisShort-Time Fourier Analysis
Short-time spectrum of male voice speechShort-time spectrum of male voice speecha)a) Time signal /ah/Time signal /ah/
local pitch 110Hzlocal pitch 110Hz
b)b) 30ms rectangular30ms rectangularwindowwindow
c)c) 15ms rectangular15ms rectangular window window
d)d) 30ms Hamming30ms Hammingwindowwindow
e)e) 15ms Hamming15ms Hammingwindowwindow
Speech Signal Representations ISpeech Signal Representations I
Short-Time Fourier AnalysisShort-Time Fourier Analysis
Short-time spectrum of female voice speechShort-time spectrum of female voice speecha)a) Time signal /aa/Time signal /aa/
local pitch 200Hzlocal pitch 200Hz
b)b) 30ms rectangular30ms rectangularwindowwindow
c)c) 15ms rectangular15ms rectangular window window
d)d) 30ms Hamming30ms Hammingwindowwindow
e)e) 15ms Hamming15ms Hammingwindowwindow
Speech Signal Representations ISpeech Signal Representations I
Short-Time Fourier AnalysisShort-Time Fourier Analysis
Short-time spectrum of unvoiced speechShort-time spectrum of unvoiced speecha)a) Time signalTime signal
b)b) 30ms rectangular30ms rectangularwindowwindow
c)c) 15ms rectangular15ms rectangular window window
d)d) 30ms Hamming30ms Hammingwindowwindow
e)e) 15ms Hamming15ms Hammingwindowwindow
Speech Signal Representations ISpeech Signal Representations I
Linear Predictive CodingLinear Predictive Coding
LPC a.k.a. auto-regressive (AR) modelingLPC a.k.a. auto-regressive (AR) modeling All-pole filter is good approximation of speech, All-pole filter is good approximation of speech,
with p as the order of the LPC analysis:with p as the order of the LPC analysis:
Predicts current sample as linear combination of Predicts current sample as linear combination of past p samplespast p samples
)(
1
1
1
)(
)()(
1
zAza
zE
zXzH p
k
kk
p
kk knxanx
1
~
Speech Signal Representations ISpeech Signal Representations I
Linear Predictive CodingLinear Predictive Coding
To estimate predictor coefficients (aTo estimate predictor coefficients (akk), use short-), use short-
term analysis techniqueterm analysis technique Per segment, minimize the total prediction error by Per segment, minimize the total prediction error by
calculating the minimum squared errorcalculating the minimum squared error
Take the derivative, equate it to 0; expressed as a Take the derivative, equate it to 0; expressed as a set of p linear equations:set of p linear equations:
the the Yule-WalkerYule-Walker equations equations
n n n
p
kmkmmmmm knxanxnxnxneE
2
22 ~
p
kmmk ikia
1
0,,
Speech Signal Representations ISpeech Signal Representations I
Linear Predictive CodingLinear Predictive Coding
Solution of the Solution of the Yule-WalkerYule-Walker equations: equations:– Any standard matrix inversion packageAny standard matrix inversion package– Due to the special form of the matrix, efficient solutions:Due to the special form of the matrix, efficient solutions:
Covariance methodCovariance methodusing the using the CholeskyCholesky decomposition decomposition
Autocorrelation methodAutocorrelation methodusing windows, results in equations with using windows, results in equations with ToeplitzToeplitz matrices, matrices, solved by the solved by the DurbinDurbin recursion algorithm recursion algorithm
Lattice methodLattice methodequivalent to equivalent to Levinson DurbinLevinson Durbin recursion recursionoften used in fixed-point implementations because lack of often used in fixed-point implementations because lack of precision doesn’t result in unstable filtersprecision doesn’t result in unstable filters
Speech Signal Representations ISpeech Signal Representations I
Linear Predictive CodingLinear Predictive Coding
Speech Signal Representations ISpeech Signal Representations I
Linear Predictive CodingLinear Predictive Coding
Speech Signal Representations ISpeech Signal Representations I
Linear Predictive CodingLinear Predictive Coding
Spectral analysis via LPCSpectral analysis via LPC– All-pole (IIR) filterAll-pole (IIR) filter– Peaks at the roots of the denominatorPeaks at the roots of the denominator
Speech Signal Representations ISpeech Signal Representations I
Linear Predictive CodingLinear Predictive Coding
Prediction errorPrediction error– Should be (approximately) the excitationShould be (approximately) the excitation– Unvoiced speech, expect white noise; OKUnvoiced speech, expect white noise; OK– Voiced speech, expect impulse train; NOKVoiced speech, expect impulse train; NOK
All-pole assumption not altogether validAll-pole assumption not altogether valid Real speech not perfectly periodicReal speech not perfectly periodic Pitch synchronous analysis gives better resultsPitch synchronous analysis gives better results
– LPC orderLPC order Larger p gives lower prediction errorsLarger p gives lower prediction errors Too large a p results in fitting the individual harmonics →Too large a p results in fitting the individual harmonics →
separation between filter and source will not be so goodseparation between filter and source will not be so good
Speech Signal Representations ISpeech Signal Representations I
Linear Predictive CodingLinear Predictive Coding
Prediction errorPrediction error– Inverse LPC filter gives residual signalInverse LPC filter gives residual signal
Speech Signal Representations ISpeech Signal Representations I
Linear Predictive CodingLinear Predictive Coding
Alternatives for the predictor coefficientsAlternatives for the predictor coefficients– Line Spectral FrequenciesLine Spectral Frequencies
local sensitivitylocal sensitivity efficiencyefficiency
– Reflection CoefficientsReflection Coefficients Guaranteed stable → useful for coefficient interpolated over Guaranteed stable → useful for coefficient interpolated over
timetime
– Log-area ratiosLog-area ratios Flat spectral sensitivityFlat spectral sensitivity
– Roots of the polynomialRoots of the polynomial Represent resonance frequencies and bandwidthsRepresent resonance frequencies and bandwidths
Speech Signal Representations ISpeech Signal Representations I
Cepstral ProcessingCepstral Processing
– A homomorphic transformation converts a A homomorphic transformation converts a convolution into a sum:convolution into a sum:
nhnenx
nhnenx
ˆˆˆ
Speech Signal Representations ISpeech Signal Representations I
Cepstral ProcessingCepstral Processing
Speech Signal Representations ISpeech Signal Representations I
Cepstral ProcessingCepstral Processing
top related