speech signal representations i

Speech Signal Speech Signal Representations IRepresentations I

Seminar Speech Recognition 2002Seminar Speech Recognition 2002

F.R. VerhageF.R. Verhage

Speech Signal Representations ISpeech Signal Representations I

Decomposition of the speech signal (x[n]) as a Decomposition of the speech signal (x[n]) as a source (e[n]) passed through a linear time-source (e[n]) passed through a linear time-varying filter (h[n]).varying filter (h[n]).

Estimation of the filter, inspired by:Estimation of the filter, inspired by: Speech production modelsSpeech production models

– Linear Predictive Coding (LPC)Linear Predictive Coding (LPC)– Cepstral analysisCepstral analysis

Speech perception models (part II)Speech perception models (part II)– Mel-frequency cepstrumMel-frequency cepstrum– Perceptual Linaer Prediction (PLP)Perceptual Linaer Prediction (PLP)

Speech recognizers estimate filter Speech recognizers estimate filter characteristics and ignore the sourcecharacteristics and ignore the source

Short-Time Fourier AnalysisShort-Time Fourier Analysis

SpectrogramSpectrogram– Representation of a signal highlighting several Representation of a signal highlighting several

of its properties based on short-time Fourier of its properties based on short-time Fourier analysisanalysis

– Two dimensional: time horizontal and frequency Two dimensional: time horizontal and frequency verticalvertical

– Third ‘dimension’: gray or color level indicating Third ‘dimension’: gray or color level indicating energyenergy

SpectrogramSpectrogram– Narrow bandNarrow band

Long windows (> 20 ms) →Long windows (> 20 ms) → Narrow bandwidthNarrow bandwidth Lower time resolution, better frequency resolutionLower time resolution, better frequency resolution

– Wide bandWide band Short windows ( <10 ms) →Short windows ( <10 ms) → Wide bandwidthWide bandwidth Good time resolution, lower frequency resolutionGood time resolution, lower frequency resolution

– Pitch synchronousPitch synchronous Requires knowledge of local pitch periodRequires knowledge of local pitch period

SpectrogramSpectrogram

Window analysisWindow analysis– Series of short segments, analysis framesSeries of short segments, analysis frames– Short enough so that the signal is stationaryShort enough so that the signal is stationary– Usually constant, 20-30 msUsually constant, 20-30 ms– Overlaps possibleOverlaps possible

– Different types of window functions (wDifferent types of window functions (wmm[n]):[n]): Rectangular (equal to no window function)Rectangular (equal to no window function) HammingHamming HanningHanning

jm enxnmwenxeX ][][][

Window analysisWindow analysis– Window size must be long enoughWindow size must be long enough

Rectangular: N ≥ MRectangular: N ≥ M Hamming, Hanning: N ≥ 2MHamming, Hanning: N ≥ 2M

– Pitch period not known in advance →Pitch period not known in advance →– Prepare for lowest pitch period →Prepare for lowest pitch period →– At least 20ms for rectangular or 40ms for At least 20ms for rectangular or 40ms for

Hamming/Hanning (50Hz)Hamming/Hanning (50Hz)– But longer windows give a more average spectrum But longer windows give a more average spectrum

instead of distinct spectra →instead of distinct spectra →– Rectangular window has better time resolutionRectangular window has better time resolution

Window analysisWindow analysis– Frequency response not completely zero outside main Frequency response not completely zero outside main

lobe → Spectral leakagelobe → Spectral leakage– Second lobe of a Hamming window is approx. 43dB Second lobe of a Hamming window is approx. 43dB

below main lobe → less spectral leakagebelow main lobe → less spectral leakage– Hamming, Hanning, triangular windows offer less Hamming, Hanning, triangular windows offer less

spectral leakage →spectral leakage →– Rectangular windows are rarely used despite their Rectangular windows are rarely used despite their

better time resolutionbetter time resolution

Short-time spectrum of male voice speechShort-time spectrum of male voice speecha)a) Time signal /ah/Time signal /ah/

local pitch 110Hzlocal pitch 110Hz

b)b) 30ms rectangular30ms rectangularwindowwindow

c)c) 15ms rectangular15ms rectangular window window

d)d) 30ms Hamming30ms Hammingwindowwindow

e)e) 15ms Hamming15ms Hammingwindowwindow

Short-time spectrum of female voice speechShort-time spectrum of female voice speecha)a) Time signal /aa/Time signal /aa/

local pitch 200Hzlocal pitch 200Hz

Short-time spectrum of unvoiced speechShort-time spectrum of unvoiced speecha)a) Time signalTime signal

Linear Predictive CodingLinear Predictive Coding

LPC a.k.a. auto-regressive (AR) modelingLPC a.k.a. auto-regressive (AR) modeling All-pole filter is good approximation of speech, All-pole filter is good approximation of speech,

with p as the order of the LPC analysis:with p as the order of the LPC analysis:

Predicts current sample as linear combination of Predicts current sample as linear combination of past p samplespast p samples

zXzH p

kk knxanx

To estimate predictor coefficients (aTo estimate predictor coefficients (akk), use short-), use short-

term analysis techniqueterm analysis technique Per segment, minimize the total prediction error by Per segment, minimize the total prediction error by

calculating the minimum squared errorcalculating the minimum squared error

Take the derivative, equate it to 0; expressed as a Take the derivative, equate it to 0; expressed as a set of p linear equations:set of p linear equations:

the the Yule-WalkerYule-Walker equations equations

kmkmmmmm knxanxnxnxneE

kmmk ikia

Solution of the Solution of the Yule-WalkerYule-Walker equations: equations:– Any standard matrix inversion packageAny standard matrix inversion package– Due to the special form of the matrix, efficient solutions:Due to the special form of the matrix, efficient solutions:

Covariance methodCovariance methodusing the using the CholeskyCholesky decomposition decomposition

Autocorrelation methodAutocorrelation methodusing windows, results in equations with using windows, results in equations with ToeplitzToeplitz matrices, matrices, solved by the solved by the DurbinDurbin recursion algorithm recursion algorithm

Lattice methodLattice methodequivalent to equivalent to Levinson DurbinLevinson Durbin recursion recursionoften used in fixed-point implementations because lack of often used in fixed-point implementations because lack of precision doesn’t result in unstable filtersprecision doesn’t result in unstable filters

Spectral analysis via LPCSpectral analysis via LPC– All-pole (IIR) filterAll-pole (IIR) filter– Peaks at the roots of the denominatorPeaks at the roots of the denominator

Prediction errorPrediction error– Should be (approximately) the excitationShould be (approximately) the excitation– Unvoiced speech, expect white noise; OKUnvoiced speech, expect white noise; OK– Voiced speech, expect impulse train; NOKVoiced speech, expect impulse train; NOK

All-pole assumption not altogether validAll-pole assumption not altogether valid Real speech not perfectly periodicReal speech not perfectly periodic Pitch synchronous analysis gives better resultsPitch synchronous analysis gives better results

– LPC orderLPC order Larger p gives lower prediction errorsLarger p gives lower prediction errors Too large a p results in fitting the individual harmonics →Too large a p results in fitting the individual harmonics →

separation between filter and source will not be so goodseparation between filter and source will not be so good

Prediction errorPrediction error– Inverse LPC filter gives residual signalInverse LPC filter gives residual signal

Alternatives for the predictor coefficientsAlternatives for the predictor coefficients– Line Spectral FrequenciesLine Spectral Frequencies

local sensitivitylocal sensitivity efficiencyefficiency

– Reflection CoefficientsReflection Coefficients Guaranteed stable → useful for coefficient interpolated over Guaranteed stable → useful for coefficient interpolated over

timetime

– Log-area ratiosLog-area ratios Flat spectral sensitivityFlat spectral sensitivity

– Roots of the polynomialRoots of the polynomial Represent resonance frequencies and bandwidthsRepresent resonance frequencies and bandwidths

Cepstral ProcessingCepstral Processing

– A homomorphic transformation converts a A homomorphic transformation converts a convolution into a sum:convolution into a sum:

nhnenx

ˆˆˆ

speech signal representations i

Documents

speech signal time frequency representation

novel speech signal enhancement techniques for tamil ... ·...

speech signal basics - אוניברסיטת חיפה

speech signal processing with praat

speech communication and signal processing

introduction to speech signal processing

a signal subspace approach for speech enhancement - speech...

transforms and fast algorithms for signal analysis and...

speech signal processing 1

sparse signal representations using the tunable q-factor...

speech signal processing

abstractness of human speech sound representations

signal modeling techniques in speech...

speech signal analysis

auditory cortical representations of speech signals for

signal, speech and image processing - wseas...signal, speech...

audio signal representations for inde xing in the

robust signal representations for automatic speech...

breaking down the cortical representations of speech in

chapter auditory pathway representations of speech … ·...