automatic speech recognition a summary of contributions from multiple disciplines mark d. skowronski...
TRANSCRIPT
Automatic Speech RecognitionA summary of contributions from multiple disciplines
Mark D. Skowronski
Computational Neuro-Engineering Lab
Electrical and Computer Engineering
University of Florida
October 6, 2004
What is ASR?
• Automatic Speech Recognition is:– A system that converts a raw acoustic
signal into phonetically meaningful text.– A combination of engineering, linguistics,
statistics, psychoacoustics, and computer science.
“seven”
Psychoacousticians provide expert knowledge about human acoustic perception.
Engineers provide efficient algorithms and hardware.
Linguists provide language rules.
What is ASR?
Feature extraction Classification Language model
Computer scientists and statisticians provide optimum modeling.
Feature extraction• Acoustic-phonetic paradigm (pre 1980):
– Holistic features (voicing and frication measures, durations, formants and BW)
– Difficult to construct robust classifiers
• Frame-based paradigm (1980 to today):– Short (20 ms) sliding analysis window, assumes
speech frame is quasi-stationarity– Relies on classifier to account for speech
nonstationarity– Allows for the inclusion of expert information of
speech perception
Feature extraction algorithms• Cepstrum (1962)• Linear prediction (1967)• Mel frequency cepstral coefficients (Davis &
Mermelstein, 1980)• Perceptual linear prediction
(Hermansky,1990)• Human factor cepstral coefficients
(Skowronski & Harris, 2002)
“seven”
Cepstral domain
DCT
Log energy
Mel-scaled filter bank
Fourier
x(t)
Time
Filter #
MFCC algorithm
Classification• Operates on frame-based features• Accounts for time variations of speech• Uses training data to transform features into
symbols (phonemes, bi-/tri-phones, words)• Non-parametric: Dynamic time warp (DTW)
– No parameters to estimate– Computationally expensive, scaling issues
• Parametric: Hidden Markov model (HMM)– State-of-the-art model, complements features– Data-intensive, scales well
HMM classificationA Hidden Markov Model is a piecewise stationary model of a nonstationary signal.
Model characteristics• states: represent domains of piecewise
stationarity• interstate connections: defines model
architecture• parameters: pdf means & covariance
HMM diagram
Time domain
State space
Feature space
Symbol # Models Positive Negative
Word <1000 Coarticulation Scaling
Phoneme 40 pdf estimation Coarticulation
Biphone 1400
Triphone 40K Coarticulation pdf estimation
TRADEOFF
HMM output symbols
Language models• Considers multiple output symbol hypotheses• Delays making hard decision on classifier
output• Uses language-based expert knowledge to
predict meaningful words/phrases from classifier output N-phones/word symbols
• Major research topic since early 1990s with advent of large speech corpora
ASR Problems• Test/Train mismatch• Speaker variations (gender, accent, mood)• Weak model assumptions• Noise: energetic or informational (babble)• Current state-of-the-art does not model the
human brain nor function with the accuracy or reliability of humans
• Most progress of late comes from faster computers, not new ideas
Conclusions• Automatic speech recognition technology
emerges from several diverse disciplines– Acousticians describe how speech is produced and
perceived by humans– Computer scientists create machine learning
models for signal-to-symbol conversion– Linguists provide language information– Engineers optimize the algorithms and provide the
hardware, and put the pieces together