automatic speech recognition a summary of contributions from multiple disciplines mark d. skowronski...

Automatic Speech RecognitionA summary of contributions from multiple disciplines

Mark D. Skowronski

Computational Neuro-Engineering Lab

Electrical and Computer Engineering

University of Florida

October 6, 2004

What is ASR?

• Automatic Speech Recognition is:– A system that converts a raw acoustic

signal into phonetically meaningful text.– A combination of engineering, linguistics,

statistics, psychoacoustics, and computer science.

“seven”

Psychoacousticians provide expert knowledge about human acoustic perception.

Engineers provide efficient algorithms and hardware.

Linguists provide language rules.

What is ASR?

Feature extraction Classification Language model

Computer scientists and statisticians provide optimum modeling.

Feature extraction• Acoustic-phonetic paradigm (pre 1980):

– Holistic features (voicing and frication measures, durations, formants and BW)

– Difficult to construct robust classifiers

• Frame-based paradigm (1980 to today):– Short (20 ms) sliding analysis window, assumes

speech frame is quasi-stationarity– Relies on classifier to account for speech

nonstationarity– Allows for the inclusion of expert information of

speech perception

Feature extraction algorithms• Cepstrum (1962)• Linear prediction (1967)• Mel frequency cepstral coefficients (Davis &

Mermelstein, 1980)• Perceptual linear prediction

(Hermansky,1990)• Human factor cepstral coefficients

(Skowronski & Harris, 2002)

“seven”

Cepstral domain

DCT

Log energy

Mel-scaled filter bank

Fourier

x(t)

Time

Filter #

MFCC algorithm

Classification• Operates on frame-based features• Accounts for time variations of speech• Uses training data to transform features into

symbols (phonemes, bi-/tri-phones, words)• Non-parametric: Dynamic time warp (DTW)

– No parameters to estimate– Computationally expensive, scaling issues

• Parametric: Hidden Markov model (HMM)– State-of-the-art model, complements features– Data-intensive, scales well

HMM classificationA Hidden Markov Model is a piecewise stationary model of a nonstationary signal.

Model characteristics• states: represent domains of piecewise

stationarity• interstate connections: defines model

architecture• parameters: pdf means & covariance

HMM diagram

Time domain

State space

Feature space

Symbol # Models Positive Negative

Word <1000 Coarticulation Scaling

Phoneme 40 pdf estimation Coarticulation

Biphone 1400

Triphone 40K Coarticulation pdf estimation

TRADEOFF

HMM output symbols

Language models• Considers multiple output symbol hypotheses• Delays making hard decision on classifier

output• Uses language-based expert knowledge to

predict meaningful words/phrases from classifier output N-phones/word symbols

• Major research topic since early 1990s with advent of large speech corpora

ASR Problems• Test/Train mismatch• Speaker variations (gender, accent, mood)• Weak model assumptions• Noise: energetic or informational (babble)• Current state-of-the-art does not model the

human brain nor function with the accuracy or reliability of humans

• Most progress of late comes from faster computers, not new ideas

Conclusions• Automatic speech recognition technology

emerges from several diverse disciplines– Acousticians describe how speech is produced and

perceived by humans– Computer scientists create machine learning

models for signal-to-symbol conversion– Linguists provide language information– Engineers optimize the algorithms and provide the

hardware, and put the pieces together

automatic speech recognition a summary of contributions from multiple disciplines mark d. skowronski...

Documents

speech frame

speech nonstationarityallows

model architectureparameters

model characteristicsstates

art model

piecewise stationary

hidden markov model

classifier outputuses