from last time …

From last time …From last time …

ASR System ArchitectureASR System Architecture

PronunciationLexicon

Signal Processing

ProbabilityEstimator

Decoder

RecognizedWords“zero”“three”“two”

Probabilities“z” -0.81“th” = 0.15“t” = 0.03

Cepstrum

SpeechSignal

Grammar

A Few Points about Human Speech Recognition

(See Chapter 18 for much more on this)

Human Speech Recognition

• Experiments dating from 1918 dealing with noise, reduced BW (Fletcher)

• Statistics of CVC perception• Comparisons between human and machine speech recognition

• A few thoughts

The Ear

The Cochlea

Assessing Recognition Accuracy

• Intelligibility• Articulation - Fletcher experiments

– CVC, VC, CV, syllables in carrier sentences

– Tests over different SNR, bands– Example: “The first group is `mav’ (forced choice between mav and nav)

– Used sharp lowpass and/or highpass filtered. For equal energy, crossover is 450 Hz; for equal articulation, 1550 Hz.

Results

• S = vc2

• Articulation Index (the original “AI”)

• Error independence between bands– Articulatory band ~ 1 mm along basilar membrane

– 20 filters between 300 and 8000 Hz– A single zero error band -> no error!– Robustness to a range of problems– AI = ∑k 1/K (SNRk / 30) where SNR saturates at 0 and 30

AI additivity

• s(a,b) = phone accuracy for band from a to b, a<b<c

• (1-s(a,c)) = (1-s(a,b))(1-s(b,c))

• log10(1-s(a,c)) = log10(1-s(a,b)) + log10(1-s(b,c))

• AI(s) = log10(1-s) / log10(1-smax)

• AI(s(a,c)) = AI(s(a,b)) + AI(s(b,c))

Jont Allen interpretation:The Big Idea

• Humans don’t use frame-like spectral templates

• Instead, partial recognition in bands• Combined for phonetic (syllabic?) recognition

• Important for 3 reasons:– Based on decades of listening experiments– Based on a theoretical structure that matched the results

– Different from what ASR systems do

Questions about AI

• Based on phones - the right unit for fluent speech?

• Lost correlation between distant bands?

• Lippmann experiments, disjoint bands– Signal above 8 kHz helps a lot in combination with signal below 800 Hz

Human SR vs ASR: Quantitative Comparisons

• Lippmann compilation (see book): typically ~factor of 10 in WER

• Hasn’t changed too much since his study

• Keep in mind this caveat: “human” scores are ideal - under sustained real conditions people don’t pay perfect attention (especially after lunch)

Human SR vs ASR: Quantitative

Comparisons (2)

System 10 dB SNR 16 dB SNR “Quiet”

Baseline HMM ASR

77.4% 42.2% 7.2%

ASR w/ noise compensation

12.8% 10.0% -

Human Listener

1.1% 1.0% 0.9%

Word error rates for 5000 word Wall Street Journal read speech task using additive automotive noise(old numbers – ASR would be a bit better now)

Human SR vs ASR: Qualitative Comparisons

• Signal processing• Subword recognition • Temporal integration• Higher level information

Human SR vs ASR: Signal Processing

• Many maps vs one• Sampled across time-frequency vs sampled in time

• Some hearing-based signal processing already in ASR

Human SR vs ASR: Subword Recognition

• Knowing what is important (from the maps)

• Combining it optimally

Human SR vs ASR: Temporal Integration

• Using or ignoring duration (e.g., VOT)

• Compensating for rapid speech• Incorporating multiple time scales

Human SR vs ASR: Higher levels

• Syntax• Semantics• Pragmatics• Getting the gist• Dialog to learn more

Human SR vs ASR: Conclusions

• When we pay attention, human SR much better than ASR

• Some aspects of human models going into ASR

• Probably much more to do, when we learn how to do it right

from last time …

Documents