acoustic / lexical model derk geene. speech recognition p(words|signal)= p(signal|words) p(words) /...

Acoustic / Lexical Model

Derk Geene

Speech recognition P(words|signal)=

P(signal|words) P(words) / P(signal)

P(signal|words): Acoustic model P(words): Language model

Idea: Maximize P(signal|words) P(words) Today: Acoustic model

Variability Variation

Speaker Pronunciation Environmental Context

Static acoustic model will not work in real applications.

Dynamically adapt P(signal|words) while using the system.

Measuring errors (1) 500 sentences of 6 – 10 words each from 5

to 10 different speakers. 10% relative error reduction

Training set / Development set

First decide optimal parameter settings.

Measuring errors (2) Word recognition errors:

Substitution Deletion Insertion

Correct: Did mob mission area of the Copeland ever go to m4 in nineteen eighty one?

Recognized: Did mob mission area ** the copy land ever go to m4 in nineteen east one?

Measuring errors (3)Correct: The effect is clearRecognised: Effect is not clear

Error RateOne by one: 75%

Subs + Dels + Ins#words in correct sentence

Word error rate=100% x

Word error rate

Units of speech (1) Modeling is language dependent.fixme

Modeling unit Accurate Trainable Generalizable

Units of speech (2) Whole-word models

Only suitable for small vocabulary recognition

Phone models Suitable for large vocabulary recognition Problem: over-generalize less accurate

Syllable models

Context dependency (1) Recognition accuricy can be improved by

using context-dependent parameters.

Important in fast / spontanious speech.

Example: the phoneme /ee/

Peat

Wheel

Context dependency (2) Triphone model: phonetic model that takes into

consideration both the left and the right neightbouring phones.

If two phones have the same identity, but different left or right contexts, there are considered different triphones.

Interword context-dependent phones. Place in the word:

Beginning Middle End

Context dependency (3) Stress

Longer duration Higher pitch More intensity

Word-level stress Import – Import Italy – Italian

Sentence-level stress I did have dinner. I did have dinner.

Radio

Radio

Context dependency (4) Vary much triphones.

503 = 125.000 Many phonemes have the same effects

/b/ & /p/ labial (pronounces by using lips) /r/ & /w/ liquids

Clustered acoustic-phonetic unitsIs the left-context phone a fricative?Is the right-context phone a front vowel?

Acoustic model After feature extraction, we have a

sequence of feature vectors, such as the MFCC vector, as input data.

Feature stream

Phonemes / units

Words

Segmentation and labeling

Lexical access problem

Acoustic model Signal Phonemes

Problem: phonemes can be pronounced differently Speaker differences Speaker rate Microphone

Acoustic model Phonemes Words

The three major ways to do this: Vector Quantization Hidden Markov Models Neural Networks

Acoustic model Problem: Multiple pronunciations:

owt

aa

eyt ow

t

ow

ax

m

aa

ey

t ow

0,5

0,5

0,8

m

Dialect variation

Coarticulation

0,5

0,5

0,2

The End

acoustic / lexical model derk geene. speech recognition p(words|signal)= p(signal|words) p(words) /...

Documents

acoustic model slide

radio slide

acoustic model problem

acoustic model pwords

acoustic model phonemes

peat wheel slide

phoneme ee slide

middle end slide