landmark-based speech recognition: spectrogram reading, support vector machines, dynamic bayesian...

37
Landmark-Based Speech Recognition: Spectrogram Reading, Support Vector Machines, Dynamic Bayesian Networks, and Phonology Mark Hasegawa-Johnson [email protected] University of Illinois at Urbana- Champaign, USA

Upload: navid

Post on 25-Feb-2016

44 views

Category:

Documents


0 download

DESCRIPTION

Landmark-Based Speech Recognition: Spectrogram Reading, Support Vector Machines, Dynamic Bayesian Networks, and Phonology. Mark Hasegawa-Johnson [email protected] University of Illinois at Urbana-Champaign, USA. Lecture 2: Acoustics of Vowel and Glide Production. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Landmark-Based Speech Recognition: Spectrogram Reading, Support Vector Machines, Dynamic Bayesian Networks, and Phonology

Landmark-Based Speech Recognition:

Spectrogram Reading,Support Vector Machines,

Dynamic Bayesian Networks,and Phonology

Mark [email protected]

University of Illinois at Urbana-Champaign, USA

Page 2: Landmark-Based Speech Recognition: Spectrogram Reading, Support Vector Machines, Dynamic Bayesian Networks, and Phonology

Lecture 2: Acoustics of Vowel and Glide Production

• One-Dimensional Linear Acoustics– The Acoustic Wave Equation– Transmission Lines– Standing Wave Patterns

• One-Tube Models– Schwa– Front cavity resonance of fricatives

• Two-Tube Models– The vowel /a/– Helmholtz Resonator– The vowels /u,i,e/

• Perturbation Theory– The vowels /u/, /o/ revisited– Glides

Page 3: Landmark-Based Speech Recognition: Spectrogram Reading, Support Vector Machines, Dynamic Bayesian Networks, and Phonology

1. One-Dimensional Acoustic Wave Equation

and Solutions

Page 4: Landmark-Based Speech Recognition: Spectrogram Reading, Support Vector Machines, Dynamic Bayesian Networks, and Phonology

Acoustics: Constitutive Equations

Page 5: Landmark-Based Speech Recognition: Spectrogram Reading, Support Vector Machines, Dynamic Bayesian Networks, and Phonology

Acoustic Plane Waves: Time Domain

Page 6: Landmark-Based Speech Recognition: Spectrogram Reading, Support Vector Machines, Dynamic Bayesian Networks, and Phonology

Acoustic Plane Waves: Frequency Domain

Tex

Page 7: Landmark-Based Speech Recognition: Spectrogram Reading, Support Vector Machines, Dynamic Bayesian Networks, and Phonology

Solution for a Tube with Constant Area and Hard Walls

Page 8: Landmark-Based Speech Recognition: Spectrogram Reading, Support Vector Machines, Dynamic Bayesian Networks, and Phonology

2. One-Tube Models

Page 9: Landmark-Based Speech Recognition: Spectrogram Reading, Support Vector Machines, Dynamic Bayesian Networks, and Phonology

Boundary Conditions

L0

Page 10: Landmark-Based Speech Recognition: Spectrogram Reading, Support Vector Machines, Dynamic Bayesian Networks, and Phonology

Resonant Frequencies

Page 11: Landmark-Based Speech Recognition: Spectrogram Reading, Support Vector Machines, Dynamic Bayesian Networks, and Phonology

Standing Wave Patterns

Page 12: Landmark-Based Speech Recognition: Spectrogram Reading, Support Vector Machines, Dynamic Bayesian Networks, and Phonology

Standing Wave Patterns: Quarter-Wave Resonators

Tube Closed at the Left End, Open at the Right End

Page 13: Landmark-Based Speech Recognition: Spectrogram Reading, Support Vector Machines, Dynamic Bayesian Networks, and Phonology

Standing Wave Patterns: Half-Wave Resonators

Tube Closed at Both Ends Tube Open at Both Ends

Page 14: Landmark-Based Speech Recognition: Spectrogram Reading, Support Vector Machines, Dynamic Bayesian Networks, and Phonology

Schwa and Invv (the vowels in “a tug”)

F1=500Hz=c/4L

F2=1500Hz=3c/4L

F3=2500Hz=5c/4L

Page 15: Landmark-Based Speech Recognition: Spectrogram Reading, Support Vector Machines, Dynamic Bayesian Networks, and Phonology

Front Cavity Resonances of a Fricative

/s/: Front Cavity Resonance = 4500Hz 4500Hz = c/4L if Front Cavity Length is L=1.9cm

/sh/: Front Cavity Resonance = 2200Hz 2200Hz = c/4L if Front Cavity Length is L=4.0cm

Page 16: Landmark-Based Speech Recognition: Spectrogram Reading, Support Vector Machines, Dynamic Bayesian Networks, and Phonology

3. Two-Tube Models

Page 17: Landmark-Based Speech Recognition: Spectrogram Reading, Support Vector Machines, Dynamic Bayesian Networks, and Phonology

Conservation of Mass at the Juncture of Two Tubes

A1

A2 = A1/2

U1(x,t) U2(x,t)= 2U1(x,t)

Total liters/second transmitted = (velocity) X (tube area)

Page 18: Landmark-Based Speech Recognition: Spectrogram Reading, Support Vector Machines, Dynamic Bayesian Networks, and Phonology

Two-Tube Model: Two Different Sets of Waves

Incident Wave P1+

Reflected Wave P1- Incident Wave P2-

Reflected Wave P2+

Page 19: Landmark-Based Speech Recognition: Spectrogram Reading, Support Vector Machines, Dynamic Bayesian Networks, and Phonology

Two-Tube Model: Solution in the Time Domain

Page 20: Landmark-Based Speech Recognition: Spectrogram Reading, Support Vector Machines, Dynamic Bayesian Networks, and Phonology

Two-Tube Model in the Frequency Domain

Page 21: Landmark-Based Speech Recognition: Spectrogram Reading, Support Vector Machines, Dynamic Bayesian Networks, and Phonology

Approximate Solution of the Two-Tube Model, A1>>A2

Approximate solution: Assume that the two tubes are completely decoupled, so that the formants include - F(BACK CAVITY) = c/4 LBACK

- F(FRONT CAVITY) = c/4LFRONT

LBACK

LFRONT

Page 22: Landmark-Based Speech Recognition: Spectrogram Reading, Support Vector Machines, Dynamic Bayesian Networks, and Phonology

The Vowels /AA/, /AH/

LBACK

LFRONT

LBACK=8.8cm F2= c/4LBACK = 1000Hz

LFRONT=12.6cm F1= c/4LFRONT = 700Hz

Page 23: Landmark-Based Speech Recognition: Spectrogram Reading, Support Vector Machines, Dynamic Bayesian Networks, and Phonology

Acoustic Impedance

Z(,j)

0

Z(,j)

0

Page 24: Landmark-Based Speech Recognition: Spectrogram Reading, Support Vector Machines, Dynamic Bayesian Networks, and Phonology

Low-Frequency Approximations of Acoustic Impedance

Page 25: Landmark-Based Speech Recognition: Spectrogram Reading, Support Vector Machines, Dynamic Bayesian Networks, and Phonology

Helmholtz Resonator

-Z1(,j) =

0

Z2(,j)

0

Page 26: Landmark-Based Speech Recognition: Spectrogram Reading, Support Vector Machines, Dynamic Bayesian Networks, and Phonology

The Vowel /i/

Back Cavity = Pharynx Resonances: 0Hz, 2000Hz, 4000HzFront Cavity = Palatal Constriction Resonances: 0Hz, 2500Hz, 5000Hz

Back Cavity Volume = 70cm3

Front Cavity Length/Area = 7cm-1

1/2√MC = 250Hz

Helmholtz Resonance replaces all 0Hz partial-tube resonances.

250Hz

2000Hz2500Hz

Page 27: Landmark-Based Speech Recognition: Spectrogram Reading, Support Vector Machines, Dynamic Bayesian Networks, and Phonology

The Vowel /u/: A Two-Tube Model

Back Cavity = Mouth + Pharynx Resonances: 0Hz, 1000Hz, 2000HzFront Cavity = Lips Resonances: 0Hz, 18000Hz, …

Back Cavity Volume = 200cm3

Front Cavity Length/Area = 2cm-1

1/2√MC = 250Hz

Helmholtz Resonance replaces all 0Hz partial-tube resonances.

250Hz

1000Hz

2000Hz

Page 28: Landmark-Based Speech Recognition: Spectrogram Reading, Support Vector Machines, Dynamic Bayesian Networks, and Phonology

The Vowel /u/: A Four-Tube Model

Two Helmholtz Resonators = Two Low-Frequency Formants! F1 = 250Hz F2 = 500Hz

F3 = Pharynx resonance, c/2L = 2000Hz

250Hz 500Hz

2000Hz

Pharynx

VelarTongueBodyConstriction Mouth

Lips

Page 29: Landmark-Based Speech Recognition: Spectrogram Reading, Support Vector Machines, Dynamic Bayesian Networks, and Phonology

4. Perturbation Theory

Page 30: Landmark-Based Speech Recognition: Spectrogram Reading, Support Vector Machines, Dynamic Bayesian Networks, and Phonology

Perturbation Theory(Chiba and Kajiyama, The Vowel, 1940)

A(x) is constant everywhere, except for one small perturbation.

Method: 1. Compute formants of the “unperturbed” vocal tract. 2. Perturb the formant frequencies to match the area perturbation.

Page 31: Landmark-Based Speech Recognition: Spectrogram Reading, Support Vector Machines, Dynamic Bayesian Networks, and Phonology

Conservation of Energy Under Perturbation

Page 32: Landmark-Based Speech Recognition: Spectrogram Reading, Support Vector Machines, Dynamic Bayesian Networks, and Phonology

Conservation of Energy Under Perturbation

Page 33: Landmark-Based Speech Recognition: Spectrogram Reading, Support Vector Machines, Dynamic Bayesian Networks, and Phonology

“Sensitivity” Functions

Page 34: Landmark-Based Speech Recognition: Spectrogram Reading, Support Vector Machines, Dynamic Bayesian Networks, and Phonology

Sensitivity Functions for the Quarter-Wave Resonator (Lips Open)

L

/AA/ /ER/ /IY/ /W/

Page 35: Landmark-Based Speech Recognition: Spectrogram Reading, Support Vector Machines, Dynamic Bayesian Networks, and Phonology

Sensitivity Functions for the Half-Wave Resonator (Lips Rounded)

L

/L,OW/ /UW/

Page 36: Landmark-Based Speech Recognition: Spectrogram Reading, Support Vector Machines, Dynamic Bayesian Networks, and Phonology

Formant Frequencies of Vowels

From Peterson & Barney, 1952

Page 37: Landmark-Based Speech Recognition: Spectrogram Reading, Support Vector Machines, Dynamic Bayesian Networks, and Phonology

Summary• Acoustic wave equation easiest to solve in frequency domain, for

example:– Solve two boundary condition equations for P+ and P-, or– Solve the two-tube model (four equations in four unknowns)

• Quarter-Wave Resonator: Open at one end, Closed at the other– Schwa or Invv (“a tug”)– Front cavity resonance of a fricative or stop

• Half-Wave Resonator: Closed at the glottis, Nearly closed at the lips– /uw/

• Two-Tube Models– Exact solution: use reflection coefficient– Approximate solution: decouple the tubes, solve separately

• Helmholtz Resonator– When the two-tube model seems to have resonances at 0Hz, use, instead,

the Helmholtz Resonance frequency, computed with low-frequency approximations of acoustic impedance

– /iy/: F1 is a Helmholtz Resonance– /uw/ and /ow/: Both F1 and F2 are Helmholtz Resonances

• Perturbation Theory– Perturbed area Perturbed formants– Sensitivity function explains most vowels and glides in one simple chart