object tracking and asynchrony in audio-visual speech recognition

30
Object Tracking and Asynchrony in Audio-Visual Speech Recognition Mark Hasegawa-Johnson AIVR Seminar August 31, 2006 AVICAR is thanks to: Bowon Lee, Ming Liu, Camille Goudeseune, Suketu Kamdar, Carl Press, Sarah Borys and to the Motorola Communications Center Some experiments and most good ideas in this talk thanks to Ming Liu, Karen Livescu, Kate Saenko and Partha Lal

Upload: emery

Post on 13-Jan-2016

45 views

Category:

Documents


1 download

DESCRIPTION

Object Tracking and Asynchrony in Audio-Visual Speech Recognition. Mark Hasegawa-Johnson AIVR Seminar August 31, 2006. AVICAR is thanks to: Bowon Lee, Ming Liu, Camille Goudeseune, Suketu Kamdar, Carl Press, Sarah Borys and to the Motorola Communications Center - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Object Tracking and Asynchrony in Audio-Visual Speech Recognition

Object Tracking and Asynchrony in Audio-

Visual Speech Recognition

Mark Hasegawa-Johnson

AIVR Seminar

August 31, 2006

AVICAR is thanks to: Bowon Lee, Ming Liu, Camille Goudeseune, Suketu Kamdar, Carl Press, Sarah Borys

and to the Motorola Communications Center

Some experiments and most good ideas in this talk thanks to

Ming Liu, Karen Livescu, Kate Saenko and Partha Lal

Page 2: Object Tracking and Asynchrony in Audio-Visual Speech Recognition

Why AVSR is not like ASR

• Use of classifiers as features– E.g., output of an

AdaBoost lip tracker is feature in a face constellation

• Obstruction– Tongue is rarely visible,

glottis never

• Asynchrony– Visual evidence for a

word can start long before the audio evidence

Which digit is she about to say?

Page 3: Object Tracking and Asynchrony in Audio-Visual Speech Recognition

Why ASR is like AVSR

• Use of classifiers as features– E.g., neural networks or SVMs transform audio spectra

into a phonetic feature space

• Obstruction– Lip closure “hides” tongue closure

– Glottal stop “hides” lip or tongue position

• Asynchrony– Tongue, lips, velum, and glottis can be out of sync, e.g.,

“every” →“ervy”

Page 4: Object Tracking and Asynchrony in Audio-Visual Speech Recognition

Discriminative Features in Face/Lip Tracking: AdaBoost

1. Each wavelet defines a “weak classifier:

hi(x) = 1 iff fi > threshold, else hi(x) = 0

2. Start with equal weight for all training tokens:

wm(1) = 1/M, 1 ≤ m≤M

3. For each learning iteration t:• Find i that minimizes the weighted training error.

• wm ↓ if token m was correctly classified, else wm ↑.

• αt = log((1- εt)/ εt)

• Final “strong classifier” is H(x) = 1 iff Σ t αt ht(x) > Σ t αt

Page 5: Object Tracking and Asynchrony in Audio-Visual Speech Recognition

Example Haar Wavelet Features Selected by AdaBoost

Page 6: Object Tracking and Asynchrony in Audio-Visual Speech Recognition

AdaBoost in a Bayesian Context

• The AdaBoost “margin:”

• Guaranteed range: 0≤MD(x)≤1

• Inverse sigmoid transform yields nearly normal distributions

Page 7: Object Tracking and Asynchrony in Audio-Visual Speech Recognition

Prior: Relative Position of Lips in the Face

p(r=rlips | MD(x)) p(r=rlips) p(MD(x) | r=rlips)

Page 8: Object Tracking and Asynchrony in Audio-Visual Speech Recognition

Lip Tracking: a few results

Page 9: Object Tracking and Asynchrony in Audio-Visual Speech Recognition

Pixel-Based Features

Page 10: Object Tracking and Asynchrony in Audio-Visual Speech Recognition

Pixel-Based Features: Dimension

Page 11: Object Tracking and Asynchrony in Audio-Visual Speech Recognition

Model-Based Correction for Head-Pose Variability

• If the head is an ellipse, its measured width wF(t) and height hF(t) are functions of roll ρ, yaw ψ, pitch φ, true height ħF and true width wF according to

• … which can usefully be approximated as…

Page 12: Object Tracking and Asynchrony in Audio-Visual Speech Recognition

Robust Correction: Linear Regression

• The additive random part of the lip width (wL(t)=w1+ħLcosψ(t)sinρ(t)) is

proportional to similar additive variation in the head width

(wF(t)=wF1+ħFcosψ(t)sinρ(t)), so we can eliminate it by orthogonalizing

wL(t) to wF(t).

Page 13: Object Tracking and Asynchrony in Audio-Visual Speech Recognition

WER Results from AVICAR(Testing on the training data; 34 talkers, continuous digits)

LR = linear regression

Model = model-based

head-pose

compensation

LLR = log-linear

regression

13+d+dd = 13 static

features

39 = 39 static features

All systems have

mean and variance

normalization and

MLLR

Page 14: Object Tracking and Asynchrony in Audio-Visual Speech Recognition

Audio-Visual AsynchronyFor example, tongue touches the teeth before acoustic speech onset in the word “three;” lips

are already round in anticipation of the /r/.

Page 15: Object Tracking and Asynchrony in Audio-Visual Speech Recognition

acoustic channel

visual channel

t = 1

t = 2

t = 3

t =T

Audio-Visual Asynchrony: Coupled HMM is a typical Phoneme-Viseme Model

(Chu and Huang, 2002)

Page 16: Object Tracking and Asynchrony in Audio-Visual Speech Recognition

A Physical Model of AsynchronySlide created by Karen Livescu

Articulatory Phonology [Browman & Goldstein ‘90]: The following 8 tract variables are independently & asynchronously controlled

LIP-LOC Protruded, Labial, Dental

LIP-OP CLosed, CRitical, Narrow, Wide

TT-LOC Dental, Alveolar, Palato-Alveolar, Retroflex

TB-LOC Palatal, Velar, Uvular, Pharyngeal

TT-OP, TB-OP

CLosed, CRitical, Narrow, Mid-Narrow, Mid, Wide

GLO CLosed (stop), CRitical (voiced), Open (voiceless)

VEL CLosed (non-nasal), Open (nasal)

LIP-OPTT-OP

TT-LOC

TB-LOC

TB-OPVELUM

GLOTTIS

LIP-LOC

For speech recognition, we collapse these into 3 streams: lips, tongue, and glottis (LTG).

Page 17: Object Tracking and Asynchrony in Audio-Visual Speech Recognition

Motivation: Pronunciation variationSlide created by Karen Livescu

(2) p r aa b iy

(1) p r ay

(1) p r aw l uh

(1) p r ah b iy

(1) p r aa l iy

(1) p r aa b uw

(1) p ow ih

(1) p aa iy

(1) p aa b uh b l iy

(1) p aa ah iy

probably

p r aa b ax b l iy

(1) s eh n t s

(1) s ih t s

sense

s eh n s

(1) eh v r ax b ax d iy

(1) eh v er b ah d iy

(1) eh ux b ax iy

(1) eh r uw ay

(1) eh b ah iy

everybody

eh v r iy b ah d iy

(37) d ow n

(16) d ow

(6) ow n

(4) d ow n t

(3) d ow t

(3) d ah n

(3) ow

(3) n ax

(2) d ax n

(2) ax

(1) n uw

...

don’t

d ow n tbaseform

word

surface (actual)

0

20

40

60

80

0 50 100 150 200

minimum # occurrences

# pro

nunci

atio

ns

/ w

ord

Page 18: Object Tracking and Asynchrony in Audio-Visual Speech Recognition

closed / alveolarnas

Explanation: Asynchrony of tract variablesBased on a slide created by Karen Livescu

mid / palatalcrit / alveolarT crit / alveolarclosed / alveolar

s

open

nehsphone

criticalopenG

valuesfeature

dictionary

stnehsphone

surface variant #1

n

tihphone

crit / alveolarcl / alvcrit / alveolarTopencriticalopenG

valuesfeaturesurface

variant #2

(example of feature

asynchrony)

(example of feature

asynchrony +

substitution)

nasal

mid / palatalcrit / alveolarT crit / alveolaropencriticalopenG

valuesfeature

nas

criticalnar / palatal

s s

Page 19: Object Tracking and Asynchrony in Audio-Visual Speech Recognition

Implementation: Multi-stream DBNSlide created by Karen Livescu

• Phone-based

• Articulatory Feature-based

q (phonetic state)

o (observation vector)

L (state of lips)

o (obs vector)

T (state of tongue)

G (state of glottis)

Page 20: Object Tracking and Asynchrony in Audio-Visual Speech Recognition

positionInWordA {0,1,2,...}

stateTransitionA {0,1}

phoneStateA { /t/1, /t/2, /t/3, /u/1, /u/2, /u/3, …}

Baseline: Audio-only phone-based HMMSlide created by Partha Lal

obsA

Page 21: Object Tracking and Asynchrony in Audio-Visual Speech Recognition

positionInWordV {0,1,2,...}

stateTransitionV {0,1}

phoneStateV { /t/1, /t/2, /t/3, /u/1, /u/2, /u/3, …}

obsV

Baseline: Video-only phone-based HMMSlide created by Partha Lal

Page 22: Object Tracking and Asynchrony in Audio-Visual Speech Recognition

obsV obsA

positionInWord {0,1,2,...}

stateTransition {0,1}

phoneState { /t/1, /t/2, /t/3, /u/1, /u/2, /u/3, …}

obs

Audio-visual HMM without asynchronySlide created by Partha Lal

Page 23: Object Tracking and Asynchrony in Audio-Visual Speech Recognition

positionInWordA {0,1,2,...}

stateTransitionA {0,1}

phoneStateA { /t/1, /t/2, /t/3, /u/1, /u/2, /u/3, …}

obsA

positionInWordV {0,1,2,...}

stateTransitionV {0,1}

phoneStateV { /t/1, /t/2, /t/3, /u/1, /u/2, /u/3, …}

obsV

Phoneme-Viseme CHMMSlide created by Partha Lal

Page 24: Object Tracking and Asynchrony in Audio-Visual Speech Recognition

Articulatory Feature CHMM

positionInWordT {0,1,2,...}

stateTransitionT {0,1}

T { /CL-ALV/1, /CL-ALV/2, /MID-UV/1, …}

positionInWordG {0,1,2,...}

stateTransitionG {0,1}

G { /OP/1, /OP/2, /CRIT/1, …}

obsV obsA

positionInWordL {0,1,2,...}

stateTransitionL {0,1}

L { /OP/1, /OP/2, /RND/1, …}

Page 25: Object Tracking and Asynchrony in Audio-Visual Speech Recognition

Asynchrony Experiments: CUAVE• 169 utterances used, 10 digits each

• NOISEX speech babble added at various SNRs

• Experimental setup

– Training on clean data, number of Gaussians tuned on clean dev set

– Audio/video weights tuned on noise-specific dev sets

– Uniform (“zero-gram”) language model

– Decoding constrained to 10-word utterances (avoids language model scale/penalty tuning)

• Thanks to Amar Subramanya at UW

for the video observations

• Thanks to Kate Saenko at MIT for initial

baselines and audio observations

Page 26: Object Tracking and Asynchrony in Audio-Visual Speech Recognition

Results, part 1: Should we use video?Answer: Fusion WER < Single-stream WER( Novelty: None. Many authors have reported this. )

0

10

20

30

40

50

60

70

80

90

CLEAN SNR 12dB SNR 10dB SNR 6dB SNR 4dB SNR -4dB

Audio

Video

Audiovisual

Page 27: Object Tracking and Asynchrony in Audio-Visual Speech Recognition

Results, part 2: Should the streams be asynchronous?Asynchronous WER < Synchronous WER (4% absolute @ midSNRs)( Novelty: First phone-based AVSR w/ inter-phone asynchrony. )

0

10

20

30

40

50

60

70

CLEAN SNR 12dB SNR 10dB SNR 6dB SNR 4dB SNR -4

No Asynchrony

1 State Async

2 States Async

Unlimited Asyn

Page 28: Object Tracking and Asynchrony in Audio-Visual Speech Recognition

Results, part 3:Should asynchrony be modeled using articulatory features?Answer: Articulatory feature WER = Phoneme-viseme WER

( Novelty: First articulatory feature model for AVSR. )

0

10

20

30

40

50

60

70

80

Clean SNR12dB

SNR10dB

SNR 6dB SNR 4dB SNR -4dB

Phone-viseme

Articulatory features

Page 29: Object Tracking and Asynchrony in Audio-Visual Speech Recognition

Results, part 4:Can AF system help the CHMM to correct mistakes?Answer: Combination AF + PV gives best results on this databaseDetails: Systems vote to determine label of each word (NIST rover)

PV = Phone-viseme AF = Articulatory features

17

18

19

20

21

22

23

Rover, Best Threew/ AF

Rover, Best Threew/o AF

PV, 2 StatesAsync

AF PV, 1 StateAsync

WER on devtest, averaged across SNRs

Page 30: Object Tracking and Asynchrony in Audio-Visual Speech Recognition

Conclusions• Classifiers as features:

– AdaBoost “margin” outputs can be used as features in Gaussian model of facial geometry

• Head-pose correction in noise:– Best correction algorithm uses linear regression followed by

model-based correction• Asynchrony matters:

– Best phone-based recognizer is a CHMM with two states of asynchrony allowed between audio and video

• Articulatory Feature Models complement Phone Models– These two systems have identical WER– Best result obtained when systems of both types are

combined using rover