object tracking and asynchrony in audio-visual speech recognition

Object Tracking and Asynchrony in Audio-

Visual Speech Recognition

Mark Hasegawa-Johnson

AIVR Seminar

August 31, 2006

AVICAR is thanks to: Bowon Lee, Ming Liu, Camille Goudeseune, Suketu Kamdar, Carl Press, Sarah Borys

and to the Motorola Communications Center

Some experiments and most good ideas in this talk thanks to

Ming Liu, Karen Livescu, Kate Saenko and Partha Lal

Why AVSR is not like ASR

• Use of classifiers as features– E.g., output of an

AdaBoost lip tracker is feature in a face constellation

• Obstruction– Tongue is rarely visible,

glottis never

• Asynchrony– Visual evidence for a

word can start long before the audio evidence

Which digit is she about to say?

Why ASR is like AVSR

• Use of classifiers as features– E.g., neural networks or SVMs transform audio spectra

into a phonetic feature space

• Obstruction– Lip closure “hides” tongue closure

– Glottal stop “hides” lip or tongue position

• Asynchrony– Tongue, lips, velum, and glottis can be out of sync, e.g.,

“every” →“ervy”

Discriminative Features in Face/Lip Tracking: AdaBoost

1. Each wavelet defines a “weak classifier:

hi(x) = 1 iff fi > threshold, else hi(x) = 0

2. Start with equal weight for all training tokens:

wm(1) = 1/M, 1 ≤ m≤M

3. For each learning iteration t:• Find i that minimizes the weighted training error.

• wm ↓ if token m was correctly classified, else wm ↑.

• αt = log((1- εt)/ εt)

• Final “strong classifier” is H(x) = 1 iff Σ t αt ht(x) > Σ t αt

Example Haar Wavelet Features Selected by AdaBoost

AdaBoost in a Bayesian Context

• The AdaBoost “margin:”

• Guaranteed range: 0≤MD(x)≤1

• Inverse sigmoid transform yields nearly normal distributions

Prior: Relative Position of Lips in the Face

p(r=rlips | MD(x)) p(r=rlips) p(MD(x) | r=rlips)

Lip Tracking: a few results

Pixel-Based Features

Pixel-Based Features: Dimension

Model-Based Correction for Head-Pose Variability

• If the head is an ellipse, its measured width wF(t) and height hF(t) are functions of roll ρ, yaw ψ, pitch φ, true height ħF and true width wF according to

• … which can usefully be approximated as…

Robust Correction: Linear Regression

• The additive random part of the lip width (wL(t)=w1+ħLcosψ(t)sinρ(t)) is

proportional to similar additive variation in the head width

(wF(t)=wF1+ħFcosψ(t)sinρ(t)), so we can eliminate it by orthogonalizing

wL(t) to wF(t).

WER Results from AVICAR(Testing on the training data; 34 talkers, continuous digits)

LR = linear regression

Model = model-based

head-pose

compensation

LLR = log-linear

regression

13+d+dd = 13 static

features

39 = 39 static features

All systems have

mean and variance

normalization and

MLLR

Audio-Visual AsynchronyFor example, tongue touches the teeth before acoustic speech onset in the word “three;” lips

are already round in anticipation of the /r/.

…

acoustic channel

visual channel

t = 1

t = 2

t = 3

…

t =T

Audio-Visual Asynchrony: Coupled HMM is a typical Phoneme-Viseme Model

(Chu and Huang, 2002)

A Physical Model of AsynchronySlide created by Karen Livescu

Articulatory Phonology [Browman & Goldstein ‘90]: The following 8 tract variables are independently & asynchronously controlled

LIP-LOC Protruded, Labial, Dental

LIP-OP CLosed, CRitical, Narrow, Wide

TT-LOC Dental, Alveolar, Palato-Alveolar, Retroflex

TB-LOC Palatal, Velar, Uvular, Pharyngeal

TT-OP, TB-OP

CLosed, CRitical, Narrow, Mid-Narrow, Mid, Wide

GLO CLosed (stop), CRitical (voiced), Open (voiceless)

VEL CLosed (non-nasal), Open (nasal)

LIP-OPTT-OP

TT-LOC

TB-LOC

TB-OPVELUM

GLOTTIS

LIP-LOC

For speech recognition, we collapse these into 3 streams: lips, tongue, and glottis (LTG).

Motivation: Pronunciation variationSlide created by Karen Livescu

(2) p r aa b iy

(1) p r ay

(1) p r aw l uh

(1) p r ah b iy

(1) p r aa l iy

(1) p r aa b uw

(1) p ow ih

(1) p aa iy

(1) p aa b uh b l iy

(1) p aa ah iy

probably

p r aa b ax b l iy

(1) s eh n t s

(1) s ih t s

sense

s eh n s

(1) eh v r ax b ax d iy

(1) eh v er b ah d iy

(1) eh ux b ax iy

(1) eh r uw ay

(1) eh b ah iy

everybody

eh v r iy b ah d iy

(37) d ow n

(16) d ow

(6) ow n

(4) d ow n t

(3) d ow t

(3) d ah n

(3) ow

(3) n ax

(2) d ax n

(2) ax

(1) n uw

...

don’t

d ow n tbaseform

word

surface (actual)

0

20

40

60

80

0 50 100 150 200

minimum # occurrences

# pro

nunci

atio

ns

/ w

ord

closed / alveolarnas

Explanation: Asynchrony of tract variablesBased on a slide created by Karen Livescu

mid / palatalcrit / alveolarT crit / alveolarclosed / alveolar

s

open

nehsphone

criticalopenG

valuesfeature

dictionary

stnehsphone

surface variant #1

n

tihphone

crit / alveolarcl / alvcrit / alveolarTopencriticalopenG

valuesfeaturesurface

variant #2

(example of feature

asynchrony)

(example of feature

asynchrony +

substitution)

nasal

mid / palatalcrit / alveolarT crit / alveolaropencriticalopenG

valuesfeature

nas

criticalnar / palatal

s s

Implementation: Multi-stream DBNSlide created by Karen Livescu

• Phone-based

• Articulatory Feature-based

q (phonetic state)

o (observation vector)

L (state of lips)

o (obs vector)

T (state of tongue)

G (state of glottis)

positionInWordA {0,1,2,...}

stateTransitionA {0,1}

phoneStateA { /t/1, /t/2, /t/3, /u/1, /u/2, /u/3, …}

Baseline: Audio-only phone-based HMMSlide created by Partha Lal

obsA

positionInWordV {0,1,2,...}

stateTransitionV {0,1}

phoneStateV { /t/1, /t/2, /t/3, /u/1, /u/2, /u/3, …}

obsV

Baseline: Video-only phone-based HMMSlide created by Partha Lal

obsV obsA

positionInWord {0,1,2,...}

stateTransition {0,1}

phoneState { /t/1, /t/2, /t/3, /u/1, /u/2, /u/3, …}

obs

Audio-visual HMM without asynchronySlide created by Partha Lal

positionInWordA {0,1,2,...}

stateTransitionA {0,1}

phoneStateA { /t/1, /t/2, /t/3, /u/1, /u/2, /u/3, …}

obsA

positionInWordV {0,1,2,...}

stateTransitionV {0,1}

phoneStateV { /t/1, /t/2, /t/3, /u/1, /u/2, /u/3, …}

obsV

Phoneme-Viseme CHMMSlide created by Partha Lal

Articulatory Feature CHMM

positionInWordT {0,1,2,...}

stateTransitionT {0,1}

T { /CL-ALV/1, /CL-ALV/2, /MID-UV/1, …}

positionInWordG {0,1,2,...}

stateTransitionG {0,1}

G { /OP/1, /OP/2, /CRIT/1, …}

obsV obsA

positionInWordL {0,1,2,...}

stateTransitionL {0,1}

L { /OP/1, /OP/2, /RND/1, …}

Asynchrony Experiments: CUAVE• 169 utterances used, 10 digits each

• NOISEX speech babble added at various SNRs

• Experimental setup

– Training on clean data, number of Gaussians tuned on clean dev set

– Audio/video weights tuned on noise-specific dev sets

– Uniform (“zero-gram”) language model

– Decoding constrained to 10-word utterances (avoids language model scale/penalty tuning)

• Thanks to Amar Subramanya at UW

for the video observations

• Thanks to Kate Saenko at MIT for initial

baselines and audio observations

Results, part 1: Should we use video?Answer: Fusion WER < Single-stream WER( Novelty: None. Many authors have reported this. )

0

10

20

30

40

50

60

70

80

90

CLEAN SNR 12dB SNR 10dB SNR 6dB SNR 4dB SNR -4dB

Audio

Video

Audiovisual

Results, part 2: Should the streams be asynchronous?Asynchronous WER < Synchronous WER (4% absolute @ midSNRs)( Novelty: First phone-based AVSR w/ inter-phone asynchrony. )

0

10

20

30

40

50

60

70

CLEAN SNR 12dB SNR 10dB SNR 6dB SNR 4dB SNR -4

No Asynchrony

1 State Async

2 States Async

Unlimited Asyn

Results, part 3:Should asynchrony be modeled using articulatory features?Answer: Articulatory feature WER = Phoneme-viseme WER

( Novelty: First articulatory feature model for AVSR. )

0

10

20

30

40

50

60

70

80

Clean SNR12dB

SNR10dB

SNR 6dB SNR 4dB SNR -4dB

Phone-viseme

Articulatory features

Results, part 4:Can AF system help the CHMM to correct mistakes?Answer: Combination AF + PV gives best results on this databaseDetails: Systems vote to determine label of each word (NIST rover)

PV = Phone-viseme AF = Articulatory features

17

18

19

20

21

22

23

Rover, Best Threew/ AF

Rover, Best Threew/o AF

PV, 2 StatesAsync

AF PV, 1 StateAsync

WER on devtest, averaged across SNRs

Conclusions• Classifiers as features:

– AdaBoost “margin” outputs can be used as features in Gaussian model of facial geometry

• Head-pose correction in noise:– Best correction algorithm uses linear regression followed by

model-based correction• Asynchrony matters:

– Best phone-based recognizer is a CHMM with two states of asynchrony allowed between audio and video

• Articulatory Feature Models complement Phone Models– These two systems have identical WER– Best result obtained when systems of both types are

combined using rover

object tracking and asynchrony in audio-visual speech recognition

Documents

learning iteration t

modelbased head

head width wft

lip width wlt

true width wf

measured width wft

relative position of

ming liu