mphil transfer

Audio-visual speech reading system

Samuel Pachoud

Mphil transfer presentation

McGurk effect

2

Why bimodality?• Audio vs. video

– 43 phonemes 15 visemes (British English)• /k/, /ɡ/, /ɳ/ /k/• /ʈʃ/, /ʃ/, /ʤ/, /ʒ/ /ch/

– Facial discrimination is more accurate than acoustic discrimination in some cases [Lander 2007]:

• /l/ and /r/ could be quite similar “grass” vs. “glass”

• Noisy environments:– Which cue to rely on?

3

Difficulties• Representation and extraction

– Low-level “perceptual” information– Filter noisy signal highly depends on noise nature

• Integration – adaptive and effective– with degraded input signals

4

Current limitations• Lip deformation, self-occlusion, distortion

[Nefian 2002,Göcke 2005]

• Manual labelling or alignment between frames [Chen 2001, Aleksic 2002, Wang 2004]

• No explicit use of the close link between audio and video [Potamianos 2001, Gordan 2002]

• No studies with both audio and video degraded– Except [Movellan1996]. However he used a small

data corpus (only 4 classes)

5

Our contributions• Occlusion, missing data

– Set of built-in space-time visual feature [CVPR 2008]

• Synchronization– Similar structure for both audio and visual feature

extraction [BMVC 2008]

• Degraded signals– Use of a discriminative model to provide levels of

confidence [BMVC 2009, under review process]

– Use of canonical correlation to fuse audio and visual features

6

Overview

7

AUDIO FEATURE SELECTION AND

EXTRACTION

VISUAL FEATURE SELECTION AND

EXTRACTION

VIDEO

t

t

AUDIO VISUAL INTEGRATION

AUDIO-ONLY ASR

VISUAL-ONLY ASR (LIP READING)

AUDIO VISUAL ASR

AUDIO

Spatio-temporal features• Lip motion features:

8

Image-to-image approach: 2 separate frames of 3 moving features.

Space-time volume modelling: a sequence with several spatio-temporal features

Confidence factors• Which cue to rely on?

– Levels of confidence of the audio and visual signals

– Comparing the distribution between training and testing set

• Confidence factors– Provided by single modality classification using

Support Vector Machine (SVM)– Used to select the most effective strategy to

integrate audio and visual feature

9

Adaptive fusion• Correlate the linear relationship between

audio and video– Canonical Correlation Analysis (CCA)

• Create a canonical space based on uncontaminated input signals– Extract dominant canonical factors from the

training set

• Map and construct the testing set– Using the trained regressions matrices and

canonical factor pairs

10

Multiple Multi-class SVM

11

AUDIO VIDEO

kCCAAudio and

visual feature estimate

MFCC + SCF

PCA

ROI

2D + time SIFT

descriptors

Recognition

Joint feature Multi-class

SVM

noise noise

Video multi-class

SVMAudio

multi-class SVM

AUDIO VIDEO

MFCC + SCF

PCA

ROI

2D + time SIFT

descriptors

~

~

V

AZ test

avproj

vaproj

train RA

RVZ

trainA trainV testVtestA

av

vaav

WW

RR

Feature extraction

Feature selection

Adaptive fusion

Decision making

Confidence factors

ACFVCF

Used database (digits)• Clean data

• Degraded signals

12

Digit 0

Salt and pepper occlusion

Digit 0, SNR = -5dB

Region-of-Interest

Isolated word recognition

13

Recognition rate (%) over 10 digits using kCCA and multiple MSVM

Average generalization performance (%) from [Movellan 1996]

1. Visual feature more robust to occlusion than salt and pepper

2. Residual information in degraded signals is extracted and used

3. Higher recognition rate under occlusion compared to [Movellan 1996]

Problems addressed• Degraded signals

– Robust and accurate spatio-temporal feature representation

• Adaptive fusion– Using Canonical Correlation Analysis (CCA)– Capable of combining features given condition at

hand

• Isolated word recognition (digits)– Based on multiple Multi-class Support Vector

Machine (MSVM)

14

Discussion• AV-ASR implies continuous speech recognition

– Contain structure

• Difficult to do with MSVM– Segmentation– Scanning

• Remaining issues:– Need for a structural system– Need for a data corpus containing contextual

information

15

Future work• Extend to a structural model Structured

Support Vector Machine (SSVM) [Tsochantaridis 2005]

– Create a joint feature map

• Evaluation performed using the GRID audiovisual sentence corpus– 6 word long sentences with particular structure:

16

Thank you

Questions?

mphil transfer

Documents

multiple multi

mcgurk effect

hear depends

space

confidence

audio

occlusion

spatio