mphil transfer
Post on 21-Oct-2014
568 views
DESCRIPTION
TRANSCRIPT
Audio-visual speech reading system
Samuel Pachoud
Mphil transfer presentation
McGurk effect
2
Why bimodality?• Audio vs. video
– 43 phonemes 15 visemes (British English)• /k/, /ɡ/, /ɳ/ /k/• /ʈʃ/, /ʃ/, /ʤ/, /ʒ/ /ch/
– Facial discrimination is more accurate than acoustic discrimination in some cases [Lander 2007]:
• /l/ and /r/ could be quite similar “grass” vs. “glass”
• Noisy environments:– Which cue to rely on?
3
Difficulties• Representation and extraction
– Low-level “perceptual” information– Filter noisy signal highly depends on noise nature
• Integration – adaptive and effective– with degraded input signals
4
Current limitations• Lip deformation, self-occlusion, distortion
[Nefian 2002,Göcke 2005]
• Manual labelling or alignment between frames [Chen 2001, Aleksic 2002, Wang 2004]
• No explicit use of the close link between audio and video [Potamianos 2001, Gordan 2002]
• No studies with both audio and video degraded– Except [Movellan1996]. However he used a small
data corpus (only 4 classes)
5
Our contributions• Occlusion, missing data
– Set of built-in space-time visual feature [CVPR 2008]
• Synchronization– Similar structure for both audio and visual feature
extraction [BMVC 2008]
• Degraded signals– Use of a discriminative model to provide levels of
confidence [BMVC 2009, under review process]
– Use of canonical correlation to fuse audio and visual features
6
Overview
7
AUDIO FEATURE SELECTION AND
EXTRACTION
VISUAL FEATURE SELECTION AND
EXTRACTION
VIDEO
t
t
AUDIO VISUAL INTEGRATION
AUDIO-ONLY ASR
VISUAL-ONLY ASR (LIP READING)
AUDIO VISUAL ASR
AUDIO
Spatio-temporal features• Lip motion features:
8
Image-to-image approach: 2 separate frames of 3 moving features.
Space-time volume modelling: a sequence with several spatio-temporal features
Confidence factors• Which cue to rely on?
– Levels of confidence of the audio and visual signals
– Comparing the distribution between training and testing set
• Confidence factors– Provided by single modality classification using
Support Vector Machine (SVM)– Used to select the most effective strategy to
integrate audio and visual feature
9
Adaptive fusion• Correlate the linear relationship between
audio and video– Canonical Correlation Analysis (CCA)
• Create a canonical space based on uncontaminated input signals– Extract dominant canonical factors from the
training set
• Map and construct the testing set– Using the trained regressions matrices and
canonical factor pairs
10
Multiple Multi-class SVM
11
AUDIO VIDEO
kCCAAudio and
visual feature estimate
MFCC + SCF
PCA
ROI
2D + time SIFT
descriptors
Recognition
Joint feature Multi-class
SVM
noise noise
Video multi-class
SVMAudio
multi-class SVM
AUDIO VIDEO
MFCC + SCF
PCA
ROI
2D + time SIFT
descriptors
~
~
V
AZ test
avproj
vaproj
train RA
RVZ
trainA trainV testVtestA
av
vaav
WW
RR
Feature extraction
Feature selection
Adaptive fusion
Decision making
Confidence factors
ACFVCF
Used database (digits)• Clean data
• Degraded signals
12
Digit 0
Salt and pepper occlusion
Digit 0, SNR = -5dB
Region-of-Interest
Isolated word recognition
13
Recognition rate (%) over 10 digits using kCCA and multiple MSVM
Average generalization performance (%) from [Movellan 1996]
1. Visual feature more robust to occlusion than salt and pepper
2. Residual information in degraded signals is extracted and used
3. Higher recognition rate under occlusion compared to [Movellan 1996]
Problems addressed• Degraded signals
– Robust and accurate spatio-temporal feature representation
• Adaptive fusion– Using Canonical Correlation Analysis (CCA)– Capable of combining features given condition at
hand
• Isolated word recognition (digits)– Based on multiple Multi-class Support Vector
Machine (MSVM)
14
Discussion• AV-ASR implies continuous speech recognition
– Contain structure
• Difficult to do with MSVM– Segmentation– Scanning
• Remaining issues:– Need for a structural system– Need for a data corpus containing contextual
information
15
Future work• Extend to a structural model Structured
Support Vector Machine (SSVM) [Tsochantaridis 2005]
– Create a joint feature map
• Evaluation performed using the GRID audiovisual sentence corpus– 6 word long sentences with particular structure:
16
Thank you
Questions?