multimodal deep learning
DESCRIPTION
TRANSCRIPT
MULTIMODAL DEEP LEARNING PRESENTATION
孟泽张氏秋怀 TRUONGTHITHUHOAI
MULTIMODAL DEEP LEARNING
Jiquan Ngiam Aditya Khosla, Mingyu Kim, Juhan Nam,
Honglak Lee, Andrew Y. Ng
Computer Science Department, Stanford University
Department of Music, Stanford University Computer Science & Engineering Division,
University of Michigan, Ann Arbor
MCGURK EFFECT
In speech recognition, people are known to integrate audio-visual information in order to understand speech.
This was first exemplified in the McGurk effect where a visual /ga/ with a voiced /ba/ is perceived as /da/ by most subjects.
AUDIO-VISUAL SPEECH RECOGNITION
FEATURE CHALLENGE
Classifier (e.g. SVM)
REPRESENTING LIPS
• Can we learn better representations for audio/visual speech recognition?
• How can multimodal data (multiple sources of input) be used to find better features?
UNSUPERVISED FEATURE LEARNING
[51.1...10
][91.67...3
]
UNSUPERVISED FEATURE LEARNING
[51.1...1091.67...3
]
MULTIMODAL FEATURES
[12.159.....6.59
]
CROSS-MODALITY FEATURE LEARNING
[51.1...10
]
FEATURE LEARNING MODELS
BACKGROUND
Sparse Restricted Boltzmann Machines (RBMs)
−𝑙𝑜𝑔𝑃 (𝑣 , h )∝𝐸 (𝑣 , h )= 1
2𝜎2
FEATURE LEARNING WITH AUTOENCODERS
...
...Audio Input
...
...Video Input
... ...Audio Reconstruction Video Reconstruction
BIMODAL AUTOENCODER
...... ...
... ...
Audio Input Video Input
HiddenRepresentation
Audio ReconstructionVideo Reconstruction
SHALLOW LEARNING H
idde
n U
nits
Video Input Audio Input
• Mostly unimodal features learned
BIMODAL AUTOENCODER
...
...
... ...
Video Input
HiddenRepresentation
Audio ReconstructionVideo Reconstruction
Cross-modality Learning: Learn better video features by using audio as a cue
CROSS-MODALITY DEEP AUTOENCODER
...
...
...
...... ......
Video Input
LearnedRepresentation
Audio ReconstructionVideo Reconstruction
CROSS-MODALITY DEEP AUTOENCODER
...
...
...
...... ......
Audio Input
LearnedRepresentation
Audio ReconstructionVideo Reconstruction
BIMODAL DEEP AUTOENCODERS
......
... ...
...
...... ......
Audio Input Video Input
SharedRepresentation
Audio ReconstructionVideo Reconstruction
“Visemes”(Mouth Shapes)
“Phonemes”
BIMODAL DEEP AUTOENCODERS
.........
...... ......
Video Input
Audio ReconstructionVideo Reconstruction
“Visemes”(Mouth Shapes)
“Phonemes”
BIMODAL DEEP AUTOENCODERS
...
...
...
...... ......
Audio Input
Audio ReconstructionVideo Reconstruction
TRAINING BIMODAL DEEP AUTOENCODER
...
...
...
...... ......
Audio Input
SharedRepresentation
Audio ReconstructionVideo Reconstruction
...
...
...
...... ......
Video Input
SharedRepresentation
Audio ReconstructionVideo Reconstruction
...
...... ......
...... ......
Audio Input Video Input
SharedRepresentation
Audio ReconstructionVideo Reconstruction
• Train a single model to perform all 3 tasks
• Similar in spirit to denoising autoencoders(Vincent et al., 2008)
EVALUATIONS
VISUALIZATIONS OF LEARNED FEATURES
0 ms 33 ms 67 ms 100 ms
0 ms 33 ms 67 ms 100 ms
Audio (spectrogram) and Video features learned over 100ms windows
LEARNING SETTINGS
We will consider the learning settings shown in Figure 1.
LIP-READING WITH AVLETTERS
AVLetters: 26-way Letter Classification 10 Speakers 60x80 pixels lip regions
Cross-modality learning
...
...
...
...... ......
Video Input
LearnedRepresentation
Audio ReconstructionVideo Reconstruction
Feature Learning
Supervised Learning
Testing
Audio + Video Video Video
LIP-READING WITH AVLETTERS
Feature Representation Classification Accuracy
Multiscale Spatial Analysis (Matthews et al., 2002)
44.6%
Local Binary Pattern(Zhao & Barnard, 2009)
58.5%
LIP-READING WITH AVLETTERSFeature Representation Classification
Accuracy
Multiscale Spatial Analysis (Matthews et al., 2002)
44.6%
Local Binary Pattern(Zhao & Barnard, 2009)
58.5%
Video-Only Learning(Single Modality Learning)
54.2%
LIP-READING WITH AVLETTERSFeature Representation Classification
Accuracy
Multiscale Spatial Analysis (Matthews et al., 2002)
44.6%
Local Binary Pattern(Zhao & Barnard, 2009)
58.5%
Video-Only Learning(Single Modality Learning)
54.2%
Our Features(Cross Modality Learning)
64.4%
LIP-READING WITH CUAVE
CUAVE: 10-way Digit Classification 36 Speakers
Cross Modality Learning.........
...... ......
Video Input
LearnedRepresentation
Audio ReconstructionVideo Reconstruction
Feature Learning
Supervised Learning
Testing
Audio + Video Video Video
LIP-READING WITH CUAVE
Feature RepresentationClassification
Accuracy
Baseline Preprocessed Video 58.5%
Video-Only Learning(Single Modality Learning)
65.4%
LIP-READING WITH CUAVE
Feature RepresentationClassification
Accuracy
Baseline Preprocessed Video 58.5%
Video-Only Learning(Single Modality Learning)
65.4%
Our Features(Cross Modality
Learning)68.7%
LIP-READING WITH CUAVE
Feature RepresentationClassification
Accuracy
Baseline Preprocessed Video 58.5%
Video-Only Learning(Single Modality Learning)
65.4%
Our Features(Cross Modality
Learning)68.7%
Discrete Cosine Transform(Gurban & Thiran, 2009)
64.0%
Visemic AAM(Papandreou et al., 2009)
83.0%
MULTIMODAL RECOGNITION
CUAVE: 10-way Digit Classification 36 Speakers
Evaluate in clean and noisy audio scenarios In the clean audio scenario, audio
performs extremely well aloneFeature
Learning Supervised Learning
Testing
Audio + Video Audio + Video Audio + Video
...
...
... ...
...
...
... ...
...
Audio Input Video Input
SharedRepresentation
Audio Reconstruction Video Reconstruction
MULTIMODAL RECOGNITION
Feature Representation
Classification Accuracy
(Noisy Audio at 0db SNR)
Audio Features (RBM) 75.8%
Our Best Video Features 68.7%
MULTIMODAL RECOGNITION
Feature Representation
Classification Accuracy
(Noisy Audio at 0db SNR)
Audio Features (RBM) 75.8%
Our Best Video Features 68.7%
Bimodal Deep Autoencoder 77.3%
MULTIMODAL RECOGNITION
Feature Representation
Classification Accuracy
(Noisy Audio at 0db SNR)
Audio Features (RBM) 75.8%
Our Best Video Features 68.7%
Bimodal Deep Autoencoder 77.3%
Bimodal Deep Autoencoder
+ Audio Features (RBM)82.2%
SHARED REPRESENTATION EVALUATION
SupervisedTesting
Audio
SharedRepresentation
Video Audio
SharedRepresentation
Video
Linear Classifier
Training Testing
Feature Learning
Supervised Learning
Testing
Audio + Video Audio Video
SHARED REPRESENTATION EVALUATION
SupervisedTesting
Audio
SharedRepresentation
Video Audio
SharedRepresentation
Video
Linear Classifier
Training Testing
Method: Learned Features + Canonical Correlation AnalysisFeature
LearningSupervised Learning
Testing Accuracy
Audio + Video Audio Video 57.3%Audio + Video Video Audio 91.7%
MCGURK EFFECT
A visual /ga/ combined with an audio /ba/ is often perceived as /da/.
AudioInput
VideoInput
Model Predictions
/ga/ /ba/ /da/
/ga/ /ga/ 82.6% 2.2% 15.2%
/ba/ /ba/ 4.4% 89.1% 6.5%
MCGURK EFFECT
A visual /ga/ combined with an audio /ba/ is often perceived as /da/.
AudioInput
VideoInput
Model Predictions
/ga/ /ba/ /da/
/ga/ /ga/ 82.6% 2.2% 15.2%
/ba/ /ba/ 4.4% 89.1% 6.5%
/ba/ /ga/ 28.3% 13.0% 58.7%
CONCLUSION
Applied deep autoencoders to discover features in multimodal data
Cross-modality Learning: We obtained better video features (for lip-reading) using audio as a cue
Multimodal Feature Learning:Learn representations that relate across audio and video data
...
...
...
...... ......
Video Input
LearnedRepresentation
Audio ReconstructionVideo Reconstruction
...
...
... ...
...
...
... ...
...
Audio Input Video Input
SharedRepresentation
Audio Reconstruction Video Reconstruction
THANK FOR ATTENTION!