multimodal deep learning

MULTIMODAL DEEP LEARNING PRESENTATION

孟泽张氏秋怀 TRUONGTHITHUHOAI

MULTIMODAL DEEP LEARNING

Jiquan Ngiam Aditya Khosla, Mingyu Kim, Juhan Nam,

Honglak Lee, Andrew Y. Ng

Computer Science Department, Stanford University

Department of Music, Stanford University Computer Science & Engineering Division,

University of Michigan, Ann Arbor

MCGURK EFFECT

In speech recognition, people are known to integrate audio-visual information in order to understand speech.

This was first exemplified in the McGurk effect where a visual /ga/ with a voiced /ba/ is perceived as /da/ by most subjects.

AUDIO-VISUAL SPEECH RECOGNITION

FEATURE CHALLENGE

Classifier (e.g. SVM)

REPRESENTING LIPS

• Can we learn better representations for audio/visual speech recognition?

• How can multimodal data (multiple sources of input) be used to find better features?

UNSUPERVISED FEATURE LEARNING

[51.1...10

][91.67...3

]

UNSUPERVISED FEATURE LEARNING

[51.1...1091.67...3

]

MULTIMODAL FEATURES

[12.159.....6.59

]

CROSS-MODALITY FEATURE LEARNING

[51.1...10

]

FEATURE LEARNING MODELS

BACKGROUND

Sparse Restricted Boltzmann Machines (RBMs)

−𝑙𝑜𝑔𝑃 (𝑣 , h )∝𝐸 (𝑣 , h )= 1

2𝜎2

FEATURE LEARNING WITH AUTOENCODERS

...

...Audio Input

...

...Video Input

... ...Audio Reconstruction Video Reconstruction

BIMODAL AUTOENCODER

...... ...

... ...

Audio Input Video Input

HiddenRepresentation

Audio ReconstructionVideo Reconstruction

SHALLOW LEARNING H

idde

n U

nits

Video Input Audio Input

• Mostly unimodal features learned

BIMODAL AUTOENCODER

...

...

... ...

Video Input

HiddenRepresentation


Cross-modality Learning: Learn better video features by using audio as a cue

CROSS-MODALITY DEEP AUTOENCODER

...

...

...

...... ......

Video Input

LearnedRepresentation


CROSS-MODALITY DEEP AUTOENCODER

...

...

...

...... ......

Audio Input



BIMODAL DEEP AUTOENCODERS

......

... ...

...

...... ......


SharedRepresentation


“Visemes”(Mouth Shapes)

“Phonemes”


.........

...... ......

Video Input


“Visemes”(Mouth Shapes)

“Phonemes”


...

...

...

...... ......

Audio Input


TRAINING BIMODAL DEEP AUTOENCODER

...

...

...

...... ......

Audio Input



...

...

...

...... ......

Video Input



...

...... ......

...... ......




• Train a single model to perform all 3 tasks

• Similar in spirit to denoising autoencoders(Vincent et al., 2008)

EVALUATIONS

VISUALIZATIONS OF LEARNED FEATURES

0 ms 33 ms 67 ms 100 ms

0 ms 33 ms 67 ms 100 ms

Audio (spectrogram) and Video features learned over 100ms windows

LEARNING SETTINGS

We will consider the learning settings shown in Figure 1.

LIP-READING WITH AVLETTERS

AVLetters: 26-way Letter Classification 10 Speakers 60x80 pixels lip regions

Cross-modality learning

...

...

...

...... ......

Video Input



Feature Learning

Supervised Learning

Testing

Audio + Video Video Video

LIP-READING WITH AVLETTERS

Feature Representation Classification Accuracy

Multiscale Spatial Analysis (Matthews et al., 2002)

44.6%

Local Binary Pattern(Zhao & Barnard, 2009)

58.5%

LIP-READING WITH AVLETTERSFeature Representation Classification

Accuracy


44.6%


58.5%

Video-Only Learning(Single Modality Learning)

54.2%

LIP-READING WITH AVLETTERSFeature Representation Classification

Accuracy


44.6%


58.5%


54.2%

Our Features(Cross Modality Learning)

64.4%

LIP-READING WITH CUAVE

CUAVE: 10-way Digit Classification 36 Speakers

Cross Modality Learning.........

...... ......

Video Input



Feature Learning

Supervised Learning

Testing

Audio + Video Video Video


Feature RepresentationClassification

Accuracy

Baseline Preprocessed Video 58.5%


65.4%



Accuracy



65.4%

Our Features(Cross Modality

Learning)68.7%



Accuracy



65.4%

Our Features(Cross Modality

Learning)68.7%

Discrete Cosine Transform(Gurban & Thiran, 2009)

64.0%

Visemic AAM(Papandreou et al., 2009)

83.0%

MULTIMODAL RECOGNITION

CUAVE: 10-way Digit Classification 36 Speakers

Evaluate in clean and noisy audio scenarios In the clean audio scenario, audio

performs extremely well aloneFeature

Learning Supervised Learning

Testing

Audio + Video Audio + Video Audio + Video

...

...

... ...

...

...

... ...

...



Audio Reconstruction Video Reconstruction


Feature Representation

Classification Accuracy

(Noisy Audio at 0db SNR)

Audio Features (RBM) 75.8%

Our Best Video Features 68.7%







Bimodal Deep Autoencoder 77.3%







Bimodal Deep Autoencoder 77.3%

Bimodal Deep Autoencoder

+ Audio Features (RBM)82.2%

SHARED REPRESENTATION EVALUATION

SupervisedTesting

Audio


Video Audio


Video

Linear Classifier

Training Testing

Feature Learning

Supervised Learning

Testing

Audio + Video Audio Video

SHARED REPRESENTATION EVALUATION

SupervisedTesting

Audio


Video Audio


Video

Linear Classifier

Training Testing

Method: Learned Features + Canonical Correlation AnalysisFeature

LearningSupervised Learning

Testing Accuracy

Audio + Video Audio Video 57.3%Audio + Video Video Audio 91.7%

MCGURK EFFECT

A visual /ga/ combined with an audio /ba/ is often perceived as /da/.

AudioInput

VideoInput

Model Predictions

/ga/ /ba/ /da/

/ga/ /ga/ 82.6% 2.2% 15.2%

/ba/ /ba/ 4.4% 89.1% 6.5%

MCGURK EFFECT

A visual /ga/ combined with an audio /ba/ is often perceived as /da/.

AudioInput

VideoInput

Model Predictions

/ga/ /ba/ /da/

/ga/ /ga/ 82.6% 2.2% 15.2%

/ba/ /ba/ 4.4% 89.1% 6.5%

/ba/ /ga/ 28.3% 13.0% 58.7%

CONCLUSION

Applied deep autoencoders to discover features in multimodal data

Cross-modality Learning: We obtained better video features (for lip-reading) using audio as a cue

Multimodal Feature Learning:Learn representations that relate across audio and video data

...

...

...

...... ......

Video Input



...

...

... ...

...

...

... ...

...



Audio Reconstruction Video Reconstruction

THANK FOR ATTENTION!

multimodal deep learning

Technology