multimodal deep learning

43
MULTIMODAL DEEP LEARNING PRESENTATION 孟孟 孟孟孟孟 TRUONGTHITHUHOAI

Upload: hoailn

Post on 05-Dec-2014

1.546 views

Category:

Technology


12 download

DESCRIPTION

 

TRANSCRIPT

Page 1: Multimodal deep learning

MULTIMODAL DEEP LEARNING PRESENTATION

孟泽张氏秋怀 TRUONGTHITHUHOAI

Page 2: Multimodal deep learning

MULTIMODAL DEEP LEARNING

Jiquan Ngiam Aditya Khosla, Mingyu Kim, Juhan Nam,

Honglak Lee, Andrew Y. Ng

Computer Science Department, Stanford University

Department of Music, Stanford University Computer Science & Engineering Division,

University of Michigan, Ann Arbor

Page 3: Multimodal deep learning

MCGURK EFFECT

In speech recognition, people are known to integrate audio-visual information in order to understand speech.

This was first exemplified in the McGurk effect where a visual /ga/ with a voiced /ba/ is perceived as /da/ by most subjects.

Page 4: Multimodal deep learning

AUDIO-VISUAL SPEECH RECOGNITION

Page 5: Multimodal deep learning

FEATURE CHALLENGE

Classifier (e.g. SVM)

Page 6: Multimodal deep learning

REPRESENTING LIPS

• Can we learn better representations for audio/visual speech recognition?

• How can multimodal data (multiple sources of input) be used to find better features?

Page 7: Multimodal deep learning

UNSUPERVISED FEATURE LEARNING

[51.1...10

][91.67...3

]

Page 8: Multimodal deep learning

UNSUPERVISED FEATURE LEARNING

[51.1...1091.67...3

]

Page 9: Multimodal deep learning

MULTIMODAL FEATURES

[12.159.....6.59

]

Page 10: Multimodal deep learning

CROSS-MODALITY FEATURE LEARNING

[51.1...10

]

Page 11: Multimodal deep learning

FEATURE LEARNING MODELS

Page 12: Multimodal deep learning

BACKGROUND

Sparse Restricted Boltzmann Machines (RBMs)

−𝑙𝑜𝑔𝑃 (𝑣 , h )∝𝐸 (𝑣 , h )= 1

2𝜎2

Page 13: Multimodal deep learning

FEATURE LEARNING WITH AUTOENCODERS

...

...Audio Input

...

...Video Input

... ...Audio Reconstruction Video Reconstruction

Page 14: Multimodal deep learning

BIMODAL AUTOENCODER

...... ...

... ...

Audio Input Video Input

HiddenRepresentation

Audio ReconstructionVideo Reconstruction

Page 15: Multimodal deep learning

SHALLOW LEARNING H

idde

n U

nits

Video Input Audio Input

• Mostly unimodal features learned

Page 16: Multimodal deep learning

BIMODAL AUTOENCODER

...

...

... ...

Video Input

HiddenRepresentation

Audio ReconstructionVideo Reconstruction

Cross-modality Learning: Learn better video features by using audio as a cue

Page 17: Multimodal deep learning

CROSS-MODALITY DEEP AUTOENCODER

...

...

...

...... ......

Video Input

LearnedRepresentation

Audio ReconstructionVideo Reconstruction

Page 18: Multimodal deep learning

CROSS-MODALITY DEEP AUTOENCODER

...

...

...

...... ......

Audio Input

LearnedRepresentation

Audio ReconstructionVideo Reconstruction

Page 19: Multimodal deep learning

BIMODAL DEEP AUTOENCODERS

......

... ...

...

...... ......

Audio Input Video Input

SharedRepresentation

Audio ReconstructionVideo Reconstruction

“Visemes”(Mouth Shapes)

“Phonemes”

Page 20: Multimodal deep learning

BIMODAL DEEP AUTOENCODERS

.........

...... ......

Video Input

Audio ReconstructionVideo Reconstruction

“Visemes”(Mouth Shapes)

Page 21: Multimodal deep learning

“Phonemes”

BIMODAL DEEP AUTOENCODERS

...

...

...

...... ......

Audio Input

Audio ReconstructionVideo Reconstruction

Page 22: Multimodal deep learning

TRAINING BIMODAL DEEP AUTOENCODER

...

...

...

...... ......

Audio Input

SharedRepresentation

Audio ReconstructionVideo Reconstruction

...

...

...

...... ......

Video Input

SharedRepresentation

Audio ReconstructionVideo Reconstruction

...

...... ......

...... ......

Audio Input Video Input

SharedRepresentation

Audio ReconstructionVideo Reconstruction

• Train a single model to perform all 3 tasks

• Similar in spirit to denoising autoencoders(Vincent et al., 2008)

Page 23: Multimodal deep learning

EVALUATIONS

Page 24: Multimodal deep learning

VISUALIZATIONS OF LEARNED FEATURES

0 ms 33 ms 67 ms 100 ms

0 ms 33 ms 67 ms 100 ms

Audio (spectrogram) and Video features learned over 100ms windows

Page 25: Multimodal deep learning

LEARNING SETTINGS

We will consider the learning settings shown in Figure 1.

Page 26: Multimodal deep learning

LIP-READING WITH AVLETTERS

AVLetters: 26-way Letter Classification 10 Speakers 60x80 pixels lip regions

Cross-modality learning

...

...

...

...... ......

Video Input

LearnedRepresentation

Audio ReconstructionVideo Reconstruction

Feature Learning

Supervised Learning

Testing

Audio + Video Video Video

Page 27: Multimodal deep learning

LIP-READING WITH AVLETTERS

Feature Representation Classification Accuracy

Multiscale Spatial Analysis (Matthews et al., 2002)

44.6%

Local Binary Pattern(Zhao & Barnard, 2009)

58.5%

Page 28: Multimodal deep learning

LIP-READING WITH AVLETTERSFeature Representation Classification

Accuracy

Multiscale Spatial Analysis (Matthews et al., 2002)

44.6%

Local Binary Pattern(Zhao & Barnard, 2009)

58.5%

Video-Only Learning(Single Modality Learning)

54.2%

Page 29: Multimodal deep learning

LIP-READING WITH AVLETTERSFeature Representation Classification

Accuracy

Multiscale Spatial Analysis (Matthews et al., 2002)

44.6%

Local Binary Pattern(Zhao & Barnard, 2009)

58.5%

Video-Only Learning(Single Modality Learning)

54.2%

Our Features(Cross Modality Learning)

64.4%

Page 30: Multimodal deep learning

LIP-READING WITH CUAVE

CUAVE: 10-way Digit Classification 36 Speakers

Cross Modality Learning.........

...... ......

Video Input

LearnedRepresentation

Audio ReconstructionVideo Reconstruction

Feature Learning

Supervised Learning

Testing

Audio + Video Video Video

Page 31: Multimodal deep learning

LIP-READING WITH CUAVE

Feature RepresentationClassification

Accuracy

Baseline Preprocessed Video 58.5%

Video-Only Learning(Single Modality Learning)

65.4%

Page 32: Multimodal deep learning

LIP-READING WITH CUAVE

Feature RepresentationClassification

Accuracy

Baseline Preprocessed Video 58.5%

Video-Only Learning(Single Modality Learning)

65.4%

Our Features(Cross Modality

Learning)68.7%

Page 33: Multimodal deep learning

LIP-READING WITH CUAVE

Feature RepresentationClassification

Accuracy

Baseline Preprocessed Video 58.5%

Video-Only Learning(Single Modality Learning)

65.4%

Our Features(Cross Modality

Learning)68.7%

Discrete Cosine Transform(Gurban & Thiran, 2009)

64.0%

Visemic AAM(Papandreou et al., 2009)

83.0%

Page 34: Multimodal deep learning

MULTIMODAL RECOGNITION

CUAVE: 10-way Digit Classification 36 Speakers

Evaluate in clean and noisy audio scenarios In the clean audio scenario, audio

performs extremely well aloneFeature

Learning Supervised Learning

Testing

Audio + Video Audio + Video Audio + Video

...

...

... ...

...

...

... ...

...

Audio Input Video Input

SharedRepresentation

Audio Reconstruction Video Reconstruction

Page 35: Multimodal deep learning

MULTIMODAL RECOGNITION

Feature Representation

Classification Accuracy

(Noisy Audio at 0db SNR)

Audio Features (RBM) 75.8%

Our Best Video Features 68.7%

Page 36: Multimodal deep learning

MULTIMODAL RECOGNITION

Feature Representation

Classification Accuracy

(Noisy Audio at 0db SNR)

Audio Features (RBM) 75.8%

Our Best Video Features 68.7%

Bimodal Deep Autoencoder 77.3%

Page 37: Multimodal deep learning

MULTIMODAL RECOGNITION

Feature Representation

Classification Accuracy

(Noisy Audio at 0db SNR)

Audio Features (RBM) 75.8%

Our Best Video Features 68.7%

Bimodal Deep Autoencoder 77.3%

Bimodal Deep Autoencoder

+ Audio Features (RBM)82.2%

Page 38: Multimodal deep learning

SHARED REPRESENTATION EVALUATION

SupervisedTesting

Audio

SharedRepresentation

Video Audio

SharedRepresentation

Video

Linear Classifier

Training Testing

Feature Learning

Supervised Learning

Testing

Audio + Video Audio Video

Page 39: Multimodal deep learning

SHARED REPRESENTATION EVALUATION

SupervisedTesting

Audio

SharedRepresentation

Video Audio

SharedRepresentation

Video

Linear Classifier

Training Testing

Method: Learned Features + Canonical Correlation AnalysisFeature

LearningSupervised Learning

Testing Accuracy

Audio + Video Audio Video 57.3%Audio + Video Video Audio 91.7%

Page 40: Multimodal deep learning

MCGURK EFFECT

A visual /ga/ combined with an audio /ba/ is often perceived as /da/.

AudioInput

VideoInput

Model Predictions

/ga/ /ba/ /da/

/ga/ /ga/ 82.6% 2.2% 15.2%

/ba/ /ba/ 4.4% 89.1% 6.5%

Page 41: Multimodal deep learning

MCGURK EFFECT

A visual /ga/ combined with an audio /ba/ is often perceived as /da/.

AudioInput

VideoInput

Model Predictions

/ga/ /ba/ /da/

/ga/ /ga/ 82.6% 2.2% 15.2%

/ba/ /ba/ 4.4% 89.1% 6.5%

/ba/ /ga/ 28.3% 13.0% 58.7%

Page 42: Multimodal deep learning

CONCLUSION

Applied deep autoencoders to discover features in multimodal data

Cross-modality Learning: We obtained better video features (for lip-reading) using audio as a cue

Multimodal Feature Learning:Learn representations that relate across audio and video data

...

...

...

...... ......

Video Input

LearnedRepresentation

Audio ReconstructionVideo Reconstruction

...

...

... ...

...

...

... ...

...

Audio Input Video Input

SharedRepresentation

Audio Reconstruction Video Reconstruction

Page 43: Multimodal deep learning

THANK FOR ATTENTION!