strong and simple baselines for multimodal utterance...

46
Strong and Simple Baselines for Multimodal Utterance Embeddings Paul Pu Liang*, Yao Chong Lim*, Yao-Hung Hubert Tsai, Ruslan Salakhutdinov and Louis-Philippe Morency

Upload: others

Post on 22-Aug-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Strong and Simple Baselines for Multimodal Utterance ...pliang/slides/naacl2019_baselines_slides.pdf · MMB1: Representing intramodal interactions 2 Visual 3 4 3 5 " 5 # 5 ’ …

Strong and Simple Baselines for Multimodal Utterance Embeddings

Paul Pu Liang*, Yao Chong Lim*, Yao-Hung Hubert Tsai,Ruslan Salakhutdinov and Louis-Philippe Morency

Page 2: Strong and Simple Baselines for Multimodal Utterance ...pliang/slides/naacl2019_baselines_slides.pdf · MMB1: Representing intramodal interactions 2 Visual 3 4 3 5 " 5 # 5 ’ …

Human Language is often multimodal

Language• Word choice• Syntax• Pragmatics

Visual• Facial expressions• Body language• Eye contact• Gestures

Acoustic• Tone• Prosody• Phrasing

Sentiment• Positive/Negative• Intensity

Emotion• Anger• Happiness• Sadness• Confusion• Fear• Surprise

Meaning• Sarcasm• Humor

Page 3: Strong and Simple Baselines for Multimodal Utterance ...pliang/slides/naacl2019_baselines_slides.pdf · MMB1: Representing intramodal interactions 2 Visual 3 4 3 5 " 5 # 5 ’ …

Human Language is often multimodal

“This movie is great” +

Neutral expression

Sentiment Intensity

Page 4: Strong and Simple Baselines for Multimodal Utterance ...pliang/slides/naacl2019_baselines_slides.pdf · MMB1: Representing intramodal interactions 2 Visual 3 4 3 5 " 5 # 5 ’ …

Human Language is often multimodal

“This movie is great”

“This movie is great”

+

Neutral expression

Smile

+

Sentiment Intensity

Page 5: Strong and Simple Baselines for Multimodal Utterance ...pliang/slides/naacl2019_baselines_slides.pdf · MMB1: Representing intramodal interactions 2 Visual 3 4 3 5 " 5 # 5 ’ …

Challenges in Multimodal ML

Page 6: Strong and Simple Baselines for Multimodal Utterance ...pliang/slides/naacl2019_baselines_slides.pdf · MMB1: Representing intramodal interactions 2 Visual 3 4 3 5 " 5 # 5 ’ …

Challenges in Multimodal ML

1. Intramodal interactions

Smile Head nod+ Smile + Head shakevs.

Page 7: Strong and Simple Baselines for Multimodal Utterance ...pliang/slides/naacl2019_baselines_slides.pdf · MMB1: Representing intramodal interactions 2 Visual 3 4 3 5 " 5 # 5 ’ …

Challenges in Multimodal ML

1. Intramodal interactions

2. Crossmodal interactions

Smile Head nod+ Smile + Head shakevs.

“This movie is great” SmileBimodal +

Page 8: Strong and Simple Baselines for Multimodal Utterance ...pliang/slides/naacl2019_baselines_slides.pdf · MMB1: Representing intramodal interactions 2 Visual 3 4 3 5 " 5 # 5 ’ …

Challenges in Multimodal ML

1. Intramodal interactions

2. Crossmodal interactions

Smile Head nod+ Smile + Head shakevs.

(Sarcasm)

“This movie is great” SmileBimodal

“This movie is GREAT” SmileTrimodal “great” is emphasized, drawn-out

+

+ +

Page 9: Strong and Simple Baselines for Multimodal Utterance ...pliang/slides/naacl2019_baselines_slides.pdf · MMB1: Representing intramodal interactions 2 Visual 3 4 3 5 " 5 # 5 ’ …

Multimodal Language Embedding

“Thisisunbelievable!”

lang

uage

visu

alac

oust

ic

Loud

Utterance Embedding

Downstream Tasks

• Sentiment Analysis• Emotion Recognition• Speaker Trait

Recognition…

Intramodal + crossmodalinteractions

Page 10: Strong and Simple Baselines for Multimodal Utterance ...pliang/slides/naacl2019_baselines_slides.pdf · MMB1: Representing intramodal interactions 2 Visual 3 4 3 5 " 5 # 5 ’ …

Multimodal Language Embedding

“Thisisunbelievable!”

lang

uage

visu

alac

oust

ic

Loud

Utterance Embedding

Downstream Tasks

• Sentiment Analysis• Emotion Recognition• Speaker Trait

Recognition…

Intramodal + crossmodalinteractions

Page 11: Strong and Simple Baselines for Multimodal Utterance ...pliang/slides/naacl2019_baselines_slides.pdf · MMB1: Representing intramodal interactions 2 Visual 3 4 3 5 " 5 # 5 ’ …

Why fast models?

• Applications• Robots, virtual agents, intelligent personal assistants• Processing large amounts of multimedia data

Page 12: Strong and Simple Baselines for Multimodal Utterance ...pliang/slides/naacl2019_baselines_slides.pdf · MMB1: Representing intramodal interactions 2 Visual 3 4 3 5 " 5 # 5 ’ …

Research Question

Can we make principled but simple models for multimodal utterance embeddings that perform competitively?

Page 13: Strong and Simple Baselines for Multimodal Utterance ...pliang/slides/naacl2019_baselines_slides.pdf · MMB1: Representing intramodal interactions 2 Visual 3 4 3 5 " 5 # 5 ’ …

Research Question

Can we make principled but simple models for multimodal utterance embeddings that perform competitively?

Performance

Speed

Current SOTAOur goal

Page 14: Strong and Simple Baselines for Multimodal Utterance ...pliang/slides/naacl2019_baselines_slides.pdf · MMB1: Representing intramodal interactions 2 Visual 3 4 3 5 " 5 # 5 ’ …

Research Question

Can we make principled but simple models for multimodal utterance embeddings that perform competitively?

Performance

Speed

Current SOTAOur goal

Our models:• Fewer parameters• Has a closed-form solution• Linear functions• Competitive with SOTA!

Page 15: Strong and Simple Baselines for Multimodal Utterance ...pliang/slides/naacl2019_baselines_slides.pdf · MMB1: Representing intramodal interactions 2 Visual 3 4 3 5 " 5 # 5 ’ …

A language-only solution

Arora et al. (2016, 2017):

𝑤" 𝑤# 𝑤$

Sentenceembedding𝑚&

This manual is helpful

𝑤'Word embeddings

Page 16: Strong and Simple Baselines for Multimodal Utterance ...pliang/slides/naacl2019_baselines_slides.pdf · MMB1: Representing intramodal interactions 2 Visual 3 4 3 5 " 5 # 5 ’ …

A language-only solution

Arora et al. (2016, 2017):

𝑤" 𝑤# 𝑤$

Sentenceembedding𝑚&

This manual is helpful

𝑤'

𝑝 𝑤) 𝑚& ∝ exp(𝑤) ⋅ 𝑚&)

Word embeddings

Page 17: Strong and Simple Baselines for Multimodal Utterance ...pliang/slides/naacl2019_baselines_slides.pdf · MMB1: Representing intramodal interactions 2 Visual 3 4 3 5 " 5 # 5 ’ …

A language-only solution

Arora et al. (2016, 2017):

𝑤" 𝑤# 𝑤$

Sentenceembedding𝑚&

This manual is helpful

𝑤'

𝑝 𝑤) 𝑚& ∝ exp(𝑤) ⋅ 𝑚&)

Fast: No learnable parameters.

Word embeddings

Page 18: Strong and Simple Baselines for Multimodal Utterance ...pliang/slides/naacl2019_baselines_slides.pdf · MMB1: Representing intramodal interactions 2 Visual 3 4 3 5 " 5 # 5 ’ …

MMB1: Representing intramodal interactions

Page 19: Strong and Simple Baselines for Multimodal Utterance ...pliang/slides/naacl2019_baselines_slides.pdf · MMB1: Representing intramodal interactions 2 Visual 3 4 3 5 " 5 # 5 ’ …

MMB1: Representing intramodal interactionsUtterance

embedding 𝑚&

Words

𝑤" 𝑤# 𝑤' 𝑤1…

It doesn’t give help

(Arora et al)

Page 20: Strong and Simple Baselines for Multimodal Utterance ...pliang/slides/naacl2019_baselines_slides.pdf · MMB1: Representing intramodal interactions 2 Visual 3 4 3 5 " 5 # 5 ’ …

MMB1: Representing intramodal interactions

Visual𝜇3 𝜎3

𝑣" 𝑣# 𝑣' 𝑣1…

Utterance embedding 𝑚&

Audio𝜇6 𝜎6

𝑎" 𝑎# 𝑎' 𝑎1…𝑤" 𝑤# 𝑤' 𝑤1

Gaussian parameters

Gaussian parameters

Visual Audio

Utterance-level feature distributions:

Page 21: Strong and Simple Baselines for Multimodal Utterance ...pliang/slides/naacl2019_baselines_slides.pdf · MMB1: Representing intramodal interactions 2 Visual 3 4 3 5 " 5 # 5 ’ …

MMB1: Representing intramodal interactions

Visual𝜇3 𝜎3

𝑣" 𝑣# 𝑣' 𝑣1…

Utterance embedding 𝑚&

Audio𝜇6 𝜎6

𝑎" 𝑎# 𝑎' 𝑎1…𝑤" 𝑤# 𝑤' 𝑤1

Linear transformations

Page 22: Strong and Simple Baselines for Multimodal Utterance ...pliang/slides/naacl2019_baselines_slides.pdf · MMB1: Representing intramodal interactions 2 Visual 3 4 3 5 " 5 # 5 ’ …

MMB1: Representing intramodal interactions

Visual𝜇3 𝜎3

𝑣" 𝑣# 𝑣' 𝑣1…

Utterance embedding 𝑚&

Audio𝜇6 𝜎6

𝑎" 𝑎# 𝑎' 𝑎1…

Words

𝑤" 𝑤# 𝑤' 𝑤1…

It doesn’t give help

(Arora et al)

Small number of additional parameters!

Linear transformations

Page 23: Strong and Simple Baselines for Multimodal Utterance ...pliang/slides/naacl2019_baselines_slides.pdf · MMB1: Representing intramodal interactions 2 Visual 3 4 3 5 " 5 # 5 ’ …

Crossmodal interactions

“It didn’t help” +

Neutral face

“It didn’t help” +

Sad face

Stable voice

Shaky voice

+

+

Disappointment

Sadness

Emotion

Page 24: Strong and Simple Baselines for Multimodal Utterance ...pliang/slides/naacl2019_baselines_slides.pdf · MMB1: Representing intramodal interactions 2 Visual 3 4 3 5 " 5 # 5 ’ …

MMB2: Incorporating crossmodal interactions

Utterance embedding 𝑚&

W+A

…[𝑤", 𝑎"] [𝑤1, 𝑎1]

V+A

…[𝑣", 𝑎"] [𝑣1, 𝑎1]

W+V

…[𝑤", 𝑣"] [𝑤1, 𝑣1]

Unimodal

W+V+A

…[𝑤", 𝑣", 𝑎"] [𝑤1, 𝑣1, 𝑎1]

Concatenated inputs

Page 25: Strong and Simple Baselines for Multimodal Utterance ...pliang/slides/naacl2019_baselines_slides.pdf · MMB1: Representing intramodal interactions 2 Visual 3 4 3 5 " 5 # 5 ’ …

MMB2: Incorporating crossmodal interactions

Utterance embedding 𝑚&

W+A

𝜇;6 𝜎;6

…[𝑤", 𝑎"] [𝑤1, 𝑎1]

V+A

𝜇36 𝜎36

…[𝑣", 𝑎"] [𝑣1, 𝑎1]

W+V

𝜇;3 𝜎;3

…[𝑤", 𝑣"] [𝑤1, 𝑣1]

Unimodal

W+V+A

𝜇;36 𝜎;36

…[𝑤", 𝑣", 𝑎"] [𝑤1, 𝑣1, 𝑎1]

Page 26: Strong and Simple Baselines for Multimodal Utterance ...pliang/slides/naacl2019_baselines_slides.pdf · MMB1: Representing intramodal interactions 2 Visual 3 4 3 5 " 5 # 5 ’ …

MMB2: Incorporating crossmodal interactions

Utterance embedding 𝑚&

W+A

𝜇;6 𝜎;6

…[𝑤", 𝑎"] [𝑤1, 𝑎1]

V+A

𝜇36 𝜎36

…[𝑣", 𝑎"] [𝑣1, 𝑎1]

W+V

𝜇;3 𝜎;3

…[𝑤", 𝑣"] [𝑤1, 𝑣1]

Unimodal

W+V+A

𝜇;36 𝜎;36

…[𝑤", 𝑣", 𝑎"] [𝑤1, 𝑣1, 𝑎1]

Linear transformations

Page 27: Strong and Simple Baselines for Multimodal Utterance ...pliang/slides/naacl2019_baselines_slides.pdf · MMB1: Representing intramodal interactions 2 Visual 3 4 3 5 " 5 # 5 ’ …

How do we optimize the model?

Visual

𝜇3 𝜎3

𝑣" 𝑣# 𝑣' 𝑣1…Audio

𝜇6 𝜎6

𝑎" 𝑎# 𝑎' 𝑎1…

Utterance embedding 𝑚&

Words𝑤" 𝑤# 𝑤' 𝑤1…

Coordinate ascent-style

Page 28: Strong and Simple Baselines for Multimodal Utterance ...pliang/slides/naacl2019_baselines_slides.pdf · MMB1: Representing intramodal interactions 2 Visual 3 4 3 5 " 5 # 5 ’ …

How do we optimize the model?

Visual

𝜇3 𝜎3

𝑣" 𝑣# 𝑣' 𝑣1…Audio

𝜇6 𝜎6

𝑎" 𝑎# 𝑎' 𝑎1…

Utterance embedding 𝑚&

Words𝑤" 𝑤# 𝑤' 𝑤1…

Two steps each iteration: Coordinate ascent-style

Page 29: Strong and Simple Baselines for Multimodal Utterance ...pliang/slides/naacl2019_baselines_slides.pdf · MMB1: Representing intramodal interactions 2 Visual 3 4 3 5 " 5 # 5 ’ …

How do we optimize the model?

𝜇3 𝜎3

𝑣" 𝑣# 𝑣' 𝑣1…

𝜇6 𝜎6

𝑎" 𝑎# 𝑎' 𝑎1…

Utterance embedding 𝑚&

𝑤" 𝑤# 𝑤' 𝑤1…

Two steps each iteration:1. Fix transformation parameters, solve for 𝑚&

Coordinate ascent-style

Page 30: Strong and Simple Baselines for Multimodal Utterance ...pliang/slides/naacl2019_baselines_slides.pdf · MMB1: Representing intramodal interactions 2 Visual 3 4 3 5 " 5 # 5 ’ …

How do we optimize the model?

𝜇3 𝜎3

𝑣" 𝑣# 𝑣' 𝑣1…Audio

𝜇6 𝜎6

𝑎" 𝑎# 𝑎' 𝑎1…

Utterance embedding 𝑚&

Words𝑤" 𝑤# 𝑤' 𝑤1…

Two steps each iteration:1. Fix transformation parameters, solve for 𝑚&2. Fix 𝑚&, update transformation parameters by gradient

descent

Coordinate ascent-style

Visual

Page 31: Strong and Simple Baselines for Multimodal Utterance ...pliang/slides/naacl2019_baselines_slides.pdf · MMB1: Representing intramodal interactions 2 Visual 3 4 3 5 " 5 # 5 ’ …

Datasets

CMU-MOSI (Zadeh et al. 2016)• Multimodal Sentiment Analysis dataset• 2199 English opinion segments (monologues) from online

videos

I thought it was funLanguage

Visual

Acoustic (elongation)(emphasis)

Page 32: Strong and Simple Baselines for Multimodal Utterance ...pliang/slides/naacl2019_baselines_slides.pdf · MMB1: Representing intramodal interactions 2 Visual 3 4 3 5 " 5 # 5 ’ …

Datasets

POM (Park et al., 2014)• Multimodal Speaker Traits Recognition• 903 English videos annotated for speaker traits such as

confidence, dominance, vividness, relaxed, nervousness, humor etc.

Page 33: Strong and Simple Baselines for Multimodal Utterance ...pliang/slides/naacl2019_baselines_slides.pdf · MMB1: Representing intramodal interactions 2 Visual 3 4 3 5 " 5 # 5 ’ …

Compared Models

Deep neural models• Early Fusion: EF-LSTM• DF (Nojavanasghari et al., 2016)• Multi-view Learning: MV-LSTM (Rajagopalan et al., 2016)• Contextual LSTM: BC-LSTM (Poria et al., 2017)• Tensor Fusion: TFN (Zadeh et al., 2017)• Memory Fusion: MFN (Zadeh et al., 2018)

Page 34: Strong and Simple Baselines for Multimodal Utterance ...pliang/slides/naacl2019_baselines_slides.pdf · MMB1: Representing intramodal interactions 2 Visual 3 4 3 5 " 5 # 5 ’ …

Experiments

69

70

71

72

73

74

75

76

77

78

EF-LSTM DF MV-LSTM BC-LSTM TFN MFN MMB1 MMB2

Bina

ry A

ccur

acy

(%)

CMU-MOSI Sentiment

74.675.1

77.4

Deep neural models

Our baselines

Page 35: Strong and Simple Baselines for Multimodal Utterance ...pliang/slides/naacl2019_baselines_slides.pdf · MMB1: Representing intramodal interactions 2 Visual 3 4 3 5 " 5 # 5 ’ …

Experiments

0.710.720.730.740.750.760.770.780.790.8

0.810.82

EF-LSTM DF MV-LSTM BC-LSTM TFN MFN MMB1 MMB2

MA

E

POM Speaker Traits Recognition

Deep neural models

Our baselines

0.774

0.746

0.785

Page 36: Strong and Simple Baselines for Multimodal Utterance ...pliang/slides/naacl2019_baselines_slides.pdf · MMB1: Representing intramodal interactions 2 Visual 3 4 3 5 " 5 # 5 ’ …

Speed Comparisons

0.001

0.01

0.1

1

10

EF-LSTM DF MV-LSTM BC-LSTM TFN MFN MMB1 MMB2

Average Inference Time (s)

Deep neural models

Our baselines

Page 37: Strong and Simple Baselines for Multimodal Utterance ...pliang/slides/naacl2019_baselines_slides.pdf · MMB1: Representing intramodal interactions 2 Visual 3 4 3 5 " 5 # 5 ’ …

Conclusion

• Proposed two simple but strong baselines for learning embeddings of multimodal utterances• Try strong baselines before working on complicated models!

Page 38: Strong and Simple Baselines for Multimodal Utterance ...pliang/slides/naacl2019_baselines_slides.pdf · MMB1: Representing intramodal interactions 2 Visual 3 4 3 5 " 5 # 5 ’ …

The End!

72

73

74

75

76

77

78

100 1000 10000 100000 1000000

CMU

-MO

SI A

ccur

acy

(%)

Inferences per second

Deep neural models

Our baselines

Github: yaochie/multimodal-baselinesEmail:[email protected]@cs.cmu.edu

Page 39: Strong and Simple Baselines for Multimodal Utterance ...pliang/slides/naacl2019_baselines_slides.pdf · MMB1: Representing intramodal interactions 2 Visual 3 4 3 5 " 5 # 5 ’ …

Additional Results

Page 40: Strong and Simple Baselines for Multimodal Utterance ...pliang/slides/naacl2019_baselines_slides.pdf · MMB1: Representing intramodal interactions 2 Visual 3 4 3 5 " 5 # 5 ’ …
Page 41: Strong and Simple Baselines for Multimodal Utterance ...pliang/slides/naacl2019_baselines_slides.pdf · MMB1: Representing intramodal interactions 2 Visual 3 4 3 5 " 5 # 5 ’ …
Page 42: Strong and Simple Baselines for Multimodal Utterance ...pliang/slides/naacl2019_baselines_slides.pdf · MMB1: Representing intramodal interactions 2 Visual 3 4 3 5 " 5 # 5 ’ …
Page 43: Strong and Simple Baselines for Multimodal Utterance ...pliang/slides/naacl2019_baselines_slides.pdf · MMB1: Representing intramodal interactions 2 Visual 3 4 3 5 " 5 # 5 ’ …

Experiments

0.4

0.45

0.5

0.55

0.6

0.65

EF-LSTM DF MV-LSTM BC-LSTM TFN MFN MMB1 MMB2

Corr

elat

ion

CMU-MOSI Sentiment

Deep neural models

Our baselines

Page 44: Strong and Simple Baselines for Multimodal Utterance ...pliang/slides/naacl2019_baselines_slides.pdf · MMB1: Representing intramodal interactions 2 Visual 3 4 3 5 " 5 # 5 ’ …

Experiments

69

70

71

72

73

74

75

76

77

78

EF-LSTM DF MV-LSTM BC-LSTM TFN MFN MMB1 MMB2

F1 s

core

CMU-MOSI Sentiment

Deep neural models

Our baselines

Page 45: Strong and Simple Baselines for Multimodal Utterance ...pliang/slides/naacl2019_baselines_slides.pdf · MMB1: Representing intramodal interactions 2 Visual 3 4 3 5 " 5 # 5 ’ …

Experiments

20

22

24

26

28

30

32

34

36

EF-LSTM DF MV-LSTM BC-LSTM TFN MFN MMB1 MMB2

7-cl

ass

Acc

urac

y (%

)

CMU-MOSI Sentiment

Deep neural models

Our baselines

Page 46: Strong and Simple Baselines for Multimodal Utterance ...pliang/slides/naacl2019_baselines_slides.pdf · MMB1: Representing intramodal interactions 2 Visual 3 4 3 5 " 5 # 5 ’ …

Experiments

0.85

0.9

0.95

1

1.05

1.1

1.15

1.2

EF-LSTM DF MV-LSTM BC-LSTM TFN MFN MMB1 MMB2

MA

E

CMU-MOSI Sentiment

Deep neural models

Our baselines