![Page 1: Multimodal recognition of behavior and affect · 2021. 4. 14. · Choube, A. and Soleymani, M., 2020, October. Punchline Detection using Context-Aware Hierarchical Multimodal Fusion](https://reader035.vdocument.in/reader035/viewer/2022071515/613749d40ad5d20676488623/html5/thumbnails/1.jpg)
Multimodal recognition of behavior and affect
Guest lecture for Affective Computing
Multimodal emotion recognition
Instructor: Mohammad Soleymani
1
![Page 2: Multimodal recognition of behavior and affect · 2021. 4. 14. · Choube, A. and Soleymani, M., 2020, October. Punchline Detection using Context-Aware Hierarchical Multimodal Fusion](https://reader035.vdocument.in/reader035/viewer/2022071515/613749d40ad5d20676488623/html5/thumbnails/2.jpg)
2
Multimodal emotion recognition
![Page 3: Multimodal recognition of behavior and affect · 2021. 4. 14. · Choube, A. and Soleymani, M., 2020, October. Punchline Detection using Context-Aware Hierarchical Multimodal Fusion](https://reader035.vdocument.in/reader035/viewer/2022071515/613749d40ad5d20676488623/html5/thumbnails/3.jpg)
• Observing external manifestations in short episodes
Affective states
Episodic EmotionsSocial signals, e.g., head nod
Mood
SentimentPersonality
Lifetime
3
![Page 4: Multimodal recognition of behavior and affect · 2021. 4. 14. · Choube, A. and Soleymani, M., 2020, October. Punchline Detection using Context-Aware Hierarchical Multimodal Fusion](https://reader035.vdocument.in/reader035/viewer/2022071515/613749d40ad5d20676488623/html5/thumbnails/4.jpg)
Emotion
Emotions as componential constructs
0 20 40 60 80 100 120 1405.8
5.85
5.9
5.95x 10
4
time (Seconds)
GSR
0 20 40 60 80 100 120 1401.3
1.35
1.4
1.45
1.5
1.55x 10
6
time (Seconds)
Blood pressure
0 20 40 60 80 100 120 140-1.35
-1.34
-1.33
-1.32
-1.31
-1.3
-1.29
-1.28
-1.27
-1.26x 10
5
time (Seconds)
Respiration pattern
0 20 40 60 80 100 120 1407700
7800
7900
8000
8100
8200
8300
8400
8500
8600
time (Seconds)
EMG Frontalis
![Page 5: Multimodal recognition of behavior and affect · 2021. 4. 14. · Choube, A. and Soleymani, M., 2020, October. Punchline Detection using Context-Aware Hierarchical Multimodal Fusion](https://reader035.vdocument.in/reader035/viewer/2022071515/613749d40ad5d20676488623/html5/thumbnails/5.jpg)
• What does he feel?
Why multimodal?
Slide credit: Nicole Nelson
![Page 6: Multimodal recognition of behavior and affect · 2021. 4. 14. · Choube, A. and Soleymani, M., 2020, October. Punchline Detection using Context-Aware Hierarchical Multimodal Fusion](https://reader035.vdocument.in/reader035/viewer/2022071515/613749d40ad5d20676488623/html5/thumbnails/6.jpg)
Modality
• “Particular mode in which something exists or is experienced or expressed.”
• “a particular form of sensory perception.” for example, auditory, visual, touch
• Multimodal
• Examples:
• We also include other perceptual channels, for example, language
![Page 7: Multimodal recognition of behavior and affect · 2021. 4. 14. · Choube, A. and Soleymani, M., 2020, October. Punchline Detection using Context-Aware Hierarchical Multimodal Fusion](https://reader035.vdocument.in/reader035/viewer/2022071515/613749d40ad5d20676488623/html5/thumbnails/7.jpg)
Why multimodal?
• Complementary information
• Multimodal interaction• McGurk effect
• Robustness• Missing/noisy channels
![Page 8: Multimodal recognition of behavior and affect · 2021. 4. 14. · Choube, A. and Soleymani, M., 2020, October. Punchline Detection using Context-Aware Hierarchical Multimodal Fusion](https://reader035.vdocument.in/reader035/viewer/2022071515/613749d40ad5d20676488623/html5/thumbnails/8.jpg)
Human behavior sensing modalities
Behavior
Audio
Visual
Physiological response
Peripheral
Central
Language
Prosody
Face
Body
Gaze
Nonverbal
Verbal
![Page 9: Multimodal recognition of behavior and affect · 2021. 4. 14. · Choube, A. and Soleymani, M., 2020, October. Punchline Detection using Context-Aware Hierarchical Multimodal Fusion](https://reader035.vdocument.in/reader035/viewer/2022071515/613749d40ad5d20676488623/html5/thumbnails/9.jpg)
9
Unimodal, bimodal and trimodal interactions
“This movie is fair”
Smile
Loud voice
Speaker’s behaviors Sentiment Intensity
Un
imo
dal
?
“This movie is sick” Smile
“This movie is sick” Frown
“This movie is sick” Loud voice ?
Bim
od
al
“This movie is sick” Smile Loud voice
Trim
od
al
“This movie is fair” Smile Loud voice
“This movie is sick” ?
Resolves ambiguity
(bimodal interaction)
Still Ambiguous !
Different trimodal
interactions !
Ambiguous !
Unimodal cues
Ambiguous !
Slide credit: LP Morency
![Page 10: Multimodal recognition of behavior and affect · 2021. 4. 14. · Choube, A. and Soleymani, M., 2020, October. Punchline Detection using Context-Aware Hierarchical Multimodal Fusion](https://reader035.vdocument.in/reader035/viewer/2022071515/613749d40ad5d20676488623/html5/thumbnails/10.jpg)
Multimodal representation learning
for emotion recognition
![Page 11: Multimodal recognition of behavior and affect · 2021. 4. 14. · Choube, A. and Soleymani, M., 2020, October. Punchline Detection using Context-Aware Hierarchical Multimodal Fusion](https://reader035.vdocument.in/reader035/viewer/2022071515/613749d40ad5d20676488623/html5/thumbnails/11.jpg)
What are representations and why they matter?
![Page 12: Multimodal recognition of behavior and affect · 2021. 4. 14. · Choube, A. and Soleymani, M., 2020, October. Punchline Detection using Context-Aware Hierarchical Multimodal Fusion](https://reader035.vdocument.in/reader035/viewer/2022071515/613749d40ad5d20676488623/html5/thumbnails/12.jpg)
Features are representation
![Page 13: Multimodal recognition of behavior and affect · 2021. 4. 14. · Choube, A. and Soleymani, M., 2020, October. Punchline Detection using Context-Aware Hierarchical Multimodal Fusion](https://reader035.vdocument.in/reader035/viewer/2022071515/613749d40ad5d20676488623/html5/thumbnails/13.jpg)
• Perceptron
• Multi-layer perceptron
Learning representations – neural networks
Learns representatio
![Page 14: Multimodal recognition of behavior and affect · 2021. 4. 14. · Choube, A. and Soleymani, M., 2020, October. Punchline Detection using Context-Aware Hierarchical Multimodal Fusion](https://reader035.vdocument.in/reader035/viewer/2022071515/613749d40ad5d20676488623/html5/thumbnails/14.jpg)
Face encoders ConvNets – holistic methods
Levi and Hassner, ICMI 2015
Expression of emotion
Emsemble of CNNs
![Page 15: Multimodal recognition of behavior and affect · 2021. 4. 14. · Choube, A. and Soleymani, M., 2020, October. Punchline Detection using Context-Aware Hierarchical Multimodal Fusion](https://reader035.vdocument.in/reader035/viewer/2022071515/613749d40ad5d20676488623/html5/thumbnails/15.jpg)
ConvNets – patch based
• Ertugrul et al., 2019
• Use z-face for 3d registration
• Create overlapping patches
• Pass them through CNN/3DCNN
![Page 16: Multimodal recognition of behavior and affect · 2021. 4. 14. · Choube, A. and Soleymani, M., 2020, October. Punchline Detection using Context-Aware Hierarchical Multimodal Fusion](https://reader035.vdocument.in/reader035/viewer/2022071515/613749d40ad5d20676488623/html5/thumbnails/16.jpg)
Convolutional nets– self-supervised
• Learning to rank, Lu et al., BMVC 2020
• Input: Sequence of frames extracted from a video.
• Network: ResNet-18 encoders with shared weights
• Loss: Triplet losses between adjacent frames. Summing up as ranking triplet loss.
![Page 17: Multimodal recognition of behavior and affect · 2021. 4. 14. · Choube, A. and Soleymani, M., 2020, October. Punchline Detection using Context-Aware Hierarchical Multimodal Fusion](https://reader035.vdocument.in/reader035/viewer/2022071515/613749d40ad5d20676488623/html5/thumbnails/17.jpg)
Language encoders
• Language is sequential
• It has structure (grammar)
• It is also full of ambiguity
• Typical approach is to use a sequential encoder:
• CNNs
• Causal CNN
• RNNs
• Transformers
![Page 18: Multimodal recognition of behavior and affect · 2021. 4. 14. · Choube, A. and Soleymani, M., 2020, October. Punchline Detection using Context-Aware Hierarchical Multimodal Fusion](https://reader035.vdocument.in/reader035/viewer/2022071515/613749d40ad5d20676488623/html5/thumbnails/18.jpg)
How to learn word representations?
• Input and output are one-hot coded
He was walking away because …
He was running away because …
• The n-dim hidden layer learns a compact representation, e.g., 300d
Word2vec https://code.google.com/archive/p/word2vec/GloVe https://nlp.stanford.edu/projects/glove/
![Page 19: Multimodal recognition of behavior and affect · 2021. 4. 14. · Choube, A. and Soleymani, M., 2020, October. Punchline Detection using Context-Aware Hierarchical Multimodal Fusion](https://reader035.vdocument.in/reader035/viewer/2022071515/613749d40ad5d20676488623/html5/thumbnails/19.jpg)
• Represent words as vectors
• Unsupervised method that learns the neighboring words in text
• Word2vec and GloVe are the popular examples
Word embedding
![Page 20: Multimodal recognition of behavior and affect · 2021. 4. 14. · Choube, A. and Soleymani, M., 2020, October. Punchline Detection using Context-Aware Hierarchical Multimodal Fusion](https://reader035.vdocument.in/reader035/viewer/2022071515/613749d40ad5d20676488623/html5/thumbnails/20.jpg)
CNN for text analysis
Kim, Y. “Convolutional Neural Networks for Sentence Classification”, EMNLP (2014)
![Page 21: Multimodal recognition of behavior and affect · 2021. 4. 14. · Choube, A. and Soleymani, M., 2020, October. Punchline Detection using Context-Aware Hierarchical Multimodal Fusion](https://reader035.vdocument.in/reader035/viewer/2022071515/613749d40ad5d20676488623/html5/thumbnails/21.jpg)
Recurrent neural networks
• You can use RNNs such as LSTM or GRU to encode language
• A famous example is sequence to sequence learning for translation
• Encoder-decoder architecture
![Page 22: Multimodal recognition of behavior and affect · 2021. 4. 14. · Choube, A. and Soleymani, M., 2020, October. Punchline Detection using Context-Aware Hierarchical Multimodal Fusion](https://reader035.vdocument.in/reader035/viewer/2022071515/613749d40ad5d20676488623/html5/thumbnails/22.jpg)
The age of Transformers!
• Unlike recurrent you can run it in parallel
• Main ideas is to use multi-head self-attention
• Self-attention looks into the similarity in the input space to see which one should be taken into account
• Multiple attention can encode different information
• Position embedding helps remembering the order – otherwise it becomes bag of words!
![Page 23: Multimodal recognition of behavior and affect · 2021. 4. 14. · Choube, A. and Soleymani, M., 2020, October. Punchline Detection using Context-Aware Hierarchical Multimodal Fusion](https://reader035.vdocument.in/reader035/viewer/2022071515/613749d40ad5d20676488623/html5/thumbnails/23.jpg)
BERT – Devlin et al 2019
• Unsupervised pre-training on multi-layer transformer
• Mask 15% of the words and train a model to predict them
• Predict the next sentence
• Can give sentence and contextualized embedding
• Difference between BERT-base and BERT-large is the depth of the model
![Page 24: Multimodal recognition of behavior and affect · 2021. 4. 14. · Choube, A. and Soleymani, M., 2020, October. Punchline Detection using Context-Aware Hierarchical Multimodal Fusion](https://reader035.vdocument.in/reader035/viewer/2022071515/613749d40ad5d20676488623/html5/thumbnails/24.jpg)
Voice prosodic measurements
• Pitch/f0 tracking and contour
• Articulation rate and pause timings
• Mel frequency cepstralcoefficients (MFCC)
• Compact representation of the spectrum
• Emulates human hearing
• Popular for speech recognition
![Page 25: Multimodal recognition of behavior and affect · 2021. 4. 14. · Choube, A. and Soleymani, M., 2020, October. Punchline Detection using Context-Aware Hierarchical Multimodal Fusion](https://reader035.vdocument.in/reader035/viewer/2022071515/613749d40ad5d20676488623/html5/thumbnails/25.jpg)
Deep spectrum features (voice)
S. Amiriparian, M. Gerczuk, S. Ottl, N. Cummins, M. Freitag, S. Pugachevskiy, A. Baird and B. Schuller. Snore Sound Classification using Image-Based Deep Spectrum Features. In Proceedings of INTERSPEECH (Vol. 17, pp. 2017-434)
https://github.com/DeepSpectrum/DeepSpectrum
![Page 26: Multimodal recognition of behavior and affect · 2021. 4. 14. · Choube, A. and Soleymani, M., 2020, October. Punchline Detection using Context-Aware Hierarchical Multimodal Fusion](https://reader035.vdocument.in/reader035/viewer/2022071515/613749d40ad5d20676488623/html5/thumbnails/26.jpg)
• Representation learning application for you?
• When is it useful and when do you just use handcrafted features?
Questions?
![Page 27: Multimodal recognition of behavior and affect · 2021. 4. 14. · Choube, A. and Soleymani, M., 2020, October. Punchline Detection using Context-Aware Hierarchical Multimodal Fusion](https://reader035.vdocument.in/reader035/viewer/2022071515/613749d40ad5d20676488623/html5/thumbnails/27.jpg)
Multimodal fusion
![Page 28: Multimodal recognition of behavior and affect · 2021. 4. 14. · Choube, A. and Soleymani, M., 2020, October. Punchline Detection using Context-Aware Hierarchical Multimodal Fusion](https://reader035.vdocument.in/reader035/viewer/2022071515/613749d40ad5d20676488623/html5/thumbnails/28.jpg)
• Fusing information form multiple modalities
• Examples:
• Audiovisual speech recognition
• Audiovisual emotion recognition
• Multimodal biometrics (e.g., face and fingerprint)
• Fusion techniques
• Model free
• Early, late and hybrid
• Model-based
• Multiple kernel learning
• Neural networks
Multimodal fusion
![Page 29: Multimodal recognition of behavior and affect · 2021. 4. 14. · Choube, A. and Soleymani, M., 2020, October. Punchline Detection using Context-Aware Hierarchical Multimodal Fusion](https://reader035.vdocument.in/reader035/viewer/2022071515/613749d40ad5d20676488623/html5/thumbnails/29.jpg)
Model free approaches – early fusion
• Easy to implement – just concatenate the features
• Exploit dependencies between features
• Can end up very high dimensional
• More difficult to use if features have different framerates
Classifier
Modality 1
Modality 2
Modality n
![Page 30: Multimodal recognition of behavior and affect · 2021. 4. 14. · Choube, A. and Soleymani, M., 2020, October. Punchline Detection using Context-Aware Hierarchical Multimodal Fusion](https://reader035.vdocument.in/reader035/viewer/2022071515/613749d40ad5d20676488623/html5/thumbnails/30.jpg)
Model free approaches – late fusion
• Train a unimodal predictor and a multimodal fusion one
• Requires multiple training stages
• Do not model low level interactions between modalities
• Fusion mechanism can be voting, weighted sum or an ML approach
Modality 2
Classifier
Modality 1
Modality n
Fusion
mechanism
Classifier
Classifier
![Page 31: Multimodal recognition of behavior and affect · 2021. 4. 14. · Choube, A. and Soleymani, M., 2020, October. Punchline Detection using Context-Aware Hierarchical Multimodal Fusion](https://reader035.vdocument.in/reader035/viewer/2022071515/613749d40ad5d20676488623/html5/thumbnails/31.jpg)
Model free approaches – hybrid fusion
Modality 2
Classifier
Modality 1
Fusion
mechanism
Classifier
Classifier
Modality 1
Modality 2
• Combine benefits of both early and late fusion mechanisms
![Page 32: Multimodal recognition of behavior and affect · 2021. 4. 14. · Choube, A. and Soleymani, M., 2020, October. Punchline Detection using Context-Aware Hierarchical Multimodal Fusion](https://reader035.vdocument.in/reader035/viewer/2022071515/613749d40ad5d20676488623/html5/thumbnails/32.jpg)
Model-based: Joint Representation• For supervised learning tasks
• Joining the unimodal representations:
• Simple concatenation
• Element-wise multiplication or summation
• Multilayer perceptron
• How to explicitly model both unimodal and bimodal interactions?
· · ·
· · ·
· · ·
· · ·
· · ·
Text Image
· · · softmax
𝒀𝑿
e.g. Sentiment
𝒉𝒙 𝒉𝒚
𝒉𝒎
![Page 33: Multimodal recognition of behavior and affect · 2021. 4. 14. · Choube, A. and Soleymani, M., 2020, October. Punchline Detection using Context-Aware Hierarchical Multimodal Fusion](https://reader035.vdocument.in/reader035/viewer/2022071515/613749d40ad5d20676488623/html5/thumbnails/33.jpg)
• Projecting different modalities/views into spaces where the correlation is maximized
• Can be sensitive to noise
• Not ideal if the information is complementary
Model-based: Canonical correlation analysis
Modality 1
Modality 2
Encoder 1
Encoder 2
CCA
𝑎′, 𝑏′ =𝑎𝑟𝑔𝑚𝑎𝑥
𝑎, 𝑏𝑐𝑜𝑟𝑟(𝑎𝑇𝑋, 𝑏𝑇𝑌)
Fusion
![Page 34: Multimodal recognition of behavior and affect · 2021. 4. 14. · Choube, A. and Soleymani, M., 2020, October. Punchline Detection using Context-Aware Hierarchical Multimodal Fusion](https://reader035.vdocument.in/reader035/viewer/2022071515/613749d40ad5d20676488623/html5/thumbnails/34.jpg)
34
Model-based: Multimodal Tensor Fusion Network (TFN)
Can be extended to three modalities:
𝒉𝒎 =𝒉𝒙1
⊗𝒉𝒚1
⊗𝒉𝒛1
[Zadeh, Jones and Morency, EMNLP 2017]
Explicitly models unimodal, bimodal and trimodal
interactions !· · ·
· · ·
Audio𝒁
· · ·
· · ·
Text𝑿
𝒉𝒙 𝒉𝒛
· · ·
· · ·
Image𝒀
𝒉𝒚
𝒉𝒛
𝒉𝒙
𝒉𝒚
𝒉𝒙⊗𝒉𝒚𝒉𝒙⊗𝒉𝒛
𝒉𝒛⊗𝒉𝒚
𝒉𝒙⊗𝒉𝒚 ⊗𝒉𝒛
Slide credit: LP Morency
![Page 35: Multimodal recognition of behavior and affect · 2021. 4. 14. · Choube, A. and Soleymani, M., 2020, October. Punchline Detection using Context-Aware Hierarchical Multimodal Fusion](https://reader035.vdocument.in/reader035/viewer/2022071515/613749d40ad5d20676488623/html5/thumbnails/35.jpg)
Model-based: Multimodal Transformer
Tsai et al., Multimodal Transformer for Unaligned Multimodal Language Sequences, ACL 2020
• Cross-modal attention for
alignment
• Attention mechanism
helps with (long-term)
temporal dependency
• What is the limitation
here?
![Page 36: Multimodal recognition of behavior and affect · 2021. 4. 14. · Choube, A. and Soleymani, M., 2020, October. Punchline Detection using Context-Aware Hierarchical Multimodal Fusion](https://reader035.vdocument.in/reader035/viewer/2022071515/613749d40ad5d20676488623/html5/thumbnails/36.jpg)
• Imagine modalities have a high level of correlation/interaction – which fusion approach is better?
Question
![Page 37: Multimodal recognition of behavior and affect · 2021. 4. 14. · Choube, A. and Soleymani, M., 2020, October. Punchline Detection using Context-Aware Hierarchical Multimodal Fusion](https://reader035.vdocument.in/reader035/viewer/2022071515/613749d40ad5d20676488623/html5/thumbnails/37.jpg)
Case study 1
EEG Emotion recognition
![Page 38: Multimodal recognition of behavior and affect · 2021. 4. 14. · Choube, A. and Soleymani, M., 2020, October. Punchline Detection using Context-Aware Hierarchical Multimodal Fusion](https://reader035.vdocument.in/reader035/viewer/2022071515/613749d40ad5d20676488623/html5/thumbnails/38.jpg)
• The limbic system• Emotional significance• Coordination of emotional behavior
• Frontal brain lateralization• Right frontal: withdrawal• Left frontal: approach
• Other patterns • Synchronization of different
neuronal populations
Brain and emotions
![Page 39: Multimodal recognition of behavior and affect · 2021. 4. 14. · Choube, A. and Soleymani, M., 2020, October. Punchline Detection using Context-Aware Hierarchical Multimodal Fusion](https://reader035.vdocument.in/reader035/viewer/2022071515/613749d40ad5d20676488623/html5/thumbnails/39.jpg)
Weak electrical activity from postsynaptic potentials generated in superficial layers of the cortex
Electroencephalogram (EEG)
![Page 40: Multimodal recognition of behavior and affect · 2021. 4. 14. · Choube, A. and Soleymani, M., 2020, October. Punchline Detection using Context-Aware Hierarchical Multimodal Fusion](https://reader035.vdocument.in/reader035/viewer/2022071515/613749d40ad5d20676488623/html5/thumbnails/40.jpg)
Based on Bashivan et al., ICLR, 2016
Convolutional neural nets for EEG
No pooling
![Page 41: Multimodal recognition of behavior and affect · 2021. 4. 14. · Choube, A. and Soleymani, M., 2020, October. Punchline Detection using Context-Aware Hierarchical Multimodal Fusion](https://reader035.vdocument.in/reader035/viewer/2022071515/613749d40ad5d20676488623/html5/thumbnails/41.jpg)
Cross-modal learning
• What if we use one modality with stronger association for alignment
• Behavior (facial expression) has a better performance for emotion recognition
Smile and EEG Correlation
![Page 42: Multimodal recognition of behavior and affect · 2021. 4. 14. · Choube, A. and Soleymani, M., 2020, October. Punchline Detection using Context-Aware Hierarchical Multimodal Fusion](https://reader035.vdocument.in/reader035/viewer/2022071515/613749d40ad5d20676488623/html5/thumbnails/42.jpg)
Cross-modal representation learning
• Jointly learn the other modality + class labels
• Representation cab be applied to datasets without the behavioral modality
![Page 43: Multimodal recognition of behavior and affect · 2021. 4. 14. · Choube, A. and Soleymani, M., 2020, October. Punchline Detection using Context-Aware Hierarchical Multimodal Fusion](https://reader035.vdocument.in/reader035/viewer/2022071515/613749d40ad5d20676488623/html5/thumbnails/43.jpg)
Cross-modal representation learning
Rayatdoost, Rudrauf, Soleymani. Expression-guided EEG Representation Learning for Emotion Recognition, IEEE ICASSP 2020 (oral).
![Page 44: Multimodal recognition of behavior and affect · 2021. 4. 14. · Choube, A. and Soleymani, M., 2020, October. Punchline Detection using Context-Aware Hierarchical Multimodal Fusion](https://reader035.vdocument.in/reader035/viewer/2022071515/613749d40ad5d20676488623/html5/thumbnails/44.jpg)
Multimodal Gated Fusion𝐻𝑓𝑢𝑠𝑖𝑜𝑛 = [𝐻𝐸𝐸𝐺 ⊙ 𝑊𝐸𝐸𝐺 ⊕ 𝐻𝐹𝑎𝑐𝑒 ⊙ 𝑊𝐹𝑎𝑐𝑒
Multimodal
1. S. Rayatdoost, D. Radrauf, and M. Soleymani, “Multimodal Gated Information Fusion for Emotion Recognition from EEG Signals and Facial Behaviors”. In
Proceedings of the 22nd ACM International Conference on Multimodal Interaction, ICMI '20, New York, NY, USA. ACM, 2020.
![Page 45: Multimodal recognition of behavior and affect · 2021. 4. 14. · Choube, A. and Soleymani, M., 2020, October. Punchline Detection using Context-Aware Hierarchical Multimodal Fusion](https://reader035.vdocument.in/reader035/viewer/2022071515/613749d40ad5d20676488623/html5/thumbnails/45.jpg)
Fusion results – within database
ClassifierType
TestValence Arousal
CR F1 CR F1
EEG CNN DAI-EF 69.5 67.2 61.4 61.3
MLP on face features DAI-EF 73.1 68.2 62.1 61.0
Concatenate fusion DAI-EF 74.1 73.1 61.5 61.1
Tensor fusion DAI-EF 74.1 73.0 61.1 60.8
Gated Fusion DAI-EF 74.8 73.4 63.2 62.5
Coordinated (Cosine) DAI-EF 71.1 68.8 61.9 61.7
Gated coordinated fusion DAI-EF 75.4 74.1 63.9 63.3
Within-database
Multimodal
![Page 46: Multimodal recognition of behavior and affect · 2021. 4. 14. · Choube, A. and Soleymani, M., 2020, October. Punchline Detection using Context-Aware Hierarchical Multimodal Fusion](https://reader035.vdocument.in/reader035/viewer/2022071515/613749d40ad5d20676488623/html5/thumbnails/46.jpg)
Summary - EEG
• Behavior is a strong emotional signal
• Behavioral activity shows up in EEG signals
• Cross-modal relationship can be leveraged to improve emotion recognition from EEG
![Page 47: Multimodal recognition of behavior and affect · 2021. 4. 14. · Choube, A. and Soleymani, M., 2020, October. Punchline Detection using Context-Aware Hierarchical Multimodal Fusion](https://reader035.vdocument.in/reader035/viewer/2022071515/613749d40ad5d20676488623/html5/thumbnails/47.jpg)
• Where do you think EEG emotion recognition is useful?
Question
![Page 48: Multimodal recognition of behavior and affect · 2021. 4. 14. · Choube, A. and Soleymani, M., 2020, October. Punchline Detection using Context-Aware Hierarchical Multimodal Fusion](https://reader035.vdocument.in/reader035/viewer/2022071515/613749d40ad5d20676488623/html5/thumbnails/48.jpg)
Choube, A. and Soleymani, M., 2020, October. Punchline Detection using Context-Aware Hierarchical Multimodal Fusion. In Proceedings of the 2020 International Conference on Multimodal Interaction (pp. 675-679).
Case study 1
Hierarchical fusion for detecting humorous utterances
![Page 49: Multimodal recognition of behavior and affect · 2021. 4. 14. · Choube, A. and Soleymani, M., 2020, October. Punchline Detection using Context-Aware Hierarchical Multimodal Fusion](https://reader035.vdocument.in/reader035/viewer/2022071515/613749d40ad5d20676488623/html5/thumbnails/49.jpg)
Detecting humorous utterances
• A context-aware hierarchical multi-modal fusion network for the task of punchline detection
Visual
“Nervous, I went down to the street to look for her. Now, I did not speak Portuguese. I did not know where the beach was. I could not call her on a cell phone because this was 1991, and the aliens had not given us that technology yet”
Acoustic Language
![Page 50: Multimodal recognition of behavior and affect · 2021. 4. 14. · Choube, A. and Soleymani, M., 2020, October. Punchline Detection using Context-Aware Hierarchical Multimodal Fusion](https://reader035.vdocument.in/reader035/viewer/2022071515/613749d40ad5d20676488623/html5/thumbnails/50.jpg)
Data and evaluation
• UR-FUNNY database
• 8257 humorous and non humorous punchlines from TED talks
• Diverse in terms of topics and speakers
• 1866 videos, 1741 Speakers, 417 topics.
• Multimodal involving text, audio and visual modalities
• Each punchline is labelled humorous/non-humorous.
• Around 64%, 16%, and 20% of data was used for training, validation, and testing
![Page 51: Multimodal recognition of behavior and affect · 2021. 4. 14. · Choube, A. and Soleymani, M., 2020, October. Punchline Detection using Context-Aware Hierarchical Multimodal Fusion](https://reader035.vdocument.in/reader035/viewer/2022071515/613749d40ad5d20676488623/html5/thumbnails/51.jpg)
Hierarchical fusion
![Page 52: Multimodal recognition of behavior and affect · 2021. 4. 14. · Choube, A. and Soleymani, M., 2020, October. Punchline Detection using Context-Aware Hierarchical Multimodal Fusion](https://reader035.vdocument.in/reader035/viewer/2022071515/613749d40ad5d20676488623/html5/thumbnails/52.jpg)
Context modelling
![Page 53: Multimodal recognition of behavior and affect · 2021. 4. 14. · Choube, A. and Soleymani, M., 2020, October. Punchline Detection using Context-Aware Hierarchical Multimodal Fusion](https://reader035.vdocument.in/reader035/viewer/2022071515/613749d40ad5d20676488623/html5/thumbnails/53.jpg)
Results
• MFN: memory Fusion Network
• TFN: Tensor Fusion Network
• EF: Early Fusion
• FF: Flat Fusion
• MF: Merge Fusion
![Page 54: Multimodal recognition of behavior and affect · 2021. 4. 14. · Choube, A. and Soleymani, M., 2020, October. Punchline Detection using Context-Aware Hierarchical Multimodal Fusion](https://reader035.vdocument.in/reader035/viewer/2022071515/613749d40ad5d20676488623/html5/thumbnails/54.jpg)
Summary
• Multimodal model better captures humor
• Language performs the best in unimodal models
• Hierarchical fusion better captures the inter-modality interactions for humor – maybe!
• Incorporating the context of punchline can boost the accuracy of prediction
![Page 55: Multimodal recognition of behavior and affect · 2021. 4. 14. · Choube, A. and Soleymani, M., 2020, October. Punchline Detection using Context-Aware Hierarchical Multimodal Fusion](https://reader035.vdocument.in/reader035/viewer/2022071515/613749d40ad5d20676488623/html5/thumbnails/55.jpg)
• Emotions are multi-faceted and have manifestations in different modalities
• Representation learning enables machine learning models to learn a useful representation without the need to handcraft new features
• Multimodal fusion increases robustness and take advantage of complementary information
• Often times multimodal fusion yields superior performance
Summary