multimodal analysis of stand-up comedians
TRANSCRIPT
Multimodal Analysis Of Stand-up ComediansAudio, Video and Lexical Analysis
Yash Singh, Madhav Sharan, Sree Priyanka Uppu, Nandan PC,
Harsh Fatepuria, Rahul Agrawal
Motivation
Data set Description
Feature Engineering
Feature Analysis
Machine Learning
Conclusion
Why Stand up comedian?
● We love watching stand ups
● They express variety of emotions
● Feedback from audience in form of laughter available.
● Relatively new
Motivation
H1: Certain facial expression could contribute to laughter
Hypotheses
H2 : Pauses and word elongation contribute towards laughter
Hypotheses
H3 : Voice modulation - Pitch and intensity changes can also play a
crucial role
H4 : Laughter is sequential in nature, meaning small laughter could
add up to bigger laughters.
• We collected 3 hours 46 minutes of data from ‘The Tonight Show
Starring Jimmy Fallon’ or ‘Late Night with Conan O’Brien’.
• 46 Videos (11.76Gb) ~ approx 5 mins each
• 27 males and 19 female artists
• The backdrop in the videos is dark
• Most part of the videos ( 80-90%) the artist faces the camera.
Data Collection
• For facial feature extractions, we blacked out the frames manually
when the camera does not capture the artist’s face ~ setting the
video features to those frames as 0.
• For audio features, there can be 0, during a pause or while the
audience are laughing.
Pre Processing
• Manually segment the videos based on punch lines
• Annotate the laughter level in each segment based on product of
mean pitch and mean intensity to–
o Big (55% ~ 100% intensity)
o Small (36% ~ 55% intensity)
o No (0~36% intensity)
• Pitch range of 75 to 625 Hz gives a good sampling rate of 10 ms
and covers a wide range of frequencies.
• Pitch of laughter varies across videos and hence, it is normalized to
the range [0,1].
Data Annotation
OpenSmile• Extracted:5 low-level descriptors. we extracted
✧ Musical Chroma features - Tone
✧ Prosody features (Loudness and pitch),
✧ Energy(1),
✧ MFCC(13 MFCC from 0-12 from 26 Mel-frequency bands).
• All these features were captured at a frame rate of 10 ms.
• Processing : Aggregated the features on standard deviation and
mean for each segment
Feature Engineering - Audio
OpenFace
• Extracted
✧ eye gaze direction vector in world coordinates for both the eyes
✧ the location of the head with respect to camera in milimeters and the
rotation (radians)
✧ 68 facial landmark location in 2D pixel format (x,y)
✧ 33 rigid and non-rigid shape parameters
✧ 11 AU intensities and AU occurrences.
• Processing : Aggregated the features on standard deviation and mean
for each segment
Feature Engineering - Video
• Analyze features like Action Units, gaze (y and z direction), pose (rotation of
head) and various facial landmark points, Frown and Eyebrow raise.
IBM Watson
● Pauses
● Last pause
● Word elongation
● Sentiments
Feature Engineering - Textual
H1 : AU related features
Feature Analysis - Visual
AU 07 (Lid tightener) AU 14 (Dimpler)
H1 : facial features
Feature Analysis - Visual
Frown (distance)
H2 : Pause related features
Feature Analysis - Textual
Last Pause Length No of pauses
H3 : Pitch , Loudness, Energy related features
Feature Analysis - Audio
Pitch variation Loudness variation
H3 : Pitch , Loudness, Energy related features
Feature Analysis - Audio
Energy variation Energy mean
Multi modal analysis using boosted decision tree
classifier
Machine Learning
H1 : Certain facial expression could contribute to laughter
H2 : Pauses and word elongation contribute towards laughter
Results
H3 : Voice modulation - Pitch and intensity changes can also play a
crucial role
XGBoost
Early Fusion:
● Min Video frames = 100 frames/segment
● Min Audio frames = 30 frames/segment
● Min text = 0 words /segment
● No good way of taking equal frames from each modality → difficult to
do early fusion
H4 : Laughter is sequential in nature :
LSTM - Challenge
H4 : Laughter is sequential in nature :
Late Fusion
To do
● Tune LSTM
● Try other classifiers :
✧ SVM
✧ Naive Bayes
Thank you.