multimodal analysis of stand-up comedians

Multimodal Analysis Of Stand-up ComediansAudio, Video and Lexical Analysis

Yash Singh, Madhav Sharan, Sree Priyanka Uppu, Nandan PC,

Harsh Fatepuria, Rahul Agrawal

Motivation

Data set Description

Feature Engineering

Feature Analysis

Machine Learning

Conclusion

Why Stand up comedian?

● We love watching stand ups

● They express variety of emotions

● Feedback from audience in form of laughter available.

● Relatively new

Motivation

H1: Certain facial expression could contribute to laughter

Hypotheses

H2 : Pauses and word elongation contribute towards laughter

Hypotheses

http://youtube.com/v/I9zujb1IjFE

http://youtube.com/v/I9zujb1IjFE

H3 : Voice modulation - Pitch and intensity changes can also play a

crucial role

H4 : Laughter is sequential in nature, meaning small laughter could

add up to bigger laughters.

http://youtube.com/v/Bb5sLtlKoec

http://youtube.com/v/Bb5sLtlKoec

• We collected 3 hours 46 minutes of data from ‘The Tonight Show

Starring Jimmy Fallon’ or ‘Late Night with Conan O’Brien’.

• 46 Videos (11.76Gb) ~ approx 5 mins each

• 27 males and 19 female artists

• The backdrop in the videos is dark

• Most part of the videos ( 80-90%) the artist faces the camera.

Data Collection

• For facial feature extractions, we blacked out the frames manually

when the camera does not capture the artist’s face ~ setting the

video features to those frames as 0.

• For audio features, there can be 0, during a pause or while the

audience are laughing.

Pre Processing

• Manually segment the videos based on punch lines

• Annotate the laughter level in each segment based on product of

mean pitch and mean intensity to–

o Big (55% ~ 100% intensity)

o Small (36% ~ 55% intensity)

o No (0~36% intensity)

• Pitch range of 75 to 625 Hz gives a good sampling rate of 10 ms

and covers a wide range of frequencies.

• Pitch of laughter varies across videos and hence, it is normalized to

the range [0,1].

Data Annotation

OpenSmile• Extracted:5 low-level descriptors. we extracted

✧ Musical Chroma features - Tone

✧ Prosody features (Loudness and pitch),

✧ Energy(1),

✧ MFCC(13 MFCC from 0-12 from 26 Mel-frequency bands).

• All these features were captured at a frame rate of 10 ms.

• Processing : Aggregated the features on standard deviation and

mean for each segment

Feature Engineering - Audio

OpenFace

• Extracted

✧ eye gaze direction vector in world coordinates for both the eyes

✧ the location of the head with respect to camera in milimeters and the

rotation (radians)

✧ 68 facial landmark location in 2D pixel format (x,y)

✧ 33 rigid and non-rigid shape parameters

✧ 11 AU intensities and AU occurrences.

• Processing : Aggregated the features on standard deviation and mean

for each segment

Feature Engineering - Video

• Analyze features like Action Units, gaze (y and z direction), pose (rotation of

head) and various facial landmark points, Frown and Eyebrow raise.

IBM Watson

● Pauses

● Last pause

● Word elongation

● Sentiments

Feature Engineering - Textual

H1 : AU related features

Feature Analysis - Visual

AU 07 (Lid tightener) AU 14 (Dimpler)

H1 : facial features

Feature Analysis - Visual

Frown (distance)

H2 : Pause related features

Feature Analysis - Textual

Last Pause Length No of pauses

H3 : Pitch , Loudness, Energy related features

Feature Analysis - Audio

Pitch variation Loudness variation

H3 : Pitch , Loudness, Energy related features

Feature Analysis - Audio

Energy variation Energy mean

Multi modal analysis using boosted decision tree

classifier

Machine Learning

H1 : Certain facial expression could contribute to laughter

H2 : Pauses and word elongation contribute towards laughter

Results

H3 : Voice modulation - Pitch and intensity changes can also play a

crucial role

XGBoost

Early Fusion:

● Min Video frames = 100 frames/segment

● Min Audio frames = 30 frames/segment

● Min text = 0 words /segment

● No good way of taking equal frames from each modality → difficult to

do early fusion

H4 : Laughter is sequential in nature :

LSTM - Challenge

H4 : Laughter is sequential in nature :

Late Fusion

To do

● Tune LSTM

● Try other classifiers :

✧ SVM

✧ Naive Bayes

Thank you.

multimodal analysis of stand-up comedians

Data & Analytics