multimedia information retrieval and user behavior

41
Date 22/10/2012 Multimedia Information Retrieval and User Behavior in Social Media Eleonora Ciceri, [email protected]

Upload: eleonora-ciceri

Post on 18-Dec-2014

206 views

Category:

Documents


1 download

DESCRIPTION

 

TRANSCRIPT

Page 1: Multimedia Information Retrieval and User Behavior

Date 22/10/2012

Multimedia Information Retrieval and User Behavior in Social MediaEleonora Ciceri, [email protected]

Page 2: Multimedia Information Retrieval and User Behavior

Outline

✤ Multimedia Information Retrieval on large data sets✤ The “giants” of photo uploads✤ Image search✤ Descriptors✤ Bag of Visual Words

✤ Analyzing User Motivations in Video Blogging✤ What is a video blog?✤ Non-verbal communication✤ Automatic processing pipeline✤ Cues extraction & Results✤ Cues vs. Social Attention

Page 3: Multimedia Information Retrieval and User Behavior

Multimedia Information Retrieval on large data sets

Page 4: Multimedia Information Retrieval and User Behavior

The “giants” of photo uploads

✤ Flickr uploads: (source: http://www.flickr.com/)

✤ 1,54 million photos per day in average✤ 51 million users✤ 6 billion images

✤ Facebook uploads: (source: http://thenextweb.com/)

✤ 250 million photos per day in average✤ 845 million users in February 2012✤ 90+ billion in August 2011

✤ “Flickr hits 6 billion total photos, Facebook does that every two months”

Page 5: Multimedia Information Retrieval and User Behavior

Image search

✤ Query by example: look for a particular object / scene / location in a collection of images

Page 6: Multimedia Information Retrieval and User Behavior

Image search

✤ Copy detection

✤ Annotation / Classification / Detection

“dog” “dog”? “dog”“child”

Page 7: Multimedia Information Retrieval and User Behavior

Descriptors

✤ How can we look for similar images?

✤ Compute a descriptor: mathematical representation

✤ Find similar descriptors

✤ Problem: occlusions, changes in rotations-scale-lighting

Page 8: Multimedia Information Retrieval and User Behavior

Descriptors

✤ How can we look for similar images?

✤ Compute a descriptor: mathematical representation

✤ Find similar descriptors

✤ Solution: invariant descriptors (to scale / rotation...)

Page 9: Multimedia Information Retrieval and User Behavior

Global descriptors

✤ Global descriptors: one descriptor per image (highly scalable)

✤ Color histogram: representation of the distribution of colors

✤ Pros: high invariance to many transformations

✤ Cons: high invariance to TOO many transformations (limited discriminative power)

Page 10: Multimedia Information Retrieval and User Behavior

Local descriptors

✤ Local descriptors: find regions of interest that will be exploited for image comparison

✤ SIFT: Scale Invariant Feature Transform

✤ Extract key-points (maxima and minima in the Difference of Gaussian image)

✤ Assign orientation to key-points (result: rotation invariance)

✤ Generate the feature vector for each key-point

Page 11: Multimedia Information Retrieval and User Behavior

Direct matching

✤ Assumption: ✤ m=1000 descriptors for one image✤ Each descriptor has d=128 dimensions✤ N>1000000 images in the data set

✤ Search: a query is submitted; results are retrieved

✤ Each descriptor of the query image is tested again each descriptor of the image data set

✤ Complexity: m2Nd elementary operations; Required space: ???

queryimage

Page 12: Multimedia Information Retrieval and User Behavior

✤ Objective: “put the images into words” (visual words)✤ What is a visual word? “A small part of the image that carries some

kind of information related to the features” [Wikipedia]

✤ Analogy Text-Image:

✤ Visual word: small patch of the image

✤ Visual term: cluster of patches that give the same information

✤ Bag of visual words: collection of words that give information about the meaning of the image at all

Bag of Visual Words

Page 13: Multimedia Information Retrieval and User Behavior

Bag of Visual Words

✤ How to build a visual dictionary?

✤ Local descriptors are clustered

✤ A local descriptor is assigned to its nearest neighbor:

w ∈ ω

Visual dictionary

q(x) = argminw∈ω

� x− µw �2

Cluster

Mean of thecluster w

Page 14: Multimedia Information Retrieval and User Behavior

✤ Pros:

✤ Much more compact representation

✤ We can take advantage from text retrieval techniques to apply them to image retrieval system

Why Visual Words?

tfidf(t, d,D) =f(t, d)

max{f(w, d) : w ∈ d} log|D|

|{d ∈ D : t ∈ d}|

Rd

Find similar vectors

Results

Page 15: Multimedia Information Retrieval and User Behavior

Analyzing User Motivations in Video Blogging

Page 16: Multimedia Information Retrieval and User Behavior

What is a video blog?

✤ Video blog (vlog): conversational videos where people (usually a single person) discuss facing the camera and addressing the audience in a Skype-style fashion

✤ Examples: video testimonial (companies pay for testing products), video advice (e.g., how to get dressed for a party), discussions

Page 17: Multimedia Information Retrieval and User Behavior

Why vlogs are used

Comments Ratings

High participation

CritiqueDiscussion

Life documentary

Daily interaction

E-learning

Entertainment

Marketing

Corporate communication

COMMUNITY

Page 18: Multimedia Information Retrieval and User Behavior

Why vlogs are studied

✤ Why are vlogs relevant?

✤ Automatic analysis of personal websites, blogs and social networks is limited to text (in order to understand users’ motivations)

✤ Vlog is a new type of social media (40% of the most viewed videos on YouTube): how to do automatic analysis?

✤ Study a real-life communication scenario

✤ Humans judgements are based on first impressions: can we predict them?

Page 19: Multimedia Information Retrieval and User Behavior

Real communication vs. vlogs

✤ Real communication

✤ Synchronous

✤ Two (or more) people interact

✤ Vlog

✤ Asynchronous

✤ Monologue

✤ Metadatablah blah blah ?

blah blah

Page 20: Multimedia Information Retrieval and User Behavior

Non-verbal communication

✤ The process of communication through sending and receiving wordless/visual cues between people

✤ Why? To express aspects of identity (age, occupation, culture, personality)

Body Speech

GesturesTouch

Body languagePosture

Facial expressionEye contact

Voice qualityRatePitch

VolumeRhythm

Intonation

Page 21: Multimedia Information Retrieval and User Behavior
Page 22: Multimedia Information Retrieval and User Behavior

An example: dominance

✤ Power: capacity or right to control others

✤ Dominance: way of exerting power involving the motive to control others

✤ Behavior: talk louder, talk longer, speak first, interrupt more, add gestures, receive more visual attention

Page 23: Multimedia Information Retrieval and User Behavior

Automatic processing pipeline

Shot boundary detection

VlogSense: Conversational Behavior and Social Attention in YouTube • 1:7

!"#$%&#'()*+,%)-$-.$/#(

0+#($*10*.-

)-$-.$/#(

!"#$!-1-.$/#(

.'-%*22+-32*$/#(

!

Fig. 4: Automatic processing of vlogs: preprocessing (steps 1, 2, and 3) and nonverbal cue extraction (steps 4 and 5).

5. AUTOMATIC PROCESSING OF VLOGSAs introduced in Section 2, vlogs result in extremely diverse content, compared to purely conversa-tional data recorded in controlled scenarios. For the purpose of analyzing conversational interaction invlogging, we require audiovisual preprocessing techniques to discard non-conversational content (e.g.openings, closings, or intermediate video snippets containing slideshows or other video footage). Inaddition, the large-scale feasibility of the analysis calls for robust and computationally efficient pro-cessing techniques that can cope with the variety existing in vlogs, both in terms of video quality andbehavior in front of the camera.

Our automatic processing approach is illustrated in Figure 4. First, we use a preprocessing scheme todivide vlogs in shots and to identify those shots that display a talking-head. Then, we extract nonverbalcues from conversational shots as descriptors of vloggers’ nonverbal behavior. We describe these twocomponents in Sections 5.1 and 5.2, respectively.

5.1 Preprocessing: Conversational Shot SelectionOur preprocessing scheme is grounded on two main assumptions: 1) visual content in conversationalshots differs widely from non-conversational shots, and 2) conversational shots can be identified bythe presence of a talking-head (or upper body). We use the first assumption, in step (1), to employ ashot boundary detector to segment vlogs in composing shots. Regarding the detection of talking-heads,

ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 2, No. 3, Article 1, Publication date: May 2010.

(based on colorhistogram differences)

Face detection

A B C ED

(Viola-Jones algorithm)

Shot selection

A B C ED

Short: not talkingWithout faces

Audio cues

Visual cues

(for each shot)

Aggregateshot-level cues

Aggregatedcues

(at video level)

Page 24: Multimedia Information Retrieval and User Behavior

Visual cue extraction

✤ Weighted Motion Energy Images (wMEI):

✤ It indicates the visual activity of a pixel (accumulated motion through the video)

✤ Brighter pixels: regions with higher motion

Worker’s rank

% o

f H

ITs

0.2

0.4

0.6

0.8

1.0

0 20 40 60 80 100

Figure 1: Cumulative frequency distribution of MTurk annotations(in percentages). Workers are ranked based on the number of HITscompleted (top ranked worker completed 17% of the HITs).

MethodologyVlog CollectionWe collected videos from YouTube with a keyword-basedsearch for “vlog” and “vlogging” using the API, and wemanually filtered the retrieved results to gather a sample ofconversational vlogs featuring one person only. We then re-stricted the sample to one video per vlogger, resulting in afinal dataset of 442 vlogs of which 47% (208) correspondedto male and 53% (234) to female vloggers. We limited thesize of the dataset in order to bound the amount of annota-tions required for our experiments.

We automatically processed videos to obtain the first con-versational minute of each vlog (the specifics of this proce-dure are not presented here for space reasons). Using “thinslices” has proven a suitable approach to study a wide rangeof constructs, including personality traits, affective states,status or dominance, etc. (Ambady and Rosenthal 1992). Infact, for the specific study of personality judgments, someresearch suggested that a few seconds are enough to makeaccurate judgments (Carney, Colvin, and Hall 2007).

Personality AnnotationsWe used Amazon’s Mechanical Turk (MTurk) crowdsourc-ing platform in order to obtain zero-acquaintance judgmentsof personality. In each Human Intelligence Task (HIT) weasked workers to watch one vlog, and to answer the TIPIquestionnaire, a 10-item measure of the Big Five personal-ity dimensions (Gosling, Rentfrow, and Swann 2003), aboutthe person appearing in the vlog. We specifically asked theworkers to watch the totality of the one minute vlog, andwe disabled the HTML questionnaire until the video hadreached the end. In addition to logging the working time re-ported by MTurk, we also controlled for real time of videowatching (to detect if any workers were playing forward thevideo), and the time spent on the questionnaire.

In total, we posted 442 different HITs to be completed fivetimes each (2,210 HITs in total), and we restricted them toUS and Indian workers with HIT acceptance rates of 95%or higher. As shown in Figure 1, HITs where completedby 113 different annotators with a substantial variation ontheir contribution in number of HITs. The average time ofquestionnaire completion (i.e., not including the video) was36.1s, compared to the one minute suggested by Gosling,Rentfrow, and Swann (2003). This result agrees with otherrecent studies in MTurk where completion times of anno-tations were reduced with respect to experts’ working time,which can be justified by the economic motive of MTurkworkers (Soleymani and Larson 2010).

Figure 2: wMEI Images for two vlogs.

Feature ExtractionWe automatically extracted nonverbal cues from both au-dio and video with the purpose of characterizing vloggers’behavior. Given the conversational nature of vlogs, in ourstudy, we focus on the use of nonverbal cues that have shownto be effective in the study of conversational interactionsin psychology (Knapp and Hall 2005) and social comput-ing (Pentland 2008; Gatica-Perez 2009). These cues have theadvantages of being relatively easy to compute and robust.Audio Features. We automatically extracted a set of au-dio cues using the toolbox developed by the Human Dynam-ics group at MIT Media Lab (Pentland 2008), computed onthe one minute vlog slices. This toolbox implements a two-level hidden Markov model (HMM) to segment the audioin voiced/unvoiced and speech/non-speech regions. Thesesegmentations are then used to extract various statistics onspeaking activity (speaking time, speaking turns, and voic-ing rate) as well as emphasis patterns (energy, pitch, andautocorrelation peaks). These nonverbal cues measure howpeople speak (how much, how loud, how fast, etc) ratherthan what they say (Biel and Gatica-Perez 2010b).

Visual Features. We automatically extracted a set of vi-sual cues as descriptors of the overall visual activity of thevlogger throughout the video. In this paper, we proposea modified version of motion energy images (Bobick andDavis 2001), that we call ”Weighted Motion Energy Images”(wMEI). The wMEI is calculated as:

wMEI =T!

t=0

(Dt), (1)

where Dt is a binary image that shows the moving pixels inframe t, and T is the duration of the vlog in frames. A wMEIis normalized by dividing all the pixel values by the maxi-mum pixel value. Thus, a normalized wMEI contains the ac-cumulated motion through the video as a gray-scale image,where each pixel’s intensity indicates the visual activity inthe pixel (brighter pixels correspond to regions with highermotion). From the normalized wMEIs, we extract simplestatistical features as descriptors of the vlogger’s body ac-tivity such as the entropy, mean, median, and center of mass(in horizontal, and vertical dimensions). To compensate fordifferent video sizes, all images are previously resized to320x240. Figure 2 shows two examples of wMEI images.

Analysis and ResultsWe divide our analysis in three parts. First, we examine thereliability of MTurk annotations. Second, we investigate theassociation between nonverbal behavior and personality. Fi-nally, we explore the association between personality traitsand social attention.

447

wMEI =�

f∈V

(Df )

binary imagecontaining themoving pixels

in frame f

Page 25: Multimedia Information Retrieval and User Behavior

Visual cue extraction

✤ It is difficult to estimate the actual direction of the eyes

✤ If the face is in frontal position I’m most likely looking at the camera

1:8 • J-I. Biel and D. Gatica-Perez

Fig. 5. Nonverbal cues are extracted basedon speech/non-speech, looking/non-lookingsegmentations, and multimodal segmenta-tions.

in step (2), we simplified the task with the detection of frontal faces, a reasonable solution given theinherent nature of conversational vlogging. In addition to its robustness, a face detector may generalizebetter to the case of vloggers who do not display much of their upper body. For each shot, we assessedthe presence of a talking-head by measuring the ratio of frames with face detections. Then, in step(3), we selected conversational shots based on a linear combination of the face detection rate and theduration of the shot relative to the whole duration of the video. This latter condition is motivated bythe observation that non-conversational shots tend to be short, independently on whether they featurepeople or not.

We used existing implementations of algorithms based the OpenCV library [Bradski and Kaehler2008]. The shot boundary detector finds shot discontinuities by thresholding the Bhattacharyya dis-tance between RGB color histograms of consecutive frames. The face detector implements the boostedclassifiers and Haar-like features from the Viola-Jones algorithm [Viola and Jones 2002] using an ex-isting cascaded on the OpenCV version 2.0 [Bradski and Kaehler 2008], which scans faces as small as20x20 pixels. The shot boundary detection algorithm (step 1) and

and the conversational shot selection (step 3) were tuned in a development set constructed froma small random sample of 100 vlogs. For this purpose, we first annotated hard shot discontinuities(a total of 168 hard shot cuts) and then labeled the shot conversational state (s = 1: conversational,s = 0: non-conversational). For shot boundary detection, we experimented with different thresholdingmethodologies, including global, relative, and adaptive thresholds [Hanjalic 2002], and we obtainedthe best performance (EER = 15%) using a global threshold ! = 0.5. This performance suggests thatthe task of boundary detection in conversational vlogs is often simple, and that consecutive composingshots do indeed differ in their content. For conversational shot selection, we first predicted the conver-sational state s of each shot as s = "rf + #rd, where rf is the ratio of frames with faces detected, rd isthe duration of the shot relative to the whole vlog duration, and " and # are coefficients obtained usinglinear regression (" = 0.76, # = 0.24, R2 = 0.6, p < 10!6). Then, we classified shots by thresholding theconversational state s and obtained the best results with a threshold ! = 0.29 (EER = 7.5%).

5.2 Nonverbal Behavioral Cues ExtractionIn this article, we investigate a number of automatic nonverbal cues extracted from both audio andvideo that have been effective to characterize social constructs related to conversational interaction inboth social psychology [Knapp and Hall 2005] and more recently in social computing [Pentland 2008;ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 2, No. 3, Article 1, Publication date: May 2010.

We are interested in frontal face detection

Page 26: Multimedia Information Retrieval and User Behavior

Visual cue extraction1:8 • J-I. Biel and D. Gatica-Perez

Fig. 5. Nonverbal cues are extracted basedon speech/non-speech, looking/non-lookingsegmentations, and multimodal segmenta-tions.

in step (2), we simplified the task with the detection of frontal faces, a reasonable solution given theinherent nature of conversational vlogging. In addition to its robustness, a face detector may generalizebetter to the case of vloggers who do not display much of their upper body. For each shot, we assessedthe presence of a talking-head by measuring the ratio of frames with face detections. Then, in step(3), we selected conversational shots based on a linear combination of the face detection rate and theduration of the shot relative to the whole duration of the video. This latter condition is motivated bythe observation that non-conversational shots tend to be short, independently on whether they featurepeople or not.

We used existing implementations of algorithms based the OpenCV library [Bradski and Kaehler2008]. The shot boundary detector finds shot discontinuities by thresholding the Bhattacharyya dis-tance between RGB color histograms of consecutive frames. The face detector implements the boostedclassifiers and Haar-like features from the Viola-Jones algorithm [Viola and Jones 2002] using an ex-isting cascaded on the OpenCV version 2.0 [Bradski and Kaehler 2008], which scans faces as small as20x20 pixels. The shot boundary detection algorithm (step 1) and

and the conversational shot selection (step 3) were tuned in a development set constructed froma small random sample of 100 vlogs. For this purpose, we first annotated hard shot discontinuities(a total of 168 hard shot cuts) and then labeled the shot conversational state (s = 1: conversational,s = 0: non-conversational). For shot boundary detection, we experimented with different thresholdingmethodologies, including global, relative, and adaptive thresholds [Hanjalic 2002], and we obtainedthe best performance (EER = 15%) using a global threshold ! = 0.5. This performance suggests thatthe task of boundary detection in conversational vlogs is often simple, and that consecutive composingshots do indeed differ in their content. For conversational shot selection, we first predicted the conver-sational state s of each shot as s = "rf + #rd, where rf is the ratio of frames with faces detected, rd isthe duration of the shot relative to the whole vlog duration, and " and # are coefficients obtained usinglinear regression (" = 0.76, # = 0.24, R2 = 0.6, p < 10!6). Then, we classified shots by thresholding theconversational state s and obtained the best results with a threshold ! = 0.29 (EER = 7.5%).

5.2 Nonverbal Behavioral Cues ExtractionIn this article, we investigate a number of automatic nonverbal cues extracted from both audio andvideo that have been effective to characterize social constructs related to conversational interaction inboth social psychology [Knapp and Hall 2005] and more recently in social computing [Pentland 2008;ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 2, No. 3, Article 1, Publication date: May 2010.

✤ Looking time: looking activity

✤ Looking segment length: persistence

✤ Looking turns: looking activity

✤ Proximity to camera

✤ Vertical framing: upper body

✤ Vertical head motion

how much the vlogger looks to the camera

persistence of a vlogger’s gaze

how much the vlogger looks to the camera

choice of addressing the camera from

close-ups

how much the vlogger shows the upper body

Page 27: Multimedia Information Retrieval and User Behavior

Visual cue extraction

✤ Looking time:

✤ Looking segment length:

✤ Looking turns:

✤ Proximity to camera:

✤ Vertical framing:

✤ Vertical head motion:

1:8 • J-I. Biel and D. Gatica-Perez

Fig. 5. Nonverbal cues are extracted basedon speech/non-speech, looking/non-lookingsegmentations, and multimodal segmenta-tions.

in step (2), we simplified the task with the detection of frontal faces, a reasonable solution given theinherent nature of conversational vlogging. In addition to its robustness, a face detector may generalizebetter to the case of vloggers who do not display much of their upper body. For each shot, we assessedthe presence of a talking-head by measuring the ratio of frames with face detections. Then, in step(3), we selected conversational shots based on a linear combination of the face detection rate and theduration of the shot relative to the whole duration of the video. This latter condition is motivated bythe observation that non-conversational shots tend to be short, independently on whether they featurepeople or not.

We used existing implementations of algorithms based the OpenCV library [Bradski and Kaehler2008]. The shot boundary detector finds shot discontinuities by thresholding the Bhattacharyya dis-tance between RGB color histograms of consecutive frames. The face detector implements the boostedclassifiers and Haar-like features from the Viola-Jones algorithm [Viola and Jones 2002] using an ex-isting cascaded on the OpenCV version 2.0 [Bradski and Kaehler 2008], which scans faces as small as20x20 pixels. The shot boundary detection algorithm (step 1) and

and the conversational shot selection (step 3) were tuned in a development set constructed froma small random sample of 100 vlogs. For this purpose, we first annotated hard shot discontinuities(a total of 168 hard shot cuts) and then labeled the shot conversational state (s = 1: conversational,s = 0: non-conversational). For shot boundary detection, we experimented with different thresholdingmethodologies, including global, relative, and adaptive thresholds [Hanjalic 2002], and we obtainedthe best performance (EER = 15%) using a global threshold ! = 0.5. This performance suggests thatthe task of boundary detection in conversational vlogs is often simple, and that consecutive composingshots do indeed differ in their content. For conversational shot selection, we first predicted the conver-sational state s of each shot as s = "rf + #rd, where rf is the ratio of frames with faces detected, rd isthe duration of the shot relative to the whole vlog duration, and " and # are coefficients obtained usinglinear regression (" = 0.76, # = 0.24, R2 = 0.6, p < 10!6). Then, we classified shots by thresholding theconversational state s and obtained the best results with a threshold ! = 0.29 (EER = 7.5%).

5.2 Nonverbal Behavioral Cues ExtractionIn this article, we investigate a number of automatic nonverbal cues extracted from both audio andvideo that have been effective to characterize social constructs related to conversational interaction inboth social psychology [Knapp and Hall 2005] and more recently in social computing [Pentland 2008;ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 2, No. 3, Article 1, Publication date: May 2010.

�L∈V tLtV �

L∈V tLNL

NL

tV

�f∈V Aface(f)

Nf ·A(f)�

f∈V � cface(f)− c(f) �Nf · fh

σ(� cface(f)− c(f) �)µ(� cface(f)− c(f) �)

looking segment

number oflooking segment

frame containinga face

face area in the current frame

number of framescontaining a face

frameareaface

center

framecenter

frameheight

Page 28: Multimedia Information Retrieval and User Behavior

Audio cue extraction

✤ Speaking time: speaking activity

✤ Speech segment avg length: fluency

✤ Speaking turns: fluency

✤ Voicing rate: fluency

✤ Speaking energy: emotional stability

✤ Pitch variation: emotional state

1:8 • J-I. Biel and D. Gatica-Perez

Fig. 5. Nonverbal cues are extracted basedon speech/non-speech, looking/non-lookingsegmentations, and multimodal segmenta-tions.

in step (2), we simplified the task with the detection of frontal faces, a reasonable solution given theinherent nature of conversational vlogging. In addition to its robustness, a face detector may generalizebetter to the case of vloggers who do not display much of their upper body. For each shot, we assessedthe presence of a talking-head by measuring the ratio of frames with face detections. Then, in step(3), we selected conversational shots based on a linear combination of the face detection rate and theduration of the shot relative to the whole duration of the video. This latter condition is motivated bythe observation that non-conversational shots tend to be short, independently on whether they featurepeople or not.

We used existing implementations of algorithms based the OpenCV library [Bradski and Kaehler2008]. The shot boundary detector finds shot discontinuities by thresholding the Bhattacharyya dis-tance between RGB color histograms of consecutive frames. The face detector implements the boostedclassifiers and Haar-like features from the Viola-Jones algorithm [Viola and Jones 2002] using an ex-isting cascaded on the OpenCV version 2.0 [Bradski and Kaehler 2008], which scans faces as small as20x20 pixels. The shot boundary detection algorithm (step 1) and

and the conversational shot selection (step 3) were tuned in a development set constructed froma small random sample of 100 vlogs. For this purpose, we first annotated hard shot discontinuities(a total of 168 hard shot cuts) and then labeled the shot conversational state (s = 1: conversational,s = 0: non-conversational). For shot boundary detection, we experimented with different thresholdingmethodologies, including global, relative, and adaptive thresholds [Hanjalic 2002], and we obtainedthe best performance (EER = 15%) using a global threshold ! = 0.5. This performance suggests thatthe task of boundary detection in conversational vlogs is often simple, and that consecutive composingshots do indeed differ in their content. For conversational shot selection, we first predicted the conver-sational state s of each shot as s = "rf + #rd, where rf is the ratio of frames with faces detected, rd isthe duration of the shot relative to the whole vlog duration, and " and # are coefficients obtained usinglinear regression (" = 0.76, # = 0.24, R2 = 0.6, p < 10!6). Then, we classified shots by thresholding theconversational state s and obtained the best results with a threshold ! = 0.29 (EER = 7.5%).

5.2 Nonverbal Behavioral Cues ExtractionIn this article, we investigate a number of automatic nonverbal cues extracted from both audio andvideo that have been effective to characterize social constructs related to conversational interaction inboth social psychology [Knapp and Hall 2005] and more recently in social computing [Pentland 2008;ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 2, No. 3, Article 1, Publication date: May 2010.

how much the vlogger talks

duration and number of

silent pauses

# of phonemes (how fast the

vlogger speaks)

how well the vlogger controls

loudness

how well the vlogger

controls tone

Page 29: Multimedia Information Retrieval and User Behavior

✤ Speaking time:

✤ Speech segment avg length:

✤ Speaking turns:

✤ Voicing rate:

✤ Speaking energy:

✤ Pitch variation:

Audio cue extraction

1:8 • J-I. Biel and D. Gatica-Perez

Fig. 5. Nonverbal cues are extracted basedon speech/non-speech, looking/non-lookingsegmentations, and multimodal segmenta-tions.

in step (2), we simplified the task with the detection of frontal faces, a reasonable solution given theinherent nature of conversational vlogging. In addition to its robustness, a face detector may generalizebetter to the case of vloggers who do not display much of their upper body. For each shot, we assessedthe presence of a talking-head by measuring the ratio of frames with face detections. Then, in step(3), we selected conversational shots based on a linear combination of the face detection rate and theduration of the shot relative to the whole duration of the video. This latter condition is motivated bythe observation that non-conversational shots tend to be short, independently on whether they featurepeople or not.

We used existing implementations of algorithms based the OpenCV library [Bradski and Kaehler2008]. The shot boundary detector finds shot discontinuities by thresholding the Bhattacharyya dis-tance between RGB color histograms of consecutive frames. The face detector implements the boostedclassifiers and Haar-like features from the Viola-Jones algorithm [Viola and Jones 2002] using an ex-isting cascaded on the OpenCV version 2.0 [Bradski and Kaehler 2008], which scans faces as small as20x20 pixels. The shot boundary detection algorithm (step 1) and

and the conversational shot selection (step 3) were tuned in a development set constructed froma small random sample of 100 vlogs. For this purpose, we first annotated hard shot discontinuities(a total of 168 hard shot cuts) and then labeled the shot conversational state (s = 1: conversational,s = 0: non-conversational). For shot boundary detection, we experimented with different thresholdingmethodologies, including global, relative, and adaptive thresholds [Hanjalic 2002], and we obtainedthe best performance (EER = 15%) using a global threshold ! = 0.5. This performance suggests thatthe task of boundary detection in conversational vlogs is often simple, and that consecutive composingshots do indeed differ in their content. For conversational shot selection, we first predicted the conver-sational state s of each shot as s = "rf + #rd, where rf is the ratio of frames with faces detected, rd isthe duration of the shot relative to the whole vlog duration, and " and # are coefficients obtained usinglinear regression (" = 0.76, # = 0.24, R2 = 0.6, p < 10!6). Then, we classified shots by thresholding theconversational state s and obtained the best results with a threshold ! = 0.29 (EER = 7.5%).

5.2 Nonverbal Behavioral Cues ExtractionIn this article, we investigate a number of automatic nonverbal cues extracted from both audio andvideo that have been effective to characterize social constructs related to conversational interaction inboth social psychology [Knapp and Hall 2005] and more recently in social computing [Pentland 2008;ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 2, No. 3, Article 1, Publication date: May 2010.

�S∈V tStV �

S∈V tSNS

NS

tV

NS�S∈V tS

σ(Senergy)

µ(Senergy)

σ(pitch)

µ(pitch)

video duration

speech segmentspeech segment duration

number of speech segments

Page 30: Multimedia Information Retrieval and User Behavior

Combining audio and visual cues

✤ Combine “looking at the camera” with “speaking”: four modalities

✤ These measures are used to determine dominance in dyadic conversations

✤ Looking-while-speaking: dominant people

1:8 • J-I. Biel and D. Gatica-Perez

Fig. 5. Nonverbal cues are extracted basedon speech/non-speech, looking/non-lookingsegmentations, and multimodal segmenta-tions.

in step (2), we simplified the task with the detection of frontal faces, a reasonable solution given theinherent nature of conversational vlogging. In addition to its robustness, a face detector may generalizebetter to the case of vloggers who do not display much of their upper body. For each shot, we assessedthe presence of a talking-head by measuring the ratio of frames with face detections. Then, in step(3), we selected conversational shots based on a linear combination of the face detection rate and theduration of the shot relative to the whole duration of the video. This latter condition is motivated bythe observation that non-conversational shots tend to be short, independently on whether they featurepeople or not.

We used existing implementations of algorithms based the OpenCV library [Bradski and Kaehler2008]. The shot boundary detector finds shot discontinuities by thresholding the Bhattacharyya dis-tance between RGB color histograms of consecutive frames. The face detector implements the boostedclassifiers and Haar-like features from the Viola-Jones algorithm [Viola and Jones 2002] using an ex-isting cascaded on the OpenCV version 2.0 [Bradski and Kaehler 2008], which scans faces as small as20x20 pixels. The shot boundary detection algorithm (step 1) and

and the conversational shot selection (step 3) were tuned in a development set constructed froma small random sample of 100 vlogs. For this purpose, we first annotated hard shot discontinuities(a total of 168 hard shot cuts) and then labeled the shot conversational state (s = 1: conversational,s = 0: non-conversational). For shot boundary detection, we experimented with different thresholdingmethodologies, including global, relative, and adaptive thresholds [Hanjalic 2002], and we obtainedthe best performance (EER = 15%) using a global threshold ! = 0.5. This performance suggests thatthe task of boundary detection in conversational vlogs is often simple, and that consecutive composingshots do indeed differ in their content. For conversational shot selection, we first predicted the conver-sational state s of each shot as s = "rf + #rd, where rf is the ratio of frames with faces detected, rd isthe duration of the shot relative to the whole vlog duration, and " and # are coefficients obtained usinglinear regression (" = 0.76, # = 0.24, R2 = 0.6, p < 10!6). Then, we classified shots by thresholding theconversational state s and obtained the best results with a threshold ! = 0.29 (EER = 7.5%).

5.2 Nonverbal Behavioral Cues ExtractionIn this article, we investigate a number of automatic nonverbal cues extracted from both audio andvideo that have been effective to characterize social constructs related to conversational interaction inboth social psychology [Knapp and Hall 2005] and more recently in social computing [Pentland 2008;ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 2, No. 3, Article 1, Publication date: May 2010.

Page 31: Multimedia Information Retrieval and User Behavior

Analysis: video editing elements

✤ Elements manually coded as a support to the core conversational part

✤ Snippets (opening, ending, intermediate), background music, objects brought toward the camera

openingsnippet

endingsnippet

intermediatesnippet

object towardthe camera

object towardthe camera

Results✤ Snippets: 45% of vlogs (16% - 20% with opening/endings, 32% with intermediate snippets)

✤ Videos without snippets are monologues✤ Snippets tend to be a small fraction of the content of the video (~10%)

✤ Audio: 25% using soundtrack on snippets, 12% using music on the entire video✤ Objects: 26% of vloggers bring the object toward the camera

Page 32: Multimedia Information Retrieval and User Behavior

Analysis: non-verbal behavior

✤ Vloggers are mainly talking: 85% of people talking for more than half of time

✤ Speaking segments tend to be short (hesitations and low fluency)VlogSense: Conversational Behavior and Social Attention in YouTube • 1:13

Time speaking (ratio)

Perc

ent o

f Tot

al

0

5

10

15

0.0 0.2 0.4 0.6 0.8 1.0Avg Leng. of Speech seg. (s)

Perc

ent o

f Tot

al

05

10152025

0 2 4 6 8 10Number of turns (Hz)

Perc

ent o

f Tot

al

0

5

10

15

20

0.0 0.2 0.4 0.6 0.8 1.0Voicing rate (Hz)

Perc

ent o

f Tot

al

0

5

10

15

20

0.00 0.02 0.04 0.06 0.08 0.10

Time looking (ratio)

Perc

ent o

f Tot

al

05

10152025

0.2 0.4 0.6 0.8 1.0Proximity to camera (ratio)

Perc

ent o

f Tot

al

0

5

10

0.0 0.1 0.2 0.3 0.4 0.5Vertical framing (ratio)

Perc

ent o

f Tot

al

0

10

20

30

!2 !1 0 1L&S/L&NS

Perc

ent o

f Tot

al

0

5

10

15

20

25

0 5 10 15

Fig. 8: Selected nonverbal cue distributions for conversational shots in YouTube vlogs: four audio cues, three visual cues, andone multimodal.

result from the fact that most of the vlogs are composed of few conversational shots (see Section 6.1).These distributions unveil information that may be useful to understand some basic characteristicsof nonverbal behavior in vlogging. For example, the speaking time distribution, biased towards highspeaking times (median = 0.65, mean = 0.67, sd = 0.15), shows that 85% of the conversational shotscontain speech more than half of the time, which suggests that vloggers who were perceived as mainlytalking during the annotation process (Section 4) are indeed speaking for a significant proportion ofthe time. Speaking segments tend to be short (median = 1.98s, mean = 2.36s, sd = 1.36s), which iscommon in spontaneous speech, typically characterized by higher numbers of hesitations and lowerfluency [Levelt 1989]. The median number of speaking turns per second (median = mean = 0.33,sd = 0.10), which corresponds to one speaking turn every 3 seconds, evidences that pauses betweenspeaking segments are also short. Finally, the voicing rate (median = mean = 0.4, sd = 0.09) variesbetween 2 and 4 regions per second, a range of values similar to other conversational scenarios [Scherer1979].

Regarding visual cues, the looking time (median = 0.68,mean = 0.67, sd = 0.14) is biased towardshigh values, with 50% of the vloggers looking at the camera over 90% of the time. It is not entirely clearto what extent this corresponds to a pure “addressing the camera” behavior, rather than the result ofthe simplification made by assuming that frontal face detections imply that vloggers look at camera.Considering the typical frame size of a YouTube video in our dataset (320x240 pixels) and the prox-imity to the camera (median = 0.19,mean = 0.23, sd = 0.14), vloggers’ face size varied approximatelybetween 19x15 and 176x132 pixels, with a median of 64x48 pixels. Since smaller ratios indicate largerdistance to the camera, these figures suggest that vloggers typically respect some “standard“ distanceto the camera, neither being too close nor too far. In addition, as shown by the vertical framing ratios(median = !0.17,mean = !0.21, sd = 0.29), faces are typically positioned in the top half of the frame,which is associated with vloggers showing their upper body.

Regarding multimodal cues, the ratio L&S/L&NS (median = 2.25,mean = 3.87, sd = 12.79) showsthat vloggers tend to look at the camera when they speak more frequently than when they are silent.

ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 2, No. 3, Article 1, Publication date: May 2010.

Page 33: Multimedia Information Retrieval and User Behavior

Analysis: non-verbal behavior

✤ 50% of vloggers look at the camera over 90% of the time, at “standard” distance to the camera (not too close, not too far), showing the upper body

✤ Vloggers look at the camera when they speak more frequently than when they are silent✤ Behavior of dominant people

VlogSense: Conversational Behavior and Social Attention in YouTube • 1:13

Time speaking (ratio)

Perc

ent o

f Tot

al

0

5

10

15

0.0 0.2 0.4 0.6 0.8 1.0Avg Leng. of Speech seg. (s)

Perc

ent o

f Tot

al

05

10152025

0 2 4 6 8 10Number of turns (Hz)

Perc

ent o

f Tot

al

0

5

10

15

20

0.0 0.2 0.4 0.6 0.8 1.0Voicing rate (Hz)

Perc

ent o

f Tot

al

0

5

10

15

20

0.00 0.02 0.04 0.06 0.08 0.10

Time looking (ratio)

Perc

ent o

f Tot

al

05

10152025

0.2 0.4 0.6 0.8 1.0Proximity to camera (ratio)

Perc

ent o

f Tot

al

0

5

10

0.0 0.1 0.2 0.3 0.4 0.5Vertical framing (ratio)

Perc

ent o

f Tot

al

0

10

20

30

!2 !1 0 1L&S/L&NS

Perc

ent o

f Tot

al

0

5

10

15

20

25

0 5 10 15

Fig. 8: Selected nonverbal cue distributions for conversational shots in YouTube vlogs: four audio cues, three visual cues, andone multimodal.

result from the fact that most of the vlogs are composed of few conversational shots (see Section 6.1).These distributions unveil information that may be useful to understand some basic characteristicsof nonverbal behavior in vlogging. For example, the speaking time distribution, biased towards highspeaking times (median = 0.65, mean = 0.67, sd = 0.15), shows that 85% of the conversational shotscontain speech more than half of the time, which suggests that vloggers who were perceived as mainlytalking during the annotation process (Section 4) are indeed speaking for a significant proportion ofthe time. Speaking segments tend to be short (median = 1.98s, mean = 2.36s, sd = 1.36s), which iscommon in spontaneous speech, typically characterized by higher numbers of hesitations and lowerfluency [Levelt 1989]. The median number of speaking turns per second (median = mean = 0.33,sd = 0.10), which corresponds to one speaking turn every 3 seconds, evidences that pauses betweenspeaking segments are also short. Finally, the voicing rate (median = mean = 0.4, sd = 0.09) variesbetween 2 and 4 regions per second, a range of values similar to other conversational scenarios [Scherer1979].

Regarding visual cues, the looking time (median = 0.68,mean = 0.67, sd = 0.14) is biased towardshigh values, with 50% of the vloggers looking at the camera over 90% of the time. It is not entirely clearto what extent this corresponds to a pure “addressing the camera” behavior, rather than the result ofthe simplification made by assuming that frontal face detections imply that vloggers look at camera.Considering the typical frame size of a YouTube video in our dataset (320x240 pixels) and the prox-imity to the camera (median = 0.19,mean = 0.23, sd = 0.14), vloggers’ face size varied approximatelybetween 19x15 and 176x132 pixels, with a median of 64x48 pixels. Since smaller ratios indicate largerdistance to the camera, these figures suggest that vloggers typically respect some “standard“ distanceto the camera, neither being too close nor too far. In addition, as shown by the vertical framing ratios(median = !0.17,mean = !0.21, sd = 0.29), faces are typically positioned in the top half of the frame,which is associated with vloggers showing their upper body.

Regarding multimodal cues, the ratio L&S/L&NS (median = 2.25,mean = 3.87, sd = 12.79) showsthat vloggers tend to look at the camera when they speak more frequently than when they are silent.

ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 2, No. 3, Article 1, Publication date: May 2010.

VlogSense: Conversational Behavior and Social Attention in YouTube • 1:13

Time speaking (ratio)

Perc

ent o

f Tot

al

0

5

10

15

0.0 0.2 0.4 0.6 0.8 1.0Avg Leng. of Speech seg. (s)

Perc

ent o

f Tot

al

05

10152025

0 2 4 6 8 10Number of turns (Hz)

Perc

ent o

f Tot

al

0

5

10

15

20

0.0 0.2 0.4 0.6 0.8 1.0Voicing rate (Hz)

Perc

ent o

f Tot

al

0

5

10

15

20

0.00 0.02 0.04 0.06 0.08 0.10

Time looking (ratio)

Perc

ent o

f Tot

al

05

10152025

0.2 0.4 0.6 0.8 1.0Proximity to camera (ratio)

Perc

ent o

f Tot

al

0

5

10

0.0 0.1 0.2 0.3 0.4 0.5Vertical framing (ratio)

Perc

ent o

f Tot

al

0

10

20

30

!2 !1 0 1L&S/L&NS

Perc

ent o

f Tot

al

0

5

10

15

20

25

0 5 10 15

Fig. 8: Selected nonverbal cue distributions for conversational shots in YouTube vlogs: four audio cues, three visual cues, andone multimodal.

result from the fact that most of the vlogs are composed of few conversational shots (see Section 6.1).These distributions unveil information that may be useful to understand some basic characteristicsof nonverbal behavior in vlogging. For example, the speaking time distribution, biased towards highspeaking times (median = 0.65, mean = 0.67, sd = 0.15), shows that 85% of the conversational shotscontain speech more than half of the time, which suggests that vloggers who were perceived as mainlytalking during the annotation process (Section 4) are indeed speaking for a significant proportion ofthe time. Speaking segments tend to be short (median = 1.98s, mean = 2.36s, sd = 1.36s), which iscommon in spontaneous speech, typically characterized by higher numbers of hesitations and lowerfluency [Levelt 1989]. The median number of speaking turns per second (median = mean = 0.33,sd = 0.10), which corresponds to one speaking turn every 3 seconds, evidences that pauses betweenspeaking segments are also short. Finally, the voicing rate (median = mean = 0.4, sd = 0.09) variesbetween 2 and 4 regions per second, a range of values similar to other conversational scenarios [Scherer1979].

Regarding visual cues, the looking time (median = 0.68,mean = 0.67, sd = 0.14) is biased towardshigh values, with 50% of the vloggers looking at the camera over 90% of the time. It is not entirely clearto what extent this corresponds to a pure “addressing the camera” behavior, rather than the result ofthe simplification made by assuming that frontal face detections imply that vloggers look at camera.Considering the typical frame size of a YouTube video in our dataset (320x240 pixels) and the prox-imity to the camera (median = 0.19,mean = 0.23, sd = 0.14), vloggers’ face size varied approximatelybetween 19x15 and 176x132 pixels, with a median of 64x48 pixels. Since smaller ratios indicate largerdistance to the camera, these figures suggest that vloggers typically respect some “standard“ distanceto the camera, neither being too close nor too far. In addition, as shown by the vertical framing ratios(median = !0.17,mean = !0.21, sd = 0.29), faces are typically positioned in the top half of the frame,which is associated with vloggers showing their upper body.

Regarding multimodal cues, the ratio L&S/L&NS (median = 2.25,mean = 3.87, sd = 12.79) showsthat vloggers tend to look at the camera when they speak more frequently than when they are silent.

ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 2, No. 3, Article 1, Publication date: May 2010.

Page 34: Multimedia Information Retrieval and User Behavior

Social attention

✤ Social attention on YouTube is measured by considering the number of views received by a video

other measures ofpopularity, BUT: not all the people that access

to the video like it!

this measure reflectsthe number of times

that the item has been accessed (resembling the way audiences are

measured in traditional mainstream media)

PopularityBorrowed from the Latin

popularis in 1490, originally meant "common"

Page 35: Multimedia Information Retrieval and User Behavior

Social attention

✤ Audio cues: vloggers talking longer, faster and using fewer pauses receive more views from the audience

Page 36: Multimedia Information Retrieval and User Behavior

Social attention

✤ Visual cues: ✤ The time looking at the camera and the average duration of looking turns

are positively correlated with attention✤ Vloggers that are too close to the camera are penalized: the audience

cannot perceive body language cues

Page 37: Multimedia Information Retrieval and User Behavior

Future work (...or not?)

✤ Background analysis: do the background tell something about the speaker?

Page 38: Multimedia Information Retrieval and User Behavior

Bibliography

Page 39: Multimedia Information Retrieval and User Behavior

Bibliography

✤ Joan-Isaac Biel, Daniel Gatica-Perez, ‘VLogSense: Conversational Behavior and Social Attention in YouTube’, ACM Transactions on Multimedia Computing, Communications and Applications, 2010

✤ Joan-Isaac Biel, Oya Aran, Daniel Gatica-Perez, ‘You Are Known by How You Vlog: Personality Impressions and Nonverbal Behavior in YouTube’, AAAI, 2011

✤ Joan-Isaac Biel, Daniel Gatica-Perez, ‘Voices of Vlogging’, AAAI, 2010

✤ Joan-Isaac Biel, Daniel Gatica-Perez, ‘Vlogcast Yourself: Nonverbal Behavior and Attention in Social Media’, ICMI-MLMI, 2010

Page 40: Multimedia Information Retrieval and User Behavior

Bibliography

✤ Joan-Isaac Biel, Daniel Gatica-Perez, ‘The Good, the Bad and the Angry: Analyzing Crowdsourced Impressions of Vloggers’, AAAI, 2012

✤ Hervé Jégou , ‘Very Large Scale Image/Video Search’, SSMS’12, Santorini

✤ Utkarsh, ‘SIFT: Scale Invariant Feature Transform’, http://www.aishack.in/2010/05/sift-scale-invariant-feature-transform/

✤ Wikipedia, ‘Bag of Words’ and ‘Visual Word’

✤ Wikipedia, ‘tf-idf’

✤ Wikipedia, ‘k-means clustering’

Page 41: Multimedia Information Retrieval and User Behavior

Bibliography

✤ Rong Yan, ‘Data mining and machine learning for large-scale social media’, SSMS’12, Santorini