new aligning movies with scripts by exploiting temporal ordering …gildea/pubs/naim-icpr16.pdf ·...

Aligning Movies with Scripts by ExploitingTemporal Ordering Constraints

Iftekhar Naim, Abdullah Al Mamun, Young Chol Song, Jiebo Luo, Henry Kautz, Daniel GildeaDepartment of Computer Science, University of Rochester, Rochester, NY

Abstract—Scripts provide rich textual annotation of movies,including dialogs, character names, and other situational de-scriptions. Exploiting such rich annotations requires aligning thesentences in the scripts with the corresponding video frames.Previous work on aligning movies with scripts predominantlyrelies on time-aligned closed-captions or subtitles, which arenot always available. In this paper, we focus on automaticallyaligning faces in movies with their corresponding characternames in scripts without requiring closed-captions/subtitles. Weutilize the intuition that faces in a movie generally appear inthe same sequential order as their names are mentioned in thescript. We first apply standard techniques for face detection andtracking, and cluster similar face tracks together. Next, we applya generative Hidden Markov Model (HMM) and a discriminativeLatent Conditional Random Field (LCRF) to align the clustersof face tracks with the corresponding character names. Ouralignment models (especially LCRF) significantly outperform theprevious state-of-the-art on two different movie datasets and fora wide range of face clustering algorithms.

I. INTRODUCTION

Movie scripts provide rich textual descriptions of moviescenes and can be easily downloaded from the internet [1].However, in order to exploit such a rich source of annotation,we need to know the alignment between the sentences in thescripts and their corresponding video frames. Aligning movieswith scripts can be useful for movie search and retrieval [2],rapid scene browsing [3], and movie summarization [4]. Fur-thermore, it can help us acquire large-scale weakly-annotatedvideo datasets for training supervised and semi-supervisedcomputer vision models [5]. Manually segmenting moviescenes and aligning them with scripts is tedious, especiallyfor large collections of movies, and therefore the need forautomated alignment methods has become crucial. Existingmethods for movie-to-script alignment predominantly rely onclosed-captions or subtitles with precise time-stamps [1], [5],[3], [6]. Since there exists a strong correspondence betweenthe scripts and the subtitles (as they both include the dialogs),aligning in the presence of closed-captions/subtitles is a sig-nificantly easier task, and has been shown to achieve highalignment accuracy. However, closed-captions/subtitles are notalways available, especially for foreign language movies andTV shows [7]. In this paper, we focus on aligning movies withscripts in the absence of closed-captions/subtitles.

Existing methods for subtitle-free movie-to-script alignmentusually apply unsupervised clustering techniques to automati-cally group similar face tracks in the movies, and match faceclusters with character names by exploiting the correlation intheir co-occurrence pattern [8], [9]. These methods are based

Face Detection Face Tracking Clustering

Input Movie

Detect Frontal Faces.Extract Corner Points. KLT Point Tracker Cluster Face Tracks

Input Script

Sequenceof

Names

Sequenceof

Face Clusters

Temporal Alignment and matching via

HMM and Latent CRF

Fig. 1. An overview of our pipeline for aligning face tracks in movies withcharacter names in scripts.

on the following intuition: if a group of character names fre-quently appear together in the script, their faces also frequentlyappear together in the movie. However, these methods do notconsider the temporal order in which the faces and namesappear, and instead focus on the aggregated co-occurrencestatistics only. We propose several methods for aligning facesin movies with their corresponding character names in scriptsby exploiting the temporal ordering constraints between thesetwo modalities.

Our work is similar to the state-of-the-art work by Zhanget al. [10], which also exploits temporal cues for aligning asequence of face clusters extracted from a movie with thesequence of names extracted from the associated script. Theirmethod divides each movie and script into an equal number oftemporal bins, and aligns each face cluster with a name basedon the symmetric Kullback-Liebler (KL) Divergence [11] oftheir temporal distributions over those bins. The alignmentaccuracy of this method, however, is highly sensitive to thenumber of temporal bins, and it does not work well for lessfrequent movie characters. We extend the state-of-the-art byapplying probabilistic latent variable models that align eachmovie segment with the corresponding script segments, andsimultaneously align each face cluster within a movie segmentwith the corresponding character name in the script segment.While the previous methods focus on identifying the majorcharacters only, we include all the face tracks in our evaluation.

The proposed alignment models are based on our priorwork [12], [13], which aligns a small number of wetlab videosrecorded in a controlled environment (12 videos, averageduration of 5 minutes) with the corresponding natural languageinstructions in text protocols. In this study, we deal withmovies which have significantly longer duration and containmore diverse objects and actions. The input to our system is a

collection of movies, each paired with a script (Figure I). Wefirst detect frontal faces from each video frame, and track themusing a Kanade, Lucas, and Tomasi (KLT) point tracker [14].Next, we apply face clustering to group similar faces togetherand obtain a sequence of face clusters from the movie. Weexperiment with a wide variety of clustering methods andfeature spaces, including the state-of-the-art ConvolutionalNeural Networks (CNN) based FaceNet embedding [15]. Onthe text side, we extract a sequence of character names fromeach script. Finally, we apply a generative Hidden MarkovModel (HMM) and a discriminative Latent Variable Condi-tional Random Field (LCRF) to align the sequence of faceclusters with the corresponding sequence of character names.

The primary contributions of this paper are as follows:• We apply probabilistic latent variable alignment models,

both generative (HMM) and discriminative (LCRF), forautomatically aligning faces in movies and the characternames in scripts without closed-captions or subtitles.

• We compare our methods with the state-of-the-art methodfor matching movie faces with names [10], which issignificantly outperformed by our alignment methods(especially LCRF).

• We incorporate gender-based features in the LCRFmodel, and show the effectiveness of such features forimproving face-to-name alignment.

• Finally, we experiment with several different face clus-tering methods (different features, distance metrics, andalgorithms), and show comprehensive results regardingthe impact of clustering accuracy on the accuracy offace-to-name alignment. We show that the convolutionalneural network based FaceNet embedding and LandmarkSIFT features provide best alignment accuracies.

II. RELATED WORK

A. Closed-Caption or Subtitle-based Alignment

Previous papers on automatically associating faces in videoswith the corresponding names in text primarily relied onclosed-captions/subtitles with precise timestamps. Everinghamet al. [1] aligned movie subtitles with scripts using dynamictime warping, and then used computer vision features to ex-ploit similarities in faces and clothes. Cour et al. [3] proposeda generative model to jointly segment and parse a video into ahierarchy of shots and scenes, using both closed-captions andscript. Ramanathan et al. [6] extended [3] by applying coref-erence resolution to resolve pronouns and other ambiguousmentions in the script. All these methods assume nearly perfectknowledge of the true alignment between script and movie,obtained through closed-captions/sub-titles. However, movieto script alignment in the absence of closed-captions/subtitlesremains an extremely challenging task, and has not beenexplored much.

B. Alignment without Closed-caption/Subtitles

A few previous papers attempted the challenging taskof movie-to-script alignment, without any subtitles/closed-captions. Sankar et al. [7] trained supervised classifiers to

recognize the faces of all the major characters and appliedthe classifiers for aligning movies with scripts in the ab-sence of subtitles. Training supervised classifiers, however,requires manually labeled face images for each of the maincharacters. Unsupervised methods for aligning movie faceswith their names typically rely on face clustering and graphmatching [8], [9]. These methods first cluster similar face-tracks together and construct two separate weighted graphsGfaces and Gnames representing the co-occurrence structureamong the face-clusters and the script names respectively. Thecorrespondences between the vertices between the two graphswere learned using bipartite graph matching. While theseapproaches considered the co-occurrence patterns betweennames and faces, they ignored the global temporal orderingconstraints between movies and scripts.

In the literature, we found only two methods that consideredthe global temporal ordering constraints for movie-to-scriptalignment [16], [10]. Liang et al. [16] proposed a genera-tive Hidden Semi-Markov model, TVParser, for automaticallygrouping movie-shots into scenes, and jointly aligning thesescenes to their corresponding script segments. However, finergrained segmentations and alignment between movie framesand script sentences have not been considered. Zhang etal. [10] proposed a movie-to-script alignment model that ex-plicitly considered the temporal ordering constraints betweenmovie and script. They represented each input movie as asequence of face clusters and the corresponding script as asequence of character names. Each face cluster sequence andcharacter name sequence were divided into an equal number oftemporal bins, and a temporal distribution has been estimatedover these bins for each face cluster and character name.Finally, the name sequence was aligned with the correspondingface sequence based on the symmetric Kullback-Leibler (KL)divergence [11] between the temporal distribution of individualface clusters and character names. Zhang et al. used fixedduration bins for estimating the temporal distributions fornames and faces. Such arbitrary binning, however, is likely tointroduce errors in alignment, especially for minor characterswho do not appear frequently in the script.

We extend the state-of-the-art alignment approach [10]by applying rich probabilistic latent variable models, whichexplicitly learn a probability distribution over the alignmentsbetween names and faces using IBM Model 1 and a HiddenMarkov Model (HMM). Furthermore, we apply a discrimina-tive latent CRF model with many informative features.

C. Aligning Videos with Text

Recently, there has been a growing interest in automaticallyaligning videos with text documents without direct humansupervision [12], [13], [17], [18]. Our algorithms are based onour prior work on aligning videos of biological experimentsin wetlabs with the corresponding sentences in a text proto-col [12], [13]. The wetlab videos were fairly short, containsactivities by a single agent, recorded in a fairly controlledenvironment (similar lighting and background), and objectswere distinguished by color. Movies and TV show episodes

Fig. 2. Examples of faces detected in our dataset.

are usually more complex and diverse and tend to have longerduration. Furthermore, the wetlab videos were recorded viaKinect, which are not available for movies.

D. Face Clustering

A key bottleneck of our method is performing reasonablyaccurate face clustering, which aims to assign the faces ofthe same person to the same cluster and the faces of differentpeople to different clusters. Face clustering is an extremelychallenging task due to variations in view points, pose, facialexpression, illumination, scale, and occlusions (Figure 2). Asa result, faces of the same person often have more appearancevariations than the faces of different people.

Early works on face clustering [19], [20] focus on learningrobust distance metrics invariant to translation, rotation, andother affine transformations. These methods did not exploitthe inherent temporal constraints arising in videos: (1) facesthat belong to the same track must group together in the samecluster (i.e., must-link) and (2) faces belonging to two differenttracks that overlap in time must go to different clusters (i.e.,cannot-link). Both these constraints were exploited by themore recent works [21], [22].

Over the last few years, several Convolutional Neural Net-work (CNN) based face recognition and verification algorithmshave achieved near-human performance on standard bench-mark datasets [23], [15]. Schroff et al. [15] proposed FaceNet– a CNN which embeds face images to a 128 dimensionalEuclidian space such that squared L2 distance between twofaces in the embedding space directly corresponds to thedissimilarity between the faces. We use OpenFaceNet [24],which is an open-source implementation of FaceNet.

The existing face clustering methods typically excludeall the faces belonging to the minor characters, and applyclustering on the faces of major characters only. However,removing the faces of minor characters requires the knowledgeof their identity, which is not usually available for real-worldunsupervised learning tasks. We evaluate in a more realisticsetting and consider all the face tracks in our experimentswithout relying on their ground truth identities.

III. DATA PREPROCESSING PIPELINE

First, we detect frontal faces from every video frame, anduse these faces to initialize a KLT point tracker, which extractsa collection of face tracks extracted from the entire video.Next, we apply several different clustering methods to groupsimilar face tracks together. Finally, we extract a sequence offace clusters in the same order as they appear in the moviesand a sequence of character names in the order in which they

appear in the scripts, and apply our alignment algorithms toalign these two heterogeneous sequences.

A. Frontal Face Detection

We apply the standard Viola-Jones cascade face detec-tor [25] to detect frontal faces in every video frame. To avoidfalse detections, we set a high merging threshold and apply athreshold on the minimum bounding box size of the face.

B. Face Tracking

We detect corner points from each of the detected frontalfaces using the Minimum Eigenvalue algorithm [26], and trackthem using a KLT point tracker. We also keep track of theamount of overlap of the bounding box of the tracked featurepoints with any of the previously detected faces. If a detectedface has more than 50% overlap with the bounding boxcontaining the tracked feature points, then we decide that faceto be already a part of that track, and re-estimate the cornerpoints from that face. Since tracks initiated from false positiveface detections typically do not have enough temporal support,we filter out many spurious tracks by applying a threshold onthe minimum track length (set to 20 in our experiments). Italso allows us to filter out the face tracks detected from thetitle/cast segment of movies, because the title segments showeach of the main characters for a very short duration.

C. Face Track Clustering

We apply constrained spectral clustering to group facetracks belonging to the same character. A key challenge isto find a distance measure or a feature space invariant toundesired transformations (e.g., view point, lighting, and posevariation). We experiment with standard features includingSIFT, Sparse Coding, and pixel-level Hue and Saturation colorfeatures. Furthermore, we experiment with the state-of-the-artConvolutional Neural Network (CNN) face embedding system– FaceNet [15]. We use the pre-trained models from the open-source implementation by Amos et al. [24]. Here is a list ofdifferent distance measures that we tried:

FaceNet Distance: Given two face images I1 and I2,we estimate two 128-dimensional feature vectors vFaceNet

I1and vFaceNet

I2for I1 and I2 respectively, by applying

the pre-trained FaceNet models. The FaceNet distancedFaceNet(I1, I2):

dFaceNet(I1, I2) = ‖vFaceNetI1 − vFaceNet

I2 ‖22 (1)

Landmark SIFT Distance: We extract SIFT features from9 landmark points detected from the faces in each track [1].Let vsift

I be the SIFT feature descriptor for the image I . TheLandmark SIFT distance dsift(I1, I2) is estimated as:

dsift(I1, I2) = ‖vsiftI1− vsift

I2‖22 (2)

Hue-Saturation Distance: The SIFT features were sensitiveto illumination variations. Recent work has shown that thehue and saturation components in the HSV color space arerelatively robust to illumination variation [27]. This motivatedus to experiment with a distance measure dhs(I1, I2):

dhs(I1, I2) = ‖Ih1 − Ih2 ‖22 + ‖Is1 − Is2‖22 (3)

Mirrored Hue-Saturation Distance: To address the chal-lenges of pose variations, we compute distance between boththe original image pairs and their mirrored combinations, andchoose the minimum distance:

dmir−hs(I1, I2) = min

{dhs(I1, I2), dhs(I

mir1 , I2),

dhs(I1, Imir2 ), dhs(I

mir1 , Imir

2 )

}(4)

Low-dimensional Projection Distance: We also train a low-dimensional embedding space for face images using SparseCoding [28], and estimate the distance dsc(I1, I2) betweenthe sparse coefficient vectors.

We also incorporate the cannot-link and must-link con-straints in the affinity matrix of spectral clustering. We havenoticed many cases where the faces of male and femalecharacters were assigned to the same cluster. To avoid thisobvious error, we add another cannot-link constraints indicat-ing that two face tracks can not link if they are detected asdifferent genders. To detect the genders of individual tracks,we sample 5 face images from each track, detect the genderfor each image using the SHORETM framework [29], andtake a majority vote. Due to the gender based cannot-linkconstraints, the constrained spectral clustering performed wellin discriminating between the male and female faces, andeach of the clusters consisted of predominantly either maleor female faces, but not both. To address the variability due tohead pose, we normalize each face via an affine transformationusing the DLIB library (http://dlib.net/). In addition to thevariants of spectral clustering, we also apply the HMRFclustering method [21] 1 and perform an extensive comparison.

D. Extracting Character Names from Scripts

In our scripts, each name starts with a capital letter, followedby a colon character ‘:’. We apply a simple text pattern searchto extract all the character names from each script, and builda sequence of names in the order they appear in the script.Detailed parsing of the script sentences and dialogues [3], [6]is left as future work.

IV. ALIGNING CHARACTER NAME SEQUENCES WITHFACE CLUSTER SEQUENCES

The input to our system is a dataset containing N pairs ofobservations {(xi,yi)}Ni=1, where xi represents the ith scriptin our datasest, and yi represents the corresponding movie. Wemerge every two consecutive lines in a script to create a chunkand extract the character names mentioned in those lines.Each input script is represent as xi = {Xi,1, Xi,2, . . . , Xi,mi

},where Xi,m is the set of character names mentioned in the mth

script chunk in xi. We also divide each input movie into 2-second long chunks, and extract the face clusters present inthe frames in those chunks. Any movie chunk that does notoverlap with any of the face tracks is ignored by our system.Each video is represented as yi = {Yi,1, Yi,2, . . . , Yi,ni

},where Yi,n is the set of face clusters present in the nth movie

1Code downloaded from https://sites.google.com/site/baoyuanwu2015/

Sets of Face Clusters in Movie

Sets

of C

hara

cter

Nam

es in

Scr

ipt

Yi,1 Yi,nYi,n+1... Yi,ni

Xi,1

Xi,m

Xi,m+1

Xi,mi

......

...

... ...

... ...xi,m

(1), xi,m

(2), ..., xi,m

(I)

Character Names in Xi,m

yi,n(1)

, yi,n(2)

, ..., yi,n(J)

Face Clusters in Yi,n

Fig. 3. An illustration of the proposed alignment models. We align each moviechunk to one of the script chunks. Furthermore, we align each face cluster ina movie chunk to a character name in the corresponding script chunk.

chunk in yi. Let mi be the number of script chunks in xi andni be the number of movie chunks in yi. We aim to aligneach movie chunk Yi,n to one of the script chunks Xi,m andsimultaneous align each face y ∈ Yi,n with the correspondingcharacter name x ∈ Xi,m.

Let hi be the latent variable representing the alignmentbetween the script chunks in xi and the movie chunks inyi. Formally, hi,n ∈ {1, . . . ,mi}, for 1 ≤ n ≤ ni, wherehi,n = m indicates that the movie chunk Yi,n is aligned tothe script chunk Xi,m. Let VX and VY be the vocabulary ofcharacter names and face clusters respectively. Our modelscontain additional boolean latent variables representing themapping of each face cluster y ∈ VY to one of characternames x ∈ VX . Our goal is to learn the overall alignment hi

between xi and yi and the global mapping variables betweenevery face cluster and character name.

A. Hidden Markov Model (HMM)We first apply a hierarchical generative model [12], which

assumes that each movie chunk Yi,n is generated from oneof the script chunks Xi,m according to a Hidden MarkovModel (HMM) and each face cluster y ∈ Yi,n is generatedfrom one of the character names x ∈ Xi,m according toIBM Model 1 (Figure 3). The model parameters include amatching probability table T = {p(y|x)} for each x ∈ VXand y ∈ VY , representing the probability of observing theface cluster y given the name x. The probability of generatinga set of blobs Yi,n = {y(1)i,n , . . . , y

(J)i,n } from the set of nouns

Xi,m = {x(1)i,m, . . . , x

(L)i,m} according to IBM Model 1 is:

P (Yi,n|Xi,m) =ε

(L)J

J∏j=1

L∑l=1

p(y(j)i,n|x

(l)i,m), (5)

which becomes the emission probability of our HMM at thealignment state hi,n = m. Following the Markov assumption,the alignment state hi,n = m depends on the alignment statefor the previous video segment hi,n−1 = m′. The transitionprobability P(hi,n = m|hi,n−1 = m′) is parameterized by thejump size (m−m′) between adjacent alignment points:

P(hi,n = m|hi,n−1 = m′) = c(m−m′) (6)

where c(k) represents the probability of jumps of distance k.We only allow monotonic transitions such that if hi,n = m,then we allow hi,n+1 ∈ {m,m+ 1}.

TABLE ITHE AVERAGE ACCURACY (% OF FACE TRACKS MATCHED TO THE CORRECT CHARACTER NAMES) OF ALIGNING FACE TRACKS WITH CHARACTER NAMESFOR THE TWO DATASETS: TBBT AND FRIENDS. FOR EACH CLUSTERING METHOD, WE REPORT THE AVERAGE ACCURACY (AND STANDARD DEVIATION)

FOR THE BEST SETTING OF K . THE ACCURACY OF DIAGONAL ALIGNMENT IS DETERMINISTIC AND DOES NOT DEPEND ON CLUSTERING.

Dataset System Alignment Accuracy (%)FaceNet Mir Hue/Sat Hue/Sat SIFT SC HMRF

TBBTDiagonal 30.1 (0.0) 30.1 (0.0) 30.1 (0.0) 30.1 (0.0) 30.1 (0.0) 30.1 (0.0)

KL-Div [10] 39.7 (0.7) 38.4 (2.4) 35.7 (2.6) 44.5 (1.1) 32.8 (0.0) 37.3 (2.2)HMM 54.7 (1.8) 46.2 (0.1) 42.9 (2.3) 64.6 (1.6) 46.5 (2.2) 54.3 (2.5)LCRF 55.9 (2.3) 50.1 (2.5) 50.3 (0.2) 65.1 (1.7) 49.6 (1.8) 56.5 (2.6)

FriendsDiagonal 23.0 (0.0) 23.0 (0.0) 23.0 (0.0) 23.0 (0.0) 23.0 (0.0) 23.0 (0.0)

KL-Div [10] 32.0 (1.9) 36.4 (0.0) 28.6 (0.0) 29.6 (1.2) 27.8 (0.6) 27.7 (1.5)HMM 37.5 (2.1) 33.1 (0.0) 33.2 (0.0) 33.6 (0.0) 30.6 (2.2) 22.6 (1.6)LCRF 43.4 (1.3) 41.5 (0.3) 40.8 (0.7) 37.8 (0.2) 38.8 (0.2) 27.8 (2.2)

The matching probability table T = {p(y|x)} is initializeduniformly, i.e., we set p(y|x) = 1/|VY | for all y ∈ VY

and x ∈ VX . We also initialize the jump probabilities c(k)uniformly. The matching probabilities T and the jump proba-bilities c are learned from the input dataset via the ExpectationMaximization (EM) algorithm. The best alignment for (xi,yi)is inferred using Viterbi-like dynamic programming.

B. Latent Conditional Random Field (LCRF)

Next, we apply a Latent Conditional Random Field(LCRF) [13] for aligning movies with scripts. We assume thescript xi is the observed input, and we aim to predict themovie yi. Similar to HMM, the alignment hi is treated aslatent variables. The feature function Φ(xi,yi,hi) maps theinput observation (xi,yi), and their latent alignment vectorhi to a d-dimensional feature vector. Our goal is to learn theweights w ∈ Rd for these features.

Given the observed script xi and a fixed video length ni,the conditional probability of the output variable yi is:

p(yi|xi, ni) =∑hi

p(yi,hi|xi, ni) (7)

The conditional probability distribution p(yi,hi|xi, ni) is pa-rameterized using a log-linear model:

p(yi,hi|xi, ni) =expwTΦ(xi,yi,hi)

Z(xi, ni), (8)

where Z(xi, ni) =∑

y

∑h expwTΦ(xi,y,h). The optimal

values for the feature weights w are learned via stochasticgradient descent optimization.

We add a co-occurrence feature for every pair of namex and face cluster y, and automatically learn the weightfor matching x with y. We also incorporate jump-size anddiagonal alignment features [13]. Finally, we incorporate fea-tures based on the gender of the face clusters and characternames to encourage matching a face cluster with a namebelonging to the same gender. The gender label for a facecluster is decided by a majority voting over the detectedgenders for all the face tracks in that cluster. The genderof each script character is determined using a publicly ac-cessible web-application Genderize.io (https://genderize.io/),which takes a first name as an input, and returns itsgender label (male or female) or a null string when the

5 10 15 20 25 3020

30

40

50

60

70

Number of Clusters

Accura

cy (

%)

HMM+FaceNet

HMM+MirHueSat

HMM+SIFT

LCRF+FaceNet

LCRF+MirHueSat

LCRF+SIFT

(a) The Big Bang Theory

5 10 15 20 25 3020

30

40

50

60

70

Number of Clusters

Accura

cy (

%)

HMM+FaceNet

HMM+MirHueSat

HMM+SIFT

LCRF+FaceNet

LCRF+MirHueSat

LCRF+SIFT

(b) Friends

Fig. 4. The impact of the number of clusters (K) on the alignment accuracies.The mid range values K = 10 or 15 appears to perform best.

query name is not in their database. We add 4 gender-based features: (Namemale,Facemale), (Namemale,Facefemale),(Namefemale,Facemale), and (Namefemale,Facefemale). We initial-ize the feature weights for the correct gender matching (malename to male face and female name to female face) to asmall positive value (+1.0), and incorrect gender matching toa small negative value (−1.0). The co-occurrence and jumpsize features are initialized to the log-probabilities P (y|x)estimated by 10 iterations of HMM.

V. EXPERIMENTAL RESULTS

Datasets: We experiment with two datasets: (1) TBBT and(2) Friends. The TBBT dataset consists of the first 3 episodesof the sixth season of the TV show The Big Bang Theory andthe Friends dataset consists of the first 3 episodes of the firstseason of the TV show Friends. Approximately, each TBBTepisode contains 30, 000 frames, whereas each Friends episodecontains 40, 000 frames. The face detection module detected20,819 faces from the 3 TBBT episodes and 30,515 faces fromthe 3 Friends episodes. We apply a KLT point tracker andextract 544 and 819 face tracks respectively.

Experiments: We cluster the detected face tracks andperform alignment using the HMM and LCRF methods. Weapply the constrained spectral clustering algorithm with fivedifferent distance measures: (1) FaceNet distance, (2) Hue/Satdistance, (3) Mirrored Hue/Sat distance, (4) Sparse Coding(SC) distance, and (5) SIFT distance. Furthermore, we experi-mented with the state-of-the-art HMRF face clustering method.The number of clusters K is a tunable parameter, and we ex-periment with 6 different values of K = [5, 10, 15, 20, 25, 30].

Spectral clustering may get trapped in local optima as itemploys non-convex K-means clustering. For each K-meansclustering, we perform 10 repeated runs with random initial-izations, and choose the best one from those 10 solutions.Furthermore, we run the spectral clustering algorithm 10 times(each with 10 K-means repetitions), and report the averagealignment accuracy and standard deviation.

Evaluation: We evaluate the accuracy of alignment byestimating the percentage of face tracks that were aligned tothe correct character names. Since we divide the video into 2-second long chunks, one track may appear in many differentchunks, and therefore can potentially be aligned to differentcharacter names in the script. For each track, we assign it thecharacter name that it has been aligned to the majority numberof times. Table I reports the average alignment accuracy andstandard deviation for the best choice of K on the two datasets.We compare our HMM and LCRF models with the state-of-the-art KL-divergence based alignment [10] and a simplediagonal alignment baseline. For the KL-divergence baseline,we tried different number of temporal bins in the range 10to 100 with an increment of 10, and chose 20 as it providedthe best accuracy. The diagonal alignment assumes that eachscript line aligns to equal number of movie chunks.

Discussion: Our HMM and LCRF methods significantlyoutperform the baseline methods (Table I). LCRF achieves thebest accuracy for all the clustering methods on both datasets,presumably due to the effectiveness of the gender features.

The best alignment accuracy for TBBT is 65.1% (usingSIFT features), which is significantly higher than that forFriends (43.4% using FaceNet). This is partly because thefaces of the main characters in TBBT are quite distinct fromeach other, which is not the case for Friends. Furthermore, thevideos in TBBT have relatively better lighting and fewer minorcharacters. For similar reasons, SIFT and HMRF worked wellfor TBBT, whereas the more advanced FaceNet embedding didbetter for Friends. The Mirrored-Hue/Sat distance performedslightly better than the Hue/Sat distance as it was robust topose variation.

VI. CONCLUSION AND FUTURE WORK

We apply unsupervised alignment algorithms for automat-ically aligning faces in complex movie scenes with theircorresponding character names in scripts. Our algorithmssignificantly outperform the previous state-of-the-art. As afuture direction, we would like to perform joint alignmentand clustering and incorporate knowledge based features fromonline movie databases. The proposed methods can be appliedfor aligning movies with the story-books that they are adaptedfrom [30], [31], when subtitles are not available.

ACKNOWLEDGMENT

Funded by Intel ISTC-PC, NSF-1319378, NYS CoE DS,and NSF IIS-1446996.

REFERENCES

[1] M. Everingham, J. Sivic, and A. Zisserman, ““Hello! My name is...Buffy”–automatic naming of characters in TV video.” in BMVC, vol. 2,no. 4, 2006, p. 6.

[2] A. Gaidon, M. Marszalek, and C. Schmid, “Mining visual actions frommovies,” in BMVC, 2009, pp. 125–1.

[3] T. Cour, C. Jordan, E. Miltsakaki, and B. Taskar, “Movie/script: Align-ment and parsing of video and text transcription,” in ECCV, 2008, pp.158–171.

[4] M. Aparıcio, P. Figueiredo, F. Raposo, D. M. de Matos, and R. Ribeiro,“Summarization of films and documentaries based on subtitles andscripts,” CoRR, vol. abs/1506.01273, 2015.

[5] I. Laptev, M. Marszalek, C. Schmid, and B. Rozenfeld, “Learningrealistic human actions from movies,” in IEEE CVPR, 2008, pp. 1–8.

[6] V. Ramanathan, A. Joulin, P. Liang, and L. Fei-Fei, “Linking peoplein videos with “their” names using coreference resolution,” in ECCV.Springer, 2014, pp. 95–110.

[7] P. Sankar, C. V. Jawahar, and A. Zisserman, “Subtitle-free movie toscript alignment,” in BMVC 2009, 2009, pp. 121.1–121.11.

[8] Y. Zhang, C. Xu, H. Lu, and Y.-M. Huang, “Character identification infeature-length films using global face-name matching,” IEEE Transac-tions on Multimedia, vol. 11, no. 7, pp. 1276–1288, Nov. 2009.

[9] J. Sang, C. Liang, C. Xu, and J. Cheng, “Robust movie characteridentification and the sensitivity analysis,” in IEEE ICME 2011, July2011, pp. 1–6.

[10] Y. Zhang, Z. Tang, C. Zhang, J. Liu, and H. Lu, “Automatic faceannotation in TV series by video/script alignment,” Neurocomputing,vol. 152, pp. 316 – 321, 2015.

[11] S. Kullback and R. A. Leibler, “On information and sufficiency,” Theannals of mathematical statistics, pp. 79–86, 1951.

[12] I. Naim, Y. C. Song, Q. Liu, H. Kautz, J. Luo, and D. Gildea,“Unsupervised alignment of natural language instructions with videosegments,” in AAAI, 2014.

[13] I. Naim, Y. C. Song, Q. Liu, L. Huang, H. Kautz, J. Luo, and D. Gildea,“Discriminative unsupervised alignment of natural language instructionswith corresponding video segments,” in NAACL, 2015.

[14] C. Tomasi and T. Kanade, “Detection and tracking of point features,”IJCV, 1991.

[15] F. Schroff, D. Kalenichenko, and J. Philbin, “Facenet: A unified embed-ding for face recognition and clustering,” in IEEE CVPR, June 2015.

[16] C. Liang, C. Xu, J. Cheng, and H. Lu, “TV-Parser: An automatic TVvideo parsing method,” in IEEE CVPR, 2011, pp. 3377–3384.

[17] J. Malmaud et al., “What’s cookin’? interpreting cooking videos usingtext, speech and vision,” in NAACL HLT, 2015, pp. 143–152.

[18] P. Bojanowski et al., “Weakly-supervised alignment of video with text,”in IEEE ICCV, 2015.

[19] A. Fitzgibbon and A. Zisserman, “On affine invariant clustering andautomatic cast listing in movies,” in ECCV, 2002, pp. 304–320.

[20] A. W. Fitzgibbon and A. Zisserman, “Joint manifold distance: a newapproach to appearance based clustering,” in IEEE CVPR, vol. 1, 2003,pp. 1–26.

[21] B. Wu, Y. Zhang, B.-G. Hu, and Q. Ji, “Constrained clustering and itsapplication to face clustering in videos,” in IEEE CVPR, June 2013, pp.3507–3514.

[22] X. Cao, C. Zhang, C. Zhou, H. Fu, and H. Foroosh, “Constrained multi-view video face clustering,” IEEE Transactions on Image Processing,vol. 24, no. 11, pp. 4381–4393, Nov 2015.

[23] Y. Taigman, M. Yang, M. Ranzato, and L. Wolf, “Deepface: Closing thegap to human-level performance in face verification,” in IEEE CVPR,2014, pp. 1701–1708.

[24] B. Amos et al., “OpenFace 0.1.1: Face recognition with Google’sFaceNet deep neural network.” October 2015.

[25] P. Viola and M. J. Jones, “Robust real-time face detection,” IJCV, vol. 57,no. 2, pp. 137–154, 2004.

[26] J. Shi and C. Tomasi, “Good features to track,” in IEEE CVPR, 1994,pp. 593 – 600.

[27] N. Vretos, V. Solachidis, and I. Pitas, “A mutual information basedface clustering algorithm for movie content analysis,” Image and VisionComputing, vol. 29, no. 10, pp. 693–705, 2011.

[28] H. Lee, A. Battle, R. Raina, and A. Y. Ng, “Efficient sparse codingalgorithms,” in NIPS 2006, 2006, pp. 801–808.

[29] B. Froba and A. Ernst, “Face detection with the modified censustransform,” in IEEE FG, 2004, pp. 91–96.

[30] M. Tapaswi, M. Bauml, and R. Stiefelhagen, “Book2movie: Aligningvideo scenes with book chapters,” in IEEE CVPR, 2015, pp. 1827–1835.

[31] Y. Zhu et al., “Aligning books and movies: Towards story-like visualexplanations by watching movies and reading books,” in IEEE ICCV,2015.

new aligning movies with scripts by exploiting temporal ordering …gildea/pubs/naim-icpr16.pdf ·...

Documents