learning realistic human actions from movies - · pdf filelearning realistic human actions...

Download Learning realistic human actions from movies - · PDF fileLearning realistic human actions from movies Ivan Laptev Marcin Marszałek Cordelia Schmid Benjamin Rozenfeld INRIA Rennes,

If you can't read please download the document

Upload: vandung

Post on 05-Feb-2018

219 views

Category:

Documents


0 download

TRANSCRIPT

  • Learning realistic human actions from movies

    Ivan Laptev Marcin Marszaek Cordelia Schmid Benjamin Rozenfeld

    INRIA Rennes, IRISA INRIA Grenoble, LEAR - LJK Bar-Ilan [email protected] [email protected] [email protected] [email protected]

    Abstract

    The aim of this paper is to address recognition of naturalhuman actions in diverse and realistic video settings. Thischallenging but important subject has mostly been ignoredin the past due to several problems one of which is the lackof realistic and annotated video datasets. Our first contri-bution is to address this limitation and to investigate theuse of movie scripts for automatic annotation of human ac-tions in videos. We evaluate alternative methods for actionretrieval from scripts and show benefits of a text-based clas-sifier. Using the retrieved action samples for visual learn-ing, we next turn to the problem of action classification invideo. We present a new method for video classificationthat builds upon and extends several recent ideas includinglocal space-time features, space-time pyramids and multi-channel non-linear SVMs. The method is shown to improvestate-of-the-art results on the standard KTH action datasetby achieving 91.8% accuracy. Given the inherent problemof noisy labels in automatic annotation, we particularly in-vestigate and show high tolerance of our method to annota-tion errors in the training set. We finally apply the methodto learning and classifying challenging action classes inmovies and show promising results.

    1. Introduction

    In the last decade the field of visual recognition had anoutstanding evolution from classifying instances of toy ob-jects towards recognizing the classes of objects and scenesin natural images. Much of this progress has been sparkedby the creation of realistic image datasets as well as by thenew, robust methods for image description and classifica-tion. We take inspiration from this progress and aim totransfer previous experience to the domain of video recog-nition and the recognition of human actions in particular.

    Existing datasets for human action recognition (e.g. [15],see figure8) provide samples for only a few action classesrecorded in controlled and simplified settings. This standsin sharp contrast with the demands of real applications fo-cused on natural video with human actions subjected to in-

    Figure 1. Realistic samples for three classes of human actions:kissing; answering a phone; getting out of a car. All samples havebeen automatically retrieved from script-aligned movies.

    dividual variations of people in expression, posture, motionand clothing; perspective effects and camera motions; illu-mination variations; occlusions and variation in scene sur-roundings. In this paper we address limitations of currentdatasets and collect realistic video samples with human ac-tions as illustrated in figure1. In particular, we consider thedifficulty of manual video annotation and present a methodfor automatic annotation of human actions in movies basedon script alignment and text classification (see section2).

    Action recognition from video shares common problemswith object recognition in static images. Both tasks haveto deal with significant intra-class variations, backgroundclutter and occlusions. In the context of object recogni-tion in static images, these problems are surprisingly wellhandled by a bag-of-features representation [17] combinedwith state-of-the-art machine learning techniques like Sup-port Vector Machines. It remains, however, an open ques-tion whether and how these results generalize to the recog-nition of realistic human actions, e.g., in feature films orpersonal videos.

  • Building on the recent experience with image classifi-cation, we employ spatio-temporal features and generalizespatial pyramids to spatio-temporal domain. This allowsus to extend the spatio-temporal bag-of-features representa-tion with weak geometry, and to apply kernel-based learn-ing techniques (cf. section3). We validate our approachon a standard benchmark [15] and show that it outperformsthe state-of-the-art. We next turn to the problem of actionclassification in realistic videos and show promising resultsfor eight very challenging action classes in movies. Finally,we present and evaluate a fully automatic setup with actionlearning and classification obtained for an automatically la-beled training set.

    1.1. Related work

    Our script-based annotation of human actions is simi-lar in spirit to several recent papers using textual informa-tion for automatic image collection from the web [10, 14]and automatic naming of characters in images [1] andvideos [4]. Differently to this work we use more sophis-ticated text classification tools to overcome action vari-ability in text. Similar to ours, several recent methodsexplore bag-of-features representations for action recogni-tion [3, 6, 13, 15, 19], but only address human actions incontrolled and simplified settings. Recognition and local-ization of actions in movies has been recently addressedin [8] for a limited dataset, i.e., manual annotation of twoaction classes. Here we present a framework that scales toautomatic annotation for tens or more visual action classes.Our approach to video classification borrows inspirationfrom image recognition methods [2, 9, 12, 20] and extendsspatial pyramids [9] to space-time pyramids.

    2. Automatic annotation of human actions

    This section describes an automatic procedure for col-lecting annotated video data for human actions frommovies. Movies contain a rich variety and a large num-ber of realistic human actions. Common action classessuch as kissing, answering a phone and getting out of acar (see figure1), however, often appear only a few timesper movie. To obtain a sufficient number of action samplesfrom movies for visual training, it is necessary to annotatetens or hundreds of hours of video which is a hard task toperform manually.

    To avoid the difficulty of manual annotation, we makeuse ofmovie scripts(or simply scripts). Scripts are pub-licly available for hundreds of popular movies1 and providetext description of the movie content in terms of scenes,characters, transcribed dialogs and human actions. Scriptsas a mean for video annotation have been previously used

    1We obtained hundreds of movie scripts from www.dailyscript.com,www.movie-page.com and www.weeklyscript.com.

    for the automatic naming of characters in videos by Ever-ingham et al. [4]. Here we extend this idea and apply text-based script search to automatically collect video samplesfor human actions.

    Automatic annotation of human actions from scripts,however, is associated with several problems. Firstly,scripts usually come without time information and have tobe aligned with the video. Secondly, actions described inscripts do not always correspond with the actions in movies.Finally, action retrieval has to cope with the substantial vari-ability of action expressions in text. In this section we ad-dress these problems in subsections2.1and2.2and use theproposed solution to automatically collect annotated videosamples with human actions, see subsection2.3. The result-ing dataset is used to train and to evaluate a visual actionclassifier later in section4.

    2.1. Alignment of actions in scripts and video

    Movie scripts are typically available in plain text formatand share similar structure. We use line indentation as asimple feature to parse scripts into monologues, characternames and scene descriptions (see figure2, right). To alignscripts with the video we follow [4] and use time informa-tion available in movie subtitles that we separately down-load from the Web. Similar to [4] we first align speechsections in scripts and subtitles using word matching anddynamic programming. We then transfer time informationfrom subtitles to scripts and infer time intervals for scenedescriptions as illustrated in figure2. Video clips used foraction training and classification in this paper are definedby time intervals of scene descriptions and, hence, maycontain multiple actions and non-action episodes. To in-dicate a possible misalignment due to mismatches betweenscripts and subtitles, we associate each scene descriptionwith the alignment scorea. The a-score is computed bythe ratio of matched words in the near-by monologues asa = (#matched words)/(#all words).

    Temporal misalignment may result from the discrepancybetween subtitles and scripts. Perfect subtitle alignment(a = 1), however, does not yet guarantee the correct actionannotation in video due to the possible discrepancy between

    117201:20:17,240 --> 01:20:20,437Why weren't you honest with me?Why'd

    you keep your marriage a secret?

    117301:20:20,640 --> 01:20:23,598lt wasn't my secret, Richard.Victor wanted it that way.

    117401:20:23,800 --> 01:20:26,189Not even our closest friendsknew about our marriage.

    RICK

    Why weren't you honest with me? Whydid

    you keep your marriage a secret?

    Rick sits down with Ilsa.

    ILSAOh,

    it wasn't my secret, Richard. Victor wanted it that way. Not even our closest friends knew about our marriage.

    01:20:1701:20:23

    subtitles movie script

    Figure 2. Example of matching speech sections (green) in subtitlesand scripts. Time information (blue) from adjacent speech sectionsis used to estimate time intervals of scene descriptions (yellow).

  • 50 100 150 200 250 300 350 4000

    0.2

    0.4

    0.6

    0.8

    1

    a=1.0 a0.5

    prec

    isio

    n

    number of samples

    Evaluation of retrieved actions on visual ground truth

    [1:13:41 - 1:13:45]A black car pulls up. Two

    army officers get out.

    Figure 3. Evaluation of script-based action annotation. Left: Preci-sion of action annotation evaluated on visual ground truth. Right