veri cation for action recognition arxiv:1603.08561v1 [cs.cv] 28 … · frames and srivastava et...

Unsupervised Learning using SequentialVerification for Action Recognition

Ishan Misra? C. Lawrence Zitnick† Martial Hebert??The Robotics Institute, Carnegie Mellon University

†Facebook AI Research

Abstract. In this paper, we consider the problem of learning a visualrepresentation from the raw spatiotemporal signals in videos for use in ac-tion recognition. Our representation is learned without supervision fromsemantic labels. We formulate it as an unsupervised sequential verifica-tion task, i.e., we determine whether a sequence of frames from a videois in the correct temporal order. With this simple task and no semanticlabels, we learn a powerful unsupervised representation using a Convo-lutional Neural Network (CNN). The representation contains comple-mentary information to that learned from supervised image datasets likeImageNet. Qualitative results show that our method captures informa-tion that is temporally varying, such as human pose. When used as pre-training for action recognition, our method gives significant gains overlearning without external data on benchmark datasets like UCF101 andHMDB51. Our method can also be combined with supervised represen-tations to provide an additional boost in accuracy for action recognition.Finally, to quantify its sensitivity to human pose, we show results forhuman pose estimation on the FLIC dataset that are competitive withapproaches using significantly more supervised training data.

Keywords: Unsupervised learning; Action Recognition; ConvolutionalNeural Networks; Sequence Verification

1 Introduction

Sequential data provides an abundant source of information in the form of audi-tory and visual percepts. Learning from the observation of sequential data is anatural and implicit process for humans [1–3]. It informs both low level cognitivetasks and high level abilities like decision making and problem solving [4]. Forinstance, answering the question “Where would the moving ball go?”, requiresthe development of basic cognitive abilities like prediction from visual sequentialdata or video [5].

In this paper, we explore the power of spatiotemporal signals, i.e., videos, inthe context of computer vision and action recognition. To study the informationavailable in a video signal in isolation, we ask the question: How does an agentlearn from the spatiotemporal structure present in video without using super-vised semantic labels? Are the representations learned using the unsupervisedspatiotemporal information present in videos meaningful? And finally, are these

arX

iv:1

603.

0856

1v1

[cs

.CV

] 2

8 M

ar 2

016

2

representations complementary to those learned from strongly supervised imagedata? In this paper, we explore such questions by using a sequential learningapproach.

Sequential learning is used in a variety of areas such as speech recognition,robotic path planning, adaptive control algorithms, etc.. These approaches canbe broadly categorized [6] into two classes: prediction and verification. In se-quential prediction, the goal is to predict the signal given an input sequence. Apopular application of this in Natural Language Processing (NLP) is ‘word2vec’by Mikolov et al. [7, 8] that learns distributional representations [9]. Using thecontinuous bag-of-words (CBOW) task, the model learns to predict a missingword given a sequence of surrounding words. The representation that results fromthis task has been shown to be semantically meaningful [7]. Unfortunately, ex-tending the same technique to predict video frames is challenging. Unlike wordsthat can be represented using limited-sized vocabularies, the space of possiblevideo frames is extremely large [10], eg., predicting pixels in a small 256 × 256image leads to 2562×3×256 hypotheses! To avoid this complex task of predictinghigh-dimensional video frames, we use sequential verification.

In sequential verification, one predicts the ‘validity’ of the sequence, ratherthan individual items in the sequence. In this paper, we explore the task of deter-mining whether a given sequence is ‘temporally valid’, i.e., whether a sequenceof video frames are in the correct temporal order, Figure 1. We demonstratethat this binary classification problem is capable of learning useful visual repre-sentations from videos. Specifically, we explore their use in the well understoodtasks of human action recognition and pose estimation. But why are these sim-ple sequential verification tasks useful for learning? Determining the validity ofa sequence requires reasoning about object transformations and relative loca-tions through time. This in turn forces the representation to learn about objectappearances and deformations.

We use a Convolutional Neural Network (CNN) [11] for our underlying fea-ture representation. The CNN is applied to each frame in the sequence andtrained “end-to-end” from random initialization. The sequence verification taskencourages the CNN features to be both visually and temporally grounded. Wedemonstrate the effectiveness of our unsupervised method on two benchmarkaction recognition datasets UCF101 [12] and HMDB51 [13], and the FLIC [14]pose estimation dataset. Using our simple unsupervised learning approach forpre-training, we show a significant boost in accuracy over learning CNNs fromscratch with random initialization. In fact, our unsupervised approach even out-performs pre-training with some supervised training datasets. When combinedwith existing supervised image-based representations, our unsupervised repre-sentation learning method can further improve performance in action recogni-tion. Since humans with changing poses are commonly the focus of action videos,our approach learns a representation sensitive to human pose. Remarkably, whenapplied to pose estimation, our representation is competitive with pre-trainingon significantly larger supervised training datasets [15].

3

Positive Tuples Negative TuplesO

rigin

al vid

eo

Temporally Correct order

Temporally Incorrect order

(b)(a)

. .. . . .

Fig. 1: (a) A video imposes a natural temporal structure for visual features. In manycases, one can easily verify whether frames are in the correct temporal order or not.Such a simple sequential verification task captures important spatiotemporal signalsin videos. We use this task for unsupervised pre-training of a Convolutional NeuralNetwork (CNN). (b) Some examples of the positive and negative sequence tuples weautomatically extracted from videos and use to formulate a classification task for aCNN.

2 Related Work

Action recognition is a well established task in the domain of video recognition(see [16] for a survey). Techniques in this domain use appearance and motioninformation in order to reason about non-rigid human deformations in actions.Typical features like STIP [17], HOG3D [18], HOF [19], MBH [20] etc. capturespatiotemporal features in a temporal grid-like structure. Dense local trajec-tory descriptors like [21–24] move beyond such sparse grid-like pooling and showgreat improvement in action recognition. The insights from these methods havehelped advance learned representations [25–32]. These methods learn represen-tations for action recognition directly from the video using end-to-end learning.Karpathy et al. [27] explore a variety of CNN architectures and fusion methodsfor learning from weakly labeled video. Simonyan et al. [33] show the benefitof combining both RGB appearance and traditional optical flow with CNNs foraction recognition by training two separate networks for each type of input. Weuse their RGB network for action recognition in our experiments.

Our work uses unlabeled video sequences for learning representations. Sincethis source of supervision is ‘free’, our work can be viewed as a form of unsu-pervised learning. Unsupervised representation learning from single images is apopular area of research in computer vision. A significant body of unsupervisedlearning literature uses hand-crafted features and clustering based approaches todiscover objects [34–36], or mid-level elements [37–41]. Deep learning methodslike auto-encoders [42–44], Deep Boltzmann Machines [45], variational meth-ods [46, 47], stacked auto-encoders [48, 49], and others [50] learn representationsdirectly from images. These methods learn a representation by estimating la-tent parameters that help reconstruct the data, and may regularize the learningprocess by priors such as sparsity [42]. Techniques in [10, 51] scale unsuper-

4

vised learning to large image datasets showing its usefulness for tasks such aspedestrian detection [51] and object detection [10]. In terms of using ‘context’for learning, our work is most similar to [10] which uses the spatial context inimages. While these approaches are unsupervised, they do not use videos andcannot exploit the temporal structure in them.

Our work is most related to work in unsupervised learning from videos [52–55]. Traditional methods in this domain utilize the spatiotemporal continuity asregularization for the learning process. Since visual appearance changes smoothlyin videos, a common constraint is enforcing temporal smoothness of features [54–58]. Zhanget al. [59], in particular, show how such constraints are useful for actionrecognition. Moving beyond just temporal smoothness, [53] enforces additional‘steadiness’ constraints on the features so that the change of features acrossframes is meaningful. Our work, in contrast, does not explicitly impose anyregularizations on the features. Other reconstruction-based learning approachesinclude that of Goroshin et al. [58] who use a generative model to predict videoframes and Srivastava et al. [60] who use LSTMs [61]. Unlike our method, theseworks [26, 54, 58, 60] explicitly predict individual frames, but do not explore largeimage sizes or datasets. [62, 63] also consider the task of predicting the futurefrom videos, but consider it as their end task and do not use it for unsupervisedpre-training.

Several recent papers [52, 62, 64] use egomotion constraints from video tofurther constrain the learning. Jayaraman et al. [52] show how they can learnequivariant transforms from such constraints. Similar to our work, they use fullvideo frames for learning with little pre-processing. Another line of work [65, 66]uses video data to mine patches which belong to the same object to learn repre-sentations useful for distinguishing objects. Typically, these approaches requiresignificant pre-processing to create this task. While our work also uses videos, weexplore them in the spirit of sequence verification for action recognition whichlearns from the raw video with very little pre-processing.

3 Our Approach

Our goal is to learn a feature representation using only the raw spatiotemporalsignal naturally available in videos. We learn this representation using a sequen-tial verification task and focus on videos with human actions. Specifically, asshown in Figure 1, we extract a tuple of frames from a video, and ask whetherthe frames are in the correct temporal order. In this section, we begin by mo-tivating our use of sequential tasks and how they use the temporal structureof videos. We then describe how positive and negative tuples are sampled fromvideos, and describe our model.

3.1 Task motivation

When using only raw videos as input, sequential verification tasks offer a promis-ing approach to unsupervised learning. In addition to our approach describedbelow, several alternative tasks are explored in Section 5.3. The goal of thesetasks is to encourage the model to reason about the motion and appearance

5

fc7

concate

nati

on

fc8

Shared parameters

Input Tuple

(b) Triplet Siamese network for sequence verification

cla

ssific

ati

on

AlexNet architecture

(a) Data Sampling

Frame Motion

Bias the sampling to high motion

windows

𝑓𝑎

𝑓𝑏

𝑓𝑐

𝑓𝑑

𝑓𝑒

𝑓𝑏 𝑓𝑐 𝑓𝑑

Positive Tuples

𝑓𝑏𝑓𝑐𝑓𝑑

Negative Tuples

𝑓𝑏 𝑓𝑎 𝑓𝑑

𝑓𝑏 𝑓𝑒 𝑓𝑑

𝑓𝑏𝑓𝑎𝑓𝑑

𝑓𝑏𝑓𝑒𝑓𝑑

96

256

384 384 256

High motion window

Tim

e

Fig. 2: (a) We sample tuples of frames from high motion temporal windows in a video.We form positive and negative tuples based on whether the three input frames arein the correct temporal order. (b) Our triplet Siamese network architecture has threeparallel network stacks with shared weights upto the fc7 layer. Each stack takes aframe as input, and produces a representation at the fc7 layer. The concatenated fc7

representations are used to predict whether the input tuple is in the correct temporalorder.

of the objects, and thus learn the temporal structure of videos. Example tasksmay include reasoning about the ordering of frames, or determining the relativetemporal proximity of frames. For tasks that ask for the verification of tempo-ral order, how many frames are needed to determine a correct answer? If wewant to determine the correct order from just two frames, the question may beambiguous in cases where cyclical motion is present. For example, consider ashort video sequence of a person picking up a coffee cup. Given two frames thetemporal order is ambiguous; the person may be picking the coffee cup up, orplacing it down.

To reduce such ambiguity, we propose sampling a three frame tuple, andask whether the tuple’s frames are correctly ordered. While theoretically, threeframes are not sufficient to resolve cyclical ambiguity [67], we found that com-bining this with smart sampling (Section 3.2) removes a significant portion ofambiguous cases. We now formalize this problem into a classification task. Con-sider the set of frames {f1 . . . fn} from an unlabeled video V. We consider thetuple (fb, fc, fd) to be in the correct temporal order (class 1, positive tuple) ifthe frames obey either ordering b < c < d or d < c < b, to account for thedirectional ambiguity in video clips. Otherwise, if b < d < c or c < b < d, we saythat the frames are not in the correct temporal order (class 0, negative tuple).

3.2 Tuple sampling

A critical challenge when training a network on the three-tuple ordering taskis how to sample positive and negative training instances. A naive method maysample the tuples uniformly from a video. However, in temporal windows with

6

very little motion it is hard to distinguish between a positive and a negative tuple,resulting in many ambiguous training examples. Instead, we only sample tuplesfrom temporal windows with high motion. As Figure 2 shows, we use coarseframe level optical flow [68] as a proxy to measure the motion between frames.We treat the flow magnitude of all video frames as a 1-dimensional probabilitydistribution, and use it to bias our sampling towards high motion windows. Thisensures that the classification of the tuples is not ambiguous. Figure 1 (b) showsexamples of such tuples.

To create positive and negative tuples, we sample five frames (fa, fb, fc, fd, fe)from a temporal window such that a < b < c < d < e (see Figure 2 (a)). Posi-tive instances are created using (fb, fc, fd), while negative instances are createdusing (fb, fa, fd) and (fb, fe, fd). Additional training examples are also createdby inverting the order of all training instances, eg., (fd, fc, fb) is positive. Duringtraining it is critical to use the same beginning frame fb and ending frame fdwhile only changing the middle frame for both positive and negative examples.Since only the middle frame changes between training examples, the network isencouraged to focus on this signal to learn the subtle difference between positivesand negatives, rather than irrelevant features.

To avoid sampling ambiguous negative frames fa and fe, we enforce that theappearance of the positive fc frame is not too similar to fa or fe. We found thatwith these simple conditions we were able to eliminate most ambiguous examples.We provide further analysis of the various design decisions for sampling data inSection 4.1.

3.3 Model Parametrization and Learning

To learn a feature representation from the tuple ordering task, we use a simpletriple-Siamese network. This network has three parallel stacks of layers withshared parameters (Figure 2). The network stacks follow the standard CaffeNet[69] (a slight modification of AlexNet [70]) architecture from the conv1 to the fc7layer. Each stack takes as input one of the frames from the tuple and producesa representation at the fc7 layer. The three fc7 outputs are concatenated asinput to a linear classification layer. The classification layer can reason aboutall three frames at once and predict whether they are in order or not (twoclass classification). Since the layers from conv1 to fc7 are shared across thenetwork stacks, the Siamese architecture has the same number of parameters asAlexNet barring the final fc8 layer. We update the parameters of the networkby minimizing the regularized cross-entropy loss of the predictions on each tupleand train it in an end-to-end fashion. While this network takes three inputs attraining time, during testing we can obtain conv1 to fc7 representations of asingle input frame by using just one stack, as the parameters across the threestacks are shared.

7

4 Empirical ablation analysis

In this section, we present experiments to analyze the various design decisionsmade during the unsupervised training of our network. In Sections 5 and 6, weprovide results on both action recognition and pose estimation.Dataset: We show all our results using split 1 of the benchmark UCF101 [12]dataset. This dataset contains videos for 101 action categories with ∼ 9.5k videosfor training and ∼ 3.5k videos for testing. Each video has an associated actioncategory label. The standard performance metric for action recognition on thisdataset is classification accuracy.Details for unsupervised pre-training: For unsupervised pre-training, wedo not use the semantic action labels. We sample about 900k tuples from theUCF101 training videos. We randomly initialize our network, and train for 80kiterations with a fixed learning rate of 10−3 and mini-batch size of 128. Batchnormalization [71] is used to accelerate the training. We use standard per-imagedata augmentation techniques - cropping, RGB jittering and horizontal flipping(as described in [72], and later). A tuple is considered to be a positive example(class 1), if either b < c < d or b > c > d; otherwise it is considered to be negative(class 0). We use the triplet-Siamese network as described in Section 3.3.Details for Action Recognition: The spatial network from [33] is a well-established method of action recognition that uses only RGB appearance infor-mation. The parameters of the spatial network are initialized with our unsu-pervised pre-trained network. We use the provided action labels per video andfollow the training and testing protocol as suggested in[33, 72]. Briefly, for train-ing we form mini-batches by sampling random frames from videos. At test time,25 frames are uniformly sampled from each video. Each frame is used to generate10 inputs after fixed cropping and flipping (5 crops × 2 flips), and the predictionfor the video is an average of the predictions across these 25 × 10 inputs. Theinput frames are resized such that the shortest dimension is 256 pixels wide. Weuse the CaffeNet (similar to AlexNet) architecture for its speed and efficiency.We initialize the network parameters up to the fc7 layer using the parametersfrom the unsupervised pre-trained network, and initialize a new fc8 layer for theaction recognition task. We finetune the network following [33] for 20k iterationswith a batch size of 256, and learning rate of 10−2 decaying by 10 after 14kiterations, using SGD with momentum of 0.9, and dropout of 0.5. While [33]used the wider VGG-M-2048 [73] architecture, we found that their parameterstransfer to CaffeNet because of the similarities in their architectures.

4.1 Sampling of data

In this section we study the impact of sampling parameters described in Sec-tion 3.2 on the unsupervised pre-training task. We denote the maximum distancebetween frames of positive tuples by τmax = |b− d|. This parameter controls the‘difficulty’ of positives: a very high value makes it difficult to see correspondenceacross the positive tuple, and a very low value gives almost identical framesand thus very easy positives. Similarly, we compute the minimum distance be-tween the frames fa and fe used for negative tuples to the other frames by

8

Table 1: We study the effect of our design choices such as temporal sampling param-eters, and varying class ratios for unsupervised pre-training. We train our network byvarying these parameters and measure the tuple prediction accuracy on a held out setfrom UCF101. We also show action classification results after finetuning the models onthe UCF101 Action recognition task (split 1).

(a) Varying temporal sampling

τmax τmin Tuple Pred. Action Recog.

30 15 60.2 47.260 15 72.1 50.960 60 64.3 49.1

(b) Varying class ratios

Class Ratio Tuple Pred. Action Recog.Neg Pos

0.5 0.5 52.1 38.10.65 0.35 68.5 45.50.75 0.25 72.1 50.90.85 0.15 67.7 48.6

τmin = min(|a − b|, |d − e|). This parameter controls the difficulty of negativeswith a low value making them harder, and a high value making them easier.

We compute the training and testing accuracy of these networks on the tupleprediction task on held out videos. This held out set is a union of samples usingall the temporal sampling parameters. We show results in Table 1 (a). We alsouse these networks for finetuning on the UCF101 action recognition task. Ourresults show that the tuple prediction accuracy and the performance on theaction recognition task are correlated. A large temporal window for positivesampling improves over a smaller temporal window (Rows 1 and 2), while alarge window for negative sampling hurts performance (Rows 2 and 3).

4.2 Class ratios in mini-batch

Another important factor when training the model is the class ratios in eachmini-batch. As has been observed empirically [74, 75], a good class ratio permini-batch ensures that the model does not overfit to one particular class, andhelps the learning process. For these experiments, we choose a single temporalwindow for sampling and vary only the ratio of positive and negative tuples permini-batch. We compare the accuracy of these networks on the tuple predictiontask on held out videos in Table 1 (b). Additionally, we report the accuracy ofthese networks after finetuning on the action recognition task. These results showthat the class ratio used for unsupervised pre-training can significantly impactlearning. It is important to have a larger percentage of negative examples.

4.3 What does the temporal ordering task capture?

Nearest Neighbor retrieval: We retrieve nearest neighbors using our unsuper-vised features on the UCF101 dataset and compare them in Figure 3 to retrievalsby the pre-trained ImageNet features, and a randomly initialized network. Ad-ditional examples are shown in the supplementary materials. We pick an inputquery frame from a clip and retrieve neighbors from other clips in the UCF101

9

QueryImageNet Ours Random

Same Videos

Nearest Neighbors

Different Videos

Same Videos

Same Videos

Different Videos

Different Videos

Fig. 3: We compute nearest neighbors using fc7 features on the UCF101 dataset. Wecompare these results across three networks: pre-trained on ImageNet, pre-trained onour unsupervised task and a randomly initialized network. We choose a input queryframe from a clip and retrieve results from other clips in the dataset. Since the datasetcontains multiple clips from the same video we get near duplicate retrievals (first row).We remove these duplicates, and display results in the second row. While ImageNetfocuses on the high level semantics, our network captures the human pose.

dataset. Since the UCF101 dataset has clips from the same video, the first set ofretrievals (after removing frames from the same input clip) are near duplicateswhich are not very informative (notice the random network’s results). We removethese near-duplicates by computing the sum of squared distances (SSD) betweenthe frames, and display the top results in the second row of each query. Theseresults make two things clear: 1) the ImageNet pre-trained network focuses onscene semantics 2) Our unsupervised pre-trained network focuses on the pose ofthe person. This would seem to indicate that the information captured by ourunsupervised pre-training is complementary to that of ImageNet. Such behavioris not surprising, if we consider our network was trained without semantic labels,and must reason about spatiotemporal signals for the tuple verification task.

Visualizing pool5 unit responses: We analyze the feature representation ofthe unsupervised network trained using the tuple prediction task on UCF101.Following the procedure of [76] we show the top regions for pool5 units alongwiththeir receptive field in Figure 4. This gives us insight into the network’s internalfeature representation and shows that many units show preference for humanbody parts and pose. This is not surprising given that our network is trained onvideos of human action recognition, and must reason about human movementsfor the tuple ordering task.

10

Fig. 4: In each row we display the top image regions for a unit from the pool5

layer. We follow the method in [76] and display the receptive fields (marked inred boxes) for these units. As our network is trained on human action recognitionvideos, most of the units show preference for parts of the human body and aresensitive to pose.

Fill in the blanks: Given a start and an end frame, we use our unsupervisednetwork to find a middle frame from the input video. We compute such resultson held out videos and show them in Figure 5. Our network is able to correctlypredict frames that should temporally lie between a given start and end frame.For cyclical actions with large motion, eg., a child on a swing (row 2), ournetwork resolves directional ambiguity (is the swing going up or down). The lastrow shows failure cases which lack motion (applying makeup) or have only smallmoving objects (soccer ball).

5 Additional Experiments on Action Recognition

The previous experiments show that the unsupervised task learns a meaningfulrepresentation. In this section we compare our unsupervised method against ex-isting baseline methods and present more quantitative results. We organize ourexperiments as follows: 1) Control experiments that control for variations in de-sign decisions among baselines and our method. 2) Comparing our unsupervised

11

Start Frame End FramePredicted

Middle FrameStart Frame End Frame

Predicted Middle Frame

Fig. 5: Given a start and an end frame on a held out video, we use our network to finda middle frame. The first two rows demonstrate correct predictions, with the networkcorrectly predicting motion for cyclical cases, eg., a child on a swing. The last rowdepicts wrong predictions.

method to learning from random initialization. 3) Exploring other unsupervisedbaselines and comparing our method with them. 4) Combining our unsuper-vised representation learning method with a supervised image representation.Additional experiments are in the supplementary material. We now describe thecommon experimental setup.

Datasets and Evaluation: We use the UCF101 [12] dataset which was alsoused for our ablation analysis in Section 4 and measure accuracy on the 101action classification task. Additionally, we use the HMDB51 [13] dataset for ac-tion recognition. This dataset contains 3 splits for train/test, each with about3.4k videos for train and 1.4k videos for testing. Each video belongs to one of 51action categories, and performance is evaluated by measuring classification ac-curacy. We follow the same train/test protocols for both UCF101 and HMDB51as described in Section 4. Note that the UCF101 dataset is about 2.5× largerthan the HMDB51 dataset.

Implementation details for pre-training: We use tuples sampled usingτmax = 60 and τmin = 15 as described in Section 4. The class ratio of positiveexamples per mini-batch is 25%. The other parameters for training/finetuningare kept unchanged from Section 4.

Action recognition details: As in Section 4, we use the CaffeNet architectureand the parameters from [33] for both training from scratch and finetuning.We described the finetuning parameters in Section 4. For training from randominitialization (or ‘scratch’), we train for 80k iterations with an initial learningrate of 10−2, decaying by a factor of 10 at steps 50k and 70k. The other trainingparameters (momentum, batch size etc.) are kept the same as in finetuning. Weuse the improved data augmentation scheme (different aspect-ratio, fixed crops)from [72] for all our methods and baselines. Note that we train or finetune allthe layers of the network for all methods, including ours.

12

5.1 Control experiments

In these experiments we control for certain variations like batch-normalization,number of iterations etc. in our method and the baseline methods. These ex-periments will help us tease apart gains over the baselines that are due to thesevariations vs. gains due to our unsupervised pre-training.Control for more iterations: Our unsupervised training method is trained fora larger number of iterations (without action labels). This gives it an advantageover the baselines that are trained from scratch or finetuned from other datasets(like ImageNet [15, 77]). To control for this, we run the baseline methods, bothscratch and finetuning, for a higher number of iterations (200k vs. 80k for scratch,40k vs. 20k for finetune) than in [33], and report the highest accuracy.Control for batch normalization: We use batch normalization [71] for train-ing our triple-Siamese network. Since the other baseline methods from [33] pre-date the batch normalization method, we first re-trained the baseline modelsusing batch normalization. For lack of space, these numbers are reported in thesupplementary material. As also reported by [10], we observed that using batchnormalization consistently gave about 1% worse accuracy. Thus, we report base-line numbers without batch normalization.

5.2 Unsupervised pre-training or random initialization?

In these experiments we study the advantage of unsupervised pre-training foraction recognition in comparison to learning without any pre-training. We useour tuple prediction task to train a network starting from random initializationon the train split of UCF101. The unsupervised pre-trained network is fine-tuned on both the UCF101 and HMDB51 datasets for action recognition andcompared against learning from scratch (without pre-training). We report theperformance in Table 2. Our unsupervised pre-training shows a dramatic im-provement of +12.4% over training from scratch in UCF101 and a significantgain of +4.7% in HMDB51. This impressive gain demonstrates the informative-ness of the unsupervised tuple verification task. On HMDB51, we additionallyfinetune a network which was trained from scratch on UCF101 and report itsperformance in Table 2 indicated by ‘UCF supervised’. We see that this net-work performs worse than our unsupervised pre-trained network. The UCF101and HMDB51 have only 23 action classes in common [33] and we hypothesizethat the poor performance is due to the scratch UCF101 network being unableto generalize to actions from HMDB51. For reference, a model pre-trained onthe supervised ImageNet dataset [15, 77] and finetuned on UCF101 gives 67.1%accuracy, and ImageNet finetuned on HMDB51 gives an accuracy of 28.5%.

5.3 Unsupervised Baselines

In this section, we enumerate a variety of alternative verification tasks that useonly video frames and their temporal ordering. For each task, we use a similarframe sampling procedure to the one described in Section 4.1. We compare their

13

Table 2: Mean classification accuracies over the 3 splits of UCF101 and HMDB51datasets. We compare different initializations and finetune them for action recognition.

Dataset Initialization Mean Accuracy

UCF101 Random 38.6(Ours) Tuple verification 50.2

HMDB51 Random 13.3UCF Supervised 15.2

(Ours) Tuple verification 18.1

performance after finetuning them on the task of action recognition. A moreinformative task should serve as a better task for pre-training.

Two Close: In this task two frames (fb, fd) are considered to be temporallyclose if |b− d| < τ for a fixed temporal window τ = 30.

Two Order: Two frames (fb, fd) are considered to be correct if b < d. Otherwisethey are considered incorrect. |b− d| < 30

Three Order: This is the original temporal ordering task we proposed in Sec-tion 3.1. We consider the 3-tuple (fb, fc, fd) to be correct only if the frames obeyeither ordering b < c < d or b > c > d.

We also compare against standard baselines for unsupervised learning fromvideo.

DrLim [55]: As Equation 1 shows, this method enforces temporal smoothnessover the learned features by minimizing the l2 distance d between representationsof nearby frames fb, fc (positive class or c = 1), while requiring frames that arenot close (negative class or c = 0) to be separated by a margin δ. We use thesame samples as in the ‘Two Close’ baseline, and set δ = 1.0 [54].

L(fb, fd) = 1(c = 1)d(fb, fd) + 1(c = 0) max(δ − d(fb, fd), 0) (1)

TempCoh [54]: Similar to the DrLim method, temporal coherence learns rep-resentations from video by using the l1 distance for pairs of frames rather thanthe l2 distance of DrLim.

Obj. Patch [65]: We use their publicly available model which was unsupervisedpre-trained on videos of objects. As their patch-mining code is not available, wedo not do unsupervised pre-training on UCF101 for their model.

All these methods (except [65]) are pre-trained on training split 1 of UCF101without action labels, and then finetuned on test split 1 of UCF101 actions andHMDB51 actions. We compare them in Table 3. Scratch performance for testsplit 1 of UCF101 and HMDB51 is 39.1% and 14.8% respectively. The tupleverification task outperforms other sequential ordering tasks, and the standardbaselines by a significant margin. We attribute the low number of [65] to the factthat they focus on object detection on a very different set of videos, and thus donot perform well on action recognition.

14

Table 3: We compare the unsupervised methods defined in Section 5.3 by finetuningon the UCF101 and HMDB51 Action recognition (split 1 for both). Method with *was not pre-trained on action data.

Unsup Two Two DrLim TempCoh Three Obj. Patch*Task Close Order [55] [54] Order (Ours) [65]

Acc. UCF101 42.3 44.1 45.7 45.4 50.9 40.7Acc. HMDB51 15.0 16.4 16.3 15.9 19.8 15.6

Table 4: Results of using our unsupervised pre-training to adapt existing image basedrepresentations. We adapt the ImageNet representation using unsupervised data fromtraining split 1 of UCF101, and show the mean accuracy (3 splits) by finetuning onHMDB51.

Test Dataset Initialization Mean Accuracy

HMDB51 Random 13.3UCF sup. 15.2

Tuple verification 18.1ImageNet 28.5

(Ours) ImageNet + Tuple verification 29.9

ImageNet + UCF sup. 30.6

5.4 Combining unsupervised and supervised pre-training

We have thus far seen that unsupervised pre-training gives a significant per-formance boost over training from random initialization. We now see if ourpre-training can help improve existing image representations. Specifically, weinitialize our model using the weights from the ImageNet pre-trained model anduse it for the tuple-prediction task on UCF101 by finetuning for 10k iterations.We hypothesize this may adapt the ImageNet representation to that of actionrecognition on videos. To test this hypothesis, we finetune this model on theHMDB51 [13] action recognition task. We compare this performance to finetun-ing on HMDB51 without the tuple-prediction task. Table 4 shows these results.

Our results show that combining our pre-training with ImageNet helps im-prove the accuracy of the model (rows 3, 4). Finally, we compare to using multi-ple sources of supervised data: initialized using the ImageNet weights, finetunedon UCF101 action recognition and then finetuned on HMDB51 (row 5). Theaccuracy using all sources of supervised data is only slightly better than theperformance of our model (rows 4, 5). This demonstrates the effectiveness of oursimple yet powerful unsupervised pre-training.

6 Pose Estimation Experiment

The qualitative results from Sec 4.3 suggest that our network captures infor-mation about human pose. To evaluate this quantitatively, we conduct a smallexperiment on the task of pose estimation using keypoint prediction on the FLIC

15

(full) [14] dataset. We use a simple AlexNet architecture to regress to the 7 key-points on the torso: head, 2× (shoulders, elbows, wrists). We follow the trainingprocedure in [78]1. We compute the keypoint for the head as an average of thekeypoints for the eyes and nose. We use a train/test split of 17k and 3k imagesrespectively and finetune models for 100k iterations, with a batch size of 32,learning rate of 5 × 10−4 with AdaGrad [79] using the Euclidean loss (mini-mize l2 distance between ground truth keypoints and predicted keypoints). Fortraining from scratch (Rand. Init.), we follow the parameters in the codebase,i.e., learning rate of 5× 10−4 trained for 1.3M iterations (a higher learning rateresulted in diverging loss). We use standard data augmentation techniques suchas mirroring and cropping. We use local contrast normalization per input image,i.e., subtract the mean (per color channel) value from the image and divide bythe standard deviation of the mean subtracted image. We evaluate the Probabil-ity of Correct Keypoints (PCK) measure [80] on the wrist and elbow keypointsas used in [81] (Figure 6). PCK measures the correctness of predicted keypointsby varying the distance threshold at which they are considered correct. We re-port the individual PCK values for the keypoints in Table 5. Remarkably, ourunsupervised method is within 2% of using the fully supervised Ima-geNet dataset and is better than the UCF supervised network (Section 5.2) by+7.6%. This supports the qualitative results from Sec 4.3 that show our methodcan learn human pose information from unsupervised videos.

Table 5: Average PCK values for thekeypoints in FLIC

Init. wri elb sho head Mean AUC

Random Init. 53.0 75.2 86.7 91.7 74.5 36.1UCF Sup. 61.0 78.8 89.1 93.8 78.8 42.0

Tuple Verif. 69.6 85.5 92.8 97.4 84.7 49.6ImageNet 69.6 86.7 93.6 97.9 85.8 51.3

Fig. 6: PCK values for the wrist andelbow keypoints

0.00 0.05 0.10 0.15 0.20

Normalized Distance to Ground Truth →0

10

20

30

40

50

60

70

80

Acc

ura

cy →

Rand Init.

UCF Sup.

Tuple Verif.

ImageNet

7 DiscussionIn this paper, we studied unsupervised learning from the raw spatiotemporalsignal in videos. Our proposed sequence verification task outperforms other ex-isting unsupervised methods on action recognition, and is very close to super-vised methods for pose estimation. As a next step to our work, we would like toexplore other tasks beyond action recognition, explore the use of different typesof videos, and use other ‘free’ signals from videos such as optical flow. Anotherdirection is to pursue sequence verification using a combination of CNNs andRNNs, and to extend our tuple verification task to longer sequences.

1 Public re-implementation from https://github.com/mitmul/deeppose

https://github.com/mitmul/deeppose

References

[1] Cleeremans, A., McClelland, J.L.: Learning the structure of event sequences.Journal of Experimental Psychology: General 120(3) (1991) 235 1

[2] Reber, A.S.: Implicit learning and tacit knowledge. Journal of experimentalpsychology: General 118(3) (1989) 219

[3] Cleeremans, A.: Mechanisms of implicit learning: Connectionist models ofsequence processing. MIT press (1993) 1

[4] Sun, R., Merrill, E., Peterson, T.: From implicit skills to explicit knowledge:A bottom-up model of skill learning. Cognitive science 25(2) (2001) 1

[5] Baker, R., Dexter, M., Hardwicke, T.E., Goldstone, A., Kourtzi, Z.: Learn-ing to predict: exposure to temporal sequences facilitates prediction of fu-ture events. Vision research 99 (2014) 124–133 1

[6] Sun, R., Giles, C.L.: Sequence learning: from recognition and prediction tosequential decision making. IEEE Intelligent Systems 16(4) (2001) 2

[7] Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of wordrepresentations in vector space. arXiv preprint arXiv:1301.3781 (2013) 2

[8] Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributedrepresentations of words and phrases and their compositionality. In: NIPS.(2013) 2

[9] Firth, J.R.: A synopsis of linguistic theory 1930-1955. (1957) 2[10] Doersch, C., Gupta, A., Efros, A.A.: Unsupervised visual representation

learning by context prediction. In: ICCV. (2015) 2, 3, 4, 12[11] LeCun, Y., Boser, B., Denker, J.S., Henderson, D., Howard, R.E., Hub-

bard, W., Jackel, L.D.: Backpropagation applied to handwritten zip coderecognition. Neural computation 1 (1989) 2

[12] Soomro, K., Zamir, A.R., Shah, M.: UCF101: A dataset of 101 humanactions classes from videos in the wild. arXiv preprint arXiv:1212.0402(2012) 2, 7, 11

[13] Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., Serre, T.: HMDB: a largevideo database for human motion recognition. In: ICCV. (2011) 2, 11, 14

[14] Sapp, B., Taskar, B.: Modec: Multimodal decomposable models for humanpose estimation. In: CVPR. (2013) 2, 15

[15] Deng, J., Dong, W., Socher, R., jia Li, L., Li, K., Fei-fei, L.: Imagenet: Alarge-scale hierarchical image database. In: CVPR. (2009) 2, 12

[16] Poppe, R.: A survey on vision-based human action recognition. Image andvision computing 28(6) (2010) 3

[17] Laptev, I.: On space-time interest points. IJCV 64 (2005) 3[18] Klaser, A., Marsza lek, M., Schmid, C.: A spatio-temporal descriptor based

on 3d-gradients. In: BMVC. (2008) 3[19] Laptev, I., Marsza lek, M., Schmid, C., Rozenfeld, B.: Learning realistic

human actions from movies. In: CVPR. (2008) 3[20] Dalal, N., Triggs, B., Schmid, C.: Human detection using oriented his-

tograms of flow and appearance. In: ECCV. (2006) 3

17

[21] Matikainen, P., Hebert, M., Sukthankar, R.: Trajectons: Action recognitionthrough the motion analysis of tracked features. In: ICCV Workshop, IEEE(2009) 3

[22] Peng, X., Zou, C., Qiao, Y., Peng, Q.: Action recognition with stacked fishervectors. In: ECCV. Springer (2014)

[23] Wang, H., Klaser, A., Schmid, C., Liu, C.L.: Action recognition by densetrajectories. In: CVPR. (2011)

[24] Wang, H., Schmid, C.: Action recognition with improved trajectories. In:ICCV. (2013) 3551–3558 3

[25] Xu, Z., Yang, Y., Hauptmann, A.G.: A discriminative cnn video represen-tation for event detection. In: CVPR. (2015) 3

[26] Taylor, G.W., Fergus, R., LeCun, Y., Bregler, C.: Convolutional learningof spatio-temporal features. In: ECCV. (2010) 4

[27] Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Fei-Fei,L.: Large-scale video classification with convolutional neural networks. In:CVPR. (2014) 3

[28] Ji, S., Xu, W., Yang, M., Yu, K.: 3d convolutional neural networks forhuman action recognition. TPAMI 35(1) (2013)

[29] Fernando, B., Gavves, E., Oramas, J.M., Ghodrati, A., Tuytelaars, T.: Mod-eling video evolution for action recognition. In: CVPR. (2015)

[30] Gkioxari, G., Malik, J.: Finding action tubes. In: CVPR. (2015)[31] Wang, L., Qiao, Y., Tang, X.: Action recognition with trajectory-pooled

deep-convolutional descriptors. In: CVPR. (2015)[32] Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spa-

tiotemporal features with 3d convolutional networks. In: ICCV. (2015) 3[33] Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action

recognition in videos. In: NIPS. (2014) 3, 7, 11, 12[34] Faktor, A., Irani, M.: clustering by composition–unsupervised discovery of

image categories. In: ECCV. (2012) 3[35] Sivic, J., Russell, B.C., Efros, A.A., Zisserman, A., Freeman, W.T.: Discov-

ering objects and their location in images. In: ICCV. (2005)[36] Russell, B.C., Freeman, W.T., Efros, A.A., Sivic, J., Zisserman, A.: Us-

ing multiple segmentations to discover objects and their extent in imagecollections. In: CVPR. (2006) 3

[37] Singh, S., Gupta, A., Efros, A.: Unsupervised discovery of mid-level dis-criminative patches. ECCV (2012) 3

[38] Juneja, M., Vedaldi, A., Jawahar, C., Zisserman, A.: Blocks that shout:Distinctive parts for scene classification. In: CVPR. (2013)

[39] Doersch, C., Gupta, A., Efros, A.A.: Mid-level visual element discovery asdiscriminative mode seeking. In: NIPS. (2013)

[40] Li, Q., Wu, J., Tu, Z.: Harvesting mid-level visual concepts from large-scaleinternet images. In: CVPR. (2013)

[41] Sun, J., Ponce, J.: Learning discriminative part detectors for image classi-fication and cosegmentation. In: ICCV. (2013) 3

[42] Olshausen, B.A., et al.: Emergence of simple-cell receptive field propertiesby learning a sparse code for natural images. Nature 381(6583) (1996) 3

18

[43] Bengio, Y., Thibodeau-Laufer, E., Alain, G., Yosinski, J.: Deep generativestochastic networks trainable by backprop. arXiv preprint arXiv:1306.1091(2013)

[44] Vincent, P., Larochelle, H., Bengio, Y., Manzagol, P.A.: Extracting andcomposing robust features with denoising autoencoders. In: ICML. (2008)3

[45] Salakhutdinov, R., Hinton, G.E.: Deep boltzmann machines. In: ICAIS.(2009) 3

[46] Kingma, D.P., Welling, M.: Auto-encoding variational bayes. arXiv preprintarXiv:1312.6114 (2013) 3

[47] Rezende, D.J., Mohamed, S., Wierstra, D.: Stochastic backpropagationand approximate inference in deep generative models. arXiv preprintarXiv:1401.4082 (2014) 3

[48] Lee, H., Battle, A., Raina, R., Ng, A.Y.: Efficient sparse coding algorithms.In: NIPS. (2006) 3

[49] Bengio, Y., Lamblin, P., Popovici, D., Larochelle, H., et al.: Greedy layer-wise training of deep networks. NIPS (2007) 3

[50] Le, Q.V.: Building high-level features using large scale unsupervised learn-ing. In: ICASSP. (2013) 3

[51] Sermanet, P., Kavukcuoglu, K., Chintala, S., LeCun, Y.: Pedestrian detec-tion with unsupervised multi-stage feature learning. In: CVPR. (2013) 3,4

[52] Jayaraman, D., Grauman, K.: Learning image representations equivariantto ego-motion. ICCV (2015) 4

[53] Jayaraman, D., Grauman, K.: Slow and steady feature analysis: Higherorder temporal coherence in video. arXiv preprint arXiv:1506.04714 (2015)4

[54] Mobahi, H., Collobert, R., Weston, J.: Deep learning from temporal coher-ence in video. In: ICML. (2009) 4, 13, 14

[55] Hadsell, R., Chopra, S., LeCun, Y.: Dimensionality reduction by learningan invariant mapping. In: CVPR, IEEE (2006) 4, 13, 14

[56] Foldiak, P.: Learning invariance from transformation sequences. NeuralComputation 3(2) (1991)

[57] Wiskott, L., Sejnowski, T.J.: Slow feature analysis: Unsupervised learningof invariances. Neural computation 14(4) (2002)

[58] Goroshin, R., Bruna, J., Tompson, J., Eigen, D., LeCun, Y.: Unsupervisedlearning of spatiotemporally coherent metrics. In: ICCV. (2015) 4

[59] Zhang, Z., Tao, D.: Slow feature analysis for human action recognition.TPAMI 34(3) (2012) 4

[60] Srivastava, N., Mansimov, E., Salakhutdinov, R.: Unsupervised learning ofvideo representations using lstms. arXiv preprint arXiv:1502.04681 (2015)4

[61] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural compu-tation 9 (1997) 4

[62] Zhou, Y., Berg, T.L.: Temporal perception and prediction in ego-centricvideo. In: ICCV. (2015) 4

19

[63] Vondrick, C., Pirsiavash, H., Torralba, A.: Anticipating the future by watch-ing unlabeled video. arXiv preprint arXiv:1504.08023 (2015) 4

[64] Agrawal, P., Carreira, J., Malik, J.: Learning to see by moving. In: ICCV.(2015) 4

[65] Wang, X., Gupta, A.: Unsupervised learning of visual representations usingvideos. In: ICCV. (2015) 4, 13, 14

[66] Liang, X., Liu, S., Wei, Y., Liu, L., Lin, L., Yan, S.: Towards computa-tional baby learning: A weakly-supervised approach for object detection.In: ICCV. (2015) 4

[67] Shannon, C.E.: Communication in the presence of noise. Proceedings ofthe IRE 37 (1949) 5

[68] Farneback, G.: Two-frame motion estimation based on polynomial expan-sion. In: Image analysis. Springer (2003) 6

[69] Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R.,Guadarrama, S., Darrell, T.: Caffe: Convolutional architecture for fast fea-ture embedding. In: ACMM. (2014) 6

[70] Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification withdeep convolutional neural networks. In: NIPS. (2012) 6

[71] Ioffe, S., Szegedy, C.: Batch normalization: Accelerating deep network train-ing by reducing internal covariate shift. arXiv preprint arXiv:1502.03167(2015) 7, 12

[72] Wang, L., Xiong, Y., Wang, Z., Qiao, Y.: Towards good practices for verydeep two-stream convnets. arXiv preprint arXiv:1507.02159 (2015) 7, 11

[73] Chatfield, K., Simonyan, K., Vedaldi, A., Zisserman, A.: Return of the devilin the details: Delving deep into convolutional nets. In: BMVC. (2014) 7

[74] Girshick, R.: Fast r-cnn. In: ICCV. (2015) 8[75] Ren, S., He, K., Girshick, R., Sun, J.: Faster r-cnn: Towards real-time object

detection with region proposal networks. In: NIPS. (2015) 8[76] Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies

for accurate object detection and semantic segmentation. In: CVPR. (2014)9, 10

[77] Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang,Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A.C., Fei-Fei, L.: Ima-geNet Large Scale Visual Recognition Challenge. IJCV 115 (2015) 12

[78] Toshev, A., Szegedy, C.: Deeppose: Human pose estimation via deep neuralnetworks. In: CVPR. (2014) 15

[79] Duchi, J., Hazan, E., Singer, Y.: Adaptive subgradient methods for onlinelearning and stochastic optimization. JMLR 12 (2011) 15

[80] Yang, Y., Ramanan, D.: Articulated human detection with flexible mixturesof parts. TPAMI 35 (2013) 15

[81] Pfister, T., Charles, J., Zisserman, A.: Flowing convnets for human poseestimation in videos. In: ICCV. (2015) 15

veri cation for action recognition arxiv:1603.08561v1 [cs.cv] 28 … · frames and srivastava et...

Documents