hallucinet-ing spatiotemporal representations using a 2d-cnn · 2020. 10. 22. · pute resource...

HalluciNet-ing Spatiotemporal Representations Using a 2D-CNN

Paritosh Parmar Brendan MorrisUniversity of Nevada, Las Vegas

[email protected], [email protected]

Abstract

Spatiotemporal representations learned using 3D con-volutional neural networks (CNN) are currently used instate-of-the-art approaches for action related tasks. How-ever, 3D-CNN are notorious for being memory and com-pute resource intensive as compared with more simple 2D-CNN architectures. We propose to hallucinate spatiotem-poral representations from a 3D-CNN teacher with a 2D-CNN student. By requiring the 2D-CNN to predict the fu-ture and intuit upcoming activity, it is encouraged to gaina deeper understanding of actions and how they evolve.The hallucination task is treated as an auxiliary task, whichcan be used with any other action related task in a mul-titask learning setting. Thorough experimental evalua-tion shows that the hallucination task indeed helps im-prove performance on action recognition, action quality as-sessment, and dynamic scene recognition tasks. From apractical standpoint, being able to hallucinate spatiotem-poral representations without an actual 3D-CNN can en-able deployment in resource-constrained scenarios, suchas with limited computing power and/or lower bandwidth.Codebase is available here: https://github.com/ParitoshParmar/HalluciNet.

1. IntroductionSpatiotemporal representations are densely packed with

information regarding both the appearance and salient mo-tion patterns occurring in the video clips, as illustrated inFig. 1. Due to this representational power, they are cur-rently the best performing models on action related tasks,like action recognition [40, 21, 14, 6], action quality as-sessment [30, 29], skills assessment [5], and action detec-tion [12]. This representation power comes at a cost of in-creased computational complexity [58, 48, 55, 13], whichmakes 3D-CNNs unsuitable for deployment in resource-constrained scenarios.

The power of 3D-CNNs comes from their ability to at-tend to the salient motion patterns of a particular actionclass. 2D-CNNs, in contrast, are generally used for learn-

ing and extracting spatial features pertaining to a singleframe/image, and thus, by design, do not take into accountany motion information, therefore lacking temporal repre-sentation power. Some works [46, 47, 32, 10] have ad-dressed this by using optical flow, which will respond at allpixels that have moved/changed. This means optical flowcan respond to cues both from the foreground motion ofinterest, as well as irrelevant activity happening in the back-ground. This background response might not be desirablesince CNNs have been shown to find short cuts to recognizeactions not from meaningful foreground, but from back-ground cues [20, 15]. These kinds of a short cuts might stillbe beneficial for action recognition tasks but not in a mean-ingful way. That is, the 2D network is not actually learningto understand the action itself but rather contextual cues andclues. Despite these shortcomings, 2D-CNNs are computa-tionally lightweight, which makes them suitable for deploy-ment on edge devices.

In short, 2D-CNNs have the advantage of being com-putationally less expensive, while 3D-CNNs extract spa-tiotemporal features that have more representation power.In our work, we propose a way to combine the best of bothworlds – rich spatiotemporal representation, with low com-putational cost. Our inspiration comes from the observationthat given even a single image of a scene, humans can pre-dict how the scene might evolve. We are able to do so be-cause of our experience and interaction in the world, whichprovides a general understanding of how other people areexpected to behave and how objects can move or be ma-nipulated. We propose to hallucinate spatiotemporal repre-sentations, as computed by a 3D-CNN, using a 2D-CNN,utilizing only a single still frame (see Fig. 1). The idea is toforce a 2D-CNN to predict the motion that will occur in thenext frames, without ever having to actually see it.

Conceptually, our hallucination task can provide richer,stronger supervisory signal that can help the 2D-CNN togain a deeper understanding of actions and how a givenscene evolves with time. Experimentally, we found our ap-proach beneficial in following settings:

• actions with:

– short-term temporal dynamics

1

arX

iv:1

912.

0443

0v3

[cs

.CV

] 2

1 O

ct 2

020

https://github.com/ParitoshParmar/HalluciNet

https://github.com/ParitoshParmar/HalluciNet

Only first frame 3D representations capture appearance and salient motion

16-frame clip

Main task (action recognition, AQA, scene recognition, etc.)Only needed

during trainingHalluciNet(2D-CNN)

3D CNN

Entire clip

Figure 1: Multitask leaning with HalluciNet. HalluciNet (2D-CNN) is jointly optimized for main task, and to hallucinatespatiotemporal features (computed by an actual 3D-CNN) from a single frame.

– long-term temporal dynamics

• dynamic scene recognition• improved performance on downstream tasks when in-

jected during pretraining

Practically, approximating spatiotemporal features, insteadof actually computing them, is useful where: 1) limitedcompute power (smart video camera systems, lower-endphones, or IoT devices); 2) limited/expensive bandwidth(Video Analytics Software as a Service (VA SaaS)), whereour method can help reduce the transmission load by a fac-tor of 15 (need to transmit only 1 frame out of 16).

2. Related WorkOur work is related to predicting features, developing

efficient/light-weight spatiotemporal network approaches,and distilling knowledge. Next, we briefly compare andcontrast our approach to the most closely related works inthe literature.

Capturing information in future frames Many workshave focused on capturing information in future frames[54, 23, 46, 47, 32, 9, 45, 24, 42, 43, 2, 44]. Generating fu-ture frames is a difficult and complicated task, and usuallyrequire disentangling of background, foreground, low-leveland high-level details and modeling them separately. Ourapproach of predicting features is much simpler. Moreover,our goal is not to a predict pixel-perfect future, but rather tomake predictions at the semantic level.

Instead of explicitly generating future frames, works like[46, 47, 32, 10] focused on learning to predict optical flow(very short-term motion information). These approaches,by design, require the use of an encoder and a decoder.Our approach does not require a decoder, which reduces thecomputational load. Moreover, our approach learns to hal-lucinate features corresponding to 16 frames, as compared

to motion information in two frames. Experiments confirmthe benefits of our method over optical flow prediction.

Bilen et al. [2] introduced a novel, compact represen-tation of a video called a “dynamic image,” which can bethought of as a summary of full videos in a single image.However, computing a dynamic image requires access toall the corresponding frames, whereas HalluciNet requiresprocessing just a single image.

Predicting features Other works [49, 17, 42] proposepredicting features. Our work is closest to [17], wherethe authors proposed hallucinating depth using RGB input,whereas, we propose hallucinating spatiotemporal informa-tion. Reasoning about depth information is different fromreasoning about spatiotemporal evolution.

Efficient Spatiotemporal Feature Computation Nu-merous works have developed approaches to make videoprocessing more efficient, either by reducing the requiredinput evidence [56, 1, 3, 37], or explicitly through more ef-ficient processing [38, 41, 34, 58, 50, 48, 26, 27].

While these works aim to address either reducing visualevidence or developing more efficient architecture design,our solution to hallucinate (without explicitly computing)spatiotemporal representations using a 2D-CNN from a sin-gle image aims to solve both the problems, while also pro-viding stronger supervision. In fact, our approach, whichfocuses on improving the backbone CNN, is complemen-tary to some of these developments [57, 27].

3. Best of Both WorldsSince humans are able to predict future activity and be-

havior through years of experience and a general under-standing of “how the world works,” we would like developa network that can understand an action in a similar manner.

2

To this end, we propose a teacher-student network architec-ture that asks a 2D-CNN to use a single frame to hallucinate(predict) 3D features pertaining to 16 frames.

Let us consider the example of a gymnast performingher routine as shown in Fig. 1. In order to complete thehallucination task, the 2D-CNN should:

• learn to identify that there’s an actor in the scene andlocalize her;

• spatially segment the actors and objects;• identify that the event is a balance beam gymnastic

event and the actor is a gymnast;• identify that the gymnast is to attempt a cartwheel;• predict how she will be moving while attempting the

cartwheel;• approximate the final position of the gymnast after 16

frames, etc.

The challenge is understanding all the rich semantic de-tails of the action from only a single frame.

3.1. Hallucination Task

The hallucination task can be seen as distilling knowl-edge from a better teacher network (3D-CNN), ft, to alighter student network (2D-CNN), fs. The teacher, ft, ispretrained and kept frozen, while the parameters of the stu-dent, fs, are learned. Mid-level representations can be com-puted as:

φt = ft(F0, F1, ..., FT−1) (1)φs = fs(F0) (2)

where FT is the T -th video frame.The hallucination loss, Lhallu encourages fs to regress

φs to φt by minimizing the Euclidean distance between φsand φt:

Lhallu = |σ(φs)− σ(φt)|2. (3)

Multitask learning (MTL): Reducing computationalcost with the hallucination task is not the only goal. Sincethe primary objective is to better understand activities andimprove performance, hallucination is meant to be an auxil-iary task to support the main action related task (e.g. actionrecognition). The main task loss (e.g., classification loss),Lmt, is used in conjunction with the hallucination loss:

LMTL = Lmt + λLhallu (4)

where, λ is a loss balancing factor. The realization of ourapproach is straightforward, as presented in Fig. 1.

3.2. Stronger Supervision

In a typical action recognition task, a network is onlyprovided with the action class label. This may be considered

a weak supervision signal since it provides a single high-level semantic interpretation of a clip filled with complexchanges. More dense labels, at lower semantic levels, areexpected to provide stronger supervisory signals, whichcould improve action understanding.

In this vein, joint actor-action segmentation is an activelypursed research direction [18, 11, 53, 19, 51]. Joint actor-action segmentation datasets [52] provide detailed annota-tions, through significant annotation efforts. In contrast,our spatiotemporal hallucination task provides detailed su-pervision of a similar flavor (though not exactly the same)for free. Since 3D-CNN representations tend to focus onactors and objects, 2D-CNN can develop a better generalunderstanding about actions through actor/object manipula-tion. Additionally, the 2D representation will be less likelyto take shortcuts – ignoring the actual actor and action be-ing performed, and instead doing recognition based on thebackground [20, 15] – as it cannot hallucinate spatiotem-poral features, which mainly pertain to actors/foreground,from the background.

3.3. Prediction Ambiguities

In general, prediction of future activity with a singleframe could be ambiguous (e.g., opening vs. closing adoor). However, a study has shown that humans are able toaccurately predict immediate future action from a still im-age 85% of the time [42]. So, while there may be ambigu-ous cases, there are many other instances where causal rela-tionships exist and the hallucination task can be exploited.Additionally, low-level motion cues can be used to resolveambiguity (Sec. 4.4).

4. ExperimentsWe hypothesize that incorporating the hallucination task

is beneficial, by providing deeper understanding of actions.We evaluate the effect of incorporating the hallucinationtask in following settings:

• (Sec. 4.1) Actions with short-term temporal dynamics• (Sec. 4.2) Actions with long-term temporal dynamics• (Sec. 4.3) Non-action task• (Sec. 4.4) Hallucinating from two frames, instead of a

single frame• (Sec. 4.5) Effect of injecting hallucination task during

pretraining

Choice of networks: In principle, any 2D- or 3D-CNNscould be used as student or teacher networks, respectively.Noting the SOTA performance of 3D-ResNeXt-101 [14]on action recognition, we choose to use it as our teachernetwork. We considered various student models. Unlessotherwise mentioned, our student model is VGG11-bn,and pretrained on the ImageNet dataset [4]; the teacher

3

Method UCF-static HMDB-static

App stream [10] 63.60 35.10App stream ensemble [10] 64.00 35.50Motion stream [10] 24.10 13.90Motion stream [47] 14.30 04.96App + Motion [10] 65.50 37.10App + Motion [47] 64.50 35.90

Ours 2D-CNN 64.97 34.23Ours HalluciNet 69.81 40.32Ours HalluciNetdirect 70.53 39.42

(a)

Method Accu

TRN (R18) 69.57TRN (HalluciNet(R18)) 69.94TSM (R18) 71.40TSM (HalluciNet(R18)) 73.12

(b)

Model Accu

Resnet-50 76.74HalluciNet(R50) 79.83

(c)

Table 1: (a) Action recognition results and comparison.(b) HalluciNet helps recent developments like TRN andTSM. (c) Multiframe inference on better base model.(b,c) are evaluated on UCF101.

network was trained on UCF-101 [36] and kept frozen. Wenamed the 2D-CNN trained with the side-task hallucinationloss as HalluciNet, and the one without hallucination lossas (vanilla) 2D-CNN, while the HalluciNetdirect variant,which directly uses hallucinated features for main actionrecognition task.

Which layer to hallucinate? We chose to hallucinate theactivations of the last bottleneck group of 3D-ResNeXt-101, which are 2048-dimensional. Representations ofshallower layers will have higher dimensionality and willbe less semantically mapped.

Implementation details: We used PyTorch [31] to im-plement all of the networks. Network parameters wereoptimized using an Adam optimizer [22] with beginninglearning rate of 0.0001. λ in Eq. 4 is set to 50, unlessspecified otherwise. Further experiment specific details arepresented with the experiment. The codebase will be madepublicly available.

Performance baselines: Our performance baseline was a2D-CNN with same architecture, but which was trainedwithout hallucination loss (vanilla 2D-CNN). In addition,we also compared the performance against other popularapproaches from the literature, specified in each experi-ment.

2D-CNN HalluciNet

Successes Failures

0.2%

98%

0.9%

90%

1%6%

+ Hallucination

task

57%

9%14%

73%

2%14%

+ Hallucination

task

0.01%

75%

24%

78%

0.3%

21%

+ Hallucination

task

48%

16%22%

27%

6%14%

+ Hallucination

task

0.1%

99%

0.5%

91%

3%4%

+ Hallucination

task

15%

11%12%

20%

34%

21%+ Hallucination

task

GT: WallPushups

GT: SoccerJuggling

GT: FieldHockeyPenalty

GT: FloorGymnastics

GT: Kayaking

GT: TaiChi

2D-CNN HalluciNet

Figure 2: Qualitative results. The hallucination task helpsimprove performance when the action sample is visuallysimilar to other action classes, and a motion cue is neededto distinguish them. However, sometimes HalluciNet makesincorrect predictions when motion cue is similar to that ofother actions, and dominates over the visual cue. Pleasezoom-in to get a better view.

Model Accuracy Time/Inf.

FLOPs /ParamUCF HMDB

Ours 2D-CNN 64.97 34.23 3.54 ms 58

Ours HalluciNetdirect 70.53 39.42 3.54 ms 58

Ours actual 3D-CNN 87.95 60.44 58.99 ms 143

Table 2: Cost vs. Accuracy Comparison. We measure thetimes on Titan-X.

4.1. Actions with Short-Term Temporal Dynamics

In the first experiment, we tested the influence of thehallucination task for general action recognition. We com-pared the performance with two single frame predictiontechniques: dense optical flow prediction from a staticimage [47], and motion prediction from a static image [10].

Datasets: UCF-101 [36] and HMDB-51 [25] action recog-nition datasets were considered. In order to be consistentwith the literature, we adopted their experimental protocols.Center frames from the training and testing samples wereused for reporting performance, and are named as UCF- andHMDB-static, as in the literature [10].

Metric: We report the top-1 frame/clip-level accuracy (in%).

4

We summarize the performance on the action recogni-tion task in Table 1a. We found that on both datasets, incor-porating the hallucination task helped. Our HalluciNet out-performed prior approaches [47, 10] on both UCF-101 andHMDB-51 datasets. Moreover, our method has an advan-tage of being computationally lighter than [10], as it doesnot use a flow image generator network. Qualitative re-sults are shown in Fig. 2. In the successes, ambiguitieswere resolved. The failure cases tended to confuse seman-tically similar classes with similar motions, such as Floor-Gymnastics/BalanceBeam or Kayaking/Rowing. To eval-uate the quality of the hallucinated representations them-selves, we directly used those representations for the mainaction recognition task (HalluciNetdirect). We saw that thehallucinated features had strong performance and improvedon the 2D-CNN, and in fact, performed best on the UCF-static.

Next, we used hallucination task to improve the perfor-mance of recent developments TRN [57] and TSM [27].We use Resnet-18 (R18) as backbone for both; and imple-mented single, center segment, 4-frame versions of both.For TRN, we considered multiscale version. For TSM, weconsidered online version, which is intended for real-timeprocessing. For both, we sample 4 frames from center 16frames. We used λ = 200. Their vanilla versions served asour baselines. Performances on UCF101 are shown in Table1b.

We also experimented with a larger, better base model,Resnet-50. In this experiment, we trained using all theframes, and not only center frame; and during testing aver-aged the results over 25 frames. We used λ = 350. Resultson UCF101 are shown in Table 1c.

Finally, in Table 2, we compare our predicted spa-tiotemporal representation, HalluciNetdirect, and the actual3D-CNN. Hallucinet improved upon the vanilla 2D-CNN,though well below the actual 3D-CNN. However, the per-formance trade-off resulted in only 6% of the computationalcost of the full 3D-CNN.

4.2. Actions with Long-Term Temporal Dynamics

Although we proposed hallucinating the short-term fu-ture (16 frames), frequently actions with longer temporaldynamics must be considered. To evaluate the utility ofshort-term hallucination in actions with longer temporal dy-namics, we considered the task of recognizing dives andassessing their quality. Short clips were aggregated overlonger videos using an LSTM, as shown in Fig. 3.

4.2.1 Dive Recognition

Task description: In Olympic diving, athletes attemptmany different types of dives. In a general action recogni-

Frame 1

Frame 16

Frame 81

Frame 96Clip 1 …

…

Clip 6

ℒℎ𝑎𝑙𝑙𝑢

3D CNN

Only the first frame

Full clip

HalluciNetℒ𝐶𝑙𝑠

LSTM

P AS RT #SS #TW

𝝨

AQA Score

ℒ𝐿2

Figure 3: Detailed action recognition and action qualityassessment models.

tion dataset, like UCF101, all of these dives are groupedunder a single action class, diving. However, these divesare different from each other in subtle ways. Each dive hasthe following five components: a) Position (legs straightor bent); b) starting from Armstand or not; c) Rotationdirection (backwards, forwards, etc.); d) number of timesthe diver Somersaulted in air; and e) number of timesthe diver Twisted in air. Different combinations of thesecomponents produce a unique type of dive (dive number).The Dive Recognition task is to predict all five componentsof a dive using very few frames.

Why is this task more challenging? Unlike general actionrecognition datasets, e.g., UCF-101 or Kinetics [21], thecues needed to identify the specific dive are distributedacross the entire action sequence. In order to correctlypredict the dive, the whole action sequence needs to beseen. To make the dive classification task more suitablefor our HalluciNet framework, we asked the network toclassify a dive correctly using only a few regularly spacedframes. In particular, we truncated a diving video to 96frames and showed the student network every 16th frame,for a total of 6 frames. Note that we are not asking ourstudent network to hallucinate the entire dive sequence;rather, the student network is required to hallucinate theshort-term future in order to “fill the holes” in the visualinput datastream.

Dataset: The recently released Diving dataset MTL-AQA[29], which has 1059 training and 353 test samples, is usedfor this task. The average sequence length is 3.84 seconds.

Model: We pretrained both our 3D-CNN teacher and2D-CNN student on UCF-101. Then the student networkwas trained to classify dives. Since we would be gatheringevidence over six frames, we made use of an LSTM [16]

5

Method CNN #Fr. P A RT SS TW

C3D-AVG [29] 3D 96 96.32 99.72 97.45 96.88 93.20MSCADC [29] 3D 16 78.47 97.45 84.70 76.20 82.72Nibali et al. [28] 3D 16 74.79 98.30 78.75 77.34 79.89

Ours VGG11 2D 6 90.08 99.43 92.07 83.00 86.69Ours HalluciNet(VGG11) 2D 6 89.52 99.43 96.32 86.12 88.10Ours HalluciNet(R18) 2D 6 91.78 99.43 95.47 88.10 89.24

(a)

Method CNN #Fr. Corr.

Pose+DCT [33] - 96 26.82C3D-SVR [30] 3D 96 77.16C3D-LSTM [30] 3D 96 84.89C3D-AVG-STL [29] 3D 96 89.60MSCADC-STL [29] 3D 16 84.72

Ours VGG11 2D 6 80.39Ours HalluciNet (VGG11) 2D 6 82.70Ours HalluciNet (R18) 2D 6 83.51

(b)

Table 3: (a) Performance comparison on dive recognition task. #Fr. represents the number of frames the correspondingmethod sees. P, AS, RT, SS, TW stand for position, arsmstand, rotation type, number of somersaults, and number of twists.(b) Performance on AQA task.

for aggregation. The LSTM was single-layered with ahidden state of 256D. The LSTM’s hidden state at the lasttime step was passed through separate linear classificationlayers, one for each of the properties of a dive. The fullmodel is illustrated in Fig. 3. The student network wastrained end-to-end for 20 epochs using an Adam solverwith a constant learning rate of 0.0001. We also consideredHalluciNet based on Resnet-18 (λ = 400). We did notconsider R50 because it is much larger compared to thedataset size.

The results are summarized in Table 3a, where we alsocompare them with other state-of-the-art 3D-CNN basedapproaches [28, 29]. Compared with the 2D baseline, Hal-lucinet performed better on 3 out of 5 tasks. The Positiontask (legs straight or bent) could be equally identifiablefrom a single image or clip, but the number of TWists,SS:somersaults, or direction of rotation are more challeng-ing without seeing motion. In contrast, HaluciNet couldpredict the motion. Our HalluciNet even outperforms 3D-CNN based approaches that use more frames (MSCADC[29] and Nibali et al. [28]). However, C3D-AVG outper-formed HalluciNet, but is computationally expensive anduses 16× more frames.

4.2.2 Dive Quality Assessment

Action quality assessment (AQA) is another task whichcan highlight the utility of hallucinating spatiotemporalrepresentations from still images using a 2D-CNN. InAQA, the task is to measure, or quantify, how well anaction was performed. A good example of AQA is thatof judging Olympic events like diving, gymnastics, figureskating, etc. Like Dive Recognition, in order to correctlyassess the quality of a dive, the entire dive sequence needsto be seen/processed.

Dataset: MTL-AQA [29], the same as in Sec. 4.2.1.

Metric: Consistent with the literature, we report Spear-man’s rank correlation (in %).

We follow the same training procedure as in Sec. 4.2.1,except that for AQA task we used L2 loss to train, as it isa regression task. We trained for 20 epochs with Adam asa solver and annealed the learning rate by a factor of 10every 5 epochs. We also considered HalluciNet based onR18 (λ = 250).

The AQA results are presented in Table 3b. Incor-porating the hallucination task helped improve AQA per-formance. Our HalluciNet outperformed C3D-SVR andwas quite close to C3D-LSTM and MSCADC although itsaw 90 and 10 fewer frames, respectively. Although itdoes not match C3D-AVG-STL, HalluciNet requires signif-icantly less computation.

4.3. Dynamic Scene Recognition

Dataset: Feichtenhofer et al. introduced the YUP++dataset [8] for the task of dynamic scene recognition. Ithas a total of 20 scene classes. The use of this dataset toevaluate the utility of inferred motion was suggested in[10]. In the work by Feichtenhofer, 10% of the sampleswere used for training, while the remaining 90% of thesamples were used for testing purposes. Gao et al. [10]formed their own split, called static-YUP++.

Protocol: For training and testing purposes, we consideredthe central frame of each sample.

The first experiment considered standard dynamic scenerecognition using splits from the literature and comparedthem with a spatiotemporal energy based approach (BoSE),slow feature analysis (SFA) approach, and temporal CNN(T-CNN). Additionally, we also considered versions basedon Resnet50 and predictions averaged over 25 frames. As

6

Method Accu

SFA [39] 56.90BoSE [7] 77.00T-CNN [35] 50.60

(Ours Singleframe inference)Ours VGG11 77.50Ours HalluciNet(VGG11) 78.15

(Ours Multiframe inference with better base models)Ours Resnet-50 83.43Ours HalluciNet(Resnet-50) 84.44

(a)

Method Accu

Appearance [10] 74.30GT Motion [10] 55.50Inferred Motion [10] 30.00Appearance ensemble [10] 75.20Appearance + Inferred Motion [10] 78.20Appearance + GT Motion [10] 79.60

Ours 2D-CNN 72.04Ours HalluciNet 81.53

(b)

Table 4: (a) Dynamic Scene Recognition on YUP++. (b)Dynamic scene recognition on static-YUP++.

shown in Table 4a, HalluciNet showed minor improvementover the baseline 2D-CNN and outperformed studies in theliterature. T-CNN might be the closest for comparison be-cause it uses a stack of 10 optical flow frames; however,our HalluciNet outperformed it by a large margin. Notethat we have not trained our 3D-CNN on scene recognitiondataset/task, and used a 3D-CNN trained on action recogni-tion dataset, yet still observed improvements.

The second experiment compared our approach with[10] in which we used their split for static-YUP++ (Table4b). In this case, our vanilla 2D-CNN did not outperformstudies in literature but our HalluciNet did – even whengroundtruth motion information was used by im2flow [10].

4.4. Using Multiple Frames to Hallucinate

As previously discussed, there are situations (e.g. dooropen/close) where a single image cannot be reliably used forhallucination. However, motions cues coming from multi-ple frames can be used to resolve ambiguities.

We modified the single frame HalluciNet architecture toaccept multiple frames, as shown in Fig. 4. We processedframe Fj and frame Fj+k (k > 0) with our student 2D-CNN. In order to tease out low-level motion cues, we didordered concatenation of the intermediate representations,corresponding to frames Fj and Fj+k. The concatenated

ℒ𝑚𝑎𝑖𝑛 𝑡𝑎𝑠𝑘

ℒℎ𝑎𝑙𝑙𝑢

3D CNN16 frames

Frame 𝐹𝑗

SharedWeightsFrame 𝐹𝑗+𝑘

Figure 4: Multiframe architecture. Instead of using a sin-gle frame to hallucinate, representations of ordered framesare concatenated (⊕), which is then used for hallucinating.Everything else remains same as our single frame model.

student representation in the 2-frame case is:

φs = concatφ(φjs, φ

j+ks ) (5)

where φls is the student representation from frame Fl asin Eq (2). This basic approach can be extended to moreframes, as well as multi-scale cases. Hallucination loss re-mains as single frame case (Eq. 3).

In order to see the effect of using multiple frames, weconsidered the following two cases:

1. Single-frame baseline (HalluciNet(1f)): We set k =0, which is equivalent to our standard single framecase;

2. Two-frame baseline (HalluciNet(2f)): We set k = 3,to give the student network fs access to pixel changesin order to tease out low-level motion cues.

We trained the networks for both the cases using the ex-act same procedure and parameters as in the single framecase, and observed the hallucination loss, Lhallu, on thetest set. We experimented with both kinds of actions – withshort-term and long-term dynamics.

Results for short-term actions are presented in Table 5afor UCF101. We saw a reduction in hallucination loss bya little more than 3%, which means that the hallucinatedrepresentations were closer to the true spatiotemporal rep-resentations. Similarly, there was a slight classification im-provement, but with a 67% increase in computation time.

The long-term action results are presented in Table 5b forMTL-AQA. Like short-term actions, there was an improve-ment when using two frames. The percent of reduction in

7

Method Lhallu(×e-03) Accu Time/Inf.

HalluciNet(1f) 3.3 68.60 3.54 msHalluciNet(2f) 3.2 (↓ 3.08%) 69.55 5.91 ms

(a)

Method Lhallu(×e-03)Accuracy

P AS RT SS TW

HalluciNet(1f) 4.2 89.24 99.72 94.62 86.69 87.54HalluciNet(2f) 3.9 (↓ 8.05%) 92.35 99.72 96.32 89.90 90.08

(b)

Table 5: (a) Single-frame vs. Two-frame on UCF-101. (b) Single-frame vs. Two-frame: MTL-AQA Dive Classification.Lhallu: lower is better.

Model Pretraining w/Hallucination

Accuracies

P AS RT SS TW

2D-CNNNo 90.08 99.43 92.07 83.00 86.69Yes 92.35 99.72 94.33 86.97 88.95

HalluciNetNo 89.52 99.43 96.32 86.12 88.10Yes 92.92 99.72 95.18 88.39 91.22

Table 6: Utility of hallucination task in pretraining. Bestperformances are underlined.

Lhallu was better than the short-term case, and dive clas-sification was improved across all components (except AS,which was saturated).

Discussion: Despite lower mean hallucination error in theshort-term case, the reduction rate was larger for the long-term actions. We believe this is due to the inherent difficultyof the classification task. In UCF-101, action classes aremore semantically distinguishable, which makes it easier(e.g., archery vs. applying makeup) to hallucinate and rea-son about the immediate future from a single image. Whilein the MTL-AQA dive classification case, action evolutioncan be confusing or tricky to predict from a single image.An example is trying to determine the direction of rotation– it is difficult to determine if it is forward or backwardwith a snapshot devoid of motion. Moreover, differencesbetween dives are more subtle. The tasks of counting som-ersaults and twists need accuracy up to half a rotation. Asa result, short-term hallucination is more difficult – it isdifficult to determine if it is a full or half rotation. Whilethe the two-frame HalluciNet can extract some low-levelmotion cues to resolve ambiguity, the impact is temperedin UCF-101, which has less motion dependence. Conse-quently, there is comparatively more improvement in MTL-AQA, where motion (e.g., speed of rotation to distinguishbetween full/half rotation) is more meaningful to the classi-fication task.

4.5. Utility of Hallucination Task in Pretraining

To determine if the hallucination task positively affectspretraining, we conducted an experiment on the down-stream task of dive classification on the MTL-AQA dataset.In Experiment 4.2.1 the backbone network was trained onthe UCF-101 action classification dataset; however, the hal-lucination task was not utilized during that pretraining. Ta-ble 6 summarizes the results of pretraining with and withoutthe hallucination for dive classification. The use of hallu-cination during pretraining provided better initialization toboth the vanilla 2D-CNN and HalluciNet, which led to im-provements in almost every category besides Rotation (RT)for HalluciNet. Additionally, HalluciNet training had thebest performance for each dive class, indicating its utilityboth in pretraining network initialization and task-specifictraining.

5. Conclusion

Although 3D-CNNs extract richer spatiotemporal fea-tures than the spatial features from 2D-CNNs, this comesat a considerably higher computational cost. We proposeda simple solution to approximate (hallucinate) spatiotem-poral representations (computed by a 3D-CNN) using acomputationally lightweight 2D-CNN with a single frame.Hallucinating spatiotemporal representations, instead of ac-tually computing them, dramatically lowers the computa-tional cost (only 6% of 3D-CNN time in our experiments),which makes deployment on edge devices feasible. In ad-dition, by using only a single frame, rather than 16, thecommunication bandwidth requirements are lowered. Be-sides these practical benefits, we found that hallucinationtask when used in a multitask learning setting provides astrong supervisory signal, which helps in: 1) actions withshort- and long-term dynamics; 2) dynamic scene recog-nition (non-action task); and 3) improving pretraining fordownstream tasks. We showed that hallucination task acrossvarious base CNNs. Our hallucination task is a plug-and-play module, and we suggest future works to leverage hal-lucination task for action as well as non-action tasks.

8

References[1] Shweta Bhardwaj, Mukundhan Srinivasan, and Mitesh M

Khapra. Efficient video classification using fewer frames.In Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition, pages 354–363, 2019. 2

[2] Hakan Bilen, Basura Fernando, Efstratios Gavves, AndreaVedaldi, and Stephen Gould. Dynamic image networks foraction recognition. In Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition, pages 3034–3042, 2016. 2

[3] Nieves Crasto, Philippe Weinzaepfel, Karteek Alahari, andCordelia Schmid. MARS: Motion-Augmented RGB Streamfor Action Recognition. In CVPR, 2019. 2

[4] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li,and Li Fei-Fei. Imagenet: A large-scale hierarchical imagedatabase. In 2009 IEEE conference on computer vision andpattern recognition, pages 248–255. Ieee, 2009. 3

[5] Hazel Doughty, Walterio Mayol-Cuevas, and Dima Damen.The pros and cons: Rank-aware temporal attention for skilldetermination in long videos. In Proceedings of the IEEEConference on Computer Vision and Pattern Recognition,pages 7862–7871, 2019. 1

[6] Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, andKaiming He. Slowfast networks for video recognition. InProceedings of the IEEE International Conference on Com-puter Vision, pages 6202–6211, 2019. 1

[7] Christoph Feichtenhofer, Axel Pinz, and Richard P Wildes.Bags of spacetime energies for dynamic scene recognition.In Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition, pages 2681–2688, 2014. 7

[8] Christoph Feichtenhofer, Axel Pinz, and Richard P Wildes.Temporal residual networks for dynamic scene recognition.In Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition, pages 4728–4737, 2017. 6

[9] Chelsea Finn, Ian Goodfellow, and Sergey Levine. Unsuper-vised learning for physical interaction through video predic-tion. In Advances in neural information processing systems,pages 64–72, 2016. 2

[10] Ruohan Gao, Bo Xiong, and Kristen Grauman. Im2flow:Motion hallucination from static images for action recogni-tion. In Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition, pages 5937–5947, 2018. 1,2, 4, 5, 6, 7

[11] Kirill Gavrilyuk, Amir Ghodrati, Zhenyang Li, and Cees GMSnoek. Actor and action video segmentation from a sentence.In Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition, pages 5958–5966, 2018. 3

[12] Bernard Ghanem, Juan Carlos Niebles, Cees Snoek,Fabian Caba Heilbron, Humam Alwassel, Victor Escorcia,Ranjay Khrisna, Shyamal Buch, and Cuong Duc Dao. Theactivitynet large-scale activity recognition challenge 2018summary. arXiv preprint arXiv:1808.03766, 2018. 1

[13] Ramyad Hadidi, Jiashen Cao, Yilun Xie, Bahar Asgari,Tushar Krishna, and Hyesoon Kim. Characterizing the de-ployment of deep neural networks on commercial edge de-vices. In IEEE International Symposium on Workload Char-acterization, 2019. 1

[14] Kensho Hara, Hirokatsu Kataoka, and Yutaka Satoh. Canspatiotemporal 3d cnns retrace the history of 2d cnns andimagenet? In Proceedings of the IEEE Conference on Com-puter Vision and Pattern Recognition (CVPR), pages 6546–6555, 2018. 1, 3

[15] Yun He, Soma Shirakabe, Yutaka Satoh, and HirokatsuKataoka. Human action recognition without human. InEuropean Conference on Computer Vision, pages 11–17.Springer, 2016. 1, 3

[16] Sepp Hochreiter and Jurgen Schmidhuber. Long short-termmemory. Neural computation, 9(8):1735–1780, 1997. 5

[17] Judy Hoffman, Saurabh Gupta, and Trevor Darrell. Learn-ing with side information through modality hallucination. InProceedings of the IEEE Conference on Computer Visionand Pattern Recognition, pages 826–834, 2016. 2

[18] Jingwei Ji, Shyamal Buch, Alvaro Soto, and Juan Car-los Niebles. End-to-end joint semantic segmentation of ac-tors and actions in video. In Proceedings of the EuropeanConference on Computer Vision (ECCV), pages 702–717,2018. 3

[19] Vicky Kalogeiton, Philippe Weinzaepfel, Vittorio Ferrari,and Cordelia Schmid. Joint learning of object and actiondetectors. In Proceedings of the IEEE International Confer-ence on Computer Vision, pages 4163–4172, 2017. 3

[20] Andrej Karpathy, George Toderici, Sanketh Shetty, ThomasLeung, Rahul Sukthankar, and Li Fei-Fei. Large-scale videoclassification with convolutional neural networks. In Pro-ceedings of the IEEE conference on Computer Vision andPattern Recognition, pages 1725–1732, 2014. 1, 3

[21] Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang,Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola,Tim Green, Trevor Back, Paul Natsev, et al. The kinetics hu-man action video dataset. arXiv preprint arXiv:1705.06950,2017. 1, 5

[22] Diederik P Kingma and Jimmy Ba. Adam: A method forstochastic optimization. arXiv preprint arXiv:1412.6980,2014. 4

[23] Kris M Kitani, Brian D Ziebart, James Andrew Bagnell, andMartial Hebert. Activity forecasting. In European Confer-ence on Computer Vision, pages 201–214. Springer, 2012.2

[24] Hema S Koppula and Ashutosh Saxena. Anticipating hu-man activities using object affordances for reactive roboticresponse. IEEE transactions on pattern analysis and ma-chine intelligence, 38(1):14–29, 2015. 2

[25] H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre.HMDB: a large video database for human motion recog-nition. In Proceedings of the International Conference onComputer Vision (ICCV), 2011. 4

[26] Myunggi Lee, Seungeui Lee, Sungjoon Son, Gyutae Park,and Nojun Kwak. Motion feature network: Fixed motionfilter for action recognition. In Proceedings of the Euro-pean Conference on Computer Vision (ECCV), pages 387–403, 2018. 2

[27] Ji Lin, Chuang Gan, and Song Han. Tsm: Temporal shiftmodule for efficient video understanding. In Proceedingsof the IEEE International Conference on Computer Vision,pages 7083–7093, 2019. 2, 5

9

[28] Aiden Nibali, Zhen He, Stuart Morgan, and Daniel Green-wood. Extraction and classification of diving clips from con-tinuous video footage. In 2017 IEEE Conference on Com-puter Vision and Pattern Recognition Workshops (CVPRW),pages 94–104. IEEE, 2017. 6

[29] Paritosh Parmar and Brendan Tran Morris. What and howwell you performed? a multitask learning approach to ac-tion quality assessment. In Proceedings of the IEEE Con-ference on Computer Vision and Pattern Recognition, pages304–313, 2019. 1, 5, 6

[30] Paritosh Parmar and Brendan Tran Morris. Learning to scoreolympic events. In proceedings of the IEEE Conference onComputer Vision and Pattern Recognition Workshops, pages20–28, 2017. 1, 6

[31] Adam Paszke, Sam Gross, Soumith Chintala, GregoryChanan, Edward Yang, Zachary DeVito, Zeming Lin, Al-ban Desmaison, Luca Antiga, and Adam Lerer. Automaticdifferentiation in pytorch. 2017. 4

[32] Silvia L Pintea, Jan C van Gemert, and Arnold WM Smeul-ders. Deja vu. In European Conference on Computer Vision,pages 172–187. Springer, 2014. 1, 2

[33] Hamed Pirsiavash, Carl Vondrick, and Antonio Torralba. As-sessing the quality of actions. In European Conference onComputer Vision, pages 556–571. Springer, 2014. 6

[34] Zhaofan Qiu, Ting Yao, and Tao Mei. Learning spatio-temporal representation with pseudo-3d residual networks.In proceedings of the IEEE International Conference onComputer Vision, pages 5533–5541, 2017. 2

[35] Karen Simonyan and Andrew Zisserman. Two-stream con-volutional networks for action recognition in videos. In Ad-vances in neural information processing systems, pages 568–576, 2014. 7

[36] Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah.Ucf101: A dataset of 101 human actions classes from videosin the wild. arXiv preprint arXiv:1212.0402, 2012. 4

[37] Jonathan C Stroud, David A Ross, Chen Sun, Jia Deng, andRahul Sukthankar. D3d: Distilled 3d networks for video ac-tion recognition. arXiv preprint arXiv:1812.08249, 2018. 2

[38] Lin Sun, Kui Jia, Dit-Yan Yeung, and Bertram E Shi. Humanaction recognition using factorized spatio-temporal convolu-tional networks. In Proceedings of the IEEE internationalconference on computer vision, pages 4597–4605, 2015. 2

[39] Christian Theriault, Nicolas Thome, and Matthieu Cord. Dy-namic scene classification: Learning motion descriptors withslow features analysis. In Proceedings of the IEEE Con-ference on Computer Vision and Pattern Recognition, pages2603–2610, 2013. 7

[40] Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani,and Manohar Paluri. Learning spatiotemporal features with3d convolutional networks. In Proceedings of the IEEE inter-national conference on computer vision, pages 4489–4497,2015. 1

[41] Du Tran, Heng Wang, Lorenzo Torresani, Jamie Ray, YannLeCun, and Manohar Paluri. A closer look at spatiotemporalconvolutions for action recognition. In Proceedings of theIEEE conference on Computer Vision and Pattern Recogni-tion, pages 6450–6459, 2018. 2

[42] Carl Vondrick, Hamed Pirsiavash, and Antonio Torralba. An-ticipating visual representations from unlabeled video. InProceedings of the IEEE Conference on Computer Visionand Pattern Recognition, pages 98–106, 2016. 2, 3

[43] Carl Vondrick, Hamed Pirsiavash, and Antonio Torralba.Generating videos with scene dynamics. In Advances InNeural Information Processing Systems, pages 613–621,2016. 2

[44] Carl Vondrick and Antonio Torralba. Generating the futurewith adversarial transformers. In Proceedings of the IEEEConference on Computer Vision and Pattern Recognition,pages 1020–1028, 2017. 2

[45] Jacob Walker, Carl Doersch, Abhinav Gupta, and MartialHebert. An uncertain future: Forecasting from static imagesusing variational autoencoders. In European Conference onComputer Vision, pages 835–851. Springer, 2016. 2

[46] Jacob Walker, Abhinav Gupta, and Martial Hebert. Patchto the future: Unsupervised visual prediction. In Proceed-ings of the IEEE conference on Computer Vision and PatternRecognition, pages 3302–3309, 2014. 1, 2

[47] Jacob Walker, Abhinav Gupta, and Martial Hebert. Denseoptical flow prediction from a static image. In Proceedingsof the IEEE International Conference on Computer Vision,pages 2443–2451, 2015. 1, 2, 4, 5

[48] Haonan Wang, Jun Lin, and Zhongfeng Wang. Design light-weight 3d convolutional networks for video recognition tem-poral residual, fully separable block, and fast algorithm.arXiv preprint arXiv:1905.13388, 2019. 1, 2

[49] Xiaolong Wang, Ali Farhadi, and Abhinav Gupta. Actions˜transformations. In Proceedings of the IEEE conferenceon Computer Vision and Pattern Recognition, pages 2658–2667, 2016. 2

[50] Saining Xie, Chen Sun, Jonathan Huang, Zhuowen Tu, andKevin Murphy. Rethinking spatiotemporal feature learning:Speed-accuracy trade-offs in video classification. In Pro-ceedings of the European Conference on Computer Vision(ECCV), pages 305–321, 2018. 2

[51] Chenliang Xu and Jason J Corso. Actor-action semantic seg-mentation with grouping process models. In Proceedingsof the IEEE Conference on Computer Vision and PatternRecognition, pages 3083–3092, 2016. 3

[52] Chenliang Xu, Shao-Hang Hsieh, Caiming Xiong, and Ja-son J Corso. Can humans fly? action understanding withmultiple classes of actors. In Proceedings of the IEEE Con-ference on Computer Vision and Pattern Recognition, pages2264–2273, 2015. 3

[53] Yan Yan, Chenliang Xu, Dawen Cai, and Jason J Corso.Weakly supervised actor-action segmentation via robustmulti-task ranking. In Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition, pages 1298–1307, 2017. 3

[54] Jenny Yuen and Antonio Torralba. A data-driven approachfor event prediction. In European Conference on ComputerVision, pages 707–720. Springer, 2010. 2

[55] Haokui Zhang, Ying Li, Peng Wang, Yu Liu, and ChunhuaShen. Rgb-d based action recognition with light-weight 3dconvolutional networks. arXiv preprint arXiv:1811.09908,2018. 1

10

[56] Zhaoyang Zhang, Zhanghui Kuang, Ping Luo, Litong Feng,and Wei Zhang. Temporal sequence distillation: Towardsfew-frame action recognition in videos. In 2018 ACM Mul-timedia Conference on Multimedia Conference, pages 257–264. ACM, 2018. 2

[57] Bolei Zhou, Alex Andonian, Aude Oliva, and Antonio Tor-ralba. Temporal relational reasoning in videos. In Pro-ceedings of the European Conference on Computer Vision(ECCV), pages 803–818, 2018. 2, 5

[58] Yizhou Zhou, Xiaoyan Sun, Zheng-Jun Zha, and WenjunZeng. Mict: Mixed 3d/2d convolutional tube for human ac-tion recognition. In Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition, pages 449–458,2018. 1, 2

11

hallucinet-ing spatiotemporal representations using a 2d-cnn · 2020. 10. 22. · pute resource...

Documents