amtnet: action-micro-tube regression by end-to …amtnet: action-micro-tube regression by end-to-end...

AMTnet: Action-Micro-Tube Regression by End-to-end Trainable Deep Architecture

Suman Saha, Gurkirt Singh and Fabio Cuzzolin

Oxford Brookes University, Oxford, UK

AI and Vision Group

Motivation

I An optimal solution for action detection:

T ∗.

= arg maxT⊂V

score(T ), (1)

T is a subset of the input video of duration D.I Dominant approaches [2, 3] provides a sub-optimal partial frame-level

solutionsR∗(t)

.= arg max

R⊂I (t)score(R), (2)

to later compose into a solution T̂ = [R∗(1), ...,R∗(D)] to the problem(1).

I Seeking for a an optimal solution - useful for effective sptio-temporalfeatures representation!

State-of-the-artI Recent action detectors [1, 5, 2, 3] show promising results, however, the

frame-level training approach does not allow the network to learntemporal associations!

Contributions

I We propose a deep neural network framework which train on video-leveland allow the network to learn temporal associations,

I Unlike SOA which follows multi-stage training, the proposed model isend-to-end trainable,

I to the best of our knowledge, this is the first work in action detectionwhich uses bilinear interpolation instead of RoI max-pooling.

3D region proposal and micro-tubes

3D proposal-1 3D proposal-2

{1,2} {2,3} {3,4}

{1,3} {4,6}

(a) (b)action-micro-tubes: {1,2} {2,3} {3,4} {1,3} {4,6}

Figure: (a) The 3D region proposals generated by our 3D-RPN network span pairs ofsuccessive video frames ft and ft+∆ at temporal distance ∆. (b) Ground-truthaction-micro-tubes generated from different pairs of successive video frames.

: predicted micro-tube(a) (b) (c) (d)

Figure: (a) The temporal associations learned by our network; (b) Our micro-tube linkingalgorithm requires (T/2− 1) connections; (c) the T − 1 connections required by [3]’sapproach.

Overview of the approach

Bilinearinterpolation

Actionnessclassificationloss3D proposalregression loss

VGG16

VGG16

512 x H' x W'

Cls & Reglosses

C classification scores for each 3D proposal+C x 8 coordinates for each 3D proposal

Model outputAction-tubegeneration

Video frames Feature mapfusion

3D-RPN Proposal sampling &RPN loss computation

AB

A BROI feature pooling and fusion

512 x 7 x 7

512 x 7 x 7512 x 7 x 7

fc6, fc7, cls and reg layers

Figure: Model overview: (a) Input - a pair of successive video frames which is passedthrough VGG conv layers and a 3D-RPN; (b) 3D-RPN generated region proposals,actionness scores and micro-tube regression offsets; (c) subsequently, ROI feature poolingis used to have fixed dimension conv features for action classification and tube regression.

3D region proposal network and micro-tube linking

512 W'H'

256 W'H'

8 x k W'H'

(a) (b) (d)2 x kW'

H'

(g)(f)

Conv layer 8xk, 1x1, 1x1, 0x0

ReLU

(i) (j)

Conv layer 256, 3x3, 1x1, 1x1

(c) (e)

Conv layer 2xk, 1x1, 1x1, 0x0

(h)

Figure: 3D-RPN architecture.

Predicted micro-tube consists of a pair of bounding boxes (§ Figure 2),so that m = {b1, b2}; action-specific pathspc = {mt, t ∈ I = {2, 4, ...,T − 2}}, are obtained by maximising viadynamic programming [1]

E (pc) =∑t∈I

sc(mt) + λo∑t∈I

ψo

(b2

mt, b1

mt+2

), (3)

sc(mt) - softmax score of the predicted micro-tube m at time step t, theoverlap potential ψo(b2

mt, b1

mt+2) is the IoU between the second

detection box b2mt

which forms micro-tube mt and the first detectionbox b1

mt+2of micro-tube mt+2.

Quantitative results

Table: Spatio-temporal action detection performance (video-mAP) comparison with thestate-of-the-art on J-HMDB-21.

IoU threshold δ 0.1 0.2 0.3 0.4 0.5Gkioxari and Malik [1] – – – – 53.30Wang et al. [4] – – – – 56.40Weinzaepfel et al. [5] – 63.1 – – 60.70Saha et al. [3] (Spatial Model) 52.99 52.94 52.57 52.22 51.34Peng and Schmid [2] – 74.3 – – 73.1Ours 57.79 57.76 57.68 56.79 55.31

continue ...

Table: Spatio-temporal action detection performance (video-mAP) comparison with thestate-of-the-art on UCF-101.

IoU threshold δ 0.1 0.2 0.3 0.5 0.75 0.5:0.95Yu et al. [6] 42.8 26.50 14.6 – – –Weinzaepfel et al. [5] 51.7 46.8 37.8 – – –Peng and Schmid [2] 77.31 72.86 65.70 30.87 01.01 07.11Saha et al. [3] (*A) 65.45 56.55 48.52 – – –Saha et al. [3] (full) 76.12 66.36 54.93 – – –Ours−ML 68.85 60.06 49.78 – – –Ours−ML− (∗) 70.71 61.36 50.44 32.01 0.4 9.68Ours− 2PDP − (∗) 71.3 63.06 51.57 33.06 0.52 10.72(*) cross validated alphas as in [3]; 2PDP - tube generation algorithm [3]ML - our micro-tube linking algorithm.

Take away

I Proposed model can learn temporal features for action detection onlyfrom RGB images,

I Easily scalable for long videos just by increasig ∆ and thus a potetialsolution for realtime detection.

I To boost the detection performance optical flow stream can beincorporated as another parallel stream and

I feature fusion for RGB and flow can be done during training time.

Qualitative results

Figure: Qualitative detection results.

References

[1] G. Gkioxari and J. Malik. Finding action tubes. In IEEE Int. Conf. on Computer Vision and Pattern Recognition, 2015.

[2] X. Peng and C. Schmid. Multi-region two-stream R-CNN for action detection. In ECCV 2016 - European Conference on Computer Vision,Amsterdam, Netherlands, Oct. 2016.

[3] S. Saha, G. Singh, M. Sapienza, P. H. S. Torr, and F. Cuzzolin. Deep learning for detecting multiple space-time action tubes in videos. In BritishMachine Vision Conference, 2016.

[4] L. Wang, Y. Qiao, X. Tang, and L. Van Gool. Actionness estimation using hybrid fully convolutional networks. IEEE Int. Conf. on Computer Visionand Pattern Recognition, 2016.

[5] P. Weinzaepfel, Z. Harchaoui, and C. Schmid. Learning to track for spatio-temporal action localization. In IEEE Int. Conf. on Computer Vision andPattern Recognition, June 2015.

[6] G. Yu and J. Yuan. Fast action proposals for human action detection and search. In Proceedings of the IEEE Conference on Computer Vision andPattern Recognition, pages 1302–1311, 2015.

https://sahasuman.bitbucket.io/ {suman.saha-2014, gurkirt.singh-2015, fabio.cuzzolin}@brookes.ac.uk

amtnet: action-micro-tube regression by end-to …amtnet: action-micro-tube regression by end-to-end...

Documents