amtnet: action-micro-tube regression by end-to …amtnet: action-micro-tube regression by end-to-end...

1
AMTnet: Action-Micro-Tube Regression by End-to-end Trainable Deep Architecture Suman Saha, Gurkirt Singh and Fabio Cuzzolin Oxford Brookes University, Oxford, UK AI and Vision Group Motivation I An optimal solution for action detection: T * . = arg max T V score(T ), (1) T is a subset of the input video of duration D . I Dominant approaches [2, 3] provides a sub-optimal partial frame-level solutions R * (t ) . = arg max R I (t ) score(R ), (2) to later compose into a solution ˆ T =[R * (1), ..., R * (D )] to the problem (1). I Seeking for a an optimal solution - useful for effective sptio-temporal features representation! State-of-the-art I Recent action detectors [1, 5, 2, 3] show promising results, however, the frame-level training approach does not allow the network to learn temporal associations! Contributions I We propose a deep neural network framework which train on video-level and allow the network to learn temporal associations, I Unlike SOA which follows multi-stage training, the proposed model is end-to-end trainable, I to the best of our knowledge, this is the first work in action detection which uses bilinear interpolation instead of RoI max-pooling. 3D region proposal and micro-tubes 3D proposal-1 3D proposal-2 {1,2} {2,3} {3,4} {1,3} {4,6} (a) (b) action-micro-tubes: {1,2} {2,3} {3,4} {1,3} {4,6} Figure: (a) The 3D region proposals generated by our 3D-RPN network span pairs of successive video frames f t and f t at temporal distance Δ. (b) Ground-truth action-micro-tubes generated from different pairs of successive video frames. : predicted micro-tube (a) (b) (c) (d) Figure: (a) The temporal associations learned by our network; (b) Our micro-tube linking algorithm requires (T /2 - 1) connections; (c) the T - 1 connections required by [3]’s approach. Overview of the approach Bilinear interpolation Actionness classication loss 3D proposal regression loss VGG 16 VGG 16 512 x H' x W' Cls & Reg losses C classication scores for each 3D proposal + C x 8 coordinates for each 3D proposal Model output Action-tube generation Video frames Feature map fusion 3D-RPN Proposal sampling & RPN loss computation A B A B ROI feature pooling and fusion 512 x 7 x 7 512 x 7 x 7 512 x 7 x 7 fc6, fc7, cls and reg layers Figure: Model overview: (a) Input - a pair of successive video frames which is passed through VGG conv layers and a 3D-RPN; (b) 3D-RPN generated region proposals, actionness scores and micro-tube regression offsets; (c) subsequently, ROI feature pooling is used to have fixed dimension conv features for action classification and tube regression. 3D region proposal network and micro-tube linking 512 W' H' 256 W' H' 8 x k W' H' (a) (b) (d) 2 x k W' H' (g) (f) Conv layer 8xk, 1x1, 1x1, 0x0 ReLU (i) (j) Conv layer 256, 3x3, 1x1, 1x1 (c) (e) Conv layer 2xk, 1x1, 1x1, 0x0 (h) Figure: 3D-RPN architecture. Predicted micro-tube consists of a pair of bounding boxes (§ Figure 2), so that m = {b 1 , b 2 }; action-specific paths p c = {m t , t I = {2, 4, ..., T - 2}}, are obtained by maximising via dynamic programming [1] E (p c )= X t I s c (m t )+ λ o X t I ψ o ( b 2 m t , b 1 m t +2 ) , (3) s c (m t ) - softmax score of the predicted micro-tube m at time step t , the overlap potential ψ o (b 2 m t , b 1 m t +2 ) is the IoU between the second detection box b 2 m t which forms micro-tube m t and the first detection box b 1 m t +2 of micro-tube m t +2 . Quantitative results Table: Spatio-temporal action detection performance (video-mAP) comparison with the state-of-the-art on J-HMDB-21. IoU threshold δ 0.1 0.2 0.3 0.4 0.5 Gkioxari and Malik [1] 53.30 Wang et al. [4] 56.40 Weinzaepfel et al. [5] 63.1 60.70 Saha et al. [3] (Spatial Model) 52.99 52.94 52.57 52.22 51.34 Peng and Schmid [2] 74.3 73.1 Ours 57.79 57.76 57.68 56.79 55.31 continue ... Table: Spatio-temporal action detection performance (video-mAP) comparison with the state-of-the-art on UCF-101. IoU threshold δ 0.1 0.2 0.3 0.5 0.75 0.5:0.95 Yu et al. [6] 42.8 26.50 14.6 Weinzaepfel et al. [5] 51.7 46.8 37.8 Peng and Schmid [2] 77.31 72.86 65.70 30.87 01.01 07.11 Saha et al. [3] (*A) 65.45 56.55 48.52 Saha et al. [3] (full) 76.12 66.36 54.93 Ours - ML 68.85 60.06 49.78 Ours - ML - (*) 70.71 61.36 50.44 32.01 0.4 9.68 Ours - 2PDP - (*) 71.3 63.06 51.57 33.06 0.52 10.72 (*) cross validated alphas as in [3]; 2PDP - tube generation algorithm [3] ML - our micro-tube linking algorithm. Take away I Proposed model can learn temporal features for action detection only from RGB images, I Easily scalable for long videos just by increasig Δ and thus a potetial solution for realtime detection. I To boost the detection performance optical flow stream can be incorporated as another parallel stream and I feature fusion for RGB and flow can be done during training time. Qualitative results Figure: Qualitative detection results. References [1] G. Gkioxari and J. Malik. Finding action tubes. In IEEE Int. Conf. on Computer Vision and Pattern Recognition, 2015. [2] X. Peng and C. Schmid. Multi-region two-stream R-CNN for action detection. In ECCV 2016 - European Conference on Computer Vision, Amsterdam, Netherlands, Oct. 2016. [3] S. Saha, G. Singh, M. Sapienza, P. H. S. Torr, and F. Cuzzolin. Deep learning for detecting multiple space-time action tubes in videos. In British Machine Vision Conference, 2016. [4] L. Wang, Y. Qiao, X. Tang, and L. Van Gool. Actionness estimation using hybrid fully convolutional networks. IEEE Int. Conf. on Computer Vision and Pattern Recognition, 2016. [5] P. Weinzaepfel, Z. Harchaoui, and C. Schmid. Learning to track for spatio-temporal action localization. In IEEE Int. Conf. on Computer Vision and Pattern Recognition, June 2015. [6] G. Yu and J. Yuan. Fast action proposals for human action detection and search. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1302–1311, 2015. https://sahasuman.bitbucket.io/ {suman.saha-2014, gurkirt.singh-2015, fabio.cuzzolin}@brookes.ac.uk

Upload: others

Post on 08-Jul-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: AMTnet: Action-Micro-Tube Regression by End-to …AMTnet: Action-Micro-Tube Regression by End-to-end Trainable Deep Architecture Suman Saha, Gurkirt Singh and Fabio Cuzzolin Oxford

AMTnet: Action-Micro-Tube Regression by End-to-end Trainable Deep Architecture

Suman Saha, Gurkirt Singh and Fabio Cuzzolin

Oxford Brookes University, Oxford, UK

AI and Vision Group

Motivation

I An optimal solution for action detection:

T ∗.

= arg maxT⊂V

score(T ), (1)

T is a subset of the input video of duration D.I Dominant approaches [2, 3] provides a sub-optimal partial frame-level

solutionsR∗(t)

.= arg max

R⊂I (t)score(R), (2)

to later compose into a solution T̂ = [R∗(1), ...,R∗(D)] to the problem(1).

I Seeking for a an optimal solution - useful for effective sptio-temporalfeatures representation!

State-of-the-artI Recent action detectors [1, 5, 2, 3] show promising results, however, the

frame-level training approach does not allow the network to learntemporal associations!

Contributions

I We propose a deep neural network framework which train on video-leveland allow the network to learn temporal associations,

I Unlike SOA which follows multi-stage training, the proposed model isend-to-end trainable,

I to the best of our knowledge, this is the first work in action detectionwhich uses bilinear interpolation instead of RoI max-pooling.

3D region proposal and micro-tubes

3D proposal-1 3D proposal-2

{1,2} {2,3} {3,4}

{1,3} {4,6}

(a) (b)action-micro-tubes: {1,2} {2,3} {3,4} {1,3} {4,6}

Figure: (a) The 3D region proposals generated by our 3D-RPN network span pairs ofsuccessive video frames ft and ft+∆ at temporal distance ∆. (b) Ground-truthaction-micro-tubes generated from different pairs of successive video frames.

: predicted micro-tube(a) (b) (c) (d)

Figure: (a) The temporal associations learned by our network; (b) Our micro-tube linkingalgorithm requires (T/2− 1) connections; (c) the T − 1 connections required by [3]’sapproach.

Overview of the approach

Bilinearinterpolation

Actionnessclassificationloss3D proposalregression loss

VGG16

VGG16

512 x H' x W'

Cls & Reglosses

C classification scores for each 3D proposal+C x 8 coordinates for each 3D proposal

Model outputAction-tubegeneration

Video frames Feature mapfusion

3D-RPN Proposal sampling &RPN loss computation

AB

A BROI feature pooling and fusion

512 x 7 x 7

512 x 7 x 7512 x 7 x 7

fc6, fc7, cls and reg layers

Figure: Model overview: (a) Input - a pair of successive video frames which is passedthrough VGG conv layers and a 3D-RPN; (b) 3D-RPN generated region proposals,actionness scores and micro-tube regression offsets; (c) subsequently, ROI feature poolingis used to have fixed dimension conv features for action classification and tube regression.

3D region proposal network and micro-tube linking

512 W'H'

256 W'H'

8 x k W'H'

(a) (b) (d)2 x kW'

H'

(g)(f)

Conv layer 8xk, 1x1, 1x1, 0x0

ReLU

(i) (j)

Conv layer 256, 3x3, 1x1, 1x1

(c) (e)

Conv layer 2xk, 1x1, 1x1, 0x0

(h)

Figure: 3D-RPN architecture.

Predicted micro-tube consists of a pair of bounding boxes (§ Figure 2),so that m = {b1, b2}; action-specific pathspc = {mt, t ∈ I = {2, 4, ...,T − 2}}, are obtained by maximising viadynamic programming [1]

E (pc) =∑t∈I

sc(mt) + λo∑t∈I

ψo

(b2

mt, b1

mt+2

), (3)

sc(mt) - softmax score of the predicted micro-tube m at time step t, theoverlap potential ψo(b2

mt, b1

mt+2) is the IoU between the second

detection box b2mt

which forms micro-tube mt and the first detectionbox b1

mt+2of micro-tube mt+2.

Quantitative results

Table: Spatio-temporal action detection performance (video-mAP) comparison with thestate-of-the-art on J-HMDB-21.

IoU threshold δ 0.1 0.2 0.3 0.4 0.5Gkioxari and Malik [1] – – – – 53.30Wang et al. [4] – – – – 56.40Weinzaepfel et al. [5] – 63.1 – – 60.70Saha et al. [3] (Spatial Model) 52.99 52.94 52.57 52.22 51.34Peng and Schmid [2] – 74.3 – – 73.1Ours 57.79 57.76 57.68 56.79 55.31

continue ...

Table: Spatio-temporal action detection performance (video-mAP) comparison with thestate-of-the-art on UCF-101.

IoU threshold δ 0.1 0.2 0.3 0.5 0.75 0.5:0.95Yu et al. [6] 42.8 26.50 14.6 – – –Weinzaepfel et al. [5] 51.7 46.8 37.8 – – –Peng and Schmid [2] 77.31 72.86 65.70 30.87 01.01 07.11Saha et al. [3] (*A) 65.45 56.55 48.52 – – –Saha et al. [3] (full) 76.12 66.36 54.93 – – –Ours−ML 68.85 60.06 49.78 – – –Ours−ML− (∗) 70.71 61.36 50.44 32.01 0.4 9.68Ours− 2PDP − (∗) 71.3 63.06 51.57 33.06 0.52 10.72(*) cross validated alphas as in [3]; 2PDP - tube generation algorithm [3]ML - our micro-tube linking algorithm.

Take away

I Proposed model can learn temporal features for action detection onlyfrom RGB images,

I Easily scalable for long videos just by increasig ∆ and thus a potetialsolution for realtime detection.

I To boost the detection performance optical flow stream can beincorporated as another parallel stream and

I feature fusion for RGB and flow can be done during training time.

Qualitative results

Figure: Qualitative detection results.

References

[1] G. Gkioxari and J. Malik. Finding action tubes. In IEEE Int. Conf. on Computer Vision and Pattern Recognition, 2015.

[2] X. Peng and C. Schmid. Multi-region two-stream R-CNN for action detection. In ECCV 2016 - European Conference on Computer Vision,Amsterdam, Netherlands, Oct. 2016.

[3] S. Saha, G. Singh, M. Sapienza, P. H. S. Torr, and F. Cuzzolin. Deep learning for detecting multiple space-time action tubes in videos. In BritishMachine Vision Conference, 2016.

[4] L. Wang, Y. Qiao, X. Tang, and L. Van Gool. Actionness estimation using hybrid fully convolutional networks. IEEE Int. Conf. on Computer Visionand Pattern Recognition, 2016.

[5] P. Weinzaepfel, Z. Harchaoui, and C. Schmid. Learning to track for spatio-temporal action localization. In IEEE Int. Conf. on Computer Vision andPattern Recognition, June 2015.

[6] G. Yu and J. Yuan. Fast action proposals for human action detection and search. In Proceedings of the IEEE Conference on Computer Vision andPattern Recognition, pages 1302–1311, 2015.

https://sahasuman.bitbucket.io/ {suman.saha-2014, gurkirt.singh-2015, fabio.cuzzolin}@brookes.ac.uk