amtnet: action-micro-tube regression by end-to …amtnet: action-micro-tube regression by end-to-end...
TRANSCRIPT
AMTnet: Action-Micro-Tube Regression by End-to-end Trainable Deep Architecture
Suman Saha, Gurkirt Singh and Fabio Cuzzolin
Oxford Brookes University, Oxford, UK
AI and Vision Group
Motivation
I An optimal solution for action detection:
T ∗.
= arg maxT⊂V
score(T ), (1)
T is a subset of the input video of duration D.I Dominant approaches [2, 3] provides a sub-optimal partial frame-level
solutionsR∗(t)
.= arg max
R⊂I (t)score(R), (2)
to later compose into a solution T̂ = [R∗(1), ...,R∗(D)] to the problem(1).
I Seeking for a an optimal solution - useful for effective sptio-temporalfeatures representation!
State-of-the-artI Recent action detectors [1, 5, 2, 3] show promising results, however, the
frame-level training approach does not allow the network to learntemporal associations!
Contributions
I We propose a deep neural network framework which train on video-leveland allow the network to learn temporal associations,
I Unlike SOA which follows multi-stage training, the proposed model isend-to-end trainable,
I to the best of our knowledge, this is the first work in action detectionwhich uses bilinear interpolation instead of RoI max-pooling.
3D region proposal and micro-tubes
3D proposal-1 3D proposal-2
{1,2} {2,3} {3,4}
{1,3} {4,6}
(a) (b)action-micro-tubes: {1,2} {2,3} {3,4} {1,3} {4,6}
Figure: (a) The 3D region proposals generated by our 3D-RPN network span pairs ofsuccessive video frames ft and ft+∆ at temporal distance ∆. (b) Ground-truthaction-micro-tubes generated from different pairs of successive video frames.
: predicted micro-tube(a) (b) (c) (d)
Figure: (a) The temporal associations learned by our network; (b) Our micro-tube linkingalgorithm requires (T/2− 1) connections; (c) the T − 1 connections required by [3]’sapproach.
Overview of the approach
Bilinearinterpolation
Actionnessclassificationloss3D proposalregression loss
VGG16
VGG16
512 x H' x W'
Cls & Reglosses
C classification scores for each 3D proposal+C x 8 coordinates for each 3D proposal
Model outputAction-tubegeneration
Video frames Feature mapfusion
3D-RPN Proposal sampling &RPN loss computation
AB
A BROI feature pooling and fusion
512 x 7 x 7
512 x 7 x 7512 x 7 x 7
fc6, fc7, cls and reg layers
Figure: Model overview: (a) Input - a pair of successive video frames which is passedthrough VGG conv layers and a 3D-RPN; (b) 3D-RPN generated region proposals,actionness scores and micro-tube regression offsets; (c) subsequently, ROI feature poolingis used to have fixed dimension conv features for action classification and tube regression.
3D region proposal network and micro-tube linking
512 W'H'
256 W'H'
8 x k W'H'
(a) (b) (d)2 x kW'
H'
(g)(f)
Conv layer 8xk, 1x1, 1x1, 0x0
ReLU
(i) (j)
Conv layer 256, 3x3, 1x1, 1x1
(c) (e)
Conv layer 2xk, 1x1, 1x1, 0x0
(h)
Figure: 3D-RPN architecture.
Predicted micro-tube consists of a pair of bounding boxes (§ Figure 2),so that m = {b1, b2}; action-specific pathspc = {mt, t ∈ I = {2, 4, ...,T − 2}}, are obtained by maximising viadynamic programming [1]
E (pc) =∑t∈I
sc(mt) + λo∑t∈I
ψo
(b2
mt, b1
mt+2
), (3)
sc(mt) - softmax score of the predicted micro-tube m at time step t, theoverlap potential ψo(b2
mt, b1
mt+2) is the IoU between the second
detection box b2mt
which forms micro-tube mt and the first detectionbox b1
mt+2of micro-tube mt+2.
Quantitative results
Table: Spatio-temporal action detection performance (video-mAP) comparison with thestate-of-the-art on J-HMDB-21.
IoU threshold δ 0.1 0.2 0.3 0.4 0.5Gkioxari and Malik [1] – – – – 53.30Wang et al. [4] – – – – 56.40Weinzaepfel et al. [5] – 63.1 – – 60.70Saha et al. [3] (Spatial Model) 52.99 52.94 52.57 52.22 51.34Peng and Schmid [2] – 74.3 – – 73.1Ours 57.79 57.76 57.68 56.79 55.31
continue ...
Table: Spatio-temporal action detection performance (video-mAP) comparison with thestate-of-the-art on UCF-101.
IoU threshold δ 0.1 0.2 0.3 0.5 0.75 0.5:0.95Yu et al. [6] 42.8 26.50 14.6 – – –Weinzaepfel et al. [5] 51.7 46.8 37.8 – – –Peng and Schmid [2] 77.31 72.86 65.70 30.87 01.01 07.11Saha et al. [3] (*A) 65.45 56.55 48.52 – – –Saha et al. [3] (full) 76.12 66.36 54.93 – – –Ours−ML 68.85 60.06 49.78 – – –Ours−ML− (∗) 70.71 61.36 50.44 32.01 0.4 9.68Ours− 2PDP − (∗) 71.3 63.06 51.57 33.06 0.52 10.72(*) cross validated alphas as in [3]; 2PDP - tube generation algorithm [3]ML - our micro-tube linking algorithm.
Take away
I Proposed model can learn temporal features for action detection onlyfrom RGB images,
I Easily scalable for long videos just by increasig ∆ and thus a potetialsolution for realtime detection.
I To boost the detection performance optical flow stream can beincorporated as another parallel stream and
I feature fusion for RGB and flow can be done during training time.
Qualitative results
Figure: Qualitative detection results.
References
[1] G. Gkioxari and J. Malik. Finding action tubes. In IEEE Int. Conf. on Computer Vision and Pattern Recognition, 2015.
[2] X. Peng and C. Schmid. Multi-region two-stream R-CNN for action detection. In ECCV 2016 - European Conference on Computer Vision,Amsterdam, Netherlands, Oct. 2016.
[3] S. Saha, G. Singh, M. Sapienza, P. H. S. Torr, and F. Cuzzolin. Deep learning for detecting multiple space-time action tubes in videos. In BritishMachine Vision Conference, 2016.
[4] L. Wang, Y. Qiao, X. Tang, and L. Van Gool. Actionness estimation using hybrid fully convolutional networks. IEEE Int. Conf. on Computer Visionand Pattern Recognition, 2016.
[5] P. Weinzaepfel, Z. Harchaoui, and C. Schmid. Learning to track for spatio-temporal action localization. In IEEE Int. Conf. on Computer Vision andPattern Recognition, June 2015.
[6] G. Yu and J. Yuan. Fast action proposals for human action detection and search. In Proceedings of the IEEE Conference on Computer Vision andPattern Recognition, pages 1302–1311, 2015.
https://sahasuman.bitbucket.io/ {suman.saha-2014, gurkirt.singh-2015, fabio.cuzzolin}@brookes.ac.uk