action recognition overview · sequence modeling approaches action recognition using visual...
TRANSCRIPT
![Page 1: Action Recognition Overview · Sequence modeling approaches Action Recognition using Visual Attention, 2015 Lightweight Network Architecture for Real-Time Action Recognition (VTN),](https://reader035.vdocument.in/reader035/viewer/2022081518/6053cd64b3e67d046a341ab8/html5/thumbnails/1.jpg)
Action Recognition Overview
Vadim Andronov
Internet of Things Group
![Page 2: Action Recognition Overview · Sequence modeling approaches Action Recognition using Visual Attention, 2015 Lightweight Network Architecture for Real-Time Action Recognition (VTN),](https://reader035.vdocument.in/reader035/viewer/2022081518/6053cd64b3e67d046a341ab8/html5/thumbnails/2.jpg)
Task definition
● Action Recognition○ Predict action on the current time (some interval of time)
○ Video classification - predict action on the whole video
○ Simplest model - image classifier
● Action Detection○ Consists of two tasks:
■ Action detection
■ Classification of detected actions
○ Simplest model - object detection
![Page 3: Action Recognition Overview · Sequence modeling approaches Action Recognition using Visual Attention, 2015 Lightweight Network Architecture for Real-Time Action Recognition (VTN),](https://reader035.vdocument.in/reader035/viewer/2022081518/6053cd64b3e67d046a341ab8/html5/thumbnails/3.jpg)
Action Examples (Kinetics Dataset)
![Page 4: Action Recognition Overview · Sequence modeling approaches Action Recognition using Visual Attention, 2015 Lightweight Network Architecture for Real-Time Action Recognition (VTN),](https://reader035.vdocument.in/reader035/viewer/2022081518/6053cd64b3e67d046a341ab8/html5/thumbnails/4.jpg)
Datasets
● Kinetics-400: 400 classes of actions, ~300k videos from YouTube
● UCF-101: 101 class of human actions, 13k clips from YouTube
● HMDB-51: 51 actions, 7k clips from movies
● Sports 1M: 487 sport actions, ~1M clips
● Jester: 27 human gestures
![Page 5: Action Recognition Overview · Sequence modeling approaches Action Recognition using Visual Attention, 2015 Lightweight Network Architecture for Real-Time Action Recognition (VTN),](https://reader035.vdocument.in/reader035/viewer/2022081518/6053cd64b3e67d046a341ab8/html5/thumbnails/5.jpg)
Why not just classifier?
What is the correct action for this image?
![Page 6: Action Recognition Overview · Sequence modeling approaches Action Recognition using Visual Attention, 2015 Lightweight Network Architecture for Real-Time Action Recognition (VTN),](https://reader035.vdocument.in/reader035/viewer/2022081518/6053cd64b3e67d046a341ab8/html5/thumbnails/6.jpg)
Action Recognition Approaches
● Two-stream models○ Two-stream convolutional networks for action recognition in videos, 2014
○ Temporal Segment Networks: Towards Good Practices for Deep Action Recognition (TSN),
2016
● 3D models
○ Learning Spatiotemporal Features with 3D Convolutional Networks (C3D), 2014
○ Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset (I3D), 2017
○ Can Spatiotemporal 3D CNNs Retrace the History of 2D CNNs and ImageNet (R3D), 2017
○ A Closer Look at Spatiotemporal Convolutions for Action Recognition (R2+1D), 2017
○ Non-local Neural Networks, 2017
● Sequence modeling approaches○ Action Recognition using Visual Attention, 2015
○ Lightweight Network Architecture for Real-Time Action Recognition
(VTN), ours - 2018
![Page 7: Action Recognition Overview · Sequence modeling approaches Action Recognition using Visual Attention, 2015 Lightweight Network Architecture for Real-Time Action Recognition (VTN),](https://reader035.vdocument.in/reader035/viewer/2022081518/6053cd64b3e67d046a341ab8/html5/thumbnails/7.jpg)
Two-stream approach
● One of the first successful DL approaches to AR problem: Two-stream
convolutional networks for action recognition in videos, 2014
● Fusion of two AlexNet-based CNNs that work on different modalities: RGB
frames and OF
● Techniques to prevent overfitting:○ Multi-task learning - two datasets
![Page 8: Action Recognition Overview · Sequence modeling approaches Action Recognition using Visual Attention, 2015 Lightweight Network Architecture for Real-Time Action Recognition (VTN),](https://reader035.vdocument.in/reader035/viewer/2022081518/6053cd64b3e67d046a341ab8/html5/thumbnails/8.jpg)
Optical FlowSparse Dense
![Page 9: Action Recognition Overview · Sequence modeling approaches Action Recognition using Visual Attention, 2015 Lightweight Network Architecture for Real-Time Action Recognition (VTN),](https://reader035.vdocument.in/reader035/viewer/2022081518/6053cd64b3e67d046a341ab8/html5/thumbnails/9.jpg)
Temporal Segment Networks: Towards Good
Practices for Deep Action Recognition (TSN), 2016● Giving a video V it divides it to segments (evenly)
● Then
● - - is a random snippet of , , - CNN function (BN-Inception)
● - some weighting function (evenly averaging), - Softmax
![Page 10: Action Recognition Overview · Sequence modeling approaches Action Recognition using Visual Attention, 2015 Lightweight Network Architecture for Real-Time Action Recognition (VTN),](https://reader035.vdocument.in/reader035/viewer/2022081518/6053cd64b3e67d046a341ab8/html5/thumbnails/10.jpg)
3D CNN models
● Introduce higher dimensional primitives operating with 5D tensors
(BxCxTxHxW):○ 3D convolution
○ 3D pooling
● Consider spatial and temporal information at the same time
● Problems:○ Higher computational complexity
○ Hard to train
![Page 11: Action Recognition Overview · Sequence modeling approaches Action Recognition using Visual Attention, 2015 Lightweight Network Architecture for Real-Time Action Recognition (VTN),](https://reader035.vdocument.in/reader035/viewer/2022081518/6053cd64b3e67d046a341ab8/html5/thumbnails/11.jpg)
3D Convolution
2D Convolution 3D Convolution
![Page 12: Action Recognition Overview · Sequence modeling approaches Action Recognition using Visual Attention, 2015 Lightweight Network Architecture for Real-Time Action Recognition (VTN),](https://reader035.vdocument.in/reader035/viewer/2022081518/6053cd64b3e67d046a341ab8/html5/thumbnails/12.jpg)
Learning Spatiotemporal Features with 3D
Convolutional Networks (C3D), 2014
● AlexNet-like architecture with 3D primitives
● First introduced 3D nets to address AR problem
![Page 13: Action Recognition Overview · Sequence modeling approaches Action Recognition using Visual Attention, 2015 Lightweight Network Architecture for Real-Time Action Recognition (VTN),](https://reader035.vdocument.in/reader035/viewer/2022081518/6053cd64b3e67d046a341ab8/html5/thumbnails/13.jpg)
Quo Vadis, Action Recognition? A New Model and
the Kinetics Dataset (I3D)
● “Inflated” Inception V1 architecture - 3D CNN (224x224)
● Proved transfer learning from ImageNet
● Introduced Kinetics dataset
● Mixed approaches (two-stream and 3D)
● Saturated UCF-101
![Page 14: Action Recognition Overview · Sequence modeling approaches Action Recognition using Visual Attention, 2015 Lightweight Network Architecture for Real-Time Action Recognition (VTN),](https://reader035.vdocument.in/reader035/viewer/2022081518/6053cd64b3e67d046a341ab8/html5/thumbnails/14.jpg)
Can Spatiotemporal 3D CNNs Retrace the History of
2D CNNs and ImageNet (R3D)
● Adopts residual architectures to 3D (112x112)
● Uses Kinetics as a main benchmark
![Page 15: Action Recognition Overview · Sequence modeling approaches Action Recognition using Visual Attention, 2015 Lightweight Network Architecture for Real-Time Action Recognition (VTN),](https://reader035.vdocument.in/reader035/viewer/2022081518/6053cd64b3e67d046a341ab8/html5/thumbnails/15.jpg)
A Closer Look at Spatiotemporal Convolutions for
Action Recognition (R2+1D)
● Compares mixed architectures
● Decomposes 3D convolutional kernel○ Less weights
○ Easier to train
![Page 16: Action Recognition Overview · Sequence modeling approaches Action Recognition using Visual Attention, 2015 Lightweight Network Architecture for Real-Time Action Recognition (VTN),](https://reader035.vdocument.in/reader035/viewer/2022081518/6053cd64b3e67d046a341ab8/html5/thumbnails/16.jpg)
Non-local Neural Networks
● Introduces special non-local block
● Uses self-attention mechanism to re-weight
features
● Can be used in any CNN
● Increases computational complexity
![Page 17: Action Recognition Overview · Sequence modeling approaches Action Recognition using Visual Attention, 2015 Lightweight Network Architecture for Real-Time Action Recognition (VTN),](https://reader035.vdocument.in/reader035/viewer/2022081518/6053cd64b3e67d046a341ab8/html5/thumbnails/17.jpg)
Sequence modeling methods
● Model sequential connections explicitly via special NN architecture (e.g.
Recurrent, 1D-CNN, etc)
RNN Cell
![Page 18: Action Recognition Overview · Sequence modeling approaches Action Recognition using Visual Attention, 2015 Lightweight Network Architecture for Real-Time Action Recognition (VTN),](https://reader035.vdocument.in/reader035/viewer/2022081518/6053cd64b3e67d046a341ab8/html5/thumbnails/18.jpg)
LSTM Cell
In practice, more complicated cell is used to allow training on longer sequences
![Page 19: Action Recognition Overview · Sequence modeling approaches Action Recognition using Visual Attention, 2015 Lightweight Network Architecture for Real-Time Action Recognition (VTN),](https://reader035.vdocument.in/reader035/viewer/2022081518/6053cd64b3e67d046a341ab8/html5/thumbnails/19.jpg)
Action recognition using visual attention, 2015
● Uses attention + RNN
● Re-weights features based on the
history:
![Page 20: Action Recognition Overview · Sequence modeling approaches Action Recognition using Visual Attention, 2015 Lightweight Network Architecture for Real-Time Action Recognition (VTN),](https://reader035.vdocument.in/reader035/viewer/2022081518/6053cd64b3e67d046a341ab8/html5/thumbnails/20.jpg)
Sequential modeling evolution
RNNs (LSTM, GRU) considered a default starting point for many sequential
modelling tasks e.g. machine translation, speech recognition, image captioning...
Because they are simple and do the job but…
● Sequentiality limits the parallelization
● Despite gating, only consider short-range context
![Page 21: Action Recognition Overview · Sequence modeling approaches Action Recognition using Visual Attention, 2015 Lightweight Network Architecture for Real-Time Action Recognition (VTN),](https://reader035.vdocument.in/reader035/viewer/2022081518/6053cd64b3e67d046a341ab8/html5/thumbnails/21.jpg)
Sequential modeling evolution: CNNs
WaveNet (Speech), ByteNet (Language model), TCN (Many tasks)
WaveNet ByteNet
![Page 22: Action Recognition Overview · Sequence modeling approaches Action Recognition using Visual Attention, 2015 Lightweight Network Architecture for Real-Time Action Recognition (VTN),](https://reader035.vdocument.in/reader035/viewer/2022081518/6053cd64b3e67d046a341ab8/html5/thumbnails/22.jpg)
Transformer (Tensor2Tensor, attention is all you need)
Transformer architecture English-to-german results
![Page 23: Action Recognition Overview · Sequence modeling approaches Action Recognition using Visual Attention, 2015 Lightweight Network Architecture for Real-Time Action Recognition (VTN),](https://reader035.vdocument.in/reader035/viewer/2022081518/6053cd64b3e67d046a341ab8/html5/thumbnails/23.jpg)
Back to AR: Our approach (Video Transformer)
● Embed each frame using spatial CNN
● Find temporal relations between embeddings in a
stacked decoder blocks
● Decoder block consist of multi-head self-attention
and 1D-Convolutional block with some residual
connections
![Page 24: Action Recognition Overview · Sequence modeling approaches Action Recognition using Visual Attention, 2015 Lightweight Network Architecture for Real-Time Action Recognition (VTN),](https://reader035.vdocument.in/reader035/viewer/2022081518/6053cd64b3e67d046a341ab8/html5/thumbnails/24.jpg)
Results (HMDB, UCF)
![Page 25: Action Recognition Overview · Sequence modeling approaches Action Recognition using Visual Attention, 2015 Lightweight Network Architecture for Real-Time Action Recognition (VTN),](https://reader035.vdocument.in/reader035/viewer/2022081518/6053cd64b3e67d046a341ab8/html5/thumbnails/25.jpg)
Results (Kinetics)
![Page 26: Action Recognition Overview · Sequence modeling approaches Action Recognition using Visual Attention, 2015 Lightweight Network Architecture for Real-Time Action Recognition (VTN),](https://reader035.vdocument.in/reader035/viewer/2022081518/6053cd64b3e67d046a341ab8/html5/thumbnails/26.jpg)
The end. Questions are welcome!