action recognition overview · sequence modeling approaches action recognition using visual...

Action Recognition Overview

Vadim Andronov

Internet of Things Group

Task definition

● Action Recognition○ Predict action on the current time (some interval of time)

○ Video classification - predict action on the whole video

○ Simplest model - image classifier

● Action Detection○ Consists of two tasks:

■ Action detection

■ Classification of detected actions

○ Simplest model - object detection

Action Examples (Kinetics Dataset)

Datasets

● Kinetics-400: 400 classes of actions, ~300k videos from YouTube

● UCF-101: 101 class of human actions, 13k clips from YouTube

● HMDB-51: 51 actions, 7k clips from movies

● Sports 1M: 487 sport actions, ~1M clips

● Jester: 27 human gestures

Why not just classifier?

What is the correct action for this image?

Action Recognition Approaches

● Two-stream models○ Two-stream convolutional networks for action recognition in videos, 2014

○ Temporal Segment Networks: Towards Good Practices for Deep Action Recognition (TSN),

2016

● 3D models

○ Learning Spatiotemporal Features with 3D Convolutional Networks (C3D), 2014

○ Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset (I3D), 2017

○ Can Spatiotemporal 3D CNNs Retrace the History of 2D CNNs and ImageNet (R3D), 2017

○ A Closer Look at Spatiotemporal Convolutions for Action Recognition (R2+1D), 2017

○ Non-local Neural Networks, 2017

● Sequence modeling approaches○ Action Recognition using Visual Attention, 2015

○ Lightweight Network Architecture for Real-Time Action Recognition

(VTN), ours - 2018

Two-stream approach

● One of the first successful DL approaches to AR problem: Two-stream

convolutional networks for action recognition in videos, 2014

● Fusion of two AlexNet-based CNNs that work on different modalities: RGB

frames and OF

● Techniques to prevent overfitting:○ Multi-task learning - two datasets

Optical FlowSparse Dense

Temporal Segment Networks: Towards Good

Practices for Deep Action Recognition (TSN), 2016● Giving a video V it divides it to segments (evenly)

● Then

● - - is a random snippet of , , - CNN function (BN-Inception)

● - some weighting function (evenly averaging), - Softmax

3D CNN models

● Introduce higher dimensional primitives operating with 5D tensors

(BxCxTxHxW):○ 3D convolution

○ 3D pooling

● Consider spatial and temporal information at the same time

● Problems:○ Higher computational complexity

○ Hard to train

3D Convolution

2D Convolution 3D Convolution

Learning Spatiotemporal Features with 3D

Convolutional Networks (C3D), 2014

● AlexNet-like architecture with 3D primitives

● First introduced 3D nets to address AR problem

Quo Vadis, Action Recognition? A New Model and

the Kinetics Dataset (I3D)

● “Inflated” Inception V1 architecture - 3D CNN (224x224)

● Proved transfer learning from ImageNet

● Introduced Kinetics dataset

● Mixed approaches (two-stream and 3D)

● Saturated UCF-101

Can Spatiotemporal 3D CNNs Retrace the History of

2D CNNs and ImageNet (R3D)

● Adopts residual architectures to 3D (112x112)

● Uses Kinetics as a main benchmark

A Closer Look at Spatiotemporal Convolutions for

Action Recognition (R2+1D)

● Compares mixed architectures

● Decomposes 3D convolutional kernel○ Less weights

○ Easier to train

Non-local Neural Networks

● Introduces special non-local block

● Uses self-attention mechanism to re-weight

features

● Can be used in any CNN

● Increases computational complexity

Sequence modeling methods

● Model sequential connections explicitly via special NN architecture (e.g.

Recurrent, 1D-CNN, etc)

RNN Cell

LSTM Cell

In practice, more complicated cell is used to allow training on longer sequences

Action recognition using visual attention, 2015

● Uses attention + RNN

● Re-weights features based on the

history:

Sequential modeling evolution

RNNs (LSTM, GRU) considered a default starting point for many sequential

modelling tasks e.g. machine translation, speech recognition, image captioning...

Because they are simple and do the job but…

● Sequentiality limits the parallelization

● Despite gating, only consider short-range context

Sequential modeling evolution: CNNs

WaveNet (Speech), ByteNet (Language model), TCN (Many tasks)

WaveNet ByteNet

Transformer (Tensor2Tensor, attention is all you need)

Transformer architecture English-to-german results

Back to AR: Our approach (Video Transformer)

● Embed each frame using spatial CNN

● Find temporal relations between embeddings in a

stacked decoder blocks

● Decoder block consist of multi-head self-attention

and 1D-Convolutional block with some residual

connections

Results (HMDB, UCF)

Results (Kinetics)

The end. Questions are welcome!

action recognition overview · sequence modeling approaches action recognition using visual...

Documents