learnable pooling with context gating for video classificationmiech/poster_youtube8m.pdf ·...

Learnable pooling with Context Gating for video classification

Antoine Miech1,2 Josef Sivic1,2,3Ivan Laptev1,2

Ranked 1 at kaggleModel Overview

1DI ENS, École Normale Supérieure, PSL Research University 3CIIRC, CTU in Prague2Inria

Goal

Challenges

Contributions

LOUPE Tensorflow toolbox:github.com/antoine77340/LOUPEarXiv: 1706.06905

Focus on aggregation

Results on the Youtube-8M dataset

Training speed and Generalization

Learnable pooling via Clustering

Context Gating

References

• Multi-label video taggingGroundtruth: Car - Vehicle - Sport Utility Vehicle - Dacia Duster - Renault - 4-Wheel Drive

Groundtruth:Tree- Christmas Tree - Christmas Decoration - Christmas

Car - Vehicle - Sport Utility Vehicle - Dacia Duster - Fiat AutomobilesVolkswagen Beetles

Top 6 scores:

Top 6 scores: Christmas - Christmas decoration - Origami - Paper - Tree Christmas Tree

• Large diversity of labels• Incomplete and noisy annotation• What is a good representation for video ?

• Learnable pooling for video representations• Context Gating: A non-linear learnable module for modeling interdependencies between activations

Input

GatedOutput

Fully-connected

Sigmoid

Multiplicationwith Input

Ski

Tree

Snow

SkiTree

Snow

Featu

re a

ctiv

ation

Featu

re a

ctiv

ation

• Models interdependencies between activations with a self-gating mechanism

• Adds quadratic interactions

• Inspired by Gated Linear Unit [3][1] R. Arandjelović, et al. NetVLAD: CNN architecture for weakly supervised place recognition. CVPR 2016[2] J. Sivic et al. Video Google: a text retrieval approach to object matching in videos. ICCV 2003 [3] Y. N. Dauphin, et al. Language modeling with gated convolutional networks. arXiv:1612.08083[4] H. Jégou, et al. Aggregating local descriptors into a compact image representation. CVPR 2010[5] F. Perronnin, et al. Fisher kernels on visual vocabularies for image categorization. CVPR 2007

Release

• How to aggregate multi-modal sequences of features into a good compact representation ?

Visual features

Audio features

Supervised model

Single andcompact pooledrepresentation

• Training time vs. performance for LSTM, GRU, NetVLAD and Gated NetVLAD.

• Performance as a function of the training set size.

• Soft-DBoWSoft Bag-of-Words [2] formulated in terms of differentiable operations.

• NetVLAD [1] / NetRVLADNetVLAD: Approximates a VLAD [4] encoding with differentiable operations.NetRVLAD: Residual-less NetVLAD.

• NetFVA variant of the Fisher Vector representation [5] with an approximation of Gaussian Mixture model by differentiable operations. This is done through a simple extension of NetVLAD [1].

0 5 10 15 20 25Number of models

83.0

83.5

84.0

84.5

85.0

85.5

GAP

• Comparison of different pooling based methods (with or without Context Gating) and the RNN pooling approaches.

• The effect of ensembling different methods.A prediction score average of 7 different models can achieve the first place of the kaggle Youtube-8M challenge out of 650 teams.Models are: Gated NetVLAD, Gated

NetRVLAD, Gated NetFV, Gated Soft DBoW,

DBoW, GRU and LSTM.

105 106 107

Number of samples

65

70

75

80

85

GAP

Gated NetVLADNetVLADLSTMAverage Pooling

X: input layerY: output layerW,b: parameters to learn

2 4 6 8 10 12 14Time (hour)

0.70

0.72

0.74

0.76

0.78

0.80

0.82

0.84

GAP

Gated NetVLADNetVLADGRULSTM

+7 Millions videos

4716labels

3.4 avg labels/video

450000video hoursVIDEO

FEATURES

Learnable pooling

(e.g NetVLAD,NetFV)

FClayer

MoE

AUDIO FEATURES

Context Gating

Context Gating

Learnable pooling

(e.g NetVLAD,NetFV)

FEATURE POOLING

INPUTFEATURES

CLASSIFICATION

Time (hours)

NetVLAD NetRVLAD NetFV Soft-DBoW GRU LSTMModel

81.5

82.0

82.5

83.0

83.5

GAP

With Context GatingBased on ClusteringRNN

learnable pooling with context gating for video classificationmiech/poster_youtube8m.pdf ·...

Documents