learnable pooling with context gating for video classificationmiech/poster_youtube8m.pdf ·...

1
Learnable pooling with Context Gating for video classification Antoine Miech 1,2 Josef Sivic 1,2,3 Ivan Laptev 1,2 Ranked 1 at kaggle Model Overview 1 DI ENS, École Normale Supérieure, PSL Research University 3 CIIRC, CTU in Prague 2 Inria Goal Challenges Contributions LOUPE Tensorflow toolbox: github.com/antoine77340/LOUPE arXiv: 1706.06905 Focus on aggregation Results on the Youtube-8M dataset Training speed and Generalization Learnable pooling via Clustering Context Gating References • Multi-label video tagging Groundtruth: Tree- Christmas Tree - Christmas Decoration - Christmas Top 6 scores: • Large diversity of labels • Incomplete and noisy annotation • What is a good representation for video ? • Learnable pooling for video representations • Context Gating: A non-linear learnable module for modeling interdependencies between activations Input Gated Output Fully-connected Sigmoid Multiplication with Input Ski Tree Snow Ski Tree Snow Feature activation Feature activation • Models interdependencies between activations with a self-gating mechanism • Adds quadratic interactions • Inspired by Gated Linear Unit [3] [1] R. Arandjelović, et al. NetVLAD: CNN architecture for weakly supervised place recognition. CVPR 2016 [2] J. Sivic et al. Video Google: a text retrieval approach to object matching in videos. ICCV 2003 [3] Y. N. Dauphin, et al. Language modeling with gated convolutional networks. arXiv:1612.08083 [4] H. Jégou, et al. Aggregating local descriptors into a compact image representation. CVPR 2010 [5] F. Perronnin, et al. Fisher kernels on visual vocabularies for image categorization. CVPR 2007 Release • How to aggregate multi-modal sequences of features into a good compact representation ? Visual features Audio features Supervised model Single and compact pooled representation • Training time vs. performance for LSTM, GRU, NetVLAD and Gated NetVLAD. • Performance as a function of the training set size. Soft-DBoW Soft Bag-of-Words [2] formulated in terms of differentiable operations. NetVLAD [1] / NetRVLAD NetVLAD: Approximates a VLAD [4] encoding with differentiable operations. NetRVLAD: Residual-less NetVLAD. NetFV A variant of the Fisher Vector representation [5] with an approximation of Gaussian Mixture model by differentiable operations. This is done through a simple extension of NetVLAD [1]. 0 5 10 15 20 25 Number of models 83.0 83.5 84.0 84.5 85.0 85.5 GAP • Comparison of different pooling based methods (with or without Context Gating) and the RNN pooling approaches. • The effect of ensembling different methods. A prediction score average of 7 different models can achieve the first place of the kaggle Youtube-8M challenge out of 650 teams. Models are: Gated NetVLAD, Gated NetRVLAD, Gated NetFV, Gated Soft DBoW, DBoW, GRU and LSTM. 10 5 10 6 10 7 Number of samples 65 70 75 80 85 GAP Gated NetVLAD NetVLAD LSTM Average Pooling X: input layer Y: output layer W,b: parameters to learn 2 4 6 8 10 12 14 0.70 0.72 0.74 0.76 0.78 0.80 0.82 0.84 GAP Gated NetVLAD NetVLAD GRU LSTM +7 Millions videos 4716 labels 3.4 avg labels/video 450000 video hours VIDEO FEATURES Learnable pooling (e.g NetVLAD, NetFV) FC layer MoE AUDIO FEATURES Context Gating Context Gating Learnable pooling (e.g NetVLAD, NetFV) FEATURE POOLING INPUT FEATURES CLASSIFICATION Time (hours) NetVLAD NetRVLAD NetFV Soft-DBoW GRU LSTM Model 81.5 82.0 82.5 83.0 83.5 GAP With Context Gating Based on Clustering RNN

Upload: others

Post on 10-Aug-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Learnable pooling with Context Gating for video classificationmiech/poster_Youtube8M.pdf · 2017-08-07 · Context Gating Context Gating Learnable pooling (e.g NetVLAD, NetFV) FEATURE

Learnable pooling with Context Gating for video classification

Antoine Miech1,2 Josef Sivic1,2,3Ivan Laptev1,2

Ranked 1 at kaggleModel Overview

1DI ENS, École Normale Supérieure, PSL Research University 3CIIRC, CTU in Prague2Inria

Goal

Challenges

Contributions

LOUPE Tensorflow toolbox:github.com/antoine77340/LOUPEarXiv: 1706.06905

Focus on aggregation

Results on the Youtube-8M dataset

Training speed and Generalization

Learnable pooling via Clustering

Context Gating

References

• Multi-label video taggingGroundtruth: Car - Vehicle - Sport Utility Vehicle - Dacia Duster - Renault - 4-Wheel Drive

Groundtruth:Tree- Christmas Tree - Christmas Decoration - Christmas

Car - Vehicle - Sport Utility Vehicle - Dacia Duster - Fiat AutomobilesVolkswagen Beetles

Top 6 scores:

Top 6 scores: Christmas - Christmas decoration - Origami - Paper - Tree Christmas Tree

• Large diversity of labels• Incomplete and noisy annotation• What is a good representation for video ?

• Learnable pooling for video representations• Context Gating: A non-linear learnable module for modeling interdependencies between activations

Input

GatedOutput

Fully-connected

Sigmoid

Multiplicationwith Input

Ski

Tree

Snow

SkiTree

Snow

Featu

re a

ctiv

ation

Featu

re a

ctiv

ation

• Models interdependencies between activations with a self-gating mechanism

• Adds quadratic interactions

• Inspired by Gated Linear Unit [3][1] R. Arandjelović, et al. NetVLAD: CNN architecture for weakly supervised place recognition. CVPR 2016[2] J. Sivic et al. Video Google: a text retrieval approach to object matching in videos. ICCV 2003 [3] Y. N. Dauphin, et al. Language modeling with gated convolutional networks. arXiv:1612.08083[4] H. Jégou, et al. Aggregating local descriptors into a compact image representation. CVPR 2010[5] F. Perronnin, et al. Fisher kernels on visual vocabularies for image categorization. CVPR 2007

Release

• How to aggregate multi-modal sequences of features into a good compact representation ?

Visual features

Audio features

Supervised model

Single andcompact pooledrepresentation

• Training time vs. performance for LSTM, GRU, NetVLAD and Gated NetVLAD.

• Performance as a function of the training set size.

• Soft-DBoWSoft Bag-of-Words [2] formulated in terms of differentiable operations.

• NetVLAD [1] / NetRVLADNetVLAD: Approximates a VLAD [4] encoding with differentiable operations.NetRVLAD: Residual-less NetVLAD.

• NetFVA variant of the Fisher Vector representation [5] with an approximation of Gaussian Mixture model by differentiable operations. This is done through a simple extension of NetVLAD [1].

0 5 10 15 20 25Number of models

83.0

83.5

84.0

84.5

85.0

85.5

GAP

• Comparison of different pooling based methods (with or without Context Gating) and the RNN pooling approaches.

• The effect of ensembling different methods.A prediction score average of 7 different models can achieve the first place of the kaggle Youtube-8M challenge out of 650 teams.Models are: Gated NetVLAD, Gated

NetRVLAD, Gated NetFV, Gated Soft DBoW,

DBoW, GRU and LSTM.

105 106 107

Number of samples

65

70

75

80

85

GAP

Gated NetVLADNetVLADLSTMAverage Pooling

X: input layerY: output layerW,b: parameters to learn

2 4 6 8 10 12 14Time (hour)

0.70

0.72

0.74

0.76

0.78

0.80

0.82

0.84

GAP

Gated NetVLADNetVLADGRULSTM

+7 Millions videos

4716labels

3.4 avg labels/video

450000video hoursVIDEO

FEATURES

Learnable pooling

(e.g NetVLAD,NetFV)

FClayer

MoE

AUDIO FEATURES

Context Gating

Context Gating

Learnable pooling

(e.g NetVLAD,NetFV)

FEATURE POOLING

INPUTFEATURES

CLASSIFICATION

Time (hours)

NetVLAD NetRVLAD NetFV Soft-DBoW GRU LSTMModel

81.5

82.0

82.5

83.0

83.5

GAP

With Context GatingBased on ClusteringRNN