multilayer and multimodal fusion of deep neural...

Post on 24-Mar-2020

21 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Xiaodong Yang, Pavlo Molchanov, Jan KautzXiaodong Yang, Pavlo Molchanov, Jan Kautz

Multilayer and Multimodal Fusion of Deep Neural Networks for Video Classification

22

INTELLIGENT VIDEO ANALYTICS

Surveillance event detection

Human-computer interaction

Multimedia search and indexing

@bmw.com

Video Classification

33

Local feature extraction

Global feature representation

Temporal modeling

INTELLIGENT VIDEO ANALYTICS Related Work

44

Local feature extraction

Global feature representation

Temporal modeling

INTELLIGENT VIDEO ANALYTICS Related Work

Dense trajectories,H. Wang et al. ICCV 2013

55

Local feature extraction

Global feature representation

Temporal modeling

INTELLIGENT VIDEO ANALYTICS Related Work

Bag-of-visual-words,J. Gemert et al. TPAMI 2009

Fisher vector,F. Perronnin et al. ECCV 2010

Dense trajectories,H. Wang et al. ICCV 2013

66

Local feature extraction

Global feature representation

Temporal modeling

INTELLIGENT VIDEO ANALYTICS Related Work

Bag-of-visual-words,J. Gemert et al. TPAMI 2009

Fisher vector,F. Perronnin et al. ECCV 2010

Dense trajectories,H. Wang et al. ICCV 2013

Spatio-temporal pyramid,X. Yang et al. ECCV 2014

77

INTELLIGENT VIDEO ANALYTICS Related Work

2D-CNN, A. Karpathy et al, CVPR 2014 C3D, D. Tran et al, ICCV 2015

Two-stream networks, K. Simonyan et al, NIPS 2014 LSTM, J. Ng, CVPR 2015

88

OUR CONTRIBUTIONS

Overview of multilayer and multimodal fusion for video classification

Local feature extraction:

• Multilayer representations from CNN

Global feature representation:

• Multimodal representations

• Fusion by boosting

Temporal modeling:

• Structure of FC-RNN

99

MULTILAYER REPRESENTATIONS

Dense image prediction

FCN by Long et al. FlowNet by Fischer et al.

1010

MULTILAYER REPRESENTATIONS

Features of conv layers

Poses, parts, articulations, objects, etc.

Visualization by Zeiler et al.

1111

MULTILAYER REPRESENTATIONS

Convert feature maps to feature descriptors

Feature maps of dimension 28×28×5

28×28 feature descriptors of dimension 5

1212

MULTILAYER REPRESENTATIONS

Learn spatial discriminative weights of conv layers

Spatial information of conv layers to enhance representations

Video frames Feature maps of a conv layer over time

Spatial weights of a conv layer

import

ance

1313

MULTILAYER REPRESENTATIONS

Aggregate feature descriptors by Fisher vector (FV)

Gaussian mixture modelFeature maps of a conv layer over time

1414

MULTILAYER REPRESENTATIONS

Represent conv layers by improved Fisher vector (iFV)

Gaussian mixture modelFeature maps of a conv layer over time

Spatial weights of a conv layerim

port

ance

1515

MULTILAYER REPRESENTATIONS

Represent conv layers by improved Fisher vector (iFV)

Represent fc layers by temporal max pooling

Overview of multilayer representation

1616

FC-RNN STRUCTUREModeling Temporal Dynamics

Don’t be a hero—use pre-trained models

1717

FC-RNN STRUCTUREModeling Temporal Dynamics

Images/Snippets Videos

Don’t be a hero—use pre-trained models

Many pre-trained models from ImageNet and Sports1M

VGG/C3D

1818

FC-RNN STRUCTUREModeling Temporal Dynamics

Images/Snippets Videos

Don’t be a hero—use pre-trained models

Many pre-trained models from ImageNet and Sports1M

VGG/C3D VGG/C3D

fc layer

RNN

Standard RNN

1919

FC-RNN STRUCTUREModeling Temporal Dynamics

Images/Snippets Videos

Don’t be a hero—use pre-trained models

Many pre-trained models from ImageNet and Sports1M

VGG/C3D VGG/C3D

fc layer

RNN

Standard RNN

VGG/C3D

fc layer

RNN

FC-RNN

2020

FC-RNN STRUCTUREModeling Temporal Dynamics

Images/Snippets Videos

Don’t be a hero—use pre-trained models

Many pre-trained models from ImageNet and Sports1M

VGG/C3D VGG/C3D

fc layer

RNN

Standard RNN

VGG/C3D

FC-RNN

FC-RNN

2121

FC-RNN STRUCTUREModeling Temporal Dynamics

RNN

FC-RNN

Pre-trained CNN, fc layer:

Transfer to recurrent layers

Comparison of standard RNN and FC-RNN

2222

MULTIMODAL REPRESENTATIONS

Static and dynamic information

2D-CNN/3D-CNN with video frames/optical flow maps

A single frame

A single flow map

A buffer of frames

A buffer of flow maps

2323

FUSION BY BOOSTING

Optimize a linear combination of predictions of multiple layers from multiple modalities

LPBoost:

boost-u: learn uniform weights for all classes

boost-c: learn class specific weights

2424

FUSION BY BOOSTING

Optimize a linear combination of predictions of multiple layers from multiple modalities

LPBoost:

boost-u: learn uniform weights for all classes

boost-c: learn class specific weights

4 layers and 4 modalities M = 16

2525

EXPERIMENTS

Benchmark datasets

UCF101: 13,320 videos in 101 classes

HMDB51: 6,766 videos in 51 classes

Skiing

Kissing

2626

EXPERIMENTSFC-RNN

Outperforms RNN and LSTM by 3.0% and 2.9%

Comparison of standard RNN and FC-RNN in training and testing of 3D-CNN-SF on UCF101

error rate

epochs

2727

EXPERIMENTSFC-RNN

Outperforms RNN and LSTM by 3.0% and 2.9%

Comparison of standard RNN and FC-RNN in training and testing of 3D-CNN-SF on UCF101

error rate

epochs

3 %

Up to

improvement

2828

EXPERIMENTSFeature Aggregation

Comparison of FV and iFV to represent conv layers of different modalities

Spatial weights of a conv layer

import

ance

A single frame

A single flow map

A buffer of frames

A buffer of flow maps

2929

EXPERIMENTSFeature Aggregation

Comparison of FV and iFV to represent conv layers of different modalities

Spatial weights of a conv layer

import

ance

A single frame

A single flow map

A buffer of frames

A buffer of flow maps

2.5 %

Up to

improvement

3030

EXPERIMENTSMultilayer Fusion

Classification accuracy of single layers over different modalities and multilayer fusion results

3131

EXPERIMENTSMultilayer Fusion

Classification accuracy of single layers over different modalities and multilayer fusion results

3232

EXPERIMENTSMultilayer Fusion

Classification accuracy of single layers over different modalities and multilayer fusion results

3333

EXPERIMENTSMultilayer Fusion

Classification accuracy of single layers over different modalities and multilayer fusion results

8 %

Up to

improvement

3434

EXPERIMENTSMultimodal Fusion

Classification accuracy of different modalities and various combinations

Comparison to the state-of-the-art results

6 %

Up to

improvement

3535

EXPERIMENTSLPBoost

17%

31%

23%

29%

0%

38%

12%

50%fc7

conv5

fc6

conv4

Modalities Layers

3636

EXPERIMENTSEffect of Multimodal Fusion

SKIING SKIJET

skiing : )Multimodal Fusion

2D-CNN-SFskijet : (

3737

EXPERIMENTSEffect of Multimodal Fusion

2D-CNN-OF boxing speeding bag : (

boxing punching bag : )

Multimodal Fusion

BOXING PUNCHING BAG BOXING SPEEDING BAG

3838

OUR CONTRIBUTIONS

Local feature extraction:

• Multilayer representations from CNN

Global feature representation:

• Multimodal representations

• Fusion by boosting

Temporal modeling:

• Structure of FC-RNNOverview of multilayer and multimodal fusion for video classification

top related