paper introduction: sequence to sequence - video to text (iccv2015)

16
Sequence to Sequence ‒ Video to Text Subhashini Venugopalan, Marcus Rohrbach, Jeff Donahue, Raymond Mooney, Trevor Darrell, Kate Saenko ICCV 2015 M2 Soichiro Murakami 10/14/16 1

Upload: soichiro-murakami

Post on 14-Apr-2017

116 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Paper introduction: Sequence to Sequence - Video to Text (ICCV2015)

Sequence to Sequence ‒Video to Text

Subhashini Venugopalan, Marcus Rohrbach, Jeff Donahue, Raymond Mooney, Trevor Darrell, Kate Saenko

ICCV 2015

M2 Soichiro Murakami

10/14/16 1

Page 2: Paper introduction: Sequence to Sequence - Video to Text (ICCV2015)

Introduction

10/14/16 2

Video

Text

Page 3: Paper introduction: Sequence to Sequence - Video to Text (ICCV2015)

10/14/16 3

A monkey is pulling a dog’s tail and is chased by the dog.

Page 4: Paper introduction: Sequence to Sequence - Video to Text (ICCV2015)

Main contribution• To propose a novel model, which learns to directly map a sequence of frames to a sequence of words

10/14/16 4

General seq2seq modela. handle a variable number of framesb. learn and use the temporal structure

of the videoc. learn a language model to generate

natural and grammatical sentences.

Fig.1

Page 5: Paper introduction: Sequence to Sequence - Video to Text (ICCV2015)

Related work 1/2• image caption [8, 40]

1. generate a fixed length vector representation of an image2. decode this vector into a sequence of words

• FGM [36]1. identify the semantic content (subject, verb, object, scene).2. combine them with confidences from a language model using a

factor graph to infer the most likey tuple in the video.3. generate a sentence based on a template.

• Mean Pool [39]• LSTMs are used to generate video descriptions by pooling the

representations of individual frames.

10/14/16 5

Page 6: Paper introduction: Sequence to Sequence - Video to Text (ICCV2015)

Related work 2/2• Temporal-Attention [43] (ICCV2015)

• employ a 3-D convnet model that incorporates spatiotemporal motion features to extract dense trajectory features (HoG, HoF, MBH).

• use an attention mechanism that learns to weight the frame features.

10/14/16 6

Page 7: Paper introduction: Sequence to Sequence - Video to Text (ICCV2015)

Approach 1/2

• 3.1 LSTM for sequence modeling• 3.2 Sequence to sequence video to text

10/14/16 7

p(y1, ..., ym|x1, ..., xn)seq. of video framesseq. of words

Fig. 2

concatenate

Zt: output of the second LSTM layer

Page 8: Paper introduction: Sequence to Sequence - Video to Text (ICCV2015)

Approach 2/2• 3.3 Video and text representation• RGB frames

• apply a CNN (pre-trained) to input images and provide the output of the top layer as input to the LSTM units. (AlexNet, 16-layer VGG model)

• Optical Flow• first extract classical variational optical flow features[2].• then create flow images and apply a CNN (pre-trained).

• Text• embed words to a lower 500 dimensional space by applying a linear

transformation to the input data.

10/14/16 8for the combined model.

Page 9: Paper introduction: Sequence to Sequence - Video to Text (ICCV2015)

Experimental Setup (1/3)• Video description datasets

• Microsoft Video Description Corpus (MSVD)• a collection of YouTube clips & single sentence descriptions from annotators.

• MPII Movie Description Dataset (MPII-MD)• Hollywood movies & movie scripts and audio description data.

• Montreal Video Annotation Dataset (M-VAD)• Hollywood movies & audio description data for the visually impaired.

ØThey used a single sentence as a target sentence for each video.

10/14/16 9

Page 10: Paper introduction: Sequence to Sequence - Video to Text (ICCV2015)

Experimental Setup (2/3)

10/14/16 10

Table 1. Corpus StatisticsExample of MPII-MD

( A Dataset for Movie Description, Anna Rohrbach, Marcus Rohrbach, Niket Tandon, Bernt Schiele, CVPR 2015)

Page 11: Paper introduction: Sequence to Sequence - Video to Text (ICCV2015)

Experimental Setup (3/3)• Evaluation Metrics

• METEOR [7]• METEOR compares exact token matches, stemmed tokens, paraphrase

matches, as well as semantically similar matches using WordNet synonyms.

• Experimental details of the models• unroll the LSTM to a fixed 80 time steps during training.

• for longer videos, truncated the number of frames.• for shorter videos, pad the remaining inputs with zeros.

• mini-batch size: up to 8 for AlexNet, up to 3 for flow model.

10/14/16 11

Page 12: Paper introduction: Sequence to Sequence - Video to Text (ICCV2015)

Results and Discussion ‒ MSVD dataset -

10/14/16 12

• S2VT AlexNet model on RGB video frames achieves 27.9% METEOR.

• The low performance of the flow model.

• Polysemous words• playing a guitar• playing golf

Page 13: Paper introduction: Sequence to Sequence - Video to Text (ICCV2015)

Results and Discussion ‒Movie description datasets-

10/14/16 13

• It was best to use dropout at the inputs and outputs of both LSTM layers.

• SMT [28]• translate holistic video

representations to a single sentence.• Visual-Labels [27]

• LSTM-based approach which uses no temporal encoding, but more diverse visual features, namely object detectors, as well as activity and scene classifiers.

Page 14: Paper introduction: Sequence to Sequence - Video to Text (ICCV2015)

10/14/16 14

Page 15: Paper introduction: Sequence to Sequence - Video to Text (ICCV2015)

10/14/16 15

Page 16: Paper introduction: Sequence to Sequence - Video to Text (ICCV2015)

Conclusion• They construct descriptions using a sequence to sequence

model, where frames are first read sequentially and then words are generated sequentially.

• Their model achieves state-of-the-art performance on the MSVD dataset.

• For further information...• https://www.cs.utexas.edu/~vsub/s2vt.html

10/14/16 16