attentive moment retrieval in videos -...

23
Attentive Moment Retrieval in Videos Meng Liu 1 , Xiang Wang 2 , Liqiang Nie 1 , Xiangnan He 2 , Baoquan Chen 1 , and Tat-Seng Chua 2 1 Shandong University, China 2 National University of Singapore, Singapore

Upload: others

Post on 09-Oct-2019

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Attentive Moment Retrieval in Videos - staff.ustc.edu.cnstaff.ustc.edu.cn/~hexn/slides/sigir18-moment-retrieval.pdf · feature and the cross-modal feature representation . • We

Attentive Moment Retrieval in Videos

Meng Liu1, Xiang Wang2, Liqiang Nie1, Xiangnan He2, Baoquan Chen1, and Tat-Seng Chua2

1Shandong University, China2National University of Singapore, Singapore

Page 2: Attentive Moment Retrieval in Videos - staff.ustc.edu.cnstaff.ustc.edu.cn/~hexn/slides/sigir18-moment-retrieval.pdf · feature and the cross-modal feature representation . • We

Pipeline

• Background

• Learning Model

• Experiment

• Conclusion

Page 3: Attentive Moment Retrieval in Videos - staff.ustc.edu.cnstaff.ustc.edu.cn/~hexn/slides/sigir18-moment-retrieval.pdf · feature and the cross-modal feature representation . • We

Pipeline

• Background

• Learning Model

• Experiment

• Conclusion

Page 4: Attentive Moment Retrieval in Videos - staff.ustc.edu.cnstaff.ustc.edu.cn/~hexn/slides/sigir18-moment-retrieval.pdf · feature and the cross-modal feature representation . • We

Background

• Inter-video Retrieval

Query: FIFA World Cup

Query: Pig Peggy

… …

… …

Page 5: Attentive Moment Retrieval in Videos - staff.ustc.edu.cnstaff.ustc.edu.cn/~hexn/slides/sigir18-moment-retrieval.pdf · feature and the cross-modal feature representation . • We

Background

• Intra-video RetrievalRetrieving a segment from the untrimmed videos, which contain complex scenes and involve a large number of objects, attributes, actions, and interactions.

Messi's penalty/Football shot

Page 6: Attentive Moment Retrieval in Videos - staff.ustc.edu.cnstaff.ustc.edu.cn/~hexn/slides/sigir18-moment-retrieval.pdf · feature and the cross-modal feature representation . • We

Background

• Surveillance Videos: Finding missing children or pets and suspectsQuery: A girl in orange first walks by the camera.

• Home videos: Recalling the desired momentQuery: Baby’s face gets very close to the camera.

• Online Videos: Quickly Jumping to the specific moment

Page 7: Attentive Moment Retrieval in Videos - staff.ustc.edu.cnstaff.ustc.edu.cn/~hexn/slides/sigir18-moment-retrieval.pdf · feature and the cross-modal feature representation . • We

Background

Reality: Dragging progress bar to locate the desired moment.

Research: Densely segment the long video into different scale moments, and then match each moment with the query.

Boring and time consuming

Expensive computational costs and the exponential search space

Page 8: Attentive Moment Retrieval in Videos - staff.ustc.edu.cnstaff.ustc.edu.cn/~hexn/slides/sigir18-moment-retrieval.pdf · feature and the cross-modal feature representation . • We

Problem Formulation-Temporal Moment Localization

Query: a girl in orange walks by the camera.

Input: a video and a language query

24s 30s

Output: Temporal moment corresponding to the given query (green box) with time points [24s,30,s]

Page 9: Attentive Moment Retrieval in Videos - staff.ustc.edu.cnstaff.ustc.edu.cn/~hexn/slides/sigir18-moment-retrieval.pdf · feature and the cross-modal feature representation . • We

Pipeline

• Background

• Learning Model

• Experiment

• Conclusion

Page 10: Attentive Moment Retrieval in Videos - staff.ustc.edu.cnstaff.ustc.edu.cn/~hexn/slides/sigir18-moment-retrieval.pdf · feature and the cross-modal feature representation . • We

Learning Model-Pipeline

MemoryAttentionNetwork

MomentFeature

Girlwithblueshirtdrivespastonbike.

QueryFeature

Cross-ModalFusion

MLPModelAlignmentScore0.9

LocalizationOffset[1.2s,2.5s]

Video Query

Page 11: Attentive Moment Retrieval in Videos - staff.ustc.edu.cnstaff.ustc.edu.cn/~hexn/slides/sigir18-moment-retrieval.pdf · feature and the cross-modal feature representation . • We

Learning Model-Feature Extraction

• Video1. Segmentation:

segment video into moments with sliding window,each moment ! has a time location [#$, #&]

2. Computing location offset:[($, (&]=[)$, )&]-[#$, #&], [)$, )&] is the temporal intervalof the given query

3. Computing temporal-spatio feature *+:C3D feature for each moment

•Query,: Skip-thoughts feature

Page 12: Attentive Moment Retrieval in Videos - staff.ustc.edu.cnstaff.ustc.edu.cn/~hexn/slides/sigir18-moment-retrieval.pdf · feature and the cross-modal feature representation . • We

Learning Model-Moment Attention Network• There are many temporal constraint words in the

given query, such as the term “first”, “second”, and “closer to”, therefore temporal context are useful to the localization.

• Not all the context have the same influence on the localization, the near context are more important than the far ones.

Page 13: Attentive Moment Retrieval in Videos - staff.ustc.edu.cnstaff.ustc.edu.cn/~hexn/slides/sigir18-moment-retrieval.pdf · feature and the cross-modal feature representation . • We

Learning Model-Memory Attention Network

!"#$% = '$#$% + )$

*$ =+,∈[/01,01]4$% "#$%

5 6,, 7 = 8(+:;/01

,'$#$< + )$)>? 8('@7 + )@)

4$% =5(6,, 7)

∑B;/0101 5(6B, 7)

, C ∈ [−E$, E$]

Memory cell

Page 14: Attentive Moment Retrieval in Videos - staff.ustc.edu.cnstaff.ustc.edu.cn/~hexn/slides/sigir18-moment-retrieval.pdf · feature and the cross-modal feature representation . • We

Learning Model- Cross-modal Fusion

Mom

ent(

)

Query(!"

)

⨂Meanpooling

Meanpooling

Inter ModalIntraVisual Intra Query

⋮ ⋮

1 1

⋯ ⋯ ⋯ ⋯ 1

!"# =%&"1 ⨂ )*

1 = [ %&", %&"⨂)*, )*, 1]

The output of this fusion procedure explores the intra- modaland the inter-modal feature interactions to generate themoment-query representations.

Page 15: Attentive Moment Retrieval in Videos - staff.ustc.edu.cnstaff.ustc.edu.cn/~hexn/slides/sigir18-moment-retrieval.pdf · feature and the cross-modal feature representation . • We

Learning Model- Loss Function

Given the output of the fusion model into a two LayerMLP model, and the output of the MLP model is a threedimension vector !" = [%&', )*, )+].

- = -./012 + 4-/5&

-./012 = 678(&,')∈<

log 1 + exp −%&' + 6E8(&,')∈F

log 1 + exp %&'

-/5& =8(&,')∈<

[G )*∗ − )* +G()+

∗ − )+)]

Page 16: Attentive Moment Retrieval in Videos - staff.ustc.edu.cnstaff.ustc.edu.cn/~hexn/slides/sigir18-moment-retrieval.pdf · feature and the cross-modal feature representation . • We

Pipeline

• Background

• Learning Model

• Experiment

• Conclusion

Page 17: Attentive Moment Retrieval in Videos - staff.ustc.edu.cnstaff.ustc.edu.cn/~hexn/slides/sigir18-moment-retrieval.pdf · feature and the cross-modal feature representation . • We

Experiment - Dataset

• TACoS and DiDeMo

• Evaluation:R(n,m)=“R@n,IoU=m”

Page 18: Attentive Moment Retrieval in Videos - staff.ustc.edu.cnstaff.ustc.edu.cn/~hexn/slides/sigir18-moment-retrieval.pdf · feature and the cross-modal feature representation . • We

Experiment – Performance Comparison

Page 19: Attentive Moment Retrieval in Videos - staff.ustc.edu.cnstaff.ustc.edu.cn/~hexn/slides/sigir18-moment-retrieval.pdf · feature and the cross-modal feature representation . • We

Experiment – Model Variants

• ACRN-a (pink): Mean pooling context feature as moment feature

• ACRN-m (purple): Attention model without memory part

• ACRN-c (blue): Concatenating multi-modal features

Page 20: Attentive Moment Retrieval in Videos - staff.ustc.edu.cnstaff.ustc.edu.cn/~hexn/slides/sigir18-moment-retrieval.pdf · feature and the cross-modal feature representation . • We

Experiment – Qualitative Result

Page 21: Attentive Moment Retrieval in Videos - staff.ustc.edu.cnstaff.ustc.edu.cn/~hexn/slides/sigir18-moment-retrieval.pdf · feature and the cross-modal feature representation . • We

Pipeline

• Background

• Learning Model

• Experiment

• Conclusion

Page 22: Attentive Moment Retrieval in Videos - staff.ustc.edu.cnstaff.ustc.edu.cn/~hexn/slides/sigir18-moment-retrieval.pdf · feature and the cross-modal feature representation . • We

Conclusion

• We present a novel Attentive Cross-Modal Retrieval Network,which jointly characterizes the attentive contextual visualfeature and the cross-modal feature representation.

• We introduce a temporal memory attention network tomemorize the contextual information for each moment, andtreat the natural language query as the input of an attentionnetwork to adaptively assign weights to the memoryrepresentation.

• We perform extensive experiments on two benchmark datasetsto demonstrate the performance improvement.

Page 23: Attentive Moment Retrieval in Videos - staff.ustc.edu.cnstaff.ustc.edu.cn/~hexn/slides/sigir18-moment-retrieval.pdf · feature and the cross-modal feature representation . • We

Thank youQ&A