multimodal residual learning for visual qa

24
Multimodal Residual Learning for Visual QA NamHyuk Ahn

Upload: namhyuk-ahn

Post on 07-Jan-2017

29 views

Category:

Engineering


1 download

TRANSCRIPT

Page 1: Multimodal Residual Learning for Visual QA

Multimodal Residual Learning for Visual QA

NamHyuk Ahn

Page 2: Multimodal Residual Learning for Visual QA

Table of Contents

1. Visual QA

2. Stacked Attention Network (SAN)

3. Residual Learning

4. Multimodal Residual Network (MRN)

Page 3: Multimodal Residual Learning for Visual QA

Visual QAEvaluation Metric

- Robust to variabilityinter-human

- Human accuracy is almost 90

- 248,349 Training questions (82,783 Images)

- 121,512 Validation questions (40,504 Images)

- 244,302 Testing questions (81,434 Images)

Page 4: Multimodal Residual Learning for Visual QA

Stacked Attention Network

Page 5: Multimodal Residual Learning for Visual QA

Motivation

- Answering question requires multi-step reasoning

- With {bicycles, window, street, baskets, dogs} objects

- To answer good question,pinpoint relevant region.

Q: what are sitting in the basket on a bicycle

Page 6: Multimodal Residual Learning for Visual QA

Stacked Attention Network (SAN)

- SAN allows multi-step reasoning for visual QA

- Extension of Attention mechanism which successfully applied in captioning, translation etc.

Q: what are sitting in the basket on a bicycle

Page 7: Multimodal Residual Learning for Visual QA

Stacked Attention Network

- Image Model• Extract image feature using

CNN

- Question Model• Extract semantic vector

using CNN or LSTM

- Stacked Attention• Multi-step reasoning

with attention layer

Stacked AttentionMulti-step reasoning

using attention layer

Page 8: Multimodal Residual Learning for Visual QA

Image / Question Model- Image Model

• Get feature map from raw pixel Image

• Rescale image to 448x448, take feature from pool5 of VGGNet (14x14x512)

• Additional layer to fit to question feature

- Question Model•

Page 9: Multimodal Residual Learning for Visual QA

Stacked Attention Model

- Global image feature leads to suboptimal due to noise from irrelevant object / region.

- Instead use SAM to pinpoint relevant region

- Given image feature matrixand question vector ,

14x14 attention distribution

- Get weighted sum of image vectors from each region.

-refined query vector

Page 10: Multimodal Residual Learning for Visual QA

Result

Page 11: Multimodal Residual Learning for Visual QA
Page 12: Multimodal Residual Learning for Visual QA

Residual Learning

Page 13: Multimodal Residual Learning for Visual QA

Problem of degradation- More depth, more accurate but deep network can

vanish/explode gradient

• BN, Xavier Init, Dropout can handle (~30 layer)

- More deeper, degradation problem occur

• Not only overfit, but also increase training error

Page 14: Multimodal Residual Learning for Visual QA

Residual Network (ResNet)

Residual Block- To avoid degradation

problem, add shortcut connection.

- Element-wise addition with F(x) and shortcut connection, and pass through ReLU.

- Similar to LSTM

http://torch.ch/blog/2016/02/04/resnets.html

Shortcut connection

Page 15: Multimodal Residual Learning for Visual QA

Multimodal Residual Network

Page 16: Multimodal Residual Learning for Visual QA

Introduction

- Extend deep residual learning for visual QA

- Achieving the state-of-the-art results on visual QA dataset (not today :(.

- Introducing a method to visualize spatial attention effect of joint residual mappings

Page 17: Multimodal Residual Learning for Visual QA

Background

SAN- But question info contribute

weakly, it cause bottleneck

Baseline [Lu et al.]- With just elem-wise multiple,

visual and question feature embed very well.

MRN- Shortcut mapping and

stacking architecture

- No weighted-sum

- Instead use global multiplication [Lu et al.] does.

Page 18: Multimodal Residual Learning for Visual QA
Page 19: Multimodal Residual Learning for Visual QA
Page 20: Multimodal Residual Learning for Visual QA

Quantitative Analysis- (a) shows large improvement

over SAN, (b) is better.

- (c) add extra embedding in question cause overfitting.

- (d) identity shortcut cause degradation (extra linear mapping is needed).

- (e) performs reasonable, but extra shortcut is not essential.

Page 21: Multimodal Residual Learning for Visual QA

Quantitative Analysis

# of Learning blocks- 58.85% (L=1), 59.44% (L=2),

60.53% (L=3), 60.42% (L=4)

Visual Features- ResNet-152 is significantly

better than VGGNet

- Even though ResNet has less feature dim (2048 vs 4096).

# of Answer Class- Trade-off relation among

answer type, but 2k is best

Page 22: Multimodal Residual Learning for Visual QA

- Implicit attention with multiplication

- Get high-resolution attention map

Page 23: Multimodal Residual Learning for Visual QA
Page 24: Multimodal Residual Learning for Visual QA

Reference

- Yang, Zichao, et al. "Stacked attention networks for image question answering." arXiv preprint arXiv:1511.02274 (2015).

- Kim, Jin-Hwa, et al. "Multimodal Residual Learning for Visual QA." arXiv preprint arXiv:1606.01455 (2016).

- Antol, Stanislaw, et al. "Vqa: Visual question answering." Proceedings of the IEEE International Conference on Computer Vision. 2015.