multimodal residual learning for visual question-answering

48
Jin-Hwa Kim BI Lab, Seoul National University Multimodal Residual Learning for Visual Question-Answering

Upload: naver-d2

Post on 16-Apr-2017

602 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Multimodal Residual Learning for Visual Question-Answering

Jin-Hwa�Kim�

BI�Lab,�Seoul�National�University

Multimodal�Residual�Learning�for�Visual�Question-Answering

!""!#$% #&"

!"% "!#$% #&"

% "!#$% #&" '

% "!#$% #&" (

BS06)*! +#*,- #$% #&"

)*! +#*,+$ ./% +0"1!, " ) 2.3&45%$ 6% 43 " #&" 7 37"#+$ $ 36%8 +$ 9%034*. " 7 :;%< 9& ł =8 6" &% 7 6#"#%5

9 " $ 7 .!4*"#%5 7 %0" + >! + ? %0 %< 3$ 6# #$% #&" 87 "##&" " )*! +#*, " 4$3@ ," "#%+ #$% #&" &"+$ 7 "##&" "

% .! 9%0! # 2 2 6% %# ' 33 8 7 %0)* 2A6',B = 8 45! 3@ "6% 2( " . ł+ > )*! +#*, " 7 7 2 CDEEFGEGH %07 7 7

% .! 9 36% 7 %0"#3 " " 7 7 : %0%< +0" 7 " 2 27 < & % " 7 7 : %0 7 $ " + > 7 2 FI-JGK %< . 6# : L 43 7 %0 7

%łMN 36% " +

Page 2: Multimodal Residual Learning for Visual Question-Answering

Table�of�Contents

1.�VQA:�Visual�Question�Answering�

2.�Vision�Part�

3.�Question�Part�

4.�Multimodal�Residual�Learning�

5.�Results�

6.�Discussion�

7.�Recent�Works�

8.�Q&A

Page 3: Multimodal Residual Learning for Visual Question-Answering

1.�VQA:�Visual�Question�Answering

Page 4: Multimodal Residual Learning for Visual Question-Answering

1.�VQA:�Visual�Question�Answering

VQA�is�a�new�dataset�containing�open-ended�questions�about�images,� for�understanding�of�vision,�language�and�commonsense�knowledge�to�

answer.

VQA�Challenge

���������������������� �����������������������������������

������������������������������ �������������� ������������

������ �������������!!������������������������������!!�

����������������� ���������������������� �����

���������������������� �����������������������������������

������������������������������ �������������� ������������

������ �������������!!������������������������������!!�

����������������� ���������������������� �����

Antol�et�al.,�ICCV�2015

Page 5: Multimodal Residual Learning for Visual Question-Answering

1.�VQA:�Visual�Question�Answering

Examples�of�Visual-Question,�or�Only-Question�answers�by�Human

VQA�Challenge

Antol�et�al.,�ICCV�2015

Page 6: Multimodal Residual Learning for Visual Question-Answering

1.�VQA:�Visual�Question�Answering

#�images:�204,721�(MS�COCO)�

#�questions:�760K�(3�per�image)�

#�answers:�10M�(10�per�question�+�ɑ)

Numbers

Images Questions Answers

Training 80K 240K 2.4M

Validation 40K 120K 1.2M

Test 80K 240K

Antol�et�al.,�ICCV�2015

Page 7: Multimodal Residual Learning for Visual Question-Answering

1.�VQA:�Visual�Question�Answering

80K�test�images�/�Four�splits�of�20K�images�each��

Test-dev�(development)�

Debugging�and�Validation�-�10/day�submission�to�the�evaluation�server.��

Test-standard�(publications)�

Used�to�score�entries�for�the�Public�Leaderboard.��

Test-challenge�(competitions)�

Used�to�rank�challenge�participants.��

Test-reserve�(check�overfitting)��

Used�to�estimate�overfitting.�Scores�on�this�set�are�never�released.

Test�Dataset

Slide�adapted�from:�MSCOCO�Detection/Segmentation�Challenge,�ICCV�2015

Page 8: Multimodal Residual Learning for Visual Question-Answering

2.�Vision�Part

Page 9: Multimodal Residual Learning for Visual Question-Answering

2.�Vision�Part

ResNet:�A�Thing�among�Convolutional�Neural�Networks

He�et�al.,�CVPR�2016

1st�place�on�the�ImageNet�2015�classification�task�

1st�place�on�the�ImageNet�2015�detection�task�

1st�place�on�the�ImageNet�2015�localization�task�

1st�place�on�the�COCO�object�detection�task�

1st�place�on�the�COCO�object�segmentation�task

Page 10: Multimodal Residual Learning for Visual Question-Answering

2.�Vision�Part

ResNet-152

He�et�al.,�CVPR�2016

Conv�1x1,�512

Conv�3x3,�512

Conv�1x1,�2048

AveragePooling

7x7x2048

1x1x2048

Linear

Output�Size

Softmax

152-Layered�Convolutional��

Neural�Networks

3x224x224 Input�Size

���������������������� �����������������������������������

������������������������������ �������������� ������������

������ �������������!!������������������������������!!�

����������������� ���������������������� �����

Page 11: Multimodal Residual Learning for Visual Question-Answering

2.�Vision�Part

ResNet-152

He�et�al.,�CVPR�2016

Conv�1x1,�512

Conv�3x3,�512

Conv�1x1,�2048

AveragePooling

7x7x2048

1x1x2048

Linear

Output�Size

Softmax

As�a�visual�feature�extractor

3x224x224 Input�Size

���������������������� �����������������������������������

������������������������������ �������������� ������������

������ �������������!!������������������������������!!�

����������������� ���������������������� �����

Page 12: Multimodal Residual Learning for Visual Question-Answering

2.�Vision�Part

ResNet-152

He�et�al.,�CVPR�2016

Pre-trained�models�are�available!�

For�Torch,�TensorFlow,�Theano,�Caffe,�Lasagne,�Neon,�and�MatConvNet�

https://github.com/KaimingHe/deep-residual-networks

Page 13: Multimodal Residual Learning for Visual Question-Answering

3.�Question�Part

Page 14: Multimodal Residual Learning for Visual Question-Answering

3.�Question�Part

Word-Embedding

What�color�are�her�eyes?

what���color���are���her���eyes���?

preprocessing

indexing

53�������7������44���127�����2���6177

w53�|�w7�|�w44�|�w127�|�w2�|�w6177

lookup

w1T

w2T

w7T

w44T

Lookup Table

{wi} are learning parameters for back-propagation algorithm.

Page 15: Multimodal Residual Learning for Visual Question-Answering

3.�Question�Part

Question-Embedding

RNNw53�(what)

h0h1

RNNw7�(color)

h1h2

RNNw6177�(?)

h5h6

Step 0

Step 1

Step 5 use this

Page 16: Multimodal Residual Learning for Visual Question-Answering

3.�Question�Part

Choice�of�RNN:�Gated�Recurrent�Units�(GRU)

Cho�et�al.,�EMNLP�2014Chung�et�al.,�arXiv�2014

hst-1

st-1h

z�=�σ(xtUz�+�st-1Wz)�

r�=�σ(xtUr�+�st-1Wr)�

h�=�tanh(xtUh�+�(st-1⚬r)Wh)�

st�=�(1�-�z)⚬h�+�z⚬st-1

Page 17: Multimodal Residual Learning for Visual Question-Answering

3.�Question�Part

Skip-Thought�Vectors

Pre-trained�model�for�word-embedding�and�question-embedding

Kiros�et�al.,�NIPS�2015

I�got�back�home.�

I�could�see�the�cat�on�the�steps.�

This�was�strange.

try to reconstruct the previous sentence and next sentence

BookCorpus dataset (Zhu�et�al.,�ArXiv�2015)

Page 18: Multimodal Residual Learning for Visual Question-Answering

3.�Question�Part

Skip-Thought�Vectors

Pre-trained�model�for�word-embedding�and�question-embedding�

Its�encoder�as�Sent2Vec�model

Kiros�et�al.,�NIPS�2015

w1T

w2T

w3T

w4T

Lookup Table

Gated Recurrent

Units

Pre-trained GRU

Page 19: Multimodal Residual Learning for Visual Question-Answering

3.�Question�Part

Skip-Thought�Vectors

Pre-trained�model�(Theano)�and�porting�code�(Torch)�are�available!�

https://github.com/ryankiros/skip-thoughts�

https://github.com/HyeonwooNoh/DPPnet/tree/master/

003_skipthoughts_porting�

Noh�et�al.,�CVPR�2016Kiros�et�al.,�NIPS�2015

Page 20: Multimodal Residual Learning for Visual Question-Answering

4.�Multimodal�Residual�Learning

Page 21: Multimodal Residual Learning for Visual Question-Answering

4.�Multimodal�Residual�Learning

Idea�1:�Deep�Residual�Learning

Extend�the�idea�of�deep�residual�learning�for�multimodal�learning�

He�et�al.,�CVPR�2016

identity

weight layer

weight layer

relu

relu

F(x)!+!x

x

F(x) x

Figure 2. Residual learning: a building block.

Page 22: Multimodal Residual Learning for Visual Question-Answering

4.�Multimodal�Residual�Learning

Idea�2:�Hadamard�product�for�Joint�Residual�Mapping

One�modality�is�directly�involved�in�the�gradient�with�respect�to�the�other�

modality�

https://github.com/VT-vision-lab/VQA_LSTM_CNN

vQ vI

tanh tanh

softmax

∂σ (x)!σ (y)∂x

= diag( ′σ (x)!σ (y))

Scaling Problem?Wu�et�al.,�NIPS�2016

Page 23: Multimodal Residual Learning for Visual Question-Answering

4.�Multimodal�Residual�Learning

Multimodal�Residual�Networks

Kim�et�al.,�NIPS�2016

Q

V

ARNN

CNN

softmax

Multimodal Residual Networks

What kind of animals are these ?

sheep

wordembedding

question shortcuts

Hadamard products

word2vec (Mikolov et al., 2013)

skip-thought vectors (Kiros et al., 2015)

ResNet (He et al., 2015)

Page 24: Multimodal Residual Learning for Visual Question-Answering

4.�Multimodal�Residual�Learning

Multimodal�Residual�Networks

Kim�et�al.,�NIPS�2016

A

LinearTanhLinear

TanhLinear

TanhLinear

Q V

H1

LinearTanhLinear

TanhLinear

TanhLinear

H2

V

LinearTanhLinear

TanhLinear

TanhLinear

H3

V

LinearSoftmax

⊙⊕

⊙⊕

⊙⊕

Softmax

Page 25: Multimodal Residual Learning for Visual Question-Answering

5.�Results

Page 26: Multimodal Residual Learning for Visual Question-Answering

5.�Results

TanhLinear

Linear

TanhLinear

Q V

Hl V

⊙⊕

(a)

LinearTanhLinear

Tanh

Linear

TanhLinear

Q V

Hl

V⊙⊕

(c)

LinearTanh

LinearTanhLinear

TanhLinear

TanhLinear

Q V

Hl V

⊙⊕

(b)

LinearTanhLinear

TanhLinear

TanhLinear

Q V

Hl

V

⊙⊕

(e)

LinearTanhLinear

TanhLinear

TanhLinear

Q V

Hl V

⊙⊕

(d)

if l=1

else

Identity

if l=1

Linearelse none

Table 1: The results of alternative models(a)-(e) on the test-dev.

Open-EndedAll Y/N Num. Other

(a) 60.17 81.83 38.32 46.61(b) 60.53 82.53 38.34 46.78(c) 60.19 81.91 37.87 46.70(d) 59.69 81.67 37.23 46.00(e) 60.20 81.98 38.25 46.57

Table 2: The e�ect of the visual features and# of target answers on the test-dev results.Vgg for VGG-19, and Res for ResNet-152 fea-tures described in Section 4.

Open-EndedAll Y/N Num. Other

Vgg, 1k 60.53 82.53 38.34 46.78Vgg, 2k 60.79 82.13 38.87 47.52Vgg, 3k 60.68 82.40 38.69 47.10Res, 1k 61.45 82.36 38.40 48.81Res, 2k 61.68 82.28 38.82 49.25Res, 3k 61.47 82.28 39.09 48.76

5 Results

The VQA Challenge, which released the VQA dataset, provides evaluation servers for test-dev and test-standard test splits. For the test-dev, the evaluation server permits unlimitedsubmissions for validation, while the test-standard permits limited submissions for thecompetition. We report accuracies in percentage.

Alternative Models The test-dev results for the Open-Ended task on the of alternativemodels are shown in Table 1. (a) shows a significant improvement over SAN. However, (b) ismarginally better than (a). As compared to (b), (c) deteriorates the performance. An extraembedding for a question vector may easily cause overfitting leading the overall degradation.And, identity shortcuts in (d) cause the degradation problem, too. Extra parameters of thelinear mappings may e�ectively support to do the task.(e) shows a reasonable performance, however, the extra shortcut is not essential. Theempirical results seem to support this idea. The question-only model (50.39%) achieves acompetitive result to the joint models (57.75%), while the image-only model gets a pooraccuracy (28.13%) (see Table 2 in [2]).Eventually, we chose model (b) for the best performance and relative simplicity.

Number of Learning Blocks To confirm the e�ectiveness of the number of learningblocks selected via a pilot test (L = 3), we explore this on the chosen model (b), again. Asthe depth increases, the overall accuracies are 58.85 (L = 1), 59.44 (L = 2), 60.53 (L = 3)and 60.42 (L = 4).

Visual Features The ResNet-152 visual features are significantly better than VGG-19features for Other type in Table 2, even if the dimension of the ResNet features (2,048) is ahalf of VGG features’ (4,096). The ResNet visual features are also used in the previous work[9], however, our model acheives a remarkably better performance with a large margin (seeTable 3).

Number of Target Answers The number of target answers slightly a�ects the overallaccuracies having the trade-o� among answer types. So, the decision on the number of targetanswers is di�cult to be made. We simply chose Res, 2k in Table 2 based on the overallaccuracy.

Comparisons with State-of-the-arts Our chosen model significantly outperforms otherstate-of-the-art methods for both Open-Ended and Multiple-Choice tasks in Table 3. However,the performance of Number and Other types are still dissatisfied compared to Humanperformance, though the advances in the recent works was mainly for Other-type answers.This fact motivates to study on a counting mechanism in future work. The model comparisonis performed on the text-standard results.

6

Table 1: The results of alternative models(a)-(e) on the test-dev.

Open-EndedAll Y/N Num. Other

(a) 60.17 81.83 38.32 46.61(b) 60.53 82.53 38.34 46.78(c) 60.19 81.91 37.87 46.70(d) 59.69 81.67 37.23 46.00(e) 60.20 81.98 38.25 46.57

Table 2: The e�ect of the visual features and# of target answers on the test-dev results.Vgg for VGG-19, and Res for ResNet-152 fea-tures described in Section 4.

Open-EndedAll Y/N Num. Other

Vgg, 1k 60.53 82.53 38.34 46.78Vgg, 2k 60.79 82.13 38.87 47.52Vgg, 3k 60.68 82.40 38.69 47.10Res, 1k 61.45 82.36 38.40 48.81Res, 2k 61.68 82.28 38.82 49.25Res, 3k 61.47 82.28 39.09 48.76

5 Results

The VQA Challenge, which released the VQA dataset, provides evaluation servers for test-dev and test-standard test splits. For the test-dev, the evaluation server permits unlimitedsubmissions for validation, while the test-standard permits limited submissions for thecompetition. We report accuracies in percentage.

Alternative Models The test-dev results for the Open-Ended task on the of alternativemodels are shown in Table 1. (a) shows a significant improvement over SAN. However, (b) ismarginally better than (a). As compared to (b), (c) deteriorates the performance. An extraembedding for a question vector may easily cause overfitting leading the overall degradation.And, identity shortcuts in (d) cause the degradation problem, too. Extra parameters of thelinear mappings may e�ectively support to do the task.(e) shows a reasonable performance, however, the extra shortcut is not essential. Theempirical results seem to support this idea. The question-only model (50.39%) achieves acompetitive result to the joint models (57.75%), while the image-only model gets a pooraccuracy (28.13%) (see Table 2 in [2]).Eventually, we chose model (b) for the best performance and relative simplicity.

Number of Learning Blocks To confirm the e�ectiveness of the number of learningblocks selected via a pilot test (L = 3), we explore this on the chosen model (b), again. Asthe depth increases, the overall accuracies are 58.85 (L = 1), 59.44 (L = 2), 60.53 (L = 3)and 60.42 (L = 4).

Visual Features The ResNet-152 visual features are significantly better than VGG-19features for Other type in Table 2, even if the dimension of the ResNet features (2,048) is ahalf of VGG features’ (4,096). The ResNet visual features are also used in the previous work[9], however, our model acheives a remarkably better performance with a large margin (seeTable 3).

Number of Target Answers The number of target answers slightly a�ects the overallaccuracies having the trade-o� among answer types. So, the decision on the number of targetanswers is di�cult to be made. We simply chose Res, 2k in Table 2 based on the overallaccuracy.

Comparisons with State-of-the-arts Our chosen model significantly outperforms otherstate-of-the-art methods for both Open-Ended and Multiple-Choice tasks in Table 3. However,the performance of Number and Other types are still dissatisfied compared to Humanperformance, though the advances in the recent works was mainly for Other-type answers.This fact motivates to study on a counting mechanism in future work. The model comparisonis performed on the text-standard results.

6

A Appendix

A.1 VQA test-dev Results

Table 1: The effects of various options for VQA test-dev. Here, the model of Figure 3a is used, sincethese experiments are preliminarily conducted. VGG-19 features and 1k target answers are used. s

stands for the usage of Skip-Thought Vectors [6] to initialize the question embedding model of GRU,b stands for the usage of Bayesian Dropout [3], and c stands for the usage of postprocessing usingimage captioning model [5].

Open-Ended Multiple-ChoiceAll Y/N Num. Other All Y/N Num. Other

baseline 58.97 81.11 37.63 44.90 63.53 81.13 38.91 54.06s 59.38 80.65 38.30 45.98 63.71 80.68 39.73 54.65s,b 59.74 81.75 38.13 45.84 64.15 81.77 39.54 54.67s,b,c 59.91 81.75 38.13 46.19 64.18 81.77 39.51 54.72

Table 2: The results for VQA test-dev. The precision of some accuracies [11, 2, 10] are one less thanothers, so, zero-filled to match others.

Open-Ended Multiple-ChoiceAll Y/N Num. Other All Y/N Num. Other

Question [1] 48.09 75.66 36.70 27.14 53.68 75.71 37.05 38.64Image [1] 28.13 64.01 00.42 03.77 30.53 69.87 00.45 03.76Q+I [1] 52.64 75.55 33.67 37.37 58.97 75.59 34.35 50.33LSTM Q [1] 48.76 78.20 35.68 26.59 54.75 78.22 36.82 38.78LSTM Q+I [1] 53.74 78.94 35.24 36.42 57.17 78.95 35.80 43.41Deep Q+I [7] 58.02 80.87 36.46 43.40 62.86 80.88 37.78 53.14

DPPnet [8] 57.22 80.71 37.24 41.69 62.48 80.79 38.94 52.16D-NMN [2] 57.90 80.50 37.40 43.10 - - - -SAN [11] 58.70 79.30 36.60 46.10 - - - -ACK [9] 59.17 81.01 38.42 45.23 - - - -FDA [4] 59.24 81.14 36.16 45.77 64.01 81.50 39.00 54.72DMN+ [10] 60.30 80.50 36.80 48.30 - - - -

Vgg, 1k 60.53 82.53 38.34 46.78 64.79 82.55 39.93 55.23Vgg, 2k 60.77 82.10 39.11 47.46 65.27 82.12 40.84 56.39Vgg, 3k 60.68 82.40 38.69 47.10 65.09 82.42 40.13 55.93Res, 1k 61.45 82.36 38.40 48.81 65.62 82.39 39.65 57.15Res, 2k 61.68 82.28 38.82 49.25 66.15 82.30 40.45 58.16Res, 3k 61.47 82.28 39.09 48.76 66.33 82.41 39.57 58.40

1

Page 27: Multimodal Residual Learning for Visual Question-Answering

5.�ResultsTable 3: The effects of shortcut connections of MRN for VQA test-dev. ResNet-152 features and 2ktarget answers are used. MN stands for Multimodal Networks without residual learning, which doesnot have any shortcut connections. Dim. stands for common embedding vector’s dimension. Thenumber of parameters for word embedding (9.3M) and question embedding (21.8M) is subtractedfrom the total number of parameters in this table.

Open-EndedL Dim. #params All Y/N Num. Other

MN 1 4604 33.9M 60.33 82.50 36.04 46.89MN 2 2350 33.9M 60.90 81.96 37.16 48.28MN 3 1559 33.9M 59.87 80.55 37.53 47.25

MRN 1 3355 33.9M 60.09 81.78 37.09 46.78MRN 2 1766 33.9M 61.05 81.81 38.43 48.43MRN 3 1200 33.9M 61.68 82.28 38.82 49.25

MRN 4 851 33.9M 61.02 82.06 39.02 48.04

A.2 More Examples

Figure 1: More examples of Figure 4 in Section 5.2.

2

Page 28: Multimodal Residual Learning for Visual Question-Answering

5.�Results

Table 3: The VQA test-standard results. The precision of some accuracies [30, 1] are oneless than others, so, zero-filled to match others.

Open-Ended Multiple-ChoiceAll Y/N Num. Other All Y/N Num. Other

DPPnet [21] 57.36 80.28 36.92 42.24 62.69 80.35 38.79 52.79D-NMN [1] 58.00 - - - - - - -Deep Q+I [11] 58.16 80.56 36.53 43.73 63.09 80.59 37.70 53.64SAN [30] 58.90 - - - - - - -ACK [27] 59.44 81.07 37.12 45.83 - - - -FDA [9] 59.54 81.34 35.67 46.10 64.18 81.25 38.30 55.20DMN+ [28] 60.36 80.43 36.82 48.33 - - - -MRN 61.84 82.39 38.23 49.41 66.33 82.41 39.57 58.40Human [2] 83.30 95.77 83.39 72.67 - - - -

5.1 Visualization

In Equation 3, the left term ‡(Wqq) can be seen as a masking (attention) vector toselect a part of visual information. We assume that the di�erence between the right termV = ‡(W2‡(W1v)) and the masked vector F(q, v) indicates an attention e�ect caused bythe masking vector. Then, the attention e�ect Latt = 1

2 (V ≠ F)2 is visualized on the imageby calculating the gradient of Latt with respect to a given image I.

ˆLattˆI = ˆV

ˆI (V ≠ F) (6)

This technique can be applied to each learning block in a similar way.Since we use the preprocessed visual features, the pretrained CNN is augmented to reachthe input image only for this visualization. Note that model (b) in Table 1 is used for thisvisualization, and the pretrained VGG-19 is used for preprocessing and augmentation. Themodel is trained using the training set of VQA dataset, and visualized using the validationset. Examples are shown in Figure 4.Unlike the other works [30, 28] using explicit attention parameters, MRN does not useany explicit attentional mechanism. Though, we observe the interpretability of element-wise multiplication as a information masking, which leads to a novel method to visualizethe attention e�ect from this operation. Since MRN does not depend on a few attentionparameters (e.g. 14 ◊ 14), our visualization method shows a higher resolution than others[30, 28]. Based on this, we argue that MRN is an implicit attention model without explicitattention mechanism.

6 Conclusions

The idea of deep residual learning is applied to visual question-answering tasks. Based onthe two observations of the previous works, various alternative models are suggested andvalidated to propose the three-block layered MRN. Our model achieves the state-of-the-artresults on the VQA dataset for both Open-Ended and Multiple-Choice tasks. Moreover, weintroduce a novel method to visualize the spatial attention from the collapsed visual featuresusing back-propagation.

Acknowledgments

This work was supported by Naver Corp. and the Korea government (IITP-R0126-16-1072-SW.StarLab, KEIT-10044009-HRI.MESSI, KEIT-10060086-RISF).

7

Page 29: Multimodal Residual Learning for Visual Question-Answering

6.�Discussions

Page 30: Multimodal Residual Learning for Visual Question-Answering

6.�Discussions

A

LinearTanhLinear

TanhLinear

TanhLinear

Q V

H1

LinearTanhLinear

TanhLinear

TanhLinear

H2

V

LinearTanhLinear

TanhLinear

TanhLinear

H3

V

LinearSoftmax

⊙⊕

⊙⊕

⊙⊕

Softmax

pretrained model

1st visualization

2nd visualization

3rd visualization

Visualization

Page 31: Multimodal Residual Learning for Visual Question-Answering

6.�Discussionsexamples examples

What kind of animals are these ? sheep What animal is the picture ? elephant

What is this animal ? zebra What game is this person playing ? tennis

How many cats are here ? 2 What color is the bird ? yellow

What sport is this ? surfing Is the horse jumping ? yes

(a) (b)

(c) (d)

(e) (f)

(g) (h)

Page 32: Multimodal Residual Learning for Visual Question-Answering

6.�Discussions

examples examples

What kind of animals are these ? sheep What animal is the picture ? elephant

What is this animal ? zebra What game is this person playing ? tennis

How many cats are here ? 2 What color is the bird ? yellow

What sport is this ? surfing Is the horse jumping ? yes

(a) (b)

(c) (d)

(e) (f)

(g) (h)

Page 33: Multimodal Residual Learning for Visual Question-Answering

Acknowledgments

This�work�was�supported�by�Naver�Corp.��and�partly�by�the�Korea�government�(IITP-R0126-16-1072-SW.StarLab,�KEIT-10044009-HRI.MESSI,�KEIT-10060086-

RISF,�ADD-UD130070ID-BMRR).

Page 34: Multimodal Residual Learning for Visual Question-Answering

Recent�Works

Page 35: Multimodal Residual Learning for Visual Question-Answering

Recent�Works�Low-rank�Bilinear�Pooling

Page 36: Multimodal Residual Learning for Visual Question-Answering

Low-rank�Bilinear�Pooling

Bilinear�model�

Low-rank�Restriction�

Kim�et�al.,�arXiv�2016

fi = wijkk=1

M

∑j=1

N

∑ x jyk + bi = xTWiy + bi

fi = xTWiy + bi = x

TUiViTy + bi = 1

T (UiTx !Vi

Ty)+ bif = PT (UTx !VTy)+ b

N�x�D�D�x�M

rank(Wi)�<=�min(N,M)

Page 37: Multimodal Residual Learning for Visual Question-Answering

Low-rank�Bilinear�Pooling

Bilinear�model�

Low-rank�Restriction�

Kim�et�al.,�arXiv�2016

fi = wijkk=1

M

∑j=1

N

∑ x jyk + bi = xTWiy + bi

fi = xTWiy + bi = x

TUiViTy + bi = 1

T (UiTx !Vi

Ty)+ bif = PT (UTx !VTy)+ b

vQ vI

⊙N�x�D�D�x�M

rank(Wi)�<=�min(N,M)

Page 38: Multimodal Residual Learning for Visual Question-Answering

Low-rank�Bilinear�Pooling

Kim�et�al.,�arXiv�2016

x1 x2 … xN

WT

∑wixi ∑wjxj … ∑wkxk

Page 39: Multimodal Residual Learning for Visual Question-Answering

Low-rank�Bilinear�Pooling

Kim�et�al.,�arXiv�2016

x1 x2 … xN

WxT

∑wixi ∑wjxj … ∑wkxk

y1 y2 … yN

WyT

∑wlyl ∑wmym … ∑wnyn

∑∑wiwlxiyl ∑∑wjwmxjym … ∑∑wkwnxkyn

Hadamard�Product�(Element-wise�Multiplication)

Page 40: Multimodal Residual Learning for Visual Question-Answering

Recent�Works�Multimodal�Low-rank�Bilinear�Attention�Networks�(MLB)

Page 41: Multimodal Residual Learning for Visual Question-Answering

MLB�Attention�Networks

Kim�et�al.,�arXiv�2016

A

TanhConv

TanhLinear

Replicate

Q V

SoftmaxConv

TanhLinear

TanhLinearLinear

Softmax

Page 42: Multimodal Residual Learning for Visual Question-Answering

MLB�Attention�Networks�(MLB)

Kim�et�al.,�arXiv�2016

Under review as a conference paper at ICLR 2017

Table 2: The VQA test-standard results to compare with state-of-the-art. Notice that these resultsare trained by provided VQA train and validation splits, without any data augmentation.

Open-Ended MCMODEL ALL Y/N NUM ETC ALL

iBOWIMG (Zhou et al., 2015) 55.89 76.76 34.98 42.62 61.97DPPnet (Noh et al., 2015) 57.36 80.28 36.92 42.24 62.69Deeper LSTM+Normalized CNN (Lu et al., 2015) 58.16 80.56 36.53 43.73 63.09SMem (Xu & Saenko, 2016) 58.24 80.80 37.53 43.48 -Ask Your Neuron (Malinowski et al., 2016) 58.43 78.24 36.27 46.32 -SAN (Yang et al., 2015) 58.85 79.11 36.41 46.42 -D-NMN (Andreas et al., 2016) 59.44 80.98 37.48 45.81 -ACK (Wu et al., 2016b) 59.44 81.07 37.12 45.83 -FDA (Ilievski et al., 2016) 59.54 81.34 35.67 46.10 64.18HYBRID (Kafle & Kanan, 2016b) 60.06 80.34 37.82 47.56 -DMN+ (Xiong et al., 2016) 60.36 80.43 36.82 48.33 -MRN (Kim et al., 2016b) 61.84 82.39 38.23 49.41 66.33HieCoAtt (Lu et al., 2016) 62.06 79.95 38.22 51.95 66.07RAU (Noh & Han, 2016) 63.2 81.7 38.2 52.8 67.3

MLB (ours) 65.07 84.02 37.90 54.77 68.89

The rate of the divided answers is approximately 16.40%, and only 0.23% of questions have morethan two divided answers in VQA dataset. We assume that it eases the difficulty of convergencewithout severe degradation of performance.

Shortcut Connection The performance contribution of shortcut connections for residual learningis explored. This experiment is conducted based on the observation of the competitive performanceof single-block layered model. Since the usefulness of shortcut connections is linked to the networkdepth (He et al., 2015).

Data Augmentation The data augmentation with Visual Genome (Krishna et al., 2016) questionanswer annotations is explored. Visual Genome (Krishna et al., 2016) originally provides 1.7 Millionvisual question answer annotations. After aligning to VQA, the valid number of question-answeringpairs for training is 837,298, which is for distinct 99,280 images.

6 RESULTS

The six experiments are conducted sequentially to narrow down architectural choices. Each experi-ment determines experimental variables one by one. Refer to Table 1, which has six sectors dividedby mid-rules.

6.1 SIX EXPERIMENT RESULTS

Number of Learning Blocks Though, MRN (Kim et al., 2016b) has the three-block layered ar-chitecture, MARN shows the best performance with two-block layered models (63.92%). For themultiple glimpse models in the next experiment, we choose one-block layered model for its simplic-ity to extend, and competitive performance (63.79%).

Number of Glimpses Compared with the results of Fukui et al. (2016), four-glimpse MARN(64.61%) is better than other comparative models. However, for a parsimonious choice, two-glimpseMARN (64.53%) is chosen for later experiments. We speculate that multiple glimpses are one ofkey factors for the competitive performance of MCB (Fukui et al., 2016), based on a large margin inaccuracy, compared to one-glimpse MARN (63.79%).

6

Page 43: Multimodal Residual Learning for Visual Question-Answering

MLB�Attention�Networks�(MLB)

Kim�et�al.,�arXiv�2016

Under review as a conference paper at ICLR 2017

Non-Linearity The results confirm that activation functions are useful to improve performances.Surprisingly, there is no empirical difference between two options, before-Hadamard product andafter-Hadamard product. This result may build a bridge to relate with studies on multiplicativeintegration with recurrent neural networks (Wu et al., 2016c).

Answer Sampling Sampled answers (64.80%) result better performance than mode answers(64.53%). It confirms that the distribution of answers from annotators can be used to improve theperformance. However, the number of multiple answers is usually limited due to the cost of datacollection.

Shortcut Connection Though, MRN (Kim et al., 2016b) effectively uses shortcut connectionsto improve model performance, one-block layered MARN shows better performance without theshortcut connection. In other words, the residual learning is not used in our proposed model, MLB.It seems that there is a trade-off between introducing attention mechanism and residual learning. Weleave a careful study on this trade-off for future work.

Data Augmentation Data augmentation using Visual Genome (Krishna et al., 2016) question an-swer annotations significantly improves the performance by 0.76% in accuracy for VQA test-devsplit. Especially, the accuracy of others (ETC)-type answers is notably improved from the dataaugmentation.

6.2 COMPARISON WITH STATE-OF-THE-ART

The comparison with other single models on VQA test-standard is shown in Table 2. The overallaccuracy of our model is approximately 1.9% above the next best model (Noh & Han, 2016) on theOpen-Ended task of VQA. The major improvements are from yes-or-no (Y/N) and others (ETC)-type answers. In Table 3, we also report the accuracy of our ensemble model to compare with otherensemble models on VQA test-standard, which won 1st to 5th places in VQA Challenge 20162. Webeat the previous state-of-the-art with a margin of 0.42%.

Table 3: The VQA test-standard results for ensemble models to compare with state-of-the-art. Forunpublished entries, their team names are used instead of their model names. Some of their figuresare updated after the challenge.

Open-Ended MCMODEL ALL Y/N NUM ETC ALL

RAU (Noh & Han, 2016) 64.12 83.33 38.02 53.37 67.34MRN (Kim et al., 2016b) 63.18 83.16 39.14 51.33 67.54DLAIT (not published) 64.83 83.23 40.80 54.32 68.30Naver Labs (not published) 64.79 83.31 38.70 54.79 69.26MCB (Fukui et al., 2016) 66.47 83.24 39.47 58.00 70.10

MLB (ours) 66.89 84.61 39.07 57.79 70.29Human (Antol et al., 2015) 83.30 95.77 83.39 72.67 91.54

7 RELATED WORKS

7.1 COMPACT BILINEAR POOLING

Compact bilinear pooling (Gao et al., 2015) approximates full bilinear pooling using a sampling-based computation, Tensor Sketch Projection (Charikar et al., 2002; Pham & Pagh, 2013):

(x⌦ y, h, s) = (x, h, s) ⇤ (y, h, s) (15)

= FFT�1(FFT( (x, h, s) � FFT( (y, h, s)) (16)

2http://visualqa.org/challenge.html

7

Page 44: Multimodal Residual Learning for Visual Question-Answering

Recent�Works�DEMO

Page 45: Multimodal Residual Learning for Visual Question-Answering

DEMO

Q:�아니�이게�뭐야?�

A:�냉장고�입니다.

Page 46: Multimodal Residual Learning for Visual Question-Answering

DEMO

Page 47: Multimodal Residual Learning for Visual Question-Answering

Q&A

Page 48: Multimodal Residual Learning for Visual Question-Answering

Thank You