multimodal residual learning for visual question-answering

Jin-Hwa�Kim�

BI�Lab,�Seoul�National�University

Multimodal�Residual�Learning�for�Visual�Question-Answering

!""!#$% #&"

!"% "!#$% #&"

% "!#$% #&" '

% "!#$% #&" (

BS06)*! +#*,- #$% #&"

)*! +#*,+$ ./% +0"1!, " ) 2.3&45%$ 6% 43 " #&" 7 37"#+$ $ 36%8 +$ 9%034*. " 7 :;%< 9& ł =8 6" &% 7 6#"#%5

9 " $ 7 .!4*"#%5 7 %0" + >! + ? %0 %< 3$ 6# #$% #&" 87 "##&" " )*! +#*, " 4$3@ ," "#%+ #$% #&" &"+$ 7 "##&" "

% .! 9%0! # 2 2 6% %# ' 33 8 7 %0)* 2A6',B = 8 45! 3@ "6% 2( " . ł+ > )*! +#*, " 7 7 2 CDEEFGEGH %07 7 7

% .! 9 36% 7 %0"#3 " " 7 7 : %0%< +0" 7 " 2 27 < & % " 7 7 : %0 7 $ " + > 7 2 FI-JGK %< . 6# : L 43 7 %0 7

%łMN 36% " +

Table�of�Contents

1.�VQA:�Visual�Question�Answering�

2.�Vision�Part�

3.�Question�Part�

4.�Multimodal�Residual�Learning�

5.�Results�

6.�Discussion�

7.�Recent�Works�

8.�Q&A

1.�VQA:�Visual�Question�Answering


VQA�is�a�new�dataset�containing�open-ended�questions�about�images,� for�understanding�of�vision,�language�and�commonsense�knowledge�to�

answer.

VQA�Challenge

��

��

�� !!��!!�

��

��

��

�� !!��!!�

��

Antol�et�al.,�ICCV�2015


Examples�of�Visual-Question,�or�Only-Question�answers�by�Human

VQA�Challenge



#�images:�204,721�(MS�COCO)�

#�questions:�760K�(3�per�image)�

#�answers:�10M�(10�per�question�+�ɑ)

Numbers

Images Questions Answers

Training 80K 240K 2.4M

Validation 40K 120K 1.2M

Test 80K 240K



80K�test�images�/�Four�splits�of�20K�images�each��

Test-dev�(development)�

Debugging�and�Validation�-�10/day�submission�to�the�evaluation�server.��

Test-standard�(publications)�

Used�to�score�entries�for�the�Public�Leaderboard.��

Test-challenge�(competitions)�

Used�to�rank�challenge�participants.��

Test-reserve�(check�overfitting)��

Used�to�estimate�overfitting.�Scores�on�this�set�are�never�released.

Test�Dataset

Slide�adapted�from:�MSCOCO�Detection/Segmentation�Challenge,�ICCV�2015

2.�Vision�Part

2.�Vision�Part

ResNet:�A�Thing�among�Convolutional�Neural�Networks

He�et�al.,�CVPR�2016

1st�place�on�the�ImageNet�2015�classification�task�

1st�place�on�the�ImageNet�2015�detection�task�

1st�place�on�the�ImageNet�2015�localization�task�

1st�place�on�the�COCO�object�detection�task�

1st�place�on�the�COCO�object�segmentation�task

2.�Vision�Part

ResNet-152


Conv�1x1,�512

Conv�3x3,�512

Conv�1x1,�2048

AveragePooling

7x7x2048

1x1x2048

Linear

Output�Size

Softmax

152-Layered�Convolutional��

Neural�Networks

3x224x224 Input�Size

��

��

�� !!��!!�

��

2.�Vision�Part

ResNet-152


Conv�1x1,�512

Conv�3x3,�512

Conv�1x1,�2048

AveragePooling

7x7x2048

1x1x2048

Linear

Output�Size

Softmax

As�a�visual�feature�extractor

3x224x224 Input�Size

��

��

�� !!��!!�

��

2.�Vision�Part

ResNet-152


Pre-trained�models�are�available!�

For�Torch,�TensorFlow,�Theano,�Caffe,�Lasagne,�Neon,�and�MatConvNet�

https://github.com/KaimingHe/deep-residual-networks

https://github.com/KaimingHe/deep-residual-networks

3.�Question�Part


Word-Embedding

What�color�are�her�eyes?

what��color��are��her��eyes��?

preprocessing

indexing

53��7��44��127��2��6177

w53�|�w7�|�w44�|�w127�|�w2�|�w6177

lookup

w1T

w2T

⋮

w7T

⋮

w44T

⋮

Lookup Table

{wi} are learning parameters for back-propagation algorithm.


Question-Embedding

RNNw53�(what)

h0h1

RNNw7�(color)

h1h2

RNNw6177�(?)

h5h6

Step 0

Step 1

Step 5 use this


Choice�of�RNN:�Gated�Recurrent�Units�(GRU)

Cho�et�al.,�EMNLP�2014Chung�et�al.,�arXiv�2014

hst-1

st-1h

z�=�σ(xtUz�+�st-1Wz)�

r�=�σ(xtUr�+�st-1Wr)�

h�=�tanh(xtUh�+�(st-1⚬r)Wh)�

st�=�(1�-�z)⚬h�+�z⚬st-1


Skip-Thought�Vectors

Pre-trained�model�for�word-embedding�and�question-embedding

Kiros�et�al.,�NIPS�2015

I�got�back�home.�

I�could�see�the�cat�on�the�steps.�

This�was�strange.

try to reconstruct the previous sentence and next sentence

BookCorpus dataset (Zhu�et�al.,�ArXiv�2015)



Pre-trained�model�for�word-embedding�and�question-embedding�

Its�encoder�as�Sent2Vec�model

Kiros�et�al.,�NIPS�2015

w1T

w2T

w3T

w4T

⋮

Lookup Table

Gated Recurrent

Units

Pre-trained GRU



Pre-trained�model�(Theano)�and�porting�code�(Torch)�are�available!�

https://github.com/ryankiros/skip-thoughts�

https://github.com/HyeonwooNoh/DPPnet/tree/master/

003_skipthoughts_porting�

Noh�et�al.,�CVPR�2016Kiros�et�al.,�NIPS�2015

https://github.com/ryankiros/skip-thoughts

https://github.com/HyeonwooNoh/DPPnet/tree/master/003_skipthoughts_porting

4.�Multimodal�Residual�Learning


Idea�1:�Deep�Residual�Learning

Extend�the�idea�of�deep�residual�learning�for�multimodal�learning�


identity

weight layer

weight layer

relu

relu

F(x)!+!x

x

F(x) x

Figure 2. Residual learning: a building block.


Idea�2:�Hadamard�product�for�Joint�Residual�Mapping

One�modality�is�directly�involved�in�the�gradient�with�respect�to�the�other�

modality�

https://github.com/VT-vision-lab/VQA_LSTM_CNN

vQ vI

tanh tanh

⊙

softmax

∂σ (x)!σ (y)∂x

= diag( ′σ (x)!σ (y))

Scaling Problem?Wu�et�al.,�NIPS�2016

https://github.com/VT-vision-lab/VQA_LSTM_CNN


Multimodal�Residual�Networks

Kim�et�al.,�NIPS�2016

Q

V

ARNN

CNN

softmax

Multimodal Residual Networks

What kind of animals are these ?

sheep

wordembedding

question shortcuts

Hadamard products

word2vec (Mikolov et al., 2013)

skip-thought vectors (Kiros et al., 2015)

ResNet (He et al., 2015)


Multimodal�Residual�Networks

Kim�et�al.,�NIPS�2016

A

LinearTanhLinear

TanhLinear

TanhLinear

Q V

H1

LinearTanhLinear

TanhLinear

TanhLinear

H2

V

LinearTanhLinear

TanhLinear

TanhLinear

H3

V

LinearSoftmax

⊙⊕

⊙⊕

⊙⊕

Softmax

5.�Results

5.�Results

TanhLinear

Linear

TanhLinear

Q V

Hl V

⊙⊕

(a)

LinearTanhLinear

Tanh

Linear

TanhLinear

Q V

Hl

V⊙⊕

(c)

LinearTanh

LinearTanhLinear

TanhLinear

TanhLinear

Q V

Hl V

⊙⊕

(b)

LinearTanhLinear

TanhLinear

TanhLinear

Q V

Hl

V

⊙⊕

(e)

LinearTanhLinear

TanhLinear

TanhLinear

Q V

Hl V

⊙⊕

(d)

if l=1

else

Identity

if l=1

Linearelse none

Table 1: The results of alternative models(a)-(e) on the test-dev.

Open-EndedAll Y/N Num. Other

(a) 60.17 81.83 38.32 46.61(b) 60.53 82.53 38.34 46.78(c) 60.19 81.91 37.87 46.70(d) 59.69 81.67 37.23 46.00(e) 60.20 81.98 38.25 46.57

Table 2: The e�ect of the visual features and# of target answers on the test-dev results.Vgg for VGG-19, and Res for ResNet-152 fea-tures described in Section 4.


Vgg, 1k 60.53 82.53 38.34 46.78Vgg, 2k 60.79 82.13 38.87 47.52Vgg, 3k 60.68 82.40 38.69 47.10Res, 1k 61.45 82.36 38.40 48.81Res, 2k 61.68 82.28 38.82 49.25Res, 3k 61.47 82.28 39.09 48.76

5 Results

The VQA Challenge, which released the VQA dataset, provides evaluation servers for test-dev and test-standard test splits. For the test-dev, the evaluation server permits unlimitedsubmissions for validation, while the test-standard permits limited submissions for thecompetition. We report accuracies in percentage.

Alternative Models The test-dev results for the Open-Ended task on the of alternativemodels are shown in Table 1. (a) shows a significant improvement over SAN. However, (b) ismarginally better than (a). As compared to (b), (c) deteriorates the performance. An extraembedding for a question vector may easily cause overfitting leading the overall degradation.And, identity shortcuts in (d) cause the degradation problem, too. Extra parameters of thelinear mappings may e�ectively support to do the task.(e) shows a reasonable performance, however, the extra shortcut is not essential. Theempirical results seem to support this idea. The question-only model (50.39%) achieves acompetitive result to the joint models (57.75%), while the image-only model gets a pooraccuracy (28.13%) (see Table 2 in [2]).Eventually, we chose model (b) for the best performance and relative simplicity.

Number of Learning Blocks To confirm the e�ectiveness of the number of learningblocks selected via a pilot test (L = 3), we explore this on the chosen model (b), again. Asthe depth increases, the overall accuracies are 58.85 (L = 1), 59.44 (L = 2), 60.53 (L = 3)and 60.42 (L = 4).

Visual Features The ResNet-152 visual features are significantly better than VGG-19features for Other type in Table 2, even if the dimension of the ResNet features (2,048) is ahalf of VGG features’ (4,096). The ResNet visual features are also used in the previous work[9], however, our model acheives a remarkably better performance with a large margin (seeTable 3).

Number of Target Answers The number of target answers slightly a�ects the overallaccuracies having the trade-o� among answer types. So, the decision on the number of targetanswers is di�cult to be made. We simply chose Res, 2k in Table 2 based on the overallaccuracy.

Comparisons with State-of-the-arts Our chosen model significantly outperforms otherstate-of-the-art methods for both Open-Ended and Multiple-Choice tasks in Table 3. However,the performance of Number and Other types are still dissatisfied compared to Humanperformance, though the advances in the recent works was mainly for Other-type answers.This fact motivates to study on a counting mechanism in future work. The model comparisonis performed on the text-standard results.

6

Table 1: The results of alternative models(a)-(e) on the test-dev.


(a) 60.17 81.83 38.32 46.61(b) 60.53 82.53 38.34 46.78(c) 60.19 81.91 37.87 46.70(d) 59.69 81.67 37.23 46.00(e) 60.20 81.98 38.25 46.57

Table 2: The e�ect of the visual features and# of target answers on the test-dev results.Vgg for VGG-19, and Res for ResNet-152 fea-tures described in Section 4.


Vgg, 1k 60.53 82.53 38.34 46.78Vgg, 2k 60.79 82.13 38.87 47.52Vgg, 3k 60.68 82.40 38.69 47.10Res, 1k 61.45 82.36 38.40 48.81Res, 2k 61.68 82.28 38.82 49.25Res, 3k 61.47 82.28 39.09 48.76

5 Results

The VQA Challenge, which released the VQA dataset, provides evaluation servers for test-dev and test-standard test splits. For the test-dev, the evaluation server permits unlimitedsubmissions for validation, while the test-standard permits limited submissions for thecompetition. We report accuracies in percentage.

Alternative Models The test-dev results for the Open-Ended task on the of alternativemodels are shown in Table 1. (a) shows a significant improvement over SAN. However, (b) ismarginally better than (a). As compared to (b), (c) deteriorates the performance. An extraembedding for a question vector may easily cause overfitting leading the overall degradation.And, identity shortcuts in (d) cause the degradation problem, too. Extra parameters of thelinear mappings may e�ectively support to do the task.(e) shows a reasonable performance, however, the extra shortcut is not essential. Theempirical results seem to support this idea. The question-only model (50.39%) achieves acompetitive result to the joint models (57.75%), while the image-only model gets a pooraccuracy (28.13%) (see Table 2 in [2]).Eventually, we chose model (b) for the best performance and relative simplicity.

Number of Learning Blocks To confirm the e�ectiveness of the number of learningblocks selected via a pilot test (L = 3), we explore this on the chosen model (b), again. Asthe depth increases, the overall accuracies are 58.85 (L = 1), 59.44 (L = 2), 60.53 (L = 3)and 60.42 (L = 4).

Visual Features The ResNet-152 visual features are significantly better than VGG-19features for Other type in Table 2, even if the dimension of the ResNet features (2,048) is ahalf of VGG features’ (4,096). The ResNet visual features are also used in the previous work[9], however, our model acheives a remarkably better performance with a large margin (seeTable 3).

Number of Target Answers The number of target answers slightly a�ects the overallaccuracies having the trade-o� among answer types. So, the decision on the number of targetanswers is di�cult to be made. We simply chose Res, 2k in Table 2 based on the overallaccuracy.

Comparisons with State-of-the-arts Our chosen model significantly outperforms otherstate-of-the-art methods for both Open-Ended and Multiple-Choice tasks in Table 3. However,the performance of Number and Other types are still dissatisfied compared to Humanperformance, though the advances in the recent works was mainly for Other-type answers.This fact motivates to study on a counting mechanism in future work. The model comparisonis performed on the text-standard results.

6

A Appendix

A.1 VQA test-dev Results

Table 1: The effects of various options for VQA test-dev. Here, the model of Figure 3a is used, sincethese experiments are preliminarily conducted. VGG-19 features and 1k target answers are used. s

stands for the usage of Skip-Thought Vectors [6] to initialize the question embedding model of GRU,b stands for the usage of Bayesian Dropout [3], and c stands for the usage of postprocessing usingimage captioning model [5].

Open-Ended Multiple-ChoiceAll Y/N Num. Other All Y/N Num. Other

baseline 58.97 81.11 37.63 44.90 63.53 81.13 38.91 54.06s 59.38 80.65 38.30 45.98 63.71 80.68 39.73 54.65s,b 59.74 81.75 38.13 45.84 64.15 81.77 39.54 54.67s,b,c 59.91 81.75 38.13 46.19 64.18 81.77 39.51 54.72

Table 2: The results for VQA test-dev. The precision of some accuracies [11, 2, 10] are one less thanothers, so, zero-filled to match others.


Question [1] 48.09 75.66 36.70 27.14 53.68 75.71 37.05 38.64Image [1] 28.13 64.01 00.42 03.77 30.53 69.87 00.45 03.76Q+I [1] 52.64 75.55 33.67 37.37 58.97 75.59 34.35 50.33LSTM Q [1] 48.76 78.20 35.68 26.59 54.75 78.22 36.82 38.78LSTM Q+I [1] 53.74 78.94 35.24 36.42 57.17 78.95 35.80 43.41Deep Q+I [7] 58.02 80.87 36.46 43.40 62.86 80.88 37.78 53.14

DPPnet [8] 57.22 80.71 37.24 41.69 62.48 80.79 38.94 52.16D-NMN [2] 57.90 80.50 37.40 43.10 - - - -SAN [11] 58.70 79.30 36.60 46.10 - - - -ACK [9] 59.17 81.01 38.42 45.23 - - - -FDA [4] 59.24 81.14 36.16 45.77 64.01 81.50 39.00 54.72DMN+ [10] 60.30 80.50 36.80 48.30 - - - -

Vgg, 1k 60.53 82.53 38.34 46.78 64.79 82.55 39.93 55.23Vgg, 2k 60.77 82.10 39.11 47.46 65.27 82.12 40.84 56.39Vgg, 3k 60.68 82.40 38.69 47.10 65.09 82.42 40.13 55.93Res, 1k 61.45 82.36 38.40 48.81 65.62 82.39 39.65 57.15Res, 2k 61.68 82.28 38.82 49.25 66.15 82.30 40.45 58.16Res, 3k 61.47 82.28 39.09 48.76 66.33 82.41 39.57 58.40

1

5.�ResultsTable 3: The effects of shortcut connections of MRN for VQA test-dev. ResNet-152 features and 2ktarget answers are used. MN stands for Multimodal Networks without residual learning, which doesnot have any shortcut connections. Dim. stands for common embedding vector’s dimension. Thenumber of parameters for word embedding (9.3M) and question embedding (21.8M) is subtractedfrom the total number of parameters in this table.

Open-EndedL Dim. #params All Y/N Num. Other

MN 1 4604 33.9M 60.33 82.50 36.04 46.89MN 2 2350 33.9M 60.90 81.96 37.16 48.28MN 3 1559 33.9M 59.87 80.55 37.53 47.25

MRN 1 3355 33.9M 60.09 81.78 37.09 46.78MRN 2 1766 33.9M 61.05 81.81 38.43 48.43MRN 3 1200 33.9M 61.68 82.28 38.82 49.25

MRN 4 851 33.9M 61.02 82.06 39.02 48.04

A.2 More Examples

Figure 1: More examples of Figure 4 in Section 5.2.

2

5.�Results

Table 3: The VQA test-standard results. The precision of some accuracies [30, 1] are oneless than others, so, zero-filled to match others.


DPPnet [21] 57.36 80.28 36.92 42.24 62.69 80.35 38.79 52.79D-NMN [1] 58.00 - - - - - - -Deep Q+I [11] 58.16 80.56 36.53 43.73 63.09 80.59 37.70 53.64SAN [30] 58.90 - - - - - - -ACK [27] 59.44 81.07 37.12 45.83 - - - -FDA [9] 59.54 81.34 35.67 46.10 64.18 81.25 38.30 55.20DMN+ [28] 60.36 80.43 36.82 48.33 - - - -MRN 61.84 82.39 38.23 49.41 66.33 82.41 39.57 58.40Human [2] 83.30 95.77 83.39 72.67 - - - -

5.1 Visualization

In Equation 3, the left term ‡(Wqq) can be seen as a masking (attention) vector toselect a part of visual information. We assume that the di�erence between the right termV = ‡(W2‡(W1v)) and the masked vector F(q, v) indicates an attention e�ect caused bythe masking vector. Then, the attention e�ect Latt = 1

2 (V ≠ F)2 is visualized on the imageby calculating the gradient of Latt with respect to a given image I.

ˆLattˆI = ˆV

ˆI (V ≠ F) (6)

This technique can be applied to each learning block in a similar way.Since we use the preprocessed visual features, the pretrained CNN is augmented to reachthe input image only for this visualization. Note that model (b) in Table 1 is used for thisvisualization, and the pretrained VGG-19 is used for preprocessing and augmentation. Themodel is trained using the training set of VQA dataset, and visualized using the validationset. Examples are shown in Figure 4.Unlike the other works [30, 28] using explicit attention parameters, MRN does not useany explicit attentional mechanism. Though, we observe the interpretability of element-wise multiplication as a information masking, which leads to a novel method to visualizethe attention e�ect from this operation. Since MRN does not depend on a few attentionparameters (e.g. 14 ◊ 14), our visualization method shows a higher resolution than others[30, 28]. Based on this, we argue that MRN is an implicit attention model without explicitattention mechanism.

6 Conclusions

The idea of deep residual learning is applied to visual question-answering tasks. Based onthe two observations of the previous works, various alternative models are suggested andvalidated to propose the three-block layered MRN. Our model achieves the state-of-the-artresults on the VQA dataset for both Open-Ended and Multiple-Choice tasks. Moreover, weintroduce a novel method to visualize the spatial attention from the collapsed visual featuresusing back-propagation.

Acknowledgments

This work was supported by Naver Corp. and the Korea government (IITP-R0126-16-1072-SW.StarLab, KEIT-10044009-HRI.MESSI, KEIT-10060086-RISF).

7

6.�Discussions

6.�Discussions

A

LinearTanhLinear

TanhLinear

TanhLinear

Q V

H1

LinearTanhLinear

TanhLinear

TanhLinear

H2

V

LinearTanhLinear

TanhLinear

TanhLinear

H3

V

LinearSoftmax

⊙⊕

⊙⊕

⊙⊕

Softmax

pretrained model

1st visualization

2nd visualization

3rd visualization

Visualization

6.�Discussionsexamples examples

What kind of animals are these ? sheep What animal is the picture ? elephant

What is this animal ? zebra What game is this person playing ? tennis

How many cats are here ? 2 What color is the bird ? yellow

What sport is this ? surfing Is the horse jumping ? yes

(a) (b)

(c) (d)

(e) (f)

(g) (h)

6.�Discussions

examples examples

What kind of animals are these ? sheep What animal is the picture ? elephant

What is this animal ? zebra What game is this person playing ? tennis

How many cats are here ? 2 What color is the bird ? yellow

What sport is this ? surfing Is the horse jumping ? yes

(a) (b)

(c) (d)

(e) (f)

(g) (h)

Acknowledgments

This�work�was�supported�by�Naver�Corp.��and�partly�by�the�Korea�government�(IITP-R0126-16-1072-SW.StarLab,�KEIT-10044009-HRI.MESSI,�KEIT-10060086-

RISF,�ADD-UD130070ID-BMRR).

Recent�Works

Recent�Works�Low-rank�Bilinear�Pooling

Low-rank�Bilinear�Pooling

Bilinear�model�

Low-rank�Restriction�

Kim�et�al.,�arXiv�2016

fi = wijkk=1

M

∑j=1

N

∑ x jyk + bi = xTWiy + bi

fi = xTWiy + bi = x

TUiViTy + bi = 1

T (UiTx !Vi

Ty)+ bif = PT (UTx !VTy)+ b

N�x�D�D�x�M

rank(Wi)�<=�min(N,M)


Bilinear�model�

Low-rank�Restriction�


fi = wijkk=1

M

∑j=1

N

∑ x jyk + bi = xTWiy + bi

fi = xTWiy + bi = x

TUiViTy + bi = 1

T (UiTx !Vi

Ty)+ bif = PT (UTx !VTy)+ b

vQ vI

⊙N�x�D�D�x�M

rank(Wi)�<=�min(N,M)



x1 x2 … xN

WT

∑wixi ∑wjxj … ∑wkxk



x1 x2 … xN

WxT

∑wixi ∑wjxj … ∑wkxk

y1 y2 … yN

WyT

∑wlyl ∑wmym … ∑wnyn

∑∑wiwlxiyl ∑∑wjwmxjym … ∑∑wkwnxkyn

Hadamard�Product�(Element-wise�Multiplication)

Recent�Works�Multimodal�Low-rank�Bilinear�Attention�Networks�(MLB)

MLB�Attention�Networks


A

TanhConv

TanhLinear

Replicate

Q V

SoftmaxConv

TanhLinear

TanhLinearLinear

Softmax

MLB�Attention�Networks�(MLB)


Under review as a conference paper at ICLR 2017

Table 2: The VQA test-standard results to compare with state-of-the-art. Notice that these resultsare trained by provided VQA train and validation splits, without any data augmentation.

Open-Ended MCMODEL ALL Y/N NUM ETC ALL

iBOWIMG (Zhou et al., 2015) 55.89 76.76 34.98 42.62 61.97DPPnet (Noh et al., 2015) 57.36 80.28 36.92 42.24 62.69Deeper LSTM+Normalized CNN (Lu et al., 2015) 58.16 80.56 36.53 43.73 63.09SMem (Xu & Saenko, 2016) 58.24 80.80 37.53 43.48 -Ask Your Neuron (Malinowski et al., 2016) 58.43 78.24 36.27 46.32 -SAN (Yang et al., 2015) 58.85 79.11 36.41 46.42 -D-NMN (Andreas et al., 2016) 59.44 80.98 37.48 45.81 -ACK (Wu et al., 2016b) 59.44 81.07 37.12 45.83 -FDA (Ilievski et al., 2016) 59.54 81.34 35.67 46.10 64.18HYBRID (Kafle & Kanan, 2016b) 60.06 80.34 37.82 47.56 -DMN+ (Xiong et al., 2016) 60.36 80.43 36.82 48.33 -MRN (Kim et al., 2016b) 61.84 82.39 38.23 49.41 66.33HieCoAtt (Lu et al., 2016) 62.06 79.95 38.22 51.95 66.07RAU (Noh & Han, 2016) 63.2 81.7 38.2 52.8 67.3

MLB (ours) 65.07 84.02 37.90 54.77 68.89

The rate of the divided answers is approximately 16.40%, and only 0.23% of questions have morethan two divided answers in VQA dataset. We assume that it eases the difficulty of convergencewithout severe degradation of performance.

Shortcut Connection The performance contribution of shortcut connections for residual learningis explored. This experiment is conducted based on the observation of the competitive performanceof single-block layered model. Since the usefulness of shortcut connections is linked to the networkdepth (He et al., 2015).

Data Augmentation The data augmentation with Visual Genome (Krishna et al., 2016) questionanswer annotations is explored. Visual Genome (Krishna et al., 2016) originally provides 1.7 Millionvisual question answer annotations. After aligning to VQA, the valid number of question-answeringpairs for training is 837,298, which is for distinct 99,280 images.

6 RESULTS

The six experiments are conducted sequentially to narrow down architectural choices. Each experi-ment determines experimental variables one by one. Refer to Table 1, which has six sectors dividedby mid-rules.

6.1 SIX EXPERIMENT RESULTS

Number of Learning Blocks Though, MRN (Kim et al., 2016b) has the three-block layered ar-chitecture, MARN shows the best performance with two-block layered models (63.92%). For themultiple glimpse models in the next experiment, we choose one-block layered model for its simplic-ity to extend, and competitive performance (63.79%).

Number of Glimpses Compared with the results of Fukui et al. (2016), four-glimpse MARN(64.61%) is better than other comparative models. However, for a parsimonious choice, two-glimpseMARN (64.53%) is chosen for later experiments. We speculate that multiple glimpses are one ofkey factors for the competitive performance of MCB (Fukui et al., 2016), based on a large margin inaccuracy, compared to one-glimpse MARN (63.79%).

6

MLB�Attention�Networks�(MLB)


Under review as a conference paper at ICLR 2017

Non-Linearity The results confirm that activation functions are useful to improve performances.Surprisingly, there is no empirical difference between two options, before-Hadamard product andafter-Hadamard product. This result may build a bridge to relate with studies on multiplicativeintegration with recurrent neural networks (Wu et al., 2016c).

Answer Sampling Sampled answers (64.80%) result better performance than mode answers(64.53%). It confirms that the distribution of answers from annotators can be used to improve theperformance. However, the number of multiple answers is usually limited due to the cost of datacollection.

Shortcut Connection Though, MRN (Kim et al., 2016b) effectively uses shortcut connectionsto improve model performance, one-block layered MARN shows better performance without theshortcut connection. In other words, the residual learning is not used in our proposed model, MLB.It seems that there is a trade-off between introducing attention mechanism and residual learning. Weleave a careful study on this trade-off for future work.

Data Augmentation Data augmentation using Visual Genome (Krishna et al., 2016) question an-swer annotations significantly improves the performance by 0.76% in accuracy for VQA test-devsplit. Especially, the accuracy of others (ETC)-type answers is notably improved from the dataaugmentation.

6.2 COMPARISON WITH STATE-OF-THE-ART

The comparison with other single models on VQA test-standard is shown in Table 2. The overallaccuracy of our model is approximately 1.9% above the next best model (Noh & Han, 2016) on theOpen-Ended task of VQA. The major improvements are from yes-or-no (Y/N) and others (ETC)-type answers. In Table 3, we also report the accuracy of our ensemble model to compare with otherensemble models on VQA test-standard, which won 1st to 5th places in VQA Challenge 20162. Webeat the previous state-of-the-art with a margin of 0.42%.

Table 3: The VQA test-standard results for ensemble models to compare with state-of-the-art. Forunpublished entries, their team names are used instead of their model names. Some of their figuresare updated after the challenge.

Open-Ended MCMODEL ALL Y/N NUM ETC ALL

RAU (Noh & Han, 2016) 64.12 83.33 38.02 53.37 67.34MRN (Kim et al., 2016b) 63.18 83.16 39.14 51.33 67.54DLAIT (not published) 64.83 83.23 40.80 54.32 68.30Naver Labs (not published) 64.79 83.31 38.70 54.79 69.26MCB (Fukui et al., 2016) 66.47 83.24 39.47 58.00 70.10

MLB (ours) 66.89 84.61 39.07 57.79 70.29Human (Antol et al., 2015) 83.30 95.77 83.39 72.67 91.54

7 RELATED WORKS

7.1 COMPACT BILINEAR POOLING

Compact bilinear pooling (Gao et al., 2015) approximates full bilinear pooling using a sampling-based computation, Tensor Sketch Projection (Charikar et al., 2002; Pham & Pagh, 2013):

(x⌦ y, h, s) = (x, h, s) ⇤ (y, h, s) (15)

= FFT�1(FFT( (x, h, s) � FFT( (y, h, s)) (16)

2http://visualqa.org/challenge.html

7

Recent�Works�DEMO

DEMO

Q:�아니�이게�뭐야?�

A:�냉장고�입니다.

Thank You

multimodal residual learning for visual question-answering

Technology