Disclaimer - Seoul National Universitys-space.snu.ac.kr/bitstream/10371/143194/1/Multimodal Deep Learning for Visually...저작자표시-비영리-변경금지 2.0 대한민국 이용자는

저 시-비 리- 경 지 2.0 한민

는 아래 조건 르는 경 에 한하여 게

l 저 물 복제, 포, 전송, 전시, 공연 송할 수 습니다.

다 과 같 조건 라야 합니다:

l 하는, 저 물 나 포 경 , 저 물에 적 된 허락조건 명확하게 나타내어야 합니다.

l 저 터 허가를 면 러한 조건들 적 되지 않습니다.

저 에 른 리는 내 에 하여 향 지 않습니다.

것 허락규약(Legal Code) 해하 쉽게 약한 것 니다.

Disclaimer

저 시. 하는 원저 를 시하여야 합니다.

비 리. 하는 저 물 리 목적 할 수 없습니다.

경 지. 하는 저 물 개 , 형 또는 가공할 수 없습니다.

http://creativecommons.org/licenses/by-nc-nd/2.0/kr/legalcode

http://creativecommons.org/licenses/by-nc-nd/2.0/kr/

Ph.D. Dissertation of Jin-Hwa Kim

Multimodal Deep Learning forVisually-Grounded Reasoning

시각 기반 추론을 위한 다중 양태의 깊은 학습

August 2018

College of HumanitiesSeoul National University

Interdisciplinary Program in Cognitive Science

Jin-Hwa Kim

Ph.D. Dissertation of Jin-Hwa Kim



August 2018

College of HumanitiesSeoul National University

Interdisciplinary Program in Cognitive Science

Jin-Hwa Kim



지도교수 장 병 탁

이 논문을 공학박사 학위논문으로 제출함

2018 년 5 월

서울대학교 대학원

협동과정 인지과학전공

Jin-Hwa Kim

Jin-Hwa Kim의 공학박사 학위논문을 인준함

2018 년 6 월

위 원 장 김 홍 기

부위원장 장 병 탁

위 원 한 보 형

위 원 김 건 희

위 원 하 정 우

Abstract

Advances in computer vision and natural language processing accelerate the

studies of artificial general intelligence. Since both vision and natural language

are the major and most interactive modalities of human, understanding and

reasoning grounded on both vision and language became the key challenge

for the artificial general intelligence. Visual question answering (VQA) is an

instance of Visual Turing Test (VTT), which is aligned with this direction on

top of the prestigious seminal work, Turing test [Turing, 1950]. In the VQA

dataset [Agrawal et al., 2017], with large image datasets, question-answer pairs

are collected for supervised learning. For instance, a machine answers for a given

image and question, such as "Who is wearing glasses?", "Is the umbrella upside

down?", or "How many children are in the bed?".

In this dissertation, having that the visual question answering task is general-

ized as multimodal learning, the advances in multimodal learning are studied in

deep learning where hierarchical representations are learned with various forms

of multiple layers in neural networks, called multimodal deep learning. First,

multimodal deep learning is introduced with three categorization: multimodal

fusion, cross modality, and shared representation learning. After that, based

on our previous works Kim et al. [2016b, 2017a, 2018], three major studies are

discussed – multimodal residual learning, multimodal low-rank bilinear pooling,

and bilinear attention networks.

Multimodal residual learning finds the joint representation of vision-language

multimodality based on the idea of residual learning, which imposes a constraint

that a part of neural networks must learn residual errors of a fitting function

i

represented by a previous part of the neural networks. Whereas multimodal low-

rank bilinear pooling gives a mathematical ground for the use of element-wise

multiplication (a.k.a Hadamard product) as a joint function, since it can be

interpreted as a low-rank bilinear pooling in the condition of that each modality

is linearly transformed with appropriate model parameters. Bilinear attention

networks unify the previous two works. Using the interpretation of low-rank

bilinear pooling, it successfully generalizes unitary attention mechanism into

bilinear attention by matrix chain multiplication. This is so efficient that the

computational cost is the same with the counter, unitary attention networks.

Moreover, residual learning of attention is proposed to exploit up to eight

bilinear attention maps in reasoning processes, which prevents the over-fitting

that usually comes from multi-layer attention networks.

As a result, Multimodal Residual Networks (MRN) achieved the 4th place

in the VQA Challenge 2016, and Multimodal Low-rank Bilinear Attention Net-

works (MLB) achieves a new state-of-the-art with a significantly less number of

parameters in November 2016, at the time of publication. Moreover, Bilinear

Attention Networks (BAN) placed the 2nd place (shared with the other team) in

the VQA Challenge 2018 achieving the best single model among the entries, and

invited as a speaker in CVPR 2018 workshop (Salt Lake City, USA) on June

18. Since each vision or natural language processing is still an evolving area

of studies, the multimodal deep learning can take advantage of the progressed

results from computer vision and natural language processing studies in the

future.

Keywords: Multimodal, attention, visual question answering, deep learning,

residual learning, low-rank approximation, bilinear

Student Number: 2015-30046

ii

Contents

Abstract i

Chapter 1 Introduction 1

Chapter 2 Multimodal Deep Learning 6

2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.2 Linear Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.3 Multimodal Deep Learning . . . . . . . . . . . . . . . . . . . . . 10

2.3.1 Multimodal Fusion . . . . . . . . . . . . . . . . . . . . . . 10

2.3.2 Cross Modality Learning . . . . . . . . . . . . . . . . . . . 11

2.3.3 Shared Representation Learning . . . . . . . . . . . . . . 13

2.4 Cognitive Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

Chapter 3 Multimodal Residual Learning 16

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3.2 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.2.1 Deep Residual Learning . . . . . . . . . . . . . . . . . . . 18

3.2.2 Stacked Attention Networks . . . . . . . . . . . . . . . . . 19

iii

3.3 Multimodal Residual Networks . . . . . . . . . . . . . . . . . . . 20

3.3.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3.3.2 Multimodal Residual Networks . . . . . . . . . . . . . . . 21

3.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.4.1 Visual QA Dataset . . . . . . . . . . . . . . . . . . . . . . 22

3.4.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . 24

3.4.3 Exploring Alternative Models . . . . . . . . . . . . . . . . 26

3.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.5.1 Quantitative Analysis . . . . . . . . . . . . . . . . . . . . 27

3.5.2 Qualitative Analysis . . . . . . . . . . . . . . . . . . . . . 29

3.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

Chapter 4 Multimodal Low-rank Bilinear Pooling 37

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

4.2 Low-rank Bilinear Model . . . . . . . . . . . . . . . . . . . . . . . 39

4.3 Low-rank Bilinear Pooling . . . . . . . . . . . . . . . . . . . . . . 40

4.3.1 Full Model . . . . . . . . . . . . . . . . . . . . . . . . . . 41

4.3.2 Nonlinear Activation . . . . . . . . . . . . . . . . . . . . . 41

4.3.3 Shortcut Connection . . . . . . . . . . . . . . . . . . . . . 42

4.4 Multimodal Low-rank Bilinear Attention Networks . . . . . . . . 43

4.4.1 Low-rank Bilinear Pooling in Attention Mechanism . . . . 43

4.4.2 Multimodal Low-rank Bilinear Attention Networks . . . . 43

4.4.3 Model Schema . . . . . . . . . . . . . . . . . . . . . . . . 44

4.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

4.5.1 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . 48

4.5.2 Vision Embedding . . . . . . . . . . . . . . . . . . . . . . 48

4.5.3 Hyperparameters . . . . . . . . . . . . . . . . . . . . . . . 49

iv

4.6 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

4.6.1 Six Experiment Results . . . . . . . . . . . . . . . . . . . 50

4.6.2 Comparison with State-of-the-Art . . . . . . . . . . . . . 52

4.6.3 Ensemble of Seven Models . . . . . . . . . . . . . . . . . . 52

4.7 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

4.7.1 Multimodal Residual Networks . . . . . . . . . . . . . . . 53

4.7.2 Higher-Order Boltzmann Machines . . . . . . . . . . . . . 53

4.7.3 Multiplicative Integration with Recurrent Neural Networks 54

4.7.4 Compact Bilinear Pooling . . . . . . . . . . . . . . . . . . 55

4.8 Discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

4.8.1 Understanding of Multimodal Compact Bilinear Pooling . 56

4.8.2 Replacement of Low-rank Bilinear Pooling . . . . . . . . . 58

4.9 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

Chapter 5 Bilinear Attention Networks 62

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

5.2 Low-rank Bilinear Pooling . . . . . . . . . . . . . . . . . . . . . . 64

5.3 Bilinear Attention Networks . . . . . . . . . . . . . . . . . . . . . 66

5.4 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

5.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

5.5.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

5.5.2 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . 71

5.5.3 Nonlinearity . . . . . . . . . . . . . . . . . . . . . . . . . . 72

5.6 Variants of BAN . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

5.6.1 Enhancing Glove Word Embedding . . . . . . . . . . . . . 72

5.6.2 Integrating Counting Module . . . . . . . . . . . . . . . . 73

5.6.3 Integrating Multimodal Factorized Bilinear (MFB) Pooling 75

v

5.6.4 Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

5.6.5 Hyperparameters and Regularization . . . . . . . . . . . . 76

5.7 VQA Results and Discussions . . . . . . . . . . . . . . . . . . . . 77

5.7.1 Quantitative Results . . . . . . . . . . . . . . . . . . . . . 77

5.7.2 Residual Learning of Attention . . . . . . . . . . . . . . . 78

5.7.3 Qualitative Analysis . . . . . . . . . . . . . . . . . . . . . 80

5.8 Flickr30k Entities Results and Discussions . . . . . . . . . . . . . 80

5.9 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

Chapter 6 Conclusions 89

Bibliography 91

초록 106

vi

List of Figures

Figure 2.1 Examples from VQA 2.0 [Agrawal et al., 2017; Goyal

et al., 2016], which depict their criteria that multimodal

information is necessary to solve the problem. For the

same question, answers may be different depending on

visual information, the provided image along with the

question. Reproduced with permission from Goyal et al.

[2016]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

Figure 3.1 Inference flow of Multimodal Residual Networks (MRN).

Using our visualization method, the attention effects are

shown as a sequence of three images. More examples are

shown in Figure 3.4. . . . . . . . . . . . . . . . . . . . . 18

Figure 3.2 A schematic diagram of Multimodal Residual Networks

with three-block layers. . . . . . . . . . . . . . . . . . . . 18

vii

Figure 3.3 Alternative models are explored to justify our proposed

model. The base model (a) has a shortcut for a question

vector as SAN does [Yang et al., 2016], and the joint

residual function takes the form of the Deep Q+I model’s

joint function [Lu et al., 2015]. (b) extra embedding for

visual modality. (c) extra embeddings for both modalities.

(d) identity mappings for shortcuts. In the first learning

block, use a linear mapping for matching a dimension with

the joint dimension. (e) two shortcuts for both modalities.

For simplicity, the linear mapping of visual shortcut only

appears in the first learning block. Notice that (d) and

(e) are compared to (b) after the model selection of (b)

among (a)-(c) on test-dev results. Eventually, we choose

(b) as the best performance and relative simplicity. . . . 23

Figure 3.4 Examples for visualization of the three-block layered

MRN. The original images are shown in the first of each

group. The next three images show the input gradients of

the attention effect for each learning block as described

in Section 3.5.2. The gradients of color channels for each

pixel are summed up after taking absolute values of these

gradients. Then, these summed absolute values which

are greater than the summation of the mean and the

standard deviation of these values are visualized as the

attention effect (bright color) on the images. The answers

(blue) are predicted by MRN. . . . . . . . . . . . . . . . 29

Figure 3.5 More examples of Figure 4 in Section 5.2. . . . . . . . . 31

viii

Figure 3.6 Comparative examples on the same image. (a1) and

(a2) depict a giraffe (left) and a man pointing at the

giraffe. MRN consistently highlights on the giraffe in (a1).

However, the other question “Can you see trees?” makes

MRN less attentive to the giraffe, while a tree in the right

of background is more focused in (a2). Similarily, the

attention effect of (b2) is widely dispersed on background

than (b1) in the middle of sequences, may be to recognize

the site. However, the subtlety in comparative study is

insufficient to objectively assess the results. . . . . . . . 32

Figure 3.7 Failure Examples. Each question is followed by model

prediction (blue) and answer (red). As mentioned in Sec-

tion 5, MRN shows the weakness of counting in (d) and

(k). Sometimes, the model finds objects regardless of the

given question. In (j), even if a word cat does not appear

in the question, the cat in the image is surely attended.

(i) shows the limitation of attentional mechanism, which

needs an inference using world knowledge. . . . . . . . . 36

Figure 4.1 A schematic diagram of MLB. Replicate module copies

an question embedding vector to match with S2 visual

feature vectors. Conv modules indicate 1× 1 convolution

to transform a given channel space, which is computa-

tionally equivalent to linear projection for channels. . . . 44

ix

Figure 5.1 Overview of a two-layer BAN. Two multi-channel inputs,

ϕ-object detection features and ρ-length GRU hidden

vectors, are used to get bilinear attention maps and

joint representations to be used by a classifier. For the

definition of the BAN, see the text in Section 5.3. . . . . 64

Figure 5.2 (a) learning curves. Bilinear attention (bi-att) is more

robust to overfitting than unitary attention (uni-att) and

co-attention (co-att). (b) validation scores for the number

of parameters. The error bar indicates the standard de-

viation among three random initialized models, although

it is too small to be noticed for over-15M parameters.

(c) ablation study for the first-N-glimpses (x-axis) used

in evaluation. (d) the information entropy (y-axis) for

each attention map in four-glimpse BAN. The entropy

of multiple attention maps is converged to certain levels. 81

Figure 5.3 Visualization of the bilinear attention maps for two-

glimpse BAN. The left and right groups indicate the

first and second bilinear attention maps (right in each

group, log-scaled) and the visualized image (left in each

group). The most salient six boxes (1-6 numbered in the

images and x-axis of the grids) in the first attention map

determined by marginalization are visualized on both

images to compare. The model gives the correct answer,

brown. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

x

Figure 5.4 Visualization examples from the test split of Flickr30k

Entities are shown. Solid-lined boxes indicate predicted

phrase localizations and dashed-line boxes indicate the

ground-truth. If there are multiple ground-truth boxes,

the closest box is shown to investigate. Each color of

a phrase is matched with the corresponding color of

predicted and ground-truth boxes. Best view in color. . 84

xi

List of Tables

Table 2.1 Three categories of multimodal learning settings. A and

B indicate arbitrary modalities. A+B denotes multimodal

learning, and A|B denotes A or B is used, respectively. . . 7

Table 3.1 The results of alternative models (a)-(e) on the test-dev. . 24

Table 3.2 The effect of the visual features and # of target answers

on the test-dev results. Vgg for VGG-19, and Res for

ResNet-152 features described in Section 3.4. . . . . . . . 25

Table 3.3 The VQA test-standard results. The precision of some

accuracies [Andreas et al., 2016; Yang et al., 2016] are one

less than others, so, zero-filled to match others. . . . . . . 33

xii

Table 3.4 The effects of various options for VQA test-dev. Here, the

model of Figure 3a is used, since these experiments are

preliminarily conducted. VGG-19 features and 1k target

answers are used. s stands for the usage of Skip-Thought

Vectors Kiros et al. [2015] to initialize the question embed-

ding model of GRU, b stands for the usage of Bayesian

Dropout Gal [2015], and c stands for the usage of post-

processing using image captioning model Karpathy and

Fei-Fei [2015]. . . . . . . . . . . . . . . . . . . . . . . . . . 34

Table 3.5 The effects of shortcut connections of MRN for VQA test-

dev. ResNet-152 features and 2k target answers are used.

MN stands for Multimodal Networks without residual

learning, which does not have any shortcut connections.

Dim. stands for common embedding vector’s dimension.

The number of parameters for word embedding (9.3M)

and question embedding (21.8M) is subtracted from the

total number of parameters in this table. . . . . . . . . . . 34

Table 3.6 The results for VQA test-dev. The precision of some accu-

racies Andreas et al. [2016]; Xiong et al. [2016]; Yang et al.

[2016] are one less than others, so, zero-filled to match

others. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

Table 4.1 The accuracies of our experimental models for VQA test-

dev split and Open-Ended task. For the MCB models, A:

attention model, G: Glove vector model, and V: Visual

Genome augmentation model. . . . . . . . . . . . . . . . . 45

Table 4.2 Hyperparameters used in MLB (single model in Table 4.4). 49

xiii

Table 4.3 The effect of joint embedding size d. . . . . . . . . . . . . 50

Table 4.4 The VQA 1.0 test-standard results to compare with state-

of-the-art. Notice that these results are trained by pro-

vided VQA train and validation splits, without any data

augmentation. . . . . . . . . . . . . . . . . . . . . . . . . . 60

Table 4.5 The VQA 1.0 test-standard results for ensemble models

to compare with state-of-the-art. For unpublished entries,

their team names are used instead of their model names.

Some of their figures are updated after the challenge. . . . 61

Table 4.6 The individual models used in our ensemble model in

Table 4.5. . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

Table 5.1 Validation scores on VQA 2.0 dataset for the number of

glimpses of the BAN. The standard deviations are reported

after ± using three random initialization. . . . . . . . . . 79

Table 5.2 Validation scores on VQA 2.0 dataset for attention and

integration mechanisms. The nParams indicates the num-

ber of parameters. Note that the hidden sizes of unitary

attention and co-attention are 1,280, while 1,024 for the

BAN. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

xiv

Table 5.3 Test split results for Flickr30k Entities. We report the aver-

age performance of our three randomly-initialized models

(the standard deviation of R@1 is 0.17). Upper Bound

of performance asserted by object detector is shown. †

box size and color information are used as additional

features. ‡ semantic segmentation, object detection, and

pose-estimation is used as additional features. Notice that

the detectors of Hinami and Satoh [2017] and ours [An-

derson et al., 2017] are based on Faster RCNN [Ren et al.,

2017], pre-trained using Visual Genome dataset [Krishna

et al., 2016]. . . . . . . . . . . . . . . . . . . . . . . . . . 85

Table 5.4 Recall@1 performance over types for Flickr30k Entities (%) 86

Table 5.5 Test-dev scores of single-model on VQA 2.0 dataset to

compare state-of-the-arts. The first section of rows trained

on training and validation splits. The rest of rows trained

on training and validation splits, and Visual Genome for

data augmentation. † This model can be found in https:

//github.com/yuzcccc/vqa-mfb, which is not published

in the paper. They use the object detection-based image

features from Anderson et al. [2017], instead of 152-layer

ResNet image features [He et al., 2016a]. . . . . . . . . . 87

Table 5.6 Test-standard scores of ensemble-model on VQA 2.0 dataset

to compare state-of-the-arts. Excerpt from the VQA 2.0

Leaderboard at the time of writing. # denotes the number

of models for their ensemble methods. . . . . . . . . . . . 88

xv

https://github.com/yuzcccc/vqa-mfb


Chapter 1

Introduction

Artificial general intelligence is cautiously conferred upon the advances in com-

puter vision and natural language processing. Understanding and reasoning

grounded on both vision and language are the core challenge for the artificial

general intelligence. To promote the progress of research, it requires quantitative

evaluations which can be easily and consistently performed.

Visual Turing Test (VTT) [Geman et al., 2015; Malinowski and Fritz, 2014;

Qi et al., 2015] is aligned with this direction on top of one of prestigious seminal

works, Turing test [Turing, 1950]. The Turing test is proposed to evaluate

intelligence of machine. In this test, if a human cannot distinguish the machine

from the other human by assessing only conversations isolating any other factor,

the machine has intelligence. This proposal is appraised as a milestone to study

intangible machine intelligence with experimental setting. VTT takes a step

further. In VTT, visual understanding given utterances is crucial to response

appropriately. The machine must recognize objects, identify attributes, and infer

relationships between objects. Moreover, the machine may need to utilize a

1

conversational context or background knowledge.

Visual question answering [Antol et al., 2015; Gao et al., 2015; Goyal et al.,

2016; Krishna et al., 2017; Malinowski and Fritz, 2014; Ren et al., 2015; Yu

et al., 2015; Zhang et al., 2016b; Zhu et al., 2016] is an instance of VTT. With

large image datasets, question-answer pairs are collected for supervised learning.

For instance, the VQA [Antol et al., 2015; Goyal et al., 2016] dataset collected

1.1M questions and 11.1M answers for 205K COCO images [Lin et al., 2014].

In the task, a machine answers for a given question, such as, "Who is wearing

glasses?", "Is the umbrella upside down?", or "How many children are in the

bed?".

A recent line of work in vision and language communities has generalized

visual question answering [Antol et al., 2015; Gao et al., 2015; Goyal et al., 2016;

Krishna et al., 2017; Malinowski and Fritz, 2014; Ren et al., 2015; Yu et al.,

2015; Zhang et al., 2016b; Zhu et al., 2016] to visually-grounded dialog [Das

et al., 2017a,b; de Vries et al., 2016; Mostafazadeh et al., 2017; Strub et al.,

2017; Yu et al., 2015, 2016], where an agent must understand an image and

answer a sequence of questions. This generalization to the dialog system brings

to forefront issues of visual grounding, context-aware in agents, being consistent

in responses, etc.

Focusing on VQA, one of simple approaches is to formulate this problem

as a classification task given two inputs, image and text, and the output is

one of candidates using k -top frequent answers. An image is converted to a

hidden representation by a pre-trained convolutional neural networks (CNN)

like ResNet [He et al., 2016a], whereas a text is converted to a hidden represen-

tation by a pre-trained recurrent neural networks (RNN), since the text can

be regarded as a sequence of tokens (pre-processed words by a tokenizer), like

LSTM [Hochreiter and Schmidhuber, 1997] or GRU [Cho, Kyunghyun and Van

2

Merriënboer, Bart and Gulcehre, Caglar and Bahdanau, Dzmitry and Bougares,

Fethi and Schwenk, Holger and Bengio, Yoshua, 2014], as shown in Figure 3.1.

In our previous works [Kim et al., 2016b, 2017a, 2018], joint representation

learning of vision-language multimodality is studied for the visual question

answering task. Since each vision or natural language processing is still an open

progressing problem, the joint representation learning can take advantage of

the progressed results from computer vision and natural language processing

studies in the future.

Residual learning emphasizes the importance of identity (or linear) shortcuts

to have non-linear mappings efficiently learn only residuals [He et al., 2016a,b].

In visual question answering, as one of multimodal learning tasks, this idea

may not be readily applied. Since the modalities may have correlations among

them, we need to carefully define joint residual functions as the non-linear

mappings. Moreover, the arrangement of the shortcuts are undetermined due

to its multimodality. Therefore, the characteristics of a given task must be

considered to determine the model structure.

As a successive work, the joint residual function of MRN is interpreted as

a low-rank bilinear pooling method. A bilinear model can be approximated

using low-rank factorization of the weight tensor in the model, which exploits

the linear embeddings after and before element-wise multiplication, also known

as Hadamard product. Since this interpretation only requires common tensor

operations used in modern deep learning frameworks, it is easy to implement and

efficient to improve performance compared with the other competitive methods.

To propose an efficient attention mechanism, the low-rank bilinear pooling is

doubly used in both attention networks and multimodal pooling networks.

In the previous works, multimodal learning for visual question answering is

explored in the context of multi-step residual learning and bilinear approximation

3

pooling in attention mechanism. However, we observed that attention networks

make multimodal residual learning be inefficient. There are two hypotheses for

the reason to this. First, the attention networks in every residual learning blocks

increase the model complexity to interfere optimization or introduce over-fitting.

Second, the softmax functions in attention networks cause gradient diminishing

since its gradients are less than one and appear multiple times in the network

likewise sigmoid function does.

We try to solve this issue using residual learning of attention in the context

of bilinear attention networks. Bilinear attention is to give separate attention

to every interaction between words and visual concepts since these interactions

have different meanings. It is natural to map a token “dog” to dogs in an

image, “cat” to cats in an image. However, a naive approach may introduce a

huge of computational cost. So, we propose an efficient method, with the same

time complexity, on top of low-rank bilinear pooling. Then, residual learning of

attention gives a useful way of integrating multiple attention maps instead of a

naive concatenation of each joint representations.

Contrast to the previous work, all attention maps are acquired in a prelim-

inary step, then, using those attention maps, residual learning is performed.

It can be interpreted as a look-ahead to predict the sequence of attentions.

We believe that it prevents a short-sighted attention strategy and empirically

validated compared to the other methods.

Multimodal learning of vision and natural language fuels the progress of

artificial general intelligence. In this dissertation, based on visual question

answering, an instance of VTT, we explore multimodal residual learning, which

provides an efficient way to combine multimodal inputs, low-rank bilinear pooling,

which comes from using element-wise multiplication in deep neural networks

and can be exploited in attention networks, and bilinear attention networks,

4

which gracefully extend unitary attention networks as low-rank bilinear pooling

inside bilinear attention. Here, residual learning of attention efficiently uses

multiple bilinear attention maps. We believe the three-fold research helps to

bring advanced machine intelligence to everyday life in natural language through

visually-grounded reasoning.

5

Chapter 2

Multimodal Deep Learning

2.1 Introduction

Computer vision and natural language processing are highly connected to each

other in the aspect of artificial intelligence. Since both vision and natural

language are ones of major modalities of human, understanding and reasoning

grounded on vision or language is mandatory to deal with various challenges in

artificial intelligence problems. For this reason, multimodal learning is studied

to solve a specific problem which involves at least two different modalities as

the sources of information.

The gist of multimodal learning is to learn joint representation from multiple

sources relating correlated information. For example, as Ngiam et al. [2011]

mentioned, RGB images and depth maps are highly correlated in pixel since

the edges in depth maps are often to be distinctive edges in the corresponding

RGB images. Though vision and natural language have different modalities from

each other, both modalities are correlated at "mid-level" as natural language is

6

Table 2.1 Three categories of multimodal learning settings. A and B indicatearbitrary modalities. A+B denotes multimodal learning, and A|B denotes A orB is used, respectively.

Settings Feature Training Testing

Deep Learning A|B A|B A|BMultimodal Fusion A+B A+B A+B

Cross Modality Learning A+B A|B A|BShared Representation Learning A+B A|B B|A

grounded in visual information (e.g. colors, shape, locations, etc.). So, the major

challenge in multimodal learning lies on how to learn joint representations in

the "mid-level".

One of successful works in multimodal learning is based on deep learn-

ing [LeCun et al., 2015], which learns joint representations using deep neural

networks [Ngiam et al., 2011]. They suggest three settings of multimodal learning,

multimodal fusion, cross modality learning, and shared representation learning.

The multimodal fusion setting can access to all modalities in both training and

testing. This setting is typical in the literatures (e.g. Agrawal et al. [2017]; Goyal

et al. [2016]). When a large multimodal dataset is available, it can be used

to learn the multimodal features in an unsupervised way. The insight of this

approach is based on the richer representative power from multimdoal datasets

than unimodal datasets’. The cross modality learning uses the pretrained mul-

timodal features to train unimodal tasks (the other modality has a crossly

supportive role in this setting), whereas the shared representation learning uses

the pretrained multimodal features to train cross modality tasks where different

modalities are presented in training and test phases (the multimodal features

should have shared representations to infer the other modality in test phase).

7

Please refer to Table 2.1 which summarizes these three settings.

In the following sections, we will discuss a linear model for multimodal

learning and deep learning approaches in three different settings in detail.

2.2 Linear Model

Combining multiple cues may reduce the noise which inevitably resides in the

sensory cues. However, the ambiguity to our nervous system in sensory cues

imposes a burden to infer both the causality of the cues and their contributions

to a body and an environment. These contributions can come from the same

causes or entirely different causes, in the same way, a cause can induce the same

contribution (in some sense) or entirely different effects. In this dissertation,

we will focus on that these contributions come from the same cause and the

contributions are instantiated by visual and textual cues.

Linear models provide a simple approach to this. Let a be the ground-truth

answer, v be a prediction by a visual cue, and q be a prediction by a textual

cue, a question. p(a) denotes the probability of that a specific answer a is the

answer, p(v|a) = N (a, σv) denotes the probability of that a specific prediction v

by a visual cue is the answer a, and p(q|a) = N (a, σq) denotes the probability of

that a specific answer q by a textual cue is the answer a. The σv and σq are the

variance introduced by the certainty of given cues. For this, we need a handful

of assumptions of conditional independence among these multiple cues, with

Gaussian noise, and the same cause. Then, Bayes rule yields:

p(a|v, q) =p(a)p(v, q|a)/Σap(v, q, a) (2.1)

∝p(a)p(v, q|a) (2.2)

=p(a)p(v|a)p(q|a). (2.3)

8

The maximum a posteriori (MAP) of a can be acquired by the mean of the

multiplication of Gaussian distributions resulting in the weighted sum of v and

q using α as follows:

a =αv + (1− α)q (2.4)

α =σ2q/(σ

2q + σ2

v) (2.5)

where the prior distribution of p(a) is treated as an uniform distribution [Trom-

mershauser et al., 2011].

However, this linear combination of cues violates conditional independence

among the cues, which means p(v, q|a) = p(v|a)p(q|a), since q is not completely

independent from v for the question q, which is arisen from the visual cue v.

One of solutions is to use the product rule of probability:

p(v, q|a) =p(v|a)p(q|v, a) (2.6)

=p(q|a)p(v|q, a). (2.7)

Here, we use p(v|q, a) to model the dependency between the cues, instead of

p(v|a), which allows us to select a portion of information in the visual cue. Later,

we will show that p(v|q, a) can be modeled using attention networks, which have

a sub-module to combine multi-channel visual cues using the weights determined

by a textual cue.

Combining multiple cues involve how to model the joint probability dis-

tribution considering the innate dependency structure in multiple cues. Since

real-world problems used to have a complex dependency structure to be modeled

by Gaussian distribution, there is an increasing number of attempts to get joint

representation using deep neural networks, which is coined by multimodal deep

9

learning [Ngiam et al., 2011].

2.3 Multimodal Deep Learning

In this section, we introduce multimodal deep learning problems by describing

subcategories of these problems. According to the use of multimodal inputs

in learning phases, feature learning, supervised training, and testing, we can

classify those subcategories. Here, we classify the multimodal deep learning into

three subcategories – Multimodal Fusion, Cross Modality Learning, and Shared

Representation Learning, following the work [Ngiam et al., 2011].

2.3.1 Multimodal Fusion

The multimodal fusion uses multiple modalities in both training and testing. The

joint representation of the multiple modalities must be learned and effectively

generalized to achieve a significant improvement over an unimodal setting. For

example, visual question answering (VQA) tasks [Agrawal et al., 2017; Goyal

et al., 2016] aim this criteria. In Figure 2.1, visual and textual information is

both important to solve the problem. The question in the top-left section is

‘Who is wearing glasses?’ and, depending on the given image, the corresponding

answer is obviously different. Moreover, the VQA dataset is collected so that each

image has three different questions and the corresponding answers, respectively.

The pair of image and question is given in training and testing phases and the

desirable output is the answer, which can be more than one word, yes or no,

or number. Since the distribution of the numbers of answers is long-tailed, one

may formulate the task as classification problem using some amount of the most

frequent answers, e.g., 2k, which covers 90.45% of total questions. Now, we can

10

summarize the multimodal fusion approach to visual question answering as:

h = f(q,v) (2.8)

where h is the joint representation for both question and image, while q and v are

embedding vectors for the question sentence and the pixel information of a given

image, respectively. Then, a nonlinear classifier takes the joint representation h

as an input to solve the classification problem:

p(a|q,v; Θ) = classifier(h) (2.9)

a = argmaxa∈Ω

p(a|q,v; Θ) (2.10)

where a is a candidate answer, a is an estimated answer, Ω is a set of candidate

answers, and Θ is a set of model parameters.

Besides, image captioning task [Lin et al., 2014] may be confused as a multi-

modal fusion task since the model learns how to map an image representation

to text representation to generate the caption corresponding to a given image.

However, the generated caption only depends on an unimodal, image represen-

tation. This is why the image captioning task is not a multimodal learning task

in our perspective. We emphasize that multimodal learning involves learning

joint representation from multimodal information, not unimodal one.

2.3.2 Cross Modality Learning

The cross modality learning uses the pretrained multimodal features to train

unimodal tasks (the other modality crossly supports a given task in this setting).

Here we assume that we can have a better representation for an unimodal input

when feature learning uses multimodal data.

11

Figure 2.1 Examples from VQA 2.0 [Agrawal et al., 2017; Goyal et al., 2016],which depict their criteria that multimodal information is necessary to solvethe problem. For the same question, answers may be different depending onvisual information, the provided image along with the question. Reproducedwith permission from Goyal et al. [2016].

Two instances of multimodal datasets for this task are AVLetters [Matthews

et al., 2002] and CUAVE [Patterson et al., 2002]. The dataset AVLetters consists

of audio-visual information (audio and video) for isolated letters A-Z. Each

alphabetic letter is pronounciated three times by 10 participants, 5 male (two

of them have mustaches) and 5 female, which results 780 utterances. Whereas

CUAVE is for the digits 0-9. The 36 participants pronounciate each digit 5 times.

Using cross modality learning, there is a potential to improve the model to use

both audio-visual information in feature learning for a lip-read functionality in

a noisy environment.

12

2.3.3 Shared Representation Learning

Whereas the shared representation learning uses the pretrained multimodal

features to train cross modality tasks where different modalities are presented

in training and test phases (the multimodal features should have shared rep-

resentations to infer the other modality in test phase). In feature learning, in

an aspect of using other information which is not used in training and testing,

it is similar to the technique of pre-training features, however, its functionality

is deviated from this aspect since the input modalities in training and testing

are different. Which is more similar to a zero-shot learning with an aid of the

shared representation.

2.4 Cognitive Models

It is helpful to know the similarity and difference between deep learning and

human brain about how they work in an aspect of information processing since it

may provide the insight to extract the principles that can apply to their possible

applications.

The studies on cognitive neuroscience have shown that the cells related

to multimodal perception are found. Superior temporal sulcus (STS) in the

temporal lobe of the brain of Japaness monkey has, out of 200 cells recorded, 51%

of unimodal cells (59 visual, 33 auditory, and 10 somesthetic), 18% of bimodal

cells (21 visual+auditory, 7 visual+somesthetic, and 8 auditor+somesthetic),

and 2% of trimodal cells [Hikosaka et al., 1988]. The other work shows that

multisensory response is greater than the sum of unisensory responses, which is

called multisensory integration [Holmes and Spence, 2005]. However, to discuss

the connection from neuroscience to machine learning, not limited to cellular

level, we want to bring an interesting neuronal structure of barn owls’ auditory

13

system.

The barn owl is well-studied for a cognitive model of how to localize a sound

source by interaural time differences. The proposed neural model as coincident

detectors [Carr and Konishi, 1990] utilizes the simultaneously received inputs

from left and right ears, however, two inputs are slightly different from each

other in time and intensity depending on the relative location of the source

with respect to a subject. For instance, if the source is located at the degree

of 45 angle to frontal face, the interaural time difference is approximately 1e−4

second, although the difference of intensity is hardly distinguishable from the

other environmental influences even for barn owls. Furthermore, the barn owls

take their anatomical advantage of ears – the left ear is placed a little higher

than eye level facing down, while the right ear is placed lower than the other

facing up. These asymmetric anatomy cannot be found in the ears of human,

however, human have more complex structure of outer ear, called pinna.

So, the activations of neural connectivity related to the left and right ears

are asymmetric depending on the location of the sound source, and, which

differentiate the distribution of the activations in a joint circuit, the coincident

detectors. Under the hood of the coincident detectors, two neuronal charac-

teristics are required: dense connections with wide diversity of its lengths and

activation delay proportional to the lengths. Based on these characteristics, the

distribution of joint activations and its resolution of interaural time differences

are determined since the coincident detectors act like AND operations for the

various conditions. Notice that the coincident detectors are corresponding to

an unimodal-auditory source, asymmetic information induced by interaural

time difference gives meaningful clues to the coincident detectors as a sort of

multimodal perception.

This observation is related to low-rank bilinear pooling in Chapter 4. The

14

low-rank bilinear pooling effectively consists of the two different linear mappings

for two inputs and element-wise multiplication of these mapped representations.

With the assumption of that the real-valued scalar inputs representing perceived

time and Bernoulli sampling parameterized by the probability as the output of

sigmoid activation function for the mapped representations, a computational

cognitive model for the coincident detector can be proposed. We will see that this

cognitive model is linked to the approximation of bilinear model for multimodal

cues in Chapter 4.

2.5 Conclusions

We have discussed the linear model using Bayesian rule to minimize the uncer-

tainty of multiple cues with the assumption of Gaussian noise, the three examples

of multimodal deep learning, and cognitive models to compare with machine

learning approaches. In the following chapters, we will focus on multimodal

fusion task, one of multimodal deep learning approaches, which uses multimodal

inputs in both training and testing phases. This task lets us to be reminded

that learning joint representation of multimodal information is critical to solve

a complex problem, i.e., visually-grounded reasoning using multimodal deep

learning.

15

Chapter 3

Multimodal Residual Learning

3.1 Introduction

Visual question-answering tasks provide a testbed to cultivate the synergistic

proposals which handle multidisciplinary problems of vision, language and

integrated reasoning. So, the visual question-answering tasks let the studies

in artificial intelligence go beyond narrow tasks. Furthermore, it may help to

solve the real world problems which need the integrated reasoning of vision and

language.

Deep residual learning [He et al., 2016a] not only advances the studies

in object recognition problems, but also gives a general framework for deep

neural networks. The existing non-linear layers of neural networks serve to fit

another mapping of F(x), which is the residual of identity mapping x. So, with

the shortcut connection of identity mapping x, the whole module of layers fit

F(x) + x for the desired underlying mapping H(x). In other words, the only

residual mapping F(x), defined by H(x)− x, is learned with non-linear layers.

16

In this way, very deep neural networks effectively learn representations in an

efficient manner.

Many attentional models utilize the residual learning to deal with various

tasks, including textual reasoning [Rocktäschel et al., 2016; Sukhbaatar et al.,

2015] and visual question-answering [Yang et al., 2016]. They use an attentional

mechanism to handle two different information sources, a query and the context

of the query (e.g. contextual sentences or an image). The query is added to the

output of the attentional module, that makes the attentional module learn the

residual of query mapping as in deep residual learning.

In this paper, we propose Multimodal Residual Networks (MRN) to learn

multimodality of visual question-answering tasks exploiting the excellence of deep

residual learning [He et al., 2016a]. MRN inherently uses shortcuts and residual

mappings for multimodality. We explore various models upon the choice of the

shortcuts for each modality, and the joint residual mappings based on element-

wise multiplication, which effectively learn the multimodal representations not

using explicit attention parameters. Figure 3.1 shows inference flow of the

proposed MRN.

Additionally, we propose a novel method to visualize the attention effects of

each joint residual mapping. The visualization method uses back-propagation

algorithm [Rumelhart et al., 1986] for the difference between the visual input

and the output of the joint residual mapping. The difference is back-propagated

up to an input image, in other words, the derivative of the function which defines

the difference with respect to the image is visualized. Since we use the pretrained

visual features, the pretrained CNN is augmented for the visualization. Based

on this, we argue that MRN is an implicit attention model without explicit

attention parameters.

Our contribution is three-fold: 1) extending the deep residual learning for

17

Q

V

ARNN

CNN

softmax

Multimodal Residual Networks

What kind of animals are these ?

sheep

wordembedding

Figure 3.1 Inference flow of MultimodalResidual Networks (MRN). Using our vi-sualization method, the attention effects areshown as a sequence of three images. Moreexamples are shown in Figure 3.4.

A

LinearTanhLinear

TanhLinear

TanhLinear

Q V

H1

LinearTanhLinear

TanhLinear

TanhLinear

H2

V

LinearTanhLinear

TanhLinear

TanhLinear

H3

V

LinearSoftmax

⊙⊕

⊙⊕

⊙⊕

Softmax

Figure 3.2 A schematic dia-gram of Multimodal ResidualNetworks with three-blocklayers.

visual question-answering tasks. This method utilizes multimodal inputs, and

allows a deeper network structure, 2) achieving the state-of-the-art results on

the Visual QA dataset for both Open-Ended and Multiple-Choice tasks, and

finally, 3) introducing a novel method to visualize spatial attention effect of joint

residual mappings from the collapsed visual feature using back-propagation.

3.2 Related Works

3.2.1 Deep Residual Learning

Deep residual learning [He et al., 2016a] allows neural networks to have a deeper

structure of over-100 layers. The very deep neural networks are usually hard to

be optimized even though the well-known activation functions and regularization

techniques are applied [Hinton et al., 2012; Ioffe and Szegedy, 2015; Nair and

18

Hinton, 2010]. However, this residual learning method consistently shows state-

of-the-art results across multiple visual tasks including image classification,

object detection, localization and segmentation.

This idea assumes that a block of deep neural networks forming a non-linear

mapping F(x) may paradoxically fail to fit into an identity mapping. To resolve

this, the deep residual learning adds x to F(x) as a shortcut connection. With

this idea, the non-linear mapping F(x) can focus on the residual of the shortcut

mapping x. Therefore, a learning block is defined as:

y = F(x) + x (3.1)

where x and y are the input and output of the learning block, respectively.

3.2.2 Stacked Attention Networks

Stacked Attention Networks (SAN) [Yang et al., 2016] explicitly learns the

weights of visual feature vectors to select a small portion of visual information

for a given question vector. Furthermore, this model stacks the attention networks

for multi-step reasoning narrowing down the selection of visual information. For

example, if the attention networks are asked to find a pink handbag in a scene,

they try to find pink objects first, and then, narrow down to the pink handbag.

For the attention networks, the weights are learned by a question vector

and the corresponding visual feature vectors. These weights are used for the

linear combination of multiple visual feature vectors indexing spatial information.

Through this, SAN successfully selects a portion of visual information. Finally,

an addition of the combined visual feature vector and the previous question

19

vector is transferred as a new input question vector to next learning block.

qk = F(qk−1,V) + qk−1 (3.2)

Here, qk is a question vector for k-th learning block and V is a visual feature

matrix, whose columns indicate the specific spatial indexes. F(q,V) is the

attention networks of SAN.

3.3 Multimodal Residual Networks

Deep residual learning emphasizes the importance of identity (or linear) shortcuts

to have the non-linear mappings efficiently learn only residuals [He et al., 2016a].

In multimodal learning, this idea may not be readily applied. Since the modalities

may have correlations, we need to carefully define joint residual functions as

the non-linear mappings. Moreover, the shortcuts are undetermined due to

its multimodality. Therefore, the characteristics of a given task ought to be

considered to determine the model structure.

3.3.1 Background

We consider the residual learning in the attention networks of SAN. We observed

that, in Equation 18 from [Yang et al., 2016], the question vector is transferred

directly through successive layers of the attention networks. In the case of SAN,

the shortcut mapping is for the question vector, and the non-linear mapping is

the attention networks.

In the attention networks, Yang et al. [2016] assume that an appropriate

choice of weights on visual feature vectors for a given question vector sufficiently

captures the joint representation for answering. However, question information

weakly contributes to the joint representation only through coefficients p, which

20

may cause a bottleneck to learn the joint representation.

F(q,V) =∑i

piVi (3.3)

The coefficients p are the output of a nonlinear function of a question vector q

and a visual feature matrix V (see Equation 15-16 in Yang et al. [2016]). The

Vi is a visual feature vector of spatial index i in 14× 14 grids.

Lu et al. [2015] propose an element-wise multiplication of a question vector

and a visual feature vector after appropriate embeddings for a joint model. This

makes a strong baseline outperforming some of the recent works [Andreas et al.,

2016; Noh et al., 2016]. We firstly take this approach as a candidate for the joint

residual function, since it is simple yet successful for visual question-answering.

In this context, we take the global visual feature approach for the element-wise

multiplication, instead of the multiple (spatial) visual features approach for the

explicit attention mechanism of SAN. (We present a visualization technique

exploiting the element-wise multiplication in Section 3.5.2.)

Based on these observations, we follow the shortcut mapping and the stacking

architecture of SAN [Yang et al., 2016]; however, the element-wise multiplication

is used for the joint residual function F . These updates effectively learn the

joint representation of given vision and language information addressing the

bottleneck issue of the attention networks of SAN.

3.3.2 Multimodal Residual Networks

MRN consists of multiple learning blocks, which are stacked for deep residual

learning. Denoting an optimal mapping by H(q,v), we approximate it using

H1(q,v) = W(1)q′ q+ F (1)(q,v). (3.4)

21

The first (linear) approximation term is W(1)q′ q and the first joint residual

function is given by F (1)(q,v). The linear mapping Wq′ is used for matching a

feature dimension. We define the joint residual function as

F (k)(q,v) = σ(W(k)q q)⊙ σ(W

(k)2 σ(W

(k)1 v)) (3.5)

where σ is tanh, and ⊙ is element-wise multiplication. The question vector and

the visual feature vector directly contribute to the joint representation. We

justify this choice in Sections 3.4 and 3.5.

For a deeper residual learning, we replace q with H1(q,v) in the next layer.

In more general terms, Equations 3.4 and 3.5 can be rewritten as

HL(q,v) = Wq′q+L∑l=1

WF(l)F (l)(Hl−1,v) (3.6)

where L is the number of learning blocks, H0 = q, Wq′ = ΠLl=1W

(l)q′ , and WF(l) =

ΠLm=l+1W

(m)q′ . The cascading in Equation 3.6 can intuitively be represented as

shown in Figure 3.2. Notice that the shortcuts for a visual part are identity

mappings to transfer the input visual feature vector to each layer (dashed line).

At the end of each block, we denote Hl as the output of the l-th learning block,

and ⊕ is element-wise addition.

3.4 Experiments

3.4.1 Visual QA Dataset

We choose the Visual QA (VQA 1.0) dataset [Antol et al., 2015] for the evaluation

of our models. Other datasets may not be ideal, since they have limited number

of examples to train and test [Malinowski et al., 2015], or have synthesized

22

TanhLinear

Linear

TanhLinear

Q V

Hl V

⊙⊕

(a)

LinearTanh

Linear

Tanh

Linear

TanhLinear

Q V

Hl

V⊙⊕

(c)

LinearTanh

LinearTanh

Linear

TanhLinear

TanhLinear

Q V

Hl V

⊙⊕

(b)

LinearTanh

Linear

TanhLinear

TanhLinear

Q V

HlV

⊙⊕

(e)

LinearTanh

Linear

TanhLinear

TanhLinear

Q V

Hl V

⊙⊕

(d)

if l=1

else Identity

if l=1Linear

else none

Figure 3.3 Alternative models are explored to justify our proposed model. Thebase model (a) has a shortcut for a question vector as SAN does [Yang et al.,2016], and the joint residual function takes the form of the Deep Q+I model’sjoint function [Lu et al., 2015]. (b) extra embedding for visual modality. (c)extra embeddings for both modalities. (d) identity mappings for shortcuts. Inthe first learning block, use a linear mapping for matching a dimension withthe joint dimension. (e) two shortcuts for both modalities. For simplicity, thelinear mapping of visual shortcut only appears in the first learning block. Noticethat (d) and (e) are compared to (b) after the model selection of (b) among(a)-(c) on test-dev results. Eventually, we choose (b) as the best performanceand relative simplicity.

questions from the image captions [Lin et al., 2014; Ren et al., 2015].

The questions and answers of the VQA dataset are collected via Amazon

Mechanical Turk from human subjects, who satisfy the experimental requirement.

The dataset includes 614,163 questions and 7,984,119 answers, since ten answers

are gathered for each question from unique human subjects. Therefore, Antol

et al. [2015] proposed a new accuracy metric as follows:

min

(# of humans that provided that answer

3, 1

). (3.7)

The questions are answered in two ways: Open-Ended and Multiple-Choice.

Unlike Open-Ended, Multiple-Choice allows additional information of eighteen

23

Table 3.1 The results of alternative models (a)-(e) on the test-dev.

Open-Ended

All Y/N Num. Other

(a) 60.17 81.83 38.32 46.61(b) 60.53 82.53 38.34 46.78(c) 60.19 81.91 37.87 46.70(d) 59.69 81.67 37.23 46.00(e) 60.20 81.98 38.25 46.57

candidate answers for each question. There are three types of answers: yes/no

(Y/N ), numbers (Num.) and others (Other). Table 3.3 shows that Other type

has the most benefit from Multiple-Choice.

The images come from the MS-COCO dataset, 123,287 of them for training

and validation, and 81,434 for test. The images are carefully collected to contain

multiple objects and natural situations, which is also valid for visual question-

answering tasks.

3.4.2 Implementation

Torch framework and rnn package [Léonard et al., 2015] are used to build

our models. For efficient computation of variable-length questions, TrimZero

is used to trim out zero vectors [Kim et al., 2016a]. TrimZero eliminates zero

computations at every time-step in mini-batch learning. Its efficiency is affected

by a batch size, RNN model size, and the number of zeros in inputs. We found

out that TrimZero was suitable for VQA tasks. Approximately, 37.5% of training

time is reduced in our experiments using this technique.

24

Table 3.2 The effect of the visual features and # of target answers on thetest-dev results. Vgg for VGG-19, and Res for ResNet-152 features described inSection 3.4.

Open-Ended

All Y/N Num. Other

Vgg, 1k 60.53 82.53 38.34 46.78Vgg, 2k 60.77 82.10 39.11 47.46Vgg, 3k 60.68 82.40 38.69 47.10Res, 1k 61.45 82.36 38.40 48.81Res, 2k 61.68 82.28 38.82 49.25Res, 3k 61.47 82.28 39.09 48.76

Preprocessing We follow the same preprocessing procedure of DeeperL-

STM+NormalizedCNN [Lu et al., 2015] (Deep Q+I ) by default. The number of

answers is 1k, 2k, or 3k using the most frequent answers, which covers 86.52%,

90.45% and 92.42% of questions, respectively. The questions are tokenized using

Python Natural Language Toolkit (nltk) [Bird et al., 2009]. Subsequently, the

vocabulary sizes are 14,770, 15,031 and 15,169, respectively.

Pretrained Models A question vector q ∈ R2,400 is the last output vector

of GRU [Cho et al., 2014], initialized with the parameters of Skip-Thought

Vectors [Kiros et al., 2015]. Based on the study of Noh et al. [2016], this method

shows effectiveness of question embedding in visual question-answering tasks.

A visual feature vector v is an output of the first fully-connected layer of

VGG-19 networks [Simonyan and Zisserman, 2015], whose dimension is 4,096.

Alternatively, ResNet-152 [He et al., 2016a] is used, whose dimension is of 2,048.

The error is back-propagated to the input question for fine-tuning, yet, not for

the visual part v due to the heavy computational cost of training.

25

Postprocessing Image captioning model [Karpathy and Fei-Fei, 2015] is used

to improve the accuracy of Other type. Let the intermediate representation

v ∈ R|Ω| which is right before applying softmax. |Ω| is the vocabulary size of

answers, and vi is corresponding to answer ai. If ai is not a number or yes or no,

and appeared at least once in the generated caption, then update vi ← vi + 1.

Notice that the pretrained image captioning model is not part of training. This

simple procedure improves around 0.1% of the test-dev overall accuracy (0.3%

for Other type). We attribute this improvement to “tie break” in Other type.

For the Multiple-Choice task, we mask the output of softmax layer with the

given candidate answers.

Hyperparameters By default, we follow Deep Q+I. The common embedding

size of the joint representation is 1,200. The learnable parameters are initialized

using a uniform distribution from −0.08 to 0.08 except for the pretrained

models. The batch size is 200, and the number of iterations is fixed to 250k. The

RMSProp [Tieleman and Hinton, 2012] is used for optimization, and dropouts

[Gal, 2015; Hinton et al., 2012] are used for regularization. The hyperparameters

are fixed using test-dev results. We compare our method to state-of-the-arts

using test-standard results.

3.4.3 Exploring Alternative Models

Figure 3.3 shows alternative models we explored, based on the observations in

Section 3.3. We carefully select alternative models (a)-(c) for the importance

of embeddings in multimodal learning [Ngiam et al., 2011; Srivastava and

Salakhutdinov, 2012], (d) for the effectiveness of identity mapping as reported by

[He et al., 2016a], and (e) for the confirmation of using question-only shortcuts

in the multiple blocks as in [Yang et al., 2016]. For comparison, all models have

26

three-block layers (selected after a pilot test), using VGG-19 features and 1k

answers, then, the number of learning blocks is explored to confirm the pilot test.

The effect of the pretrained visual feature models and the number of answers

are also explored. All validation is performed on the test-dev split.

3.5 Results

3.5.1 Quantitative Analysis

The VQA Challenge, which released the VQA dataset, provides evaluation servers

for test-dev and test-standard test splits. For the test-dev, the evaluation server

permits unlimited submissions for validation, while the test-standard permits

limited submissions for the competition. We report accuracies in percentage.

Alternative Models The test-dev results of the alternative models for the

Open-Ended task are shown in Table 3.1. (a) shows a significant improvement

over SAN. However, (b) is marginally better than (a). As compared to (b), (c)

deteriorates the performance. An extra embedding for a question vector may

easily cause overfitting leading to the overall degradation. And, the identity

shortcuts in (d) cause the degradation problem, too. Extra parameters of the

linear mappings may effectively support to do the task.

(e) shows a reasonable performance, however, the extra shortcut is not

essential. The empirical results seem to support this idea. Since the question-

only model (50.39%) achieves a competitive result to the joint model (57.75%),

while the image-only model gets a poor accuracy (28.13%) (see Table 2 in [Antol

et al., 2015]). Eventually, we chose model (b) as the best performance and

relative simplicity.

The effects of other various options, Skip-Thought Vectors [Kiros et al., 2015]

27

for parameter initialization, Bayesian Dropout [Gal, 2015] for regularization,

image captioning model [Karpathy and Fei-Fei, 2015] for postprocessing, and

the usage of shortcut connections, are explored in Table 3.4 and 3.5.

Number of Learning Blocks To confirm the effectiveness of the number of

learning blocks selected via a pilot test (L = 3), we explore this on the chosen

model (b), again. As the depth increases, the overall accuracies are 58.85%

(L = 1), 59.44% (L = 2), 60.53% (L = 3) and 60.42% (L = 4).

Visual Features The ResNet-152 visual features are significantly better than

VGG-19 features for Other type in Table 3.2, even if the dimension of the ResNet

features (2,048) is a half of VGG features’ (4,096). The ResNet visual features

are also used in the previous work [Ilievski et al., 2016]; however, our model

achieves a remarkably better performance with a large margin (see Table 3.3).

Number of Target Answers The number of target answers slightly affects

the overall accuracies with the trade-off among answer types. So, the decision

on the number of target answers is difficult to be made. We chose Res, 2k in

Table 3.2 based on the overall accuracy (for Multiple-Choice task, see Table 3.6).

Comparisons with State-of-the-arts Our chosen model significantly out-

performs other state-of-the-art methods for both Open-Ended and Multiple-

Choice tasks in Table 3.3. However, the performance of Number and Other types

are still not satisfactory compared to Human performance, though the advances

in the recent works were mainly for Other -type answers. This fact motivates

to study on a counting mechanism in future work. The model comparison is

performed on the test-standard results.

28

examples examples

What kind of animals are these ? sheep What animal is the picture ? elephant

What is this animal ? zebra What game is this person playing ? tennis

How many cats are here ? 2 What color is the bird ? yellow

What sport is this ? surfing Is the horse jumping ? yes

(a) (b)

(c) (d)

(e) (f)

(g) (h)

Figure 3.4 Examples for visualization of the three-block layered MRN. Theoriginal images are shown in the first of each group. The next three images showthe input gradients of the attention effect for each learning block as describedin Section 3.5.2. The gradients of color channels for each pixel are summed upafter taking absolute values of these gradients. Then, these summed absolutevalues which are greater than the summation of the mean and the standarddeviation of these values are visualized as the attention effect (bright color) onthe images. The answers (blue) are predicted by MRN.

3.5.2 Qualitative Analysis

In Equation 3.5, the left term σ(Wqq) can be seen as a masking (attention)

vector to select a part of visual information. We assume that the difference

between the right term V := σ(W2σ(W1v)) and the masked vector F(q,v)

indicates an attention effect caused by the masking vector. Then, the attention

effect Latt =12∥V − F∥

2 is visualized on the image by calculating the gradient

29

of Latt with respect to a given image I, while treating F as a constant.

∂Latt

∂I=

∂V∂I

(V − F) (3.8)

This technique can be applied to each learning block in a similar way.

Since we use the preprocessed visual features, the pretrained CNN is aug-

mented only for this visualization. Note that model (b) in Table 3.1 is used for

this visualization, and the pretrained VGG-19 is used for preprocessing and

augmentation. The model is trained using the training set of the VQA dataset,

and visualized using the validation set. Examples are shown in Figure 3.4 (more

examples in Figure 3.5, 3.6, and 3.7).

Unlike the other works [Xiong et al., 2016; Yang et al., 2016] that use explicit

attention parameters, MRN does not use any explicit attentional mechanism.

However, we observe the interpretability of element-wise multiplication as an

information masking, which yields a novel method for visualizing the attention

effect from this operation. Since MRN does not depend on a few attention

parameters (e.g. 14× 14), our visualization method shows a higher resolution

than others [Xiong et al., 2016; Yang et al., 2016]. Based on this, we argue that

MRN is an implicit attention model without explicit attention mechanism.

3.6 Conclusions

The idea of deep residual learning is applied to visual question-answering tasks.

Based on the two observations of the previous works, various alternative models

are suggested and validated to propose the three-block layered MRN. Our model

achieves the state-of-the-art results on the VQA dataset for both Open-Ended

and Multiple-Choice tasks. Moreover, we have introduced a novel method to

visualize the spatial attention from the collapsed visual features using back-

30

(a) Does the man have good posture ? no (b) Did he fall down ? yes

(c) Are there two cats in the picture ? no (d)What color are the bears ? brown

(e)What are many of the people carrying ? umbrellas (f)What color is the dog ? black

(g) Are these animals tall ? yes (h)What animal is that ? sheep

(i) Are all the cows the same color ? no (j)What is the reflection of in the mirror ? dog

(k)What are the giraffe in the foreground doing ?eating

(l)What animal is standing in the water other than birds ? bear

Figure 3.5 More examples of Figure 4 in Section 5.2.

propagation.

We believe our visualization method brings implicit attention mechanism

to research of attentional models. Using back-propagation of attention effect,

extensive research in object detection, segmentation and tracking are worth

further investigations.

31

(a1)What is the animal on the left ? giraffe

(a2) Can you see trees ? yes

(b1)What is the lady riding ? motorcycle

(b2) Is she riding the motorcycle on the street ? no

Figure 3.6 Comparative examples on the same image. (a1) and (a2) depict agiraffe (left) and a man pointing at the giraffe. MRN consistently highlightson the giraffe in (a1). However, the other question “Can you see trees?” makesMRN less attentive to the giraffe, while a tree in the right of background ismore focused in (a2). Similarily, the attention effect of (b2) is widely dispersedon background than (b1) in the middle of sequences, may be to recognize thesite. However, the subtlety in comparative study is insufficient to objectivelyassess the results.

32

Tab

le3.

3T

heV

QA

test

-sta

ndar

dre

sult

s.T

hepr

ecis

ion

ofso

me

accu

raci

es[A

ndre

aset

al.,

2016

;Yan

get

al.,

2016

]ar

eon

ele

ssth

anot

hers

,so,

zero

-fille

dto

mat

chot

hers

.

Ope

n-E

nded

Mul

tipl

e-C

hoic

e

All

Y/N

Num

.O

ther

All

Y/N

Num

.O

ther

DP

Pne

t[N

ohet

al.,

2016

]57

.36

80.2

836

.92

42.2

462

.69

80.3

538

.79

52.7

9D

-NM

N[A

ndre

aset

al.,

2016

]58

.00

--

--

--

-D

eep

Q+

I[L

uet

al.,

2015

]58

.16

80.5

636

.53

43.7

363

.09

80.5

937

.70

53.6

4SA

N[Y

ang

etal

.,20

16]

58.9

0-

--

--

--

AC

K[W

uet

al.,

2016

b]59

.44

81.0

737

.12

45.8

3-

--

-FD

A[Ilie

vski

etal

.,20

16]

59.5

481

.34

35.6

746

.10

64.1

881

.25

38.3

055

.20

DM

N+

[Xio

nget

al.,

2016

]60

.36

80.4

336

.82

48.3

3-

--

-

MR

N61

.84

82.3

938

.23

49.4

166

.33

82.4

139

.57

58.4

0

Hum

an[A

ntol

etal

.,20

15]

83.3

095

.77

83.3

972

.67

--

--

33

Table 3.4 The effects of various options for VQA test-dev. Here, the model ofFigure 3a is used, since these experiments are preliminarily conducted. VGG-19features and 1k target answers are used. s stands for the usage of Skip-ThoughtVectors Kiros et al. [2015] to initialize the question embedding model of GRU, bstands for the usage of Bayesian Dropout Gal [2015], and c stands for the usageof postprocessing using image captioning model Karpathy and Fei-Fei [2015].

Open-Ended Multiple-Choice

All Y/N Num. Other All Y/N Num. Other

baseline 58.97 81.11 37.63 44.90 63.53 81.13 38.91 54.06s 59.38 80.65 38.30 45.98 63.71 80.68 39.73 54.65s,b 59.74 81.75 38.13 45.84 64.15 81.77 39.54 54.67s,b,c 59.91 81.75 38.13 46.19 64.18 81.77 39.51 54.72

Table 3.5 The effects of shortcut connections of MRN for VQA test-dev. ResNet-152 features and 2k target answers are used. MN stands for Multimodal Networkswithout residual learning, which does not have any shortcut connections. Dim.stands for common embedding vector’s dimension. The number of parametersfor word embedding (9.3M) and question embedding (21.8M) is subtracted fromthe total number of parameters in this table.

Open-Ended

L Dim. #params All Y/N Num. Other

MN 1 4604 33.9M 60.33 82.50 36.04 46.89MN 2 2350 33.9M 60.90 81.96 37.16 48.28MN 3 1559 33.9M 59.87 80.55 37.53 47.25

MRN 1 3355 33.9M 60.09 81.78 37.09 46.78MRN 2 1766 33.9M 61.05 81.81 38.43 48.43MRN 3 1200 33.9M 61.68 82.28 38.82 49.25MRN 4 851 33.9M 61.02 82.06 39.02 48.04

34

Tab

le3.

6T

here

sult

sfo

rV

QA

test

-dev

.The

prec

isio

nof

som

eac

cura

cies

And

reas

etal

.[20

16];

Xio

nget

al.[

2016

];Y

ang

etal

.[20

16]a

reon

ele

ssth

anot

hers

,so,

zero

-fille

dto

mat

chot

hers

.

Ope

n-E

nded

Mul

tipl

e-C

hoic

e

All

Y/N

Num

.O

ther

All

Y/N

Num

.O

ther

Que

stio

nA

ntol

etal

.[20

15]

48.0

975

.66

36.7

027

.14

53.6

875

.71

37.0

538

.64

Imag

eA

ntol

etal

.[20

15]

28.1

364

.01

00.4

203

.77

30.5

369

.87

00.4

503

.76

Q+

IA

ntol

etal

.[20

15]

52.6

475

.55

33.6

737

.37

58.9

775

.59

34.3

550

.33

LST

MQ

Ant

olet

al.[

2015

]48

.76

78.2

035

.68

26.5

954

.75

78.2

236

.82

38.7

8LS

TM

Q+

IA

ntol

etal

.[20

15]

53.7

478

.94

35.2

436

.42

57.1

778

.95

35.8

043

.41

Dee

pQ

+I

Luet

al.[

2015

]58

.02

80.8

736

.46

43.4

062

.86

80.8

837

.78

53.1

4

DP

Pne

tN

ohet

al.[

2016

]57

.22

80.7

137

.24

41.6

962

.48

80.7

938

.94

52.1

6D

-NM

NA

ndre

aset

al.[

2016

]57

.90

80.5

037

.40

43.1

0-

--

-SA

NY

ang

etal

.[20

16]

58.7

079

.30

36.6

046

.10

--

--

AC

KW

uet

al.[

2016

b]59

.17

81.0

138

.42

45.2

3-

--

-FD

AIlie

vski

etal

.[20

16]

59.2

481

.14

36.1

645

.77

64.0

181

.50

39.0

054

.72

DM

N+

Xio

nget

al.[

2016

]60

.30

80.5

036

.80

48.3

0-

--

-

Vgg

,1k

60.5

382

.53

38.3

446

.78

64.7

982

.55

39.9

355

.23

Vgg

,2k

60.7

782

.10

39.1

147

.46

65.2

782

.12

40.8

456

.39

Vgg

,3k

60.6

882

.40

38.6

947

.10

65.0

982

.42

40.1

355

.93

Res

,1k

61.4

582

.36

38.4

048

.81

65.6

282

.39

39.6

557

.15

Res

,2k

61.6

882

.28

38.8

249

.25

66.1

582

.30

40.4

558

.16

Res

,3k

61.4

782

.28

39.0

948

.76

66.3

382

.41

39.5

758

.40

35

(a)What animals are these ? bears ducks (b)What are these animals ? cows goats

(c)What animals are visible ? sheep horses (d) How many animals are depicted ? 2 1

(e)What flavor donut is this ? chocolate strawberry (f)What is the man doing ? playing tennis frisbee

(g)What color are the giraffes eyelashes ? brownblack

(h)What food is the bear trying to eat ? bananapapaya

(i)What kind of animal is used to herd these animals ?sheep dog

(j)What species of tree are in the background ? pinepalm

(k) Are there any birds on the photo ? no yes (l)Why is the hydrant smiling ? happysomeone drew on it

Figure 3.7 Failure Examples. Each question is followed by model prediction(blue) and answer (red). As mentioned in Section 5, MRN shows the weaknessof counting in (d) and (k). Sometimes, the model finds objects regardless ofthe given question. In (j), even if a word cat does not appear in the question,the cat in the image is surely attended. (i) shows the limitation of attentionalmechanism, which needs an inference using world knowledge.

36

Chapter 4

Multimodal Low-rank BilinearPooling

4.1 Introduction

Bilinear models [Tenenbaum and Freeman, 2000] provide richer representations

than linear models. To exploit this advantage, fully-connected layers in neural

networks can be replaced with bilinear pooling. The outer product of two

vectors (or Kroneker product for matrices) is involved in bilinear pooling, as

a result of this, all pairwise interactions among given features are considered.

Recently, a successful application of this technique is used for fine-grained visual

recognition [Lin et al., 2015].

However, bilinear pooling produces a high-dimensional feature of quadratic

expansion, which may constrain a model structure and computational resources.

For example, an outer product of two feature vectors, both of which have 1K-

dimensionality, produces a million-dimensional feature vector. Therefore, for

classification problems, the choice of the number of target classes is severely

37

constrained, because the number of parameters for a standard linear classifier is

determined by multiplication of the size of the high-dimensional feature vector

and the number of target classes.

Compact bilinear pooling [Gao et al., 2016] reduces the quadratic expansion

of dimensionality by two orders of magnitude, retaining the performance of

the full bilinear pooling. This approximation uses sampling-based computation,

Tensor Sketch Projection [Charikar et al., 2002; Pham and Pagh, 2013], which

utilizes an useful property that Ψ(x ⊗ y, h, s) = Ψ(x, h, s) ∗ Ψ(y, h, s), which

means the projection of outer product of two vectors is the convolution of two

projected vectors. Here, Ψ is the proposed projection function, and, h and s are

randomly sampled parameters by the algorithm.

Nevertheless, compact bilinear pooling embraces two shortcomings. One

comes from the sampling approach. Compact bilinear pooling relies on a fa-

vorable property, E[⟨Ψ(x, h, s),Ψ(y, h, s)⟩] = ⟨x, y⟩, which provides a basis to

use projected features instead of original features. Yet, calculating the exact

expectation is computationally intractable, so, the random parameters, h and

s are fixed during training and evaluation. This practical choice leads to the

second. The projected dimension of compact bilinear pooling should be large

enough to minimize the bias from the fixed parameters. Practical choices are 10K

and 16K for 512 and 4096-dimensional inputs, respectively [Fukui et al., 2016;

Gao et al., 2016]. Though, these compacted dimensions are reduced ones by two

orders of magnitude compared with full bilinear pooling, such high-dimensional

features could be a bottleneck for computationally complex models.

We propose low-rank bilinear pooling using Hadamard product (element-

wise multiplication), which is commonly used in various scientific computing

frameworks as one of tensor operations. The proposed method factors a three-

dimensional weight tensor for bilinear pooling into three two-dimensional weight

38

matrices, which enforces the rank of the weight tensor to be low-rank. As a result,

two input feature vectors linearly projected by two weight matrices, respectively,

are computed by Hadamard product, then, followed by a linear projection using

the third weight matrix. For example, the projected vector z is represented by

WTz (W

Txx WT

yy), where denotes Hadamard product.

We also explore to add non-linearity using non-linear activation functions

into the low-rank bilinear pooling, and shortcut connections inspired by deep

residual learning [He et al., 2016a]. Then, we show that it becomes a simple

baseline model [Antol et al., 2015] or one-learning block of Multimodal Residual

Networks [Kim et al., 2016b] as a low-rank bilinear model, yet, this interpretation

has not be done.

Our contributions are as follows: First, we propose low-rank bilinear pooling

to approximate full bilinear pooling to substitute compact bilinear pooling.

Second, Multimodal Low-rank Bilinear Attention Networks (MLB) having an

efficient attention mechanism using low-rank bilinear pooling is proposed for

visual question-answering tasks. MLB achieves a new state-of-the-art perfor-

mance, and has a better parsimonious property. Finally, ablation studies to

explore alternative choices, e.g. network depth, non-linear functions, and shortcut

connections, are conducted.

4.2 Low-rank Bilinear Model

Bilinear models use a quadratic expansion of linear transformation considering

every pair of features.

fi =

N∑j=1

M∑k=1

wijkxjyk + bi = xTWiy + bi (4.1)

39

where x and y are input vectors, Wi ∈ RN×M is a weight matrix for the output

fi, and bi is a bias for the output fi. Notice that the number of parameters is

L× (N ×M + 1) including a bias vector b, where L is the number of output

features.

Pirsiavash et al. [2009] suggest a low-rank bilinear method to reduce the rank

of the weight matrix Wi to have less number of parameters for regularization.

They rewrite the weight matrix as Wi = UiVTi where Ui ∈ RN×d and Vi ∈

RM×d, which imposes a restriction on the rank of Wi to be at most d ≤

min(N,M).

Based on this idea, fi can be rewritten as follows:

fi = xTWiy + bi = xTUiVTi y + bi = 1

T (UTi x VT

i y) + bi (4.2)

where 1 ∈ Rd denotes a column vector of ones, and denotes Hadamard product.

Still, we need two third-order tensors, U and V, for a feature vector f , whose

elements are fi. To reduce the order of the weight tensors by one, we replace 1

with P ∈ Rd×c and bi with b ∈ Rc, then, redefine as U ∈ RN×d and V ∈ RM×d

to get a projected feature vector f ∈ Rc. Then, we get:

f = PT (UTx VTy) + b (4.3)

where d and c are hyperparameters to decide the dimension of joint embeddings

and the output dimension of low-rank bilinear models, respectively.

4.3 Low-rank Bilinear Pooling

A low-rank bilinear model in Equation 4.3 can be implemented using two linear

mappings without biases for embedding two input vectors, Hadamard product

40

to learn joint representations in a multiplicative way, and a linear mapping with

a bias to project the joint representations into an output vector for a given

output dimension. Then, we use this structure as a pooling method for deep

neural networks. Now, we discuss possible variations of low-rank bilinear pooling

based on this model inspired by studies of neural networks.

4.3.1 Full Model

In Equation 4.3, linear projections, U and V , can have their own bias vectors.

As a result, linear models for each input vectors, x and y, are integrated in an

additive form, called as full model for linear regression in statistics:

f = PT((UTx+ bx) (VTy + by)

)+ b

= PT (UTx VTy +U′Tx+V′Ty) + b′. (4.4)

Here, U′T = diag(by) ·UT , V′T = diag(bx) ·VT , and b′ = b+PT (bx by).

4.3.2 Nonlinear Activation

Applying non-linear activation functions may help to increase representative

capacity of model. The first candidate is to apply non-linear activation functions

right after linear mappings for input vectors.

f = PT(σ(UTx) σ(VTy)

)+ b (4.5)

where σ denotes an arbitrary non-linear activation function, which maps any

real values into a finite interval, e.g. sigmoid or tanh. If two inputs come from

different modalities, statistics of two inputs may be quite different from each

other, which may result an interference. Since the gradient with respect to each

41

input is directly dependent on the other input in Hadamard product of two

inputs.

Additional applying an activation function after the Hadamard product is not

appropriate, since activation functions doubly appear in calculating gradients.

However, applying the activation function only after the Hadamard product

would be alternative choice (We explore this option in Section 4.5) as follows:

f = PTσ(UTx VTy

)+ b. (4.6)

Note that using the activation function in low-rank bilinear pooling can be

found in an implementation of simple baseline for the VQA 1.0 dataset [Antol

et al., 2015] without an interpretation of low-rank bilinear pooling. However,

notably, Wu et al. [2016c] studied learning behavior of multiplicative integration

in RNNs with discussions and empirical evidences.

4.3.3 Shortcut Connection

When we apply two previous techniques, full model and non-linear activation,

linear models of two inputs are nested by the non-linear activation functions.

To avoid this unfortunate situation, we add shortcut connections as explored in

residual learning [He et al., 2016a].

f = PT(σ(UTx) σ(VTy)

)+ hx(x) + hy(y) + b (4.7)

where hx and hy are shortcut mappings. For linear projection, the shortcut

mappings are linear mappings. Notice that this formulation is a generalized

form of the one-block layered MRN [Kim et al., 2016b]. Though, the shortcut

connections are not used in our proposed model, as explained in Section 4.6.

42

4.4 Multimodal Low-rank Bilinear Attention Networks

In this section, we apply low-rank bilinear pooling to propose an efficient atten-

tion mechanism for visual question-answering tasks, based on the interpretation

of previous section. We assumed that inputs are a question embedding vector q

and a set of visual feature vectors F over S × S lattice space.

4.4.1 Low-rank Bilinear Pooling in Attention Mechanism

Attention mechanism uses an attention probability distribution α over S × S

lattice space. Here, using low-rank bilinear pooling, α is defined as

α = softmax(PT

α

(σ(UT

qq · 1T ) σ(VTFF

T )))

(4.8)

where α ∈ RG×S2 , Pα ∈ Rd×G, σ is a hyperbolic tangent function, Uq ∈ RN×d,

q ∈ RN , 1 ∈ RS2 , VF ∈ RM×d, and F ∈ RS2×M . If G > 1, multiple glimpses are

explicitly expressed as in Fukui et al. [2016], conceptually similar to Jaderberg

et al. [2015]. And, the softmax function applies to each row vector of α. The

bias terms are omitted for simplicity.

4.4.2 Multimodal Low-rank Bilinear Attention Networks

Attended visual feature v is a linear combination of Fi with coefficients αg,i.

Each attention probability distribution αg is for a glimpse g. For G > 1, v is

the concatenation of resulting vectors vg as

v =Gn

g=1

S2∑s=1

αg,sFs (4.9)

wheref

denotes concatenation of vectors. The posterior probability distribution

is an output of a softmax function, whose input is the result of another low-rank

43

bilinear pooling of q and v as

p(a|q,F; Θ) = softmax(PT

o

(σ(WT

qq) σ(VTv v)

))(4.10)

a = argmaxa∈Ω

p(a|q,F; Θ) (4.11)

where a denotes a predicted answer, Ω is a set of candidate answers and Θ is an

aggregation of entire model parameters.

4.4.3 Model Schema

Figure 4.1 shows a schematic diagram of MLB, where denotes Hadamard

product, and Σ denotes a linear combination of visual feature vectors using

coefficients, which is the output of softmax function. If G > 1, the softmax

function is applied to each row vectors of an output matrix (Equation 4.8), and

we concatenate the resulting vectors of the G linear combinations (Equation 4.9).

A

TanhConv

TanhLinear

Replicate

Q V

SoftmaxConv

TanhLinear

TanhLinearLinear

Softmax

Figure 4.1 A schematic diagram of MLB. Replicate module copies an questionembedding vector to match with S2 visual feature vectors. Conv modules indicate1× 1 convolution to transform a given channel space, which is computationallyequivalent to linear projection for channels.

44

4.5 Experiments

Table 4.1 The accuracies of our experimental models for VQA test-dev split andOpen-Ended task. For the MCB models, A: attention model, G: Glove vectormodel, and V: Visual Genome augmentation model.

MODEL SIZE ALL Y/N NUM ETC

MRN-L3 65.0M 61.68 82.28 38.82 49.25MARN-L3 65.5M 62.37 82.31 38.06 50.83MARN-L2 56.3M 63.92 82.88 37.98 53.59

* MARN-L1 47.0M 63.79 82.73 37.92 53.46

MARN-L1-G1 47.0M 63.79 82.73 37.92 53.46* MARN-L1-G2 57.7M 64.53 83.41 37.82 54.43

MARN-L1-G4 78.9M 64.61 83.72 37.86 54.33

No Tanh 57.7M 63.58 83.18 37.23 52.79* Before-Product 57.7M 64.53 83.41 37.82 54.43

After-Product 57.7M 64.53 83.53 37.06 54.50

Mode Answer 57.7M 64.53 83.41 37.82 54.43* Sampled Answer 57.7M 64.80 83.59 38.38 54.73

Shortcut 57.7M 64.80 83.59 38.38 54.73* No Shortcut 51.9M 65.08 84.14 38.21 54.87

MLB 51.9M 65.08 84.14 38.21 54.87MLB+VG 51.9M 65.84 83.87 37.87 56.76

MCB-A [Fukui et al., 2016] 69.2M 64.2 82.2 37.7 54.8MCB-AG [Fukui et al., 2016] 70.5M 64.7 82.5 37.6 55.6

MCB-AGV [Fukui et al., 2016] 70.5M 65.4 82.3 37.2 57.4

In this section, we conduct six experiments to select the proposed model,

Multimodal Low-rank Bilinear Attention Networks (MLB). Each experiment

controls other factors except one factor to assess the effect on accuracies. Based

on MRN [Kim et al., 2016b], we start our assessments with an initial option

of G = 1 and shortcut connections of MRN, called as Multimodal Attention

45

Residual Networks (MARN). Notice that we use one embeddings for each

visual feature for better performance, based on our preliminary experiment

(not shown). We attribute this choice to the attention mechanism for visual

features, which provides more capacity to learn visual features. We use the same

hyperparameters of MRN [Kim et al., 2016b], without any explicit mention of

this.

The VQA 1.0 dataset [Antol et al., 2015] is used as a primary dataset, and, for

data augmentation, question-answering annotations of Visual Genome [Krishna

et al., 2016] are used. Validation is performed on the VQA test-dev split, and

model comparison is based on the results of the VQA test-standard split. For

the comprehensive reviews of VQA tasks, please refer to Wu et al. [2016a] and

Kafle and Kanan [2016a]. The source code for the experiments is available in

Github repository1.

Number of Learning Blocks Kim et al. [2016b] argue that three-block

layered MRN shows the best performance among one to four-block layered

models, taking advantage of residual learning. However, we speculate that an

introduction of attention mechanism makes deep networks hard to optimize.

Therefore, we explore the number of learning blocks of MARN, which have an

attention mechanism using low-rank bilinear pooling.

Number of Glimpses Fukui et al. [2016] show that the attention mechanism

of two glimpses was an optimal choice. In a similar way, we assess one, two, and

four-glimpse models.

Non-Linearity We assess three options applying non-linearity on low-rank

bilinear pooling, vanilla, before Hadamard product as in Equation 4.5, and after1https://github.com/jnhwkim/MulLowBiVQA

46

https://github.com/jnhwkim/MulLowBiVQA

Hadamard product as in Equation 4.6.

Answer Sampling VQA [Antol et al., 2015] 1.0 dataset has ten answers from

unique persons for each question, while Visual Genome [Krishna et al., 2016]

dataset has a single answer for each question. Since difficult or ambiguous ques-

tions may have divided answers, the probabilistic sampling from the distribution

of answers can be utilized to optimize for the multiple answers. An instance2

can be found in Fukui et al. [2016]. We simplify the procedure as follows:

p(a1) =

|a1|/Σi|ai|, if |a1| ≥ 3

0, otherwise(4.12)

p(a0) = 1− p(a1) (4.13)

where |ai| denotes the number of unique answer ai in a set of multiple answers,

a0 denotes a mode, which is the most frequent answer, and a1 denotes the

secondly most frequent answer. We define the divided answers as having at least

three answers which are the secondly frequent one, for the evaluation metric of

VQA [Antol et al., 2015],

accuracy(ak) = min (|ak|/3, 1) . (4.14)

The rate of the divided answers is approximately 16.40%, and only 0.23% of

questions have more than two divided answers in VQA dataset. We assume that

it eases the difficulty of convergence without severe degradation of performance.

Shortcut Connection The contribution of shortcut connections for residual

learning is explored based on the observation of the competitive performance of2https://github.com/akirafukui/vqa-mcb/blob/5fea8/train/multi_att_2_glove/

vqa_data_provider_layer.py#L130

47

https://github.com/akirafukui/vqa-mcb/blob/5fea8/train/multi_att_2_glove/vqa_data_provider_layer.py#L130

https://github.com/akirafukui/vqa-mcb/blob/5fea8/train/multi_att_2_glove/vqa_data_provider_layer.py#L130

single-block layered model. Since the usefulness of shortcut connections is linked

to the network depth [He et al., 2016a].

Data Augmentation The data augmentation using Visual Genome [Krishna

et al., 2016] question answer annotations is explored. Visual Genome [Krishna

et al., 2016] originally provides 1.7 Million visual question answer annotations.

After aligning to VQA, the valid number of question-answering pairs for training

is 837,298, which is for distinct 99,280 images.

4.5.1 Preprocessing

We follow the preprocessing procedure of Kim et al. [2016b]. Here, we remark

some details of it, and changes.

The 90.45% of questions for the 2K-most frequent answers are used. The

vocabulary size of questions is 15,031. GRU [Cho, Kyunghyun and Van Mer-

riënboer, Bart and Gulcehre, Caglar and Bahdanau, Dzmitry and Bougares,

Fethi and Schwenk, Holger and Bengio, Yoshua, 2014] is used for question

embedding. Based on earlier studies [Kim et al., 2016b; Noh et al., 2016], a

word embedding matrix and a GRU are initialized with Skip-thought Vector

pre-trained model [Kiros et al., 2015]. As a result, question vectors have 2,400

dimensions.

For efficient computation of variable-length questions, Kim et al. [2016a] is

used for the GRU. Moreover, for regularization, Bayesian Dropout [Gal, 2015]

which is implemented in Léonard et al. [2015] is applied while training.

4.5.2 Vision Embedding

ResNet-152 networks [He et al., 2016a] are used for feature extraction. The

dimensionality of an input image is 3 × 448 × 448. The outputs of the last

convolution layer is used, which have 2, 048× 14× 14 dimensions.

48

4.5.3 Hyperparameters

The hyperparameters used in MLB of Table 4.4 are described in Table 4.2.

The batch size is 100, and the number of iterations is fixed to 250K. For data

augmented models, a simplified early stopping is used, starting from 250K to

350K-iteration for every 25K iterations (250K, 275K, 300K, 325K, and 350K; at

most five points) to avoid exhaustive submissions to VQA test-dev evaluation

server. RMSProp [Tieleman and Hinton, 2012] is used for optimization.

Though, the size of joint embedding size d is borrowed from Kim et al. [2016b],

a grid search on d confirms this choice in our model as shown in Table 4.3.

Table 4.2 Hyperparameters used in MLB (single model in Table 4.4).

SYMBOL VALUE DESCRIPTION

S 14 attention lattice sizeN 2,400 question embedding sizeM 2,048 channel size of extracted visual featuresd 1,200 joint embedding sizeG 2 number of glimpses|Ω| 2,000 number of candidate answersη 3e-4 learning rateλ 0.99997592083 learning rate decay factor at every iterationp 0.5 dropout rateθ ±10 gradient clipping threshold

4.6 Results

The six experiments are conducted sequentially. Each experiment determines

experimental variables one by one. In Table 4.1, which has six sectors divided

by mid-rules, shows the accuracies of our experimental model, Multimodal

Attention Residual Networks (MARN), with respect to the number of learning

49

Table 4.3 The effect of joint embedding size d.

Open-Ended

d SIZE ALL Y/N NUM ETC

800 45.0M 64.89 84.08 38.15 54.551000 48.4M 65.06 84.18 38.01 54.851200 51.9M 65.08 84.14 38.21 54.871400 55.4M 64.94 84.13 38.00 54.641600 58.8M 65.02 84.15 37.79 54.85

blocks (L#), the number of glimpse (G#), the position of activation functions

(tanh), answer sampling, shortcut connections, and data augmentation using

Visual Genome dataset, for VQA test-dev split and Open-Ended task. Note

that our proposed model, Multimodal Low-rank Bilinear Attention Networks

(MLB) have no shortcut connections, compared with MARN. MODEL: model

name, SIZE: number of parameters, ALL: overall accuracy in percentage, Y/N:

yes/no, NUM: numbers, and ETC: others. Since Fukui et al. [2016] only report

the accuracy of the ensemble model on the test-standard, the test-dev results of

their single models are included in the last sector. Some figures have different

precisions which are rounded. ∗ indicates the selected model for each experiment.

4.6.1 Six Experiment Results

Number of Learning Blocks Though, MRN [Kim et al., 2016b] has the

three-block layered architecture, MARN shows the best performance with two-

block layered models (63.92%). For the multiple glimpse models in the next

experiment, we choose one-block layered model for its simplicity to extend, and

competitive performance (63.79%).

50

Number of Glimpses Compared with the results of Fukui et al. [2016], four-

glimpse MARN (64.61%) is better than other comparative models. However,

for a parsimonious choice, two-glimpse MARN (64.53%) is chosen for later

experiments. We speculate that multiple glimpses are one of key factors for the

competitive performance of MCB [Fukui et al., 2016], based on a large margin

in accuracy, compared with one-glimpse MARN (63.79%).

Non-Linearity The results confirm that activation functions are useful to

improve performances. Surprisingly, there is no empirical difference between two

options, before-Hadamard product and after-Hadamard product. This result

may build a bridge to relate with studies on multiplicative integration with

recurrent neural networks [Wu et al., 2016c].

Answer Sampling Sampled answers (64.80%) result better performance

than mode answers (64.53%). It confirms that the distribution of answers from

annotators can be used to improve the performance. However, the number of

multiple answers is usually limited due to the cost of data collection.

Shortcut Connection Though, MRN [Kim et al., 2016b] effectively uses

shortcut connections to improve model performance, one-block layered MARN

shows better performance without the shortcut connection. In other words, the

residual learning is not used in our proposed model, MLB. It seems that there

is a trade-off between introducing attention mechanism and residual learning.

We leave a careful study on this trade-off for future work.

Data Augmentation Data augmentation using Visual Genome [Krishna

et al., 2016] question answer annotations significantly improves the performance

51

by 0.76% in accuracy for VQA test-dev split. Especially, the accuracy of others

(ETC)-type answers is notably improved from the data augmentation.

4.6.2 Comparison with State-of-the-Art

The comparison with other single models on VQA test-standard is shown in

Table 4.4. The overall accuracy of our model is approximately 1.9% above the

next best model [Noh and Han, 2016] on the Open-Ended task of VQA. The

major improvements are from yes-or-no (Y/N) and others (ETC)-type answers.

In Table 4.5, we also report the accuracy of our ensemble model to compare

with other ensemble models on VQA test-standard, which won 1st to 5th places

in VQA Challenge 20163. We beat the previous state-of-the-art with a margin

of 0.42%.

4.6.3 Ensemble of Seven Models

The test-dev results for individual models consisting of our ensemble model is

presented in Table 4.6.

4.7 Related Works

MRN [Kim et al., 2016b] proposes multimodal residual learning with Hadamard

product of low-rank bilinear pooling. However, their utilization of low-rank

bilinear pooling is limited to joint residual mapping function for multimodal

residual learning. Higher-order Boltzmann Machines [Memisevic and Hinton,

2007, 2010] use Hadamard product to capture the interactions of input, output,

and hidden representations for energy function. Wu et al. [2016c] propose the

recurrent neural networks using Hadamard product to integrate multiplicative

interactions among hidden representations in the model.3http://visualqa.org/challenge.html

52

Yet, compact bilinear pooling or multimodal compact bilinear pooling [Fukui

et al., 2016; Gao et al., 2016] is worth to discuss and carefully compare with our

method.

4.7.1 Multimodal Residual Networks

MRN [Kim et al., 2016b] is an implicit attentional model using multimodal

residual learning with Hadamard product which does not have any explicit

attention mechanism.

F (k)(q,v) = σ(W(k)q q) σ(W(k)

2 σ(W(k)1 v)) (4.15)

HL(q,v) = Wq′q+L∑l=1

WF(l)F (l)(Hl−1,v) (4.16)

where W∗ are parameter matrices, L is the number of learning blocks, H0 = q,

Wq′ = ΠLl=1W

(l)q′ , and WF(l) = ΠL

m=l+1W(m)q′ . Notice that these equations can

be generalized by Equation 4.7.

However, an explicit attention mechanism allows the use of lower-level visual

features than fully-connected layers, and, more importantly, spatially selective

learning. Recent state-of-the-art methods use a variant of an explicit attention

mechanism in their models [Fukui et al., 2016; Lu et al., 2016; Noh and Han, 2016].

Note that shortcut connections of MRN are not used in the proposed Multimodal

Low-rank Bilinear (MLB) model. Since, it does not have any performance gain

due to not stacking multiple layers in MLB. We leave the study of residual

learning for MLB for future work, which may leverage the excellency of bilinear

models as suggested in Wu et al. [2016a].

4.7.2 Higher-Order Boltzmann Machines

A similar model can be found in a study of Higher-Order Boltzmann Ma-

chines [Memisevic and Hinton, 2007, 2010]. They suggest a factoring method for

53

the three-way energy function to capture correlations among input, output, and

hidden representations.

−E(y,h;x) =∑f

(∑i

xiwxif

)(∑j

yjwyjf

)(∑k

hkwhkf

)+∑k

whkhk +

∑j

wyj yj

=(xTWx yTWy hTWh

)1+ hTwh + yTwy (4.17)

Setting aside of bias terms, the I × J × K parameter tensor of unfactored

Higher-Order Boltzmann Machines is replaced with three matrices, Wx ∈ RI×F ,

Wy ∈ RJ×F , and Wh ∈ RK×F .

4.7.3 Multiplicative Integration with Recurrent Neural Net-works

Most of recurrent neural networks, including vanilla RNNs, Long Short Term

Memory networks [Hochreiter and Schmidhuber, 1997] and Gated Recurrent

Units [Cho, Kyunghyun and Van Merriënboer, Bart and Gulcehre, Caglar and

Bahdanau, Dzmitry and Bougares, Fethi and Schwenk, Holger and Bengio,

Yoshua, 2014], share a common expression as follows:

ϕ(Wx+Uh+ b) (4.18)

where ϕ is a non-linear function, W ∈ Rd×n, x ∈ Rn, U ∈ Rd×m, h ∈ Rm, and

b ∈ Rd is a bias vector. Note that, usually, x is an input state vector and h is

an hidden state vector in recurrent neural networks.

Wu et al. [2016c] propose a new design to replace the additive expression

with a multiplicative expression using Hadamard product as

ϕ(Wx Uh+ b). (4.19)

Moreover, a general formulation of this multiplicative integration can be

54

described as

ϕ(ααα Wx Uh+Wx βββ1 +Uh βββ2 + b) (4.20)

which is reminiscent of full model in Section 4.3.1.

4.7.4 Compact Bilinear Pooling

Compact bilinear pooling [Gao et al., 2016] approximates full bilinear pooling

using a sampling-based computation, Tensor Sketch Projection [Charikar et al.,

2002; Pham and Pagh, 2013]:

Ψ(x⊗ y, h, s) = Ψ(x, h, s) ∗Ψ(y, h, s) (4.21)

= FFT−1(FFT(Ψ(x, h, s) FFT(Ψ(y, h, s)) (4.22)

where ⊗ denotes outer product, ∗ denotes convolution, Ψ(v, h, s)i :=∑

j:hj=i sj ·

vj , FFT denotes Fast Fourier Transform, d denotes an output dimension,

x, y, h, s ∈ Rn, x and y are inputs, and h and s are random variables. hi

is sampled from 1, ..., d, and si is sampled from −1, 1, then, both random

variables are fixed for further usage. Even if the dimensions of x and y are

different from each other, it can be used for multimodal learning [Fukui et al.,

2016].

Similarly to Equation 4.1, compact bilinear pooling can be described as

follows:

fi = xTWiy (4.23)

where Wijk = sijkwijk if sijk is sampled from −1, 1, wijk is sampled from

Pi1,Pi2, . . . ,Pid, and the compact bilinear pooling is followed by a fully

connected layer P ∈ R|Ω|×d. Then, this method can be formulated as a hashing

trick [Chen et al., 2015; Weinberger et al., 2009] to share randomly chosen

55

bilinear weights using d parameters for a output value, in a way that a single

parameter is shared by NM/d bilinear terms in expectation, with the variance

of NM(d− 1)/d2 (See Appendix 4.8.1).

In comparison with our method, their method approximates a three-dimensional

weight tensor in bilinear pooling with a two-dimensional matrix P, which is

larger than the concatenation of three two-dimensional matrices for low-rank

bilinear pooling. The ratio of the number of parameters for a single output to

the total number of parameters for |Ω| outputs is d/d|Ω| = 1/|Ω| [Fukui et al.,

2016], vs. d(N +M + 1)/d(N +M + |Ω|) = (N +M + 1)/(N +M + |Ω|) ≈ 2/3

(ours), since our method uses a three-way factorization. Hence, more parameters

are allocated to each bilinear approximation than compact bilinear pooling does,

effectively managing overall parameters guided by back-propagation algorithm.

MCB [Fukui et al., 2016], which uses compact bilinear pooling for multimodal

tasks, needs to set the dimension of output d to 16K, to reduce the bias induced

by the fixed random variables h and s. As a result, the majority of model

parameters (16K × 3K = 48M) are concentrated on the last fully connected

layer, which makes a fan-out structure. So, the total number of parameters

of MCB is highly sensitive to the number of classes, which is approximately

69.2M for MCB+att, and 70.5M for MCB+att+GloVe. Yet, the total number

of parameters of our proposed model (MLB) is 51.9M, which is more robust

to the number of classes having d = 1.2K, which has a similar role in model

architecture.

4.8 Discussions

4.8.1 Understanding of Multimodal Compact Bilinear Pooling

In this section, the algorithm of multimodal compact bilinear pooling (MCB) [Fukui

et al., 2016; Gao et al., 2016] is described as a kind of hashing tick [Chen et al.,

56

2015].

x ∈ Rnx and y ∈ Rny are the given inputs, Φ(x,y) ∈ Rd is the output. Ran-

dom variables hx ∈ Nnx and hy ∈ Nny are uniformly sampled from 1, . . . , d,

and sx ∈ Znx and sy ∈ Zny are uniformly sampled from −1, 1. Then, Count

Sketch projection function Ψ [Charikar et al., 2002] projects x and y to interme-

diate representations Ψ(x,hx, sx) ∈ Rd and Ψ(y,hy, sy) ∈ Rd, which is defined

as:

Ψ(v,h, s)i :=∑

j:hj=i

sj · vj (4.24)

Notice that both h and s remain as constants after initialization [Fukui et al.,

2016].

The probability of hxj = i and hyj = i for the given j is 1/d2. Hence, the

expected number of bilinear terms in Ψ(x,hx, sx)iΨ(y,hy, sy)i is (nxny)/d2.

Since, the output Φ(x,y) is a result of circular convolution of Ψ(x,hx, sx) and

Ψ(y,hy, sy), the expected number of bilinear terms in Φ(x,y)i is (nxny)/d.

Likewise, the probability of that a bilinear term is allocated in Φ(x,y)i is

1/d. The probability distribution of the number of bilinear terms in Φ(x,y)i

follows a multinomial distribution, whose mean is (nxny)/d and variance is

(nxny)(d− 1)/d2.

Linear projection after the multimodal compact bilinear pooling provides

weights on the bilinear terms, in a way that a shared weight is assigned to

Φ(x,y)i, which has (nxny)/d bilinear terms in expectation, though each bilinear

term can have a different sign induced by both sx and sy.

HashedNets [Chen et al., 2015] propose a method to compress neural networks

using a low-cost hashing function [Weinberger et al., 2009], which is the same

function of Ψ(v,h, s). They randomly group a portion of connections in neural

57

networks to share a single weight. We speculate that multimodal compact

bilinear pooling uses the hashing tick to reduce the number of full bilinear

weights with the rate of d/(nxny). However, this approximation is limited to

two-way interaction, compared with three-way factorization in our method.

4.8.2 Replacement of Low-rank Bilinear Pooling

For the explicit comparison with compact bilinear pooling, we explicitly substi-

tute compact bilinear pooling for low-rank bilinear pooling to control everything

else, which means that the rest of the model architecture is exactly the same.

According to Fukui et al. [2016], we use MCB followed by Signed Square Root,

L2-Normalization, Dropout (p=0.1), and linear projection from 16,000-dimension

to the target dimension. Also, Dropout (p=0.3) for a question embedding vector.

Note that an overall architecture for multimodal learning of both is the same.

Experimental details are referenced from the implementation4 of Fukui et al.

[2016].

For test-dev split, our version of MCB gets 61.48% for overall accuracy

(yes/no: 82.48%, number: 37.06%, and other: 49.07%) vs. 65.08% (ours, MLB in

Table 4.1). Additionally, if the nonlinearity in getting attention distributions

is increased as the original MCB does using ReLU, we get 62.11% for overall

accuracy (yes/no: 82.55%, number: 37.18%, and other: 50.30%), which is still

the below of our performance5.

We do not see it as a decisive evidence of the better performance of MLB,

but as a reference (the comparison of test-dev results may be also unfair.), since

an optimal architecture and hyperparameters may be required for each method.4https://github.com/akirafukui/vqa-mcb5Our version of MCB definition can be found in https://github.com/jnhwkim/

MulLowBiVQA/blob/master/netdef/MCB.lua

58

https://github.com/akirafukui/vqa-mcb

https://github.com/jnhwkim/MulLowBiVQA/blob/master/netdef/MCB.lua

https://github.com/jnhwkim/MulLowBiVQA/blob/master/netdef/MCB.lua

4.9 Conclusions

We suggest a low-rank bilinear pooling method to replace compact bilinear

pooling, which has a fan-out structure, and needs complex computations. Low-

rank bilinear pooling has a flexible structure using linear mapping and Hadamard

product, and a better parsimonious property, compared with compact bilinear

pooling. We achieve new state-of-the-art results on the VQA dataset using a

similar architecture of Fukui et al. [2016], replacing compact bilinear pooling

with low-rank bilinear pooling. We believe our method could be applicable to

other bilinear learning tasks.

59

Tab

le4.

4T

heV

QA

1.0

test

-sta

ndar

dre

sult

sto

com

pare

wit

hst

ate-

of-t

he-a

rt.N

otic

eth

atth

ese

resu

lts

are

trai

ned

bypr

ovid

edV

QA

trai

nan

dva

lidat

ion

split

s,w

itho

utan

yda

taau

gmen

tati

on.

Open

-Ended

MC

MO

DEL

ALL

Y/N

NU

MET

CA

LL

iBO

WIM

G[Z

hou

etal

.,20

15]

55.8

976

.76

34.9

842

.62

61.9

7D

PP

net

[Noh

etal

.,20

16]

57.3

680

.28

36.9

242

.24

62.6

9D

eepe

rLS

TM

+N

orm

aliz

edC

NN

[Ant

olet

al.,

2015

]58

.16

80.5

636

.53

43.7

363

.09

SMem

[Xu

and

Saen

ko,2

016]

58.2

480

.80

37.5

343

.48

-A

skY

our

Neu

rons

[Mal

inow

skie

tal

.,20

16]

58.4

378

.24

36.2

746

.32

-SA

N[Y

ang

etal

.,20

16]

58.8

579

.11

36.4

146

.42

-D

-NM

N[A

ndre

aset

al.,

2016

]59

.44

80.9

837

.48

45.8

1-

AC

K[W

uet

al.,

2016

b]59

.44

81.0

737

.12

45.8

3-

FD

A[Ilie

vski

etal

.,20

16]

59.5

481

.34

35.6

746

.10

64.1

8H

YB

RID

[Kafl

ean

dK

anan

,201

6b]

60.0

680

.34

37.8

247

.56

-D

MN

+[X

iong

etal

.,20

16]

60.3

680

.43

36.8

248

.33

-M

RN

[Kim

etal

.,20

16b]

61.8

482

.39

38.2

349

.41

66.3

3H

ieC

oAtt

[Lu

etal

.,20

16]

62.0

679

.95

38.2

251

.95

66.0

7R

AU

[Noh

and

Han

,201

6]63

.281

.738

.252

.867

.3

MLB

(our

s)65

.07

84.0

237

.90

54.7

768

.89

60

Table 4.5 The VQA 1.0 test-standard results for ensemble models to comparewith state-of-the-art. For unpublished entries, their team names are used insteadof their model names. Some of their figures are updated after the challenge.

Open-Ended MC

MODEL ALL Y/N NUM ETC ALL

RAU [Noh and Han, 2016] 64.12 83.33 38.02 53.37 67.34MRN [Kim et al., 2016b] 63.18 83.16 39.14 51.33 67.54DLAIT (not published) 64.83 83.23 40.80 54.32 68.30Naver Labs (not published) 64.79 83.31 38.70 54.79 69.26MCB [Fukui et al., 2016] 66.47 83.24 39.47 58.00 70.10

MLB (ours) 66.89 84.61 39.07 57.79 70.29

Human [Antol et al., 2015] 83.30 95.77 83.39 72.67 91.54

Table 4.6 The individual models used in our ensemble model in Table 4.5.

Open-Ended

MODEL GLIMPSE ALL Y/N NUM ETC

MLB 2 64.89 84.13 37.85 54.57MLB 2 65.08 84.14 38.21 54.87MLB 4 65.01 84.09 37.66 54.88MLB-VG 2 65.76 83.64 37.57 56.86MLB-VG 2 65.84 83.87 37.87 56.76MLB-VG 3 66.05 83.88 38.13 57.13MLB-VG 4 66.09 83.59 38.32 57.42

Ensemble - 66.77 84.54 39.21 57.81

61

Chapter 5

Bilinear Attention Networks

5.1 Introduction

Machine learning for computer vision and natural language processing accelerates

the advancement of artificial intelligence. Since vision and natural language

are the major modalities of human interaction, understanding and reasoning of

vision and natural language information become a key challenge. For instance,

visual question answering involves a vision-language cross-grounding problem. A

machine is expected to answer given questions like "who is wearing glasses?", "is

the umbrella upside down?", or "how many children are in the bed?" exploiting

visually-grounded information.

For this reason, visual attention based models have succeeded in multimodal

learning tasks, identifying selective regions in a spatial map of an image defined

by the model. Also, textual attention can be considered along with visual

attention. The attention mechanism of co-attention networks [Lu et al., 2016;

Nam et al., 2016; Xu and Saenko, 2016; Yu et al., 2018] concurrently infers

62

visual and textual attention distributions for each modality. The co-attention

networks selectively attend to question words in addition to a part of image

regions. However, the co-attention neglects the interaction between words and

visual regions to avoid increasing computational complexity.

In this paper, we extend the idea of co-attention into bilinear attention

which considers every pair of multimodal channels, e.g., the pairs of question

words and image regions. If the given question involves multiple visual concepts

represented by multiple words, the inference using visual attention distributions

for each word can exploit relevant information better than that using single

compressed attention distribution.

From this background, we propose bilinear attention networks (BAN) to use

a bilinear attention distribution, on top of low-rank bilinear pooling [Kim et al.,

2017b]. Notice that the BAN exploits bilinear interactions between two groups of

input channels, while low-rank bilinear pooling extracts the joint representations

for each pair of channels. Furthermore, we propose a variant of multimodal

residual networks (MRN) to efficiently utilize the multiple bilinear attention

maps of the BAN, unlike the previous works Fukui et al. [2016]; Kim et al.

[2017b] where multiple attention maps are used by concatenating the attended

features. Since the proposed residual learning method for BAN exploits residual

summations instead of concatenation, which leads to parameter-efficiently and

performance-effectively learn up to eight-glimpse BAN. For the overview of

two-glimpse BAN, please refer to Figure 5.1.

Our main contributions are:

• We propose the bilinear attention networks (BAN) to learn and use bilinear

attention distributions, on top of low-rank bilinear pooling technique.

• We propose a variant of multimodal residual networks (MRN) to efficiently

63

att_1

Overview

ρ

K

= ρ

φ

Step 1. Bilinear Attention Maps

1K ρ

φ

φρ

1

K

=

XTU

p

K

φ

VTY

U’TX YTV’ K=N

ρ

φ

1

K

1Kρ

U’’TX’ YTV’’

Residual Learning

Step 2. Bilinear Attention Networks

MLP classifier

GRUObject DetectionAll hidden

states

• After getting bilinear attention maps, we can stack multiple BANs.

What is the mustache made of ?

Att_1

Att_2∙ att_2

X

repeat 1→ρ

+

=φ K

X’

repeat 1→ρ

+Sum

Pooling

X

Y

Softmax

X’Residual Learning

Figure 5.1 Overview of a two-layer BAN. Two multi-channel inputs, ϕ-objectdetection features and ρ-length GRU hidden vectors, are used to get bilinearattention maps and joint representations to be used by a classifier. For thedefinition of the BAN, see the text in Section 5.3.

utilize the multiple bilinear attention maps generated by our model. Unlike

previous works, our method successfully utilizes up to 8 attention maps.

• Finally, we validate our proposed method on a large and highly-competitive

dataset, VQA 2.0 [Goyal et al., 2016]. Our model achieves a new state-of-

the-art maintaining simplicity of model structure. Moreover, we evaluate the

visual grounding of bilinear attention map on Flickr30k Entities [Plummer

et al., 2017] outperforming previous methods, along with 25.37% improve-

ment of inference speed taking advantage of the processing of multi-channel

inputs.

5.2 Low-rank Bilinear Pooling

We first review the low-rank bilinear pooling and its application to attention

networks [Kim et al., 2017b], which uses single-channel input (question vector)

to combine the other multi-channel input (image features) as single-channel

intermediate representation (attended feature).

Low-rank bilinear model. The previous works [Pirsiavash et al., 2009;

64

Wolf et al., 2007] proposed a low-rank bilinear model to reduce the rank of

bilinear weight matrix Wi to give regularity. For this, Wi is replaced with the

multiplication of two smaller matrices UiVTi , where Ui ∈ RN×d and Vi ∈ RM×d.

As a result, this replacement makes the rank of Wi to be at most d ≤ min(N,M).

For the scalar output fi (bias terms are omitted without loss of generality):

fi = xTWiy ≈ xTUiVTi y = 1T (UT

i x VTi y) (5.1)

where 1 ∈ Rd is a vector of ones and denotes Hadamard product (element-wise

multiplication).

Low-rank bilinear pooling. For a vector output f , a pooling matrix P is

introduced:

f = PT (UTx VTy) (5.2)

where P ∈ Rd×c, U ∈ RN×d, and V ∈ RM×d. It allows U and V to be two-

dimensional tensors by introducing P for a vector output f ∈ Rc, significantly

reducing the number of parameters.

Unitary attention networks. Attention provides an efficient mechanism

to reduce input channel by selectively utilizing given information. Assuming

that a multi-channel input Y consisting of ϕ = |yi| column vectors, we want

to get single channel y from Y using the weights αi:

y =∑i

αiyi (5.3)

where α represents an attention distribution to selectively combine ϕ input

channels. Using the low-rank bilinear pooling, the α is defined by the output of

65

softmax function as:

α := softmax(PT

((UTx · 1T ) (VTY)

))(5.4)

where α ∈ RG×ϕ, P ∈ Rd×G, U ∈ RN×d, x ∈ RN , 1 ∈ Rϕ, V ∈ RM×d, and

Y ∈ RM×ϕ. If G > 1, multiple glimpses (a.k.a. attention heads) are used [Fukui

et al., 2016; Jaderberg et al., 2015; Kim et al., 2017b], then y =fGg=1

∑i αg,iyi,

the concatenation of attended outputs. Finally, two single channel inputs x and

y can be used to get the joint representation using the other low-rank bilinear

pooling for a classifier.

5.3 Bilinear Attention Networks

We generalize a bilinear model for two multi-channel inputs, X ∈ RN×ρ and

Y ∈ RM×ϕ, where ρ = |xi| and ϕ = |yj|, the numbers of two input channels,

respectively. To reduce both input channel simultaneously, we introduce bilinear

attention map A ∈ Rρ×ϕ as follows:

f ′k = (XTU′)TkA(YTV′)k (5.5)

where U′ ∈ RN×K , V′ ∈ RM×K , (XTU′)k ∈ Rρ, (YTV′)k ∈ Rϕ, and f ′k denotes

the k-th element of intermediate representation. The subscript k for the matrices

indicates the index of column. Notice that Equation 5.5 is a bilinear model for

the two groups of input channels where A in the middle is a bilinear weight

matrix. Interestingly, Equation 5.5 can be rewritten as:

f ′k =

ρ∑i=1

ϕ∑j=1

Ai,j(XTi U

′k)(V

′Tk Yj) =

ρ∑i=1

ϕ∑j=1

Ai,jXTi (U

′kV

′Tk )Yj (5.6)

where Xi and Yj denotes the i-th channel (column) of input X and the j-th

channel (channel) of input Y, respectively, U′k and V′

k denotes the k-th column

66

of U′ and V′ matrices, respectively, and Ai,j denotes an element in the i-th row

and the j-th column of A. Notice that, for each pair of channels, the 1-rank

bilinear representation of two feature vectors is modeled in XTi (U

′kV

′Tk )Yj of

Equation 5.6 (eventually at most K-rank bilinear pooling for f ′ ∈ RK). Then,

the bilinear joint representation is f = PT f ′ where f ∈ RC and P ∈ RK×C . For

the convenience, we define the bilinear attention networks as a function of two

multi-channel inputs parameterized by a bilinear attention map as follows:

f = BAN(X,Y;A). (5.7)

Bilinear attention map. Now, we want to get the attention map similarly

to Equation 5.4. Using Hadamard product and matrix-matrix multiplication,

the attention map A is defined as:

A := softmax((

(1 · pT ) XTU)VTY

)(5.8)

where 1 ∈ Rρ, p ∈ RK′ , and remind that A ∈ Rρ×ϕ. The softmax function is

applied element-wisely. Notice that each logit Ai,j of the softmax is the output

of low-rank bilinear pooling as:

Ai,j = pT((UTXi) (VTYj)

). (5.9)

The multiple bilinear attention maps can be extended as follows:

Ag := softmax((

(1 · pTg ) XTU

)VTY

)(5.10)

where the parameters of U and V are shared, but not for pg where g denotes

the index of glimpses.

Residual learning of attention. Inspired by multimodal residual networks

(MRN) from Kim et al. [2016b], we propose a variant of MRN to integrate the

67

joint representations from the multiple bilinear attention maps. The i+ 1-th

output is defined as:

fi+1 = BANi(fi,Y;Ai) · 1T + fi (5.11)

where f0 = X (if N = K) and 1 ∈ Rρ. Here, the size of fi is the same with the

size of X as successive attention maps are processed. To get the logits for a

classifier, e.g., two-layer MLP, we sum over the channel dimension of the last

output fG, where G is the number of glimpses.

Time complexity. When we assume that the number of input channels

is smaller than feature sizes, M ≥ N ≥ K ≫ ϕ ≥ ρ, the time complexity of

the BAN is the same with the case of one multi-channel input as O(KMϕ) for

single glimpse model. Since the BAN consists of matrix chain multiplication and

exploits the property of low-rank factorization in the low-rank bilinear pooling.

In our experimental setting, the spending time per epoch of one-glimpse BAN is

approximately 284 seconds, while 190 seconds for an unitary attention networks.

The increased spending time is largely due to the increased size of softmax

input for attention distribution from ϕ to ρ×ϕ. However, the validation score is

significantly improved (+0.75%). The experiment results are shown in Table 5.1.

5.4 Related Works

Multimodal factorized bilinear pooling. Yu et al. [2018] extends low-rank

bilinear pooling [Kim et al., 2017b] using the rank > 1. They remove a projection

matrix P, instead, d in Equation 5.2 is replaced with much smaller k while U

and V are three-dimensional tensors. For an efficient computation, two matrices

Ui ∈ RN×(k×d) and Vi ∈ RM×(k×d) are used followed by sum pooling defined

as f = SumPool(UTx VTy, k) where the function SumPool(x, k) denotes sum

pooling over x using a one-dimensional window of size k and stride k (non-

68

overlapping). However, this generalization was not effective for BAN, at least

in our experimental setting. Please see BAN-1+MFB in Figure 5.2b where the

performance is not significantly improved from that of BAN-1. Furthermore,

the peak GPU memory consumption is larger due to its model structure which

hinders to use multiple-glimpse BAN.

Co-attention networks. Xu and Saenko [2016] proposed the spatial

memory network model estimating the correlation among every image patches

and tokens in a sentence. The estimated correlation C is defined as (UX)TY

in our notation. Unlike our method, they get an attention distribution α =

softmax(maxi=1,...,ρ(Ci)

)∈ Rρ where the logits to softmax is the maximum

values in each row vector of C. The attention distribution for the other input

can be calculated similarly. There are variants of co-attention networks [Lu et al.,

2016; Nam et al., 2016], especially, Lu et al. [2016] sequentially get two attention

distributions conditioning on the other modality. Recently, Yu et al. [2018] reduce

the co-attention method into two steps, self-attention for a question embedding

and the question-conditioned attention for a visual embedding. However, these

co-attention approaches use separate attention distributions for each modality,

neglecting the interaction between the modalities what we consider and model.

5.5 Experiments

5.5.1 Datasets

Visual Question Answering (VQA). We evaluate on the visual question

answering (VQA 2.0) dataset [Agrawal et al., 2017; Goyal et al., 2016], which

is improved from the previous version to emphasize visual understanding by

reducing the answer bias in the dataset. This improvement pushes the model to

have more effective joint representation of question and image, which fits the

motivation of our bilinear attention approach. In VQA 2.0, the 205k images

69

of MS COCO dataset [Lin et al., 2014] are used. The number of questions are

444k, 214k, and 448k for training, validation, and test, respectively. For the

annotations, roughly, five questions per image, and exactly, ten answers per

question.

The 205k images of VQA 2.0 are from the MS COCO dataset [Lin et al.,

2014]. The number of questions are 444k, 214k, and 448k for training, validation,

and testing, respectively. For the annotations, roughly, five questions per image,

and exactly, ten answers per question. The VQA evaluation metric considers

inter-human variability defined as:

Accuracy(ans) = min(#humans that said ans

3, 1)

(5.12)

The test sets are splited into test-dev, test-standard, test-challenge, and test-

reserve. The annotations for the test sets are unavailable except the remote

evaluation server. In off-challenge season, the evaluation is limited to test-dev

and test-standard and the number of submissions per day is limited to ten and

one, respectively, and the total number of submission is limited to five times only

for test-standard. For this reason, test-dev is used for debugging and validation

and test-standard is used to compare state-of-the-arts.

Flickr30k Entities. For the evaluation of visual grounding by the bilinear

attention maps, we use Flickr30k Entities [Plummer et al., 2017] consisting of

31,783 images [Young et al., 2014] and 244,035 annotations that multiple entities

(phrases) in a sentence for an image are mapped to the boxes on the image to

indicate the correspondences between them. The task is to localize a correspond-

ing box for each entity. In this way, visual grounding of textual information

is quantitatively measured. Following the evaluation metric [Plummer et al.,

2017], if a predicted box has the intersection over union (IoU) of overlapping

area with one of the ground-truth boxes which are greater than or equal to 0.5,

70

the prediction for a given entity is correct. This metric is called Recall@1. If K

predictions are permitted to find at least one correction, it is called Recall@K.

We report Recall@1, 5, and 10 to compare state-of-the-arts (R@K in Table 5.3).

The upper bound of performance depends on the performance of object detection

if the detector proposes candidate boxes for the prediction. We also report the

upper bounds to compare the performances of various object detectors.

5.5.2 Preprocessing

To control the other factors than the model structure, we follow the preprocessing

procedure of Teney et al. [2017].

Question embedding. For VQA, we get a question embedding XT ∈

R14×N using GloVe word embeddings [Pennington et al., 2014] and the outputs

of Gated Recurrent Unit (GRU) [Cho, Kyunghyun and Van Merriënboer, Bart

and Gulcehre, Caglar and Bahdanau, Dzmitry and Bougares, Fethi and Schwenk,

Holger and Bengio, Yoshua, 2014] for every time-steps up to the first 14 tokens

following the previous work Teney et al. [2017]. The questions shorter than 14

words are end-padded with zero vectors. This process discards only 0.25% of

questions and significantly increases learning efficiency in mini-batch learning. A

300-dimensional vector represents a word by embedding, which is learned during

training but initialized with the pre-trained GloVe word embeddings [Pennington

et al., 2014]. The questions shorter than 14 words are end-padded with zero

vectors. The result of 14 × 300 embeddings is fed into a Recurrent Gated Unit

(GRU) [Cho, Kyunghyun and Van Merriënboer, Bart and Gulcehre, Caglar

and Bahdanau, Dzmitry and Bougares, Fethi and Schwenk, Holger and Bengio,

Yoshua, 2014]. The size of hidden state is N , which is one of hyperparameters.

We use all of output states XT ∈ R14×N as a 14-channel question input. Notice

that although the length of questions is variable, BAN successfully learns through

71

the bilinear attention mechanism. For Flickr30k Entities, we use a full length of

sentences which is up to 82 tokens to consider every entities in sentences. We

mark the token positions which are at the end of each annotated phrase. Then,

we select a subset of the output channels of GRU using these positions, which

makes the number of channels is the number of entities in a sentence.

Image features. We use the image features extracted from bottom-up

attention [Anderson et al., 2017]. These features are the output of Faster R-

CNN [Ren et al., 2017], pre-trained using Visual Genome [Krishna et al., 2016].

We set a threshold for object detection to get ϕ = 10 to 100 objects per image.

The features are represented as YT ∈ Rϕ×2,048, which is fixed while training.

To deal with variable-channel inputs, we mask the padding logits with minus

infinite to get zero probability from softmax avoiding underflow.

5.5.3 Nonlinearity

We use ReLU [Nair and Hinton, 2010] to give nonlinearity to BAN as follows:

f ′k = σ(XTU′)Tk · A · σ(YTV′)k (5.13)

where σ denotes ReLU(x) := max(x, 0). For the attention maps,

Ag := softmax((

(1 · pTg ) σ(XTU)

)· σ(VTY)

). (5.14)

Note that XTU and VTY are corresponding to the question embedding matrix

Q and the detected object features V in Figure 5.1, respectively.

5.6 Variants of BAN

5.6.1 Enhancing Glove Word Embedding

We augment a computed 300-dimensional word embedding to each 300-dimensional

Glove word embedding. The computation is as follows: 1) we choose arbitrary

72

two words wi and wj from each question can be found in VQA and Visual

Genome datasets or each caption in MS COCO dataset. 2) we increase the

value of Ai,j by one where A ∈ V ′ × V ′ is an association matrix initialized

with zeros. Notice that i and j can be the index out of vocabulary V and

the size of vocabulary in this computation is denoted by V ′. 3) to penalize

highly frequent words, each row of A is divided by the number of sentences

(question or caption) which contain the corresponding word . 4) each row is

normalized by the sum of all elements of each row. 5) we calculate W′ = A ·W

where W ∈ V ′ × E is a Glove word embedding matrix and E is the size of

word embedding, i.e., 300. Therefore, W′ ∈ V ′ × E stands for the mixed word

embeddings of semantically closed words. 6) finally, we select V rows from W′

corresponding to the vocabulary in our model and augment these rows to the

previous word embeddings, which makes 600-dimensional word embeddings in

total. The input size of GRU is increased to 600 to match with these word

embeddings. These word embeddings are fine-tuned.

As a result, this variant significantly improves the performance to 66.03

(±0.12) compared with the performance of 65.72 (± 0.11) which is done by

augmenting the same 300-dimensional Glove word embeddings (so the number

of parameters is controlled). In this experiment, we use four-glimpse BAN

and evaluate on validation split. The standard deviation is calculated by three

random initialized models and the means are reported. The result on test-dev

split can be found in Table 5.5 as BAN+Glove.

5.6.2 Integrating Counting Module

The counting module [Zhang et al., 2018] is proposed to improve the performance

related to counting tasks. This module is a neural network component to get a

dense representation from spatial information of detected objects, i.e., the left-

73

top and right-bottom positions of the ϕ proposed objects (rectangles) denoted

by S ∈ R4×ϕ. The interface of the counting module is defined as:

c = Counter(s, α) (5.15)

where c ∈ Rϕ+1 and α ∈ Rϕ is the logits of corresponding objects for sigmoid

function inside the counting module. We found that the α defined by maxj=1,...,ϕ(A·,j),

i.e., the maximum values in each column vector of A in Equation 5.9, was better

than that of summation. Since the counting module does not support variable-

object inputs, we select 10-top objects for the input instead of ϕ objects based

on the values of α.

The BAN integrated with the counting module is defined as:

fi+1 =(BANi(fi,Y;Ai) + gi(ci)

)· 1T + fi (5.16)

where the function gi(·) is the i-th linear embedding followed by ReLU activation

function and ci = Counter(s,maxj=1,...,ϕ(A(i)·,j )) where A(i) is the logit ofAi. Note

that a dropout layer before this linear embedding severely hurts performance,

so we did not use it.

As a result, this variant significantly improves the counting performance

from 54.92 (±0.30) to 58.21 (±0.49), while overall performance is improved

from 65.81 (±0.09) to 66.01 (±0.14) in a controlled experiment using a vanilla

four-glimpse BAN. The definition of a subset of counting questions comes from

the previous work [Trott et al., 2018]. The result on test-dev split can be found in

Table 5.5 as BAN+Glove+Counter, notice that, which is applied by the previous

embedding variant, too.

74

5.6.3 Integrating Multimodal Factorized Bilinear (MFB) Pool-ing

Yu et al. [2018] extend low-rank bilinear pooling [Kim et al., 2017b] with the

rank k > 1 and two factorized three-dimensional matrices, which called as MFB.

The implementation of MFB is effectively equivalent to low-rank bilinear pooling

with the rank d′ = d × k followed by sum pooling with the window size of k

and the stride of k, defined by SumPool(UTx VTy,k). Notice that a pooling

matrix P in Equation 5.2 is not used. The variant of BAN inspired by MFB is

defined as:

zk′ = σ(XT U)Tk′ · A · σ(YT V)k′ (5.17)

f ′ = SumPool(z,k) (5.18)

where U ∈ RN×K′ , V ∈ RM×K′ , σ denotes ReLU activation function, and k = 5

following Yu et al. [2018]. Notice that K ′ = K × k and k′ is the index for the

elements in z ∈ RK′ in our notation.

However, this generalization was not effective for BAN. In Figure 5.2b, the

performance of BAN-1+MFB is not significantly different from that of BAN-1.

Furthermore, the larger K ′ increases the peak consumption of GPU memory

which hinders to use multiple-glimpses for the BAN.

5.6.4 Classifier

For VQA, we use a two-layer multi-layer perceptron as a classifier for the

final joint representation fG. The activation function is ReLU. The number

of outputs is determined by the minimum occurrence of an answer in unique

questions as nine times in the dataset, which is 3,129. To directly reflect the

VQA evaluation metric, the label is encoded in a vector ∈ R3,129 whose elements

are corresponding to the score. Then, binary cross entropy is used for the loss

75

function, which means the network output logits are fed into sigmoid function

instead of softmax function. This loss function is also used in Teney et al. [2017].

For Flickr30k Entities, we just take the output of bilinear attention map

and binary cross entropy is used for this output. Following the evaluation

metric [Plummer et al., 2017], if an extracted box from the pre-trained Faster-

RCNN has the intersection over union (IoU) of overlapping area with one of the

ground-truth boxes which is greater than or equal to 0.5, a prediction of this

box for a given entity is correct. This metric is called Recall@1. If K predictions

are permitted to find at least one correction, it is called Recall@K. We report

Recall@1, 5, and 10 to compare state-of-the-art. This strategy has the upper

bound of performance depending on the performance of object detection. We

also report the upper bounds to compare the performances of various object

detectors.

5.6.5 Hyperparameters and Regularization

Hyperparameters. The size of image features and question embeddings are

M = 2, 048 and N = 1, 024, respectively. The size of joint representation C is

the same with the rank K in low-rank bilinear pooling, C = K = 1, 024, but

K ′ = K × 3 is used in the bilinear attention maps to increase a representational

capacity for residual learning of attention. Every linear mapping is regularized

by Weight Normalization [Salimans and Kingma, 2016] and Dropout [Srivastava

et al., 2014] (p = .2, except for the classifier with .5). Adamax optimizer [Kingma

and Ba, 2015], a variant of Adam based on infinite norm, is used. The learning

rate is min(ie−3, 4e−3) where i is the number of epochs starting from 1, then

after 10 epochs, the learning rate is decayed by 1/4 for every 2 epochs up to 13

epochs (i.e. 1e−3 for 11-th and 2.5e−4 for 13-th epoch). We clip the 2-norm of

vectorized gradients to .25. The batch size is 512.

76

Regularization. For the test split of VQA, both train and validation splits

are used for training. We augment a subset of Visual Genome [Krishna et al.,

2016] dataset following the procedure of the previous works [Teney et al., 2017].

We filter out the samples of Visual Genome dataset if an image or an answer is

not found in the target split. This procedure results in that the 492k (67.64%) of

questions in Visual Genome dataset is used, which is around the size of training

split of VQA 2.0. Accordingly, we adjust the model capacity by increasing all of

N , C, and K to 1,280. And, G = 8 glimpses are used. For Flickr30k Entities, we

use the same test split of the previous methods [Plummer et al., 2017], without

additional hyperparameter tuning from VQA experiments.

For Flickr30k Entities, we use exactly the same splits of the previous methods

without additional hyperparameter tuning.

5.7 VQA Results and Discussions

5.7.1 Quantitative Results

Comparison with state-of-the-arts. The first row in Table 5.1 shows 2017

VQA Challenge winner architecture Anderson et al. [2017]; Teney et al. [2017].

BAN significantly outperforms this baseline and successfully utilize up to eight

bilinear attention maps to improve its performance taking advantage of resid-

ual learning of attention. As shown in Table 5.5, BAN outperforms the latest

model [Yu et al., 2018] which uses the same bottom-up attention feature [An-

derson et al., 2017] by a substantial margin. BAN-Glove uses the concatenation

of 300-dimensional Glove word embeddings and the semantically-closed mixture

of these embeddings (see Section 5.6.1). Notice that similar approaches can

be found in the competitive models [Fukui et al., 2016; Yu et al., 2018] in

Table 5.5 with a different initialization strategy for the same 600-dimensional

word embedding. BAN-Glove-Counter uses both the previous 600-dimensional

77

word embeddings and counting module Zhang et al. [2018], which exploits spatial

information of detected object boxes from the feature extractor [Anderson et al.,

2017]. The learned representation c ∈ Rϕ+1 for the counting mechanism is

linearly projected and added to joint representation after applying ReLU (see

Equation 5.16 in Section 5.6.2). In Table 5.6, we compare with the entries in the

leaderboard of both VQA Challenge 2017 and 2018 achieving the 1st place at

the time of submission (our entry is not shown in the leaderboad since challenge

entries are not visible).

Comparison with other attention methods. Unitary attention has a

similar architecture with Kim et al. [2017b] where a question embedding vector is

used to calculate the attentional weights for multiple image features of an image.

Co-attention has the same mechanism of Yu et al. [2018], similar to Lu et al.

[2016]; Xu and Saenko [2016], where multiple question embeddings are combined

as single embedding vector using a self-attention mechanism, then unitary visual

attention is applied. Table 5.2 confirms that bilinear attention is significantly

better than any other attention methods. The co-attention is slightly better

than simple unitary attention. In Figure 5.2a, co-attention suffers overfitting

more severely (green) than any other methods, while bilinear attention (blue) is

more regularized compared with the others. In Figure 5.2b, BAN is the most

parameter-efficient among various attention methods. Notice that four-glimpse

BAN more parsimoniously utilizes its parameters than one-glimpse BAN does.

5.7.2 Residual Learning of Attention

Comparison with other approaches. In the second section of Table 5.2,

the residual learning of attention significantly outperforms the other methods,

sum, i.e., fG =∑

iBANi(X,Y;Ai), and concatenation (concat), i.e., fG =

∥iBANi(X,Y;Ai). Whereas, the difference between sum and concat is not

78

Table 5.1 Validation scores on VQA 2.0 dataset for the number of glimpses ofthe BAN. The standard deviations are reported after ± using three randominitialization.

Model VQA Score

Bottom-Up [Teney et al., 2017] 63.37 ±0.21

BAN-1 65.36 ±0.14BAN-2 65.61 ±0.10BAN-4 65.81 ±0.09BAN-8 66.00 ±0.11BAN-12 66.04 ±0.08

significantly different. Notice that the number of parameters of concat is larger

than the others, since the input size of the classifier is increased.

Ablation study. An interesting property of residual learning is robustness

toward arbitrary ablations [Veit et al., 2016]. To see the relative contributions,

we observe the learning curve of validation scores when incremental ablation

is performed. First, we train 1,2,4,8,12-glimpse models using training split.

Then, we evaluate the model on validation split using the first N attention

maps. Hence, the intermediate representation fN is directly fed into the classifier

instead of fG. As shown in Figure 5.2c, the accuracy gain of the first glimpse is

the highest, then the gain is smoothly decreased as the number of used glimpses

is increased.

Entropy of Attention We analyze the information entropy of attention

distributions in a four-glimpse BAN. As shown in Figure 5.2d, the mean entropy

of each attention for validation split is converged to a different level of values.

This result is repeatably observed in the other number of glimpse models. Our

speculation is the multi-attention maps do not equally contribute similarly to

voting by committees, but the residual learning by the multi-step attention. We

79

Table 5.2 Validation scores on VQA 2.0 dataset for attention and integrationmechanisms. The nParams indicates the number of parameters. Note that thehidden sizes of unitary attention and co-attention are 1,280, while 1,024 for theBAN.

Model nParams VQA Score

Unitary attention 31.9M 64.59 ±0.04Co-attention 32.5M 64.79 ±0.06Bilinear attention 32.2M 65.36 ±0.14

BAN-4 (residual) 44.8M 65.81 ±0.09BAN-4 (sum) 44.8M 64.78 ±0.08BAN-4 (concat) 51.1M 64.71 ±0.21

argue that this is a novel observation where the residual learning [He et al.,

2016a] is used for stacked attention networks.

5.7.3 Qualitative Analysis

The visualization for a two-glimpse BAN is shown in Figure 5.3. The question is

“what color are the pants of the guy skateboarding”. The question and content

words, what, pants, guy, and skateboarding and skateboarder’s pants in the image

are attended. Notice that the box 2 (orange) captured the sitting man’s pants

in the bottom.

5.8 Flickr30k Entities Results and Discussions

To examine the capability of bilinear attention map to capture vision-language

interactions, we conduct experiments on Flickr30k Entities [Plummer et al.,

2017]. Our experiments show that BAN outperforms the previous state-of-the-art

on the phrase localization task with a large margin of 4.48% at a high speed of

inference.

80

Entro

py

3.0

3.3

3.6

3.9

4.2

4.5

4.8

Epoch0 2 4 6 8 10 12 14 16 18

Att_1Att_2Att_3Att_4

Valid

atio

n Sc

ore

35

40

45

50

55

60

65

70

The number of used glimpses0 1 2 3 4 5 6 7 8

2 Glimpses4 Glimpses8 Glimpses

Entro

py

0.0

1.0

2.0

3.0

4.0

5.0

6.0

Epoch1 3 5 7 9 11 1315 1719

Att_1 Att_2Att_3 Att_4

Valid

atio

n sc

ore

35

40

45

50

55

60

65

70

Used glimpses0 2 4 6 8 10 12

BAN-1BAN-2BAN-4BAN-8BAN-12

Valid

atio

n sc

ore

35

45

55

65

75

85

Epoch1 3 5 7 9 11 13 15 17 19

bi-att trainco-att trainuni-att trainbi-att valco-att valuni-att val

(a) (b) (c)

Valid

atio

n sc

ore

58.0

59.5

61.0

62.5

64.0

65.5

67.0

0M 15M 30M 45M 60M

Uni-AttCo-AttBAN-1BAN-4BAN-1+MFB

Entro

py

0.0

1.3

2.7

4.0

5.3

6.7

8.0

Epoch1 4 7 10 13 16 19

Att_1Att_2Att_3Att_4

Valid

atio

n sc

ore

35

40

45

50

55

60

65

70

Used glimpses0 2 4 6 8 10 12

BAN-1BAN-2BAN-4BAN-8BAN-12

Valid

atio

n sc

ore

30

41

52

63

74

85

Epoch1 4 7 10 13 16 19

Bi-Att trainCo-Att trainUni-Att trainBi-Att valCo-Att valUni-Att val

(a) (c) (d)

The number of parameters

(b)

1

Figure 5.2 (a) learning curves. Bilinear attention (bi-att) is more robust tooverfitting than unitary attention (uni-att) and co-attention (co-att). (b) valida-tion scores for the number of parameters. The error bar indicates the standarddeviation among three random initialized models, although it is too small to benoticed for over-15M parameters. (c) ablation study for the first-N-glimpses (x-axis) used in evaluation. (d) the information entropy (y-axis) for each attentionmap in four-glimpse BAN. The entropy of multiple attention maps is convergedto certain levels.

Performance. In Table 5.3, we compare with other previous approaches.

Our bilinear attention map to predict the boxes for the phrase entities in a

sentence achieves new state-of-the-art with 69.69% for Recall@1. This result

is remarkable considering that BAN does not use any additional features like

box size, color, segmentation, or pose-estimation [Plummer et al., 2017; Yeh

et al., 2017]. Note that both Query-Adaptive RCNN [Hinami and Satoh, 2017]

and our off-the-shelf object detector [Anderson et al., 2017] are based on Faster

RCNN [Ren et al., 2017] and pre-trained on Visual Genome [Krishna et al., 2016].

Compared to Query-Adaptive RCNN, the parameters of our object detector are

fixed and only used to extract 10-100 visual features and the corresponding box

proposals.

Type. In Table 5.4, we report the results for each type of Flickr30k Entities.

Notice that clothing and body parts are significantly improved to 74.95% and

81

47.23%, respectively.

Speed. The faster inference is achieved taking advantage of multi-channel

inputs in our BAN. Unlike previous methods, BAN ables to infer multiple entities

in a sentence which can be prepared as a multi-channel input. Therefore, the

number of forwardings to infer is significantly decreased. In our experiment, BAN

takes 0.67 ms/entity whereas the setting that single entity as an example takes

0.84 ms/entity, achieving 25.37% improvement. We emphasize that this property

is a novel in our model that considers every interaction among vision-language

multi-channel inputs.

Visualization. Figure 5.4 shows three examples from the test split of

Flickr30k Entities. The entities which has visual properties, for instance, a

yellow tennis suit and white tennis shoes in Figure 5.4a, and a denim shirt

in Figure 5.4b, are correct. However, relatively small object (e.g., a cigarette

in Figure 5.4b) and the entity that requires semantic inference (e.g., a male

conductor in Figure 5.4c) are incorrect.

5.9 Conclusions

In the paper, BAN gracefully extends unitary attention networks, as low-rank

bilinear pooling inside bilinear attention. Although this networks consider every

pair of multimodal input channels, the computational cost remains in the same

magnitude for matrix chain multiplication. The proposed residual learning of

attention efficiently uses up to eight bilinear attention maps, keeping the size

of intermediate features constant. We believe our BAN give new opportunity

to learn the richer joint representation when all multimodal inputs consist of

multi-channels. With a simple ensemble of BANs, we won the runners-up in 2018

VQA Challenge while BAN was a winner of single models among the entries.

82

Fig

ure

5.3

Vis

ualiz

atio

nof

the

bilin

ear

atte

ntio

nm

aps

for

two-

glim

pse

BA

N.T

hele

ftan

dri

ght

grou

psin

dica

teth

efir

stan

dse

cond

bilin

ear

atte

ntio

nm

aps

(rig

htin

each

grou

p,lo

g-sc

aled

)an

dth

evi

sual

ized

imag

e(l

eft

inea

chgr

oup)

.The

mos

tsa

lient

six

boxe

s(1

-6nu

mbe

red

inth

eim

ages

and

x-ax

isof

the

grid

s)in

the

first

atte

ntio

nm

apde

term

ined

bym

argi

naliz

atio

nar

evi

sual

ized

onbo

thim

ages

toco

mpa

re.T

hem

odel

give

sth

eco

rrec

tan

swer

,bro

wn.

83

(a) A girl in a yellow tennis suit, green visor and white tennis shoes holding a tennis racket in a position where she is going to hit the tennis ball.

(b) A man in a denim shirt and pants is smoking a cigarette while playing a cello for money.

(c) A male conductor wearing all black leading an orchestra and choir on a brown stage playing and singing a musical number.

Figure 5.4 Visualization examples from the test split of Flickr30k Entities areshown. Solid-lined boxes indicate predicted phrase localizations and dashed-lineboxes indicate the ground-truth. If there are multiple ground-truth boxes, theclosest box is shown to investigate. Each color of a phrase is matched with thecorresponding color of predicted and ground-truth boxes. Best view in color.

84

Tabl

e5.

3Te

stsp

litre

sult

sfo

rFlic

kr30

kE

ntit

ies.

We

repo

rtth

eav

erag

epe

rfor

man

ceof

ourth

ree

rand

omly

-init

ializ

edm

odel

s(t

hest

anda

rdde

viat

ion

ofR

@1

is0.

17).

Upp

erBou

ndof

perf

orm

ance

asse

rted

byob

ject

dete

ctor

issh

own.†

box

size

and

colo

rin

form

atio

nar

eus

edas

addi

tion

alfe

atur

es.‡

sem

anti

cse

gmen

tati

on,o

bjec

tde

tect

ion,

and

pose

-es

tim

atio

nis

used

asad

diti

onal

feat

ures

.Not

ice

that

the

dete

ctor

sof

Hin

amia

ndSa

toh

[201

7]an

dou

rs[A

nder

son

etal

.,20

17]a

reba

sed

onFa

ster

RC

NN

[Ren

etal

.,20

17],

pre-

trai

ned

usin

gV

isua

lGen

ome

data

set

[Kri

shna

etal

.,20

16].

Mod

elD

etec

tor

R@

1R

@5

R@

10U

pper

Bou

nd

Zhan

get

al.[

2016

a]M

CG

[Arb

eláe

zet

al.,

2014

]28

.552

.761

.3-

Hu

etal

.[20

16]

Edg

eB

oxes

[Zit

nick

and

Dol

lár,

2014

]27

.8-

62.9

76.9

Roh

rbac

het

al.[

2016

]Fa

stR

CN

N[G

irsh

ick,

2015

]42

.43

--

77.9

0W

ang

etal

.[20

16b]

Fast

RC

NN

[Gir

shic

k,20

15]

42.0

8-

-76

.91

Wan

get

al.[

2016

a]Fa

stR

CN

N[G

irsh

ick,

2015

]43

.89

64.4

668

.66

76.9

1R

ohrb

ach

etal

.[20

16]

Fast

RC

NN

[Gir

shic

k,20

15]

48.3

8-

-77

.90

Fuku

iet

al.[

2016

]Fa

stR

CN

N[G

irsh

ick,

2015

]48

.69

--

-P

lum

mer

etal

.[20

17]

Fast

RC

NN

[Gir

shic

k,20

15]†

50.8

971

.09

75.7

385

.12

Yeh

etal

.[20

17]

YO

LOv2

[Red

mon

and

Farh

adi,

2017

]‡53

.97

--

-H

inam

iand

Sato

h[2

017]

Que

ry-A

dapt

ive

RC

NN

[Hin

amia

ndSa

toh,

2017

]65

.21

--

-

BA

N(o

urs)

Bot

tom

-Up

[And

erso

net

al.,

2017

]69

.69

84.2

286

.35

87.4

5

85

Tab

le5.

4R

ecal

l@1

perf

orm

ance

over

type

sfo

rFlic

kr30

kE

ntit

ies

(%)

Mod

elPeo

ple

Clo

thin

gB

ody

Par

tsA

nim

als

Veh

icle

sIn

stru

men

tsSc

ene

Oth

er

Roh

rbac

het

al.[

2016

]60

.24

39.1

614

.34

64.4

867

.50

38.2

759

.17

30.5

6P

lum

mer

etal

.[20

17]

64.7

346

.88

17.2

165

.83

68.7

537

.65

51.3

931

.77

Yeh

etal

.[20

17]

68.7

146

.83

19.5

070

.07

73.7

539

.50

60.3

832

.45

Hin

amia

ndSa

toh

[201

7]78

.17

61.9

935

.25

74.4

176

.16

56.6

968

.07

47.4

2

BA

N(o

urs)

79.9

074

.95

47.2

381

.85

76.9

243

.00

68.6

951

.33

#of

Inst

ance

s5,

656

2,30

652

351

840

016

21,

619

3,37

4

86

Tabl

e5.

5Te

st-d

evsc

ores

ofsi

ngle

-mod

elon

VQ

A2.

0da

tase

tto

com

pare

stat

e-of

-the

-art

s.T

hefir

stse

ctio

nof

row

str

aine

don

trai

ning

and

valid

atio

nsp

lits.

The

rest

ofro

ws

trai

ned

ontr

aini

ngan

dva

lidat

ion

split

s,an

dV

isua

lG

enom

efo

rda

taau

gmen

tati

on.†

Thi

sm

odel

can

befo

und

inht

tps:

//gi

thub

.com

/yuz

cccc

/vqa

-mfb

,whi

chis

not

publ

ishe

din

the

pape

r.T

hey

use

the

obje

ctde

tect

ion-

base

dim

age

feat

ures

from

And

erso

net

al.[

2017

],in

stea

dof

152-

laye

rR

esN

etim

age

feat

ures

[He

etal

.,20

16a]

.

Mod

elO

vera

llYes

/no

Num

ber

Oth

erTes

t-st

d

Bot

tom

-Up

[And

erso

net

al.,

2017

;Ten

eyet

al.,

2017

]65

.32

81.8

244

.21

56.0

565

.67

MFH

[Yu

etal

.,20

18]

66.1

2-

--

-C

ount

er[Z

hang

etal

.,20

18]

68.0

983

.14

51.6

258

.97

68.4

1M

FH

+B

otto

m-U

p[Y

uet

al.,

2018

]†68

.76

84.2

749

.56

59.8

9-

BA

N(o

urs)

69.5

285

.31

50.9

360

.26

-B

AN

+G

love

(our

s)69

.66

85.4

650

.66

60.5

0-

BA

N+

Glo

ve+

Cou

nter

(our

s)70

.04

85.4

254

.04

60.5

270

.35

87


Tab

le5.

6Tes

t-st

anda

rdsc

ores

ofen

sem

ble-

mod

elon

VQ

A2.

0da

tase

tto

com

pare

stat

e-of

-the

-art

s.E

xcer

ptfr

omth

eV

QA

2.0

Lead

erbo

ard

atth

eti

me

ofw

riti

ng.#

deno

tes

the

num

ber

ofm

odel

sfo

rth

eir

ense

mbl

em

etho

ds.

Tea

mN

ame

#O

vera

llYes

/no

Num

ber

Oth

er

vqat

eam

_m

cb_

benc

hmar

k[F

ukui

etal

.,20

16;G

oyal

etal

.,20

16]

162

.27

78.8

238

.28

53.3

6vq

a_ha

ck3r

-62

.89

79.8

838

.95

53.5

8V

QA

Mac

hine

[Wan

get

al.,

2016

c]-

62.9

779

.82

40.9

153

.35

NW

PU

_V

QA

-63

.00

80.3

840

.32

53.0

7ya

hia

zaka

ria

-63

.57

79.7

740

.53

54.7

5R

easo

nNet

_-

64.6

178

.86

41.9

857

.39

June

flow

erIv

aNlp

r-

65.7

081

.09

41.5

657

.83

UP

MC

-LIP

6[B

en-y

oune

set

al.,

2017

]-

65.7

182

.07

41.0

657

.12

Ath

ena

-66

.67

82.8

843

.17

57.9

5A

dela

ide-

Ten

ey-

66.7

383

.71

43.7

757

.20

LV_

NU

S[Ilie

vski

and

Feng

,201

7]-

66.7

781

.89

46.2

958

.30

vqah

hi_

drau

-66

.85

83.3

544

.37

57.6

3C

FM

-UE

STC

-67

.02

83.6

945

.17

57.5

2V

LCSo

utha

mpt

on[Z

hang

etal

.,20

18]

168

.41

83.5

651

.39

59.1

1Toh

oku

CV

-68

.91

85.5

449

.00

58.9

9V

QA

-E-

69.4

485

.74

48.1

860

.12

Ade

laid

e-Ten

eyA

CRV

MSR

[Ten

eyet

al.,

2017

]30

70.3

486

.60

48.6

461

.15

Dee

pSea

rch

-70

.40

86.2

148

.82

61.5

8H

DU

-USY

D-U

NC

C[Y

uet

al.,

2018

]8

70.9

286

.65

51.1

361

.75

BA

N+

Glo

ve+

Cou

nter

(our

s)1

70.3

585

.82

53.7

160

.69

BA

NE

nsem

ble

(our

s)8

71.7

287

.02

54.4

162

.37

BA

NE

nsem

ble

(our

s)15

71.8

487

.22

54.3

762

.45

88

Chapter 6

Conclusions

Vision and language processing is studied for artificial general intelligence, mainly

focusing on how to efficiently learn the joint representation of multimodality.

Multimodal residual learning provides an efficient way to combine multimodal

inputs with an idea of residual learning, which comes from a computer vision

study. Low-rank bilinear pooling interpretation comes from an approximation of

bilinear feature learning where element-wise multiplication is used as multimodal

fusion in deep neural networks. This interpretation enables to suggest multimodal

low-rank bilinear attention networks, which becomes a foundational work for

the following models like Yu et al. [2018]. Bilinear attention networks gracefully

extends unitary attention networks, as low-rank bilinear pooling inside bilinear

attention, where residual learning of attention provides an efficient way to learn

and utilize multiple bilinear attention maps. Although, in this work, the learned

joint representation is fed into a classifier to generate an answer for visual

question answering tasks, We believe the joint representation can be used for

the other tasks, such as visual dialog, image indexing and retrieval with an aid

89

of multimodal information, enhanced speech recognition with visual information,

and so on. Morever, since computer vision and natural language processing are

still evolving areas of studies, multimodal deep learning can exploit advanced

discoveries in the future.

90

Bibliography

Aishwarya Agrawal, Jiasen Lu, Stanislaw Antol, Margaret Mitchell, C Lawrence

Zitnick, Devi Parikh, and Dhruv Batra. Vqa: Visual question answering.

International Journal of Computer Vision, 123(1):4–31, 2017.

Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson,

Stephen Gould, and Lei Zhang. Bottom-Up and Top-Down Attention for Image

Captioning and Visual Question Answering. arXiv preprint arXiv:1707.07998,

2017.

Jacob Andreas, Marcus Rohrbach, Trevor Darrell, and Dan Klein. Learn-

ing to Compose Neural Networks for Question Answering. arXiv preprint

arXiv:1601.01705, 2016.

Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra,

C. Lawrence Zitnick, and Devi Parikh. VQA: Visual Question Answering. In

IEEE International Conference on Computer Vision, 2015.

Pablo Arbeláez, Jordi Pont-Tuset, Jonathan T Barron, Ferran Marques, and

Jitendra Malik. Multiscale combinatorial grouping. In IEEE conference on

computer vision and pattern recognition, pages 328–335, 2014.

91

Hedi Ben-younes, Rémi Cadene, Matthieu Cord, and Nicolas Thome. MU-

TAN: Multimodal Tucker Fusion for Visual Question Answering. In IEEE

International Conference on Computer Vision, pages 2612–2620, 2017.

Steven Bird, Ewan Klein, and Edward Loper. Natural language processing with

Python. "O’Reilly Media, Inc.", 2009.

C E Carr and M Konishi. A circuit for detection of interaural time differences in

the brain stem of the barn owl. The Journal of Neuroscience, 10(10):3227–46,

1990.

Moses Charikar, Kevin Chen, and Martin Farach-Colton. Finding frequent items

in data streams. In International Colloquium on Automata, Languages, and

Programming, pages 693–703. Springer, 2002.

Wenlin Chen, James T. Wilson, Stephen Tyree, Kilian Q. Weinberger, and

Yixin Chen. Compressing Neural Networks with the Hashing Trick. In 32nd

International Conference on Machine Learning, pages 2285–2294, 2015.

Kyunghyun Cho, Bart van Merrienboer, Dzmitry Bahdanau, and Yoshua Ben-

gio. On the Properties of Neural Machine Translation: Encoder-Decoder

Approaches. arXiv preprint arXiv:1409.1259, 2014.

Cho, Kyunghyun and Van Merriënboer, Bart and Gulcehre, Caglar and Bah-

danau, Dzmitry and Bougares, Fethi and Schwenk, Holger and Bengio, Yoshua.

Learning Phrase Representations using RNN Encoder-Decoder for Statistical

Machine Translation. In 2014 Conference on Empirical Methods in Natural

Language Processing (EMNLP), pages 1724–1734, 2014.

Abhishek Das, Satwik Kottur, Khushi Gupta, Avi Singh, Deshraj Yadav, José

92

M. F. Moura, Devi Parikh, and Dhruv Batra. Visual Dialog. In IEEE

Conference on Computer Vision and Pattern Recognition, 2017a.

Abhishek Das, Satwik Kottur, José M. F. Moura, Stefan Lee, and Dhruv Batra.

Learning Cooperative Visual Dialog Agents with Deep Reinforcement Learning.

arXiv preprint arXiv:1703.06585, 2017b.

Harm de Vries, Florian Strub, Sarath Chandar, Olivier Pietquin, Hugo Larochelle,

and Aaron Courville. GuessWhat?! Visual object discovery through multi-

modal dialogue. arXiv preprint arXiv:1611.08481, 2016.

Akira Fukui, Dong Huk Park, Daylen Yang, Anna Rohrbach, Trevor Darrell, and

Marcus Rohrbach. Multimodal Compact Bilinear Pooling for Visual Question

Answering and Visual Grounding. In Conference on Empirical Methods in

Natural Language Processing, 2016.

Yarin Gal. A Theoretically Grounded Application of Dropout in Recurrent

Neural Networks. arXiv preprint arXiv:1512.05287, 2015.

Haoyuan Gao, Junhua Mao, Jie Zhou, Zhiheng Huang, Lei Wang, and Wei Xu.

Are You Talking to a Machine? Dataset and Methods for Multilingual Image

Question Answering. In Advances in neural information processing systems

28, pages 2296–2304, 2015.

Yang Gao, Oscar Beijbom, Ning Zhang, and Trevor Darrell. Compact Bilinear

Pooling. In IEEE Conference on Computer Vision and Pattern Recognition,

2016.

Donald Geman, Stuart Geman, Neil Hallonquist, and Laurent Younes. Visual

Turing test for computer vision systems. Proceedings of the National Academy

of Sciences, 112(12):3618–3623, 2015.

93

Ross Girshick. Fast r-cnn. In IEEE International Conference on Computer

Vision, pages 1440–1448, 2015.

Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh.

Making the V in VQA Matter: Elevating the Role of Image Understanding in

Visual Question Answering. arXiv preprint arXiv:1612.00837, 2016.

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep Residual

Learning for Image Recognition. In IEEE Conference on Computer Vision

and Pattern Recognition, 2016a.

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity Mappings

in Deep Residual Networks. arXiv preprint arXiv:1603.05027, 2016b.

K Hikosaka, E Iwai, H Saito, and K Tanaka. Polysensory properties of neurons

in the anterior bank of the caudal superior temporal sulcus of the macaque

monkey. Journal of neurophysiology, 60(5):1615–1637, 1988.

Ryota Hinami and Shin’ichi Satoh. Query-Adaptive R-CNN for Open-Vocabulary

Object Detection and Retrieval. arXiv preprint arXiv:1711.09509, 2017.

Geoffrey E Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, and Rus-

lan R Salakhutdinov. Improving neural networks by preventing co-adaptation

of feature detectors. arXiv preprint arXiv:1207.0580, 2012.

Sepp Hochreiter and Jürgen Schmidhuber. Long Short-Term Memory. Neural

computation, 9(8):1735–1780, 1997.

N P Holmes and C Spence. Multisensory Integration : Space , Time and

Superadditivity The superior colliculus generates and controls eye and head.

Current Biology, 15(18):R762–764, 2005.

94

Ronghang Hu, Huazhe Xu, Marcus Rohrbach, Jiashi Feng, Kate Saenko, and

Trevor Darrell. Natural language object retrieval. In IEEE Computer Vision

and Pattern Recognition, pages 4555–4564, 2016.

Ilija Ilievski and Jiashi Feng. A Simple Loss Function for Improving the

Convergence and Accuracy of Visual Question Answering Models. 2017.

Ilija Ilievski, Shuicheng Yan, and Jiashi Feng. A Focused Dynamic Attention

Model for Visual Question Answering. arXiv preprint arXiv:1604.01485, 2016.

Sergey Ioffe and Christian Szegedy. Batch Normalization : Accelerating Deep

Network Training by Reducing Internal Covariate Shift. In Proceedings of the

32nd International Conference on Machine Learning, 2015.

Max Jaderberg, Karen Simonyan, Andrew Zisserman, and Koray Kavukcuoglu.

Spatial Transformer Networks. In Advances in Neural Information Processing

Systems 28, pages 2008–2016, 2015.

Kushal Kafle and Christopher Kanan. Visual Question Answering: Datasets,

Algorithms, and Future Challenges. arXiv preprint arXiv:1610.01465, 2016a.

Kushal Kafle and Christopher Kanan. Answer-Type Prediction for Visual

Question Answering. IEEE Conference on Computer Vision and Pattern

Recognition, pages 4976–4984, 2016b.

Andrej Karpathy and Li Fei-Fei. Deep Visual-Semantic Alignments for Generat-

ing Image Descriptions. In 28th IEEE Conference on Computer Vision and

Pattern Recognition, 2015.

Jin-Hwa Kim, Jeonghee Kim, Jung-Woo Ha, and Byoung-Tak Zhang. TrimZero:

A Torch Recurrent Module for Efficient Natural Language Processing. In

95

Proceedings of KIIS Spring Conference, volume 26, pages 165–166, 2016a.

ISBN 2093-4025.

Jin-Hwa Kim, Sang-Woo Lee, Donghyun Kwak, Min-Oh Heo, Jeonghee Kim,

Jung-Woo Ha, and Byoung-Tak Zhang. Multimodal Residual Learning for

Visual QA. In Advances In Neural Information Processing Systems 29, pages

361–369, 2016b.

Jin-Hwa Kim, Kyoung Woon On, Woosang Lim, Jeonghee Kim, Jung-Woo Ha,

and Byoung-Tak Zhang. Hadamard Product for Low-rank Bilinear Pooling.

In 5th International Conference on Learning Representations, 2017a.

Jin-Hwa Kim, Kyoung Woon On, Woosang Lim, Jeonghee Kim, Jung-Woo Ha,

and Byoung-Tak Zhang. Hadamard Product for Low-rank Bilinear Pooling.

In The 5th International Conference on Learning Representations, 2017b.

Jin-Hwa Kim, Jaehyun Jun, and Byoung-Tak Zhang. Bilinear Attention Net-

works. arXiv preprint arXiv:1805.07932, 2018.

Diederik P. Kingma and Jimmy Ba. Adam: A Method for Stochastic Optimiza-

tion. In International Conference on Learning Representations, 2015.

Ryan Kiros, Yukun Zhu, Ruslan Salakhutdinov, Richard S. Zemel, Antonio

Torralba, Raquel Urtasun, and Sanja Fidler. Skip-Thought Vectors. In

Advances in Neural Information Processing Systems 28, pages 3294–3302,

2015.

Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua

Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A Shamma,

Michael Bernstein, and Li Fei-Fei. Visual genome: Connecting language

96

and vision using crowdsourced dense image annotations. arXiv preprint

arXiv:1602.07332, 2016.

Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua

Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A. Shamma,

Michael S. Bernstein, and Fei-Fei Li. Visual Genome: Connecting Language

and Vision Using Crowdsourced Dense Image Annotations. International

Journal of Computer Vision, 123(1):32–73, 2017.

Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature, 521

(7553):436–444, 2015. ISSN 0028-0836.

Nicholas Léonard, Sagar Waghmare, Yang Wang, and Jin-Hwa Kim. rnn :

Recurrent Library for Torch. arXiv preprint arXiv:1511.07889, 2015.

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva

Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft COCO: Common

objects in context. In European Conference on Computer Vision (ECCV),

pages 740–755, 2014.

Tsung-Yu Lin, Aruni RoyChowdhury, and Subhransu Maji. Bilinear CNN Models

for Fine-grained Visual Recognition. In IEEE International Conference on

Computer Vision, pages 1449–1457, 2015.

Jiasen Lu, Xiao Lin, Dhruv Batra, and Devi Parikh. Deeper LSTM and nor-

malized CNN Visual Question Answering model. https://github.com/

VT-vision-lab/VQA_LSTM_CNN, 2015.

Jiasen Lu, Jianwei Yang, Dhruv Batra, and Devi Parikh. Hierarchical

Question-Image Co-Attention for Visual Question Answering. arXiv preprint

arXiv:1606.00061, 2016.

97

https://github.com/VT-vision-lab/VQA_LSTM_CNN

https://github.com/VT-vision-lab/VQA_LSTM_CNN

Mateusz Malinowski and Mario Fritz. A Multi-World Approach to Question

Answering about Real-World Scenes based on Uncertain Input. In Advances

in Neural Information Processing Systems 27, pages 1682–1690, 2014.

Mateusz Malinowski, Marcus Rohrbach, and Mario Fritz. Ask Your Neurons: A

Neural-based Approach to Answering Questions about Images. arXiv preprint

arXiv:1505.01121, 2015.

Mateusz Malinowski, Marcus Rohrbach, and Mario Fritz. Ask Your Neurons:

A Deep Learning Approach to Visual Question Answering. arXiv preprint

arXiv:1605.02697, 2016.

Iain Matthews, Timothy F. Cootes, J. Andrew Bangham, Stephen Cox, and

Richard Harvey. Extraction of visual features for lipreading. IEEE Transac-

tions on Pattern Analysis and Machine Intelligence, 24(2):198–213, 2002. doi:

10.1109/34.982900.

Roland Memisevic and Geoffrey E Hinton. Unsupervised learning of image

transformations. In IEEE Conference on Computer Vision and Pattern

Recognition, 2007.

Roland Memisevic and Geoffrey E Hinton. Learning to represent spatial transfor-

mations with factored higher-order Boltzmann machines. Neural computation,

22(6):1473–1492, 2010.

Nasrin Mostafazadeh, Chris Brockett, Bill Dolan, Michel Galley, Jianfeng Gao,

Georgios P. Spithourakis, and Lucy Vanderwende. Image-Grounded Conver-

sations: Multimodal Context for Natural Question and Response Generation.

arXiv preprint arXiv:1701.08251, 2017.

Vinod Nair and Geoffrey E Hinton. Rectified Linear Units Improve Restricted

98

Boltzmann Machines. Proceedings of the 27th International Conference on

Machine Learning, 2010.

Hyeonseob Nam, Jung-Woo Ha, and Jeonghee Kim. Dual Attention Networks

for Multimodal Reasoning and Matching. In IEEE Conference on Computer

Vision and Pattern Recognition, 2016.

Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee, and

Andrew Y Ng. Multimodal Deep Learning. In 28th International Conference

on Machine Learning, pages 689–696, 2011.

Hyeonwoo Noh and Bohyung Han. Training Recurrent Answering Units with

Joint Loss Minimization for VQA. arXiv preprint arXiv:1606.03647, 2016.

Hyeonwoo Noh, Paul Hongsuck Seo, and Bohyung Han. Image Question Answer-

ing using Convolutional Neural Network with Dynamic Parameter Prediction.

In IEEE Conference on Computer Vision and Pattern Recognition, 2016.

E K Patterson, S Gurbuz, Z Tufekci, and J N Gowdy. CUAVE: A new audio-

visual database for multimodal human-computer interface research. In IEEE

International Conference on Acoustics, Speech, and Signal Processing, vol-

ume 2, pages 2017–2020, 2002. doi: 10.1109/ICASSP.2002.5745028.

Jeffrey Pennington, Richard Socher, and Christopher D Manning. GloVe: Global

Vectors for Word Representation. Proceedings of the 2014 Conference on

Empirical Methods in Natural Language Processing, 2014.

Ninh Pham and Rasmus Pagh. Fast and scalable polynomial kernels via explicit

feature maps. In 19th ACM SIGKDD International Conference on Knowledge

Discovery and Data Mining, pages 239–247. ACM, 2013.

99

Hamed Pirsiavash, Deva Ramanan, and Charless C. Fowlkes. Bilinear classifiers

for visual recognition. In Advances in Neural Information Processing Systems

22, pages 1482–1490, 2009.

Bryan A. Plummer, Liwei Wang, Chris M. Cervantes, Juan C. Caicedo, Julia

Hockenmaier, and Svetlana Lazebnik. Flickr30k Entities: Collecting Region-to-

Phrase Correspondences for Richer Image-to-Sentence Models. International

Journal of Computer Vision, 123:74–93, 2017.

Hang Qi, Tianfu Wu, Mun-Wai Lee, and Song-Chun Zhu. A Restricted Vi-

sual Turing Test for Deep Scene and Event Understanding. arXiv preprint

arXiv:1512.01715, 2015.

Joseph Redmon and Ali Farhadi. YOLO9000: Better, Faster, Stronger. In IEEE

Computer Vision and Pattern Recognition, 2017.

Mengye Ren, Ryan Kiros, and Richard Zemel. Exploring Models and Data for

Image Question Answering. In Advances in Neural Information Processing

Systems 28, pages 2935–2943, 2015.

Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster R-CNN:

Towards Real-Time Object Detection with Region Proposal Networks. IEEE

Transactions on Pattern Analysis and Machine Intelligence, 39(6), 2017.

Tim Rocktäschel, Edward Grefenstette, Karl Moritz Hermann, Tomáš Kočiský,

and Phil Blunsom. Reasoning about Entailment with Neural Attention. In

International Conference on Learning Representations, pages 1–9, 2016.

Anna Rohrbach, Marcus Rohrbach, Ronghang Hu, Trevor Darrell, and Bernt

Schiele. Grounding of textual phrases in images by reconstruction. In European

Conference on Computer Vision, pages 817–834, 2016.

100

David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. Learning

representations by back-propagating errors. Nature, 323(6088):533–536, 1986.

Tim Salimans and Diederik P. Kingma. Weight Normalization: A Simple

Reparameterization to Accelerate Training of Deep Neural Networks. arXiv

preprint arXiv:1602.07868, 2016.

Karen Simonyan and Andrew Zisserman. Very Deep Convolutional Networks

for Large-Scale Image Recognition. In International Conference on Learning

Representations, 2015.

Nitish Srivastava and Ruslan R Salakhutdinov. Multimodal Learning with

Deep Boltzmann Machines. In F Pereira, C J C Burges, L Bottou, and K Q

Weinberger, editors, Advances in Neural Information Processing Systems 25,

pages 2222–2230, 2012.

Nitish Srivastava, Geoffrey E. Hinton, Alex Krizhevsky, Ilya Sutskever, and

Ruslan Salakhutdinov. Dropout : A Simple Way to Prevent Neural Networks

from Overfitting. Journal of Machine Learning Research, 15(1):1929–1958,

2014.

Florian Strub, Harm de Vries, Jeremie Mary, Bilal Piot, Aaron Courville, and

Olivier Pietquin. End-to-end optimization of goal-driven and visually grounded

dialogue systems. arXiv preprint arXiv:1703.05423, 2017.

Sainbayar Sukhbaatar, Arthur Szlam, Jason Weston, and Rob Fergus. End-

To-End Memory Networks. In Advances in Neural Information Processing

Systems 28, pages 2440–2448, 2015.

Joshua B Tenenbaum and William T Freeman. Separating style and content

with bilinear models. Neural computation, 12(6):1247–1283, 2000.

101

Damien Teney, Peter Anderson, Xiaodong He, and Anton van den Hengel. Tips

and Tricks for Visual Question Answering: Learnings from the 2017 Challenge.

arXiv preprint arXiv:1708.02711, 2017.

Tijmen Tieleman and Geoffrey Hinton. Lecture 6.5-rmsprop: Divide the gradient

by a running average of its recent magnitude. COURSERA: Neural Networks

for Machine Learning, 4, 2012.

Julia Trommershauser, Konrad Kording, and Michael S Landy. Sensory cue

integration. Oxford University Press, 2011.

Alexander Trott, Caiming Xiong, and Richard Socher. Interpretable Counting

for Visual Question Answering. In International Conference on Learning

Representations, 2018.

Alan Turing. Computing Machinery and Intelligence. Mind, 59:433–460, 1950.

Andreas Veit, Michael J Wilber, and Serge Belongie. Residual Networks are

Exponential Ensembles of Relatively Shallow Networks. In Advances in Neural

Information Processing Systems 29, pages 550–558, 2016.

Liwei Wang, Yin Li, and Svetlana Lazebnik. Learning Deep Structure-Preserving

Image-Text Embeddings. In IEEE Conference on Computer Vision and

Pattern Recognition, pages 5005–5013, 2016a.

Mingzhe Wang, Mahmoud Azab, Noriyuki Kojima, Rada Mihalcea, and Jia

Deng. Structured Matching for Phrase Localization. In European Conference

on Computer Vision, volume 9908, pages 696–711, 2016b.

Sida I. Wang, Percy Liang, and Christopher D. Manning. Learning Language

Games through Interaction. In 54th Annual Meeting of the Association for

Computational Linguistics, pages 2368–2378, 2016c.

102

Kilian Weinberger, Anirban Dasgupta, John Langford, Alex Smola, and Josh

Attenberg. Feature hashing for large scale multitask learning. In 26th Inter-

national Conference on Machine Learning, pages 1113–1120, 2009.

Lior Wolf, Hueihan Jhuang, and Tamir Hazan. Modeling appearances with low-

rank SVM. IEEE Conference on Computer Vision and Pattern Recognition,

2007.

Qi Wu, Damien Teney, Peng Wang, Chunhua Shen, Anthony Dick, and Anton

van den Hengel. Visual Question Answering: A Survey of Methods and

Datasets. arXiv preprint arXiv:1607.05910, 2016a.

Qi Wu, Peng Wang, Chunhua Shen, Anthony Dick, and Anton van den Hengel.

Ask Me Anything: Free-form Visual Question Answering Based on Knowledge

from External Sources. In IEEE Conference on Computer Vision and Pattern

Recognition, 2016b.

Yuhuai Wu, Saizheng Zhang, Ying Zhang, Yoshua Bengio, and Ruslan Salakhut-

dinov. On Multiplicative Integration with Recurrent Neural Networks. arXiv

preprint arXiv:1606.06630, 2016c.

Caiming Xiong, Stephen Merity, and Richard Socher. Dynamic Memory Net-

works for Visual and Textual Question Answering. In 33rd International

Conference on Machine Learning, 2016.

Huijuan Xu and Kate Saenko. Ask, Attend and Answer: Exploring Question-

Guided Spatial Attention for Visual Question Answering. In European Con-

ference on Computer Vision, 2016.

Zichao Yang, Xiaodong He, Jianfeng Gao, Li Deng, and Alex Smola. Stacked

103

Attention Networks for Image Question Answering. In IEEE Conference on

Computer Vision and Pattern Recognition, 2016.

Raymond A Yeh, Jinjun Xiong, Wen-Mei W Hwu, Minh N Do, and Alexander G

Schwing. Interpretable and Globally Optimal Prediction for Textual Ground-

ing using Image Concepts. In Advances in Neural Information Processing

Systems 30, 2017.

Peter Young, Alice Lai, Micah Hodosh, and Julia Hockenmaier. From image de-

scriptions to visual denotations: New similarity metrics for semantic inference

over event descriptions. Transactions of the Association for Computational

Linguistics, 2:67–78, 2014.

Licheng Yu, Eunbyung Park, Alexander C Berg, and Tamara L. Berg. Visual

Madlibs : Fill in the blank Description Generation and Question Answering.

In IEEE International Conference on Computer Vision, pages 2461–2469,

2015.

Yanchao Yu, Arash Eshghi, and Oliver Lemon. Training an adaptive dialogue

policy for interactive learning of visually grounded word meanings. In 17th

Annual Meeting of the Special Interest Group on Discourse and Dialogue, page

339, 2016.

Zhou Yu, Jun Yu, Chenchao Xiang, Jianping Fan, and Dacheng Tao. Beyond

Bilinear: Generalized Multi-modal Factorized High-order Pooling for Visual

Question Answering. IEEE Transactions on Neural Networks and Learning

Systems, 2018.

Jianming Zhang, Zhe Lin, Jonathan Brandt, Xiaohui Shen, and Stan Sclaroff.

Top-Down Neural Attention by Excitation Backprop. In European Conference

on Computer Vision, volume 9908, pages 543–559, 2016a.

104

Peng Zhang, Yash Goyal, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh.

Yin and Yang: Balancing and Answering Binary Visual Questions. In IEEE

Conference on Computer Vision and Pattern Recognition, pages 5014–5022,

2016b.

Yan Zhang, Jonathon Hare, and Adam Prügel-Bennett. Learning to Count

Objects in Natural Images for Visual Question Answering. In International

Conference on Learning Representations, 2018.

Bolei Zhou, Yuandong Tian, Sainbayar Sukhbaatar, Arthur Szlam, and Rob

Fergus. Simple Baseline for Visual Question Answering. arXiv preprint

arXiv:1512.02167, 2015.

Yuke Zhu, Oliver Groth, Michael Bernstein, and Li Fei-Fei. Visual7W: Grounded

Question Answering in Images. In IEEE Conference on Computer Vision and

Pattern Recognition, pages 4995–5004, 2016.

C Lawrence Zitnick and Piotr Dollár. Edge boxes: Locating object proposals

from edges. In European Conference on Computer Vision, pages 391–405,

2014.

105

초록

컴퓨터 시각과 자연어 처리 기술의 발달은 일반 인공 지능에 대한 연구를 가속화

하였다. 시각과 자연어는 인간이 사용하는 가장 상호 작용적인 양태이므로 시각과

언어에 모두 기반한 이해와 추론은 일반 인공 지능의 핵심 과제가 된다. 시각 질의

응답(VQA)은 시각 튜링 테스트의 한 예로서, 초석이 되는 튜링 테스트 [Turing,

1950] 연구에 기반한다. VQA 데이터셋 [Agrawal et al., 2017]은 대용량의 이미지

데이터셋을 이용해 지도 학습을 위한 질문-답 쌍을 수집하였다. 예를 들면 "누가

안경을 쓰고 있나?", "우산이 뒤집어져 있나?", "침대에는 몇 명의 아이들이 있는

가?"와 같은 질문에 기계는 수집한 답들을 이용해 학습한 후 이미지와 질문만을

보고 답을 내어야 한다.

본연구에서는시각질의응답과제를다중양태학습문제로일반화하고,다중

양태 학습의 발전을 다층 구조 신경망의 다양한 형태를 활용하여 계층적 표상을

학습하는 깊은 학습, 다중 양태 깊은 학습 관점에서 살펴본다. 다중 양태 깊은 학

습을 세 가지 분류 기준, 다중 양태 융합, 교차 양태, 공유 표상 학습으로 나누어

소개한다. 또, 이전 연구들 Kim et al. [2016b, 2017a, 2018]를 바탕으로 세 가지

주요연구,다중 양태 잔차 학습,다중 양태 저계수 쌍일차 추출,쌍일차 주의 망의

내용들을 논의한다.

다중양태잔차학습은잔차학습을기반으로시각-언어다중양태의결합표상

을 찾는다. 여기에서 신경망의 일부는 앞 부분의 신경망이 표현하는 목적 함수의

잔차 오류를 학습하도록 강제한다. 반면, 다중 양태 저계수 쌍일차 추출은 각 양태

가 적절하게 선형 사영된 조건에서 원소곱이 결합 함수로서 가지는 수학적 의미를

설명할 수 있게 한다. 쌍일차 주의 망은 이전 두 연구를 통합한다. 저계수 쌍일차

추출에 대한 해석을 바탕으로 행렬 연결 곱을 이용해 단일 주의 기제를 쌍일차

주의로 성공적으로 일반화하여 계산 비용은 단일 주의 망과 비슷한 수준으로 효율

106

적이다.더나아가,주의잔차학습을제안하여여덟개의쌍일차주의지도를추론

과정에서 활용할 수 있게 하여 다층 주의 망에서 발생하는 과조정을 방지한다.

그 결과, 다중 양태 잔차 망 (MRN)은 VQA 챌린지 2016에서 4위를 기록하였

고, 2016년 11월 출판 시점에는 보다 적은 파라미터를 이용하여 다중 양태 저계수

쌍일차 주의 망 (MLB)을 제안하고 세계 최고 성능을 갱신하였다. 쌍일차 주의 망

(BAN)은 VQA챌린지 2018에서준우승(공동 2위)를하였으나단일모델로는최고

성능을 보였다. 이 결과는 2018년 6월 18일, CVPR 2018 학회(미국 솔트레이크

시티) 워크샵에 초청되어 구두 발표하였다.

시각 또는 자연어 처리는 계속 발전 중인 분야이므로 제안하는 다중 양태 깊은

학습 방법들은 컴퓨터 시각과 자연어 처리 기술의 발달과 더불어 더 향상될 수

있는 가능성이 있다.

주요어:다중양태,주의,시각질의응답,깊은학습,잔차학습,저계수근사,쌍일차

학번: 2015-30046

107

Documents

Disclaimer - Seoul National Universitys-space.snu.ac.kr/bitstream/10371/143194/1/Multimodal Deep Learning for Visually...저작자표시-비영리-변경금지 2.0 대한민국 이용자는