Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
저 시-비 리- 경 지 2.0 한민
는 아래 조건 르는 경 에 한하여 게
l 저 물 복제, 포, 전송, 전시, 공연 송할 수 습니다.
다 과 같 조건 라야 합니다:
l 하는, 저 물 나 포 경 , 저 물에 적 된 허락조건 명확하게 나타내어야 합니다.
l 저 터 허가를 면 러한 조건들 적 되지 않습니다.
저 에 른 리는 내 에 하여 향 지 않습니다.
것 허락규약(Legal Code) 해하 쉽게 약한 것 니다.
Disclaimer
저 시. 하는 원저 를 시하여야 합니다.
비 리. 하는 저 물 리 목적 할 수 없습니다.
경 지. 하는 저 물 개 , 형 또는 가공할 수 없습니다.
Ph.D. Dissertation of Jin-Hwa Kim
Multimodal Deep Learning forVisually-Grounded Reasoning
시각 기반 추론을 위한 다중 양태의 깊은 학습
August 2018
College of HumanitiesSeoul National University
Interdisciplinary Program in Cognitive Science
Jin-Hwa Kim
Ph.D. Dissertation of Jin-Hwa Kim
Multimodal Deep Learning forVisually-Grounded Reasoning
시각 기반 추론을 위한 다중 양태의 깊은 학습
August 2018
College of HumanitiesSeoul National University
Interdisciplinary Program in Cognitive Science
Jin-Hwa Kim
Multimodal Deep Learning forVisually-Grounded Reasoning
시각 기반 추론을 위한 다중 양태의 깊은 학습
지도교수 장 병 탁
이 논문을 공학박사 학위논문으로 제출함
2018 년 5 월
서울대학교 대학원
협동과정 인지과학전공
Jin-Hwa Kim
Jin-Hwa Kim의 공학박사 학위논문을 인준함
2018 년 6 월
위 원 장 김 홍 기
부위원장 장 병 탁
위 원 한 보 형
위 원 김 건 희
위 원 하 정 우
Abstract
Advances in computer vision and natural language processing accelerate the
studies of artificial general intelligence. Since both vision and natural language
are the major and most interactive modalities of human, understanding and
reasoning grounded on both vision and language became the key challenge
for the artificial general intelligence. Visual question answering (VQA) is an
instance of Visual Turing Test (VTT), which is aligned with this direction on
top of the prestigious seminal work, Turing test [Turing, 1950]. In the VQA
dataset [Agrawal et al., 2017], with large image datasets, question-answer pairs
are collected for supervised learning. For instance, a machine answers for a given
image and question, such as "Who is wearing glasses?", "Is the umbrella upside
down?", or "How many children are in the bed?".
In this dissertation, having that the visual question answering task is general-
ized as multimodal learning, the advances in multimodal learning are studied in
deep learning where hierarchical representations are learned with various forms
of multiple layers in neural networks, called multimodal deep learning. First,
multimodal deep learning is introduced with three categorization: multimodal
fusion, cross modality, and shared representation learning. After that, based
on our previous works Kim et al. [2016b, 2017a, 2018], three major studies are
discussed – multimodal residual learning, multimodal low-rank bilinear pooling,
and bilinear attention networks.
Multimodal residual learning finds the joint representation of vision-language
multimodality based on the idea of residual learning, which imposes a constraint
that a part of neural networks must learn residual errors of a fitting function
i
represented by a previous part of the neural networks. Whereas multimodal low-
rank bilinear pooling gives a mathematical ground for the use of element-wise
multiplication (a.k.a Hadamard product) as a joint function, since it can be
interpreted as a low-rank bilinear pooling in the condition of that each modality
is linearly transformed with appropriate model parameters. Bilinear attention
networks unify the previous two works. Using the interpretation of low-rank
bilinear pooling, it successfully generalizes unitary attention mechanism into
bilinear attention by matrix chain multiplication. This is so efficient that the
computational cost is the same with the counter, unitary attention networks.
Moreover, residual learning of attention is proposed to exploit up to eight
bilinear attention maps in reasoning processes, which prevents the over-fitting
that usually comes from multi-layer attention networks.
As a result, Multimodal Residual Networks (MRN) achieved the 4th place
in the VQA Challenge 2016, and Multimodal Low-rank Bilinear Attention Net-
works (MLB) achieves a new state-of-the-art with a significantly less number of
parameters in November 2016, at the time of publication. Moreover, Bilinear
Attention Networks (BAN) placed the 2nd place (shared with the other team) in
the VQA Challenge 2018 achieving the best single model among the entries, and
invited as a speaker in CVPR 2018 workshop (Salt Lake City, USA) on June
18. Since each vision or natural language processing is still an evolving area
of studies, the multimodal deep learning can take advantage of the progressed
results from computer vision and natural language processing studies in the
future.
Keywords: Multimodal, attention, visual question answering, deep learning,
residual learning, low-rank approximation, bilinear
Student Number: 2015-30046
ii
Contents
Abstract i
Chapter 1 Introduction 1
Chapter 2 Multimodal Deep Learning 6
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 Linear Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3 Multimodal Deep Learning . . . . . . . . . . . . . . . . . . . . . 10
2.3.1 Multimodal Fusion . . . . . . . . . . . . . . . . . . . . . . 10
2.3.2 Cross Modality Learning . . . . . . . . . . . . . . . . . . . 11
2.3.3 Shared Representation Learning . . . . . . . . . . . . . . 13
2.4 Cognitive Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
Chapter 3 Multimodal Residual Learning 16
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.2 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.2.1 Deep Residual Learning . . . . . . . . . . . . . . . . . . . 18
3.2.2 Stacked Attention Networks . . . . . . . . . . . . . . . . . 19
iii
3.3 Multimodal Residual Networks . . . . . . . . . . . . . . . . . . . 20
3.3.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.3.2 Multimodal Residual Networks . . . . . . . . . . . . . . . 21
3.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.4.1 Visual QA Dataset . . . . . . . . . . . . . . . . . . . . . . 22
3.4.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . 24
3.4.3 Exploring Alternative Models . . . . . . . . . . . . . . . . 26
3.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.5.1 Quantitative Analysis . . . . . . . . . . . . . . . . . . . . 27
3.5.2 Qualitative Analysis . . . . . . . . . . . . . . . . . . . . . 29
3.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
Chapter 4 Multimodal Low-rank Bilinear Pooling 37
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.2 Low-rank Bilinear Model . . . . . . . . . . . . . . . . . . . . . . . 39
4.3 Low-rank Bilinear Pooling . . . . . . . . . . . . . . . . . . . . . . 40
4.3.1 Full Model . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.3.2 Nonlinear Activation . . . . . . . . . . . . . . . . . . . . . 41
4.3.3 Shortcut Connection . . . . . . . . . . . . . . . . . . . . . 42
4.4 Multimodal Low-rank Bilinear Attention Networks . . . . . . . . 43
4.4.1 Low-rank Bilinear Pooling in Attention Mechanism . . . . 43
4.4.2 Multimodal Low-rank Bilinear Attention Networks . . . . 43
4.4.3 Model Schema . . . . . . . . . . . . . . . . . . . . . . . . 44
4.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.5.1 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.5.2 Vision Embedding . . . . . . . . . . . . . . . . . . . . . . 48
4.5.3 Hyperparameters . . . . . . . . . . . . . . . . . . . . . . . 49
iv
4.6 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.6.1 Six Experiment Results . . . . . . . . . . . . . . . . . . . 50
4.6.2 Comparison with State-of-the-Art . . . . . . . . . . . . . 52
4.6.3 Ensemble of Seven Models . . . . . . . . . . . . . . . . . . 52
4.7 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.7.1 Multimodal Residual Networks . . . . . . . . . . . . . . . 53
4.7.2 Higher-Order Boltzmann Machines . . . . . . . . . . . . . 53
4.7.3 Multiplicative Integration with Recurrent Neural Networks 54
4.7.4 Compact Bilinear Pooling . . . . . . . . . . . . . . . . . . 55
4.8 Discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.8.1 Understanding of Multimodal Compact Bilinear Pooling . 56
4.8.2 Replacement of Low-rank Bilinear Pooling . . . . . . . . . 58
4.9 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
Chapter 5 Bilinear Attention Networks 62
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
5.2 Low-rank Bilinear Pooling . . . . . . . . . . . . . . . . . . . . . . 64
5.3 Bilinear Attention Networks . . . . . . . . . . . . . . . . . . . . . 66
5.4 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
5.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
5.5.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
5.5.2 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . 71
5.5.3 Nonlinearity . . . . . . . . . . . . . . . . . . . . . . . . . . 72
5.6 Variants of BAN . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
5.6.1 Enhancing Glove Word Embedding . . . . . . . . . . . . . 72
5.6.2 Integrating Counting Module . . . . . . . . . . . . . . . . 73
5.6.3 Integrating Multimodal Factorized Bilinear (MFB) Pooling 75
v
5.6.4 Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
5.6.5 Hyperparameters and Regularization . . . . . . . . . . . . 76
5.7 VQA Results and Discussions . . . . . . . . . . . . . . . . . . . . 77
5.7.1 Quantitative Results . . . . . . . . . . . . . . . . . . . . . 77
5.7.2 Residual Learning of Attention . . . . . . . . . . . . . . . 78
5.7.3 Qualitative Analysis . . . . . . . . . . . . . . . . . . . . . 80
5.8 Flickr30k Entities Results and Discussions . . . . . . . . . . . . . 80
5.9 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
Chapter 6 Conclusions 89
Bibliography 91
초록 106
vi
List of Figures
Figure 2.1 Examples from VQA 2.0 [Agrawal et al., 2017; Goyal
et al., 2016], which depict their criteria that multimodal
information is necessary to solve the problem. For the
same question, answers may be different depending on
visual information, the provided image along with the
question. Reproduced with permission from Goyal et al.
[2016]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
Figure 3.1 Inference flow of Multimodal Residual Networks (MRN).
Using our visualization method, the attention effects are
shown as a sequence of three images. More examples are
shown in Figure 3.4. . . . . . . . . . . . . . . . . . . . . 18
Figure 3.2 A schematic diagram of Multimodal Residual Networks
with three-block layers. . . . . . . . . . . . . . . . . . . . 18
vii
Figure 3.3 Alternative models are explored to justify our proposed
model. The base model (a) has a shortcut for a question
vector as SAN does [Yang et al., 2016], and the joint
residual function takes the form of the Deep Q+I model’s
joint function [Lu et al., 2015]. (b) extra embedding for
visual modality. (c) extra embeddings for both modalities.
(d) identity mappings for shortcuts. In the first learning
block, use a linear mapping for matching a dimension with
the joint dimension. (e) two shortcuts for both modalities.
For simplicity, the linear mapping of visual shortcut only
appears in the first learning block. Notice that (d) and
(e) are compared to (b) after the model selection of (b)
among (a)-(c) on test-dev results. Eventually, we choose
(b) as the best performance and relative simplicity. . . . 23
Figure 3.4 Examples for visualization of the three-block layered
MRN. The original images are shown in the first of each
group. The next three images show the input gradients of
the attention effect for each learning block as described
in Section 3.5.2. The gradients of color channels for each
pixel are summed up after taking absolute values of these
gradients. Then, these summed absolute values which
are greater than the summation of the mean and the
standard deviation of these values are visualized as the
attention effect (bright color) on the images. The answers
(blue) are predicted by MRN. . . . . . . . . . . . . . . . 29
Figure 3.5 More examples of Figure 4 in Section 5.2. . . . . . . . . 31
viii
Figure 3.6 Comparative examples on the same image. (a1) and
(a2) depict a giraffe (left) and a man pointing at the
giraffe. MRN consistently highlights on the giraffe in (a1).
However, the other question “Can you see trees?” makes
MRN less attentive to the giraffe, while a tree in the right
of background is more focused in (a2). Similarily, the
attention effect of (b2) is widely dispersed on background
than (b1) in the middle of sequences, may be to recognize
the site. However, the subtlety in comparative study is
insufficient to objectively assess the results. . . . . . . . 32
Figure 3.7 Failure Examples. Each question is followed by model
prediction (blue) and answer (red). As mentioned in Sec-
tion 5, MRN shows the weakness of counting in (d) and
(k). Sometimes, the model finds objects regardless of the
given question. In (j), even if a word cat does not appear
in the question, the cat in the image is surely attended.
(i) shows the limitation of attentional mechanism, which
needs an inference using world knowledge. . . . . . . . . 36
Figure 4.1 A schematic diagram of MLB. Replicate module copies
an question embedding vector to match with S2 visual
feature vectors. Conv modules indicate 1× 1 convolution
to transform a given channel space, which is computa-
tionally equivalent to linear projection for channels. . . . 44
ix
Figure 5.1 Overview of a two-layer BAN. Two multi-channel inputs,
ϕ-object detection features and ρ-length GRU hidden
vectors, are used to get bilinear attention maps and
joint representations to be used by a classifier. For the
definition of the BAN, see the text in Section 5.3. . . . . 64
Figure 5.2 (a) learning curves. Bilinear attention (bi-att) is more
robust to overfitting than unitary attention (uni-att) and
co-attention (co-att). (b) validation scores for the number
of parameters. The error bar indicates the standard de-
viation among three random initialized models, although
it is too small to be noticed for over-15M parameters.
(c) ablation study for the first-N-glimpses (x-axis) used
in evaluation. (d) the information entropy (y-axis) for
each attention map in four-glimpse BAN. The entropy
of multiple attention maps is converged to certain levels. 81
Figure 5.3 Visualization of the bilinear attention maps for two-
glimpse BAN. The left and right groups indicate the
first and second bilinear attention maps (right in each
group, log-scaled) and the visualized image (left in each
group). The most salient six boxes (1-6 numbered in the
images and x-axis of the grids) in the first attention map
determined by marginalization are visualized on both
images to compare. The model gives the correct answer,
brown. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
x
Figure 5.4 Visualization examples from the test split of Flickr30k
Entities are shown. Solid-lined boxes indicate predicted
phrase localizations and dashed-line boxes indicate the
ground-truth. If there are multiple ground-truth boxes,
the closest box is shown to investigate. Each color of
a phrase is matched with the corresponding color of
predicted and ground-truth boxes. Best view in color. . 84
xi
List of Tables
Table 2.1 Three categories of multimodal learning settings. A and
B indicate arbitrary modalities. A+B denotes multimodal
learning, and A|B denotes A or B is used, respectively. . . 7
Table 3.1 The results of alternative models (a)-(e) on the test-dev. . 24
Table 3.2 The effect of the visual features and # of target answers
on the test-dev results. Vgg for VGG-19, and Res for
ResNet-152 features described in Section 3.4. . . . . . . . 25
Table 3.3 The VQA test-standard results. The precision of some
accuracies [Andreas et al., 2016; Yang et al., 2016] are one
less than others, so, zero-filled to match others. . . . . . . 33
xii
Table 3.4 The effects of various options for VQA test-dev. Here, the
model of Figure 3a is used, since these experiments are
preliminarily conducted. VGG-19 features and 1k target
answers are used. s stands for the usage of Skip-Thought
Vectors Kiros et al. [2015] to initialize the question embed-
ding model of GRU, b stands for the usage of Bayesian
Dropout Gal [2015], and c stands for the usage of post-
processing using image captioning model Karpathy and
Fei-Fei [2015]. . . . . . . . . . . . . . . . . . . . . . . . . . 34
Table 3.5 The effects of shortcut connections of MRN for VQA test-
dev. ResNet-152 features and 2k target answers are used.
MN stands for Multimodal Networks without residual
learning, which does not have any shortcut connections.
Dim. stands for common embedding vector’s dimension.
The number of parameters for word embedding (9.3M)
and question embedding (21.8M) is subtracted from the
total number of parameters in this table. . . . . . . . . . . 34
Table 3.6 The results for VQA test-dev. The precision of some accu-
racies Andreas et al. [2016]; Xiong et al. [2016]; Yang et al.
[2016] are one less than others, so, zero-filled to match
others. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
Table 4.1 The accuracies of our experimental models for VQA test-
dev split and Open-Ended task. For the MCB models, A:
attention model, G: Glove vector model, and V: Visual
Genome augmentation model. . . . . . . . . . . . . . . . . 45
Table 4.2 Hyperparameters used in MLB (single model in Table 4.4). 49
xiii
Table 4.3 The effect of joint embedding size d. . . . . . . . . . . . . 50
Table 4.4 The VQA 1.0 test-standard results to compare with state-
of-the-art. Notice that these results are trained by pro-
vided VQA train and validation splits, without any data
augmentation. . . . . . . . . . . . . . . . . . . . . . . . . . 60
Table 4.5 The VQA 1.0 test-standard results for ensemble models
to compare with state-of-the-art. For unpublished entries,
their team names are used instead of their model names.
Some of their figures are updated after the challenge. . . . 61
Table 4.6 The individual models used in our ensemble model in
Table 4.5. . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
Table 5.1 Validation scores on VQA 2.0 dataset for the number of
glimpses of the BAN. The standard deviations are reported
after ± using three random initialization. . . . . . . . . . 79
Table 5.2 Validation scores on VQA 2.0 dataset for attention and
integration mechanisms. The nParams indicates the num-
ber of parameters. Note that the hidden sizes of unitary
attention and co-attention are 1,280, while 1,024 for the
BAN. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
xiv
Table 5.3 Test split results for Flickr30k Entities. We report the aver-
age performance of our three randomly-initialized models
(the standard deviation of R@1 is 0.17). Upper Bound
of performance asserted by object detector is shown. †
box size and color information are used as additional
features. ‡ semantic segmentation, object detection, and
pose-estimation is used as additional features. Notice that
the detectors of Hinami and Satoh [2017] and ours [An-
derson et al., 2017] are based on Faster RCNN [Ren et al.,
2017], pre-trained using Visual Genome dataset [Krishna
et al., 2016]. . . . . . . . . . . . . . . . . . . . . . . . . . 85
Table 5.4 Recall@1 performance over types for Flickr30k Entities (%) 86
Table 5.5 Test-dev scores of single-model on VQA 2.0 dataset to
compare state-of-the-arts. The first section of rows trained
on training and validation splits. The rest of rows trained
on training and validation splits, and Visual Genome for
data augmentation. † This model can be found in https:
//github.com/yuzcccc/vqa-mfb, which is not published
in the paper. They use the object detection-based image
features from Anderson et al. [2017], instead of 152-layer
ResNet image features [He et al., 2016a]. . . . . . . . . . 87
Table 5.6 Test-standard scores of ensemble-model on VQA 2.0 dataset
to compare state-of-the-arts. Excerpt from the VQA 2.0
Leaderboard at the time of writing. # denotes the number
of models for their ensemble methods. . . . . . . . . . . . 88
xv
Chapter 1
Introduction
Artificial general intelligence is cautiously conferred upon the advances in com-
puter vision and natural language processing. Understanding and reasoning
grounded on both vision and language are the core challenge for the artificial
general intelligence. To promote the progress of research, it requires quantitative
evaluations which can be easily and consistently performed.
Visual Turing Test (VTT) [Geman et al., 2015; Malinowski and Fritz, 2014;
Qi et al., 2015] is aligned with this direction on top of one of prestigious seminal
works, Turing test [Turing, 1950]. The Turing test is proposed to evaluate
intelligence of machine. In this test, if a human cannot distinguish the machine
from the other human by assessing only conversations isolating any other factor,
the machine has intelligence. This proposal is appraised as a milestone to study
intangible machine intelligence with experimental setting. VTT takes a step
further. In VTT, visual understanding given utterances is crucial to response
appropriately. The machine must recognize objects, identify attributes, and infer
relationships between objects. Moreover, the machine may need to utilize a
1
conversational context or background knowledge.
Visual question answering [Antol et al., 2015; Gao et al., 2015; Goyal et al.,
2016; Krishna et al., 2017; Malinowski and Fritz, 2014; Ren et al., 2015; Yu
et al., 2015; Zhang et al., 2016b; Zhu et al., 2016] is an instance of VTT. With
large image datasets, question-answer pairs are collected for supervised learning.
For instance, the VQA [Antol et al., 2015; Goyal et al., 2016] dataset collected
1.1M questions and 11.1M answers for 205K COCO images [Lin et al., 2014].
In the task, a machine answers for a given question, such as, "Who is wearing
glasses?", "Is the umbrella upside down?", or "How many children are in the
bed?".
A recent line of work in vision and language communities has generalized
visual question answering [Antol et al., 2015; Gao et al., 2015; Goyal et al., 2016;
Krishna et al., 2017; Malinowski and Fritz, 2014; Ren et al., 2015; Yu et al.,
2015; Zhang et al., 2016b; Zhu et al., 2016] to visually-grounded dialog [Das
et al., 2017a,b; de Vries et al., 2016; Mostafazadeh et al., 2017; Strub et al.,
2017; Yu et al., 2015, 2016], where an agent must understand an image and
answer a sequence of questions. This generalization to the dialog system brings
to forefront issues of visual grounding, context-aware in agents, being consistent
in responses, etc.
Focusing on VQA, one of simple approaches is to formulate this problem
as a classification task given two inputs, image and text, and the output is
one of candidates using k -top frequent answers. An image is converted to a
hidden representation by a pre-trained convolutional neural networks (CNN)
like ResNet [He et al., 2016a], whereas a text is converted to a hidden represen-
tation by a pre-trained recurrent neural networks (RNN), since the text can
be regarded as a sequence of tokens (pre-processed words by a tokenizer), like
LSTM [Hochreiter and Schmidhuber, 1997] or GRU [Cho, Kyunghyun and Van
2
Merriënboer, Bart and Gulcehre, Caglar and Bahdanau, Dzmitry and Bougares,
Fethi and Schwenk, Holger and Bengio, Yoshua, 2014], as shown in Figure 3.1.
In our previous works [Kim et al., 2016b, 2017a, 2018], joint representation
learning of vision-language multimodality is studied for the visual question
answering task. Since each vision or natural language processing is still an open
progressing problem, the joint representation learning can take advantage of
the progressed results from computer vision and natural language processing
studies in the future.
Residual learning emphasizes the importance of identity (or linear) shortcuts
to have non-linear mappings efficiently learn only residuals [He et al., 2016a,b].
In visual question answering, as one of multimodal learning tasks, this idea
may not be readily applied. Since the modalities may have correlations among
them, we need to carefully define joint residual functions as the non-linear
mappings. Moreover, the arrangement of the shortcuts are undetermined due
to its multimodality. Therefore, the characteristics of a given task must be
considered to determine the model structure.
As a successive work, the joint residual function of MRN is interpreted as
a low-rank bilinear pooling method. A bilinear model can be approximated
using low-rank factorization of the weight tensor in the model, which exploits
the linear embeddings after and before element-wise multiplication, also known
as Hadamard product. Since this interpretation only requires common tensor
operations used in modern deep learning frameworks, it is easy to implement and
efficient to improve performance compared with the other competitive methods.
To propose an efficient attention mechanism, the low-rank bilinear pooling is
doubly used in both attention networks and multimodal pooling networks.
In the previous works, multimodal learning for visual question answering is
explored in the context of multi-step residual learning and bilinear approximation
3
pooling in attention mechanism. However, we observed that attention networks
make multimodal residual learning be inefficient. There are two hypotheses for
the reason to this. First, the attention networks in every residual learning blocks
increase the model complexity to interfere optimization or introduce over-fitting.
Second, the softmax functions in attention networks cause gradient diminishing
since its gradients are less than one and appear multiple times in the network
likewise sigmoid function does.
We try to solve this issue using residual learning of attention in the context
of bilinear attention networks. Bilinear attention is to give separate attention
to every interaction between words and visual concepts since these interactions
have different meanings. It is natural to map a token “dog” to dogs in an
image, “cat” to cats in an image. However, a naive approach may introduce a
huge of computational cost. So, we propose an efficient method, with the same
time complexity, on top of low-rank bilinear pooling. Then, residual learning of
attention gives a useful way of integrating multiple attention maps instead of a
naive concatenation of each joint representations.
Contrast to the previous work, all attention maps are acquired in a prelim-
inary step, then, using those attention maps, residual learning is performed.
It can be interpreted as a look-ahead to predict the sequence of attentions.
We believe that it prevents a short-sighted attention strategy and empirically
validated compared to the other methods.
Multimodal learning of vision and natural language fuels the progress of
artificial general intelligence. In this dissertation, based on visual question
answering, an instance of VTT, we explore multimodal residual learning, which
provides an efficient way to combine multimodal inputs, low-rank bilinear pooling,
which comes from using element-wise multiplication in deep neural networks
and can be exploited in attention networks, and bilinear attention networks,
4
which gracefully extend unitary attention networks as low-rank bilinear pooling
inside bilinear attention. Here, residual learning of attention efficiently uses
multiple bilinear attention maps. We believe the three-fold research helps to
bring advanced machine intelligence to everyday life in natural language through
visually-grounded reasoning.
5
Chapter 2
Multimodal Deep Learning
2.1 Introduction
Computer vision and natural language processing are highly connected to each
other in the aspect of artificial intelligence. Since both vision and natural
language are ones of major modalities of human, understanding and reasoning
grounded on vision or language is mandatory to deal with various challenges in
artificial intelligence problems. For this reason, multimodal learning is studied
to solve a specific problem which involves at least two different modalities as
the sources of information.
The gist of multimodal learning is to learn joint representation from multiple
sources relating correlated information. For example, as Ngiam et al. [2011]
mentioned, RGB images and depth maps are highly correlated in pixel since
the edges in depth maps are often to be distinctive edges in the corresponding
RGB images. Though vision and natural language have different modalities from
each other, both modalities are correlated at "mid-level" as natural language is
6
Table 2.1 Three categories of multimodal learning settings. A and B indicatearbitrary modalities. A+B denotes multimodal learning, and A|B denotes A orB is used, respectively.
Settings Feature Training Testing
Deep Learning A|B A|B A|BMultimodal Fusion A+B A+B A+B
Cross Modality Learning A+B A|B A|BShared Representation Learning A+B A|B B|A
grounded in visual information (e.g. colors, shape, locations, etc.). So, the major
challenge in multimodal learning lies on how to learn joint representations in
the "mid-level".
One of successful works in multimodal learning is based on deep learn-
ing [LeCun et al., 2015], which learns joint representations using deep neural
networks [Ngiam et al., 2011]. They suggest three settings of multimodal learning,
multimodal fusion, cross modality learning, and shared representation learning.
The multimodal fusion setting can access to all modalities in both training and
testing. This setting is typical in the literatures (e.g. Agrawal et al. [2017]; Goyal
et al. [2016]). When a large multimodal dataset is available, it can be used
to learn the multimodal features in an unsupervised way. The insight of this
approach is based on the richer representative power from multimdoal datasets
than unimodal datasets’. The cross modality learning uses the pretrained mul-
timodal features to train unimodal tasks (the other modality has a crossly
supportive role in this setting), whereas the shared representation learning uses
the pretrained multimodal features to train cross modality tasks where different
modalities are presented in training and test phases (the multimodal features
should have shared representations to infer the other modality in test phase).
7
Please refer to Table 2.1 which summarizes these three settings.
In the following sections, we will discuss a linear model for multimodal
learning and deep learning approaches in three different settings in detail.
2.2 Linear Model
Combining multiple cues may reduce the noise which inevitably resides in the
sensory cues. However, the ambiguity to our nervous system in sensory cues
imposes a burden to infer both the causality of the cues and their contributions
to a body and an environment. These contributions can come from the same
causes or entirely different causes, in the same way, a cause can induce the same
contribution (in some sense) or entirely different effects. In this dissertation,
we will focus on that these contributions come from the same cause and the
contributions are instantiated by visual and textual cues.
Linear models provide a simple approach to this. Let a be the ground-truth
answer, v be a prediction by a visual cue, and q be a prediction by a textual
cue, a question. p(a) denotes the probability of that a specific answer a is the
answer, p(v|a) = N (a, σv) denotes the probability of that a specific prediction v
by a visual cue is the answer a, and p(q|a) = N (a, σq) denotes the probability of
that a specific answer q by a textual cue is the answer a. The σv and σq are the
variance introduced by the certainty of given cues. For this, we need a handful
of assumptions of conditional independence among these multiple cues, with
Gaussian noise, and the same cause. Then, Bayes rule yields:
p(a|v, q) =p(a)p(v, q|a)/Σap(v, q, a) (2.1)
∝p(a)p(v, q|a) (2.2)
=p(a)p(v|a)p(q|a). (2.3)
8
The maximum a posteriori (MAP) of a can be acquired by the mean of the
multiplication of Gaussian distributions resulting in the weighted sum of v and
q using α as follows:
a =αv + (1− α)q (2.4)
α =σ2q/(σ
2q + σ2
v) (2.5)
where the prior distribution of p(a) is treated as an uniform distribution [Trom-
mershauser et al., 2011].
However, this linear combination of cues violates conditional independence
among the cues, which means p(v, q|a) = p(v|a)p(q|a), since q is not completely
independent from v for the question q, which is arisen from the visual cue v.
One of solutions is to use the product rule of probability:
p(v, q|a) =p(v|a)p(q|v, a) (2.6)
=p(q|a)p(v|q, a). (2.7)
Here, we use p(v|q, a) to model the dependency between the cues, instead of
p(v|a), which allows us to select a portion of information in the visual cue. Later,
we will show that p(v|q, a) can be modeled using attention networks, which have
a sub-module to combine multi-channel visual cues using the weights determined
by a textual cue.
Combining multiple cues involve how to model the joint probability dis-
tribution considering the innate dependency structure in multiple cues. Since
real-world problems used to have a complex dependency structure to be modeled
by Gaussian distribution, there is an increasing number of attempts to get joint
representation using deep neural networks, which is coined by multimodal deep
9
learning [Ngiam et al., 2011].
2.3 Multimodal Deep Learning
In this section, we introduce multimodal deep learning problems by describing
subcategories of these problems. According to the use of multimodal inputs
in learning phases, feature learning, supervised training, and testing, we can
classify those subcategories. Here, we classify the multimodal deep learning into
three subcategories – Multimodal Fusion, Cross Modality Learning, and Shared
Representation Learning, following the work [Ngiam et al., 2011].
2.3.1 Multimodal Fusion
The multimodal fusion uses multiple modalities in both training and testing. The
joint representation of the multiple modalities must be learned and effectively
generalized to achieve a significant improvement over an unimodal setting. For
example, visual question answering (VQA) tasks [Agrawal et al., 2017; Goyal
et al., 2016] aim this criteria. In Figure 2.1, visual and textual information is
both important to solve the problem. The question in the top-left section is
‘Who is wearing glasses?’ and, depending on the given image, the corresponding
answer is obviously different. Moreover, the VQA dataset is collected so that each
image has three different questions and the corresponding answers, respectively.
The pair of image and question is given in training and testing phases and the
desirable output is the answer, which can be more than one word, yes or no,
or number. Since the distribution of the numbers of answers is long-tailed, one
may formulate the task as classification problem using some amount of the most
frequent answers, e.g., 2k, which covers 90.45% of total questions. Now, we can
10
summarize the multimodal fusion approach to visual question answering as:
h = f(q,v) (2.8)
where h is the joint representation for both question and image, while q and v are
embedding vectors for the question sentence and the pixel information of a given
image, respectively. Then, a nonlinear classifier takes the joint representation h
as an input to solve the classification problem:
p(a|q,v; Θ) = classifier(h) (2.9)
a = argmaxa∈Ω
p(a|q,v; Θ) (2.10)
where a is a candidate answer, a is an estimated answer, Ω is a set of candidate
answers, and Θ is a set of model parameters.
Besides, image captioning task [Lin et al., 2014] may be confused as a multi-
modal fusion task since the model learns how to map an image representation
to text representation to generate the caption corresponding to a given image.
However, the generated caption only depends on an unimodal, image represen-
tation. This is why the image captioning task is not a multimodal learning task
in our perspective. We emphasize that multimodal learning involves learning
joint representation from multimodal information, not unimodal one.
2.3.2 Cross Modality Learning
The cross modality learning uses the pretrained multimodal features to train
unimodal tasks (the other modality crossly supports a given task in this setting).
Here we assume that we can have a better representation for an unimodal input
when feature learning uses multimodal data.
11
Figure 2.1 Examples from VQA 2.0 [Agrawal et al., 2017; Goyal et al., 2016],which depict their criteria that multimodal information is necessary to solvethe problem. For the same question, answers may be different depending onvisual information, the provided image along with the question. Reproducedwith permission from Goyal et al. [2016].
Two instances of multimodal datasets for this task are AVLetters [Matthews
et al., 2002] and CUAVE [Patterson et al., 2002]. The dataset AVLetters consists
of audio-visual information (audio and video) for isolated letters A-Z. Each
alphabetic letter is pronounciated three times by 10 participants, 5 male (two
of them have mustaches) and 5 female, which results 780 utterances. Whereas
CUAVE is for the digits 0-9. The 36 participants pronounciate each digit 5 times.
Using cross modality learning, there is a potential to improve the model to use
both audio-visual information in feature learning for a lip-read functionality in
a noisy environment.
12
2.3.3 Shared Representation Learning
Whereas the shared representation learning uses the pretrained multimodal
features to train cross modality tasks where different modalities are presented
in training and test phases (the multimodal features should have shared rep-
resentations to infer the other modality in test phase). In feature learning, in
an aspect of using other information which is not used in training and testing,
it is similar to the technique of pre-training features, however, its functionality
is deviated from this aspect since the input modalities in training and testing
are different. Which is more similar to a zero-shot learning with an aid of the
shared representation.
2.4 Cognitive Models
It is helpful to know the similarity and difference between deep learning and
human brain about how they work in an aspect of information processing since it
may provide the insight to extract the principles that can apply to their possible
applications.
The studies on cognitive neuroscience have shown that the cells related
to multimodal perception are found. Superior temporal sulcus (STS) in the
temporal lobe of the brain of Japaness monkey has, out of 200 cells recorded, 51%
of unimodal cells (59 visual, 33 auditory, and 10 somesthetic), 18% of bimodal
cells (21 visual+auditory, 7 visual+somesthetic, and 8 auditor+somesthetic),
and 2% of trimodal cells [Hikosaka et al., 1988]. The other work shows that
multisensory response is greater than the sum of unisensory responses, which is
called multisensory integration [Holmes and Spence, 2005]. However, to discuss
the connection from neuroscience to machine learning, not limited to cellular
level, we want to bring an interesting neuronal structure of barn owls’ auditory
13
system.
The barn owl is well-studied for a cognitive model of how to localize a sound
source by interaural time differences. The proposed neural model as coincident
detectors [Carr and Konishi, 1990] utilizes the simultaneously received inputs
from left and right ears, however, two inputs are slightly different from each
other in time and intensity depending on the relative location of the source
with respect to a subject. For instance, if the source is located at the degree
of 45 angle to frontal face, the interaural time difference is approximately 1e−4
second, although the difference of intensity is hardly distinguishable from the
other environmental influences even for barn owls. Furthermore, the barn owls
take their anatomical advantage of ears – the left ear is placed a little higher
than eye level facing down, while the right ear is placed lower than the other
facing up. These asymmetric anatomy cannot be found in the ears of human,
however, human have more complex structure of outer ear, called pinna.
So, the activations of neural connectivity related to the left and right ears
are asymmetric depending on the location of the sound source, and, which
differentiate the distribution of the activations in a joint circuit, the coincident
detectors. Under the hood of the coincident detectors, two neuronal charac-
teristics are required: dense connections with wide diversity of its lengths and
activation delay proportional to the lengths. Based on these characteristics, the
distribution of joint activations and its resolution of interaural time differences
are determined since the coincident detectors act like AND operations for the
various conditions. Notice that the coincident detectors are corresponding to
an unimodal-auditory source, asymmetic information induced by interaural
time difference gives meaningful clues to the coincident detectors as a sort of
multimodal perception.
This observation is related to low-rank bilinear pooling in Chapter 4. The
14
low-rank bilinear pooling effectively consists of the two different linear mappings
for two inputs and element-wise multiplication of these mapped representations.
With the assumption of that the real-valued scalar inputs representing perceived
time and Bernoulli sampling parameterized by the probability as the output of
sigmoid activation function for the mapped representations, a computational
cognitive model for the coincident detector can be proposed. We will see that this
cognitive model is linked to the approximation of bilinear model for multimodal
cues in Chapter 4.
2.5 Conclusions
We have discussed the linear model using Bayesian rule to minimize the uncer-
tainty of multiple cues with the assumption of Gaussian noise, the three examples
of multimodal deep learning, and cognitive models to compare with machine
learning approaches. In the following chapters, we will focus on multimodal
fusion task, one of multimodal deep learning approaches, which uses multimodal
inputs in both training and testing phases. This task lets us to be reminded
that learning joint representation of multimodal information is critical to solve
a complex problem, i.e., visually-grounded reasoning using multimodal deep
learning.
15
Chapter 3
Multimodal Residual Learning
3.1 Introduction
Visual question-answering tasks provide a testbed to cultivate the synergistic
proposals which handle multidisciplinary problems of vision, language and
integrated reasoning. So, the visual question-answering tasks let the studies
in artificial intelligence go beyond narrow tasks. Furthermore, it may help to
solve the real world problems which need the integrated reasoning of vision and
language.
Deep residual learning [He et al., 2016a] not only advances the studies
in object recognition problems, but also gives a general framework for deep
neural networks. The existing non-linear layers of neural networks serve to fit
another mapping of F(x), which is the residual of identity mapping x. So, with
the shortcut connection of identity mapping x, the whole module of layers fit
F(x) + x for the desired underlying mapping H(x). In other words, the only
residual mapping F(x), defined by H(x)− x, is learned with non-linear layers.
16
In this way, very deep neural networks effectively learn representations in an
efficient manner.
Many attentional models utilize the residual learning to deal with various
tasks, including textual reasoning [Rocktäschel et al., 2016; Sukhbaatar et al.,
2015] and visual question-answering [Yang et al., 2016]. They use an attentional
mechanism to handle two different information sources, a query and the context
of the query (e.g. contextual sentences or an image). The query is added to the
output of the attentional module, that makes the attentional module learn the
residual of query mapping as in deep residual learning.
In this paper, we propose Multimodal Residual Networks (MRN) to learn
multimodality of visual question-answering tasks exploiting the excellence of deep
residual learning [He et al., 2016a]. MRN inherently uses shortcuts and residual
mappings for multimodality. We explore various models upon the choice of the
shortcuts for each modality, and the joint residual mappings based on element-
wise multiplication, which effectively learn the multimodal representations not
using explicit attention parameters. Figure 3.1 shows inference flow of the
proposed MRN.
Additionally, we propose a novel method to visualize the attention effects of
each joint residual mapping. The visualization method uses back-propagation
algorithm [Rumelhart et al., 1986] for the difference between the visual input
and the output of the joint residual mapping. The difference is back-propagated
up to an input image, in other words, the derivative of the function which defines
the difference with respect to the image is visualized. Since we use the pretrained
visual features, the pretrained CNN is augmented for the visualization. Based
on this, we argue that MRN is an implicit attention model without explicit
attention parameters.
Our contribution is three-fold: 1) extending the deep residual learning for
17
Q
V
ARNN
CNN
softmax
Multimodal Residual Networks
What kind of animals are these ?
sheep
wordembedding
Figure 3.1 Inference flow of MultimodalResidual Networks (MRN). Using our vi-sualization method, the attention effects areshown as a sequence of three images. Moreexamples are shown in Figure 3.4.
A
LinearTanhLinear
TanhLinear
TanhLinear
Q V
H1
LinearTanhLinear
TanhLinear
TanhLinear
H2
V
LinearTanhLinear
TanhLinear
TanhLinear
H3
V
LinearSoftmax
⊙⊕
⊙⊕
⊙⊕
Softmax
Figure 3.2 A schematic dia-gram of Multimodal ResidualNetworks with three-blocklayers.
visual question-answering tasks. This method utilizes multimodal inputs, and
allows a deeper network structure, 2) achieving the state-of-the-art results on
the Visual QA dataset for both Open-Ended and Multiple-Choice tasks, and
finally, 3) introducing a novel method to visualize spatial attention effect of joint
residual mappings from the collapsed visual feature using back-propagation.
3.2 Related Works
3.2.1 Deep Residual Learning
Deep residual learning [He et al., 2016a] allows neural networks to have a deeper
structure of over-100 layers. The very deep neural networks are usually hard to
be optimized even though the well-known activation functions and regularization
techniques are applied [Hinton et al., 2012; Ioffe and Szegedy, 2015; Nair and
18
Hinton, 2010]. However, this residual learning method consistently shows state-
of-the-art results across multiple visual tasks including image classification,
object detection, localization and segmentation.
This idea assumes that a block of deep neural networks forming a non-linear
mapping F(x) may paradoxically fail to fit into an identity mapping. To resolve
this, the deep residual learning adds x to F(x) as a shortcut connection. With
this idea, the non-linear mapping F(x) can focus on the residual of the shortcut
mapping x. Therefore, a learning block is defined as:
y = F(x) + x (3.1)
where x and y are the input and output of the learning block, respectively.
3.2.2 Stacked Attention Networks
Stacked Attention Networks (SAN) [Yang et al., 2016] explicitly learns the
weights of visual feature vectors to select a small portion of visual information
for a given question vector. Furthermore, this model stacks the attention networks
for multi-step reasoning narrowing down the selection of visual information. For
example, if the attention networks are asked to find a pink handbag in a scene,
they try to find pink objects first, and then, narrow down to the pink handbag.
For the attention networks, the weights are learned by a question vector
and the corresponding visual feature vectors. These weights are used for the
linear combination of multiple visual feature vectors indexing spatial information.
Through this, SAN successfully selects a portion of visual information. Finally,
an addition of the combined visual feature vector and the previous question
19
vector is transferred as a new input question vector to next learning block.
qk = F(qk−1,V) + qk−1 (3.2)
Here, qk is a question vector for k-th learning block and V is a visual feature
matrix, whose columns indicate the specific spatial indexes. F(q,V) is the
attention networks of SAN.
3.3 Multimodal Residual Networks
Deep residual learning emphasizes the importance of identity (or linear) shortcuts
to have the non-linear mappings efficiently learn only residuals [He et al., 2016a].
In multimodal learning, this idea may not be readily applied. Since the modalities
may have correlations, we need to carefully define joint residual functions as
the non-linear mappings. Moreover, the shortcuts are undetermined due to
its multimodality. Therefore, the characteristics of a given task ought to be
considered to determine the model structure.
3.3.1 Background
We consider the residual learning in the attention networks of SAN. We observed
that, in Equation 18 from [Yang et al., 2016], the question vector is transferred
directly through successive layers of the attention networks. In the case of SAN,
the shortcut mapping is for the question vector, and the non-linear mapping is
the attention networks.
In the attention networks, Yang et al. [2016] assume that an appropriate
choice of weights on visual feature vectors for a given question vector sufficiently
captures the joint representation for answering. However, question information
weakly contributes to the joint representation only through coefficients p, which
20
may cause a bottleneck to learn the joint representation.
F(q,V) =∑i
piVi (3.3)
The coefficients p are the output of a nonlinear function of a question vector q
and a visual feature matrix V (see Equation 15-16 in Yang et al. [2016]). The
Vi is a visual feature vector of spatial index i in 14× 14 grids.
Lu et al. [2015] propose an element-wise multiplication of a question vector
and a visual feature vector after appropriate embeddings for a joint model. This
makes a strong baseline outperforming some of the recent works [Andreas et al.,
2016; Noh et al., 2016]. We firstly take this approach as a candidate for the joint
residual function, since it is simple yet successful for visual question-answering.
In this context, we take the global visual feature approach for the element-wise
multiplication, instead of the multiple (spatial) visual features approach for the
explicit attention mechanism of SAN. (We present a visualization technique
exploiting the element-wise multiplication in Section 3.5.2.)
Based on these observations, we follow the shortcut mapping and the stacking
architecture of SAN [Yang et al., 2016]; however, the element-wise multiplication
is used for the joint residual function F . These updates effectively learn the
joint representation of given vision and language information addressing the
bottleneck issue of the attention networks of SAN.
3.3.2 Multimodal Residual Networks
MRN consists of multiple learning blocks, which are stacked for deep residual
learning. Denoting an optimal mapping by H(q,v), we approximate it using
H1(q,v) = W(1)q′ q+ F (1)(q,v). (3.4)
21
The first (linear) approximation term is W(1)q′ q and the first joint residual
function is given by F (1)(q,v). The linear mapping Wq′ is used for matching a
feature dimension. We define the joint residual function as
F (k)(q,v) = σ(W(k)q q)⊙ σ(W
(k)2 σ(W
(k)1 v)) (3.5)
where σ is tanh, and ⊙ is element-wise multiplication. The question vector and
the visual feature vector directly contribute to the joint representation. We
justify this choice in Sections 3.4 and 3.5.
For a deeper residual learning, we replace q with H1(q,v) in the next layer.
In more general terms, Equations 3.4 and 3.5 can be rewritten as
HL(q,v) = Wq′q+L∑l=1
WF(l)F (l)(Hl−1,v) (3.6)
where L is the number of learning blocks, H0 = q, Wq′ = ΠLl=1W
(l)q′ , and WF(l) =
ΠLm=l+1W
(m)q′ . The cascading in Equation 3.6 can intuitively be represented as
shown in Figure 3.2. Notice that the shortcuts for a visual part are identity
mappings to transfer the input visual feature vector to each layer (dashed line).
At the end of each block, we denote Hl as the output of the l-th learning block,
and ⊕ is element-wise addition.
3.4 Experiments
3.4.1 Visual QA Dataset
We choose the Visual QA (VQA 1.0) dataset [Antol et al., 2015] for the evaluation
of our models. Other datasets may not be ideal, since they have limited number
of examples to train and test [Malinowski et al., 2015], or have synthesized
22
TanhLinear
Linear
TanhLinear
Q V
Hl V
⊙⊕
(a)
LinearTanh
Linear
Tanh
Linear
TanhLinear
Q V
Hl
V⊙⊕
(c)
LinearTanh
LinearTanh
Linear
TanhLinear
TanhLinear
Q V
Hl V
⊙⊕
(b)
LinearTanh
Linear
TanhLinear
TanhLinear
Q V
HlV
⊙⊕
(e)
LinearTanh
Linear
TanhLinear
TanhLinear
Q V
Hl V
⊙⊕
(d)
if l=1
else Identity
if l=1Linear
else none
Figure 3.3 Alternative models are explored to justify our proposed model. Thebase model (a) has a shortcut for a question vector as SAN does [Yang et al.,2016], and the joint residual function takes the form of the Deep Q+I model’sjoint function [Lu et al., 2015]. (b) extra embedding for visual modality. (c)extra embeddings for both modalities. (d) identity mappings for shortcuts. Inthe first learning block, use a linear mapping for matching a dimension withthe joint dimension. (e) two shortcuts for both modalities. For simplicity, thelinear mapping of visual shortcut only appears in the first learning block. Noticethat (d) and (e) are compared to (b) after the model selection of (b) among(a)-(c) on test-dev results. Eventually, we choose (b) as the best performanceand relative simplicity.
questions from the image captions [Lin et al., 2014; Ren et al., 2015].
The questions and answers of the VQA dataset are collected via Amazon
Mechanical Turk from human subjects, who satisfy the experimental requirement.
The dataset includes 614,163 questions and 7,984,119 answers, since ten answers
are gathered for each question from unique human subjects. Therefore, Antol
et al. [2015] proposed a new accuracy metric as follows:
min
(# of humans that provided that answer
3, 1
). (3.7)
The questions are answered in two ways: Open-Ended and Multiple-Choice.
Unlike Open-Ended, Multiple-Choice allows additional information of eighteen
23
Table 3.1 The results of alternative models (a)-(e) on the test-dev.
Open-Ended
All Y/N Num. Other
(a) 60.17 81.83 38.32 46.61(b) 60.53 82.53 38.34 46.78(c) 60.19 81.91 37.87 46.70(d) 59.69 81.67 37.23 46.00(e) 60.20 81.98 38.25 46.57
candidate answers for each question. There are three types of answers: yes/no
(Y/N ), numbers (Num.) and others (Other). Table 3.3 shows that Other type
has the most benefit from Multiple-Choice.
The images come from the MS-COCO dataset, 123,287 of them for training
and validation, and 81,434 for test. The images are carefully collected to contain
multiple objects and natural situations, which is also valid for visual question-
answering tasks.
3.4.2 Implementation
Torch framework and rnn package [Léonard et al., 2015] are used to build
our models. For efficient computation of variable-length questions, TrimZero
is used to trim out zero vectors [Kim et al., 2016a]. TrimZero eliminates zero
computations at every time-step in mini-batch learning. Its efficiency is affected
by a batch size, RNN model size, and the number of zeros in inputs. We found
out that TrimZero was suitable for VQA tasks. Approximately, 37.5% of training
time is reduced in our experiments using this technique.
24
Table 3.2 The effect of the visual features and # of target answers on thetest-dev results. Vgg for VGG-19, and Res for ResNet-152 features described inSection 3.4.
Open-Ended
All Y/N Num. Other
Vgg, 1k 60.53 82.53 38.34 46.78Vgg, 2k 60.77 82.10 39.11 47.46Vgg, 3k 60.68 82.40 38.69 47.10Res, 1k 61.45 82.36 38.40 48.81Res, 2k 61.68 82.28 38.82 49.25Res, 3k 61.47 82.28 39.09 48.76
Preprocessing We follow the same preprocessing procedure of DeeperL-
STM+NormalizedCNN [Lu et al., 2015] (Deep Q+I ) by default. The number of
answers is 1k, 2k, or 3k using the most frequent answers, which covers 86.52%,
90.45% and 92.42% of questions, respectively. The questions are tokenized using
Python Natural Language Toolkit (nltk) [Bird et al., 2009]. Subsequently, the
vocabulary sizes are 14,770, 15,031 and 15,169, respectively.
Pretrained Models A question vector q ∈ R2,400 is the last output vector
of GRU [Cho et al., 2014], initialized with the parameters of Skip-Thought
Vectors [Kiros et al., 2015]. Based on the study of Noh et al. [2016], this method
shows effectiveness of question embedding in visual question-answering tasks.
A visual feature vector v is an output of the first fully-connected layer of
VGG-19 networks [Simonyan and Zisserman, 2015], whose dimension is 4,096.
Alternatively, ResNet-152 [He et al., 2016a] is used, whose dimension is of 2,048.
The error is back-propagated to the input question for fine-tuning, yet, not for
the visual part v due to the heavy computational cost of training.
25
Postprocessing Image captioning model [Karpathy and Fei-Fei, 2015] is used
to improve the accuracy of Other type. Let the intermediate representation
v ∈ R|Ω| which is right before applying softmax. |Ω| is the vocabulary size of
answers, and vi is corresponding to answer ai. If ai is not a number or yes or no,
and appeared at least once in the generated caption, then update vi ← vi + 1.
Notice that the pretrained image captioning model is not part of training. This
simple procedure improves around 0.1% of the test-dev overall accuracy (0.3%
for Other type). We attribute this improvement to “tie break” in Other type.
For the Multiple-Choice task, we mask the output of softmax layer with the
given candidate answers.
Hyperparameters By default, we follow Deep Q+I. The common embedding
size of the joint representation is 1,200. The learnable parameters are initialized
using a uniform distribution from −0.08 to 0.08 except for the pretrained
models. The batch size is 200, and the number of iterations is fixed to 250k. The
RMSProp [Tieleman and Hinton, 2012] is used for optimization, and dropouts
[Gal, 2015; Hinton et al., 2012] are used for regularization. The hyperparameters
are fixed using test-dev results. We compare our method to state-of-the-arts
using test-standard results.
3.4.3 Exploring Alternative Models
Figure 3.3 shows alternative models we explored, based on the observations in
Section 3.3. We carefully select alternative models (a)-(c) for the importance
of embeddings in multimodal learning [Ngiam et al., 2011; Srivastava and
Salakhutdinov, 2012], (d) for the effectiveness of identity mapping as reported by
[He et al., 2016a], and (e) for the confirmation of using question-only shortcuts
in the multiple blocks as in [Yang et al., 2016]. For comparison, all models have
26
three-block layers (selected after a pilot test), using VGG-19 features and 1k
answers, then, the number of learning blocks is explored to confirm the pilot test.
The effect of the pretrained visual feature models and the number of answers
are also explored. All validation is performed on the test-dev split.
3.5 Results
3.5.1 Quantitative Analysis
The VQA Challenge, which released the VQA dataset, provides evaluation servers
for test-dev and test-standard test splits. For the test-dev, the evaluation server
permits unlimited submissions for validation, while the test-standard permits
limited submissions for the competition. We report accuracies in percentage.
Alternative Models The test-dev results of the alternative models for the
Open-Ended task are shown in Table 3.1. (a) shows a significant improvement
over SAN. However, (b) is marginally better than (a). As compared to (b), (c)
deteriorates the performance. An extra embedding for a question vector may
easily cause overfitting leading to the overall degradation. And, the identity
shortcuts in (d) cause the degradation problem, too. Extra parameters of the
linear mappings may effectively support to do the task.
(e) shows a reasonable performance, however, the extra shortcut is not
essential. The empirical results seem to support this idea. Since the question-
only model (50.39%) achieves a competitive result to the joint model (57.75%),
while the image-only model gets a poor accuracy (28.13%) (see Table 2 in [Antol
et al., 2015]). Eventually, we chose model (b) as the best performance and
relative simplicity.
The effects of other various options, Skip-Thought Vectors [Kiros et al., 2015]
27
for parameter initialization, Bayesian Dropout [Gal, 2015] for regularization,
image captioning model [Karpathy and Fei-Fei, 2015] for postprocessing, and
the usage of shortcut connections, are explored in Table 3.4 and 3.5.
Number of Learning Blocks To confirm the effectiveness of the number of
learning blocks selected via a pilot test (L = 3), we explore this on the chosen
model (b), again. As the depth increases, the overall accuracies are 58.85%
(L = 1), 59.44% (L = 2), 60.53% (L = 3) and 60.42% (L = 4).
Visual Features The ResNet-152 visual features are significantly better than
VGG-19 features for Other type in Table 3.2, even if the dimension of the ResNet
features (2,048) is a half of VGG features’ (4,096). The ResNet visual features
are also used in the previous work [Ilievski et al., 2016]; however, our model
achieves a remarkably better performance with a large margin (see Table 3.3).
Number of Target Answers The number of target answers slightly affects
the overall accuracies with the trade-off among answer types. So, the decision
on the number of target answers is difficult to be made. We chose Res, 2k in
Table 3.2 based on the overall accuracy (for Multiple-Choice task, see Table 3.6).
Comparisons with State-of-the-arts Our chosen model significantly out-
performs other state-of-the-art methods for both Open-Ended and Multiple-
Choice tasks in Table 3.3. However, the performance of Number and Other types
are still not satisfactory compared to Human performance, though the advances
in the recent works were mainly for Other -type answers. This fact motivates
to study on a counting mechanism in future work. The model comparison is
performed on the test-standard results.
28
examples examples
What kind of animals are these ? sheep What animal is the picture ? elephant
What is this animal ? zebra What game is this person playing ? tennis
How many cats are here ? 2 What color is the bird ? yellow
What sport is this ? surfing Is the horse jumping ? yes
(a) (b)
(c) (d)
(e) (f)
(g) (h)
Figure 3.4 Examples for visualization of the three-block layered MRN. Theoriginal images are shown in the first of each group. The next three images showthe input gradients of the attention effect for each learning block as describedin Section 3.5.2. The gradients of color channels for each pixel are summed upafter taking absolute values of these gradients. Then, these summed absolutevalues which are greater than the summation of the mean and the standarddeviation of these values are visualized as the attention effect (bright color) onthe images. The answers (blue) are predicted by MRN.
3.5.2 Qualitative Analysis
In Equation 3.5, the left term σ(Wqq) can be seen as a masking (attention)
vector to select a part of visual information. We assume that the difference
between the right term V := σ(W2σ(W1v)) and the masked vector F(q,v)
indicates an attention effect caused by the masking vector. Then, the attention
effect Latt =12∥V − F∥
2 is visualized on the image by calculating the gradient
29
of Latt with respect to a given image I, while treating F as a constant.
∂Latt
∂I=
∂V∂I
(V − F) (3.8)
This technique can be applied to each learning block in a similar way.
Since we use the preprocessed visual features, the pretrained CNN is aug-
mented only for this visualization. Note that model (b) in Table 3.1 is used for
this visualization, and the pretrained VGG-19 is used for preprocessing and
augmentation. The model is trained using the training set of the VQA dataset,
and visualized using the validation set. Examples are shown in Figure 3.4 (more
examples in Figure 3.5, 3.6, and 3.7).
Unlike the other works [Xiong et al., 2016; Yang et al., 2016] that use explicit
attention parameters, MRN does not use any explicit attentional mechanism.
However, we observe the interpretability of element-wise multiplication as an
information masking, which yields a novel method for visualizing the attention
effect from this operation. Since MRN does not depend on a few attention
parameters (e.g. 14× 14), our visualization method shows a higher resolution
than others [Xiong et al., 2016; Yang et al., 2016]. Based on this, we argue that
MRN is an implicit attention model without explicit attention mechanism.
3.6 Conclusions
The idea of deep residual learning is applied to visual question-answering tasks.
Based on the two observations of the previous works, various alternative models
are suggested and validated to propose the three-block layered MRN. Our model
achieves the state-of-the-art results on the VQA dataset for both Open-Ended
and Multiple-Choice tasks. Moreover, we have introduced a novel method to
visualize the spatial attention from the collapsed visual features using back-
30
(a) Does the man have good posture ? no (b) Did he fall down ? yes
(c) Are there two cats in the picture ? no (d)What color are the bears ? brown
(e)What are many of the people carrying ? umbrellas (f)What color is the dog ? black
(g) Are these animals tall ? yes (h)What animal is that ? sheep
(i) Are all the cows the same color ? no (j)What is the reflection of in the mirror ? dog
(k)What are the giraffe in the foreground doing ?eating
(l)What animal is standing in the water other than birds ? bear
Figure 3.5 More examples of Figure 4 in Section 5.2.
propagation.
We believe our visualization method brings implicit attention mechanism
to research of attentional models. Using back-propagation of attention effect,
extensive research in object detection, segmentation and tracking are worth
further investigations.
31
(a1)What is the animal on the left ? giraffe
(a2) Can you see trees ? yes
(b1)What is the lady riding ? motorcycle
(b2) Is she riding the motorcycle on the street ? no
Figure 3.6 Comparative examples on the same image. (a1) and (a2) depict agiraffe (left) and a man pointing at the giraffe. MRN consistently highlightson the giraffe in (a1). However, the other question “Can you see trees?” makesMRN less attentive to the giraffe, while a tree in the right of background ismore focused in (a2). Similarily, the attention effect of (b2) is widely dispersedon background than (b1) in the middle of sequences, may be to recognize thesite. However, the subtlety in comparative study is insufficient to objectivelyassess the results.
32
Tab
le3.
3T
heV
QA
test
-sta
ndar
dre
sult
s.T
hepr
ecis
ion
ofso
me
accu
raci
es[A
ndre
aset
al.,
2016
;Yan
get
al.,
2016
]ar
eon
ele
ssth
anot
hers
,so,
zero
-fille
dto
mat
chot
hers
.
Ope
n-E
nded
Mul
tipl
e-C
hoic
e
All
Y/N
Num
.O
ther
All
Y/N
Num
.O
ther
DP
Pne
t[N
ohet
al.,
2016
]57
.36
80.2
836
.92
42.2
462
.69
80.3
538
.79
52.7
9D
-NM
N[A
ndre
aset
al.,
2016
]58
.00
--
--
--
-D
eep
Q+
I[L
uet
al.,
2015
]58
.16
80.5
636
.53
43.7
363
.09
80.5
937
.70
53.6
4SA
N[Y
ang
etal
.,20
16]
58.9
0-
--
--
--
AC
K[W
uet
al.,
2016
b]59
.44
81.0
737
.12
45.8
3-
--
-FD
A[Ilie
vski
etal
.,20
16]
59.5
481
.34
35.6
746
.10
64.1
881
.25
38.3
055
.20
DM
N+
[Xio
nget
al.,
2016
]60
.36
80.4
336
.82
48.3
3-
--
-
MR
N61
.84
82.3
938
.23
49.4
166
.33
82.4
139
.57
58.4
0
Hum
an[A
ntol
etal
.,20
15]
83.3
095
.77
83.3
972
.67
--
--
33
Table 3.4 The effects of various options for VQA test-dev. Here, the model ofFigure 3a is used, since these experiments are preliminarily conducted. VGG-19features and 1k target answers are used. s stands for the usage of Skip-ThoughtVectors Kiros et al. [2015] to initialize the question embedding model of GRU, bstands for the usage of Bayesian Dropout Gal [2015], and c stands for the usageof postprocessing using image captioning model Karpathy and Fei-Fei [2015].
Open-Ended Multiple-Choice
All Y/N Num. Other All Y/N Num. Other
baseline 58.97 81.11 37.63 44.90 63.53 81.13 38.91 54.06s 59.38 80.65 38.30 45.98 63.71 80.68 39.73 54.65s,b 59.74 81.75 38.13 45.84 64.15 81.77 39.54 54.67s,b,c 59.91 81.75 38.13 46.19 64.18 81.77 39.51 54.72
Table 3.5 The effects of shortcut connections of MRN for VQA test-dev. ResNet-152 features and 2k target answers are used. MN stands for Multimodal Networkswithout residual learning, which does not have any shortcut connections. Dim.stands for common embedding vector’s dimension. The number of parametersfor word embedding (9.3M) and question embedding (21.8M) is subtracted fromthe total number of parameters in this table.
Open-Ended
L Dim. #params All Y/N Num. Other
MN 1 4604 33.9M 60.33 82.50 36.04 46.89MN 2 2350 33.9M 60.90 81.96 37.16 48.28MN 3 1559 33.9M 59.87 80.55 37.53 47.25
MRN 1 3355 33.9M 60.09 81.78 37.09 46.78MRN 2 1766 33.9M 61.05 81.81 38.43 48.43MRN 3 1200 33.9M 61.68 82.28 38.82 49.25MRN 4 851 33.9M 61.02 82.06 39.02 48.04
34
Tab
le3.
6T
here
sult
sfo
rV
QA
test
-dev
.The
prec
isio
nof
som
eac
cura
cies
And
reas
etal
.[20
16];
Xio
nget
al.[
2016
];Y
ang
etal
.[20
16]a
reon
ele
ssth
anot
hers
,so,
zero
-fille
dto
mat
chot
hers
.
Ope
n-E
nded
Mul
tipl
e-C
hoic
e
All
Y/N
Num
.O
ther
All
Y/N
Num
.O
ther
Que
stio
nA
ntol
etal
.[20
15]
48.0
975
.66
36.7
027
.14
53.6
875
.71
37.0
538
.64
Imag
eA
ntol
etal
.[20
15]
28.1
364
.01
00.4
203
.77
30.5
369
.87
00.4
503
.76
Q+
IA
ntol
etal
.[20
15]
52.6
475
.55
33.6
737
.37
58.9
775
.59
34.3
550
.33
LST
MQ
Ant
olet
al.[
2015
]48
.76
78.2
035
.68
26.5
954
.75
78.2
236
.82
38.7
8LS
TM
Q+
IA
ntol
etal
.[20
15]
53.7
478
.94
35.2
436
.42
57.1
778
.95
35.8
043
.41
Dee
pQ
+I
Luet
al.[
2015
]58
.02
80.8
736
.46
43.4
062
.86
80.8
837
.78
53.1
4
DP
Pne
tN
ohet
al.[
2016
]57
.22
80.7
137
.24
41.6
962
.48
80.7
938
.94
52.1
6D
-NM
NA
ndre
aset
al.[
2016
]57
.90
80.5
037
.40
43.1
0-
--
-SA
NY
ang
etal
.[20
16]
58.7
079
.30
36.6
046
.10
--
--
AC
KW
uet
al.[
2016
b]59
.17
81.0
138
.42
45.2
3-
--
-FD
AIlie
vski
etal
.[20
16]
59.2
481
.14
36.1
645
.77
64.0
181
.50
39.0
054
.72
DM
N+
Xio
nget
al.[
2016
]60
.30
80.5
036
.80
48.3
0-
--
-
Vgg
,1k
60.5
382
.53
38.3
446
.78
64.7
982
.55
39.9
355
.23
Vgg
,2k
60.7
782
.10
39.1
147
.46
65.2
782
.12
40.8
456
.39
Vgg
,3k
60.6
882
.40
38.6
947
.10
65.0
982
.42
40.1
355
.93
Res
,1k
61.4
582
.36
38.4
048
.81
65.6
282
.39
39.6
557
.15
Res
,2k
61.6
882
.28
38.8
249
.25
66.1
582
.30
40.4
558
.16
Res
,3k
61.4
782
.28
39.0
948
.76
66.3
382
.41
39.5
758
.40
35
(a)What animals are these ? bears ducks (b)What are these animals ? cows goats
(c)What animals are visible ? sheep horses (d) How many animals are depicted ? 2 1
(e)What flavor donut is this ? chocolate strawberry (f)What is the man doing ? playing tennis frisbee
(g)What color are the giraffes eyelashes ? brownblack
(h)What food is the bear trying to eat ? bananapapaya
(i)What kind of animal is used to herd these animals ?sheep dog
(j)What species of tree are in the background ? pinepalm
(k) Are there any birds on the photo ? no yes (l)Why is the hydrant smiling ? happysomeone drew on it
Figure 3.7 Failure Examples. Each question is followed by model prediction(blue) and answer (red). As mentioned in Section 5, MRN shows the weaknessof counting in (d) and (k). Sometimes, the model finds objects regardless ofthe given question. In (j), even if a word cat does not appear in the question,the cat in the image is surely attended. (i) shows the limitation of attentionalmechanism, which needs an inference using world knowledge.
36
Chapter 4
Multimodal Low-rank BilinearPooling
4.1 Introduction
Bilinear models [Tenenbaum and Freeman, 2000] provide richer representations
than linear models. To exploit this advantage, fully-connected layers in neural
networks can be replaced with bilinear pooling. The outer product of two
vectors (or Kroneker product for matrices) is involved in bilinear pooling, as
a result of this, all pairwise interactions among given features are considered.
Recently, a successful application of this technique is used for fine-grained visual
recognition [Lin et al., 2015].
However, bilinear pooling produces a high-dimensional feature of quadratic
expansion, which may constrain a model structure and computational resources.
For example, an outer product of two feature vectors, both of which have 1K-
dimensionality, produces a million-dimensional feature vector. Therefore, for
classification problems, the choice of the number of target classes is severely
37
constrained, because the number of parameters for a standard linear classifier is
determined by multiplication of the size of the high-dimensional feature vector
and the number of target classes.
Compact bilinear pooling [Gao et al., 2016] reduces the quadratic expansion
of dimensionality by two orders of magnitude, retaining the performance of
the full bilinear pooling. This approximation uses sampling-based computation,
Tensor Sketch Projection [Charikar et al., 2002; Pham and Pagh, 2013], which
utilizes an useful property that Ψ(x ⊗ y, h, s) = Ψ(x, h, s) ∗ Ψ(y, h, s), which
means the projection of outer product of two vectors is the convolution of two
projected vectors. Here, Ψ is the proposed projection function, and, h and s are
randomly sampled parameters by the algorithm.
Nevertheless, compact bilinear pooling embraces two shortcomings. One
comes from the sampling approach. Compact bilinear pooling relies on a fa-
vorable property, E[⟨Ψ(x, h, s),Ψ(y, h, s)⟩] = ⟨x, y⟩, which provides a basis to
use projected features instead of original features. Yet, calculating the exact
expectation is computationally intractable, so, the random parameters, h and
s are fixed during training and evaluation. This practical choice leads to the
second. The projected dimension of compact bilinear pooling should be large
enough to minimize the bias from the fixed parameters. Practical choices are 10K
and 16K for 512 and 4096-dimensional inputs, respectively [Fukui et al., 2016;
Gao et al., 2016]. Though, these compacted dimensions are reduced ones by two
orders of magnitude compared with full bilinear pooling, such high-dimensional
features could be a bottleneck for computationally complex models.
We propose low-rank bilinear pooling using Hadamard product (element-
wise multiplication), which is commonly used in various scientific computing
frameworks as one of tensor operations. The proposed method factors a three-
dimensional weight tensor for bilinear pooling into three two-dimensional weight
38
matrices, which enforces the rank of the weight tensor to be low-rank. As a result,
two input feature vectors linearly projected by two weight matrices, respectively,
are computed by Hadamard product, then, followed by a linear projection using
the third weight matrix. For example, the projected vector z is represented by
WTz (W
Txx WT
yy), where denotes Hadamard product.
We also explore to add non-linearity using non-linear activation functions
into the low-rank bilinear pooling, and shortcut connections inspired by deep
residual learning [He et al., 2016a]. Then, we show that it becomes a simple
baseline model [Antol et al., 2015] or one-learning block of Multimodal Residual
Networks [Kim et al., 2016b] as a low-rank bilinear model, yet, this interpretation
has not be done.
Our contributions are as follows: First, we propose low-rank bilinear pooling
to approximate full bilinear pooling to substitute compact bilinear pooling.
Second, Multimodal Low-rank Bilinear Attention Networks (MLB) having an
efficient attention mechanism using low-rank bilinear pooling is proposed for
visual question-answering tasks. MLB achieves a new state-of-the-art perfor-
mance, and has a better parsimonious property. Finally, ablation studies to
explore alternative choices, e.g. network depth, non-linear functions, and shortcut
connections, are conducted.
4.2 Low-rank Bilinear Model
Bilinear models use a quadratic expansion of linear transformation considering
every pair of features.
fi =
N∑j=1
M∑k=1
wijkxjyk + bi = xTWiy + bi (4.1)
39
where x and y are input vectors, Wi ∈ RN×M is a weight matrix for the output
fi, and bi is a bias for the output fi. Notice that the number of parameters is
L× (N ×M + 1) including a bias vector b, where L is the number of output
features.
Pirsiavash et al. [2009] suggest a low-rank bilinear method to reduce the rank
of the weight matrix Wi to have less number of parameters for regularization.
They rewrite the weight matrix as Wi = UiVTi where Ui ∈ RN×d and Vi ∈
RM×d, which imposes a restriction on the rank of Wi to be at most d ≤
min(N,M).
Based on this idea, fi can be rewritten as follows:
fi = xTWiy + bi = xTUiVTi y + bi = 1
T (UTi x VT
i y) + bi (4.2)
where 1 ∈ Rd denotes a column vector of ones, and denotes Hadamard product.
Still, we need two third-order tensors, U and V, for a feature vector f , whose
elements are fi. To reduce the order of the weight tensors by one, we replace 1
with P ∈ Rd×c and bi with b ∈ Rc, then, redefine as U ∈ RN×d and V ∈ RM×d
to get a projected feature vector f ∈ Rc. Then, we get:
f = PT (UTx VTy) + b (4.3)
where d and c are hyperparameters to decide the dimension of joint embeddings
and the output dimension of low-rank bilinear models, respectively.
4.3 Low-rank Bilinear Pooling
A low-rank bilinear model in Equation 4.3 can be implemented using two linear
mappings without biases for embedding two input vectors, Hadamard product
40
to learn joint representations in a multiplicative way, and a linear mapping with
a bias to project the joint representations into an output vector for a given
output dimension. Then, we use this structure as a pooling method for deep
neural networks. Now, we discuss possible variations of low-rank bilinear pooling
based on this model inspired by studies of neural networks.
4.3.1 Full Model
In Equation 4.3, linear projections, U and V , can have their own bias vectors.
As a result, linear models for each input vectors, x and y, are integrated in an
additive form, called as full model for linear regression in statistics:
f = PT((UTx+ bx) (VTy + by)
)+ b
= PT (UTx VTy +U′Tx+V′Ty) + b′. (4.4)
Here, U′T = diag(by) ·UT , V′T = diag(bx) ·VT , and b′ = b+PT (bx by).
4.3.2 Nonlinear Activation
Applying non-linear activation functions may help to increase representative
capacity of model. The first candidate is to apply non-linear activation functions
right after linear mappings for input vectors.
f = PT(σ(UTx) σ(VTy)
)+ b (4.5)
where σ denotes an arbitrary non-linear activation function, which maps any
real values into a finite interval, e.g. sigmoid or tanh. If two inputs come from
different modalities, statistics of two inputs may be quite different from each
other, which may result an interference. Since the gradient with respect to each
41
input is directly dependent on the other input in Hadamard product of two
inputs.
Additional applying an activation function after the Hadamard product is not
appropriate, since activation functions doubly appear in calculating gradients.
However, applying the activation function only after the Hadamard product
would be alternative choice (We explore this option in Section 4.5) as follows:
f = PTσ(UTx VTy
)+ b. (4.6)
Note that using the activation function in low-rank bilinear pooling can be
found in an implementation of simple baseline for the VQA 1.0 dataset [Antol
et al., 2015] without an interpretation of low-rank bilinear pooling. However,
notably, Wu et al. [2016c] studied learning behavior of multiplicative integration
in RNNs with discussions and empirical evidences.
4.3.3 Shortcut Connection
When we apply two previous techniques, full model and non-linear activation,
linear models of two inputs are nested by the non-linear activation functions.
To avoid this unfortunate situation, we add shortcut connections as explored in
residual learning [He et al., 2016a].
f = PT(σ(UTx) σ(VTy)
)+ hx(x) + hy(y) + b (4.7)
where hx and hy are shortcut mappings. For linear projection, the shortcut
mappings are linear mappings. Notice that this formulation is a generalized
form of the one-block layered MRN [Kim et al., 2016b]. Though, the shortcut
connections are not used in our proposed model, as explained in Section 4.6.
42
4.4 Multimodal Low-rank Bilinear Attention Networks
In this section, we apply low-rank bilinear pooling to propose an efficient atten-
tion mechanism for visual question-answering tasks, based on the interpretation
of previous section. We assumed that inputs are a question embedding vector q
and a set of visual feature vectors F over S × S lattice space.
4.4.1 Low-rank Bilinear Pooling in Attention Mechanism
Attention mechanism uses an attention probability distribution α over S × S
lattice space. Here, using low-rank bilinear pooling, α is defined as
α = softmax(PT
α
(σ(UT
qq · 1T ) σ(VTFF
T )))
(4.8)
where α ∈ RG×S2 , Pα ∈ Rd×G, σ is a hyperbolic tangent function, Uq ∈ RN×d,
q ∈ RN , 1 ∈ RS2 , VF ∈ RM×d, and F ∈ RS2×M . If G > 1, multiple glimpses are
explicitly expressed as in Fukui et al. [2016], conceptually similar to Jaderberg
et al. [2015]. And, the softmax function applies to each row vector of α. The
bias terms are omitted for simplicity.
4.4.2 Multimodal Low-rank Bilinear Attention Networks
Attended visual feature v is a linear combination of Fi with coefficients αg,i.
Each attention probability distribution αg is for a glimpse g. For G > 1, v is
the concatenation of resulting vectors vg as
v =Gn
g=1
S2∑s=1
αg,sFs (4.9)
wheref
denotes concatenation of vectors. The posterior probability distribution
is an output of a softmax function, whose input is the result of another low-rank
43
bilinear pooling of q and v as
p(a|q,F; Θ) = softmax(PT
o
(σ(WT
qq) σ(VTv v)
))(4.10)
a = argmaxa∈Ω
p(a|q,F; Θ) (4.11)
where a denotes a predicted answer, Ω is a set of candidate answers and Θ is an
aggregation of entire model parameters.
4.4.3 Model Schema
Figure 4.1 shows a schematic diagram of MLB, where denotes Hadamard
product, and Σ denotes a linear combination of visual feature vectors using
coefficients, which is the output of softmax function. If G > 1, the softmax
function is applied to each row vectors of an output matrix (Equation 4.8), and
we concatenate the resulting vectors of the G linear combinations (Equation 4.9).
A
TanhConv
TanhLinear
Replicate
Q V
SoftmaxConv
TanhLinear
TanhLinearLinear
Softmax
Figure 4.1 A schematic diagram of MLB. Replicate module copies an questionembedding vector to match with S2 visual feature vectors. Conv modules indicate1× 1 convolution to transform a given channel space, which is computationallyequivalent to linear projection for channels.
44
4.5 Experiments
Table 4.1 The accuracies of our experimental models for VQA test-dev split andOpen-Ended task. For the MCB models, A: attention model, G: Glove vectormodel, and V: Visual Genome augmentation model.
MODEL SIZE ALL Y/N NUM ETC
MRN-L3 65.0M 61.68 82.28 38.82 49.25MARN-L3 65.5M 62.37 82.31 38.06 50.83MARN-L2 56.3M 63.92 82.88 37.98 53.59
* MARN-L1 47.0M 63.79 82.73 37.92 53.46
MARN-L1-G1 47.0M 63.79 82.73 37.92 53.46* MARN-L1-G2 57.7M 64.53 83.41 37.82 54.43
MARN-L1-G4 78.9M 64.61 83.72 37.86 54.33
No Tanh 57.7M 63.58 83.18 37.23 52.79* Before-Product 57.7M 64.53 83.41 37.82 54.43
After-Product 57.7M 64.53 83.53 37.06 54.50
Mode Answer 57.7M 64.53 83.41 37.82 54.43* Sampled Answer 57.7M 64.80 83.59 38.38 54.73
Shortcut 57.7M 64.80 83.59 38.38 54.73* No Shortcut 51.9M 65.08 84.14 38.21 54.87
MLB 51.9M 65.08 84.14 38.21 54.87MLB+VG 51.9M 65.84 83.87 37.87 56.76
MCB-A [Fukui et al., 2016] 69.2M 64.2 82.2 37.7 54.8MCB-AG [Fukui et al., 2016] 70.5M 64.7 82.5 37.6 55.6
MCB-AGV [Fukui et al., 2016] 70.5M 65.4 82.3 37.2 57.4
In this section, we conduct six experiments to select the proposed model,
Multimodal Low-rank Bilinear Attention Networks (MLB). Each experiment
controls other factors except one factor to assess the effect on accuracies. Based
on MRN [Kim et al., 2016b], we start our assessments with an initial option
of G = 1 and shortcut connections of MRN, called as Multimodal Attention
45
Residual Networks (MARN). Notice that we use one embeddings for each
visual feature for better performance, based on our preliminary experiment
(not shown). We attribute this choice to the attention mechanism for visual
features, which provides more capacity to learn visual features. We use the same
hyperparameters of MRN [Kim et al., 2016b], without any explicit mention of
this.
The VQA 1.0 dataset [Antol et al., 2015] is used as a primary dataset, and, for
data augmentation, question-answering annotations of Visual Genome [Krishna
et al., 2016] are used. Validation is performed on the VQA test-dev split, and
model comparison is based on the results of the VQA test-standard split. For
the comprehensive reviews of VQA tasks, please refer to Wu et al. [2016a] and
Kafle and Kanan [2016a]. The source code for the experiments is available in
Github repository1.
Number of Learning Blocks Kim et al. [2016b] argue that three-block
layered MRN shows the best performance among one to four-block layered
models, taking advantage of residual learning. However, we speculate that an
introduction of attention mechanism makes deep networks hard to optimize.
Therefore, we explore the number of learning blocks of MARN, which have an
attention mechanism using low-rank bilinear pooling.
Number of Glimpses Fukui et al. [2016] show that the attention mechanism
of two glimpses was an optimal choice. In a similar way, we assess one, two, and
four-glimpse models.
Non-Linearity We assess three options applying non-linearity on low-rank
bilinear pooling, vanilla, before Hadamard product as in Equation 4.5, and after1https://github.com/jnhwkim/MulLowBiVQA
46
Hadamard product as in Equation 4.6.
Answer Sampling VQA [Antol et al., 2015] 1.0 dataset has ten answers from
unique persons for each question, while Visual Genome [Krishna et al., 2016]
dataset has a single answer for each question. Since difficult or ambiguous ques-
tions may have divided answers, the probabilistic sampling from the distribution
of answers can be utilized to optimize for the multiple answers. An instance2
can be found in Fukui et al. [2016]. We simplify the procedure as follows:
p(a1) =
|a1|/Σi|ai|, if |a1| ≥ 3
0, otherwise(4.12)
p(a0) = 1− p(a1) (4.13)
where |ai| denotes the number of unique answer ai in a set of multiple answers,
a0 denotes a mode, which is the most frequent answer, and a1 denotes the
secondly most frequent answer. We define the divided answers as having at least
three answers which are the secondly frequent one, for the evaluation metric of
VQA [Antol et al., 2015],
accuracy(ak) = min (|ak|/3, 1) . (4.14)
The rate of the divided answers is approximately 16.40%, and only 0.23% of
questions have more than two divided answers in VQA dataset. We assume that
it eases the difficulty of convergence without severe degradation of performance.
Shortcut Connection The contribution of shortcut connections for residual
learning is explored based on the observation of the competitive performance of2https://github.com/akirafukui/vqa-mcb/blob/5fea8/train/multi_att_2_glove/
vqa_data_provider_layer.py#L130
47
single-block layered model. Since the usefulness of shortcut connections is linked
to the network depth [He et al., 2016a].
Data Augmentation The data augmentation using Visual Genome [Krishna
et al., 2016] question answer annotations is explored. Visual Genome [Krishna
et al., 2016] originally provides 1.7 Million visual question answer annotations.
After aligning to VQA, the valid number of question-answering pairs for training
is 837,298, which is for distinct 99,280 images.
4.5.1 Preprocessing
We follow the preprocessing procedure of Kim et al. [2016b]. Here, we remark
some details of it, and changes.
The 90.45% of questions for the 2K-most frequent answers are used. The
vocabulary size of questions is 15,031. GRU [Cho, Kyunghyun and Van Mer-
riënboer, Bart and Gulcehre, Caglar and Bahdanau, Dzmitry and Bougares,
Fethi and Schwenk, Holger and Bengio, Yoshua, 2014] is used for question
embedding. Based on earlier studies [Kim et al., 2016b; Noh et al., 2016], a
word embedding matrix and a GRU are initialized with Skip-thought Vector
pre-trained model [Kiros et al., 2015]. As a result, question vectors have 2,400
dimensions.
For efficient computation of variable-length questions, Kim et al. [2016a] is
used for the GRU. Moreover, for regularization, Bayesian Dropout [Gal, 2015]
which is implemented in Léonard et al. [2015] is applied while training.
4.5.2 Vision Embedding
ResNet-152 networks [He et al., 2016a] are used for feature extraction. The
dimensionality of an input image is 3 × 448 × 448. The outputs of the last
convolution layer is used, which have 2, 048× 14× 14 dimensions.
48
4.5.3 Hyperparameters
The hyperparameters used in MLB of Table 4.4 are described in Table 4.2.
The batch size is 100, and the number of iterations is fixed to 250K. For data
augmented models, a simplified early stopping is used, starting from 250K to
350K-iteration for every 25K iterations (250K, 275K, 300K, 325K, and 350K; at
most five points) to avoid exhaustive submissions to VQA test-dev evaluation
server. RMSProp [Tieleman and Hinton, 2012] is used for optimization.
Though, the size of joint embedding size d is borrowed from Kim et al. [2016b],
a grid search on d confirms this choice in our model as shown in Table 4.3.
Table 4.2 Hyperparameters used in MLB (single model in Table 4.4).
SYMBOL VALUE DESCRIPTION
S 14 attention lattice sizeN 2,400 question embedding sizeM 2,048 channel size of extracted visual featuresd 1,200 joint embedding sizeG 2 number of glimpses|Ω| 2,000 number of candidate answersη 3e-4 learning rateλ 0.99997592083 learning rate decay factor at every iterationp 0.5 dropout rateθ ±10 gradient clipping threshold
4.6 Results
The six experiments are conducted sequentially. Each experiment determines
experimental variables one by one. In Table 4.1, which has six sectors divided
by mid-rules, shows the accuracies of our experimental model, Multimodal
Attention Residual Networks (MARN), with respect to the number of learning
49
Table 4.3 The effect of joint embedding size d.
Open-Ended
d SIZE ALL Y/N NUM ETC
800 45.0M 64.89 84.08 38.15 54.551000 48.4M 65.06 84.18 38.01 54.851200 51.9M 65.08 84.14 38.21 54.871400 55.4M 64.94 84.13 38.00 54.641600 58.8M 65.02 84.15 37.79 54.85
blocks (L#), the number of glimpse (G#), the position of activation functions
(tanh), answer sampling, shortcut connections, and data augmentation using
Visual Genome dataset, for VQA test-dev split and Open-Ended task. Note
that our proposed model, Multimodal Low-rank Bilinear Attention Networks
(MLB) have no shortcut connections, compared with MARN. MODEL: model
name, SIZE: number of parameters, ALL: overall accuracy in percentage, Y/N:
yes/no, NUM: numbers, and ETC: others. Since Fukui et al. [2016] only report
the accuracy of the ensemble model on the test-standard, the test-dev results of
their single models are included in the last sector. Some figures have different
precisions which are rounded. ∗ indicates the selected model for each experiment.
4.6.1 Six Experiment Results
Number of Learning Blocks Though, MRN [Kim et al., 2016b] has the
three-block layered architecture, MARN shows the best performance with two-
block layered models (63.92%). For the multiple glimpse models in the next
experiment, we choose one-block layered model for its simplicity to extend, and
competitive performance (63.79%).
50
Number of Glimpses Compared with the results of Fukui et al. [2016], four-
glimpse MARN (64.61%) is better than other comparative models. However,
for a parsimonious choice, two-glimpse MARN (64.53%) is chosen for later
experiments. We speculate that multiple glimpses are one of key factors for the
competitive performance of MCB [Fukui et al., 2016], based on a large margin
in accuracy, compared with one-glimpse MARN (63.79%).
Non-Linearity The results confirm that activation functions are useful to
improve performances. Surprisingly, there is no empirical difference between two
options, before-Hadamard product and after-Hadamard product. This result
may build a bridge to relate with studies on multiplicative integration with
recurrent neural networks [Wu et al., 2016c].
Answer Sampling Sampled answers (64.80%) result better performance
than mode answers (64.53%). It confirms that the distribution of answers from
annotators can be used to improve the performance. However, the number of
multiple answers is usually limited due to the cost of data collection.
Shortcut Connection Though, MRN [Kim et al., 2016b] effectively uses
shortcut connections to improve model performance, one-block layered MARN
shows better performance without the shortcut connection. In other words, the
residual learning is not used in our proposed model, MLB. It seems that there
is a trade-off between introducing attention mechanism and residual learning.
We leave a careful study on this trade-off for future work.
Data Augmentation Data augmentation using Visual Genome [Krishna
et al., 2016] question answer annotations significantly improves the performance
51
by 0.76% in accuracy for VQA test-dev split. Especially, the accuracy of others
(ETC)-type answers is notably improved from the data augmentation.
4.6.2 Comparison with State-of-the-Art
The comparison with other single models on VQA test-standard is shown in
Table 4.4. The overall accuracy of our model is approximately 1.9% above the
next best model [Noh and Han, 2016] on the Open-Ended task of VQA. The
major improvements are from yes-or-no (Y/N) and others (ETC)-type answers.
In Table 4.5, we also report the accuracy of our ensemble model to compare
with other ensemble models on VQA test-standard, which won 1st to 5th places
in VQA Challenge 20163. We beat the previous state-of-the-art with a margin
of 0.42%.
4.6.3 Ensemble of Seven Models
The test-dev results for individual models consisting of our ensemble model is
presented in Table 4.6.
4.7 Related Works
MRN [Kim et al., 2016b] proposes multimodal residual learning with Hadamard
product of low-rank bilinear pooling. However, their utilization of low-rank
bilinear pooling is limited to joint residual mapping function for multimodal
residual learning. Higher-order Boltzmann Machines [Memisevic and Hinton,
2007, 2010] use Hadamard product to capture the interactions of input, output,
and hidden representations for energy function. Wu et al. [2016c] propose the
recurrent neural networks using Hadamard product to integrate multiplicative
interactions among hidden representations in the model.3http://visualqa.org/challenge.html
52
Yet, compact bilinear pooling or multimodal compact bilinear pooling [Fukui
et al., 2016; Gao et al., 2016] is worth to discuss and carefully compare with our
method.
4.7.1 Multimodal Residual Networks
MRN [Kim et al., 2016b] is an implicit attentional model using multimodal
residual learning with Hadamard product which does not have any explicit
attention mechanism.
F (k)(q,v) = σ(W(k)q q) σ(W(k)
2 σ(W(k)1 v)) (4.15)
HL(q,v) = Wq′q+L∑l=1
WF(l)F (l)(Hl−1,v) (4.16)
where W∗ are parameter matrices, L is the number of learning blocks, H0 = q,
Wq′ = ΠLl=1W
(l)q′ , and WF(l) = ΠL
m=l+1W(m)q′ . Notice that these equations can
be generalized by Equation 4.7.
However, an explicit attention mechanism allows the use of lower-level visual
features than fully-connected layers, and, more importantly, spatially selective
learning. Recent state-of-the-art methods use a variant of an explicit attention
mechanism in their models [Fukui et al., 2016; Lu et al., 2016; Noh and Han, 2016].
Note that shortcut connections of MRN are not used in the proposed Multimodal
Low-rank Bilinear (MLB) model. Since, it does not have any performance gain
due to not stacking multiple layers in MLB. We leave the study of residual
learning for MLB for future work, which may leverage the excellency of bilinear
models as suggested in Wu et al. [2016a].
4.7.2 Higher-Order Boltzmann Machines
A similar model can be found in a study of Higher-Order Boltzmann Ma-
chines [Memisevic and Hinton, 2007, 2010]. They suggest a factoring method for
53
the three-way energy function to capture correlations among input, output, and
hidden representations.
−E(y,h;x) =∑f
(∑i
xiwxif
)(∑j
yjwyjf
)(∑k
hkwhkf
)+∑k
whkhk +
∑j
wyj yj
=(xTWx yTWy hTWh
)1+ hTwh + yTwy (4.17)
Setting aside of bias terms, the I × J × K parameter tensor of unfactored
Higher-Order Boltzmann Machines is replaced with three matrices, Wx ∈ RI×F ,
Wy ∈ RJ×F , and Wh ∈ RK×F .
4.7.3 Multiplicative Integration with Recurrent Neural Net-works
Most of recurrent neural networks, including vanilla RNNs, Long Short Term
Memory networks [Hochreiter and Schmidhuber, 1997] and Gated Recurrent
Units [Cho, Kyunghyun and Van Merriënboer, Bart and Gulcehre, Caglar and
Bahdanau, Dzmitry and Bougares, Fethi and Schwenk, Holger and Bengio,
Yoshua, 2014], share a common expression as follows:
ϕ(Wx+Uh+ b) (4.18)
where ϕ is a non-linear function, W ∈ Rd×n, x ∈ Rn, U ∈ Rd×m, h ∈ Rm, and
b ∈ Rd is a bias vector. Note that, usually, x is an input state vector and h is
an hidden state vector in recurrent neural networks.
Wu et al. [2016c] propose a new design to replace the additive expression
with a multiplicative expression using Hadamard product as
ϕ(Wx Uh+ b). (4.19)
Moreover, a general formulation of this multiplicative integration can be
54
described as
ϕ(ααα Wx Uh+Wx βββ1 +Uh βββ2 + b) (4.20)
which is reminiscent of full model in Section 4.3.1.
4.7.4 Compact Bilinear Pooling
Compact bilinear pooling [Gao et al., 2016] approximates full bilinear pooling
using a sampling-based computation, Tensor Sketch Projection [Charikar et al.,
2002; Pham and Pagh, 2013]:
Ψ(x⊗ y, h, s) = Ψ(x, h, s) ∗Ψ(y, h, s) (4.21)
= FFT−1(FFT(Ψ(x, h, s) FFT(Ψ(y, h, s)) (4.22)
where ⊗ denotes outer product, ∗ denotes convolution, Ψ(v, h, s)i :=∑
j:hj=i sj ·
vj , FFT denotes Fast Fourier Transform, d denotes an output dimension,
x, y, h, s ∈ Rn, x and y are inputs, and h and s are random variables. hi
is sampled from 1, ..., d, and si is sampled from −1, 1, then, both random
variables are fixed for further usage. Even if the dimensions of x and y are
different from each other, it can be used for multimodal learning [Fukui et al.,
2016].
Similarly to Equation 4.1, compact bilinear pooling can be described as
follows:
fi = xTWiy (4.23)
where Wijk = sijkwijk if sijk is sampled from −1, 1, wijk is sampled from
Pi1,Pi2, . . . ,Pid, and the compact bilinear pooling is followed by a fully
connected layer P ∈ R|Ω|×d. Then, this method can be formulated as a hashing
trick [Chen et al., 2015; Weinberger et al., 2009] to share randomly chosen
55
bilinear weights using d parameters for a output value, in a way that a single
parameter is shared by NM/d bilinear terms in expectation, with the variance
of NM(d− 1)/d2 (See Appendix 4.8.1).
In comparison with our method, their method approximates a three-dimensional
weight tensor in bilinear pooling with a two-dimensional matrix P, which is
larger than the concatenation of three two-dimensional matrices for low-rank
bilinear pooling. The ratio of the number of parameters for a single output to
the total number of parameters for |Ω| outputs is d/d|Ω| = 1/|Ω| [Fukui et al.,
2016], vs. d(N +M + 1)/d(N +M + |Ω|) = (N +M + 1)/(N +M + |Ω|) ≈ 2/3
(ours), since our method uses a three-way factorization. Hence, more parameters
are allocated to each bilinear approximation than compact bilinear pooling does,
effectively managing overall parameters guided by back-propagation algorithm.
MCB [Fukui et al., 2016], which uses compact bilinear pooling for multimodal
tasks, needs to set the dimension of output d to 16K, to reduce the bias induced
by the fixed random variables h and s. As a result, the majority of model
parameters (16K × 3K = 48M) are concentrated on the last fully connected
layer, which makes a fan-out structure. So, the total number of parameters
of MCB is highly sensitive to the number of classes, which is approximately
69.2M for MCB+att, and 70.5M for MCB+att+GloVe. Yet, the total number
of parameters of our proposed model (MLB) is 51.9M, which is more robust
to the number of classes having d = 1.2K, which has a similar role in model
architecture.
4.8 Discussions
4.8.1 Understanding of Multimodal Compact Bilinear Pooling
In this section, the algorithm of multimodal compact bilinear pooling (MCB) [Fukui
et al., 2016; Gao et al., 2016] is described as a kind of hashing tick [Chen et al.,
56
2015].
x ∈ Rnx and y ∈ Rny are the given inputs, Φ(x,y) ∈ Rd is the output. Ran-
dom variables hx ∈ Nnx and hy ∈ Nny are uniformly sampled from 1, . . . , d,
and sx ∈ Znx and sy ∈ Zny are uniformly sampled from −1, 1. Then, Count
Sketch projection function Ψ [Charikar et al., 2002] projects x and y to interme-
diate representations Ψ(x,hx, sx) ∈ Rd and Ψ(y,hy, sy) ∈ Rd, which is defined
as:
Ψ(v,h, s)i :=∑
j:hj=i
sj · vj (4.24)
Notice that both h and s remain as constants after initialization [Fukui et al.,
2016].
The probability of hxj = i and hyj = i for the given j is 1/d2. Hence, the
expected number of bilinear terms in Ψ(x,hx, sx)iΨ(y,hy, sy)i is (nxny)/d2.
Since, the output Φ(x,y) is a result of circular convolution of Ψ(x,hx, sx) and
Ψ(y,hy, sy), the expected number of bilinear terms in Φ(x,y)i is (nxny)/d.
Likewise, the probability of that a bilinear term is allocated in Φ(x,y)i is
1/d. The probability distribution of the number of bilinear terms in Φ(x,y)i
follows a multinomial distribution, whose mean is (nxny)/d and variance is
(nxny)(d− 1)/d2.
Linear projection after the multimodal compact bilinear pooling provides
weights on the bilinear terms, in a way that a shared weight is assigned to
Φ(x,y)i, which has (nxny)/d bilinear terms in expectation, though each bilinear
term can have a different sign induced by both sx and sy.
HashedNets [Chen et al., 2015] propose a method to compress neural networks
using a low-cost hashing function [Weinberger et al., 2009], which is the same
function of Ψ(v,h, s). They randomly group a portion of connections in neural
57
networks to share a single weight. We speculate that multimodal compact
bilinear pooling uses the hashing tick to reduce the number of full bilinear
weights with the rate of d/(nxny). However, this approximation is limited to
two-way interaction, compared with three-way factorization in our method.
4.8.2 Replacement of Low-rank Bilinear Pooling
For the explicit comparison with compact bilinear pooling, we explicitly substi-
tute compact bilinear pooling for low-rank bilinear pooling to control everything
else, which means that the rest of the model architecture is exactly the same.
According to Fukui et al. [2016], we use MCB followed by Signed Square Root,
L2-Normalization, Dropout (p=0.1), and linear projection from 16,000-dimension
to the target dimension. Also, Dropout (p=0.3) for a question embedding vector.
Note that an overall architecture for multimodal learning of both is the same.
Experimental details are referenced from the implementation4 of Fukui et al.
[2016].
For test-dev split, our version of MCB gets 61.48% for overall accuracy
(yes/no: 82.48%, number: 37.06%, and other: 49.07%) vs. 65.08% (ours, MLB in
Table 4.1). Additionally, if the nonlinearity in getting attention distributions
is increased as the original MCB does using ReLU, we get 62.11% for overall
accuracy (yes/no: 82.55%, number: 37.18%, and other: 50.30%), which is still
the below of our performance5.
We do not see it as a decisive evidence of the better performance of MLB,
but as a reference (the comparison of test-dev results may be also unfair.), since
an optimal architecture and hyperparameters may be required for each method.4https://github.com/akirafukui/vqa-mcb5Our version of MCB definition can be found in https://github.com/jnhwkim/
MulLowBiVQA/blob/master/netdef/MCB.lua
58
4.9 Conclusions
We suggest a low-rank bilinear pooling method to replace compact bilinear
pooling, which has a fan-out structure, and needs complex computations. Low-
rank bilinear pooling has a flexible structure using linear mapping and Hadamard
product, and a better parsimonious property, compared with compact bilinear
pooling. We achieve new state-of-the-art results on the VQA dataset using a
similar architecture of Fukui et al. [2016], replacing compact bilinear pooling
with low-rank bilinear pooling. We believe our method could be applicable to
other bilinear learning tasks.
59
Tab
le4.
4T
heV
QA
1.0
test
-sta
ndar
dre
sult
sto
com
pare
wit
hst
ate-
of-t
he-a
rt.N
otic
eth
atth
ese
resu
lts
are
trai
ned
bypr
ovid
edV
QA
trai
nan
dva
lidat
ion
split
s,w
itho
utan
yda
taau
gmen
tati
on.
Open
-Ended
MC
MO
DEL
ALL
Y/N
NU
MET
CA
LL
iBO
WIM
G[Z
hou
etal
.,20
15]
55.8
976
.76
34.9
842
.62
61.9
7D
PP
net
[Noh
etal
.,20
16]
57.3
680
.28
36.9
242
.24
62.6
9D
eepe
rLS
TM
+N
orm
aliz
edC
NN
[Ant
olet
al.,
2015
]58
.16
80.5
636
.53
43.7
363
.09
SMem
[Xu
and
Saen
ko,2
016]
58.2
480
.80
37.5
343
.48
-A
skY
our
Neu
rons
[Mal
inow
skie
tal
.,20
16]
58.4
378
.24
36.2
746
.32
-SA
N[Y
ang
etal
.,20
16]
58.8
579
.11
36.4
146
.42
-D
-NM
N[A
ndre
aset
al.,
2016
]59
.44
80.9
837
.48
45.8
1-
AC
K[W
uet
al.,
2016
b]59
.44
81.0
737
.12
45.8
3-
FD
A[Ilie
vski
etal
.,20
16]
59.5
481
.34
35.6
746
.10
64.1
8H
YB
RID
[Kafl
ean
dK
anan
,201
6b]
60.0
680
.34
37.8
247
.56
-D
MN
+[X
iong
etal
.,20
16]
60.3
680
.43
36.8
248
.33
-M
RN
[Kim
etal
.,20
16b]
61.8
482
.39
38.2
349
.41
66.3
3H
ieC
oAtt
[Lu
etal
.,20
16]
62.0
679
.95
38.2
251
.95
66.0
7R
AU
[Noh
and
Han
,201
6]63
.281
.738
.252
.867
.3
MLB
(our
s)65
.07
84.0
237
.90
54.7
768
.89
60
Table 4.5 The VQA 1.0 test-standard results for ensemble models to comparewith state-of-the-art. For unpublished entries, their team names are used insteadof their model names. Some of their figures are updated after the challenge.
Open-Ended MC
MODEL ALL Y/N NUM ETC ALL
RAU [Noh and Han, 2016] 64.12 83.33 38.02 53.37 67.34MRN [Kim et al., 2016b] 63.18 83.16 39.14 51.33 67.54DLAIT (not published) 64.83 83.23 40.80 54.32 68.30Naver Labs (not published) 64.79 83.31 38.70 54.79 69.26MCB [Fukui et al., 2016] 66.47 83.24 39.47 58.00 70.10
MLB (ours) 66.89 84.61 39.07 57.79 70.29
Human [Antol et al., 2015] 83.30 95.77 83.39 72.67 91.54
Table 4.6 The individual models used in our ensemble model in Table 4.5.
Open-Ended
MODEL GLIMPSE ALL Y/N NUM ETC
MLB 2 64.89 84.13 37.85 54.57MLB 2 65.08 84.14 38.21 54.87MLB 4 65.01 84.09 37.66 54.88MLB-VG 2 65.76 83.64 37.57 56.86MLB-VG 2 65.84 83.87 37.87 56.76MLB-VG 3 66.05 83.88 38.13 57.13MLB-VG 4 66.09 83.59 38.32 57.42
Ensemble - 66.77 84.54 39.21 57.81
61
Chapter 5
Bilinear Attention Networks
5.1 Introduction
Machine learning for computer vision and natural language processing accelerates
the advancement of artificial intelligence. Since vision and natural language
are the major modalities of human interaction, understanding and reasoning of
vision and natural language information become a key challenge. For instance,
visual question answering involves a vision-language cross-grounding problem. A
machine is expected to answer given questions like "who is wearing glasses?", "is
the umbrella upside down?", or "how many children are in the bed?" exploiting
visually-grounded information.
For this reason, visual attention based models have succeeded in multimodal
learning tasks, identifying selective regions in a spatial map of an image defined
by the model. Also, textual attention can be considered along with visual
attention. The attention mechanism of co-attention networks [Lu et al., 2016;
Nam et al., 2016; Xu and Saenko, 2016; Yu et al., 2018] concurrently infers
62
visual and textual attention distributions for each modality. The co-attention
networks selectively attend to question words in addition to a part of image
regions. However, the co-attention neglects the interaction between words and
visual regions to avoid increasing computational complexity.
In this paper, we extend the idea of co-attention into bilinear attention
which considers every pair of multimodal channels, e.g., the pairs of question
words and image regions. If the given question involves multiple visual concepts
represented by multiple words, the inference using visual attention distributions
for each word can exploit relevant information better than that using single
compressed attention distribution.
From this background, we propose bilinear attention networks (BAN) to use
a bilinear attention distribution, on top of low-rank bilinear pooling [Kim et al.,
2017b]. Notice that the BAN exploits bilinear interactions between two groups of
input channels, while low-rank bilinear pooling extracts the joint representations
for each pair of channels. Furthermore, we propose a variant of multimodal
residual networks (MRN) to efficiently utilize the multiple bilinear attention
maps of the BAN, unlike the previous works Fukui et al. [2016]; Kim et al.
[2017b] where multiple attention maps are used by concatenating the attended
features. Since the proposed residual learning method for BAN exploits residual
summations instead of concatenation, which leads to parameter-efficiently and
performance-effectively learn up to eight-glimpse BAN. For the overview of
two-glimpse BAN, please refer to Figure 5.1.
Our main contributions are:
• We propose the bilinear attention networks (BAN) to learn and use bilinear
attention distributions, on top of low-rank bilinear pooling technique.
• We propose a variant of multimodal residual networks (MRN) to efficiently
63
att_1
Overview
ρ
K
= ρ
φ
Step 1. Bilinear Attention Maps
1K ρ
φ
φρ
1
K
=
XTU
p
K
φ
VTY
U’TX YTV’ K=N
ρ
φ
1
K
1Kρ
U’’TX’ YTV’’
Residual Learning
Step 2. Bilinear Attention Networks
MLP classifier
GRUObject DetectionAll hidden
states
• After getting bilinear attention maps, we can stack multiple BANs.
What is the mustache made of ?
Att_1
Att_2∙ att_2
X
repeat 1→ρ
+
=φ K
X’
repeat 1→ρ
+Sum
Pooling
X
Y
Softmax
X’Residual Learning
Figure 5.1 Overview of a two-layer BAN. Two multi-channel inputs, ϕ-objectdetection features and ρ-length GRU hidden vectors, are used to get bilinearattention maps and joint representations to be used by a classifier. For thedefinition of the BAN, see the text in Section 5.3.
utilize the multiple bilinear attention maps generated by our model. Unlike
previous works, our method successfully utilizes up to 8 attention maps.
• Finally, we validate our proposed method on a large and highly-competitive
dataset, VQA 2.0 [Goyal et al., 2016]. Our model achieves a new state-of-
the-art maintaining simplicity of model structure. Moreover, we evaluate the
visual grounding of bilinear attention map on Flickr30k Entities [Plummer
et al., 2017] outperforming previous methods, along with 25.37% improve-
ment of inference speed taking advantage of the processing of multi-channel
inputs.
5.2 Low-rank Bilinear Pooling
We first review the low-rank bilinear pooling and its application to attention
networks [Kim et al., 2017b], which uses single-channel input (question vector)
to combine the other multi-channel input (image features) as single-channel
intermediate representation (attended feature).
Low-rank bilinear model. The previous works [Pirsiavash et al., 2009;
64
Wolf et al., 2007] proposed a low-rank bilinear model to reduce the rank of
bilinear weight matrix Wi to give regularity. For this, Wi is replaced with the
multiplication of two smaller matrices UiVTi , where Ui ∈ RN×d and Vi ∈ RM×d.
As a result, this replacement makes the rank of Wi to be at most d ≤ min(N,M).
For the scalar output fi (bias terms are omitted without loss of generality):
fi = xTWiy ≈ xTUiVTi y = 1T (UT
i x VTi y) (5.1)
where 1 ∈ Rd is a vector of ones and denotes Hadamard product (element-wise
multiplication).
Low-rank bilinear pooling. For a vector output f , a pooling matrix P is
introduced:
f = PT (UTx VTy) (5.2)
where P ∈ Rd×c, U ∈ RN×d, and V ∈ RM×d. It allows U and V to be two-
dimensional tensors by introducing P for a vector output f ∈ Rc, significantly
reducing the number of parameters.
Unitary attention networks. Attention provides an efficient mechanism
to reduce input channel by selectively utilizing given information. Assuming
that a multi-channel input Y consisting of ϕ = |yi| column vectors, we want
to get single channel y from Y using the weights αi:
y =∑i
αiyi (5.3)
where α represents an attention distribution to selectively combine ϕ input
channels. Using the low-rank bilinear pooling, the α is defined by the output of
65
softmax function as:
α := softmax(PT
((UTx · 1T ) (VTY)
))(5.4)
where α ∈ RG×ϕ, P ∈ Rd×G, U ∈ RN×d, x ∈ RN , 1 ∈ Rϕ, V ∈ RM×d, and
Y ∈ RM×ϕ. If G > 1, multiple glimpses (a.k.a. attention heads) are used [Fukui
et al., 2016; Jaderberg et al., 2015; Kim et al., 2017b], then y =fGg=1
∑i αg,iyi,
the concatenation of attended outputs. Finally, two single channel inputs x and
y can be used to get the joint representation using the other low-rank bilinear
pooling for a classifier.
5.3 Bilinear Attention Networks
We generalize a bilinear model for two multi-channel inputs, X ∈ RN×ρ and
Y ∈ RM×ϕ, where ρ = |xi| and ϕ = |yj|, the numbers of two input channels,
respectively. To reduce both input channel simultaneously, we introduce bilinear
attention map A ∈ Rρ×ϕ as follows:
f ′k = (XTU′)TkA(YTV′)k (5.5)
where U′ ∈ RN×K , V′ ∈ RM×K , (XTU′)k ∈ Rρ, (YTV′)k ∈ Rϕ, and f ′k denotes
the k-th element of intermediate representation. The subscript k for the matrices
indicates the index of column. Notice that Equation 5.5 is a bilinear model for
the two groups of input channels where A in the middle is a bilinear weight
matrix. Interestingly, Equation 5.5 can be rewritten as:
f ′k =
ρ∑i=1
ϕ∑j=1
Ai,j(XTi U
′k)(V
′Tk Yj) =
ρ∑i=1
ϕ∑j=1
Ai,jXTi (U
′kV
′Tk )Yj (5.6)
where Xi and Yj denotes the i-th channel (column) of input X and the j-th
channel (channel) of input Y, respectively, U′k and V′
k denotes the k-th column
66
of U′ and V′ matrices, respectively, and Ai,j denotes an element in the i-th row
and the j-th column of A. Notice that, for each pair of channels, the 1-rank
bilinear representation of two feature vectors is modeled in XTi (U
′kV
′Tk )Yj of
Equation 5.6 (eventually at most K-rank bilinear pooling for f ′ ∈ RK). Then,
the bilinear joint representation is f = PT f ′ where f ∈ RC and P ∈ RK×C . For
the convenience, we define the bilinear attention networks as a function of two
multi-channel inputs parameterized by a bilinear attention map as follows:
f = BAN(X,Y;A). (5.7)
Bilinear attention map. Now, we want to get the attention map similarly
to Equation 5.4. Using Hadamard product and matrix-matrix multiplication,
the attention map A is defined as:
A := softmax((
(1 · pT ) XTU)VTY
)(5.8)
where 1 ∈ Rρ, p ∈ RK′ , and remind that A ∈ Rρ×ϕ. The softmax function is
applied element-wisely. Notice that each logit Ai,j of the softmax is the output
of low-rank bilinear pooling as:
Ai,j = pT((UTXi) (VTYj)
). (5.9)
The multiple bilinear attention maps can be extended as follows:
Ag := softmax((
(1 · pTg ) XTU
)VTY
)(5.10)
where the parameters of U and V are shared, but not for pg where g denotes
the index of glimpses.
Residual learning of attention. Inspired by multimodal residual networks
(MRN) from Kim et al. [2016b], we propose a variant of MRN to integrate the
67
joint representations from the multiple bilinear attention maps. The i+ 1-th
output is defined as:
fi+1 = BANi(fi,Y;Ai) · 1T + fi (5.11)
where f0 = X (if N = K) and 1 ∈ Rρ. Here, the size of fi is the same with the
size of X as successive attention maps are processed. To get the logits for a
classifier, e.g., two-layer MLP, we sum over the channel dimension of the last
output fG, where G is the number of glimpses.
Time complexity. When we assume that the number of input channels
is smaller than feature sizes, M ≥ N ≥ K ≫ ϕ ≥ ρ, the time complexity of
the BAN is the same with the case of one multi-channel input as O(KMϕ) for
single glimpse model. Since the BAN consists of matrix chain multiplication and
exploits the property of low-rank factorization in the low-rank bilinear pooling.
In our experimental setting, the spending time per epoch of one-glimpse BAN is
approximately 284 seconds, while 190 seconds for an unitary attention networks.
The increased spending time is largely due to the increased size of softmax
input for attention distribution from ϕ to ρ×ϕ. However, the validation score is
significantly improved (+0.75%). The experiment results are shown in Table 5.1.
5.4 Related Works
Multimodal factorized bilinear pooling. Yu et al. [2018] extends low-rank
bilinear pooling [Kim et al., 2017b] using the rank > 1. They remove a projection
matrix P, instead, d in Equation 5.2 is replaced with much smaller k while U
and V are three-dimensional tensors. For an efficient computation, two matrices
Ui ∈ RN×(k×d) and Vi ∈ RM×(k×d) are used followed by sum pooling defined
as f = SumPool(UTx VTy, k) where the function SumPool(x, k) denotes sum
pooling over x using a one-dimensional window of size k and stride k (non-
68
overlapping). However, this generalization was not effective for BAN, at least
in our experimental setting. Please see BAN-1+MFB in Figure 5.2b where the
performance is not significantly improved from that of BAN-1. Furthermore,
the peak GPU memory consumption is larger due to its model structure which
hinders to use multiple-glimpse BAN.
Co-attention networks. Xu and Saenko [2016] proposed the spatial
memory network model estimating the correlation among every image patches
and tokens in a sentence. The estimated correlation C is defined as (UX)TY
in our notation. Unlike our method, they get an attention distribution α =
softmax(maxi=1,...,ρ(Ci)
)∈ Rρ where the logits to softmax is the maximum
values in each row vector of C. The attention distribution for the other input
can be calculated similarly. There are variants of co-attention networks [Lu et al.,
2016; Nam et al., 2016], especially, Lu et al. [2016] sequentially get two attention
distributions conditioning on the other modality. Recently, Yu et al. [2018] reduce
the co-attention method into two steps, self-attention for a question embedding
and the question-conditioned attention for a visual embedding. However, these
co-attention approaches use separate attention distributions for each modality,
neglecting the interaction between the modalities what we consider and model.
5.5 Experiments
5.5.1 Datasets
Visual Question Answering (VQA). We evaluate on the visual question
answering (VQA 2.0) dataset [Agrawal et al., 2017; Goyal et al., 2016], which
is improved from the previous version to emphasize visual understanding by
reducing the answer bias in the dataset. This improvement pushes the model to
have more effective joint representation of question and image, which fits the
motivation of our bilinear attention approach. In VQA 2.0, the 205k images
69
of MS COCO dataset [Lin et al., 2014] are used. The number of questions are
444k, 214k, and 448k for training, validation, and test, respectively. For the
annotations, roughly, five questions per image, and exactly, ten answers per
question.
The 205k images of VQA 2.0 are from the MS COCO dataset [Lin et al.,
2014]. The number of questions are 444k, 214k, and 448k for training, validation,
and testing, respectively. For the annotations, roughly, five questions per image,
and exactly, ten answers per question. The VQA evaluation metric considers
inter-human variability defined as:
Accuracy(ans) = min(#humans that said ans
3, 1)
(5.12)
The test sets are splited into test-dev, test-standard, test-challenge, and test-
reserve. The annotations for the test sets are unavailable except the remote
evaluation server. In off-challenge season, the evaluation is limited to test-dev
and test-standard and the number of submissions per day is limited to ten and
one, respectively, and the total number of submission is limited to five times only
for test-standard. For this reason, test-dev is used for debugging and validation
and test-standard is used to compare state-of-the-arts.
Flickr30k Entities. For the evaluation of visual grounding by the bilinear
attention maps, we use Flickr30k Entities [Plummer et al., 2017] consisting of
31,783 images [Young et al., 2014] and 244,035 annotations that multiple entities
(phrases) in a sentence for an image are mapped to the boxes on the image to
indicate the correspondences between them. The task is to localize a correspond-
ing box for each entity. In this way, visual grounding of textual information
is quantitatively measured. Following the evaluation metric [Plummer et al.,
2017], if a predicted box has the intersection over union (IoU) of overlapping
area with one of the ground-truth boxes which are greater than or equal to 0.5,
70
the prediction for a given entity is correct. This metric is called Recall@1. If K
predictions are permitted to find at least one correction, it is called Recall@K.
We report Recall@1, 5, and 10 to compare state-of-the-arts (R@K in Table 5.3).
The upper bound of performance depends on the performance of object detection
if the detector proposes candidate boxes for the prediction. We also report the
upper bounds to compare the performances of various object detectors.
5.5.2 Preprocessing
To control the other factors than the model structure, we follow the preprocessing
procedure of Teney et al. [2017].
Question embedding. For VQA, we get a question embedding XT ∈
R14×N using GloVe word embeddings [Pennington et al., 2014] and the outputs
of Gated Recurrent Unit (GRU) [Cho, Kyunghyun and Van Merriënboer, Bart
and Gulcehre, Caglar and Bahdanau, Dzmitry and Bougares, Fethi and Schwenk,
Holger and Bengio, Yoshua, 2014] for every time-steps up to the first 14 tokens
following the previous work Teney et al. [2017]. The questions shorter than 14
words are end-padded with zero vectors. This process discards only 0.25% of
questions and significantly increases learning efficiency in mini-batch learning. A
300-dimensional vector represents a word by embedding, which is learned during
training but initialized with the pre-trained GloVe word embeddings [Pennington
et al., 2014]. The questions shorter than 14 words are end-padded with zero
vectors. The result of 14 × 300 embeddings is fed into a Recurrent Gated Unit
(GRU) [Cho, Kyunghyun and Van Merriënboer, Bart and Gulcehre, Caglar
and Bahdanau, Dzmitry and Bougares, Fethi and Schwenk, Holger and Bengio,
Yoshua, 2014]. The size of hidden state is N , which is one of hyperparameters.
We use all of output states XT ∈ R14×N as a 14-channel question input. Notice
that although the length of questions is variable, BAN successfully learns through
71
the bilinear attention mechanism. For Flickr30k Entities, we use a full length of
sentences which is up to 82 tokens to consider every entities in sentences. We
mark the token positions which are at the end of each annotated phrase. Then,
we select a subset of the output channels of GRU using these positions, which
makes the number of channels is the number of entities in a sentence.
Image features. We use the image features extracted from bottom-up
attention [Anderson et al., 2017]. These features are the output of Faster R-
CNN [Ren et al., 2017], pre-trained using Visual Genome [Krishna et al., 2016].
We set a threshold for object detection to get ϕ = 10 to 100 objects per image.
The features are represented as YT ∈ Rϕ×2,048, which is fixed while training.
To deal with variable-channel inputs, we mask the padding logits with minus
infinite to get zero probability from softmax avoiding underflow.
5.5.3 Nonlinearity
We use ReLU [Nair and Hinton, 2010] to give nonlinearity to BAN as follows:
f ′k = σ(XTU′)Tk · A · σ(YTV′)k (5.13)
where σ denotes ReLU(x) := max(x, 0). For the attention maps,
Ag := softmax((
(1 · pTg ) σ(XTU)
)· σ(VTY)
). (5.14)
Note that XTU and VTY are corresponding to the question embedding matrix
Q and the detected object features V in Figure 5.1, respectively.
5.6 Variants of BAN
5.6.1 Enhancing Glove Word Embedding
We augment a computed 300-dimensional word embedding to each 300-dimensional
Glove word embedding. The computation is as follows: 1) we choose arbitrary
72
two words wi and wj from each question can be found in VQA and Visual
Genome datasets or each caption in MS COCO dataset. 2) we increase the
value of Ai,j by one where A ∈ V ′ × V ′ is an association matrix initialized
with zeros. Notice that i and j can be the index out of vocabulary V and
the size of vocabulary in this computation is denoted by V ′. 3) to penalize
highly frequent words, each row of A is divided by the number of sentences
(question or caption) which contain the corresponding word . 4) each row is
normalized by the sum of all elements of each row. 5) we calculate W′ = A ·W
where W ∈ V ′ × E is a Glove word embedding matrix and E is the size of
word embedding, i.e., 300. Therefore, W′ ∈ V ′ × E stands for the mixed word
embeddings of semantically closed words. 6) finally, we select V rows from W′
corresponding to the vocabulary in our model and augment these rows to the
previous word embeddings, which makes 600-dimensional word embeddings in
total. The input size of GRU is increased to 600 to match with these word
embeddings. These word embeddings are fine-tuned.
As a result, this variant significantly improves the performance to 66.03
(±0.12) compared with the performance of 65.72 (± 0.11) which is done by
augmenting the same 300-dimensional Glove word embeddings (so the number
of parameters is controlled). In this experiment, we use four-glimpse BAN
and evaluate on validation split. The standard deviation is calculated by three
random initialized models and the means are reported. The result on test-dev
split can be found in Table 5.5 as BAN+Glove.
5.6.2 Integrating Counting Module
The counting module [Zhang et al., 2018] is proposed to improve the performance
related to counting tasks. This module is a neural network component to get a
dense representation from spatial information of detected objects, i.e., the left-
73
top and right-bottom positions of the ϕ proposed objects (rectangles) denoted
by S ∈ R4×ϕ. The interface of the counting module is defined as:
c = Counter(s, α) (5.15)
where c ∈ Rϕ+1 and α ∈ Rϕ is the logits of corresponding objects for sigmoid
function inside the counting module. We found that the α defined by maxj=1,...,ϕ(A·,j),
i.e., the maximum values in each column vector of A in Equation 5.9, was better
than that of summation. Since the counting module does not support variable-
object inputs, we select 10-top objects for the input instead of ϕ objects based
on the values of α.
The BAN integrated with the counting module is defined as:
fi+1 =(BANi(fi,Y;Ai) + gi(ci)
)· 1T + fi (5.16)
where the function gi(·) is the i-th linear embedding followed by ReLU activation
function and ci = Counter(s,maxj=1,...,ϕ(A(i)·,j )) where A(i) is the logit ofAi. Note
that a dropout layer before this linear embedding severely hurts performance,
so we did not use it.
As a result, this variant significantly improves the counting performance
from 54.92 (±0.30) to 58.21 (±0.49), while overall performance is improved
from 65.81 (±0.09) to 66.01 (±0.14) in a controlled experiment using a vanilla
four-glimpse BAN. The definition of a subset of counting questions comes from
the previous work [Trott et al., 2018]. The result on test-dev split can be found in
Table 5.5 as BAN+Glove+Counter, notice that, which is applied by the previous
embedding variant, too.
74
5.6.3 Integrating Multimodal Factorized Bilinear (MFB) Pool-ing
Yu et al. [2018] extend low-rank bilinear pooling [Kim et al., 2017b] with the
rank k > 1 and two factorized three-dimensional matrices, which called as MFB.
The implementation of MFB is effectively equivalent to low-rank bilinear pooling
with the rank d′ = d × k followed by sum pooling with the window size of k
and the stride of k, defined by SumPool(UTx VTy,k). Notice that a pooling
matrix P in Equation 5.2 is not used. The variant of BAN inspired by MFB is
defined as:
zk′ = σ(XT U)Tk′ · A · σ(YT V)k′ (5.17)
f ′ = SumPool(z,k) (5.18)
where U ∈ RN×K′ , V ∈ RM×K′ , σ denotes ReLU activation function, and k = 5
following Yu et al. [2018]. Notice that K ′ = K × k and k′ is the index for the
elements in z ∈ RK′ in our notation.
However, this generalization was not effective for BAN. In Figure 5.2b, the
performance of BAN-1+MFB is not significantly different from that of BAN-1.
Furthermore, the larger K ′ increases the peak consumption of GPU memory
which hinders to use multiple-glimpses for the BAN.
5.6.4 Classifier
For VQA, we use a two-layer multi-layer perceptron as a classifier for the
final joint representation fG. The activation function is ReLU. The number
of outputs is determined by the minimum occurrence of an answer in unique
questions as nine times in the dataset, which is 3,129. To directly reflect the
VQA evaluation metric, the label is encoded in a vector ∈ R3,129 whose elements
are corresponding to the score. Then, binary cross entropy is used for the loss
75
function, which means the network output logits are fed into sigmoid function
instead of softmax function. This loss function is also used in Teney et al. [2017].
For Flickr30k Entities, we just take the output of bilinear attention map
and binary cross entropy is used for this output. Following the evaluation
metric [Plummer et al., 2017], if an extracted box from the pre-trained Faster-
RCNN has the intersection over union (IoU) of overlapping area with one of the
ground-truth boxes which is greater than or equal to 0.5, a prediction of this
box for a given entity is correct. This metric is called Recall@1. If K predictions
are permitted to find at least one correction, it is called Recall@K. We report
Recall@1, 5, and 10 to compare state-of-the-art. This strategy has the upper
bound of performance depending on the performance of object detection. We
also report the upper bounds to compare the performances of various object
detectors.
5.6.5 Hyperparameters and Regularization
Hyperparameters. The size of image features and question embeddings are
M = 2, 048 and N = 1, 024, respectively. The size of joint representation C is
the same with the rank K in low-rank bilinear pooling, C = K = 1, 024, but
K ′ = K × 3 is used in the bilinear attention maps to increase a representational
capacity for residual learning of attention. Every linear mapping is regularized
by Weight Normalization [Salimans and Kingma, 2016] and Dropout [Srivastava
et al., 2014] (p = .2, except for the classifier with .5). Adamax optimizer [Kingma
and Ba, 2015], a variant of Adam based on infinite norm, is used. The learning
rate is min(ie−3, 4e−3) where i is the number of epochs starting from 1, then
after 10 epochs, the learning rate is decayed by 1/4 for every 2 epochs up to 13
epochs (i.e. 1e−3 for 11-th and 2.5e−4 for 13-th epoch). We clip the 2-norm of
vectorized gradients to .25. The batch size is 512.
76
Regularization. For the test split of VQA, both train and validation splits
are used for training. We augment a subset of Visual Genome [Krishna et al.,
2016] dataset following the procedure of the previous works [Teney et al., 2017].
We filter out the samples of Visual Genome dataset if an image or an answer is
not found in the target split. This procedure results in that the 492k (67.64%) of
questions in Visual Genome dataset is used, which is around the size of training
split of VQA 2.0. Accordingly, we adjust the model capacity by increasing all of
N , C, and K to 1,280. And, G = 8 glimpses are used. For Flickr30k Entities, we
use the same test split of the previous methods [Plummer et al., 2017], without
additional hyperparameter tuning from VQA experiments.
For Flickr30k Entities, we use exactly the same splits of the previous methods
without additional hyperparameter tuning.
5.7 VQA Results and Discussions
5.7.1 Quantitative Results
Comparison with state-of-the-arts. The first row in Table 5.1 shows 2017
VQA Challenge winner architecture Anderson et al. [2017]; Teney et al. [2017].
BAN significantly outperforms this baseline and successfully utilize up to eight
bilinear attention maps to improve its performance taking advantage of resid-
ual learning of attention. As shown in Table 5.5, BAN outperforms the latest
model [Yu et al., 2018] which uses the same bottom-up attention feature [An-
derson et al., 2017] by a substantial margin. BAN-Glove uses the concatenation
of 300-dimensional Glove word embeddings and the semantically-closed mixture
of these embeddings (see Section 5.6.1). Notice that similar approaches can
be found in the competitive models [Fukui et al., 2016; Yu et al., 2018] in
Table 5.5 with a different initialization strategy for the same 600-dimensional
word embedding. BAN-Glove-Counter uses both the previous 600-dimensional
77
word embeddings and counting module Zhang et al. [2018], which exploits spatial
information of detected object boxes from the feature extractor [Anderson et al.,
2017]. The learned representation c ∈ Rϕ+1 for the counting mechanism is
linearly projected and added to joint representation after applying ReLU (see
Equation 5.16 in Section 5.6.2). In Table 5.6, we compare with the entries in the
leaderboard of both VQA Challenge 2017 and 2018 achieving the 1st place at
the time of submission (our entry is not shown in the leaderboad since challenge
entries are not visible).
Comparison with other attention methods. Unitary attention has a
similar architecture with Kim et al. [2017b] where a question embedding vector is
used to calculate the attentional weights for multiple image features of an image.
Co-attention has the same mechanism of Yu et al. [2018], similar to Lu et al.
[2016]; Xu and Saenko [2016], where multiple question embeddings are combined
as single embedding vector using a self-attention mechanism, then unitary visual
attention is applied. Table 5.2 confirms that bilinear attention is significantly
better than any other attention methods. The co-attention is slightly better
than simple unitary attention. In Figure 5.2a, co-attention suffers overfitting
more severely (green) than any other methods, while bilinear attention (blue) is
more regularized compared with the others. In Figure 5.2b, BAN is the most
parameter-efficient among various attention methods. Notice that four-glimpse
BAN more parsimoniously utilizes its parameters than one-glimpse BAN does.
5.7.2 Residual Learning of Attention
Comparison with other approaches. In the second section of Table 5.2,
the residual learning of attention significantly outperforms the other methods,
sum, i.e., fG =∑
iBANi(X,Y;Ai), and concatenation (concat), i.e., fG =
∥iBANi(X,Y;Ai). Whereas, the difference between sum and concat is not
78
Table 5.1 Validation scores on VQA 2.0 dataset for the number of glimpses ofthe BAN. The standard deviations are reported after ± using three randominitialization.
Model VQA Score
Bottom-Up [Teney et al., 2017] 63.37 ±0.21
BAN-1 65.36 ±0.14BAN-2 65.61 ±0.10BAN-4 65.81 ±0.09BAN-8 66.00 ±0.11BAN-12 66.04 ±0.08
significantly different. Notice that the number of parameters of concat is larger
than the others, since the input size of the classifier is increased.
Ablation study. An interesting property of residual learning is robustness
toward arbitrary ablations [Veit et al., 2016]. To see the relative contributions,
we observe the learning curve of validation scores when incremental ablation
is performed. First, we train 1,2,4,8,12-glimpse models using training split.
Then, we evaluate the model on validation split using the first N attention
maps. Hence, the intermediate representation fN is directly fed into the classifier
instead of fG. As shown in Figure 5.2c, the accuracy gain of the first glimpse is
the highest, then the gain is smoothly decreased as the number of used glimpses
is increased.
Entropy of Attention We analyze the information entropy of attention
distributions in a four-glimpse BAN. As shown in Figure 5.2d, the mean entropy
of each attention for validation split is converged to a different level of values.
This result is repeatably observed in the other number of glimpse models. Our
speculation is the multi-attention maps do not equally contribute similarly to
voting by committees, but the residual learning by the multi-step attention. We
79
Table 5.2 Validation scores on VQA 2.0 dataset for attention and integrationmechanisms. The nParams indicates the number of parameters. Note that thehidden sizes of unitary attention and co-attention are 1,280, while 1,024 for theBAN.
Model nParams VQA Score
Unitary attention 31.9M 64.59 ±0.04Co-attention 32.5M 64.79 ±0.06Bilinear attention 32.2M 65.36 ±0.14
BAN-4 (residual) 44.8M 65.81 ±0.09BAN-4 (sum) 44.8M 64.78 ±0.08BAN-4 (concat) 51.1M 64.71 ±0.21
argue that this is a novel observation where the residual learning [He et al.,
2016a] is used for stacked attention networks.
5.7.3 Qualitative Analysis
The visualization for a two-glimpse BAN is shown in Figure 5.3. The question is
“what color are the pants of the guy skateboarding”. The question and content
words, what, pants, guy, and skateboarding and skateboarder’s pants in the image
are attended. Notice that the box 2 (orange) captured the sitting man’s pants
in the bottom.
5.8 Flickr30k Entities Results and Discussions
To examine the capability of bilinear attention map to capture vision-language
interactions, we conduct experiments on Flickr30k Entities [Plummer et al.,
2017]. Our experiments show that BAN outperforms the previous state-of-the-art
on the phrase localization task with a large margin of 4.48% at a high speed of
inference.
80
Entro
py
3.0
3.3
3.6
3.9
4.2
4.5
4.8
Epoch0 2 4 6 8 10 12 14 16 18
Att_1Att_2Att_3Att_4
Valid
atio
n Sc
ore
35
40
45
50
55
60
65
70
The number of used glimpses0 1 2 3 4 5 6 7 8
2 Glimpses4 Glimpses8 Glimpses
Entro
py
0.0
1.0
2.0
3.0
4.0
5.0
6.0
Epoch1 3 5 7 9 11 1315 1719
Att_1 Att_2Att_3 Att_4
Valid
atio
n sc
ore
35
40
45
50
55
60
65
70
Used glimpses0 2 4 6 8 10 12
BAN-1BAN-2BAN-4BAN-8BAN-12
Valid
atio
n sc
ore
35
45
55
65
75
85
Epoch1 3 5 7 9 11 13 15 17 19
bi-att trainco-att trainuni-att trainbi-att valco-att valuni-att val
(a) (b) (c)
Valid
atio
n sc
ore
58.0
59.5
61.0
62.5
64.0
65.5
67.0
0M 15M 30M 45M 60M
Uni-AttCo-AttBAN-1BAN-4BAN-1+MFB
Entro
py
0.0
1.3
2.7
4.0
5.3
6.7
8.0
Epoch1 4 7 10 13 16 19
Att_1Att_2Att_3Att_4
Valid
atio
n sc
ore
35
40
45
50
55
60
65
70
Used glimpses0 2 4 6 8 10 12
BAN-1BAN-2BAN-4BAN-8BAN-12
Valid
atio
n sc
ore
30
41
52
63
74
85
Epoch1 4 7 10 13 16 19
Bi-Att trainCo-Att trainUni-Att trainBi-Att valCo-Att valUni-Att val
(a) (c) (d)
The number of parameters
(b)
1
Figure 5.2 (a) learning curves. Bilinear attention (bi-att) is more robust tooverfitting than unitary attention (uni-att) and co-attention (co-att). (b) valida-tion scores for the number of parameters. The error bar indicates the standarddeviation among three random initialized models, although it is too small to benoticed for over-15M parameters. (c) ablation study for the first-N-glimpses (x-axis) used in evaluation. (d) the information entropy (y-axis) for each attentionmap in four-glimpse BAN. The entropy of multiple attention maps is convergedto certain levels.
Performance. In Table 5.3, we compare with other previous approaches.
Our bilinear attention map to predict the boxes for the phrase entities in a
sentence achieves new state-of-the-art with 69.69% for Recall@1. This result
is remarkable considering that BAN does not use any additional features like
box size, color, segmentation, or pose-estimation [Plummer et al., 2017; Yeh
et al., 2017]. Note that both Query-Adaptive RCNN [Hinami and Satoh, 2017]
and our off-the-shelf object detector [Anderson et al., 2017] are based on Faster
RCNN [Ren et al., 2017] and pre-trained on Visual Genome [Krishna et al., 2016].
Compared to Query-Adaptive RCNN, the parameters of our object detector are
fixed and only used to extract 10-100 visual features and the corresponding box
proposals.
Type. In Table 5.4, we report the results for each type of Flickr30k Entities.
Notice that clothing and body parts are significantly improved to 74.95% and
81
47.23%, respectively.
Speed. The faster inference is achieved taking advantage of multi-channel
inputs in our BAN. Unlike previous methods, BAN ables to infer multiple entities
in a sentence which can be prepared as a multi-channel input. Therefore, the
number of forwardings to infer is significantly decreased. In our experiment, BAN
takes 0.67 ms/entity whereas the setting that single entity as an example takes
0.84 ms/entity, achieving 25.37% improvement. We emphasize that this property
is a novel in our model that considers every interaction among vision-language
multi-channel inputs.
Visualization. Figure 5.4 shows three examples from the test split of
Flickr30k Entities. The entities which has visual properties, for instance, a
yellow tennis suit and white tennis shoes in Figure 5.4a, and a denim shirt
in Figure 5.4b, are correct. However, relatively small object (e.g., a cigarette
in Figure 5.4b) and the entity that requires semantic inference (e.g., a male
conductor in Figure 5.4c) are incorrect.
5.9 Conclusions
In the paper, BAN gracefully extends unitary attention networks, as low-rank
bilinear pooling inside bilinear attention. Although this networks consider every
pair of multimodal input channels, the computational cost remains in the same
magnitude for matrix chain multiplication. The proposed residual learning of
attention efficiently uses up to eight bilinear attention maps, keeping the size
of intermediate features constant. We believe our BAN give new opportunity
to learn the richer joint representation when all multimodal inputs consist of
multi-channels. With a simple ensemble of BANs, we won the runners-up in 2018
VQA Challenge while BAN was a winner of single models among the entries.
82
Fig
ure
5.3
Vis
ualiz
atio
nof
the
bilin
ear
atte
ntio
nm
aps
for
two-
glim
pse
BA
N.T
hele
ftan
dri
ght
grou
psin
dica
teth
efir
stan
dse
cond
bilin
ear
atte
ntio
nm
aps
(rig
htin
each
grou
p,lo
g-sc
aled
)an
dth
evi
sual
ized
imag
e(l
eft
inea
chgr
oup)
.The
mos
tsa
lient
six
boxe
s(1
-6nu
mbe
red
inth
eim
ages
and
x-ax
isof
the
grid
s)in
the
first
atte
ntio
nm
apde
term
ined
bym
argi
naliz
atio
nar
evi
sual
ized
onbo
thim
ages
toco
mpa
re.T
hem
odel
give
sth
eco
rrec
tan
swer
,bro
wn.
83
(a) A girl in a yellow tennis suit, green visor and white tennis shoes holding a tennis racket in a position where she is going to hit the tennis ball.
(b) A man in a denim shirt and pants is smoking a cigarette while playing a cello for money.
(c) A male conductor wearing all black leading an orchestra and choir on a brown stage playing and singing a musical number.
Figure 5.4 Visualization examples from the test split of Flickr30k Entities areshown. Solid-lined boxes indicate predicted phrase localizations and dashed-lineboxes indicate the ground-truth. If there are multiple ground-truth boxes, theclosest box is shown to investigate. Each color of a phrase is matched with thecorresponding color of predicted and ground-truth boxes. Best view in color.
84
Tabl
e5.
3Te
stsp
litre
sult
sfo
rFlic
kr30
kE
ntit
ies.
We
repo
rtth
eav
erag
epe
rfor
man
ceof
ourth
ree
rand
omly
-init
ializ
edm
odel
s(t
hest
anda
rdde
viat
ion
ofR
@1
is0.
17).
Upp
erBou
ndof
perf
orm
ance
asse
rted
byob
ject
dete
ctor
issh
own.†
box
size
and
colo
rin
form
atio
nar
eus
edas
addi
tion
alfe
atur
es.‡
sem
anti
cse
gmen
tati
on,o
bjec
tde
tect
ion,
and
pose
-es
tim
atio
nis
used
asad
diti
onal
feat
ures
.Not
ice
that
the
dete
ctor
sof
Hin
amia
ndSa
toh
[201
7]an
dou
rs[A
nder
son
etal
.,20
17]a
reba
sed
onFa
ster
RC
NN
[Ren
etal
.,20
17],
pre-
trai
ned
usin
gV
isua
lGen
ome
data
set
[Kri
shna
etal
.,20
16].
Mod
elD
etec
tor
R@
1R
@5
R@
10U
pper
Bou
nd
Zhan
get
al.[
2016
a]M
CG
[Arb
eláe
zet
al.,
2014
]28
.552
.761
.3-
Hu
etal
.[20
16]
Edg
eB
oxes
[Zit
nick
and
Dol
lár,
2014
]27
.8-
62.9
76.9
Roh
rbac
het
al.[
2016
]Fa
stR
CN
N[G
irsh
ick,
2015
]42
.43
--
77.9
0W
ang
etal
.[20
16b]
Fast
RC
NN
[Gir
shic
k,20
15]
42.0
8-
-76
.91
Wan
get
al.[
2016
a]Fa
stR
CN
N[G
irsh
ick,
2015
]43
.89
64.4
668
.66
76.9
1R
ohrb
ach
etal
.[20
16]
Fast
RC
NN
[Gir
shic
k,20
15]
48.3
8-
-77
.90
Fuku
iet
al.[
2016
]Fa
stR
CN
N[G
irsh
ick,
2015
]48
.69
--
-P
lum
mer
etal
.[20
17]
Fast
RC
NN
[Gir
shic
k,20
15]†
50.8
971
.09
75.7
385
.12
Yeh
etal
.[20
17]
YO
LOv2
[Red
mon
and
Farh
adi,
2017
]‡53
.97
--
-H
inam
iand
Sato
h[2
017]
Que
ry-A
dapt
ive
RC
NN
[Hin
amia
ndSa
toh,
2017
]65
.21
--
-
BA
N(o
urs)
Bot
tom
-Up
[And
erso
net
al.,
2017
]69
.69
84.2
286
.35
87.4
5
85
Tab
le5.
4R
ecal
l@1
perf
orm
ance
over
type
sfo
rFlic
kr30
kE
ntit
ies
(%)
Mod
elPeo
ple
Clo
thin
gB
ody
Par
tsA
nim
als
Veh
icle
sIn
stru
men
tsSc
ene
Oth
er
Roh
rbac
het
al.[
2016
]60
.24
39.1
614
.34
64.4
867
.50
38.2
759
.17
30.5
6P
lum
mer
etal
.[20
17]
64.7
346
.88
17.2
165
.83
68.7
537
.65
51.3
931
.77
Yeh
etal
.[20
17]
68.7
146
.83
19.5
070
.07
73.7
539
.50
60.3
832
.45
Hin
amia
ndSa
toh
[201
7]78
.17
61.9
935
.25
74.4
176
.16
56.6
968
.07
47.4
2
BA
N(o
urs)
79.9
074
.95
47.2
381
.85
76.9
243
.00
68.6
951
.33
#of
Inst
ance
s5,
656
2,30
652
351
840
016
21,
619
3,37
4
86
Tabl
e5.
5Te
st-d
evsc
ores
ofsi
ngle
-mod
elon
VQ
A2.
0da
tase
tto
com
pare
stat
e-of
-the
-art
s.T
hefir
stse
ctio
nof
row
str
aine
don
trai
ning
and
valid
atio
nsp
lits.
The
rest
ofro
ws
trai
ned
ontr
aini
ngan
dva
lidat
ion
split
s,an
dV
isua
lG
enom
efo
rda
taau
gmen
tati
on.†
Thi
sm
odel
can
befo
und
inht
tps:
//gi
thub
.com
/yuz
cccc
/vqa
-mfb
,whi
chis
not
publ
ishe
din
the
pape
r.T
hey
use
the
obje
ctde
tect
ion-
base
dim
age
feat
ures
from
And
erso
net
al.[
2017
],in
stea
dof
152-
laye
rR
esN
etim
age
feat
ures
[He
etal
.,20
16a]
.
Mod
elO
vera
llYes
/no
Num
ber
Oth
erTes
t-st
d
Bot
tom
-Up
[And
erso
net
al.,
2017
;Ten
eyet
al.,
2017
]65
.32
81.8
244
.21
56.0
565
.67
MFH
[Yu
etal
.,20
18]
66.1
2-
--
-C
ount
er[Z
hang
etal
.,20
18]
68.0
983
.14
51.6
258
.97
68.4
1M
FH
+B
otto
m-U
p[Y
uet
al.,
2018
]†68
.76
84.2
749
.56
59.8
9-
BA
N(o
urs)
69.5
285
.31
50.9
360
.26
-B
AN
+G
love
(our
s)69
.66
85.4
650
.66
60.5
0-
BA
N+
Glo
ve+
Cou
nter
(our
s)70
.04
85.4
254
.04
60.5
270
.35
87
Tab
le5.
6Tes
t-st
anda
rdsc
ores
ofen
sem
ble-
mod
elon
VQ
A2.
0da
tase
tto
com
pare
stat
e-of
-the
-art
s.E
xcer
ptfr
omth
eV
QA
2.0
Lead
erbo
ard
atth
eti
me
ofw
riti
ng.#
deno
tes
the
num
ber
ofm
odel
sfo
rth
eir
ense
mbl
em
etho
ds.
Tea
mN
ame
#O
vera
llYes
/no
Num
ber
Oth
er
vqat
eam
_m
cb_
benc
hmar
k[F
ukui
etal
.,20
16;G
oyal
etal
.,20
16]
162
.27
78.8
238
.28
53.3
6vq
a_ha
ck3r
-62
.89
79.8
838
.95
53.5
8V
QA
Mac
hine
[Wan
get
al.,
2016
c]-
62.9
779
.82
40.9
153
.35
NW
PU
_V
QA
-63
.00
80.3
840
.32
53.0
7ya
hia
zaka
ria
-63
.57
79.7
740
.53
54.7
5R
easo
nNet
_-
64.6
178
.86
41.9
857
.39
June
flow
erIv
aNlp
r-
65.7
081
.09
41.5
657
.83
UP
MC
-LIP
6[B
en-y
oune
set
al.,
2017
]-
65.7
182
.07
41.0
657
.12
Ath
ena
-66
.67
82.8
843
.17
57.9
5A
dela
ide-
Ten
ey-
66.7
383
.71
43.7
757
.20
LV_
NU
S[Ilie
vski
and
Feng
,201
7]-
66.7
781
.89
46.2
958
.30
vqah
hi_
drau
-66
.85
83.3
544
.37
57.6
3C
FM
-UE
STC
-67
.02
83.6
945
.17
57.5
2V
LCSo
utha
mpt
on[Z
hang
etal
.,20
18]
168
.41
83.5
651
.39
59.1
1Toh
oku
CV
-68
.91
85.5
449
.00
58.9
9V
QA
-E-
69.4
485
.74
48.1
860
.12
Ade
laid
e-Ten
eyA
CRV
MSR
[Ten
eyet
al.,
2017
]30
70.3
486
.60
48.6
461
.15
Dee
pSea
rch
-70
.40
86.2
148
.82
61.5
8H
DU
-USY
D-U
NC
C[Y
uet
al.,
2018
]8
70.9
286
.65
51.1
361
.75
BA
N+
Glo
ve+
Cou
nter
(our
s)1
70.3
585
.82
53.7
160
.69
BA
NE
nsem
ble
(our
s)8
71.7
287
.02
54.4
162
.37
BA
NE
nsem
ble
(our
s)15
71.8
487
.22
54.3
762
.45
88
Chapter 6
Conclusions
Vision and language processing is studied for artificial general intelligence, mainly
focusing on how to efficiently learn the joint representation of multimodality.
Multimodal residual learning provides an efficient way to combine multimodal
inputs with an idea of residual learning, which comes from a computer vision
study. Low-rank bilinear pooling interpretation comes from an approximation of
bilinear feature learning where element-wise multiplication is used as multimodal
fusion in deep neural networks. This interpretation enables to suggest multimodal
low-rank bilinear attention networks, which becomes a foundational work for
the following models like Yu et al. [2018]. Bilinear attention networks gracefully
extends unitary attention networks, as low-rank bilinear pooling inside bilinear
attention, where residual learning of attention provides an efficient way to learn
and utilize multiple bilinear attention maps. Although, in this work, the learned
joint representation is fed into a classifier to generate an answer for visual
question answering tasks, We believe the joint representation can be used for
the other tasks, such as visual dialog, image indexing and retrieval with an aid
89
of multimodal information, enhanced speech recognition with visual information,
and so on. Morever, since computer vision and natural language processing are
still evolving areas of studies, multimodal deep learning can exploit advanced
discoveries in the future.
90
Bibliography
Aishwarya Agrawal, Jiasen Lu, Stanislaw Antol, Margaret Mitchell, C Lawrence
Zitnick, Devi Parikh, and Dhruv Batra. Vqa: Visual question answering.
International Journal of Computer Vision, 123(1):4–31, 2017.
Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson,
Stephen Gould, and Lei Zhang. Bottom-Up and Top-Down Attention for Image
Captioning and Visual Question Answering. arXiv preprint arXiv:1707.07998,
2017.
Jacob Andreas, Marcus Rohrbach, Trevor Darrell, and Dan Klein. Learn-
ing to Compose Neural Networks for Question Answering. arXiv preprint
arXiv:1601.01705, 2016.
Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra,
C. Lawrence Zitnick, and Devi Parikh. VQA: Visual Question Answering. In
IEEE International Conference on Computer Vision, 2015.
Pablo Arbeláez, Jordi Pont-Tuset, Jonathan T Barron, Ferran Marques, and
Jitendra Malik. Multiscale combinatorial grouping. In IEEE conference on
computer vision and pattern recognition, pages 328–335, 2014.
91
Hedi Ben-younes, Rémi Cadene, Matthieu Cord, and Nicolas Thome. MU-
TAN: Multimodal Tucker Fusion for Visual Question Answering. In IEEE
International Conference on Computer Vision, pages 2612–2620, 2017.
Steven Bird, Ewan Klein, and Edward Loper. Natural language processing with
Python. "O’Reilly Media, Inc.", 2009.
C E Carr and M Konishi. A circuit for detection of interaural time differences in
the brain stem of the barn owl. The Journal of Neuroscience, 10(10):3227–46,
1990.
Moses Charikar, Kevin Chen, and Martin Farach-Colton. Finding frequent items
in data streams. In International Colloquium on Automata, Languages, and
Programming, pages 693–703. Springer, 2002.
Wenlin Chen, James T. Wilson, Stephen Tyree, Kilian Q. Weinberger, and
Yixin Chen. Compressing Neural Networks with the Hashing Trick. In 32nd
International Conference on Machine Learning, pages 2285–2294, 2015.
Kyunghyun Cho, Bart van Merrienboer, Dzmitry Bahdanau, and Yoshua Ben-
gio. On the Properties of Neural Machine Translation: Encoder-Decoder
Approaches. arXiv preprint arXiv:1409.1259, 2014.
Cho, Kyunghyun and Van Merriënboer, Bart and Gulcehre, Caglar and Bah-
danau, Dzmitry and Bougares, Fethi and Schwenk, Holger and Bengio, Yoshua.
Learning Phrase Representations using RNN Encoder-Decoder for Statistical
Machine Translation. In 2014 Conference on Empirical Methods in Natural
Language Processing (EMNLP), pages 1724–1734, 2014.
Abhishek Das, Satwik Kottur, Khushi Gupta, Avi Singh, Deshraj Yadav, José
92
M. F. Moura, Devi Parikh, and Dhruv Batra. Visual Dialog. In IEEE
Conference on Computer Vision and Pattern Recognition, 2017a.
Abhishek Das, Satwik Kottur, José M. F. Moura, Stefan Lee, and Dhruv Batra.
Learning Cooperative Visual Dialog Agents with Deep Reinforcement Learning.
arXiv preprint arXiv:1703.06585, 2017b.
Harm de Vries, Florian Strub, Sarath Chandar, Olivier Pietquin, Hugo Larochelle,
and Aaron Courville. GuessWhat?! Visual object discovery through multi-
modal dialogue. arXiv preprint arXiv:1611.08481, 2016.
Akira Fukui, Dong Huk Park, Daylen Yang, Anna Rohrbach, Trevor Darrell, and
Marcus Rohrbach. Multimodal Compact Bilinear Pooling for Visual Question
Answering and Visual Grounding. In Conference on Empirical Methods in
Natural Language Processing, 2016.
Yarin Gal. A Theoretically Grounded Application of Dropout in Recurrent
Neural Networks. arXiv preprint arXiv:1512.05287, 2015.
Haoyuan Gao, Junhua Mao, Jie Zhou, Zhiheng Huang, Lei Wang, and Wei Xu.
Are You Talking to a Machine? Dataset and Methods for Multilingual Image
Question Answering. In Advances in neural information processing systems
28, pages 2296–2304, 2015.
Yang Gao, Oscar Beijbom, Ning Zhang, and Trevor Darrell. Compact Bilinear
Pooling. In IEEE Conference on Computer Vision and Pattern Recognition,
2016.
Donald Geman, Stuart Geman, Neil Hallonquist, and Laurent Younes. Visual
Turing test for computer vision systems. Proceedings of the National Academy
of Sciences, 112(12):3618–3623, 2015.
93
Ross Girshick. Fast r-cnn. In IEEE International Conference on Computer
Vision, pages 1440–1448, 2015.
Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh.
Making the V in VQA Matter: Elevating the Role of Image Understanding in
Visual Question Answering. arXiv preprint arXiv:1612.00837, 2016.
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep Residual
Learning for Image Recognition. In IEEE Conference on Computer Vision
and Pattern Recognition, 2016a.
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity Mappings
in Deep Residual Networks. arXiv preprint arXiv:1603.05027, 2016b.
K Hikosaka, E Iwai, H Saito, and K Tanaka. Polysensory properties of neurons
in the anterior bank of the caudal superior temporal sulcus of the macaque
monkey. Journal of neurophysiology, 60(5):1615–1637, 1988.
Ryota Hinami and Shin’ichi Satoh. Query-Adaptive R-CNN for Open-Vocabulary
Object Detection and Retrieval. arXiv preprint arXiv:1711.09509, 2017.
Geoffrey E Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, and Rus-
lan R Salakhutdinov. Improving neural networks by preventing co-adaptation
of feature detectors. arXiv preprint arXiv:1207.0580, 2012.
Sepp Hochreiter and Jürgen Schmidhuber. Long Short-Term Memory. Neural
computation, 9(8):1735–1780, 1997.
N P Holmes and C Spence. Multisensory Integration : Space , Time and
Superadditivity The superior colliculus generates and controls eye and head.
Current Biology, 15(18):R762–764, 2005.
94
Ronghang Hu, Huazhe Xu, Marcus Rohrbach, Jiashi Feng, Kate Saenko, and
Trevor Darrell. Natural language object retrieval. In IEEE Computer Vision
and Pattern Recognition, pages 4555–4564, 2016.
Ilija Ilievski and Jiashi Feng. A Simple Loss Function for Improving the
Convergence and Accuracy of Visual Question Answering Models. 2017.
Ilija Ilievski, Shuicheng Yan, and Jiashi Feng. A Focused Dynamic Attention
Model for Visual Question Answering. arXiv preprint arXiv:1604.01485, 2016.
Sergey Ioffe and Christian Szegedy. Batch Normalization : Accelerating Deep
Network Training by Reducing Internal Covariate Shift. In Proceedings of the
32nd International Conference on Machine Learning, 2015.
Max Jaderberg, Karen Simonyan, Andrew Zisserman, and Koray Kavukcuoglu.
Spatial Transformer Networks. In Advances in Neural Information Processing
Systems 28, pages 2008–2016, 2015.
Kushal Kafle and Christopher Kanan. Visual Question Answering: Datasets,
Algorithms, and Future Challenges. arXiv preprint arXiv:1610.01465, 2016a.
Kushal Kafle and Christopher Kanan. Answer-Type Prediction for Visual
Question Answering. IEEE Conference on Computer Vision and Pattern
Recognition, pages 4976–4984, 2016b.
Andrej Karpathy and Li Fei-Fei. Deep Visual-Semantic Alignments for Generat-
ing Image Descriptions. In 28th IEEE Conference on Computer Vision and
Pattern Recognition, 2015.
Jin-Hwa Kim, Jeonghee Kim, Jung-Woo Ha, and Byoung-Tak Zhang. TrimZero:
A Torch Recurrent Module for Efficient Natural Language Processing. In
95
Proceedings of KIIS Spring Conference, volume 26, pages 165–166, 2016a.
ISBN 2093-4025.
Jin-Hwa Kim, Sang-Woo Lee, Donghyun Kwak, Min-Oh Heo, Jeonghee Kim,
Jung-Woo Ha, and Byoung-Tak Zhang. Multimodal Residual Learning for
Visual QA. In Advances In Neural Information Processing Systems 29, pages
361–369, 2016b.
Jin-Hwa Kim, Kyoung Woon On, Woosang Lim, Jeonghee Kim, Jung-Woo Ha,
and Byoung-Tak Zhang. Hadamard Product for Low-rank Bilinear Pooling.
In 5th International Conference on Learning Representations, 2017a.
Jin-Hwa Kim, Kyoung Woon On, Woosang Lim, Jeonghee Kim, Jung-Woo Ha,
and Byoung-Tak Zhang. Hadamard Product for Low-rank Bilinear Pooling.
In The 5th International Conference on Learning Representations, 2017b.
Jin-Hwa Kim, Jaehyun Jun, and Byoung-Tak Zhang. Bilinear Attention Net-
works. arXiv preprint arXiv:1805.07932, 2018.
Diederik P. Kingma and Jimmy Ba. Adam: A Method for Stochastic Optimiza-
tion. In International Conference on Learning Representations, 2015.
Ryan Kiros, Yukun Zhu, Ruslan Salakhutdinov, Richard S. Zemel, Antonio
Torralba, Raquel Urtasun, and Sanja Fidler. Skip-Thought Vectors. In
Advances in Neural Information Processing Systems 28, pages 3294–3302,
2015.
Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua
Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A Shamma,
Michael Bernstein, and Li Fei-Fei. Visual genome: Connecting language
96
and vision using crowdsourced dense image annotations. arXiv preprint
arXiv:1602.07332, 2016.
Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua
Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A. Shamma,
Michael S. Bernstein, and Fei-Fei Li. Visual Genome: Connecting Language
and Vision Using Crowdsourced Dense Image Annotations. International
Journal of Computer Vision, 123(1):32–73, 2017.
Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature, 521
(7553):436–444, 2015. ISSN 0028-0836.
Nicholas Léonard, Sagar Waghmare, Yang Wang, and Jin-Hwa Kim. rnn :
Recurrent Library for Torch. arXiv preprint arXiv:1511.07889, 2015.
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva
Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft COCO: Common
objects in context. In European Conference on Computer Vision (ECCV),
pages 740–755, 2014.
Tsung-Yu Lin, Aruni RoyChowdhury, and Subhransu Maji. Bilinear CNN Models
for Fine-grained Visual Recognition. In IEEE International Conference on
Computer Vision, pages 1449–1457, 2015.
Jiasen Lu, Xiao Lin, Dhruv Batra, and Devi Parikh. Deeper LSTM and nor-
malized CNN Visual Question Answering model. https://github.com/
VT-vision-lab/VQA_LSTM_CNN, 2015.
Jiasen Lu, Jianwei Yang, Dhruv Batra, and Devi Parikh. Hierarchical
Question-Image Co-Attention for Visual Question Answering. arXiv preprint
arXiv:1606.00061, 2016.
97
Mateusz Malinowski and Mario Fritz. A Multi-World Approach to Question
Answering about Real-World Scenes based on Uncertain Input. In Advances
in Neural Information Processing Systems 27, pages 1682–1690, 2014.
Mateusz Malinowski, Marcus Rohrbach, and Mario Fritz. Ask Your Neurons: A
Neural-based Approach to Answering Questions about Images. arXiv preprint
arXiv:1505.01121, 2015.
Mateusz Malinowski, Marcus Rohrbach, and Mario Fritz. Ask Your Neurons:
A Deep Learning Approach to Visual Question Answering. arXiv preprint
arXiv:1605.02697, 2016.
Iain Matthews, Timothy F. Cootes, J. Andrew Bangham, Stephen Cox, and
Richard Harvey. Extraction of visual features for lipreading. IEEE Transac-
tions on Pattern Analysis and Machine Intelligence, 24(2):198–213, 2002. doi:
10.1109/34.982900.
Roland Memisevic and Geoffrey E Hinton. Unsupervised learning of image
transformations. In IEEE Conference on Computer Vision and Pattern
Recognition, 2007.
Roland Memisevic and Geoffrey E Hinton. Learning to represent spatial transfor-
mations with factored higher-order Boltzmann machines. Neural computation,
22(6):1473–1492, 2010.
Nasrin Mostafazadeh, Chris Brockett, Bill Dolan, Michel Galley, Jianfeng Gao,
Georgios P. Spithourakis, and Lucy Vanderwende. Image-Grounded Conver-
sations: Multimodal Context for Natural Question and Response Generation.
arXiv preprint arXiv:1701.08251, 2017.
Vinod Nair and Geoffrey E Hinton. Rectified Linear Units Improve Restricted
98
Boltzmann Machines. Proceedings of the 27th International Conference on
Machine Learning, 2010.
Hyeonseob Nam, Jung-Woo Ha, and Jeonghee Kim. Dual Attention Networks
for Multimodal Reasoning and Matching. In IEEE Conference on Computer
Vision and Pattern Recognition, 2016.
Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee, and
Andrew Y Ng. Multimodal Deep Learning. In 28th International Conference
on Machine Learning, pages 689–696, 2011.
Hyeonwoo Noh and Bohyung Han. Training Recurrent Answering Units with
Joint Loss Minimization for VQA. arXiv preprint arXiv:1606.03647, 2016.
Hyeonwoo Noh, Paul Hongsuck Seo, and Bohyung Han. Image Question Answer-
ing using Convolutional Neural Network with Dynamic Parameter Prediction.
In IEEE Conference on Computer Vision and Pattern Recognition, 2016.
E K Patterson, S Gurbuz, Z Tufekci, and J N Gowdy. CUAVE: A new audio-
visual database for multimodal human-computer interface research. In IEEE
International Conference on Acoustics, Speech, and Signal Processing, vol-
ume 2, pages 2017–2020, 2002. doi: 10.1109/ICASSP.2002.5745028.
Jeffrey Pennington, Richard Socher, and Christopher D Manning. GloVe: Global
Vectors for Word Representation. Proceedings of the 2014 Conference on
Empirical Methods in Natural Language Processing, 2014.
Ninh Pham and Rasmus Pagh. Fast and scalable polynomial kernels via explicit
feature maps. In 19th ACM SIGKDD International Conference on Knowledge
Discovery and Data Mining, pages 239–247. ACM, 2013.
99
Hamed Pirsiavash, Deva Ramanan, and Charless C. Fowlkes. Bilinear classifiers
for visual recognition. In Advances in Neural Information Processing Systems
22, pages 1482–1490, 2009.
Bryan A. Plummer, Liwei Wang, Chris M. Cervantes, Juan C. Caicedo, Julia
Hockenmaier, and Svetlana Lazebnik. Flickr30k Entities: Collecting Region-to-
Phrase Correspondences for Richer Image-to-Sentence Models. International
Journal of Computer Vision, 123:74–93, 2017.
Hang Qi, Tianfu Wu, Mun-Wai Lee, and Song-Chun Zhu. A Restricted Vi-
sual Turing Test for Deep Scene and Event Understanding. arXiv preprint
arXiv:1512.01715, 2015.
Joseph Redmon and Ali Farhadi. YOLO9000: Better, Faster, Stronger. In IEEE
Computer Vision and Pattern Recognition, 2017.
Mengye Ren, Ryan Kiros, and Richard Zemel. Exploring Models and Data for
Image Question Answering. In Advances in Neural Information Processing
Systems 28, pages 2935–2943, 2015.
Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster R-CNN:
Towards Real-Time Object Detection with Region Proposal Networks. IEEE
Transactions on Pattern Analysis and Machine Intelligence, 39(6), 2017.
Tim Rocktäschel, Edward Grefenstette, Karl Moritz Hermann, Tomáš Kočiský,
and Phil Blunsom. Reasoning about Entailment with Neural Attention. In
International Conference on Learning Representations, pages 1–9, 2016.
Anna Rohrbach, Marcus Rohrbach, Ronghang Hu, Trevor Darrell, and Bernt
Schiele. Grounding of textual phrases in images by reconstruction. In European
Conference on Computer Vision, pages 817–834, 2016.
100
David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. Learning
representations by back-propagating errors. Nature, 323(6088):533–536, 1986.
Tim Salimans and Diederik P. Kingma. Weight Normalization: A Simple
Reparameterization to Accelerate Training of Deep Neural Networks. arXiv
preprint arXiv:1602.07868, 2016.
Karen Simonyan and Andrew Zisserman. Very Deep Convolutional Networks
for Large-Scale Image Recognition. In International Conference on Learning
Representations, 2015.
Nitish Srivastava and Ruslan R Salakhutdinov. Multimodal Learning with
Deep Boltzmann Machines. In F Pereira, C J C Burges, L Bottou, and K Q
Weinberger, editors, Advances in Neural Information Processing Systems 25,
pages 2222–2230, 2012.
Nitish Srivastava, Geoffrey E. Hinton, Alex Krizhevsky, Ilya Sutskever, and
Ruslan Salakhutdinov. Dropout : A Simple Way to Prevent Neural Networks
from Overfitting. Journal of Machine Learning Research, 15(1):1929–1958,
2014.
Florian Strub, Harm de Vries, Jeremie Mary, Bilal Piot, Aaron Courville, and
Olivier Pietquin. End-to-end optimization of goal-driven and visually grounded
dialogue systems. arXiv preprint arXiv:1703.05423, 2017.
Sainbayar Sukhbaatar, Arthur Szlam, Jason Weston, and Rob Fergus. End-
To-End Memory Networks. In Advances in Neural Information Processing
Systems 28, pages 2440–2448, 2015.
Joshua B Tenenbaum and William T Freeman. Separating style and content
with bilinear models. Neural computation, 12(6):1247–1283, 2000.
101
Damien Teney, Peter Anderson, Xiaodong He, and Anton van den Hengel. Tips
and Tricks for Visual Question Answering: Learnings from the 2017 Challenge.
arXiv preprint arXiv:1708.02711, 2017.
Tijmen Tieleman and Geoffrey Hinton. Lecture 6.5-rmsprop: Divide the gradient
by a running average of its recent magnitude. COURSERA: Neural Networks
for Machine Learning, 4, 2012.
Julia Trommershauser, Konrad Kording, and Michael S Landy. Sensory cue
integration. Oxford University Press, 2011.
Alexander Trott, Caiming Xiong, and Richard Socher. Interpretable Counting
for Visual Question Answering. In International Conference on Learning
Representations, 2018.
Alan Turing. Computing Machinery and Intelligence. Mind, 59:433–460, 1950.
Andreas Veit, Michael J Wilber, and Serge Belongie. Residual Networks are
Exponential Ensembles of Relatively Shallow Networks. In Advances in Neural
Information Processing Systems 29, pages 550–558, 2016.
Liwei Wang, Yin Li, and Svetlana Lazebnik. Learning Deep Structure-Preserving
Image-Text Embeddings. In IEEE Conference on Computer Vision and
Pattern Recognition, pages 5005–5013, 2016a.
Mingzhe Wang, Mahmoud Azab, Noriyuki Kojima, Rada Mihalcea, and Jia
Deng. Structured Matching for Phrase Localization. In European Conference
on Computer Vision, volume 9908, pages 696–711, 2016b.
Sida I. Wang, Percy Liang, and Christopher D. Manning. Learning Language
Games through Interaction. In 54th Annual Meeting of the Association for
Computational Linguistics, pages 2368–2378, 2016c.
102
Kilian Weinberger, Anirban Dasgupta, John Langford, Alex Smola, and Josh
Attenberg. Feature hashing for large scale multitask learning. In 26th Inter-
national Conference on Machine Learning, pages 1113–1120, 2009.
Lior Wolf, Hueihan Jhuang, and Tamir Hazan. Modeling appearances with low-
rank SVM. IEEE Conference on Computer Vision and Pattern Recognition,
2007.
Qi Wu, Damien Teney, Peng Wang, Chunhua Shen, Anthony Dick, and Anton
van den Hengel. Visual Question Answering: A Survey of Methods and
Datasets. arXiv preprint arXiv:1607.05910, 2016a.
Qi Wu, Peng Wang, Chunhua Shen, Anthony Dick, and Anton van den Hengel.
Ask Me Anything: Free-form Visual Question Answering Based on Knowledge
from External Sources. In IEEE Conference on Computer Vision and Pattern
Recognition, 2016b.
Yuhuai Wu, Saizheng Zhang, Ying Zhang, Yoshua Bengio, and Ruslan Salakhut-
dinov. On Multiplicative Integration with Recurrent Neural Networks. arXiv
preprint arXiv:1606.06630, 2016c.
Caiming Xiong, Stephen Merity, and Richard Socher. Dynamic Memory Net-
works for Visual and Textual Question Answering. In 33rd International
Conference on Machine Learning, 2016.
Huijuan Xu and Kate Saenko. Ask, Attend and Answer: Exploring Question-
Guided Spatial Attention for Visual Question Answering. In European Con-
ference on Computer Vision, 2016.
Zichao Yang, Xiaodong He, Jianfeng Gao, Li Deng, and Alex Smola. Stacked
103
Attention Networks for Image Question Answering. In IEEE Conference on
Computer Vision and Pattern Recognition, 2016.
Raymond A Yeh, Jinjun Xiong, Wen-Mei W Hwu, Minh N Do, and Alexander G
Schwing. Interpretable and Globally Optimal Prediction for Textual Ground-
ing using Image Concepts. In Advances in Neural Information Processing
Systems 30, 2017.
Peter Young, Alice Lai, Micah Hodosh, and Julia Hockenmaier. From image de-
scriptions to visual denotations: New similarity metrics for semantic inference
over event descriptions. Transactions of the Association for Computational
Linguistics, 2:67–78, 2014.
Licheng Yu, Eunbyung Park, Alexander C Berg, and Tamara L. Berg. Visual
Madlibs : Fill in the blank Description Generation and Question Answering.
In IEEE International Conference on Computer Vision, pages 2461–2469,
2015.
Yanchao Yu, Arash Eshghi, and Oliver Lemon. Training an adaptive dialogue
policy for interactive learning of visually grounded word meanings. In 17th
Annual Meeting of the Special Interest Group on Discourse and Dialogue, page
339, 2016.
Zhou Yu, Jun Yu, Chenchao Xiang, Jianping Fan, and Dacheng Tao. Beyond
Bilinear: Generalized Multi-modal Factorized High-order Pooling for Visual
Question Answering. IEEE Transactions on Neural Networks and Learning
Systems, 2018.
Jianming Zhang, Zhe Lin, Jonathan Brandt, Xiaohui Shen, and Stan Sclaroff.
Top-Down Neural Attention by Excitation Backprop. In European Conference
on Computer Vision, volume 9908, pages 543–559, 2016a.
104
Peng Zhang, Yash Goyal, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh.
Yin and Yang: Balancing and Answering Binary Visual Questions. In IEEE
Conference on Computer Vision and Pattern Recognition, pages 5014–5022,
2016b.
Yan Zhang, Jonathon Hare, and Adam Prügel-Bennett. Learning to Count
Objects in Natural Images for Visual Question Answering. In International
Conference on Learning Representations, 2018.
Bolei Zhou, Yuandong Tian, Sainbayar Sukhbaatar, Arthur Szlam, and Rob
Fergus. Simple Baseline for Visual Question Answering. arXiv preprint
arXiv:1512.02167, 2015.
Yuke Zhu, Oliver Groth, Michael Bernstein, and Li Fei-Fei. Visual7W: Grounded
Question Answering in Images. In IEEE Conference on Computer Vision and
Pattern Recognition, pages 4995–5004, 2016.
C Lawrence Zitnick and Piotr Dollár. Edge boxes: Locating object proposals
from edges. In European Conference on Computer Vision, pages 391–405,
2014.
105
초록
컴퓨터 시각과 자연어 처리 기술의 발달은 일반 인공 지능에 대한 연구를 가속화
하였다. 시각과 자연어는 인간이 사용하는 가장 상호 작용적인 양태이므로 시각과
언어에 모두 기반한 이해와 추론은 일반 인공 지능의 핵심 과제가 된다. 시각 질의
응답(VQA)은 시각 튜링 테스트의 한 예로서, 초석이 되는 튜링 테스트 [Turing,
1950] 연구에 기반한다. VQA 데이터셋 [Agrawal et al., 2017]은 대용량의 이미지
데이터셋을 이용해 지도 학습을 위한 질문-답 쌍을 수집하였다. 예를 들면 "누가
안경을 쓰고 있나?", "우산이 뒤집어져 있나?", "침대에는 몇 명의 아이들이 있는
가?"와 같은 질문에 기계는 수집한 답들을 이용해 학습한 후 이미지와 질문만을
보고 답을 내어야 한다.
본연구에서는시각질의응답과제를다중양태학습문제로일반화하고,다중
양태 학습의 발전을 다층 구조 신경망의 다양한 형태를 활용하여 계층적 표상을
학습하는 깊은 학습, 다중 양태 깊은 학습 관점에서 살펴본다. 다중 양태 깊은 학
습을 세 가지 분류 기준, 다중 양태 융합, 교차 양태, 공유 표상 학습으로 나누어
소개한다. 또, 이전 연구들 Kim et al. [2016b, 2017a, 2018]를 바탕으로 세 가지
주요연구,다중 양태 잔차 학습,다중 양태 저계수 쌍일차 추출,쌍일차 주의 망의
내용들을 논의한다.
다중양태잔차학습은잔차학습을기반으로시각-언어다중양태의결합표상
을 찾는다. 여기에서 신경망의 일부는 앞 부분의 신경망이 표현하는 목적 함수의
잔차 오류를 학습하도록 강제한다. 반면, 다중 양태 저계수 쌍일차 추출은 각 양태
가 적절하게 선형 사영된 조건에서 원소곱이 결합 함수로서 가지는 수학적 의미를
설명할 수 있게 한다. 쌍일차 주의 망은 이전 두 연구를 통합한다. 저계수 쌍일차
추출에 대한 해석을 바탕으로 행렬 연결 곱을 이용해 단일 주의 기제를 쌍일차
주의로 성공적으로 일반화하여 계산 비용은 단일 주의 망과 비슷한 수준으로 효율
106
적이다.더나아가,주의잔차학습을제안하여여덟개의쌍일차주의지도를추론
과정에서 활용할 수 있게 하여 다층 주의 망에서 발생하는 과조정을 방지한다.
그 결과, 다중 양태 잔차 망 (MRN)은 VQA 챌린지 2016에서 4위를 기록하였
고, 2016년 11월 출판 시점에는 보다 적은 파라미터를 이용하여 다중 양태 저계수
쌍일차 주의 망 (MLB)을 제안하고 세계 최고 성능을 갱신하였다. 쌍일차 주의 망
(BAN)은 VQA챌린지 2018에서준우승(공동 2위)를하였으나단일모델로는최고
성능을 보였다. 이 결과는 2018년 6월 18일, CVPR 2018 학회(미국 솔트레이크
시티) 워크샵에 초청되어 구두 발표하였다.
시각 또는 자연어 처리는 계속 발전 중인 분야이므로 제안하는 다중 양태 깊은
학습 방법들은 컴퓨터 시각과 자연어 처리 기술의 발달과 더불어 더 향상될 수
있는 가능성이 있다.
주요어:다중양태,주의,시각질의응답,깊은학습,잔차학습,저계수근사,쌍일차
학번: 2015-30046
107