Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
저 시-비 리- 경 지 2.0 한민
는 아래 조건 르는 경 에 한하여 게
l 저 물 복제, 포, 전송, 전시, 공연 송할 수 습니다.
다 과 같 조건 라야 합니다:
l 하는, 저 물 나 포 경 , 저 물에 적 된 허락조건 명확하게 나타내어야 합니다.
l 저 터 허가를 면 러한 조건들 적 되지 않습니다.
저 에 른 리는 내 에 하여 향 지 않습니다.
것 허락규약(Legal Code) 해하 쉽게 약한 것 니다.
Disclaimer
저 시. 하는 원저 를 시하여야 합니다.
비 리. 하는 저 물 리 목적 할 수 없습니다.
경 지. 하는 저 물 개 , 형 또는 가공할 수 없습니다.
공학석사 학위논문
Deep Representation Learning for
Visually-Grounded Dialog
시각 기반 대화를 위한 심층 표상 학습
2020년 2월
서울대학교 대학원
협동과정 인지과학전공
강 기 천
Deep Representation Learning for
Visually-Grounded Dialog
시각 기반 대화를 위한 심층 표상 학습
지도교수 장 병 탁
이 논문을 공학석사 학위논문으로 제출함
2020 년 1 월
서울대학교 대학원
협동과정 인지과학전공
강 기 천
강 기 천의 공학석사 학위논문을 인준함
2019 년 12 월
위 원 장 정 민 화 (인)
부위원장 장 병 탁 (인)
위 원 이 준 환 (인)
Abstract
Deep Representation Learning for
Visually-Grounded Dialog
Gi-Cheon Kang
Interdisciplinary Program in Cognitive Science
The Graduate School
Seoul National University
Thanks to the recent advances in computer vision and natural language pro-
cessing, there has been an extensive amount of effort towards developing an
artificial intelligent (AI) system that jointly understands vision and natural
language information. To bridge the gap between human-level understanding
and the current AI system’s performance, selectively utilizing visually-grounded
information and capturing subtle nuances from human conversation became the
key challenges. Visual dialog (Das et al., 2017) is a machine learning task that
requires an AI agent to answer a series of questions grounded in an image.
Visual dialog dataset consists of large-scale image datasets and multi-round
question-answer pairs (i.e., dialog) per image. For instance, the agent is to an-
swer a series of semantically inter-dependent questions, such as “How many
people are in the image?”, “Are they indoors or outside?”.
The present study aims to introduce the deep neural network-based learn-
ing algorithm for visual dialog task. Specifically, we will investigate the visual
reference resolution problem based on our previous work (Kang et al., 2019).
i
The problem of visual reference resolution is to resolve ambiguous expressions
on their own (e.g., it, they, any other) and ground the references to a given
image. This problem is crucial that it involves the aforementioned two big chal-
lenges: (1) finding visual groundings of linguistic expressions, and (2) catching
contextual information from a previous dialog. The previous studies also dealt
with the visual reference resolution in visual dialog task by proposing attention
memory (Seo et al., 2017), and neural module networks (Kottur et al., 2018).
These approaches store all visual attentions of previous dialogs, assuming that
the previous visual attentions are key information to the visual reference res-
olution. However, researches in human memory system show that the visual
sensory-memory, due to its rapid decay property, hardly stores all previous vi-
sual attentions (Sergent et al., 2011, Sperling, 1960). Based on this biologically
inspired motivation, we propose Dual Attention Networks (DAN) that does not
rely on the visual attention maps of the previous dialogs. DAN consists of two
kinds of attention network, REFER and FIND. REFER network learns latent
relationships between a given question and a previous dialog. FIND network
takes image representations and the output of REFER network as input, and
performs visual grounding. By using the two attention mechanisms, we expect
our dialog agent to mimic the behavior of a human in the scenario, where one
receives an ambiguous question and then has to find an answer in the presented
image by recalling previous questions and answers from one’s memory.
As a result, DAN placed the 3rd place in the Visual Dialog Challenge 2019
as an ensemble model, and also achieves a new state-of-the-art performance in
November 2019, at the time of publication.
Keywords: Visual dialog, multi-modal, attention, visual reference resolution
Student Number: 2018-23580
ii
Contents
Abstract i
Contents iv
List of Tables v
List of Figures viii
Chapter 1 Introduction 1
Chapter 2 Related Works 4
2.1 Visual Dialog . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 Visual Reference Resolution . . . . . . . . . . . . . . . . . . . . . 4
2.3 Attention Mechanisms . . . . . . . . . . . . . . . . . . . . . . . . 5
Chapter 3 Dual Attention Networks 9
3.1 Input Representation . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.1.1 Image Features . . . . . . . . . . . . . . . . . . . . . . . . 10
3.1.2 Language Features . . . . . . . . . . . . . . . . . . . . . . 10
3.2 REFER Network . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.3 FIND Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.4 Answer Decoder . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
Chapter 4 Experiments 17
iii
4.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.2 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.3 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . 18
4.4 Quantitative Results . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.4.1 Comparison with State-of-the-Art . . . . . . . . . . . . . 18
4.4.2 Comparison with Baseline . . . . . . . . . . . . . . . . . . 19
4.5 Qualitative Results . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.6 Ablation Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.6.1 Single network . . . . . . . . . . . . . . . . . . . . . . . . 23
4.6.2 Image Features in FIND network . . . . . . . . . . . . . . 24
4.6.3 Residual Connection in REFER Network . . . . . . . . . 24
4.6.4 Stack of REFER Networks & Attention Heads . . . . . . 24
Chapter 5 Conclusion 27
Bibliography 29
국문초록 35
iv
List of Tables
Table 4.1 Hyperparameters for Dual Attention Networks . . . . . . 19
Table 4.2 Retrieval performance on VisDial v1.0 and v0.9 datasets,
measured by normalized discounted cumulative gain (NDCG),
mean reciprocal rank (MRR), recall @k (R@k), and mean
rank. The higher the better for NDCG, MRR, and R@k,
while the lower the better for mean rank. DAN outper-
forms all other models across NDCG, MRR, and R@1 on
both datasets. NDCG is not supported in v0.9 dataset. . . 20
Table 4.3 VisDial v1.0 validation performance on the semantically
complete (SC) and incomplete (SI) questions. We observe
that SI questions obtain more benefits from the dialog
history than SC questions. . . . . . . . . . . . . . . . . . . 21
Table 4.4 Ablation studies on VisDial v1.0 validation split. Res and
RPN denote the residual connection and the region pro-
posal networks, respectively. . . . . . . . . . . . . . . . . . 26
v
List of Figures
Figure 1.1 Examples from visual dialog (Das et al., 2017). Visual
Dialog requires an dialog agent to answer a series of ques-
tions grounded in an image. Specifically, an image, a di-
alog history, and a follow-up question about the image
are given. . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
Figure 2.1 Previous methods for visual reference resolution in visual
dialog task (Top: neural module network model, Bot-
tom: attention memory model). Figures are from (Kot-
tur et al., 2018) and (Seo et al., 2017), respectively. . . . 7
Figure 2.2 The Transformer model. This model replaces the se-
quence aligned RNN or convolution with the self-attention
mechanism. Figure is from (Vaswani et al., 2017). . . . . 8
vi
Figure 3.1 An overview of Dual Attention Networks (DAN). We
propose two kinds of attention networks, REFER and
FIND. REFER learns latent relationships between a given
question and a dialog history to retrieve the relevant pre-
vious dialogs. FIND performs visual grounding, taking
image features and reference-aware representations (i.e.,
the output of REFER). ⊗, ⊕, and � denote matrix mul-
tiplication, concatenation and element-wise multiplica-
tion, respectively. The multi-layer perceptron is omitted
in this figure for simplicity. . . . . . . . . . . . . . . . . . 11
Figure 3.2 Illustration of the single-layer REFER network. REFER
network focuses on the latent relationship between the
follow-up question and a dialog history to resolve am-
biguous references in the question. We employ two sub-
module: multi-head attention and feed-forward networks.
Multi-head attention computes the h number of soft
attentions over all elements of dialog history by using
scaled dot product attention. Then, it returns the h num-
ber of heads which are weighted by the attentions. Fol-
lowed by the two-layer feed-forward networks, REFER
network finally returns the reference-aware representa-
tions ereft . ⊕ and Dotted line denote the concatenation
operation and linear projection operation by the learn-
able matrices, respectively. . . . . . . . . . . . . . . . . . 14
vii
Figure 4.1 Qualitative results on the VisDial v1.0 dataset. We visu-
alize the attention over dialog history from REFER net-
work and the visual attention from FIND network. The
object detection features with top five attention weights
are marked with colored box. A red colored box indi-
cates the most salient visual feature. Also, the attention
from REFER network is represented as shading, darker
shading indicates the larger attention weight for each
element of the dialog history. Our proposed model not
only responds to the correct answer, but also selectively
pays attention to the previous dialogs and salient image
regions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
Figure 4.2 Ablation study on a different number of attention heads
and REFER stacks. REFER (n) indicates that DAN
uses a stack of n identical REFER networks. . . . . . . . 26
viii
Chapter 1
Introduction
The machine learning research in computer vision and natural language pro-
cessing accelerates the development of Artificial Intelligence (AI). As the vision
and language are core modalities for human being, the intelligent system that
can see everyday scenes and fluently communicate with people is one of the
ambitious goals of AI. As demands for this line of research increase, vision
and language communities have proposed challenging tasks that require the
machine to jointly understand vision and natural language information. Ac-
cordingly, there has been an extensive amount of effort towards developing the
intelligent system in various aspects. Typically, visual question answering (An-
derson et al., 2018, Antol et al., 2015, Fukui et al., 2016, Goyal et al., 2017, Kim
et al., 2018) and image captioning (Johnson et al., 2016, Lu et al., 2018, Xu
et al., 2015) tasks have been widely explored. However, the agent performing
these tasks much still has a long way to go to be deployed in real-world applica-
tions (e.g., aiding visually impaired users, interacting with humanoid robots) in
that it does not consider the continuous interaction over time. Specifically, the
interaction in image captioning is that the agent simply talks to human about
visual content, with no input from human. Whereas the VQA system takes a
question as input, it does not consider the varying pattern over time. To this
end, (Das et al., 2017) proposed the generalized VQA task which is called the
1
Figure 1.1: Examples from visual dialog (Das et al., 2017). Visual Dialog re-
quires an dialog agent to answer a series of questions grounded in an image.
Specifically, an image, a dialog history, and a follow-up question about the
image are given.
visual dialog (VisDial). Different from the single round VQA, a visual dialog
agent needs to answer a series of inter-dependent questions such as “How many
people are in the image?”, “Are they indoors or outside?”. We believe that there
are two big challenges in this task: (1) extracting contextual information from
a dialog history, and (2) utilizing visually-grounded information. To address
the aforementioned challenges, researchers have recently dealt with a problem
called visual reference resolution in visual dialog. The problem of visual refer-
ence resolution is to link the ambiguous reference (e.g., it, they, any other) to
an entity in the visual source.
In this study, we address the visual reference resolution in a visual dialog
task. We first hypothesize that humans address the visual reference resolution
through a two-step process: (1) semantically resolve the ambiguous reference
by recalling the dialog history from one’s memory and (2) attempt to find
a spatial region of a given image for the resolved reference. For example, as
shown in Figure 1.1, the question “What color is it?” is ambiguous on its own
2
because it is hard to find out what “it” refers to. So we believe that humans
try to recall the previous dialogs and notice that “it” refers to the “dog”. And
then, we believe that they will finally try to find the dog in the image and
answer the question. For these processes, we propose Dual Attention Networks
(DAN) which consists of two kinds of attention networks, REFER and FIND.
REFER network learns semantic relationships between a given question and a
dialog history to extract the relevant information. Inspired by the multi-head
attention mechanism (Vaswani et al., 2017), REFER network calculates the
multi-head attention over all previous dialogs in a sentence-level fashion to get
the reference-aware representations. FIND network takes image features and
the reference-aware representations as inputs, and maps the reference-aware
representations to spatial region of the image. From this processes, we expect
our proposed model to be capable of question disambiguation by using REFER
network and ground the resolved reference properly to the given image.
The main contributions of this study are as follows. First, we propose Dual
Attention Networks (DAN) for visual reference resolution in visual dialog based
on REFER and FIND networks. Second, we validate our proposed model on the
large-scale datasets: VisDial v1.0 and v0.9. Our model achieves a new state-of-
the-art results compared to other methods. We also make a comparison between
DAN and our baseline model to demonstrate the performance improvements
on semantically incomplete questions needed to be clarified. Third, we perform
qualitative analysis of our model, showing that DAN reasonably attends to the
dialog history and salient image regions. Finally, We conduct ablation studies
by four criteria to demonstrate the effectiveness of our proposed components.
3
Chapter 2
Related Works
2.1 Visual Dialog
Visual dialog (VisDial) dataset was recently proposed by (Das et al., 2017),
providing a testbed for research on the interplay between computer vision and
multi-turn dialog systems. Visual dialog requires a dialog agent to hold a mean-
ingful dialog with humans in natural, conversational language about image. The
dialog agent is given an image, a dialog history, and a follow-up question needs
to be answered. Accordingly, the dialog agent performing this task is not only
required to find visual groundings of linguistic expressions but also capture
semantic nuances from human conversation. To tackle these challenges, atten-
tion mechanism-based approaches were primarily proposed to address these
challenges, including memory networks (Das et al., 2017), history-conditioned
image attentive encoder (Lu et al., 2017), sequential co-attention (Wu et al.,
2018), and synergistic co-attention networks (Guo et al., 2019).
2.2 Visual Reference Resolution
Recently, researchers have tackled a problem called visual reference resolution
(Kottur et al., 2018, Seo et al., 2017) in VisDial. To resolve visual references,
(Seo et al., 2017) proposed an attention memory which stores a sequence of
4
previous visual attention maps in memory slots. They retrieved the previous
visual attention maps by applying a soft attention over all the memory slots and
combined it with a current visual attention. Furthermore, (Kottur et al., 2018)
attempted to resolve visual references at a word-level, relying on an off-the-
shelf parser. Similar to the attention memory (Seo et al., 2017), they proposed
a reference pool which stores visual attention maps of recognized entities and
retrieved the weighted sum of the visual attention maps by applying a soft at-
tention. To resolve the visual references, above approaches attempted to retrieve
the visual attention of the previous dialogs, and applied it on the current visual
attention. These approaches have limitations in that they store all previous vi-
sual attentions, while researches in human memory system show that the visual
sensory-memory, due to its rapid decay property, hardly stores all previous vi-
sual attentions (Sergent et al., 2011, Sperling, 1960). Based on this biologically
inspired motivation, our proposed model calculates the current visual attention
by using linguistic cues (i.e., dialog history).
2.3 Attention Mechanisms
Attention mechanisms are universally used technique in machine learning field,
including neural machine translation (Bahdanau et al., 2014, Luong et al.,
2015), automatic speech recognition (Chorowski et al., 2015, Zeyer et al., 2018),
and vision-language learning (Anderson et al., 2018, Kim et al., 2018, Lu et al.,
2016). It mimics a human’s selective attention by assigning a target data or rep-
resentations to non-zero weights. The chunk of data that got a higher attention
weight is regarded as relatively more important one. Recently, self-attention
mechanisms (Vaswani et al., 2017) have been widely studied because it has
shown a superior performance on some natural language processing tasks with-
5
out adopting powerful RNN and CNN structure. By utilizing the excellence of
the self-attention mechanism, (Yu et al., 2019) showed a state-of-the-art per-
formance on the visual question answering task.
6
Figure 2.1: Previous methods for visual reference resolution in visual dialog
task (Top: neural module network model, Bottom: attention memory model).
Figures are from (Kottur et al., 2018) and (Seo et al., 2017), respectively.
7
Figure 2.2: The Transformer model. This model replaces the sequence aligned
RNN or convolution with the self-attention mechanism. Figure is from (Vaswani
et al., 2017).
8
Chapter 3
Dual Attention Networks
In this section, we formally describe the visual dialog task and our proposed
algorithm, Dual Attention Networks (DAN). The visual dialog task (Das et al.,
2017) is defined as follows. A dialog agent is given input such as an image I,
a follow-up question at round t as Qt, a dialog history (including the image
caption) H = ( C︸︷︷︸H0
, (Q1, Agt1 )︸ ︷︷ ︸
H1
, · · · , (Qt−1, Agtt−1)︸ ︷︷ ︸
Ht−1
) till round t− 1. By using
these inputs, the agent is asked to rank a list of 100 candidate answers, At ={A1
t , · · · , A100t
}. Agt
t denotes the ground truth answer (i.e., human response) at
round t. Given the problem setup, DAN for visual dialog task can be framed as
encoder-decoder architecture: (1) an encoder that jointly embeds the input (I,
Qt, H) and (2) a decoder that converts the embedded representation into the
ranked list At . From this point of view, DAN consists of three components which
are REFER, FIND, and the answer decoder. As shown in Figure 3.1, REFER
network learns to attend relevant previous dialogs to resolve the ambiguous
references in a given question Qt. FIND network learns to attend to the spatial
image features that the output of REFER network describes. Answer decoder
ranks the list of candidate answers At given the output of FIND network.
We first introduce the language features, as well as the image features in
Sec. 3.1. Then we describe the detailed architectures of the REFER and FIND
networks in Sec. 3.2 and Sec. 3.3, respectively. Finally, we present the answer
9
decoder in Sec. 3.4.
3.1 Input Representation
3.1.1 Image Features
Inspired by bottom-up attention (Anderson et al., 2018), we use the Faster R-
CNN (Ren et al., 2015) pre-trained with Visual Genome (Krishna et al., 2017)
to extract the object-level image features. We denote the output features as v ∈
RK×V , where K and V are the total number of object detection features per
image and dimension of the each feature, respectively. We adaptively extract the
number of object features K ranging from 10 to 100 for reflecting the complexity
of each image. K is fixed during training.
3.1.2 Language Features
We first embed each of the words in the follow-up questionQt to {wt,1, · · · , wt,T }
by using pre-trained GloVe (Pennington et al., 2014) embeddings, where T
denotes the number of tokens in Qt. We then use a two-layer LSTM, generating
a sequence of hidden states {ut,1, · · · , ut,T }. Note that we use the last hidden
state of the LSTM ut,T as a question feature, denoted as qt ∈ RL.
ut,i = LSTM(wt,i,ut,i−1) (3.1)
qt = ut,T (3.2)
Also, each element of the dialog history {Hi}t−1i=0 and the candidate answers{Ai
t
}100i=1 are embedded as the follow-up question, yielding {hi}t−1i=0 ∈ Rt×L and
10
Fig
ure
3.1:
An
over
vie
wof
Du
alA
tten
tion
Net
wor
ks
(DA
N).
We
pro
pos
etw
okin
ds
ofat
tenti
onn
etw
orks,
RE
FE
R
an
dF
IND
.R
EF
ER
lear
ns
late
nt
rela
tion
ship
sb
etw
een
agi
ven
qu
esti
onan
da
dia
log
his
tory
tore
trie
veth
ere
leva
nt
pre
vio
us
dia
logs.
FIN
Dp
erfo
rms
vis
ual
grou
nd
ing,
takin
gim
age
feat
ure
san
dre
fere
nce
-aw
are
rep
rese
nta
tion
s(i.e.,
the
outp
ut
ofR
EF
ER
).⊗
,⊕
,an
d�
den
ote
mat
rix
mu
ltip
lica
tion
,co
nca
ten
atio
nan
del
emen
t-w
ise
mu
ltip
lica
tion
,
resp
ecti
vely
.T
he
mu
lti-
laye
rp
erce
ptr
onis
omit
ted
inth
isfi
gure
for
sim
pli
city
.
11
{oit}100i=1 ∈ R100×L. Qt, H, and At are embedded with same word embedding
vector and three different LSTMs.
3.2 REFER Network
Given the question and dialog history features, REFER network aims to attend
to the most relevant elements of dialog history with respect to the given ques-
tion. Specifically, we first compute scaled dot product attention (Vaswani et al.,
2017) in multi-head settings which are called multi-head attention. Let qt and
Mt = {hi}t−1i=0 be the question and dialog history feature vectors respectively. qt
and Mt are projected to dref dimensions by different, learnable projection ma-
trices. We then conduct dot product of these two projected matrices, divide by√dref , and apply a softmax to obtain the attention weights on the all elements
of dialog history.
headn = Attention(qtWqn ,MtW
mn ) (3.3)
where Attention(a, b) = softmax(ab>√dref
)b (3.4)
where W qn ∈ RL×dref and Wm
n ∈ RL×dref . Note that dot product attention
is computed h times with different projection matrices, yielding {headn}hn=1.
Accordingly, we can get the multi-head representations xt, concatenating all
{headn}hn=1, followed by linear projection. Also, we can compute xt by applying
a residual connection (He et al., 2016), followed by layer normalization (Ba
et al., 2016).
xt = (head1 ⊕ · · · ⊕ headh)W o (3.5)
12
xt = LayerNorm(xt + qt) (3.6)
where ⊕ denotes the concatenation operation, and W o ∈ Rhdref×L is the pro-
jection matrix. Next, we apply xt to two-layer feed-forward networks with a
ReLU activation in between, where W f1 ∈ RL×2L and W f
2 ∈ R2L×L. The resid-
ual connection and layer normalization is also applied in this step.
ct = ReLU(xtWf1 + bf1)W f
2 + bf2 (3.7)
ct = LayerNorm(ct + xt) (3.8)
ereft = ct ⊕ qt (3.9)
Finally, REFER network returns the reference-aware representations by con-
catenating the contextual representation ct and the original question represen-
tation qt, denoted as ereft ∈ R2L. In this work, we use dref = 256 and h = 4.
Figure 3.2 illustrates the pipeline of the REFER network.
13
Fig
ure
3.2:
Illu
stra
tion
ofth
esi
ngl
e-la
yer
RE
FE
Rn
etw
ork.
RE
FE
Rn
etw
ork
focu
ses
onth
ela
tent
rela
tion
ship
bet
wee
nth
efo
llow
-up
qu
esti
onan
da
dia
log
his
tory
tore
solv
eam
big
uou
sre
fere
nce
sin
the
qu
esti
on.W
eem
plo
ytw
o
sub
mod
ule
:m
ult
i-h
ead
atte
nti
onan
dfe
ed-f
orw
ard
net
wor
ks.
Mu
lti-
hea
dat
tenti
onco
mp
ute
sth
eh
nu
mb
erof
soft
att
enti
on
sov
eral
lel
emen
tsof
dia
log
his
tory
by
usi
ng
scal
edd
otp
rod
uct
atte
nti
on.
Th
en,
itre
turn
sth
eh
nu
mb
er
of
hea
ds
wh
ich
are
wei
ghte
dby
the
atte
nti
ons.
Fol
low
edby
the
two-
layer
feed
-for
war
dn
etw
orks,
RE
FE
Rn
etw
ork
fin
all
yre
turn
sth
ere
fere
nce
-aw
are
rep
rese
nta
tion
ser
eft
.⊕
and
Dot
ted
lin
ed
enot
eth
eco
nca
ten
atio
nop
erat
ion
and
lin
ear
pro
ject
ion
op
erat
ion
by
the
lear
nab
lem
atri
ces,
resp
ecti
vel
y.
14
3.3 FIND Network
Instead of relying on the visual attention maps of the previous dialogs as in
(Kottur et al., 2018, Seo et al., 2017), we expect the FIND network to attend
to the most relevant regions of the image with respect to the reference-aware
representations (i.e., the output of REFER network). In order to implement
the visual grounding for the reference-aware representations, we take inspira-
tion from bottom-up attention mechanism (Anderson et al., 2018). Let v ∈
RK×V and ereft ∈ R2L be the image feature vectors and reference-aware repre-
sentations, respectively. We first project these two vectors to dfind dimensions
and compute soft attention over all the object detection features as follows:
rt = fv(v)� fref (ereft ) (3.10)
αt = softmax(rtWr + br) (3.11)
where fv(·) and fref (·) denote the two-layer multi-layer perceptrons which con-
vert to dfind dimensions, and W r ∈ Rdfind×1 is the projection matrix for the
softmax activation. � denotes hadamard product (i.e., element-wise multipli-
cation). From these equations, we can get the visual attention weights αt ∈
RK×1. Next, we apply the visual attention weights to v and compute the vision-
language joint representations as follows:
vt =K∑j=1
αt,jvj (3.12)
zt = f ′v(vt)� f ′ref (ereft ) (3.13)
15
efindt = ztWz + bz (3.14)
where f ′v(·) and f ′ref (·) also denote the two-layer multi-layer perceptrons which
convert to dfind dimensions, and W z ∈ Rdfind×L is the projection matrix. Note
that efindt ∈ RL is the output representations of the encoder as well as FIND
network, and given to the answer decoder to score the list of candidate answers.
In this work, we use dfind = 1024.
3.4 Answer Decoder
Answer decoder computes each score of candidate answers via a dot product
with the embedded representation efindt , followed by a softmax activation to get
a categorical distribution over the candidates. Let Ot ={oit}100i=1 ∈ R100×L be
the feature vectors of 100 candidate answers. The distribution pt is formulated
as follows:
pt = softmax(efindt O>t ) (3.15)
In training phase, DAN is optimized by minimizing the cross-entropy loss be-
tween the one-hot encoded label vector yt and probability distribution pt.
L(θ) = −∑k
yt,k log pt,k (3.16)
Where pt,k denotes the probability of the k-th candidate answer at round t. In
test phase, the list of candidate answers is ranked by the distribution pt, and
evaluated by some metrics.
16
Chapter 4
Experiments
In this section, we describe the details of our experiments on the VisDial v1.0
and v0.9 datasets. We first introduce the VisDial datasets, evaluation metrics,
and implementation details in Sec. 4.1, Sec. 4.2, and Sec. 4.3, respectively. Then
we report the quantitative results by comparing our proposed model with the
state-of-the-art approaches and baseline model in Sec. 4.4. Then we provide
qualitative results in Sec. 4.5. Finally, we conduct the ablation studies by four
criteria to report the relative contributions of each components in Sec. 4.6.
4.1 Datasets
We evaluate our proposed model on the VisDial v0.9 and v1.0 dataset. VisDial
v0.9 dataset (Das et al., 2017) has collected via two subjects chatting about MS-
COCO (Lin et al., 2014) images. Each dialog is made up of an image, a caption
from MS-COCO dataset and 10 QA pairs. As a result, VisDial v0.9 dataset
contains 83k dialogs and 40k dialogs as train and validation splits, respectively.
Recently, VisDial v1.0 dataset (Das et al., 2017) has been released with an
additional 10k COCO-like images from Flickr. Dialogs for the additional images
have been collected similar to v0.9. Overall, VisDial v1.0 dataset contains 123k
(all dialogs from v0.9), 2k, and 8k dialogs as train, validation, and test splits,
respectively.
17
4.2 Evaluation Metrics
We evaluate individual responses at each question in a retrieval setting as sug-
gested by (Das et al., 2017). Specifically, the dialog agent is given a list of 100
candidate answers of each question and asked to rank the list. There are three
kinds of evaluation metrics for retrieval performance: (1) mean rank of human
response, (2) recall@k (i.e., existence of the human response in top-k ranked re-
sponse), and (3) mean reciprocal rank (MRR). Mean rank, recall@k, and MRR
are highly correlated with the rank of human response. In addition, (Das et al.,
2017) proposed to use the robust evaluation metric, normalized discounted cu-
mulative gain (NDCG). NDCG takes into account all relevant answers from the
ranked list, where the relevance scores are densely annotated for VisDial v1.0
test split. NDCG penalizes the lower rank of the candidate answers with high
relevance scores.
4.3 Implementation Details
We use PyTorch http://pytorch.org to implement our proposed model. Hy-
perparameters are summarized in Table 4.1.
4.4 Quantitative Results
4.4.1 Comparison with State-of-the-Art
We compare our proposed model with the state-of-the-art approaches on Vis-
Dial v1.0 and v0.9 datasets, which can be categorized into three groups: (1)
Fusion-based approaches (LF and HRE (Das et al., 2017)), (2) Attention-based
approaches (MN (Das et al., 2017), HCIAE (Lu et al., 2017) and CoAtt (Wu
et al., 2018)), and (3) Approaches that deal with visual reference resolution in
18
Table 4.1: Hyperparameters for Dual Attention Networks
Parameter Type Value
mini batch size 64
dimension of image features 2048
dimension of hidden states in LSTM 512
dimension of word embedding 300
the number of attention heads 4
optimizer & initial learning rate Adam, 0.001
the number of epoch 12
VisDial (AMEM (Seo et al., 2017), and CorefNMN (Kottur et al., 2018)). Our
proposed model belongs to the third category. As shown in Table 4.2, DAN
significantly outperforms all other approaches on NDCG, MRR, and R@1, in-
cluding the previous state-of-the-art method, Synergistic (Guo et al., 2019).
Specifically, DAN improves approximately 0.27% on NDCG and 1.73% on R@1
in VisDial v1.0 dataset. The results indicate that DAN ranks higher than all
other methods on both single ground-truth answer (R@1) and all relevant an-
swers on average (NDCG).
4.4.2 Comparison with Baseline
We first define the questions that contain one or more pronouns (i.e., it, its,
they, their, them, these, those, this, that, he, his, him, she, her) as the semanti-
cally incomplete (SI) questions. Also, we can declare the questions that do not
have pronouns as semantically complete (SC) questions. Then, we have checked
the contribution of the reference-aware representations for the SC and SI ques-
tions, respectively. Specifically, we make a comparison between DAN, which
19
Tab
le4.2
:R
etri
eval
per
form
an
ceon
Vis
Dia
lv1.0
an
dv0.
9d
atas
ets,
mea
sure
dby
nor
mal
ized
dis
cou
nte
dcu
mu
lati
ve
gain
(ND
CG
),m
ean
reci
pro
cal
ran
k(M
RR
),re
call
@k
(R@
k),
and
mea
nra
nk.
Th
eh
igh
erth
eb
ette
rfo
rN
DC
G,
MR
R,
and
R@
k,
wh
ile
the
low
erth
eb
ette
rfo
rm
ean
ran
k.
DA
Nou
tper
form
sal
lot
her
mod
els
acro
ssN
DC
G,
MR
R,
and
R@
1on
bot
hd
atase
ts.
ND
CG
isn
ot
sup
por
ted
inv0.
9d
atas
et.
Vis
Dia
lv1.
0(t
est-
std)
Vis
Dia
lv0.
9(v
al)
ND
CG
MR
RR
@1
R@
5R
@10
Mea
nM
RR
R@
1R
@5
R@
10M
ean
LF
(Das
etal
.,20
17)
45.3
155
.42
40.9
572
.45
82.8
35.
9558.
0743.
82
74.
6884
.07
5.78
HR
E(D
aset
al.,
2017
)45
.46
54.1
639
.93
70.4
581
.50
6.41
58.
4644
.67
74.
504.2
25.7
2
MN
(Das
etal
.,20
17)
47.5
055
.49
40.9
872
.30
83.3
05.
9259.
6545
.55
76.
2285
.37
5.4
6
HC
IAE
(Lu
etal
.,20
17)
--
--
--
62.2
248.
4878
.75
87.
59
4.81
AM
EM
(Seo
etal
.,20
17)
--
--
--
62.2
748.
5378
.66
87.
434.
86
CoA
tt(W
uet
al.,
2018
)-
--
--
-63
.98
50.
2980.
7188.
81
4.47
Cor
efN
MN
(Kot
tur
etal
.,20
18)
54.7
061
.50
47.5
578
.10
88.8
04.
4064.
1050.
92
80.
18
88.8
14.4
5
Syner
gist
ic(G
uo
etal
.,20
19)
57.3
262
.20
47.9
080.43
89.95
4.17
--
--
-
DA
N(o
urs
)57.59
63.20
49.63
79.7
589
.35
4.30
66.38
53.33
82.
42
90.3
84.0
4
20
Table 4.3: VisDial v1.0 validation performance on the semantically complete
(SC) and incomplete (SI) questions. We observe that SI questions obtain more
benefits from the dialog history than SC questions.
Model MRR R@1 R@5 R@10 Mean
SC
No REFER 61.85 47.80 79.10 88.43 4.49
DAN 64.81 51.22 81.63 90.19 4.03
Improvements 2.96 3.42 2.53 1.76 0.46
SI
No REFER 58.44 44.38 75.36 85.48 5.36
DAN 61.77 48.13 78.43 87.81 4.70
Improvements 3.33 3.75 3.07 2.33 0.66
utilizes reference-aware representations (i.e., ereft ), and No REFER, which ex-
ploits question representations (i.e., qt) only. From the Table 4.3, we draw three
observations: (1) DAN shows significantly better results than the No REFER
model for SC questions. It validates that the context from dialog history enriches
the question information, even when the question is semantically complete. (2)
SI questions obtain more benefits from the dialog history than SC questions.
It indicates that DAN is more robust to the SI questions than SC questions.
(3) A dialog agent faces greater difficulty in answering SI questions compared
to SC questions. No REFER is equivalent to the FIND + RPN model in the
ablation study section.
4.5 Qualitative Results
In this section, we visualize the inference mechanism of our proposed model.
Figure 4.1 shows the qualitative results of DAN. Given a question that is needed
21
Fig
ure
4.1:
Qu
alit
ativ
ere
sult
son
the
Vis
Dia
lv1.
0d
atas
et.
We
vis
ual
ize
the
atte
nti
onov
erd
ialo
gh
isto
ryfr
om
RE
FE
Rn
etw
ork
and
the
vis
ual
atte
nti
onfr
omF
IND
net
wor
k.T
he
obje
ctd
etec
tion
feat
ure
sw
ith
top
five
atte
nti
on
wei
ghts
are
mark
edw
ith
colo
red
box
.A
red
colo
red
box
ind
icat
esth
em
ost
sali
ent
vis
ual
feat
ure
.A
lso,
the
atte
nti
on
from
RE
FE
Rn
etw
ork
isre
pre
sente
das
shad
ing,
dar
ker
shad
ing
ind
icat
esth
ela
rger
atte
nti
onw
eigh
tfo
rea
ch
elem
ent
ofth
ed
ialo
gh
isto
ry.
Ou
rp
rop
osed
mod
eln
oton
lyre
spon
ds
toth
eco
rrec
tan
swer
,b
ut
also
sele
ctiv
ely
pay
satt
enti
on
toth
ep
revio
us
dia
logs
and
sali
ent
imag
ere
gion
s.
22
to be clarified, DAN correctly answers the question by selectively attending to
each element of the dialog history and salient image regions. In case of the visual
attention, we mark the object detection features with top five attention weights
of each image. On the other hand, the attention weights from REFER network
are represented as shading; darker shading indicates the larger attention weight
for each element of the dialog history. These attention weights are calculated
by averaging over all the attention heads.
4.6 Ablation Studies
In this section, we perform ablation study on VisDial v1.0 validation split with
the following four model variants: (1) Model only using the single attention
network, (2) Model that uses different image features (pre-trained VGG-16 is
used), (3) Model that does not use the residual connection in REFER network,
and (4) Model that stacks the REFER network up to four layers with each
different number of attention heads.
4.6.1 Single network
The first four rows in Table 4.4 show the performance of a single network. FIND
denotes the use of FIND network only, and REFER denotes the use of single-
layer REFER network only. Specifically, REFER uses the output of REFER
network as the encoder outputs. On the other hand, FIND does not take the
reference-aware representations (i.e., ereft ) but the question feature (i.e., qt).
The single models show relatively poor performance compared with the dual
network model. We believe that the results validate two hypotheses: (1) VisDial
task requires contextual information from dialog history as well as the visually-
grounded information. (2) REFER and FIND networks have complementary
23
modeling abilities.
4.6.2 Image Features in FIND network
To report the impact of image features, we replace the bottom-up attention
features (Anderson et al., 2018) with ImageNet pre-trained VGG-16 (Simonyan
and Zisserman, 2014) features. In detail, we use the output of the VGG-16
pool5 layer as image features. In Table 4.4, RPN denotes the use of the region
proposal networks (Ren et al., 2015) which are equivalent to the use of bottom-
up attention features. Similar to VQA task, we observe that DAN with bottom-
up attention features achieves better performance than with VGG-16 features.
In other words, the use of object-level features boosts the MRR performance of
DAN.
4.6.3 Residual Connection in REFER Network
We also conduct an ablation study to investigate the effectiveness of the residual
connection in REFER network. As shown in Table 4.4, the use of the residual
connection (i.e., Res) boosts the MRR score of DAN. In other words, DAN
utilizes the excellence of deep residual learning as in (He et al., 2016, Kim
et al., 2016, Rocktaschel et al., 2015, Vaswani et al., 2017, Yang et al., 2016).
4.6.4 Stack of REFER Networks & Attention Heads
We stack the REFER networks up to four layers with each different number
of attention heads, h ∈ {1, 2, 4, 8, 16, 32, 64}. In other words, we conduct the
ablation experiments with twenty-eight models to set the hyperparameters of
our model. Figure 4.2 shows the results of the ablation experiments. For n ≥ 2,
REFER (n) indicates that DAN uses a stack of n identical REFER networks.
Specifically, for each pair of successive networks, the output of the previous
24
REFER network is fed into the next REFER network as a query (i.e., qt).
Due to the small number of elements in each dialog history, the overall perfor-
mance pattern shows a tendency to decrease as the number of attention heads
increases. It turns out that the two-layer REFER network with four attention
heads (i.e., REFER (2) and h = 4) performs the best among all models in
ablation study, recording 64.17% on MRR.
25
Table 4.4: Ablation studies on VisDial v1.0 validation split. Res and RPN denote
the residual connection and the region proposal networks, respectively.
Model MRR Score
FIND 57.85
FIND + RPN 60.80
REFER 57.18
REFER + Res 58.69
REFER + FIND 60.98
REFER + Res + FIND 61.86
REFER + FIND + RPN 63.47
REFER + Res + FIND + RPN 63.88
Figure 4.2: Ablation study on a different number of attention heads and REFER
stacks. REFER (n) indicates that DAN uses a stack of n identical REFER
networks.
26
Chapter 5
Conclusion
We introduce Dual Attention Networks (DAN) for visual reference resolution
in visual dialog task. DAN explicitly divides the visual reference resolution
problem into a two-step process by employing REFER and FIND networks. In
place of relying on the previous visual attention maps as in previous works,
DAN first linguistically resolves ambiguous references in a given question by
using REFER network. Then, it grounds the resolved references in the image
by using FIND network. We empirically validate our proposed model on Vis-
Dial v1.0 and v0.9 datasets. First, we compare DAN with other state-of-the-art
methods on VisDial v1.0 test-standard splits, showing that our model signifi-
cantly outperforms them on core metrics (i.e., NDCG, MRR, and R@1). Then,
we compare DAN with our baseline model on semantically complete and in-
complete question types. Our model turn out to be robust to the semantically
incomplete questions boosting the overall performance by 3.33%. Finally, we
conduct an ablation study to validate the effectiveness of our proposed model
and visualize the qualitative results. DAN gracefully integrates the REFER and
FIND network, while being simpler and more grounded. Furthermore, in terms
of the applicability, we believe that the algorithms of our proposed model can
be applied to the areas dealing with multimodal content aligned on time-series
such as video story understanding agents and audio-visual speech recognition.
27
Future Study The present study solely focused on the latent relationship
between the two pairs: (1) the given question and the dialog history, and (2) the
contextualized question and the given image. First, our proposed model does not
consider the order of the dialog in (1), resulting in the dialog history as position
invariant information – swapping the order of each dialog history’s element does
not affect the performance. However, the position of the word or the utterances
are regarded as an important information in natural language processing (Shaw
et al., 2018, Vaswani et al., 2017). In this respect, we believe that applying the
positional embedding would be the future research direction. Finally, we assume
that all questions in the visual dialog dataset demand the context information.
However, there could be some questions that can be answered without context,
such as “How many people are in the image?”. To this end, deciding whether
the given question needs the context or not would be a fundamental step in this
area.
28
Bibliography
Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson,
Stephen Gould, and Lei Zhang. Bottom-up and top-down attention for image
captioning and visual question answering. In CVPR, 2018.
Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Ba-
tra, C Lawrence Zitnick, and Devi Parikh. Vqa: Visual question answering. In
Proceedings of the IEEE international conference on computer vision, pages
2425–2433, 2015.
Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization.
arXiv preprint arXiv:1607.06450, 2016.
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine
translation by jointly learning to align and translate. arXiv preprint
arXiv:1409.0473, 2014.
Jan K Chorowski, Dzmitry Bahdanau, Dmitriy Serdyuk, Kyunghyun Cho, and
Yoshua Bengio. Attention-based models for speech recognition. In Advances
in neural information processing systems, pages 577–585, 2015.
Abhishek Das, Satwik Kottur, Khushi Gupta, Avi Singh, Deshraj Yadav,
Jose MF Moura, Devi Parikh, and Dhruv Batra. Visual dialog. In Proceed-
ings of the IEEE Conference on Computer Vision and Pattern Recognition,
volume 2, 2017.
29
Akira Fukui, Dong Huk Park, Daylen Yang, Anna Rohrbach, Trevor Darrell,
and Marcus Rohrbach. Multimodal compact bilinear pooling for visual ques-
tion answering and visual grounding. arXiv preprint arXiv:1606.01847, 2016.
Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh.
Making the v in vqa matter: Elevating the role of image understanding in vi-
sual question answering. In Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, pages 6904–6913, 2017.
Dalu Guo, Chang Xu, and Dacheng Tao. Image-question-answer synergistic
network for visual dialog. arXiv preprint arXiv:1902.09774, 2019.
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual
learning for image recognition. In Proceedings of the IEEE conference on
computer vision and pattern recognition, pages 770–778, 2016.
Justin Johnson, Andrej Karpathy, and Li Fei-Fei. Densecap: Fully convolu-
tional localization networks for dense captioning. In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, pages 4565–4574,
2016.
Gi-Cheon Kang, Jaeseo Lim, and Byoung-Tak Zhang. Dual attention networks
for visual reference resolution in visual dialog. In Proceedings of the 2019 Con-
ference on Empirical Methods in Natural Language Processing, pages 2024–
2033, 2019.
Jin-Hwa Kim, Sang-Woo Lee, Donghyun Kwak, Min-Oh Heo, Jeonghee Kim,
Jung-Woo Ha, and Byoung-Tak Zhang. Multimodal residual learning for
visual qa. In Advances in neural information processing systems, pages 361–
369, 2016.
30
Jin-Hwa Kim, Jaehyun Jun, and Byoung-Tak Zhang. Bilinear attention net-
works. In Advances in Neural Information Processing Systems, pages 1564–
1574, 2018.
Satwik Kottur, Jose MF Moura, Devi Parikh, Dhruv Batra, and Marcus
Rohrbach. Visual coreference resolution in visual dialog using neural module
networks. In Proceedings of the European Conference on Computer Vision
(ECCV), pages 153–169, 2018.
Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua
Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A Shamma,
et al. Visual genome: Connecting language and vision using crowdsourced
dense image annotations. International Journal of Computer Vision, 123(1):
32–73, 2017.
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva
Ramanan, Piotr Dollar, and C Lawrence Zitnick. Microsoft coco: Common
objects in context. In European conference on computer vision, pages 740–
755. Springer, 2014.
Jiasen Lu, Jianwei Yang, Dhruv Batra, and Devi Parikh. Hierarchical question-
image co-attention for visual question answering. In Advances In Neural
Information Processing Systems, pages 289–297, 2016.
Jiasen Lu, Anitha Kannan, Jianwei Yang, Devi Parikh, and Dhruv Batra. Best
of both worlds: Transferring knowledge from discriminative learning to a
generative visual dialog model. In Advances in Neural Information Processing
Systems, pages 314–324, 2017.
Jiasen Lu, Jianwei Yang, Dhruv Batra, and Devi Parikh. Neural baby talk.
31
In Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, pages 7219–7228, 2018.
Minh-Thang Luong, Hieu Pham, and Christopher D Manning. Effective ap-
proaches to attention-based neural machine translation. arXiv preprint
arXiv:1508.04025, 2015.
Jeffrey Pennington, Richard Socher, and Christopher Manning. Glove: Global
vectors for word representation. In Proceedings of the 2014 conference on em-
pirical methods in natural language processing (EMNLP), pages 1532–1543,
2014.
Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards
real-time object detection with region proposal networks. In Advances in
neural information processing systems, pages 91–99, 2015.
Tim Rocktaschel, Edward Grefenstette, Karl Moritz Hermann, Tomas Kocisky,
and Phil Blunsom. Reasoning about entailment with neural attention. arXiv
preprint arXiv:1509.06664, 2015.
Paul Hongsuck Seo, Andreas Lehrmann, Bohyung Han, and Leonid Sigal. Visual
reference resolution using attention memory for visual dialog. In Advances
in neural information processing systems, pages 3719–3729, 2017.
Claire Sergent, Christian C Ruff, Antoine Barbot, Jon Driver, and Geraint
Rees. Top–down modulation of human early visual cortex after stimulus offset
supports successful postcued report. Journal of Cognitive Neuroscience, 23
(8):1921–1934, 2011.
Peter Shaw, Jakob Uszkoreit, and Ashish Vaswani. Self-attention with relative
position representations. arXiv preprint arXiv:1803.02155, 2018.
32
Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for
large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
George Sperling. The information available in brief visual presentations. Psy-
chological monographs: General and applied, 74(11):1, 1960.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,
Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you
need. In Advances in Neural Information Processing Systems, pages 5998–
6008, 2017.
Qi Wu, Peng Wang, Chunhua Shen, Ian Reid, and Anton van den Hengel.
Are you talking to me? reasoned visual dialog generation through adversarial
learning. In Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, pages 6106–6115, 2018.
Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan
Salakhudinov, Rich Zemel, and Yoshua Bengio. Show, attend and tell: Neural
image caption generation with visual attention. In International conference
on machine learning, pages 2048–2057, 2015.
Zichao Yang, Xiaodong He, Jianfeng Gao, Li Deng, and Alex Smola. Stacked
attention networks for image question answering. In Proceedings of the IEEE
conference on computer vision and pattern recognition, pages 21–29, 2016.
Zhou Yu, Jun Yu, Yuhao Cui, Dacheng Tao, and Qi Tian. Deep modular co-
attention networks for visual question answering. In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, pages 6281–6290,
2019.
33
Albert Zeyer, Kazuki Irie, Ralf Schluter, and Hermann Ney. Improved train-
ing of end-to-end attention models for speech recognition. arXiv preprint
arXiv:1805.03294, 2018.
34
국문초록
최근 컴퓨터비전과 자연언어처리 분야의 발전 덕분에 시각과 자연언어 정보를 동
시에 이해하는 인공지능 시스템 연구가 활발하게 이루어지고 있다. 인간 수준의
시각, 자연언어의 이해와 현재 인공지능 시스템의 성능 간의 격차를 줄이기 위하
여 시각 기반 정보를 이용하는 것 그리고 인간 대화의 미묘한 뉘앙스를 이해하는
것은 중요한 문제가 되었다. 시각 대화 (Das et al., 2017)는 인공지능 에이전트에
게 이미지와 관련된 일련의 질문에 답변을 하도록 하는 기계학습 문제이다. 시각
대화 데이터셋은 방대한 양의 사진과 사진 당 여러 회의 질의응답 쌍을 포함하고
있다.예를들어,에이전트는 “사진에몇명의사람이있니?”, “그들은실내에있니
실외에 있니?”와 같은 의미적으로 상호의존적인 질문들에 답하여야 한다.
본 연구는 시각 대화 문제를 위한 심층 신경망 기반의 학습 알고리즘을 소개
한다. 구체적으로, 기존 연구 (Kang et al., 2019)를 기반으로 시각 대화 문제 내의
시각 참조 해결 문제를 다룬다. 시각 참조 해결 문제는 그것 자체로 모호한 언어
적 표현들의 의미를 명확히하고, 그것들을 이미지의 국소적인 영역에 매핑시키는
문제를 일컫는다. 시각 참조 해결은 앞서 언급한 두 가지의 중요한 문제를 다룬다.
시각 참조 해결을 다루는 기존 연구는 주의 집중 기억 (Seo et al., 2017), 신경
모듈망 (Kottur et al., 2018) 등의 기법을 이용하여 시각 참조 해결 문제를 다루
었다. 해당 기법들의 공통점은 이전에 나눈 대화들의 시각적 주의집중 정보들을
모두 저장하였다. 하지만 인간의 기억 구조 연구에 의하면, 시각 감각 기억은 빠르
게 소멸되는 특징을 가지기 때문에 기존에 제안된 모델링은 인지과학적, 생물학적
근거가 부족하다. 이러한 동기에 근거하여 우리는 이전 대화의 시각적 주의집중에
의존하지 않는 이중 주의집중 신경망을 제안한다. 이중 주의집중 신경망은 참조
그리고 탐색 신경망으로 구성되어 있다. 참조 신경망은 주어진 질의와 이전 대
화 내용 사이의 의미적 연관성을 학습한다. 참조 신경망은 참조 신경망의 출력과
35
사진의 표상을 입력으로 받아 시각 기반을 수행한다. 두 종류의 주의집중 메커니
즘을이용하여,우리는대화에이전트가의미적으로모호한질문을받았을때해당
질문의 의미를 명확히하고 주어진 사진에서 답변을 찾길 기대한다.
그 결과 이중 주의집중 신경망은 2019년 시각 대화 대회에서 종합 3위를 기록
하였고단일모델로는출판시점인 2019년 11월에최고수준의성능을기록하였다.
주요어: 시각 대화, 다중 양태, 주의, 심층 학습, 시각 참조 해결
학번: 2018-23580
36