Disclaimer · 2020-05-07 · receives an ambiguous question and then has to nd an answer in the presented image by recalling previous questions and answers from one’s memory. As

저 시-비 리- 경 지 2.0 한민

는 아래 조건 르는 경 에 한하여 게

l 저 물 복제, 포, 전송, 전시, 공연 송할 수 습니다.

다 과 같 조건 라야 합니다:

l 하는, 저 물 나 포 경 , 저 물에 적 된 허락조건 명확하게 나타내어야 합니다.

l 저 터 허가를 면 러한 조건들 적 되지 않습니다.

저 에 른 리는 내 에 하여 향 지 않습니다.

것 허락규약(Legal Code) 해하 쉽게 약한 것 니다.

Disclaimer

저 시. 하는 원저 를 시하여야 합니다.

비 리. 하는 저 물 리 목적 할 수 없습니다.

경 지. 하는 저 물 개 , 형 또는 가공할 수 없습니다.

http://creativecommons.org/licenses/by-nc-nd/2.0/kr/legalcode

http://creativecommons.org/licenses/by-nc-nd/2.0/kr/

공학석사 학위논문

Deep Representation Learning for

Visually-Grounded Dialog

시각 기반 대화를 위한 심층 표상 학습

2020년 2월

서울대학교 대학원

협동과정 인지과학전공

강 기 천



시각 기반 대화를 위한 심층 표상 학습

지도교수 장 병 탁

이 논문을 공학석사 학위논문으로 제출함

2020 년 1 월

서울대학교 대학원

협동과정 인지과학전공

강 기 천

강 기 천의 공학석사 학위논문을 인준함

2019 년 12 월

위 원 장 정 민 화 (인)

부위원장 장 병 탁 (인)

위 원 이 준 환 (인)

Abstract



Gi-Cheon Kang

Interdisciplinary Program in Cognitive Science

The Graduate School

Seoul National University

Thanks to the recent advances in computer vision and natural language pro-

cessing, there has been an extensive amount of effort towards developing an

artificial intelligent (AI) system that jointly understands vision and natural

language information. To bridge the gap between human-level understanding

and the current AI system’s performance, selectively utilizing visually-grounded

information and capturing subtle nuances from human conversation became the

key challenges. Visual dialog (Das et al., 2017) is a machine learning task that

requires an AI agent to answer a series of questions grounded in an image.

Visual dialog dataset consists of large-scale image datasets and multi-round

question-answer pairs (i.e., dialog) per image. For instance, the agent is to an-

swer a series of semantically inter-dependent questions, such as “How many

people are in the image?”, “Are they indoors or outside?”.

The present study aims to introduce the deep neural network-based learn-

ing algorithm for visual dialog task. Specifically, we will investigate the visual

reference resolution problem based on our previous work (Kang et al., 2019).

i

The problem of visual reference resolution is to resolve ambiguous expressions

on their own (e.g., it, they, any other) and ground the references to a given

image. This problem is crucial that it involves the aforementioned two big chal-

lenges: (1) finding visual groundings of linguistic expressions, and (2) catching

contextual information from a previous dialog. The previous studies also dealt

with the visual reference resolution in visual dialog task by proposing attention

memory (Seo et al., 2017), and neural module networks (Kottur et al., 2018).

These approaches store all visual attentions of previous dialogs, assuming that

the previous visual attentions are key information to the visual reference res-

olution. However, researches in human memory system show that the visual

sensory-memory, due to its rapid decay property, hardly stores all previous vi-

sual attentions (Sergent et al., 2011, Sperling, 1960). Based on this biologically

inspired motivation, we propose Dual Attention Networks (DAN) that does not

rely on the visual attention maps of the previous dialogs. DAN consists of two

kinds of attention network, REFER and FIND. REFER network learns latent

relationships between a given question and a previous dialog. FIND network

takes image representations and the output of REFER network as input, and

performs visual grounding. By using the two attention mechanisms, we expect

our dialog agent to mimic the behavior of a human in the scenario, where one

receives an ambiguous question and then has to find an answer in the presented

image by recalling previous questions and answers from one’s memory.

As a result, DAN placed the 3rd place in the Visual Dialog Challenge 2019

as an ensemble model, and also achieves a new state-of-the-art performance in

November 2019, at the time of publication.

Keywords: Visual dialog, multi-modal, attention, visual reference resolution

Student Number: 2018-23580

ii

Contents

Abstract i

Contents iv

List of Tables v

List of Figures viii

Chapter 1 Introduction 1

Chapter 2 Related Works 4

2.1 Visual Dialog . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.2 Visual Reference Resolution . . . . . . . . . . . . . . . . . . . . . 4

2.3 Attention Mechanisms . . . . . . . . . . . . . . . . . . . . . . . . 5

Chapter 3 Dual Attention Networks 9

3.1 Input Representation . . . . . . . . . . . . . . . . . . . . . . . . . 10

3.1.1 Image Features . . . . . . . . . . . . . . . . . . . . . . . . 10

3.1.2 Language Features . . . . . . . . . . . . . . . . . . . . . . 10

3.2 REFER Network . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3.3 FIND Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3.4 Answer Decoder . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

Chapter 4 Experiments 17

iii

4.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

4.2 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . 18

4.3 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . 18

4.4 Quantitative Results . . . . . . . . . . . . . . . . . . . . . . . . . 18

4.4.1 Comparison with State-of-the-Art . . . . . . . . . . . . . 18

4.4.2 Comparison with Baseline . . . . . . . . . . . . . . . . . . 19

4.5 Qualitative Results . . . . . . . . . . . . . . . . . . . . . . . . . . 21

4.6 Ablation Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

4.6.1 Single network . . . . . . . . . . . . . . . . . . . . . . . . 23

4.6.2 Image Features in FIND network . . . . . . . . . . . . . . 24

4.6.3 Residual Connection in REFER Network . . . . . . . . . 24

4.6.4 Stack of REFER Networks & Attention Heads . . . . . . 24

Chapter 5 Conclusion 27

Bibliography 29

국문초록 35

iv

List of Tables

Table 4.1 Hyperparameters for Dual Attention Networks . . . . . . 19

Table 4.2 Retrieval performance on VisDial v1.0 and v0.9 datasets,

measured by normalized discounted cumulative gain (NDCG),

mean reciprocal rank (MRR), recall @k (R@k), and mean

rank. The higher the better for NDCG, MRR, and R@k,

while the lower the better for mean rank. DAN outper-

forms all other models across NDCG, MRR, and R@1 on

both datasets. NDCG is not supported in v0.9 dataset. . . 20

Table 4.3 VisDial v1.0 validation performance on the semantically

complete (SC) and incomplete (SI) questions. We observe

that SI questions obtain more benefits from the dialog

history than SC questions. . . . . . . . . . . . . . . . . . . 21

Table 4.4 Ablation studies on VisDial v1.0 validation split. Res and

RPN denote the residual connection and the region pro-

posal networks, respectively. . . . . . . . . . . . . . . . . . 26

v

List of Figures

Figure 1.1 Examples from visual dialog (Das et al., 2017). Visual

Dialog requires an dialog agent to answer a series of ques-

tions grounded in an image. Specifically, an image, a di-

alog history, and a follow-up question about the image

are given. . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

Figure 2.1 Previous methods for visual reference resolution in visual

dialog task (Top: neural module network model, Bot-

tom: attention memory model). Figures are from (Kot-

tur et al., 2018) and (Seo et al., 2017), respectively. . . . 7

Figure 2.2 The Transformer model. This model replaces the se-

quence aligned RNN or convolution with the self-attention

mechanism. Figure is from (Vaswani et al., 2017). . . . . 8

vi

Figure 3.1 An overview of Dual Attention Networks (DAN). We

propose two kinds of attention networks, REFER and

FIND. REFER learns latent relationships between a given

question and a dialog history to retrieve the relevant pre-

vious dialogs. FIND performs visual grounding, taking

image features and reference-aware representations (i.e.,

the output of REFER). ⊗, ⊕, and � denote matrix mul-

tiplication, concatenation and element-wise multiplica-

tion, respectively. The multi-layer perceptron is omitted

in this figure for simplicity. . . . . . . . . . . . . . . . . . 11

Figure 3.2 Illustration of the single-layer REFER network. REFER

network focuses on the latent relationship between the

follow-up question and a dialog history to resolve am-

biguous references in the question. We employ two sub-

module: multi-head attention and feed-forward networks.

Multi-head attention computes the h number of soft

attentions over all elements of dialog history by using

scaled dot product attention. Then, it returns the h num-

ber of heads which are weighted by the attentions. Fol-

lowed by the two-layer feed-forward networks, REFER

network finally returns the reference-aware representa-

tions ereft . ⊕ and Dotted line denote the concatenation

operation and linear projection operation by the learn-

able matrices, respectively. . . . . . . . . . . . . . . . . . 14

vii

Figure 4.1 Qualitative results on the VisDial v1.0 dataset. We visu-

alize the attention over dialog history from REFER net-

work and the visual attention from FIND network. The

object detection features with top five attention weights

are marked with colored box. A red colored box indi-

cates the most salient visual feature. Also, the attention

from REFER network is represented as shading, darker

shading indicates the larger attention weight for each

element of the dialog history. Our proposed model not

only responds to the correct answer, but also selectively

pays attention to the previous dialogs and salient image

regions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

Figure 4.2 Ablation study on a different number of attention heads

and REFER stacks. REFER (n) indicates that DAN

uses a stack of n identical REFER networks. . . . . . . . 26

viii

Chapter 1

Introduction

The machine learning research in computer vision and natural language pro-

cessing accelerates the development of Artificial Intelligence (AI). As the vision

and language are core modalities for human being, the intelligent system that

can see everyday scenes and fluently communicate with people is one of the

ambitious goals of AI. As demands for this line of research increase, vision

and language communities have proposed challenging tasks that require the

machine to jointly understand vision and natural language information. Ac-

cordingly, there has been an extensive amount of effort towards developing the

intelligent system in various aspects. Typically, visual question answering (An-

derson et al., 2018, Antol et al., 2015, Fukui et al., 2016, Goyal et al., 2017, Kim

et al., 2018) and image captioning (Johnson et al., 2016, Lu et al., 2018, Xu

et al., 2015) tasks have been widely explored. However, the agent performing

these tasks much still has a long way to go to be deployed in real-world applica-

tions (e.g., aiding visually impaired users, interacting with humanoid robots) in

that it does not consider the continuous interaction over time. Specifically, the

interaction in image captioning is that the agent simply talks to human about

visual content, with no input from human. Whereas the VQA system takes a

question as input, it does not consider the varying pattern over time. To this

end, (Das et al., 2017) proposed the generalized VQA task which is called the

1

Figure 1.1: Examples from visual dialog (Das et al., 2017). Visual Dialog re-

quires an dialog agent to answer a series of questions grounded in an image.

Specifically, an image, a dialog history, and a follow-up question about the

image are given.

visual dialog (VisDial). Different from the single round VQA, a visual dialog

agent needs to answer a series of inter-dependent questions such as “How many

people are in the image?”, “Are they indoors or outside?”. We believe that there

are two big challenges in this task: (1) extracting contextual information from

a dialog history, and (2) utilizing visually-grounded information. To address

the aforementioned challenges, researchers have recently dealt with a problem

called visual reference resolution in visual dialog. The problem of visual refer-

ence resolution is to link the ambiguous reference (e.g., it, they, any other) to

an entity in the visual source.

In this study, we address the visual reference resolution in a visual dialog

task. We first hypothesize that humans address the visual reference resolution

through a two-step process: (1) semantically resolve the ambiguous reference

by recalling the dialog history from one’s memory and (2) attempt to find

a spatial region of a given image for the resolved reference. For example, as

shown in Figure 1.1, the question “What color is it?” is ambiguous on its own

2

because it is hard to find out what “it” refers to. So we believe that humans

try to recall the previous dialogs and notice that “it” refers to the “dog”. And

then, we believe that they will finally try to find the dog in the image and

answer the question. For these processes, we propose Dual Attention Networks

(DAN) which consists of two kinds of attention networks, REFER and FIND.

REFER network learns semantic relationships between a given question and a

dialog history to extract the relevant information. Inspired by the multi-head

attention mechanism (Vaswani et al., 2017), REFER network calculates the

multi-head attention over all previous dialogs in a sentence-level fashion to get

the reference-aware representations. FIND network takes image features and

the reference-aware representations as inputs, and maps the reference-aware

representations to spatial region of the image. From this processes, we expect

our proposed model to be capable of question disambiguation by using REFER

network and ground the resolved reference properly to the given image.

The main contributions of this study are as follows. First, we propose Dual

Attention Networks (DAN) for visual reference resolution in visual dialog based

on REFER and FIND networks. Second, we validate our proposed model on the

large-scale datasets: VisDial v1.0 and v0.9. Our model achieves a new state-of-

the-art results compared to other methods. We also make a comparison between

DAN and our baseline model to demonstrate the performance improvements

on semantically incomplete questions needed to be clarified. Third, we perform

qualitative analysis of our model, showing that DAN reasonably attends to the

dialog history and salient image regions. Finally, We conduct ablation studies

by four criteria to demonstrate the effectiveness of our proposed components.

3

Chapter 2

Related Works

2.1 Visual Dialog

Visual dialog (VisDial) dataset was recently proposed by (Das et al., 2017),

providing a testbed for research on the interplay between computer vision and

multi-turn dialog systems. Visual dialog requires a dialog agent to hold a mean-

ingful dialog with humans in natural, conversational language about image. The

dialog agent is given an image, a dialog history, and a follow-up question needs

to be answered. Accordingly, the dialog agent performing this task is not only

required to find visual groundings of linguistic expressions but also capture

semantic nuances from human conversation. To tackle these challenges, atten-

tion mechanism-based approaches were primarily proposed to address these

challenges, including memory networks (Das et al., 2017), history-conditioned

image attentive encoder (Lu et al., 2017), sequential co-attention (Wu et al.,

2018), and synergistic co-attention networks (Guo et al., 2019).

2.2 Visual Reference Resolution

Recently, researchers have tackled a problem called visual reference resolution

(Kottur et al., 2018, Seo et al., 2017) in VisDial. To resolve visual references,

(Seo et al., 2017) proposed an attention memory which stores a sequence of

4

previous visual attention maps in memory slots. They retrieved the previous

visual attention maps by applying a soft attention over all the memory slots and

combined it with a current visual attention. Furthermore, (Kottur et al., 2018)

attempted to resolve visual references at a word-level, relying on an off-the-

shelf parser. Similar to the attention memory (Seo et al., 2017), they proposed

a reference pool which stores visual attention maps of recognized entities and

retrieved the weighted sum of the visual attention maps by applying a soft at-

tention. To resolve the visual references, above approaches attempted to retrieve

the visual attention of the previous dialogs, and applied it on the current visual

attention. These approaches have limitations in that they store all previous vi-

sual attentions, while researches in human memory system show that the visual

sensory-memory, due to its rapid decay property, hardly stores all previous vi-

sual attentions (Sergent et al., 2011, Sperling, 1960). Based on this biologically

inspired motivation, our proposed model calculates the current visual attention

by using linguistic cues (i.e., dialog history).

2.3 Attention Mechanisms

Attention mechanisms are universally used technique in machine learning field,

including neural machine translation (Bahdanau et al., 2014, Luong et al.,

2015), automatic speech recognition (Chorowski et al., 2015, Zeyer et al., 2018),

and vision-language learning (Anderson et al., 2018, Kim et al., 2018, Lu et al.,

2016). It mimics a human’s selective attention by assigning a target data or rep-

resentations to non-zero weights. The chunk of data that got a higher attention

weight is regarded as relatively more important one. Recently, self-attention

mechanisms (Vaswani et al., 2017) have been widely studied because it has

shown a superior performance on some natural language processing tasks with-

5

out adopting powerful RNN and CNN structure. By utilizing the excellence of

the self-attention mechanism, (Yu et al., 2019) showed a state-of-the-art per-

formance on the visual question answering task.

6

Figure 2.1: Previous methods for visual reference resolution in visual dialog

task (Top: neural module network model, Bottom: attention memory model).

Figures are from (Kottur et al., 2018) and (Seo et al., 2017), respectively.

7

Figure 2.2: The Transformer model. This model replaces the sequence aligned

RNN or convolution with the self-attention mechanism. Figure is from (Vaswani

et al., 2017).

8

Chapter 3

Dual Attention Networks

In this section, we formally describe the visual dialog task and our proposed

algorithm, Dual Attention Networks (DAN). The visual dialog task (Das et al.,

2017) is defined as follows. A dialog agent is given input such as an image I,

a follow-up question at round t as Qt, a dialog history (including the image

caption) H = ( C︸︷︷︸H0

, (Q1, Agt1 )︸︷︷︸

H1

, · · · , (Qt−1, Agtt−1)︸︷︷︸

Ht−1

) till round t− 1. By using

these inputs, the agent is asked to rank a list of 100 candidate answers, At ={A1

t , · · · , A100t

}. Agt

t denotes the ground truth answer (i.e., human response) at

round t. Given the problem setup, DAN for visual dialog task can be framed as

encoder-decoder architecture: (1) an encoder that jointly embeds the input (I,

Qt, H) and (2) a decoder that converts the embedded representation into the

ranked list At . From this point of view, DAN consists of three components which

are REFER, FIND, and the answer decoder. As shown in Figure 3.1, REFER

network learns to attend relevant previous dialogs to resolve the ambiguous

references in a given question Qt. FIND network learns to attend to the spatial

image features that the output of REFER network describes. Answer decoder

ranks the list of candidate answers At given the output of FIND network.

We first introduce the language features, as well as the image features in

Sec. 3.1. Then we describe the detailed architectures of the REFER and FIND

networks in Sec. 3.2 and Sec. 3.3, respectively. Finally, we present the answer

9

decoder in Sec. 3.4.

3.1 Input Representation

3.1.1 Image Features

Inspired by bottom-up attention (Anderson et al., 2018), we use the Faster R-

CNN (Ren et al., 2015) pre-trained with Visual Genome (Krishna et al., 2017)

to extract the object-level image features. We denote the output features as v ∈

RK×V , where K and V are the total number of object detection features per

image and dimension of the each feature, respectively. We adaptively extract the

number of object features K ranging from 10 to 100 for reflecting the complexity

of each image. K is fixed during training.

3.1.2 Language Features

We first embed each of the words in the follow-up questionQt to {wt,1, · · · , wt,T }

by using pre-trained GloVe (Pennington et al., 2014) embeddings, where T

denotes the number of tokens in Qt. We then use a two-layer LSTM, generating

a sequence of hidden states {ut,1, · · · , ut,T }. Note that we use the last hidden

state of the LSTM ut,T as a question feature, denoted as qt ∈ RL.

ut,i = LSTM(wt,i,ut,i−1) (3.1)

qt = ut,T (3.2)

Also, each element of the dialog history {Hi}t−1i=0 and the candidate answers{Ai

t

}100i=1 are embedded as the follow-up question, yielding {hi}t−1i=0 ∈ Rt×L and

10

Fig

ure

3.1:

An

over

vie

wof

Du

alA

tten

tion

Net

wor

ks

(DA

N).

We

pro

pos

etw

okin

ds

ofat

tenti

onn

etw

orks,

RE

FE

R

an

dF

IND

.R

EF

ER

lear

ns

late

nt

rela

tion

ship

sb

etw

een

agi

ven

qu

esti

onan

da

dia

log

his

tory

tore

trie

veth

ere

leva

nt

pre

vio

us

dia

logs.

FIN

Dp

erfo

rms

vis

ual

grou

nd

ing,

takin

gim

age

feat

ure

san

dre

fere

nce

-aw

are

rep

rese

nta

tion

s(i.e.,

the

outp

ut

ofR

EF

ER

).⊗

,⊕

,an

d�

den

ote

mat

rix

mu

ltip

lica

tion

,co

nca

ten

atio

nan

del

emen

t-w

ise

mu

ltip

lica

tion

,

resp

ecti

vely

.T

he

mu

lti-

laye

rp

erce

ptr

onis

omit

ted

inth

isfi

gure

for

sim

pli

city

.

11

{oit}100i=1 ∈ R100×L. Qt, H, and At are embedded with same word embedding

vector and three different LSTMs.

3.2 REFER Network

Given the question and dialog history features, REFER network aims to attend

to the most relevant elements of dialog history with respect to the given ques-

tion. Specifically, we first compute scaled dot product attention (Vaswani et al.,

2017) in multi-head settings which are called multi-head attention. Let qt and

Mt = {hi}t−1i=0 be the question and dialog history feature vectors respectively. qt

and Mt are projected to dref dimensions by different, learnable projection ma-

trices. We then conduct dot product of these two projected matrices, divide by√dref , and apply a softmax to obtain the attention weights on the all elements

of dialog history.

headn = Attention(qtWqn ,MtW

mn ) (3.3)

where Attention(a, b) = softmax(ab>√dref

)b (3.4)

where W qn ∈ RL×dref and Wm

n ∈ RL×dref . Note that dot product attention

is computed h times with different projection matrices, yielding {headn}hn=1.

Accordingly, we can get the multi-head representations xt, concatenating all

{headn}hn=1, followed by linear projection. Also, we can compute xt by applying

a residual connection (He et al., 2016), followed by layer normalization (Ba

et al., 2016).

xt = (head1 ⊕ · · · ⊕ headh)W o (3.5)

12

xt = LayerNorm(xt + qt) (3.6)

where ⊕ denotes the concatenation operation, and W o ∈ Rhdref×L is the pro-

jection matrix. Next, we apply xt to two-layer feed-forward networks with a

ReLU activation in between, where W f1 ∈ RL×2L and W f

2 ∈ R2L×L. The resid-

ual connection and layer normalization is also applied in this step.

ct = ReLU(xtWf1 + bf1)W f

2 + bf2 (3.7)

ct = LayerNorm(ct + xt) (3.8)

ereft = ct ⊕ qt (3.9)

Finally, REFER network returns the reference-aware representations by con-

catenating the contextual representation ct and the original question represen-

tation qt, denoted as ereft ∈ R2L. In this work, we use dref = 256 and h = 4.

Figure 3.2 illustrates the pipeline of the REFER network.

13

Fig

ure

3.2:

Illu

stra

tion

ofth

esi

ngl

e-la

yer

RE

FE

Rn

etw

ork.

RE

FE

Rn

etw

ork

focu

ses

onth

ela

tent

rela

tion

ship

bet

wee

nth

efo

llow

-up

qu

esti

onan

da

dia

log

his

tory

tore

solv

eam

big

uou

sre

fere

nce

sin

the

qu

esti

on.W

eem

plo

ytw

o

sub

mod

ule

:m

ult

i-h

ead

atte

nti

onan

dfe

ed-f

orw

ard

net

wor

ks.

Mu

lti-

hea

dat

tenti

onco

mp

ute

sth

eh

nu

mb

erof

soft

att

enti

on

sov

eral

lel

emen

tsof

dia

log

his

tory

by

usi

ng

scal

edd

otp

rod

uct

atte

nti

on.

Th

en,

itre

turn

sth

eh

nu

mb

er

of

hea

ds

wh

ich

are

wei

ghte

dby

the

atte

nti

ons.

Fol

low

edby

the

two-

layer

feed

-for

war

dn

etw

orks,

RE

FE

Rn

etw

ork

fin

all

yre

turn

sth

ere

fere

nce

-aw

are

rep

rese

nta

tion

ser

eft

.⊕

and

Dot

ted

lin

ed

enot

eth

eco

nca

ten

atio

nop

erat

ion

and

lin

ear

pro

ject

ion

op

erat

ion

by

the

lear

nab

lem

atri

ces,

resp

ecti

vel

y.

14

3.3 FIND Network

Instead of relying on the visual attention maps of the previous dialogs as in

(Kottur et al., 2018, Seo et al., 2017), we expect the FIND network to attend

to the most relevant regions of the image with respect to the reference-aware

representations (i.e., the output of REFER network). In order to implement

the visual grounding for the reference-aware representations, we take inspira-

tion from bottom-up attention mechanism (Anderson et al., 2018). Let v ∈

RK×V and ereft ∈ R2L be the image feature vectors and reference-aware repre-

sentations, respectively. We first project these two vectors to dfind dimensions

and compute soft attention over all the object detection features as follows:

rt = fv(v)� fref (ereft ) (3.10)

αt = softmax(rtWr + br) (3.11)

where fv(·) and fref (·) denote the two-layer multi-layer perceptrons which con-

vert to dfind dimensions, and W r ∈ Rdfind×1 is the projection matrix for the

softmax activation. � denotes hadamard product (i.e., element-wise multipli-

cation). From these equations, we can get the visual attention weights αt ∈

RK×1. Next, we apply the visual attention weights to v and compute the vision-

language joint representations as follows:

vt =K∑j=1

αt,jvj (3.12)

zt = f ′v(vt)� f ′ref (ereft ) (3.13)

15

efindt = ztWz + bz (3.14)

where f ′v(·) and f ′ref (·) also denote the two-layer multi-layer perceptrons which

convert to dfind dimensions, and W z ∈ Rdfind×L is the projection matrix. Note

that efindt ∈ RL is the output representations of the encoder as well as FIND

network, and given to the answer decoder to score the list of candidate answers.

In this work, we use dfind = 1024.

3.4 Answer Decoder

Answer decoder computes each score of candidate answers via a dot product

with the embedded representation efindt , followed by a softmax activation to get

a categorical distribution over the candidates. Let Ot ={oit}100i=1 ∈ R100×L be

the feature vectors of 100 candidate answers. The distribution pt is formulated

as follows:

pt = softmax(efindt O>t ) (3.15)

In training phase, DAN is optimized by minimizing the cross-entropy loss be-

tween the one-hot encoded label vector yt and probability distribution pt.

L(θ) = −∑k

yt,k log pt,k (3.16)

Where pt,k denotes the probability of the k-th candidate answer at round t. In

test phase, the list of candidate answers is ranked by the distribution pt, and

evaluated by some metrics.

16

Chapter 4

Experiments

In this section, we describe the details of our experiments on the VisDial v1.0

and v0.9 datasets. We first introduce the VisDial datasets, evaluation metrics,

and implementation details in Sec. 4.1, Sec. 4.2, and Sec. 4.3, respectively. Then

we report the quantitative results by comparing our proposed model with the

state-of-the-art approaches and baseline model in Sec. 4.4. Then we provide

qualitative results in Sec. 4.5. Finally, we conduct the ablation studies by four

criteria to report the relative contributions of each components in Sec. 4.6.

4.1 Datasets

We evaluate our proposed model on the VisDial v0.9 and v1.0 dataset. VisDial

v0.9 dataset (Das et al., 2017) has collected via two subjects chatting about MS-

COCO (Lin et al., 2014) images. Each dialog is made up of an image, a caption

from MS-COCO dataset and 10 QA pairs. As a result, VisDial v0.9 dataset

contains 83k dialogs and 40k dialogs as train and validation splits, respectively.

Recently, VisDial v1.0 dataset (Das et al., 2017) has been released with an

additional 10k COCO-like images from Flickr. Dialogs for the additional images

have been collected similar to v0.9. Overall, VisDial v1.0 dataset contains 123k

(all dialogs from v0.9), 2k, and 8k dialogs as train, validation, and test splits,

respectively.

17

4.2 Evaluation Metrics

We evaluate individual responses at each question in a retrieval setting as sug-

gested by (Das et al., 2017). Specifically, the dialog agent is given a list of 100

candidate answers of each question and asked to rank the list. There are three

kinds of evaluation metrics for retrieval performance: (1) mean rank of human

response, (2) recall@k (i.e., existence of the human response in top-k ranked re-

sponse), and (3) mean reciprocal rank (MRR). Mean rank, recall@k, and MRR

are highly correlated with the rank of human response. In addition, (Das et al.,

2017) proposed to use the robust evaluation metric, normalized discounted cu-

mulative gain (NDCG). NDCG takes into account all relevant answers from the

ranked list, where the relevance scores are densely annotated for VisDial v1.0

test split. NDCG penalizes the lower rank of the candidate answers with high

relevance scores.

4.3 Implementation Details

We use PyTorch http://pytorch.org to implement our proposed model. Hy-

perparameters are summarized in Table 4.1.

4.4 Quantitative Results

4.4.1 Comparison with State-of-the-Art

We compare our proposed model with the state-of-the-art approaches on Vis-

Dial v1.0 and v0.9 datasets, which can be categorized into three groups: (1)

Fusion-based approaches (LF and HRE (Das et al., 2017)), (2) Attention-based

approaches (MN (Das et al., 2017), HCIAE (Lu et al., 2017) and CoAtt (Wu

et al., 2018)), and (3) Approaches that deal with visual reference resolution in

18

http://pytorch.org

Table 4.1: Hyperparameters for Dual Attention Networks

Parameter Type Value

mini batch size 64

dimension of image features 2048

dimension of hidden states in LSTM 512

dimension of word embedding 300

the number of attention heads 4

optimizer & initial learning rate Adam, 0.001

the number of epoch 12

VisDial (AMEM (Seo et al., 2017), and CorefNMN (Kottur et al., 2018)). Our

proposed model belongs to the third category. As shown in Table 4.2, DAN

significantly outperforms all other approaches on NDCG, MRR, and R@1, in-

cluding the previous state-of-the-art method, Synergistic (Guo et al., 2019).

Specifically, DAN improves approximately 0.27% on NDCG and 1.73% on R@1

in VisDial v1.0 dataset. The results indicate that DAN ranks higher than all

other methods on both single ground-truth answer (R@1) and all relevant an-

swers on average (NDCG).

4.4.2 Comparison with Baseline

We first define the questions that contain one or more pronouns (i.e., it, its,

they, their, them, these, those, this, that, he, his, him, she, her) as the semanti-

cally incomplete (SI) questions. Also, we can declare the questions that do not

have pronouns as semantically complete (SC) questions. Then, we have checked

the contribution of the reference-aware representations for the SC and SI ques-

tions, respectively. Specifically, we make a comparison between DAN, which

19

Tab

le4.2

:R

etri

eval

per

form

an

ceon

Vis

Dia

lv1.0

an

dv0.

9d

atas

ets,

mea

sure

dby

nor

mal

ized

dis

cou

nte

dcu

mu

lati

ve

gain

(ND

CG

),m

ean

reci

pro

cal

ran

k(M

RR

),re

call

@k

(R@

k),

and

mea

nra

nk.

Th

eh

igh

erth

eb

ette

rfo

rN

DC

G,

MR

R,

and

R@

k,

wh

ile

the

low

erth

eb

ette

rfo

rm

ean

ran

k.

DA

Nou

tper

form

sal

lot

her

mod

els

acro

ssN

DC

G,

MR

R,

and

R@

1on

bot

hd

atase

ts.

ND

CG

isn

ot

sup

por

ted

inv0.

9d

atas

et.

Vis

Dia

lv1.

0(t

est-

std)

Vis

Dia

lv0.

9(v

al)

ND

CG

MR

RR

@1

R@

5R

@10

Mea

nM

RR

R@

1R

@5

R@

10M

ean

LF

(Das

etal

.,20

17)

45.3

155

.42

40.9

572

.45

82.8

35.

9558.

0743.

82

74.

6884

.07

5.78

HR

E(D

aset

al.,

2017

)45

.46

54.1

639

.93

70.4

581

.50

6.41

58.

4644

.67

74.

504.2

25.7

2

MN

(Das

etal

.,20

17)

47.5

055

.49

40.9

872

.30

83.3

05.

9259.

6545

.55

76.

2285

.37

5.4

6

HC

IAE

(Lu

etal

.,20

17)

--

--

--

62.2

248.

4878

.75

87.

59

4.81

AM

EM

(Seo

etal

.,20

17)

--

--

--

62.2

748.

5378

.66

87.

434.

86

CoA

tt(W

uet

al.,

2018

)-

--

--

-63

.98

50.

2980.

7188.

81

4.47

Cor

efN

MN

(Kot

tur

etal

.,20

18)

54.7

061

.50

47.5

578

.10

88.8

04.

4064.

1050.

92

80.

18

88.8

14.4

5

Syner

gist

ic(G

uo

etal

.,20

19)

57.3

262

.20

47.9

080.43

89.95

4.17

--

--

-

DA

N(o

urs

)57.59

63.20

49.63

79.7

589

.35

4.30

66.38

53.33

82.

42

90.3

84.0

4

20

Table 4.3: VisDial v1.0 validation performance on the semantically complete

(SC) and incomplete (SI) questions. We observe that SI questions obtain more

benefits from the dialog history than SC questions.

Model MRR R@1 R@5 R@10 Mean

SC

No REFER 61.85 47.80 79.10 88.43 4.49

DAN 64.81 51.22 81.63 90.19 4.03

Improvements 2.96 3.42 2.53 1.76 0.46

SI

No REFER 58.44 44.38 75.36 85.48 5.36

DAN 61.77 48.13 78.43 87.81 4.70

Improvements 3.33 3.75 3.07 2.33 0.66

utilizes reference-aware representations (i.e., ereft ), and No REFER, which ex-

ploits question representations (i.e., qt) only. From the Table 4.3, we draw three

observations: (1) DAN shows significantly better results than the No REFER

model for SC questions. It validates that the context from dialog history enriches

the question information, even when the question is semantically complete. (2)

SI questions obtain more benefits from the dialog history than SC questions.

It indicates that DAN is more robust to the SI questions than SC questions.

(3) A dialog agent faces greater difficulty in answering SI questions compared

to SC questions. No REFER is equivalent to the FIND + RPN model in the

ablation study section.

4.5 Qualitative Results

In this section, we visualize the inference mechanism of our proposed model.

Figure 4.1 shows the qualitative results of DAN. Given a question that is needed

21

Fig

ure

4.1:

Qu

alit

ativ

ere

sult

son

the

Vis

Dia

lv1.

0d

atas

et.

We

vis

ual

ize

the

atte

nti

onov

erd

ialo

gh

isto

ryfr

om

RE

FE

Rn

etw

ork

and

the

vis

ual

atte

nti

onfr

omF

IND

net

wor

k.T

he

obje

ctd

etec

tion

feat

ure

sw

ith

top

five

atte

nti

on

wei

ghts

are

mark

edw

ith

colo

red

box

.A

red

colo

red

box

ind

icat

esth

em

ost

sali

ent

vis

ual

feat

ure

.A

lso,

the

atte

nti

on

from

RE

FE

Rn

etw

ork

isre

pre

sente

das

shad

ing,

dar

ker

shad

ing

ind

icat

esth

ela

rger

atte

nti

onw

eigh

tfo

rea

ch

elem

ent

ofth

ed

ialo

gh

isto

ry.

Ou

rp

rop

osed

mod

eln

oton

lyre

spon

ds

toth

eco

rrec

tan

swer

,b

ut

also

sele

ctiv

ely

pay

satt

enti

on

toth

ep

revio

us

dia

logs

and

sali

ent

imag

ere

gion

s.

22

to be clarified, DAN correctly answers the question by selectively attending to

each element of the dialog history and salient image regions. In case of the visual

attention, we mark the object detection features with top five attention weights

of each image. On the other hand, the attention weights from REFER network

are represented as shading; darker shading indicates the larger attention weight

for each element of the dialog history. These attention weights are calculated

by averaging over all the attention heads.

4.6 Ablation Studies

In this section, we perform ablation study on VisDial v1.0 validation split with

the following four model variants: (1) Model only using the single attention

network, (2) Model that uses different image features (pre-trained VGG-16 is

used), (3) Model that does not use the residual connection in REFER network,

and (4) Model that stacks the REFER network up to four layers with each

different number of attention heads.

4.6.1 Single network

The first four rows in Table 4.4 show the performance of a single network. FIND

denotes the use of FIND network only, and REFER denotes the use of single-

layer REFER network only. Specifically, REFER uses the output of REFER

network as the encoder outputs. On the other hand, FIND does not take the

reference-aware representations (i.e., ereft ) but the question feature (i.e., qt).

The single models show relatively poor performance compared with the dual

network model. We believe that the results validate two hypotheses: (1) VisDial

task requires contextual information from dialog history as well as the visually-

grounded information. (2) REFER and FIND networks have complementary

23

modeling abilities.

4.6.2 Image Features in FIND network

To report the impact of image features, we replace the bottom-up attention

features (Anderson et al., 2018) with ImageNet pre-trained VGG-16 (Simonyan

and Zisserman, 2014) features. In detail, we use the output of the VGG-16

pool5 layer as image features. In Table 4.4, RPN denotes the use of the region

proposal networks (Ren et al., 2015) which are equivalent to the use of bottom-

up attention features. Similar to VQA task, we observe that DAN with bottom-

up attention features achieves better performance than with VGG-16 features.

In other words, the use of object-level features boosts the MRR performance of

DAN.

4.6.3 Residual Connection in REFER Network

We also conduct an ablation study to investigate the effectiveness of the residual

connection in REFER network. As shown in Table 4.4, the use of the residual

connection (i.e., Res) boosts the MRR score of DAN. In other words, DAN

utilizes the excellence of deep residual learning as in (He et al., 2016, Kim

et al., 2016, Rocktaschel et al., 2015, Vaswani et al., 2017, Yang et al., 2016).

4.6.4 Stack of REFER Networks & Attention Heads

We stack the REFER networks up to four layers with each different number

of attention heads, h ∈ {1, 2, 4, 8, 16, 32, 64}. In other words, we conduct the

ablation experiments with twenty-eight models to set the hyperparameters of

our model. Figure 4.2 shows the results of the ablation experiments. For n ≥ 2,

REFER (n) indicates that DAN uses a stack of n identical REFER networks.

Specifically, for each pair of successive networks, the output of the previous

24

REFER network is fed into the next REFER network as a query (i.e., qt).

Due to the small number of elements in each dialog history, the overall perfor-

mance pattern shows a tendency to decrease as the number of attention heads

increases. It turns out that the two-layer REFER network with four attention

heads (i.e., REFER (2) and h = 4) performs the best among all models in

ablation study, recording 64.17% on MRR.

25

Table 4.4: Ablation studies on VisDial v1.0 validation split. Res and RPN denote

the residual connection and the region proposal networks, respectively.

Model MRR Score

FIND 57.85

FIND + RPN 60.80

REFER 57.18

REFER + Res 58.69

REFER + FIND 60.98

REFER + Res + FIND 61.86

REFER + FIND + RPN 63.47

REFER + Res + FIND + RPN 63.88

Figure 4.2: Ablation study on a different number of attention heads and REFER

stacks. REFER (n) indicates that DAN uses a stack of n identical REFER

networks.

26

Chapter 5

Conclusion

We introduce Dual Attention Networks (DAN) for visual reference resolution

in visual dialog task. DAN explicitly divides the visual reference resolution

problem into a two-step process by employing REFER and FIND networks. In

place of relying on the previous visual attention maps as in previous works,

DAN first linguistically resolves ambiguous references in a given question by

using REFER network. Then, it grounds the resolved references in the image

by using FIND network. We empirically validate our proposed model on Vis-

Dial v1.0 and v0.9 datasets. First, we compare DAN with other state-of-the-art

methods on VisDial v1.0 test-standard splits, showing that our model signifi-

cantly outperforms them on core metrics (i.e., NDCG, MRR, and R@1). Then,

we compare DAN with our baseline model on semantically complete and in-

complete question types. Our model turn out to be robust to the semantically

incomplete questions boosting the overall performance by 3.33%. Finally, we

conduct an ablation study to validate the effectiveness of our proposed model

and visualize the qualitative results. DAN gracefully integrates the REFER and

FIND network, while being simpler and more grounded. Furthermore, in terms

of the applicability, we believe that the algorithms of our proposed model can

be applied to the areas dealing with multimodal content aligned on time-series

such as video story understanding agents and audio-visual speech recognition.

27

Future Study The present study solely focused on the latent relationship

between the two pairs: (1) the given question and the dialog history, and (2) the

contextualized question and the given image. First, our proposed model does not

consider the order of the dialog in (1), resulting in the dialog history as position

invariant information – swapping the order of each dialog history’s element does

not affect the performance. However, the position of the word or the utterances

are regarded as an important information in natural language processing (Shaw

et al., 2018, Vaswani et al., 2017). In this respect, we believe that applying the

positional embedding would be the future research direction. Finally, we assume

that all questions in the visual dialog dataset demand the context information.

However, there could be some questions that can be answered without context,

such as “How many people are in the image?”. To this end, deciding whether

the given question needs the context or not would be a fundamental step in this

area.

28

Bibliography

Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson,

Stephen Gould, and Lei Zhang. Bottom-up and top-down attention for image

captioning and visual question answering. In CVPR, 2018.

Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Ba-

tra, C Lawrence Zitnick, and Devi Parikh. Vqa: Visual question answering. In

Proceedings of the IEEE international conference on computer vision, pages

2425–2433, 2015.

Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization.

arXiv preprint arXiv:1607.06450, 2016.

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine

translation by jointly learning to align and translate. arXiv preprint

arXiv:1409.0473, 2014.

Jan K Chorowski, Dzmitry Bahdanau, Dmitriy Serdyuk, Kyunghyun Cho, and

Yoshua Bengio. Attention-based models for speech recognition. In Advances

in neural information processing systems, pages 577–585, 2015.

Abhishek Das, Satwik Kottur, Khushi Gupta, Avi Singh, Deshraj Yadav,

Jose MF Moura, Devi Parikh, and Dhruv Batra. Visual dialog. In Proceed-

ings of the IEEE Conference on Computer Vision and Pattern Recognition,

volume 2, 2017.

29

Akira Fukui, Dong Huk Park, Daylen Yang, Anna Rohrbach, Trevor Darrell,

and Marcus Rohrbach. Multimodal compact bilinear pooling for visual ques-

tion answering and visual grounding. arXiv preprint arXiv:1606.01847, 2016.

Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh.

Making the v in vqa matter: Elevating the role of image understanding in vi-

sual question answering. In Proceedings of the IEEE Conference on Computer

Vision and Pattern Recognition, pages 6904–6913, 2017.

Dalu Guo, Chang Xu, and Dacheng Tao. Image-question-answer synergistic

network for visual dialog. arXiv preprint arXiv:1902.09774, 2019.

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual

learning for image recognition. In Proceedings of the IEEE conference on

computer vision and pattern recognition, pages 770–778, 2016.

Justin Johnson, Andrej Karpathy, and Li Fei-Fei. Densecap: Fully convolu-

tional localization networks for dense captioning. In Proceedings of the IEEE

Conference on Computer Vision and Pattern Recognition, pages 4565–4574,

2016.

Gi-Cheon Kang, Jaeseo Lim, and Byoung-Tak Zhang. Dual attention networks

for visual reference resolution in visual dialog. In Proceedings of the 2019 Con-

ference on Empirical Methods in Natural Language Processing, pages 2024–

2033, 2019.

Jin-Hwa Kim, Sang-Woo Lee, Donghyun Kwak, Min-Oh Heo, Jeonghee Kim,

Jung-Woo Ha, and Byoung-Tak Zhang. Multimodal residual learning for

visual qa. In Advances in neural information processing systems, pages 361–

369, 2016.

30

Jin-Hwa Kim, Jaehyun Jun, and Byoung-Tak Zhang. Bilinear attention net-

works. In Advances in Neural Information Processing Systems, pages 1564–

1574, 2018.

Satwik Kottur, Jose MF Moura, Devi Parikh, Dhruv Batra, and Marcus

Rohrbach. Visual coreference resolution in visual dialog using neural module

networks. In Proceedings of the European Conference on Computer Vision

(ECCV), pages 153–169, 2018.

Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua

Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A Shamma,

et al. Visual genome: Connecting language and vision using crowdsourced

dense image annotations. International Journal of Computer Vision, 123(1):

32–73, 2017.

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva

Ramanan, Piotr Dollar, and C Lawrence Zitnick. Microsoft coco: Common

objects in context. In European conference on computer vision, pages 740–

755. Springer, 2014.

Jiasen Lu, Jianwei Yang, Dhruv Batra, and Devi Parikh. Hierarchical question-

image co-attention for visual question answering. In Advances In Neural

Information Processing Systems, pages 289–297, 2016.

Jiasen Lu, Anitha Kannan, Jianwei Yang, Devi Parikh, and Dhruv Batra. Best

of both worlds: Transferring knowledge from discriminative learning to a

generative visual dialog model. In Advances in Neural Information Processing

Systems, pages 314–324, 2017.

Jiasen Lu, Jianwei Yang, Dhruv Batra, and Devi Parikh. Neural baby talk.

31

In Proceedings of the IEEE Conference on Computer Vision and Pattern

Recognition, pages 7219–7228, 2018.

Minh-Thang Luong, Hieu Pham, and Christopher D Manning. Effective ap-

proaches to attention-based neural machine translation. arXiv preprint

arXiv:1508.04025, 2015.

Jeffrey Pennington, Richard Socher, and Christopher Manning. Glove: Global

vectors for word representation. In Proceedings of the 2014 conference on em-

pirical methods in natural language processing (EMNLP), pages 1532–1543,

2014.

Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards

real-time object detection with region proposal networks. In Advances in

neural information processing systems, pages 91–99, 2015.

Tim Rocktaschel, Edward Grefenstette, Karl Moritz Hermann, Tomas Kocisky,

and Phil Blunsom. Reasoning about entailment with neural attention. arXiv

preprint arXiv:1509.06664, 2015.

Paul Hongsuck Seo, Andreas Lehrmann, Bohyung Han, and Leonid Sigal. Visual

reference resolution using attention memory for visual dialog. In Advances

in neural information processing systems, pages 3719–3729, 2017.

Claire Sergent, Christian C Ruff, Antoine Barbot, Jon Driver, and Geraint

Rees. Top–down modulation of human early visual cortex after stimulus offset

supports successful postcued report. Journal of Cognitive Neuroscience, 23

(8):1921–1934, 2011.

Peter Shaw, Jakob Uszkoreit, and Ashish Vaswani. Self-attention with relative

position representations. arXiv preprint arXiv:1803.02155, 2018.

32

Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for

large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.

George Sperling. The information available in brief visual presentations. Psy-

chological monographs: General and applied, 74(11):1, 1960.

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,

Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you

need. In Advances in Neural Information Processing Systems, pages 5998–

6008, 2017.

Qi Wu, Peng Wang, Chunhua Shen, Ian Reid, and Anton van den Hengel.

Are you talking to me? reasoned visual dialog generation through adversarial

learning. In Proceedings of the IEEE Conference on Computer Vision and

Pattern Recognition, pages 6106–6115, 2018.

Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan

Salakhudinov, Rich Zemel, and Yoshua Bengio. Show, attend and tell: Neural

image caption generation with visual attention. In International conference

on machine learning, pages 2048–2057, 2015.

Zichao Yang, Xiaodong He, Jianfeng Gao, Li Deng, and Alex Smola. Stacked

attention networks for image question answering. In Proceedings of the IEEE

conference on computer vision and pattern recognition, pages 21–29, 2016.

Zhou Yu, Jun Yu, Yuhao Cui, Dacheng Tao, and Qi Tian. Deep modular co-

attention networks for visual question answering. In Proceedings of the IEEE

Conference on Computer Vision and Pattern Recognition, pages 6281–6290,

2019.

33

Albert Zeyer, Kazuki Irie, Ralf Schluter, and Hermann Ney. Improved train-

ing of end-to-end attention models for speech recognition. arXiv preprint

arXiv:1805.03294, 2018.

34

국문초록

최근 컴퓨터비전과 자연언어처리 분야의 발전 덕분에 시각과 자연언어 정보를 동

시에 이해하는 인공지능 시스템 연구가 활발하게 이루어지고 있다. 인간 수준의

시각, 자연언어의 이해와 현재 인공지능 시스템의 성능 간의 격차를 줄이기 위하

여 시각 기반 정보를 이용하는 것 그리고 인간 대화의 미묘한 뉘앙스를 이해하는

것은 중요한 문제가 되었다. 시각 대화 (Das et al., 2017)는 인공지능 에이전트에

게 이미지와 관련된 일련의 질문에 답변을 하도록 하는 기계학습 문제이다. 시각

대화 데이터셋은 방대한 양의 사진과 사진 당 여러 회의 질의응답 쌍을 포함하고

있다.예를들어,에이전트는 “사진에몇명의사람이있니?”, “그들은실내에있니

실외에 있니?”와 같은 의미적으로 상호의존적인 질문들에 답하여야 한다.

본 연구는 시각 대화 문제를 위한 심층 신경망 기반의 학습 알고리즘을 소개

한다. 구체적으로, 기존 연구 (Kang et al., 2019)를 기반으로 시각 대화 문제 내의

시각 참조 해결 문제를 다룬다. 시각 참조 해결 문제는 그것 자체로 모호한 언어

적 표현들의 의미를 명확히하고, 그것들을 이미지의 국소적인 영역에 매핑시키는

문제를 일컫는다. 시각 참조 해결은 앞서 언급한 두 가지의 중요한 문제를 다룬다.

시각 참조 해결을 다루는 기존 연구는 주의 집중 기억 (Seo et al., 2017), 신경

모듈망 (Kottur et al., 2018) 등의 기법을 이용하여 시각 참조 해결 문제를 다루

었다. 해당 기법들의 공통점은 이전에 나눈 대화들의 시각적 주의집중 정보들을

모두 저장하였다. 하지만 인간의 기억 구조 연구에 의하면, 시각 감각 기억은 빠르

게 소멸되는 특징을 가지기 때문에 기존에 제안된 모델링은 인지과학적, 생물학적

근거가 부족하다. 이러한 동기에 근거하여 우리는 이전 대화의 시각적 주의집중에

의존하지 않는 이중 주의집중 신경망을 제안한다. 이중 주의집중 신경망은 참조

그리고 탐색 신경망으로 구성되어 있다. 참조 신경망은 주어진 질의와 이전 대

화 내용 사이의 의미적 연관성을 학습한다. 참조 신경망은 참조 신경망의 출력과

35

사진의 표상을 입력으로 받아 시각 기반을 수행한다. 두 종류의 주의집중 메커니

즘을이용하여,우리는대화에이전트가의미적으로모호한질문을받았을때해당

질문의 의미를 명확히하고 주어진 사진에서 답변을 찾길 기대한다.

그 결과 이중 주의집중 신경망은 2019년 시각 대화 대회에서 종합 3위를 기록

하였고단일모델로는출판시점인 2019년 11월에최고수준의성능을기록하였다.

주요어: 시각 대화, 다중 양태, 주의, 심층 학습, 시각 참조 해결

학번: 2018-23580

36

Documents

Disclaimer · 2020-05-07 · receives an ambiguous question and then has to nd an answer in the presented image by recalling previous questions and answers from one’s memory. As