Text Generation: From the
Perspective of Interactive Inference
张家俊
模式识别国家重点实验室中国科学院自动化研究所
www.nlpr.ia.ac.cn/cip/jjzhang.htm
2
Outline
Background
Bidirectional Interactive Inference
Interactive Inference for Two Tasks
Summary and Future Challenges
3
BERT: Bidirectional Understanding
𝑥2 𝑥3 ⋯ 𝑥𝑛𝑥1
⋯
⋯
⋯
⋯ ⋯⋯
Input Sequence
Representation
Learning
Linear Classification
12
Layers
110M
Params
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of
deep bidirectional transformers for language understanding. In NAACL-HLT 2019 (Best Paper).
4
BERT for Classification
𝑥1 𝑥2 ⋯ 𝑥𝑛[𝐶𝐿𝑆]
⋯
⋯
⋯
⋯𝐶 ℎ1 ℎ2 ℎ𝑛
Input Sequence
Representation
Learning
Linear Classification
Classification
Problem
Class Labels
𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑐𝑖 =𝑒𝑥𝑝 𝑐𝑖 𝑐𝑘 𝑒𝑥𝑝 𝑐𝑘
5
BERT for Sequence Labeling
𝑥1 𝑥2 ⋯ 𝑥𝑛[𝐶𝐿𝑆]
⋯
⋯
⋯
⋯𝐶 ℎ1 ℎ2 ℎ𝑛
Input Sequence
Representation
Learning
Linear Classification⋯
𝐵 𝐼 𝑂
Sequence
Labeling Problem
𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑐𝑖 =𝑒𝑥𝑝 𝑐𝑖 𝑐𝑘 𝑒𝑥𝑝 𝑐𝑘
6
Reasons behind BERT Success
• Corpus: 2.5B-word Wiki and 800M-word Books
• Architecture: Pre-training and Fine-tuning Same Model
• Model: Deep Bidirectional Transformer Encoder
• Optimization: Masked LM and Next Sentence Prediction
𝑥11 𝑥1
2 ⋯ 𝑥𝑛2[𝐶𝐿𝑆]
⋯
⋯
⋯
⋯𝐶 ℎ11 ⋯ ℎ𝑛
2
Specific Task
[𝑆𝐸𝑃]⋯
Pre-training
Fine-tuning
Bidirectional
Transformer Encoder
7
BERT vs. GPT (Generative Pre-trained Transformer)
𝑥2 ⋯ 𝑥𝑛𝑥1
⋯
⋯
⋯
⋯ℎ1 ℎ2 ℎ3 ℎ𝑛
Specific Task
Pre-training
Fine-tuning
𝑥3
• Architecture: Pre-training and Fine-tuning Same Model
• Model: Deep Unidirectional Transformer Decoder
• Optimization: Traditional Language Model
• Corpus: 800M-word Books 𝑗log 𝑝 𝑥𝑗|𝑥<𝑗 , 𝜃𝑇𝑟𝑎𝑛𝑠𝑓𝑜𝑟𝑚𝑒𝑟
8
BERT vs. GPT (Generative Pre-trained Transformer)
𝑥2 ⋯ 𝑥𝑛𝑥1
⋯
⋯
⋯
⋯ℎ1 ℎ2 ℎ3 ℎ𝑛
Specific Task
Pre-training
Fine-tuning
𝑥3
• Architecture: Pre-training and Fine-tuning Same Model
• Model: Deep Unidirectional Transformer Decoder
• Optimization: Traditional Language Model
• Corpus: 800M-word Books 𝑗log 𝑝 𝑥𝑗|𝑥<𝑗 , 𝜃𝑇𝑟𝑎𝑛𝑠𝑓𝑜𝑟𝑚𝑒𝑟
Unidirectional
Transformer Decoder
9
BERT vs. GPT (Generative Pre-trained Transformer)
𝑥2 ⋯ 𝑥𝑛𝑥1
⋯
⋯
⋯
⋯ℎ1 ℎ2 ℎ3 ℎ𝑛
Specific Task
Pre-training
Fine-tuning
𝑥3
• Architecture: Pre-training and Fine-tuning Same Model
• Model: Deep Unidirectional Transformer Decoder
• Optimization: Traditional Language Model
• Corpus: 800M-word Books 𝑗log 𝑝 𝑥𝑗|𝑥<𝑗 , 𝜃𝑇𝑟𝑎𝑛𝑠𝑓𝑜𝑟𝑚𝑒𝑟
Unidirectional
Transformer Decoder
Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. Improving
language understanding by Generative Pre-Training. Technical report, OpenAI.
10
BERT Ablation Study
Left-to-Right LM
Fine-tuning with BiLSTM
The more Layers
The Better
11
BERT Ablation Study
Left-to-Right LM
Fine-tuning with BiLSTM
The more Layers
The Better
Bidirectional Encoder is the Key!
𝑥2 𝑥3 ⋯ 𝑥𝑛𝑥112
From Understanding to Generation
𝑦2 𝑦3 ⋯ 𝑦𝑚𝑦1
⋯
⋯
⋯
⋯ℎ1 ℎ2 ℎ3 ℎ𝑚
⋯
⋯
⋯
⋯ℎ1 ℎ2 ℎ3 ℎ𝑚
𝑥1 𝑥2 ⋯ 𝑥𝑛[𝐶𝐿𝑆]⋯
⋯
⋯⋯𝐶 ℎ1 ℎ2 ℎ𝑛
Classification Problem
𝑥1 𝑥2 ⋯ 𝑥𝑛[𝐶𝐿𝑆]⋯
⋯
⋯⋯𝐶 ℎ1 ℎ2 ℎ𝑛
⋯𝐵 𝐼 𝑂
Sequence Labeling
Problem⋯
⋯
⋯
⋯ ⋯⋯
Sequence Generation
Problem
𝑥2 𝑥3 ⋯ 𝑥𝑛𝑥113
From Understanding to Generation
𝑦2 𝑦3 ⋯ 𝑦𝑚𝑦1
⋯
⋯
⋯
⋯ℎ1 ℎ2 ℎ3 ℎ𝑚
⋯
⋯
⋯
⋯ℎ1 ℎ2 ℎ3 ℎ𝑚
𝑥1 𝑥2 ⋯ 𝑥𝑛[𝐶𝐿𝑆]⋯
⋯
⋯⋯𝐶 ℎ1 ℎ2 ℎ𝑛
Classification Problem
𝑥1 𝑥2 ⋯ 𝑥𝑛[𝐶𝐿𝑆]⋯
⋯
⋯⋯𝐶 ℎ1 ℎ2 ℎ𝑛
⋯𝐵 𝐼 𝑂
Sequence Labeling
Problem⋯
⋯
⋯
⋯ ⋯⋯
Sequence Generation
Problem
𝑃 𝑦|𝑥 =
𝑖=1
𝑚
𝑝 𝑦𝑖|𝑦1⋯𝑦𝑖−1, 𝑥1⋯𝑥𝑛
14
Text Generation
15
Text Generation
16
Text Generation
机器翻译
17
Text Generation
机器翻译
18
Text Generation
机器翻译 人机对话
19
Text Generation
机器翻译 人机对话
20
Text Generation
机器翻译 人机对话
自动摘要
21
Text Generation
机器翻译 人机对话
自动摘要
22
Text Generation
机器翻译 人机对话
自动摘要 标题生成
11
Beam Search for
Unidirectional Inference
12
Transformer: Best Unidirectional Text Generation Framework
12
Encoder Decoder
Transformer: Best Unidirectional Text Generation Framework
12
Encoder Decoder
encoder
self-attentiondecoder
self-attention
encoder-decoder
self-attention
Transformer: Best Unidirectional Text Generation Framework
12
Encoder Decoder
encoder
self-attentiondecoder
self-attention
encoder-decoder
self-attention
Transformer: Best Unidirectional Text Generation Framework
Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N.
Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. NIPS-2017.
Transformer
13
有 五个 人 。
Transformer
13
有 五个 人 。
Transformer
13
有 五个 人 。
Transformer
13
有 五个 人 。
Transformer
13
有 五个 人 。
Transformer
13
有 五个 人 。
Transformer
13
有 五个 人 。
Transformer
13
有 五个 人 。
Transformer
13
有 五个 人 。
Transformer
13
有 五个 人 。
Transformer
13
有 五个 人 。
Transformer
13
有 五个 人 。
<s>
Transformer
13
有 五个 人 。
<s>
Transformer
13
有 五个 人 。
<s>
Transformer
13
有 五个 人 。
<s>
Transformer
13
有 五个 人 。
<s>
Transformer
13
有 五个 人 。
there
<s>
Transformer
13
有 五个 人 。
there
<s>
Transformer
13
有 五个 人 。
there
<s> there
Transformer
13
有 五个 人 。
there
<s> there
Transformer
13
有 五个 人 。
there
<s> there
Transformer
13
有 五个 人 。
there
<s> there
Transformer
13
有 五个 人 。
there
<s> there
Transformer
13
有 五个 人 。
there
<s>
are
there
Transformer
13
有 五个 人 。
there
<s>
are
there
Transformer
13
有 五个 人 。
there
<s>
are
there are
Transformer
13
有 五个 人 。
there
<s>
are
there are
Transformer
13
有 五个 人 。
there
<s>
are
there are
Transformer
13
有 五个 人 。
there
<s>
are
there are
Transformer
13
有 五个 人 。
there
<s>
are
there are
Transformer
13
有 五个 人 。
there
<s>
are
there
five
are
Transformer
13
有 五个 人 。
there
<s>
are
there
five
are
Transformer
13
有 五个 人 。
there
<s>
are
there
five
are five
Transformer
13
有 五个 人 。
there
<s>
are
there
five
are five
Transformer
13
有 五个 人 。
there
<s>
are
there
five
are five
Transformer
13
有 五个 人 。
there
<s>
are
there
five
are five
Transformer
13
有 五个 人 。
there
<s>
are
there
five
are five
Transformer
13
有 五个 人 。
there
<s>
are
there
five
are
persons
five
Transformer
13
有 五个 人 。
there
<s>
are
there
five
are
persons
five
Transformer
13
有 五个 人 。
there
<s>
are
there
five
are
persons
fiveBidirectional
Encoder
Transformer
13
有 五个 人 。
there
<s>
are
there
five
are
persons
fiveBidirectional
Encoder
Unidirectional
Decoder
Attention for Encoder
14
• Attention (Example)
有 五个 人
0.7-0.10.60.3
0.3-0.10.9-0.3
0.40.9-0.60.7
T T T
0.2-0.10.40.3
0.70.60.60.2
T T
l:
l+1:
4 12W R
Attention for Encoder
14
• Attention (Example)
有 五个 人
0.7-0.10.60.3
0.3-0.10.9-0.3
0.40.9-0.60.7
T T T
0.2-0.10.40.3
0.70.60.60.2
T T
l:
l+1:
4 12W R
W W W
-0.60.10.4-0.30.3-0.1-0.90.20.4-0.6-0.10.7
T
Q
K
V
0.3-0.30.60.60.7-0.10.90.80.1-0.30.40.4
T0.5-0.70.90.30.2-0.6-0.60.70.7-0.90.4-0.3
T
有 五个 人
Attention for Encoder
14
• Attention (Example)
有 五个 人
0.7-0.10.60.3
0.3-0.10.9-0.3
0.40.9-0.60.7
T T T
0.2-0.10.40.3
0.70.60.60.2
T T
l:
l+1:
4 12W R
W W W
-0.60.10.4-0.30.3-0.1-0.90.20.4-0.6-0.10.7
T
Q
K
V
0.3-0.30.60.60.7-0.10.90.80.1-0.30.40.4
T0.5-0.70.90.30.2-0.6-0.60.70.7-0.90.4-0.3
T
有 五个 人
1 2
3
Attention for Encoder
14
• Attention (Example)
有 五个 人
0.7-0.10.60.3
0.3-0.10.9-0.3
0.40.9-0.60.7
T T T
0.2-0.10.40.3
0.70.60.60.2
T T
l:
l+1:
4 12W R
W W W
-0.60.10.4-0.30.3-0.1-0.90.20.4-0.6-0.10.7
T
Q
K
V
0.3-0.30.60.60.7-0.10.90.80.1-0.30.40.4
T0.5-0.70.90.30.2-0.6-0.60.70.7-0.90.4-0.3
T
有 五个 人
1 2
3
15
• Attention (Example)
<s> there
0.5-0.60.20.3
0.10.4-0.20.5
T T
0.1-0.30.20.4
0.7-0.10.4-0.2
T
l:
l+1:
T
4 12W R
W W
-0.10.20.4-0.60.2-0.3-0.40.30.6-0.8-0.90.5
T
Q
K
V
T0.2-0.40.30.60.4-0.7-0.20.50.2-0.50.1-0.8
T
<s> there
1 2
Attention for Decoder
15
• Attention (Example)
<s> there
are?
0.5-0.60.20.3
0.10.4-0.20.5
T T
0.1-0.30.20.4
0.7-0.10.4-0.2
T
l:
l+1:
T
4 12W R
W W
-0.10.20.4-0.60.2-0.3-0.40.30.6-0.8-0.90.5
T
Q
K
V
T0.2-0.40.30.60.4-0.7-0.20.50.2-0.50.1-0.8
T
<s> there
1 2
Attention for Decoder
15
• Attention (Example)
<s> there
are?
0.5-0.60.20.3
0.10.4-0.20.5
T T
0.1-0.30.20.4
0.7-0.10.4-0.2
T
l:
l+1:
T
4 12W R
W W
-0.10.20.4-0.60.2-0.3-0.40.30.6-0.8-0.90.5
T
Q
K
V
T0.2-0.40.30.60.4-0.7-0.20.50.2-0.50.1-0.8
T
<s> there
1 2
Attention for Decoder
Cannot Utilize Future Information
Problems for Unidirectional Inference
16
⋯𝑥0 𝑥1 𝑥𝑚𝑥𝑗 ⋯
Problems for Unidirectional Inference
16
⋯𝑥0 𝑥1 𝑥𝑚𝑥𝑗 ⋯
𝑦0𝑙2𝑟 𝑦1
𝑙2𝑟 𝑦𝑖𝑙2𝑟⋯ ⋯ ⋯
Left-to-Right Decoding
Problems for Unidirectional Inference
16
⋯𝑥0 𝑥1 𝑥𝑚𝑥𝑗 ⋯
𝑦0𝑙2𝑟 𝑦1
𝑙2𝑟 𝑦𝑖𝑙2𝑟⋯ ⋯ ⋯
Left-to-Right Decoding
Problems for Unidirectional Inference
16
⋯𝑥0 𝑥1 𝑥𝑚𝑥𝑗 ⋯
𝑦0𝑙2𝑟 𝑦1
𝑙2𝑟 𝑦𝑖𝑙2𝑟⋯ ⋯ ⋯
×Left-to-Right Decoding
Cannot Access the Right Contexts
Problems for Unidirectional Inference
16
⋯𝑥0 𝑥1 𝑥𝑚𝑥𝑗 ⋯
𝑦0𝑙2𝑟 𝑦1
𝑙2𝑟 𝑦𝑖𝑙2𝑟⋯ ⋯ ⋯
𝑦𝑛′−1𝑟2𝑙 𝑦𝑛′−2
𝑟2𝑙 𝑦𝑖′𝑟2𝑙 ⋯ ⋯
×Left-to-Right Decoding
Right-to-Left Decoding
Cannot Access the Right Contexts
Problems for Unidirectional Inference
16
⋯𝑥0 𝑥1 𝑥𝑚𝑥𝑗 ⋯
𝑦0𝑙2𝑟 𝑦1
𝑙2𝑟 𝑦𝑖𝑙2𝑟⋯ ⋯ ⋯
𝑦𝑛′−1𝑟2𝑙 𝑦𝑛′−2
𝑟2𝑙 𝑦𝑖′𝑟2𝑙 ⋯ ⋯
×Left-to-Right Decoding
Right-to-Left Decoding
Cannot Access the Right Contexts
Problems for Unidirectional Inference
16
⋯𝑥0 𝑥1 𝑥𝑚𝑥𝑗 ⋯
𝑦0𝑙2𝑟 𝑦1
𝑙2𝑟 𝑦𝑖𝑙2𝑟⋯ ⋯ ⋯
𝑦𝑛′−1𝑟2𝑙 𝑦𝑛′−2
𝑟2𝑙 𝑦𝑖′𝑟2𝑙 ⋯ ⋯
×
×
Left-to-Right Decoding
Right-to-Left Decoding
Cannot Access the Right Contexts
Cannot Access the Left Contexts
17
Source 捷克总统哈维卸任新总统仍未确定
Referenceczech president havel steps down while new president still not
chosen
L2R czech president leaves office
R2L the outgoing president of the czech republic is still uncertain
Source 他们正在研制一种超大型的叫做炸弹之母。
Referencethey are developing a kind of superhuge bomb called the mother of
bombs .
L2R they are developing a super , big , mother , called the bomb .
R2Lthey are working on a much larger mother called the mother of a
bomb .
Problems: Unbalanced Outputs
18
• Statistical Analysis
Model The first 4 tokens The last 4 tokens
L2R 40.21% 35.10%
R2L 35.67% 39.47%
Table: Translation accuracy of the first 4 tokens and last 4
tokens in NIST Chinese-English translation tasks.
Problems: Unbalanced Outputs
18
• Statistical Analysis
Model The first 4 tokens The last 4 tokens
L2R 40.21% 35.10%
R2L 35.67% 39.47%
Table: Translation accuracy of the first 4 tokens and last 4
tokens in NIST Chinese-English translation tasks.
How to effectively utilize
bidirectional decoding?
Problems: Unbalanced Outputs
19
Outline
Background
Bidirectional Interactive Inference
Interactive Inference for Two Tasks
Summary and Future Challenges
Solution 1: Bidirectional Agreement from Perspective of Loss Function
20
[Liu et al., 2016] Agreement on Target-bidirectional Neural Machine Translation. NAACL.
[Zhang et al., 2019] Regularizing Neural Machine Translation by Target-Bidirectional
Agreement. AAAI
• Agreement
Solution 1: Bidirectional Agreement from Perspective of Loss Function
20
[Liu et al., 2016] Agreement on Target-bidirectional Neural Machine Translation. NAACL.
[Zhang et al., 2019] Regularizing Neural Machine Translation by Target-Bidirectional
Agreement. AAAI
• Agreement
Solution 1: Bidirectional Agreement from Perspective of Loss Function
20
[Liu et al., 2016] Agreement on Target-bidirectional Neural Machine Translation. NAACL.
[Zhang et al., 2019] Regularizing Neural Machine Translation by Target-Bidirectional
Agreement. AAAI
• Agreement
Drawbacks:
Two separate L2R and R2L models. No
interaction between bidirectional inference.
Solution 2: Neural System Combination from the Perspective of Ensemble
21
• NSC-NMT
[Zhou et al., 2017] Neural System Combination for Machine Translation. ACL.
1hr
2hr
hnm
r...
2hs
hnm
s...
1hs
1
nz 2
nzn
n
mz...
1hr
2hr
hpm
r...
2hs
hpm
s...
1hs
1
pz2
pz p
p
mz...
1hr
2hr
hhm
r...
2hs
hhm
s...
1hs
1
hz 2
hz h
h
mz...
-1jSjS
1jS °
-1jS
1jy jy
°jS
1jy
...
... ...1
n
j 2
p
j1
p
jp
jm
1j 2j 3j
2
n
jn
jm 2
h
j1
h
jh
jm
jnc
jpc jhc
Decoder
Attention
... ...
Encoder_Source Text Encoder_L2R Text Encoder_R2L text
jc
:nZ :pZ :hZ
Solution 2: Neural System Combination from the Perspective of Ensemble
21
• NSC-NMT
[Zhou et al., 2017] Neural System Combination for Machine Translation. ACL.
1hr
2hr
hnm
r...
2hs
hnm
s...
1hs
1
nz 2
nzn
n
mz...
1hr
2hr
hpm
r...
2hs
hpm
s...
1hs
1
pz2
pz p
p
mz...
1hr
2hr
hhm
r...
2hs
hhm
s...
1hs
1
hz 2
hz h
h
mz...
-1jSjS
1jS °
-1jS
1jy jy
°jS
1jy
...
... ...1
n
j 2
p
j1
p
jp
jm
1j 2j 3j
2
n
jn
jm 2
h
j1
h
jh
jm
jnc
jpc jhc
Decoder
Attention
... ...
Encoder_Source Text Encoder_L2R Text Encoder_R2L text
jc
:nZ :pZ :hZ
Drawbacks:
It is not an end-to-end model, and can’t
directly optimize encoder and decoder of
single NMT system.
Solution 3: Asynchronous Bidirectional Decoding from the Perspective of Model Integration
22
• ABD-NMT
[Zhang et al., 2018] Asynchronous Bidirectional Decoding for Neural Machine Translation. AAAI.
Solution 3: Asynchronous Bidirectional Decoding from the Perspective of Model Integration
22
• ABD-NMT
[Zhang et al., 2018] Asynchronous Bidirectional Decoding for Neural Machine Translation. AAAI.
Drawbacks:
(1) This work still requires two NMT models or
decoders.
(2) Only the forward decoder can utilize information of
backward decoder.
Solution 3: Asynchronous Bidirectional Decoding from the Perspective of Model Integration
22
• ABD-NMT
[Zhang et al., 2018] Asynchronous Bidirectional Decoding for Neural Machine Translation. AAAI.
Drawbacks:
(1) This work still requires two NMT models or
decoders.
(2) Only the forward decoder can utilize information of
backward decoder.
Question: How to utilize bidirectional
decoding more effectively and efficiently?
Solution 4
23
Solution 4
23
Synchronous Bidirectional Neural Machine Translation
Long Zhou, Jiajun Zhang and Chengqing Zong.
Transactions on ACL 2019.
Synchronous
Bidirectional Neural Machine Translation
24
⋯𝑥0 𝑥1 𝑥𝑚𝑥𝑖 ⋯
𝑙2𝑟
𝑟2𝑙
Synchronous
Bidirectional Neural Machine Translation
24
⋯𝑥0 𝑥1 𝑥𝑚𝑥𝑖 ⋯
𝑇0
𝑙2𝑟
𝑟2𝑙
Synchronous
Bidirectional Neural Machine Translation
24
⋯𝑥0 𝑥1 𝑥𝑚𝑥𝑖 ⋯
𝑦0𝑙2𝑟
𝑇0
𝑙2𝑟
𝑟2𝑙
Synchronous
Bidirectional Neural Machine Translation
24
⋯𝑥0 𝑥1 𝑥𝑚𝑥𝑖 ⋯
𝑦0𝑙2𝑟
𝑦𝑛′−1𝑟2𝑙
𝑇0
𝑙2𝑟
𝑟2𝑙
Synchronous
Bidirectional Neural Machine Translation
24
⋯𝑥0 𝑥1 𝑥𝑚𝑥𝑖 ⋯
𝑦0𝑙2𝑟
𝑦𝑛′−1𝑟2𝑙
𝑇0 𝑇1
𝑙2𝑟
𝑟2𝑙
Synchronous
Bidirectional Neural Machine Translation
24
⋯𝑥0 𝑥1 𝑥𝑚𝑥𝑖 ⋯
𝑦0𝑙2𝑟
𝑦𝑛′−1𝑟2𝑙
𝑇0 𝑇1
𝑙2𝑟
𝑟2𝑙
Synchronous
Bidirectional Neural Machine Translation
24
⋯𝑥0 𝑥1 𝑥𝑚𝑥𝑖 ⋯
𝑦0𝑙2𝑟
𝑦𝑛′−1𝑟2𝑙
𝑇0 𝑇1
𝑙2𝑟
𝑟2𝑙
Synchronous
Bidirectional Neural Machine Translation
24
⋯𝑥0 𝑥1 𝑥𝑚𝑥𝑖 ⋯
𝑦0𝑙2𝑟 𝑦1
𝑙2𝑟
𝑦𝑛′−1𝑟2𝑙
𝑇0 𝑇1
𝑙2𝑟
𝑟2𝑙
Synchronous
Bidirectional Neural Machine Translation
24
⋯𝑥0 𝑥1 𝑥𝑚𝑥𝑖 ⋯
𝑦0𝑙2𝑟 𝑦1
𝑙2𝑟
𝑦𝑛′−1𝑟2𝑙
𝑇0 𝑇1
𝑙2𝑟
𝑟2𝑙
Synchronous
Bidirectional Neural Machine Translation
24
⋯𝑥0 𝑥1 𝑥𝑚𝑥𝑖 ⋯
𝑦0𝑙2𝑟 𝑦1
𝑙2𝑟
𝑦𝑛′−1𝑟2𝑙
𝑇0 𝑇1
𝑙2𝑟
𝑟2𝑙
Synchronous
Bidirectional Neural Machine Translation
24
⋯𝑥0 𝑥1 𝑥𝑚𝑥𝑖 ⋯
𝑦0𝑙2𝑟 𝑦1
𝑙2𝑟
𝑦𝑛′−1𝑟2𝑙 𝑦𝑛′−2
𝑟2𝑙
𝑇0 𝑇1
𝑙2𝑟
𝑟2𝑙
Synchronous
Bidirectional Neural Machine Translation
24
⋯𝑥0 𝑥1 𝑥𝑚𝑥𝑖 ⋯
𝑦0𝑙2𝑟 𝑦1
𝑙2𝑟
𝑦𝑛′−1𝑟2𝑙 𝑦𝑛′−2
𝑟2𝑙
𝑇0 𝑇1 𝑇2
𝑙2𝑟
𝑟2𝑙
Synchronous
Bidirectional Neural Machine Translation
24
⋯𝑥0 𝑥1 𝑥𝑚𝑥𝑖 ⋯
𝑦0𝑙2𝑟 𝑦1
𝑙2𝑟
𝑦𝑛′−1𝑟2𝑙 𝑦𝑛′−2
𝑟2𝑙
𝑇0 𝑇1 𝑇2
𝑙2𝑟
𝑟2𝑙
Synchronous
Bidirectional Neural Machine Translation
24
⋯𝑥0 𝑥1 𝑥𝑚𝑥𝑖 ⋯
𝑦0𝑙2𝑟 𝑦1
𝑙2𝑟
𝑦𝑛′−1𝑟2𝑙 𝑦𝑛′−2
𝑟2𝑙
𝑇0 𝑇1 𝑇2
𝑙2𝑟
𝑟2𝑙
Synchronous
Bidirectional Neural Machine Translation
24
⋯𝑥0 𝑥1 𝑥𝑚𝑥𝑖 ⋯
𝑦0𝑙2𝑟 𝑦1
𝑙2𝑟 𝑦2𝑙2𝑟
𝑦𝑛′−1𝑟2𝑙 𝑦𝑛′−2
𝑟2𝑙
𝑇0 𝑇1 𝑇2
𝑙2𝑟
𝑟2𝑙
Synchronous
Bidirectional Neural Machine Translation
24
⋯𝑥0 𝑥1 𝑥𝑚𝑥𝑖 ⋯
𝑦0𝑙2𝑟 𝑦1
𝑙2𝑟 𝑦2𝑙2𝑟
𝑦𝑛′−1𝑟2𝑙 𝑦𝑛′−2
𝑟2𝑙
𝑇0 𝑇1 𝑇2
𝑙2𝑟
𝑟2𝑙
Synchronous
Bidirectional Neural Machine Translation
24
⋯𝑥0 𝑥1 𝑥𝑚𝑥𝑖 ⋯
𝑦0𝑙2𝑟 𝑦1
𝑙2𝑟 𝑦2𝑙2𝑟
𝑦𝑛′−1𝑟2𝑙 𝑦𝑛′−2
𝑟2𝑙
𝑇0 𝑇1 𝑇2
𝑙2𝑟
𝑟2𝑙
Synchronous
Bidirectional Neural Machine Translation
24
⋯𝑥0 𝑥1 𝑥𝑚𝑥𝑖 ⋯
𝑦0𝑙2𝑟 𝑦1
𝑙2𝑟 𝑦2𝑙2𝑟
𝑦𝑛′−1𝑟2𝑙 𝑦𝑛′−2
𝑟2𝑙 𝑦𝑛′−3𝑟2𝑙
𝑇0 𝑇1 𝑇2
𝑙2𝑟
𝑟2𝑙
Synchronous
Bidirectional Neural Machine Translation
24
⋯𝑥0 𝑥1 𝑥𝑚𝑥𝑖 ⋯
𝑦0𝑙2𝑟 𝑦1
𝑙2𝑟 𝑦𝑛−1𝑙2𝑟𝑦2
𝑙2𝑟 ⋯
𝑦𝑛′−1𝑟2𝑙 𝑦𝑛′−2
𝑟2𝑙 𝑦0𝑟2𝑙𝑦𝑛′−3
𝑟2𝑙 ⋯
𝑇0 𝑇1 𝑇2
𝑙2𝑟
𝑟2𝑙
Synchronous
Bidirectional Neural Machine Translation
24
⋯𝑥0 𝑥1 𝑥𝑚𝑥𝑖 ⋯
𝑦0𝑙2𝑟 𝑦1
𝑙2𝑟 𝑦𝑛−1𝑙2𝑟𝑦2
𝑙2𝑟 ⋯
𝑦𝑛′−1𝑟2𝑙 𝑦𝑛′−2
𝑟2𝑙 𝑦0𝑟2𝑙𝑦𝑛′−3
𝑟2𝑙 ⋯
𝑇0 𝑇1 𝑇2
𝑃 𝑦|𝑥 =
𝑖=0
𝑛−1
𝑝 𝑦𝑖| 𝑦0⋯ 𝑦𝑖−1, 𝑥, 𝒚𝟎⋯𝒚𝒊−𝟏 𝑖𝑓 𝐿2𝑅
𝑖=0
𝑛′−1
𝑝 𝑦𝑖| 𝑦0⋯ 𝑦𝑖−1, 𝑥, 𝒚𝟎⋯𝒚𝒊−𝟏 𝑖𝑓 𝑅2𝐿
𝑙2𝑟
𝑟2𝑙
Synchronous
Bidirectional Neural Machine Translation
24
L2R (R2L) inference not only uses its previously generated outputs,
but also uses future contexts predicted by R2L (L2R) decoding.
⋯𝑥0 𝑥1 𝑥𝑚𝑥𝑖 ⋯
𝑦0𝑙2𝑟 𝑦1
𝑙2𝑟 𝑦𝑛−1𝑙2𝑟𝑦2
𝑙2𝑟 ⋯
𝑦𝑛′−1𝑟2𝑙 𝑦𝑛′−2
𝑟2𝑙 𝑦0𝑟2𝑙𝑦𝑛′−3
𝑟2𝑙 ⋯
𝑇0 𝑇1 𝑇2
𝑃 𝑦|𝑥 =
𝑖=0
𝑛−1
𝑝 𝑦𝑖| 𝑦0⋯ 𝑦𝑖−1, 𝑥, 𝒚𝟎⋯𝒚𝒊−𝟏 𝑖𝑓 𝐿2𝑅
𝑖=0
𝑛′−1
𝑝 𝑦𝑖| 𝑦0⋯ 𝑦𝑖−1, 𝑥, 𝒚𝟎⋯𝒚𝒊−𝟏 𝑖𝑓 𝑅2𝐿
𝑙2𝑟
𝑟2𝑙
Synchronous Bidirectional Attention
𝑥2 𝑥3 ⋯ 𝑥𝑚−1𝑥0⋯
⋯
⋯
𝑦1 ⋯ 𝑦𝑖−1 𝑦0
𝒛0 𝒛1 𝒛𝑖−1
𝑦𝑖−1 ⋯ 𝑦1 𝑦0 𝑦𝑖
𝒛0𝒛1𝒛𝑖
⋯
⋯
⋯𝒛𝑖−1
⋯
⋯
⋯
𝒉0 𝒉1 𝒉2 𝒉𝑚−1
Synchronous Bidirectional Attention
𝑥2 𝑥3 ⋯ 𝑥𝑚−1𝑥0⋯
⋯
⋯
𝑦1 ⋯ 𝑦𝑖−1 𝑦0
𝒛0 𝒛1 𝒛𝑖−1
𝑦𝑖−1 ⋯ 𝑦1 𝑦0 𝑦𝑖
𝒛0𝒛1𝒛𝑖
⋯
⋯
⋯𝒛𝑖−1
⋯
⋯
⋯
𝒉0 𝒉1 𝒉2 𝒉𝑚−1
Synchronous Bidirectional Attention
𝑥2 𝑥3 ⋯ 𝑥𝑚−1𝑥0⋯
⋯
⋯
𝑦1 ⋯ 𝑦𝑖−1 𝑦0
𝒛0 𝒛1 𝒛𝑖−1
𝑦𝑖−1 ⋯ 𝑦1 𝑦0 𝑦𝑖
𝒛0𝒛1𝒛𝑖
⋯
⋯
⋯𝒛𝑖−1
⋯
⋯
⋯
𝒉0 𝒉1 𝒉2 𝒉𝑚−1
𝑯𝑖
Synchronous Bidirectional Attention
𝑥2 𝑥3 ⋯ 𝑥𝑚−1𝑥0⋯
⋯
⋯
𝑦1 ⋯ 𝑦𝑖−1 𝑦0
𝒛0 𝒛1 𝒛𝑖−1
𝑦𝑖−1 ⋯ 𝑦1 𝑦0 𝑦𝑖
𝒛0𝒛1𝒛𝑖
⋯
⋯
⋯𝒛𝑖−1
⋯
⋯
⋯
𝒉0 𝒉1 𝒉2 𝒉𝑚−1
𝑯𝑖
Synchronous Bidirectional Attention
𝑥2 𝑥3 ⋯ 𝑥𝑚−1𝑥0⋯
⋯
⋯
𝑦1 ⋯ 𝑦𝑖−1 𝑦0
𝒛0 𝒛1 𝒛𝑖−1
𝑦𝑖−1 ⋯ 𝑦1 𝑦0 𝑦𝑖
𝒛0𝒛1𝒛𝑖
⋯
⋯
⋯𝒛𝑖−1
⋯
⋯
⋯
𝒉0 𝒉1 𝒉2 𝒉𝑚−1
𝑯𝑖 𝑯𝑖
Synchronous Bidirectional Attention
𝑥2 𝑥3 ⋯ 𝑥𝑚−1𝑥0⋯
⋯
⋯
𝑦1 ⋯ 𝑦𝑖−1 𝑦0
𝒛0 𝒛1 𝒛𝑖−1
𝑦𝑖−1 ⋯ 𝑦1 𝑦0 𝑦𝑖
𝒛0𝒛1𝒛𝑖
⋯
⋯
⋯𝒛𝑖−1
⋯
⋯
⋯
𝒉0 𝒉1 𝒉2 𝒉𝑚−1
𝑯𝑖 𝑯𝑖
𝒛𝑖 = 𝑓𝑢𝑠𝑒 𝑯𝑖 , 𝑯𝑖
Synchronous Bidirectional Attention
𝑥2 𝑥3 ⋯ 𝑥𝑚−1𝑥0⋯
⋯
⋯
𝑦1 ⋯ 𝑦𝑖−1 𝑦0
𝒛0 𝒛1 𝒛𝑖−1 𝒛𝑖
𝑦𝑖−1 ⋯ 𝑦1 𝑦0 𝑦𝑖
𝒛0𝒛1𝒛𝑖
⋯
⋯
⋯𝒛𝑖−1
⋯
⋯
⋯
𝒉0 𝒉1 𝒉2 𝒉𝑚−1
𝑯𝑖 𝑯𝑖
𝒛𝑖 = 𝑓𝑢𝑠𝑒 𝑯𝑖 , 𝑯𝑖
Synchronous Bidirectional Attention
𝑥2 𝑥3 ⋯ 𝑥𝑚−1𝑥0⋯
⋯
⋯
𝑦1 ⋯ 𝑦𝑖−1 𝑦𝑖 𝑦0
𝒛0 𝒛1 𝒛𝑖−1 𝒛𝑖
𝑦𝑖−1 ⋯ 𝑦1 𝑦0 𝑦𝑖
𝒛0𝒛1𝒛𝑖
⋯
⋯
⋯𝒛𝑖−1
⋯
⋯
⋯
𝒉0 𝒉1 𝒉2 𝒉𝑚−1
𝑯𝑖 𝑯𝑖
𝒛𝑖 = 𝑓𝑢𝑠𝑒 𝑯𝑖 , 𝑯𝑖
Matmul
Scale
Mask
Softmax
Matmul
Matmul
Scale
Mask
Softmax
Matmul
KVQ
K VQ
KKV V
Fusion
iH
iH
i
f
Hf
iH
b
iHb
iH
26
• Synchronous Bidirectional Dot-Product Attention
Linear Interpolation
Nonlinear Interpolation
Gate Mechanism
Synchronous Bidirectional Attention
Matmul
Scale
Mask
Softmax
Matmul
Matmul
Scale
Mask
Softmax
Matmul
KVQ
K VQ
KKV V
Fusion
iH
iH
i
f
Hf
iH
b
iHb
iH
26
• Synchronous Bidirectional Dot-Product Attention
Linear Interpolation
Nonlinear Interpolation
Gate Mechanism
Synchronous Bidirectional Attention
Matmul
Scale
Mask
Softmax
Matmul
Matmul
Scale
Mask
Softmax
Matmul
KVQ
K VQ
KKV V
Fusion
iH
iH
i
f
Hf
iH
b
iHb
iH
26
• Synchronous Bidirectional Dot-Product Attention
Linear Interpolation
Nonlinear Interpolation
Gate Mechanism
tanh
Relu
Synchronous Bidirectional Attention
Matmul
Scale
Mask
Softmax
Matmul
Matmul
Scale
Mask
Softmax
Matmul
KVQ
K VQ
KKV V
Fusion
iH
iH
i
f
Hf
iH
b
iHb
iH
26
• Synchronous Bidirectional Dot-Product Attention
Linear Interpolation
Nonlinear Interpolation
Gate Mechanism
tanh
Relu
Synchronous Bidirectional Attention
Matmul
Scale
Mask
Softmax
Matmul
Matmul
Scale
Mask
Softmax
Matmul
KVQ
K VQ
KKV V
Fusion
iH
iH
i
f
Hf
iH
b
iHb
iH
27
• Synchronous Bidirectional Dot-Product Attention
SBDPA
Matmul
Scale
Mask
Softmax
Matmul
Matmul
Scale
Mask
Softmax
Matmul
KVQ
K VQ
KKV V
Fusion
iH
iH
i
f
Hf
iH
b
iHb
iH
27
• Synchronous Bidirectional Dot-Product Attention
We refer to the whole procedure as SBDPT():
SBDPA
28
• Synchronous Bidirectional Multi-Head Attention
Concat
Linear
Q
Q
K
K
V
V
h
h
Linear
hSynchronous Bidirectional Dot-
Product Attention Model
LinearLinear
Synchronous Multi-Head Attention
28
• Synchronous Bidirectional Multi-Head Attention
Concat
Linear
Q
Q
K
K
V
V
h
h
Linear
hSynchronous Bidirectional Dot-
Product Attention Model
LinearLinear
Synchronous Multi-Head Attention
28
• Synchronous Bidirectional Multi-Head Attention
Concat
Linear
Q
Q
K
K
V
V
h
h
Linear
hSynchronous Bidirectional Dot-
Product Attention Model
LinearLinear Note that all parameters are
the same as standard multi-
head attention model.
Synchronous Multi-Head Attention
29
• Integrating Bidirectional Attention into NMT
Multi-Head Intra-Attention
Input Embedding
Output Embedding
Softmax
Add&Norm
FeedForward
Add&Norm Multi-Head Inter-Attention
Add&Norm
FeedForward
Add&Norm
Synchronous Bidirectional
Attention
Add&NormN
N
+ +Positional Encoding
Positional Encoding
Linear
InputsOutputs (L2R & R2L)
(shifted right)
Output (L2R & R2L)Probabilities
Our proposed bidirectional
attention model
Synchronous Bidirectional Attention
29
• Integrating Bidirectional Attention into NMT
Multi-Head Intra-Attention
Input Embedding
Output Embedding
Softmax
Add&Norm
FeedForward
Add&Norm Multi-Head Inter-Attention
Add&Norm
FeedForward
Add&Norm
Synchronous Bidirectional
Attention
Add&NormN
N
+ +Positional Encoding
Positional Encoding
Linear
InputsOutputs (L2R & R2L)
(shifted right)
Output (L2R & R2L)Probabilities
Our proposed bidirectional
attention model
Synchronous Bidirectional Attention
29
• Integrating Bidirectional Attention into NMT
Multi-Head Intra-Attention
Input Embedding
Output Embedding
Softmax
Add&Norm
FeedForward
Add&Norm Multi-Head Inter-Attention
Add&Norm
FeedForward
Add&Norm
Synchronous Bidirectional
Attention
Add&NormN
N
+ +Positional Encoding
Positional Encoding
Linear
InputsOutputs (L2R & R2L)
(shifted right)
Output (L2R & R2L)Probabilities
Our proposed bidirectional
attention model
Note that all bidirectional information flow indecoder runs in parallel.
Synchronous Bidirectional Attention
30
Synchronous Bidirectional Beam Search Algorithm
30
Synchronous Bidirectional Beam Search Algorithm
30
<pad>
T=0 T=1 T=2
<l2r>
…
…
…
…
<l2r>
<r2l>
<r2l>
…
SBAtt
SBAtt
SBAtt
SBAtt
L2R
R2L
Synchronous Bidirectional Beam Search Algorithm
31
• Training Objective Function
Training
Training
32
Training
32
Inference
Training
32
⋯𝑥0 𝑥1 𝑥𝑚𝑥𝑖 ⋯
𝒚𝟎𝒍𝟐𝒓 𝑦1
𝑙2𝑟 𝒚𝒏−𝟏𝒍𝟐𝒓𝑦2
𝑙2𝑟 ⋯
𝒚𝒏′−𝟏𝒓𝟐𝒍 𝑦𝑛′−2
𝑟2𝑙 𝒚𝟎𝒓𝟐𝒍𝑦𝑛′−3
𝑟2𝑙 ⋯
𝑇0 𝑇1 𝑇2
Inference
Training
32
⋯𝑥0 𝑥1 𝑥𝑚𝑥𝑖 ⋯
𝒚𝟎𝒍𝟐𝒓 𝑦1
𝑙2𝑟 𝒚𝒏−𝟏𝒍𝟐𝒓𝑦2
𝑙2𝑟 ⋯
𝒚𝒏′−𝟏𝒓𝟐𝒍 𝑦𝑛′−2
𝑟2𝑙 𝒚𝟎𝒓𝟐𝒍𝑦𝑛′−3
𝑟2𝑙 ⋯
𝑇0 𝑇1 𝑇2
Inference
Training
1 2 1
1 2 1
src: x , x , ..., x , x
tgt: y , y , ..., y , y
m m
n n
Training
32
⋯𝑥0 𝑥1 𝑥𝑚𝑥𝑖 ⋯
𝒚𝟎𝒍𝟐𝒓 𝑦1
𝑙2𝑟 𝒚𝒏−𝟏𝒍𝟐𝒓𝑦2
𝑙2𝑟 ⋯
𝒚𝒏′−𝟏𝒓𝟐𝒍 𝑦𝑛′−2
𝑟2𝑙 𝒚𝟎𝒓𝟐𝒍𝑦𝑛′−3
𝑟2𝑙 ⋯
𝑇0 𝑇1 𝑇2
⋯𝑥0 𝑥1 𝑥𝑚𝑥𝑖 ⋯
𝒚𝟎𝒍𝟐𝒓 𝑦1
𝑙2𝑟 𝒚𝒏−𝟏𝒍𝟐𝒓𝑦2
𝑙2𝑟 ⋯
𝒚𝒏−𝟏𝒓𝟐𝒍 𝑦𝑛−2
𝑟2𝑙 𝒚𝟎𝒓𝟐𝒍𝑦𝑛−3
𝑟2𝑙 ⋯
𝑇0 𝑇1 𝑇2
Inference
Training
1 2 1
1 2 1
src: x , x , ..., x , x
tgt: y , y , ..., y , y
m m
n n
Training
32
⋯𝑥0 𝑥1 𝑥𝑚𝑥𝑖 ⋯
𝒚𝟎𝒍𝟐𝒓 𝑦1
𝑙2𝑟 𝒚𝒏−𝟏𝒍𝟐𝒓𝑦2
𝑙2𝑟 ⋯
𝒚𝒏′−𝟏𝒓𝟐𝒍 𝑦𝑛′−2
𝑟2𝑙 𝒚𝟎𝒓𝟐𝒍𝑦𝑛′−3
𝑟2𝑙 ⋯
𝑇0 𝑇1 𝑇2
⋯𝑥0 𝑥1 𝑥𝑚𝑥𝑖 ⋯
𝒚𝟎𝒍𝟐𝒓 𝑦1
𝑙2𝑟 𝒚𝒏−𝟏𝒍𝟐𝒓𝑦2
𝑙2𝑟 ⋯
𝒚𝒏−𝟏𝒓𝟐𝒍 𝑦𝑛−2
𝑟2𝑙 𝒚𝟎𝒓𝟐𝒍𝑦𝑛−3
𝑟2𝑙 ⋯
𝑇0 𝑇1 𝑇2
Inference
Training
1 2 1
1 2 1
src: x , x , ..., x , x
tgt: y , y , ..., y , y
m m
n n
Question: Mismatch between Training
and Inference
33
Training Strategy 1
𝐽 𝜃 = 𝑡=1
𝑇
𝑙𝑜𝑔𝑝 𝑦 𝑡 |𝑥 𝑡 + 𝑙𝑜𝑔𝑝 𝑦𝑡|𝑥 𝑡
Two-pass method
First-pass:training 𝐿2𝑅 and 𝑅2𝐿, models. Using 𝐿2𝑅 and 𝑅2𝐿 to
decode the source inputs of bitext, resulting 𝑥 𝑡 , 𝑦∗𝑡
𝑡=1
𝑇
and
𝑥 𝑡 , 𝑦∗𝑡
𝑡=1
𝑇
;
Second-pass: using 𝑦∗<𝑖𝑡
instead of 𝑦<𝑖𝑡
to compute
𝑝 𝑦𝑖𝑡| 𝑦<𝑖𝑡, 𝑥 𝑡 , 𝑦∗<𝑖
𝑡,similar for 𝑝 𝑦𝑖
𝑡| 𝑦<𝑖𝑡, 𝑥 𝑡 , 𝑦∗<𝑖
𝑡.
33
Training Strategy 1
𝐽 𝜃 = 𝑡=1
𝑇
𝑙𝑜𝑔𝑝 𝑦 𝑡 |𝑥 𝑡 + 𝑙𝑜𝑔𝑝 𝑦𝑡|𝑥 𝑡
Two-pass method
First-pass:training 𝐿2𝑅 and 𝑅2𝐿, models. Using 𝐿2𝑅 and 𝑅2𝐿 to
decode the source inputs of bitext, resulting 𝑥 𝑡 , 𝑦∗𝑡
𝑡=1
𝑇
and
𝑥 𝑡 , 𝑦∗𝑡
𝑡=1
𝑇
;
Second-pass: using 𝑦∗<𝑖𝑡
instead of 𝑦<𝑖𝑡
to compute
𝑝 𝑦𝑖𝑡| 𝑦<𝑖𝑡, 𝑥 𝑡 , 𝑦∗<𝑖
𝑡,similar for 𝑝 𝑦𝑖
𝑡| 𝑦<𝑖𝑡, 𝑥 𝑡 , 𝑦∗<𝑖
𝑡.
Problem: Too Time Consuming
𝑃 𝑦|𝑥 =
𝑖=0
𝑛−1
𝑝 𝑦𝑖| 𝑦0⋯ 𝑦𝑖−1, 𝑥 𝑖𝑓 𝐿2𝑅
𝑖=0
𝑛′−1
𝑝 𝑦𝑖| 𝑦0⋯ 𝑦𝑖−1, 𝑥 𝑖𝑓 𝑅2𝐿
34
Training Strategy 2
Fine-tuning method
Bidirectional Inference without Interaction:training 𝑆𝐵𝑁𝑀𝑇,
model with no interaction. The learned 𝑆𝐵𝑁𝑀𝑇 perfroms 𝐿2𝑅 and
𝑅2𝐿 decoding for the source inputs of bitext, resulting
𝑥 𝑡 , 𝑦∗𝑡
𝑡=1
𝑇
and 𝑥 𝑡 , 𝑦∗𝑡
𝑡=1
𝑇
;
Fine-tuning with Interaction: using 𝑦∗<𝑖𝑡
instead of 𝑦<𝑖𝑡
to
compute 𝑝 𝑦𝑖𝑡| 𝑦<𝑖𝑡, 𝑥 𝑡 , 𝑦∗<𝑖
𝑡.
35
• Setup
Dataset:
(1) NIST Chinese-English translation (2M, 30K tokens,
MT03-06 as test set)
(2) WMT14 English-German translation (4.5M, 37K
shared tokens, newstest2014 as test set)
Train details:
(1) Transformer_big setting
(2) Chinese-English: 1 GPUs, single model, case-
insensitive BLEU.
(3) English-German: 3 GPUs, model averaging, case-
sensitive BLEU.
Experiments: Machine Translation
36
• Baselines
Moses: an Open source phrase-based SMT system.
RNMT: RNN-based NMT with default setting.
Transformer: Predict target sentence from left to right.
Transformer(R2L): Predict sentence from right to left.
Rerank-NMT: (1) first run beam search to obtain two
k-best lists; (2) then re-score and get the best candidate.
ABD-NMT: (1) use backward decoder to generate
reverse sequence states; (2) perform beam search on the
forward decoder to find the best translation.
Experiments: Machine Translation
37
• Results on Chinese-English Translation
Translation Quality
Table: Evaluation of translation quality for Chinese-English translation tasks with
case-insensitive BLEU scores.
Experiments: Machine Translation
37
• Results on Chinese-English Translation
Translation Quality
Table: Evaluation of translation quality for Chinese-English translation tasks with
case-insensitive BLEU scores.
Experiments: Machine Translation
37
• Results on Chinese-English Translation
Translation Quality
Table: Evaluation of translation quality for Chinese-English translation tasks with
case-insensitive BLEU scores.
Experiments: Machine Translation
37
• Results on Chinese-English Translation
Translation Quality
Table: Evaluation of translation quality for Chinese-English translation tasks with
case-insensitive BLEU scores.
Experiments: Machine Translation
38
• Results on English-German Translation
Table: Results of English-German translation using case-sensitive BLEU.
Experiments: Machine Translation
38
• Results on English-German Translation
Table: Results of English-German translation using case-sensitive BLEU.
Strong Baselines
Experiments: Machine Translation
38
• Results on English-German Translation
Table: Results of English-German translation using case-sensitive BLEU.
Strong Baselines
(+1.49)
Experiments: Machine Translation
39
Experiments: Image Caption
39
Experiments: Image Caption
39
• Setup
Dataset:
(1) Flickr30k (Young et al., 2014)
(2) 29,000 image-caption for training
(3) 1014 for validation and 2000 for test
Baselines:
(1) VGGNet encoder + LSMT decoder (Xu et al., 2015)
(2) Transformer
Experiments: Image Caption
40
• Results on English Image Caption
BLEU score
Experiments: Image Caption
Method Validation Test
Xu et al., (2015) ~ 19.90
Transformer 22.11 21.25
Ours 23.27 22.41
41
Figure: Translation accuracy of the first 4 tokens and last 4 tokens for L2R, R2L,
Rerank-NMT, ABD-NMT and our proposed model.
MT Analysis: Unbalanced Outputs
41
Figure: Translation accuracy of the first 4 tokens and last 4 tokens for L2R, R2L,
Rerank-NMT, ABD-NMT and our proposed model.
Get the best translation
accuracy no matter for
the first 4 words and the
last 4 words.
MT Analysis: Unbalanced Outputs
MT Analysis: BLEU along Length
42
• Analysis
Effect of Long Sentence
Figure: Performance of translations on the test set with respect to the lengths
of the source sentences.
MT Analysis: BLEU along Length
42
• Analysis
Effect of Long Sentence
Figure: Performance of translations on the test set with respect to the lengths
of the source sentences.
43
MT Analysis: Case Study
43
L2R produces good prefix, whereas R2L generates better
suffixes.
Our approach can make full use of bidirectional decoding
and produce balanced outputs in these cases.
MT Analysis: Case Study
44
MT Analysis: Parameters and Speeds
44
Table: Statistics of parameters, training and testing speeds. Train denotes the
number of global training steps processed per second; Test indicates the amount
of translated sentences in one second.
MT Analysis: Parameters and Speeds
44
Table: Statistics of parameters, training and testing speeds. Train denotes the
number of global training steps processed per second; Test indicates the amount
of translated sentences in one second.
No additional parameters except for lambda
MT Analysis: Parameters and Speeds
44
Table: Statistics of parameters, training and testing speeds. Train denotes the
number of global training steps processed per second; Test indicates the amount
of translated sentences in one second.
No additional parameters except for lambda
MT Analysis: Parameters and Speeds
44
Table: Statistics of parameters, training and testing speeds. Train denotes the
number of global training steps processed per second; Test indicates the amount
of translated sentences in one second.
Slightly Slower than
baseline TransformerNo additional parameters except for lambda
MT Analysis: Parameters and Speeds
Beyond Synchronous Bidirectional
Decoding: Improving Efficiency
45
Beyond Synchronous Bidirectional
Decoding: Improving Efficiency
45
Sequence Generation: from Both Sides to the Middle
Long Zhou, Jiajun Zhang, Chengqing Zong and Heng Yu.
In Proceedings of IJCAI 2019.
Sequence Generation from Both Sides to the Middle
46
• SBSG: Synchronous Bidirectional Sequence Generation
Speedup decoding: Generates two tokens at a time
Improve quality: Rely on history and future context
1y 2y /2ny
ny1ny /2 1ny
…
t=0 t=1 t=n/2
(Left-to-right decoding) …
End
(Right-to-left decoding)
t=0t=1t=n/2… …
Sequence Generation from Both Sides to the Middle
47
• Training and Inference
<null>
/2ny /2 1ny
/2 2ny
<eos>
<eos>
…
…
:y
:y
1y
ny
<l2r>
<r2l>
2y
1ny
Following previous work, we also use knowledge
distillation techniques to train our model.
Training objective:
The Smoothing model:
Experiments on Machine Translation
48
Inference speed:
Translation quality:
Experiments on Machine Translation
48
Inference speed:
Translation quality:
Experiments on Machine Translation
48
Inference speed:
Translation quality:
Experiments on Machine Translation
48
Inference speed:
Translation quality:
Experiments on Text Summarization
49
• Application to Text Summarization
Example:
Setup
(1) English Gigaword dataset (3.8M training set, 189K
dev set, DUC2004 as our test set)
(2) shared vocabulary of about 90K word types
(3) Transformer_base setting
(4) ROUGE-1, ROUGE-2, ROUGE-L
Experiments on Text Summarization
50
Experiments on Text Summarization
50
Experiments on Text Summarization
50
The proposed model significant outperforms the conventional
Transformer model in terms of both decoding speed and generation
quality.
51
Outline
Background
Bidirectional Interactive Inference
Interactive Inference for Two Tasks
Summary and Future Challenges
Interactive Inference for Two Tasks
52
Interactive Inference for Two Tasks
52
Synchronously Generating Two Languages with
Interactive Decoding
Yining Wang, Jiajun Zhang, Long Zhou, Yuchen Liu and Chengqing Zong.
In Proceedings of EMNLP 2019.
From Generating Two Directions to
Generating Two Languages
53
⋯𝑥0 𝑥1 𝑥𝑚𝑥𝑗 ⋯
𝑦0𝑙2𝑟 𝑦1
𝑙2𝑟 𝑦𝑖𝑙2𝑟⋯ ⋯ ⋯
⋯𝑦𝑛′−1𝑟2𝑙 𝑦𝑛′−2
𝑟2𝑙 𝑦𝑖′𝑟2𝑙 ⋯ ⋯
Left-to-Right Decoding
Right-to-Left Decoding
From Generating Two Directions to
Generating Two Languages
53
⋯𝑥0 𝑥1 𝑥𝑚𝑥𝑗 ⋯
𝑦0𝑙2𝑟 𝑦1
𝑙2𝑟 𝑦𝑖𝑙2𝑟⋯ ⋯ ⋯
⋯𝑦𝑛′−1𝑟2𝑙 𝑦𝑛′−2
𝑟2𝑙 𝑦𝑖′𝑟2𝑙 ⋯ ⋯
Left-to-Right Decoding
Right-to-Left Decoding
From Generating Two Directions to
Generating Two Languages
53
⋯𝑥0 𝑥1 𝑥𝑚𝑥𝑗 ⋯
𝑦0𝑙2𝑟 𝑦1
𝑙2𝑟 𝑦𝑖𝑙2𝑟⋯ ⋯ ⋯
⋯𝑦𝑛′−1𝑟2𝑙 𝑦𝑛′−2
𝑟2𝑙 𝑦𝑖′𝑟2𝑙 ⋯ ⋯
Left-to-Right Decoding
Right-to-Left Decoding
⋯𝑥0 𝑥1 𝑥𝑚𝑥𝑗 ⋯
𝑦0 𝑦1 𝑦𝑖⋯ ⋯ ⋯
⋯𝑧0 𝑧1 𝑧𝑗 ⋯ ⋯
English-to-Chinese Decoding
English-to-Japanese Decoding
Conventional Multilingual Translation
54
Separate Encoder or
Decoder network
Shared Encoder or
Decoder network
Shared with partial
parameter
Synchronously Generating Two Languages
with Interactive Decoding
55
56
Synchronous Self-Attention Model:
Encoder
Decoder
Decoder
Interact
Synchronously Generating Two Languages
with Interactive Decoding
Synchronous Bi-language Attention
𝑥2 𝑥3 ⋯ 𝑥𝑚−1𝑥0⋯
⋯
⋯
𝑦1 ⋯ 𝑦𝑖−1 𝑦𝑖𝑦0
𝒔0 𝒔1 𝒔𝑖−1 𝒔𝑖
𝑧1 ⋯ 𝑧𝑖−1 𝑧𝑖𝑧0
𝒕𝑖𝒕𝑖−1𝒕0
⋯
⋯
⋯𝒕1
⋯
⋯
⋯
𝒉0 𝒉1 𝒉2 𝒉𝑚−1
𝑯𝑖 = 𝑓𝑢𝑠𝑒 𝑯𝑖𝑠𝑒𝑙𝑓, 𝑯𝑖𝑜𝑡ℎ𝑒𝑟
𝑯𝑖𝑠𝑒𝑙𝑓 𝑯𝑖
𝑜𝑡ℎ𝑒𝑟
58
WMT14 subset
En-De En-Fr
Train 2.43M 2.43M
Test 3003 3003
1. Large Scale
2. Small Scale
IWSLT
En-Ja En-Zh En-De En-Fr
Train 223K 231K 206K 233K
Test 3003 3003 1305 1306
Some Experiments
59
Training Data Construction
⋯𝑥0 𝑥1 𝑥𝑚𝑥𝑗 ⋯
𝑦0 𝑦1 𝑦𝑖⋯ ⋯ ⋯
⋯𝑧0 𝑧1 𝑧𝑗 ⋯ ⋯
English-Chinese Decoding
English-to-Japanese Decoding
59
Training Data Construction
⋯𝑥0 𝑥1 𝑥𝑚𝑥𝑗 ⋯
𝑦0 𝑦1 𝑦𝑖⋯ ⋯ ⋯
⋯𝑧0 𝑧1 𝑧𝑗 ⋯ ⋯
English-Chinese Decoding
English-to-Japanese Decoding
Training Instance Format Requirement:
trilingual translation example (x, y, z) in
which (x, y) and (x, z) are parallel sentence
60
Step1(train): (x1, y1) Model: M1
(x2, y2) Model: M2
Step2 (decode): M1 (x2,y2*)
M2 (x1,y1*)
Step3 (combination):
(x1,y1,y1*) ∪ (x2,y2*,y2)
Training
Inference
x2
x1
Training Data Construction
61
Main Results
• Indiv: System learned with bilingual training
• Multi: shared encoder-decoder networks
•Sync-Trans significantly outperforms Indiv and Multi
• English-Chinese/Japanese and English-German/French
62
Main Results
• Sync-Trans significantly outperforms Indiv and Multi
• Large-scale WMT Dataset
From Bidirection to Two Tasks
63
⋯𝑥0 𝑥1 𝑥𝑚𝑥𝑗 ⋯
𝑦0𝑙2𝑟 𝑦1
𝑙2𝑟 𝑦𝑖𝑙2𝑟⋯ ⋯ ⋯
⋯𝑦𝑛′−1𝑟2𝑙 𝑦𝑛′−2
𝑟2𝑙 𝑦𝑖′𝑟2𝑙 ⋯ ⋯
Left-to-Right Decoding
Right-to-Left Decoding
⋯𝑥0 𝑥1 𝑥𝑚𝑥𝑗 ⋯
𝑦0 𝑦1 𝑦𝑖⋯ ⋯ ⋯
⋯𝑧0 𝑧1 𝑧𝑗 ⋯ ⋯
English-to-Chinese Decoding
English-to-Japanese Decoding
Interactive Inference for other Two Tasks
64
Speech Recognition
Speech-to-Text Translation
𝑦0 𝑦1 𝑦𝑖⋯ ⋯ ⋯
⋯𝑧0 𝑧1 𝑧𝑗 ⋯ ⋯
To English Caption
To Chinese Caption
𝑥0 𝑥1 𝑥𝑖⋯ ⋯ ⋯
⋯𝑦0 𝑦1 𝑦𝑗 ⋯ ⋯
Interactive Inference for other Two Tasks
64
Speech Recognition
Speech-to-Text Translation
𝑦0 𝑦1 𝑦𝑖⋯ ⋯ ⋯
⋯𝑧0 𝑧1 𝑧𝑗 ⋯ ⋯
To English Caption
To Chinese Caption
𝑥0 𝑥1 𝑥𝑖⋯ ⋯ ⋯
⋯𝑦0 𝑦1 𝑦𝑗 ⋯ ⋯
65
Original signal
• Speech Features
Mel滤波
Interactive Inference for Speech Recognition
and Speech-to-Text Translation
65
Original signal
Filter bank
• Speech Features
Mel滤波
Interactive Inference for Speech Recognition
and Speech-to-Text Translation
65
Original signal
Filter bank
• Speech Features
Mel滤波
离散余弦变换
Interactive Inference for Speech Recognition
and Speech-to-Text Translation
65
Original signal
Filter bank
MFCC
• Speech Features
Mel滤波
离散余弦变换
Interactive Inference for Speech Recognition
and Speech-to-Text Translation
65
Original signal
Filter bank
MFCC
• Speech Features
Mel滤波
离散余弦变换
Interactive Inference for Speech Recognition
and Speech-to-Text Translation
66
• Overall Architecture
Interactive Inference for Speech Recognition
and Speech-to-Text Translation
66
• Overall Architecture
Interactive Inference for Speech Recognition
and Speech-to-Text Translation
66
• Overall Architecture
Interactive Inference for Speech Recognition
and Speech-to-Text Translation
Speech Recognition
66
• Overall Architecture
Interactive Inference for Speech Recognition
and Speech-to-Text Translation
Speech Recognition
66
• Overall Architecture
Interactive Inference for Speech Recognition
and Speech-to-Text Translation
Speech Recognition
Speech-to-Text Translation
67
• Interactive Attention
Interactive Inference for Speech Recognition
and Speech-to-Text Translation
67
• Interactive Attention
Interactive Inference for Speech Recognition
and Speech-to-Text Translation
67
• Interactive Attention
H1 = 𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛(Q1, K1, V1)
H2 = 𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛(Q1, K2, V2)
Hfinal = 𝐹𝑢𝑠𝑖𝑜𝑛(H1, H2)
Interactive Inference for Speech Recognition
and Speech-to-Text Translation
67
• Interactive Attention
H1 = 𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛(Q1, K1, V1)
H2 = 𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛(Q1, K2, V2)
Hfinal = 𝐹𝑢𝑠𝑖𝑜𝑛(H1, H2)
• Fusion Linear Interpolation
Nonlinear Interpolation
Gate Interpolation
Hfinal = 𝜆1 ∗ H1 + 𝜆2 ∗ H2
Hfinal = 𝜆1 ∗ H1 + 𝜆2 ∗ tanh(H2)
Hfinal = r ⨀H1 + z⨀H2
r, z = σ(𝑊[𝐻1; 𝐻2])
Interactive Inference for Speech Recognition
and Speech-to-Text Translation
68
Interactive Inference for Speech Recognition
and Speech-to-Text Translation
• Experimental Setup
Dataset:
TED En-Fr,En-Zh
Train details:
(1) Transformer_big setting
(2) English-Chinese: 2 GPUs, character BLEU
(3) English-French: 2 GPUs, tokenizer BLEU
69
Interactive Inference for Speech Recognition
and Speech-to-Text Translation
CorpusTotal
Source
(per segment)
Target
(per segment)
segments hours frames words words
Fisher/Callh
-ome
(En-Es)
train 138,819 138:00 762 20.7 20.3
dev 3,961 2:00 673 17.9 17.9
test 3,641 3:44 657 18.3 18.3
TED
(En-Zh/
En-Fr)
train 305,971 527:00 662 17.9 20.3
dev 1,148 2:23 659 18.2 17.9
test 1,223 2:37 624 18.3 18.1
• Data Size
69
Interactive Inference for Speech Recognition
and Speech-to-Text Translation
CorpusTotal
Source
(per segment)
Target
(per segment)
segments hours frames words words
Fisher/Callh
-ome
(En-Es)
train 138,819 138:00 762 20.7 20.3
dev 3,961 2:00 673 17.9 17.9
test 3,641 3:44 657 18.3 18.3
TED
(En-Zh/
En-Fr)
train 305,971 527:00 662 17.9 20.3
dev 1,148 2:23 659 18.2 17.9
test 1,223 2:37 624 18.3 18.1
• Data Size
70
Interactive Inference for Speech Recognition
and Speech-to-Text Translation
• Baselines
Pipeline: Transformer ASR + Transformer MT
Pre-trained E2E: Pretrain on ASR, fintune on ST
Multi-task: ASR + ST with encoder shared
Two-stage: (1) use the first decoder to generate
transcription sequence; (2) use the output of first
decoder on the second decoder
71
Interactive Inference for Speech Recognition
and Speech-to-Text Translation
𝑊𝐸𝑅 = 100 ∙𝑆 + 𝐷 + 𝐼
𝑁%
REF: 各位 来宾 * 各位 合作 伙伴 媒体界 的 朋友们 下午 好ASR: 各位 来宾 个 各位 合作 伙伴 媒体界 * 朋友们 刚 好
I D S
• Evaluation Metrics
• ASR Metric
• MT Metric
𝐵𝐿𝐸𝑈
72
Interactive Inference for Speech Recognition
and Speech-to-Text Translation
ModelEn-De En-Fr En-Zh En-Ja
WER(↓) BLEU(↑) WER(↓) BLEU(↑) WER(↓) BLEU(↑) WER(↓) BLEU(↑)
MT / 22.19 / 30.68 / 25.01 / 22.93
Pipeline 14.29 19.50 14.20 26.62 14.20 21.52 14.21 20.87
E2E 14.29 16.07 14.20 27.63 14.20 19.15 14.21 16.59
Multi-task 14.20 19.08 13.04 28.71 13.43 20.60 14.01 18.73
Two-stage 14.27 20.08 13.34 30.08 13.55 20.29 13.85 19.32
Interactive 14.16 21.11 12.58 29.79 13.38 21.68 13.52 20.06
• Overall Results
73
Interactive Inference for Image Caption in
Two Languages
𝑦0 𝑦1 𝑦𝑖⋯ ⋯ ⋯
⋯𝑧0 𝑧1 𝑧𝑗 ⋯ ⋯
To English Caption
To German Caption
Dataset:
(1)Multi30k (Elliott et al., 2016): English and German Captions
(2) 29,000 image-caption for training
(3) 1014 for validation and 2000 for test
Baselines:
(1) VGGNet encoder + LSMT decoder (Xu et al., 2015)
(2) Transformer
74
• Results on English and German Image Captions
BLEU score
Experiments: Image Caption in Two Languages
Method English German
Xu et al., (2015) 19.90 ~
Jaffe (2017) ~ 11.84
Transformer 21.25 13.55
Ours 22.54 15.49
Unified Text Generation from Text,
Speech and Image
75
Encoder
1y2y 3y
my
1z 2z3z
nz
...
...
T=1 T=2 T=3 ...
4y
4z
T=4
1y2y 3y
my...4y
T=1 T=2 T=3 T=4
(c) Synchronous Interactive text generation:
(b) Conventional text generation:
I had no idea what was coming
or
or
(a) Text, Speech or Image encoding:
...
And Beyond …
76
• Why not interactive inference for multi-task learning?
And Beyond …
76
• Why not interactive inference for multi-task learning?
77
Outline
Background
Bidirectional Interactive Inference
Interactive Inference for Two Tasks
Summary and Future Challenges
78
• The synchronous bidirectional Inference model that can take
full advantage of both history and future information provided
by bidirectional decoding states, achieving promising results.
Summary
78
• The synchronous bidirectional Inference model that can take
full advantage of both history and future information provided
by bidirectional decoding states, achieving promising results.
•The bidirectional inference model can be further extended to
inference from both sides to the middle to improve the efficiency.
Summary
78
• The synchronous bidirectional Inference model that can take
full advantage of both history and future information provided
by bidirectional decoding states, achieving promising results.
•The bidirectional inference model can be further extended to
inference from both sides to the middle to improve the efficiency.
•The bidirectional inference model can be generalized to
generate two languages synchronously and interactively.
Summary
78
• The synchronous bidirectional Inference model that can take
full advantage of both history and future information provided
by bidirectional decoding states, achieving promising results.
•The bidirectional inference model can be further extended to
inference from both sides to the middle to improve the efficiency.
•The bidirectional inference model can be generalized to
generate two languages synchronously and interactively.
•Our main code is available at https://github.com/ZNLP/sb-nmt.
Feel free to have a try!
Summary
78
• The synchronous bidirectional Inference model that can take
full advantage of both history and future information provided
by bidirectional decoding states, achieving promising results.
•The bidirectional inference model can be further extended to
inference from both sides to the middle to improve the efficiency.
•The bidirectional inference model can be generalized to
generate two languages synchronously and interactively.
•Our main code is available at https://github.com/ZNLP/sb-nmt.
Feel free to have a try!
Summary
Message: Synchronous Interactive Inference
May Reshape Multi-task Generation
79
• How to generalize the interactive inference idea into
multi-task problems in which three or more tasks are
concerned?
• How to perform efficient training without generating
pseudo parallel instances?
• How to effectively combine bidirectional inference in
the multi-task interactive inference problem?
Future Challenges
80
1. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of
deep bidirectional transformers for language understanding. In NAACL-HLT 2019 (Best Paper).
2. Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. Improving
language understanding by Generative Pre-Training. Technical report, OpenAI.
3. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez,
Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. NIPS-2017.
4. Lemao Liu, Masao Utiyama, Andrew Finch, and Eiichiro Sumita. Agreement on target-
bidirectional neural machine translation. NAACL-2016.
5. Zhirui Zhang, Shuangzhi Wu, Shujie Liu, Mu Li, Ming Zhou, and Tong Xu. Regularizing neural
machine translation by target-bidirectional agreement. AAAI-2019.
6. Long Zhou, Wenpeng Hu, Jiajun Zhang, and Chengqing Zong. Neural system combination for
machine translation. ACL-2017.
7. Xiangwen Zhang, Jinsong Su, Yue Qin, Yang Liu, Rongrong Ji, and Hongji Wang. Asynchronous
bidirectional decoding for neural machine translation. AAAI-2018.
8. Long Zhou, Jiajun Zhang, and Chengqing Zong. Synchronous Bidirectional Neural Machine
Translation. TACL-2019.
Reference
81
9. Long Zhou, Jiajun Zhang, Chengqing Zong and Heng Yu. Sequence Generation: From Both
Sides to the Middle. IJCAI-2019.
10. Yining Wang, Jiajun Zhang, Long Zhou, Yuchen Liu and Chengqing Zong. Synchronously
Generating Two Languages with Interactive Decoding. EMNLP-2019
11. Jiajun Zhang, Long Zhou, Yang Zhao and Chengqing Zong. Synchronous Bidirectional
Inference for Neural Sequence Generation. arXiv preprint arXiv:1902.08955.
12. Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov,
Rich Zemel, and Yoshua Bengio. Show, attend and tell: Neural image caption generation with
visual attention. ICML-2015.
13. Desmond Elliott, Stella Frank, Khalil Sima’an, Lucia Specia. Multi30K: Multilingual English-
German Image Descriptions. Proceedings of the 5th Workshop on Vision and Language. 2016.
14. Alan Jaffe. Generating Image Descriptions using Multilingual Data. Proceedings of the Second
Conference on Machine Translation 2017.
15. Peter Young, Alice Lai, Micah Hodosh, and Julia Hockenmaier. From image descriptions to
visual denotations: New similarity metrics for semantic inference over event descriptions. TACL-
2014.
Reference
谢谢!Thanks!