Text Generation: From the Perspective of Interactive...

Preview:

Citation preview

Text Generation: From the

Perspective of Interactive Inference

张家俊

模式识别国家重点实验室中国科学院自动化研究所

jjzhang@nlpr.ia.ac.cn

www.nlpr.ia.ac.cn/cip/jjzhang.htm

2

Outline

Background

Bidirectional Interactive Inference

Interactive Inference for Two Tasks

Summary and Future Challenges

3

BERT: Bidirectional Understanding

𝑥2 𝑥3 ⋯ 𝑥𝑛𝑥1

⋯ ⋯⋯

Input Sequence

Representation

Learning

Linear Classification

12

Layers

110M

Params

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of

deep bidirectional transformers for language understanding. In NAACL-HLT 2019 (Best Paper).

4

BERT for Classification

𝑥1 𝑥2 ⋯ 𝑥𝑛[𝐶𝐿𝑆]

⋯𝐶 ℎ1 ℎ2 ℎ𝑛

Input Sequence

Representation

Learning

Linear Classification

Classification

Problem

Class Labels

𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑐𝑖 =𝑒𝑥𝑝 𝑐𝑖 𝑐𝑘 𝑒𝑥𝑝 𝑐𝑘

5

BERT for Sequence Labeling

𝑥1 𝑥2 ⋯ 𝑥𝑛[𝐶𝐿𝑆]

⋯𝐶 ℎ1 ℎ2 ℎ𝑛

Input Sequence

Representation

Learning

Linear Classification⋯

𝐵 𝐼 𝑂

Sequence

Labeling Problem

𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑐𝑖 =𝑒𝑥𝑝 𝑐𝑖 𝑐𝑘 𝑒𝑥𝑝 𝑐𝑘

6

Reasons behind BERT Success

• Corpus: 2.5B-word Wiki and 800M-word Books

• Architecture: Pre-training and Fine-tuning Same Model

• Model: Deep Bidirectional Transformer Encoder

• Optimization: Masked LM and Next Sentence Prediction

𝑥11 𝑥1

2 ⋯ 𝑥𝑛2[𝐶𝐿𝑆]

⋯𝐶 ℎ11 ⋯ ℎ𝑛

2

Specific Task

[𝑆𝐸𝑃]⋯

Pre-training

Fine-tuning

Bidirectional

Transformer Encoder

7

BERT vs. GPT (Generative Pre-trained Transformer)

𝑥2 ⋯ 𝑥𝑛𝑥1

⋯ℎ1 ℎ2 ℎ3 ℎ𝑛

Specific Task

Pre-training

Fine-tuning

𝑥3

• Architecture: Pre-training and Fine-tuning Same Model

• Model: Deep Unidirectional Transformer Decoder

• Optimization: Traditional Language Model

• Corpus: 800M-word Books 𝑗log 𝑝 𝑥𝑗|𝑥<𝑗 , 𝜃𝑇𝑟𝑎𝑛𝑠𝑓𝑜𝑟𝑚𝑒𝑟

8

BERT vs. GPT (Generative Pre-trained Transformer)

𝑥2 ⋯ 𝑥𝑛𝑥1

⋯ℎ1 ℎ2 ℎ3 ℎ𝑛

Specific Task

Pre-training

Fine-tuning

𝑥3

• Architecture: Pre-training and Fine-tuning Same Model

• Model: Deep Unidirectional Transformer Decoder

• Optimization: Traditional Language Model

• Corpus: 800M-word Books 𝑗log 𝑝 𝑥𝑗|𝑥<𝑗 , 𝜃𝑇𝑟𝑎𝑛𝑠𝑓𝑜𝑟𝑚𝑒𝑟

Unidirectional

Transformer Decoder

9

BERT vs. GPT (Generative Pre-trained Transformer)

𝑥2 ⋯ 𝑥𝑛𝑥1

⋯ℎ1 ℎ2 ℎ3 ℎ𝑛

Specific Task

Pre-training

Fine-tuning

𝑥3

• Architecture: Pre-training and Fine-tuning Same Model

• Model: Deep Unidirectional Transformer Decoder

• Optimization: Traditional Language Model

• Corpus: 800M-word Books 𝑗log 𝑝 𝑥𝑗|𝑥<𝑗 , 𝜃𝑇𝑟𝑎𝑛𝑠𝑓𝑜𝑟𝑚𝑒𝑟

Unidirectional

Transformer Decoder

Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. Improving

language understanding by Generative Pre-Training. Technical report, OpenAI.

10

BERT Ablation Study

Left-to-Right LM

Fine-tuning with BiLSTM

The more Layers

The Better

11

BERT Ablation Study

Left-to-Right LM

Fine-tuning with BiLSTM

The more Layers

The Better

Bidirectional Encoder is the Key!

𝑥2 𝑥3 ⋯ 𝑥𝑛𝑥112

From Understanding to Generation

𝑦2 𝑦3 ⋯ 𝑦𝑚𝑦1

⋯ℎ1 ℎ2 ℎ3 ℎ𝑚

⋯ℎ1 ℎ2 ℎ3 ℎ𝑚

𝑥1 𝑥2 ⋯ 𝑥𝑛[𝐶𝐿𝑆]⋯

⋯⋯𝐶 ℎ1 ℎ2 ℎ𝑛

Classification Problem

𝑥1 𝑥2 ⋯ 𝑥𝑛[𝐶𝐿𝑆]⋯

⋯⋯𝐶 ℎ1 ℎ2 ℎ𝑛

⋯𝐵 𝐼 𝑂

Sequence Labeling

Problem⋯

⋯ ⋯⋯

Sequence Generation

Problem

𝑥2 𝑥3 ⋯ 𝑥𝑛𝑥113

From Understanding to Generation

𝑦2 𝑦3 ⋯ 𝑦𝑚𝑦1

⋯ℎ1 ℎ2 ℎ3 ℎ𝑚

⋯ℎ1 ℎ2 ℎ3 ℎ𝑚

𝑥1 𝑥2 ⋯ 𝑥𝑛[𝐶𝐿𝑆]⋯

⋯⋯𝐶 ℎ1 ℎ2 ℎ𝑛

Classification Problem

𝑥1 𝑥2 ⋯ 𝑥𝑛[𝐶𝐿𝑆]⋯

⋯⋯𝐶 ℎ1 ℎ2 ℎ𝑛

⋯𝐵 𝐼 𝑂

Sequence Labeling

Problem⋯

⋯ ⋯⋯

Sequence Generation

Problem

𝑃 𝑦|𝑥 =

𝑖=1

𝑚

𝑝 𝑦𝑖|𝑦1⋯𝑦𝑖−1, 𝑥1⋯𝑥𝑛

14

Text Generation

15

Text Generation

16

Text Generation

机器翻译

17

Text Generation

机器翻译

18

Text Generation

机器翻译 人机对话

19

Text Generation

机器翻译 人机对话

20

Text Generation

机器翻译 人机对话

自动摘要

21

Text Generation

机器翻译 人机对话

自动摘要

22

Text Generation

机器翻译 人机对话

自动摘要 标题生成

11

Beam Search for

Unidirectional Inference

12

Transformer: Best Unidirectional Text Generation Framework

12

Encoder Decoder

Transformer: Best Unidirectional Text Generation Framework

12

Encoder Decoder

encoder

self-attentiondecoder

self-attention

encoder-decoder

self-attention

Transformer: Best Unidirectional Text Generation Framework

12

Encoder Decoder

encoder

self-attentiondecoder

self-attention

encoder-decoder

self-attention

Transformer: Best Unidirectional Text Generation Framework

Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N.

Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. NIPS-2017.

Transformer

13

有 五个 人 。

Transformer

13

有 五个 人 。

Transformer

13

有 五个 人 。

Transformer

13

有 五个 人 。

Transformer

13

有 五个 人 。

Transformer

13

有 五个 人 。

Transformer

13

有 五个 人 。

Transformer

13

有 五个 人 。

Transformer

13

有 五个 人 。

Transformer

13

有 五个 人 。

Transformer

13

有 五个 人 。

Transformer

13

有 五个 人 。

<s>

Transformer

13

有 五个 人 。

<s>

Transformer

13

有 五个 人 。

<s>

Transformer

13

有 五个 人 。

<s>

Transformer

13

有 五个 人 。

<s>

Transformer

13

有 五个 人 。

there

<s>

Transformer

13

有 五个 人 。

there

<s>

Transformer

13

有 五个 人 。

there

<s> there

Transformer

13

有 五个 人 。

there

<s> there

Transformer

13

有 五个 人 。

there

<s> there

Transformer

13

有 五个 人 。

there

<s> there

Transformer

13

有 五个 人 。

there

<s> there

Transformer

13

有 五个 人 。

there

<s>

are

there

Transformer

13

有 五个 人 。

there

<s>

are

there

Transformer

13

有 五个 人 。

there

<s>

are

there are

Transformer

13

有 五个 人 。

there

<s>

are

there are

Transformer

13

有 五个 人 。

there

<s>

are

there are

Transformer

13

有 五个 人 。

there

<s>

are

there are

Transformer

13

有 五个 人 。

there

<s>

are

there are

Transformer

13

有 五个 人 。

there

<s>

are

there

five

are

Transformer

13

有 五个 人 。

there

<s>

are

there

five

are

Transformer

13

有 五个 人 。

there

<s>

are

there

five

are five

Transformer

13

有 五个 人 。

there

<s>

are

there

five

are five

Transformer

13

有 五个 人 。

there

<s>

are

there

five

are five

Transformer

13

有 五个 人 。

there

<s>

are

there

five

are five

Transformer

13

有 五个 人 。

there

<s>

are

there

five

are five

Transformer

13

有 五个 人 。

there

<s>

are

there

five

are

persons

five

Transformer

13

有 五个 人 。

there

<s>

are

there

five

are

persons

five

Transformer

13

有 五个 人 。

there

<s>

are

there

five

are

persons

fiveBidirectional

Encoder

Transformer

13

有 五个 人 。

there

<s>

are

there

five

are

persons

fiveBidirectional

Encoder

Unidirectional

Decoder

Attention for Encoder

14

• Attention (Example)

有 五个 人

0.7-0.10.60.3

0.3-0.10.9-0.3

0.40.9-0.60.7

T T T

0.2-0.10.40.3

0.70.60.60.2

T T

l:

l+1:

4 12W R

Attention for Encoder

14

• Attention (Example)

有 五个 人

0.7-0.10.60.3

0.3-0.10.9-0.3

0.40.9-0.60.7

T T T

0.2-0.10.40.3

0.70.60.60.2

T T

l:

l+1:

4 12W R

W W W

-0.60.10.4-0.30.3-0.1-0.90.20.4-0.6-0.10.7

T

Q

K

V

0.3-0.30.60.60.7-0.10.90.80.1-0.30.40.4

T0.5-0.70.90.30.2-0.6-0.60.70.7-0.90.4-0.3

T

有 五个 人

Attention for Encoder

14

• Attention (Example)

有 五个 人

0.7-0.10.60.3

0.3-0.10.9-0.3

0.40.9-0.60.7

T T T

0.2-0.10.40.3

0.70.60.60.2

T T

l:

l+1:

4 12W R

W W W

-0.60.10.4-0.30.3-0.1-0.90.20.4-0.6-0.10.7

T

Q

K

V

0.3-0.30.60.60.7-0.10.90.80.1-0.30.40.4

T0.5-0.70.90.30.2-0.6-0.60.70.7-0.90.4-0.3

T

有 五个 人

1 2

3

Attention for Encoder

14

• Attention (Example)

有 五个 人

0.7-0.10.60.3

0.3-0.10.9-0.3

0.40.9-0.60.7

T T T

0.2-0.10.40.3

0.70.60.60.2

T T

l:

l+1:

4 12W R

W W W

-0.60.10.4-0.30.3-0.1-0.90.20.4-0.6-0.10.7

T

Q

K

V

0.3-0.30.60.60.7-0.10.90.80.1-0.30.40.4

T0.5-0.70.90.30.2-0.6-0.60.70.7-0.90.4-0.3

T

有 五个 人

1 2

3

15

• Attention (Example)

<s> there

0.5-0.60.20.3

0.10.4-0.20.5

T T

0.1-0.30.20.4

0.7-0.10.4-0.2

T

l:

l+1:

T

4 12W R

W W

-0.10.20.4-0.60.2-0.3-0.40.30.6-0.8-0.90.5

T

Q

K

V

T0.2-0.40.30.60.4-0.7-0.20.50.2-0.50.1-0.8

T

<s> there

1 2

Attention for Decoder

15

• Attention (Example)

<s> there

are?

0.5-0.60.20.3

0.10.4-0.20.5

T T

0.1-0.30.20.4

0.7-0.10.4-0.2

T

l:

l+1:

T

4 12W R

W W

-0.10.20.4-0.60.2-0.3-0.40.30.6-0.8-0.90.5

T

Q

K

V

T0.2-0.40.30.60.4-0.7-0.20.50.2-0.50.1-0.8

T

<s> there

1 2

Attention for Decoder

15

• Attention (Example)

<s> there

are?

0.5-0.60.20.3

0.10.4-0.20.5

T T

0.1-0.30.20.4

0.7-0.10.4-0.2

T

l:

l+1:

T

4 12W R

W W

-0.10.20.4-0.60.2-0.3-0.40.30.6-0.8-0.90.5

T

Q

K

V

T0.2-0.40.30.60.4-0.7-0.20.50.2-0.50.1-0.8

T

<s> there

1 2

Attention for Decoder

Cannot Utilize Future Information

Problems for Unidirectional Inference

16

⋯𝑥0 𝑥1 𝑥𝑚𝑥𝑗 ⋯

Problems for Unidirectional Inference

16

⋯𝑥0 𝑥1 𝑥𝑚𝑥𝑗 ⋯

𝑦0𝑙2𝑟 𝑦1

𝑙2𝑟 𝑦𝑖𝑙2𝑟⋯ ⋯ ⋯

Left-to-Right Decoding

Problems for Unidirectional Inference

16

⋯𝑥0 𝑥1 𝑥𝑚𝑥𝑗 ⋯

𝑦0𝑙2𝑟 𝑦1

𝑙2𝑟 𝑦𝑖𝑙2𝑟⋯ ⋯ ⋯

Left-to-Right Decoding

Problems for Unidirectional Inference

16

⋯𝑥0 𝑥1 𝑥𝑚𝑥𝑗 ⋯

𝑦0𝑙2𝑟 𝑦1

𝑙2𝑟 𝑦𝑖𝑙2𝑟⋯ ⋯ ⋯

×Left-to-Right Decoding

Cannot Access the Right Contexts

Problems for Unidirectional Inference

16

⋯𝑥0 𝑥1 𝑥𝑚𝑥𝑗 ⋯

𝑦0𝑙2𝑟 𝑦1

𝑙2𝑟 𝑦𝑖𝑙2𝑟⋯ ⋯ ⋯

𝑦𝑛′−1𝑟2𝑙 𝑦𝑛′−2

𝑟2𝑙 𝑦𝑖′𝑟2𝑙 ⋯ ⋯

×Left-to-Right Decoding

Right-to-Left Decoding

Cannot Access the Right Contexts

Problems for Unidirectional Inference

16

⋯𝑥0 𝑥1 𝑥𝑚𝑥𝑗 ⋯

𝑦0𝑙2𝑟 𝑦1

𝑙2𝑟 𝑦𝑖𝑙2𝑟⋯ ⋯ ⋯

𝑦𝑛′−1𝑟2𝑙 𝑦𝑛′−2

𝑟2𝑙 𝑦𝑖′𝑟2𝑙 ⋯ ⋯

×Left-to-Right Decoding

Right-to-Left Decoding

Cannot Access the Right Contexts

Problems for Unidirectional Inference

16

⋯𝑥0 𝑥1 𝑥𝑚𝑥𝑗 ⋯

𝑦0𝑙2𝑟 𝑦1

𝑙2𝑟 𝑦𝑖𝑙2𝑟⋯ ⋯ ⋯

𝑦𝑛′−1𝑟2𝑙 𝑦𝑛′−2

𝑟2𝑙 𝑦𝑖′𝑟2𝑙 ⋯ ⋯

×

×

Left-to-Right Decoding

Right-to-Left Decoding

Cannot Access the Right Contexts

Cannot Access the Left Contexts

17

Source 捷克总统哈维卸任新总统仍未确定

Referenceczech president havel steps down while new president still not

chosen

L2R czech president leaves office

R2L the outgoing president of the czech republic is still uncertain

Source 他们正在研制一种超大型的叫做炸弹之母。

Referencethey are developing a kind of superhuge bomb called the mother of

bombs .

L2R they are developing a super , big , mother , called the bomb .

R2Lthey are working on a much larger mother called the mother of a

bomb .

Problems: Unbalanced Outputs

18

• Statistical Analysis

Model The first 4 tokens The last 4 tokens

L2R 40.21% 35.10%

R2L 35.67% 39.47%

Table: Translation accuracy of the first 4 tokens and last 4

tokens in NIST Chinese-English translation tasks.

Problems: Unbalanced Outputs

18

• Statistical Analysis

Model The first 4 tokens The last 4 tokens

L2R 40.21% 35.10%

R2L 35.67% 39.47%

Table: Translation accuracy of the first 4 tokens and last 4

tokens in NIST Chinese-English translation tasks.

How to effectively utilize

bidirectional decoding?

Problems: Unbalanced Outputs

19

Outline

Background

Bidirectional Interactive Inference

Interactive Inference for Two Tasks

Summary and Future Challenges

Solution 1: Bidirectional Agreement from Perspective of Loss Function

20

[Liu et al., 2016] Agreement on Target-bidirectional Neural Machine Translation. NAACL.

[Zhang et al., 2019] Regularizing Neural Machine Translation by Target-Bidirectional

Agreement. AAAI

• Agreement

Solution 1: Bidirectional Agreement from Perspective of Loss Function

20

[Liu et al., 2016] Agreement on Target-bidirectional Neural Machine Translation. NAACL.

[Zhang et al., 2019] Regularizing Neural Machine Translation by Target-Bidirectional

Agreement. AAAI

• Agreement

Solution 1: Bidirectional Agreement from Perspective of Loss Function

20

[Liu et al., 2016] Agreement on Target-bidirectional Neural Machine Translation. NAACL.

[Zhang et al., 2019] Regularizing Neural Machine Translation by Target-Bidirectional

Agreement. AAAI

• Agreement

Drawbacks:

Two separate L2R and R2L models. No

interaction between bidirectional inference.

Solution 2: Neural System Combination from the Perspective of Ensemble

21

• NSC-NMT

[Zhou et al., 2017] Neural System Combination for Machine Translation. ACL.

1hr

2hr

hnm

r...

2hs

hnm

s...

1hs

1

nz 2

nzn

n

mz...

1hr

2hr

hpm

r...

2hs

hpm

s...

1hs

1

pz2

pz p

p

mz...

1hr

2hr

hhm

r...

2hs

hhm

s...

1hs

1

hz 2

hz h

h

mz...

-1jSjS

1jS °

-1jS

1jy jy

°jS

1jy

...

... ...1

n

j 2

p

j1

p

jp

jm

1j 2j 3j

2

n

jn

jm 2

h

j1

h

jh

jm

jnc

jpc jhc

Decoder

Attention

... ...

Encoder_Source Text Encoder_L2R Text Encoder_R2L text

jc

:nZ :pZ :hZ

Solution 2: Neural System Combination from the Perspective of Ensemble

21

• NSC-NMT

[Zhou et al., 2017] Neural System Combination for Machine Translation. ACL.

1hr

2hr

hnm

r...

2hs

hnm

s...

1hs

1

nz 2

nzn

n

mz...

1hr

2hr

hpm

r...

2hs

hpm

s...

1hs

1

pz2

pz p

p

mz...

1hr

2hr

hhm

r...

2hs

hhm

s...

1hs

1

hz 2

hz h

h

mz...

-1jSjS

1jS °

-1jS

1jy jy

°jS

1jy

...

... ...1

n

j 2

p

j1

p

jp

jm

1j 2j 3j

2

n

jn

jm 2

h

j1

h

jh

jm

jnc

jpc jhc

Decoder

Attention

... ...

Encoder_Source Text Encoder_L2R Text Encoder_R2L text

jc

:nZ :pZ :hZ

Drawbacks:

It is not an end-to-end model, and can’t

directly optimize encoder and decoder of

single NMT system.

Solution 3: Asynchronous Bidirectional Decoding from the Perspective of Model Integration

22

• ABD-NMT

[Zhang et al., 2018] Asynchronous Bidirectional Decoding for Neural Machine Translation. AAAI.

Solution 3: Asynchronous Bidirectional Decoding from the Perspective of Model Integration

22

• ABD-NMT

[Zhang et al., 2018] Asynchronous Bidirectional Decoding for Neural Machine Translation. AAAI.

Drawbacks:

(1) This work still requires two NMT models or

decoders.

(2) Only the forward decoder can utilize information of

backward decoder.

Solution 3: Asynchronous Bidirectional Decoding from the Perspective of Model Integration

22

• ABD-NMT

[Zhang et al., 2018] Asynchronous Bidirectional Decoding for Neural Machine Translation. AAAI.

Drawbacks:

(1) This work still requires two NMT models or

decoders.

(2) Only the forward decoder can utilize information of

backward decoder.

Question: How to utilize bidirectional

decoding more effectively and efficiently?

Solution 4

23

Solution 4

23

Synchronous Bidirectional Neural Machine Translation

Long Zhou, Jiajun Zhang and Chengqing Zong.

Transactions on ACL 2019.

Synchronous

Bidirectional Neural Machine Translation

24

⋯𝑥0 𝑥1 𝑥𝑚𝑥𝑖 ⋯

𝑙2𝑟

𝑟2𝑙

Synchronous

Bidirectional Neural Machine Translation

24

⋯𝑥0 𝑥1 𝑥𝑚𝑥𝑖 ⋯

𝑇0

𝑙2𝑟

𝑟2𝑙

Synchronous

Bidirectional Neural Machine Translation

24

⋯𝑥0 𝑥1 𝑥𝑚𝑥𝑖 ⋯

𝑦0𝑙2𝑟

𝑇0

𝑙2𝑟

𝑟2𝑙

Synchronous

Bidirectional Neural Machine Translation

24

⋯𝑥0 𝑥1 𝑥𝑚𝑥𝑖 ⋯

𝑦0𝑙2𝑟

𝑦𝑛′−1𝑟2𝑙

𝑇0

𝑙2𝑟

𝑟2𝑙

Synchronous

Bidirectional Neural Machine Translation

24

⋯𝑥0 𝑥1 𝑥𝑚𝑥𝑖 ⋯

𝑦0𝑙2𝑟

𝑦𝑛′−1𝑟2𝑙

𝑇0 𝑇1

𝑙2𝑟

𝑟2𝑙

Synchronous

Bidirectional Neural Machine Translation

24

⋯𝑥0 𝑥1 𝑥𝑚𝑥𝑖 ⋯

𝑦0𝑙2𝑟

𝑦𝑛′−1𝑟2𝑙

𝑇0 𝑇1

𝑙2𝑟

𝑟2𝑙

Synchronous

Bidirectional Neural Machine Translation

24

⋯𝑥0 𝑥1 𝑥𝑚𝑥𝑖 ⋯

𝑦0𝑙2𝑟

𝑦𝑛′−1𝑟2𝑙

𝑇0 𝑇1

𝑙2𝑟

𝑟2𝑙

Synchronous

Bidirectional Neural Machine Translation

24

⋯𝑥0 𝑥1 𝑥𝑚𝑥𝑖 ⋯

𝑦0𝑙2𝑟 𝑦1

𝑙2𝑟

𝑦𝑛′−1𝑟2𝑙

𝑇0 𝑇1

𝑙2𝑟

𝑟2𝑙

Synchronous

Bidirectional Neural Machine Translation

24

⋯𝑥0 𝑥1 𝑥𝑚𝑥𝑖 ⋯

𝑦0𝑙2𝑟 𝑦1

𝑙2𝑟

𝑦𝑛′−1𝑟2𝑙

𝑇0 𝑇1

𝑙2𝑟

𝑟2𝑙

Synchronous

Bidirectional Neural Machine Translation

24

⋯𝑥0 𝑥1 𝑥𝑚𝑥𝑖 ⋯

𝑦0𝑙2𝑟 𝑦1

𝑙2𝑟

𝑦𝑛′−1𝑟2𝑙

𝑇0 𝑇1

𝑙2𝑟

𝑟2𝑙

Synchronous

Bidirectional Neural Machine Translation

24

⋯𝑥0 𝑥1 𝑥𝑚𝑥𝑖 ⋯

𝑦0𝑙2𝑟 𝑦1

𝑙2𝑟

𝑦𝑛′−1𝑟2𝑙 𝑦𝑛′−2

𝑟2𝑙

𝑇0 𝑇1

𝑙2𝑟

𝑟2𝑙

Synchronous

Bidirectional Neural Machine Translation

24

⋯𝑥0 𝑥1 𝑥𝑚𝑥𝑖 ⋯

𝑦0𝑙2𝑟 𝑦1

𝑙2𝑟

𝑦𝑛′−1𝑟2𝑙 𝑦𝑛′−2

𝑟2𝑙

𝑇0 𝑇1 𝑇2

𝑙2𝑟

𝑟2𝑙

Synchronous

Bidirectional Neural Machine Translation

24

⋯𝑥0 𝑥1 𝑥𝑚𝑥𝑖 ⋯

𝑦0𝑙2𝑟 𝑦1

𝑙2𝑟

𝑦𝑛′−1𝑟2𝑙 𝑦𝑛′−2

𝑟2𝑙

𝑇0 𝑇1 𝑇2

𝑙2𝑟

𝑟2𝑙

Synchronous

Bidirectional Neural Machine Translation

24

⋯𝑥0 𝑥1 𝑥𝑚𝑥𝑖 ⋯

𝑦0𝑙2𝑟 𝑦1

𝑙2𝑟

𝑦𝑛′−1𝑟2𝑙 𝑦𝑛′−2

𝑟2𝑙

𝑇0 𝑇1 𝑇2

𝑙2𝑟

𝑟2𝑙

Synchronous

Bidirectional Neural Machine Translation

24

⋯𝑥0 𝑥1 𝑥𝑚𝑥𝑖 ⋯

𝑦0𝑙2𝑟 𝑦1

𝑙2𝑟 𝑦2𝑙2𝑟

𝑦𝑛′−1𝑟2𝑙 𝑦𝑛′−2

𝑟2𝑙

𝑇0 𝑇1 𝑇2

𝑙2𝑟

𝑟2𝑙

Synchronous

Bidirectional Neural Machine Translation

24

⋯𝑥0 𝑥1 𝑥𝑚𝑥𝑖 ⋯

𝑦0𝑙2𝑟 𝑦1

𝑙2𝑟 𝑦2𝑙2𝑟

𝑦𝑛′−1𝑟2𝑙 𝑦𝑛′−2

𝑟2𝑙

𝑇0 𝑇1 𝑇2

𝑙2𝑟

𝑟2𝑙

Synchronous

Bidirectional Neural Machine Translation

24

⋯𝑥0 𝑥1 𝑥𝑚𝑥𝑖 ⋯

𝑦0𝑙2𝑟 𝑦1

𝑙2𝑟 𝑦2𝑙2𝑟

𝑦𝑛′−1𝑟2𝑙 𝑦𝑛′−2

𝑟2𝑙

𝑇0 𝑇1 𝑇2

𝑙2𝑟

𝑟2𝑙

Synchronous

Bidirectional Neural Machine Translation

24

⋯𝑥0 𝑥1 𝑥𝑚𝑥𝑖 ⋯

𝑦0𝑙2𝑟 𝑦1

𝑙2𝑟 𝑦2𝑙2𝑟

𝑦𝑛′−1𝑟2𝑙 𝑦𝑛′−2

𝑟2𝑙 𝑦𝑛′−3𝑟2𝑙

𝑇0 𝑇1 𝑇2

𝑙2𝑟

𝑟2𝑙

Synchronous

Bidirectional Neural Machine Translation

24

⋯𝑥0 𝑥1 𝑥𝑚𝑥𝑖 ⋯

𝑦0𝑙2𝑟 𝑦1

𝑙2𝑟 𝑦𝑛−1𝑙2𝑟𝑦2

𝑙2𝑟 ⋯

𝑦𝑛′−1𝑟2𝑙 𝑦𝑛′−2

𝑟2𝑙 𝑦0𝑟2𝑙𝑦𝑛′−3

𝑟2𝑙 ⋯

𝑇0 𝑇1 𝑇2

𝑙2𝑟

𝑟2𝑙

Synchronous

Bidirectional Neural Machine Translation

24

⋯𝑥0 𝑥1 𝑥𝑚𝑥𝑖 ⋯

𝑦0𝑙2𝑟 𝑦1

𝑙2𝑟 𝑦𝑛−1𝑙2𝑟𝑦2

𝑙2𝑟 ⋯

𝑦𝑛′−1𝑟2𝑙 𝑦𝑛′−2

𝑟2𝑙 𝑦0𝑟2𝑙𝑦𝑛′−3

𝑟2𝑙 ⋯

𝑇0 𝑇1 𝑇2

𝑃 𝑦|𝑥 =

𝑖=0

𝑛−1

𝑝 𝑦𝑖| 𝑦0⋯ 𝑦𝑖−1, 𝑥, 𝒚𝟎⋯𝒚𝒊−𝟏 𝑖𝑓 𝐿2𝑅

𝑖=0

𝑛′−1

𝑝 𝑦𝑖| 𝑦0⋯ 𝑦𝑖−1, 𝑥, 𝒚𝟎⋯𝒚𝒊−𝟏 𝑖𝑓 𝑅2𝐿

𝑙2𝑟

𝑟2𝑙

Synchronous

Bidirectional Neural Machine Translation

24

L2R (R2L) inference not only uses its previously generated outputs,

but also uses future contexts predicted by R2L (L2R) decoding.

⋯𝑥0 𝑥1 𝑥𝑚𝑥𝑖 ⋯

𝑦0𝑙2𝑟 𝑦1

𝑙2𝑟 𝑦𝑛−1𝑙2𝑟𝑦2

𝑙2𝑟 ⋯

𝑦𝑛′−1𝑟2𝑙 𝑦𝑛′−2

𝑟2𝑙 𝑦0𝑟2𝑙𝑦𝑛′−3

𝑟2𝑙 ⋯

𝑇0 𝑇1 𝑇2

𝑃 𝑦|𝑥 =

𝑖=0

𝑛−1

𝑝 𝑦𝑖| 𝑦0⋯ 𝑦𝑖−1, 𝑥, 𝒚𝟎⋯𝒚𝒊−𝟏 𝑖𝑓 𝐿2𝑅

𝑖=0

𝑛′−1

𝑝 𝑦𝑖| 𝑦0⋯ 𝑦𝑖−1, 𝑥, 𝒚𝟎⋯𝒚𝒊−𝟏 𝑖𝑓 𝑅2𝐿

𝑙2𝑟

𝑟2𝑙

Synchronous Bidirectional Attention

𝑥2 𝑥3 ⋯ 𝑥𝑚−1𝑥0⋯

𝑦1 ⋯ 𝑦𝑖−1 𝑦0

𝒛0 𝒛1 𝒛𝑖−1

𝑦𝑖−1 ⋯ 𝑦1 𝑦0 𝑦𝑖

𝒛0𝒛1𝒛𝑖

⋯𝒛𝑖−1

𝒉0 𝒉1 𝒉2 𝒉𝑚−1

Synchronous Bidirectional Attention

𝑥2 𝑥3 ⋯ 𝑥𝑚−1𝑥0⋯

𝑦1 ⋯ 𝑦𝑖−1 𝑦0

𝒛0 𝒛1 𝒛𝑖−1

𝑦𝑖−1 ⋯ 𝑦1 𝑦0 𝑦𝑖

𝒛0𝒛1𝒛𝑖

⋯𝒛𝑖−1

𝒉0 𝒉1 𝒉2 𝒉𝑚−1

Synchronous Bidirectional Attention

𝑥2 𝑥3 ⋯ 𝑥𝑚−1𝑥0⋯

𝑦1 ⋯ 𝑦𝑖−1 𝑦0

𝒛0 𝒛1 𝒛𝑖−1

𝑦𝑖−1 ⋯ 𝑦1 𝑦0 𝑦𝑖

𝒛0𝒛1𝒛𝑖

⋯𝒛𝑖−1

𝒉0 𝒉1 𝒉2 𝒉𝑚−1

𝑯𝑖

Synchronous Bidirectional Attention

𝑥2 𝑥3 ⋯ 𝑥𝑚−1𝑥0⋯

𝑦1 ⋯ 𝑦𝑖−1 𝑦0

𝒛0 𝒛1 𝒛𝑖−1

𝑦𝑖−1 ⋯ 𝑦1 𝑦0 𝑦𝑖

𝒛0𝒛1𝒛𝑖

⋯𝒛𝑖−1

𝒉0 𝒉1 𝒉2 𝒉𝑚−1

𝑯𝑖

Synchronous Bidirectional Attention

𝑥2 𝑥3 ⋯ 𝑥𝑚−1𝑥0⋯

𝑦1 ⋯ 𝑦𝑖−1 𝑦0

𝒛0 𝒛1 𝒛𝑖−1

𝑦𝑖−1 ⋯ 𝑦1 𝑦0 𝑦𝑖

𝒛0𝒛1𝒛𝑖

⋯𝒛𝑖−1

𝒉0 𝒉1 𝒉2 𝒉𝑚−1

𝑯𝑖 𝑯𝑖

Synchronous Bidirectional Attention

𝑥2 𝑥3 ⋯ 𝑥𝑚−1𝑥0⋯

𝑦1 ⋯ 𝑦𝑖−1 𝑦0

𝒛0 𝒛1 𝒛𝑖−1

𝑦𝑖−1 ⋯ 𝑦1 𝑦0 𝑦𝑖

𝒛0𝒛1𝒛𝑖

⋯𝒛𝑖−1

𝒉0 𝒉1 𝒉2 𝒉𝑚−1

𝑯𝑖 𝑯𝑖

𝒛𝑖 = 𝑓𝑢𝑠𝑒 𝑯𝑖 , 𝑯𝑖

Synchronous Bidirectional Attention

𝑥2 𝑥3 ⋯ 𝑥𝑚−1𝑥0⋯

𝑦1 ⋯ 𝑦𝑖−1 𝑦0

𝒛0 𝒛1 𝒛𝑖−1 𝒛𝑖

𝑦𝑖−1 ⋯ 𝑦1 𝑦0 𝑦𝑖

𝒛0𝒛1𝒛𝑖

⋯𝒛𝑖−1

𝒉0 𝒉1 𝒉2 𝒉𝑚−1

𝑯𝑖 𝑯𝑖

𝒛𝑖 = 𝑓𝑢𝑠𝑒 𝑯𝑖 , 𝑯𝑖

Synchronous Bidirectional Attention

𝑥2 𝑥3 ⋯ 𝑥𝑚−1𝑥0⋯

𝑦1 ⋯ 𝑦𝑖−1 𝑦𝑖 𝑦0

𝒛0 𝒛1 𝒛𝑖−1 𝒛𝑖

𝑦𝑖−1 ⋯ 𝑦1 𝑦0 𝑦𝑖

𝒛0𝒛1𝒛𝑖

⋯𝒛𝑖−1

𝒉0 𝒉1 𝒉2 𝒉𝑚−1

𝑯𝑖 𝑯𝑖

𝒛𝑖 = 𝑓𝑢𝑠𝑒 𝑯𝑖 , 𝑯𝑖

Matmul

Scale

Mask

Softmax

Matmul

Matmul

Scale

Mask

Softmax

Matmul

KVQ

K VQ

KKV V

Fusion

iH

iH

i

f

Hf

iH

b

iHb

iH

26

• Synchronous Bidirectional Dot-Product Attention

Linear Interpolation

Nonlinear Interpolation

Gate Mechanism

Synchronous Bidirectional Attention

Matmul

Scale

Mask

Softmax

Matmul

Matmul

Scale

Mask

Softmax

Matmul

KVQ

K VQ

KKV V

Fusion

iH

iH

i

f

Hf

iH

b

iHb

iH

26

• Synchronous Bidirectional Dot-Product Attention

Linear Interpolation

Nonlinear Interpolation

Gate Mechanism

Synchronous Bidirectional Attention

Matmul

Scale

Mask

Softmax

Matmul

Matmul

Scale

Mask

Softmax

Matmul

KVQ

K VQ

KKV V

Fusion

iH

iH

i

f

Hf

iH

b

iHb

iH

26

• Synchronous Bidirectional Dot-Product Attention

Linear Interpolation

Nonlinear Interpolation

Gate Mechanism

tanh

Relu

Synchronous Bidirectional Attention

Matmul

Scale

Mask

Softmax

Matmul

Matmul

Scale

Mask

Softmax

Matmul

KVQ

K VQ

KKV V

Fusion

iH

iH

i

f

Hf

iH

b

iHb

iH

26

• Synchronous Bidirectional Dot-Product Attention

Linear Interpolation

Nonlinear Interpolation

Gate Mechanism

tanh

Relu

Synchronous Bidirectional Attention

Matmul

Scale

Mask

Softmax

Matmul

Matmul

Scale

Mask

Softmax

Matmul

KVQ

K VQ

KKV V

Fusion

iH

iH

i

f

Hf

iH

b

iHb

iH

27

• Synchronous Bidirectional Dot-Product Attention

SBDPA

Matmul

Scale

Mask

Softmax

Matmul

Matmul

Scale

Mask

Softmax

Matmul

KVQ

K VQ

KKV V

Fusion

iH

iH

i

f

Hf

iH

b

iHb

iH

27

• Synchronous Bidirectional Dot-Product Attention

We refer to the whole procedure as SBDPT():

SBDPA

28

• Synchronous Bidirectional Multi-Head Attention

Concat

Linear

Q

Q

K

K

V

V

h

h

Linear

hSynchronous Bidirectional Dot-

Product Attention Model

LinearLinear

Synchronous Multi-Head Attention

28

• Synchronous Bidirectional Multi-Head Attention

Concat

Linear

Q

Q

K

K

V

V

h

h

Linear

hSynchronous Bidirectional Dot-

Product Attention Model

LinearLinear

Synchronous Multi-Head Attention

28

• Synchronous Bidirectional Multi-Head Attention

Concat

Linear

Q

Q

K

K

V

V

h

h

Linear

hSynchronous Bidirectional Dot-

Product Attention Model

LinearLinear Note that all parameters are

the same as standard multi-

head attention model.

Synchronous Multi-Head Attention

29

• Integrating Bidirectional Attention into NMT

Multi-Head Intra-Attention

Input Embedding

Output Embedding

Softmax

Add&Norm

FeedForward

Add&Norm Multi-Head Inter-Attention

Add&Norm

FeedForward

Add&Norm

Synchronous Bidirectional

Attention

Add&NormN

N

+ +Positional Encoding

Positional Encoding

Linear

InputsOutputs (L2R & R2L)

(shifted right)

Output (L2R & R2L)Probabilities

Our proposed bidirectional

attention model

Synchronous Bidirectional Attention

29

• Integrating Bidirectional Attention into NMT

Multi-Head Intra-Attention

Input Embedding

Output Embedding

Softmax

Add&Norm

FeedForward

Add&Norm Multi-Head Inter-Attention

Add&Norm

FeedForward

Add&Norm

Synchronous Bidirectional

Attention

Add&NormN

N

+ +Positional Encoding

Positional Encoding

Linear

InputsOutputs (L2R & R2L)

(shifted right)

Output (L2R & R2L)Probabilities

Our proposed bidirectional

attention model

Synchronous Bidirectional Attention

29

• Integrating Bidirectional Attention into NMT

Multi-Head Intra-Attention

Input Embedding

Output Embedding

Softmax

Add&Norm

FeedForward

Add&Norm Multi-Head Inter-Attention

Add&Norm

FeedForward

Add&Norm

Synchronous Bidirectional

Attention

Add&NormN

N

+ +Positional Encoding

Positional Encoding

Linear

InputsOutputs (L2R & R2L)

(shifted right)

Output (L2R & R2L)Probabilities

Our proposed bidirectional

attention model

Note that all bidirectional information flow indecoder runs in parallel.

Synchronous Bidirectional Attention

30

Synchronous Bidirectional Beam Search Algorithm

30

Synchronous Bidirectional Beam Search Algorithm

30

<pad>

T=0 T=1 T=2

<l2r>

<l2r>

<r2l>

<r2l>

SBAtt

SBAtt

SBAtt

SBAtt

L2R

R2L

Synchronous Bidirectional Beam Search Algorithm

31

• Training Objective Function

Training

Training

32

Training

32

Inference

Training

32

⋯𝑥0 𝑥1 𝑥𝑚𝑥𝑖 ⋯

𝒚𝟎𝒍𝟐𝒓 𝑦1

𝑙2𝑟 𝒚𝒏−𝟏𝒍𝟐𝒓𝑦2

𝑙2𝑟 ⋯

𝒚𝒏′−𝟏𝒓𝟐𝒍 𝑦𝑛′−2

𝑟2𝑙 𝒚𝟎𝒓𝟐𝒍𝑦𝑛′−3

𝑟2𝑙 ⋯

𝑇0 𝑇1 𝑇2

Inference

Training

32

⋯𝑥0 𝑥1 𝑥𝑚𝑥𝑖 ⋯

𝒚𝟎𝒍𝟐𝒓 𝑦1

𝑙2𝑟 𝒚𝒏−𝟏𝒍𝟐𝒓𝑦2

𝑙2𝑟 ⋯

𝒚𝒏′−𝟏𝒓𝟐𝒍 𝑦𝑛′−2

𝑟2𝑙 𝒚𝟎𝒓𝟐𝒍𝑦𝑛′−3

𝑟2𝑙 ⋯

𝑇0 𝑇1 𝑇2

Inference

Training

1 2 1

1 2 1

src: x , x , ..., x , x

tgt: y , y , ..., y , y

m m

n n

Training

32

⋯𝑥0 𝑥1 𝑥𝑚𝑥𝑖 ⋯

𝒚𝟎𝒍𝟐𝒓 𝑦1

𝑙2𝑟 𝒚𝒏−𝟏𝒍𝟐𝒓𝑦2

𝑙2𝑟 ⋯

𝒚𝒏′−𝟏𝒓𝟐𝒍 𝑦𝑛′−2

𝑟2𝑙 𝒚𝟎𝒓𝟐𝒍𝑦𝑛′−3

𝑟2𝑙 ⋯

𝑇0 𝑇1 𝑇2

⋯𝑥0 𝑥1 𝑥𝑚𝑥𝑖 ⋯

𝒚𝟎𝒍𝟐𝒓 𝑦1

𝑙2𝑟 𝒚𝒏−𝟏𝒍𝟐𝒓𝑦2

𝑙2𝑟 ⋯

𝒚𝒏−𝟏𝒓𝟐𝒍 𝑦𝑛−2

𝑟2𝑙 𝒚𝟎𝒓𝟐𝒍𝑦𝑛−3

𝑟2𝑙 ⋯

𝑇0 𝑇1 𝑇2

Inference

Training

1 2 1

1 2 1

src: x , x , ..., x , x

tgt: y , y , ..., y , y

m m

n n

Training

32

⋯𝑥0 𝑥1 𝑥𝑚𝑥𝑖 ⋯

𝒚𝟎𝒍𝟐𝒓 𝑦1

𝑙2𝑟 𝒚𝒏−𝟏𝒍𝟐𝒓𝑦2

𝑙2𝑟 ⋯

𝒚𝒏′−𝟏𝒓𝟐𝒍 𝑦𝑛′−2

𝑟2𝑙 𝒚𝟎𝒓𝟐𝒍𝑦𝑛′−3

𝑟2𝑙 ⋯

𝑇0 𝑇1 𝑇2

⋯𝑥0 𝑥1 𝑥𝑚𝑥𝑖 ⋯

𝒚𝟎𝒍𝟐𝒓 𝑦1

𝑙2𝑟 𝒚𝒏−𝟏𝒍𝟐𝒓𝑦2

𝑙2𝑟 ⋯

𝒚𝒏−𝟏𝒓𝟐𝒍 𝑦𝑛−2

𝑟2𝑙 𝒚𝟎𝒓𝟐𝒍𝑦𝑛−3

𝑟2𝑙 ⋯

𝑇0 𝑇1 𝑇2

Inference

Training

1 2 1

1 2 1

src: x , x , ..., x , x

tgt: y , y , ..., y , y

m m

n n

Question: Mismatch between Training

and Inference

33

Training Strategy 1

𝐽 𝜃 = 𝑡=1

𝑇

𝑙𝑜𝑔𝑝 𝑦 𝑡 |𝑥 𝑡 + 𝑙𝑜𝑔𝑝 𝑦𝑡|𝑥 𝑡

Two-pass method

First-pass:training 𝐿2𝑅 and 𝑅2𝐿, models. Using 𝐿2𝑅 and 𝑅2𝐿 to

decode the source inputs of bitext, resulting 𝑥 𝑡 , 𝑦∗𝑡

𝑡=1

𝑇

and

𝑥 𝑡 , 𝑦∗𝑡

𝑡=1

𝑇

;

Second-pass: using 𝑦∗<𝑖𝑡

instead of 𝑦<𝑖𝑡

to compute

𝑝 𝑦𝑖𝑡| 𝑦<𝑖𝑡, 𝑥 𝑡 , 𝑦∗<𝑖

𝑡,similar for 𝑝 𝑦𝑖

𝑡| 𝑦<𝑖𝑡, 𝑥 𝑡 , 𝑦∗<𝑖

𝑡.

33

Training Strategy 1

𝐽 𝜃 = 𝑡=1

𝑇

𝑙𝑜𝑔𝑝 𝑦 𝑡 |𝑥 𝑡 + 𝑙𝑜𝑔𝑝 𝑦𝑡|𝑥 𝑡

Two-pass method

First-pass:training 𝐿2𝑅 and 𝑅2𝐿, models. Using 𝐿2𝑅 and 𝑅2𝐿 to

decode the source inputs of bitext, resulting 𝑥 𝑡 , 𝑦∗𝑡

𝑡=1

𝑇

and

𝑥 𝑡 , 𝑦∗𝑡

𝑡=1

𝑇

;

Second-pass: using 𝑦∗<𝑖𝑡

instead of 𝑦<𝑖𝑡

to compute

𝑝 𝑦𝑖𝑡| 𝑦<𝑖𝑡, 𝑥 𝑡 , 𝑦∗<𝑖

𝑡,similar for 𝑝 𝑦𝑖

𝑡| 𝑦<𝑖𝑡, 𝑥 𝑡 , 𝑦∗<𝑖

𝑡.

Problem: Too Time Consuming

𝑃 𝑦|𝑥 =

𝑖=0

𝑛−1

𝑝 𝑦𝑖| 𝑦0⋯ 𝑦𝑖−1, 𝑥 𝑖𝑓 𝐿2𝑅

𝑖=0

𝑛′−1

𝑝 𝑦𝑖| 𝑦0⋯ 𝑦𝑖−1, 𝑥 𝑖𝑓 𝑅2𝐿

34

Training Strategy 2

Fine-tuning method

Bidirectional Inference without Interaction:training 𝑆𝐵𝑁𝑀𝑇,

model with no interaction. The learned 𝑆𝐵𝑁𝑀𝑇 perfroms 𝐿2𝑅 and

𝑅2𝐿 decoding for the source inputs of bitext, resulting

𝑥 𝑡 , 𝑦∗𝑡

𝑡=1

𝑇

and 𝑥 𝑡 , 𝑦∗𝑡

𝑡=1

𝑇

;

Fine-tuning with Interaction: using 𝑦∗<𝑖𝑡

instead of 𝑦<𝑖𝑡

to

compute 𝑝 𝑦𝑖𝑡| 𝑦<𝑖𝑡, 𝑥 𝑡 , 𝑦∗<𝑖

𝑡.

35

• Setup

Dataset:

(1) NIST Chinese-English translation (2M, 30K tokens,

MT03-06 as test set)

(2) WMT14 English-German translation (4.5M, 37K

shared tokens, newstest2014 as test set)

Train details:

(1) Transformer_big setting

(2) Chinese-English: 1 GPUs, single model, case-

insensitive BLEU.

(3) English-German: 3 GPUs, model averaging, case-

sensitive BLEU.

Experiments: Machine Translation

36

• Baselines

Moses: an Open source phrase-based SMT system.

RNMT: RNN-based NMT with default setting.

Transformer: Predict target sentence from left to right.

Transformer(R2L): Predict sentence from right to left.

Rerank-NMT: (1) first run beam search to obtain two

k-best lists; (2) then re-score and get the best candidate.

ABD-NMT: (1) use backward decoder to generate

reverse sequence states; (2) perform beam search on the

forward decoder to find the best translation.

Experiments: Machine Translation

37

• Results on Chinese-English Translation

Translation Quality

Table: Evaluation of translation quality for Chinese-English translation tasks with

case-insensitive BLEU scores.

Experiments: Machine Translation

37

• Results on Chinese-English Translation

Translation Quality

Table: Evaluation of translation quality for Chinese-English translation tasks with

case-insensitive BLEU scores.

Experiments: Machine Translation

37

• Results on Chinese-English Translation

Translation Quality

Table: Evaluation of translation quality for Chinese-English translation tasks with

case-insensitive BLEU scores.

Experiments: Machine Translation

37

• Results on Chinese-English Translation

Translation Quality

Table: Evaluation of translation quality for Chinese-English translation tasks with

case-insensitive BLEU scores.

Experiments: Machine Translation

38

• Results on English-German Translation

Table: Results of English-German translation using case-sensitive BLEU.

Experiments: Machine Translation

38

• Results on English-German Translation

Table: Results of English-German translation using case-sensitive BLEU.

Strong Baselines

Experiments: Machine Translation

38

• Results on English-German Translation

Table: Results of English-German translation using case-sensitive BLEU.

Strong Baselines

(+1.49)

Experiments: Machine Translation

39

Experiments: Image Caption

39

Experiments: Image Caption

39

• Setup

Dataset:

(1) Flickr30k (Young et al., 2014)

(2) 29,000 image-caption for training

(3) 1014 for validation and 2000 for test

Baselines:

(1) VGGNet encoder + LSMT decoder (Xu et al., 2015)

(2) Transformer

Experiments: Image Caption

40

• Results on English Image Caption

BLEU score

Experiments: Image Caption

Method Validation Test

Xu et al., (2015) ~ 19.90

Transformer 22.11 21.25

Ours 23.27 22.41

41

Figure: Translation accuracy of the first 4 tokens and last 4 tokens for L2R, R2L,

Rerank-NMT, ABD-NMT and our proposed model.

MT Analysis: Unbalanced Outputs

41

Figure: Translation accuracy of the first 4 tokens and last 4 tokens for L2R, R2L,

Rerank-NMT, ABD-NMT and our proposed model.

Get the best translation

accuracy no matter for

the first 4 words and the

last 4 words.

MT Analysis: Unbalanced Outputs

MT Analysis: BLEU along Length

42

• Analysis

Effect of Long Sentence

Figure: Performance of translations on the test set with respect to the lengths

of the source sentences.

MT Analysis: BLEU along Length

42

• Analysis

Effect of Long Sentence

Figure: Performance of translations on the test set with respect to the lengths

of the source sentences.

43

MT Analysis: Case Study

43

L2R produces good prefix, whereas R2L generates better

suffixes.

Our approach can make full use of bidirectional decoding

and produce balanced outputs in these cases.

MT Analysis: Case Study

44

MT Analysis: Parameters and Speeds

44

Table: Statistics of parameters, training and testing speeds. Train denotes the

number of global training steps processed per second; Test indicates the amount

of translated sentences in one second.

MT Analysis: Parameters and Speeds

44

Table: Statistics of parameters, training and testing speeds. Train denotes the

number of global training steps processed per second; Test indicates the amount

of translated sentences in one second.

No additional parameters except for lambda

MT Analysis: Parameters and Speeds

44

Table: Statistics of parameters, training and testing speeds. Train denotes the

number of global training steps processed per second; Test indicates the amount

of translated sentences in one second.

No additional parameters except for lambda

MT Analysis: Parameters and Speeds

44

Table: Statistics of parameters, training and testing speeds. Train denotes the

number of global training steps processed per second; Test indicates the amount

of translated sentences in one second.

Slightly Slower than

baseline TransformerNo additional parameters except for lambda

MT Analysis: Parameters and Speeds

Beyond Synchronous Bidirectional

Decoding: Improving Efficiency

45

Beyond Synchronous Bidirectional

Decoding: Improving Efficiency

45

Sequence Generation: from Both Sides to the Middle

Long Zhou, Jiajun Zhang, Chengqing Zong and Heng Yu.

In Proceedings of IJCAI 2019.

Sequence Generation from Both Sides to the Middle

46

• SBSG: Synchronous Bidirectional Sequence Generation

Speedup decoding: Generates two tokens at a time

Improve quality: Rely on history and future context

1y 2y /2ny

ny1ny /2 1ny

t=0 t=1 t=n/2

(Left-to-right decoding) …

End

(Right-to-left decoding)

t=0t=1t=n/2… …

Sequence Generation from Both Sides to the Middle

47

• Training and Inference

<null>

/2ny /2 1ny

/2 2ny

<eos>

<eos>

:y

:y

1y

ny

<l2r>

<r2l>

2y

1ny

Following previous work, we also use knowledge

distillation techniques to train our model.

Training objective:

The Smoothing model:

Experiments on Machine Translation

48

Inference speed:

Translation quality:

Experiments on Machine Translation

48

Inference speed:

Translation quality:

Experiments on Machine Translation

48

Inference speed:

Translation quality:

Experiments on Machine Translation

48

Inference speed:

Translation quality:

Experiments on Text Summarization

49

• Application to Text Summarization

Example:

Setup

(1) English Gigaword dataset (3.8M training set, 189K

dev set, DUC2004 as our test set)

(2) shared vocabulary of about 90K word types

(3) Transformer_base setting

(4) ROUGE-1, ROUGE-2, ROUGE-L

Experiments on Text Summarization

50

Experiments on Text Summarization

50

Experiments on Text Summarization

50

The proposed model significant outperforms the conventional

Transformer model in terms of both decoding speed and generation

quality.

51

Outline

Background

Bidirectional Interactive Inference

Interactive Inference for Two Tasks

Summary and Future Challenges

Interactive Inference for Two Tasks

52

Interactive Inference for Two Tasks

52

Synchronously Generating Two Languages with

Interactive Decoding

Yining Wang, Jiajun Zhang, Long Zhou, Yuchen Liu and Chengqing Zong.

In Proceedings of EMNLP 2019.

From Generating Two Directions to

Generating Two Languages

53

⋯𝑥0 𝑥1 𝑥𝑚𝑥𝑗 ⋯

𝑦0𝑙2𝑟 𝑦1

𝑙2𝑟 𝑦𝑖𝑙2𝑟⋯ ⋯ ⋯

⋯𝑦𝑛′−1𝑟2𝑙 𝑦𝑛′−2

𝑟2𝑙 𝑦𝑖′𝑟2𝑙 ⋯ ⋯

Left-to-Right Decoding

Right-to-Left Decoding

From Generating Two Directions to

Generating Two Languages

53

⋯𝑥0 𝑥1 𝑥𝑚𝑥𝑗 ⋯

𝑦0𝑙2𝑟 𝑦1

𝑙2𝑟 𝑦𝑖𝑙2𝑟⋯ ⋯ ⋯

⋯𝑦𝑛′−1𝑟2𝑙 𝑦𝑛′−2

𝑟2𝑙 𝑦𝑖′𝑟2𝑙 ⋯ ⋯

Left-to-Right Decoding

Right-to-Left Decoding

From Generating Two Directions to

Generating Two Languages

53

⋯𝑥0 𝑥1 𝑥𝑚𝑥𝑗 ⋯

𝑦0𝑙2𝑟 𝑦1

𝑙2𝑟 𝑦𝑖𝑙2𝑟⋯ ⋯ ⋯

⋯𝑦𝑛′−1𝑟2𝑙 𝑦𝑛′−2

𝑟2𝑙 𝑦𝑖′𝑟2𝑙 ⋯ ⋯

Left-to-Right Decoding

Right-to-Left Decoding

⋯𝑥0 𝑥1 𝑥𝑚𝑥𝑗 ⋯

𝑦0 𝑦1 𝑦𝑖⋯ ⋯ ⋯

⋯𝑧0 𝑧1 𝑧𝑗 ⋯ ⋯

English-to-Chinese Decoding

English-to-Japanese Decoding

Conventional Multilingual Translation

54

Separate Encoder or

Decoder network

Shared Encoder or

Decoder network

Shared with partial

parameter

Synchronously Generating Two Languages

with Interactive Decoding

55

56

Synchronous Self-Attention Model:

Encoder

Decoder

Decoder

Interact

Synchronously Generating Two Languages

with Interactive Decoding

Synchronous Bi-language Attention

𝑥2 𝑥3 ⋯ 𝑥𝑚−1𝑥0⋯

𝑦1 ⋯ 𝑦𝑖−1 𝑦𝑖𝑦0

𝒔0 𝒔1 𝒔𝑖−1 𝒔𝑖

𝑧1 ⋯ 𝑧𝑖−1 𝑧𝑖𝑧0

𝒕𝑖𝒕𝑖−1𝒕0

⋯𝒕1

𝒉0 𝒉1 𝒉2 𝒉𝑚−1

𝑯𝑖 = 𝑓𝑢𝑠𝑒 𝑯𝑖𝑠𝑒𝑙𝑓, 𝑯𝑖𝑜𝑡ℎ𝑒𝑟

𝑯𝑖𝑠𝑒𝑙𝑓 𝑯𝑖

𝑜𝑡ℎ𝑒𝑟

58

WMT14 subset

En-De En-Fr

Train 2.43M 2.43M

Test 3003 3003

1. Large Scale

2. Small Scale

IWSLT

En-Ja En-Zh En-De En-Fr

Train 223K 231K 206K 233K

Test 3003 3003 1305 1306

Some Experiments

59

Training Data Construction

⋯𝑥0 𝑥1 𝑥𝑚𝑥𝑗 ⋯

𝑦0 𝑦1 𝑦𝑖⋯ ⋯ ⋯

⋯𝑧0 𝑧1 𝑧𝑗 ⋯ ⋯

English-Chinese Decoding

English-to-Japanese Decoding

59

Training Data Construction

⋯𝑥0 𝑥1 𝑥𝑚𝑥𝑗 ⋯

𝑦0 𝑦1 𝑦𝑖⋯ ⋯ ⋯

⋯𝑧0 𝑧1 𝑧𝑗 ⋯ ⋯

English-Chinese Decoding

English-to-Japanese Decoding

Training Instance Format Requirement:

trilingual translation example (x, y, z) in

which (x, y) and (x, z) are parallel sentence

60

Step1(train): (x1, y1) Model: M1

(x2, y2) Model: M2

Step2 (decode): M1 (x2,y2*)

M2 (x1,y1*)

Step3 (combination):

(x1,y1,y1*) ∪ (x2,y2*,y2)

Training

Inference

x2

x1

Training Data Construction

61

Main Results

• Indiv: System learned with bilingual training

• Multi: shared encoder-decoder networks

•Sync-Trans significantly outperforms Indiv and Multi

• English-Chinese/Japanese and English-German/French

62

Main Results

• Sync-Trans significantly outperforms Indiv and Multi

• Large-scale WMT Dataset

From Bidirection to Two Tasks

63

⋯𝑥0 𝑥1 𝑥𝑚𝑥𝑗 ⋯

𝑦0𝑙2𝑟 𝑦1

𝑙2𝑟 𝑦𝑖𝑙2𝑟⋯ ⋯ ⋯

⋯𝑦𝑛′−1𝑟2𝑙 𝑦𝑛′−2

𝑟2𝑙 𝑦𝑖′𝑟2𝑙 ⋯ ⋯

Left-to-Right Decoding

Right-to-Left Decoding

⋯𝑥0 𝑥1 𝑥𝑚𝑥𝑗 ⋯

𝑦0 𝑦1 𝑦𝑖⋯ ⋯ ⋯

⋯𝑧0 𝑧1 𝑧𝑗 ⋯ ⋯

English-to-Chinese Decoding

English-to-Japanese Decoding

Interactive Inference for other Two Tasks

64

Speech Recognition

Speech-to-Text Translation

𝑦0 𝑦1 𝑦𝑖⋯ ⋯ ⋯

⋯𝑧0 𝑧1 𝑧𝑗 ⋯ ⋯

To English Caption

To Chinese Caption

𝑥0 𝑥1 𝑥𝑖⋯ ⋯ ⋯

⋯𝑦0 𝑦1 𝑦𝑗 ⋯ ⋯

Interactive Inference for other Two Tasks

64

Speech Recognition

Speech-to-Text Translation

𝑦0 𝑦1 𝑦𝑖⋯ ⋯ ⋯

⋯𝑧0 𝑧1 𝑧𝑗 ⋯ ⋯

To English Caption

To Chinese Caption

𝑥0 𝑥1 𝑥𝑖⋯ ⋯ ⋯

⋯𝑦0 𝑦1 𝑦𝑗 ⋯ ⋯

65

Original signal

• Speech Features

Mel滤波

Interactive Inference for Speech Recognition

and Speech-to-Text Translation

65

Original signal

Filter bank

• Speech Features

Mel滤波

Interactive Inference for Speech Recognition

and Speech-to-Text Translation

65

Original signal

Filter bank

• Speech Features

Mel滤波

离散余弦变换

Interactive Inference for Speech Recognition

and Speech-to-Text Translation

65

Original signal

Filter bank

MFCC

• Speech Features

Mel滤波

离散余弦变换

Interactive Inference for Speech Recognition

and Speech-to-Text Translation

65

Original signal

Filter bank

MFCC

• Speech Features

Mel滤波

离散余弦变换

Interactive Inference for Speech Recognition

and Speech-to-Text Translation

66

• Overall Architecture

Interactive Inference for Speech Recognition

and Speech-to-Text Translation

66

• Overall Architecture

Interactive Inference for Speech Recognition

and Speech-to-Text Translation

66

• Overall Architecture

Interactive Inference for Speech Recognition

and Speech-to-Text Translation

Speech Recognition

66

• Overall Architecture

Interactive Inference for Speech Recognition

and Speech-to-Text Translation

Speech Recognition

66

• Overall Architecture

Interactive Inference for Speech Recognition

and Speech-to-Text Translation

Speech Recognition

Speech-to-Text Translation

67

• Interactive Attention

Interactive Inference for Speech Recognition

and Speech-to-Text Translation

67

• Interactive Attention

Interactive Inference for Speech Recognition

and Speech-to-Text Translation

67

• Interactive Attention

H1 = 𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛(Q1, K1, V1)

H2 = 𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛(Q1, K2, V2)

Hfinal = 𝐹𝑢𝑠𝑖𝑜𝑛(H1, H2)

Interactive Inference for Speech Recognition

and Speech-to-Text Translation

67

• Interactive Attention

H1 = 𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛(Q1, K1, V1)

H2 = 𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛(Q1, K2, V2)

Hfinal = 𝐹𝑢𝑠𝑖𝑜𝑛(H1, H2)

• Fusion Linear Interpolation

Nonlinear Interpolation

Gate Interpolation

Hfinal = 𝜆1 ∗ H1 + 𝜆2 ∗ H2

Hfinal = 𝜆1 ∗ H1 + 𝜆2 ∗ tanh(H2)

Hfinal = r ⨀H1 + z⨀H2

r, z = σ(𝑊[𝐻1; 𝐻2])

Interactive Inference for Speech Recognition

and Speech-to-Text Translation

68

Interactive Inference for Speech Recognition

and Speech-to-Text Translation

• Experimental Setup

Dataset:

TED En-Fr,En-Zh

Train details:

(1) Transformer_big setting

(2) English-Chinese: 2 GPUs, character BLEU

(3) English-French: 2 GPUs, tokenizer BLEU

69

Interactive Inference for Speech Recognition

and Speech-to-Text Translation

CorpusTotal

Source

(per segment)

Target

(per segment)

segments hours frames words words

Fisher/Callh

-ome

(En-Es)

train 138,819 138:00 762 20.7 20.3

dev 3,961 2:00 673 17.9 17.9

test 3,641 3:44 657 18.3 18.3

TED

(En-Zh/

En-Fr)

train 305,971 527:00 662 17.9 20.3

dev 1,148 2:23 659 18.2 17.9

test 1,223 2:37 624 18.3 18.1

• Data Size

69

Interactive Inference for Speech Recognition

and Speech-to-Text Translation

CorpusTotal

Source

(per segment)

Target

(per segment)

segments hours frames words words

Fisher/Callh

-ome

(En-Es)

train 138,819 138:00 762 20.7 20.3

dev 3,961 2:00 673 17.9 17.9

test 3,641 3:44 657 18.3 18.3

TED

(En-Zh/

En-Fr)

train 305,971 527:00 662 17.9 20.3

dev 1,148 2:23 659 18.2 17.9

test 1,223 2:37 624 18.3 18.1

• Data Size

70

Interactive Inference for Speech Recognition

and Speech-to-Text Translation

• Baselines

Pipeline: Transformer ASR + Transformer MT

Pre-trained E2E: Pretrain on ASR, fintune on ST

Multi-task: ASR + ST with encoder shared

Two-stage: (1) use the first decoder to generate

transcription sequence; (2) use the output of first

decoder on the second decoder

71

Interactive Inference for Speech Recognition

and Speech-to-Text Translation

𝑊𝐸𝑅 = 100 ∙𝑆 + 𝐷 + 𝐼

𝑁%

REF: 各位 来宾 * 各位 合作 伙伴 媒体界 的 朋友们 下午 好ASR: 各位 来宾 个 各位 合作 伙伴 媒体界 * 朋友们 刚 好

I D S

• Evaluation Metrics

• ASR Metric

• MT Metric

𝐵𝐿𝐸𝑈

72

Interactive Inference for Speech Recognition

and Speech-to-Text Translation

ModelEn-De En-Fr En-Zh En-Ja

WER(↓) BLEU(↑) WER(↓) BLEU(↑) WER(↓) BLEU(↑) WER(↓) BLEU(↑)

MT / 22.19 / 30.68 / 25.01 / 22.93

Pipeline 14.29 19.50 14.20 26.62 14.20 21.52 14.21 20.87

E2E 14.29 16.07 14.20 27.63 14.20 19.15 14.21 16.59

Multi-task 14.20 19.08 13.04 28.71 13.43 20.60 14.01 18.73

Two-stage 14.27 20.08 13.34 30.08 13.55 20.29 13.85 19.32

Interactive 14.16 21.11 12.58 29.79 13.38 21.68 13.52 20.06

• Overall Results

73

Interactive Inference for Image Caption in

Two Languages

𝑦0 𝑦1 𝑦𝑖⋯ ⋯ ⋯

⋯𝑧0 𝑧1 𝑧𝑗 ⋯ ⋯

To English Caption

To German Caption

Dataset:

(1)Multi30k (Elliott et al., 2016): English and German Captions

(2) 29,000 image-caption for training

(3) 1014 for validation and 2000 for test

Baselines:

(1) VGGNet encoder + LSMT decoder (Xu et al., 2015)

(2) Transformer

74

• Results on English and German Image Captions

BLEU score

Experiments: Image Caption in Two Languages

Method English German

Xu et al., (2015) 19.90 ~

Jaffe (2017) ~ 11.84

Transformer 21.25 13.55

Ours 22.54 15.49

Unified Text Generation from Text,

Speech and Image

75

Encoder

1y2y 3y

my

1z 2z3z

nz

...

...

T=1 T=2 T=3 ...

4y

4z

T=4

1y2y 3y

my...4y

T=1 T=2 T=3 T=4

(c) Synchronous Interactive text generation:

(b) Conventional text generation:

I had no idea what was coming

or

or

(a) Text, Speech or Image encoding:

...

And Beyond …

76

• Why not interactive inference for multi-task learning?

And Beyond …

76

• Why not interactive inference for multi-task learning?

77

Outline

Background

Bidirectional Interactive Inference

Interactive Inference for Two Tasks

Summary and Future Challenges

78

• The synchronous bidirectional Inference model that can take

full advantage of both history and future information provided

by bidirectional decoding states, achieving promising results.

Summary

78

• The synchronous bidirectional Inference model that can take

full advantage of both history and future information provided

by bidirectional decoding states, achieving promising results.

•The bidirectional inference model can be further extended to

inference from both sides to the middle to improve the efficiency.

Summary

78

• The synchronous bidirectional Inference model that can take

full advantage of both history and future information provided

by bidirectional decoding states, achieving promising results.

•The bidirectional inference model can be further extended to

inference from both sides to the middle to improve the efficiency.

•The bidirectional inference model can be generalized to

generate two languages synchronously and interactively.

Summary

78

• The synchronous bidirectional Inference model that can take

full advantage of both history and future information provided

by bidirectional decoding states, achieving promising results.

•The bidirectional inference model can be further extended to

inference from both sides to the middle to improve the efficiency.

•The bidirectional inference model can be generalized to

generate two languages synchronously and interactively.

•Our main code is available at https://github.com/ZNLP/sb-nmt.

Feel free to have a try!

Summary

78

• The synchronous bidirectional Inference model that can take

full advantage of both history and future information provided

by bidirectional decoding states, achieving promising results.

•The bidirectional inference model can be further extended to

inference from both sides to the middle to improve the efficiency.

•The bidirectional inference model can be generalized to

generate two languages synchronously and interactively.

•Our main code is available at https://github.com/ZNLP/sb-nmt.

Feel free to have a try!

Summary

Message: Synchronous Interactive Inference

May Reshape Multi-task Generation

79

• How to generalize the interactive inference idea into

multi-task problems in which three or more tasks are

concerned?

• How to perform efficient training without generating

pseudo parallel instances?

• How to effectively combine bidirectional inference in

the multi-task interactive inference problem?

Future Challenges

80

1. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of

deep bidirectional transformers for language understanding. In NAACL-HLT 2019 (Best Paper).

2. Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. Improving

language understanding by Generative Pre-Training. Technical report, OpenAI.

3. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez,

Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. NIPS-2017.

4. Lemao Liu, Masao Utiyama, Andrew Finch, and Eiichiro Sumita. Agreement on target-

bidirectional neural machine translation. NAACL-2016.

5. Zhirui Zhang, Shuangzhi Wu, Shujie Liu, Mu Li, Ming Zhou, and Tong Xu. Regularizing neural

machine translation by target-bidirectional agreement. AAAI-2019.

6. Long Zhou, Wenpeng Hu, Jiajun Zhang, and Chengqing Zong. Neural system combination for

machine translation. ACL-2017.

7. Xiangwen Zhang, Jinsong Su, Yue Qin, Yang Liu, Rongrong Ji, and Hongji Wang. Asynchronous

bidirectional decoding for neural machine translation. AAAI-2018.

8. Long Zhou, Jiajun Zhang, and Chengqing Zong. Synchronous Bidirectional Neural Machine

Translation. TACL-2019.

Reference

81

9. Long Zhou, Jiajun Zhang, Chengqing Zong and Heng Yu. Sequence Generation: From Both

Sides to the Middle. IJCAI-2019.

10. Yining Wang, Jiajun Zhang, Long Zhou, Yuchen Liu and Chengqing Zong. Synchronously

Generating Two Languages with Interactive Decoding. EMNLP-2019

11. Jiajun Zhang, Long Zhou, Yang Zhao and Chengqing Zong. Synchronous Bidirectional

Inference for Neural Sequence Generation. arXiv preprint arXiv:1902.08955.

12. Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov,

Rich Zemel, and Yoshua Bengio. Show, attend and tell: Neural image caption generation with

visual attention. ICML-2015.

13. Desmond Elliott, Stella Frank, Khalil Sima’an, Lucia Specia. Multi30K: Multilingual English-

German Image Descriptions. Proceedings of the 5th Workshop on Vision and Language. 2016.

14. Alan Jaffe. Generating Image Descriptions using Multilingual Data. Proceedings of the Second

Conference on Machine Translation 2017.

15. Peter Young, Alice Lai, Micah Hodosh, and Julia Hockenmaier. From image descriptions to

visual denotations: New similarity metrics for semantic inference over event descriptions. TACL-

2014.

Reference

谢谢!Thanks!

Recommended