240
Text Generation: From the Perspective of Interactive Inference 张家俊 模式识别国家重点实验室 中国科学院自动化研究所 [email protected] www.nlpr.ia.ac.cn/cip/jjzhang.htm

Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

Text Generation: From the

Perspective of Interactive Inference

张家俊

模式识别国家重点实验室中国科学院自动化研究所

[email protected]

www.nlpr.ia.ac.cn/cip/jjzhang.htm

Page 2: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

2

Outline

Background

Bidirectional Interactive Inference

Interactive Inference for Two Tasks

Summary and Future Challenges

Page 3: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

3

BERT: Bidirectional Understanding

𝑥2 𝑥3 ⋯ 𝑥𝑛𝑥1

⋯ ⋯⋯

Input Sequence

Representation

Learning

Linear Classification

12

Layers

110M

Params

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of

deep bidirectional transformers for language understanding. In NAACL-HLT 2019 (Best Paper).

Page 4: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

4

BERT for Classification

𝑥1 𝑥2 ⋯ 𝑥𝑛[𝐶𝐿𝑆]

⋯𝐶 ℎ1 ℎ2 ℎ𝑛

Input Sequence

Representation

Learning

Linear Classification

Classification

Problem

Class Labels

𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑐𝑖 =𝑒𝑥𝑝 𝑐𝑖 𝑐𝑘 𝑒𝑥𝑝 𝑐𝑘

Page 5: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

5

BERT for Sequence Labeling

𝑥1 𝑥2 ⋯ 𝑥𝑛[𝐶𝐿𝑆]

⋯𝐶 ℎ1 ℎ2 ℎ𝑛

Input Sequence

Representation

Learning

Linear Classification⋯

𝐵 𝐼 𝑂

Sequence

Labeling Problem

𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑐𝑖 =𝑒𝑥𝑝 𝑐𝑖 𝑐𝑘 𝑒𝑥𝑝 𝑐𝑘

Page 6: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

6

Reasons behind BERT Success

• Corpus: 2.5B-word Wiki and 800M-word Books

• Architecture: Pre-training and Fine-tuning Same Model

• Model: Deep Bidirectional Transformer Encoder

• Optimization: Masked LM and Next Sentence Prediction

𝑥11 𝑥1

2 ⋯ 𝑥𝑛2[𝐶𝐿𝑆]

⋯𝐶 ℎ11 ⋯ ℎ𝑛

2

Specific Task

[𝑆𝐸𝑃]⋯

Pre-training

Fine-tuning

Bidirectional

Transformer Encoder

Page 7: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

7

BERT vs. GPT (Generative Pre-trained Transformer)

𝑥2 ⋯ 𝑥𝑛𝑥1

⋯ℎ1 ℎ2 ℎ3 ℎ𝑛

Specific Task

Pre-training

Fine-tuning

𝑥3

• Architecture: Pre-training and Fine-tuning Same Model

• Model: Deep Unidirectional Transformer Decoder

• Optimization: Traditional Language Model

• Corpus: 800M-word Books 𝑗log 𝑝 𝑥𝑗|𝑥<𝑗 , 𝜃𝑇𝑟𝑎𝑛𝑠𝑓𝑜𝑟𝑚𝑒𝑟

Page 8: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

8

BERT vs. GPT (Generative Pre-trained Transformer)

𝑥2 ⋯ 𝑥𝑛𝑥1

⋯ℎ1 ℎ2 ℎ3 ℎ𝑛

Specific Task

Pre-training

Fine-tuning

𝑥3

• Architecture: Pre-training and Fine-tuning Same Model

• Model: Deep Unidirectional Transformer Decoder

• Optimization: Traditional Language Model

• Corpus: 800M-word Books 𝑗log 𝑝 𝑥𝑗|𝑥<𝑗 , 𝜃𝑇𝑟𝑎𝑛𝑠𝑓𝑜𝑟𝑚𝑒𝑟

Unidirectional

Transformer Decoder

Page 9: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

9

BERT vs. GPT (Generative Pre-trained Transformer)

𝑥2 ⋯ 𝑥𝑛𝑥1

⋯ℎ1 ℎ2 ℎ3 ℎ𝑛

Specific Task

Pre-training

Fine-tuning

𝑥3

• Architecture: Pre-training and Fine-tuning Same Model

• Model: Deep Unidirectional Transformer Decoder

• Optimization: Traditional Language Model

• Corpus: 800M-word Books 𝑗log 𝑝 𝑥𝑗|𝑥<𝑗 , 𝜃𝑇𝑟𝑎𝑛𝑠𝑓𝑜𝑟𝑚𝑒𝑟

Unidirectional

Transformer Decoder

Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. Improving

language understanding by Generative Pre-Training. Technical report, OpenAI.

Page 10: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

10

BERT Ablation Study

Left-to-Right LM

Fine-tuning with BiLSTM

The more Layers

The Better

Page 11: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

11

BERT Ablation Study

Left-to-Right LM

Fine-tuning with BiLSTM

The more Layers

The Better

Bidirectional Encoder is the Key!

Page 12: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

𝑥2 𝑥3 ⋯ 𝑥𝑛𝑥112

From Understanding to Generation

𝑦2 𝑦3 ⋯ 𝑦𝑚𝑦1

⋯ℎ1 ℎ2 ℎ3 ℎ𝑚

⋯ℎ1 ℎ2 ℎ3 ℎ𝑚

𝑥1 𝑥2 ⋯ 𝑥𝑛[𝐶𝐿𝑆]⋯

⋯⋯𝐶 ℎ1 ℎ2 ℎ𝑛

Classification Problem

𝑥1 𝑥2 ⋯ 𝑥𝑛[𝐶𝐿𝑆]⋯

⋯⋯𝐶 ℎ1 ℎ2 ℎ𝑛

⋯𝐵 𝐼 𝑂

Sequence Labeling

Problem⋯

⋯ ⋯⋯

Sequence Generation

Problem

Page 13: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

𝑥2 𝑥3 ⋯ 𝑥𝑛𝑥113

From Understanding to Generation

𝑦2 𝑦3 ⋯ 𝑦𝑚𝑦1

⋯ℎ1 ℎ2 ℎ3 ℎ𝑚

⋯ℎ1 ℎ2 ℎ3 ℎ𝑚

𝑥1 𝑥2 ⋯ 𝑥𝑛[𝐶𝐿𝑆]⋯

⋯⋯𝐶 ℎ1 ℎ2 ℎ𝑛

Classification Problem

𝑥1 𝑥2 ⋯ 𝑥𝑛[𝐶𝐿𝑆]⋯

⋯⋯𝐶 ℎ1 ℎ2 ℎ𝑛

⋯𝐵 𝐼 𝑂

Sequence Labeling

Problem⋯

⋯ ⋯⋯

Sequence Generation

Problem

𝑃 𝑦|𝑥 =

𝑖=1

𝑚

𝑝 𝑦𝑖|𝑦1⋯𝑦𝑖−1, 𝑥1⋯𝑥𝑛

Page 14: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

14

Text Generation

Page 15: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

15

Text Generation

Page 16: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

16

Text Generation

机器翻译

Page 17: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

17

Text Generation

机器翻译

Page 18: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

18

Text Generation

机器翻译 人机对话

Page 19: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

19

Text Generation

机器翻译 人机对话

Page 20: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

20

Text Generation

机器翻译 人机对话

自动摘要

Page 21: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

21

Text Generation

机器翻译 人机对话

自动摘要

Page 22: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

22

Text Generation

机器翻译 人机对话

自动摘要 标题生成

Page 23: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

11

Beam Search for

Unidirectional Inference

Page 24: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

12

Transformer: Best Unidirectional Text Generation Framework

Page 25: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

12

Encoder Decoder

Transformer: Best Unidirectional Text Generation Framework

Page 26: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

12

Encoder Decoder

encoder

self-attentiondecoder

self-attention

encoder-decoder

self-attention

Transformer: Best Unidirectional Text Generation Framework

Page 27: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

12

Encoder Decoder

encoder

self-attentiondecoder

self-attention

encoder-decoder

self-attention

Transformer: Best Unidirectional Text Generation Framework

Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N.

Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. NIPS-2017.

Page 28: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

Transformer

13

有 五个 人 。

Page 29: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

Transformer

13

有 五个 人 。

Page 30: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

Transformer

13

有 五个 人 。

Page 31: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

Transformer

13

有 五个 人 。

Page 32: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

Transformer

13

有 五个 人 。

Page 33: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

Transformer

13

有 五个 人 。

Page 34: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

Transformer

13

有 五个 人 。

Page 35: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

Transformer

13

有 五个 人 。

Page 36: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

Transformer

13

有 五个 人 。

Page 37: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

Transformer

13

有 五个 人 。

Page 38: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

Transformer

13

有 五个 人 。

Page 39: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

Transformer

13

有 五个 人 。

<s>

Page 40: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

Transformer

13

有 五个 人 。

<s>

Page 41: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

Transformer

13

有 五个 人 。

<s>

Page 42: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

Transformer

13

有 五个 人 。

<s>

Page 43: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

Transformer

13

有 五个 人 。

<s>

Page 44: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

Transformer

13

有 五个 人 。

there

<s>

Page 45: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

Transformer

13

有 五个 人 。

there

<s>

Page 46: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

Transformer

13

有 五个 人 。

there

<s> there

Page 47: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

Transformer

13

有 五个 人 。

there

<s> there

Page 48: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

Transformer

13

有 五个 人 。

there

<s> there

Page 49: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

Transformer

13

有 五个 人 。

there

<s> there

Page 50: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

Transformer

13

有 五个 人 。

there

<s> there

Page 51: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

Transformer

13

有 五个 人 。

there

<s>

are

there

Page 52: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

Transformer

13

有 五个 人 。

there

<s>

are

there

Page 53: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

Transformer

13

有 五个 人 。

there

<s>

are

there are

Page 54: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

Transformer

13

有 五个 人 。

there

<s>

are

there are

Page 55: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

Transformer

13

有 五个 人 。

there

<s>

are

there are

Page 56: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

Transformer

13

有 五个 人 。

there

<s>

are

there are

Page 57: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

Transformer

13

有 五个 人 。

there

<s>

are

there are

Page 58: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

Transformer

13

有 五个 人 。

there

<s>

are

there

five

are

Page 59: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

Transformer

13

有 五个 人 。

there

<s>

are

there

five

are

Page 60: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

Transformer

13

有 五个 人 。

there

<s>

are

there

five

are five

Page 61: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

Transformer

13

有 五个 人 。

there

<s>

are

there

five

are five

Page 62: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

Transformer

13

有 五个 人 。

there

<s>

are

there

five

are five

Page 63: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

Transformer

13

有 五个 人 。

there

<s>

are

there

five

are five

Page 64: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

Transformer

13

有 五个 人 。

there

<s>

are

there

five

are five

Page 65: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

Transformer

13

有 五个 人 。

there

<s>

are

there

five

are

persons

five

Page 66: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

Transformer

13

有 五个 人 。

there

<s>

are

there

five

are

persons

five

Page 67: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

Transformer

13

有 五个 人 。

there

<s>

are

there

five

are

persons

fiveBidirectional

Encoder

Page 68: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

Transformer

13

有 五个 人 。

there

<s>

are

there

five

are

persons

fiveBidirectional

Encoder

Unidirectional

Decoder

Page 69: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

Attention for Encoder

14

• Attention (Example)

有 五个 人

0.7-0.10.60.3

0.3-0.10.9-0.3

0.40.9-0.60.7

T T T

0.2-0.10.40.3

0.70.60.60.2

T T

l:

l+1:

4 12W R

Page 70: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

Attention for Encoder

14

• Attention (Example)

有 五个 人

0.7-0.10.60.3

0.3-0.10.9-0.3

0.40.9-0.60.7

T T T

0.2-0.10.40.3

0.70.60.60.2

T T

l:

l+1:

4 12W R

W W W

-0.60.10.4-0.30.3-0.1-0.90.20.4-0.6-0.10.7

T

Q

K

V

0.3-0.30.60.60.7-0.10.90.80.1-0.30.40.4

T0.5-0.70.90.30.2-0.6-0.60.70.7-0.90.4-0.3

T

有 五个 人

Page 71: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

Attention for Encoder

14

• Attention (Example)

有 五个 人

0.7-0.10.60.3

0.3-0.10.9-0.3

0.40.9-0.60.7

T T T

0.2-0.10.40.3

0.70.60.60.2

T T

l:

l+1:

4 12W R

W W W

-0.60.10.4-0.30.3-0.1-0.90.20.4-0.6-0.10.7

T

Q

K

V

0.3-0.30.60.60.7-0.10.90.80.1-0.30.40.4

T0.5-0.70.90.30.2-0.6-0.60.70.7-0.90.4-0.3

T

有 五个 人

1 2

3

Page 72: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

Attention for Encoder

14

• Attention (Example)

有 五个 人

0.7-0.10.60.3

0.3-0.10.9-0.3

0.40.9-0.60.7

T T T

0.2-0.10.40.3

0.70.60.60.2

T T

l:

l+1:

4 12W R

W W W

-0.60.10.4-0.30.3-0.1-0.90.20.4-0.6-0.10.7

T

Q

K

V

0.3-0.30.60.60.7-0.10.90.80.1-0.30.40.4

T0.5-0.70.90.30.2-0.6-0.60.70.7-0.90.4-0.3

T

有 五个 人

1 2

3

Page 73: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

15

• Attention (Example)

<s> there

0.5-0.60.20.3

0.10.4-0.20.5

T T

0.1-0.30.20.4

0.7-0.10.4-0.2

T

l:

l+1:

T

4 12W R

W W

-0.10.20.4-0.60.2-0.3-0.40.30.6-0.8-0.90.5

T

Q

K

V

T0.2-0.40.30.60.4-0.7-0.20.50.2-0.50.1-0.8

T

<s> there

1 2

Attention for Decoder

Page 74: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

15

• Attention (Example)

<s> there

are?

0.5-0.60.20.3

0.10.4-0.20.5

T T

0.1-0.30.20.4

0.7-0.10.4-0.2

T

l:

l+1:

T

4 12W R

W W

-0.10.20.4-0.60.2-0.3-0.40.30.6-0.8-0.90.5

T

Q

K

V

T0.2-0.40.30.60.4-0.7-0.20.50.2-0.50.1-0.8

T

<s> there

1 2

Attention for Decoder

Page 75: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

15

• Attention (Example)

<s> there

are?

0.5-0.60.20.3

0.10.4-0.20.5

T T

0.1-0.30.20.4

0.7-0.10.4-0.2

T

l:

l+1:

T

4 12W R

W W

-0.10.20.4-0.60.2-0.3-0.40.30.6-0.8-0.90.5

T

Q

K

V

T0.2-0.40.30.60.4-0.7-0.20.50.2-0.50.1-0.8

T

<s> there

1 2

Attention for Decoder

Cannot Utilize Future Information

Page 76: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

Problems for Unidirectional Inference

16

⋯𝑥0 𝑥1 𝑥𝑚𝑥𝑗 ⋯

Page 77: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

Problems for Unidirectional Inference

16

⋯𝑥0 𝑥1 𝑥𝑚𝑥𝑗 ⋯

𝑦0𝑙2𝑟 𝑦1

𝑙2𝑟 𝑦𝑖𝑙2𝑟⋯ ⋯ ⋯

Left-to-Right Decoding

Page 78: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

Problems for Unidirectional Inference

16

⋯𝑥0 𝑥1 𝑥𝑚𝑥𝑗 ⋯

𝑦0𝑙2𝑟 𝑦1

𝑙2𝑟 𝑦𝑖𝑙2𝑟⋯ ⋯ ⋯

Left-to-Right Decoding

Page 79: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

Problems for Unidirectional Inference

16

⋯𝑥0 𝑥1 𝑥𝑚𝑥𝑗 ⋯

𝑦0𝑙2𝑟 𝑦1

𝑙2𝑟 𝑦𝑖𝑙2𝑟⋯ ⋯ ⋯

×Left-to-Right Decoding

Cannot Access the Right Contexts

Page 80: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

Problems for Unidirectional Inference

16

⋯𝑥0 𝑥1 𝑥𝑚𝑥𝑗 ⋯

𝑦0𝑙2𝑟 𝑦1

𝑙2𝑟 𝑦𝑖𝑙2𝑟⋯ ⋯ ⋯

𝑦𝑛′−1𝑟2𝑙 𝑦𝑛′−2

𝑟2𝑙 𝑦𝑖′𝑟2𝑙 ⋯ ⋯

×Left-to-Right Decoding

Right-to-Left Decoding

Cannot Access the Right Contexts

Page 81: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

Problems for Unidirectional Inference

16

⋯𝑥0 𝑥1 𝑥𝑚𝑥𝑗 ⋯

𝑦0𝑙2𝑟 𝑦1

𝑙2𝑟 𝑦𝑖𝑙2𝑟⋯ ⋯ ⋯

𝑦𝑛′−1𝑟2𝑙 𝑦𝑛′−2

𝑟2𝑙 𝑦𝑖′𝑟2𝑙 ⋯ ⋯

×Left-to-Right Decoding

Right-to-Left Decoding

Cannot Access the Right Contexts

Page 82: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

Problems for Unidirectional Inference

16

⋯𝑥0 𝑥1 𝑥𝑚𝑥𝑗 ⋯

𝑦0𝑙2𝑟 𝑦1

𝑙2𝑟 𝑦𝑖𝑙2𝑟⋯ ⋯ ⋯

𝑦𝑛′−1𝑟2𝑙 𝑦𝑛′−2

𝑟2𝑙 𝑦𝑖′𝑟2𝑙 ⋯ ⋯

×

×

Left-to-Right Decoding

Right-to-Left Decoding

Cannot Access the Right Contexts

Cannot Access the Left Contexts

Page 83: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

17

Source 捷克总统哈维卸任新总统仍未确定

Referenceczech president havel steps down while new president still not

chosen

L2R czech president leaves office

R2L the outgoing president of the czech republic is still uncertain

Source 他们正在研制一种超大型的叫做炸弹之母。

Referencethey are developing a kind of superhuge bomb called the mother of

bombs .

L2R they are developing a super , big , mother , called the bomb .

R2Lthey are working on a much larger mother called the mother of a

bomb .

Problems: Unbalanced Outputs

Page 84: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

18

• Statistical Analysis

Model The first 4 tokens The last 4 tokens

L2R 40.21% 35.10%

R2L 35.67% 39.47%

Table: Translation accuracy of the first 4 tokens and last 4

tokens in NIST Chinese-English translation tasks.

Problems: Unbalanced Outputs

Page 85: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

18

• Statistical Analysis

Model The first 4 tokens The last 4 tokens

L2R 40.21% 35.10%

R2L 35.67% 39.47%

Table: Translation accuracy of the first 4 tokens and last 4

tokens in NIST Chinese-English translation tasks.

How to effectively utilize

bidirectional decoding?

Problems: Unbalanced Outputs

Page 86: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

19

Outline

Background

Bidirectional Interactive Inference

Interactive Inference for Two Tasks

Summary and Future Challenges

Page 87: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

Solution 1: Bidirectional Agreement from Perspective of Loss Function

20

[Liu et al., 2016] Agreement on Target-bidirectional Neural Machine Translation. NAACL.

[Zhang et al., 2019] Regularizing Neural Machine Translation by Target-Bidirectional

Agreement. AAAI

• Agreement

Page 88: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

Solution 1: Bidirectional Agreement from Perspective of Loss Function

20

[Liu et al., 2016] Agreement on Target-bidirectional Neural Machine Translation. NAACL.

[Zhang et al., 2019] Regularizing Neural Machine Translation by Target-Bidirectional

Agreement. AAAI

• Agreement

Page 89: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

Solution 1: Bidirectional Agreement from Perspective of Loss Function

20

[Liu et al., 2016] Agreement on Target-bidirectional Neural Machine Translation. NAACL.

[Zhang et al., 2019] Regularizing Neural Machine Translation by Target-Bidirectional

Agreement. AAAI

• Agreement

Drawbacks:

Two separate L2R and R2L models. No

interaction between bidirectional inference.

Page 90: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

Solution 2: Neural System Combination from the Perspective of Ensemble

21

• NSC-NMT

[Zhou et al., 2017] Neural System Combination for Machine Translation. ACL.

1hr

2hr

hnm

r...

2hs

hnm

s...

1hs

1

nz 2

nzn

n

mz...

1hr

2hr

hpm

r...

2hs

hpm

s...

1hs

1

pz2

pz p

p

mz...

1hr

2hr

hhm

r...

2hs

hhm

s...

1hs

1

hz 2

hz h

h

mz...

-1jSjS

1jS °

-1jS

1jy jy

°jS

1jy

...

... ...1

n

j 2

p

j1

p

jp

jm

1j 2j 3j

2

n

jn

jm 2

h

j1

h

jh

jm

jnc

jpc jhc

Decoder

Attention

... ...

Encoder_Source Text Encoder_L2R Text Encoder_R2L text

jc

:nZ :pZ :hZ

Page 91: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

Solution 2: Neural System Combination from the Perspective of Ensemble

21

• NSC-NMT

[Zhou et al., 2017] Neural System Combination for Machine Translation. ACL.

1hr

2hr

hnm

r...

2hs

hnm

s...

1hs

1

nz 2

nzn

n

mz...

1hr

2hr

hpm

r...

2hs

hpm

s...

1hs

1

pz2

pz p

p

mz...

1hr

2hr

hhm

r...

2hs

hhm

s...

1hs

1

hz 2

hz h

h

mz...

-1jSjS

1jS °

-1jS

1jy jy

°jS

1jy

...

... ...1

n

j 2

p

j1

p

jp

jm

1j 2j 3j

2

n

jn

jm 2

h

j1

h

jh

jm

jnc

jpc jhc

Decoder

Attention

... ...

Encoder_Source Text Encoder_L2R Text Encoder_R2L text

jc

:nZ :pZ :hZ

Drawbacks:

It is not an end-to-end model, and can’t

directly optimize encoder and decoder of

single NMT system.

Page 92: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

Solution 3: Asynchronous Bidirectional Decoding from the Perspective of Model Integration

22

• ABD-NMT

[Zhang et al., 2018] Asynchronous Bidirectional Decoding for Neural Machine Translation. AAAI.

Page 93: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

Solution 3: Asynchronous Bidirectional Decoding from the Perspective of Model Integration

22

• ABD-NMT

[Zhang et al., 2018] Asynchronous Bidirectional Decoding for Neural Machine Translation. AAAI.

Drawbacks:

(1) This work still requires two NMT models or

decoders.

(2) Only the forward decoder can utilize information of

backward decoder.

Page 94: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

Solution 3: Asynchronous Bidirectional Decoding from the Perspective of Model Integration

22

• ABD-NMT

[Zhang et al., 2018] Asynchronous Bidirectional Decoding for Neural Machine Translation. AAAI.

Drawbacks:

(1) This work still requires two NMT models or

decoders.

(2) Only the forward decoder can utilize information of

backward decoder.

Question: How to utilize bidirectional

decoding more effectively and efficiently?

Page 95: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

Solution 4

23

Page 96: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

Solution 4

23

Synchronous Bidirectional Neural Machine Translation

Long Zhou, Jiajun Zhang and Chengqing Zong.

Transactions on ACL 2019.

Page 97: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

Synchronous

Bidirectional Neural Machine Translation

24

⋯𝑥0 𝑥1 𝑥𝑚𝑥𝑖 ⋯

𝑙2𝑟

𝑟2𝑙

Page 98: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

Synchronous

Bidirectional Neural Machine Translation

24

⋯𝑥0 𝑥1 𝑥𝑚𝑥𝑖 ⋯

𝑇0

𝑙2𝑟

𝑟2𝑙

Page 99: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

Synchronous

Bidirectional Neural Machine Translation

24

⋯𝑥0 𝑥1 𝑥𝑚𝑥𝑖 ⋯

𝑦0𝑙2𝑟

𝑇0

𝑙2𝑟

𝑟2𝑙

Page 100: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

Synchronous

Bidirectional Neural Machine Translation

24

⋯𝑥0 𝑥1 𝑥𝑚𝑥𝑖 ⋯

𝑦0𝑙2𝑟

𝑦𝑛′−1𝑟2𝑙

𝑇0

𝑙2𝑟

𝑟2𝑙

Page 101: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

Synchronous

Bidirectional Neural Machine Translation

24

⋯𝑥0 𝑥1 𝑥𝑚𝑥𝑖 ⋯

𝑦0𝑙2𝑟

𝑦𝑛′−1𝑟2𝑙

𝑇0 𝑇1

𝑙2𝑟

𝑟2𝑙

Page 102: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

Synchronous

Bidirectional Neural Machine Translation

24

⋯𝑥0 𝑥1 𝑥𝑚𝑥𝑖 ⋯

𝑦0𝑙2𝑟

𝑦𝑛′−1𝑟2𝑙

𝑇0 𝑇1

𝑙2𝑟

𝑟2𝑙

Page 103: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

Synchronous

Bidirectional Neural Machine Translation

24

⋯𝑥0 𝑥1 𝑥𝑚𝑥𝑖 ⋯

𝑦0𝑙2𝑟

𝑦𝑛′−1𝑟2𝑙

𝑇0 𝑇1

𝑙2𝑟

𝑟2𝑙

Page 104: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

Synchronous

Bidirectional Neural Machine Translation

24

⋯𝑥0 𝑥1 𝑥𝑚𝑥𝑖 ⋯

𝑦0𝑙2𝑟 𝑦1

𝑙2𝑟

𝑦𝑛′−1𝑟2𝑙

𝑇0 𝑇1

𝑙2𝑟

𝑟2𝑙

Page 105: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

Synchronous

Bidirectional Neural Machine Translation

24

⋯𝑥0 𝑥1 𝑥𝑚𝑥𝑖 ⋯

𝑦0𝑙2𝑟 𝑦1

𝑙2𝑟

𝑦𝑛′−1𝑟2𝑙

𝑇0 𝑇1

𝑙2𝑟

𝑟2𝑙

Page 106: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

Synchronous

Bidirectional Neural Machine Translation

24

⋯𝑥0 𝑥1 𝑥𝑚𝑥𝑖 ⋯

𝑦0𝑙2𝑟 𝑦1

𝑙2𝑟

𝑦𝑛′−1𝑟2𝑙

𝑇0 𝑇1

𝑙2𝑟

𝑟2𝑙

Page 107: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

Synchronous

Bidirectional Neural Machine Translation

24

⋯𝑥0 𝑥1 𝑥𝑚𝑥𝑖 ⋯

𝑦0𝑙2𝑟 𝑦1

𝑙2𝑟

𝑦𝑛′−1𝑟2𝑙 𝑦𝑛′−2

𝑟2𝑙

𝑇0 𝑇1

𝑙2𝑟

𝑟2𝑙

Page 108: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

Synchronous

Bidirectional Neural Machine Translation

24

⋯𝑥0 𝑥1 𝑥𝑚𝑥𝑖 ⋯

𝑦0𝑙2𝑟 𝑦1

𝑙2𝑟

𝑦𝑛′−1𝑟2𝑙 𝑦𝑛′−2

𝑟2𝑙

𝑇0 𝑇1 𝑇2

𝑙2𝑟

𝑟2𝑙

Page 109: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

Synchronous

Bidirectional Neural Machine Translation

24

⋯𝑥0 𝑥1 𝑥𝑚𝑥𝑖 ⋯

𝑦0𝑙2𝑟 𝑦1

𝑙2𝑟

𝑦𝑛′−1𝑟2𝑙 𝑦𝑛′−2

𝑟2𝑙

𝑇0 𝑇1 𝑇2

𝑙2𝑟

𝑟2𝑙

Page 110: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

Synchronous

Bidirectional Neural Machine Translation

24

⋯𝑥0 𝑥1 𝑥𝑚𝑥𝑖 ⋯

𝑦0𝑙2𝑟 𝑦1

𝑙2𝑟

𝑦𝑛′−1𝑟2𝑙 𝑦𝑛′−2

𝑟2𝑙

𝑇0 𝑇1 𝑇2

𝑙2𝑟

𝑟2𝑙

Page 111: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

Synchronous

Bidirectional Neural Machine Translation

24

⋯𝑥0 𝑥1 𝑥𝑚𝑥𝑖 ⋯

𝑦0𝑙2𝑟 𝑦1

𝑙2𝑟 𝑦2𝑙2𝑟

𝑦𝑛′−1𝑟2𝑙 𝑦𝑛′−2

𝑟2𝑙

𝑇0 𝑇1 𝑇2

𝑙2𝑟

𝑟2𝑙

Page 112: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

Synchronous

Bidirectional Neural Machine Translation

24

⋯𝑥0 𝑥1 𝑥𝑚𝑥𝑖 ⋯

𝑦0𝑙2𝑟 𝑦1

𝑙2𝑟 𝑦2𝑙2𝑟

𝑦𝑛′−1𝑟2𝑙 𝑦𝑛′−2

𝑟2𝑙

𝑇0 𝑇1 𝑇2

𝑙2𝑟

𝑟2𝑙

Page 113: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

Synchronous

Bidirectional Neural Machine Translation

24

⋯𝑥0 𝑥1 𝑥𝑚𝑥𝑖 ⋯

𝑦0𝑙2𝑟 𝑦1

𝑙2𝑟 𝑦2𝑙2𝑟

𝑦𝑛′−1𝑟2𝑙 𝑦𝑛′−2

𝑟2𝑙

𝑇0 𝑇1 𝑇2

𝑙2𝑟

𝑟2𝑙

Page 114: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

Synchronous

Bidirectional Neural Machine Translation

24

⋯𝑥0 𝑥1 𝑥𝑚𝑥𝑖 ⋯

𝑦0𝑙2𝑟 𝑦1

𝑙2𝑟 𝑦2𝑙2𝑟

𝑦𝑛′−1𝑟2𝑙 𝑦𝑛′−2

𝑟2𝑙 𝑦𝑛′−3𝑟2𝑙

𝑇0 𝑇1 𝑇2

𝑙2𝑟

𝑟2𝑙

Page 115: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

Synchronous

Bidirectional Neural Machine Translation

24

⋯𝑥0 𝑥1 𝑥𝑚𝑥𝑖 ⋯

𝑦0𝑙2𝑟 𝑦1

𝑙2𝑟 𝑦𝑛−1𝑙2𝑟𝑦2

𝑙2𝑟 ⋯

𝑦𝑛′−1𝑟2𝑙 𝑦𝑛′−2

𝑟2𝑙 𝑦0𝑟2𝑙𝑦𝑛′−3

𝑟2𝑙 ⋯

𝑇0 𝑇1 𝑇2

𝑙2𝑟

𝑟2𝑙

Page 116: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

Synchronous

Bidirectional Neural Machine Translation

24

⋯𝑥0 𝑥1 𝑥𝑚𝑥𝑖 ⋯

𝑦0𝑙2𝑟 𝑦1

𝑙2𝑟 𝑦𝑛−1𝑙2𝑟𝑦2

𝑙2𝑟 ⋯

𝑦𝑛′−1𝑟2𝑙 𝑦𝑛′−2

𝑟2𝑙 𝑦0𝑟2𝑙𝑦𝑛′−3

𝑟2𝑙 ⋯

𝑇0 𝑇1 𝑇2

𝑃 𝑦|𝑥 =

𝑖=0

𝑛−1

𝑝 𝑦𝑖| 𝑦0⋯ 𝑦𝑖−1, 𝑥, 𝒚𝟎⋯𝒚𝒊−𝟏 𝑖𝑓 𝐿2𝑅

𝑖=0

𝑛′−1

𝑝 𝑦𝑖| 𝑦0⋯ 𝑦𝑖−1, 𝑥, 𝒚𝟎⋯𝒚𝒊−𝟏 𝑖𝑓 𝑅2𝐿

𝑙2𝑟

𝑟2𝑙

Page 117: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

Synchronous

Bidirectional Neural Machine Translation

24

L2R (R2L) inference not only uses its previously generated outputs,

but also uses future contexts predicted by R2L (L2R) decoding.

⋯𝑥0 𝑥1 𝑥𝑚𝑥𝑖 ⋯

𝑦0𝑙2𝑟 𝑦1

𝑙2𝑟 𝑦𝑛−1𝑙2𝑟𝑦2

𝑙2𝑟 ⋯

𝑦𝑛′−1𝑟2𝑙 𝑦𝑛′−2

𝑟2𝑙 𝑦0𝑟2𝑙𝑦𝑛′−3

𝑟2𝑙 ⋯

𝑇0 𝑇1 𝑇2

𝑃 𝑦|𝑥 =

𝑖=0

𝑛−1

𝑝 𝑦𝑖| 𝑦0⋯ 𝑦𝑖−1, 𝑥, 𝒚𝟎⋯𝒚𝒊−𝟏 𝑖𝑓 𝐿2𝑅

𝑖=0

𝑛′−1

𝑝 𝑦𝑖| 𝑦0⋯ 𝑦𝑖−1, 𝑥, 𝒚𝟎⋯𝒚𝒊−𝟏 𝑖𝑓 𝑅2𝐿

𝑙2𝑟

𝑟2𝑙

Page 118: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

Synchronous Bidirectional Attention

𝑥2 𝑥3 ⋯ 𝑥𝑚−1𝑥0⋯

𝑦1 ⋯ 𝑦𝑖−1 𝑦0

𝒛0 𝒛1 𝒛𝑖−1

𝑦𝑖−1 ⋯ 𝑦1 𝑦0 𝑦𝑖

𝒛0𝒛1𝒛𝑖

⋯𝒛𝑖−1

𝒉0 𝒉1 𝒉2 𝒉𝑚−1

Page 119: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

Synchronous Bidirectional Attention

𝑥2 𝑥3 ⋯ 𝑥𝑚−1𝑥0⋯

𝑦1 ⋯ 𝑦𝑖−1 𝑦0

𝒛0 𝒛1 𝒛𝑖−1

𝑦𝑖−1 ⋯ 𝑦1 𝑦0 𝑦𝑖

𝒛0𝒛1𝒛𝑖

⋯𝒛𝑖−1

𝒉0 𝒉1 𝒉2 𝒉𝑚−1

Page 120: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

Synchronous Bidirectional Attention

𝑥2 𝑥3 ⋯ 𝑥𝑚−1𝑥0⋯

𝑦1 ⋯ 𝑦𝑖−1 𝑦0

𝒛0 𝒛1 𝒛𝑖−1

𝑦𝑖−1 ⋯ 𝑦1 𝑦0 𝑦𝑖

𝒛0𝒛1𝒛𝑖

⋯𝒛𝑖−1

𝒉0 𝒉1 𝒉2 𝒉𝑚−1

𝑯𝑖

Page 121: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

Synchronous Bidirectional Attention

𝑥2 𝑥3 ⋯ 𝑥𝑚−1𝑥0⋯

𝑦1 ⋯ 𝑦𝑖−1 𝑦0

𝒛0 𝒛1 𝒛𝑖−1

𝑦𝑖−1 ⋯ 𝑦1 𝑦0 𝑦𝑖

𝒛0𝒛1𝒛𝑖

⋯𝒛𝑖−1

𝒉0 𝒉1 𝒉2 𝒉𝑚−1

𝑯𝑖

Page 122: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

Synchronous Bidirectional Attention

𝑥2 𝑥3 ⋯ 𝑥𝑚−1𝑥0⋯

𝑦1 ⋯ 𝑦𝑖−1 𝑦0

𝒛0 𝒛1 𝒛𝑖−1

𝑦𝑖−1 ⋯ 𝑦1 𝑦0 𝑦𝑖

𝒛0𝒛1𝒛𝑖

⋯𝒛𝑖−1

𝒉0 𝒉1 𝒉2 𝒉𝑚−1

𝑯𝑖 𝑯𝑖

Page 123: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

Synchronous Bidirectional Attention

𝑥2 𝑥3 ⋯ 𝑥𝑚−1𝑥0⋯

𝑦1 ⋯ 𝑦𝑖−1 𝑦0

𝒛0 𝒛1 𝒛𝑖−1

𝑦𝑖−1 ⋯ 𝑦1 𝑦0 𝑦𝑖

𝒛0𝒛1𝒛𝑖

⋯𝒛𝑖−1

𝒉0 𝒉1 𝒉2 𝒉𝑚−1

𝑯𝑖 𝑯𝑖

𝒛𝑖 = 𝑓𝑢𝑠𝑒 𝑯𝑖 , 𝑯𝑖

Page 124: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

Synchronous Bidirectional Attention

𝑥2 𝑥3 ⋯ 𝑥𝑚−1𝑥0⋯

𝑦1 ⋯ 𝑦𝑖−1 𝑦0

𝒛0 𝒛1 𝒛𝑖−1 𝒛𝑖

𝑦𝑖−1 ⋯ 𝑦1 𝑦0 𝑦𝑖

𝒛0𝒛1𝒛𝑖

⋯𝒛𝑖−1

𝒉0 𝒉1 𝒉2 𝒉𝑚−1

𝑯𝑖 𝑯𝑖

𝒛𝑖 = 𝑓𝑢𝑠𝑒 𝑯𝑖 , 𝑯𝑖

Page 125: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

Synchronous Bidirectional Attention

𝑥2 𝑥3 ⋯ 𝑥𝑚−1𝑥0⋯

𝑦1 ⋯ 𝑦𝑖−1 𝑦𝑖 𝑦0

𝒛0 𝒛1 𝒛𝑖−1 𝒛𝑖

𝑦𝑖−1 ⋯ 𝑦1 𝑦0 𝑦𝑖

𝒛0𝒛1𝒛𝑖

⋯𝒛𝑖−1

𝒉0 𝒉1 𝒉2 𝒉𝑚−1

𝑯𝑖 𝑯𝑖

𝒛𝑖 = 𝑓𝑢𝑠𝑒 𝑯𝑖 , 𝑯𝑖

Page 126: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

Matmul

Scale

Mask

Softmax

Matmul

Matmul

Scale

Mask

Softmax

Matmul

KVQ

K VQ

KKV V

Fusion

iH

iH

i

f

Hf

iH

b

iHb

iH

26

• Synchronous Bidirectional Dot-Product Attention

Linear Interpolation

Nonlinear Interpolation

Gate Mechanism

Synchronous Bidirectional Attention

Page 127: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

Matmul

Scale

Mask

Softmax

Matmul

Matmul

Scale

Mask

Softmax

Matmul

KVQ

K VQ

KKV V

Fusion

iH

iH

i

f

Hf

iH

b

iHb

iH

26

• Synchronous Bidirectional Dot-Product Attention

Linear Interpolation

Nonlinear Interpolation

Gate Mechanism

Synchronous Bidirectional Attention

Page 128: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

Matmul

Scale

Mask

Softmax

Matmul

Matmul

Scale

Mask

Softmax

Matmul

KVQ

K VQ

KKV V

Fusion

iH

iH

i

f

Hf

iH

b

iHb

iH

26

• Synchronous Bidirectional Dot-Product Attention

Linear Interpolation

Nonlinear Interpolation

Gate Mechanism

tanh

Relu

Synchronous Bidirectional Attention

Page 129: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

Matmul

Scale

Mask

Softmax

Matmul

Matmul

Scale

Mask

Softmax

Matmul

KVQ

K VQ

KKV V

Fusion

iH

iH

i

f

Hf

iH

b

iHb

iH

26

• Synchronous Bidirectional Dot-Product Attention

Linear Interpolation

Nonlinear Interpolation

Gate Mechanism

tanh

Relu

Synchronous Bidirectional Attention

Page 130: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

Matmul

Scale

Mask

Softmax

Matmul

Matmul

Scale

Mask

Softmax

Matmul

KVQ

K VQ

KKV V

Fusion

iH

iH

i

f

Hf

iH

b

iHb

iH

27

• Synchronous Bidirectional Dot-Product Attention

SBDPA

Page 131: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

Matmul

Scale

Mask

Softmax

Matmul

Matmul

Scale

Mask

Softmax

Matmul

KVQ

K VQ

KKV V

Fusion

iH

iH

i

f

Hf

iH

b

iHb

iH

27

• Synchronous Bidirectional Dot-Product Attention

We refer to the whole procedure as SBDPT():

SBDPA

Page 132: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

28

• Synchronous Bidirectional Multi-Head Attention

Concat

Linear

Q

Q

K

K

V

V

h

h

Linear

hSynchronous Bidirectional Dot-

Product Attention Model

LinearLinear

Synchronous Multi-Head Attention

Page 133: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

28

• Synchronous Bidirectional Multi-Head Attention

Concat

Linear

Q

Q

K

K

V

V

h

h

Linear

hSynchronous Bidirectional Dot-

Product Attention Model

LinearLinear

Synchronous Multi-Head Attention

Page 134: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

28

• Synchronous Bidirectional Multi-Head Attention

Concat

Linear

Q

Q

K

K

V

V

h

h

Linear

hSynchronous Bidirectional Dot-

Product Attention Model

LinearLinear Note that all parameters are

the same as standard multi-

head attention model.

Synchronous Multi-Head Attention

Page 135: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

29

• Integrating Bidirectional Attention into NMT

Multi-Head Intra-Attention

Input Embedding

Output Embedding

Softmax

Add&Norm

FeedForward

Add&Norm Multi-Head Inter-Attention

Add&Norm

FeedForward

Add&Norm

Synchronous Bidirectional

Attention

Add&NormN

N

+ +Positional Encoding

Positional Encoding

Linear

InputsOutputs (L2R & R2L)

(shifted right)

Output (L2R & R2L)Probabilities

Our proposed bidirectional

attention model

Synchronous Bidirectional Attention

Page 136: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

29

• Integrating Bidirectional Attention into NMT

Multi-Head Intra-Attention

Input Embedding

Output Embedding

Softmax

Add&Norm

FeedForward

Add&Norm Multi-Head Inter-Attention

Add&Norm

FeedForward

Add&Norm

Synchronous Bidirectional

Attention

Add&NormN

N

+ +Positional Encoding

Positional Encoding

Linear

InputsOutputs (L2R & R2L)

(shifted right)

Output (L2R & R2L)Probabilities

Our proposed bidirectional

attention model

Synchronous Bidirectional Attention

Page 137: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

29

• Integrating Bidirectional Attention into NMT

Multi-Head Intra-Attention

Input Embedding

Output Embedding

Softmax

Add&Norm

FeedForward

Add&Norm Multi-Head Inter-Attention

Add&Norm

FeedForward

Add&Norm

Synchronous Bidirectional

Attention

Add&NormN

N

+ +Positional Encoding

Positional Encoding

Linear

InputsOutputs (L2R & R2L)

(shifted right)

Output (L2R & R2L)Probabilities

Our proposed bidirectional

attention model

Note that all bidirectional information flow indecoder runs in parallel.

Synchronous Bidirectional Attention

Page 138: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

30

Synchronous Bidirectional Beam Search Algorithm

Page 139: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

30

Synchronous Bidirectional Beam Search Algorithm

Page 140: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

30

<pad>

T=0 T=1 T=2

<l2r>

<l2r>

<r2l>

<r2l>

SBAtt

SBAtt

SBAtt

SBAtt

L2R

R2L

Synchronous Bidirectional Beam Search Algorithm

Page 141: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

31

• Training Objective Function

Training

Page 142: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

Training

32

Page 143: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

Training

32

Inference

Page 144: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

Training

32

⋯𝑥0 𝑥1 𝑥𝑚𝑥𝑖 ⋯

𝒚𝟎𝒍𝟐𝒓 𝑦1

𝑙2𝑟 𝒚𝒏−𝟏𝒍𝟐𝒓𝑦2

𝑙2𝑟 ⋯

𝒚𝒏′−𝟏𝒓𝟐𝒍 𝑦𝑛′−2

𝑟2𝑙 𝒚𝟎𝒓𝟐𝒍𝑦𝑛′−3

𝑟2𝑙 ⋯

𝑇0 𝑇1 𝑇2

Inference

Page 145: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

Training

32

⋯𝑥0 𝑥1 𝑥𝑚𝑥𝑖 ⋯

𝒚𝟎𝒍𝟐𝒓 𝑦1

𝑙2𝑟 𝒚𝒏−𝟏𝒍𝟐𝒓𝑦2

𝑙2𝑟 ⋯

𝒚𝒏′−𝟏𝒓𝟐𝒍 𝑦𝑛′−2

𝑟2𝑙 𝒚𝟎𝒓𝟐𝒍𝑦𝑛′−3

𝑟2𝑙 ⋯

𝑇0 𝑇1 𝑇2

Inference

Training

1 2 1

1 2 1

src: x , x , ..., x , x

tgt: y , y , ..., y , y

m m

n n

Page 146: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

Training

32

⋯𝑥0 𝑥1 𝑥𝑚𝑥𝑖 ⋯

𝒚𝟎𝒍𝟐𝒓 𝑦1

𝑙2𝑟 𝒚𝒏−𝟏𝒍𝟐𝒓𝑦2

𝑙2𝑟 ⋯

𝒚𝒏′−𝟏𝒓𝟐𝒍 𝑦𝑛′−2

𝑟2𝑙 𝒚𝟎𝒓𝟐𝒍𝑦𝑛′−3

𝑟2𝑙 ⋯

𝑇0 𝑇1 𝑇2

⋯𝑥0 𝑥1 𝑥𝑚𝑥𝑖 ⋯

𝒚𝟎𝒍𝟐𝒓 𝑦1

𝑙2𝑟 𝒚𝒏−𝟏𝒍𝟐𝒓𝑦2

𝑙2𝑟 ⋯

𝒚𝒏−𝟏𝒓𝟐𝒍 𝑦𝑛−2

𝑟2𝑙 𝒚𝟎𝒓𝟐𝒍𝑦𝑛−3

𝑟2𝑙 ⋯

𝑇0 𝑇1 𝑇2

Inference

Training

1 2 1

1 2 1

src: x , x , ..., x , x

tgt: y , y , ..., y , y

m m

n n

Page 147: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

Training

32

⋯𝑥0 𝑥1 𝑥𝑚𝑥𝑖 ⋯

𝒚𝟎𝒍𝟐𝒓 𝑦1

𝑙2𝑟 𝒚𝒏−𝟏𝒍𝟐𝒓𝑦2

𝑙2𝑟 ⋯

𝒚𝒏′−𝟏𝒓𝟐𝒍 𝑦𝑛′−2

𝑟2𝑙 𝒚𝟎𝒓𝟐𝒍𝑦𝑛′−3

𝑟2𝑙 ⋯

𝑇0 𝑇1 𝑇2

⋯𝑥0 𝑥1 𝑥𝑚𝑥𝑖 ⋯

𝒚𝟎𝒍𝟐𝒓 𝑦1

𝑙2𝑟 𝒚𝒏−𝟏𝒍𝟐𝒓𝑦2

𝑙2𝑟 ⋯

𝒚𝒏−𝟏𝒓𝟐𝒍 𝑦𝑛−2

𝑟2𝑙 𝒚𝟎𝒓𝟐𝒍𝑦𝑛−3

𝑟2𝑙 ⋯

𝑇0 𝑇1 𝑇2

Inference

Training

1 2 1

1 2 1

src: x , x , ..., x , x

tgt: y , y , ..., y , y

m m

n n

Question: Mismatch between Training

and Inference

Page 148: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

33

Training Strategy 1

𝐽 𝜃 = 𝑡=1

𝑇

𝑙𝑜𝑔𝑝 𝑦 𝑡 |𝑥 𝑡 + 𝑙𝑜𝑔𝑝 𝑦𝑡|𝑥 𝑡

Two-pass method

First-pass:training 𝐿2𝑅 and 𝑅2𝐿, models. Using 𝐿2𝑅 and 𝑅2𝐿 to

decode the source inputs of bitext, resulting 𝑥 𝑡 , 𝑦∗𝑡

𝑡=1

𝑇

and

𝑥 𝑡 , 𝑦∗𝑡

𝑡=1

𝑇

;

Second-pass: using 𝑦∗<𝑖𝑡

instead of 𝑦<𝑖𝑡

to compute

𝑝 𝑦𝑖𝑡| 𝑦<𝑖𝑡, 𝑥 𝑡 , 𝑦∗<𝑖

𝑡,similar for 𝑝 𝑦𝑖

𝑡| 𝑦<𝑖𝑡, 𝑥 𝑡 , 𝑦∗<𝑖

𝑡.

Page 149: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

33

Training Strategy 1

𝐽 𝜃 = 𝑡=1

𝑇

𝑙𝑜𝑔𝑝 𝑦 𝑡 |𝑥 𝑡 + 𝑙𝑜𝑔𝑝 𝑦𝑡|𝑥 𝑡

Two-pass method

First-pass:training 𝐿2𝑅 and 𝑅2𝐿, models. Using 𝐿2𝑅 and 𝑅2𝐿 to

decode the source inputs of bitext, resulting 𝑥 𝑡 , 𝑦∗𝑡

𝑡=1

𝑇

and

𝑥 𝑡 , 𝑦∗𝑡

𝑡=1

𝑇

;

Second-pass: using 𝑦∗<𝑖𝑡

instead of 𝑦<𝑖𝑡

to compute

𝑝 𝑦𝑖𝑡| 𝑦<𝑖𝑡, 𝑥 𝑡 , 𝑦∗<𝑖

𝑡,similar for 𝑝 𝑦𝑖

𝑡| 𝑦<𝑖𝑡, 𝑥 𝑡 , 𝑦∗<𝑖

𝑡.

Problem: Too Time Consuming

Page 150: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

𝑃 𝑦|𝑥 =

𝑖=0

𝑛−1

𝑝 𝑦𝑖| 𝑦0⋯ 𝑦𝑖−1, 𝑥 𝑖𝑓 𝐿2𝑅

𝑖=0

𝑛′−1

𝑝 𝑦𝑖| 𝑦0⋯ 𝑦𝑖−1, 𝑥 𝑖𝑓 𝑅2𝐿

34

Training Strategy 2

Fine-tuning method

Bidirectional Inference without Interaction:training 𝑆𝐵𝑁𝑀𝑇,

model with no interaction. The learned 𝑆𝐵𝑁𝑀𝑇 perfroms 𝐿2𝑅 and

𝑅2𝐿 decoding for the source inputs of bitext, resulting

𝑥 𝑡 , 𝑦∗𝑡

𝑡=1

𝑇

and 𝑥 𝑡 , 𝑦∗𝑡

𝑡=1

𝑇

;

Fine-tuning with Interaction: using 𝑦∗<𝑖𝑡

instead of 𝑦<𝑖𝑡

to

compute 𝑝 𝑦𝑖𝑡| 𝑦<𝑖𝑡, 𝑥 𝑡 , 𝑦∗<𝑖

𝑡.

Page 151: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

35

• Setup

Dataset:

(1) NIST Chinese-English translation (2M, 30K tokens,

MT03-06 as test set)

(2) WMT14 English-German translation (4.5M, 37K

shared tokens, newstest2014 as test set)

Train details:

(1) Transformer_big setting

(2) Chinese-English: 1 GPUs, single model, case-

insensitive BLEU.

(3) English-German: 3 GPUs, model averaging, case-

sensitive BLEU.

Experiments: Machine Translation

Page 152: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

36

• Baselines

Moses: an Open source phrase-based SMT system.

RNMT: RNN-based NMT with default setting.

Transformer: Predict target sentence from left to right.

Transformer(R2L): Predict sentence from right to left.

Rerank-NMT: (1) first run beam search to obtain two

k-best lists; (2) then re-score and get the best candidate.

ABD-NMT: (1) use backward decoder to generate

reverse sequence states; (2) perform beam search on the

forward decoder to find the best translation.

Experiments: Machine Translation

Page 153: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

37

• Results on Chinese-English Translation

Translation Quality

Table: Evaluation of translation quality for Chinese-English translation tasks with

case-insensitive BLEU scores.

Experiments: Machine Translation

Page 154: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

37

• Results on Chinese-English Translation

Translation Quality

Table: Evaluation of translation quality for Chinese-English translation tasks with

case-insensitive BLEU scores.

Experiments: Machine Translation

Page 155: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

37

• Results on Chinese-English Translation

Translation Quality

Table: Evaluation of translation quality for Chinese-English translation tasks with

case-insensitive BLEU scores.

Experiments: Machine Translation

Page 156: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

37

• Results on Chinese-English Translation

Translation Quality

Table: Evaluation of translation quality for Chinese-English translation tasks with

case-insensitive BLEU scores.

Experiments: Machine Translation

Page 157: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

38

• Results on English-German Translation

Table: Results of English-German translation using case-sensitive BLEU.

Experiments: Machine Translation

Page 158: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

38

• Results on English-German Translation

Table: Results of English-German translation using case-sensitive BLEU.

Strong Baselines

Experiments: Machine Translation

Page 159: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

38

• Results on English-German Translation

Table: Results of English-German translation using case-sensitive BLEU.

Strong Baselines

(+1.49)

Experiments: Machine Translation

Page 160: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

39

Experiments: Image Caption

Page 161: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

39

Experiments: Image Caption

Page 162: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

39

• Setup

Dataset:

(1) Flickr30k (Young et al., 2014)

(2) 29,000 image-caption for training

(3) 1014 for validation and 2000 for test

Baselines:

(1) VGGNet encoder + LSMT decoder (Xu et al., 2015)

(2) Transformer

Experiments: Image Caption

Page 163: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

40

• Results on English Image Caption

BLEU score

Experiments: Image Caption

Method Validation Test

Xu et al., (2015) ~ 19.90

Transformer 22.11 21.25

Ours 23.27 22.41

Page 164: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

41

Figure: Translation accuracy of the first 4 tokens and last 4 tokens for L2R, R2L,

Rerank-NMT, ABD-NMT and our proposed model.

MT Analysis: Unbalanced Outputs

Page 165: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

41

Figure: Translation accuracy of the first 4 tokens and last 4 tokens for L2R, R2L,

Rerank-NMT, ABD-NMT and our proposed model.

Get the best translation

accuracy no matter for

the first 4 words and the

last 4 words.

MT Analysis: Unbalanced Outputs

Page 166: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

MT Analysis: BLEU along Length

42

• Analysis

Effect of Long Sentence

Figure: Performance of translations on the test set with respect to the lengths

of the source sentences.

Page 167: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

MT Analysis: BLEU along Length

42

• Analysis

Effect of Long Sentence

Figure: Performance of translations on the test set with respect to the lengths

of the source sentences.

Page 168: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

43

MT Analysis: Case Study

Page 169: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

43

L2R produces good prefix, whereas R2L generates better

suffixes.

Our approach can make full use of bidirectional decoding

and produce balanced outputs in these cases.

MT Analysis: Case Study

Page 170: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

44

MT Analysis: Parameters and Speeds

Page 171: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

44

Table: Statistics of parameters, training and testing speeds. Train denotes the

number of global training steps processed per second; Test indicates the amount

of translated sentences in one second.

MT Analysis: Parameters and Speeds

Page 172: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

44

Table: Statistics of parameters, training and testing speeds. Train denotes the

number of global training steps processed per second; Test indicates the amount

of translated sentences in one second.

No additional parameters except for lambda

MT Analysis: Parameters and Speeds

Page 173: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

44

Table: Statistics of parameters, training and testing speeds. Train denotes the

number of global training steps processed per second; Test indicates the amount

of translated sentences in one second.

No additional parameters except for lambda

MT Analysis: Parameters and Speeds

Page 174: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

44

Table: Statistics of parameters, training and testing speeds. Train denotes the

number of global training steps processed per second; Test indicates the amount

of translated sentences in one second.

Slightly Slower than

baseline TransformerNo additional parameters except for lambda

MT Analysis: Parameters and Speeds

Page 175: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

Beyond Synchronous Bidirectional

Decoding: Improving Efficiency

45

Page 176: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

Beyond Synchronous Bidirectional

Decoding: Improving Efficiency

45

Sequence Generation: from Both Sides to the Middle

Long Zhou, Jiajun Zhang, Chengqing Zong and Heng Yu.

In Proceedings of IJCAI 2019.

Page 177: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

Sequence Generation from Both Sides to the Middle

46

• SBSG: Synchronous Bidirectional Sequence Generation

Speedup decoding: Generates two tokens at a time

Improve quality: Rely on history and future context

1y 2y /2ny

ny1ny /2 1ny

t=0 t=1 t=n/2

(Left-to-right decoding) …

End

(Right-to-left decoding)

t=0t=1t=n/2… …

Page 178: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

Sequence Generation from Both Sides to the Middle

47

• Training and Inference

<null>

/2ny /2 1ny

/2 2ny

<eos>

<eos>

:y

:y

1y

ny

<l2r>

<r2l>

2y

1ny

Following previous work, we also use knowledge

distillation techniques to train our model.

Training objective:

The Smoothing model:

Page 179: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

Experiments on Machine Translation

48

Inference speed:

Translation quality:

Page 180: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

Experiments on Machine Translation

48

Inference speed:

Translation quality:

Page 181: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

Experiments on Machine Translation

48

Inference speed:

Translation quality:

Page 182: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

Experiments on Machine Translation

48

Inference speed:

Translation quality:

Page 183: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

Experiments on Text Summarization

49

• Application to Text Summarization

Example:

Setup

(1) English Gigaword dataset (3.8M training set, 189K

dev set, DUC2004 as our test set)

(2) shared vocabulary of about 90K word types

(3) Transformer_base setting

(4) ROUGE-1, ROUGE-2, ROUGE-L

Page 184: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

Experiments on Text Summarization

50

Page 185: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

Experiments on Text Summarization

50

Page 186: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

Experiments on Text Summarization

50

The proposed model significant outperforms the conventional

Transformer model in terms of both decoding speed and generation

quality.

Page 187: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

51

Outline

Background

Bidirectional Interactive Inference

Interactive Inference for Two Tasks

Summary and Future Challenges

Page 188: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

Interactive Inference for Two Tasks

52

Page 189: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

Interactive Inference for Two Tasks

52

Synchronously Generating Two Languages with

Interactive Decoding

Yining Wang, Jiajun Zhang, Long Zhou, Yuchen Liu and Chengqing Zong.

In Proceedings of EMNLP 2019.

Page 190: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

From Generating Two Directions to

Generating Two Languages

53

⋯𝑥0 𝑥1 𝑥𝑚𝑥𝑗 ⋯

𝑦0𝑙2𝑟 𝑦1

𝑙2𝑟 𝑦𝑖𝑙2𝑟⋯ ⋯ ⋯

⋯𝑦𝑛′−1𝑟2𝑙 𝑦𝑛′−2

𝑟2𝑙 𝑦𝑖′𝑟2𝑙 ⋯ ⋯

Left-to-Right Decoding

Right-to-Left Decoding

Page 191: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

From Generating Two Directions to

Generating Two Languages

53

⋯𝑥0 𝑥1 𝑥𝑚𝑥𝑗 ⋯

𝑦0𝑙2𝑟 𝑦1

𝑙2𝑟 𝑦𝑖𝑙2𝑟⋯ ⋯ ⋯

⋯𝑦𝑛′−1𝑟2𝑙 𝑦𝑛′−2

𝑟2𝑙 𝑦𝑖′𝑟2𝑙 ⋯ ⋯

Left-to-Right Decoding

Right-to-Left Decoding

Page 192: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

From Generating Two Directions to

Generating Two Languages

53

⋯𝑥0 𝑥1 𝑥𝑚𝑥𝑗 ⋯

𝑦0𝑙2𝑟 𝑦1

𝑙2𝑟 𝑦𝑖𝑙2𝑟⋯ ⋯ ⋯

⋯𝑦𝑛′−1𝑟2𝑙 𝑦𝑛′−2

𝑟2𝑙 𝑦𝑖′𝑟2𝑙 ⋯ ⋯

Left-to-Right Decoding

Right-to-Left Decoding

⋯𝑥0 𝑥1 𝑥𝑚𝑥𝑗 ⋯

𝑦0 𝑦1 𝑦𝑖⋯ ⋯ ⋯

⋯𝑧0 𝑧1 𝑧𝑗 ⋯ ⋯

English-to-Chinese Decoding

English-to-Japanese Decoding

Page 193: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

Conventional Multilingual Translation

54

Separate Encoder or

Decoder network

Shared Encoder or

Decoder network

Shared with partial

parameter

Page 194: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

Synchronously Generating Two Languages

with Interactive Decoding

55

Page 195: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

56

Synchronous Self-Attention Model:

Encoder

Decoder

Decoder

Interact

Synchronously Generating Two Languages

with Interactive Decoding

Page 196: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

Synchronous Bi-language Attention

𝑥2 𝑥3 ⋯ 𝑥𝑚−1𝑥0⋯

𝑦1 ⋯ 𝑦𝑖−1 𝑦𝑖𝑦0

𝒔0 𝒔1 𝒔𝑖−1 𝒔𝑖

𝑧1 ⋯ 𝑧𝑖−1 𝑧𝑖𝑧0

𝒕𝑖𝒕𝑖−1𝒕0

⋯𝒕1

𝒉0 𝒉1 𝒉2 𝒉𝑚−1

𝑯𝑖 = 𝑓𝑢𝑠𝑒 𝑯𝑖𝑠𝑒𝑙𝑓, 𝑯𝑖𝑜𝑡ℎ𝑒𝑟

𝑯𝑖𝑠𝑒𝑙𝑓 𝑯𝑖

𝑜𝑡ℎ𝑒𝑟

Page 197: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

58

WMT14 subset

En-De En-Fr

Train 2.43M 2.43M

Test 3003 3003

1. Large Scale

2. Small Scale

IWSLT

En-Ja En-Zh En-De En-Fr

Train 223K 231K 206K 233K

Test 3003 3003 1305 1306

Some Experiments

Page 198: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

59

Training Data Construction

⋯𝑥0 𝑥1 𝑥𝑚𝑥𝑗 ⋯

𝑦0 𝑦1 𝑦𝑖⋯ ⋯ ⋯

⋯𝑧0 𝑧1 𝑧𝑗 ⋯ ⋯

English-Chinese Decoding

English-to-Japanese Decoding

Page 199: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

59

Training Data Construction

⋯𝑥0 𝑥1 𝑥𝑚𝑥𝑗 ⋯

𝑦0 𝑦1 𝑦𝑖⋯ ⋯ ⋯

⋯𝑧0 𝑧1 𝑧𝑗 ⋯ ⋯

English-Chinese Decoding

English-to-Japanese Decoding

Training Instance Format Requirement:

trilingual translation example (x, y, z) in

which (x, y) and (x, z) are parallel sentence

Page 200: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

60

Step1(train): (x1, y1) Model: M1

(x2, y2) Model: M2

Step2 (decode): M1 (x2,y2*)

M2 (x1,y1*)

Step3 (combination):

(x1,y1,y1*) ∪ (x2,y2*,y2)

Training

Inference

x2

x1

Training Data Construction

Page 201: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

61

Main Results

• Indiv: System learned with bilingual training

• Multi: shared encoder-decoder networks

•Sync-Trans significantly outperforms Indiv and Multi

• English-Chinese/Japanese and English-German/French

Page 202: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

62

Main Results

• Sync-Trans significantly outperforms Indiv and Multi

• Large-scale WMT Dataset

Page 203: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

From Bidirection to Two Tasks

63

⋯𝑥0 𝑥1 𝑥𝑚𝑥𝑗 ⋯

𝑦0𝑙2𝑟 𝑦1

𝑙2𝑟 𝑦𝑖𝑙2𝑟⋯ ⋯ ⋯

⋯𝑦𝑛′−1𝑟2𝑙 𝑦𝑛′−2

𝑟2𝑙 𝑦𝑖′𝑟2𝑙 ⋯ ⋯

Left-to-Right Decoding

Right-to-Left Decoding

⋯𝑥0 𝑥1 𝑥𝑚𝑥𝑗 ⋯

𝑦0 𝑦1 𝑦𝑖⋯ ⋯ ⋯

⋯𝑧0 𝑧1 𝑧𝑗 ⋯ ⋯

English-to-Chinese Decoding

English-to-Japanese Decoding

Page 204: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

Interactive Inference for other Two Tasks

64

Speech Recognition

Speech-to-Text Translation

𝑦0 𝑦1 𝑦𝑖⋯ ⋯ ⋯

⋯𝑧0 𝑧1 𝑧𝑗 ⋯ ⋯

To English Caption

To Chinese Caption

𝑥0 𝑥1 𝑥𝑖⋯ ⋯ ⋯

⋯𝑦0 𝑦1 𝑦𝑗 ⋯ ⋯

Page 205: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

Interactive Inference for other Two Tasks

64

Speech Recognition

Speech-to-Text Translation

𝑦0 𝑦1 𝑦𝑖⋯ ⋯ ⋯

⋯𝑧0 𝑧1 𝑧𝑗 ⋯ ⋯

To English Caption

To Chinese Caption

𝑥0 𝑥1 𝑥𝑖⋯ ⋯ ⋯

⋯𝑦0 𝑦1 𝑦𝑗 ⋯ ⋯

Page 206: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

65

Original signal

• Speech Features

Mel滤波

Interactive Inference for Speech Recognition

and Speech-to-Text Translation

Page 207: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

65

Original signal

Filter bank

• Speech Features

Mel滤波

Interactive Inference for Speech Recognition

and Speech-to-Text Translation

Page 208: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

65

Original signal

Filter bank

• Speech Features

Mel滤波

离散余弦变换

Interactive Inference for Speech Recognition

and Speech-to-Text Translation

Page 209: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

65

Original signal

Filter bank

MFCC

• Speech Features

Mel滤波

离散余弦变换

Interactive Inference for Speech Recognition

and Speech-to-Text Translation

Page 210: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

65

Original signal

Filter bank

MFCC

• Speech Features

Mel滤波

离散余弦变换

Interactive Inference for Speech Recognition

and Speech-to-Text Translation

Page 211: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

66

• Overall Architecture

Interactive Inference for Speech Recognition

and Speech-to-Text Translation

Page 212: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

66

• Overall Architecture

Interactive Inference for Speech Recognition

and Speech-to-Text Translation

Page 213: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

66

• Overall Architecture

Interactive Inference for Speech Recognition

and Speech-to-Text Translation

Speech Recognition

Page 214: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

66

• Overall Architecture

Interactive Inference for Speech Recognition

and Speech-to-Text Translation

Speech Recognition

Page 215: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

66

• Overall Architecture

Interactive Inference for Speech Recognition

and Speech-to-Text Translation

Speech Recognition

Speech-to-Text Translation

Page 216: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

67

• Interactive Attention

Interactive Inference for Speech Recognition

and Speech-to-Text Translation

Page 217: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

67

• Interactive Attention

Interactive Inference for Speech Recognition

and Speech-to-Text Translation

Page 218: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

67

• Interactive Attention

H1 = 𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛(Q1, K1, V1)

H2 = 𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛(Q1, K2, V2)

Hfinal = 𝐹𝑢𝑠𝑖𝑜𝑛(H1, H2)

Interactive Inference for Speech Recognition

and Speech-to-Text Translation

Page 219: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

67

• Interactive Attention

H1 = 𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛(Q1, K1, V1)

H2 = 𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛(Q1, K2, V2)

Hfinal = 𝐹𝑢𝑠𝑖𝑜𝑛(H1, H2)

• Fusion Linear Interpolation

Nonlinear Interpolation

Gate Interpolation

Hfinal = 𝜆1 ∗ H1 + 𝜆2 ∗ H2

Hfinal = 𝜆1 ∗ H1 + 𝜆2 ∗ tanh(H2)

Hfinal = r ⨀H1 + z⨀H2

r, z = σ(𝑊[𝐻1; 𝐻2])

Interactive Inference for Speech Recognition

and Speech-to-Text Translation

Page 220: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

68

Interactive Inference for Speech Recognition

and Speech-to-Text Translation

• Experimental Setup

Dataset:

TED En-Fr,En-Zh

Train details:

(1) Transformer_big setting

(2) English-Chinese: 2 GPUs, character BLEU

(3) English-French: 2 GPUs, tokenizer BLEU

Page 221: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

69

Interactive Inference for Speech Recognition

and Speech-to-Text Translation

CorpusTotal

Source

(per segment)

Target

(per segment)

segments hours frames words words

Fisher/Callh

-ome

(En-Es)

train 138,819 138:00 762 20.7 20.3

dev 3,961 2:00 673 17.9 17.9

test 3,641 3:44 657 18.3 18.3

TED

(En-Zh/

En-Fr)

train 305,971 527:00 662 17.9 20.3

dev 1,148 2:23 659 18.2 17.9

test 1,223 2:37 624 18.3 18.1

• Data Size

Page 222: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

69

Interactive Inference for Speech Recognition

and Speech-to-Text Translation

CorpusTotal

Source

(per segment)

Target

(per segment)

segments hours frames words words

Fisher/Callh

-ome

(En-Es)

train 138,819 138:00 762 20.7 20.3

dev 3,961 2:00 673 17.9 17.9

test 3,641 3:44 657 18.3 18.3

TED

(En-Zh/

En-Fr)

train 305,971 527:00 662 17.9 20.3

dev 1,148 2:23 659 18.2 17.9

test 1,223 2:37 624 18.3 18.1

• Data Size

Page 223: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

70

Interactive Inference for Speech Recognition

and Speech-to-Text Translation

• Baselines

Pipeline: Transformer ASR + Transformer MT

Pre-trained E2E: Pretrain on ASR, fintune on ST

Multi-task: ASR + ST with encoder shared

Two-stage: (1) use the first decoder to generate

transcription sequence; (2) use the output of first

decoder on the second decoder

Page 224: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

71

Interactive Inference for Speech Recognition

and Speech-to-Text Translation

𝑊𝐸𝑅 = 100 ∙𝑆 + 𝐷 + 𝐼

𝑁%

REF: 各位 来宾 * 各位 合作 伙伴 媒体界 的 朋友们 下午 好ASR: 各位 来宾 个 各位 合作 伙伴 媒体界 * 朋友们 刚 好

I D S

• Evaluation Metrics

• ASR Metric

• MT Metric

𝐵𝐿𝐸𝑈

Page 225: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

72

Interactive Inference for Speech Recognition

and Speech-to-Text Translation

ModelEn-De En-Fr En-Zh En-Ja

WER(↓) BLEU(↑) WER(↓) BLEU(↑) WER(↓) BLEU(↑) WER(↓) BLEU(↑)

MT / 22.19 / 30.68 / 25.01 / 22.93

Pipeline 14.29 19.50 14.20 26.62 14.20 21.52 14.21 20.87

E2E 14.29 16.07 14.20 27.63 14.20 19.15 14.21 16.59

Multi-task 14.20 19.08 13.04 28.71 13.43 20.60 14.01 18.73

Two-stage 14.27 20.08 13.34 30.08 13.55 20.29 13.85 19.32

Interactive 14.16 21.11 12.58 29.79 13.38 21.68 13.52 20.06

• Overall Results

Page 226: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

73

Interactive Inference for Image Caption in

Two Languages

𝑦0 𝑦1 𝑦𝑖⋯ ⋯ ⋯

⋯𝑧0 𝑧1 𝑧𝑗 ⋯ ⋯

To English Caption

To German Caption

Dataset:

(1)Multi30k (Elliott et al., 2016): English and German Captions

(2) 29,000 image-caption for training

(3) 1014 for validation and 2000 for test

Baselines:

(1) VGGNet encoder + LSMT decoder (Xu et al., 2015)

(2) Transformer

Page 227: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

74

• Results on English and German Image Captions

BLEU score

Experiments: Image Caption in Two Languages

Method English German

Xu et al., (2015) 19.90 ~

Jaffe (2017) ~ 11.84

Transformer 21.25 13.55

Ours 22.54 15.49

Page 228: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

Unified Text Generation from Text,

Speech and Image

75

Encoder

1y2y 3y

my

1z 2z3z

nz

...

...

T=1 T=2 T=3 ...

4y

4z

T=4

1y2y 3y

my...4y

T=1 T=2 T=3 T=4

(c) Synchronous Interactive text generation:

(b) Conventional text generation:

I had no idea what was coming

or

or

(a) Text, Speech or Image encoding:

...

Page 229: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

And Beyond …

76

• Why not interactive inference for multi-task learning?

Page 230: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

And Beyond …

76

• Why not interactive inference for multi-task learning?

Page 231: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

77

Outline

Background

Bidirectional Interactive Inference

Interactive Inference for Two Tasks

Summary and Future Challenges

Page 232: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

78

• The synchronous bidirectional Inference model that can take

full advantage of both history and future information provided

by bidirectional decoding states, achieving promising results.

Summary

Page 233: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

78

• The synchronous bidirectional Inference model that can take

full advantage of both history and future information provided

by bidirectional decoding states, achieving promising results.

•The bidirectional inference model can be further extended to

inference from both sides to the middle to improve the efficiency.

Summary

Page 234: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

78

• The synchronous bidirectional Inference model that can take

full advantage of both history and future information provided

by bidirectional decoding states, achieving promising results.

•The bidirectional inference model can be further extended to

inference from both sides to the middle to improve the efficiency.

•The bidirectional inference model can be generalized to

generate two languages synchronously and interactively.

Summary

Page 235: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

78

• The synchronous bidirectional Inference model that can take

full advantage of both history and future information provided

by bidirectional decoding states, achieving promising results.

•The bidirectional inference model can be further extended to

inference from both sides to the middle to improve the efficiency.

•The bidirectional inference model can be generalized to

generate two languages synchronously and interactively.

•Our main code is available at https://github.com/ZNLP/sb-nmt.

Feel free to have a try!

Summary

Page 236: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

78

• The synchronous bidirectional Inference model that can take

full advantage of both history and future information provided

by bidirectional decoding states, achieving promising results.

•The bidirectional inference model can be further extended to

inference from both sides to the middle to improve the efficiency.

•The bidirectional inference model can be generalized to

generate two languages synchronously and interactively.

•Our main code is available at https://github.com/ZNLP/sb-nmt.

Feel free to have a try!

Summary

Message: Synchronous Interactive Inference

May Reshape Multi-task Generation

Page 237: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

79

• How to generalize the interactive inference idea into

multi-task problems in which three or more tasks are

concerned?

• How to perform efficient training without generating

pseudo parallel instances?

• How to effectively combine bidirectional inference in

the multi-task interactive inference problem?

Future Challenges

Page 238: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

80

1. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of

deep bidirectional transformers for language understanding. In NAACL-HLT 2019 (Best Paper).

2. Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. Improving

language understanding by Generative Pre-Training. Technical report, OpenAI.

3. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez,

Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. NIPS-2017.

4. Lemao Liu, Masao Utiyama, Andrew Finch, and Eiichiro Sumita. Agreement on target-

bidirectional neural machine translation. NAACL-2016.

5. Zhirui Zhang, Shuangzhi Wu, Shujie Liu, Mu Li, Ming Zhou, and Tong Xu. Regularizing neural

machine translation by target-bidirectional agreement. AAAI-2019.

6. Long Zhou, Wenpeng Hu, Jiajun Zhang, and Chengqing Zong. Neural system combination for

machine translation. ACL-2017.

7. Xiangwen Zhang, Jinsong Su, Yue Qin, Yang Liu, Rongrong Ji, and Hongji Wang. Asynchronous

bidirectional decoding for neural machine translation. AAAI-2018.

8. Long Zhou, Jiajun Zhang, and Chengqing Zong. Synchronous Bidirectional Neural Machine

Translation. TACL-2019.

Reference

Page 239: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

81

9. Long Zhou, Jiajun Zhang, Chengqing Zong and Heng Yu. Sequence Generation: From Both

Sides to the Middle. IJCAI-2019.

10. Yining Wang, Jiajun Zhang, Long Zhou, Yuchen Liu and Chengqing Zong. Synchronously

Generating Two Languages with Interactive Decoding. EMNLP-2019

11. Jiajun Zhang, Long Zhou, Yang Zhao and Chengqing Zong. Synchronous Bidirectional

Inference for Neural Sequence Generation. arXiv preprint arXiv:1902.08955.

12. Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov,

Rich Zemel, and Yoshua Bengio. Show, attend and tell: Neural image caption generation with

visual attention. ICML-2015.

13. Desmond Elliott, Stella Frank, Khalil Sima’an, Lucia Specia. Multi30K: Multilingual English-

German Image Descriptions. Proceedings of the 5th Workshop on Vision and Language. 2016.

14. Alan Jaffe. Generating Image Descriptions using Multilingual Data. Proceedings of the Second

Conference on Machine Translation 2017.

15. Peter Young, Alice Lai, Micah Hodosh, and Julia Hockenmaier. From image descriptions to

visual denotations: New similarity metrics for semantic inference over event descriptions. TACL-

2014.

Reference

Page 240: Text Generation: From the Perspective of Interactive Inferencecips-cl.org/static/CCL2019/downloads/tutorialsPPT/04.pdf · 3 BERT: Bidirectional Understanding T1 T2 T3 ⋯ T Input

谢谢!Thanks!