36
Language Generation with Continuous Outputs Yulia Tsvetkov Carnegie Mellon University August 9, 2018 1

Language Generation with Continuous Outputs · TranslationQuality SourceType/ TargetType Loss BLEU fr–en word! word cross-entropy 30.98 word! BPE cross-entropy 29.06 BPE! BPE cross-entropy

  • Upload
    others

  • View
    7

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Language Generation with Continuous Outputs · TranslationQuality SourceType/ TargetType Loss BLEU fr–en word! word cross-entropy 30.98 word! BPE cross-entropy 29.06 BPE! BPE cross-entropy

Language Generation with Continuous Outputs

Yulia Tsvetkov

Carnegie Mellon University

August 9, 2018

1

Page 2: Language Generation with Continuous Outputs · TranslationQuality SourceType/ TargetType Loss BLEU fr–en word! word cross-entropy 30.98 word! BPE cross-entropy 29.06 BPE! BPE cross-entropy

Encoder–Decoder Architectures

Я увидела кОшку

I saw a

</s>

<unk> </s>

2

Page 3: Language Generation with Continuous Outputs · TranslationQuality SourceType/ TargetType Loss BLEU fr–en word! word cross-entropy 30.98 word! BPE cross-entropy 29.06 BPE! BPE cross-entropy

(Conditional) Language Generation

Task + Data + Language

NLGMachine TranslationSummarizationDialogueCaption GenerationSpeech Recognition. . .

3

Page 4: Language Generation with Continuous Outputs · TranslationQuality SourceType/ TargetType Loss BLEU fr–en word! word cross-entropy 30.98 word! BPE cross-entropy 29.06 BPE! BPE cross-entropy

(Conditional) Language Generation – 2D

NLP Technologies/Applications

6K World’s Languages

Lemmatization

POS tagging

Parsing

NER

Coref

SRL

...

Summarization

QA

Dialogue

MT

ASR

English

Chinese

Arabic

French

Spanish

Portuguese

Russian... ...

Some European Languages

UN Languages

CzechHindi

Hebrew

Medium-Resourced Languages (dozens)

...

Resource-PoorLanguages

(thousands)

4

Page 5: Language Generation with Continuous Outputs · TranslationQuality SourceType/ TargetType Loss BLEU fr–en word! word cross-entropy 30.98 word! BPE cross-entropy 29.06 BPE! BPE cross-entropy

(Conditional) Language Generation – 3DNLP Technologies/Applications

6K World’s Languages

Lemmatization

POS tagging

Parsing

NER

Coref

SRL

...

Summarization

QA

Dialogue

MT

ASR

English

Chinese

Arabic

French

Spanish

Portuguese

Russian... ...

Some European Languages

UN Languages

CzechHindi

Hebrew

Medium-Resourced Languages (dozens)

...

Resource-PoorLanguages

(thousands)

Data Domains

BibleParliamentary proceedings

NewswireWikipedia

Novels

TwitterTED talks

Telephone conversations

...

5

Page 6: Language Generation with Continuous Outputs · TranslationQuality SourceType/ TargetType Loss BLEU fr–en word! word cross-entropy 30.98 word! BPE cross-entropy 29.06 BPE! BPE cross-entropy

NLP 6= Task + Data

The common misconception is that languagehas to do with words and what they mean.

It doesn’t.It has to do with people and what they mean.

Herbert H. Clark & Michael F. Schober, 1992+ Dan Jurafsky’s keynote at CVPR’17 and EMNLP’17

6

Page 7: Language Generation with Continuous Outputs · TranslationQuality SourceType/ TargetType Loss BLEU fr–en word! word cross-entropy 30.98 word! BPE cross-entropy 29.06 BPE! BPE cross-entropy

(Conditional) Language Generation – ∞DNLP Technologies/Applications

6K World’s Languages

Lemmatization

POS tagging

Parsing

NER

Coref

SRL

...

Summarization

QA

Dialogue

MT

ASR

English

Chinese

Arabic

French

Spanish

Portuguese

Russian... ...

Some European Languages

UN Languages

CzechHindi

Hebrew

Medium-Resourced Languages (dozens)

...

Resource-PoorLanguages

(thousands)

Data Domains

BibleParliamentary proceedings

NewswireWikipedia

Novels

TwitterTED talks

Telephone conversations

...

7

Page 8: Language Generation with Continuous Outputs · TranslationQuality SourceType/ TargetType Loss BLEU fr–en word! word cross-entropy 30.98 word! BPE cross-entropy 29.06 BPE! BPE cross-entropy

OutlineNLP Technologies/Applications

6K World’s Languages

Lemmatization

POS tagging

Parsing

NER

Coref

SRL

...

Summarization

QA

Dialogue

MT

ASR

English

Chinese

Arabic

French

Spanish

Portuguese

Russian... ...

Some European Languages

UN Languages

CzechHindi

Hebrew

Medium-Resourced Languages (dozens)

...

Resource-PoorLanguages

(thousands)

Data Domains

BibleParliamentary proceedings

NewswireWikipedia

Novels

TwitterTED talks

Telephone conversations

...

Part 1

Part 2

Part 38

Page 9: Language Generation with Continuous Outputs · TranslationQuality SourceType/ TargetType Loss BLEU fr–en word! word cross-entropy 30.98 word! BPE cross-entropy 29.06 BPE! BPE cross-entropy

(Conditional) Language Generation – ∞DNLP Technologies/Applications

6K World’s Languages

Lemmatization

POS tagging

Parsing

NER

Coref

SRL

...

Summarization

QA

Dialogue

MT

ASR

English

Chinese

Arabic

French

Spanish

Portuguese

Russian... ...

Some European Languages

UN Languages

CzechHindi

Hebrew

Medium-Resourced Languages (dozens)

...

Resource-PoorLanguages

(thousands)

Data Domains

BibleParliamentary proceedings

NewswireWikipedia

Novels

TwitterTED talks

Telephone conversations

...

9

Page 10: Language Generation with Continuous Outputs · TranslationQuality SourceType/ TargetType Loss BLEU fr–en word! word cross-entropy 30.98 word! BPE cross-entropy 29.06 BPE! BPE cross-entropy

Language Generation with Continuous Outputs

a b c </s> a b c

</s>a b c

10

Page 11: Language Generation with Continuous Outputs · TranslationQuality SourceType/ TargetType Loss BLEU fr–en word! word cross-entropy 30.98 word! BPE cross-entropy 29.06 BPE! BPE cross-entropy

Я увидела кОшку

I saw a

</s>

<unk> </s>

11

Page 12: Language Generation with Continuous Outputs · TranslationQuality SourceType/ TargetType Loss BLEU fr–en word! word cross-entropy 30.98 word! BPE cross-entropy 29.06 BPE! BPE cross-entropy

Rare Words Are Common in Language

By SergioJimenez - Own work, CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=45516736 12

Page 13: Language Generation with Continuous Outputs · TranslationQuality SourceType/ TargetType Loss BLEU fr–en word! word cross-entropy 30.98 word! BPE cross-entropy 29.06 BPE! BPE cross-entropy

Softmax

h

W

softmax

wi

Multinomial distribution over discrete and mutually exclusivealternativesHigh computational and memory complexityVocabulary size is limited to a small fraction of words plus <unk>Words are represented as 1-hot vectors

13

Page 14: Language Generation with Continuous Outputs · TranslationQuality SourceType/ TargetType Loss BLEU fr–en word! word cross-entropy 30.98 word! BPE cross-entropy 29.06 BPE! BPE cross-entropy

Alternatives to Softmax

Sampling-based approximationsÏ Importance Sampling: evaluate the denominator over a subsetÏ Noise Contrastive Estimation: convert to a proxy binary classificationproblem

Ï . . .Structure-based approximations

Ï Differentiated Softmax: divide the vocabulary to multiple classes; firstpredict a class, then predict a word of the class

Ï Hierarchical Softmax: binary tree with words as leavesÏ . . .

Subword UnitsÏ Byte Pair Encoding (BPE) (Sennrich et al. ‘2016)

14

Page 15: Language Generation with Continuous Outputs · TranslationQuality SourceType/ TargetType Loss BLEU fr–en word! word cross-entropy 30.98 word! BPE cross-entropy 29.06 BPE! BPE cross-entropy

Alternatives to Softmax

SamplingBased

StructureBased

SubwordUnits

TrainingTime î î î

TestTime ¦ ¦ î

Accuracy h h îMemory ¦ h îVery LargeVocab h h î

15

Page 16: Language Generation with Continuous Outputs · TranslationQuality SourceType/ TargetType Loss BLEU fr–en word! word cross-entropy 30.98 word! BPE cross-entropy 29.06 BPE! BPE cross-entropy

Our Proposal: No Softmax

Я увидела кОшку

I saw a

</s>

<unk> </s>

16

Page 17: Language Generation with Continuous Outputs · TranslationQuality SourceType/ TargetType Loss BLEU fr–en word! word cross-entropy 30.98 word! BPE cross-entropy 29.06 BPE! BPE cross-entropy

Our Proposal

Word Embedding O(D) [300-500]

Softmax O(V) [16K-50K]

Represent each word by it’s pre-trained embedding instead of a 1-hotvector

17

Page 18: Language Generation with Continuous Outputs · TranslationQuality SourceType/ TargetType Loss BLEU fr–en word! word cross-entropy 30.98 word! BPE cross-entropy 29.06 BPE! BPE cross-entropy

Seq2Seq with Continuous Outputs

Я увидела кОшку

I saw a

</s>

cAt

At each time-step t , generate the word’s embedding instead of aprobability distribution over the vocabulary.Training (next slides)Decoding: kNN

18

Page 19: Language Generation with Continuous Outputs · TranslationQuality SourceType/ TargetType Loss BLEU fr–en word! word cross-entropy 30.98 word! BPE cross-entropy 29.06 BPE! BPE cross-entropy

Training Seq2Seq with Continuous Outputs:Empirical Losses

Euclidean LossLL2 = ‖e−e(w)‖2

Cosine LossLcosine = 1− eT e(w)

‖e‖.‖e(w)‖

Max-Margin LossLmm =∑

w ′∈V ,w ′ 6=w max{0,γ+cos(e,e(w ′))−cos(e,e(w))}

19

Page 20: Language Generation with Continuous Outputs · TranslationQuality SourceType/ TargetType Loss BLEU fr–en word! word cross-entropy 30.98 word! BPE cross-entropy 29.06 BPE! BPE cross-entropy

Training Seq2Seq with Continuous Outputs:Probabilistic Loss

von Mises Fisher (vMF) Distributionp(e(w);µ,κ) =Cm(κ)eκµ

T e(w)

We use κ= ‖e‖,p(e(w); e) =Cm(‖e‖)e eT e(w)

vMF LossLNLLvMF =− log(Cm‖e‖)− eT e(w)

+regularizationLNLLvMF−reg1 =− logCm(‖e‖)− eT e(w)+λ1‖e‖

LNLLvMF−reg2 =− logCm(‖e‖)−λ2eT e(w)

20

Page 21: Language Generation with Continuous Outputs · TranslationQuality SourceType/ TargetType Loss BLEU fr–en word! word cross-entropy 30.98 word! BPE cross-entropy 29.06 BPE! BPE cross-entropy

Training Seq2Seq with Continuous Outputs:Research Questions

Я увидела кОшку

I saw a

</s>

cAt

Objective function: empirical and probabilistic lossesEmbeddings: word2vec, fasttext, syntactic, morphological, ELMO, etc.Attention: words vs. BPE in the inputDecoding: scheduled sampling; kNN approximations; beam search;post-processing with LMsOOVs: scheduled sampling; tied embeddings

21

Page 22: Language Generation with Continuous Outputs · TranslationQuality SourceType/ TargetType Loss BLEU fr–en word! word cross-entropy 30.98 word! BPE cross-entropy 29.06 BPE! BPE cross-entropy

Training Seq2Seq with Continuous Outputs:Research Questions

Я увидела кОшку

I saw a

</s>

cAt

Objective function: empirical and probabilistic lossesEmbeddings: word2vec, fasttext, syntactic, morphological, ELMO, etc.Attention: words vs. BPE in the inputDecoding: scheduled sampling; kNN approximations; beam search;interpolation with LMsOOVs: scheduled sampling; tied embeddings

22

Page 23: Language Generation with Continuous Outputs · TranslationQuality SourceType/ TargetType Loss BLEU fr–en word! word cross-entropy 30.98 word! BPE cross-entropy 29.06 BPE! BPE cross-entropy

Experimental Setup

IWSLTfr–en

tst2015+tst2016train 220Kdev 2.3Ktest 2.2K

Stronger Baselines for Trustable Results in NMT (Denkowski &Neubig ‘17)BLEU50K word vocab; 16K BPE vocab300-dimensional embeddingsMore setups in the paper: IWSLT de–en, IWSLT en–fr, WMT de–en

23

Page 24: Language Generation with Continuous Outputs · TranslationQuality SourceType/ TargetType Loss BLEU fr–en word! word cross-entropy 30.98 word! BPE cross-entropy 29.06 BPE! BPE cross-entropy

Translation Quality

Source Type/Target Type Loss BLEU

fr–en

word → word cross-entropy 30.98word → BPE cross-entropy 29.06BPE → BPE cross-entropy 31.44

BPE → word2vec L2 16.78BPE → word2vec cosine 26.92

word → word2vec L2 27.16word → word2vec cosine 29.14word → word2vec max-margin 29.56

word → fasttext max-margin 30.98word → fasttext + tied max-margin 32.12word → fasttext NLLvMFreg1+reg2 30.38word → fasttext + tied NLLvMFreg1+reg2 31.63

24

Page 25: Language Generation with Continuous Outputs · TranslationQuality SourceType/ TargetType Loss BLEU fr–en word! word cross-entropy 30.98 word! BPE cross-entropy 29.06 BPE! BPE cross-entropy

Translation Quality

Source Type/Target Type Loss BLEU

fr–en

word → word cross-entropy 30.98word → BPE cross-entropy 29.06BPE → BPE cross-entropy 31.44

BPE → word2vec L2 16.78BPE → word2vec cosine 26.92

word → word2vec L2 27.16word → word2vec cosine 29.14word → word2vec max-margin 29.56

word → fasttext max-margin 30.98word → fasttext + tied max-margin 32.12word → fasttext NLLvMFreg1+reg2 30.38word → fasttext + tied NLLvMFreg1+reg2 31.63

24

Page 26: Language Generation with Continuous Outputs · TranslationQuality SourceType/ TargetType Loss BLEU fr–en word! word cross-entropy 30.98 word! BPE cross-entropy 29.06 BPE! BPE cross-entropy

Translation Quality

Source Type/Target Type Loss BLEU

fr–en

word → word cross-entropy 30.98word → BPE cross-entropy 29.06BPE → BPE cross-entropy 31.44

BPE → word2vec L2 16.78BPE → word2vec cosine 26.92

word → word2vec L2 27.16word → word2vec cosine 29.14word → word2vec max-margin 29.56

word → fasttext max-margin 30.98word → fasttext + tied max-margin 32.12word → fasttext NLLvMFreg1+reg2 30.38word → fasttext + tied NLLvMFreg1+reg2 31.63

24

Page 27: Language Generation with Continuous Outputs · TranslationQuality SourceType/ TargetType Loss BLEU fr–en word! word cross-entropy 30.98 word! BPE cross-entropy 29.06 BPE! BPE cross-entropy

Translation Quality

Source Type/Target Type Loss BLEU

fr–en

word → word cross-entropy 30.98word → BPE cross-entropy 29.06BPE → BPE cross-entropy 31.44

BPE → word2vec L2 16.78BPE → word2vec cosine 26.92

word → word2vec L2 27.16word → word2vec cosine 29.14word → word2vec max-margin 29.56

word → fasttext max-margin 30.98word → fasttext + tied max-margin 32.12word → fasttext NLLvMFreg1+reg2 30.38word → fasttext + tied NLLvMFreg1+reg2 31.63

24

Page 28: Language Generation with Continuous Outputs · TranslationQuality SourceType/ TargetType Loss BLEU fr–en word! word cross-entropy 30.98 word! BPE cross-entropy 29.06 BPE! BPE cross-entropy

Translation Quality

Source Type/Target Type Loss BLEU

fr–en

word → word cross-entropy 30.98word → BPE cross-entropy 29.06BPE → BPE cross-entropy 31.44

BPE → word2vec L2 16.78BPE → word2vec cosine 26.92

word → word2vec L2 27.16word → word2vec cosine 29.14word → word2vec max-margin 29.56

word → fasttext max-margin 30.98word → fasttext + tied max-margin 32.12word → fasttext NLLvMFreg1+reg2 30.38word → fasttext + tied NLLvMFreg1+reg2 31.63

24

Page 29: Language Generation with Continuous Outputs · TranslationQuality SourceType/ TargetType Loss BLEU fr–en word! word cross-entropy 30.98 word! BPE cross-entropy 29.06 BPE! BPE cross-entropy

Translation Quality

Source Type/Target Type Loss BLEU

fr–en

word → word cross-entropy 30.98word → BPE cross-entropy 29.06BPE → BPE cross-entropy 31.44

BPE → word2vec L2 16.78BPE → word2vec cosine 26.92

word → word2vec L2 27.16word → word2vec cosine 29.14word → word2vec max-margin 29.56

word → fasttext max-margin 30.98word → fasttext + tied max-margin 32.12word → fasttext NLLvMFreg1+reg2 30.38word → fasttext + tied NLLvMFreg1+reg2 31.63

24

Page 30: Language Generation with Continuous Outputs · TranslationQuality SourceType/ TargetType Loss BLEU fr–en word! word cross-entropy 30.98 word! BPE cross-entropy 29.06 BPE! BPE cross-entropy

Training Time & Memory

* 1 GeForce GTX TITAN X GPU

Softmaxbaseline

BPEbaseline

NLLvMFbest model

fr–en 4h 4.5h 1.9hde–en 3h 3.5h 1.5hen–fr 1.8 2.8h 1.3hWMT de-en 4.3d 4.5d 1.6d

# Parametersin the Output Layer

Softmax 51.2M (1.0x)BPE 16.384M (0.32x)NLLvMF 307.2K (0.006x)

25

Page 31: Language Generation with Continuous Outputs · TranslationQuality SourceType/ TargetType Loss BLEU fr–en word! word cross-entropy 30.98 word! BPE cross-entropy 29.06 BPE! BPE cross-entropy

Training Time & Memory

26

Page 32: Language Generation with Continuous Outputs · TranslationQuality SourceType/ TargetType Loss BLEU fr–en word! word cross-entropy 30.98 word! BPE cross-entropy 29.06 BPE! BPE cross-entropy

Encoder–Decoder with Continuous OutputsConvergence Time

13

18

23

28

33

38

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

13

18

23

28

33

38

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

27

Page 33: Language Generation with Continuous Outputs · TranslationQuality SourceType/ TargetType Loss BLEU fr–en word! word cross-entropy 30.98 word! BPE cross-entropy 29.06 BPE! BPE cross-entropy

Encoder–Decoder with Continuous OutputsOutput Example

GOLDAn education is critical , but tackling this problem is going to requireeach and everyone of us to step up and be better role models forthe women and girls in our own lives .BPE2BPEeducation is critical , but it’s going to require that each of us will comein and if you do a better example for women and girls in our lives .WORD2FASTTEXTeducation is critical , but fixed this problem is going to require thatall of us engage and be a better example for women and girls inour lives .

28

Page 34: Language Generation with Continuous Outputs · TranslationQuality SourceType/ TargetType Loss BLEU fr–en word! word cross-entropy 30.98 word! BPE cross-entropy 29.06 BPE! BPE cross-entropy

Seq2Seq with Continuous Outputs

SamplingBased

StructureBased

SubwordUnits Semfit

TrainingTime î î î îTestTime ¦ ¦ î î

Accuracy h h î îMemory ¦ h î îHandle VeryLargeVocab

h h î î

29

Page 35: Language Generation with Continuous Outputs · TranslationQuality SourceType/ TargetType Loss BLEU fr–en word! word cross-entropy 30.98 word! BPE cross-entropy 29.06 BPE! BPE cross-entropy

Encoder–Decoder with Continuous OutputsFuture Research Questions

Я увидела кОшку

I saw a

</s>

cAt

DecodingTranslation into morphologically-rich languagesLow-resource NMTMore generation tasks, e.g. style transfer with GANs

30

Page 36: Language Generation with Continuous Outputs · TranslationQuality SourceType/ TargetType Loss BLEU fr–en word! word cross-entropy 30.98 word! BPE cross-entropy 29.06 BPE! BPE cross-entropy

31