Discriminative Learning of Extraction Sets for Machine Translation John DeNero and Dan Klein UC...

Preview:

Citation preview

Discriminative Learning of Extraction Sets for Machine Translation

John DeNero and Dan KleinUC Berkeley

Identifying Phrasal Translations

In the past two years , a number of US citizens …

过去 两 年 中 , 一 批 美国 公民 …

past two year in , one lots US citizen

Phrase alignment models: Choose a segmentation and a one-to-one phrase alignment

Past Go over

Underlying assumption: There is a correct phrasal segmentation

Unique Segmentations?

In the past two years , a number of US citizens …

过去 两 年 中 , 一 批 美国 公民 …

past two year in , one lots US citizen

Problem 1: Overlapping phrases can be useful (and complementary)

Problem 2: Phrases and their sub-phrases can both be useful

Hypothesis: This is why models of phrase alignment don’t work well

Identifying Phrasal Translations

This talk: Modeling sets of overlapping, multi-scale phrase pairs

In the past two years , a number of US citizens …

过去 两 年 中 , 一 批 美国 公民 …

past two year in , one lots US citizen

Input: sentence pairs

Output: extracted phrases

… But the Standard Pipeline has Overlap!

M O T I V A T I O N

In the past two years

过去

past

two

year

in

Sentence Pair

Word Alignment

Extracted Phrases

Related Work

M O T I V A T I O N

Sentence Pair

Word Alignment

Extracted Phrases

Translation models: Sinuhe system (Kääriäinen, 2009)

Combining Aligners: Yonggang Deng & Bowen Zhou (2009)

Fixed alignments; learned phrase pair weights

Fixed directional alignments; learned symmetrization

Extraction models: Moore and Quirk, 2007

Fixed alignments; learned phrase pair weights

Our Task: Predict Extraction Sets

M O T I V A T I O N

Sentence Pair

Extracted Phrases

Conditional model of extraction sets given sentence pairs

In the past two years

过去两年中

0

1

2

3

40 1 2 3 4 5

In the past two years

过去两年中

0

1

2

3

40 1 2 3 4 5

Extracted Phrases + ``Word Alignments’’

Alignments Imply Extraction Sets

M O D E L

In the past two years

过去

past

two

year

in

0

1

2

3

40 1 2 3 4 5

Word-level alignment

links

Word-to-span alignments

Extraction set of bispans

Nulls and Possibles

报道

according to

news report

it is reported

报道

according to

news report

it is reported

Nulls:

Possibles:

Incorporating Possible Alignments

M O D E L

In the past two years

过去

past

two

year

in

0

1

2

3

40 1 2 3 4 5

Sure and possible

word links

Word-to-span alignments

Extraction set of bispans

Linear Model for Extraction Sets

M O D E L

In the past two years

过去

0

1

2

3

40 1 2 3 4 5

Features on sure links

Features on all bispans

Features on Bispans and Sure Links

F E A T U R E S

地球

go over

Earth

over the Earth

Some features on sure links

HMM posteriors

Presence in dictionary

Numbers & punctuation

Features on bispans

HMM phrase table features: e.g., phrase relative frequencies

Lexical indicator features for phrases with common words

Monolingual phrase features: e.g., “the _____”

Shape features: e.g., Chinese character counts

Getting Gold Extraction Sets

T R A I N I N G

Hand Aligned: Sure and possible

word links

Word-to-span alignments

Extraction set of bispans

Deterministic: A bispan is included iff every word within the bispan aligns within the bispan

Deterministic: Find min and max alignment index for each word

Discriminative Training with MIRA

T R A I N I N G

Loss function: F-score of bispan errors (precision & recall)

Training Criterion: Minimal change to w such that the gold is preferred to the guess by a loss-scaled margin

Gold (annotated) Guess (arg max w ɸ)∙

Inference: An ITG Parser

I N F E R E N C E

ITG captures some bispans

Coarse-to-Fine Approximation

I N F E R E N C E

Coarse Pass: Features that are local to terminal productions

Fine Pass: Agenda search using coarse pass as a heuristic

We use an agenda-based parser. It’s fast!

Experimental Setup

R E S U L T S

Chinese-to-English newswire

Parallel corpus: 11.3 million words; sentences length ≤ 40

MT systems: Tuned and tested on NIST ‘04 and ‘05

Supervised data: 150 training & 191 test sentences (NIST ‘02)

Unsupervised Model: Jointly trained HMM (Berkeley Aligner)

Baselines and Limited Systems

R E S U L T S

HMM:

ITG:

Coarse:

State-of-the-art unsupervised baseline

Joint training & competitive posterior decoding

Source of many features for supervised models

Supervised ITG aligner with block terminals

State-of-the-art supervised baseline

Re-implementation of Haghighi et al., 2009

Supervised block ITG + possible alignments

Coarse pass of full extraction set model

Word Alignment Performance

R E S U L T S

Precision

Recall

1 - AER

84.7

84.0

84.4

82.2

84.2

83.1

83.4

83.8

83.6

84.0

76.9

80.4 HMMITGCoarseFull

Extracted Bispan Performance

R E S U L T S

Precision

Recall

F1

F5

69.0

74.2

71.6

74.0

70.0

72.9

71.4

72.8

75.8

62.3

68.4

62.8

69.5

59.5

64.1

59.9

HMMITGCoarseFull

Translation Performance (BLEU)

R E S U L T S

Moses

Joshua

31.5 32 32.5 33 33.5 34 34.5 35 35.5 36 36.5

34.4

35.9

34.2

35.7

33.6

34.7

33.2

34.5

HMMITGCoarseFull

Supervised conditions also included HMM alignments

Conclusions

Extraction set model directly learns what phrases to extract

The system performs well as an aligner or a rule extractor

Are segmentations always bad?

Idea: get overlap and multi-scale into the learning!

Thank you!

nlp.cs.berkeley.edu

Recommended