Insertion Position Selection Model for Flexible Non-Terminals in Dependency Tree-to-Tree

  • View
    237

  • Download
    0

  • Category

    Science

Preview:

Citation preview

Insertion Position Selection Model for Flexible Non-Terminals

in Dependency Tree-to-TreeMachine Translation

Toshiaki NakazawaJapan Science and Technology Agency

(JST )John Richardson Sadao Kurohashi

Kyoto University4/11/2016 @ EMNLP2016

Where to insert?

I found Pikachu by chance

yesterdayinsertion positions

0.70.25 0.02 0.01prob. 0.010.01

2

Where to insert?

I found Pikachu by chance yesterday

in the parkinsertion positions

0.20.1 0.6 0.010.01

@Texas State Capitol

0.010.1

3

Pikachu

Dependency Tree-to-Tree Translation

私は昨日

公園で

ピカチュウを

見つけた

私は

を見つけた

I

found

by

Input Translation Rules Output

ピカチュウ Pikachu

偶然 [X7][X7]

偶然

chance

I

found

by

[X7]

chance

公園 thepark

昨日 yesterday

で4

Dependency Tree-to-Tree Translation

私は昨日

公園で

ピカチュウを

見つけた

私は

を見つけた

Input Translation Rules Output

ピカチュウ Pikachu

偶然

公園 thepark

[X7]偶然

昨日 yesterday

[X]

[X]

[X]

[X]

found

by

chance

[X]I

[X7]found

Pikachu

by

I

chance

yesterday

the

park

in

found

Pikachu

by

I

chance

yesterday

Pikachu

I

found

by

chance

Flexible Non-terminals[Richardson+, 2016]

floatingsubtreefloatingsubtree

5

Translation Quality and Decoding Speedw/ and w/o Flexible Non-terminals

• Using ASPEC (Asian Scientific Paper Excerpt Corpus) JE and JC

• Time is a relative decoding time

Ja->En En->Ja Ja->Zh Zh->JaBLEU Time BLEU Time BLEU Time BLEU Time

w/o Flex 20.28 1.00 28.77 1.00 24.85 1.00 30.51 1.00w/ Flex 21.61 6.28 30.57 3.30 28.79 5.16 34.32 5.28

6

Appropriate Insertion Position Selection• roughly half of all translation rules were

augmented with flexible non-terminals [Richardson+, 2016]

• flexible non-terminals make the search space much bigger -> slower decoding speed, increased search error

• reduce the number of possible insertion positions in translation rules by a Neural Network model

7

Insertion Position Selection Model for Flexible Non-Terminals

in Dependency Tree-to-TreeMachine Translation

Toshiaki NakazawaJapan Science and Technology Agency

John Richardson Sadao KurohashiKyoto University

4/11/2016 @ EMNLP2016

INSERTION POSITION SELECTION MODEL

9

Insertion Position Selection Model• For each insertion position:–predict• scores of the insertion positions

– given• input: the floating word (I) and its parent word

(Ps) with the distance (Ds)• target: previous (Sp) and next (Sn) sibling words

of the insertion position and the parent (Pt) with the distance (Dt)

10

Information for Selection Model

私は昨日

公園で

ピカチュウを

見つけた

私は

を見つけた

Input Translation Rules

偶然[X7]

偶然 found

by

chance

I

[X7]

I

Ps

Pt

Sp

Sn

Ds

=4

[X]

Dt=-2

Non-terminals:reverted to the original word in the parallel corpus

11

[yesterday]

[found]

Information for Selection Model

私は昨日

公園で

ピカチュウを

見つけた

私は

を見つけた

Input Translation Rules

偶然[X7]

偶然 found

by

chance

I

[X7]

I

Ps

Pt

Sp

Sn

Ds

=4

[X]

Dt=-3

= [POST-BOTTOM]

12

[yesterday]

[found]

Neural Network Model

220

I

Ps

Pt

Sp1

Sn1

Ds

Dtk

100100

220220

220220

100

word to be inserted

parent of I

distance from PS

previous sibling

next sibling

parent of the insertion position

distance from Pt

fully-connectedfeed-forward network

( )

・・・11

1

・・・

insertion position 2

insertion position N

scores

0.10.6・・・0.1

01・・・0

( )

softmax gold

loss =softmax cross-entropy

insertion position 1

13

Training Data Creation• Training data for the NN model can be

automatically created from the word-aligned parallel corpus– consider each alignment as the floating word and

remove it from the target tree

14

私は

を見つけた

I

found

byピカチュウ Pikachu

偶然

chance

[X][X][X]

[X]label

0

00

1

EXPERIMENTS

15

Insertion Position Selection Experiment• Parallel corpus: ASPEC-JE/JC (2M/680K

sentences)• Data size

• Comparison– L2-regularized logistic regression (using Multi-core

LIBLINEAR)

Ja->En

En->Ja

Ja->Zh

Zh->Ja

Training 15.7M 5.7M

Development 160K 58K

Test 160K 58K

Ave. # IP 3.39 3.15 3.72 3.41

16

Experimental ResultsJa->En En->Ja Ja->Zh Zh->Ja

Training 15.7M 5.7MDevelopment 160K 58KTest 160K 58KAve. # IP 3.39 3.15 3.72 3.41Mean loss 0.089 0.058 0.105 0.056Top 1 Accuracy (%) 97.08 97.72 96.51 97.99Top 2 Accuracy (%) 98.94 99.52 98.97 99.56Logit Accuracy (%) 55.00 89.03 68.04 83.16

17

Translation Experiment• Parallel corpus: ASPEC-JE/JC (2M/680K

sentences)• Decoder: KyotoEBMT [Richardson+, 2014]• 5 Settings– Phrase-based and hierarchical phrase-based SMTs – w/o Flex: not using flexible non-terminals– w/ Flex: baseline with flexible non-terminals– Prop: using insertion position selection (only top 1)

• BLEU and relative decoding time

18

Translation Experimental Results

Ja->En En->Ja Ja->Zh Zh->JaBLEU Time BLEU Time BLEU Time BLEU Time

PBSMT 18.45 - 27.48 - 27.96 - 34.65 -HPBSMT 18.72 - 30.19 - 27.71 - 35.43 -w/o Flex 20.28 1.00 28.77 1.00 24.85 1.00 30.51 1.00w/ Flex 21.61 6.28 30.57 3.30 28.79 5.16 34.32 5.28Prop 22.07 2.25 30.50 1.27 29.83 2.21 34.71 1.89

19

20

Conclusion• Proposed insertion position selection model to

reduced the number of insertion positions for flexible non-terminals in the translation rules

• Automatic evaluation scores and decoding speed are improved

21

Future Work• Use grand-children’s info– Recursive NN [Liu et al., 2015] or Convolutional

NN [Mou et al., 2015]

• Shift to NMT!!– Actually, we’ve already shifted and participated

WAT2016 shared tasks• However, NMT is still far from perfect

J->E Adequacy in WAT2016

22

3.76 3.710%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

21.75 2137.25

51.75 46.7530.5

20.75 26.7516.25

4.75 510

1 0.5 6

12345

3.83Average adequacy

BLEU 26.22 26.39 25.41

Kyoto-U(NMT)

NAIST/CMU(NMT)

NAIST(2015 best, F2T)

Team name

23

Thank You!AD I’m co-organizing

The 3rd Workshop on Asian Translation(WAT2016)

in conjunction with COLING 2016Invited talk by Google about GNMT!

Please come to the workshop!

http://lotus.kuee.kyoto-u.ac.jp/WAT/