27
Modality-Preserving Phrase-Based Statistical Machine Translation Masamichi Ideue, Masao Utiyama, Eiichiro Sumita and Kazuhide Yamamoto (Nagaoka University of Technology and NICT)

Modality-Preserving Phrase-based Statistical Machine Translation

  • View
    127

  • Download
    4

Embed Size (px)

Citation preview

Page 1: Modality-Preserving Phrase-based Statistical Machine Translation

Modality-PreservingPhrase-Based

Statistical Machine Translation

Masamichi Ideue, Masao Utiyama,Eiichiro Sumita and Kazuhide Yamamoto

(Nagaoka University of Technology and NICT)

Page 2: Modality-Preserving Phrase-based Statistical Machine Translation

Purpose of our studyJapanese to English translation preserving negation and question modality by Phrase-based SMT.

Input 私はりんごが好きではありません。

Translation I don’t like apples.

The MT users would not be able to detect a modality error.

1

MT Translation I like apples.

Page 3: Modality-Preserving Phrase-based Statistical Machine Translation

Related Studies

• Class-Dependent Modeling for Dialog Translation [Finch et al., 2009]• Discriminative Reranking for SMT

using Various Global Features [Goh et al., 2010]

Our study focused on characteristic modality words in negations and questions.

Neither of the studies discussed what expressions influence modalities.

2

Page 4: Modality-Preserving Phrase-based Statistical Machine Translation

Proposed MethodAdd feature functions considered characteristic words of negation and question.

3

Page 5: Modality-Preserving Phrase-based Statistical Machine Translation

Added feature functionsThe number of phrase pairs including characteristic words of question (negation) in Japanese phrase and English phrase.

Hypothesis e Where is the purse ?

Input f 財布 は どこ に あり ます か ?

2

4

Page 6: Modality-Preserving Phrase-based Statistical Machine Translation

Characteristic Words Extraction

• Manual extraction

• Automatic extraction• Using LLR(Log-likelihood ratio) score

Extract characteristic words from the parallel corpus in travel domain.

5

Page 7: Modality-Preserving Phrase-based Statistical Machine Translation

Manual Extraction(English)

Negation Questionnot ’t

don Don

haven isn

No won

wasn doesn

didn cannot

hadn

? WhyWill What

Could IsHow DoesCan DoAre WhichWhen WhereHave DoesDid WasMay

6

Page 8: Modality-Preserving Phrase-based Statistical Machine Translation

Manual Extraction(Japanese)

Negation Question

ない(nai) ません(masen) ? か 。(ka.)

• The characteristic words that clearly express the modalities are few.

• Whether a word expresses modality or not, there is tendency to depends on the domain.

7

Page 9: Modality-Preserving Phrase-based Statistical Machine Translation

Automatic Extraction• Automatic extraction is based on LLR.• LLR is convenient for extracting characteristic words in travel domain (Chujo et al., 2006).

1 ?

2 Will

3 Could

4 How

5 Can

... ...

Extract top N words from the ranking by LLR score as the

characteristic words.

Order by LLR score (Question)

8

Page 10: Modality-Preserving Phrase-based Statistical Machine Translation

Calculation of LLR(In case of negation)

If a word tends to occur in negation only, the LLR score becomes high.

Negation Affirmationw=1 a b a+bw=0 c d c+d

a+c b+d n(a,b,c,d : occurrence frequency in each condition)

9

Page 11: Modality-Preserving Phrase-based Statistical Machine Translation

Sentence type classification

To build the contingency table, we divided sentences in the parallel corpus with manually extracted English characteristic words.

English Japanese TypeHe is not an artist. 彼は芸術家ではない。 negation

I like apples. 私はりんごが好きです。 affirmationAre you a doctor? あなたは医者ですか。 question

10

Page 12: Modality-Preserving Phrase-based Statistical Machine Translation

Extracted Words by LLR(English)

Negation Questiondo any

there havethis donlong itisn didyour muchhow time

can yetany butknow worry

I anythingit so

afraid understandwhat enough

11

Page 13: Modality-Preserving Phrase-based Statistical Machine Translation

Extracted Words by LLR(Japanese)

Negation Questionか どこ

何 どう

いくら は

いただけ どの

何時 あり

でしょ もらえ

いかが どんな

ませ ない

ん は

なかっ あまり

まだ あり

でき じゃ

いいえ そんなに

そんな たく

12

Page 14: Modality-Preserving Phrase-based Statistical Machine Translation

Experiments

SMT Toolkit Moses

Tuning Minimum Error Rate Training

Parallel corpus Basic Travel Expression Corpus (BTEC; 70,000 pairs)

Test set1,500 sentences

(included 500 sentences for negation, question, and affirmation)

Development set

1,500 sentences (in the same way as test set)

13

Page 15: Modality-Preserving Phrase-based Statistical Machine Translation

Experiments

• From preliminary experimental evaluation with BLEU, the N is decided as 30 (LLR30).

• Baseline method is no additional features.

14

Page 16: Modality-Preserving Phrase-based Statistical Machine Translation

Manual Evaluation

• To verify effectiveness of translation quality when add the proposed features.

• To verify accuracy of each modality.We randomly extracted 90 pairs to test the methods for each modality (total 270 pairs).

15

Page 17: Modality-Preserving Phrase-based Statistical Machine Translation

Translation Quality

Good(S,A,B) S A B C D

Baseline(No additional features)

151 60 57 34 26 93

Manually Extraction 153 55 54 44 29 88

LLR30 154 60 56 38 28 88

All the methods have the same translation quality if S, A and B are assumed good translation.

(number of sentences)

16

Page 18: Modality-Preserving Phrase-based Statistical Machine Translation

Accuracy of each modality

Aff Neg Que

Baseline 86.67 39.22 90.48

Manually Extraction 87.41 64.71 90.48

LLR30 87.41 62.75 95.24

(Percentage of the outputs preserved the modality of the input.)

•Proposed methods indicated a marked improvement in negation modality.

•The accuracy of LLR30 was better than the accuracy of the baseline in all modalities.

17

Page 19: Modality-Preserving Phrase-based Statistical Machine Translation

Translation Example

Proposed method (Manually Extraction):Which one shall we go to the circus and zoo? (O)

Input (Question):サーカスと動物園、どっちに行こうか。

Baseline:Let’s go to the circus and, the zoo? (X)

18

Page 20: Modality-Preserving Phrase-based Statistical Machine Translation

Translation Example

Proposed method (Manually Extraction):I don’t mind if you cancel it? (X)

Input (Question):キャンセルしてもかまいませんか。

Baseline:May I cancel? (O)

masen(negation)

masen ka(question)

We have to treat word combinations.

19

Page 21: Modality-Preserving Phrase-based Statistical Machine Translation

Conclusion• We proposed additional feature considering characteristic words for modality-preserving PBSMT.

• Produced more translations preserved the modality of the input sentence than baseline without decrease of translation quality.

• Automatic extraction performed the same as or better than manual extraction.

20

Page 22: Modality-Preserving Phrase-based Statistical Machine Translation

LLR

Page 23: Modality-Preserving Phrase-based Statistical Machine Translation

LLR

Negation Affirmationw=1 a b a+bw=0 c d c+d

a+c b+d n

Page 24: Modality-Preserving Phrase-based Statistical Machine Translation

Translation Example

Proposed method (Manually Extraction):Please go easy. (O)

Input (Affirmation):やさしく打ってくださいね。

Proposed mthod (English side only):Please go easy, isn’t it? (X)

Page 25: Modality-Preserving Phrase-based Statistical Machine Translation

Calculation of LLR

• Pr(D|H_indep) is the probability under the null hypothesis that the occurrences of a word w in the negative and affirmative sentences are independent of one another.

• Pr(D|H_dep) is the case in which the occurrences are dependent.

(In case of negation)

If a word tends to occur in negation only, the LLR score becomes high.

Page 26: Modality-Preserving Phrase-based Statistical Machine Translation

Calculation of LLRNegation Affirmation

w=1 a b a+bw=0 c d c+d

a+c b+d n(a,b,c,d : occurrence frequency in each condition)

Page 27: Modality-Preserving Phrase-based Statistical Machine Translation

Related Studies• Class-Dependent Modeling for Dialog Translation

[Finch et al., 2009]• 2 models are trained for question sentence and other

sentence.

• Discriminative Reranking for SMT using Various Global Features [Goh et al., 2010]• Probabilities of sentence types such as negations and

questions are used.

Our study focused on characteristic modality words in negations and questions.

Neither of the studies discussed what expressions influence modalities.