Upload
kyoshiro-sugiyama
View
518
Download
0
Embed Size (px)
Citation preview
Kyoshiro SUGIYAMA , AHC-Lab. , NAIST
An Investigation of Machine Translation Evaluation Metrics
in Cross-lingual Question Answering
Kyoshiro Sugiyama, Masahiro Mizukami, Graham Neubig, Koichiro Yoshino, Sakriani Sakti, Tomoki Toda, Satoshi Nakamura
NAIST, Japan
Kyoshiro SUGIYAMA , AHC-Lab. , NAIST
Question answering (QA)
One of the techniques for information retrieval
Input: Question Output: Answer
InformationSourceWhere is the
capital of Japan?
Tokyo.
Retrieval
Retrieval Result
2/22
Kyoshiro SUGIYAMA , AHC-Lab. , NAIST
QA using knowledge bases
Convert question sentence into a query
Low ambiguity
Linguistic restriction of knowledge base Cross-lingual QA is necessary
Where is the capital of Japan?
Tokyo.
Type.Location⊓Country.Japan.CapitalCity
Knowledge base
Location.City.Tokyo
QA system using knowledge base
Query
Response
3/22
Kyoshiro SUGIYAMA , AHC-Lab. , NAIST
Cross-lingual QA (CLQA)
Question sentence(Linguistic difference) Information source
日本の首都はどこ?
東京
Type.Location⊓Country.Japan.CapitalCity
Knowledge base
Location.City.Tokyo
QA system using knowledge base
Query
Response
To create mapping:High cost and
not re-usable in other languages
4/22
Any language
Any language
Kyoshiro SUGIYAMA , AHC-Lab. , NAIST
CLQA using machine translation
Machine translation (MT) can be used to perform CLQA
Easy, low cost and usable in many languages
QA accuracy depends on MT quality
日本の首都はどこ?Where is the
capital of Japan? Existing
QA system
Tokyo
MachineTranslation
東京Machine
Translation
5/22
Any language
Any language
Kyoshiro SUGIYAMA , AHC-Lab. , NAIST
Purpose of our work
To make clear how translation affects QA accuracyWhich MT metrics are suitable for the CLQA task? Creation of QA dataset using various translations systems Evaluation of the translation quality and QA accuracy
What kind of translation results influences QA accuracy? Case study (manual analysis of the QA results)
6/22
Kyoshiro SUGIYAMA , AHC-Lab. , NAIST
QA system
SEMPRE framework [Berant et al., 13]
3 steps of query generation:
AlignmentConvert entities in the question sentence
into “logical forms”
BridgingGenerate predicates compatible with
neighboring predicates
ScoringEvaluate candidates using
scoring function7/22
Scoring
Kyoshiro SUGIYAMA , AHC-Lab. , NAIST
Data set creation
8/22
Training(512 pairs)
Dev.(129 pairs)
Test(276 pairs)
(OR set)
Free917
JA set
HT set
GT set
YT set
Mo set
Tra set
Manual translation into Japanese
Translation into English
Kyoshiro SUGIYAMA , AHC-Lab. , NAIST
Translation method
Manual Translation (“HT” set): Professional humans
Commercial MT systemsGoogle Translate (“GT” set)
Yahoo! Translate (“YT” set)
Moses (“Mo” set): Phrase-based MT system
Travatar (“Tra” set): Tree-to-String based MT system9/22
Kyoshiro SUGIYAMA , AHC-Lab. , NAIST
Experiments
Evaluation of translation quality of created data setsReference is the questions in the OR set
QA accuracy evaluation using created data setsUsing same model
Investigation of correlation between them
10/22
Kyoshiro SUGIYAMA , AHC-Lab. , NAIST
Metrics for evaluation of translation quality
11/22
BLEU+1: Evaluates local n-grams
1-WER: Evaluates whole word order strictly
RIBES: Evaluates rank correlation of word order
NIST: Evaluates local word order and correctness of infrequent words
Acceptability: Human evaluation
Kyoshiro SUGIYAMA , AHC-Lab. , NAIST
Sentence-level analysis
47% questions of OR set are not answered correctly These questions might be difficult to answer
even with the correct translation result
Dividing questions into two groupsCorrect group (141*5=705 questions):
Translated from 141 questions answered correctly in OR set
Incorrect group (123*5=615 questions):Translated from remaining 123 questions in OR set
16/22
Kyoshiro SUGIYAMA , AHC-Lab. , NAIST
Sentence-level correlation
Metrics𝑹𝟐
(correct group)𝑹𝟐
(incorrect group)
BLUE+1 0.900 0.007
1-WER 0.690 0.092
RIBES 0.418 0.311
NIST 0.942 0.210
Acceptability 0.890 0.547
17/22
Kyoshiro SUGIYAMA , AHC-Lab. , NAIST
Sentence-level correlation
Metrics𝑹𝟐
(correct group)𝑹𝟐
(incorrect group)
BLUE+1 0.900 0.007
1-WER 0.690 0.092
RIBES 0.418 0.311
NIST 0.942 0.210
Acceptability 0.890 0.547
Very little correlation
NIST has the highest correlation Importance of content words
If the reference cannot be answered correctly, the sentences are not suitable,
even for negative samples
18/22
Kyoshiro SUGIYAMA , AHC-Lab. , NAIST
Sample 3
21/22
All questions were answered correctly though they are grammatically incorrect.
Kyoshiro SUGIYAMA , AHC-Lab. , NAIST
Conclusion
NIST score has the highest correlationNIST is sensitive to the change of content words
If reference cannot be answered correctly, there is very little correlation between translation quality and QA accuracyAnswerable references should be used
3 factors which cause change of QA results:content words, question types and syntax
22/22