- 1. Applying mpaligner to Machine Transliteration with Japanese-Specic HeuristicsYoh Okuno
2. Outline Introduction System Experiments Conclusion 2 3. Outline Introduction Statistical Machine Transliteration Baseline and Our Systems System Experiments Conclusion 3 4. Machine Transliteration as Monotonic SMT[Finch+ 2008] The most common approach for machine transliteration is to follow the manner of SMT (Statistical Machine Translation) Consists of 3 steps as below: 1. Align training data monotonically (character-based) 2. Train discriminative model given aligned data 3. Decode input string to n-best list 4 5. Example of Statistical Transliteration Given training data of transliteration pairs Training Data OKUNO NOMURA MURAI 5 6. Example of Statistical Transliteration1. Align training data utilizing co-occurrence Training Data OKUNO NOMURA MURAI 1. Align Aligned Data OKU:NO : NO:MURA : MURA:I :6 7. Example of Statistical Transliteration2. Train statistical model from aligned data Training Data OKUNO NOMURA MURAI 1. AlignAligned Data Learned ModelOKU:NO : 2. TrainOKU NO:MURA : NO (Rules)MURA:I : MURA I 7 8. Example of Statistical Transliteration3. Decode new input and return output Training Data3. Decode OKUNO Test Input Output NOMURA OKUMURA MURAI OKUI 1. AlignMURANO Aligned Data Learned ModelOKU:NO : 2. TrainOKU NO:MURA : NO (Rules)MURA:I : MURA I 8 9. The Baseline System using m2m-aligner[Jiampojamarn+ 2007, 2008] Training Data Align: m2m-alignerTrain: DirecTL+Decode: DirecTL+ Output: N-best List9 10. Our System: mpaligner with Heuristics Training Data Japanese-Specic Heuristics Pre-processing1. JnJk: De-romanization 2. EnJa: Syllable-based AlignmentAlign: mpaligner Improved Alignment Tool [Kubo+ 2011] 1. Better Accuracy than m2m Train: DirecTL+2. No hand-tuning parametersDecode: DirecTL+ Output: N-best List 10 11. Outline Introduction System Comparing Aligners Japanese-Specic Heuristics Experiments Conclusion 11 12. m2m-aligner: Many-to-Many Alignments [Jiampojamarn+ 2007] Alignment tool based on EM algorithm and MLE Advantages: 1. Can align multiple characters 2. Perform well on short alignment Disadvantages: 1. Poor performance on long alignment by overtting 2. Require hand-tuning of length limit parameters http://code.google.com/p/m2m-aligner/ 12 13. mpaligner: Minimum Pattern Aligner[Kubo+ 2011] Idea: penalize long alignment during E-step Simple scaling as below x: source string, y: target string |x|: length of x, |y|: length of y P(x, y): probability of string pair (x,y) Good performance without hand-tuning parameters http://sourceforge.jp/projects/mpaligner/ 13 14. Motivation: Invalid Alignment Problem Character-based alignment can be phonetically invalid It may divide atomic units into meaningless pieces We call the smallest unit of alignment as syllable Syllable-based alignment should be used for this task Problem: No training data for syllable-based alignment In this study, we propose Japanese-specic heuristics for this problem depending on Japanese knowledge 14 15. Examples of Invalid and Valid Alignment In Japanese language, consonants should be combined with vowels JnJk Task EnJa TaskTypeSourceTarget TypeSource TargetValid SUZU:KI :Valid Ar:thur:Invalid SUZ:UKI :Invalid A:r:th:ur:::Valid HIRO:MI :Valid Cha:p:li:n :::Invalid HIR:OMI :Invalid C:h:a:p:li:n :::::Valid OKU:NO:Valid Ju:s:mi:ne :::Invalid OK:UNO:Invalid J:u:s:mi:ne::::15 16. Language Specic Heuristics as Preprocessing Developed Japanese-specic heuristics for JnJk and EnJa tasks as preprocessing Combine atomic string into syllable Treat a syllable as one character in alignment Denition of syllable should be chosen carefully It may cause bad side eect Some contexts are incorporated as n-gram features16 17. JnJk task: De-romanization Heuristic De-romanization: convert Roman characters Consonant and vowel are coupled into Kana Common romanization table is used (Hepburn) RomanAIU EOKana RomanKA KI KUKE KOKana 17 http://www.social-ime.com/conv-table.html 18. EnJa Task: Syllable-based Alignment In EnJa task, target side should be aligned with unit of syllable, not character Combine sub-characters with previous ones There are 3 types of sub-characters: 1. Lower case characters (Yo-on): e.g. , , 2. Silent character (Soku-on): e.g. 3. Hyphen (Cho-on; long vowel): e.g. 18 19. Outline Introduction System Experiments Ocial Scores for 8 Language Pairs Further Investigation for JnJk and EnJa Conclusion 19 20. Experimental Settings Conducted 2 types of experiments Ocial evaluation on test set for 8 language pairs Compared proposed and baseline systems for JnJk and EnJa tasks on development set Followed default settings of tools basically m2m-aligner: length limits are selected carefully Iteration number: optimized by development set Features: N-gram (N=2) and context (size=7) features 20 21. Ocial Scores for 8 Language Pairs Applied heuristics to JnJk and EnJa tasks Performed well (top rank on EnPe and EnHe) TaskACCF-Score MRRMAP Rank JnJk0.5120.693 0.582 0.401 2 EnJa0.3620.803 0.4690.359 2 EnCh0.3010.655 0.3760.292 5 ChEn0.0130.259 0.0170.013 4 EnKo0.3340.688 0.4110.334 3 EnBa0.4040.882 0.5150.403 2 EnPe0.6580.941 0.7610.640 1 EnHe0.1910.808 0.2540.190 1 21 22. Results in JnJk and EnJa Tasks Proposed system overcome baselines Result in JnJk Task MethodACC F-Score MRR MAP m2m-aligner 0.113 0.389 0.182 0.114 mpaligner 0.121 0.391 0.197 0.122 Proposed0.199 0.494 0.300 0.200Result in EnJa Task MethodACC F-Score MRR MAP m2m-aligner 0.280 0.737 0.359 0.280 mpaligner 0.326 0.761 0.431 0.326 Proposed0.358 0.774 0.469 0.358 22 23. Output Examples (10-best list)JnJk TaskEnJa TaskHarui KyotaroBloyGrothendieck1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 10 10 23 24. Error Analysis Sparseness problem: Side eect of syllable-based alignment in EnJa task Too many target side characters in JnJk task Word origin [Hagiwara+ 2011]: English names come from various languages First and family name can be modeled dierently Gender: rst names are quite dierent Training data inconsistency or ambiguity e.g. JAPAN (Not transliteration) 24 25. Outline Introduction System Experiments Conclusion Future Work 25 26. Conclusion Applied mpaligner to machine transliteration task for the rst time Performed better than m2m-aligner Maximum likelihood estimation approach is not suitable Proposed Japanese-specic heuristics for JnJk and EnJa tasks De-romanization for JnJk task Syllable-based alignment for EnJa task 26 27. Future Work Combine these heuristics with other language-independent approaches such as [Finch+ 2011] or [Hagiwara+ 2011] Develop language-dependent heuristics besides Japanese language Can we nd such heuristics automatically?27 28. Reference (1) Andrew Finch and Eiichiro Sumita. 2008. Phrase-based machine transliteration. Sittichai Jiampojamarn, Grzegorz Kondrak, and Tarek Sherif. 2007. Applying many-to-many alignments and hidden markov models to letter-to-phoneme con- version. Sittichai Jiampojamarn, Colin Cherry, and Grzegorz Kondrak. 2008. Joint processing and discrimina- tive training for letter-to-phoneme conversion. Keigo Kubo, Hiromichi Kawanami, Hiroshi Saruwatari, and Kiyohiro Shikano. 2011. Unconstrained many- to-many alignment for automatic pronunciation annotation. Min Zhang, A Kumaran, and Haizhou Li. 2012. Whitepaper of news 2012 shared task on machine transliteration.28 29. Reference (2) Masato Hagiwara and Satoshi Sekine. 2011. Latent class transliteration based on source language origin. Andrew Finch, Paul Dixon, and Eiichiro Sumita. 2011. Integrating models derived from non-parametric bayesian co-segmentation into a statistical machine transliteration system. Andrew Finch and Eiichiro Sumita. 2010. A Bayesian Model of Bilingual Segmentation for Transliteration. 29 30. WTIM: Workshop on Text Input Methods 1st workshop with IJCNLP 2011 (Thailand) 12 people presented from Google, Microsoft, Yahoo https://sites.google.com/site/wtim2011/ 2nd workshop planed with COLING 2012 (India) Venue: December, 2012 in Mumbai, India Are you interested as a presenter or an attendee? 31. Any Questions?