Applying mpaligner to Machine Transliteration with Japanese-‐Specific Heuristics
Yoh Okuno
Outline
• Introduction
• System
• Experiments
• Conclusion
2
Outline • Introduction – Statistical Machine Transliteration
– Baseline and Our Systems
• System
• Experiments
• Conclusion
3
Machine Transliteration as Monotonic SMT
• The most common approach for machine
transliteration is to follow the manner of SMT
(Statistical Machine Translation)
• Consists of 3 steps as below:
1. Align training data monotonically (character-‐based)
2. Train discriminative model given aligned data
3. Decode input string to n-‐best list
4
[Finch+ 2008]
Example of Statistical Transliteration • Given training data of transliteration pairs
5
OKUNO 奥野 NOMURA 野村 MURAI 村井
Training Data
Example of Statistical Transliteration 1. Align training data utilizing co-‐occurrence
6
OKU:NO 奥:野 NO:MURA 野:村 MURA:I 村:井
OKUNO 奥野 NOMURA 野村 MURAI 村井
1. Align
Training Data
Aligned Data
OKU → 奥 NO → 野 MURA → 村 I → 井
Example of Statistical Transliteration 2. Train statistical model from aligned data
7
OKU:NO 奥:野 NO:MURA 野:村 MURA:I 村:井
OKUNO 奥野 NOMURA 野村 MURAI 村井
1. Align
Training Data
Aligned Data 2. Train
Learned Model
(Rules)
OKU → 奥 NO → 野 MURA → 村 I → 井
Example of Statistical Transliteration 3. Decode new input and return output
8
OKU:NO 奥:野 NO:MURA 野:村 MURA:I 村:井
OKUNO 奥野 NOMURA 野村 MURAI 村井
OKUMURA OKUI MURANO 1. Align
Training Data
Aligned Data
3. Decode
2. Train
奥村 奥井 村野
Test Input Output
Learned Model
(Rules)
The Baseline System using m2m-‐aligner
Align: m2m-‐aligner
Train: DirecTL+
Decode: DirecTL+
Training Data
Output: N-‐best List
[Jiampojamarn+ 2007, 2008]
9
Our System: mpaligner with Heuristics
Pre-‐processing
Align: mpaligner
Train: DirecTL+
Decode: DirecTL+
Training Data
Output: N-‐best List
Japanese-‐Specific Heuristics 1. JnJk: De-‐romanization 2. EnJa: Syllable-‐based Alignment
10
Improved Alignment Tool 1. Better Accuracy than m2m 2. No hand-‐tuning parameters
[Kubo+ 2011]
Outline • Introduction
• System
– Comparing Aligners
– Japanese-‐Specific Heuristics
• Experiments
• Conclusion
11
m2m-‐aligner: Many-‐to-‐Many Alignments
• Alignment tool based on EM algorithm and MLE
• Advantages: 1. Can align multiple characters
2. Perform well on short alignment
• Disadvantages: 1. Poor performance on long alignment by overfitting
2. Require hand-‐tuning of length limit parameters
[Jiampojamarn+ 2007]
http://code.google.com/p/m2m-‐aligner/ 12
mpaligner: Minimum Pattern Aligner
• Idea: penalize long alignment during E-‐step
• Simple scaling as below
• x: source string, y: target string
• |x|: length of x, |y|: length of y
• P(x, y): probability of string pair (x,y) • Good performance without hand-‐tuning parameters
[Kubo+ 2011]
http://sourceforge.jp/projects/mpaligner/ 13
Motivation: Invalid Alignment Problem
• Character-‐based alignment can be phonetically invalid
– It may divide atomic units into meaningless pieces
– We call the smallest unit of alignment as syllable
• Syllable-‐based alignment should be used for this task
– Problem: No training data for syllable-‐based alignment
• In this study, we propose Japanese-‐specific heuristics
for this problem depending on Japanese knowledge
14
Examples of Invalid and Valid Alignment • In Japanese language, consonants should be combined with vowels
• JnJk Task Type Source Target Valid SUZU:KI 鈴:木 Invalid SUZ:UKI 鈴:木 Valid HIRO:MI 裕:実 Invalid HIR:OMI 裕:実 Valid OKU:NO 奥:野 Invalid OK:UNO 奥:野
• EnJa Task Type Source Target Valid Ar:thur アー:サー Invalid A:r:th:ur ア:ー:サ:ー Valid Cha:p:li:n チャッ:プ:リ:ン Invalid C:h:a:p:li:n チ:ャ:ッ:プ:リ:ン Valid Ju:s:mi:ne ジャ:ス:ミ:ン Invalid J:u:s:mi:ne ジ:ャ:ス:ミ:ン
15
Language Specific Heuristics as Preprocessing
• Developed Japanese-‐specific heuristics for JnJk and EnJa tasks as preprocessing
– Combine atomic string into syllable
– Treat a syllable as one character in alignment
• Definition of syllable should be chosen carefully – It may cause bad side effect
– Some contexts are incorporated as n-‐gram features
16
JnJk task: De-‐romanization Heuristic
・・
• De-‐romanization: convert Roman characters
• Consonant and vowel are coupled into Kana
• Common romanization table is used (Hepburn)
Roman A I U E O Kana あ い う え お Roman KA KI KU KE KO Kana か き く け こ
17 http://www.social-‐ime.com/conv-‐table.html
EnJa Task: Syllable-‐based Alignment
• In EnJa task, target side should be aligned with
unit of syllable, not character
• Combine sub-‐characters with previous ones
• There are 3 types of sub-‐characters: 1. Lower case characters (Yo-‐on): e.g. ャ, ュ, ョ
2. Silent character (Soku-‐on): e.g. ッ
3. Hyphen (Cho-‐on; long vowel): e.g. ー
18
Outline • Introduction
• System
• Experiments
– Official Scores for 8 Language Pairs
– Further Investigation for JnJk and EnJa
• Conclusion
19
Experimental Settings • Conducted 2 types of experiments
– Official evaluation on test set for 8 language pairs
– Compared proposed and baseline systems for JnJk
and EnJa tasks on development set
• Followed default settings of tools basically
– m2m-‐aligner: length limits are selected carefully
– Iteration number: optimized by development set
– Features: N-‐gram (N=2) and context (size=7) features 20
Official Scores for 8 Language Pairs • Applied heuristics to JnJk and EnJa tasks
• Performed well (top rank on EnPe and EnHe) Task ACC F-‐Score MRR MAP Rank JnJk 0.512 0.693 0.582 0.401 2 EnJa 0.362 0.803 0.469 0.359 2 EnCh 0.301 0.655 0.376 0.292 5 ChEn 0.013 0.259 0.017 0.013 4 EnKo 0.334 0.688 0.411 0.334 3 EnBa 0.404 0.882 0.515 0.403 2 EnPe 0.658 0.941 0.761 0.640 1 EnHe 0.191 0.808 0.254 0.190 1 21
Results in JnJk and EnJa Tasks • Proposed system overcome baselines
Method ACC F-‐Score MRR MAP m2m-‐aligner 0.113 0.389 0.182 0.114 mpaligner 0.121 0.391 0.197 0.122 Proposed 0.199 0.494 0.300 0.200
Method ACC F-‐Score MRR MAP m2m-‐aligner 0.280 0.737 0.359 0.280 mpaligner 0.326 0.761 0.431 0.326 Proposed 0.358 0.774 0.469 0.358
Result in JnJk Task
Result in EnJa Task
22
Output Examples (10-‐best list) JnJk Task
Harui Kyotaro 1 春井 京太郎 2 晴井 恭太郎 3 治井 匡太郎 4 榛井 強太郎 5 敏井 共太郎 6 明井 享太郎 7 陽井 亨太郎 8 遙井 杏太郎 9 遥井 鋸太郎 10 温井 教太郎
EnJa Task Bloy Grothendieck
1 ブロイ グローテンディック 2 ブロア グロートンディック 3 ブローイ グローテンディーク 4 ブロワ グローテンディック 5 ブロッイ グローゾンディック 6 ブロヤ グローテンジーク 7 ブロヨ グローザーンディック 8 ブウォイ グローザンディック 9 ブロティ グローシンディック 10 ブロレィ グローゼンディック
23
Error Analysis • Sparseness problem:
– Side effect of syllable-‐based alignment in EnJa task
– Too many target side characters in JnJk task
• Word origin [Hagiwara+ 2011]:
– English names come from various languages
– First and family name can be modeled differently
– Gender: first names are quite different
• Training data inconsistency or ambiguity
– e.g. JAPAN → 日本国 (Not transliteration) 24
Outline • Introduction
• System
• Experiments
• Conclusion
– Future Work
25
Conclusion • Applied mpaligner to machine transliteration task for
the first time
– Performed better than m2m-‐aligner
– Maximum likelihood estimation approach is not suitable
• Proposed Japanese-‐specific heuristics for JnJk and EnJa tasks
– De-‐romanization for JnJk task
– Syllable-‐based alignment for EnJa task
26
Future Work • Combine these heuristics with other
language-‐independent approaches such as
[Finch+ 2011] or [Hagiwara+ 2011]
• Develop language-‐dependent heuristics besides Japanese language
• Can we find such heuristics automatically? 27
Reference (1) • Andrew Finch and Eiichiro Sumita. 2008. Phrase-‐based machine
transliteration.
• Sittichai Jiampojamarn, Grzegorz Kondrak, and Tarek Sherif. 2007.
Applying many-‐to-‐many alignments and hidden markov models to letter-‐
to-‐phoneme con-‐ version.
• Sittichai Jiampojamarn, Colin Cherry, and Grzegorz Kondrak. 2008. Joint
processing and discrimina-‐ tive training for letter-‐to-‐phoneme conversion.
• Keigo Kubo, Hiromichi Kawanami, Hiroshi Saruwatari, and Kiyohiro
Shikano. 2011. Unconstrained many-‐ to-‐many alignment for automatic
pronunciation annotation.
• Min Zhang, A Kumaran, and Haizhou Li. 2012. Whitepaper of news 2012
shared task on machine transliteration. 28
Reference (2)
• Masato Hagiwara and Satoshi Sekine. 2011. Latent
class transliteration based on source language origin.
• Andrew Finch, Paul Dixon, and Eiichiro Sumita. 2011.
Integrating models derived from non-‐parametric
bayesian co-‐segmentation into a statistical machine
transliteration system.
• Andrew Finch and Eiichiro Sumita. 2010. A Bayesian
Model of Bilingual Segmentation for Transliteration. 29
WTIM: Workshop on Text Input Methods
• 1st workshop with IJCNLP 2011 (Thailand)
– 12 people presented from Google, Microsoft, Yahoo
– https://sites.google.com/site/wtim2011/
• 2nd workshop planed with COLING 2012 (India)
– Venue: December, 2012 in Mumbai, India
– Are you interested as a presenter or an attendee?
Any Questions?