22ndnd Asian Regional Training on Local Language ComputingAsian Regional Training on Local Language Computing
2020thth –– 2424thth June, 2005June, 2005
Prince Prince DD’’AngkorAngkor Hotel, Hotel, SiemSiem Reap, CAMBODIAReap, CAMBODIA
Pan Localization Cambodia (PLC)Pan Localization Cambodia (PLC)
►║♀↑ Ỳ→►║♀↑ Ỳ→Ĥ š▲►˝╩ΡŠ ╪▲Ĥ š▲►˝╩ΡŠ ╪▲ History of Khmer ScriptHistory of Khmer Script
22ndnd Asian Regional Training on Local Language ComputingAsian Regional Training on Local Language Computing
Ł▲♀ ỲŁ▲♀ Ỳ♀♀↑Ś┤ξΣ┤Ĥ š▲►˝╩ΡŠ ╪▲↑Ś┤ξΣ┤Ĥ š▲►˝╩ΡŠ ╪▲ Evolution of Khmer Script Evolution of Khmer Script
Ĥ š▲►˝╩ΡŠ ╪▲şĆΎ║╒┤┬Ĥ š▲►˝╩ΡŠ ╪▲şĆΎ║╒┤┬ Present Khmer ScriptPresent Khmer Script
╟█Ńą ┤ή╟█Ńą ┤ή ┤Ỳ₣┤Ỳ₣ ΠĄΙ₣ΠĄΙ₣ ╬┤╬┤ ТХТХ ↑∆↑∆ (Script & Sub(Script & Sub--script 69)script 69)ГГ˝ ŠšŠš ÐðÐð þ−þ− ₣Ğ₣Ğ şĆşĆ đ ¯đ ¯ ĄąĄą ĘęĘę ŃňŃň řŚřŚ ŤťŤť űŹűŹ ΓΘΓΘ ‗‼‗‼ ijʼnijʼn ℓ □ℓ □ ◦Ā◦Ā ĊċĊċ ĕĖĕĖ ĠġĠġ ĩĪĩĪ ĮįĮį ķĸķĸ ņŊņŊ ŎŏŎŏ ŪũŪũ ŲųŲų ŷ ſŷ ſ ₤℮₤℮ ǽǾǽǾ ẁẂẁẂ ΅Ά΅Ά ΌΌ Β∆Β∆
►◦έ┤Ỳ◦ĀІ▄►◦έ┤Ỳ◦ĀІ▄ ╬┤╬┤ ОПОП ↑∆↑∆ (Dependence Vowel 23) (Dependence Vowel 23) ГГ
˘Љ˘Љ ЊЊ ББ ЕЕ ЙЙ НН РР УУ Ю ЧЮ Ч Ю˘ЩЮ˘Щ Ю˘ЬЮ˘Ь Ю˘Ю˘ Я˘Я˘ а˘а˘ Ю˘вЮ˘в Ю˘дЮ˘д НеНе ее ˘Ље˘Ље ˘й˘й НйНй Ю˘йЮ˘й Ю˘вйЮ˘вй►◦έΠ╟ŀ↑∆►◦έΠ╟ŀ↑∆ ╬┤╬┤ НПНП ↑∆↑∆ (Independence Vowel 13) (Independence Vowel 13) ГГ
ΟΟ ΣΣ ΧΧ έέ ΰΰ ζζ ιι µµ οο υυ ωω όό ЄЄ
♀‗‼▄Ό↑ŚỲ♀‗‼▄Ό↑ŚỲ ╬┤╬┤ ОНОН ↑∆↑∆ (Various Signs 21)(Various Signs 21)ГГ
˘й˘й лл пп чч хх сс њњ ៗៗ єє ээ ៙៙ ¤¤ ៚៚ }} -- ...... ¤Ų¤¤Ų¤ ˘к˘к ៜៜΠ◘ŠΠ◘Š ╬┤╬┤ НМНМ ↑∆↑∆ (Digits 10)(Digits 10)ГГƠƠ ơơ ƯƯ ưư ៨៨ ៩៩
Π◘ŠĤ↑→ήΠ◘ŠĤ↑→ή ╬┤╬┤ НМНМ ↑∆↑∆ (Divination (Divination LorLor 10)10)ГГ៰៰ ៱៱ ៲៲ ៳៳ ៴៴ ៵៵ ៶៶ ៷៷ ៸៸ ៹៹
Π◘Šş┤⌠ÐёΠ◘Šş┤⌠Ðё ╬┤╬┤ ПОПО ↑∆↑∆ (Lunar Date 32)(Lunar Date 32)ГГ
᧠᧠ ᧡᧡ ᧢᧢ ẠẠ ạạ ẢẢ ảả ẤẤ ấấ ẦẦ ầầ ẨẨ ẩẩ ẪẪ ẫẫ ẬẬ ậậ ẮẮ ắắ ẰẰ ằằ ẲẲ ẳẳ ẴẴ ẵẵ ẶẶ ặặ ẸẸ ẹẹ ẺẺ ẻẻ ẼẼ
◦Ŕň ►╔˝ θΠ▲Ξ◘◦Ŕň ►╔˝ θΠ▲Ξ◘ ╬┤╬┤ НН ↑∆↑∆ ГГ (Money Sign)(Money Sign)
៛៛ ◦▲Ό║╬┤◦▲Ό║╬┤ НУХНУХ ↑∆↑∆ (TOTAL of 179)(TOTAL of 179)
Khmer Collation DevelopmentKhmer Collation Development
22ndnd Asian Regional Training on Local Language ComputingAsian Regional Training on Local Language Computing
ROS, Pich HemyROS, Pich HemyPan Localization Cambodia (PLC)Pan Localization Cambodia (PLC)
6/22/20056/22/2005 22ndnd Asian Regional Training on Local Language ComputingAsian Regional Training on Local Language Computing Page Page 66
AgendaAgenda
Problems in Khmer Script Ordering RuleProblems in Khmer Script Ordering RuleKhmer Collation DevelopmentKhmer Collation DevelopmentKhmer Normalization DevelopmentKhmer Normalization DevelopmentDiscussion and Further DevelopmentDiscussion and Further Development
6/22/20056/22/2005 22ndnd Asian Regional Training on Local Language ComputingAsian Regional Training on Local Language Computing Page Page 77
AgendaAgenda
Problems in Khmer Script Ordering RuleProblems in Khmer Script Ordering RuleKhmer Collation DevelopmentKhmer Collation DevelopmentKhmer Normalization DevelopmentKhmer Normalization DevelopmentDiscussion and Further DevelopmentDiscussion and Further Development
6/22/20056/22/2005 22ndnd Asian Regional Training on Local Language ComputingAsian Regional Training on Local Language Computing Page Page 88
There is no specific or official resource for There is no specific or official resource for the ordering rule in Khmer language.the ordering rule in Khmer language.
Only CHOUN NAT dictionary is recognized Only CHOUN NAT dictionary is recognized by the government as the official one.by the government as the official one.
Resource ProblemsResource Problems
6/22/20056/22/2005 22ndnd Asian Regional Training on Local Language ComputingAsian Regional Training on Local Language Computing Page Page 99
The words in the dictionary are ordered according to its The words in the dictionary are ordered according to its phonetic.phonetic.Many complex problems are found:Many complex problems are found:
Some characters are pronounced in many different ways Some characters are pronounced in many different ways depending on the usage, so that, one character can be founded depending on the usage, so that, one character can be founded being ordered in different buckets in the dictionary. being ordered in different buckets in the dictionary. Ex. The consonant BA can be pronounced as BA and PA.Ex. The consonant BA can be pronounced as BA and PA.Some characters are founded among other sequence of Some characters are founded among other sequence of consonants because of its similar sound.consonants because of its similar sound.
No linguistic resource is found to solve those problems.No linguistic resource is found to solve those problems.In any circumstance, the dictionary must be 100 percent In any circumstance, the dictionary must be 100 percent adapted.adapted.
Problems in CHOUN NAT dictionaryProblems in CHOUN NAT dictionary
6/22/20056/22/2005 22ndnd Asian Regional Training on Local Language ComputingAsian Regional Training on Local Language Computing Page Page 1010
SolutionsSolutions
CHOUNNAT words
Non CHOUNNAT words
CHOUNNATWord lists
Most commonRule from
CHOUNNAT
6/22/20056/22/2005 22ndnd Asian Regional Training on Local Language ComputingAsian Regional Training on Local Language Computing Page Page 1111
Prior tasksPrior tasks
Study the dictionary to extract the most Study the dictionary to extract the most common ordering rule.common ordering rule.Generate a collation element table.Generate a collation element table.Discuss with the Royal Academy of Discuss with the Royal Academy of Cambodia for the change and approval.Cambodia for the change and approval.Design and implementation.Design and implementation.
6/22/20056/22/2005 22ndnd Asian Regional Training on Local Language ComputingAsian Regional Training on Local Language Computing Page Page 1212
AgendaAgenda
Problems in Khmer Script Ordering RuleProblems in Khmer Script Ordering RuleKhmer Collation DevelopmentKhmer Collation DevelopmentKhmer Normalization DevelopmentKhmer Normalization DevelopmentDiscussion and Further DevelopmentDiscussion and Further Development
6/22/20056/22/2005 22ndnd Asian Regional Training on Local Language ComputingAsian Regional Training on Local Language Computing Page Page 1313
Khmer Character categoriesKhmer Character categories
Khmer characters can be categorized in 9 groups:Khmer characters can be categorized in 9 groups:Consonant : Consonant : ˝Independent vowel : Independent vowel : ΧΧDependent vowel : Dependent vowel : ˘Љ˘ЉSubscript : Subscript : ˘˛˘˛Various sign : Various sign : ˘ х˘ хDigit : Digit : ơơKhmer symbol for divination lore : Khmer symbol for divination lore : ៱៱
Khmer lunar date : Khmer lunar date :
6/22/20056/22/2005 22ndnd Asian Regional Training on Local Language ComputingAsian Regional Training on Local Language Computing Page Page 1414
Khmer Character orderingKhmer Character ordering
In generalIn generalConsonantConsonantDependent vowelDependent vowelSubscriptSubscriptDigitDigitNumeric symbol for divination loreNumeric symbol for divination loreLunar date symbolLunar date symbol
Independent vowels: vary among the consonant Independent vowels: vary among the consonant ũũ ŲŲ ΒΒVarious sign : Ignorable if there are the above Various sign : Ignorable if there are the above characterscharacters
6/22/20056/22/2005 22ndnd Asian Regional Training on Local Language ComputingAsian Regional Training on Local Language Computing Page Page 1515
Level of ComparisonLevel of Comparison
We defined two levels of comparison:We defined two levels of comparison:
L1, Base characters such as consonant, L1, Base characters such as consonant, subscripts, Vowels etcsubscripts, Vowels etc……
Č < Č₤L2, Accents such as various signsL2, Accents such as various signs
Č < Čэ < Č₤
6/22/20056/22/2005 22ndnd Asian Regional Training on Local Language ComputingAsian Regional Training on Local Language Computing Page Page 1616
Khmer Collation AlgorithmKhmer Collation Algorithm
Produce the normalized form of each Produce the normalized form of each input string.input string.If the two strings exist in the dictionary, If the two strings exist in the dictionary, the order is according to the position.the order is according to the position.If notIf not
Produce an array of the collation elements for Produce an array of the collation elements for each string.each string.Compare the elements.Compare the elements.
6/22/20056/22/2005 22ndnd Asian Regional Training on Local Language ComputingAsian Regional Training on Local Language Computing Page Page 1717
Dictionary word listDictionary word list
All the words and its position in the All the words and its position in the dictionary are stored in a text file.dictionary are stored in a text file.During run time, the data in the files are During run time, the data in the files are loaded into a hash table, because there loaded into a hash table, because there are not much words in the dictionary are not much words in the dictionary around 40,000 words.around 40,000 words.
6/22/20056/22/2005 22ndnd Asian Regional Training on Local Language ComputingAsian Regional Training on Local Language Computing Page Page 1818
AgendaAgenda
Problems in Khmer Script Ordering RuleProblems in Khmer Script Ordering RuleKhmer Collation DevelopmentKhmer Collation DevelopmentKhmer Normalization DevelopmentKhmer Normalization DevelopmentDiscussion and Further DevelopmentDiscussion and Further Development
6/22/20056/22/2005 22ndnd Asian Regional Training on Local Language ComputingAsian Regional Training on Local Language Computing Page Page 1919
ProblemsProblems
Unicode has more than one way to encode things
For example 1. ω ω or Į + ˘ ċ2. ū˝ ˝ + ˘˛ + Ū˘ or ˝ + Ū˘ + ˘˛
User requires to treat them the same.Normalization is needed.
6/22/20056/22/2005 22ndnd Asian Regional Training on Local Language ComputingAsian Regional Training on Local Language Computing Page Page 2020
AlgorithmAlgorithm
Decompose characters according to the Decompose characters according to the canonical mappings. That is, put the string canonical mappings. That is, put the string into Normalization form.into Normalization form.Render those characters to a string that Render those characters to a string that respect to Khmer spelling order.respect to Khmer spelling order.
6/22/20056/22/2005 22ndnd Asian Regional Training on Local Language ComputingAsian Regional Training on Local Language Computing Page Page 2121
Characters RenderingCharacters Rendering
Example:Example:
ū₤ʼnБū₤ʼnБ
ū₤ʼnБū₤ʼnБ
˘ Б˘ БŪ˘Ū˘˘ ʼn˘ ʼn₤₤
˘ Б˘ Б˘ ʼn˘ ʼnŪ˘Ū˘₤₤˘ Б˘ БŪ˘Ū˘˘ ʼn˘ ʼn₤₤
1
1
2
2
6/22/20056/22/2005 22ndnd Asian Regional Training on Local Language ComputingAsian Regional Training on Local Language Computing Page Page 2222
How to render?How to render?
As a rule, Khmer words are written As a rule, Khmer words are written respecting to its spelling order.respecting to its spelling order.The detection of word is still a major The detection of word is still a major concern in Khmer language because concern in Khmer language because Khmer does not separate words by space Khmer does not separate words by space in writing.in writing.Syllable is easier to detect than word as it Syllable is easier to detect than word as it is wellis well--defined.defined.
6/22/20056/22/2005 22ndnd Asian Regional Training on Local Language ComputingAsian Regional Training on Local Language Computing Page Page 2323
What is syllable?What is syllable?
In writing, Syllable is a successive of characters with an inseparable unit. In Khmer, it is composed respectively of:
One base consonantNone or one or two subscriptsNone or one consonant shifterNone or one vowelNone or one or two various signs
Subscripts are categorized into 3 groups and the priority in theKhmer spelling order is: south subscript, west subscript and east subscriptVowels are categorized into four main groups: west vowel, east vowel, south vowel and north vowel.
For example:⅜ŵ ŚБ ⅜ | ŵ | Ś Б
6/22/20056/22/2005 22ndnd Asian Regional Training on Local Language ComputingAsian Regional Training on Local Language Computing Page Page 2424
How to detect a syllable?How to detect a syllable?
The main idea to detect Khmer syllable is to detect the transition state for each character of the input string until the terminator is reached. There are four conditions to reach the end of syllable:
No possible transition state for the character.The current character is non-Khmer character or the group of character that might be the terminator.Syllable buffer is full or the block in the buffer is already filled.End of the input string.
Therefore, it is required to analyze all the possibilities of the input sequence of Khmer script and then create a possible lookup table.
6/22/20056/22/2005 22ndnd Asian Regional Training on Local Language ComputingAsian Regional Training on Local Language Computing Page Page 2525
Transition Lookup TableTransition Lookup Table
1111100000VS
1111111110SS
1010111110ES
1110111110WS
1001000000EV
1001000000SV
1111000000NV
1111000000WV
1111111100CS
1111111110C
VSSSESWSEVSVNVWVCSC
˘ŏEast SubscriptES
˘ хVarious SignVS
˘˛South SubscriptSS
Ū˘West SubscriptWS
˘ЉEast VowelEV
˘ НSouth VowelSV
˘ БNorth VowelNV
Ю˘West VowelWV
˘ л , ˘ пConsonant ShifterCS
˝ConsonantC
ExampleMeaningShortcut
Initial state = Consonant- 0 means NOT possible transition- 1 means possible transition - ROW presents the current state- Column presents the next state
6/22/20056/22/2005 22ndnd Asian Regional Training on Local Language ComputingAsian Regional Training on Local Language Computing Page Page 2626
AgendaAgenda
Problems in Khmer Script Ordering RuleProblems in Khmer Script Ordering RuleKhmer Collation DevelopmentKhmer Collation DevelopmentKhmer Normalization DevelopmentKhmer Normalization DevelopmentDiscussion and Further DevelopmentDiscussion and Further Development
6/22/20056/22/2005 22ndnd Asian Regional Training on Local Language ComputingAsian Regional Training on Local Language Computing Page Page 2727
Collation with normalized output?Collation with normalized output?
ProblemsProblemsBefore the collation process, the string must be Before the collation process, the string must be normalized first. normalized first. The output string might be different from the original The output string might be different from the original one.one.
SolutionSolutionGive two possibilities to user whether they want the Give two possibilities to user whether they want the normalized output or not.normalized output or not.
6/22/20056/22/2005 22ndnd Asian Regional Training on Local Language ComputingAsian Regional Training on Local Language Computing Page Page 2828
Collation of Khmer date, time, book titleCollation of Khmer date, time, book title
The collation of Khmer date, time, book title The collation of Khmer date, time, book title has not yet been provided for our current has not yet been provided for our current research.research.There is no exact resource to determine the There is no exact resource to determine the Khmer date and time format or shortcut.Khmer date and time format or shortcut.
6/22/20056/22/2005 22ndnd Asian Regional Training on Local Language ComputingAsian Regional Training on Local Language Computing Page Page 2929
Subscript of TA Subscript of TA (( ˘ ʼn˘ ʼn )) and DA and DA (( ˘ Ś˘ Ś ))
The visibility of the two subscripts is the same, but, The visibility of the two subscripts is the same, but, of course, the code is different.of course, the code is different.Usually, People do not care much while writing one Usually, People do not care much while writing one among the two subscripts.among the two subscripts.The problem is when the user wishes to sort the The problem is when the user wishes to sort the data, the order must be according to the codedata, the order must be according to the codeNot yet any solution has been proposed to solve the Not yet any solution has been proposed to solve the problem in Khmer.problem in Khmer.Therefore, the user must be careful for the use of Therefore, the user must be careful for the use of the two subscripts.the two subscripts.
6/22/20056/22/2005 22ndnd Asian Regional Training on Local Language ComputingAsian Regional Training on Local Language Computing Page Page 3030
Questions?Questions?
Thank you!