Introduction to W3C I18n Best Practices
Presented by Gopal Venkatesan<[email protected]>
नमस्का�र
வணக்கம்
ನಮಸ್ಕಾ��ರ
నమస్కా�రం�
ਸਤਿ� ਸ�� ਅਕਾ�ਲ
നമസ്�കാ�രം�
ନମସ୍କ�ର୍
নমস্কা�র
علیکم السالمનમસ્કા�ર
Training Outline
• Internationalisation Vocabulary• Typical Problems– Outline the common problems found across the web
• Java and Internationalisation– The level of Internationalisation support is available in Java
• Resource Bundles– Formatting messages the correct way
• PHP and Internationalisation– The level of Internationalisation support is available in PHP
VOCABULARY
Unicode
• International standard for representing written language in computers
• Latest version 5.2 adds 6648 new characters including support for Vedic Sanskrit
• Maintained in sync with ISO 10646• Three main encodings: UTF-8, UTF-16 and
UTF-32• Address space of 21 bits
Unicode (contd.)
• UTF-8 is a multi-byte encoding and is eight bytes long
• An encoded character can take one, two, three or four bytes
• UTF-8 is backward compatible with US-ASCII• Default encoding for PHP6?
Unicode (contd.)
• UTF-16 uses 16-bit code units• Cannot address the complete set, so uses
surrogates• Default encoding for strings in Java and
JavaScript
Unicode (contd.)
• UTF-32 uses 32-bit code units• Every Unicode character is addressed within a
single code unit
Internationalisation
• Design and development of a product, application or document content that enables easy localization for target audiences that vary in culture, region, or language
• Abbreviated as I18n as there are eighteen characters between “I” and “n”
Localisation
• Adaptation of a product, application or document content to meet the language, cultural and other requirements of a specific target market (a “locale”)
• Translation is one aspect of localisation• Abbreviated as L10n as there are ten
characters between “L” and “n”
TYPICAL PROBLEMS
Typical Problem
Typical Problem (Contd.)
Typical Problem (Contd.)
Typical Problem (Contd.)
Typical Problem (Contd.)
The Solution
• Determine the user environment– Format dates, times, currencies as per the locale
• Understand the Internationalisation support available with your implementation language
• Use the ICU/Internationalisation libraries rather than rolling out your own functions
COMMON ENCODING PROBLEMS
Tofu characters – Black hollow boxes
• Shown as a black hollow box, typically one per character
• Indicates font problem i.e., the system doesn’t have the right fonts to display the glyph(s)
• Tofu isn’t always a software problem – not a bug but really annoying
Tofu characters – Black hollow boxes
Question Marks – Incorrect conversion
• “???” usually displayed when converting text from one encoding to another
• Means there is no equivalent character in the target encoding for the corresponding source
• May not be a bug always, though sometimes occurs when an incorrect encoding is specified
Question Marks – Incorrect conversion
Mojibake – 文字化け • Pronounced as “Moh-jee-baa-kay” is a
Japanese word meaning “garbled characters”• Occurs when text in one encoding is
“interpreted” as some other encoding• Most of the times caused by interpreting
Latin-1 as UTF-8– UTF-8 is compatible only with US-ASCII– Characters outside the ASCII range are
incompatible with UTF-8 and cause Mojibake
Mojibake – 文字化け
JAVA™ AND UNICODE
Unicode support in Java™
• Java™ has always supported Unicode• Java™ strings are UTF-16– A “char” in Java™ is a UTF-16 code unit, not a code
point• By default the input and output streams use
the OS native charset– On Windows™ this is Windows-1252– On most Unices and Unix-like OS this is UTF-8
A “Hello, world” example
A “Hello, world” example (contd.)
A “Hello, world” example (contd.)
“Hello, world” on GNU/Linux
Garbage In, Garbage Out!
“Hello, world” Corrected!
Oops!
“Hello, world” Corrected!
EXTERNALISING STRINGSResource Bundles
The Need
• Allows a single code base to display strings in multiple languages
• No need to refactor code to support new languages
Beginning
Beginning (Sum.properties)
• SUM_OF = Sum of• AND = and• IS = is
That was broken!
• Its generally a bad idea to concatenate strings– Does not work for all languages since the grammar
is different!• Always use string substitution using positional
parameters
Correct Way
Correct Way (contd.)
• SumI18n.properties– SUM = Sum of {0} and {1} is {2}
• SumI18n_hi.properties– SUM = {0} अतिरिरक्त {1} {2} का बर�बर है�
• SumI18n_ta.properties– SUM = {0} மற்றும் {1} கூட்டினா ல் {2}
Oops!
• Java 1.5 property files are read as ISO-8859-1 (Latin-1)
• Use “native2ascii” tool to convert Unicode files to escape sequences (U+??)
• native2ascii –encoding UTF-8 SumI18n_hi.properties
• native2ascii –encoding UTF-8 SumI18n_ta.properties
It’s working!
INTERNATIONALISATION IN PHP
Challenges
• PHP 5 (and earlier) does not understand characters and encodings
• The multi-byte extension (mbstring) in PHP works only for a few encodings (primarily CJK)
• PHP has very limited functions for formatting date, time, currencies, etc.
• PHP doesn’t provide linguistic sorting!
The Good News – Intl extension
• Open source – http://pecl.php.net/intl• Designed for PHP 5.x, part of PHP 5.3– Configure using “—enable-intl”
• Leverages ICU and CLDR• Available as OO and procedural APIs– Collator::sort() vs. collator_sort()
• Yahoo! is a key contributor
The PHP Intl Library
Collator
Intl
NumberFormatter
Locale
Normalizer
MessageFormatter
IntlDateFormatter
Grapheme
ResourceBundle
IDN
Corrected substring implementation
Formatting Numbers
Resource Bundles
• Externalize strings in your application• Similar to how desktop applications are built– One binary and additional language packs
• Similar to Windows™ resource files and Unix® message files– Structure is different, see ICU resource bundles
• Key/value pairs– Key is used by the application at run time to
display the value
Additional Things
• Change the “default_charset” in php.ini to “utf-8”
• While the “mbstring” works good enough for Indic languages, use the more precise “grapheme_*” functions from the Intl library
• “echo” is encoding agnostic
Why Intl is better than mbstring?
Why Intl is better than mbstring? (contd.)
Resources
• http://www.w3.org/International/• http://unicode.org/• http://
java.sun.com/javase/technologies/core/basic/intl/faq.jsp
• http://pecl.php.net/intl• http://php.net/manual/en/refs.international.php