globalization gotchas mark davis. unicode basics unicode encodes characters, not glyphs: u+0067 g g...
TRANSCRIPT
Globalization Gotchas
Mark Davis
Unicode BasicsUnicode encodes characters not glyphs
U+0067 rarr g g g g g g g g g g g g g
Unicode does not encode characters by languageFrench German English j have the same code point even though all have different pronunciations
Chinese 大 (da) has the same code point as Japanese 大 (dai)
UTF-8 UTF-16 and UTF-32 are all Unicode
The word character means different things to different people make clear which one you mean
glyphs code points bytes code units user-perceived characters (grapheme clusters)hellip
Unicode in APIsU+0000 to U+10FFFF Be prepared to handle (at least not corrupt) any incoming code points
A back-level system may get unassigned code points from later versions
Watch for UCS-2 implementations They use UTF-16 text but dont support characters above U+FFFF they also may accidentally cause isolated surrogates
Some APIsprotocols will count lengths in code points and others in bytes (or other code units)
Make sure you dont mix them up
Dont limit API parameters to a single character (and definitely not to a single code unit)
What users think of as a single character (eg x ch) may be a sequence in Unicode
Use the latest version of Unicode supports new characters corrections more stability guarantees
Choice of CharactersCharacter and block names may be misleading eg
U+034F COMBINING GRAPHEME JOINER doesnt join graphemes httpwwwunicodeorgfaq
Use U+2060 (word joiner) instead of U+FEFF (zero-width nobreak space) for everything but the BOM function
Never use unassigned code points those will be used in future versions of Unicode
Only use private use (PUA) or non-characters (and only if necessary)
If you do minimize the opportunity for collision by picking an unusual range
Character Conversion
Always use shortest form UTF-8Its the Law
And if that isnrsquot enough consider security attacks
If a protocol allows a choice of charsets always tag correctly
Not all text is correctly tagged character detection may be necessary But remember its always a guess
Converting a database of mixed untagged data is extremely painful
Bad assumptionsLength [bytes] = N length [code points]
1 character [charset X] = 1 character [Unicode]The ordering may also be different
Character Conversion II
IANA MIME charset names are ill-defined vendors often convert same charset different ways
Shift-JIS 0x5C rarr U+005C () or U+00A5 (yen)
Donrsquot simply omit unconvertable data to reduce security problems at least substitute
U+FFFD (when converting to Unicode) or
0x1A (when converting to bytes)
httpwwww3orgTRjapanese-xml
httpicusourceforgenetchartscharset
Properties Use properties such as Alphabetic not hard-coded lists
isAlphabetic(x) regex pAlphabetic or [Alphabetic]Not (ldquoArdquo le x le ldquoZrdquo OR ldquoardquo le x le ldquozrdquo)
Some properties arent what you think useWhite_Space not General_Category=Zs
Alphabetic not General_Category=L
Lowercase not General_Category=Ll
Script=Greek not Block=Greek
Characters may change property values between versions of Unicode
httpunicodeorgstandardstability_policyhtml
Identifiers amp Tokens
When designing syntax use as a basePattern_Syntax for operators relations
Pattern_Whitespace for gaps
XID_Start and XID_Continue for identifiers
All backwards compatible across versions
Profiles may expand or narrow from the base
Watch out for security attacksldquopaypalcomrdquo with a Cyrillic ldquoardquo
See Unicode Security at this conference
Comparison (Collation)Searching Sorting
MatchingThere are two binary orders
code point order = UTF-8 order = UTF-32 order
ne UTF16 order
Donrsquot present users with binary order
No users expect A lt Z lt a lt z lt Ccedil lt auml
Apply normalization to get a unique form so Aring = Aring
Security Issues Protocols must precisely define the comparison operations
Eg LDAP doesnt so lookup may fail (or falsely succeed)
Aside from wrong results opening for security attacks
Language-Sensitive Comparison
Use UCA Order as a base to meet user-expectationsa lt A lt auml lt Ccedil = C_ lt z lt Z
Real language-sensitive order requires tailoring on top of UCA ordering depends on context and language
china lt China lt chinas lt danish
ae lt aelig lt af
z lt aelig (Danish)
c lt d lt h lt ch lt i (Slovak)
Follow UCA for substring match offsets ndash some gotchas here
Dont mix up stable and deterministic sorting they are very different
httpunicodeorgreportstr10 httpunicodeorgcldr
Normalization (NFChellip)
Standardized normalized forms defined by Unicode
The ordering of accents in a normalization form may not be the typical type-in order
Fonts should handle both orders
Normalization is context independent
Dont assume NFC(x + y) = NFC(x) + NFC(y)
People assume that NFC always composes but some characters decompose in NFC
Trivia In Unicode 41 there are exactly 3 characters that are different in all 4 normalization forms ϓ ϔ ẛ
Maximum Expansion (U41)
Operation UTF Factor Sample
NFC8 3X 119136 U+1D160
16 32 3X ש U+FB2C
NFD8 3X ΐ U+0390
16 32 4X ᾂ U+1F82
NFKC NFKD8 11X ملسو هيلع هللا ىلص U+FDFA
16 32 18X
Case Conversion
Not a simple 11 mapping
Title case dz harr DZ harr Dz
Expansion heiszlig rarr HEISS rarr heiss
Context-dependent ΌΣΟΣ rarr όσος
Language-dependent istanbul harr İSTANBUL
Warning never use language-dependent casing for language-independent structures like file-system B-Trees
Casing Maximum Expansion
Operation UTFFacto
rSample
Lower
8 15X Ⱥ U+023A
16 32 1X A U+0041
Upper Title Fold
8 16 32
3X ΐ U+0390
Case Conversion II
Case folding was not stable
Different results from toCaseFold(S) between two versions
Stability now guaranteed in Unicode 50
Dont use the Lowercase_Letter (Ll) or Uppercase_Letter (Lt) of General_Category
These were constrained to be in a partition
Use the separate binary properties Lowercase and Uppercase instead
Lowercase UppercaseForm vs Function
Lowercase the binary property
The character is lowercase in formbut not necessarily in function
Functionally Lowercase
isCased(x) amp isLowercase(x)
See Section 313 of TUS
Lowercase Form vs Function
LC F LC Ll
Count
Examples(U41
)
Y
NN 114 ˠ U+02E
0MODIFIER LETTER SMALL GAMMA
Y 705 ordf U+00AA
FEMININE ORDINAL INDICATOR
YN 43 ⅰ U+217
0SMALL ROMAN NUMERAL ONE
Y 903 a U+0061
LATIN SMALL LETTER A
Segmentation
What a user thinks of as a characters is often a sequence
Words are not just sequences of letters
Lines donrsquot just break at spaces
All may be language-dependent
httpwwwunicodeorgreportstr14 httpwwwunicodeorgreportstr29
TransliterationTransliteration Ελληνικά harr Ellēnikaacutene Translation Ελληνικά harr Greek
Transliteration may vary by language
Путин harr Putin Poutine
Горбачёв harr Gorbachev Gorbacev Gorbatchev Gorbačeumlv Gorbachov Gorbatsov Gorbatschow
Watch for terminology ldquolossyrdquo vs ldquolosslessrdquoLossy transliteration Ελληνικά rarr Ellinika rarr Ελλινικα
In ISO terms ldquotransliterationrdquo = lossless transliteration
ldquotranscriptionrdquo = lossy transliteration
httpunicodeorgdraftreportstr35tr35html
Rendering is Contextual
Glyphs may change shape
Multiple characters rarr 1 glyph
One character rarr multiple glyphs
Processing character-by-character gives the wrong results
Rendering IIGood rendering systems will handle customary type-in order for text plus canonical order
Excellent ones will do any canonically-equivalent order but those are rare
There may be differences in the customary glyphs for different languages specify the font or the language where they have to be distinguished
Security IssuesNever render a missing glyph as ldquo
Dont simply overlay diacritics it can cause security problems
httpwwwunicodeorgnotestn2
httpunicodeorgreportstr14
GlobalizationUnicode ne Globalization (aka Internationalization Localizability)
Unicode provides the basis for software globalization but theres more work to be done
Use globalization APIs Formatting and parsing of dates times numbers currencies comparison of text calendar systems are locale-dependent
Where OS facilities are not adequate or cross-platform solutions are needed use ICU (C C++ Java)
Dont put any translatable strings into your code separate into resource files
Provide context to translators is Mark a noun a verb or a namehellip
Donrsquot use the same string in different contexts unless the meaning is identical (including references)
Note User-Interface language (menus dialog help-system) neData language (body text spreadsheet cells)
Programs need to handle as data more languages than in localized UI
Common Globalization Mistakes
Never compile Windows apps as ldquoANSIrdquo (the default)
Dont simply concatenate strings to make messagesOrder of components differs by language use Java MessageFormat or structure UI as separate fields
Dont assume icons and symbols mean the same around the world Dont assume everyone can read the Latin alphabet
Allocate space flexibly ldquoOKrdquo in English rarr ldquoAceptarrdquo in Spanish
English is a relatively compact language others may require more characters (eg in database fields) and more screen real estate (in UIs)
Beware of discrepancies in ldquofallbackrdquo behaviorJava ResourceBundle (J2SE) Java Standard Tag Library (JSTL) Java Server Face (JSF) Apache HTTP
httpunicodeorgcldr
httpibmcomsoftwareglobalizationicu
Neutral FormatsStore and transmit neutral-format data wherever possible Convert that data to the users preferred formats as close to the user as possible
Type Example Rec Standard
LanguageLocale en-US (en_US) RFC 3066 bis CLDR
Territory AU RFC 3066 bis
Currency EUR ISO 4217
Timezone AustraliaMelbourne TZDB
Calendar islamic-civil CLDR Calendar ID
Custom Date yyyy-mmm-dd CLDR Pattern Format
Binary Time 8C80E9E3967A4B0 Windows File Time
IdentificationLocale IDs are extensions of language IDs use CLDR httpunicodeorgcldr
Dont assume that everyone in country always uses that countryrsquos currency Always use an explicit currency ID (ISO 4217)
ltRUR 123457times10sup3gt harr 1 23457р in Russian
but Rub 123457 in English
Dont assume the timezone ID is implied by the users locale For the best timezone information use the TZ database use CLDR for timezone names httpwwwtwinsuncomtztz-linkhtm
If you heuristically compute territory IDs timezone IDs currency IDs etc (eg from browser settings) make sure the user can override that and pick an explicit value
Unicode Guide
Authoritative but lightweight
Introduction overview and quick reference
Main principles of the Unicode Standard
Best practices in Software Globalization
Other ResourcesUnicode Site
httpunicodeorg
An Overview of ICUhttpicusourceforgenetdocspapersicu_overview_latestppt
Globalizing Softwarehttpicusourceforgenetdocspapersglobalizing_softwareppt
W3C Internationalizationhttpwwww3orgInternational
Microsoft Global Software Developmenthttpwwwmicrosoftcomglobaldevdefaultasp
QampA
Backup Slides
User Input
If you develop your own text editor use the OS APIs to handle IMEs (Input Method Engines) for Chinese Japanese Korean
If you are using type-ahead to get to a position in a list (eg typing Jo gets to the first element starting with those characters) allow arbitrary input This is often easiest with visible fields
If your password field can contain characters that require an IME a screen pop-up box may reveal the password to onlookers
Dotted and Dotless I
Uppercase Normal Lowercase Turkic Uppercase
larr
I + ˙
i 0049 0307
I 0069
İ
0049 0130
larr
ı
I
0131 0049
İrarr
0130 i + ˙ İ + ˙
I + ˙
0069 0307 0130 0307
0049 0307
JavaIn MessageFormat watch for words like cant since ASCII has syntactic meaning Use a real apostrophe (U+2019) where possible canrsquot
In Date and Calendar the months are numbered from 0 (February is month number 1) However weeks and days are numbered from 1
Java serialized text isnt UTF-8 though its close U+0000 and supplementary code points are encoded differently
Java globalization support is pretty outdated use ICU to supplement it
Java ResourceBundle (J2SE) Java Standard Tag Library (JSTL) Java Server Face (JSF) Apache HTTP server etc all provide some locale determination mechanism and facility but they all differ in details
JavaScript
Always encode characters above U+007F with escapes (uxxxx)
There is an HTML mechanism to specify the charset of the Javascript source but it is not widely implemented
The JDK tool native2ascii can be used to convert the files to use escapes
- Globalization Gotchas
-
Unicode BasicsUnicode encodes characters not glyphs
U+0067 rarr g g g g g g g g g g g g g
Unicode does not encode characters by languageFrench German English j have the same code point even though all have different pronunciations
Chinese 大 (da) has the same code point as Japanese 大 (dai)
UTF-8 UTF-16 and UTF-32 are all Unicode
The word character means different things to different people make clear which one you mean
glyphs code points bytes code units user-perceived characters (grapheme clusters)hellip
Unicode in APIsU+0000 to U+10FFFF Be prepared to handle (at least not corrupt) any incoming code points
A back-level system may get unassigned code points from later versions
Watch for UCS-2 implementations They use UTF-16 text but dont support characters above U+FFFF they also may accidentally cause isolated surrogates
Some APIsprotocols will count lengths in code points and others in bytes (or other code units)
Make sure you dont mix them up
Dont limit API parameters to a single character (and definitely not to a single code unit)
What users think of as a single character (eg x ch) may be a sequence in Unicode
Use the latest version of Unicode supports new characters corrections more stability guarantees
Choice of CharactersCharacter and block names may be misleading eg
U+034F COMBINING GRAPHEME JOINER doesnt join graphemes httpwwwunicodeorgfaq
Use U+2060 (word joiner) instead of U+FEFF (zero-width nobreak space) for everything but the BOM function
Never use unassigned code points those will be used in future versions of Unicode
Only use private use (PUA) or non-characters (and only if necessary)
If you do minimize the opportunity for collision by picking an unusual range
Character Conversion
Always use shortest form UTF-8Its the Law
And if that isnrsquot enough consider security attacks
If a protocol allows a choice of charsets always tag correctly
Not all text is correctly tagged character detection may be necessary But remember its always a guess
Converting a database of mixed untagged data is extremely painful
Bad assumptionsLength [bytes] = N length [code points]
1 character [charset X] = 1 character [Unicode]The ordering may also be different
Character Conversion II
IANA MIME charset names are ill-defined vendors often convert same charset different ways
Shift-JIS 0x5C rarr U+005C () or U+00A5 (yen)
Donrsquot simply omit unconvertable data to reduce security problems at least substitute
U+FFFD (when converting to Unicode) or
0x1A (when converting to bytes)
httpwwww3orgTRjapanese-xml
httpicusourceforgenetchartscharset
Properties Use properties such as Alphabetic not hard-coded lists
isAlphabetic(x) regex pAlphabetic or [Alphabetic]Not (ldquoArdquo le x le ldquoZrdquo OR ldquoardquo le x le ldquozrdquo)
Some properties arent what you think useWhite_Space not General_Category=Zs
Alphabetic not General_Category=L
Lowercase not General_Category=Ll
Script=Greek not Block=Greek
Characters may change property values between versions of Unicode
httpunicodeorgstandardstability_policyhtml
Identifiers amp Tokens
When designing syntax use as a basePattern_Syntax for operators relations
Pattern_Whitespace for gaps
XID_Start and XID_Continue for identifiers
All backwards compatible across versions
Profiles may expand or narrow from the base
Watch out for security attacksldquopaypalcomrdquo with a Cyrillic ldquoardquo
See Unicode Security at this conference
Comparison (Collation)Searching Sorting
MatchingThere are two binary orders
code point order = UTF-8 order = UTF-32 order
ne UTF16 order
Donrsquot present users with binary order
No users expect A lt Z lt a lt z lt Ccedil lt auml
Apply normalization to get a unique form so Aring = Aring
Security Issues Protocols must precisely define the comparison operations
Eg LDAP doesnt so lookup may fail (or falsely succeed)
Aside from wrong results opening for security attacks
Language-Sensitive Comparison
Use UCA Order as a base to meet user-expectationsa lt A lt auml lt Ccedil = C_ lt z lt Z
Real language-sensitive order requires tailoring on top of UCA ordering depends on context and language
china lt China lt chinas lt danish
ae lt aelig lt af
z lt aelig (Danish)
c lt d lt h lt ch lt i (Slovak)
Follow UCA for substring match offsets ndash some gotchas here
Dont mix up stable and deterministic sorting they are very different
httpunicodeorgreportstr10 httpunicodeorgcldr
Normalization (NFChellip)
Standardized normalized forms defined by Unicode
The ordering of accents in a normalization form may not be the typical type-in order
Fonts should handle both orders
Normalization is context independent
Dont assume NFC(x + y) = NFC(x) + NFC(y)
People assume that NFC always composes but some characters decompose in NFC
Trivia In Unicode 41 there are exactly 3 characters that are different in all 4 normalization forms ϓ ϔ ẛ
Maximum Expansion (U41)
Operation UTF Factor Sample
NFC8 3X 119136 U+1D160
16 32 3X ש U+FB2C
NFD8 3X ΐ U+0390
16 32 4X ᾂ U+1F82
NFKC NFKD8 11X ملسو هيلع هللا ىلص U+FDFA
16 32 18X
Case Conversion
Not a simple 11 mapping
Title case dz harr DZ harr Dz
Expansion heiszlig rarr HEISS rarr heiss
Context-dependent ΌΣΟΣ rarr όσος
Language-dependent istanbul harr İSTANBUL
Warning never use language-dependent casing for language-independent structures like file-system B-Trees
Casing Maximum Expansion
Operation UTFFacto
rSample
Lower
8 15X Ⱥ U+023A
16 32 1X A U+0041
Upper Title Fold
8 16 32
3X ΐ U+0390
Case Conversion II
Case folding was not stable
Different results from toCaseFold(S) between two versions
Stability now guaranteed in Unicode 50
Dont use the Lowercase_Letter (Ll) or Uppercase_Letter (Lt) of General_Category
These were constrained to be in a partition
Use the separate binary properties Lowercase and Uppercase instead
Lowercase UppercaseForm vs Function
Lowercase the binary property
The character is lowercase in formbut not necessarily in function
Functionally Lowercase
isCased(x) amp isLowercase(x)
See Section 313 of TUS
Lowercase Form vs Function
LC F LC Ll
Count
Examples(U41
)
Y
NN 114 ˠ U+02E
0MODIFIER LETTER SMALL GAMMA
Y 705 ordf U+00AA
FEMININE ORDINAL INDICATOR
YN 43 ⅰ U+217
0SMALL ROMAN NUMERAL ONE
Y 903 a U+0061
LATIN SMALL LETTER A
Segmentation
What a user thinks of as a characters is often a sequence
Words are not just sequences of letters
Lines donrsquot just break at spaces
All may be language-dependent
httpwwwunicodeorgreportstr14 httpwwwunicodeorgreportstr29
TransliterationTransliteration Ελληνικά harr Ellēnikaacutene Translation Ελληνικά harr Greek
Transliteration may vary by language
Путин harr Putin Poutine
Горбачёв harr Gorbachev Gorbacev Gorbatchev Gorbačeumlv Gorbachov Gorbatsov Gorbatschow
Watch for terminology ldquolossyrdquo vs ldquolosslessrdquoLossy transliteration Ελληνικά rarr Ellinika rarr Ελλινικα
In ISO terms ldquotransliterationrdquo = lossless transliteration
ldquotranscriptionrdquo = lossy transliteration
httpunicodeorgdraftreportstr35tr35html
Rendering is Contextual
Glyphs may change shape
Multiple characters rarr 1 glyph
One character rarr multiple glyphs
Processing character-by-character gives the wrong results
Rendering IIGood rendering systems will handle customary type-in order for text plus canonical order
Excellent ones will do any canonically-equivalent order but those are rare
There may be differences in the customary glyphs for different languages specify the font or the language where they have to be distinguished
Security IssuesNever render a missing glyph as ldquo
Dont simply overlay diacritics it can cause security problems
httpwwwunicodeorgnotestn2
httpunicodeorgreportstr14
GlobalizationUnicode ne Globalization (aka Internationalization Localizability)
Unicode provides the basis for software globalization but theres more work to be done
Use globalization APIs Formatting and parsing of dates times numbers currencies comparison of text calendar systems are locale-dependent
Where OS facilities are not adequate or cross-platform solutions are needed use ICU (C C++ Java)
Dont put any translatable strings into your code separate into resource files
Provide context to translators is Mark a noun a verb or a namehellip
Donrsquot use the same string in different contexts unless the meaning is identical (including references)
Note User-Interface language (menus dialog help-system) neData language (body text spreadsheet cells)
Programs need to handle as data more languages than in localized UI
Common Globalization Mistakes
Never compile Windows apps as ldquoANSIrdquo (the default)
Dont simply concatenate strings to make messagesOrder of components differs by language use Java MessageFormat or structure UI as separate fields
Dont assume icons and symbols mean the same around the world Dont assume everyone can read the Latin alphabet
Allocate space flexibly ldquoOKrdquo in English rarr ldquoAceptarrdquo in Spanish
English is a relatively compact language others may require more characters (eg in database fields) and more screen real estate (in UIs)
Beware of discrepancies in ldquofallbackrdquo behaviorJava ResourceBundle (J2SE) Java Standard Tag Library (JSTL) Java Server Face (JSF) Apache HTTP
httpunicodeorgcldr
httpibmcomsoftwareglobalizationicu
Neutral FormatsStore and transmit neutral-format data wherever possible Convert that data to the users preferred formats as close to the user as possible
Type Example Rec Standard
LanguageLocale en-US (en_US) RFC 3066 bis CLDR
Territory AU RFC 3066 bis
Currency EUR ISO 4217
Timezone AustraliaMelbourne TZDB
Calendar islamic-civil CLDR Calendar ID
Custom Date yyyy-mmm-dd CLDR Pattern Format
Binary Time 8C80E9E3967A4B0 Windows File Time
IdentificationLocale IDs are extensions of language IDs use CLDR httpunicodeorgcldr
Dont assume that everyone in country always uses that countryrsquos currency Always use an explicit currency ID (ISO 4217)
ltRUR 123457times10sup3gt harr 1 23457р in Russian
but Rub 123457 in English
Dont assume the timezone ID is implied by the users locale For the best timezone information use the TZ database use CLDR for timezone names httpwwwtwinsuncomtztz-linkhtm
If you heuristically compute territory IDs timezone IDs currency IDs etc (eg from browser settings) make sure the user can override that and pick an explicit value
Unicode Guide
Authoritative but lightweight
Introduction overview and quick reference
Main principles of the Unicode Standard
Best practices in Software Globalization
Other ResourcesUnicode Site
httpunicodeorg
An Overview of ICUhttpicusourceforgenetdocspapersicu_overview_latestppt
Globalizing Softwarehttpicusourceforgenetdocspapersglobalizing_softwareppt
W3C Internationalizationhttpwwww3orgInternational
Microsoft Global Software Developmenthttpwwwmicrosoftcomglobaldevdefaultasp
QampA
Backup Slides
User Input
If you develop your own text editor use the OS APIs to handle IMEs (Input Method Engines) for Chinese Japanese Korean
If you are using type-ahead to get to a position in a list (eg typing Jo gets to the first element starting with those characters) allow arbitrary input This is often easiest with visible fields
If your password field can contain characters that require an IME a screen pop-up box may reveal the password to onlookers
Dotted and Dotless I
Uppercase Normal Lowercase Turkic Uppercase
larr
I + ˙
i 0049 0307
I 0069
İ
0049 0130
larr
ı
I
0131 0049
İrarr
0130 i + ˙ İ + ˙
I + ˙
0069 0307 0130 0307
0049 0307
JavaIn MessageFormat watch for words like cant since ASCII has syntactic meaning Use a real apostrophe (U+2019) where possible canrsquot
In Date and Calendar the months are numbered from 0 (February is month number 1) However weeks and days are numbered from 1
Java serialized text isnt UTF-8 though its close U+0000 and supplementary code points are encoded differently
Java globalization support is pretty outdated use ICU to supplement it
Java ResourceBundle (J2SE) Java Standard Tag Library (JSTL) Java Server Face (JSF) Apache HTTP server etc all provide some locale determination mechanism and facility but they all differ in details
JavaScript
Always encode characters above U+007F with escapes (uxxxx)
There is an HTML mechanism to specify the charset of the Javascript source but it is not widely implemented
The JDK tool native2ascii can be used to convert the files to use escapes
- Globalization Gotchas
-
Unicode in APIsU+0000 to U+10FFFF Be prepared to handle (at least not corrupt) any incoming code points
A back-level system may get unassigned code points from later versions
Watch for UCS-2 implementations They use UTF-16 text but dont support characters above U+FFFF they also may accidentally cause isolated surrogates
Some APIsprotocols will count lengths in code points and others in bytes (or other code units)
Make sure you dont mix them up
Dont limit API parameters to a single character (and definitely not to a single code unit)
What users think of as a single character (eg x ch) may be a sequence in Unicode
Use the latest version of Unicode supports new characters corrections more stability guarantees
Choice of CharactersCharacter and block names may be misleading eg
U+034F COMBINING GRAPHEME JOINER doesnt join graphemes httpwwwunicodeorgfaq
Use U+2060 (word joiner) instead of U+FEFF (zero-width nobreak space) for everything but the BOM function
Never use unassigned code points those will be used in future versions of Unicode
Only use private use (PUA) or non-characters (and only if necessary)
If you do minimize the opportunity for collision by picking an unusual range
Character Conversion
Always use shortest form UTF-8Its the Law
And if that isnrsquot enough consider security attacks
If a protocol allows a choice of charsets always tag correctly
Not all text is correctly tagged character detection may be necessary But remember its always a guess
Converting a database of mixed untagged data is extremely painful
Bad assumptionsLength [bytes] = N length [code points]
1 character [charset X] = 1 character [Unicode]The ordering may also be different
Character Conversion II
IANA MIME charset names are ill-defined vendors often convert same charset different ways
Shift-JIS 0x5C rarr U+005C () or U+00A5 (yen)
Donrsquot simply omit unconvertable data to reduce security problems at least substitute
U+FFFD (when converting to Unicode) or
0x1A (when converting to bytes)
httpwwww3orgTRjapanese-xml
httpicusourceforgenetchartscharset
Properties Use properties such as Alphabetic not hard-coded lists
isAlphabetic(x) regex pAlphabetic or [Alphabetic]Not (ldquoArdquo le x le ldquoZrdquo OR ldquoardquo le x le ldquozrdquo)
Some properties arent what you think useWhite_Space not General_Category=Zs
Alphabetic not General_Category=L
Lowercase not General_Category=Ll
Script=Greek not Block=Greek
Characters may change property values between versions of Unicode
httpunicodeorgstandardstability_policyhtml
Identifiers amp Tokens
When designing syntax use as a basePattern_Syntax for operators relations
Pattern_Whitespace for gaps
XID_Start and XID_Continue for identifiers
All backwards compatible across versions
Profiles may expand or narrow from the base
Watch out for security attacksldquopaypalcomrdquo with a Cyrillic ldquoardquo
See Unicode Security at this conference
Comparison (Collation)Searching Sorting
MatchingThere are two binary orders
code point order = UTF-8 order = UTF-32 order
ne UTF16 order
Donrsquot present users with binary order
No users expect A lt Z lt a lt z lt Ccedil lt auml
Apply normalization to get a unique form so Aring = Aring
Security Issues Protocols must precisely define the comparison operations
Eg LDAP doesnt so lookup may fail (or falsely succeed)
Aside from wrong results opening for security attacks
Language-Sensitive Comparison
Use UCA Order as a base to meet user-expectationsa lt A lt auml lt Ccedil = C_ lt z lt Z
Real language-sensitive order requires tailoring on top of UCA ordering depends on context and language
china lt China lt chinas lt danish
ae lt aelig lt af
z lt aelig (Danish)
c lt d lt h lt ch lt i (Slovak)
Follow UCA for substring match offsets ndash some gotchas here
Dont mix up stable and deterministic sorting they are very different
httpunicodeorgreportstr10 httpunicodeorgcldr
Normalization (NFChellip)
Standardized normalized forms defined by Unicode
The ordering of accents in a normalization form may not be the typical type-in order
Fonts should handle both orders
Normalization is context independent
Dont assume NFC(x + y) = NFC(x) + NFC(y)
People assume that NFC always composes but some characters decompose in NFC
Trivia In Unicode 41 there are exactly 3 characters that are different in all 4 normalization forms ϓ ϔ ẛ
Maximum Expansion (U41)
Operation UTF Factor Sample
NFC8 3X 119136 U+1D160
16 32 3X ש U+FB2C
NFD8 3X ΐ U+0390
16 32 4X ᾂ U+1F82
NFKC NFKD8 11X ملسو هيلع هللا ىلص U+FDFA
16 32 18X
Case Conversion
Not a simple 11 mapping
Title case dz harr DZ harr Dz
Expansion heiszlig rarr HEISS rarr heiss
Context-dependent ΌΣΟΣ rarr όσος
Language-dependent istanbul harr İSTANBUL
Warning never use language-dependent casing for language-independent structures like file-system B-Trees
Casing Maximum Expansion
Operation UTFFacto
rSample
Lower
8 15X Ⱥ U+023A
16 32 1X A U+0041
Upper Title Fold
8 16 32
3X ΐ U+0390
Case Conversion II
Case folding was not stable
Different results from toCaseFold(S) between two versions
Stability now guaranteed in Unicode 50
Dont use the Lowercase_Letter (Ll) or Uppercase_Letter (Lt) of General_Category
These were constrained to be in a partition
Use the separate binary properties Lowercase and Uppercase instead
Lowercase UppercaseForm vs Function
Lowercase the binary property
The character is lowercase in formbut not necessarily in function
Functionally Lowercase
isCased(x) amp isLowercase(x)
See Section 313 of TUS
Lowercase Form vs Function
LC F LC Ll
Count
Examples(U41
)
Y
NN 114 ˠ U+02E
0MODIFIER LETTER SMALL GAMMA
Y 705 ordf U+00AA
FEMININE ORDINAL INDICATOR
YN 43 ⅰ U+217
0SMALL ROMAN NUMERAL ONE
Y 903 a U+0061
LATIN SMALL LETTER A
Segmentation
What a user thinks of as a characters is often a sequence
Words are not just sequences of letters
Lines donrsquot just break at spaces
All may be language-dependent
httpwwwunicodeorgreportstr14 httpwwwunicodeorgreportstr29
TransliterationTransliteration Ελληνικά harr Ellēnikaacutene Translation Ελληνικά harr Greek
Transliteration may vary by language
Путин harr Putin Poutine
Горбачёв harr Gorbachev Gorbacev Gorbatchev Gorbačeumlv Gorbachov Gorbatsov Gorbatschow
Watch for terminology ldquolossyrdquo vs ldquolosslessrdquoLossy transliteration Ελληνικά rarr Ellinika rarr Ελλινικα
In ISO terms ldquotransliterationrdquo = lossless transliteration
ldquotranscriptionrdquo = lossy transliteration
httpunicodeorgdraftreportstr35tr35html
Rendering is Contextual
Glyphs may change shape
Multiple characters rarr 1 glyph
One character rarr multiple glyphs
Processing character-by-character gives the wrong results
Rendering IIGood rendering systems will handle customary type-in order for text plus canonical order
Excellent ones will do any canonically-equivalent order but those are rare
There may be differences in the customary glyphs for different languages specify the font or the language where they have to be distinguished
Security IssuesNever render a missing glyph as ldquo
Dont simply overlay diacritics it can cause security problems
httpwwwunicodeorgnotestn2
httpunicodeorgreportstr14
GlobalizationUnicode ne Globalization (aka Internationalization Localizability)
Unicode provides the basis for software globalization but theres more work to be done
Use globalization APIs Formatting and parsing of dates times numbers currencies comparison of text calendar systems are locale-dependent
Where OS facilities are not adequate or cross-platform solutions are needed use ICU (C C++ Java)
Dont put any translatable strings into your code separate into resource files
Provide context to translators is Mark a noun a verb or a namehellip
Donrsquot use the same string in different contexts unless the meaning is identical (including references)
Note User-Interface language (menus dialog help-system) neData language (body text spreadsheet cells)
Programs need to handle as data more languages than in localized UI
Common Globalization Mistakes
Never compile Windows apps as ldquoANSIrdquo (the default)
Dont simply concatenate strings to make messagesOrder of components differs by language use Java MessageFormat or structure UI as separate fields
Dont assume icons and symbols mean the same around the world Dont assume everyone can read the Latin alphabet
Allocate space flexibly ldquoOKrdquo in English rarr ldquoAceptarrdquo in Spanish
English is a relatively compact language others may require more characters (eg in database fields) and more screen real estate (in UIs)
Beware of discrepancies in ldquofallbackrdquo behaviorJava ResourceBundle (J2SE) Java Standard Tag Library (JSTL) Java Server Face (JSF) Apache HTTP
httpunicodeorgcldr
httpibmcomsoftwareglobalizationicu
Neutral FormatsStore and transmit neutral-format data wherever possible Convert that data to the users preferred formats as close to the user as possible
Type Example Rec Standard
LanguageLocale en-US (en_US) RFC 3066 bis CLDR
Territory AU RFC 3066 bis
Currency EUR ISO 4217
Timezone AustraliaMelbourne TZDB
Calendar islamic-civil CLDR Calendar ID
Custom Date yyyy-mmm-dd CLDR Pattern Format
Binary Time 8C80E9E3967A4B0 Windows File Time
IdentificationLocale IDs are extensions of language IDs use CLDR httpunicodeorgcldr
Dont assume that everyone in country always uses that countryrsquos currency Always use an explicit currency ID (ISO 4217)
ltRUR 123457times10sup3gt harr 1 23457р in Russian
but Rub 123457 in English
Dont assume the timezone ID is implied by the users locale For the best timezone information use the TZ database use CLDR for timezone names httpwwwtwinsuncomtztz-linkhtm
If you heuristically compute territory IDs timezone IDs currency IDs etc (eg from browser settings) make sure the user can override that and pick an explicit value
Unicode Guide
Authoritative but lightweight
Introduction overview and quick reference
Main principles of the Unicode Standard
Best practices in Software Globalization
Other ResourcesUnicode Site
httpunicodeorg
An Overview of ICUhttpicusourceforgenetdocspapersicu_overview_latestppt
Globalizing Softwarehttpicusourceforgenetdocspapersglobalizing_softwareppt
W3C Internationalizationhttpwwww3orgInternational
Microsoft Global Software Developmenthttpwwwmicrosoftcomglobaldevdefaultasp
QampA
Backup Slides
User Input
If you develop your own text editor use the OS APIs to handle IMEs (Input Method Engines) for Chinese Japanese Korean
If you are using type-ahead to get to a position in a list (eg typing Jo gets to the first element starting with those characters) allow arbitrary input This is often easiest with visible fields
If your password field can contain characters that require an IME a screen pop-up box may reveal the password to onlookers
Dotted and Dotless I
Uppercase Normal Lowercase Turkic Uppercase
larr
I + ˙
i 0049 0307
I 0069
İ
0049 0130
larr
ı
I
0131 0049
İrarr
0130 i + ˙ İ + ˙
I + ˙
0069 0307 0130 0307
0049 0307
JavaIn MessageFormat watch for words like cant since ASCII has syntactic meaning Use a real apostrophe (U+2019) where possible canrsquot
In Date and Calendar the months are numbered from 0 (February is month number 1) However weeks and days are numbered from 1
Java serialized text isnt UTF-8 though its close U+0000 and supplementary code points are encoded differently
Java globalization support is pretty outdated use ICU to supplement it
Java ResourceBundle (J2SE) Java Standard Tag Library (JSTL) Java Server Face (JSF) Apache HTTP server etc all provide some locale determination mechanism and facility but they all differ in details
JavaScript
Always encode characters above U+007F with escapes (uxxxx)
There is an HTML mechanism to specify the charset of the Javascript source but it is not widely implemented
The JDK tool native2ascii can be used to convert the files to use escapes
- Globalization Gotchas
-
Choice of CharactersCharacter and block names may be misleading eg
U+034F COMBINING GRAPHEME JOINER doesnt join graphemes httpwwwunicodeorgfaq
Use U+2060 (word joiner) instead of U+FEFF (zero-width nobreak space) for everything but the BOM function
Never use unassigned code points those will be used in future versions of Unicode
Only use private use (PUA) or non-characters (and only if necessary)
If you do minimize the opportunity for collision by picking an unusual range
Character Conversion
Always use shortest form UTF-8Its the Law
And if that isnrsquot enough consider security attacks
If a protocol allows a choice of charsets always tag correctly
Not all text is correctly tagged character detection may be necessary But remember its always a guess
Converting a database of mixed untagged data is extremely painful
Bad assumptionsLength [bytes] = N length [code points]
1 character [charset X] = 1 character [Unicode]The ordering may also be different
Character Conversion II
IANA MIME charset names are ill-defined vendors often convert same charset different ways
Shift-JIS 0x5C rarr U+005C () or U+00A5 (yen)
Donrsquot simply omit unconvertable data to reduce security problems at least substitute
U+FFFD (when converting to Unicode) or
0x1A (when converting to bytes)
httpwwww3orgTRjapanese-xml
httpicusourceforgenetchartscharset
Properties Use properties such as Alphabetic not hard-coded lists
isAlphabetic(x) regex pAlphabetic or [Alphabetic]Not (ldquoArdquo le x le ldquoZrdquo OR ldquoardquo le x le ldquozrdquo)
Some properties arent what you think useWhite_Space not General_Category=Zs
Alphabetic not General_Category=L
Lowercase not General_Category=Ll
Script=Greek not Block=Greek
Characters may change property values between versions of Unicode
httpunicodeorgstandardstability_policyhtml
Identifiers amp Tokens
When designing syntax use as a basePattern_Syntax for operators relations
Pattern_Whitespace for gaps
XID_Start and XID_Continue for identifiers
All backwards compatible across versions
Profiles may expand or narrow from the base
Watch out for security attacksldquopaypalcomrdquo with a Cyrillic ldquoardquo
See Unicode Security at this conference
Comparison (Collation)Searching Sorting
MatchingThere are two binary orders
code point order = UTF-8 order = UTF-32 order
ne UTF16 order
Donrsquot present users with binary order
No users expect A lt Z lt a lt z lt Ccedil lt auml
Apply normalization to get a unique form so Aring = Aring
Security Issues Protocols must precisely define the comparison operations
Eg LDAP doesnt so lookup may fail (or falsely succeed)
Aside from wrong results opening for security attacks
Language-Sensitive Comparison
Use UCA Order as a base to meet user-expectationsa lt A lt auml lt Ccedil = C_ lt z lt Z
Real language-sensitive order requires tailoring on top of UCA ordering depends on context and language
china lt China lt chinas lt danish
ae lt aelig lt af
z lt aelig (Danish)
c lt d lt h lt ch lt i (Slovak)
Follow UCA for substring match offsets ndash some gotchas here
Dont mix up stable and deterministic sorting they are very different
httpunicodeorgreportstr10 httpunicodeorgcldr
Normalization (NFChellip)
Standardized normalized forms defined by Unicode
The ordering of accents in a normalization form may not be the typical type-in order
Fonts should handle both orders
Normalization is context independent
Dont assume NFC(x + y) = NFC(x) + NFC(y)
People assume that NFC always composes but some characters decompose in NFC
Trivia In Unicode 41 there are exactly 3 characters that are different in all 4 normalization forms ϓ ϔ ẛ
Maximum Expansion (U41)
Operation UTF Factor Sample
NFC8 3X 119136 U+1D160
16 32 3X ש U+FB2C
NFD8 3X ΐ U+0390
16 32 4X ᾂ U+1F82
NFKC NFKD8 11X ملسو هيلع هللا ىلص U+FDFA
16 32 18X
Case Conversion
Not a simple 11 mapping
Title case dz harr DZ harr Dz
Expansion heiszlig rarr HEISS rarr heiss
Context-dependent ΌΣΟΣ rarr όσος
Language-dependent istanbul harr İSTANBUL
Warning never use language-dependent casing for language-independent structures like file-system B-Trees
Casing Maximum Expansion
Operation UTFFacto
rSample
Lower
8 15X Ⱥ U+023A
16 32 1X A U+0041
Upper Title Fold
8 16 32
3X ΐ U+0390
Case Conversion II
Case folding was not stable
Different results from toCaseFold(S) between two versions
Stability now guaranteed in Unicode 50
Dont use the Lowercase_Letter (Ll) or Uppercase_Letter (Lt) of General_Category
These were constrained to be in a partition
Use the separate binary properties Lowercase and Uppercase instead
Lowercase UppercaseForm vs Function
Lowercase the binary property
The character is lowercase in formbut not necessarily in function
Functionally Lowercase
isCased(x) amp isLowercase(x)
See Section 313 of TUS
Lowercase Form vs Function
LC F LC Ll
Count
Examples(U41
)
Y
NN 114 ˠ U+02E
0MODIFIER LETTER SMALL GAMMA
Y 705 ordf U+00AA
FEMININE ORDINAL INDICATOR
YN 43 ⅰ U+217
0SMALL ROMAN NUMERAL ONE
Y 903 a U+0061
LATIN SMALL LETTER A
Segmentation
What a user thinks of as a characters is often a sequence
Words are not just sequences of letters
Lines donrsquot just break at spaces
All may be language-dependent
httpwwwunicodeorgreportstr14 httpwwwunicodeorgreportstr29
TransliterationTransliteration Ελληνικά harr Ellēnikaacutene Translation Ελληνικά harr Greek
Transliteration may vary by language
Путин harr Putin Poutine
Горбачёв harr Gorbachev Gorbacev Gorbatchev Gorbačeumlv Gorbachov Gorbatsov Gorbatschow
Watch for terminology ldquolossyrdquo vs ldquolosslessrdquoLossy transliteration Ελληνικά rarr Ellinika rarr Ελλινικα
In ISO terms ldquotransliterationrdquo = lossless transliteration
ldquotranscriptionrdquo = lossy transliteration
httpunicodeorgdraftreportstr35tr35html
Rendering is Contextual
Glyphs may change shape
Multiple characters rarr 1 glyph
One character rarr multiple glyphs
Processing character-by-character gives the wrong results
Rendering IIGood rendering systems will handle customary type-in order for text plus canonical order
Excellent ones will do any canonically-equivalent order but those are rare
There may be differences in the customary glyphs for different languages specify the font or the language where they have to be distinguished
Security IssuesNever render a missing glyph as ldquo
Dont simply overlay diacritics it can cause security problems
httpwwwunicodeorgnotestn2
httpunicodeorgreportstr14
GlobalizationUnicode ne Globalization (aka Internationalization Localizability)
Unicode provides the basis for software globalization but theres more work to be done
Use globalization APIs Formatting and parsing of dates times numbers currencies comparison of text calendar systems are locale-dependent
Where OS facilities are not adequate or cross-platform solutions are needed use ICU (C C++ Java)
Dont put any translatable strings into your code separate into resource files
Provide context to translators is Mark a noun a verb or a namehellip
Donrsquot use the same string in different contexts unless the meaning is identical (including references)
Note User-Interface language (menus dialog help-system) neData language (body text spreadsheet cells)
Programs need to handle as data more languages than in localized UI
Common Globalization Mistakes
Never compile Windows apps as ldquoANSIrdquo (the default)
Dont simply concatenate strings to make messagesOrder of components differs by language use Java MessageFormat or structure UI as separate fields
Dont assume icons and symbols mean the same around the world Dont assume everyone can read the Latin alphabet
Allocate space flexibly ldquoOKrdquo in English rarr ldquoAceptarrdquo in Spanish
English is a relatively compact language others may require more characters (eg in database fields) and more screen real estate (in UIs)
Beware of discrepancies in ldquofallbackrdquo behaviorJava ResourceBundle (J2SE) Java Standard Tag Library (JSTL) Java Server Face (JSF) Apache HTTP
httpunicodeorgcldr
httpibmcomsoftwareglobalizationicu
Neutral FormatsStore and transmit neutral-format data wherever possible Convert that data to the users preferred formats as close to the user as possible
Type Example Rec Standard
LanguageLocale en-US (en_US) RFC 3066 bis CLDR
Territory AU RFC 3066 bis
Currency EUR ISO 4217
Timezone AustraliaMelbourne TZDB
Calendar islamic-civil CLDR Calendar ID
Custom Date yyyy-mmm-dd CLDR Pattern Format
Binary Time 8C80E9E3967A4B0 Windows File Time
IdentificationLocale IDs are extensions of language IDs use CLDR httpunicodeorgcldr
Dont assume that everyone in country always uses that countryrsquos currency Always use an explicit currency ID (ISO 4217)
ltRUR 123457times10sup3gt harr 1 23457р in Russian
but Rub 123457 in English
Dont assume the timezone ID is implied by the users locale For the best timezone information use the TZ database use CLDR for timezone names httpwwwtwinsuncomtztz-linkhtm
If you heuristically compute territory IDs timezone IDs currency IDs etc (eg from browser settings) make sure the user can override that and pick an explicit value
Unicode Guide
Authoritative but lightweight
Introduction overview and quick reference
Main principles of the Unicode Standard
Best practices in Software Globalization
Other ResourcesUnicode Site
httpunicodeorg
An Overview of ICUhttpicusourceforgenetdocspapersicu_overview_latestppt
Globalizing Softwarehttpicusourceforgenetdocspapersglobalizing_softwareppt
W3C Internationalizationhttpwwww3orgInternational
Microsoft Global Software Developmenthttpwwwmicrosoftcomglobaldevdefaultasp
QampA
Backup Slides
User Input
If you develop your own text editor use the OS APIs to handle IMEs (Input Method Engines) for Chinese Japanese Korean
If you are using type-ahead to get to a position in a list (eg typing Jo gets to the first element starting with those characters) allow arbitrary input This is often easiest with visible fields
If your password field can contain characters that require an IME a screen pop-up box may reveal the password to onlookers
Dotted and Dotless I
Uppercase Normal Lowercase Turkic Uppercase
larr
I + ˙
i 0049 0307
I 0069
İ
0049 0130
larr
ı
I
0131 0049
İrarr
0130 i + ˙ İ + ˙
I + ˙
0069 0307 0130 0307
0049 0307
JavaIn MessageFormat watch for words like cant since ASCII has syntactic meaning Use a real apostrophe (U+2019) where possible canrsquot
In Date and Calendar the months are numbered from 0 (February is month number 1) However weeks and days are numbered from 1
Java serialized text isnt UTF-8 though its close U+0000 and supplementary code points are encoded differently
Java globalization support is pretty outdated use ICU to supplement it
Java ResourceBundle (J2SE) Java Standard Tag Library (JSTL) Java Server Face (JSF) Apache HTTP server etc all provide some locale determination mechanism and facility but they all differ in details
JavaScript
Always encode characters above U+007F with escapes (uxxxx)
There is an HTML mechanism to specify the charset of the Javascript source but it is not widely implemented
The JDK tool native2ascii can be used to convert the files to use escapes
- Globalization Gotchas
-
Character Conversion
Always use shortest form UTF-8Its the Law
And if that isnrsquot enough consider security attacks
If a protocol allows a choice of charsets always tag correctly
Not all text is correctly tagged character detection may be necessary But remember its always a guess
Converting a database of mixed untagged data is extremely painful
Bad assumptionsLength [bytes] = N length [code points]
1 character [charset X] = 1 character [Unicode]The ordering may also be different
Character Conversion II
IANA MIME charset names are ill-defined vendors often convert same charset different ways
Shift-JIS 0x5C rarr U+005C () or U+00A5 (yen)
Donrsquot simply omit unconvertable data to reduce security problems at least substitute
U+FFFD (when converting to Unicode) or
0x1A (when converting to bytes)
httpwwww3orgTRjapanese-xml
httpicusourceforgenetchartscharset
Properties Use properties such as Alphabetic not hard-coded lists
isAlphabetic(x) regex pAlphabetic or [Alphabetic]Not (ldquoArdquo le x le ldquoZrdquo OR ldquoardquo le x le ldquozrdquo)
Some properties arent what you think useWhite_Space not General_Category=Zs
Alphabetic not General_Category=L
Lowercase not General_Category=Ll
Script=Greek not Block=Greek
Characters may change property values between versions of Unicode
httpunicodeorgstandardstability_policyhtml
Identifiers amp Tokens
When designing syntax use as a basePattern_Syntax for operators relations
Pattern_Whitespace for gaps
XID_Start and XID_Continue for identifiers
All backwards compatible across versions
Profiles may expand or narrow from the base
Watch out for security attacksldquopaypalcomrdquo with a Cyrillic ldquoardquo
See Unicode Security at this conference
Comparison (Collation)Searching Sorting
MatchingThere are two binary orders
code point order = UTF-8 order = UTF-32 order
ne UTF16 order
Donrsquot present users with binary order
No users expect A lt Z lt a lt z lt Ccedil lt auml
Apply normalization to get a unique form so Aring = Aring
Security Issues Protocols must precisely define the comparison operations
Eg LDAP doesnt so lookup may fail (or falsely succeed)
Aside from wrong results opening for security attacks
Language-Sensitive Comparison
Use UCA Order as a base to meet user-expectationsa lt A lt auml lt Ccedil = C_ lt z lt Z
Real language-sensitive order requires tailoring on top of UCA ordering depends on context and language
china lt China lt chinas lt danish
ae lt aelig lt af
z lt aelig (Danish)
c lt d lt h lt ch lt i (Slovak)
Follow UCA for substring match offsets ndash some gotchas here
Dont mix up stable and deterministic sorting they are very different
httpunicodeorgreportstr10 httpunicodeorgcldr
Normalization (NFChellip)
Standardized normalized forms defined by Unicode
The ordering of accents in a normalization form may not be the typical type-in order
Fonts should handle both orders
Normalization is context independent
Dont assume NFC(x + y) = NFC(x) + NFC(y)
People assume that NFC always composes but some characters decompose in NFC
Trivia In Unicode 41 there are exactly 3 characters that are different in all 4 normalization forms ϓ ϔ ẛ
Maximum Expansion (U41)
Operation UTF Factor Sample
NFC8 3X 119136 U+1D160
16 32 3X ש U+FB2C
NFD8 3X ΐ U+0390
16 32 4X ᾂ U+1F82
NFKC NFKD8 11X ملسو هيلع هللا ىلص U+FDFA
16 32 18X
Case Conversion
Not a simple 11 mapping
Title case dz harr DZ harr Dz
Expansion heiszlig rarr HEISS rarr heiss
Context-dependent ΌΣΟΣ rarr όσος
Language-dependent istanbul harr İSTANBUL
Warning never use language-dependent casing for language-independent structures like file-system B-Trees
Casing Maximum Expansion
Operation UTFFacto
rSample
Lower
8 15X Ⱥ U+023A
16 32 1X A U+0041
Upper Title Fold
8 16 32
3X ΐ U+0390
Case Conversion II
Case folding was not stable
Different results from toCaseFold(S) between two versions
Stability now guaranteed in Unicode 50
Dont use the Lowercase_Letter (Ll) or Uppercase_Letter (Lt) of General_Category
These were constrained to be in a partition
Use the separate binary properties Lowercase and Uppercase instead
Lowercase UppercaseForm vs Function
Lowercase the binary property
The character is lowercase in formbut not necessarily in function
Functionally Lowercase
isCased(x) amp isLowercase(x)
See Section 313 of TUS
Lowercase Form vs Function
LC F LC Ll
Count
Examples(U41
)
Y
NN 114 ˠ U+02E
0MODIFIER LETTER SMALL GAMMA
Y 705 ordf U+00AA
FEMININE ORDINAL INDICATOR
YN 43 ⅰ U+217
0SMALL ROMAN NUMERAL ONE
Y 903 a U+0061
LATIN SMALL LETTER A
Segmentation
What a user thinks of as a characters is often a sequence
Words are not just sequences of letters
Lines donrsquot just break at spaces
All may be language-dependent
httpwwwunicodeorgreportstr14 httpwwwunicodeorgreportstr29
TransliterationTransliteration Ελληνικά harr Ellēnikaacutene Translation Ελληνικά harr Greek
Transliteration may vary by language
Путин harr Putin Poutine
Горбачёв harr Gorbachev Gorbacev Gorbatchev Gorbačeumlv Gorbachov Gorbatsov Gorbatschow
Watch for terminology ldquolossyrdquo vs ldquolosslessrdquoLossy transliteration Ελληνικά rarr Ellinika rarr Ελλινικα
In ISO terms ldquotransliterationrdquo = lossless transliteration
ldquotranscriptionrdquo = lossy transliteration
httpunicodeorgdraftreportstr35tr35html
Rendering is Contextual
Glyphs may change shape
Multiple characters rarr 1 glyph
One character rarr multiple glyphs
Processing character-by-character gives the wrong results
Rendering IIGood rendering systems will handle customary type-in order for text plus canonical order
Excellent ones will do any canonically-equivalent order but those are rare
There may be differences in the customary glyphs for different languages specify the font or the language where they have to be distinguished
Security IssuesNever render a missing glyph as ldquo
Dont simply overlay diacritics it can cause security problems
httpwwwunicodeorgnotestn2
httpunicodeorgreportstr14
GlobalizationUnicode ne Globalization (aka Internationalization Localizability)
Unicode provides the basis for software globalization but theres more work to be done
Use globalization APIs Formatting and parsing of dates times numbers currencies comparison of text calendar systems are locale-dependent
Where OS facilities are not adequate or cross-platform solutions are needed use ICU (C C++ Java)
Dont put any translatable strings into your code separate into resource files
Provide context to translators is Mark a noun a verb or a namehellip
Donrsquot use the same string in different contexts unless the meaning is identical (including references)
Note User-Interface language (menus dialog help-system) neData language (body text spreadsheet cells)
Programs need to handle as data more languages than in localized UI
Common Globalization Mistakes
Never compile Windows apps as ldquoANSIrdquo (the default)
Dont simply concatenate strings to make messagesOrder of components differs by language use Java MessageFormat or structure UI as separate fields
Dont assume icons and symbols mean the same around the world Dont assume everyone can read the Latin alphabet
Allocate space flexibly ldquoOKrdquo in English rarr ldquoAceptarrdquo in Spanish
English is a relatively compact language others may require more characters (eg in database fields) and more screen real estate (in UIs)
Beware of discrepancies in ldquofallbackrdquo behaviorJava ResourceBundle (J2SE) Java Standard Tag Library (JSTL) Java Server Face (JSF) Apache HTTP
httpunicodeorgcldr
httpibmcomsoftwareglobalizationicu
Neutral FormatsStore and transmit neutral-format data wherever possible Convert that data to the users preferred formats as close to the user as possible
Type Example Rec Standard
LanguageLocale en-US (en_US) RFC 3066 bis CLDR
Territory AU RFC 3066 bis
Currency EUR ISO 4217
Timezone AustraliaMelbourne TZDB
Calendar islamic-civil CLDR Calendar ID
Custom Date yyyy-mmm-dd CLDR Pattern Format
Binary Time 8C80E9E3967A4B0 Windows File Time
IdentificationLocale IDs are extensions of language IDs use CLDR httpunicodeorgcldr
Dont assume that everyone in country always uses that countryrsquos currency Always use an explicit currency ID (ISO 4217)
ltRUR 123457times10sup3gt harr 1 23457р in Russian
but Rub 123457 in English
Dont assume the timezone ID is implied by the users locale For the best timezone information use the TZ database use CLDR for timezone names httpwwwtwinsuncomtztz-linkhtm
If you heuristically compute territory IDs timezone IDs currency IDs etc (eg from browser settings) make sure the user can override that and pick an explicit value
Unicode Guide
Authoritative but lightweight
Introduction overview and quick reference
Main principles of the Unicode Standard
Best practices in Software Globalization
Other ResourcesUnicode Site
httpunicodeorg
An Overview of ICUhttpicusourceforgenetdocspapersicu_overview_latestppt
Globalizing Softwarehttpicusourceforgenetdocspapersglobalizing_softwareppt
W3C Internationalizationhttpwwww3orgInternational
Microsoft Global Software Developmenthttpwwwmicrosoftcomglobaldevdefaultasp
QampA
Backup Slides
User Input
If you develop your own text editor use the OS APIs to handle IMEs (Input Method Engines) for Chinese Japanese Korean
If you are using type-ahead to get to a position in a list (eg typing Jo gets to the first element starting with those characters) allow arbitrary input This is often easiest with visible fields
If your password field can contain characters that require an IME a screen pop-up box may reveal the password to onlookers
Dotted and Dotless I
Uppercase Normal Lowercase Turkic Uppercase
larr
I + ˙
i 0049 0307
I 0069
İ
0049 0130
larr
ı
I
0131 0049
İrarr
0130 i + ˙ İ + ˙
I + ˙
0069 0307 0130 0307
0049 0307
JavaIn MessageFormat watch for words like cant since ASCII has syntactic meaning Use a real apostrophe (U+2019) where possible canrsquot
In Date and Calendar the months are numbered from 0 (February is month number 1) However weeks and days are numbered from 1
Java serialized text isnt UTF-8 though its close U+0000 and supplementary code points are encoded differently
Java globalization support is pretty outdated use ICU to supplement it
Java ResourceBundle (J2SE) Java Standard Tag Library (JSTL) Java Server Face (JSF) Apache HTTP server etc all provide some locale determination mechanism and facility but they all differ in details
JavaScript
Always encode characters above U+007F with escapes (uxxxx)
There is an HTML mechanism to specify the charset of the Javascript source but it is not widely implemented
The JDK tool native2ascii can be used to convert the files to use escapes
- Globalization Gotchas
-
Character Conversion II
IANA MIME charset names are ill-defined vendors often convert same charset different ways
Shift-JIS 0x5C rarr U+005C () or U+00A5 (yen)
Donrsquot simply omit unconvertable data to reduce security problems at least substitute
U+FFFD (when converting to Unicode) or
0x1A (when converting to bytes)
httpwwww3orgTRjapanese-xml
httpicusourceforgenetchartscharset
Properties Use properties such as Alphabetic not hard-coded lists
isAlphabetic(x) regex pAlphabetic or [Alphabetic]Not (ldquoArdquo le x le ldquoZrdquo OR ldquoardquo le x le ldquozrdquo)
Some properties arent what you think useWhite_Space not General_Category=Zs
Alphabetic not General_Category=L
Lowercase not General_Category=Ll
Script=Greek not Block=Greek
Characters may change property values between versions of Unicode
httpunicodeorgstandardstability_policyhtml
Identifiers amp Tokens
When designing syntax use as a basePattern_Syntax for operators relations
Pattern_Whitespace for gaps
XID_Start and XID_Continue for identifiers
All backwards compatible across versions
Profiles may expand or narrow from the base
Watch out for security attacksldquopaypalcomrdquo with a Cyrillic ldquoardquo
See Unicode Security at this conference
Comparison (Collation)Searching Sorting
MatchingThere are two binary orders
code point order = UTF-8 order = UTF-32 order
ne UTF16 order
Donrsquot present users with binary order
No users expect A lt Z lt a lt z lt Ccedil lt auml
Apply normalization to get a unique form so Aring = Aring
Security Issues Protocols must precisely define the comparison operations
Eg LDAP doesnt so lookup may fail (or falsely succeed)
Aside from wrong results opening for security attacks
Language-Sensitive Comparison
Use UCA Order as a base to meet user-expectationsa lt A lt auml lt Ccedil = C_ lt z lt Z
Real language-sensitive order requires tailoring on top of UCA ordering depends on context and language
china lt China lt chinas lt danish
ae lt aelig lt af
z lt aelig (Danish)
c lt d lt h lt ch lt i (Slovak)
Follow UCA for substring match offsets ndash some gotchas here
Dont mix up stable and deterministic sorting they are very different
httpunicodeorgreportstr10 httpunicodeorgcldr
Normalization (NFChellip)
Standardized normalized forms defined by Unicode
The ordering of accents in a normalization form may not be the typical type-in order
Fonts should handle both orders
Normalization is context independent
Dont assume NFC(x + y) = NFC(x) + NFC(y)
People assume that NFC always composes but some characters decompose in NFC
Trivia In Unicode 41 there are exactly 3 characters that are different in all 4 normalization forms ϓ ϔ ẛ
Maximum Expansion (U41)
Operation UTF Factor Sample
NFC8 3X 119136 U+1D160
16 32 3X ש U+FB2C
NFD8 3X ΐ U+0390
16 32 4X ᾂ U+1F82
NFKC NFKD8 11X ملسو هيلع هللا ىلص U+FDFA
16 32 18X
Case Conversion
Not a simple 11 mapping
Title case dz harr DZ harr Dz
Expansion heiszlig rarr HEISS rarr heiss
Context-dependent ΌΣΟΣ rarr όσος
Language-dependent istanbul harr İSTANBUL
Warning never use language-dependent casing for language-independent structures like file-system B-Trees
Casing Maximum Expansion
Operation UTFFacto
rSample
Lower
8 15X Ⱥ U+023A
16 32 1X A U+0041
Upper Title Fold
8 16 32
3X ΐ U+0390
Case Conversion II
Case folding was not stable
Different results from toCaseFold(S) between two versions
Stability now guaranteed in Unicode 50
Dont use the Lowercase_Letter (Ll) or Uppercase_Letter (Lt) of General_Category
These were constrained to be in a partition
Use the separate binary properties Lowercase and Uppercase instead
Lowercase UppercaseForm vs Function
Lowercase the binary property
The character is lowercase in formbut not necessarily in function
Functionally Lowercase
isCased(x) amp isLowercase(x)
See Section 313 of TUS
Lowercase Form vs Function
LC F LC Ll
Count
Examples(U41
)
Y
NN 114 ˠ U+02E
0MODIFIER LETTER SMALL GAMMA
Y 705 ordf U+00AA
FEMININE ORDINAL INDICATOR
YN 43 ⅰ U+217
0SMALL ROMAN NUMERAL ONE
Y 903 a U+0061
LATIN SMALL LETTER A
Segmentation
What a user thinks of as a characters is often a sequence
Words are not just sequences of letters
Lines donrsquot just break at spaces
All may be language-dependent
httpwwwunicodeorgreportstr14 httpwwwunicodeorgreportstr29
TransliterationTransliteration Ελληνικά harr Ellēnikaacutene Translation Ελληνικά harr Greek
Transliteration may vary by language
Путин harr Putin Poutine
Горбачёв harr Gorbachev Gorbacev Gorbatchev Gorbačeumlv Gorbachov Gorbatsov Gorbatschow
Watch for terminology ldquolossyrdquo vs ldquolosslessrdquoLossy transliteration Ελληνικά rarr Ellinika rarr Ελλινικα
In ISO terms ldquotransliterationrdquo = lossless transliteration
ldquotranscriptionrdquo = lossy transliteration
httpunicodeorgdraftreportstr35tr35html
Rendering is Contextual
Glyphs may change shape
Multiple characters rarr 1 glyph
One character rarr multiple glyphs
Processing character-by-character gives the wrong results
Rendering IIGood rendering systems will handle customary type-in order for text plus canonical order
Excellent ones will do any canonically-equivalent order but those are rare
There may be differences in the customary glyphs for different languages specify the font or the language where they have to be distinguished
Security IssuesNever render a missing glyph as ldquo
Dont simply overlay diacritics it can cause security problems
httpwwwunicodeorgnotestn2
httpunicodeorgreportstr14
GlobalizationUnicode ne Globalization (aka Internationalization Localizability)
Unicode provides the basis for software globalization but theres more work to be done
Use globalization APIs Formatting and parsing of dates times numbers currencies comparison of text calendar systems are locale-dependent
Where OS facilities are not adequate or cross-platform solutions are needed use ICU (C C++ Java)
Dont put any translatable strings into your code separate into resource files
Provide context to translators is Mark a noun a verb or a namehellip
Donrsquot use the same string in different contexts unless the meaning is identical (including references)
Note User-Interface language (menus dialog help-system) neData language (body text spreadsheet cells)
Programs need to handle as data more languages than in localized UI
Common Globalization Mistakes
Never compile Windows apps as ldquoANSIrdquo (the default)
Dont simply concatenate strings to make messagesOrder of components differs by language use Java MessageFormat or structure UI as separate fields
Dont assume icons and symbols mean the same around the world Dont assume everyone can read the Latin alphabet
Allocate space flexibly ldquoOKrdquo in English rarr ldquoAceptarrdquo in Spanish
English is a relatively compact language others may require more characters (eg in database fields) and more screen real estate (in UIs)
Beware of discrepancies in ldquofallbackrdquo behaviorJava ResourceBundle (J2SE) Java Standard Tag Library (JSTL) Java Server Face (JSF) Apache HTTP
httpunicodeorgcldr
httpibmcomsoftwareglobalizationicu
Neutral FormatsStore and transmit neutral-format data wherever possible Convert that data to the users preferred formats as close to the user as possible
Type Example Rec Standard
LanguageLocale en-US (en_US) RFC 3066 bis CLDR
Territory AU RFC 3066 bis
Currency EUR ISO 4217
Timezone AustraliaMelbourne TZDB
Calendar islamic-civil CLDR Calendar ID
Custom Date yyyy-mmm-dd CLDR Pattern Format
Binary Time 8C80E9E3967A4B0 Windows File Time
IdentificationLocale IDs are extensions of language IDs use CLDR httpunicodeorgcldr
Dont assume that everyone in country always uses that countryrsquos currency Always use an explicit currency ID (ISO 4217)
ltRUR 123457times10sup3gt harr 1 23457р in Russian
but Rub 123457 in English
Dont assume the timezone ID is implied by the users locale For the best timezone information use the TZ database use CLDR for timezone names httpwwwtwinsuncomtztz-linkhtm
If you heuristically compute territory IDs timezone IDs currency IDs etc (eg from browser settings) make sure the user can override that and pick an explicit value
Unicode Guide
Authoritative but lightweight
Introduction overview and quick reference
Main principles of the Unicode Standard
Best practices in Software Globalization
Other ResourcesUnicode Site
httpunicodeorg
An Overview of ICUhttpicusourceforgenetdocspapersicu_overview_latestppt
Globalizing Softwarehttpicusourceforgenetdocspapersglobalizing_softwareppt
W3C Internationalizationhttpwwww3orgInternational
Microsoft Global Software Developmenthttpwwwmicrosoftcomglobaldevdefaultasp
QampA
Backup Slides
User Input
If you develop your own text editor use the OS APIs to handle IMEs (Input Method Engines) for Chinese Japanese Korean
If you are using type-ahead to get to a position in a list (eg typing Jo gets to the first element starting with those characters) allow arbitrary input This is often easiest with visible fields
If your password field can contain characters that require an IME a screen pop-up box may reveal the password to onlookers
Dotted and Dotless I
Uppercase Normal Lowercase Turkic Uppercase
larr
I + ˙
i 0049 0307
I 0069
İ
0049 0130
larr
ı
I
0131 0049
İrarr
0130 i + ˙ İ + ˙
I + ˙
0069 0307 0130 0307
0049 0307
JavaIn MessageFormat watch for words like cant since ASCII has syntactic meaning Use a real apostrophe (U+2019) where possible canrsquot
In Date and Calendar the months are numbered from 0 (February is month number 1) However weeks and days are numbered from 1
Java serialized text isnt UTF-8 though its close U+0000 and supplementary code points are encoded differently
Java globalization support is pretty outdated use ICU to supplement it
Java ResourceBundle (J2SE) Java Standard Tag Library (JSTL) Java Server Face (JSF) Apache HTTP server etc all provide some locale determination mechanism and facility but they all differ in details
JavaScript
Always encode characters above U+007F with escapes (uxxxx)
There is an HTML mechanism to specify the charset of the Javascript source but it is not widely implemented
The JDK tool native2ascii can be used to convert the files to use escapes
- Globalization Gotchas
-
Properties Use properties such as Alphabetic not hard-coded lists
isAlphabetic(x) regex pAlphabetic or [Alphabetic]Not (ldquoArdquo le x le ldquoZrdquo OR ldquoardquo le x le ldquozrdquo)
Some properties arent what you think useWhite_Space not General_Category=Zs
Alphabetic not General_Category=L
Lowercase not General_Category=Ll
Script=Greek not Block=Greek
Characters may change property values between versions of Unicode
httpunicodeorgstandardstability_policyhtml
Identifiers amp Tokens
When designing syntax use as a basePattern_Syntax for operators relations
Pattern_Whitespace for gaps
XID_Start and XID_Continue for identifiers
All backwards compatible across versions
Profiles may expand or narrow from the base
Watch out for security attacksldquopaypalcomrdquo with a Cyrillic ldquoardquo
See Unicode Security at this conference
Comparison (Collation)Searching Sorting
MatchingThere are two binary orders
code point order = UTF-8 order = UTF-32 order
ne UTF16 order
Donrsquot present users with binary order
No users expect A lt Z lt a lt z lt Ccedil lt auml
Apply normalization to get a unique form so Aring = Aring
Security Issues Protocols must precisely define the comparison operations
Eg LDAP doesnt so lookup may fail (or falsely succeed)
Aside from wrong results opening for security attacks
Language-Sensitive Comparison
Use UCA Order as a base to meet user-expectationsa lt A lt auml lt Ccedil = C_ lt z lt Z
Real language-sensitive order requires tailoring on top of UCA ordering depends on context and language
china lt China lt chinas lt danish
ae lt aelig lt af
z lt aelig (Danish)
c lt d lt h lt ch lt i (Slovak)
Follow UCA for substring match offsets ndash some gotchas here
Dont mix up stable and deterministic sorting they are very different
httpunicodeorgreportstr10 httpunicodeorgcldr
Normalization (NFChellip)
Standardized normalized forms defined by Unicode
The ordering of accents in a normalization form may not be the typical type-in order
Fonts should handle both orders
Normalization is context independent
Dont assume NFC(x + y) = NFC(x) + NFC(y)
People assume that NFC always composes but some characters decompose in NFC
Trivia In Unicode 41 there are exactly 3 characters that are different in all 4 normalization forms ϓ ϔ ẛ
Maximum Expansion (U41)
Operation UTF Factor Sample
NFC8 3X 119136 U+1D160
16 32 3X ש U+FB2C
NFD8 3X ΐ U+0390
16 32 4X ᾂ U+1F82
NFKC NFKD8 11X ملسو هيلع هللا ىلص U+FDFA
16 32 18X
Case Conversion
Not a simple 11 mapping
Title case dz harr DZ harr Dz
Expansion heiszlig rarr HEISS rarr heiss
Context-dependent ΌΣΟΣ rarr όσος
Language-dependent istanbul harr İSTANBUL
Warning never use language-dependent casing for language-independent structures like file-system B-Trees
Casing Maximum Expansion
Operation UTFFacto
rSample
Lower
8 15X Ⱥ U+023A
16 32 1X A U+0041
Upper Title Fold
8 16 32
3X ΐ U+0390
Case Conversion II
Case folding was not stable
Different results from toCaseFold(S) between two versions
Stability now guaranteed in Unicode 50
Dont use the Lowercase_Letter (Ll) or Uppercase_Letter (Lt) of General_Category
These were constrained to be in a partition
Use the separate binary properties Lowercase and Uppercase instead
Lowercase UppercaseForm vs Function
Lowercase the binary property
The character is lowercase in formbut not necessarily in function
Functionally Lowercase
isCased(x) amp isLowercase(x)
See Section 313 of TUS
Lowercase Form vs Function
LC F LC Ll
Count
Examples(U41
)
Y
NN 114 ˠ U+02E
0MODIFIER LETTER SMALL GAMMA
Y 705 ordf U+00AA
FEMININE ORDINAL INDICATOR
YN 43 ⅰ U+217
0SMALL ROMAN NUMERAL ONE
Y 903 a U+0061
LATIN SMALL LETTER A
Segmentation
What a user thinks of as a characters is often a sequence
Words are not just sequences of letters
Lines donrsquot just break at spaces
All may be language-dependent
httpwwwunicodeorgreportstr14 httpwwwunicodeorgreportstr29
TransliterationTransliteration Ελληνικά harr Ellēnikaacutene Translation Ελληνικά harr Greek
Transliteration may vary by language
Путин harr Putin Poutine
Горбачёв harr Gorbachev Gorbacev Gorbatchev Gorbačeumlv Gorbachov Gorbatsov Gorbatschow
Watch for terminology ldquolossyrdquo vs ldquolosslessrdquoLossy transliteration Ελληνικά rarr Ellinika rarr Ελλινικα
In ISO terms ldquotransliterationrdquo = lossless transliteration
ldquotranscriptionrdquo = lossy transliteration
httpunicodeorgdraftreportstr35tr35html
Rendering is Contextual
Glyphs may change shape
Multiple characters rarr 1 glyph
One character rarr multiple glyphs
Processing character-by-character gives the wrong results
Rendering IIGood rendering systems will handle customary type-in order for text plus canonical order
Excellent ones will do any canonically-equivalent order but those are rare
There may be differences in the customary glyphs for different languages specify the font or the language where they have to be distinguished
Security IssuesNever render a missing glyph as ldquo
Dont simply overlay diacritics it can cause security problems
httpwwwunicodeorgnotestn2
httpunicodeorgreportstr14
GlobalizationUnicode ne Globalization (aka Internationalization Localizability)
Unicode provides the basis for software globalization but theres more work to be done
Use globalization APIs Formatting and parsing of dates times numbers currencies comparison of text calendar systems are locale-dependent
Where OS facilities are not adequate or cross-platform solutions are needed use ICU (C C++ Java)
Dont put any translatable strings into your code separate into resource files
Provide context to translators is Mark a noun a verb or a namehellip
Donrsquot use the same string in different contexts unless the meaning is identical (including references)
Note User-Interface language (menus dialog help-system) neData language (body text spreadsheet cells)
Programs need to handle as data more languages than in localized UI
Common Globalization Mistakes
Never compile Windows apps as ldquoANSIrdquo (the default)
Dont simply concatenate strings to make messagesOrder of components differs by language use Java MessageFormat or structure UI as separate fields
Dont assume icons and symbols mean the same around the world Dont assume everyone can read the Latin alphabet
Allocate space flexibly ldquoOKrdquo in English rarr ldquoAceptarrdquo in Spanish
English is a relatively compact language others may require more characters (eg in database fields) and more screen real estate (in UIs)
Beware of discrepancies in ldquofallbackrdquo behaviorJava ResourceBundle (J2SE) Java Standard Tag Library (JSTL) Java Server Face (JSF) Apache HTTP
httpunicodeorgcldr
httpibmcomsoftwareglobalizationicu
Neutral FormatsStore and transmit neutral-format data wherever possible Convert that data to the users preferred formats as close to the user as possible
Type Example Rec Standard
LanguageLocale en-US (en_US) RFC 3066 bis CLDR
Territory AU RFC 3066 bis
Currency EUR ISO 4217
Timezone AustraliaMelbourne TZDB
Calendar islamic-civil CLDR Calendar ID
Custom Date yyyy-mmm-dd CLDR Pattern Format
Binary Time 8C80E9E3967A4B0 Windows File Time
IdentificationLocale IDs are extensions of language IDs use CLDR httpunicodeorgcldr
Dont assume that everyone in country always uses that countryrsquos currency Always use an explicit currency ID (ISO 4217)
ltRUR 123457times10sup3gt harr 1 23457р in Russian
but Rub 123457 in English
Dont assume the timezone ID is implied by the users locale For the best timezone information use the TZ database use CLDR for timezone names httpwwwtwinsuncomtztz-linkhtm
If you heuristically compute territory IDs timezone IDs currency IDs etc (eg from browser settings) make sure the user can override that and pick an explicit value
Unicode Guide
Authoritative but lightweight
Introduction overview and quick reference
Main principles of the Unicode Standard
Best practices in Software Globalization
Other ResourcesUnicode Site
httpunicodeorg
An Overview of ICUhttpicusourceforgenetdocspapersicu_overview_latestppt
Globalizing Softwarehttpicusourceforgenetdocspapersglobalizing_softwareppt
W3C Internationalizationhttpwwww3orgInternational
Microsoft Global Software Developmenthttpwwwmicrosoftcomglobaldevdefaultasp
QampA
Backup Slides
User Input
If you develop your own text editor use the OS APIs to handle IMEs (Input Method Engines) for Chinese Japanese Korean
If you are using type-ahead to get to a position in a list (eg typing Jo gets to the first element starting with those characters) allow arbitrary input This is often easiest with visible fields
If your password field can contain characters that require an IME a screen pop-up box may reveal the password to onlookers
Dotted and Dotless I
Uppercase Normal Lowercase Turkic Uppercase
larr
I + ˙
i 0049 0307
I 0069
İ
0049 0130
larr
ı
I
0131 0049
İrarr
0130 i + ˙ İ + ˙
I + ˙
0069 0307 0130 0307
0049 0307
JavaIn MessageFormat watch for words like cant since ASCII has syntactic meaning Use a real apostrophe (U+2019) where possible canrsquot
In Date and Calendar the months are numbered from 0 (February is month number 1) However weeks and days are numbered from 1
Java serialized text isnt UTF-8 though its close U+0000 and supplementary code points are encoded differently
Java globalization support is pretty outdated use ICU to supplement it
Java ResourceBundle (J2SE) Java Standard Tag Library (JSTL) Java Server Face (JSF) Apache HTTP server etc all provide some locale determination mechanism and facility but they all differ in details
JavaScript
Always encode characters above U+007F with escapes (uxxxx)
There is an HTML mechanism to specify the charset of the Javascript source but it is not widely implemented
The JDK tool native2ascii can be used to convert the files to use escapes
- Globalization Gotchas
-
Identifiers amp Tokens
When designing syntax use as a basePattern_Syntax for operators relations
Pattern_Whitespace for gaps
XID_Start and XID_Continue for identifiers
All backwards compatible across versions
Profiles may expand or narrow from the base
Watch out for security attacksldquopaypalcomrdquo with a Cyrillic ldquoardquo
See Unicode Security at this conference
Comparison (Collation)Searching Sorting
MatchingThere are two binary orders
code point order = UTF-8 order = UTF-32 order
ne UTF16 order
Donrsquot present users with binary order
No users expect A lt Z lt a lt z lt Ccedil lt auml
Apply normalization to get a unique form so Aring = Aring
Security Issues Protocols must precisely define the comparison operations
Eg LDAP doesnt so lookup may fail (or falsely succeed)
Aside from wrong results opening for security attacks
Language-Sensitive Comparison
Use UCA Order as a base to meet user-expectationsa lt A lt auml lt Ccedil = C_ lt z lt Z
Real language-sensitive order requires tailoring on top of UCA ordering depends on context and language
china lt China lt chinas lt danish
ae lt aelig lt af
z lt aelig (Danish)
c lt d lt h lt ch lt i (Slovak)
Follow UCA for substring match offsets ndash some gotchas here
Dont mix up stable and deterministic sorting they are very different
httpunicodeorgreportstr10 httpunicodeorgcldr
Normalization (NFChellip)
Standardized normalized forms defined by Unicode
The ordering of accents in a normalization form may not be the typical type-in order
Fonts should handle both orders
Normalization is context independent
Dont assume NFC(x + y) = NFC(x) + NFC(y)
People assume that NFC always composes but some characters decompose in NFC
Trivia In Unicode 41 there are exactly 3 characters that are different in all 4 normalization forms ϓ ϔ ẛ
Maximum Expansion (U41)
Operation UTF Factor Sample
NFC8 3X 119136 U+1D160
16 32 3X ש U+FB2C
NFD8 3X ΐ U+0390
16 32 4X ᾂ U+1F82
NFKC NFKD8 11X ملسو هيلع هللا ىلص U+FDFA
16 32 18X
Case Conversion
Not a simple 11 mapping
Title case dz harr DZ harr Dz
Expansion heiszlig rarr HEISS rarr heiss
Context-dependent ΌΣΟΣ rarr όσος
Language-dependent istanbul harr İSTANBUL
Warning never use language-dependent casing for language-independent structures like file-system B-Trees
Casing Maximum Expansion
Operation UTFFacto
rSample
Lower
8 15X Ⱥ U+023A
16 32 1X A U+0041
Upper Title Fold
8 16 32
3X ΐ U+0390
Case Conversion II
Case folding was not stable
Different results from toCaseFold(S) between two versions
Stability now guaranteed in Unicode 50
Dont use the Lowercase_Letter (Ll) or Uppercase_Letter (Lt) of General_Category
These were constrained to be in a partition
Use the separate binary properties Lowercase and Uppercase instead
Lowercase UppercaseForm vs Function
Lowercase the binary property
The character is lowercase in formbut not necessarily in function
Functionally Lowercase
isCased(x) amp isLowercase(x)
See Section 313 of TUS
Lowercase Form vs Function
LC F LC Ll
Count
Examples(U41
)
Y
NN 114 ˠ U+02E
0MODIFIER LETTER SMALL GAMMA
Y 705 ordf U+00AA
FEMININE ORDINAL INDICATOR
YN 43 ⅰ U+217
0SMALL ROMAN NUMERAL ONE
Y 903 a U+0061
LATIN SMALL LETTER A
Segmentation
What a user thinks of as a characters is often a sequence
Words are not just sequences of letters
Lines donrsquot just break at spaces
All may be language-dependent
httpwwwunicodeorgreportstr14 httpwwwunicodeorgreportstr29
TransliterationTransliteration Ελληνικά harr Ellēnikaacutene Translation Ελληνικά harr Greek
Transliteration may vary by language
Путин harr Putin Poutine
Горбачёв harr Gorbachev Gorbacev Gorbatchev Gorbačeumlv Gorbachov Gorbatsov Gorbatschow
Watch for terminology ldquolossyrdquo vs ldquolosslessrdquoLossy transliteration Ελληνικά rarr Ellinika rarr Ελλινικα
In ISO terms ldquotransliterationrdquo = lossless transliteration
ldquotranscriptionrdquo = lossy transliteration
httpunicodeorgdraftreportstr35tr35html
Rendering is Contextual
Glyphs may change shape
Multiple characters rarr 1 glyph
One character rarr multiple glyphs
Processing character-by-character gives the wrong results
Rendering IIGood rendering systems will handle customary type-in order for text plus canonical order
Excellent ones will do any canonically-equivalent order but those are rare
There may be differences in the customary glyphs for different languages specify the font or the language where they have to be distinguished
Security IssuesNever render a missing glyph as ldquo
Dont simply overlay diacritics it can cause security problems
httpwwwunicodeorgnotestn2
httpunicodeorgreportstr14
GlobalizationUnicode ne Globalization (aka Internationalization Localizability)
Unicode provides the basis for software globalization but theres more work to be done
Use globalization APIs Formatting and parsing of dates times numbers currencies comparison of text calendar systems are locale-dependent
Where OS facilities are not adequate or cross-platform solutions are needed use ICU (C C++ Java)
Dont put any translatable strings into your code separate into resource files
Provide context to translators is Mark a noun a verb or a namehellip
Donrsquot use the same string in different contexts unless the meaning is identical (including references)
Note User-Interface language (menus dialog help-system) neData language (body text spreadsheet cells)
Programs need to handle as data more languages than in localized UI
Common Globalization Mistakes
Never compile Windows apps as ldquoANSIrdquo (the default)
Dont simply concatenate strings to make messagesOrder of components differs by language use Java MessageFormat or structure UI as separate fields
Dont assume icons and symbols mean the same around the world Dont assume everyone can read the Latin alphabet
Allocate space flexibly ldquoOKrdquo in English rarr ldquoAceptarrdquo in Spanish
English is a relatively compact language others may require more characters (eg in database fields) and more screen real estate (in UIs)
Beware of discrepancies in ldquofallbackrdquo behaviorJava ResourceBundle (J2SE) Java Standard Tag Library (JSTL) Java Server Face (JSF) Apache HTTP
httpunicodeorgcldr
httpibmcomsoftwareglobalizationicu
Neutral FormatsStore and transmit neutral-format data wherever possible Convert that data to the users preferred formats as close to the user as possible
Type Example Rec Standard
LanguageLocale en-US (en_US) RFC 3066 bis CLDR
Territory AU RFC 3066 bis
Currency EUR ISO 4217
Timezone AustraliaMelbourne TZDB
Calendar islamic-civil CLDR Calendar ID
Custom Date yyyy-mmm-dd CLDR Pattern Format
Binary Time 8C80E9E3967A4B0 Windows File Time
IdentificationLocale IDs are extensions of language IDs use CLDR httpunicodeorgcldr
Dont assume that everyone in country always uses that countryrsquos currency Always use an explicit currency ID (ISO 4217)
ltRUR 123457times10sup3gt harr 1 23457р in Russian
but Rub 123457 in English
Dont assume the timezone ID is implied by the users locale For the best timezone information use the TZ database use CLDR for timezone names httpwwwtwinsuncomtztz-linkhtm
If you heuristically compute territory IDs timezone IDs currency IDs etc (eg from browser settings) make sure the user can override that and pick an explicit value
Unicode Guide
Authoritative but lightweight
Introduction overview and quick reference
Main principles of the Unicode Standard
Best practices in Software Globalization
Other ResourcesUnicode Site
httpunicodeorg
An Overview of ICUhttpicusourceforgenetdocspapersicu_overview_latestppt
Globalizing Softwarehttpicusourceforgenetdocspapersglobalizing_softwareppt
W3C Internationalizationhttpwwww3orgInternational
Microsoft Global Software Developmenthttpwwwmicrosoftcomglobaldevdefaultasp
QampA
Backup Slides
User Input
If you develop your own text editor use the OS APIs to handle IMEs (Input Method Engines) for Chinese Japanese Korean
If you are using type-ahead to get to a position in a list (eg typing Jo gets to the first element starting with those characters) allow arbitrary input This is often easiest with visible fields
If your password field can contain characters that require an IME a screen pop-up box may reveal the password to onlookers
Dotted and Dotless I
Uppercase Normal Lowercase Turkic Uppercase
larr
I + ˙
i 0049 0307
I 0069
İ
0049 0130
larr
ı
I
0131 0049
İrarr
0130 i + ˙ İ + ˙
I + ˙
0069 0307 0130 0307
0049 0307
JavaIn MessageFormat watch for words like cant since ASCII has syntactic meaning Use a real apostrophe (U+2019) where possible canrsquot
In Date and Calendar the months are numbered from 0 (February is month number 1) However weeks and days are numbered from 1
Java serialized text isnt UTF-8 though its close U+0000 and supplementary code points are encoded differently
Java globalization support is pretty outdated use ICU to supplement it
Java ResourceBundle (J2SE) Java Standard Tag Library (JSTL) Java Server Face (JSF) Apache HTTP server etc all provide some locale determination mechanism and facility but they all differ in details
JavaScript
Always encode characters above U+007F with escapes (uxxxx)
There is an HTML mechanism to specify the charset of the Javascript source but it is not widely implemented
The JDK tool native2ascii can be used to convert the files to use escapes
- Globalization Gotchas
-
Comparison (Collation)Searching Sorting
MatchingThere are two binary orders
code point order = UTF-8 order = UTF-32 order
ne UTF16 order
Donrsquot present users with binary order
No users expect A lt Z lt a lt z lt Ccedil lt auml
Apply normalization to get a unique form so Aring = Aring
Security Issues Protocols must precisely define the comparison operations
Eg LDAP doesnt so lookup may fail (or falsely succeed)
Aside from wrong results opening for security attacks
Language-Sensitive Comparison
Use UCA Order as a base to meet user-expectationsa lt A lt auml lt Ccedil = C_ lt z lt Z
Real language-sensitive order requires tailoring on top of UCA ordering depends on context and language
china lt China lt chinas lt danish
ae lt aelig lt af
z lt aelig (Danish)
c lt d lt h lt ch lt i (Slovak)
Follow UCA for substring match offsets ndash some gotchas here
Dont mix up stable and deterministic sorting they are very different
httpunicodeorgreportstr10 httpunicodeorgcldr
Normalization (NFChellip)
Standardized normalized forms defined by Unicode
The ordering of accents in a normalization form may not be the typical type-in order
Fonts should handle both orders
Normalization is context independent
Dont assume NFC(x + y) = NFC(x) + NFC(y)
People assume that NFC always composes but some characters decompose in NFC
Trivia In Unicode 41 there are exactly 3 characters that are different in all 4 normalization forms ϓ ϔ ẛ
Maximum Expansion (U41)
Operation UTF Factor Sample
NFC8 3X 119136 U+1D160
16 32 3X ש U+FB2C
NFD8 3X ΐ U+0390
16 32 4X ᾂ U+1F82
NFKC NFKD8 11X ملسو هيلع هللا ىلص U+FDFA
16 32 18X
Case Conversion
Not a simple 11 mapping
Title case dz harr DZ harr Dz
Expansion heiszlig rarr HEISS rarr heiss
Context-dependent ΌΣΟΣ rarr όσος
Language-dependent istanbul harr İSTANBUL
Warning never use language-dependent casing for language-independent structures like file-system B-Trees
Casing Maximum Expansion
Operation UTFFacto
rSample
Lower
8 15X Ⱥ U+023A
16 32 1X A U+0041
Upper Title Fold
8 16 32
3X ΐ U+0390
Case Conversion II
Case folding was not stable
Different results from toCaseFold(S) between two versions
Stability now guaranteed in Unicode 50
Dont use the Lowercase_Letter (Ll) or Uppercase_Letter (Lt) of General_Category
These were constrained to be in a partition
Use the separate binary properties Lowercase and Uppercase instead
Lowercase UppercaseForm vs Function
Lowercase the binary property
The character is lowercase in formbut not necessarily in function
Functionally Lowercase
isCased(x) amp isLowercase(x)
See Section 313 of TUS
Lowercase Form vs Function
LC F LC Ll
Count
Examples(U41
)
Y
NN 114 ˠ U+02E
0MODIFIER LETTER SMALL GAMMA
Y 705 ordf U+00AA
FEMININE ORDINAL INDICATOR
YN 43 ⅰ U+217
0SMALL ROMAN NUMERAL ONE
Y 903 a U+0061
LATIN SMALL LETTER A
Segmentation
What a user thinks of as a characters is often a sequence
Words are not just sequences of letters
Lines donrsquot just break at spaces
All may be language-dependent
httpwwwunicodeorgreportstr14 httpwwwunicodeorgreportstr29
TransliterationTransliteration Ελληνικά harr Ellēnikaacutene Translation Ελληνικά harr Greek
Transliteration may vary by language
Путин harr Putin Poutine
Горбачёв harr Gorbachev Gorbacev Gorbatchev Gorbačeumlv Gorbachov Gorbatsov Gorbatschow
Watch for terminology ldquolossyrdquo vs ldquolosslessrdquoLossy transliteration Ελληνικά rarr Ellinika rarr Ελλινικα
In ISO terms ldquotransliterationrdquo = lossless transliteration
ldquotranscriptionrdquo = lossy transliteration
httpunicodeorgdraftreportstr35tr35html
Rendering is Contextual
Glyphs may change shape
Multiple characters rarr 1 glyph
One character rarr multiple glyphs
Processing character-by-character gives the wrong results
Rendering IIGood rendering systems will handle customary type-in order for text plus canonical order
Excellent ones will do any canonically-equivalent order but those are rare
There may be differences in the customary glyphs for different languages specify the font or the language where they have to be distinguished
Security IssuesNever render a missing glyph as ldquo
Dont simply overlay diacritics it can cause security problems
httpwwwunicodeorgnotestn2
httpunicodeorgreportstr14
GlobalizationUnicode ne Globalization (aka Internationalization Localizability)
Unicode provides the basis for software globalization but theres more work to be done
Use globalization APIs Formatting and parsing of dates times numbers currencies comparison of text calendar systems are locale-dependent
Where OS facilities are not adequate or cross-platform solutions are needed use ICU (C C++ Java)
Dont put any translatable strings into your code separate into resource files
Provide context to translators is Mark a noun a verb or a namehellip
Donrsquot use the same string in different contexts unless the meaning is identical (including references)
Note User-Interface language (menus dialog help-system) neData language (body text spreadsheet cells)
Programs need to handle as data more languages than in localized UI
Common Globalization Mistakes
Never compile Windows apps as ldquoANSIrdquo (the default)
Dont simply concatenate strings to make messagesOrder of components differs by language use Java MessageFormat or structure UI as separate fields
Dont assume icons and symbols mean the same around the world Dont assume everyone can read the Latin alphabet
Allocate space flexibly ldquoOKrdquo in English rarr ldquoAceptarrdquo in Spanish
English is a relatively compact language others may require more characters (eg in database fields) and more screen real estate (in UIs)
Beware of discrepancies in ldquofallbackrdquo behaviorJava ResourceBundle (J2SE) Java Standard Tag Library (JSTL) Java Server Face (JSF) Apache HTTP
httpunicodeorgcldr
httpibmcomsoftwareglobalizationicu
Neutral FormatsStore and transmit neutral-format data wherever possible Convert that data to the users preferred formats as close to the user as possible
Type Example Rec Standard
LanguageLocale en-US (en_US) RFC 3066 bis CLDR
Territory AU RFC 3066 bis
Currency EUR ISO 4217
Timezone AustraliaMelbourne TZDB
Calendar islamic-civil CLDR Calendar ID
Custom Date yyyy-mmm-dd CLDR Pattern Format
Binary Time 8C80E9E3967A4B0 Windows File Time
IdentificationLocale IDs are extensions of language IDs use CLDR httpunicodeorgcldr
Dont assume that everyone in country always uses that countryrsquos currency Always use an explicit currency ID (ISO 4217)
ltRUR 123457times10sup3gt harr 1 23457р in Russian
but Rub 123457 in English
Dont assume the timezone ID is implied by the users locale For the best timezone information use the TZ database use CLDR for timezone names httpwwwtwinsuncomtztz-linkhtm
If you heuristically compute territory IDs timezone IDs currency IDs etc (eg from browser settings) make sure the user can override that and pick an explicit value
Unicode Guide
Authoritative but lightweight
Introduction overview and quick reference
Main principles of the Unicode Standard
Best practices in Software Globalization
Other ResourcesUnicode Site
httpunicodeorg
An Overview of ICUhttpicusourceforgenetdocspapersicu_overview_latestppt
Globalizing Softwarehttpicusourceforgenetdocspapersglobalizing_softwareppt
W3C Internationalizationhttpwwww3orgInternational
Microsoft Global Software Developmenthttpwwwmicrosoftcomglobaldevdefaultasp
QampA
Backup Slides
User Input
If you develop your own text editor use the OS APIs to handle IMEs (Input Method Engines) for Chinese Japanese Korean
If you are using type-ahead to get to a position in a list (eg typing Jo gets to the first element starting with those characters) allow arbitrary input This is often easiest with visible fields
If your password field can contain characters that require an IME a screen pop-up box may reveal the password to onlookers
Dotted and Dotless I
Uppercase Normal Lowercase Turkic Uppercase
larr
I + ˙
i 0049 0307
I 0069
İ
0049 0130
larr
ı
I
0131 0049
İrarr
0130 i + ˙ İ + ˙
I + ˙
0069 0307 0130 0307
0049 0307
JavaIn MessageFormat watch for words like cant since ASCII has syntactic meaning Use a real apostrophe (U+2019) where possible canrsquot
In Date and Calendar the months are numbered from 0 (February is month number 1) However weeks and days are numbered from 1
Java serialized text isnt UTF-8 though its close U+0000 and supplementary code points are encoded differently
Java globalization support is pretty outdated use ICU to supplement it
Java ResourceBundle (J2SE) Java Standard Tag Library (JSTL) Java Server Face (JSF) Apache HTTP server etc all provide some locale determination mechanism and facility but they all differ in details
JavaScript
Always encode characters above U+007F with escapes (uxxxx)
There is an HTML mechanism to specify the charset of the Javascript source but it is not widely implemented
The JDK tool native2ascii can be used to convert the files to use escapes
- Globalization Gotchas
-
Language-Sensitive Comparison
Use UCA Order as a base to meet user-expectationsa lt A lt auml lt Ccedil = C_ lt z lt Z
Real language-sensitive order requires tailoring on top of UCA ordering depends on context and language
china lt China lt chinas lt danish
ae lt aelig lt af
z lt aelig (Danish)
c lt d lt h lt ch lt i (Slovak)
Follow UCA for substring match offsets ndash some gotchas here
Dont mix up stable and deterministic sorting they are very different
httpunicodeorgreportstr10 httpunicodeorgcldr
Normalization (NFChellip)
Standardized normalized forms defined by Unicode
The ordering of accents in a normalization form may not be the typical type-in order
Fonts should handle both orders
Normalization is context independent
Dont assume NFC(x + y) = NFC(x) + NFC(y)
People assume that NFC always composes but some characters decompose in NFC
Trivia In Unicode 41 there are exactly 3 characters that are different in all 4 normalization forms ϓ ϔ ẛ
Maximum Expansion (U41)
Operation UTF Factor Sample
NFC8 3X 119136 U+1D160
16 32 3X ש U+FB2C
NFD8 3X ΐ U+0390
16 32 4X ᾂ U+1F82
NFKC NFKD8 11X ملسو هيلع هللا ىلص U+FDFA
16 32 18X
Case Conversion
Not a simple 11 mapping
Title case dz harr DZ harr Dz
Expansion heiszlig rarr HEISS rarr heiss
Context-dependent ΌΣΟΣ rarr όσος
Language-dependent istanbul harr İSTANBUL
Warning never use language-dependent casing for language-independent structures like file-system B-Trees
Casing Maximum Expansion
Operation UTFFacto
rSample
Lower
8 15X Ⱥ U+023A
16 32 1X A U+0041
Upper Title Fold
8 16 32
3X ΐ U+0390
Case Conversion II
Case folding was not stable
Different results from toCaseFold(S) between two versions
Stability now guaranteed in Unicode 50
Dont use the Lowercase_Letter (Ll) or Uppercase_Letter (Lt) of General_Category
These were constrained to be in a partition
Use the separate binary properties Lowercase and Uppercase instead
Lowercase UppercaseForm vs Function
Lowercase the binary property
The character is lowercase in formbut not necessarily in function
Functionally Lowercase
isCased(x) amp isLowercase(x)
See Section 313 of TUS
Lowercase Form vs Function
LC F LC Ll
Count
Examples(U41
)
Y
NN 114 ˠ U+02E
0MODIFIER LETTER SMALL GAMMA
Y 705 ordf U+00AA
FEMININE ORDINAL INDICATOR
YN 43 ⅰ U+217
0SMALL ROMAN NUMERAL ONE
Y 903 a U+0061
LATIN SMALL LETTER A
Segmentation
What a user thinks of as a characters is often a sequence
Words are not just sequences of letters
Lines donrsquot just break at spaces
All may be language-dependent
httpwwwunicodeorgreportstr14 httpwwwunicodeorgreportstr29
TransliterationTransliteration Ελληνικά harr Ellēnikaacutene Translation Ελληνικά harr Greek
Transliteration may vary by language
Путин harr Putin Poutine
Горбачёв harr Gorbachev Gorbacev Gorbatchev Gorbačeumlv Gorbachov Gorbatsov Gorbatschow
Watch for terminology ldquolossyrdquo vs ldquolosslessrdquoLossy transliteration Ελληνικά rarr Ellinika rarr Ελλινικα
In ISO terms ldquotransliterationrdquo = lossless transliteration
ldquotranscriptionrdquo = lossy transliteration
httpunicodeorgdraftreportstr35tr35html
Rendering is Contextual
Glyphs may change shape
Multiple characters rarr 1 glyph
One character rarr multiple glyphs
Processing character-by-character gives the wrong results
Rendering IIGood rendering systems will handle customary type-in order for text plus canonical order
Excellent ones will do any canonically-equivalent order but those are rare
There may be differences in the customary glyphs for different languages specify the font or the language where they have to be distinguished
Security IssuesNever render a missing glyph as ldquo
Dont simply overlay diacritics it can cause security problems
httpwwwunicodeorgnotestn2
httpunicodeorgreportstr14
GlobalizationUnicode ne Globalization (aka Internationalization Localizability)
Unicode provides the basis for software globalization but theres more work to be done
Use globalization APIs Formatting and parsing of dates times numbers currencies comparison of text calendar systems are locale-dependent
Where OS facilities are not adequate or cross-platform solutions are needed use ICU (C C++ Java)
Dont put any translatable strings into your code separate into resource files
Provide context to translators is Mark a noun a verb or a namehellip
Donrsquot use the same string in different contexts unless the meaning is identical (including references)
Note User-Interface language (menus dialog help-system) neData language (body text spreadsheet cells)
Programs need to handle as data more languages than in localized UI
Common Globalization Mistakes
Never compile Windows apps as ldquoANSIrdquo (the default)
Dont simply concatenate strings to make messagesOrder of components differs by language use Java MessageFormat or structure UI as separate fields
Dont assume icons and symbols mean the same around the world Dont assume everyone can read the Latin alphabet
Allocate space flexibly ldquoOKrdquo in English rarr ldquoAceptarrdquo in Spanish
English is a relatively compact language others may require more characters (eg in database fields) and more screen real estate (in UIs)
Beware of discrepancies in ldquofallbackrdquo behaviorJava ResourceBundle (J2SE) Java Standard Tag Library (JSTL) Java Server Face (JSF) Apache HTTP
httpunicodeorgcldr
httpibmcomsoftwareglobalizationicu
Neutral FormatsStore and transmit neutral-format data wherever possible Convert that data to the users preferred formats as close to the user as possible
Type Example Rec Standard
LanguageLocale en-US (en_US) RFC 3066 bis CLDR
Territory AU RFC 3066 bis
Currency EUR ISO 4217
Timezone AustraliaMelbourne TZDB
Calendar islamic-civil CLDR Calendar ID
Custom Date yyyy-mmm-dd CLDR Pattern Format
Binary Time 8C80E9E3967A4B0 Windows File Time
IdentificationLocale IDs are extensions of language IDs use CLDR httpunicodeorgcldr
Dont assume that everyone in country always uses that countryrsquos currency Always use an explicit currency ID (ISO 4217)
ltRUR 123457times10sup3gt harr 1 23457р in Russian
but Rub 123457 in English
Dont assume the timezone ID is implied by the users locale For the best timezone information use the TZ database use CLDR for timezone names httpwwwtwinsuncomtztz-linkhtm
If you heuristically compute territory IDs timezone IDs currency IDs etc (eg from browser settings) make sure the user can override that and pick an explicit value
Unicode Guide
Authoritative but lightweight
Introduction overview and quick reference
Main principles of the Unicode Standard
Best practices in Software Globalization
Other ResourcesUnicode Site
httpunicodeorg
An Overview of ICUhttpicusourceforgenetdocspapersicu_overview_latestppt
Globalizing Softwarehttpicusourceforgenetdocspapersglobalizing_softwareppt
W3C Internationalizationhttpwwww3orgInternational
Microsoft Global Software Developmenthttpwwwmicrosoftcomglobaldevdefaultasp
QampA
Backup Slides
User Input
If you develop your own text editor use the OS APIs to handle IMEs (Input Method Engines) for Chinese Japanese Korean
If you are using type-ahead to get to a position in a list (eg typing Jo gets to the first element starting with those characters) allow arbitrary input This is often easiest with visible fields
If your password field can contain characters that require an IME a screen pop-up box may reveal the password to onlookers
Dotted and Dotless I
Uppercase Normal Lowercase Turkic Uppercase
larr
I + ˙
i 0049 0307
I 0069
İ
0049 0130
larr
ı
I
0131 0049
İrarr
0130 i + ˙ İ + ˙
I + ˙
0069 0307 0130 0307
0049 0307
JavaIn MessageFormat watch for words like cant since ASCII has syntactic meaning Use a real apostrophe (U+2019) where possible canrsquot
In Date and Calendar the months are numbered from 0 (February is month number 1) However weeks and days are numbered from 1
Java serialized text isnt UTF-8 though its close U+0000 and supplementary code points are encoded differently
Java globalization support is pretty outdated use ICU to supplement it
Java ResourceBundle (J2SE) Java Standard Tag Library (JSTL) Java Server Face (JSF) Apache HTTP server etc all provide some locale determination mechanism and facility but they all differ in details
JavaScript
Always encode characters above U+007F with escapes (uxxxx)
There is an HTML mechanism to specify the charset of the Javascript source but it is not widely implemented
The JDK tool native2ascii can be used to convert the files to use escapes
- Globalization Gotchas
-
Normalization (NFChellip)
Standardized normalized forms defined by Unicode
The ordering of accents in a normalization form may not be the typical type-in order
Fonts should handle both orders
Normalization is context independent
Dont assume NFC(x + y) = NFC(x) + NFC(y)
People assume that NFC always composes but some characters decompose in NFC
Trivia In Unicode 41 there are exactly 3 characters that are different in all 4 normalization forms ϓ ϔ ẛ
Maximum Expansion (U41)
Operation UTF Factor Sample
NFC8 3X 119136 U+1D160
16 32 3X ש U+FB2C
NFD8 3X ΐ U+0390
16 32 4X ᾂ U+1F82
NFKC NFKD8 11X ملسو هيلع هللا ىلص U+FDFA
16 32 18X
Case Conversion
Not a simple 11 mapping
Title case dz harr DZ harr Dz
Expansion heiszlig rarr HEISS rarr heiss
Context-dependent ΌΣΟΣ rarr όσος
Language-dependent istanbul harr İSTANBUL
Warning never use language-dependent casing for language-independent structures like file-system B-Trees
Casing Maximum Expansion
Operation UTFFacto
rSample
Lower
8 15X Ⱥ U+023A
16 32 1X A U+0041
Upper Title Fold
8 16 32
3X ΐ U+0390
Case Conversion II
Case folding was not stable
Different results from toCaseFold(S) between two versions
Stability now guaranteed in Unicode 50
Dont use the Lowercase_Letter (Ll) or Uppercase_Letter (Lt) of General_Category
These were constrained to be in a partition
Use the separate binary properties Lowercase and Uppercase instead
Lowercase UppercaseForm vs Function
Lowercase the binary property
The character is lowercase in formbut not necessarily in function
Functionally Lowercase
isCased(x) amp isLowercase(x)
See Section 313 of TUS
Lowercase Form vs Function
LC F LC Ll
Count
Examples(U41
)
Y
NN 114 ˠ U+02E
0MODIFIER LETTER SMALL GAMMA
Y 705 ordf U+00AA
FEMININE ORDINAL INDICATOR
YN 43 ⅰ U+217
0SMALL ROMAN NUMERAL ONE
Y 903 a U+0061
LATIN SMALL LETTER A
Segmentation
What a user thinks of as a characters is often a sequence
Words are not just sequences of letters
Lines donrsquot just break at spaces
All may be language-dependent
httpwwwunicodeorgreportstr14 httpwwwunicodeorgreportstr29
TransliterationTransliteration Ελληνικά harr Ellēnikaacutene Translation Ελληνικά harr Greek
Transliteration may vary by language
Путин harr Putin Poutine
Горбачёв harr Gorbachev Gorbacev Gorbatchev Gorbačeumlv Gorbachov Gorbatsov Gorbatschow
Watch for terminology ldquolossyrdquo vs ldquolosslessrdquoLossy transliteration Ελληνικά rarr Ellinika rarr Ελλινικα
In ISO terms ldquotransliterationrdquo = lossless transliteration
ldquotranscriptionrdquo = lossy transliteration
httpunicodeorgdraftreportstr35tr35html
Rendering is Contextual
Glyphs may change shape
Multiple characters rarr 1 glyph
One character rarr multiple glyphs
Processing character-by-character gives the wrong results
Rendering IIGood rendering systems will handle customary type-in order for text plus canonical order
Excellent ones will do any canonically-equivalent order but those are rare
There may be differences in the customary glyphs for different languages specify the font or the language where they have to be distinguished
Security IssuesNever render a missing glyph as ldquo
Dont simply overlay diacritics it can cause security problems
httpwwwunicodeorgnotestn2
httpunicodeorgreportstr14
GlobalizationUnicode ne Globalization (aka Internationalization Localizability)
Unicode provides the basis for software globalization but theres more work to be done
Use globalization APIs Formatting and parsing of dates times numbers currencies comparison of text calendar systems are locale-dependent
Where OS facilities are not adequate or cross-platform solutions are needed use ICU (C C++ Java)
Dont put any translatable strings into your code separate into resource files
Provide context to translators is Mark a noun a verb or a namehellip
Donrsquot use the same string in different contexts unless the meaning is identical (including references)
Note User-Interface language (menus dialog help-system) neData language (body text spreadsheet cells)
Programs need to handle as data more languages than in localized UI
Common Globalization Mistakes
Never compile Windows apps as ldquoANSIrdquo (the default)
Dont simply concatenate strings to make messagesOrder of components differs by language use Java MessageFormat or structure UI as separate fields
Dont assume icons and symbols mean the same around the world Dont assume everyone can read the Latin alphabet
Allocate space flexibly ldquoOKrdquo in English rarr ldquoAceptarrdquo in Spanish
English is a relatively compact language others may require more characters (eg in database fields) and more screen real estate (in UIs)
Beware of discrepancies in ldquofallbackrdquo behaviorJava ResourceBundle (J2SE) Java Standard Tag Library (JSTL) Java Server Face (JSF) Apache HTTP
httpunicodeorgcldr
httpibmcomsoftwareglobalizationicu
Neutral FormatsStore and transmit neutral-format data wherever possible Convert that data to the users preferred formats as close to the user as possible
Type Example Rec Standard
LanguageLocale en-US (en_US) RFC 3066 bis CLDR
Territory AU RFC 3066 bis
Currency EUR ISO 4217
Timezone AustraliaMelbourne TZDB
Calendar islamic-civil CLDR Calendar ID
Custom Date yyyy-mmm-dd CLDR Pattern Format
Binary Time 8C80E9E3967A4B0 Windows File Time
IdentificationLocale IDs are extensions of language IDs use CLDR httpunicodeorgcldr
Dont assume that everyone in country always uses that countryrsquos currency Always use an explicit currency ID (ISO 4217)
ltRUR 123457times10sup3gt harr 1 23457р in Russian
but Rub 123457 in English
Dont assume the timezone ID is implied by the users locale For the best timezone information use the TZ database use CLDR for timezone names httpwwwtwinsuncomtztz-linkhtm
If you heuristically compute territory IDs timezone IDs currency IDs etc (eg from browser settings) make sure the user can override that and pick an explicit value
Unicode Guide
Authoritative but lightweight
Introduction overview and quick reference
Main principles of the Unicode Standard
Best practices in Software Globalization
Other ResourcesUnicode Site
httpunicodeorg
An Overview of ICUhttpicusourceforgenetdocspapersicu_overview_latestppt
Globalizing Softwarehttpicusourceforgenetdocspapersglobalizing_softwareppt
W3C Internationalizationhttpwwww3orgInternational
Microsoft Global Software Developmenthttpwwwmicrosoftcomglobaldevdefaultasp
QampA
Backup Slides
User Input
If you develop your own text editor use the OS APIs to handle IMEs (Input Method Engines) for Chinese Japanese Korean
If you are using type-ahead to get to a position in a list (eg typing Jo gets to the first element starting with those characters) allow arbitrary input This is often easiest with visible fields
If your password field can contain characters that require an IME a screen pop-up box may reveal the password to onlookers
Dotted and Dotless I
Uppercase Normal Lowercase Turkic Uppercase
larr
I + ˙
i 0049 0307
I 0069
İ
0049 0130
larr
ı
I
0131 0049
İrarr
0130 i + ˙ İ + ˙
I + ˙
0069 0307 0130 0307
0049 0307
JavaIn MessageFormat watch for words like cant since ASCII has syntactic meaning Use a real apostrophe (U+2019) where possible canrsquot
In Date and Calendar the months are numbered from 0 (February is month number 1) However weeks and days are numbered from 1
Java serialized text isnt UTF-8 though its close U+0000 and supplementary code points are encoded differently
Java globalization support is pretty outdated use ICU to supplement it
Java ResourceBundle (J2SE) Java Standard Tag Library (JSTL) Java Server Face (JSF) Apache HTTP server etc all provide some locale determination mechanism and facility but they all differ in details
JavaScript
Always encode characters above U+007F with escapes (uxxxx)
There is an HTML mechanism to specify the charset of the Javascript source but it is not widely implemented
The JDK tool native2ascii can be used to convert the files to use escapes
- Globalization Gotchas
-
Maximum Expansion (U41)
Operation UTF Factor Sample
NFC8 3X 119136 U+1D160
16 32 3X ש U+FB2C
NFD8 3X ΐ U+0390
16 32 4X ᾂ U+1F82
NFKC NFKD8 11X ملسو هيلع هللا ىلص U+FDFA
16 32 18X
Case Conversion
Not a simple 11 mapping
Title case dz harr DZ harr Dz
Expansion heiszlig rarr HEISS rarr heiss
Context-dependent ΌΣΟΣ rarr όσος
Language-dependent istanbul harr İSTANBUL
Warning never use language-dependent casing for language-independent structures like file-system B-Trees
Casing Maximum Expansion
Operation UTFFacto
rSample
Lower
8 15X Ⱥ U+023A
16 32 1X A U+0041
Upper Title Fold
8 16 32
3X ΐ U+0390
Case Conversion II
Case folding was not stable
Different results from toCaseFold(S) between two versions
Stability now guaranteed in Unicode 50
Dont use the Lowercase_Letter (Ll) or Uppercase_Letter (Lt) of General_Category
These were constrained to be in a partition
Use the separate binary properties Lowercase and Uppercase instead
Lowercase UppercaseForm vs Function
Lowercase the binary property
The character is lowercase in formbut not necessarily in function
Functionally Lowercase
isCased(x) amp isLowercase(x)
See Section 313 of TUS
Lowercase Form vs Function
LC F LC Ll
Count
Examples(U41
)
Y
NN 114 ˠ U+02E
0MODIFIER LETTER SMALL GAMMA
Y 705 ordf U+00AA
FEMININE ORDINAL INDICATOR
YN 43 ⅰ U+217
0SMALL ROMAN NUMERAL ONE
Y 903 a U+0061
LATIN SMALL LETTER A
Segmentation
What a user thinks of as a characters is often a sequence
Words are not just sequences of letters
Lines donrsquot just break at spaces
All may be language-dependent
httpwwwunicodeorgreportstr14 httpwwwunicodeorgreportstr29
TransliterationTransliteration Ελληνικά harr Ellēnikaacutene Translation Ελληνικά harr Greek
Transliteration may vary by language
Путин harr Putin Poutine
Горбачёв harr Gorbachev Gorbacev Gorbatchev Gorbačeumlv Gorbachov Gorbatsov Gorbatschow
Watch for terminology ldquolossyrdquo vs ldquolosslessrdquoLossy transliteration Ελληνικά rarr Ellinika rarr Ελλινικα
In ISO terms ldquotransliterationrdquo = lossless transliteration
ldquotranscriptionrdquo = lossy transliteration
httpunicodeorgdraftreportstr35tr35html
Rendering is Contextual
Glyphs may change shape
Multiple characters rarr 1 glyph
One character rarr multiple glyphs
Processing character-by-character gives the wrong results
Rendering IIGood rendering systems will handle customary type-in order for text plus canonical order
Excellent ones will do any canonically-equivalent order but those are rare
There may be differences in the customary glyphs for different languages specify the font or the language where they have to be distinguished
Security IssuesNever render a missing glyph as ldquo
Dont simply overlay diacritics it can cause security problems
httpwwwunicodeorgnotestn2
httpunicodeorgreportstr14
GlobalizationUnicode ne Globalization (aka Internationalization Localizability)
Unicode provides the basis for software globalization but theres more work to be done
Use globalization APIs Formatting and parsing of dates times numbers currencies comparison of text calendar systems are locale-dependent
Where OS facilities are not adequate or cross-platform solutions are needed use ICU (C C++ Java)
Dont put any translatable strings into your code separate into resource files
Provide context to translators is Mark a noun a verb or a namehellip
Donrsquot use the same string in different contexts unless the meaning is identical (including references)
Note User-Interface language (menus dialog help-system) neData language (body text spreadsheet cells)
Programs need to handle as data more languages than in localized UI
Common Globalization Mistakes
Never compile Windows apps as ldquoANSIrdquo (the default)
Dont simply concatenate strings to make messagesOrder of components differs by language use Java MessageFormat or structure UI as separate fields
Dont assume icons and symbols mean the same around the world Dont assume everyone can read the Latin alphabet
Allocate space flexibly ldquoOKrdquo in English rarr ldquoAceptarrdquo in Spanish
English is a relatively compact language others may require more characters (eg in database fields) and more screen real estate (in UIs)
Beware of discrepancies in ldquofallbackrdquo behaviorJava ResourceBundle (J2SE) Java Standard Tag Library (JSTL) Java Server Face (JSF) Apache HTTP
httpunicodeorgcldr
httpibmcomsoftwareglobalizationicu
Neutral FormatsStore and transmit neutral-format data wherever possible Convert that data to the users preferred formats as close to the user as possible
Type Example Rec Standard
LanguageLocale en-US (en_US) RFC 3066 bis CLDR
Territory AU RFC 3066 bis
Currency EUR ISO 4217
Timezone AustraliaMelbourne TZDB
Calendar islamic-civil CLDR Calendar ID
Custom Date yyyy-mmm-dd CLDR Pattern Format
Binary Time 8C80E9E3967A4B0 Windows File Time
IdentificationLocale IDs are extensions of language IDs use CLDR httpunicodeorgcldr
Dont assume that everyone in country always uses that countryrsquos currency Always use an explicit currency ID (ISO 4217)
ltRUR 123457times10sup3gt harr 1 23457р in Russian
but Rub 123457 in English
Dont assume the timezone ID is implied by the users locale For the best timezone information use the TZ database use CLDR for timezone names httpwwwtwinsuncomtztz-linkhtm
If you heuristically compute territory IDs timezone IDs currency IDs etc (eg from browser settings) make sure the user can override that and pick an explicit value
Unicode Guide
Authoritative but lightweight
Introduction overview and quick reference
Main principles of the Unicode Standard
Best practices in Software Globalization
Other ResourcesUnicode Site
httpunicodeorg
An Overview of ICUhttpicusourceforgenetdocspapersicu_overview_latestppt
Globalizing Softwarehttpicusourceforgenetdocspapersglobalizing_softwareppt
W3C Internationalizationhttpwwww3orgInternational
Microsoft Global Software Developmenthttpwwwmicrosoftcomglobaldevdefaultasp
QampA
Backup Slides
User Input
If you develop your own text editor use the OS APIs to handle IMEs (Input Method Engines) for Chinese Japanese Korean
If you are using type-ahead to get to a position in a list (eg typing Jo gets to the first element starting with those characters) allow arbitrary input This is often easiest with visible fields
If your password field can contain characters that require an IME a screen pop-up box may reveal the password to onlookers
Dotted and Dotless I
Uppercase Normal Lowercase Turkic Uppercase
larr
I + ˙
i 0049 0307
I 0069
İ
0049 0130
larr
ı
I
0131 0049
İrarr
0130 i + ˙ İ + ˙
I + ˙
0069 0307 0130 0307
0049 0307
JavaIn MessageFormat watch for words like cant since ASCII has syntactic meaning Use a real apostrophe (U+2019) where possible canrsquot
In Date and Calendar the months are numbered from 0 (February is month number 1) However weeks and days are numbered from 1
Java serialized text isnt UTF-8 though its close U+0000 and supplementary code points are encoded differently
Java globalization support is pretty outdated use ICU to supplement it
Java ResourceBundle (J2SE) Java Standard Tag Library (JSTL) Java Server Face (JSF) Apache HTTP server etc all provide some locale determination mechanism and facility but they all differ in details
JavaScript
Always encode characters above U+007F with escapes (uxxxx)
There is an HTML mechanism to specify the charset of the Javascript source but it is not widely implemented
The JDK tool native2ascii can be used to convert the files to use escapes
- Globalization Gotchas
-
Case Conversion
Not a simple 11 mapping
Title case dz harr DZ harr Dz
Expansion heiszlig rarr HEISS rarr heiss
Context-dependent ΌΣΟΣ rarr όσος
Language-dependent istanbul harr İSTANBUL
Warning never use language-dependent casing for language-independent structures like file-system B-Trees
Casing Maximum Expansion
Operation UTFFacto
rSample
Lower
8 15X Ⱥ U+023A
16 32 1X A U+0041
Upper Title Fold
8 16 32
3X ΐ U+0390
Case Conversion II
Case folding was not stable
Different results from toCaseFold(S) between two versions
Stability now guaranteed in Unicode 50
Dont use the Lowercase_Letter (Ll) or Uppercase_Letter (Lt) of General_Category
These were constrained to be in a partition
Use the separate binary properties Lowercase and Uppercase instead
Lowercase UppercaseForm vs Function
Lowercase the binary property
The character is lowercase in formbut not necessarily in function
Functionally Lowercase
isCased(x) amp isLowercase(x)
See Section 313 of TUS
Lowercase Form vs Function
LC F LC Ll
Count
Examples(U41
)
Y
NN 114 ˠ U+02E
0MODIFIER LETTER SMALL GAMMA
Y 705 ordf U+00AA
FEMININE ORDINAL INDICATOR
YN 43 ⅰ U+217
0SMALL ROMAN NUMERAL ONE
Y 903 a U+0061
LATIN SMALL LETTER A
Segmentation
What a user thinks of as a characters is often a sequence
Words are not just sequences of letters
Lines donrsquot just break at spaces
All may be language-dependent
httpwwwunicodeorgreportstr14 httpwwwunicodeorgreportstr29
TransliterationTransliteration Ελληνικά harr Ellēnikaacutene Translation Ελληνικά harr Greek
Transliteration may vary by language
Путин harr Putin Poutine
Горбачёв harr Gorbachev Gorbacev Gorbatchev Gorbačeumlv Gorbachov Gorbatsov Gorbatschow
Watch for terminology ldquolossyrdquo vs ldquolosslessrdquoLossy transliteration Ελληνικά rarr Ellinika rarr Ελλινικα
In ISO terms ldquotransliterationrdquo = lossless transliteration
ldquotranscriptionrdquo = lossy transliteration
httpunicodeorgdraftreportstr35tr35html
Rendering is Contextual
Glyphs may change shape
Multiple characters rarr 1 glyph
One character rarr multiple glyphs
Processing character-by-character gives the wrong results
Rendering IIGood rendering systems will handle customary type-in order for text plus canonical order
Excellent ones will do any canonically-equivalent order but those are rare
There may be differences in the customary glyphs for different languages specify the font or the language where they have to be distinguished
Security IssuesNever render a missing glyph as ldquo
Dont simply overlay diacritics it can cause security problems
httpwwwunicodeorgnotestn2
httpunicodeorgreportstr14
GlobalizationUnicode ne Globalization (aka Internationalization Localizability)
Unicode provides the basis for software globalization but theres more work to be done
Use globalization APIs Formatting and parsing of dates times numbers currencies comparison of text calendar systems are locale-dependent
Where OS facilities are not adequate or cross-platform solutions are needed use ICU (C C++ Java)
Dont put any translatable strings into your code separate into resource files
Provide context to translators is Mark a noun a verb or a namehellip
Donrsquot use the same string in different contexts unless the meaning is identical (including references)
Note User-Interface language (menus dialog help-system) neData language (body text spreadsheet cells)
Programs need to handle as data more languages than in localized UI
Common Globalization Mistakes
Never compile Windows apps as ldquoANSIrdquo (the default)
Dont simply concatenate strings to make messagesOrder of components differs by language use Java MessageFormat or structure UI as separate fields
Dont assume icons and symbols mean the same around the world Dont assume everyone can read the Latin alphabet
Allocate space flexibly ldquoOKrdquo in English rarr ldquoAceptarrdquo in Spanish
English is a relatively compact language others may require more characters (eg in database fields) and more screen real estate (in UIs)
Beware of discrepancies in ldquofallbackrdquo behaviorJava ResourceBundle (J2SE) Java Standard Tag Library (JSTL) Java Server Face (JSF) Apache HTTP
httpunicodeorgcldr
httpibmcomsoftwareglobalizationicu
Neutral FormatsStore and transmit neutral-format data wherever possible Convert that data to the users preferred formats as close to the user as possible
Type Example Rec Standard
LanguageLocale en-US (en_US) RFC 3066 bis CLDR
Territory AU RFC 3066 bis
Currency EUR ISO 4217
Timezone AustraliaMelbourne TZDB
Calendar islamic-civil CLDR Calendar ID
Custom Date yyyy-mmm-dd CLDR Pattern Format
Binary Time 8C80E9E3967A4B0 Windows File Time
IdentificationLocale IDs are extensions of language IDs use CLDR httpunicodeorgcldr
Dont assume that everyone in country always uses that countryrsquos currency Always use an explicit currency ID (ISO 4217)
ltRUR 123457times10sup3gt harr 1 23457р in Russian
but Rub 123457 in English
Dont assume the timezone ID is implied by the users locale For the best timezone information use the TZ database use CLDR for timezone names httpwwwtwinsuncomtztz-linkhtm
If you heuristically compute territory IDs timezone IDs currency IDs etc (eg from browser settings) make sure the user can override that and pick an explicit value
Unicode Guide
Authoritative but lightweight
Introduction overview and quick reference
Main principles of the Unicode Standard
Best practices in Software Globalization
Other ResourcesUnicode Site
httpunicodeorg
An Overview of ICUhttpicusourceforgenetdocspapersicu_overview_latestppt
Globalizing Softwarehttpicusourceforgenetdocspapersglobalizing_softwareppt
W3C Internationalizationhttpwwww3orgInternational
Microsoft Global Software Developmenthttpwwwmicrosoftcomglobaldevdefaultasp
QampA
Backup Slides
User Input
If you develop your own text editor use the OS APIs to handle IMEs (Input Method Engines) for Chinese Japanese Korean
If you are using type-ahead to get to a position in a list (eg typing Jo gets to the first element starting with those characters) allow arbitrary input This is often easiest with visible fields
If your password field can contain characters that require an IME a screen pop-up box may reveal the password to onlookers
Dotted and Dotless I
Uppercase Normal Lowercase Turkic Uppercase
larr
I + ˙
i 0049 0307
I 0069
İ
0049 0130
larr
ı
I
0131 0049
İrarr
0130 i + ˙ İ + ˙
I + ˙
0069 0307 0130 0307
0049 0307
JavaIn MessageFormat watch for words like cant since ASCII has syntactic meaning Use a real apostrophe (U+2019) where possible canrsquot
In Date and Calendar the months are numbered from 0 (February is month number 1) However weeks and days are numbered from 1
Java serialized text isnt UTF-8 though its close U+0000 and supplementary code points are encoded differently
Java globalization support is pretty outdated use ICU to supplement it
Java ResourceBundle (J2SE) Java Standard Tag Library (JSTL) Java Server Face (JSF) Apache HTTP server etc all provide some locale determination mechanism and facility but they all differ in details
JavaScript
Always encode characters above U+007F with escapes (uxxxx)
There is an HTML mechanism to specify the charset of the Javascript source but it is not widely implemented
The JDK tool native2ascii can be used to convert the files to use escapes
- Globalization Gotchas
-
Casing Maximum Expansion
Operation UTFFacto
rSample
Lower
8 15X Ⱥ U+023A
16 32 1X A U+0041
Upper Title Fold
8 16 32
3X ΐ U+0390
Case Conversion II
Case folding was not stable
Different results from toCaseFold(S) between two versions
Stability now guaranteed in Unicode 50
Dont use the Lowercase_Letter (Ll) or Uppercase_Letter (Lt) of General_Category
These were constrained to be in a partition
Use the separate binary properties Lowercase and Uppercase instead
Lowercase UppercaseForm vs Function
Lowercase the binary property
The character is lowercase in formbut not necessarily in function
Functionally Lowercase
isCased(x) amp isLowercase(x)
See Section 313 of TUS
Lowercase Form vs Function
LC F LC Ll
Count
Examples(U41
)
Y
NN 114 ˠ U+02E
0MODIFIER LETTER SMALL GAMMA
Y 705 ordf U+00AA
FEMININE ORDINAL INDICATOR
YN 43 ⅰ U+217
0SMALL ROMAN NUMERAL ONE
Y 903 a U+0061
LATIN SMALL LETTER A
Segmentation
What a user thinks of as a characters is often a sequence
Words are not just sequences of letters
Lines donrsquot just break at spaces
All may be language-dependent
httpwwwunicodeorgreportstr14 httpwwwunicodeorgreportstr29
TransliterationTransliteration Ελληνικά harr Ellēnikaacutene Translation Ελληνικά harr Greek
Transliteration may vary by language
Путин harr Putin Poutine
Горбачёв harr Gorbachev Gorbacev Gorbatchev Gorbačeumlv Gorbachov Gorbatsov Gorbatschow
Watch for terminology ldquolossyrdquo vs ldquolosslessrdquoLossy transliteration Ελληνικά rarr Ellinika rarr Ελλινικα
In ISO terms ldquotransliterationrdquo = lossless transliteration
ldquotranscriptionrdquo = lossy transliteration
httpunicodeorgdraftreportstr35tr35html
Rendering is Contextual
Glyphs may change shape
Multiple characters rarr 1 glyph
One character rarr multiple glyphs
Processing character-by-character gives the wrong results
Rendering IIGood rendering systems will handle customary type-in order for text plus canonical order
Excellent ones will do any canonically-equivalent order but those are rare
There may be differences in the customary glyphs for different languages specify the font or the language where they have to be distinguished
Security IssuesNever render a missing glyph as ldquo
Dont simply overlay diacritics it can cause security problems
httpwwwunicodeorgnotestn2
httpunicodeorgreportstr14
GlobalizationUnicode ne Globalization (aka Internationalization Localizability)
Unicode provides the basis for software globalization but theres more work to be done
Use globalization APIs Formatting and parsing of dates times numbers currencies comparison of text calendar systems are locale-dependent
Where OS facilities are not adequate or cross-platform solutions are needed use ICU (C C++ Java)
Dont put any translatable strings into your code separate into resource files
Provide context to translators is Mark a noun a verb or a namehellip
Donrsquot use the same string in different contexts unless the meaning is identical (including references)
Note User-Interface language (menus dialog help-system) neData language (body text spreadsheet cells)
Programs need to handle as data more languages than in localized UI
Common Globalization Mistakes
Never compile Windows apps as ldquoANSIrdquo (the default)
Dont simply concatenate strings to make messagesOrder of components differs by language use Java MessageFormat or structure UI as separate fields
Dont assume icons and symbols mean the same around the world Dont assume everyone can read the Latin alphabet
Allocate space flexibly ldquoOKrdquo in English rarr ldquoAceptarrdquo in Spanish
English is a relatively compact language others may require more characters (eg in database fields) and more screen real estate (in UIs)
Beware of discrepancies in ldquofallbackrdquo behaviorJava ResourceBundle (J2SE) Java Standard Tag Library (JSTL) Java Server Face (JSF) Apache HTTP
httpunicodeorgcldr
httpibmcomsoftwareglobalizationicu
Neutral FormatsStore and transmit neutral-format data wherever possible Convert that data to the users preferred formats as close to the user as possible
Type Example Rec Standard
LanguageLocale en-US (en_US) RFC 3066 bis CLDR
Territory AU RFC 3066 bis
Currency EUR ISO 4217
Timezone AustraliaMelbourne TZDB
Calendar islamic-civil CLDR Calendar ID
Custom Date yyyy-mmm-dd CLDR Pattern Format
Binary Time 8C80E9E3967A4B0 Windows File Time
IdentificationLocale IDs are extensions of language IDs use CLDR httpunicodeorgcldr
Dont assume that everyone in country always uses that countryrsquos currency Always use an explicit currency ID (ISO 4217)
ltRUR 123457times10sup3gt harr 1 23457р in Russian
but Rub 123457 in English
Dont assume the timezone ID is implied by the users locale For the best timezone information use the TZ database use CLDR for timezone names httpwwwtwinsuncomtztz-linkhtm
If you heuristically compute territory IDs timezone IDs currency IDs etc (eg from browser settings) make sure the user can override that and pick an explicit value
Unicode Guide
Authoritative but lightweight
Introduction overview and quick reference
Main principles of the Unicode Standard
Best practices in Software Globalization
Other ResourcesUnicode Site
httpunicodeorg
An Overview of ICUhttpicusourceforgenetdocspapersicu_overview_latestppt
Globalizing Softwarehttpicusourceforgenetdocspapersglobalizing_softwareppt
W3C Internationalizationhttpwwww3orgInternational
Microsoft Global Software Developmenthttpwwwmicrosoftcomglobaldevdefaultasp
QampA
Backup Slides
User Input
If you develop your own text editor use the OS APIs to handle IMEs (Input Method Engines) for Chinese Japanese Korean
If you are using type-ahead to get to a position in a list (eg typing Jo gets to the first element starting with those characters) allow arbitrary input This is often easiest with visible fields
If your password field can contain characters that require an IME a screen pop-up box may reveal the password to onlookers
Dotted and Dotless I
Uppercase Normal Lowercase Turkic Uppercase
larr
I + ˙
i 0049 0307
I 0069
İ
0049 0130
larr
ı
I
0131 0049
İrarr
0130 i + ˙ İ + ˙
I + ˙
0069 0307 0130 0307
0049 0307
JavaIn MessageFormat watch for words like cant since ASCII has syntactic meaning Use a real apostrophe (U+2019) where possible canrsquot
In Date and Calendar the months are numbered from 0 (February is month number 1) However weeks and days are numbered from 1
Java serialized text isnt UTF-8 though its close U+0000 and supplementary code points are encoded differently
Java globalization support is pretty outdated use ICU to supplement it
Java ResourceBundle (J2SE) Java Standard Tag Library (JSTL) Java Server Face (JSF) Apache HTTP server etc all provide some locale determination mechanism and facility but they all differ in details
JavaScript
Always encode characters above U+007F with escapes (uxxxx)
There is an HTML mechanism to specify the charset of the Javascript source but it is not widely implemented
The JDK tool native2ascii can be used to convert the files to use escapes
- Globalization Gotchas
-
Case Conversion II
Case folding was not stable
Different results from toCaseFold(S) between two versions
Stability now guaranteed in Unicode 50
Dont use the Lowercase_Letter (Ll) or Uppercase_Letter (Lt) of General_Category
These were constrained to be in a partition
Use the separate binary properties Lowercase and Uppercase instead
Lowercase UppercaseForm vs Function
Lowercase the binary property
The character is lowercase in formbut not necessarily in function
Functionally Lowercase
isCased(x) amp isLowercase(x)
See Section 313 of TUS
Lowercase Form vs Function
LC F LC Ll
Count
Examples(U41
)
Y
NN 114 ˠ U+02E
0MODIFIER LETTER SMALL GAMMA
Y 705 ordf U+00AA
FEMININE ORDINAL INDICATOR
YN 43 ⅰ U+217
0SMALL ROMAN NUMERAL ONE
Y 903 a U+0061
LATIN SMALL LETTER A
Segmentation
What a user thinks of as a characters is often a sequence
Words are not just sequences of letters
Lines donrsquot just break at spaces
All may be language-dependent
httpwwwunicodeorgreportstr14 httpwwwunicodeorgreportstr29
TransliterationTransliteration Ελληνικά harr Ellēnikaacutene Translation Ελληνικά harr Greek
Transliteration may vary by language
Путин harr Putin Poutine
Горбачёв harr Gorbachev Gorbacev Gorbatchev Gorbačeumlv Gorbachov Gorbatsov Gorbatschow
Watch for terminology ldquolossyrdquo vs ldquolosslessrdquoLossy transliteration Ελληνικά rarr Ellinika rarr Ελλινικα
In ISO terms ldquotransliterationrdquo = lossless transliteration
ldquotranscriptionrdquo = lossy transliteration
httpunicodeorgdraftreportstr35tr35html
Rendering is Contextual
Glyphs may change shape
Multiple characters rarr 1 glyph
One character rarr multiple glyphs
Processing character-by-character gives the wrong results
Rendering IIGood rendering systems will handle customary type-in order for text plus canonical order
Excellent ones will do any canonically-equivalent order but those are rare
There may be differences in the customary glyphs for different languages specify the font or the language where they have to be distinguished
Security IssuesNever render a missing glyph as ldquo
Dont simply overlay diacritics it can cause security problems
httpwwwunicodeorgnotestn2
httpunicodeorgreportstr14
GlobalizationUnicode ne Globalization (aka Internationalization Localizability)
Unicode provides the basis for software globalization but theres more work to be done
Use globalization APIs Formatting and parsing of dates times numbers currencies comparison of text calendar systems are locale-dependent
Where OS facilities are not adequate or cross-platform solutions are needed use ICU (C C++ Java)
Dont put any translatable strings into your code separate into resource files
Provide context to translators is Mark a noun a verb or a namehellip
Donrsquot use the same string in different contexts unless the meaning is identical (including references)
Note User-Interface language (menus dialog help-system) neData language (body text spreadsheet cells)
Programs need to handle as data more languages than in localized UI
Common Globalization Mistakes
Never compile Windows apps as ldquoANSIrdquo (the default)
Dont simply concatenate strings to make messagesOrder of components differs by language use Java MessageFormat or structure UI as separate fields
Dont assume icons and symbols mean the same around the world Dont assume everyone can read the Latin alphabet
Allocate space flexibly ldquoOKrdquo in English rarr ldquoAceptarrdquo in Spanish
English is a relatively compact language others may require more characters (eg in database fields) and more screen real estate (in UIs)
Beware of discrepancies in ldquofallbackrdquo behaviorJava ResourceBundle (J2SE) Java Standard Tag Library (JSTL) Java Server Face (JSF) Apache HTTP
httpunicodeorgcldr
httpibmcomsoftwareglobalizationicu
Neutral FormatsStore and transmit neutral-format data wherever possible Convert that data to the users preferred formats as close to the user as possible
Type Example Rec Standard
LanguageLocale en-US (en_US) RFC 3066 bis CLDR
Territory AU RFC 3066 bis
Currency EUR ISO 4217
Timezone AustraliaMelbourne TZDB
Calendar islamic-civil CLDR Calendar ID
Custom Date yyyy-mmm-dd CLDR Pattern Format
Binary Time 8C80E9E3967A4B0 Windows File Time
IdentificationLocale IDs are extensions of language IDs use CLDR httpunicodeorgcldr
Dont assume that everyone in country always uses that countryrsquos currency Always use an explicit currency ID (ISO 4217)
ltRUR 123457times10sup3gt harr 1 23457р in Russian
but Rub 123457 in English
Dont assume the timezone ID is implied by the users locale For the best timezone information use the TZ database use CLDR for timezone names httpwwwtwinsuncomtztz-linkhtm
If you heuristically compute territory IDs timezone IDs currency IDs etc (eg from browser settings) make sure the user can override that and pick an explicit value
Unicode Guide
Authoritative but lightweight
Introduction overview and quick reference
Main principles of the Unicode Standard
Best practices in Software Globalization
Other ResourcesUnicode Site
httpunicodeorg
An Overview of ICUhttpicusourceforgenetdocspapersicu_overview_latestppt
Globalizing Softwarehttpicusourceforgenetdocspapersglobalizing_softwareppt
W3C Internationalizationhttpwwww3orgInternational
Microsoft Global Software Developmenthttpwwwmicrosoftcomglobaldevdefaultasp
QampA
Backup Slides
User Input
If you develop your own text editor use the OS APIs to handle IMEs (Input Method Engines) for Chinese Japanese Korean
If you are using type-ahead to get to a position in a list (eg typing Jo gets to the first element starting with those characters) allow arbitrary input This is often easiest with visible fields
If your password field can contain characters that require an IME a screen pop-up box may reveal the password to onlookers
Dotted and Dotless I
Uppercase Normal Lowercase Turkic Uppercase
larr
I + ˙
i 0049 0307
I 0069
İ
0049 0130
larr
ı
I
0131 0049
İrarr
0130 i + ˙ İ + ˙
I + ˙
0069 0307 0130 0307
0049 0307
JavaIn MessageFormat watch for words like cant since ASCII has syntactic meaning Use a real apostrophe (U+2019) where possible canrsquot
In Date and Calendar the months are numbered from 0 (February is month number 1) However weeks and days are numbered from 1
Java serialized text isnt UTF-8 though its close U+0000 and supplementary code points are encoded differently
Java globalization support is pretty outdated use ICU to supplement it
Java ResourceBundle (J2SE) Java Standard Tag Library (JSTL) Java Server Face (JSF) Apache HTTP server etc all provide some locale determination mechanism and facility but they all differ in details
JavaScript
Always encode characters above U+007F with escapes (uxxxx)
There is an HTML mechanism to specify the charset of the Javascript source but it is not widely implemented
The JDK tool native2ascii can be used to convert the files to use escapes
- Globalization Gotchas
-
Lowercase UppercaseForm vs Function
Lowercase the binary property
The character is lowercase in formbut not necessarily in function
Functionally Lowercase
isCased(x) amp isLowercase(x)
See Section 313 of TUS
Lowercase Form vs Function
LC F LC Ll
Count
Examples(U41
)
Y
NN 114 ˠ U+02E
0MODIFIER LETTER SMALL GAMMA
Y 705 ordf U+00AA
FEMININE ORDINAL INDICATOR
YN 43 ⅰ U+217
0SMALL ROMAN NUMERAL ONE
Y 903 a U+0061
LATIN SMALL LETTER A
Segmentation
What a user thinks of as a characters is often a sequence
Words are not just sequences of letters
Lines donrsquot just break at spaces
All may be language-dependent
httpwwwunicodeorgreportstr14 httpwwwunicodeorgreportstr29
TransliterationTransliteration Ελληνικά harr Ellēnikaacutene Translation Ελληνικά harr Greek
Transliteration may vary by language
Путин harr Putin Poutine
Горбачёв harr Gorbachev Gorbacev Gorbatchev Gorbačeumlv Gorbachov Gorbatsov Gorbatschow
Watch for terminology ldquolossyrdquo vs ldquolosslessrdquoLossy transliteration Ελληνικά rarr Ellinika rarr Ελλινικα
In ISO terms ldquotransliterationrdquo = lossless transliteration
ldquotranscriptionrdquo = lossy transliteration
httpunicodeorgdraftreportstr35tr35html
Rendering is Contextual
Glyphs may change shape
Multiple characters rarr 1 glyph
One character rarr multiple glyphs
Processing character-by-character gives the wrong results
Rendering IIGood rendering systems will handle customary type-in order for text plus canonical order
Excellent ones will do any canonically-equivalent order but those are rare
There may be differences in the customary glyphs for different languages specify the font or the language where they have to be distinguished
Security IssuesNever render a missing glyph as ldquo
Dont simply overlay diacritics it can cause security problems
httpwwwunicodeorgnotestn2
httpunicodeorgreportstr14
GlobalizationUnicode ne Globalization (aka Internationalization Localizability)
Unicode provides the basis for software globalization but theres more work to be done
Use globalization APIs Formatting and parsing of dates times numbers currencies comparison of text calendar systems are locale-dependent
Where OS facilities are not adequate or cross-platform solutions are needed use ICU (C C++ Java)
Dont put any translatable strings into your code separate into resource files
Provide context to translators is Mark a noun a verb or a namehellip
Donrsquot use the same string in different contexts unless the meaning is identical (including references)
Note User-Interface language (menus dialog help-system) neData language (body text spreadsheet cells)
Programs need to handle as data more languages than in localized UI
Common Globalization Mistakes
Never compile Windows apps as ldquoANSIrdquo (the default)
Dont simply concatenate strings to make messagesOrder of components differs by language use Java MessageFormat or structure UI as separate fields
Dont assume icons and symbols mean the same around the world Dont assume everyone can read the Latin alphabet
Allocate space flexibly ldquoOKrdquo in English rarr ldquoAceptarrdquo in Spanish
English is a relatively compact language others may require more characters (eg in database fields) and more screen real estate (in UIs)
Beware of discrepancies in ldquofallbackrdquo behaviorJava ResourceBundle (J2SE) Java Standard Tag Library (JSTL) Java Server Face (JSF) Apache HTTP
httpunicodeorgcldr
httpibmcomsoftwareglobalizationicu
Neutral FormatsStore and transmit neutral-format data wherever possible Convert that data to the users preferred formats as close to the user as possible
Type Example Rec Standard
LanguageLocale en-US (en_US) RFC 3066 bis CLDR
Territory AU RFC 3066 bis
Currency EUR ISO 4217
Timezone AustraliaMelbourne TZDB
Calendar islamic-civil CLDR Calendar ID
Custom Date yyyy-mmm-dd CLDR Pattern Format
Binary Time 8C80E9E3967A4B0 Windows File Time
IdentificationLocale IDs are extensions of language IDs use CLDR httpunicodeorgcldr
Dont assume that everyone in country always uses that countryrsquos currency Always use an explicit currency ID (ISO 4217)
ltRUR 123457times10sup3gt harr 1 23457р in Russian
but Rub 123457 in English
Dont assume the timezone ID is implied by the users locale For the best timezone information use the TZ database use CLDR for timezone names httpwwwtwinsuncomtztz-linkhtm
If you heuristically compute territory IDs timezone IDs currency IDs etc (eg from browser settings) make sure the user can override that and pick an explicit value
Unicode Guide
Authoritative but lightweight
Introduction overview and quick reference
Main principles of the Unicode Standard
Best practices in Software Globalization
Other ResourcesUnicode Site
httpunicodeorg
An Overview of ICUhttpicusourceforgenetdocspapersicu_overview_latestppt
Globalizing Softwarehttpicusourceforgenetdocspapersglobalizing_softwareppt
W3C Internationalizationhttpwwww3orgInternational
Microsoft Global Software Developmenthttpwwwmicrosoftcomglobaldevdefaultasp
QampA
Backup Slides
User Input
If you develop your own text editor use the OS APIs to handle IMEs (Input Method Engines) for Chinese Japanese Korean
If you are using type-ahead to get to a position in a list (eg typing Jo gets to the first element starting with those characters) allow arbitrary input This is often easiest with visible fields
If your password field can contain characters that require an IME a screen pop-up box may reveal the password to onlookers
Dotted and Dotless I
Uppercase Normal Lowercase Turkic Uppercase
larr
I + ˙
i 0049 0307
I 0069
İ
0049 0130
larr
ı
I
0131 0049
İrarr
0130 i + ˙ İ + ˙
I + ˙
0069 0307 0130 0307
0049 0307
JavaIn MessageFormat watch for words like cant since ASCII has syntactic meaning Use a real apostrophe (U+2019) where possible canrsquot
In Date and Calendar the months are numbered from 0 (February is month number 1) However weeks and days are numbered from 1
Java serialized text isnt UTF-8 though its close U+0000 and supplementary code points are encoded differently
Java globalization support is pretty outdated use ICU to supplement it
Java ResourceBundle (J2SE) Java Standard Tag Library (JSTL) Java Server Face (JSF) Apache HTTP server etc all provide some locale determination mechanism and facility but they all differ in details
JavaScript
Always encode characters above U+007F with escapes (uxxxx)
There is an HTML mechanism to specify the charset of the Javascript source but it is not widely implemented
The JDK tool native2ascii can be used to convert the files to use escapes
- Globalization Gotchas
-
Lowercase Form vs Function
LC F LC Ll
Count
Examples(U41
)
Y
NN 114 ˠ U+02E
0MODIFIER LETTER SMALL GAMMA
Y 705 ordf U+00AA
FEMININE ORDINAL INDICATOR
YN 43 ⅰ U+217
0SMALL ROMAN NUMERAL ONE
Y 903 a U+0061
LATIN SMALL LETTER A
Segmentation
What a user thinks of as a characters is often a sequence
Words are not just sequences of letters
Lines donrsquot just break at spaces
All may be language-dependent
httpwwwunicodeorgreportstr14 httpwwwunicodeorgreportstr29
TransliterationTransliteration Ελληνικά harr Ellēnikaacutene Translation Ελληνικά harr Greek
Transliteration may vary by language
Путин harr Putin Poutine
Горбачёв harr Gorbachev Gorbacev Gorbatchev Gorbačeumlv Gorbachov Gorbatsov Gorbatschow
Watch for terminology ldquolossyrdquo vs ldquolosslessrdquoLossy transliteration Ελληνικά rarr Ellinika rarr Ελλινικα
In ISO terms ldquotransliterationrdquo = lossless transliteration
ldquotranscriptionrdquo = lossy transliteration
httpunicodeorgdraftreportstr35tr35html
Rendering is Contextual
Glyphs may change shape
Multiple characters rarr 1 glyph
One character rarr multiple glyphs
Processing character-by-character gives the wrong results
Rendering IIGood rendering systems will handle customary type-in order for text plus canonical order
Excellent ones will do any canonically-equivalent order but those are rare
There may be differences in the customary glyphs for different languages specify the font or the language where they have to be distinguished
Security IssuesNever render a missing glyph as ldquo
Dont simply overlay diacritics it can cause security problems
httpwwwunicodeorgnotestn2
httpunicodeorgreportstr14
GlobalizationUnicode ne Globalization (aka Internationalization Localizability)
Unicode provides the basis for software globalization but theres more work to be done
Use globalization APIs Formatting and parsing of dates times numbers currencies comparison of text calendar systems are locale-dependent
Where OS facilities are not adequate or cross-platform solutions are needed use ICU (C C++ Java)
Dont put any translatable strings into your code separate into resource files
Provide context to translators is Mark a noun a verb or a namehellip
Donrsquot use the same string in different contexts unless the meaning is identical (including references)
Note User-Interface language (menus dialog help-system) neData language (body text spreadsheet cells)
Programs need to handle as data more languages than in localized UI
Common Globalization Mistakes
Never compile Windows apps as ldquoANSIrdquo (the default)
Dont simply concatenate strings to make messagesOrder of components differs by language use Java MessageFormat or structure UI as separate fields
Dont assume icons and symbols mean the same around the world Dont assume everyone can read the Latin alphabet
Allocate space flexibly ldquoOKrdquo in English rarr ldquoAceptarrdquo in Spanish
English is a relatively compact language others may require more characters (eg in database fields) and more screen real estate (in UIs)
Beware of discrepancies in ldquofallbackrdquo behaviorJava ResourceBundle (J2SE) Java Standard Tag Library (JSTL) Java Server Face (JSF) Apache HTTP
httpunicodeorgcldr
httpibmcomsoftwareglobalizationicu
Neutral FormatsStore and transmit neutral-format data wherever possible Convert that data to the users preferred formats as close to the user as possible
Type Example Rec Standard
LanguageLocale en-US (en_US) RFC 3066 bis CLDR
Territory AU RFC 3066 bis
Currency EUR ISO 4217
Timezone AustraliaMelbourne TZDB
Calendar islamic-civil CLDR Calendar ID
Custom Date yyyy-mmm-dd CLDR Pattern Format
Binary Time 8C80E9E3967A4B0 Windows File Time
IdentificationLocale IDs are extensions of language IDs use CLDR httpunicodeorgcldr
Dont assume that everyone in country always uses that countryrsquos currency Always use an explicit currency ID (ISO 4217)
ltRUR 123457times10sup3gt harr 1 23457р in Russian
but Rub 123457 in English
Dont assume the timezone ID is implied by the users locale For the best timezone information use the TZ database use CLDR for timezone names httpwwwtwinsuncomtztz-linkhtm
If you heuristically compute territory IDs timezone IDs currency IDs etc (eg from browser settings) make sure the user can override that and pick an explicit value
Unicode Guide
Authoritative but lightweight
Introduction overview and quick reference
Main principles of the Unicode Standard
Best practices in Software Globalization
Other ResourcesUnicode Site
httpunicodeorg
An Overview of ICUhttpicusourceforgenetdocspapersicu_overview_latestppt
Globalizing Softwarehttpicusourceforgenetdocspapersglobalizing_softwareppt
W3C Internationalizationhttpwwww3orgInternational
Microsoft Global Software Developmenthttpwwwmicrosoftcomglobaldevdefaultasp
QampA
Backup Slides
User Input
If you develop your own text editor use the OS APIs to handle IMEs (Input Method Engines) for Chinese Japanese Korean
If you are using type-ahead to get to a position in a list (eg typing Jo gets to the first element starting with those characters) allow arbitrary input This is often easiest with visible fields
If your password field can contain characters that require an IME a screen pop-up box may reveal the password to onlookers
Dotted and Dotless I
Uppercase Normal Lowercase Turkic Uppercase
larr
I + ˙
i 0049 0307
I 0069
İ
0049 0130
larr
ı
I
0131 0049
İrarr
0130 i + ˙ İ + ˙
I + ˙
0069 0307 0130 0307
0049 0307
JavaIn MessageFormat watch for words like cant since ASCII has syntactic meaning Use a real apostrophe (U+2019) where possible canrsquot
In Date and Calendar the months are numbered from 0 (February is month number 1) However weeks and days are numbered from 1
Java serialized text isnt UTF-8 though its close U+0000 and supplementary code points are encoded differently
Java globalization support is pretty outdated use ICU to supplement it
Java ResourceBundle (J2SE) Java Standard Tag Library (JSTL) Java Server Face (JSF) Apache HTTP server etc all provide some locale determination mechanism and facility but they all differ in details
JavaScript
Always encode characters above U+007F with escapes (uxxxx)
There is an HTML mechanism to specify the charset of the Javascript source but it is not widely implemented
The JDK tool native2ascii can be used to convert the files to use escapes
- Globalization Gotchas
-
Segmentation
What a user thinks of as a characters is often a sequence
Words are not just sequences of letters
Lines donrsquot just break at spaces
All may be language-dependent
httpwwwunicodeorgreportstr14 httpwwwunicodeorgreportstr29
TransliterationTransliteration Ελληνικά harr Ellēnikaacutene Translation Ελληνικά harr Greek
Transliteration may vary by language
Путин harr Putin Poutine
Горбачёв harr Gorbachev Gorbacev Gorbatchev Gorbačeumlv Gorbachov Gorbatsov Gorbatschow
Watch for terminology ldquolossyrdquo vs ldquolosslessrdquoLossy transliteration Ελληνικά rarr Ellinika rarr Ελλινικα
In ISO terms ldquotransliterationrdquo = lossless transliteration
ldquotranscriptionrdquo = lossy transliteration
httpunicodeorgdraftreportstr35tr35html
Rendering is Contextual
Glyphs may change shape
Multiple characters rarr 1 glyph
One character rarr multiple glyphs
Processing character-by-character gives the wrong results
Rendering IIGood rendering systems will handle customary type-in order for text plus canonical order
Excellent ones will do any canonically-equivalent order but those are rare
There may be differences in the customary glyphs for different languages specify the font or the language where they have to be distinguished
Security IssuesNever render a missing glyph as ldquo
Dont simply overlay diacritics it can cause security problems
httpwwwunicodeorgnotestn2
httpunicodeorgreportstr14
GlobalizationUnicode ne Globalization (aka Internationalization Localizability)
Unicode provides the basis for software globalization but theres more work to be done
Use globalization APIs Formatting and parsing of dates times numbers currencies comparison of text calendar systems are locale-dependent
Where OS facilities are not adequate or cross-platform solutions are needed use ICU (C C++ Java)
Dont put any translatable strings into your code separate into resource files
Provide context to translators is Mark a noun a verb or a namehellip
Donrsquot use the same string in different contexts unless the meaning is identical (including references)
Note User-Interface language (menus dialog help-system) neData language (body text spreadsheet cells)
Programs need to handle as data more languages than in localized UI
Common Globalization Mistakes
Never compile Windows apps as ldquoANSIrdquo (the default)
Dont simply concatenate strings to make messagesOrder of components differs by language use Java MessageFormat or structure UI as separate fields
Dont assume icons and symbols mean the same around the world Dont assume everyone can read the Latin alphabet
Allocate space flexibly ldquoOKrdquo in English rarr ldquoAceptarrdquo in Spanish
English is a relatively compact language others may require more characters (eg in database fields) and more screen real estate (in UIs)
Beware of discrepancies in ldquofallbackrdquo behaviorJava ResourceBundle (J2SE) Java Standard Tag Library (JSTL) Java Server Face (JSF) Apache HTTP
httpunicodeorgcldr
httpibmcomsoftwareglobalizationicu
Neutral FormatsStore and transmit neutral-format data wherever possible Convert that data to the users preferred formats as close to the user as possible
Type Example Rec Standard
LanguageLocale en-US (en_US) RFC 3066 bis CLDR
Territory AU RFC 3066 bis
Currency EUR ISO 4217
Timezone AustraliaMelbourne TZDB
Calendar islamic-civil CLDR Calendar ID
Custom Date yyyy-mmm-dd CLDR Pattern Format
Binary Time 8C80E9E3967A4B0 Windows File Time
IdentificationLocale IDs are extensions of language IDs use CLDR httpunicodeorgcldr
Dont assume that everyone in country always uses that countryrsquos currency Always use an explicit currency ID (ISO 4217)
ltRUR 123457times10sup3gt harr 1 23457р in Russian
but Rub 123457 in English
Dont assume the timezone ID is implied by the users locale For the best timezone information use the TZ database use CLDR for timezone names httpwwwtwinsuncomtztz-linkhtm
If you heuristically compute territory IDs timezone IDs currency IDs etc (eg from browser settings) make sure the user can override that and pick an explicit value
Unicode Guide
Authoritative but lightweight
Introduction overview and quick reference
Main principles of the Unicode Standard
Best practices in Software Globalization
Other ResourcesUnicode Site
httpunicodeorg
An Overview of ICUhttpicusourceforgenetdocspapersicu_overview_latestppt
Globalizing Softwarehttpicusourceforgenetdocspapersglobalizing_softwareppt
W3C Internationalizationhttpwwww3orgInternational
Microsoft Global Software Developmenthttpwwwmicrosoftcomglobaldevdefaultasp
QampA
Backup Slides
User Input
If you develop your own text editor use the OS APIs to handle IMEs (Input Method Engines) for Chinese Japanese Korean
If you are using type-ahead to get to a position in a list (eg typing Jo gets to the first element starting with those characters) allow arbitrary input This is often easiest with visible fields
If your password field can contain characters that require an IME a screen pop-up box may reveal the password to onlookers
Dotted and Dotless I
Uppercase Normal Lowercase Turkic Uppercase
larr
I + ˙
i 0049 0307
I 0069
İ
0049 0130
larr
ı
I
0131 0049
İrarr
0130 i + ˙ İ + ˙
I + ˙
0069 0307 0130 0307
0049 0307
JavaIn MessageFormat watch for words like cant since ASCII has syntactic meaning Use a real apostrophe (U+2019) where possible canrsquot
In Date and Calendar the months are numbered from 0 (February is month number 1) However weeks and days are numbered from 1
Java serialized text isnt UTF-8 though its close U+0000 and supplementary code points are encoded differently
Java globalization support is pretty outdated use ICU to supplement it
Java ResourceBundle (J2SE) Java Standard Tag Library (JSTL) Java Server Face (JSF) Apache HTTP server etc all provide some locale determination mechanism and facility but they all differ in details
JavaScript
Always encode characters above U+007F with escapes (uxxxx)
There is an HTML mechanism to specify the charset of the Javascript source but it is not widely implemented
The JDK tool native2ascii can be used to convert the files to use escapes
- Globalization Gotchas
-
TransliterationTransliteration Ελληνικά harr Ellēnikaacutene Translation Ελληνικά harr Greek
Transliteration may vary by language
Путин harr Putin Poutine
Горбачёв harr Gorbachev Gorbacev Gorbatchev Gorbačeumlv Gorbachov Gorbatsov Gorbatschow
Watch for terminology ldquolossyrdquo vs ldquolosslessrdquoLossy transliteration Ελληνικά rarr Ellinika rarr Ελλινικα
In ISO terms ldquotransliterationrdquo = lossless transliteration
ldquotranscriptionrdquo = lossy transliteration
httpunicodeorgdraftreportstr35tr35html
Rendering is Contextual
Glyphs may change shape
Multiple characters rarr 1 glyph
One character rarr multiple glyphs
Processing character-by-character gives the wrong results
Rendering IIGood rendering systems will handle customary type-in order for text plus canonical order
Excellent ones will do any canonically-equivalent order but those are rare
There may be differences in the customary glyphs for different languages specify the font or the language where they have to be distinguished
Security IssuesNever render a missing glyph as ldquo
Dont simply overlay diacritics it can cause security problems
httpwwwunicodeorgnotestn2
httpunicodeorgreportstr14
GlobalizationUnicode ne Globalization (aka Internationalization Localizability)
Unicode provides the basis for software globalization but theres more work to be done
Use globalization APIs Formatting and parsing of dates times numbers currencies comparison of text calendar systems are locale-dependent
Where OS facilities are not adequate or cross-platform solutions are needed use ICU (C C++ Java)
Dont put any translatable strings into your code separate into resource files
Provide context to translators is Mark a noun a verb or a namehellip
Donrsquot use the same string in different contexts unless the meaning is identical (including references)
Note User-Interface language (menus dialog help-system) neData language (body text spreadsheet cells)
Programs need to handle as data more languages than in localized UI
Common Globalization Mistakes
Never compile Windows apps as ldquoANSIrdquo (the default)
Dont simply concatenate strings to make messagesOrder of components differs by language use Java MessageFormat or structure UI as separate fields
Dont assume icons and symbols mean the same around the world Dont assume everyone can read the Latin alphabet
Allocate space flexibly ldquoOKrdquo in English rarr ldquoAceptarrdquo in Spanish
English is a relatively compact language others may require more characters (eg in database fields) and more screen real estate (in UIs)
Beware of discrepancies in ldquofallbackrdquo behaviorJava ResourceBundle (J2SE) Java Standard Tag Library (JSTL) Java Server Face (JSF) Apache HTTP
httpunicodeorgcldr
httpibmcomsoftwareglobalizationicu
Neutral FormatsStore and transmit neutral-format data wherever possible Convert that data to the users preferred formats as close to the user as possible
Type Example Rec Standard
LanguageLocale en-US (en_US) RFC 3066 bis CLDR
Territory AU RFC 3066 bis
Currency EUR ISO 4217
Timezone AustraliaMelbourne TZDB
Calendar islamic-civil CLDR Calendar ID
Custom Date yyyy-mmm-dd CLDR Pattern Format
Binary Time 8C80E9E3967A4B0 Windows File Time
IdentificationLocale IDs are extensions of language IDs use CLDR httpunicodeorgcldr
Dont assume that everyone in country always uses that countryrsquos currency Always use an explicit currency ID (ISO 4217)
ltRUR 123457times10sup3gt harr 1 23457р in Russian
but Rub 123457 in English
Dont assume the timezone ID is implied by the users locale For the best timezone information use the TZ database use CLDR for timezone names httpwwwtwinsuncomtztz-linkhtm
If you heuristically compute territory IDs timezone IDs currency IDs etc (eg from browser settings) make sure the user can override that and pick an explicit value
Unicode Guide
Authoritative but lightweight
Introduction overview and quick reference
Main principles of the Unicode Standard
Best practices in Software Globalization
Other ResourcesUnicode Site
httpunicodeorg
An Overview of ICUhttpicusourceforgenetdocspapersicu_overview_latestppt
Globalizing Softwarehttpicusourceforgenetdocspapersglobalizing_softwareppt
W3C Internationalizationhttpwwww3orgInternational
Microsoft Global Software Developmenthttpwwwmicrosoftcomglobaldevdefaultasp
QampA
Backup Slides
User Input
If you develop your own text editor use the OS APIs to handle IMEs (Input Method Engines) for Chinese Japanese Korean
If you are using type-ahead to get to a position in a list (eg typing Jo gets to the first element starting with those characters) allow arbitrary input This is often easiest with visible fields
If your password field can contain characters that require an IME a screen pop-up box may reveal the password to onlookers
Dotted and Dotless I
Uppercase Normal Lowercase Turkic Uppercase
larr
I + ˙
i 0049 0307
I 0069
İ
0049 0130
larr
ı
I
0131 0049
İrarr
0130 i + ˙ İ + ˙
I + ˙
0069 0307 0130 0307
0049 0307
JavaIn MessageFormat watch for words like cant since ASCII has syntactic meaning Use a real apostrophe (U+2019) where possible canrsquot
In Date and Calendar the months are numbered from 0 (February is month number 1) However weeks and days are numbered from 1
Java serialized text isnt UTF-8 though its close U+0000 and supplementary code points are encoded differently
Java globalization support is pretty outdated use ICU to supplement it
Java ResourceBundle (J2SE) Java Standard Tag Library (JSTL) Java Server Face (JSF) Apache HTTP server etc all provide some locale determination mechanism and facility but they all differ in details
JavaScript
Always encode characters above U+007F with escapes (uxxxx)
There is an HTML mechanism to specify the charset of the Javascript source but it is not widely implemented
The JDK tool native2ascii can be used to convert the files to use escapes
- Globalization Gotchas
-
Rendering is Contextual
Glyphs may change shape
Multiple characters rarr 1 glyph
One character rarr multiple glyphs
Processing character-by-character gives the wrong results
Rendering IIGood rendering systems will handle customary type-in order for text plus canonical order
Excellent ones will do any canonically-equivalent order but those are rare
There may be differences in the customary glyphs for different languages specify the font or the language where they have to be distinguished
Security IssuesNever render a missing glyph as ldquo
Dont simply overlay diacritics it can cause security problems
httpwwwunicodeorgnotestn2
httpunicodeorgreportstr14
GlobalizationUnicode ne Globalization (aka Internationalization Localizability)
Unicode provides the basis for software globalization but theres more work to be done
Use globalization APIs Formatting and parsing of dates times numbers currencies comparison of text calendar systems are locale-dependent
Where OS facilities are not adequate or cross-platform solutions are needed use ICU (C C++ Java)
Dont put any translatable strings into your code separate into resource files
Provide context to translators is Mark a noun a verb or a namehellip
Donrsquot use the same string in different contexts unless the meaning is identical (including references)
Note User-Interface language (menus dialog help-system) neData language (body text spreadsheet cells)
Programs need to handle as data more languages than in localized UI
Common Globalization Mistakes
Never compile Windows apps as ldquoANSIrdquo (the default)
Dont simply concatenate strings to make messagesOrder of components differs by language use Java MessageFormat or structure UI as separate fields
Dont assume icons and symbols mean the same around the world Dont assume everyone can read the Latin alphabet
Allocate space flexibly ldquoOKrdquo in English rarr ldquoAceptarrdquo in Spanish
English is a relatively compact language others may require more characters (eg in database fields) and more screen real estate (in UIs)
Beware of discrepancies in ldquofallbackrdquo behaviorJava ResourceBundle (J2SE) Java Standard Tag Library (JSTL) Java Server Face (JSF) Apache HTTP
httpunicodeorgcldr
httpibmcomsoftwareglobalizationicu
Neutral FormatsStore and transmit neutral-format data wherever possible Convert that data to the users preferred formats as close to the user as possible
Type Example Rec Standard
LanguageLocale en-US (en_US) RFC 3066 bis CLDR
Territory AU RFC 3066 bis
Currency EUR ISO 4217
Timezone AustraliaMelbourne TZDB
Calendar islamic-civil CLDR Calendar ID
Custom Date yyyy-mmm-dd CLDR Pattern Format
Binary Time 8C80E9E3967A4B0 Windows File Time
IdentificationLocale IDs are extensions of language IDs use CLDR httpunicodeorgcldr
Dont assume that everyone in country always uses that countryrsquos currency Always use an explicit currency ID (ISO 4217)
ltRUR 123457times10sup3gt harr 1 23457р in Russian
but Rub 123457 in English
Dont assume the timezone ID is implied by the users locale For the best timezone information use the TZ database use CLDR for timezone names httpwwwtwinsuncomtztz-linkhtm
If you heuristically compute territory IDs timezone IDs currency IDs etc (eg from browser settings) make sure the user can override that and pick an explicit value
Unicode Guide
Authoritative but lightweight
Introduction overview and quick reference
Main principles of the Unicode Standard
Best practices in Software Globalization
Other ResourcesUnicode Site
httpunicodeorg
An Overview of ICUhttpicusourceforgenetdocspapersicu_overview_latestppt
Globalizing Softwarehttpicusourceforgenetdocspapersglobalizing_softwareppt
W3C Internationalizationhttpwwww3orgInternational
Microsoft Global Software Developmenthttpwwwmicrosoftcomglobaldevdefaultasp
QampA
Backup Slides
User Input
If you develop your own text editor use the OS APIs to handle IMEs (Input Method Engines) for Chinese Japanese Korean
If you are using type-ahead to get to a position in a list (eg typing Jo gets to the first element starting with those characters) allow arbitrary input This is often easiest with visible fields
If your password field can contain characters that require an IME a screen pop-up box may reveal the password to onlookers
Dotted and Dotless I
Uppercase Normal Lowercase Turkic Uppercase
larr
I + ˙
i 0049 0307
I 0069
İ
0049 0130
larr
ı
I
0131 0049
İrarr
0130 i + ˙ İ + ˙
I + ˙
0069 0307 0130 0307
0049 0307
JavaIn MessageFormat watch for words like cant since ASCII has syntactic meaning Use a real apostrophe (U+2019) where possible canrsquot
In Date and Calendar the months are numbered from 0 (February is month number 1) However weeks and days are numbered from 1
Java serialized text isnt UTF-8 though its close U+0000 and supplementary code points are encoded differently
Java globalization support is pretty outdated use ICU to supplement it
Java ResourceBundle (J2SE) Java Standard Tag Library (JSTL) Java Server Face (JSF) Apache HTTP server etc all provide some locale determination mechanism and facility but they all differ in details
JavaScript
Always encode characters above U+007F with escapes (uxxxx)
There is an HTML mechanism to specify the charset of the Javascript source but it is not widely implemented
The JDK tool native2ascii can be used to convert the files to use escapes
- Globalization Gotchas
-
Rendering IIGood rendering systems will handle customary type-in order for text plus canonical order
Excellent ones will do any canonically-equivalent order but those are rare
There may be differences in the customary glyphs for different languages specify the font or the language where they have to be distinguished
Security IssuesNever render a missing glyph as ldquo
Dont simply overlay diacritics it can cause security problems
httpwwwunicodeorgnotestn2
httpunicodeorgreportstr14
GlobalizationUnicode ne Globalization (aka Internationalization Localizability)
Unicode provides the basis for software globalization but theres more work to be done
Use globalization APIs Formatting and parsing of dates times numbers currencies comparison of text calendar systems are locale-dependent
Where OS facilities are not adequate or cross-platform solutions are needed use ICU (C C++ Java)
Dont put any translatable strings into your code separate into resource files
Provide context to translators is Mark a noun a verb or a namehellip
Donrsquot use the same string in different contexts unless the meaning is identical (including references)
Note User-Interface language (menus dialog help-system) neData language (body text spreadsheet cells)
Programs need to handle as data more languages than in localized UI
Common Globalization Mistakes
Never compile Windows apps as ldquoANSIrdquo (the default)
Dont simply concatenate strings to make messagesOrder of components differs by language use Java MessageFormat or structure UI as separate fields
Dont assume icons and symbols mean the same around the world Dont assume everyone can read the Latin alphabet
Allocate space flexibly ldquoOKrdquo in English rarr ldquoAceptarrdquo in Spanish
English is a relatively compact language others may require more characters (eg in database fields) and more screen real estate (in UIs)
Beware of discrepancies in ldquofallbackrdquo behaviorJava ResourceBundle (J2SE) Java Standard Tag Library (JSTL) Java Server Face (JSF) Apache HTTP
httpunicodeorgcldr
httpibmcomsoftwareglobalizationicu
Neutral FormatsStore and transmit neutral-format data wherever possible Convert that data to the users preferred formats as close to the user as possible
Type Example Rec Standard
LanguageLocale en-US (en_US) RFC 3066 bis CLDR
Territory AU RFC 3066 bis
Currency EUR ISO 4217
Timezone AustraliaMelbourne TZDB
Calendar islamic-civil CLDR Calendar ID
Custom Date yyyy-mmm-dd CLDR Pattern Format
Binary Time 8C80E9E3967A4B0 Windows File Time
IdentificationLocale IDs are extensions of language IDs use CLDR httpunicodeorgcldr
Dont assume that everyone in country always uses that countryrsquos currency Always use an explicit currency ID (ISO 4217)
ltRUR 123457times10sup3gt harr 1 23457р in Russian
but Rub 123457 in English
Dont assume the timezone ID is implied by the users locale For the best timezone information use the TZ database use CLDR for timezone names httpwwwtwinsuncomtztz-linkhtm
If you heuristically compute territory IDs timezone IDs currency IDs etc (eg from browser settings) make sure the user can override that and pick an explicit value
Unicode Guide
Authoritative but lightweight
Introduction overview and quick reference
Main principles of the Unicode Standard
Best practices in Software Globalization
Other ResourcesUnicode Site
httpunicodeorg
An Overview of ICUhttpicusourceforgenetdocspapersicu_overview_latestppt
Globalizing Softwarehttpicusourceforgenetdocspapersglobalizing_softwareppt
W3C Internationalizationhttpwwww3orgInternational
Microsoft Global Software Developmenthttpwwwmicrosoftcomglobaldevdefaultasp
QampA
Backup Slides
User Input
If you develop your own text editor use the OS APIs to handle IMEs (Input Method Engines) for Chinese Japanese Korean
If you are using type-ahead to get to a position in a list (eg typing Jo gets to the first element starting with those characters) allow arbitrary input This is often easiest with visible fields
If your password field can contain characters that require an IME a screen pop-up box may reveal the password to onlookers
Dotted and Dotless I
Uppercase Normal Lowercase Turkic Uppercase
larr
I + ˙
i 0049 0307
I 0069
İ
0049 0130
larr
ı
I
0131 0049
İrarr
0130 i + ˙ İ + ˙
I + ˙
0069 0307 0130 0307
0049 0307
JavaIn MessageFormat watch for words like cant since ASCII has syntactic meaning Use a real apostrophe (U+2019) where possible canrsquot
In Date and Calendar the months are numbered from 0 (February is month number 1) However weeks and days are numbered from 1
Java serialized text isnt UTF-8 though its close U+0000 and supplementary code points are encoded differently
Java globalization support is pretty outdated use ICU to supplement it
Java ResourceBundle (J2SE) Java Standard Tag Library (JSTL) Java Server Face (JSF) Apache HTTP server etc all provide some locale determination mechanism and facility but they all differ in details
JavaScript
Always encode characters above U+007F with escapes (uxxxx)
There is an HTML mechanism to specify the charset of the Javascript source but it is not widely implemented
The JDK tool native2ascii can be used to convert the files to use escapes
- Globalization Gotchas
-
GlobalizationUnicode ne Globalization (aka Internationalization Localizability)
Unicode provides the basis for software globalization but theres more work to be done
Use globalization APIs Formatting and parsing of dates times numbers currencies comparison of text calendar systems are locale-dependent
Where OS facilities are not adequate or cross-platform solutions are needed use ICU (C C++ Java)
Dont put any translatable strings into your code separate into resource files
Provide context to translators is Mark a noun a verb or a namehellip
Donrsquot use the same string in different contexts unless the meaning is identical (including references)
Note User-Interface language (menus dialog help-system) neData language (body text spreadsheet cells)
Programs need to handle as data more languages than in localized UI
Common Globalization Mistakes
Never compile Windows apps as ldquoANSIrdquo (the default)
Dont simply concatenate strings to make messagesOrder of components differs by language use Java MessageFormat or structure UI as separate fields
Dont assume icons and symbols mean the same around the world Dont assume everyone can read the Latin alphabet
Allocate space flexibly ldquoOKrdquo in English rarr ldquoAceptarrdquo in Spanish
English is a relatively compact language others may require more characters (eg in database fields) and more screen real estate (in UIs)
Beware of discrepancies in ldquofallbackrdquo behaviorJava ResourceBundle (J2SE) Java Standard Tag Library (JSTL) Java Server Face (JSF) Apache HTTP
httpunicodeorgcldr
httpibmcomsoftwareglobalizationicu
Neutral FormatsStore and transmit neutral-format data wherever possible Convert that data to the users preferred formats as close to the user as possible
Type Example Rec Standard
LanguageLocale en-US (en_US) RFC 3066 bis CLDR
Territory AU RFC 3066 bis
Currency EUR ISO 4217
Timezone AustraliaMelbourne TZDB
Calendar islamic-civil CLDR Calendar ID
Custom Date yyyy-mmm-dd CLDR Pattern Format
Binary Time 8C80E9E3967A4B0 Windows File Time
IdentificationLocale IDs are extensions of language IDs use CLDR httpunicodeorgcldr
Dont assume that everyone in country always uses that countryrsquos currency Always use an explicit currency ID (ISO 4217)
ltRUR 123457times10sup3gt harr 1 23457р in Russian
but Rub 123457 in English
Dont assume the timezone ID is implied by the users locale For the best timezone information use the TZ database use CLDR for timezone names httpwwwtwinsuncomtztz-linkhtm
If you heuristically compute territory IDs timezone IDs currency IDs etc (eg from browser settings) make sure the user can override that and pick an explicit value
Unicode Guide
Authoritative but lightweight
Introduction overview and quick reference
Main principles of the Unicode Standard
Best practices in Software Globalization
Other ResourcesUnicode Site
httpunicodeorg
An Overview of ICUhttpicusourceforgenetdocspapersicu_overview_latestppt
Globalizing Softwarehttpicusourceforgenetdocspapersglobalizing_softwareppt
W3C Internationalizationhttpwwww3orgInternational
Microsoft Global Software Developmenthttpwwwmicrosoftcomglobaldevdefaultasp
QampA
Backup Slides
User Input
If you develop your own text editor use the OS APIs to handle IMEs (Input Method Engines) for Chinese Japanese Korean
If you are using type-ahead to get to a position in a list (eg typing Jo gets to the first element starting with those characters) allow arbitrary input This is often easiest with visible fields
If your password field can contain characters that require an IME a screen pop-up box may reveal the password to onlookers
Dotted and Dotless I
Uppercase Normal Lowercase Turkic Uppercase
larr
I + ˙
i 0049 0307
I 0069
İ
0049 0130
larr
ı
I
0131 0049
İrarr
0130 i + ˙ İ + ˙
I + ˙
0069 0307 0130 0307
0049 0307
JavaIn MessageFormat watch for words like cant since ASCII has syntactic meaning Use a real apostrophe (U+2019) where possible canrsquot
In Date and Calendar the months are numbered from 0 (February is month number 1) However weeks and days are numbered from 1
Java serialized text isnt UTF-8 though its close U+0000 and supplementary code points are encoded differently
Java globalization support is pretty outdated use ICU to supplement it
Java ResourceBundle (J2SE) Java Standard Tag Library (JSTL) Java Server Face (JSF) Apache HTTP server etc all provide some locale determination mechanism and facility but they all differ in details
JavaScript
Always encode characters above U+007F with escapes (uxxxx)
There is an HTML mechanism to specify the charset of the Javascript source but it is not widely implemented
The JDK tool native2ascii can be used to convert the files to use escapes
- Globalization Gotchas
-
Common Globalization Mistakes
Never compile Windows apps as ldquoANSIrdquo (the default)
Dont simply concatenate strings to make messagesOrder of components differs by language use Java MessageFormat or structure UI as separate fields
Dont assume icons and symbols mean the same around the world Dont assume everyone can read the Latin alphabet
Allocate space flexibly ldquoOKrdquo in English rarr ldquoAceptarrdquo in Spanish
English is a relatively compact language others may require more characters (eg in database fields) and more screen real estate (in UIs)
Beware of discrepancies in ldquofallbackrdquo behaviorJava ResourceBundle (J2SE) Java Standard Tag Library (JSTL) Java Server Face (JSF) Apache HTTP
httpunicodeorgcldr
httpibmcomsoftwareglobalizationicu
Neutral FormatsStore and transmit neutral-format data wherever possible Convert that data to the users preferred formats as close to the user as possible
Type Example Rec Standard
LanguageLocale en-US (en_US) RFC 3066 bis CLDR
Territory AU RFC 3066 bis
Currency EUR ISO 4217
Timezone AustraliaMelbourne TZDB
Calendar islamic-civil CLDR Calendar ID
Custom Date yyyy-mmm-dd CLDR Pattern Format
Binary Time 8C80E9E3967A4B0 Windows File Time
IdentificationLocale IDs are extensions of language IDs use CLDR httpunicodeorgcldr
Dont assume that everyone in country always uses that countryrsquos currency Always use an explicit currency ID (ISO 4217)
ltRUR 123457times10sup3gt harr 1 23457р in Russian
but Rub 123457 in English
Dont assume the timezone ID is implied by the users locale For the best timezone information use the TZ database use CLDR for timezone names httpwwwtwinsuncomtztz-linkhtm
If you heuristically compute territory IDs timezone IDs currency IDs etc (eg from browser settings) make sure the user can override that and pick an explicit value
Unicode Guide
Authoritative but lightweight
Introduction overview and quick reference
Main principles of the Unicode Standard
Best practices in Software Globalization
Other ResourcesUnicode Site
httpunicodeorg
An Overview of ICUhttpicusourceforgenetdocspapersicu_overview_latestppt
Globalizing Softwarehttpicusourceforgenetdocspapersglobalizing_softwareppt
W3C Internationalizationhttpwwww3orgInternational
Microsoft Global Software Developmenthttpwwwmicrosoftcomglobaldevdefaultasp
QampA
Backup Slides
User Input
If you develop your own text editor use the OS APIs to handle IMEs (Input Method Engines) for Chinese Japanese Korean
If you are using type-ahead to get to a position in a list (eg typing Jo gets to the first element starting with those characters) allow arbitrary input This is often easiest with visible fields
If your password field can contain characters that require an IME a screen pop-up box may reveal the password to onlookers
Dotted and Dotless I
Uppercase Normal Lowercase Turkic Uppercase
larr
I + ˙
i 0049 0307
I 0069
İ
0049 0130
larr
ı
I
0131 0049
İrarr
0130 i + ˙ İ + ˙
I + ˙
0069 0307 0130 0307
0049 0307
JavaIn MessageFormat watch for words like cant since ASCII has syntactic meaning Use a real apostrophe (U+2019) where possible canrsquot
In Date and Calendar the months are numbered from 0 (February is month number 1) However weeks and days are numbered from 1
Java serialized text isnt UTF-8 though its close U+0000 and supplementary code points are encoded differently
Java globalization support is pretty outdated use ICU to supplement it
Java ResourceBundle (J2SE) Java Standard Tag Library (JSTL) Java Server Face (JSF) Apache HTTP server etc all provide some locale determination mechanism and facility but they all differ in details
JavaScript
Always encode characters above U+007F with escapes (uxxxx)
There is an HTML mechanism to specify the charset of the Javascript source but it is not widely implemented
The JDK tool native2ascii can be used to convert the files to use escapes
- Globalization Gotchas
-
Neutral FormatsStore and transmit neutral-format data wherever possible Convert that data to the users preferred formats as close to the user as possible
Type Example Rec Standard
LanguageLocale en-US (en_US) RFC 3066 bis CLDR
Territory AU RFC 3066 bis
Currency EUR ISO 4217
Timezone AustraliaMelbourne TZDB
Calendar islamic-civil CLDR Calendar ID
Custom Date yyyy-mmm-dd CLDR Pattern Format
Binary Time 8C80E9E3967A4B0 Windows File Time
IdentificationLocale IDs are extensions of language IDs use CLDR httpunicodeorgcldr
Dont assume that everyone in country always uses that countryrsquos currency Always use an explicit currency ID (ISO 4217)
ltRUR 123457times10sup3gt harr 1 23457р in Russian
but Rub 123457 in English
Dont assume the timezone ID is implied by the users locale For the best timezone information use the TZ database use CLDR for timezone names httpwwwtwinsuncomtztz-linkhtm
If you heuristically compute territory IDs timezone IDs currency IDs etc (eg from browser settings) make sure the user can override that and pick an explicit value
Unicode Guide
Authoritative but lightweight
Introduction overview and quick reference
Main principles of the Unicode Standard
Best practices in Software Globalization
Other ResourcesUnicode Site
httpunicodeorg
An Overview of ICUhttpicusourceforgenetdocspapersicu_overview_latestppt
Globalizing Softwarehttpicusourceforgenetdocspapersglobalizing_softwareppt
W3C Internationalizationhttpwwww3orgInternational
Microsoft Global Software Developmenthttpwwwmicrosoftcomglobaldevdefaultasp
QampA
Backup Slides
User Input
If you develop your own text editor use the OS APIs to handle IMEs (Input Method Engines) for Chinese Japanese Korean
If you are using type-ahead to get to a position in a list (eg typing Jo gets to the first element starting with those characters) allow arbitrary input This is often easiest with visible fields
If your password field can contain characters that require an IME a screen pop-up box may reveal the password to onlookers
Dotted and Dotless I
Uppercase Normal Lowercase Turkic Uppercase
larr
I + ˙
i 0049 0307
I 0069
İ
0049 0130
larr
ı
I
0131 0049
İrarr
0130 i + ˙ İ + ˙
I + ˙
0069 0307 0130 0307
0049 0307
JavaIn MessageFormat watch for words like cant since ASCII has syntactic meaning Use a real apostrophe (U+2019) where possible canrsquot
In Date and Calendar the months are numbered from 0 (February is month number 1) However weeks and days are numbered from 1
Java serialized text isnt UTF-8 though its close U+0000 and supplementary code points are encoded differently
Java globalization support is pretty outdated use ICU to supplement it
Java ResourceBundle (J2SE) Java Standard Tag Library (JSTL) Java Server Face (JSF) Apache HTTP server etc all provide some locale determination mechanism and facility but they all differ in details
JavaScript
Always encode characters above U+007F with escapes (uxxxx)
There is an HTML mechanism to specify the charset of the Javascript source but it is not widely implemented
The JDK tool native2ascii can be used to convert the files to use escapes
- Globalization Gotchas
-
IdentificationLocale IDs are extensions of language IDs use CLDR httpunicodeorgcldr
Dont assume that everyone in country always uses that countryrsquos currency Always use an explicit currency ID (ISO 4217)
ltRUR 123457times10sup3gt harr 1 23457р in Russian
but Rub 123457 in English
Dont assume the timezone ID is implied by the users locale For the best timezone information use the TZ database use CLDR for timezone names httpwwwtwinsuncomtztz-linkhtm
If you heuristically compute territory IDs timezone IDs currency IDs etc (eg from browser settings) make sure the user can override that and pick an explicit value
Unicode Guide
Authoritative but lightweight
Introduction overview and quick reference
Main principles of the Unicode Standard
Best practices in Software Globalization
Other ResourcesUnicode Site
httpunicodeorg
An Overview of ICUhttpicusourceforgenetdocspapersicu_overview_latestppt
Globalizing Softwarehttpicusourceforgenetdocspapersglobalizing_softwareppt
W3C Internationalizationhttpwwww3orgInternational
Microsoft Global Software Developmenthttpwwwmicrosoftcomglobaldevdefaultasp
QampA
Backup Slides
User Input
If you develop your own text editor use the OS APIs to handle IMEs (Input Method Engines) for Chinese Japanese Korean
If you are using type-ahead to get to a position in a list (eg typing Jo gets to the first element starting with those characters) allow arbitrary input This is often easiest with visible fields
If your password field can contain characters that require an IME a screen pop-up box may reveal the password to onlookers
Dotted and Dotless I
Uppercase Normal Lowercase Turkic Uppercase
larr
I + ˙
i 0049 0307
I 0069
İ
0049 0130
larr
ı
I
0131 0049
İrarr
0130 i + ˙ İ + ˙
I + ˙
0069 0307 0130 0307
0049 0307
JavaIn MessageFormat watch for words like cant since ASCII has syntactic meaning Use a real apostrophe (U+2019) where possible canrsquot
In Date and Calendar the months are numbered from 0 (February is month number 1) However weeks and days are numbered from 1
Java serialized text isnt UTF-8 though its close U+0000 and supplementary code points are encoded differently
Java globalization support is pretty outdated use ICU to supplement it
Java ResourceBundle (J2SE) Java Standard Tag Library (JSTL) Java Server Face (JSF) Apache HTTP server etc all provide some locale determination mechanism and facility but they all differ in details
JavaScript
Always encode characters above U+007F with escapes (uxxxx)
There is an HTML mechanism to specify the charset of the Javascript source but it is not widely implemented
The JDK tool native2ascii can be used to convert the files to use escapes
- Globalization Gotchas
-
Unicode Guide
Authoritative but lightweight
Introduction overview and quick reference
Main principles of the Unicode Standard
Best practices in Software Globalization
Other ResourcesUnicode Site
httpunicodeorg
An Overview of ICUhttpicusourceforgenetdocspapersicu_overview_latestppt
Globalizing Softwarehttpicusourceforgenetdocspapersglobalizing_softwareppt
W3C Internationalizationhttpwwww3orgInternational
Microsoft Global Software Developmenthttpwwwmicrosoftcomglobaldevdefaultasp
QampA
Backup Slides
User Input
If you develop your own text editor use the OS APIs to handle IMEs (Input Method Engines) for Chinese Japanese Korean
If you are using type-ahead to get to a position in a list (eg typing Jo gets to the first element starting with those characters) allow arbitrary input This is often easiest with visible fields
If your password field can contain characters that require an IME a screen pop-up box may reveal the password to onlookers
Dotted and Dotless I
Uppercase Normal Lowercase Turkic Uppercase
larr
I + ˙
i 0049 0307
I 0069
İ
0049 0130
larr
ı
I
0131 0049
İrarr
0130 i + ˙ İ + ˙
I + ˙
0069 0307 0130 0307
0049 0307
JavaIn MessageFormat watch for words like cant since ASCII has syntactic meaning Use a real apostrophe (U+2019) where possible canrsquot
In Date and Calendar the months are numbered from 0 (February is month number 1) However weeks and days are numbered from 1
Java serialized text isnt UTF-8 though its close U+0000 and supplementary code points are encoded differently
Java globalization support is pretty outdated use ICU to supplement it
Java ResourceBundle (J2SE) Java Standard Tag Library (JSTL) Java Server Face (JSF) Apache HTTP server etc all provide some locale determination mechanism and facility but they all differ in details
JavaScript
Always encode characters above U+007F with escapes (uxxxx)
There is an HTML mechanism to specify the charset of the Javascript source but it is not widely implemented
The JDK tool native2ascii can be used to convert the files to use escapes
- Globalization Gotchas
-
Other ResourcesUnicode Site
httpunicodeorg
An Overview of ICUhttpicusourceforgenetdocspapersicu_overview_latestppt
Globalizing Softwarehttpicusourceforgenetdocspapersglobalizing_softwareppt
W3C Internationalizationhttpwwww3orgInternational
Microsoft Global Software Developmenthttpwwwmicrosoftcomglobaldevdefaultasp
QampA
Backup Slides
User Input
If you develop your own text editor use the OS APIs to handle IMEs (Input Method Engines) for Chinese Japanese Korean
If you are using type-ahead to get to a position in a list (eg typing Jo gets to the first element starting with those characters) allow arbitrary input This is often easiest with visible fields
If your password field can contain characters that require an IME a screen pop-up box may reveal the password to onlookers
Dotted and Dotless I
Uppercase Normal Lowercase Turkic Uppercase
larr
I + ˙
i 0049 0307
I 0069
İ
0049 0130
larr
ı
I
0131 0049
İrarr
0130 i + ˙ İ + ˙
I + ˙
0069 0307 0130 0307
0049 0307
JavaIn MessageFormat watch for words like cant since ASCII has syntactic meaning Use a real apostrophe (U+2019) where possible canrsquot
In Date and Calendar the months are numbered from 0 (February is month number 1) However weeks and days are numbered from 1
Java serialized text isnt UTF-8 though its close U+0000 and supplementary code points are encoded differently
Java globalization support is pretty outdated use ICU to supplement it
Java ResourceBundle (J2SE) Java Standard Tag Library (JSTL) Java Server Face (JSF) Apache HTTP server etc all provide some locale determination mechanism and facility but they all differ in details
JavaScript
Always encode characters above U+007F with escapes (uxxxx)
There is an HTML mechanism to specify the charset of the Javascript source but it is not widely implemented
The JDK tool native2ascii can be used to convert the files to use escapes
- Globalization Gotchas
-
QampA
Backup Slides
User Input
If you develop your own text editor use the OS APIs to handle IMEs (Input Method Engines) for Chinese Japanese Korean
If you are using type-ahead to get to a position in a list (eg typing Jo gets to the first element starting with those characters) allow arbitrary input This is often easiest with visible fields
If your password field can contain characters that require an IME a screen pop-up box may reveal the password to onlookers
Dotted and Dotless I
Uppercase Normal Lowercase Turkic Uppercase
larr
I + ˙
i 0049 0307
I 0069
İ
0049 0130
larr
ı
I
0131 0049
İrarr
0130 i + ˙ İ + ˙
I + ˙
0069 0307 0130 0307
0049 0307
JavaIn MessageFormat watch for words like cant since ASCII has syntactic meaning Use a real apostrophe (U+2019) where possible canrsquot
In Date and Calendar the months are numbered from 0 (February is month number 1) However weeks and days are numbered from 1
Java serialized text isnt UTF-8 though its close U+0000 and supplementary code points are encoded differently
Java globalization support is pretty outdated use ICU to supplement it
Java ResourceBundle (J2SE) Java Standard Tag Library (JSTL) Java Server Face (JSF) Apache HTTP server etc all provide some locale determination mechanism and facility but they all differ in details
JavaScript
Always encode characters above U+007F with escapes (uxxxx)
There is an HTML mechanism to specify the charset of the Javascript source but it is not widely implemented
The JDK tool native2ascii can be used to convert the files to use escapes
- Globalization Gotchas
-
Backup Slides
User Input
If you develop your own text editor use the OS APIs to handle IMEs (Input Method Engines) for Chinese Japanese Korean
If you are using type-ahead to get to a position in a list (eg typing Jo gets to the first element starting with those characters) allow arbitrary input This is often easiest with visible fields
If your password field can contain characters that require an IME a screen pop-up box may reveal the password to onlookers
Dotted and Dotless I
Uppercase Normal Lowercase Turkic Uppercase
larr
I + ˙
i 0049 0307
I 0069
İ
0049 0130
larr
ı
I
0131 0049
İrarr
0130 i + ˙ İ + ˙
I + ˙
0069 0307 0130 0307
0049 0307
JavaIn MessageFormat watch for words like cant since ASCII has syntactic meaning Use a real apostrophe (U+2019) where possible canrsquot
In Date and Calendar the months are numbered from 0 (February is month number 1) However weeks and days are numbered from 1
Java serialized text isnt UTF-8 though its close U+0000 and supplementary code points are encoded differently
Java globalization support is pretty outdated use ICU to supplement it
Java ResourceBundle (J2SE) Java Standard Tag Library (JSTL) Java Server Face (JSF) Apache HTTP server etc all provide some locale determination mechanism and facility but they all differ in details
JavaScript
Always encode characters above U+007F with escapes (uxxxx)
There is an HTML mechanism to specify the charset of the Javascript source but it is not widely implemented
The JDK tool native2ascii can be used to convert the files to use escapes
- Globalization Gotchas
-
User Input
If you develop your own text editor use the OS APIs to handle IMEs (Input Method Engines) for Chinese Japanese Korean
If you are using type-ahead to get to a position in a list (eg typing Jo gets to the first element starting with those characters) allow arbitrary input This is often easiest with visible fields
If your password field can contain characters that require an IME a screen pop-up box may reveal the password to onlookers
Dotted and Dotless I
Uppercase Normal Lowercase Turkic Uppercase
larr
I + ˙
i 0049 0307
I 0069
İ
0049 0130
larr
ı
I
0131 0049
İrarr
0130 i + ˙ İ + ˙
I + ˙
0069 0307 0130 0307
0049 0307
JavaIn MessageFormat watch for words like cant since ASCII has syntactic meaning Use a real apostrophe (U+2019) where possible canrsquot
In Date and Calendar the months are numbered from 0 (February is month number 1) However weeks and days are numbered from 1
Java serialized text isnt UTF-8 though its close U+0000 and supplementary code points are encoded differently
Java globalization support is pretty outdated use ICU to supplement it
Java ResourceBundle (J2SE) Java Standard Tag Library (JSTL) Java Server Face (JSF) Apache HTTP server etc all provide some locale determination mechanism and facility but they all differ in details
JavaScript
Always encode characters above U+007F with escapes (uxxxx)
There is an HTML mechanism to specify the charset of the Javascript source but it is not widely implemented
The JDK tool native2ascii can be used to convert the files to use escapes
- Globalization Gotchas
-
Dotted and Dotless I
Uppercase Normal Lowercase Turkic Uppercase
larr
I + ˙
i 0049 0307
I 0069
İ
0049 0130
larr
ı
I
0131 0049
İrarr
0130 i + ˙ İ + ˙
I + ˙
0069 0307 0130 0307
0049 0307
JavaIn MessageFormat watch for words like cant since ASCII has syntactic meaning Use a real apostrophe (U+2019) where possible canrsquot
In Date and Calendar the months are numbered from 0 (February is month number 1) However weeks and days are numbered from 1
Java serialized text isnt UTF-8 though its close U+0000 and supplementary code points are encoded differently
Java globalization support is pretty outdated use ICU to supplement it
Java ResourceBundle (J2SE) Java Standard Tag Library (JSTL) Java Server Face (JSF) Apache HTTP server etc all provide some locale determination mechanism and facility but they all differ in details
JavaScript
Always encode characters above U+007F with escapes (uxxxx)
There is an HTML mechanism to specify the charset of the Javascript source but it is not widely implemented
The JDK tool native2ascii can be used to convert the files to use escapes
- Globalization Gotchas
-
JavaIn MessageFormat watch for words like cant since ASCII has syntactic meaning Use a real apostrophe (U+2019) where possible canrsquot
In Date and Calendar the months are numbered from 0 (February is month number 1) However weeks and days are numbered from 1
Java serialized text isnt UTF-8 though its close U+0000 and supplementary code points are encoded differently
Java globalization support is pretty outdated use ICU to supplement it
Java ResourceBundle (J2SE) Java Standard Tag Library (JSTL) Java Server Face (JSF) Apache HTTP server etc all provide some locale determination mechanism and facility but they all differ in details
JavaScript
Always encode characters above U+007F with escapes (uxxxx)
There is an HTML mechanism to specify the charset of the Javascript source but it is not widely implemented
The JDK tool native2ascii can be used to convert the files to use escapes
- Globalization Gotchas
-
JavaScript
Always encode characters above U+007F with escapes (uxxxx)
There is an HTML mechanism to specify the charset of the Javascript source but it is not widely implemented
The JDK tool native2ascii can be used to convert the files to use escapes
- Globalization Gotchas
-