globalization gotchas mark davis. unicode basics unicode encodes characters, not glyphs: u+0067 g g...

Globalization Gotchas

Mark Davis

Unicode BasicsUnicode encodes characters not glyphs

U+0067 rarr g g g g g g g g g g g g g

Unicode does not encode characters by languageFrench German English j have the same code point even though all have different pronunciations

Chinese 大 (da) has the same code point as Japanese 大 (dai)

UTF-8 UTF-16 and UTF-32 are all Unicode

The word character means different things to different people make clear which one you mean

glyphs code points bytes code units user-perceived characters (grapheme clusters)hellip

Unicode in APIsU+0000 to U+10FFFF Be prepared to handle (at least not corrupt) any incoming code points

A back-level system may get unassigned code points from later versions

Watch for UCS-2 implementations They use UTF-16 text but dont support characters above U+FFFF they also may accidentally cause isolated surrogates

Some APIsprotocols will count lengths in code points and others in bytes (or other code units)

Make sure you dont mix them up

Dont limit API parameters to a single character (and definitely not to a single code unit)

What users think of as a single character (eg x ch) may be a sequence in Unicode

Use the latest version of Unicode supports new characters corrections more stability guarantees

Choice of CharactersCharacter and block names may be misleading eg

U+034F COMBINING GRAPHEME JOINER doesnt join graphemes httpwwwunicodeorgfaq

Use U+2060 (word joiner) instead of U+FEFF (zero-width nobreak space) for everything but the BOM function

Never use unassigned code points those will be used in future versions of Unicode

Only use private use (PUA) or non-characters (and only if necessary)

If you do minimize the opportunity for collision by picking an unusual range

Character Conversion

Always use shortest form UTF-8Its the Law

And if that isnrsquot enough consider security attacks

If a protocol allows a choice of charsets always tag correctly

Not all text is correctly tagged character detection may be necessary But remember its always a guess

Converting a database of mixed untagged data is extremely painful

Bad assumptionsLength [bytes] = N length [code points]

1 character [charset X] = 1 character [Unicode]The ordering may also be different

Character Conversion II

IANA MIME charset names are ill-defined vendors often convert same charset different ways

Shift-JIS 0x5C rarr U+005C () or U+00A5 (yen)

Donrsquot simply omit unconvertable data to reduce security problems at least substitute

U+FFFD (when converting to Unicode) or

0x1A (when converting to bytes)

httpwwww3orgTRjapanese-xml

httpicusourceforgenetchartscharset

Properties Use properties such as Alphabetic not hard-coded lists

isAlphabetic(x) regex pAlphabetic or [Alphabetic]Not (ldquoArdquo le x le ldquoZrdquo OR ldquoardquo le x le ldquozrdquo)

Some properties arent what you think useWhite_Space not General_Category=Zs

Alphabetic not General_Category=L

Lowercase not General_Category=Ll

Script=Greek not Block=Greek

Characters may change property values between versions of Unicode

httpunicodeorgstandardstability_policyhtml

Identifiers amp Tokens

When designing syntax use as a basePattern_Syntax for operators relations

Pattern_Whitespace for gaps

XID_Start and XID_Continue for identifiers

All backwards compatible across versions

Profiles may expand or narrow from the base

Watch out for security attacksldquopaypalcomrdquo with a Cyrillic ldquoardquo

See Unicode Security at this conference

Comparison (Collation)Searching Sorting

MatchingThere are two binary orders

code point order = UTF-8 order = UTF-32 order

ne UTF16 order

Donrsquot present users with binary order

No users expect A lt Z lt a lt z lt Ccedil lt auml

Apply normalization to get a unique form so Aring = Aring

Security Issues Protocols must precisely define the comparison operations

Eg LDAP doesnt so lookup may fail (or falsely succeed)

Aside from wrong results opening for security attacks

Language-Sensitive Comparison

Use UCA Order as a base to meet user-expectationsa lt A lt auml lt Ccedil = C_ lt z lt Z

Real language-sensitive order requires tailoring on top of UCA ordering depends on context and language

china lt China lt chinas lt danish

ae lt aelig lt af

z lt aelig (Danish)

c lt d lt h lt ch lt i (Slovak)

Follow UCA for substring match offsets ndash some gotchas here

Dont mix up stable and deterministic sorting they are very different

httpunicodeorgreportstr10 httpunicodeorgcldr

Normalization (NFChellip)

Standardized normalized forms defined by Unicode

The ordering of accents in a normalization form may not be the typical type-in order

Fonts should handle both orders

Normalization is context independent

Dont assume NFC(x + y) = NFC(x) + NFC(y)

People assume that NFC always composes but some characters decompose in NFC

Trivia In Unicode 41 there are exactly 3 characters that are different in all 4 normalization forms ϓ ϔ ẛ

Maximum Expansion (U41)

Operation UTF Factor Sample

NFC8 3X 119136 U+1D160

16 32 3X ש U+FB2C

NFD8 3X ΐ U+0390

16 32 4X ᾂ U+1F82

NFKC NFKD8 11X ملسو هيلع هللا ىلص U+FDFA

16 32 18X

Case Conversion

Not a simple 11 mapping

Title case ǳ harr Ǳ harr ǲ

Expansion heiszlig rarr HEISS rarr heiss

Context-dependent ΌΣΟΣ rarr όσος

Language-dependent istanbul harr İSTANBUL

Warning never use language-dependent casing for language-independent structures like file-system B-Trees

Casing Maximum Expansion

Operation UTFFacto

rSample

Lower

8 15X Ⱥ U+023A

16 32 1X A U+0041

Upper Title Fold

8 16 32

3X ΐ U+0390

Case Conversion II

Case folding was not stable

Different results from toCaseFold(S) between two versions

Stability now guaranteed in Unicode 50

Dont use the Lowercase_Letter (Ll) or Uppercase_Letter (Lt) of General_Category

These were constrained to be in a partition

Use the separate binary properties Lowercase and Uppercase instead

Lowercase UppercaseForm vs Function

Lowercase the binary property

The character is lowercase in formbut not necessarily in function

Functionally Lowercase

isCased(x) amp isLowercase(x)

See Section 313 of TUS

Lowercase Form vs Function

LC F LC Ll

Count

Examples(U41

)

Y

NN 114 ˠ U+02E

0MODIFIER LETTER SMALL GAMMA

Y 705 ordf U+00AA

FEMININE ORDINAL INDICATOR

YN 43 ⅰ U+217

0SMALL ROMAN NUMERAL ONE

Y 903 a U+0061

LATIN SMALL LETTER A

Segmentation

What a user thinks of as a characters is often a sequence

Words are not just sequences of letters

Lines donrsquot just break at spaces

All may be language-dependent

httpwwwunicodeorgreportstr14 httpwwwunicodeorgreportstr29

TransliterationTransliteration Ελληνικά harr Ellēnikaacutene Translation Ελληνικά harr Greek

Transliteration may vary by language

Путин harr Putin Poutine

Горбачёв harr Gorbachev Gorbacev Gorbatchev Gorbačeumlv Gorbachov Gorbatsov Gorbatschow

Watch for terminology ldquolossyrdquo vs ldquolosslessrdquoLossy transliteration Ελληνικά rarr Ellinika rarr Ελλινικα

In ISO terms ldquotransliterationrdquo = lossless transliteration

ldquotranscriptionrdquo = lossy transliteration

httpunicodeorgdraftreportstr35tr35html

Rendering is Contextual

Glyphs may change shape

Multiple characters rarr 1 glyph

One character rarr multiple glyphs

Processing character-by-character gives the wrong results

Rendering IIGood rendering systems will handle customary type-in order for text plus canonical order

Excellent ones will do any canonically-equivalent order but those are rare

There may be differences in the customary glyphs for different languages specify the font or the language where they have to be distinguished

Security IssuesNever render a missing glyph as ldquo

Dont simply overlay diacritics it can cause security problems

httpwwwunicodeorgnotestn2

httpunicodeorgreportstr14

GlobalizationUnicode ne Globalization (aka Internationalization Localizability)

Unicode provides the basis for software globalization but theres more work to be done

Use globalization APIs Formatting and parsing of dates times numbers currencies comparison of text calendar systems are locale-dependent

Where OS facilities are not adequate or cross-platform solutions are needed use ICU (C C++ Java)

Dont put any translatable strings into your code separate into resource files

Provide context to translators is Mark a noun a verb or a namehellip

Donrsquot use the same string in different contexts unless the meaning is identical (including references)

Note User-Interface language (menus dialog help-system) neData language (body text spreadsheet cells)

Programs need to handle as data more languages than in localized UI

Common Globalization Mistakes

Never compile Windows apps as ldquoANSIrdquo (the default)

Dont simply concatenate strings to make messagesOrder of components differs by language use Java MessageFormat or structure UI as separate fields

Dont assume icons and symbols mean the same around the world Dont assume everyone can read the Latin alphabet

Allocate space flexibly ldquoOKrdquo in English rarr ldquoAceptarrdquo in Spanish

English is a relatively compact language others may require more characters (eg in database fields) and more screen real estate (in UIs)

Beware of discrepancies in ldquofallbackrdquo behaviorJava ResourceBundle (J2SE) Java Standard Tag Library (JSTL) Java Server Face (JSF) Apache HTTP

httpunicodeorgcldr

httpibmcomsoftwareglobalizationicu

Neutral FormatsStore and transmit neutral-format data wherever possible Convert that data to the users preferred formats as close to the user as possible

Type Example Rec Standard

LanguageLocale en-US (en_US) RFC 3066 bis CLDR

Territory AU RFC 3066 bis

Currency EUR ISO 4217

Timezone AustraliaMelbourne TZDB

Calendar islamic-civil CLDR Calendar ID

Custom Date yyyy-mmm-dd CLDR Pattern Format

Binary Time 8C80E9E3967A4B0 Windows File Time

IdentificationLocale IDs are extensions of language IDs use CLDR httpunicodeorgcldr

Dont assume that everyone in country always uses that countryrsquos currency Always use an explicit currency ID (ISO 4217)

ltRUR 123457times10sup3gt harr 1 23457р in Russian

but Rub 123457 in English

Dont assume the timezone ID is implied by the users locale For the best timezone information use the TZ database use CLDR for timezone names httpwwwtwinsuncomtztz-linkhtm

If you heuristically compute territory IDs timezone IDs currency IDs etc (eg from browser settings) make sure the user can override that and pick an explicit value

Unicode Guide

Authoritative but lightweight

Introduction overview and quick reference

Main principles of the Unicode Standard

Best practices in Software Globalization

Other ResourcesUnicode Site

httpunicodeorg

An Overview of ICUhttpicusourceforgenetdocspapersicu_overview_latestppt

Globalizing Softwarehttpicusourceforgenetdocspapersglobalizing_softwareppt

W3C Internationalizationhttpwwww3orgInternational

Microsoft Global Software Developmenthttpwwwmicrosoftcomglobaldevdefaultasp

QampA

Backup Slides

User Input

If you develop your own text editor use the OS APIs to handle IMEs (Input Method Engines) for Chinese Japanese Korean

If you are using type-ahead to get to a position in a list (eg typing Jo gets to the first element starting with those characters) allow arbitrary input This is often easiest with visible fields

If your password field can contain characters that require an IME a screen pop-up box may reveal the password to onlookers

Dotted and Dotless I

Uppercase Normal Lowercase Turkic Uppercase

larr

I + ˙

i 0049 0307

I 0069

İ

0049 0130

larr

ı

I

0131 0049

İrarr

0130 i + ˙ İ + ˙

I + ˙

0069 0307 0130 0307

0049 0307

JavaIn MessageFormat watch for words like cant since ASCII has syntactic meaning Use a real apostrophe (U+2019) where possible canrsquot

In Date and Calendar the months are numbered from 0 (February is month number 1) However weeks and days are numbered from 1

Java serialized text isnt UTF-8 though its close U+0000 and supplementary code points are encoded differently

Java globalization support is pretty outdated use ICU to supplement it

Java ResourceBundle (J2SE) Java Standard Tag Library (JSTL) Java Server Face (JSF) Apache HTTP server etc all provide some locale determination mechanism and facility but they all differ in details

JavaScript

Always encode characters above U+007F with escapes (uxxxx)

There is an HTML mechanism to specify the charset of the Javascript source but it is not widely implemented

The JDK tool native2ascii can be used to convert the files to use escapes


Unicode BasicsUnicode encodes characters not glyphs

U+0067 rarr g g g g g g g g g g g g g

Unicode does not encode characters by languageFrench German English j have the same code point even though all have different pronunciations

Chinese 大 (da) has the same code point as Japanese 大 (dai)

UTF-8 UTF-16 and UTF-32 are all Unicode

The word character means different things to different people make clear which one you mean

glyphs code points bytes code units user-perceived characters (grapheme clusters)hellip


















































ne UTF16 order











ae lt aelig lt af

z lt aelig (Danish)















NFC8 3X 119136 U+1D160

16 32 3X ש U+FB2C

NFD8 3X ΐ U+0390

16 32 4X ᾂ U+1F82


16 32 18X

Case Conversion








Operation UTFFacto

rSample

Lower

8 15X Ⱥ U+023A

16 32 1X A U+0041

Upper Title Fold

8 16 32

3X ΐ U+0390

Case Conversion II














LC F LC Ll

Count

Examples(U41

)

Y

NN 114 ˠ U+02E


Y 705 ordf U+00AA


YN 43 ⅰ U+217


Y 903 a U+0061


Segmentation










































httpunicodeorgcldr

















Unicode Guide






httpunicodeorg





QampA

Backup Slides

User Input






larr

I + ˙

i 0049 0307

I 0069

İ

0049 0130

larr

ı

I

0131 0049

İrarr

0130 i + ˙ İ + ˙

I + ˙

0069 0307 0130 0307

0049 0307






JavaScript






















































ne UTF16 order











ae lt aelig lt af

z lt aelig (Danish)















NFC8 3X 119136 U+1D160

16 32 3X ש U+FB2C

NFD8 3X ΐ U+0390

16 32 4X ᾂ U+1F82


16 32 18X

Case Conversion








Operation UTFFacto

rSample

Lower

8 15X Ⱥ U+023A

16 32 1X A U+0041

Upper Title Fold

8 16 32

3X ΐ U+0390

Case Conversion II














LC F LC Ll

Count

Examples(U41

)

Y

NN 114 ˠ U+02E


Y 705 ordf U+00AA


YN 43 ⅰ U+217


Y 903 a U+0061


Segmentation










































httpunicodeorgcldr

















Unicode Guide






httpunicodeorg





QampA

Backup Slides

User Input






larr

I + ˙

i 0049 0307

I 0069

İ

0049 0130

larr

ı

I

0131 0049

İrarr

0130 i + ˙ İ + ˙

I + ˙

0069 0307 0130 0307

0049 0307






JavaScript









ae lt aelig lt af

z lt aelig (Danish)















NFC8 3X 119136 U+1D160

16 32 3X ש U+FB2C

NFD8 3X ΐ U+0390

16 32 4X ᾂ U+1F82


16 32 18X

Case Conversion








Operation UTFFacto

rSample

Lower

8 15X Ⱥ U+023A

16 32 1X A U+0041

Upper Title Fold

8 16 32

3X ΐ U+0390

Case Conversion II














LC F LC Ll

Count

Examples(U41

)

Y

NN 114 ˠ U+02E


Y 705 ordf U+00AA


YN 43 ⅰ U+217


Y 903 a U+0061


Segmentation










































httpunicodeorgcldr

















Unicode Guide






httpunicodeorg





QampA

Backup Slides

User Input






larr

I + ˙

i 0049 0307

I 0069

İ

0049 0130

larr

ı

I

0131 0049

İrarr

0130 i + ˙ İ + ˙

I + ˙

0069 0307 0130 0307

0049 0307






JavaScript















NFC8 3X 119136 U+1D160

16 32 3X ש U+FB2C

NFD8 3X ΐ U+0390

16 32 4X ᾂ U+1F82


16 32 18X

Case Conversion








Operation UTFFacto

rSample

Lower

8 15X Ⱥ U+023A

16 32 1X A U+0041

Upper Title Fold

8 16 32

3X ΐ U+0390

Case Conversion II














LC F LC Ll

Count

Examples(U41

)

Y

NN 114 ˠ U+02E


Y 705 ordf U+00AA


YN 43 ⅰ U+217


Y 903 a U+0061


Segmentation










































httpunicodeorgcldr

















Unicode Guide






httpunicodeorg





QampA

Backup Slides

User Input






larr

I + ˙

i 0049 0307

I 0069

İ

0049 0130

larr

ı

I

0131 0049

İrarr

0130 i + ˙ İ + ˙

I + ˙

0069 0307 0130 0307

0049 0307






JavaScript





Case Conversion








Operation UTFFacto

rSample

Lower

8 15X Ⱥ U+023A

16 32 1X A U+0041

Upper Title Fold

8 16 32

3X ΐ U+0390

Case Conversion II














LC F LC Ll

Count

Examples(U41

)

Y

NN 114 ˠ U+02E


Y 705 ordf U+00AA


YN 43 ⅰ U+217


Y 903 a U+0061


Segmentation










































httpunicodeorgcldr

















Unicode Guide






httpunicodeorg





QampA

Backup Slides

User Input






larr

I + ˙

i 0049 0307

I 0069

İ

0049 0130

larr

ı

I

0131 0049

İrarr

0130 i + ˙ İ + ˙

I + ˙

0069 0307 0130 0307

0049 0307






JavaScript






Operation UTFFacto

rSample

Lower

8 15X Ⱥ U+023A

16 32 1X A U+0041

Upper Title Fold

8 16 32

3X ΐ U+0390

Case Conversion II














LC F LC Ll

Count

Examples(U41

)

Y

NN 114 ˠ U+02E


Y 705 ordf U+00AA


YN 43 ⅰ U+217


Y 903 a U+0061


Segmentation










































httpunicodeorgcldr

















Unicode Guide






httpunicodeorg





QampA

Backup Slides

User Input






larr

I + ˙

i 0049 0307

I 0069

İ

0049 0130

larr

ı

I

0131 0049

İrarr

0130 i + ˙ İ + ˙

I + ˙

0069 0307 0130 0307

0049 0307






JavaScript





Case Conversion II














LC F LC Ll

Count

Examples(U41

)

Y

NN 114 ˠ U+02E


Y 705 ordf U+00AA


YN 43 ⅰ U+217


Y 903 a U+0061


Segmentation










































httpunicodeorgcldr

















Unicode Guide






httpunicodeorg





QampA

Backup Slides

User Input






larr

I + ˙

i 0049 0307

I 0069

İ

0049 0130

larr

ı

I

0131 0049

İrarr

0130 i + ˙ İ + ˙

I + ˙

0069 0307 0130 0307

0049 0307






JavaScript












LC F LC Ll

Count

Examples(U41

)

Y

NN 114 ˠ U+02E


Y 705 ordf U+00AA


YN 43 ⅰ U+217


Y 903 a U+0061


Segmentation










































httpunicodeorgcldr

















Unicode Guide






httpunicodeorg





QampA

Backup Slides

User Input






larr

I + ˙

i 0049 0307

I 0069

İ

0049 0130

larr

ı

I

0131 0049

İrarr

0130 i + ˙ İ + ˙

I + ˙

0069 0307 0130 0307

0049 0307






JavaScript





Segmentation










































httpunicodeorgcldr

















Unicode Guide






httpunicodeorg





QampA

Backup Slides

User Input






larr

I + ˙

i 0049 0307

I 0069

İ

0049 0130

larr

ı

I

0131 0049

İrarr

0130 i + ˙ İ + ˙

I + ˙

0069 0307 0130 0307

0049 0307






JavaScript









































httpunicodeorgcldr

















Unicode Guide






httpunicodeorg





QampA

Backup Slides

User Input






larr

I + ˙

i 0049 0307

I 0069

İ

0049 0130

larr

ı

I

0131 0049

İrarr

0130 i + ˙ İ + ˙

I + ˙

0069 0307 0130 0307

0049 0307






JavaScript




















Unicode Guide






httpunicodeorg





QampA

Backup Slides

User Input






larr

I + ˙

i 0049 0307

I 0069

İ

0049 0130

larr

ı

I

0131 0049

İrarr

0130 i + ˙ İ + ˙

I + ˙

0069 0307 0130 0307

0049 0307






JavaScript






httpunicodeorg





QampA

Backup Slides

User Input






larr

I + ˙

i 0049 0307

I 0069

İ

0049 0130

larr

ı

I

0131 0049

İrarr

0130 i + ˙ İ + ˙

I + ˙

0069 0307 0130 0307

0049 0307






JavaScript





QampA

Backup Slides

User Input






larr

I + ˙

i 0049 0307

I 0069

İ

0049 0130

larr

ı

I

0131 0049

İrarr

0130 i + ˙ İ + ˙

I + ˙

0069 0307 0130 0307

0049 0307






JavaScript





Backup Slides

User Input






larr

I + ˙

i 0049 0307

I 0069

İ

0049 0130

larr

ı

I

0131 0049

İrarr

0130 i + ˙ İ + ˙

I + ˙

0069 0307 0130 0307

0049 0307






JavaScript





User Input






larr

I + ˙

i 0049 0307

I 0069

İ

0049 0130

larr

ı

I

0131 0049

İrarr

0130 i + ˙ İ + ˙

I + ˙

0069 0307 0130 0307

0049 0307






JavaScript







larr

I + ˙

i 0049 0307

I 0069

İ

0049 0130

larr

ı

I

0131 0049

İrarr

0130 i + ˙ İ + ˙

I + ˙

0069 0307 0130 0307

0049 0307






JavaScript










JavaScript





globalization gotchas mark davis. unicode basics unicode encodes characters, not glyphs: u+0067 g g...

Documents

g g g g g g g g g g

character unicode

unicode security

unicode basics unicode

code point order

code units

length code points

unassigned code points