software internationalisation — single-byte scripts guy lacoursière software globalisation...

Software Internationalisation —Single-Byte Scripts

Guy LacoursièreSoftware Globalisation Consultant

AgendaAgenda

Deliverables

Definitions

Scripts Latin scripts

Greek

Hebrew

Cumulative testing

Sorting (optional)

References

Deliverables — Deliverables — English Internationalized ProductsEnglish Internationalized Products

We currently support Latin1 and Asian character sets: ISO-8859-1: Afrikaans, Basque, Catalan, Danish, Dutch, English, Faroese,

Finnish, French, German, Icelandic, Indonesian, Italian, Norwegian, Portuguese, Spanish, Swedish

Multibyte character sets: Japanese, traditional Chinese, simplified Chinese, Korean

Newly supported character sets: ISO-8859-2: Albanian, Croatian, Czech, Hungarian, Polish, Romanian, Serbian,

Slovak, Slovenian

ISO-8859-7/8/9: Greek, Hebrew, Turkish

Complex languages are not supported: Thai, Indic languages, Arabic

Goal: Unicode





Slovak, Slovenian



Goal: Unicode





Slovak, Slovenian



Goal: Unicode





Slovak, Slovenian



Goal: Unicode

DefinitionsDefinitions

Script System of characters composed of:

Letters, syllables or ideographs (with one or more possible directions)

Punctuation symbols

Numbers ( 0 1 2 3 4 5 6 7 8 9 ¼ ½ ¾ )

Other symbols ( ® $ # % & ± ° _ @ )

n scripts/language or n languages/script

Character set (or code page, or coded character set) Ordered group of characters assigned to code points.

Encoding System defining the storage mechanism for a given character set.

Single-Byte Character SetsSingle-Byte Character Sets

Expressed in 8-bit sequences.

The character set does not exceed 256 code points.

The encoding is the order of the character set code points.

A given code point may have a different value (character) depending on the character set.

The first 128 code points are always the same.

Latin ScriptsLatin ScriptsLatin 1 Character Set Latin 1 Character Set (ISO 8859-1)(ISO 8859-1)

Latin 1 Languages covered Afrikaans, Albanian, Basque,

Catalan, Danish, Dutch, English, Finnish, French, German, Icelandic, Irish, Italian, Norwegian, Portuguese, Scottish, Spanish, Swahili, Swedish

Notes Uppercase and lowercase letters

have two code points even though they refer to 2 forms of the same letter.

Some letters have no uppercase.

The base characters are the same for all Latin character sets.

Base characters a b c d e f g h i j k l m

n o p q r s t u v w x y z0 1 2 4 5 6 7 8 9! " ' ( ) , . : ; ? [ ] ^ { | } ~# $ % & ÷ × + - * / = \ < > _

Extended characters àÀ áÁ âÂ ãÃ äÄ åÅ æÆ

çÇèÈ éÉ êÊ ëËðÐíÍ îÎ ïÏñÑòÒ óÓ ôÔ õÕ öÖ øØßùÙ úÚ ûÛ üÜýÝ ÿþÞ

Latin ScriptsLatin ScriptsISO 8859-1 vs. Windows 1252ISO 8859-1 vs. Windows 1252

Microsoft Windows' Latin 1 character set (code page 1252) is different from ISO 8859-1.

It contains about 20 extra characters, among others: The euro symbol ( ) The English curly quotes ( “ ” ) The ellipsis (…) The German opening quotes ( „ ) The bullet ( • ) The n-dash (–) The m-dash (—) The French uppercase and lowercase oe ligatures (œ Œ) The English trademark symbol (™)

These may not display correctly in non-Latin 1 systems.

Latin ScriptsLatin ScriptsISO 8859-1 vs. Windows 1252ISO 8859-1 vs. Windows 1252

Latin 1(ISO 8859-1)

Windows code page 1252


Latin 1

Latin 2

Languages covered Czech, Hungarian, Polish,

Romanian, Croatian, Slovak, Slovenian, Sorbian

Notes Some characters are duplicates

from the Latin 1 character set.

The caron diacritic has two forms:

“ ˘ ” and “ ’ ”.

The T with cedilla has a glyph variant (T with comma) for Romanian.

Latin 2 characters common to Latin 1 use identical code points.

Extended characters ąĄ áÁ âÂ ăĂ äÄ

ćĆ çÇ čČďĎéÉ ęĘ ëË ěĚðÐíÍ îÎłŁ ľĽ ĺĹńŃ ňŇóÓ ôÔ őŐ öÖŕŔ řŘśŚ šŠ şŞ § ß

ťŤ ţŢ ůŮ úÚ űŰ üÜ ýÝ

źŹ žŽ żŻ

ISO 8859-1 vs. ISO 8859-2ISO 8859-1 vs. ISO 8859-2

Latin 1(ISO 8859-1)

Latin 2(ISO 8859-2)

ISO 8859-1 vs. ISO 8859-2ISO 8859-1 vs. ISO 8859-2

All common characters have the same code points.

Characters that are different belong to separate language families (mostly West European vs. East European).

Allows a certain level of flexibility between languages.


Latin 1

Latin 2

Latin 3

Languages covered Esperanto, Maltese

Notes Covered Turkish before the

introduction of Latin 5 in 1988.

Not supported.

Extended characters

àÀ áÁ âÂ äÄ

ċĊ ĉĈ çÇèÈ éÉ êÊ ëËğĞħĦ ĥĤıI iİ ìÌ íÍ îÎ ïÏĵĴñÑòÒ óÓ ôÔ öÖşŞ ŝŜ §

ß ùÙ úÚ ûÛ üÜ ŭŬżŻ£¤


Latin 1

Latin 2

Latin 3

Latin 4

Languages covered Estonian, Latvian, Lithuanian,

Greenlandic, Lappish

Notes Not supported.

Extended characters

ąĄ āĀ áÁ âÂ ãÃ äÄ åÅ æÆčČēĒ éÉ ęĘ ëË ėĖðÐģĢ ĸ ķĶĩĨ íÍ îÎ īĪ įĮ ļĻņŅ ŋŊōŌ ôÔ õÕ öÖ øØŗŖšŠß ŧŦųŲ úÚ ûÛ üÜ ũŨ ūŪ¤ ÷


Latin 1

Latin 2

Latin 3

Latin 4

Latin 5

Languages covered Turkish

Notes Very similar to Latin 1.

The letters ð, ý and þ from Latin 1 are replaced with Turkish letters.


Issue: *.ini = *.İNİ, and *.n = *.INI

*.ini *.INI, and *.n *.İNİ


çÇèÈ éÉ êÊ ëËíÍ îÎ ïÏðÐ ---> ğĞñÑòÒ óÓ ôÔ õÕ öÖ øØßùÙ úÚ ûÛ üÜýÝ ---> ıİ ÿþÞ ---> şŞ


Latin 1

Latin 2

Latin 3

Latin 4

Latin 5

Latin 6

Languages covered Nordic area

Inuit (Greenlandic Eskimo), non-Skolt Sami (Lappish), Icelandic

Notes Similar characters to Latin 4, but

with extra letters for the Nordic languages.

Latin 6 characters common to Latin 4 use different code points.

Very not supported.

Extended characters

ąĄ āĀ áÁ âÂ ãÃ äÄ åÅ æÆčČēĒ éÉ ęĘ ëË ėĖðÐģĢ ĸ ķĶĩĨ íÍ îÎ īĪ įĮ ļĻņŅ ŋŊōŌ ôÔ õÕ öÖ øØŗŖšŠß ŧŦųŲ úÚ ûÛ üÜ ũŨ ūŪ¤ ÷

Latin ScriptsLatin ScriptsLatin 7 & 8 Character Sets Latin 7 & 8 Character Sets (ISO 8859-13 (ISO 8859-13 & 14)& 14)

Latin 1

Latin 2

Latin 3

Latin 4

Latin 5

Latin 6

Latin 7

Latin 8

Languages covered Latin 7: Baltic languages

Latin 8: Celtic languages

Notes Similar characters to Latin 4 and

6, but with extra letters for the Nordic languages.

Latin 7 characters common to Latin 4 and 6 use different code points.


Not supported.


Latin 1

Latin 2

Latin 3

Latin 4

Latin 5

Latin 6

Latin 7

Latin 8

Latin 9

Languages covered Same as Latin 1.

Notes Some Latin 9 characters common

to Latin 1 use different code points.

Less used characters are replaced:¨ ---> š ¦ ---> Š

¸ ---> ž ´ ---> Ž ½ ---> œ ¼ ---> Œ ¾ ---> Ÿ ¤ --->


çÇèÈ éÉ êÊ ëËíÍ îÎ ïÏðÐñÑòÒ óÓ ôÔ õÕ öÖ øØ œŒ šŠßùÙ úÚ ûÛ üÜýÝ ÿ Ÿ žŽþÞ

ISO 8859-15 vs. Windows 1252ISO 8859-15 vs. Windows 1252

Latin 9(ISO 8859-15)

Windows 1252

Latin Scripts in...Latin Scripts in...Non-Latin Character Sets!Non-Latin Character Sets!

Latin 1

Latin 2

Latin 3

Latin 4

Latin 5

Latin 6

Latin 7

Latin 8

Latin 9

Other

Languages Traditional Chinese

Simplified ChineseJapanese (romaji or romanji)Vietnamese

Notes Chinese, Japanese and Korean use

Latin letters for transliteration (sometime with tone accents) and numbers.

Vietnamese uses Latin characters with diacritics.

Latin characters are also used in the transliteration of Greek, Hebrew, Russian, etc.

Some Vietnamese extended characters

ðÐăĂ âÂêÊôÔ

…with tones

Languages Covered by Latin Character Languages Covered by Latin Character SetsSets

Language Character set (Latin-n)Czech 2

Danish 1 4 5 6 7 8 9

Dutch 1 5 9

English 1 2 3 4 5 6 7 8 9

Finnish 1 2 3 4 5 6 7 8 9

French 1 3 5 8 9

German 1 2 3 4 5 6 7 8 9

Hungarian 2

Italian 1 3 5 8 9

Norwegian 1 2 3 4 5 6 7 8 9

Polish 2 7

Portuguese 1 3 5 8 9

Romanian 2

Spanish 1 8 9

Swedish 1 4 5 6 7 8 9

Turkish 3 5

Language Character set (Latin-n)Czech 2

Danish 1 4 5 6 7 8 9

Dutch 1 5 9

English 1 2 3 4 5 6 7 8 9

Finnish 1 2 3 4 5 6 7 8 9

French 1 3 5 8 9

German 1 2 3 4 5 6 7 8 9

Hungarian 2

Italian 1 3 5 8 9

Norwegian 1 2 3 4 5 6 7 8 9

Polish 2 7

Portuguese 1 3 5 8 9

Romanian 2

Spanish 1 8 9

Swedish 1 4 5 6 7 8 9

Turkish 3 5

Greek ScriptGreek ScriptGreek Character SetGreek Character Set

One script, one character set, one language.

Contains modern monotonic upper & lowercase Greek letters, punctuation and a few accented Greek letters.

The rest is almost identical to Latin 1!

Missing from Latin 1: Latin punctuation: ¡ ¿ Currency symbols: ¢ ¤ ¥ Other symbols: ® ª º × ÷ µ ¶ Diacritics: ¸ Numbers: ¹ ¼ ¾

Extended characters

αβγδεζηικλμν…ΑΒΓΔΖΗΘΙΚΛΝΞ…

The rest...

² ³ ½£ ¦ § © ¬ ¯ ° ± « »· ¨

Hebrew ScriptHebrew ScriptHebrew Character SetHebrew Character Set

One script, one character set: Hebrew

Yiddish

Directionality of text: Hebrew letters are written from right to left (RTL).

Numbers (Arabic) are written from left to right (LTR).

Latin characters are written from left to right (LTR).

Order of the text depends on the predominant language.

Order of mirrored characters depends on neighboring characters.

Differences from Latin 1: Latin punctuation: ¡ ¿ are missing

Currency symbol: ₪ (new sheqel) is absent

Other symbols: ª º are missing× ÷ have different code points

Extended characters

תשרקעסליטחזוהדגבא

Final & nominal forms:

ך - כן - נ

ם - מף - פץ - צ

Final form

Hebrew User InterfaceHebrew User Interface

There are two types of Hebrew support: Hebrew-enabled product (supporting Hebrew characters)

Hebrew product (translated into Hebrew)

Both types must support RTL display. Text alignment may differ for characters, strings

and document. Normally, the logical order (or storage order or file

order) is the same as the reading order. The display order is bi-directional and does not

follow the logical order.

Hebrew User InterfaceHebrew User InterfaceLogical vs. VisualLogical vs. Visual

Input string: "Hebrew text : טסקט ילגנא "

In a LTR document:Hebrew text : אנגלי טקסט

In a RTL document:אנגלי Hebrew text : טקסט

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

How should it be displayed?

You get different displays depending on the main direction (script) of the document or the string.

Notice the direction of the colon.

Hebrew User Interface Hebrew User Interface —— Issues Issues

Display of improper characters.

Display in improper order.

Display in correct order; cursor in logical position.

Mix of Hebrew and Latin text.

Alignment inside an input field.

Copy and paste.

Carriage returns inside a Hebrew or mixed string.

Cumulative TestingCumulative Testing

Premisses: Testing in French or German includes English issues.

Testing of Greek includes non-Latin 1 character and font issues. Special cases:

Cursory testing of character and font issues per character set.

Sorting and comparision per language.

Hebrew: Bi-directionality

Turkish: INI files and anything related to case conversion

Total 50% Increase for ALL Total 50% Increase for ALL LanguagesLanguages

French or German: 100%

Greek: 15%

Hebrew: 15%

Turkish: 5%

Czech or Polish: 5%

Cursory testing: 10%

English: 0% English coverage: 100%

Swedish Danish Trad. Spanish Modern Spanish French(a, à, â, ã) (a, à, â, ã) (a, à, â, ä, ã, å) (a, à, â, ä, ã, å) (a, à, â, ä, ã, å)(æ, ae) (æ, ae) (æ, ae) (æ, ae)b b b b b(c, ç) (c, ç) (c, ç) (c, ç) (c, ç)cg < ch < ci cg < ch < ci ch (cz < ch < da) cg < ch < ci cg < ch < cid d d d d(e, è, é, ê, ë) (e, è, é, ê, ë) (e, è, é, ê, ë) (e, è, é, ê, ë) (e, è, é, ê, ë)f…h f…h f…h f…h f…h(i, ì, í, î, ï) (i, ì, í, î, ï) (i, ì, í, î, ï) (i, ì, í, î, ï) (i, ì, í, î, ï)j…l j…l j…l j…l j…llk < ll < lm lk < ll < lm ll (lz < ll < ma) lk < ll < lm lk < ll < lmm m m m m(n, ñ) (n, ñ) n n (n, ñ)

ñ ñ(o, ó, ò, ô, õ) (o, ó, ò, ô, õ) (o, ó, ò, ô, ö, õ, ø) (o, ó, ò, ô, ö, õ, ø) (o, ó, ò, ô, ö, õ, ø)p…r p…r p…r p…r p…r(ß, ss) (ß, ss) (ß, ss) (ß, ss) (ß, ss)t t t t t(u, ù, ú, û, ü) (u, ù, ú, û, ü) (u, ù, ú, û, ü) (u, ù, ú, û, ü) (u, ù, ú, û, ü)v…x v…x v…x v…x v…x(y, ý, ÿ) (y, ý, ÿ) (y, ý, ÿ) (y, ý, ÿ) (y, ý, ÿ)z z z z zå (æ, ae, ä)ä (ø, ö)(ö, ø) (å, aa)











Sorting — 1Sorting — 1Sorting — 1Sorting — 1Feuille Microsoft Excel

Sorting — 2Sorting — 2Sorting — 2Sorting — 2

Sort order The system generates a sort key based on locale-specific

rules

A sort key consists of several weighted components that represent a character’s script, diacritics, case, etc.

Simple example of French sorting:

Sorting: Rules:élémentaireélève 1) Alphanumeric baseÉlève 2) Diacriticsélevé 3) CaseÉlevé 4) Non-alphanumeric dataélever


rules



Sorting: Rules:elementaireeleve 1) Alphanumeric baseEleve 2) Diacriticseleve 3) CaseEleve 4) Non-alphanumeric dataelever


rules





rules



Sorting: Rules:elementaireeleve 1) Alphanumeric baseEleve 2) Diacriticseleve 3) CaseEleve 4) Non-alphanumeric dataelever


rules




Sort order The system generates a sort key based on locale-specific rules



Sorting: Rules:élémentaireélève 1) Alphanumeric baseÉlève 2) Diacriticsélevé 3) CaseÉlevé 4) Non-alphanumeric dataélevere-lever

ReferencesReferences

The ISO 8859 Alphabet Soup by Roman Czyborra. An absolute classic... http://czyborra.com/charsets/iso8859.html

Character table: http://www.microsoft.com/globaldev/reference/sbcs/1250.htm

Some Internet Explorer limitations: http://sizif.mf.uni-lj.si/linux/cee/app/ie30.html#http

More of the same: http://sizif.mf.uni-lj.si/linux/cee/charset.html

On fonts (a bit specialized): http://studweb.euv-frankfurt-o.de/twardoch/f/en/index.html

ISO 8859-2 vs.Windows Central European code page (1250): http://titus.uni-frankfurt.de/unicode/iso8859/iso8859b.htm#start

software internationalisation — single-byte scripts guy lacoursière software globalisation...

Documents

asian character sets

character set iso

given character set

character set code points

different value character

indic languages

turkishcomplex languages

traditional chinese