software internationalisation — single-byte scripts guy lacoursière software globalisation...

30
Software Internationalisation — Single-Byte Scripts Guy Lacoursière Software Globalisation Consultant

Upload: marcus-andrews

Post on 11-Jan-2016

224 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Software Internationalisation — Single-Byte Scripts Guy Lacoursière Software Globalisation Consultant

Software Internationalisation —Single-Byte Scripts

Guy LacoursièreSoftware Globalisation Consultant

Page 2: Software Internationalisation — Single-Byte Scripts Guy Lacoursière Software Globalisation Consultant

AgendaAgenda

Deliverables

Definitions

Scripts Latin scripts

Greek

Hebrew

Cumulative testing

Sorting (optional)

References

Page 3: Software Internationalisation — Single-Byte Scripts Guy Lacoursière Software Globalisation Consultant

Deliverables — Deliverables — English Internationalized ProductsEnglish Internationalized Products

We currently support Latin1 and Asian character sets: ISO-8859-1: Afrikaans, Basque, Catalan, Danish, Dutch, English, Faroese,

Finnish, French, German, Icelandic, Indonesian, Italian, Norwegian, Portuguese, Spanish, Swedish

Multibyte character sets: Japanese, traditional Chinese, simplified Chinese, Korean

Newly supported character sets: ISO-8859-2: Albanian, Croatian, Czech, Hungarian, Polish, Romanian, Serbian,

Slovak, Slovenian

ISO-8859-7/8/9: Greek, Hebrew, Turkish

Complex languages are not supported: Thai, Indic languages, Arabic

Goal: Unicode

We currently support Latin1 and Asian character sets: ISO-8859-1: Afrikaans, Basque, Catalan, Danish, Dutch, English, Faroese,

Finnish, French, German, Icelandic, Indonesian, Italian, Norwegian, Portuguese, Spanish, Swedish

Multibyte character sets: Japanese, traditional Chinese, simplified Chinese, Korean

Newly supported character sets: ISO-8859-2: Albanian, Croatian, Czech, Hungarian, Polish, Romanian, Serbian,

Slovak, Slovenian

ISO-8859-7/8/9: Greek, Hebrew, Turkish

Complex languages are not supported: Thai, Indic languages, Arabic

Goal: Unicode

We currently support Latin1 and Asian character sets: ISO-8859-1: Afrikaans, Basque, Catalan, Danish, Dutch, English, Faroese,

Finnish, French, German, Icelandic, Indonesian, Italian, Norwegian, Portuguese, Spanish, Swedish

Multibyte character sets: Japanese, traditional Chinese, simplified Chinese, Korean

Newly supported character sets: ISO-8859-2: Albanian, Croatian, Czech, Hungarian, Polish, Romanian, Serbian,

Slovak, Slovenian

ISO-8859-7/8/9: Greek, Hebrew, Turkish

Complex languages are not supported: Thai, Indic languages, Arabic

Goal: Unicode

We currently support Latin1 and Asian character sets: ISO-8859-1: Afrikaans, Basque, Catalan, Danish, Dutch, English, Faroese,

Finnish, French, German, Icelandic, Indonesian, Italian, Norwegian, Portuguese, Spanish, Swedish

Multibyte character sets: Japanese, traditional Chinese, simplified Chinese, Korean

Newly supported character sets: ISO-8859-2: Albanian, Croatian, Czech, Hungarian, Polish, Romanian, Serbian,

Slovak, Slovenian

ISO-8859-7/8/9: Greek, Hebrew, Turkish

Complex languages are not supported: Thai, Indic languages, Arabic

Goal: Unicode

Page 4: Software Internationalisation — Single-Byte Scripts Guy Lacoursière Software Globalisation Consultant

DefinitionsDefinitions

Script System of characters composed of:

Letters, syllables or ideographs (with one or more possible directions)

Punctuation symbols

Numbers ( 0 1 2 3 4 5 6 7 8 9 ¼ ½ ¾ )

Other symbols ( ® $ # % & ± ° _ @ )

n scripts/language or n languages/script

Character set (or code page, or coded character set) Ordered group of characters assigned to code points.

Encoding System defining the storage mechanism for a given character set.

Page 5: Software Internationalisation — Single-Byte Scripts Guy Lacoursière Software Globalisation Consultant

Single-Byte Character SetsSingle-Byte Character Sets

Expressed in 8-bit sequences.

The character set does not exceed 256 code points.

The encoding is the order of the character set code points.

A given code point may have a different value (character) depending on the character set.

The first 128 code points are always the same.

Page 6: Software Internationalisation — Single-Byte Scripts Guy Lacoursière Software Globalisation Consultant

Latin ScriptsLatin ScriptsLatin 1 Character Set Latin 1 Character Set (ISO 8859-1)(ISO 8859-1)

Latin 1 Languages covered Afrikaans, Albanian, Basque,

Catalan, Danish, Dutch, English, Finnish, French, German, Icelandic, Irish, Italian, Norwegian, Portuguese, Scottish, Spanish, Swahili, Swedish

Notes Uppercase and lowercase letters

have two code points even though they refer to 2 forms of the same letter.

Some letters have no uppercase.

The base characters are the same for all Latin character sets.

Base characters a b c d e f g h i j k l m

n o p q r s t u v w x y z0 1 2 4 5 6 7 8 9! " ' ( ) , . : ; ? [ ] ^ { | } ~# $ % & ÷ × + - * / = \ < > _

Extended characters àÀ áÁ â ãà äÄ åÅ æÆ

çÇèÈ éÉ êÊ ëËðÐíÍ îÎ ïÏñÑòÒ óÓ ôÔ õÕ öÖ øØßùÙ úÚ ûÛ üÜýÝ ÿþÞ

Page 7: Software Internationalisation — Single-Byte Scripts Guy Lacoursière Software Globalisation Consultant

Latin ScriptsLatin ScriptsISO 8859-1 vs. Windows 1252ISO 8859-1 vs. Windows 1252

Microsoft Windows' Latin 1 character set (code page 1252) is different from ISO 8859-1.

It contains about 20 extra characters, among others: The euro symbol ( ) The English curly quotes ( “ ” ) The ellipsis (…) The German opening quotes ( „ ) The bullet ( • ) The n-dash (–) The m-dash (—) The French uppercase and lowercase oe ligatures (œ Œ) The English trademark symbol (™)

These may not display correctly in non-Latin 1 systems.

Page 8: Software Internationalisation — Single-Byte Scripts Guy Lacoursière Software Globalisation Consultant

Latin ScriptsLatin ScriptsISO 8859-1 vs. Windows 1252ISO 8859-1 vs. Windows 1252

Latin 1(ISO 8859-1)

Windows code page 1252

Page 9: Software Internationalisation — Single-Byte Scripts Guy Lacoursière Software Globalisation Consultant

Latin ScriptsLatin ScriptsLatin 2 Character Set Latin 2 Character Set (ISO 8859-2)(ISO 8859-2)

Latin 1

Latin 2

Languages covered Czech, Hungarian, Polish,

Romanian, Croatian, Slovak, Slovenian, Sorbian

Notes Some characters are duplicates

from the Latin 1 character set.

The caron diacritic has two forms:

“ ˘ ” and “ ’ ”.

The T with cedilla has a glyph variant (T with comma) for Romanian.

Latin 2 characters common to Latin 1 use identical code points.

Extended characters ąĄ áÁ â ăĂ äÄ

ćĆ çÇ čČďĎéÉ ęĘ ëË ěĚðÐíÍ îÎłŁ ľĽ ĺĹńŃ ňŇóÓ ôÔ őŐ öÖŕŔ řŘśŚ šŠ şŞ § ß

ťŤ ţŢ ůŮ úÚ űŰ üÜ ýÝ

źŹ žŽ żŻ

Page 10: Software Internationalisation — Single-Byte Scripts Guy Lacoursière Software Globalisation Consultant

ISO 8859-1 vs. ISO 8859-2ISO 8859-1 vs. ISO 8859-2

Latin 1(ISO 8859-1)

Latin 2(ISO 8859-2)

Page 11: Software Internationalisation — Single-Byte Scripts Guy Lacoursière Software Globalisation Consultant

ISO 8859-1 vs. ISO 8859-2ISO 8859-1 vs. ISO 8859-2

All common characters have the same code points.

Characters that are different belong to separate language families (mostly West European vs. East European).

Allows a certain level of flexibility between languages.

Page 12: Software Internationalisation — Single-Byte Scripts Guy Lacoursière Software Globalisation Consultant

Latin ScriptsLatin ScriptsLatin 3 Character Set Latin 3 Character Set (ISO 8859-3)(ISO 8859-3)

Latin 1

Latin 2

Latin 3

Languages covered Esperanto, Maltese

Notes Covered Turkish before the

introduction of Latin 5 in 1988.

Not supported.

Extended characters

àÀ áÁ â äÄ

ċĊ ĉĈ çÇèÈ éÉ êÊ ëËğĞħĦ ĥĤıI iİ ìÌ íÍ îÎ ïÏĵĴñÑòÒ óÓ ôÔ öÖşŞ ŝŜ §

ß ùÙ úÚ ûÛ üÜ ŭŬżŻ£¤

Page 13: Software Internationalisation — Single-Byte Scripts Guy Lacoursière Software Globalisation Consultant

Latin ScriptsLatin ScriptsLatin 4 Character Set Latin 4 Character Set (ISO 8859-4)(ISO 8859-4)

Latin 1

Latin 2

Latin 3

Latin 4

Languages covered Estonian, Latvian, Lithuanian,

Greenlandic, Lappish

Notes Not supported.

Extended characters

ąĄ āĀ áÁ â ãà äÄ åÅ æÆčČēĒ éÉ ęĘ ëË ėĖðÐģĢ ĸ ķĶĩĨ íÍ îÎ īĪ įĮ ļĻņŅ ŋŊōŌ ôÔ õÕ öÖ øØŗŖšŠß ŧŦųŲ úÚ ûÛ üÜ ũŨ ūŪ¤ ÷

Page 14: Software Internationalisation — Single-Byte Scripts Guy Lacoursière Software Globalisation Consultant

Latin ScriptsLatin ScriptsLatin 5 Character Set Latin 5 Character Set (ISO 8859-9)(ISO 8859-9)

Latin 1

Latin 2

Latin 3

Latin 4

Latin 5

Languages covered Turkish

Notes Very similar to Latin 1.

The letters ð, ý and þ from Latin 1 are replaced with Turkish letters.

Latin 5 characters common to Latin 1 use identical code points.

Issue: *.ini = *.İNİ, and *.n = *.INI

*.ini *.INI, and *.n *.İNİ

Extended characters àÀ áÁ â ãà äÄ åÅ æÆ

çÇèÈ éÉ êÊ ëËíÍ îÎ ïÏðÐ ---> ğĞñÑòÒ óÓ ôÔ õÕ öÖ øØßùÙ úÚ ûÛ üÜýÝ ---> ıİ ÿþÞ ---> şŞ

Page 15: Software Internationalisation — Single-Byte Scripts Guy Lacoursière Software Globalisation Consultant

Latin ScriptsLatin ScriptsLatin 6 Character Set Latin 6 Character Set (ISO 8859-10)(ISO 8859-10)

Latin 1

Latin 2

Latin 3

Latin 4

Latin 5

Latin 6

Languages covered Nordic area

Inuit (Greenlandic Eskimo), non-Skolt Sami (Lappish), Icelandic

Notes Similar characters to Latin 4, but

with extra letters for the Nordic languages.

Latin 6 characters common to Latin 4 use different code points.

Very not supported.

Extended characters

ąĄ āĀ áÁ â ãà äÄ åÅ æÆčČēĒ éÉ ęĘ ëË ėĖðÐģĢ ĸ ķĶĩĨ íÍ îÎ īĪ įĮ ļĻņŅ ŋŊōŌ ôÔ õÕ öÖ øØŗŖšŠß ŧŦųŲ úÚ ûÛ üÜ ũŨ ūŪ¤ ÷

Page 16: Software Internationalisation — Single-Byte Scripts Guy Lacoursière Software Globalisation Consultant

Latin ScriptsLatin ScriptsLatin 7 & 8 Character Sets Latin 7 & 8 Character Sets (ISO 8859-13 (ISO 8859-13 & 14)& 14)

Latin 1

Latin 2

Latin 3

Latin 4

Latin 5

Latin 6

Latin 7

Latin 8

Languages covered Latin 7: Baltic languages

Latin 8: Celtic languages

Notes Similar characters to Latin 4 and

6, but with extra letters for the Nordic languages.

Latin 7 characters common to Latin 4 and 6 use different code points.

Latin 8 characters common to Latin 1 use identical code points.

Not supported.

Page 17: Software Internationalisation — Single-Byte Scripts Guy Lacoursière Software Globalisation Consultant

Latin ScriptsLatin ScriptsLatin 9 Character Set Latin 9 Character Set (ISO 8859-15)(ISO 8859-15)

Latin 1

Latin 2

Latin 3

Latin 4

Latin 5

Latin 6

Latin 7

Latin 8

Latin 9

Languages covered Same as Latin 1.

Notes Some Latin 9 characters common

to Latin 1 use different code points.

Less used characters are replaced:¨ ---> š ¦ ---> Š

¸ ---> ž ´ ---> Ž ½ ---> œ ¼ ---> Œ ¾ ---> Ÿ ¤ --->

Extended characters àÀ áÁ â ãà äÄ åÅ æÆ

çÇèÈ éÉ êÊ ëËíÍ îÎ ïÏðÐñÑòÒ óÓ ôÔ õÕ öÖ øØ œŒ šŠßùÙ úÚ ûÛ üÜýÝ ÿ Ÿ žŽþÞ

Page 18: Software Internationalisation — Single-Byte Scripts Guy Lacoursière Software Globalisation Consultant

ISO 8859-15 vs. Windows 1252ISO 8859-15 vs. Windows 1252

Latin 9(ISO 8859-15)

Windows 1252

Page 19: Software Internationalisation — Single-Byte Scripts Guy Lacoursière Software Globalisation Consultant

Latin Scripts in...Latin Scripts in...Non-Latin Character Sets!Non-Latin Character Sets!

Latin 1

Latin 2

Latin 3

Latin 4

Latin 5

Latin 6

Latin 7

Latin 8

Latin 9

Other

Languages Traditional Chinese

Simplified ChineseJapanese (romaji or romanji)Vietnamese

Notes Chinese, Japanese and Korean use

Latin letters for transliteration (sometime with tone accents) and numbers.

Vietnamese uses Latin characters with diacritics.

Latin characters are also used in the transliteration of Greek, Hebrew, Russian, etc.

Some Vietnamese extended characters

ðÐăĂ âÂêÊôÔ

…with tones

Page 20: Software Internationalisation — Single-Byte Scripts Guy Lacoursière Software Globalisation Consultant

Languages Covered by Latin Character Languages Covered by Latin Character SetsSets

Language Character set (Latin-n)Czech 2

Danish 1 4 5 6 7 8 9

Dutch 1 5 9

English 1 2 3 4 5 6 7 8 9

Finnish 1 2 3 4 5 6 7 8 9

French 1 3 5 8 9

German 1 2 3 4 5 6 7 8 9

Hungarian 2

Italian 1 3 5 8 9

Norwegian 1 2 3 4 5 6 7 8 9

Polish 2 7

Portuguese 1 3 5 8 9

Romanian 2

Spanish 1 8 9

Swedish 1 4 5 6 7 8 9

Turkish 3 5

Language Character set (Latin-n)Czech 2

Danish 1 4 5 6 7 8 9

Dutch 1 5 9

English 1 2 3 4 5 6 7 8 9

Finnish 1 2 3 4 5 6 7 8 9

French 1 3 5 8 9

German 1 2 3 4 5 6 7 8 9

Hungarian 2

Italian 1 3 5 8 9

Norwegian 1 2 3 4 5 6 7 8 9

Polish 2 7

Portuguese 1 3 5 8 9

Romanian 2

Spanish 1 8 9

Swedish 1 4 5 6 7 8 9

Turkish 3 5

Page 21: Software Internationalisation — Single-Byte Scripts Guy Lacoursière Software Globalisation Consultant

Greek ScriptGreek ScriptGreek Character SetGreek Character Set

One script, one character set, one language.

Contains modern monotonic upper & lowercase Greek letters, punctuation and a few accented Greek letters.

The rest is almost identical to Latin 1!

Missing from Latin 1: Latin punctuation: ¡ ¿ Currency symbols: ¢ ¤ ¥ Other symbols: ® ª º × ÷ µ ¶ Diacritics: ¸ Numbers: ¹ ¼ ¾

Extended characters

αβγδεζηικλμν…ΑΒΓΔΖΗΘΙΚΛΝΞ…

The rest...

² ³ ½£ ¦ § © ¬ ¯ ° ± « »· ¨

Page 22: Software Internationalisation — Single-Byte Scripts Guy Lacoursière Software Globalisation Consultant

Hebrew ScriptHebrew ScriptHebrew Character SetHebrew Character Set

One script, one character set: Hebrew

Yiddish

Directionality of text: Hebrew letters are written from right to left (RTL).

Numbers (Arabic) are written from left to right (LTR).

Latin characters are written from left to right (LTR).

Order of the text depends on the predominant language.

Order of mirrored characters depends on neighboring characters.

Differences from Latin 1: Latin punctuation: ¡ ¿ are missing

Currency symbol: ₪ (new sheqel) is absent

Other symbols: ª º are missing× ÷ have different code points

Extended characters

תשרקעסליטחזוהדגבא

Final & nominal forms:

ך - כן - נ

ם - מף - פץ - צ

Final form

Page 23: Software Internationalisation — Single-Byte Scripts Guy Lacoursière Software Globalisation Consultant

Hebrew User InterfaceHebrew User Interface

There are two types of Hebrew support: Hebrew-enabled product (supporting Hebrew characters)

Hebrew product (translated into Hebrew)

Both types must support RTL display. Text alignment may differ for characters, strings

and document. Normally, the logical order (or storage order or file

order) is the same as the reading order. The display order is bi-directional and does not

follow the logical order.

Page 24: Software Internationalisation — Single-Byte Scripts Guy Lacoursière Software Globalisation Consultant

Hebrew User InterfaceHebrew User InterfaceLogical vs. VisualLogical vs. Visual

Input string: "Hebrew text : טסקט ילגנא "

In a LTR document:Hebrew text : אנגלי טקסט

In a RTL document:אנגלי Hebrew text : טקסט

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

How should it be displayed?

You get different displays depending on the main direction (script) of the document or the string.

Notice the direction of the colon.

Page 25: Software Internationalisation — Single-Byte Scripts Guy Lacoursière Software Globalisation Consultant

Hebrew User Interface Hebrew User Interface —— Issues Issues

Display of improper characters.

Display in improper order.

Display in correct order; cursor in logical position.

Mix of Hebrew and Latin text.

Alignment inside an input field.

Copy and paste.

Carriage returns inside a Hebrew or mixed string.

Page 26: Software Internationalisation — Single-Byte Scripts Guy Lacoursière Software Globalisation Consultant

Cumulative TestingCumulative Testing

Premisses: Testing in French or German includes English issues.

Testing of Greek includes non-Latin 1 character and font issues. Special cases:

Cursory testing of character and font issues per character set.

Sorting and comparision per language.

Hebrew: Bi-directionality

Turkish: INI files and anything related to case conversion

Page 27: Software Internationalisation — Single-Byte Scripts Guy Lacoursière Software Globalisation Consultant

Total 50% Increase for ALL Total 50% Increase for ALL LanguagesLanguages

French or German: 100%

Greek: 15%

Hebrew: 15%

Turkish: 5%

Czech or Polish: 5%

Cursory testing: 10%

English: 0% English coverage: 100%

Page 28: Software Internationalisation — Single-Byte Scripts Guy Lacoursière Software Globalisation Consultant

Swedish Danish Trad. Spanish Modern Spanish French(a, à, â, ã) (a, à, â, ã) (a, à, â, ä, ã, å) (a, à, â, ä, ã, å) (a, à, â, ä, ã, å)(æ, ae) (æ, ae) (æ, ae) (æ, ae)b b b b b(c, ç) (c, ç) (c, ç) (c, ç) (c, ç)cg < ch < ci cg < ch < ci ch (cz < ch < da) cg < ch < ci cg < ch < cid d d d d(e, è, é, ê, ë) (e, è, é, ê, ë) (e, è, é, ê, ë) (e, è, é, ê, ë) (e, è, é, ê, ë)f…h f…h f…h f…h f…h(i, ì, í, î, ï) (i, ì, í, î, ï) (i, ì, í, î, ï) (i, ì, í, î, ï) (i, ì, í, î, ï)j…l j…l j…l j…l j…llk < ll < lm lk < ll < lm ll (lz < ll < ma) lk < ll < lm lk < ll < lmm m m m m(n, ñ) (n, ñ) n n (n, ñ)

ñ ñ(o, ó, ò, ô, õ) (o, ó, ò, ô, õ) (o, ó, ò, ô, ö, õ, ø) (o, ó, ò, ô, ö, õ, ø) (o, ó, ò, ô, ö, õ, ø)p…r p…r p…r p…r p…r(ß, ss) (ß, ss) (ß, ss) (ß, ss) (ß, ss)t t t t t(u, ù, ú, û, ü) (u, ù, ú, û, ü) (u, ù, ú, û, ü) (u, ù, ú, û, ü) (u, ù, ú, û, ü)v…x v…x v…x v…x v…x(y, ý, ÿ) (y, ý, ÿ) (y, ý, ÿ) (y, ý, ÿ) (y, ý, ÿ)z z z z zå (æ, ae, ä)ä (ø, ö)(ö, ø) (å, aa)

Swedish Danish Trad. Spanish Modern Spanish French(a, à, â, ã) (a, à, â, ã) (a, à, â, ä, ã, å) (a, à, â, ä, ã, å) (a, à, â, ä, ã, å)(æ, ae) (æ, ae) (æ, ae) (æ, ae)b b b b b(c, ç) (c, ç) (c, ç) (c, ç) (c, ç)cg < ch < ci cg < ch < ci ch (cz < ch < da) cg < ch < ci cg < ch < cid d d d d(e, è, é, ê, ë) (e, è, é, ê, ë) (e, è, é, ê, ë) (e, è, é, ê, ë) (e, è, é, ê, ë)f…h f…h f…h f…h f…h(i, ì, í, î, ï) (i, ì, í, î, ï) (i, ì, í, î, ï) (i, ì, í, î, ï) (i, ì, í, î, ï)j…l j…l j…l j…l j…llk < ll < lm lk < ll < lm ll (lz < ll < ma) lk < ll < lm lk < ll < lmm m m m m(n, ñ) (n, ñ) n n (n, ñ)

ñ ñ(o, ó, ò, ô, õ) (o, ó, ò, ô, õ) (o, ó, ò, ô, ö, õ, ø) (o, ó, ò, ô, ö, õ, ø) (o, ó, ò, ô, ö, õ, ø)p…r p…r p…r p…r p…r(ß, ss) (ß, ss) (ß, ss) (ß, ss) (ß, ss)t t t t t(u, ù, ú, û, ü) (u, ù, ú, û, ü) (u, ù, ú, û, ü) (u, ù, ú, û, ü) (u, ù, ú, û, ü)v…x v…x v…x v…x v…x(y, ý, ÿ) (y, ý, ÿ) (y, ý, ÿ) (y, ý, ÿ) (y, ý, ÿ)z z z z zå (æ, ae, ä)ä (ø, ö)(ö, ø) (å, aa)

Swedish Danish Trad. Spanish Modern Spanish French(a, à, â, ã) (a, à, â, ã) (a, à, â, ä, ã, å) (a, à, â, ä, ã, å) (a, à, â, ä, ã, å)(æ, ae) (æ, ae) (æ, ae) (æ, ae)b b b b b(c, ç) (c, ç) (c, ç) (c, ç) (c, ç)cg < ch < ci cg < ch < ci ch (cz < ch < da) cg < ch < ci cg < ch < cid d d d d(e, è, é, ê, ë) (e, è, é, ê, ë) (e, è, é, ê, ë) (e, è, é, ê, ë) (e, è, é, ê, ë)f…h f…h f…h f…h f…h(i, ì, í, î, ï) (i, ì, í, î, ï) (i, ì, í, î, ï) (i, ì, í, î, ï) (i, ì, í, î, ï)j…l j…l j…l j…l j…llk < ll < lm lk < ll < lm ll (lz < ll < ma) lk < ll < lm lk < ll < lmm m m m m(n, ñ) (n, ñ) n n (n, ñ)

ñ ñ(o, ó, ò, ô, õ) (o, ó, ò, ô, õ) (o, ó, ò, ô, ö, õ, ø) (o, ó, ò, ô, ö, õ, ø) (o, ó, ò, ô, ö, õ, ø)p…r p…r p…r p…r p…r(ß, ss) (ß, ss) (ß, ss) (ß, ss) (ß, ss)t t t t t(u, ù, ú, û, ü) (u, ù, ú, û, ü) (u, ù, ú, û, ü) (u, ù, ú, û, ü) (u, ù, ú, û, ü)v…x v…x v…x v…x v…x(y, ý, ÿ) (y, ý, ÿ) (y, ý, ÿ) (y, ý, ÿ) (y, ý, ÿ)z z z z zå (æ, ae, ä)ä (ø, ö)(ö, ø) (å, aa)

Swedish Danish Trad. Spanish Modern Spanish French(a, à, â, ã) (a, à, â, ã) (a, à, â, ä, ã, å) (a, à, â, ä, ã, å) (a, à, â, ä, ã, å)(æ, ae) (æ, ae) (æ, ae) (æ, ae)b b b b b(c, ç) (c, ç) (c, ç) (c, ç) (c, ç)cg < ch < ci cg < ch < ci ch (cz < ch < da) cg < ch < ci cg < ch < cid d d d d(e, è, é, ê, ë) (e, è, é, ê, ë) (e, è, é, ê, ë) (e, è, é, ê, ë) (e, è, é, ê, ë)f…h f…h f…h f…h f…h(i, ì, í, î, ï) (i, ì, í, î, ï) (i, ì, í, î, ï) (i, ì, í, î, ï) (i, ì, í, î, ï)j…l j…l j…l j…l j…llk < ll < lm lk < ll < lm ll (lz < ll < ma) lk < ll < lm lk < ll < lmm m m m m(n, ñ) (n, ñ) n n (n, ñ)

ñ ñ(o, ó, ò, ô, õ) (o, ó, ò, ô, õ) (o, ó, ò, ô, ö, õ, ø) (o, ó, ò, ô, ö, õ, ø) (o, ó, ò, ô, ö, õ, ø)p…r p…r p…r p…r p…r(ß, ss) (ß, ss) (ß, ss) (ß, ss) (ß, ss)t t t t t(u, ù, ú, û, ü) (u, ù, ú, û, ü) (u, ù, ú, û, ü) (u, ù, ú, û, ü) (u, ù, ú, û, ü)v…x v…x v…x v…x v…x(y, ý, ÿ) (y, ý, ÿ) (y, ý, ÿ) (y, ý, ÿ) (y, ý, ÿ)z z z z zå (æ, ae, ä)ä (ø, ö)(ö, ø) (å, aa)

Swedish Danish Trad. Spanish Modern Spanish French(a, à, â, ã) (a, à, â, ã) (a, à, â, ä, ã, å) (a, à, â, ä, ã, å) (a, à, â, ä, ã, å)(æ, ae) (æ, ae) (æ, ae) (æ, ae)b b b b b(c, ç) (c, ç) (c, ç) (c, ç) (c, ç)cg < ch < ci cg < ch < ci ch (cz < ch < da) cg < ch < ci cg < ch < cid d d d d(e, è, é, ê, ë) (e, è, é, ê, ë) (e, è, é, ê, ë) (e, è, é, ê, ë) (e, è, é, ê, ë)f…h f…h f…h f…h f…h(i, ì, í, î, ï) (i, ì, í, î, ï) (i, ì, í, î, ï) (i, ì, í, î, ï) (i, ì, í, î, ï)j…l j…l j…l j…l j…llk < ll < lm lk < ll < lm ll (lz < ll < ma) lk < ll < lm lk < ll < lmm m m m m(n, ñ) (n, ñ) n n (n, ñ)

ñ ñ(o, ó, ò, ô, õ) (o, ó, ò, ô, õ) (o, ó, ò, ô, ö, õ, ø) (o, ó, ò, ô, ö, õ, ø) (o, ó, ò, ô, ö, õ, ø)p…r p…r p…r p…r p…r(ß, ss) (ß, ss) (ß, ss) (ß, ss) (ß, ss)t t t t t(u, ù, ú, û, ü) (u, ù, ú, û, ü) (u, ù, ú, û, ü) (u, ù, ú, û, ü) (u, ù, ú, û, ü)v…x v…x v…x v…x v…x(y, ý, ÿ) (y, ý, ÿ) (y, ý, ÿ) (y, ý, ÿ) (y, ý, ÿ)z z z z zå (æ, ae, ä)ä (ø, ö)(ö, ø) (å, aa)

Swedish Danish Trad. Spanish Modern Spanish French(a, à, â, ã) (a, à, â, ã) (a, à, â, ä, ã, å) (a, à, â, ä, ã, å) (a, à, â, ä, ã, å)(æ, ae) (æ, ae) (æ, ae) (æ, ae)b b b b b(c, ç) (c, ç) (c, ç) (c, ç) (c, ç)cg < ch < ci cg < ch < ci ch (cz < ch < da) cg < ch < ci cg < ch < cid d d d d(e, è, é, ê, ë) (e, è, é, ê, ë) (e, è, é, ê, ë) (e, è, é, ê, ë) (e, è, é, ê, ë)f…h f…h f…h f…h f…h(i, ì, í, î, ï) (i, ì, í, î, ï) (i, ì, í, î, ï) (i, ì, í, î, ï) (i, ì, í, î, ï)j…l j…l j…l j…l j…llk < ll < lm lk < ll < lm ll (lz < ll < ma) lk < ll < lm lk < ll < lmm m m m m(n, ñ) (n, ñ) n n (n, ñ)

ñ ñ(o, ó, ò, ô, õ) (o, ó, ò, ô, õ) (o, ó, ò, ô, ö, õ, ø) (o, ó, ò, ô, ö, õ, ø) (o, ó, ò, ô, ö, õ, ø)p…r p…r p…r p…r p…r(ß, ss) (ß, ss) (ß, ss) (ß, ss) (ß, ss)t t t t t(u, ù, ú, û, ü) (u, ù, ú, û, ü) (u, ù, ú, û, ü) (u, ù, ú, û, ü) (u, ù, ú, û, ü)v…x v…x v…x v…x v…x(y, ý, ÿ) (y, ý, ÿ) (y, ý, ÿ) (y, ý, ÿ) (y, ý, ÿ)z z z z zå (æ, ae, ä)ä (ø, ö)(ö, ø) (å, aa)

Sorting — 1Sorting — 1Sorting — 1Sorting — 1Feuille Microsoft Excel

Page 29: Software Internationalisation — Single-Byte Scripts Guy Lacoursière Software Globalisation Consultant

Sorting — 2Sorting — 2Sorting — 2Sorting — 2

Sort order The system generates a sort key based on locale-specific

rules

A sort key consists of several weighted components that represent a character’s script, diacritics, case, etc.

Simple example of French sorting:

Sorting: Rules:élémentaireélève 1) Alphanumeric baseÉlève 2) Diacriticsélevé 3) CaseÉlevé 4) Non-alphanumeric dataélever

Sort order The system generates a sort key based on locale-specific

rules

A sort key consists of several weighted components that represent a character’s script, diacritics, case, etc.

Simple example of French sorting:

Sorting: Rules:elementaireeleve 1) Alphanumeric baseEleve 2) Diacriticseleve 3) CaseEleve 4) Non-alphanumeric dataelever

Sort order The system generates a sort key based on locale-specific

rules

A sort key consists of several weighted components that represent a character’s script, diacritics, case, etc.

Simple example of French sorting:

Sorting: Rules:élémentaireélève 1) Alphanumeric baseÉlève 2) Diacriticsélevé 3) CaseÉlevé 4) Non-alphanumeric dataélever

Sort order The system generates a sort key based on locale-specific

rules

A sort key consists of several weighted components that represent a character’s script, diacritics, case, etc.

Simple example of French sorting:

Sorting: Rules:elementaireeleve 1) Alphanumeric baseEleve 2) Diacriticseleve 3) CaseEleve 4) Non-alphanumeric dataelever

Sort order The system generates a sort key based on locale-specific

rules

A sort key consists of several weighted components that represent a character’s script, diacritics, case, etc.

Simple example of French sorting:

Sorting: Rules:élémentaireélève 1) Alphanumeric baseÉlève 2) Diacriticsélevé 3) CaseÉlevé 4) Non-alphanumeric dataélever

Sort order The system generates a sort key based on locale-specific rules

A sort key consists of several weighted components that represent a character’s script, diacritics, case, etc.

Simple example of French sorting:

Sorting: Rules:élémentaireélève 1) Alphanumeric baseÉlève 2) Diacriticsélevé 3) CaseÉlevé 4) Non-alphanumeric dataélevere-lever

Page 30: Software Internationalisation — Single-Byte Scripts Guy Lacoursière Software Globalisation Consultant

ReferencesReferences

The ISO 8859 Alphabet Soup by Roman Czyborra. An absolute classic... http://czyborra.com/charsets/iso8859.html

Character table: http://www.microsoft.com/globaldev/reference/sbcs/1250.htm

Some Internet Explorer limitations: http://sizif.mf.uni-lj.si/linux/cee/app/ie30.html#http

More of the same: http://sizif.mf.uni-lj.si/linux/cee/charset.html

On fonts (a bit specialized): http://studweb.euv-frankfurt-o.de/twardoch/f/en/index.html

ISO 8859-2 vs.Windows Central European code page (1250): http://titus.uni-frankfurt.de/unicode/iso8859/iso8859b.htm#start