lecture 101 unicode support status in various platforms (microsoft windows) windows 9x / me –do...

Lecture 10 1

Unicode support status in various platforms (Microsoft Windows)

• Windows 9x / ME– Do not support Unicode internally– Limited Unicode APIs are supported.– Unicode applications compiled with Microsoft

Layer for Unicode can be run on Win9x– Use code page to support different encodings

• Windows NT / 2000 / XP– Support Unicode– Use of wide char (fixed 2 bytes)– Use UCS-2

Lecture 10 2

Unicode support status in various platforms (Linux & Mac OS)

• Linux– Newer Kernel supports Unicode– Requires glibc 2.2.2 and XFree86 4.0.3 or newer– Use UTF-8 in most case, e.g. filesystem– Set locale to <lang>_<place>.<encoding>, e.g.

zh_TW.utf8– Enable UTF-8 support in console by executing unicode_start

• Mac OS– Mac OS 9.1, Mac OS X support Unicode– 16-bit for Unicode character

Lecture 10 3

What is a code page• There are a lot of different encodings, e.g. EUC-TW, Big5,

Latin-1 etc.• A code page (code page identifier) is a number to identify

a codeset.– e.g. 950 – Traditional Chinese (Big5)– e.g. 1252 – Windows Latin-1– Other code page identifiers can be found in:

http://msdn.microsoft.com/library/en-us/intl/unicode_81rn.asp

• In Windows NT/2000/XP, code page conversion table provides information to convert between different encodings.

Lecture 10 4

Java• Java is in Unicode internally. The supported encoding sets are

provided by Java library packages rt.jar and i18n.jar• The supported encoding sets for java.io.*, java.lang.* and

java.nio.* API can be found in:http://java.sun.com/j2se/1.4/docs/guide/intl/encoding.doc.html

• User Input/Output will be automatically convert between Unicode and System code page

• Specify the encoding of the source files when compiling.– javac –encoding <encoding> <source files>

• Convert to other supported encoding:e.g. byte[] utf8Bytes = str.getBytes(“UTF-8”);

• Convert from other supported encoding:e.g. String str = new String(utf8Bytes, “UTF-8”);

Lecture 10 5

Code Conversion

• Generally codeset conversion cannot provide one-to-one mapping(unless the two character sets are exactly the same)

• Unicode is a superset of every existing national standard => guaranteed round-trip conversion

• Round-trip conversion: Suppose a file file1 in codeset A is converted to a file file2 in codeset B and then converted back to codeset A with a file name file3.

If file3=file1, we say that codeset B guarantees round-trip conversion for A.

Lecture 10 6

Java Code conversion

Conversion from multibyte to Unicode

Byte[ ] my_data = { 0xA4, 0x40}

String my_unicode_data = new String(my_data,”big5”)

Where “big5” is the name of the multibyte code name. Unicode needs this to do code conversion to:

Conversion from Unicode to multibyte

String my_unicode_data =“\u4E00” (一 )

Byte[] my_b5_data=my_unicode_data.getBytes(“Big5”)My_b5_data will have the value of 0xA440

Byte[] my_gb_data= my_unicode_data.getBytes(“GBK”)My_gb_data will have the value of D2BB

Lecture 10 7

Text stream import

File I = new File(“input”);

FileInputStream tmpin = new FileInputStream(I);

BufferedReader in = new BufferedReader( new InputStreamReader(tmpin, “Big5”));

Once the BufferedReader in is established, data can be read using the readLine() method.

inputStr = in.readLine();

Lecture 10 8

Text Stream Export

File o = new File(“output.big5”);

FileOutputStream tmpout = new FileOutputStream(o);

BufferedWriter out = new BufferedWriter(new OutputStreamWriter(tmpout, “Big5”));

• ….• Out.println(“\u6CB3\u8C5A”); (“河豚” )• Out.close();• 0xAA65 0xB362

Lecture 10 9

Multilingual applications• Software teaching Chinese for English people• Software teaching English for Chinese• Conceptually separate two types of data in a multilingual

application:– Data related to display of menu/instructions,– Data related to the processing in the program– Multilingual application vs. I18n applications– I18N: data related to display and processing are the

same and it is for the same language/convention– Multilingual applications: Data related to display is for

one language(and can be internationalized). Data related to the processing can be multilingual and not necessarily related to the display language.

– Unicode is the most convenient encoding for multilingual applications, but not absolutely necessary

Lecture 10 10

The Ideographic Composition Scheme Used in ISO 10646

• Introduction to Ideograph Description Characters(IDCs)

• The ideographic composition scheme

• Application using IDCs

Lecture 10 11

What are Ideograph Description Characters

• 12 structure symbols used to describe the formation of characters using some smaller ideograph functional units such as character

components⿰⿱⿲⿳⿴⿵⿶⿷⿸⿹⿺⿻

2FF0

2FF1

2FF2

2FF3

2FF4

2FF5

2FF6

2FF7

2FF8

2FF9

2FFA

2FFB

left-to-right

above-to-below

left-middle-righ

t

above-middle-below

overall-around

Down-to-encompass

up-to-emcomp

ass

right-to-encom

pass

right-down-encompass

left-down-encompass

right-up-encom

pass

embedment

峰

明

繁

杏

街

弼

密

奚

囯

困

同

岡

凶

函

匡

匠

病

原

句

司

這

毯

爽

巫

Lecture 10 12

Characteristics of Ideographs

• Ideograph characters are often formed by smaller ideographic elements such as Radicals, ideographs proper, and other ideographic components which we generally call ideograph components

• Natural in the formation of characters• Examples: 2 components

=>

Chinese uses components has long been using components to describe characters, especially characters with the same pronunciation

Lecture 10 13

Problems with ideograph Character Encoding

• Each character is treated as a different symbol, and thus given a codepoint

• Codepoint assignment in a block does try to follow radical order, but codepoint assignment does not consider the substructures(components). Thus such information is not revealed.

• When new character is created, codepoint allocation is needed in new blocks, thus radical order cannot be globally maintained.

• Also there is a potentially endless standardization process– Encoding of rarely used ideograph characters is a

waste of resource both in terms of code space and also standardization effort

Lecture 10 14

Introduction of IDCs• Work started in 1995 by ISO/IEC SC2/WG2/IRG in 1995• Objective of the Original proposal: use coded ideographs and

“structure symbols” to describe not yet coded ideographs.• Original proposal has 15 “Ideograph Structure Symbols” base

on study on Han characters, three of them didn’t make it to ISO 10646/Unicode:

– Ideograph_Proper(日 ): Every coded character is considered ideograph proper, thus not needed

– Left_Up_Encompass: no un-encoded example

– Mirror_Symmetry(非 ): left being mirrored to the right, but can be describe by Left_to_Right

• Renames the 12 symbols as Ideograph Description Characters

Lecture 10 15

Ideographic Composition Scheme

• IDS describes a character using its components and indicating the relative positions of the components.

• IDCs are considered operators to the components. • IDSs can be expressed by a context free grammar through the

Backus Naur Form(BNF). The grammar G has four components:• Let G = {, N, P, S}, where

: the set of terminal symbols — coded radicals, coded ideographs, and the 12 IDCs.

• N:the set of 5 non-terminal symbols

N={IDS, IDS1, Binary_Symbol, Ternary_Symbol, Ideograph_Component}

• S = {IDS}, which is the start symbol of the grammar• P: a set of rewrite rules

Lecture 10 16

The following is the set of rewriting rules P:• IDS::=<Binary_Symbol><IDS1><IDS1>|<Ternary_Symbol>

<IDS1><IDS1><IDS1>• <IDS1> ::= <IDS> | <Ideograph_Component>• <Ideograph_Component>::= coded_ideograph | coded_radical |

coded_component

• <Binary-Symbol> ::=⿰ | ⿱ | ⿴ | ⿵ | ⿶ | ⿷ | ⿸ | ⿹| ⿺ | ⿻

• <Ternary_Symbol> ::= ⿲ | ⿳

• Note that even though the IDCs are terminal symbols, they are not part of the ideograph components.

Lecture 10 17

Examples

Lecture 10 18

•IDEOGRAPHIC DESCRIPTION CHARACTER OVERLAID (IDC-OLD , ⿻ ):

– The IDS introduced by IDC-OLD describes the abstract form of the ideograph with D1 and D2 overlaying each other.

–⿻从工 is an example of an IDS which represents the abstract from of

巫•IDEOGRAPHIC DESCRIPTION CHARACTER SURROUND

FROM UPPER RIGHT （ IDC-SUR, ⿹ ):–The IDS introduced by IDC-SUR describes the abstract form of the ideograph with D1 on the right top corner of D2, and D2 is encompassed by D1.

–⿹ is an example of an IDS which represents the abstract from of

Lecture 10 19

• IDS allows a character to be described by different sequences

• One IDS should describe only one character, yet one character can be described by different IDSs.

Lecture 10 20

• IDS describes ideographic character composition at the abstract level. It indicates the relative positions of the components, but does not indicate the proportions.

• Not intended for rendering.• Nesting is natural in ideographs and they are reflected in

in the IDS scheme

Lecture 10 21

Components

• Ideographic Components(IRG definition):

units which can be used to represent ideographs. These components consist of ideographs proper coded in ISO 10646 (BMP) and some basic elements used to form ideographs.

• Radicals(IRG definition): those ideographic components listed in index pages of KX1, DKW, DJW, HYD.

• ISO extensions:– Radicals– Components

Lecture 10 22

• 28 from GBK and more from IRGISO IRG component sample

Lecture 10 23

Extending the Objectives of IDCs

• Using coded characters to describe not yet code ideographs both for representation and exchange

• Limit standardization to only modern characters, and not some rarely used characters

• Learning of character composition(education)• Revealing substructures of ideograph characters• Description of ideograph variants

Lecture 10 24

Examples

• Given characters => IDS?

• 忂䑑䔄䔴蠿• Given a IDS => what are the characters

• 莫言• 艹旲言• Is the following a legal IDS?

• 莫言艹旲

Lecture 10 25

Conclusion

• IDCs are introduced in Unicode 3.0• The use is going beyond the original objective• Applications based on the IDCs were already

developed such as in the the Hong Kong Glyph Specification.

• IDCs should also useful in ideograph variant specifications

• Additional search site:

http://glyph.iso10646hk.net/ccs/ccs.jsp?lang=zh_TW

lecture 101 unicode support status in various platforms (microsoft windows) windows 9x / me –do...

Documents