notes on a standard: unicode

Notes on a Standard:UNICODE

Elena-Oana [email protected]

UAIC, Iasi

mailto:[email protected]

2

Plan

● Introduction● Design Goals● Code Points and Characters● Encoding Forms, UTF-32, UTF-16, UTF-8● Conclusion

3

Introduction

● UNIversal character enCODing system● Unicode = universal character encoding scheme

for written characters and text● Advantages

● Consistent way of encoding multilingual text● Data stability instead of proliferating character sets● Encode ALL characters used for the written languages (> 1

million characters can be encoded)● Creates a foundation for global software

4

Design Principles

5

Characters, not Glyphs

● The Unicode Standard draws a distinction between characters and glyphs.

● Characters are the abstract representations of the smallest components of written language that have semantic value.

6

Logical Order● The order in which

Unicode text is stored in the memory representation is called logical order

● Unicode Standard includes characters to explicitly specify changes in direction when necessary

7

Code Points and Characters● Abstract characters are

encoded internally as numbers

● Codespace: 0 to 10FFFF16 => 1,114,112 code points available

● Abstract character -> code point

● Example:

U+0061 latin small letter a

8

Encoding Forms● Encoding forms specify how

each code point is to be expressed as a sequence of one or more code unit (8-bit, 16-bit, 32-bit units)

● Encoding forms for Unicode characters: UTF-8, UTF-16, UTF-32

● Each form can be efficiently transformed into either of the other two without any loss of data

9

UTF-32

● The simplest Unicode encoding form ● Each Unicode code point is represented directly

by a single 32-bit code unit (fixed-width)● restricted to representation of code points in

the range 0..10FFFF16● Example:

U+10000 is represented as <00010000>● preferred encoding form for processing

characters on most Unix platforms

10

UTF-16

● Code unit values often change from the code point value => conversion required

● Variable-width encoding:➢ U+0000..U+FFFF are represented as a single 16-bit

code unit➢ U+10000..U+10FFFF are represented as pairs of

16-bit code units (surrogate pairs)● Optimized for BMP (Basic Multilingual Plain) =

majority of common-use characters for all modern scripts of the world

11

UTF-8

● UTF-8 encodes each character (code point) in 1 to 4 octets (8-bit bytes), with the single–octet encoding used only for the 128 US-ASCII characters● U+0000 to U+007F → 1 byte ● above → 2, 3, up to 4 bytes

● Backwards compatible with ASCII● Standard for XML (XHTML) documents● Example:

U+10000 is represented as <F0 90 80 80>

12

Conclusion

● The Unicode Standard is a superset of all characters in widespread use today.

● It contains characters from major international and national standards (e.g. the SGML standard) as well as prominient industry character sets (e.g. industy code from Apple, Adobe, Fujitsu, etc).

● Responds to changing industry demands by encoding important new characters (e.g. the € sign )

13

Questions?

● Thank You!

notes on a standard: unicode

Technology