notes on a standard: unicode
TRANSCRIPT
2
Plan
● Introduction● Design Goals● Code Points and Characters● Encoding Forms, UTF-32, UTF-16, UTF-8● Conclusion
3
Introduction
● UNIversal character enCODing system● Unicode = universal character encoding scheme
for written characters and text● Advantages
● Consistent way of encoding multilingual text● Data stability instead of proliferating character sets● Encode ALL characters used for the written languages (> 1
million characters can be encoded)● Creates a foundation for global software
4
Design Principles
5
Characters, not Glyphs
● The Unicode Standard draws a distinction between characters and glyphs.
● Characters are the abstract representations of the smallest components of written language that have semantic value.
6
Logical Order● The order in which
Unicode text is stored in the memory representation is called logical order
● Unicode Standard includes characters to explicitly specify changes in direction when necessary
7
Code Points and Characters● Abstract characters are
encoded internally as numbers
● Codespace: 0 to 10FFFF16 => 1,114,112 code points available
● Abstract character -> code point
● Example:
U+0061 latin small letter a
8
Encoding Forms● Encoding forms specify how
each code point is to be expressed as a sequence of one or more code unit (8-bit, 16-bit, 32-bit units)
● Encoding forms for Unicode characters: UTF-8, UTF-16, UTF-32
● Each form can be efficiently transformed into either of the other two without any loss of data
9
UTF-32
● The simplest Unicode encoding form ● Each Unicode code point is represented directly
by a single 32-bit code unit (fixed-width)● restricted to representation of code points in
the range 0..10FFFF16● Example:
U+10000 is represented as <00010000>● preferred encoding form for processing
characters on most Unix platforms
10
UTF-16
● Code unit values often change from the code point value => conversion required
● Variable-width encoding:➢ U+0000..U+FFFF are represented as a single 16-bit
code unit➢ U+10000..U+10FFFF are represented as pairs of
16-bit code units (surrogate pairs)● Optimized for BMP (Basic Multilingual Plain) =
majority of common-use characters for all modern scripts of the world
11
UTF-8
● UTF-8 encodes each character (code point) in 1 to 4 octets (8-bit bytes), with the single–octet encoding used only for the 128 US-ASCII characters● U+0000 to U+007F → 1 byte ● above → 2, 3, up to 4 bytes
● Backwards compatible with ASCII● Standard for XML (XHTML) documents● Example:
U+10000 is represented as <F0 90 80 80>
12
Conclusion
● The Unicode Standard is a superset of all characters in widespread use today.
● It contains characters from major international and national standards (e.g. the SGML standard) as well as prominient industry character sets (e.g. industy code from Apple, Adobe, Fujitsu, etc).
● Responds to changing industry demands by encoding important new characters (e.g. the € sign )
13
Questions?
● Thank You!