unicode

31
Ãfibercool

Upload: esug

Post on 12-May-2015

995 views

Category:

Technology


5 download

DESCRIPTION

Unicode. Philippe Marschall. ESUG 2008, Amsterdam

TRANSCRIPT

Page 1: Unicode

Ãfibercool

Page 2: Unicode

Outline

• character sets (what)

• encodings (how)

• a bit intertwined

• generalizations

• Squeak specific

• european examples

Page 3: Unicode

Character Setssets of characters

Page 4: Unicode

ASCII

• North America, United States

• 128 characters

Page 5: Unicode

ISO-8859-1(latin1)

• Western Europe

• Does not include € (ISO-8859-15 / latin-9)

• ASCII + 128 additional charters

Page 6: Unicode

Unicodeone character set to rule them all

Page 7: Unicode

Unicode

Latin-1

ASCII

Page 8: Unicode

Code point“atom” of text

Page 9: Unicode

Part of Unicode

♕ ☠ ☃ ☙

Page 10: Unicode

NOT Part of Unicode

Page 11: Unicode

Integer(abstract)

SmallInteger(-10737418241073741823)

LargeInteger(∞ - SmallInteger)

ranges, no endianness!

Page 12: Unicode

String(abstract)

ByteString(ISO-8859-1)

WideString(Unicode - ISO-8859-1)

character set, not encoding!

Page 13: Unicode

#leadingChar

• mixes abstraction layers

• language

• presentation

Page 14: Unicode

Algorithms

• know Unicode

• know all the rules

• know all the code points

• know all the locales

Page 15: Unicode

Transformations

Fußball

FUSSBALL

locale dependent!

Page 16: Unicode

Collation(ordering)

• ABC...RSTUVWXYZ

• ÄB...NOÖ...SßTUÜV...YZ

• ABC...RSTUVWXYZÅÄÖ

locale dependent!

Page 17: Unicode

Normalization(what does #= really mean?)

• ¨ + a = ä

• there are different ones to chose from

Page 18: Unicode

PHP 6 will do all of this

Page 19: Unicode

Encodingsmappings from one space to an other

(isomorphisms)

Page 20: Unicode

1:1

•ASCII

•ISO-8859-1

Page 21: Unicode

ASCII

•7 bit

•8 bit

Page 22: Unicode

ISO-8859-1

16rFC

Page 23: Unicode

UTF-32

16rFC 16r00 16r00 16r00 (LE)

16r00 16r00 16r00 16rFC (BE)

Page 24: Unicode

UTF-16

16rFC 16r00 (LE)

16r00 16rFC (BE)

Page 25: Unicode

UTF-8

16rC3 16rBC

Page 26: Unicode

WAKom1:1 direct mapping from bytes to Charaters

Page 27: Unicode

WAKomEncoded*use it with utf-8!

Page 28: Unicode

Content-Type: text/html;charset=utf-8

<meta content="text/html;charset=utf-8"

http-equiv="Content-Type"/>

Page 29: Unicode

2.8: WASession >> #charSet

2.9: /seaside/config

Page 31: Unicode

übercool