![Page 1: Lis508 lecture 1: bits, bytes and characters Thomas Krichel 2002-09-23](https://reader035.vdocument.in/reader035/viewer/2022062404/5514d9b0550346935c8b52a2/html5/thumbnails/1.jpg)
lis508 lecture 1: bits, bytes and characters
Thomas Krichel
2002-09-23
![Page 2: Lis508 lecture 1: bits, bytes and characters Thomas Krichel 2002-09-23](https://reader035.vdocument.in/reader035/viewer/2022062404/5514d9b0550346935c8b52a2/html5/thumbnails/2.jpg)
Structure
• Bits
• Bytes
• Character sets– Coded character set– Character endcoding
![Page 3: Lis508 lecture 1: bits, bytes and characters Thomas Krichel 2002-09-23](https://reader035.vdocument.in/reader035/viewer/2022062404/5514d9b0550346935c8b52a2/html5/thumbnails/3.jpg)
Literature
• Norton “new inside the PC” chapter 4
• http://www.danbbs.dk/~erikoest/bb_terms.htm
• http://wwwinfo.cern.ch/asdoc/WWW/publications/ictp99/ictp99N2705.html
• http://www.cl.cam.ac.uk/~mgk25/unicode.html
![Page 4: Lis508 lecture 1: bits, bytes and characters Thomas Krichel 2002-09-23](https://reader035.vdocument.in/reader035/viewer/2022062404/5514d9b0550346935c8b52a2/html5/thumbnails/4.jpg)
Information
• Information is best understood as “what it takes to answer a question”.
• The simplest question has a “yes” or “no” answer. Therefore a bit is the natural measure of information.
• Term first used by John Turkey in 1946.
• Concatenation of “binary digit”.
![Page 5: Lis508 lecture 1: bits, bytes and characters Thomas Krichel 2002-09-23](https://reader035.vdocument.in/reader035/viewer/2022062404/5514d9b0550346935c8b52a2/html5/thumbnails/5.jpg)
Usage of bits
• Computers are sometimes classified by – The number of bits they can process at one
time i.e. the register size. Larger registers make a computer run faster.
– The number of bits they use to represent addresses i.e. address size. A larger address size allows to run larger programs.
• Graphics are also often described by the number of bits used to represent each dot.
![Page 6: Lis508 lecture 1: bits, bytes and characters Thomas Krichel 2002-09-23](https://reader035.vdocument.in/reader035/viewer/2022062404/5514d9b0550346935c8b52a2/html5/thumbnails/6.jpg)
Many bits
• The first chips used to process 8 bits at a time. It become customary to refer to them as a byte.
• Larger units are– Kilo byte is 2 power 10 bytes – Mega bytes is 2 power 20 bytes– Giga bytes is 2 power 30 bytes– Tera byte is 2 power 40 bytes
• From ancient Greek words for "thousand", "large", "giant", and "monster", respectively. Terms date back to the French revolution.
![Page 7: Lis508 lecture 1: bits, bytes and characters Thomas Krichel 2002-09-23](https://reader035.vdocument.in/reader035/viewer/2022062404/5514d9b0550346935c8b52a2/html5/thumbnails/7.jpg)
More than a monster
• In 1975, the General Conference of Weights and Measures (CGPM), based at Sèvres near Paris, agreed to add peta- (P) and exa- (E)
• Petabyte is 2 power 50 bytes
• Exabyte in 2 power 60
• Nowadays they are followed by yottabyte (70) and zettabyte (80)
![Page 8: Lis508 lecture 1: bits, bytes and characters Thomas Krichel 2002-09-23](https://reader035.vdocument.in/reader035/viewer/2022062404/5514d9b0550346935c8b52a2/html5/thumbnails/8.jpg)
Hex numbers• A byte is often represented by two hex
numbers.
• Each hex number can encode 16 values
• Written 0 to 9, then A B C D E F. F is 15.
• Here, prefixed with 0x
• Use Microsoft calculator with scientific notation to convert.
![Page 9: Lis508 lecture 1: bits, bytes and characters Thomas Krichel 2002-09-23](https://reader035.vdocument.in/reader035/viewer/2022062404/5514d9b0550346935c8b52a2/html5/thumbnails/9.jpg)
decimal/binary numbers
• 0 0• 1 1• 2 10• 3 11• 4 100• 5 101• 6 110• 7 111
• 8 1000• 9 1001• 10 1010• 11 1011• 12 1100• 13 1101• 14 1110• 15 1111
![Page 10: Lis508 lecture 1: bits, bytes and characters Thomas Krichel 2002-09-23](https://reader035.vdocument.in/reader035/viewer/2022062404/5514d9b0550346935c8b52a2/html5/thumbnails/10.jpg)
Characters
• Much of the information processed by computers is in the form of characters.
• A character only makes sense for a human user of a minimum cultural level.
• A character is not a glyph.– ligatures
![Page 11: Lis508 lecture 1: bits, bytes and characters Thomas Krichel 2002-09-23](https://reader035.vdocument.in/reader035/viewer/2022062404/5514d9b0550346935c8b52a2/html5/thumbnails/11.jpg)
Representing characters
• Computers don't understand text, they only understand numbers. For computers to be able to treat text, there must be a correspondence between numbers and text characters. Such a correspondence is called a coded character set.
• Important examples are– ASCII– ISO 8859--1– cp1252
![Page 12: Lis508 lecture 1: bits, bytes and characters Thomas Krichel 2002-09-23](https://reader035.vdocument.in/reader035/viewer/2022062404/5514d9b0550346935c8b52a2/html5/thumbnails/12.jpg)
ASCII
• American Standard Code for Information Interchange
• 7-bit character set. There is no such thing as 8-bit ASCII
• 95 printable symbols
• 33 control characters (0-31, 127)
• http://www.ccmr.cornell.edu/helpful_data/ascii2.html has a list.
![Page 13: Lis508 lecture 1: bits, bytes and characters Thomas Krichel 2002-09-23](https://reader035.vdocument.in/reader035/viewer/2022062404/5514d9b0550346935c8b52a2/html5/thumbnails/13.jpg)
ASCII control codes• ACK (6, ^F) used to acknowledge receipt of
message, NAK (21, ^U) used to signal non-receipt
• CR (13, ^M) is the carriage return• LF (10, ^J) is the linefeed • FF (12, ^L) is the form feed (new page)• BS (8, ^H) is the backspace • DEL (ALT-127) is delete• ESC (^[) escapeDifferent programs use them in different ways, a
big pain in the a…
![Page 14: Lis508 lecture 1: bits, bytes and characters Thomas Krichel 2002-09-23](https://reader035.vdocument.in/reader035/viewer/2022062404/5514d9b0550346935c8b52a2/html5/thumbnails/14.jpg)
ISO-8859-1
• PCs work with bytes, so manufactures were free to fill the other 128 characters.
• ISO-8859-1, aka ISO-latin-1, it extends ASCII with characters that are used by the western European languages.
• It is the default character set of html.
• Positions 128 to 159 are not used.
• Cp1252 fills these with graphic chars.
![Page 15: Lis508 lecture 1: bits, bytes and characters Thomas Krichel 2002-09-23](https://reader035.vdocument.in/reader035/viewer/2022062404/5514d9b0550346935c8b52a2/html5/thumbnails/15.jpg)
Three concepts for characters
• Abstract Character Repertoire: the set of characters to be encoded, e.g., some alphabet or symbol set
• Coded Character Set : a mapping from an abstract character repertoire to a set of non-negative integers
• Character Encoding Scheme: a mapping from a coded character set to a serialized sequence of bytes
![Page 16: Lis508 lecture 1: bits, bytes and characters Thomas Krichel 2002-09-23](https://reader035.vdocument.in/reader035/viewer/2022062404/5514d9b0550346935c8b52a2/html5/thumbnails/16.jpg)
ISO 10646-1
• Defines the Universal Character Set (UCS)• UCS contains the characters required to
represent characters used by practically all known languages, even the likes of Gurmukhi, Oriya, Telugu, Bopomofo, Runic.
• There are proposals for more, like Hieroglyphs and Tengwar.
• Note that there are about 6800 known languages.
.
![Page 17: Lis508 lecture 1: bits, bytes and characters Thomas Krichel 2002-09-23](https://reader035.vdocument.in/reader035/viewer/2022062404/5514d9b0550346935c8b52a2/html5/thumbnails/17.jpg)
UCS organization
• ISO 10646 defines formally a 31-bit character set. They are represented as 32 bits, i.e. 4 bytes, or 8 hex chars.
• The canonical form of ISO 10646 uses a four-dimensional coding space consisting of 256 groups. Each group consists of 256 planes with each plane containing 256 rows, each having 256 cells.
![Page 18: Lis508 lecture 1: bits, bytes and characters Thomas Krichel 2002-09-23](https://reader035.vdocument.in/reader035/viewer/2022062404/5514d9b0550346935c8b52a2/html5/thumbnails/18.jpg)
UCS organization
• The first plane (Plane 0x00) of Group (0x00) is called the Basic Multilingual Plane (BMP). It has been fixed since first publication.
• The subsequent 223 planes (0x01 to 0xDF) of Group 0x00, as well as planes 0x00 to 0xFF in Groups 0x01 to 0x5F are reserved for further standardization.
• The last 32 planes (0xE0 to 0xFF) of Group 0x00, as well as all code positions of 32 groups (0x60 to 0x7F) are reserved for private use.
![Page 19: Lis508 lecture 1: bits, bytes and characters Thomas Krichel 2002-09-23](https://reader035.vdocument.in/reader035/viewer/2022062404/5514d9b0550346935c8b52a2/html5/thumbnails/19.jpg)
Relationship with legacy sets
• Let U+(four hex numbers) denote characters in the BMP.
• The UCS characters U+0000 to U+007F are identical to those in ASCII
• The range U+0000 to U+00FF is identical to ISO 8859-1 (Latin-1).
![Page 20: Lis508 lecture 1: bits, bytes and characters Thomas Krichel 2002-09-23](https://reader035.vdocument.in/reader035/viewer/2022062404/5514d9b0550346935c8b52a2/html5/thumbnails/20.jpg)
Types of characters in UCS
• Letters– Base characters– Ideographic characters– Combining characters
• Digits
• Extenders
![Page 21: Lis508 lecture 1: bits, bytes and characters Thomas Krichel 2002-09-23](https://reader035.vdocument.in/reader035/viewer/2022062404/5514d9b0550346935c8b52a2/html5/thumbnails/21.jpg)
http://openlib.org/home/krichel
Thank you for your attention!