lis508 lecture 1: bits, bytes and characters thomas krichel 2003-09-30
Post on 27-Mar-2015
216 Views
Preview:
TRANSCRIPT
lis508 lecture 1: bits, bytes and characters
Thomas Krichel
2003-09-30
Structure
• Numbers– Bits– Bytes
• Character sets– Coded character set– Character endcoding
Literature, no need to read…
• Norton “new inside the PC” chapter 4
• http://www.danbbs.dk/~erikoest/bb_terms.htm
• http://wwwinfo.cern.ch/asdoc/WWW/publications/ictp99/ictp99N2705.html
• http://www.cl.cam.ac.uk/~mgk25/unicode.html
Information
• Information is best understood as “what it takes to answer a question”.
• The simplest question has a “yes” or “no” answer. Therefore a bit is the natural measure of information.
• Term first used by John Turkey in 1946.
• Concatenation of “binary digit”.
Usage of bits
• Computers are sometimes classified by the number of bits they can process at one time. "32 bit processor"
• Graphics are also often described by the number of bits used to represent each dot.
bits and bytes
• a bit can take the values 0 or 1, thus it can describe 2 possibilities
• two bits can take the value 00, 01, 10, 11, thus it can describe four 2×2 possibilities
• n bits can encode 2 power n possibilities.• The first chips used to process 8 bits at a time. It
become customary to refer to them as a byte. It can encode 2 power 8 possibilities.
• We can use binary numbers just as decimal numbers.
application of bytes
• IP (Internet Protocol) numbers are used as the addresses of computers on the Internet.
• In IP version 4 (the one that is most commonly used), each IP number has 4 bytes.
• It is represented as x.x.x.x where x is a number between 0 and 255 (why?)
• how many computers can there be on the Internet at any one time?
decimal/binary numbers
• 0 0• 1 1• 2 10• 3 11• 4 100• 5 101• 6 110• 7 111
• 8 1000• 9 1001• 10 1010• 11 1011• 12 1100• 13 1101• 14 1110• 15 1111
Many bytes
• Larger units are– Kilo byte is 2 power 10 bytes (=1024 bytes)– Mega bytes is 2 power 20 bytes– Giga bytes is 2 power 30 bytes– Tera byte is 2 power 40 bytes
• From ancient Greek words for "thousand", "large", "giant", and "monster", respectively. Terms date back to the French revolution.
Hex numbers• A byte is often represented by two hex
numbers.
• Each hex number can encode 16 values
• Written 0 to 9, then A B C D E F. F is 15.
• Conventionally prefixed with 0x
• Use Microsoft calculator with scientific notation to convert.
application of hex numbers
• Media Access Control (mac) addresses of hardware that allows access to computer networks. They are 6-byte numbers, each byte written as 2 hex numbers, e.g. 00:60:08:F5:20:A9
• character numbers that you see when you are inserting a special symbol in Microsoft software, e.g. powerpoint.
Characters
• Much of the information processed by computers is in the form of characters.
• A character only makes sense for a human user of a minimum cultural level.
• A character is not a glyph.– ligatures
Information in a computer file
• A file is a piece of data on a stored on a computer.
• Any file contains a sequence of 0s and 1s, like 1010100101010011110101010101…
• For a computer to make sense of a file, it has to know what type of file it is.
executable files
• Files that are executable are files that make the computer do something. For example the file starts a program, say powerpoint. An executable on one computer may not run on another
• Non-executable files hold data that is used by an executable file. We will call them data files. Example: powerpoint slides file.
text files
• Many data files contain textual data. • Textual data is a sequence of characters.• A character is an elementary symbol that
has some meaning– alphabet letter– hieroglyph
• Example: email file• Text files can be read by many computer
programs.
non-text files
• Examples for non-text files are – graphics files– movie files– sound files
• non-text files are not very important in library settings– there is not way to organize information
retrieval for non-text files. They have to be retrieved using a textual surrogate.
– traditional library material are textual
• will talk about this later.
Representing characters
• Computers don't understand text, they only understand numbers. For computers to be able to treat text, there must be a correspondence between numbers and text characters. Such a correspondence is called a character set.
• Examples for characters are – a– c– ë– €
Legacy character sets
• In early days, computers were a lot less powerful than they are today.
• Could only deal with the characters that are most commonly used.
• Such sets are– ascii– ISO-8859-1– cp1252
ASCII
• American Standard Code for Information Interchange
• 7-bit character set. There is no such thing as 8-bit ASCII
• 95 printable symbols
• 33 control characters (0-31, 127)
• http://www.ccmr.cornell.edu/helpful_data/ascii2.html has a list up to 127
some ASCII control characters
• CR (13, ^M) is the carriage return
• LF (10, ^J) is the linefeed
• FF (12, ^L) is the form feed (new page)
• BS (8, ^H) is the backspace
• DEL (127, ALT-127) is delete
• ESC (27, ^[) escape
ISO-8859-1
• ISO-8859-1, aka ISO-latin-1 extends ASCII with characters that are commonly used by the western European languages.
• It is the default character set of html.
• Positions 128 to 159 are not used.
• Cp1252 fills these with graphic chars. It is as Microsoft character set.
This is not enough
• There are around 6800 different languages around.
• Some of these languages use characters sets that are not finite, i.e. folks can make up now characters out of existing ones!
• Setting up a character set for all languages is almost impossible.
ISO 10646-1
• Defines the Universal Character Set (UCS)
• UCS contains the characters required to represent characters used by many known languages, even the likes of Oriya, Telugu, Bopomofo, Runic.
• ISO 10646 defines formally a 31-bit character set. They are represented as 32 bits, i.e. 4 bytes, or 8 hex chars.
• Not finished.
.
Unicode
• ISO is a inter-government agency. Slow and bureaucratic.
• Industry has come together to work on Unicode, a 2-byte character set.
• With some minor exceptions, the Unicode characters are the some as the first 65536 characters in UCS.
• Much better documented standard.
Unicode and legacy sets
• The first 128 characters are identical to those in ASCII
• The next 128 characters are identical to ISO 8859-1 (Latin-1).
• Unicode is well documented and the Unicode book can be downloaded from the Internet. A must-have for the serious digital librarian.
Politics…
• Does it make sense to use Unicode rather than, say, ISO-latin-1?
• Many commercial pieces of software have data files that contain character data interspersed with non-character data. Is that good?
http://openlib.org/home/krichel
Thank you for your attention!
top related