cs-2852 data structures lecture 13b andrew j. wozniewicz image copyright © 2010 andyjphoto.com

CS-2852Data StructuresLECTURE 13B

Andrew J. Wozniewicz

Image copyright © 2010 andyjphoto.com

CS-2852 Data Structures, Andrew J. Wozniewicz

Agenda• Encodings• Morse Code• Huffman Trees


Character Encodings• UNICODE (6.0): Different Encodings– UTF-8• 8-bits per character for ASCII chars• Up to 4 bytes per character for other chars

– UTF-16• 2 bytes per character• Some characters encoded with 4 bytes


Character Encodings• Standard ASCII– 7 bits per character– 128 distinct values

• Extended ASCII– 8 bits per character– 256 distinct values

• EBCDIC (“manifestation of purest evil” – E. Raymond)

– 8 bits per character– IBM-mainframe specific


Fixed-Length Encoding• UTF-8, Extended ASCII: – 8 bits per character

• Always the same number of bits per symbol

• Log2 n bits per symbol to distinguish among n symbols

• More efficient encoding possible, if not all characters are equally likely to appear


Variable-Length Encoding


Morse Code• The length of each character in Morse

is approximately inversely proportional to its frequency of occurrence in English.

• The most common letter in English, the letter "E," has the shortest code, a single dot.


Prefix Codes• Design a code in such a way that no

complete code for any symbol is the beginning (prefix) for any other symbol.

• If the relative frequency of symbols in a message is known: efficient prefix encoding can be found

• Huffman encoding is a prefix code


Huffman Encoding• Lossless data compression algorithm• “Minimum-redundancy” code

by David Huffman (1952)• Variable-length code table• The technique works by creating a binary

tree of nodes.• The symbols with the lowest frequency

appear farthest away from the root. • The tree can itself be efficiently encoded and

attached with the message to enable decoding.


Example of a Huffman Tree

http://mitpress.mit.edu/sicp/full-text/sicp/book/node41.html

left=0, right=1


Huffman Algorithm• Begin with the set of leaf nodes,

containing symbols and their frequencies• Find two leaves with the lowest weights

and merge them to produce a node that has these two nodes as its left and right branches. – The weight of the new node is the sum of

the two weights• Remove the two leaves from the original

set and replace them by this new node.


Example of Huffman Tree-Building

Initial leaves {(A 8) (B 3) (C 1) (D 1) (E 1) (F 1) (G 1) (H 1)}Merge {(A 8) (B 3) ({C D} 2) (E 1) (F 1) (G 1) (H 1)}Merge {(A 8) (B 3) ({C D} 2) ({E F} 2) (G 1) (H 1)}Merge {(A 8) (B 3) ({C D} 2) ({E F} 2) ({G H} 2)}Merge {(A 8) (B 3) ({C D} 2) ({E F G H} 4)}Merge {(A 8) ({B C D} 5) ({E F G H} 4)}Merge {(A 8) ({B C D E F G H} 9)}Final merge {({A B C D E F G H} 17)}


Summary• Encodings• Morse Code• Huffman Trees

cs-2852 data structures lecture 13b andrew j. wozniewicz image copyright © 2010 andyjphoto.com

Documents

data structures cs

data structures lecture

b c d e f g h

letter e

bytes cs

final merge

raymond8 bits

number of bits