huffman edited

8/3/2019 Huffman Edited

1/32

Representation of Strings

ENCODING

How much space do we need?

Assume we represent every character.

How many bits to represent each

character?Depends on ||


2/32

Bits to encode a character

Two character alphabet{A,B} one bit per character:

0 = A, 1 = B

Four character alphabet{A,B,C,D} two bits per character:

00 = A, 01 = B, 10 = C, 11 = D

Six character alphabet {A,B,C,D,E, F} three bits per character:

000 = A, 001 = B, 010 = C, 011 = D, 100=E, 101=F, 110 =unused, 111=unused


3/32

Generally

The bit sequence representing a character iscalled the encoding of the character.

There are 2n different bit sequences of length

n,

if we use the same number of bits for eachcharacter then length of encoding of a word isequal to number of bits to its number of

character


4/32

Can we do better??Better solution If is very small, might use run-length

encoding

Taking a step back Why do we need compression?

rate of creation of image and video data can bereduce

image data from digital camera today 1k by 1.5 k is common = 1.5 Mbytes

need 2k by 3k to equal 35mm slide = 6 Mbytes

video at even low resolution of 512 by 512 and 3 bytes per pixel

30 frames/second


5/32

Compression basics video data rate

23.6 Mbytes/second

2 hours of video = 169 gigabytes mpeg-1 compresses

23.6 Mbytesdown to 187 kbytes per second

169 gigabytes down to 1.3 gigabytes compression is essential for both storage and

transmission of data

compression is very widely usedjpeg, gif for single images

mpeg1, 2, 3, 4 for video sequence

zip for computer data

mp3 for sound


6/32

Basics of compression

character = basic data unit in the input stream --represents byte, bit, etc.

strings = sequences of characters

encoding = compression decoding = decompression

codeword = data elements used to represent inputcharacters or character strings

codetable = list of codewords


7/32

Codeword encoding/compression takes

characters/strings as input and uses codetable todecide on which codewords to produce

decoder/decompressor takes

codewords as input and uses same codetable to decideon which characters/strings to produce

Encoder Decoder

InputDataStream

OutputDataStream

DataStorageOrTransmission


8/32


9/32

Basic definitions

compression ratio =

size of original data / compressed data

basically higher compression ratio the better

lossless compression output data is exactly same as input data

essential for encoding computer processed data

lossy compression

output data not same as input data

acceptable for data that is only viewed or heard


10/32

Lossless versus lossy human visual system less sensitive to high frequency losses

and to losses in color

lossy compression acceptable for visual data degree of loss is usually a parameter of the compression

algorithm

tradeoff - loss versus compression

higher compression => more loss

lower compression => less loss

Symmetric versus asymmetric symmetric

encoding time == decoding time

essential for real-time applications

(ie. video or audio on demand)

asymmetric encoding time >> decoding

ok for write-once, read-many situations

E di


11/32

Entropy encoding compression that does not take into account

what is being compressed

normally is also lossless encoding most common types of entropy encoding

run length encoding

Huffman encoding

modified Huffman (fax) Lempel Ziv

Source encoding takes into account type of data (ie. visual)

normally is lossy but can also be lossless

most common types in use:

JPEG, GIF = single images

MPEG = sequence of images (video)

MP3 = sound sequence


12/32

Run length encoding one of simplest and earliest types of compression

take account of repeating data (called runs) runs are represented by a count along with the original data

eg. AAAABB => 4A2B

do you run length encode a single character?

no, use a special prefix character to represent start of runs

runs are represented as

prefix char itself becomes1

want a prefix char that is not too common

run length encoding is lossless and has fixed length

codewords


13/32

Run length encoding

works best for images with solidbackground

good example of such an image is acartoon

does not work as well for naturalimages

does not work well for English text however, is almost always a part of a

larger compression system


14/32

What if

the string we encode doesnt use all theletters in the alphabet?

But then also need to store / transmit

the mapping from encodings tocharacters

and is typically close to size ofalphabet

H ff E di


15/32

Huffman Encoding: Assumes encoding on a per-character basis

Observation: assigning shorter codes tofrequently used characters can result inoverall shorter encodings of strings requires assigning longer codes to rarely

used characters

Problem: when decoding, need to know how many bits to

read off for each character.

Solution: Choose an encoding that ensures that no character

encoding is the prefix of any other character

encoding. An encoding tree has this property.

k h f f h h h


16/32

assume we know the frequency of each character in theinput stream

then encode each character as a variable length bit string,with the length inversely proportional to the character

frequency variable length codewords are used; early example is

Morse code

Huffman produced an algorithm for assigning codewords

optimally input = probabilities of occurrence of each input character

(frequencies of occurrence)

output is a binary tree

each leaf node is an input character each branch is a zero or one bit

codeword for a leaf is the concatenation of bits for the pathfrom the root to the leaf

codeword is a variable length bit string

a very good compression ratio (optimal)?


17/32

Huffman encoding

Basic algorithmMark all characters as free tree nodes

While there is more than one free node

Take two nodes with lowest freq. of occurrenceCreate a new tree node with these nodes as children

and with freq. equal to the sum of their freqs.

Remove the two children from the free node list.

Add the new parent to the free node list

Last remaining free node is the root of thebinary tree used for encoding/decoding


18/32

A Huffman Encoding Tree

12

21

9

7

43

5

23

A T R N

E

0 1

0 1

0 1 0 1


19/32

12

21

9

7

43

5

23

A T R N

E

0 1

0 1

0 1 0 1

A 000

T 001

R 010

N 011

E 1


20/32

Weighted path length

A 000

T 001

R 010

N 011

E 1

Weighted path = Len(code(A)) * f(A) +

Len(code(T)) * f(T) + Len(code(R) ) * f(R) +

Len(code(N)) * f(N) + Len(code(E)) * f(E)

= (3 * 3) + ( 2 * 3) + (3 * 3) + (4 *3) + (9*1)

= 9 + 6 + 9 + 12 + 9 = 45

Claim (proof in text) : no other encoding can result in ashorter weighted path length


21/32

Building the Huffman Tree

A

3

T

4

R

4

E

5


22/32


A3

T4

R

4

E

5

7


23/32


R4

E5

A3

T4

7


24/32


R

4

E

5

A3

T4

79


25/32


A3

T4

7

R4

E5

9


26/32


A3

T4

7

R4

E5

9

16


27/32


A3

T4

7

R4

E5

9

16

0

0 1

1

0 1

00 01 10 11


28/32

Huffman example

a series of colors in an 8 by 8screen

colors are red, green, cyan, blue,magenta, yellow, and black

sequence is

rkkkkkkk gggmcbrr

kkkrrkkk bbbmybbr

kkrrrrgg gggggggr

kkbcccrr grrrrgrr


29/32

Another Huffman example

Color Frequency

Black (K) 19

Red ( R) 17

Green (G) 16

Blue (B) 5

Cyan ( C) 4Magenta (M) 2

Yellow (Y) 1

Another Huffman Example


30/32

Another Huffman Example

Red = 00 Blue = 111 Magenta = 11010

Black = 01 Cyan = 1100 Yellow = 11011

Green = 10

Fixed versus variable length codewords


31/32

Fixed versus variable length codewords

run length codewords are fixed length

Huffman codewords are variable length

length inversely proportional to frequency

all variable length compression schemes have theprefix property

one code can not be the prefix of another

binary tree structure guarantees that this is the case(a leaf node is a leaf node!)

Huffman encoding


32/32

Huffman encoding advantages

max compression ratio assuming correct probabilities of occurrence

easy to implement and fast

disadvantages need two passes for both encoder and decoder

one to create the frequency distribution

one to encode/decode the data

can avoid this by sending tree (takes time) or by havingunchanging frequencies

Modified Huffman encoding if we know freq of occurrences, then Huffman works very well

consider case of a fax; mostly long white spaces with short bursts

of black do the following

run length encode each string of bits on a line

Huffman encode these run length codewords

use a predefined frequency distribution

combination run length then Huffman

huffman edited

Documents