huffman edited

Upload: francis-mikel-limbag

Post on 06-Apr-2018

230 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/3/2019 Huffman Edited

    1/32

    Representation of Strings

    ENCODING

    How much space do we need?

    Assume we represent every character.

    How many bits to represent each

    character?Depends on ||

  • 8/3/2019 Huffman Edited

    2/32

    Bits to encode a character

    Two character alphabet{A,B} one bit per character:

    0 = A, 1 = B

    Four character alphabet{A,B,C,D} two bits per character:

    00 = A, 01 = B, 10 = C, 11 = D

    Six character alphabet {A,B,C,D,E, F} three bits per character:

    000 = A, 001 = B, 010 = C, 011 = D, 100=E, 101=F, 110 =unused, 111=unused

  • 8/3/2019 Huffman Edited

    3/32

    Generally

    The bit sequence representing a character iscalled the encoding of the character.

    There are 2n different bit sequences of length

    n,

    if we use the same number of bits for eachcharacter then length of encoding of a word isequal to number of bits to its number of

    character

  • 8/3/2019 Huffman Edited

    4/32

    Can we do better??Better solution If is very small, might use run-length

    encoding

    Taking a step back Why do we need compression?

    rate of creation of image and video data can bereduce

    image data from digital camera today 1k by 1.5 k is common = 1.5 Mbytes

    need 2k by 3k to equal 35mm slide = 6 Mbytes

    video at even low resolution of 512 by 512 and 3 bytes per pixel

    30 frames/second

  • 8/3/2019 Huffman Edited

    5/32

    Compression basics video data rate

    23.6 Mbytes/second

    2 hours of video = 169 gigabytes mpeg-1 compresses

    23.6 Mbytesdown to 187 kbytes per second

    169 gigabytes down to 1.3 gigabytes compression is essential for both storage and

    transmission of data

    compression is very widely usedjpeg, gif for single images

    mpeg1, 2, 3, 4 for video sequence

    zip for computer data

    mp3 for sound

  • 8/3/2019 Huffman Edited

    6/32

    Basics of compression

    character = basic data unit in the input stream --represents byte, bit, etc.

    strings = sequences of characters

    encoding = compression decoding = decompression

    codeword = data elements used to represent inputcharacters or character strings

    codetable = list of codewords

  • 8/3/2019 Huffman Edited

    7/32

    Codeword encoding/compression takes

    characters/strings as input and uses codetable todecide on which codewords to produce

    decoder/decompressor takes

    codewords as input and uses same codetable to decideon which characters/strings to produce

    Encoder Decoder

    InputDataStream

    OutputDataStream

    DataStorageOrTransmission

  • 8/3/2019 Huffman Edited

    8/32

  • 8/3/2019 Huffman Edited

    9/32

    Basic definitions

    compression ratio =

    size of original data / compressed data

    basically higher compression ratio the better

    lossless compression output data is exactly same as input data

    essential for encoding computer processed data

    lossy compression

    output data not same as input data

    acceptable for data that is only viewed or heard

  • 8/3/2019 Huffman Edited

    10/32

    Lossless versus lossy human visual system less sensitive to high frequency losses

    and to losses in color

    lossy compression acceptable for visual data degree of loss is usually a parameter of the compression

    algorithm

    tradeoff - loss versus compression

    higher compression => more loss

    lower compression => less loss

    Symmetric versus asymmetric symmetric

    encoding time == decoding time

    essential for real-time applications

    (ie. video or audio on demand)

    asymmetric encoding time >> decoding

    ok for write-once, read-many situations

    E di

  • 8/3/2019 Huffman Edited

    11/32

    Entropy encoding compression that does not take into account

    what is being compressed

    normally is also lossless encoding most common types of entropy encoding

    run length encoding

    Huffman encoding

    modified Huffman (fax) Lempel Ziv

    Source encoding takes into account type of data (ie. visual)

    normally is lossy but can also be lossless

    most common types in use:

    JPEG, GIF = single images

    MPEG = sequence of images (video)

    MP3 = sound sequence

  • 8/3/2019 Huffman Edited

    12/32

    Run length encoding one of simplest and earliest types of compression

    take account of repeating data (called runs) runs are represented by a count along with the original data

    eg. AAAABB => 4A2B

    do you run length encode a single character?

    no, use a special prefix character to represent start of runs

    runs are represented as

    prefix char itself becomes1

    want a prefix char that is not too common

    run length encoding is lossless and has fixed length

    codewords

  • 8/3/2019 Huffman Edited

    13/32

    Run length encoding

    works best for images with solidbackground

    good example of such an image is acartoon

    does not work as well for naturalimages

    does not work well for English text however, is almost always a part of a

    larger compression system

  • 8/3/2019 Huffman Edited

    14/32

    What if

    the string we encode doesnt use all theletters in the alphabet?

    But then also need to store / transmit

    the mapping from encodings tocharacters

    and is typically close to size ofalphabet

    H ff E di

  • 8/3/2019 Huffman Edited

    15/32

    Huffman Encoding: Assumes encoding on a per-character basis

    Observation: assigning shorter codes tofrequently used characters can result inoverall shorter encodings of strings requires assigning longer codes to rarely

    used characters

    Problem: when decoding, need to know how many bits to

    read off for each character.

    Solution: Choose an encoding that ensures that no character

    encoding is the prefix of any other character

    encoding. An encoding tree has this property.

    k h f f h h h

  • 8/3/2019 Huffman Edited

    16/32

    assume we know the frequency of each character in theinput stream

    then encode each character as a variable length bit string,with the length inversely proportional to the character

    frequency variable length codewords are used; early example is

    Morse code

    Huffman produced an algorithm for assigning codewords

    optimally input = probabilities of occurrence of each input character

    (frequencies of occurrence)

    output is a binary tree

    each leaf node is an input character each branch is a zero or one bit

    codeword for a leaf is the concatenation of bits for the pathfrom the root to the leaf

    codeword is a variable length bit string

    a very good compression ratio (optimal)?

  • 8/3/2019 Huffman Edited

    17/32

    Huffman encoding

    Basic algorithmMark all characters as free tree nodes

    While there is more than one free node

    Take two nodes with lowest freq. of occurrenceCreate a new tree node with these nodes as children

    and with freq. equal to the sum of their freqs.

    Remove the two children from the free node list.

    Add the new parent to the free node list

    Last remaining free node is the root of thebinary tree used for encoding/decoding

  • 8/3/2019 Huffman Edited

    18/32

    A Huffman Encoding Tree

    12

    21

    9

    7

    43

    5

    23

    A T R N

    E

    0 1

    0 1

    0 1 0 1

  • 8/3/2019 Huffman Edited

    19/32

    12

    21

    9

    7

    43

    5

    23

    A T R N

    E

    0 1

    0 1

    0 1 0 1

    A 000

    T 001

    R 010

    N 011

    E 1

  • 8/3/2019 Huffman Edited

    20/32

    Weighted path length

    A 000

    T 001

    R 010

    N 011

    E 1

    Weighted path = Len(code(A)) * f(A) +

    Len(code(T)) * f(T) + Len(code(R) ) * f(R) +

    Len(code(N)) * f(N) + Len(code(E)) * f(E)

    = (3 * 3) + ( 2 * 3) + (3 * 3) + (4 *3) + (9*1)

    = 9 + 6 + 9 + 12 + 9 = 45

    Claim (proof in text) : no other encoding can result in ashorter weighted path length

  • 8/3/2019 Huffman Edited

    21/32

    Building the Huffman Tree

    A

    3

    T

    4

    R

    4

    E

    5

  • 8/3/2019 Huffman Edited

    22/32

    Building the Huffman Tree

    A3

    T4

    R

    4

    E

    5

    7

  • 8/3/2019 Huffman Edited

    23/32

    Building the Huffman Tree

    R4

    E5

    A3

    T4

    7

  • 8/3/2019 Huffman Edited

    24/32

    Building the Huffman Tree

    R

    4

    E

    5

    A3

    T4

    79

  • 8/3/2019 Huffman Edited

    25/32

    Building the Huffman Tree

    A3

    T4

    7

    R4

    E5

    9

  • 8/3/2019 Huffman Edited

    26/32

    Building the Huffman Tree

    A3

    T4

    7

    R4

    E5

    9

    16

  • 8/3/2019 Huffman Edited

    27/32

    Building the Huffman Tree

    A3

    T4

    7

    R4

    E5

    9

    16

    0

    0 1

    1

    0 1

    00 01 10 11

  • 8/3/2019 Huffman Edited

    28/32

    Huffman example

    a series of colors in an 8 by 8screen

    colors are red, green, cyan, blue,magenta, yellow, and black

    sequence is

    rkkkkkkk gggmcbrr

    kkkrrkkk bbbmybbr

    kkrrrrgg gggggggr

    kkbcccrr grrrrgrr

  • 8/3/2019 Huffman Edited

    29/32

    Another Huffman example

    Color Frequency

    Black (K) 19

    Red ( R) 17

    Green (G) 16

    Blue (B) 5

    Cyan ( C) 4Magenta (M) 2

    Yellow (Y) 1

    Another Huffman Example

  • 8/3/2019 Huffman Edited

    30/32

    Another Huffman Example

    Red = 00 Blue = 111 Magenta = 11010

    Black = 01 Cyan = 1100 Yellow = 11011

    Green = 10

    Fixed versus variable length codewords

  • 8/3/2019 Huffman Edited

    31/32

    Fixed versus variable length codewords

    run length codewords are fixed length

    Huffman codewords are variable length

    length inversely proportional to frequency

    all variable length compression schemes have theprefix property

    one code can not be the prefix of another

    binary tree structure guarantees that this is the case(a leaf node is a leaf node!)

    Huffman encoding

  • 8/3/2019 Huffman Edited

    32/32

    Huffman encoding advantages

    max compression ratio assuming correct probabilities of occurrence

    easy to implement and fast

    disadvantages need two passes for both encoder and decoder

    one to create the frequency distribution

    one to encode/decode the data

    can avoid this by sending tree (takes time) or by havingunchanging frequencies

    Modified Huffman encoding if we know freq of occurrences, then Huffman works very well

    consider case of a fax; mostly long white spaces with short bursts

    of black do the following

    run length encode each string of bits on a line

    Huffman encode these run length codewords

    use a predefined frequency distribution

    combination run length then Huffman