text compression

Post on 18-Jan-2016

43 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

Text Compression. Input: String S Output: String S  Shorter S can be reconstructed from S . Text Compression. … - PowerPoint PPT Presentation

TRANSCRIPT

CompSci 100E 31.1

Text Compression

Input: String S Output: String S

Shorter S can be reconstructed from S

CompSci 100E 31.2

Text Compression … He said to his friend, "If the British march

By land or sea from the town to-night,Hang a lantern aloft in the belfry archOf the North Church tower as a signal light,--One if by land, and two if by sea;And I on the opposite shore will be,Ready to ride and spread the alarmThrough every Middlesex village and farm,For the country folk to be up and to arm." …

Henry Wadsworth Longfellow (1807-1882)

By land 1

By sea 2

CompSci 100E 31.3

Text Compression: Examples

Symbol

ASCII Fixedlength

Var.length

a 01100001

000 000

b 01100010

001 11

c 01100011

010 01

d 01100100

011 001

e 01100101

100 10

“abcde” in the different formatsASCII: 01100001011000100110001101100100…Fixed: 000001010011100

Var: 000110100110

0

00

0

0

00

1

1 1

1

a b c d e

a d

bc e0

0

0

0 1

1

1

1

EncodingsASCII: 8 bits/characterUnicode: 16 bits/character

Unicode adds 8 more 0’s to left of ASCII

CompSci 100E 31.4

Huffman coding: go go gophers

Encoding uses tree: 0 left/1 right How many bits? 37!! Savings? Worth it?

ASCII(7bits) 3 bits Huffmang 103 1100111 3 000 00o 111 1101111 3 001 01p 112 1110000 1 010 1100h 104 1101000 1 011 1101e 101 1100101 1 100 1110r 114 1110010 1 101 1111s 115 1110011 1 110 100sp. 32 1000000 2 111 101

3

s1

*2

2

p1

h1

2

e1

r1

4

g

3

o

3

6

3

2

p1

h1

2

e1

r1

4

s1

*2

7

g3

o3

6

13

2

p1

h1

2

e1

r1

CompSci 100E 31.5

Huffman Coding

D.A Huffman in early 1950’s Before compressing data, analyze the input

stream Represent data using variable length codes Variable length codes though Prefix codes

Each letter is assigned a codeword Codeword is for a given letter is produced by

traversing the Huffman tree Property: No codeword produced is the prefix of

another Letters appearing frequently have short codewords,

while those that appear rarely have longer ones Huffman coding is optimal per-character

coding method

CompSci 100E 31.6

Building a Huffman tree

Begin with a forest of single-node trees (leaves) Each node/tree/leaf is weighted with character count Node stores two values: character and count There are n nodes in forest, n is size of alphabet?

Repeat until there is only one node left: root of tree Remove two minimally weighted trees from forest Create new tree with minimal trees as children,

o New tree root's weight: sum of children (character ignored)

Does this process terminate? How do we get minimal trees? Remove minimal trees, hmmm……

CompSci 100E 31.7

Building a tree

“A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”

11 6

I

5

E

5

N

1

C

1

F

1

P

2

U

2

R

2

L

2

D

2

G

3

T

3

O

3

B

3

A

4

M

4

S

CompSci 100E 31.8

Building a tree

“A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”

11 6

I

5

E

5

N

1

C

1

F

1

P

2

U

2

R

2

L

2

D

2

G

3

T

3

O

3

B

3

A

4

M

4

S

2

2

CompSci 100E 31.9

Building a tree

“A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”

11 6

I

5

E

5

N

1

F

1

C

1

P

2

U

2

R

2

L

2

D

2

G

3

T

3

O

3

B

3

A

4

M

4

S

2

2

3

3

CompSci 100E 31.10

Building a tree

“A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”

11 6

I

5

E

5

N

1

F

1

C

1

P

2

U

2

R

2

L

2

D

2

G

3

T

3

O

3

B

3

A

4

M

4

S

2

2

3

3

4

4

CompSci 100E 31.11

Building a tree

“A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”

11 6

I

5

E

5

N

1

F

1

C

1

P

2

U

2

R

2

L

2

D

2

G

3

T

3

O

3

B

3

A

4

M

4

S

2

2

3

3

4

4

4

4

CompSci 100E 31.12

Building a tree

“A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”

11 6

I

5

E

5

N

1

F

1

C

1

P

2

U

2

R

2

L

2

D

2

G

3

O

3

T

3

B

3

A

4

M

4

S

2

3

3

4

4

4

4

5

5

CompSci 100E 31.13

Building a tree

“A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”

11 6

I

5

E

5

N

1

F

1

C

1

P

2

U

2

R

2

L

2

D

2

G

3

O

3

T

3

B

3

A

4

M

4

S

2 3

4

4

4

4

5

5

6

6

CompSci 100E 31.14

Building a tree

“A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”

11 6

I

5

E

5

N

1

F

1

C

1

P

2

U

2

R

2

L

2

D

2

G

3

O

3

T

3

B

3

A

4

M

4

S

2 3

4

4

4

4

5

5

6

6

6

6

CompSci 100E 31.15

Building a tree

“A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”

11 6

I

5

E

5

N

1

F

1

C

1

P

2

U

2

R

2

L

2

D

2

G

3

O

3

T

3

B

3

A

4

M

4

S2 3

4

4

4

4

5

5

6

6

6

8

8

6

CompSci 100E 31.16

Building a tree

“A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”

11 6

I

5

E

5

N

1

F

1

C

1

P

2

U

2

R

2

L

2

D

2

G

3

O

3

T

3

B

3

A

4

M

4

S2 3

44

5

5

6

6

6

8

8

6

8

8

CompSci 100E 31.17

Building a tree

“A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”

11 6

I

5

N

5

E

1

F

1

C

1

P

2

U

2

R

2

L

2

D

2

G

3

O

3

T

3

B

3

A

4

M

4

S2 3

44

5

5

6

6

6

8

8

6

8

8

10

10

CompSci 100E 31.18

Building a tree

“A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”

11 6

I

5

N

5

E

1

F

1

C

1

P

2

U

2

R

2

L

2

D

2

G

3

O

3

T

3

B

3

A

4

M

4

S2 3

445

6

6

8

8

6

8

8

10

10

11

11

CompSci 100E 31.19

Building a tree

“A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”

11

6

I5

N

5

E

1

F

1

C

1

P

2

U

2

R

2

L

2

D

2

G

3

O

3

T

3

B

3

A

4

M

4

S2 3

445

68

8

6

8

8

10

10

11

11

12

12

CompSci 100E 31.20

Building a tree

“A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”

11

6

I5

N

5

E

1

F

1

C

1

P

2

U

2

R

2

L

2

D

2

G

3

O

3

T

3

B

3

A

4

M

4

S2 3

445

68

6

8

16

10

10

11

11

12

1216

CompSci 100E 31.21

Building a tree

“A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”

11

6

I5

N

5

E

1

F

1

C

1

P

2

U

2

R

2

L

2

D

2

G

3

O

3

T

3

B

3

A

4

M

4

S2 3

445

68

6

8

16

10

21

11

12

121621

CompSci 100E 31.22

Building a tree

“A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”

11

6

I5

N

5

E

1

F

1

C

1

P

2

U

2

R

2

L

2

D

2

G

3

O

3

T

3

B

3

A

4

M

4

S2 3

445

68

6

8

16

10

21

11

12

23

162123

CompSci 100E 31.23

Building a tree

“A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”

11

6

I5

N

5

E

1

F

1

C

1

P

2

U

2

R

2

L

2

D

2

G

3

O

3

T

3

B

3

A

4

M

4

S2 3

445

68

6

8

16

10

21

11

12

2337

2337

CompSci 100E 31.24

Building a tree

“A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”

11

6

I5

N

5

E

1

F

1

C

1

P

2

U

2

R

2

L

2

D

2

G

3

O

3

T

3

B

3

A

4

M

4

S2 3

445

68

6

8

16

10

21

11

12

2337

60

60

CompSci 100E 31.25

Encoding

1. Count occurrence of all occurring characters O( )

2. Build priority queue O( )

3. Build Huffman tree O( )

4. Create Table of codes from tree O( )

5. Write Huffman tree and coded data to file O( )

N

A

A log A

A log A

N

CompSci 100E 31.26

Properties of Huffman coding Want to minimize weighted path length L(T)of

tree T

wi is the weight or count of each codeword i

di is the leaf corresponding to codeword i

How do we calculate character (codeword) frequencies?

Huffman coding creates pretty full bushy trees? When would it produce a “bad” tree?

How do we produce coded compressed data from input efficiently?

)(

)(TLeafi

iiwdTL

CompSci 100E 31.27

Writing code out to file

How do we go from characters to encodings? Build Huffman tree Root-to-leaf path generates encoding

Need way of writing bits out to file Platform dependent? Complicated to write bits and read in same ordering

See BitInputStream and BitOutputStream classes Depend on each other, bit ordering preserved

How do we know bits come from compressed file? Store a magic number

CompSci 100E 31.28

Decoding a message

11

6

I5

N

5

E

1

F

1

C

1

P

2

U

2

R

2

L

2

D

2

G

3

O

3

T

3

B

3

A

4

M

4

S2 3

445

68

6

8

16

10

21

11

12

2337

60

01100000100001001101

CompSci 100E 31.29

Decoding a message

11

6

I5

N

5

E

1

F

1

C

1

P

2

U

2

R

2

L

2

D

2

G

3

O

3

T

3

B

3

A

4

M

4

S2 3

445

68

6

8

16

10

21

11

12

2337

60

1100000100001001101

CompSci 100E 31.30

Decoding a message

11

6

I5

N

5

E

1

F

1

C

1

P

2

U

2

R

2

L

2

D

2

G

3

O

3

T

3

B

3

A

4

M

4

S2 3

445

68

6

8

16

10

21

11

12

2337

60

100000100001001101

CompSci 100E 31.31

Decoding a message

11

6

I5

N

5

E

1

F

1

C

1

P

2

U

2

R

2

L

2

D

2

G

3

O

3

T

3

B

3

A

4

M

4

S2 3

445

68

6

8

16

10

21

11

12

2337

60

00000100001001101

CompSci 100E 31.32

Decoding a message

11

6

I5

N

5

E

1

F

1

C

1

P

2

U

2

R

2

L

2

D

2

G

3

O

3

T

3

B

3

A

4

M

4

S2 3

445

68

6

8

16

10

21

11

12

2337

60

0000100001001101

G

CompSci 100E 31.33

Decoding a message

11

6

I5

N

5

E

1

F

1

C

1

P

2

U

2

R

2

L

2

D

2

G

3

O

3

T

3

B

3

A

4

M

4

S2 3

445

68

6

8

16

10

21

11

12

2337

60

000100001001101

G

CompSci 100E 31.34

Decoding a message

11

6

I5

N

5

E

1

F

1

C

1

P

2

U

2

R

2

L

2

D

2

G

3

O

3

T

3

B

3

A

4

M

4

S2 3

445

68

6

8

16

10

21

11

12

2337

60

00100001001101

G

CompSci 100E 31.35

Decoding a message

11

6

I5

N

5

E

1

F

1

C

1

P

2

U

2

R

2

L

2

D

2

G

3

O

3

T

3

B

3

A

4

M

4

S2 3

445

68

6

8

16

10

21

11

12

2337

60

0100001001101

G

CompSci 100E 31.36

Decoding a message

11

6

I5

N

5

E

1

F

1

C

1

P

2

U

2

R

2

L

2

D

2

G

3

O

3

T

3

B

3

A

4

M

4

S2 3

445

68

6

8

16

10

21

11

12

2337

60

100001001101

G

CompSci 100E 31.37

Decoding a message

11

6

I5

N

5

E

1

F

1

C

1

P

2

U

2

R

2

L

2

D

2

G

3

O

3

T

3

B

3

A

4

M

4

S2 3

445

68

6

8

16

10

21

11

12

2337

60

00001001101

GO

CompSci 100E 31.38

Decoding a message

11

6

I5

N

5

E

1

F

1

C

1

P

2

U

2

R

2

L

2

D

2

G

3

O

3

T

3

B

3

A

4

M

4

S2 3

445

68

6

8

16

10

21

11

12

2337

60

0001001101

GO

CompSci 100E 31.39

Decoding a message

11

6

I5

N

5

E

1

F

1

C

1

P

2

U

2

R

2

L

2

D

2

G

3

O

3

T

3

B

3

A

4

M

4

S2 3

445

68

6

8

16

10

21

11

12

2337

60

001001101

GO

CompSci 100E 31.40

Decoding a message

11

6

I5

N

5

E

1

F

1

C

1

P

2

U

2

R

2

L

2

D

2

G

3

O

3

T

3

B

3

A

4

M

4

S2 3

445

68

6

8

16

10

21

11

12

2337

60

01001101

GO

CompSci 100E 31.41

Decoding a message

11

6

I5

N

5

E

1

F

1

C

1

P

2

U

2

R

2

L

2

D

2

G

3

O

3

T

3

B

3

A

4

M

4

S2 3

445

68

6

8

16

10

21

11

12

2337

60

1001101

GO

CompSci 100E 31.42

Decoding a message

11

6

I5

N

5

E

1

F

1

C

1

P

2

U

2

R

2

L

2

D

2

G

3

O

3

T

3

B

3

A

4

M

4

S2 3

445

68

6

8

16

10

21

11

12

2337

60

001101

GOO

CompSci 100E 31.43

Decoding a message

11

6

I5

N

5

E

1

F

1

C

1

P

2

U

2

R

2

L

2

D

2

G

3

O

3

T

3

B

3

A

4

M

4

S2 3

445

68

6

8

16

10

21

11

12

2337

60

01101

GOO

CompSci 100E 31.44

Decoding a message

11

6

I5

N

5

E

1

F

1

C

1

P

2

U

2

R

2

L

2

D

2

G

3

O

3

T

3

B

3

A

4

M

4

S2 3

445

68

6

8

16

10

21

11

12

2337

60

1101

GOO

CompSci 100E 31.45

Decoding a message

11

6

I5

N

5

E

1

F

1

C

1

P

2

U

2

R

2

L

2

D

2

G

3

O

3

T

3

B

3

A

4

M

4

S2 3

445

68

6

8

16

10

21

11

12

2337

60

101

GOO

CompSci 100E 31.46

Decoding a message

11

6

I5

N

5

E

1

F

1

C

1

P

2

U

2

R

2

L

2

D

2

G

3

O

3

T

3

B

3

A

4

M

4

S2 3

445

68

6

8

16

10

21

11

12

2337

60

01

GOO

CompSci 100E 31.47

Decoding a message

11

6

I5

N

5

E

1

F

1

C

1

P

2

U

2

R

2

L

2

D

2

G

3

O

3

T

3

B

3

A

4

M

4

S2 3

445

68

6

8

16

10

21

11

12

2337

60

1

GOOD

CompSci 100E 31.48

Decoding a message

11

6

I5

N

5

E

1

F

1

C

1

P

2

U

2

R

2

L

2

D

2

G

3

O

3

T

3

B

3

A

4

M

4

S2 3

445

68

6

8

16

10

21

11

12

2337

60

01100000100001001101

GOOD

CompSci 100E 31.49

Decoding

1. Read in tree data O( )

2. Decode bit string with tree O( )

CompSci 100E 31.50

Huffman Tree 2

“A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS” E.g. “ A SIMPLE”

“10101101001000101001110011100000”

CompSci 100E 31.51

Huffman Tree 2

“A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS” E.g. “ A SIMPLE”

“10101101001000101001110011100000”

CompSci 100E 31.52

Huffman Tree 2

“A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS” E.g. “ A SIMPLE”

“10101101001000101001110011100000”

CompSci 100E 31.53

Huffman Tree 2

“A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS” E.g. “ A SIMPLE”

“10101101001000101001110011100000”

CompSci 100E 31.54

Huffman Tree 2

“A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS” E.g. “ A SIMPLE”

“10101101001000101001110011100000”

CompSci 100E 31.55

Huffman Tree 2

“A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS” E.g. “ A SIMPLE”

“10101101001000101001110011100000”

CompSci 100E 31.56

Huffman Tree 2

“A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS” E.g. “ A SIMPLE”

“10101101001000101001110011100000”

CompSci 100E 31.57

Huffman Tree 2

“A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS” E.g. “ A SIMPLE”

“10101101001000101001110011100000”

CompSci 100E 31.58

Huffman Tree 2

“A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS” E.g. “ A SIMPLE”

“10101101001000101001110011100000”

CompSci 100E 31.59

Other methods

Adaptive Huffman coding Lempel-Ziv algorithms

Build the coding table on the fly while reading document

Coding table changes dynamically Protocol between encoder and decoder so that

everyone is always using the right coding scheme

Works well in practice (compress, gzip, etc.) More complicated methods

Burrows-Wheeler (bunzip2) PPM statistical methods

top related