text compression
Post on 18-Jan-2016
43 Views
Preview:
DESCRIPTION
TRANSCRIPT
CompSci 100E 31.1
Text Compression
Input: String S Output: String S
Shorter S can be reconstructed from S
CompSci 100E 31.2
Text Compression … He said to his friend, "If the British march
By land or sea from the town to-night,Hang a lantern aloft in the belfry archOf the North Church tower as a signal light,--One if by land, and two if by sea;And I on the opposite shore will be,Ready to ride and spread the alarmThrough every Middlesex village and farm,For the country folk to be up and to arm." …
Henry Wadsworth Longfellow (1807-1882)
By land 1
By sea 2
CompSci 100E 31.3
Text Compression: Examples
Symbol
ASCII Fixedlength
Var.length
a 01100001
000 000
b 01100010
001 11
c 01100011
010 01
d 01100100
011 001
e 01100101
100 10
“abcde” in the different formatsASCII: 01100001011000100110001101100100…Fixed: 000001010011100
Var: 000110100110
0
00
0
0
00
1
1 1
1
a b c d e
a d
bc e0
0
0
0 1
1
1
1
EncodingsASCII: 8 bits/characterUnicode: 16 bits/character
Unicode adds 8 more 0’s to left of ASCII
CompSci 100E 31.4
Huffman coding: go go gophers
Encoding uses tree: 0 left/1 right How many bits? 37!! Savings? Worth it?
ASCII(7bits) 3 bits Huffmang 103 1100111 3 000 00o 111 1101111 3 001 01p 112 1110000 1 010 1100h 104 1101000 1 011 1101e 101 1100101 1 100 1110r 114 1110010 1 101 1111s 115 1110011 1 110 100sp. 32 1000000 2 111 101
3
s1
*2
2
p1
h1
2
e1
r1
4
g
3
o
3
6
3
2
p1
h1
2
e1
r1
4
s1
*2
7
g3
o3
6
13
2
p1
h1
2
e1
r1
CompSci 100E 31.5
Huffman Coding
D.A Huffman in early 1950’s Before compressing data, analyze the input
stream Represent data using variable length codes Variable length codes though Prefix codes
Each letter is assigned a codeword Codeword is for a given letter is produced by
traversing the Huffman tree Property: No codeword produced is the prefix of
another Letters appearing frequently have short codewords,
while those that appear rarely have longer ones Huffman coding is optimal per-character
coding method
CompSci 100E 31.6
Building a Huffman tree
Begin with a forest of single-node trees (leaves) Each node/tree/leaf is weighted with character count Node stores two values: character and count There are n nodes in forest, n is size of alphabet?
Repeat until there is only one node left: root of tree Remove two minimally weighted trees from forest Create new tree with minimal trees as children,
o New tree root's weight: sum of children (character ignored)
Does this process terminate? How do we get minimal trees? Remove minimal trees, hmmm……
CompSci 100E 31.7
Building a tree
“A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”
11 6
I
5
E
5
N
1
C
1
F
1
P
2
U
2
R
2
L
2
D
2
G
3
T
3
O
3
B
3
A
4
M
4
S
CompSci 100E 31.8
Building a tree
“A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”
11 6
I
5
E
5
N
1
C
1
F
1
P
2
U
2
R
2
L
2
D
2
G
3
T
3
O
3
B
3
A
4
M
4
S
2
2
CompSci 100E 31.9
Building a tree
“A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”
11 6
I
5
E
5
N
1
F
1
C
1
P
2
U
2
R
2
L
2
D
2
G
3
T
3
O
3
B
3
A
4
M
4
S
2
2
3
3
CompSci 100E 31.10
Building a tree
“A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”
11 6
I
5
E
5
N
1
F
1
C
1
P
2
U
2
R
2
L
2
D
2
G
3
T
3
O
3
B
3
A
4
M
4
S
2
2
3
3
4
4
CompSci 100E 31.11
Building a tree
“A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”
11 6
I
5
E
5
N
1
F
1
C
1
P
2
U
2
R
2
L
2
D
2
G
3
T
3
O
3
B
3
A
4
M
4
S
2
2
3
3
4
4
4
4
CompSci 100E 31.12
Building a tree
“A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”
11 6
I
5
E
5
N
1
F
1
C
1
P
2
U
2
R
2
L
2
D
2
G
3
O
3
T
3
B
3
A
4
M
4
S
2
3
3
4
4
4
4
5
5
CompSci 100E 31.13
Building a tree
“A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”
11 6
I
5
E
5
N
1
F
1
C
1
P
2
U
2
R
2
L
2
D
2
G
3
O
3
T
3
B
3
A
4
M
4
S
2 3
4
4
4
4
5
5
6
6
CompSci 100E 31.14
Building a tree
“A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”
11 6
I
5
E
5
N
1
F
1
C
1
P
2
U
2
R
2
L
2
D
2
G
3
O
3
T
3
B
3
A
4
M
4
S
2 3
4
4
4
4
5
5
6
6
6
6
CompSci 100E 31.15
Building a tree
“A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”
11 6
I
5
E
5
N
1
F
1
C
1
P
2
U
2
R
2
L
2
D
2
G
3
O
3
T
3
B
3
A
4
M
4
S2 3
4
4
4
4
5
5
6
6
6
8
8
6
CompSci 100E 31.16
Building a tree
“A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”
11 6
I
5
E
5
N
1
F
1
C
1
P
2
U
2
R
2
L
2
D
2
G
3
O
3
T
3
B
3
A
4
M
4
S2 3
44
5
5
6
6
6
8
8
6
8
8
CompSci 100E 31.17
Building a tree
“A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”
11 6
I
5
N
5
E
1
F
1
C
1
P
2
U
2
R
2
L
2
D
2
G
3
O
3
T
3
B
3
A
4
M
4
S2 3
44
5
5
6
6
6
8
8
6
8
8
10
10
CompSci 100E 31.18
Building a tree
“A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”
11 6
I
5
N
5
E
1
F
1
C
1
P
2
U
2
R
2
L
2
D
2
G
3
O
3
T
3
B
3
A
4
M
4
S2 3
445
6
6
8
8
6
8
8
10
10
11
11
CompSci 100E 31.19
Building a tree
“A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”
11
6
I5
N
5
E
1
F
1
C
1
P
2
U
2
R
2
L
2
D
2
G
3
O
3
T
3
B
3
A
4
M
4
S2 3
445
68
8
6
8
8
10
10
11
11
12
12
CompSci 100E 31.20
Building a tree
“A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”
11
6
I5
N
5
E
1
F
1
C
1
P
2
U
2
R
2
L
2
D
2
G
3
O
3
T
3
B
3
A
4
M
4
S2 3
445
68
6
8
16
10
10
11
11
12
1216
CompSci 100E 31.21
Building a tree
“A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”
11
6
I5
N
5
E
1
F
1
C
1
P
2
U
2
R
2
L
2
D
2
G
3
O
3
T
3
B
3
A
4
M
4
S2 3
445
68
6
8
16
10
21
11
12
121621
CompSci 100E 31.22
Building a tree
“A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”
11
6
I5
N
5
E
1
F
1
C
1
P
2
U
2
R
2
L
2
D
2
G
3
O
3
T
3
B
3
A
4
M
4
S2 3
445
68
6
8
16
10
21
11
12
23
162123
CompSci 100E 31.23
Building a tree
“A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”
11
6
I5
N
5
E
1
F
1
C
1
P
2
U
2
R
2
L
2
D
2
G
3
O
3
T
3
B
3
A
4
M
4
S2 3
445
68
6
8
16
10
21
11
12
2337
2337
CompSci 100E 31.24
Building a tree
“A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”
11
6
I5
N
5
E
1
F
1
C
1
P
2
U
2
R
2
L
2
D
2
G
3
O
3
T
3
B
3
A
4
M
4
S2 3
445
68
6
8
16
10
21
11
12
2337
60
60
CompSci 100E 31.25
Encoding
1. Count occurrence of all occurring characters O( )
2. Build priority queue O( )
3. Build Huffman tree O( )
4. Create Table of codes from tree O( )
5. Write Huffman tree and coded data to file O( )
N
A
A log A
A log A
N
CompSci 100E 31.26
Properties of Huffman coding Want to minimize weighted path length L(T)of
tree T
wi is the weight or count of each codeword i
di is the leaf corresponding to codeword i
How do we calculate character (codeword) frequencies?
Huffman coding creates pretty full bushy trees? When would it produce a “bad” tree?
How do we produce coded compressed data from input efficiently?
)(
)(TLeafi
iiwdTL
CompSci 100E 31.27
Writing code out to file
How do we go from characters to encodings? Build Huffman tree Root-to-leaf path generates encoding
Need way of writing bits out to file Platform dependent? Complicated to write bits and read in same ordering
See BitInputStream and BitOutputStream classes Depend on each other, bit ordering preserved
How do we know bits come from compressed file? Store a magic number
CompSci 100E 31.28
Decoding a message
11
6
I5
N
5
E
1
F
1
C
1
P
2
U
2
R
2
L
2
D
2
G
3
O
3
T
3
B
3
A
4
M
4
S2 3
445
68
6
8
16
10
21
11
12
2337
60
01100000100001001101
CompSci 100E 31.29
Decoding a message
11
6
I5
N
5
E
1
F
1
C
1
P
2
U
2
R
2
L
2
D
2
G
3
O
3
T
3
B
3
A
4
M
4
S2 3
445
68
6
8
16
10
21
11
12
2337
60
1100000100001001101
CompSci 100E 31.30
Decoding a message
11
6
I5
N
5
E
1
F
1
C
1
P
2
U
2
R
2
L
2
D
2
G
3
O
3
T
3
B
3
A
4
M
4
S2 3
445
68
6
8
16
10
21
11
12
2337
60
100000100001001101
CompSci 100E 31.31
Decoding a message
11
6
I5
N
5
E
1
F
1
C
1
P
2
U
2
R
2
L
2
D
2
G
3
O
3
T
3
B
3
A
4
M
4
S2 3
445
68
6
8
16
10
21
11
12
2337
60
00000100001001101
CompSci 100E 31.32
Decoding a message
11
6
I5
N
5
E
1
F
1
C
1
P
2
U
2
R
2
L
2
D
2
G
3
O
3
T
3
B
3
A
4
M
4
S2 3
445
68
6
8
16
10
21
11
12
2337
60
0000100001001101
G
CompSci 100E 31.33
Decoding a message
11
6
I5
N
5
E
1
F
1
C
1
P
2
U
2
R
2
L
2
D
2
G
3
O
3
T
3
B
3
A
4
M
4
S2 3
445
68
6
8
16
10
21
11
12
2337
60
000100001001101
G
CompSci 100E 31.34
Decoding a message
11
6
I5
N
5
E
1
F
1
C
1
P
2
U
2
R
2
L
2
D
2
G
3
O
3
T
3
B
3
A
4
M
4
S2 3
445
68
6
8
16
10
21
11
12
2337
60
00100001001101
G
CompSci 100E 31.35
Decoding a message
11
6
I5
N
5
E
1
F
1
C
1
P
2
U
2
R
2
L
2
D
2
G
3
O
3
T
3
B
3
A
4
M
4
S2 3
445
68
6
8
16
10
21
11
12
2337
60
0100001001101
G
CompSci 100E 31.36
Decoding a message
11
6
I5
N
5
E
1
F
1
C
1
P
2
U
2
R
2
L
2
D
2
G
3
O
3
T
3
B
3
A
4
M
4
S2 3
445
68
6
8
16
10
21
11
12
2337
60
100001001101
G
CompSci 100E 31.37
Decoding a message
11
6
I5
N
5
E
1
F
1
C
1
P
2
U
2
R
2
L
2
D
2
G
3
O
3
T
3
B
3
A
4
M
4
S2 3
445
68
6
8
16
10
21
11
12
2337
60
00001001101
GO
CompSci 100E 31.38
Decoding a message
11
6
I5
N
5
E
1
F
1
C
1
P
2
U
2
R
2
L
2
D
2
G
3
O
3
T
3
B
3
A
4
M
4
S2 3
445
68
6
8
16
10
21
11
12
2337
60
0001001101
GO
CompSci 100E 31.39
Decoding a message
11
6
I5
N
5
E
1
F
1
C
1
P
2
U
2
R
2
L
2
D
2
G
3
O
3
T
3
B
3
A
4
M
4
S2 3
445
68
6
8
16
10
21
11
12
2337
60
001001101
GO
CompSci 100E 31.40
Decoding a message
11
6
I5
N
5
E
1
F
1
C
1
P
2
U
2
R
2
L
2
D
2
G
3
O
3
T
3
B
3
A
4
M
4
S2 3
445
68
6
8
16
10
21
11
12
2337
60
01001101
GO
CompSci 100E 31.41
Decoding a message
11
6
I5
N
5
E
1
F
1
C
1
P
2
U
2
R
2
L
2
D
2
G
3
O
3
T
3
B
3
A
4
M
4
S2 3
445
68
6
8
16
10
21
11
12
2337
60
1001101
GO
CompSci 100E 31.42
Decoding a message
11
6
I5
N
5
E
1
F
1
C
1
P
2
U
2
R
2
L
2
D
2
G
3
O
3
T
3
B
3
A
4
M
4
S2 3
445
68
6
8
16
10
21
11
12
2337
60
001101
GOO
CompSci 100E 31.43
Decoding a message
11
6
I5
N
5
E
1
F
1
C
1
P
2
U
2
R
2
L
2
D
2
G
3
O
3
T
3
B
3
A
4
M
4
S2 3
445
68
6
8
16
10
21
11
12
2337
60
01101
GOO
CompSci 100E 31.44
Decoding a message
11
6
I5
N
5
E
1
F
1
C
1
P
2
U
2
R
2
L
2
D
2
G
3
O
3
T
3
B
3
A
4
M
4
S2 3
445
68
6
8
16
10
21
11
12
2337
60
1101
GOO
CompSci 100E 31.45
Decoding a message
11
6
I5
N
5
E
1
F
1
C
1
P
2
U
2
R
2
L
2
D
2
G
3
O
3
T
3
B
3
A
4
M
4
S2 3
445
68
6
8
16
10
21
11
12
2337
60
101
GOO
CompSci 100E 31.46
Decoding a message
11
6
I5
N
5
E
1
F
1
C
1
P
2
U
2
R
2
L
2
D
2
G
3
O
3
T
3
B
3
A
4
M
4
S2 3
445
68
6
8
16
10
21
11
12
2337
60
01
GOO
CompSci 100E 31.47
Decoding a message
11
6
I5
N
5
E
1
F
1
C
1
P
2
U
2
R
2
L
2
D
2
G
3
O
3
T
3
B
3
A
4
M
4
S2 3
445
68
6
8
16
10
21
11
12
2337
60
1
GOOD
CompSci 100E 31.48
Decoding a message
11
6
I5
N
5
E
1
F
1
C
1
P
2
U
2
R
2
L
2
D
2
G
3
O
3
T
3
B
3
A
4
M
4
S2 3
445
68
6
8
16
10
21
11
12
2337
60
01100000100001001101
GOOD
CompSci 100E 31.49
Decoding
1. Read in tree data O( )
2. Decode bit string with tree O( )
CompSci 100E 31.50
Huffman Tree 2
“A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS” E.g. “ A SIMPLE”
“10101101001000101001110011100000”
CompSci 100E 31.51
Huffman Tree 2
“A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS” E.g. “ A SIMPLE”
“10101101001000101001110011100000”
CompSci 100E 31.52
Huffman Tree 2
“A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS” E.g. “ A SIMPLE”
“10101101001000101001110011100000”
CompSci 100E 31.53
Huffman Tree 2
“A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS” E.g. “ A SIMPLE”
“10101101001000101001110011100000”
CompSci 100E 31.54
Huffman Tree 2
“A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS” E.g. “ A SIMPLE”
“10101101001000101001110011100000”
CompSci 100E 31.55
Huffman Tree 2
“A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS” E.g. “ A SIMPLE”
“10101101001000101001110011100000”
CompSci 100E 31.56
Huffman Tree 2
“A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS” E.g. “ A SIMPLE”
“10101101001000101001110011100000”
CompSci 100E 31.57
Huffman Tree 2
“A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS” E.g. “ A SIMPLE”
“10101101001000101001110011100000”
CompSci 100E 31.58
Huffman Tree 2
“A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS” E.g. “ A SIMPLE”
“10101101001000101001110011100000”
CompSci 100E 31.59
Other methods
Adaptive Huffman coding Lempel-Ziv algorithms
Build the coding table on the fly while reading document
Coding table changes dynamically Protocol between encoder and decoder so that
everyone is always using the right coding scheme
Works well in practice (compress, gzip, etc.) More complicated methods
Burrows-Wheeler (bunzip2) PPM statistical methods
top related