Download - Data Compression
CompSci 100e 9.1
Data Compression Compression is a high-profile application
.zip, .mp3, .aac, .jpg, .gif, .gz, .mpg, … What property of MP3 was a significant factor in
what made P2P file sharing work? Why do we care?
Secondary storage capacity doubles every year Disk space fills up quickly on every computer
system More data to compress than ever before
Data compression facilitated by priority queue All-time best assignment in a Compsci 100(e)
course?• Subject to debate, of course
CompSci 100e 9.2
More on Compression What’s the difference between compression
techniques? .mp3 files and .zip files? .gif and .jpg? Lossless and lossy
Is it possible to compress (lossless) every file? Why?
Lossy methods Good for pictures, video, and audio (JPEG,
MPEG, etc.) Lossless methods
Run-length encoding, Huffman, LZW, …
CompSci 100e 9.3
Text Compression
Input: String S Output: String S
Shorter S can be reconstructed from S
CompSci 100e 9.4
Text Compression: Examples
Symbol
ASCII Fixedlength
Var.length
a 01100001
000 000
b 01100010
001 11
c 01100011
010 01
d 01100100
011 001
e 01100101
100 10
“abcde” in the different formatsASCII: 01100001011000100110001101100100…Fixed: 000001010011100
Var: 000110100110
0
00
0
0
00
1
1 1
1
a b c d e
a d
bc e0
0
0
0 1
1
1
1
EncodingsASCII: 8 bits/characterUnicode: 16 bits/character
CompSci 100e 9.5
Huffman coding: go go gophers
Encoding uses tree: 0 left/1 right How many bits? 37!! Savings? Worth it?
ASCII 3 bits Huffmang 103 1100111 000 00o 111 1101111 001 01p 112 1110000 010 1100h 104 1101000 011 1101e 101 1100101 100 1110r 114 1110010 101 1111s 115 1110011 110 101sp. 32 1000000 111 101
3
s1
*2
2
p1
h1
2
e1
r1
4g
3
o
3
6
3
2
p1
h1
2
e1
r1
4
s1
*2
7
g3
o3
6
13
CompSci 100e 9.6
Huffman Coding D.A Huffman in early 1950’s Before compressing data, analyze the input stream Represent data using variable length codes Variable length codes though Prefix codes
Each letter is assigned a codeword Codeword is for a given letter is produced by
traversing the Huffman tree Property: No codeword produced is the prefix of
another Letters appearing frequently have short
codewords, while those that appear rarely have longer ones
Huffman coding is optimal per-character coding method
CompSci 100e 9.7
Building a Huffman tree Begin with a forest of single-node trees (leaves)
Each node/tree/leaf is weighted with character count
Node stores two values: character and count There are n nodes in forest, n is size of alphabet?
Repeat until there is only one node left: root of tree Remove two minimally weighted trees from forest Create new tree with minimal trees as children,
• New tree root's weight: sum of children (character ignored)
Does this process terminate? How do we get minimal trees? Remove minimal trees, hummm……
CompSci 100e 9.8
Building a tree
“A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”
11 6
I
5
E
5
N
1
C
1
F
1
P
2
U
2
R
2
L
2
D
2
G
3
T
3
O
3
B
3
A
4
M
4
S
CompSci 100e 9.9
Building a tree
“A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”
11 6
I
5
E
5
N
1
C
1
F
1
P
2
U
2
R
2
L
2
D
2
G
3
T
3
O
3
B
3
A
4
M
4
S
2
2
CompSci 100e 9.10
Building a tree
“A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”
11 6
I
5
E
5
N
1
F
1
C
1
P
2
U
2
R
2
L
2
D
2
G
3
T
3
O
3
B
3
A
4
M
4
S
2
2
3
3
CompSci 100e 9.11
Building a tree
“A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”
11 6
I
5
E
5
N
1
F
1
C
1
P
2
U
2
R
2
L
2
D
2
G
3
T
3
O
3
B
3
A
4
M
4
S
2
2
3
3
4
4
CompSci 100e 9.12
Building a tree
“A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”
11 6
I
5
E
5
N
1
F
1
C
1
P
2
U
2
R
2
L
2
D
2
G
3
T
3
O
3
B
3
A
4
M
4
S
2
2
3
3
4
4
4
4
CompSci 100e 9.13
Building a tree
“A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”
11 6
I
5
E
5
N
1
F
1
C
1
P
2
U
2
R
2
L
2
D
2
G
3
O
3
T
3
B
3
A
4
M
4
S
2
3
3
4
4
4
4
5
5
CompSci 100e 9.14
Building a tree
“A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”
11 6
I
5
E
5
N
1
F
1
C
1
P
2
U
2
R
2
L
2
D
2
G
3
O
3
T
3
B
3
A
4
M
4
S
2 3
4
4
4
4
5
5
6
6
CompSci 100e 9.15
Building a tree
“A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”
11 6
I
5
E
5
N
1
F
1
C
1
P
2
U
2
R
2
L
2
D
2
G
3
O
3
T
3
B
3
A
4
M
4
S
2 3
4
4
4
4
5
5
6
6
6
6
CompSci 100e 9.16
Building a tree
“A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”
11 6
I
5
E
5
N
1
F
1
C
1
P
2
U
2
R
2
L
2
D
2
G
3
O
3
T
3
B
3
A
4
M
4
S2 3
4
4
4
4
5
5
6
6
6
8
8
6
CompSci 100e 9.17
Building a tree
“A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”
11 6
I
5
E
5
N
1
F
1
C
1
P
2
U
2
R
2
L
2
D
2
G
3
O
3
T
3
B
3
A
4
M
4
S2 3
44
5
5
6
6
6
8
8
6
8
8
CompSci 100e 9.18
Building a tree
“A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”
11 6
I
5
N
5
E
1
F
1
C
1
P
2
U
2
R
2
L
2
D
2
G
3
O
3
T
3
B
3
A
4
M
4
S2 3
44
5
5
6
6
6
8
8
6
8
8
10
10
CompSci 100e 9.19
Building a tree
“A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”
11 6
I
5
N
5
E
1
F
1
C
1
P
2
U
2
R
2
L
2
D
2
G
3
O
3
T
3
B
3
A
4
M
4
S2 3
445
6
6
8
8
6
8
8
10
10
11
11
CompSci 100e 9.20
Building a tree
“A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”
11
6
I5
N
5
E
1
F
1
C
1
P
2
U
2
R
2
L
2
D
2
G
3
O
3
T
3
B
3
A
4
M
4
S2 3
445
68
8
6
8
8
10
10
11
11
12
12
CompSci 100e 9.21
Building a tree
“A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”
11
6
I5
N
5
E
1
F
1
C
1
P
2
U
2
R
2
L
2
D
2
G
3
O
3
T
3
B
3
A
4
M
4
S2 3
445
68
6
8
16
10
10
11
11
12
1216
CompSci 100e 9.22
Building a tree
“A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”
11
6
I5
N
5
E
1
F
1
C
1
P
2
U
2
R
2
L
2
D
2
G
3
O
3
T
3
B
3
A
4
M
4
S2 3
445
68
6
8
16
10
21
11
12
121621
CompSci 100e 9.23
Building a tree
“A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”
11
6
I5
N
5
E
1
F
1
C
1
P
2
U
2
R
2
L
2
D
2
G
3
O
3
T
3
B
3
A
4
M
4
S2 3
445
68
6
8
16
10
21
11
12
23
162123
CompSci 100e 9.24
Building a tree
“A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”
11
6
I5
N
5
E
1
F
1
C
1
P
2
U
2
R
2
L
2
D
2
G
3
O
3
T
3
B
3
A
4
M
4
S2 3
445
68
6
8
16
10
21
11
12
2337
2337
CompSci 100e 9.25
Building a tree
“A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS”
11
6
I5
N
5
E
1
F
1
C
1
P
2
U
2
R
2
L
2
D
2
G
3
O
3
T
3
B
3
A
4
M
4
S2 3
445
68
6
8
16
10
21
11
12
2337
60
60
CompSci 100e 9.26
Encoding1. Count occurrence of all occurring character
O( )2. Build priority queue O(
)
3. Build Huffman tree O( )
4. Create Table of codes from tree O( )
5. Write Huffman tree and coded data to file O( )
CompSci 100e 9.27
Properties of Huffman coding Want to minimize weighted path length L(T)of
tree T
wi is the weight or count of each codeword i
di is the leaf corresponding to codeword i How do we calculate character (codeword)
frequencies? Huffman coding creates pretty full bushy trees?
When would it produce a “bad” tree? How do we produce coded compressed data
from input efficiently?
∑∈
=)(
)(TLeafi
iiwdTL
CompSci 100e 9.28
Writing code out to file How do we go from characters to encodings?
Build Huffman tree Root-to-leaf path generates encoding
Need way of writing bits out to file Platform dependent? Complicated to write bits and read in same
ordering
See BitInputStream and BitOutputStream classes Depend on each other, bit ordering preserved
How do we know bits come from compressed file? Store a magic number
CompSci 100e 9.29
Decoding a message
11
6
I5
N
5
E
1
F
1
C
1
P
2
U
2
R
2
L
2
D
2
G
3
O
3
T
3
B
3
A
4
M
4
S2 3
445
68
6
8
16
10
21
11
12
2337
60
01100000100001001101
CompSci 100e 9.30
Decoding a message
11
6
I5
N
5
E
1
F
1
C
1
P
2
U
2
R
2
L
2
D
2
G
3
O
3
T
3
B
3
A
4
M
4
S2 3
445
68
6
8
16
10
21
11
12
2337
60
1100000100001001101
CompSci 100e 9.31
Decoding a message
11
6
I5
N
5
E
1
F
1
C
1
P
2
U
2
R
2
L
2
D
2
G
3
O
3
T
3
B
3
A
4
M
4
S2 3
445
68
6
8
16
10
21
11
12
2337
60
100000100001001101
CompSci 100e 9.32
Decoding a message
11
6
I5
N
5
E
1
F
1
C
1
P
2
U
2
R
2
L
2
D
2
G
3
O
3
T
3
B
3
A
4
M
4
S2 3
445
68
6
8
16
10
21
11
12
2337
60
00000100001001101
CompSci 100e 9.33
Decoding a message
11
6
I5
N
5
E
1
F
1
C
1
P
2
U
2
R
2
L
2
D
2
G
3
O
3
T
3
B
3
A
4
M
4
S2 3
445
68
6
8
16
10
21
11
12
2337
60
0000100001001101
G
CompSci 100e 9.34
Decoding a message
11
6
I5
N
5
E
1
F
1
C
1
P
2
U
2
R
2
L
2
D
2
G
3
O
3
T
3
B
3
A
4
M
4
S2 3
445
68
6
8
16
10
21
11
12
2337
60
000100001001101
G
CompSci 100e 9.35
Decoding a message
11
6
I5
N
5
E
1
F
1
C
1
P
2
U
2
R
2
L
2
D
2
G
3
O
3
T
3
B
3
A
4
M
4
S2 3
445
68
6
8
16
10
21
11
12
2337
60
00100001001101
G
CompSci 100e 9.36
Decoding a message
11
6
I5
N
5
E
1
F
1
C
1
P
2
U
2
R
2
L
2
D
2
G
3
O
3
T
3
B
3
A
4
M
4
S2 3
445
68
6
8
16
10
21
11
12
2337
60
0100001001101
G
CompSci 100e 9.37
Decoding a message
11
6
I5
N
5
E
1
F
1
C
1
P
2
U
2
R
2
L
2
D
2
G
3
O
3
T
3
B
3
A
4
M
4
S2 3
445
68
6
8
16
10
21
11
12
2337
60
100001001101
G
CompSci 100e 9.38
Decoding a message
11
6
I5
N
5
E
1
F
1
C
1
P
2
U
2
R
2
L
2
D
2
G
3
O
3
T
3
B
3
A
4
M
4
S2 3
445
68
6
8
16
10
21
11
12
2337
60
00001001101
GO
CompSci 100e 9.39
Decoding a message
11
6
I5
N
5
E
1
F
1
C
1
P
2
U
2
R
2
L
2
D
2
G
3
O
3
T
3
B
3
A
4
M
4
S2 3
445
68
6
8
16
10
21
11
12
2337
60
0001001101
GO
CompSci 100e 9.40
Decoding a message
11
6
I5
N
5
E
1
F
1
C
1
P
2
U
2
R
2
L
2
D
2
G
3
O
3
T
3
B
3
A
4
M
4
S2 3
445
68
6
8
16
10
21
11
12
2337
60
001001101
GO
CompSci 100e 9.41
Decoding a message
11
6
I5
N
5
E
1
F
1
C
1
P
2
U
2
R
2
L
2
D
2
G
3
O
3
T
3
B
3
A
4
M
4
S2 3
445
68
6
8
16
10
21
11
12
2337
60
01001101
GO
CompSci 100e 9.42
Decoding a message
11
6
I5
N
5
E
1
F
1
C
1
P
2
U
2
R
2
L
2
D
2
G
3
O
3
T
3
B
3
A
4
M
4
S2 3
445
68
6
8
16
10
21
11
12
2337
60
1001101
GO
CompSci 100e 9.43
Decoding a message
11
6
I5
N
5
E
1
F
1
C
1
P
2
U
2
R
2
L
2
D
2
G
3
O
3
T
3
B
3
A
4
M
4
S2 3
445
68
6
8
16
10
21
11
12
2337
60
001101
GOO
CompSci 100e 9.44
Decoding a message
11
6
I5
N
5
E
1
F
1
C
1
P
2
U
2
R
2
L
2
D
2
G
3
O
3
T
3
B
3
A
4
M
4
S2 3
445
68
6
8
16
10
21
11
12
2337
60
01101
GOO
CompSci 100e 9.45
Decoding a message
11
6
I5
N
5
E
1
F
1
C
1
P
2
U
2
R
2
L
2
D
2
G
3
O
3
T
3
B
3
A
4
M
4
S2 3
445
68
6
8
16
10
21
11
12
2337
60
1101
GOO
CompSci 100e 9.46
Decoding a message
11
6
I5
N
5
E
1
F
1
C
1
P
2
U
2
R
2
L
2
D
2
G
3
O
3
T
3
B
3
A
4
M
4
S2 3
445
68
6
8
16
10
21
11
12
2337
60
101
GOO
CompSci 100e 9.47
Decoding a message
11
6
I5
N
5
E
1
F
1
C
1
P
2
U
2
R
2
L
2
D
2
G
3
O
3
T
3
B
3
A
4
M
4
S2 3
445
68
6
8
16
10
21
11
12
2337
60
01
GOO
CompSci 100e 9.48
Decoding a message
11
6
I5
N
5
E
1
F
1
C
1
P
2
U
2
R
2
L
2
D
2
G
3
O
3
T
3
B
3
A
4
M
4
S2 3
445
68
6
8
16
10
21
11
12
2337
60
1
GOOD
CompSci 100e 9.49
Decoding a message
11
6
I5
N
5
E
1
F
1
C
1
P
2
U
2
R
2
L
2
D
2
G
3
O
3
T
3
B
3
A
4
M
4
S2 3
445
68
6
8
16
10
21
11
12
2337
60
01100000100001001101
GOOD
CompSci 100e 9.50
Decoding
1. Read in tree data O( )
2. Decode bit string with tree O( )
CompSci 100e 9.51
Huffman Tree 2 “A SIMPLE STRING TO BE ENCODED USING A
MINIMAL NUMBER OF BITS” E.g. “ A SIMPLE”
“10101101001000101001110011100000”
CompSci 100e 9.52
Huffman Tree 2 “A SIMPLE STRING TO BE ENCODED USING A
MINIMAL NUMBER OF BITS” E.g. “ A SIMPLE”
“10101101001000101001110011100000”
CompSci 100e 9.53
Huffman Tree 2 “A SIMPLE STRING TO BE ENCODED USING A
MINIMAL NUMBER OF BITS” E.g. “ A SIMPLE”
“10101101001000101001110011100000”
CompSci 100e 9.54
Huffman Tree 2 “A SIMPLE STRING TO BE ENCODED USING A
MINIMAL NUMBER OF BITS” E.g. “ A SIMPLE”
“10101101001000101001110011100000”
CompSci 100e 9.55
Huffman Tree 2 “A SIMPLE STRING TO BE ENCODED USING A
MINIMAL NUMBER OF BITS” E.g. “ A SIMPLE”
“10101101001000101001110011100000”
CompSci 100e 9.56
Huffman Tree 2 “A SIMPLE STRING TO BE ENCODED USING A
MINIMAL NUMBER OF BITS” E.g. “ A SIMPLE”
“10101101001000101001110011100000”
CompSci 100e 9.57
Huffman Tree 2 “A SIMPLE STRING TO BE ENCODED USING A
MINIMAL NUMBER OF BITS” E.g. “ A SIMPLE”
“10101101001000101001110011100000”
CompSci 100e 9.58
Huffman Tree 2 “A SIMPLE STRING TO BE ENCODED USING A
MINIMAL NUMBER OF BITS” E.g. “ A SIMPLE”
“10101101001000101001110011100000”
CompSci 100e 9.59
Huffman Tree 2 “A SIMPLE STRING TO BE ENCODED USING A
MINIMAL NUMBER OF BITS” E.g. “ A SIMPLE”
“10101101001000101001110011100000”
CompSci 100e 9.60
Building a Huffman tree Begin with a forest of single-node trees/tries (leaves)
Each node/tree/leaf is weighted with character count
Node stores two values: character and count
Repeat until there is only one node left: root of tree Remove two minimally weighted trees from forest Create new tree/internal node with minimal trees
as children,• Weight is sum of children’s weight (no char)
How does process terminate? Finding minimum? Remove minimal trees, hummm……
CompSci 100e 9.61
How do we create Huffman Tree/Trie? Insert weighted values into priority queue
What are initial weights? Why?
Remove minimal nodes, weight by sums, re-insert Total number of nodes?
PriorityQueue<TreeNode> pq = new PriorityQueue<TreeNode>(); for(int k=0; k < freq.length; k++){ pq.add(new TreeNode(k,freq[k],null,null)); } while (pq.size() > 1){ TreeNode left = pq.remove(); TreeNode right = pq.remove(); pq.add(new TreeNode(0,left.weight+right.weight, left,right)); } TreeNode root = pq.remove();
CompSci 100e 9.62
Creating compressed file Once we have new encodings, read every character
Write encoding, not the character, to compressed file
Why does this save bits? What other information needed in compressed
file?
How do we uncompress? How do we know foo.hf represents compressed
file? Is suffix sufficient? Alternatives?
Why is Huffman coding a two-pass method? Alternatives?
CompSci 100e 9.63
Uncompression with Huffman We need the trie to uncompress
000100100010011001101111 As we read a bit, what do we do?
Go left on 0, go right on 1 When do we stop? What to do?
How do we get the trie? How did we get it originally? Store 256
int/counts• How do we read counts?
How do we store a trie? 20 Questions relevance?• Reading a trie? Leaf indicator? Node values?
CompSci 100e 9.64
Other Huffman Issues What do we need to decode?
How did we encode? How will we decode? What information needed for decoding?
Reading and writing bits: chunks and stopping Can you write 3 bits? Why not? Why? PSEUDO_EOF BitInputStream and BitOutputStream: API
What should happen when the file won’t compress? Silently compress bigger? Warn user?
Alternatives?
CompSci 100e 9.65
Huffman Complexities How do we measure? Size of input file, size of
alphabet Which is typically bigger?
Accumulating character counts: ______ How can we do this in O(1) time, though not
really Building the heap/priority queue from counts ____
Initializing heap guaranteed Building Huffman tree ____
Why? Create table of encodings from tree ____
Why? Write tree and compressed file _____
CompSci 100e 9.66
Good Compsci 100(e) Assignment? Array of character/chunk counts, or is this a map?
Map character/chunk to count, why array? Priority Queue for generating tree/trie
Do we need a heap implementation? Why? Tree traversals for code generation, uncompression
One recursive, one not, why and which? Deal with bits and chunks rather than ints and chars
The good, the bad, the ugly Create a working compression program
How would we deploy it? Make it better? Benchmark for analysis
What’s a corpus?
CompSci 100e 9.67
Other methods Adaptive Huffman coding Lempel-Ziv algorithms
Build the coding table on the fly while reading document
Coding table changes dynamically Protocol between encoder and decoder so
that everyone is always using the right coding scheme
Works well in practice (compress, gzip, etc.) More complicated methods
Burrows-Wheeler (bunzip2) PPM statistical methods
CompSci 100e 9.68
Data CompressionYear Scheme Bit/
Char
1967
ASCII 7.00
1950
Huffman 4.70
1977
Lempel-Ziv (LZ77) 3.94
1984
Lempel-Ziv-Welch (LZW) – Unix compress
3.32
1987
(LZH) used by zip and unzip 3.30
1987
Move-to-front 3.24
1987
gzip 2.71
1995
Burrows-Wheeler 2.29
1997
BOA (statistical data compression)
1.99
Why is data compression important?
How well can you compress files losslessly? Is there a limit? How to
compare? How do you
measure how much information?