table compression and related problems raffaele giancarlo dipartimento di matematica università di...

22
TABLE COMPRESSION AND RELATED PROBLEMS Raffaele Giancarlo Dipartimento di Matematica Università di Palermo

Upload: verity-stokes

Post on 05-Jan-2016

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: TABLE COMPRESSION AND RELATED PROBLEMS Raffaele Giancarlo Dipartimento di Matematica Università di Palermo

TABLE COMPRESSION AND RELATED PROBLEMS

Raffaele GiancarloDipartimento di Matematica

Università di Palermo

Page 2: TABLE COMPRESSION AND RELATED PROBLEMS Raffaele Giancarlo Dipartimento di Matematica Università di Palermo

Improving Table Compression with Combinatorial Optimization- J. ACM 03

A. L. Buchsbaum, G.L. Flowler and R. Giancarlo

Boosting Textual Compression in Optimal Linear Time J. ACM 05

P. Ferragina, R. Giancarlo, G. Manzini and M. Sciortino

Permutation, Partitions and Combinatorial Compression Boosting – TM 256 Unipa 04

P. Ferragina, R. Giancarlo, G. Manzini and M. Sciortino

Page 3: TABLE COMPRESSION AND RELATED PROBLEMS Raffaele Giancarlo Dipartimento di Matematica Università di Palermo

Table Compression

gzip

a a b b a

a a b b a

a a b b a

a a b b a

…bbbaa

Feed Table in Row Major Order to gzip

Page 4: TABLE COMPRESSION AND RELATED PROBLEMS Raffaele Giancarlo Dipartimento di Matematica Università di Palermo

Table Compression

a a b b a

a a b b a

a a b b a

a a b b a

gzip

gzip

gzip

On-Line (no Training): Partition Table and Compress separately

Page 5: TABLE COMPRESSION AND RELATED PROBLEMS Raffaele Giancarlo Dipartimento di Matematica Università di Palermo

Table Compression

Off-Line (Training):Permute Columns, Partition, Compress

a a b b a

a a b b a

a a b b a

a a b b a

a a a b b

a a a b b

a a a b b

a a a b b

gzip

gzip

Page 6: TABLE COMPRESSION AND RELATED PROBLEMS Raffaele Giancarlo Dipartimento di Matematica Università di Palermo

Table Compression

On-LineOptimal SolutionSame speed as gzip40-60% gain in Compression over gzip and bzip2

Off-LineGood Heuristics (Traveling Salesman Problem)Tolerably slower than gzipAdditional 10-20% gain in Compression

ApplicationsData warehousingData Base of Multiple Alignments - PFAM

Page 7: TABLE COMPRESSION AND RELATED PROBLEMS Raffaele Giancarlo Dipartimento di Matematica Università di Palermo

Table Compression

Column Permutations via TSP

Build complete directed weighted graph G column T[i] is vertex i weight of (i,j): min(C(T[i])+C(T[j]), C(T[i]T[j]))

Find a good tour and therefore a good permutation of the table columns

Permute, Partition, Compress

Page 8: TABLE COMPRESSION AND RELATED PROBLEMS Raffaele Giancarlo Dipartimento di Matematica Università di Palermo

The PPC Paradigm

Base Compressor C, i.e., gzip, Huffman, Arithmetic Codes Objects to be compressed: x1, x2, …,xn

Find suitable permutation of objectsPermute objects and partition Compress each piece of the partition seperately via C

Boosting the performance of Base Compressor C

Page 9: TABLE COMPRESSION AND RELATED PROBLEMS Raffaele Giancarlo Dipartimento di Matematica Università di Palermo

Back to Table Compression

Binh Dao Vo and Kiem-Phong Vo-DCC04Using Column Dependency to Compress Tables

9088771 079229733360 079329084640 079229733600 07932908

973 908 973

908 908973 973

2 2 3 3

Lex sort PPC

Page 10: TABLE COMPRESSION AND RELATED PROBLEMS Raffaele Giancarlo Dipartimento di Matematica Università di Palermo

Back to Table Compression

Column Dependency for Table Compression

Elegant algorithms to infer dependency and rearrange data

Theory: NP- HardHeuristics: 5-50% improvement in compression over TSP reordering

Page 11: TABLE COMPRESSION AND RELATED PROBLEMS Raffaele Giancarlo Dipartimento di Matematica Università di Palermo

A Transition

Exercise: Specialize TSP Reordering to strings

String x1 x2 …xn

lcp(i,j)= length of longest common prefix of xi+1 …

xn and xj+1 …xn

Symbols i and j have relation weight n-lcp(i,j)

s = mississippi#

Page 12: TABLE COMPRESSION AND RELATED PROBLEMS Raffaele Giancarlo Dipartimento di Matematica Università di Palermo

A Transition

Exercise (continued)

Define undirected graph G, where node

i is labeled with xi and (i,j) has weight given by

relation

An optimal tour is given by the lex sort of all cyclic shifts of S

All contexts are packed together optimally

PPC

Page 13: TABLE COMPRESSION AND RELATED PROBLEMS Raffaele Giancarlo Dipartimento di Matematica Università di Palermo

The Burrows and Wheeler Transform (1994)

pi#mississi pppi#mississ isippi#missi ssissippi#mi sssippi#miss ississippi#m i

issippi#mis s

mississippi #ississippi# m

mississippi#ississippi#mssissippi#mi sissippi#mis

sippi#missisippi#mississppi#mississipi#mississipi#mississipp#mississippi

ssippi#missiissippi#miss Sort the rows

#mississipp ii#mississip p

bwt(s)

s

ippi#missis s

Page 14: TABLE COMPRESSION AND RELATED PROBLEMS Raffaele Giancarlo Dipartimento di Matematica Università di Palermo

Qualitatively, we show that

c’ is shorter than c, if s is compressible

Time(Aboost) = Time(A), i.e. no slowdown

A is used as a black-box

Our technique takes a poor compressor A and turns it into

a compressor Aboost with better performance guarantee

c’

BoosterThe better is A,

the better is Aboost

As cThe more compressible is s,

the better is Aboost

Boosting Textual Compression in optimal time

Page 15: TABLE COMPRESSION AND RELATED PROBLEMS Raffaele Giancarlo Dipartimento di Matematica Università di Palermo

|c | ≤ λ |s| H (s) + µ |s|

Technically, we prove that

0k

Our technique takes a poor compressor A and turns it into

a compressor Aboost with better performance guarantee

c’

BoosterThe better is A,

the better is Aboost

As cThe more compressible is s,

the better is Aboost

k+ log2 |s| + k ’“Poor” means H0 bounds for A

Boosting

Page 16: TABLE COMPRESSION AND RELATED PROBLEMS Raffaele Giancarlo Dipartimento di Matematica Università di Palermo

Boosting

Three Key Components: Burrows-Wheeler Transform, Suffix Tree and a Greedy processing of them

Our technique takes a 0th order compressor A and turns it

into a compressor Aboost with better performance guarantee

c’

BoosterThe better is A,

the better is Aboost

As cThe more compressible is s,

the better is Aboost

We achieve the best known compression ratio

Page 17: TABLE COMPRESSION AND RELATED PROBLEMS Raffaele Giancarlo Dipartimento di Matematica Università di Palermo

Boosting

Outline

BWT

Find optimal partition of permuted string Greedy processing of suffix tree

Compress each piece of partition separately via base compressor A

Page 18: TABLE COMPRESSION AND RELATED PROBLEMS Raffaele Giancarlo Dipartimento di Matematica Università di Palermo

Related Work

Foschini, Grossi, Gupta and Vitter- DCC04

Fast Compression with a Static Model in High Order Entropy

It ca be seen as a Compression Booster of Run length Encoding Ingredients:

BWT Wavelet Trees [GGV03] efficient encoding of the Integers [E75]

Page 19: TABLE COMPRESSION AND RELATED PROBLEMS Raffaele Giancarlo Dipartimento di Matematica Università di Palermo

Related Work

Liefke and Suciu

Compression for XML Files

Group Together XML Strings based on similarities

Greatly Improves the performance of Gzip

Page 20: TABLE COMPRESSION AND RELATED PROBLEMS Raffaele Giancarlo Dipartimento di Matematica Università di Palermo

Related Work

Johnson et. al. 2005

Compression of Boolean Matrices

Permute Columns so that Number of Runs is Minimized

NP- hard; Actually Max SNP Hard TSP + Hamming Distance

Page 21: TABLE COMPRESSION AND RELATED PROBLEMS Raffaele Giancarlo Dipartimento di Matematica Università di Palermo

Related Work

Shortest Common Superstring [G97]

Oldest Instance of Permute, Partition and Compress

Page 22: TABLE COMPRESSION AND RELATED PROBLEMS Raffaele Giancarlo Dipartimento di Matematica Università di Palermo

Conclusions

Permute Data Before Compression

It is efficient and fun…

In particular, if chosen permutation is not invertible