compressing tabular data via pairwise...

Compressing Tabular Data via Pairwise Dependencies

Amir Ingber, Yahoo! Research

TCE Conference, June 22, 2017

Joint work with Dmitri Pavlichin, Tsachy Weissman (Stanford)

Huge datasets: everywhere

-  Internet -  Science -  Media -  …

At Yahoo: -  More than 100k servers in ~36 clusters -  More than 800PB of storage -  Lots of data, always want to store more

Compressing big data: does it matter?

It’s expensive Cost of storing 1 PB: around $300k/ year e.g. on AWS

It’s big: Example: storing event log ~1B events / day x 6 months Stored for analytics / machine learning

Lossless compression: dictionary methods Typical compression: gzip (DEFLATE)- Based on LZ77 + Huffman -  Popular, fast -  Recent variants: zstd (FB, 2015), Brotli (Google, 2015), …

-  Good at: detecting temporal dependencies, e.g. text

26,9 25,5

Main idea: find repetitions in sliding window

the brown fox jumped over the brownish jumping bear

the brown fox jumped over ish ing bear

2625

Tabular data Typical dataset: A table -  Each row has several fields, complex dependencies -  Example:

-  Temporal dependencies? -  Cross-field dependencies?

UserID Age Location Device Time DocID

4324234 25 90210 iPhone 7 9pm 33221

1223231 49 94087 iPad pro 10am 66543

… … … … … …

Entropy coding 101 ▪ Given: a stream of i.i.d. symbols of a R.V. X: ▪ Encode each symbol as a (prefix free) bit string of variable length

›  More frequent symbols à shorter codeword ›  Theorem: avg. code length

▪ Huffman code: optimal. Rate ▪ Better: Arithmetic coding

›  Approaches entropy ›  Requires: distribution

▪ Black box

≥ H X( ) = p x( ) logx∑ 1

p x( )

≤ H X( )+1bit

p x( )

Assumptions: ▪ Records in the table are i.i.d. ▪ Only dependence between fields

▪ Example: independence

▪ Expected compression rate: per record

▪ Fine print: for each RV, need to save ›  The distribution and/or codebook (Huffman / Arith. coder) ›  A dictionary (to translate back to the original values)

X1,X2,...,Xn[ ]

PX1X2 ...Xn (x1, x2,..., xn ) = PXi (xi )i∏

H Xi( )i=1

n

∑

Fancier models: Bayesian networks

▪ Bayes net ›  DAG with n nodes ›  Nodes are the RVs, ›  Edges model (conditional) independence

X1

X2

X3

X4

P(x1, x2, x3, x4 ) = P(x1)P(x2 | x1)P(x3 | x1)P(x4 | x2x3)

▪ Dense graph àmore general ›  Compression rate:

▪ Usage for compression: compress according to the graph edges ›  Metadata: larger codebooks / distributions (conditional!)

▪ Not a new idea [e.g. Davies & Moore, KDD’99]

H (X1)+H (X2 | X1)+H (X3 | X1)+H (X4 | X2X3)

How to choose a Bayes Net for compression?

▪ Another assumption: ›  Each node can only have a single parent

▪ DAG è Tree ▪ Simpler compression

›  Conditioned only on a single RV ›  Compression rate:

▪ Best tree?

X1

X2

X3

X4

P(x1, x2, x3, x4 ) = P(x1)P(x2 | x1)P(x3 | x1)P(x4 | x2 )

H Xroot( )+ H (Xi | Xj )Edges(i, j )∑ = H Xi( )

i=1

n

∑ − I Xi;Xj( )Edges(i, j )∑

Searching for the best tree

▪ Rate:

▪ Algorithm: ›  Calculate ›  Set ›  Find minimum spanning tree! ›  Efficient algorithms exist – [Fredman & Tarjan, 1987] ›  Also: minimizes the KL divergence w.r.t. to the true distribution.

▪ Known as a Chow-Liu Tree [Chow & Liu, 1968] ›  Extensions exist [e.g. Williamson, 2000]


i=1

n

∑ − I Xi;Xj( )Edges(i, j )∑

I Xi;Xj( ), ∀ 1≤ i, j ≤ n

O n2( )

wij = −I Xi,Xj( )

Example: MST with Mutual Information Weights

UserID Age Location Device Time DocID

4324234 25 90210 iPhone 7 9pm 33221

1223231 49 94087 iPad pro 10am 66543

… … … … … …

UserID

Age Device

Location

DocID

Time

Chow Liu compression in real life

▪ Compressing given : ▪ For each possible , store

▪ Dataset not infinite – metadata takes space!

▪ Example: ›  1B records, two variables with size 10k, 100k à Conditional distribution of size 1B values (comparable to dataset itself) à Then maybe choosing these two is not the best idea…

Xi Xj

x j PXi |X j⋅ | x j( )

entropy code

metadata

Revised Chow-Liu tree

Take into account model size Actual rate: à Revised weights for the Chow-Liu tree:

Negative gain? à might opt to drop dependencies à forest

wij = −I Xi,Xj( )+ 1# rows

Size PXi |X j( )


i=1

n

∑ − I Xi;Xj( )Edges(i, j )∑ +

1# rows

Size PXi |X j( )

Example: MST with Mutual Information Weights

UserID

Age Device

Location

DocID

Time

wij = −I Xi,Xj( )+ 1# rows

Size PXi |X j( )wij = −I Xi,Xj( )

UserID

Age Device

Location

DocID

Time

entropy code

metadata

entropy code

metadata

Storing the metadata

How to store the distribution P(X|Y)? - Naïve: save entire matrix - Lossless compression: gzip / utilize sparsity - Lossy compression!

Improvements: Lossy model compression

▪ Compressing given : (compression is still lossless) ▪ True distribution: ▪ Lossy representation results in distorted distribution ▪ Code rate:

▪ Want to minimize both model storage size and divergence! ›  Related to MDL ›  Can be used to modify edge weights

X YPXY

QXY

H X |Y( )+D PX|Y ||QX|Y | PX( )+ 1# rows

Size QX|Y( )

QXY

Proposed approach:

▪ Add a virtual variable with a small alphabet size, s.t. X–Z–Y

▪ Storage size decreased from to ▪  : controls tradeoff between two objectives ▪ Finding

›  Iterate through the three terms, minimize KL divergence, repeat until convergence •  Not optimal! Optimization is hard •  Similar in spirit to [Lee & Seung,NIPS 2001]

PXY (x, y) ≅ QY |Z (y | z)QX|Z (x | z)QZ (z)z∑

| X | ⋅ |Y | (| X |+ |Y |) | Z || Z |

QY |Z (y | z),QX|Z (x | z),QZ (z){ }

Example: Criteo dataset

▪ A Kaggle competition for click prediction by Criteo ▪ Dataset: 45M records

▪ Mutual information: Chow Liu Tree:

Example: Criteo dataset

▪ Variables 3 and 8 have large alphabet 5,500 and 14k (vs 16M records) à can’t store conditional distribution

Results of NNMF:

Experiments

▪ Datasets: machine learning, US census, etc. ›  #features: 10-68 ›  #lines: 60K – 45M

▪ Current version: ›  MST with adjusted weights ›  Sparse encoding of metadata + lossless comp.

Speed vs. compression efficiency

Summary

▪ Dataset compression via probabilistic assumptions ›  Bayes nets, Chow-Liu Trees ›  Metadata encoding +weight modification

▪ Lossless compression via lossy model compression ›  Add a new RV with a Markov restriction ›  Balance metadata size vs. model inaccuracy

▪ Take home message: ›  Choose right metric ›  Revisit old ideas

compressing tabular data via pairwise...

Documents