compressing tabular data via pairwise...

23
Compressing Tabular Data via Pairwise Dependencies Amir Ingber, Yahoo! Research TCE Conference, June 22, 2017 Joint work with Dmitri Pavlichin, Tsachy Weissman (Stanford)

Upload: others

Post on 23-Aug-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Compressing Tabular Data via Pairwise Dependenciestce.webee.eedev.technion.ac.il/wp-content/uploads/... · Compressing Tabular Data via Pairwise Dependencies Amir Ingber, Yahoo! Research

Compressing Tabular Data via Pairwise Dependencies

Amir Ingber, Yahoo! Research

TCE Conference, June 22, 2017

Joint work with Dmitri Pavlichin, Tsachy Weissman (Stanford)

Page 2: Compressing Tabular Data via Pairwise Dependenciestce.webee.eedev.technion.ac.il/wp-content/uploads/... · Compressing Tabular Data via Pairwise Dependencies Amir Ingber, Yahoo! Research

Huge datasets: everywhere

-  Internet -  Science -  Media -  …

At Yahoo: -  More than 100k servers in ~36 clusters -  More than 800PB of storage -  Lots of data, always want to store more

Page 3: Compressing Tabular Data via Pairwise Dependenciestce.webee.eedev.technion.ac.il/wp-content/uploads/... · Compressing Tabular Data via Pairwise Dependencies Amir Ingber, Yahoo! Research

Compressing big data: does it matter?

It’s expensive Cost of storing 1 PB: around $300k/ year e.g. on AWS

It’s big: Example: storing event log ~1B events / day x 6 months Stored for analytics / machine learning

Page 4: Compressing Tabular Data via Pairwise Dependenciestce.webee.eedev.technion.ac.il/wp-content/uploads/... · Compressing Tabular Data via Pairwise Dependencies Amir Ingber, Yahoo! Research

Lossless compression: dictionary methods Typical compression: gzip (DEFLATE)- Based on LZ77 + Huffman -  Popular, fast -  Recent variants: zstd (FB, 2015), Brotli (Google, 2015), …

-  Good at: detecting temporal dependencies, e.g. text

26,9 25,5

Main idea: find repetitions in sliding window

the brown fox jumped over the brownish jumping bear

the brown fox jumped over ish ing bear

2625

Page 5: Compressing Tabular Data via Pairwise Dependenciestce.webee.eedev.technion.ac.il/wp-content/uploads/... · Compressing Tabular Data via Pairwise Dependencies Amir Ingber, Yahoo! Research

Tabular data Typical dataset: A table -  Each row has several fields, complex dependencies -  Example:

-  Temporal dependencies? -  Cross-field dependencies?

UserID Age Location Device Time DocID

4324234 25 90210 iPhone 7 9pm 33221

1223231 49 94087 iPad pro 10am 66543

… … … … … …

Page 6: Compressing Tabular Data via Pairwise Dependenciestce.webee.eedev.technion.ac.il/wp-content/uploads/... · Compressing Tabular Data via Pairwise Dependencies Amir Ingber, Yahoo! Research

Entropy coding 101 ▪ Given: a stream of i.i.d. symbols of a R.V. X: ▪ Encode each symbol as a (prefix free) bit string of variable length

›  More frequent symbols à shorter codeword ›  Theorem: avg. code length

▪ Huffman code: optimal. Rate ▪ Better: Arithmetic coding

›  Approaches entropy ›  Requires: distribution

▪ Black box

≥ H X( ) = p x( ) logx∑ 1

p x( )

≤ H X( )+1bit

p x( )

Page 7: Compressing Tabular Data via Pairwise Dependenciestce.webee.eedev.technion.ac.il/wp-content/uploads/... · Compressing Tabular Data via Pairwise Dependencies Amir Ingber, Yahoo! Research

Assumptions: ▪ Records in the table are i.i.d. ▪ Only dependence between fields

▪ Example: independence

▪ Expected compression rate: per record

▪ Fine print: for each RV, need to save ›  The distribution and/or codebook (Huffman / Arith. coder) ›  A dictionary (to translate back to the original values)

X1,X2,...,Xn[ ]

PX1X2 ...Xn (x1, x2,..., xn ) = PXi (xi )i∏

H Xi( )i=1

n

Page 8: Compressing Tabular Data via Pairwise Dependenciestce.webee.eedev.technion.ac.il/wp-content/uploads/... · Compressing Tabular Data via Pairwise Dependencies Amir Ingber, Yahoo! Research

Fancier models: Bayesian networks

▪ Bayes net ›  DAG with n nodes ›  Nodes are the RVs, ›  Edges model (conditional) independence

X1

X2

X3

X4

P(x1, x2, x3, x4 ) = P(x1)P(x2 | x1)P(x3 | x1)P(x4 | x2x3)

▪ Dense graph àmore general ›  Compression rate:

▪ Usage for compression: compress according to the graph edges ›  Metadata: larger codebooks / distributions (conditional!)

▪ Not a new idea [e.g. Davies & Moore, KDD’99]

H (X1)+H (X2 | X1)+H (X3 | X1)+H (X4 | X2X3)

Page 9: Compressing Tabular Data via Pairwise Dependenciestce.webee.eedev.technion.ac.il/wp-content/uploads/... · Compressing Tabular Data via Pairwise Dependencies Amir Ingber, Yahoo! Research

How to choose a Bayes Net for compression?

▪ Another assumption: ›  Each node can only have a single parent

▪ DAG è Tree ▪ Simpler compression

›  Conditioned only on a single RV ›  Compression rate:

▪ Best tree?

X1

X2

X3

X4

P(x1, x2, x3, x4 ) = P(x1)P(x2 | x1)P(x3 | x1)P(x4 | x2 )

H Xroot( )+ H (Xi | Xj )Edges(i, j )∑ = H Xi( )

i=1

n

∑ − I Xi;Xj( )Edges(i, j )∑

Page 10: Compressing Tabular Data via Pairwise Dependenciestce.webee.eedev.technion.ac.il/wp-content/uploads/... · Compressing Tabular Data via Pairwise Dependencies Amir Ingber, Yahoo! Research

Searching for the best tree

▪ Rate:

▪ Algorithm: ›  Calculate ›  Set ›  Find minimum spanning tree! ›  Efficient algorithms exist – [Fredman & Tarjan, 1987] ›  Also: minimizes the KL divergence w.r.t. to the true distribution.

▪ Known as a Chow-Liu Tree [Chow & Liu, 1968] ›  Extensions exist [e.g. Williamson, 2000]

H Xroot( )+ H (Xi | Xj )Edges(i, j )∑ = H Xi( )

i=1

n

∑ − I Xi;Xj( )Edges(i, j )∑

I Xi;Xj( ),   ∀ 1≤ i, j ≤ n

O n2( )

wij = −I Xi,Xj( )

Page 11: Compressing Tabular Data via Pairwise Dependenciestce.webee.eedev.technion.ac.il/wp-content/uploads/... · Compressing Tabular Data via Pairwise Dependencies Amir Ingber, Yahoo! Research

Example: MST with Mutual Information Weights

UserID Age Location Device Time DocID

4324234 25 90210 iPhone 7 9pm 33221

1223231 49 94087 iPad pro 10am 66543

… … … … … …

UserID

Age Device

Location

DocID

Time

Page 12: Compressing Tabular Data via Pairwise Dependenciestce.webee.eedev.technion.ac.il/wp-content/uploads/... · Compressing Tabular Data via Pairwise Dependencies Amir Ingber, Yahoo! Research

Chow Liu compression in real life

▪ Compressing given : ▪ For each possible , store

▪ Dataset not infinite – metadata takes space!

▪ Example: ›  1B records, two variables with size 10k, 100k à Conditional distribution of size 1B values (comparable to dataset itself) à Then maybe choosing these two is not the best idea…

Xi Xj

x j PXi |X j⋅ | x j( )

entropy code

metadata

Page 13: Compressing Tabular Data via Pairwise Dependenciestce.webee.eedev.technion.ac.il/wp-content/uploads/... · Compressing Tabular Data via Pairwise Dependencies Amir Ingber, Yahoo! Research

Revised Chow-Liu tree

Take into account model size Actual rate: à Revised weights for the Chow-Liu tree:

Negative gain? à might opt to drop dependencies à forest

wij = −I Xi,Xj( )+ 1# rows

Size PXi |X j( )

H Xroot( )+ H (Xi | Xj )Edges(i, j )∑ = H Xi( )

i=1

n

∑ − I Xi;Xj( )Edges(i, j )∑ +

1# rows

Size PXi |X j( )

Page 14: Compressing Tabular Data via Pairwise Dependenciestce.webee.eedev.technion.ac.il/wp-content/uploads/... · Compressing Tabular Data via Pairwise Dependencies Amir Ingber, Yahoo! Research

Example: MST with Mutual Information Weights

UserID

Age Device

Location

DocID

Time

wij = −I Xi,Xj( )+ 1# rows

Size PXi |X j( )wij = −I Xi,Xj( )

UserID

Age Device

Location

DocID

Time

entropy code

metadata

entropy code

metadata

Page 15: Compressing Tabular Data via Pairwise Dependenciestce.webee.eedev.technion.ac.il/wp-content/uploads/... · Compressing Tabular Data via Pairwise Dependencies Amir Ingber, Yahoo! Research

Storing the metadata

How to store the distribution P(X|Y)? - Naïve: save entire matrix - Lossless compression: gzip / utilize sparsity - Lossy compression!

Page 16: Compressing Tabular Data via Pairwise Dependenciestce.webee.eedev.technion.ac.il/wp-content/uploads/... · Compressing Tabular Data via Pairwise Dependencies Amir Ingber, Yahoo! Research

Improvements: Lossy model compression

▪ Compressing given : (compression is still lossless) ▪ True distribution: ▪ Lossy representation results in distorted distribution ▪ Code rate:

▪ Want to minimize both model storage size and divergence! ›  Related to MDL ›  Can be used to modify edge weights

X YPXY

QXY

H X |Y( )+D PX|Y ||QX|Y | PX( )+ 1# rows

Size QX|Y( )

QXY

Page 17: Compressing Tabular Data via Pairwise Dependenciestce.webee.eedev.technion.ac.il/wp-content/uploads/... · Compressing Tabular Data via Pairwise Dependencies Amir Ingber, Yahoo! Research

Proposed approach:

▪ Add a virtual variable with a small alphabet size, s.t. X–Z–Y

▪ Storage size decreased from to ▪  : controls tradeoff between two objectives ▪ Finding

›  Iterate through the three terms, minimize KL divergence, repeat until convergence •  Not optimal! Optimization is hard •  Similar in spirit to [Lee & Seung,NIPS 2001]

PXY (x, y) ≅ QY |Z (y | z)QX|Z (x | z)QZ (z)z∑

| X | ⋅ |Y | (| X |+ |Y |) | Z || Z |

QY |Z (y | z),QX|Z (x | z),QZ (z){ }

Page 18: Compressing Tabular Data via Pairwise Dependenciestce.webee.eedev.technion.ac.il/wp-content/uploads/... · Compressing Tabular Data via Pairwise Dependencies Amir Ingber, Yahoo! Research

Example: Criteo dataset

▪ A Kaggle competition for click prediction by Criteo ▪ Dataset: 45M records

▪ Mutual information: Chow Liu Tree:

Page 19: Compressing Tabular Data via Pairwise Dependenciestce.webee.eedev.technion.ac.il/wp-content/uploads/... · Compressing Tabular Data via Pairwise Dependencies Amir Ingber, Yahoo! Research

Example: Criteo dataset

▪ Variables 3 and 8 have large alphabet 5,500 and 14k (vs 16M records) à can’t store conditional distribution

Results of NNMF:

Page 20: Compressing Tabular Data via Pairwise Dependenciestce.webee.eedev.technion.ac.il/wp-content/uploads/... · Compressing Tabular Data via Pairwise Dependencies Amir Ingber, Yahoo! Research

Experiments

▪ Datasets: machine learning, US census, etc. ›  #features: 10-68 ›  #lines: 60K – 45M

▪ Current version: ›  MST with adjusted weights ›  Sparse encoding of metadata + lossless comp.

Page 21: Compressing Tabular Data via Pairwise Dependenciestce.webee.eedev.technion.ac.il/wp-content/uploads/... · Compressing Tabular Data via Pairwise Dependencies Amir Ingber, Yahoo! Research
Page 22: Compressing Tabular Data via Pairwise Dependenciestce.webee.eedev.technion.ac.il/wp-content/uploads/... · Compressing Tabular Data via Pairwise Dependencies Amir Ingber, Yahoo! Research

Speed vs. compression efficiency

Page 23: Compressing Tabular Data via Pairwise Dependenciestce.webee.eedev.technion.ac.il/wp-content/uploads/... · Compressing Tabular Data via Pairwise Dependencies Amir Ingber, Yahoo! Research

Summary

▪ Dataset compression via probabilistic assumptions ›  Bayes nets, Chow-Liu Trees ›  Metadata encoding +weight modification

▪ Lossless compression via lossy model compression ›  Add a new RV with a Markov restriction ›  Balance metadata size vs. model inaccuracy

▪ Take home message: ›  Choose right metric ›  Revisit old ideas