compressing tabular data via pairwise...

Post on 23-Aug-2020

6 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Compressing Tabular Data via Pairwise Dependencies

Amir Ingber, Yahoo! Research

TCE Conference, June 22, 2017

Joint work with Dmitri Pavlichin, Tsachy Weissman (Stanford)

Huge datasets: everywhere

-  Internet -  Science -  Media -  …

At Yahoo: -  More than 100k servers in ~36 clusters -  More than 800PB of storage -  Lots of data, always want to store more

Compressing big data: does it matter?

It’s expensive Cost of storing 1 PB: around $300k/ year e.g. on AWS

It’s big: Example: storing event log ~1B events / day x 6 months Stored for analytics / machine learning

Lossless compression: dictionary methods Typical compression: gzip (DEFLATE)- Based on LZ77 + Huffman -  Popular, fast -  Recent variants: zstd (FB, 2015), Brotli (Google, 2015), …

-  Good at: detecting temporal dependencies, e.g. text

26,9 25,5

Main idea: find repetitions in sliding window

the brown fox jumped over the brownish jumping bear

the brown fox jumped over ish ing bear

2625

Tabular data Typical dataset: A table -  Each row has several fields, complex dependencies -  Example:

-  Temporal dependencies? -  Cross-field dependencies?

UserID Age Location Device Time DocID

4324234 25 90210 iPhone 7 9pm 33221

1223231 49 94087 iPad pro 10am 66543

… … … … … …

Entropy coding 101 ▪ Given: a stream of i.i.d. symbols of a R.V. X: ▪ Encode each symbol as a (prefix free) bit string of variable length

›  More frequent symbols à shorter codeword ›  Theorem: avg. code length

▪ Huffman code: optimal. Rate ▪ Better: Arithmetic coding

›  Approaches entropy ›  Requires: distribution

▪ Black box

≥ H X( ) = p x( ) logx∑ 1

p x( )

≤ H X( )+1bit

p x( )

Assumptions: ▪ Records in the table are i.i.d. ▪ Only dependence between fields

▪ Example: independence

▪ Expected compression rate: per record

▪ Fine print: for each RV, need to save ›  The distribution and/or codebook (Huffman / Arith. coder) ›  A dictionary (to translate back to the original values)

X1,X2,...,Xn[ ]

PX1X2 ...Xn (x1, x2,..., xn ) = PXi (xi )i∏

H Xi( )i=1

n

Fancier models: Bayesian networks

▪ Bayes net ›  DAG with n nodes ›  Nodes are the RVs, ›  Edges model (conditional) independence

X1

X2

X3

X4

P(x1, x2, x3, x4 ) = P(x1)P(x2 | x1)P(x3 | x1)P(x4 | x2x3)

▪ Dense graph àmore general ›  Compression rate:

▪ Usage for compression: compress according to the graph edges ›  Metadata: larger codebooks / distributions (conditional!)

▪ Not a new idea [e.g. Davies & Moore, KDD’99]

H (X1)+H (X2 | X1)+H (X3 | X1)+H (X4 | X2X3)

How to choose a Bayes Net for compression?

▪ Another assumption: ›  Each node can only have a single parent

▪ DAG è Tree ▪ Simpler compression

›  Conditioned only on a single RV ›  Compression rate:

▪ Best tree?

X1

X2

X3

X4

P(x1, x2, x3, x4 ) = P(x1)P(x2 | x1)P(x3 | x1)P(x4 | x2 )

H Xroot( )+ H (Xi | Xj )Edges(i, j )∑ = H Xi( )

i=1

n

∑ − I Xi;Xj( )Edges(i, j )∑

Searching for the best tree

▪ Rate:

▪ Algorithm: ›  Calculate ›  Set ›  Find minimum spanning tree! ›  Efficient algorithms exist – [Fredman & Tarjan, 1987] ›  Also: minimizes the KL divergence w.r.t. to the true distribution.

▪ Known as a Chow-Liu Tree [Chow & Liu, 1968] ›  Extensions exist [e.g. Williamson, 2000]

H Xroot( )+ H (Xi | Xj )Edges(i, j )∑ = H Xi( )

i=1

n

∑ − I Xi;Xj( )Edges(i, j )∑

I Xi;Xj( ),   ∀ 1≤ i, j ≤ n

O n2( )

wij = −I Xi,Xj( )

Example: MST with Mutual Information Weights

UserID Age Location Device Time DocID

4324234 25 90210 iPhone 7 9pm 33221

1223231 49 94087 iPad pro 10am 66543

… … … … … …

UserID

Age Device

Location

DocID

Time

Chow Liu compression in real life

▪ Compressing given : ▪ For each possible , store

▪ Dataset not infinite – metadata takes space!

▪ Example: ›  1B records, two variables with size 10k, 100k à Conditional distribution of size 1B values (comparable to dataset itself) à Then maybe choosing these two is not the best idea…

Xi Xj

x j PXi |X j⋅ | x j( )

entropy code

metadata

Revised Chow-Liu tree

Take into account model size Actual rate: à Revised weights for the Chow-Liu tree:

Negative gain? à might opt to drop dependencies à forest

wij = −I Xi,Xj( )+ 1# rows

Size PXi |X j( )

H Xroot( )+ H (Xi | Xj )Edges(i, j )∑ = H Xi( )

i=1

n

∑ − I Xi;Xj( )Edges(i, j )∑ +

1# rows

Size PXi |X j( )

Example: MST with Mutual Information Weights

UserID

Age Device

Location

DocID

Time

wij = −I Xi,Xj( )+ 1# rows

Size PXi |X j( )wij = −I Xi,Xj( )

UserID

Age Device

Location

DocID

Time

entropy code

metadata

entropy code

metadata

Storing the metadata

How to store the distribution P(X|Y)? - Naïve: save entire matrix - Lossless compression: gzip / utilize sparsity - Lossy compression!

Improvements: Lossy model compression

▪ Compressing given : (compression is still lossless) ▪ True distribution: ▪ Lossy representation results in distorted distribution ▪ Code rate:

▪ Want to minimize both model storage size and divergence! ›  Related to MDL ›  Can be used to modify edge weights

X YPXY

QXY

H X |Y( )+D PX|Y ||QX|Y | PX( )+ 1# rows

Size QX|Y( )

QXY

Proposed approach:

▪ Add a virtual variable with a small alphabet size, s.t. X–Z–Y

▪ Storage size decreased from to ▪  : controls tradeoff between two objectives ▪ Finding

›  Iterate through the three terms, minimize KL divergence, repeat until convergence •  Not optimal! Optimization is hard •  Similar in spirit to [Lee & Seung,NIPS 2001]

PXY (x, y) ≅ QY |Z (y | z)QX|Z (x | z)QZ (z)z∑

| X | ⋅ |Y | (| X |+ |Y |) | Z || Z |

QY |Z (y | z),QX|Z (x | z),QZ (z){ }

Example: Criteo dataset

▪ A Kaggle competition for click prediction by Criteo ▪ Dataset: 45M records

▪ Mutual information: Chow Liu Tree:

Example: Criteo dataset

▪ Variables 3 and 8 have large alphabet 5,500 and 14k (vs 16M records) à can’t store conditional distribution

Results of NNMF:

Experiments

▪ Datasets: machine learning, US census, etc. ›  #features: 10-68 ›  #lines: 60K – 45M

▪ Current version: ›  MST with adjusted weights ›  Sparse encoding of metadata + lossless comp.

Speed vs. compression efficiency

Summary

▪ Dataset compression via probabilistic assumptions ›  Bayes nets, Chow-Liu Trees ›  Metadata encoding +weight modification

▪ Lossless compression via lossy model compression ›  Add a new RV with a Markov restriction ›  Balance metadata size vs. model inaccuracy

▪ Take home message: ›  Choose right metric ›  Revisit old ideas

top related