compressing tabular data via pairwise...
TRANSCRIPT
Compressing Tabular Data via Pairwise Dependencies
Amir Ingber, Yahoo! Research
TCE Conference, June 22, 2017
Joint work with Dmitri Pavlichin, Tsachy Weissman (Stanford)
Huge datasets: everywhere
- Internet - Science - Media - …
At Yahoo: - More than 100k servers in ~36 clusters - More than 800PB of storage - Lots of data, always want to store more
Compressing big data: does it matter?
It’s expensive Cost of storing 1 PB: around $300k/ year e.g. on AWS
It’s big: Example: storing event log ~1B events / day x 6 months Stored for analytics / machine learning
Lossless compression: dictionary methods Typical compression: gzip (DEFLATE)- Based on LZ77 + Huffman - Popular, fast - Recent variants: zstd (FB, 2015), Brotli (Google, 2015), …
- Good at: detecting temporal dependencies, e.g. text
26,9 25,5
Main idea: find repetitions in sliding window
the brown fox jumped over the brownish jumping bear
the brown fox jumped over ish ing bear
2625
Tabular data Typical dataset: A table - Each row has several fields, complex dependencies - Example:
- Temporal dependencies? - Cross-field dependencies?
UserID Age Location Device Time DocID
4324234 25 90210 iPhone 7 9pm 33221
1223231 49 94087 iPad pro 10am 66543
… … … … … …
Entropy coding 101 ▪ Given: a stream of i.i.d. symbols of a R.V. X: ▪ Encode each symbol as a (prefix free) bit string of variable length
› More frequent symbols à shorter codeword › Theorem: avg. code length
▪ Huffman code: optimal. Rate ▪ Better: Arithmetic coding
› Approaches entropy › Requires: distribution
▪ Black box
≥ H X( ) = p x( ) logx∑ 1
p x( )
≤ H X( )+1bit
p x( )
Assumptions: ▪ Records in the table are i.i.d. ▪ Only dependence between fields
▪ Example: independence
▪ Expected compression rate: per record
▪ Fine print: for each RV, need to save › The distribution and/or codebook (Huffman / Arith. coder) › A dictionary (to translate back to the original values)
X1,X2,...,Xn[ ]
PX1X2 ...Xn (x1, x2,..., xn ) = PXi (xi )i∏
H Xi( )i=1
n
∑
Fancier models: Bayesian networks
▪ Bayes net › DAG with n nodes › Nodes are the RVs, › Edges model (conditional) independence
X1
X2
X3
X4
P(x1, x2, x3, x4 ) = P(x1)P(x2 | x1)P(x3 | x1)P(x4 | x2x3)
▪ Dense graph àmore general › Compression rate:
▪ Usage for compression: compress according to the graph edges › Metadata: larger codebooks / distributions (conditional!)
▪ Not a new idea [e.g. Davies & Moore, KDD’99]
H (X1)+H (X2 | X1)+H (X3 | X1)+H (X4 | X2X3)
How to choose a Bayes Net for compression?
▪ Another assumption: › Each node can only have a single parent
▪ DAG è Tree ▪ Simpler compression
› Conditioned only on a single RV › Compression rate:
▪ Best tree?
X1
X2
X3
X4
P(x1, x2, x3, x4 ) = P(x1)P(x2 | x1)P(x3 | x1)P(x4 | x2 )
H Xroot( )+ H (Xi | Xj )Edges(i, j )∑ = H Xi( )
i=1
n
∑ − I Xi;Xj( )Edges(i, j )∑
Searching for the best tree
▪ Rate:
▪ Algorithm: › Calculate › Set › Find minimum spanning tree! › Efficient algorithms exist – [Fredman & Tarjan, 1987] › Also: minimizes the KL divergence w.r.t. to the true distribution.
▪ Known as a Chow-Liu Tree [Chow & Liu, 1968] › Extensions exist [e.g. Williamson, 2000]
H Xroot( )+ H (Xi | Xj )Edges(i, j )∑ = H Xi( )
i=1
n
∑ − I Xi;Xj( )Edges(i, j )∑
I Xi;Xj( ), ∀ 1≤ i, j ≤ n
O n2( )
wij = −I Xi,Xj( )
Example: MST with Mutual Information Weights
UserID Age Location Device Time DocID
4324234 25 90210 iPhone 7 9pm 33221
1223231 49 94087 iPad pro 10am 66543
… … … … … …
UserID
Age Device
Location
DocID
Time
Chow Liu compression in real life
▪ Compressing given : ▪ For each possible , store
▪ Dataset not infinite – metadata takes space!
▪ Example: › 1B records, two variables with size 10k, 100k à Conditional distribution of size 1B values (comparable to dataset itself) à Then maybe choosing these two is not the best idea…
Xi Xj
x j PXi |X j⋅ | x j( )
entropy code
metadata
Revised Chow-Liu tree
Take into account model size Actual rate: à Revised weights for the Chow-Liu tree:
Negative gain? à might opt to drop dependencies à forest
wij = −I Xi,Xj( )+ 1# rows
Size PXi |X j( )
H Xroot( )+ H (Xi | Xj )Edges(i, j )∑ = H Xi( )
i=1
n
∑ − I Xi;Xj( )Edges(i, j )∑ +
1# rows
Size PXi |X j( )
Example: MST with Mutual Information Weights
UserID
Age Device
Location
DocID
Time
wij = −I Xi,Xj( )+ 1# rows
Size PXi |X j( )wij = −I Xi,Xj( )
UserID
Age Device
Location
DocID
Time
entropy code
metadata
entropy code
metadata
Storing the metadata
How to store the distribution P(X|Y)? - Naïve: save entire matrix - Lossless compression: gzip / utilize sparsity - Lossy compression!
Improvements: Lossy model compression
▪ Compressing given : (compression is still lossless) ▪ True distribution: ▪ Lossy representation results in distorted distribution ▪ Code rate:
▪ Want to minimize both model storage size and divergence! › Related to MDL › Can be used to modify edge weights
X YPXY
QXY
H X |Y( )+D PX|Y ||QX|Y | PX( )+ 1# rows
Size QX|Y( )
QXY
Proposed approach:
▪ Add a virtual variable with a small alphabet size, s.t. X–Z–Y
▪ Storage size decreased from to ▪ : controls tradeoff between two objectives ▪ Finding
› Iterate through the three terms, minimize KL divergence, repeat until convergence • Not optimal! Optimization is hard • Similar in spirit to [Lee & Seung,NIPS 2001]
PXY (x, y) ≅ QY |Z (y | z)QX|Z (x | z)QZ (z)z∑
| X | ⋅ |Y | (| X |+ |Y |) | Z || Z |
QY |Z (y | z),QX|Z (x | z),QZ (z){ }
Example: Criteo dataset
▪ A Kaggle competition for click prediction by Criteo ▪ Dataset: 45M records
▪ Mutual information: Chow Liu Tree:
Example: Criteo dataset
▪ Variables 3 and 8 have large alphabet 5,500 and 14k (vs 16M records) à can’t store conditional distribution
Results of NNMF:
Experiments
▪ Datasets: machine learning, US census, etc. › #features: 10-68 › #lines: 60K – 45M
▪ Current version: › MST with adjusted weights › Sparse encoding of metadata + lossless comp.
Speed vs. compression efficiency
Summary
▪ Dataset compression via probabilistic assumptions › Bayes nets, Chow-Liu Trees › Metadata encoding +weight modification
▪ Lossless compression via lossy model compression › Add a new RV with a Markov restriction › Balance metadata size vs. model inaccuracy
▪ Take home message: › Choose right metric › Revisit old ideas