sparse datasets using coresets dimensionality reduction of...

Dimensionality Reduction of Massive Sparse Datasets Using Coresets

Authors: Dan Feldman, Mikhail Volkov, and Daniela Rus

Presentation: Brittany Norman

1. Introduction

What is a coreset?● Used for dimensionality reduction● Semantic compression of data sets into smaller sets ● Provably approximates original data for a given problem● Using merge-reduce, the smaller sets can then be used for ML:

○ in real-time○ in parallel (cloud,network)○ on big streaming data

● In this paper, coresets applied to large-scale sparse matrices and to low rank approximation (reduced SVD)

● There are many other applications, including...

Other Applications● Robotics: Sensor streams● Clustering:

○ K-means/medians○ Projective clustering○ Trajectory clustering○ K-segmentation of streams

● Principle Component Analysis (PCA)● And many others

Previous Dimensionality Reduction● Goal: project set of d-dim. vect. onto k ≤ d-1 subspace● Existing dim. reduction algorithms:

○ Principle Component Analysis (PCA)○ Linear Regression (k = d - 1)○ Low-rank Approximation (k-SVD)○ Latent Dirichlet Analysis (LDA)○ Non-negative Matrix Factorization (NNMF)

● Can then apply ML like k-means on the low dim. data● But what about sparse data?

Sparse Data● Authors claim this paper presents the first practical

algorithm with provable guarantees that computes dimension reduction for sparse large-scale data.

● Much large scale, high-dim data is sparse:○ Text streams○ Image streams○ Adjacency matrices: social networks

Coreset Definition● Informal definition:

○ Given an original matrix A, a coreset C is defined as a weighted subset of rows of A such

that: the sum of squared distances from any k-dimensional subspace to the rows of A is approximately the same as the sum of squared weighted distances to the rows in C.

● Formal definition:

Cardinality and Sparsity● If we let all the weights equal one, w = (1,...,1), then we get a trivial (0,k)-

coreset, same as original A, with an error parameter of zero● However, most of the weights in an efficient coreset will be zero.● The corresponding rows to the zero-weights will be discarded● Hence, cardinality of the coreset is the sparsity of w:

● Smaller C, more efficient computation● If A is sparse, C is also sparse

Does C exist?● Can we have a corset C that has:

○ Size independent of input dimensions (n x d)?○ Subset of the original input rows?

● Yes, authors will give a constructive proof:

● Notice, the size of C only depends on k and epsilon● So, size is independent of original dimensions (n x d)

1.1 Why Coresets?

Why coresets?● Fast approximations:

○ Can use the coreset to efficiently compute the low-rank approximation (reduced SVD)

● Streaming and parallel computation:○ Use merge-and-reduce○ Cloud, network, or GPU○ one pass over the vectors○ O(|C|log n) memory

● Applications to sparse data○ if ‘s’ is sparsity (authors define as max non-zero entries in a row of A)○ Memory O(|C|m) words (real numbers), independent of dim. of A

● Interpretation○ Since coreset is a few weighted rows, it tells us which records in the data are important

1.3 Related Work

Related work● Prior to this paper, was not clear that such a coreset exists● Previous coresets dependent on d, so useless for fat or square matrices● Previous dim. reduction couldn’t deal with sparse matrices as well● Existing software and implementations:

○ Tried to run modern reduced SVD implementations○ GenSim (uses random projections)○ Matlab (uses LAPACK library, uses Power/Lanczoz method)○ All crashed for input of a few thousand documents and k<100○ For k=3, implementation in Hadoop didn’t crash, but took several

days (also used Power Method)

Related work, part II● Coresets:

○ After 10 years, proof that ( ,k) corset of size |C|= O(dk3/ 2) exists○ Although efficient for tall matrices, still dependent on original

dimension d● Sketches:

○ A sketch is a set of vectors such that: The sum of sq. dist. from the original to every k-dim subspace S can be approximated by sum of sq. distances from S to the sketch

○ However, unlike coresets, if input vectors are sparse, sketch vectors are not sparse

○ First sketch for sparse matrices assumes entire matrix fits in memory

Related work, part III● Lower bounds:

○ Lower bound of O(k/ ) proved for the cardinality of sketch○ Lower bound of O(k/ 2) proved for coreset when k = d -1

● Lanczoz Algorithm:○ Lanczoz method and variant, Power Method:○ Multiply large matrix by vector few times to get largest eigenvector○ Then computation done recursively after projecting matrix on

hyperplane orthogonal to that eigenvector○ Problem: Lanczoz method on large sparse data only efficient for k=1○ Largest eigenvector of a sparse matrix is not always sparse○ So, after projection, resulting matrix is dense for all k>1 computations

2. Novel Coreset Constructions

Novel coreset constructions● All existing coreset constructions compute a sensitivity (i.e., leverage

score, importance) for each point, then compute a sample of matrix A● In this section, authors suggest a gradient-descent type of deterministic

construction of coresets:○ By reducing coreset problems to “Item Frequency 2-Approximation”

Item Frequency Approximation (IFA)● Say there is a universe of n items I={e1,...,ed}● Say there is a stream of item appearances a1,...,an ∈ I● The frequency fi of an item ai is the fraction that ei appears in the stream

(i.e., it is the number of occurrences divided by n)● Easy to use O(d) space to get item freq: keep a counter for each item● Goal is, instead, to use O(1/ε) space● Goal is to produce approximate frequencies such that:

● In this paper, authors will generalize the IFA problem, since it was proven that traditional IFA can’t be used to construct coresets

Generalizing the IFA problem● Traditionally, IFA uses the ∞-norm, which is the magnitude of the

maximum entry. In this context, we would have:

● Here, the sum of the ai’s over n is just the frequency fi:○ The number of occurrences divided by n (the number of items)

● The approx. frequency is represented as the sum of wi’s times ai’s:○ Because we want to approximate each entry by a weighted subset,○ Where w is a non-negative sparse vector of weights

● Problem: Proven that this approach can’t be used to construct coresets…

2-Item Frequency Approximation ( 2-IFA)● Solution: Generalize, allowing for different kinds of norms, like 2-norm● This paper defines a new version of IFA called 2-IFA, which uses the 2-

norm (i.e., Euclidian norm, Frobenius norm):

● Here, the equation is the almost the same as the previous slide, except we are using a different kind of norm

=

2.1 Warm-up: reduction to 1-mean -coresets

Reduction to 1-mean -coresets ● The authors’ overall goal is to show that the construction of coresets can

be reduced to the ℓ2-IFA problem● In this section, they show a simpler version of the reduction for k=1:

○ This is the basis step for their proof○ Since k=1, the k-subspaces (the S’s) are just vectors○ These vectors are called ‘c’ for centers, since the distance between the

ai’s and each center should be about the same as the distance between the weighted ai’s and that center

3. Reduction from Coreset to 2-Item Frequency Approximation

Reduction from Coreset to 2-IFA● In this section, the authors show that the coreset construction can be

reduced to the problem of ℓ2 Item Frequency Approximation● They show this for all k-subspaces of size k>=1

Review: Singular Value Decomposition (SVD)

Review of SVD

● U: Left singular vectors● Σ: Singular values

○ Diagonal matrix, entries sorted in decreasing order● V: Right singular vectors● r: rank of the matrix A

Dimensionality Reduction with SVD● Set the smallest singular values in Σ to zero

○ This will zero out some columns from U and rows from VT

○ Gives a low-rank approximation of original matrix A

≈

x x

SVD Pros and ConsPros:

● Optimal low-rank approximation○ lowest Frobenius norm

Cons:

● Not very interpretable:○ Singular vector is a linear combination of input columns or rows

● Lack of sparsity:○ Singular vectors are dense

Note:

● The authors of this paper use D to stand for the diagonal matrix of singular values

● They do this instead of using the Σ symbol

3.1 Implementation

NotationA: The original input matrixA1...n: The input points (i.e., input vectors, rows)n: Number of rows in Ad: Number of columns in A k: The approximation rank, i.e., the desired dimensionality reductionε: Error parameter, the nominal approximation errorw: The output, a vector of weights

UDVT: The SVD of AXi,j = Entry in ith row and jth columnj:k = A set of indices, where j<ki,: = The ith row of a matrix:,j = The jth column of a matrixxi = The ith entry of a vectorǁXǁF = (i.e., the Frobenius norm)I = The Identity matrix

4. Conclusion

Conclusion“We present a new approach for dimensionality reduction using coresets. The key feature of our algorithm is that it computes coresets that are small in size and subsets of the original data. Using synthetic data as ground truth we show that our algorithm provides a good approximation.”

My thoughts● Comments:● Pros: Interpretability: Result is a weighted subset of original rows● Cons: Depends on SVD as an initial step

Authors claim that distributing across M machines will reduce construction time of C by a factor of M; this is not exactly true

● Questions:● Authors mention “merge-reduce” instead of “map-reduce”

○ A search for this term was uninformative○ Result of search: single paper by Microsoft on physical design refinement○ Do they just mean map-reduce? or map-reduce-merge? or something like combiners?

● Authors don’t talk much about how to extend this to a distributed/parallel setting. Mention that this is “embarrassingly parallel,” but don’t elaborate

sparse datasets using coresets dimensionality reduction of...

Documents