wasserstein measure coresets - arxiv

Wasserstein Measure Coresets

Sebastian Claici 1 Aude Genevay 1 Justin Solomon 1

AbstractThe proliferation of large data sets and Bayesianinference techniques motivates demand for betterdata sparsification. Coresets provide a principledway of summarizing a large dataset via a smallerone that is guaranteed to match the performanceof the full data set on specific problems. Classicalcoresets, however, neglect the underlying data dis-tribution, which is often continuous. We addressthis oversight by introducing Wasserstein measurecoresets, an extension of coresets which by defini-tion takes into account generalization. Our formu-lation of the problem, which essentially consistsin minimizing the Wasserstein distance, is solv-able via stochastic gradient descent. This yieldsan algorithm which simply requires sample accessto the data distribution and is able to handle largedata streams in an online manner. We validate ourconstruction for inference and clustering.

1. IntroductionHow do we deal with too much data? Despite the commonwisdom that more data is better, algorithms whose complex-ity scales with the size of the dataset are still routinely usedin many areas of machine learning. While large datasets cap-ture high frequency differences between data points, manyalgorithms only need a handful of representative samplesthat summarize the dataset.

Formalizing a notion of representative requires care, how-ever, since a representative sample for a clustering algorithmmay differ from that for a classification algorithm. The no-tion of a data coreset was introduced to specify precisely anotion of data summarization that is task dependent. Origi-nally proposed for computational geometry, coresets havefound their way into the learning literature for tasks rang-ing from clustering (Bachem et al., 2018b), classification(Tsang et al., 2005), neural network compression (Baykal

1Computer Science and Artificial Intelligence Laboratory, Mas-sachusetts Institute of Technology, Cambridge, MA, USA. Corre-spondence to: Sebastian Claici <[email protected]>.

Figure 1: Coresets with 50 points for a Gaussian (top left)and the pushforward of a Gaussian through f : (x, y) 7→(x, x2 + y). Top right is the image of the Gaussian coresetthrough f , bottom left is computed directly on the pushfor-ward. A random sample is plotted bottom right.

et al., 2018), and Bayesian inference (Huggins et al., 2016;Campbell & Broderick, 2019).

Coreset construction is typically posed as a discrete opti-mization problem: Given a fixed dataset and learning al-gorithm, how can we construct a smaller dataset on whichthat algorithm achieves similar performance? This approach,however, ignores a key theme in machine learning. A datasetis an empirical sample from an underlying data distribution,and learning problems typically seek to minimize an ex-pected loss against the distribution, not the dataset. Theeffectiveness of a coreset should thus be measured againstthe distribution, and not the sample. In other words, thecoreset should be designed to guarantee good generalization.

To address this oversight, we introduce measure coresets,which approximate the dataset by either a parametric con-tinuous measure or a finitely supported one with a smallernumber of points. Our formulation extends coreset languageto smooth data distributions and recovers the original formu-lation when the distribution is supported on finitely manypoints. We specifically focus on Wasserstein measure core-sets, which hinge on a natural connection between coresetlanguage and optimal transport theory.

Contributions. We generalize the definition of a coresetto take into account the underlying data distribution, produc-

arX

iv:1

805.

0741

2v2

[st

at.M

L]

2 M

ar 2

020


ing a measure coreset, with strong generalization guaranteesfor a variety of learning problems. Our formulation revealsan elegant connection to optimal transport, allowing us toleverage relevant theoretical results to obtain generalizationerror bounds for our coresets as well as stability under Lip-schitz transformations. From a computational perspective,we provide stochastic algorithms for extracting measurecoresets, yielding methods that are well-adapted to casesinvolving incoming streams of data. This allows us to con-struct coresets in an online manner, without having to storethe whole dataset in memory. Besides, contrarily to existingmethods which are specific to a given learning problem, ourformulation is robust enough so that a given coreset can beused for different tasks.

1.1. Related work

We join the probabilistic language of optimal transport withthe discrete setting of data compression via coresets.

Coresets. Initially introduced in computational geome-try (Agarwal et al., 2005), coresets have found their way tomachine learning research via importance sampling (Lang-berg & Schulman, 2010). Coreset applications are varied,and generic frameworks exist for their construction (Feld-man & Langberg, 2011). Among the relevant recent appli-cations are k-means and k-median clustering (Har-Peled& Mazumdar, 2004; Arthur & Vassilvitskii, 2007; Feld-man et al., 2013; Bachem et al., 2018b), Bayesian infer-ence (Campbell & Broderick, 2018; Huggins et al., 2016),support vector machine training (Tsang et al., 2005), andneural network compression (Baykal et al., 2018).

While coresets are discrete, a sensitivity-based approach toimportance sampling coresets was introduced in a continu-ous setting for approximating expectations under absolutelycontinuous measures w.r.t. the Lebesgue measure (Langberg& Schulman, 2010). For more information, see (Bachemet al., 2018b; Munteanu & Schwiegelshohn, 2018).

Another line of work closer to ours uses the theory of Repro-ducing Kernel Hilbert Spaces (RKHS) to design coresets, inparticular kernel herding (Chen et al., 2010; Lacoste-Julienet al., 2015) and Stein points (Chen et al., 2018). Thesemethods also take into account the underlying distributionof the data, but both require knowledge of that distribution(e.g., the density up to a normalizing constant) while ourapproach simply assumes sample access.

Optimal transport (OT). The connection between opti-mal transport and quantization can be traced back to Pol-lard (1982), who studied asymptotic properties of k-meansin the language of OT. More recently, Cuturi & Doucet(2014) proposed a more efficient version of transport-basedquantization using entropy-regularized transport. Entropy-regularized transport (Cuturi, 2013a) is a computationally

efficient formulation of OT, which led to a wide range ofmachine learning applications; see recent surveys (Solomon,2018; Peyré & Cuturi, 2018) for details. Recent results char-acterize its statistical behavior (Genevay et al., 2019) andits ability to handle noisy datasets (Rigollet & Weed, 2018),which we can leverage to design robust coresets.

Our coreset construction algorithms are inspired by semi-discrete methods that compute transport from a continu-ous measure to a discrete one using power diagrams (Au-renhammer, 1987). Efficient algorithms that use compu-tational geometry tools to perform gradient iterations tosolve the Kantorovich dual problem have been introducedfor 2D (Mérigot, 2011) and 3D (Lévy, 2015). Closer to ourmethod are the algorithms by De Goes et al. (2012) andClaici et al. (2018), which solve a non-convex problem forthe support of a discrete uniform measure that minimizestransport cost to an input image (De Goes et al., 2012) orthe barycenter of the input distributions (Claici et al., 2018).Stochastic approaches for semi-discrete transport, both stan-dard and regularized, were tackled by Genevay et al. (2016).

Notation. In what follows, we will consider a compactmetric space X ⊆ Rd endowed with the Euclidean normon Rd denoted by ‖ · ‖. For a random variable X and aprobability distribution µ on X , we denote by X ∼ µ thefact that X has distribution µ. The notation Eµ(X) is theexpectation of the random variable X , when X ∼ µ. Wedenote by f]µ the pushforward of a measure µ by f . Werecall that by definition,

∫X xd(f]µ) =

∫X f(x) d(µ).

2. Coresets: From Discrete to Continuous2.1. Discrete coresets

A coreset is a small summary of a data set. Small usuallyrefers to a the number of points in the coreset, which onehopes is much smaller than the data set size, but one canalso think of this in terms of the number of bits required tostore the coreset. The summary is often a weighted subsetof the data, but can also refer to points that are not in theinitial dataset but rather represent the original points well.

To make these notions more precise, we must define a core-set in terms of both the dataset and the cost function that thecoreset is meant to perform well against. We can understandthe definition as a learning problem, where our goal is toapproximate the performance of a learning algorithm on adataset X by its performance on the coreset C.

Let F be the hypothesis set for a learning problem. Everyfunction f ∈ F maps from X to R. Let µX be a weightingfunction on the points in X (this is typically uniform), and


define the cost of f on (X,µX) as

cost(X,µX , f) =∑x∈X

µX(x)f(x). (1)

A coreset is then defined by a set C and a weight func-tion µC in such a way that cost(C, µC , f) is close tocost(X,µX , f). This leads to the following classical defini-tion of a coreset (Bachem et al., 2017):Definition 1 (Strong/weak ε-coreset). The pair (C, µC) isa strong ε-coreset for the function family F if C ⊆ X and|cost(X,µX , f)− cost(C, µC , f)| ≤ ε · cost(X,µX , f)for all f ∈ F . If we require that the inequality only holds atf∗ = arg minf∈F cost(X,µX , f), then we call (C, µC) aweak ε-coreset.

A coreset always exists for a dataset (X,µX) and family Fas the original dataset (X,µX) satisfies Definition 1.

What distinguishes coresets from other notions of datasparsification is their dependence on the learning problem.For instance, there exist coresets for clustering (Bachemet al., 2018a;b), Bayesian inference (Campbell & Broderick,2019), and classification (Baykal et al., 2017).

Example (k-means). The cost of a particular choice Qof k centers is given by

∑x∈X minq∈Q ‖x − q‖2. To

translate this into the language of Definition 1, we takefQ(x) = minq∈Q ‖x− q‖2 and µX(x) = 1 for all x ∈ X .The function family F is thus parameterized by the set ofall possible choices of the center set Q, and we wish to con-struct a coreset that performs well against all such choices(in the case of a strong coreset) or against the optimal k-means assignment (in the case of a weak coreset).

2.2. Measure coresets

So far we have used discrete language to describe coresets,but this belies the intent of coresets for learning problems.Typical learning problems are posed as minimizations in ahypothesis class of an expectation over a data distributionµ. The standard coreset definition is incompatible with thissetting as it relies on the existence of a finite data set. Tocircumvent this issue, we define a measure coreset as ameasure ν that produces similar results under F as µ:Definition 2 (Measure Coreset). We call ν a strong ε-measure coreset for µ if for all f ∈ F

|Eµ[f(X)]− Eν [f(X)]| ≤ ε. (2)

In analogy to the discrete case, a weak ε-measure core-set is one for which the inequality holds at f∗ =arg minf∈F Eµ[f(X)]. As in the case of discrete coresets,such a ν always exists, as ν = µ satisfies the inequality.

Beyond the change to measure theoretic language, our defi-nition differs from the typical coreset one in two ways. (1)

The coreset ν can be an absolutely continuous measure,which means the size of the coreset can no longer be mea-sured simply in the number of points. (2) We use absoluteerror instead of relative error; this connects our notion ofcoreset with generalization error in learning problems inthat we can see the coreset as observed data and the fullmeasure as out of sample data. Absolute instead of relativeerror is uncommon in coreset language, but not unheard of;see (Reddi et al., 2015; Bachem et al., 2018a) for examples.

Under which constraints on ν, µ and F can we constructa measure coreset? We will show a connection to optimaltransport and a resulting construction algorithm that aimsat minimizing a Wasserstein distance between the coreset νand the target measure µ. Using optimal transport duality,we can qualify which learning problems admit measurecoresets and the guarantees we can hope to achieve.

3. Sufficient Conditions for CoresetApproximation

The link between our measure coreset formulation and thetheory of optimal transport uses the notion of integral prob-ability metrics (Müller, 1997):

Definition 3 (Integral Probability Metric). Consider a classof functions F : X → R. The integral probability metricdF between two measures µ and ν is defined by

dF (µ, ν) = supf∈F|Eµ[f(X)]− Eν [f(X)]| . (3)

Under mild assumptions on the set of functions F , dF de-fines a distance on the space of probability measures. Wemention the following examples:

• 1-Wasserstein Distance: F = {f | ‖∇f‖ ≤ 1} the spaceof 1-Lipschitz functions.

• Dual-Sobolev distance: F = {f | ‖f‖H1(µ) ≤ 1} whereH1 is the Sobolev space {f ∈ L2 | ∂xif ∈ L2}.

• Maximum Mean Discrepancy (MMD) (Gretton et al.,2008): F = {f | ‖f‖H ≤ 1} where H is a universalReproducing Kernel Hilbert Space (RKHS).

The examples above allow us to derive a coreset conditionfor each of these function classes based on the Wassersteindistance or the MMD, explored in detail below.

Wasserstein distances. The p-Wasserstein distance be-tween distributions µ and ν is given by the solution of aminimization problem:

W pp (µ, ν) = inf

π∈Π(µ,ν)

∫X×X

‖x− y‖p dπ(x, y), (4)


where Π(µ, ν) = {π ∈ P (X × X ) | π(dx × X ) =µ(dx), π(X × dy) = ν(dy)} is the set of couplings withmarginals µ and ν.

When p = 1,W1(µ, ν) can be rewritten via duality as a max-imization problem over the set of 1-Lipschitz functions (San-tambrogio, 2015, §3.1). In particular, for F = Lip1(X ),

dF (µ, ν) = supf∈Lip1

∫Xf d(µ− ν) = W1(µ, ν).

When p = 2, W2(µ, ν) upper-bounds the dual Sobolevnorm of (µ− ν) if µ and ν have densities w.r.t the Lebesguemeasure that are bounded above by some constant M . Inparticular, for any C1 function f , define a semi-norm by

‖f‖H1(µ) =

(∫X|∇f(x)|2 dµ(x)

) 12

.

This norm allows us to define a dual Sobolev norm on mea-sures as

‖ν‖H−1(µ) = sup‖f‖H1(µ)≤1

∫Xf(x) dν(x).

Using (Peyre, 2018, Equation (17)), we obtain that for F ={f | ‖f‖H1(µ) ≤ 1} :

dF (µ, ν) = ‖µ− ν‖H−1(µ) ≤√MW2(µ, ν),

where M is the uniform bound on the densities of µ and ν.

Maximum mean discrepancy. When F is the unit ballof a RKHS, equation (3) defines a distance function knownas the maximum mean discrepancy (Gretton et al., 2008).If κ(·, ·) is the reproducing kernel of the RKHS, we canrewrite (3) as an expectation over kernel evaluations

MMD(µ, ν) =Eµ⊗µ[κ(X,X ′)] + Eν⊗ν [κ(Y, Y ′)]

− 2Eµ⊗ν [κ(X,Y )]. (5)

While our focus is on coresets under the Wasserstein dis-tance, we mention that coresets that minimize the MMDhave been constructed for kernel density estimation (Phillips& Tai, 2018). Generic construction algorithms for samplingto minimize MMD to a known fixed measure—known askernel herding—have been given by Chen et al. (2010) andLacoste-Julien et al. (2015).

Coreset condition. Using the properties of IPMs above,we summarize conditions for ν to be an ε-coreset for µbased on conditions on F .

Proposition 1. The measure ν is an ε-coreset for µ withfunction family F if:

(i) W1(µ, ν) ≤ ε for F ⊆ Lip1;

(ii) W2(µ, ν) ≤ ε/√M for F ⊆ H1(µ), when µ and ν

have densities with respect to the Lebesgue measurethat are bounded above by M ; or

(iii) MMD(µ, ν) ≤ ε for F ⊆ H.

We can extend the first two conditions to LipK and‖f‖H1(µ) ≤ K by scaling f by the Lipschitz or Sobolevconstant by a multiplicative K factor. In the remainder ofthis paper, we will focus on coresets based on Wassersteindistances and will call them measure coresets for simplicity.When more precision is required, we will denote by W1

(resp. W2, MMD) measure coreset a coreset with functionfamily Lip1 (resp. H1(µ),H).

4. Practical Wasserstein CoresetConstructions

While §3 gives a metric for measuring how close a distribu-tion ν is to satisfying the coreset condition for a distributionµ, the question of how to compute such a ν remains.

In our definition, ν was unconstrained, but for it to be auseful coreset for a measure, we should be able to describeit using fewer bits than needed to describe the full measureµ. From a practical point of view, we should also be ableto compute expectations under the coreset ν and at leastapproximate expectations under µ.

We make a few simplifications. We assume that we can sam-ple from µ efficiently and that µ is supported on a compactset X ⊂ Rd. This is true of any finite dataset. The simplestnotion of a measure coreset is a uniform distribution over afinite point set x1, . . . , xn. This leads to the following opti-mization problem, which will be our focus in this section:

min(x1,...,xn)

Wp

(1

n

n∑i=1

δxi , µ

). (P)

It is also possible to formulate the problem using a con-tinuous parametric density as a coreset. Given a family ofparametric densities (pθ)θ∈Θ (e.g., Gaussian), we want tofind the parametric distribution pθ∗ that best approximates ameasure µ. This can be written simply as

minθ∈Θ

Wp (pθ, µ) . (6)

We experimented with this option using Gaussian mixtures,but the minimization is highly non-convex, and gradientdescent algorithms do not converge except in restricted set-tings (e.g., mixtures with equal weights). We find the sim-pler problem (P) sufficient for the applications we considerand leave computation of more general coresets to futureresearch.


4.1. Properties of empirical coresets

We address the problem of estimating n the number of pointsin a coreset n given ε for µ an arbitratry measure continuous.Namely, we ask how many samples n we need such thatWp (µ, ν) ≤ ε when ν =

∑ni=1 δxi .

Statistical bounds. There exist several theorems for finitesample rates ofWp, which each focus on specific hypothesesto marginally improve rates. We give a general statement:

Theorem 1 (Metric convergence, Kloeckner 2012; Bran-colini et al. 2009; Weed & Bach 2017). Suppose µ isa compactly supported measure in Rd and νn is a uni-form measure supported on n points drawn from µ. ThenWp(νn, µ) ∼ Θ(n−1/d). Moreover, if µ has Hausdorff di-mension s < d, then Wp(νn, µ) ∼ Θ(n−1/s).

Thus, both W1 and W2 have finite sample rate O(n−1/d). Ifwe assume that µ is supported on a lower dimensional man-ifold of dimension s, we get the improved rate O(n−1/s).

Corollary 1. If ν =∑ni=1 δx∗

iwith n = Θ(ε−s) is a glob-

ally optimal solution for (P), then ν is a ε-measure coreset.

While we cannot guarantee this bound in practice sinceglobal optimality is NP-hard (Claici et al., 2018), empiri-cally we observe that it holds and in fact is an overestimateof coreset size. Note that the theoretically required coresetsize is independent of additional variables in the underlyingproblem, e.g., the number of means in k-means.

This bound improves over the best known deterministiccoreset size for k-means and k-median of O(kε−d log n)(Har-Peled & Mazumdar, 2004), but we must be careful asour coreset bounds are given in absolute error. For k-meansand k-medians, we are typically in the regime where thefull data set has large cost (1), but if that does not hold, thecoresets are no longer comparable.

Better randomized construction algorithms exist for bothk-means/k-median and SVM with sizes that do not havesuch a strong dependence on dimension. Empirically, ourcoresets are competitive, and often better than specializedconstruction algorithms, especially in the small data regime(see Figures 3, 2 and 4).

One useful property of Wp coresets is that given anε−coreset for a reference measure µ, we immediately havea Lε−coreset for the pushforward measure f]µ, where L isthe Lipschitz constant of f .

Proposition 2. (Coreset of pushforward measure) Considera L-Lipschitz function f . If {x∗i }ni=1 is a ε-measure coresetunder Wp for µ, then {f(x∗i )}ni=1 is a Lε-measure coresetunder Wp for f]µ.

Proof. f being L-Lipschitz implies that ‖f(x)− f(y)‖p ≤Lp‖x− y‖p ∀(x, y) ∈ X . Thus, for all π ∈ Π( 1

nδx∗i , µ),

∫X

n∑i=1

‖f(x∗i )− f(x)‖p dπ(xi, x)

≤ Lp∫X

n∑i=1

‖x∗i − x‖p dπ(xi, x).

Minimizing over π on the right hand side and using thedefinition of a pushforward measure on the left gives

W pp

(1

n

n∑i=1

δf(x∗i ), f]µ

)≤ LpW p

p

(1

n

n∑i=1

δx∗i, µ

).

Since x∗i is a Wp ε-measure coreset for µ, we haveWp

(1n

∑ni=1 δx∗

i, µ)≤ ε, yielding the desired bound.

Pushforward measures are ubiquitous in (deep) generativemodels, which have gained popularity for image generationthrough GANs (Goodfellow et al., 2014) and VAEs (Kingma& Welling, 2014). Specifically, new data is generated bypushing uniform or Gaussian noise through a neural networkf (Genevay et al., 2018). The above proposition suggeststhat if the pushforward function f has bounded variation,constructing a coreset for the source noise and pushing itthrough f is sufficient to find a ‘good enough’ coreset forthe generative model without additional computations. Thisrobustness property is illustrated by Figure 1, where thebanana-shaped distribution is the pushforward of a normal-ized Gaussian N through f : (x, y) 7→ (x, x2 + y). Eventhough the coreset obtained as the image of the coreset of theGaussian through f performs slightly worse than the coresetcomputed directly on f]N , it represents the distribution ina more faithful way than a random sample.

We also have the following relationship between being aW2 coreset and being a W1 coreset:

Remark 1. Let {x∗i }ni=1 minimize W2

(1n

∑ni=1 δx∗

i, µ).

Using the inequality between Wp metrics,

W1

(1

n

n∑i=1

δx∗i, µ

)≤W2

(1

n

n∑i=1

δx∗iµ

).

Thus, if we choose n large enough such thatW2

(1n

∑ni=1 δx∗

i, µ)≤ ε, then 1

n

∑ni=1 δx∗

iis also a

W1 ε-measure coreset for µ.


4.2. Entropy-regularized Wasserstein distances

The entropy-regularized Wasserstein distance is a popularapproximation of the Wasserstein distance, as it is com-putable with faster algorithms (Cuturi, 2013b). The entropi-cally regularized p-Wasserstein distance is

W pp,η(µ, ν)=arg min

π∈Π(µ,ν)

∫X×X‖x− y‖pdπ(x, y)+ηKL(π‖µ⊗ν).

(7)

As the KL term is nonnegative, W pp,η upper-bopunds W p

p

for all p, and thus any coreset under W1,η and W2,η is alsoa coreset under W1 and W2. Due to the entropic term, how-ever, we have Wp,η(µ, µ) = O(η) (Genevay et al., 2018),so even with a large number of samples n in the coreset,it is not always possible to get an ε-coreset for Wp usingWp,η. In practice, we observe that this regularizer yieldsmode collapse of the coreset, with the number of modesdecreasing as η increases.

To alleviate this issue, Genevay et al. (2018) introduceSinkhorn divergences, defined via

SDp,η(µ, ν) = Wp,η(µ, ν)−1

2(Wp,η(µ, µ) +Wp,η(ν, ν)) .

The additional terms ensure that SDp,η(µ, µ) = 0. In-terestingly, when η goes to infinity, Sinkhorn divergencesconverge to MMD defined in (5) with kernel κ(x, y) =−‖x− y‖p for 0 < p < 2. While solving (P) using SDp,η

can be faster than with Wp, especially for larger coresetsizes, we do not have theoretical guarantees for the mini-mizer.

4.3. Algorithms

Recall that the goal of our measure coreset algorithms isto find a set of points {x1, . . . , xn} that minimizes someWasserstein distance to a given distribution. Here, we detailhow this goal is achieved by leveraging the dual of theWasserstein problem. In particular, we give algorithms thatcompute coresets under the W1 and W2, via the updatesspecific to each setting.

Minimizing W1 and W2. In the semi-discrete case, whenν = 1

n

∑ni=1 δxi , computing the Wasserstein distance can

be cast as maximizing an expectation:

W pp (ν, µ) = max

v∈RnEµ[

mini

(‖X − xi‖p − vi) +1

n

n∑i=1

vi

],

(8)

which can be optimized via stochastic gradient methods(Genevay et al., 2016; Claici et al., 2018). The gradients

Algorithm 1 Compute an online W1 coreset via SGD

Input: Measure µ, n > 0, minibatch size m, γ > 0Output: Points x1, . . . , xn

1: Initialize (x1, . . . , xn) ∼ µ2: for k = 1, . . . do3: Sample (y1, . . . , ym) ∼ µ4: Update estimate of v∗ using samples yk.5: Define generalized Voronoi regions Vi(v∗).6: Step: xi ← xi − γ√

k

∑yk∈Vi(v∗)

1|Vi(v∗)|

yk−xi‖yk−xi‖ .

7: end for

Algorithm 2 Compute an online W2 coreset via SGD

Input: Measure µ, n > 0, minibatch size m, γ > 0Output: Points x1, . . . , xn

1: Initialize (x1, . . . , xn) ∼ µ2: for k = 1, . . . do3: Sample (y1, . . . , ym) ∼ µ4: Update estimate of v∗ using samples yk.5: Define generalized Voronoi regions Vi(v∗).6: Update: xi ←

∑yk∈Vi(v∗)

1|Vi(v∗)|yk.

7: end for

w.r.t. xi can be written in terms of power diagrams:

∇xiW1

(1

n

n∑i=1

δxi , µ

)=

∫Vi(v∗)

x− xi‖x− xi‖

dµ(x) (9)

∇xiW 22

(1

n

n∑i=1

δxi , µ

)= xi −

∫Vi(v∗)

xdµ(x) (10)

where v∗ is the solution of (8) and Vi(v) = {x : ‖x−xi‖p−vi ≤ ‖x − xj‖p − vj ,∀j 6= i} is the generalized Voronoiregion of point xi with p = 1 for W1, and p = 2 for W2.

Thus, a gradient step in the point positions xi requires firstsolving (8) to get the optimal v, and then computing thegradients according to (9), (10). For W 2

2 , the gradient stepcan be replaced by a fixed point iteration (Claici et al., 2018).

Minimizing Wp,η and SDp,η . Due to the mode col-lapse inherent to large regularization η mentioned in §4.2,Sinkhorn divergences empirically are better candidates toconstruct coresets. Following (Genevay et al., 2018), wecompute∇xSDp,η using automatic differentiation of the ob-jective. The resulting algorithm is identical to Algorithm 1,where∇xW1 gradient in line (6) is replaced by∇xSDp,η .

4.4. Convergence

We mention some observations on the convergence of ourapproach. The minimization over the x variables is notconvex due to inherent symmetries in the solution space,and Wp(·, ·) is not sufficiently smooth in the x variables togive precise convergence guarantees.


In Algorithms 1 and 2, we specify the number of pointsin the coreset. This parameter is unlike discrete coresetalgorithms, which take ε as an input and return a coresetwith enough points to satisfy the coreset inequality. Becauseour input is a measure that is absolutely continuous withrespect to the Lebesgue measure, we do not have the luxuryof this approach. An illustrative example is to consider ε =0. In this case, a discrete coreset algorithm would simplyreturn the original dataset. For a continuous µ, however,there is no finite distribution that has 0 error relative to µ.

4.5. Implementation details

Construction time depends strongly on the characteristicsof the measure we are approximating. Most of the time isspent evaluating the expectations in (9), (10). Since we runthe gradient ascent until ‖∇wF‖2 ≤ ε and perform T fixedpoint iterations, the construction requires O(T/ε) calls to anoracle that computes densities of the power cells Vi(v).

The algorithms for W1 and W2 were implemented in C++using the Eigen matrix library (Guennebaud et al., 2010) andrun on an Intel i7-6700K processor with 4 cores and 32GBof system memory. Computing expectations under samplesfrom µ can be trivially parallelized. The total coreset con-struction time ranges from a few seconds for small coresetson small datasets, to 5 minutes on large datasets where largecoresets are required. The Sinkhorn divergence coresetswere implemented in TensorFlow and run on the same archi-tecture without GPU support. Since our code for Wp is inC++, we do not observe significant computational speedupwhen using Sinkhorn divergences in our experiments. Asthe resulting coresets are merely an approximation of Wp

coresets, we do not display them in the experimental results.

All algorithms were run 20 times – we display the mean andstandard deviations in our plots. Regarding the parametersin Algorithms 1 and 2, we use a step size γ = 1 and 100iterations.

5. Comparison with Classical CoresetsWe compare with classical coreset constructions on a fewproblems. Each of the three tasks we consider has a spe-cialized coreset construction algorithm that does not extendto other problems. Our coresets, on the other hand, do nothave this limitation, but broad applicability may come atthe price of performance. Even so, our coresets performbetter than uniform on the three tasks we have chosen (k-means clustering, SVM classification, posterior inference),and greatly outperform state-of-the-art algorithms for thefirst two.

101 102 103

Coreset size

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

Rel

ativ

eco

st

K-Means Clustering Coresets

Uniform

W2

W1

Bachem et al., 2018 (Lightweight)

Figure 2: Coreset construction on the Pendigit dataset(Keller et al., 2012) for the k-means algorithm. We computethe k-means cost on the full data using means learned on thecoreset. The y axis measures relative error to computing thecost using the means learned on the full data. Comparison iswith (Bachem et al., 2018a). We expect (and verify) thatW2

coresets perform better than W1 coresets on this problem.

101 102 103

Coreset size

0.0

0.2

0.4

0.6

0.8

Rel

ativ

eer

ror

SVM Coresets

Uniform

W2

W1

Baykal et al., 2017

Figure 3: Coreset construction on the UCI credit card dataset(Yeh & Lien, 2009) for SVM classification. We computerelative accuracy with respect to training a classifier on allthe data. Comparison is with (Baykal et al., 2017). Softmargin SVMs minimize a Lipschitz cost, and we expectboth W1 and W2 coresets to perform well.

5.1. k-means clustering

The k-means objective for a fixed set of cluster centers Q isgiven by J(Q) =

∑x∈X minq∈Q ‖x− q‖2.

When Q is a subset of a compact set, this cost has boundedSobolev norm but is not Lipschitz. We expect W1 coresetsto perform worse than W2 coresets on this problem. To


101 102

Coreset size

101

102

103

104

105

KL

div

erge

nce

tofu

lld

ata

pos

teri

orBayesian Coresets

Uniform

W2

W1

Campbell et al., 2018

Figure 4: Coreset construction on a synthetic dataset (de-scribed in 5.3). The goal is to approximate the posteriordistribution for a logistic regression model, and we reportthe KL divergence to the true posterior learned on the fulldata. Comparison is with (Campbell & Broderick, 2019).The log likelihood of the model is Lipschitz, and we expectsimilar performance from W1 and W2 coresets.

measure performance, we compute coresets on the Pendig-its dataset (Keller et al., 2012) and compute relative cost1− J(Qc)/J(Q∗) of the centers learned on the coreset Qcagainst the centers learned on the full data Q∗. We com-pare with the importance sampling method of Bachem et al.(2018a). The number of clusters we expect in the data is 10,one for each digit.

In this experiment, (Bachem et al., 2018a) does not exhibita clear advantage over uniform sampling. This suggests thattheir method is better suited to larger datasets. On the otherhand, when using W2 coresets, our method is on par withthe minimal error for a coreset of 10 points. This is notsurprising, as minimizing (P) with W2 and n = k supportpoints is equivalent to minimizing the k-means objectivewith balanced cluster assignments (Pollard, 1982; Cañas& Rosasco, 2012). This example demonstrates that ourstochastic gradient descent approach is an efficient meansof solving balanced k-means problems over large datasets,since we only access small-sized batches of the data at eachiteration and never process the whole dataset at once.

5.2. Support vector machine classification

The soft margin SVM cost of a point xi with label yi isgiven by yi(wᵀxi + b)− 1 + ξi, where ξi is a slack variableassociated to xi. This cost is Lipschitz with a constantdepending on the diameter of the set of allowable w’s.

Because SVMs solve classification problems and our core-sets approximate a dataset, our experimental setup here isslightly different than for k-means. Instead of constructing

a coreset on the (xi, yi) pairs in the training data, we con-struct individual coresets for all data associated to a singlelabel and merge them afterward. Hence, the coreset containsequal numbers of positive and negative samples. We hy-pothesize that this property and the tendency of coresets toremove large outliers explains why in Figure 3 our coresetscan yield better classifiers than training on the full data forlarge coreset size.

5.3. Bayesian inference

We construct a synthetic dataset for logistic regression bydrawing xi ∼ N (0, I) and labeling the xi by

θ ∼ N (0, I) yi | xi, θ ∼ Bern

(1

1 + e−xᵀi θ

). (11)

The goal is to construct a (weighted) coreset that approx-imates the log likelihood of the full data

∑i log p(yi | θ).

This cost is Lipschitz in this particular case. To agree with(Campbell & Broderick, 2019), instead of computing the rel-ative log likelihood of our coreset against that of the full data,we use the coreset to infer the parameters of the posteriordistribution and report KL divergence against the posteriorlearned on the entire dataset. Figure 4 shows results on adataset of 20000 points drawn from a 5-dimensional Gaus-sian distribution. While we do not match the performanceof (Campbell & Broderick, 2019), our coreset performssignificantly better than a uniform sample.

6. DiscussionLearning problems are frequently posed as finding the besthypothesis that minimizes expected loss under a data dis-tribution. However classic coreset theory ignores that thesamples from the dataset are drawn from some distribution.We have introduced a notion of measure coreset whose goalis to minimize generalization error of the coreset against thedata distribution. Our definition is the natural one, and wecan draw connections between this generalized notion ofa coreset and optimal transport theory that leads to onlineconstruction algorithms.

As our paper is exploratory, there are many avenues forfuture research. For one, our definitions rely on identitiesand inequalities that relate large function families toW1 andW2. If we cannot assume much about µ, then these rela-tions cannot be refined. The theory in our paper, however,does not sufficiently explain the effectiveness of our coresetconstructions on the learning problems in §5.

Our algorithm’s performance suggests several questions.There is a gap between the statistical knowledge we haveabout the sample complexity of W1 and W2 and the be-havior of Algorithms 1 and 2 in the few-samples regime.Additionally, being able to get a coreset condition similar


to Proposition 1 for Sinkhorn divergences would allow usto leverage their improved sample complexity compared toWasserstein distances, yielding tighter theoretical boundsfor the number of points required to be an ε-measure coreset.

ReferencesAgarwal, P. K., Har-Peled, S., and Varadarajan, K. R. Ge-

ometric approximation via coresets. Combinatorial andcomputational geometry, 52:1–30, 2005.

Arthur, D. and Vassilvitskii, S. k-means++: The advantagesof careful seeding. In Proceedings of the EighteenthAnnual ACM-SIAM Symposium on Discrete Algorithms,SODA 2007, pp. 1027–1035. Society for Industrial andApplied Mathematics, 2007.

Aurenhammer, F. Power diagrams: properties, algorithmsand applications. SIAM Journal on Computing, 16(1):78–96, 1987.

Bachem, O., Lucic, M., and Krause, A. Practical core-set constructions for machine learning. arXiv preprintarXiv:1703.06476, 2017.

Bachem, O., Lucic, M., and Krause, A. Scalable k -meansclustering via lightweight coresets. In Proceedings of the24th ACM SIGKDD International Conference on Knowl-edge Discovery & Data Mining, KDD 2018, London,UK, August 19-23, 2018, pp. 1119–1127, 2018a. doi:10.1145/3219819.3219973.

Bachem, O., Lucic, M., and Lattanzi, S. One-shot coresets:The case of k-clustering. In International Conference onArtificial Intelligence and Statistics, AISTATS 2018, 9-11April 2018, Playa Blanca, Lanzarote, Canary Islands,Spain, pp. 784–792, 2018b.

Baykal, C., Liebenwein, L., and Schwarting, W. Train-ing support vector machines using coresets. CoRR,abs/1708.03835, 2017.

Baykal, C., Liebenwein, L., Gilitschenski, I., Feldman,D., and Rus, D. Data-dependent coresets for compress-ing neural networks with applications to generalizationbounds. CoRR, abs/1804.05345, 2018.

Brancolini, A., Buttazzo, G., Santambrogio, F., andStepanov, E. Long-term planning versus short-term plan-ning in the asymptotical location problem. ESAIM: Con-trol, Optimisation and Calculus of Variations, 15(3):509–524, 2009.

Campbell, T. and Broderick, T. Bayesian coreset con-struction via greedy iterative geodesic ascent. CoRR,abs/1802.01737, 2018.

Campbell, T. and Broderick, T. Automated scalableBayesian inference via Hilbert coresets. Journal of Ma-chine Learning Research, 20(15):1–38, 2019.

Cañas, G. D. and Rosasco, L. Learning probability measureswith respect to optimal transport metrics. In Advances inNeural Information Processing Systems, pp. 2501–2509,2012.

Chen, W. Y., Mackey, L., Gorham, J., Briol, F.-X., andOates, C. Stein points. In International Conference onMachine Learning, pp. 844–853, 2018.

Chen, Y., Welling, M., and Smola, A. J. Super-samples fromkernel herding. In UAI 2010, Proceedings of the Twenty-Sixth Conference on Uncertainty in Artificial Intelligence,Catalina Island, CA, USA, July 8-11, 2010, pp. 109–116,2010.

Claici, S., Chien, E., and Solomon, J. Stochastic Wasser-stein barycenters. Proceedings of the 35th Interna-tional Conference on Machine Learning, ICML 2018,abs/1802.05757, 2018.

Cuturi, M. Sinkhorn distances: Lightspeed computationof optimal transport. In Advances in Neural InformationProcessing Systems 26: 27th Annual Conference on Neu-ral Information Processing Systems 2013, pp. 2292–2300,2013a.

Cuturi, M. Sinkhorn distances: Lightspeed computationof optimal transport. In Advances in Neural InformationProcessing Systems, pp. 2292–2300, 2013b.

Cuturi, M. and Doucet, A. Fast computation of Wassersteinbarycenters. In Proceedings of the 31th InternationalConference on Machine Learning, ICML 2014, Beijing,China, 21-26 June 2014, pp. 685–693, 2014.

De Goes, F., Breeden, K., Ostromoukhov, V., and Desbrun,M. Blue noise through optimal transport. ACM Transac-tions on Graphics (TOG), 31(6):171, 2012.

Feldman, D. and Langberg, M. A unified framework forapproximating and clustering data. In Proceedings of the43rd ACM Symposium on Theory of Computing, STOC2011, San Jose, CA, USA, 6-8 June 2011, pp. 569–578,2011. doi: 10.1145/1993636.1993712.

Feldman, D., Schmidt, M., and Sohler, C. Turning bigdata into tiny data: Constant-size coresets for k-means,PCA and projective clustering. In Proceedings of theTwenty-Fourth Annual ACM-SIAM Symposium on Dis-crete Algorithms, SODA 2013, pp. 1434–1453. SIAM,2013.

Genevay, A., Cuturi, M., Peyré, G., and Bach, F. Stochasticoptimization for large-scale optimal transport. In Lee,


D. D., Sugiyama, M., Luxburg, U. V., Guyon, I., andGarnett, R. (eds.), Advances in Neural Information Pro-cessing Systems 29, pp. 3440–3448. Curran Associates,Inc., 2016.

Genevay, A., Peyré, G., and Cuturi, M. Learning genera-tive models with Sinkhorn divergences. In InternationalConference on Artificial Intelligence and Statistics, pp.1608–1617, 2018.

Genevay, A., Chizat, L., Bach, F., Cuturi, M., and Peyré, G.Sample complexity of sinkhorn divergences. pp. 1574–1583, 2019.

Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B.,Warde-Farley, D., Ozair, S., Courville, A., and Bengio,Y. Generative adversarial nets. In Advances in neuralinformation processing systems, pp. 2672–2680, 2014.

Gretton, A., Borgwardt, K. M., Rasch, M. J., Schölkopf, B.,and Smola, A. J. A kernel method for the two-sampleproblem. CoRR, abs/0805.2368, 2008.

Guennebaud, G., Jacob, B., et al. Eigen v3.http://eigen.tuxfamily.org, 2010.

Har-Peled, S. and Mazumdar, S. On coresets for k-meansand k-median clustering. In Proceedings of the Thirty-Sixth Annual ACM Symposium on Theory of Computing,STOC 2004, pp. 291–300. ACM, 2004.

Huggins, J. H., Campbell, T., and Broderick, T. Coresetsfor scalable Bayesian logistic regression. In Advancesin Neural Information Processing Systems 29: AnnualConference on Neural Information Processing Systems2016, December 5-10, 2016, Barcelona, Spain, pp. 4080–4088, 2016.

Keller, F., Muller, E., and Bohm, K. Hics: High contrastsubspaces for density-based outlier ranking. In 2012IEEE 28th international conference on data engineering,pp. 1037–1048. IEEE, 2012.

Kingma, D. P. and Welling, M. Auto-encoding variationalbayes. Proceedings of the 2nd International Conferenceon Learning Representations (ICLR), 2014.

Kloeckner, B. Approximation by finitely supported mea-sures. ESAIM Control Optim. Calc. Var., 18(2):343–359,2012. ISSN 1292-8119.

Lacoste-Julien, S., Lindsten, F., and Bach, F. R. Sequentialkernel herding: Frank–Wolfe optimization for particlefiltering. In Proceedings of the Eighteenth InternationalConference on Artificial Intelligence and Statistics, AIS-TATS 2015, San Diego, California, USA, May 9-12, 2015,2015.

Langberg, M. and Schulman, L. J. Universal epsilon-approximators for integrals. In Proceedings of theTwenty-First Annual ACM-SIAM Symposium on Dis-crete Algorithms, SODA 2010, Austin, Texas, USA, Jan-uary 17-19, 2010, pp. 598–607, 2010. doi: 10.1137/1.9781611973075.50.

Lévy, B. A Numerical Algorithm for L2 Semi-DiscreteOptimal Transport in 3D. ESAIM Math. Model. Numer.Anal., 49(6):1693–1715, November 2015. ISSN 0764-583X, 1290-3841. doi: 10.1051/m2an/2015055.

Lyon, R. J., Stappers, B., Cooper, S., Brooke, J., andKnowles, J. Fifty years of pulsar candidate selection:from simple filters to a new principled real-time classifica-tion approach. Monthly Notices of the Royal AstronomicalSociety, 459(1):1104–1123, 2016.

Mérigot, Q. A multiscale approach to optimal transport. InComputer Graphics Forum, volume 30, pp. 1583–1592.Wiley Online Library, 2011.

Müller, A. Integral probability metrics and their generatingclasses of functions. Advances in Applied Probability, 29(2):429–443, 1997.

Munteanu, A. and Schwiegelshohn, C. Coresets—Methodsand history: A theoreticians design pattern for approxi-mation and streaming algorithms. Künstliche Intelligenz(KI), 32(1):37–53, 2018.

Peyré, G. and Cuturi, M. Computational Optimal Transport.Submitted, 2018.

Peyre, R. Comparison between W2 distance and H1 norm,and localization of Wasserstein distance. ESAIM: Con-trol, Optimisation and Calculus of Variations, 24(4):1489–1501, 2018.

Phillips, J. M. and Tai, W. M. Near-optimal coresets of ker-nel density estimates. In 34th International Symposiumon Computational Geometry, SoCG 2018, June 11-14,2018, Budapest, Hungary, pp. 66:1–66:13, 2018. doi:10.4230/LIPIcs.SoCG.2018.66.

Pollard, D. Quantization and the method of k-means.IEEE Transactions on Information theory, 28(2):199–205,1982.

Reddi, S. J., Póczos, B., and Smola, A. J. Communicationefficient coresets for empirical loss minimization. 2015.

Rigollet, P. and Weed, J. Entropic optimal transport ismaximum-likelihood deconvolution. Comptes RendusMathematique, 356(11-12):1228–1235, 2018.

Santambrogio, F. Optimal Transport for Applied Mathemati-cians, volume 87 of Progress in Nonlinear Differential


Equations and Their Applications. Springer InternationalPublishing, Cham, 2015. ISBN 978-3-319-20827-5 978-3-319-20828-2. doi: 10.1007/978-3-319-20828-2.

Solomon, J. Optimal Transport on Discrete Domains. AMSShort Course on Discrete Differential Geometry, 2018.

Tsang, I. W., Kwok, J. T., and Cheung, P. Core vectormachines: Fast SVM training on very large data sets.Journal of Machine Learning Research, 6:363–392, 2005.

Uzilov, A. V., Keegan, J. M., and Mathews, D. H. Detectionof non-coding rnas on the basis of predicted secondarystructure formation free energy change. BMC bioinfor-matics, 7(1):173, 2006.

Weed, J. and Bach, F. Sharp asymptotic and finite-samplerates of convergence of empirical measures in Wasser-stein distance. CoRR, abs/1707.00087, 2017.

Yeh, I. and Lien, C. The comparisons of data mining tech-niques for the predictive accuracy of probability of de-fault of credit card clients. Expert Syst. Appl., 36(2):2473–2480, 2009. doi: 10.1016/j.eswa.2007.12.020.


WassersteinKernel Herding

Figure 5: Comparison with kernel herding on a mixture ofGaussians. The first twenty points obtained from herdingare plotted against a twenty point coreset under the W2

distance.

A. Additional ResultsWe present additional experimental results on the HTRUdataset (Lyon et al., 2016), and the RNA coding dataset(Uzilov et al., 2006). We test our SVM coreset and k-meanscoresets against uniform samples and state of the art coresetconstructions.

Results for k-means are shown in Figure 6. Results forSVMs are shown in Figure 7.

B. Comparison with Kernel HerdingWe have mentioned constructing coresets under the maxi-mum mean discrepancy. Coresets under the MMD distancecan be constructed using kernel herding, as shown in (Chenet al., 2010; Lacoste-Julien et al., 2015). We give a qualita-tive comparison between W2 coresets and samples obtainedfrom herding on the mixture of Gaussian example from(Chen et al., 2010) in Figure 5.


101 102 103

Coreset size

1.0

1.5

2.0

2.5

3.0

3.5

Rel

ativ

eco

st

K-Means Clustering Coresets (HTRU Dataset)UniformW1

W2


102 103

Coreset size

1.0

1.2

1.4

1.6

1.8

2.0

Rel

ativ

eco

st

K-Means Clustering Coresets (Cod-RNA Dataset)UniformW1

W2


Figure 6: Additional results for k-means clustering.

101 102 103

Coreset size

1

2

3

4

5

6

Rel

ativ

eer

ror

SVM Coresets (HTRU Dataset)UniformW1

W2

Baykal et al., 2017

102 103

Coreset size

1

2

3

4

5

6

7

Rel

ativ

eer

ror

SVM Coresets (Cod-RNA Dataset)UniformW1

W2

Baykal et al., 2017

Figure 7: Additional results for SVM classification.

wasserstein measure coresets - arxiv

Documents