a,b arxiv:2001.07697v5 [math.oc] 16 sep 2020ter) of probability measures with discrete support...

24
ARTICLE TEMPLATE Stochastic Approximation versus Sample Average Approximation for population Wasserstein barycenters Darina Dvinskikh a,b a V. Weierstrass Institute for Applied Analysis and Stochastics, Berlin, Germany; b Institute for Information Transmission Problems, Moscow, Russia. ARTICLE HISTORY Compiled September 17, 2020 ABSTRACT In machine learning and optimization community there are two main approaches for convex risk minimization problem, namely, the Stochastic Approximation (SA) and the Sample Average Approximation (SAA). In terms of oracle complexity (re- quired number of stochastic gradient evaluations), both approaches are considered equivalent on average (up to a logarithmic factor). The total complexity depends on the specific problem, however, starting from work [45] it was generally accepted that the SA is better than the SAA. Nevertheless, in case of large-scale problems SA may run out of memory as storing all data on one machine and organizing online access to it can be impossible without communications with other machines. SAA in contradistinction to SA allows parallel/distributed calculations. In this paper, we shed new light on the comparison of SA and SAA for particular problem of calculat- ing the population (regularized) Wasserstein barycenter of discrete measures. The conclusion is valid even for non-parallel (non-decentralized) setup. KEYWORDS empirical risk minimization, stochastic approximation, sample average approximation, Wasserstein barycenter, Fr´ echet mean, stochastic gradient descent, mirror descent. 1. Introduction In this work, we consider the problem of calculating the population mean (barycen- ter) of probability measures with discrete support (e.g., images). We define the notion of the population barycenter by using Fr´ echet mean that is an extension of the Eu- clidean barycenter to non-linear spaces with non-Euclidean metrics. Fr´ echet mean of distribution P on a metric space (M,W 2 ) is the solution of the following optimization problem p * = arg min p∈M Z W 2 2 (p, q)dP(q) = arg min p∈M E q W 2 2 (p, q), q P, (1) where W 2 is the 2-Wasserstein distance defined by optimal transport (OT) problem. A nice survey of OT and Wasserstein barycenters is presented in books [50, 58]. In CONTACT Darina Dvinskikh. Email: [email protected] arXiv:2001.07697v5 [math.OC] 16 Sep 2020

Upload: others

Post on 25-Sep-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: a,b arXiv:2001.07697v5 [math.OC] 16 Sep 2020ter) of probability measures with discrete support (e.g., images). We de ne the notion of the population barycenter by using Fr echet mean

ARTICLE TEMPLATE

Stochastic Approximation versus Sample Average Approximation for

population Wasserstein barycenters

Darina Dvinskikha,b

aV. Weierstrass Institute for Applied Analysis and Stochastics, Berlin, Germany;bInstitute for Information Transmission Problems, Moscow, Russia.

ARTICLE HISTORY

Compiled September 17, 2020

ABSTRACT

In machine learning and optimization community there are two main approachesfor convex risk minimization problem, namely, the Stochastic Approximation (SA)and the Sample Average Approximation (SAA). In terms of oracle complexity (re-quired number of stochastic gradient evaluations), both approaches are consideredequivalent on average (up to a logarithmic factor). The total complexity dependson the specific problem, however, starting from work [45] it was generally acceptedthat the SA is better than the SAA. Nevertheless, in case of large-scale problems SAmay run out of memory as storing all data on one machine and organizing onlineaccess to it can be impossible without communications with other machines. SAAin contradistinction to SA allows parallel/distributed calculations. In this paper, weshed new light on the comparison of SA and SAA for particular problem of calculat-ing the population (regularized) Wasserstein barycenter of discrete measures. Theconclusion is valid even for non-parallel (non-decentralized) setup.

KEYWORDSempirical risk minimization, stochastic approximation, sample averageapproximation, Wasserstein barycenter, Frechet mean, stochastic gradient descent,mirror descent.

1. Introduction

In this work, we consider the problem of calculating the population mean (barycen-ter) of probability measures with discrete support (e.g., images). We define the notionof the population barycenter by using Frechet mean that is an extension of the Eu-clidean barycenter to non-linear spaces with non-Euclidean metrics. Frechet mean ofdistribution P on a metric space (M,W2) is the solution of the following optimizationproblem

p∗ = arg minp∈M

∫W 2

2 (p, q)dP(q) = arg minp∈M

EqW 22 (p, q), q ∼ P, (1)

where W2 is the 2-Wasserstein distance defined by optimal transport (OT) problem.A nice survey of OT and Wasserstein barycenters is presented in books [50, 58]. In

CONTACT Darina Dvinskikh. Email: [email protected]

arX

iv:2

001.

0769

7v5

[m

ath.

OC

] 1

6 Se

p 20

20

Page 2: a,b arXiv:2001.07697v5 [math.OC] 16 Sep 2020ter) of probability measures with discrete support (e.g., images). We de ne the notion of the population barycenter by using Fr echet mean

this paper, we refer to p∗ from (1) as the population Wasserstein barycenter. Theoptimization problem (1) is the risk minimization problem (the objective functionis given in a form of the expectation) for which there are two classical approachesbased on Monte Carlo sampling techniques: the Stochastic Approximation (SA) andthe Sample Average Approximation (SAA). The SAA approach approximates the trueproblem (1) by the sample average (empirical barycenter)

pm = arg minp∈M

m∑k=1

W 22 (p, qk), (2)

where q1, q2, ..., qm are realizations of random variable q according to distribution P.The number of realizations m is adjusted by the desired precision for the approxi-mation of problem (1) by problem (2). We notice that both SA and SAA methodsare not algorithms as corresponding problems (1) and (2) require the use of appro-priate numerical algorithms. The main difference of problem (1) from the standardrisk minimization problems is the high computational complexity of calculating theobjective under the expectation itself (Wasserstein distance) solving correspondingOT problem between two measures that requires O(n3) arithmetic iterations (n isthe size of the support of the measures) [1, 15, 22, 50, 57]. Entropic regularizationof OT [12] improves statistical properties of Wasserstein distance [7, 39] and reducesthe computational complexity to n2 min{O

(1ε

), O (√n)}.1 This regularization shows

good results in generative models [26], multi-label learning [21], dictionary learning[53], image processing [13, 52], neural imaging [28].

The aim of this paper is to compare two approaches, which are SA and SAA, fortwo settings of the Wasserstein barycenter problem: when the barycenter is definedas the minimizer of the expectation of OT and as the minimizer of the expectationof entropy-regularized OT. Motivated by enormous applications of the Wassersteindistance and Wasserstein barycenters to discrete objects, such as images, videos andtexts, we limit ourselves by considering only discrete probability measures. Indeed, acontinuous measure can be approximated by its empirical counterpart and the con-vergence of these measures with respect to (w.r.t.) entropy-regularized OT cost wasstudied in [7, 44].

1.1. Contribution and Related Work

SA and SAA approaches. This paper is inspired by the work [45], where it is statedthat SA approach outperforms SAA approach for certain class of convex stochasticproblems. Our aim is to show that for population Wasserstein barycenter problemthis superiority is inverted. We provide detailed comparison with stating the com-plexity bounds of implementations of the SA and the SAA approaches for populationWasserstein barycenter problem and population Wasserstein barycenter problem de-fined w.r.t. regularized OT. As a byproduct, we also construct the confidence intervalfor barycenter p∗µ defined w.r.t. µ-regularized OT in the 2-norm.

Consistency and rates of convergence. Consistency of empirical barycenteras an estimator of population barycenter w.r.t. Wasserstein distance as the numberof measures tends to infinity was studied in many papers, e.g, [10, 42], under someconditions on the process generated the measures. Moreover, the authors of [10] provide

1The estimate n2 min{O(

), O(√n)} is the best theoretically known estimate for solving OT problem [9,

33, 43, 51]. The best known practical estimates are√n times worse (see [30] and references therein).

2

Page 3: a,b arXiv:2001.07697v5 [math.OC] 16 Sep 2020ter) of probability measures with discrete support (e.g., images). We de ne the notion of the population barycenter by using Fr echet mean

the rate of this convergence but under restrictive assumption on the process (it mustbe from admissible family of deformations, i.e., it is a gradient of a convex function).Without any assumptions on generating process, the rate of convergence was obtainedin [8], however, only for measures with one-dimensional support. For some specifictypes of metrics and measures, the rates of convergence were also provided in works[11, 27, 40].

Penalization of barycenter problem. Population Wasserstein barycenter can bedefined by two ways: as the minimizer of the expectation of OT distance or entropy-regularized (also called smoothed) OT distance. The first problem can be led to thesecond problem if one wants to reduce the computational complexity of solving OTproblem and get more stable optimization problem. Alternative regularization of theproblem is introducing a strongly convex penalty function in the population Wasser-stein problem itself. The advantages of convex penalization, which are the existence,uniqueness and stability of penalized barycenter, and the convergence of penalizedbarycenter to the population barycenter are studied in [6]. For a general convex (butnot strongly convex) optimization problem, empirical minimization may fail in offlineapproach despite the guaranteed success of an online approach if no regularization wasintroduced [54]. The limitations of the SAA approach for non-strongly convex case arealso discussed in [29, 55]. Our contribution includes introducing new regularizationfor population Wasserstein barycenter problem that improves the complexity boundsfor standard penalty (squared norm penalty) [54]. This regularization relies on theBregman divergence from [3].

1.2. Paper organization

The structure of the paper is the following. Section 2 recalls OT problem, entropy-regularized OT problem and its properties. Section 3 presents the comparison of SAand SAA approaches for the problem of population Wasserstein barycenter definedw.r.t. regularized OT. In Section 4 we refuse the entropic regularization of OT andcompare SA and SAA for the population Wasserstein barycenter problem. Section 5presents new regularization for population Wasserstein problem. Finally, in Section 6,we present numerical experiments to support our theoretical results.

2. Entropy-regularized OT

2.1. General setup and definitions.

Here, we briefly recall some key definitions used throughout the paper. For any finite-dimensional real vector space X , we denote its dual space by X∗. Let ‖ · ‖ be somenorm on X , then the dual norm ‖ · ‖∗ is the norm on X∗ that is defined as follows

‖λ‖∗ = max{〈λ, x〉 : ‖x‖ ≤ 1}.

Definition 2.1. A function f : X → R is M -Lipschitz continious w.r.t. norm ‖ · ‖ ifit satisfies

‖f(x)− f(y)‖∗ ≤M‖x− y‖, ∀x, y ∈ X .

Definition 2.2. A function f : X → R is µ-strongly convex w.r.t. norm ‖ · ‖ if it is

3

Page 4: a,b arXiv:2001.07697v5 [math.OC] 16 Sep 2020ter) of probability measures with discrete support (e.g., images). We de ne the notion of the population barycenter by using Fr echet mean

continuously differentiable and it satisfies

f(x)− f(y)− 〈∇f(y), x− y〉 ≥ µ‖x− y‖2, ∀x, y ∈ X .

Definition 2.3. The Fenchel–Legendre conjugate for a function f : X → R is

f∗(y) , supx∈X{〈x, y〉 − f(x)}.

We also use denotation O(·) when we want to indicate the complexity hiding con-stants and logarithms.

2.2. Entropy-regularized OT, Dual Formulation and Properties

Let Sn(1) = {a ∈ Rn+ |∑n

l=1 al = 1} be the probability simplex and δx be the Diracmeasure at point x, then measure p with finite support of size n can be presentedin the form p =

∑ni=1 piδxi , where p ∈ Sn(1) is the histogram. For two histograms

p, q ∈ Sn(1) we define optimal transport (OT) as the following optimization problem

W (p, q) = minπ∈Π(p,q)

〈C, π〉,

where π is a transport plan with marginals p and q from transportation polytipeΠ(p, q) = {π ∈ Rn×n+ : π1 = p, πT1 = q}, C is the (ground) cost matrix (Cij isthe cost to move a unit mass from support point xi of measure p to support pointxj of measure q). When Cij = d(xi, xj)

2, where d(xi, xj) is the distance on support

points xi, xj , then W (p, q)1/2 is known as the 2-Wasserstein distance on Sn(1).2 Inwhat follows we rewrite the population Wasserstein barycenter (1)

p∗ = arg minp∈Sn(1)

∫W (p, q)dP(q) = arg min

p∈Sn(1)EqW (p, q).

and its empirical counter part (2)

pm = arg minp∈Sn(1)

m∑k=1

W (p, qk).

in our introduced notations. We define entropy-regularized OT as the following opti-mization problem penalized by the negative entropy with µ ≥ 0

Wµ(p, q) = minπ∈Π(p,q)

{〈C, π〉+ µ〈π, lnπ〉} .

One of the advantages of entropic regularization of OT is existing a closed-form repre-sentation for its dual (Fenchel–Legendre) function that leads to the following results.

Proposition 2.4. Given two histograms p, q ∈ Sn(1), dual formulation of entropy-

2We omit the sub-index 2 for simplicity.

4

Page 5: a,b arXiv:2001.07697v5 [math.OC] 16 Sep 2020ter) of probability measures with discrete support (e.g., images). We de ne the notion of the population barycenter by using Fr echet mean

regularized OT is

Wµ(p, q) = maxλ∈Rn

〈λ, p〉 − µn∑j=1

qj ln

(1

qj

n∑i=1

e−Cij+λi

µ

) , (3)

Moreover, the gradient of Wµ(p, q) w.r.t. p is the solution λ∗ of this optimizationproblem (3) such that 〈λ∗,1〉 = 0 [24, 50]

∇pWµ(p, q) = λ∗. (4)

The (Fenchel–Legendre) dual function for Wµ(p, q) has the following closed-form rep-resentation

Dq(λ) = µ

n∑j=1

qj ln

(1

qj

n∑i=1

e−Cij+λi

µ

), ∀λ ∈ Rn. (5)

Proof. We add the constraints π1 = p and πT1 = q into the objective in regularizedOT with corresponding Lagrangian dual variables λ and ν, and solve the problemw.r.t. ν analytically

Wµ(p, q) = minπ∈Π(p,q)

n∑i,j=1

(Cijπi,j + µπi,j lnπi,j)

= maxλ,ν∈Rn

〈λ, p〉+ 〈ν, q〉 − µn∑

i,j=1

exp

(−Cij + λi + νj

µ− 1

)= max

λ∈Rn

〈λ, p〉 − µn∑j=1

qj ln

(1

qj

n∑i=1

exp

(−Cij + λi

µ

)) .

The next two statements of the proposition directly follows from this representation.

The proposition below describes the properties of entropy-regularized OT.

Proposition 2.5 (Properties of Wµ(p, q)). Given two histograms p and q from theentry of Sn(1), entropy-regularized OT Wµ(p, q) is

• µ-strongly convex in p w.r.t the 2-norm

Wµ(p, q) ≥Wµ(p′, q) + 〈∇Wµ(p′, q), p− p′〉+µ

2‖p− p′‖22,

for any p, p′ from the interior of Sn(1)• M∞-Lipschitz continuous in p w.r.t the 1-norm.

|Wµ(p, q)−Wµ(p′, q)| ≤M∞‖p− p′‖1,

for any p, p′ from the interior of Sn(1). Hence, Wµ(p, q) is also M -Lipschitz inp w.r.t the 2-norm. Hereby, M ≤

√nM∞, M∞ = O (‖C‖∞).

5

Page 6: a,b arXiv:2001.07697v5 [math.OC] 16 Sep 2020ter) of probability measures with discrete support (e.g., images). We de ne the notion of the population barycenter by using Fr echet mean

Proof. The gradient of function Dq(λ) in (5) is 1µ–Lipschitz continuous in the 2-norm

[17, Lemma 1],[50]. From this and dual formulation of regularized OT (3) we concludethat Wµ(p, q) is µ-strongly convex w.r.t. p in the 2-norm [37, Theorema 6], [46]. Wealso used here that the dual norm for the 2-norm is again the 2-norm.

The second statement follows from the fact that the ∞-norm of the solution λ∗ of(3) is upper bounded ([9, Lemma 10] for the ∞-norm and [30, Lemma 7] for the 2norm). From this and (4) we get that the gradient of Wµ(p, q) in p is upper boundedthat means Lipschitz continuity of Wµ(p, q). From [9] assuming that the measuresare separated from zero, we roughly take M∞ = O(‖C‖∞) . This separation canbe achieved by simple preproccessing of measures, moreover, the most of transportalgorithms require this preproccessing.

In what follows we use Preposition 2.5 for any p, q ∈ Sn(1) keeping in mind thatp, q are from the interior of Sn(1) as we can easily get this condition by adding somenoise and normalize the measures. We also notice that if some measures are from theinterior of Sn(1) then their barycenter will be also from the interior of Sn(1).

3. Population Wasserstein barycenter w.r.t regularized OT

In this section, we present the comparison of SA and SAA approaches for populationWasserstein barycenter defined w.r.t. regularized OT

p∗µ = arg minp∈Sn(1)

EqWµ(p, q). (6)

Throughout this section we use the following simplification for the objective

Wµ(p) , EqWµ(p, q).

3.1. Stochastic Approximation (SA)

We present an implementation of the SA approach for problem (6). To do so, weassume that we can sample measures q1, q2, q3, . . . from distribution P (q ∼ P). Wedefine stochastic subgradient w.r.t. p by ∇pWµ(p, qk) (k = 1, 2, 3, ...). The classical SAalgorithm with stochastic oracle is the following

pk+1 = ΠSn(1)

(pk − ηk∇δpWµ(pk, qk)

), (7)

where ΠSn(1)(p) is the projection onto Sn(1) and ∇δpWµ(pk, qk) is δ-approximation for

the true gradient ∇pWµ(pk, qk)

‖∇δpWµ(p, q)−∇pWµ(p, q)‖2 ≤ δ, ∀q ∈ Sn(1). (8)

Using (4) we can compute the approximate gradient∇δpWµ(p, q) by Sinkhorn algorithm[18, 50]. Based on (7) we provide online algorithm (Alg. 1) that inputs online sequenceof measures q1, q2, q3, . . . (realizations of q) and at each stochastic gradient descent

6

Page 7: a,b arXiv:2001.07697v5 [math.OC] 16 Sep 2020ter) of probability measures with discrete support (e.g., images). We de ne the notion of the population barycenter by using Fr echet mean

iteration calls Sinkhorn algorithm to compute the approximation for the gradient ofWµ(pk, qk) with precision δ. We take step size η according to [32].

Algorithm 1 Online Stochastic Gradient Descent (SGD)

Input: starting point p1 ∈ Sn(1), realization q1, precision of gradient calculation δ, µ1: for k = 1, 2, 3, . . . do2: ηk = 1

µk

3: pk+1 = ΠSn(1)

(pk − ηk∇δpWµ(pk, qk)

),

where ∇δpWµ(pk, qk) is calculated by Sinkhorn algorithm (δ is defined by (8)),ΠSn(1)(p) = arg min

v∈S1(n)‖p−v‖2 is the projection onto Sn(1) (ΠSn(1)(p) is calculated

by algorithm from [16])4: Sample qk+1

Output: p1, p2, p3...

One of the benefits of online approach is no need to fix the number of measuresthat allows to regulate the precision of the estimate for the barycenter. Moreover, theproblem of storing a large number of measures in a computing node is not present ifwe have an access to online oracle, e.g., some measuring device.

To approximate population barycenter p∗µ by the outputs of Algorithm 1 we use

online-to-batch conversions [54] and define pN as the average of online outputs

p1, . . . , pN from Algorithm 1: pN = 1N

∑Nk=1 p

k. The convergence properties of pN

to population barycenter p∗µ are presented in the following theorem.

Theorem 3.1. Let pN be the average of N online outputs of Algorithm 1. Then, withprobability ≥ 1− α we have

Eq[Wµ(pN , q)−Wµ(p∗µ, q)

]= O

(M2 ln(N/α)

µN+ δD2

)= O

(M2 ln(N/α)

µN+ δ

),

where D2 = maxp′,p′′∈Sn(1)

‖p′ − p′′‖2 =√

2. Let Algorithm 1 run with δ = O (ε) and

N = O(M2

µε

). Then, with probability ≥ 1− α the following holds

Eq[Wµ(pN , q)−Wµ(p∗µ, q)

]≤ ε and ‖pN − p∗µ‖2 ≤

√2ε/µ.

The total complexity of Algorithm 1 is

O

(M2

µεn2 min

{exp

(‖C‖∞µ

)(‖C‖∞µ

+ ln

(‖C‖∞γε2

)),

√n

γµε2

}),

where γ , σmin

(∇2Dq(λ

∗))> 0, σmin(A) is the smallest positive eigenvalue of positive

semi-definite matrix A.3

Proof. From µ-strongly convexity of Wµ(p, qk) w.r.t. to p, it follows

Wµ(p∗, qk) ≥Wµ(pk, qk) + 〈∇pWµ(pk, qk), p∗ − pk〉+µ

2‖p∗ − pk‖2.

3Due to the results of [20] we may expect γ to be n−β with β ≥ 0.

7

Page 8: a,b arXiv:2001.07697v5 [math.OC] 16 Sep 2020ter) of probability measures with discrete support (e.g., images). We de ne the notion of the population barycenter by using Fr echet mean

Adding and subtracting the term 〈∇δpWµ(pk, qk), p∗−pk〉 we get using CauchySchwarzinequality

Wµ(p∗, qk) ≥Wµ(pk, qk) + 〈∇δpWµ(pk, qk), p∗ − pk〉+µ

2‖p∗ − pk‖2

+ 〈∇pWµ(pk, qk)−∇δpWµ(pk, qk), p∗ − pk〉

≤Wµ(pk, qk) + 〈∇δpWµ(pk, qk), p∗ − pk〉+µ

2‖p∗ − pk‖2 + δ‖p∗ − pk‖2.

(9)

From the update rule for pk+1 we have

‖pk+1 − p∗‖2 = ‖ΠS1(n)(pk − ηk∇δpWµ(pk, qk))− p∗‖2

≤ ‖pk − ηk∇δpWµ(pk, qk)− p∗‖2≤ ‖pk − p∗‖22 + η2

k‖∇δpWµ(pk, qk)‖22 − 2ηk〈∇δpWµ(pk, qk), pk − p∗〉.

From this it follows

〈∇δpWµ(pk, qk), pk − p∗〉 ≤ 1

2ηk(‖pk − p∗‖22 − ‖pk+1 − p∗‖22) +

ηk2‖∇δpWµ(pk, qk)‖22.

Together with (9) we get

Wµ(pk, qk)−Wµ(p∗, qk) ≤ 1

2ηk(‖pk − p∗‖22 − ‖pk+1 − p∗‖22)

−(µ

2+ δ)‖p∗ − pk‖2 +

η2k

2‖∇δpWµ(pk, qk)‖22.

Summing this from 1 to N + 1, using ηk = 1µk and ‖∇Wµ(p, q)‖2 ≤M we get

N+1∑k=1

(Wµ(pk, qk)−Wµ(p∗, qk)

)≤ 1

2

N+1∑k=1

(1

ηk− 1

ηk−1+ µ+ δ

)‖p∗ − pk‖2

+1

2

N+1∑k=1

ηk‖∇δpWµ(pk, qk)‖22

≤ 1

2

N+1∑k=1

δ‖p∗ − pk‖2 +M2N+1∑k=1

1

µk

≤ δD2 +M2

µN(1 + lnN) = O

(M2 lnN

µN+ δD2

),

(10)

where D2 = maxp′,p′′∈Sn(1)

‖p′ − p′′‖2 =√

2. Here the last bound takes place due to the

sum of harmonic series.

8

Page 9: a,b arXiv:2001.07697v5 [math.OC] 16 Sep 2020ter) of probability measures with discrete support (e.g., images). We de ne the notion of the population barycenter by using Fr echet mean

Next we estimate the codomain (image) of W (p, q)

maxp,q∈Sn(1)

Wµ(p, q) = maxp,q∈Sn(1)

minπ∈Rn×n+ ,π1=p,πT1=q

n∑i,j=1

(Cijπij + µπij lnπij)

≤ maxπ∈Rn×n+ ,∑ni,j=1 πij=1

n∑i,j=1

(Cijπij + µπij lnπij) ≤ ‖C‖∞.

Therefore, Wµ(p, q) : Sn(1)× Sn(1)→ [−2µ lnn, ‖C‖∞].Then using this we refer to [38, Theorem 2] with the regret estimated by (10) and

get with probability ≥ 1− α the first statement of the theorem

Wµ(pN )−Wµ(p∗µ) = O

(M2 ln(N/α)

µN+ δD2

)= O

(M2 ln(N/α)

µN+ δ

).

Equating the right-hand side (r.h.s.) of this equality to epsilon we get the expressionsfor N and δ. The statement about the confidence region for the barycenter followsdirectly from strong convexity of Wµ(p, q) and Wµ(p).

The proof of algorithm complexity follows from the complexity of Sinkhorn algo-rithm. To state the complexity of Sinkhorn algorithm we firstly define δ as the accuracyin function value of the inexact solution λ of max-problem in (3). Using this we for-mulate the number of iteration of Sinkhorn [41, 56]

O

(exp

(‖C‖∞µ

)(‖C‖∞µ

+ ln

(‖C‖∞δ

))).

The number of iteration for Accelerated Sinkhorn can be improved [30]

O

(√n

µδ

)Multiplying both of this estimates by the number of iterations N (measures) and

complexity of each iteration of Sinkhorn algorithm O(n2), taking the minimum weget the last statement of the theorem, where we used the transition from accuracy infunction value δ to accuracy in argument δ. From γ , σmin(∇2Dq(λ

∗)) > 0 we can

conclude that δ is proportionally to γ2 δ

2.

3.2. Sample Average Approximation (SAA)

Now we suppose that we sample measures q1, . . . , qm in advance (with proper chosenm). This offline setting can be relevant when we are interested in parallelization or de-centralization. The SAA approach approximates the true problem (6) by the followingempirical problem

pmµ = arg minp∈Sn(1)

m∑k=1

Wµ(p, qk). (11)

9

Page 10: a,b arXiv:2001.07697v5 [math.OC] 16 Sep 2020ter) of probability measures with discrete support (e.g., images). We de ne the notion of the population barycenter by using Fr echet mean

We refer to pε′ as an approximation of empirical barycenter of pmµ (11) if it satisfiesthe following inequality for some precision ε′

1

m

m∑k=1

Wµ(pε′ , qk)− 1

m

m∑k=1

Wµ(pmµ , qk) ≤ ε′. (12)

The convergence properties of pε′ to population barycenter p∗µ and the proper num-ber of measures m needed to approximate the problem (6) by the problem (11) arepresented in the following theorem.

Theorem 3.2. Let pε′ satisfies (12) with precision ε′, where m is the number ofmeasures in empirical average (11). Then, with probability ≥ 1− α we have

Eq[Wµ(pε′ , q)−Wµ(p∗µ, q)

]≤

√2M2

µε′ +

4M2

αµm.

Let ε′ = O(µε2

M2

)and m = O

(M2

αµε

). Then, with probability ≥ 1−α the following holds

Eq[Wµ(pε′ , q)−Wµ(p∗µ, q)

]≤ ε and ‖pε′ − p∗µ‖2 ≤

√2ε/µ.

The total complexity of offline algorithm from [41] computing pε′ that satisfies (12) is

O

(n2M3√αµ3ε3

). (13)

If parallel or distributed architecture is available, then the total complexity per eachnode is the following

O

(κ√m

nM2

µε

)where κ is the parameter of the architecture:

κ =

1 in fully parallel m nodes architecture√m in parallel

√m nodes architecture

m if we have only one node (machine)

d in centralized m nodes architecture (d is the communication network diameter)√χ in decentralized m nodes architecture (

√χ is the condition number for the network).

Proof. Consider for any p ∈ Sn(1) the following difference

Wµ(p)−Wµ(p∗µ) ≤Wµ(pmµ )−Wµ(p∗µ) +Wµ(p)−Wµ(pmµ ). (14)

From [54, Theorem 6] with probability ≥ 1 − α for the empirical minimizer pmµ the

10

Page 11: a,b arXiv:2001.07697v5 [math.OC] 16 Sep 2020ter) of probability measures with discrete support (e.g., images). We de ne the notion of the population barycenter by using Fr echet mean

following holds

Wµ(pmµ )−Wµ(p∗µ) ≤ 4M2

αµm.

Then from this and (14) we get

Wµ(p)−Wµ(p∗µ) ≤ 4M2

αµm+Wµ(p)−Wµ(pmµ ).

From Lipschitz continuity of Wµ(p) we have

Wµ(p)−Wµ(pmµ ) ≤M‖p− pmµ ‖2. (15)

From strong convexity of Wµ(p, q) we get

‖p− pmµ ‖2 ≤

√√√√ 2

µ

(1

m

m∑k=1

Wµ(p, qk)− 1

m

m∑k=1

Wµ(pmµ , qk)

). (16)

By using (15) and (16) and taking p = pε′ we get the first statement of the theorem

Wµ(pε′)−Wµ(p∗µ) ≤

√√√√2M2

µ

(1

m

m∑k=1

Wµ(pε′ , qk)−1

m

m∑k=1

Wµ(pmµ , qk)

)+

4M2

αµm

√2M2

µε′ +

4M2

αµm. (17)

Then from the strong convexity we have

‖pε′ − p∗µ‖2 ≤

√√√√ 2

µ

(√2M2

µε′ +

4M2

αµm

). (18)

Equating (17) to epsilon we get the expressions for the number of measures m and aux-iliary precision ε′. Substituting both of these expressions in (18) we get the confidenceregion for p∗µ.

To calculate the total complexity we refer to the Algorithm 6 in the paper [41] cal-culating pε′ . For the readers convenience we repeat the the scheme of the proof. Thisalgorithm relates to the class of fast gradient methods for Lipschitz smooth functions

and, consequently, has the complexity O

(√LR2

ε′

)[48]. Here L is the Lipschitz con-

stant of smoothness for dual function Dq(λ) from (3) (L = 1/µ [17, Lemma 1]) andR is the radius for dual solution ([9, Lemma 10 ], [41, Lemma 8] and [30, Lemma 7]).Incorporating all of this we get the following number of iterations

N = O

√M2

mε′µ

), (19)

11

Page 12: a,b arXiv:2001.07697v5 [math.OC] 16 Sep 2020ter) of probability measures with discrete support (e.g., images). We de ne the notion of the population barycenter by using Fr echet mean

where κ is the parameter of the architecture. Multiplying this by the complexity ofcalculating the gradient for the dual function (which is n2) and using the definition ofε′ we get the following complexity per each node

O(n2N

)= O

(n2κ

√M2

mµε′

)= O

(κ√m

nM2

µε

).

Using the expression for the number of measures and by using κ for one-machinearchitecture we get the algorithm complexity and finish the proof.

From the recent results [19] we may expect that the dependence on α in Theorem3.2 is indeed much better (logarithmic) if µ in these formulas is small (proportionalto ε). Otherwise, it is still a hypothesis.

3.3. Comparison of SA and SAA for population Wasserstein barycenterproblem defined with respect to regularized OT

Next we compare the SA and the SAA approaches for problem (6). For the readersconvenience we skip the details about high probability bounds. The first reason isthat we can fixed α, say as α = 0.05, and consider it to be fixed parameter in the allbounds. The second reason is the intuition, which goes back to [54], that all boundsof this paper have logarithmic dependence on α in fact and up to a O(·) denotationwe can ignore the dependence on α.

Table 1 presents the total complexity of the numerical algorithms from this sectionimplementing SA and SAA approaches. We estimate M∞ = O (‖C‖∞) and M ≤√nM∞ in the complexity bounds by using Proposition 2.5.

Table 1. Total complexity of SA and SAA implementations for problem minp∈Sn(1)

EqWµ(p, q).

Complexity

Algorithm 1 (SA) O

(n3 ‖C‖

2∞

µεmin

{exp

(‖C‖∞µ

)(‖C‖∞µ

+ ln(‖C‖∞γε2

)),√

nγµε2

})Algorithm from [41] (SAA) O

(n3.5‖C‖3∞√

αµ3ε3

)

We conclude that when µ is not too large, SA has the complexity according to thesecond term under the minimum that is O (

√ε) times bigger than SAA complexity.

Hereby, SAA approach outperforms SA approach under this condition on the param-eter µ.

4. Population Wasserstein barycenter

In previous section, we were aim at computing the barycenter w.r.t. regularized OT.Now we refuse the regularization of OT and compare the SA and the SAA approachesfor the problem of population Wasserstein barycenter defined w.r.t OT

p∗ = arg minp∈Sn(1)

EqW (p, q). (20)

12

Page 13: a,b arXiv:2001.07697v5 [math.OC] 16 Sep 2020ter) of probability measures with discrete support (e.g., images). We de ne the notion of the population barycenter by using Fr echet mean

Throughout this section we use the following simplification for the objective

Wq(p) = EqW (p, q).

We notice that Proposition 2.5 is not completely valid for W (p, q) since W (p, q) is notstrongly convex in p w.r.t the 2-norm but still Lipschitz continuous. We assume thatthe Lipschitz constants for W (p, q) in the 1-norm and the 2-norm are merely the sameas for Wµ(p, q): M∞ and M respectively.

4.1. Stochastic Approximation (SA)

Now we show how proper chose of µ ensures the application of the results from Sect. 3to the problem (20). Let p∗ be the solution of (20) then for any p ∈ Sn(1) the followingholds [24, 41, 50]

EqW (p, q)− EqW (p∗, q) ≤ EqWµ(p, q)− EqWµ(p∗, q) + 2µ lnn

≤ Eq(Wµ(p, q)−Wµ(p∗µ, q)

)+ 2µ lnn. (21)

Let us choose µ = ε4 lnn that ensures the following

EqW (p, q)− EqW (p∗, q) ≤ Eq(Wµ(p, q)−Wµ(p∗µ, q)

)+ ε/2, ∀p ∈ Sn(1).

This means that solving the problem (6) with ε/2 precision, we get the solution ofproblem (20) with ε precision. The next theorem is a modification of Theorem 3.1 forthe problem (20).

Theorem 4.1. Let µ = ε/(2R2) with R2 = 2 lnn and let pN be the average of Nonline outputs of Algorithm 1. Then, with probability ≥ 1− α we have

Eq[W (pN , q)−W (p∗µ, q)

]= O

(M2 ln(N/α) lnn

εN+ δ

),

Let Algorithm 1 run with δ = O (ε) and N = O(M2

ε2

). Then, with probability ≥ 1− α

the following holds

Eq[W (pN , q)−W (p∗, q)

]≤ ε

The total complexity of Algorithm 1 is

O

((Mn

ε

)2

min

{exp

(‖C‖∞ lnn

ε

)(‖C‖∞ lnn

ε+ ln

(‖C‖∞ lnn

γε2

)),

√n

γε3

}).

Next we provide another algorithm for an implementation of SA approach whichsolves directly problem (20) without regularization of OT.

13

Page 14: a,b arXiv:2001.07697v5 [math.OC] 16 Sep 2020ter) of probability measures with discrete support (e.g., images). We de ne the notion of the population barycenter by using Fr echet mean

4.1.1. Stochastic Mirror Descent

Let d(p) be a distance generating function and Dd(t, p) be the Bregman divergenceassociated to d(p)

Dd(t, p) = d(t)− d(p)− 〈∇d(p), t− p〉.

We consider stochastic mirror descent (MD) with simplex setup (see, e.g., [31, 45, 49]for MD with exact oracle, for inexact oracle see, e.g., [23, 34])4

pk+1 = Proxpk(ηk∇δpkW (pk, q)) (22)

where Proxp(g) is the prox-mapping

Proxp(g) = arg mint∈Sn(1)

(〈g, t〉+Dd(t, p)) (23)

and ∇δpW (p, q) is the gradient of W (p, q) w.r.t. p calculated with δ precision

‖∇δpW (p, q)−∇pW (p, q)‖2 ≤ δ ∀q ∈ Sn(1). (24)

We take negative entropy as a distance generating function d(p) =∑n

j=1 pj ln pj , thecorresponding Bregman divergence is given by Kullback–Leibler divergence in thiscase. This setting ensures that the prox mapping (23) and iterative formula of MD(22) can be rewritten in a closed form described in Algorithm 2. We have startingpoint p1 = arg min

p∈S1(n)d(p) = (1/n, ...1/n) and R2 = max

p∈Sn(1)d(p) − min

p∈Sn(1)d(p) = lnn.

We take step size η =√

2RM∞√N

according to [45].

Algorithm 2 Stochastic Mirror Descent

Input: starting point p1 = (1/n, ..., 1/n)T , N – number of measures q1, ..., qN , accu-racy of gradient calculation δ

1: η =√

2 lnnM∞√N

,

2: for k = 1, . . . , N do3: Calculate component-wise

pk+1i =

pki exp(−η∇δpk,iW (pk, qk)

)∑n

j=1 pkj exp

(−η∇δpk,jW (pk, qk)

) ,where indices i, j denote the i-th (or j-th) component of a vector, ∇δpW (pk, qk)is calculated with δ-precision (24) (e.g., by Simplex Method or Interior PointMethod)

Output: pN = 1N

∑Nk=1 p

k

The next theorem estimates the complexity of Algorithm 2

4By using dual averaging scheme [47] we can rewrite Alg. 2 in online regime [31, 49] without including N in

the step-size policy. Note, that mirror descent and dual averaging scheme are very close to each other [35].

14

Page 15: a,b arXiv:2001.07697v5 [math.OC] 16 Sep 2020ter) of probability measures with discrete support (e.g., images). We de ne the notion of the population barycenter by using Fr echet mean

Theorem 4.2. Let pN be the output of Algorithm 2 processing N measures. Then,with probability ≥ 1− α we have

Eq[W (pN , q)−W (p∗µ, q)

]≤M∞(3R+ 2D1

√ln(α−1))√

2N+δD1 = O

(M∞

√ln(n/α)√N

+ 2δ

),

where R = KL(p∗, p1) ≤√

lnn and D1 = maxp′,p′′∈Sn(1)

‖p′ − p′′‖1 = 2. Let Algorithm 2

run with δ = 0 and N = O(M2∞/ε

2).5 Then, with probability ≥ 1 − α the following

holds

Eq[W (pN , q)−W (p∗, q)

]≤ ε.

The total complexity of Algorithm 2 is

O(n3N) = O

(n3

(M∞R

ε

)2)

=O

(n3

(M∞ε

)2).

Proof. For stochastic mirror descent with d(p) =∑n

j=1 pj ln pj the following holds for

any p ∈ Sn(1)

η〈∇δpW (pk, qk), pk − p〉 ≤ KL(p, pk)−KL(p, pk+1) +η2

2‖∇δpW (pk, qk)‖∞

≤ KL(p, pk)−KL(p, pk+1) + η2M2∞.

By adding and subtracting the terms 〈∇pW (p, qk), p − pk〉 and 〈∇δpW (p, qk), p − pk〉we get using CauchySchwarz inequality

η〈∇pW (pk), pk − p〉 ≤ η〈∇pW (pk, qk)−∇δpW (pk, qk), pk − p〉+ η〈∇pW (pk)−∇pW (pk, qk), pk − p〉+KL(p, pk)−KL(p, pk+1) + η2M2

≤ ηδ maxk=1,...,N

‖pk − p‖1 + η〈∇pW (pk)−∇pW (pk, qk), pk − p〉

+KL(p, pk)−KL(p, pk+1) + η2M2∞.

Summing this for k = 1, ..., N and we get for p = p∗

N∑k=1

η〈∇pW (pk), pk − p∗〉 ≤ KL(p∗, p1) + η2M2∞ + ηδ max

k=1,...,N‖pk − p∗‖1

+

N∑k=1

η〈∇pW (pk)−∇pW (pk, qk), pk − p∗〉

≤ R2 + η2M2∞N + ηδND1 +

N∑k=1

η〈∇pW (pk)−∇pW (pk, qk), pk − p∗〉.

5Notice, that Simplex method gives exact solution (δ = 0).

15

Page 16: a,b arXiv:2001.07697v5 [math.OC] 16 Sep 2020ter) of probability measures with discrete support (e.g., images). We de ne the notion of the population barycenter by using Fr echet mean

Where we used KL(p∗, p1) ≤ R2 and maxk=1,...,N ‖pk − p∗‖1 ≤ D1. Then using con-vexity of W (pk) and definition of output pN we have

W (pN )−W (p∗) ≤ R2

ηN+ ηM2 + δD1 +

1

N

N∑k=1

〈∇pW (pk)−∇pW (pk, qk), pk − p∗〉.

Next we use Azuma–Hoeffding’s [36] inequality and get for all β ≥ 0

P

(N+1∑k=1

〈∇pW (pk)−∇pW (pk, qk), pk − p∗〉 ≤ β

)≥ 1−exp

(− 2β2

N(2M∞D1)2

)= 1− α

since 〈∇pW (pk)−∇pW (pk, qk), p∗ − pk〉 is a martingale-difference and∣∣∣〈∇pW (pk)−∇pW (pk, qk), p∗ − pk〉∣∣∣ ≤ ‖∇pW (pk)−∇pW (pk, qk)‖∞‖p∗ − pk‖1

≤ 2M∞ maxk=1,...,N

‖pk − p∗‖1 ≤ 2M∞D1.

Hence with probability ≥ 1− α the following holds

W (pN )−W (p∗) ≤ R2

ηN+ ηM2

∞ + δD1 +β

N. (25)

Expressing β through α and substituting η = RM∞

√2N , that minimize RHS of (25) on

η, we get

W (pN )−W (p∗) ≤ M∞R√2N

+M∞R

√2√

N+ δD1 +

M∞D1

√2 ln 1

α√N

≤M∞(3R+ 2D

√ln 1

α)√

2N+ δD1.

Using R ≤√

lnn and D1 ≤ 2 we have

W (pN )−W (p∗) ≤M∞(3

√lnn+ 4

√ln 1

α)√

2N+ 2δ. (26)

Squaring the l.h.s of (26), using Caushi–Schwartz inequality and then extracting theroot, we get the first statement of the theorem

W (pN )−W (p∗) ≤M∞

√6 lnn+ 8 ln 1

α√2N

+ 2δ = O

(M∞

√ln(n/α)√N

+ 2δ

).

The second statement of the theorem directly follows from this and the conditionW (pN )−W (p∗) ≤ ε. To get the complexity bounds we notice that the complexity for‘exact’ calculating ∇pW (pk, qk) is O(n3) (see [1, 14, 15, 22] and references therein),multiplying this by N we get the last statement of the theorem.

16

Page 17: a,b arXiv:2001.07697v5 [math.OC] 16 Sep 2020ter) of probability measures with discrete support (e.g., images). We de ne the notion of the population barycenter by using Fr echet mean

We notice that complexity bound for Algorithm 2 is O(√n)-times better than the

bound for Algorithm 1 with Euclidean set up.

4.2. Sample Average Approximation (SAA)

Similar to SA approach for problen (20) we use regularization parameter µ = ε4 lnn in

Theorem 3.2 and formulate the followimg theorem.

Theorem 4.3. Let µ = ε/(2R2) with R2 = 2 lnn and let pε′ satisfies (12) withprecision ε′, where m is the number of measures in empirical average (11). Then, withprobability ≥ 1− α we have

Eq[Wµ(pε′ , q)−Wµ(p∗µ, q)

]≤√

4M2 lnn

εε′ +

8M2 lnn

αεm.

Let ε′ = O(ε3

M2

)and m = O

(M2

αε2

). Then, with probability ≥ 1−α the following holds

Eq[Wµ(pε′ , q)−Wµ(p∗µ, q)

]≤ ε.

The total complexity of offline algorithm from [41] computing pε′ on one machine(without parallelization/decentralization) that satisfies (12) is

O

(1√α

n2M3R2

ε3

)=O

(1√α

n2M3

ε3

).

5. Penalized Wasserstein barycenters

In this section, we consider alternative regularization of problem (20) by adding astrongly convex penalization

minp∈Sn(1)

{EqW (p, q) + λr(p)} , (27)

where r(p) is a strongly convex penalty function: S1(n)→ R+ and λ > 0.The SAA approach suggests to approximate this problem (27) by its empirical

counterpart

pmλ = arg minp∈Sn(1)

{1

m

m∑k=1

W (p, qk) + λr(p)

}.

The standard way for choosing r(p) is the squared norm penalty 12‖p− p

1‖22 ( p1 issome vector from Sn(1)) [54]. In this case, the objective in (27) under the expectationis λ-strongly convex. This allows us to apply Theorem 3.2 replacing µ by λ, Lipschitzconstant M by M + λR, where R = max

p∈Sn(1)‖p − p1‖2≤

√2.6 Consider m to be big

6Note, that in [54] instead of (M +λR) it was used simple M . For the moment we do not know how to justifythis replacement. That is why we write (M + λR) . Fortunately, when m is big enough (λ ∼ 1/

√m is small

enough) it does not matter.

17

Page 18: a,b arXiv:2001.07697v5 [math.OC] 16 Sep 2020ter) of probability measures with discrete support (e.g., images). We de ne the notion of the population barycenter by using Fr echet mean

enough, we choose λ =√

8M2

αR2m [54] and obtain the following with probability ≥ 1−α

W (pε′)−W (p∗) = O

(√MR√mε′ +

√M2R2

αm

), (28)

where pε′ satisfies the following inequality

1

m

m∑k=1

W (pε′ , qk) + λr(pε′)−

(1

m

m∑k=1

W (pmλ , qk) + λr(pmλ )

)≤ ε′. (29)

Next, we present our new regularization r(p) in (27) to improve the results withstandard penalization 1

2‖p− p1‖22.

5.1. New regularization for Wasserstein barycenter

Consider Bregman divergence Bd(p, p1)

Bd(p, p1) = d(p)− d(p1)− 〈∇d(p1), p− p1〉,

with distance generating function [3]

d(p) =1

2(a− 1)‖p‖2a, a = 1 +

1

2 lnn.

Then, we choose r(p) = Bd(p, p1) that leads to the following problem

minp∈Sn(1)

{EqW (p, q) + λBd(p, p

1))}. (30)

Bd(p, p1) is 1-strongly convex in the 1-norm and O(1)-Lipschitz continuous in the 1-

norm on Sn(1). One of the advantages of this penalization compared to the negativeentropy penalization, proposed in [2, 6], is that we get the upper bound on the Lipschitzconstant, the properties of strong convexity in the 1-norm on Sn(1) remain the same.

We consider empirical counterpart of problem (30) and redefine pmλ as follows7

pmλ = arg minp∈Sn(1)

{1

m

m∑k=1

W (p, qk) + λBd(p, p1)

}(31)

and assume that pε′ such that

1

m

m∑k=1

W (pε′ , qk) + λBd(pε′ , p

1)−

(1

m

m∑k=1

W (pmλ , qk) + λBd(p

mλ , p

1)

)≤ ε′. (32)

We summarize the result in the next theorem.

7Note, that to solve (31) we may use the same dual distributed tricks like in [40] if we put composite term ina separate node. But before, we should regularized W (p, q) with µ = ε

4 lnn. The complexity in terms of O(·)

will be the same as in Theorem 3.2. Dual function for Bd(p, p1) can be calculated with the complexity O(n)

[25].

18

Page 19: a,b arXiv:2001.07697v5 [math.OC] 16 Sep 2020ter) of probability measures with discrete support (e.g., images). We de ne the notion of the population barycenter by using Fr echet mean

Theorem 5.1. Let R ' maxp∈Sn(1)

Bd(p, p1) = O(lnn), λ =

√8M2∞

αR2m(where m is big

enough) and let pε′ satisfies (32) with accuracy ε′. 8 Then, with probability ≥ 1−α wehave

Eq[Wµ(pε′ , q)−Wµ(p∗µ, q)

]= O

√M∞R√mε′ +√M2∞R

2

αm

.

Let

ε′ = O

(ε2

M∞R√m

)= O

(ε3√αM2∞

)and

m = O

(M2∞R

2

αε2

)= O

(M2∞

αε2

).

Then, with probability ≥ 1− α the following holds

Eq[Wµ(pε′ , q)−Wµ(p∗µ, q)

]≤ ε.

The total complexity of properly corrected algorithm from [41] computing pε′ on onemachine (without parallelization/decentralization) that satisfies (12) is 9

O

(n2

√mM2

µε′

)= O

(n2.5

α0.75

(M∞ε

)3).

Proof. Let us define f(p, q) , W (p, q) + λBd(p, p1) and F (p) , Eq[f(p, q)]. Note,

that F (p, q) is λ-strongly convex in p w.r.t. the 1-norm since Bd(p, p1) is 1-strongly

convex w.r.t. the 1-norm. Moreover, f(p, q) is Mf -Lipschitz continuous in p∈ Sn(1)

with Mf ,M∞ + λR by definition. Therefore, for f(p, q) we can apply [54, Theorem6] (formulated in the 2-norm but also valid for the 1-norm) stated that with probability≥ 1− α

F (pm)− F (p∗) ≤4M2

f

αλm=

4(M∞ + λR)2

αλm. (33)

Denoting empirical average by F (p) , 1m

m∑k=1

W (p, qk)+λBd(p, p1) we get the following

consequence of (33)

F (p)− F (p∗) ≤

√2(M∞ + λR)2

λ

(F (p)− F (pm)

)+

4(M∞ + λR)2

αλm

8See [4, Lemma 6.1].9See (19). Recall that µ = ε

2R2 , R2 = 2 lnn. Note also that λ after substituting of m will be λ = O(ε/R2

)=

O(µ).

19

Page 20: a,b arXiv:2001.07697v5 [math.OC] 16 Sep 2020ter) of probability measures with discrete support (e.g., images). We de ne the notion of the population barycenter by using Fr echet mean

Therefore, from this we get

W (pε′)−W (p∗) ≤

√2(M∞ + λR)2ε′

λ+

4(M∞ + λR)2

αλm− λBd(pε′ , p1) + λBd(p

∗, p1)

√2(M∞ + λR)2ε′

λ+

4(M∞ + λR)2

αλm+ λR.

Choosing λ '√

8M2∞

R2mwe get the following

W (pε′)−W (p∗) = O

√M∞R√mε′ +√M2∞R

2

αm

. (34)

The other statements follows from this and the condition W (pε′)−W (p∗) ≤ ε.

Since M/M∞ '√n we may conclude that (34) is O(

√n)-times better than (28).

From the recent results [19] we may expect that the dependence on α in Theo-rems 4.3, 5.1 is indeed much better (logarithmic). For the moment we do not possessan accurate prove of it, but we suspect that ideas from [19] allow to prove it.

5.2. Comparison of SA and SAA for population Wasserstein barycenterproblem.

Now we compare the SA and the SAA approaches for problem (20).Table 2 presents the total complexity for the numerical algorithms from this Sections

4 and 5 implementing SA and SAA approaches. We estimate M∞ = O (‖C‖∞) andM ≤

√nM∞ in the complexity bounds by using Proposition 2.5.

Table 2. Total complexity of SA and SAA implementations for problem minp∈Sn(1)

EqW (p, q)

Complexity

Algorithm 1 (SA)with µ = ε

4 lnnO

(n3(‖C‖∞ε

)2min

{exp

(‖C‖∞ lnn

ε

)(‖C‖∞ lnn

ε+ ln

(‖C‖∞ lnn

γε2

)), 1ε

√nγε

})Algorithm from [41] (SAA),

with µ = ε4 lnn

O

(n3.5

(‖C‖∞ε

)3)

Algorithm 2 (SA) O

(n3(‖C‖∞ε

)2)

Regularized ERM (SAA) O

(n2.5

(‖C‖∞ε

)3)

We do not make any conclusion about comparison of Algorithm 2 and RegularizedERM (empirical risk minimization from Sect. 5 with our new regularization) since itdepends on comparison of

√n and 1/ε. However, both of this methods are definitely

outperform (according to complexity results) Algorithm 1 and Algorithm from [41]approaches based on entropic regularization of OT with proper µ.

20

Page 21: a,b arXiv:2001.07697v5 [math.OC] 16 Sep 2020ter) of probability measures with discrete support (e.g., images). We de ne the notion of the population barycenter by using Fr echet mean

Figure 1. Quality of the estimate for population Wasserstein barycenter w.r.t regularized OT.

6. Numerical Experiments

Now we show the experiments performed on MNIST data set to support the results ofSection 3. Each image from MNIST data set is a hand-written digit with the value from0 to 9 of the size 28 × 28 pixels. We are interested in the convergence of estimatedbarycenter of digits 3 to its true counterpart. Since the true population barycenterp∗µ of all hand-written digits 3 is unknown, we approximate it by the barycenter ofall digits 3 in MNIST (7141 images) and then we study the convergence of estimatedbarycenter to p∗µ as the number of measure grows. We estimate p∗µ by Iterative BregmanProjection since it showed relatively good results. Figure 1 compares Algorithm 1 (SAapproach) and Iterative Bregman Projection (SAA approach)[5] in two metrics: theconvergence in the 2 -norm ‖p − p∗µ‖2 (Theorems 3.1 and 3.2) and the convergencein optimal transport distance Wµ(p, p∗µ). The entropy regularization parameter is setto µ = 0.01. Despite the fact that our results guarantee the convergence only in the2-norm, OT distance is a natural metric to compare two measures (‘true’ barycenterand its estimation).

We do not provide any experiments for Sect. 4 since we cannot exactly calculatethe decrease in function value W (p) − W (p∗) which was studied in Sect. 4. Recallthat W (p) = EqW (p, q). Thus, we only limit ourselves by supporting the results aboutarguments convergence: ‖p− p∗µ‖2 (Sect. 3).

Acknowledgements

The work was funded by RFBR, project number 19-31-51001, and by the Ministry ofScience and Higher Education of the Russian Federation (Goszadaniye) number 075-00337-20-03, project No. 0714-2020-0005, and by Russian Science Foundation, projectno. 18-71-10108. We are grateful to Alexander Gasnikov and Vladimir Spokoiny whoinitiated this research. We thank Pavel Dvurechensky, Eduard Gorbunov for fruitfuldiscussions as well. We thank Ohad Shamir for useful reference.

21

Page 22: a,b arXiv:2001.07697v5 [math.OC] 16 Sep 2020ter) of probability measures with discrete support (e.g., images). We de ne the notion of the population barycenter by using Fr echet mean

References

[1] R.K. Ahuja, T.L. Magnanti, and J.B. Orlin, Network flows: Theory, Algorithms, andApplications 526 (1993).

[2] M. Ballu, Q. Berthet, and F. Bach, Stochastic optimization for regularized wassersteinestimators, arXiv preprint arXiv:2002.08695 (2020).

[3] A. Ben-Tal and A. Nemirovski, Lectures on Modern Convex Optimization (Lecture Notes),Personal web-page of A. Nemirovski, 2015, Available at http://www2.isye.gatech.edu/

~nemirovs/Lect_ModConvOpt.pdf.[4] A. Ben-Tal and A. Nemirovski, The conjugate barrier mirror descent method for non-

smooth convex optimization, Minerva optimization center, Technion Institute of Technol-ogy (1999).

[5] J.D. Benamou, G. Carlier, M. Cuturi, L. Nenna, and G. Peyre, Iterative bregman projec-tions for regularized transportation problems, SIAM Journal on Scientific Computing 37(2015), pp. A1111–A1138.

[6] J. Bigot, E. Cazelles, and N. Papadakis, Penalized barycenters in the wasserstein space(2017).

[7] J. Bigot, E. Cazelles, and N. Papadakis, Central limit theorems for entropy-regularizedoptimal transport on finite spaces and statistical applications (2019).

[8] J. Bigot, R. Gouet, T. Klein, A. Lopez, et al., Upper and lower risk bounds for estimatingthe wasserstein barycenter of random measures on the real line, Electronic journal ofstatistics 12 (2018), pp. 2253–2289.

[9] J. Blanchet, A. Jambulapati, C. Kent, and A. Sidford, Towards optimal running timesfor optimal transport, arXiv preprint arXiv:1810.07717 (2018).

[10] E. Boissard, T. Le Gouic, J.M. Loubes, et al., Distributions template estimate with wasser-stein metrics, Bernoulli 21 (2015), pp. 740–759.

[11] S. Chewi, T. Maunu, P. Rigollet, and A.J. Stromme, Gradient descent algorithms forbures-wasserstein barycenters, arXiv preprint arXiv:2001.01700 (2020).

[12] M. Cuturi, Sinkhorn distances: Lightspeed computation of optimal transport, inAdvances in Neural Information Processing Systems 26, C.J.C. Burges, L. Bot-tou, M. Welling, Z. Ghahramani, and K.Q. Weinberger, eds., Curran Asso-ciates, Inc., 2013, pp. 2292–2300. Available at http://papers.nips.cc/paper/

4927-sinkhorn-distances-lightspeed-computation-of-optimal-transport.pdf.[13] M. Cuturi and G. Peyre, A smoothed dual approach for variational wasserstein problems,

SIAM Journal on Imaging Sciences 9 (2016), pp. 320–343.[14] D. Dadush and S. Huiberts, A friendly smoothed analysis of the simplex method, in Pro-

ceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing. ACM,2018, pp. 390–403.

[15] Y. Dong, Y. Gao, R. Peng, I. Razenshteyn, and S. Sawlani, A study of performance ofoptimal transport, arXiv preprint arXiv:2005.01182 (2020).

[16] J. Duchi, S. Shalev-Shwartz, Y. Singer, and T. Chandra, Efficient projections onto the l1-ball for learning in high dimensions, in Proceedings of the 25th international conferenceon Machine learning. 2008, pp. 272–279.

[17] P. Dvurechenskii, D. Dvinskikh, A. Gasnikov, C. Uribe, and A. Nedich, Decentralize andrandomize: Faster algorithm for wasserstein barycenters, in Advances in Neural Informa-tion Processing Systems. 2018, pp. 10760–10770.

[18] P. Dvurechensky, A. Gasnikov, and A. Kroshnin, Computational Optimal Transport: Com-plexity by Accelerated Gradient Descent Is Better Than by Sinkhorns Algorithm, in Pro-ceedings of the 35th International Conference on Machine Learning, J. Dy and A. Krause,eds., Vol. 80. 2018, pp. 1367–1376. arXiv:1802.04367.

[19] V. Feldman and J. Vondrak, High probability generalization bounds for uniformly stablealgorithms with nearly optimal rate, arXiv preprint arXiv:1902.10710 (2019).

[20] J. Franklin and J. Lorenz, On the scaling of multidimensional matrices, Linear Algebraand its Applications 114 (1989), pp. 717 – 735. Available at http://www.sciencedirect.

22

Page 23: a,b arXiv:2001.07697v5 [math.OC] 16 Sep 2020ter) of probability measures with discrete support (e.g., images). We de ne the notion of the population barycenter by using Fr echet mean

com/science/article/pii/0024379589904904, Special Issue Dedicated to Alan J. Hoff-man.

[21] C. Frogner, C. Zhang, H. Mobahi, M. Araya, and T.A. Poggio, Learning with a Wasser-stein loss, in Advances in Neural Information Processing Systems. 2015, pp. 2053–2061.

[22] H.N. Gabow and R.E. Tarjan, Faster scaling algorithms for general graph matching prob-lems, Journal of the ACM (JACM) 38 (1991), pp. 815–853.

[23] A.V. Gasnikov, A.A. Lagunovskaya, I.N. Usmanova, and F.A. Fedorenko, Gradient-freeproximal methods with inexact oracle for convex stochastic nonsmooth optimization prob-lems on the simplex, Automation and Remote Control 77 (2016), pp. 2018–2034. Availableat http://dx.doi.org/10.1134/S0005117916110114, arXiv:1412.3890.

[24] A. Gasnikov, P. Dvurechensky, D. Kamzolov, Y. Nesterov, V. Spokoiny, P. Stetsyuk, A.Suvorikova, and A. Chernov, Universal method with inexact oracle and its applications forsearching equillibriums in multistage transport problems, arXiv preprint arXiv:1506.00292(2015).

[25] A. Gasnikov, D. Kamzolov, and M. Mendel, Universal composite prox-method for strictlyconvex optimization problems, arXiv preprint arXiv:1603.07701 (2016).

[26] A. Genevay, G. Peyre, and M. Cuturi, Learning generative models with sinkhorn diver-gences, arXiv preprint arXiv:1706.00292 (2017).

[27] T.L. Gouic, Q. Paris, P. Rigollet, and A.J. Stromme, Fast convergence of em-pirical barycenters in alexandrov spaces and the wasserstein space, arXiv preprintarXiv:1908.00828 (2019).

[28] A. Gramfort, G. Peyre, and M. Cuturi, Fast optimal transport averaging of neuroimagingdata, in International Conference on Information Processing in Medical Imaging. Springer,2015, pp. 261–272.

[29] V. Guigues, A. Juditsky, and A. Nemirovski, Non-asymptotic confidence bounds for theoptimal value of a stochastic program, Optimization Methods and Software 32 (2017), pp.1033–1058. Available at https://doi.org/10.1080/10556788.2017.1350177.

[30] S. Guminov, P. Dvurechensky, N. Tupitsa, and A. Gasnikov, Accelerated alternating min-imization, arXiv preprint arXiv:1906.03622 (2019).

[31] E. Hazan, et al., Introduction to online convex optimization, Foundations and Trends R©in Optimization 2 (2019), pp. 157–325.

[32] E. Hazan, A. Kalai, S. Kale, and A. Agarwal, Logarithmic regret algorithms for onlineconvex optimization, in International Conference on Computational Learning Theory.Springer, 2006, pp. 499–513.

[33] A. Jambulapati, A. Sidford, and K. Tian, A direct o(1/ε) iteration parallel algorithm foroptimal transport, arXiv preprint arXiv:1906.00618 (2019).

[34] A. Juditsky and A. Nemirovski, First order methods for nonsmooth convex large-scaleoptimization, i: General purpose methods. optimization for machine learning, chap. 5(2012).

[35] A. Juditsky, J. Kwon, and E. Moulines, Unifying mirror descent and dual averaging, arXivpreprint arXiv:1910.13742 (2019).

[36] A. Juditsky, P. Rigollet, A.B. Tsybakov, et al., Learning by mirror averaging, The Annalsof Statistics 36 (2008), pp. 2183–2206.

[37] S. Kakade, S. Shalev-Shwartz, and A. Tewari, On the duality of strong convexity and strongsmoothness: Learning applications and matrix regularization, Unpublished Manuscript,http://ttic. uchicago. edu/shai/papers/KakadeShalevTewari09.pdf 2 (2009).

[38] S.M. Kakade and A. Tewari, On the generalization ability of online strongly convex pro-gramming algorithms, in Advances in Neural Information Processing Systems. 2009, pp.801–808.

[39] M. Klatt, C. Tameling, and A. Munk, Empirical regularized optimal transport: Statisticaltheory and applications, arXiv preprint arXiv:1810.09880 (2018).

[40] A. Kroshnin, V. Spokoiny, and A. Suvorikova, Statistical inference for bures-wassersteinbarycenters, arXiv preprint arXiv:1901.00226 (2019).

[41] A. Kroshnin, N. Tupitsa, D. Dvinskikh, P. Dvurechensky, A. Gasnikov, and C. Uribe,

23

Page 24: a,b arXiv:2001.07697v5 [math.OC] 16 Sep 2020ter) of probability measures with discrete support (e.g., images). We de ne the notion of the population barycenter by using Fr echet mean

On the Complexity of Approximating Wasserstein Barycenters, in Proceedings of the 36thInternational Conference on Machine Learning, K. Chaudhuri and R. Salakhutdinov, eds.,Vol. 97. 2019, pp. 3530–3540. arXiv:1901.08686.

[42] T. Le Gouic and J.M. Loubes, Existence and consistency of wasserstein barycenters, Prob-ability Theory and Related Fields 168 (2017), pp. 901–917.

[43] Y.T. Lee and A. Sidford, Path Finding Methods for Linear Programming: Solving LinearPrograms in O(

√rank) Iterations and Faster Algorithms for Maximum Flow, in 2014

IEEE 55th Annual Symposium on Foundations of Computer Science, Oct. 2014, pp. 424–433.

[44] G. Mena and J. Weed, Statistical bounds for entropic optimal transport: sample complexityand the central limit theorem, arXiv preprint arXiv:1905.11882 (2019).

[45] A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro, Robust stochastic approximationapproach to stochastic programming, SIAM Journal on Optimization 19 (2009), pp. 1574–1609. Available at https://doi.org/10.1137/070704277.

[46] Y. Nesterov, Smooth minimization of non-smooth functions, Mathematical Pro-gramming 103 (2005), pp. 127–152. Available at http://dx.doi.org/10.1007/

s10107-004-0552-5.[47] Y. Nesterov, Primal-dual subgradient methods for convex problems, Mathematical

Programming 120 (2009), pp. 221–259. Available at https://doi.org/10.1007/

s10107-007-0149-x, First appeared in 2005 as CORE discussion paper 2005/67.[48] Y. Nesterov, Lectures on convex optimization, Vol. 137, Springer, 2018.[49] F. Orabona, A modern introduction to online learning, arXiv preprint arXiv:1912.13213

(2019).[50] G. Peyre, M. Cuturi, et al., Computational optimal transport, Foundations and Trends R©

in Machine Learning 11 (2019), pp. 355–607.[51] K. Quanrud, Approximating optimal transport with linear programs, arXiv preprint

arXiv:1810.05957 (2018).[52] J. Rabin and N. Papadakis, Convex color image segmentation with optimal transport dis-

tances, in International Conference on Scale Space and Variational Methods in ComputerVision. Springer, 2015, pp. 256–269.

[53] A. Rolet, M. Cuturi, and G. Peyre, Fast dictionary learning with a smoothed Wassersteinloss, in Artificial Intelligence and Statistics. 2016, pp. 630–638.

[54] S. Shalev-Shwartz, O. Shamir, N. Srebro, and K. Sridharan, Stochastic Convex Optimiza-tion., in COLT. 2009.

[55] A. Shapiro and A. Nemirovski, On complexity of stochastic programming problems, inContinuous optimization, Springer, 2005, pp. 111–146.

[56] F.S. Stonyakin, D. Dvinskikh, P. Dvurechensky, A. Kroshnin, O. Kuznetsova, A. Aga-fonov, A. Gasnikov, A. Tyurin, C.A. Uribe, D. Pasechnyuk, and S. Artamonov, GradientMethods for Problems with Inexact Model of the Objective, in Mathematical OptimizationTheory and Operations Research, M. Khachay, Y. Kochetov, and P. Pardalos, eds., Cham.Springer International Publishing, 2019, pp. 97–114. arXiv:1902.09001.

[57] R.E. Tarjan, Dynamic trees as search trees via euler tours, applied to the network simplexalgorithm, Mathematical Programming 78 (1997), pp. 169–177.

[58] C. Villani, Optimal transport: old and new, Vol. 338, Springer Science & Business Media,2008.

24