algorithms for stochastic optimization with function or …pcarpent/optsto/pdf/... · 2020. 11....

Vol.:(0123456789)

Computational Optimization and Applicationshttps://doi.org/10.1007/s10589-020-00179-x

1 3

Algorithms for stochastic optimization with function or expectation constraints

Guanghui Lan1 · Zhiqiang Zhou1

Received: 8 August 2019 © Springer Science+Business Media, LLC, part of Springer Nature 2020

AbstractThis paper considers the problem of minimizing an expectation function over a closed convex set, coupled with a function or expectation constraint on either deci-sion variables or problem parameters. We first present a new stochastic approxima-tion (SA) type algorithm, namely the cooperative SA (CSA), to handle problems with the constraint on devision variables. We show that this algorithm exhibits the optimal O(1∕�2) rate of convergence, in terms of both optimality gap and constraint violation, when the objective and constraint functions are generally convex, where � denotes the optimality gap and infeasibility. Moreover, we show that this rate of convergence can be improved to O(1∕�) if the objective and constraint functions are strongly convex. We then present a variant of CSA, namely the cooperative stochas-tic parameter approximation (CSPA) algorithm, to deal with the situation when the constraint is defined over problem parameters and show that it exhibits similar opti-mal rate of convergence to CSA. It is worth noting that CSA and CSPA are primal methods which do not require the iterations on the dual space and/or the estimation on the size of the dual variables. To the best of our knowledge, this is the first time that such optimal SA methods for solving function or expectation constrained sto-chastic optimization are presented in the literature.

Keywords Convex programming · Stochastic optimization · Complexity · Subgradient method

Part of the results were first presented at the Annual INFORMS meeting in Oct, 2015, https ://infor ms.emeet ingso nline .com/emeet ings/formb uilde r/clust erses siond tl.asp?csnno =24236 &mmnno =272&ppnno =91687 and summarized in a previous version entitled “Algorithms for stochastic optimization with expectation constraints” in 2016.Guanghui Lan has been supported by NSF CMMI 1637474.

* Guanghui Lan [email protected]

Zhiqiang Zhou [email protected]

1 H. Milton Stewart School of Industrial and Systems Engineering, Georgia Institute of Technology, Atlanta, GA 30332, USA

http://orcid.org/0000-0002-2103-087Xhttp://crossmark.crossref.org/dialog/?doi=10.1007/s10589-020-00179-x&domain=pdfhttps://informs.emeetingsonline.com/emeetings/formbuilder/clustersessiondtl.asp?csnno=24236&mmnno=272&ppnno=91687https://informs.emeetingsonline.com/emeetings/formbuilder/clustersessiondtl.asp?csnno=24236&mmnno=272&ppnno=91687https://informs.emeetingsonline.com/emeetings/formbuilder/clustersessiondtl.asp?csnno=24236&mmnno=272&ppnno=91687

G. Lan, Z. Zhou

1 3

Mathematics Subject classification 90C25 · 90C06 · 90C22 · 49M37

1 Introduction

In this paper, we study two related stochastic programming (SP) problems with func-tion or expectation constraints. The first one is a classical SP problem with the function constraint over the decision variables, formally defined as

where X ⊆ ℝn is a convex compact set, � are random vectors supported on P ⊆ ℝp , F(x, �) ∶ X × P ↦ ℝ and g(x) ∶ X ↦ ℝ are closed convex functions w.r.t. x for a.e. � ∈ P . Moreover, we assume that � are independent of x. Under these assumptions, (1.1) is a convex optimization problem.

In particular, the constraint function g(x) in problem (1.1) can be given in the form of expectation as

where G(x, �) ∶ X × P ↦ ℝ are closed convex functions w.r.t. x for a.e. � ∈ P . Such problems have many applications in operations research, finance and data analysis. One motivating example is SP with the conditional value at risk (CVaR) constraint. In an important work [30], Rockafellar and Uryasev shows that a class of asset allo-cation problem can be modeled as

where � denotes the random return with mean � = �[�] . Expectation constraints also play an important role in providing tight convex approximation to chance constrained problems (e.g., Nemirovksi and Shapiro [23]). Some other important applications of (1.1) can be found in semi-supervised learning (see, e.g., [6]). For example, one can use the objective function to define the fidelity of the model for the labelled data, while using the constraint to enforce some other properties of the model for the unlabelled data (e.g., proximity for data with similar features).

While problem (1.1) covers a wide class of problems with constraints over the deci-sion variables, in practice we often encounter the situation where the constraint is defined over the problem parameters. Under these circumstances our goal is to find a pair of parameters x∗ and decision variables y∗(x∗) such that

(1.1)min f (x) ∶= �[F(x, �)]

s.t.g(x) ≤ 0,

x ∈ X,

(1.2)g(x) ∶= ��[G(x, �)],

(1.3)minx,� − �

Tx

s.t. � +1

�

�{[−�Tx − �]+} ≤ 0,∑ni=1

xi = 1, x ≥ 0,

(1.4)y∗(x∗) ∈ Argminy∈Y{�(x∗, y) ∶= �[Φ(x∗, y, �)]},

1 3

Algorithms for stochastic optimization with function or…

Here Φ(x, y, �) is convex w.r.t. y for a.e. 𝜁 ∈ Q ⊆ ℝq but possibly nonconvex w.r.t. (x, y) jointly, and g(⋅) is convex w.r.t. x. Moreover, we assume that � is independent of x and y, while � is not necessarily independent of x∗ . Note that (1.4)–(1.5) defines a pair of optimization and feasibility problems coupled through the following ways: (a) the solution to (1.5) defines an admissible parameter of (1.4); (b) � can be a ran-dom variable with probability distribution parameterized by x∗.

Problem (1.4)–(1.5) also has many applications, especially in data analysis. One such example is to learn a classifier w with a certain metric Ā using the support vector machine model:

where l(w;(�, y)) = max{0, 1 − y⟨w, �⟩} denotes the hinge loss function, u, ui, uj ∈ ℝ

n , v, vi, vj ∈ {+1,−1} , and bij ∈ ℝ are the random variables satisfy-ing certain probability distributions, and 𝜆,C > 0 are certain given parameters. In this problem, (1.6) is used to learn the classifer w by using the metric Ā satisfying certain requirements in (1.7), including the low rank (or nuclear norm) assumption. Problem (1.4)–(1.5) can also be used in some data-driven applications, where one can use (1.5) to specify the parameters for the probabilistic models associated with the random variable � , as well as some other applications for multi-objective sto-chastic optimization.

In spite of its wide applicability, the study on efficient solution methods for expec-tation constrained optimization is still limited. For the sake of simplicity, suppose for now that � is given as a deterministic vector and hence that the objective functions f and � in (1.1) and (1.4) are easily computable. One popular method to solve stochas-tic optimization problems is called the sample average approximation (SAA) approach [17, 34, 37]. To apply SAA for (1.1) and (1.5), we first generate a random sample �i, i = 1,… ,N , for some N ≥ 1 and then approximate g by g̃(x) =

1

N

∑Ni=1

G(x, 𝜉i) . The main issues associated with the SAA for solving (1.1) include: i) the deterministic SAA problem might not be feasible; ii) the resulting deterministic SAA problem is often dif-ficult to solve especially when N is large, requiring going through the whole sample {�1,… , �N} at each iteration; and ii) it is not applicable to the on-line setting where one needs to update the decision variable whenever a new piece of sample �i , i = 1,…N , is collected.

A different approach to solve stochastic optimization problems is called stochastic approximation (SA), which was initially proposed in a seminal paper by Robbins and Monro[29] in 1951 for solving strongly convex SP problems. This algorithm mimics the gradient descent method by using the stochastic gradient F�(x, �i) rather than the original gradient f �(x) for minimizing f(x) in (1.1) over a simple convex set X (see also [4, 10, 11, 25, 31, 35]). An important improvement of this algorithm was developed

(1.5)x∗ ∈ {x ∈ X|g(x) ∶= �[G(x, �)] ≤ 0}.

(1.6)minw

�

�l

�w;

�Ā

1

2 u, v

��+

𝜆

2‖w‖2,

(1.7)Ā ∈{A ⪰ 0|�

[|Tr

(A(ui − vj

)(ui − vj

)T)− bij|

]≤ 0, Tr(A) ≤ C

},

G. Lan, Z. Zhou

1 3

by Polyak and Juditsky [26, 27] through using longer steps and then averaging the obtained iterates. Their method was shown to be more robust with respect to the choice of stepsize than classic SA method for solving strongly convex SP problems. More recently, Nemirovski et al. [22] presented a modified SA method, namely, the mirror descent SA method, and demonstrated its superior numerical performance for solving a general class of nonsmooth convex SP problems. The SA algorithms have been inten-sively studied over the past few years (see, e.g., [9, 12–14, 18, 21, 32, 38]). It should be noted, however, that none of these SA algorithms are applicable to expectation con-strained problems, since each iteration of these algorithms requires the projection over the feasible set {x ∈ X|g(x) ≤ 0} , which is computationally prohibitive as g is given in the form of expectation.

In this paper, we intend to develop efficient solution methods for solving expecta-tion constrained problems by properly addressing the aforementioned issues asso-ciated with existing SA methods. Our contribution mainly exists in the following several aspects. Firstly, inspired by Polayk’s subgradient method for constrained optimization [28] and Nesterov’s note [24], we present a new SA algorithm, namely the cooperative SA (CSA) method for solving the SP problem with expectation constraint in (1.1) with constraint (1.2). At the kth iteration, CSA performs a pro-jected subgradient step along either F�(xk, �k) or G�(xk, �k) over the set X, depend-ing on whether an unbiased estimator Ĝk of g(xk) satisfies Ĝk ≤ 𝜂k or not. Observe that the aforementioned estimator Ĝk can be easily computed in many cases by using the structure of the problem, e.g., the linear dependence �Tx in (1.3) (see Section 4.1 in [20] and Section 2.1 for more details). We introduce an index set B ∶= {1 ≤ k ≤ N ∶ Ĝk ≤ 𝜂k} in order to compute the output solution as a weighted average of the iterates in B . By carefully bounding |B| , we show that the number of iterations performed by the CSA algorithm to find an �-solution of (1.1), i.e., a point x̄ ∈ X s.t. �[f (x̄) − f ∗] ≤ 𝜖 and �[g(x̄)] ≤ 𝜖 , can be bounded by O(1∕�2) . Moreover, when both f and g are strongly convex, by using a different set of algorithmic param-eters we show that the complexity of the CSA method can be significantly improved to O(1∕�) . It it is worth mentioning that this result is new even for solving determin-istic strongly convex problems with function constraints. We also established the large-deviation properties for the CSA method under certain light-tail assumptions.

Secondly, we develop a variant of CSA, namely the cooperative stochastic param-eter approximation (CSPA) method for solving the SP problem with expectation constraints on problem parameters in (1.4)–(1.5). In CSPA, we update parameter x by running the mirror descend SA iterates whenever a certain easily verifiable condition is violated. Otherwise, we update the decision variable y while keeping x intact. We show that the number of iterations performed by the CSPA algorithm to find an �-solution of (1.4)–(1.5), i.e., a pair of solution (x̄, ȳ) s.t. �[g(x̄)] ≤ 𝜖 and �[𝜙(x̄, ȳ) − 𝜙(x̄, y∗(x̄)] ≤ 𝜖 , can be bounded by O(1∕�2) . Moreover, this bound can be significantly improved to O(1∕�) if G and Φ are strongly convex w.r.t. x and y, respectively.

To the best of our knowledge, all the aforementioned algorithmic developments are new in the stochastic optimization literature. It is also worth mentioning a few alternative or related methods to solve (1.1) and (1.4)–(1.5). First, without efficient methods to directly solve (1.1), current practice resorts to reformulate it

1 3


as minx∈X �f (x) + (1 − �)g(x) for some � ∈ (0, 1) . However, one then has to face the difficulty of properly specifying � , since an optimal selection would depend on the unknown dual multiplier. As a consequence, we cannot assess the qual-ity of the solutions obtained by solving this reformulated problem. Second, one alternative approach to solve (1.1) is the penalty-based or primal-dual approach. However these methods would require either the estimation of the optimal dual variables or iterations performed on the dual space (see [7, 19, 22]). Moreover, the rate of convergence of these methods for function constrained problems has not been well-understood other than conic constraints even for the deterministic setting. Third, in [16] (and see references therein), Jiang and Shanbhag developed a coupled SA method to solve a stochastic optimization problem with parame-ters given by another optimization problem, and hence is not applicable to prob-lem (1.4)–(1.5). Moreover, each iteration of their method requires two stochastic subgradient projection steps and hence is more expensive than that of CSPA.

The remaining part of this paper is organized as follows. In Sect. 2, we present the CSA algorithm and establish its convergence properties under general con-vexity and strong convexity assumptions. Then in Sect. 3, we develop a variant of the CSA algorithm, namely the CSPA for solving SP problems with the expecta-tion constraint over problem parameters and discuss its convergence properties. We then present some numerical results for these new SA methods in Sect. 4. Finally some concluding remarks are added in Sect. 5.

2 Function or expectation constraints over decision variables

In this section we present the cooperative SA (CSA) algorithm for solving con-vex stochastic optimization problems with the constraint over decision variables. More specifically, we first briefly review the distance generating function and prox-mapping in Sect. 2.1. We then describe the CSA algorithm in Sect. 2.2 and discuss its convergence properties in terms of expectation and large deviation for solving general convex problems in Sect. 2.3. Then we show how to apply the CSA algorithm to problem (1.1) with expectation constraint and discuss its large deviation properties in Sect. 2.4. Finally, we show how to improve the conver-gence of this algorithm by imposing strong convexity assumptions to problem (1.1) in Sect. 2.5.

2.1 Preliminary: prox‑mapping

Recall that a function �X ∶ X ↦ R is a distance generating function with param-eter � , if �X is continuously differentiable and strongly convex with parameter � with respect to ‖ ⋅ ‖ . Without loss of generality, we assume throughout this paper that � = 1 , because we can always rescale �X(x) to �̄�X(x) = 𝜔X(x)∕𝛼 . Therefore, we have

⟨x − z,∇�X(x) − ∇�X(z)⟩ ≥ ‖x − z‖2,∀x, z ∈ X.

G. Lan, Z. Zhou

1 3

The prox-function associated with � is given by

VX(⋅, ⋅) is also called the Bregman’s distance, which was initially studied by Breg-man [5] and later by many others (see [1, 2, 36]). In this paper we assume the prox-function VX(x, z) is chosen such that, for a given x ∈ X , the prox-mapping Px,X ∶ ℝ

n↦ ℝ

n defined as

is easily computed.It can be seen from the strong convexity of �(⋅, ⋅) that

Whenever the set X is bounded, the distance generating function �X also gives rise to the diameter of X that will be used frequently in our convergence analysis:

The following lemma follows from the optimality condition of (2.1) and the defini-tion of the prox-function (see the proof in [22]).

Lemma 1 For every u, x ∈ X , and y ∈ ℝn , we have

where the ‖ ⋅ ‖∗ denotes the conjugate of ‖ ⋅ ‖ , i.e., ‖y‖∗ = max{⟨x, y⟩�‖x‖ ≤ 1}.

2.2 The CSA method

In this section, we present a generic algorithmic framework for solving the con-strained optimization problem in (1.1). We assume the expectation function f(x) and constraint g(x), in addition to being well-defined and finite-valued for every x ∈ X , are continuous and convex on X.

The CSA method can be viewed as a stochastic counterpart of Polayk’s subgradi-ent method, which was originally designed for solving deterministic nonsmooth con-vex optimization problems (see [28] and a more recent generalization in [3]). At each iterate xk , k ≥ 0 , depending on whether g(xk) ≤ �k for some tolerance 𝜂k > 0 , it moves either along the subgradient direction f �(xk) or g�(xk) , with an appropriately chosen stepsize �k which also depends on ‖f �(xk)‖∗ and ‖g�(xk)‖∗ . However, Polayk’s subgra-dient method cannot be applied to solve (1.1) because we do not have access to exact information about f ′ , g′ and g. The CSA method differs from Polyak’s subgradient method in the following three aspects. Firstly, the search direction hk is defined in a

VX(z, x) = �X(x) − �X(z) − ⟨∇�X(z), x − z⟩.

(2.1)Px,X(⋅) ∶= argminz∈X{⟨⋅, z⟩ + VX(x, z)}

(2.2)VX(x, z) ≥1

2‖x − z‖2,∀x, z ∈ X.

(2.3)DX ≡ DX,�X ∶=√

maxx,z∈X

VX(x, z).

VX(Px,X(y), u) ≤ VX(x, u) + yT (u − x) +

1

2‖y‖2

∗,

1 3


stochastic manner: we first check if the solution xk we computed at iteration k violates the condition Ĝk ≤ 𝜂k for some �k ≥ 0 . If so, we set the hk = G�(xk, �k) for a random realization �k of � (Note that for deterministic constraint in (1.1), hk = g�(xk) ) in order to control the violation of expectation constraint. Otherwise, we set hk = F�(xk, �k) . Secondly, for some 1 ≤ s ≤ N , we partition the indices I = {s, ...,N} into two subsets: B = {s ≤ k ≤ N|Ĝk ≤ 𝜂k} and N = I ⧵ B , and define the output x̄N,s as an ergodic mean of xk over B . This differs from the Polyak’s subgradient method that defines the output solution as the best xk, k ∈ B , with the smallest objective value. Thirdly, while the original Polayk’s subgradient method were developed only for general nonsmooth problems, we show that the CSA method also exhibits an optimal rate of convergence for solving strongly convex problems by properly choosing {�k} and {�k}.

Notice that every iteration of CSA requires an unbiased estimator of g(xk) . Sup-pose there is no uncertainty associated with the constraint in (1.1), we can evaluate g(xk) exactly. If g is given in the form of expectation, one natural way is to generate a J-sized i.i.d. random sample of � and then evaluate the constraint function value by Ĝk =

1

J

∑Jj=1

G(xk, 𝜉j) . However, this basic scheme can be much improved by using some structural information for constraint evaluation. For instance, one ubiquitous structure existing in machine learning and portfolio optimization applications is the linear combination of �Tx . For a given x ∈ X , we can define a new random variable 𝜉 = 𝜉Tx and generate samples of 𝜉 instead of � . 𝜉 is only of dimension one and it is computationally much cheaper to simulate. Given the distribution of � , below we pro-vide a few examples where the distribution of 𝜉 can be explicitly computed or approxi-mated. For instance, if x ∈ ℝd , �i are independent normal N(�i, �i) , then 𝜉 follows N(

∑di=1

�i, [∑d

i=1x2i�2i]1∕2) . If �i follows independent exp(�i) , then the probability den-

sity function of 𝜉 is

where �̂�i = 𝜆i∕xi . If �i follows independent Uniform(a, b) , then the cumulative distri-bution function of 𝜉 is

f𝜉(y) =

�d�i=1

�̂�i

�∑d

j=1

e−�̂�j y∏d

k≠j,k=1 (�̂�k �̂�j),

G. Lan, Z. Zhou

1 3

If the �i are dependent normal random variables with mean � and covari-ance C (by Cholesky decomposition, C = LL� ), we can estimate

∑i=1 �ixi by ∑d

i=1𝜇ixi + r̄[

∑di=1

(LTx)2i]1∕2 , where r̄ follows N(0, 1). In fact, when the dimension

d is large enough, by central limit theorem, we can use a normal distribution to approximate the new random variable 𝜉 . These are a few examples showing that to simulate 𝜉 can be much faster than to simulate the original random variables for constraint evaluation.

2.3 Convergence of CSA for SP with function constraints

In this subsection, we consider the case when the constraint function g is deterministic (i.e., Ĝk = g�(xk) ). Our goal is to establish the rate of convergence associated with CSA, in terms of both the distance to the optimal value and the violation of constraints. It should also be noted that Algorithm 1 is conceptional only as we have not specified a few algorithmic parameters (e.g. {�k} and {�k} ). We will come back to this issue after establishing some general properties about this method. Throughout this subsection, we make the following assumptions.

Assumption 1 For any x ∈ X , a.e. � ∈ P,

where F�(x, �) ∈ �xF(x, �) and g�(x) ∈ �xg(x).

The following result establishes a simple but important recursion about the CSA method for problem (1.1).

Proposition 2 For any 1 ≤ s ≤ N , we have

for all x ∈ X.

Proof For any s ≤ k ≤ N , using Lemma 1, we have

F𝜉(y) =

1

d!∏d

i=1xi

⎧⎪⎨⎪⎩

�y−a

∑di=1

xib−a

+�d+∑d

v=1(−1)v

∑dj1=1

∑dj2=j

1+1

…

∑djv=jv−1+1

��y−a

∑di=1

xib−a

−�xj

1

+ xj2

+…+ xjv

��+�⎫⎪⎬⎪⎭.

�[‖F�(x, �)‖2∗] ≤ M2

Fand ‖g�(x)‖2

∗≤ M2

G,

(2.7)

∑k∈N �k(�k − g(x)) +

∑k∈B �k⟨F�(xk, �k), xk − x⟩ ≤ V(xs, x) + 12

∑k∈B �

2k‖F�(xk, �k)‖2∗

+1

2

∑k∈N �

2k‖g�(xk)‖2∗,

1 3


Observe that if k ∈ B , we have hk = F�(xk, �k), and

Moreover, if k ∈ N , we have hk = g�(xk) and

Summing up the inequalities in (2.8) from k = s to N and using the previous two observations, we obtain

Rearranging the terms in above inequality, we obtain (2.7) ◻

Using Proposition 2, we present below a sufficient condition under which the output solution x̄N,s is well-defined.

Lemma 3 Let x∗ be an optimal solution of (1.1). If

then B ≠ ∅ , i.e., x̄N,s is well-defined. Moreover, we have one of the following two statements holds,

(a) |B| ≥ (N − s + 1)∕2,(b)

∑k∈B �k⟨f �(xk), xk − x∗⟩ ≤ 0.

Proof Taking expectation w.r.t. �k on both sides of (2.7) and fixing x = x∗ , we have

Suppose for contradiction that B = � . We then conclude from the above relation and the fact g(x∗) ≤ 0 that

(2.8)V(xk+1, x) ≤ V(xk, x) + �k⟨hk, x − xk⟩ + 12�2k ‖hk‖2∗.

⟨hk, xk − x⟩ = ⟨F�(xk, �k), xk − x⟩.

⟨hk, xk − x⟩ = ⟨g�(xk), xk − x⟩ ≥ g(xk) − g(x) ≥ �k − g(x).

(2.9)

V(xk+1, x) ≤ V(xs, x) −∑N

k=s�k⟨hk, xk − x⟩ + 12

∑Nk=s

�2k‖hk‖2∗

≤ V(xs, x) −�∑

k∈N �k⟨g�(xk), xk − x⟩ +∑

k∈B �k⟨F�(xk, �k), xk − x⟩�

+1

2

∑Nk=s

�2k‖hk‖2∗

≤ V(xs, x) −�∑

k∈N �k(�k − g(x)) +∑

k∈B �k⟨F�(xk, �k), xk − x⟩�

+1

2

∑k∈B �

2k‖F�(xk, �k)‖2∗ + 12

∑k∈N �

2k‖g�(xk)‖2∗.

(2.10)N−s+1

2mink∈N

𝛾k𝜂k > D2X+

1

2

∑k∈B 𝛾

2kM2

F+

1

2

∑k∈N 𝛾

2kM2

G,

(2.11)

∑k∈N �k[�k − g(x

∗)] +∑

k∈B �k⟨f �(xk), xk − x∗⟩≤ V(xs, x

∗) +1

2

∑k∈B �

2kM2

F+

1

2

∑k∈N �

2kM2

G

≤ D2X+

1

2

∑k∈B �

2kM2

F+

1

2

∑k∈N �

2kM2

G.

G. Lan, Z. Zhou

1 3

which contradicts with (2.10). Hence, we must have B ≠ ∅ . Now if

∑k∈B �k⟨f �(xk), xk − x∗⟩ ≤ 0 , part (b) holds. Otherwise, if ∑

k∈B �k⟨f �(xk), xk − x∗⟩ ≥ 0 , we have

which, in view of g(x∗) ≤ 0 , implies that

Suppose that |B| < (N − s + 1)∕2 , i.e., |N| ≥ (N − s + 1)∕2 . Then,

which contradicts with (2.12). Hence, part (a) holds. ◻

Now we are ready to establish the main convergence properties of the CSA method.

Theorem 4 Suppose that {�k} and {�k} in the CSA algorithm are chosen such that (2.10) holds. Then for any 1 ≤ s ≤ N , we have

where M ∶= max{MF,MG}.

Proof We first show (2.13). By Lemma 3, if Lemma 3 part (b) holds, dividing both sides by

∑k∈B �k and taking expectation, we have

If |B| ≥ (N − s + 1)∕2 , we have ∑k∈B �k ≥ �B�mink∈B �k ≥ N−s+12 mink∈B �k. It fol-lows from the definition of x̄N,s in (2.6), the convexity of f (⋅) and (2.11) that

which implies that

Using this bound and the fact �k�k ≥ 0 in (2.16), we have

Nmink∈N

�k�k ≤∑

k∈N �k[�k − g(x∗)] ≤ D2

X+

1

2

∑k∈B �

2kM2

F+

1

2

∑k∈N �

2kM2

G,

∑k∈N �k[�k − g(x

∗)] ≤ V(xs, x∗) +

1

2

∑k∈B �

2kM2

F+

1

2

∑k∈N �

2kM2

G,

(2.12)∑

k∈N �k�k ≤ V(xs, x∗) +

1

2

∑k∈B �

2kM2

F+

1

2

∑k∈N �

2kM2

G.

∑k∈N 𝛾k𝜂k ≥

N−s+1

2mink∈N 𝛾k𝜂k > V(xs, x

∗) +1

2

∑k∈B 𝛾

2kM2

F+

1

2

∑k∈N 𝛾

2kM2

G,

(2.13)�[f (x̄N,s) − f (x∗)] ≤2D2

X+M2

∑s≤k≤N 𝛾

2k

(N−s+1)mins≤k≤N 𝛾k,

(2.14)g(x̄N,s) ≤�∑

k∈B 𝛾k

�−1�∑k∈B 𝛾k𝜂k

�,

(2.15)�[f (x̄N,s) − f (x∗)] ≤ 0.

∑k∈N 𝛾k𝜂k +

∑k∈B 𝛾kE[f (x̄N,s) − f (x

∗)] ≤∑

k∈N 𝛾k𝜂k +∑

k∈B E[𝛾k(f (xk) − f (x∗))]

≤D2X+

1

2

∑k∈B 𝛾

2kM2

F+

1

2

∑k∈N 𝛾

2kM2

G,

(2.16)�N�min

k∈N𝛾k𝜂k +

�∑k∈B 𝛾k

�E[f (x̄N,s) − f (x

∗)] ≤ D2X+

1

2

∑k∈B 𝛾

2kM2

F+

1

2

∑k∈N 𝛾

2kM2

G.

1 3


Combining these two inequalities (2.15) and (2.17), we have (2.13). Now we show that (2.14) holds. For any k ∈ B , we have g(xk) ≤ �k . Then, in view of the definition of x̄N,s in (2.6) and the convexity of g(⋅) , then implies that

◻

Below we provide a few specific selections of {�k} , {�k} and s that lead to the opti-mal rate of convergence for the CSA method. In particular, we will present a constant and variable stepsize policy, respectively, in Corollaries 5 and 6.

Corollary 5 If s=1,�k =DX√

N(MF+MG) and �k =

4(MF+MG)DX√N

, k = 1, ...N , then

Proof First, observe that condition (2.10) holds by using the facts that

It then follows from Lemma 3 and Theorem 4 that

◻

Corollary 6 If s = N2 , �k =

DX√k(MF+MG)

and �k =4DX (MF+MG)√

k, k = 1, 2, ...,N , then

(2.17)�[f (x̄N,s) − f (x

∗)] ≤2D2

X+∑

k∈B 𝛾2kM2

F+∑

k∈N 𝛾2kM2

G

(N−s+1)mink∈I 𝛾k≤

2D2X+M2

∑s≤k≤N 𝛾

2k

(N−s+1)mink∈B 𝛾k.

(2.18)g(x̄N,s) ≤∑

k∈B 𝛾kg(xk)∑k∈B 𝛾k

≤

∑k∈B 𝛾k𝜂k∑k∈B 𝛾k

.

�[f (x̄N,s) − f (x∗)] ≤

4DX (MF+MG)√N

,

g(x̄N,s) ≤4DX (MF+MG)√

N.

N−s+1

2mink∈N

�k�k =N

2

4D2X

N= 2D2

X,

D2X+

1

2

∑k∈B �

2kM2

F+

1

2

∑k∈N �

2kM2

G

≤ D2X+

1

2

∑k∈B

D2XM2

F

N(MF+MG)2+

1

2

∑k∈N

D2XM2

G

N(MF+MG)2

≤ D2X+

1

2

∑Nk=1

D2X

N≤ 2D2

X.

�[f (x̄N,s) − f (x∗)] ≤

2DX (MF+MG)+∑

k∈B

DXM2F

N(MF+MG)+∑

k∈N

DXM2G

N(MF+MG)√N

≤4DX (MF+MG)√

N,

g(x̄N,s) ≤ maxs≤k≤N

𝜂k =4DX (MF+MG)√

N.

G. Lan, Z. Zhou

1 3

Proof The proof is similar to that of corollary 4 and hence the details are skipped. ◻

In view of Corollaries 5 and 6, the CSA algorithm achieves an O(1∕√N) rate of

convergence for solving problem (1.1). This convergence rate seems to be unimprov-able as it matches the optimal rate of convergence for deterministic convex optimiza-tion problems with function constraints [24]. However, to the best of our knowledge, no such complexity bounds have been obtained before for solving stochastic optimization problems with function constraints.

In the Corollary 5 and 6, we established the expected convergence properties over many runs of the CSA algorithm. In the remaining part of this subsection, we are inter-ested in the large deviation properties for a single run of this method.

First note that by Corollary 6 and the Markov’s inequality, we have

It then follows that in order to find a solution x̄N,s ∈ X such that

the number of iteration performed by the CSA method can be bounded by

We will show that this result can be significantly improved if Assumption A1 is aug-mented by the following “light-tail” assumption, which is satisfied by a wide class of distributions (e.g., Gaussian and t-distribution).

Assumption 2 For and x ∈ X,

We first present the following Bernstein inequality that will be used to establish the large-deviation properties of the CSA method (e.g. see [22]). Note that in the sequel, we denote �[k] ∶= {�1,… , �k}.

Lemma 7 Let �1, �2, ... be a sequence of i.i.d. random variables, and �t = �(�[t]) be deterministic Borel functions of �[t] such that �[�t] = 0 a.s. and �[exp{�2

t∕�2

t}] ≤ exp{1} a.s., where 𝜎t > 0 are deterministic. Then

�[f (x̄N,s) − f (x∗)] ≤

4DX (1+1

2log 2)(MF+MG)√

N,

g(x̄N,s) ≤4√2DX (MF+MG)√

N.

Prob

�f (x̄N,s) − f (x

∗) > 𝜆14DX (1+

1

2log 2)(MF+MG)√

N

�<

1

𝜆1

,∀𝜆1 ≥ 0.

Prob(f (x̄N,s) − f (x

∗) ≤ 𝜖)> 1 − Λ,

(2.19)O{

1

�2Λ2

}.

�[exp{‖F�(x, �)‖2∗∕M2

F}] ≤ exp{1}.

1 3


Now we are ready to establish the large deviation properties of the CSA algorithm.

Theorem 8 Under Assumption 2, ∀� ≥ 0,

where K0 =1

2D2

X+M2

F

∑k∈B �

2k+M2

G

∑k∈N �

2k∑

k∈B �k

and

K1 =M2

F

∑k∈B �

2k+M2

G

∑k∈N �

2k+ �

�∑k∈N �

2k+MFDX

�∑k∈B �

2k∑

k∈B �k

.

Proof Let F�(xk, �k) = f �(xk) + Δk . It follows from the inequality (2.7) (with x = x∗ ) and the fact g(x∗) ≤ 0 that

Now we provide probabilistic bounds for ∑

k∈B �2k‖F�(xk, �k)‖2∗ and ∑

k∈B �k⟨Δk, xk − x∗⟩ . First, setting �k = �2k ∕∑

k∈B �2k , using the fact that

�[exp{‖F�(xk, �k)‖2∗∕M2F}] ≤ exp{1} and Jensens inequality, we have

and hence that

It then follows from Markov’s inequality that ∀� ≥ 0,

Then, let us consider ∑

k∈B �k⟨Δk, xk − x∗⟩ . Setting �k = �k⟨Δk, xk − x∗⟩ and noting that �[‖Δk‖2∗] ≤ (2MF)2 , we have

which, in view of Lemma 7, implies that

∀𝜆 ≥ 0 ∶ Prob

�∑Nt=1

𝜉t > 𝜆

�∑Nt=1

𝜎2t

�≤ exp{−𝜆2∕3}.

(2.20)Prob{f (x̄N,s) − f (x∗) ≥ K0 + 𝜆K1} ≤ exp{−𝜆} + exp{−

𝜆2

3

},

(2.21)∑

k∈N 𝛾k𝜂k + (∑

k∈B 𝛾k)(f (x̄N,s) − f (x∗)) ≤ D2

X+∑

k∈B 𝛾2k‖F�(xk, 𝜉k)‖2∗

+∑

k∈N 𝛾2k‖g�(xk)‖2∗ −

∑k∈B 𝛾k⟨Δk, xk − x∗⟩.

exp{∑

k∈B �k(‖F�(xk, �k)‖2∗∕M2F)} ≤∑

k∈B �kexp{‖F�(xk, �k)‖2∗∕M2F},

�[exp{∑

k∈B �2k‖F�(xk, �k)‖2∗∕M2F

∑k∈B �

2k}] ≤ exp{1}.

(2.22)

Prob(∑

k∈B 𝛾2k‖F�(xk, 𝜉k)‖2∗ > (1 + 𝜆)M2F

∑k∈B 𝛾

2k)

= Prob

�exp

�∑k∈B 𝛾

2k‖F�(xk, 𝜉k)‖2∗

M2F

∑k∈B 𝛾

2k

�> exp(1 + 𝜆)

�

≤exp{1}

exp{1+𝜆}≤ exp{−𝜆}.

�[exp{�2k∕(2MF�kDX)

2}] ≤ exp{1},

G. Lan, Z. Zhou

1 3

Combining (2.22) and (2.23), and rearranging the terms we get (2.20). ◻

Applying the stepsize strategy in Corollary 5 to Theorem 8, then it follows that the number of iterations performed by the CSA method can be bounded by

We can see that the above result significantly improves the one in (2.19).

2.4 Convergence of CSA for SP with expectation constraints

In this subsection, we focus on the SP problem (1.1)–(1.2) with the expectation constraint. We assume the expectation functions f(x) and g(x), in addition to being well-defined and finite-valued for every x ∈ X , are continuous and convex on X. Throughout this section, we assume the Assumption 2 holds. Moreover, with a little abuse of notation, we make the following assumption.

Assumption 3 for and x ∈ X,

We will use (2.24) and (2.25) to bound the error associated with stochastic subgradient and function value for the constraint g, respectively. As discussed in Sect. 2.2, there may exist different ways to simulate the random variable � for con-straint evaluation, e.g., by generating a J-sized i.i.d. random sample of � or its lin-ear transformation 𝜉 = 𝜉Tx . However, regardless of the way to simulate the random variable � , the light-tail assumption (2.25) holds for the constraint value G(x, �) . Our goal in this subsection is to show how the sample size (or iteration count) N to com-pute stochastic subgradients, as well as the sample size J to evaluate the constraint value, will affect the quality of the solutions generated by CSA.

The following result establishes a simple but important recursion about the CSA method for stochastic optimization with expectation constraints.


Proof For any s ≤ k ≤ N , using Lemma 1, we have

(2.23)Prob�∑

k∈B 𝛽k > 2𝜆MFDX

�∑k∈B 𝛾

2k

�≤ exp{−𝜆2∕3}.

O

{1

�2(log

1

Λ)2}.

(2.24)�[exp{‖G�(x, �)‖2∗∕M2G}] ≤ exp{1},

(2.25)�[exp{(G(x, �) − g(x))2∕�2}] ≤ exp{1}.

(2.26)

∑k∈N �k(G(xk, �k) − G(x, �k)) +

∑k∈B �k⟨F�(xk, �k), xk − x⟩

≤ V(xs, x) +1

2

∑k∈B �

2k‖F�(xk, �k)‖2∗ + 12

∑k∈N �

2k‖G�(xk, �k)‖2∗, ∀x ∈ X.

1 3


Observe that if k ∈ B , we have hk = F�(xk, �k), and

Moreover, if k ∈ N , we have hk = G�(xk, �k) and

Summing up the inequalities in (2.27) from k = s to N and using the previous two observations, we obtain

Rearranging the terms in above inequality, we obtain (2.26). ◻

Using Proposition 9, we present below a sufficient condition under which the out-put solution x̄N,s is well-defined.

Lemma 10 Let x∗ be an optimal solution of (1.1)–(1.2). Under Assumption 3, for any given 𝜆 > 0 , if

where J is the number of random samples to estimate g(xk) in each iteration, then B ≠ ∅ , i.e., x̄N,s is well-defined. Moreover, we have one of the following two state-ments holds,

(a) Prob{|B| ≥ (N − s + 1)∕2} ≥ 1 − |N|exp{−

�2

3

},

(b) ∑

k∈B �k⟨f �(xk), xk − x∗⟩ ≤ 0.

Proof Taking expectation w.r.t. �k on both sides of (2.26), fixing x = x∗ and noting that Assumption 3 implies that �[‖G�(x, �)‖2

∗] ≤ M2

G , we have

(2.27)V(xk+1, x) ≤ V(xk, x) + �k⟨hk, x − xk⟩ + 12�2k ‖hk‖2∗.

⟨hk, xk − x⟩ = ⟨F�(xk, �k), xk − x⟩.

⟨hk, xk − x⟩ = ⟨G�(xk, �k), xk − x⟩ ≥ G(xk, �k) − G(x, �k).

(2.28)

V(xk+1, x) ≤ V(xs, x) −∑N

k=s�k⟨hk, xk − x⟩ + 12

∑Nk=s

�2k‖hk‖2∗

≤ V(xs, x) −�∑

k∈N �k⟨G�(xk, �k), xk − x⟩ +∑

k∈B �k⟨F�(xk, �k), xk − x⟩�

+1

2

∑Nk=s

�2k‖hk‖2∗

= V(xs, x) −�∑

k∈N �k(G(xk, �k) − G(x, �k)) +∑

k∈B �k⟨F�(xk, �k), xk − x⟩�

+1

2

∑k∈B �

2k‖F�(xk, �k)‖2∗ + 12

∑k∈N �

2k‖G�(xk, �k)‖2∗.

(2.29)

N−s+1

2mink∈N

𝛾k𝜂k > V(xs, x∗) +

1

2

∑k∈B 𝛾

2kM2

F+

1

2

∑k∈N 𝛾

2kM2

G+

𝜆𝜎√J

∑k∈N 𝛾k,

(2.30)

∑k∈N �k[g(xk) − g(x

∗)] +∑

k∈B �k⟨f �(xk), xk − x∗⟩ ≤ V(xs, x∗) + 12∑

k∈B �2kM2

F

+1

2

∑k∈N �

2kM2

G.

G. Lan, Z. Zhou

1 3

If ∑

k∈B �k⟨f �(xk), xk − x∗⟩ ≤ 0 , part (b) holds. If ∑

k∈B �k⟨f �(xk), xk − x∗⟩ ≥ 0 , we have

which, in view of g(x∗) ≤ 0 , implies that

It follows from (2.4), Assumption 3 and Lemma 7 that, for k ∈ N , we have Ĝk > 𝜂k and Prob{Ĝk ≥ g(xk) + 𝜆𝜎∕

√J} ≤ exp{−𝜆2∕3}, which implies,

Prob{g(xk) ≤ �k − ��∕√J} ≤ exp{−�2∕3} . Therefore,

Combining (2.31) and (2.32), we have

Suppose that |B| < (N − s + 1)∕2 , i.e., |N| ≥ (N − s + 1)∕2 . Then, the condition in (2.29) implies that

It then follows from the previous two observations that Prob{|B| ≥ (N − s + 1)∕2} ≥ 1 − |N|exp

{−

�2

3

}. ◻

Now we are ready to establish the large deviation properties of the CSA algorithm.

Theorem 11 Suppose that Assumptions 2 and 3 hold.

(a) For any given partition B and N of I = {s,… ,N} , we have, ∀� ≥ 0 ,

∑k∈N �k[g(xk) − g(x

∗)] ≤ V(xs, x∗) +

1

2

∑k∈B �

2kM2

F+

1

2

∑k∈N �

2kM2

G,

(2.31)∑

k∈N �kg(xk) ≤ V(xs, x∗) +

1

2

∑k∈B �

2kM2

F+

1

2

∑k∈N �

2kM2

G.

(2.32)

Prob{∑

k∈N �kg(xk) ≤∑

k∈N �k�k −��√J

∑k∈N �k}

≤ Prob�∃k ∈ N, g(xk) ≤ �k −

��√J

�≤ 1 −

�1 − exp

�−

�2

3

��N�≤ �N�exp

�−

�2

3

�.

Prob{∑

k∈N 𝛾k𝜂k V(xs, x∗) +

1

2

∑k∈B 𝛾

2kM2

F

+1

2

∑k∈N 𝛾

2kM2

G+

𝜆𝜎√J

∑k∈N 𝛾k.

(2.33)Prob{f (x̄N,s) − f (x∗) ≥ K0 + 𝜆K1} ≤ 2exp{−𝜆} + (|N| + 2)exp{−

𝜆2

3

},

(2.34)Prob�g(x̄N,s) ≥

�∑k∈B 𝛾k

�−1�∑k∈B 𝛾k𝜂k

�+

𝜆𝜎√J

�≤ �B�exp{−𝜆2∕3},

1 3


where K0 =�∑

k∈B �k

�−1�D2

X+

M2F

2

∑k∈B �

2k+

M2G

2

∑k∈N �

2k

� and

K1=�∑

k∈B �k

�−1�M2F2

∑k∈B �

2

k+

M2

G

2

∑k∈N �

2

k+ 2�

�∑k∈N �

2

k+ 2MFDX

�∑k∈B �

2

k+

�√J

∑k∈N �k

�.

(b) For any Λ ∈ (0, 1) , if we choose � such that Nexp{−�2∕3} ≤ Λ and set

where M = max{MF,MG} andC = max{9D2XM2, 4�2} , then we have

Proof Let us first show part (a) holds. Observe that the constraint evalua-tion and hence the partition of B and N is independent of the trajectory. Let G(x, �k) = g(x) + �k and F�(xk, �k) = f �(xk) + Δk . It follows from the inequality (2.26) (with x = x∗ ) and the fact g(x∗) ≤ 0 that

Now we provide probabilistic bounds for ∑

k∈B �2k‖F�(xk, �k)‖2∗ ,

∑k∈N �

2k‖G�(xk, �k)‖2∗ , ∑

k∈N �k�k and ∑

k∈B �k⟨Δk, xk − x∗⟩ . First, setting �k = �2k ∕∑

k∈B �2k , using the

fact that �[exp{‖F�(xk, �k)‖2∗∕M2F}] ≤ exp{1} and Jensens inequality, we have exp{

∑k∈B �k(‖F�(xk, �k)‖2∗∕M2F)} ≤

∑k∈B �kexp{‖F�(xk, �k)‖2∗∕M2F}, and hence that

�[exp{∑

k∈B �2k‖F�(xk, �k)‖2∗∕M2F

∑k∈B �

2k}] ≤ exp{1}. It then follows from Mark-

ov’s inequality that ∀� ≥ 0,

Similarly, we have

Second, for ∑

k∈N �k�k , setting �k = �k∕∑

k∈B �k , and noting that �[�k] = 0 and �[exp{�2

k∕�2}] ≤ exp{1}, we obtain �[�k�k] = 0 , �[exp{�2k�

2k∕�2

k�2}] ≤ exp{1} . By

lemma 7, we have

(2.35)

s = 1, �k =DX√NM

, �k =4MDX√

N+

2��√J,

N = max{2C

�2(log

4

Λ)2,

6C

�2log

18D2XM2

�2Λ

,64M2D2

X

�2

},

J = max{8�2

�2(log

4

Λ)2,

24�2

�2log

18D2XM2

�2Λ

,36�2

�2log

1

Λ3,36�2

�2log

18D2XM2

�2Λ

},

(2.36)Prob{g(x̄N,s) ≤ 𝜗} ≥ 1 − Λ and Prob{f (x̄N,s) − f (x∗) ≤ 𝜖} ≥ (1 − Λ)2.

(2.37)

∑k∈N 𝛾kg(xk) + (

∑k∈B 𝛾k)(f (x̄N,s) − f (x

∗)) ≤ V(xs, x∗) +

1

2

∑k∈B 𝛾

2k‖F�(xk, 𝜉k)‖2∗

+1

2

∑k∈N 𝛾

2k‖G�(xk, 𝜉k)‖2∗ + 2

∑k∈N 𝛾k𝛿k −

∑k∈B 𝛾k⟨Δk, xk − x∗⟩.

(2.38)

Prob(∑

k∈B 𝛾2k‖F�(xk, 𝜉k)‖2∗ > (1 + 𝜆)M2F

∑k∈B 𝛾

2k)

= Prob

�exp

�∑k∈B 𝛾

2k‖F�(xk, 𝜉k)‖2∗

M2F

∑k∈B 𝛾

2k

�> exp(1 + 𝜆)

�≤

exp{1}


(2.39)Prob�∑

k∈N 𝛾2k‖G�(xk, 𝜉k)‖2∗ > (1 + 𝜆)M2G

∑k∈N 𝛾

2k

�≤ exp{−𝜆}.

(2.40)Prob�∑

k∈N 𝛾k𝛿k > 𝜆𝜎

�∑k∈N 𝛾

2k

�≤ exp{−𝜆2∕3}.

G. Lan, Z. Zhou

1 3

Lastly, let us consider ∑

k∈B �k⟨Δk, xk − x∗⟩ . Setting �k = �k⟨Δk, xk − x∗⟩ and not-ing that �[‖Δk‖2∗] ≤ (2MF)2 , we have �[exp{�2k∕(2MF�kDX)2}] ≤ exp{1}, which, in view of Lemma 7, implies that

Combining (2.38),(2.39), (2.40), (2.41) and (2.32), and rearranging the terms we get (2.33). Let us show that (2.34) holds. Clearly, by the convexity of g(⋅) and definition of x̄N,s , we have

Using this observation and an argument similar to the proof of (2.32), we obtain (2.34).

Then, let us show part (b) holds. First, easily observe that condition (2.29) holds by using the selection of s, {�k} and {�k} . From Lemma 10, we have either one of the following two statements holds,

(a) Prob{|B| ≥ (N − s + 1)∕2} ≥ 1 − |N|exp{−

�2

3

}≥ 1 − Λ,

(b) ∑

k∈B �k⟨f �(xk), xk − x∗⟩ ≤ 0, which, in view of the convexity of f, implies, f (x̄N,s) − f (x

∗) ≤ 0.

Also, from (2.34) and (2.35), we have

Moreover, conditional on that |B| ≥ N∕2 , it then follows Theorem 11 and (2.35) that

By implementing the selection of N and J, we have (2.36). ◻

In view of Theorem 11, the complexity in terms of the number of iterations N of the CSA algorithm can be bounded by O(max{ 1

�2(log

1

Λ)2,

1

�2}) , and the sample size

J for estimating constraint in every iteration of the CSA algorithm can be bounded by O(max{

1

�2(log

1

Λ)2,

1

�2log

1

Λ3}) for solving problem (1.1)–(1.2).

2.5 Strongly convex objective and strongly convex constraints

In this subsection, we are interested in establishing the convergence of the CSA algo-rithm applied to strongly convex problems. More specifically, we assume that the

(2.41)Prob�∑

k∈B 𝛽k > 2𝜆MFDX

�∑k∈B 𝛾

2k

�≤ exp{−𝜆2∕3}.

g(x̄N,s) = g(∑

k∈B 𝜄kxk) ≤�∑

k∈B 𝛾k

�−1 ∑k∈B 𝛾kg(xk).

Prob�g(x̄N,s) ≥

4MDX√N

+3𝜆𝜎√

J

�≤ �B�exp{−𝜆2∕3}

Prob�g(x̄N,s) ≥ 𝜗

�≤ Λ.

Prob

�f�x̄N,s

�− f (x∗) ≥

3DXM√N

+ 𝜆

�3√2MDX√N

+2√2𝜎√N

+√2𝜎√J

��≤ 2exp{−𝜆}

+ (�N� + 2)exp�−

𝜆2

3

�,

1 3


objective function F and constraint function g in problem (1.1), where g is given in the form of function constraint, are both strongly convex w.r.t. x, i.e., ∃𝜇F > 0 and 𝜇G > 0 s.t.

For the sake of simplicity, we focus on the case when the constraint function g can be evaluated exactly (i.e., Ĝk = g�(xk) ). However, expectation constraints can be dealt with using similar techniques discussed in Sect. 2.4.

In order to estimate the convergent rate of the CSA algorithm for solving strongly convex problems, we need to assume that the prox-function VX(⋅, ⋅) and VY (⋅, ⋅) satisfies a quadratic growth condition

Moreover, letting �k be the stepsizes used in the CSA method, and denoting

we define

as the output of Algorithm 1.The following simple result will be used in the convergence analysis of the CSA

method.

Lemma 12 If ak ∈ (0, 1] , k = 0, 1, 2, ..., Ak > 0,∀k ≥ 1 , and {Δk} satisfies

then we have

Below we provide an important recursion about CSA applied to strongly convex problems. This result differs from Proposition 2 for the general convex case in that we use different weight �k rather than �k.


F(x1, �) ≥ F(x2, �) + ⟨F�(x2, �), x1 − x2⟩ + �F2 ‖x1 − x2‖2,∀x1, x2 ∈ X,g(x1) ≥ g(x2) + ⟨g�(x2), x1 − x2⟩ + �G2 ‖x1 − x2‖2,∀x1, x2 ∈ X.

(2.42)VX(z, x) ≤Q

2‖z − x‖2,∀z, x ∈ X and VY (z, y) ≤ Q2 ‖z − y‖2,∀z, y ∈ Y .

ak =

{�F�k

Q, k ∈ B,

�G�k

Q, k ∈ N,

Ak =

{1, k = 1,

(1 − ak)Ak−1, k ≥ 2,and �k =

�k

Ak,

(2.43)x̄N,s =∑

k∈B 𝜌kxk∑k∈B 𝜌k

Δk+1 ≤ (1 − ak)Δk + Bk,∀k ≥ 1,

Δk+1

Ak≤ (1 − a1)Δ1 +

∑ki=1

Bi

Ai.

(2.44)

∑k∈N �k(�k − G(x, �k)) +

∑k∈B �k[F(xk, �k) − F(x, �k)] ≤ (1 − as)D

2X

+1

2

∑k∈B �k�k‖F�(xk, �k)‖2∗ + 12

∑k∈N �k�k‖g�(xk)‖2∗.

G. Lan, Z. Zhou

1 3

Proof Consider the iteration k, ∀s ≤ k ≤ N . If k ∈ B , by Lemma 1 and the strong convexity of F(x, �) , we have

Similarly for k ∈ N , using Lemma 1 and the strong convexity of g(x), we have

Summing up these inequalities for s ≤ k ≤ N and using Lemma 12, we have

Using the fact V(xN+1, x)∕AN ≥ 0 and the definition of �k , and rearranging the terms in the above inequality, we obtain (2.44). ◻

Lemma 14 below provides a sufficient condition which guarantees x̄N,s to be well-defined.

Lemma 14 Let x∗ be the optimal solution of (1.1). If

then B ≠ ∅ and hence x̄N,s is well-defined. Moreover, we have one of the following two statements holds,

(a) |B| ≥ (N − s + 1)∕2,(b)

∑k∈B �k[f (xk) − f (x

∗)] ≤ 0.

Proof The proof of this result is similar to that of Lemma 2 and hence the details are skipped. ◻

With the help of Proposition 13, we are ready to establish the main convergence properties of the CSA method for solving strongly convex problems.

V(xk+1, x) ≤ V(xk, x) − �k⟨hk, xk − x⟩ + 12�2k ‖F�(xk, �k)‖2∗= V(xk, x) − �k⟨F�(xk, �k), xk − x⟩ + 12�2k ‖F�(xk, �k)‖2∗≤ V(xk, x) − �k

�F(xk, �k) − F(x, �k) +

�F

2‖xk − x‖2

�+

1

2�2k‖F�(xk, �k)‖2∗

≤

�1 −

�F�k

Q

�V(xk, x) − �k[F(xk, �k) − F(x, �k)] +

1

2�2k‖F�(xk, �k)‖2∗.

V(xk+1, x) ≤ V(xk, x) − �k⟨hk, xk − x⟩ + 12�2k ‖g�(xk)‖2∗= V(xk, x) − �k⟨g�(xk), xk − x⟩ + 12�2k ‖g�(xk)‖2∗≤ V(xk, x) − �k

�(g(xk) − g(x)) +

�G

2‖xk − x‖2

�+

1

2�2k‖g�(xk)‖2∗

≤

�1 −

�G�k

Q

�V(xk, x) − �k(�k − g(x)) +

1

2�2k‖g�(xk)‖2∗.

V(xN+1,x)

AN≤�1 − as

�V(xs, x) −

�∑k∈N

�k

Ak(�k − g(x)) +

∑k∈B

�k

Ak[F(xk, �k) − F(x, �k)]

�

+1

2

∑k∈N

�2k

Ak‖g�(xk)‖2∗ + 12

∑k∈B

�2k

Ak‖F�(xk, �k)‖2∗,

(2.45)N−s+1

2mink∈N

𝜌k𝜂k > (1 − as)D2X+

1

2

∑k∈N 𝜌k𝛾kM

2G+

1

2

∑k∈B 𝜌k𝛾kM

2F,

1 3


Theorem 15 Suppose that {�k} and {�k} in the CSA algorithm are chosen such that (2.45) holds. Then for any 1 ≤ s ≤ N , we have

Proof Taking expectation w.r.t. �i, 1 ≤ i ≤ k , on both sides of (2.44) (with x = x∗ ) and using Assumption 1, we have

(2.46) then immediately follows from the above inequality, (2.43), the convexity of f and the fact that g(x∗) ≤ 0 . Moreover, (2.47) follows similarly to (2.18). ◻

Below we provide a stepsize policy of s, �k and �k in order to achieve the optimal rate of convergence for solving strongly convex problems.

Corollary 16 Let s = N2 , �k =

{2Q

�F(k+1), if k ∈ B;

2Q

�G(k+1), if k ∈ N,

,

�k =2�GQ

k

(2D2

X

k+max

{M2

F

�2F

,M2

G

�2G

}) , then we have

Proof Based on our selection of s, �k , �k and the definition of ak , Ak and �k , we have

For ∀s ≤ k ≤ N , by the definition of s, �k and �k , we have

(2.46)

�[f (x̄N,s) − f (x∗)]

≤ ((N − s + 1) mins≤k≤N

𝜌k)−1�2(1 − as)D

2X+∑

k∈B 𝜌k𝛾kM2F+∑

k∈N 𝜌k𝛾kM2G

�,

(2.47)g(x̄N,s) ≤ (∑

k∈B 𝜌k)−1(

∑k∈B 𝜌k𝜂k).

∑k∈N �k(�k − g(x

∗)) +∑

k∈B �k�[f (xk) − f (x∗)] ≤ (1 − as)D

2X

+1

2

∑k∈B �k�kM

2F+

1

2

∑k∈N �k�kM

2G.

�[f (x̄N,s) − f (x∗)] ≤

8𝜇FD2X

N2Q+

4𝜇FQ

Nmax{

M2F

𝜇2F

,M2

G

𝜇2G

},

g(x̄N,s) ≤16𝜇GQD

2X

N2+

4𝜇GQ

Nmax

{M2

F

𝜇2F

,M2

G

𝜇2G

}.

ak =2

k+1, Ak =

k∏i=2

(1 − ai) =2

k(k+1), �k =

{kQ

�F

, if k ∈ B;kQ

�G

, if k ∈ N,

G. Lan, Z. Zhou

1 3

Combining the above two inequalities, we can easily see that condition (2.45) holds. It then follows from Theorem 15 that

◻

In view of Corollary 16, the CSA algorithm can achieve the optimal rate of conver-gence for strongly convex optimization with strongly convex constraints. To the best of our knowledge, this is the first time such a complexity result is obtained in the literature and this result is new also for the deterministic setting.

3 Expectation constraints over problem parameters

In this section, we are interested in solving a class of parameterized stochastic optimi-zation problems whose parameters are defined by expectation constraints as described in (1.4)–(1.5), under the assumption that such a pair of solutions satisfying (1.4)–(1.5) exists.

Our goal in this section is to present a variant of the CSA algorithm to approxi-mately solve problem (1.4)–(1.5) and establish its convergence properties. More specif-ically, we discuss this variant of the CSA algorithm when applied to the parameterized stochastic optimization problem in (1.4)–(1.5) and then consider a modified problem by imposing certain strong convexity assumptions to the function Φ(x, y, �) w.r.t. y and G(x, �) w.r.t. x in Sects. 4.1 and 4.2, respectively. In Subsection 4.3, we discuss some large deviation properties for the variant of the CSA method for the problem defined by (1.4)–(1.5).

(1 − as)V(xs, x) +1

2

∑k∈N �k�kM

2G+

1

2

∑k∈B �k�kM

2F

≤ D2X+

1

2

∑k∈B

�2k

AkM2

F+

1

2

∑k∈N

�2k

AkM2

G≤ D2

X+ Q2

��B�M2F

�2F

+ �N�M2G�2G

�≤ D2

X

+Q2N

2max

�M2

F

�2F

,M2

G

�2G

�,

N−s+1

2mink∈N

�k�k =N

4mink∈N

kQ

�G

2�GQ

k

�2D2

X

k+max

�M2

F

�2F

,M2

G

�2G

��

≥ D2X+

Q2N

2max

�M2

F

�2F

,M2

G

�2G

�.

�[f (x̄N,s) − f (x∗)]

≤ ((N − s + 1) mins≤k≤N

𝜌k)−1�2(1 − as)D

2X+∑

k∈B 𝜌k𝛾kM2F+∑

k∈N 𝜌k𝛾kM2G

�

≤8𝜇FD

2X

N2Q+

4𝜇FQ

Nmax

�M2

F

𝜇2F

,M2

G

𝜇2G

�,

g(x̄N,s) ≤�∑

k∈B 𝜌k

�−1�∑k∈B 𝜌k𝜂k

�≤

16𝜇GQD2X

N2+

4𝜇GQ

Nmax

�M2

F

𝜇2F

,M2

G

𝜇2G

�.

1 3


3.1 Stochastic optimization with parameter feasibility constraints

Given tolerance 𝜂 > 0 and target accuracy 𝜖 > 0 , we will present a variant of the CSA algorithm, namely cooperative stochastic parameter approximation (CSPA), to find a pair of approximate solutions (x̄, ȳ) ∈ X × Y s.t. �[g(x̄)] ≤ 𝜂 and �[𝜙(x̄, ȳ) − 𝜙(x̄, y)] ≤ 𝜖, ∀y ∈ Y , in this subsection. Before we describe the CSPA method, we need slightly modify Assumption 1.

Assumption 4 For any x ∈ X and y ∈ Y ,

where Φ�(x, y, �) ∈ �yΦ(x, y, �) and G�(x, �) ∈ �xG(x, �).

We will also discuss the convergent properties under the light-tail assumptions as follows.

Assumption 5 We assume that the distance generating functions �X ∶ X ↦ ℝ and �Y ∶ Y ↦ ℝ are

strongly convex with modulus 1 w.r.t. given norms in ℝn and ℝm , respectively, and that their associated prox-mappings Px,X and Py,Y (see (2.1)) are easily computable.

We make the following modifications to the CSA method in Sect. 2.1 in order to apply it to solve problem (1.4)–(1.5). Firstly, we still check the solution (xk, yk) to see whether xk violates the condition

∑ki=1

�iG(xi, �i)∕∑k

i=1�i ≤ �k . If so, we set the search

direction as G�(xk, �k) to update xk , while keeping yk intact. Otherwise, we only update yk along the direction Φ�(x̄k, yk, 𝜁k) . Secondly, we define the output as a randomly selected (x̄k, yk) according to a certain probability distribution instead of the ergodic mean of {(x̄k, yk)} , where x̄k denotes the average of {xk} (see (3.1)). Since we are solv-ing a coupled optimization and feasibility problem, each iteration of our algorithm only updates either yk or xk and requires the computation of either Φ� or G′ depending on whether

∑ki=1

�iG(xi, �i)∕∑k

i=1�i ≤ �k . This differs from the SA method used in Jiang

and Shanbhag [16] that requires two projection steps and the computation of two sub-gradients at each iteration to solve a different parameterized stochastic optimization problem.

�[‖Φ�(x, y, �)‖2∗] ≤ M2

Φand �[‖G�(x, �)‖2

∗] ≤ M2

G,

�[exp{‖Φ�(x, y, �)‖2∗∕M2

Φ}] ≤ exp{1},

�[exp{(Φ(x, y, �) − �(x, y))2∕�2}] ≤ exp{1},

�[exp{(G(x, �) − g(x))2∕�2}] ≤ exp{1}.

G. Lan, Z. Zhou

1 3

With a little abuse of notation, we still use B to represent the set {s ≤ k ≤ N�∑�(k)

i=1�iG(xi, �i)∕

∑�(k)

i=1�i ≤ �k} , I = {s,… ,N} , and N = I ⧵ B . The fol-

lowing result mimics Proposition 2.


where DX ≡ DX,wx and DY ≡ DY ,wy are defined as in (2.3).

Proof By Lemma 1, if k ∈ B,

Also note that V(yk+1, y) = V(yk, y) for k ∈ N . Summing up these relations for k ∈ B ∪N and using the fact that V(ys, y) ≤ D2Y , we have

Similarly for �(s) ≤ i ≤ �(N) , we have

Summing up these relations for �(s) ≤ i ≤ �(N) and using the fact that V(x

�(s), x) ≤ D2X , we obtain

(3.4)

∑k∈B 𝛾k⟨Φ�(x̄k, yk, 𝜁k), yk − y⟩ ≤ D2Y + 12

∑k∈B 𝛾

2k‖Φ�(x̄k, yk, 𝜁k)‖2∗, ∀y ∈ Y ,

(3.5)∑

�(N)

i=�(s)�i[G(xi, �i) − G(x, �i)] ≤ D

2X+

1

2

∑�(N)

i=�(s)�2i‖G�(xi, �i)‖2∗, ∀x ∈ X,

V(yk+1, y) ≤ V(yk, y) + 𝛾k⟨Φ�(x̄k, yk, 𝜁k), y − yk⟩ + 12𝛾2k ‖Φ�(x̄k, yk, 𝜁k)‖2∗.

(3.6)

V(yN+1, y) ≤ V(ys, y) +1

2

∑k∈B 𝛾

2k‖Φ�(x̄k, yk, 𝜁k)‖2∗ −

∑k∈B 𝛾k⟨Φ�(x̄k, yk, 𝜁k), yk − y⟩

≤ D2Y+

1

2

∑k∈B 𝛾

2k‖Φ�(x̄k, yk, 𝜁k)‖2∗ −

∑k∈B 𝛾k⟨Φ�(x̄k, yk, 𝜁k), yk − y⟩.

V(xi+1, x) ≤ V(xi, x) + �i⟨G�(xi, �i), x − xi⟩ + 12�2i ‖G�(xi, �i)‖2∗.

1 3


Using the facts V(yN+1, y) ≥ 0 and V(x�(N)+1, x) ≥ 0 , and rearranging the terms in (3.6) and (3.7), we then obtain (3.4) and (3.5), respectively. ◻

The following result provides a sufficient condition under which (x̄R, yR) is well-defined.

Lemma 18 The following statements holds.

(a) Under Assumption 4, if for any 𝜆 > 0 , we have

then Prob{|B| ≥ N−s+12

} ≥ 1 − 1∕�.(b) Under Assumption 5, if for any 𝜆 > 0 , we have


} ≥ 1 − 2exp{−

�2

3

}.

Proof First let us show part (a), set �k = G(x∗, �k) − g(x∗) , it follows from (3.5) and fixing x = x∗ that

For contradiction, suppose that |B| < N−s+12

, i.e., �(N) − �(s) = |N| ≥ N−s+12

. The above relation, in view of g(x∗) ≤ 0 and the fact

∑�(N)

i=�(s)�iG(xi, �i) ≥ ��(N)

∑�(N)

i=�(s)�i ,

implies that

Under Assumption 4, for any 𝜆 > 0 , taking expectation on both sides and using Markov’s inequality, we have

Hence, part a) holds. Similarly we can show part (b), and the details are skipped. ◻

Theorem 19 summarizes the main convergence properties of Algorithm 2 applied to problem (1.4)–(1.5).

Theorem 19 The following statements holds for the CSPA algorithm.

(3.7)V(x

�(N)+1, x) ≤ D2X+∑

�(N)

i=�(s)�2i‖G�(xi, �i)‖2∗ −

∑�(N)

i=�(s)(G(xi, �i) − G(x, �i)).

(3.8)N−s+12

mink∈N

𝛾k𝜂k > D2X+ 𝜆

M2G

2

∑𝜏(N)

k=𝜏(s)𝛾2k,

(3.9)N−s+12

mink∈N

𝛾k𝜂k > D2X+ (1 + 𝜆)

M2G

2

∑𝜏(N)

k=𝜏(s)𝛾2k+ 𝜆𝜎

�∑𝜏(N)

k=𝜏(s)𝛾2k,

∑�(N)

i=�(s)�iG(xi, �i) −

∑�(N)

i=�(s)�ig(x

∗) ≤ D2X+

1

2

∑�(N)

i=�(s)�2i‖G�(xi, �i)‖2∗ +

∑�(N)

i=�(s)�i�i.

N−s+1

2mink∈N

�k�k ≤ ��(N)∑

�(N)

k=�(s)�k ≤ D

2X+

1

2

∑�(N)

k=�(s)�2k‖G�(xk, �k)‖2∗ +

∑�(N)

k=�(s)�k�k.

Prob�

N−s+1

2mink∈N

�k�k ≤ D2X+ �

M2G

2

∑�(N)

k=�(s)�2k

�≥ 1 − 1∕�.

G. Lan, Z. Zhou

1 3

(a) Under Assumption 4, we have, ∀𝜆 > 0 ,

(b) Under Assumption 5, we have, ∀𝜆 > 0 ,

where K0 =2D2

Y+M2

Φ

∑k∈B �

2k

2∑

k∈B �k

and K1 =M2

Φ

∑k∈B �

2k+ 4MΦDY

�∑k∈B �

2k

2∑

k∈B �k

.

where the expectation is taken w.r.t. R and �1,… , �N.Proof Let us prove part (a) first. Set Δk = Φ(x̄k, yk, 𝜁k) − 𝜙(x̄k, yk) , it follows from (3.4) (fix y = y∗ ) that

Since conditional on �[k−1] , the expectation of Δk equals to zero, then taking expec-tation on both sides of (3.16), and dividing both sides by

∑k∈B �k , we have (3.10).

Hence, using the Markov inquality, we have (3.11). Denote �k = G(xk, �k) − g(xk) . It then follows from the convexity of g(⋅) and the definition of the set B that

Using the fact that �[�k|�[k−1]] = 0 and �[|�k|2] ≤ �2 , we have

(3.10)�[𝜙(x̄R, yR) − 𝜙(x̄R, y∗(x̄R))] ≤2D2

Y+M2

Φ

∑k∈B 𝛾

2k

2∑

k∈B 𝛾k

,

(3.11)Prob�𝜙(x̄R, yR) − 𝜙(x̄R, y

∗(x̄R)) ≥ 𝜆

�2D2

Y+M2

Φ

∑k∈B 𝛾

2k

2∑

k∈B 𝛾k

��≤

1

𝜆

,

(3.12)Prob

⎧⎪⎨⎪⎩g(x̄R) ≥ 𝜂R + 𝜆𝜎

�∑𝜏(N)

k=𝜏(s)𝛾2k∑

𝜏(N)

k=𝜏(s)𝛾k

⎫⎪⎬⎪⎭≤

1

𝜆2.

(3.13)�[𝜙(x̄R, yR) − 𝜙(x̄R, y∗(x̄R))] ≤2D2

Y+M2

Φ

∑k∈B 𝛾

2k

2∑

k∈B 𝛾k

,

(3.14)Prob{𝜙(x̄R, yR) − 𝜙(x̄R, y

∗(x̄R)) ≥ K0 + 𝜆K1}≤ exp{−𝜆} + exp{−𝜆2∕3},

(3.15)Prob

⎧⎪⎨⎪⎩g(x̄R) ≥ 𝜂R + 𝜆𝜎

�∑𝜏(N)

k=𝜏(s)𝛾2k∑

𝜏(N)

k=𝜏(s)𝛾k

⎫⎪⎬⎪⎭≤ exp{−𝜆2∕3},

(3.16)

∑k∈B 𝛾k

�𝜙(xk, yk) − 𝜙(xk, y

∗(xk))�

≤ D2Y+

1

2

∑k∈B 𝛾

2k‖Φ�(x̄k, yk, 𝜁k)‖2∗ +

∑k∈B 𝛾kΔk(y − yk).

(3.17)g(x̄k) ≤∑

𝜏(N)

k=𝜏(s)𝛾kg(xk)∑

𝜏(N)

k=𝜏(s)𝛾k

≤ 𝜂k −

∑𝜏(N)

k=𝜏(s)𝛾k𝛿k∑

𝜏(N)

k=𝜏(s)𝛾k

.

1 3


From the Markov inequality, we have (3.12). Hence the part a) holds.Under Assumption 5, (3.13) still holds. Using the fact that

�[exp{‖Φ�(x̄k, yk, 𝜁k)‖2∗∕M2Φ}] ≤ exp{1} and Jensens inequality, we have �[exp{

∑k∈B 𝛾

2k‖Φ�(x̄k, yk, 𝜁k)‖2∗∕M2Φ

∑k∈B 𝛾

2k}] ≤ exp{1}. It then follows from

Markov’s inequality that ∀� ≥ 0,

Also,

Combining (3.16), (3.18) and (3.19), we have (3.14). Similarly, we have

Combining (3.17) and (3.20), we have (3.15). ◻

Below we provide a special selection of s, {�k} and {�k}.

Corollary 20 Let s = N2+ 1 , �k =

DX

MG

√k and �k =

4MGDX√k

for k = 1,… ,N . Then we have

where � ∶= (MGDY )∕(MΦDX). Moreover, the following statements hold.

(a) Under Assumption 4,

(b) Under Assumption 5,

�

⎡⎢⎢⎣

��

∑�(N)

k=�(s)�k�k∑

�(N)

k=�(s)�k

��

2⎤⎥⎥⎦≤

∑�(N)

k=�(s)�2k�2

�∑�(N)

k=�(s)�k

�2 .

(3.18)Prob(

∑k∈B 𝛾

2k‖Φ�(x̄k, yk, 𝜁k)‖2∗ > (1 + 𝜆)M2Φ

∑k∈B 𝛾

2k) ≤

exp{1}


(3.19)Prob�∑

k∈B 𝛾kΔk(y − yk) > 2𝜆MΦDY

�∑k∈B 𝛾

2k

�≤ exp

�−𝜆2∕3

�

(3.20)Prob{∑

�(N)

k=�(s)�k�k ≥ ��

�∑�(N)

k=�(s)�2k} ≤ exp{−�2∕3}

�[�(xR, yR) − �(xR, y∗(xR))] ≤

8MΦDY√N

max{�,1

�

},

(3.21)Prob�𝜙(x̄R, yR) − 𝜙(x̄R, y

∗(x̄R)) ≤ 𝜆8MΦDY√

Nmax{𝜈,

1

𝜈

}�≥ (1 −

1

𝜆

)2,

(3.22)Prob�g(x̄R) ≤ 𝜆

√2DX

MG

√N

�≥ (1 −

1

𝜆

)2.

G. Lan, Z. Zhou

1 3

where K0 =8MΦDY√

Nmax{�,

1

�

} and K1 =1√N

�4M2

ΦDX

MG+ 10MΦDY

�.

Proof Similarly to Corollary 5, we can show that (3.8) holds. It then follows from Lemma 18 and Theorem 19(a) that

Similarly, part (b) follows from Theorem 19(b). ◻

By Corollary (20), the CSPA method applied to (1.4)–(1.5) can achieve an O(1∕

√N) rate of convergence.

3.2 CSPA with strong convexity assumptions

In this subsection, we modify problem (1.4)–(1.5) by imposing certain strong convexity assumptions to Φ and G with respect to y and x, respectively, i.e., ∃𝜇Φ,𝜇G > 0 , s.t.

We also assume that the pair of solutions (x∗, y∗) exists for problem (1.4)–(1.5). Our main goal in this subsection is to estimate the convergence properties of the CSPA algorithm under these new assumptions.

We need to modify the probability distribution (3.3) used in the CSPA algorithm as follows. Given the stepsize �k , modulus �G and �Φ , and growth parameter Q (see (2.42)), let us define

and denote

Prob�𝜙(x̄R, yR) − 𝜙(x̄R, y

∗(x̄R)) ≤ K0 + 𝜆K1�

≥ (1 − 2exp{−𝜆2∕3})(1 − exp{−𝜆} − exp{−𝜆2∕3}),

Prob

�g(x̄R) ≤

√2DX

MG

√N+ 𝜆

5𝜎√N

�≥ (1 − 2exp{−𝜆2∕3})(1 − exp{−𝜆2∕3}),

∑k∈B �k =

∑k∈B

DX

MG

√k≥

DX

MG

N

4

1√N=

DX

√N

4MG.

�[�(xR, yR) − �(xR, y∗)] ≤

2MG

DX

√N

�2D2

Y+∑

k∈B

D2XM2

Φ

M2Gk

�

≤2MG

DX

√N

�2D2

Y+∑N

k=N∕2

D2XM2

Φ

M2Gk

�

≤2MG

DX

√N[2D2

Y+ log 2D2

X

M2Φ

M2G

] ≤8MΦDY√

Nmax{�,

1

�

}.

(3.23)Φ(x, y1, �) ≥ Φ(x, y2, �) + ⟨Φ�(x, y2, �), y1 − y2⟩ + �Φ2 ‖y1 − y2‖2, ∀y1, y2 ∈ Y .

(3.24)G(x1, �) ≥ G(x2, �) + ⟨G�(x2, �), x1 − x2⟩ + �G2 ‖x1 − x2‖2, ∀x1, x2 ∈ X.

(3.25)ak ∶= (𝜇Φ𝛾k)∕Q and Ak ∶=�

1, k = 1;∏i≤k, i∈B(1 − ai), k > 1,

1 3


Also the probability distribution of R is modified to

The following result shows some simple but important properties for the modified CSPA method applied to problem (1.4)–(1.5).

Proposition 21 For any s ≤ k ≤ m , we have

Proof Using Lemma 1 and the strong convexity of Φ w.r.t. y, for k ∈ B , we have

Also note that VY (yk+1, y) = VY (yk, y) for all k ∈ N . Summing up these relations for all k ∈ B ∪N and using Lemma 12, we obtain

Similarly for �(s) ≤ k ≤ �(N) , we have

(3.26)bk ∶= (𝜇G𝛾k)∕Q and Bk ∶=�

1, k = 1;∏ki=1

(1 − bi), k > 1.

(3.27)Prob{R = k} =�k∕Ak∑

i∈B �i∕Ai, k ∈ B.

(3.28)

∑k∈B

�k

Ak[Φ(xk, yk, �k) − Φ(xk, y, �k)] ≤ (1 −

�Φ�s

Q)VY (ys, y)

+1

2

∑k∈B

�2k

Ak‖Φ�(xk, yk, �k)‖2∗, ∀y ∈ Y

(3.29)

∑�(N)

k=�(s)

�k

Bk

��k − G(x, �k)

�≤

�1 −

�G�s

Q

�VX(xs, x)

+1

2

∑�(N)

k=�(s)

�2k

Bk‖G�(xk, �k)‖2∗, ∀x ∈ X.

VY (yk+1, y) ≤ VY (yk, y) − �k⟨Φ�(xk, yk, �k), yk − y⟩ + 12�2k ‖Φ�(xk, yk, �k)‖2∗≤ VY (yk, y) − �k

�Φ(xk, yk, �k) − Φ(xk, y, �k) +

�Φ

2‖yk − y‖2

�

+1

2�2k‖Φ�(xk, yk, �k)‖2∗

≤

�1 −

�Φ�k

Q

�VY (yk, y) − �k[Φ(xk, yk, �k) − Φ(xk, y, �k)]

+1

2�2k‖Φ�(xk, yk, �k)‖2∗.

(3.30)

VY (yN ,y)

AN+1≤

�1 −

�Φ�s

Q

�VY (ys, y) −

∑k∈B

�k

Ak[Φ(xk, yk, �k) − Φ(xk, y, �k)]

+1

2

∑k∈B

�2k

Ak‖Φ�(xk, yk, �k)‖2∗.

VX(xk+1, x) ≤ VX(xk, x) − �k⟨G�(xk, �k), xk − x⟩ + 12�2k ‖G�(xk, �k)‖2∗≤ VX(xk, x) − �k

�G(xk, �k) − G(x, �k) +

�G

2‖xk − x‖2

�+

1

2�2k‖G�(xk, �k)‖2∗

≤

�1 −

�G�k

Q

�VX(xk, x) − �k[G(xk, �k) − G(x, �k)] +

1

2�2k‖G�(xk, �k)‖2∗,

G. Lan, Z. Zhou

1 3

Summing up these relations for �(s) ≤ k ≤ �(N) and using Lemma 12, we have

Using the facts that VY (yN+1, y)∕AN ≥ 0 and VX(xN+1, x)∕AN ≥ 0 , and rearranging the terms in (3.30) and (3.31), we obtain (3.28) and (3.29), respectively. ◻

Lemma 22 below provides a sufficient condition which guarantees that the output solution (x̄R, yR) is well-defined.

Lemma 22 The following statements hold.

(a) Under Assumption 4, if for any 𝜆 > 0 , we have


} ≥ 1 − 1∕�.(b) Under Assumption 5, if for any 𝜆 > 0 , we have


} ≥ 1 − 2exp{−�2∕3}.

Proof The proof is similar to the one of Lemma 18 and hence the details are skipped. ◻

Now let us establish the rate of convergence of the modified CSPA method for prob-lem (1.4)–(1.5).

Theorem 23 Suppose that {�k} and {�k} are chosen according to Lemma 22. Then

Moreover, under Assumption 4, we have for any 𝜆 > 0,

(3.31)

VX (xN+1,x)

AN≤

�1 −

�G�s

Q

�VX(xs, x) −

∑�(N)

k=�(s)

�k

Ak[�k − G(x, �k)]

+1

2

∑�(N)

k=�(s)

�2k

Ak‖G�(xk, �k)‖2∗.

(3.32)N−s+12

mink∈N

𝛾k𝜂k

Bk>

�1 −

𝜇G𝛾s

Q

�D2

X+ 𝜆

M2G

2

∑𝜏(N)

k=𝜏(s)

𝛾2k

Bk,

(3.33)N−s+12

mink∈N

𝛾k𝜂k

Bk>

�1 −

𝜇G𝛾s

Q

�D2

X+ (1 + 𝜆)

M2G

2

∑𝜏(N)

k=𝜏(s)

𝛾2k

Bk+ 𝜆𝜎

�∑𝜏(N)

k=𝜏(s)

𝛾2k

B2k

,

(3.34)�[𝜙(x̄R, yR) − 𝜙(x̄R, y

∗(x̄R))] ≤�∑

k∈B

𝛾k

Ak

�−1�(1 −

𝜇Φ𝛾s

Q)D2

Y+

M2Φ

2

∑k∈B

𝛾2k

Ak

�.

(3.35)

Prob

�𝜙(x̄R, yR) − 𝜙(x̄R, y

∗(x̄R)) ≥ 𝜆�∑

k∈B

𝛾k

Ak

�−1�(1 −

𝜇Φ𝛾s

Q)D2

Y+

M2Φ

2

∑k∈B

𝛾2k

Ak

��≤

1

𝜆

,

1 3


In addition, under Assumption 5, we have for any 𝜆 > 0,

where K0 =�∑

k∈B

�k

Ak

�−1�(1 −

�Φ�s

Q)D2

Y+

M2Φ

2

∑k∈B

�2k

Ak

� and

K1 =�∑

k∈B

�k

Ak

�−1�M2

Φ

∑k∈B

�2k

Ak+ 4MΦDY

�∑k∈B

�2k

A2k

�.

Proof The proof is similar to the proof of Theorem 19, and hence the details are skipped. ◻

Now we provide a specific selection of {�k} and {�k} that satisfies the condition of Lemma 22. While the selection of �k only depends on iteration index k, i.e.,

the selection of �k depends on the particular position of iteration index k in set B or N . More specifically, let �B(k) and �(k) be the position of index k in set B and set N , respectively (for example, B = {1, 3, 5, 9, 10} and N = {2, 4, 6, 7, 8} . If k = 9 , then �B(k) = 4 ). We define �k as

Such a selection of �k can be conveniently implemented by using two separate coun-ters in each iteration to represent �B(k) and �(k).

Corollary 24 Let s = 1 , �k and �k be given in (3.39) and (3.40), respectively. Then we have

Moreover, under Assumption 4, we have for any 𝜆 > 0,

(3.36)Prob

⎧⎪⎨⎪⎩g(x̄R) ≥ 𝜂R + 𝜆𝜎

�∑𝜏(N)

k=𝜏(s)𝛾2k∕B2

k∑𝜏(N)

k=𝜏(s)𝛾k∕Bk

⎫⎪⎬⎪⎭≤

1

𝜆2.

(3.37)Prob

{𝜙(x̄R, yR) − 𝜙(x̄R, y

∗(x̄R)) ≥ K0 + 𝜆K1}≤ exp{−𝜆} + exp{−𝜆2∕3},

(3.38)Prob

⎧⎪⎨⎪⎩g(x̄R) ≥ 𝜂R + 𝜆𝜎

�∑𝜏(N)

k=𝜏(s)𝛾2k∕B2

k∑𝜏(N)

k=𝜏(s)𝛾k∕Bk

⎫⎪⎬⎪⎭≤ exp{−𝜆2∕3},

(3.39)�k =8QM2

G

k�G,

(3.40)�k =

{ 2Q�Φ(�B(k)+1)

, k ∈ B;

2Q

�G(�(k)+1), k ∈ N.

�[𝜙(x̄R, yR) − 𝜙(x̄R, y∗(x̄R))] ≤

8QM2Φ

(N+2)𝜇Φ.

G. Lan, Z. Zhou

1 3

In addition, under Assumption 5, we have for any 𝜆 > 0,

where K0 = 8QM2Φ∕[(N + 2)�Φ] and K1 = 8QM2Φ∕[(N + 2)�Φ] + 64MΦDY∕

√N.

Proof The proof is similar to the proof of Corollary 20 and hence the details are skipped. ◻

Note that Corollary 24(a) implies an O(1∕N) rate of convergence, while Corol-lary 24(b) show an O(1∕

√N) rate of convergence with much improved dependence

on � . One possible approach to improve the result in part (b) is to shrink the feasible set Y from time to time in order to obtain an O(1∕N) rate of convergence (see [13]).

4 Numerical experiment

In this section, we present some numerical results of our computational experiments for solving two problems: an asset allocation problem with conditional value at risk (CVaR) constraint and a parameterized classification problem. More specifically, we report the numerical results obtained from the CSA and CSPA method applied to these two problems in Sects. 4.1 and 4.2, respectively.

4.1 Asset allocation problem

Our goal of this subsection is to examine the performance of the CSA method applied to the CVaR constrained problem in (1.3).

Apparently, there is one problem associated with applying the CSA algorithm to this model—the feasible region X is unbounded. Lan, Nemirovski and Shapiro (see [20] Section 4.2) show that � can be restricted to

[𝜇 +

√𝛽

1−𝛽𝜎, �̄� +

√1−𝛽

𝛽

𝜎

], where

𝜇 ∶= miny∈Y{−𝜉Ty} and �̄� ∶= maxy∈Y{−𝜉Ty}.

In this experiment, we consider four instances. The first three instances are ran-domly generated according to the factor model in Goldfarb and Iyengar (see Section 7 of [15] ) with different number of stocks ( d = 500 , 1000 and 2000), while the last instance consists of the 95 stocks from S&P100 (excluding SBC, ATI, GS, LU and VIA-B) obtained from [37], the mean 𝜉 and covariance Σ are estimated by the historical monthly data from 1996 to 2002. The reliability level � = 0.05 , the number of samples

Prob{𝜙(x̄R, yR) − 𝜙(x̄R, y

∗(x̄R)) ≤ 𝜆8QM2

Φ

(N+2)𝜇Φ

}≥ (1 −

1

𝜆

)2,

Prob{g(x̄R) ≤ 𝜆

16QM2G

N𝜇G

}≥ (1 −

1

𝜆

)2.

Prob�𝜙(x̄R, yR) − 𝜙(x̄R, y

∗(x̄R)) ≤ K0 + 𝜆K1�

≥ (1 − 2exp{−𝜆2∕3})(1 − exp{−𝜆} − exp{−𝜆2∕3}),

Prob�g(x̄R) ≤

16QM2G

N𝜇G+ 𝜆

2𝜎√N

�≥ (1 − 2exp{−𝜆2∕3})(1 − exp{−𝜆2∕3}),

1 3


to estimate g(x) is J = 100 and the number of samples used to evaluate the solution is n = 50, 000 . It is worth noting that, by utilizing the linear structure of �Tx (where x ∈ ℝd ) in constraint function, in kth iteration we generate J-sized i.i.d. samples of 𝜉 ∶= 𝜉Txk (with dimension 1) to estimate �Tx in constraint function, instead of J-sized i.i.d. samples of � (with dimension d). For SAA algorithm, the deterministic SAA prob-lem to (1.3) is defined by

We implemented the SAA approach by using Polyak’s subgradient method for solv-ing convex programming problems with function constraints (see [28]). The main reasons why we did not use the linear programming (LP) method to (4.1) include: 1) problem (4.1) might be infeasible for some instances; and 2) we tried the LP method with CVX toolbox for an instance with 500 stocks and the CPU time is thousands times larger than that of the CSA method. In our experiment, we adjust the stepsize strategy by multiplying �k and �k with some scaling parameters cg and ce , respec-tively. These parameters are chosen as a result of pilot runs of our algorithm (see [20] for more details). We have found that the “best parameters” in Table 1 slightly outperforms other parameter settings we have considered (Tables 2, 3, 4 and 5).

The following conclusions can be made from the numerical results. First, as far as the quality of solutions is concerned, the CSA method is at least as good as SAA method and it may outperform SAA for some instances especially as N increases.

(4.1)minx,� − �

Tx

s.t. � +1

�N

∑Ni=1

[−�Tix − �]+ ≤ 0,∑n

i=1xi = 1, x ≥ 0,

Table 1 The stepsize factor best cg

best ce

Number of stocks 500 0.5 0.0051000 0.5 0.052000 0.5 0.05

Table 2 Random sample with 500 assets

N: the sample size (the number of steps in SA, and the size of the sample used to SAA approximation), Obj.: the objective function value of our solution, i.e. the loss of the portfolio, Cons.: the con-straint function value of our solution, CPU: the processing time in seconds for each method

N = 500 N = 1000 N = 2000 N = 5000

CSA Obj. −4.883 −4.870 −4.953 −4.984Cons. 5.330 4.096 5.167 2.859CPU 1.671e-01 3.383e-01 6.271e-01 1.470e+00

SAA Obj. −4.978 −4.981 −4.977 −4.977Cons. 4.372 3.071 2.330 2.249CPU 2.031e+00 9.926e+00 4.132e+01 2.591e+02

G. Lan, Z. Zhou

1 3

Second, the CSA method can significantly reduce the processing time than SAA method for all the instances.


N: the sample size( the number of steps in SA, and the size of the sample used to SAA approximation), Obj.: the objective function value of our solution, i.e. the loss of the portfolio, Cons.: the con-straint function value of our solution, CPU: the processing time in seconds for each method

N = 500 N = 1000 N = 2000 N = 5000

CSA Obj. −4.532 −4.704 −4.838 −4.949Cons. 27.660 24.901 23.825 20.785CPU 4.193e-01 8.578e-01 1.659e+00 4.001e+00



N: the sample size( the number of steps in SA, and the size of the sample used to SAA approximation), Obj.: the objective function value of our solution, i.e. the loss of the portfolio, Cons.: the con-straint function value of our solution, CPU: the processing time in seconds for each method

N = 500 N = 1000 N = 2000 N = 5000

CSA Obj. −4.299 −4.077 −4.355 −4.859Cons. 144.92 112.54 89.74 82.65CPU 1.374e+00 2.810e+00 5.538e+00 2.716e+01


Table 5 Comparing the CSA and SAA for the CVaR model

N: the sample size( the number of steps in SA, and the size of the sample used to SAA approximation), Obj.: the objective function value of our solution, i.e. the loss of the portfolio, Cons.: the constraint func-tion value of our solution, CPU: the processing time in seconds for each method

N=500 N=1000 N=2000 N=5000 N=10000

CSA Obj. −3.531 −3.537 −3.542 −3.548 −3.560Cons. 3.382e+00 2.188e-01 1.106e-01 2.724e-01 −7.102e-01CPU 8.315e-02 1.422e-01 2.778e-01 7.251e-01 1.415e+00

SAA Obj. −3.530 −3.541 −3.541 −3.544 −3.559Cons. 3.385e+00 7.163e-01 6.989e-01 6.988e-01 7.061e-01CPU 3.155e+00 1.221e+01 4.834e+01 3.799e+02 1.462e+03

1 3


4.2 Classification and metric learning problem

In this subsection, our goal is to examine the efficiency of the CSPA algorithm applied to a classification problem with the metric as parameter. In this experi-ment, we use the expectation of hinge loss function, described in [33], as objec-tive function, and formulate the constraint with the loss function of metric learn-ing problem in [8], see formal definition in (1.6)–(1.7). For each i, j, we are given samples ui, uj ∈ ℝd and a measure bij ≥ 0 of the similarity between the samples ui and uj ( bij = 0 means ui and uj are the same). The goal is to learn a metric A such that ⟨(ui − uj),A(ui − uj)⟩ ≈ bij , and to do classification among all the samples u pro-jected by the learned metric A.

For solving this class of problems in machine learning, one widely accepted approach is to learn the metric in the first step and then solve the classification prob-lem with the obtained optimal metric. However, this approach is not applicable to the online setting since once the dataset is updated with new samples, this approach has to go through all the samples to update A and � . On the other hand, the CSPA algorithm optimizes the metric A and classifier � simultaneously, and only needs to take one new sample in each iteration.

In this experiment, our goal is to test the solution quality of the CSPA algorithm with respect to the number of iterations. More specifically, we consider 2 instances of this problem with different dimension ( d = 100 and 200, respectively). Since we are dealing with the online setting, our sample size for training A and � is increasing with the number of iterations. The size for the sample used to estimate the parame-ters and the one used to evaluate the quality of solution (or testing sample) are set to 100 and 10, 000, respectively. Within each trial, we test the objective and constraint value of the output solution over training sample and testing sample, respectively. Since R is randomly picked up from all the series {x̄k, yk} , we generate 5 candidate R, instead of one, in order to increase the probability of getting a better solution. Intuitively, the latter solutions in the series should be better than the earlier ones, hence, we also put the last pair of the solution (x̄N , yN) into the candidate list. In each trial, we compare these 6 candidate solutions. First, we choose three pairs with smallest constraint function values, then, choose the one with the smallest objective function value from these three selected solutions as our output solution.

Tables 6 and 7 shows the CSPA method decreases the objective value and con-straint value as the sample size (number of iterations N) increases. These experi-ments demonstrate that we can improve both the metric and the classifier simultane-ously by using the CSPA method as more and more data are collected.

5 Conclusions

In this paper, we present a new stochastic approximation type method, the CSA method, for solving the stochastic convex optimization problems with function or expectation constraints. Moreover, we show that a variant of CSA method, the CSPA method, is applicable to a class of p

algorithms for stochastic optimization with function or …pcarpent/optsto/pdf/... · 2020. 11....

Documents