algorithms for stochastic optimization with function or …pcarpent/optsto/pdf/... · 2020. 11....

38
Vol.:(0123456789) Computational Optimization and Applications https://doi.org/10.1007/s10589-020-00179-x 1 3 Algorithms for stochastic optimization with function or expectation constraints Guanghui Lan 1  · Zhiqiang Zhou 1 Received: 8 August 2019 © Springer Science+Business Media, LLC, part of Springer Nature 2020 Abstract This paper considers the problem of minimizing an expectation function over a closed convex set, coupled with a function or expectation constraint on either deci- sion variables or problem parameters. We first present a new stochastic approxima- tion (SA) type algorithm, namely the cooperative SA (CSA), to handle problems with the constraint on devision variables. We show that this algorithm exhibits the optimal O(1 2 ) rate of convergence, in terms of both optimality gap and constraint violation, when the objective and constraint functions are generally convex, where denotes the optimality gap and infeasibility. Moreover, we show that this rate of convergence can be improved to O(1) if the objective and constraint functions are strongly convex. We then present a variant of CSA, namely the cooperative stochas- tic parameter approximation (CSPA) algorithm, to deal with the situation when the constraint is defined over problem parameters and show that it exhibits similar opti- mal rate of convergence to CSA. It is worth noting that CSA and CSPA are primal methods which do not require the iterations on the dual space and/or the estimation on the size of the dual variables. To the best of our knowledge, this is the first time that such optimal SA methods for solving function or expectation constrained sto- chastic optimization are presented in the literature. Keywords Convex programming · Stochastic optimization · Complexity · Subgradient method Part of the results were first presented at the Annual INFORMS meeting in Oct, 2015, https://infor ms.emeetingsonline.com/emeetings/formbuilder/clustersessiondtl.asp?csnno=24236&mmnno =272&ppnno=91687 and summarized in a previous version entitled “Algorithms for stochastic optimization with expectation constraints” in 2016. Guanghui Lan has been supported by NSF CMMI 1637474. * Guanghui Lan [email protected] Zhiqiang Zhou [email protected] 1 H. Milton Stewart School of Industrial and Systems Engineering, Georgia Institute of Technology, Atlanta, GA 30332, USA

Upload: others

Post on 28-Jan-2021

2 views

Category:

Documents


0 download

TRANSCRIPT

  • Vol.:(0123456789)

    Computational Optimization and Applicationshttps://doi.org/10.1007/s10589-020-00179-x

    1 3

    Algorithms for stochastic optimization with function or expectation constraints

    Guanghui Lan1  · Zhiqiang Zhou1

    Received: 8 August 2019 © Springer Science+Business Media, LLC, part of Springer Nature 2020

    AbstractThis paper considers the problem of minimizing an expectation function over a closed convex set, coupled with a function or expectation constraint on either deci-sion variables or problem parameters. We first present a new stochastic approxima-tion (SA) type algorithm, namely the cooperative SA (CSA), to handle problems with the constraint on devision variables. We show that this algorithm exhibits the optimal O(1∕�2) rate of convergence, in terms of both optimality gap and constraint violation, when the objective and constraint functions are generally convex, where � denotes the optimality gap and infeasibility. Moreover, we show that this rate of convergence can be improved to O(1∕�) if the objective and constraint functions are strongly convex. We then present a variant of CSA, namely the cooperative stochas-tic parameter approximation (CSPA) algorithm, to deal with the situation when the constraint is defined over problem parameters and show that it exhibits similar opti-mal rate of convergence to CSA. It is worth noting that CSA and CSPA are primal methods which do not require the iterations on the dual space and/or the estimation on the size of the dual variables. To the best of our knowledge, this is the first time that such optimal SA methods for solving function or expectation constrained sto-chastic optimization are presented in the literature.

    Keywords Convex programming · Stochastic optimization · Complexity · Subgradient method

    Part of the results were first presented at the Annual INFORMS meeting in Oct, 2015, https ://infor ms.emeet ingso nline .com/emeet ings/formb uilde r/clust erses siond tl.asp?csnno =24236 &mmnno =272&ppnno =91687 and summarized in a previous version entitled “Algorithms for stochastic optimization with expectation constraints” in 2016.Guanghui Lan has been supported by NSF CMMI 1637474.

    * Guanghui Lan [email protected]

    Zhiqiang Zhou [email protected]

    1 H. Milton Stewart School of Industrial and Systems Engineering, Georgia Institute of Technology, Atlanta, GA 30332, USA

    http://orcid.org/0000-0002-2103-087Xhttp://crossmark.crossref.org/dialog/?doi=10.1007/s10589-020-00179-x&domain=pdfhttps://informs.emeetingsonline.com/emeetings/formbuilder/clustersessiondtl.asp?csnno=24236&mmnno=272&ppnno=91687https://informs.emeetingsonline.com/emeetings/formbuilder/clustersessiondtl.asp?csnno=24236&mmnno=272&ppnno=91687https://informs.emeetingsonline.com/emeetings/formbuilder/clustersessiondtl.asp?csnno=24236&mmnno=272&ppnno=91687

  • G. Lan, Z. Zhou

    1 3

    Mathematics Subject classification 90C25 · 90C06 · 90C22 · 49M37

    1 Introduction

    In this paper, we study two related stochastic programming (SP) problems with func-tion or expectation constraints. The first one is a classical SP problem with the function constraint over the decision variables, formally defined as

    where X ⊆ ℝn is a convex compact set, � are random vectors supported on P ⊆ ℝp , F(x, �) ∶ X × P ↦ ℝ and g(x) ∶ X ↦ ℝ are closed convex functions w.r.t. x for a.e. � ∈ P . Moreover, we assume that � are independent of x. Under these assumptions, (1.1) is a convex optimization problem.

    In particular, the constraint function g(x) in problem (1.1) can be given in the form of expectation as

    where G(x, �) ∶ X × P ↦ ℝ are closed convex functions w.r.t. x for a.e. � ∈ P . Such problems have many applications in operations research, finance and data analysis. One motivating example is SP with the conditional value at risk (CVaR) constraint. In an important work [30], Rockafellar and Uryasev shows that a class of asset allo-cation problem can be modeled as

    where � denotes the random return with mean � = �[�] . Expectation constraints also play an important role in providing tight convex approximation to chance constrained problems (e.g., Nemirovksi and Shapiro [23]). Some other important applications of (1.1) can be found in semi-supervised learning (see, e.g., [6]). For example, one can use the objective function to define the fidelity of the model for the labelled data, while using the constraint to enforce some other properties of the model for the unlabelled data (e.g., proximity for data with similar features).

    While problem (1.1) covers a wide class of problems with constraints over the deci-sion variables, in practice we often encounter the situation where the constraint is defined over the problem parameters. Under these circumstances our goal is to find a pair of parameters x∗ and decision variables y∗(x∗) such that

    (1.1)min f (x) ∶= �[F(x, �)]

    s.t.g(x) ≤ 0,

    x ∈ X,

    (1.2)g(x) ∶= ��[G(x, �)],

    (1.3)minx,� − �

    Tx

    s.t. � +1

    �{[−�Tx − �]+} ≤ 0,∑ni=1

    xi = 1, x ≥ 0,

    (1.4)y∗(x∗) ∈ Argminy∈Y{�(x∗, y) ∶= �[Φ(x∗, y, �)]},

  • 1 3

    Algorithms for stochastic optimization with function or…

    Here Φ(x, y, �) is convex w.r.t. y for a.e. 𝜁 ∈ Q ⊆ ℝq but possibly nonconvex w.r.t. (x, y) jointly, and g(⋅) is convex w.r.t. x. Moreover, we assume that � is independent of x and y, while � is not necessarily independent of x∗ . Note that (1.4)–(1.5) defines a pair of optimization and feasibility problems coupled through the following ways: (a) the solution to (1.5) defines an admissible parameter of (1.4); (b) � can be a ran-dom variable with probability distribution parameterized by x∗.

    Problem (1.4)–(1.5) also has many applications, especially in data analysis. One such example is to learn a classifier w with a certain metric Ā using the support vector machine model:

    where l(w;(�, y)) = max{0, 1 − y⟨w, �⟩} denotes the hinge loss function, u, ui, uj ∈ ℝ

    n , v, vi, vj ∈ {+1,−1} , and bij ∈ ℝ are the random variables satisfy-ing certain probability distributions, and 𝜆,C > 0 are certain given parameters. In this problem, (1.6) is used to learn the classifer w by using the metric Ā satisfying certain requirements in (1.7), including the low rank (or nuclear norm) assumption. Problem (1.4)–(1.5) can also be used in some data-driven applications, where one can use (1.5) to specify the parameters for the probabilistic models associated with the random variable � , as well as some other applications for multi-objective sto-chastic optimization.

    In spite of its wide applicability, the study on efficient solution methods for expec-tation constrained optimization is still limited. For the sake of simplicity, suppose for now that � is given as a deterministic vector and hence that the objective functions f and � in (1.1) and (1.4) are easily computable. One popular method to solve stochas-tic optimization problems is called the sample average approximation (SAA) approach [17, 34, 37]. To apply SAA for (1.1) and (1.5), we first generate a random sample �i, i = 1,… ,N , for some N ≥ 1 and then approximate g by g̃(x) =

    1

    N

    ∑Ni=1

    G(x, 𝜉i) . The main issues associated with the SAA for solving (1.1) include: i) the deterministic SAA problem might not be feasible; ii) the resulting deterministic SAA problem is often dif-ficult to solve especially when N is large, requiring going through the whole sample {�1,… , �N} at each iteration; and ii) it is not applicable to the on-line setting where one needs to update the decision variable whenever a new piece of sample �i , i = 1,…N , is collected.

    A different approach to solve stochastic optimization problems is called stochastic approximation (SA), which was initially proposed in a seminal paper by Robbins and Monro[29] in 1951 for solving strongly convex SP problems. This algorithm mimics the gradient descent method by using the stochastic gradient F�(x, �i) rather than the original gradient f �(x) for minimizing f(x) in (1.1) over a simple convex set X (see also [4, 10, 11, 25, 31, 35]). An important improvement of this algorithm was developed

    (1.5)x∗ ∈ {x ∈ X|g(x) ∶= �[G(x, �)] ≤ 0}.

    (1.6)minw

    �l

    �w;

    �Ā

    1

    2 u, v

    ���+

    𝜆

    2‖w‖2,

    (1.7)Ā ∈{A ⪰ 0|�

    [|Tr

    (A(ui − vj

    )(ui − vj

    )T)− bij|

    ]≤ 0, Tr(A) ≤ C

    },

  • G. Lan, Z. Zhou

    1 3

    by Polyak and Juditsky [26, 27] through using longer steps and then averaging the obtained iterates. Their method was shown to be more robust with respect to the choice of stepsize than classic SA method for solving strongly convex SP problems. More recently, Nemirovski et al. [22] presented a modified SA method, namely, the mirror descent SA method, and demonstrated its superior numerical performance for solving a general class of nonsmooth convex SP problems. The SA algorithms have been inten-sively studied over the past few years (see, e.g., [9, 12–14, 18, 21, 32, 38]). It should be noted, however, that none of these SA algorithms are applicable to expectation con-strained problems, since each iteration of these algorithms requires the projection over the feasible set {x ∈ X|g(x) ≤ 0} , which is computationally prohibitive as g is given in the form of expectation.

    In this paper, we intend to develop efficient solution methods for solving expecta-tion constrained problems by properly addressing the aforementioned issues asso-ciated with existing SA methods. Our contribution mainly exists in the following several aspects. Firstly, inspired by Polayk’s subgradient method for constrained optimization [28] and Nesterov’s note [24], we present a new SA algorithm, namely the cooperative SA (CSA) method for solving the SP problem with expectation constraint in (1.1) with constraint (1.2). At the kth iteration, CSA performs a pro-jected subgradient step along either F�(xk, �k) or G�(xk, �k) over the set X, depend-ing on whether an unbiased estimator Ĝk of g(xk) satisfies Ĝk ≤ 𝜂k or not. Observe that the aforementioned estimator Ĝk can be easily computed in many cases by using the structure of the problem, e.g., the linear dependence �Tx in (1.3) (see Section  4.1 in [20] and Section  2.1 for more details). We introduce an index set B ∶= {1 ≤ k ≤ N ∶ Ĝk ≤ 𝜂k} in order to compute the output solution as a weighted average of the iterates in B . By carefully bounding |B| , we show that the number of iterations performed by the CSA algorithm to find an �-solution of (1.1), i.e., a point x̄ ∈ X s.t. �[f (x̄) − f ∗] ≤ 𝜖 and �[g(x̄)] ≤ 𝜖 , can be bounded by O(1∕�2) . Moreover, when both f and g are strongly convex, by using a different set of algorithmic param-eters we show that the complexity of the CSA method can be significantly improved to O(1∕�) . It it is worth mentioning that this result is new even for solving determin-istic strongly convex problems with function constraints. We also established the large-deviation properties for the CSA method under certain light-tail assumptions.

    Secondly, we develop a variant of CSA, namely the cooperative stochastic param-eter approximation (CSPA) method for solving the SP problem with expectation constraints on problem parameters in (1.4)–(1.5). In CSPA, we update parameter x by running the mirror descend SA iterates whenever a certain easily verifiable condition is violated. Otherwise, we update the decision variable y while keeping x intact. We show that the number of iterations performed by the CSPA algorithm to find an �-solution of (1.4)–(1.5), i.e., a pair of solution (x̄, ȳ) s.t. �[g(x̄)] ≤ 𝜖 and �[𝜙(x̄, ȳ) − 𝜙(x̄, y∗(x̄)] ≤ 𝜖 , can be bounded by O(1∕�2) . Moreover, this bound can be significantly improved to O(1∕�) if G and Φ are strongly convex w.r.t. x and y, respectively.

    To the best of our knowledge, all the aforementioned algorithmic developments are new in the stochastic optimization literature. It is also worth mentioning a few alternative or related methods to solve (1.1) and (1.4)–(1.5). First, without efficient methods to directly solve (1.1), current practice resorts to reformulate it

  • 1 3

    Algorithms for stochastic optimization with function or…

    as minx∈X �f (x) + (1 − �)g(x) for some � ∈ (0, 1) . However, one then has to face the difficulty of properly specifying � , since an optimal selection would depend on the unknown dual multiplier. As a consequence, we cannot assess the qual-ity of the solutions obtained by solving this reformulated problem. Second, one alternative approach to solve (1.1) is the penalty-based or primal-dual approach. However these methods would require either the estimation of the optimal dual variables or iterations performed on the dual space (see [7, 19, 22]). Moreover, the rate of convergence of these methods for function constrained problems has not been well-understood other than conic constraints even for the deterministic setting. Third, in [16] (and see references therein), Jiang and Shanbhag developed a coupled SA method to solve a stochastic optimization problem with parame-ters given by another optimization problem, and hence is not applicable to prob-lem (1.4)–(1.5). Moreover, each iteration of their method requires two stochastic subgradient projection steps and hence is more expensive than that of CSPA.

    The remaining part of this paper is organized as follows. In Sect. 2, we present the CSA algorithm and establish its convergence properties under general con-vexity and strong convexity assumptions. Then in Sect. 3, we develop a variant of the CSA algorithm, namely the CSPA for solving SP problems with the expecta-tion constraint over problem parameters and discuss its convergence properties. We then present some numerical results for these new SA methods in Sect.  4. Finally some concluding remarks are added in Sect. 5.

    2 Function or expectation constraints over decision variables

    In this section we present the cooperative SA (CSA) algorithm for solving con-vex stochastic optimization problems with the constraint over decision variables. More specifically, we first briefly review the distance generating function and prox-mapping in Sect. 2.1. We then describe the CSA algorithm in Sect. 2.2 and discuss its convergence properties in terms of expectation and large deviation for solving general convex problems in Sect.  2.3. Then we show how to apply the CSA algorithm to problem (1.1) with expectation constraint and discuss its large deviation properties in Sect.  2.4. Finally, we show how to improve the conver-gence of this algorithm by imposing strong convexity assumptions to problem (1.1) in Sect. 2.5.

    2.1 Preliminary: prox‑mapping

    Recall that a function �X ∶ X ↦ R is a distance generating function with param-eter � , if �X is continuously differentiable and strongly convex with parameter � with respect to ‖ ⋅ ‖ . Without loss of generality, we assume throughout this paper that � = 1 , because we can always rescale �X(x) to �̄�X(x) = 𝜔X(x)∕𝛼 . Therefore, we have

    ⟨x − z,∇�X(x) − ∇�X(z)⟩ ≥ ‖x − z‖2,∀x, z ∈ X.

  • G. Lan, Z. Zhou

    1 3

    The prox-function associated with � is given by

    VX(⋅, ⋅) is also called the Bregman’s distance, which was initially studied by Breg-man [5] and later by many others (see [1, 2, 36]). In this paper we assume the prox-function VX(x, z) is chosen such that, for a given x ∈ X , the prox-mapping Px,X ∶ ℝ

    n↦ ℝ

    n defined as

    is easily computed.It can be seen from the strong convexity of �(⋅, ⋅) that

    Whenever the set X is bounded, the distance generating function �X also gives rise to the diameter of X that will be used frequently in our convergence analysis:

    The following lemma follows from the optimality condition of (2.1) and the defini-tion of the prox-function (see the proof in [22]).

    Lemma 1 For every u, x ∈ X , and y ∈ ℝn , we have

    where the ‖ ⋅ ‖∗ denotes the conjugate of ‖ ⋅ ‖ , i.e., ‖y‖∗ = max{⟨x, y⟩�‖x‖ ≤ 1}.

    2.2 The CSA method

    In this section, we present a generic algorithmic framework for solving the con-strained optimization problem in (1.1). We assume the expectation function f(x) and constraint g(x), in addition to being well-defined and finite-valued for every x ∈ X , are continuous and convex on X.

    The CSA method can be viewed as a stochastic counterpart of Polayk’s subgradi-ent method, which was originally designed for solving deterministic nonsmooth con-vex optimization problems (see [28] and a more recent generalization in [3]). At each iterate xk , k ≥ 0 , depending on whether g(xk) ≤ �k for some tolerance 𝜂k > 0 , it moves either along the subgradient direction f �(xk) or g�(xk) , with an appropriately chosen stepsize �k which also depends on ‖f �(xk)‖∗ and ‖g�(xk)‖∗ . However, Polayk’s subgra-dient method cannot be applied to solve (1.1) because we do not have access to exact information about f ′ , g′ and g. The CSA method differs from Polyak’s subgradient method in the following three aspects. Firstly, the search direction hk is defined in a

    VX(z, x) = �X(x) − �X(z) − ⟨∇�X(z), x − z⟩.

    (2.1)Px,X(⋅) ∶= argminz∈X{⟨⋅, z⟩ + VX(x, z)}

    (2.2)VX(x, z) ≥1

    2‖x − z‖2,∀x, z ∈ X.

    (2.3)DX ≡ DX,�X ∶=√

    maxx,z∈X

    VX(x, z).

    VX(Px,X(y), u) ≤ VX(x, u) + yT (u − x) +

    1

    2‖y‖2

    ∗,

  • 1 3

    Algorithms for stochastic optimization with function or…

    stochastic manner: we first check if the solution xk we computed at iteration k violates the condition Ĝk ≤ 𝜂k for some �k ≥ 0 . If so, we set the hk = G�(xk, �k) for a random realization �k of � (Note that for deterministic constraint in (1.1), hk = g�(xk) ) in order to control the violation of expectation constraint. Otherwise, we set hk = F�(xk, �k) . Secondly, for some 1 ≤ s ≤ N , we partition the indices I = {s, ...,N} into two subsets: B = {s ≤ k ≤ N|Ĝk ≤ 𝜂k} and N = I ⧵ B , and define the output x̄N,s as an ergodic mean of xk over B . This differs from the Polyak’s subgradient method that defines the output solution as the best xk, k ∈ B , with the smallest objective value. Thirdly, while the original Polayk’s subgradient method were developed only for general nonsmooth problems, we show that the CSA method also exhibits an optimal rate of convergence for solving strongly convex problems by properly choosing {�k} and {�k}.

    Notice that every iteration of CSA requires an unbiased estimator of g(xk) . Sup-pose there is no uncertainty associated with the constraint in (1.1), we can evaluate g(xk) exactly. If g is given in the form of expectation, one natural way is to generate a J-sized i.i.d. random sample of � and then evaluate the constraint function value by Ĝk =

    1

    J

    ∑Jj=1

    G(xk, 𝜉j) . However, this basic scheme can be much improved by using some structural information for constraint evaluation. For instance, one ubiquitous structure existing in machine learning and portfolio optimization applications is the linear combination of �Tx . For a given x ∈ X , we can define a new random variable 𝜉 = 𝜉Tx and generate samples of 𝜉 instead of � . 𝜉 is only of dimension one and it is computationally much cheaper to simulate. Given the distribution of � , below we pro-vide a few examples where the distribution of 𝜉 can be explicitly computed or approxi-mated. For instance, if x ∈ ℝd , �i are independent normal N(�i, �i) , then 𝜉 follows N(

    ∑di=1

    �i, [∑d

    i=1x2i�2i]1∕2) . If �i follows independent exp(�i) , then the probability den-

    sity function of 𝜉 is

    where �̂�i = 𝜆i∕xi . If �i follows independent Uniform(a, b) , then the cumulative distri-bution function of 𝜉 is

    f𝜉(y) =

    �d�i=1

    �̂�i

    �∑d

    j=1

    e−�̂�j y∏d

    k≠j,k=1 (�̂�k �̂�j),

  • G. Lan, Z. Zhou

    1 3

    If the �i are dependent normal random variables with mean � and covari-ance C (by Cholesky decomposition, C = LL� ), we can estimate

    ∑i=1 �ixi by ∑d

    i=1𝜇ixi + r̄[

    ∑di=1

    (LTx)2i]1∕2 , where r̄ follows N(0, 1). In fact, when the dimension

    d is large enough, by central limit theorem, we can use a normal distribution to approximate the new random variable 𝜉 . These are a few examples showing that to simulate 𝜉 can be much faster than to simulate the original random variables for constraint evaluation.

    2.3 Convergence of CSA for SP with function constraints

    In this subsection, we consider the case when the constraint function g is deterministic (i.e., Ĝk = g�(xk) ). Our goal is to establish the rate of convergence associated with CSA, in terms of both the distance to the optimal value and the violation of constraints. It should also be noted that Algorithm 1 is conceptional only as we have not specified a few algorithmic parameters (e.g. {�k} and {�k} ). We will come back to this issue after establishing some general properties about this method. Throughout this subsection, we make the following assumptions.

    Assumption 1 For any x ∈ X , a.e. � ∈ P,

    where F�(x, �) ∈ �xF(x, �) and g�(x) ∈ �xg(x).

    The following result establishes a simple but important recursion about the CSA method for problem (1.1).

    Proposition 2 For any 1 ≤ s ≤ N , we have

    for all x ∈ X.

    Proof For any s ≤ k ≤ N , using Lemma 1, we have

    F𝜉(y) =

    1

    d!∏d

    i=1xi

    ⎧⎪⎨⎪⎩

    �y−a

    ∑di=1

    xib−a

    +�d+∑d

    v=1(−1)v

    ∑dj1=1

    ∑dj2=j

    1+1

    ∑djv=jv−1+1

    ��y−a

    ∑di=1

    xib−a

    −�xj

    1

    + xj2

    +…+ xjv

    ��+�⎫⎪⎬⎪⎭.

    �[‖F�(x, �)‖2∗] ≤ M2

    Fand ‖g�(x)‖2

    ∗≤ M2

    G,

    (2.7)

    ∑k∈N �k(�k − g(x)) +

    ∑k∈B �k⟨F�(xk, �k), xk − x⟩ ≤ V(xs, x) + 12

    ∑k∈B �

    2k‖F�(xk, �k)‖2∗

    +1

    2

    ∑k∈N �

    2k‖g�(xk)‖2∗,

  • 1 3

    Algorithms for stochastic optimization with function or…

    Observe that if k ∈ B , we have hk = F�(xk, �k), and

    Moreover, if k ∈ N , we have hk = g�(xk) and

    Summing up the inequalities in (2.8) from k = s to N and using the previous two observations, we obtain

    Rearranging the terms in above inequality, we obtain (2.7) ◻

    Using Proposition 2, we present below a sufficient condition under which the output solution x̄N,s is well-defined.

    Lemma 3 Let x∗ be an optimal solution of (1.1). If

    then B ≠ ∅ , i.e., x̄N,s is well-defined. Moreover, we have one of the following two statements holds,

    (a) |B| ≥ (N − s + 1)∕2,(b)

    ∑k∈B �k⟨f �(xk), xk − x∗⟩ ≤ 0.

    Proof Taking expectation w.r.t. �k on both sides of (2.7) and fixing x = x∗ , we have

    Suppose for contradiction that B = � . We then conclude from the above relation and the fact g(x∗) ≤ 0 that

    (2.8)V(xk+1, x) ≤ V(xk, x) + �k⟨hk, x − xk⟩ + 12�2k ‖hk‖2∗.

    ⟨hk, xk − x⟩ = ⟨F�(xk, �k), xk − x⟩.

    ⟨hk, xk − x⟩ = ⟨g�(xk), xk − x⟩ ≥ g(xk) − g(x) ≥ �k − g(x).

    (2.9)

    V(xk+1, x) ≤ V(xs, x) −∑N

    k=s�k⟨hk, xk − x⟩ + 12

    ∑Nk=s

    �2k‖hk‖2∗

    ≤ V(xs, x) −�∑

    k∈N �k⟨g�(xk), xk − x⟩ +∑

    k∈B �k⟨F�(xk, �k), xk − x⟩�

    +1

    2

    ∑Nk=s

    �2k‖hk‖2∗

    ≤ V(xs, x) −�∑

    k∈N �k(�k − g(x)) +∑

    k∈B �k⟨F�(xk, �k), xk − x⟩�

    +1

    2

    ∑k∈B �

    2k‖F�(xk, �k)‖2∗ + 12

    ∑k∈N �

    2k‖g�(xk)‖2∗.

    (2.10)N−s+1

    2mink∈N

    𝛾k𝜂k > D2X+

    1

    2

    ∑k∈B 𝛾

    2kM2

    F+

    1

    2

    ∑k∈N 𝛾

    2kM2

    G,

    (2.11)

    ∑k∈N �k[�k − g(x

    ∗)] +∑

    k∈B �k⟨f �(xk), xk − x∗⟩≤ V(xs, x

    ∗) +1

    2

    ∑k∈B �

    2kM2

    F+

    1

    2

    ∑k∈N �

    2kM2

    G

    ≤ D2X+

    1

    2

    ∑k∈B �

    2kM2

    F+

    1

    2

    ∑k∈N �

    2kM2

    G.

  • G. Lan, Z. Zhou

    1 3

    which contradicts with (2.10). Hence, we must have B ≠ ∅ . Now if

    ∑k∈B �k⟨f �(xk), xk − x∗⟩ ≤ 0 , part (b) holds. Otherwise, if ∑

    k∈B �k⟨f �(xk), xk − x∗⟩ ≥ 0 , we have

    which, in view of g(x∗) ≤ 0 , implies that

    Suppose that |B| < (N − s + 1)∕2 , i.e., |N| ≥ (N − s + 1)∕2 . Then,

    which contradicts with (2.12). Hence, part (a) holds. ◻

    Now we are ready to establish the main convergence properties of the CSA method.

    Theorem 4 Suppose that {�k} and {�k} in the CSA algorithm are chosen such that (2.10) holds. Then for any 1 ≤ s ≤ N , we have

    where M ∶= max{MF,MG}.

    Proof We first show (2.13). By Lemma 3, if Lemma 3 part (b) holds, dividing both sides by

    ∑k∈B �k and taking expectation, we have

    If |B| ≥ (N − s + 1)∕2 , we have ∑k∈B �k ≥ �B�mink∈B �k ≥ N−s+12 mink∈B �k. It fol-lows from the definition of x̄N,s in (2.6), the convexity of f (⋅) and (2.11) that

    which implies that

    Using this bound and the fact �k�k ≥ 0 in (2.16), we have

    Nmink∈N

    �k�k ≤∑

    k∈N �k[�k − g(x∗)] ≤ D2

    X+

    1

    2

    ∑k∈B �

    2kM2

    F+

    1

    2

    ∑k∈N �

    2kM2

    G,

    ∑k∈N �k[�k − g(x

    ∗)] ≤ V(xs, x∗) +

    1

    2

    ∑k∈B �

    2kM2

    F+

    1

    2

    ∑k∈N �

    2kM2

    G,

    (2.12)∑

    k∈N �k�k ≤ V(xs, x∗) +

    1

    2

    ∑k∈B �

    2kM2

    F+

    1

    2

    ∑k∈N �

    2kM2

    G.

    ∑k∈N 𝛾k𝜂k ≥

    N−s+1

    2mink∈N 𝛾k𝜂k > V(xs, x

    ∗) +1

    2

    ∑k∈B 𝛾

    2kM2

    F+

    1

    2

    ∑k∈N 𝛾

    2kM2

    G,

    (2.13)�[f (x̄N,s) − f (x∗)] ≤2D2

    X+M2

    ∑s≤k≤N 𝛾

    2k

    (N−s+1)mins≤k≤N 𝛾k,

    (2.14)g(x̄N,s) ≤�∑

    k∈B 𝛾k

    �−1�∑k∈B 𝛾k𝜂k

    �,

    (2.15)�[f (x̄N,s) − f (x∗)] ≤ 0.

    ∑k∈N 𝛾k𝜂k +

    ∑k∈B 𝛾kE[f (x̄N,s) − f (x

    ∗)] ≤∑

    k∈N 𝛾k𝜂k +∑

    k∈B E[𝛾k(f (xk) − f (x∗))]

    ≤D2X+

    1

    2

    ∑k∈B 𝛾

    2kM2

    F+

    1

    2

    ∑k∈N 𝛾

    2kM2

    G,

    (2.16)�N�min

    k∈N𝛾k𝜂k +

    �∑k∈B 𝛾k

    �E[f (x̄N,s) − f (x

    ∗)] ≤ D2X+

    1

    2

    ∑k∈B 𝛾

    2kM2

    F+

    1

    2

    ∑k∈N 𝛾

    2kM2

    G.

  • 1 3

    Algorithms for stochastic optimization with function or…

    Combining these two inequalities (2.15) and (2.17), we have (2.13). Now we show that (2.14) holds. For any k ∈ B , we have g(xk) ≤ �k . Then, in view of the definition of x̄N,s in (2.6) and the convexity of g(⋅) , then implies that

    Below we provide a few specific selections of {�k} , {�k} and s that lead to the opti-mal rate of convergence for the CSA method. In particular, we will present a constant and variable stepsize policy, respectively, in Corollaries 5 and 6.

    Corollary 5 If s=1,�k =DX√

    N(MF+MG) and �k =

    4(MF+MG)DX√N

    , k = 1, ...N , then

    Proof First, observe that condition (2.10) holds by using the facts that

    It then follows from Lemma 3 and Theorem 4 that

    Corollary 6 If s = N2 , �k =

    DX√k(MF+MG)

    and �k =4DX (MF+MG)√

    k, k = 1, 2, ...,N , then

    (2.17)�[f (x̄N,s) − f (x

    ∗)] ≤2D2

    X+∑

    k∈B 𝛾2kM2

    F+∑

    k∈N 𝛾2kM2

    G

    (N−s+1)mink∈I 𝛾k≤

    2D2X+M2

    ∑s≤k≤N 𝛾

    2k

    (N−s+1)mink∈B 𝛾k.

    (2.18)g(x̄N,s) ≤∑

    k∈B 𝛾kg(xk)∑k∈B 𝛾k

    ∑k∈B 𝛾k𝜂k∑k∈B 𝛾k

    .

    �[f (x̄N,s) − f (x∗)] ≤

    4DX (MF+MG)√N

    ,

    g(x̄N,s) ≤4DX (MF+MG)√

    N.

    N−s+1

    2mink∈N

    �k�k =N

    2

    4D2X

    N= 2D2

    X,

    D2X+

    1

    2

    ∑k∈B �

    2kM2

    F+

    1

    2

    ∑k∈N �

    2kM2

    G

    ≤ D2X+

    1

    2

    ∑k∈B

    D2XM2

    F

    N(MF+MG)2+

    1

    2

    ∑k∈N

    D2XM2

    G

    N(MF+MG)2

    ≤ D2X+

    1

    2

    ∑Nk=1

    D2X

    N≤ 2D2

    X.

    �[f (x̄N,s) − f (x∗)] ≤

    2DX (MF+MG)+∑

    k∈B

    DXM2F

    N(MF+MG)+∑

    k∈N

    DXM2G

    N(MF+MG)√N

    ≤4DX (MF+MG)√

    N,

    g(x̄N,s) ≤ maxs≤k≤N

    𝜂k =4DX (MF+MG)√

    N.

  • G. Lan, Z. Zhou

    1 3

    Proof The proof is similar to that of corollary 4 and hence the details are skipped. ◻

    In view of Corollaries 5 and 6, the CSA algorithm achieves an O(1∕√N) rate of

    convergence for solving problem (1.1). This convergence rate seems to be unimprov-able as it matches the optimal rate of convergence for deterministic convex optimiza-tion problems with function constraints [24]. However, to the best of our knowledge, no such complexity bounds have been obtained before for solving stochastic optimization problems with function constraints.

    In the Corollary 5 and 6, we established the expected convergence properties over many runs of the CSA algorithm. In the remaining part of this subsection, we are inter-ested in the large deviation properties for a single run of this method.

    First note that by Corollary 6 and the Markov’s inequality, we have

    It then follows that in order to find a solution x̄N,s ∈ X such that

    the number of iteration performed by the CSA method can be bounded by

    We will show that this result can be significantly improved if Assumption A1 is aug-mented by the following “light-tail” assumption, which is satisfied by a wide class of distributions (e.g., Gaussian and t-distribution).

    Assumption 2 For and x ∈ X,

    We first present the following Bernstein inequality that will be used to establish the large-deviation properties of the CSA method (e.g. see [22]). Note that in the sequel, we denote �[k] ∶= {�1,… , �k}.

    Lemma 7 Let �1, �2, ... be a sequence of i.i.d. random variables, and �t = �(�[t]) be deterministic Borel functions of �[t] such that �[�t] = 0 a.s. and �[exp{�2

    t∕�2

    t}] ≤ exp{1} a.s., where 𝜎t > 0 are deterministic. Then

    �[f (x̄N,s) − f (x∗)] ≤

    4DX (1+1

    2log 2)(MF+MG)√

    N,

    g(x̄N,s) ≤4√2DX (MF+MG)√

    N.

    Prob

    �f (x̄N,s) − f (x

    ∗) > 𝜆14DX (1+

    1

    2log 2)(MF+MG)√

    N

    �<

    1

    𝜆1

    ,∀𝜆1 ≥ 0.

    Prob(f (x̄N,s) − f (x

    ∗) ≤ 𝜖)> 1 − Λ,

    (2.19)O{

    1

    �2Λ2

    }.

    �[exp{‖F�(x, �)‖2∗∕M2

    F}] ≤ exp{1}.

  • 1 3

    Algorithms for stochastic optimization with function or…

    Now we are ready to establish the large deviation properties of the CSA algorithm.

    Theorem 8 Under Assumption 2, ∀� ≥ 0,

    where K0 =1

    2D2

    X+M2

    F

    ∑k∈B �

    2k+M2

    G

    ∑k∈N �

    2k∑

    k∈B �k

    and

    K1 =M2

    F

    ∑k∈B �

    2k+M2

    G

    ∑k∈N �

    2k+ �

    �∑k∈N �

    2k+MFDX

    �∑k∈B �

    2k∑

    k∈B �k

    .

    Proof Let F�(xk, �k) = f �(xk) + Δk . It follows from the inequality (2.7) (with x = x∗ ) and the fact g(x∗) ≤ 0 that

    Now we provide probabilistic bounds for ∑

    k∈B �2k‖F�(xk, �k)‖2∗ and ∑

    k∈B �k⟨Δk, xk − x∗⟩ . First, setting �k = �2k ∕∑

    k∈B �2k , using the fact that

    �[exp{‖F�(xk, �k)‖2∗∕M2F}] ≤ exp{1} and Jensens inequality, we have

    and hence that

    It then follows from Markov’s inequality that ∀� ≥ 0,

    Then, let us consider ∑

    k∈B �k⟨Δk, xk − x∗⟩ . Setting �k = �k⟨Δk, xk − x∗⟩ and noting that �[‖Δk‖2∗] ≤ (2MF)2 , we have

    which, in view of Lemma 7, implies that

    ∀𝜆 ≥ 0 ∶ Prob

    �∑Nt=1

    𝜉t > 𝜆

    �∑Nt=1

    𝜎2t

    �≤ exp{−𝜆2∕3}.

    (2.20)Prob{f (x̄N,s) − f (x∗) ≥ K0 + 𝜆K1} ≤ exp{−𝜆} + exp{−

    𝜆2

    3

    },

    (2.21)∑

    k∈N 𝛾k𝜂k + (∑

    k∈B 𝛾k)(f (x̄N,s) − f (x∗)) ≤ D2

    X+∑

    k∈B 𝛾2k‖F�(xk, 𝜉k)‖2∗

    +∑

    k∈N 𝛾2k‖g�(xk)‖2∗ −

    ∑k∈B 𝛾k⟨Δk, xk − x∗⟩.

    exp{∑

    k∈B �k(‖F�(xk, �k)‖2∗∕M2F)} ≤∑

    k∈B �kexp{‖F�(xk, �k)‖2∗∕M2F},

    �[exp{∑

    k∈B �2k‖F�(xk, �k)‖2∗∕M2F

    ∑k∈B �

    2k}] ≤ exp{1}.

    (2.22)

    Prob(∑

    k∈B 𝛾2k‖F�(xk, 𝜉k)‖2∗ > (1 + 𝜆)M2F

    ∑k∈B 𝛾

    2k)

    = Prob

    �exp

    �∑k∈B 𝛾

    2k‖F�(xk, 𝜉k)‖2∗

    M2F

    ∑k∈B 𝛾

    2k

    �> exp(1 + 𝜆)

    ≤exp{1}

    exp{1+𝜆}≤ exp{−𝜆}.

    �[exp{�2k∕(2MF�kDX)

    2}] ≤ exp{1},

  • G. Lan, Z. Zhou

    1 3

    Combining (2.22) and (2.23), and rearranging the terms we get (2.20). ◻

    Applying the stepsize strategy in Corollary 5 to Theorem 8, then it follows that the number of iterations performed by the CSA method can be bounded by

    We can see that the above result significantly improves the one in (2.19).

    2.4 Convergence of CSA for SP with expectation constraints

    In this subsection, we focus on the SP problem (1.1)–(1.2) with the expectation constraint. We assume the expectation functions f(x) and g(x), in addition to being well-defined and finite-valued for every x ∈ X , are continuous and convex on X. Throughout this section, we assume the Assumption 2 holds. Moreover, with a little abuse of notation, we make the following assumption.

    Assumption 3 for and x ∈ X,

    We will use (2.24) and (2.25) to bound the error associated with stochastic subgradient and function value for the constraint g, respectively. As discussed in Sect. 2.2, there may exist different ways to simulate the random variable � for con-straint evaluation, e.g., by generating a J-sized i.i.d. random sample of � or its lin-ear transformation 𝜉 = 𝜉Tx . However, regardless of the way to simulate the random variable � , the light-tail assumption (2.25) holds for the constraint value G(x, �) . Our goal in this subsection is to show how the sample size (or iteration count) N to com-pute stochastic subgradients, as well as the sample size J to evaluate the constraint value, will affect the quality of the solutions generated by CSA.

    The following result establishes a simple but important recursion about the CSA method for stochastic optimization with expectation constraints.

    Proposition 9 For any 1 ≤ s ≤ N , we have

    Proof For any s ≤ k ≤ N , using Lemma 1, we have

    (2.23)Prob�∑

    k∈B 𝛽k > 2𝜆MFDX

    �∑k∈B 𝛾

    2k

    �≤ exp{−𝜆2∕3}.

    O

    {1

    �2(log

    1

    Λ)2}.

    (2.24)�[exp{‖G�(x, �)‖2∗∕M2G}] ≤ exp{1},

    (2.25)�[exp{(G(x, �) − g(x))2∕�2}] ≤ exp{1}.

    (2.26)

    ∑k∈N �k(G(xk, �k) − G(x, �k)) +

    ∑k∈B �k⟨F�(xk, �k), xk − x⟩

    ≤ V(xs, x) +1

    2

    ∑k∈B �

    2k‖F�(xk, �k)‖2∗ + 12

    ∑k∈N �

    2k‖G�(xk, �k)‖2∗, ∀x ∈ X.

  • 1 3

    Algorithms for stochastic optimization with function or…

    Observe that if k ∈ B , we have hk = F�(xk, �k), and

    Moreover, if k ∈ N , we have hk = G�(xk, �k) and

    Summing up the inequalities in (2.27) from k = s to N and using the previous two observations, we obtain

    Rearranging the terms in above inequality, we obtain (2.26). ◻

    Using Proposition 9, we present below a sufficient condition under which the out-put solution x̄N,s is well-defined.

    Lemma 10 Let x∗ be an optimal solution of (1.1)–(1.2). Under Assumption 3, for any given 𝜆 > 0 , if

    where J is the number of random samples to estimate g(xk) in each iteration, then B ≠ ∅ , i.e., x̄N,s is well-defined. Moreover, we have one of the following two state-ments holds,

    (a) Prob{|B| ≥ (N − s + 1)∕2} ≥ 1 − |N|exp{−

    �2

    3

    },

    (b) ∑

    k∈B �k⟨f �(xk), xk − x∗⟩ ≤ 0.

    Proof Taking expectation w.r.t. �k on both sides of (2.26), fixing x = x∗ and noting that Assumption 3 implies that �[‖G�(x, �)‖2

    ∗] ≤ M2

    G , we have

    (2.27)V(xk+1, x) ≤ V(xk, x) + �k⟨hk, x − xk⟩ + 12�2k ‖hk‖2∗.

    ⟨hk, xk − x⟩ = ⟨F�(xk, �k), xk − x⟩.

    ⟨hk, xk − x⟩ = ⟨G�(xk, �k), xk − x⟩ ≥ G(xk, �k) − G(x, �k).

    (2.28)

    V(xk+1, x) ≤ V(xs, x) −∑N

    k=s�k⟨hk, xk − x⟩ + 12

    ∑Nk=s

    �2k‖hk‖2∗

    ≤ V(xs, x) −�∑

    k∈N �k⟨G�(xk, �k), xk − x⟩ +∑

    k∈B �k⟨F�(xk, �k), xk − x⟩�

    +1

    2

    ∑Nk=s

    �2k‖hk‖2∗

    = V(xs, x) −�∑

    k∈N �k(G(xk, �k) − G(x, �k)) +∑

    k∈B �k⟨F�(xk, �k), xk − x⟩�

    +1

    2

    ∑k∈B �

    2k‖F�(xk, �k)‖2∗ + 12

    ∑k∈N �

    2k‖G�(xk, �k)‖2∗.

    (2.29)

    N−s+1

    2mink∈N

    𝛾k𝜂k > V(xs, x∗) +

    1

    2

    ∑k∈B 𝛾

    2kM2

    F+

    1

    2

    ∑k∈N 𝛾

    2kM2

    G+

    𝜆𝜎√J

    ∑k∈N 𝛾k,

    (2.30)

    ∑k∈N �k[g(xk) − g(x

    ∗)] +∑

    k∈B �k⟨f �(xk), xk − x∗⟩ ≤ V(xs, x∗) + 12∑

    k∈B �2kM2

    F

    +1

    2

    ∑k∈N �

    2kM2

    G.

  • G. Lan, Z. Zhou

    1 3

    If ∑

    k∈B �k⟨f �(xk), xk − x∗⟩ ≤ 0 , part (b) holds. If ∑

    k∈B �k⟨f �(xk), xk − x∗⟩ ≥ 0 , we have

    which, in view of g(x∗) ≤ 0 , implies that

    It follows from (2.4), Assumption  3 and Lemma  7 that, for k ∈ N , we have Ĝk > 𝜂k and Prob{Ĝk ≥ g(xk) + 𝜆𝜎∕

    √J} ≤ exp{−𝜆2∕3}, which implies,

    Prob{g(xk) ≤ �k − ��∕√J} ≤ exp{−�2∕3} . Therefore,

    Combining (2.31) and (2.32), we have

    Suppose that |B| < (N − s + 1)∕2 , i.e., |N| ≥ (N − s + 1)∕2 . Then, the condition in (2.29) implies that

    It then follows from the previous two observations that Prob{|B| ≥ (N − s + 1)∕2} ≥ 1 − |N|exp

    {−

    �2

    3

    }. ◻

    Now we are ready to establish the large deviation properties of the CSA algorithm.

    Theorem 11 Suppose that Assumptions 2 and 3 hold.

    (a) For any given partition B and N of I = {s,… ,N} , we have, ∀� ≥ 0 ,

    ∑k∈N �k[g(xk) − g(x

    ∗)] ≤ V(xs, x∗) +

    1

    2

    ∑k∈B �

    2kM2

    F+

    1

    2

    ∑k∈N �

    2kM2

    G,

    (2.31)∑

    k∈N �kg(xk) ≤ V(xs, x∗) +

    1

    2

    ∑k∈B �

    2kM2

    F+

    1

    2

    ∑k∈N �

    2kM2

    G.

    (2.32)

    Prob{∑

    k∈N �kg(xk) ≤∑

    k∈N �k�k −��√J

    ∑k∈N �k}

    ≤ Prob�∃k ∈ N, g(xk) ≤ �k −

    ��√J

    �≤ 1 −

    �1 − exp

    �−

    �2

    3

    ���N�≤ �N�exp

    �−

    �2

    3

    �.

    Prob{∑

    k∈N 𝛾k𝜂k V(xs, x∗) +

    1

    2

    ∑k∈B 𝛾

    2kM2

    F

    +1

    2

    ∑k∈N 𝛾

    2kM2

    G+

    𝜆𝜎√J

    ∑k∈N 𝛾k.

    (2.33)Prob{f (x̄N,s) − f (x∗) ≥ K0 + 𝜆K1} ≤ 2exp{−𝜆} + (|N| + 2)exp{−

    𝜆2

    3

    },

    (2.34)Prob�g(x̄N,s) ≥

    �∑k∈B 𝛾k

    �−1�∑k∈B 𝛾k𝜂k

    �+

    𝜆𝜎√J

    �≤ �B�exp{−𝜆2∕3},

  • 1 3

    Algorithms for stochastic optimization with function or…

    where K0 =�∑

    k∈B �k

    �−1�D2

    X+

    M2F

    2

    ∑k∈B �

    2k+

    M2G

    2

    ∑k∈N �

    2k

    � and

    K1=�∑

    k∈B �k

    �−1�M2F2

    ∑k∈B �

    2

    k+

    M2

    G

    2

    ∑k∈N �

    2

    k+ 2�

    �∑k∈N �

    2

    k+ 2MFDX

    �∑k∈B �

    2

    k+

    �√J

    ∑k∈N �k

    �.

    (b) For any Λ ∈ (0, 1) , if we choose � such that Nexp{−�2∕3} ≤ Λ and set

    where M = max{MF,MG} andC = max{9D2XM2, 4�2} , then we have

    Proof Let us first show part (a) holds. Observe that the constraint evalua-tion and hence the partition of B and N is independent of the trajectory. Let G(x, �k) = g(x) + �k and F�(xk, �k) = f �(xk) + Δk . It follows from the inequality (2.26) (with x = x∗ ) and the fact g(x∗) ≤ 0 that

    Now we provide probabilistic bounds for ∑

    k∈B �2k‖F�(xk, �k)‖2∗ ,

    ∑k∈N �

    2k‖G�(xk, �k)‖2∗ , ∑

    k∈N �k�k and ∑

    k∈B �k⟨Δk, xk − x∗⟩ . First, setting �k = �2k ∕∑

    k∈B �2k , using the

    fact that �[exp{‖F�(xk, �k)‖2∗∕M2F}] ≤ exp{1} and Jensens inequality, we have exp{

    ∑k∈B �k(‖F�(xk, �k)‖2∗∕M2F)} ≤

    ∑k∈B �kexp{‖F�(xk, �k)‖2∗∕M2F}, and hence that

    �[exp{∑

    k∈B �2k‖F�(xk, �k)‖2∗∕M2F

    ∑k∈B �

    2k}] ≤ exp{1}. It then follows from Mark-

    ov’s inequality that ∀� ≥ 0,

    Similarly, we have

    Second, for ∑

    k∈N �k�k , setting �k = �k∕∑

    k∈B �k , and noting that �[�k] = 0 and �[exp{�2

    k∕�2}] ≤ exp{1}, we obtain �[�k�k] = 0 , �[exp{�2k�

    2k∕�2

    k�2}] ≤ exp{1} . By

    lemma 7, we have

    (2.35)

    s = 1, �k =DX√NM

    , �k =4MDX√

    N+

    2��√J,

    N = max{2C

    �2(log

    4

    Λ)2,

    6C

    �2log

    18D2XM2

    �2Λ

    ,64M2D2

    X

    �2

    },

    J = max{8�2

    �2(log

    4

    Λ)2,

    24�2

    �2log

    18D2XM2

    �2Λ

    ,36�2

    �2log

    1

    Λ3,36�2

    �2log

    18D2XM2

    �2Λ

    },

    (2.36)Prob{g(x̄N,s) ≤ 𝜗} ≥ 1 − Λ and Prob{f (x̄N,s) − f (x∗) ≤ 𝜖} ≥ (1 − Λ)2.

    (2.37)

    ∑k∈N 𝛾kg(xk) + (

    ∑k∈B 𝛾k)(f (x̄N,s) − f (x

    ∗)) ≤ V(xs, x∗) +

    1

    2

    ∑k∈B 𝛾

    2k‖F�(xk, 𝜉k)‖2∗

    +1

    2

    ∑k∈N 𝛾

    2k‖G�(xk, 𝜉k)‖2∗ + 2

    ∑k∈N 𝛾k𝛿k −

    ∑k∈B 𝛾k⟨Δk, xk − x∗⟩.

    (2.38)

    Prob(∑

    k∈B 𝛾2k‖F�(xk, 𝜉k)‖2∗ > (1 + 𝜆)M2F

    ∑k∈B 𝛾

    2k)

    = Prob

    �exp

    �∑k∈B 𝛾

    2k‖F�(xk, 𝜉k)‖2∗

    M2F

    ∑k∈B 𝛾

    2k

    �> exp(1 + 𝜆)

    �≤

    exp{1}

    exp{1+𝜆}≤ exp{−𝜆}.

    (2.39)Prob�∑

    k∈N 𝛾2k‖G�(xk, 𝜉k)‖2∗ > (1 + 𝜆)M2G

    ∑k∈N 𝛾

    2k

    �≤ exp{−𝜆}.

    (2.40)Prob�∑

    k∈N 𝛾k𝛿k > 𝜆𝜎

    �∑k∈N 𝛾

    2k

    �≤ exp{−𝜆2∕3}.

  • G. Lan, Z. Zhou

    1 3

    Lastly, let us consider ∑

    k∈B �k⟨Δk, xk − x∗⟩ . Setting �k = �k⟨Δk, xk − x∗⟩ and not-ing that �[‖Δk‖2∗] ≤ (2MF)2 , we have �[exp{�2k∕(2MF�kDX)2}] ≤ exp{1}, which, in view of Lemma 7, implies that

    Combining (2.38),(2.39), (2.40), (2.41) and (2.32), and rearranging the terms we get (2.33). Let us show that (2.34) holds. Clearly, by the convexity of g(⋅) and definition of x̄N,s , we have

    Using this observation and an argument similar to the proof of (2.32), we obtain (2.34).

    Then, let us show part (b) holds. First, easily observe that condition (2.29) holds by using the selection of s, {�k} and {�k} . From Lemma 10, we have either one of the following two statements holds,

    (a) Prob{|B| ≥ (N − s + 1)∕2} ≥ 1 − |N|exp{−

    �2

    3

    }≥ 1 − Λ,

    (b) ∑

    k∈B �k⟨f �(xk), xk − x∗⟩ ≤ 0, which, in view of the convexity of f, implies, f (x̄N,s) − f (x

    ∗) ≤ 0.

    Also, from (2.34) and (2.35), we have

    Moreover, conditional on that |B| ≥ N∕2 , it then follows Theorem 11 and (2.35) that

    By implementing the selection of N and J, we have (2.36). ◻

    In view of Theorem 11, the complexity in terms of the number of iterations N of the CSA algorithm can be bounded by O(max{ 1

    �2(log

    1

    Λ)2,

    1

    �2}) , and the sample size

    J for estimating constraint in every iteration of the CSA algorithm can be bounded by O(max{

    1

    �2(log

    1

    Λ)2,

    1

    �2log

    1

    Λ3}) for solving problem (1.1)–(1.2).

    2.5 Strongly convex objective and strongly convex constraints

    In this subsection, we are interested in establishing the convergence of the CSA algo-rithm applied to strongly convex problems. More specifically, we assume that the

    (2.41)Prob�∑

    k∈B 𝛽k > 2𝜆MFDX

    �∑k∈B 𝛾

    2k

    �≤ exp{−𝜆2∕3}.

    g(x̄N,s) = g(∑

    k∈B 𝜄kxk) ≤�∑

    k∈B 𝛾k

    �−1 ∑k∈B 𝛾kg(xk).

    Prob�g(x̄N,s) ≥

    4MDX√N

    +3𝜆𝜎√

    J

    �≤ �B�exp{−𝜆2∕3}

    Prob�g(x̄N,s) ≥ 𝜗

    �≤ Λ.

    Prob

    �f�x̄N,s

    �− f (x∗) ≥

    3DXM√N

    + 𝜆

    �3√2MDX√N

    +2√2𝜎√N

    +√2𝜎√J

    ��≤ 2exp{−𝜆}

    + (�N� + 2)exp�−

    𝜆2

    3

    �,

  • 1 3

    Algorithms for stochastic optimization with function or…

    objective function F and constraint function g in problem (1.1), where g is given in the form of function constraint, are both strongly convex w.r.t. x, i.e., ∃𝜇F > 0 and 𝜇G > 0 s.t.

    For the sake of simplicity, we focus on the case when the constraint function g can be evaluated exactly (i.e., Ĝk = g�(xk) ). However, expectation constraints can be dealt with using similar techniques discussed in Sect. 2.4.

    In order to estimate the convergent rate of the CSA algorithm for solving strongly convex problems, we need to assume that the prox-function VX(⋅, ⋅) and VY (⋅, ⋅) satisfies a quadratic growth condition

    Moreover, letting �k be the stepsizes used in the CSA method, and denoting

    we define

    as the output of Algorithm 1.The following simple result will be used in the convergence analysis of the CSA

    method.

    Lemma 12 If ak ∈ (0, 1] , k = 0, 1, 2, ..., Ak > 0,∀k ≥ 1 , and {Δk} satisfies

    then we have

    Below we provide an important recursion about CSA applied to strongly convex problems. This result differs from Proposition 2 for the general convex case in that we use different weight �k rather than �k.

    Proposition 13 For any 1 ≤ s ≤ N , we have

    F(x1, �) ≥ F(x2, �) + ⟨F�(x2, �), x1 − x2⟩ + �F2 ‖x1 − x2‖2,∀x1, x2 ∈ X,g(x1) ≥ g(x2) + ⟨g�(x2), x1 − x2⟩ + �G2 ‖x1 − x2‖2,∀x1, x2 ∈ X.

    (2.42)VX(z, x) ≤Q

    2‖z − x‖2,∀z, x ∈ X and VY (z, y) ≤ Q2 ‖z − y‖2,∀z, y ∈ Y .

    ak =

    {�F�k

    Q, k ∈ B,

    �G�k

    Q, k ∈ N,

    Ak =

    {1, k = 1,

    (1 − ak)Ak−1, k ≥ 2,and �k =

    �k

    Ak,

    (2.43)x̄N,s =∑

    k∈B 𝜌kxk∑k∈B 𝜌k

    Δk+1 ≤ (1 − ak)Δk + Bk,∀k ≥ 1,

    Δk+1

    Ak≤ (1 − a1)Δ1 +

    ∑ki=1

    Bi

    Ai.

    (2.44)

    ∑k∈N �k(�k − G(x, �k)) +

    ∑k∈B �k[F(xk, �k) − F(x, �k)] ≤ (1 − as)D

    2X

    +1

    2

    ∑k∈B �k�k‖F�(xk, �k)‖2∗ + 12

    ∑k∈N �k�k‖g�(xk)‖2∗.

  • G. Lan, Z. Zhou

    1 3

    Proof Consider the iteration k, ∀s ≤ k ≤ N . If k ∈ B , by Lemma 1 and the strong convexity of F(x, �) , we have

    Similarly for k ∈ N , using Lemma 1 and the strong convexity of g(x), we have

    Summing up these inequalities for s ≤ k ≤ N and using Lemma 12, we have

    Using the fact V(xN+1, x)∕AN ≥ 0 and the definition of �k , and rearranging the terms in the above inequality, we obtain (2.44). ◻

    Lemma 14 below provides a sufficient condition which guarantees x̄N,s to be well-defined.

    Lemma 14 Let x∗ be the optimal solution of (1.1). If

    then B ≠ ∅ and hence x̄N,s is well-defined. Moreover, we have one of the following two statements holds,

    (a) |B| ≥ (N − s + 1)∕2,(b)

    ∑k∈B �k[f (xk) − f (x

    ∗)] ≤ 0.

    Proof The proof of this result is similar to that of Lemma 2 and hence the details are skipped. ◻

    With the help of Proposition 13, we are ready to establish the main convergence properties of the CSA method for solving strongly convex problems.

    V(xk+1, x) ≤ V(xk, x) − �k⟨hk, xk − x⟩ + 12�2k ‖F�(xk, �k)‖2∗= V(xk, x) − �k⟨F�(xk, �k), xk − x⟩ + 12�2k ‖F�(xk, �k)‖2∗≤ V(xk, x) − �k

    �F(xk, �k) − F(x, �k) +

    �F

    2‖xk − x‖2

    �+

    1

    2�2k‖F�(xk, �k)‖2∗

    �1 −

    �F�k

    Q

    �V(xk, x) − �k[F(xk, �k) − F(x, �k)] +

    1

    2�2k‖F�(xk, �k)‖2∗.

    V(xk+1, x) ≤ V(xk, x) − �k⟨hk, xk − x⟩ + 12�2k ‖g�(xk)‖2∗= V(xk, x) − �k⟨g�(xk), xk − x⟩ + 12�2k ‖g�(xk)‖2∗≤ V(xk, x) − �k

    �(g(xk) − g(x)) +

    �G

    2‖xk − x‖2

    �+

    1

    2�2k‖g�(xk)‖2∗

    �1 −

    �G�k

    Q

    �V(xk, x) − �k(�k − g(x)) +

    1

    2�2k‖g�(xk)‖2∗.

    V(xN+1,x)

    AN≤�1 − as

    �V(xs, x) −

    �∑k∈N

    �k

    Ak(�k − g(x)) +

    ∑k∈B

    �k

    Ak[F(xk, �k) − F(x, �k)]

    +1

    2

    ∑k∈N

    �2k

    Ak‖g�(xk)‖2∗ + 12

    ∑k∈B

    �2k

    Ak‖F�(xk, �k)‖2∗,

    (2.45)N−s+1

    2mink∈N

    𝜌k𝜂k > (1 − as)D2X+

    1

    2

    ∑k∈N 𝜌k𝛾kM

    2G+

    1

    2

    ∑k∈B 𝜌k𝛾kM

    2F,

  • 1 3

    Algorithms for stochastic optimization with function or…

    Theorem 15 Suppose that {�k} and {�k} in the CSA algorithm are chosen such that (2.45) holds. Then for any 1 ≤ s ≤ N , we have

    Proof Taking expectation w.r.t. �i, 1 ≤ i ≤ k , on both sides of (2.44) (with x = x∗ ) and using Assumption 1, we have

    (2.46) then immediately follows from the above inequality, (2.43), the convexity of f and the fact that g(x∗) ≤ 0 . Moreover, (2.47) follows similarly to (2.18). ◻

    Below we provide a stepsize policy of s, �k and �k in order to achieve the optimal rate of convergence for solving strongly convex problems.

    Corollary 16 Let s = N2 , �k =

    {2Q

    �F(k+1), if k ∈ B;

    2Q

    �G(k+1), if k ∈ N,

    ,

    �k =2�GQ

    k

    (2D2

    X

    k+max

    {M2

    F

    �2F

    ,M2

    G

    �2G

    }) , then we have

    Proof Based on our selection of s, �k , �k and the definition of ak , Ak and �k , we have

    For ∀s ≤ k ≤ N , by the definition of s, �k and �k , we have

    (2.46)

    �[f (x̄N,s) − f (x∗)]

    ≤ ((N − s + 1) mins≤k≤N

    𝜌k)−1�2(1 − as)D

    2X+∑

    k∈B 𝜌k𝛾kM2F+∑

    k∈N 𝜌k𝛾kM2G

    �,

    (2.47)g(x̄N,s) ≤ (∑

    k∈B 𝜌k)−1(

    ∑k∈B 𝜌k𝜂k).

    ∑k∈N �k(�k − g(x

    ∗)) +∑

    k∈B �k�[f (xk) − f (x∗)] ≤ (1 − as)D

    2X

    +1

    2

    ∑k∈B �k�kM

    2F+

    1

    2

    ∑k∈N �k�kM

    2G.

    �[f (x̄N,s) − f (x∗)] ≤

    8𝜇FD2X

    N2Q+

    4𝜇FQ

    Nmax{

    M2F

    𝜇2F

    ,M2

    G

    𝜇2G

    },

    g(x̄N,s) ≤16𝜇GQD

    2X

    N2+

    4𝜇GQ

    Nmax

    {M2

    F

    𝜇2F

    ,M2

    G

    𝜇2G

    }.

    ak =2

    k+1, Ak =

    k∏i=2

    (1 − ai) =2

    k(k+1), �k =

    {kQ

    �F

    , if k ∈ B;kQ

    �G

    , if k ∈ N,

  • G. Lan, Z. Zhou

    1 3

    Combining the above two inequalities, we can easily see that condition (2.45) holds. It then follows from Theorem 15 that

    In view of Corollary 16, the CSA algorithm can achieve the optimal rate of conver-gence for strongly convex optimization with strongly convex constraints. To the best of our knowledge, this is the first time such a complexity result is obtained in the literature and this result is new also for the deterministic setting.

    3 Expectation constraints over problem parameters

    In this section, we are interested in solving a class of parameterized stochastic optimi-zation problems whose parameters are defined by expectation constraints as described in (1.4)–(1.5), under the assumption that such a pair of solutions satisfying (1.4)–(1.5) exists.

    Our goal in this section is to present a variant of the CSA algorithm to approxi-mately solve problem (1.4)–(1.5) and establish its convergence properties. More specif-ically, we discuss this variant of the CSA algorithm when applied to the parameterized stochastic optimization problem in (1.4)–(1.5) and then consider a modified problem by imposing certain strong convexity assumptions to the function Φ(x, y, �) w.r.t. y and G(x, �) w.r.t. x in Sects. 4.1 and 4.2, respectively. In Subsection 4.3, we discuss some large deviation properties for the variant of the CSA method for the problem defined by (1.4)–(1.5).

    (1 − as)V(xs, x) +1

    2

    ∑k∈N �k�kM

    2G+

    1

    2

    ∑k∈B �k�kM

    2F

    ≤ D2X+

    1

    2

    ∑k∈B

    �2k

    AkM2

    F+

    1

    2

    ∑k∈N

    �2k

    AkM2

    G≤ D2

    X+ Q2

    ��B�M2F

    �2F

    + �N�M2G�2G

    �≤ D2

    X

    +Q2N

    2max

    �M2

    F

    �2F

    ,M2

    G

    �2G

    �,

    N−s+1

    2mink∈N

    �k�k =N

    4mink∈N

    kQ

    �G

    2�GQ

    k

    �2D2

    X

    k+max

    �M2

    F

    �2F

    ,M2

    G

    �2G

    ��

    ≥ D2X+

    Q2N

    2max

    �M2

    F

    �2F

    ,M2

    G

    �2G

    �.

    �[f (x̄N,s) − f (x∗)]

    ≤ ((N − s + 1) mins≤k≤N

    𝜌k)−1�2(1 − as)D

    2X+∑

    k∈B 𝜌k𝛾kM2F+∑

    k∈N 𝜌k𝛾kM2G

    ≤8𝜇FD

    2X

    N2Q+

    4𝜇FQ

    Nmax

    �M2

    F

    𝜇2F

    ,M2

    G

    𝜇2G

    �,

    g(x̄N,s) ≤�∑

    k∈B 𝜌k

    �−1�∑k∈B 𝜌k𝜂k

    �≤

    16𝜇GQD2X

    N2+

    4𝜇GQ

    Nmax

    �M2

    F

    𝜇2F

    ,M2

    G

    𝜇2G

    �.

  • 1 3

    Algorithms for stochastic optimization with function or…

    3.1 Stochastic optimization with parameter feasibility constraints

    Given tolerance 𝜂 > 0 and target accuracy 𝜖 > 0 , we will present a variant of the CSA algorithm, namely cooperative stochastic parameter approximation (CSPA), to find a pair of approximate solutions (x̄, ȳ) ∈ X × Y s.t. �[g(x̄)] ≤ 𝜂 and �[𝜙(x̄, ȳ) − 𝜙(x̄, y)] ≤ 𝜖, ∀y ∈ Y , in this subsection. Before we describe the CSPA method, we need slightly modify Assumption 1.

    Assumption 4 For any x ∈ X and y ∈ Y ,

    where Φ�(x, y, �) ∈ �yΦ(x, y, �) and G�(x, �) ∈ �xG(x, �).

    We will also discuss the convergent properties under the light-tail assumptions as follows.

    Assumption 5 We assume that the distance generating functions �X ∶ X ↦ ℝ and �Y ∶ Y ↦ ℝ are

    strongly convex with modulus 1 w.r.t. given norms in ℝn and ℝm , respectively, and that their associated prox-mappings Px,X and Py,Y (see (2.1)) are easily computable.

    We make the following modifications to the CSA method in Sect. 2.1 in order to apply it to solve problem (1.4)–(1.5). Firstly, we still check the solution (xk, yk) to see whether xk violates the condition

    ∑ki=1

    �iG(xi, �i)∕∑k

    i=1�i ≤ �k . If so, we set the search

    direction as G�(xk, �k) to update xk , while keeping yk intact. Otherwise, we only update yk along the direction Φ�(x̄k, yk, 𝜁k) . Secondly, we define the output as a randomly selected (x̄k, yk) according to a certain probability distribution instead of the ergodic mean of {(x̄k, yk)} , where x̄k denotes the average of {xk} (see (3.1)). Since we are solv-ing a coupled optimization and feasibility problem, each iteration of our algorithm only updates either yk or xk and requires the computation of either Φ� or G′ depending on whether

    ∑ki=1

    �iG(xi, �i)∕∑k

    i=1�i ≤ �k . This differs from the SA method used in Jiang

    and Shanbhag [16] that requires two projection steps and the computation of two sub-gradients at each iteration to solve a different parameterized stochastic optimization problem.

    �[‖Φ�(x, y, �)‖2∗] ≤ M2

    Φand �[‖G�(x, �)‖2

    ∗] ≤ M2

    G,

    �[exp{‖Φ�(x, y, �)‖2∗∕M2

    Φ}] ≤ exp{1},

    �[exp{(Φ(x, y, �) − �(x, y))2∕�2}] ≤ exp{1},

    �[exp{(G(x, �) − g(x))2∕�2}] ≤ exp{1}.

  • G. Lan, Z. Zhou

    1 3

    With a little abuse of notation, we still use B to represent the set {s ≤ k ≤ N�∑�(k)

    i=1�iG(xi, �i)∕

    ∑�(k)

    i=1�i ≤ �k} , I = {s,… ,N} , and N = I ⧵ B . The fol-

    lowing result mimics Proposition 2.

    Proposition 17 For any 1 ≤ s ≤ N , we have

    where DX ≡ DX,wx and DY ≡ DY ,wy are defined as in (2.3).

    Proof By Lemma 1, if k ∈ B,

    Also note that V(yk+1, y) = V(yk, y) for k ∈ N . Summing up these relations for k ∈ B ∪N and using the fact that V(ys, y) ≤ D2Y , we have

    Similarly for �(s) ≤ i ≤ �(N) , we have

    Summing up these relations for �(s) ≤ i ≤ �(N) and using the fact that V(x

    �(s), x) ≤ D2X , we obtain

    (3.4)

    ∑k∈B 𝛾k⟨Φ�(x̄k, yk, 𝜁k), yk − y⟩ ≤ D2Y + 12

    ∑k∈B 𝛾

    2k‖Φ�(x̄k, yk, 𝜁k)‖2∗, ∀y ∈ Y ,

    (3.5)∑

    �(N)

    i=�(s)�i[G(xi, �i) − G(x, �i)] ≤ D

    2X+

    1

    2

    ∑�(N)

    i=�(s)�2i‖G�(xi, �i)‖2∗, ∀x ∈ X,

    V(yk+1, y) ≤ V(yk, y) + 𝛾k⟨Φ�(x̄k, yk, 𝜁k), y − yk⟩ + 12𝛾2k ‖Φ�(x̄k, yk, 𝜁k)‖2∗.

    (3.6)

    V(yN+1, y) ≤ V(ys, y) +1

    2

    ∑k∈B 𝛾

    2k‖Φ�(x̄k, yk, 𝜁k)‖2∗ −

    ∑k∈B 𝛾k⟨Φ�(x̄k, yk, 𝜁k), yk − y⟩

    ≤ D2Y+

    1

    2

    ∑k∈B 𝛾

    2k‖Φ�(x̄k, yk, 𝜁k)‖2∗ −

    ∑k∈B 𝛾k⟨Φ�(x̄k, yk, 𝜁k), yk − y⟩.

    V(xi+1, x) ≤ V(xi, x) + �i⟨G�(xi, �i), x − xi⟩ + 12�2i ‖G�(xi, �i)‖2∗.

  • 1 3

    Algorithms for stochastic optimization with function or…

    Using the facts V(yN+1, y) ≥ 0 and V(x�(N)+1, x) ≥ 0 , and rearranging the terms in (3.6) and (3.7), we then obtain (3.4) and (3.5), respectively. ◻

    The following result provides a sufficient condition under which (x̄R, yR) is well-defined.

    Lemma 18 The following statements holds.

    (a) Under Assumption 4, if for any 𝜆 > 0 , we have

    then Prob{|B| ≥ N−s+12

    } ≥ 1 − 1∕�.(b) Under Assumption 5, if for any 𝜆 > 0 , we have

    then Prob{|B| ≥ N−s+12

    } ≥ 1 − 2exp{−

    �2

    3

    }.

    Proof First let us show part (a), set �k = G(x∗, �k) − g(x∗) , it follows from (3.5) and fixing x = x∗ that

    For contradiction, suppose that |B| < N−s+12

    , i.e., �(N) − �(s) = |N| ≥ N−s+12

    . The above relation, in view of g(x∗) ≤ 0 and the fact

    ∑�(N)

    i=�(s)�iG(xi, �i) ≥ ��(N)

    ∑�(N)

    i=�(s)�i ,

    implies that

    Under Assumption  4, for any 𝜆 > 0 , taking expectation on both sides and using Markov’s inequality, we have

    Hence, part a) holds. Similarly we can show part (b), and the details are skipped. ◻

    Theorem 19 summarizes the main convergence properties of Algorithm 2 applied to problem (1.4)–(1.5).

    Theorem 19 The following statements holds for the CSPA algorithm.

    (3.7)V(x

    �(N)+1, x) ≤ D2X+∑

    �(N)

    i=�(s)�2i‖G�(xi, �i)‖2∗ −

    ∑�(N)

    i=�(s)(G(xi, �i) − G(x, �i)).

    (3.8)N−s+12

    mink∈N

    𝛾k𝜂k > D2X+ 𝜆

    M2G

    2

    ∑𝜏(N)

    k=𝜏(s)𝛾2k,

    (3.9)N−s+12

    mink∈N

    𝛾k𝜂k > D2X+ (1 + 𝜆)

    M2G

    2

    ∑𝜏(N)

    k=𝜏(s)𝛾2k+ 𝜆𝜎

    �∑𝜏(N)

    k=𝜏(s)𝛾2k,

    ∑�(N)

    i=�(s)�iG(xi, �i) −

    ∑�(N)

    i=�(s)�ig(x

    ∗) ≤ D2X+

    1

    2

    ∑�(N)

    i=�(s)�2i‖G�(xi, �i)‖2∗ +

    ∑�(N)

    i=�(s)�i�i.

    N−s+1

    2mink∈N

    �k�k ≤ ��(N)∑

    �(N)

    k=�(s)�k ≤ D

    2X+

    1

    2

    ∑�(N)

    k=�(s)�2k‖G�(xk, �k)‖2∗ +

    ∑�(N)

    k=�(s)�k�k.

    Prob�

    N−s+1

    2mink∈N

    �k�k ≤ D2X+ �

    M2G

    2

    ∑�(N)

    k=�(s)�2k

    �≥ 1 − 1∕�.

  • G. Lan, Z. Zhou

    1 3

    (a) Under Assumption 4, we have, ∀𝜆 > 0 ,

    (b) Under Assumption 5, we have, ∀𝜆 > 0 ,

    where K0 =2D2

    Y+M2

    Φ

    ∑k∈B �

    2k

    2∑

    k∈B �k

    and K1 =M2

    Φ

    ∑k∈B �

    2k+ 4MΦDY

    �∑k∈B �

    2k

    2∑

    k∈B �k

    .

    where the expectation is taken w.r.t. R and �1,… , �N.Proof Let us prove part (a) first. Set Δk = Φ(x̄k, yk, 𝜁k) − 𝜙(x̄k, yk) , it follows from (3.4) (fix y = y∗ ) that

    Since conditional on �[k−1] , the expectation of Δk equals to zero, then taking expec-tation on both sides of (3.16), and dividing both sides by

    ∑k∈B �k , we have (3.10).

    Hence, using the Markov inquality, we have (3.11). Denote �k = G(xk, �k) − g(xk) . It then follows from the convexity of g(⋅) and the definition of the set B that

    Using the fact that �[�k|�[k−1]] = 0 and �[|�k|2] ≤ �2 , we have

    (3.10)�[𝜙(x̄R, yR) − 𝜙(x̄R, y∗(x̄R))] ≤2D2

    Y+M2

    Φ

    ∑k∈B 𝛾

    2k

    2∑

    k∈B 𝛾k

    ,

    (3.11)Prob�𝜙(x̄R, yR) − 𝜙(x̄R, y

    ∗(x̄R)) ≥ 𝜆

    �2D2

    Y+M2

    Φ

    ∑k∈B 𝛾

    2k

    2∑

    k∈B 𝛾k

    ��≤

    1

    𝜆

    ,

    (3.12)Prob

    ⎧⎪⎨⎪⎩g(x̄R) ≥ 𝜂R + 𝜆𝜎

    �∑𝜏(N)

    k=𝜏(s)𝛾2k∑

    𝜏(N)

    k=𝜏(s)𝛾k

    ⎫⎪⎬⎪⎭≤

    1

    𝜆2.

    (3.13)�[𝜙(x̄R, yR) − 𝜙(x̄R, y∗(x̄R))] ≤2D2

    Y+M2

    Φ

    ∑k∈B 𝛾

    2k

    2∑

    k∈B 𝛾k

    ,

    (3.14)Prob{𝜙(x̄R, yR) − 𝜙(x̄R, y

    ∗(x̄R)) ≥ K0 + 𝜆K1}≤ exp{−𝜆} + exp{−𝜆2∕3},

    (3.15)Prob

    ⎧⎪⎨⎪⎩g(x̄R) ≥ 𝜂R + 𝜆𝜎

    �∑𝜏(N)

    k=𝜏(s)𝛾2k∑

    𝜏(N)

    k=𝜏(s)𝛾k

    ⎫⎪⎬⎪⎭≤ exp{−𝜆2∕3},

    (3.16)

    ∑k∈B 𝛾k

    �𝜙(xk, yk) − 𝜙(xk, y

    ∗(xk))�

    ≤ D2Y+

    1

    2

    ∑k∈B 𝛾

    2k‖Φ�(x̄k, yk, 𝜁k)‖2∗ +

    ∑k∈B 𝛾kΔk(y − yk).

    (3.17)g(x̄k) ≤∑

    𝜏(N)

    k=𝜏(s)𝛾kg(xk)∑

    𝜏(N)

    k=𝜏(s)𝛾k

    ≤ 𝜂k −

    ∑𝜏(N)

    k=𝜏(s)𝛾k𝛿k∑

    𝜏(N)

    k=𝜏(s)𝛾k

    .

  • 1 3

    Algorithms for stochastic optimization with function or…

    From the Markov inequality, we have (3.12). Hence the part a) holds.Under Assumption  5, (3.13) still holds. Using the fact that

    �[exp{‖Φ�(x̄k, yk, 𝜁k)‖2∗∕M2Φ}] ≤ exp{1} and Jensens inequality, we have �[exp{

    ∑k∈B 𝛾

    2k‖Φ�(x̄k, yk, 𝜁k)‖2∗∕M2Φ

    ∑k∈B 𝛾

    2k}] ≤ exp{1}. It then follows from

    Markov’s inequality that ∀� ≥ 0,

    Also,

    Combining (3.16), (3.18) and (3.19), we have (3.14). Similarly, we have

    Combining (3.17) and (3.20), we have (3.15). ◻

    Below we provide a special selection of s, {�k} and {�k}.

    Corollary 20 Let s = N2+ 1 , �k =

    DX

    MG

    √k and �k =

    4MGDX√k

    for k = 1,… ,N . Then we have

    where � ∶= (MGDY )∕(MΦDX). Moreover, the following statements hold.

    (a) Under Assumption 4,

    (b) Under Assumption 5,

    ⎡⎢⎢⎣

    ������

    ∑�(N)

    k=�(s)�k�k∑

    �(N)

    k=�(s)�k

    ������

    2⎤⎥⎥⎦≤

    ∑�(N)

    k=�(s)�2k�2

    �∑�(N)

    k=�(s)�k

    �2 .

    (3.18)Prob(

    ∑k∈B 𝛾

    2k‖Φ�(x̄k, yk, 𝜁k)‖2∗ > (1 + 𝜆)M2Φ

    ∑k∈B 𝛾

    2k) ≤

    exp{1}

    exp{1+𝜆}≤ exp{−𝜆}.

    (3.19)Prob�∑

    k∈B 𝛾kΔk(y − yk) > 2𝜆MΦDY

    �∑k∈B 𝛾

    2k

    �≤ exp

    �−𝜆2∕3

    (3.20)Prob{∑

    �(N)

    k=�(s)�k�k ≥ ��

    �∑�(N)

    k=�(s)�2k} ≤ exp{−�2∕3}

    �[�(xR, yR) − �(xR, y∗(xR))] ≤

    8MΦDY√N

    max{�,1

    },

    (3.21)Prob�𝜙(x̄R, yR) − 𝜙(x̄R, y

    ∗(x̄R)) ≤ 𝜆8MΦDY√

    Nmax{𝜈,

    1

    𝜈

    }�≥ (1 −

    1

    𝜆

    )2,

    (3.22)Prob�g(x̄R) ≤ 𝜆

    √2DX

    MG

    √N

    �≥ (1 −

    1

    𝜆

    )2.

  • G. Lan, Z. Zhou

    1 3

    where K0 =8MΦDY√

    Nmax{�,

    1

    } and K1 =1√N

    �4M2

    ΦDX

    MG+ 10MΦDY

    �.

    Proof Similarly to Corollary 5, we can show that (3.8) holds. It then follows from Lemma 18 and Theorem 19(a) that

    Similarly, part (b) follows from Theorem 19(b). ◻

    By Corollary  (20), the CSPA method applied to (1.4)–(1.5) can achieve an O(1∕

    √N) rate of convergence.

    3.2 CSPA with strong convexity assumptions

    In this subsection, we modify problem (1.4)–(1.5) by imposing certain strong convexity assumptions to Φ and G with respect to y and x, respectively, i.e., ∃𝜇Φ,𝜇G > 0 , s.t.

    We also assume that the pair of solutions (x∗, y∗) exists for problem (1.4)–(1.5). Our main goal in this subsection is to estimate the convergence properties of the CSPA algorithm under these new assumptions.

    We need to modify the probability distribution (3.3) used in the CSPA algorithm as follows. Given the stepsize �k , modulus �G and �Φ , and growth parameter Q (see (2.42)), let us define

    and denote

    Prob�𝜙(x̄R, yR) − 𝜙(x̄R, y

    ∗(x̄R)) ≤ K0 + 𝜆K1�

    ≥ (1 − 2exp{−𝜆2∕3})(1 − exp{−𝜆} − exp{−𝜆2∕3}),

    Prob

    �g(x̄R) ≤

    √2DX

    MG

    √N+ 𝜆

    5𝜎√N

    �≥ (1 − 2exp{−𝜆2∕3})(1 − exp{−𝜆2∕3}),

    ∑k∈B �k =

    ∑k∈B

    DX

    MG

    √k≥

    DX

    MG

    N

    4

    1√N=

    DX

    √N

    4MG.

    �[�(xR, yR) − �(xR, y∗)] ≤

    2MG

    DX

    √N

    �2D2

    Y+∑

    k∈B

    D2XM2

    Φ

    M2Gk

    ≤2MG

    DX

    √N

    �2D2

    Y+∑N

    k=N∕2

    D2XM2

    Φ

    M2Gk

    ≤2MG

    DX

    √N[2D2

    Y+ log 2D2

    X

    M2Φ

    M2G

    ] ≤8MΦDY√

    Nmax{�,

    1

    }.

    (3.23)Φ(x, y1, �) ≥ Φ(x, y2, �) + ⟨Φ�(x, y2, �), y1 − y2⟩ + �Φ2 ‖y1 − y2‖2, ∀y1, y2 ∈ Y .

    (3.24)G(x1, �) ≥ G(x2, �) + ⟨G�(x2, �), x1 − x2⟩ + �G2 ‖x1 − x2‖2, ∀x1, x2 ∈ X.

    (3.25)ak ∶= (𝜇Φ𝛾k)∕Q and Ak ∶=�

    1, k = 1;∏i≤k, i∈B(1 − ai), k > 1,

  • 1 3

    Algorithms for stochastic optimization with function or…

    Also the probability distribution of R is modified to

    The following result shows some simple but important properties for the modified CSPA method applied to problem (1.4)–(1.5).

    Proposition 21 For any s ≤ k ≤ m , we have

    Proof Using Lemma 1 and the strong convexity of Φ w.r.t. y, for k ∈ B , we have

    Also note that VY (yk+1, y) = VY (yk, y) for all k ∈ N . Summing up these relations for all k ∈ B ∪N and using Lemma 12, we obtain

    Similarly for �(s) ≤ k ≤ �(N) , we have

    (3.26)bk ∶= (𝜇G𝛾k)∕Q and Bk ∶=�

    1, k = 1;∏ki=1

    (1 − bi), k > 1.

    (3.27)Prob{R = k} =�k∕Ak∑

    i∈B �i∕Ai, k ∈ B.

    (3.28)

    ∑k∈B

    �k

    Ak[Φ(xk, yk, �k) − Φ(xk, y, �k)] ≤ (1 −

    �Φ�s

    Q)VY (ys, y)

    +1

    2

    ∑k∈B

    �2k

    Ak‖Φ�(xk, yk, �k)‖2∗, ∀y ∈ Y

    (3.29)

    ∑�(N)

    k=�(s)

    �k

    Bk

    ��k − G(x, �k)

    �≤

    �1 −

    �G�s

    Q

    �VX(xs, x)

    +1

    2

    ∑�(N)

    k=�(s)

    �2k

    Bk‖G�(xk, �k)‖2∗, ∀x ∈ X.

    VY (yk+1, y) ≤ VY (yk, y) − �k⟨Φ�(xk, yk, �k), yk − y⟩ + 12�2k ‖Φ�(xk, yk, �k)‖2∗≤ VY (yk, y) − �k

    �Φ(xk, yk, �k) − Φ(xk, y, �k) +

    �Φ

    2‖yk − y‖2

    +1

    2�2k‖Φ�(xk, yk, �k)‖2∗

    �1 −

    �Φ�k

    Q

    �VY (yk, y) − �k[Φ(xk, yk, �k) − Φ(xk, y, �k)]

    +1

    2�2k‖Φ�(xk, yk, �k)‖2∗.

    (3.30)

    VY (yN ,y)

    AN+1≤

    �1 −

    �Φ�s

    Q

    �VY (ys, y) −

    ∑k∈B

    �k

    Ak[Φ(xk, yk, �k) − Φ(xk, y, �k)]

    +1

    2

    ∑k∈B

    �2k

    Ak‖Φ�(xk, yk, �k)‖2∗.

    VX(xk+1, x) ≤ VX(xk, x) − �k⟨G�(xk, �k), xk − x⟩ + 12�2k ‖G�(xk, �k)‖2∗≤ VX(xk, x) − �k

    �G(xk, �k) − G(x, �k) +

    �G

    2‖xk − x‖2

    �+

    1

    2�2k‖G�(xk, �k)‖2∗

    �1 −

    �G�k

    Q

    �VX(xk, x) − �k[G(xk, �k) − G(x, �k)] +

    1

    2�2k‖G�(xk, �k)‖2∗,

  • G. Lan, Z. Zhou

    1 3

    Summing up these relations for �(s) ≤ k ≤ �(N) and using Lemma 12, we have

    Using the facts that VY (yN+1, y)∕AN ≥ 0 and VX(xN+1, x)∕AN ≥ 0 , and rearranging the terms in (3.30) and (3.31), we obtain (3.28) and (3.29), respectively. ◻

    Lemma 22 below provides a sufficient condition which guarantees that the output solution (x̄R, yR) is well-defined.

    Lemma 22 The following statements hold.

    (a) Under Assumption 4, if for any 𝜆 > 0 , we have

    then Prob{|B| ≥ N−s+12

    } ≥ 1 − 1∕�.(b) Under Assumption 5, if for any 𝜆 > 0 , we have

    then Prob{|B| ≥ N−s+12

    } ≥ 1 − 2exp{−�2∕3}.

    Proof The proof is similar to the one of Lemma  18 and hence the details are skipped. ◻

    Now let us establish the rate of convergence of the modified CSPA method for prob-lem (1.4)–(1.5).

    Theorem 23 Suppose that {�k} and {�k} are chosen according to Lemma 22. Then

    Moreover, under Assumption 4, we have for any 𝜆 > 0,

    (3.31)

    VX (xN+1,x)

    AN≤

    �1 −

    �G�s

    Q

    �VX(xs, x) −

    ∑�(N)

    k=�(s)

    �k

    Ak[�k − G(x, �k)]

    +1

    2

    ∑�(N)

    k=�(s)

    �2k

    Ak‖G�(xk, �k)‖2∗.

    (3.32)N−s+12

    mink∈N

    𝛾k𝜂k

    Bk>

    �1 −

    𝜇G𝛾s

    Q

    �D2

    X+ 𝜆

    M2G

    2

    ∑𝜏(N)

    k=𝜏(s)

    𝛾2k

    Bk,

    (3.33)N−s+12

    mink∈N

    𝛾k𝜂k

    Bk>

    �1 −

    𝜇G𝛾s

    Q

    �D2

    X+ (1 + 𝜆)

    M2G

    2

    ∑𝜏(N)

    k=𝜏(s)

    𝛾2k

    Bk+ 𝜆𝜎

    �∑𝜏(N)

    k=𝜏(s)

    𝛾2k

    B2k

    ,

    (3.34)�[𝜙(x̄R, yR) − 𝜙(x̄R, y

    ∗(x̄R))] ≤�∑

    k∈B

    𝛾k

    Ak

    �−1�(1 −

    𝜇Φ𝛾s

    Q)D2

    Y+

    M2Φ

    2

    ∑k∈B

    𝛾2k

    Ak

    �.

    (3.35)

    Prob

    �𝜙(x̄R, yR) − 𝜙(x̄R, y

    ∗(x̄R)) ≥ 𝜆�∑

    k∈B

    𝛾k

    Ak

    �−1�(1 −

    𝜇Φ𝛾s

    Q)D2

    Y+

    M2Φ

    2

    ∑k∈B

    𝛾2k

    Ak

    ��≤

    1

    𝜆

    ,

  • 1 3

    Algorithms for stochastic optimization with function or…

    In addition, under Assumption 5, we have for any 𝜆 > 0,

    where K0 =�∑

    k∈B

    �k

    Ak

    �−1�(1 −

    �Φ�s

    Q)D2

    Y+

    M2Φ

    2

    ∑k∈B

    �2k

    Ak

    � and

    K1 =�∑

    k∈B

    �k

    Ak

    �−1�M2

    Φ

    ∑k∈B

    �2k

    Ak+ 4MΦDY

    �∑k∈B

    �2k

    A2k

    �.

    Proof The proof is similar to the proof of Theorem 19, and hence the details are skipped. ◻

    Now we provide a specific selection of {�k} and {�k} that satisfies the condition of Lemma 22. While the selection of �k only depends on iteration index k, i.e.,

    the selection of �k depends on the particular position of iteration index k in set B or N . More specifically, let �B(k) and �(k) be the position of index k in set B and set N , respectively (for example, B = {1, 3, 5, 9, 10} and N = {2, 4, 6, 7, 8} . If k = 9 , then �B(k) = 4 ). We define �k as

    Such a selection of �k can be conveniently implemented by using two separate coun-ters in each iteration to represent �B(k) and �(k).

    Corollary 24 Let s = 1 , �k and �k be given in (3.39) and (3.40), respectively. Then we have

    Moreover, under Assumption 4, we have for any 𝜆 > 0,

    (3.36)Prob

    ⎧⎪⎨⎪⎩g(x̄R) ≥ 𝜂R + 𝜆𝜎

    �∑𝜏(N)

    k=𝜏(s)𝛾2k∕B2

    k∑𝜏(N)

    k=𝜏(s)𝛾k∕Bk

    ⎫⎪⎬⎪⎭≤

    1

    𝜆2.

    (3.37)Prob

    {𝜙(x̄R, yR) − 𝜙(x̄R, y

    ∗(x̄R)) ≥ K0 + 𝜆K1}≤ exp{−𝜆} + exp{−𝜆2∕3},

    (3.38)Prob

    ⎧⎪⎨⎪⎩g(x̄R) ≥ 𝜂R + 𝜆𝜎

    �∑𝜏(N)

    k=𝜏(s)𝛾2k∕B2

    k∑𝜏(N)

    k=𝜏(s)𝛾k∕Bk

    ⎫⎪⎬⎪⎭≤ exp{−𝜆2∕3},

    (3.39)�k =8QM2

    G

    k�G,

    (3.40)�k =

    { 2Q�Φ(�B(k)+1)

    , k ∈ B;

    2Q

    �G(�(k)+1), k ∈ N.

    �[𝜙(x̄R, yR) − 𝜙(x̄R, y∗(x̄R))] ≤

    8QM2Φ

    (N+2)𝜇Φ.

  • G. Lan, Z. Zhou

    1 3

    In addition, under Assumption 5, we have for any 𝜆 > 0,

    where K0 = 8QM2Φ∕[(N + 2)�Φ] and K1 = 8QM2Φ∕[(N + 2)�Φ] + 64MΦDY∕

    √N.

    Proof The proof is similar to the proof of Corollary  20 and hence the details are skipped. ◻

    Note that Corollary 24(a) implies an O(1∕N) rate of convergence, while Corol-lary 24(b) show an O(1∕

    √N) rate of convergence with much improved dependence

    on � . One possible approach to improve the result in part (b) is to shrink the feasible set Y from time to time in order to obtain an O(1∕N) rate of convergence (see [13]).

    4 Numerical experiment

    In this section, we present some numerical results of our computational experiments for solving two problems: an asset allocation problem with conditional value at risk (CVaR) constraint and a parameterized classification problem. More specifically, we report the numerical results obtained from the CSA and CSPA method applied to these two problems in Sects. 4.1 and 4.2, respectively.

    4.1 Asset allocation problem

    Our goal of this subsection is to examine the performance of the CSA method applied to the CVaR constrained problem in (1.3).

    Apparently, there is one problem associated with applying the CSA algorithm to this model—the feasible region X is unbounded. Lan, Nemirovski and Shapiro (see [20] Section 4.2) show that � can be restricted to

    [𝜇 +

    √𝛽

    1−𝛽𝜎, �̄� +

    √1−𝛽

    𝛽

    𝜎

    ], where

    𝜇 ∶= miny∈Y{−𝜉Ty} and �̄� ∶= maxy∈Y{−𝜉Ty}.

    In this experiment, we consider four instances. The first three instances are ran-domly generated according to the factor model in Goldfarb and Iyengar (see Section 7 of [15] ) with different number of stocks ( d = 500 , 1000 and 2000), while the last instance consists of the 95 stocks from S&P100 (excluding SBC, ATI, GS, LU and VIA-B) obtained from [37], the mean 𝜉 and covariance Σ are estimated by the historical monthly data from 1996 to 2002. The reliability level � = 0.05 , the number of samples

    Prob{𝜙(x̄R, yR) − 𝜙(x̄R, y

    ∗(x̄R)) ≤ 𝜆8QM2

    Φ

    (N+2)𝜇Φ

    }≥ (1 −

    1

    𝜆

    )2,

    Prob{g(x̄R) ≤ 𝜆

    16QM2G

    N𝜇G

    }≥ (1 −

    1

    𝜆

    )2.

    Prob�𝜙(x̄R, yR) − 𝜙(x̄R, y

    ∗(x̄R)) ≤ K0 + 𝜆K1�

    ≥ (1 − 2exp{−𝜆2∕3})(1 − exp{−𝜆} − exp{−𝜆2∕3}),

    Prob�g(x̄R) ≤

    16QM2G

    N𝜇G+ 𝜆

    2𝜎√N

    �≥ (1 − 2exp{−𝜆2∕3})(1 − exp{−𝜆2∕3}),

  • 1 3

    Algorithms for stochastic optimization with function or…

    to estimate g(x) is J = 100 and the number of samples used to evaluate the solution is n = 50, 000 . It is worth noting that, by utilizing the linear structure of �Tx (where x ∈ ℝd ) in constraint function, in kth iteration we generate J-sized i.i.d. samples of 𝜉 ∶= 𝜉Txk (with dimension 1) to estimate �Tx in constraint function, instead of J-sized i.i.d. samples of � (with dimension d). For SAA algorithm, the deterministic SAA prob-lem to (1.3) is defined by

    We implemented the SAA approach by using Polyak’s subgradient method for solv-ing convex programming problems with function constraints (see [28]). The main reasons why we did not use the linear programming (LP) method to (4.1) include: 1) problem (4.1) might be infeasible for some instances; and 2) we tried the LP method with CVX toolbox for an instance with 500 stocks and the CPU time is thousands times larger than that of the CSA method. In our experiment, we adjust the stepsize strategy by multiplying �k and �k with some scaling parameters cg and ce , respec-tively. These parameters are chosen as a result of pilot runs of our algorithm (see [20] for more details). We have found that the “best parameters” in Table 1 slightly outperforms other parameter settings we have considered (Tables 2, 3, 4 and 5).

    The following conclusions can be made from the numerical results. First, as far as the quality of solutions is concerned, the CSA method is at least as good as SAA method and it may outperform SAA for some instances especially as N increases.

    (4.1)minx,� − �

    Tx

    s.t. � +1

    �N

    ∑Ni=1

    [−�Tix − �]+ ≤ 0,∑n

    i=1xi = 1, x ≥ 0,

    Table 1 The stepsize factor best cg

    best ce

    Number of stocks 500 0.5 0.0051000 0.5 0.052000 0.5 0.05

    Table 2 Random sample with 500 assets

    N: the sample size (the number of steps in SA, and the size of the sample used to SAA approximation), Obj.: the objective function value of our solution, i.e. the loss of the portfolio, Cons.: the con-straint function value of our solution, CPU: the processing time in seconds for each method

    N = 500 N = 1000 N = 2000 N = 5000

    CSA Obj. −4.883 −4.870 −4.953 −4.984Cons. 5.330 4.096 5.167 2.859CPU 1.671e-01 3.383e-01 6.271e-01 1.470e+00

    SAA Obj. −4.978 −4.981 −4.977 −4.977Cons. 4.372 3.071 2.330 2.249CPU 2.031e+00 9.926e+00 4.132e+01 2.591e+02

  • G. Lan, Z. Zhou

    1 3

    Second, the CSA method can significantly reduce the processing time than SAA method for all the instances.

    Table 3 Random sample with 1000 assets

    N: the sample size( the number of steps in SA, and the size of the sample used to SAA approximation), Obj.: the objective function value of our solution, i.e. the loss of the portfolio, Cons.: the con-straint function value of our solution, CPU: the processing time in seconds for each method

    N = 500 N = 1000 N = 2000 N = 5000

    CSA Obj. −4.532 −4.704 −4.838 −4.949Cons. 27.660 24.901 23.825 20.785CPU 4.193e-01 8.578e-01 1.659e+00 4.001e+00

    SAA Obj. −4.965 −4.981 −4.981 −4.977Cons. 60.421 47.745 33.940 20.357CPU 1.513e+01 5.954e+01 2.774e+02 1.524e+03

    Table 4 Random sample with 2000 assets

    N: the sample size( the number of steps in SA, and the size of the sample used to SAA approximation), Obj.: the objective function value of our solution, i.e. the loss of the portfolio, Cons.: the con-straint function value of our solution, CPU: the processing time in seconds for each method

    N = 500 N = 1000 N = 2000 N = 5000

    CSA Obj. −4.299 −4.077 −4.355 −4.859Cons. 144.92 112.54 89.74 82.65CPU 1.374e+00 2.810e+00 5.538e+00 2.716e+01

    SAA Obj. −4.752 −4.699 −4.721 −4.727Cons. 279.43 218.96 147.93 94.46CPU 1.968e+01 6.571e+01 2.940e+02 3.697e+03

    Table 5 Comparing the CSA and SAA for the CVaR model

    N: the sample size( the number of steps in SA, and the size of the sample used to SAA approximation), Obj.: the objective function value of our solution, i.e. the loss of the portfolio, Cons.: the constraint func-tion value of our solution, CPU: the processing time in seconds for each method

    N=500 N=1000 N=2000 N=5000 N=10000

    CSA Obj. −3.531 −3.537 −3.542 −3.548 −3.560Cons. 3.382e+00 2.188e-01 1.106e-01 2.724e-01 −7.102e-01CPU 8.315e-02 1.422e-01 2.778e-01 7.251e-01 1.415e+00

    SAA Obj. −3.530 −3.541 −3.541 −3.544 −3.559Cons. 3.385e+00 7.163e-01 6.989e-01 6.988e-01 7.061e-01CPU 3.155e+00 1.221e+01 4.834e+01 3.799e+02 1.462e+03

  • 1 3

    Algorithms for stochastic optimization with function or…

    4.2 Classification and metric learning problem

    In this subsection, our goal is to examine the efficiency of the CSPA algorithm applied to a classification problem with the metric as parameter. In this experi-ment, we use the expectation of hinge loss function, described in [33], as objec-tive function, and formulate the constraint with the loss function of metric learn-ing problem in [8], see formal definition in (1.6)–(1.7). For each i,  j, we are given samples ui, uj ∈ ℝd and a measure bij ≥ 0 of the similarity between the samples ui and uj ( bij = 0 means ui and uj are the same). The goal is to learn a metric A such that ⟨(ui − uj),A(ui − uj)⟩ ≈ bij , and to do classification among all the samples u pro-jected by the learned metric A.

    For solving this class of problems in machine learning, one widely accepted approach is to learn the metric in the first step and then solve the classification prob-lem with the obtained optimal metric. However, this approach is not applicable to the online setting since once the dataset is updated with new samples, this approach has to go through all the samples to update A and � . On the other hand, the CSPA algorithm optimizes the metric A and classifier � simultaneously, and only needs to take one new sample in each iteration.

    In this experiment, our goal is to test the solution quality of the CSPA algorithm with respect to the number of iterations. More specifically, we consider 2 instances of this problem with different dimension ( d = 100 and 200, respectively). Since we are dealing with the online setting, our sample size for training A and � is increasing with the number of iterations. The size for the sample used to estimate the parame-ters and the one used to evaluate the quality of solution (or testing sample) are set to 100 and 10, 000, respectively. Within each trial, we test the objective and constraint value of the output solution over training sample and testing sample, respectively. Since R is randomly picked up from all the series {x̄k, yk} , we generate 5 candidate R, instead of one, in order to increase the probability of getting a better solution. Intuitively, the latter solutions in the series should be better than the earlier ones, hence, we also put the last pair of the solution (x̄N , yN) into the candidate list. In each trial, we compare these 6 candidate solutions. First, we choose three pairs with smallest constraint function values, then, choose the one with the smallest objective function value from these three selected solutions as our output solution.

    Tables 6 and 7 shows the CSPA method decreases the objective value and con-straint value as the sample size (number of iterations N) increases. These experi-ments demonstrate that we can improve both the metric and the classifier simultane-ously by using the CSPA method as more and more data are collected.

    5 Conclusions

    In this paper, we present a new stochastic approximation type method, the CSA method, for solving the stochastic convex optimization problems with function or expectation constraints. Moreover, we show that a variant of CSA method, the CSPA method, is applicable to a class of p