arxiv:2002.04766v1 [cs.lg] 12 feb 2020distribution-agnostic model-agnostic meta-learning liam...

Distribution-Agnostic Model-Agnostic Meta-Learning

Liam Collins∗, Aryan Mokhtari∗, Sanjay Shakkottai∗

February 13, 2020

Abstract

The Model-Agnostic Meta-Learning (MAML) algorithm [Finn et al., 2017] has been cel-ebrated for its efficiency and generality, as it has demonstrated success in quickly learningthe parameters of an arbitrary learning model. However, MAML implicitly assumes that thetasks come from a particular distribution, and optimizes the expected (or sample average)loss over tasks drawn from this distribution. Here, we amend this limitation of MAML byreformulating the objective function as a min-max problem, where the maximization is over theset of possible distributions over tasks. Our proposed algorithm is the first distribution-agnosticand model-agnostic meta-learning method, and we show that it converges to an ε-accuratepoint at the rate of O(1/ε2) in the convex setting and to an (ε, δ)-stationary point at the rateof O(max1/ε5, 1/δ5) in nonconvex settings. We also provide numerical experiments thatdemonstrate the worst-case superiority of our algorithm in comparison to MAML.

1 IntroductionBroadly speaking, the extent to which a model learns from data is dependent upon the model’sproperties, including its architecture, initial setting, and update rule. Meta-learning has gainedpopularity as a paradigm to learn optimal model properties and thereby improve model performancewith more experience, i.e., to learn how to learn [Thrun and Pratt, 2012, Bengio et al., 1990,Hochreiter et al., 2001]. One especially popular meta-learning problem proposed by Finn et al.[2017] is to learn an initialization for a gradient-based update, such that upon seeing a new task,the model performs well on that task after performing one step of gradient descent from theglobal initialization. This problem is also known as gradient-based Model-Agnostic Meta-Learning(MAML), and is formally defined as

minw∈W

Ei∼P [Fi(w)] := Ei∼P [fi(w − α∇fi(w))] , (1)

where W ⊂ Rd is the feasible set, P is the probability distribution over tasks indexed by i and fi isthe stochastic loss function associated with the i-th task. In particular, each fi : W → R is suchthat fi(w) := Eθ∼Qi

[fi(w, θ)

], where Qi is the distribution over the data from the sample space

Ωi associated with the i-th task, θ is a sample drawn from Qi, and fi(w, θ) is the approximation offi(w) at the data point θ. Note that since often the probability distribution P is unknown andonly m realizations from P are accessible, we settle for solving the sample average approximationproblem, i.e.,

minw∈W

1

m

m∑i=1

Fi(w) :=1

m

m∑i=1

fi(w − α∇fi(w)). (2)

However, the formulation (2) yields meta-learning algorithms that are flawed in three key areas:∗Department of Electrical and Computer Engineering, The University of Texas at Austin, Austin, TX, USA.

Email: [email protected], [email protected], [email protected].

1

arX

iv:2

002.

0476

6v1

[cs

.LG

] 1

2 Fe

b 20

20

1. Worst case performance. Since the objective function of (2) is the average loss over allthe drawn tasks, models trained to optimize the objective prioritize performance on tasks thatappear the most often from P while devaluing performance on infrequent tasks, leading topotentially arbitrarily poor performance in the worst case. Yet in many situations, arbitrarilypoor worst-case performance is unacceptable.

2. Fairness. Following the same line of reasoning, algorithms that solve (2) are biased towardstasks that appear more frequently from P than others. This may result in high variance inmodel performance across tasks, which is unfair to those tasks at the low end of the spectrum.

3. Reliance on Task Similarity. State-of-the-art algorithms to solve (2) are forms of stochasticgradient descent that compute their next iterates as functions of the stochastic gradientsof only a relatively small number of tasks sampled from P on each iteration. Therefore, ifthe gradients of commonly occurring tasks are far from each other, these methods may notconverge. See for example the proof of convergence of the MAML algorithm in Fallah et al.[2019], which relies on the assumption of bounded variance of the gradients of the tasks drawnfrom P (Assumption 4.5).

All three of these deficiencies caused by the formulation (2) originate from the underlyingassumption that the tasks come from a particular distribution P , signaling the need for a meta-learning problem formulation that does not make this assumption, just as the MAML formulationhas removed the assumption that the objective is to learn a particular type of model. This begsthe question:

Can we develop meta-learning methods that are not only model-agnostic, but alsodistribution-agnostic?

Contributions. In this paper, we answer the above question in the affirmative by developinga novel min-max meta-learning formulation (Section 3) and introducing a method (Section 4)which efficiently solves it in three distinct cases (Section 5). In particular, a list of our detailedcontributions follows.

• We propose a min-max variant of the MAML formulation in (2) which corrects the threedeficiencies of the original MAML problem discussed above by making no assumptions aboutthe tasks being generated from some distribution, yielding algorithms that perform well inthe worst case and fairly treat tasks that may be arbitrarily dissimilar from each other.

• In the convex setting, we show that our proposed methods converge to a point that is ε-closeto an optimal solution of the proposed min-max MAML problem after O(1/ε2) stochasticfunction evaluations, matching the optimal rate studied in Nemirovski et al. [2009].

• We also consider two nonconvex settings - one where the problem is unconstrained, i.e., W =Rd, and another where the problem is constrained over a convex and compact set. In the uncon-strained case, we show that our method requires at most O(max1/ε2/β , 1/δ1/min2β,1−2β)iterations to find an (ε, δ)-stationary point for any β ∈ (0, 1/2), where the norm of thegradient with respect to w of our objective is at most ε. In the constrained case, we showthat the complexity to achieve the same bound on the norm of the projected gradient is ofO(max1/ε(2+2β)/β , 1/δ(1+β)/minβ,1−β) for any β ∈ (0, 1). For both cases, an appropriatesetting of β yields an O(max1/ε5, 1/δ5) convergence rate.

• Additionally, we provide experimental results that demonstrate the improved worst-caseperformance across tasks compared to state-of-the-art methods.

2 Related WorkRecently many works have studied Meta-learning techniques in various contexts, including few-shotlearning [Vinyals et al., 2016, Ravi and Larochelle, 2016, Snell et al., 2017], reinforcement learning[Duan et al., 2016, Wang et al., 2016] and online learning [Finn et al., 2019, Khodak et al., 2019b,a].Of interest in this paper are gradient-based meta-learning methods [Andrychowicz et al., 2016, Finnet al., 2017, Zhou et al., 2019] which have an inner-outer loop structure that executes a learning

2

algorithm in the inner loop and use the results to update the algorithm parameters in the outerloop with some function of the gradient of the meta-learning objective.

Our work contributes a worst-case optimal version of the seminal gradient-based meta-learningtechnique, MAML [Finn et al., 2017]. MAML has achieved success in regression, few-shot imageclassification, and reinforcement learning [Finn et al., 2017] and has inspired numerous relatedalgorithms studied experimentally [Al-Shedivat et al., 2017, Li et al., 2017, Nichol and Schulman,2018, Antoniou et al., 2018]. The convergence properties of MAML and two of its first-order variantswere analyzed in Fallah et al. [2019], and other works have established regret bounds in terms ofexpected loss over tasks for the online version of MAML [Finn et al., 2019, Zhuang et al., 2019,Khodak et al., 2019b]. However, to the best of our knowledge, no other work has proposed a versionof MAML that optimizes the worst case performance over tasks.

Robustness in meta-learning has been studied recently by Zügner and Günnemann [2019] andYin et al. [2018]. However, the notion of robustness in both of these papers is with respect toperturbations in the samples, not with respect to the distribution of tasks - both works attempt tominimize some expected loss over a particular distribution of tasks. Meanwhile, meta-learning isclosely related to hyper-parameter optimization (HPO) and bilevel programming (BLP), and bothmeta-learning and HPO have jointly been formulated in terms of BLP [Franceschi et al., 2018].Still, we are unaware of any analogous worst-task performance analysis in either the HPO or BLPliterature.

There also exist many works outside of meta-learning that have considered min-max optimizationproblems of the stochastic finite-sum form we will discuss. In the context of distributionally robustoptimization, Shalev-Shwartz and Wexler [2016] and Sinha et al. [2017] argued that minimizingthe maximal loss of a model over a set of possible distributions can provide better generalizationperformance than minimizing the average loss. When each function being summed is convex,Nemirovski et al. [2009] showed that stochastic mirror descent-ascent algorithm achieves theasymptotically optimal O(ε−2) rate of convergence to an ε-accurate solution. The literature hastreated the case when the outer minimization problem is nonconvex far less thoroughly. Rafiqueet al. [2018] proposed a stochastic inexact proximal point method that attains O(ε−6) convergencein terms of the outer minimization problem when that problem is nonsmooth and weakly convex,while Qian et al. [2019] showed O(ε−4) convergence for a smooth problem. Also, Chen et al. [2017]and Jin et al. [2019] analyzed first order methods that improve on this convergence rate but relyon an oracle to solve the inner maximization. In contrast, neither the smoothness nor oracleassumptions apply to our setting.

3 Problem FormulationIn this section, we reformulate the MAML objective in (1) to yield a both model-agnostic anddistribution-agnostic problem that amends the deficiencies of MAML as discussed in the introduction.In particular, our problem formulation tries to find the initial point that performs best after onestep of gradient descent for the worst-case task. We later show that to solve this problem usingiterative methods we never need to assume any conditions about the similarity of the tasks and ingeneral they can be arbitrarily chosen.

Consider the following min-max problem

minw∈W

maxi∈[m]fi(w − α∇fi(w)) (3)

which finds the initial point that minimizes the objective function after one step of gradient over allpossible loss functions (i.e., over the loss corresponding to all possible tasks). We can think of (3)as a distribution-agnostic formulation of MAML.

The min-max problem in (3) is equivalent to the problem of finding the w∗ that minimizes theworst-case performance over all possible distributions of tasks, since the worst-case distributionswill occur at the extreme points of the probability simplex. We write this relaxed problem as

minw∈W

maxp∈∆

m∑i=1

pifi(w − α∇fi(w)), (4)

where pi is the probability associated with task i, the vector p ∈ Rm is the concatenation ofprobabilities p1, . . . , pm, and ∆ is the standard simplex in Rm, i.e., ∆ = p ∈ Rm+ |

∑mi=1 pi = 1.

3

By optimizing worst-case performance, the formulation in (4) also encourages a fair solution w∗.Instead of being allowed to disregard performance on some tasks, any algorithm that solves (4)must try to perform reasonably well on all of them. Indeed, as Duchi et al. [2016] observed, themin-max formulation implicitly regularizes the variance of the losses fi(w − α∇fi(w)), meaningthat the training procedure tries to produce a model that performs similarly on all tasks. Such amodel aligns with the objective of “good intent fairness” proposed in Mohri et al. [2019].

In addition to replacing the expectation over a particular distribution with a maximization overall possible distributions of tasks, we also reorder the expectations implicit in the MAML objective(1). Note that ∇fi(w) is an expectation over many data points, thus it is intractable to computeexactly, so instead we must approximate it using a finite number of samples - perhaps as few as 1or 5 for few-shot learning - at test time. Thus, it makes sense to optimize the expectation overthe drawn samples of the function value after taking one step of stochastic gradient descent - inother words, to move the expectation over the finite sample gradient approximation outside of thefunction evaluation - as opposed to optimizing the function value after the true gradient step.

Combining both our modifications, the problem that we aim to solve is as follows:

minw∈W

maxp∈∆

φ(w, p) :=

m∑i=1

piEθ∼Qi

[fi(w−α∇fi(w, θ))

](5)

Note that this objective corresponds to the case that in the test phase we only run one step ofstochastic gradient descent, and the expectation is taken over θ. Indeed, one can also consider thecase where we run multiple steps of stochastic gradient descent, but to simplify the expressions forthe proposed method and convergence analysis we only focus on the single step case, without lossof generality.

4 AlgorithmOur proposed algorithm to solve (5) is a version of stochastic projected gradient descent-ascent,and is inspired by the Euclidean version of the Saddle Point Mirror SA algorithm proposed byNemirovski et al. [2009]. Before stating our method, let us first mention that the gradients of thefunction φ(w, p) defined in (5) with respect to w and p, which we denote by gw(w, p) and gp(w, p),respectively, are given by

gw(w, p) =

m∑i=1

piEθ∼Qi

[(I − α∇2fi(w, θ))∇fi(w − α∇fi(w, θ)))

](6)

gp(w, p) =[Eθ∼Qi

[fi(w − α∇fi(w, θ))

]]1≤i≤m

(7)

The notation [ai]1≤i≤m corresponds to a vector of size m with elements a1, . . . , am. Note thatgw(w, p) is the weighted average of the gradients of each of the task losses weighted by p, whereaseach component of gp(w, p) is the loss for a distinct task. Each of the above expressions involves twoexpectations: one over the samples θ, as is written, and another in the definitions of ∇fi and fi, asthese functions are the expected values of their sample evaluations over the data corresponding tothe i-th task. Since the number of data points corresponding to a task may be large or even infinite,exactly computing the expectations may be intractable. Likewise, we cannot hope to repetitivelyestimate the gradients corresponding to every task because the number of tasks m may be large.Thus, we must estimate gw and gp by sampling both tasks and data.

To do so, we first sample a subset C of C tasks uniformly and independently from the setof all tasks. For each sampled task Ti ∈ C, we independently sample a batch Di of D pairs(θinij , θoutij )j=1,...,D from the data distribution Qi ×Qi corresponding to Ti. To estimate gw, foreach pair (θinij , θ

outij ) ∈ Di, we compute a stochastic estimate of the gradient of the objective φ(w, p)

with respect to w at the i-th task, using θinij to estimate the Hessian and the inner gradient, andθoutij to estimate the outer gradient. We finally average these values across the C sampled tasks andD sampled data pairs and multiply by m to obtain our unbiased stochastic gradient gw(w, p), i.e,

gw(w, p) =m

C

∑i:Ti∈C

1

D

D∑j=1

[pi(I − α∇2fi(w, θ

inij ))∇fi(w − α∇fi(w, θinij ), θoutij )

](8)

4

Algorithm 1 Distribution-Agnostic MAML (DA-MAML)Input: m functions fi, inner step size α, step sizes ηtwt, ηtpt, convex set W , number ofiterations TInitialize p1 = [ 1

m ]1≤i≤m and w1 ∈W arbitrarily.for t = 1 to T − 1 do

Sample a batch C of C tasks independently from the uniform distribution over tasks.for Ti ∈ C do

Sample a batch Di of D pairs of samples (θinij , θoutij )j=1,...,D independently from Qi×Qi.end forCompute gtw and gtp using (8) and (9), respectively.Update wt+1 and pt+1 as in (10) and (11), respectively.

end forOutput: See Cases T1 and T2.

The averaging is performed outside of the function evaluations, and the same data is used toestimate the Hessian and gradient to ensure gw(w, p) is an unbiased estimate of gw.

To estimate gp we follow a similar procedure: for each (θinij , θoutij ) ∈ Di, we compute a stochastic

estimate of the gradient of the objective φ(w, p) with respect to p at the i-th task, using θinij toestimate the gradient of fi and θoutij to estimate the function fi. We then average these valuesacross the data sampled for each task, and set the i-th element of gp(w, p) to be the the averagecorresponding to the i-th task, as written in Equation (9). Note that the computation of eachfi(w, θ

inij ) can be reused in computing gp(w, p) after evaluating it to compute gw(w, p), thus the

total number of stochastic function, gradient and Hessian evaluations required to estimate thegradients per iteration is 4CD.

gp(w, p) =

m

CD

∑Ti∈C

D∑j=1

fi(w − α∇fi(w, θinij ), θoutij )

1≤i≤m

(9)

Now that we have a procedure to compute the stochastic gradients, we are ready to discuss ouralgorithm to solve (5). The algorithm initializes p1 = [1/m]1≤i≤m and w1 ∈ W , then iterativelyexecutes alternating projected stochastic gradient descent-ascent steps. From now on, we denotethe stochastic gradients in (8) and (9) on the t-th iteration as

gtw := gw(wt, pt), and gtp := gw(wt, pt).

From iterations t = 1 to T − 1, the algorithm updates wt+1 and pt+1 on the t-th iteration as

wt+1 = ΠW (wt − ηtwgw(wt, pt)) (10)

pt+1 = Π∆(pt + ηtpgp(wt, pt)) (11)

where ηtw, ηtp ∈ R are step sizes and ΠW (u) = arg minw∈W ‖u− w‖2 and Π∆(q) = arg minp∈∆ ‖p−q‖2. These projections are convex programs and can be solved efficiently using standard convexminimization techniques. After T iterations, the algorithm terminates in one of two ways, dependingon the convexity of the minimization problem:

• Case T1 (Convex Case): If each Fi is convex, the algorithm outputs

wCT :=1

T

T∑t=1

wt and pCT :=1

T

T∑t=1

pt (12)

• Case T2 (Nonconvex Case): If any function Fi is nonconvex, the algorithm samples τ uniformlyat random from 1, ..., T and outputs wτT := wτ and pτT := pτ .

The full procedure is outlined in Algorithm 1.

Implementation. In order to implement Algorithm 1, we execute a routine with the inner-outerloop structure of the MAML algorithm, proposed in Finn et al. [2017], and other gradient-basedmeta-learning methods. In the inner loop, the algorithm executes individual stochastic gradient

5

Algorithm 2 DA-MAML: ImplementationInput: m functions fi, number of iterations T , inner step size α, method step sizes ηtwt, ηtpt,convex set W .Initialize p1 = [ 1

m ]1≤i≤m and w1 ∈W arbitrarily.for t = 1 to T − 1 do

Sample a batch C of C tasks independently from the uniform distribution over tasks.for Ti ∈ C do

Sample a batch Di of D pairs of samples (θinij , θoutij )j=1,...,D independently from Qi×Qi.for j = 1 to D do

Set wtij ← wt − α∇fi(wt, θinij )end for

end forDenoting Λij := I − α∇2fi(w, θ

inij ), update

wt+1← ΠW (wt− ηtwmCD

∑i:Ti∈C

D∑j=1

Λij∇fi(wtij , θoutij ))

pt+1← Π∆([pt+

ηtpm

CD

D∑j=1

fi(wtij , θ

outij )

]1≤i≤m)

end forOutput: See Cases T1 and T2.

descent using each sample in Di for each of the set of sampled tasks, computing the stochasticgradient update

wtij ← wt − α∇fi(wt, θinij ) (13)

for the j-th sample for the i-th task. In the outer loop, i.e. the meta-learning phase, thealgorithm updates wt+1 by executing projected stochastic gradient descent on the loss of the modelcorresponding to the sampled tasks after taking the local stochastic gradient descent steps, wherethe losses with respect to the tasks are weighed by pt. Updating pt+1 during the meta-learningphase also amounts to taking the stochastic gradient of the local losses, this time with respect topt, and executing projected stochastic gradient ascent. In order to facilitate the meta-learningphase, the algorithm computes the second-order term Λij ← I − α∇2fi(w, θ

inij ) for each sample θinij

and each task Ti passed over in the inner loop. Note that the total number of stochastic function,gradient and Hessian evaluations required per iteration is still 4CD. The full procedure is outlinedin Algorithm 2.

4.1 Comparison to standard MAMLMAML attempts to solve (1) by executing one step of SGD for each of the tasks sampled in theinner loop, then updating the initial weights w using SGD on the net loss, or test error, for all of theinner loop SGD steps [Finn et al., 2017]. Algorithm 1 shares the structure of MAML, but insteadof computing the meta-stochastic gradient as the average of the stochastic gradients correspondingto the sampled tasks, Algorithm 1 weighs more heavily the stochastic gradients of tasks with higherloss.

Algorithm 1 also differs from MAML in that the data used to approximate the Hessian is thesame as the data used to compute the inner loop parameter update. Furthermore, the averagingof the stochastic gradients in Algorithm 1 is over single-sample, full meta-gradient computations,whereas in MAML, each stochastic component of the meta-gradient is averaged before combining.We undertake these changes so that the stochastic gradients are unbiased estimates of the updatedobjective (5), and the motivation for this objective was given in the introduction. Like MAML,Algorithm 1 can be modified to only entail computing first-order derivatives in the interest ofcomputational efficiency. The analogous min-max algorithms to FO-MAML [Finn et al., 2017],Reptile [Nichol and Schulman, 2018], and Hessian-Free MAML [Fallah et al., 2019] can be developedto approximate optimal solutions to (5) without requiring Hessian computation.

6

5 Theoretical ResultsFor our convergence results we will need unbiased stochastic gradients with bounded secondmoments. These properties are standard assumptions in the stochastic optimization literature, see,e.g., Nemirovski et al. [2009], Rafique et al. [2018], Qian et al. [2019]. However, it is not obvious thatthey are satisfied for the stochastic gradients of our meta-learning objective φ(w, p) given similarassumptions about the functions fi due to the nested stochastic gradients involved in φ(w, p), sowe establish them in the following lemmas. We provide all proofs in the supplementary material.

Assumption 1 (Unbiased samples). Recall that Qi is the distribution over the data associatedwith the i-th task, θ is a sample drawn from Qi, and fi(w, θ) is the approximation of fi(w) at thedata point θ. Then, ∀w ∈W, i ∈ [m]:

Eθ∼Qi

[fi(w, θ)− fi(w)

]= 0, (14)

Eθ∼Qi

[∇fi(w, θ)−∇fi(w)

]= 0, (15)

Eθ∼Qi

[∇2fi(w, θ)−∇2fi(w)

]= 0. (16)

The conditions in Assumption 1 are required to ensure the estimates of functions, gradients,and Hessians are unbiased.

Lemma 1. Suppose Assumption 1 holds, then the stochastic gradients gtw and gtp defined in (6)and (7), respectively, are unbiased estimates of gtw and gtp, i.e., for all t ≥ 1

E[gtw − gtw] = E[gtp − gtp] = 0. (17)

Assumption 2. fi and fi, ∀i ∈ [m], have these properties:

1. fi has bounded function values: ∃B ∈ R s.t. ∀w ∈W we have |fi(w)| ≤ B.

2. fi is L-Lipschitz continuous: ∃L ∈ R s.t. ∀u, v ∈W , we have |fi(u)− fi(v)| ≤ L‖u− v‖2.

3. fi is smooth: Each fi is differentiable, and ∃M ∈ R such that ‖∇fi(u, θ) − ∇fi(v, θ)‖2 ≤M‖u− v‖, ∀u, v ∈W, θ ∈ Ωi. Hence, each fi is also M -smooth.

4. Bounded variance: ∃σ, σR, σH ∈ R s.t. ∀w ∈W :

Eθ∼Qi

[| fi(w, θ)− fi(w) |2

]≤ σ2 (18)

Eθ∼Qi

[‖∇fi(w, θ)−∇fi(w)‖2

]≤ σ2

R (19)

Eθ∼Qi

[‖∇2fi(w, θ)−∇2fi(w)‖2

]≤ σ2

H (20)

5. Hessian second-order Lipschitz continuity in expectation: ∃H ∈ R s.t. Eθ∼Qi[‖∇2fi(u, θ)−

∇2fi(v, θ)‖2F ] ≤ H2‖u− v‖22,∀u, v ∈W, i ∈ [m].

These assumptions on the task loss functions allow us to derive properties of the meta-learningtask loss functions, which we write as follows, for all w ∈W and i ∈ [m]:

Fi(w) := Eθ∼Qi[fi(w − α∇fi(w, θ))] (21)

Lemma 2. Suppose Assumptions 1 and 2 hold. Then for all i ∈ [m], Fi(w) defined in (21) isL-Lipschitz, where

L := L(1 + αM + ασH) (22)

Next, we show that the variance of gradient estimations and the expected squared norm ofstochastic gradients are uniformly bounded. We will use these results later to characterize theconvergence properties of our proposed method.

Lemma 3. Suppose Assumptions 1-2 hold and the stochastic gradients gtw and gtp are computed asin (8) and (9). Then defining δtw := gtw − gtw, and δp := gtp − gtp for all t ≥ 1,

E[‖δtw‖2

]≤ σ2

w, E[‖δtp‖2

]≤ σ2

p (23)

where σ2w :=

m− 1

CL2 +

m2

CD(1 + αM)2

(L2 + σ2

R

)and σ2

p :=m2σ2

CD. Moreover, E[‖gtw‖22] ≤ G2

w

and E[‖gtp‖22] ≤ G2p, where G2

w := L2 + σ2w and G2

p := mB2 + σ2p.

7

5.1 Convex settingWe first consider the case when each of the functions Fi(w) are convex, which implies φ(w, p) isconvex in w. In the following lemma we show that fi(w) being strongly convex implies Fi(w) isalso strongly convex.

Lemma 4. Suppose that if α < 1/M and in addition to satisfying Assumption 2, each fi is alsoµ-strongly convex. Then each Fi is µ(1− αM)2 − αLH-strongly convex.

Note that the convexity of fi does not imply the convexity of Fi (consider for example thecounterexample in one dimension fi(w) = 1

w for w ∈ R+). We also assume that the set W iscontained in a ball of radius RW . The optimal rate of convergence for solving convex-concavestochastic min-max problems is O(1/ε2), where convergence is measured in terms of the expectednumber of stochastic gradient computations required to achieve a duality gap of ε [Nemirovski et al.,2009]. If φ∗ is the min-max optimal value of φ, then the duality gap of the pair (w, p) is defined as

maxp∈∆

φ(w, p)− minw∈W

φ(w, p) (24)

It is known that if the stochastic gradients of φ(w, p) are unbiased and have bounded secondmoments, then alternating stochastic projected gradient descent-ascent achieves the optimal O(1/ε2)convergence rate when φ(w, p) is a finite sum of convex functions of w weighted linearly by thecomponents of p, with p in the simplex and W contained in a ball of finite radius [Nemirovski et al.,2009]. We have established that all the conditions required for the existing analysis to apply toour convex setting, so we can apply the standard analysis to show the optimal convergence rate ofO(1/ε2) of Algorithm 1. We adapt the result from Mohri et al. [2019], which in turn is a simplifiedversion of Algorithm 1 and Theorem 1 from Juditsky et al. [2011].

Theorem 1. (Adapted from Mohri et al. [2019]) Consider problem (5) when each Fi is convex andAssumptions 1 and 2 hold. Suppose also that there exists a ball of radius RW that contains W . LetwCT and pCT be the output of Algorithm 1 run for T iterations with termination case T1. Then if wechoose the step sizes as ηw = (2RW )/(Gw

√T ) and ηp = 2/(Gp

√T ), the following bound holds:

E[maxp∈∆

φ(wCT , p)− minw∈W

φ(w, pCT )

]≤ 3RW Gw + 3Gp√

T

Thus, Algorithm 1 requires O(1/ε2) iterations to reach a solution with an expected dualitygap of at most ε. Since Algorithm 1 computes a constant number of stochastic function, gradientand Hessian evaluations per iteration, we have that its convergence rate is the optimal O(1/ε2)stochastic oracle calls to reach an ε-accurate solution.

5.2 Nonconvex settingNext, we study the case that the functions fi in (5) may not be convex and as a result the functionφ(w, p) be nonconvex in w. In this case, the inner maximization remains concave with a linearobjective and convex constraints. Thus, we must evaluate the pair (wT , pT ) returned by ouralgorithm using distinct methods with respect to w and p: with respect to p, we still care aboutthe closeness of φ(wT , pT ) to global maximum of φ(w, ·), as solving concave programs is generallytractable. Meanwhile, we care about the closeness of w to a stationary point of φ(·, p), as φ isnonconvex in w and finding a global minimum of a nonconvex function is generally intractable. Tocapture this, we say that (w′, p′) is an (ε, δ)-stationary point of φ if

‖∇wφ(w′, p′)‖2 ≤ ε and φ(w′, p′) ≥ maxp∈∆

φ(w′, p)− δ,

where ε, δ > 0, assuming that W = Rd. In the case that W does not contain all of Rd, we considerthe projected gradient at the point in question, which we will discuss later. In either case we willleverage smoothness. Unfortunately, the function of w that we aim to minimize, maxp∈∆ φ(w, p),is nonsmooth because of the maximization. However, we show that each Fi is smooth under theprevious assumptions on fi.

Lemma 5. Suppose Assumptions 1 and 2 hold. Then for all i ∈ [m], Fi is M -smooth, where

M := M(1 + αM)2 + αLH (25)

We are now ready to present our results for the nonconvex setting, including both the case whenW is equal to Rd and when W is a compact, convex set, starting with the former.

8

5.2.1 Unconstrained case (W = Rd)

Proposition 1. Suppose Assumptions 1-2 hold and W = Rd. Let ηtw and ηtp be constant over all t,denoted by ηw and ηp, respectively, where ηw < (2/M). Let (wτT , p

τT ) be the solution returned by

Algorithm 1 after T iterations. Then,

E[‖∇wφ(wτT , pτT )‖22] ≤ 2(φ(w1, p1) +B)

T (2ηw − η2wM)

+4ηp√mBGp

(2ηw − η2wM)

+ηwMσ2

w

(2− ηwM),

E [φ(wτT , pτT )] ≥ max

p∈∆E [φ(wτT , p)] −

1

ηpT−ηpG

2p

2.

Theorem 2. In the setting of Proposition 1, if the step sizes are set as ηw = T−β, and ηp =

2−1/2G−1p T−2β for β ∈ (0, 1

2 ), and T β > M/2, then

E[‖∇wφ(wτT , pτT )‖22] ≤ φ(w1, p1) +B +

√2mB + 2Mσ2

w

T β − M/2


p∈∆E [φ(wτT , p)] −

√2Gp

Tmin2β,1−2β

Theorem 2 shows that the output (wτT , pτT ) of Algorithm 1 converges in expectation to an

(ε, δ)-stationary point of φ in O(max1/ε2/β , 1/δ1/min2β,1−2β) stochastic gradient evaluations,recalling that there are a constant number of stochastic gradient evaluations per iteration. Notethat β can be tuned to favor convergence with respect to w or p. For instance, choosing β = 1

4yields a convergence rate of O(max1/ε8, 1/δ2). If we would like to treat the convergence order ofmagnitudes with respect to w and p equally, the optimal setting is β = 2

5 leading to a convergencerate of O(max1/ε5, 1/δ5).

5.2.2 Compact, Convex Set W

In this section, we consider the case that W is a compact convex set, thus the projection ΠW isrequired to enforce that the iterates wt remain in W . Here it is helpful to rewrite ΠW as a proxoperation, which requires removing the constraint w ∈W in our objective function and accountingfor it with the function h, where h(w) = 0 in w ∈ W and h(w) = +∞ otherwise. Our modifiedobjective becomes:

minw∈Rd

maxp∈∆Φ(w, p) := φ(w, p) + h(w) (26)

Under the above definition of h, our projected gradient descent update can be written as the proxoperation:

wt+1 := uw(wt, pt)

:= argminu∈Rd

〈gw(wt, pt), u〉+1

2η‖u−wt‖22 + h(u)

In this case our evaluation for the output wτT changes. We are no longer concerned with the norm ofthe gradient with respect to w at the output, since the negative gradient may belong to the normalcone of W at wτT . Instead, we study the second moment of the projected gradient, g, defined as

gR := PW (wR, pR, gRw , ηRw) :=

1

ηRw(wR − uw(wR, pR))

since this quantity signals how much we can improve our solution by moving locally within thefeasible set. Thus, we use the same notion of an (ε, δ)-stationary point to evaluate convergence,except that here ε upper bounds the norm of the projected gradient instead of the norm of thegradient. Next, we use the analysis in Theorem 2 in Ghadimi et al. [2016] to obtain a bound on thesecond moment of the projected gradient corresponding to the output of Algorithm 1.

Proposition 2. Suppose Assumptions 1-2 hold and W is a convex, compact set. Let the stepsizes ηtw and ηtp be constant over all t, denoted by ηw and ηp, respectively, where ηw < (2/M). Let

9

(wτT , pτT ) be the solution returned by Algorithm 1 after T iterations. Then we have

E[‖∇wφ(wτT , pτT )‖22] ≤ 2(φ(w1, p1) +B)

T (2ηw − η2wM)

+4ηp√mBGp

(2ηw − η2wM)

+σ2w

(2− ηwM),


p∈∆E [φ(wτT , p)] −

1

ηpT−ηpG

2p

2.

The only significant difference between this bound and the bound derived in Proposition 1 isthat the term with σ2

w is not multiplied by the step size ηw, thus appears to asymptotically behaveas a constant. Therefore, in order to show that the right hand side in the above bound converges, wemust treat σ2

w as a function of the number of stochastic gradients computed during each iteration.Recall that σ2

w is an upper bound on the variance of gw, and note from Lemma 3 that we canwrite it as σ2

w = σ2w/C, where C is the number of sampled tasks used for each stochastic gradient

computation, and each sampled task involves a constant number of single-sample function, gradientand Hessian evaluations if D is fixed. In the following theorem, we choose C as a function of T toshow convergence.

Theorem 3. Suppose that the setting of Proposition 2 holds. If the step sizes are chosen asηtw = 1/2M and ηtp = Gp(

√2T−β) for all t = 1, ..., T and the task batch size is chosen as C = T β

for β ∈ (0, 1), then

E[‖gτT ‖22

]≤ 8M(Φ(w1, p1) +B)

3T+

8MB√m+ 4σ2

w

3T β,


p∈∆E [φ(wτT , p)] −

Gp√2Tminβ,1−β

Hence, if N represents the number of single-sample function, gradient, and Hessian evaluations(N = CT = T 1+β thus T = N1/(1+β)) Algorithm 1 converges to an (ε, δ)-stationary point afterO(max1/ε(2+2β)/β , 1/δ(1+β)/minβ,1−β) single-sample function, gradient, and Hessian evaluations.We can tune β to favor convergence with respect to w or p. By setting β = 2

3 we treat convergencewith respect to both equally, leading to a complexity of O(max1/ε5, 1/δ5).

6 Experimental ResultsIn this section we compare the empirical performance of the DA-MAML algorithm with the MAMLalgorithm on image classification problems. We are particularly interested in the extent to whicheach algorithm can be considered fair. To define fairness we adopt the definition provided in Mohriet al. [2019], which considers an algorithm fair if it tries to minimize the maximum loss over a setof objectives. This means that the algorithm does not allow poor performance on some objectivesin order to boost performance on others, such as those which may contribute heavily to minimizingthe average loss, even if the training data itself is biased. Our experiments test whether DA-MAMLand/or MAML yield a fair solution under exactly this type of scenario.

We employ the MNIST [LeCun and Cortes, 2010] and Fashion MNIST (FMNIST) [Xiao et al.,2017] datasets in order to make this evaluation. Both these datasets contain 70,000 28-by-28grayscale images split evenly into 10 classes. In MNIST these classes are the handwritten digits,and in FMNIST they are more complex fashion products. Following Finn et al. [2017], we considertasks as N -way image classification problems, and we evaluate meta-learning performance byobserving how well the model classifies N images from N distinct classes after learning from Klabelled training examples from each class, where K is small. An algorithm’s success on this K-shot,N -way classification problem is an appropriate measure of its meta-learning capability because Kbeing small exposes how quickly the algorithm learns from training data. Furthermore, in all thedistributions of tasks that we consider, the probabilities of drawing the tasks composed of the sameclasses but any permutation of labelings are equivalent, so all models are expected to classify with1N -accuracy on randomly drawn tasks without access to any training examples.

Our first experiment involves meta-learning eight two-way classification tasks with a two-layer,tanh-activated neural network. Two of the tasks are to classify pullovers and coats from FMNIST(where in each task, each class has the opposite label), and the remaining six MINIST tasks aresimilarly to classify images belonging to three distinct pairs of classes. To simulate a biased training

10

Figure 1: The average and worst case 5-shot classification accuracies for MAML and DA-MAMLacross 8 binary classification tasks on MNIST and FMNIST.

environment, we enforce that MAML will see fewer training samples of FMNIST tasks than MNISTtasks by setting the ambient distribution of the tasks such that the probability of drawing anFMNIST task is only 0.1, and fixing the number of task samples per iteration and the total numberof iterations at 3 and 3000, respectively. One would predict that in this situation a combination offactors likely biases MAML against trying to perform well on the FMNIST tasks: the fact that theMAML objective is to optimize the expected loss over tasks drawn from the ambient distributionand the FMNIST tasks have a low probability of being drawn from this distribution, the greatercomputational resources that solving the FMNIST tasks presumably requires, the possibility thatthe optimal solution to solve FMNIST tasks is far from the optimal solution to solve MNISTtasks, and the small number of FMNIST training examples MAML can learn from. In contrast,DA-MAML optimizes over worst case instead of expected performance, and samples tasks uniformly,so we expect that DA-MAML still tries to perform well on the FMNIST tasks.

Our results support this hypothesis, showing that DA-MAML yields a more fair solution. Wereport the testing accuracy vs. number of iterations for both MAML and DA-MAML in Figure 1.For testing, we take the current model and perform 500 episodes of 5-shot, 2-way classification foreach task and consider the average classification accuracy to be the test accuracy corresponding tothe task. DA-MAML clearly outperforms MAML in terms of the worst performing task (which areFMNIST tasks in this case).

A Proof of Lemma 1Proof. Recall that gw(w, p) is computed as follows:

gw(w, p) =m

C

∑i:Ti∈C

1

D

D∑j=1

[pi(I − α∇2fi(w, θ

inij ))∇fi(w − α∇fi(w, θinij ), θoutij )

]

Denote ci as the number of times the task Ti appears in the batch C, and (θinij,k, θoutij,k) as the j-th

data pair sampled from the k-th instance of Ti. Then we can write gw(w, p) as

gw(w, p) =

m∑i=1

m

C

ci∑k=1

1

D

D∑j=1

pi(I − α∇2fi(w, θinij,k))∇fi(w − α∇fi(w, θinij,k), θoutij,k) (27)

11

Taking the expectation over the random variables ci and (θinij,k, θoutij,k) and using the linearity ofexpectation,

E[gw(w, p)] = E

m∑i=1

m

C

ci∑k=1

1

D

D∑j=1

pi(I − α∇2fi(w, θinij,k))∇fi(w − α∇fi(w, θinij,k), θoutij,k)

=

m∑i=1

m

CE

ci∑k=1

1

D

D∑j=1


Observe that each ci is a binomial random variable with mean C/m, thus using the Law of IteratedExpectations while conditioning on ci yields

E[gw(w, p)]

=

m∑i=1

m

CEci

Eθinij,k,θoutij,k

ci∑k=1

1

D

D∑j=1

pi(I − α∇2fi(w, θinij,k))∇fi(w − α∇fi(w, θinij,k), θoutij,k)|ci

=

m∑i=1

m

CEci

ci∑k=1

1

D

D∑j=1

piEθinij,k,θoutij,k

[(I − α∇2fi(w, θ

inij,k))∇fi(w − α∇fi(w, θinij,k), θoutij,k)

]=

m∑i=1

m

CEci

ci∑k=1

1

D

D∑j=1

piEθinij,k[(I − α∇2fi(w, θ

inij,k))∇fi(w − α∇fi(w, θinij,k))

]=

m∑i=1

m

CEci[cipi∇wFi(w)

]=

m∑i=1

pi∇wFi(w)

= gw(w, p)

To see that gp(w, p) is unbiased, note that if we again use ci to denote the number of times thetask Ti is chosen for each computation of gp(w, p), we can write gp(w, p) as:

gp(w, p) =

m

CD

ci∑k=1

D∑j=1

fi(w − α∇fi(w, θinij ), θoutij )

1≤i≤m

(28)

Observe that the mean of ci is c is C/m, thus similarly applying the Law of Iterated Expectations,the linearity of expectation, the independence of the samples, and the unbiasedness of the single-sample function evaluations yields that the expected value of the i-th element of gp(w, p) is Fi(w),for all i ∈ [m], as expected.

B Proof of Lemma 2Proof. To begin, note that

∇Fi(w) = Eθ∼Qi

[(I − α∇2fi(w, θ))∇fi(w − α∇fi(w, θ))

]Thus using Jensen’s Inequality followed by the Cauchy-Schwarz Inequality twice (recalling that theCauchy-Schwarz Inequality implies E[XY ] ≤

√E[X2]E[Y 2] for any two random variables X and

12

Y ), we have

‖∇Fi(w)‖ = ‖Eθ∼Qi

[(I − α∇2fi(w, θ))∇fi(w − α∇fi(w, θ))

]‖

≤ Eθ∼Qi‖(I − α∇2fi(w, θ))∇fi(w − α∇fi(w, θ))‖

≤ Eθ∼Qi

[‖(I − α∇2fi(w, θ))‖‖∇fi(w − α∇fi(w, θ))‖

]≤√Eθ∼Qi

[‖I − α∇2fi(w, θ)‖2

]Eθ∼Qi

[‖∇fi(w − α∇fi(w, θ))‖2

]≤ L

√Eθ∼Qi

[‖I − α∇2fi(w, θ)‖2

](29)

where (29) follows from the L-Lipshitzness of fi. Considering the term inside the square root, wehave

Eθ∼Qi

[‖I − α∇2fi(w, θ)‖2

]= Eθ∼Qi

[‖(α∇2fi(w)− α∇2fi(w, θ)) + I − α∇2fi(w)‖2

]≤ Eθ∼Qi

[(‖α∇2fi(w)− α∇2fi(w, θ)‖+ ‖I − α∇2fi(w)‖)2

](30)

≤ Eθ∼Qi

[(‖α∇2fi(w)− α∇2fi(w, θ)‖+ (1 + αM))2

](31)

= (1 + αM)2 + 2(1 + αM)αEθ∼Qi

[‖∇2fi(w)−∇2fi(w, θ)‖

]+ α2Eθ∼Qi

[‖∇2fi(w)−∇2fi(w, θ)‖2

]≤ (1 + αM)2 + 2(1 + αM)ασH + α2σ2

H (32)

where (30) follows from the triangle inequality, (31) follows from the M -smoothness of fi and (32)follows from the definition of σ2

H , noting that

Eθ∼Qi

[‖α∇2fi(w)− α∇2fi(w, θ)‖

]≤√Eθ∼Qi

[‖α∇2fi(w)− α∇2fi(w, θ)‖2

]≤ σH (33)

Combining (29) and (32) gives us that Fi is L := L(1 + αM + ασH)-Lipshitz for all i ∈ [m].

C Proof of Lemma 3Proof. We denote the true gradient of φ(w, p) with respect to w corresponding to the i-th task as

ρi := pi∇wFi(wi) = piEθ[(I − α∇2fi(w, θ))∇fi(w − α∇fi(w, θ))] (34)

Following (27), we denote the stochastic gradient corresponding to the i-th task as

ρi :=m

C

ci∑j=1

1

D

D∑k=1

pi(I − α∇2fi(w, θinij,k))∇fi(w − α∇fi(w, θinij,k), θoutij,k) (35)

where is ci is the number of times the i-th task is chosen during the computation of gw(w, p). FromLemma 1 and the linearity of expectation, we have that E[ρi] = ρi, so using the independence ofthe sampled tasks and data, we have

E[‖δw‖22] = Var(gw(w, p)) = E[‖m∑i=1

ρi − ρi‖22] =

m∑i=1

E[‖ρi − ρi‖22] =

m∑i=1

Var(ρi) (36)

Thus it only remains to compute Var(ρi). Note that each ci is a binomial random variable withparameters (C, 1/m). By the Law of Total Variance,

Var(ρi) = Var(E[ρi|ci]) + E[Var(ρi|ci)] (37)

Conditioned on ci, ρi has mean mC ciρi, thus

Var(E[ρi|ci)] = Var(mC ciρi) = m−1C ‖ρi‖

22 ≤ m−1

C L2p2i (38)

13

by the L-Lipshitzness of Fi, and

Var(ρi|ci)

= Eθinij,k,θoutij,k

‖mCciρi −

m

C

ci∑j=1

1

D

D∑k=1

pi(I − α∇2fi(w, θinij,k))∇fi(w − α∇fi(w, θinij,k), θoutij,k)‖22|ci

= Eθinij,k,θout

ij,k

‖mC

ci∑j=1

(ρi −

1

D

D∑k=1


)‖22|ci

=m2

C2

ci∑j=1

Eθinij,k,θoutij,k

[‖ρi −

1

D

D∑k=1

pi(I − α∇2fi(w, θinij,k))∇fi(w − α∇fi(w, θinij,k), θoutij,k)‖22

](39)

=m2

C2D2

ci∑j=1

D∑k=1

Eθinij,k,θoutij,k

[‖ρi − pi(I − α∇2fi(w, θ

inij,k))∇fi(w − α∇fi(w, θinij,k), θoutij,k)‖22

](40)

where (39) follows from the independence of the sampled data across task instances, and (40) followsfrom the independence of each of the samples within each task instance. Thus we are left with asum of finite-sample stochastic gradient variances. For any i, j, k, we can bound this variance byfirst computing its second moment of the single-sample stochastic gradient:

Eθinij,k,θoutij,k

[‖pi(I − α∇2fi(w, θ

inij,k))∇fi(w − α∇fi(w, θinij,k), θoutij,k)‖22

]≤ p2

iEθinij,k,θoutij,k

[‖I − α∇2fi(w, θ

inij,k)‖22‖∇fi(w − α∇fi(w, θinij,k), θoutij,k)‖22

](41)

≤ p2i (1 + αM)2Eθinij,k,θout

ij,k

[‖∇fi(w − α∇fi(w, θinij,k), θoutij,k)‖22

](42)

= p2i (1 + αM)2

(Eθinij,k,θout

ij,k

[‖∇fi(w − α∇fi(w, θinij,k))‖22

]+ (43)

Eθinij,k,θoutij,k

[‖∇fi(w − α∇fi(w, θinij,k), θoutij,k)−∇fi(w − α∇fi(w, θinij,k))‖22

] )(44)

≤ p2i (1 + αM)2

(L2 + σ2

R

)(45)

where (41) follows from the Cauchy-Schwarz Inequality, (42) follows from the M -smoothness offi(·, θ) for all θ, (44) follows from the fact that Var(X) = E[X2]− E[X]2 for any random variableX, and (45) follows from the L-Lipshitz continuity of fi and the bounded single-sample variance of∇fi. Plugging this result back into (40), we have

Var(ρi|ci) ≤m2ciC2D

p2i (1 + αM)2

(L2 + σ2

R

)thus taking the expectation over ci yields

E[Var(ρi|ci)] ≤m

CDp2i (1 + αM)2

(L2 + σ2

R

)Using (37), we have

Var(ρi) ≤(m− 1

CL2 +

m2

CD(1 + αM)2

(L2 + σ2

R

))p2i

Summing over i, by (36) we have

E[‖δw‖22] = Var(gw(w, p)) ≤(m− 1

CL2 +

m2

CD(1 + αµ)2

(L2 + σ2

R

)) m∑i=1

p2i

≤ m− 1

CL2 +

m2

CD(1 + αM)2

(L2 + σ2

R

)(46)

where (46) holds because the probability simplex has diameter 1.

14

Next we bound E[‖δp‖22]. Note that

E[‖δp‖22] =

m∑i=1

E[‖mC

ci∑j=1

1

D

D∑k=1

fi(w − α∇fi(w, θinij,k), θoutij,k)− Fi(w)‖22]

=m2

C2

m∑i=1

Eci

Eθinij,k,θoutij,k

‖ ci∑j=1

1

D

D∑k=1

fi(w − α∇fi(w, θinij,k), θoutij,k)− Fi(w)‖22|ci

(47)

=m2

C2

m∑i=1

Eci

ci∑j=1

1

D2

D∑k=1

Eθinij,k,θoutij,k

[‖fi(w − α∇fi(w, θinij,k), θoutij,k)− Fi(w)‖22|ci

] (48)

where (47) follows from the Law of Iterated Expectation and (48) follows from the independence ofthe samples. We can bound the inner expectation as follows:

Eθinij,k,θoutij,k

[‖fi(w−α∇fi(w, θinij,k), θoutij,k)− Fi(w)‖22|ci

]= Eθinij,k,θout

ij,k

[‖fi(w − α∇fi(w, θinij,k), θoutij,k)− Eθ

[fi(w − α∇fi(w, θ))

]‖22]

≤ Eθinij,k,θoutij,k,θ

[‖fi(w − α∇fi(w, θinij,k), θoutij,k)− fi(w − α∇fi(w, θ))‖22

](49)

≤ σ2 (50)

where (49) follows from Jensen’s Inequality and (50) follows from the assumption of boundedvariance of single-sample function evaluations. Combining (50) with (48) yields

E[‖δp‖22] ≤ m2

C2

m∑i=1

Eci

ci∑j=1

1

D2

D∑k=1

σ2

=m2

C2

m∑i=1

Eci[ciσ2

D

]=m2σ2

CD(51)

Now that we have bounds on E[‖δw‖22] and E[‖δp‖22], it is straightforward to bound the secondmoments of the stochastic gradients with respect to w and p. Denoting the upper bound on E[‖δw‖22]in (46) as σ2

w, we have that since gw is unbiased,

E[‖gw‖22] = ‖gw‖22 + E[‖gw − gw‖22]

≤ ‖m∑i=1

pi∇wFi(w)‖22 + σ2w

=

m∑i=1

pi

m∑j=1

pj(∇wFi(w))T∇wFj(w) + σ2w

≤m∑i=1

pi‖p‖1 maxj∈[m]

(∇wFi(w))T∇wFj(w)+ σ2w (52)

≤ ‖p‖21 maxi∈[m]

maxj∈[m]

(∇wFi(w))T∇wFj(w)+ σ2w (53)

≤ L2 + σ2w (54)

where (52) and (53) follow from Hölder’s Inequality, and (54) follows from the facts that ‖p‖1 = 1since p ∈ ∆ and each Fi is L-Lipshitz. To show the bound on Gp, we denote the upper bound onE[‖δp‖22] in (51) as σ2

p, and similarly use the unbiasedness of gp to obtain

E[‖gp‖22] = ‖gp‖22 + E[‖gp − gp‖22]

≤m∑i=1

Fi(w)2 + σ2p

≤ mB2 + σ2p (55)

where (55) follows from the boundedness of each fi.

15

D Proof of Lemma 4Proof. We show the strong convexity of Fi when α < 1/M and fi is µ-strongly convex in additionto satisfying Assumptions 1 and 2. We have

‖∇Fi(u)−∇Fi(v)‖

= ‖Eθ∼Qi

[(I − α∇2fi(u, θ))(∇fi(u− α∇fi(u, θ))− (∇fi(v − α∇fi(v, θ)))+

((I − α∇2fi(u, θ))− (I − α∇2fi(v, θ)))∇fi(v − α∇fi(v, θ))]‖ (56)

≥ ‖Eθ∼Qi

[(I − α∇2fi(u, θ))(∇fi(u− α∇fi(u, θ))− (∇fi(v − α∇fi(v, θ)))

]‖−

‖Eθ∼Qi

[((I − α∇2fi(u, θ))− (I − α∇2fi(v, θ)))∇fi(v − α∇fi(v, θ))

]‖ (57)

To lower bound the first term, we use the M -smoothness of fi, which implies that the minimumeigenvalue of I − α∇2fi(u) is at least 1− αM for all u ∈W . Thus,

‖Eθ∼Qi

[(I−α∇2fi(u, θ))(∇fi(u− α∇fi(u, θ))− (∇fi(v − α∇fi(v, θ)))

]‖ (58)

≥ (1− αM)‖Eθ∼Qi

[∇fi(u− α∇fi(u, θ))− (∇fi(v − α∇fi(v, θ))

]‖ (59)

By the µ-strong convexity of fi and the triangle inequality, we have

‖Eθ∼Qi

[∇fi(u− α∇fi(u, θ))−(∇fi(v − α∇fi(v, θ))

]‖

≥ µ‖Eθ∼Qi

[u− α∇fi(u, θ)− (v − α∇fi(v, θ))

]‖

≥ µ(‖u− v‖ − α‖Eθ∼Qi

[∇fi(v, θ)− α∇fi(u, θ))

]‖)

= µ (‖u− v‖ − α‖∇fi(v)− α∇fi(u))‖)≥ µ (‖u− v‖ − αM‖u− v‖) (60)

Next we upper bound the second term in (57). We have

Eθ∼Qi

[∥∥∥(α∇2fi(v, θ)− α∇2fi(u, θ))∇fi(v − α∇fi(v, θ))∥∥∥]

≤ Eθ∼Qi

[‖(α∇2fi(v, θ)− α∇2fi(u, θ))∇fi(v − α∇fi(v, θ))‖

](61)

≤ Eθ∼Qi

[‖(α∇2fi(v, θ)− α∇2fi(u, θ))‖‖∇fi(v − α∇fi(v, θ))‖

](62)

≤√

Eθ∼Qi[‖(α∇2fi(v, θ)− α∇2fi(u, θ))‖2]Eθ∼Qi

[‖∇fi(v − α∇fi(v, θ))‖2] (63)

≤ αL√Eθ∼Qi [‖(α∇2fi(v, θ)− α∇2fi(u, θ))‖2] (64)

≤ αLH‖u− v‖ (65)

where (61) follows from Jensen’s Inequality, (62) and (63) follow from the Cauchy-Schwarz Inequality,(64) follows from the L-Lipshitzness of fi and (65) follows from Assumption 2. Combining (57),(59), and (60) and (65) yields that Fi is µ := (µ(1−αM)2−αLH)-strongly convex under the givenconditions.

E Proof of Theorem 1Proof. We adapt the arguments from Mohri et al. [2019] to our nested gradients case. First observethat since each Fi(w) is convex, φ(w, p) is convex in w and linear, thus concave, in p. Therefore wecan write:

maxp∈∆


φ(w, pCT ) = maxp∈∆

φ(wCT , p)− min

w∈Wφ(w, pCT )

= maxp∈∆,w∈W

φ(wCT , p)− φ(w, pCT )

≤ 1

Tmax

p∈∆,w∈W

T∑t=1

φ(wt, p)− φ(w, pt)

(66)

16

where (66) follows from the convexity of φ in w and the concavity of φ in p. Again using theconvexity of φ in w along with the linearity of φ in p, we have that for any t ≥ 1,

φ(wt, p)− φ(w, pt) = φ(wt, p)− φ(wt, pt) + φ(wt, pt)− φ(w, pt)

≤ 〈(p− pt),∇pφ(wt, pt)〉+ 〈(wt − w),∇wφ(wt, pt)〉= 〈(p− pt), gtp〉+ 〈(wt − w), gtw〉

+ 〈(p− pt), (∇pφ(wt, pt)− gtp)〉+ 〈(wt − w), (∇wφ(wt, pt)− gtw)〉

Thus by rearranging terms and the subadditivity of max,

maxp∈∆,w∈W

T∑t=1

φ(wt, p)− φ(w, pt)

≤ maxp∈∆,w∈W

T∑t=1

〈(p− pt), gtp〉+ 〈(wt − w), gtw〉

+ maxp∈∆,w∈W

T∑t=1

〈p, (∇pφ(wt, pt)− gtp)〉+ 〈w, (gtw −∇wφ(wt, pt))〉

−

(T∑t=1

〈pt, (∇pφ(wt, pt)− gtp)〉 − 〈wt, (∇wφ(wt, pt)− gtw)〉

)(67)

We bound the expectation of each of the above terms separately, starting with the first one. Notethat since 2ab = a2 + b2 − (a− b)2, for any w ∈W , using a constant step size ηw we have

T∑t=1

〈(wt − w), gtw〉 =1

2

T∑t=1

1

ηw‖wt − w‖22 + ηw‖gtw‖22 −

1

ηw‖wt − ηwgtw − w‖22

≤ 1

2ηw

T∑t=1

‖wt − w‖22 + (ηw)2‖gtw‖22 − ‖wt+1 − w‖22 (68)

=1

2ηw(‖w1 − w‖22 − ‖wT+1 − w‖22) +

ηw2

T∑t=1

‖gtw‖22 (69)

≤ 1

2ηw‖w1 − w‖22 +

ηw2

T∑t=1

‖gtw‖22

≤ 2R2W

ηw+ηw2

T∑t=1

‖gtw‖22 (70)

where (68) follows from the projection property and (69) is the result of the telescoping sum. Since(70) holds for all w ∈ W and its right hand side does does not depend on w, we can take themaximum over w ∈W on the left hand side, and the expectation of both sides with respect to thestochastic gradients, to obtain

E

[maxw∈W

T∑t=1

〈(wt − w), gtw〉

]≤ 2R2

W

ηw+ηwTG

2w

2(71)

where we have used Lemma 3 for the bounds on the stochastic gradients. Using analogous argumentsand noting that the radius of ∆ is 1, we can show that

E

[maxp∈∆

T∑t=1

〈(p− pt), gtp〉

]≤ 2

ηp+ηpTG

2p

2(72)

17

Next, for the second term in (67), we can use the Cauchy-Schwarz Inequality and again the factthat maxp∈∆ ‖p‖2 = 1 to write

maxp∈∆

T∑t=1

〈p,∇pφ(wt, pt)− gtp〉 = maxp∈∆〈p,

T∑t=1

∇pφ(wt, pt)− gtp〉

≤ ‖T∑t=1

∇pφ(wt, pt)− gtp‖2 (73)

Recall from Definition 1 that E[‖∇pφ(wt, pt) − gtp‖22] ≤ σ2p for all t ≥ 1. Note that because the

batch selections are independent, the ∇pφ(wt, pt) − gtp terms are uncorrelated random variableswith mean 0 and variance at most σ2

p, which means that∑Tt=1∇pφ(wt, pt)− gtp is a random variable

with mean 0 and variance at most Tσ2p. Since

E[‖T∑t=1

∇pφ(wt, pt)− gtp‖2]2 ≤ Var(T∑t=1

∇pφ(wt, pt)− gtp) ≤ Tσ2p (74)

we have that E[‖∑Tt=1∇pφ(wt, pt)− gtp‖2] ≤

√Tσp. Using this relation after taking the expectation

of both sides of (73) yields

E

[maxp∈∆

T∑t=1

〈p,∇pφ(wt, pt)− gtp〉

]≤√Tσp (75)

Using similar arguments, with this time using Rw to bound maxw∈W ‖w‖2 after the analogousCauchy-Schwarz step as in 73, we have

E

[maxw∈W

T∑t=1

〈w, gtw −∇wφ(wt, pt)〉

]≤ RW

√Tσw (76)

For the third and final term in (67), note that by the Law of Iterated Expectations and theunbiasedness of the stochastic gradients, we have that for any t ≥ 1,

E[〈pt, (∇pφ(wt, pt)− gtp)〉 − 〈wt, (∇wφ(wt, pt)− gtw)〉]= E

[E[〈pt, (∇pφ(wt, pt)− gtp)〉 − 〈wt, (∇wφ(wt, pt)− gtw)〉|wt, pt

]]= 0

Recalling (66) and (67), by combining the bounds on each of the terms and dividing by T , weobtain

E[maxp∈∆


φ(w, pCT )

]≤ 2R2

W

ηwT+ηwG

2w

2+

2

ηpT+ηpG

2p

2+RWσw√

T+

σp√T

(77)

We minimize the above bound by setting the step sizes as

ηw =2RW

Gw√T, ηp =

2

Gp√T

(78)

to complete the proof, noting that σw ≤ Gw and σp ≤ Gp.

F Proof of Lemma 5Proof. We show the smoothness of each Fi by upper bounding the norm of the difference of itsgradients. Using (56) and the triangle inequality,

‖∇Fi(u)−∇Fi(v)‖ ≤ ‖Eθ∼Qi

[(I − α∇2fi(u, θ))(∇fi(u− α∇fi(u, θ))− (∇fi(v − α∇fi(v, θ)))

]‖+

‖Eθ∼Qi

[((I − α∇2fi(u, θ))− (I − α∇2fi(v, θ)))∇fi(v − α∇fi(v, θ))

]‖ (79)

18

where (79) follows from the triangle inequality. We consider the two terms in the right hand side ofthe second equation separately. Denoting the first term as Ξ, we use Jensen’s Inequality then theCauchy-Schwarz Inequality twice, as in (29) and (63), to obtain

Ξ ≤√Eθ∼Qi

[‖I − α∇2fi(u, θ)‖2

]Eθ∼Qi

[‖∇fi(u− α∇fi(u, θ))−∇fi(v − α∇fi(v, θ))‖2

](80)

≤ (1 + αM)

√Eθ∼Qi


](81)

where to obtain (81) we have used the M -smoothness of fi. Considering the term remaining insidethe square root, we have

Eθ∼Qi


]≤M2Eθ∼Qi

[‖u− α∇fi(u, θ)− (v − α∇fi(v, θ))‖2

](82)

= M2Eθ∼Qi

[‖u− v‖2 + 2α(u− v)T (∇fi(v, θ)−∇fi(u, θ)) + α2‖∇fi(u, θ)−∇fi(v, θ)‖2

]= M2

(‖u− v‖2 + 2α2(u− v)TEθ∼Qi

[∇fi(u, θ)−∇fi(v, θ)]

+ α2Eθ∼Qi [‖∇fi(u, θ)−∇fi(v, θ)‖2]

)≤M2

(‖u− v‖2 + 2α‖u− v‖M + α2M2

)(83)

= M2 (1 + αM)2 ‖u− v‖2 (84)

where (83) follows from the M -smoothness of fi and the Cauchy Schwarz Inequality. Thus we have

Ξ ≤M(1 + αM)2‖u− v‖ (85)

Note that we have already upper bounded the second term in (79) in the previous lemma (seeEquation (65)). Thus we have that the smoothness parameter of Fi is

Mi := M(1 + αM)2 + αLH (86)

G Proof of Proposition 1Proof. Note that

E[‖∇wφ(wτT , pτT )‖22] = E

[Eτ [‖∇wφ(wτT , p

τT )‖22]

](87)

= E

[1

T

T∑t=1

‖∇wφ(wt, pt)‖22

](88)

=1

T

T∑t=1

E[‖gtw‖2

](89)

where the un-subscripted expectation in the right hand sides of (87) and (88) is over the stochasticgradients which determine the sequence (wt, pt)t. Thus to show the bound on E[‖∇wφ(wτT , p

τT )‖22]

in Proposition 1, we bound the right hand side of (89). To do so we borrow ideas from theproof of Theorem 1 in Qian et al. [2019]. First recall that by Lemma 5, Fi is M -smooth for eachi ∈ 1, ...,m. Then for any u, v ∈W ,

Fi(u) ≤ Fi(v) +∇Fi(v)T (u− v) +M

2‖u− v‖2 (90)

Condition on the history up to iteration t, denoted by F t, the above equation implies

E

[m∑i=1

ptiFi(wt+1)|F t

]

≤ E

m∑i=1

ptiFi(wt) +

(∇w

m∑i=1

ptiFi(wt)

)T(wt+1 − wt) +

M

2‖wt+1 − wt‖2|F t

(91)

19

where the expectation is conditioned on the prior stochastic gradient computations up to iterationt. Note that ∇w

∑mi=1 p

tiFi(w

t) = gtw and wt+1 − wt = −ηwgtw. Thus, we have

E

[m∑i=1

ptiFi(wt+1)|F t

]≤ E

[m∑i=1

ptiFi(wt)− ηw(gtw)T gtw +

M

2η2w‖gtw‖2|F t

](92)

=

m∑i=1

ptiFi(wt)− ηw‖gtw‖2 +

M

2η2w

(‖gtw‖2 + E

[‖gtw − gtw‖2|F t

])(93)

where (93) follows because gtw is an unbiased estimate of gtw. Using Lemma 3, we have

E

[m∑i=1

ptiFi(wt+1)|F t

]≤

m∑i=1

ptiFi(wt)−

(ηw −

η2wM

2

)‖gtw‖2 +

η2wMσ2

w

2(94)

Rearranging the terms, we obtain

(ηw − η2wM

2

)‖gtw‖2 ≤ E

[m∑i=1

ptiFi(wt)−

m∑i=1

ptiFi(wt+1)|F t

]+η2wMσ2

w

2(95)

= E

[m∑i=1

ptiFi(wt)−

m∑i=1

pt+1i Fi(w

t+1)|F t]

(96)

+ E

[m∑i=1

pt+1i Fi(w

t+1)−m∑i=1

ptiFi(wt+1)|F t

]+η2wMσ2

w

2(97)

We bound the second expectation in the above equation:

E

[m∑i=1

pt+1i Fi(w

t+1)−m∑i=1

ptiFi(wt+1)|F t

]= E

[m∑i=1

(pt+1i − pti)Fi(wt+1)|F t

]

≤ E

[‖pt+1 − pt‖2

m∑i=1

(Fi(w

t+1))1/2

|F t]

(98)

≤√mBE

[‖pt+1 − pt‖2|F t

](99)

≤ 2√mB(E

[‖ηpgtp‖2|F t

]) (100)

= 2ηp√mBGp (101)

where (98) follows from the Cauchy-Shwarz Inequality, (99) follows by the bound on fi for all i,and (100) follows from the update rule for p combined with the projection property (since pt ∈ ∆,‖pt − (pt + ηpg

tp)‖ ≥ ‖pt −Π∆(pt + ηpg

tp)‖). Using this result, summing (97) from t = 1 to T , and

taking the expectation over all the stochastic gradients of both sides and using the Law of IteratedExpectations to remove the conditioning on F t, we obtain

(ηw − η2wM

2

) T∑t=1

E[‖gtw‖2

]≤ E

[m∑i=1

p1i Fi(w

1)

]− E

[m∑i=1

pT+1i Fi(w

T+1)

]+ 2Tηp

√mBGp +

TMη2wσ

2w

2

≤ φ(w1, p1) +B + 2Tηp√mBGp +

Tη2wMσ2

w

2

Next, dividing both sides by T(ηw − η2wM

2

)we have

1

T

T∑t=1

E[‖gtw‖2

]≤ φ(w1, p1) +B

T(ηw − η2wM

2

) +2ηp√mBGp(

ηw − η2wM2

) +ηwMσ2

w(2− ηwM

) (102)

which by (89) is the desired bound on E[‖gτT ‖2].Next we show the bound on the optimality of pτT . As before, we start by evaluating the

20

expectation over τ :

E [φ(wτT , pτT )] = E [Eτ [φ(wτT , p

τT )]] (103)

= E

[1

T

T∑t=1

φ(wt, pt)

](104)

=1

T

T∑t=1

E [φ(wτT , pτT )] (105)

Next, since φ(w, p) is linear in p, we have that for any p ∈ ∆ and any t ∈ 1, ..., T,

E[φ(wt, p)− φ(wt, pt)|F t

]= E

[(p− pt)gtp|F t

]= E

[(p− pt)gtp|F t

]+ E

[(p− pt)(gtp − gtp)|F t

](106)

= E[(p− pt)gtp|F t

](107)

where (107) follows because gtp is an unbiased estimate of gtp. Using (107) and the identity2ab = a2 + b2 − (a− b)2 with a = p− pt and b = ηpg

tp yields

E[φ(wt, p)− φ(wt, pt)|F t

]= E

[1

2ηp

(‖p− pt‖22 + (ηp)

2‖gtp‖22 − ‖p− (pt + ηpgtp)‖22

)|F t]

(108)

≤ E[

1

2ηp

(‖p− pt‖22 + (ηp)

2‖gtp‖22 − ‖p− pt+1‖22)|F t]

(109)

≤ E[

1

2ηp

(‖p− pt‖22 + (ηp)

2G2p − ‖p− pt+1‖22

)|F t]

(110)

(111)

where (109) follows from the projection property and (110) follows from the definition of Gp.Summing from t = 1 to T and taking the expectation over all the stochastic gradients of both sidesand using the Law of Iterated Expectations to remove the conditioning on F t, we obtain

T∑t=1

E[φ(wt, p)− φ(wt, pt)

]≤

T∑t=1

1

2ηpE[‖p− pt‖22

]− 1

2ηpE[‖p− pt+1‖22

]+ηp2G2p (112)

=1

2ηpE[‖p− p1‖22

]+ηp2TG2

p (113)

≤ 1

ηp+ηpTG

2p

2(114)

where (112) follows from the telescoping sum and (114) follows from the fact that p, p1 ∈ ∆ and ∆is contained in an `2 ball of radius 1. Dividing both sides of (114) by T and rearranging terms

1

T

T∑t=1

E[φ(wt, pt)

]≥ E [φ(wτT , p)]−

(1

ηpT+ηpG

2p

2

)

Finally, since (??) holds for all p ∈ ∆, we maximize the right hand side over p ∈ ∆, yielding

1

T

T∑t=1

E[φ(wt, pt)

]≥ max

p∈∆[φ(wτT , p)]−

(1

ηpT+ηpG

2p

2

)

From (105), the left hand side above is equal to E [φ(wτT , pτT )], thus completing the proof.

H Proof of Proposition 2Proof. Our proof makes analogous initial arguments to the proof of Theorem 2 in Ghadimi et al.[2016], and cites two results on the properties of the prox operation from the same paper. Let

21

gt := PW (wt, pt, gtw, ηtw) for all t ≥ 1. By the M -smoothness of Fi for each i, we have equation (90),

and thus for any t ∈ 1, ..., T,

m∑i=1

ptiFi(wt+1) ≤

m∑i=1

ptiFi(wt) +

(∇w

m∑i=1

ptiFi(wt)

)T(wt+1 − wt) +

M

2‖wt+1 − wt‖22

=

m∑i=1

ptiFi(wt)− ηtw

(∇w

m∑i=1

ptiFi(wt)

)Tgt +

M

2(ηtw)2‖gt‖22

=

m∑i=1

ptiFi(wt)− ηtw

(gtw)Tgt +

M

2(ηtw)2‖gt‖22 + ηtw(δtw)T gt

where in the identity we have used the definitions of gt and δtw. Next, using Lemma 1 in Ghadimiet al. [2016] with x = wt, γ = ηtw, and g = gtw, we obtain

m∑i=1

ptiFi(wt+1) ≤

m∑i=1

ptiFi(wt)− [ηtw‖gt‖22 + h(wt+1)− h(wt)] +

M

2(ηtw)2‖gt‖22 + ηtw(δtw)T gt

=

m∑i=1

ptiFi(wt)− [ηtw‖gt‖22 + h(wt+1)− h(wt)] +

M

2(ηtw)2‖gt‖22 + ηtw(δtw)T gt + ηtw(δtw)T (gt − gt)

where gt := PW (wt, pt, gtw, ηt) is the projection of the true gradient with respect to w. Thus after

rearranging terms,

Φ(wt+1, pt) ≤ Φ(wt, pt)−

(ηtw −

M

2(ηtw)2

)‖gt‖22 + ηtw〈δtw, gt〉+ ηtw‖δtw‖‖gt − gt‖

≤ Φ(wt, pt)−

(ηtw −

M

2(ηtw)2

)‖gt‖22 + ηtw〈δtw, gt〉+ ηtw‖δtw‖2

where the last inequality follows from Proposition 1 in Ghadimi et al. [2016] with x = wt, γ = ηtw,g1 = gtw, and g2 = gtw. Rearranging terms, we have(ηtw −

M

2(ηtw)2

)‖gt‖22 ≤ Φ(wt, pt)− Φ(wt+1, pt) + ηtw〈δtw, gt〉+ ηtw‖δtw‖2

=(Φ(wt, pt)− Φ(wt+1, pt+1)

)+(Φ(wt+1, pt+1)− Φ(wt+1, pt)

)+ ηtw〈δtw, gt〉+ ηtw‖δtw‖2

=(Φ(wt, pt)− Φ(wt+1, pt+1)

)+(φ(wt+1, pt+1)− φ(wt+1, pt)

)+ ηtw〈δtw, gt〉+ ηtw‖δtw‖2

Taking the expectation with respect to the stochastic gradients conditioned on the history up totime t of each side, we have(

ηtw −M

2(ηtw)2

)E[‖gt‖22|F t

]≤ E

[(Φ(wt, pt)− Φ(wt+1, pt+1)

)|F t]

+ E[(φ(wt+1, pt+1)− φ(wt+1, pt)

)|F t]

+ ηtwE[〈δtw, gt〉|F t

]+ ηtwE

[‖δtw‖2|F t

]= E

[(Φ(wt, pt)− Φ(wt+1, pt+1)

)|F t]

+ E

[m∑i=1

(pt+1i − pti)Fi(wt+1)|F t

]+ ηtwE

[〈δtw, gt〉|F t

]+ ηtwE

[‖δtw‖2|F t

](115)

22

Note that we can use the Hölder Inequality to bound the second expectation in (115). In doing sowe obtain(

ηtw −M

2(ηtw)2

)E[‖gt‖22|F t

]≤ E

[(Φ(wt, pt)− Φ(wt+1, pt+1)

)|F t]

+ E

‖pt+1 − pt‖2

(m∑i=1

Fi(wt+1)2

)1/2

|F t

+ ηtwE[〈δtw, gt〉|F t

]+ ηtwE

[‖δtw‖2|F t

]≤ E

[(Φ(wt, pt)− Φ(wt+1, pt+1)

)|F t]

+ 2√mBE

[‖ηtpgtp‖2|F t

]+ ηtwE

[〈δtw, gt〉|F t

]+ ηtwE

[‖δtw‖2|F t

](116)

≤ E[(

Φ(wt, pt)− Φ(wt+1, pt+1))|F t]

+ 2√mBηtpGp + ηtwE

[〈δtw, gt〉|F t

]+ ηtwE

[‖δtw‖2|F t

](117)

≤ E[(


+ 2√mBηtpGp + ηtwE

[‖δtw‖2|F t

](118)

≤ E[(


+ 2√mBηtpGp + ηtwσ

2w (119)

where (116) follows from the definition of B and the update rule for p combined with the projectionproperty, (117) follows from the definition of Gp, (118) follows from the facts that gt is a deterministicfunction of the stochastic samples that determine the stochastic gradients up to time t and gtwis an unbiased estimate of gt, and (119) follows from the bound on E[‖δw‖2] given in Lemma 3.Summing over t = 1, ..., T , setting the step sizes to be constants, and taking the expectation withrespect to all of the stochastic gradients and using the Law of Iterated Expectations, we find(

ηw −M

2(ηw)2

)T∑t=1

E[‖gt‖22

]≤ Φ(w1, p1)− E

[Φ(wT+1, pT+1)

]+ 2TηpB

√mGp + Tηwσ

2w

≤ Φ(w1, p1) +B + 2TηpB√mGp + Tηwσ

2w

Next we divide both sides by T(ηw − M

2 (ηw)2)to yield(

ηw −M

2(ηw)2

)T∑t=1

E[‖gt‖22

]≤ 2(φ(w1, p1) +B)

T (2ηw − η2wM)

+4ηp√mBGp

(2ηw − η2wM)

+σ2w

(2− ηwM)

Using (89), we have that the left hand side of the above equation is equal to E[‖gτT ‖2], thus we havecompleted the proof of the convergence result in w.

For the convergence with respect to p, note that the update rule for pt+1 is identical to theupdate rule analyzed in Proposition 1, and the output procedure is the same for both algorithms.Furthermore, since the convergence analysis of p does not depend on the update rule for w, theanalysis with respect to p in the proof of Proposition 1 still applies here, thus we have the samebound.

ReferencesMaruan Al-Shedivat, Trapit Bansal, Yuri Burda, Ilya Sutskever, Igor Mordatch, and Pieter Abbeel.

Continuous adaptation via meta-learning in nonstationary and competitive environments. arXivpreprint arXiv:1710.03641, 2017.

Marcin Andrychowicz, Misha Denil, Sergio Gomez, Matthew W Hoffman, David Pfau, Tom Schaul,Brendan Shillingford, and Nando De Freitas. Learning to learn by gradient descent by gradientdescent. In Advances in neural information processing systems, pages 3981–3989, 2016.

Antreas Antoniou, Harrison Edwards, and Amos Storkey. How to train your maml. arXiv preprintarXiv:1810.09502, 2018.

Yoshua Bengio, Samy Bengio, and Jocelyn Cloutier. Learning a synaptic learning rule. Universitéde Montréal, Département d’informatique et de recherche . . . , 1990.

23

Robert S Chen, Brendan Lucier, Yaron Singer, and Vasilis Syrgkanis. Robust optimization fornon-convex objectives. In Advances in Neural Information Processing Systems, pages 4705–4714,2017.

Yan Duan, John Schulman, Xi Chen, Peter L Bartlett, Ilya Sutskever, and Pieter Abbeel. Rl2: Fastreinforcement learning via slow reinforcement learning. arXiv preprint arXiv:1611.02779, 2016.

John Duchi, Peter Glynn, and Hongseok Namkoong. Statistics of robust optimization: A generalizedempirical likelihood approach. arXiv preprint arXiv:1610.03425, 2016.

Alireza Fallah, Aryan Mokhtari, and Asuman Ozdaglar. On the convergence theory of gradient-basedmodel-agnostic meta-learning algorithms. arXiv preprint arXiv:1908.10400, 2019.

Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation ofdeep networks. In Proceedings of the 34th International Conference on Machine Learning-Volume70, pages 1126–1135. JMLR. org, 2017.

Chelsea Finn, Aravind Rajeswaran, Sham Kakade, and Sergey Levine. Online meta-learning. arXivpreprint arXiv:1902.08438, 2019.

Luca Franceschi, Paolo Frasconi, Saverio Salzo, Riccardo Grazzi, and Massimilano Pontil. Bilevel pro-gramming for hyperparameter optimization and meta-learning. arXiv preprint arXiv:1806.04910,2018.

Saeed Ghadimi, Guanghui Lan, and Hongchao Zhang. Mini-batch stochastic approximation methodsfor nonconvex stochastic composite optimization. Mathematical Programming, 155(1-2):267–305,2016.

Sepp Hochreiter, A Steven Younger, and Peter R Conwell. Learning to learn using gradient descent.In International Conference on Artificial Neural Networks, pages 87–94. Springer, 2001.

Chi Jin, Praneeth Netrapalli, and Michael I Jordan. Minmax optimization: Stable limit points ofgradient descent ascent are locally optimal. arXiv preprint arXiv:1902.00618, 2019.

Anatoli Juditsky, Arkadi Nemirovski, and Claire Tauvel. Solving variational inequalities withstochastic mirror-prox algorithm. Stochastic Systems, 1(1):17–58, 2011.

Mikhail Khodak, Maria-Florina Balcan, and Ameet Talwalkar. Provable guarantees for gradient-based meta-learning. arXiv preprint arXiv:1902.10644, 2019a.

Mikhail Khodak, Maria-Florina F Balcan, and Ameet S Talwalkar. Adaptive gradient-based meta-learning methods. In Advances in Neural Information Processing Systems, pages 5915–5926,2019b.

Yann LeCun and Corinna Cortes. MNIST handwritten digit database. 2010. URL http://yann.lecun.com/exdb/mnist/.

Zhenguo Li, Fengwei Zhou, Fei Chen, and Hang Li. Meta-sgd: Learning to learn quickly for few-shotlearning. arXiv preprint arXiv:1707.09835, 2017.

Mehryar Mohri, Gary Sivek, and Ananda Theertha Suresh. Agnostic federated learning. arXivpreprint arXiv:1902.00146, 2019.

Arkadi Nemirovski, Anatoli Juditsky, Guanghui Lan, and Alexander Shapiro. Robust stochasticapproximation approach to stochastic programming. SIAM Journal on optimization, 19(4):1574–1609, 2009.

Alex Nichol and John Schulman. Reptile: a scalable metalearning algorithm. arXiv preprintarXiv:1803.02999, 2:2, 2018.

Qi Qian, Shenghuo Zhu, Jiasheng Tang, Rong Jin, Baigui Sun, and Hao Li. Robust optimization overmultiple domains. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33,pages 4739–4746, 2019.

24

http://yann.lecun.com/exdb/mnist/

http://yann.lecun.com/exdb/mnist/

Hassan Rafique, Mingrui Liu, Qihang Lin, and Tianbao Yang. Non-convex min-max optimization:Provable algorithms and applications in machine learning. arXiv preprint arXiv:1810.02060,2018.

Sachin Ravi and Hugo Larochelle. Optimization as a model for few-shot learning. 2016.

Shai Shalev-Shwartz and Yonatan Wexler. Minimizing the maximal loss: How and why. In ICML,pages 793–801, 2016.

Aman Sinha, Hongseok Namkoong, and John Duchi. Certifying some distributional robustnesswith principled adversarial training. arXiv preprint arXiv:1710.10571, 2017.

Jake Snell, Kevin Swersky, and Richard Zemel. Prototypical networks for few-shot learning. InAdvances in neural information processing systems, pages 4077–4087, 2017.

Sebastian Thrun and Lorien Pratt. Learning to learn. Springer Science & Business Media, 2012.

Oriol Vinyals, Charles Blundell, Timothy Lillicrap, Koray Kavukcuoglu, and Daan Wierstra.Matching networks for one shot learning. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon,and R. Garnett, editors, Advances in Neural Information Processing Systems 29, pages 3630–3638.2016.

Jane X Wang, Zeb Kurth-Nelson, Dhruva Tirumala, Hubert Soyer, Joel Z Leibo, Remi Munos,Charles Blundell, Dharshan Kumaran, and Matt Botvinick. Learning to reinforcement learn.arXiv preprint arXiv:1611.05763, 2016.

Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-mnist: a novel image dataset for benchmarkingmachine learning algorithms. arXiv preprint arXiv:1708.07747, 2017.

Chengxiang Yin, Jian Tang, Zhiyuan Xu, and Yanzhi Wang. Adversarial meta-learning. arXivpreprint arXiv:1806.03316, 2018.

Pan Zhou, Xiaotong Yuan, Huan Xu, Shuicheng Yan, and Jiashi Feng. Efficient meta learningvia minibatch proximal update. In Advances in Neural Information Processing Systems, pages1532–1542, 2019.

Zhenxun Zhuang, Yunlong Wang, Kezi Yu, and Songtao Lu. Online meta-learning on non-convexsetting. arXiv preprint arXiv:1910.10196, 2019.

Daniel Zügner and Stephan Günnemann. Adversarial attacks on graph neural networks via metalearning. arXiv preprint arXiv:1902.08412, 2019.

25

arxiv:2002.04766v1 [cs.lg] 12 feb 2020distribution-agnostic model-agnostic meta-learning liam...

Documents