kabir aladin chandrasekher

111
Sharp global convergence guarantees for iterative nonconvex optimization: A Gaussian process perspective Kabir Aladin Chandrasekher , Ashwin Pananjady ? , Christos Thrampoulidis Department of Electrical Engineering, Stanford University ? Schools of Industrial & Systems Engineering and Electrical & Computer Engineering, Georgia Tech Department of Electrical & Computer Engineering, University of British Columbia September 22, 2021 Abstract We consider a general class of regression models with normally distributed covariates, and the associated nonconvex problem of fitting these models from data. We develop a general recipe for analyzing the convergence of iterative algorithms for this task from a random initialization. In particular, provided each iteration can be written as the solution to a convex optimization problem satisfying some natural conditions, we leverage Gaussian comparison theorems to derive a deterministic sequence that provides sharp upper and lower bounds on the error of the algorithm with sample-splitting. Crucially, this deterministic sequence accurately captures both the convergence rate of the algorithm and the eventual error floor in the finite-sample regime, and is distinct from the commonly used “population” sequence that results from taking the infinite-sample limit. We apply our general framework to derive several concrete consequences for parameter estimation in popular statistical models including phase retrieval and mixtures of regressions. Provided the sample size scales near- linearly in the dimension, we show sharp global convergence rates for both higher-order algorithms based on alternating updates and first-order algorithms based on subgradient descent. These corollaries, in turn, yield multiple consequences, including: (a) Proof that higher-order algorithms can converge significantly faster than their first-order counterparts (and sometimes super-linearly), even if the two share the same population update; (b) Intricacies in super-linear convergence behavior for higher-order algorithms, which can be nonstandard (e.g., with exponent 3/2) and sensitive to the noise level in the problem. We complement these results with extensive numerical experiments, which show excellent agreement with our theoretical predictions. Contents 1 Introduction 2 1.1 Motivation: Accurate deterministic predictions of convergence behavior .......... 3 1.2 Contributions and roadmap ................................... 5 1.3 Related work ........................................... 8 1.4 General notation ......................................... 10 2 Background and illustrative examples 10 2.1 Observation model ........................................ 10 2.2 Iterative algorithms ....................................... 11 3 Recipe and main result: The Gordon state evolution update 15 3.1 High-level sketch of the steps .................................. 15 3.2 Implementing the recipe: A heuristic derivation in a special case .............. 17 3.3 The general result ........................................ 19 1 arXiv:2109.09859v1 [math.OC] 20 Sep 2021

Upload: others

Post on 01-Dec-2021

20 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Kabir Aladin Chandrasekher

Sharp global convergence guarantees for iterativenonconvex optimization: A Gaussian process perspective

Kabir Aladin Chandrasekher†, Ashwin Pananjady?, Christos Thrampoulidis♦

†Department of Electrical Engineering, Stanford University?Schools of Industrial & Systems Engineering and Electrical & Computer Engineering,

Georgia Tech♦Department of Electrical & Computer Engineering, University of British Columbia

September 22, 2021

Abstract

We consider a general class of regression models with normally distributed covariates, andthe associated nonconvex problem of fitting these models from data. We develop a generalrecipe for analyzing the convergence of iterative algorithms for this task from a randominitialization. In particular, provided each iteration can be written as the solution to a convexoptimization problem satisfying some natural conditions, we leverage Gaussian comparisontheorems to derive a deterministic sequence that provides sharp upper and lower boundson the error of the algorithm with sample-splitting. Crucially, this deterministic sequenceaccurately captures both the convergence rate of the algorithm and the eventual error floorin the finite-sample regime, and is distinct from the commonly used “population” sequencethat results from taking the infinite-sample limit. We apply our general framework toderive several concrete consequences for parameter estimation in popular statistical modelsincluding phase retrieval and mixtures of regressions. Provided the sample size scales near-linearly in the dimension, we show sharp global convergence rates for both higher-orderalgorithms based on alternating updates and first-order algorithms based on subgradientdescent. These corollaries, in turn, yield multiple consequences, including:

(a) Proof that higher-order algorithms can converge significantly faster than their first-ordercounterparts (and sometimes super-linearly), even if the two share the same population update;(b) Intricacies in super-linear convergence behavior for higher-order algorithms, which canbe nonstandard (e.g., with exponent 3/2) and sensitive to the noise level in the problem.

We complement these results with extensive numerical experiments, which show excellentagreement with our theoretical predictions.

Contents

1 Introduction 2

1.1 Motivation: Accurate deterministic predictions of convergence behavior . . . . . . . . . . 3

1.2 Contributions and roadmap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.3 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

1.4 General notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2 Background and illustrative examples 10

2.1 Observation model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.2 Iterative algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

3 Recipe and main result: The Gordon state evolution update 15

3.1 High-level sketch of the steps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3.2 Implementing the recipe: A heuristic derivation in a special case . . . . . . . . . . . . . . 17

3.3 The general result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

1

arX

iv:2

109.

0985

9v1

[m

ath.

OC

] 2

0 Se

p 20

21

Page 2: Kabir Aladin Chandrasekher

4 Consequences for some concrete statistical models 224.1 Phase retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244.2 Mixture of regressions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274.3 A glimpse of the convergence proof mechanism . . . . . . . . . . . . . . . . . . . . . . . . 31

5 Numerical illustrations 335.1 Phase retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 345.2 Mixture of linear regressions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

6 Discussion 36

7 Proof of general results, part (a): Gordon update and deviation bounds 377.1 General result from steps 1–3 of recipe . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 387.2 Proof of Theorem 1(a) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 417.3 Proof of Theorem 2(a) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

8 Proof of general results, part (b): Tighter bounds on parallel component 468.1 Proof of Lemma 5: Concentration of centered term . . . . . . . . . . . . . . . . . . . . . . 488.2 Proof of Lemma 6: Controlling the bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

9 Proofs of results for specific models 589.1 Preliminary lemmas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 589.2 Proof of Theorem 3: Alternating minimization for phase retrieval . . . . . . . . . . . . . . 609.3 Proof of Theorem 4: Subgradient descent for phase retrieval . . . . . . . . . . . . . . . . . 639.4 Proof of Theorem 5: Alternating minimization for mixtures of regressions . . . . . . . . . 659.5 Proof of Theorem 6: Subgradient descent for mixtures of regressions . . . . . . . . . . . . 67

A Heuristic derivations deferred from Section 3.2 76

B Auxiliary proofs for general results, part (a) 77B.1 Proofs of technical lemmas in steps 1–3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77B.2 Establishing growth conditions for higher-order methods . . . . . . . . . . . . . . . . . . . 83B.3 Establishing growth conditions for first-order methods . . . . . . . . . . . . . . . . . . . . 92

C Auxiliary proofs for general results, part (b) 93

D Auxiliary technical results for specific models 94D.1 Proof of Corollary 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94D.2 Proof of Corollary 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95D.3 Proof of Corollary 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96D.4 Proof of Corollary 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97D.5 Proof of Fact 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98D.6 Proof of Lemma 10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99D.7 Proof of Lemma 11 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101D.8 Proof of Lemma 12 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104D.9 Proof of Lemma 13 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

E Some elementary lemmas 107

1 Introduction

In many modern statistical estimation problems involving nonlinear observations, latent vari-ables, or missing data, the log-likelihood—when viewed as a function of the parameters ofinterest—is nonconcave. Accordingly, even though the maximum likelihood estimator enjoysfavorable statistical properties in many of these problems, the more practically relevant questionis one at the intersection of statistics and optimization: Can we, in modern problems where thedimension is typically comparable to the sample size, optimize the likelihood in a computation-ally efficient manner to produce statistically useful estimates? When viewed in isolation, many

2

Page 3: Kabir Aladin Chandrasekher

of these nonconvex model-fitting problems can be shown to be NP-hard. However, the statisti-cally relevant setting—in which data are drawn i.i.d. from a suitably “nice” distribution—givesrise to random ensembles of optimization problems that are often amenable to iterative algo-rithms. Following a decade of intense activity, a unifying picture of nonconvex optimization instatistical models has begun to emerge.

Given the rapid development of the field, a few distinct methods now exist for the analysisof these nonconvex procedures. Let us briefly discuss two salient and natural approaches. Thefirst approach is to directly work with iterates of the algorithm. In particular, one can view eachiteration as a random operator mapping the parameter space to itself, and study its properties.A typical example of this approach involves tracking some error metric between the iterates andthe “ground truth” parameter, and showing that applying the operator reduces the error at acertain rate, possibly up to an additive correction to accommodate noise in the observations (see,e.g., some of the early papers Jain et al., 2013; Loh and Wainwright, 2012). The second approachlooks instead at the landscape of the loss function that the iterative algorithm is designed tominimize, showing that once the sample size exceeds a threshold, this landscape has favorableproperties that make it amenable to iterative optimization. Several such properties have beenestablished in particular problems, including, but not limited to, properties of the loss in aneighborhood of its optimum (e.g., Loh and Wainwright, 2015; Candes et al., 2015) and theabsence of local minima with high probability (e.g., Sun et al., 2018; Ge et al., 2016).

In both recipes alluded to above, we are interested in characterizing properties of randomobjects: the sample-based operator in the first case and the sample-dependent loss function inthe second. A general-purpose tool to carry out both recipes is to first understand deterministic,population analogs of these (random) objects in the infinite-sample limit. For instance, takingthe sample size to infinity in these problems yields a population operator in the first case anda population likelihood in the second, and allows one to analyze algorithms deterministically inthis limit. Post this point, tools from empirical process theory (e.g., Balakrishnan et al., 2017;Mei et al., 2018) or more refined leave-one-out techniques (Ma et al., 2020; Chen et al., 2019)can be used to argue that sample-based versions (of the operator/loss) behave similarly. Inparticular, the first program—showing convergence of the population operator and relating thisto its sample-based analog—has established convergence of several algorithms in a variety ofsettings (e.g., Balakrishnan et al., 2017; Daskalakis et al., 2017; Tian, 2017; Chen et al., 2019;Dwivedi et al., 2020; Wu and Zhou, 2019; Xu et al., 2018; Ho et al., 2020). The overall style of theanalysis is appealing for several reasons: (a) It applies (in principle) to any iterative algorithmrun on any model-fitting problem and, (b) In contrast to the direct sample-based approach, itdoes not require the analysis of a complex recursion involving highly nonlinear functions of therandom data. In addition, decomposing the analysis into a deterministic optimization-theoreticcomponent applied to the population operator and a stochastic component that captures theeventual statistical neighborhood of convergence provides a natural two-step approach. Butdoes the population operator always provide a reliable prediction of convergence behavior inmodern, high dimensional settings?

1.1 Motivation: Accurate deterministic predictions of convergence behavior

Toward answering the question posed above, we run a simulation on what is arguably thesimplest nonlinear model resulting in a nonconvex fitting problem: phase retrieval with a realsignal. This is a regression model in which a scalar response y is related to a d-dimensionalcovariate x via E[y|x] = |〈x, θ∗〉|, and the task is to estimate θ∗ from i.i.d. observations (xi, yi).Two popular algorithms to optimize the nonconcave log-likelihood in this problem—describedin detail in Section 2.2 to follow—are given by:

(i) Alternating minimization, an algorithm that dates back to Gerchberg (1972) and Fienup(1982). Each (tuning-free) iteration is based on fixing the latent “signs” according to the

3

Page 4: Kabir Aladin Chandrasekher

current parameter and solving a least squares problem.

(ii) Subgradient descent with stepsize η. This is a simple first-order method on the negativelog-likelihood that also goes by the name of reshaped Wirtinger flow (Zhang et al., 2017).

Before running our simulation, we emphasize two aspects of it that form recurring themesthroughout the paper. First, we use a sample-splitting device: each iteration of the algorithmis executed using n fresh observations of the model, drawn independently of past iterations.This device has been used extensively in the analysis of iterative algorithms as a simplifyingassumption (e.g., Jain et al., 2013; Hardt and Wootters, 2014; Netrapalli et al., 2015; Kwonet al., 2019), and forms a natural starting point for our investigations. Second, over and abovetracking the `2 error of parameter estimation, we track a more expressive statistic over iterations.In particular, we associate each parameter θ ∈ Rd with a two-dimensional state

α(θ) = ‖Pθ∗θ‖2 and β(θ) = ‖P⊥θ∗θ‖2, (1)

where Pθ∗ denotes the projection matrix onto the one-dimensional subspace spanned by θ∗

and P⊥θ∗ denotes the projection matrix onto the orthogonal complement of this subspace. Inwords, these two scalars measure the component of θ parallel to θ∗ and perpendicular to θ∗,respectively. Iterates θt of the algorithm then give rise to a two-dimensional state evolution(αt, βt), where αt = α(θt) and βt = β(θt). As several papers in this space have pointedout (Chen et al., 2019; Wu and Zhou, 2019; Tan and Vershynin, 2019a), tracking the stateevolution instead of the evolving d-dimensional parameter provides a useful summary statisticof the algorithm’s behavior, and natural losses such as the `2 or angular loss of parameterestimation can be expressed solely in terms of the state evolution.

0 5 10 15

10−8

10−6

10−4

10−2

100

Iteration

Empirical: AM

Empirical: GD

Population

(a) The `2 error of the population update vs. `2error of the empirics.

β

α

0 5 10 15

10−8

10−6

10−4

10−2

100

Iteration

Empirical: AM

Empirical: GD

Population

(b) State evolution: α and β components (1) forthe population update and empirics.

Figure 1. Plots of behavior over iterations for both alternating minimization (AM) and sub-gradient descent (GD) with step-size 1/2, alongside the population update, which is identical for

both algorithms (see Section 4). Both algorithms are initialized at θ0 = 0.2 ·θ∗+√

1− 0.22P⊥θ∗γ,with γ uniformly distributed on the unit sphere. Additionally, the observations are noisy withnoise standard deviation 10−8. Shaded envelopes around the empirics denote 95% confidencebands over 100 independent trials.

We run our simulation in dimension d = 600, and use n = 12, 000 observations per it-eration of the algorithm. We set θ∗ to be the first standard basis vector in Rd, and sup-pose that the covariates xi ∈ Rd follow a standard Gaussian distribution. We generate the

4

Page 5: Kabir Aladin Chandrasekher

i-th response via yi = |〈xi, θ∗〉| + εi, where εi ∼ N (0, σ2). In our simulation, we considerσ = 10−8. Consider the two iterative algorithms above initialized at a randomly chosen pointθ0 = 0.2 · θ∗ +

√1− 0.22P⊥θ∗γ—with γ uniformly distributed on the unit sphere—and choosing

the stepsize η to ensure that the population operators of both algorithms coincide1. Figure 1plots both the (random) `2-error of parameter estimation and the state evolution for both al-gorithms along with the analogous quantities for the (deterministic) population update. As wemake clear shortly, the population update in the latter case takes the form of a state evolutionupdate, in that the state of the next population iterate can be computed as a deterministicfunction of the state of the current iterate. Two conclusions can be drawn immediately fromFigure 1. First, the population update is overly optimistic when predicting the convergencerates of both algorithms. Second, algorithms with the same population update can exhibit verydifferent convergence behaviors. Looking more closely at Figure 1(b), we see that while thestate evolution predictions are comparable to empirical behavior towards the left of the plot(immediately after initialization), the β prediction is no longer accurate towards the right, ina local neighborhood of θ∗. In Figure 7 in Section 5, we exhibit even more drastic situationsin which the population update predicts convergence to θ∗ whereas the empirical iterates staybounded away from it.

As our simple experiment demonstrates, the population operator is not, at least in general,a very reliable predictor of convergence behavior. The underlying reason is simply that theproblem is high dimensional: it is too simplistic to assume that the algorithm’s finite-samplebehavior will resemble the case when the sample size goes to infinity. This observation naturallyleads to the principal question that we attempt to answer in this paper:

Is there a more faithful deterministic prediction for the empirical behavior of iterativealgorithms in the high dimensional setting?

To be more specific, we would like such a deterministic update to satisfy several desiderata.First and foremost, we should be able to accurately predict the error of parameter estimationafter running one step of the update from any point, allowing us to distinguish cases in whichthe algorithm gets closer to the ground truth parameter (thereby suggesting convergence) fromotherwise. Second, the update should give us sharp predictions of convergence behavior thatdifferentiate, for instance, between linear and superlinear convergence. Such a sharp predictionfor the iteration complexity can be used in conjunction with a characterization of the per-stepcomputational cost of the algorithm to guide the choice of the fastest procedure to implementfor any given task. Third, and related to the previous point, we would like deterministicrecursions that provide both upper and lower bounds on the error of the algorithm, at least ina local neighborhood of the solution. This would allow us to rigorously compare and delineatealgorithms in terms of their convergence behavior, instead of simply comparing upper boundswith upper bounds.

1.2 Contributions and roadmap

We consider a general class of regression models with Gaussian covariates (to be introducedprecisely in Section 2), and analyze the convergence behavior of iterative algorithms run withsample-splitting. Our contributions are summarized below.

1. Gordon state evolution update: Our main contribution is to use the machinery ofGaussian comparison inequalities—in particular, the convex Gaussian minmax theorem,or CGMT for short (Thrampoulidis et al., 2015b)—to derive a deterministic Gordon stateevolution update (or Gordon update for short) that satisfies the desiderata laid out above.

1See Section 4 for the concrete setting, and explicit evaluations of the population update.

5

Page 6: Kabir Aladin Chandrasekher

0 5 10 15

10−8

10−6

10−4

10−2

100

Iteration

Empirical: AM

Empirical: GD

Gordon: AM

Gordon: GD

(a) The `2 error of the Gordon update vs. `2 errorof the empirics.

β

α

0 5 10 15

10−8

10−6

10−4

10−2

100

Iteration

Empirical: AM

Empirical: GD

Gordon: AM

Gordon: GD

(b) State evolution: α and β components (1) forthe Gordon update and the empirics.

Figure 2. Plots of behavior over iterations for both alternating minimization and subgradientdescent with step-size 1/2, in a simulation identical to that of Figure 1. Overlaid is the predictionfrom their respective Gordon state evolution updates. As is evident from the occlusion of thetriangle markers, the Gordon prediction exactly tracks behavior in both cases.

This update applies provided each iteration of the algorithm can be written as the solutionto a convex optimization problem satisfying some mild assumptions2.

The Gordon state evolution update is distinct from the population state evolution updatein that it involves an additive correction term, which is nonzero whenever the samplesize is finite. In particular, suppose that κ = n/d denotes the oversampling ratio used toimplement one step of the algorithm; this is defined precisely in Section 2.2. Then theperpendicular component of the Gordon update takes the form

βt+1 = βpopt+1 +O(

1

κ

)·D(αt, βt),

where βpopt+1 is the analogous component of the population state evolution update run fromthe point (αt, βt), and D : R2 → [0,∞) is some function that takes only nonnegativevalues. Thus, taking the sample size (and hence κ) to infinity, we recover the populationstate evolution update directly from the Gordon update. However, as we will see shortly,the finite-sample behavior of the update is often dominated by the term O

(1κ

)·D(α, β),

and in these scenarios the population update is a poor predictor of convergence behavior.We showcase the general recipe involved in deriving the Gordon update in Section 3.

Our recipe provides not only a deterministic update but also a finite-sample concentrationbound, showing that the empirical state evolution concentrates sharply around the pointpredicted by the Gordon update. This is illustrated for the phase retrieval simulation inFigure 2. As is clear from this figure, the Gordon update provides a sharp predictionof convergence behavior, allowing us to distinguish different types of convergence andalso providing near-exact predictions of the eventual error floor. As we explore furtherin Figure 7 (see Section 5), the Gordon update accurately captures behavior even insituations where the empirical iterates do not converge.

2In particular, although all of the scenarios we consider in this paper involve optimizing a nonconcave log-likelihood function, our recipe can also be applied to provide sharp convergence guarantees for the iterativeminimization of convex loss functions.

6

Page 7: Kabir Aladin Chandrasekher

2. Results for concrete models: While the machinery that we develop is general, weshowcase its utility by deriving global convergence guarantees (i.e., from a random initial-ization) for both higher-order and first-order algorithms in two statistical models: phaseretrieval and mixtures of regressions. Some salient takeaways are collected in Table 1.

Algorithm Model Metric Local convergence rate

Alternating minimization Phase retrieval `2 Superlinear, exponent 3/2

Subgradient descent Phase retrieval `2 Linear

Alternating minimization Mixture of regressions Angular Linear

Subgradient AM Mixture of regressions Angular Linear

Table 1. Summary of results for specific models and algorithms. In all cases, we provide globalconvergence guarantees showing that with high probability, convergence to the local neighborhoodof the ground-truth parameter takes places after a number of iterations that is logarithmic inthe dimension of the problem. Convergence rates within this neighborhood, as predicted by theGordon update, are listed above. These exactly match empirical behavior in all cases.

To summarize, for the phase retrieval model, our primary contribution is to make quan-titative the behavior observed in Figures 1 and 2. While the population update predictsquadratic convergence (i.e., superlinear convergence with exponent 2), we show that bothalternating minimization and subgradient descent behave differently from this prediction.The former algorithm does converge superlinearly but with a nonstandard exponent 3/2,while the latter converges linearly at best. For the mixture of regressions model, wepropose a first-order method termed subgradient AM, which is inspired by the closely re-lated gradient EM update (Dempster et al., 1977; Neal and Hinton, 1998). We study italongside alternating minimization, and show that while both algorithms exhibit linearconvergence in the angular metric, they are inconsistent in the `2 metric for any nonzeronoise level. We exhibit regimes in which the first-order method is competitive (in terms ofits iteration complexity) with its higher-order counterpart, suggesting that the first-ordermethod should be preferred in these regimes given its lower per-iteration cost.

3. Techniques of independent interest: Over the course of proving our results, we de-velop some techniques that may be of broader interest, three of which we highlight below.

• In proving finite-sample concentration bounds around the deterministic Gordon up-dates, we handle a family of loss functions that is strictly more general than those usedfor proving analogous results in linear models (Oymak et al., 2013; Miolane and Mon-tanari, 2021). Our techniques are based on arguing about carefully chosen growthproperties of these loss functions, and may prove useful in other non-asymptoticinstantiations of the CGMT machinery.

• Characterizing algorithmic behavior near a random initialization requires a sharperbound on the deviation of the parallel component than what is provided by thegeneral technique alluded to above. We develop a refined bound—applicable tohigher-order updates that involve a matrix inversion in each iteration—by using aleave-one-out device. This characterization allows us to replace a polylogarithmicfactor in the sample complexity bound with a doubly-iterated logarithm, and thetechnique may prove more broadly useful in analyzing other higher-order updatesfrom a random initialization.

• Finally, our local convergence analysis for particular algorithms relies on a first-order expansion of the Gordon update. In particular, we show that the Gordonupdate is contractive in a local neighborhood of the ground truth θ∗, and combinethis structural characterization with our refined concentration bounds on the samplestate evolution to show deterministic upper and lower bounds, i.e., a high-probability

7

Page 8: Kabir Aladin Chandrasekher

envelope around, the error of the empirical trajectory of the algorithm. Such atechnique may prove more broadly useful in producing sharp characterizations ofconvergence behavior in other classes of iterative algorithms.

The rest of the paper is organized as follows. In Section 2 to follow, we present the formalproblem setup and background on the models and algorithms that we use to illustrate theGordon state evolution machinery. This section also introduces the “subgradient AM” updatefor mixtures of regressions. In Section 3, we provide the recipe itself, starting with a highlevel overview of the steps and a heuristic derivation in a special case before stating our mainresults in Theorems 1 and 2. Section 4 collects consequences for two models, phase retrieval andmixtures of linear regressions, and on each we employ two algorithms, one based on alternatingprojections and another on subgradient descent. Theorems 3-6 establish global convergenceresults for all of these cases. In Section 5, we present numerical experiments to corroborateour theoretical findings. We discuss future directions in Section 6, and then turn to our proofs.Theorems 1 and 2 both have two parts; we present the proof of parts (a) in Sections 7 andthe proof of parts (b) in Section 8. Proofs of Theorems 3 through 6 are presented in a unifiedfashion in Section 9. Our appendices collect proofs of auxiliary technical lemmas.

1.3 Related work

The literature on nonconvex optimization in statistical settings is vast, and we cannot hope tocover all of it here. We refer the reader to a few recent monographs (Jain and Kar, 2017; Chenand Chi, 2018; Chi et al., 2019; Zhang et al., 2020) for surveys, and the webpage (Sun, 2021)for an ever-expanding list of relevant references. We focus in this subsection on describing afew papers that are most closely related to our contributions, categorized for convenience underthree broad headings.

Predictions in random optimization problems: As alluded to before, the populationupdate has proven useful in analyzing many algorithms in a variety of settings including Gaus-sian mixture models (Balakrishnan et al., 2017; Daskalakis et al., 2017; Xu et al., 2016), mix-tures of regressions (Balakrishnan et al., 2017; Kwon et al., 2019; Klusowski et al., 2019),phase retrieval (Chen et al., 2019), mixtures of experts (Makkuva et al., 2019), and neuralnetworks (Tian, 2017). In addition to providing local convergence guarantees, it has enabledresearchers to study the more challenging setting with random initialization (Chen et al., 2019;Dwivedi et al., 2020; Wu and Zhou, 2019), and also revealed several surprising phenomenarelated to overparameterization and stability (Xu et al., 2018; Ho et al., 2020). The Gordonupdate that we derive is a much sharper deterministic predictor of convergence behavior than itspopulation counterpart, and we hope that other surprising phenomena—over and above thosethat we present in the current paper—can be uncovered by making use of it.

In addition to papers that characterize the random loss landscape by utilizing properties ofthe population loss (e.g., Mei et al., 2018; Davis et al., 2020; Hand and Voroninski, 2019), wemention another line of inquiry—rooted in the literature on statistical physics—that leads todeterministic predictions. This framework is especially appealing when a prior on the underlyingparameter is assumed, and employs the approximate message passing (AMP) algorithm (Donohoet al., 2009, 2011; Bayati and Montanari, 2011; Montanari, 2013). AMP is carefully designedto satisfy certain (approximate) independence properties across iterates and leads to a simplestate evolution without sample-splitting; see the recent tutorial by Feng et al. (2021) for anintroduction. The analysis framework has recently been used to explore the (sub-)optimalityof first-order methods in terms of their eventual parameter estimation error (Celentano et al.,2020b), to predict computational barriers in a variety of problems including phase retrieval inhigh dimensions (Maillard et al., 2020), and to demonstrate that logistic regression is biasedin high-dimensions, thereby suggesting an asymptotic correction (Sur and Candes, 2019). In

8

Page 9: Kabir Aladin Chandrasekher

contrast to our motivation, predictions in this family are not designed with the dual goal ofcharacterizing the (optimization-theoretic) rate of convergence of various algorithms as well asthe statistical error of the eventual solution. Instead, they focus on producing a single algorithmthat eventually attains statistical optimality, which is typically a member of the AMP family.

Finally, we note that Oymak et al. (2017) focused on showing sharp time-data tradeoffs inlinear inverse problems. In particular, they considered random design linear regression wherethe underlying parameter was constrained to an arbitrary (possibly nonconvex) set, and showedthat employing projected gradient descent on the square loss with a particular choice of stepsizeenjoys a linear rate of convergence to an order-optimal neighborhood of the true parameter.They also showed that a linear rate is the best achievable when the constraint set is convex. Infollow-up work and for the same optimization algorithms, Oymak and Soltanolkotabi (2016),obtained similar results for single-index model estimation. Specifically, their measurement modelallows a nonlinear link function, but their algorithm assumes a linear one (thus, is agnostic tothe nonlinearity) following the paradigm of Brillinger (2012); Plan and Vershynin (2016). Whilethese results are compelling, they are restricted to the analysis of a single algorithm, do notprovide sharp iterate-by-iterate predictions, and their primary focus is on exploiting structurein the underlying parameter. For comparison and on the one hand, we do not explicitly modelstructure in the parameter of interest, and also require that each iteration of the algorithmsolves a convex program. On the other hand, we allow for arbitrary nonlinear models, and ourmachinery allows us to derive sharp tradeoffs applying to a broad class of iterative algorithmsthat go beyond first-order methods for linear regression.

Convergence guarantees for iterative algorithms beyond first-order updates: Asmade clear shortly, the Gordon state evolution recipe is particularly powerful when dealingwith iterative algorithms that go beyond first-order updates, and consequently involve highlynon-linear functions of the random data. There are several “direct” analyses of such higher-orderupdates in the literature on matrix factorization, mixture models, neural networks, and indexmodels, including for alternating projections (Jain et al., 2013; Gunasekar et al., 2013; Hardtand Wootters, 2014; Yi et al., 2014; Agarwal et al., 2016; Sun and Luo, 2016; Waldspurger,2018; Jagatap and Hegde, 2017; Zhang, 2020; Ghosh et al., 2019; Pananjady and Foster, 2021),composite optimization (Duchi and Ruan, 2019; Charisopoulos et al., 2021), and Gauss–Newtonmethods (Gao and Xu, 2017). For the expectation maximization (EM) algorithm and its Newton(i.e., second-order) analog, the population update has been widely used to prove parameterestimation guarantees (Balakrishnan et al., 2017; Xu et al., 2016; Ho et al., 2020), althoughconvergence in function value can be shown via other means (Xu and Jordan, 1996; Kunstneret al., 2021). All of the analyses mentioned here are only able to provide upper bounds on theparameter estimation error over iterations, and we expect that employing our recipe in thesesettings would yield either matching lower bounds or sharper convergence rates.

Gordon’s Gaussian comparison theorem in statistical models: Gordon proved his cel-ebrated minmax theorem for doubly-indexed Gaussian processes in the 1980s (Gordon, 1985,1988), which was popularized in statistical signal processing by Rudelson and Vershynin (2006);Stojnic (2009). Following a line of work (Stojnic, 2013a,c,b; Amelunxen et al., 2014; Oymaket al., 2013), a sharp version of Gordon’s result in the presence of convexity—providing bothupper and lower bounds on the minmax value—was formalized in Thrampoulidis et al. (2015b);see Thrampoulidis (2016) for broader historical context. Since then, the convex Gaussianminmax theorem (or CGMT for short) has been used to provide sharp performance guar-antees for several convex programs with Gaussian data, including regularized M-estimators(Thrampoulidis et al., 2018a,b), one-bit compressed sensing (Thrampoulidis et al., 2015a), thePhase-Max program for phase-retrieval (Dhifallah et al., 2018; Salehi et al., 2018), regularizedlogistic regression (Salehi et al., 2019; Taheri et al., 2020b, 2021; Aubin et al., 2020; Dhifal-

9

Page 10: Kabir Aladin Chandrasekher

lah and Lu, 2020), adversarial training for linear regression and classification (Javanmard andSoltanolkotabi, 2020; Javanmard et al., 2020; Taheri et al., 2020a), max-margin linear classi-fiers (Montanari et al., 2019; Deng et al., 2021; Kammoun and Alouini, 2021), distributionalcharacterization of minimum norm linear interpolators (Chang et al., 2021), and minimum `1norm interpolation and boosting (Liang and Sur, 2020). While this line of work typically usesthe Gordon machinery to provide a one-step—and asymptotic—guarantee, the results of ourpaper are obtained by using the CGMT in each step of the iterative algorithm, which requires anon-asymptotic characterization. Having said that, we note that non-asymptotic bounds havebeen obtained using the CGMT in the context of the LASSO (Oymak et al., 2013; Miolaneand Montanari, 2021; Celentano et al., 2020a) and SLOPE (Wang et al., 2019), but existingguarantees of this form appear to have been restricted to the study of sparse linear regression.

1.4 General notation

We use boldface small letters to denote vectors and boldface capital letters to denote matrices.We let sgn(v) denote the sign of a scalar v, with the convention that sgn(0) = 1. We use sgn(v)to denote the sign function applied entrywise to a vector v. Let I · denote the indicatorfunction. For p ≥ 1, let Bp(v; t) = x : ‖x − v‖p ≤ t denote the closed `p ball of radius taround a point v, with the shorthand Bp(t) = Bp(0; t); the dimension will usually be clear fromcontext. Analogously, let Bp(S; t) = x : ‖x − v‖p ≤ t for some v ∈ S denote the t-fatteningof a set S in `p-norm. For an operator A : S → S, let At := A⊗ · · · ⊗ A︸ ︷︷ ︸

t times

denote the operator

obtained by t repeated applications of A.For two sequences of non-negative reals fnn≥1 and gnn≥1, we use fn . gn to indicate

that there is a universal positive constant C such that fn ≤ Cgn for all n ≥ 1. The relationfn & gn indicates that gn . fn, and we say that fn gn if both fn . gn and fn & gn holdsimultaneously. We also use standard order notation fn = O(gn) to indicate that fn . gnand fn = O(gn) to indicate that fn . gn logc n, for a universal constant c > 0. We say thatfn = Ω(gn) (resp. fn = Ω(gn)) if gn = O(fn) (resp. gn = O(fn)). The notation fn = o(gn)is used when limn→∞ fn/gn = 0, and fn = ω(gn) when gn = o(fn). Throughout, we usec, C to denote universal positive constants, and their values may change from line to line. Alllogarithms are to the natural base unless otherwise stated.

We denote by N (µ,Σ) a normal distribution with mean µ and covariance matrix Σ. LetUnif(S) denote the uniform distribution on a set S, where the distinction between a discrete

and continuous distribution can be made from context. We say that X(d)= Y for two random

variables X and Y that are equal in distribution. For q ≥ 1 and a random variable X takingvalues in Rd, we write ‖X‖q = (E[|X|q])1/q for its Lq norm. Finally, for a real valued randomvariable X and a strictly increasing convex function ψ : R≥0 → R≥0 satisfying ψ(0) = 0, wewrite ‖X‖ψ = inft > 0 | E[ψ(t−1|X|)] ≤ 1 for its ψ-Orlicz norm. We make particular use ofthe ψq-Orlicz norm for ψq(u) = exp(|u|q)− 1. We say that X is sub-Gaussian if ‖X‖ψ2 is finiteand that X is sub-exponential if ‖X‖ψ1 is finite.

2 Background and illustrative examples

In this section, we set up our formal observation model, and a general form for the iterativealgorithms that we will study.

2.1 Observation model

Suppose that we observe i.i.d. covariate response pairs (xi, yi) generated according to the model

yi = f(〈xi,θ∗〉; qi) + εi, i = 1, 2, . . . . (2)

10

Page 11: Kabir Aladin Chandrasekher

The covariates xi are assumed to be d-dimensional and drawn i.i.d. from the standard normaldistribution N (0, Id), and the function f is some known link function. The random variableqi ∼ Q represents a possible latent variable, i.e., some source of auxiliary randomness that isunobserved, and εi represents additive noise drawn from the distribution N (0, σ2); both of theseare drawn i.i.d. Our goal is to use observations of pairs (xi, yi)i≥1 to estimate the unknownd-dimensional parameter θ∗. For the rest of this paper, we will make the assumption that‖θ∗‖2 = 1 in order to simplify statements of our theoretical results.3 Before proceeding, letus give two canonical examples of the observation model (2) that will form the focus of thispaper, illustrating why maximum likelihood estimation in these models can be computationallychallenging.

Example: Phase retrieval with a real-valued signal. Here, there is no auxiliary latentvariable, and the function f depends solely on its first argument. In the nonsmooth version, itis given by f(t; q) = |t|, so that our model for the i-th observation takes the form

yi = |〈xi,θ∗〉|+ εi. (3)

Our goal is to estimate the real-valued signal θ∗ from these covariate-response pairs. Note thatthe negative log-likelihood of our observations (X,y) is the shifted least squares objective

− log p(θ;X,y) =1

n

n∑i=1

(yi − |〈xi,θ〉|)2 + c0, (4)

where c0 is a scalar independent of θ. This is a nonconvex function of θ ∈ Rd. ♣

Example: Symmetric mixture of linear regressions. Here, the latent variables qi arechosen i.i.d. from a Rademacher distribution Unif(±1) and the function f is specified byf(t; q) = q · t. This leads to the observation model

yi = qi · 〈xi,θ∗〉+ εi (5)

for the i-th observation. The negative log-likelihood of our observations (X,y) is given by

− log p(θ;X,y) = − 1

n

n∑i=1

log

(exp

−(yi − 〈xi,θ〉)2

2σ2

+ exp

−(yi + 〈xi,θ〉)2

2σ2

)+ c0, (6)

where c0 is a scalar independent of θ. Clearly, this is a nonconvex function of θ. ♣

An important feature of estimation under the general observation model (3) that is exempli-fied by the specific cases above is that the negative log-likelihood, when viewed as a function ofthe parameter of interest, is nonconvex. Nevertheless, it is common to run iterative algorithms—beginning either from a random initialization or a carefully designed spectral initialization—toattempt to optimize the negative log-likelihood. Our focus will be on studying two such canon-ical families of iterative algorithms from a random initialization, which we introduce next andunder a general framework.

2.2 Iterative algorithms

We study iterative algorithms designed to recover θ∗ in the observation model (2) when runwith sample-splitting. In particular, suppose that at each iteration, we form a fresh batch4 of

3This assumption can be removed by straightforward means; in particular, the algorithms that we study willnot make explicit use of the fact that θ∗ is unit-norm.

4Owing to sample-splitting, the pair (X,y) can also be thought of as depending on the iteration number t,but we suppress this dependence and opt for more manageable notation.

11

Page 12: Kabir Aladin Chandrasekher

n observations by collecting the covariates in a matrix X ∈ Rn×d and the responses in a vectory ∈ Rn. By design, the pair (X,y) is statistically independent of the iterations of the algorithmthus far. At iteration t, we update our current estimate of the parameter θt to θt+1 by solvingan optimization problem of the form

θt+1 ∈ argminθ∈Rd

L(θ;θt,X,y), (7)

for some loss function L that depends implicitly on the current point θt and is formed usingthe data (X,y). In general terms, what makes the iterative algorithm tractable is that theoptimization problem (7) corresponding to each iteration is solvable efficiently. More often thannot, this is enabled by the function L being convex in θ, a property that we will exploit fruitfullyin the examples that we study.

It is important to note that owing to our sample splitting heuristic, the total sample sizewhen the iterative algorithm is run for T iterations is given by n · T . In the sequel, it is usefulto track the per-iteration oversampling ratio, given by

κ =n

d.

We will be interested in the near-linear regime of sample size in which κ scales at most poly-logarithmically5 in the dimension d.

Let us conclude by introducing some equivalent operator-theoretic notation that simplifiessome of our exposition. It is common to view a step of the algorithm through the lens of anempirical operator Tn : Rd → Rd, with

Tn(θ) = argminθ′∈Rd

L(θ′;θ,X,y) for each θ ∈ Rd. (8)

In other words, equation (7) denotes the evaluation of the operator at θt, i.e., with θt+1 = Tn(θt).Note that the operator Tn is random by virtue of randomness in the data, and that since we areinterested in the algorithm run with sample-splitting, one may view the random operator Tnas being generated i.i.d. at each iteration. Adopting this perspective, the parameter estimateobtained at iteration k when starting from an initial point θ0 is given by applying the randomoperator Tn repeatedly, so that θk = T kn (θ0).

We now discuss two specific classes of algorithms from this general perspective.

2.2.1 Higher-order update methods

The first class of methods that we consider are those that do not have an interpretation asfirst-order methods. In particular, they typically involve running least squares in each iteration.As we will see in the examples to follow, each of these algorithms can be written in the form (7)with

L(θ;θt,X,y) :=

√√√√ 1

n

n∑i=1

(ω(〈xi,θt〉, yi)− 〈xi,θ〉)2, (9a)

with ω : R2 → R denoting a weight function that is model and algorithm dependent and thesquare root is taken for convenience. The minimizer of the loss (9a) is given by

θt+1 =

( n∑i=1

xix>i

)−1( n∑i=1

ω(〈xi,θt〉, yi) · xi)

; (9b)

5In the specific examples that we study, the number of iterations T required to obtain order-optimal parameterestimates will turn out to be at most logarithmic in the dimension, so that the total sample size nT also scalesnear-linearly in dimension.

12

Page 13: Kabir Aladin Chandrasekher

in other words, we apply the empirical operator Tn : θ 7→(∑n

i=1 xix>i

)−1(∑ni=1 ω(〈xi,θ〉, yi) · xi

).

Let us provide a few examples of such methods in the specific cases (3) and (5) for concreteness.

Example: Alternating projections for phase retrieval. To motivate the first example,consider the phase retrieval model and write the corresponding negative log-likelihood (4) inthe equivalent form − log g(θ;X,y) = 1

n

∑ni=1(sgn(〈xi, θ〉) · yi − 〈xi, θ〉)2 + c0. This suggests a

heuristic that fixes the signs using the current iterate θt, and obtains θt+1 by minimizing theloss

L(θ;θt,X,y) =

√√√√ 1

n

n∑i=1

(sgn(〈xi,θt〉)yi − 〈xi,θ〉)2. (10a)

Concretely, this results in the update

θt+1 =

( n∑i=1

xix>i

)−1( n∑i=1

sgn(〈xi, θt〉) · yixi). (10b)

Clearly, this loss function/update pair takes the general form (9) with the specific choiceω(x, y) = sgn(x) · y. ♣

Example: Alternating projections for mixtures of two regressions. This algorithmstems from the observation that while the negative log-likelihood of θ—given by equation (6)—may be difficult to optimize, the likelihood of the pair (θ, q) is often easier to reason about. Inparticular, writing

− log h(θ, q;X,y) =1

n

n∑i=1

(yi − qi · 〈xi,θ〉)2 + c0 ,

for a scalar c0 that is independent of the pair (θ, q), notice that the function − log h is nowindividually convex in each of θ and q. This suggests an alternating update algorithm: Supposethat the current parameter is θt; then for each i, the minimizer of − log h over qi ∈ −1, 1 isgiven by argminq∈−1,1 |yi − q · 〈xi, θt〉| = sgn(yi〈xi,θt〉). This in turn yields the one-step lossfunction

L(θ;θt,X,y) =

√√√√ 1

n

n∑i=1

(sgn(yi〈xi,θt〉) · yi − 〈xi,θ〉)2 (11a)

and the corresponding update

θt+1 =

( n∑i=1

xix>i

)−1( n∑i=1

sgn(yi〈xi, θt〉) · yixi), (11b)

which takes the general form (9) with ω(x, y) = sgn(yx) · y. ♣

We note in passing that alternating projections for mixtures of linear regressions coincideswith the expectation maximization (EM) algorithm (Dempster et al., 1977) when σ = 0, andthat the machinery that we develop also applies to the EM algorithm. Let us now turn to asecond class of (simpler) iterative algorithms.

13

Page 14: Kabir Aladin Chandrasekher

2.2.2 First order methods

The second class of methods that we analyze are first-order versions of counterparts presentedabove. As we will see shortly, each of these methods can also be written in the form (7) with

L(θ;θt,X,y) :=1

2‖θ‖22 − 〈θ,θt〉+

n

n∑i=1

ω(〈xi,θt〉, yi

)〈xi,θ〉, (12a)

where η > 0 denotes a stepsize and ω is some weight function. It is important to note that thefunction ω will be distinct for the higher-order update and its first-order analog.

Minimizing the loss function (12a) over θ, the update in this case can be written as

θt+1 = θt − η ·2

n

n∑i=1

ω(〈xi,θt〉, yi

)· xi, (12b)

which resembles a gradient update and induces the operatorTn : θ 7→ θt − η · 2

n

∑ni=1 ω

(〈xi,θt〉, yi

)· xi. Examples are collected below.

Example: Subgradient descent for nonsmooth phase retrieval. Our first example isgiven by the subgradient descent algorithm on the objective (4). In particular, straightforwardcalculation yields that one iteration of this algorithm run with stepsize η takes the form

θt+1 = θt − η ·

2

n

n∑i=1

sgn(〈xi,θt〉) · (|〈xi,θt〉| − yi) · xi

, (13a)

which in turn is the minimizer of the loss function

L(θ;θt,X,y) =1

2‖θ‖22 − 〈θ,θt〉+

n

n∑i=1

sgn(〈xi,θt〉) · (|〈xi,θt〉| − yi) · 〈xi,θ〉. (13b)

This takes the general form (12) with ω(x, y) = sgn(x) · (|x| − y) = x− sgn(x) · y. ♣

Example: Subgradient AM for mixtures of regressions. This update is obtained byrunning subgradient descent on the loss function in equation (11a). In particular, running thisalgorithm with stepsize η yields

θt+1 = θt − η ·

2

n

n∑i=1

(〈xi, θt〉 − sgn(yi〈xi,θt〉) · yi) · xi

, (14a)

which is clearly the minimizer of the loss

L(θ;θt,X,y) =1

2‖θ‖22 − 〈θ,θt〉+

n

n∑i=1

(〈xi, θt〉 − sgn(yi〈xi,θt〉) · yi) · 〈xi, θ〉. (14b)

These expressions take the general form (12) with ω(x, y) = x− sgn(xy) · y.We note that this algorithm is analogous to a gradient EM update (Neal and Hinton, 1998)

in that it is obtained via a first order method applied to the one-step loss function derivedwith the objective of performing alternating minimization. However, to our knowledge, thisalgorithm has not been considered before in the literature on mixtures of linear regressions. ♣

Remark 1. As noted before, the weight functions of the higher-order and first order updates cor-responding to a particular model do not coincide. However, note that in the examples presentedabove, we have

ωFO(x, y) = x− ωHO(x, y),

where ωHO and ωFO denote the higher-order and first-order weight function, respectively.

Having introduced illustrative examples, we are now well-placed to introduce our generalrecipe for establishing convergence guarantees on iterative algorithms.

14

Page 15: Kabir Aladin Chandrasekher

3 Recipe and main result: The Gordon state evolution update

We are now ready to describe the Gordon state evolution update in detail. We begin witha high-level overview, in Section 3.1, of the steps involved in the recipe, and then provide aheuristic but illustrative derivation for a specific algorithm in Section 3.2. Having conveyed thehigh-level intuition about how one might derive these updates in concrete problems, we thenproceed to a rigorous result, in Section 3.3, showing that the empirical iteration concentratesaround the Gordon state evolution update.

3.1 High-level sketch of the steps

We begin with the ansatz—which will be intuitively justified in the heuristic derivation ofSection 3.2 and proved rigorously when establishing the main results to follow—that it sufficesto track the two dimensional state evolution (α(θ), β(θ)) defined in equation (1). In particular,when one step of the algorithm is run from the parameter θt to obtain θt+1, we are interestedin a deterministic prediction (αt+1, βt+1) for the random pair (α(θt+1), β(θt+1)) that is (a) afunction only of the pair (α(θt), β(θt)), and (b) accurate up to a small error. We use severalsteps to derive such a deterministic state evolution update. Let us begin by introducing theconvex Gaussian minmax theorem, or CGMT, which forms the bedrock of our recipe.

Proposition 1 (CGMT (Thrampoulidis et al., 2015b)). Let G denote an n×d standard Gaus-sian random matrix, and let γd ∈ Rd and γn ∈ Rn denote standard Gaussian random vectorsdrawn independently of each other and of G. Let L ∈ Rd×d and M ∈ Rn×n denote two fixedmatrices. Also, let U ⊆ Rd and V ⊆ Rn denote compact sets, and let Q : U × V → R denote acontinuous function. Define

P (G) := minu∈U

maxv∈V〈Mv,GLu〉+Q(u,v) and (15a)

A(γn,γd) := minu∈U

maxv∈V

‖Mv‖2 · 〈γd,Lu〉+ ‖Lu‖2 · 〈γn,Mv〉+Q(u,v). (15b)

Then

(a) For all t ∈ R, we have

P P (G) ≤ t ≤ 2P A(γn,γd) ≤ t .

(b) If, in addition, the sets U ,V are convex and the function Q is convex-concave, then forall t ∈ R, we have

P P (G) ≥ t ≤ 2P A(γn,γd) ≥ t .

Strictly speaking, Proposition 1 is a generalization of the result appearing in Thrampoulidiset al. (2015b), which is stated without the matrix pair (L,M). However, its proof follows identi-cally, and we choose to state the more general result since it is most useful for our development.Following the terminology from Thrampoulidis et al. (2015b), we refer to equation (15a) asthe primary optimization problem or PO, and to equation (15b) as the auxiliary optimizationproblem or AO. Having stated the CGMT, let us now provide a rough outline of the stepsinvolved in deriving the Gordon state evolution update. These are then concretely instantiatedin heuristic derivations carried out in Section 3.2. In this section, we will deliberately avoidtechnical details; Section 7 to follow makes all the steps rigorous in the general case, along theway to proving our main results in Theorems 1 and 2.

15

Page 16: Kabir Aladin Chandrasekher

Step 1: Write one iteration of algorithm as solution to convex optimization problem.As alluded to in Section 2.2, each iteration of most algorithms—even on nonconvex log-likelihoodfunctions—can be written as the solution to a convex optimization problem (7). To recall thismore explicitly, suppose that running one step of the algorithm from the parameter θt resultsin the update θt+1 = argminθ∈Rd L(θ;θt,X,y), where L is convex in θ for each fixed triple(θt,X,y). This was indeed the case in all the illustrative examples in Section 2, but is truemore broadly with many iterative algorithms.

Step 2: Write equivalent auxiliary optimization problem. In this step, our goal is towrite the minimization of the loss function L—which is a function of the Gaussian design matrixX—as a simpler minimization involving fewer Gaussian random variables. In particular, wewould like to show that

minθ∈Rd

L(θ;θt,X,y) ≈ minθ∈Rd

L(θ;θt,γd,γn), (16)

where γd and γn denote (either n or d-dimensional) standard Gaussian random vectors and the≈ symbol denotes some form of approximate equality in distribution motivated by Proposition 1.The latter optimization problem is typically easier to solve and admits a representation in termsof a small number of decision variables (Thrampoulidis et al., 2018a).

The key workhorse in this step is the CGMT, and the program typically consists of twosubsteps:

(i) Frame optimization problem in the form (15a): First, we show that there exists a stan-dard Gaussian random matrix G and a pair of fixed matrices (L,M) such that the convexoptimization problem (7) can be written in the form (15a), i.e.,

minθ∈RdL(θ;θt,X,y)

(d)= min

u∈Umaxv∈V〈Mv,GLu〉+Q(u,v)︸ ︷︷ ︸

P (G)

,

where the pair of decision variables (u,v) is determined by the parameters (θ,θt). Here,it is important to note that the function Q may depend on randomness independent of G.

(ii) Invoke the CGMT and formulate the auxiliary optimization problem: Next, we use theCGMT to simplify the problem. Proposition 1 shows that P (G) is very well approximated(in distribution) by A(γn,γd), and moreover, the optimization problem (15b) involves twoGaussian random vectors γd and γn and in many cases is easier to solve. Applying thisleads to an equivalence of the form (16), as desired.

Step 3: Scalarize to obtain deterministic Gordon state evolution update: As men-tioned before, writing the optimization problem in terms of the objective L was motivated bythe fact that this objective could be scalarized in terms of a low dimensional function. In thisstep, our goal is to establish the approximate equivalence

minθ∈Rd

L(θ;θt,γd,γn) ≈ minξ

L(ξ; ξt), (17)

where L(·; ξt) : R3 → R is a deterministic function solely of a three-dimensional state, andmoreover, depends on the previous iterate only through its own three-dimensional state ξt. Theminimizers of the optimization problem on the RHS—along with some algebraic simplification—will then yield the deterministic, two-dimensional Gordon state evolution update (αt+1, βt+1) asalluded to in the ansatz. As before, this step is typically accomplished via two further substeps:

16

Page 17: Kabir Aladin Chandrasekher

(i) Argue equivalence to a random low-dimensional function Ln: This is often easy to do justvia a change of variables, expressing the d-dimensional parameters θ and θt in terms oftheir respective states ξ and ξt. It is important to note however that the objective functionLn : R3 → R that results from this transformation is still random.

(ii) Use the LLN to obtain population loss L = limn→∞ Ln, and solve: The key techniqueenabled by the scalarization above is that since we are now in low (i.e., 3) dimensions,passing to the population loss still provides an accurate prediction of behavior even whenn is moderately large. Solving for the minimizers of L can be done readily; typically, thesolutions to this low-dimensional optimization problem will coincide with the solutions toa nonlinear system of equations (in three variables)6.

Step 4: Argue that the empirical state evolution is tracked by the Gordon update.The final step is to use growth properties of the objective functions L and Ln around theirminima to show that if their optimum values coincide, then so must their optimizers. This isthe most technical step of the recipe, and a large portion of the proof is dedicated to establishingthese properties.

The following subsection clarifies these abstract steps by carrying out a concrete derivationon an example. We emphasize that the derivations are heuristic and aim to illustrate the recipe.We defer precise statements and their proofs to Section 3.3.

3.2 Implementing the recipe: A heuristic derivation in a special case

To illustrate the steps sketched above, we present a heuristic derivation of the Gordon updatein the case of alternating minimization for noiseless phase retrieval (10).

Step 1: One-step update as a convex optimization problem. Letting denote theHadamard product between two vectors of the same dimension, notice that the update whenrun from θt is given by

θt+1 = argminθ∈Rd

1√n‖Xθ − sgn(Xθt) y‖2,

which is clearly the minimizer of the convex loss L(θ;θt,X,y) = 1√n‖Xθ − sgn(Xθt) y‖2.

Step 2: Equivalent auxiliary optimization problem. Let us detail the two substepsindividually:

(i) Frame optimization problem in the form (15a): First, observe that via the dual normcharacterization of the `2 norm, we have

L(θ;θt,X,y) =1√n‖Xθ − sgn(Xθt) y‖2 = max

‖v‖2≤1

1√n〈v,Xθ〉 − 1√

n〈v, sgn(Xθt) y〉.

The first term in the RHS above is bilinear in the Gaussian random matrix X, but thesecond term also depends on X and so does not immediately take the form required inequation (15a). To remedy this issue, consider the fixed subspace St = span(θ∗,θt) andwrite

L(θ;θt,X,y) = max‖v‖2≤1

1√n〈v,XP⊥Stθ〉+

1√n〈v,XPStθ〉 −

1√n〈v, sgn(Xθt) y〉,

6In the examples, we consider in this paper, the solutions turn out to be computable in closed form.

17

Page 18: Kabir Aladin Chandrasekher

where PSt and P⊥St denote projection matrices onto the subspaces St and S⊥t , respectively.By construction, the first term on the RHS is independent of the rest, and so we mayreplace the matrix X in this term with an independent copy G to obtain

L(θ;θt,X,y)(d)= max‖v‖2≤1

1√n〈v,GP⊥Stθ〉+Q(v,θ),

where Q(v,θ) = 1√n〈v,XPStθ〉 − 1√

n〈v, sgn(Xθt) y〉 is now independent of G. This

leads to the definition

P (G) = minθ∈Rd

max‖v‖2≤1

1√n〈v,GP⊥Stθ〉+Q(v,θ),

which takes the form (15a).

(ii) Invoke the CGMT and approximate the minimum of loss function. Given that this is a

heuristic derivation, we ignore for the moment that the set Rd is not compact and use theCGMT to write P (G) ≈ A(γn,γd), where

A(γn,γd) = minθ∈Rd

max‖v‖2≤1

1√n‖v‖2〈γd,P⊥Stθ〉+

1√n‖P⊥Stθ‖2〈γn,v〉+Q(v,u),

and the approximation T1 ≈ T2 signifies that the CDFs of the two random variables T1 andT2 match up to a factor 2 (see Proposition 1). Note that heuristically speaking, we haveshown through the previous steps that minθ∈Rd L(θ;θt,X,y) ≈ minθ∈Rd L(θ;θt,γd,γn),where L(θ;θt,γd,γn) has the variational representation

max‖v‖2≤1

1√n‖v‖2 ·〈γd,P⊥Stθ〉+

1√n‖P⊥Stθ‖2 ·〈v,γn〉+

1√n〈v,XPStθ〉−

1√n〈v, sgn(Xθt)y〉.

Step 3: Scalarize and obtain Gordon update. We now scalarize the problem by intro-ducing the change of variables

α = 〈θ,θ∗〉, µ =〈θ,P⊥θ∗θt〉‖P⊥θ∗θt‖2

, and ν = ‖P⊥Stθ‖2, (18)

where as before α denotes the projection of the decision variable θ onto the ground-truth θ∗

(since by assumption ‖θ∗‖2 = 1), but the perpendicular component β (cf. (1)) has been splitinto two further components based on the current iterate θt. The scalar µ is the projection of θonto the component of the current iterate θt orthogonal to the ground-truth (i.e., onto the unitvector P⊥θ∗θt/‖P⊥θ∗θt‖2), and the scalar ν is the magnitude of the portion of θ orthogonal to thesubspace spanned by the ground-truth and the current iterate. Analogously, let αt = 〈θt,θ∗〉and βt = ‖P⊥θ∗θt‖2 and define the independent, Gaussian random vectors z1 = Xθ∗ and

z2 =XP⊥

θ∗θt

‖P⊥θ∗θt‖2

. With this notation, we have

Xθt = αtz1 + βtz2, y = |z1|, and XPStθ = αz1 + µz2.

Use these to define, for two scalar Gaussian variates (Z1, Z2), the random variableΩt = sgn(αtZ1 + βtZ2)|Z1| as well as the random vector ωt = sgn(αtz1 + βtz2)|z1|. A sequenceof steps, detailed in Appendix A, implements both substeps referenced above to show that

minθ∈Rd

L(θ;θt,X,y) ≈ minα∈R,µ∈R,ν≥0

(−ν‖P⊥Stγd‖2√

n+

1√n‖ωt − αz1 − µz2 − νγn‖2

)+

(19a)

≈ minα∈R,µ∈R,ν≥0

(− ν√

κ+

√E(

Ωt − νH − αZ1 − µZ2

)2)+, (19b)

18

Page 19: Kabir Aladin Chandrasekher

where we have used the shorthand x+ = maxx, 0. Letting ξ = (α, µ, ν) and ξt = (αt, µt, νt),notice that the RHS of Eq. (19a) is given by minimizing a random loss Ln(ξ; ξt) over all ξ ∈R2 × [0,∞) and the RHS of Eq. (19b) is given by minimizing a deterministic loss L(ξ; ξt) overthe same domain.

Since this computation only involves optimizing over a few variables, it can be shown viastraightforward calculation—detailed for convenience in Appendix A—that the minimizers ofthe RHS in equation (19b) are given by

α = EZ1Ωt, µ = EZ2Ωt, and ν =

√EΩ2

t − (EZ1Ωt)2 − (EZ2Ωt)2

κ− 1. (20)

Letting φt = tan−1(βt/αt), some calculation shows that

EZ1Ωt = 1− 1

π(2φt − sin(2φt)), EZ2Ωt =

2

πsin2(φt), and EΩ2

t = 1.

Finally, recalling the change of variables (18) and noting that ‖P⊥θ∗θ‖2 =√ν2 + µ2, we have

the Gordon state evolution update

αt+1 = 1− 1

π(2φt − sin(2φt)), and

βt+1 =

√4

π2sin4(φt) +

1− (1− 1π (2φt − sin(2φt)))2 − 4

π2 sin4(φt)

κ− 1.

(21)

Step 4: Random state evolution is tracked by Gordon update: The final step is toshow that both |〈θt+1, θ

∗〉 − αt+1| and |‖P⊥θ∗θt+1‖2 − βt+1| are small, so that the deterministicGordon update faithfully tracks the random pair αt+1 = 〈θt+1, θ

∗〉 and βt+1 = ‖P⊥θ∗θt+1‖2.This is achieved by showing (a) a growth condition (typically strong convexity) around theminimum of the scalarized auxiliary loss Ln and (b) that the empirical minimizers

ξn := (αn, µn, νn) = argminα∈R,µ∈R,ν≥0

Ln(α, µ, ν)

are close to the deterministic state ξ = (α, µ, ν). With these two ingredients in hand, we showthat for any vector θ for which α(θ), µ(θ), or ν(θ) is far from α, µ, or ν, respectively, the valueL(θ) is far from the deterministic value (19b). Additionally, we show that the minimum of Lover the entire domain Rd is close to the deterministic value (19b). Thus, it must be the casethat if θ is the minimizer of the loss L, then the quantities α(θ), µ(θ), and ν(θ) are close to therespective deterministic quantities α, µ, and ν.

Remark 2. Two key observations to make at this juncture are that (a) the Gordon state evo-lution update can be run from any point θ ∈ Rd, not just θt, and (b) the update equations (21)define a map (αt, βt) 7→ (αt+1, βt+1). As postulated in the ansatz at the beginning of Section 3.1,the Gordon state evolution update takes the form of a state evolution operator, mapping R2 toitself. This will also be true in our other specific examples, and so we use this terminology inthe sequel alongside the notation Sgor = (αgor, βgor) to denote this operator.

With the intuition gained from this heuristic derivation, we are now in a position to stateour general result obtained via this recipe.

3.3 The general result

We now formally derive and prove concentration of the one-step Gordon updates for higher-order and first-order methods run on a generic class of problems. As observed in Remark 2, the

19

Page 20: Kabir Aladin Chandrasekher

Gordon state evolution update is well-defined when run from any current iterate θ. Accordingly,fix an arbitrary d-dimensional parameter θ and consider the one-step update (8), restated belowfor convenience

Tn(θ) ∈ arg minθ′L(θ′;θ,X,y), (22)

where the loss function takes either of the forms in equations (9) or (12). For convenience, usethe shorthand

(α, β) = (α(θ), β(θ)) and (α+, β+) = (α(Tn(θ)), β(Tn(θ))). (23)

The main result of this section shows that for algorithms whose one-step updates take theform (9) or (12), the pair (α+, β+) concentrates around the deterministic Gordon state evolutionupdate run from (α, β), i.e., the pair Sgor(α, β).

This result holds under some mild assumptions on the weight function used to define thesealgorithms. In particular, recall that the losses in equations (9) and (12) are parameterized by aweight function ω : R× R→ R. Also recall the model (2), and let Q denote a random variabledrawn from the latent variable distribution Q. Let (Z1, Z2, Z3) denote a triple of i.i.d. standardGaussian vectors, and let

Ω = ω(αZ1 + βZ2 , f(Z1;Q) + σZ3

). (24)

The first assumption requires that this random variable is light-tailed. The second assumptionis technical, and requires a lower bound on a particular functional of Ω.

Assumption 1. The random variable Ω (24) is sub-Gaussian with Orlicz norm bounded as‖Ω‖ψ2 ≤ K1, for some parameter K1 > 0.

Assumption 2. For a parameter K2 > 0, we have

E[Ω2]− (E[Z1Ω])2 − (E[Z2Ω])2 ≥ K2.

We show in Section 4 to follow that several models and algorithms satisfy Assumptions 1and 2. Before stating our main result, it is helpful to first define the deterministic Gordonupdates themselves.

Definition 1 (Gordon state evolution update: Higher-order methods). Let (Z1, Z2, Z3) denotea triple of independent standard Gaussian random variables and use these to define the randomvariable Ω as in equation (24). If the loss function L is as in equation (9), then define

αgor = E[Z1Ω] and βgor =

√(E[Z2Ω])2 +

1

κ− 1

(E[Ω2]− (E[Z1Ω])2 − (E[Z2Ω])2

). (25)

Next, we state the update for first-order methods, assuming that7 η ≤ 1/2.

Definition 2 (Gordon state evolution update: First-order methods). Suppose η ≤ 1/2. Let(Z1, Z2, Z3) denote a triple of independent standard Gaussian random variables and use theseto define the random variable Ω as in equation (24). If L is as in equation (12), then define

αgor = α− 2η · E [Z1Ω] and βgor =

√(β − 2η · E [Z2Ω])2 +

4η2

κ· E [Ω2]. (26)

7For larger stepsizes, similar update equations still apply, but some delicacy is required to handle the signscorrectly.

20

Page 21: Kabir Aladin Chandrasekher

In the sequel, we will evaluate the expressions in equations (25) and (26) for concrete modelsand algorithms. However, at this level of generality, a salient similarity between higher-orderand first-order updates is already apparent, since it can be shown that the population stateevolution update can be obtained by taking κ→∞ in its Gordon counterpart.

Remark 3 (Population updates coincide for stepsize η = 1/2). Set η = 1/2 and send κ→∞,so that the Gordon update now coincides with its population counterpart. Then using Remark 1to relate the weight functions for higher and first-order updates, we obtain that the two Gordonupdates in equations (25) and (26) coincide. On the other hand, for finite κ, these updates arealways distinct.

We return to explore this phenomenon in Section 4 to follow, deriving convergence guaranteesfor first-order updates when η = 1/2 by using the Gordon state evolution update in place of thepopulation update. But first, we state our main results characterizing the concentration of therandom pair (α+, β+) around (αgor, βgor). We state two very similar theorems for conveniencesince they apply under a slightly different set of assumptions. The first theorem applies tohigher-order updates under both Assumptions 1 and 2, and the second theorem applies tofirst-order updates but requires only Assumption 1 to hold.

Theorem 1 (Higher-order deterministic prediction). Consider the general model (2) forthe data, and procedures that obey the general one-step update (9). Recall the shorthand(α, β, α+, β+) from equation (23). Suppose that Assumptions 1 and 2 hold on the associ-ated weight function ω with parameters K1 and K2, respectively. Consider the pair of scalars(αgor, βgor) from Definition 1. There exists a universal positive constant C1 as well as a pairof positive constants (CK , C

′K) depending only on the pair (K1,K2) such that the following is

true. If κ ≥ C1, then

(a) Provided we further have n ≥ C ′K · log(1/δ), the perpendicular component satisfies

P

|β+ − βgor| ≥ CK

(log(1/δ)

n

)1/4≤ δ, and (27a)

(b) The parallel component satisfies

P

∣∣α+ − αgor∣∣ ≥ CK ( log7(1/δ)

n

)1/2≤ δ. (27b)

The main theorem for first-order methods is extremely similar, except that we make theassumption8 α∨β ≤ 3/2 and obtain sharper logarithmic factors. We also state the theorem forstepsize η ≤ 1/2 for convenience.

Theorem 2 (First-order deterministic prediction). Consider the general model (2) for the data,and procedures that obey the general one-step update (12) for some η ≤ 1/2. Recall the shorthand(α, β, α+, β+) from equation (23) and assume that α ∨ β ≤ 3/2. Suppose that Assumption 1holds on the associated weight function ω with parameter K1. Consider the pair of scalars(αgor, βgor) from Definition 2. There exists a universal positive constant C1 as well as a pair ofpositive constants (CK , C

′K), depending only on K1 such that the following is true. If κ ≥ C1,

then

8This assumption is not required for higher-order methods because the sub-Gaussianity of the ω functionsuffices to ensure that the pair (αgor, βgor) remains bounded (see Definition 1). The same is not true for first-ordermethods; as is evident from Definition 2, we also require the pair (α, β) to be bounded.

21

Page 22: Kabir Aladin Chandrasekher

(a) Provided we further have n ≥ C ′K · log(1/δ), the perpendicular component satisfies

P

|β+ − βgor| ≥ CK

(log(1/δ)

n

)1/4≤ δ, and (28a)

(b) The parallel component satisfies

P

∣∣α+ − αgor∣∣ ≥ CK ( log(1/δ)

n

)1/2≤ δ. (28b)

We formally derive the one-step Gordon updates (αgor, βgor) in a unified fashion for boththese theorems in Section 7, with rigorous justifications of the steps outlined in Sections 3.1and 3.2. In particular, this program is carried out under a weaker set of assumptions on theone-step loss function, which includes equations (9) and (12) as special cases (see Assumption 3in Section 7). Section 7 also provides a proof that both α+ and β+ concentrate at the rateO(n−1/4) around their Gordon counterparts, thereby proving part (a) of both theorems. InSection 8, we refine the concentration rate for the parallel component α and establish part (b)of both theorems.

It should be emphasized that Theorems 1 and 2 are both non-asymptotic results, in contrastto results typically derived using the CGMT machinery. A non-asymptotic characterization isessential in our case because we intend to apply these results iteratively, once per step of thealgorithm. As alluded to in the heuristic derivation, our proof of the O(n−1/4) rate of concen-tration of the pair (α+, β+) around the deterministic update follows a generic proof techniquereasoning about the growth properties of the scalarized loss function around its minimum, andgeneralizes results from the linear case (Miolane and Montanari, 2021). This technique mayprove to be of independent interest in other applications of the CGMT.

While a deviation result of O(n−1/4) can be obtained via this general technique, this rate doesnot suffice for the parallel update α+ near a random initialization. In particular, for a randominitialization we have α d−1/2, and it can be shown that the deterministic prediction arisingfrom one step of the algorithm also satisfies αgor d−1/2. Thus, showing that α+ is withinO(n−1/4) of αgor is only a nontrivial statement—guaranteeing say a nonzero parallel componentat the next step—when n = Ω(d2), or equivalently, when κ = Ω(d). On the other hand, wewould like to prove global convergence in the regime κ = O(1), and so dedicate significanteffort to improving this concentration result to O(n−1/2), thereby allowing us to obtain part(b) of the theorems. This proof, presented in Section 8, requires significant subtlety especiallyfor higher-order algorithms since the update (9b) involves a matrix inversion. We employ aleave-one-out trick to show a sharpened version of a result by Zhang (2020), and believe thatthis technique will prove more broadly useful in analyzing other higher-order updates from arandom initialization. Our refined characterization for the parallel component α+ raises thequestion of whether deviation of β+ can also be improved to O(n−1/2). While we conjecturethat this is indeed the case, we leave this question open for future investigation, turning nowto deriving corollaries of the main theorems in two specific models.

4 Consequences for some concrete statistical models

In this section, we state consequences of our main results for two specific models and algorithms,although it is important to note that the Gordon recipe itself—as sketched in the previoussection—is much more broadly applicable. In particular, we will consider phase retrieval anda symmetric mixture of linear regressions, as well as the algorithms covered in Section 2. Itis important to note that in both these models, the global sign of the parameter θ∗ is not

22

Page 23: Kabir Aladin Chandrasekher

identifiable from observations, and so parameter estimates should be assessed in terms of their“distance” to the set −θ∗,θ∗.

As mentioned before, we track the two-dimensional state (α(θ), β(θ)) of each parameterθ ∈ Rd, with α(θ) = 〈θ, θ∗〉 and β(θ) = ‖P⊥θ∗θ‖2. The sign ambiguity will be resolved by theinitialization, so we assume throughout that α(θ) ≥ 0 for parameters θ that we consider. Forany two-dimensional state evolution element ζ = (α, β), define two metrics

d`2(ζ) :=√

(1− α)2 + β2 and d∠(ζ) := tan−1(β/α). (29)

When α = α(θ) and β = β(θ), the quantity d`2(α, β) measures the `2 distance between θ andthe set −θ∗,θ∗, i.e., we have d`2(α, β) = min‖θ − θ∗‖2, ‖θ + θ∗‖2. Similarly, the angularmetric satisfies d∠(α, β) = min∠ (θ,θ∗) ,∠ (θ,−θ∗).

As alluded to in the previous sections (see Remark 2), a state evolution operator is a functionmapping R2 to itself. We begin with a few useful definitions for such operators. First, for anystate evolution operator S, recall that St denotes the operator formed by t iterated applicationsof S. Next, we define an S-faithful state evolution operator.

Definition 3 (S-faithful operator). For a set S ⊆ R2, a state evolution operator S : R2 → R2

is said to be S-faithful if S(ζ) ∈ S for all ζ ∈ S.

Next, we present two formal definitions of convergence rates, measuring linear (geometric)and faster-than-linear convergence.

Definition 4 (Linear convergence of state evolution). For parameters 0 < c ≤ C < 1, a stateevolution operator S : R2 → R2 is said to exhibit (c, C, t0)-linear convergence in the metric dwithin the set S to level ε if S is S-faithful, and for all ζ ∈ S, we have

c · d(St(ζ)) +ε

2≤ d(St+1(ζ)) ≤ C · d(St(ζ)) + ε for all t ≥ t0. (30)

Definition 5 (Super-linear convergence). Set parameters 0 < c ≤ C and λ > 1, and supposethat S ⊆ ζ : d(ζ) ≤ C1−λ. A state evolution operator S : R2 → R2 is said to exhibit(c, C, λ, t0)-super-linear convergence in the metric d within the set S to level ε if S is S-faithful,and for all ζ ∈ S, we have

c · [d(St(ζ))]λ +ε

2≤ d(St+1(ζ)) ≤ C · [d(St(ζ))]λ + ε for all t ≥ t0. (31)

A few comments on our definitions are worth making. First, note that both definitionsrequire both upper and lower bounds on the per-step behavior of the algorithm, where thebounds apply after a “transient” period of t0 iterations. This is a key feature of our framework, inthat we are able to exactly characterize the convergence behavior as opposed to solely providingupper bounds. Both upper and lower bounds are characterized both by a rate of decrease ofthe error (linear in the case of equation (30) and super-linear in the case of equation (31)) andthe eventual statistical neighborhood ε. Second, our choice of defining the lower bounds inequations (30) and (31) with ε/2 is arbitrary; any absolute constant other than 2 preserves thequalitative convergence behavior.

As is common in the analysis of nonconvex optimization problems, our convergence guaranteewill be established in two stages. In the first stage, we will show that the algorithm converges(typically slowly) to a “good region” around the optimal solution; once in the good region,the algorithm converges much faster. For both of the models that we consider, the followingdefinition of the good region suffices. It is important to note that the numerical constants inthis definition have not been optimized to be sharp.

Definition 6 (Good region). Define the region

G = (α, β) | 0.55 ≤ α ≤ 1.05, and α/β ≥ 5.

With slight abuse of terminology, we say that θ ∈ G if (α(θ), β(θ)) ∈ G.

23

Page 24: Kabir Aladin Chandrasekher

We are now in a position to present our guarantees for two specific models: phase retrievaland a symmetric mixture of linear regressions.

4.1 Phase retrieval

Our first example is the phase retrieval model (3). We characterize the convergence behaviorof both the alternating minimization algorithm and the subgradient descent method for thismodel.

4.1.1 Alternating minimization

Recall from equation (10b) that the empirical update run from the point θ is given by

Tn(θ) =

(1

n

n∑i=1

xix>i

)−1 (1

n

n∑i=1

sgn(〈xi, θ〉) · yi · xi

). (32)

The following corollary follows from Theorem 1; in it, we state both the explicit Gordon stateevolution and the concentration of the empirical iterates assuming that the update is run fromsome arbitrary “current” point θ. Its proof can be found in Appendix D.1.

Corollary 1. Let α = α(θ) and β = β(θ) with ζ = (α, β) and φ = tan−1(βα

). Let (αgor, βgor) =

Sgor(ζ) denote the Gordon state evolution from Definition 1.(a) We have

αgor = 1− 1

π(2φ− sin(2φ)) and (33a)

βgor =

√4

π2sin4(φ) +

1

κ− 1

(1− (1− 1

π(2φ− sin(2φ)))2 − 4

π2sin4(φ) + σ2

). (33b)

(b) Suppose σ > 0. There is a constant Cσ > 0 depending only on σ such that the fol-lowing holds. With Tn as defined in equation (32), the empirical state evolution (α+, β+) =(α(Tn(θ)), β(Tn(θ))) satisfies

P

|α+ − αgor| ≤ Cσ

(log7(1/δ)

n

)1/2≤ δ and P

|β+ − βgor| ≤ Cσ

(log(1/δ)

n

)1/4≤ δ.

From equation (33), it is possible to recover the following population update by lettingκ→∞, which is given by

αpop = 1− 1

π(2φ− sin(2φ)) and βpop =

2

πsin2(φ). (34)

It is easy to show that the population state evolution predicts super-linear convergence withexponent 2 (i.e., quadratic convergence) in the good region. The following fact is proved inAppendix D.5.

Fact 1. The population state evolution Spop = (αpop, βpop) is ( 120 , 1, λ, t0)-super-linearly con-

vergent in the `2 metric9 d`2 within the region G to level ε = 0, where λ = 2 and t0 = 1.

However, the following theorem shows that the empirics are instead tracked faithfully bythe Gordon state evolution, which converges more slowly than the population state evolution.

9In fact, the population state evolution (34) enjoys global quadratic convergence in the angular metric d∠; seeRemark 5 in the appendix.

24

Page 25: Kabir Aladin Chandrasekher

Theorem 3. Consider the alternating minimization update Tn from equation (32) and theassociated Gordon state evolution update Sgor from equation (33). There is a universal positiveconstant C such that the following is true. If κ ≥ C(1 + σ2), then:(a) The Gordon state evolution update

Sgor is (cκ, Cκ, λ, t0)-super-linearly convergent in the `2 metric d`2 within G to level εn,d = σ√κ

,

where 0 ≤ cκ ≤ Cκ ≤ 1 are constants depending solely on κ, and we have

λ = 3/2 and t0 = 1.

(b) If σ > 0, then there exist Cσ, C′σ > 0 depending only on σ such that for all n ≥ C ′σ and for

any θ such that ζ = (α(θ), β(θ)) ∈ G, we have

max1≤t≤T

|d`2(Stgor(ζ))− ‖T tn(θ)− θ∗‖2| ≤ Cσ(

log n

n

)1/4

with probability exceeding 1− 2Tn−10.(c) Suppose θ0 denotes a point such that α(θ0)

β(θ0) ≥ 150√d

and further suppose that

κ ≥ C ′′σ · log7(

1+log dδ

)for C ′′σ depending solely on σ. Then for some t′ ≤ C log d, we have

T t′n (θ0) ∈ G

with probability exceeding 1− δ.

Note that if θ0 is chosen at random from the d-dimensional unit ball B2(1) with d ≥ 130,

then we have α(θ0)β(θ0) ≥

150√d

with probability at least 0.95 (see Lemma 24(a) in the appendix).

Theorem 3 then shows that after τ = O(log d+ log log(κ/σ2)) iterations, the empirics satisfy

‖T τn (θ)− θ∗‖2 = O

√d

n

)+ O

(n−1/4

)(35)

with high probability. Concretely, after taking O(log d) steps to converge to the good region G,the AM update converges very fast to within statistical error of the optimal parameter.

Some remarks on specific aspects of Theorem 3 are in order. First, note that this theo-rem predicts super-linear convergence with nonstandard exponent 3/2 whenever κ is boundedabove. Comparing with Fact 1, we see that the population update is overly optimistic, and thiscorroborates what we saw in Figures 1 and 2 in the introduction. Nonstandard super-linearconvergence was recently observed in the noiseless case of this problem (Ghosh and Ramchan-dran, 2020), but a larger exponent was conjectured. Theorem 3 shows that the exponent 3/2is indeed sharp, since we obtain both upper and lower bounds on the error of the algorithm.Furthermore, the convergence rate is super-linear with exponent 3/2 for every value of the noiselevel. As we will see shortly, this is not the case for the closely related model of a symmetricmixture of regressions, in which the convergence rate of this algorithm is linear for any constantnoise level.

Second, note that part (b) of the theorem shows that the (random) empirical state evolutionis within `2 distance n−1/4 of its (deterministic) Gordon counterpart once the iterates enter thegood region. Consequently, the final result (35) on the empirical error has two terms. Note thatthis error is dominated by the σ/

√κ term in modern high dimensional problems.

Third, our convergence result is global, and holds from a random initialization. In particular,part (c) of the theorem guarantees that within O(log d) iterations, the iterations enter the goodregion G, at which point parts (a) and (b) of the theorem become active. Convergence froma random initialization is also established by showing that the empirical state evolution tracks

25

Page 26: Kabir Aladin Chandrasekher

its Gordon counterpart closely. But rather than showing two deterministic envelopes aroundthe empirical trajectory, we leverage closeness of the updates iterate-by-iterate. It is worthnoting that this is the only step that requires the condition n d log7(log d); all other stepsonly require sample complexity that is linear in dimension.

Finally, we note that our assumption that σ2/κ be bounded above by a universal constantshould not be viewed as restrictive. If this condition does not hold, then one can show usingour analysis that running just one step of the algorithm from a random initialization alreadysatisfies ‖θ1−θ∗‖2 = O(1) = O(σ2/κ), thereby providing an estimate with order-optimal error.

4.1.2 Subgradient descent

To contrast with the super-linear convergence shown in the previous section, we now considersubgradient descent with step-size 1/2. As alluded to in Remark 3 and shown explicitly be-low, this update shares the same population update as AM, considered before. As derived inequation (13a), the general subgradient method for PR is given by the update

Tn(θ) = θ − 2η

n·n∑i=1

(|〈xi, θ〉| − yi) · sgn(〈xi, θ〉) · xi, (36)

where η > 0 denotes the step-size. The Gordon state evolution update is given by the followingcorollary of Theorem 2, proved in Appendix D.2.

Corollary 2. Let α = α(θ) and β = β(θ) with φ = tan−1(βα

). Let (αgor, βgor) = Sgor(α, β)

denote the Gordon state evolution update for the subgradient descent operator (36), given byDefinition 2. Let η ≤ 1/2.(a) We have

αgor = (1− 2η)α+ 2η

(1− 1

π(2φ− sin(2φ))

), and (37a)

βgor =(

(1− 2η)β + 2η · 2

πsin2 φ

2

+4η2

κ

α2 + β2 − 2α

(1− 1

π(2φ− sin(2φ))

)− 2β · 2

πsin2 φ+ 1 + σ2

)1/2.

(37b)

(b) Suppose σ > 0 and α ∨ β ≤ 3/2. Then there is a positive constant Cσ depending onlyon σ such that with Tn as defined in equation (36), the empirical state evolution (α+, β+) =(α(Tn(θ)), β(Tn(θ))) satisfies

P

|α+ − αgor| ≤ Cσ

(log(1/δ)

n

)1/2≤ δ and P

|β+ − βgor| ≤ Cσ

(log(1/δ)

n

)1/4≤ δ.

Sending κ → ∞ in equation (37) recovers the infinite-sample population state evolutionupdate

αpop = (1− 2η)α+ 2η

(1− 1

π(2φ− sin(2φ))

)and

βpop = (1− 2η)β + 2η · 2

πsin2 φ. (38)

As previously noted, our interest10 will be in analyzing the special case η = 1/2 so as tocompare and contrast with the AM update. In this case, the population updates (34) and (38)

10Our techniques are also applicable to analyzing the algorithm with general stepsize η, but we do not do soin this paper since a variety of other analysis methods tailored to first order updates (e.g., Zhang et al., 2017;Chen et al., 2019; Tan and Vershynin, 2019b) also work in this case.

26

Page 27: Kabir Aladin Chandrasekher

coincide, and so Fact 1 suggests that subgradient descent ought to converge quadratically fast.This would be quite surprising for a first-order method, and already suggests that the popu-lation update may be even more optimistic than before. However, the Gordon state evolutionupdates (33) and (37) are distinct even when η = 1/2, and as we saw before, these providemuch more faithful predictions of convergence behavior.

Theorem 4. Consider the subgradient descent update Tn (36) and the associated Gordon stateevolution update Sgor from equation (37), with stepsize η = 1/2. There is a universal positiveconstant C such that the following is true. If κ ≥ C(1 + σ2), then:

(a) The Gordon state evolution update

Sgor is (cκ, Cκ, 0)-linearly convergent in the `2 metric d`2 on G to level εn,d = σ√κ

.

Here 0 ≤ cκ ≤ Cκ < 1 are constants depending solely on κ.

(b) Suppose σ > 0. Then there are positive constants Cσ, C′σ depending only on σ such that for

all n ≥ C ′σ and for any θ such that ζ = (α(θ), β(θ)) ∈ G, we have

max1≤t≤T

|d`2(Stgor(ζ))− ‖T tn(θ)− θ∗‖2| ≤ Cσ(

log n

n

)1/4

with probability exceeding 1− 2Tn−10.

(c) Suppose θ0 denotes a point such that α(θ0)β(θ0) ≥

150√d

and α(θ0) ∨ β(θ0) ≤ 3/2, and further

suppose that κ ≥ C ′′σ · log(

1+log dδ

)for C ′′σ depending solely on σ. Then for some t′ ≤ C log d,

we have

T t′n (θ0) ∈ G

with probability exceeding 1− δ.

To be concrete once again, suppose d ≥ 130. Then using n ≥ d observations (xi, yi)ni=1

from the model (3) and setting θ0 =√

1n

∑ni=1 y

2i · u with the vector u chosen uniformly at

random from the unit ball, we obtain the required initialization condition with probabilitygreater than 0.95 (see Lemma 24(b) in the appendix). The theorem then guarantees that forsome τ = O(log d+ log(κ/σ2)), the empirics satisfy

‖T τn (θ0)− θ∗‖ = O

√d

n

)+ O

(n−1/4

)(39)

with high probability. Given our extensive discussion of Theorem 3 and that most of thesecomments also apply here, we make just one remark in passing that focuses on the difference.Note that as expected, Theorem 4 shows that subgradient descent only converges linearly in thegood region. This corroborates what we saw in Figures 1 and 2, and shows once again—andmore dramatically than before—that the (quadratically convergent) population update can besignificantly optimistic in predicting convergence behavior.

4.2 Mixture of regressions

While the symmetric mixture of linear regressions model is statistically equivalent (for parameterestimation) to the phase retrieval model without additive noise (i.e., σ = 0), we show in thissection that the models and their associated algorithms have distinct behavior for any nonzeronoise level.

27

Page 28: Kabir Aladin Chandrasekher

4.2.1 Alternating minimization

Recall from equation (11b) that the empirical update applied at θ is given by

Tn(θ) =

(1

n

n∑i=1

xix>i

)−1 (1

n

n∑i=1

sgn(yi · 〈xi, θ〉) · yi · xi

). (40)

The Gordon updates are given by the following corollary of Theorem 1, proved in Appendix D.3.Before stating it, we define the convenient shorthand

Aσ(ρ) :=2

πtan−1

(√ρ2 + σ2 + σ2ρ2

)and Bσ(ρ) :=

2

π

√ρ2 + σ2 + σ2ρ2

1 + ρ2. (41)

Corollary 3. Let α = α(θ) and β = β(θ) with ζ = (α, β) and ρ = βα . Let (αgor, βgor) = Sgor(ζ)

denote the Gordon state evolution update in this case, given by Definition 1.(a) Using the shorthand (41), we have

αgor = 1−Aσ(ρ) +Bσ(ρ), and (42a)

βgor =

√ρ2Bσ(ρ)2 +

1

κ− 1(1 + σ2 − (1−Aσ(ρ) +Bσ(ρ))2 − ρ2Bσ(ρ)2). (42b)

(b) Suppose σ > 0. Then there is a positive constant Cσ depending only on σ such that withTn as defined in equation (40), the empirical state evolution (α+, β+) = (α(Tn(θ)), β(Tn(θ)))satisfies

P

|α+ − αgor| ≤ Cσ

(log7(1/δ)

n

)1/2≤ δ and P

|β+ − βgor| ≤ Cσ

(log(1/δ)

n

)1/4≤ δ.

By taking κ→∞ in equation (42), we recover the population update for this case, given by

αpop = 1−Aσ(ρ) +Bσ(ρ) and βpop = ρBσ(ρ). (43)

The update (43) has no dependence on κ and thus cannot recover the noise floor of the problem.On the other hand, and similarly to before, the following theorem shows that the empirics aretracked instead by the Gordon update (42).

Theorem 5. Consider the alternating minimization update Tn given in equation (40) and theassociated Gordon state evolution update Sgor (42). There are universal positive constants (c, C)such that the following is true. If κ ≥ C and 0 < σ ≤ c, then:(a) The Gordon state evolution update

Sgor is (cκ,σ, Cκ,σ, 0)-linearly convergent in the angular metric d∠ on G to level εn,d = σ√κ

,

where 0 ≤ cκ,σ ≤ Cκ,σ ≤ 1 are constants depending solely on the pair (κ, σ).(b) If n ≥ C ′σ, then for any θ such that ζ = (α(θ), β(θ)) ∈ G, we have

max1≤t≤T

|d∠(Stgor(ζ))− ∠(θ,θ∗)| ≤ Cσ(

log n

n

)1/4

with probability exceeding 1− 2Tn−10. Here Cσ and C ′σ are positive constants depending solelyon σ.(c) Suppose θ0 denotes a point such that α(θ0)

β(θ0) ≥ 150√d

and further suppose that

κ ≥ C ′′σ · log7(

1+log dδ

)for C ′′σ depending solely on σ. Then for some t′ ≤ C log d, we have

T t′n (θ0) ∈ G

with probability exceeding 1− δ.

28

Page 29: Kabir Aladin Chandrasekher

Owing to the discussion following Theorem 4 (see Lemma 24 in the appendix), we deducethat with a random initialization θ0 and after τ = O(log d + log log(κ/σ2)) iterations, theempirics satisfy

∠ (T τn (θ0),θ∗) = O

√d

n

)+ O

(n−1/4

)(44)

with high probability.Let us make a few remarks to compare and contrast Theorem 5 with our previous results.

First, note that the convergence result proved here is in the angular metric d∠ and not in the(stronger) `2 metric d`2 . This is a crucial difference between the phase retrieval and mixture ofregressions models. Indeed, the parameter estimate for AM in mixtures of regressions can beshown to be inconsistent in the `2 distance; to see this, note that when θ = θ∗, we have αgor =1 + Θ(σ3). Combining this estimate with part (b) of Corollary 3, we see that d`2(α+, β+) =Θ(σ3) + o(1), and so for any constant noise level, the algorithm is not consistent. Inconsistencyof parameter estimation is a known phenomenon for alternating minimization algorithms inmixture models with noise (for instance, a similar conclusion follows from the results of Lu andZhou (2016) on the label recovery error of Lloyd’s algorithm in a Gaussian mixture).

Second, note that when σ = 0, the mixture of regressions and phase retrieval models coin-cide. However, when there is noise, the convergence behavior predicted by Theorem 5 changesdrastically to a linear rate, while in phase retrieval, super-linear convergence is preserved evenwhen the noise level is nonzero (cf. Theorem 3). The Gordon update—and the ensuing sharp-ness of our upper and lower bounds of the error of the algorithm—enable us to make thisdistinction.

Finally, note that our assumption on the noise level in this case is that σ (as opposed toσ/√κ) be bounded above by a universal constant, resulting in a more stringent condition than

what we required in phase retrieval. While we make this assumption for convenience in ourproof, we conjecture that it can be weakened to accommodate the optimal condition σ/

√κ ≤ c.

4.2.2 Subgradient AM

For completeness, we also present corollaries for the subgradient version of the AM update. Asmentioned before, we are not aware of this algorithm having been considered in the literature,but it is natural for us to study it since when the stepsize η = 1/2, it shares the same populationupdate as AM (see Remark 3). Given that AM converges linearly for a mixture of regressions,it is natural to ask if the first-order method—which has a much lower per-iteration cost—enjoysa similar convergence rate. As derived in equation (14a), the update with stepsize η is given by

Tn(θ) = θ − 2η

n·n∑i=1

(sgn(yi〈xi, θ〉) · 〈xi, θ〉 − yi) · sgn(yi〈xi, θ〉) · xi. (45)

The Gordon state evolution update in this case is given by the following corollary of Theorem 2,proved in Appendix D.4.

Corollary 4. Let α = α(θ) and β = β(θ) with ζ = (α, β) and ρ = βα . Let (αgor, βgor) = Sgor(ζ)

denote the Gordon state evolution corresponding to the update (45), given by Definition 2. Letη ≤ 1/2.(a) Using the shorthand (41), we have

αgor = (1− 2η)α+ 2η · (1−Aσ(ρ) +Bσ(ρ)) , and (46a)

βgor =(

(1− 2η)β + 2η · ρBσ(ρ)2

+4η2

κ

α2 + β2 − 2α(1−Aσ(ρ) +Bσ(ρ))− 2βρBσ(ρ) + 1 + σ2

)1/2. (46b)

29

Page 30: Kabir Aladin Chandrasekher

(b) Suppose σ > 0. Then there is a positive constant Cσ depending solely on σ such that withTn as defined in equation (36), the empirical state evolution (α+, β+) = (α(Tn(θ)), β(Tn(θ)))satisfies

P

|α+ − αgor| ≤ Cσ

(log(1/δ)

n

)1/2≤ δ and P

|β+ − βgor| ≤ Cσ

(log(1/δ)

n

)1/4≤ δ.

Sending κ→∞ recovers the population update

αpop = (1− 2η)α+ 2η · (1−Aσ(ρ) +Bσ(ρ)) and βpop = (1− 2η)β + 2η · ρBσ(ρ). (47)

Once again, our interest will be in analyzing the special case η = 1/2, in which case thepopulation updates (47) and (43) of both the first-order and higher-order algorithm coincide.The following theorem establishes a sharp characterization of the convergence behavior of thesubgradient method.

0 2 4 6 8 10 12

10−2

10−1

100

Iteration

∠(θ∗ ,θt)

Empirical: AM

Empirical: GD

Gordon: AM

Gordon: GD

(a) Subgradient descent and alternating minimiza-tion for MLR with σ = 0.05 and κ = 20.

0 2 4 6 8 10 12

10−1

100

Iteration

∠(θ∗ ,θt)

Empirical: AM

Empirical: GD

Gordon: AM

Gordon: GD

(b) Subgradient descent and alternating minimiza-tion for MLR with σ = 0.25 and κ = 100.

Figure 3. The AM and subgradient AM algorithms for two settings of κ and σ initialized withθ0 = θ0 = 0.5 · θ∗ +

√1− 0.52P⊥θ∗γ, for γ uniformly distributed on the unit sphere, plotted

with the respective Gordon predictions. The shaded region denotes the values taken between theminimum and maximum of the empirics.

Theorem 6. Let the stepsize η = 1/2 and consider the subgradient update Tn (40) and theassociated Gordon state evolution update Sgor (46). There are universal positive constants (c, C)such that the following is true. If κ ≥ C and σ ≤ c, then:(a) The Gordon state evolution update

Sgor is (cκ,σ, Cκ,σ, 1)-linearly convergent in the angular metric d∠ on G to level εn,d = σ√κ

,

where 0 ≤ cκ,σ ≤ Cκ,σ ≤ 1 are constants depending solely on the pair (κ, σ).(b) If n ≥ C ′σ, then for any θ such that ζ = (α(θ), β(θ)) ∈ G, we have

max1≤t≤T

|d∠(Stgor(ζ))− ∠(θ,θ∗)| ≤ Cσ(

log n

n

)1/4

with probability exceeding 1− 2Tn−10. Here C ′σ and Cσ are positive constants depending solelyon σ.

30

Page 31: Kabir Aladin Chandrasekher

(c) Suppose θ0 denotes a point such that α(θ0)β(θ0) ≥

150√d

and α(θ0) ∨ β(θ0) ≤ 3/2, and further

suppose that κ ≥ C ′′σ · log(

1+log dδ

)for C ′′σ depending solely on σ. Then for some t′ ≤ C log d,

we have

T t′n (θ0) ∈ G

with probability exceeding 1− δ.

As in the case of subgradient descent for phase retrieval (see also Lemma 24(b) in the

appendix), we see that if θ0 =√

1n

∑ni=1 y

2i · u for a random vector u chosen from the unit

sphere, then after τ = O(log d+ log log(κ/σ2)) iterations, the empirics satisfy

∠ (T τn (θ0),θ∗) = O

√d

n

)+ O

(n−1/4

)(48)

with high probability.The fact that both subgradient descent and alternating minimization (cf. Theorem 5) con-

verge linearly in the good region suggest that the first order method, which has smaller per-iteration cost, may be a good choice for a mixture of linear regressions. A closer look at theproof suggests that the corresponding coefficients of contraction Cκ,σ may be comparable foreven moderately large κ. Indeed, this is illustrated in Figure 3, where we see two settings ofthe pair (κ, σ) in which both algorithms exhibit nearly identical behavior. This observationprovides further evidence that the subgradient method is a compelling choice in such scenarios.

4.3 A glimpse of the convergence proof mechanism

To conclude this section, we provide a high level overview of our convergence proof technique,aspects of which may be of independent interest. A schematic of the proof mechanism ispresented in Figure 4. The blue curve in the panel (Top) represents the empirical state evolution(αt, βt), and our proof technique relies on tracking the transitions of this curve across threephases. Points (α, β) in Phase I are such that the ratio β/α is greater than some threshold.Phase II is characterized by β/α being between two distinct thresholds. Phase III correspondsto being in the good region G, in which the ratio β/α is smaller than some small threshold andthe parallel component α is larger than a threshold (see Definition 6). In each phase, depictedin detail in the (Left), (Right), and (Bottom) plots of Figure 4, we track particular Gordon stateevolution updates using red dots. The shaded light blue regions schematically depict confidencesets that show how each empirical iterate is “trapped” around its Gordon counterpart with highprobability. In Phases I and II, we track Gordon state evolution updates when run from the“worst possible” empirical iterate in the previous confidence set, depicted in the figure usinglight blue triangles. In Phase III, on the other hand, we track the full Gordon trajectory, i.e., thedeterministic sequence of points that results from iteratively running the Gordon update fromthe initial dark blue triangle. The behavior of the Gordon update itself is model-dependentand governed by specific structural properties of the corresponding state evolution maps. Weestablish these properties in Section 9.1, and use them to establish part (a) of all our theoremsin this section. For now, let us sketch the key ideas underlying our treatment of the empiricaliterates in each phase.

Phase I: Immediately after initialization, the parallel component α is very small, of the orderd−1/2. To show that the empirical iterates proceed favorably through Phase I, we use the factthat the Gordon state evolution αgor ≥ (1 + c)α whenever β/α is large, thereby increasingthe parallel component exponentially within this phase. The O(n−1/2) concentration of theempirical αt update around its Gordon prediction traps each empirical iterate αt within a small

31

Page 32: Kabir Aladin Chandrasekher

<latexit sha1_base64="a7YqkBbA1xhFAP7RlPgy5wTlbxw=">AAAB7XicbVBNS8NAEJ3Ur1q/qh69LBbBU0mkoseiF48V7Ae0oUy2m3btZhN2N0IJ/Q9ePCji1f/jzX/jts1BWx8MPN6bYWZekAiujet+O4W19Y3NreJ2aWd3b/+gfHjU0nGqKGvSWMSqE6BmgkvWNNwI1kkUwygQrB2Mb2d++4kpzWP5YCYJ8yMcSh5yisZKrR6KZIT9csWtunOQVeLlpAI5Gv3yV28Q0zRi0lCBWnc9NzF+hspwKti01Es1S5COcci6lkqMmPaz+bVTcmaVAQljZUsaMld/T2QYaT2JAtsZoRnpZW8m/ud1UxNe+xmXSWqYpItFYSqIicnsdTLgilEjJpYgVdzeSugIFVJjAyrZELzll1dJ66Lq1aqX97VK/SaPowgncArn4MEV1OEOGtAECo/wDK/w5sTOi/PufCxaC04+cwx/4Hz+AI4ZjyA=</latexit>

<latexit sha1_base64="GmQ8gpHDbR/zaLrYtPjo00gqMnw=">AAAB7HicbVBNS8NAEJ3Ur1q/qh69LBbBU0lE0WPRi8cKpi20oWy2m3bpZhN2J0IJ/Q1ePCji1R/kzX/jts1BWx8MPN6bYWZemEph0HW/ndLa+sbmVnm7srO7t39QPTxqmSTTjPsskYnuhNRwKRT3UaDknVRzGoeSt8Px3cxvP3FtRKIecZLyIKZDJSLBKFrJ74Ucab9ac+vuHGSVeAWpQYFmv/rVGyQsi7lCJqkxXc9NMcipRsEkn1Z6meEpZWM65F1LFY25CfL5sVNyZpUBiRJtSyGZq78nchobM4lD2xlTHJllbyb+53UzjG6CXKg0Q67YYlGUSYIJmX1OBkJzhnJiCWVa2FsJG1FNGdp8KjYEb/nlVdK6qHuX9auHy1rjtoijDCdwCufgwTU04B6a4AMDAc/wCm+Ocl6cd+dj0Vpyiplj+APn8wfGbo6s</latexit>

G

<latexit sha1_base64="4UPtFwgQXv37ejA9kdLDg05a6co=">AAAB8XicbVDLSgMxFL1TX7W+qi7dBIvgqsyIosuiC11WsA9sh5JJb9vQTGZIMkIZ+hduXCji1r9x59+YaWehrQcCh3PuJeeeIBZcG9f9dgorq2vrG8XN0tb2zu5eef+gqaNEMWywSESqHVCNgktsGG4EtmOFNAwEtoLxTea3nlBpHskHM4nRD+lQ8gFn1FjpsRtSMwqC9HbaK1fcqjsDWSZeTiqQo94rf3X7EUtClIYJqnXHc2Pjp1QZzgROS91EY0zZmA6xY6mkIWo/nSWekhOr9MkgUvZJQ2bq742UhlpPwsBOZgn1opeJ/3mdxAyu/JTLODEo2fyjQSKIiUh2PulzhcyIiSWUKW6zEjaiijJjSyrZErzFk5dJ86zqnVcv7s8rteu8jiIcwTGcggeXUIM7qEMDGEh4hld4c7Tz4rw7H/PRgpPvHMIfOJ8/r0CQ7w==</latexit>

Phase I Phase II

Phase III

(Left) (Right)

(Bottom)

(Top)

Figure 4. A schematic showing convergence of the algorithm in terms of its state space represen-tation (α, β), in three distinct phases. (Top) The triangles in dark blue (and the correspondingcurved line) denote the empirical iterates as they proceed through three phases. The panels(Left), (Right) and (Bottom) are zoomed-in versions of Phases I, II, and III, respectively, wherethe dark blue triangle at the start of the phase depicts the point of the trajectory within thatphase. Each red circle in these subfigures denotes an iterate of the deterministic Gordon stateevolution update when run from the point that it is connected to. The shaded blue regions in allthree phases represent high-probability confidence sets for the empirical iterates. In panel (Left),we leverage the fact that the β-component of each iterate is trapped around that of its Gordoncounterpart. In panel (Right), each such region is an angular “wedge” around the correspondingGordon iterate, and in panel (Bottom), the entire region (across iterations) is determined by asmall envelope around the full Gordon trajectory. The light blue triangles in Phases I and IIdenote “worst-case” instances of the empirics within the corresponding confidence set. See theaccompanying text for a more detailed discussion.

interval—as depicted in Figure 4(Left)—and allows us to argue when n & d that αt also increasesexponentially with t in Phase I. At the same time, the βt iterates also remain bounded, so thatβt/αt decreases below a threshold and enters Phase II. Phase I takes at most O(log d) iterationswith high probability.

Phase II: Next, we show that the ratio βgor/αgor of the Gordon state evolution decreasesexponentially, and we translate this convergence to the empirical ratio βt/αt by using therelations (27) and (28). This traps each empirical iterate within a small angular neighborhoodof its Gordon counterpart, and is depicted in Figure 4(Right). Together with the aforementionedconvergence of the Gordon ratio βgor/αgor, this ensures that we enter the good region G. Weshow that with high probability, the iterates stay within Phase II for at most O(1) iterations.Along with the previously established convergence in Phase I, this establishes part (c) of all ourmodel-specific theorems, showing that our iterates enter the good region, i.e., Phase III, afterat most O(log d) steps after random initialization.

Phase III: In this final phase, we show a property that, to the best of our knowledge, is absentfrom local convergence guarantees in prior work. This is collected in part (b) of our individualtheorems, and shows that a small envelope around the Gordon state evolution trajectory, as

32

Page 33: Kabir Aladin Chandrasekher

depicted in Figure 4(Bottom), fully traps the random iterates with high probability. The keyproperty that we use to show this is in fact what guides our choice of the good region: Thederivatives of the αgor and βgor maps when evaluated for any element in this region are bothbounded above by 1 − c for some universal constant c > 0, so that small deviations of theempirics from these maps are not amplified over the course of successive iterations.

5 Numerical illustrations

We provide several numerical simulations to illustrate the sharpness of our results. For eachof the two models and two algorithms we consider, we demonstrate both global convergence aswell as local convergence. In particular, for each of the two models, we perform two families ofexperiments. The first explores convergence from a random initialization for both the higher-order and first-order method. These experiments are performed in dimension d = 800 with thenumber of samples n = 80, 000 (that is, κ = 100) and noise standard deviation σ = 10−6. First,a true parameter vector θ∗ is drawn uniformly at random from the unit sphere. Subsequently,an initialization θ0 is drawn (independently of θ∗) uniformly at random on the unit sphere.Then, from this vector, we simulate 12 independent trials of the algorithm for 12 iterations. Inthe second family of experiments, we explore local convergence—from an initialization whichhas constant correlation with the ground truth θ∗—for three different settings of noise standarddeviation σ and oversampling ratio κ. Each experiment is performed in dimension d = 500 withvarious numbers of samples n. Each simulation is done by first drawing the ground-truth vectorθ∗ uniformly at random on the unit sphere and subsequently generating an initialization

θ0 = 0.8 · θ∗ +√

1− 0.82P⊥θ∗γ,

where γ is uniformly distributed on the unit sphere and is independent of all other randomness.Next, we run 100 independent trials of both algorithms for 12 iterations.

0 2 4 6 8 10 12

10−6

10−5

10−4

10−3

10−2

10−1

100

Iteration

‖θt−θ∗ ‖

2

Empirical: AM

Gordon: AM

(a) Alternating minimization.

0 2 4 6 8 10 12

10−6

10−5

10−4

10−3

10−2

10−1

100

Iteration

‖θt−θ∗ ‖

2

Empirical: GD

Gordon: GD

(b) Subgradient descent.

Figure 5. Global convergence in the phase retrieval model. The hollow triangular marks (barelyvisible) denote the average over 12 independent trials and the shaded regions denote the rangeof values taken by the empirics.

33

Page 34: Kabir Aladin Chandrasekher

5.1 Phase retrieval

We first consider phase retrieval. Figure 5 illustrates the global convergence of both alternatingminimization (in Figure 5a) and subgradient descent (in Figure 5b). Figure 5a plots (i.) filledin circular marks denoting the Gordon state evolution started at the state (α(θ0), β(θ0)); (ii.)hollow triangular marks denoting the average of the empirical performance of AM over the 12independent trials; and (iii.) a shaded region denoting the region between the minimum andmaximum values taken in the empirics. The same three items are plotted with gradient descentin place of alternating minimization in Figure 5b.

0 2 4 6 8 10 12

10−10

10−8

10−6

10−4

10−2

100

Iteration

‖θt−θ∗ ‖

2

Empirical: AM

Empirical: GD

Gordon: AM

Gordon: GD

(a) σ = 10−10, κ = 20

0 2 4 6 8 10 12

10−6

10−5

10−4

10−3

10−2

10−1

100

Iteration

‖θt−θ∗ ‖

2

Empirical: AM

Empirical: GD

Gordon: AM

Gordon: GD

(b) σ = 10−6, κ = 100

Figure 6. Local convergence for the phase retrieval model. Each subplot shows: (in purple) theempirics of alternating minimization, (in red) the Gordon updates for alternating minimization,(in blue) the empirics for subgradient descent, and (in orange) the Gordon updates for subgradientdescent. Hollow triangular markers denote the average of the empirics and the shaded regionsdenote the range of values taken by the empirics over 100 independent trials.

Recall that part (c) of Theorems 3 and 4 states that each algorithm—when started from arandom initialization—first consists of a transient phase which takes O(log d) iterations to reacha “good” region. This transient phase is witnessed by the first 5 iterations of each algorithm,which make very little progress in the `2 distance. Subsequently, parts (a) and (b) of eachtheorem state that in the “good” region, the Gordon state evolution converges at a specifiedrate and the empirics are trapped in a small envelope around this state evolution. Iterations 5−9illustrate the super-linear convergence of alternating minimization (Figure 5a) and iterations5 − 12 illustrate the linear convergence of subgradient descent (Figure 5b). We remark thatwhereas the theorems show the empirics to be trapped in a small envelope surrounding theGordon state evolution in the “good” region, the simulations suggest that this may hold evenfrom random initialization—that is, even the transient phase may consist of empirics trappedin an envelope around the Gordon state evolution.

Figure 6 zooms in and demonstrates the local convergence for two different settings of noisestandard deviation σ and oversampling ratio κ. For each of the parameter values, we maketwo observations. First, both subfigures make clear the deterministic qualities of the Gordonupdates—the distinction between convergence rates as well as the attainment of the error floor—whereby demonstrating part (a) of Theorem 3 and 4. Second, both simulations demonstratepart (b) of the same two theorems: the empirics are trapped in a small envelope surroundingthe Gordon state evolution.

34

Page 35: Kabir Aladin Chandrasekher

0 20 40 60 80 100 120 14010−7

10−6

10−5

10−4

10−3

10−2

10−1

100

Iteration

‖θt−θ∗ ‖

2

Empirical: GD

Population

Gordon: GD

Figure 7. Subgradient descent for stepsize η = 0.95. Markers are placed once every 5 iterations.

We provide one final experiment in noiseless phase retrieval to illustrate the effect of stepsizein subgradient descent. Here, we take dimension d = 250, the oversampling ratio κ = 10 andstart from an initial correlation α0 = 0.6. As opposed to setting the stepsize η = 1/2, in thisexperiment, we try using a much larger stepsize; namely, we take η = 0.95. We then run 140iterations of subgradient descent and perform 10 independent trials. As is evident from Figure 7,this is a situation in which the population update predicts convergence, yet the empirics fail toconverge. On the other hand, the Gordon updates continue to sharply characterize the empiricalperformance and are able to predict the lack of convergence to the ground truth parameter.

5.2 Mixture of linear regressions

0 2 4 6 8 10 12

10−6

10−5

10−4

10−3

10−2

10−1

100

Iteration

∠(θt,θ∗ )

Empirical: AM

Gordon: AM

(a) Alternating minimization

0 2 4 6 8 10 12

10−6

10−5

10−4

10−3

10−2

10−1

100

Iteration

∠(θt,θ∗ )

Empirical: GD

Gordon: GD

(b) Subgradient AM

Figure 8. Global convergence in the mixture of linear regression model. Hollow triangular marksdenote the average over 12 independent trials and the shaded regions denote the range of valuestaken by the empirics.

35

Page 36: Kabir Aladin Chandrasekher

The two sets of simulations performed in this subsection (Figures 8 and 9) follow the samedichotomy as the two simulations performed in the previous subsection. An important distinc-tion is that the error metric used is the angular metric rather than the `2 distance used inthe preceding subsection. Figure 8 plots the trajectory of both AM and subgradient AM whenstarted from a random initialization. As before, the simulations suggest that the empirics aretrapped around the Gordon state evolution trajectory even from random initialization.

Next, we turn to the local convergence as illustrated in Figure 9 under two distinct parameterregimes, with the other details of the setup being identical to local convergence in phase retrieval.We pause only to call out the linearly convergent behavior evident in Figure 9a as well as thesimilarity in performance of the two algorithms in the same simulation. This is an importantfeature of the mixtures of linear regression model with constant noise.

0 2 4 6 8 10 1210−6

10−5

10−4

10−3

10−2

10−1

100

Iteration

∠(θ∗ ,θt)

Empirical: AM

Empirical: GD

Gordon: AM

Gordon: GD

(a) σ = 10−6, κ = 20

0 2 4 6 8 10 1210−1.5

10−1

10−0.5

Iteration

∠(θ∗ ,θt)

Empirical: AM

Empirical: GD

Gordon: AM

Gordon: GD

(b) σ = 10−2, κ = 6

Figure 9. Local convergence for the mixture of linear regression model. Each subplot shows:(in purple) the empirics of alternating minimization, (in red) the Gordon updates for alternatingminimization, (in blue) the empirics for subgradient AM, and (in orange) the Gordon updates forsubgradient AM. Hollow triangular markers denote the average of the empirics and the shadedregions denote the range of values taken by the empirics over 100 independent trials.

6 Discussion

We presented a recipe for deriving accurate deterministic predictions for the behavior of iterativealgorithms in nonconvex Gaussian regression models, which applies provided each iteration canbe written as a convex optimization problem satisfying mild decomposability conditions. Ratherthan decouple the deterministic component of these analyses from its random counterpart bypassing to the infinite-sample population limit—which is the most prevalent program in theliterature—we used duality and Gaussian comparison theorems to obtain our deterministicGordon state evolution update. We presented several consequences for both higher-order andfirst-order algorithms applied to the problems of phase retrieval and mixtures of regressions.These results are in themselves novel, but the key takeaway is our sharp characterization ofconvergence behavior, which we hope will enable a rigorous comparison between algorithms inother related problems. We conclude by listing a few open questions.

We begin with two technical open questions. We showed that our deterministic predictionsof the perpendicular component β were within O(n−1/4) of their empirical counterparts. Wewere able to sharpen this rate to O(n−1/2) for the parallel α component, and conjecture that a

36

Page 37: Kabir Aladin Chandrasekher

similar improvement can be carried out for the β component. As a second technical question,we highlight the condition σ . 1 present in our results for mixture of regression models. Weconjecture that this condition can be weakened to σ .

√κ while preserving the same qualitative

behavior of the theorem (i.e. linear angular convergence), but establishing this rigorously is aninteresting open problem.

The next set of open questions is broader. Note that our analysis—which relied on Gaussian-ity of the data independent of the current iterate—required fresh observations at each iteration,and to that end, we used a sample-splitting device to partition the data into disjoint batches.While this is a reasonable method to obtain a practical algorithm—indeed, all the algorithms weanalyzed converge very fast, so that at most a logarithmic number of batches suffices—it is morecommon to run these algorithms without sample splitting. The leave-one-out technique (Maet al., 2020; Chen et al., 2019) has emerged as a powerful analysis framework for the case with-out sample-splitting, and it is an interesting open question to what extent this can be combinedwith our Gordon recipe. Even more broadly, there is the question of building an analogous the-ory under weaker distributional assumptions on the data; indeed, some iterative (higher-order)algorithms considered in the literature are known to converge under weaker assumptions (e.g.,Duchi and Ruan, 2019; Ghosh et al., 2020). While universality theorems (broadly construed)have been proved in related settings (Bayati et al., 2015; Oymak and Tropp, 2018; Panahi andHassibi, 2017; El Karoui, 2018; Abbasi et al., 2019; Paquette et al., 2020), do similar insightsapply here? Can we produce an accurate deterministic prediction if the data is no longer i.i.d.,akin to the population update in such settings (Yang et al., 2017)? These are interesting andimportant questions for future work.

Finally, there is the question of broadening the scope of problems to which our analysisapplies, and we provide two examples along these lines. First, one could consider “weak”signal-to-noise regimes in the models that we considered. These regimes have been the subjectof recent work (Dwivedi et al., 2020; Wu and Zhou, 2019; Ho et al., 2020), and it is known thatthe optimal statistical rates of convergence are different from those in the strong signal-to-noiseregimes that we consider in this paper. What are sharp rates of convergence of optimizationalgorithms in these settings? Second, and more importantly, phase retrieval and mixtures ofregressions are just two models to which our framework applies. There are several other modelsand algorithms that can be analyzed with the Gordon state evolution machinery to sharplycharacterize (possibly nonstandard) convergence behavior.

7 Proof of general results, part (a): Gordon update and devia-tion bounds

In this section, we prove part (a) of both Theorem 1 and Theorem 2. The structure of the prooffollows the recipe sketched in Section 3. We proceed by carrying out steps 1–3 of the recipe fora broader class of algorithms (captured by one-step updates satisfying Assumption 3 to follow),and derive a general Proposition 2. With this proposition in hand, we then carry out step4 of the recipe separately for higher-order methods to prove Theorem 1(a) and for first-ordermethods to prove Theorem 2(a).

Throughout this section, we let θ] denote the “current” iterate of the algorithm, with(α], β]) = (α(θ]), β(θ])). This frees up the tuple (θ, α, β) to denote decision variables thatwill be used throughout the proof. In addition, as in the heuristic derivation in Section 3.2, itis useful in the proof to track a three dimensional state evolution (α, µ, ν), where

α = 〈θ,θ∗〉, µ =〈θ,P⊥θ∗θ]〉‖P⊥θ∗θ]‖2

, and ν = ‖P⊥S#θ‖2, (49)

where S# = span(θ∗,θ]) and P⊥S#is the projection matrix onto the orthogonal complement of

37

Page 38: Kabir Aladin Chandrasekher

this subspace. Finally, define the independent random variables

z1 = Xθ∗ and z2 =XP⊥θ∗θ

]

‖XP⊥θ∗θ]‖2, (50)

noting that both z1, z2 ∼ N (0, In) since ‖θ∗‖2 = 1. We are now ready to rigorously implementeach step of the recipe.

7.1 General result from steps 1–3 of recipe

Our general result is derived by implementing steps 1–3 in a setting involving a general decom-posability assumption on the one-step loss function L (22).

7.1.1 Implementing step 1: One-step convex optimization

Our general assumption takes the following form:

Assumption 3 (Decomposability and convexity of loss). Consider a fixed vector θ] ∈ Rd anda Gaussian random matrix X ∈ Rn×d and assume y is generated, given X and θ∗, accordingto the generative model (2). Then one step of the iterative algorithm run from θ] can be writtenin the form (7), where the loss

L(θ) := L(θ;θ],X,y)

satisfies the following properties:

(a) There is a pair of functions g : Rn × Rn → R and h : Rd × Rd → R, and a (random)function

F : Rn × Rd → RF (u,θ) 7→ g(u,Xθ];y) + h(θ,θ]), (51)

such that

L(θ) = F (Xθ,θ) for all θ ∈ Rd.

(b) The functions g and h are convex in their first arguments. Moreover, the function g andthus F are CL/

√n-Lipschitz in their first argument.

(c) The function h depends on θ only through its lower dimensional projections. That is,there exists another function

hscal : R3 × Rd → R(α, µ, ν),θ] 7→ hscal((α, µ, ν),θ]),

such that

h(θ,θ]) = hscal

(〈θ,θ∗〉,

〈θ,P⊥θ∗θ]〉‖P⊥θ∗θ]‖2

, ‖P⊥span(θ∗,θ])θ‖2,θ]

)(d) L(θ) is coercive. That is, L(θ)→∞ whenever ‖θ‖2 →∞.

Note that specifying

h(θ,θ]) = 0 and g(u,Xθ];y) =1√n‖ω(Xθ],y)− u‖2

38

Page 39: Kabir Aladin Chandrasekher

recovers the higher-order loss functions (9), whereas specifying

h(θ,θ]) = 1/2 · ‖θ‖22 − 〈θ,θ]〉 and g(u,Xθ];y) = 2η/n · 〈u, ω(Xθ],y)〉

recovers the first-order loss functions (12). The remaining properties (b)-(d) of the assumptioncan be straightforwardly verified for these two choices. Thus, Assumption 3 captures both thespecial cases corresponding to Theorems 1 and 2. Having written one step of the iterativealgorithm of interest as a minimization of a convex loss, we are now ready to proceed to step 2of the recipe.

7.1.2 Implementing step 2: The auxiliary optimization problem

Next, we state a formal definition of the auxiliary loss function Ln.

Definition 7 (Auxiliary loss). Let γn ∼ N (0, In) and γd ∼ N (0, Id) denote independent ran-dom vectors drawn independently of the pair (X,y), let θ] ∈ Rd, and define the subspaceS] = span(θ∗,θ]). Further, let L denote a loss function which satisfies Assumption 3 for func-tions g and h. Then, given a positive scalar r define the auxiliary loss function

Ln(θ,u; r) := maxv∈B2(r)

1√n‖v‖2〈P⊥S]θ,γd〉+ h(θ,θ]) + g(u,Xθ];y) +

1√n〈v,XPS]θ + ‖P⊥S]θ‖2γn − u〉.

The following lemma shows that our original optimization problem over the loss function Lis essentially equivalent to an auxiliary optimization problem involving the loss Ln.

Lemma 1. Let θ] ∈ Rd and suppose that the loss function L(θ) = L(θ;θ],X,y) satisfiesAssumption 3 (and recall the Lipschitz constant CL therein) and associate with it the auxiliaryloss Ln. Let D ⊆ B2(R) denote a closed subset for some positive constant R. Then there existsa positive constant C1 ≥ 6R, depending only on R, such that for any scalar r ≥ CL and scalart ∈ R,

P

minθ∈DL(θ) ≤ t

≤ 2P

min

θ∈D,u∈B2(C1√n)Ln(θ,u; r) ≤ t

+ 2e−2n.

If, in addition, D is convex, then

P

minθ∈DL(θ) ≥ t

≤ 2P

min

θ∈D,u∈B2(C1√n)Ln(θ,u; r) ≥ t

+ 2e−2n.

Given that the minimization over L can be approximately written as a minimization overan auxiliary loss, we are now ready to proceed to step 3.

7.1.3 Implementing step 3: Scalarization

Next, we define the scalarized auxiliary loss. Recall the definition of the convex conjugate of afunction g : Rd → R, given by g∗(x) = supx′∈Rd 〈x′, x〉 − g(x′).

Definition 8 (Scalarized auxiliary loss). Let γn ∼ N (0, In), γd ∼ N (0, Id), z1 ∼ N (0, In),and z2 ∼ N (0, In) denote mutually independent random vectors, with the pair (z1, z2) chosenaccording to equation (50). Let θ] ∈ Rd and define the subspace S] = span(θ∗,θ]). Further, letL denote a loss function which satisfies Assumption 3 for functions g and h. Then associatewith it the scalarized auxiliary loss

Ln(α, µ, ν;θ]) := maxv∈Rn

hscal(α, µ, ν,θ])− g∗(v,Xθ];y)− ν‖P⊥S]γd‖2‖v‖2 + 〈νγn +αz1 +µz2,v〉,

where g∗ denotes the convex conjugate of the function g and hscal is as in part (c) of Assump-tion 3.

39

Page 40: Kabir Aladin Chandrasekher

Our next lemma implements step 3, scalarizing the auxiliary loss Ln. Before stating thelemma, we require the definition of a scalarized set and an amenable set.

Definition 9 (Scalarized set). Let θ∗ ∈ Rd denote the ground truth, θ] ∈ Rd, and S] thesubspace S] = span(θ∗,θ]). For any subset D ⊆ Rd, define the scalarized set

P(D) :=

(α, µ, ν) ∈ R3 : θ ∈ D and α = 〈θ,θ∗〉, µ =

〈θ,P⊥θ∗θ]〉‖P⊥θ∗θ]‖2

, ν = ‖P⊥S]θ‖2.

Definition 10 (Amenable set). Let θ∗ ∈ Rd denote the ground truth, θ] ∈ Rd, and S] thesubspace S] = span(θ∗,θ]). A subset D ⊆ Rd is amenable with respect to the subspace S] if theset D ∩ S]⊥ is rotationally invariant, i.e. for all unit vectors ‖v‖2 = 1 such that v ∈ S]⊥, thereexists θ ∈ D such that P⊥S]θ/‖P

⊥S]θ‖ = v.

With these definitions in hand, we have the following lemma, whose proof we provide inSubsection B.1.2.

Lemma 2. Let θ] ∈ Rd and S] = span(θ∗,θ]). Suppose that the loss function L satisfiesAssumption 3 and associate with it the auxiliary loss Ln as well as the scalarized auxiliary lossLn. Further, let R be a positive constant and suppose that the subset D ⊆ B2(R) is amenablewith respect to the subspace S]. Then, with C1 as in Lemma 1, for all r ≥ CL, we have thesandwich relation

minθ∈D

u∈B2(C1√n)

Ln(θ,u; r) ≤ min(α,µ,ν)∈P(D)

Ln(α, µ, ν;θ]) ≤ minθ∈D

u∈B2(C1√n)

Ln(θ,u; r) +3C2

LC1

r,

with probability at least 1− 8e−n/2.

7.1.4 Putting together steps 1–3

We are now in a position to put the pieces together and prove a formal equivalence betweenthe original minimization problem over the loss L and a low-dimensional minimization problemover the loss Ln.

Proposition 2. Let θ] ∈ Rd and S] = span(θ∗,θ]). Suppose that the loss function L satisfiesAssumption 3 and associate with it the scalarized auxiliary loss Ln. Let R be a positive constantand suppose that the subset D ⊆ B2(R) is amenable with respect to the subspace S]. Then,there exists a positive constant C1, depending only on R, such that for each triple of scalarsr ≥ CL, L ∈ R, and ε′ > 0, we have

P

argminθ∈B2(R)

L(θ) ∈ D≤ 2P

min

(α,µ,ν)∈P(D)Ln(α, µ, ν) ≤ L−

3C2LC1

r+ 2ε′

+ 2P

min

(α,µ,ν)∈P(B2(R))Ln(α, µ, ν) > L + ε′

+ 20e−n/2.

Proof. Applying the law of total probability, we obtain for any L and any ε′ > 0, the chain ofinequalities

P

argminθ∈B2(R)

L(θ) ∈ D≤ P

min

θ∈D∩B2(R)L(θ) ≤ min

θ∈B2(R)L(θ) + ε′

≤ P

minθ∈DL(θ) ≤ L + 2ε′

+ P

min

θ∈B2(R)L(θ) > L + ε′

, (52)

40

Page 41: Kabir Aladin Chandrasekher

where we note that in the second inequality we have used the fact that D ⊆ B2(R). Now, weapply Lemma 1 to obtain the pair of inequalities

P

minθ∈DL(θ) ≤ L + 2ε′

≤ 2P

minθ∈D,

u∈B2(C1√n)

Ln(θ,u; r) ≤ L + 2ε′

+ 2e−2n, (53a)

P

minθ∈B2(R)

L(θ) > L + ε′≤ 2P

min

θ∈B2(R),u∈B2(C1

√n)

Ln(θ,u; r) > L + ε′

+ 2e−2n. (53b)

Next, we note that D is an amenable set with respect to S] (as in Definition 10). Thus, weapply Lemma 2 to further obtain the pair of inequalities

P

minθ∈D,

u∈B2(C1√n)

Ln(θ,u; r) ≤ L + 2ε′

≤ P

min

(α,µ,ν)∈P(D)Ln(α, µ, ν) ≤ L−

3C2LC1

r+ 2ε′

+ 8e−n/2,

(54a)

P

min

θ∈B2(R),u∈B2(C1

√n)

Ln(θ,u; r) > L + ε′

≤ P

min

(α,µ,ν)∈P(B2(R))Ln(α, µ, ν) > L + ε′

+ 8e−n/2.

(54b)

Combining the inequalities (52)–(54) yields the desired conclusion.

Having established steps 1–3 of the recipe under the general Assumption 3 on the one-steploss function, we now carry out step 4 of the recipe individually for each theorem. For clarity,we include a schematic diagram of the various ingredients in Figure 10. Recall from the recipedescribed earlier that the Gordon state evolution update is obtained as the minimizer of thedeterministic loss L, which is in turn obtained from Ln in the limit n→∞. We prove each ofthe two theorems below without making this equivalence explicit, but the connection is evidentfrom the proofs of Lemmas 3 and 4 in the appendix.

7.2 Proof of Theorem 1(a)

First, we define the expanded (i.e., three dimensional) Gordon state evolution update for higher-order methods.

Definition 11 (Expanded Gordon state evolution update: Higher-order methods). Recall themodel (2), and let Q denote a random variable drawn from the latent variable distributionQ. Suppose that the loss function L takes the form (9). Let (Z1, Z2, Z3) denote a triplet ofindependent standard Gaussian random variables and let α] ∈ R and β] ∈ R≥0 denote arbitraryscalars. Let

Ω = ω(α]Z1 + β]Z2 , f(Z1;Q) + σZ3

).

Then define the scalars

αgor = E[Z1Ω], µgor = E[Z2Ω], and νgor =

√E[Ω2]− (E[Z1Ω])2 − (E[Z2Ω])2

κ− 1.

With this definition in hand, note that in order to prove Theorem 1(a) it suffices to showthat for a parameter cK depending only on the pair (K1,K2) in Assumptions 1 and 2, we have

P

max

(∣∣∣∣〈Tn(θ]), P⊥θ∗θ]〉

‖P⊥θ∗θ]‖2− µgor

∣∣∣∣ , ∣∣∣‖P⊥S#Tn(θ])‖2 − νgor

∣∣∣) > ε

≤ C exp

(−cKnε4

). (55)

41

Page 42: Kabir Aladin Chandrasekher

R3

Ln(ξ)

L(ξ)

ξnξgor

L

L + Cε2

L− Cε2

B∞(ξgor; ε) Tn(θ])

L(θ)

Figure 10. Schematic diagram of the proof of Theorems 1 and 2, drawn for θ ∈ R3. Forshorthand, we consider the state ξ = (α, µ, ν). The quantity Tn(θ]) denotes the minimizer ofthe loss L and the state ξn = (αn, µn, νn) denotes the minimizer of the scalarized auxiliary lossLn. With L = limn→∞ Ln denoting the deterministic scalarized auxiliary loss in the limit, thestate ξgor = (αgor, µgor, νgor) denotes the minimizer of L. Proposition 2 utilizes the CGMT toconnect the minima of the original loss L over any amenable set (which includes the entire setR3 as well as the set B∞(ξgor; ε)c) to minima over the simpler empirical loss Ln. The growthconditions and concentration properties given by Lemmas 3 and 4 imply that L(ξgor) and Ln(ξn)lie in the purple shaded region as well as the inclusion ξn ∈ B∞(ξgor; ε). Used in conjunctionwith Proposition 2, this shows that the minimizer of the original loss L also satisfies the inclusionTn(θ]) ∈ B∞(ξgor; ε).

Indeed, setting ε = ε0 :=(

log(C/δ)cK ·n

)1/4, we have

P

max

(∣∣∣∣〈Tn(θ]), P⊥θ∗θ]〉

‖P⊥θ∗θ]‖2− µgor

∣∣∣∣ , ∣∣∣‖P⊥S#Tn(θ])‖2 − νgor

∣∣∣) > ε0

≤ δ.

Note that (β+)2 =〈Tn(θ]),P⊥

θ∗θ]〉2

‖P⊥θ∗θ

]‖22+ ‖P⊥S#

Tn(θ])‖22 and (βgor)2 = (µgor)2 + (νgor)2. Applying

the triangle inequality and adjusting constant factors, we have P(|β+ − βgor| ≥ cε0) ≤ δ, asdesired. Consequently, we dedicate our effort toward establishing inequality (55) by provinggrowth properties for Ln.

7.2.1 Implementing step 4: Growth properties of Ln

The following lemma guarantees growth conditions on the function Ln when the one-step lossfunction takes the form (9). Its proof is deferred to Appendix B.2.

Lemma 3. Suppose that the loss function L can be written in the form (9) and let the scalarized

42

Page 43: Kabir Aladin Chandrasekher

auxiliary loss Ln be as in Definition 8. Define the constant

L =

√(1− 1

κ

)(E[Ω2]− (E[Z1Ω])2 − (E[Z2Ω])2

),

and let the tuple (αgor, µgor, νgor) be as in Definition 11. Suppose that Assumptions 1 and 2 holdwith parameters K1 and K2, respectively. Then, there exist positive constants CK1 , C

′K1, CK2

and cK1,K2 each depending on a subset of K1,K2 and universal positive constants c, c′, C, C ′

such that for all κ ≥ C ′ and all ε ∈ (0, c′), the following hold:

(a) The minimizer(αn, µn, νn) = argmin

(α,µ,ν)∈P(B2(CK1))Ln(α, µ, ν)

is unique and satisfies both

max|αn − αgor|, |µn − µgor|, |νn − νgor|

≤ C ′K1

ε,

and ∣∣∣∣Ln(αn, µn, νn)− L

∣∣∣∣ ≤ C ′K1ε,

with probability at least 1− C exp−cK1,K2 · n − Ce−cnε2.

(b) The scalarized auxiliary loss Ln is CK2-strongly convex on the domain B2(CK1) with prob-ability at least 1− C exp−cK1,K2 · n.

With each of the individual steps of the recipe completed, we can now put everythingtogether to prove equation (55).

7.2.2 Combining the pieces

Let the tuple (αgor, µgor, νgor) be as in Definition 11. For each nonnegative scalar ε, define thedeviation set

Dε := θ ∈ Rd : max(|〈θ,θ∗〉 − αgor|,

∣∣∣〈θ,P⊥θ∗θ]〉‖P⊥θ∗θ]‖2

− µgor∣∣∣, |‖P⊥S]θ‖2 − νgor|) ≥ ε, (56)

and note that the deviation set Dε is amenable with respect to the subspace S] = span(θ∗,θ]).Note that it suffices to bound PTn(θ]) ∈ Dε. To this end, note that

PTn(θ]) ∈ Dε = P

argminθ∈Rd

L(θ) ∈ Dε

(i)= P

argminθ∈B2(CK1

)L(θ) ∈ Dε

+ 2e−cn, (57)

where step (i) follows upon applying Lemma 22 in the appendix.First, note that the loss L satisfies Assumption 3 as we can take the functions

h(θ,θ]) = 0 and g(u,Xθ];y) =1√n‖ω(Xθ],y)− u‖2.

Each of the properties (a)-(d) is evident as the norm ‖ · ‖2 is convex, Lipschitz continuous, andcoercive. Moreover, the Lipschitz constant CL = 1.

Next, recall the constant L as defined in Lemma 3 and subsequently invoke Proposition 2 toobtain, for constants r ≥ 1 and ε′ > 0 to be specified later, the inequality

P

argminθ∈B2(CK1

)L(θ) ∈ Dε

≤ T1 + T2 + 20e−n/2, (58)

43

Page 44: Kabir Aladin Chandrasekher

where

T1 = 2P

min(α,µ,ν)∈P(D∩B2(CK1

))Ln(α, µ, ν) ≤ L−

3C2LC1

r+ 2ε′

,

T2 = 2P

min(α,µ,ν)∈P(B2(CK1

))Ln(α, µ, ν) > L + ε′

.

Now, let 0 < ε1 ≤ ε be a constant to be specified later and define the two events

E1 =‖(αn, µn, νn)− (αgor, µgor, νgor)‖2 ≤ ε1,

∣∣Ln(αn, µn, νn)− L∣∣ ≤ ε1,

E2 =

min

‖(α,µ,ν)−(αn,µn,νn)‖2≥ε−ε1Ln(α, µ, ν) ≥ Ln(αn, µn, νn) + cK2(ε− ε1)2

.

Assume for the moment that ε ≤ c′ (a fact that will be true for the eventual setting of ε) andsubsequently invoke Lemma 3(a) to obtain

PE1 ≥ 1− C exp−cK1,K2 · n − Ce−cK1nε21 ,

for some constant cK1 depending only on K1. Also apply Lemma 3(b) to obtain the inequality

PE2 ≥ 1− C exp−cK1,K2 · n.

For the remainder of the proof, we work on the event E1 ∩ E2. We have

min(α,µ,ν)∈P(Dε∩B2(CK1

))Ln(α, µ, ν) ≥ min

‖(α,µ,ν)−(αgor,µgor,νgor)‖2≥εLn(α, µ, ν)

(i)

≥ min‖(α,µ,ν)−(αn,µn,νn)‖2≥ε−ε1

Ln(α, µ, ν)

≥ Ln(αn, µn, νn) + cK2(ε− ε1)2 ≥ L− ε1 + cK2(ε− ε1)2, (59)

where step (i) follows since on E1, if ‖(α, µ, ν)− (αgor, µgor, νgor)‖2 ≥ ε then‖(α, µ, ν)− (αn, µn, νn)‖2 ≥ ε− ε1. Now, set ε1 =

cK24 ε2, which is a valid choice provided

ε ≤ 4/cK2 . For each such value of ε, continuing from the inequality (59), we obtain

min(α,µ,ν)∈P(Dε∩B2(CK1

))Ln(α, µ, ν) ≥ L + c′K2

ε2, (60)

where c′K2≤ cK2/2 is a small enough constant depending only on K2. Continuing, set

r = 12 · C1 · C2L/(c

′K2· ε2) and ε′ = c′K2

· ε2/4. Thus, in view of the inequality (60), we obtainthe bounds

maxT1, T2 ≤ PEc1 ∪ Ec2 ≤ C exp−cK1,K2 · n+ Ce−cK1,K2nε4 ,

where cK1,K2 is once again a constant depending solely on K1 and K2. Substituting the boundin the display above into the inequality (58), we obtain

P

argminθ∈B2(CK1

)L(θ) ∈ Dε

≤ C exp−cK1,K2 · n+ Ce−cK1,K2

nε4 + 20e−n/2.

Combining the display above with the inequality (57), we obtain

PTn(θ]) ∈ Dε ≤ C exp−cK1,K2 · n+ Ce−cK1,K2nε4 + 22e−n/2. (61)

To complete the proof, set ε = O(c−1K1,K2

·(

log(1/δ)n

)1/4)

, which we can ensure is a valid choice

owing to the condition n ≥ CK1,K2 log(1/δ). Finally, we use this lower bound on n to also boundthe remaining two terms of the RHS in equation (61) by O(δ), thereby obtaining the claimedresult.

44

Page 45: Kabir Aladin Chandrasekher

7.3 Proof of Theorem 2(a)

We begin by defining the first-order analog of the expanded Gordon state evolution update.

Definition 12 (Expanded Gordon state evolution update: First-order methods). Recall themodel (2), and let Q denote a random variable drawn from the latent variable distribution Q.Suppose the loss function L takes the form (12). Let (Z1, Z2, Z3) denote a triplet of independentstandard Gaussian random variables and let α] ∈ R and β] ∈ R≥0 denote arbitrary scalars. Let

Ω = ω(α]Z1 + β]Z2 , f(Z1;Q) + σZ3

).

Then define the scalars

αgor = α] − 2η · E[Z1Ω], µgor = β] − 2η · E[Z2Ω], and νgor =2η√κ·√E[Ω2].

As before, it suffices to show that for a parameter cK depending only on K1 from Assump-tion 1, we have

P

max

(∣∣∣∣〈Tn(θ]), P⊥θ∗θ]〉

‖P⊥θ∗θ]‖2− µgor

∣∣∣∣ , ∣∣∣‖P⊥S#Tn(θ])‖2 − νgor

∣∣∣) > ε

≤ C exp

(−cKnε4

). (62)

7.3.1 Implementing step 4: Growth properties of Ln

The following lemma guarantees growth conditions on the function Ln when the one-step lossfunction takes the form (12). We provide its proof in Subsection B.3.

Lemma 4. Suppose that the loss function L takes the form (12) and let the scalarized auxiliaryloss Ln be as in Definition 8. Define the constant

L = −1

2((αgor)2 + (µgor)2 + (νgor)2),

and let the tuple (αgor, µgor, νgor) be as in Definition 12. Suppose that Assumption 1 holds withparameter K1. Then, there exists cK1 , CK1 > 0 depending only on K1 and universal positiveconstants c, C, c′, C1, C2 such that for all κ ≥ C1 and all ε ∈ (0, c′), the following hold:

(a) The minimizer

(αn, µn, νn) = argmin(α,µ,ν)∈P(B2(C2))

Ln(α, µ, ν)

is unique and satisfies both

max|αn − αgor|, |µn − µgor|, |νn − νgor|

≤ CK1ε,

and ∣∣∣∣Ln(αn, µn, νn)− L

∣∣∣∣ ≤ CK1ε,

with probability at least 1− Ce−cnε2.

(b) The scalarized auxiliary loss Ln is 1-strongly convex on the domain B2(C2) with probabilityat least 1− Ce−cn.

Note that the strong convexity constant here is absolute (equal to 1) instead of dependenton Assumption 2 like in higher-order methods. We can now put everything together exactlylike before.

45

Page 46: Kabir Aladin Chandrasekher

7.3.2 Combining the pieces

The calculations here are very similar to before, so we only sketch the differences. First, considerthe first order loss as in equation (12)

L(θ;θ],X,y) =1

2‖θ‖22 − 〈θ,θ]〉 −

n

n∑i=1

ω(〈xi,θ]〉, yi) · 〈xi,θ〉,

which corresponds to setting

h(θ,θ]) =1

2‖θ‖22 − 〈θ,θ]〉, hscal(α, µ, ν) =

1

2(α2 + µ2 + ν2)− (αα] + µβ]),

and

g(u,Xθ];y) = −2η

n〈u, ω(Xθ],y)〉.

Note that

g∗(v,Xθ];y) =

0 if v = 2η

n · ω(Xθ],y),

+∞ otherwise.

Consequently, we obtain

Ln(α, µ, ν) =α2 + µ2 + ν2

2− (αα] + µβ])−

n· 〈ω(Xθ],y), νγn + αz1 + µz2〉

− 2ην

n· ‖P⊥S]γd‖2 · ‖ω(Xθ],y)‖2. (63)

Note that g is a linear function of u, and that an application of Hoeffding’s inequality yields‖ω(Xθ],y)‖2 ≤ CK1 with probability at least 1− e−cn. Thus, as is evident from this inequalityand the displays above, L satisfies Assumption 3, with Lipschitz constant CL = CK1 .

Now, with the tuple (αgor, µgor, νgor) as in Definition 12, define the events

A1 =

max|αn − αgor|, |µn − µgor|, |νn − νgor|

≤ CK1ε1,

∣∣Ln(αn, µn, νn)− L∣∣ ≤ ε1 and

A2 =

min‖(α,µ,ν)−(αn,µn,νn)‖2≥ε−ε1

Ln(α, µ, ν) ≥ Ln(αn, µn, νn) + (ε− ε1)2.

Carrying out the proof exactly as for higher-order methods but now with ε1 = cK1ε2 leads to

the desired result.

8 Proof of general results, part (b): Tighter bounds on parallelcomponent

The main result of this section is the following proposition, which—in words—shows that theone-dimensional projection of the empirical operator Tn onto the ground truth θ∗ concentratesaround αgor for both higher-order and first order methods at rate O(n−1/2).

Proposition 3. Consider the one-step empirical updates Tn(θ) taking either of the forms (9)or (12) and the associated state evolution update αgor. There are universal positive constants Cand C ′ such that if κ ≥ C the following hold.

(a) Suppose that the empirical update Tn(θ) takes the first-order form (12) and consider theassociated state evolution update αgor as in Definition 2. Then,

∣∣〈θ∗, Tn(θ)〉 − αgor∣∣ ≤ CK1 ·

(log (1/δ)

n

)1/2

,

with probability at least 1− δ.

46

Page 47: Kabir Aladin Chandrasekher

(b) Suppose that the empirical update Tn(θ) takes the second-order form (9) and consider theassociated state evolution update αgor as in Definition 1. Then,

∣∣〈θ∗, Tn(θ)〉 − αgor∣∣ ≤ CK1 ·

(log7 (1/δ)

n

)1/2

,

with probability at least 1− δ.

Clearly, the proposition directly implies part (b) of both Theorems 1 and 2. Before providingthe proof, we pause to make a few comments. First, we note that in the previous section, weshowed concentration of 〈θ∗, Tn(θ)〉 with fluctuations of order n−1/4, but additionally providedcontrol over the random variable ‖P⊥θ∗θ‖2, which has d degrees of freedom. On the other hand,by focusing directly on the quantity 〈θ∗, Tn(θ)〉, which has one degree of freedom, we are able totighten the fluctuations to the order n−1/2. Second, we comment briefly on the proof, especiallyfor higher-order methods. Our proof improves upon the strategy utilized in Zhang (2020) by (i)proving a concentration inequality with exponential tails and (ii) extending the methodologybeyond alternating minimization for phase retrieval to more general updates of the form (9b).This extension relies on a delicate combination of the leave-one-out technique of Zhang (2020)with the moment inequalities in Boucheron et al. (2005). We turn now to the proof of the mainproposition, proving the result for first-order and higher-order methods separately.

Proof of Proposition 3(a): first-order methods. First, recall the generic first-order update (12)and specify the operator

Tn(θ) = θ − η · 2

n

n∑i=1

ω(〈xi,θ〉, yi) · xi.

Now, evaluating the quantity E〈θ∗, Tn(θ)〉 and recalling Definition 2, we obtain the charac-terization

E〈θ∗, Tn(θ)〉 = 〈θ∗,θ〉 − η · 2

n

n∑i=1

ω(〈xi,θ〉, yi) · 〈θ∗,xi〉 = α(θ)− 2η · EZ1Ω = αgor, (64)

where we have drawn Ω according to equation (24) using the random variablesZ1 = 〈x1,θ

∗〉, Z2 = 〈x1,P⊥θ∗θ〉, and Z3 ∼ N (0, σ2). Now, note that 〈θ∗,xi〉 ∼ N (0, 1) and that

by Assumption 1, ‖ω(〈xi,θ〉, yi)‖ψ2 ≤ K1. Thus, we apply Vershynin (2018, Lemma 2.7.7) toobtain the bound

‖ω(〈xi,θ〉, yi) · 〈θ∗,xi〉‖ψ1 ≤ K1.

Subsequently applying Bernstein’s inequality in conjunction with the characterization of theexpectation (64), we obtain the inequality

P∣∣〈θ∗, Tn(θ)〉 − αgor

∣∣ ≥ t ≤ 2 exp−cmin

(nt2K2

1

,nt

K1

).

The conclusion follows immediately from the above inequality.

Proof of Proposition 3(b): higher-order methods. Recall the weight functions ω : R2 → R as inthe equations (9) and (9b) and consider its separable extension to vector valued functions:

ω : Rn × Rn → Rn

(x,y) 7→ (ω(x1, y1), ω(x2, y2), . . . ω(xn, yn)).

We can then re-write the updates (9b) using matrix notation as

Tn(θ) = (X>X)−1X>ω(Xθ,y), (65)

47

Page 48: Kabir Aladin Chandrasekher

where specifying ω(x, y) = sgn(x) · y recovers the alternating minimization update for phaseretrieval and specifying ω(x, y) = sgn(yx) · y recovers the alternating minimization updatefor mixtures of linear regressions. We also specify, for convenience, the following populationoperator

T (θ) =1

nEX>ω(Xθ,y)

. (66)

Recalling the definition of αgor as in Definition 1, we note that

〈θ∗, T (θ)〉 =1

nE〈ω(Xθ,y),Xθ∗〉 = EZ1Ω = αgor,

where we have let Z1 denote the first component of the vector Xθ∗ and Z2 denote the first com-ponent of the vector XP⊥θ∗θ/‖P⊥θ∗θ‖2. We now state two lemmas whose proofs are postponedto Subsections 8.1 and 8.2, respectively.

Lemma 5. Under the setting of Proposition 3, there exist universal, positive constants c andC such that for all t > 0,

Pr∣∣〈θ∗, Tn(θ)〉 − 〈θ∗, ETn(θ)〉

∣∣ ≥ t ≤ C exp−c( t√nK1

)2/7+ e−cn. (67)

Lemma 6. Under the setting of Proposition 3, there exists a universal, positive constant Csuch that,

|〈θ∗, ETn(θ)〉 − 〈θ∗, T (θ)〉| ≤ CK1√n.

To prove the proposition from these two lemmas, apply the triangle inequality to obtain thefollowing decomposition∣∣〈θ∗, Tn(θ)〉 − 〈θ∗, T (θ)〉

∣∣ ≤ ∣∣〈θ∗, Tn(θ)〉 − 〈θ∗, ETn(θ)〉∣∣+∣∣〈θ∗, ETn(θ)〉 − 〈θ∗, T (θ)〉

∣∣. (68)

The conclusion follows immediately upon applying Lemmas 5 and 6 to upper bound the twoterms above.

8.1 Proof of Lemma 5: Concentration of centered term

Using the shorthand

zi = xiω(〈xi,θ〉, yi) and Σ =

n∑i=1

xix>i , (69)

and recalling the empirical updates Tn(θ) (65), we denote the random variable of interest

Z := 〈θ∗, Tn(θ)〉 = 〈θ∗, Σ−1n∑i=1

zi〉, (70)

where in the last equality we have simply simply written Tn(θ) using the shorthand (69).Noting that Z is a non-linear function of the n independent samples xi, yini=1, we introducesome notation in order to isolate the contribution of the i-th sample. For all 1 ≤ j ≤ n, letx′j , y′j denote an independent copy of the pair xj , yj, and define (cf. Eq. (69))

z′j = x′jω(〈x′j ,θ〉, y′j), Σj =∑i 6=jxix

>i . (71)

48

Page 49: Kabir Aladin Chandrasekher

We then define

Z ′j = (θ∗)>(x′j(x′j)> + Σj)

−1(z′j +

∑i 6=jzi

).

Applying the Sherman-Morrison formula, we obtain the pair of identities

Z = (θ∗)>(Σ−1j −

Σ−1j xjx

>j Σ−1

j

1 + x>j Σ−1j xj

)(zj +

∑i 6=jzi

), (72a)

Z′j = (θ∗)>

(Σ−1j −

Σ−1j x

′jx′>j Σ−1

j

1 + x′>j Σ−1j x

′j

)(z′j +

∑i 6=jzi

). (72b)

Note that in Eq. (72a), the index j is arbitrary and the same equation can be written for all1 ≤ j ≤ n.

With this notation defined, we now state two lemmas. Their proofs are deferred to Subsec-tions 8.1.1 and 8.1.2, respectively. For the first lemma, recall that we use ‖X‖q = (E|X|q)1/q todenote the Lq norm of a random variable X.

Lemma 7. Consider the random variable Z (70). There exists a universal, positive constantC such that for all integers q satisfying 1 ≤ q ≤ n−d−1

16 , it holds that

‖Z − EZ‖q ≤ CK1q7/2

√n.

Lemma 8. There exists a constant C > 0 such that for all q ≥ 1,(E∥∥∥∑

i 6=jzi

∥∥∥2q

2

)1/2q≤ CK1q

2n.

We now use the moment bound from Lemma 7 to obtain a tail bound. Note that byassumption, n ≥ 2d, so that (n − d − 1)/16 ≥ n/64. Additionally, since Z = 〈θ∗, Tn(θ)〉 bydefinition (recall Eq. (70)), we apply Lemma 22 to obtain the inequality P|Z| ≥ CK1 ≤ e−cn.Thus, taking ζ = 7/2 and invoking Lemma 23, we obtain the desired result.

It remains to prove the technical lemmas.

8.1.1 Proof of Lemma 7

For all 1 ≤ j ≤ n, let Xj = xj , yj. Following Boucheron et al. (2005), define

V + = E n∑j=1

(Z − Z ′j

)2+

∣∣∣xi, yini=1

, (73)

and

V − = E n∑j=1

(Z − Z ′j

)2−

∣∣∣xi, yini=1

. (74)

With this notation in hand, we have that for all q ≥ 2,

‖Z − EZ‖q(i)

≤ ‖(Z − EZ)+‖q + ‖(Z − EZ)−‖q(ii)

≤ C√q√‖V +‖q/2 + C

√q√‖V −‖q/2, (75)

49

Page 50: Kabir Aladin Chandrasekher

where step (i) follows from the triangle inequality and step (ii) follows from Boucheron et al.(2005, Theorem 2). Additionally, applying the triangle inequality in combination with thesimple numeric inequalities a2

+ ≤ a2, a2− ≤ a2 yields the pair of inequalities

‖V +‖q ≤n∑j=1

∥∥E[(Z − Z ′j)2∣∣xi, yini=1

]∥∥q,

and

‖V −‖q ≤n∑j=1

∥∥E(Z − Z ′j)2∣∣xi, yini=1

∥∥q.

Combining the inequality (75) with the above two displays yields the inequality

‖Z − EZ‖q ≤ C√q

( n∑j=1

∥∥E[(Z − Z ′j)2∣∣xi, yini=1

]∥∥q

)1/2

. (76)

Continuing, we see that

∥∥E[(Z − Z ′j)2∣∣xi, yini=1

]∥∥q

(i)

≤(EE[(Z − Z ′j)2q

∣∣xi, yini=1

])1/q

(ii)=(EE[(Z − Z ′j)2q

∣∣xi, yii 6=j])1/q(77)

where step (i) follows by noting that the map x 7→ xa is convex on R≥0 for a ≥ 1 and applyingJensen’s inequality; and step (ii) follows by applying the tower property of conditional expec-tation to remove the conditioning on the sample xj , yj . Recalling the representations (72a)and (72b), we let

Z − Z ′j =(T1 − T ′1

)+(T2 − T ′2

)+(T3 − T ′3

), (78)

where

T1 = (θ∗)>Σ−1j zj , (79a)

T2 = (θ∗)>(Σ−1

j xjx>j Σ−1

j

1 + x>j Σ−1j xj

)∑i 6=jzi, (79b)

T3 = (θ∗)>(Σ−1

j xjx>j Σ−1

j

1 + x>j Σ−1j xj

)zj , (79c)

and T ′1, T′2, T

′3 are equivalently defined, with x′j , z

′j in place of xj , zj . Now, applying the numeric

inequality (∑k

i=1 ai)2q ≤ k2q−1

∑ki=1 a

2qi to the term (Z − Z ′j)2q, using the decomposition (78)

and noting that for a = 1, 2, 3, the terms Ta and T ′a are identically distributed conditionedon xi, yii 6=j , we obtain the inequality

E[(Z − Z ′j)2q

∣∣xi, yii 6=j] ≤ 62q E[T 2q

1 + T 2q2 + T 2q

3 |xi, yii 6=j]. (80)

Recalling the shorthand (69), plugging into the definition of T3 (79c), and noting that sinceΣ−1j is positive semidefinite, (x>j Σ−1

j xj)/(1 + x>j Σ−1j xj) ≤ 1, we have

T 2q3 =

((θ∗)>

Σ−1j xjx

>j Σ−1

j xjω(〈xj ,θ〉, yj)1 + x>j Σ−1

j xj

)2q

≤ T 2q1 .

50

Page 51: Kabir Aladin Chandrasekher

Consequently, we obtain the bound

ET 2q

3 |xi, yii 6=j≤ E

T 2q

1 |xi, yii 6=j. (81)

We proceed to bound the conditional moments ET 2q

1 |xi, yii 6=j

and ET 2q

2 |xi, yii 6=j

inturn. Recall the sub-exponential norm ψ1 (see, for instance Vershynin (2018, Definition 2.7.5))and note that for any vector u ∈ Rd:

‖z>j u‖ψ1 = ‖ω(〈xj ,θ〉, yj)x>j u‖ψ1

(i)

≤ K1‖u‖2, (82)

where step (i) follows since ω(〈xj ,θ〉, yj) is sub-Gaussian, x>j u is ‖u‖2-sub-Gaussian, and theproduct of sub-Gaussian random variables is sub-exponential (see for instance Vershynin (2018,Lemma 2.7.7)). Thus,

E[T 2q

1 |xi, yii 6=j] (i)

≤ (CK1q)2q‖Σ−1

j θ∗‖2q2 ≤ (CK1q)

2q‖Σ−1j ‖

2qop, (83)

where step (i) follows by recalling the definition of T1 (79a), applying the inequality (82) foru = Σ−1

j θ∗, and using the Lp norm characterization of sub-exponential random variables (see

for instance Vershynin (2018, Proposition 2.7.1, part b.)). We now consider T2. We have

ET 2q

2 |xi, yii 6=j (i)

≤(E(x>j Σ−1

j θ∗)4q|xi, yii 6=j)1/2(

E(x>j Σ−1

j

∑i 6=jzi)4q|xi, yii 6=j)1/2

(ii)

≤ (Cq)2q‖Σ−1j θ

∗‖2q2∥∥∥Σ−1

j

∑i 6=jzi

∥∥∥2q

2

≤ (Cq)2q‖Σ−1j ‖

4qop

∥∥∥∑i 6=jzi

∥∥∥2q

2, (84)

where step (i) follows from substituting the definition of T2 (79b), using the fact thatx>j Σ−1

j xj ≥ 0 and applying the Cauchy-Schwarz inequality; and step (ii) follows fromthe Lp norm characterization of sub-exponential random variables. Now, plugging thebounds (81), (83), and (84) into the inequality (80), we obtain

E[(Z − Z ′j)2q

∣∣xi, yini=1

]≤ (Cq)2q‖Σ−1

j ‖4qop

∥∥∥∑i 6=jzi

∥∥∥2q

2+ (CK1q)

2q‖Σ−1j ‖

2qop.

Consequently, plugging the inequality above into the RHS of the inequality (77) and subse-quently using the Cauchy-Schwarz inequality yields∥∥E(Z − Z ′j)2

∣∣xi, yini=1

∥∥q≤ Cq2

((E‖Σ−1

j ‖8qop

E∥∥∥∑

i 6=jzi

∥∥∥4q

2

)1/2q+K1

(E‖Σ−1

j ‖2qop

)1/q).

(85)

Now note that Lemma 21 from the appendix yields

E‖Σ−1‖pop

≤(C

n

)p, for 1 ≤ p < n− d− 1

2.

Applying Lemmas 8 and 21 to the RHS of the inequality (85), we obtain∥∥E(Z − Z ′j)2∣∣xi, yini=1

∥∥q≤ CK1q

4

n2, for 1 ≤ q < n− d− 1

16.

Substituting the above bound into the RHS of the inequality (76), we have

‖Z − EZ‖q ≤CK1q

7/2

√n

, for 1 ≤ q < n− d− 1

16, (86)

which completes the proof.

51

Page 52: Kabir Aladin Chandrasekher

8.1.2 Proof of Lemma 8

We begin by centering. We have

E∥∥∥∑

i 6=jzi

∥∥∥2q

2

= E

∥∥∥∑i 6=j

(zi − E zi

)+∑i 6=j

E zi∥∥∥2q

2

(i)

≤ 22q−1(E∥∥∥∑

i 6=jzi − E zi

∥∥∥2q

2

+∥∥∥∑i 6=j

E zi∥∥∥2q

2

),

(87)

where step (i) used the numeric inequality (a + b)q ≤ 22q−1(a2q + b2q). We tackle each of thetwo terms on the RHS of the above display in turn.

Bounding∥∥∑

i 6=j E zi∥∥2q

2. First, we apply the triangle inequality to obtain∥∥∥∑

i 6=jE zi

∥∥∥2≤ n‖E zi‖2.

Towards bounding the RHS of the inequality in the above display, we consider the subspaceS = span(θ,θ∗) to obtain

E zi = Exiω(〈xi,θ〉, yi) = EPSxiω(〈xi,θ〉, yi)

+ E

P⊥S xiω(〈xi,θ〉, yi)

(i)= E

PSxiω(〈xi,θ〉, yi)

+ E

P⊥S xi

Eω(〈xi,θ〉, yi)

(ii)= E

PSxiω(〈xi,θ〉, yi)

− E

PSxi

Eω(〈xi,θ〉, yi) ,

where step (i) follows since xi ∼ N (0, I) so that by definition of the subspace S, the randomvector P⊥S xi and the random variable ω(〈xi,θ〉, yi) are independent and step (ii) follows sinceP⊥S xi = xi − PSxi and Exi = 0. Continuing from the display above, we apply the triangleinequality, Jensen’s inequality, and the Cauchy-Schwarz inequality in succession to obtain thebound

‖E zi‖2 ≤(E‖PSxi‖22Eω(〈xi,θ〉, yi)2

)1/2+ E‖PSxi‖2E|ω(〈xi,θ〉, yi)|. (88)

Next, by Gram-Schmidt,

PSxi =〈xi,θ∗〉‖θ∗‖22

θ∗ +〈xi,P⊥θ∗θ〉‖P⊥θ∗θ‖22

P⊥θ∗θ.

Hence,

E‖PSxi‖22 = E〈xi,θ∗〉2‖θ∗‖22

+ E

〈xi,P⊥θ∗θ〉2‖P⊥θ∗θ‖22

(i)= 2, (89)

where step (i) follows since for any vector v, 〈xi,v〉 ∼ N (0, ‖v‖22). Note additionally that byJensen’s inequality E ‖PSxi‖2 ≤

√E ‖PSxi‖22. Thus, substituting the bound (89) into the RHS

of the inequality (88), we obtain

‖E zi‖2 ≤√

2(Eω(〈xi,θ〉, yi)2

)1/2+√

2E|ω(〈xi,θ〉, yi)|(i)

≤ CK1,

where step (i) follows by noting that by Assumption 1, ω(〈xi,θ〉, yi) is a sub-Gaussian ran-dom variable, and further bounding its moments using the Lp characterization of sub-Gaussianrandom variables Vershynin (2018, Proposition 2.5.2 (ii)). Taking stock, we have shown that∥∥∑

i 6=jE zi

∥∥2q

2≤ (CK1n)2q. (90)

52

Page 53: Kabir Aladin Chandrasekher

Bounding E∥∥∑

i 6=j zi − E zi∥∥2q

2

. To reduce notation, let

zi := zi − E zi.

Then, applying Ledoux and Talagrand (2013, Theorem 6.20), we obtain(E∥∥∑

i 6=jzi∥∥2q

2

)1/2q ≤ C 2q

log 2q

(E∥∥∑

i 6=jzi∥∥

2

+(E

maxi 6=j‖zi‖2q2

)1/2q). (91)

To bound the first term, note that we have

E∥∥∑

i 6=jzi∥∥

2

(i)

≤(E∥∥∑

i 6=jzi∥∥2

2

)1/2 (ii)=(∑i 6=j

E ‖zi‖22)1/2

=( ∑i 6=j,k∈[d]

EX2ikω(〈xi,θ〉, yi)2

)1/2.

Here, we have used Xik to denote the ik-th entry of the matrix X. Step (i) follows by Jensen’sinequality and step (ii) follows since the random vectors (zi)1≤i≤n are zero-mean and indepen-dent. Now, we apply the Cauchy-Schwarz inequality to obtain

EX2ikω(〈xi,θ〉, yi)2 ≤ EX4

ik1/2Eω(〈xi,θ〉, yi)41/2(i)

≤ CK21 ,

where step (i) follows by noting that the random variables Xik and ω(〈xi,θ〉, yi) are sub-Gaussian and subsequently using the Lp characterization of sub-Gaussian random variables.Combining the bounds in the above two displays, we obtain the inequality

E∥∥∑

i 6=jzi∥∥

2

≤ CK1

√nd. (92)

Turning to the next term, let Zik denote the ik-th entry of the matrix Z = [z1 | z2 | · · · | zn]>,and note

E

maxi 6=j‖zi‖2q2

(i)

≤∑i 6=j

E‖zi‖2q2 (ii)

≤ ndq−1 E d∑k=1

Z2qik

(iii)

≤ (CK1qn)2q. (93)

Step (i) follows since ‖zi‖2 are non-negative random variables, step (ii) follows by applyingJensen’s inequality to the term (1/d

∑di=1 Z

2ik)

q, and step (iii) follows from the Lp characteri-zation of sub-exponential random variables. Finally, using the facts that d ≤ n and q ≥ 1, wesubstitute inequalities (92) and (93) into the inequality (91) to obtain the inequality(

E∥∥∑

i 6=jzi∥∥2q

2

)1/2q ≤ CK1

log 2qq2n.

Consequently, we have

E∥∥∑

i 6=jzi − E zi

∥∥2q

2

≤ (CK1q

2n)2q. (94)

Putting it all together. Substituting the inequalities (90) and (94) into the inequality (87)yields

E∥∥∥∑

i 6=jzi

∥∥∥2q

2

≤ (CK1q

2n)2q,

and the lemma is proved upon raising each side of the inequality in the display above to thepower 1/(2q).

53

Page 54: Kabir Aladin Chandrasekher

8.2 Proof of Lemma 6: Controlling the bias

We recall the leave-one-out notation Σj and zj (71) as well as the empirical update Tn(θ) (65)and apply the Sherman-Morrison formula to obtain

Tn(θ) = (X>X)−1X>ω(Xθ,y) =n∑i=1

(Σi + xix

>i

)−1xiω(〈xi,θ〉, yi)

=n∑i=1

Σ−1i xi

1 + x>i Σ−1i xi

ω(〈xi,θ〉, yi).

Before proceeding, we state the following lemma, whose proof we provide in Subsection 8.2.1.

Lemma 9. Under the setting of Proposition 3, there exist universal, positive constants c0, c,and C such that for all c0/

√n < t ≤ 1,

Pr∣∣∣x>i Σ−1

i xi −d

n− d− 2

∣∣∣ ≥ t ≤ C exp−c(t√n)2/3.

Taking this lemma as given, we proceed to prove Lemma 6. Introduce the shorthand

Ui = x>i Σ−1i xi −

d

n− d− 2,

so that

〈θ∗, Tn(θ)〉 =

n∑i=1

(θ∗)>Σ−1i xi

1 + dn−d−2 + Ui

· ω(〈xi,θ〉, yi) = T1 + T2, (95)

where we let

T1 =n∑i=1

(θ∗)>Σ−1i xi

1 + dn−d−2

· ω(〈xi,θ〉, yi),

T2 =n∑i=1

(θ∗)>Σ−1i xiUi(

1 + dn−d−2 + Ui

)(1 + d

n−d−2

) · ω(〈xi,θ〉, yi).

Note that

ET1(i)=n− d− 2

n− 2· 〈θ∗,

n∑i=1

EΣ−1i Exiω(〈xi,θ〉, yi)〉

(ii)=

n(n− d− 2)

(n− 2)(n− d− 2)· E〈θ∗, xi〉 · ω(〈xi,θ〉, yi),

where step (i) follows since Σi and xiω(〈xi,θ〉, yi) are independent and step (ii) follows sinceΣ−1i follows the inverse Wishart distribution with n−1 degrees of freedom and scale matrix Id.

Consequently, using the fact that 2d ≤ n in conjunction with sub-Gaussianity of ω(〈xi,θ〉, yi)by Assumption 1, we deduce∣∣ET1 − E〈θ∗, xi〉 · ω(〈xi,θ〉, yi)

∣∣ ≤ CK1

n. (96)

We turn now to bounding the term T2. Note that the denominator is each summand of T2

is lower bounded by 1, since Σ−1 is a PSD matrix. This, in conjunction with the triangleinequality, yields

|ET2| ≤n∑i=1

E|Uiω(〈xi,θ〉, yi)(θ∗)>Σ−1

i xi| (i)

≤ n√E〈θ∗,Σ−1

i xi〉2√

EU2i ω(〈xi,θ〉, yi)2,

54

Page 55: Kabir Aladin Chandrasekher

where step (i) follows from the Cauchy-Schwarz inequality. Next, we have

E〈θ∗, Σ−1i xi〉

2 = EE〈θ∗, Σ−1

i xi〉2 | xjj 6=i

(i)= E‖Σ−1

i θ∗‖22

(ii)

≤ C

n2,

where step (i) follows since xi and Σ−1i are independent, whence x>i Σ−1

i θ∗ ∼ N (0, ‖Σ−1

i θ∗‖22)

and step (ii) makes use of Lemma 21 from the appendix. Once more applying the Cauchy-Schwarz inequality, we obtain√

EU2i ω(〈xi,θ〉, yi)2 ≤ (EU4

i )1/4(Eω(〈xi,θ〉, yi)4)1/4 ≤ CK1(EU4i )1/4,

where in the last inequality, we have noted that by Assumption 1, ω(〈xi,θ〉, yi) is a sub-Gaussianrandom variable and subsequently applied the Lp characterization of sub-Gaussian randomvariables. Now, the integration by parts formula for non-negative random variables implies

EU4i = E|Ui|4 =

∫ ∞0

4t3 Pr|Ui| ≥ tdt(i)

≤∫ c0/

√n

04t3dt+

∫ ∞c0/√n

4Ct3e−c(t√n)2/3dt ≤ C

n2,

where step (i) follows by applying Lemma 9. Putting the pieces together, we obtain

|ET2| ≤CK1√n. (97)

Finally, substituting the bound on T1 (96) and the upper bound on T2 (97) into the decompo-sition (95), we obtain

|E〈θ∗, Tn(θ)〉 − 〈θ∗, T (θ)〉| ≤ CK1√n,

as desired.

8.2.1 Proof of Lemma 9

We begin with the decomposition∣∣∣x>i Σ−1i xi −

d

n− d− 2

∣∣∣ ≤ ∣∣∣x>i Σ−1i xi − Tr(Σ−1

i )∣∣∣+∣∣∣Tr(Σ−1

i )− d

n− d− 2

∣∣∣,so that

Pr∣∣∣x>i Σ−1

i xi −d

n− d− 2

∣∣∣ ≥ t ≤ Pr∣∣∣x>i Σ−1

i xi − Tr(Σ−1i )∣∣∣ ≥ t

2

+ Pr

∣∣∣Tr(Σ−1i )− d

n− d− 2

∣∣∣ ≥ t

2

.

(98)

The result is a consequence of the following claim:

Pr∣∣∣x>i Σ−1

i xi − Tr(Σ−1i )∣∣∣ ≥ t

2

≤ 2 exp

−cmin

((nt)2

d, nt)

+ 2e−cn, (99a)

Pr∣∣∣Tr(Σ−1

i )− d

n− d− 2

∣∣∣ ≥ t

2

≤ C exp−c(t

√n)2/3+ e−cn. (99b)

Indeed, Lemma 9 follows immediately from substituting claim 99 into inequality (98) and usingthe condition n ≥ Cd to simplify.

It remains to prove claim (99).

55

Page 56: Kabir Aladin Chandrasekher

Proof of the inequality (99a). To begin, define the event

Ai := λmin := λmin(Σi) ≥ c2d,

and apply Vershynin (2018, Theorem 4.6.1) to obtain

PrAci ≤ 2e−cn.

We then see that

Pr∣∣∣x>i Σ−1

i xi − Tr(Σ−1i )∣∣∣ ≥ t ≤ Pr

∣∣∣x>i Σ−1i xi − Tr(Σ−1

i )∣∣∣ ≥ t,Ai+ 2e−cn. (100)

Then, for any fixed Σi, we note that Exix>i Σ−1i xi = Tr(Σ−1

i ) and subsequently apply theHanson-Wright inequality Vershynin (2018, Theorem 6.2.1), to obtain

Pr∣∣∣x>i Σ−1

i xi − Tr(Σ−1i )∣∣∣ ≥ t ≤ 2 exp

−cmin

( t2

‖Σ−1i ‖2F

,t

‖Σ−1i ‖op

)(i)

≤ 2 exp−cmin

( t2

d‖Σ−1i ‖2op

,t

‖Σ−1i ‖op

)= 2 exp

−cmin

(λ2mint

2

d, λ2

mint), (101)

where step (i) follows from the chain of inequalities (which hold for any matrix A)

‖A‖2F = Tr(A>A) ≤ d‖A‖2op.

Subsequently, substituting the inequality (101) into the inequality (100), we obtain

Pr∣∣∣x>i Σ−1

i xi − Tr(Σ−1i )∣∣∣ ≥ t ≤ 2 exp

−cmin

((nt)2

d, nt)

+ 2e−cn.

Proof of the inequality (99b). The proof of this statement follows a similar strategy to thatof Lemma 5. In lighten notation, we prove the inequality for the full matrix Σ rather than theleave-one-out matrix Σi. Since Σ−1 follows an inverse Wishart distribution with n degrees offreedom (instead of n− 1 for the matrix Σ−1

i ), it suffices to show that

Pr∣∣∣Tr(Σ−1)− d

n− d− 1

∣∣∣ ≥ t

2

≤ C exp−c(t

√n)2/3+ e−cn. (102)

Define the random variable Z := Tr(Σ−1), and notice that the Sherman-Morrison formulaimplies

Z = Tr(Σ−1j )− 1

1 + x>j Σ−1j xj

Tr(Σ−1j xjx

>j Σ−1

j ).

Then, for each j ∈ [n], define the random variable

Z ′j := Tr(Σ−1j )− 1

1 + x′>j Σ−1j x

′j

Tr(Σ−1j x

′jx′>j Σ−1

j ),

and note the following inequality, whose proof is analogous to the proof of the inequality (76)used in Lemma 5:

‖Z − EZ‖q ≤ C√q

( n∑j=1

∥∥E(Z − Z ′j)2∣∣xini=1

∥∥q

)1/2

, for all integers q ≥ 1. (103)

56

Page 57: Kabir Aladin Chandrasekher

Subsequently applying Jensen’s inequality to the function x 7→ xq, which is convex on R+ forq ≥ 1 followed by the tower property of conditional expectation then yields∥∥E(Z − Z ′j)2

∣∣xini=1

∥∥q≤(E

(Z − Z ′j)2q)1/q

.

Continuing, we have

E

(Z − Z ′j)2q

= E(

Tr(Σ−1j xjx

>j Σ−1

j )

1 + x>j Σ−1j xj

−Tr(Σ−1

j x′jx′>j Σ−1

j )

1 + x′>j Σ−1j x

′j

)2q(i)

≤ 22q · E(

Tr(Σ−1j xjx

>j Σ−1

j )

1 + x>j Σ−1j xj

)2q(ii)

≤ 22q · E(

Tr(Σ−1j xjx

>j Σ−1

j ))2q

.

Here step (i) follows by using the numeric inequality (a+ b)2q ≤ 22q−1(a2q + b2q) as well as thefact that xj ,x

′j are identically distributed. On the other hand, step (ii) follows since Σ−1

j ispositive semidefinite. We then obtain the following chain of inequalities:

E(

Tr(Σ−1j xjx

>j Σ−1

j ))2q

= E(

Tr(Σ−2j xjx

>j ))2q (i)

≤ E‖Σ−2

j ‖2qop

E(

Tr(xjx>j ))2q

, (104)

where in step (i), we used the bound Tr(A1A2) ≤ ‖A1‖opTr(A2) (which holds as long as A1 ispositive semidefinite) as well as the fact that Σj and xj are independent. Now, recall that Xjk

refers to the jk-th entry of the data matrix X and note that

E(

Tr(xjx>j ))2q

= E( d∑

k=1

Xjk

)2q (i)

≤ d2qEX4qjk

(ii)

≤ (Cdq)2q, (105)

where step (i) follows by applying Jensen’s inequality to the term(

(1/d) ·∑d

k=1Xjk

)2qand

step (ii) follows since Xjk ∼ N (0, 1). Additionally, we note that Lemma 21 implies

E‖Σ−2

j ‖2qop

= E

‖Σ−1

j ‖4qop

≤(Cn

)4q, for 1 ≤ q < n− d− 1

16. (106)

Thus, substituting the inequalities (105) and (106) into the RHS of the inequality (104) yields

E(

Tr(Σ−1j xjx

>j Σ−1

j ))2q ≤ (Cq

n

)2q, for 1 ≤ q < n− d− 1

16.

Next, substituting the above display into the inequality (103), we obtain

‖Z − EZ‖q ≤Cq3/2

√n, for 1 ≤ q < n− d− 1

16.

Now, invoking Lemma 23, we obtain the tail bound

Pr∣∣Z − EZ

∣∣ ≥ t ≤ Ce−c(t√n)2/3 + e−cn. (107)

Finally, recalling that Σ−1 follows the inverse Wishart distribution with n degrees of freedom andscale matrix Id, we see that EZ = Tr(Σ−1)− d

n−d−1 , and this proves claim (99) as desired.

57

Page 58: Kabir Aladin Chandrasekher

9 Proofs of results for specific models

Recall the shorthand ρ = β/α and φ = tan−1(ρ). Also recall our definition of the good region:

G = ζ = (α, β) : 0.55 ≤ α ≤ 1.05, and ρ ≤ 1/5.

Note that by definition, this ensures that β ≤ 0.21 for all (α, β) ∈ G. The shorthand Aσ(ρ) =

2π tan−1

(√ρ2 + σ2 + σ2ρ2

)and Bσ(ρ) = 2

π

√ρ2+σ2+σ2ρ2

1+ρ2was defined in equation (41). Using

these, define the functions

F (α, β) = 1− 1

π(2φ− sin(2φ))

G(α, β) =

√4

π2sin4(φ) +

1

κ− 1

(1− (1− 1

π(2φ− sin(2φ)))2 − 4

π2sin4(φ) + σ2

),

g(α, β) =

√4

π2sin4 φ+

1

κ

α2 + β2 − 2α

(1− 1

π(2φ− sin(2φ))

)− 2β · 2

πsin2 φ+ 1 + σ2

,

Fσ(α, β) = 1−Aσ(ρ) +Bσ(ρ),

Gσ(α, β) =

√ρ2Bσ(ρ)2 +

1

κ− 1(1 + σ2 − (1−Aσ(ρ) +Bσ(ρ))2 − ρ2Bσ(ρ)2), and

gσ(α, β) =

√ρ2Bσ(ρ)2 +

1

κα2 + β2 − 2α(1−Aσ(ρ) +Bσ(ρ))− 2βρBσ(ρ) + 1 + σ2.

The pair (F,G) denotes the (αgor, βgor) map for the alternating minimization update for phaseretrieval, (F, g) the map for subgradient descent in phase retrieval with stepsize η = 1/2,(Fσ, Gσ) the map for alternating minimization for mixtures of regressions, and (Fσ, gσ) the mapfor subgradient method in mixtures of regressions with stepsize η = 1/2. With this notationdefined, we collect some preliminary lemmas.

9.1 Preliminary lemmas

The first two lemmas collect properties of the maps defined above, and are proved in Sections D.6and D.7 of the appendix, respectively. A key consequence of these lemmas is that we obtainbounds on the the derivatives of the αgor and βgor maps when evaluated for any element in thisregion.

Lemma 10. Suppose (α, β, σ, κ) are all nonnegative scalars. There is a universal positiveconstant C such that the maps above satisfy the following relations.

(a) For all (α, β) pairs, we have(

1− 4φ3

)∨ 0 ≤ F0(α, β) ≤ 1. Additionally, if ρ ≤ 1/5, then

1− F0 ≥ 25 · φ

3.

(b) For all (α, β) pairs satisfying ρ ≥ 2 and β ≤ 1, we have F0(α, β) ≥ 1.06α.

(c) For all (α, β) pairs, Fσ(α, β) is non-decreasing in σ and Fσ(α, β) ≤ 1 + 2σ3

3π for all σ ≥ 0.

(d) If σ ≤ 0.5 and κ ≥ C, then for all (α, β) pairs, we have

1√κ− 1

(1− [Fσ(α, β)]2)1/2 ≤ Gσ(α, β) ≤ 0.8.

(e) If κ ≥ C, then for all (α, β) ∈ G, we have [G0(α, β)]2 ≤ φ3

10 .

58

Page 59: Kabir Aladin Chandrasekher

(f) For all σ ≥ 0, we have

κ− 1

κ· [Gσ(α, β)]2 ≤ [gσ(α, β)]2 ≤ [Gσ(α, β)]2 +

2

κ

((1− α)2 + β2

)+

2

κ[ρBσ(ρ)]2 +

2

κ[1− Fσ(α, β)]2.

(g) If σ ≤ 0.5 and κ ≥ C, then for all ρ ≤ 2, we have

(1 +

2σ3

)−1

·

√[ρBσ(ρ)]2 · κ− 2

κ− 1+

σ2

2(κ− 1)≤ Gσ(α, β)

Fσ(α, β)≤ 4

5· ρ+

2σ√κ− 1

.

Lemma 11. Suppose (α, β, σ, κ) are all nonnegative scalars. There is a universal positiveconstant C such that the gradients of the maps above satisfy the following relations.

(a) If σ ≤ 1/2, we have ‖∇Fσ(α, β)‖1 ≤ 0.5 for all ρ ≤ 1/4 and α ≥ 1/2.

(b) If σ ≤ 1/2 and κ ≥ C, we have ‖∇Gσ(α, β)‖1 ≤ 0.98 for all ρ ≤ 1/4 and α ≥ 1/2.

(c) For all (α, β) pairs, we have ‖∇G(α, β)‖1 ≤ ‖∇G0(α, β)‖1 and ‖∇g(α, β)‖1 ≤‖∇g0(α, β)‖1.

(d) If σ ≤ 1/2 and ρ ≤ 1/4, we have ‖∇gσ(α, β)‖1 ≤ ‖∇Gσ(α, β)‖1 + 1√κ

(3 + ‖∇Fσ(α, β)‖1).

Next, we present two technical lemmas that allow us to argue part (b) and part (c) in ourtheorems, respectively. These lemmas are proved in Sections D.8 and D.9 of the appendix,respectively. In the first lemma, we show that provided the gradients of the Gordon stateevolution updates are bounded above by 1− τ in `1, small deviations of the empirics from thesemaps are not amplified over the course of successive iterations.

Lemma 12. Suppose S = (F ,G) denotes a state evolution operator that is G-faithful. Letζt = (αt, βtTt=0 denote a sequence of state evolution elements satisfying

‖ζt+1 − S(αt, βt)‖∞ ≤ ∆ for each 0 ≤ t ≤ T − 1.

Also suppose that ‖∇F (α, β)‖1 ∨ ‖∇G(α, β)‖1 ≤ 1 − τ for some τ > 0 and all (α, β) ∈B∞(G; ∆/τ). Then provided (α0, β0) ∈ G, we have

max0≤t≤T

‖ζt − St(α0, β0)‖∞ ≤

τ.

To state the last lemma, let us state some generic conditions on a state evolution operatorS = (F ,G). A subset of these will be used in the lemma.

C1. F (α, β) ≥ (50√d)−1 for all (α, β) such that φ ≤ π/2− (50

√d)−1.

C2. F (α, β) ≥ 1.06 · α for all ρ ≥ 2, and F (α, β) ≥ 0.56 if ρ ≤ 2.

C3. G(α,β)

F (α,β)≤ 7

8 · ρ for all 15 ≤ ρ ≤ 2 and 1/2 ≤ α ≤ 3/2.

C4. F (α, β) ≤ 1.04 for all (α, β) and G(α,β)

F (α,β)≤ 1/6 for all ρ ≤ 1/5.

C5a. G(α, β) ≤ 0.99 for all (α, β).

C5b. G(α, β) ≤ 0.99 if α ∨ β ≤ 3/2.

59

Page 60: Kabir Aladin Chandrasekher

As will be shown in the proof of Lemma 13, conditions C1 and C2 are useful to ensure that theiterates are boosted from a random initialization to a region in which ρ ≤ 2. Post that point,we use condition C3 to show that the ratio is boosted further to ρ ≥ 5. Finally, conditions C4and—depending on context—one of C5a/b are used to show that one more step of the operatorpushes the iterates into the good region.

We also state two possible assumptions on the initialization (α0, β0), where the secondassumption is strictly stronger than the first. These will be used in conjunction with conditionsC1 and C2 to handle the first few iterates of the algorithm from a random initialization.

Ia. α0/β0 ≥ (50√d)−1.

Ib. α0/β0 ≥ (50√d)−1 and α0 ∨ β0 ≤ 3/2.

Having stated the various assumptions, we are now in a position to state Lemma 13.

Lemma 13. There is a universal constant c > 0 such that the following is true. Let

t0 := log1.05(50√d) + log55/54(10) + 2.

Suppose S = (F ,G) denotes a state evolution operator and that there there exists a sequence ofelements ζt = (αt, βt)t≥0 satisfying

max0≤t≤t0

|αt+1 − F (αt, βt)| ≤c√d, and (108a)

max0≤t≤t0

|βt+1 −G(αt, βt)| ≤ c. (108b)

(a) If S satisfies conditions C1-C4 and C5a and the initialization (α0, β0) satisfies condition Ia,then

ζt ∈ G for some t ≤ t0.

(b) If S satisfies conditions C1-C4 and C5b and the initialization (α0, β0) satisfies condition Ib,then

ζt ∈ G for some t ≤ t0.

With all of these lemmas stated, we are now in a position to prove the various theorems.Before proceeding to this, we make one remark about the proofs of part (a) of these theorems,in particular the transient period.

Remark 4. Some of our bounds—especially the lower bounds on convergence rates—rely on anexplicit relation between the quantities |1−α| and β that comes from the Gordon state evolution.This is the reason why these bounds require a transient period of 1 iteration: Once the Gordonupdate is run for just one iteration, the requisite relationship can be ensured.

9.2 Proof of Theorem 3: Alternating minimization for phase retrieval

We prove each step of the theorem in turn. It is useful to note that

F (α, β) = F0(α, β) and G(α, β)2 = G0(α, β)2 +σ2

κ− 1. (109)

9.2.1 Part (a): Convergence of Gordon state evolution in good region

In order to establish this part of the theorem, it suffices to show that the Gordon state evolutionis G-faithful, and to prove the upper and lower bounds on its one-step convergence behavior.

60

Page 61: Kabir Aladin Chandrasekher

Verifying that Sgor is G-faithful: We must show that if the pair (α, β) satisfies 0.55 ≤ α ≤1.05 and ρ ≤ 1/5, then 0.55 ≤ F (α, β) ≤ 1.05 and G(α, β)/F (α, β) ≤ 1/5. We show each ofthese bounds separately.

Bounding F (α, β): Applying Lemma 10(a), we have 0.95 ≤ F0(α, β) ≤ 1, where the lower bound

follows since φ ≤ tan−1(1/5). Using equation (109) finishes the claim.

Bounding G(α,β)F (α,β) : From equation (109) and Lemma 10(g), we have

G(α, β)

F (α, β)≤ G0(α, β) + σ/

√κ− 1

F0(α/β)≤ 4

5· ρ+

σ

F0(α, β) ·√κ− 1

. (110)

Now from Lemma 10(a), we have F0 ≥ 0.95 if ρ ≤ 1/5. Furthermore, σ2/κ ≤ c for a smallenough constant c. Putting together the pieces completes the proof.

Establishing upper bound on one-step distance: Equation (109) and Lemma 10(e) di-rectly yield that if (α, β) ∈ G and κ ≥ C and σ2/κ ≤ c, then

[G(α, β)]2 ≤ φ3

10+

σ2

κ− 1≤ 6β3

7+

σ2

κ− 1, (111)

where the last inequality follows since α ≥ 0.55. On the other hand, equation (109) andLemma 10(a) together yield the bound

(1− F (α, β))2 ≤ 16

9π2φ6 ≤ β3/100, (112)

where the final inequality follows since φ ≤ 1/5 and α ≥ 0.55. Putting together the pieces, wehave

[d(Sgor(ζ))]2 = [G(α, β)]2 + (1− F (α, β))2 ≤ β3 +σ2

κ− 1≤β2 + (1− α)2

3/2+

σ2

κ− 1,

and the desired upper bound follows from the elementary inequality√a+ b ≤

√a+√b.

Establishing lower bound on two-step distance: Given that we are interested in a tran-sient period of t0 = 1 (see Remark 4), let us now compute two steps of the Gordon update,letting F+ = F 2(α, β) and G+ = G2(α, β). Analogously, we let F = F (α, β) and G = G(α, β),and use φ+ = tan−1(G/F ) to denote the angle after one step of the Gordon update. Recall thatρ = tanφ = β/α. Combining equation (109) and Lemma 10(d), we have

G2+ ≥

1

κ− 1· (1− F 2

+) +σ2

κ− 1≥ 2

5(κ− 1)

(G√

1 +G2

)3

+σ2

κ− 1≥ 1

4(κ− 1)πG3 +

σ2

κ− 1.

Here, the penultimate inequality uses Lemma 10(a) and the facts that φ+ ≥ sinφ+ = G√F 2+G2

and F ≤ 1. The last inequality makes use of G ≤ 1. Turning now to the F+ component, we useρ ≤ 1/5 and Lemma 10(a) to obtain

π2(1− F+)2 ≥(

G√1 +G2

)6

≥ G6

8. (113)

61

Page 62: Kabir Aladin Chandrasekher

Putting together the pieces yields

G2+ + (1− F+)2 ≥ cκ ·G3 +

σ2

κ− 1+G6

8π2

≥ cκ ·G3 +σ2

κ− 1+

1

8π2

(1

κ− 1· (1− F 2)

)3

+1

8π2

(σ2

κ− 1

)3

≥ cκ ·G3 +1

8π2(κ− 1)3· (1− F )3 +

σ2

κ− 1

≥(c′κ ·

G2 + (1− F )2

3/4+

σ

2√κ− 1

)2

where the second inequality uses Lemma 10(d) and equation (109), and the last step followsbecause (A+B)κ ≤ 2κ(Aκ+Bκ) for any positive scalars (A,B) and κ ≥ 1. Taking square rootscompletes the proof.

9.2.2 Part (b): Empirical error is sharply tracked by Gordon state evolution

As mentioned before, the proof of this result relies on Lemma 12, and so we dedicate our efforttowards verifying the assumptions required to apply it. We set ∆0 = 1/2000 and τ0 = 1/50 forconvenience in computation, so that ∆0/τ0 = 0.025.

Verifying gradient conditions: The first step is to verify that the gradients of the Gordonstate evolution maps are bounded as desired. It is easy to verify that for all ζ = (α, β) ∈B∞(G; ∆0/τ0), we have α ≥ 1/2 and β/α ≤ 1/4. Consequently, parts (a) and (b) of Lemma 11yield that

‖∇F (α, β)‖1 ∨ ‖∇G(α, β)‖1 ≤ 0.98 = 1− τ0

for all (α, β) ∈ B∞(G; ∆0/τ0).

Defining the iterates: Put θt = T tn(θ) for each t ≥ 1 with the convention that θ0 = θ, andlet (αt, βt) = (α(θt), β(θt)) for each t ≥ 0. By Corollary 1(b), we have that with probabilityexceeding 1− n−10,

|αt+1 − F (α, β)| ∨ |βt+1 −G(α, β)| ≤ Cσ(

log7 n

n

)1/4

=: ∆n ≤ ∆0,

where the final inequality follows for n ≥ C ′σ. Taking a union bound over t = 0, . . . , T − 1, wesee that

max0≤t≤T−1

‖ζt+1 − Sgor(αt, βt)‖∞ ≤ ∆n.

with probability greater than 1− Tn−10.

Putting together the pieces: Applying Lemma 12 along with the conditions verified above,we have that provided (α0, β0) ∈ G,

max1≤t≤T

|αt − F t(α0, β0)| ∨ |βt −Gt(α0, β0)| ≤ 50∆n

with probability exceeding 1 − Tn−10. Note that ‖T tn(θ0) − θ∗‖2 = (1 − αt)2 + β2

t , and[d(Stgor(ζ0))]2 = [1− F t(α0, β0)]2 + [Gt(α0, β0)]2. Consequently, for each 1 ≤ t ≤ T , we have∣∣∣‖T tn(θ)− θ∗‖ − d(Stgor(ζ)

∣∣∣ ≤ √2|αt − F t(α0, β0)| ∨ |βt −Gt(α0, β0)|

. ∆n,

as desired.

62

Page 63: Kabir Aladin Chandrasekher

9.2.3 Part (c): Iterates converge to good region from random initialization

The proof of this result relies on Lemma 13(a), and we will apply it for the Gordon map (F,G)playing the role of (F ,G).

Verifying condition C1: Letting ζ = π/2 − φ, note that F (α, β) = 2ζ+sin(2ζ)π . This is clearly

an increasing function of ζ in the range 0 ≤ ζ ≤ π/2, and greater than (50√d)−1 when ζ =

(50√d)−1.

Verifying condition C2: From Lemma 10(b) and (d), we have F (α, β) ≥ 1.06α. We also haveF (α, β) ≥ 0.56 for ρ = 2, and F is a non-decreasing function of ρ.Verifying condition C3: The bound (110) yields

G(α, β)

F (α, β)≤ 4

5ρ+

σ

F (α, β) ·√κ− 1

.

Using F (α, β) ≥ 0.56 for all ρ ≤ 2 in conjunction with the fact that κ ≥ C, σ2/κ ≤ c, andρ ≥ 1/5, we obtain

4

5ρ+

σ

F (α, β) ·√κ− 1

≤ 4

5ρ+

3

40ρ.

Verifying condition C4: We have F (α, β) ≤ 1 for all (α, β). The inequality G(α,β)F (α,β) ≤ 1/6 for

ρ ≤ 1/5 follows from the bound (110) and the inequalities F (α, β) ≥ 0.95 and σ2/κ ≤ c.Verifying condition C5a: Clearly, we have G(α, β) ≤

√0.8 + σ2

κ−1 ≤ 0.95, where the last inequal-

ity holds for σ2/κ ≤ c.

Putting together the pieces: Note that Corollary 1(b) in conjunction with the union boundyields that the empirical updates satisfy

max1≤t≤t0

|αt − F t(α0, β0)| ≤ Cσ(

log7(t0/δ)

n

)1/2

and max1≤t≤t0

|βt −Gt(α0, β0)| ≤ C(

log(t0/δ)

n

)1/4

with probability exceeding 1 − δ. Furthermore, by assumption, we have α0/β0 ≥ (50√d)−1.

Thus, applying Lemma 13(a) yields that if n ≥ C ′σ for a sufficiently large constant C ′σ, we haveT tn(θ0) ∈ G on this event, completing the proof. .

9.3 Proof of Theorem 4: Subgradient descent for phase retrieval

Recall that the Gordon update in this case is given by the pair (F, g). It is also useful to notethat

g(α, β)2 = g0(α, β)2 +σ2

κ. (114)

9.3.1 Part (a): Convergence of Gordon state evolution in good region

As before, it suffices to show that the Gordon state evolution is G-faithful, and to prove theupper and lower bounds on its one-step convergence behavior.

Verifying that Sgor is G-faithful: The bounds on F were shown already in the proof ofTheorem 3. It remains to handle the ratio g/F .

Bounding g(α,β)F (α,β) : We begin by bounding g(α, β) alone. Using Lemma 10(f) and equation (114)

together yields

g(α, β) ≤√G0(α, β)2 +

1

κ(2[ρB0(ρ)]2 + 2[1− F0(α, β)]2 + 2(1− α)2 + 2β2 + σ2). (115)

63

Page 64: Kabir Aladin Chandrasekher

Now note that if ρ ≤ 1/5, then ρB0(ρ) = 2π sin2 φ ≤ 1

13π and 0.95 ≤ F0(α, β) ≤ 1, so that

[ρB0(ρ)]2 + [1− F0(α, β)]2 ≤ 1/300. (116)

Since (α, β) ∈ G, we also have

(1− α)2 + β2 ≤ 3/10. (117)

Putting together equations (116) and (117) with Lemma 10(g) and the inequality√a+ b ≤√

a+√b for two positive scalars (a, b), yields

g(α, β)

F (α, β)≤ 4

5· ρ+

√1

3(κ−1) + σ√κ−1

F0(α, β), (118)

But from Lemma 10(a), we have F0 ≥ 0.95 if ρ ≤ 1/5, and furthermore, κ ≥ C and σ2/κ ≤ c.Putting together the pieces completes the proof.

Establishing upper bound on one-step distance: Equation (114) and Lemma 10 parts(e) and (f) yield that if (α, β) ∈ G and κ ≥ C and σ2/κ ≤ c, then

[g(α, β)]2 ≤ φ3

10+

1

κ

(2[ρB0(ρ)]2 + 2[1− F0(α, β)]2 + 2(1− α)2 + 2β2 + σ2

). (119)

When φ ≤ 1/5 and α ≥ 0.55, Lemma 10(a) also yields

(1− F0(α, β))2 ≤ 16

9π2φ6. (120)

Finally, note that ρB0(ρ) = 2π sin2 φ ≤ 2φ2

π . Putting together the pieces and noting that φ ≤ 1/5and α ≥ 0.55, we have

[d`2(Sgor(ζ))]2 = [g(α, β)]2 + (1− F0(α, β))2 ≤ φ2

50+

1

10κφ2 +

2

κ[d`2(ζ)]2 + σ2/κ

≤ [d`2(ζ)]2

10+

3

κ[d`2(ζ)]2 + σ2/κ,

and the desired upper bound follows from choosing κ ≥ C.

Establishing lower bound on one-step distance: We begin with the following convenientcharacterization of the g map:

[g(α, β)]2 =κ− 2

κ− 1· [ρB0(ρ)]2 +

1

κ

(α− F (α, β))2 + (β − [ρB0(ρ)]2)2 + (1− [F (α, β)]2) + σ2

,

(121)

Applying Young’s inequality, we have

[g(α, β)]2 ≥ κ− 2

κ− 1· [ρB0(ρ)]2

+1

κ

(1

2(1− α)2 − (1− F (α, β))2 +

1

2β2 − [ρB0(ρ)]2 + (1− F (α, β))(1 + F (α, β)) + σ2

)≥ κ− 3

κ− 1· [ρB0(ρ)]2 +

1

κ

(1

2(1− α)2 +

1

2β2 + 2F (α, β) · (1− F (α, β)) + σ2

)≥ 1

2(κ− 1)[d`2(ζ)]2 +

σ2

κ,

where the final inequality follows since κ ≥ C. The proof follows by noting thatd`2(Sgor(ζ)) ≥ g(α, β).

64

Page 65: Kabir Aladin Chandrasekher

9.3.2 Part (b): Empirical error is sharply tracked by Gordon state evolution

This proof is almost identical to that of Theorem 3(b), so we only sketch the major difference:verifying the gradient conditions. We set ∆0 = 1/4000 and τ0 = 1/100 for convenience incomputation, so that ∆0/τ0 = 0.025.

Verifying gradient conditions: It is easy to verify that for all ζ = (α, β) ∈ B∞(G; ∆0/τ0),we have α ≥ 1/2 and β/α ≤ 1/4. Consequently, parts (ii-iv) of Lemma 11 yield that‖∇g(α, β)‖1 ≤ 0.99, where the final step follows for κ ≥ C. Combining this with Lemma 11(a),we have

‖∇F (α, β)‖1 ∨ ‖∇g(α, β)‖1 ≤ 0.99 = 1− τ0

for all (α, β) ∈ B∞(G; ∆0/τ0).

The rest of the proof proceeds identically.

9.3.3 Part (c): Iterates converge to good region from random initialization

This proof is almost identical to that of Theorem 3(c), so we only sketch the differences. Inthis case, we will apply Lemma 13(b). Conditions C1 and C2 are verified exactly as before. Itremains to verify conditions C3, C4, and C5b.

Verifying condition C3: If ρ ≤ 2, then ρB0(ρ) = 2π sin2 φ ≤ 0.85 and 0.56 ≤ 1− F0(α, β) ≤ 1, so

that [ρB0(ρ)]2 +[1−F0(α, β)]2 ≤ 1.1. Since ρ ≤ 2 and α ≤ 3/2, we also have (1−α)2 +β2 ≤ 10.Together with the bound (115) and Lemma 10(g), this yields

g(α, β)

F (α, β)≤ 4

5ρ+

2σ +√

22.2

F0(α, β) ·√κ− 1

.

Since F (α, β) ≥ 0.56 for all ρ ≤ 2, κ ≥ C, σ2/κ ≤ c, and ρ ≥ 1/5, we obtain

2σ +√

22.2

F0(α, β) ·√κ− 1

≤ 3

40ρ.

Verifying condition C4: We have F (α, β) ≤ 1 for all (α, β), as before. The inequality g(α,β)F (α,β) ≤

1/6 for ρ ≤ 1/5 follows from the bound (118) and the inequalities F (α, β) ≥ 0.95, κ ≥ C, andσ2/κ ≤ c. .

Verifying condition C5b: Note also that |ρB0(ρ)| ∨ |1− F0(α, β)| ≤ 1, and that (1− α)2 + β2 ≤5/2 if α ∨ β ≤ 3/2. Combining with equation (115) and Lemma 10(d), we have g(α, β) ≤√

0.8 + 9+σ2

κ ≤ 0.95, where the last inequality holds for κ ≥ C and σ2/κ ≤ c.

9.4 Proof of Theorem 5: Alternating minimization for mixtures of regres-sions

Recall that the Gordon update in this case is given by the pair (Fσ, Gσ).

9.4.1 Part (a): Convergence of Gordon state evolution in good region

As before, it suffices to show that the Gordon state evolution is G-faithful, and to prove theupper and lower bounds on its one-step convergence behavior.

65

Page 66: Kabir Aladin Chandrasekher

Verifying that Sgor is G-faithful: We verify the two inequalities separately.

Bounding Fσ(α, β): From Lemma 10(c), we have F0(α, β) ≤ Fσ(α, β) ≤ 1 + 2σ3

3π ≤ 1.05, wherethe final inequality holds since σ ≤ c. The lower bound on F0 established in the previous proofcompletes the claim.

Bounding Gσ(α,β)Fσ(α,β) : From Lemma 10(g), we directly have

Gσ(α, β)

Fσ(α, β)≤ 4

5· ρ+

2σ√κ− 1

, (122)

Noting that ρ ≤ 1/5 and σ2/κ ≤ c completes the proof.

Establishing upper bound on one-step distance: From Lemma 10(g), we have

Gσ(α, β)

Fσ(α, β)≤ 4

5· βα

+2σ√κ− 1

.

Now applying Lemma 26(b) yields

tan−1

(Gσ(α, β)

Fσ(α, β)

)≤ 51

50· 4

5· tan−1

α

)+

2σ√κ− 1

;

to complete the proof, note that d∠(ζ) = tan−1(β/α) for an element ζ = (α, β) of the state-evolution.

Establishing lower bound on one-step distance: Using the lower bound in Lemma 10(g)in conjunction with the assumptions κ ≥ C and σ ≤ c, we obtain

Gσ(α, β)

Fσ(α, β)≥ ρBσ(ρ)

2+

σ

1.8√κ− 1

≥ cσρ+σ

1.8√κ− 1

.

The second inequality follows by noting that Bσ(ρ) ≥ 2cσ for all ρ for some absolute constantc ∈ (0, 1]. Note also that since σ is small, we have 1− cσ ≥ 0.95, and also that σ2/(κ− 1) ≤ c′for a small enough constant c′. Putting these together with Lemma 26(a) yields

tan−1

(Gσ(α, β)

Fσ(α, β)

)≥ cσ tan−1 ρ+

σ

2√κ− 1

,

as desired.

9.4.2 Part (b): Empirical error is sharply tracked by Gordon state evolution

This proof is almost identical to that of Theorem 3(b), so we only sketch the major difference:verifying the gradient conditions. We set ∆0 = 1/2000 and τ0 = 1/50 for convenience incomputation, so that ∆0/τ0 = 0.025.

Verifying gradient conditions: It is easy to verify that for all ζ = (α, β) ∈ B∞(G; ∆0/τ0),we have α ≥ 1/2 and β/α ≤ 1/4. Consequently, parts (a) and (b) of Lemma 11 yield that

‖∇Fσ(α, β)‖1 ∨ ‖∇Gσ(α, β)‖1 ≤ 0.98 = 1− τ0

for all (α, β) ∈ B∞(G; ∆0/τ0).

The rest of the proof proceeds identically.

66

Page 67: Kabir Aladin Chandrasekher

9.4.3 Part (c): Iterates converge to good region from random initialization

This proof is almost identical to that of Theorem 3(c), so we only sketch the differences. Condi-tions C1 and C2 are follow directly from the fact that Fσ(α, β) ≥ F0(α, β). It remains to verifyconditions C3, C4, and C5a.

Verifying condition C3: Lemma 10(g) yields

Gσ(α, β)

Fσ(α, β)≤ 4

5ρ+

2σ√κ− 1

.

Since 1/5 ≤ ρ ≤ 2 and σ2/κ ≤ c, we obtain

2σ√κ− 1

≤ 3

40ρ.

Verifying condition C4: As argued before, when σ ≤ c, we have Fσ(α, β) ≤ 1.05 for all (α, β).

The inequality Gσ(α,β)Fσ(α,β) ≤ 1/6 for ρ ≤ 1/5 follows from Lemma 10(g) and the inequality σ ≤ c.

Verifying condition C5a. We have Gσ(α, β) ≤ 0.8 by Lemma 10(d) .

9.5 Proof of Theorem 6: Subgradient descent for mixtures of regressions

Recall that the Gordon update in this case is given by the pair (Fσ, gσ). It is also useful to notethat

[gσ(α, β)]2 = Gσ(α, β)2 +1

κ

((α− Fσ(α, β))2 + (β − ρBσ(ρ))2

). (123)

9.5.1 Part (a): Convergence of Gordon state evolution in good region

As before, it suffices to show that the Gordon state evolution is G-faithful, and to prove theupper and lower bounds on its one-step convergence behavior. Unlike before, we show the lowerbound on one-step convergence first, since some steps here are used in the proof of the upperbound.

Verifying that Sgor is G-faithful: The bounds on Fσ were already shown in the previousproof for AM.

Bounding gσ(α,β)Fσ(α,β) : Putting together parts (a) and (c) of Lemma 10, we have |1 − Fσ(α, β)| .

φ3 ∨ σ3 . ρ3 + σ3. We also have ρBσ(ρ) . (ρ∧ 1) · (σ ∨ (ρ∧ 1)) . σ and Fσ(α, β) ≥ 0.95 for allρ ≤ 1/5. Combining this with Lemma 10 parts (f) and (g) yields

gσ(α, β)

Fσ(α, β)≤ 4

5· ρ+

2σ√κ

+2√κ

((1− α)2 + β2

)1/2+

C√κ

(σ + ρ3 + σ3

)≤ 5

6· ρ+

Cσ√κ

+2√κ· d`2(α, β). (124)

Noting that ρ ≤ 1/5 and σ2/κ ≤ c and setting κ ≥ C completes the proof.

Establishing lower bound on one-step distance: Note that

[gσ(α, β)]2 =κ− 2

κ− 1· [ρBσ(ρ)]2 +

1

κ

(α− Fσ(α, β))2 + (β − [ρBσ(ρ)]2)2 + (1− [Fσ(α, β)]2) + σ2

,

(125)

67

Page 68: Kabir Aladin Chandrasekher

Applying Young’s inequality, we have

[gσ(α, β)]2 ≥ κ− 2

κ− 1· [ρBσ(ρ)]2

+1

κ

(1

2(1− α)2 − (1− Fσ(α, β))2 +

1

2β2 − [ρBσ(ρ)]2 + (1− Fσ(α, β))(1 + Fσ(α, β)) + σ2

)=κ− 3

κ− 1· [ρBσ(ρ)]2 +

1

κ

(1

2(1− α)2 +

1

2β2 + 2Fσ(α, β) · (1− Fσ(α, β)) + σ2

)≥ 1

2(κ− 1)[d`2(ζ)]2 +

σ2

1.5κ,

where the final inequality follows since 1− Fσ & −σ3, κ ≥ C and σ ≤ c. Dividing both sides ofthe above inequality by [Fσ(α, β)]2 and noting that [Fσ(α, β)]2 ≥ 0.95 for all ρ ≤ 1/5, we have

[gσ(α, β)]2

[Fσ(α, β)]2≥ 2

5(κ− 1)ρ2 +

σ2

1.6κ(126)

where we have also used Lemma 25(a) to conclude that d`2(ζ) ≥ 0.9ρ for all ρ ≤ 1/5. Now

using the inequality√a+ b ≥

√a

1+c +√

bc1+c (valid for any three non-negative scalars (a, b, c))

we have

[gσ(α, β)]

[Fσ(α, β)]≥ cκρ+

σ2

1.8κ,

where 1 − cκ ≥ 0.9. Using the fact that σ2

κ ≤ c and applying Lemma 26(a) completes theproof.

Establishing upper bound on two-step distance: We require an explicit relationship be-tween the parallel and perpendicular components in this proof (see Remark 4), so we use a tran-sient period t0 = 1. For notational convenience, let (Fσ, gσ) denote the pair (Fσ(α, β), gσ(α, β)),and let (F+, g+) = (Fσ(Fσ, gσ), gσ(Fσ, gσ)) denote the element of the state evolution obtainedafter two steps of the Gordon update. Let ρ+ = gσ/Fσ. Equation (124) yields

g+

F+≤ 5

6· ρ+ +

Cσ√κ

+2√κ· d`2(Fσ, gσ) ≤ 5

6· ρ+ +

Cσ√κ

+1√κ· (ρ3 + 2gσ) (127)

=5

6· ρ+ +

Cσ√κ

+ ρ2 · ρ√κ− 1

+2ρ+ · Fσ√

κ

≤ 5

6· ρ+ +

Cσ√κ

+ρ+

25+

2.2ρ+√κ

≤ 7

8· ρ+ +

Cσ√κ.

Here, the second inequality follows since d`2(Fσ, gσ) ≤ |1−Fσ|+ gσ ≤ 12(ρ3 +σ3) + gσ, the third

inequality is a consequence of the relation (126) and the fact that ρ ≤ 1/5 and Fσ ≤ 1.1 (whichin turn follows from Lemma 10(c) and σ ≤ c). Applying Lemma 26(b) completes the proof.

9.5.2 Part (b): Empirical error is sharply tracked by Gordon state evolution

This proof is almost identical to that of Theorem 3(b), so we only sketch the major difference:verifying the gradient conditions. We set ∆0 = 1/4000 and τ0 = 1/100 for convenience incomputation, so that ∆0/τ0 = 0.025.

68

Page 69: Kabir Aladin Chandrasekher

Verifying gradient conditions: It is easy to verify that for all ζ = (α, β) ∈ B∞(G; ∆0/τ0),we have α ≥ 1/2 and β/α ≤ 1/4. Consequently, parts (a) and (b) of Lemma 11 yield that

‖∇Fσ(α, β)‖1 ∨ ‖∇gσ(α, β)‖1 ≤ 0.99 = 1− τ0

for all (α, β) ∈ B∞(G; ∆0/τ0). The rest of the proof proceeds identically.

9.5.3 Part (c): Iterates converge to good region from random initialization

This proof is almost identical to that of Theorem 5(c), except that we use Lemma 13(b).Consequently, we only verify conditions C3, C4, C5b.Verifying condition C3: For all 1/5 ≤ ρ ≤ 2 and 1/2 ≤ α ≤ 3/2, equation (124) andLemma 25(b) in the appendix together yield

gσ(α, β)

Fσ(α, β)≤ 5

6· βα

+Cσ√κ

+C√κ· βα.

Since 1/5 ≤ ρ ≤ 2 and σ2/κ ≤ c, we obtain Cσ√κ≤ 3

40ρ and choosing large enough κ completes

the proof.Verifying condition C4: As argued before, when σ ≤ c, we have Fσ(α, β) ≤ 1.05 for all (α, β).

The inequality gσ(α,β)Fσ(α,β) ≤ 1/6 for ρ ≤ 1/5 follows from the bound (124) and the inequalities

Fσ(α, β) ≥ 0.95, κ ≥ C, and σ2/κ ≤ c.Verifying condition C5b: Note also that the quantities ρBσ(ρ), |1−Fσ(α, β)|, and (1−α)2 +β2

are all bounded by an absolute constant C if σ ≤ c and α∨β ≤ 3/2. Combining with Lemma 10

parts (d) and (f), we have gσ(α, β) ≤√

0.8 + C+σ2

κ ≤ 0.95, where the last inequality holds for

large enough κ and small enough σ. .

69

Page 70: Kabir Aladin Chandrasekher

Acknowledgments

Part of this work was performed when the authors were participants in the program on Probability,Geometry, and Computation in High Dimensions hosted at the Simons Institute for the Theory ofComputing. KAC was supported in part by a National Science Foundation Graduate Research Fellowshipand the Sony Stanford Graduate Fellowship. AP was supported in part by a research fellowship from theSimons Institute and National Science Foundation grant CCF-2107455. CT was supported in part bythe National Science Foundation grant CCF-2009030, by an NSERC Discovery Grant, and by a researchgrant from KAUST.

References

E. Abbasi, F. Salehi, and B. Hassibi. Universality in learning from linear measurements. Advances inNeural Information Processing Systems, 32:12372–12382, 2019. (Cited on page 37.)

A. Agarwal, A. Anandkumar, P. Jain, and P. Netrapalli. Learning sparsely used overcomplete dictionariesvia alternating minimization. SIAM Journal on Optimization, 26(4):2775–2799, 2016. (Cited onpage 9.)

D. Amelunxen, M. Lotz, M. B. McCoy, and J. A. Tropp. Living on the edge: Phase transitions in convexprograms with random data. Information and Inference: A Journal of the IMA, 3(3):224–294, 2014.(Cited on page 9.)

B. Aubin, Y. Lu, F. Krzakala, and L. Zdeborova. Generalization error in high-dimensional perceptrons:Approaching bayes error with convex optimization. In Conference on Neural Information ProcessingSystems (NeurIPS), 2020. (Cited on page 9.)

S. Balakrishnan, M. J. Wainwright, and B. Yu. Statistical guarantees for the EM algorithm: Frompopulation to sample-based analysis. The Annals of Statistics, 45(1):77–120, 2017. (Cited on pages 3,8, and 9.)

M. Bayati and A. Montanari. The Lasso risk for Gaussian matrices. IEEE Transactions on InformationTheory, 58(4):1997–2017, 2011. (Cited on page 8.)

M. Bayati, M. Lelarge, and A. Montanari. Universality in polytope phase transitions and message passingalgorithms. The Annals of Applied Probability, 25(2):753–822, 2015. (Cited on page 37.)

S. Boucheron, O. Bousquet, G. Lugosi, and P. Massart. Moment inequalities for functions of independentrandom variables. Annals of Probability, 33(2):514–560, 2005. (Cited on pages 47, 49, and 50.)

D. R. Brillinger. A generalized linear model with “Gaussian” regressor variables. In Selected Works ofDavid Brillinger, pages 589–606. Springer, 2012. (Cited on page 9.)

E. J. Candes, X. Li, and M. Soltanolkotabi. Phase retrieval via Wirtinger flow: Theory and algorithms.IEEE Transactions on Information Theory, 61(4):1985–2007, 2015. (Cited on page 3.)

M. Celentano, A. Montanari, and Y. Wei. The Lasso with general Gaussian designs with applications tohypothesis testing. arXiv preprint arXiv:2007.13716, 2020a. (Cited on page 10.)

M. Celentano, A. Montanari, and Y. Wu. The estimation error of general first order methods. InConference on Learning Theory, pages 1078–1141. PMLR, 2020b. (Cited on page 8.)

X. Chang, Y. Li, S. Oymak, and C. Thrampoulidis. Provable benefits of overparameterization in modelcompression: From double descent to pruning neural networks. In Proceedings of the AAAI Conferenceon Artificial Intelligence, volume 35, pages 6974–6983, 2021. (Cited on page 10.)

V. Charisopoulos, Y. Chen, D. Davis, M. Dıaz, L. Ding, and D. Drusvyatskiy. Low-rank matrix recoverywith composite optimization: Good conditioning and rapid convergence. Foundations of Computa-tional Mathematics, pages 1–89, 2021. (Cited on page 9.)

Y. Chen and Y. Chi. Harnessing structures in big data via guaranteed low-rank matrix estimation:Recent theory and fast algorithms via convex and nonconvex optimization. IEEE Signal ProcessingMagazine, 35(4):14–31, 2018. (Cited on page 8.)

70

Page 71: Kabir Aladin Chandrasekher

Y. Chen, Y. Chi, J. Fan, and C. Ma. Gradient descent with random initialization: Fast global convergencefor nonconvex phase retrieval. Mathematical Programming, 176(1):5–37, 2019. (Cited on pages 3, 4,8, 26, and 37.)

Z. Chen and J. J. Dongarra. Condition numbers of Gaussian random matrices. SIAM Journal on MatrixAnalysis and Applications, 27(3):603–620, 2005. (Cited on page 94.)

Y. Chi, Y. M. Lu, and Y. Chen. Nonconvex optimization meets low-rank matrix factorization: Anoverview. IEEE Transactions on Signal Processing, 67(20):5239–5269, 2019. (Cited on page 8.)

C. Daskalakis, C. Tzamos, and M. Zampetakis. Ten steps of EM suffice for mixtures of two Gaussians.In Conference on Learning Theory, pages 704–710. PMLR, 2017. (Cited on pages 3 and 8.)

D. Davis, D. Drusvyatskiy, and C. Paquette. The nonsmooth landscape of phase retrieval. IMA Journalof Numerical Analysis, 40(4):2652–2695, 2020. (Cited on page 8.)

A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via theEM algorithm. Journal of the Royal Statistical Society: Series B (Methodological), 39(1):1–22, 1977.(Cited on pages 7 and 13.)

Z. Deng, A. Kammoun, and C. Thrampoulidis. A model of double descent for high-dimensional binarylinear classification. Information and Inference: A Journal of the IMA, 2021. (Cited on page 10.)

O. Dhifallah and Y. M. Lu. A precise performance analysis of learning with random features. arXivpreprint arXiv:2008.11904, 2020. (Cited on page 9.)

O. Dhifallah, C. Thrampoulidis, and Y. M. Lu. Phase retrieval via polytope optimization: Geometry,phase transitions, and new algorithms. arXiv preprint arXiv:1805.09555, 2018. (Cited on page 9.)

D. L. Donoho, A. Maleki, and A. Montanari. Message-passing algorithms for compressed sensing. Pro-ceedings of the National Academy of Sciences, 106(45):18914–18919, 2009. (Cited on page 8.)

D. L. Donoho, A. Maleki, and A. Montanari. The noise-sensitivity phase transition in compressed sensing.IEEE Transactions on Information Theory, 57(10):6920–6941, 2011. (Cited on page 8.)

J. C. Duchi and F. Ruan. Solving (most) of a set of quadratic equalities: Composite optimization forrobust phase retrieval. Information and Inference: A Journal of the IMA, 8(3):471–529, 2019. (Citedon pages 9 and 37.)

R. Dwivedi, N. Ho, K. Khamaru, M. J. Wainwright, M. I. Jordan, and B. Yu. Singularity, misspecificationand the convergence rate of EM. The Annals of Statistics, 48(6):3161–3182, 2020. (Cited on pages 3,8, and 37.)

N. El Karoui. On the impact of predictor geometry on the performance on high-dimensional ridge-regularized generalized robust regression estimators. Probability Theory and Related Fields, 170(1):95–175, 2018. (Cited on page 37.)

O. Y. Feng, R. Venkataramanan, C. Rush, and R. J. Samworth. A unifying tutorial on approximatemessage passing. arXiv preprint arXiv:2105.02180, 2021. (Cited on page 8.)

J. R. Fienup. Phase retrieval algorithms: A comparison. Applied optics, 21(15):2758–2769, 1982. (Citedon page 3.)

B. Gao and Z. Xu. Phaseless recovery using the Gauss–Newton method. IEEE Transactions on SignalProcessing, 65(22):5885–5896, 2017. (Cited on page 9.)

R. Ge, J. D. Lee, and T. Ma. Matrix completion has no spurious local minimum. Advances in NeuralInformation Processing Systems, pages 2981–2989, 2016. (Cited on page 3.)

R. W. Gerchberg. A practical algorithm for the determination of phase from image and diffraction planepictures. Optik, 35:237–246, 1972. (Cited on page 3.)

71

Page 72: Kabir Aladin Chandrasekher

A. Ghosh and K. Ramchandran. Alternating minimization converges super-linearly for mixed linearregression. In International Conference on Artificial Intelligence and Statistics, pages 1093–1103.PMLR, 2020. (Cited on page 25.)

A. Ghosh, A. Pananjady, A. Guntuboyina, and K. Ramchandran. Max-affine regression: Provable,tractable, and near-optimal statistical estimation. arXiv preprint arXiv:1906.09255, 2019. (Cited onpage 9.)

A. Ghosh, A. Pananjady, A. Guntuboyina, and K. Ramchandran. Max-affine regression with universalparameter estimation for small-ball designs. In 2020 IEEE International Symposium on InformationTheory (ISIT), pages 2706–2710. IEEE, 2020. (Cited on page 37.)

Y. Gordon. Some inequalities for Gaussian processes and applications. Israel Journal of Mathematics,50(4):265–289, 1985. (Cited on page 9.)

Y. Gordon. On Milman’s inequality and random subspaces which escape through a mesh in Rn. InGeometric aspects of functional analysis, pages 84–106. Springer, 1988. (Cited on page 9.)

S. Gunasekar, A. Acharya, N. Gaur, and J. Ghosh. Noisy matrix completion using alternating minimiza-tion. In Joint European conference on machine learning and knowledge discovery in databases, pages194–209. Springer, 2013. (Cited on page 9.)

P. Hand and V. Voroninski. Global guarantees for enforcing deep generative priors by empirical risk.IEEE Transactions on Information Theory, 66(1):401–418, 2019. (Cited on page 8.)

M. Hardt and M. Wootters. Fast matrix completion without the condition number. In Conference onlearning theory, pages 638–678. PMLR, 2014. (Cited on pages 4 and 9.)

N. Ho, K. Khamaru, R. Dwivedi, M. J. Wainwright, M. I. Jordan, and B. Yu. Instability, computationalefficiency and statistical accuracy. arXiv preprint arXiv:2005.11411, 2020. (Cited on pages 3, 8, 9,and 37.)

G. Jagatap and C. Hegde. Fast, sample-efficient algorithms for structured phase retrieval. In Advancesin Neural Information Processing Systems, pages 4924–4934, 2017. (Cited on page 9.)

P. Jain and P. Kar. Non-convex optimization for machine learning. Foundations and Trends® in MachineLearning, 10(3-4):142–363, 2017. (Cited on page 8.)

P. Jain, P. Netrapalli, and S. Sanghavi. Low-rank matrix completion using alternating minimization. InProceedings of the forty-fifth annual ACM symposium on Theory of computing, pages 665–674, 2013.(Cited on pages 3, 4, and 9.)

A. Javanmard and M. Soltanolkotabi. Precise statistical analysis of classification accuracies for adversarialtraining. arXiv preprint arXiv:2010.11213, 2020. (Cited on page 10.)

A. Javanmard, M. Soltanolkotabi, and H. Hassani. Precise tradeoffs in adversarial training for linearregression. In Conference on Learning Theory, pages 2034–2078. PMLR, 2020. (Cited on page 10.)

A. Kammoun and M.-S. Alouini. On the precise error analysis of support vector machines. IEEE OpenJournal of Signal Processing, 2:99–118, 2021. (Cited on pages 10 and 80.)

J. M. Klusowski, D. Yang, and W. Brinda. Estimating the coefficients of a mixture of two linearregressions by expectation maximization. IEEE Transactions on Information Theory, 65(6):3515–3524, 2019. (Cited on page 8.)

F. Kunstner, R. Kumar, and M. Schmidt. Homeomorphic-invariance of EM: Non-asymptotic convergencein KL divergence for exponential families via mirror descent. In International Conference on ArtificialIntelligence and Statistics, pages 3295–3303. PMLR, 2021. (Cited on page 9.)

J. Kwon, W. Qian, C. Caramanis, Y. Chen, and D. Davis. Global convergence of the EM algorithm formixtures of two component linear regression. In Conference on Learning Theory, pages 2055–2110.PMLR, 2019. (Cited on pages 4 and 8.)

72

Page 73: Kabir Aladin Chandrasekher

M. Ledoux and M. Talagrand. Probability in Banach Spaces: Isoperimetry and processes. SpringerScience & Business Media, 2013. (Cited on page 53.)

T. Liang and P. Sur. A precise high-dimensional asymptotic theory for boosting and minimum-l1-norminterpolated classifiers. arXiv preprint arXiv:2002.01586, 2020. (Cited on page 10.)

P.-L. Loh and M. J. Wainwright. High-dimensional regression with noisy and missing data: Provableguarantees with nonconvexity. The Annals of Statistics, 40(3):1637–1664, 2012. (Cited on page 3.)

P.-L. Loh and M. J. Wainwright. Regularized M-estimators with nonconvexity: Statistical and algorith-mic theory for local optima. The Journal of Machine Learning Research, 16(1):559–616, 2015. (Citedon page 3.)

Y. Lu and H. H. Zhou. Statistical and computational guarantees of Lloyd’s algorithm and its variants.arXiv preprint arXiv:1612.02099, 2016. (Cited on page 29.)

C. Ma, K. Wang, Y. Chi, and Y. Chen. Implicit regularization in nonconvex statistical estimation:Gradient descent converges linearly for phase retrieval, matrix completion, and blind deconvolution.Foundations of Computational Mathematics, 20(3):451–632, 2020. (Cited on pages 3 and 37.)

A. Maillard, B. Loureiro, F. Krzakala, and L. Zdeborova. Phase retrieval in high dimensions: Statis-tical and computational phase transitions. In Advances in Neural Information Processing Systems,volume 33, pages 11071–11082, 2020. (Cited on page 8.)

A. Makkuva, P. Viswanath, S. Kannan, and S. Oh. Breaking the gridlock in mixture-of-experts: Con-sistent and efficient algorithms. In International Conference on Machine Learning, pages 4304–4313.PMLR, 2019. (Cited on page 8.)

S. Mei, Y. Bai, and A. Montanari. The landscape of empirical risk for nonconvex losses. The Annals ofStatistics, 46(6A):2747–2774, 2018. (Cited on pages 3 and 8.)

L. Miolane and A. Montanari. The distribution of the Lasso: Uniform control over sparse balls andadaptive parameter tuning. Annals of Statistics, 2021. (Cited on pages 7, 10, and 22.)

A. Montanari. Statistical estimation: From denoising to sparse regression and hidden cliques. StatisticalPhysics, Optimization, Inference, and Message-Passing Algorithms: Lecture Notes of the Les HouchesSchool of Physics: Special Issue, 2013. (Cited on page 8.)

A. Montanari, F. Ruan, Y. Sohn, and J. Yan. The generalization error of max-margin linear classifiers:High-dimensional asymptotics in the overparametrized regime. arXiv preprint arXiv:1911.01544, 2019.(Cited on page 10.)

R. M. Neal and G. E. Hinton. A view of the EM algorithm that justifies incremental, sparse, and othervariants. In Learning in graphical models, pages 355–368. Springer, 1998. (Cited on pages 7 and 14.)

P. Netrapalli, P. Jain, and S. Sanghavi. Phase retrieval using alternating minimization. IEEE Transac-tions on Signal Processing, 63(18):4814–4826, 2015. (Cited on page 4.)

S. Oymak and M. Soltanolkotabi. Fast and reliable parameter estimation from nonlinear observations.arXiv preprint arXiv:1610.07108, 2016. (Cited on page 9.)

S. Oymak and J. A. Tropp. Universality laws for randomized dimension reduction, with applications.Information and Inference: A Journal of the IMA, 7(3):337–446, 2018. (Cited on page 37.)

S. Oymak, C. Thrampoulidis, and B. Hassibi. The squared-error of generalized Lasso: A precise analysis.In 2013 51st Annual Allerton Conference on Communication, Control, and Computing (Allerton),pages 1002–1009. IEEE, 2013. (Cited on pages 7, 9, and 10.)

S. Oymak, B. Recht, and M. Soltanolkotabi. Sharp time–data tradeoffs for linear inverse problems. IEEETransactions on Information Theory, 64(6):4129–4158, 2017. (Cited on page 9.)

A. Panahi and B. Hassibi. A universal analysis of large-scale regularized least squares solutions. InProceedings of the 31st International Conference on Neural Information Processing Systems, pages3384–3393, 2017. (Cited on page 37.)

73

Page 74: Kabir Aladin Chandrasekher

A. Pananjady and D. P. Foster. Single-index models in the high signal regime. IEEE Transactions onInformation Theory, 67(6):4092–4124, 2021. (Cited on page 9.)

C. Paquette, B. van Merrienboer, E. Paquette, and F. Pedregosa. Halting time is predictable for largemodels: A universality property and average-case analysis. arXiv preprint arXiv:2006.04299, 2020.(Cited on page 37.)

Y. Plan and R. Vershynin. The generalized Lasso with non-linear observations. IEEE Transactions oninformation theory, 62(3):1528–1537, 2016. (Cited on page 9.)

M. Rudelson and R. Vershynin. Sparse reconstruction by convex relaxation: Fourier and Gaussianmeasurements. In 2006 40th Annual Conference on Information Sciences and Systems, pages 207–212. IEEE, 2006. (Cited on page 9.)

F. Salehi, E. Abbasi, and B. Hassibi. A precise analysis of Phasemax in phase retrieval. In 2018IEEE International Symposium on Information Theory (ISIT), pages 976–980. IEEE, 2018. (Citedon page 9.)

F. Salehi, E. Abbasi, and B. Hassibi. The impact of regularization on high-dimensional logistic regression.arXiv preprint arXiv:1906.03761, 2019. (Cited on page 9.)

M. Stojnic. Various thresholds for `1-optimization in compressed sensing. arXiv preprintarXiv:0907.3666, 2009. (Cited on page 9.)

M. Stojnic. A framework to characterize performance of Lasso algorithms. arXiv preprintarXiv:1303.7291, 2013a. (Cited on page 9.)

M. Stojnic. Regularly random duality. arXiv preprint arXiv:1303.7295, 2013b. (Cited on page 9.)

M. Stojnic. Upper-bounding `1-optimization weak thresholds. arXiv preprint arXiv:1303.7289, 2013c.(Cited on page 9.)

J. Sun. Provable nonconvex methods/algorithms, 2021. URL https://sunju.org/research/

nonconvex/. (Cited on page 8.)

J. Sun, Q. Qu, and J. Wright. A geometric analysis of phase retrieval. Foundations of ComputationalMathematics, 18(5):1131–1198, 2018. (Cited on page 3.)

R. Sun and Z.-Q. Luo. Guaranteed matrix completion via non-convex factorization. IEEE Transactionson Information Theory, 62(11):6535–6579, 2016. (Cited on page 9.)

P. Sur and E. J. Candes. A modern maximum-likelihood theory for high-dimensional logistic regression.Proceedings of the National Academy of Sciences, 116(29):14516–14525, 2019. (Cited on page 8.)

H. Taheri, R. Pedarsani, and C. Thrampoulidis. Asymptotic behavior of adversarial training in binaryclassification. arXiv preprint arXiv:2010.13275, 2020a. (Cited on page 10.)

H. Taheri, R. Pedarsani, and C. Thrampoulidis. Sharp asymptotics and optimal performance for inferencein binary models. In International Conference on Artificial Intelligence and Statistics, pages 3739–3749. PMLR, 2020b. (Cited on page 9.)

H. Taheri, R. Pedarsani, and C. Thrampoulidis. Fundamental limits of ridge-regularized empirical riskminimization in high dimensions. In International Conference on Artificial Intelligence and Statistics,pages 2773–2781. PMLR, 2021. (Cited on page 9.)

Y. S. Tan and R. Vershynin. Online stochastic gradient descent with arbitrary initialization solvesnon-smooth, non-convex phase retrieval. arXiv preprint arXiv:1910.12837, 2019a. (Cited on page 4.)

Y. S. Tan and R. Vershynin. Phase retrieval via randomized Kaczmarz: Theoretical guarantees. Infor-mation and Inference: A Journal of the IMA, 8(1):97–123, 2019b. (Cited on page 26.)

C. Thrampoulidis. Recovering structured signals in high dimensions via non-smooth convex optimization:Precise performance analysis. PhD thesis, California Institute of Technology, 2016. (Cited on page 9.)

74

Page 75: Kabir Aladin Chandrasekher

C. Thrampoulidis, E. Abbasi, and B. Hassibi. Lasso with non-linear measurements is equivalent to onewith linear measurements. In Proceedings of the 28th International Conference on Neural InformationProcessing Systems-Volume 2, pages 3420–3428, 2015a. (Cited on page 9.)

C. Thrampoulidis, S. Oymak, and B. Hassibi. Regularized linear regression: A precise analysis of theestimation error. In Conference on Learning Theory, pages 1683–1709. PMLR, 2015b. (Cited on pages5, 9, and 15.)

C. Thrampoulidis, E. Abbasi, and B. Hassibi. Precise error analysis of regularized m-estimators in highdimensions. IEEE Transactions on Information Theory, 64(8):5592–5628, 2018a. (Cited on pages 9and 16.)

C. Thrampoulidis, W. Xu, and B. Hassibi. Symbol error rate performance of box-relaxation decoders inmassive MIMO. IEEE Transactions on Signal Processing, 66(13):3377–3392, 2018b. (Cited on page 9.)

Y. Tian. An analytical formula of population gradient for two-layered ReLu network and its applicationsin convergence and critical point analysis. In International Conference on Machine Learning, pages3404–3413. PMLR, 2017. (Cited on pages 3 and 8.)

R. Vershynin. High-dimensional probability: An introduction with applications in data science, volume 47.Cambridge university press, 2018. (Cited on pages 47, 51, 52, 56, 82, 83, and 108.)

M. J. Wainwright. High-dimensional statistics: A non-asymptotic viewpoint, volume 48. CambridgeUniversity Press, 2019. (Cited on pages 78, 88, 91, and 92.)

I. Waldspurger. Phase retrieval with random Gaussian sensing vectors by alternating projections. IEEETransactions on Information Theory, 64(5):3301–3312, 2018. (Cited on page 9.)

S. Wang, H. Weng, and A. Maleki. Does SLOPE outperform bridge regression? arXiv preprintarXiv:1909.09345, 2019. (Cited on page 10.)

Y. Wu and H. H. Zhou. Randomly initialized EM algorithm for two-component Gaussian mixtureachieves near optimality in O(

√n) iterations. arXiv preprint arXiv:1908.10935, 2019. (Cited on pages

3, 4, 8, and 37.)

J. Xu, D. J. Hsu, and A. Maleki. Global analysis of expectation maximization for mixtures of twoGaussians. Advances in Neural Information Processing Systems, 29, 2016. (Cited on pages 8 and 9.)

J. Xu, D. J. Hsu, and A. Maleki. Benefits of over-parameterization with EM. In Advances in NeuralInformation Processing Systems, volume 31, 2018. (Cited on pages 3 and 8.)

L. Xu and M. I. Jordan. On convergence properties of the EM algorithm for Gaussian mixtures. Neuralcomputation, 8(1):129–151, 1996. (Cited on page 9.)

F. Yang, S. Balakrishnan, and M. J. Wainwright. Statistical and computational guarantees for theBaum–Welch algorithm. The Journal of Machine Learning Research, 18(1):4528–4580, 2017. (Citedon page 37.)

X. Yi, C. Caramanis, and S. Sanghavi. Alternating minimization for mixed linear regression. In Inter-national Conference on Machine Learning, pages 613–621. PMLR, 2014. (Cited on page 9.)

H. Zhang, Y. Zhou, Y. Liang, and Y. Chi. A nonconvex approach for phase retrieval: Reshaped Wirtingerflow and incremental algorithms. Journal of Machine Learning Research, 18, 2017. (Cited on pages 4and 26.)

T. Zhang. Phase retrieval using alternating minimization in a batch setting. Applied and ComputationalHarmonic Analysis, 49(1):279–295, 2020. (Cited on pages 9, 22, and 47.)

Y. Zhang, Q. Qu, and J. Wright. From symmetry to geometry: Tractable nonconvex problems. arXivpreprint arXiv:2007.06753, 2020. (Cited on page 8.)

75

Page 76: Kabir Aladin Chandrasekher

Appendix

A Heuristic derivations deferred from Section 3.2

In this section, we collect two calculations that were deferred from Section 3.2.

Calculations to obtain equation (19b)

We begin by applying the Cauchy–Schwarz inequality to maximize over v, obtaining

minθ∈Rd

L(θ;θt,γd,γn) = minθ∈Rd

max‖v‖2≤1

‖v‖2√n

(〈γd,P⊥Stθ〉+

∥∥sgn(Xθt) y − ‖P⊥Stθ‖2γn −XPStθ∥∥

2

).

Note that the objective is linear in the magnitude ‖v‖2; thus, maximizing over it is straightfor-ward giving

minθ∈Rd

L(θ;θt,γd,γn) = minθ∈Rd

(〈γd,P⊥Stθ〉+

∥∥sgn(Xθt) y − ‖P⊥Stθ‖2γn −XPStθ∥∥

2

))+.

Now, using the fact that minθ∈Rd (f(θ))+ = (minθ∈Rd f(θ))+ for any function f(·) and noting

that the function of interest in the display above is linear in the direction P⊥Stθ/‖P⊥Stθ‖2, we

may minimize over the latter to obtain

minθ∈Rd

L(θ;θt,γd,γn) =

(minθ∈Rd

−‖P⊥Stθ‖2‖P⊥Stγd‖2√

n+

1√n

∥∥sgn(Xθt) y − ‖P⊥Stθ‖2γn −XPStθ∥∥

2

)+

.

When written in this form, the scalarization is apparent; recall our scalars

α = 〈θ,θ∗〉, µ =〈θ,P⊥θ∗θt〉‖P⊥θ∗θt‖2

, and ν = ‖P⊥Stθ‖2, (128)

and the analogous quantities for the current iterate αt = 〈θt,θ∗〉 and βt = ‖P⊥θ∗θt‖2. Also recall

the independent, n-dimensional Gaussian random vectors z1 = Xθ∗ and z2 =XP⊥

θ∗θt

‖P⊥θ∗θt‖2

, using

which we obtain

Xθt = αtz1 + βtz2, y = |z1|, and XP⊥Stθ = αz1 + µz2.

Thus,

minθ∈Rd

L(θ;θt,γd,γn) = minα∈R,µ∈R,ν≥0

(−ν‖P⊥Stγd‖2√

n+

1√n

∥∥sgn(αtz1 + βtz2) |z1|︸ ︷︷ ︸ωt

−νγn − αz1 − µz2

∥∥2

)+

= minα∈R,µ∈R,ν≥0

(−ν‖P⊥Stγd‖2√

n+

1√n

∥∥ωt − νγn − αz1 − µz2

∥∥2

)+

(i)≈ min

α∈R,µ∈R,ν≥0

(− ν√

κ+

√E(

Ωt − νH − αZ1 − µZ2

)2)+, (129)

where step (i) follows by concentration of the norms of sub-Gaussian random variables.Summarizing, we have

A? =(

minα∈R,µ∈R,ν≥0

− ν√κ

+

√E(

Ωt − νH − αZ1 − µZ2

)2)+, (130)

so thatA(γn,γd) ≈ A?.

76

Page 77: Kabir Aladin Chandrasekher

Calculations to obtain equation (20)

Note that the random variable H is zero-mean and independent of Z1 and Z2, whence we obtain

− ν√κ

+√

E(Ωt − νH − αZ1 − µZ2)2 = − ν√κ

+√E(Ωt − αZ1 − µZ2)2+ ν2.

It is evident from the RHS of the display above that the minimizers α and µ are given by

α = EZ1Ωt, and µ = EZ2Ωt.

Substituting these back into A?, we obtain

A? =(

minν≥0− ν√

κ+√ν2 +

(EΩ2

t − (EZ1Ωt)2 − (EZ2Ωt)2))

+.

Minimizing the above in ν, we obtain

ν =

√EΩ2

t − (EZ1Ωt)2 − (EZ2Ωt)2

κ− 1,

and this establishes the claimed scalarization.

B Auxiliary proofs for general results, part (a)

In this appendix, we prove the technical lemmas stated in Section 7.

B.1 Proofs of technical lemmas in steps 1–3

In this subsection, we prove each of our technical lemmas used in the proof of Proposition 2.

B.1.1 Proof of Lemma 1

We require two additional lemmas that are proved at the end of this subsection. The first lemmashows that the optimization can be done over a compact set.

Lemma 14. Suppose that the loss L satisfies Assumption 3 and let κ > 1. Let D ⊆ B2(R)be a closed subset for some constant R > 0. Then, there exists a positive constant C1 depend-ing only on R such that for any scalar r ≥ CL (where CL denotes the Lipschitz constant inAssumption 3), we have

minθ∈DL(θ;θ],X,y) = min

θ∈Du∈B2(C1

√n)

maxv∈B2(r)

1√n〈v,Xθ − u〉+ F (u,θ),

with probability at least 1− 2e−2n.

The following lemma uses the bilinear characterization of Lemma 14 and subsequently in-vokes the CGMT to connect to the auxiliary loss Ln.

Lemma 15. Let Assumption 3 hold and recall the definition of the auxiliary loss Ln in Definition7. For any compact set D and any positive scalars C1 and r, it holds that

P

minθ∈D

u∈B2(C1√n)

maxv∈B2(r)

1√n〈v,Xθ − u〉+ F (u,θ) ≥ t

≤ 2P

minθ∈D

u∈B2(C1√n)

Ln(θ,u; r) ≥ t

If, in addition, D is convex, then

P

minθ∈D

u∈B2(C1√n)

maxv∈B2(r)

1√n〈v,Xθ − u〉+ F (u,θ) ≤ t

≤ 2P

minθ∈D

u∈B2(C1√n)

Ln(θ,u; r) ≤ t

Lemma 1 follows immediately upon combining Lemmas 14 and 15.

77

Page 78: Kabir Aladin Chandrasekher

Proof of Lemma 14 First, recall that Assumption 3 implies the existence of a function Fsuch that L(θ) = F (Xθ,θ). Consequently, we obtain

minθ∈DL(θ) = min

θ∈D,u∈RnF (u,θ) s.t. u = Xθ.

Applying Wainwright (2019, Theorem 6.1) in conjunction with the assumptions κ > 1 andD ⊆ B2(R) yields that the event

A = supθ∈D

‖Xθ‖2 ≤ 4R√n

occurs with probability at least 1− 2e−2n. We carry out the rest of the proof on this event. ForC1 > 4R, we obtain

minθ∈DL(θ) = min

θ∈Du∈B2(C1

√n)

F (u,θ) s.t. u = Xθ

= minθ∈D

u∈B2(C1√n)

maxv∈Rn

F (u,θ)− 〈v,u−Xθ〉. (131)

It remains only to prove that v can further be constrained to a large enough ball. To this end,recall that by Assumption 3(b), the function F (u,θ) is CL/

√n-Lipschitz in its first argument.

We thus obtain the inequality

minθ∈D

u∈B2(C1√n)

maxv∈B2(r)

F (u,θ)− 1√n〈v,u−Xθ〉

(i)

≥ minθ∈D

u∈B2(C1√n)

F (Xθ,θ) + maxv∈B2(r)

(‖v‖2√n− CL√

n

)‖u−Xθ‖2

(ii)

≥ minθ∈D

u∈B2(C1√n)

F (Xθ,θ) = minθ∈DL(θ). (132)

Step (i) follows by utilizing the Lipschitz continuity of F in its first argument in conjunctionwith the Cauchy–Schwarz inequality and step (ii) follows from the assumption r ≥ CL. On theother hand, it also holds that

minθ∈D

u∈B2(C1√n)

maxv∈B2(r)

F (u,θ)− 1√n〈v,u−Xθ〉 ≤ min

θ∈Du∈B2(C1

√n)

maxv∈Rn

F (u,θ)− 1√n〈v,u−Xθ〉

= minθ∈DL(θ), (133)

where the equality holds due to equation (131). The desired result follows immediately bycombining equations (132) and (133).

Proof of Lemma 15 First, recall that by Assumption 3, F (u,θ) can be decomposed as

F (u,θ) = g(u,Xθ];y) + h(θ,θ]).

Next, recall the subspace S] = span(θ∗,θ]) and consider the orthogonal decompositionθ = PS]θ + P⊥S]θ. Combining these two pieces, we obtain the representation

1√n〈v,Xθ − u〉+ F (u,θ) = h(θ,θ]) + g(u,Xθ];y)− 1√

n〈v,u〉+

1√n〈v,XPS]θ〉+

1√n〈v,XP⊥S]θ〉.

Note that the Gaussian random variable 〈v,XP⊥S]θ〉 is independent of all other randomnessin the expression. Thus, in that term, we replace the random matrix X with an independentcopy G ∈ Rn×d. Turning to the variational problem of interest, we have

minθ∈D

u∈B2(C1√n)

maxv∈B2(r)

h(θ,θ]) + g(u,Xθ];y)− 1√n〈v,u〉+

1√n〈v,XPS]θ〉+

1√n〈v,XP⊥S]θ〉

78

Page 79: Kabir Aladin Chandrasekher

(d)= min

θ∈Du∈B2(C1

√n)

maxv∈B2(r)

h(θ,θ]) + g(u,Xθ];y)− 1√n〈v,u〉+

1√n〈v,XPS]θ〉+

1√n〈v,GP⊥S]θ〉.

At this juncture, we invoke the CGMT (Proposition 1) with

P (G) = min(θ,u)∈D×B2(C1

√n)

maxv∈B2(r)

h(θ,θ])+g(u,Xθ];y)− 1√n〈v,u〉+ 1√

n〈v,XPS]θ〉+

1√n〈v,GP⊥S]θ〉

andA(γn,γd) = min

(θ,u)∈D×B2(C1√n)

Ln(θ,u; r).

B.1.2 Proof of Lemma 2

We begin by defining a few additional optimization problems. Under the setting of Lemma 2,recall the map P as defined in Definition 9, and the random vectors z1 and z2 (50). Define theminimum of the variational problem

φvar,C1,r = min(α,µ,ν)∈P(D),u∈B2(C1

√n)

hscal(α, µ, ν,θ]) + g(u,Xθ];y)

+ max0≤ρ≤r

ρ√n·(‖ν · γn + α · z1 + µ · z2 − u‖2 − ν · ‖P⊥S]γd‖2

), (134)

where we explicitly track the dependence on the pair (C1, r) as their scaling will be importantin the proof. Next, define the minimum of the constrained problem

φcon,C1 = min(α,µ,ν)∈P(D),u∈B2(C1

√n)

hscal(α, µ, ν,θ]) + g(u,Xθ];y)

s.t. ‖ν · γn + α · z1 + µ · z2 − u‖2 ≤ ν · ‖P⊥S]γd‖2. (135)

Finally, define the minimum of the following closely related constrained problem, which—as wewill show in Lemma 17 below—is equal with high probability to the minimum of the scalarizedauxiliary loss Ln:

φscal,C1 = min(α,µ,ν)∈P(D)

maxv∈Rn

hscal(α, µ, ν,θ])− g∗(v,Xθ];y) + min

u∈B2(C1√n)〈u,v〉

s.t. ‖ν · γn + α · z1 + µ · z2 − u‖2 ≤ ν · ‖P⊥S]γd‖2. (136)

We now state two lemmas that establish the relation between the above optimization prob-lems and will prove useful in the proof.

Lemma 16. Under the setting of Lemma 2, we have

Pφvar,C1,r ≤ φcon,C1 ≤ φvar,C1,r +

3C2LC1

r

≥ 1− 6e−n/2.

In interpreting Lemma 16, note that if the inner maximization over ρ ≥ 0 in the definitionof φvar,C1,r (134) were unbounded, then it would be equivalent to the constrained minimumφcon,C1 (135). Lemma 16 uses Lipschitz continuity of the objective function, which holds thanksto Assumption 3, and demonstrates the impact of finite values of the scalar r on the gap betweenthe values of the two optimization problems.

Lemma 17. Under the setting of Lemma 2, we have the inequality

Pφscal,C1 = min

(α,µ,ν)∈P(D)Ln(α, µ, ν;θ])

≥ 1− 8e−n/2.

79

Page 80: Kabir Aladin Chandrasekher

Taking these lemmas as given, the proof of Lemma 2 consists of three major steps, which weperform in sequence: we (i) scalarize the maximization over v; (ii) scalarize the minimizationover θ; and (iii) optimize over u. We proceed now to the execution of these steps.

Scalarize the maximization over v. Recall the auxiliary loss Ln from Definition 7 and note thatthe Cauchy–Schwarz inequality implies that

minθ∈D

u∈B2(C1√n)

Ln(θ,u; r) = minθ∈D

u∈B2(C1√n)

max0≤ρ≤r

F (u,θ)+ρ√n·(〈γd,P⊥S]θ〉+

∥∥∥‖P⊥S]θ‖2·γn+XPS]θ−u∥∥∥

2

).

Scalarize the minimization in θ. We now show how to perform the minimization over the direc-tion of the projection vector P⊥S]θ. We do this in two steps.

First, we argue that we can decouple the minimization over its projection and its norm.For this, note by assumption that D ⊆ B2(R) is amenable (recall Definition 10), whence forany feasible value of the norm ‖P⊥S]θ‖2, the set of feasible directions P⊥S]θ/‖P

⊥S]θ‖2 remains the

same. Thus, decoupling is indeed allowed.Second, we argue that we can minimize over the direction of the projection vector P⊥S]θ

despite the inner maximization over the variable ρ. To this end, we note that because the opti-mal direction is the same irrespective of the choice of ρ, we may invoke Kammoun and Alouini(2021, Lemma 8). In particular, recall that the function F (u,θ) only depends on the vectorP⊥S]θ through its norm. Thus, the objective in the preceding display depends on the direction

of the vector P⊥S]θ only through the linear term ρ〈γd,P⊥S]θ〉, which, for any ρ ≥ 0, is minimized

by setting P⊥S]θ/‖P⊥S]θ‖2 = −P⊥S]γd/‖P⊥S]γd‖2. We thus apply Kammoun and Alouini (2021, Lemma

8) to obtain the characterization

minθ∈D

u∈B2(C1√n)

Ln(θ,u; r) = minθ∈D

u∈B2(C1√n)

max0≤ρ≤r

F (u,θ)

+ρ√n·(∥∥∥‖P⊥S]θ‖2 · γn +XPS]θ − u

∥∥∥2−‖P⊥S]θ‖2 · ‖γd‖2

). (137)

Recalling the independent random variables z1 and z2 (50), write

y = f(z1; q) + ε and Xθ] = α]z1 + β]z2.

For each θ ∈ Rd, we have the orthogonal decomposition

θ = PS]θ + P⊥S]θ = α(θ) · θ∗ + µ(θ) ·P⊥θ∗θ

]

‖P⊥θ∗θ]‖2+ ν(θ) ·

P⊥S]θ

‖P⊥S]θ‖2,

where the second equality follows from the Gram–Schmidt orthogonalization. Combining thiswith the characterization (137), we obtain

minθ∈D

u∈B2(C1√n)

Ln(θ,u; r) = min(α,µ,ν)∈P(D),u∈B2(C1

√n)

hscal(α, µ, ν,θ]) + g(u,Xθ];y)

+ max0≤ρ≤r

ρ√n·(‖ν · γn + α · z1 + µ · z2 − u‖2 − ν · ‖P⊥S]γd‖2

)= φvar,C1,r, (138)

where the last line follows by definition (134).

Optimize over u. To be able to optimize over u, we recall the constant φcon,C1 (135) and invokeLemma 16, which yields the sandwich relation

φvar,C1,r ≤ φcon,C1 ≤ φvar,C1,r +3C2

LC1

r, (139)

80

Page 81: Kabir Aladin Chandrasekher

with probability at least 1−6e−n/2. Next, we linearize the constrained objective of φcon,C1 (135)in the optimization variable u. To this end, recall from Assumption 3 that g is convex in itsfirst argument and let g∗ denote its convex conjugate so that

φcon,C1 = min(α,µ,ν)∈P(D),u∈B2(C1

√n)

maxv∈Rn

hscal(α, µ, ν,θ])− g∗(v,Xθ];y) + 〈v,u〉

s.t. ‖ν · γn + α · z1 + µ · z2 − u‖2 ≤ ν · ‖P⊥S]γd‖2.

Next, we perform three steps in sequence: (i) write the equivalent Lagrangian to the problemabove; (ii) note that the minimization over u is of a convex function over a compact constraintand the maximization over v is of a concave function and invoke Sion’s minimax theorem toswap the minimization over u with the maximization over v; and (iii) re-write as a constrainedoptimization problem to obtain

φcon,C1 = min(α,µ,ν)∈P(D)

maxv∈Rn

hscal(α, µ, ν,θ])− g∗(v,Xθ];y) + min

u∈B2(C1√n)〈u,v〉

s.t. ‖ν · γn + α · z1 + µ · z2 − u‖2 ≤ ν · ‖P⊥S]γd‖2= φscal,C1 .

The last line follows by definition of φcon,C1 (136). We complete the proof by invoking Lemma 17.

Proof of Lemma 16: Note that φcon,C1 admits the variational representation

φcon,C1 = min(α,µ,ν)∈P(D),u∈B2(C1

√n)

hscal(α, µ, ν,θ]) + g(u,Xθ];y)

+ maxρ≥0

ρ√n·(‖ν · γn + α · z1 + µ · z2 − u‖2 − ν · ‖P⊥S]γd‖2

),

whence we obtain the inequality φcon,C1 ≥ φvar,C1,r. The rest of the section is devoted to theproof of the reverse inequality φcon,C1 ≤ φvar,C1,r + 3C2

LC1/r. To this end, fix an arbitrary triple(α, µ, ν) ∈ P(D). Given this triple, define the constraint function

ψ(u) = ‖ν · γn + α · z1 + µ · z2 − u‖2 − ν · ‖P⊥S]γd‖2,

as well as the loss functions

Lvarn (u) = max

0≤ρ≤rhscal(α, µ, ν,θ

]) + g(u,Xθ];y) +ρ√n· ψ(u),

andLconn (u) = max

ρ≥0hscal(α, µ, ν,θ

]) + g(u,Xθ];y) +ρ√n· ψ(u).

Let uvar denote an arbitrary minimizer of the loss function Lvarn and suppose that uvar does

not satisfy the constraint ψ(uvar) ≤ 0—if any such minimizer does satisfy this constraint, thenφvar,C1,r ≥ φcon,C1 and there is nothing to prove.

The remainder of the proof is thus dedicated to showing the inequalityφvar,C1,r + 3C2

LC1/r ≥ φcon,C1 under the proviso that the minimizer uvar satisfies the in-equality ψ(uvar) ≤ 0. To this end, consider the sublevel sets (constrained to a ball)

Sr′ := u : ψ(u) ≤ r′ ∩ B2(C1) = B2

(ν · γn + α · z1 + µ · z2; ν · ‖P⊥S]γd‖2 + r′

).

Note that the set S0 contains all of the feasible points we are interested in. Next, define theevent

A0 = ‖γn‖2 ≤ 2√n, ‖z1‖2 ≤ 2

√n, ‖z2‖2 ≤ 2

√n , (140)

81

Page 82: Kabir Aladin Chandrasekher

which, by Vershynin (2018, Theorem 3.1.1), occurs with probability at least 1−6e−n/2. Workingon this event, we note that since C1 ≥ 6R by assumption, the set S0 is non-empty and thussince S0 is a non-empty, closed and compact set, projections onto it are well-defined.

Now, note that S0 and S′r are concentric balls, whence if u ∈ Sr′ and u denotes its projectiononto the set S0, the distance between the two points satisfies the inequality

‖u− u‖2 ≤ r′. (141)

We note the following inequality, which we take for granted now and prove at the end of thesection,

ψ(uvar) ≤ 3CLC1√n

r. (142)

Consequently, we note the inclusion uvar ∈ S3CLC1√n/r. Letting uvar denote the projection of

uvar onto the set S0, we obtain the chain of inequalities

Lvarn (uvar)

(i)

≥ Lconn (uvar)− CL‖uvar − uvar‖2√

n

(ii)

≥ Lconn (ucon)− CL‖uvar − uvar‖2√

n

≥ Lconn (ucon)−

3C2LC1

r.

Above, step (i) follows since from Assumption 3, the function g is CL/√n-Lipschitz in its first

argument, step (ii) follows since ucon minimizes Lconn , and the final inequality follows from the

inequality bounding distances (141), taking r′ = 3CLC1√n/r. Taking stock, since the above

inequality holds for all values (α, µ, ν) ∈ P(D), we have shown that on the event A0

φcon,C1 −3C2

LC1

r≤ φvar,C1,r ≤ φcon,C1 ,

Rearranging this relation completes the proof. It remains to prove the claim (142).

Proof of the inequality (142). Assume for the sake of contradiction that the inequality does nothold and let uvar denote the projection of uvar onto the set S0. Then, since from Assumption 3the function g is CL/

√n-Lipschitz in its first argument, we obtain the chain of inequalities

Lvarn (uvar) ≥ Lcon

n (uvar)−CL‖uvar − uvar‖2√

n+rψ(uvar)√

n≥ Lcon

n (uvar)−2CLC1+3CLC1 > Lconn (uvar),

where the penultimate inequality follows since u ∈ B2(C1√n). But the above display contradicts

the fact that uvar minimizes Lvarn , whence we obtain the desired result.

Proof of Lemma 17: Recall the value φscal,C1 (136). Now, we consider a fixed triplet(α, µ, ν) ∈ P(D) and perform the minimization over u. To this end, let

Lu = minu∈B2(C1

√n)〈u,v〉 s.t. ‖ν · γn + α · z1 + µ · z2 − u‖2 ≤ ν · ‖P⊥S]γd‖2.

Then, introducing a Lagrange multiplier and invoking Sion’s minimax theorem to interchangeminimization and maximization, we obtain

Lu = maxλ≥0

minu∈B2(C1

√n)〈u,v〉+

λ

2‖ν · γn + α · z1 + µ · z2 − u‖22 −

λν2

2‖P⊥S]γd‖

22

≥ maxλ≥0

minu∈Rn

〈u,v〉+λ

2‖ν · γn + α · z1 + µ · z2 − u‖22 −

λν2

2‖P⊥S]γd‖

22

82

Page 83: Kabir Aladin Chandrasekher

= maxλ≥0

〈ν · γn + α · z1 + µ · z2,v〉 −1

2λ‖v‖22 −

λν2

2‖P⊥S]γd‖

22

= 〈ν · γn + α · z1 + µ · z2,v〉 − ν‖P⊥S]γd‖2‖v‖2.

Conversely, let

u = ν · γn + α · z1 + µ · z2 − ν‖P⊥S]γd‖2 ·v

‖v‖2.

We claim that this choice is feasible with high probability for a large enough constant C1 (whichmay depend on R), which means we obtain

Lu ≤ 〈u,v〉 = 〈ν · γn + α · z1 + µ · z2,v〉 − ν‖P⊥S]γd‖2‖v‖2.

To see that this is true, condition on the following event (recall the event A0 (140))

A′0 = A0 ∩ ‖P⊥S]γd‖2 ≤ 2√n ,

which after applying Vershynin (2018, Theorem 3.1.1) holds with PA′0 ≥ 1− 8e−n/2. On thisevent, apply triangle inequality to obtain

u/√n ≤ 4(ν + |α|+ |µ|) ≤ 8R,

where in the last inequality, we recalled from the definition of the scalarized set P(D) thatα2 + µ2 + ν2 = ‖θ‖22 and used the assumption that D ⊂ B2(R). Evidently, on the event A′0,

Lu = 〈ν · γn + α · z1 + µ · z2,v〉 − ν‖P⊥S]γd‖2‖v‖2.

Thus, we obtain

φscal,C1 = min(α,µ,ν)∈P(D)

maxv∈Rn

hscal(α, µ, ν,θ])− g∗(u,Xθ];y) + 〈ν · γn + α · z1 + µ · z2,v〉 − ν‖P⊥S]γd‖2‖v‖2

= min(α,µ,ν)∈P(D)

Ln(α, µ, ν;θ]),

which concludes the proof.

B.2 Establishing growth conditions for higher-order methods

In this subsection, we prove Lemma 3, specialized to second order methods where the lossfunction (9) takes the form

L(θ;θ],X,y) =1√n‖ω(Xθ],y)−Xθ‖2,

which corresponds to setting

F (u,θ) =1√n‖ω(Xθ],y)−u‖2, h(θ,θ]) = 0, and g(u,Xθ];y) =

1√n‖ω(Xθ],y)−u‖2.

We now state several lemmas, which we will invoke in sequence. The first specializes the functionLn when the loss corresponds to a higher-order method.

Lemma 18. Let the loss L correspond to a higher-order method as in equation (9), and let Lndenote the corresponding scalarized loss given by Definition 8. Also recall the pair of randomvectors (z1, z2) from equation (50). There exists a universal positive constant c such that withprobability at least 1− 6e−cn, it holds simultaneously for all scalars α, µ ∈ R and ν ≥ 0 that

Ln(α, µ, ν;θ]) =1√n‖ν · γn + α · z1 + µ · z2 − ω(Xθ],y)‖2 − ν

‖P⊥S]γd‖2√n

.

83

Page 84: Kabir Aladin Chandrasekher

This lemma is proved in Subsection B.2.3. Before stating the next lemma, we introduce theshorthand

A := [z1, z2,γn] ∈ Rn×3, and vn := [0, 0, ‖P⊥S]γd‖2/√n]> ∈ R3,

andω] = ω(Xθ],y),

so that letting ξ = [α, µ, ν] and invoking Lemma 18, we obtain

Ln(ξ;θ]) =1√n‖Aξ − ω]‖2 − 〈vn, ξ〉.

Now, introduce the variable τ and write the above display in the form

Ln(ξ;θ]) = infτ>0

τ

2+

1

2τn‖Aξ − ω]‖22 − 〈vn, ξ〉︸ ︷︷ ︸

Ln(ξ,τ)

.

Next, recall the parameters (K1,K2) from Assumptions 1 and 2 and fix ε > 0. For a universalconstant C > 0 and a constant CK1 > 0 depending only on K1, define the events

A1 =

(1− Cε) · I3 1

nA>A (1 + Cε) · I3

, (143a)

A2 = 1

n2|〈γn,ω]〉|2 ∨

∣∣∣ 1

n2〈z1,ω]〉2 − (EZ1Ω)2

∣∣∣ ∨ ∣∣∣ 1

n2〈z2,ω]〉2 − (EZ2Ω)2

∣∣∣ ≤ CK1 · ε,

(143b)

and

A3 =∣∣∣ 1n‖ω]‖22 − EΩ2

∣∣∣ ≤ CK1ε. (143c)

Finally, note that on the event A1, minξ∈R3 ‖Aξ − ω]‖2 = ‖(I −A(A>A)−1A>

)· ω]‖2.

Lemma 19. Let Assumptions 1 and 2 hold. There exists a positive constant CK1,K2 dependingonly on K1 and K2 and universal positive constants c, C, and C ′ such that for all C ′ > ε > 0,the following hold.

(a) Recall events A1,A2, and A3 as in equations (143a)–(143c). On the event A1 ∩A2 ∩A3,we have∣∣∣ 1√

n‖(I −A(A>A)−1A>

)· ω]‖2 −

√(EΩ2 − (EZ1Ω)2 − (EZ2Ω)2

∣∣∣ ≤ CK1,K2 · ε.

(b) We have PA1 ∩ A2 ∩ A3 ≥ 1− Ce−cnε2.

This lemma is proved in Subsection B.2.4.

Lemma 20. Let κ > 1. Let Assumptions 1 and 2 hold. There exist positive constant cK1,K2 de-pending only on K1,K2, a positive constant CK1 depending only on K1 and a universal, positiveconstant C such that the following statements hold with probability at least 1− Ce−cK1,K2

n.

(a) For any R ≥ CK1, we have the equivalence

minξ∈P(B2(R))

infτ>0

Ln(ξ, τ) = infτ>0

minξ∈P(B2(R))

Ln(ξ, τ). (144a)

84

Page 85: Kabir Aladin Chandrasekher

(b) The optimization problem on the RHS of equation (144a) admits unique minimizers

ξn(τ) = (A>A)−1A>ω(Xθ],y) + τ · n · (A>A)−1vn, (144b)

and

τn =

1√n‖(In −A(A>A)−1A>) · ω(Xθ],y)‖2√

1− n · v>n (A>A)−1vn. (144c)

We prove this lemma in Subsection B.2.5.

B.2.1 Proof of Lemma 3(a)

The first part of the statement—uniqueness of the minimizer—is an immediate consequenceof Lemma 20; moreover, the unique minimizer is given by ξn(τn). It remains to prove theconcentration properties. As a preliminary step, we show that the dual variable τn concentratesaround a deterministic quantity

τgor =

√( κ

κ− 1

)(EΩ2 − (EZ1Ω)2 − (EZ2Ω)2).

To this end, consider the events A1,A2,A3 from equation (143), and note that Lemma 19(b)implies that A1 ∩ A2 ∩ A3 occurs with probability greater than 1− Ce−cnε2 . We carry out theproof on this event.

Bounding τn (144c). We proceed in three steps. First, we bound the numerator; second, webound the denominator; and finally we combine the two bounds.

Bounding the numerator. This step is immediate, since Lemma 19(a) yields∣∣∣ 1√n‖(I −A(A>A)−1A>

)· ω]‖2 −

√(EΩ2 − (EZ1Ω)2 − (EZ2Ω)2

∣∣∣ ≤ CK1,K2 · ε.

Bounding the denominator. On the event A1, we note the sandwich relation

DL ≤√

1− n · v>n (A>A)−1vn ≤ DU ,

where we have let

DL =√

1− (1 + Cε)‖vn‖22, and DU =√

1− (1− Cε)‖vn‖22.

Now, consider the event

A4 =∣∣‖vn‖22 − κ−1

∣∣ ≤ Cε,noting that an application of Bernstein’s inequality implies PA4 ≥ 1− 2e−cnε

2. Now, on the

event A4, we further obtain the bounds (recalling also the assumption κ > C),

DL ≥√

1− κ−1 − C ′ε, and DU ≤√

1− κ−1 + C ′ε.

Putting the pieces together to control τn. Now, we combine the two-sided bounds on both thenumerator and denominator to obtain the inequality

P∣∣τn − τgor∣∣ ≤ CK1,K2 · ε

≥ 1− Ce−cnε2 . (145)

85

Page 86: Kabir Aladin Chandrasekher

Bounding the minimizer ξn(τn) (144b). First, let e1 denote the first standard basis vectorin R3 and consider the quantity

|〈e1, ξn(τn)〉 − αgor| = |〈e1, (A>A)−1A>ω] + τ · n · (A>A)−1vn〉 − αgor|.

First, note that on the event A1, we can write n · (A>A)−1 = I3 +B, where ‖B‖op ≤ Cε. Wethus obtain the decomposition

|〈e1, (A>A)−1A>ω] + τn · n · (A>A)−1vn〉 − αgor| ≤ T1 + T2,

where

T1 =∣∣∣ 1n〈e1,A

>ω]〉+ τn〈e1,vn〉 − αgor∣∣∣

T2 =∣∣∣ 1n〈e1,BA

>ω]〉+ τn · 〈e1,Bvn〉∣∣∣.

Now, consider the event

A5 = 1

n|〈γn,ω]〉| ∨

∣∣∣ 1n〈z1,ω]〉 − EZ1Ω

∣∣∣ ∨ ∣∣∣ 1n〈z2,ω]〉 − EZ2Ω

∣∣∣ ≤ C ′K1ε,

and note that on A1 ∩ A2 ∩ A3, we have

|τn − τgor| ≤ CK1,K2 · ε (146)

for a pair (C ′K1, CK1,K2) depending only on K1 and only on K1,K2 respectively. In the following

lines, we note that CK1,K2 may change from line to line, but always depends only on K1,K2.

Note that by applying Bernstein’s inequality, we obtain PA5 ≥ 1−6e−cnε2. Onward, we work

on the event⋂5k=1Ak. First, applying the triangle inequality, we obtain the upper bound

T1 ≤ |EZ1Ω − αgor|+ CK1 · ε.

Next, we have

T2 ≤1

n‖B‖op‖A>ω]‖2 +

1

n|τn| · ‖B‖op‖vn‖2

≤ CK1 · ε. (147)

To establish inequality(147), we employ the following steps. First, we bound ‖B‖op on eventA1 and bound ‖A>ω]‖2 on event A2. Next, applying the Cauchy–Schwarz inequality yieldsmaxEZ1Ω,EZ2Ω ≤ CK1 . Third, we bound ‖vn‖2 on event A4 and note that κ > C.Finally, we |τn| by invoking the bound (146) and note that τgor ≤ CK1 since κ > C andEΩ2 ≤ CK1 .

Summarizing, we have shown that

P|〈e1, ξn(τn)〉 − αgor| ≤ CK1 · ε ≥ 1− Ce−cnε2 . (148a)

Proceeding in a parallel manner, we obtain the two bounds

P|〈e2, ξn(τn)〉 − µgor| ≤ CK1 · ε ≥ 1− Ce−cnε2 , (148b)

P|〈e1, ξn(τn)〉 − νgor| ≤ CK1 · ε ≥ 1− Ce−cnε2 . (148c)

86

Page 87: Kabir Aladin Chandrasekher

Bounding the minimum Ln(ξn(τn),θ]). Note that

Ln(ξn(τn);θ]) =1√n‖Aξn(τn)− ω]‖2 − 〈vn, ξn(τn)〉.

Now, let ξgor = [αgor, µgor, νgor]T ∈ R3 and consider the event

A6 = ‖ξn(τn)− ξgor‖∞ ≤ CK1 · ε, (149)

noting that the inequalities (148) imply PA6 ≥ 1− Ce−cnε2 . For the remainder of the proof,we will work on the event A =

⋂6k=1Ak. Adding and subtracting the quantity ξgor yields

Ln(ξn(τn);θ]) =1√n‖Aξgor +A(ξn(τn)− ξgor)− ω]‖2 − 〈vn, ξgor〉 − 〈vn, ξn(τn)− ξgor〉.

Thus, on the event A, we obtain∣∣∣Ln(ξn(τn);θ])− 1√n‖Aξgor − ω]‖2 + 〈vn, ξgor〉

∣∣∣ ≤ CK1 · ε. (150)

Next, we claim that∣∣∣∣ 1√n‖Aξgor − ω]‖2 − 〈vn, ξgor〉 −

√(1− 1

κ

)(EΩ2 − (EZ1Ω)2 − (EZ2Ω)2

)∣∣∣∣ ≤ CK1 · ε.

(151)

The proof of the lemma follows upon combining inequalities (150) and (151), so the only re-maining piece is to establish inequality (151).

Proof of claim (151): On event A1 ∩ A2 ∩ A3, we have∣∣∣∣ 1n‖Aξgor‖22 − ((αgor)2 + (µgor)2 + (νgor)2)∣∣∣∣ ≤ CK1 · ε ,

∣∣∣∣ 2√nωT] Aξ

gor + 2(

(αgor)2 + (µgor)2)∣∣∣∣ ≤ CK1 · ε ,

and ∣∣∣∣ 1n‖ω]‖22 − EΩ2∣∣∣∣ ≤ CK1 · ε ,

respectively. Combining the above displays and noting that EΩ2 − (αgor)2 − (µgor)2 = (κ −1) (νgor)2 yields ∣∣∣∣ 1√

n‖‖Aξgor − ω]‖2 − νgor

√κ

∣∣∣∣ ≤ CK1 · ε.

But, under event A4, ∣∣∣∣〈vn, ξgor〉 − νgor√κ

∣∣∣∣ ≤ C · εHence, the inequality (151) follows by combining the above two displays and recalling thedefinition of νgor.

87

Page 88: Kabir Aladin Chandrasekher

B.2.2 Proof of Lemma 3(b)

We begin by defining the events

A′1 =

minξ∈R3

1√n‖Aξ − ω]‖2 ≥ cK1,K1

, (152)

A′2 =1

2I3

1

nA>A 2I3

, and A′3 =

1√n‖ω]‖2 ≤ CK1

.

Note that on the event A′2 ∩ A′3, we obtain the inequality

maxξ∈B2(C′)

1√n‖Aξ − ω]‖2 ≤ CK1 . (153)

Moreover, recall that minξ∈R3 ‖Aξ − ω]‖2 = ‖(I −A(A>A)−1A>

)· ω]‖2. Thus, apply-

ing Lemma 19 for small enough ε in conjunction with Assumption 2 yields the inequal-

ity PA′1 ≥ 1 − Ce−c′K1,K2

·n. Finally, applying Wainwright (2019, Theorem 6.1) yields

PA′2 ≥ 1 − 2e−cn; and applying Bernstein’s inequality (as each component of ω] is K-sub-Gaussian by Assumption 1) implies PA′3 ≥ 2e−cn. For the rest of the proof, we work on theevent A′1 ∩ A′2 ∩ A′3.

Note that on the event A′1, the function Ln is twice continuously differentiable. Thus, ourstrategy is to bound the minimum eigenvalue of the Hessian ∇2Ln. For ξ in a bounded domainB2(C), we compute

∇2Ln(ξ;θ]) =1

‖Aξ − ω]‖2· 1√

n·(A>A−

A>(Aξ − ω])(Aξ − ω])>A‖Aξ − ω]‖22

)(i)

1

CK1n·(A>A−

A>(Aξ − ω](Aξ − ω])>A‖Aξ − ω]‖22

),

where step (i) follows from inequality (153). Subsequently, we utilize the variational character-ization of eigenvalues to obtain

λmin(∇2Ln(ξ;θ])) ≥ 1

CK1nmin‖v‖2=1

〈v,A>Av〉 −〈v,A>(Aξ − ω])(Aξ − ω])>Av〉

‖Aξ − ω]‖22

=1

CK1nmin‖v‖2=1

‖Av‖22 −〈Av,Aξ − ω]〉2

‖Aξ − ω]‖22. (154)

Next, consider the orthogonal decomposition

ω] = Aξ0 + ω⊥,

whereAξ0 denotes an element in the column space of the random matrixA and ω⊥ is orthogonalto the column space of A. Thus,

min‖v‖2=1

‖Av‖22 −〈Av,Aξ − ω]〉2

‖Aξ − ω]‖22= min‖v‖2=1

‖Av‖22 −〈Av,A(ξ − ξ0)− ω⊥〉2

‖A(ξ − ξ0)‖22 + ‖ω⊥‖22

= min‖v‖2=1

‖Av‖22 −〈Av,A(ξ − ξ0)〉2

‖A(ξ − ξ0)‖22 + ‖ω⊥‖22(i)= min‖v‖2=1

‖Av‖22 −‖Av‖22 · ‖A(ξ − ξ0)‖22‖A(ξ − ξ0)‖22 + ‖ω⊥‖22

, (155)

where step (i) follows by applying the Cauchy–Schwarz inequality. Now,

min‖v‖2=1

‖Av‖22 −‖Av‖22 · ‖A(ξ − ξ0)‖22‖A(ξ − ξ0)‖22 + ‖ω⊥‖22

(i)

≥ n

2· ‖ω⊥‖22‖A(ξ − ξ0)‖22 + ‖ω⊥‖22

(ii)

≥ cK · n (156)

88

Page 89: Kabir Aladin Chandrasekher

where step (i) follows by re-arranging and subsequently utilizing event A′2 to lower bound theterm ‖Av‖2. Step (ii) follows by upper bounding the denominator using events A′2 and A′3 andlower bounding the numerator by using the event A′1. Finally, combining the lower bound (154),the equation (155), and the lower bound (156), we obtain

λmin(∇2Ln(ξ;θ])) ≥ cK .

The final step holds on the event A′1 ∩ A′2 ∩ A′3, which occurs with probability at least 1 −Ce−c′K1,K2

·n.

B.2.3 Proof of Lemma 18

Recall the loss function L (9):

L(θ;θ],X,y) =1√n‖ω(Xθ],y)−Xθ‖2,

which corresponds to setting

F (u,θ) =1√n‖ω(Xθ],y)−u‖2, h(θ,θ]) = 0, and g(u,Xθ];y) =

1√n‖ω(Xθ],y)−u‖2.

Additionally, note that the convex conjugate is given by

g∗(v,Xθ];y) =

〈v, ω(Xθ],y)〉 if ‖v‖2 ≤ 1√

n,

+∞ otherwise.

Substituting into the definition of the scalarized auxiliary loss Ln (see Definition 8), we obtain

Ln(α, µ, ν;θ]) =

(1√n‖ν · γn + α · z1 + µ · z2 − ω]‖2 − ν

‖P⊥S]γd‖2√n

)+

.

Now, consider the shorthandw = α · z1 + µ · z2 − ω],

and define the three events

E1 = 1

n|〈ν · γn,w〉| ≤

5√n‖w‖2

, E2 =

‖γn‖22 ≥ 4n/5, and E3 = ‖γd‖22 ≤ 36d/25.

Next, apply Bernstein’s inequality to bound PEc1, Hoeffding’s inequality to bound PEc2 andPEc3, and the union bound to obtain the inequality

PE1 ∩ E2 ∩ E3 ≥ 1− 6e−cn.

Working on the intersection of these three events, we obtain

1√n‖ν · γn +w‖2 =

√4ν2

5n‖γn‖22 +

ν2

5n‖γn‖22 +

1

n〈ν · γn,w〉+

1

n‖w‖22

√16ν2

25+

4ν2

25− 4ν

5√n‖w‖2 +

1

n‖w‖22 =

√16ν2

25+(2ν

5− ‖w‖2√

n

)2≥ 4ν

5.

Thus, we obtain

1√n‖ν · γn +w‖2 − ν

‖P⊥S]γd‖2√n

≥ 4ν

5− 6ν

5√κ≥ 0,

89

Page 90: Kabir Aladin Chandrasekher

where the last inequality holds for κ ≥ C with C a large enough constant. Summarizing, wesee that

1√n‖ν · γn + α · z1 + µ · z2 − ω]‖2 − ν

‖P⊥S]γd‖2√n

≥ 0,

whence

Ln(α, µ, ν;θ]) =1√n‖ν · γn + α · z1 + µ · z2 − ω]‖2 − ν

‖P⊥S]γd‖2√n

as desired.

B.2.4 Proof of Lemma 19

Recall the three events (143). We prove each part of the lemma in turn.

Proof of part (a): First, expand the norm to obtain

1√n‖(In −A(A>A)−1A>)ω(Xθ],y)‖2 =

√1

n‖ω]‖22 −

1

n(A>ω])>(A>A)−1(A>ω]).

Now, on the event A1, we obtain the sandwich relation

AL ≤√

1

n‖ω]‖22 −

1

n(A>ω])>(A>A)−1(A>ω]) ≤ AU ,

where we have let

AL =

√1

n‖ω]‖22 −

1 + Cε/2

n2‖Aω]‖22 =

√1

n‖ω]‖22 −

1 + Cε/2

n2(〈γn,ω]〉2 + 〈z1,ω]〉2 + 〈z2,ω]〉2)

AU =

√1

n‖ω]‖22 −

1− Cε/2n2

‖Aω]‖22 =

√1

n‖ω]‖22 −

1− Cε/2n2

(〈γn,ω]〉2 + 〈z1,ω]〉2 + 〈z2,ω]〉2),

where the last equality in both lines follows by recalling thatA>ω] = [〈z1,ω]〉, 〈z2,ω]〉, 〈γn,ω]〉].On the event A2 ∩ A3, we obtain

AL(i)

≥√

(EΩ2 − (EZ1Ω)2 − (EZ2Ω)2 − CK1ε(ii)

≥√

(EΩ2 − (EZ1Ω)2 − (EZ2Ω)2−CK1,K2 ·ε.

Specifically, step (i) holds since by Assumption 1, max EZ1Ω,EZ1Ω ≤√

EΩ2 ≤ CK1.Step (ii) follows since by Assumption 2, the first term on the RHS is at least K2. Proceedingsimilarly, we obtain the upper bound

AU ≤√

(EΩ2 − (EZ1Ω)2 − (EZ2Ω)2 + CK1,K2ε.

Putting the pieces together, we see that on the event A1 ∩ A2 ∩ A3,∣∣∣ 1√n‖(In −A(A>A)−1A>) · ω(Xθ],y)‖2 −

√(EΩ2 − (EZ1Ω)2 − (EZ2Ω)2

∣∣∣ ≤ CK1,K2ε,

as claimed.

90

Page 91: Kabir Aladin Chandrasekher

Proof of part (b): From Wainwright (2019, Theorem 6.1), we directly have PA1 ≥ 1 −Ce−cnε

2. Next, recall for convenience the other two events

A2 = 1

n2|〈γn,ω]〉|2 ∨

∣∣∣ 1

n2〈z1,ω]〉2 − (EZ1Ω)2

∣∣∣ ∨ ∣∣∣ 1

n2〈z2,ω]〉2 − (EZ2Ω)2

∣∣∣ ≤ CK1 · ε,

and

A3 =∣∣∣ 1n‖ω]‖22 − EΩ2

∣∣∣ ≤ CK1 · ε.

Next, recall that by Assumption 1, we have the bound ‖ω]‖ψ2 ≤ K1. Thus, we apply Bernstein’sinequality to obtain

P∣∣∣ 1n〈z1,ω]〉 − EZ1Ω

∣∣∣ ≥ CK1 · ε≤ 2e−cnε

2.

Consequently, with probability at least 1− 2e−cnε2,∣∣∣ 1

n2〈z1,ω]〉2 − (EZ1Ω)2

∣∣∣ =∣∣∣ 1n〈z1,ω]〉 − EZ1Ω

∣∣∣ · ∣∣∣ 1n〈z1,ω]〉+ EZ1Ω

∣∣∣ (i)

≤ CK1 · ε,

where step (i) additionally used the fact that EZ1Ω ≤ CK1 . Bounding the other termssimilarly and applying a union bound, we obtain PA2 ≥ 1 − 6e−cnε

2. Finally, we apply

Bernstein’s inequality once more to obtain the inequality PA3 ≥ 1− 2e−cnε2.

B.2.5 Proof of Lemma 20

Set CK1 to be a sufficiently large positive constant depending only on K1. For any R ≥ CK1 ,that may only depend on K1, we note the following characterization:

minξ∈P(B2(R))

Ln(ξ;θ]) = minξ∈P(B2(R))

infτ>0

τ

2+

1

2τn‖Aξ − ω(Xθ],y)‖22 − 〈vn, ξ〉

= infτ>0

minξ∈P(B2(R))

τ

2+

1

2τn‖Aξ − ω(Xθ],y)‖22 − 〈vn, ξ〉, (157)

where the second equality follows on event A′1 (152), which holds with probability1−C exp(−cK1,K2n) and implies that the infimum over τ is achieved. This proves the part (a)of the lemma.

Next, we prove part (b). To do this, consider the unconstrained minimization in (157) overξ ∈ R3. Note that this admits the unique minimizer

ξn(τ) = (A>A)−1A>ω(Xθ],y) + τ · n · (A>A)−1vn.

Substituting this value into the RHS of the optimization problem (157) yields

minξ∈R3)

Ln(ξ;θ]) = infτ>0

τ

2+

1

2τn‖Aξn(τ)− ω(Xθ],y)‖22 − 〈vn, ξn(τ)〉

= infτ>0

1

2τn‖(In −A(A>A)−1A>)ω(Xθ],y)‖22 +

τ

2(1− n · v>n (A>A)−1vn)

− v>n (A>A)−1Aω(Xθ],y). (158)

Consider the events

A′′1 = ‖P⊥S]γd‖22 ≤ 36d/25 and A′′2 =

(1− c′)I3

1

nA>A (1 + c′)I3

.

91

Page 92: Kabir Aladin Chandrasekher

Apply Bernstein’s inequality to obtain PA′′1 ≥ 1−2e−cn and apply Wainwright (2019, Theorem6.1) to obtain PA′′2 ≥ 1− 2e−cn. On the event A′′1 ∩ A′′2, note that

1− n · v>n (A>A)−1vn ≥ 1− (1 + c′/2)‖vn‖22 ≥ 1− (1 + c′/2) · 6

5κ> c′′ > 0,

where in the last inequality we used κ > C. Thus, on the event A′1 ∩A′′1 ∩A′′2, the optimizationproblem (158) admits the unique minimizer

τn =

1√n‖(In −A(A>A)−1A>) · ω(Xθ],y)‖2√

1− n · v>n (A>A)−1vn.

We have thus far shown that ξn(τn) is the unique minimizer of

minξ∈R3

Ln(ξ;θ]). (159)

But, recalling the event (149), we note that ‖ξn(τn) − ξgor‖2 ≤ C ′K1ε with probability at least

1−Ce−cnε2 . On this event, using triangle inequality in conjunction with the fact that ‖ξgor‖2 ≤C ′′K1

by Assumption 1 and κ > C, yields the inequality ‖ξn(τn)‖2 ≤ CK1 for a positive constantCK1 depending only on K1. Therefore, the minimizer of equation (159) remains ξn(τn) even ifwe constrain ‖ξ‖2 ≤ R, provided R ≥ CK1 . To finish the proof, recall by Definition 9 that sinceξ = ξ(θ), we have ‖ξ‖2 = ‖θ‖2.

B.3 Establishing growth conditions for first-order methods

In this subsection, we prove Lemma 4. Note that the events in this section are unrelated toevents defined in Section B.2. Also note that universal constants, as well as those dependingon K1 may change from line to line. Now, recall from equation (63) that Ln can be written as

Ln(α, µ, ν) =α2 + µ2 + ν2

2− (αα] + µβ]) +

n· 〈ω(Xθ],y), νγn + αz1 + µz2〉

− 2ην

n· ‖P⊥S]γd‖2 · ‖ω(Xθ],y)‖2.

Note that the 1-strong convexity claimed in part (b) of the lemma is evident from the expressionabove, so we focus our attention on proving part (a) in the next subsection.

B.3.1 Proof of Lemma 4(a)

Evidently, Ln (63) is strongly convex and continuously differentiable. The optimizers are givenby the first order conditions

αn = α] − 2η

n· 〈z1, ω(Xθ],y)〉 (160a)

µn = β] − 2η

n· 〈z2, ω(Xθ],y)〉 (160b)

νn =2η

n· 〈γn, ω(Xθ],y)〉+

n· ‖P⊥S]γd‖2 · ‖ω(Xθ],y)‖2. (160c)

Now, consider the events

A1 =∣∣∣ 1n〈γn, ω(Xθ],y)〉

∣∣∣∨∣∣∣ 1n〈z1, ω(Xθ],y)〉−EZ1Ω

∣∣∣∨∣∣∣ 1n〈z2, ω(Xθ],y)〉−EZ2Ω

∣∣∣ ≤ CK1 ·ε,

and

A2 =∣∣∣ 1√

n‖P⊥S]γd‖2 − κ

−1/2∣∣∣ ∨ ∣∣∣ 1√

n‖ω(Xθ],y)‖2 −

√EΩ2

∣∣∣ ≤ CK1 · ε,

92

Page 93: Kabir Aladin Chandrasekher

and apply Bernstein’s inequality to obtain the bound PA1 ∩ A2 ≥ 1 − 10e−cnminε2,ε. Con-sequently, on the event A1 ∩ A2, we obtain

P‖[αn, µn, νn]− [αgor, µgor, νgor]‖∞ ≤ CK1 · ε ≥ 1− 20e−cnminε2,ε

Additionally, substitute the empirical minimizers (160) into the loss Ln (63) to obtain

Ln(αn, µn, νn) = −1

2· (α2

n + µ2n + ν2

n).

Next, recall from Lemma 4 the constant

L = −1

2· ((αgor)2 + (µgor)2 + (νgor)2).

We also claim that αgor∨µgor∨νgor ≤ CK1 ; this can be verified from Definition 12, and applyingthe Cauchy–Schwarz inequality in conjunction with the assumed bounds α], β] ≤ 3/2.

Combining the pieces, note that on the event A1 ∩ A2 we obtain

|Ln(αn, µn, νn)− L| ≤ CK1 · ε2

(|αn + αgor|+ |µn + µgor|+ |νn + νgor|

)≤ C ′K1

· ε,

as desired.

C Auxiliary proofs for general results, part (b)

We state and prove two technical lemmas; the first is used throughout Section 8 and the secondprovides some basic properties about the initial point θ0.

Lemma 21. Let n and d be positive integers such that n ≥ 2d. Additionally, let x1, . . . ,xniid∼

N (0, Id) and let Σ =∑n

i=1 xix>i . Then, there exists a universal positive constant C such that

for all integers p where 1 ≤ p < n−d−12 , we have

E‖Σ−1‖pop

≤(C

n

)p.

Proof. We begin by noting that ‖Σ−1‖pop = λmin(Σ)−p. Our strategy is to truncate λmin(Σ) atthe level An, for a constant A > 0 to be chosen later, and decompose

Eλmin(Σ)−p

= E

λmin(Σ)−p1λmin(Σ) ≤ An

+ E

λmin(Σ)−p1λmin(Σ) ≥ An

.

(161)

For ease of notation, we will denote the two terms in the above decomposition by

T1 = Eλmin(Σ)−p1λmin(Σ) ≤ An

T2 = E

λmin(Σ)−p1λmin(Σ) ≥ An

,

and handle each term in turn.

Bounding the term T1. We write explicitly:

Eλmin(Σ)−p1λmin(Σ) ≤ An

=

∫ An

0t−pfλmin

(t)dt,

∫ An

0t−pfλmin

(t)dt(i)

≤2n−d−1

2 Γ(n+12 )

Γ(d2)Γ(n− d+ 1)

∫ An

0t−pt

12

(n−d−1)dt

93

Page 94: Kabir Aladin Chandrasekher

(ii)=

2n−d−1

2 Γ(n+12 )(

12(n− d+ 1)− p

)Γ(d2)Γ(n− d+ 1)

(An)12

(n−d−1)−p+1

(iii)

≤ 8(A)12

(n−d+1)−p

Γ(n− d+ 1)nn−d−p

(iv)

≤ 8

2e

(1

4e2n

)p.

Step (i) follows by applying Chen and Dongarra (2005, Lemma 3.3) to upper bound the densityfλmin

(t). Step (ii) follows by noting that (n − d − 1)/2 − p > 0 and evaluating the integral

exactly. Step (iii) from the inequality Γ(d2)(n2

) 12

(n−d+1)> Γ(n+1

2 ) (see the proof of Chen andDongarra (2005, Lemma 4.1)) and the fact that n− d+ 1 ≥ n/2. Step (iv) follows by utilizingStirling’s inequality for the Gamma function and setting A = (4e2)−1. Summarizing, we haveshown

Eλmin(Σ)−p1λmin(Σ) ≤ An

≤ 8

2e

(1

4e2n

)p, (162)

where A = (4e2)−1.

Bounding the term T2. Note that the function t 7→ t−p is decreasing for p > 0. Conse-quently,

Eλmin(Σ)−p1λmin(Σ) ≥ An

≤(

1

An

)pPr λmin(Σ) ≥ An ≤

(4e2

n

)p. (163)

The result follows immediately upon combining the decomposition (161) along with theupper bound on term T1 (162) the upper bound on term T2 (163).

D Auxiliary technical results for specific models

We begin by proving the four corollaries for one-step updates from Theorems 1 and 2, and thenproceed to proofs of Fact 1 and the technical lemmas stated in Section 9.1.

D.1 Proof of Corollary 1

We evaluate the Gordon updates explicitly and verify Assumptions 1 and 2. The corollary thenfollows by invoking Theorem 1. Note that in this case, we have ω(x, y) = sgn(x) · y, so that

Ω = sgn(αZ1 + βZ2) · (|Z1|+ σZ3) .

D.1.1 Evaluating Gordon state evolution update

Let us begin by evaluating the three expectations that appear in the claimed Gordon updatein Definition 1. Clearly, we have

E[Ω2] = E (|Z1|+ σZ3) = 1 + σ2. (164)

Since Z3 is independent of the pair (Z1, Z2), we also have

E[Z1Ω] = E[sgn(αZ1 + βZ2) · sgn(Z1) · Z21 ] and E[Z2Ω] = E[sgn(αZ1 + βZ2) · |Z1| · Z2].

94

Page 95: Kabir Aladin Chandrasekher

We evaluate these expectations by first transforming into polar coordinates. Let Z1 = R cos Φand Z2 = R sin Φ, where R is a χ-random variable with 2 degrees of freedom and the randomvariable Ψ ∼ Unif([0, 2π]) is drawn independently. The first expectation can then be written as

E[Z1Ω] = E[sgn(α cos(Ψ) + β sin(Ψ)) sgn(cos(Ψ))R2 cos2(Ψ)

]= E

[sgn(cos(Ψ) · cos(Ψ− φ))R2 cos2(Ψ)

],

where we have used the fact that tanφ = β/α. Evaluating the final expectation explicitly, weobtain

E[Z1Ω] = E[R2] ·(

1

∫ 2π

0sgn(cos(ψ) · cos(ψ − φ)) cos2(ψ)dψ

)(i)=

1

π

(π − 4

∫ π/2+φ

π/2cos2 ψdψ

)= 1− 1

π(2φ− sin(2φ)), (165)

where step (i) follows from noting that sgn(cos(ψ) · cos(ψ− φ)) = 1 for all ψ ∈ [0, π/2]∪ [π/2 +φ, 3π/2] ∪ [3π/2 + φ, 2π]. Proceeding similarly for the second expectation, we have

E[Z2Ω] = E[sgn(cos(Ψ) · cos(Ψ− φ))R2 cos(Ψ) sin(Ψ)

]=

1

π

∫ 2π

0sgn(cos(ψ) · cos(ψ − φ)) cos(ψ) sin(ψ)dψ =

2

πsin2(φ). (166)

Putting together equations (164), (165), and (166) with Definition 1, some straightforwardcalculation yields the Gordon state evolution update (33).

D.1.2 Verifying assumptions

To verify Assumption 1, note that Ω2 ≤ 2Z21 + 2σ2Z2

3 , so that E[exp(Ω2/(2 + 2σ2)] ≤ 1. Thus,we have ‖Ω‖ψ2 ≤ 2(1 + σ2). To verify Assumption 2, note that from the calculations above,

E[Ω2]− (E[Z1Ω])2 − (E[Z2Ω])2 = 1 + σ2 −(

1− 1

π(2φ− sin(2φ))

)2

− 4

π2sin4 φ ≥ σ2,

where the final inequality can be verified for each 0 ≤ φ ≤ π/2.

D.2 Proof of Corollary 2

In this case, we evaluate the expectations and verify Assumption 1, and the result follows byinvoking Theorem 2. We have ω = x− sgn(x) · y, so that

Ω = αZ1 + βZ2 − sgn(αZ1 + βZ2) · (|Z1|+ σZ3) .

D.2.1 Evaluating Gordon state evolution update

Note the definition of Ω in conjunction with equations (165) and (166), and Remark 1. Per-forming some straightforward algebra using Definition 2(b) yields the Gordon updates (37).

D.2.2 Verifying Assumption 1

We have the upper bound

Ω2 ≤ 2(1 + α2)Z21 + 2β2Z2 + 2σ2Z2

3 ,

so that E[exp(Ω2/(2α2 + 2β2 + 2σ2 + 2)] ≤ 1. Thus, we have

‖Ω‖ψ2 ≤ 2(α2 + β2 + σ2 + 1) ≤ 2(10 + σ2),

where the final inequality is a consequence of the assumption α ∨ β ≤ 3/2.

95

Page 96: Kabir Aladin Chandrasekher

D.3 Proof of Corollary 3

We evaluate the Gordon state evolution update explicitly, and verify Assumptions 1 and 2. Thecorollary then follows by invoking Theorem 1. In this case, we have ω(x, y) = sgn(yx) · y, sothat

Ω = sgn(αZ1 + βZ2) · sgn(Q · Z1 + σZ3) · (Q · Z1 + σZ3) .

Here Q is a Rademacher random variable.

D.3.1 Evaluating Gordon state evolution update

An immediate calculation yields E[Ω2] = 1 + σ2. We now claim that the following equalitiescharacterize the remaining two expectations:

E [Z1Ω] = 1− 2

πtan−1

(√ρ2 + σ2 + ρ2σ2

)+

2

π

√ρ2 + σ2 + ρ2σ2

1 + ρ2, and (167)

E [Z2Ω] =2

π

ρ√ρ2 + σ2 + σ2ρ2

1 + ρ2, (168)

where we recall the notation ρ = β/α. Taking this claim as given, combining it with Definition 1,and performing some algebra yields the Gordon state evolution update (42). It remains toestablish the two equalities. We prove claim (167) below; the proof of claim (168) is similar andomitted for brevity.

Proof of equation (167): Note that

E [Z1Ω] = E [Z1 · sgn(αZ1 + βZ2) · sgn(QZ1 + σZ3) · (QZ1 + σZ3)]

(i)= E [Z1 · sgn(αZ1 + βZ2) · |Z1 + σZ3|] .

In step (i), we multiplied the expression by Q2 = 1 and used the fact that Z3(d)= QZ3. To

compute this expectation tractably, we use a change of variables. Let Z ′ = αZ1+βZ2√α2+β2

and write

Z1 =α√

α2 + β2Z ′ +

β√α2 + β2

Z,

where Z is a standard Gaussian independent of the tuple (Z ′, Z3). Finally, define the followingstandard Gaussian variate that is independent of Z ′:

Z ′′ =

(β2

α2 + β2+ σ2

)−1/2(

β√α2 + β2

· Z + σZ3

),

and use σ =√

β2

α2+β2 + σ2 to denote the normalization constant. Substituting above, we obtain

E [Z1Ω] =α√

α2 + β2· E

[|Z ′| ·

∣∣∣∣∣ α√α2 + β2

Z ′ + σZ ′′

∣∣∣∣∣]

+β√

α2 + β2· E

[Z · sgn(Z ′) ·

∣∣∣∣∣ α√α2 + β2

Z ′ + σZ ′′

∣∣∣∣∣]

=α√

α2 + β2· E

[|Z ′| ·

∣∣∣∣∣ α√α2 + β2

Z ′ + σZ ′′

∣∣∣∣∣]

︸ ︷︷ ︸T1

96

Page 97: Kabir Aladin Chandrasekher

+β2

σ · (α2 + β2)· E

[Z ′′ · sgn(Z ′) ·

∣∣∣∣∣ α√α2 + β2

Z ′ + σZ ′′

∣∣∣∣∣]

︸ ︷︷ ︸T2

We may now use polar coordinates to compute the two expectations; write Z ′ = R cos Ψ andZ ′′ = R sin Ψ. Let λ = tan−1(σ

√1 + ρ2) for convenience, so that

α√α2 + β2

Z ′ + σZ ′′ =√

1 + σ2 ·R cos(Ψ− λ).

Evaluating term T1: We have

T1 =√

1 + σ2 · E[R2] ·(

1

∫ 2π

0| cos(ψ) · cos(ψ − λ)|dψ

)(i)=

√1 + σ2

π· ((π − 2λ) cosλ+ 2 sinλ) ,

where step (i) follows from evaluating the integral explicitly, noting that cos(ψ) · cos(ψ − λ) isnonnegative except in the range (π/2, π/2 + λ) ∪ (3π/2, 3π/2 + λ).Evaluating term T2: We have

T2 =√

1 + σ2 · E[R2] ·(

1

∫ 2π

0sin(ψ) · sgn(cos(ψ)) · | cos(ψ − λ)|dψ

)(i)=√

1 + σ2 ·(π − 2λ

π

)· sinλ,

where once again, step (i) follows from evaluating the integral explicitly, noting that cos(ψ) ·cos(ψ − λ) is nonnegative except in the range (π/2, π/2 + λ) ∪ (3π/2, 3π/2 + λ).Putting together the pieces: Since λ = tan−1(σ

√1 + ρ2), we have

sinλ =

√ρ2 + σ2 + ρ2σ2

√1 + σ2 ·

√1 + ρ2

and cosλ =1

√1 + σ2 ·

√1 + ρ2

.

Consequently,

E[Z1Ω] =

(π − 2λ

π

)· 1

1 + ρ2+

2

π·√ρ2 + σ2 + ρ2σ2

1 + ρ2+

(π − 2λ

π

)· ρ2

1 + ρ2

= 1− 2λ

π+

2

π·√ρ2 + σ2 + ρ2σ2

1 + ρ2,

as claimed.

D.3.2 Verifying assumptions

To verify Assumption 1, note that Ω2 ≤ 2Z21 + 2σ2Z2

3 , so that E[exp(Ω2/(2 + 2σ2)] ≤ 1.Consequently ‖Ω‖ψ2 ≤ 2(1 + σ2). To verify Assumption 2, note from the calculations abovethat

E[Ω2]− (E[Z1Ω])2 − (E[Z2Ω])2 = 1− (1−Aσ(ρ) +Bσ(ρ))2 − [ρBσ(ρ)]2 + σ2 := H(ρ),

where Aσ(ρ) and Bσ(ρ) were defined in equation (41). Note that H ′(ρ) =4ρ(ρ2(ρ2+2)+σ2)(π−2 tan−1(

√ρ2(ρ2+2)+σ2)

π2(ρ2+1)2√ρ2(ρ2+2)+σ2

≥ 0, so that H(ρ) ≥ H(0) = σ2.

D.4 Proof of Corollary 4

We evaluate the various expectations and verify Assumption 1. The corollary then follows byinvoking Theorem 2. In this case, we have ω(x, y) = x− sgn(yx) · y, so that

Ω = αZ1 + βZ2 − sgn(αZ1 + βZ2) · sgn(Q · Z1 + σZ3) · (Q · Z1 + σZ3) .

Here Q is a Rademacher random variable.

97

Page 98: Kabir Aladin Chandrasekher

D.4.1 Evaluating Gordon state evolution update

Note the definition of Ω in conjunction with equations (167) and (168), and Remark 1. Per-forming some straightforward algebra using Definition 2(b) yields the Gordon updates (46).

D.4.2 Verifying Assumption 1

As in the previous subgradient update, we have the upper bound Ω2 ≤ 2(1 + α2)Z21 + 2β2Z2 +

2σ2Z23 , so that E[exp(Ω2/(2α2 + 2β2 + 2σ)] ≤ 1. Thus, we have ‖Ω‖ψ2 ≤ 2(α2 + β2 + σ2).

D.5 Proof of Fact 1

Let us begin by restating the population update for alternating minimization as applied tophase retrieval:

αpop = 1− 1

π(2φ− sin(2φ)) and βpop =

2

πsin2(φ).

We refer to these as the F and G maps respectively and let Spop denote the population stateevolution operator. In order to prove the desired fact, it suffices to verify that Spop is G-faithful,and to prove upper and lower bounds on its one-step convergence.

Verifying G-faithfulness: This follows directly from the G-faithfulness of the Sgor update,since F = F and G ≤ G.

Upper bound on one-step convergence: First, note that G(α, β) ≤ 4π2φ

4. On the otherhand, equation (112) yields

(1− F (α, β))2 ≤ 16

9π2φ6.

Putting together the pieces, we have

[d(Spop(ζ))]2 = [G(α, β)]2 + (1− F (α, β))2 ≤ β4 ≤β2 + (1− α)2

2,

where the first inequality is a result of noting that φ ≤ β/α and α ≥ 0.55.

Lower bound on two-step convergence: Moving now to the lower bound, let us computetwo steps of the Gordon update, letting F+ = F 2(α, β) and G+ = G2(α, β). Analogously, welet F = F (α, β) and G = G(α, β), and use φ+ = tan−1(G/F ) to denote the angle after one stepof the Gordon update. Recall that tanφ = β/α. We have

G2+ =

4

π2· sin4 φ+ =

4G4

π2(F 2 +G2)2≥ 4π2

(π2 + 4)2·G4 >

G4

5,

where the penultimate inequality uses the fact that F ≤ 1 and G ≤ 2/π, guaranteed by one

step of the Gordon update. For the F component, equation (113) yields π2(1 − F+)2 ≥ G6

8 .Furthermore, we have (1− F )4 . β12 . G6, so that putting together the pieces yields

G2+ + (1− F+)2 ≥ G4

5+ c(1− F )4 ≥ c′ ·

G2 + (1− F )2

2,

where the last step follows because (A+B)κ ≤ 2κ(Aκ +Bκ) for any positive scalars (A,B) andκ ≥ 1. Taking square roots completes the proof.

Remark 5. The claimed quadratic convergence holds in a region much larger than the goodregion GPR. Indeed, it is straightforward to show that tan−1(G/F ) ≤ 2

π ·(tan−1(β/α)

)2, showing

that the angle converges quadratically fast, globally for any φ < π/2.

98

Page 99: Kabir Aladin Chandrasekher

D.6 Proof of Lemma 10

Since the lemma consists of several parts, we prove each in turn.

D.6.1 Proof of part (a)

Since the maps F0 and F coincide, we have F0(α, β) = 1− 2φ−sin(2φ)π , which is a non-increasing

function of φ. Evaluating it at φ = 0, π/2, we obtain 0 ≤ F0(α, φ) ≤ 1. Next, note that

sinx ≥ x− x3

3! for x ≥ 0 to obtain

F0(α, β) = 1− 2φ− sin(2φ)

π≥ 1− 4

3πφ3.

Finally, using the fact that sinx ≤ x− x3

3! + x5

5! for x ≥ 0, we have that for ρ ≤ 1/5,

1− F0(α, β) =2φ− sin(2φ)

π≥ 2

5φ3.

Here the final inequality uses the fact that 2φ ≤ 2ρ ≤ 2/5.

D.6.2 Proof of part (b)

Introduce the change of variables ζ = π/2− φ = tan−1(α/β). We have

F0(α, β) = F (α, β) =2ζ + sin 2ζ

π

(i)

≥ 4ζ

π− 4ζ3

(ii)

≥ 4

π

β

)− 8

β

)3

where in step (i), we have used the fact that sinx ≥ x − x3

3! and in step (ii) we have used the

Taylor expansion of the tan−1 function to conclude that x− x3

3 ≤ tan−1 x ≤ x. Now using thefacts that α/β ≤ 1/2 and β ≤ 1, respectively, we have

F (α, β) ≥ 10

β

)> 1.06 · α

as desired.

D.6.3 Proof of part (c)

Consider the map f : (ρ, σ) 7→ 1 − Aσ(ρ) + Bσ(ρ), and note that ∂f∂σ = 1

π ·σ

(2σ2+2)√ρ2+σ2+ρ2σ2

,

which is non-negative for each ρ, σ ≥ 0. Thus Fσ is non-decreasing in σ for each ρ, i.e., for each(α, β) pair.

Also note that ∂f∂ρ = − 1

π ·(σ2+2)+σ2/ρ2

2(ρ2+1)2√ρ2+σ2+ρ2σ2

which is non-positive for each ρ, σ ≥ 0. Thus,

Fσ(α, β) ≤ 1−Aσ(0) +Bσ(0) = 1 +2

π(σ − tan−1(σ)) ≤ 1 +

2σ3

3π,

where the final inequality uses tan−1 x ≥ x− x3/3.

D.6.4 Proof of part (d)

Note that we can simplify Gσ(α, β) =√

[ρBσ(ρ)]2 · κ−2κ−1 + 1

κ−1 (1 + σ2 − [Fσ(α, β)]2), from

which the lower bound follows immediately. To prove the upper bound, note that

Gσ(α, β) ≤√

[ρBσ(ρ)]2 · κ− 2

κ− 1+

σ2

κ− 1+

(1 + Fσ(α, β))

κ− 1(169)

99

Page 100: Kabir Aladin Chandrasekher

√[ρBσ(ρ)]2 +

σ2

κ− 1+

1

κ− 1·(

1 +σ3

), (170)

where the second step follows since Fσ(α, β) ≤ 1 + 2σ3

3π , as proved in the previous part. Nownote that a straightforward calculation yields that for all ρ ≥ 0, we have

ρBσ(ρ) ≤ 2

π

√1 + σ2.

Noting that σ ≤ 1/2, and choosing κ ≥ C for a large constant C, we have

Gσ(α, β) ≤√

5

π2+ 0.2 ≤ 0.8.

D.6.5 Proof of part (e)

We have

[G0(α, β)]2 =κ− 2

κ− 1· 4

π2· sin4 φ+

1

κ− 1· (1− [F0(α, β)]2)

≤ 4

π2φ4 +

1

κ− 1(1− F (α, β))(1 + F (α, β))

≤ 4

π2φ4 +

8φ3

3π(κ− 1)

≤ φ3

10. (171)

Here the penultimate inequality makes use of parts (a) and (c) of the lemma to conclude that

1 − F (α, β) ≤ 4φ3

3π and 1 + F (α, β) ≤ 2, respectively. The last line follows since κ ≥ C andφ ≤ 1/5.

D.6.6 Proof of part (f)

By definition of the maps, performing some algebra yields

[gσ(α, β)]2 =κ− 2

κ− 1· [ρBσ(ρ)]2 +

1

κ

(α− Fσ(α, β))2 + (β − ρBσ(ρ))2 + 1− [Fσ(α, β)]2 + σ2

≤ [Gσ(α, β)]2 +

1

κ

(α− Fσ(α, β))2 + (β − ρBσ(ρ))2

, (172)

Using the numeric inequality (a+ b)2 ≤ 2a2 + 2b2 twice, we obtain

(α− Fσ(α, β))2 + (β − ρBσ(ρ))2 ≤ 2(1− α)2 + 2β2 + 2(1− Fσ(α, β))2 + 2[ρBσ(ρ)]2,

and combining the pieces completes the proof of the upper bound.

To prove the lower bound, note that

κ− 2

κ− 1·[ρBσ(ρ)]2+

1

κ

(α− Fσ(α, β))2 + (β − ρBσ(ρ))2 + 1− [Fσ(α, β)]2 + σ2

≥ κ− 1

κ·[Gσ(α, β)]2.

100

Page 101: Kabir Aladin Chandrasekher

D.6.7 Proof of part (g)

Combining the fact that Gσ(α, β) =√ρ2Bσ(ρ)2 · κ−2

κ−1 + 1κ−1 (1 + σ2 − [Fσ(α, β)]2) with the

inequalities Fσ(α, β) ≤ 1+ 2σ3

3π and σ ≤ 1/2, the lower bound on GσFσ

follows from straightforwardcalculation.

To prove the upper bound, begin by noting that since 1− Fσ(α, β) ≤ 4φ3

3π , we have

Gσ(α, β) ≤

√ρ2Bσ(ρ)2 +

σ2

κ− 1+

8φ3

3π(κ− 1)·(

1 +σ3

).

We now use the fact that φ ≤ ρ, and that φ ≤ 1.12 for all ρ ≤ 2. In conjunction with theassumption σ ≤ 0.5, we obtain

Gσ(α, β) ≤ ρBσ(ρ) +1.05ρ√κ− 1

+1.1σ√κ− 1

where we have also used the inequality√a+ b+ c ≤

√a +√b +√c, valid for any three non-

negative scalars (a, b, c). Using the definition Fσ(α, β) = 1−Aσ(ρ) +Bσ(ρ), we have

GσFσ≤ ρ

(1− 1−Aσ(ρ)

1 +Bσ(ρ)−Aσ(ρ)+

1.05

Fσ ·√κ− 1

)+

1.1σ

Fσ ·√κ− 1

≤ ρ(

1− 1−Aσ(ρ)

1 +Bσ(ρ)−Aσ(ρ)+

2.1√κ− 1

)+

2σ√κ− 1

,

where in the final inequality, we have used the fact that Aσ(ρ) ≥ 0 for all ρ and Fσ ≥ F0 ≥ 0.56for all ρ ≤ 2. Now note that if σ ≤ 0.5 and ρ ≤ 2, then a straightforward computation yieldsthat Aσ(ρ) ≤ 0.74 and Bσ(ρ) ≤ 0.4. Since 1 +Bσ(ρ)−Aσ(ρ) ≥ 0, we have 1−Aσ(ρ)

1+Bσ(ρ)−Aσ(ρ) ≥ 1/4.Putting together the pieces yields

GσFσ≤ ρ

(0.75 +

2.1√κ− 1

)+

2σ√κ− 1

≤ 4

5· βα

+2σ√κ− 1

,

where the final inequality uses the fact that κ ≥ C.

D.7 Proof of Lemma 11

Once again, we prove each part separately.

D.7.1 Proof of part (a)

By definition, we have Fσ(α, β) = Φ(ρ(α, β)) for a univariate. σ-dependent function Φ. Bychain rule, ∇Fσ(α, β) = Φ′(ρ) · ∇ρ(α, β). In addition, ∇ρ(α, β) = α−1 · (1,−ρ), so that‖∇ρ(α, β)‖1 = α−1(1 + ρ). Differentiating the univariate function Φ, we obtain |Φ′(ρ)| =2ρπ ·

ρ2(2+σ2)+σ2

(1+ρ2)2·√ρ2(1+σ2)+σ2

. Thus, we have

‖∇Fσ(α, β)‖1 =2ρ

πα· ρ2(2 + σ2) + σ2√

ρ2(1 + σ2) + σ2· (1 + ρ)

(1 + ρ2)2=: α−1 · f(ρ).

We now claim that f(ρ) is non-decreasing in the interval [0, 0.25]. This claim directly yields‖∇Fσ(α, β)‖1 ≤ α−1 · f(0.25) ≤ 1/2 for all α ≥ 0.5, ρ ≤ 0.25, and σ ≤ 1/2.

To prove that f is non-decreasing, note that a straightforward calculation yields

f ′(ρ) =2(σ2 + (2 + σ2)ρ2)

π(1 + ρ2)2√σ2 + (1 + σ2)ρ2

·

[1 + 2ρ− ρ2(1 + ρ)√

σ2 + (1 + σ2)ρ2− 4ρ2(1 + ρ)

1 + ρ2

]

101

Page 102: Kabir Aladin Chandrasekher

+8(2 + σ2)ρ2(1 + ρ)

π(1 + ρ2)2√σ2 + (1 + σ2)ρ2

. (173)

Since

ρ2(1 + ρ)√σ2 + (1 + σ2)ρ2

≤ ρ(1 + ρ),

we have

1 + 2ρ− ρ2(1 + ρ)√σ2 + (1 + σ2)ρ2

− 4ρ2(1 + ρ)

1 + ρ2≥ 1 + ρ− 4ρ2 − 3ρ3 − ρ4

1 + ρ2≥ 0. (174)

Here, the final inequality follows from the following argument: note that f : ρ 7→ 1 + ρ −4ρ2 − 3ρ3 − ρ4 is concave and thus on the interval [0, 0.25], it takes its minimizer at one of theendpoints. Furthermore, minf(0), f(0.25) > 0.

D.7.2 Proof of part (b)

Writing Gσ(α, β) = Γ(ρ), note that

Γ(ρ) =

√ρ2Bσ(ρ)2 · κ− 2

κ− 1+

1

κ− 1(1 + σ2 − [Φ(ρ)]2)

By Lemma 10(a), we have Φ(ρ) ≤ 1 + 2σ3

3π , and so we obtain

Γ(ρ) ≥

√√√√ρ2Bσ(ρ)2κ− 2

κ− 1+

1

κ− 1

(1 + σ2 −

(1 +

2σ3

)2)

√ρ2Bσ(ρ)2

κ− 2

κ− 1+

σ2

2(κ− 1)≥ ρBσ(ρ)

√κ− 2

κ− 1, (175)

where the first inequality holds since σ ≤ 0.5. Now, define the function

S(ρ) = [ρBσ(ρ)]2 +1

κ− 1

(1 + σ2 − (1−Aσ(ρ) +Bσ(ρ))2 − ρ2Bσ(ρ)2

),

and note that Γ(ρ) =√S(ρ) and Γ′(ρ) = S′(ρ)

2√S(ρ)

. We then have

‖∇Gσ(α, β)‖1 =α−1

2· |S

′(ρ)|√S(ρ)

(1 + ρ)(i)

≤ |S′(ρ)|√S(ρ)

(1 + ρ) =: T1 + T2, (176)

where step (i) follows since α ≥ 1/2, and we have let

T1 =2(κ− 2)

(κ− 1)√S(ρ)

ρ(1 + ρ)Bσ(ρ)(Bσ(ρ) + ρB′σ(ρ)

)(177a)

T2 =2(1 + ρ)

(κ− 1)√S(ρ)

(1−Aσ(ρ) +Bσ(ρ))(B′σ(ρ)−A′σ(ρ)). (177b)

We will bound each of these terms in turn.

102

Page 103: Kabir Aladin Chandrasekher

Bounding the term T1 (177a). An explicit computation yields

T1 =1√S(ρ)

8(κ− 2)

π2(κ− 1)ρ(1 + ρ)

2ρ2 + σ2 + σ2ρ2

(1 + ρ2)3

(i)

≤ 4

π

(1 + ρ)(2ρ2 + σ2(1 + ρ2))

(1 + ρ2)2√ρ2 + σ2(1 + ρ2)

, (178)

where step (i) follows by using the inequality (175). Now, define the function

a(ρ) :=(1 + ρ)(2ρ2 + σ2(1 + ρ2))

(1 + ρ2)2√ρ2 + σ2(1 + ρ2)

,

so that the inequality (178) is equivalent to the inequality

T1 ≤4

πa(ρ).

Then, note that

a′(ρ) =σ4(1− 2ρ6 − 3ρ5) + ρ4(3− 4ρ2 − 6σ2ρ2 − 6ρ− 9σ2ρ) + ρ4(1− 3σ4)

(1 + ρ2)3(ρ2 + σ2(1 + ρ2))3/2

+ρ3(2− 6σ2 − 6σ4) + ρ2(6σ2) + ρ(3σ2 − 3σ4)

(1 + ρ2)3(ρ2 + σ2(1 + ρ2))3/2≥ 0,

where the last inequality can be verified for all 0 ≤ ρ ≤ 1/4 and 0 ≤ σ ≤ 1/2. Thus, for all0 ≤ σ ≤ 0.5, a(ρ) is an increasing function of ρ on the interval [0, 0.25]. Note also that thefunction σ 7→ (aσ2 + 2d)/(c

√aσ2 + d) is increasing for σ ≥ 0. Combining these pieces implies

that

T1 ≤4

πa(ρ)

(i)

≤ 0.962, (179)

where step (i) evaluated a with ρ = 0.25 and σ = 0.5.

Bounding the term T2 (177b). First, note that 1 − Aσ(ρ) + Bσ(ρ) = Φ(ρ) and B′σ(ρ) −A′σ(ρ) = Φ′(ρ), so we have

T2 =2Φ(ρ)|Φ′(ρ)|(1 + ρ)√

S(ρ)(κ− 1)

(i)

≤ 2

(1 + 2σ3

)κ− 1

· |Φ′(ρ)|(1 + ρ)√

S(ρ)

(ii)

≤ 4

(1 + 2σ3

)κ− 1

2(2ρ2 + σ2(1 + ρ2))(1 + ρ)

π(1 + ρ2)(ρ2 + σ2(1 + ρ2))

≤20(

1 + 2σ3

)π(κ− 1)

, (180)

where step (i) follows from Lemma 10(a), step (ii) follows by computing Φ′(ρ) explicitly and

using the fact that the inequality (175) implies√S(ρ) ≥ ρBσ(ρ)

√κ−2κ−1 ≥

ρBσ(ρ)2 . Under the

assumption κ ≥ C, we thus see that

T2 ≤ 0.018. (181)

Combining the upper bound on T1 (179), the upper bound on T2 (181), and the decomposi-tion (176) yields

‖∇Gσ(α, β)‖1 ≤ 0.98

for all α ≥ 1/2 and ρ ≤ 1/4, as desired.

103

Page 104: Kabir Aladin Chandrasekher

D.7.3 Proof of part (c)

Let us establish the lemma for the map G; an identical argument also holds for the map g.Recall that for all (α, β) we have [G(α, β)]2 = [G0(α, β)]2 + σ2

κ−1 . Taking a gradient of both sidesof the equation with respect to (α, β) yields

2 · [G(α, β)] · [∇G(α, β)] = 2 · [G0(α, β)] · [∇G0(α, β)],

so that∇G(α, β) = G0(α,β)G(α,β) ·∇G0(α, β). Taking `1 norms on both sides and noting thatG(α, β) ≥

G0(α, β), we have ‖∇G(α, β)‖1 ≤ ‖∇G0(α, β)‖1, as desired.

D.7.4 Proof of part (d)

From equation (172) we see that for each (α, β) pair, the following relation holds:

[gσ(α, β)]2 = [Gσ(α, β)]2 +1

κ

(α− Fσ(α, β))2 + (β − ρBσ(ρ))2

.

Taking gradients and then `1 norms on both sides, we have

2[gσ(α, β)] · ‖∇gσ(α, β)‖1≤ 2[Gσ(α, β)] · ‖∇Gσ(α, β)‖1

+2

κ·(|α− Fσ(α, β)| · (1 + ‖∇Fσ(α, β)‖1) + |β − ρBσ(ρ)| · (1 + ‖∇α,βρBσ(ρ)‖1)

)Noting that gσ(α, β) ≥ Gσ(α, β) ∨ |α−Fσ(α,β)|√

κ∨ |β−ρBσ(ρ)|√

κ, we have

‖∇gσ(α, β)‖1 ≤ ‖∇Gσ(α, β)‖1 +1√κ

(1 + ‖∇Fσ(α, β)‖1 + 1 + ‖∇ρBσ(ρ)‖1)

≤ ‖∇Gσ(α, β)‖1 +1√κ

(3 + ‖∇Fσ(α, β)‖1) ,

where we have used the shorthand ∇ρBσ(ρ) ≡ ∇α,β[ρBσ(ρ)], and the final inequality holds forρ ≤ 1/4 and σ ≤ 1/2, since

‖∇ρBσ(ρ)‖1 = (1 + ρ) ·

(√ρ2 + σ2 + σ2ρ2

1 + ρ2+

ρ2(1 + σ2)

(1 + ρ2)√ρ2 + σ2 + σ2ρ2

− 2ρ2√ρ2 + σ2 + σ2ρ2

(1 + ρ2)2

)

=(1 + ρ) · 2ρ2

(1 + ρ2)2√ρ2 + σ2 + σ2ρ2

+(1 + ρ) · σ2

(1 + ρ2)√ρ2 + σ2 + σ2ρ2

(i)

≤ 2ρ+ σ ≤ 1,

where step (i) follows by using the simple bounds (1 + ρ) ≤ (1 + ρ2)2 and√ρ2 + σ2 + σ2ρ2 ≥ ρ

to upper bound the first term and√ρ2 + σ2 + σ2ρ2 ≥ σ

√1 + ρ2 to bound the second term.

The final inequality follows by using the fact that ρ ≤ 1/4 and σ ≤ 1/2.

D.8 Proof of Lemma 12

Let (αt, βt) := St(α0, β0). Owing to the G-faithfulness of S = (F ,G), we have (αt, βt) ∈ G forall t ≥ 0. Furthermore, by assumption, the sequence ζt = (αt, βt)t≥0 satisfies, for all t ≥ 0,

‖ζt+1 − S(αt, βt)‖∞ ≤ ∆.

We now claim that for each t ≥ 0, we have

|αt+1 − αt+1| ∨ |βt+1 − βt+1| ≤ ∆ + (1− τ) ·(|αt − αt| ∨ |βt − βt|

). (182)

104

Page 105: Kabir Aladin Chandrasekher

Note that this claim immediately yields the desired result, since applying it iteratively fort = 0, 1, . . . , k − 1 and using the fact that α0 = α0 and β0 = β0 yields

|αk − αk| ∨ |βk − βk| ≤k−1∑i=0

(1− τ)i ·∆ ≤ ∆

τ.

It remains to prove claim (182), and we do so by induction.

Base case: The case t = 0 is clearly true, since α0 = α0 and β0 = β0.

Induction step: Suppose that the claim is true for all t ≤ k − 1, so that

|αk − αk| ∨ |βk − βk| ≤k−1∑i=0

(1− τ)i ·∆ ≤ ∆

τ.

We must show that it holds for t = k. By triangle inequality, we have

|αk+1 − αk+1| ≤ |F (αk, βk)− F (αk, βk)|+ |F (αk, βk)− αk+1|.

Let αu := uαk + (1 − u)αk and define βu analogously. Let ζk = (αk, βk) and ζk = (αk, βk).Since F is a continuously differentiable function of its arguments, a first order Taylor expansionat ζk yields

|F (αk, βk)− F (αk, βk)| ≤ sup0≤u≤1

|〈∇F (αu, βu), ζk − ζk〉|

≤ sup0≤u≤1

‖∇F (αu, βu)‖1 · ‖ζk − ζk‖∞

≤ (1− τ) ·(|αk − αk| ∨ |βk − βk|

),

where the final inequality follows by using the fact that by the inductive hypothesis, |αk −αk| ∨ |βk − βk| ≤ ∆/τ , whence (αu, βu) ∈ B∆/τ (G) for each 0 ≤ u ≤ 1 in conjunction with

the assumption ‖∇F (α, β)‖1 ≤ 1 − τ for all (α, β) ∈ B∆/τ (G). At the same time, we haveby assumption that F (αk, βk) − αk+1 ≤ ∆. An identical argument holds for βk+1, and so thiscompletes the inductive step.

D.9 Proof of Lemma 13

Recall that α0/β0 ≥ 150√d

by assumption, and that

t0 = log1.05(50√d) + log55/54(10) + 2.

Also define the scalar t := log1.05(50√d) + 1, and given the iterates αt, βtt0t=0, define

T = inf

t∣∣∣ αtβt≥ 1/2

and T0 = inf

t∣∣∣ αtβt≥ 5

,

with the convention that each quantity is set to∞ if the condition is not met. Conditions (108a)and (108b) with c = 1/100 yield

max0≤t≤t0

|αt+1 − F (αt, βt)| ≤1

100√d

and max0≤t≤t0

|βt+1 −G(αt, βt)| ≤1

100.

(183)

We begin by noting that αt ≤ 3/2 for all 1 ≤ t ≤ t0. This is straightforward to establish:note that it follows directly from condition C1 and the bound (183) for t = 1, and from thatpoint onward, by induction over t, applying condition C4 and the bound (183). The crux of thelemma is the following claim.

105

Page 106: Kabir Aladin Chandrasekher

Claim 1. Under both settings (a) and (b) of the lemma, we have T(i)

≤ t and T0

(ii)

≤ t0 − 1.

Taking the claim as given for the moment, we note that it suffices to show that(αT0+1, βT0+1) ∈ G. Since T0 + 1 ≤ t0, we have

|αT0+1 − F (αT0 , βT0)| ∨ |βT0+1 −G(αT0 , βT0)| ≤ 1

100, (184)

and by definition of T0, we have αT0/βT0 ≥ 5. Condition C4 yields the bounds

0.56 ≤ F (αT0 , βT0) ≤ 1.04, andG(αT0 , βT0)

F (αT0 , βT0)≤ 1/6.

Combining the first bound with the perturbation bound (184) yields 0.55 ≤ αT0+1 ≤ 1.05.Moreover, we have

βT0+1

αT0+1≤ G(αT0 , βT0) + 0.01

F (αT0 , βT0)− 0.01≤ G(αT0 , βT0)

F (αT0 , βT0)+

0.01

0.55≤ 1/5.

Putting together the above two displays yields that (αT0+1, βT0+1) ∈ G, as desired. It remainsto prove Claim 1.

Proof of Claim 1(i): Note that by condition C1 and the initialization condition (guaranteedby both Ia and Ib) that α0/β0 ≥ (50

√d)−1, we have F (α0, β0) ≥ (50

√d)−1, so that

α1 ≥ (50√d)−1 − (100

√d)−1 = (100

√d)−1.

We prove momentarily that under both settings (a) and (b) of the lemma and for all 1 ≤ t ≤ T ,

αt ≥ (1.05)t−1 · α1 and βt ≤ 1. (185)

Now suppose for the sake of contradiction that T > t, where we recall that t := log1.05(50√d)+1.

Putting together the above two displays, we have

αt ≥50√d

100√d

= 1/2 and βt ≤ 1,

so that αt/βt ≥ 1/2. But this contradicts the fact that T > t and proves the theorem.We prove claim (185) under two distinct settings (a) and (b) of the lemma. Both proceed

via induction.Proof of claim (185), setting (a): To prove the base case t = 1, note that the α component

follows trivially. To handle β1, note that β1 ≤ G(α0, β0) + 0.01 ≤ 1, where the final inequalityfollows from condition C5a.

For the induction hypothesis, suppose that αt ≥ (1.05)t · α1 ≥ 1100√d, and βt ≤ 1. Since

t ≤ T , we have αt/βt ≤ 1/2 by definition. Then condition C2 and equation (108a) togetheryield

αt+1 ≥ F (αt, βt)−1

100√d≥ 1.06 · αt

βt− 1

100√d≥ 1.05αt ≥ (1.05)t · α1.

At the same time, condition C5a and equation (108a) together yield

βt+1 ≤ G(αt, βt) +1

100≤ 1.

106

Page 107: Kabir Aladin Chandrasekher

This completes the induction step.Proof of claim (185), setting (b): Once again, for the base case t = 1, note that the α component

follows trivially. To handle β1, note that β1 ≤ G(α0, β0) + 0.01 ≤ 1, where the final inequalityfollows from condition C5b and the initialization condition Ib.

For the induction hypothesis, suppose that αt ≥ (1.05)t · α1 ≥ 1100√d, and βt ≤ 1. Since

t ≤ T , we have αt/βt ≤ 1/2 by definition. Then condition C2 and equation (108a) togetheryield

αt+1 ≥ F (αt, βt)−1

100√d≥ 1.06 · αt

βt− 1

100√d≥ 1.05αt ≥ (1.05)t · α1.

We also have F (αt, βt) ≤ 1.04 from condition C4, so that αt+1 ≤ 1.05. Consequently, we mayapply condition C5b and equation (108a) together, to yield

βt+1 ≤ G(αt, βt) +1

100≤ 1.

This completes the induction step.

Proof of Claim 1(ii): Note that αt/βt ≥ 1/2 by part (a) of the claim. We will showmomentarily that for each t+ 1 ≤ t < T0, we have

αtβt≥(

55

54

)t−t−1

·αtβt

and 1/2 ≤ αt ≤ 3/2. (186)

Now suppose for the sake of contradiction that T0 > t0−1. Setting t = t0−1 in equation (186),we obtain αt

βt≥ 10 · 1

2 ≥ 5. But this contradicts the fact that T0 > t0 − 1.Proof of claim (186): To prove the base case t = t + 1, note that condition C2 ensures that

F (αt, βt) ≥ 0.56, so that

βt+1

αt+1≤G(αt, βt) + 0.01

F (αt, βt)− 0.01≤ 1/2,

where the last inequality uses condition C3. Furthermore, condition C4 yields F (αt, βt) ≤ 1.04,so that 1/2 ≤ αt+1 ≤ 3/2.

For the induction hypothesis, suppose that αtβt≥(

5554

)t−t · αtβt ≥ 1/2 and 1/2 ≤ αt ≤ 3/2.

Since t < T0, we also have αt/βt ≤ 5, and so condition C3 yields G(αt,βt)

F (αt,βt)≤ 7

8 ·βtαt

. Therefore,

βt+1

αt+1≤ G(αt, βt) + 0.01

F (αt, βt)− 0.01≤ 56

55· 7

8· βtαt

+0.01

0.55≤(

49

55+

1

11

)· βtαt,

where in the last step, we have used the fact that βtαt≥ 1/5. At the same time, conditions C2

and C4 yield 0.56 ≤ F (αt, βt) ≤ 1.04, so that 1/2 ≤ αt+1 ≤ 3/2. This completes the inductivestep.

E Some elementary lemmas

In this section, we collect a few elementary lemmas that are used multiple times in the proof.

Lemma 22. Let Tn(θ) denote the empirical operator corresponding to the higher order up-dates (9) and suppose that the weight function ω satisfies Assumption 1 with parameter K1.There exist universal, positive constants c and C such that

P‖Tn(θ)‖2 ≤ C√K1 ≤ 2e−cn.

107

Page 108: Kabir Aladin Chandrasekher

Proof. We write explicitly

‖Tn(θ)‖2 = ‖(X>X)−1X>ω(Xθ,y)‖2 ≤ ‖(X>X)−1‖op‖X‖op‖ω(Xθ,y)‖2,

where the final inequality is due to the sub-multiplicativity of the operator norm. Now, weapply Vershynin (2018, Theorem 4.6.1) to obtain the probabilistic inequality

Pc√n ≤ ‖X‖op ≤ C

√n ≥ 1− 2e−c

′n,

which in turn implies that with probability at least 1− 2e−c′n,

‖(X>X)−1‖op‖X‖op ≤C√n.

Now, since ω satisfies Assumption 1, the random vector ω(Xθ,y) contains independent sub-Gaussian coordinates whence we apply Vershynin (2018, Theorem 3.1.1) to obtain the proba-bilistic inequality

P‖ω(Xθ,y)‖2 ≤ C√K1n ≥ 1− 2e−cn.

Putting the pieces together, we obtain the result.

Lemma 23. Let n be a positive integer and suppose that, for all positive integers q ≤ n/64, therandom variable Zn satisfies the inequality

‖Zn − EZn‖q ≤Aqζ√n, (187)

for constants A > 0 and ζ > 1. Further, suppose that there are positive constants c↓ and C↓such that P|Zn| ≥ C↓ ≤ e−c↓n. Then, there exist universal positive constants c′ and C ′ suchthat for all t ≥ 0,

Pr|Zn − EZn| ≥ t ≤ C ′ exp−c′( t√nA

)1/ζ+ e−c↓n.

Proof. We employ a truncation argument. For some integer M > 0 to be specified later, weintroduce the notation

Z = Zn − EZn, Z↓ = Z1|Z| ≤M, and Z↑ = Z1|Z| > M,

noting thatZ = Z↓ + Z↑.

Consequently, an application of the union bound yields the inequality

Pr|Z| ≥ t ≤ Pr|Z↓| ≥ t

2

+ Pr

|Z↑| ≥ t

2

. (188)

We control each of these terms in turn, beginning with the lower truncation. First, for anyλ > 0, applying Markov’s inequality yields

Pr|Z↓| ≥ t

2

≤ exp

−(λt/2)1/ζ

Ee(λ|Z↓|)1/ζ. (189)

Note that the expectation on the RHS of the display above exists since Z↓ is bounded. Thus,the Taylor expansion of x 7→ ex gives

Ee(λ(Z↓)1/ζ

= 1 +

∞∑`=1

λ`/ζE|Z↓|`/ζ

`!= 1 +

n/64∑`=1

λ`/ζE|Z↓|`/ζ

`!+

∞∑`=n/64+1

λ`/ζE|Z↓|`/ζ

`!

108

Page 109: Kabir Aladin Chandrasekher

(i)

≤ 1 +

n/64∑`=1

λ`/ζ(E|Z↓|`

)1/ζ`!

+∞∑

`=n/64+1

λ`/ζM `/ζ

`!

(ii)

≤ 1 +

n/64∑`=1

(Aλ/√n)`/ζ``

`!+

∞∑`=n/64+1

λ`/ζM `/ζ

`!, (190)

where step (i) follows by applying Jensen’s inequality to each summand of the first sum sincethe map x 7→ x1/ζ is concave on R≥0 and by noting that |Z↓| ≤ M pointwise. Step (ii) followsby noting that ‖Z↓‖q ≤ ‖Z‖q and using the assumption (187). Now when ` ≥ n/64 andM ≤ CAn(2ζ−1)/2,

M `/ζ ≤( A√

n

)`/ζ``.

Substituting the above inequality into the bound on the MGF (190), we obtain

Ee(λ(Z↓)1/ζ

≤ 1 +

∞∑`=1

(Aλ/√n)`/ζ``

`!

(i)

≤ 1 +

∞∑`=1

(CAλ√n

)`/ζ= 1 +

(CAλ/√n)1/ζ

1− (CAλ/√n)1/ζ

(ii)

≤ exp

2(CAλ√

n

)1/ζ, (191)

where step (i) follows by the Stirling inequality `! ≥ ``/e` and step (ii) follows for allλ ≤ c/A · (1/2)ζ

√n by the elementary inequality 1 + x ≤ ex. Summarizing, we see that for

M ≤ CAn(2ζ−1)/2 and λ ≤ c/A ·√n, plugging the inequality (191) into the inequality (189)

yields

Pr|Z↓| ≥ t

2

≤ exp

−(λt/2)1/ζ + 2

(CAλ√n

)1/ζ.

Taking λ as large as possible, we obtain

Pr|Z↓| ≥ t

2

≤ C exp−(c/A · t

√n)1/ζ, (192)

where we emphasize that the constants c and C changed from line to line. We turn now tobounding the upper truncation. We have

Pr|Z↑| ≥ t

2

= Pr

|Z↑| ≥ t

2, |Z| > M

+ Pr

|Z↑| ≥ t

2, |Z| ≤M

(i)

≤ Pr|Z| > M(ii)

≤ e−c↓n,

(193)

where step (i) follows since by assumption t > 0, so the second term has zero-probability and step(ii) follows by assumption as long as M > C↓. We conclude by letting M take any value betweenC↓ and M ≤ CAn(2ζ−1)/2 and then substituting the tail bound on the lower truncation (192)and the tail bound on the upper truncation (193) into the decomposition (188).

Lemma 24. Let u denote a random vector sampled uniformly from the unit sphere Sd−1.Suppose (xi, yi)

ni=1 are drawn i.i.d. from either the model (3) or (5), with θ∗ 6= 0 denoting an

arbitrary vector (not necessarily unit norm). For any θ ∈ Rd, let α(θ) = 〈θ,θ∗〉‖θ∗‖22

and β(θ) =

‖P⊥θ∗θ‖2. Then the following statements are true.(a) If θ0 = λu for an arbitrary positive scalar λ, then

P|α(θ0)|β(θ0)

≤ δ√d− 1

≤ δ + exp

(−d− 1

32

)(b) If θ0 =

√1n

∑ni=1 y

2i · u, then

Pα2(θ0) + β2(θ0) ≥ (‖θ∗‖2 + σ)2 + C log(1/δ)

≤ δ

for an absolute constant C > 0.

109

Page 110: Kabir Aladin Chandrasekher

Proof. Let ‖θ∗‖2 = λ∗, noting that we may assume due to rotation invariance of u thatθ∗ = λ∗ · e1. Furthermore, we may write u = z/‖z‖2 for a random vector z = (z1, . . . , zd) ∼N (0, I). Thus, we have |α(θ0)|

β(θ0) = |z1|/‖z\1‖2, where z\1 = (z2, . . . , zd). Part (a) then followsfrom the tail bounds

P|z1| ≤ t1 ≤√

2

π· t1 and P‖z\1‖2 ≥

√d− 1 + t2 ≤ e−t

22/2 for each t1, t2 ≥ 0.

In particular, setting t1 =√π/2 · δ and t2 =

√d−14 and applying a union bound proves part (a).

Next, note that α2(θ0) + β2(θ0) = ‖θ0‖22 = 1n

∑ni=1 y

2i . Furthermore, in both models of

interest, we have |yi|(d)

≤ yi := λ∗|z1,i| + σ|z2,i|, where z1,i and z2,i are independent Gaussians

and X(d)

≤ Y denotes that the random variable X is stochastically dominated by Y . Furthermore,we have E[y2

i ] ≤ (λ∗ + σ)2, and also

P

1

n

n∑i=1

y2i − E[y2

i ] ≥ t

≤ e−ct,

since y2i is a subexponential random variable. Choosing t = c−1 · log(1/δ) completes the proof

of part (b).

Lemma 25. (a) For all state evolution elements ζ = (α, β) with α ≥ 0, we have

d`2(ζ) ≥ β√α2 + β2

= sin(d∠(ζ)).

(b) For all state evolution elements ζ = (α, β) with 1/2 ≤ α ≤ 3/2 and ρ ≥ 1/5, we have

d`2(ζ) ≤ 8ρ ≤ 8 tan(d∠(ζ)).

Proof. To prove part (a), note that θ∗ corresponds to the state evolution element (α∗, β∗) =(1, 0). All points ζ = (α, β) such that α ≥ 0 and d∠(ζ) = ∠((α∗, β∗), (α, β)) = φ form a line inR2. The point on this line with smallest d`2 is the projection of (1, 0) onto this line. The lengthof this projection is, by definition, equal to sinφ.

To prove part (b), note that

d`2(ζ) =√

(1− α)2 + β2 = α ·

√(α−1 − 1)2 +

α

)2

≤ 3

√1 +

α

)2

≤ 3

2·√

26 · βα,

where the final inequality follows since β/α ≥ 1/5.

Lemma 26. Let a, b, x denote non-negative scalars.(a) For all a < 1, we have

tan−1(ax+ b) ≥ a tan−1 x+ b− b3

3(1− a)2.

(b) If a ≤ 1 and x ≤ 1/5, we have

tan−1(ax+ b) ≤ 51

50· a tan−1 x+ b.

110

Page 111: Kabir Aladin Chandrasekher

Proof. To prove part (a), note from the concavity of the tan−1 function on the positive realsthat

tan−1(ax+ b) ≥ a tan−1 x+ (1− a) tan−1

(b

1− a

).

Using the inequality tan−1 x ≥ x − x3

3 completes the proof. To prove part (b), first note fromthe subadditivity of the tan−1 function on the positive reals that

tan−1(ax+ b) ≤ tan−1(ax) + tan−1(b) ≤ tan−1(ax) + b.

Next, let h(a, x) = tan−1(ax)a tan−1(x)

, and note that for each fixed a ∈ [0, 1], h is non-decreasing in

x ∈ [0,∞). Similarly, for each fixed x ≥ 0, h is non-increasing in a ∈ [0, 1]. The proof iscompleted by noting that lima↓0 h(a, 1/5) = [5 tan−1(1/5)]−1 < 1.02.

111