optimization algorithms for data analysis · 2 optimization algorithms for data analysis 20 we...

IAS/Park City Mathematics SeriesVolume 00, Pages 000–000S 1079-5634(XX)0000-0

Optimization Algorithms for Data Analysis1

Stephen J. Wright2

Abstract. We describe the fundamentals of algorithms for minimizing a smoothnonlinear function, and extensions of these methods to the sum of a smoothfunction and a convex nonsmooth function. Such objective functions are ubiqui-tous in data analysis applications, as we illustrate using several examples. Wediscuss methods that make use of gradient (first-order) information about thesmooth part of the function, and also Newton methods that make use of Hessian(second-order) information. Convergence and complexity theory is outlined foreach approach.

1. Introduction3

In this article, we consider algorithms for solving smooth optimization prob-4

lems, possibly with simple constraints or structured nonsmooth regularizers. One5

such canonical formulation is6

(1.0.1) minx∈Rn

f(x),7

where f : Rn → R has at least Lipschitz continuous gradients. Additional as-8

sumptions about f, such as convexity and Lipschitz continuity of the Hessian, are9

introduced as needed. Another formulation we consider is10

(1.0.2) minx∈Rn

f(x) + λψ(x),11

where f is as in (1.0.1), ψ : Rn → R is a function that is usually convex and12

usually nonsmooth, and λ > 0 is a regularization parameter.1 We refer to (1.0.2) as13

a regularized minimization problem because the presence of the term involving ψ14

induces certain structural properties on the solution, that make it more desirable15

or plausible in the context of the application. We describe iterative algorithms16

that generate a sequence xkk=0,1,2,... of points that, in the case of convex objective17

functions, converges to the set of solutions. (Some algorithms also generate other18

“auxiliary” sequences of iterates.)19

2010 Mathematics Subject Classification. Primary 14Dxx; Secondary 14Dxx.Key words and phrases. Park City Mathematics Institute.1A set S is said to be convex if for any pair of points z ′,z ′′ ∈ S, we have that αz ′ +(1 −α)z ′′ ∈ Sfor all α ∈ [0, 1]. A function φ : Rn → R is said to be convex if φ(αz ′+(1 −α)z ′′) 6 αφ(z ′)+(1 −α)φ(z ′′) for all z ′,z ′′ in the (convex) domain of φ and all α ∈ [0, 1].

©0000 (copyright holder)

1

2 Optimization Algorithms for Data Analysis

We start in Section 2 by describing some canonical problems in data analysis20

and their formulation as optimization problems. After some preliminaries in Sec-21

tion 3, we describe in Section 4 algorithms that take step based on the gradients22

∇f(xk). Extensions of these methods to the case (1.0.2) of regularized objectives23

are described in Section 5. Section 6 describes accelerated gradient methods,24

which achieve better worst-case complexity than basic gradient methods, while25

still only using first-derivative information. We discuss Newton’s method in26

Section 7, outlining variants that can guarantee convergence to points that ap-27

proximately satisfy second-order conditions for a local minimizer of a smooth28

nonconvex function.29

Although we allow nonsmoothness in the regularization term in (1.0.2), we do30

not cover subgradient methods or mirror descent explicitly in this chapter. We31

also do not discuss stochastic gradient methods, a class of methods that is central32

to modern machine learning. All these topics are discussed in the contribution of33

J. Duchi to the current volume; see [20]. Other omissions include the following.34

• Coordinate descent methods; see [43] for a review.35

• Augmented Lagrangian methods, including the alternating direction meth-36

ods of multipliers (ADMM) [21]. The review [5] remains a good reference37

for the latter topic, especially as it applies to problems from data analysis.38

• Semidefinite programming and conic optimization.39

• Methods tailored specifically to linear or quadratic programming, such as40

the simplex method or interior-point methods (see [42] for a discussion of41

the latter).42

• Quasi-Newton methods, which modify Newton’s method by approximat-43

ing the Hessian or its inverse, thus attaining attractive theoretical and44

practical performance without using any second-derivative information.45

For a discussion of these methods, see [34, Chapter 6]. One important46

method of this class, which is useful in data analysis and many other47

large-scale problems, is the limited-memory method L-BFGS [28]; see also48

[34, Section 7.2].49

Our approach throughout is to give a concise description of some of the most50

important algorithmic tools for nonlinear smooth and regularized optimization,51

along with the basic convergence theory for each. In most cases, the theory is52

elementary enough to include here in its entirety. In the few remaining cases, we53

provide citations to works in which complete proofs can be found.54

2. Optimization Formulations of Data Analysis Problems55

In this section, we describe briefly some representative problems in data anal-56

ysis and machine learning, emphasizing their formulation as optimization prob-57

lems. Our list is by no means exhaustive. In many cases, there are a number of58

different ways to formulate a given application as an optimization problem. We59

Stephen J. Wright 3

do not try to describe all of them. But our list here gives a flavor of the interface60

between data analysis and optimization.61

2.1. Setup Practical data sets are often extremely messy. Data may be misla-62

beled, noisy, incomplete, or otherwise corrupted. Much of the hard work in data63

analysis is done by professionals, familiar with the underlying applications, who64

“clean” the data and prepare it for analysis, while being careful not to change the65

essential properties that they wish to discern from the analysis. Dasu and John-66

son [18] claim out that “80% of data analysis is spent on the process of cleaning67

and preparing the data.” We do not discuss this aspect of the process, focusing in-68

stead on the part of the data analysis pipeline in which the problem is formulated69

and solved.70

The data set in a typical analysis problem consists of m objects:71

(2.1.1) D := (aj,yj), j = 1, 2, . . . ,m,72

where aj is a vector (or matrix) of features and yj is an observation or label. (Each73

pair (aj,yj) has the same size and shape for all j = 1, 2, . . . ,m.) The analysis74

task then consists of discovering a function φ such that φ(aj) ≈ yj for most j =75

1, 2, . . . ,m. The process of discovering the mapping φ is often called “learning”76

or “training.”77

The function φ is often defined in terms of a vector or matrix of parameters,78

which we denote by x or X. (Other notation also appears below.) With these79

parametrizations, the problem of identifying φ becomes a data-fitting problem:80

“Find the parameters x defining φ such that φ(aj) ≈ yj, j = 1, 2, . . . ,m in some81

optimal sense.” Once we come up with a definition of the term “optimal,” we82

have an optimization problem. Many such optimization formulations have objec-83

tive functions of the “summation” type84

(2.1.2) LD(x) :=

m∑j=1

`(aj,yj; x),85

where the jth term `(aj,yj; x) is a measure of the mismatch between φ(aj) and86

yj, and x is the vector of parameters that determines φ.87

One use of φ is to make predictions about future data items. Given another88

previously unseen item of data a of the same type as aj, j = 1, 2, . . . ,m, we89

predict that the label y associated with a would be φ(a). The mapping may also90

expose other structure and properties in the data set. For example, it may reveal91

that only a small fraction of the features in aj are needed to reliably predict the92

label yj. (This is known as feature selection.) The function φ or its parameter x93

may also reveal important structure in the data. For example, X could reveal a94

low-dimensional subspace that contains most of the aj, or X could reveal a matrix95

with particular structure (low-rank, sparse) such that observations of X prompted96

by the feature vectors aj yield results close to yj.97

Examples of labels yj include the following.98


• A real number, leading to a regression problem.99

• A label, say yj ∈ 1, 2, . . . ,M indicating that aj belongs to one of M100

classes. This is a classification problem. We have M = 2 for binary classifi-101

cation and M > 2 for multiclass classification.102

• Null. Some problems only have feature vectors aj and no labels. In103

this case, the data analysis task may consist of grouping the aj into clus-104

ters (where the vectors within each cluster are deemed to be function-105

ally similar), or identifying a low-dimensional subspace (or a collection106

of low-dimensional subspaces) that approximately contains the aj. Such107

problems require the labels yj to be learnt, alongside the function φ. For108

example, in a clustering problem, yj could represent the cluster to which109

aj is assigned.110

Even after cleaning and preparation, the setup above may contain many com-111

plications that need to be dealt in formulating the problem in rigorous mathe-112

matical terms. The quantities (aj,yj) may contain noise, or may be otherwise113

corrupted. We would like the mapping φ to be robust to such errors. There may114

be missing data: parts of the vectors aj may be missing, or we may not know all115

the labels yj. The data may be arriving in streaming fashion rather than being116

available all at once. In this case, we would learn φ in an online fashion.117

One particular consideration is that we wish to avoid overfitting the model to118

the data set D in (2.1.1). The particular data set D available to us can often be119

thought of as a finite sample drawn from some underlying larger (often infinite)120

collection of data, and we wish the function φ to perform well on the unobserved121

data points as well as the observed subset D. In other words, we want φ to122

be not too sensitive to the particular sample D that is used to define empirical123

objective functions such as (2.1.2). The optimization formulation can be modified124

in various ways to achieve this goal, by the inclusion of constraints or penalty125

terms that limit some measure of “complexity” of the function (such techniques126

are called generalization or regularization). Another approach is to terminate the127

optimization algorithm early, the rationale being that overfitting occurs mainly in128

the later stages of the optimization process.129

2.2. Least Squares Probably the oldest and best-known data analysis problem is130

linear least squares. Here, the data points (aj,yj) lie in Rn ×R, and we solve131

(2.2.1) minx

12m

m∑j=1

(aTj x− yj)2 =

12m‖Ax− y‖2

2,132

where A the matrix whose rows are aTj , j = 1, 2, . . . ,m and y = (y1,y2, . . . ,ym)T .133

In the terminology above, the function φ is defined by φ(a) := aTx. (We could134

also introduce a nonzero intercept by adding an extra parameter β ∈ R and135

defining φ(a) := aTx+ β.) This formulation can be motivated statistically, as a136

maximum-likelihood estimate of xwhen the observations yj are exact but for i.i.d.137

Stephen J. Wright 5

Gaussian noise. Various modifications of (2.2.1) impose desirable structure on x138

and hence on φ. For example, Tikhonov regularization with a squared `2-norm,139

which is140

minx

12m‖Ax− y‖2

2 + λ‖x‖22, for some parameter λ > 0,141

yields a solution x with less sensitivity to perturbations in the data (aj,yj). The142

LASSO formulation143

(2.2.2) minx

12m‖Ax− y‖2

2 + λ‖x‖1144

tends to yield solutions x that are sparse, that is, containing relatively few nonzero145

components [40]. This formulation performs feature selection: The locations of146

the nonzero components in x reveal those components of aj that are instrumental147

in determining the observation yj. Besides its statistical appeal — predictors that148

depend on few features are potentially simpler and more comprehensible than149

those depending on many features — feature selection has practical appeal in150

making predictions about future data. Rather than gathering all components of a151

new data vector a, we need to find only the “selected” features, since only these152

are needed to make a prediction.153

The LASSO formulation (2.2.2) is an important prototype for many problems154

in data analysis, in that it involves a regularization term λ‖x‖1 that is nonsmooth155

and convex, but with relatively simple structure that can potentially be exploited156

by algorithms.157

2.3. Matrix Completion Matrix completion is in one sense a natural extension158

of least-squares to problems in which the data aj are naturally represented as159

matrices rather than vectors. Changing notation slightly, we suppose that each160

Aj is an n× p matrix, and we seek another n× p matrix X that solves161

(2.3.1) minX

12m

m∑j=1

(〈Aj,X〉− yj)2,162

where 〈A,B〉 := trace(ATB). Here we can think of the Aj as “probing” the un-163

known matrix X. Commonly considered types of observations are random linear164

combinations (where the elements of Aj are selected i.i.d. from some distribution)165

or single-element observations (in which each Aj has 1 in a single location and166

zeros elsewhere). A regularized version of (2.3.1), leading to solutions X that are167

low-rank, is168

(2.3.2) minX

12m

m∑j=1

(〈Aj,X〉− yj)2 + λ‖X‖∗,169

where ‖X‖∗ is the nuclear norm, which is the sum of singular values of X [37].170

The nuclear norm plays a role analogous to the `1 norm in (2.2.2). Although the171

nuclear norm is a somewhat complex nonsmooth function, it is at least convex, so172

that the formulation (2.3.2) is also convex. This formulation can be shown to yield173


a statistically valid solution when the true X is low-rank and the observation ma-174

trices Aj satisfy a “restricted isometry” property, commonly satisfied by random175

matrices, but not by matrices with just one nonzero element. The formulation is176

also valid in a different context, in which the true X is incoherent (roughly speak-177

ing, it does not have a few elements that are much larger than the others), and178

the observations Aj are of single elements [9].179

In another form of regularization, the matrix X is represented explicitly as a180

product of two “thin” matrices L and R, where L ∈ Rn×r and R ∈ Rp×r, with181

r min(n,p). We set X = LRT in (2.3.1) and solve182

(2.3.3) minL,R

12m

m∑j=1

(〈Aj,LRT 〉− yj)2.183

In this formulation, the rank r is “hard-wired” into the definition of X, so there is184

no need to include a regularizing term. This formulation is also typically much185

more compact than (2.3.2); the total number of elements in (L,R) is (n + p)r,186

which is much less than np. A disadvantage is that it is nonconvex. An active187

line of current research, pioneered in [8] and also drawing on statistical sources,188

shows that the nonconvexity is benign in many situations, and that under certain189

assumptions on the data (Aj,yj), j = 1, 2, . . . ,m and careful choice of algorithmic190

strategy, good solutions can be obtained from the formulation (2.3.3). A clue to191

this good behavior is that although this formulation is nonconvex, it is in some192

sense an approximation to a tractable problem: If we have a complete observation193

of X, then a rank-r approximation can be found by performing a singular value194

decomposition of X, and defining L and R in terms of the r leading left and right195

singular vectors.196

2.4. Nonnegative Matrix Factorization Some applications in computer vision,197

chemometrics, and document clustering require us to find factors L and R like198

those in (2.3.3) in which all elements are nonnegative. If the full matrix Y ∈ Rn×p199

is observed, this problem has the form200

minL,R‖LRT − Y‖2

F, subject to L > 0, R > 0.201

2.5. Sparse Inverse Covariance Estimation In this problem, the labels yj are202

null, and the vectors aj ∈ Rn are viewed as independent observations of a ran-203

dom vector A ∈ Rn, which has zero norm. The sample covariance matrix con-204

structed from these observations is205

S =1

m− 1

m∑j=1

ajaTj .206

The element Sil is an estimate of the covariance between the ith and lth elements207

of the random variable vector A. Our interest is in calculating an estimate X of208

the inverse covariance matrix that is sparse. The structure of X yields important209

information about A. In particular, if Xil = 0, we can conclude that the i and210

Stephen J. Wright 7

l components of A are conditionally independent. (That is, they are independent211

given knowledge of the values of the other n− 2 components of A.) Stated an-212

other way, the nonzero locations in X indicate the arcs in the dependency graph213

whose nodes correspond to the n components of A.214

One optimization formulation that has been proposed for estimating the in-215

verse sparse covariance matrix X is the following:216

(2.5.1) minX∈SRn×n, X0

〈S,X〉− log det(X) + λ‖X‖1,217

where SRn×n is the set of n× n symmetric matrices, X 0 indicates that X is218

positive definite, and ‖X‖1 :=∑ni,l=1 |Xil| (see [16, 23]).219

2.6. Sparse Principal Components The setup for this problem is similar to the220

previous section, in that we have a sample covariance matrix S that is estimated221

from a number of observations of some underlying random vector. The princi-222

pal components of this matrix are the eigenvectors corresponding to the largest223

eigenvalues. It is often of interest to find sparse principal components, approxi-224

mations to the leading eigenvectors that also contain few nonzeros. An explicit225

optimization formulation of this problem is226

(2.6.1) maxv∈Rn

vTSv s.t. ‖v‖2 = 1, ‖v‖0 6 k,227

where ‖ · ‖0 indicates the cardinality of v (that is, the number of nonzeros in v)228

and k is a user-defined parameter indicating a bound on the cardinality of v. The229

problem (2.6.1) can be formulated as a quadratic program with binary variables,230

but such intractable formulations are usually avoided. The following relaxation231

(due to [17]) replaces vvT by an positive semidefinite proxy M ∈ SRn×n:232

(2.6.2) maxM∈SRn×n

〈S,M〉 s.t. M 0, 〈I,M〉 = 1, ‖M‖1 6 R,233

for some parameter R > 0 that can be adjusted to attain the desired sparsity. This234

formulation is a convex optimization problem, in fact, a semidefinite program-235

ming problem.236

This formulation technique can be generalized to find the leading r > 1 sparse237

principal components. Ideally, we would obtain these from a matrix V ∈ Rn×r238

whose columns are mutually orthogonal and have at most k nonzeros each. We239

can write a convex relaxation of this problem as240

(2.6.3) maxM∈SRn×n

〈S,M〉 s.t. 0 M I, 〈I,M〉 = 1, ‖M‖1 6 R,241

which is again a semidefinite program. A more compact (but nonconvex) formu-242

lation is243

maxF∈Rn×r

〈S, FFT 〉 s.t. ‖F‖2 6 1, ‖F‖2,1 6 R,244

where ‖F‖2,1 :=∑ni=1 ‖Fi·‖2 [14]. The latter regularization term is often called245

a “group-sparse” or “group-LASSO” regularizer. (An early use of this type of246

regularizer was described in [41].)247


2.7. Sparse Plus Low-Rank Matrix Decomposition Another useful paradigm248

is to decompose a partly or fully observed n × p matrix Y into the sum of a249

sparse matrix and a low-rank matrix. A convex formulation of the fully-observed250

problem is251

minM,S

‖M‖∗ + λ‖S‖1 s.t. Y =M+ S,252

where ‖S‖1 :=∑ni=1∑pj=1 |Sij| [10, 13]. Compact, nonconvex formulations that

allow noise in the observations include the following:

minL,R,S

12‖LRT + S− Y‖2

F (fully observed)

minL,R,S

12‖PΦ(LRT + S− Y)‖2

F (partially observed),

where Φ represents the locations of the observed entries of Y and PΦ is projection253

onto this set [14, 44].254

One application of these formulations is to robust PCA, where the low-rank255

part represents principal components and the sparse part represents “outlier”256

observations. Another application is to foreground-background separation in257

video processing. Here, each column of Y represents the pixels in one frame of258

video, whereas each row of Y shows the evolution of one pixel over time.259

2.8. Subspace Identification In this application, the aj ∈ Rn, j = 1, 2, . . . ,m are260

vectors that lie (approximately) in a low-dimensional subspace. The aim is to261

identify this subspace, expressed as the column subspace of a matrix X ∈ Rn×r.262

If the aj are fully observed, an obvious way to solve this problem is to perform263

a singular value decomposition of the n×m matrix A = [aj]mj=1, and take X to264

be the leading r right singular vectors. In interesting variants of this problem,265

however, the vectors aj may be arriving in streaming fashion and may be only266

partly observed, for example in indices Φj ⊂ 1, 2, . . . ,n. We would thus need to267

identify a matrix X and vectors sj ∈ Rr such that268

PΦj(aj −Xsj) ≈ 0, j = 1, 2, . . . ,m.269

The algorithm for identifying X, described in [1], is a manifold-projection scheme270

that takes steps in incremental fashion for each aj in turn. Its validity relies on271

incoherence of the matrix X with respect to the principal axes, that is, the matrix272

X should not have a few elements that are much larger than the others. A local273

convergence analysis of this method is given in [2].274

2.9. Support Vector Machines Classification via support vector machines (SVM)is a classical paradigm in machine learning. This problem takes as input data(aj,yj) with aj ∈ Rn and yj ∈ −1, 1, and seeks a vector x ∈ Rn and a scalarβ ∈ R such that

aTj x−β > 1 when yj = +1;(2.9.1a)

aTj x−β 6 −1 when yj = −1.(2.9.1b)

Stephen J. Wright 9

Any pair (x,β) that satisfies these conditions defines a separating hyperplane in275

Rn, that separates the “positive” cases aj |yj = +1 from the “negative” cases276

aj |yj = −1. (In the language of Section 2.1, we could define the function277

φ as φ(aj) = sign(aTj x− β).) Among all separating hyperplanes, the one that278

minimizes ‖x‖2 is the one that maximizes the margin between the two classes,279

that is, the hyperplane whose distance to the nearest point aj of either class is280

greatest.281

We can formulate the problem of finding a separating hyperplane as an opti-282

mization problem by defining an objective with the summation form (2.1.2):283

(2.9.2) H(x,β) =1m

m∑j=1

max(1 − yj(aTj x−β), 0).284

Note that the jth term in this summation is zero if the conditions (2.9.1) are285

satisfied, and positive otherwise. Even if no pair (x,β) exists for which H(x,β) =286

0, a value (x,β) that minimizes (2.1.2) will be the one that comes as close as287

possible to satisfying (2.9.1), in some sense. A term λ‖x‖22 (for some parameter288

λ > 0) is often added to (2.9.2), yielding the following regularized version:289

(2.9.3) H(x,β) =1m

m∑j=1

max(1 − yj(aTj x−β), 0) +

12λ‖x‖2

2.290

If λ is sufficiently small, and if separating hyperplanes exist, the pair (x,β) that291

minimizes (2.9.3) is the maximum-margin separating hyperplane. The maximum-292

margin property is consistent with the goals of generalizability and robustness.293

For example, if the observed data (aj,yj) is drawn from an underlying “cloud” of294

positive and negative cases, the maximum-margin solution usually does a reason-295

able job of separating other empirical data samples drawn from the same clouds,296

whereas a hyperplane that passes close by several of the observed data points297

may not do as well (see Figure 2.9.4).298

.

Figure 2.9.4. Linear support vector machine classification, withthe one class represented by circles and the other by squares. Onepossible choice of separating hyperplane is shown at left. If theobserved data is an empirical sample drawn from a cloud of un-derlying data points, this plane does not do well in separatingthe two clouds (middle). The maximum-margin separating hy-perplane does better (right).


Often it is not possible to find a hyperplane that separates the positive andnegative cases well enough to be useful as a classifier. One solution is to transformall of the raw data vectors aj by a mapping ψ, then perform the support-vector-machine classification on the vectors ψ(aj), j = 1, 2, . . . ,m. The conditions (2.9.1)would thus be replaced by

ψ(aj)Tx−β > 1 when yj = +1;(2.9.5a)

ψ(aj)Tx−β 6 −1 when yj = −1,(2.9.5b)

leading to the following analog of (2.9.3):299

(2.9.6) H(x,β) =1m

m∑j=1

max(1 − yj(ψ(aj)Tx−β), 0) +

12λ‖x‖2

2.300

When transformed back to Rm, the surface a |ψ(a)Tx− β = 0 is nonlinear and301

possibly disconnected, and is often a much more powerful classifier than the302

hyperplanes resulting from (2.9.3).303

By introducing artificial variables, the problem (2.9.6) (and (2.9.3)) can be for-304

mulated as a convex quadratic program, that is, a problem with a convex qua-305

dratic objective and linear constraints. By taking the dual of this problem, we306

obtain another convex quadratic program, in m variables:307

(2.9.7) minα∈Rm

12αTQα− 1Tα subject to 0 6 α 6

1λ

1, yTα = 0,308

where309

Qkl = ykylψ(ak)Tψ(al), y = (y1,y2, . . . ,ym)T , 1 = (1, 1, . . . , 1)T .310

Interestingly, problem (2.9.7) can be formulated and solved without explicit knowl-311

edge or definition of the mapping ψ. We need only a technique to define the ele-312

ments of Q. This can be done with the use of a kernel function K : Rn ×Rn → R,313

where K(ak,al) replaces ψ(ak)Tψ(al) [4, 15]. This is the so-called “kernel trick.”314

(The kernel function K can also be used to construct a classification function φ315

from the solution of (2.9.7).) A particularly popular choice of kernel is the Gauss-316

ian kernel:317

K(ak,al) := exp(−‖ak − al‖2/(2σ)),318

where σ is a positive parameter.319

2.10. Logistic Regression Logistic regression can be viewed as a variant of bi-320

nary support-vector machine classification, in which rather than the classification321

function φ giving a unqualified prediction of the class in which a new data vector322

a lies, it returns an estimate of the odds of a belonging to one class or the other.323

We seek an “odds function” p parametrized by a vector x ∈ Rn as follows:324

(2.10.1) p(a; x) := (1 + exp(aTx))−1,325

Stephen J. Wright 11

and aim to choose the parameter x in so that

p(aj; x) ≈ 1 when yj = +1;(2.10.2a)

p(aj; x) ≈ 0 when yj = −1.(2.10.2b)

(Note the similarity to (2.9.1).) The optimal value of x can be found by maximizing326

a log-likelihood function:327

(2.10.3) L(x) :=1m

∑j:yj=−1

log(1 − p(aj; x)) +∑j:yj=1

logp(aj; x)

.328

We can perform feature selection using this model by introducing a regularizer329

λ‖x‖1, as follows:330

(2.10.4) maxx

1m

∑j:yj=−1

log(1 − p(aj; x)) +∑j:yj=1

logp(aj; x)

− λ‖x‖1,331

where λ > 0 is a regularization parameter. As we see later, this term has the332

effect of producing a solution in which few components of x are nonzero, mak-333

ing it possible to evaluate p(a; x) by knowing only those components of a that334

correspond to the nonzeros in x.335

An important extension of this technique is to multiclass (or multinomial) lo-336

gistic regression, in which the data vectors aj belong to more than two classes.337

Such applications are common in modern data analysis. For example, in a speech338

recognition system, the M classes could each represent a phoneme of speech, one339

of the potentially thousands of distinct elementary sounds that can be uttered by340

humans in a few tens of milliseconds. A multinomial logistic regression problem341

requires a distinct odds function pk for each class k ∈ 1, 2, . . . ,M. These func-342

tions are parametrized by vectors x[k] ∈ Rn, k = 1, 2, . . . ,M, defined as follows:343

344

(2.10.5) pk(a;X) :=exp(aTx[k])∑Ml=1 exp(aTx[l])

, k = 1, 2, . . . ,M,345

where we define X := x[k] |k = 1, 2, . . . ,M. Note that for all a, we have pk(a) ∈346

(0, 1) for all k = 1, 2, . . . ,M and that∑Mk=1 pk(a) = 1. The functions (2.10.5) are347

(somewhat colloquially) referred to as performing a “softmax” on the quantities348

aTx[l] | l = 1, 2, . . . ,M.349

In the setting of multiclass logistic regression, the labels yj are vectors in RM,350

whose elements are defined as follows:351

(2.10.6) yjk =

1 when aj belongs to class k,

0 otherwise.352

Similarly to (2.10.2), we seek to define the vectors x[k] so that

pk(aj;X) ≈ 1 when yjk = 1(2.10.7a)


pk(aj;X) ≈ 0 when yjk = 0.(2.10.7b)

The problem of finding values of x[k] that satisfy these conditions can again be353

formulated as one of maximizing a log-likelihood:354

(2.10.8) L(X) :=1m

m∑j=1

[M∑`=1

yj`(xT[`]aj) − log

(M∑`=1

exp(xT[`]aj)

)].355

“Group-sparse” regularization terms can be included in this formulation to se-356

lect a set of features in the vectors aj, common to each class, that distinguish357

effectively between the classes.358

2.11. Deep Learning Deep neural networks are often designed to perform the359

same function as multiclass logistic regression, that is, to classify a data vector a360

into one of M possible classes, where M > 2 is large in some key applications.361

The difference is that the data vector a undergoes a series of structured transfor-362

mations before being passed through a “softmax” operator, a multiclass logistic363

regression classifier of the type described above.364

output nodes

input nodes

hidden layers

Figure 2.11.1. A deep neural network, showing connections be-tween adjacent layers.

The simple neural network shown in Figure 2.11.1 illustrates the basic ideas.365

In this figure, the data vector aj enters at the bottom of the network, each node in366

the bottom layer corresponding to one component of aj. The vector then moves367

upward through the network, undergoing a structured nonlinear transformation368

as it moves from one layer to the next. A typical form of this transformation,369

which converts the vector alj at layer l to vector al+1j at layer l+ 1, is370

alj = σ(Wlal−1j + gl), l = 1, 2, . . . ,D,371

where Wl is a matrix of dimension |alj |× |al−1j | and gl is a vector of length |alj |, σ372

is a componentwise nonlinear transformation, andD is the number of layer situated373

strictly between the bottom and top layers (referred to as “hidden layers”). Each374

arc in Figure 2.11.1 represents one of the elements of a transformation matrix Wl.375

We define a0j to be the “raw” input vector aj, and let aDj be the vector formed by376

the nodes at the second-top later in Figure 2.11.1. Typical forms of the function377


σ include the following, acting identically on each component t ∈ R of its input378

vector:379

• Logistic function: t→ 1/(1 + e−t);380

• Hinge loss: t→ max(t, 0);381

• Bernoulli: a random function that outputs 1 with probability 1/(1 + e−t)382

and 0 otherwise.383

Each node in the top layer corresponds to a particular class, and the output of384

each node corresponds to the odds of the input vector belonging to each class. As385

mentioned, the “softmax” operator is typically used to convert the transformed386

input vector in the second-top layer (layer D) to a set of odds at the top layer. As-387

sociated with each input vector aj are labels yjk, defined as in (2.10.6) to indicate388

which of the M classes that aj belongs to.389

The parameters in this neural network are the matrix-vector pairs (Wl,gl),390

l = 1, 2, . . . ,D that transform the input vector aj into its form aDj at the second-391

top layer, together with the parameters X of the softmax operation that takes392

place at the final (top) stage, where X is defined exactly as in the discussion of393

the previous section on multiclass logistic regression. We aim to choose all these394

parameters so that the network does a good job on classifying the training data395

correctly. Using the notation w for the layer-to-layer transformations, that is,396

w := (W1,g1,W2,g2, . . . ,WD,gD),397

we can write the loss function for deep learning as follows:398

(2.11.2) L(w,X) :=1m

m∑j=1

[M∑`=1

yj`(xT[`]a

Dj (w)) − log

(M∑`=1

exp(xT[`]aDj (w))

)].399

Here we write aDj (w) to make explicit the dependence of aDj on the transforma-400

tions w = (W1,g1,W2,g2, . . . ,WD,gD) as well as on the input vector aj. (We401

can view multiclass logistic regression as a special case of deep learning in which402

there are no hidden layers, so that D = 0, w is null, and aDj = aj, j = 1, 2, . . . ,m.)403

Neural networks in use for particular applications (in image recognition and404

speech recognition, for example, where they have been very successful) include405

many variants on the basic design above. These include restricted connectivity406

between layers (that is, enforcing structure on the matrices Wl, l = 1, 2, . . . ,D),407

layer arrangements that are more complex than the linear layout illustrated in408

Figure 2.11.1, with outputs coming from different levels, connections across non-409

adjacent layers, different componentwise transformations σ at different layers,410

and so on. Deep neural networks for practical applications are highly engineered411

objects.412

The loss function (2.11.2) shares with many other applications the “summation”413

form (2.1.2), but it has several features that set it apart from the other applications414

discussed above. First, and possibly most important, it is nonconvex in the param-415

eters w. There is reason to believe that the “landscape” of L is complex, with the416


global minimizer being exceedingly difficult to find. Second, the total number417

of parameters in (w,X) is usually very large. The most popular algorithms for418

minimizing (2.11.2) are of stochastic gradient type, which like most optimization419

methods come with no guarantee for finding the minimizer of a nonconvex func-420

tion. Effective training of deep learning classifiers typically requires a great deal421

of data and computation power. Huge clusters of powerful computers, often us-422

ing multicore processors, GPUs, and even specially architected processing units,423

are devoted to this task. Efficiency also requires many heuristics in the formula-424

tion and the algorithm (for example, in the choice of regularization functions and425

in the steplengths for stochastic gradient).426

3. Preliminaries427

We discuss here some foundations for the analysis of subsequent sections.428

These include useful facts about smooth and nonsmooth convex functions, Tay-429

lor’s theorem and some of its consequences, optimality conditions, and prox-430

operators.431

In the discussion of this section, our basic assumption is that f is a mapping432

from Rn to R ∪ +∞, continuous on its effective domain D := x | f(x) < ∞.433

Further assumptions of f are introduced as needed.434

3.1. Solutions Consider the problem of minimizing f (1.0.1). We have the fol-435

lowing terminology:436

• x∗ is a local minimizer of f if there is a neighborhood N of x∗ such that437

f(x) > f(x∗) for all x ∈ N.438

• x∗ is a global minimizer of f if f(x) > f(x∗) for all x ∈ Rn.439

• x∗ is a strict local minimizer if it is a local minimizer on some neighborhood440

N and in addition f(x) > f(x∗) for all x ∈ N with x 6= x∗.441

• x∗ is an isolated local minimizer if there is a neighborhood N of x∗ such that442

f(x) > f(x∗) for all x ∈ N and in addition, N contains no local minimizers443

other than x∗.444

3.2. Convexity and Subgradients A convex set Ω ⊂ Rn has the property that445

(3.2.1) x,y ∈ Ω ⇒ (1 −α)x+αy ∈ Ω for all α ∈ [0, 1].446

We usually deal with closed convex sets in this article. For a convex set Ω ⊂ Rn447

we define the indicator function IΩ(x) as follows:448

IΩ(x) =

0 if x ∈ Ω+∞ otherwise.

449


Indicator functions are useful devices for deriving optimality conditions for con-450

strained problems, and even for developing algorithms. The constrained opti-451

mization problem452

(3.2.2) minx∈Ω

f(x)453

can be restated equivalently as follows:454

(3.2.3) min f(x) + IΩ(x).455

We noted already that a convex function φ : Rn → R∪ +∞ has the following456

defining property:457

(3.2.4)φ((1 −α)x+αy) 6 (1 −α)φ(x) +αφ(y), for all x,y ∈ Rn and all α ∈ [0, 1].458

The concepts of “minimizer” are simpler in the case of convex objective func-459

tions than in the general case. In particular, the distinction between “local” and460

“global” minimizers disappears. For f convex in (1.0.1), we have the following.461

(a) Any local minimizer of (1.0.1) is also a global minimizer.462

(b) The set of global minimizers of (1.0.1) is a convex set.463

If there exists a value γ > 0 such that464

(3.2.5) φ((1 −α)x+αy) 6 (1 −α)φ(x) +αφ(y) −12γα(1 −α)‖x− y‖2

2465

for all x and y in the domain of φ, we say that φ is strongly convex with modulus of466

convexity γ.467

We summarize some definitions and results about subgradients of convex func-468

tions here. For a more extensive discussion, see [20].469

Definition 3.2.6. A vector v ∈ Rn is a subgradient of f at a point x if470

f(x+ d) > f(x) + vTd. for all d ∈ Rn.471

The subdifferential, denoted ∂f(x), is the set of all subgradients of f at x.472

Subdifferentials satisfy a monotonicity property, as we show now.473

Lemma 3.2.7. If a ∈ ∂f(x) and b ∈ ∂f(y), we have (a− b)T (x− y) > 0.474

Proof. From convexity of f and the definitions of a and b, we have f(y) > f(x) +475

aT (y− x) and f(x) > f(y) + bT (x− y). The result follows by adding these two476

inequalities. 477

We can easily characterize a minimum in terms of the subdifferential.478

Theorem 3.2.8. The point x∗ is the minimizer of a convex function f if and only if479

0 ∈ ∂f(x∗).480

Proof. Suppose that 0 ∈ ∂f(x∗), we have by substituting x = x∗ and v = 0 into481

Definition 3.2.6 that f(x∗ + d) > f(x∗) for all d ∈ Rn, which implies that x∗ is482

a minimizer of f. The converse follows trivially by showing that v = 0 satisfies483

Definition 3.2.6 when x∗ is a minimizer. 484


The subdifferential generalizes the concept of derivative of a smooth function.485

Theorem 3.2.9. If f is convex and differentiable at x, then ∂f(x) = ∇f(x).486

A converse of this result is also true. Specifically, if the subdifferential of a487

convex function f at x contains a single subgradient, then f is differentiable with488

gradient equal to this subgradient (see [38, Theorem 25.1]).489

3.3. Taylor’s Theorem Taylor’s theorem is a foundational result for optimization490

of smooth nonlinear functions. It shows how smooth functions can be approxi-491

mated locally by low-order (linear or quadratic) functions.492

Theorem 3.3.1. Given a continuously differentiable function f : Rn → R, and givenx,p ∈ Rn, we have that

f(x+ p) = f(x) +

∫1

0∇f(x+ ξp)Tpdξ,(3.3.2)

f(x+ p) = f(x) +∇f(x+ ξp)Tp, some ξ ∈ (0, 1).(3.3.3)

If f is twice continuously differentiable, we have

∇f(x+ p) = ∇f(x) +∫1

0∇2f(x+ ξp)pdξ,(3.3.4)

f(x+ p) = f(x) +∇f(x)Tp+ 12pT∇2f(x+ ξp)p, for some ξ ∈ (0, 1).(3.3.5)

We can derive an important consequence of this theorem when f is Lipschitz493

continuously differentiable with constant L, that is,494

(3.3.6) ‖∇f(x) −∇f(y)‖ 6 L‖x− y‖, for all x,y ∈ Rn.495

From (3.3.2), we have496

f(y) − f(x) −∇f(x)T (y− x) =∫1

0[∇f(x+ ξ(y− x)) −∇f(x)]T (y− x)dξ.497

By using (3.3.6), we have498

[∇f(x+ξ(y−x))−∇f(x)]T (y−x) 6 ‖∇f(x+ξ(y−x))−∇f(x)‖‖y−x‖ 6 Lξ‖y−x‖2.499

By substituting this bound into the previous integral, we obtain500

(3.3.7) f(y) − f(x) −∇f(x)T (y− x) 6 L2‖y− x‖2.501

For the remainder of this section, we assume that f is continuously differ-502

entiable and also convex. The definition of convexity (3.2.4) and the fact that503

∂f(x) = ∇f(x) implies that504

(3.3.8) f(y) > f(x) +∇f(x)T (y− x), for all x,y ∈ Rn.505

We defined “strong convexity with modulus γ” in (3.2.5). When f is differentiable,506

we have the following equivalent definition, obtained by rearranging (3.2.5) and507

letting α ↓ 0.508

(3.3.9) f(y) > f(x) +∇f(x)T (y− x) + γ

2‖y− x‖2.509


By combining this expression with (3.3.7), we have the following result.510

Lemma 3.3.10. Given convex f satisfying (3.2.5), with ∇f uniformly Lipschitz continu-511

ous with constant L, we have for any x,y that512

(3.3.11)γ

2‖y− x‖2 6 f(y) − f(x) −∇f(x)T (y− x) 6 L

2‖y− x‖2.513

For later convenience, we define a condition number κ as follows:514

(3.3.12) κ :=L

γ.515

When f is twice continuously differentiable, we can characterize the constants γ516

and L in terms of the eigenvalues of the Hessian ∇f(x). Specifically, we can show517

that (3.3.11) is equivalent to518

(3.3.13) γI ∇2f(x) LI, for all x.519

When f is strictly convex and quadratic, κ defined in (3.3.12) is the condition520

number of the (constant) Hessian, in the usual sense of linear algebra.521

Strongly convex functions have unique minimizers, as we now show.522

Theorem 3.3.14. Let f be differentiable and strongly convex with modulus γ > 0. Then523

the minimizer x∗ of f exists and is unique.524

Proof. We show first that for any point x0, the level set x | f(x) 6 f(x0) is closed525

and bounded, and hence compact. Suppose for contradiction that there is a se-526

quence x` such that ‖x`‖ →∞ and527

(3.3.15) f(x`) 6 f(x0).528

By strong convexity of f, we have for some γ > 0 that529

f(x`) > f(x0) +∇f(x0)T (x` − x0) +γ

2‖x` − x0‖2.530

By rearranging slightly, and using (3.3.15), we obtain531

γ

2‖x` − x0‖2 6 −∇f(x0)T (x` − x0) 6 ‖∇f(x0)‖‖x` − x0‖.532

By dividing both sides by (γ/2)‖x` − x0‖, we obtain ‖x` − x0‖ 6 (2/γ)‖∇f(x0)‖533

for all `, which contradicts unboundedness of x`. Thus, the level set is bounded.534

Since it is also closed (by continuity of f), it is compact.535

Since f is continuous, it attains its minimum on the compact level set, which is536

also the solution of minx f(x), and we denote it by x∗. Suppose for contradiction537

that the minimizer is not unique, so that we have two points x∗1 and x∗2 that538

minimize f. Obviously, these points must attain equal objective values, so that539

f(x∗1) = f(x∗2) = f∗ for some f∗. By taking (3.2.5) and setting φ = f∗, x = x∗1 ,540

y = x∗2 , and α = 1/2, we obtain541

f((x∗1 + x∗2)/2) 612(f(x∗1) + f(x

∗2)) −

18γ‖x∗1 − x∗2‖2 < f∗,542

so the point (x∗1 + x∗2)/2 has a smaller function value than both x∗1 and x∗2 , contra-543

dicting our assumption that x∗1 and x∗2 are both minimizers. Hence, the minimizer544

x∗ is unique. 545


3.4. Optimality Conditions for Smooth Functions We consider the case in which546

f is a smooth (twice continuously differentiable) function, that is not necessarily547

convex. Before designing algorithms to find a minimizer of f, we need to identify548

properties of f and its derivatives at a point x that tell us whether or not x is a549

minimizer, of one of the types described in Subsection 3.1. We call such properties550

optimality conditions.551

A first-order necessary condition for optimality is that∇f(x) = 0. More precisely,552

if x is a local minimizer, then ∇f(x) = 0. We can prove this by using Taylor’s553

theorem. Supposing for contradiction that ∇f(x) 6= 0, we can show by setting554

x = x and p = −α∇f(x) for α > 0 in (3.3.3) that f(x − α∇f(x)) < f(x) for all555

α > 0 sufficiently small. Thus any neighborhood of x will contain points x with a556

f(x) < f(x), so x cannot be a local minimizer.557

If f is convex, in addition to being smooth, the condition ∇f(x) = 0 is sufficient558

for x to be a global solution. This claim follows immediately from Theorems 3.2.8559

and 3.2.9.560

A second-order necessary condition for x to be a local solution is that ∇f(x) = 0561

and ∇2f(x) is positive semidefinite. The proof is by an argument similar to that562

of the first-order necessary condition, but using the second-order Taylor series ex-563

pansion (3.3.5) instead of (3.3.3). A second-order sufficient condition is that∇f(x) = 0564

and ∇2f(x) is positive definite. This condition guarantees that x is a strict local565

minimizer, that is, there is a neighborhood of x such that x has a strictly smaller566

function value than all other points in this neighborhood. Again, the proof makes567

use of (3.3.5).568

We call x a stationary point for smooth f if it satisfies the first-order necessary569

condition ∇f(x) = 0. Stationary points are not necessarily local minimizers. In570

fact, local maximizers satisfy the same condition. More interestingly, stationary571

points can be saddle points. These are points for which there exist directions u572

and v such that f(x + αu) < f(x) and f(x + αv) > f(x) for all positive α suffi-573

ciently small. When the Hessian ∇2f(x) has both strictly positive and strictly574

negative eigenvalues, it follows from (3.3.5) that x is a saddle point. When ∇2f(x)575

is positive semidefinite or negative semidefinite, second derivatives alone are in-576

sufficient to classify x; higher-order derivative information is needed.577

3.5. Proximal Operators and the Moreau Envelope Here we present some anal-578

ysis for analyzing the convergence of algorithms for the regularized problem579

(1.0.2), where the objective is the sum of a smooth function and a convex (usually580

nonsmooth) function.581

For a closed proper convex function h and a positive scalar λ, we define the582

Moreau envelope as583

(3.5.1) Mλ,h(x) := infu

h(u) +

12λ‖u− x‖2

=

1λ

infu

λh(u) +

12‖u− x‖2

.584


The prox-operator of the function λh is the value of u that achieves the infimum585

in (3.5.1), that is,586

(3.5.2) proxλh(x) := arg minu

λh(u) +

12‖u− x‖2

.587

From optimality properties for this problem, we have588

(3.5.3) 0 ∈ λ∂h(proxλh(x)) + (proxλh(x) − x).589

The Moreau envelope can be seen as a smoothing of regularization of the func-590

tion h. It has a finite value for all x, even when h takes on infinite values for some591

x ∈ Rn. In fact, it is differentiable everywhere — its gradient is592

∇Mλ,h(x) =1λ(x− proxλh(x)).593

Moreover, x∗ is a minimizer of h if and only if it is a minimizer of Mλ,h.594

The proximal operator satisfies a nonexpansiveness property. From the opti-595

mality conditions (3.5.3) at two points x and y, we have596

x− proxλh(x) ∈ λ∂(proxλh(x)), y− proxλh(y) ∈ λ∂(proxλh(y)).597

By applying monotonicity (Lemma 3.2.7), we have598

(1/λ)((x− proxλh(x)) − (y− proxλh(y))

)T(proxλh(x) − proxλh(y)) > 0,599

which by rearrangement and application of the Cauchy-Schwartz inequality yields

‖proxλh(x) − proxλh(y)‖2 6 (x− y)T (proxλh(x) − proxλh(y))

6 ‖x− y‖ ‖proxλh(x) − proxλh(y)‖,

from which we obtain ‖proxλh(x) − proxλh(y)‖ 6 ‖x− y‖, as claimed.600

We list the prox operator for several instances of h that are common in data601

analysis applications. These definitions are useful in implementing the prox-602

gradient algorithms of Section 5.603

• h(x) = 0 for all x, for which we have proxλh(x) = 0. (This observation is604

useful in proving that the prox-gradient method reduces to the familiar605

steepest descent method when the objective contains no regularization606

term.)607

• h(x) = IΩ(x), the indicator function for a closed convex set Ω. In this608

case, we have for any λ > 0 that609

proxλIΩ(x) = arg minu

λIΩ(u) +

12‖u− x‖2

= arg min

u∈Ω

12‖u− x‖2,610

which is simply the projection of x onto the set Ω.611

• h(x) = ‖x‖1. By substituting into definition (3.5.2) we see that the mini-612

mization separates into its n separate components, and that the ith com-613

ponent of proxλ‖·‖1is614

[proxλ‖·‖1]i = arg min

ui

λ|ui|+

12(ui − xi)

2

.615


We can thus verify that616

(3.5.4) [proxλ‖·‖1(x)]i =

xi − λ if xi > λ;

0 if xi ∈ [−λ, λ];

xi + λ if xi < −λ,

617

an operation that is known as soft-thresholding.618

• h(x) = ‖x‖0, where ‖x‖0 denotes the cardinality of the vector x, its number619

of nonzero components. Although this h is not a convex function (as620

we can see by considering convex combinations of the vectors (0, 1)T and621

(1, 0)T in R2), its prox-operator is well defined, and is known as hard622

thresholding:623

[proxλ‖·‖0(x)]i =

xi if |xi| >

√2λ;

0 if |xi| <√

2λ.624

As in (3.5.4), the definition (3.5.2) separates into n individual components.625

3.6. Convergence Rates An important measure for evaluating algorithms is the626

rate of convergence to zero of some measure of error. For smooth f, we may be627

interested in how rapidly the sequence of gradient norms ‖∇f(xk)‖ converges628

to zero. For nonsmooth convex f, a measure of interest may be convergence to629

zero of dist(0,∂f(xk)) (the sequence of distances from 0 to the subdifferential630

∂f(xk)). Other error measures for which we may be able to prove convergence631

rates include ‖xk − x∗‖ (where x∗ is a solution) and f(xk) − f∗ (where f∗ is the632

optimal value of the objective function f). For generality, we denote by φk633

the sequence of nonnegative scalars, converging to zero, whose rate we wish to634

obtain.635

We say that linear convergence holds if there is some σ ∈ (0, 1) such that636

(3.6.1) φk+1/φk 6 1 − σ, for all k sufficiently large.637

(This property is sometimes also known as geometric or exponential convergence,638

but the term linear is standard in the optimization literature.) It follows from639

(3.6.1) that there is some positive constant C such that640

(3.6.2) φk 6 C(1 − σ)k, k = 1, 2, . . . .641

While (3.6.1) implies (3.6.2), the converse does not hold. The sequence642

φk =

2−k k even

0 k odd,643

satisfies (3.6.2) with C = 1 and σ = .5, but does not satisfy (3.6.1). To distinguish644

between these two slightly different definitions, (3.6.1) is sometimes called Q-645

linear while (3.6.2) is called R-linear.646

Sublinear convergence is, as its name suggests, slower than linear. Severalvarieties of sublinear convergence are encountered in optimization algorithms


for data analysis, including the following

φk 6 C/√k, k = 1, 2, . . . ,(3.6.3a)

φk 6 C/k, k = 1, 2, . . . ,(3.6.3b)

φk 6 C/k2, k = 1, 2, . . . ,(3.6.3c)

where in each case, C is some positive constant.647

Superlinear convergence occurs when the constant σ ∈ (0, 1) in (3.6.1) can be648

chosen arbitrarily close to 1. Specifically, we say that the sequence φk converges649

Q-superlinearly to 0 if650

(3.6.4) limk→∞φk+1/φk = 0.651

Q-Quadratic convergence occurs when652

(3.6.5) φk+1/φ2k 6 C, k = 1, 2, . . . ,653

for some sufficiently large C. We say that the convergence is R-superlinear if654

there is a Q-superlinearly convergent sequence ψk that dominates φk (that is,655

0 6 φk 6 ψk for all k). R-quadratic convergence is defined similarly. Quadratic656

and superlinear rates are associated with higher-order methods, such as Newton657

and quasi-Newton methods.658

When a convergence rate applies globally, from any reasonable starting point,659

it can be used to derive a complexity bound for the algorithm, which takes the660

form of a bound on the number of iterations K required to reduce φk below661

some specified tolerance ε. For a sequence satisfying the R-linear convergence662

condition (3.6.2) a sufficient condition for φK 6 ε is C(1 − σ)K 6 ε. By using the663

estimate log(1 − σ) 6 −σ for all σ ∈ (0, 1), we have that664

C(1 − σ)K 6 ε ⇔ K log(1 − σ) 6 log(ε/C) ⇐ K > log(C/ε)/σ.665

It follows that for linearly convergent algorithms, the number of iterations re-666

quired to converge to a tolerance ε depends logarithmically on 1/ε and inversely667

on the rate constant σ. For an algorithm that satisfies the sublinear rate (3.6.3a), a668

sufficient condition for φK 6 ε is C/√K 6 ε, which is equivalent to K > (C/ε)2,669

so the complexity is O(1/ε2). Similar analyses for (3.6.3b) reveal complexity of670

O(1/ε), while for (3.6.3c), we have complexity O(1/√ε).671

For quadratically convergent methods, the complexity is doubly logarithmic672

in ε (that is, O(log log(1/ε))). Once the algorithm enters a neighborhood of qua-673

dratic convergence, just a few additional iterations are required for convergence674

to a solution of high accuracy.675

4. Gradient Methods676

We consider here iterative methods for solving the unconstrained smooth prob-677

lem (1.0.1) that make use of the gradient ∇f. (Duchi’s contribution [20] describes678


subgradient methods for nonsmooth convex functions.) We consider mostly meth-679

ods that generate an iteration sequence xk via the formula680

(4.0.1) xk+1 = xk +αkdk,681

where dk is the search direction and αk is a steplength.682

We consider the steepest descent method, which searches along the negative683

gradient direction dk = −∇f(xk), proving convergence results for nonconvex684

functions, convex functions, and strongly convex functions. In Subsection 4.5, we685

consider methods that use more general descent directions dk, proving conver-686

gence of methods that make careful choices of the line search parameter αk at687

each iteration. In Subsection 4.6, we consider the conditional gradient method for688

minimization of a smooth function f over a compact set.689

4.1. Steepest Descent The simplest stepsize protocol is the short-step variant690

of steepest descent. We assume here that f is convex and smooth, and that its691

gradients satisfy the Lipschitz condition (3.3.6) with Lipschitz constant L. We692

choose the search direction dk = −∇f(xk) in (4.0.1), and set the steplength αk to693

be the constant 1/L, to obtain the iteration694

(4.1.1) xk+1 = xk −1L∇f(xk), k = 0, 1, 2, . . . .695

To estimate the amount of decrease in f obtained at each iterate of this method,696

we use Taylor’s theorem. From (3.3.7), we have697

(4.1.2) f(x+αd) 6 f(x) +α∇f(x)Td+α2 L

2‖d‖2,698

For x = xk and d = −∇f(xk), the value of α that minimizes the expression on the699

right-hand side is α = 1/L. By substituting these values, we obtain700

(4.1.3) f(xk+1) = f(xk − (1/L)∇f(xk)) 6 f(xk) − 12L‖∇f(xk)‖2.701

This expression is one of the foundational inequalities in the analysis of optimiza-702

tion methods. Depending on the variety of assumptions about f, we can derive a703

variety of different convergence rates from this basic inequality.704

4.2. General Case We consider first general smooth f, not necessarily convex.705

From (4.1.3) alone, we can already prove convergence of the steepest descent706

method with no additional assumptions. Rearranging (4.1.3) and summing over707

the first T − 1 iterates, we have708

(4.2.1)T−1∑k=0

‖∇f(xk)‖2 6 2LT−1∑k=0

[f(xk) − f(xk+1)] = 2L[f(x0) − f(xT )].709

(Note the telescoping sum.) If f is bounded below, say by f, the right-hand side710

is bounded above by a constant. We have711

(4.2.2) min06k6T−1

‖∇f(xk)‖ 6√

2L[f(x0) − f(xT )]

T6

√2L[f(x0) − f]

T.712


Thus, within the first T − 1 steps of steepest descent, at least one of the iterates713

has gradient norm less than√

2L[f(x0) − f]/T , which represents sublinear conver-714

gence of type (3.6.3a). It follows too from (4.2.1) that for f bounded below, any715

accumulation point of the sequence xk is stationary.716

4.3. Convex Case When f is also convex, we have the following stronger result717

for the steepest descent method.718

Theorem 4.3.1. Suppose that f is convex and smooth, where ∇f has Lipschitz constant719

L, and that (1.0.1) has a solution x∗. Then the steepest descent method with stepsize720

αk ≡ 1/L generates a sequence xk∞k=0 that satisfies721

(4.3.2) f(xT ) − f∗ 6L

2T‖x0 − x∗‖2.722

Proof. By convexity of f, we have f(x∗) > f(xk) +∇f(xk)T (x∗ − xk), so by substi-tuting into (4.1.3), we obtain for k = 0, 1, 2, . . . that

f(xk+1) 6 f(x∗) +∇f(xk)T (xk − x∗) − 12L‖∇f(xk)‖2

= f(x∗) +L

2

(‖xk − x∗‖2 −

∥∥∥∥xk − x∗ − 1L∇f(xk)

∥∥∥∥2)

= f(x∗) +L

2

(‖xk − x∗‖2 − ‖xk+1 − x∗‖2

).

By summing over k = 0, 1, 2, . . . , T − 1, and noting the telescoping sum, we have

T−1∑k=0

(f(xk+1) − f∗) 6L

2

T−1∑k=0

(‖xk − x∗‖2 − ‖xk+1 − x∗‖2

)=L

2

(‖x0 − x∗‖2 − ‖xT − x∗‖2

)6L

2‖x0 − x∗‖2.

Since f(xk) is a nonincreasing sequence, we have723

f(xT ) − f(x∗) 61T

T−1∑k=0

(f(xk+1) − f∗) 6L

2T‖x0 − x∗‖2,724

as required. 725

4.4. Strongly Convex Case Recall the definition (3.3.9) of strong convexity, which726

shows that f can be bounded below by a quadratic with Hessian γI. When a727

strongly convex f has L-Lipschitz gradients, it is also bounded above by a simi-728

lar quadratic (see (3.3.7)) differing only in the quadratic term, which becomes LI.729

This “sandwich” effect yields a linear convergence rate for the gradient method.730

We saw in Theorem 3.3.14 that a strongly convex f has a unique minimizer,which we denote by x∗. Minimizing both sides of the inequality (3.3.9) withrespect to y, we find that the minimizer on the left-hand side is attained at y = x∗,


while on the right-hand side it is attained at x −∇f(x)/γ. By plugging theseoptimal values into (3.3.9), we obtain

minyf(y) > min

yf(x) +∇f(x)T (y− x) + γ

2‖y− x‖2

⇒ f(x∗) > f(x) −∇f(x)T(

1γ∇f(x)

)+γ

2

∥∥∥∥ 1γ∇f(x)

∥∥∥∥2

⇒ f(x∗) > f(x) −1

2γ‖∇f(x)‖2.

By rearrangement, we obtain731

(4.4.1) ‖∇f(x)‖2 > 2γ[f(x) − f(x∗)].732

By substituting (4.4.1) into our basic inequality (4.1.3), we obtain733

f(xk+1) = f

(xk −

1L∇f(xk)

)6 f(xk) −

12L‖∇f(xk)‖2 6 f(xk) −

γ

L(f(xk) − f∗).734

Subtracting f∗ from both sides of this inequality gives us the recursion735

(4.4.2) f(xk+1) − f∗ 6(

1 −γ

L

)(f(xk) − f∗) .736

Thus the function values converge linearly to the optimum. After T steps, we have737

738

(4.4.3) f(xT ) − f∗ 6(

1 −γ

L

)T(f(x0) − f∗),739

which is convergence of type (3.6.2) with constant σ = γ/L.740

4.5. General Case: Line-Search Methods For the case of f in (1.0.1) smooth but741

possibly nonconvex, we consider algorithms that take steps of the form (4.0.1),742

where dk is a descent direction, that is, it makes a negative inner product with the743

negative gradient −∇f(xk), so that ∇f(xk)Tdk < 0. This condition ensures that744

f(xk + αdk) < f(xk) for sufficiently small positive values of step length α — we745

can get improvement in f by taking small steps along dk. (This claim follows746

from (3.3.3).) Line-search methods are built around this fundamental observation.747

By introducing additional conditions on dk and αk, that can be verified in practice748

with reasonable effort, we can establish a bound on decrease similar to (4.1.3) on749

each iteration, and thus a conclusion similar to (4.2.2).750

We assume that dk satisfies the following for some η > 0:751

(4.5.1) ∇f(xk)Tdk 6 −η‖∇f(xk)‖‖dk‖.752

For the steplength αk, we assume the following weak Wolfe conditions hold, forsome constants c1 and c2 with 0 < c1 < c2 < 1:

f(xk +αkdk) 6 f(xk) + c1αk∇f(xk)Tdk(4.5.2a)

∇f(xk +αkdk)Tdk > c2∇f(xk)Tdk.(4.5.2b)

Condition (4.5.2a) is called “sufficient decrease;” it ensures descent at each step of753

at least a small fraction c1 of the amount promised by the first-order Taylor-series754


expansion (3.3.3). Condition (4.5.2b) ensures that the directional derivative of f755

along the search direction dk is significantly less negative at the chosen steplength756

αk than at α = 0. This condition ensures that the step is “not too short.” It can757

be shown that it is always possible to find αk that satisfies both conditions (4.5.2)758

simultaneously. Line-search procedures, which are specialized optimization pro-759

cedures for minimizing functions of one variable, have been devised to finding760

such values efficiently; see [34, Chapter 3] for details.761

As in the remainder of this section, we assume that the Lipschitz property762

(3.3.6) holds for ∇f. From this property and (4.5.2b), we have763

−(1 − c2)∇f(xk)Tdk 6 [∇f(xk +αkdk) −∇f(xk)]Tdk 6 Lαk‖dk‖2.764

By comparing the first and last terms in these inequalities, we obtain the following765

lower bound on αk:766

αk > −(1 − c2)

L

∇f(xk)Tdk

‖dk‖2 .767

By substituting this bound into (4.5.2a), and using (4.5.1) and the step definition(4.0.1), we obtain

f(xk+1) = f(xk +αkdk) 6 f(xk) + c1αk∇f(xk)Tdk

6 f(xk) −c1(1 − c2)

L

(∇f(xk)Tdk)2

‖dk‖2

6 f(xk) −c1(1 − c2)

Lη2‖∇f(xk)‖2,(4.5.3)

which by rearrangement yields768

(4.5.4) ‖∇f(xk)‖2 6L

c1(1 − c2)η2

(f(xk) − f(xk+1)

).769

By following the argument of Subsection 4.2, we have that if f is bounded below770

by f, we obtain for any T > 1 that771

(4.5.5) min06k6T−1

‖∇f(xk)‖ 6

√L

η2c1(1 − c2)

√f(x0) − f

T.772

It follows by taking limits on both sides of (4.5.4) that for f bounded below, we773

have774

(4.5.6) limk→∞ ‖∇f(xk)‖ = 0,775

and thus that all accumulation points x of the sequence xk generated by the776

algorithm (4.0.1) have∇f(x) = 0. In the case of f convex, this condition guarantees777

that x is a solution of (1.0.1). When f is nonconvex, x may be a local minimum,778

but it may also be a saddle point or a local maximum.779

The paper [27] uses the stable manifold theorem to show that line-search gra-780

dient methods are highly unlikely to converge to stationary points x at which781

some eigenvalues of the Hessian ∇2f(x) are negative. Although it is easy to con-782

struct examples for which such bad behavior occurs, it requires special choices of783


starting point x0. Possibly the most obvious example is where f(x1, x2) = x21 − x

22784

starting from x0 = (1, 0)T , where dk = −∇f(xk) at each k. For this example, all785

iterates have xk2 = 0 and, under appropriate conditions, converge to the saddle786

point x = 0. Any starting point with x02 6= 0 cannot converge to 0, in fact, it is easy787

to see that xk2 diverges away from 0.788

4.6. Conditional Gradient Method The conditional gradient approach, often789

known as “Frank-Wolfe” after the authors who devised it [22], is a method for790

convex nonlinear optimization over compact convex sets. This is the problem791

(4.6.1) minx∈Ω

f(x),792

(see earlier discussion around (3.2.2)), where Ω is a compact convex set and f793

is a convex function whose gradient is Lipschitz continuously differentiable in a794

neighborhood of Ω, with Lipschitz constant L. We assume that Ω has diameter795

D, that is, ‖x− y‖ 6 D for all x,y ∈ Ω.796

The conditional gradient method replaces the objective in (4.6.1) at each iter-ation by a linear Taylor-series approximation around the current iterate xk, andminimizes this linear objective over the original constraint set Ω. It then takesa step from xk towards the minimizer of this linearized subproblem. The fullmethod is as follows:

vk := arg minv∈Ω

vT∇f(xk);(4.6.2a)

xk+1 := xk +αk(vk − xk), αk :=

2k+ 2

.(4.6.2b)

The method has a sublinear convergence rate, as we show below, and indeed797

requires many iterations in practice to obtain an accurate solution. Despite this798

feature, it makes sense in many interesting applications, because the subproblems799

(4.6.2a) can be solved very cheaply in some settings, and because highly accurate800

solutions are not required in some applications.801

We have the following result for sublinear convergence of the conditional gra-802

dient method.803

Theorem 4.6.3. Under the conditions above, where L is the Lipschitz constant for ∇f on804

an open neighborhood of Ω and D is the diameter of Ω, the conditional gradient method805

(4.6.2) applied to (4.6.1) satisfies806

(4.6.4) f(xk) − f(x∗) 62LD2

k+ 2, k = 1, 2, . . . ,807

where x∗ is any solution of (4.6.1).808

Proof. Setting x = xk and y = xk+1 = xk +αk(vk − xk) in (3.3.7), we have

f(xk+1) 6 f(xk) +αk∇f(xk)T (vk − xk) +12α2kL‖v

k − xk‖2

6 f(xk) +αk∇f(xk)T (vk − xk) +12α2kLD

2,(4.6.5)


where the second inequality comes from the definition of D. For the first-order809

term, we have since vk solves (4.6.2a) and x∗ is feasible for (4.6.2a) that810

∇f(xk)T (vk − xk) 6 ∇f(xk)T (x∗ − xk) 6 f(x∗) − f(xk).811

By substituting in (4.6.5) and subtracting f(x∗) from both sides, we obtain812

(4.6.6) f(xk+1) − f(x∗) 6 (1 −αk)[f(xk) − f(x∗)] +

12α2kLD

2.813

We now apply an inductive argument. For k = 0, we have α0 = 1 and814

f(x1) − f(x∗) 612LD2 <

23LD2,815

so that (4.6.4) holds in this case. Supposing that (4.6.4) holds for some value of k,we aim to show that it holds for k+ 1 too. We have

f(xk+1) − f(x∗) 6

(1 −

2k+ 2

)[f(xk) − f(x∗)] +

12

4(k+ 2)2 LD

2 from (4.6.6), (4.6.2b)

6 LD2[

2k(k+ 2)2 +

2(k+ 2)2

]from (4.6.4)

= 2LD2 (k+ 1)(k+ 2)2

= 2LD2 k+ 1k+ 2

1k+ 2

6 2LD2 k+ 2k+ 3

1k+ 2

=2LD2

k+ 3,

as required. 816

5. Prox-Gradient Methods817

We now describe an elementary but powerful approach for solving the regu-818

larized optimization problem819

(5.0.1) minx∈Rn

φ(x) := f(x) + λψ(x),820

where f is a smooth convex function, ψ is a convex regularization function (known821

simply as the “regularizer”), and λ > 0 is a regularization parameter. The tech-822

nique we describe here is a natural extension of the steepest-descent approach,823

in that it reduces to the steepest-descent method analyzed in Theorem 4.3.1 ap-824

plied to f when the regularization term is not present (λ = 0). It is useful when825

the regularizer ψ has a simple structure that is easy to account for explicitly, as826

is true for many regularizers that arise in data analysis, such as the `1 function827

(ψ(x) = ‖x‖1) and the indicator function for a simple set Ω (ψ(x) = IΩ(x)), such828

as a box Ω = [l1,u1]⊗ [l2,u2]⊗ . . .⊗ [ln,un].2829

Each step of the algorithm is defined as follows:830

(5.0.2) xk+1 := proxαkλψ(xk −αk∇f(xk)),831

2For the analysis of this section I am indebted to class notes of L. Vandenberghe, from 2013-14.


for some steplength αk > 0, and the prox operator defined in (3.5.2). By substitut-832

ing into this definition, we can verify that xk+1 is the solution of an approximation833

to the objective φ of (5.0.1), namely:834

(5.0.3) xk+1 := arg minz∇f(xk)T (z− xk) + 1

2αk‖z− xk‖2 + λψ(z).835

One way to verify this equivalence is to note that the objective in (5.0.3) can be836

written as8371αk

12

∥∥∥z− (xk −αk∇f(xk))∥∥∥2

+αkλψ(x)

,838

(modulo a term αk‖∇f(xk)‖2 that does not involve z). The subproblem objective839

in (5.0.3) consists of a linear term∇f(xk)T (z− xk) (the first-order term in a Taylor-840

series expansion), a proximality term 12αk‖z− xk‖2 that becomes more strict as841

αk ↓ 0, and the regularization term λψ(x) in unaltered form. When λ = 0, we842

have xk+1 = xk − αk∇f(xk), so the iteration (5.0.2) (or (5.0.3)) reduces to the843

usual steepest-descent approach discussed in Section 4 in this case. It is useful844

to continue thinking of αk as playing the role of a line-search parameter, though845

here the line search is expressed implicitly through a proximal term.846

We will demonstrate convergence of the method (5.0.2) at a sublinear rate, for847

functions fwhose gradients satisfy a Lipschitz continuity property with Lipschitz848

constant L (see (3.3.6)), and for the constant steplength choice αk = 1/L. The proof849

makes use of a “gradient map” defined by850

(5.0.4) Gα(x) :=1α

(x− proxαλψ(x−α∇f(x))

).851

By comparing with (5.0.2), we see that this map defines the step taken at iteration852

k:853

(5.0.5) xk+1 = xk −αkGαk(xk) ⇔ Gαk(x

k) =1αk

(xk − xk+1).854

The following technical lemma reveals some useful properties of Gα(x).855

Lemma 5.0.6. Suppose that in problem (5.0.1), ψ is a closed convex function and that f856

is convex with Lipschitz continuous gradient on Rn, with Lipschitz constant L. Then for857

the definition (5.0.4) with α > 0, the following claims are true.858

(a) Gα(x) ∈ ∇f(x) + λ∂ψ(x−αGα(x)).859

(b) For any z, and any α ∈ (0, 1/L], we have that860

φ(x−αGα(x)) 6 φ(z) +Gα(x)T (x− z) −

α

2‖Gα(x)‖2.861

Proof. For part (a), we use the optimality property (3.5.3) of the prox operator,862

and make the following substitutions: x− α∇f(x) for “x”, αλ for “λ”, and ψ for863

“h” to obtain864

0 ∈ αλ∂ψ(proxαλψ(x−α∇f(x))) + (proxαλψ(x−α∇f(x)) − (x−α∇f(x)).865

We use definition (5.0.4) to make the substitution proxαλψ(x − α∇f(x)) = x −866

αGα(x), to obtain867

0 ∈ αλ∂ψ(x−αGα(x)) −α(Gα(x) −∇f(x)),868


and the result follows when we divide by α.869

For (b), we start with the following consequence of Lipschitz continuity of ∇f,870

from Lemma 3.3.10:871

f(y) 6 f(x) +∇f(x)T (y− x) + L

2‖y− x‖2.872

By setting y = x−αGα(x), for any α ∈ (0, 1/L], we have

f(x−αGα(x)) 6 f(x) −αGα(x)T∇f(x) + Lα2

2‖Gα(x)‖2

6 f(x) −αGα(x)T∇f(x) + α

2‖Gα(x)‖2.(5.0.7)

(The second inequality uses α ∈ (0, 1/L].) We also have by convexity of f and ψ873

that for any z and any v ∈ ∂ψ(x−αGα(x) the following are true:874

(5.0.8)f(z) > f(x) +∇f(x)T (z− x), ψ(z) > ψ(x−αGα(x)) + v

T (z− (x−αGα(x))).875

We have from part (a) that v = (Gα(x) −∇f(x))/λ ∈ ∂ψ(x − αGα(x)), so bymaking this choice of v in (5.0.8) and also using (5.0.7) we have for any α ∈ (0, 1/L]that

φ(x−αGα(x))

= f(x−αGα(x)) + λψ(x−αGα(x))

6 f(x) −αGα(x)T∇f(x) + α

2‖Gα(x)‖2 + λψ(x−αGα(x)) (from (5.0.7))

6 f(z) +∇f(x)T (x− z) −αGα(x)T∇f(x) +α

2‖Gα(x)‖2

+ λψ(z) + (Gα(x) −∇f(x))T (x−αGα(x) − z) (from (5.0.8))

= f(z) + λψ(z) +Gα(x)T (x− z) −

α

2‖Gα(x)‖2,

where the last equality follows from cancellation of several terms in the previous876

line. Thus (b) is proved. 877

Theorem 5.0.9. Suppose that in problem (5.0.1), ψ is a closed convex function and878

that f is is convex with Lipschitz continuous gradient on Rn, with Lipschitz constant879

L. Suppose that (5.0.1) attains a minimizer x∗ (not necessarily unique) with optimal880

objective value φ∗. Then if αk = 1/L for all k in (5.0.2), we have881

φ(xk) −φ∗ 6L‖x0 − x∗‖2

2k, k = 1, 2, . . . .882

Proof. Since αk = 1/L satisfies the conditions of Lemma 5.0.6, we can use part (b)883

of this result to show that the sequence φ(xk) is decreasing and that the distance884

to the optimum x∗ also decreases at each iteration. Setting x = z = xk and α = αk885

in Lemma 5.0.6, and recalling (5.0.5), we have886

φ(xk+1) = φ(xk −αkGαk(xk)) 6 φ(xk) −

αk2‖Gαk(x

k)‖2,887


justifying the first claim. For the second claim, we have by setting x = xk, α = αk,and z = x∗ in Lemma 5.0.6 that

0 6 φ(xk+1) −φ∗ = φ(xk −αkGαk(xk)) −φ∗

6 Gαk(xk)T (xk − x∗) −

αk2‖Gαk(x

k)‖2

=1

2αk

(‖xk − x∗‖2 − ‖xk − x∗ −αkGαk(x

k)‖2)

=1

2αk

(‖xk − x∗‖2 − ‖xk+1 − x∗‖2

),(5.0.10)

from which ‖xk+1 − x∗‖ 6 ‖xk − x∗‖ follows.888

By setting αk = 1/L in (5.0.10), and summing over k = 0, 1, 2, . . . ,K − 1, we889

obtain from a telescoping sum on the right-hand side that890

K−1∑k=0

(φ(xk+1) −φ∗) 6L

2

(‖x0 − x∗‖2 − ‖xK − x∗‖2

)6L

2‖x0 − x∗‖2.891

By monotonicity of φ(xk), we have892

K(φ(xK) −φ∗) 6K−1∑k=0

(φ(xk+1) −φ∗).893

The result follows immediately by combining these last two expressions. 894

6. Accelerating Gradient Methods895

We showed in Section 4 that the basic steepest descent method for solving896

(1.0.1) in the case smooth f converges sublinearly at a 1/k rate when f is convex897

and linearly at a rate of (1−γ/L) when f is strongly convex (satisfying (3.3.13) for898

positive γ and L). We show now that quantitatively faster rates can be attained,899

while still requiring just one gradient evaluation per iteration, by using the idea900

of momentum. This means that at each iteration, we tend to continue to move901

along the length and direction of the previous step, making a small adjustment902

to take account of the latest gradient ∇f. Although it may not be obvious at903

first, there is some intuition behind this approach. The step that we took at904

the previous iteration was itself mostly based on the step prior to that, together905

with gradient information at the previous point. By thinking recursively, we906

can see that the last step is a compact encoding of all the gradient information907

that we have encountered at all iterates so far. If this information is aggregated908

properly, it can produce a richer overall picture of the behavior of the function909

than does the latest gradient alone, and thus has the potential to yield faster910

convergence. Sure enough, several intricate methods that use the momentum911

idea have been proposed, and have been widely successful. These methods are912

often called accelerated gradient methods. A major contributor in this area is Yuri913

Nesterov, dating to his seminal contribution in 1983 [31] and explicated further914


in his book [32] and many other publications. Another key contribution is [3],915

which derived an accelerated method for the regularized case (1.0.2).916

6.1. Heavy-Ball Method Possibly the most elementary method of momentum917

type is the heavy-ball method of Polyak [35]; see also [36]. Each iteration of this918

method has the form919

(6.1.1) xk+1 = xk −αk∇f(xk) +βk(xk − xk−1).920

That is, a momentum term βk(xk − xk−1) is added to the usual steepest descent921

update. An intriguing analysis of this approach is possible for strongly convex922

quadratic functions:923

(6.1.2) minx∈Rn

f(x) :=12xTAx− bTx,924

where the (constant) Hessian A has eigenvalues in the range [γ,L], with 0 < γ 6 L925

(see [36]). For the following constant choices of steplength parameters:926

αk = α :=4

(√L+√γ)2

, βk = β :=

√L−√γ√

L+√γ

,927

it can be shown that ‖xk − x∗‖ 6 Cβk, for some (possibly large) constant C. We928

can use (3.3.7) to translate this into a bound on the function error, as follows:929

f(xk) − f(x∗) 6L

2‖xk − x∗‖2 6

LC2

2β2k,930

allowing a direct comparison with the rate (4.4.3) for the steepest descent method.931

If we suppose that L γ, we have932

β ≈ 1 − 2√γ

L,933

so that we achieve approximate convergence f(xk) − f(x∗) 6 ε (for small positive934

ε) in O(√L/γ log(1/ε)) iterations, compared with O((L/γ) log(1/ε)) for steepest935

descent — a significant improvement.936

The heavy-ball method is fundamental, but several points should be noted.937

First, the analysis for convex quadratic f does not generalize to general strongly938

convex nonlinear functions. Second, it requires knowledge of γ and L. Third, it939

is not a descent method; we usually have f(xk+1) > f(xk) for many k. These940

properties are not specific to heavy-ball — at least one of them is shared by other941

methods that use momentum.942

6.2. Conjugate Gradient The conjugate gradient method for solving linear sys-943

tems Ax = b (or, equivalently, minimizing the convex quadratic (6.1.2)) where A944

is symmetric positive definite, is one of the most important algorithms in compu-945

tational science. Though invented earlier than the other algorithms discussed in946

this section (see [25]) and motivated in a different way, conjugate gradient clearly947

makes use of momentum. Its steps have the form948

(6.2.1) xk+1 = xk +αkpk, where pk = −∇f(xk) + ξkpk−1,949


for some choices of αk and ξk, which is identical to (6.1.1) when we define βk ap-950

propriately. For convex, strongly quadratic problems (6.1.2), conjugate gradient951

has excellent properties. It does not require prior knowledge of the range [γ,L] of952

the eigenvalue spectrum of A, choosing the steplengths αk and ξk in an adaptive953

fashion. (In fact, αk is chosen to be the exact minimizer along the search direction954

pk.) The main arithmetic operation per iteration is one matrix-vector multiplica-955

tion involving A, the same cost as a gradient evaluation for f in (6.1.2). Most956

importantly, there is a rich convergence theory, that characterizes convergence in957

terms of the properties of the full spectrum of A (not just its extreme elements),958

showing in particular that good approximate solutions can be obtained quickly959

if the eigenvalues are clustered. Convergence to an exact solution of (6.1.2) in at960

most n iterations is guaranteed (provided, naturally, that the arithmetic is carried961

out exactly).962

There has been much work over the years on extending the conjugate gradi-963

ent method to general smooth functions f. Few of the theoretical properties for964

the quadratic case carry over to the nonlinear setting, though several results are965

known; see [34, Chapter 5], for example. Such “nonlinear” conjugate gradient966

methods vary in the accuracy with which they perform the line search for αk in967

(6.2.1) and — more fundamentally — in the choice of ξk. The latter is done in968

a way that ensures that each search direction pk is a descent direction. In some969

methods, ξk is set to zero on some iterations, which causes the method to take970

a steepest descent step, effectively “restarting” the conjugate gradient method971

at the latest iterate. Despite these qualifications, nonlinear conjugate gradient is972

quite commonly used in practice, because of its minimal storage requirements973

and the fact that it requires only one gradient evaluation per iteration. Its pop-974

ularity has been eclipsed in recent years by the limited-memory quasi-Newton975

method L-BFGS [28], [34, Section 7.2], which requires more storage (though still976

O(n)) and is similarly economical and easy to implement.977

6.3. Nesterov’s Accelerated Gradient: Weakly Convex Case We now describe978

Nesterov’s method for (1.0.1) and prove its convergence — sublinear at a 1/k2979

rate — for the case of f convex with Lipschitz continuous gradients satisfying980

(3.3.6). Each iteration of this method has the form981

(6.3.1) xk+1 = xk −αk∇f(xk +βk(x

k − xk−1))+βk(x

k − xk−1),982

for choices of the parameters αk and βk to be defined. Note immediately thesimilarity to the heavy-ball formula (6.1.1). The only difference is that the extrap-olation step xk → xk + βk(x

k − xk−1) is taken before evaluation of the gradient∇f in (6.3.1), whereas in (6.1.1) the gradient is simply evaluated at xk. It is con-venient for purposes of analysis (and implementation) to introduce an auxiliarysequence yk, fix αk ≡ 1/L, and rewrite the update (6.3.1) as follows:

xk+1 = yk −1L∇f(yk),(6.3.2a)


yk+1 = xk+1 +βk+1(xk+1 − xk), k = 0, 1, 2, . . . ,(6.3.2b)

where we initialize at an arbitrary y0 and set x0 = y0. We define βk with reference983

to another scalar sequence λk in the following manner:984

(6.3.3) λ0 = 0, λk+1 =12

(1 +

√1 + 4λ2

k

), βk =

λk − 1λk+1

.985

Since λk > 1 for k = 1, 2, . . . , we have βk+1 > 0 for k = 0, 1, 2, . . . . It also follows986

from the definition of λk+1 that987

(6.3.4) λ2k+1 − λk+1 = λ2

k.988

Suppose that the solution of (1.0.1) is attained at a point x∗. Nesterov’s scheme989

achieves the following bound:990

(6.3.5) f(xT ) − f∗ 62L‖x0 − x∗‖2

(T + 1)2 , T = 1, 2, . . . .991

We prove this result by an argument from [3], as reformulated in [6, Section 3.7].992

This analysis is famously technical, and intuition is hard to come by. Some re-993

cent progress has been made in deriving algorithms similar to (6.3.2) that have a994

plausible geometric or algebraic motivation; see [7, 19].995

From convexity of f and (3.3.7), we have for any x and y that

f(y−∇f(y)/L) − f(x)

6 f(y−∇f(y)/L) − f(y) +∇f(y)T (y− x)

6 ∇f(y)T (y−∇f(y)/L− y) + L

2‖y−∇f(y)/L− y‖2 +∇f(y)T (y− x)

= −1

2L‖∇f(y)‖2 +∇f(y)T (y− x).(6.3.6)

Setting y = yk and x = xk in this bound, we obtain

f(xk+1) − f(xk) = f(yk −∇f(yk)/L) − f(xk)

6 −1

2L‖∇f(yk)‖2 +∇f(yk)T (yk − xk)

= −L

2‖xk+1 − yk‖2 − L(xk+1 − yk)T (yk − xk).(6.3.7)

We now set y = yk and x = x∗ in (6.3.6), and use (6.3.2a) to obtain996

(6.3.8) f(xk+1) − f(x∗) 6 −L

2‖xk+1 − yk‖2 − L(xk+1 − yk)T (yk − x∗).997

Introducing notation δk := f(xk) − f(x∗), we multiply (6.3.7) by λk+1 − 1 and addit to (6.3.8) to obtain

(λk+1 − 1)(δk+1 − δk) + δk+1

6 −L

2λk+1‖xk+1 − yk‖2 − L(xk+1 − yk)T (λk+1y

k − (λk+1 − 1)xk − x∗)

We multiply this bound by λk+1, and use (6.3.4) to obtain

λ2k+1δk+1 − λ

2kδk


6 −L

2

[‖λk+1(x

k+1 − yk)‖2 + 2λk+1(xk+1 − yk)T (λk+1y

k − (λk+1 − 1)xk − x∗)]

= −L

2

[‖λk+1x

k+1 − (λk+1 − 1)xk − x∗‖2 − ‖λk+1yk − (λk+1 − 1)xk − x∗‖2

],

(6.3.9)

where in the final equality we used the identity ‖a‖2 + 2aTb = ‖a+ b‖2 − ‖b‖2.998

By multiplying (6.3.2b) by λk+2, and using λk+2βk+1 = λk+1 − 1 from (6.3.3), we999

have1000

λk+2yk+1 = λk+2x

k+1 +λk+2βk+1(xk+1 −xk) = λk+2x

k+1 +(λk+1 −1)(xk+1 −xk).1001

By rearranging this equality, we have1002

λk+1xk+1 − (λk+1 − 1)xk = λk+2y

k+1 − (λk+2 − 1)xk+1.1003

By substituting into the first term on the right-hand side of (6.3.9), and using the1004

definition1005

(6.3.10) uk := λk+1yk − (λk+1 − 1)xk − x∗,1006

we obtain1007

λ2k+1δk+1 − λ

2kδk 6 −

L

2(‖uk+1‖2 − ‖uk‖2).1008

By summing both sides of this inequality over k = 0, 1, . . . , T − 1, and using λ0 = 0,1009

we obtain1010

λ2TδT 6

L

2(‖u0‖2 − ‖uT‖2) 6

L

2‖x0 − x∗‖2,1011

so that1012

(6.3.11) δT = f(xT ) − f(x∗) 6L‖x0 − x∗‖2

2λ2T

.1013

A simple induction confirms that λk > (k+ 1)/2 for k = 1, 2, . . . , and the result1014

(6.3.5) follows by substituting this bound into (6.3.11).1015

6.4. Nesterov’s Accelerated Gradient: Strongly Convex Case We turn now to1016

Nesterov’s approach for smooth strongly convex functions, which satisfy (3.2.5)1017

with γ > 0. Again, we follow the proof in [6, Section 3.7], which is based on1018

the analysis in [32]. The method uses the same update formula (6.3.2) as in the1019

weakly convex case, and the same initialization, but with a different choice of1020

βk+1, namely:1021

(6.4.1) βk+1 ≡√L−√γ√

L+√γ

=

√κ− 1√κ+ 1

,1022

and initial values y0 = y0. The condition measure κ is defined in (3.3.12). We1023

prove the following convergence result.1024

Theorem 6.4.2. Suppose that f is such that ∇f is Lipschitz continuously differentiable1025

with constant L, and that it is strongly convex with modulus of convexity γ and unique1026


minimizer x∗. Then the method (6.3.2), (6.4.1) with starting point x0 = y0 satisfies1027

(6.4.3) f(xT ) − f(x∗) 6L+ γ

2‖x0 − x∗‖2

(1 −

1√κ

)T.1028

Proof. The proof makes use of a family of strongly convex functions Φk(z) de-fined inductively as follows:

Φ0(z) = f(y0) +

γ

2‖x− y0‖2,(6.4.4a)

Φk+1(z) = (1 − 1/√κ)Φk(z)(6.4.4b)

+1√κ

(f(yk) +∇f(yk)T (z− yk) + γ

2‖z− yk‖2

).

Each Φk(·) is a quadratic, and an inductive argument shows that ∇2Φk(z) = γI1029

for all k and all z. Thus, each Φk has the form1030

(6.4.5) Φk(z) = Φ∗k +

γ

2‖z− vk‖2, k = 0, 1, 2, . . . ,1031

where vk is the minimizer of Φk(·) and Φ∗k is its optimal value. (From (6.4.4a),1032

we have v0 = y0.) We note too that Φk becomes a tighter overapproximation to f1033

as k→∞. To show this, we use (3.3.9) to replace the final term in parentheses in1034

(6.4.4b) by f(z), then subtract f(z) from both sides of (6.4.4b) to obtain1035

(6.4.6) Φk+1(z) − f(z) 6 (1 − 1/√κ)(Φk(z) − f(z)).1036

In the remainder of the proof, we show that the following bound holds:1037

(6.4.7) f(xk) 6 minzΦk(z) = Φ

∗k, k = 0, 1, 2, . . . .1038

From the upper bound in Lemma 3.3.10, with x = x∗, we have that f(z) − f(x∗) 6(L/2)‖z− x∗‖2. By combining this bound with (6.4.6) and (6.4.7), we have

f(xk) − f(x∗) 6 Φ∗k − f(x∗)

6 Φk(x∗) − f(x∗)

6 (1 − 1/√κ)k(Φ0(x

∗) − f(x∗))

6 (1 − 1/√κ)k[(Φ0(x

∗) − f(x0)) + (f(x0) − f(x∗))]

6 (1 − 1/√κ)k

γ+ L

2‖x0 − x∗‖2.(6.4.8)

The proof is completed by establishing (6.4.7), by induction on k. Since x0 = y0,it holds by definition at k = 0. By using step formula (6.3.2a), the convexityproperty (3.3.8) (with x = yk), and the inductive hypothesis, we have

f(xk+1)

6 f(yk) −1

2L‖∇f(yk)‖2

= (1 − 1/√κ)f(xk) + (1 − 1/

√κ)(f(yk) − f(xk)) + f(yk)/

√κ−

12L‖∇f(yk)‖2


6 (1 − 1/√κ)Φ∗k + (1 − 1/

√κ)∇f(yk)T (yk − xk) + f(yk)/

√κ−

12L‖∇f(yk)‖2.

(6.4.9)

Thus the claim is established (and the theorem is proved) if we can show that the1039

right-hand side in (6.4.9) is bounded above by Φ∗k+1.1040

Recalling the observation (6.4.5), we have by taking derivatives of both sides of1041

(6.4.4b) with respect to z that1042

(6.4.10) ∇Φk+1(z) = γ(1 − 1/√κ)(z− vk) +∇f(yk)/

√κ+ γ(z− yk)/

√κ.1043

Since vk+1 is the minimizer of Φk+1 we can set ∇Φk+1(vk+1) = 0 in (6.4.10) to1044

obtain1045

(6.4.11) vk+1 = (1 − 1/√κ)vk + yk/

√κ−∇f(yk)/(γ

√κ).1046

Subtracting yk from both sides of this expression, and taking ‖ · ‖2 of both sides,we obtain

‖vk+1 − yk‖2 = (1 − 1/√κ)2‖yk − vk‖2 + ‖∇f(yk)‖2/(γ2κ)

− 2(1 − 1/√κ)/(γ

√κ)∇f(yk)T (vk − yk).(6.4.12)

Meanwhile, by evaluating Φk+1 at z = yk, using both (6.4.5) and (6.4.4b), weobtain

Φ∗k+1 +γ

2‖yk − vk+1‖2 = (1 − 1/

√κ)Φk(y

k) + f(yk)/√κ

= (1 − 1/√κ)Φ∗k +

γ

2(1 − 1/

√κ)‖yk − vk‖2 + f(yk)/

√κ.(6.4.13)

By substituting (6.4.12) into (6.4.13), we obtain

Φ∗k+1 = (1 − 1/√κ)Φ∗k + f(y

k)/√κ+ γ(1 − 1/

√κ)/(2

√κ)‖yk − vk‖2

−1

2L‖∇f(yk)‖2 + (1 − 1/

√κ)∇f(yk)T (vk − yk)/

√κ

> (1 − 1/√κ)Φ∗k + f(y

k)/√κ

−1

2L‖∇f(yk)‖2 + (1 − 1/

√κ)∇f(yk)T (vk − yk)/

√κ,(6.4.14)

where we simply dropped a nonnegative term from the right-hand side to obtain1047

the inequality. The final step is to show that1048

(6.4.15) vk − yk =√κ(yk − xk),1049

which we do by induction. Note that v0 = x0 = y0, so the claim holds for k = 0.We have

vk+1 − yk+1 = (1 − 1/√κ)vk + yk/

√κ−∇f(yk)/(γ

√κ) − yk+1

=√κyk − (

√κ− 1)xk −

√κ∇f(yk)/L− yk+1

=√κxk+1 − (

√κ− 1)xk − yk+1

=√κ(yk+1 − xk+1),(6.4.16)


where the first equality is from (6.4.11), the second equality is from the inductive1050

hypothesis, the third equality is from the iteration formula (6.3.2a), and the final1051

equality is from the iteration formula (6.3.2b) with the definition of βk+1 from1052

(6.4.1). We have thus proved (6.4.15), and by substituting this equality into (6.4.14),1053

we obtain that Φ∗k+1 is an upper bound on the right-hand side of (6.4.9). This1054

establishes (6.4.7) and thus completes the proof of the theorem. 1055

6.5. Lower Bounds on Rates The term “optimal” in Nesterov’s optimal method1056

is used because the convergence rate achieved by the method is the best possible1057

(possibly up to a constant), among algorithms that make use of gradient informa-1058

tion at the iterates xk. This claim can be proved by means of a carefully designed1059

function, for which no method that makes use of all gradients observed up to and1060

including iteration k (namely, ∇f(xi), i = 0, 1, 2, . . . ,k) can produce a sequence1061

xk that achieves a rate better than (6.3.5). The function proposed in [30] is a1062

convex quadratic f(x) = (1/2)xTAx− eT1 x, where1063

A =

2 −1 0 0 . . . . . . 0

−1 2 −1 0 . . . . . . 0

0 −1 2 −1 0 . . . 0. . . . . . . . .

0 . . . 0 −1 2 −1

0 . . . 0 −1 2

, e1 =

1

0

0...

0

.1064

The solution x∗ satisfies Ax∗ = e1; its components are x∗i = 1 − i/(n + 1), for1065

i = 1, 2, . . . ,n. If we use x0 = 0 as the starting point, and construct the iterate1066

xk+1 as1067

xk+1 = xk +

k∑j=0

ξj∇f(xj),1068

for some coefficients ξj, j = 0, 1, . . . , k, an elementary inductive argument shows1069

that each iterate xk can have nonzero entries only in its first k components. It1070

follows that for any such algorithm, we have1071

(6.5.1) ‖xk − x∗‖2 >n∑

j=k+1

(x∗j )2 =

n∑j=k+1

(1 −

j

n+ 1

)2.1072

A little arithmetic shows that1073

(6.5.2) ‖xk − x∗‖2 >18‖x0 − x∗‖2, k = 1, 2, . . . ,

n

2− 1,1074

It can be shown further that1075

(6.5.3) f(xk) − f∗ >3L

32(k+ 1)2 ‖x0 − x∗‖2, k = 1, 2, . . . ,

n

2− 1,1076

where L = ‖A‖2. This lower bound on f(xk) − x∗ is within a constant factor of1077

the upper bound (6.3.5).1078


The restriction k 6 n/2 in the argument above is not fully satisfying. A more1079

compelling example would show that the lower bound (6.5.3) holds for all k.1080

7. Newton Methods1081

So far, we have dealt with methods that use first-order (gradient or subgra-1082

dient) information about the objective function. We have shown that such algo-1083

rithms can yield sequences of iterates that converge at linear or sublinear rates.1084

We turn our attention in this chapter to methods that exploit second-derivative1085

(Hessian) information. The canonical method here is Newton’s method, named1086

after Isaac Newton, who proposed a version of the method for polynomial equa-1087

tions in around 1670.1088

For many functions, including many that arise in data analysis, second-order1089

information is not difficult to compute, in the sense that the functions that we1090

deal with are simple (usually compositions of elementary functions). In compar-1091

ing with first-order methods, there is a tradeoff. Second-order methods typically1092

have locally superlinear convergence rates: Once the iterates reach a neighbor-1093

hood of a solution at which second-order sufficient conditions are satisfied, con-1094

vergence is rapid. Moreover, their global convergence properties are attractive.1095

With appropriate enhancements, they can provably avoid convergence to saddle1096

points. But the costs of calculating and handling the second-order information1097

and of computing the step is higher. Whether this tradeoff makes them appealing1098

depends on the specifics of the application and on whether the second-derivative1099

computations are able to take advantage of structure in the objective function.1100

We start by sketching the basic Newton’s method for the unconstrained smooth1101

optimization problem min f(x), and prove local convergence to a minimizer x∗1102

that satisfies second-order sufficient conditions. Subsection 7.2 discusses perfor-1103

mance of Newton’s method on convex functions, where the use of Newton search1104

directions in the line search framework (4.0.1) can yield global convergence. Mod-1105

ifications of Newton’s method for nonconvex functions are discussed in Subsec-1106

tion 7.3. Subsection 7.4 discusses algorithms for smooth nonconvex functions1107

that use gradient and Hessian information but guarantee convergence to points1108

that approximately satisfy second-order necessary conditions. Some variants of1109

these methods are related closely to the trust-region methods discussed in Sub-1110

section 7.3, but the motivation and mechanics are somewhat different.1111

7.1. Basic Newton’s Method Consider the problem1112

(7.1.1) min f(x),1113

where f : Rn → R is a Lipschitz twice continuously differentiable function, where1114

the Hessian has Lipschitz constant M, that is,1115

(7.1.2) ‖∇2f(x ′) −∇2f(x ′′)‖ 6M‖x ′ − x ′′‖.1116

Newton’s method generates a sequence of iterates xkk=0,1,2,....1117


A second-order Taylor series approximation to f around the current iterate xk1118

is1119

(7.1.3) f(xk + p) ≈ f(xk) +∇f(xk)Tp+ 12pT∇2f(xk)p.1120

When ∇2f(xk) is positive definite, the minimizer pk of the right-hand side is1121

unique; it is1122

(7.1.4) pk = −∇2f(xk)−1∇f(xk).1123

This is the Newton step. In its most basic form, then, Newton’s method is defined1124

by the following iteration:1125

(7.1.5) xk+1 = xk −∇2f(xk)−1∇f(xk).1126

We have the following local convergence result in the neighborhood of a point x∗1127

satisfying second-order sufficient conditions.1128

Theorem 7.1.6. Consider the problem (7.1.1) with f twice Lipschitz continuously differ-1129

entiable with Lipschitz constant M defined in (7.1.2). Suppose that the second-order suf-1130

ficient conditions are satisfied for the problem (7.1.1) at the point x∗, that is, ∇f(x∗) = 01131

and ∇2f(x∗) γI for some γ > 0. Then if ‖x0 − x∗‖ 6 γ2M , the sequence defined by1132

(7.1.5) converges to x∗ at a quadratic rate, with1133

(7.1.7) ‖xk+1 − x∗‖ 6 Mγ‖xk − x∗‖2, k = 0, 1, 2, . . . .1134

Proof. From (7.1.4) and (7.1.5), and using ∇f(x∗) = 0, we have

xk+1 − x∗ = xk − x∗ −∇2f(xk)−1∇f(xk)

= ∇2f(xk)−1[∇2f(xk)(xk − x∗) − (∇f(xk) −∇f(x∗))].

so that1135

(7.1.8) ‖xk+1 − x∗‖ 6 ‖∇f(xk)−1‖‖∇2f(xk)(xk − x∗) − (∇f(xk) −∇f(x∗))‖.1136

By using Taylor’s theorem (see (3.3.4) with x = xk and p = x∗ − xk), we have1137

∇f(xk) −∇f(x∗) =∫1

0∇2f(xk + t(x∗ − xk))(xk − x∗)dt.1138

By using this result along with the Lipschitz condition (7.1.2), we have

‖∇2f(xk)(xk − x∗) − (∇f(xk) −∇f(x∗))‖

=

∥∥∥∥∥∫1

0[∇2f(xk) −∇2f(xk + t(x∗ − xk)](xk − x∗)dt

∥∥∥∥∥6∫1

0‖∇2f(xk) −∇2f(xk + t(x∗ − xk))‖‖xk − x∗‖dt

6

(∫1

0Mtdt

)‖xk − x∗‖2 = 1

2M‖xk − x∗‖2.(7.1.9)

From the Weilandt-Hoffman inequality[26] and (7.1.2), we have that1139

|λmin(∇2f(xk)) − λmin(∇2f(x∗))| 6 ‖∇2f(xk) −∇2f(x∗)‖ 6M‖xk − x∗‖,1140


where λmin(·) denotes the smallest eigenvalue of a symmetric matrix. Thus for1141

(7.1.10) ‖xk − x∗‖ 6 γ

2M,1142

we have1143

λmin(∇2f(xk)) > λmin(∇2f(x∗)) −M‖xk − x∗‖ > γ−M γ

2M>γ

2,1144

so that ‖∇2f(xk)−1‖ 6 2/γ. By substituting this result together with (7.1.9) into1145

(7.1.8), we obtain1146

‖xk+1 − x∗‖ 6 2γ

M

2‖xk − x∗‖2 =

M

γ‖xk − x∗‖2,1147

verifying the locally quadratic convergence rate. By applying (7.1.10) again, we1148

have1149

‖xk+1 − x∗‖ 6(M

γ‖xk − x∗‖

)‖xk − x∗‖ 6 1

2‖xk − x∗‖,1150

so, by arguing inductively, we see that the sequence converges to x∗ provided1151

that x0 satisfies (7.1.10), as claimed. 1152

Of course, we do not need to explicitly identify a starting point x0 in the stated1153

region of convergence. Any sequence that approaches to x∗ will eventually enter1154

this region, and thereafter the quadratic convergence guarantees apply.1155

We have established that Newton’s method converges rapidly once the iterates1156

enter the neighborhood of a point x∗ satisfying second-order sufficient optimality1157

conditions. But what happens when we start far from such a point?1158

7.2. Newton’s Method for Convex Functions When f is strongly convex with1159

modulus γ and satisfies Lipschitz continuity of the gradient (3.3.6), the Hessian1160

∇2f(xk) is positive definite for all k, with all eigenvalues in the interval [γ,L].1161

Thus, the Newton direction (7.1.4) is well defined, and is a descent direction1162

satisfying the condition (4.5.1) with η = γ/L. To verify this claim, note first1163

‖pk‖ 6 ‖∇2f(xk)−1‖‖∇f(xk)‖ 6 1γ‖∇f(xk)‖.1164

Then

(pk)T∇f(xk) = −∇f(xk)T∇2f(xk)−1∇f(xk)

6 −1L‖∇f(xk)‖2

6 −γ

L‖∇f(xk)‖‖pk‖.

We can use the Newton direction in the line-search framework of Subsection 4.51165

to obtain a method for which xk → x∗, where x∗ is the (unique) global minimizer1166

of f. (This claim follows from the property (4.5.6) together with the fact that x∗ is1167

the only point for which ∇f(x∗) = 0.)1168

These global convergence properties are enhanced by the quadratic local con-1169

vergence property of Theorem 7.1.6 if we modify the line-search framework by1170

accepting the step length αk = 1 in (4.0.1) whenever it satisfies the conditions1171

(4.5.2). (It can be shown, by again using arguments based on Taylor’s theorem1172


(Theorem 3.3.1), that these conditions will be satisfied by αk = 1 for all xk suffi-1173

ciently close to the minimizer x∗.)1174

For f convex but not strongly convex, we are not guaranteed that the Hessian1175

∇2f(xk) is nonsingular, so the direction (7.1.4) may not be well defined. By adding1176

any positive number λk > 0 to the diagonal, however, we can ensure that the1177

modified Newton direction defined by1178

(7.2.1) pk = −[∇2f(xk) + λkI]−1∇f(xk),1179

is well defined and is a descent direction for f. For any η ∈ (0, 1) in (4.5.1),1180

we have by choosing λk large enough that λk/(L + λk) > η that the condition1181

(4.5.1) is satisfied too, so again we can use the resulting direction pk in the line-1182

search framework of Subsection 4.5. If the minimizer x∗ is unique and satisfies1183

a second-order sufficient condition, in particular ∇2f(x∗) positive definite, then1184

for k sufficiently large we have ∇2f(xk) is positive definite too, so for sufficiently1185

small η the unmodified Newton direction (with λk = 0 in (7.2.1)) will satisfy (4.5.1).1186

By using λk = 0 where possible, and accepting αk = 1 as the step length when-1187

ever it satisfies (4.5.2), we can obtain local quadratic convergence to x∗ under the1188

second-order sufficient condition.1189

7.3. Newton Methods for Nonconvex Functions For smooth nonconvex f,∇2f(xk)1190

may be indefinite. The Newton direction (7.1.4) may not exist (when ∇2f(xk) is1191

singular) or may not be a descent direction (when ∇2f(xk) has negative eigen-1192

values). However, we can still define a modified Newton direction as in (7.2.1),1193

which will be a descent direction for λk sufficiently large. For a given η in (4.5.1),1194

a sufficient condition for pk from (7.2.1) to satisfy (4.5.1) is that1195

λk + λmin(∇2f(xk))

λk + L> η,1196

where λmin(∇2f(xk)) is the minimum eigenvalue of the Hessian, which may be1197

negative. Once again, if the iterates xk enter the neighborhood of a local solution1198

x∗ for which ∇2f(x∗) positive definite, some enhancements of the strategy for1199

choosing λk and αk can recover the local quadratic convergence of Theorem 7.1.6.1200

Formula (7.2.1) is not the only way to modify the Newton direction to ensure1201

descent in a line-search framework. Other approaches are outlined in [34, Chap-1202

ter 3]. One such technique is to modify the Cholesky factorization of ∇2(fk)1203

(which is a prelude to solving the system (7.1.4) or (7.2.1) for pk) by adding posi-1204

tive elements to the diagonal only as needed to allow the factorization to proceed1205

(that is, to avoid taking the square root of a negative number). Another technique1206

is to compute an eigenvalue decomposition ∇2f(xk) = QkΛkQTk (where Qk is1207

orthogonal and Λk is diagonal), then define Λk to be a modified version of Λk in1208

which all the diagonals are positive. Then, following (7.1.4), pk can be defined as1209

pk := −QkΛ−1k Q

Tk∇f(x

k).1210


A trust-region version of Newton’s method ensures convergence to a point sat-1211

isfying second-order necessary conditions. (This is in contrast to the line-search1212

framework discussed above, which can generally guaranteed only that accumu-1213

lation points are stationary, when applied to nonconvex f.) The trust-region ap-1214

proach was developed in the late 1970s and early 1980s, and has become pop-1215

ular again recently because of this appealing global convergence behavior. A1216

trust-region Newton method also recovers quadratic convergence to solutions x∗1217

satisfying second-order-sufficient conditions, without any special modifications.1218

We outline the trust-region approach below. Further details can be found in1219

[34, Chapter 4].1220

The subproblem to be solved at each iteration of a trust-region Newton algo-1221

rithm is1222

(7.3.1) mindf(xk) +∇f(xk)Td+ 1

2dT∇2f(xk)d subject to ‖d‖2 6 ∆k.1223

The objective is a second-order Taylor-series approximation while ∆k is the radius1224

of the trust region — the region within which we trust the second-order model1225

to capture the true behavior of f. Somewhat surprisingly, the problem (7.3.1) is1226

not too difficult to solve, even when the Hessian ∇2f(xk) is indefinite. In fact, the1227

solution dk of (7.3.1) satisfies the linear system1228

(7.3.2) [∇2f(xk) + λI]dk = −∇f(xk), for some λ > 0,1229

where λ is chosen such that ∇2f(xk) + λI is positive semidefinite and λ > 0 only1230

if ‖dk‖ = ∆k (see [29]). Thus the process of solving (7.3.1) reduces to a search for1231

the appropriate value of the scalar λk, for which specialized methods have been1232

devised.1233

For large-scale problems, it may be too expensive to solve (7.3.1) near-exactly,1234

since the process may require several factorizations of an n× n matrix (namely,1235

the coefficient matrix in (7.3.2), for different values of λ). A popular approach1236

for finding approximate solutions of (7.3.1), which can be used when ∇2f(xk)1237

is positive definite, is the dogleg method. In this method the curved path traced1238

out by solutions of (7.3.2) for values of λ in the interval [0,∞) is approximated1239

by simpler path consisting of two line segments. The first segment joins 0 to1240

the point dkC that minimizes the objective in (7.3.1) along the direction −∇f(xk),1241

while the second segment joins dkC to the pure Newton step defined in (7.1.4).1242

The approximate solution is taken to be the point at which this “dogleg” path1243

crosses the boundary of the trust region ‖d‖ 6 ∆k. If the dogleg path lies entirely1244

inside the trust region, we take dk to be the pure Newton step.1245

Having discussed the trust-region subproblem (7.3.1), let us outline how it can1246

be used as the basis for a complete algorithm. A crucial role is played by the ratio1247

between the amount of decrease in f predicted by the quadratic objective in (7.3.1) and1248

the actual decrease in f, namely, f(xk)− f(xk+dk). Ideally, this ratio would be close1249

to 1. If it is at least greater than a small tolerance (say, 10−4) we accept the step1250

and proceed to the next iteration. Otherwise, we conclude that the trust-region1251


radius ∆k is too large, so we do not take the step, shrink the trust region, and1252

re-solve (7.3.1) to obtain a new step. Additionally, when the actual-to-predicted1253

ratio is close to 1, we conclude that a larger trust region may hasten progress, so1254

we increase ∆ for the next iteration, provided that the bound ‖dk‖ 6 ∆k really is1255

active at the solution of (7.3.1).1256

Unlike a basic line-search method, the trust-region Newton method can “es-1257

cape” from a saddle point. Suppose we have ∇f(xk) = 0 and ∇2f(xk) indefinite1258

with some strictly negative eigenvalues. Then, the solution dk to (7.3.1) will be1259

nonzero, and the algorithm will step away from the saddle point, in the direction1260

of most negative curvature for ∇2f(xk).1261

Another appealing feature of the trust-region Newton approach is that when1262

the sequence it generates approaches a point x∗ satisfying second-order sufficient1263

conditions, the trust region bound becomes inactive, and the method takes pure1264

Newton steps (7.1.4) for all sufficiently large k. Thus, the local quadratic conver-1265

gence that characterizes Newton’s method is observed.1266

The basic difference between line-search and trust-region methods can be sum-1267

marized as follows. Line-search methods first choose a direction pk, then decide1268

how far to move along that direction. Trust-region methods do the opposite: They1269

choose the distance ∆k first, then find the direction that makes the best progress1270

for this step length.1271

7.4. Convergence to Second-Order Necessary Points Trust-region Newton meth-1272

ods have the significant advantage of guaranteeing that any accumulation points1273

will satisfy second-order necessary conditions. A related approach based on cubic1274

regularization has similar properties, and comes with some additional complexity1275

guarantees. Cubic regularization requires the Hessian to be Lipschitz continu-1276

ous, as in (7.1.2). It follows that the following cubic function yields a global upper1277

bound for f:1278

(7.4.1) TM(z; x) := f(x) +∇f(x)T (z− x) + 12(z− x)T∇2f(x)(z− x) +

M

6‖z− x‖3.1279

Specifically, we have for any x that1280

f(z) 6 TM(z; x), for all z.1281

The basic cubic regularization algorithm proceeds as follows, from a starting1282

point x0:1283

(7.4.2) xk+1 = arg minzTM(z; xk), k = 0, 1, 2, . . . .1284

The complexity properties of this approach were analyzed in [33]. Variants on the1285

scheme were studied in [24] and [11,12]. Implementation of this scheme involves1286

solution of the subproblem (7.4.2), which is not straightforward.1287

Rather than present the theory for (7.4.2), we describe an elementary algorithm1288

that makes use of the expansion (7.4.1) as well as the steepest-descent theory of1289

Subsection 4.1. Our algorithm aims to identify a point that approximately satisfies1290


second-order necessary conditions, that is,1291

(7.4.3) ‖∇f(x)‖ 6 εg, λmin(∇2f(x)) > −εH,1292

where εg and εH are two small constants. In addition to Lipschitz continuity of1293

the Hessian (7.1.2), we assume Lipschitz continuity of the gradient with constant1294

L (see (3.3.6)), and also that the objective f is lower-bounded by some number f.1295

Our algorithm takes steps of two types — a steepest-descent step, as in Subsec-1296

tion 4.1, or a step in a negative curvature direction for ∇2f. Iteration l proceeds1297

as follows:1298

(i) If ‖∇f(xk)‖ > εg, take the steepest descent step (4.1.1).1299

(ii) Otherwise, if λmin(∇2f(xk)) < −εH, choose pk to be the eigenvector cor-1300

responding to the most negative eigenvalue of ∇2f(xk). Choose the size1301

and sign of pk such that ‖pk‖ = 1 and (pk)T∇f(xk) 6 0, and set1302

(7.4.4) xk+1 = xk +αkpk, where αk =

2εHM

.1303

• If neither of these conditions hold, then xk satisfies the necessary condi-1304

tions (7.4.3), so is an approximate second-order-necessary point.1305

For the steepest-descent step (i), we have from (4.1.3) that1306

(7.4.5) f(xk+1) 6 f(xk) −1

2L‖∇f(xk)‖2 6 f(xk) −

ε2g

2L.1307

For a step of type (ii), we have from (7.4.1) that

f(xk+1) 6 f(xk) +αk∇f(xk)Tpk +12α2k(p

k)T∇2f(xk)pk +16Mα3

k‖pk‖3

6 f(xk) −12

(2εHM

)2εH +

16M

(2εHM

)3

= f(xk) −23ε3H

M2 .(7.4.6)

By aggregating (7.4.5) and (7.4.6), we have that at each xk for which the condition1308

(7.4.3) does not hold, we attain a decrease in the objective of at least1309

min

(ε2g

2L,

23ε3H

M2

).1310

Using the lower bound f on the objective f, we see that the number of iterations1311

K required must satisfy the condition1312

Kmin

(ε2g

2L,

23ε3H

M2

)6 f(x0) − f,1313

from which we conclude that1314

K 6 max(

2Lε−2g ,

32M2ε−3

H

)(f(x0) − f

).1315

Note that the maximum number of iterates required to identify a point for which1316

just the approximate stationarity condition ‖∇f(xk)‖ 6 εg holds is at most 2Lε−2g (f(x0)−1317

f). (We can just omit the second-order part of the algorithm.) Note too that it is1318

References 45

easy to devise approximate versions of this algorithm with similar complexity. For1319

example, the negative curvature direction pk in step (ii) above can be replaced1320

by an approximation to the direction of most negative curvature, obtained by the1321

Lanczos iteration with random initialization.1322

In algorithms that make more complete use of the cubic model (7.4.1), the term1323

ε−2g in the complexity expression becomes ε−3/2

g , and the constants are different.1324

The subproblems in (7.4.1) are more complicated to solve than those in the simple1325

scheme above, however.1326

8. Conclusions1327

We have outlined various algorithmic tools from optimization that are useful1328

for solving problems in data analysis and machine learning, and presented their1329

basic theoretical properties. The intersection of optimization and machine learn-1330

ing is a fruitful and very popular area of current research. All the major machine1331

learning conferences have a large contingent of optimization papers, and there is1332

a great deal of interest in developing algorithmic tools to meet new challenges1333

and in understanding their properties. The edited volume [39] contains a snap-1334

shot of the state of the art circa 2010, but this is a fast-moving field and there have1335

been many developments since then.1336

Acknowledgments1337

I thank Ching-pei Lee for a close reading and many helpful suggestions on the1338

manuscript.1339

References1340

[1] L. Balzano, R. Nowak, and B. Recht, Online identification and tracking of subspaces from highly in-1341

complete information, 48th annual allerton conference on communication, control, and computing,1342

2010, pp. 704–711. http://arxiv.org/abs/1006.4046.←81343

[2] L. Balzano and S. J. Wright, Local convergence of an algorithm for subspace identification from partial1344

data, Foundations of Computational Mathematics 14 (2014), 1–36. DOI: 10.1007/s10208-014-9227-1345

7.←81346

[3] A. Beck and M. Teboulle, A fast iterative shrinkage-threshold algorithm for linear inverse problems,1347

SIAM Journal on Imaging Sciences 2 (2009), no. 1, 183–202.←31, 331348

[4] B. E. Boser, I. M. Guyon, and V. N. Vapnik, A training algorithm for optimal margin classifiers,1349

Proceedings of the fifth annual workshop on computational learning theory, 1992, pp. 144–152.1350

←101351

[5] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein, Distributed optimization and statistical learn-1352

ing via the alternating direction methods of multipliers, Foundations and Trends in Machine Learning1353

3 (2011), no. 1, 1–122.←21354

[6] S. Bubeck, Convex optimization: Algorithms and complexity, Foundations and Trends in Machine1355

Learning 8 (2015), no. 3–4, 231–357.←33, 341356

[7] S. Bubeck, Y. T. Lee, and M. Singh, A geometric alternative to Nesterov’s accelerated gradient descent,1357

Technical Report arXiv:1506.08187, Microsoft Research, 2015.←331358

[8] S. Burer and R. D. C. Monteiro, A nonlinear programming algorithm for solving semidefinite programs1359

via low-rank factorizations, Mathematical Programming, Series B 95 (2003), 329–257.←61360

46 References

[9] E. Candès and B. Recht, Exact matrix completion via convex optimization, Foundations of Computa-1361

tional Mathematics 9 (2009), 717–772.←61362

[10] E. J. Candès, X. Li, Y. Ma, and J. Wright, Robust principal component analysis?, Journal of the ACM1363

58.3 (2011), 11.←81364

[11] C. Cartis, N. I. M. Gould, and Ph. L. Toint, Adaptive cubic regularisation methods for unconstrained op-1365

timization. Part I: motivation, convergence and numerical results, Mathematical Programming, Series1366

A 127 (2011), 245–295.←431367

[12] C. Cartis, N. I. M. Gould, and Ph. L. Toint, Adaptive cubic regularisation methods for unconstrained1368

optimization. part ii: worst-case function-and derivative-evaluation complexity, Mathematical Program-1369

ming, Series A 130 (2011), no. 2, 295–319.←431370

[13] V. Chandrasekaran, S. Sanghavi, P. A. Parrilo, and A. S. Willsky, Rank-sparsity incoherence for1371

matrix decomposition, SIAM Journal on Optimization 21 (2011), no. 2, 572–596.←81372

[14] Y. Chen and M. J. Wainwright, Fast low-rank estimation by projected gradent descent: General statistical1373

and algorithmic guarantees, Technical Report arXiv:1509.03025, University of California-Berkeley,1374

2015.←81375

[15] C. Cortes and V. N. Vapnik, Support-vector networks, Machine Learning 20 (1995), 273–297.←101376

[16] A. d’Aspremont, O. Banerjee, and L. El Ghaoui, First-order methods for sparse covariance selection,1377

SIAM Journal on Matrix Analysis and Applications 30 (2008), 56–66.←71378

[17] A. d’Aspremont, L. El Ghaoui, M. I. Jordan, and G. Lanckriet, A direct formulation for sparse PCA1379

using semidefinte programming, SIAM Review 49 (2007), no. 3, 434–448.←71380

[18] T. Dasu and T. Johnson, Exploratory data mining and data cleaning, John Wiley & Sons, 2003.←31381

[19] D. Drusvyatskiy, M. Fazel, and S. Roy, An optimal first-order method based on optimal quadratic1382

averaging, Technical Report arXiv:1604.06543, University of Washington, 2016.←331383

[20] J. Duchi, Introductory lectures on stochastic optimization, Stanford University, 2016. To appear in1384

IAS/Park City Mathematics Series volume on "Mathematics of Data".←2, 15, 221385

[21] J. Eckstein and D. P. Bertsekas, On the Douglas-Rachford splitting method and the proximal point1386

algorithm for maximal monotone operators, Mathematical Programming 55 (1992), 293–318.←21387

[22] M. Frank and P. Wolfe, An algorithm for quadratic programming, Naval Research Logistics Quarterly1388

3 (1956), 95–110.←261389

[23] J. Friedman, T. Hastie, and R. Tibshirani, Sparse inverse covariance estimation with the graphical lasso,1390

Biostatistics 9 (2008), no. 3, 432–441.←71391

[24] A. Griewank, The modification of Newton’s method for unconstrained optimization by bounding cubic1392

terms, Technical Report NA/12, DAMTP, Cambridge University, 1981.←431393

[25] M. R. Hestenes and E. Stiefel, Methods of conjugate gradients for solving linear systems, Journal of1394

Research of the National Bureau of Standards 49 (1952), 409–436.←311395

[26] A. J. Hoffman and H. Weilandt, The variation of the spectrum of a normal matrix, Duke Mathematical1396

Journal 20 (1953), no. 1, 37–39.←391397

[27] J. D. Lee, M. Simchowitz, M. I. Jordan, and B. Recht, Gradient descent converges to minimizers,1398

Technical Report arXiv:1602.04915, University of California-Berkeley, 2016.←251399

[28] D. C. Liu and J. Nocedal, On the limited-memory BFGS method for large scale optimization, Mathe-1400

matical Programming 45 (1989), 503–528.←2, 321401

[29] J. J. Moré and D. C. Sorensen, Computing a trust region step, SIAM Journal on Scientific and1402

Statistical Computing 4 (1983), 553–572.←421403

[30] A. S. Nemirovski and D. B. Yudin, Problem complexity and method efficiency in optimization, John1404

Wiley, 1983.←371405

[31] Y. Nesterov, A method for unconstrained convex problem with the rate of convergence O(1/k2), Dok-1406

lady AN SSSR 269 (1983), 543–547.←311407

[32] Y. Nesterov, Introductory lectures on convex optimization: A basic course, Springer Science and Busi-1408

ness Media, New York, 2004.←31, 341409

[33] Y. Nesterov and B. T. Polyak, Cubic regularization of Newton method and its global performance, Math-1410

ematical Programming, Series A 108 (2006), 177–205.←431411

[34] J. Nocedal and S. J. Wright, Numerical Optimization, Second, Springer, New York, 2006. ←2, 25,1412

32, 41, 421413

[35] B. T. Polyak, Some methods of speeding up the convergence of iteration methods, USSR Computational1414

Mathematics and Mathematical Physics 4.5 (1964), 1–17.←311415

[36] B. T. Polyak, Introduction to optimization, Optimization Software, 1987.←311416

References 47

[37] B. Recht, M. Fazel, and P. Parrilo, Guaranteed minimum-rank solutions to linear matrix equations via1417

nuclear norm minimization, SIAM Review 52 (2010), no. 3, 471–501.←61418

[38] R. T. Rockafellar, Convex analysis, Princeton University Press, Princeton, N.J., 1970.←161419

[39] S. Sra, S. Nowozin, and S. J. Wright (eds.), Optimization for machine learning, NIPS Workshop1420

Series, MIT Press, 2011.←451421

[40] R. Tibshirani, Regression shrinkage and selection via the LASSO, Journal of the Royal Statistical1422

Society B 58 (1996), 267–288.←51423

[41] B. Turlach, W. N. Venables, and S. J. Wright, Simultaneous variable selection, Technometrics 471424

(2005), no. 3, 349–363.←81425

[42] S. J. Wright, Primal-dual interior-point methods, SIAM, Philadelphia, PA, 1997.←21426

[43] S. J. Wright, Coordinate descent algorithms, Mathematical Programming, Series B 151 (2015Decem-1427

ber), 3–34.←21428

[44] X. Yi, D. Park, Y. Chen, and C. Caramanis, Fast algorithms for robust pca via gradient descent, Tech-1429

nical Report arXiv:1605.07784, University of Texas-Austin, 2016.←81430

Computer Sciences Department, University of Wisconsin-Madison, Madison, WI 537061431

E-mail address: [email protected]

optimization algorithms for data analysis · 2 optimization algorithms for data analysis 20 we...

Documents