optimization algorithms for data analysis · 2 optimization algorithms for data analysis 20 we...
TRANSCRIPT
IAS/Park City Mathematics SeriesVolume 00, Pages 000–000S 1079-5634(XX)0000-0
Optimization Algorithms for Data Analysis1
Stephen J. Wright2
Abstract. We describe the fundamentals of algorithms for minimizing a smoothnonlinear function, and extensions of these methods to the sum of a smoothfunction and a convex nonsmooth function. Such objective functions are ubiqui-tous in data analysis applications, as we illustrate using several examples. Wediscuss methods that make use of gradient (first-order) information about thesmooth part of the function, and also Newton methods that make use of Hessian(second-order) information. Convergence and complexity theory is outlined foreach approach.
1. Introduction3
In this article, we consider algorithms for solving smooth optimization prob-4
lems, possibly with simple constraints or structured nonsmooth regularizers. One5
such canonical formulation is6
(1.0.1) minx∈Rn
f(x),7
where f : Rn → R has at least Lipschitz continuous gradients. Additional as-8
sumptions about f, such as convexity and Lipschitz continuity of the Hessian, are9
introduced as needed. Another formulation we consider is10
(1.0.2) minx∈Rn
f(x) + λψ(x),11
where f is as in (1.0.1), ψ : Rn → R is a function that is usually convex and12
usually nonsmooth, and λ > 0 is a regularization parameter.1 We refer to (1.0.2) as13
a regularized minimization problem because the presence of the term involving ψ14
induces certain structural properties on the solution, that make it more desirable15
or plausible in the context of the application. We describe iterative algorithms16
that generate a sequence xkk=0,1,2,... of points that, in the case of convex objective17
functions, converges to the set of solutions. (Some algorithms also generate other18
“auxiliary” sequences of iterates.)19
2010 Mathematics Subject Classification. Primary 14Dxx; Secondary 14Dxx.Key words and phrases. Park City Mathematics Institute.1A set S is said to be convex if for any pair of points z ′,z ′′ ∈ S, we have that αz ′ +(1 −α)z ′′ ∈ Sfor all α ∈ [0, 1]. A function φ : Rn → R is said to be convex if φ(αz ′+(1 −α)z ′′) 6 αφ(z ′)+(1 −α)φ(z ′′) for all z ′,z ′′ in the (convex) domain of φ and all α ∈ [0, 1].
©0000 (copyright holder)
1
2 Optimization Algorithms for Data Analysis
We start in Section 2 by describing some canonical problems in data analysis20
and their formulation as optimization problems. After some preliminaries in Sec-21
tion 3, we describe in Section 4 algorithms that take step based on the gradients22
∇f(xk). Extensions of these methods to the case (1.0.2) of regularized objectives23
are described in Section 5. Section 6 describes accelerated gradient methods,24
which achieve better worst-case complexity than basic gradient methods, while25
still only using first-derivative information. We discuss Newton’s method in26
Section 7, outlining variants that can guarantee convergence to points that ap-27
proximately satisfy second-order conditions for a local minimizer of a smooth28
nonconvex function.29
Although we allow nonsmoothness in the regularization term in (1.0.2), we do30
not cover subgradient methods or mirror descent explicitly in this chapter. We31
also do not discuss stochastic gradient methods, a class of methods that is central32
to modern machine learning. All these topics are discussed in the contribution of33
J. Duchi to the current volume; see [20]. Other omissions include the following.34
• Coordinate descent methods; see [43] for a review.35
• Augmented Lagrangian methods, including the alternating direction meth-36
ods of multipliers (ADMM) [21]. The review [5] remains a good reference37
for the latter topic, especially as it applies to problems from data analysis.38
• Semidefinite programming and conic optimization.39
• Methods tailored specifically to linear or quadratic programming, such as40
the simplex method or interior-point methods (see [42] for a discussion of41
the latter).42
• Quasi-Newton methods, which modify Newton’s method by approximat-43
ing the Hessian or its inverse, thus attaining attractive theoretical and44
practical performance without using any second-derivative information.45
For a discussion of these methods, see [34, Chapter 6]. One important46
method of this class, which is useful in data analysis and many other47
large-scale problems, is the limited-memory method L-BFGS [28]; see also48
[34, Section 7.2].49
Our approach throughout is to give a concise description of some of the most50
important algorithmic tools for nonlinear smooth and regularized optimization,51
along with the basic convergence theory for each. In most cases, the theory is52
elementary enough to include here in its entirety. In the few remaining cases, we53
provide citations to works in which complete proofs can be found.54
2. Optimization Formulations of Data Analysis Problems55
In this section, we describe briefly some representative problems in data anal-56
ysis and machine learning, emphasizing their formulation as optimization prob-57
lems. Our list is by no means exhaustive. In many cases, there are a number of58
different ways to formulate a given application as an optimization problem. We59
Stephen J. Wright 3
do not try to describe all of them. But our list here gives a flavor of the interface60
between data analysis and optimization.61
2.1. Setup Practical data sets are often extremely messy. Data may be misla-62
beled, noisy, incomplete, or otherwise corrupted. Much of the hard work in data63
analysis is done by professionals, familiar with the underlying applications, who64
“clean” the data and prepare it for analysis, while being careful not to change the65
essential properties that they wish to discern from the analysis. Dasu and John-66
son [18] claim out that “80% of data analysis is spent on the process of cleaning67
and preparing the data.” We do not discuss this aspect of the process, focusing in-68
stead on the part of the data analysis pipeline in which the problem is formulated69
and solved.70
The data set in a typical analysis problem consists of m objects:71
(2.1.1) D := (aj,yj), j = 1, 2, . . . ,m,72
where aj is a vector (or matrix) of features and yj is an observation or label. (Each73
pair (aj,yj) has the same size and shape for all j = 1, 2, . . . ,m.) The analysis74
task then consists of discovering a function φ such that φ(aj) ≈ yj for most j =75
1, 2, . . . ,m. The process of discovering the mapping φ is often called “learning”76
or “training.”77
The function φ is often defined in terms of a vector or matrix of parameters,78
which we denote by x or X. (Other notation also appears below.) With these79
parametrizations, the problem of identifying φ becomes a data-fitting problem:80
“Find the parameters x defining φ such that φ(aj) ≈ yj, j = 1, 2, . . . ,m in some81
optimal sense.” Once we come up with a definition of the term “optimal,” we82
have an optimization problem. Many such optimization formulations have objec-83
tive functions of the “summation” type84
(2.1.2) LD(x) :=
m∑j=1
`(aj,yj; x),85
where the jth term `(aj,yj; x) is a measure of the mismatch between φ(aj) and86
yj, and x is the vector of parameters that determines φ.87
One use of φ is to make predictions about future data items. Given another88
previously unseen item of data a of the same type as aj, j = 1, 2, . . . ,m, we89
predict that the label y associated with a would be φ(a). The mapping may also90
expose other structure and properties in the data set. For example, it may reveal91
that only a small fraction of the features in aj are needed to reliably predict the92
label yj. (This is known as feature selection.) The function φ or its parameter x93
may also reveal important structure in the data. For example, X could reveal a94
low-dimensional subspace that contains most of the aj, or X could reveal a matrix95
with particular structure (low-rank, sparse) such that observations of X prompted96
by the feature vectors aj yield results close to yj.97
Examples of labels yj include the following.98
4 Optimization Algorithms for Data Analysis
• A real number, leading to a regression problem.99
• A label, say yj ∈ 1, 2, . . . ,M indicating that aj belongs to one of M100
classes. This is a classification problem. We have M = 2 for binary classifi-101
cation and M > 2 for multiclass classification.102
• Null. Some problems only have feature vectors aj and no labels. In103
this case, the data analysis task may consist of grouping the aj into clus-104
ters (where the vectors within each cluster are deemed to be function-105
ally similar), or identifying a low-dimensional subspace (or a collection106
of low-dimensional subspaces) that approximately contains the aj. Such107
problems require the labels yj to be learnt, alongside the function φ. For108
example, in a clustering problem, yj could represent the cluster to which109
aj is assigned.110
Even after cleaning and preparation, the setup above may contain many com-111
plications that need to be dealt in formulating the problem in rigorous mathe-112
matical terms. The quantities (aj,yj) may contain noise, or may be otherwise113
corrupted. We would like the mapping φ to be robust to such errors. There may114
be missing data: parts of the vectors aj may be missing, or we may not know all115
the labels yj. The data may be arriving in streaming fashion rather than being116
available all at once. In this case, we would learn φ in an online fashion.117
One particular consideration is that we wish to avoid overfitting the model to118
the data set D in (2.1.1). The particular data set D available to us can often be119
thought of as a finite sample drawn from some underlying larger (often infinite)120
collection of data, and we wish the function φ to perform well on the unobserved121
data points as well as the observed subset D. In other words, we want φ to122
be not too sensitive to the particular sample D that is used to define empirical123
objective functions such as (2.1.2). The optimization formulation can be modified124
in various ways to achieve this goal, by the inclusion of constraints or penalty125
terms that limit some measure of “complexity” of the function (such techniques126
are called generalization or regularization). Another approach is to terminate the127
optimization algorithm early, the rationale being that overfitting occurs mainly in128
the later stages of the optimization process.129
2.2. Least Squares Probably the oldest and best-known data analysis problem is130
linear least squares. Here, the data points (aj,yj) lie in Rn ×R, and we solve131
(2.2.1) minx
12m
m∑j=1
(aTj x− yj)2 =
12m‖Ax− y‖2
2,132
where A the matrix whose rows are aTj , j = 1, 2, . . . ,m and y = (y1,y2, . . . ,ym)T .133
In the terminology above, the function φ is defined by φ(a) := aTx. (We could134
also introduce a nonzero intercept by adding an extra parameter β ∈ R and135
defining φ(a) := aTx+ β.) This formulation can be motivated statistically, as a136
maximum-likelihood estimate of xwhen the observations yj are exact but for i.i.d.137
Stephen J. Wright 5
Gaussian noise. Various modifications of (2.2.1) impose desirable structure on x138
and hence on φ. For example, Tikhonov regularization with a squared `2-norm,139
which is140
minx
12m‖Ax− y‖2
2 + λ‖x‖22, for some parameter λ > 0,141
yields a solution x with less sensitivity to perturbations in the data (aj,yj). The142
LASSO formulation143
(2.2.2) minx
12m‖Ax− y‖2
2 + λ‖x‖1144
tends to yield solutions x that are sparse, that is, containing relatively few nonzero145
components [40]. This formulation performs feature selection: The locations of146
the nonzero components in x reveal those components of aj that are instrumental147
in determining the observation yj. Besides its statistical appeal — predictors that148
depend on few features are potentially simpler and more comprehensible than149
those depending on many features — feature selection has practical appeal in150
making predictions about future data. Rather than gathering all components of a151
new data vector a, we need to find only the “selected” features, since only these152
are needed to make a prediction.153
The LASSO formulation (2.2.2) is an important prototype for many problems154
in data analysis, in that it involves a regularization term λ‖x‖1 that is nonsmooth155
and convex, but with relatively simple structure that can potentially be exploited156
by algorithms.157
2.3. Matrix Completion Matrix completion is in one sense a natural extension158
of least-squares to problems in which the data aj are naturally represented as159
matrices rather than vectors. Changing notation slightly, we suppose that each160
Aj is an n× p matrix, and we seek another n× p matrix X that solves161
(2.3.1) minX
12m
m∑j=1
(〈Aj,X〉− yj)2,162
where 〈A,B〉 := trace(ATB). Here we can think of the Aj as “probing” the un-163
known matrix X. Commonly considered types of observations are random linear164
combinations (where the elements of Aj are selected i.i.d. from some distribution)165
or single-element observations (in which each Aj has 1 in a single location and166
zeros elsewhere). A regularized version of (2.3.1), leading to solutions X that are167
low-rank, is168
(2.3.2) minX
12m
m∑j=1
(〈Aj,X〉− yj)2 + λ‖X‖∗,169
where ‖X‖∗ is the nuclear norm, which is the sum of singular values of X [37].170
The nuclear norm plays a role analogous to the `1 norm in (2.2.2). Although the171
nuclear norm is a somewhat complex nonsmooth function, it is at least convex, so172
that the formulation (2.3.2) is also convex. This formulation can be shown to yield173
6 Optimization Algorithms for Data Analysis
a statistically valid solution when the true X is low-rank and the observation ma-174
trices Aj satisfy a “restricted isometry” property, commonly satisfied by random175
matrices, but not by matrices with just one nonzero element. The formulation is176
also valid in a different context, in which the true X is incoherent (roughly speak-177
ing, it does not have a few elements that are much larger than the others), and178
the observations Aj are of single elements [9].179
In another form of regularization, the matrix X is represented explicitly as a180
product of two “thin” matrices L and R, where L ∈ Rn×r and R ∈ Rp×r, with181
r min(n,p). We set X = LRT in (2.3.1) and solve182
(2.3.3) minL,R
12m
m∑j=1
(〈Aj,LRT 〉− yj)2.183
In this formulation, the rank r is “hard-wired” into the definition of X, so there is184
no need to include a regularizing term. This formulation is also typically much185
more compact than (2.3.2); the total number of elements in (L,R) is (n + p)r,186
which is much less than np. A disadvantage is that it is nonconvex. An active187
line of current research, pioneered in [8] and also drawing on statistical sources,188
shows that the nonconvexity is benign in many situations, and that under certain189
assumptions on the data (Aj,yj), j = 1, 2, . . . ,m and careful choice of algorithmic190
strategy, good solutions can be obtained from the formulation (2.3.3). A clue to191
this good behavior is that although this formulation is nonconvex, it is in some192
sense an approximation to a tractable problem: If we have a complete observation193
of X, then a rank-r approximation can be found by performing a singular value194
decomposition of X, and defining L and R in terms of the r leading left and right195
singular vectors.196
2.4. Nonnegative Matrix Factorization Some applications in computer vision,197
chemometrics, and document clustering require us to find factors L and R like198
those in (2.3.3) in which all elements are nonnegative. If the full matrix Y ∈ Rn×p199
is observed, this problem has the form200
minL,R‖LRT − Y‖2
F, subject to L > 0, R > 0.201
2.5. Sparse Inverse Covariance Estimation In this problem, the labels yj are202
null, and the vectors aj ∈ Rn are viewed as independent observations of a ran-203
dom vector A ∈ Rn, which has zero norm. The sample covariance matrix con-204
structed from these observations is205
S =1
m− 1
m∑j=1
ajaTj .206
The element Sil is an estimate of the covariance between the ith and lth elements207
of the random variable vector A. Our interest is in calculating an estimate X of208
the inverse covariance matrix that is sparse. The structure of X yields important209
information about A. In particular, if Xil = 0, we can conclude that the i and210
Stephen J. Wright 7
l components of A are conditionally independent. (That is, they are independent211
given knowledge of the values of the other n− 2 components of A.) Stated an-212
other way, the nonzero locations in X indicate the arcs in the dependency graph213
whose nodes correspond to the n components of A.214
One optimization formulation that has been proposed for estimating the in-215
verse sparse covariance matrix X is the following:216
(2.5.1) minX∈SRn×n, X0
〈S,X〉− log det(X) + λ‖X‖1,217
where SRn×n is the set of n× n symmetric matrices, X 0 indicates that X is218
positive definite, and ‖X‖1 :=∑ni,l=1 |Xil| (see [16, 23]).219
2.6. Sparse Principal Components The setup for this problem is similar to the220
previous section, in that we have a sample covariance matrix S that is estimated221
from a number of observations of some underlying random vector. The princi-222
pal components of this matrix are the eigenvectors corresponding to the largest223
eigenvalues. It is often of interest to find sparse principal components, approxi-224
mations to the leading eigenvectors that also contain few nonzeros. An explicit225
optimization formulation of this problem is226
(2.6.1) maxv∈Rn
vTSv s.t. ‖v‖2 = 1, ‖v‖0 6 k,227
where ‖ · ‖0 indicates the cardinality of v (that is, the number of nonzeros in v)228
and k is a user-defined parameter indicating a bound on the cardinality of v. The229
problem (2.6.1) can be formulated as a quadratic program with binary variables,230
but such intractable formulations are usually avoided. The following relaxation231
(due to [17]) replaces vvT by an positive semidefinite proxy M ∈ SRn×n:232
(2.6.2) maxM∈SRn×n
〈S,M〉 s.t. M 0, 〈I,M〉 = 1, ‖M‖1 6 R,233
for some parameter R > 0 that can be adjusted to attain the desired sparsity. This234
formulation is a convex optimization problem, in fact, a semidefinite program-235
ming problem.236
This formulation technique can be generalized to find the leading r > 1 sparse237
principal components. Ideally, we would obtain these from a matrix V ∈ Rn×r238
whose columns are mutually orthogonal and have at most k nonzeros each. We239
can write a convex relaxation of this problem as240
(2.6.3) maxM∈SRn×n
〈S,M〉 s.t. 0 M I, 〈I,M〉 = 1, ‖M‖1 6 R,241
which is again a semidefinite program. A more compact (but nonconvex) formu-242
lation is243
maxF∈Rn×r
〈S, FFT 〉 s.t. ‖F‖2 6 1, ‖F‖2,1 6 R,244
where ‖F‖2,1 :=∑ni=1 ‖Fi·‖2 [14]. The latter regularization term is often called245
a “group-sparse” or “group-LASSO” regularizer. (An early use of this type of246
regularizer was described in [41].)247
8 Optimization Algorithms for Data Analysis
2.7. Sparse Plus Low-Rank Matrix Decomposition Another useful paradigm248
is to decompose a partly or fully observed n × p matrix Y into the sum of a249
sparse matrix and a low-rank matrix. A convex formulation of the fully-observed250
problem is251
minM,S
‖M‖∗ + λ‖S‖1 s.t. Y =M+ S,252
where ‖S‖1 :=∑ni=1∑pj=1 |Sij| [10, 13]. Compact, nonconvex formulations that
allow noise in the observations include the following:
minL,R,S
12‖LRT + S− Y‖2
F (fully observed)
minL,R,S
12‖PΦ(LRT + S− Y)‖2
F (partially observed),
where Φ represents the locations of the observed entries of Y and PΦ is projection253
onto this set [14, 44].254
One application of these formulations is to robust PCA, where the low-rank255
part represents principal components and the sparse part represents “outlier”256
observations. Another application is to foreground-background separation in257
video processing. Here, each column of Y represents the pixels in one frame of258
video, whereas each row of Y shows the evolution of one pixel over time.259
2.8. Subspace Identification In this application, the aj ∈ Rn, j = 1, 2, . . . ,m are260
vectors that lie (approximately) in a low-dimensional subspace. The aim is to261
identify this subspace, expressed as the column subspace of a matrix X ∈ Rn×r.262
If the aj are fully observed, an obvious way to solve this problem is to perform263
a singular value decomposition of the n×m matrix A = [aj]mj=1, and take X to264
be the leading r right singular vectors. In interesting variants of this problem,265
however, the vectors aj may be arriving in streaming fashion and may be only266
partly observed, for example in indices Φj ⊂ 1, 2, . . . ,n. We would thus need to267
identify a matrix X and vectors sj ∈ Rr such that268
PΦj(aj −Xsj) ≈ 0, j = 1, 2, . . . ,m.269
The algorithm for identifying X, described in [1], is a manifold-projection scheme270
that takes steps in incremental fashion for each aj in turn. Its validity relies on271
incoherence of the matrix X with respect to the principal axes, that is, the matrix272
X should not have a few elements that are much larger than the others. A local273
convergence analysis of this method is given in [2].274
2.9. Support Vector Machines Classification via support vector machines (SVM)is a classical paradigm in machine learning. This problem takes as input data(aj,yj) with aj ∈ Rn and yj ∈ −1, 1, and seeks a vector x ∈ Rn and a scalarβ ∈ R such that
aTj x−β > 1 when yj = +1;(2.9.1a)
aTj x−β 6 −1 when yj = −1.(2.9.1b)
Stephen J. Wright 9
Any pair (x,β) that satisfies these conditions defines a separating hyperplane in275
Rn, that separates the “positive” cases aj |yj = +1 from the “negative” cases276
aj |yj = −1. (In the language of Section 2.1, we could define the function277
φ as φ(aj) = sign(aTj x− β).) Among all separating hyperplanes, the one that278
minimizes ‖x‖2 is the one that maximizes the margin between the two classes,279
that is, the hyperplane whose distance to the nearest point aj of either class is280
greatest.281
We can formulate the problem of finding a separating hyperplane as an opti-282
mization problem by defining an objective with the summation form (2.1.2):283
(2.9.2) H(x,β) =1m
m∑j=1
max(1 − yj(aTj x−β), 0).284
Note that the jth term in this summation is zero if the conditions (2.9.1) are285
satisfied, and positive otherwise. Even if no pair (x,β) exists for which H(x,β) =286
0, a value (x,β) that minimizes (2.1.2) will be the one that comes as close as287
possible to satisfying (2.9.1), in some sense. A term λ‖x‖22 (for some parameter288
λ > 0) is often added to (2.9.2), yielding the following regularized version:289
(2.9.3) H(x,β) =1m
m∑j=1
max(1 − yj(aTj x−β), 0) +
12λ‖x‖2
2.290
If λ is sufficiently small, and if separating hyperplanes exist, the pair (x,β) that291
minimizes (2.9.3) is the maximum-margin separating hyperplane. The maximum-292
margin property is consistent with the goals of generalizability and robustness.293
For example, if the observed data (aj,yj) is drawn from an underlying “cloud” of294
positive and negative cases, the maximum-margin solution usually does a reason-295
able job of separating other empirical data samples drawn from the same clouds,296
whereas a hyperplane that passes close by several of the observed data points297
may not do as well (see Figure 2.9.4).298
.
Figure 2.9.4. Linear support vector machine classification, withthe one class represented by circles and the other by squares. Onepossible choice of separating hyperplane is shown at left. If theobserved data is an empirical sample drawn from a cloud of un-derlying data points, this plane does not do well in separatingthe two clouds (middle). The maximum-margin separating hy-perplane does better (right).
10 Optimization Algorithms for Data Analysis
Often it is not possible to find a hyperplane that separates the positive andnegative cases well enough to be useful as a classifier. One solution is to transformall of the raw data vectors aj by a mapping ψ, then perform the support-vector-machine classification on the vectors ψ(aj), j = 1, 2, . . . ,m. The conditions (2.9.1)would thus be replaced by
ψ(aj)Tx−β > 1 when yj = +1;(2.9.5a)
ψ(aj)Tx−β 6 −1 when yj = −1,(2.9.5b)
leading to the following analog of (2.9.3):299
(2.9.6) H(x,β) =1m
m∑j=1
max(1 − yj(ψ(aj)Tx−β), 0) +
12λ‖x‖2
2.300
When transformed back to Rm, the surface a |ψ(a)Tx− β = 0 is nonlinear and301
possibly disconnected, and is often a much more powerful classifier than the302
hyperplanes resulting from (2.9.3).303
By introducing artificial variables, the problem (2.9.6) (and (2.9.3)) can be for-304
mulated as a convex quadratic program, that is, a problem with a convex qua-305
dratic objective and linear constraints. By taking the dual of this problem, we306
obtain another convex quadratic program, in m variables:307
(2.9.7) minα∈Rm
12αTQα− 1Tα subject to 0 6 α 6
1λ
1, yTα = 0,308
where309
Qkl = ykylψ(ak)Tψ(al), y = (y1,y2, . . . ,ym)T , 1 = (1, 1, . . . , 1)T .310
Interestingly, problem (2.9.7) can be formulated and solved without explicit knowl-311
edge or definition of the mapping ψ. We need only a technique to define the ele-312
ments of Q. This can be done with the use of a kernel function K : Rn ×Rn → R,313
where K(ak,al) replaces ψ(ak)Tψ(al) [4, 15]. This is the so-called “kernel trick.”314
(The kernel function K can also be used to construct a classification function φ315
from the solution of (2.9.7).) A particularly popular choice of kernel is the Gauss-316
ian kernel:317
K(ak,al) := exp(−‖ak − al‖2/(2σ)),318
where σ is a positive parameter.319
2.10. Logistic Regression Logistic regression can be viewed as a variant of bi-320
nary support-vector machine classification, in which rather than the classification321
function φ giving a unqualified prediction of the class in which a new data vector322
a lies, it returns an estimate of the odds of a belonging to one class or the other.323
We seek an “odds function” p parametrized by a vector x ∈ Rn as follows:324
(2.10.1) p(a; x) := (1 + exp(aTx))−1,325
Stephen J. Wright 11
and aim to choose the parameter x in so that
p(aj; x) ≈ 1 when yj = +1;(2.10.2a)
p(aj; x) ≈ 0 when yj = −1.(2.10.2b)
(Note the similarity to (2.9.1).) The optimal value of x can be found by maximizing326
a log-likelihood function:327
(2.10.3) L(x) :=1m
∑j:yj=−1
log(1 − p(aj; x)) +∑j:yj=1
logp(aj; x)
.328
We can perform feature selection using this model by introducing a regularizer329
λ‖x‖1, as follows:330
(2.10.4) maxx
1m
∑j:yj=−1
log(1 − p(aj; x)) +∑j:yj=1
logp(aj; x)
− λ‖x‖1,331
where λ > 0 is a regularization parameter. As we see later, this term has the332
effect of producing a solution in which few components of x are nonzero, mak-333
ing it possible to evaluate p(a; x) by knowing only those components of a that334
correspond to the nonzeros in x.335
An important extension of this technique is to multiclass (or multinomial) lo-336
gistic regression, in which the data vectors aj belong to more than two classes.337
Such applications are common in modern data analysis. For example, in a speech338
recognition system, the M classes could each represent a phoneme of speech, one339
of the potentially thousands of distinct elementary sounds that can be uttered by340
humans in a few tens of milliseconds. A multinomial logistic regression problem341
requires a distinct odds function pk for each class k ∈ 1, 2, . . . ,M. These func-342
tions are parametrized by vectors x[k] ∈ Rn, k = 1, 2, . . . ,M, defined as follows:343
344
(2.10.5) pk(a;X) :=exp(aTx[k])∑Ml=1 exp(aTx[l])
, k = 1, 2, . . . ,M,345
where we define X := x[k] |k = 1, 2, . . . ,M. Note that for all a, we have pk(a) ∈346
(0, 1) for all k = 1, 2, . . . ,M and that∑Mk=1 pk(a) = 1. The functions (2.10.5) are347
(somewhat colloquially) referred to as performing a “softmax” on the quantities348
aTx[l] | l = 1, 2, . . . ,M.349
In the setting of multiclass logistic regression, the labels yj are vectors in RM,350
whose elements are defined as follows:351
(2.10.6) yjk =
1 when aj belongs to class k,
0 otherwise.352
Similarly to (2.10.2), we seek to define the vectors x[k] so that
pk(aj;X) ≈ 1 when yjk = 1(2.10.7a)
12 Optimization Algorithms for Data Analysis
pk(aj;X) ≈ 0 when yjk = 0.(2.10.7b)
The problem of finding values of x[k] that satisfy these conditions can again be353
formulated as one of maximizing a log-likelihood:354
(2.10.8) L(X) :=1m
m∑j=1
[M∑`=1
yj`(xT[`]aj) − log
(M∑`=1
exp(xT[`]aj)
)].355
“Group-sparse” regularization terms can be included in this formulation to se-356
lect a set of features in the vectors aj, common to each class, that distinguish357
effectively between the classes.358
2.11. Deep Learning Deep neural networks are often designed to perform the359
same function as multiclass logistic regression, that is, to classify a data vector a360
into one of M possible classes, where M > 2 is large in some key applications.361
The difference is that the data vector a undergoes a series of structured transfor-362
mations before being passed through a “softmax” operator, a multiclass logistic363
regression classifier of the type described above.364
output nodes
input nodes
hidden layers
Figure 2.11.1. A deep neural network, showing connections be-tween adjacent layers.
The simple neural network shown in Figure 2.11.1 illustrates the basic ideas.365
In this figure, the data vector aj enters at the bottom of the network, each node in366
the bottom layer corresponding to one component of aj. The vector then moves367
upward through the network, undergoing a structured nonlinear transformation368
as it moves from one layer to the next. A typical form of this transformation,369
which converts the vector alj at layer l to vector al+1j at layer l+ 1, is370
alj = σ(Wlal−1j + gl), l = 1, 2, . . . ,D,371
where Wl is a matrix of dimension |alj |× |al−1j | and gl is a vector of length |alj |, σ372
is a componentwise nonlinear transformation, andD is the number of layer situated373
strictly between the bottom and top layers (referred to as “hidden layers”). Each374
arc in Figure 2.11.1 represents one of the elements of a transformation matrix Wl.375
We define a0j to be the “raw” input vector aj, and let aDj be the vector formed by376
the nodes at the second-top later in Figure 2.11.1. Typical forms of the function377
Stephen J. Wright 13
σ include the following, acting identically on each component t ∈ R of its input378
vector:379
• Logistic function: t→ 1/(1 + e−t);380
• Hinge loss: t→ max(t, 0);381
• Bernoulli: a random function that outputs 1 with probability 1/(1 + e−t)382
and 0 otherwise.383
Each node in the top layer corresponds to a particular class, and the output of384
each node corresponds to the odds of the input vector belonging to each class. As385
mentioned, the “softmax” operator is typically used to convert the transformed386
input vector in the second-top layer (layer D) to a set of odds at the top layer. As-387
sociated with each input vector aj are labels yjk, defined as in (2.10.6) to indicate388
which of the M classes that aj belongs to.389
The parameters in this neural network are the matrix-vector pairs (Wl,gl),390
l = 1, 2, . . . ,D that transform the input vector aj into its form aDj at the second-391
top layer, together with the parameters X of the softmax operation that takes392
place at the final (top) stage, where X is defined exactly as in the discussion of393
the previous section on multiclass logistic regression. We aim to choose all these394
parameters so that the network does a good job on classifying the training data395
correctly. Using the notation w for the layer-to-layer transformations, that is,396
w := (W1,g1,W2,g2, . . . ,WD,gD),397
we can write the loss function for deep learning as follows:398
(2.11.2) L(w,X) :=1m
m∑j=1
[M∑`=1
yj`(xT[`]a
Dj (w)) − log
(M∑`=1
exp(xT[`]aDj (w))
)].399
Here we write aDj (w) to make explicit the dependence of aDj on the transforma-400
tions w = (W1,g1,W2,g2, . . . ,WD,gD) as well as on the input vector aj. (We401
can view multiclass logistic regression as a special case of deep learning in which402
there are no hidden layers, so that D = 0, w is null, and aDj = aj, j = 1, 2, . . . ,m.)403
Neural networks in use for particular applications (in image recognition and404
speech recognition, for example, where they have been very successful) include405
many variants on the basic design above. These include restricted connectivity406
between layers (that is, enforcing structure on the matrices Wl, l = 1, 2, . . . ,D),407
layer arrangements that are more complex than the linear layout illustrated in408
Figure 2.11.1, with outputs coming from different levels, connections across non-409
adjacent layers, different componentwise transformations σ at different layers,410
and so on. Deep neural networks for practical applications are highly engineered411
objects.412
The loss function (2.11.2) shares with many other applications the “summation”413
form (2.1.2), but it has several features that set it apart from the other applications414
discussed above. First, and possibly most important, it is nonconvex in the param-415
eters w. There is reason to believe that the “landscape” of L is complex, with the416
14 Optimization Algorithms for Data Analysis
global minimizer being exceedingly difficult to find. Second, the total number417
of parameters in (w,X) is usually very large. The most popular algorithms for418
minimizing (2.11.2) are of stochastic gradient type, which like most optimization419
methods come with no guarantee for finding the minimizer of a nonconvex func-420
tion. Effective training of deep learning classifiers typically requires a great deal421
of data and computation power. Huge clusters of powerful computers, often us-422
ing multicore processors, GPUs, and even specially architected processing units,423
are devoted to this task. Efficiency also requires many heuristics in the formula-424
tion and the algorithm (for example, in the choice of regularization functions and425
in the steplengths for stochastic gradient).426
3. Preliminaries427
We discuss here some foundations for the analysis of subsequent sections.428
These include useful facts about smooth and nonsmooth convex functions, Tay-429
lor’s theorem and some of its consequences, optimality conditions, and prox-430
operators.431
In the discussion of this section, our basic assumption is that f is a mapping432
from Rn to R ∪ +∞, continuous on its effective domain D := x | f(x) < ∞.433
Further assumptions of f are introduced as needed.434
3.1. Solutions Consider the problem of minimizing f (1.0.1). We have the fol-435
lowing terminology:436
• x∗ is a local minimizer of f if there is a neighborhood N of x∗ such that437
f(x) > f(x∗) for all x ∈ N.438
• x∗ is a global minimizer of f if f(x) > f(x∗) for all x ∈ Rn.439
• x∗ is a strict local minimizer if it is a local minimizer on some neighborhood440
N and in addition f(x) > f(x∗) for all x ∈ N with x 6= x∗.441
• x∗ is an isolated local minimizer if there is a neighborhood N of x∗ such that442
f(x) > f(x∗) for all x ∈ N and in addition, N contains no local minimizers443
other than x∗.444
3.2. Convexity and Subgradients A convex set Ω ⊂ Rn has the property that445
(3.2.1) x,y ∈ Ω ⇒ (1 −α)x+αy ∈ Ω for all α ∈ [0, 1].446
We usually deal with closed convex sets in this article. For a convex set Ω ⊂ Rn447
we define the indicator function IΩ(x) as follows:448
IΩ(x) =
0 if x ∈ Ω+∞ otherwise.
449
Stephen J. Wright 15
Indicator functions are useful devices for deriving optimality conditions for con-450
strained problems, and even for developing algorithms. The constrained opti-451
mization problem452
(3.2.2) minx∈Ω
f(x)453
can be restated equivalently as follows:454
(3.2.3) min f(x) + IΩ(x).455
We noted already that a convex function φ : Rn → R∪ +∞ has the following456
defining property:457
(3.2.4)φ((1 −α)x+αy) 6 (1 −α)φ(x) +αφ(y), for all x,y ∈ Rn and all α ∈ [0, 1].458
The concepts of “minimizer” are simpler in the case of convex objective func-459
tions than in the general case. In particular, the distinction between “local” and460
“global” minimizers disappears. For f convex in (1.0.1), we have the following.461
(a) Any local minimizer of (1.0.1) is also a global minimizer.462
(b) The set of global minimizers of (1.0.1) is a convex set.463
If there exists a value γ > 0 such that464
(3.2.5) φ((1 −α)x+αy) 6 (1 −α)φ(x) +αφ(y) −12γα(1 −α)‖x− y‖2
2465
for all x and y in the domain of φ, we say that φ is strongly convex with modulus of466
convexity γ.467
We summarize some definitions and results about subgradients of convex func-468
tions here. For a more extensive discussion, see [20].469
Definition 3.2.6. A vector v ∈ Rn is a subgradient of f at a point x if470
f(x+ d) > f(x) + vTd. for all d ∈ Rn.471
The subdifferential, denoted ∂f(x), is the set of all subgradients of f at x.472
Subdifferentials satisfy a monotonicity property, as we show now.473
Lemma 3.2.7. If a ∈ ∂f(x) and b ∈ ∂f(y), we have (a− b)T (x− y) > 0.474
Proof. From convexity of f and the definitions of a and b, we have f(y) > f(x) +475
aT (y− x) and f(x) > f(y) + bT (x− y). The result follows by adding these two476
inequalities. 477
We can easily characterize a minimum in terms of the subdifferential.478
Theorem 3.2.8. The point x∗ is the minimizer of a convex function f if and only if479
0 ∈ ∂f(x∗).480
Proof. Suppose that 0 ∈ ∂f(x∗), we have by substituting x = x∗ and v = 0 into481
Definition 3.2.6 that f(x∗ + d) > f(x∗) for all d ∈ Rn, which implies that x∗ is482
a minimizer of f. The converse follows trivially by showing that v = 0 satisfies483
Definition 3.2.6 when x∗ is a minimizer. 484
16 Optimization Algorithms for Data Analysis
The subdifferential generalizes the concept of derivative of a smooth function.485
Theorem 3.2.9. If f is convex and differentiable at x, then ∂f(x) = ∇f(x).486
A converse of this result is also true. Specifically, if the subdifferential of a487
convex function f at x contains a single subgradient, then f is differentiable with488
gradient equal to this subgradient (see [38, Theorem 25.1]).489
3.3. Taylor’s Theorem Taylor’s theorem is a foundational result for optimization490
of smooth nonlinear functions. It shows how smooth functions can be approxi-491
mated locally by low-order (linear or quadratic) functions.492
Theorem 3.3.1. Given a continuously differentiable function f : Rn → R, and givenx,p ∈ Rn, we have that
f(x+ p) = f(x) +
∫1
0∇f(x+ ξp)Tpdξ,(3.3.2)
f(x+ p) = f(x) +∇f(x+ ξp)Tp, some ξ ∈ (0, 1).(3.3.3)
If f is twice continuously differentiable, we have
∇f(x+ p) = ∇f(x) +∫1
0∇2f(x+ ξp)pdξ,(3.3.4)
f(x+ p) = f(x) +∇f(x)Tp+ 12pT∇2f(x+ ξp)p, for some ξ ∈ (0, 1).(3.3.5)
We can derive an important consequence of this theorem when f is Lipschitz493
continuously differentiable with constant L, that is,494
(3.3.6) ‖∇f(x) −∇f(y)‖ 6 L‖x− y‖, for all x,y ∈ Rn.495
From (3.3.2), we have496
f(y) − f(x) −∇f(x)T (y− x) =∫1
0[∇f(x+ ξ(y− x)) −∇f(x)]T (y− x)dξ.497
By using (3.3.6), we have498
[∇f(x+ξ(y−x))−∇f(x)]T (y−x) 6 ‖∇f(x+ξ(y−x))−∇f(x)‖‖y−x‖ 6 Lξ‖y−x‖2.499
By substituting this bound into the previous integral, we obtain500
(3.3.7) f(y) − f(x) −∇f(x)T (y− x) 6 L2‖y− x‖2.501
For the remainder of this section, we assume that f is continuously differ-502
entiable and also convex. The definition of convexity (3.2.4) and the fact that503
∂f(x) = ∇f(x) implies that504
(3.3.8) f(y) > f(x) +∇f(x)T (y− x), for all x,y ∈ Rn.505
We defined “strong convexity with modulus γ” in (3.2.5). When f is differentiable,506
we have the following equivalent definition, obtained by rearranging (3.2.5) and507
letting α ↓ 0.508
(3.3.9) f(y) > f(x) +∇f(x)T (y− x) + γ
2‖y− x‖2.509
Stephen J. Wright 17
By combining this expression with (3.3.7), we have the following result.510
Lemma 3.3.10. Given convex f satisfying (3.2.5), with ∇f uniformly Lipschitz continu-511
ous with constant L, we have for any x,y that512
(3.3.11)γ
2‖y− x‖2 6 f(y) − f(x) −∇f(x)T (y− x) 6 L
2‖y− x‖2.513
For later convenience, we define a condition number κ as follows:514
(3.3.12) κ :=L
γ.515
When f is twice continuously differentiable, we can characterize the constants γ516
and L in terms of the eigenvalues of the Hessian ∇f(x). Specifically, we can show517
that (3.3.11) is equivalent to518
(3.3.13) γI ∇2f(x) LI, for all x.519
When f is strictly convex and quadratic, κ defined in (3.3.12) is the condition520
number of the (constant) Hessian, in the usual sense of linear algebra.521
Strongly convex functions have unique minimizers, as we now show.522
Theorem 3.3.14. Let f be differentiable and strongly convex with modulus γ > 0. Then523
the minimizer x∗ of f exists and is unique.524
Proof. We show first that for any point x0, the level set x | f(x) 6 f(x0) is closed525
and bounded, and hence compact. Suppose for contradiction that there is a se-526
quence x` such that ‖x`‖ →∞ and527
(3.3.15) f(x`) 6 f(x0).528
By strong convexity of f, we have for some γ > 0 that529
f(x`) > f(x0) +∇f(x0)T (x` − x0) +γ
2‖x` − x0‖2.530
By rearranging slightly, and using (3.3.15), we obtain531
γ
2‖x` − x0‖2 6 −∇f(x0)T (x` − x0) 6 ‖∇f(x0)‖‖x` − x0‖.532
By dividing both sides by (γ/2)‖x` − x0‖, we obtain ‖x` − x0‖ 6 (2/γ)‖∇f(x0)‖533
for all `, which contradicts unboundedness of x`. Thus, the level set is bounded.534
Since it is also closed (by continuity of f), it is compact.535
Since f is continuous, it attains its minimum on the compact level set, which is536
also the solution of minx f(x), and we denote it by x∗. Suppose for contradiction537
that the minimizer is not unique, so that we have two points x∗1 and x∗2 that538
minimize f. Obviously, these points must attain equal objective values, so that539
f(x∗1) = f(x∗2) = f∗ for some f∗. By taking (3.2.5) and setting φ = f∗, x = x∗1 ,540
y = x∗2 , and α = 1/2, we obtain541
f((x∗1 + x∗2)/2) 612(f(x∗1) + f(x
∗2)) −
18γ‖x∗1 − x∗2‖2 < f∗,542
so the point (x∗1 + x∗2)/2 has a smaller function value than both x∗1 and x∗2 , contra-543
dicting our assumption that x∗1 and x∗2 are both minimizers. Hence, the minimizer544
x∗ is unique. 545
18 Optimization Algorithms for Data Analysis
3.4. Optimality Conditions for Smooth Functions We consider the case in which546
f is a smooth (twice continuously differentiable) function, that is not necessarily547
convex. Before designing algorithms to find a minimizer of f, we need to identify548
properties of f and its derivatives at a point x that tell us whether or not x is a549
minimizer, of one of the types described in Subsection 3.1. We call such properties550
optimality conditions.551
A first-order necessary condition for optimality is that∇f(x) = 0. More precisely,552
if x is a local minimizer, then ∇f(x) = 0. We can prove this by using Taylor’s553
theorem. Supposing for contradiction that ∇f(x) 6= 0, we can show by setting554
x = x and p = −α∇f(x) for α > 0 in (3.3.3) that f(x − α∇f(x)) < f(x) for all555
α > 0 sufficiently small. Thus any neighborhood of x will contain points x with a556
f(x) < f(x), so x cannot be a local minimizer.557
If f is convex, in addition to being smooth, the condition ∇f(x) = 0 is sufficient558
for x to be a global solution. This claim follows immediately from Theorems 3.2.8559
and 3.2.9.560
A second-order necessary condition for x to be a local solution is that ∇f(x) = 0561
and ∇2f(x) is positive semidefinite. The proof is by an argument similar to that562
of the first-order necessary condition, but using the second-order Taylor series ex-563
pansion (3.3.5) instead of (3.3.3). A second-order sufficient condition is that∇f(x) = 0564
and ∇2f(x) is positive definite. This condition guarantees that x is a strict local565
minimizer, that is, there is a neighborhood of x such that x has a strictly smaller566
function value than all other points in this neighborhood. Again, the proof makes567
use of (3.3.5).568
We call x a stationary point for smooth f if it satisfies the first-order necessary569
condition ∇f(x) = 0. Stationary points are not necessarily local minimizers. In570
fact, local maximizers satisfy the same condition. More interestingly, stationary571
points can be saddle points. These are points for which there exist directions u572
and v such that f(x + αu) < f(x) and f(x + αv) > f(x) for all positive α suffi-573
ciently small. When the Hessian ∇2f(x) has both strictly positive and strictly574
negative eigenvalues, it follows from (3.3.5) that x is a saddle point. When ∇2f(x)575
is positive semidefinite or negative semidefinite, second derivatives alone are in-576
sufficient to classify x; higher-order derivative information is needed.577
3.5. Proximal Operators and the Moreau Envelope Here we present some anal-578
ysis for analyzing the convergence of algorithms for the regularized problem579
(1.0.2), where the objective is the sum of a smooth function and a convex (usually580
nonsmooth) function.581
For a closed proper convex function h and a positive scalar λ, we define the582
Moreau envelope as583
(3.5.1) Mλ,h(x) := infu
h(u) +
12λ‖u− x‖2
=
1λ
infu
λh(u) +
12‖u− x‖2
.584
Stephen J. Wright 19
The prox-operator of the function λh is the value of u that achieves the infimum585
in (3.5.1), that is,586
(3.5.2) proxλh(x) := arg minu
λh(u) +
12‖u− x‖2
.587
From optimality properties for this problem, we have588
(3.5.3) 0 ∈ λ∂h(proxλh(x)) + (proxλh(x) − x).589
The Moreau envelope can be seen as a smoothing of regularization of the func-590
tion h. It has a finite value for all x, even when h takes on infinite values for some591
x ∈ Rn. In fact, it is differentiable everywhere — its gradient is592
∇Mλ,h(x) =1λ(x− proxλh(x)).593
Moreover, x∗ is a minimizer of h if and only if it is a minimizer of Mλ,h.594
The proximal operator satisfies a nonexpansiveness property. From the opti-595
mality conditions (3.5.3) at two points x and y, we have596
x− proxλh(x) ∈ λ∂(proxλh(x)), y− proxλh(y) ∈ λ∂(proxλh(y)).597
By applying monotonicity (Lemma 3.2.7), we have598
(1/λ)((x− proxλh(x)) − (y− proxλh(y))
)T(proxλh(x) − proxλh(y)) > 0,599
which by rearrangement and application of the Cauchy-Schwartz inequality yields
‖proxλh(x) − proxλh(y)‖2 6 (x− y)T (proxλh(x) − proxλh(y))
6 ‖x− y‖ ‖proxλh(x) − proxλh(y)‖,
from which we obtain ‖proxλh(x) − proxλh(y)‖ 6 ‖x− y‖, as claimed.600
We list the prox operator for several instances of h that are common in data601
analysis applications. These definitions are useful in implementing the prox-602
gradient algorithms of Section 5.603
• h(x) = 0 for all x, for which we have proxλh(x) = 0. (This observation is604
useful in proving that the prox-gradient method reduces to the familiar605
steepest descent method when the objective contains no regularization606
term.)607
• h(x) = IΩ(x), the indicator function for a closed convex set Ω. In this608
case, we have for any λ > 0 that609
proxλIΩ(x) = arg minu
λIΩ(u) +
12‖u− x‖2
= arg min
u∈Ω
12‖u− x‖2,610
which is simply the projection of x onto the set Ω.611
• h(x) = ‖x‖1. By substituting into definition (3.5.2) we see that the mini-612
mization separates into its n separate components, and that the ith com-613
ponent of proxλ‖·‖1is614
[proxλ‖·‖1]i = arg min
ui
λ|ui|+
12(ui − xi)
2
.615
20 Optimization Algorithms for Data Analysis
We can thus verify that616
(3.5.4) [proxλ‖·‖1(x)]i =
xi − λ if xi > λ;
0 if xi ∈ [−λ, λ];
xi + λ if xi < −λ,
617
an operation that is known as soft-thresholding.618
• h(x) = ‖x‖0, where ‖x‖0 denotes the cardinality of the vector x, its number619
of nonzero components. Although this h is not a convex function (as620
we can see by considering convex combinations of the vectors (0, 1)T and621
(1, 0)T in R2), its prox-operator is well defined, and is known as hard622
thresholding:623
[proxλ‖·‖0(x)]i =
xi if |xi| >
√2λ;
0 if |xi| <√
2λ.624
As in (3.5.4), the definition (3.5.2) separates into n individual components.625
3.6. Convergence Rates An important measure for evaluating algorithms is the626
rate of convergence to zero of some measure of error. For smooth f, we may be627
interested in how rapidly the sequence of gradient norms ‖∇f(xk)‖ converges628
to zero. For nonsmooth convex f, a measure of interest may be convergence to629
zero of dist(0,∂f(xk)) (the sequence of distances from 0 to the subdifferential630
∂f(xk)). Other error measures for which we may be able to prove convergence631
rates include ‖xk − x∗‖ (where x∗ is a solution) and f(xk) − f∗ (where f∗ is the632
optimal value of the objective function f). For generality, we denote by φk633
the sequence of nonnegative scalars, converging to zero, whose rate we wish to634
obtain.635
We say that linear convergence holds if there is some σ ∈ (0, 1) such that636
(3.6.1) φk+1/φk 6 1 − σ, for all k sufficiently large.637
(This property is sometimes also known as geometric or exponential convergence,638
but the term linear is standard in the optimization literature.) It follows from639
(3.6.1) that there is some positive constant C such that640
(3.6.2) φk 6 C(1 − σ)k, k = 1, 2, . . . .641
While (3.6.1) implies (3.6.2), the converse does not hold. The sequence642
φk =
2−k k even
0 k odd,643
satisfies (3.6.2) with C = 1 and σ = .5, but does not satisfy (3.6.1). To distinguish644
between these two slightly different definitions, (3.6.1) is sometimes called Q-645
linear while (3.6.2) is called R-linear.646
Sublinear convergence is, as its name suggests, slower than linear. Severalvarieties of sublinear convergence are encountered in optimization algorithms
Stephen J. Wright 21
for data analysis, including the following
φk 6 C/√k, k = 1, 2, . . . ,(3.6.3a)
φk 6 C/k, k = 1, 2, . . . ,(3.6.3b)
φk 6 C/k2, k = 1, 2, . . . ,(3.6.3c)
where in each case, C is some positive constant.647
Superlinear convergence occurs when the constant σ ∈ (0, 1) in (3.6.1) can be648
chosen arbitrarily close to 1. Specifically, we say that the sequence φk converges649
Q-superlinearly to 0 if650
(3.6.4) limk→∞φk+1/φk = 0.651
Q-Quadratic convergence occurs when652
(3.6.5) φk+1/φ2k 6 C, k = 1, 2, . . . ,653
for some sufficiently large C. We say that the convergence is R-superlinear if654
there is a Q-superlinearly convergent sequence ψk that dominates φk (that is,655
0 6 φk 6 ψk for all k). R-quadratic convergence is defined similarly. Quadratic656
and superlinear rates are associated with higher-order methods, such as Newton657
and quasi-Newton methods.658
When a convergence rate applies globally, from any reasonable starting point,659
it can be used to derive a complexity bound for the algorithm, which takes the660
form of a bound on the number of iterations K required to reduce φk below661
some specified tolerance ε. For a sequence satisfying the R-linear convergence662
condition (3.6.2) a sufficient condition for φK 6 ε is C(1 − σ)K 6 ε. By using the663
estimate log(1 − σ) 6 −σ for all σ ∈ (0, 1), we have that664
C(1 − σ)K 6 ε ⇔ K log(1 − σ) 6 log(ε/C) ⇐ K > log(C/ε)/σ.665
It follows that for linearly convergent algorithms, the number of iterations re-666
quired to converge to a tolerance ε depends logarithmically on 1/ε and inversely667
on the rate constant σ. For an algorithm that satisfies the sublinear rate (3.6.3a), a668
sufficient condition for φK 6 ε is C/√K 6 ε, which is equivalent to K > (C/ε)2,669
so the complexity is O(1/ε2). Similar analyses for (3.6.3b) reveal complexity of670
O(1/ε), while for (3.6.3c), we have complexity O(1/√ε).671
For quadratically convergent methods, the complexity is doubly logarithmic672
in ε (that is, O(log log(1/ε))). Once the algorithm enters a neighborhood of qua-673
dratic convergence, just a few additional iterations are required for convergence674
to a solution of high accuracy.675
4. Gradient Methods676
We consider here iterative methods for solving the unconstrained smooth prob-677
lem (1.0.1) that make use of the gradient ∇f. (Duchi’s contribution [20] describes678
22 Optimization Algorithms for Data Analysis
subgradient methods for nonsmooth convex functions.) We consider mostly meth-679
ods that generate an iteration sequence xk via the formula680
(4.0.1) xk+1 = xk +αkdk,681
where dk is the search direction and αk is a steplength.682
We consider the steepest descent method, which searches along the negative683
gradient direction dk = −∇f(xk), proving convergence results for nonconvex684
functions, convex functions, and strongly convex functions. In Subsection 4.5, we685
consider methods that use more general descent directions dk, proving conver-686
gence of methods that make careful choices of the line search parameter αk at687
each iteration. In Subsection 4.6, we consider the conditional gradient method for688
minimization of a smooth function f over a compact set.689
4.1. Steepest Descent The simplest stepsize protocol is the short-step variant690
of steepest descent. We assume here that f is convex and smooth, and that its691
gradients satisfy the Lipschitz condition (3.3.6) with Lipschitz constant L. We692
choose the search direction dk = −∇f(xk) in (4.0.1), and set the steplength αk to693
be the constant 1/L, to obtain the iteration694
(4.1.1) xk+1 = xk −1L∇f(xk), k = 0, 1, 2, . . . .695
To estimate the amount of decrease in f obtained at each iterate of this method,696
we use Taylor’s theorem. From (3.3.7), we have697
(4.1.2) f(x+αd) 6 f(x) +α∇f(x)Td+α2 L
2‖d‖2,698
For x = xk and d = −∇f(xk), the value of α that minimizes the expression on the699
right-hand side is α = 1/L. By substituting these values, we obtain700
(4.1.3) f(xk+1) = f(xk − (1/L)∇f(xk)) 6 f(xk) − 12L‖∇f(xk)‖2.701
This expression is one of the foundational inequalities in the analysis of optimiza-702
tion methods. Depending on the variety of assumptions about f, we can derive a703
variety of different convergence rates from this basic inequality.704
4.2. General Case We consider first general smooth f, not necessarily convex.705
From (4.1.3) alone, we can already prove convergence of the steepest descent706
method with no additional assumptions. Rearranging (4.1.3) and summing over707
the first T − 1 iterates, we have708
(4.2.1)T−1∑k=0
‖∇f(xk)‖2 6 2LT−1∑k=0
[f(xk) − f(xk+1)] = 2L[f(x0) − f(xT )].709
(Note the telescoping sum.) If f is bounded below, say by f, the right-hand side710
is bounded above by a constant. We have711
(4.2.2) min06k6T−1
‖∇f(xk)‖ 6√
2L[f(x0) − f(xT )]
T6
√2L[f(x0) − f]
T.712
Stephen J. Wright 23
Thus, within the first T − 1 steps of steepest descent, at least one of the iterates713
has gradient norm less than√
2L[f(x0) − f]/T , which represents sublinear conver-714
gence of type (3.6.3a). It follows too from (4.2.1) that for f bounded below, any715
accumulation point of the sequence xk is stationary.716
4.3. Convex Case When f is also convex, we have the following stronger result717
for the steepest descent method.718
Theorem 4.3.1. Suppose that f is convex and smooth, where ∇f has Lipschitz constant719
L, and that (1.0.1) has a solution x∗. Then the steepest descent method with stepsize720
αk ≡ 1/L generates a sequence xk∞k=0 that satisfies721
(4.3.2) f(xT ) − f∗ 6L
2T‖x0 − x∗‖2.722
Proof. By convexity of f, we have f(x∗) > f(xk) +∇f(xk)T (x∗ − xk), so by substi-tuting into (4.1.3), we obtain for k = 0, 1, 2, . . . that
f(xk+1) 6 f(x∗) +∇f(xk)T (xk − x∗) − 12L‖∇f(xk)‖2
= f(x∗) +L
2
(‖xk − x∗‖2 −
∥∥∥∥xk − x∗ − 1L∇f(xk)
∥∥∥∥2)
= f(x∗) +L
2
(‖xk − x∗‖2 − ‖xk+1 − x∗‖2
).
By summing over k = 0, 1, 2, . . . , T − 1, and noting the telescoping sum, we have
T−1∑k=0
(f(xk+1) − f∗) 6L
2
T−1∑k=0
(‖xk − x∗‖2 − ‖xk+1 − x∗‖2
)=L
2
(‖x0 − x∗‖2 − ‖xT − x∗‖2
)6L
2‖x0 − x∗‖2.
Since f(xk) is a nonincreasing sequence, we have723
f(xT ) − f(x∗) 61T
T−1∑k=0
(f(xk+1) − f∗) 6L
2T‖x0 − x∗‖2,724
as required. 725
4.4. Strongly Convex Case Recall the definition (3.3.9) of strong convexity, which726
shows that f can be bounded below by a quadratic with Hessian γI. When a727
strongly convex f has L-Lipschitz gradients, it is also bounded above by a simi-728
lar quadratic (see (3.3.7)) differing only in the quadratic term, which becomes LI.729
This “sandwich” effect yields a linear convergence rate for the gradient method.730
We saw in Theorem 3.3.14 that a strongly convex f has a unique minimizer,which we denote by x∗. Minimizing both sides of the inequality (3.3.9) withrespect to y, we find that the minimizer on the left-hand side is attained at y = x∗,
24 Optimization Algorithms for Data Analysis
while on the right-hand side it is attained at x −∇f(x)/γ. By plugging theseoptimal values into (3.3.9), we obtain
minyf(y) > min
yf(x) +∇f(x)T (y− x) + γ
2‖y− x‖2
⇒ f(x∗) > f(x) −∇f(x)T(
1γ∇f(x)
)+γ
2
∥∥∥∥ 1γ∇f(x)
∥∥∥∥2
⇒ f(x∗) > f(x) −1
2γ‖∇f(x)‖2.
By rearrangement, we obtain731
(4.4.1) ‖∇f(x)‖2 > 2γ[f(x) − f(x∗)].732
By substituting (4.4.1) into our basic inequality (4.1.3), we obtain733
f(xk+1) = f
(xk −
1L∇f(xk)
)6 f(xk) −
12L‖∇f(xk)‖2 6 f(xk) −
γ
L(f(xk) − f∗).734
Subtracting f∗ from both sides of this inequality gives us the recursion735
(4.4.2) f(xk+1) − f∗ 6(
1 −γ
L
)(f(xk) − f∗) .736
Thus the function values converge linearly to the optimum. After T steps, we have737
738
(4.4.3) f(xT ) − f∗ 6(
1 −γ
L
)T(f(x0) − f∗),739
which is convergence of type (3.6.2) with constant σ = γ/L.740
4.5. General Case: Line-Search Methods For the case of f in (1.0.1) smooth but741
possibly nonconvex, we consider algorithms that take steps of the form (4.0.1),742
where dk is a descent direction, that is, it makes a negative inner product with the743
negative gradient −∇f(xk), so that ∇f(xk)Tdk < 0. This condition ensures that744
f(xk + αdk) < f(xk) for sufficiently small positive values of step length α — we745
can get improvement in f by taking small steps along dk. (This claim follows746
from (3.3.3).) Line-search methods are built around this fundamental observation.747
By introducing additional conditions on dk and αk, that can be verified in practice748
with reasonable effort, we can establish a bound on decrease similar to (4.1.3) on749
each iteration, and thus a conclusion similar to (4.2.2).750
We assume that dk satisfies the following for some η > 0:751
(4.5.1) ∇f(xk)Tdk 6 −η‖∇f(xk)‖‖dk‖.752
For the steplength αk, we assume the following weak Wolfe conditions hold, forsome constants c1 and c2 with 0 < c1 < c2 < 1:
f(xk +αkdk) 6 f(xk) + c1αk∇f(xk)Tdk(4.5.2a)
∇f(xk +αkdk)Tdk > c2∇f(xk)Tdk.(4.5.2b)
Condition (4.5.2a) is called “sufficient decrease;” it ensures descent at each step of753
at least a small fraction c1 of the amount promised by the first-order Taylor-series754
Stephen J. Wright 25
expansion (3.3.3). Condition (4.5.2b) ensures that the directional derivative of f755
along the search direction dk is significantly less negative at the chosen steplength756
αk than at α = 0. This condition ensures that the step is “not too short.” It can757
be shown that it is always possible to find αk that satisfies both conditions (4.5.2)758
simultaneously. Line-search procedures, which are specialized optimization pro-759
cedures for minimizing functions of one variable, have been devised to finding760
such values efficiently; see [34, Chapter 3] for details.761
As in the remainder of this section, we assume that the Lipschitz property762
(3.3.6) holds for ∇f. From this property and (4.5.2b), we have763
−(1 − c2)∇f(xk)Tdk 6 [∇f(xk +αkdk) −∇f(xk)]Tdk 6 Lαk‖dk‖2.764
By comparing the first and last terms in these inequalities, we obtain the following765
lower bound on αk:766
αk > −(1 − c2)
L
∇f(xk)Tdk
‖dk‖2 .767
By substituting this bound into (4.5.2a), and using (4.5.1) and the step definition(4.0.1), we obtain
f(xk+1) = f(xk +αkdk) 6 f(xk) + c1αk∇f(xk)Tdk
6 f(xk) −c1(1 − c2)
L
(∇f(xk)Tdk)2
‖dk‖2
6 f(xk) −c1(1 − c2)
Lη2‖∇f(xk)‖2,(4.5.3)
which by rearrangement yields768
(4.5.4) ‖∇f(xk)‖2 6L
c1(1 − c2)η2
(f(xk) − f(xk+1)
).769
By following the argument of Subsection 4.2, we have that if f is bounded below770
by f, we obtain for any T > 1 that771
(4.5.5) min06k6T−1
‖∇f(xk)‖ 6
√L
η2c1(1 − c2)
√f(x0) − f
T.772
It follows by taking limits on both sides of (4.5.4) that for f bounded below, we773
have774
(4.5.6) limk→∞ ‖∇f(xk)‖ = 0,775
and thus that all accumulation points x of the sequence xk generated by the776
algorithm (4.0.1) have∇f(x) = 0. In the case of f convex, this condition guarantees777
that x is a solution of (1.0.1). When f is nonconvex, x may be a local minimum,778
but it may also be a saddle point or a local maximum.779
The paper [27] uses the stable manifold theorem to show that line-search gra-780
dient methods are highly unlikely to converge to stationary points x at which781
some eigenvalues of the Hessian ∇2f(x) are negative. Although it is easy to con-782
struct examples for which such bad behavior occurs, it requires special choices of783
26 Optimization Algorithms for Data Analysis
starting point x0. Possibly the most obvious example is where f(x1, x2) = x21 − x
22784
starting from x0 = (1, 0)T , where dk = −∇f(xk) at each k. For this example, all785
iterates have xk2 = 0 and, under appropriate conditions, converge to the saddle786
point x = 0. Any starting point with x02 6= 0 cannot converge to 0, in fact, it is easy787
to see that xk2 diverges away from 0.788
4.6. Conditional Gradient Method The conditional gradient approach, often789
known as “Frank-Wolfe” after the authors who devised it [22], is a method for790
convex nonlinear optimization over compact convex sets. This is the problem791
(4.6.1) minx∈Ω
f(x),792
(see earlier discussion around (3.2.2)), where Ω is a compact convex set and f793
is a convex function whose gradient is Lipschitz continuously differentiable in a794
neighborhood of Ω, with Lipschitz constant L. We assume that Ω has diameter795
D, that is, ‖x− y‖ 6 D for all x,y ∈ Ω.796
The conditional gradient method replaces the objective in (4.6.1) at each iter-ation by a linear Taylor-series approximation around the current iterate xk, andminimizes this linear objective over the original constraint set Ω. It then takesa step from xk towards the minimizer of this linearized subproblem. The fullmethod is as follows:
vk := arg minv∈Ω
vT∇f(xk);(4.6.2a)
xk+1 := xk +αk(vk − xk), αk :=
2k+ 2
.(4.6.2b)
The method has a sublinear convergence rate, as we show below, and indeed797
requires many iterations in practice to obtain an accurate solution. Despite this798
feature, it makes sense in many interesting applications, because the subproblems799
(4.6.2a) can be solved very cheaply in some settings, and because highly accurate800
solutions are not required in some applications.801
We have the following result for sublinear convergence of the conditional gra-802
dient method.803
Theorem 4.6.3. Under the conditions above, where L is the Lipschitz constant for ∇f on804
an open neighborhood of Ω and D is the diameter of Ω, the conditional gradient method805
(4.6.2) applied to (4.6.1) satisfies806
(4.6.4) f(xk) − f(x∗) 62LD2
k+ 2, k = 1, 2, . . . ,807
where x∗ is any solution of (4.6.1).808
Proof. Setting x = xk and y = xk+1 = xk +αk(vk − xk) in (3.3.7), we have
f(xk+1) 6 f(xk) +αk∇f(xk)T (vk − xk) +12α2kL‖v
k − xk‖2
6 f(xk) +αk∇f(xk)T (vk − xk) +12α2kLD
2,(4.6.5)
Stephen J. Wright 27
where the second inequality comes from the definition of D. For the first-order809
term, we have since vk solves (4.6.2a) and x∗ is feasible for (4.6.2a) that810
∇f(xk)T (vk − xk) 6 ∇f(xk)T (x∗ − xk) 6 f(x∗) − f(xk).811
By substituting in (4.6.5) and subtracting f(x∗) from both sides, we obtain812
(4.6.6) f(xk+1) − f(x∗) 6 (1 −αk)[f(xk) − f(x∗)] +
12α2kLD
2.813
We now apply an inductive argument. For k = 0, we have α0 = 1 and814
f(x1) − f(x∗) 612LD2 <
23LD2,815
so that (4.6.4) holds in this case. Supposing that (4.6.4) holds for some value of k,we aim to show that it holds for k+ 1 too. We have
f(xk+1) − f(x∗) 6
(1 −
2k+ 2
)[f(xk) − f(x∗)] +
12
4(k+ 2)2 LD
2 from (4.6.6), (4.6.2b)
6 LD2[
2k(k+ 2)2 +
2(k+ 2)2
]from (4.6.4)
= 2LD2 (k+ 1)(k+ 2)2
= 2LD2 k+ 1k+ 2
1k+ 2
6 2LD2 k+ 2k+ 3
1k+ 2
=2LD2
k+ 3,
as required. 816
5. Prox-Gradient Methods817
We now describe an elementary but powerful approach for solving the regu-818
larized optimization problem819
(5.0.1) minx∈Rn
φ(x) := f(x) + λψ(x),820
where f is a smooth convex function, ψ is a convex regularization function (known821
simply as the “regularizer”), and λ > 0 is a regularization parameter. The tech-822
nique we describe here is a natural extension of the steepest-descent approach,823
in that it reduces to the steepest-descent method analyzed in Theorem 4.3.1 ap-824
plied to f when the regularization term is not present (λ = 0). It is useful when825
the regularizer ψ has a simple structure that is easy to account for explicitly, as826
is true for many regularizers that arise in data analysis, such as the `1 function827
(ψ(x) = ‖x‖1) and the indicator function for a simple set Ω (ψ(x) = IΩ(x)), such828
as a box Ω = [l1,u1]⊗ [l2,u2]⊗ . . .⊗ [ln,un].2829
Each step of the algorithm is defined as follows:830
(5.0.2) xk+1 := proxαkλψ(xk −αk∇f(xk)),831
2For the analysis of this section I am indebted to class notes of L. Vandenberghe, from 2013-14.
28 Optimization Algorithms for Data Analysis
for some steplength αk > 0, and the prox operator defined in (3.5.2). By substitut-832
ing into this definition, we can verify that xk+1 is the solution of an approximation833
to the objective φ of (5.0.1), namely:834
(5.0.3) xk+1 := arg minz∇f(xk)T (z− xk) + 1
2αk‖z− xk‖2 + λψ(z).835
One way to verify this equivalence is to note that the objective in (5.0.3) can be836
written as8371αk
12
∥∥∥z− (xk −αk∇f(xk))∥∥∥2
+αkλψ(x)
,838
(modulo a term αk‖∇f(xk)‖2 that does not involve z). The subproblem objective839
in (5.0.3) consists of a linear term∇f(xk)T (z− xk) (the first-order term in a Taylor-840
series expansion), a proximality term 12αk‖z− xk‖2 that becomes more strict as841
αk ↓ 0, and the regularization term λψ(x) in unaltered form. When λ = 0, we842
have xk+1 = xk − αk∇f(xk), so the iteration (5.0.2) (or (5.0.3)) reduces to the843
usual steepest-descent approach discussed in Section 4 in this case. It is useful844
to continue thinking of αk as playing the role of a line-search parameter, though845
here the line search is expressed implicitly through a proximal term.846
We will demonstrate convergence of the method (5.0.2) at a sublinear rate, for847
functions fwhose gradients satisfy a Lipschitz continuity property with Lipschitz848
constant L (see (3.3.6)), and for the constant steplength choice αk = 1/L. The proof849
makes use of a “gradient map” defined by850
(5.0.4) Gα(x) :=1α
(x− proxαλψ(x−α∇f(x))
).851
By comparing with (5.0.2), we see that this map defines the step taken at iteration852
k:853
(5.0.5) xk+1 = xk −αkGαk(xk) ⇔ Gαk(x
k) =1αk
(xk − xk+1).854
The following technical lemma reveals some useful properties of Gα(x).855
Lemma 5.0.6. Suppose that in problem (5.0.1), ψ is a closed convex function and that f856
is convex with Lipschitz continuous gradient on Rn, with Lipschitz constant L. Then for857
the definition (5.0.4) with α > 0, the following claims are true.858
(a) Gα(x) ∈ ∇f(x) + λ∂ψ(x−αGα(x)).859
(b) For any z, and any α ∈ (0, 1/L], we have that860
φ(x−αGα(x)) 6 φ(z) +Gα(x)T (x− z) −
α
2‖Gα(x)‖2.861
Proof. For part (a), we use the optimality property (3.5.3) of the prox operator,862
and make the following substitutions: x− α∇f(x) for “x”, αλ for “λ”, and ψ for863
“h” to obtain864
0 ∈ αλ∂ψ(proxαλψ(x−α∇f(x))) + (proxαλψ(x−α∇f(x)) − (x−α∇f(x)).865
We use definition (5.0.4) to make the substitution proxαλψ(x − α∇f(x)) = x −866
αGα(x), to obtain867
0 ∈ αλ∂ψ(x−αGα(x)) −α(Gα(x) −∇f(x)),868
Stephen J. Wright 29
and the result follows when we divide by α.869
For (b), we start with the following consequence of Lipschitz continuity of ∇f,870
from Lemma 3.3.10:871
f(y) 6 f(x) +∇f(x)T (y− x) + L
2‖y− x‖2.872
By setting y = x−αGα(x), for any α ∈ (0, 1/L], we have
f(x−αGα(x)) 6 f(x) −αGα(x)T∇f(x) + Lα2
2‖Gα(x)‖2
6 f(x) −αGα(x)T∇f(x) + α
2‖Gα(x)‖2.(5.0.7)
(The second inequality uses α ∈ (0, 1/L].) We also have by convexity of f and ψ873
that for any z and any v ∈ ∂ψ(x−αGα(x) the following are true:874
(5.0.8)f(z) > f(x) +∇f(x)T (z− x), ψ(z) > ψ(x−αGα(x)) + v
T (z− (x−αGα(x))).875
We have from part (a) that v = (Gα(x) −∇f(x))/λ ∈ ∂ψ(x − αGα(x)), so bymaking this choice of v in (5.0.8) and also using (5.0.7) we have for any α ∈ (0, 1/L]that
φ(x−αGα(x))
= f(x−αGα(x)) + λψ(x−αGα(x))
6 f(x) −αGα(x)T∇f(x) + α
2‖Gα(x)‖2 + λψ(x−αGα(x)) (from (5.0.7))
6 f(z) +∇f(x)T (x− z) −αGα(x)T∇f(x) +α
2‖Gα(x)‖2
+ λψ(z) + (Gα(x) −∇f(x))T (x−αGα(x) − z) (from (5.0.8))
= f(z) + λψ(z) +Gα(x)T (x− z) −
α
2‖Gα(x)‖2,
where the last equality follows from cancellation of several terms in the previous876
line. Thus (b) is proved. 877
Theorem 5.0.9. Suppose that in problem (5.0.1), ψ is a closed convex function and878
that f is is convex with Lipschitz continuous gradient on Rn, with Lipschitz constant879
L. Suppose that (5.0.1) attains a minimizer x∗ (not necessarily unique) with optimal880
objective value φ∗. Then if αk = 1/L for all k in (5.0.2), we have881
φ(xk) −φ∗ 6L‖x0 − x∗‖2
2k, k = 1, 2, . . . .882
Proof. Since αk = 1/L satisfies the conditions of Lemma 5.0.6, we can use part (b)883
of this result to show that the sequence φ(xk) is decreasing and that the distance884
to the optimum x∗ also decreases at each iteration. Setting x = z = xk and α = αk885
in Lemma 5.0.6, and recalling (5.0.5), we have886
φ(xk+1) = φ(xk −αkGαk(xk)) 6 φ(xk) −
αk2‖Gαk(x
k)‖2,887
30 Optimization Algorithms for Data Analysis
justifying the first claim. For the second claim, we have by setting x = xk, α = αk,and z = x∗ in Lemma 5.0.6 that
0 6 φ(xk+1) −φ∗ = φ(xk −αkGαk(xk)) −φ∗
6 Gαk(xk)T (xk − x∗) −
αk2‖Gαk(x
k)‖2
=1
2αk
(‖xk − x∗‖2 − ‖xk − x∗ −αkGαk(x
k)‖2)
=1
2αk
(‖xk − x∗‖2 − ‖xk+1 − x∗‖2
),(5.0.10)
from which ‖xk+1 − x∗‖ 6 ‖xk − x∗‖ follows.888
By setting αk = 1/L in (5.0.10), and summing over k = 0, 1, 2, . . . ,K − 1, we889
obtain from a telescoping sum on the right-hand side that890
K−1∑k=0
(φ(xk+1) −φ∗) 6L
2
(‖x0 − x∗‖2 − ‖xK − x∗‖2
)6L
2‖x0 − x∗‖2.891
By monotonicity of φ(xk), we have892
K(φ(xK) −φ∗) 6K−1∑k=0
(φ(xk+1) −φ∗).893
The result follows immediately by combining these last two expressions. 894
6. Accelerating Gradient Methods895
We showed in Section 4 that the basic steepest descent method for solving896
(1.0.1) in the case smooth f converges sublinearly at a 1/k rate when f is convex897
and linearly at a rate of (1−γ/L) when f is strongly convex (satisfying (3.3.13) for898
positive γ and L). We show now that quantitatively faster rates can be attained,899
while still requiring just one gradient evaluation per iteration, by using the idea900
of momentum. This means that at each iteration, we tend to continue to move901
along the length and direction of the previous step, making a small adjustment902
to take account of the latest gradient ∇f. Although it may not be obvious at903
first, there is some intuition behind this approach. The step that we took at904
the previous iteration was itself mostly based on the step prior to that, together905
with gradient information at the previous point. By thinking recursively, we906
can see that the last step is a compact encoding of all the gradient information907
that we have encountered at all iterates so far. If this information is aggregated908
properly, it can produce a richer overall picture of the behavior of the function909
than does the latest gradient alone, and thus has the potential to yield faster910
convergence. Sure enough, several intricate methods that use the momentum911
idea have been proposed, and have been widely successful. These methods are912
often called accelerated gradient methods. A major contributor in this area is Yuri913
Nesterov, dating to his seminal contribution in 1983 [31] and explicated further914
Stephen J. Wright 31
in his book [32] and many other publications. Another key contribution is [3],915
which derived an accelerated method for the regularized case (1.0.2).916
6.1. Heavy-Ball Method Possibly the most elementary method of momentum917
type is the heavy-ball method of Polyak [35]; see also [36]. Each iteration of this918
method has the form919
(6.1.1) xk+1 = xk −αk∇f(xk) +βk(xk − xk−1).920
That is, a momentum term βk(xk − xk−1) is added to the usual steepest descent921
update. An intriguing analysis of this approach is possible for strongly convex922
quadratic functions:923
(6.1.2) minx∈Rn
f(x) :=12xTAx− bTx,924
where the (constant) Hessian A has eigenvalues in the range [γ,L], with 0 < γ 6 L925
(see [36]). For the following constant choices of steplength parameters:926
αk = α :=4
(√L+√γ)2
, βk = β :=
√L−√γ√
L+√γ
,927
it can be shown that ‖xk − x∗‖ 6 Cβk, for some (possibly large) constant C. We928
can use (3.3.7) to translate this into a bound on the function error, as follows:929
f(xk) − f(x∗) 6L
2‖xk − x∗‖2 6
LC2
2β2k,930
allowing a direct comparison with the rate (4.4.3) for the steepest descent method.931
If we suppose that L γ, we have932
β ≈ 1 − 2√γ
L,933
so that we achieve approximate convergence f(xk) − f(x∗) 6 ε (for small positive934
ε) in O(√L/γ log(1/ε)) iterations, compared with O((L/γ) log(1/ε)) for steepest935
descent — a significant improvement.936
The heavy-ball method is fundamental, but several points should be noted.937
First, the analysis for convex quadratic f does not generalize to general strongly938
convex nonlinear functions. Second, it requires knowledge of γ and L. Third, it939
is not a descent method; we usually have f(xk+1) > f(xk) for many k. These940
properties are not specific to heavy-ball — at least one of them is shared by other941
methods that use momentum.942
6.2. Conjugate Gradient The conjugate gradient method for solving linear sys-943
tems Ax = b (or, equivalently, minimizing the convex quadratic (6.1.2)) where A944
is symmetric positive definite, is one of the most important algorithms in compu-945
tational science. Though invented earlier than the other algorithms discussed in946
this section (see [25]) and motivated in a different way, conjugate gradient clearly947
makes use of momentum. Its steps have the form948
(6.2.1) xk+1 = xk +αkpk, where pk = −∇f(xk) + ξkpk−1,949
32 Optimization Algorithms for Data Analysis
for some choices of αk and ξk, which is identical to (6.1.1) when we define βk ap-950
propriately. For convex, strongly quadratic problems (6.1.2), conjugate gradient951
has excellent properties. It does not require prior knowledge of the range [γ,L] of952
the eigenvalue spectrum of A, choosing the steplengths αk and ξk in an adaptive953
fashion. (In fact, αk is chosen to be the exact minimizer along the search direction954
pk.) The main arithmetic operation per iteration is one matrix-vector multiplica-955
tion involving A, the same cost as a gradient evaluation for f in (6.1.2). Most956
importantly, there is a rich convergence theory, that characterizes convergence in957
terms of the properties of the full spectrum of A (not just its extreme elements),958
showing in particular that good approximate solutions can be obtained quickly959
if the eigenvalues are clustered. Convergence to an exact solution of (6.1.2) in at960
most n iterations is guaranteed (provided, naturally, that the arithmetic is carried961
out exactly).962
There has been much work over the years on extending the conjugate gradi-963
ent method to general smooth functions f. Few of the theoretical properties for964
the quadratic case carry over to the nonlinear setting, though several results are965
known; see [34, Chapter 5], for example. Such “nonlinear” conjugate gradient966
methods vary in the accuracy with which they perform the line search for αk in967
(6.2.1) and — more fundamentally — in the choice of ξk. The latter is done in968
a way that ensures that each search direction pk is a descent direction. In some969
methods, ξk is set to zero on some iterations, which causes the method to take970
a steepest descent step, effectively “restarting” the conjugate gradient method971
at the latest iterate. Despite these qualifications, nonlinear conjugate gradient is972
quite commonly used in practice, because of its minimal storage requirements973
and the fact that it requires only one gradient evaluation per iteration. Its pop-974
ularity has been eclipsed in recent years by the limited-memory quasi-Newton975
method L-BFGS [28], [34, Section 7.2], which requires more storage (though still976
O(n)) and is similarly economical and easy to implement.977
6.3. Nesterov’s Accelerated Gradient: Weakly Convex Case We now describe978
Nesterov’s method for (1.0.1) and prove its convergence — sublinear at a 1/k2979
rate — for the case of f convex with Lipschitz continuous gradients satisfying980
(3.3.6). Each iteration of this method has the form981
(6.3.1) xk+1 = xk −αk∇f(xk +βk(x
k − xk−1))+βk(x
k − xk−1),982
for choices of the parameters αk and βk to be defined. Note immediately thesimilarity to the heavy-ball formula (6.1.1). The only difference is that the extrap-olation step xk → xk + βk(x
k − xk−1) is taken before evaluation of the gradient∇f in (6.3.1), whereas in (6.1.1) the gradient is simply evaluated at xk. It is con-venient for purposes of analysis (and implementation) to introduce an auxiliarysequence yk, fix αk ≡ 1/L, and rewrite the update (6.3.1) as follows:
xk+1 = yk −1L∇f(yk),(6.3.2a)
Stephen J. Wright 33
yk+1 = xk+1 +βk+1(xk+1 − xk), k = 0, 1, 2, . . . ,(6.3.2b)
where we initialize at an arbitrary y0 and set x0 = y0. We define βk with reference983
to another scalar sequence λk in the following manner:984
(6.3.3) λ0 = 0, λk+1 =12
(1 +
√1 + 4λ2
k
), βk =
λk − 1λk+1
.985
Since λk > 1 for k = 1, 2, . . . , we have βk+1 > 0 for k = 0, 1, 2, . . . . It also follows986
from the definition of λk+1 that987
(6.3.4) λ2k+1 − λk+1 = λ2
k.988
Suppose that the solution of (1.0.1) is attained at a point x∗. Nesterov’s scheme989
achieves the following bound:990
(6.3.5) f(xT ) − f∗ 62L‖x0 − x∗‖2
(T + 1)2 , T = 1, 2, . . . .991
We prove this result by an argument from [3], as reformulated in [6, Section 3.7].992
This analysis is famously technical, and intuition is hard to come by. Some re-993
cent progress has been made in deriving algorithms similar to (6.3.2) that have a994
plausible geometric or algebraic motivation; see [7, 19].995
From convexity of f and (3.3.7), we have for any x and y that
f(y−∇f(y)/L) − f(x)
6 f(y−∇f(y)/L) − f(y) +∇f(y)T (y− x)
6 ∇f(y)T (y−∇f(y)/L− y) + L
2‖y−∇f(y)/L− y‖2 +∇f(y)T (y− x)
= −1
2L‖∇f(y)‖2 +∇f(y)T (y− x).(6.3.6)
Setting y = yk and x = xk in this bound, we obtain
f(xk+1) − f(xk) = f(yk −∇f(yk)/L) − f(xk)
6 −1
2L‖∇f(yk)‖2 +∇f(yk)T (yk − xk)
= −L
2‖xk+1 − yk‖2 − L(xk+1 − yk)T (yk − xk).(6.3.7)
We now set y = yk and x = x∗ in (6.3.6), and use (6.3.2a) to obtain996
(6.3.8) f(xk+1) − f(x∗) 6 −L
2‖xk+1 − yk‖2 − L(xk+1 − yk)T (yk − x∗).997
Introducing notation δk := f(xk) − f(x∗), we multiply (6.3.7) by λk+1 − 1 and addit to (6.3.8) to obtain
(λk+1 − 1)(δk+1 − δk) + δk+1
6 −L
2λk+1‖xk+1 − yk‖2 − L(xk+1 − yk)T (λk+1y
k − (λk+1 − 1)xk − x∗)
We multiply this bound by λk+1, and use (6.3.4) to obtain
λ2k+1δk+1 − λ
2kδk
34 Optimization Algorithms for Data Analysis
6 −L
2
[‖λk+1(x
k+1 − yk)‖2 + 2λk+1(xk+1 − yk)T (λk+1y
k − (λk+1 − 1)xk − x∗)]
= −L
2
[‖λk+1x
k+1 − (λk+1 − 1)xk − x∗‖2 − ‖λk+1yk − (λk+1 − 1)xk − x∗‖2
],
(6.3.9)
where in the final equality we used the identity ‖a‖2 + 2aTb = ‖a+ b‖2 − ‖b‖2.998
By multiplying (6.3.2b) by λk+2, and using λk+2βk+1 = λk+1 − 1 from (6.3.3), we999
have1000
λk+2yk+1 = λk+2x
k+1 +λk+2βk+1(xk+1 −xk) = λk+2x
k+1 +(λk+1 −1)(xk+1 −xk).1001
By rearranging this equality, we have1002
λk+1xk+1 − (λk+1 − 1)xk = λk+2y
k+1 − (λk+2 − 1)xk+1.1003
By substituting into the first term on the right-hand side of (6.3.9), and using the1004
definition1005
(6.3.10) uk := λk+1yk − (λk+1 − 1)xk − x∗,1006
we obtain1007
λ2k+1δk+1 − λ
2kδk 6 −
L
2(‖uk+1‖2 − ‖uk‖2).1008
By summing both sides of this inequality over k = 0, 1, . . . , T − 1, and using λ0 = 0,1009
we obtain1010
λ2TδT 6
L
2(‖u0‖2 − ‖uT‖2) 6
L
2‖x0 − x∗‖2,1011
so that1012
(6.3.11) δT = f(xT ) − f(x∗) 6L‖x0 − x∗‖2
2λ2T
.1013
A simple induction confirms that λk > (k+ 1)/2 for k = 1, 2, . . . , and the result1014
(6.3.5) follows by substituting this bound into (6.3.11).1015
6.4. Nesterov’s Accelerated Gradient: Strongly Convex Case We turn now to1016
Nesterov’s approach for smooth strongly convex functions, which satisfy (3.2.5)1017
with γ > 0. Again, we follow the proof in [6, Section 3.7], which is based on1018
the analysis in [32]. The method uses the same update formula (6.3.2) as in the1019
weakly convex case, and the same initialization, but with a different choice of1020
βk+1, namely:1021
(6.4.1) βk+1 ≡√L−√γ√
L+√γ
=
√κ− 1√κ+ 1
,1022
and initial values y0 = y0. The condition measure κ is defined in (3.3.12). We1023
prove the following convergence result.1024
Theorem 6.4.2. Suppose that f is such that ∇f is Lipschitz continuously differentiable1025
with constant L, and that it is strongly convex with modulus of convexity γ and unique1026
Stephen J. Wright 35
minimizer x∗. Then the method (6.3.2), (6.4.1) with starting point x0 = y0 satisfies1027
(6.4.3) f(xT ) − f(x∗) 6L+ γ
2‖x0 − x∗‖2
(1 −
1√κ
)T.1028
Proof. The proof makes use of a family of strongly convex functions Φk(z) de-fined inductively as follows:
Φ0(z) = f(y0) +
γ
2‖x− y0‖2,(6.4.4a)
Φk+1(z) = (1 − 1/√κ)Φk(z)(6.4.4b)
+1√κ
(f(yk) +∇f(yk)T (z− yk) + γ
2‖z− yk‖2
).
Each Φk(·) is a quadratic, and an inductive argument shows that ∇2Φk(z) = γI1029
for all k and all z. Thus, each Φk has the form1030
(6.4.5) Φk(z) = Φ∗k +
γ
2‖z− vk‖2, k = 0, 1, 2, . . . ,1031
where vk is the minimizer of Φk(·) and Φ∗k is its optimal value. (From (6.4.4a),1032
we have v0 = y0.) We note too that Φk becomes a tighter overapproximation to f1033
as k→∞. To show this, we use (3.3.9) to replace the final term in parentheses in1034
(6.4.4b) by f(z), then subtract f(z) from both sides of (6.4.4b) to obtain1035
(6.4.6) Φk+1(z) − f(z) 6 (1 − 1/√κ)(Φk(z) − f(z)).1036
In the remainder of the proof, we show that the following bound holds:1037
(6.4.7) f(xk) 6 minzΦk(z) = Φ
∗k, k = 0, 1, 2, . . . .1038
From the upper bound in Lemma 3.3.10, with x = x∗, we have that f(z) − f(x∗) 6(L/2)‖z− x∗‖2. By combining this bound with (6.4.6) and (6.4.7), we have
f(xk) − f(x∗) 6 Φ∗k − f(x∗)
6 Φk(x∗) − f(x∗)
6 (1 − 1/√κ)k(Φ0(x
∗) − f(x∗))
6 (1 − 1/√κ)k[(Φ0(x
∗) − f(x0)) + (f(x0) − f(x∗))]
6 (1 − 1/√κ)k
γ+ L
2‖x0 − x∗‖2.(6.4.8)
The proof is completed by establishing (6.4.7), by induction on k. Since x0 = y0,it holds by definition at k = 0. By using step formula (6.3.2a), the convexityproperty (3.3.8) (with x = yk), and the inductive hypothesis, we have
f(xk+1)
6 f(yk) −1
2L‖∇f(yk)‖2
= (1 − 1/√κ)f(xk) + (1 − 1/
√κ)(f(yk) − f(xk)) + f(yk)/
√κ−
12L‖∇f(yk)‖2
36 Optimization Algorithms for Data Analysis
6 (1 − 1/√κ)Φ∗k + (1 − 1/
√κ)∇f(yk)T (yk − xk) + f(yk)/
√κ−
12L‖∇f(yk)‖2.
(6.4.9)
Thus the claim is established (and the theorem is proved) if we can show that the1039
right-hand side in (6.4.9) is bounded above by Φ∗k+1.1040
Recalling the observation (6.4.5), we have by taking derivatives of both sides of1041
(6.4.4b) with respect to z that1042
(6.4.10) ∇Φk+1(z) = γ(1 − 1/√κ)(z− vk) +∇f(yk)/
√κ+ γ(z− yk)/
√κ.1043
Since vk+1 is the minimizer of Φk+1 we can set ∇Φk+1(vk+1) = 0 in (6.4.10) to1044
obtain1045
(6.4.11) vk+1 = (1 − 1/√κ)vk + yk/
√κ−∇f(yk)/(γ
√κ).1046
Subtracting yk from both sides of this expression, and taking ‖ · ‖2 of both sides,we obtain
‖vk+1 − yk‖2 = (1 − 1/√κ)2‖yk − vk‖2 + ‖∇f(yk)‖2/(γ2κ)
− 2(1 − 1/√κ)/(γ
√κ)∇f(yk)T (vk − yk).(6.4.12)
Meanwhile, by evaluating Φk+1 at z = yk, using both (6.4.5) and (6.4.4b), weobtain
Φ∗k+1 +γ
2‖yk − vk+1‖2 = (1 − 1/
√κ)Φk(y
k) + f(yk)/√κ
= (1 − 1/√κ)Φ∗k +
γ
2(1 − 1/
√κ)‖yk − vk‖2 + f(yk)/
√κ.(6.4.13)
By substituting (6.4.12) into (6.4.13), we obtain
Φ∗k+1 = (1 − 1/√κ)Φ∗k + f(y
k)/√κ+ γ(1 − 1/
√κ)/(2
√κ)‖yk − vk‖2
−1
2L‖∇f(yk)‖2 + (1 − 1/
√κ)∇f(yk)T (vk − yk)/
√κ
> (1 − 1/√κ)Φ∗k + f(y
k)/√κ
−1
2L‖∇f(yk)‖2 + (1 − 1/
√κ)∇f(yk)T (vk − yk)/
√κ,(6.4.14)
where we simply dropped a nonnegative term from the right-hand side to obtain1047
the inequality. The final step is to show that1048
(6.4.15) vk − yk =√κ(yk − xk),1049
which we do by induction. Note that v0 = x0 = y0, so the claim holds for k = 0.We have
vk+1 − yk+1 = (1 − 1/√κ)vk + yk/
√κ−∇f(yk)/(γ
√κ) − yk+1
=√κyk − (
√κ− 1)xk −
√κ∇f(yk)/L− yk+1
=√κxk+1 − (
√κ− 1)xk − yk+1
=√κ(yk+1 − xk+1),(6.4.16)
Stephen J. Wright 37
where the first equality is from (6.4.11), the second equality is from the inductive1050
hypothesis, the third equality is from the iteration formula (6.3.2a), and the final1051
equality is from the iteration formula (6.3.2b) with the definition of βk+1 from1052
(6.4.1). We have thus proved (6.4.15), and by substituting this equality into (6.4.14),1053
we obtain that Φ∗k+1 is an upper bound on the right-hand side of (6.4.9). This1054
establishes (6.4.7) and thus completes the proof of the theorem. 1055
6.5. Lower Bounds on Rates The term “optimal” in Nesterov’s optimal method1056
is used because the convergence rate achieved by the method is the best possible1057
(possibly up to a constant), among algorithms that make use of gradient informa-1058
tion at the iterates xk. This claim can be proved by means of a carefully designed1059
function, for which no method that makes use of all gradients observed up to and1060
including iteration k (namely, ∇f(xi), i = 0, 1, 2, . . . ,k) can produce a sequence1061
xk that achieves a rate better than (6.3.5). The function proposed in [30] is a1062
convex quadratic f(x) = (1/2)xTAx− eT1 x, where1063
A =
2 −1 0 0 . . . . . . 0
−1 2 −1 0 . . . . . . 0
0 −1 2 −1 0 . . . 0. . . . . . . . .
0 . . . 0 −1 2 −1
0 . . . 0 −1 2
, e1 =
1
0
0...
0
.1064
The solution x∗ satisfies Ax∗ = e1; its components are x∗i = 1 − i/(n + 1), for1065
i = 1, 2, . . . ,n. If we use x0 = 0 as the starting point, and construct the iterate1066
xk+1 as1067
xk+1 = xk +
k∑j=0
ξj∇f(xj),1068
for some coefficients ξj, j = 0, 1, . . . , k, an elementary inductive argument shows1069
that each iterate xk can have nonzero entries only in its first k components. It1070
follows that for any such algorithm, we have1071
(6.5.1) ‖xk − x∗‖2 >n∑
j=k+1
(x∗j )2 =
n∑j=k+1
(1 −
j
n+ 1
)2.1072
A little arithmetic shows that1073
(6.5.2) ‖xk − x∗‖2 >18‖x0 − x∗‖2, k = 1, 2, . . . ,
n
2− 1,1074
It can be shown further that1075
(6.5.3) f(xk) − f∗ >3L
32(k+ 1)2 ‖x0 − x∗‖2, k = 1, 2, . . . ,
n
2− 1,1076
where L = ‖A‖2. This lower bound on f(xk) − x∗ is within a constant factor of1077
the upper bound (6.3.5).1078
38 Optimization Algorithms for Data Analysis
The restriction k 6 n/2 in the argument above is not fully satisfying. A more1079
compelling example would show that the lower bound (6.5.3) holds for all k.1080
7. Newton Methods1081
So far, we have dealt with methods that use first-order (gradient or subgra-1082
dient) information about the objective function. We have shown that such algo-1083
rithms can yield sequences of iterates that converge at linear or sublinear rates.1084
We turn our attention in this chapter to methods that exploit second-derivative1085
(Hessian) information. The canonical method here is Newton’s method, named1086
after Isaac Newton, who proposed a version of the method for polynomial equa-1087
tions in around 1670.1088
For many functions, including many that arise in data analysis, second-order1089
information is not difficult to compute, in the sense that the functions that we1090
deal with are simple (usually compositions of elementary functions). In compar-1091
ing with first-order methods, there is a tradeoff. Second-order methods typically1092
have locally superlinear convergence rates: Once the iterates reach a neighbor-1093
hood of a solution at which second-order sufficient conditions are satisfied, con-1094
vergence is rapid. Moreover, their global convergence properties are attractive.1095
With appropriate enhancements, they can provably avoid convergence to saddle1096
points. But the costs of calculating and handling the second-order information1097
and of computing the step is higher. Whether this tradeoff makes them appealing1098
depends on the specifics of the application and on whether the second-derivative1099
computations are able to take advantage of structure in the objective function.1100
We start by sketching the basic Newton’s method for the unconstrained smooth1101
optimization problem min f(x), and prove local convergence to a minimizer x∗1102
that satisfies second-order sufficient conditions. Subsection 7.2 discusses perfor-1103
mance of Newton’s method on convex functions, where the use of Newton search1104
directions in the line search framework (4.0.1) can yield global convergence. Mod-1105
ifications of Newton’s method for nonconvex functions are discussed in Subsec-1106
tion 7.3. Subsection 7.4 discusses algorithms for smooth nonconvex functions1107
that use gradient and Hessian information but guarantee convergence to points1108
that approximately satisfy second-order necessary conditions. Some variants of1109
these methods are related closely to the trust-region methods discussed in Sub-1110
section 7.3, but the motivation and mechanics are somewhat different.1111
7.1. Basic Newton’s Method Consider the problem1112
(7.1.1) min f(x),1113
where f : Rn → R is a Lipschitz twice continuously differentiable function, where1114
the Hessian has Lipschitz constant M, that is,1115
(7.1.2) ‖∇2f(x ′) −∇2f(x ′′)‖ 6M‖x ′ − x ′′‖.1116
Newton’s method generates a sequence of iterates xkk=0,1,2,....1117
Stephen J. Wright 39
A second-order Taylor series approximation to f around the current iterate xk1118
is1119
(7.1.3) f(xk + p) ≈ f(xk) +∇f(xk)Tp+ 12pT∇2f(xk)p.1120
When ∇2f(xk) is positive definite, the minimizer pk of the right-hand side is1121
unique; it is1122
(7.1.4) pk = −∇2f(xk)−1∇f(xk).1123
This is the Newton step. In its most basic form, then, Newton’s method is defined1124
by the following iteration:1125
(7.1.5) xk+1 = xk −∇2f(xk)−1∇f(xk).1126
We have the following local convergence result in the neighborhood of a point x∗1127
satisfying second-order sufficient conditions.1128
Theorem 7.1.6. Consider the problem (7.1.1) with f twice Lipschitz continuously differ-1129
entiable with Lipschitz constant M defined in (7.1.2). Suppose that the second-order suf-1130
ficient conditions are satisfied for the problem (7.1.1) at the point x∗, that is, ∇f(x∗) = 01131
and ∇2f(x∗) γI for some γ > 0. Then if ‖x0 − x∗‖ 6 γ2M , the sequence defined by1132
(7.1.5) converges to x∗ at a quadratic rate, with1133
(7.1.7) ‖xk+1 − x∗‖ 6 Mγ‖xk − x∗‖2, k = 0, 1, 2, . . . .1134
Proof. From (7.1.4) and (7.1.5), and using ∇f(x∗) = 0, we have
xk+1 − x∗ = xk − x∗ −∇2f(xk)−1∇f(xk)
= ∇2f(xk)−1[∇2f(xk)(xk − x∗) − (∇f(xk) −∇f(x∗))].
so that1135
(7.1.8) ‖xk+1 − x∗‖ 6 ‖∇f(xk)−1‖‖∇2f(xk)(xk − x∗) − (∇f(xk) −∇f(x∗))‖.1136
By using Taylor’s theorem (see (3.3.4) with x = xk and p = x∗ − xk), we have1137
∇f(xk) −∇f(x∗) =∫1
0∇2f(xk + t(x∗ − xk))(xk − x∗)dt.1138
By using this result along with the Lipschitz condition (7.1.2), we have
‖∇2f(xk)(xk − x∗) − (∇f(xk) −∇f(x∗))‖
=
∥∥∥∥∥∫1
0[∇2f(xk) −∇2f(xk + t(x∗ − xk)](xk − x∗)dt
∥∥∥∥∥6∫1
0‖∇2f(xk) −∇2f(xk + t(x∗ − xk))‖‖xk − x∗‖dt
6
(∫1
0Mtdt
)‖xk − x∗‖2 = 1
2M‖xk − x∗‖2.(7.1.9)
From the Weilandt-Hoffman inequality[26] and (7.1.2), we have that1139
|λmin(∇2f(xk)) − λmin(∇2f(x∗))| 6 ‖∇2f(xk) −∇2f(x∗)‖ 6M‖xk − x∗‖,1140
40 Optimization Algorithms for Data Analysis
where λmin(·) denotes the smallest eigenvalue of a symmetric matrix. Thus for1141
(7.1.10) ‖xk − x∗‖ 6 γ
2M,1142
we have1143
λmin(∇2f(xk)) > λmin(∇2f(x∗)) −M‖xk − x∗‖ > γ−M γ
2M>γ
2,1144
so that ‖∇2f(xk)−1‖ 6 2/γ. By substituting this result together with (7.1.9) into1145
(7.1.8), we obtain1146
‖xk+1 − x∗‖ 6 2γ
M
2‖xk − x∗‖2 =
M
γ‖xk − x∗‖2,1147
verifying the locally quadratic convergence rate. By applying (7.1.10) again, we1148
have1149
‖xk+1 − x∗‖ 6(M
γ‖xk − x∗‖
)‖xk − x∗‖ 6 1
2‖xk − x∗‖,1150
so, by arguing inductively, we see that the sequence converges to x∗ provided1151
that x0 satisfies (7.1.10), as claimed. 1152
Of course, we do not need to explicitly identify a starting point x0 in the stated1153
region of convergence. Any sequence that approaches to x∗ will eventually enter1154
this region, and thereafter the quadratic convergence guarantees apply.1155
We have established that Newton’s method converges rapidly once the iterates1156
enter the neighborhood of a point x∗ satisfying second-order sufficient optimality1157
conditions. But what happens when we start far from such a point?1158
7.2. Newton’s Method for Convex Functions When f is strongly convex with1159
modulus γ and satisfies Lipschitz continuity of the gradient (3.3.6), the Hessian1160
∇2f(xk) is positive definite for all k, with all eigenvalues in the interval [γ,L].1161
Thus, the Newton direction (7.1.4) is well defined, and is a descent direction1162
satisfying the condition (4.5.1) with η = γ/L. To verify this claim, note first1163
‖pk‖ 6 ‖∇2f(xk)−1‖‖∇f(xk)‖ 6 1γ‖∇f(xk)‖.1164
Then
(pk)T∇f(xk) = −∇f(xk)T∇2f(xk)−1∇f(xk)
6 −1L‖∇f(xk)‖2
6 −γ
L‖∇f(xk)‖‖pk‖.
We can use the Newton direction in the line-search framework of Subsection 4.51165
to obtain a method for which xk → x∗, where x∗ is the (unique) global minimizer1166
of f. (This claim follows from the property (4.5.6) together with the fact that x∗ is1167
the only point for which ∇f(x∗) = 0.)1168
These global convergence properties are enhanced by the quadratic local con-1169
vergence property of Theorem 7.1.6 if we modify the line-search framework by1170
accepting the step length αk = 1 in (4.0.1) whenever it satisfies the conditions1171
(4.5.2). (It can be shown, by again using arguments based on Taylor’s theorem1172
Stephen J. Wright 41
(Theorem 3.3.1), that these conditions will be satisfied by αk = 1 for all xk suffi-1173
ciently close to the minimizer x∗.)1174
For f convex but not strongly convex, we are not guaranteed that the Hessian1175
∇2f(xk) is nonsingular, so the direction (7.1.4) may not be well defined. By adding1176
any positive number λk > 0 to the diagonal, however, we can ensure that the1177
modified Newton direction defined by1178
(7.2.1) pk = −[∇2f(xk) + λkI]−1∇f(xk),1179
is well defined and is a descent direction for f. For any η ∈ (0, 1) in (4.5.1),1180
we have by choosing λk large enough that λk/(L + λk) > η that the condition1181
(4.5.1) is satisfied too, so again we can use the resulting direction pk in the line-1182
search framework of Subsection 4.5. If the minimizer x∗ is unique and satisfies1183
a second-order sufficient condition, in particular ∇2f(x∗) positive definite, then1184
for k sufficiently large we have ∇2f(xk) is positive definite too, so for sufficiently1185
small η the unmodified Newton direction (with λk = 0 in (7.2.1)) will satisfy (4.5.1).1186
By using λk = 0 where possible, and accepting αk = 1 as the step length when-1187
ever it satisfies (4.5.2), we can obtain local quadratic convergence to x∗ under the1188
second-order sufficient condition.1189
7.3. Newton Methods for Nonconvex Functions For smooth nonconvex f,∇2f(xk)1190
may be indefinite. The Newton direction (7.1.4) may not exist (when ∇2f(xk) is1191
singular) or may not be a descent direction (when ∇2f(xk) has negative eigen-1192
values). However, we can still define a modified Newton direction as in (7.2.1),1193
which will be a descent direction for λk sufficiently large. For a given η in (4.5.1),1194
a sufficient condition for pk from (7.2.1) to satisfy (4.5.1) is that1195
λk + λmin(∇2f(xk))
λk + L> η,1196
where λmin(∇2f(xk)) is the minimum eigenvalue of the Hessian, which may be1197
negative. Once again, if the iterates xk enter the neighborhood of a local solution1198
x∗ for which ∇2f(x∗) positive definite, some enhancements of the strategy for1199
choosing λk and αk can recover the local quadratic convergence of Theorem 7.1.6.1200
Formula (7.2.1) is not the only way to modify the Newton direction to ensure1201
descent in a line-search framework. Other approaches are outlined in [34, Chap-1202
ter 3]. One such technique is to modify the Cholesky factorization of ∇2(fk)1203
(which is a prelude to solving the system (7.1.4) or (7.2.1) for pk) by adding posi-1204
tive elements to the diagonal only as needed to allow the factorization to proceed1205
(that is, to avoid taking the square root of a negative number). Another technique1206
is to compute an eigenvalue decomposition ∇2f(xk) = QkΛkQTk (where Qk is1207
orthogonal and Λk is diagonal), then define Λk to be a modified version of Λk in1208
which all the diagonals are positive. Then, following (7.1.4), pk can be defined as1209
pk := −QkΛ−1k Q
Tk∇f(x
k).1210
42 Optimization Algorithms for Data Analysis
A trust-region version of Newton’s method ensures convergence to a point sat-1211
isfying second-order necessary conditions. (This is in contrast to the line-search1212
framework discussed above, which can generally guaranteed only that accumu-1213
lation points are stationary, when applied to nonconvex f.) The trust-region ap-1214
proach was developed in the late 1970s and early 1980s, and has become pop-1215
ular again recently because of this appealing global convergence behavior. A1216
trust-region Newton method also recovers quadratic convergence to solutions x∗1217
satisfying second-order-sufficient conditions, without any special modifications.1218
We outline the trust-region approach below. Further details can be found in1219
[34, Chapter 4].1220
The subproblem to be solved at each iteration of a trust-region Newton algo-1221
rithm is1222
(7.3.1) mindf(xk) +∇f(xk)Td+ 1
2dT∇2f(xk)d subject to ‖d‖2 6 ∆k.1223
The objective is a second-order Taylor-series approximation while ∆k is the radius1224
of the trust region — the region within which we trust the second-order model1225
to capture the true behavior of f. Somewhat surprisingly, the problem (7.3.1) is1226
not too difficult to solve, even when the Hessian ∇2f(xk) is indefinite. In fact, the1227
solution dk of (7.3.1) satisfies the linear system1228
(7.3.2) [∇2f(xk) + λI]dk = −∇f(xk), for some λ > 0,1229
where λ is chosen such that ∇2f(xk) + λI is positive semidefinite and λ > 0 only1230
if ‖dk‖ = ∆k (see [29]). Thus the process of solving (7.3.1) reduces to a search for1231
the appropriate value of the scalar λk, for which specialized methods have been1232
devised.1233
For large-scale problems, it may be too expensive to solve (7.3.1) near-exactly,1234
since the process may require several factorizations of an n× n matrix (namely,1235
the coefficient matrix in (7.3.2), for different values of λ). A popular approach1236
for finding approximate solutions of (7.3.1), which can be used when ∇2f(xk)1237
is positive definite, is the dogleg method. In this method the curved path traced1238
out by solutions of (7.3.2) for values of λ in the interval [0,∞) is approximated1239
by simpler path consisting of two line segments. The first segment joins 0 to1240
the point dkC that minimizes the objective in (7.3.1) along the direction −∇f(xk),1241
while the second segment joins dkC to the pure Newton step defined in (7.1.4).1242
The approximate solution is taken to be the point at which this “dogleg” path1243
crosses the boundary of the trust region ‖d‖ 6 ∆k. If the dogleg path lies entirely1244
inside the trust region, we take dk to be the pure Newton step.1245
Having discussed the trust-region subproblem (7.3.1), let us outline how it can1246
be used as the basis for a complete algorithm. A crucial role is played by the ratio1247
between the amount of decrease in f predicted by the quadratic objective in (7.3.1) and1248
the actual decrease in f, namely, f(xk)− f(xk+dk). Ideally, this ratio would be close1249
to 1. If it is at least greater than a small tolerance (say, 10−4) we accept the step1250
and proceed to the next iteration. Otherwise, we conclude that the trust-region1251
Stephen J. Wright 43
radius ∆k is too large, so we do not take the step, shrink the trust region, and1252
re-solve (7.3.1) to obtain a new step. Additionally, when the actual-to-predicted1253
ratio is close to 1, we conclude that a larger trust region may hasten progress, so1254
we increase ∆ for the next iteration, provided that the bound ‖dk‖ 6 ∆k really is1255
active at the solution of (7.3.1).1256
Unlike a basic line-search method, the trust-region Newton method can “es-1257
cape” from a saddle point. Suppose we have ∇f(xk) = 0 and ∇2f(xk) indefinite1258
with some strictly negative eigenvalues. Then, the solution dk to (7.3.1) will be1259
nonzero, and the algorithm will step away from the saddle point, in the direction1260
of most negative curvature for ∇2f(xk).1261
Another appealing feature of the trust-region Newton approach is that when1262
the sequence it generates approaches a point x∗ satisfying second-order sufficient1263
conditions, the trust region bound becomes inactive, and the method takes pure1264
Newton steps (7.1.4) for all sufficiently large k. Thus, the local quadratic conver-1265
gence that characterizes Newton’s method is observed.1266
The basic difference between line-search and trust-region methods can be sum-1267
marized as follows. Line-search methods first choose a direction pk, then decide1268
how far to move along that direction. Trust-region methods do the opposite: They1269
choose the distance ∆k first, then find the direction that makes the best progress1270
for this step length.1271
7.4. Convergence to Second-Order Necessary Points Trust-region Newton meth-1272
ods have the significant advantage of guaranteeing that any accumulation points1273
will satisfy second-order necessary conditions. A related approach based on cubic1274
regularization has similar properties, and comes with some additional complexity1275
guarantees. Cubic regularization requires the Hessian to be Lipschitz continu-1276
ous, as in (7.1.2). It follows that the following cubic function yields a global upper1277
bound for f:1278
(7.4.1) TM(z; x) := f(x) +∇f(x)T (z− x) + 12(z− x)T∇2f(x)(z− x) +
M
6‖z− x‖3.1279
Specifically, we have for any x that1280
f(z) 6 TM(z; x), for all z.1281
The basic cubic regularization algorithm proceeds as follows, from a starting1282
point x0:1283
(7.4.2) xk+1 = arg minzTM(z; xk), k = 0, 1, 2, . . . .1284
The complexity properties of this approach were analyzed in [33]. Variants on the1285
scheme were studied in [24] and [11,12]. Implementation of this scheme involves1286
solution of the subproblem (7.4.2), which is not straightforward.1287
Rather than present the theory for (7.4.2), we describe an elementary algorithm1288
that makes use of the expansion (7.4.1) as well as the steepest-descent theory of1289
Subsection 4.1. Our algorithm aims to identify a point that approximately satisfies1290
44 Optimization Algorithms for Data Analysis
second-order necessary conditions, that is,1291
(7.4.3) ‖∇f(x)‖ 6 εg, λmin(∇2f(x)) > −εH,1292
where εg and εH are two small constants. In addition to Lipschitz continuity of1293
the Hessian (7.1.2), we assume Lipschitz continuity of the gradient with constant1294
L (see (3.3.6)), and also that the objective f is lower-bounded by some number f.1295
Our algorithm takes steps of two types — a steepest-descent step, as in Subsec-1296
tion 4.1, or a step in a negative curvature direction for ∇2f. Iteration l proceeds1297
as follows:1298
(i) If ‖∇f(xk)‖ > εg, take the steepest descent step (4.1.1).1299
(ii) Otherwise, if λmin(∇2f(xk)) < −εH, choose pk to be the eigenvector cor-1300
responding to the most negative eigenvalue of ∇2f(xk). Choose the size1301
and sign of pk such that ‖pk‖ = 1 and (pk)T∇f(xk) 6 0, and set1302
(7.4.4) xk+1 = xk +αkpk, where αk =
2εHM
.1303
• If neither of these conditions hold, then xk satisfies the necessary condi-1304
tions (7.4.3), so is an approximate second-order-necessary point.1305
For the steepest-descent step (i), we have from (4.1.3) that1306
(7.4.5) f(xk+1) 6 f(xk) −1
2L‖∇f(xk)‖2 6 f(xk) −
ε2g
2L.1307
For a step of type (ii), we have from (7.4.1) that
f(xk+1) 6 f(xk) +αk∇f(xk)Tpk +12α2k(p
k)T∇2f(xk)pk +16Mα3
k‖pk‖3
6 f(xk) −12
(2εHM
)2εH +
16M
(2εHM
)3
= f(xk) −23ε3H
M2 .(7.4.6)
By aggregating (7.4.5) and (7.4.6), we have that at each xk for which the condition1308
(7.4.3) does not hold, we attain a decrease in the objective of at least1309
min
(ε2g
2L,
23ε3H
M2
).1310
Using the lower bound f on the objective f, we see that the number of iterations1311
K required must satisfy the condition1312
Kmin
(ε2g
2L,
23ε3H
M2
)6 f(x0) − f,1313
from which we conclude that1314
K 6 max(
2Lε−2g ,
32M2ε−3
H
)(f(x0) − f
).1315
Note that the maximum number of iterates required to identify a point for which1316
just the approximate stationarity condition ‖∇f(xk)‖ 6 εg holds is at most 2Lε−2g (f(x0)−1317
f). (We can just omit the second-order part of the algorithm.) Note too that it is1318
References 45
easy to devise approximate versions of this algorithm with similar complexity. For1319
example, the negative curvature direction pk in step (ii) above can be replaced1320
by an approximation to the direction of most negative curvature, obtained by the1321
Lanczos iteration with random initialization.1322
In algorithms that make more complete use of the cubic model (7.4.1), the term1323
ε−2g in the complexity expression becomes ε−3/2
g , and the constants are different.1324
The subproblems in (7.4.1) are more complicated to solve than those in the simple1325
scheme above, however.1326
8. Conclusions1327
We have outlined various algorithmic tools from optimization that are useful1328
for solving problems in data analysis and machine learning, and presented their1329
basic theoretical properties. The intersection of optimization and machine learn-1330
ing is a fruitful and very popular area of current research. All the major machine1331
learning conferences have a large contingent of optimization papers, and there is1332
a great deal of interest in developing algorithmic tools to meet new challenges1333
and in understanding their properties. The edited volume [39] contains a snap-1334
shot of the state of the art circa 2010, but this is a fast-moving field and there have1335
been many developments since then.1336
Acknowledgments1337
I thank Ching-pei Lee for a close reading and many helpful suggestions on the1338
manuscript.1339
References1340
[1] L. Balzano, R. Nowak, and B. Recht, Online identification and tracking of subspaces from highly in-1341
complete information, 48th annual allerton conference on communication, control, and computing,1342
2010, pp. 704–711. http://arxiv.org/abs/1006.4046.←81343
[2] L. Balzano and S. J. Wright, Local convergence of an algorithm for subspace identification from partial1344
data, Foundations of Computational Mathematics 14 (2014), 1–36. DOI: 10.1007/s10208-014-9227-1345
7.←81346
[3] A. Beck and M. Teboulle, A fast iterative shrinkage-threshold algorithm for linear inverse problems,1347
SIAM Journal on Imaging Sciences 2 (2009), no. 1, 183–202.←31, 331348
[4] B. E. Boser, I. M. Guyon, and V. N. Vapnik, A training algorithm for optimal margin classifiers,1349
Proceedings of the fifth annual workshop on computational learning theory, 1992, pp. 144–152.1350
←101351
[5] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein, Distributed optimization and statistical learn-1352
ing via the alternating direction methods of multipliers, Foundations and Trends in Machine Learning1353
3 (2011), no. 1, 1–122.←21354
[6] S. Bubeck, Convex optimization: Algorithms and complexity, Foundations and Trends in Machine1355
Learning 8 (2015), no. 3–4, 231–357.←33, 341356
[7] S. Bubeck, Y. T. Lee, and M. Singh, A geometric alternative to Nesterov’s accelerated gradient descent,1357
Technical Report arXiv:1506.08187, Microsoft Research, 2015.←331358
[8] S. Burer and R. D. C. Monteiro, A nonlinear programming algorithm for solving semidefinite programs1359
via low-rank factorizations, Mathematical Programming, Series B 95 (2003), 329–257.←61360
46 References
[9] E. Candès and B. Recht, Exact matrix completion via convex optimization, Foundations of Computa-1361
tional Mathematics 9 (2009), 717–772.←61362
[10] E. J. Candès, X. Li, Y. Ma, and J. Wright, Robust principal component analysis?, Journal of the ACM1363
58.3 (2011), 11.←81364
[11] C. Cartis, N. I. M. Gould, and Ph. L. Toint, Adaptive cubic regularisation methods for unconstrained op-1365
timization. Part I: motivation, convergence and numerical results, Mathematical Programming, Series1366
A 127 (2011), 245–295.←431367
[12] C. Cartis, N. I. M. Gould, and Ph. L. Toint, Adaptive cubic regularisation methods for unconstrained1368
optimization. part ii: worst-case function-and derivative-evaluation complexity, Mathematical Program-1369
ming, Series A 130 (2011), no. 2, 295–319.←431370
[13] V. Chandrasekaran, S. Sanghavi, P. A. Parrilo, and A. S. Willsky, Rank-sparsity incoherence for1371
matrix decomposition, SIAM Journal on Optimization 21 (2011), no. 2, 572–596.←81372
[14] Y. Chen and M. J. Wainwright, Fast low-rank estimation by projected gradent descent: General statistical1373
and algorithmic guarantees, Technical Report arXiv:1509.03025, University of California-Berkeley,1374
2015.←81375
[15] C. Cortes and V. N. Vapnik, Support-vector networks, Machine Learning 20 (1995), 273–297.←101376
[16] A. d’Aspremont, O. Banerjee, and L. El Ghaoui, First-order methods for sparse covariance selection,1377
SIAM Journal on Matrix Analysis and Applications 30 (2008), 56–66.←71378
[17] A. d’Aspremont, L. El Ghaoui, M. I. Jordan, and G. Lanckriet, A direct formulation for sparse PCA1379
using semidefinte programming, SIAM Review 49 (2007), no. 3, 434–448.←71380
[18] T. Dasu and T. Johnson, Exploratory data mining and data cleaning, John Wiley & Sons, 2003.←31381
[19] D. Drusvyatskiy, M. Fazel, and S. Roy, An optimal first-order method based on optimal quadratic1382
averaging, Technical Report arXiv:1604.06543, University of Washington, 2016.←331383
[20] J. Duchi, Introductory lectures on stochastic optimization, Stanford University, 2016. To appear in1384
IAS/Park City Mathematics Series volume on "Mathematics of Data".←2, 15, 221385
[21] J. Eckstein and D. P. Bertsekas, On the Douglas-Rachford splitting method and the proximal point1386
algorithm for maximal monotone operators, Mathematical Programming 55 (1992), 293–318.←21387
[22] M. Frank and P. Wolfe, An algorithm for quadratic programming, Naval Research Logistics Quarterly1388
3 (1956), 95–110.←261389
[23] J. Friedman, T. Hastie, and R. Tibshirani, Sparse inverse covariance estimation with the graphical lasso,1390
Biostatistics 9 (2008), no. 3, 432–441.←71391
[24] A. Griewank, The modification of Newton’s method for unconstrained optimization by bounding cubic1392
terms, Technical Report NA/12, DAMTP, Cambridge University, 1981.←431393
[25] M. R. Hestenes and E. Stiefel, Methods of conjugate gradients for solving linear systems, Journal of1394
Research of the National Bureau of Standards 49 (1952), 409–436.←311395
[26] A. J. Hoffman and H. Weilandt, The variation of the spectrum of a normal matrix, Duke Mathematical1396
Journal 20 (1953), no. 1, 37–39.←391397
[27] J. D. Lee, M. Simchowitz, M. I. Jordan, and B. Recht, Gradient descent converges to minimizers,1398
Technical Report arXiv:1602.04915, University of California-Berkeley, 2016.←251399
[28] D. C. Liu and J. Nocedal, On the limited-memory BFGS method for large scale optimization, Mathe-1400
matical Programming 45 (1989), 503–528.←2, 321401
[29] J. J. Moré and D. C. Sorensen, Computing a trust region step, SIAM Journal on Scientific and1402
Statistical Computing 4 (1983), 553–572.←421403
[30] A. S. Nemirovski and D. B. Yudin, Problem complexity and method efficiency in optimization, John1404
Wiley, 1983.←371405
[31] Y. Nesterov, A method for unconstrained convex problem with the rate of convergence O(1/k2), Dok-1406
lady AN SSSR 269 (1983), 543–547.←311407
[32] Y. Nesterov, Introductory lectures on convex optimization: A basic course, Springer Science and Busi-1408
ness Media, New York, 2004.←31, 341409
[33] Y. Nesterov and B. T. Polyak, Cubic regularization of Newton method and its global performance, Math-1410
ematical Programming, Series A 108 (2006), 177–205.←431411
[34] J. Nocedal and S. J. Wright, Numerical Optimization, Second, Springer, New York, 2006. ←2, 25,1412
32, 41, 421413
[35] B. T. Polyak, Some methods of speeding up the convergence of iteration methods, USSR Computational1414
Mathematics and Mathematical Physics 4.5 (1964), 1–17.←311415
[36] B. T. Polyak, Introduction to optimization, Optimization Software, 1987.←311416
References 47
[37] B. Recht, M. Fazel, and P. Parrilo, Guaranteed minimum-rank solutions to linear matrix equations via1417
nuclear norm minimization, SIAM Review 52 (2010), no. 3, 471–501.←61418
[38] R. T. Rockafellar, Convex analysis, Princeton University Press, Princeton, N.J., 1970.←161419
[39] S. Sra, S. Nowozin, and S. J. Wright (eds.), Optimization for machine learning, NIPS Workshop1420
Series, MIT Press, 2011.←451421
[40] R. Tibshirani, Regression shrinkage and selection via the LASSO, Journal of the Royal Statistical1422
Society B 58 (1996), 267–288.←51423
[41] B. Turlach, W. N. Venables, and S. J. Wright, Simultaneous variable selection, Technometrics 471424
(2005), no. 3, 349–363.←81425
[42] S. J. Wright, Primal-dual interior-point methods, SIAM, Philadelphia, PA, 1997.←21426
[43] S. J. Wright, Coordinate descent algorithms, Mathematical Programming, Series B 151 (2015Decem-1427
ber), 3–34.←21428
[44] X. Yi, D. Park, Y. Chen, and C. Caramanis, Fast algorithms for robust pca via gradient descent, Tech-1429
nical Report arXiv:1605.07784, University of Texas-Austin, 2016.←81430
Computer Sciences Department, University of Wisconsin-Madison, Madison, WI 537061431
E-mail address: [email protected]