math 301: advanced topics in convex optimizationcandes/teaching/math...accelerated rst-order methods...
TRANSCRIPT
Agenda
Fast proximal gradient methods
1 Accelerated first-order methods
2 Auxiliary sequences
3 Convergence analysis
4 Numerical examples
5 Optimality of Nesterov’s scheme
Last time
Proximal gradient method → convergence rate 1k
Subgradient methods → convergence rate 1√k
Can we do better for non-smooth problems
min f(x) = g(x) + h(x)
with the same computational effort as proximal gradient method but with fasterconvergence?
Answer: Yes we can - with equally simple scheme
xk+1 = arg min Q1/t(x, yk)
Note that we use yk instead of xk where new point is cleverly chosen
Original idea: Nesterov 1983 for minimization of smooth objective
Here: nonsmooth problem
Accelerated first-order methods
Choose x0 and set y0 = x0. Repeat for k = 1, 2, . . .{xk = proxtkh(yk−1 − tk∇g(yk−1))yk = xk +
k−1k+2 (xk − xk−1)
same computational complexity as proximal gradient
with h = 0, this is the accelerated gradient descent of Nesterov (’83)
can be used with various stepsize rules
fixedBLS. . .
interpretation
xk + k−1k+2 (xk − xk−1)↓
momentum term/prevents zigzagging
Other formulations: Beck and Teboulle 2009
Fix step size t = 1L(g)
Choose x0 , set y0 = x0, θ0 = 1
Loop: for k = 1, 2, . . .
(a) xk = proxtkh(yk−1 − tk∇g(yk−1))
(b) 1θk
=1+√
1+4/θ2k−1
2
(c) yk = xk + θk[1
θk−1− 1](xk − xk−1)
With BLS (knowledge of Lipschitz constant not necessary)
Choose x0 , set y0 = x0, θ0 = 1
Loop: for k = 1, 2, . . ., backtrack until (this gives tk)
f(yk−1 − tkGtk(yk−1)) ≤ Q1/tk((yk−1 − tkGtk(yk−1), yk−1)
proxtkh(yk−1 − tk∇g(yk−1)) , yk−1 − tkGtk(yk−1)Then
(a) —–(b) —–(c) —–
Convergence analysis
Theorem
f(xk)− f? ≤2‖x0 − x?‖2(k + 1)2t
t = 1/L for fixed step size
t = β/L for BLS
Other 1/k2 first-order methods
Nesterov 2007
Two auxiliary sequences {yk}, {zk}Two prox operations at each iteration6= convergence analysis
Lu, Lan and Monteiro
Tseng
Auslander and Teboulle
Unified analysis framework: Tseng (2008)
Proof (Beck and Teboulle’s version)
{vk , 1
θk−1xk − [ 1
θk−1− 1]xk−1
yk = θkvk + (1− θk)xkand
(i) vk+1 = vk +1θk[xk+1 − yk]
(ii) 1−θkθ2k
= 1θ2k−1
Proof of (ii) (u = 4/θ2k−1 + 1)
1− θkθ2k
=[1 +
√u]2
4
√u− 1√u+ 1
=u− 1
4=
4/θk−12 + 1− 1
4=
1
θ2k−1
Increment in one iteration: Beck and Teboulle, Vandenberghe
x = xi−1 x+ = xi y = yi−1 v = vi−1 v+ = vi θ = θi−1
Pillars of analysis:
(1) f(x+) ≤ f(x) +Gt(y)T (y − x)− t
2‖Gt(y)‖2(2) f(x+) ≤ f? +Gt(y)
T (y − x?)− t2‖Gt(y)‖2
Take cvx combination
f(x+) ≤ (1− θ)f(x) + θf? + 〈Gt(y), y − (1− θ)x− θx?〉 − t
2‖Gt(y)‖2
= (1− θ)f(x) + θf? + θ〈Gt(y), v − x?〉 −t
2‖Gt(y)‖2
Because y = θv + (1− θ)x
f(x+)− f? ≤ (1− θ) [f(x)− f?] + θ2
2t
[‖v − x?‖2 − ‖v − x? − t
θGt(y)‖2
]
v − t
θGt(y) = v +
1
θ[y −Gt(y)− y] = v+
Therefore
f(x+)− f? ≤ (1− θ)[f(x)− f?] + θ2
2t
[‖v − x?‖2 − ‖v+ − x?‖2
]Conclusion
1
θ2i−1[f(xi)− f?] +
1
2t‖vi − x?‖2 ≤
1− θi−1θ2i−1
[f(xi−1)− f?] +1
2t‖vi−1 − x?‖2
We have 1−θi−1
θ2i−1= 1
θ2i−1and
1
θ2k−1[f(xk)− f?] +
1
2t‖vk − x?‖2 ≤
1− θ0θ20
[f(x0)− f?] +1
2t‖v0 − x?‖2
Since θ0 = 1 and v0 = x0
1
θ2k−1(f(xk)− f?) ≤
1
2t‖x0 − x?‖2
Since 1θ2k−1
≥ 14 (k + 1)2,
f(xk)− f? ≤2
(k + 1)2t‖x0 − x?‖2
Similar with BLS, see Beck and Teboulle (2009)
Case study: LASSO
min f(x) =1
2‖Ax− b‖22 + λ‖x‖1
Chose x0, set y0 = x0 and θ0 = 1 and repeat
xk = Stkλ(yk−1 − tkA∗(Ayk−1 − b))
θk = 2[1 +
√1 + 4/θ2k−1
]−1yk = xk + θk(θ
−1k−1 − 1)(xk − xk−1)
until convergence (St is soft-thresholding at level t)
Dominant computational cost per iteration
one application of A
one application of A∗
Example from Beck and Teboulle (FISTA)
A FAST ITERATIVE SHRINKAGE-THRESHOLDING ALGORITHM 201
0 2000 4000 6000 8000 1000010
!8
10!6
10!4
10!2
100
102
ISTAMTWISTFISTA
Figure 5. Comparison of function value errors F (xk) ! F (x!) of ISTA, MTWIST, and FISTA.
REFERENCES
[1] A. Ben-Tal and A. Nemirovski, Lectures on Modern Convex Optimization: Analysis, Algorithms, andEngineering Applications, MPS/SIAM Ser. Optim., SIAM, Philadelphia, 2001.
[2] D. P. Bertsekas, Nonlinear Programming, 2nd ed., Athena Scientific, Belmont, MA, 1999.[3] J. Bioucas-Dias and M. Figueiredo, A new TwIST: Two-step iterative shrinkage/thresholding algo-
rithms for image restoration, IEEE Trans. Image Process., 16 (2007), pp. 2992–3004.[4] A. Bjorck, Numerical Methods for Least Squares Problems, SIAM, Philadelphia, 1996.[5] K. Bredies and D. Lorenz, Iterative Soft-Thresholding Converges Linearly, Technical report, 2008;
available online at http://arxiv.org/abs/0709.1598v3.[6] R. J. Bruck, On the weak convergence of an ergodic iteration for the solution of variational inequalities
for monotone operators in Hilbert space, J. Math. Anal. Appl., 61 (1977), pp. 159–164.[7] A. Chambolle, R. A. DeVore, N. Y. Lee, and B. J. Lucier, Nonlinear wavelet image processing:
Variational problems, compression, and noise removal through wavelet shrinkage, IEEE Trans. ImageProcess., 7 (1998), pp. 319–335.
[8] S. S. Chen, D. L. Donoho, and M. A. Saunders, Atomic decomposition by basis pursuit, SIAM J.Sci. Comput., 20 (1998), pp. 33–61.
[9] P. L. Combettes and V. R. Wajs, Signal recovery by proximal forward-backward splitting, MultiscaleModel. Simul., 4 (2005), pp. 1168–1200.
[10] I. Daubechies, M. Defrise, and C. D. Mol, An iterative thresholding algorithm for linear inverseproblems with a sparsity constraint, Comm. Pure Appl. Math., 57 (2004), pp. 1413–1457.
[11] D. L. Donoho and I. M. Johnstone, Adapting to unknown smoothness via wavelet shrinkage, J. Amer.Statist. Assoc., 90 (1995), pp. 1200–1224.
[12] M. Elad, B. Matalon, and M. Zibulevsky, Subspace optimization methods for linear least squareswith non-quadratic regularization, Appl. Comput. Harmon. Anal., 23 (2007), pp. 346–367.
[13] H. W. Engl, M. Hanke, and A. Neubauer, Regularization of Inverse Problems, Math. Appl. 375,Kluwer Academic Publishers Group, Dordrecht, The Netherlands, 1996.
Example from Vandenberghe (EE 236C, UCLA)
Convergence analysis
almost identical to the proof in lecture 1
• on page 1-15 we replace !f(y) with Gt(y)
• the 2nd inequality on p. 1-15 follows from p. 4–13 with 0 < t " 1/L:
f(x+) " f(x) + Gt(y)T (y # x) # t
2$Gt(y)$2
2
and
f(x+) " f! + Gt(y)T (y # x!) # t
2$Gt(y)$2
2
making a convex combination gives
f(x+) " (1 # !)f(x) + !f! + Gt(y)T (y # (1 # !)x # !x!) # t
2$Gt(y)$2
2
conclusion: #iterations to reach f(x(k)) # f! " " is O(1/%
")
Gradient methods for nonsmooth problems 4–17
1-norm regularized least-squares
minimize1
2$Ax # b$2
2 + $x$1
k
(f(x
(k) )
!f
!)/
f!
randomly generated A & R2000"1000; step tk = 1/L with L = #max(ATA)
Gradient methods for nonsmooth problems 4–18
Nuclear norm regularization
min g(X) + λ‖X‖∗General gradient update
X = arg min{ 1
2t‖X − (X0 − t∇g(X0)‖2F + λ‖X‖∗
}= Stλ(X0 − t∇g(X0))
Sλ is the singular value soft-thresholding operator
X =
r∑j=1
σj ujv∗j ⇒ St(X) :=
r∑j=1
max(σj − t, 0)ujv∗j
Example
min1
2‖A(X)− b‖2 + λ‖X‖∗
Choose X0, set Y0 = X0, θ0 = 1 and repeat
Xk = Stkλ[Yk−1 − tkA∗(A(Yk−1)− b)]θk = . . .
Yk = . . .
Important remark: only need to compute the (top) part of the SVD of
Yk−1 − tkA∗(A(Yk−1)− b)
with singular values exceeding tkλ
Example from Vandenberghe (EE 236C, UCLA)
minimize∑
(i,j) obs.
(Xij −Mij)2 + λ‖X‖∗
X is 500× 5005,000 observed entriesFix step size t = 1/L
convergence (fixed step size t = 1/L)
k
(f(x
(k) )
!f
!)/
f!
Gradient methods for nonsmooth problems 4–23
result
index
nor
mal
ized
singu
lar
valu
e
optimal X has rank 38; relative error in specified entries is 9%
Gradient methods for nonsmooth problems 4–24
Optimality of Nesterov’s Method
min f(x)
f convex
∇f Lipschitz
No method which updates xk in span {x0,∇f(x0), . . . ,∇f(xk−1)} canconverge faster than 1/k2
1/k2 is the optimal rate of first-order method
Why?
f(x) =1
2x∗Ax− e∗1x, A =
2 −1 0 . . . . . . 0−1 2 −1 0 . . . 0. . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . .0 . . . 0 −1 2 −10 . . . 0 0 −1 −2
, e1 =
10. . .. . .. . .0
A � 0, ‖A‖ ≤ 4 and solution obeys Ax = e1
x∗ = 1− i
n+ 1f∗ = − n
2(n+ 1)
Note that
‖x∗‖2 =1
(n+ 1)2(n2 + . . .+ 1) ≤ n+ 1
3
since ∑0≤k≤n
(k + 1)3 − k3 ≥∑
0≤k≤n3k2 ⇐⇒
n∑k=1
k2 ≤ n+ 1
3
Start first-order algorithm at x0 = 0
span(∇f(x0)) = e1 =⇒ x1 ∈ span(e1)
span (∇f(x1)) ⊂ span(e1, e2)span (∇f(x2)) ⊂ span(e1, e2, e3). . .
f(xk)− f? ≥ infxk+1=...=xn=0
f(x) = − k
2(k + 1)
For k ≈ n/2 or n = 2k + 1
f(xk)− f? ≥n
2(n+ 1)− k
2(k + 1)=
1
4(k + 1)
So
f(xk)− f? ≥1
4(k + 1)
‖x?‖2‖x?‖2 ≥
3‖x?‖24(k + 1)(n+ 1)
≥ 3‖x?‖28(k + 1)2
References
1 Y. Nesterov. Gradient methods for minimizing composite objective functionTechnical Report – CORE – Universite Catholique de Louvain, (2007)
2 A. Beck and M. Teboulle. Fast iterative shrinkage-thresholding algorithm forlinear inverse problems, SIAM J. Imaging Sciences, (2008)
3 M. Teboulle, First Order Algorithms for Convex Minimization, OptimizationTutorials (2010), IPAM, UCLA
4 L. Vandenberghe, EE236C (Spring 2011), UCLA