![Page 1: Math 301: Advanced topics in convex optimizationcandes/teaching/math...Accelerated rst-order methods Choose x0 and set y0 = x0.Repeat for k= 1;2;::: ˆ xk= proxt kh(yk 1 tkrg(yk 1))](https://reader036.vdocument.in/reader036/viewer/2022071001/5fbd6f5634289f31065c6ec0/html5/thumbnails/1.jpg)
Agenda
Fast proximal gradient methods
1 Accelerated first-order methods
2 Auxiliary sequences
3 Convergence analysis
4 Numerical examples
5 Optimality of Nesterov’s scheme
![Page 2: Math 301: Advanced topics in convex optimizationcandes/teaching/math...Accelerated rst-order methods Choose x0 and set y0 = x0.Repeat for k= 1;2;::: ˆ xk= proxt kh(yk 1 tkrg(yk 1))](https://reader036.vdocument.in/reader036/viewer/2022071001/5fbd6f5634289f31065c6ec0/html5/thumbnails/2.jpg)
Last time
Proximal gradient method → convergence rate 1k
Subgradient methods → convergence rate 1√k
Can we do better for non-smooth problems
min f(x) = g(x) + h(x)
with the same computational effort as proximal gradient method but with fasterconvergence?
Answer: Yes we can - with equally simple scheme
xk+1 = arg min Q1/t(x, yk)
Note that we use yk instead of xk where new point is cleverly chosen
Original idea: Nesterov 1983 for minimization of smooth objective
Here: nonsmooth problem
![Page 3: Math 301: Advanced topics in convex optimizationcandes/teaching/math...Accelerated rst-order methods Choose x0 and set y0 = x0.Repeat for k= 1;2;::: ˆ xk= proxt kh(yk 1 tkrg(yk 1))](https://reader036.vdocument.in/reader036/viewer/2022071001/5fbd6f5634289f31065c6ec0/html5/thumbnails/3.jpg)
Accelerated first-order methods
Choose x0 and set y0 = x0. Repeat for k = 1, 2, . . .{xk = proxtkh(yk−1 − tk∇g(yk−1))yk = xk +
k−1k+2 (xk − xk−1)
same computational complexity as proximal gradient
with h = 0, this is the accelerated gradient descent of Nesterov (’83)
can be used with various stepsize rules
fixedBLS. . .
interpretation
xk + k−1k+2 (xk − xk−1)↓
momentum term/prevents zigzagging
![Page 4: Math 301: Advanced topics in convex optimizationcandes/teaching/math...Accelerated rst-order methods Choose x0 and set y0 = x0.Repeat for k= 1;2;::: ˆ xk= proxt kh(yk 1 tkrg(yk 1))](https://reader036.vdocument.in/reader036/viewer/2022071001/5fbd6f5634289f31065c6ec0/html5/thumbnails/4.jpg)
Other formulations: Beck and Teboulle 2009
Fix step size t = 1L(g)
Choose x0 , set y0 = x0, θ0 = 1
Loop: for k = 1, 2, . . .
(a) xk = proxtkh(yk−1 − tk∇g(yk−1))
(b) 1θk
=1+√
1+4/θ2k−1
2
(c) yk = xk + θk[1
θk−1− 1](xk − xk−1)
![Page 5: Math 301: Advanced topics in convex optimizationcandes/teaching/math...Accelerated rst-order methods Choose x0 and set y0 = x0.Repeat for k= 1;2;::: ˆ xk= proxt kh(yk 1 tkrg(yk 1))](https://reader036.vdocument.in/reader036/viewer/2022071001/5fbd6f5634289f31065c6ec0/html5/thumbnails/5.jpg)
With BLS (knowledge of Lipschitz constant not necessary)
Choose x0 , set y0 = x0, θ0 = 1
Loop: for k = 1, 2, . . ., backtrack until (this gives tk)
f(yk−1 − tkGtk(yk−1)) ≤ Q1/tk((yk−1 − tkGtk(yk−1), yk−1)
proxtkh(yk−1 − tk∇g(yk−1)) , yk−1 − tkGtk(yk−1)Then
(a) —–(b) —–(c) —–
![Page 6: Math 301: Advanced topics in convex optimizationcandes/teaching/math...Accelerated rst-order methods Choose x0 and set y0 = x0.Repeat for k= 1;2;::: ˆ xk= proxt kh(yk 1 tkrg(yk 1))](https://reader036.vdocument.in/reader036/viewer/2022071001/5fbd6f5634289f31065c6ec0/html5/thumbnails/6.jpg)
Convergence analysis
Theorem
f(xk)− f? ≤2‖x0 − x?‖2(k + 1)2t
t = 1/L for fixed step size
t = β/L for BLS
Other 1/k2 first-order methods
Nesterov 2007
Two auxiliary sequences {yk}, {zk}Two prox operations at each iteration6= convergence analysis
Lu, Lan and Monteiro
Tseng
Auslander and Teboulle
Unified analysis framework: Tseng (2008)
![Page 7: Math 301: Advanced topics in convex optimizationcandes/teaching/math...Accelerated rst-order methods Choose x0 and set y0 = x0.Repeat for k= 1;2;::: ˆ xk= proxt kh(yk 1 tkrg(yk 1))](https://reader036.vdocument.in/reader036/viewer/2022071001/5fbd6f5634289f31065c6ec0/html5/thumbnails/7.jpg)
Proof (Beck and Teboulle’s version)
{vk , 1
θk−1xk − [ 1
θk−1− 1]xk−1
yk = θkvk + (1− θk)xkand
(i) vk+1 = vk +1θk[xk+1 − yk]
(ii) 1−θkθ2k
= 1θ2k−1
Proof of (ii) (u = 4/θ2k−1 + 1)
1− θkθ2k
=[1 +
√u]2
4
√u− 1√u+ 1
=u− 1
4=
4/θk−12 + 1− 1
4=
1
θ2k−1
![Page 8: Math 301: Advanced topics in convex optimizationcandes/teaching/math...Accelerated rst-order methods Choose x0 and set y0 = x0.Repeat for k= 1;2;::: ˆ xk= proxt kh(yk 1 tkrg(yk 1))](https://reader036.vdocument.in/reader036/viewer/2022071001/5fbd6f5634289f31065c6ec0/html5/thumbnails/8.jpg)
Increment in one iteration: Beck and Teboulle, Vandenberghe
x = xi−1 x+ = xi y = yi−1 v = vi−1 v+ = vi θ = θi−1
Pillars of analysis:
(1) f(x+) ≤ f(x) +Gt(y)T (y − x)− t
2‖Gt(y)‖2(2) f(x+) ≤ f? +Gt(y)
T (y − x?)− t2‖Gt(y)‖2
Take cvx combination
f(x+) ≤ (1− θ)f(x) + θf? + 〈Gt(y), y − (1− θ)x− θx?〉 − t
2‖Gt(y)‖2
= (1− θ)f(x) + θf? + θ〈Gt(y), v − x?〉 −t
2‖Gt(y)‖2
![Page 9: Math 301: Advanced topics in convex optimizationcandes/teaching/math...Accelerated rst-order methods Choose x0 and set y0 = x0.Repeat for k= 1;2;::: ˆ xk= proxt kh(yk 1 tkrg(yk 1))](https://reader036.vdocument.in/reader036/viewer/2022071001/5fbd6f5634289f31065c6ec0/html5/thumbnails/9.jpg)
Because y = θv + (1− θ)x
f(x+)− f? ≤ (1− θ) [f(x)− f?] + θ2
2t
[‖v − x?‖2 − ‖v − x? − t
θGt(y)‖2
]
v − t
θGt(y) = v +
1
θ[y −Gt(y)− y] = v+
Therefore
f(x+)− f? ≤ (1− θ)[f(x)− f?] + θ2
2t
[‖v − x?‖2 − ‖v+ − x?‖2
]Conclusion
1
θ2i−1[f(xi)− f?] +
1
2t‖vi − x?‖2 ≤
1− θi−1θ2i−1
[f(xi−1)− f?] +1
2t‖vi−1 − x?‖2
![Page 10: Math 301: Advanced topics in convex optimizationcandes/teaching/math...Accelerated rst-order methods Choose x0 and set y0 = x0.Repeat for k= 1;2;::: ˆ xk= proxt kh(yk 1 tkrg(yk 1))](https://reader036.vdocument.in/reader036/viewer/2022071001/5fbd6f5634289f31065c6ec0/html5/thumbnails/10.jpg)
We have 1−θi−1
θ2i−1= 1
θ2i−1and
1
θ2k−1[f(xk)− f?] +
1
2t‖vk − x?‖2 ≤
1− θ0θ20
[f(x0)− f?] +1
2t‖v0 − x?‖2
Since θ0 = 1 and v0 = x0
1
θ2k−1(f(xk)− f?) ≤
1
2t‖x0 − x?‖2
Since 1θ2k−1
≥ 14 (k + 1)2,
f(xk)− f? ≤2
(k + 1)2t‖x0 − x?‖2
Similar with BLS, see Beck and Teboulle (2009)
![Page 11: Math 301: Advanced topics in convex optimizationcandes/teaching/math...Accelerated rst-order methods Choose x0 and set y0 = x0.Repeat for k= 1;2;::: ˆ xk= proxt kh(yk 1 tkrg(yk 1))](https://reader036.vdocument.in/reader036/viewer/2022071001/5fbd6f5634289f31065c6ec0/html5/thumbnails/11.jpg)
Case study: LASSO
min f(x) =1
2‖Ax− b‖22 + λ‖x‖1
Chose x0, set y0 = x0 and θ0 = 1 and repeat
xk = Stkλ(yk−1 − tkA∗(Ayk−1 − b))
θk = 2[1 +
√1 + 4/θ2k−1
]−1yk = xk + θk(θ
−1k−1 − 1)(xk − xk−1)
until convergence (St is soft-thresholding at level t)
Dominant computational cost per iteration
one application of A
one application of A∗
![Page 12: Math 301: Advanced topics in convex optimizationcandes/teaching/math...Accelerated rst-order methods Choose x0 and set y0 = x0.Repeat for k= 1;2;::: ˆ xk= proxt kh(yk 1 tkrg(yk 1))](https://reader036.vdocument.in/reader036/viewer/2022071001/5fbd6f5634289f31065c6ec0/html5/thumbnails/12.jpg)
Example from Beck and Teboulle (FISTA)
A FAST ITERATIVE SHRINKAGE-THRESHOLDING ALGORITHM 201
0 2000 4000 6000 8000 1000010
!8
10!6
10!4
10!2
100
102
ISTAMTWISTFISTA
Figure 5. Comparison of function value errors F (xk) ! F (x!) of ISTA, MTWIST, and FISTA.
REFERENCES
[1] A. Ben-Tal and A. Nemirovski, Lectures on Modern Convex Optimization: Analysis, Algorithms, andEngineering Applications, MPS/SIAM Ser. Optim., SIAM, Philadelphia, 2001.
[2] D. P. Bertsekas, Nonlinear Programming, 2nd ed., Athena Scientific, Belmont, MA, 1999.[3] J. Bioucas-Dias and M. Figueiredo, A new TwIST: Two-step iterative shrinkage/thresholding algo-
rithms for image restoration, IEEE Trans. Image Process., 16 (2007), pp. 2992–3004.[4] A. Bjorck, Numerical Methods for Least Squares Problems, SIAM, Philadelphia, 1996.[5] K. Bredies and D. Lorenz, Iterative Soft-Thresholding Converges Linearly, Technical report, 2008;
available online at http://arxiv.org/abs/0709.1598v3.[6] R. J. Bruck, On the weak convergence of an ergodic iteration for the solution of variational inequalities
for monotone operators in Hilbert space, J. Math. Anal. Appl., 61 (1977), pp. 159–164.[7] A. Chambolle, R. A. DeVore, N. Y. Lee, and B. J. Lucier, Nonlinear wavelet image processing:
Variational problems, compression, and noise removal through wavelet shrinkage, IEEE Trans. ImageProcess., 7 (1998), pp. 319–335.
[8] S. S. Chen, D. L. Donoho, and M. A. Saunders, Atomic decomposition by basis pursuit, SIAM J.Sci. Comput., 20 (1998), pp. 33–61.
[9] P. L. Combettes and V. R. Wajs, Signal recovery by proximal forward-backward splitting, MultiscaleModel. Simul., 4 (2005), pp. 1168–1200.
[10] I. Daubechies, M. Defrise, and C. D. Mol, An iterative thresholding algorithm for linear inverseproblems with a sparsity constraint, Comm. Pure Appl. Math., 57 (2004), pp. 1413–1457.
[11] D. L. Donoho and I. M. Johnstone, Adapting to unknown smoothness via wavelet shrinkage, J. Amer.Statist. Assoc., 90 (1995), pp. 1200–1224.
[12] M. Elad, B. Matalon, and M. Zibulevsky, Subspace optimization methods for linear least squareswith non-quadratic regularization, Appl. Comput. Harmon. Anal., 23 (2007), pp. 346–367.
[13] H. W. Engl, M. Hanke, and A. Neubauer, Regularization of Inverse Problems, Math. Appl. 375,Kluwer Academic Publishers Group, Dordrecht, The Netherlands, 1996.
![Page 13: Math 301: Advanced topics in convex optimizationcandes/teaching/math...Accelerated rst-order methods Choose x0 and set y0 = x0.Repeat for k= 1;2;::: ˆ xk= proxt kh(yk 1 tkrg(yk 1))](https://reader036.vdocument.in/reader036/viewer/2022071001/5fbd6f5634289f31065c6ec0/html5/thumbnails/13.jpg)
Example from Vandenberghe (EE 236C, UCLA)
Convergence analysis
almost identical to the proof in lecture 1
• on page 1-15 we replace !f(y) with Gt(y)
• the 2nd inequality on p. 1-15 follows from p. 4–13 with 0 < t " 1/L:
f(x+) " f(x) + Gt(y)T (y # x) # t
2$Gt(y)$2
2
and
f(x+) " f! + Gt(y)T (y # x!) # t
2$Gt(y)$2
2
making a convex combination gives
f(x+) " (1 # !)f(x) + !f! + Gt(y)T (y # (1 # !)x # !x!) # t
2$Gt(y)$2
2
conclusion: #iterations to reach f(x(k)) # f! " " is O(1/%
")
Gradient methods for nonsmooth problems 4–17
1-norm regularized least-squares
minimize1
2$Ax # b$2
2 + $x$1
k
(f(x
(k) )
!f
!)/
f!
randomly generated A & R2000"1000; step tk = 1/L with L = #max(ATA)
Gradient methods for nonsmooth problems 4–18
![Page 14: Math 301: Advanced topics in convex optimizationcandes/teaching/math...Accelerated rst-order methods Choose x0 and set y0 = x0.Repeat for k= 1;2;::: ˆ xk= proxt kh(yk 1 tkrg(yk 1))](https://reader036.vdocument.in/reader036/viewer/2022071001/5fbd6f5634289f31065c6ec0/html5/thumbnails/14.jpg)
Nuclear norm regularization
min g(X) + λ‖X‖∗General gradient update
X = arg min{ 1
2t‖X − (X0 − t∇g(X0)‖2F + λ‖X‖∗
}= Stλ(X0 − t∇g(X0))
Sλ is the singular value soft-thresholding operator
X =
r∑j=1
σj ujv∗j ⇒ St(X) :=
r∑j=1
max(σj − t, 0)ujv∗j
![Page 15: Math 301: Advanced topics in convex optimizationcandes/teaching/math...Accelerated rst-order methods Choose x0 and set y0 = x0.Repeat for k= 1;2;::: ˆ xk= proxt kh(yk 1 tkrg(yk 1))](https://reader036.vdocument.in/reader036/viewer/2022071001/5fbd6f5634289f31065c6ec0/html5/thumbnails/15.jpg)
Example
min1
2‖A(X)− b‖2 + λ‖X‖∗
Choose X0, set Y0 = X0, θ0 = 1 and repeat
Xk = Stkλ[Yk−1 − tkA∗(A(Yk−1)− b)]θk = . . .
Yk = . . .
Important remark: only need to compute the (top) part of the SVD of
Yk−1 − tkA∗(A(Yk−1)− b)
with singular values exceeding tkλ
![Page 16: Math 301: Advanced topics in convex optimizationcandes/teaching/math...Accelerated rst-order methods Choose x0 and set y0 = x0.Repeat for k= 1;2;::: ˆ xk= proxt kh(yk 1 tkrg(yk 1))](https://reader036.vdocument.in/reader036/viewer/2022071001/5fbd6f5634289f31065c6ec0/html5/thumbnails/16.jpg)
Example from Vandenberghe (EE 236C, UCLA)
minimize∑
(i,j) obs.
(Xij −Mij)2 + λ‖X‖∗
X is 500× 5005,000 observed entriesFix step size t = 1/L
convergence (fixed step size t = 1/L)
k
(f(x
(k) )
!f
!)/
f!
Gradient methods for nonsmooth problems 4–23
result
index
nor
mal
ized
singu
lar
valu
e
optimal X has rank 38; relative error in specified entries is 9%
Gradient methods for nonsmooth problems 4–24
![Page 17: Math 301: Advanced topics in convex optimizationcandes/teaching/math...Accelerated rst-order methods Choose x0 and set y0 = x0.Repeat for k= 1;2;::: ˆ xk= proxt kh(yk 1 tkrg(yk 1))](https://reader036.vdocument.in/reader036/viewer/2022071001/5fbd6f5634289f31065c6ec0/html5/thumbnails/17.jpg)
Optimality of Nesterov’s Method
min f(x)
f convex
∇f Lipschitz
No method which updates xk in span {x0,∇f(x0), . . . ,∇f(xk−1)} canconverge faster than 1/k2
1/k2 is the optimal rate of first-order method
![Page 18: Math 301: Advanced topics in convex optimizationcandes/teaching/math...Accelerated rst-order methods Choose x0 and set y0 = x0.Repeat for k= 1;2;::: ˆ xk= proxt kh(yk 1 tkrg(yk 1))](https://reader036.vdocument.in/reader036/viewer/2022071001/5fbd6f5634289f31065c6ec0/html5/thumbnails/18.jpg)
Why?
f(x) =1
2x∗Ax− e∗1x, A =
2 −1 0 . . . . . . 0−1 2 −1 0 . . . 0. . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . .0 . . . 0 −1 2 −10 . . . 0 0 −1 −2
, e1 =
10. . .. . .. . .0
A � 0, ‖A‖ ≤ 4 and solution obeys Ax = e1
x∗ = 1− i
n+ 1f∗ = − n
2(n+ 1)
Note that
‖x∗‖2 =1
(n+ 1)2(n2 + . . .+ 1) ≤ n+ 1
3
since ∑0≤k≤n
(k + 1)3 − k3 ≥∑
0≤k≤n3k2 ⇐⇒
n∑k=1
k2 ≤ n+ 1
3
![Page 19: Math 301: Advanced topics in convex optimizationcandes/teaching/math...Accelerated rst-order methods Choose x0 and set y0 = x0.Repeat for k= 1;2;::: ˆ xk= proxt kh(yk 1 tkrg(yk 1))](https://reader036.vdocument.in/reader036/viewer/2022071001/5fbd6f5634289f31065c6ec0/html5/thumbnails/19.jpg)
Start first-order algorithm at x0 = 0
span(∇f(x0)) = e1 =⇒ x1 ∈ span(e1)
span (∇f(x1)) ⊂ span(e1, e2)span (∇f(x2)) ⊂ span(e1, e2, e3). . .
f(xk)− f? ≥ infxk+1=...=xn=0
f(x) = − k
2(k + 1)
For k ≈ n/2 or n = 2k + 1
f(xk)− f? ≥n
2(n+ 1)− k
2(k + 1)=
1
4(k + 1)
So
f(xk)− f? ≥1
4(k + 1)
‖x?‖2‖x?‖2 ≥
3‖x?‖24(k + 1)(n+ 1)
≥ 3‖x?‖28(k + 1)2
![Page 20: Math 301: Advanced topics in convex optimizationcandes/teaching/math...Accelerated rst-order methods Choose x0 and set y0 = x0.Repeat for k= 1;2;::: ˆ xk= proxt kh(yk 1 tkrg(yk 1))](https://reader036.vdocument.in/reader036/viewer/2022071001/5fbd6f5634289f31065c6ec0/html5/thumbnails/20.jpg)
References
1 Y. Nesterov. Gradient methods for minimizing composite objective functionTechnical Report – CORE – Universite Catholique de Louvain, (2007)
2 A. Beck and M. Teboulle. Fast iterative shrinkage-thresholding algorithm forlinear inverse problems, SIAM J. Imaging Sciences, (2008)
3 M. Teboulle, First Order Algorithms for Convex Minimization, OptimizationTutorials (2010), IPAM, UCLA
4 L. Vandenberghe, EE236C (Spring 2011), UCLA