math 301: advanced topics in convex optimizationcandes/teaching/math...accelerated rst-order methods...

20
Agenda Fast proximal gradient methods 1 Accelerated first-order methods 2 Auxiliary sequences 3 Convergence analysis 4 Numerical examples 5 Optimality of Nesterov’s scheme

Upload: others

Post on 17-Aug-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Math 301: Advanced topics in convex optimizationcandes/teaching/math...Accelerated rst-order methods Choose x0 and set y0 = x0.Repeat for k= 1;2;::: ˆ xk= proxt kh(yk 1 tkrg(yk 1))

Agenda

Fast proximal gradient methods

1 Accelerated first-order methods

2 Auxiliary sequences

3 Convergence analysis

4 Numerical examples

5 Optimality of Nesterov’s scheme

Page 2: Math 301: Advanced topics in convex optimizationcandes/teaching/math...Accelerated rst-order methods Choose x0 and set y0 = x0.Repeat for k= 1;2;::: ˆ xk= proxt kh(yk 1 tkrg(yk 1))

Last time

Proximal gradient method → convergence rate 1k

Subgradient methods → convergence rate 1√k

Can we do better for non-smooth problems

min f(x) = g(x) + h(x)

with the same computational effort as proximal gradient method but with fasterconvergence?

Answer: Yes we can - with equally simple scheme

xk+1 = arg min Q1/t(x, yk)

Note that we use yk instead of xk where new point is cleverly chosen

Original idea: Nesterov 1983 for minimization of smooth objective

Here: nonsmooth problem

Page 3: Math 301: Advanced topics in convex optimizationcandes/teaching/math...Accelerated rst-order methods Choose x0 and set y0 = x0.Repeat for k= 1;2;::: ˆ xk= proxt kh(yk 1 tkrg(yk 1))

Accelerated first-order methods

Choose x0 and set y0 = x0. Repeat for k = 1, 2, . . .{xk = proxtkh(yk−1 − tk∇g(yk−1))yk = xk +

k−1k+2 (xk − xk−1)

same computational complexity as proximal gradient

with h = 0, this is the accelerated gradient descent of Nesterov (’83)

can be used with various stepsize rules

fixedBLS. . .

interpretation

xk + k−1k+2 (xk − xk−1)↓

momentum term/prevents zigzagging

Page 4: Math 301: Advanced topics in convex optimizationcandes/teaching/math...Accelerated rst-order methods Choose x0 and set y0 = x0.Repeat for k= 1;2;::: ˆ xk= proxt kh(yk 1 tkrg(yk 1))

Other formulations: Beck and Teboulle 2009

Fix step size t = 1L(g)

Choose x0 , set y0 = x0, θ0 = 1

Loop: for k = 1, 2, . . .

(a) xk = proxtkh(yk−1 − tk∇g(yk−1))

(b) 1θk

=1+√

1+4/θ2k−1

2

(c) yk = xk + θk[1

θk−1− 1](xk − xk−1)

Page 5: Math 301: Advanced topics in convex optimizationcandes/teaching/math...Accelerated rst-order methods Choose x0 and set y0 = x0.Repeat for k= 1;2;::: ˆ xk= proxt kh(yk 1 tkrg(yk 1))

With BLS (knowledge of Lipschitz constant not necessary)

Choose x0 , set y0 = x0, θ0 = 1

Loop: for k = 1, 2, . . ., backtrack until (this gives tk)

f(yk−1 − tkGtk(yk−1)) ≤ Q1/tk((yk−1 − tkGtk(yk−1), yk−1)

proxtkh(yk−1 − tk∇g(yk−1)) , yk−1 − tkGtk(yk−1)Then

(a) —–(b) —–(c) —–

Page 6: Math 301: Advanced topics in convex optimizationcandes/teaching/math...Accelerated rst-order methods Choose x0 and set y0 = x0.Repeat for k= 1;2;::: ˆ xk= proxt kh(yk 1 tkrg(yk 1))

Convergence analysis

Theorem

f(xk)− f? ≤2‖x0 − x?‖2(k + 1)2t

t = 1/L for fixed step size

t = β/L for BLS

Other 1/k2 first-order methods

Nesterov 2007

Two auxiliary sequences {yk}, {zk}Two prox operations at each iteration6= convergence analysis

Lu, Lan and Monteiro

Tseng

Auslander and Teboulle

Unified analysis framework: Tseng (2008)

Page 7: Math 301: Advanced topics in convex optimizationcandes/teaching/math...Accelerated rst-order methods Choose x0 and set y0 = x0.Repeat for k= 1;2;::: ˆ xk= proxt kh(yk 1 tkrg(yk 1))

Proof (Beck and Teboulle’s version)

{vk , 1

θk−1xk − [ 1

θk−1− 1]xk−1

yk = θkvk + (1− θk)xkand

(i) vk+1 = vk +1θk[xk+1 − yk]

(ii) 1−θkθ2k

= 1θ2k−1

Proof of (ii) (u = 4/θ2k−1 + 1)

1− θkθ2k

=[1 +

√u]2

4

√u− 1√u+ 1

=u− 1

4=

4/θk−12 + 1− 1

4=

1

θ2k−1

Page 8: Math 301: Advanced topics in convex optimizationcandes/teaching/math...Accelerated rst-order methods Choose x0 and set y0 = x0.Repeat for k= 1;2;::: ˆ xk= proxt kh(yk 1 tkrg(yk 1))

Increment in one iteration: Beck and Teboulle, Vandenberghe

x = xi−1 x+ = xi y = yi−1 v = vi−1 v+ = vi θ = θi−1

Pillars of analysis:

(1) f(x+) ≤ f(x) +Gt(y)T (y − x)− t

2‖Gt(y)‖2(2) f(x+) ≤ f? +Gt(y)

T (y − x?)− t2‖Gt(y)‖2

Take cvx combination

f(x+) ≤ (1− θ)f(x) + θf? + 〈Gt(y), y − (1− θ)x− θx?〉 − t

2‖Gt(y)‖2

= (1− θ)f(x) + θf? + θ〈Gt(y), v − x?〉 −t

2‖Gt(y)‖2

Page 9: Math 301: Advanced topics in convex optimizationcandes/teaching/math...Accelerated rst-order methods Choose x0 and set y0 = x0.Repeat for k= 1;2;::: ˆ xk= proxt kh(yk 1 tkrg(yk 1))

Because y = θv + (1− θ)x

f(x+)− f? ≤ (1− θ) [f(x)− f?] + θ2

2t

[‖v − x?‖2 − ‖v − x? − t

θGt(y)‖2

]

v − t

θGt(y) = v +

1

θ[y −Gt(y)− y] = v+

Therefore

f(x+)− f? ≤ (1− θ)[f(x)− f?] + θ2

2t

[‖v − x?‖2 − ‖v+ − x?‖2

]Conclusion

1

θ2i−1[f(xi)− f?] +

1

2t‖vi − x?‖2 ≤

1− θi−1θ2i−1

[f(xi−1)− f?] +1

2t‖vi−1 − x?‖2

Page 10: Math 301: Advanced topics in convex optimizationcandes/teaching/math...Accelerated rst-order methods Choose x0 and set y0 = x0.Repeat for k= 1;2;::: ˆ xk= proxt kh(yk 1 tkrg(yk 1))

We have 1−θi−1

θ2i−1= 1

θ2i−1and

1

θ2k−1[f(xk)− f?] +

1

2t‖vk − x?‖2 ≤

1− θ0θ20

[f(x0)− f?] +1

2t‖v0 − x?‖2

Since θ0 = 1 and v0 = x0

1

θ2k−1(f(xk)− f?) ≤

1

2t‖x0 − x?‖2

Since 1θ2k−1

≥ 14 (k + 1)2,

f(xk)− f? ≤2

(k + 1)2t‖x0 − x?‖2

Similar with BLS, see Beck and Teboulle (2009)

Page 11: Math 301: Advanced topics in convex optimizationcandes/teaching/math...Accelerated rst-order methods Choose x0 and set y0 = x0.Repeat for k= 1;2;::: ˆ xk= proxt kh(yk 1 tkrg(yk 1))

Case study: LASSO

min f(x) =1

2‖Ax− b‖22 + λ‖x‖1

Chose x0, set y0 = x0 and θ0 = 1 and repeat

xk = Stkλ(yk−1 − tkA∗(Ayk−1 − b))

θk = 2[1 +

√1 + 4/θ2k−1

]−1yk = xk + θk(θ

−1k−1 − 1)(xk − xk−1)

until convergence (St is soft-thresholding at level t)

Dominant computational cost per iteration

one application of A

one application of A∗

Page 12: Math 301: Advanced topics in convex optimizationcandes/teaching/math...Accelerated rst-order methods Choose x0 and set y0 = x0.Repeat for k= 1;2;::: ˆ xk= proxt kh(yk 1 tkrg(yk 1))

Example from Beck and Teboulle (FISTA)

A FAST ITERATIVE SHRINKAGE-THRESHOLDING ALGORITHM 201

0 2000 4000 6000 8000 1000010

!8

10!6

10!4

10!2

100

102

ISTAMTWISTFISTA

Figure 5. Comparison of function value errors F (xk) ! F (x!) of ISTA, MTWIST, and FISTA.

REFERENCES

[1] A. Ben-Tal and A. Nemirovski, Lectures on Modern Convex Optimization: Analysis, Algorithms, andEngineering Applications, MPS/SIAM Ser. Optim., SIAM, Philadelphia, 2001.

[2] D. P. Bertsekas, Nonlinear Programming, 2nd ed., Athena Scientific, Belmont, MA, 1999.[3] J. Bioucas-Dias and M. Figueiredo, A new TwIST: Two-step iterative shrinkage/thresholding algo-

rithms for image restoration, IEEE Trans. Image Process., 16 (2007), pp. 2992–3004.[4] A. Bjorck, Numerical Methods for Least Squares Problems, SIAM, Philadelphia, 1996.[5] K. Bredies and D. Lorenz, Iterative Soft-Thresholding Converges Linearly, Technical report, 2008;

available online at http://arxiv.org/abs/0709.1598v3.[6] R. J. Bruck, On the weak convergence of an ergodic iteration for the solution of variational inequalities

for monotone operators in Hilbert space, J. Math. Anal. Appl., 61 (1977), pp. 159–164.[7] A. Chambolle, R. A. DeVore, N. Y. Lee, and B. J. Lucier, Nonlinear wavelet image processing:

Variational problems, compression, and noise removal through wavelet shrinkage, IEEE Trans. ImageProcess., 7 (1998), pp. 319–335.

[8] S. S. Chen, D. L. Donoho, and M. A. Saunders, Atomic decomposition by basis pursuit, SIAM J.Sci. Comput., 20 (1998), pp. 33–61.

[9] P. L. Combettes and V. R. Wajs, Signal recovery by proximal forward-backward splitting, MultiscaleModel. Simul., 4 (2005), pp. 1168–1200.

[10] I. Daubechies, M. Defrise, and C. D. Mol, An iterative thresholding algorithm for linear inverseproblems with a sparsity constraint, Comm. Pure Appl. Math., 57 (2004), pp. 1413–1457.

[11] D. L. Donoho and I. M. Johnstone, Adapting to unknown smoothness via wavelet shrinkage, J. Amer.Statist. Assoc., 90 (1995), pp. 1200–1224.

[12] M. Elad, B. Matalon, and M. Zibulevsky, Subspace optimization methods for linear least squareswith non-quadratic regularization, Appl. Comput. Harmon. Anal., 23 (2007), pp. 346–367.

[13] H. W. Engl, M. Hanke, and A. Neubauer, Regularization of Inverse Problems, Math. Appl. 375,Kluwer Academic Publishers Group, Dordrecht, The Netherlands, 1996.

Page 13: Math 301: Advanced topics in convex optimizationcandes/teaching/math...Accelerated rst-order methods Choose x0 and set y0 = x0.Repeat for k= 1;2;::: ˆ xk= proxt kh(yk 1 tkrg(yk 1))

Example from Vandenberghe (EE 236C, UCLA)

Convergence analysis

almost identical to the proof in lecture 1

• on page 1-15 we replace !f(y) with Gt(y)

• the 2nd inequality on p. 1-15 follows from p. 4–13 with 0 < t " 1/L:

f(x+) " f(x) + Gt(y)T (y # x) # t

2$Gt(y)$2

2

and

f(x+) " f! + Gt(y)T (y # x!) # t

2$Gt(y)$2

2

making a convex combination gives

f(x+) " (1 # !)f(x) + !f! + Gt(y)T (y # (1 # !)x # !x!) # t

2$Gt(y)$2

2

conclusion: #iterations to reach f(x(k)) # f! " " is O(1/%

")

Gradient methods for nonsmooth problems 4–17

1-norm regularized least-squares

minimize1

2$Ax # b$2

2 + $x$1

k

(f(x

(k) )

!f

!)/

f!

randomly generated A & R2000"1000; step tk = 1/L with L = #max(ATA)

Gradient methods for nonsmooth problems 4–18

Page 14: Math 301: Advanced topics in convex optimizationcandes/teaching/math...Accelerated rst-order methods Choose x0 and set y0 = x0.Repeat for k= 1;2;::: ˆ xk= proxt kh(yk 1 tkrg(yk 1))

Nuclear norm regularization

min g(X) + λ‖X‖∗General gradient update

X = arg min{ 1

2t‖X − (X0 − t∇g(X0)‖2F + λ‖X‖∗

}= Stλ(X0 − t∇g(X0))

Sλ is the singular value soft-thresholding operator

X =

r∑j=1

σj ujv∗j ⇒ St(X) :=

r∑j=1

max(σj − t, 0)ujv∗j

Page 15: Math 301: Advanced topics in convex optimizationcandes/teaching/math...Accelerated rst-order methods Choose x0 and set y0 = x0.Repeat for k= 1;2;::: ˆ xk= proxt kh(yk 1 tkrg(yk 1))

Example

min1

2‖A(X)− b‖2 + λ‖X‖∗

Choose X0, set Y0 = X0, θ0 = 1 and repeat

Xk = Stkλ[Yk−1 − tkA∗(A(Yk−1)− b)]θk = . . .

Yk = . . .

Important remark: only need to compute the (top) part of the SVD of

Yk−1 − tkA∗(A(Yk−1)− b)

with singular values exceeding tkλ

Page 16: Math 301: Advanced topics in convex optimizationcandes/teaching/math...Accelerated rst-order methods Choose x0 and set y0 = x0.Repeat for k= 1;2;::: ˆ xk= proxt kh(yk 1 tkrg(yk 1))

Example from Vandenberghe (EE 236C, UCLA)

minimize∑

(i,j) obs.

(Xij −Mij)2 + λ‖X‖∗

X is 500× 5005,000 observed entriesFix step size t = 1/L

convergence (fixed step size t = 1/L)

k

(f(x

(k) )

!f

!)/

f!

Gradient methods for nonsmooth problems 4–23

result

index

nor

mal

ized

singu

lar

valu

e

optimal X has rank 38; relative error in specified entries is 9%

Gradient methods for nonsmooth problems 4–24

Page 17: Math 301: Advanced topics in convex optimizationcandes/teaching/math...Accelerated rst-order methods Choose x0 and set y0 = x0.Repeat for k= 1;2;::: ˆ xk= proxt kh(yk 1 tkrg(yk 1))

Optimality of Nesterov’s Method

min f(x)

f convex

∇f Lipschitz

No method which updates xk in span {x0,∇f(x0), . . . ,∇f(xk−1)} canconverge faster than 1/k2

1/k2 is the optimal rate of first-order method

Page 18: Math 301: Advanced topics in convex optimizationcandes/teaching/math...Accelerated rst-order methods Choose x0 and set y0 = x0.Repeat for k= 1;2;::: ˆ xk= proxt kh(yk 1 tkrg(yk 1))

Why?

f(x) =1

2x∗Ax− e∗1x, A =

2 −1 0 . . . . . . 0−1 2 −1 0 . . . 0. . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . .0 . . . 0 −1 2 −10 . . . 0 0 −1 −2

, e1 =

10. . .. . .. . .0

A � 0, ‖A‖ ≤ 4 and solution obeys Ax = e1

x∗ = 1− i

n+ 1f∗ = − n

2(n+ 1)

Note that

‖x∗‖2 =1

(n+ 1)2(n2 + . . .+ 1) ≤ n+ 1

3

since ∑0≤k≤n

(k + 1)3 − k3 ≥∑

0≤k≤n3k2 ⇐⇒

n∑k=1

k2 ≤ n+ 1

3

Page 19: Math 301: Advanced topics in convex optimizationcandes/teaching/math...Accelerated rst-order methods Choose x0 and set y0 = x0.Repeat for k= 1;2;::: ˆ xk= proxt kh(yk 1 tkrg(yk 1))

Start first-order algorithm at x0 = 0

span(∇f(x0)) = e1 =⇒ x1 ∈ span(e1)

span (∇f(x1)) ⊂ span(e1, e2)span (∇f(x2)) ⊂ span(e1, e2, e3). . .

f(xk)− f? ≥ infxk+1=...=xn=0

f(x) = − k

2(k + 1)

For k ≈ n/2 or n = 2k + 1

f(xk)− f? ≥n

2(n+ 1)− k

2(k + 1)=

1

4(k + 1)

So

f(xk)− f? ≥1

4(k + 1)

‖x?‖2‖x?‖2 ≥

3‖x?‖24(k + 1)(n+ 1)

≥ 3‖x?‖28(k + 1)2

Page 20: Math 301: Advanced topics in convex optimizationcandes/teaching/math...Accelerated rst-order methods Choose x0 and set y0 = x0.Repeat for k= 1;2;::: ˆ xk= proxt kh(yk 1 tkrg(yk 1))

References

1 Y. Nesterov. Gradient methods for minimizing composite objective functionTechnical Report – CORE – Universite Catholique de Louvain, (2007)

2 A. Beck and M. Teboulle. Fast iterative shrinkage-thresholding algorithm forlinear inverse problems, SIAM J. Imaging Sciences, (2008)

3 M. Teboulle, First Order Algorithms for Convex Minimization, OptimizationTutorials (2010), IPAM, UCLA

4 L. Vandenberghe, EE236C (Spring 2011), UCLA