math 301: advanced topics in convex optimizationcandes/teaching/math...accelerated rst-order methods...

Agenda

Fast proximal gradient methods

1 Accelerated first-order methods

2 Auxiliary sequences

3 Convergence analysis

4 Numerical examples

5 Optimality of Nesterov’s scheme

Last time

Proximal gradient method → convergence rate 1k

Subgradient methods → convergence rate 1√k

Can we do better for non-smooth problems

min f(x) = g(x) + h(x)

with the same computational effort as proximal gradient method but with fasterconvergence?

Answer: Yes we can - with equally simple scheme

xk+1 = arg min Q1/t(x, yk)

Note that we use yk instead of xk where new point is cleverly chosen

Original idea: Nesterov 1983 for minimization of smooth objective

Here: nonsmooth problem

Accelerated first-order methods

Choose x0 and set y0 = x0. Repeat for k = 1, 2, . . .{xk = proxtkh(yk−1 − tk∇g(yk−1))yk = xk +

k−1k+2 (xk − xk−1)

same computational complexity as proximal gradient

with h = 0, this is the accelerated gradient descent of Nesterov (’83)

can be used with various stepsize rules

fixedBLS. . .

interpretation

xk + k−1k+2 (xk − xk−1)↓

momentum term/prevents zigzagging

Other formulations: Beck and Teboulle 2009

Fix step size t = 1L(g)

Choose x0 , set y0 = x0, θ0 = 1

Loop: for k = 1, 2, . . .

(a) xk = proxtkh(yk−1 − tk∇g(yk−1))

(b) 1θk

=1+√

1+4/θ2k−1

2

(c) yk = xk + θk[1

θk−1− 1](xk − xk−1)

With BLS (knowledge of Lipschitz constant not necessary)

Choose x0 , set y0 = x0, θ0 = 1

Loop: for k = 1, 2, . . ., backtrack until (this gives tk)

f(yk−1 − tkGtk(yk−1)) ≤ Q1/tk((yk−1 − tkGtk(yk−1), yk−1)

proxtkh(yk−1 − tk∇g(yk−1)) , yk−1 − tkGtk(yk−1)Then

(a) —–(b) —–(c) —–

Convergence analysis

Theorem

f(xk)− f? ≤2‖x0 − x?‖2(k + 1)2t

t = 1/L for fixed step size

t = β/L for BLS

Other 1/k2 first-order methods

Nesterov 2007

Two auxiliary sequences {yk}, {zk}Two prox operations at each iteration6= convergence analysis

Lu, Lan and Monteiro

Tseng

Auslander and Teboulle

Unified analysis framework: Tseng (2008)

Proof (Beck and Teboulle’s version)

{vk , 1

θk−1xk − [ 1

θk−1− 1]xk−1

yk = θkvk + (1− θk)xkand

(i) vk+1 = vk +1θk[xk+1 − yk]

(ii) 1−θkθ2k

= 1θ2k−1

Proof of (ii) (u = 4/θ2k−1 + 1)

1− θkθ2k

=[1 +

√u]2

4

√u− 1√u+ 1

=u− 1

4=

4/θk−12 + 1− 1

4=

1

θ2k−1

Increment in one iteration: Beck and Teboulle, Vandenberghe

x = xi−1 x+ = xi y = yi−1 v = vi−1 v+ = vi θ = θi−1

Pillars of analysis:

(1) f(x+) ≤ f(x) +Gt(y)T (y − x)− t

2‖Gt(y)‖2(2) f(x+) ≤ f? +Gt(y)

T (y − x?)− t2‖Gt(y)‖2

Take cvx combination

f(x+) ≤ (1− θ)f(x) + θf? + 〈Gt(y), y − (1− θ)x− θx?〉 − t

2‖Gt(y)‖2

= (1− θ)f(x) + θf? + θ〈Gt(y), v − x?〉 −t

2‖Gt(y)‖2

Because y = θv + (1− θ)x

f(x+)− f? ≤ (1− θ) [f(x)− f?] + θ2

2t

[‖v − x?‖2 − ‖v − x? − t

θGt(y)‖2

]

v − t

θGt(y) = v +

1

θ[y −Gt(y)− y] = v+

Therefore

f(x+)− f? ≤ (1− θ)[f(x)− f?] + θ2

2t

[‖v − x?‖2 − ‖v+ − x?‖2

]Conclusion

1

θ2i−1[f(xi)− f?] +

1

2t‖vi − x?‖2 ≤

1− θi−1θ2i−1

[f(xi−1)− f?] +1

2t‖vi−1 − x?‖2

We have 1−θi−1

θ2i−1= 1

θ2i−1and

1

θ2k−1[f(xk)− f?] +

1

2t‖vk − x?‖2 ≤

1− θ0θ20

[f(x0)− f?] +1

2t‖v0 − x?‖2

Since θ0 = 1 and v0 = x0

1

θ2k−1(f(xk)− f?) ≤

1

2t‖x0 − x?‖2

Since 1θ2k−1

≥ 14 (k + 1)2,

f(xk)− f? ≤2

(k + 1)2t‖x0 − x?‖2

Similar with BLS, see Beck and Teboulle (2009)

Case study: LASSO

min f(x) =1

2‖Ax− b‖22 + λ‖x‖1

Chose x0, set y0 = x0 and θ0 = 1 and repeat

xk = Stkλ(yk−1 − tkA∗(Ayk−1 − b))

θk = 2[1 +

√1 + 4/θ2k−1

]−1yk = xk + θk(θ

−1k−1 − 1)(xk − xk−1)

until convergence (St is soft-thresholding at level t)

Dominant computational cost per iteration

one application of A

one application of A∗

Example from Beck and Teboulle (FISTA)

A FAST ITERATIVE SHRINKAGE-THRESHOLDING ALGORITHM 201

0 2000 4000 6000 8000 1000010

!8

10!6

10!4

10!2

100

102

ISTAMTWISTFISTA

Figure 5. Comparison of function value errors F (xk) ! F (x!) of ISTA, MTWIST, and FISTA.

REFERENCES

[1] A. Ben-Tal and A. Nemirovski, Lectures on Modern Convex Optimization: Analysis, Algorithms, andEngineering Applications, MPS/SIAM Ser. Optim., SIAM, Philadelphia, 2001.

[2] D. P. Bertsekas, Nonlinear Programming, 2nd ed., Athena Scientific, Belmont, MA, 1999.[3] J. Bioucas-Dias and M. Figueiredo, A new TwIST: Two-step iterative shrinkage/thresholding algo-

rithms for image restoration, IEEE Trans. Image Process., 16 (2007), pp. 2992–3004.[4] A. Bjorck, Numerical Methods for Least Squares Problems, SIAM, Philadelphia, 1996.[5] K. Bredies and D. Lorenz, Iterative Soft-Thresholding Converges Linearly, Technical report, 2008;

available online at http://arxiv.org/abs/0709.1598v3.[6] R. J. Bruck, On the weak convergence of an ergodic iteration for the solution of variational inequalities

for monotone operators in Hilbert space, J. Math. Anal. Appl., 61 (1977), pp. 159–164.[7] A. Chambolle, R. A. DeVore, N. Y. Lee, and B. J. Lucier, Nonlinear wavelet image processing:

Variational problems, compression, and noise removal through wavelet shrinkage, IEEE Trans. ImageProcess., 7 (1998), pp. 319–335.

[8] S. S. Chen, D. L. Donoho, and M. A. Saunders, Atomic decomposition by basis pursuit, SIAM J.Sci. Comput., 20 (1998), pp. 33–61.

[9] P. L. Combettes and V. R. Wajs, Signal recovery by proximal forward-backward splitting, MultiscaleModel. Simul., 4 (2005), pp. 1168–1200.

[10] I. Daubechies, M. Defrise, and C. D. Mol, An iterative thresholding algorithm for linear inverseproblems with a sparsity constraint, Comm. Pure Appl. Math., 57 (2004), pp. 1413–1457.

[11] D. L. Donoho and I. M. Johnstone, Adapting to unknown smoothness via wavelet shrinkage, J. Amer.Statist. Assoc., 90 (1995), pp. 1200–1224.

[12] M. Elad, B. Matalon, and M. Zibulevsky, Subspace optimization methods for linear least squareswith non-quadratic regularization, Appl. Comput. Harmon. Anal., 23 (2007), pp. 346–367.

[13] H. W. Engl, M. Hanke, and A. Neubauer, Regularization of Inverse Problems, Math. Appl. 375,Kluwer Academic Publishers Group, Dordrecht, The Netherlands, 1996.

Example from Vandenberghe (EE 236C, UCLA)

Convergence analysis

almost identical to the proof in lecture 1

• on page 1-15 we replace !f(y) with Gt(y)

• the 2nd inequality on p. 1-15 follows from p. 4–13 with 0 < t " 1/L:

f(x+) " f(x) + Gt(y)T (y # x) # t

2$Gt(y)$2

2

and

f(x+) " f! + Gt(y)T (y # x!) # t

2$Gt(y)$2

2

making a convex combination gives

f(x+) " (1 # !)f(x) + !f! + Gt(y)T (y # (1 # !)x # !x!) # t

2$Gt(y)$2

2

conclusion: #iterations to reach f(x(k)) # f! " " is O(1/%

")

Gradient methods for nonsmooth problems 4–17

1-norm regularized least-squares

minimize1

2$Ax # b$2

2 + $x$1

k

(f(x

(k) )

!f

!)/

f!

randomly generated A & R2000"1000; step tk = 1/L with L = #max(ATA)


Nuclear norm regularization

min g(X) + λ‖X‖∗General gradient update

X = arg min{ 1

2t‖X − (X0 − t∇g(X0)‖2F + λ‖X‖∗

}= Stλ(X0 − t∇g(X0))

Sλ is the singular value soft-thresholding operator

X =

r∑j=1

σj ujv∗j ⇒ St(X) :=

r∑j=1

max(σj − t, 0)ujv∗j

Example

min1

2‖A(X)− b‖2 + λ‖X‖∗

Choose X0, set Y0 = X0, θ0 = 1 and repeat

Xk = Stkλ[Yk−1 − tkA∗(A(Yk−1)− b)]θk = . . .

Yk = . . .

Important remark: only need to compute the (top) part of the SVD of

Yk−1 − tkA∗(A(Yk−1)− b)

with singular values exceeding tkλ

Example from Vandenberghe (EE 236C, UCLA)

minimize∑

(i,j) obs.

(Xij −Mij)2 + λ‖X‖∗

X is 500× 5005,000 observed entriesFix step size t = 1/L

convergence (fixed step size t = 1/L)

k

(f(x

(k) )

!f

!)/

f!


result

index

nor

mal

ized

singu

lar

valu

e

optimal X has rank 38; relative error in specified entries is 9%


Optimality of Nesterov’s Method

min f(x)

f convex

∇f Lipschitz

No method which updates xk in span {x0,∇f(x0), . . . ,∇f(xk−1)} canconverge faster than 1/k2

1/k2 is the optimal rate of first-order method

Why?

f(x) =1

2x∗Ax− e∗1x, A =

2 −1 0 . . . . . . 0−1 2 −1 0 . . . 0. . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . .0 . . . 0 −1 2 −10 . . . 0 0 −1 −2

, e1 =

10. . .. . .. . .0

A � 0, ‖A‖ ≤ 4 and solution obeys Ax = e1

x∗ = 1− i

n+ 1f∗ = − n

2(n+ 1)

Note that

‖x∗‖2 =1

(n+ 1)2(n2 + . . .+ 1) ≤ n+ 1

3

since ∑0≤k≤n

(k + 1)3 − k3 ≥∑

0≤k≤n3k2 ⇐⇒

n∑k=1

k2 ≤ n+ 1

3

Start first-order algorithm at x0 = 0

span(∇f(x0)) = e1 =⇒ x1 ∈ span(e1)

span (∇f(x1)) ⊂ span(e1, e2)span (∇f(x2)) ⊂ span(e1, e2, e3). . .

f(xk)− f? ≥ infxk+1=...=xn=0

f(x) = − k

2(k + 1)

For k ≈ n/2 or n = 2k + 1

f(xk)− f? ≥n

2(n+ 1)− k

2(k + 1)=

1

4(k + 1)

So

f(xk)− f? ≥1

4(k + 1)

‖x?‖2‖x?‖2 ≥

3‖x?‖24(k + 1)(n+ 1)

≥ 3‖x?‖28(k + 1)2

References

1 Y. Nesterov. Gradient methods for minimizing composite objective functionTechnical Report – CORE – Universite Catholique de Louvain, (2007)

2 A. Beck and M. Teboulle. Fast iterative shrinkage-thresholding algorithm forlinear inverse problems, SIAM J. Imaging Sciences, (2008)

3 M. Teboulle, First Order Algorithms for Convex Minimization, OptimizationTutorials (2010), IPAM, UCLA

4 L. Vandenberghe, EE236C (Spring 2011), UCLA

math 301: advanced topics in convex optimizationcandes/teaching/math...accelerated rst-order methods...

Documents