accelerating nesterov's method for strongly convex functions · nesterov’s method (h = 1 l):...

The GapQuadratic Functions

General Strongly Convex Functions

Accelerating Nesterov’s Methodfor Strongly Convex Functions

Hao Chen Xiangrui Meng

MATH301, 2011

Hao Chen, Xiangrui Meng Accelerating Nesterov’s Method for Strongly Convex Functions



Outline

1 The Gap

2 Quadratic Functions

3 General Strongly Convex Functions




Our talk begins with a tiny gap

For any x0 ∈ R∞ and any constant µ > 0, L > µ there exists afunction f ∈ S∞,1µ,L such that for any first-order method, we have

f (xk)− f ∗ ≥ µ

2

(√κ− 1√κ+ 1

)2k

‖x0 − x∗‖2, κ =L

µ.

Nesterov’s method generates a sequence {xk}∞k=0 such that

f (xk)− f ∗ ≤ L

(√κ− 1√κ

)k

‖x0 − x∗‖2, κ =L

µ.




At a closer look, the gap is not tiny

Assume that κ is large. Given a small tolerance ε > 0, to makef (xk)− f < ε, the “ideal” first-order method needs

K ∗ =log ε− log µ

2

2 log√κ−1√κ+1

≈ log1

ε·√κ

4

number of iterations. Nesterov’s method needs

K =log ε− log L

log√κ−1√κ

≈ log1

ε·√κ

number of iterations, which is 4 times as large as the ideal number.




Can we reduce the gap?

Can we reduce the gap

for quadratic functions?

minimize f (x) =1

2xTAx − bT x , µIn � A � LIn.

In this case, we do have an “ideal” method, the conjugategradient method, having the optimal convergence rate.

for general strongly convex functions?

minimize f (x), f (x) ∈ Sµ,L.




Outline

1 The Gap






Nesterov’s constant step scheme, III

0. Choose y0 = x0 ∈ Rn.

1. k-th iteration (k ≥ 0).

xk+1 = yk − hf ′(yk),

yk+1 = xk+1 + β(xk+1 − xk),

where h = 1L and β = 1−

√µh

1+√µh

.

Q: Is Nesterov’s choice of h and β optimal?




On quadratic functions

When minimizing a quadratic function

f (x) =1

2xTAx − bT x ,

Nesterov’s updates become

0. Choose y0 = x0 = 0.


xk+1 = yk − h(Ayk − b),

yk+1 = xk+1 + β(xk+1 − xk).




Eigendecomposition

Let A = VΛV T be A’s eigendecomposition. Define xk = V T xk ,yk = V T yk for all k , and b = V Tb. Then Nesterov’s updates canbe written as

0. Choose y0 = x0 = 0.


xk+1 = yk − h(Λyk − b),

yk+1 = xk+1 + β(xk+1 − xk).

Λ is diagonal, hence the updates are actually element-wise:

xk+1,i = yk,i − h(λi yk,i − bi ), i = 1, . . . , n,

yk+1,i = xk+1,i − β(λi yk,i − bi ), i = 1, . . . , n.




Recurrence relation

We can eliminate the sequence {yk} from the update scheme.

xk+1,i = yk,i − h(λi yk,i − bi )

= (xk,i + β(xk,i − xk−1,i )− h(λi (xk,i + β(xk,i − xk−1,i ))− bi )

= (1 + β)(1− λih)xk,i − β(1− λih)xk−1,i + hbi .

Let ek = V T (xk − x∗) = V T (xk − VΛ−1V Tb) = xk − Λ−1b for allk . We have the following recurrence relation on the error:

ek+1,i = (1 + β)(1− λih)ek,i − β(1− λih)ek−1,i .




Characteristic equation

The characteristic equation for the recurrence relation is given by

ξ2i = (1 + β)(1− λih)ξi − β(1− λih).

Denote the two roots by ξi ,1 and ξi ,2, and assume they are distinctfor simplicity. The general solution is given by

ek,i = Ci ,1ξki ,1 + Ci ,2ξ

ki ,2.

Let Ci = |Ci ,1|+ |Ci ,2| and θi = max{|ξi ,1|, |ξi ,2|}. We have

|ek,i | ≤ Ciθki .

Hence,

‖xk − x∗‖2 = ‖xk − x∗‖2 =∑i

|ek,i |2 ≤∑i

C 2i θ

2ki ≤ Cθ2k ,

where C =∑

i C2i and θ = maxi θi .




Finding the optimal convergence rate

Our problem becomes

minimize θ

subject to θ ≥ |ξ1(λ)|, |ξ2(λ)|, ∀λ ∈ [µ, L],

where ξ1(λ) and ξ2(λ) are the roots of

ξ2 = (1 + β)(1− λh)ξ − β(1− λh),

where h, β and θ are variables.




Special cases

If β = 0, we are doing gradient descent. The optimal rate isgiven by θ∗ = L−µ

L+µ , attained at h∗ = 2L+µ .

If h = 1L , the optimal rate is given by

θ∗ = 1−√µh = 1−

√µL , attained at β∗ = 1−

√µh

1+√µh

=√L−√µ√L+√µ

,

which confirms Nesterov’s choice.

Q: Why do we choose h = 1L? It guarantees the most decrease in

function value of a function with Lipschitz constant L.




The optimal convergence rate

By considering all the combinations of h and β, we reach thefollowing optimal solution:

h∗ =4

3L + µ

(the harmonic mean of

1

Land

2

L + µ

)β∗ =

1−√µh∗

1 +√µh∗

,

θ∗ = 1−√µh∗ = 1− 2√

3κ+ 1.




Comparing the convergence rates

Nesterov’s method (h = 1L ):

‖xk − x∗‖ ≤ C ·(

1− 1√κ

)k

‖x0 − x∗‖.

Note that this is better than the convergence rate we have ongeneral strongly convex functions.

Nesterov’s method (h = 43L+µ ):

‖xk − x∗‖ ≤ C ·(

1− 2√3κ+ 1

)k

‖x0 − x∗‖.

Conjugate gradient:

‖xk − x∗‖A ≤ 2 ·(

1− 2√κ+ 1

)k

‖x0 − x∗‖A.




What’s happening on the eigenspace

Figure: Error along eigendirections (|ek,i |)




The model problem

minimize f (x) =1

2xTAx − bT x ,

where

A =

2 −1

−1. . .

. . .. . .

. . . −1−1 2

+ δIn ∈ Rn×n, b = randn(n, 1) ∈ Rn.

We chose n = 106 and δ = 0.05.




Figure: ‖xk − x∗‖Hao Chen, Xiangrui Meng Accelerating Nesterov’s Method for Strongly Convex Functions



Figure: f (xk)− f ∗




Outline

1 The Gap






Back to Nesterov’s proof

A pair of sequence {φk(x)} and {λk}, λk ≥ 0 is called anestimate sequence of function f (x) if λk → 0 and for anyx ∈ Rn and all k ≥ 0 we have

φk(x) ≤ (1− λk)f (x) + λkφ0(x).

If for a sequence {xk} we have

f (xk) ≤ φ∗k ≡ minx∈Rn

φk(x)

then f (xk)− f ∗ ≤ λk [φ0(x∗)− f ∗]→ 0




A useful estimate sequence provided by Nesterov

λk+1 = (1− αk)λk

φk+1(x) = (1− αk)φk(x) + αk [f (yk) +⟨f ′(yk), x − yk

⟩+µ

2||x − yk ||2]

where

{yk} is an arbitrary sequence in Rn.

αk ∈ (0, 1),∑∞

k=0 αk =∞.

λ0 = 1.

φ0 is an arbitrary function on Rn.




A specific choice of φ0(x)

φ0(x) ≡ φ∗0 +γ02||x − v0||2

and set x0 = v0, φ∗0 = f (x0)The previous estimate sequence becomes

φk(x) ≡ φ∗k +γk2||x − vk ||2

with

γk+1 =(1− αk)γk + αkµ

vk+1 =[(1− αk)γkvk + αkµyk − αk f′(yk)]/γk+1

φ∗k+1 =(1− αk)φ∗k + αk f (yk)−α2k

2γk+1||f ′(yk)||2

+αk(1− αk)γk

γk+1

(µ2||yk − vk ||2 +

⟨f ′(yk), vk − yk

⟩)Hao Chen, Xiangrui Meng Accelerating Nesterov’s Method for Strongly Convex Functions



Let the update be

xk+1 = yk − hk f′(yk)

and use the inequalities

φ∗k ≥ f (xk) ≥ f (yk) + 〈f ′(yk), xk − yk〉+ µ2 ||xk − yk ||2

f (xk+1) ≤ f (yk)− hk (2−Lhk )2 ||f ′(yk)||2

We have

φ∗k+1 − f (xk+1) ≥(−

α2k

2γk+1+

hk(2− Lhk)

2

)||f ′(yk)||2

+ (1− α)〈f ′(yk),αkγkγk+1

(vk − yk) + (xk − yk)〉

+µ(1− αk)

2

(αkγkγk+1

||vk − yk ||2 + ||xk − yk ||2)




φ∗k+1 − f (xk+1) ≥(−

α2k

2γk+1+

hk(2− Lhk)

2

)||f ′(yk)||2

+(1− αk)〈f ′(yk),αkγkγk+1

(vk − yk) + (xk − yk)〉

+µ(1− αk)

2

(αkγkγk+1

||vk − yk ||2 + ||xk − yk ||2)

Nesterov’ choice:

yk = αkγkvk+γk+1xkγk+αkµ

hk = 1L .

γ0 ≥ µ. Since γk+1 = (1− αk)γk + αkµ, we have γk ≥ µ

αk can be as large as√

µL at each step, which leads to the

convergence rate 1−√

µL = 1− 1√

κ.




A simplified version

γk ≡ µ, hk ≡ 1L

yk = αkvk+xkα+1

vk − yk = vk−xkα+1

xk − yk = α(xk−vk )α+1

φ∗k+1 − f (xk+1) ≥(−α2k

2µ+

1

2L

)||f ′(yk)||2

+µαk(1− αk)

2(1 + αk)||xk − vk ||2




||xk − vk ||2/||f ′(yk)||2

Figure: f (x) = 12 ||Ax − b||2 + λ · smooth(||x ||1, τ) + 1

2µ||x ||2




µαk(1− αk)

2(1 + αk)

||xk − vk ||2

||f ′(αkvk+xkαk+1 )||2

−α2k

2µ≥ − 1

2L

Since the decay rate is∏

k(1− αk), we want to find a largeαk such that the inequality holds.

Evaluating f ′(αkvk+xkα+1 ) is time consuming, so we hope our first

guess of αk is good.

Note that ||f ′(yk)|| has a trend of decreasing, so our procedure is

to find an αk ≥√

µL such that µαk (1−αk )

2(1+αk )||xk−vk ||2||f ′(yk−1)||2

− α2k

2µ is large,

then such αk usually makes the inequality holds.




||f ′(yk)||

Figure: f (x) = 12 ||Ax − b||2 + λ · smooth(||x ||1, τ) + 1

2µ||x ||2




Test 1: smooth-BPDN

The first test is a smooth version of Basis Pursuit De-Noising:

minimize f (x) =1

2‖Ax − b‖2 + λ · smooth(‖x‖1, τ) +

µ

2‖x‖2,

where we set

A =1√n· randn(m, n), m = 1000, n = 3000,

λ = 0.2, τ = 0.001, and µ = 0.01.

x∗ is a random sparse vector with 125 non-zeros and b = Ax∗ + ε.We use the following estimate for L:

L =

(1 +

√m

n

)2

+λ

τ+ µ ≈ 202.50.




Figure: ‖xk − x∗‖




Test 2: anisotropic bowl

The second test is

minimize f (x) =n∑

i=1

i · x4i +1

2‖x‖2,

subject to ‖x‖ ≤ τ.

We choose n = 500 and τ = 4. x0 is randomly chosen from theboundary {x | ‖x‖ = τ}. For this problem, we have

L = 12nτ2 + 1 = 96001 and µ = 1.




Test 3: back to quadratic functions

Let’s check the performance of the adaptive algorithm onquadratic functions.

minimize f (x) =1

2xTAx − bT x .

We choose A ∼ 1mWn(In,m), where n = 4500 and m = 5000. We

use the following estimate for L and µ:

L =

(1 +

√n

m

)2

, µ =

(1−

√n

m

)2

.




Comparing with TFOCS(AT)

Figure: ‖xk − x∗‖Hao Chen, Xiangrui Meng Accelerating Nesterov’s Method for Strongly Convex Functions



Final thoughts

The convergence rate of Nesterov’s method depends onproblem types. For quadratic problems, the speed is doubled.

There is space to improve Nesterov’s optimal gradient methodon strongly convex functions.

Whether we can improve Nesterov’s method universally (withtheoretical proof) is still a question.


accelerating nesterov's method for strongly convex functions · nesterov’s method (h = 1 l):...

Documents