accelerating nesterov's method for strongly convex functions · nesterov’s method (h = 1 l):...
TRANSCRIPT
The GapQuadratic Functions
General Strongly Convex Functions
Accelerating Nesterov’s Methodfor Strongly Convex Functions
Hao Chen Xiangrui Meng
MATH301, 2011
Hao Chen, Xiangrui Meng Accelerating Nesterov’s Method for Strongly Convex Functions
The GapQuadratic Functions
General Strongly Convex Functions
Outline
1 The Gap
2 Quadratic Functions
3 General Strongly Convex Functions
Hao Chen, Xiangrui Meng Accelerating Nesterov’s Method for Strongly Convex Functions
The GapQuadratic Functions
General Strongly Convex Functions
Outline
1 The Gap
2 Quadratic Functions
3 General Strongly Convex Functions
Hao Chen, Xiangrui Meng Accelerating Nesterov’s Method for Strongly Convex Functions
The GapQuadratic Functions
General Strongly Convex Functions
Our talk begins with a tiny gap
For any x0 ∈ R∞ and any constant µ > 0, L > µ there exists afunction f ∈ S∞,1µ,L such that for any first-order method, we have
f (xk)− f ∗ ≥ µ
2
(√κ− 1√κ+ 1
)2k
‖x0 − x∗‖2, κ =L
µ.
Nesterov’s method generates a sequence {xk}∞k=0 such that
f (xk)− f ∗ ≤ L
(√κ− 1√κ
)k
‖x0 − x∗‖2, κ =L
µ.
Hao Chen, Xiangrui Meng Accelerating Nesterov’s Method for Strongly Convex Functions
The GapQuadratic Functions
General Strongly Convex Functions
At a closer look, the gap is not tiny
Assume that κ is large. Given a small tolerance ε > 0, to makef (xk)− f < ε, the “ideal” first-order method needs
K ∗ =log ε− log µ
2
2 log√κ−1√κ+1
≈ log1
ε·√κ
4
number of iterations. Nesterov’s method needs
K =log ε− log L
log√κ−1√κ
≈ log1
ε·√κ
number of iterations, which is 4 times as large as the ideal number.
Hao Chen, Xiangrui Meng Accelerating Nesterov’s Method for Strongly Convex Functions
The GapQuadratic Functions
General Strongly Convex Functions
Can we reduce the gap?
Can we reduce the gap
for quadratic functions?
minimize f (x) =1
2xTAx − bT x , µIn � A � LIn.
In this case, we do have an “ideal” method, the conjugategradient method, having the optimal convergence rate.
for general strongly convex functions?
minimize f (x), f (x) ∈ Sµ,L.
Hao Chen, Xiangrui Meng Accelerating Nesterov’s Method for Strongly Convex Functions
The GapQuadratic Functions
General Strongly Convex Functions
Outline
1 The Gap
2 Quadratic Functions
3 General Strongly Convex Functions
Hao Chen, Xiangrui Meng Accelerating Nesterov’s Method for Strongly Convex Functions
The GapQuadratic Functions
General Strongly Convex Functions
Nesterov’s constant step scheme, III
0. Choose y0 = x0 ∈ Rn.
1. k-th iteration (k ≥ 0).
xk+1 = yk − hf ′(yk),
yk+1 = xk+1 + β(xk+1 − xk),
where h = 1L and β = 1−
õh
1+õh
.
Q: Is Nesterov’s choice of h and β optimal?
Hao Chen, Xiangrui Meng Accelerating Nesterov’s Method for Strongly Convex Functions
The GapQuadratic Functions
General Strongly Convex Functions
On quadratic functions
When minimizing a quadratic function
f (x) =1
2xTAx − bT x ,
Nesterov’s updates become
0. Choose y0 = x0 = 0.
1. k-th iteration (k ≥ 0).
xk+1 = yk − h(Ayk − b),
yk+1 = xk+1 + β(xk+1 − xk).
Hao Chen, Xiangrui Meng Accelerating Nesterov’s Method for Strongly Convex Functions
The GapQuadratic Functions
General Strongly Convex Functions
Eigendecomposition
Let A = VΛV T be A’s eigendecomposition. Define xk = V T xk ,yk = V T yk for all k , and b = V Tb. Then Nesterov’s updates canbe written as
0. Choose y0 = x0 = 0.
1. k-th iteration (k ≥ 0).
xk+1 = yk − h(Λyk − b),
yk+1 = xk+1 + β(xk+1 − xk).
Λ is diagonal, hence the updates are actually element-wise:
xk+1,i = yk,i − h(λi yk,i − bi ), i = 1, . . . , n,
yk+1,i = xk+1,i − β(λi yk,i − bi ), i = 1, . . . , n.
Hao Chen, Xiangrui Meng Accelerating Nesterov’s Method for Strongly Convex Functions
The GapQuadratic Functions
General Strongly Convex Functions
Recurrence relation
We can eliminate the sequence {yk} from the update scheme.
xk+1,i = yk,i − h(λi yk,i − bi )
= (xk,i + β(xk,i − xk−1,i )− h(λi (xk,i + β(xk,i − xk−1,i ))− bi )
= (1 + β)(1− λih)xk,i − β(1− λih)xk−1,i + hbi .
Let ek = V T (xk − x∗) = V T (xk − VΛ−1V Tb) = xk − Λ−1b for allk . We have the following recurrence relation on the error:
ek+1,i = (1 + β)(1− λih)ek,i − β(1− λih)ek−1,i .
Hao Chen, Xiangrui Meng Accelerating Nesterov’s Method for Strongly Convex Functions
The GapQuadratic Functions
General Strongly Convex Functions
Characteristic equation
The characteristic equation for the recurrence relation is given by
ξ2i = (1 + β)(1− λih)ξi − β(1− λih).
Denote the two roots by ξi ,1 and ξi ,2, and assume they are distinctfor simplicity. The general solution is given by
ek,i = Ci ,1ξki ,1 + Ci ,2ξ
ki ,2.
Let Ci = |Ci ,1|+ |Ci ,2| and θi = max{|ξi ,1|, |ξi ,2|}. We have
|ek,i | ≤ Ciθki .
Hence,
‖xk − x∗‖2 = ‖xk − x∗‖2 =∑i
|ek,i |2 ≤∑i
C 2i θ
2ki ≤ Cθ2k ,
where C =∑
i C2i and θ = maxi θi .
Hao Chen, Xiangrui Meng Accelerating Nesterov’s Method for Strongly Convex Functions
The GapQuadratic Functions
General Strongly Convex Functions
Finding the optimal convergence rate
Our problem becomes
minimize θ
subject to θ ≥ |ξ1(λ)|, |ξ2(λ)|, ∀λ ∈ [µ, L],
where ξ1(λ) and ξ2(λ) are the roots of
ξ2 = (1 + β)(1− λh)ξ − β(1− λh),
where h, β and θ are variables.
Hao Chen, Xiangrui Meng Accelerating Nesterov’s Method for Strongly Convex Functions
The GapQuadratic Functions
General Strongly Convex Functions
Special cases
If β = 0, we are doing gradient descent. The optimal rate isgiven by θ∗ = L−µ
L+µ , attained at h∗ = 2L+µ .
If h = 1L , the optimal rate is given by
θ∗ = 1−√µh = 1−
√µL , attained at β∗ = 1−
õh
1+õh
=√L−√µ√L+√µ
,
which confirms Nesterov’s choice.
Q: Why do we choose h = 1L? It guarantees the most decrease in
function value of a function with Lipschitz constant L.
Hao Chen, Xiangrui Meng Accelerating Nesterov’s Method for Strongly Convex Functions
The GapQuadratic Functions
General Strongly Convex Functions
The optimal convergence rate
By considering all the combinations of h and β, we reach thefollowing optimal solution:
h∗ =4
3L + µ
(the harmonic mean of
1
Land
2
L + µ
)β∗ =
1−√µh∗
1 +√µh∗
,
θ∗ = 1−√µh∗ = 1− 2√
3κ+ 1.
Hao Chen, Xiangrui Meng Accelerating Nesterov’s Method for Strongly Convex Functions
The GapQuadratic Functions
General Strongly Convex Functions
Comparing the convergence rates
Nesterov’s method (h = 1L ):
‖xk − x∗‖ ≤ C ·(
1− 1√κ
)k
‖x0 − x∗‖.
Note that this is better than the convergence rate we have ongeneral strongly convex functions.
Nesterov’s method (h = 43L+µ ):
‖xk − x∗‖ ≤ C ·(
1− 2√3κ+ 1
)k
‖x0 − x∗‖.
Conjugate gradient:
‖xk − x∗‖A ≤ 2 ·(
1− 2√κ+ 1
)k
‖x0 − x∗‖A.
Hao Chen, Xiangrui Meng Accelerating Nesterov’s Method for Strongly Convex Functions
The GapQuadratic Functions
General Strongly Convex Functions
What’s happening on the eigenspace
Figure: Error along eigendirections (|ek,i |)
Hao Chen, Xiangrui Meng Accelerating Nesterov’s Method for Strongly Convex Functions
The GapQuadratic Functions
General Strongly Convex Functions
The model problem
minimize f (x) =1
2xTAx − bT x ,
where
A =
2 −1
−1. . .
. . .. . .
. . . −1−1 2
+ δIn ∈ Rn×n, b = randn(n, 1) ∈ Rn.
We chose n = 106 and δ = 0.05.
Hao Chen, Xiangrui Meng Accelerating Nesterov’s Method for Strongly Convex Functions
The GapQuadratic Functions
General Strongly Convex Functions
Figure: ‖xk − x∗‖Hao Chen, Xiangrui Meng Accelerating Nesterov’s Method for Strongly Convex Functions
The GapQuadratic Functions
General Strongly Convex Functions
Figure: f (xk)− f ∗
Hao Chen, Xiangrui Meng Accelerating Nesterov’s Method for Strongly Convex Functions
The GapQuadratic Functions
General Strongly Convex Functions
Outline
1 The Gap
2 Quadratic Functions
3 General Strongly Convex Functions
Hao Chen, Xiangrui Meng Accelerating Nesterov’s Method for Strongly Convex Functions
The GapQuadratic Functions
General Strongly Convex Functions
Back to Nesterov’s proof
A pair of sequence {φk(x)} and {λk}, λk ≥ 0 is called anestimate sequence of function f (x) if λk → 0 and for anyx ∈ Rn and all k ≥ 0 we have
φk(x) ≤ (1− λk)f (x) + λkφ0(x).
If for a sequence {xk} we have
f (xk) ≤ φ∗k ≡ minx∈Rn
φk(x)
then f (xk)− f ∗ ≤ λk [φ0(x∗)− f ∗]→ 0
Hao Chen, Xiangrui Meng Accelerating Nesterov’s Method for Strongly Convex Functions
The GapQuadratic Functions
General Strongly Convex Functions
A useful estimate sequence provided by Nesterov
λk+1 = (1− αk)λk
φk+1(x) = (1− αk)φk(x) + αk [f (yk) +⟨f ′(yk), x − yk
⟩+µ
2||x − yk ||2]
where
{yk} is an arbitrary sequence in Rn.
αk ∈ (0, 1),∑∞
k=0 αk =∞.
λ0 = 1.
φ0 is an arbitrary function on Rn.
Hao Chen, Xiangrui Meng Accelerating Nesterov’s Method for Strongly Convex Functions
The GapQuadratic Functions
General Strongly Convex Functions
A specific choice of φ0(x)
φ0(x) ≡ φ∗0 +γ02||x − v0||2
and set x0 = v0, φ∗0 = f (x0)The previous estimate sequence becomes
φk(x) ≡ φ∗k +γk2||x − vk ||2
with
γk+1 =(1− αk)γk + αkµ
vk+1 =[(1− αk)γkvk + αkµyk − αk f′(yk)]/γk+1
φ∗k+1 =(1− αk)φ∗k + αk f (yk)−α2k
2γk+1||f ′(yk)||2
+αk(1− αk)γk
γk+1
(µ2||yk − vk ||2 +
⟨f ′(yk), vk − yk
⟩)Hao Chen, Xiangrui Meng Accelerating Nesterov’s Method for Strongly Convex Functions
The GapQuadratic Functions
General Strongly Convex Functions
Let the update be
xk+1 = yk − hk f′(yk)
and use the inequalities
φ∗k ≥ f (xk) ≥ f (yk) + 〈f ′(yk), xk − yk〉+ µ2 ||xk − yk ||2
f (xk+1) ≤ f (yk)− hk (2−Lhk )2 ||f ′(yk)||2
We have
φ∗k+1 − f (xk+1) ≥(−
α2k
2γk+1+
hk(2− Lhk)
2
)||f ′(yk)||2
+ (1− α)〈f ′(yk),αkγkγk+1
(vk − yk) + (xk − yk)〉
+µ(1− αk)
2
(αkγkγk+1
||vk − yk ||2 + ||xk − yk ||2)
Hao Chen, Xiangrui Meng Accelerating Nesterov’s Method for Strongly Convex Functions
The GapQuadratic Functions
General Strongly Convex Functions
φ∗k+1 − f (xk+1) ≥(−
α2k
2γk+1+
hk(2− Lhk)
2
)||f ′(yk)||2
+(1− αk)〈f ′(yk),αkγkγk+1
(vk − yk) + (xk − yk)〉
+µ(1− αk)
2
(αkγkγk+1
||vk − yk ||2 + ||xk − yk ||2)
Nesterov’ choice:
yk = αkγkvk+γk+1xkγk+αkµ
hk = 1L .
γ0 ≥ µ. Since γk+1 = (1− αk)γk + αkµ, we have γk ≥ µ
αk can be as large as√
µL at each step, which leads to the
convergence rate 1−√
µL = 1− 1√
κ.
Hao Chen, Xiangrui Meng Accelerating Nesterov’s Method for Strongly Convex Functions
The GapQuadratic Functions
General Strongly Convex Functions
A simplified version
γk ≡ µ, hk ≡ 1L
yk = αkvk+xkα+1
vk − yk = vk−xkα+1
xk − yk = α(xk−vk )α+1
φ∗k+1 − f (xk+1) ≥(−α2k
2µ+
1
2L
)||f ′(yk)||2
+µαk(1− αk)
2(1 + αk)||xk − vk ||2
Hao Chen, Xiangrui Meng Accelerating Nesterov’s Method for Strongly Convex Functions
The GapQuadratic Functions
General Strongly Convex Functions
||xk − vk ||2/||f ′(yk)||2
Figure: f (x) = 12 ||Ax − b||2 + λ · smooth(||x ||1, τ) + 1
2µ||x ||2
Hao Chen, Xiangrui Meng Accelerating Nesterov’s Method for Strongly Convex Functions
The GapQuadratic Functions
General Strongly Convex Functions
µαk(1− αk)
2(1 + αk)
||xk − vk ||2
||f ′(αkvk+xkαk+1 )||2
−α2k
2µ≥ − 1
2L
Since the decay rate is∏
k(1− αk), we want to find a largeαk such that the inequality holds.
Evaluating f ′(αkvk+xkα+1 ) is time consuming, so we hope our first
guess of αk is good.
Note that ||f ′(yk)|| has a trend of decreasing, so our procedure is
to find an αk ≥√
µL such that µαk (1−αk )
2(1+αk )||xk−vk ||2||f ′(yk−1)||2
− α2k
2µ is large,
then such αk usually makes the inequality holds.
Hao Chen, Xiangrui Meng Accelerating Nesterov’s Method for Strongly Convex Functions
The GapQuadratic Functions
General Strongly Convex Functions
||f ′(yk)||
Figure: f (x) = 12 ||Ax − b||2 + λ · smooth(||x ||1, τ) + 1
2µ||x ||2
Hao Chen, Xiangrui Meng Accelerating Nesterov’s Method for Strongly Convex Functions
The GapQuadratic Functions
General Strongly Convex Functions
Test 1: smooth-BPDN
The first test is a smooth version of Basis Pursuit De-Noising:
minimize f (x) =1
2‖Ax − b‖2 + λ · smooth(‖x‖1, τ) +
µ
2‖x‖2,
where we set
A =1√n· randn(m, n), m = 1000, n = 3000,
λ = 0.2, τ = 0.001, and µ = 0.01.
x∗ is a random sparse vector with 125 non-zeros and b = Ax∗ + ε.We use the following estimate for L:
L =
(1 +
√m
n
)2
+λ
τ+ µ ≈ 202.50.
Hao Chen, Xiangrui Meng Accelerating Nesterov’s Method for Strongly Convex Functions
The GapQuadratic Functions
General Strongly Convex Functions
Figure: ‖xk − x∗‖
Hao Chen, Xiangrui Meng Accelerating Nesterov’s Method for Strongly Convex Functions
The GapQuadratic Functions
General Strongly Convex Functions
Figure: f (xk)− f ∗
Hao Chen, Xiangrui Meng Accelerating Nesterov’s Method for Strongly Convex Functions
The GapQuadratic Functions
General Strongly Convex Functions
Test 2: anisotropic bowl
The second test is
minimize f (x) =n∑
i=1
i · x4i +1
2‖x‖2,
subject to ‖x‖ ≤ τ.
We choose n = 500 and τ = 4. x0 is randomly chosen from theboundary {x | ‖x‖ = τ}. For this problem, we have
L = 12nτ2 + 1 = 96001 and µ = 1.
Hao Chen, Xiangrui Meng Accelerating Nesterov’s Method for Strongly Convex Functions
The GapQuadratic Functions
General Strongly Convex Functions
Figure: ‖xk − x∗‖
Hao Chen, Xiangrui Meng Accelerating Nesterov’s Method for Strongly Convex Functions
The GapQuadratic Functions
General Strongly Convex Functions
Figure: f (xk)− f ∗
Hao Chen, Xiangrui Meng Accelerating Nesterov’s Method for Strongly Convex Functions
The GapQuadratic Functions
General Strongly Convex Functions
Test 3: back to quadratic functions
Let’s check the performance of the adaptive algorithm onquadratic functions.
minimize f (x) =1
2xTAx − bT x .
We choose A ∼ 1mWn(In,m), where n = 4500 and m = 5000. We
use the following estimate for L and µ:
L =
(1 +
√n
m
)2
, µ =
(1−
√n
m
)2
.
Hao Chen, Xiangrui Meng Accelerating Nesterov’s Method for Strongly Convex Functions
The GapQuadratic Functions
General Strongly Convex Functions
Figure: ‖xk − x∗‖
Hao Chen, Xiangrui Meng Accelerating Nesterov’s Method for Strongly Convex Functions
The GapQuadratic Functions
General Strongly Convex Functions
Figure: f (xk)− f ∗
Hao Chen, Xiangrui Meng Accelerating Nesterov’s Method for Strongly Convex Functions
The GapQuadratic Functions
General Strongly Convex Functions
Comparing with TFOCS(AT)
Figure: ‖xk − x∗‖Hao Chen, Xiangrui Meng Accelerating Nesterov’s Method for Strongly Convex Functions
The GapQuadratic Functions
General Strongly Convex Functions
Figure: f (xk)− f ∗
Hao Chen, Xiangrui Meng Accelerating Nesterov’s Method for Strongly Convex Functions
The GapQuadratic Functions
General Strongly Convex Functions
Final thoughts
The convergence rate of Nesterov’s method depends onproblem types. For quadratic problems, the speed is doubled.
There is space to improve Nesterov’s optimal gradient methodon strongly convex functions.
Whether we can improve Nesterov’s method universally (withtheoretical proof) is still a question.
Hao Chen, Xiangrui Meng Accelerating Nesterov’s Method for Strongly Convex Functions