escaping saddle points in constrained optimization€¦ · 05/12/2018  · 4 2 0 0-2 -5-4-6-8 50-10...

1
Escaping Saddle Points in Constrained Optimization Aryan Mokhtari, Asuman Ozdaglar, and Ali Jadbabaie Laboratory for Information and Decision Systems (LIDS), Massachusetts Institute of Technology (MIT) Introduction I Recent revival of interest in nonconvex optimization Practical success and advances in computational tools I Consider the following general optimization program min x ∈C f (x ) I C⊆ R d is a convex compact closed set This problem is hard Convex Optimization: Optimality Condition I Before jumping to nonconvex optimization Let’s recap the convex case! 0 20 40 60 80 100 10 120 140 160 180 200 5 10 0 5 0 -5 -5 -10 -10 I In the convex setting (f is convex) First-order optimality condition implies global optimality Finding an approximate first-order stationary point is suff. ( Unconstrained: Find x * s.t. k∇f (x * )k≤ ε Constrained: Find x * s.t. f (x * ) T (x - x * ) -ε for all x ∈C Nonconvex Optimization 10 5 -100 10 8 0 6 -50 4 2 0 0 -5 -2 -4 -6 -8 50 -10 -10 100 20 -2 15 20 -1.5 -1 -0.5 0 15 0.5 10 4 10 1 1.5 2 10 5 5 0 0 -5 -5 -10 -10 -15 -15 -20 -20 -8000 -6000 -4000 20 -2000 0 2000 4000 6000 8000 10 10000 20 0 15 10 5 -10 0 -5 -10 -15 -20 -20 I 1st-order optimality is not enough Saddle points exist! I Check higher order derivatives To escape from saddle points Search for a second-order stationary point (SOSP) I Does convergence to an SOSP lead to global optimality? No! I But, if all saddles are escapable (strict saddles) SOSP local minimum! I In several cases, all saddle points are escapable and all local minima are global Eigenvector problem [Absil et al., ’10] Phase retrieval [Sun et al., ’16] Dictionary learning [Sun et al., ’17] Unconstrained Optimization I Consider the unconstrained nonconvex setting (C = R d ) I x * is an approximate (ε, γ )-second-order stationary point if k∇f (x * )k≤ ε | {z } first-order optimality condition and 2 f (x * ) -γ I | {z } second-order optimality condition I Various attempts to design algorithms converging to an SOSP I Perturbing iterates by injecting noise [Ge et al., ’15], [Jin et al., ’17a,b], [Daneshmand et al., ’18] I Using the eigenvector of the smallest eigenvalue of the Hessian [Carmon et al., ’16], [Allen-Zhu, ’17], [Xu & Yang, ’17], [Royer & Wright, ’17], [Agarwal et al., ’17], [Reddi et al., ’18] I Overall cost to find an (ε, γ )-SOSP Polynomial in ε -1 and γ -1 I However, not applicable to the convex constrained setting! I In the constrained case, can we find an SOSP in poly-time? Constrained optimization: Second-order stationary point I How should we define an SOSP for the constrained setting? I x * ∈C is an approximate (ε, γ )-second-order order stationary point if f (x * ) T (x - x * ) -ε for all x ∈C (x - x * ) T 2 f (x * )(x - x * ) -γ for all x ∈C s. t. f (x * ) T (x - x * )= 0 I Second condition should be satisfied only on the subspace that function can be increasing I Setting ε = γ = 0 gives the necessary conditions for a local min I We propose a framework that finds an (ε, γ )-SOSP in poly-time If optimizing a quadratic loss over C up to a constant factor is tractable Proposed algorithm to find an (ε, γ )-SOSP 20 -400 20 15 -300 15 10 -200 10 5 -100 5 0 0 0 -5 100 -5 200 -10 -10 300 -15 -15 400 -20 -20 20 -400 20 15 -300 15 10 -200 10 5 -100 5 0 0 0 -5 100 -5 200 -10 -10 300 -15 -15 400 -20 -20 20 -400 20 15 -300 15 10 -200 10 5 -100 5 0 0 0 -5 100 -5 200 -10 -10 300 -15 -15 400 -20 -20 I Follow a first-order update to reach an ε-FOSP The function value decreases at a rate of O ( -2 ) I Escape from saddle points by solving a QP which depends objective function curvature information The function value decreases at a rate of O (γ -3 ) I Once we escape from a saddle point we won’t revisit it again The function value decreases after escaping from saddle It is guaranteed that the function value never increases Stage I: First-order update (Finding a critical point) I Goal: Find x t s.t. f (x t ) T (x - x t ) ≥-ε for all x ∈C I Follow Frank-Wolfe until reaching an ε-FOSP x t +1 =(1 - η )x t + η v t , where v t = argmin v ∈C {∇f (x t ) T v } I Follow Projected Gradient Descent until reaching an ε-FOSP x t +1 = π C {x t - η f (x t )}, π C (.) is the Euclidean projection onto the convex set C I The function value decreases at least by a factor of O ( -2 ) Stage II: Second-order update (Escaping from saddle points) I Find u t a ρ-approximate solution of the quadratic program Minimize q (u ) := (u - x t ) T 2 f (x t )(u - x t ) subject to u ∈C , f (x t ) T (u - x t )= 0 I q (u * ) q (u t ) ρq (u * ) for some ρ (0, 1] ( If q (u t ) < -ργ Update x t +1 =(1 - σ )x t + σ u t If q (u t ) ≥-ργ q (u * ) ≥-γ x t is an (ε, γ )-SOSP I Some classes of convex constraints satisfy this property Quadratic constraints under some conditions Theoretical Results Theorem. If we set the stepsizes to η = O (ε) and σ = O (ργ ), the proposed algorithm finds an (ε, γ )-SOSP after at most O (max{ε -2 -3 γ -3 }) iterations. I When can we solve the quadratic subproblem approximately? Proposition If C is defined by a quadratic constraint, then the alg. finds an (ε, γ )-SOSP after O (max{τε -2 , d 3 γ -3 }) arith. operations. Proposition If the convex set C is defined as a set of m quadratic constraints (m > 1), and the objective function Hessian satis- fies max x ∈C x T 2 f (x )x ≤O (γ ), then the algorithm finds an (ε, γ )- SOSP at most after O (max{τε -2 , d 3 m 7 γ -3 }) arithmetic operations. Proposed Algorithm I for t = 1, 2,... I Compute v t = argmin v ∈C {∇f (x t ) T v } I if f (x t ) T (v t - x t ) < -ε I x t +1 =(1 - η )x t + η v t I else I Find u t :a ρ-approximate solution of the QP I if q (u t ) < -ργ I x t +1 =(1 - σ )x t + σ u t I else I return x t and stop Stochastic Setting I What about the stochastic setting? min x ∈C f (x ) = min x ∈C E Θ [F (x , Θ)] I where Θ is a random variable with probability distribution P I Replace f (x t ) and 2 f (x t ) by their stochastic approximations g t and H t g t = 1 b g b g X i =1 F (x t i ), H t = 1 b H b H X i =1 2 F (x t i ) I Change some conditions to afford approximation error f (x t ) T (x - x t )= 0 g T t (x - x t ) r Proposed Method for the Stochastic Setting I for t = 1, 2,... I Compute v t = argmin v ∈C {g t T v } I if g t T (v t - x t ) ≤- ε 2 I x t +1 =(1 - η )x t + η v t I else I Find u t :a ρ-approximate solution of min q (u ) := (u - x t ) T H t (u - x t ) s. t. u ∈C , g t T (u - x t ) r I if q (u t ) < - ργ 2 I x t +1 =(1 - σ )x t + σ u t I else I return x t and stop Theoretical Results for the Stochastic Setting Theorem. If we set stepsizes to η = O (ε) and σ = O (ργ ), batch sizes to b g = O (max{ρ -4 γ -4 -2 }) and b H = O (ρ -2 γ -2 ), and choose r = O (ρ 2 γ 2 ), The outcome of Algorithm 2 is an (ε, γ )-SOSP w.h.p. Total No. of iterations is at most O (max{ε -2 -3 γ -3 }) w.h.p. Corollary Algorithm finds an (ε, γ )-SOSP w.h.p. after computing O (max{ε -2 ρ -4 γ -4 -4 -7 γ -7 }) stochastic gradients O (max{ε -2 ρ -3 γ -3 -5 γ -5 }) stochastic Hessians Conclusion I Method for finding an SOSP in constrained settings Using first-order information to reach an FOSP Solve a QP up to a constant factor ρ< 1 to escape from saddles I First finite-time complexity analysis for constrained problems O (max{ε -2 -3 γ -3 }) iter. O (max{τε -2 , d 3 m 7 γ -3 }) A.O. for QC Extended our results to the stochastic setting The 32nd Annual Conference on Neural Information Processing Systems (NIPS) 2018 December 5th, 2018

Upload: others

Post on 22-Aug-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Escaping Saddle Points in Constrained Optimization€¦ · 05/12/2018  · 4 2 0 0-2 -5-4-6-8 50-10 -10 100 20-2 20 15-1.5-1-0.5 0 15 0.5 104 10 1 1.5 2 10 5 0 5 0 -5-15 -15-20-20-8000-6000-4000

Escaping Saddle Points in Constrained OptimizationAryan Mokhtari, Asuman Ozdaglar, and Ali Jadbabaie

Laboratory for Information and Decision Systems (LIDS), Massachusetts Institute of Technology (MIT)

Introduction

I Recent revival of interest in nonconvex optimization⇒ Practical success and advances in computational tools

I Consider the following general optimization program

minx∈C

f (x)

I C ⊆ Rd is a convex compact closed set⇒ This problem is hard

Convex Optimization: Optimality Condition

I Before jumping to nonconvexoptimization⇒ Let’s recap the convex case!

0

20

40

60

80

100

10

120

140

160

180

200

5 100 50-5 -5

-10 -10

I In the convex setting (f is convex)⇒ First-order optimality condition implies global optimality⇒ Finding an approximate first-order stationary point is suff.{

Unconstrained: Find x∗ s.t. ‖∇f (x∗)‖ ≤ ε

Constrained: Find x∗ s.t. ∇f (x∗)T (x − x∗) ≥ −ε for all x ∈ C

Nonconvex Optimization

10

5-100

108

06

-50

42

0

0

-5-2-4

-6-8

50

-10-10

100

20

-2

1520

-1.5

-1

-0.5

0

15

0.5

104

10

1

1.5

2

10 55 00 -5-5 -10-10

-15-15-20 -20

-8000

-6000

-4000

20

-2000

0

2000

4000

6000

8000

10

10000

20015

105

-10 0-5

-10-15-20 -20

I 1st-order optimality is not enough ⇒ Saddle points exist!

I Check higher order derivatives ⇒ To escape from saddle points⇒ Search for a second-order stationary point (SOSP)

I Does convergence to an SOSP lead to global optimality? No!I But, if all saddles are escapable (strict saddles)

⇒ SOSP ⇒ local minimum!

I In several cases, all saddle points are escapable and all localminima are global⇒ Eigenvector problem [Absil et al., ’10]⇒ Phase retrieval [Sun et al., ’16]⇒ Dictionary learning [Sun et al., ’17]

Unconstrained Optimization

I Consider the unconstrained nonconvex setting (C = Rd)I x∗ is an approximate (ε, γ)-second-order stationary point if

‖∇f (x∗)‖ ≤ ε︸ ︷︷ ︸first-order optimality condition

and ∇2f (x∗) � −γI︸ ︷︷ ︸second-order optimality condition

I Various attempts to design algorithms converging to an SOSP

I Perturbing iterates by injecting noise⇒ [Ge et al., ’15], [Jin et al., ’17a,b], [Daneshmand et al., ’18]

I Using the eigenvector of the smallest eigenvalue of the Hessian⇒ [Carmon et al., ’16], [Allen-Zhu, ’17], [Xu & Yang, ’17], [Royer &

Wright, ’17], [Agarwal et al., ’17], [Reddi et al., ’18]

I Overall cost to find an (ε, γ)-SOSP ⇒ Polynomial in ε−1 and γ−1

I However, not applicable to the convex constrained setting!

I In the constrained case, can we find an SOSP in poly-time?

Constrained optimization: Second-order stationary point

I How should we define an SOSP for the constrained setting?I x∗ ∈ C is an approximate (ε, γ)-second-order order stationary

point if

∇f (x∗)T (x − x∗) ≥ −ε for all x ∈ C

(x−x∗)T∇2f (x∗)(x−x∗) ≥ −γ for allx ∈C s. t.∇f (x∗)T (x−x∗)=0

I Second condition should be satisfied only on the subspace thatfunction can be increasing

I Setting ε = γ = 0 gives the necessary conditions for a local min

I We propose a framework that finds an (ε, γ)-SOSP in poly-time⇒ If optimizing a quadratic loss over C up to a constant

factor is tractable

Proposed algorithm to find an (ε, γ)-SOSP

20-400

20 15

-300

15 10

-200

105

-100

50

0

0-5

100

-5

200

-10-10

300

-15-15

400

-20-20

20-400

20 15

-300

15 10

-200

105

-100

50

0

0-5

100

-5

200

-10-10

300

-15-15

400

-20-20

20-400

20 15

-300

15 10

-200

105

-100

50

0

0-5

100

-5

200

-10-10

300

-15-15

400

-20-20

I Follow a first-order update to reach an ε-FOSP⇒ The function value decreases at a rate of O(ε−2)

I Escape from saddle points by solving a QP which dependsobjective function curvature information⇒ The function value decreases at a rate of O(γ−3)

I Once we escape from a saddle point we won’t revisit it again⇒ The function value decreases after escaping from saddle⇒ It is guaranteed that the function value never increases

Stage I: First-order update (Finding a critical point)

I Goal: Find xt s.t. ⇒ ∇f (xt)T (x − xt) ≥ −ε for all x ∈ C

I Follow Frank-Wolfe until reaching an ε-FOSP

xt+1 = (1− η)xt + ηvt, where vt = argminv∈C{∇f (xt)

Tv}

I Follow Projected Gradient Descent until reaching an ε-FOSP

xt+1 = πC{xt − η∇f (xt)},

πC(.) is the Euclidean projection onto the convex set C

I The function value decreases at least by a factor of O(ε−2)

Stage II: Second-order update (Escaping from saddle points)

I Find ut a ρ-approximate solution of the quadratic program

Minimize q(u) := (u − xt)T∇2f (xt)(u − xt)

subject to u ∈ C, ∇f (xt)T (u − xt) = 0

I q(u∗) ≤ q(ut) ≤ ρq(u∗) for some ρ ∈ (0,1]{If q(ut) < −ργ ⇒ Update xt+1 = (1− σ)xt + σut

If q(ut) ≥ −ργ ⇒ q(u∗) ≥ −γ ⇒ xt is an (ε, γ)-SOSP

I Some classes of convex constraints satisfy this property⇒ Quadratic constraints under some conditions

Theoretical Results

Theorem. If we set the stepsizes to η = O(ε) and σ =O(ργ), the proposed algorithm finds an (ε, γ)-SOSP after at mostO(max{ε−2, ρ−3γ−3}) iterations.

I When can we solve the quadratic subproblem approximately?

Proposition If C is defined by a quadratic constraint, then the alg.finds an (ε, γ)-SOSP after O(max{τε−2,d3γ−3}) arith. operations.

Proposition If the convex set C is defined as a set of m quadraticconstraints (m > 1), and the objective function Hessian satis-fies maxx∈C xT∇2f (x)x ≤ O(γ), then the algorithm finds an (ε, γ)-SOSP at most afterO(max{τε−2,d3m7γ−3}) arithmetic operations.

Proposed Algorithm

I for t = 1,2, . . .I Compute vt = argminv∈C{∇f (xt)

Tv}I if ∇f (xt)

T (vt − xt) < −εI xt+1 = (1− η)xt + ηvtI elseI Find ut: a ρ-approximate solution of the QPI if q(ut) < −ργI xt+1 = (1− σ)xt + σutI elseI return xt and stop

Stochastic Setting

I What about the stochastic setting?minx∈C

f (x) = minx∈C

EΘ[F (x ,Θ)]

I where Θ is a random variable with probability distribution PI Replace ∇f (xt) and ∇2f (xt) by their stochastic approximations

gt and Ht

gt =1bg

bg∑i=1

∇F (xt, θi), Ht =1

bH

bH∑i=1

∇2F (xt, θi)

I Change some conditions to afford approximation error⇒ ∇f (xt)

T (x − xt) = 0 ⇒ ∇gTt (x − xt) ≤ r

Proposed Method for the Stochastic Setting

I for t = 1,2, . . .I Compute vt = argminv∈C{gt

Tv}I if gt

T (vt − xt) ≤ −ε2

I xt+1 = (1− η)xt + ηvtI elseI Find ut: a ρ-approximate solution of

min q(u) := (u − xt)THt(u − xt)

s. t. u ∈ C, gtT (u − xt) ≤ r

I if q(ut) < −ργ2I xt+1 = (1− σ)xt + σutI elseI return xt and stop

Theoretical Results for the Stochastic Setting

Theorem. If we set stepsizes to η = O(ε) and σ = O(ργ),batch sizes to bg = O(max{ρ−4γ−4, ε−2}) and bH = O(ρ−2γ−2),and choose r = O(ρ2γ2),⇒ The outcome of Algorithm 2 is an (ε, γ)-SOSP w.h.p.⇒ Total No. of iterations is at most O(max{ε−2, ρ−3γ−3}) w.h.p.

Corollary Algorithm finds an (ε, γ)-SOSP w.h.p. after computing⇒ O(max{ε−2ρ−4γ−4, ε−4, ρ−7γ−7}) stochastic gradients⇒ O(max{ε−2ρ−3γ−3, ρ−5γ−5}) stochastic Hessians

Conclusion

I Method for finding an SOSP in constrained settings⇒ Using first-order information to reach an FOSP⇒ Solve a QP up to a constant factor ρ < 1 to escape from

saddlesI First finite-time complexity analysis for constrained problems

⇒ O(max{ε−2, ρ−3γ−3}) iter. ⇒ O(max{τε−2,d3m7γ−3}) A.O.for QC⇒ Extended our results to the stochastic setting

The 32nd Annual Conference on Neural Information Processing Systems (NIPS) 2018 December 5th, 2018