escaping from saddle points on riemannian manifoldsstudents.washington.edu/yuesun/slides.pdfescaping...

24
Escaping from saddle points on Riemannian manifolds Yue Sun , Nicolas Flammarion , Maryam Fazel Department of Electrical and Computer Engineering, University of Washington, Seattle School of Computer and Communication Sciences, EPFL, Lausanne, Switzerland November 8, 2019 1 / 24

Upload: others

Post on 16-Feb-2021

0 views

Category:

Documents


0 download

TRANSCRIPT

  • Escaping from saddle points onRiemannian manifolds

    Yue Sun†, Nicolas Flammarion‡, Maryam Fazel†

    † Department of Electrical and Computer Engineering, University of Washington,Seattle

    ‡ School of Computer and Communication Sciences, EPFL, Lausanne, Switzerland

    November 8, 2019

    1 / 24

  • Manifold constrained optimization

    We consider the problem

    minimizex

    f (x), subject to x ∈M

    Same as Euclidean space, generally we cannot find global optimumin polynomial time, so we want to find an approximate localminimum.

    Plot of saddle point in Euclideanspace.

    Contour of function value onsphere.

    2 / 24

  • Example of manifolds

    1. Sphere. {x ∈ Rd :∑d

    i=1 x2i = r

    2}.2. Stiefel manifold. {X ∈ Rm×n : XTX = I}.3. Grassmannian manifold. Grass(p, n) is set of p dimensional

    subspaces in Rn.4. Bruer-Monteiro relaxation. {X ∈ Rm×n : diag(XTX ) = 1}.

    3 / 24

  • Curve

    A curve in a continuous mapγ : t →M.Usually t ∈ [0, 1], where γ(0) andγ(1) are start and end points ofthe curve.

    4 / 24

  • Tangent vector and tangent space

    We use

    γ̇(t) = limτ→0

    γ(t + τ)− γ(t)τ

    as the velocity of the curve, γ̇(t) is atangent vector at γ(t) ∈M.

    x ∈M can be start point of manycurves, and a tangent space TxM isthe set of tangent vectors at x .

    Tangent space is a metric space.

    5 / 24

  • Gradient of a function

    Let f :M→ R be a function defined on M, and γ be a curve onM.The directional derivative of f in direction γ̇(0) is1

    γ̇(0)f =d(f (γ(t)))

    dt

    ∣∣t=0

    = limτ→0

    f (γ(0 + τ))− f (γ(0))τ

    Then we can define gradf (x) ∈ TxM, which satisfies

    〈gradf , y〉 = yf

    for all y ∈ TxM.

    1Usually γ̇ denotes differential operator and γ′ denotes the tangent vector.They are closely related. To avoid confusion, we always use γ̇.

    6 / 24

  • Vector field

    The gradient of a function is aspecial case of vector field of amanifold.

    A vector field is a function from apoint in M to tangent vector atthat point.

    7 / 24

  • Connection

    Denote a smooth vector field on M by X(M). A connection definesdirectional derivative

    ∇ : X(M)× X(M)→ X(M)

    satisfying

    ∇fx+gyu = f∇xu + g∇yu,∇x(au + bv) = a∇xu + b∇xv ,

    ∇x(fu) = (xf )u + f (∇xu).

    Note that

    ∇eiu =∑j

    (∂iujej + u

    j∇ei ej).

    A special connection is Riemannian (Levi-Civita) connection.

    8 / 24

  • Riemannian Hessian

    A directional Hessian is defined as

    H(x)[u] = ∇ugradf (x)

    for u ∈ TxM2.

    Similar as gradient, we can define a Hessian from directional version.

    〈H(x)u, v〉 = 〈∇ugradf (x), v〉, ∀u, v ∈ TxM.

    It is a symmetric operator.

    2In Riemannian geometry, one writes ux to indicate u ∈ TxM and thedirectional Hessian is ∇ux gradf .

    9 / 24

  • Geodesic

    A geodesic is a special class ofcurves on the manifold, whichsatisfies zero accelerationcondition

    ∇γ̇(t)γ̇ = 0.

    10 / 24

  • Exponential map

    For any x ∈M, y ∈ TxM andthe geodesic γ defined by y ,

    γ(0) = x , γ̇(0) = y

    we call the mappingExpx(y) : TxM→M such thatExpx(y) = γ(1) as exponentialmap.

    There is a neighborhood with radius I in TxM, such that for ally ∈ TxM, ‖y‖ ≤ I, exponential map is a bijection/diffeomorphism.

    11 / 24

  • Parallel transport

    The parallel transport Γtransports a tangent vector walong a curve γ, satisfying thezero acceleration condition

    ∇γtwt = 0, wt = Γγ(t)γ(0)w .

    12 / 24

  • Curvature tensor

    The curvature tensor describes how curved the manifold is. Itrelates to the second order structure of the manifold.

    A definition by connection is

    R(x , y)w = ∇x∇yw −∇y∇xw −∇[x ,y ]w .

    where x , y ,w are in tangent space of the same point3.

    3[x , y ] is the Lie bracket defined by [x , y ]f = xyf − yxf13 / 24

  • Curvature tensor

    R(x , y)w = limt,τ→0

    Γ0,00,τyΓ0,τytx ,τyΓ

    tx ,τytx ,0 Γ

    tx ,00,0 w − w

    14 / 24

  • Smooth function on Riemannian manifold

    We consider the manifold constrained optimization problem

    minimizex

    f (x), subject to x ∈M

    assuming the function and manifold satisfying

    1. There is a finite constant β such that

    ‖gradf (y)− Γyxgradf (x)‖ ≤ βd(x , y) for all x , y ∈M.

    2. There is a finite constant ρ such that

    ‖H(y)− ΓyxH(x)Γxy‖2 ≤ ρd(x , y) for all x , y ∈M.

    3. There is a finite constant K such that

    |R(x)[u, v ]| ≤ K for all x ∈M and u, v ∈ TxM.f may not be convex.

    15 / 24

  • Taylor expansion of a smooth function

    For x , y ∈ E Euclidean space,

    f (y)− f (x) = 〈y − x ,∇f (x) +∇2f (x)(y − x)

    +

    ∫ 10

    (∇2f (x + τ(y − x))−∇2f (x))(y − x)dτ〉.

    For x , y ∈M, let γ denote the geodesic where γ(0) = x , γ(1) = y ,then

    f (y)− f (x) = 〈γ̇(0), gradf (x) + 12∇γ̇(0)gradf + ∆〉

    where ∆ =∫ 10 ∆(γ(τ))dτ ,

    ∆(γ(τ)) =

    ∫ 10

    (Γxγ(τ)∇γ̇(τ)gradf −∇γ̇(0)gradf )dτ.

    16 / 24

  • Riemannian gradient descent

    On a smooth manifold, there exists a η such that, if

    xt+1 = Expxt (−gradf (xt)),

    then

    f (xt+1) ≤ f (xt)−η

    2‖gradf (xt)‖2.

    Converge to first order stationary.

    17 / 24

  • Proposed algorithm for escaping saddle

    Hope to escape from saddle point andconverge to an approximate local minimum.

    1. At iterate x , check the norm of gradient.

    2. If large: do x+ = Expx(−ηgradf (x)) to decrease function value.

    3. If small: near either a saddle point or a local min. Perturb iterate byadding appropriate noise, run a few iterations.

    3.1 if f decreases, iterates escape saddle point (and alg continues).3.2 if f doesn’t decrease: at approximate local min (alg terminates).

    18 / 24

  • Difficulty of second order analysis

    1. We use linearization in first order analysis, for second order,manifold has a second order structure as well.

    2. Consider power method in Euclidean space. We need to provethat the biggest eigenvector direction of x grows exponentially.

    3. If it’s iteration of variable, we have to consider gradient indifferent tangent spaces.

    4. Some recent work require strong assumptions such as flatmanifold, product manifold.

    5. Other recent work assume smoothness parameters ofcomposition of function and manifold operator, which are hardto check.

    19 / 24

  • Useful lemmas

    Let x ∈M and y , a ∈ TxM. Let us denote by z = Expx(a) then

    d(Expx(y + a),Expz(Γzxy)) ≤ c(K ) min{‖a‖, ‖y‖}(‖a‖+ ‖y‖)2.

    20 / 24

  • Useful lemmas

    Holonomy.

    ‖ΓxzΓzyΓyxw − w‖ ≤ c(K )d(x , y)d(y , z)‖w‖.

    Similar to definition of curvature tensor, if a vector is paralleltransported around a closed curve, then the change is bounded bythe area whose boundary is the curve.

    21 / 24

  • Useful lemmas

    Euclidean: f (x) = xTHx ⇒ x+ = (I − ηH)x .

    Exponential growth in a vector space.

    If function f is β gradient Lipschitz, ρ Hessian Lipschitz, curvatureconstant is bounded by K , x is a (�,−

    √ρ̂�) saddle point, and define

    u+ = Expu(−ηgradf (u)) and w+ = Expw (−ηgradf (w)). If asmall enough neighborhood4,

    ‖Exp−1x (w+)− Exp−1x (u+)− (I − ηH(x))(Exp−1x (w)− Exp−1x (u))‖≤ C (K , ρ, β)d(u,w) (d(u,w) + d(u, x) + d(w , x)) .

    for some explicit constant C (K , ρ, β).

    4Quantified in paper.

    22 / 24

  • Theorem

    Theorem (Jin et al., Eucledean space)

    Perturbed GD converges to a (�,−√ρ�)-stationary point of f in

    O

    (β(f (x0)− f (x∗))

    �2log4

    (βd(f (x0)− f (x∗))

    �2δ

    ))

    iterations.

    We replace Hessian Lipschitz ρ by ρ̂ as a function of ρ and K and wequantify it in the paper.

    Theorem (manifold)

    Perturbed RGD converges to a (�,−√ρ̂(ρ,K )�)-stationary point of f in

    O

    (β(f (x0)− f (x∗))

    �2log4

    (βd(f (x0)− f (x∗))

    �2δ

    ))

    iterations.

    23 / 24

  • Experiment

    Burer-Monteiro facotorization.

    Let A ∈ Sd×d , the problemmax

    X∈Sd×dtrace(AX ),

    s.t. diag(X ) = 1,X � 0, rank(X ) ≤ r .

    can be factorized as

    maxY∈Rd×p

    trace(AYY T ), s.t. diag(YY T ) = 1.

    when r(r + 1)/2 ≤ d , p(p + 1)/2 ≥ d .

    0 5 10 15 20 25 30 35 40 45

    Iterations

    0

    1

    2

    3

    4

    5

    6

    Fu

    nctio

    n v

    alu

    e

    Iteration versus function value.

    24 / 24