riemannian stochastic variance reduced gradient on grassmann manifold (iccopt2016)

24
Riemannian stochastic variance reduced gradient on Grassmann manifold Hiroyuki Kasai , Hiroyuki Sato § , and Bamdev Mishra †† The University of Electro-Communications, Japan § Tokyo University of Science, Japan †† Amazon Development Centre India, India August 10, 2016 Riemannian stochastic variance reduced gradient on Grassmann manifold (all copyrights owned by Kasai, Sato, and Mishra) 1

Upload: hiroyuki-kasai

Post on 11-Apr-2017

241 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Page 1: Riemannian stochastic variance reduced gradient on Grassmann manifold (ICCOPT2016)

Riemannian stochastic variance reduced gradienton Grassmann manifold

Hiroyuki Kasai†, Hiroyuki Sato§, and Bamdev Mishra††

†The University of Electro-Communications, Japan

§Tokyo University of Science, Japan

††Amazon Development Centre India, India

August 10, 2016

Riemannian stochastic variance reduced gradient on Grassmann manifold (all copyrights owned by Kasai, Sato, and Mishra) 1

Page 2: Riemannian stochastic variance reduced gradient on Grassmann manifold (ICCOPT2016)

Summary (our contributions)

I Address stochastic gradient descent (SGD) algorithm forempirical risk minimization problem as

minw∈Rd

n∑i=1

fi(w).

I Paritularlly, structured problems on manifolds, i.e., w ∈M.

I Propose Riemannian SVRG (R-SVRG).I Extend SVRG in the Euclidean into Riemannian manifolds.

I Give two analyses;I Global convergence analysis, andI Local convergence rate analysis.

I Show effectiveness of R-SVRG from numerical comparisons.

Riemannian stochastic variance reduced gradient on Grassmann manifold (all copyrights owned by Kasai, Sato, and Mishra) 2

Page 3: Riemannian stochastic variance reduced gradient on Grassmann manifold (ICCOPT2016)

Stochastic gradient method (SGD) (1)

Update in SGD

wk = wk−1︸ ︷︷ ︸current point

− αk

single gradient for ik-th sample(= stochastic gradient)︷ ︸︸ ︷∇fik︸︷︷︸

random sample

(wk−1)

I Unbiased expectation of full gradient as

E[∇fi(w)] =1

n

n∑i=1

∇fi(w) = ∇f(w).

Riemannian stochastic variance reduced gradient on Grassmann manifold (all copyrights owned by Kasai, Sato, and Mishra) 3

Page 4: Riemannian stochastic variance reduced gradient on Grassmann manifold (ICCOPT2016)

Stochastic gradient descent (SGD) (2)Features against full gradient descent (FGD)

I Pros: High scalability to large-scale dataI Iteration complexity is independent of n.

I FGD shows linear complexity in n.

I Cons: Slow convergence propertyI Decaying stepsizes for convergence to avoid

I big fluctuations around a solution due to a large step-size.I too slow convergence due to a too small step-size.

⇓I Sub-linear convergence rate E[f(wk)]− f(w∗) ∈ O(k−1).

I FGD shows f(wk)− f(w∗) ∈ O(ck).

Riemannian stochastic variance reduced gradient on Grassmann manifold (all copyrights owned by Kasai, Sato, and Mishra) 4

Page 5: Riemannian stochastic variance reduced gradient on Grassmann manifold (ICCOPT2016)

Speeding up of SGD: Variance reduction techniqueI Accelerate the convergence rate of SGD

I [Mairal, 2015, Roux et al., 2012,Shalev-Shwartz and Zhang, 2012,Shalev-Shwartz and Zhang, 2013, Defazio et al., 2014,Zhang and Xiao, 2014].

I Stochastic variance reduced gradient (SVRG)[Johnson and Zhang, 2013]

I linear convergence rate for strongly-convex functions.I Various variants

I [Garber and Hazan, 2015] analyze the convergence rate forSVRG when f is a convex function that is a sum ofnon-convex (but smooth) terms.

I [Shalev-Shwartz, 2015] proposes similar results.I [Allen-Zhu and Yan, 2015] further study the same case with

better convergence rates.I [Shamir, 2015] studies specifically the convergence properties

of the variance reduction PCA algorithm.I Very recently, [Allen-Zhu and Hazan, 2016] propose a variance

reduction method for faster non-convex optimization.

Riemannian stochastic variance reduced gradient on Grassmann manifold (all copyrights owned by Kasai, Sato, and Mishra) 5

Page 6: Riemannian stochastic variance reduced gradient on Grassmann manifold (ICCOPT2016)

Stochastic variance reduced gradient (SVRG) (1)[Johnson and Zhang, 2013]

I Motivations:I Reduce the variance of stochastic gradients.I No need to store all gradients not like SAG.I But, allow additional calculations of gradients.

I Basic idea: hybrid algorithm of SGD and FGD.I Periodically, calculate and store a full gradient.I Every iteration, adjust a stochastic gradient v by the latest full

gradient to reduce variance.

⇓I Linear convergence rate

E[f(ws)]−E[f(w∗)]≤αs(E[f(w0)]−E[f(w∗)])

Riemannian stochastic variance reduced gradient on Grassmann manifold (all copyrights owned by Kasai, Sato, and Mishra) 6

Page 7: Riemannian stochastic variance reduced gradient on Grassmann manifold (ICCOPT2016)

Stochastic variance reduced gradient (SVRG) (2)

Simplified algorithm of SVRG

1: Initial iterate w00 ∈M.

2: for s = 1, 2, . . . (outer loop) do3: Store w = ws−1

t .4: Store ∇f(w).5: for t = 1, 2, . . . ,ms (inner loop) do

6: Calculate

modified stochastic gradient︷︸︸︷vst = ∇fist (w

st−1)︸ ︷︷ ︸

single gradient at wst−1

−single gradient︷ ︸︸ ︷∇fist (w)+∇f(w).︸ ︷︷ ︸

full gradient

7: Update wst = wst−1 − αvst .8: end for9: end for

Riemannian stochastic variance reduced gradient on Grassmann manifold (all copyrights owned by Kasai, Sato, and Mishra) 7

Page 8: Riemannian stochastic variance reduced gradient on Grassmann manifold (ICCOPT2016)

Stochastic variance reduced gradient (SVRG) (3)[Johnson and Zhang, 2013]

Riemannian stochastic variance reduced gradient on Grassmann manifold (all copyrights owned by Kasai, Sato, and Mishra) 8

Page 9: Riemannian stochastic variance reduced gradient on Grassmann manifold (ICCOPT2016)

Structured problemsExamples

I PCA problem: calculate the projection matrix U to minimizeas

minU∈St(r,d)

1

n

n∑i=1

‖xi −UUTxi‖22,

I U belongs to Stiefel manifold St(r, d).I The set of matrices of size d× r with orthonormal columns,

i.e., UTU = I.

⇓I Cost function remains unchanged under the orthogonal group

action U 7→ UO for O ∈ O(r).⇓

I U belongs to Grassmann manifold Gr(r, d).I The set of r-dimensional linear subspaces in Rd with

orthonormal columns, i.e., UTU = I.

I Other examples (not exchasted)I matrix completion, subspace tracking, spectral clustering,

CCA, bi-factor regression, ....Riemannian stochastic variance reduced gradient on Grassmann manifold (all copyrights owned by Kasai, Sato, and Mishra) 9

Page 10: Riemannian stochastic variance reduced gradient on Grassmann manifold (ICCOPT2016)

Optimization on Riemannian manifolds[Absil et al., 2008]

I If constraints can be defined by a manifold, the constrainedproblem is viewed as unconstrained problem on the manifoldas

minw∈Rn

f(w), s.t. ci(w) = 0, cj(w) ≤ 0

⇓minw∈M

f(w), M : Riemannian manifold

Riemannian stochastic variance reduced gradient on Grassmann manifold (all copyrights owned by Kasai, Sato, and Mishra) 10

Page 11: Riemannian stochastic variance reduced gradient on Grassmann manifold (ICCOPT2016)

Riemannian SGD (R-SGD) (1)[Bonnabel, 2013]

I Extension of Euclidean SGD into Riemannian manifolds.

Update in R-SGD

wk =

Move along geodesic(by exponential mapping)︷ ︸︸ ︷

Expwk−1(−αk gradfik(wk−1)︸ ︷︷ ︸Riemannian stochastic gradient

)

1. Calculate a Riemannian stochastic gradient gradfik(wk−1) forthe sample ik at wk−1.

2. Then, move along the geodesic from wk−1 in the direction ofgradfik(wk−1).

I Geodesic is generalization of straight lines in Euclidean space.I Exponential mapping Expw(·) specifies the geodesic.

Riemannian stochastic variance reduced gradient on Grassmann manifold (all copyrights owned by Kasai, Sato, and Mishra) 11

Page 12: Riemannian stochastic variance reduced gradient on Grassmann manifold (ICCOPT2016)

Riemannian SGD (R-SGD) (2)[Bonnabel, 2013]

Riemannian stochastic variance reduced gradient on Grassmann manifold (all copyrights owned by Kasai, Sato, and Mishra) 12

Page 13: Riemannian stochastic variance reduced gradient on Grassmann manifold (ICCOPT2016)

Proposal: Riemannian SVRG (R-SVRG)[Kasai et al., 2016]

I Propose a novel extension of SVRG in the Euclidean space tothe Riemannian manifold search space.

I Extension is not trivial.I Focus on the Grassmann manifold Gr(r, d).

I Can be generalized to other compact Riemannian manifolds.

I NotationsSVRG R-SVRG

Model parameter wst−1 ∈ Rn Us

t−1 ∈ Gr(r, d)

Edge point of outer loop w ∈ Rn U ∈ Gr(r, d)Stochastic gradient ∇fist (ws

t−1) ∈ Rn gradfist (Ust−1) ∈ TUs

t−1Gr(r, d)

Modified stochastic vst ∈ Rn ξst ∈ TUst−1

Gr(r, d)

gradient

Riemannian stochastic variance reduced gradient on Grassmann manifold (all copyrights owned by Kasai, Sato, and Mishra) 13

Page 14: Riemannian stochastic variance reduced gradient on Grassmann manifold (ICCOPT2016)

Proposal: Riemannian SVRG (R-SVRG)Algorithm

I Straightforward modification of stochastic gradientI Extend SVRG case: vst = ∇fist (ws

t−1)−∇fit(w) +∇f(w).

ξst = gradfist (Ust−1)− gradfist (U) + gradf(U)

I Meaningless because manifolds are not vector space.

⇓I Proposed modification

I Transport vectors at U into the current tangent space at Ust−1

by parallel translation, then add them.

ξst = gradfist (Ust−1)

+

parallel−translation operator︷ ︸︸ ︷P

Ust−1←U

γ︸︷︷︸geodesic

(−gradfist (U) + gradf(U)

)I Logarithm mapping gives the tangent vector for geodesic γ.

Riemannian stochastic variance reduced gradient on Grassmann manifold (all copyrights owned by Kasai, Sato, and Mishra) 14

Page 15: Riemannian stochastic variance reduced gradient on Grassmann manifold (ICCOPT2016)

Proposal: Riemannian SVRG (R-SVRG)Conceptual illustration

Riemannian stochastic variance reduced gradient on Grassmann manifold (all copyrights owned by Kasai, Sato, and Mishra) 15

Page 16: Riemannian stochastic variance reduced gradient on Grassmann manifold (ICCOPT2016)

Tools in Grassmann manifold

I Exponential mapping in the direction of ξ ∈ TU(0)

U(t) = [U(0)V W]

cos tΣ

sin tΣ

VT ,

I ξ = WΣVT is the rank-r singular value decomposition of ξ.I cos(·) and sin(·) operations are only on the diagonal entries.

I Parallel translation of ζ ∈ TU(0) along γ(t) with ξ

ζ(t) =

[U(0)V W]

− sin tΣ

cos tΣ

WT + (I−WWT )

ζ.

I Logarithm mapping of U(t) at U(0)

ξ = logU(0)(U(t)) = W arctan(Σ)VT ,

I WΣVT is the rank-r singular value decomposition of(U(t)−U(0)U(0)TU(t))(U(0)TU(t))−1.

Riemannian stochastic variance reduced gradient on Grassmann manifold (all copyrights owned by Kasai, Sato, and Mishra) 16

Page 17: Riemannian stochastic variance reduced gradient on Grassmann manifold (ICCOPT2016)

Main results: convergence analyses

I Global convergence analysis with decaying step-sizes.I Guarantee that the iteration globally converges to a critical

point starting from any initialization point.

I Local convergence rate analysis under fixed step-size.I Consider the rate in neighborhood of a local minimum.I Assume that Lipschitz smoothness and lower bound of Hessian

hold only in this neighborhood.I Obtain local linear convergence rate as

E[(dist(Us,U∗))2] ≤ 4(1 + 8mα2β2)

αm(σ − 14ηβ2)E[(dist(U

s−1,U∗))2].

Riemannian stochastic variance reduced gradient on Grassmann manifold (all copyrights owned by Kasai, Sato, and Mishra) 17

Page 18: Riemannian stochastic variance reduced gradient on Grassmann manifold (ICCOPT2016)

Proof sketch for local convergence rate

1. Obtain below by assuming the smallest eigenvalue σ ofHessian of f as

f(z) ≥ f(w) + 〈Exp−1w (z), gradf(w)〉w +

σ

2‖Exp−1

w (z)‖2w, w, z ∈ U . (1)

2. Obtain the variance of ξst from β-Lipschitz continuity as

Eist[‖ξst ‖2] ≤ β2(14(dist(ws

t−1, w∗))2 + 8dist(ws−1, w∗))2) (2)

3. Obtain the expectation of the decrease of the distance to thesolution in the inner iteration from the lemma for a geodesictriangle in an Alexandrov space as

Eist

[(dist(Us

t ,U∗))2 − (dist(Us

t−1,U∗))2

]≤ Eist

[(dist(Ust−1,U

st ))2 + 2η〈gradf(Us

t−1),Exp−1Ust−1

(U∗)〉Ust−1

]. (3)

4. Putting (1)&(2) into (3) with summing over the inner loopfinally yields the decrease of the distance to the solution inthe outer iteration.

Riemannian stochastic variance reduced gradient on Grassmann manifold (all copyrights owned by Kasai, Sato, and Mishra) 18

Page 19: Riemannian stochastic variance reduced gradient on Grassmann manifold (ICCOPT2016)

Numerical comparisonsExperiments conditions

I Compare R-SVRG with

1. R-SGD2. R-SD (steepest descent) with backtracking line search

I Step-size algorithms

1. fixed step-size2. decaying step-sizes3. hybrid step-sizes

I Use the decaying step-sizes at less than sTH(= 5) epoch, andsubsequently switches to a fixed step-size.

I PCA problemI n = 10000, d = 20, and r = 5.

I Evaluation metricsI Optimality gap

I Distance to the minimum loss obtained by Matlab pca.

I Norm of gradient

Riemannian stochastic variance reduced gradient on Grassmann manifold (all copyrights owned by Kasai, Sato, and Mishra) 19

Page 20: Riemannian stochastic variance reduced gradient on Grassmann manifold (ICCOPT2016)

Numerical comparisonsResults for PCA problem

#grad/N

0 50 100 150 200 250

Tra

in lo

ss -

op

tim

um

10-10

10-5

100

R-SD

R-SGD : decay (η=0.009, λ=0.1)

R-SVRG : fix (η=0.001)R-SVRG : decay (η=0.001, λ=0.001)

R-SVRG : hybrid (η=0.004, λ=0.01)R-SVRG+ : fix (η=0.001)

R-SVRG+ : decay (η=0.002, λ=0.01)R-SVRG+ : hybrid (η=0.002, λ=0.01)

(a) Optimality gap.

#grad/N

0 50 100 150 200 250

No

rm o

f g

rad

ien

t

10-5

100

R-SD

R-SGD : decay (η=0.009, λ=0.1)

R-SVRG : fix (η=0.001)R-SVRG : decay (η=0.001, λ=0.001)

R-SVRG : hybrid (η=0.004, λ=0.01)R-SVRG+ : fix (η=0.001)

R-SVRG+ : decay (η=0.002, λ=0.01)R-SVRG+ : hybrid (η=0.002, λ=0.01)

(b) Norm of gradient.

Riemannian stochastic variance reduced gradient on Grassmann manifold (all copyrights owned by Kasai, Sato, and Mishra) 20

Page 21: Riemannian stochastic variance reduced gradient on Grassmann manifold (ICCOPT2016)

Conclusions and more information

I ConclusionsI Propose Riemannian SVRG (R-SVRG).I R-SVRG shows local linear convergence rate.I Numerical comparisons shows the effectiveness of the

algorithm.

I More informationI Full paper

I H.Kasai, H.Sato and B.Mishra, ”Riemannian stochasticvariance reduced gradient on Grassmann manifold,”arXiv:1605.07367, May 2016, [Kasai et al., 2016]

I Matlab codeI https://bamdevmishra.com/codes/rsvrg/

Thank you for your attention.Riemannian stochastic variance reduced gradient on Grassmann manifold (all copyrights owned by Kasai, Sato, and Mishra) 21

Page 22: Riemannian stochastic variance reduced gradient on Grassmann manifold (ICCOPT2016)

References I

I Absil, P.-A., Mahony, R., and Sepulchre, R. (2008).Optimization Algorithms on Matrix Manifolds.Princeton University Press.

I Allen-Zhu, Z. and Hazan, E. (2016).Variance reduction for faster non-convex optimization.Technical report, arXiv preprint arXiv:1603.05643.

I Allen-Zhu, Z. and Yan, Y. (2015).Improved SVRG for non-strongly-convex or sum-of-non-convex objectives.Technical report, arXiv preprint arXiv:1506.01972.

I Bonnabel, S. (2013).Stochastic gradient descent on Riemannian manifolds.IEEE Trans. on Automatic Control, 58(9):2217–2229.

I Defazio, A., Bach, F., and Lacoste-Julien, S. (2014).SAGA: A fast incremental gradient method with support for non-strongly convexcomposite objectives.In NIPS.

I Garber, D. and Hazan, E. (2015).Fast and simple PCA via convex optimization.Technical report, arXiv preprint arXiv:1509.05647.

Riemannian stochastic variance reduced gradient on Grassmann manifold (all copyrights owned by Kasai, Sato, and Mishra) 22

Page 23: Riemannian stochastic variance reduced gradient on Grassmann manifold (ICCOPT2016)

References II

I Johnson, R. and Zhang, T. (2013).Accelerating stochastic gradient descent using predictive variance reduction.In NIPS, pages 315–323.

I Kasai, H., Sato, H., and Mishra, B. (2016).Riemannian stochastic variance reduced gradient on grassmann manifold.arXiv preprint: arXiv:1605.07367.

I Mairal, J. (2015).Incremental majorization-minimization optimization with application to largescalemachine learning.SIAM J. Optim., 25(2):829–855.

I Roux, N. L., Schmidt, M., and Bach, F. R. (2012).A stochastic gradient method with an exponential convergence rate for finitetraining sets.In NIPS, pages 2663–2671.

I Shalev-Shwartz, S. (2015).SDCA without duality.Technical report, arXiv preprint arXiv:1502.06177.

I Shalev-Shwartz, S. and Zhang, T. (2012).Proximal stochastic dual coordinate ascent.Technical report, arXiv preprint arXiv:1211.2717.

Riemannian stochastic variance reduced gradient on Grassmann manifold (all copyrights owned by Kasai, Sato, and Mishra) 23

Page 24: Riemannian stochastic variance reduced gradient on Grassmann manifold (ICCOPT2016)

References III

I Shalev-Shwartz, S. and Zhang, T. (2013).Stochastic dual coordinate ascent methods for regularized loss minimization.JMRL, 14:567–599.

I Shamir, O. (2015).Fast stochastic algorithms for SVD and PCA: Convergence properties andconvexity.Technical report, arXiv preprint arXiv:1507.08788.

I Zhang, Y. and Xiao, L. (2014).Stochastic primal-dual coordinate method for regularized empirical riskminimization.SIAM J. Optim., 24(4):2057–2075.

Riemannian stochastic variance reduced gradient on Grassmann manifold (all copyrights owned by Kasai, Sato, and Mishra) 24