riemannian stochastic variance reduced gradient on grassmann manifold (iccopt2016)

Riemannian stochastic variance reduced gradienton Grassmann manifold

Hiroyuki Kasai†, Hiroyuki Sato§, and Bamdev Mishra††

†The University of Electro-Communications, Japan

§Tokyo University of Science, Japan

††Amazon Development Centre India, India

August 10, 2016

Riemannian stochastic variance reduced gradient on Grassmann manifold (all copyrights owned by Kasai, Sato, and Mishra) 1

Summary (our contributions)

I Address stochastic gradient descent (SGD) algorithm forempirical risk minimization problem as

minw∈Rd

n∑i=1

fi(w).

I Paritularlly, structured problems on manifolds, i.e., w ∈M.

I Propose Riemannian SVRG (R-SVRG).I Extend SVRG in the Euclidean into Riemannian manifolds.

I Give two analyses;I Global convergence analysis, andI Local convergence rate analysis.

I Show effectiveness of R-SVRG from numerical comparisons.


Stochastic gradient method (SGD) (1)

Update in SGD

wk = wk−1︸︷︷︸current point

− αk

single gradient for ik-th sample(= stochastic gradient)︷︸︸︷∇fik︸︷︷︸

random sample

(wk−1)

I Unbiased expectation of full gradient as

E[∇fi(w)] =1

n

n∑i=1

∇fi(w) = ∇f(w).


Stochastic gradient descent (SGD) (2)Features against full gradient descent (FGD)

I Pros: High scalability to large-scale dataI Iteration complexity is independent of n.

I FGD shows linear complexity in n.

I Cons: Slow convergence propertyI Decaying stepsizes for convergence to avoid

I big fluctuations around a solution due to a large step-size.I too slow convergence due to a too small step-size.

⇓I Sub-linear convergence rate E[f(wk)]− f(w∗) ∈ O(k−1).

I FGD shows f(wk)− f(w∗) ∈ O(ck).


Speeding up of SGD: Variance reduction techniqueI Accelerate the convergence rate of SGD

I [Mairal, 2015, Roux et al., 2012,Shalev-Shwartz and Zhang, 2012,Shalev-Shwartz and Zhang, 2013, Defazio et al., 2014,Zhang and Xiao, 2014].

I Stochastic variance reduced gradient (SVRG)[Johnson and Zhang, 2013]

I linear convergence rate for strongly-convex functions.I Various variants

I [Garber and Hazan, 2015] analyze the convergence rate forSVRG when f is a convex function that is a sum ofnon-convex (but smooth) terms.

I [Shalev-Shwartz, 2015] proposes similar results.I [Allen-Zhu and Yan, 2015] further study the same case with

better convergence rates.I [Shamir, 2015] studies specifically the convergence properties

of the variance reduction PCA algorithm.I Very recently, [Allen-Zhu and Hazan, 2016] propose a variance

reduction method for faster non-convex optimization.


Stochastic variance reduced gradient (SVRG) (1)[Johnson and Zhang, 2013]

I Motivations:I Reduce the variance of stochastic gradients.I No need to store all gradients not like SAG.I But, allow additional calculations of gradients.

I Basic idea: hybrid algorithm of SGD and FGD.I Periodically, calculate and store a full gradient.I Every iteration, adjust a stochastic gradient v by the latest full

gradient to reduce variance.

⇓I Linear convergence rate

E[f(ws)]−E[f(w∗)]≤αs(E[f(w0)]−E[f(w∗)])


Stochastic variance reduced gradient (SVRG) (2)

Simplified algorithm of SVRG

1: Initial iterate w00 ∈M.

2: for s = 1, 2, . . . (outer loop) do3: Store w = ws−1

t .4: Store ∇f(w).5: for t = 1, 2, . . . ,ms (inner loop) do

6: Calculate

modified stochastic gradient︷︸︸︷vst = ∇fist (w

st−1)︸︷︷︸

single gradient at wst−1

−single gradient︷︸︸︷∇fist (w)+∇f(w).︸︷︷︸

full gradient

7: Update wst = wst−1 − αvst .8: end for9: end for


Stochastic variance reduced gradient (SVRG) (3)[Johnson and Zhang, 2013]


Structured problemsExamples

I PCA problem: calculate the projection matrix U to minimizeas

minU∈St(r,d)

1

n

n∑i=1

‖xi −UUTxi‖22,

I U belongs to Stiefel manifold St(r, d).I The set of matrices of size d× r with orthonormal columns,

i.e., UTU = I.

⇓I Cost function remains unchanged under the orthogonal group

action U 7→ UO for O ∈ O(r).⇓

I U belongs to Grassmann manifold Gr(r, d).I The set of r-dimensional linear subspaces in Rd with

orthonormal columns, i.e., UTU = I.

I Other examples (not exchasted)I matrix completion, subspace tracking, spectral clustering,

CCA, bi-factor regression, ....Riemannian stochastic variance reduced gradient on Grassmann manifold (all copyrights owned by Kasai, Sato, and Mishra) 9

Optimization on Riemannian manifolds[Absil et al., 2008]

I If constraints can be defined by a manifold, the constrainedproblem is viewed as unconstrained problem on the manifoldas

minw∈Rn

f(w), s.t. ci(w) = 0, cj(w) ≤ 0

⇓minw∈M

f(w), M : Riemannian manifold


Riemannian SGD (R-SGD) (1)[Bonnabel, 2013]

I Extension of Euclidean SGD into Riemannian manifolds.

Update in R-SGD

wk =

Move along geodesic(by exponential mapping)︷︸︸︷

Expwk−1(−αk gradfik(wk−1)︸︷︷︸Riemannian stochastic gradient

)

1. Calculate a Riemannian stochastic gradient gradfik(wk−1) forthe sample ik at wk−1.

2. Then, move along the geodesic from wk−1 in the direction ofgradfik(wk−1).

I Geodesic is generalization of straight lines in Euclidean space.I Exponential mapping Expw(·) specifies the geodesic.


Riemannian SGD (R-SGD) (2)[Bonnabel, 2013]


Proposal: Riemannian SVRG (R-SVRG)[Kasai et al., 2016]

I Propose a novel extension of SVRG in the Euclidean space tothe Riemannian manifold search space.

I Extension is not trivial.I Focus on the Grassmann manifold Gr(r, d).

I Can be generalized to other compact Riemannian manifolds.

I NotationsSVRG R-SVRG

Model parameter wst−1 ∈ Rn Us

t−1 ∈ Gr(r, d)

Edge point of outer loop w ∈ Rn U ∈ Gr(r, d)Stochastic gradient ∇fist (ws

t−1) ∈ Rn gradfist (Ust−1) ∈ TUs

t−1Gr(r, d)

Modified stochastic vst ∈ Rn ξst ∈ TUst−1

Gr(r, d)

gradient


Proposal: Riemannian SVRG (R-SVRG)Algorithm

I Straightforward modification of stochastic gradientI Extend SVRG case: vst = ∇fist (ws

t−1)−∇fit(w) +∇f(w).

ξst = gradfist (Ust−1)− gradfist (U) + gradf(U)

I Meaningless because manifolds are not vector space.

⇓I Proposed modification

I Transport vectors at U into the current tangent space at Ust−1

by parallel translation, then add them.

ξst = gradfist (Ust−1)

+

parallel−translation operator︷︸︸︷P

Ust−1←U

γ︸︷︷︸geodesic

(−gradfist (U) + gradf(U)

)I Logarithm mapping gives the tangent vector for geodesic γ.


Proposal: Riemannian SVRG (R-SVRG)Conceptual illustration


Tools in Grassmann manifold

I Exponential mapping in the direction of ξ ∈ TU(0)

U(t) = [U(0)V W]

cos tΣ

sin tΣ

VT ,

I ξ = WΣVT is the rank-r singular value decomposition of ξ.I cos(·) and sin(·) operations are only on the diagonal entries.

I Parallel translation of ζ ∈ TU(0) along γ(t) with ξ

ζ(t) =

[U(0)V W]

− sin tΣ

cos tΣ

WT + (I−WWT )

ζ.

I Logarithm mapping of U(t) at U(0)

ξ = logU(0)(U(t)) = W arctan(Σ)VT ,

I WΣVT is the rank-r singular value decomposition of(U(t)−U(0)U(0)TU(t))(U(0)TU(t))−1.


Main results: convergence analyses

I Global convergence analysis with decaying step-sizes.I Guarantee that the iteration globally converges to a critical

point starting from any initialization point.

I Local convergence rate analysis under fixed step-size.I Consider the rate in neighborhood of a local minimum.I Assume that Lipschitz smoothness and lower bound of Hessian

hold only in this neighborhood.I Obtain local linear convergence rate as

E[(dist(Us,U∗))2] ≤ 4(1 + 8mα2β2)

αm(σ − 14ηβ2)E[(dist(U

s−1,U∗))2].


Proof sketch for local convergence rate

1. Obtain below by assuming the smallest eigenvalue σ ofHessian of f as

f(z) ≥ f(w) + 〈Exp−1w (z), gradf(w)〉w +

σ

2‖Exp−1

w (z)‖2w, w, z ∈ U . (1)

2. Obtain the variance of ξst from β-Lipschitz continuity as

Eist[‖ξst ‖2] ≤ β2(14(dist(ws

t−1, w∗))2 + 8dist(ws−1, w∗))2) (2)

3. Obtain the expectation of the decrease of the distance to thesolution in the inner iteration from the lemma for a geodesictriangle in an Alexandrov space as

Eist

[(dist(Us

t ,U∗))2 − (dist(Us

t−1,U∗))2

]≤ Eist

[(dist(Ust−1,U

st ))2 + 2η〈gradf(Us

t−1),Exp−1Ust−1

(U∗)〉Ust−1

]. (3)

4. Putting (1)&(2) into (3) with summing over the inner loopfinally yields the decrease of the distance to the solution inthe outer iteration.


Numerical comparisonsExperiments conditions

I Compare R-SVRG with

1. R-SGD2. R-SD (steepest descent) with backtracking line search

I Step-size algorithms

1. fixed step-size2. decaying step-sizes3. hybrid step-sizes

I Use the decaying step-sizes at less than sTH(= 5) epoch, andsubsequently switches to a fixed step-size.

I PCA problemI n = 10000, d = 20, and r = 5.

I Evaluation metricsI Optimality gap

I Distance to the minimum loss obtained by Matlab pca.

I Norm of gradient


Numerical comparisonsResults for PCA problem

#grad/N

0 50 100 150 200 250

Tra

in lo

ss -

op

tim

um

10-10

10-5

100

R-SD

R-SGD : decay (η=0.009, λ=0.1)

R-SVRG : fix (η=0.001)R-SVRG : decay (η=0.001, λ=0.001)

R-SVRG : hybrid (η=0.004, λ=0.01)R-SVRG+ : fix (η=0.001)

R-SVRG+ : decay (η=0.002, λ=0.01)R-SVRG+ : hybrid (η=0.002, λ=0.01)

(a) Optimality gap.

#grad/N

0 50 100 150 200 250

No

rm o

f g

rad

ien

t

10-5

100

R-SD

R-SGD : decay (η=0.009, λ=0.1)

R-SVRG : fix (η=0.001)R-SVRG : decay (η=0.001, λ=0.001)

R-SVRG : hybrid (η=0.004, λ=0.01)R-SVRG+ : fix (η=0.001)

R-SVRG+ : decay (η=0.002, λ=0.01)R-SVRG+ : hybrid (η=0.002, λ=0.01)

(b) Norm of gradient.


Conclusions and more information

I ConclusionsI Propose Riemannian SVRG (R-SVRG).I R-SVRG shows local linear convergence rate.I Numerical comparisons shows the effectiveness of the

algorithm.

I More informationI Full paper

I H.Kasai, H.Sato and B.Mishra, ”Riemannian stochasticvariance reduced gradient on Grassmann manifold,”arXiv:1605.07367, May 2016, [Kasai et al., 2016]

I Matlab codeI https://bamdevmishra.com/codes/rsvrg/

Thank you for your attention.Riemannian stochastic variance reduced gradient on Grassmann manifold (all copyrights owned by Kasai, Sato, and Mishra) 21

https://bamdevmishra.com/codes/rsvrg/

References I

I Absil, P.-A., Mahony, R., and Sepulchre, R. (2008).Optimization Algorithms on Matrix Manifolds.Princeton University Press.

I Allen-Zhu, Z. and Hazan, E. (2016).Variance reduction for faster non-convex optimization.Technical report, arXiv preprint arXiv:1603.05643.

I Allen-Zhu, Z. and Yan, Y. (2015).Improved SVRG for non-strongly-convex or sum-of-non-convex objectives.Technical report, arXiv preprint arXiv:1506.01972.

I Bonnabel, S. (2013).Stochastic gradient descent on Riemannian manifolds.IEEE Trans. on Automatic Control, 58(9):2217–2229.

I Defazio, A., Bach, F., and Lacoste-Julien, S. (2014).SAGA: A fast incremental gradient method with support for non-strongly convexcomposite objectives.In NIPS.

I Garber, D. and Hazan, E. (2015).Fast and simple PCA via convex optimization.Technical report, arXiv preprint arXiv:1509.05647.


References II

I Johnson, R. and Zhang, T. (2013).Accelerating stochastic gradient descent using predictive variance reduction.In NIPS, pages 315–323.

I Kasai, H., Sato, H., and Mishra, B. (2016).Riemannian stochastic variance reduced gradient on grassmann manifold.arXiv preprint: arXiv:1605.07367.

I Mairal, J. (2015).Incremental majorization-minimization optimization with application to largescalemachine learning.SIAM J. Optim., 25(2):829–855.

I Roux, N. L., Schmidt, M., and Bach, F. R. (2012).A stochastic gradient method with an exponential convergence rate for finitetraining sets.In NIPS, pages 2663–2671.

I Shalev-Shwartz, S. (2015).SDCA without duality.Technical report, arXiv preprint arXiv:1502.06177.

I Shalev-Shwartz, S. and Zhang, T. (2012).Proximal stochastic dual coordinate ascent.Technical report, arXiv preprint arXiv:1211.2717.


References III

I Shalev-Shwartz, S. and Zhang, T. (2013).Stochastic dual coordinate ascent methods for regularized loss minimization.JMRL, 14:567–599.

I Shamir, O. (2015).Fast stochastic algorithms for SVD and PCA: Convergence properties andconvexity.Technical report, arXiv preprint arXiv:1507.08788.

I Zhang, Y. and Xiao, L. (2014).Stochastic primal-dual coordinate method for regularized empirical riskminimization.SIAM J. Optim., 24(4):2057–2075.


riemannian stochastic variance reduced gradient on grassmann manifold (iccopt2016)

Data & Analytics