vr-sgd: a simple stochastic variance reduction method for ... · vr-sgd are set to the average and...

15
VR-SGD: A Simple Stochastic Variance Reduction Method for Machine Learning Fanhua Shang , Member, IEEE, Kaiwen Zhou, Hongying Liu , Member, IEEE, James Cheng, Ivor W. Tsang, Lijun Zhang , Member, IEEE, Dacheng Tao , Fellow, IEEE, and Licheng Jiao , Fellow, IEEE Abstract—In this paper, we propose a simple variant of the original SVRG, called variance reduced stochastic gradient descent (VR-SGD). Unlike the choices of snapshot and starting points in SVRG and its proximal variant, Prox-SVRG, the two vectors of VR-SGD are set to the average and last iterate of the previous epoch, respectively. The settings allow us to use much larger learning rates, and also make our convergence analysis more challenging. We also design two different update rules for smooth and non- smooth objective functions, respectively, which means that VR-SGD can tackle non-smooth and/or non-strongly convex problems directly without any reduction techniques. Moreover, we analyze the convergence properties of VR-SGD for strongly convex problems, which show that VR-SGD attains linear convergence. Different from most algorithms that have no convergence guarantees for non- strongly convex problems, we also provide the convergence guarantees of VR-SGD for this case, and empirically verify that VR-SGD with varying learning rates achieves similar performance to its momentum accelerated variant that has the optimal convergence rate 1=T 2 Þ. Finally, we apply VR-SGD to solve various machine learning problems, such as convex and non-convex empirical risk minimization, and leading eigenvalue computation. Experimental results show that VR-SGD converges significantly faster than SVRG and Prox-SVRG, and usually outperforms state-of-the-art accelerated methods, e.g., Katyusha. Index Terms—Stochastic optimization, stochastic gradient descent (SGD), variance reduction, empirical risk minimization, strongly convex and non-strongly convex, smooth and non-smooth Ç 1 INTRODUCTION I N this paper, we focus on the following composite optimi- zation problem: min x2R d F ðxÞ¼ def 1 n X n i¼1 f i ðxÞþ gðxÞ; (1) where f ðxÞ¼ 1 n P n i¼1 f i ðxÞ, f i ðxÞ : R d ! R;i ¼ 1; ... ;n are the smooth functions, and gðxÞ is a relatively simple (but possi- bly non-differentiable) convex function (referred to as a reg- ularizer). The formulation (1) arises in many places in machine learning, signal processing, data science, statistics and operations research, such as regularized empirical risk minimization (ERM). For instance, one popular choice of the component function f i ðÞ in binary classification problems is the logistic loss, i.e., f i ðxÞ¼ log ð1 þ expðb i a T i xÞÞ, where a 1 ;b 1 Þ; ... ; ða n ;b n Þg is a collection of training examples, and b i 2f1g. Some popular choices for the regularizer include the 2 -norm regularizer (i.e., gðxÞ¼ð!=2Þkxk 2 ), the 1 -norm regularizer (i.e., gðxÞ¼ !kxk 1 ), and the elastic-net regularizer (i.e., gðxÞ¼ð! 1 =2Þkxk 2 þ ! 2 kxk 1 ). Some other applications include deep neural networks [1], [2], [3], [4], [5], group Lasso [6], sparse learning and coding [7], [8], [9], [10], non-negative matrix factorization [11], phase retrieval [12], matrix completion [13], [14], conditional random fields [15], generalized eigen-decomposition and canonical corre- lation analysis [16], and eigenvector computation [17], [18] such as principal component analysis (PCA) and singular value decomposition (SVD). 1.1 Stochastic Gradient Descent We are especially interested in developing efficient algo- rithms to solve Problem (1) involving the sum of a large number of component functions. The standard and effec- tive method for solving (1) is the (proximal) gradient descent (GD) method, including Nesterov’s accelerated gradient descent (AGD) [19], [20] and accelerated proxi- mal gradient (APG) [21], [22]. For the smooth problem (1), GD takes the following update rule: starting with x 0 , and for any k 0 F. Shang, H. Liu, and L. Jiao are with the Key Laboratory of Intelligent Perception and Image Understanding of Ministry of Education, School of Artificial Intelligence, Xidian University, Xi’an, Shaanxi 710126, China. E-mail: {fhshang, hyliu}@xidian.edu.cn, [email protected]. K. Zhou and J. Cheng are with the Department of Computer Science and Engineering, Chinese University of Hong Kong, Shatin, Hong Kong. E-mail: {kwzhou, jcheng}@cse.cuhk.edu.hk. I.W. Tsang is with the Centre for Artificial Intelligence, University of Technology Sydney, Ultimo, NSW 2007, Australia. E-mail: [email protected]. L. Zhang is with the National Key Laboratory for Novel Software Technol- ogy, Nanjing University, Nanjing 210023, China. E-mail: [email protected]. D. Tao is with the UBTECH Sydney Artificial Intelligence Centre and the School of Information Technologies, Faculty of Engineering and Informa- tion Technologies, University of Sydney, 6 Cleveland St, Darlington, NSW 2008, Australia. E-mail: [email protected]. Manuscript received 22 Mar. 2018; revised 1 Oct. 2018; accepted 14 Oct. 2018. Date of publication 30 Oct. 2018; date of current version 5 Dec. 2019. (Corresponding author: Hongying Liu.) Recommended for acceptance by Jeff M. Phillips. Digital Object Identifier no. 10.1109/TKDE.2018.2878765 1041-4347 ß 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See ht_tp://www.ieee.org/publications_standards/publications/rights/index.html for more information. 188 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 32, NO. 1, JANUARY 2020

Upload: others

Post on 13-Oct-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: VR-SGD: A Simple Stochastic Variance Reduction Method for ... · VR-SGD are set to the average and last iterate of the previous epoch, respectively. The settings allow us to use much

VR-SGD: A Simple Stochastic VarianceReduction Method for Machine Learning

Fanhua Shang ,Member, IEEE, Kaiwen Zhou, Hongying Liu ,Member, IEEE,

James Cheng, Ivor W. Tsang, Lijun Zhang ,Member, IEEE,

Dacheng Tao , Fellow, IEEE, and Licheng Jiao , Fellow, IEEE

Abstract—In this paper, we propose a simple variant of the original SVRG, called variance reduced stochastic gradient descent

(VR-SGD). Unlike the choices of snapshot and starting points in SVRG and its proximal variant, Prox-SVRG, the two vectors of

VR-SGD are set to the average and last iterate of the previous epoch, respectively. The settings allow us to use much larger learning

rates, and also make our convergence analysis more challenging. We also design two different update rules for smooth and non-

smooth objective functions, respectively, which means that VR-SGD can tackle non-smooth and/or non-strongly convex problems

directly without any reduction techniques. Moreover, we analyze the convergence properties of VR-SGD for strongly convex problems,

which show that VR-SGD attains linear convergence. Different from most algorithms that have no convergence guarantees for non-

strongly convex problems, we also provide the convergence guarantees of VR-SGD for this case, and empirically verify that VR-SGD

with varying learning rates achieves similar performance to its momentum accelerated variant that has the optimal convergence rate

Oð1=T 2Þ. Finally, we apply VR-SGD to solve various machine learning problems, such as convex and non-convex empirical risk

minimization, and leading eigenvalue computation. Experimental results show that VR-SGD converges significantly faster than

SVRG and Prox-SVRG, and usually outperforms state-of-the-art accelerated methods, e.g., Katyusha.

Index Terms—Stochastic optimization, stochastic gradient descent (SGD), variance reduction, empirical risk minimization, strongly convex

and non-strongly convex, smooth and non-smooth

Ç

1 INTRODUCTION

IN this paper, we focus on the following composite optimi-zation problem:

minx2Rd F ðxÞ ¼def 1n

Xni¼1

fiðxÞ þ gðxÞ; (1)

where fðxÞ ¼ 1n

Pni¼1fiðxÞ, fiðxÞ : Rd!R; i ¼ 1; . . . ; n are the

smooth functions, and gðxÞ is a relatively simple (but possi-bly non-differentiable) convex function (referred to as a reg-ularizer). The formulation (1) arises in many places in

machine learning, signal processing, data science, statisticsand operations research, such as regularized empirical riskminimization (ERM). For instance, one popular choice of thecomponent function fið�Þ in binary classification problems is

the logistic loss, i.e., fiðxÞ ¼ log ð1þ expð�biaTi xÞÞ, where

fða1; b1Þ; . . . ; ðan; bnÞg is a collection of training examples,

and bi2f�1g. Some popular choices for the regularizer

include the ‘2-norm regularizer (i.e., gðxÞ ¼ ð�=2Þkxk2), the‘1-norm regularizer (i.e., gðxÞ ¼ �kxk1), and the elastic-net

regularizer (i.e., gðxÞ ¼ ð�1=2Þkxk2 þ �2kxk1). Some other

applications include deep neural networks [1], [2], [3], [4],

[5], group Lasso [6], sparse learning and coding [7], [8], [9],

[10], non-negative matrix factorization [11], phase retrieval

[12], matrix completion [13], [14], conditional random fields[15], generalized eigen-decomposition and canonical corre-

lation analysis [16], and eigenvector computation [17], [18]

such as principal component analysis (PCA) and singular

value decomposition (SVD).

1.1 Stochastic Gradient DescentWe are especially interested in developing efficient algo-rithms to solve Problem (1) involving the sum of a largenumber of component functions. The standard and effec-tive method for solving (1) is the (proximal) gradientdescent (GD) method, including Nesterov’s acceleratedgradient descent (AGD) [19], [20] and accelerated proxi-mal gradient (APG) [21], [22]. For the smooth problem(1), GD takes the following update rule: starting with x0,and for any k�0

� F. Shang, H. Liu, and L. Jiao are with the Key Laboratory of IntelligentPerception and Image Understanding of Ministry of Education, School ofArtificial Intelligence, Xidian University, Xi’an, Shaanxi 710126, China.E-mail: {fhshang, hyliu}@xidian.edu.cn, [email protected].

� K. Zhou and J. Cheng are with the Department of Computer Science andEngineering, Chinese University of Hong Kong, Shatin, Hong Kong.E-mail: {kwzhou, jcheng}@cse.cuhk.edu.hk.

� I.W. Tsang is with the Centre for Artificial Intelligence, University ofTechnology Sydney, Ultimo, NSW 2007, Australia.E-mail: [email protected].

� L. Zhang is with the National Key Laboratory for Novel Software Technol-ogy, Nanjing University, Nanjing 210023, China.E-mail: [email protected].

� D. Tao is with the UBTECH Sydney Artificial Intelligence Centre and theSchool of Information Technologies, Faculty of Engineering and Informa-tion Technologies, University of Sydney, 6 Cleveland St, Darlington,NSW 2008, Australia. E-mail: [email protected].

Manuscript received 22 Mar. 2018; revised 1 Oct. 2018; accepted 14 Oct.2018. Date of publication 30 Oct. 2018; date of current version 5 Dec. 2019.(Corresponding author: Hongying Liu.)Recommended for acceptance by Jeff M. Phillips.Digital Object Identifier no. 10.1109/TKDE.2018.2878765

1041-4347� 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See ht _tp://www.ieee.org/publications_standards/publications/rights/index.html for more information.

188 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 32, NO. 1, JANUARY 2020

Page 2: VR-SGD: A Simple Stochastic Variance Reduction Method for ... · VR-SGD are set to the average and last iterate of the previous epoch, respectively. The settings allow us to use much

xkþ1 ¼ xk � hk1

n

Xni¼1

rfiðxkÞ þ rgðxkÞ" #

; (2)

where hk > 0 is commonly referred to as the learning rate inmachine learning or step-size in optimization. When gð�Þ isnon-smooth (e.g., the ‘1-norm regularizer), we typically intro-duce the following proximal operator to replace (2)

xkþ1¼ ProxghkðykÞ :¼ argminx2Rd

1

2hkkx� ykk2 þ gðxÞ

� �; (3)

where yk ¼ xk � ðhk=nÞPn

i¼1rfiðxkÞ: GD has been proven toachieve linear convergence for strongly convex problems,and both AGD and APG attain the optimal convergencerate Oð1=T 2Þ for non-strongly convex problems, where Tdenotes the number of iterations. However, the per-iterationcost of all the batch (or deterministic) methods is OðndÞ,which is expensive for very large n.

Instead of evaluating the full gradient of fð�Þ at each iter-ation, an efficient alternative is the stochastic (or incremen-tal) gradient descent (SGD) method [23]. SGD onlyevaluates the gradient of a single component function ateach iteration, and has much lower per-iteration cost, OðdÞ.Thus, SGD has been successfully applied to many large-scale learning problems [24], [25], [26], especially trainingfor deep learning models [2], [3], [27], and its update rule is

xkþ1 ¼ xk � hk½rfikðxkÞ þ rgðxkÞ�; (4)

where hk/1=ffiffiffik

p, and the index ik can be chosen uniformly

at random from f1; 2; . . . ; ng. Although the expectation ofthe stochastic gradient estimator rfikðxkÞ is an unbiased esti-mation for rfðxkÞ, i.e., E½rfikðxkÞ� ¼ rfðxkÞ, the variance ofrfikðxkÞ may be large due to the variance of random sam-pling [1]. Thus, stochastic gradient estimators are also called“noisy gradients”, and we need to gradually reduce its stepsize, which leads to slow convergence. In particular, evenunder the strongly convex (SC)condition, standard SGDattains a slower sub-linear convergence rate Oð1=T Þ [28].

1.2 Accelerated SGDRecently, many SGD methods with variance reduction havebeen proposed, such as stochastic average gradient (SAG)[29], stochastic variance reduced gradient (SVRG) [1], sto-chastic dual coordinate ascent (SDCA) [30], SAGA [31], sto-chastic primal-dual coordinate (SPDC) [32], and theirproximal variants, such as Prox-SAG [33], Prox-SVRG [34]and Prox-SDCA [35]. These accelerated SGD methods canuse a constant learning rate h instead of diminishing stepsizes for SGD, and fall into the following three categories: pri-malmethods such as SVRG and SAGA, dualmethods such asSDCA, and primal-dual methods such as SPDC. In essence,many of the primalmethods use the full gradient at the snap-shot ex or the average gradient to progressively reduce thevariance of stochastic gradient estimators, as well as the dualand primal-dual methods, which leads to a revolution in thearea of first-order optimization [36]. Thus, they are also knownas the hybrid gradient descent method [37] or semi-stochas-tic gradient descent method [38]. In particular, under thestrongly convex condition, most of the accelerated SGDmethods enjoy a linear convergence rate (also known as ageometric or exponential rate) and the oracle complexity ofO ðnþ L=mÞlog ð1=�Þð Þ to obtain an �-suboptimal solution,

where each fið�Þ is L-smooth, and F ð�Þ is m-strongly convex.The complexity bound shows that they converge faster thanaccelerated deterministic methods, whose oracle complexityisOðn ffiffiffiffiffiffiffiffiffi

L=mp

log ð1=�ÞÞ [39], [40].SVRG [1] and its proximal variant, Prox-SVRG [34], are

particularly attractive because of their low storage require-ment compared with other methods such as SAG, SAGAand SDCA, which require storage of all the gradients ofcomponent functions or dual variables. At the beginning ofthe sth epoch in SVRG, the full gradient rfðexs�1Þ is com-puted at the snapshot exs�1, which is updated periodically.

Definition 1. The stochastic variance reduced gradient estimatoris independently introduced in [1], [37] as follows:erfis

kðxs

kÞ ¼ rfiskðxskÞ � rfis

kðexs�1Þ þ rfðexs�1Þ; (5)

where s is the epoch that iteration k belongs to.

It is not hard to verify that the variance of the SVRG esti-mator erfis

kðxs

kÞ (i.e., Ek erfiskðxs

kÞ � rfðxskÞk2) can be muchsmaller than that of the SGD estimator rfikðxs

kÞ (i.e.,Ekrfikðxs

kÞ � rfðxskÞk2). Theoretically, for non-strongly convex

(Non-SC) problems, the variance reduced methods con-verge slower than the accelerated batch methods such asFISTA [22], i.e., Oð1=T Þ versus Oð1=T 2Þ.

More recently, many acceleration techniques were pro-posed to further speed up the stochastic variance reducedmethods mentioned above. These techniques mainly includethe Nesterov’s acceleration techniques in [25], [39], [40], [41],[42], reducing the number of gradient calculations in earlyiterations [36], [43], [44], the projection-free property of theconditional gradientmethod (also known as the Frank-Wolfealgorithm [45]) as in [46], the stochastic sufficient decreasetechnique [47], and the momentum acceleration tricksin [36], [48], [49]. [40] proposed an accelerating Catalystframework and achieved the oracle complexity ofOððnþ ffiffiffiffiffiffiffiffiffiffiffiffi

nL=mp Þlog ðL=mÞlog ð1=�ÞÞ for strongly convex prob-

lems. [48] and [50] proved that the accelerated methods canattain the oracle complexity of Oðnlog ð1=�Þ þ ffiffiffiffiffiffiffiffiffiffiffi

nL=�p Þ for

non-strongly convex problems. The overall complexitymatches the theoretical upper bound provided in [51].Katyusha [48], point-SAGA [52] and MiG [50] achieve thebest-known oracle complexity of Oððnþ ffiffiffiffiffiffiffiffiffiffiffiffi

nL=mp Þ log ð1=�ÞÞ

for strongly convex problems, which is identical to the uppercomplexity bound in [51]. Hence, Katyusha and MiG are thebest-known stochastic optimization method for both SC andNon-SC problems, as pointed out in [51]. However, selectingthe best values for the parameters in the acceleratedmethods(e.g., the momentum parameter) is still an open problem. Inparticular, most of accelerated stochastic variance reductionmethods, including Katyusha, require at least one auxiliaryvariable and one momentum parameter, which lead to com-plicated algorithm design and high per-iteration complexity,especially for very high-dimensional and sparse data.

1.3 Our ContributionsFrom the above discussions, we can see that most of the accel-erated stochastic variance reduction methods such as [36],[39], [44], [46], [47], [48], [53], [54] and applications such as [7],[9], [10], [14], [17], [18], [55], [56], [57] are based on the SVRGmethod [1]. Thus, any key improvement on SVRG is veryimportant for the research of stochastic optimization. In thispaper, we propose a simple variant of the original SVRG [1],

SHANG ETAL.: VR-SGD: A SIMPLE STOCHASTIC VARIANCE REDUCTION METHOD FOR MACHINE LEARNING 189

Page 3: VR-SGD: A Simple Stochastic Variance Reduction Method for ... · VR-SGD are set to the average and last iterate of the previous epoch, respectively. The settings allow us to use much

called variance reduced stochastic gradient descent (VR-SGD). Thesnapshot point and starting point of each epoch in VR-SGDare set to the average and last iterate of the previous epoch,respectively. This is different from the settings of SVRG andProx-SVRG [34], where the two points of the former are set tobe the last iterate, and those of the latter are set to be the aver-age of the previous epoch. This difference makes the conver-gence analysis of VR-SGD significantly more challenging thanthat of SVRG and Prox-SVRG. Our empirical results showthat the performance of VR-SGD is significantly better than itscounterparts, SVRG and Prox-SVRG. Impressively, VR-SGDwith varying learning rates achieves better or at least com-parable performance with accelerated methods, such asCatalyst [40] and Katyusha [48]. The main contributions ofthis paper are summarized below.

� The snapshot and starting points of VR-SGD are setto two different vectors, i.e., exs ¼ 1

m

Pmk¼1 x

sk (Option

I) or exs ¼ 1m�1Pm�1

k¼1 xsk (Option II), and xsþ1

0 ¼ xsm. In

particular, we find that the settings of VR-SGD allowus to take much larger learning rates than SVRG, e.g.,1=L versus 1=ð10LÞ, and thus significantly speed upits convergence in practice. Moreover, VR-SGD hasan advantage over SVRG in terms of robustness oflearning rate selection.

� Unlike proximal stochastic gradient methods, e.g.,Prox-SVRG and Katyusha, which have a unifiedupdate rule for the two cases of smooth and non-smooth objectives (see Section 2.2 for details), VR-SGD employs two different update rules for the twocases, respectively, as in (12) and (13) below. Empiri-cal results show that gradient update rules as in (12)for smooth optimization problems are better choicesthan proximal update formulas as in (10).

� We provide the convergence guarantees of VR-SGDfor solving smooth/non-smooth and non-stronglyconvex (or general convex) functions. In comparison,SVRG and Prox-SVRG do not have any convergenceguarantees, as shown in Table 1.

� Moreover, we also present a momentum acceleratedvariant of VR-SGD, discuss their equivalent relation-ship, and empirically verify that they achieve similarperformance to their variant that attains the optimalconvergence rate Oð1=T 2Þ.

� Finally, we theoretically analyze the convergenceproperties of VR-SGD with Option I or Option II forsmooth/non-smooth and strongly convex functions,which show that VR-SGD attains linear convergence.

2 PRELIMINARY AND RELATED WORK

Throughout this paper, we use k�k to denote the ‘2-norm(also known as the standard euclidean norm), and k�k1 is

the ‘1-norm, i.e., kxk1 ¼Pd

i¼1jxij. rfð�Þ denotes the full gra-dient of fð�Þ if it is differentiable, or @fð�Þ the subgradient iffð�Þ is only Lipschitz continuous. For each epoch s2½S� andinner iteration k2f0; 1; . . . ;m� 1g, isk2½n� is the randomchosen index. We mostly focus on the case of Problem (1)when each fið�Þ is L-smooth,1 and F ð�Þ is m-strongly convex.The two common assumptions are defined as follows.

2.1 Basic Assumptions

Assumption 1 (Smoothness). Each fið�Þ is L-smooth, that is,there exists a constant L > 0 such that for all x; y2Rd

krfiðxÞ � rfiðyÞk � Lkx� yk: (6)

Assumption 2 (Strong Convexity). F ðxÞ is m-strongly con-vex, i.e., there exists a constant m > 0 such that for all x; y2Rd

F ðyÞ � F ðxÞ þ hrF ðxÞ; y� xi þ m

2kx� yk2: (7)

Note that when gð�Þ is non-smooth, the inequality in (7) needsto be revised by simply replacing the gradient rF ðxÞ withan arbitrary sub-gradient of F ð�Þ at x. In contrast, for a non-strongly convex or general convex function, the inequalityin (7) can always be satisfied with m ¼ 0.

2.2 Related WorkTo speed up standard and proximal SGD, many stochasticvariance reduced methods [29], [30], [31], [37] have beenproposed for some special cases of Problem (1). In the casewhen each fiðxÞ is L-smooth, fðxÞ is m-strongly convex, andgðxÞ0, Roux et al. [29] proposed a stochastic average gra-dient (SAG) method, which attains linear convergence.However, SAG, as well as other incremental aggregatedgradient methods such as SAGA [31], needs to store all gra-dients, so that OðndÞ memory is required in general [43].Similarly, SDCA [30] requires storage of all dual varia-bles [1], which uses OðnÞ memory. In contrast, SVRG pro-posed by Johnson and Zhang [1], as well as Prox-SVRG [34],has the similar convergence rate to SAG and SDCA, butwithout the memory requirements of all gradients and dualvariables. In particular, the SVRG estimator in (5) may bethe most popular choice for stochastic gradient estimators. Theupdate rule of SVRG for the case of Problem (1) whengð�Þ0 is

xskþ1 ¼ xs

k � h erfiskðxskÞ: (8)

When the smooth regularizer gð�Þ 6¼0, the update rule in (8)

becomes: xskþ1 ¼ xs

k � h½ erfiskðxs

kÞ þ rgðxskÞ�. Although the

original SVRG in [1] only has convergence guarantees for

the special case of Problem (1), when each fiðxÞ is L-smooth,

fðxÞ is m-strongly convex, and gðxÞ0, we can extend SVRG

to the proximal setting by introducing the proximal opera-

tor in (3), as shown in Line 7 of Algorithm 1.Based on the SVRG estimator in (5), some accelerated algo-

rithms [39], [40], [48] have been proposed. The proximalupdate rules of Katyusha [48] are formulated as follows:

TABLE 1Comparison of Convergence Rates of

VR-SGD and Its Counterparts

SVRG [1] Prox-SVRG [34] VR-SGD

SC, smooth linear rate unknown linear rateSC, non-smooth unknown linear rate linear rateNon-SC, smooth unknown unknown Oð1=T ÞNon-SC, non-smooth unknown unknown Oð1=T Þ

1. In fact, we can extend the theoretical results for the case, when thegradients of all component functions have the same Lipschitz constantL, to the more general case, when some component functions fið�Þ havedifferent degrees of smoothness.

190 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 32, NO. 1, JANUARY 2020

Page 4: VR-SGD: A Simple Stochastic Variance Reduction Method for ... · VR-SGD are set to the average and last iterate of the previous epoch, respectively. The settings allow us to use much

xskþ1¼ w1y

sk þ w2exs�1 þ ð1� w1 � w2Þzsk; (9a)

yskþ1 ¼ argminy2Rd

1

2hky� yskk2 þ yT erfis

kðxs

kþ1Þ þ gðyÞ� �

; (9b)

zskþ1 ¼ argminz2Rd

3L

2kz� xs

kþ1k2 þ zT erfiskðxs

kþ1Þ þ gðzÞ� �

; (9c)

wherew1; w22½0; 1� are two parameters. To eliminate the needfor parameter tuning, h is set to 1=ð3w1LÞ, andw2 is fixed to 0.5in [48]. In addition, [16], [17], [18] applied efficient stocha-stic solvers to compute leading eigenvectors of a symmetricmatrix or generalized eigenvectors of two symmetric matri-ces. The first such method is VR-PCA proposed by Shamir[17], and the convergence properties of VR-PCA for such anon-convex problem are also provided. Garber et al. [18] ana-lyzed the convergence rate of SVRG when fð�Þ is a convexfunction that is a sum of non-convex component functions.Moreover, [4], [5] and [58] proved that SVRG and SAGAwithminor modifications can converge asymptotically to a station-ary point of non-convex functions. Some sparse approxima-tion, parallel and distributed variants [50], [59], [60], [61], [62]of accelerated SGDmethods have also been proposed.

Algorithm 1. SVRG (Option I) and Prox-SVRG (Option II)

Input: The number of epochs S, the number of iterations m perepoch, and the learning rate h.

Initialize: ex0.1: for s ¼ 1; 2; . . . ; S do

2: ems ¼ 1n

Pni¼1rfiðexs�1Þ, xs

0 ¼ exs�1;

3: for k ¼ 0; 1; . . . ;m� 1 do4: Pick isk uniformly at random from ½n�;5: erfis

kðxs

kÞ ¼ rfiskðxs

kÞ � rfiskðexs�1Þ þ ems;

6: Option I: xskþ1¼ xs

k� h erfiskðxs

kÞ þrgðxskÞ

h i,

or xskþ1¼ Proxgh xs

k� h erfiskðxs

k� �

;

7: Option II:xskþ1¼ argminy2Rd gðyÞ þ yT erfis

kðxs

kÞ þ 12h ky� xs

kk2n o

;

8: end for9: Option I: exs ¼ xs

m; == Last iterate for snapshot ex10: Option II: exs ¼ 1

m

Pmk¼1x

sk; == Iterate averaging for ex

11: end forOutput: exS

An important class of stochastic methods is the proximalstochastic gradient (Prox-SG) method, such as Prox-SVRG [34],SAGA [31], and Katyusha [48]. Different from standard vari-ance reduction SGD methods such as SVRG, the Prox-SGmethod has a unified update rule for both smooth and non-smooth cases of gð�Þ. For instance, the update rule of Prox-SVRG [34] is formulated as follows:

xskþ1 ¼ argmin

y2Rd

gðyÞ þ yT erfiskðxs

kÞ þ1

2hky� xs

kk2� �

: (10)

For the sake of completeness, the details of Prox-SVRG [34]are shown in Algorithm 1 with Option II. When gð�Þ is thewidely used ‘2-norm regularizer, i.e., gð�Þ ¼ ð�1=2Þk � k2, theproximal update formula in (10) becomes

xskþ1 ¼

1

1þ �1hxsk � h erfis

kðxs

kÞh i

: (11)

3 VARIANCE REDUCED SGD

In this section, we propose an efficient variance reduced sto-chastic gradient descent (VR-SGD) algorithm, as shown inAlgorithm 2. Different from the choices of the snapshot andstarting points in SVRG [1] and Prox-SVRG [34], the twovectors of each epoch in VR-SGD are set to the average andlast iterate of the previous epoch, respectively. Moreover,unlike existing proximal stochastic methods, we design twodifferent update rules for smooth and non-smooth objectivefunctions, respectively.

Algorithm 2. VR-SGD for Solving Smooth Problems

Input: The number of epochs S, and the number of iterationsmper epoch.

Initialize: x10 ¼ ex0, and fhsg.

1: for s ¼ 1; 2; . . . ; S do2: ems ¼ 1

n

Pni¼1rfiðexs�1Þ; == Compute the full gradient

3: for k ¼ 0; 1; . . . ;m� 1 do4: Pick isk uniformly at random from ½n�;5: erfis

kðxs

kÞ ¼ rfiskðxs

kÞ � rfiskðexs�1Þ þ ems;

6: xskþ1¼ xs

k � hs½ erfiskðxs

kÞ þrgðxskÞ�;

7: end for8: Option I: exs ¼ 1

m

Pmk¼1x

sk; == Iterate averaging for ex

9: Option II: exs ¼ 1m�1Pm�1

k¼1 xsk; == Iterate averaging for ex

10: xsþ10 ¼ xs

m; == Initiate xsþ10 for the next epoch

11: end forOutput: bxS ¼ exS , if F ðexSÞ�F ð1S

PSs¼1exsÞ, and bxS ¼ 1

S

PSs¼1exs

otherwise.

3.1 Snapshot and Starting PointsLike SVRG, VR-SGD is also divided into S epochs, and eachepoch consists of m stochastic gradient steps, where m isusually chosen to be QðnÞ, as suggested in [1], [34], [48].Within each epoch, we need to compute the full gradientrfðexsÞ at the snapshot exs and use it to define the variancereduced stochastic gradient estimator erfis

kðxs

kÞ in (5). UnlikeSVRG,whose snapshot is set to the last iterate of the previousepoch, the snapshot exs of VR-SGD is set to the average of theprevious epoch, e.g., exs ¼ 1

m

Pmk¼1 x

sk in Option I of Algo-

rithm 2, which leads to better robustness to gradient noise,2

as also suggested in [36], [47], [66]. In fact, the choice of

Option II in Algorithm 2, i.e., exs ¼ 1m�1Pm�1

k¼1 xsk, also works

well in practice, as shown in Fig. 2 in the SupplementaryMaterial, which can be found on the Computer SocietyDigital Library at http://doi.ieeecomputersociety.org/10.1109/TKDE.2018.2878765. Therefore, we provide the con-vergence guarantees for our algorithm with either Option Ior Option II in the next section. In particular, we find thatone of the effects of the choice in Option I or Option II ofAlgorithm 2 is to allow taking much larger learning rates orstep sizes than SVRG in practice, e.g., 1=L for VR-SGD versus1=ð10LÞ for SVRG (see Fig. 1). Actually, a larger learning rate

2. It should be emphasized that the noise introduced by randomsampling is inevitable, and generally slows down the convergencespeed in this sense. However, SGD and its variants are probably themostly used optimization algorithms for deep learning [63]. In particu-lar, [64] has shown that by adding gradient noise at each step, noisygradient descent can escape the saddle points efficiently and convergeto a local minimum of the non-convex minimization problem, e.g., theapplication of deep neural networks in [65].

SHANG ETAL.: VR-SGD: A SIMPLE STOCHASTIC VARIANCE REDUCTION METHOD FOR MACHINE LEARNING 191

Page 5: VR-SGD: A Simple Stochastic Variance Reduction Method for ... · VR-SGD are set to the average and last iterate of the previous epoch, respectively. The settings allow us to use much

enjoyed by VR-SGDmeans that the variance of its stochasticgradient estimator goes asymptotically to zero faster.

Unlike Prox-SVRG [34] whose starting point is initializedto the average of the previous epoch, the starting point ofVR-SGD is set to the last iterate of the previous epoch. Thatis, in VR-SGD, the last iterate of the previous epoch becomesthe new starting point, while the two points of Prox-SVRGare completely different, thereby leading to relatively slowconvergence in general. Both the starting and snapshotpoints of SVRG [1] are set to the last iterate of the previousepoch,3 while the two points of Prox-SVRG [34] are set tothe average of the previous epoch (also suggested in [1]). Bysetting the starting and snapshot points in VR-SGD to thetwo different vectors mentioned above, the convergenceanalysis of VR-SGD becomes significantly more challengingthan that of SVRG and Prox-SVRG, as shown in Section 4.

3.2 The VR-SGD AlgorithmIn this part, we propose an efficient VR-SGD algorithm tosolve Problem (1), as outlined in Algorithm 2 for the caseof smooth objective functions. It is well known that theoriginal SVRG [1] only works for the case of smooth mini-mization problems. However, in many machine learningapplications, e.g., elastic net regularized logistic regression,the strongly convex objective function F ðxÞ is non-smooth.To solve this class of problems, the proximal variant ofSVRG, Prox-SVRG [34], was subsequently proposed. Unlikethe original SVRG, VR-SGD can not only solve smooth objec-tive functions, but also directly tackle non-smooth ones. Thatis, when the regularizer gðxÞ is smooth (e.g., the ‘2-normregularizer), the key update rule of VR-SGD is

xskþ1 ¼ xs

k � hs½ erfiskðxs

kÞ þ rgðxskÞ�: (12)

When gðxÞ is non-smooth (e.g., the ‘1-norm regularizer), thekey update rule of VR-SGD in Algorithm 2 becomes

xskþ1 ¼ Prox g

hsxsk � hs erfis

kðxs

k� �

: (13)

Unlike the proximal stochastic methods such as Prox-SVRG [34], all of which have a unified update rule as in (10)for both the smooth and non-smooth cases of gð�Þ, VR-SGDhas two different update rules for the two cases, as in (12)and (13). Fig. 1 demonstrates that VR-SGD has a significantadvantage over SVRG in terms of robustness of learningrate selection. That is, VR-SGD yields good performancewithin the range of the learning rate from 0:2=L to 1:2=L,whereas the performance of SVRG is very sensitive to theselection of learning rates. Thus, VR-SGD is convenient tobe applied in various real-world problems of machine learn-ing. In fact, VR-SGD can use much larger learning rates thanSVRG for ridge regression problems in practice, e.g., 8=ð5LÞfor VR-SGD versus 1=ð5LÞ for SVRG, as shown in Fig. 1b.

3.3 VR-SGD for Non-Strongly Convex ObjectivesAlthough many stochastic variance reduced methods havebeen proposed, most of them, including SVRG and Prox-SVRG, only have convergence guarantees for the case ofProblem (1), when F ðxÞ is strongly convex. However, F ðxÞmay be non-strongly convex in many machine learningapplications, such as Lasso and ‘1-norm regularized logisticregression. As suggested in [48], [68], this class of problemscan be transformed into strongly convex ones by adding a

proximal term ðt=2Þkx� xs0k2, which can be efficiently

solved by Algorithm 2. However, the reduction techniquemay degrade the performance of the involved algorithmsboth in theory and in practice [44]. Thus, we use VR-SGD todirectly solve non-strongly convex problems.

The learning rate hs of Algorithm 2 can be fixed to aconstant. Inspired by existing accelerated stochastic algo-rithms [36], [48], the learning rate in Algorithm 2 can be grad-ually increased in early iterations for both strongly convexand non-strongly convex problems, which leads to fasterconvergence (see Fig. 3 in the SupplementaryMaterial, avail-able online). Different from SGD and Katyusha [48], wherethe learning rate of the former requires to be gradually deca-yed and that of the latter needs to be gradually increased, theupdate rule of hs in Algorithm 2 is defined as follows: h0 is aninitial learning rate, and for any s�1

hs ¼ h0=maxfa; 2=ðsþ 1Þg; (14)

where 0 < a�1 is a given constant, e.g., a ¼ 0:2.

3.4 Extensions of VR-SGDIt has been shown in [38], [39] that mini-batching can effec-tively decrease the variance of stochastic gradient estimates.Therefore, we first extend the proposed VR-SGD method tothe mini-batch setting, as well as its convergence resultsbelow. Here, we denote by b the mini-batch size and Isk theselected random index set Ik½n� for each outer-iterations2½S� and inner-iteration k2f0; 1; . . . ;m� 1g.Definition 2. The stochastic variance reduced gradient estimator

in the mini-batch setting is defined as

Fig. 1. Comparison of SVRG [1] and VR-SGD with different learningrates for solving ‘2-norm regularized logistic regression and ridgeregression on Covtype. Note that the blue lines stand for the results ofSVRG, while the red lines correspond to the results of VR-SGD (bestviewed in color).

3. Note that the theoretical convergence of the original SVRG [1]relies on its Option II, i.e., both exs and xsþ10 are set to xsk, where k is ran-domly chosen from f1; 2; . . . ;mg. However, the empirical results in [1]suggest that Option I is a better choice than its Option II, and the con-vergence guarantee of SVRG with Option I for strongly convex objec-tive functions is provided in [67].

192 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 32, NO. 1, JANUARY 2020

Page 6: VR-SGD: A Simple Stochastic Variance Reduction Method for ... · VR-SGD are set to the average and last iterate of the previous epoch, respectively. The settings allow us to use much

erfIskðxs

kÞ ¼1

b

Xi2Is

k

rfiðxskÞ � rfiðexs�1Þ� �þrfðexs�1Þ; (15)

where Isk½n� is a mini-batch of size b.

If some component functions are non-smooth, we can usethe proximal operator oracle [68] or the Nesterov’s smooth-ing [69] and homotopy smoothing [70] techniques tosmoothen them, and thereby obtain their smoothed approx-imations. In addition, we can directly extend our VR-SGDmethod to the non-smooth setting as in [36] (e.g., Algorithm3 in [36]) without using any smoothing techniques.

Algorithm 3. The Momentum Accelerated Algorithm

Input: S andm.Initialize: x1

0 ¼ v10 ¼ ex0, fwsg, a > 0, and h0.1: for s ¼ 1; 2; . . . ; S do

2: ems ¼ 1n

Pni¼1rfiðexs�1Þ, hs ¼ h0=maxfa; 2=ðsþ 1Þg;

3: Option I: vs0 ¼ xs0, or Option II: xs

0 ¼ wsvs0 þ ð1� wsÞexs�1;

4: for k ¼ 0; 1; . . . ;m� 1 do5: Pick isk uniformly at random from ½n�;6: erfis

kðxs

kÞ ¼ rfiskðxs

kÞ � rfiskðexs�1Þ þ ems;

7: vskþ1 ¼ vsk � hs½ erfiskðxs

kÞ þ rgðxskÞ�;

8: xskþ1 ¼ exs�1 þ wsðvskþ1 � exs�1Þ;

9: end for10: exs ¼ 1

m

Pmk¼1x

sk;

11: Option I: xsþ10 ¼ xs

m, or Option II: vsþ10 ¼ vsm;

12: end forOutput: bxS ¼ exS , if F ðexSÞ�F ð1S

PSs¼1exsÞ, and bxS ¼ 1

S

PSs¼1exs

otherwise.

Considering that each component function fiðxÞ mayhave different degrees of smoothness, picking the randomindex isk from a non-uniform distribution is a much betterchoice than commonly used uniform random sampling [71],[72], as well as without-replacement sampling versus with-replacement sampling [73]. This can be done using the sametechniques in [34], [48], i.e., the sampling probabilities for allfiðxÞ are proportional to their Lipschitz constants, i.e.,pi ¼ Li=

Pnj¼1Lj. VR-SGD can also be combined with other

accelerated techniques used for SVRG. For instance, theepoch length of VR-SGD can be automatically determined bythe techniques in [44], [74], instead of a fixed epoch length.We can reduce the number of gradient calculations in earlyiterations as in [43], [44], which leads to faster convergence ingeneral (see Section 5.5 for details). Moreover, we can intro-duce theNesterov’s acceleration techniques in [25], [39], [40],[41], [42] and momentum acceleration tricks in [36], [48], [75]to further improve the performance of VR-SGD.

4 ALGORITHM ANALYSIS

In this section, we provide the convergence guarantees ofVR-SGD for solving both smooth and non-smooth generalconvex problems, and extend the results to the mini-batchsetting. We also study the convergence properties of VR-SGD for solving both smooth and non-smooth strongly con-vex objective functions. Moreover, we discuss the equiva-lent relationship between VR-SGD and its momentumaccelerated variant, as well as some of its extensions.

4.1 Convergence Properties: Non-Strongly ConvexIn this part, we analyze the convergence properties of VR-SGD for solving more general non-strongly convex problems.Considering that the proposed algorithm (i.e., Algorithm 2)has two different update rules for smooth and non-smoothcases, we give the convergence guarantees of VR-SGD for thetwo cases as follows.

4.1.1 Smooth Objective Functions

We first provide the convergence guarantee of our algo-rithm for solving Problem (1) when F ðxÞ is smooth. In orderto simplify analysis, we denote F ðxÞ by fðxÞ, that is,fiðxÞ :¼ fiðxÞ þ gðxÞ for all i ¼ 1; 2; . . . ; n, and then gðxÞ 0.

Lemma 1 (Variance bound). Let x� be the optimal solution ofProblem (1). Suppose Assumption 1 holds. Then the followinginequality holds

E k erfiskðxskÞ � rfðxskÞk2

h i�4L½fðxskÞ � fðx�Þ þ fðexs�1Þ � fðx�Þ�:

The proofs of this lemma, the lemmas and theoremsbelow are all included in the Supplementary Material, avail-able online. Lemma 1 provides the upper bound on theexpected variance of the variance reduced gradient estimatorin (5), i.e., the SVRG estimator. For Algorithm 2 with OptionII and a fixed learning rate h, we have the following result.

Theorem 1 (Smooth objectives). Suppose Assumption 1holds. Then the following inequality holds

E fðbxSÞ� �� fðx�Þ � 2ðmþ 1Þ½g � 4mþ 2�S ½fðex0Þ � fðx�Þ�

þ bðb� 1ÞL2½g � 4mþ 2�S kex0� x�k2;

where g ¼ ðb� 1Þðm� 1Þ, and b ¼ 1=ðLhÞ.From Theorem 1 and its proof, one can see that our con-

vergence analysis is very different from that of existing sto-chastic methods, such as SVRG [1], Prox-SVRG [34], andSVRG++ [44]. Similarly, the convergence of Algorithm 2with Option I and a fixed learning rate can be guaranteed,as stated in Theorem 6 in the Supplementary Material, avail-able online. All the results show that VR-SGD attains a con-vergence rate of Oð1=T Þ for non-strongly convex functions.

4.1.2 Non-Smooth Objective Functions

We also provide the convergence guarantee of Algorithm 2with Option I and (13) for solving Problem (1) when F ðxÞ isnon-smooth and non-strongly convex, as shown below.

Theorem 2 (Non-smooth objectives). Suppose Assump-tion 1 holds. Then the following inequality holds

E F ðbxSÞ� �� F ðx�Þ

� 2ðmþ 1Þðb� 5ÞmS

½F ðex0Þ� F ðx�Þ� þ bðb� 1ÞL2ðb� 5ÞmS

kex0� x�k2:

Similarly, the convergence of Algorithm 2 with Option IIand a fixed learning rate can be guaranteed, as stated inCorollary 4 in the Supplementary Material, available online.

4.1.3 Mini-Batch Settings

The upper bound on the variance of erfiskðxs

kÞ is extended tothe mini-batch setting as follows.

SHANG ETAL.: VR-SGD: A SIMPLE STOCHASTIC VARIANCE REDUCTION METHOD FOR MACHINE LEARNING 193

Page 7: VR-SGD: A Simple Stochastic Variance Reduction Method for ... · VR-SGD are set to the average and last iterate of the previous epoch, respectively. The settings allow us to use much

Corollary 1 (Variance bound of mini-batch). If each fið�Þis convex and L-smooth, then the following inequality holds

E k erfIskðxs

kÞ � rfðxskÞk2

h i� 4LdðbÞ½F ðxs

kÞ � F ðx�Þ þ F ðexs�1Þ � F ðx�Þ�;where dðbÞ ¼ ðn� bÞ=½ðn� 1Þb�.This corollary is essentially identical to Theorem 4 in [38],

and hence its proof is omitted. It is not hard to verify that0 � dðbÞ � 1. Based on the variance upper bound, we furtheranalyze the convergence properties of VR-SGD in the mini-batch setting, as shown below.

Theorem 3 (Mini-batch). If each fið�Þ is convex andL-smooth, then the following inequality holds

E F ðbxSÞ� �� F ðx�Þ

� 2dðbÞðmþ 1ÞzmS

E F ðex0Þ � F ðx�Þ� �þ bðb� 1ÞL2zmS

E kx� � ex0k2h i

;

where z ¼ b� 1� 4dðbÞ.From Theorem 3, one can see that when b ¼ n (i.e., the

batch setting), dðnÞ ¼ 0, and the first term on the right-handside of the above inequality diminishes. That is, VR-SGDdegenerates to a batch method. When b ¼ 1, we havedð1Þ ¼ 1, and thus Theorem 3 degenerates to Theorem 2.

4.2 Convergence Properties: Strongly ConvexWe also analyze the convergence properties of VR-SGD forsolving strongly convex problems. We first give the followingconvergence result for Algorithm 2 with Option II.

Theorem 4 (Strongly convex). Suppose Assumptions 1, 2and 3 in the Supplementary Material, available online, hold,andm is sufficiently large so that

r :¼ 2Lhðmþ cÞðm� 1Þð1� 3LhÞ þ

cð1� LhÞmhðm� 1Þð1� 3LhÞ < 1;

where c is a constant. Then Algorithm 2 with Option II has thefollowing geometric convergence in expectation

E F ðbxSÞ � F ðx�Þ� � � rS F ðex0Þ � F ðx�Þ� �:

We also provide the linear convergence guarantees forAlgorithm 2 with Option I for solving non-smooth andstrongly convex functions, as stated in Theorem 7 in theSupplementary Material, available online. Similarly, the lin-ear convergence of Algorithm 2 can be guaranteed for thesmooth strongly-convex case. All the theoretical resultsshow that VR-SGD attains a linear convergence rate and atmost the oracle complexity of O ðnþ L=mÞlog ð1=�Þð Þ for bothsmooth and non-smooth strongly convex functions. In con-trast, the convergence of SVRG [1] is only guaranteed forsmooth and strongly convex problems.

Although the learning rate in Theorem 4 needs to be lessthan 1=ð3LÞ, we can use much larger learning rates in prac-tice, e.g., h ¼ 1=L. However, it can be easily verified that thelearning rate of SVRG should be less than 1=ð4LÞ in theory,and adopting a larger learning rate for SVRG is not alwayshelpful in practice, which means that VR-SGD can usemuch larger learning rates than SVRG both in theory and inpractice. In other words, although they have the same theo-retical convergence rate, VR-SGD converges significantly

faster than SVRG in practice, as shown by our experiments.Note that similar to the convergence analysis in [4], [5], [58],the convergence of VR-SGD for some non-convex problemscan also be guaranteed.

4.3 Equivalent to its Momentum Accelerated VariantInspired by the success of the momentum technique in ourpreviouswork [6], [50], [75], we present amomentum acceler-ated variant of Algorithm 2, as shown in Algorithm 3. Unlikeexisting momentum techniques, e.g., [19], [22], [25], [39], [40],[48], we use the convex combination of the snapshot exs�1 andlatest iterate vsk for acceleration, i.e., exs�1 þ wsðvskþ1 � exs�1Þ ¼wsv

sk þ ð1� wsÞexs�1. It is not hard to verify that Algorithm 2

with Option I is equivalent to its variant (i.e., Algorithm 3with Option I), when ws ¼ maxfa; 2=ðsþ 1Þg and a is suffi-ciently small (see the Supplementary Material, availableonline, for their equivalent analysis). We emphasize that theonly difference between Options I and II in Algorithm 3 is theinitialization of xs

0 and vs0.

Theorem 5. Suppose Assumption 1 holds. Then the followinginequality holds:

E½F ðbxSÞ � F ðx�Þ�

� 4ð1� w1Þw2

1ðS þ 1Þ2 ½F ðex0Þ � F ðx�Þ� þ 2

mh0ðS þ 1Þ2 kx�� ex0k2:

Choosing m ¼ QðnÞ, Algorithm 3 with Option II achieves an�-suboptimal solution (i.e., E½F ðbxSÞ� � F ðx�Þ � ") using at

most Oðn ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi½F ðex0Þ � F ðx�Þ�="p þ ffiffiffiffiffiffiffiffiffiffiffinL="

p kex0 � x�kÞ iterations.This theorem shows that the oracle complexity of

Algorithm 3with Option II is consistent with that of Katyusha[48], and is better than that of accelerated deterministic meth-ods (e.g., AGD [20]), (i.e.,Oðn ffiffiffiffiffiffiffiffi

L="p Þ), which are also verified

by the experimental results in Fig. 2. Our algorithm alsoachieves the optimal convergence rate Oð1=T 2Þ for non-strongly convex functions as in [48], [49]. Fig. 2 shows thatKatyusha andAlgorithm 3withOption II have similar perfor-mance as Algorithms 2 and 3 with Option I (h0 ¼ 3=ð5LÞ) interms of number of effective passes. Clearly, Algorithm 3 andKatyusha have higher complexity per iteration than Algo-rithm 2. Thus, we only report the results of VR-SGD (i.e.,Algorithm 2) in Section 5.

4.4 Complexity AnalysisFrom Algorithm 2, we can see that the per-iteration cost ofVR-SGD is dominated by the computation of rfis

kðxskÞ,

rfiskðexs�1Þ, and rgðxs

kÞ or the proximal update in (13). Thus,the complexity is OðdÞ, which is as low as that of SVRG [1]

Fig. 2. Comparison of AGD [20], Katyusha [48], Algorithm 3 with Option Iand II, and VR-SGD for solving logistic regression with � ¼ 0.

194 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 32, NO. 1, JANUARY 2020

Page 8: VR-SGD: A Simple Stochastic Variance Reduction Method for ... · VR-SGD are set to the average and last iterate of the previous epoch, respectively. The settings allow us to use much

and Prox-SVRG [34]. In fact, for some ERMproblems, we cansave the intermediate gradientsrfiðexs�1Þ in the computationof ems, which generally requiresOðnÞ additional storage. As aresult, each epoch only requires ðnþmÞ component gradientevaluations. In addition, for extremely sparse data, we canintroduce the lazy update tricks in [38], [76], [77] to our algo-rithm, and perform the update steps in (12) and (13) only forthe non-zero dimensions of each sample, rather than alldimensions. In other words, the per-iteration complexity ofVR-SGD can be improved from OðdÞ to Oðd0Þ, where d0 �dis the sparsity of feature vectors. Moreover, VR-SGD hasa much lower per-iteration complexity than existing acc-elerated stochastic variance reduction methods such asKatyusha [48], which have more updating steps for addi-tional variables, as shown in (9a), (9b), and (9c).

5 EXPERIMENTAL RESULTS

In this section, we evaluate the performance of VR-SGDfor solving a number of convex and non-convex ERMproblems (such as logistic regression, Lasso and ridgeregression), and compare its performance with severalstate-of-the-art stochastic variance reduced methods(including SVRG [1], Prox-SVRG [34], SAGA [31]) andaccelerated methods, such as Catalyst [40] and Katyu-sha [48]. Moreover, we apply VR-SGD to solve othermachine learning problems, such as ERM with non-convexloss and leading eigenvalue computation.

5.1 Experimental SetupWe used several publicly available data sets in the experi-ments: Adult (also called a9a), Covtype, Epsilon, MNIST,and RCV1, all of which can be downloaded from theLIBSVM Data website.4 It should be noted that each sample ofthese date sets was normalized so that they have unit length asin [34], [36], which leads to the same upper bound on the Lipschitzconstants Li, i.e., L ¼ Li for all i ¼ 1; . . . ; n. As suggestedin [1], [34], [48], the epoch length is set to m ¼ 2n for thestochastic variance reduced methods, SVRG [1], Prox-SVRG [34], Catalyst [40], and Katyusha [48], as well as VR-

SGD. Then the only parameter we have to tune by hand isthe learning rate, h. More specifically, we select learningrates from f10j; 2:5� 10j; 5� 10j; 7:5� 10j; 10jþ1g, wherej2f�2;�1; 0g. Since Katyusha has a much higher per-iteration complexity than SVRG and VR-SGD, we comparetheir performance in terms of both the number of effectivepasses and running time (seconds), where computing asingle full gradient or evaluating n component gradients isconsidered as one effective pass over the data. For fair com-parison, we implemented SVRG, Prox-SVRG, SAGA, Cata-lyst, Katyusha, and VR-SGD in C++ with a Matlab interface,as well as their sparse versions with lazy update tricks, andperformed all the experiments on a PC with an Intel i5-4570CPU and 16GB RAM. The source code of all the methods isavailable at https://github.com/jnhujnhu/VR-SGD.

5.2 Deterministic Methods versus StochasticMethods

In this section, we compare the performance of stochasticmethods (including SGD, SVRG, Katyusha, and VR-SGD)with that of deterministic methods such as AGD [19], [20]and APG [22] for solving strongly and non-strongly convexproblems. Note that the important momentum parameter w

of AGD is w ¼ ð ffiffiffiffiL

p � ffiffiffiffim

p Þ=ð ffiffiffiffiL

p þ ffiffiffiffim

p Þ as in [78], while that

of APG is defined as follows: wk ¼ ðak � 1Þ=akþ1 for all k�1

[22], where akþ1 ¼ ð1þffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi1þ 4a2

k

qÞ=2, and a1 ¼ 1.

Fig. 3 shows the the objective gap (i.e., F ðxsÞ � F ðx�Þ) ofthose deterministic and stochastic methods for solving‘2-norm and ‘1-norm regularized logistic regression problems(see the Supplementary Material for more results, availableonline). It can be seen that the accelerated deterministic meth-ods and SGD have similar convergence speed, and APGusually performs slightly better than SGD for non-stronglyconvex problems. The variance reduction methods (e.g.,SVRG, Katyusha and VR-SGD) significantly outperform theaccelerated deterministic methods and SGD for both stronglyand non-strongly convex cases, suggesting the importance ofvariance reduction techniques. Although accelerated deter-ministic methods have a faster theoretical speed than SVRGfor general convex problems, as discussed in Section 1.2, APGconverges much slower in practice. VR-SGD consistently out-performs the other methods (including Katyusha) in all thesettings, which verifies the effectiveness of VR-SGD.

5.3 Different Choices for Snapshot andStarting Points

In the practical implementation of SVRG [1], both the snap-shot exs and starting point xsþ1

0 in each epoch are set to thelast iterate xs

m of the previous epoch (i.e., Option I in Algo-rithm 1), while the two vectors in [34] are set to the averagepoint of the previous epoch, 1

m

Pmk¼1x

sk (i.e., Option II in

Algorithm 1). In contrast, exs and xsþ10 in our algorithm are

set to 1m

Pmk¼1x

sk and xs

m (denoted by Option III, i.e., Option I5

in Algorithm 2), respectively.

We compare the performance of the algorithms with thethree settings (i.e., the Options I, II and III listed in Table 2) forsolving ridge regression and Lasso problems, as shown inFig. 4 (see the Supplementary Material for more results,

Fig. 3. Comparison of deterministic and stochastic methods on Adult.

4. https://www.csie.ntu.edu.tw/ cjlin/libsvm/5. As Options I and II in Algorithm 2 achieve very similar perfor-

mance, we only report the results of our algorithm with Option I.

SHANG ETAL.: VR-SGD: A SIMPLE STOCHASTIC VARIANCE REDUCTION METHOD FOR MACHINE LEARNING 195

Page 9: VR-SGD: A Simple Stochastic Variance Reduction Method for ... · VR-SGD are set to the average and last iterate of the previous epoch, respectively. The settings allow us to use much

available online). Except for the three different settings forsnapshot and starting points, we use the update rules in (12)and (13) for ridge regression and Lasso problems, respec-tively. We can see that the algorithm with Option III (i.e.,Algorithm 2 with Option I) consistently converges muchfaster than the algorithms with Options I and II for bothstrongly convex and non-strongly convex cases. This indicatesthat the setting of Option III suggested in this paper is a betterchoice thanOptions I and II for stochastic optimization.

5.4 Common SG Updates versus Prox-SG Updates

In this section, we compare the original Katyusha algorithmin [48] with the slightly modified Katyusha algorithm(denoted by Katyusha-I). In Katyusha-I, only the followingtwo update rules are used to replace the original proximalstochastic gradient update rules in (9b) and (9c)

yskþ1 ¼ ysk � h½ erfikðxskþ1Þ þ rgðxskþ1Þ�;

zskþ1 ¼ xskþ1 � ½ erfikðxs

kþ1Þ þ rgðxskþ1Þ�=ð3LÞ:

(16)

Similarly, we also implement the proximal versions6 for theoriginal SVRG (called SVRG-I) and VR-SGD (denoted byVR-SGD-I) methods, and denote their proximal variants bySVRG-II and VR-SGD-II, respectively. In addition, the origi-nal Katyusha is denoted by Katyusha-II.

Fig. 5 shows the performance of Katyusha-I andKatyusha-II for solving ridge regression on the two populardata sets: Adult and Covtype. We also report the results ofSVRG, VR-SGD, and their proximal variants. It is clear thatKatyusha-I usually performs better than Katyusha-II (i.e.,the original Katyusha [48]), and converges significantlyfaster in the case when the regularization parameter is 10�4

or 10�6. This seems to be the main reason why Katyusha hasinferior performance when the regularization parameter isrelatively large, as shown in Section 5.6.1. In contrast, VR-SGD and its proximal variant have similar performance,and the former slightly outperforms the latter in most cases(similar results are also observed for SVRG versus its proxi-mal variant). This suggests that stochastic gradient updaterules as in (12) and (16) are better choices than proximalupdate rules as in (10), (9b) and (9c) for smooth objectivefunctions.We also believe that our new insight can help us todesign accelerated stochastic optimizationmethods.

Both Katyusha-I and Katyusha-II usually outperformSVRG and its proximal variant, especially when the regulari-zation parameter is relatively small, e.g., � ¼ 10�6. Moreover,it can be seen that both VR-SGD and its proximal variantachieve much better performance than the other methods

in most cases, and are also comparable to Katyusha-I andKatyusha-II in the remaining cases. This further verifies thatVR-SGD is suitable for various large-scalemachine learning.

5.5 Growing Epoch Size Strategy in Early Iterations

In this section, we present a general growing epoch size strat-egy in early iterations (i.e., Ifms < 2n,msþ1 ¼ brmscwith thefactor r > 1. Otherwise, msþ1 ¼ ms). Different from the dou-bling-epoch technique used in SVRG++ [44] (i.e., msþ1 ¼2ms), we gradually increase the epoch size in only the earlyiterations. Similar to the convergence analysis in Section 4,VR-SGD with the growing epoch size strategy (called VR-SGD++) can be guaranteed to converge. As suggested in [44],we set m1 ¼ bn=4c for both SVRG++ and VR-SGD++, andr ¼ 1:75 for VR-SGD++. Note that they use the same initiallearning rate. We compare their performance for solving‘2-norm regularized logistic regression, as shown in Fig. 6(see the Supplementary Material for more results, availableonline). All the results show that VR-SGD++ converges fasterthan VR-SGD, which means that reducing the number of gra-dient calculations in early iterations can lead to faster conver-gence as discussed in [43]. Moreover, both VR-SGD++ andVR-SGD significantly outperform SVRG++, especially whenthe regularization parameter is relatively small, e.g., � ¼ 10�6.

5.6 Real-World ApplicationsIn this section, we apply VR-SGD to solve a number ofmachine learning problems, e.g., logistic regression, non-convex ERM and eigenvalue computation.

5.6.1 Convex Logistic Regression

In this part, we focus on the following generalized logisticregression problem for binary classification

minx2Rd

1

n

Xni¼1

log ð1þ expð�biaTi xÞÞ þ

�1

2kxk2 þ �2kxk1; (17)

where fðai; biÞg is a set of training examples, and �1; �2�0are the regularization parameters. Note that when �2 > 0,fiðxÞ ¼ log ð1þ expð�bia

Ti xÞÞ þ ð�1=2Þkxk2. The formulation

(17) includes the ‘2-norm (i.e., �2 ¼ 0), ‘1-norm (i.e., �1 ¼ 0),and elastic net (i.e., �1 6¼0 and �2 6¼0) regularized logisticregression problems. Fig. 7 shows how the objective gapdecreases for the ‘2-norm, ‘1-norm, and elastic-net regular-ized logistic regression problems, respectively (see the Sup-plementary Material for more results, available online).From all the results, we make the following observations.

� When the problems are well-conditioned (e.g., �1 ¼10�4 or �2 ¼ 10�4), Prox-SVRG usually converges fast-er than SVRG for both strongly convex (e.g., ‘2-normregularized logistic regression) and non-strongly con-vex (e.g., ‘1-norm regularized logistic regression) cases.On the contrary, SVRG often outperforms Prox-SVRG,when the problems are ill-conditioned, e.g., �1 ¼ 10�6

or �2 ¼ 10�6 (see Figs. 11 and 12 in the SupplementaryMaterial, available online). The main reason is thatthey have different initialization settings, i.e., exs ¼ xs

m

and xsþ10 ¼ xs

m for SVRG versus exs ¼ 1m

Pmk¼1x

sk and

xsþ10 ¼ 1m

Pmk¼1x

sk for Prox-SVRG.

� Katyusha converges much faster than SAGA, SVRG,Prox-SVRG, and Catalyst in the cases when the

TABLE 2The Three Choices of Snapshot and Starting Points for

Stochastic Variance Reduction Optimization

Option I Option II Option IIIexs¼ xsm and

xsþ10 ¼ xs

m

exs¼ 1m

Pmk¼1x

sk and

xsþ10 ¼ 1

m

Pmk¼1x

sk

exs¼ 1m

Pmk¼1x

sk and

xsþ10 ¼ xsm

6. Here, the proximal variant of SVRG is different from Prox-SVRG [34], and their main difference is the choices of both the snapshotpoint and starting point. That is, the two vectors of the former are set tothe last iterate xsm, while those of Prox-SVRG are set to the average pointof the previous epoch, i.e., 1

m

Pmk¼1x

sk.

196 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 32, NO. 1, JANUARY 2020

Page 10: VR-SGD: A Simple Stochastic Variance Reduction Method for ... · VR-SGD are set to the average and last iterate of the previous epoch, respectively. The settings allow us to use much

problems are ill-conditioned, e.g., �1 ¼ 10�6, whereasit often achieves similar or inferior performancewhen the problems are well-conditioned, e.g.,�1 ¼ 10�4 (see Figs. 11, 12 and 13 in the Supplemen-tary Material, available online). Note that we imple-mented the original algorithm with Option I in [48]for Katyusha. Obviously, the above observationmatches the convergence properties of Katyusha pro-vided in [48], that is, only if mm=L�3=4, Katyushaattains the oracle complexity of Oððnþ ffiffiffiffiffiffiffiffiffiffiffiffi

nL=mp Þ

log ð1=�ÞÞ for strongly convex problems.

� VR-SGD converges significantly faster than SAGA,SVRG, Prox-SVRG and Catalyst, especially when theproblems are ill-conditioned, e.g., �1 ¼ 10�6 or�2 ¼ 10�6 (see Figs. 11 and 12 in the SupplementaryMaterial, available online). The main reason is thatVR-SGD can use much larger learning rates than them(e.g., 1=L for VR-SGD versus 1=ð10LÞ for SVRG),which leads to faster convergence. This further verifiesthat the settings of both snapshot and starting points inour algorithm (i.e., Algorithm 2) are better choicesthanOptions I and II inAlgorithm 1.

� In particular, VR-SGD consistently outperforms thebest-known stochastic method, Katyusha, in terms of

the number of passes through the data, especiallywhen the problems are well-conditioned, e.g., 10�4

and 10�5 (see Figs. 11 and 12 in the SupplementaryMaterial, available online). Since VR-SGD has amuch lower per-iteration complexity than Katyusha,VR-SGD has more obvious advantage over Katyushain terms of running time, especially in the case ofsparse data (e.g., RCV1), as shown in Fig. 7b. Fromthe algorithms of Katyusha proposed in [48], we cansee that the learning rate of Katyusha is at least set to1=ð3LÞ. Similarly, the learning rate used in VR-SGDis comparable to that of Katyusha, which may be themain reason why the performance of VR-SGD ismuch better than that of Katyusha. This also impliesthat the algorithms (including VR-SGD) that enjoylarger learning rates can converge faster in general.

5.6.2 ERM with Non-Convex Loss

In this part, we apply VR-SGD to solve the following regu-larized ERM problem with non-convex sigmoid loss

minx2Rd

1

n

Xni¼1

fiðxÞ þ �

2kxk2; (18)

where fiðxÞ ¼ 1=½1þ expðbiaTi xÞ�. Some work [4], [79] hasshown that the sigmoid function usually generalizes betterthan some other loss functions (such as squared loss, logisticloss and hinge loss) in terms of test accuracy especially whenthere are outliers. Here, we consider binary classification onthe four data sets: Adult, MNIST, Covtype and RCV1. Notethat we only consider classifying the first class inMNIST.

We compare the performance (including training objectivevalue and function suboptimality, i.e., F ðxsÞ � F ðx�Þ) of VR-SGD with that of SAGA [58], SVRG [4], and SVRG++ [44], asshown in Fig. 8 (more results are provided in the Supplemen-tary Material, available online), where x� denotes the bestsolution obtained by running all those methods for a largenumber of iterations and multiple random initializations.

Fig. 4. Comparison of the algorithms with Options I, II, and III for solving ridge regression and Lasso on Covtype. In each plot, the vertical axis showsthe objective value minus the minimum, and the horizontal axis denotes the number of effective passes.

Fig. 5. Comparison of SVRG [1], Katyusha [48], VR-SGD, and their proximal versions for solving ridge regression problems. In each plot, the verticalaxis shows the objective value minus the minimum, and the horizontal axis is the number of effective passes over data.

Fig. 6. Comparison of SVRG++ [44], VR-SGD, and VR-SGD++ forsolving logistic regression problems on Epsilon.

SHANG ETAL.: VR-SGD: A SIMPLE STOCHASTIC VARIANCE REDUCTION METHOD FOR MACHINE LEARNING 197

Page 11: VR-SGD: A Simple Stochastic Variance Reduction Method for ... · VR-SGD are set to the average and last iterate of the previous epoch, respectively. The settings allow us to use much

Note that both SAGA and SVRG are two variants of the origi-nal SAGA [31] and SVRG [1]. The results show that our VR-SGD method has faster convergence than the other methods,and its objective value is much lower. This implies that VR-SGD can yield much better solutions than the other methodsincluding SVRG++. Furthermore, we can see that VR-SGDhas much greater advantage over the other methods in thecases when the smaller � is, which means that the objectivefunction becomesmore “non-convex”.

Moreover, we report the classification testing accuraciesof all those methods on the test sets of Adult and MNIST inFig. 9, as the number of effective passes over datasetsincreases. Note that the regularization parameter is set to� ¼ 10�4. It can be seen that our VR-SGD method obtainshigher test accuracies than the other methods with muchshorter running time, suggesting faster convergence.

5.6.3 Eigenvalue Computation

Finally, we apply VR-SGD to solve the following non-convex leading eigenvalue computation problem

minx2Rd:xTx¼1

�xT 1

n

Xni¼1

aiaTi

!x: (19)

We plot the performance of the classical Power iterationmethod, VR-PCA [17], and VR-SGD on Epsilon and RCV1in Fig. 10, where the relative error is defined as in [17], i.e.,log 10ð1� kATxk2=ðmaxu:uT u¼1kATuk2ÞÞ, and A 2 Rd�n is thedata matrix. Note that the epoch length is set to m ¼ n forVR-PCA and VR-SGD, as suggested in [17], and both ofthem use a constant learning rate. The results show that thestochastic variance reduced methods, VR-PCA and VR-SGD, significantly outperform the traditional method,Power. Moreover, our VR-SGD method often convergesmuch faster than VR-PCA.

6 CONCLUSIONS

Weproposed a simple variant of the original SVRG [1], calledvariance reduced stochastic gradient descent (VR-SGD).Unlike the choices of snapshot and starting points in SVRG

Fig. 7. Comparison of SAGA [31], SVRG [1], Prox-SVRG [34], Catalyst [40], Katyusha [48], and VR-SGD for solving ‘2-norm (the first row), ‘1-norm(c), and elastic net (d) regularized logistic regression problems. In each plot, the vertical axis shows the objective value minus the minimum, and thehorizontal axis is the number of effective passes (left) or running time (right).

Fig. 8. Comparison of SAGA [58], SVRG [4], SVRG++ [44], and VR-SGD for solving non-convex ERM problems with sigmoid loss: � ¼ 10�5 (top) and� ¼ 10�6 (bottom). Note that x� denotes the best solution obtained by running all those methods for a large number of iterations and multiple randominitializations.

198 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 32, NO. 1, JANUARY 2020

Page 12: VR-SGD: A Simple Stochastic Variance Reduction Method for ... · VR-SGD are set to the average and last iterate of the previous epoch, respectively. The settings allow us to use much

and Prox-SVRG [34], the two points of each epoch in VR-SGDare set to the average and last iterate of the previous epoch,respectively. This setting allows us to use much larger learn-ing rates than SVRG, e.g., 1=L for VR-SGD versus 1=ð10LÞ forSVRG, and also makes VR-SGD much more robust to learn-ing rate selection. Different from existing proximal stochasticmethods such as Prox-SVRG andKatyusha [48], we designedtwo different update rules for smooth and non-smooth prob-lems, respectively, which makes VR-SGD suitable for non-smooth and/or non-strongly convex problemswithout usingany reduction techniques as in [68]. Our empirical resultsalso showed that for smooth problems stochastic gradientupdate rules as in (12) are better choices than proximalupdate formulas as in (10).

On the practical side, the choices of the snapshot andstarting points make VR-SGD significantly faster than itscounterparts, SVRG and Prox-SVRG. On the theoreticalside, the setting also makes our convergence analysis morechallenging. We analyzed the convergence properties ofVR-SGD for strongly convex objective functions, whichshow that VR-SGD attains a linear convergence rate. More-over, we provided the convergence guarantees of VR-SGDfor non-strongly convex functions, and our experimentalresults showed that VR-SGD achieves similar performanceto its momentum accelerated variant that has the optimalconvergence rate Oð1=T 2Þ. In contrast, SVRG and Prox-SVRG cannot directly solve non-strongly convex func-tions [44]. Various experimental results show that VR-SGDsignificantly outperforms state-of-the-art variance reductionmethods such as SAGA [58], SVRG [1] and Prox-SVRG [34],and also achieves better or at least comparable perform-ance with recently-proposed acceleration methods, e.g.,Catalyst [40] and Katyusha [48]. Since VR-SGD has a muchlower per-iteration complexity than accelerated methods(e.g., Katyusha), it has more obvious advantage over themin terms of running time, especially for high-dimensionalsparse data. This further verifies that VR-SGD is suitable forvarious large-scale machine learning. Furthermore, as theupdate rules of VR-SGD are much simpler than existingaccelerated stochastic variance reduction methods such asKatyusha, it is more friendly to asynchronous parallel anddistributed implementation similar to [59], [60], [62].

ACKNOWLEDGMENTS

Fanhua Shang, Hongying Liu, and Licheng Jiao were sup-ported in part by a project supported by the Foundation forInnovative Research Groups of the National Natural ScienceFoundation of China (No. 61621005), the Major ResearchPlan of the National Natural Science Foundation of China(Nos. 91438201 and 91438103), the National Natural ScienceFoundation of China (Nos. 61876220, 61836009, U1701267,61871310, 61573267, 61502369, 61876221 and 61473215), theProgram for CheungKong Scholars and Innovative ResearchTeam in University (No. IRT_15R53), the Fund for ForeignScholars in University Research and Teaching Programs (the111 Project) (No. B07048), and the Science Foundation ofXidian University (No. 10251180018). James Cheng was sup-ported in part by Grants (CUHK 14206715 & 14222816) fromthe Hong Kong RGC. Ivor Tsang was supported by ARCFT130100746, DP180100106, and LP150100671. Lijun Zhangwas supported by JiangsuSF (BK20160658). Dacheng Taowas supported by Australian Research Council Projects FL-170100117, DP-180103424, and IH180100002.

REFERENCES

[1] R. Johnson and T. Zhang, “Accelerating stochastic gradientdescent using predictive variance reduction,” in Proc. Int. Conf.Neural Inf. Process. Syst., 2013, pp. 315–323.

[2] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classifi-cation with deep convolutional neural networks,” in Proc. Int.Conf. Neural Inf. Process. Syst., 2012, pp. 1097–1105.

[3] S. Zhang, A. Choromanska, and Y. LeCun, “Deep learning withelastic averaging SGD,” in Proc. Int. Conf. Neural Inf. Process. Syst.,2015, pp. 685–693.

[4] Z. Allen-Zhu and E. Hazan, “Variance reduction for faster non-convex optimization,” in Proc. Int. Conf. Mach. Learn., 2016,pp. 699–707.

[5] S. J. Reddi, A. Hefny, S. Sra, B. Poczos, and A. Smola, “Stochasticvariance reduction for nonconvex optimization,” in Proc. Int. Conf.Mach. Learn., 2016, pp. 314–323.

[6] Y. Liu, F. Shang, and J. Cheng, “Accelerated variance reducedstochastic ADMM,” in Proc. AAAI Conf. Artif. Intell., 2017,pp. 2287–2293.

[7] X. Li, T. Zhao, R. Arora, H. Liu, and J. Haupt, “Nonconvexsparse learning via stochastic optimizationwith progressive variancereduction,” inProc. Int. Conf.Mach. Learn., 2016, pp. 917–925.

[8] W. Zhang, L. Zhang, Z. Jin, R. Jin, D. Cai, X. Li, R. Liang, andX. He, “Sparse learning with stochastic composite optimization,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 39, no. 6, pp. 1223–1236, Jun. 2017.

[9] C. Qu, Y. Li, and H. Xu, “Linear convergence of SVRG in statisticalestimation,” arXiv preprint, arXiv:1611.01957v3, 2017.

[10] C. Paquette, H. Lin, D. Drusvyatskiy, J. Mairal, and Z. Harchaoui,“Catalyst for gradient-based nonconvex optimization,” in Proc.Int. Conf. Artif. Intell. Statist., 2018, pp. 613–622.

[11] H. Kasai, “Stochastic variance reduced multiplicative update fornonnegative matrix factorization,” in Proc. IEEE Int. Conf. Acoust.Speech Signal Process., 2018, pp. 6338–6342.

[12] J. Duchi and F. Ruan, “Stochastic methods for composite optimi-zation problems,” arXiv preprint, arXiv:1703.08570, 2017.

[13] B. Recht and C. R�e, “Parallel stochastic gradient algorithms forlarge-scale matrix completion,” Math. Program. Comput., vol. 5,pp. 201–226, 2013.

[14] L. Wang, X. Zhang, and Q. Gu, “A unified variance reduction-based framework for nonconvex low-rank matrix recovery,” inProc. Int. Conf. Mach. Learn., 2017, pp. 3712–3721.

[15] M. Schmidt, R. Babanezhad, M. Ahmed, A. Defazio, A. Clifton,and A. Sarkar, “Non-uniform stochastic average gradient methodfor training conditional random fields,” in Proc. Int. Conf. Artif.Intell. Statist., 2015, pp. 819–828.

[16] Z. Allen-Zhu and Y. Li, “Doubly accelerated methods for fasterCCA and generalized eigendecomposition,” in Proc. Int. Conf.Mach. Learn., 2017, pp. 98–106.

Fig. 9. Testing accuracy comparison on Adult (left) and MNIST (right).

Fig. 10. Relative error comparison of leading eigenvalue computation onRCV1 (left) and Epsilon (right).

SHANG ETAL.: VR-SGD: A SIMPLE STOCHASTIC VARIANCE REDUCTION METHOD FOR MACHINE LEARNING 199

Page 13: VR-SGD: A Simple Stochastic Variance Reduction Method for ... · VR-SGD are set to the average and last iterate of the previous epoch, respectively. The settings allow us to use much

[17] O. Shamir, “A stochastic PCA and SVD algorithm with an expo-nential convergence rate,” in Proc. Int. Conf. Mach. Learn., 2015,pp. 144–152.

[18] D. Garber, E. Hazan, C. Jin, S. M. Kakade, C. Musco, P. Netrapalli,and A. Sidford, “Faster eigenvector computation via shift-and-invert preconditioning,” in Proc. Int. Conf. Mach. Learn., 2016,pp. 2626–2634.

[19] Y. Nesterov, “A method of solving a convex programming prob-lem with convergence rate Oð1=k2Þ,” Soviet Math. Doklady, vol. 27,pp. 372–376, 1983.

[20] Y. Nesterov, Introductory Lectures on Convex Optimization: A BasicCourse. Boston, MA, USA: Kluwer, 2004.

[21] P. Tseng, “On accelerated proximal gradient methods for convex-concave optimization,” Tech. Rep., University ofWashington, 2008.

[22] A. Beck and M. Teboulle, “A fast iterative shrinkage-thresholdingalgorithm for linear inverse problems,” SIAM J. Imag. Sci., vol. 2,no. 1, pp. 183–202, 2009.

[23] H. Robbins and S. Monro, “A stochastic approximation method,”Ann. Math. Statist., vol. 22, no. 3, pp. 400–407, 1951.

[24] T. Zhang, “Solving large scale linear prediction problems usingstochastic gradient descent algorithms,” in Proc. Int. Conf. Mach.Learn., 2004, pp. 919–926.

[25] C. Hu, J. T. Kwok, and W. Pan, “Accelerated gradient methods forstochastic optimization and online learning,” in Proc. Int. Conf.Neural Inf. Process. Syst., 2009, pp. 781–789.

[26] S. Bubeck, “Convex optimization: Algorithms and complexity,”Found. Trends Mach. Learn., vol. 8, pp. 231–358, 2015.

[27] D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang,A. Guez, T. Hubert, L. Baker, M. Lai, A. Bolton, Y. Chen,T. Lillicrap, F. Hui, L. Sifre, G. van den Driessche, T. Graepel, andD. Hassabis, “Mastering the game of go without human knowl-edge,” Nature, vol. 550, pp. 354–359, 2017.

[28] A. Rakhlin, O. Shamir, and K. Sridharan, “Making gradientdescent optimal for strongly convex stochastic optimization,” inProc. Int. Conf. Mach. Learn., 2012, pp. 449–456.

[29] N. L. Roux, M. Schmidt, and F. Bach, “A stochastic gradientmethod with an exponential convergence rate for finite trainingsets,” in Proc. Int. Conf. Neural Inf. Process. Syst., 2012, pp. 2672–2680.

[30] S. Shalev-Shwartz and T. Zhang, “Stochastic dual coordinateascent methods for regularized loss minimization,” J. Mach. Learn.Res., vol. 14, pp. 567–599, 2013.

[31] A. Defazio, F. Bach, and S. Lacoste-Julien, “SAGA: A fast incre-mental gradient method with support for non-strongly convexcomposite objectives,” in Proc. Int. Conf. Neural Inf. Process. Syst.,2014, pp. 1646–1654.

[32] Y. Zhang and L. Xiao, “Stochastic primal-dual coordinate methodfor regularized empirical risk minimization,” in Proc. Int. Conf.Mach. Learn., 2015, pp. 353–361.

[33] M. Schmidt, N. L. Roux, and F. Bach, “Minimizing finite sumswith the stochastic average gradient,” Math. Program., vol. 162,pp. 83–112, 2017.

[34] L. Xiao and T. Zhang, “A proximal stochastic gradient methodwith progressive variance reduction,” SIAM J. Optimization,vol. 24, no. 4, pp. 2057–2075, 2014.

[35] S. Shalev-Shwartz and T. Zhang, “Accelerated proximal stochasticdual coordinate ascent for regularized loss minimization,” Math.Program., vol. 155, pp. 105–145, 2016.

[36] F. Shang, Y. Liu, J. Cheng, and J. Zhuo, “Fast stochastic variancereduced gradient method with momentum acceleration for mach-ine learning,” arXiv preprint, arXiv:1703.07948, 2017.

[37] L. Zhang, M. Mahdavi, and R. Jin, “Linear convergence with con-dition number independent access of full gradients,” in Proc. Int.Conf. Neural Inf. Process. Syst., 2013, pp. 980–988.

[38] J. Kone�cn�y, J. Liu, P. Richt�arik, and M. Tak�a�c, “Mini-batchsemi-stochastic gradient descent in the proximal setting,” IEEEJ. Select. Topics Signal Process., vol. 10, no. 2, pp. 242–255, Mar.2016.

[39] A. Nitanda, “Stochastic proximal gradient descent with accelera-tion techniques,” in Proc. Int. Conf. Neural Inf. Process. Syst., 2014,pp. 1574–1582.

[40] H. Lin, J. Mairal, and Z. Harchaoui, “A universal catalyst for first-order optimization,” in Proc. Int. Conf. Neural Inf. Process. Syst.,2015, pp. 3366–3374.

[41] R. Frostig, R. Ge, S. M. Kakade, and A. Sidford, “Un-regularizing:Approximate proximal point and faster stochastic algorithms forempirical risk minimization,” in Proc. Int. Conf. Mach. Learn., 2015,pp. 2540–2548.

[42] G. Lan and Y. Zhou, “An optimal randomized incremental gradi-ent method,”Math. Program., vol. 171, pp. 167–215, 2018.

[43] R. Babanezhad, M. O. Ahmed, A. Virani, M. Schmidt, J. Konecny,and S. Sallinen, “Stop wasting my gradients: Practical SVRG,” inProc. Int. Conf. Neural Inf. Process. Syst., 2015, pp. 2242–2250.

[44] Z. Allen-Zhu and Y. Yuan, “Improved SVRG for non-strongly-convex or sum-of-non-convex objectives,” in Proc. Int. Conf. Mach.Learn., 2016, pp. 1080–1089.

[45] M. Frank and P. Wolfe, “An algorithm for quadratic pro-gramming,” Naval Res. Logistics Quart., vol. 3, pp. 95–110, 1956.

[46] E. Hazan and H. Luo, “Variance-reduced and projection-free sto-chastic optimization,” in Proc. Int. Conf. Mach. Learn., 2016,pp. 1263–1271.

[47] F. Shang, Y. Liu, J. Cheng, K. W. Ng, and Y. Yoshida, “Guaranteedsufficient decrease for stochastic variance reduced gradient opti-mization,” in Proc. 21st Int. Conf. Artif. Intell. Statist., 2018,pp. 1027–1036.

[48] Z. Allen-Zhu, “Katyusha: The first direct acceleration of stochasticgradient methods,” J. Mach. Learn. Res., vol. 18, no. 221, pp. 1–51,2018.

[49] L. Hien, C. Lu, H. Xu, and J. Feng, “Accelerated stochastic mirrordescent algorithms for composite non-strongly convex opti-mization,” arXiv preprint, arXiv:1605.06892v4, 2017.

[50] K. Zhou, F. Shang, and J. Cheng, “A simple stochastic variancereduced algorithm with fast convergence rates,” in Proc. Int. Conf.Mach. Learn., 2018, pp. 5975–5984.

[51] B. Woodworth and N. Srebro, “Tight complexity bounds for opti-mizing composite objectives,” in Proc. Int. Conf. Neural Inf. Process.Syst., 2016, pp. 3639–3647.

[52] A. Defazio, “A simple practical accelerated method for finitesums,” in Proc. Int. Conf. Neural Inf. Process. Syst., 2016, pp. 676–684.

[53] L. Liu, J. Liu, and D. Tao, “Variance reduced methods for non-con-vex composition optimization,” arXiv preprint, arXiv:1711.04416v1,2017.

[54] L. Lei, C. Ju, J. Chen, and M. I. Jordan, “Nonconvex finite-sumoptimization via SCSG methods,” in Proc. Int. Conf. Neural Inf.Process. Syst., 2017, pp. 2345–2355.

[55] L. Liu, M. Cheng, C.-J. Hsieh, and D. Tao, “Stochastic zeroth-orderoptimization via variance reduction method,” arXiv preprint,arXiv:1805.11811, 2018.

[56] L. Liu, J. Liu, C.-J. Hsieh, and D. Tao, “Stochastically controlledstochastic gradient for the convex and non-convex compositionproblem,” arXiv preprint, arXiv:1809.02505, 2018.

[57] L. Liu, J. Liu, and D. Tao, “Dualityfree methods for stochasticcomposition optimization,” IEEE Trans. Neural Netw. Learn. Syst.,to be published, doi: 10.1109/TNNLS.2018.2866699.

[58] S. J. Reddi, S. Sra, B. P�oczos, and A. Smola, “Fast incrementalmethod for smooth nonconvex optimization,” in Proc. Conf. Deci-sion Contorl, 2016, pp. 1971–1977.

[59] H. Mania, X. Pan, D. Papailiopoulos, B. Recht, K. Ramchandran,and M. I. Jordan, “Perturbed iterate analysis for asynchronousstochastic optimization,” SIAM J. Optimization, vol. 27, no. 4,pp. 2202–2229, 2017.

[60] S. Reddi, A. Hefny, S. Sra, B. Poczos, and A. Smola, “On variancereduction in stochastic gradient descent and its asynchronous var-iants,” inProc. Int. Conf. Neural Inf. Process. Syst., 2015, pp. 2629–2637.

[61] X. Lian, C. Zhang,H. Zhang, C.-J. Hsieh,W. Zhang, and J. Liu, “Candecentralized algorithms outperform centralized algorithms? Acase study for decentralized parallel stochastic gradient descent,”in Proc. Int. Conf. Neural Inf. Process. Syst., 2017, pp. 5336–5346.

[62] J. D. Lee, Q. Lin, T. Ma, and T. Yang, “Distributed stochastic vari-ance reduced gradient methods by sampling extra data withreplacement,” J. Mach. Learn. Res., vol. 18, pp. 1–43, 2017.

[63] Y. Bengio, “Learning deep architectures for AI,” Found. TrendsMach. Learn., vol. 2, no. 1, pp. 1–127, 2009.

[64] R. Ge, F. Huang, C. Jin, and Y. Yuan, “Escaping from saddlepoints—Online stochastic gradient for tensor decomposition,” inProc. Conf. Learn. Theory, 2015, pp. 797–842.

[65] A. Neelakantan, L. Vilnis, Q. V. Le, I. Sutskever, L. Kaiser, K. Kur-ach, and J. Martens, “Adding gradient noise improves learning forvery deep networks,” arXiv preprint, arXiv:1511.06807, 2015.

[66] N. Flammarion and F. Bach, “From averaging to acceleration,there is only a step-size,” in Proc. Conf. Learn. Theory, 2015,pp. 658–695.

[67] C. Tan, S. Ma, Y. Dai, and Y. Qian, “Barzilai-Borwein step size forstochastic gradient descent,” in Proc. Int. Conf. Neural Inf. Process.Syst., 2016, pp. 685–693.

200 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 32, NO. 1, JANUARY 2020

Page 14: VR-SGD: A Simple Stochastic Variance Reduction Method for ... · VR-SGD are set to the average and last iterate of the previous epoch, respectively. The settings allow us to use much

[68] Z. Allen-Zhu and E. Hazan, “Optimal black-box reductionsbetween optimization objectives,” in Proc. Int. Conf. Neural Inf. Pro-cess. Syst., 2016, pp. 1606–1614.

[69] Y. Nesterov, “Smooth minimization of non-smooth functions,”Math. Program., vol. 103, pp. 127–152, 2005.

[70] Y. Xu, Y. Yan, Q. Lin, and T. Yang, “Homotopy smoothing fornon-smooth problems with lower complexity than Oð1=�Þ,” inProc. Int. Conf. Neural Inf. Process. Syst., 2016, pp. 1208–1216.

[71] P. Zhao and T. Zhang, “Stochastic optimization with importancesampling for regularized loss minimization,” in Proc. Int. Conf.Mach. Learn., 2015, pp. 1–9.

[72] D. Needell, N. Srebro, and R. Ward, “Stochastic gradient descent,weighted sampling, and the randomized Kaczmarz algorithm,”Math. Program., vol. 155, pp. 549–573, 2016.

[73] O. Shamir, “Without-replacement sampling for stochastic gradi-ent methods,” in Proc. Int. Conf. Neural Inf. Process. Syst., 2016,pp. 46–54.

[74] J. Kone�cn�y and P. Richt�arik, “Semi-stochastic gradient descentmethods,” Optimization Method Softw., vol. 32, no. 5, pp. 993–1005,2017.

[75] F. Shang, L. Jiao, K. Zhou, J. Cheng, Y. Ren, and Y. Jin, “ASVRG:Accelerated proximal SVRG,” in Proc. Mach. Learn. Res., 2018,vol. 95, pp. 815–830.

[76] B. Carpenter, “Lazy sparse stochastic gradient descent for regular-ized multinomial logistic regression,” Tech. Rep., Aliasi, Inc.,pp. 1–20, 2008.

[77] J. Langford, L. Li, and T. Zhang, “Sparse online learning via trun-cated gradient,” J. Mach. Learn. Res., vol. 10, pp. 777–801, 2009.

[78] W. Su, S. Boyd, and E. J. Candes, “A differential equation formodeling Nesterov’s accelerated gradient method: Theory andinsights,” J. Mach. Learn. Res., vol. 17, pp. 1–43, 2016.

[79] S. Shalev-Shwartz, O. Shamir, and K. Sridharan, “Learning kernel-based halfspaces with the 0-1 loss,” SIAM J. Comput., vol. 40, no. 6,pp. 1623–1646, 2011.

Fanhua Shang (M’14) received the PhD degreein circuits and systems from Xidian University,Xi’an, China, in 2012. He is currently a professorwith the School of Artificial Intelligence, XidianUniversity, China. Prior to joining Xidian Univer-sity, he was a research associate with the Depart-ment of Computer Science and Engineering,Chinese University of Hong Kong. From 2013 to2015, he was a post-doctoral research fellow withthe Department of Computer Science and Engi-neering, Chinese University of Hong Kong. From

2012 to 2013, he was a post-doctoral research associate with theDepartment of Electrical and Computer Engineering, Duke University,Durham, North Carolina. His current research interests include machinelearning, data mining, pattern recognition, and computer vision. He is amember of the IEEE.

Kaiwen Zhou received the BS degree in com-puter science and technology from Fudan Univer-sity, in 2017. He is currently working toward themaster’s degree in the Department of ComputerScience and Engineering, Chinese University ofHong Kong, Hong Kong. His current researchinterests include data mining, stochastic optimi-zation for machine learning, large scale machinelearning, etc.

Hongying Liu (M’10) received the BE and MSdegrees in computer science and technologyfrom the Xi’An University of Technology, China, in2006 and 2009, respectively, and the PhD degreein engineering from Waseda University, Japan, in2012. Currently, she is a faculty member with theSchool of Artificial Intelligence, and also with theKey Laboratory of Intelligent Perception andImage Understanding of Ministry of Education,Xidian University, China. In addition, she is amember of the IEEE. Her major research inter-

ests include image processing, intelligent signal processing, machinelearning, etc.

James Cheng is an assistant professor with theDepartment of Computer Science and Engineer-ing, Chinese University of Hong Kong, HongKong. His current research interests include dis-tributed computing systems, large-scale networkanalysis, temporal networks, and big data.

Ivor W. Tsang is an ARC future fellow and a pro-fessor of artificial intelligence with the Universityof Technology Sydney. He is also the researchdirector of the UTS Priority Research Centre forArtificial Intelligence. His research focuses ontransfer learning, feature selection, big data ana-lytics for data with trillions of dimensions, andtheir applications to computer vision and patternrecognition. He has more than 140 researchpapers published in top-tier journal and confer-ence papers, including the Journal of Machine

Learning Research, the Madras Law Journal, the IEEE Transactions onPattern Analysis and Machine Intelligence, the IEEE Transactions onNeural Networks and Learning Systems, NIPS, ICML, etc. In 2009, hewas conferred the 2008 Natural Science Award (Class II) by Ministry ofEducation, China, which recognized his contributions to kernel methods.In 2013, he received his prestigious Australian Research Council FutureFellowship for his research regarding machine learning on big data. Inaddition, he had received the prestigious IEEE Transactions on NeuralNetworks Outstanding 2004 Paper Award in 2007, the 2014 IEEE Trans-actions on Multimedia Prize Paper Award, and a number of best paperawards and honors from reputable international conferences, includingthe Best Student Paper Award at CVPR 2010, and the Best Paper Awardat ICTAI 2011. He was also awarded the ECCV 2012 OutstandingReviewer Award.

Lijun Zhang received the BS and PhD degrees insoftware engineering and computer science fromZhejiang University, China, in 2007 and 2012,respectively. He is currently an associate profes-sor of the Department of Computer Science andTechnology, Nanjing University, China. Prior tojoining Nanjing University, he was a postdoctoralresearcher with the Department of Computer Sci-ence and Engineering, Michigan State University.His research interests include machine learning,optimization, information retrieval, and data min-ing. He is a member of the IEEE.

SHANG ETAL.: VR-SGD: A SIMPLE STOCHASTIC VARIANCE REDUCTION METHOD FOR MACHINE LEARNING 201

Page 15: VR-SGD: A Simple Stochastic Variance Reduction Method for ... · VR-SGD are set to the average and last iterate of the previous epoch, respectively. The settings allow us to use much

Dacheng Tao (F’15) is a professor of computerscience and ARC laureate fellow with the Schoolof Information Technologies and the Faculty ofEngineering and Information Technologies, andthe inaugural director of the UBTECHSydney Arti-ficial Intelligence Centre, University of Sydney. Hemainly applies statistics and mathematics to artifi-cial intelligence and data science. His researchresults have expounded in one monograph andmore than 200 publications at prestigious journalsand prominent conferences, such as the IEEE

Transactions on Pattern Analysis and Machine Intelligence, the IEEETransactions on Image Processing, the IEEE Transactions on NeuralNetworks and Learning Systems, the International Journal of ComputerVision, the Journal of Machine Learning Research, NIPS, ICML, CVPR,ICCV, ECCV, ICDM, and ACM SIGKDD, with several best paper awards,such as the best theory/algorithm paper runner up award in IEEEICDM’07, the best student paper award in IEEE ICDM’13, the distin-guished paper award in the 2018 IJCAI, the 2014 ICDM 10-year highest-impact paper award, and the 2017 IEEE Signal Processing Society BestPaper Award. He is a fellow of the Australian Academy of Science, AAAS,IEEE, IAPR, OSA and SPIE.

Licheng Jiao (F’18) received the BS degree fromShanghai Jiaotong University, Shanghai, China, in1982, and the MS and PhD degrees from XianJiaotong University, Xian, China, in 1984 and1990, respectively. He was a post-doctoral fellowwith the National Key Laboratory for Radar SignalProcessing, Xidian University, Xian, from 1990 to1991, where he has been a professor with theSchool of Electronic Engineering, since 1992, andcurrently the director of the Key Laboratory ofIntelligent Perception and Image Understanding,

Ministry of Education of China. He has been in charge of about 40 impor-tant scientific research projects, and publishedmore than 20monographsand a 100 papers in international journals and conferences. His currentresearch interests include image processing, natural computation,machine learning, and intelligent information processing. He is the chair-man of the Awards and Recognition Committee, the vice board chairper-son of the Chinese Association of Artificial Intelligence, a councilor of theChinese Institute of Electronics, a committee member of the ChineseCommittee of Neural Networks, and an expert of the Academic DegreesCommittee of the State Council. He is a fellow of the IEEE.

" For more information on this or any other computing topic,please visit our Digital Library at www.computer.org/csdl.

202 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 32, NO. 1, JANUARY 2020