stochastic newton and quasi-newton methods for large-scale...
TRANSCRIPT
Stochastic Newton and quasi-Newton Methods forLarge-Scale Convex Optimization
Donald Goldfarb
Department of Industrial Engineering and Operations ResearchColumbia University
Joint with Robert Gower and Peter Richtarik
Optimization Without Borders 2016Les Houches, February 7-12, 2016
1 / 1
Outline
Newton-like and quasi-Newton methods for convex stochasticoptimization problems using limited memory block BFGSupdates.
In the class of problems of interest, the objective functionscan be expressed as the sum of a huge number of functions ofan extremely large number of variables.
We present preliminary numerical results on problems frommachine learning.
2 / 1
Related work on L-BFGS for Stochastic Optimization
P1 N.N. Schraudolph, J. Yu and S.Gunter. A stochastic quasi-Newtonmethod for online convex optim. Int’l. Conf. AI & Stat., 2007
P2 A. Bordes, L. Bottou and P. Gallinari. SGD-QN: Carefulquasi-Newton stochastic gradient descent. JMLR vol. 10, 2009
P3 R.H. Byrd, S.L. Hansen, J. Nocedal, and Y. Singer. A stochasticquasi-Newton method for large-scale optim. arXiv1401.7020v2,2014
P4 A. Mokhtari and A. Ribeiro. RES: Regularized stochastic BFGSalgorithm. IEEE Trans. Signal Process., no. 10, 2014.
P5 A. Mokhtari and A. Ribeiro. Global convergence of online limitedmemory BFGS. to appear in J. Mach. Learn. Res., 2015.
P6 P. Moritz, R. Nishihara, M.I. Jordan. A linearly-convergentstochastic L-BFGS Algorithm, 2015 arXiv:1508.02087v1
P7 X. Wang, S. Ma, D. Goldfarb and W. Liu. Stochastic quasi-Newtonmethods for nonconvex stochastic optim. 2015, submitted.
(the first 6 papers are for strongly convex problems, the last one isfor nonconvex problems) 3 / 1
Stochastic optimization
Stochastic optimization
min f (x) = E[f (x , ξ)], ξ is random variable
Or finite sum (with fi (x) ≡ f (x , ξi ) for i = 1, . . . , n and verylarge n)
min f (x) =1
n
n∑i=1
fi (x)
f and ∇f are very expensive to evaluate; e.g., SGD methodsrandomly choose a random subset S ⊂ [n] and evaluate
fS(x) =1
|S|∑i∈S
fi (x) and ∇fS(x) =1
|S|∑i∈S∇fi (x)
Essentially, only noisy info about f , ∇f and ∇2f is available
Challenge: how to design a method that takes advantage ofnoisy 2nd-order information?
4 / 1
Using 2nd-order information
Assumption: f (x) = 1n
∑ni=1 fi (x) is strongly convex and twice
continuously differentiable.
Choose (compute) a sketching matrix Sk (the columns of Skare a set of directions).
Following Byrd, Hansen, Nocedal and Singer, we do not usedifferences in noisy gradients to estimate curvature, but rathercompute the action of the sub-sampled Hessian on Sk . i.e.,
compute Yk = 1|T |
∑i∈T ∇2fi (x)Sk , where T ⊂ [n].
We choose T = S
5 / 1
block BFGS
Given Hk = B−1k , the block BFGS method computes a ”least
change” update to the current approximation Hk to the inverseHessian matrix ∇2f (x) at the current point x , by solving
min ‖H − Hk‖s.t., H = H>, HYk = Sk .
This gives the updating formula (analgous to the updates derivedby Broyden, Fletcher, Goldfarb and Shanno).
Hk+1 = (I−Sk [S>k Yk ]−1Y>k )Hk(I−Yk [S>k Yk ]−1S>k )+Sk [S>k Yk ]−1S>k
or, by the Sherman-Morrison-Woodbury formula:
Bk+1 = Bk − BkSk [S>k BkSk ]−1S>k Bk + Yk [S>k Yk ]−1Y>k
6 / 1
Limited Memory Block BFGS
After M block BFGS steps starting from Hk+1−M , one can expressHk+1 as
Hk+1 = VkHkVTk + SkΛkS
Tk
= VkVk−1Hk−1VTk−1Vk + VkSk−1Λk−1S
Tk−1V
Tk + SkΛkS
Tk
...
= Vk:k+1−MHk+1−MV Tk:k+1−M +
k+1−M∑i=k
Vk:i+1SiΛiSTi V T
k:i+1,
whereVk = (I − SkΛkY
Tk ) (1)
and Λk = (STk Yk)−1 and Vk:i = Vk · · ·Vi .
7 / 1
Limited Memory Block BFGS
Hence, when the number of variables d is large, instead ofstoring the d × d matrix Hk , we store the previous M blockcurvature pairs
(Sk+1−M ,Yk+1−M) , . . . , (Sk ,Yk) ,
and the Cholesky factors of the matrices (STi Yi ) = Λ−1
i fori = k + 1−M, . . . , k .
Then, analogously to the standard L-BFGS method, for anyvector v ∈ Rd , Hkv can be computed efficiently using atwo-loop block recursion (in Mp(4d + 2p) + O(p))operations), if all Si ∈ Rd×p.
Intuition
Limited memory - least change aspect of BFGS is important
Each block update acts like a sketching procedure.
8 / 1
Choices for the Sketching Matrix Sk
We employ one of the following strategies
Gaussian: Sk ∼ N (0, I ) has Gaussian entries sampled i.i.d ateach iteration.
Previous search directions si delayed: Store the previous Lsearch directions Sk = [sk+1−L, . . . , sk ] then update Hk onlyonce every L iterations.
Self-conditioning: Sample the columns of the Cholesky factorsLk of Hk (i.e., LkL
Tk = Hk) uniformly at random. Fortunately
we can maintain and update Lk efficiently with limitedmemory.
The matrix S is a sketching matrix, in the sense that we aresketching the, possibly very large equation ∇2f (x)H = I to whichthe solution is the inverse Hessian. Left multiplying by ST
compresses/sketches the equation yielding ST∇2f (x)H = ST .
9 / 1
Stochastic Variance Reduced Gradients
Stochastic methods converge slowly near the optimum due tothe variance of the gradient estimates ∇fS(x); hence requiringa decreasing step size.
We use the control variates approach of Johnson and Zhang(2013) for a SGD method SVRG.
It uses ∇fS(xt)−∇fS(wk) +∇f (wk , where wk is a referencepoint, in place of ∇fS(xt) .
wk , and the full gradient, are computed after each full pass ofthe data, hence doubling the work of computing stochasticgradients.
Other recently proposed SGD variance reduction techniquessuch as SAG, SAGA, SDCA, and S2GD, can be used in placeof SVRG.
10 / 1
The Basic Algorithm
11 / 1
Algorithm 0.1: Stochastic Variable Metric Learning with SVRG
Input: H−1 ∈ Rd×d , w0 ∈ Rd , η ∈ R+, s = subsample size, q = sampleaction size and m
1 for k = 0, . . . , max iter do2 µ = ∇f (wk)3 x0 = wk
4 for t = 0, . . . ,m − 1 do5 Sample St , Tt ⊆ [n] i.i.d from a distribution S6 Compute the sketching matrix St ∈ Rd×q
7 Compute ∇2fS(xt)St8 Ht =update metric(Ht−1,St ,∇2fT (xt)St)9 dt = −Ht (∇fS(xt)−∇fS(wk) + µ)
10 xt+1 = xt + ηdt11 end12 Option I: wk+1 = xm13 Option II: wk+1 = xi , i selected uniformly at random from [m];
14 end
11 / 1
Convergence - Assumptions
There exist constants λ,Λ ∈ R+ such that
f is λ–strongly convex
f (w) ≥ f (x) +∇f (x)T (w − x) +λ
2‖w − x‖2
2 , (2)
f is Λ–smooth
f (w) ≤ f (x) +∇f (x)T (w − x) +Λ
2‖w − x‖2
2 , (3)
These assumptions imply that
λI � ∇2fS(w) � ΛI , for all x ∈ Rd ,S ⊆ [n], (4)
from which we can prove that there exist constants γ, Γ ∈ R+
such that for all k we have
γI � Hk � ΓI . (5)
12 / 1
Linear Convergence
Theorem
Suppose that the Assumptions hold. Let w∗ be the uniqueminimizer of f (w). Then in our Algorithm, we have for all k ≥ 0that
Ef (wk)− f (w∗) ≤ ρkEf (w0)− f (w∗),
where the convergence rate is given by
ρ =1/2mη + ηΓ2Λ(Λ− λ)
γλ− ηΓ2Λ2< 1,
assuming we have chosen η < γλ/(2Γ2Λ2) and that we choose mlarge enough to satisfy
m ≥ 1
2η (γλ− ηΓ2Λ(2Λ− λ)),
which is a positive lower bound given our restriction on η.13 / 1
Upper and lower bounds on eigenvalues of Hk
Under the assumption that
λI � ∇2fT (x) � ΛI , ∀x ∈ Rd (6)
there exist constants γ, Γ ∈ R+ such that for all k we have
γI � Hk � ΓI . (7)
where
γ ≥ 1
1 + MΛ, Γ ≤ (1 +
√κ)2M(1+
1
λ(2√κ+ κ)
), κ ≡ Λ/λ.
(8)
Previously derived bounds depend on the problem dimensiond ; e.g. Γ ∼ ((d + M)κ)d+M
14 / 1
gisette-scale d = 5, 000, n = 6, 000
0 10 20 30
time (s)
10-1
100
error
10 20 30 40
datapasses
gauss_18_M_5
prev_18_M_5
fact_18_M_3MNJ_bH_330
SVRG
Figure: gisette
15 / 1
covtype-libsvm-binary d = 54, n = 581, 012
0 10 20 30 40
time (s)
10-3
10-2
10-1
100
error
5 10 15 20
datapasses
gauss_8_M_5
prev_8_M_5
fact_8_M_3MNJ_bH_3815
SVRG
Figure: covtype.libsvm.binary
16 / 1
Higgs d = 28, n = 11, 000, 000
0 50 100
time (s)
10-4
10-3
10-2
10-1
error
1 1.2 1.4
datapasses
gauss_4_M_5
prev_4_M_5
fact_4_M_3MNJ_bH_16585
SVRG
Figure: HIGGS
17 / 1
SUSY d = 18, n = 3, 548, 466
0 50 100
time (s)
10-2
10-1
100
error
1 2 3 4 5 6
datapasses
gauss_3_M_5
prev_3_M_5
fact_3_M_3MNJ_bH_9420
SVRG
Figure: SUSY
18 / 1
epsilon-normalized d = 2, 000, n = 400, 000
0 50 100
time (s)
10-2
10-1
100
error
1 2 3 4 5 6
datapasses
gauss_45_M_5
prev_45_M_5
fact_45_M_3MNJ_bH_3165
SVRG
Figure: epsilon-normaliized
19 / 1
rcv1-training d = 47, 236, n = 20, 242
0 10 20 30
time (s)
0.3
0.4
0.5
0.6
error
20 40 60 80 100
datapasses
gauss_10_M_5
prev_10_M_5
fact_10_M_3MNJ_bH_715
SVRG
Figure: rcv1-train
20 / 1
url-combined d = 3, 231, 961, n = 2, 396, 130
0 50 100
time (s)
10-2
10-1
100
error
1 1.5 2
datapasses
gauss_2_M_3
prev_2_M_3
fact_2_M_3MNJ_bH_7740
SVRG
Figure: url-combined
21 / 1
Contributions
New metric learning framework. A block BFGS framework forgradually learning the metric of the underlying function usinga sketched form of the subsampled Hessian matrix
New limited-memory block BFGS method. May also be ofinterest for non-stochastic optimization
New limited-memory factored form block BFGS method.
Several sketching matrix possibilities.
Linear convergence rate proof for our methods.
Tighter upper and lower bounds on the eigenvalues of thevariable metric
22 / 1