sparse semismooth newton methods and big data composite ... · sparse semismooth newton methods and...
Post on 19-Aug-2020
7 Views
Preview:
TRANSCRIPT
Sparse Semismooth Newton Methods and BigData Composite Optimization
Defeng Sun
Department of Applied Mathematics, The Hong KongPolytechnic University
August 17, 2018/Wuyishan
Based on joint works with: Kim-Chuan Toh, Houduo Qi, and manyothers
1
A brief review on nonsmooth Newton methods
1 Let X ,Y be two finite-dimensional real Euclidean spaces2 F : X → Y a locally Lipschitz continuous function.
Since F is almost everywhere differentiable [Rademacher, 1912], wecan define
∂BF (x) :=
limF ′(xk) : xk → x, xk ∈ DF
.
Here DF is the set of points where F is differentiable. Hence,Clarke’s generalized Jacobian of F at x is given by
∂F (x) = conv ∂BF (x).
2
Semismoothness
Definition 1
Let K : X ⇒ L(X ,Y) be a nonempty, compact valued andupper-semicontinous multifunction. We say that F is semismoothx ∈ X with respect to the multifunction K if (i) F is directionallydifferentiable at x; and (ii) for any ∆x ∈ X and V ∈ K(x+ ∆x)with ∆x→ 0,
F (x+ ∆x)− F (x)− V (∆x) = o(‖∆x‖). (1)
Furthermore, if (1) is replaced by
F (x+ ∆x)− F (x)− V (∆x) = O(‖∆x‖1+γ), (2)
where γ > 0 is a constant, then F is said to be γ-order (stronglyif γ = 1) semismooth at x with respect to K. We say that F is asemismooth function with respect to K if it is semismootheverywhere in X with respect to K.
3
Nonsmooth Newton’s method
Assume that F (x) = 0.
Given x0 ∈ X . For k = 0, 1, . . .
Main Step Choose an arbitrary Vk ∈ K(xk). Solve
F (xk) + Vk(xk+1 − xk) = 0
Rates of Convergence: Assume that K(x) is nonsingular and thatx0 is sufficiently close to x. If F is semismooth at x, then
‖xk+1 − x‖ = ‖V −1k [F (xk)− F (x)− Vk(xk − x)]‖ = o(‖xk − x‖).
It takes o(‖xk − x‖1+γ) if F is γ-order semismooth at x [the di-rectional differentiability of F is not needed in the above analysis]
4
Nonsmooth Equations
1 The nonsmooth Newton approach is popular in the comple-mentarity and variational inequalities (nonsmooth equations)community.
2 Kojima and Shindo (1986) investigated a piecewise smoothNewton’s method.
3 Kummer (1988, 1992) gave a sufficient condition (1) to gener-alize Kojima and Shindo’s work.
4 L. Qi and J. Sun (1993) proved what we know now.
5 Since then, many developments ...
Why nonsmooth Newton methods important in solving big dataoptimization?
5
The nearest correlation matrix problem
Consider the nearest correlation matrix (NCM) problem:
min
1
2‖X −G‖2F | X 0, Xii = 1, i = 1, . . . , n
.
The dual of the above problem can be written as
min 12‖Ξ‖
2 − 〈b, y〉 − 12‖G‖
2
s.t. S − Ξ +A∗y = −G, S 0
or via eliminating Ξ and S 0, the following
min
ϕ(y) :=
1
2‖Π0(A∗y +G)‖2 − 〈b, y〉 − 1
2‖G‖2
.
6
Numerical results for the NCM
Test the second order nonsmooth Newton-CG method [H.-D. Qi &Sun 06] ([X,y] = CorrelationMatrix(G,b,tau,tol) in Matlab) and twopopular first order methods (FOMs) [APG of Nesterov; ADMM ofGlowinski (steplength 1.618)] all to the dual forms for the NCM withreal financial data:G: Cor3120, n = 3, 120, obtained from [N. J. Higham & N. Strabic,SIMAX, 2016] [Optimal sol. rank = 3, 025]
n = 3, 120 SSNCG ADMM APG
Rel. KKT Res. 2.7-8 2.9-7 9.2-7
time (s) 26.8 246.4 459.1
iters 4 58 111
avg-time/iter 6.7 4.3 4.1
Newton’s method only takes at most 40% time more than ADMM& APG per iteration. How is it possible?
7
Lasso-type problems
We shall use simple vector cases to explain why:
(LASSO)
min1
2‖Ax− b‖2 + λ‖x‖1 | x ∈ Rn
where λ > 0, A ∈ Rm×n, and b ∈ Rm.
(Fused LASSO)
min1
2‖Ax− b‖2 + λ‖x‖1 + λ2‖Bx‖1
B =
1 −1
1 −1. . .
. . .
1 −1
8
Lasso-type problems (continued)
(Clustered LASSO)
min1
2‖Ax− b‖2 + λ‖x‖1 + λ2
n∑i=1
n∑j=i+1
|xi − xj |
(OSCAR)
min1
2‖Ax− b‖2 + λ‖x‖1 + λ2
n∑i=1
n∑j=i+1
|xi + xj |+ |xi − xj |
We are interested in n (number of features) large and/or m (numberof samples) large
9
Example: Sparse regression
Sparse regression:
Am
n
x≈ b
# of features n # of samples m
searching for a sparse solution
10
Example: Support vector machine
Class +1
Class -1
Normal Vector
Separating Hyperplane
Residuals, ri
Support Vectors
11
Newton’s method
Figure: Sir Isaac Newton (Niu Dun) (4 January 1643 - 31 March 1727)
12
Which Newton’s method?
(a) Snail (Niu) (b) Longhorn beetle (Niu)
(c) Charging Bull (Niu) (d) Yak (Niu)
13
Numerical results for fused LASSO
20 40 60 80 100 120 140 160 180 200
at most x times of the best
0
0.2
0.4
0.6
0.8
1(1
00y)
% o
f the
pro
blem
s
Performance profile: time
SsnalADMMiADMMLADMM
Performance profiles on biomedical datasets.
14
Numerical results for fused LASSO
20 40 60 80 100 120 140 160 180 200
at most x times of the best
0
0.2
0.4
0.6
0.8
1(1
00y)
% o
f the
pro
blem
s
Performance profile: time
SsnalADMMiADMMLADMMSLEP
Performance profiles on UCI datasets.
15
Interior point methods
For the illustrative purpose, consider a simpler example
min
1
2‖Ax− b‖2 | x ≥ 0
and its dual
max
−1
2‖ξ‖2 + 〈b, ξ〉 | AT ξ ≤ 0
Interior-point based solver I: an n× n linear system
(D +ATA)x = rhs1
D: A Diagonal matrix with positive diagonal elements
Using PCG solver (e.g., matrix free interior point methods [K. Foun-toulakis, J. Gondzio and P. Zhlobich, 2014])
Costly when n is large
16
Interior point methods
Interior-point based solver II: an m×m linear system
(Im +AD−1AT )ξ = rhs2
m
n
AAT =O(m2n ∗ sparsity)
17
Our nonsmooth Newton’s method
Our nonsmooth Newton’s method: an m×m linear system
(Im +APAT )ξ = rhs2
P : A Diagonal matrix with 0 or 1 diagonal elements
r: number of nonzero diagonal elements of P (second order sparsity)
(AP )(AP )T =m
r
= O(m2r ∗ sparsity)
Sherman-Morrison-Woodbury formula:
=(AP )T (AP ) = O(r2m ∗ sparsity)
18
Convex composite programming
(P) min f(x) := h(Ax) + p(x) ,
Real finite dimensional Hilbert spaces X , YClosed proper convex function p : X → (−∞,+∞]
Convex differentiable function h : Y → <Linear map A : X → Y
Dual problem
(D) minh∗(ξ) + p∗(u) | A∗ξ + u = 0
p∗ and h∗: the Fenchel conjugate functions of p and h.
p∗(z) = sup〈z, x〉 − p(x).
19
Examples in machine learning
Examples of smooth loss function h:
Linear regression h(y) = ‖y − b‖2Logistic regression h(y) = log(1 + exp(−yb))many more ...
Examples of regularizer p:
LASSO p(x) = ‖x‖1Fused LASSO p(x) = ‖x‖1 +
∑n−1i=1 |xi − xi+1|
Ridge p(x) = ‖x‖22Elastic net p(x) = ‖x‖1 + ‖x‖22Group LASSOFused Group LASSOClustered LASSO, OSCAROrdered LASSO, etc
20
Assumptions on the loss function
Assumption 1 (Assumptions on h)
1. h : Y → < has a 1/αh-Lipschitz continuous gradient:
‖∇h(y1)−∇h(y2)‖ ≤ (1/αh)‖y1 − y2‖, ∀y1, y2 ∈ Y
2. h is essentially locally strongly convex [Goebel and Rockafellar,2008]: for any compact and convex set K ⊂ dom ∂h, ∃βK > 0s.t.
(1−λ)h(y1)+λh(y2) ≥ h((1−λ)y1+λy2)+1
2βKλ(1−λ)‖y1−y2‖2
for all λ ∈ [0, 1], y1, y2 ∈ K
21
Properties on h∗
Under the assumptions on h, we know
a. h∗: strongly convex with constant αh
b. h∗: essentially smooth1
c. ∇h∗: locally Lipschitz continuous on Dh∗ := int (domh∗)
d. ∂h∗(y) = ∅ when y 6∈ Dh∗ .
Only need to focus on Dh∗
1h∗ is differentiable on int (domh∗) 6= ∅ and limi→∞ ‖∇h∗(yi)‖ = +∞whenever yi ⊂ int (domh∗) → y ∈ bdry(int (domh∗)).
22
An augmented Lagrangian method for (D)
The Lagrangian function for (D):
l(ξ, u;x) = h∗(ξ)+p∗(u)−〈x, A∗ξ+u〉, ∀ (ξ, u, x) ∈ Y×X ×X .
Given σ > 0, the augmented Lagrangian function for (D):
Lσ(ξ, u;x) = l(ξ, u;x) +σ
2‖A∗ξ + u‖2, ∀ (ξ, u, x) ∈ Y ×X ×X .
The proximal mapping Proxp(x):
Proxp(x) = arg minu∈X
p(u) +
1
2‖u− x‖2
.
Assumption: Proxσp(x) is easy to compute given any x
Advantage of using (D): h∗ is strongly convex; minuLσ(ξ, u;x)is easy.
23
An augmented Lagrangian method of multipliers for (D)
An inexact augmented Lagrangian method of multipliers.
Given∑εk < +∞, σ0 > 0, choose (ξ0, u0, x0) ∈ int(domh∗) ×
dom p∗ ×X . For k = 0, 1, . . ., iterate
Step 1. Compute
(ξk+1, uk+1) ≈ arg minΨk(ξ, u) := Lσk(ξ, u;xk).
To be solved via a nonsmooth Newton method.
Step 2. Compute xk+1 = xk−σk(A∗ξk+1+uk+1) and updateσk+1 ↑ σ∞ ≤ ∞ .
24
Global convergence
The stopping criterion for inner subproblem
(A) Ψk(ξk+1, uk+1)− inf Ψk ≤ ε2k/2σk,
∑εk <∞.
Theorem 2 (Global convergence)
Suppose that the solution set to (P) is nonempty. Then, xk isbounded and converges to an optimal solution x∗ of (P). Inaddition, (ξk, uk) is also bounded and converges to the uniqueoptimal solution (ξ∗, u∗) ∈ int(domh∗)× dom p∗ of (D).
25
Fast linear local convergence
Assumption 2 (Error bound)
For a maximal monotone operator T (·) with T −1(0) 6= ∅, ∃ ε > 0and a > 0 s.t.
∀η ∈ B(0, ε) and ∀ξ ∈ T −1(η), dist(ξ, T −1(0)) ≤ a‖η‖,
where B(0, ε) = y ∈ Y | ‖y‖ ≤ ε. The constant a is called theerror bound modulus associated with T .
1 T is a polyhedral multifunction [Robinson, 1981].2 Tf (∂f) of LASSO, fused LASSO and elastic net regularized LS
problems (piecewise linear-quadratic programming problems [J.Sun, PhD thesis, 1986] +1 ⇒ error bound).
3 Tf of `1 or elastic net regularized logistic regression [Luo andTseng, 1992; Tseng and Yun, 2009].
26
Fast linear local convergence
Stopping criterion for the local convergence analysis
(B) Ψk(ξk+1, uk+1)− inf Ψk
≤min1, (δ2k/2σk)‖xk+1 − xk‖2,∑
δk <∞.
Theorem 3
Assume that the solution set Ω to (P) is nonempty. Suppose thatAssumption 2 holds for Tf with modulus af . Then, xk isconvergent and, for all k sufficiently large,
dist(xk+1,Ω) ≤ θkdist(xk,Ω),
where θk ≈(af (a2f + σ2k)
−1/2 + 2δk)→ θ∞ = af/
√a2f + σ2∞ < 1
as k →∞. Moreover, the conclusions of Theorem 2 about(ξk, yk) are valid.
ALM is an approximate Newton’s method!!! (arbitrary linear con-vergence rate).
27
Nonsmooth Newton method for inner problems
Fix σ > 0 and x, denote
ψ(ξ) := infuLσ(ξ, u, x)
= h∗(ξ) + p∗(Proxp∗/σ(x/σ −A∗ξ)) +1
2σ‖Proxσp(x− σA∗ξ)‖2.
ψ(·): strongly convex and continuously differentiable on Dh∗ with
∇ψ(ξ) = ∇h∗(ξ)−AProxσp(x− σA∗ξ), ∀ξ ∈ Dh∗
Solving nonsmooth equation:
∇ψ(ξ) = 0, ξ ∈ Dh∗ .
28
Nonsmooth Newton’s method for inner problems
Denote for ξ ∈ Dh∗ :
∂2ψ(ξ) := ∂2h∗(ξ) + σA∂Proxσp(x− σA∗ξ)A∗
∂2h∗(ξ): Clarke subdifferential of ∇h∗ at ξ
∂Proxσp(x−σA∗ξ) : Clarke subdifferential of Proxσp(·) at x−σA∗ξLipschitz continuous mapping: ∇h∗, Proxσp(·)From [Hiriart-Urruty et al., 1984],
∂2ψ(ξ) (d) = ∂2ψ(ξ) (d), ∀ d ∈ Y
∂2ψ(ξ): the generalized Hessian of ψ at ξ. Define
V 0 := H0 + σAU0A∗
with H0 ∈ ∂2h∗(ξ) and U0 ∈ ∂Proxσp(x− σA∗ξ)
V 0 0 and V 0 ∈ ∂2ψ(ξ)
29
Nonsmooth Newton’s method for inner problem
SSN(ξ0, u0, x, σ). Given µ ∈ (0, 1/2), η ∈ (0, 1), τ ∈ (0, 1], andδ ∈ (0, 1). Choose ξ0 ∈ Dh∗ . Iterate
Step 1. Find an approximate solution dj ∈ Y to
Vj(d) = −∇ψ(ξj)
with Vj ∈ ∂2ψ(ξj) s.t.
‖Vj(dj) +∇ψ(ξj)‖ ≤ min(η, ‖∇ψ(ξj)‖1+τ ).
Step 2. (Line search) Set αj = δmj , where mj is the firstnonnegative integer m for which
ξj + δmdj ∈ Dh∗
ψ(ξj + δmdj) ≤ ψ(ξj) + µδm〈∇ψ(ξj), dj〉.
Step 3. Set ξj+1 = ξj + αj dj .
30
Nonsmooth Newton’s method for inner problems
Theorem 4
Assume that ∇h∗(·) and Proxσp(·) are strongly semismooth onDh∗ and X . Then ξj converges to the unique optimal solutionξ ∈ Dh∗ and
‖ξj+1 − ξ‖ = O(‖ξj − ξ‖1+τ ).
Implementable stopping criteria: the stopping criteria (A) and (B)can be achieved via:
(A′) ‖∇ψk(ξk+1)‖ ≤√αhσkεk
(B′) ‖∇ψk(ξk+1)‖ ≤√αhσkδk min1, σk‖A∗ξk+1 + uk+1‖
(A′)⇒ (A) & (B′)⇒ (B)
31
Summary: outer iterations and inner iterations
So far we have
1 Outer iterations (ALM): asymptotically superlinear (arbitraryrate of linear convergence)
2 Inner iterations (nonsmooth Newton): superlinear + cheap
Essentially, we have a ”fast + fast” algorithm.
32
Newton system for LASSO
LASSO: min12‖Ax− b‖
2 + λ1‖x‖1
h(y) = 12‖y − b‖
2, p(x) = λ1‖x‖1
Proxσp(x): easy to compute = sgn(x) max|x| − σλ1, 0Newton System:
(I + σAPA∗)ξ = rhs
P ∈ ∂Proxσp(xk − σA∗ξ): diagonal matrix with 0, 1 entries. Most
of these entries are 0 if the optimal solution xopt is sparse.
Message: Nonsmooth Newton can fully exploit the secondorder sparsity (SOS) of solutions to solve the Newton systemvery efficiently!
33
Newton system for fused LASSO
Fused LASSO: min12‖Ax− b‖
2 + λ1‖x‖1 + λ2‖Bx‖1
B =
1 −1
1 −1. . .
. . .
1 −1
h(y) = 1
2‖y − b‖2, p(x) = λ1‖x‖1 + λ2‖Bx‖1
Let xλ2(v) := arg minx12‖x− v‖
2 + λ2‖Bx‖1.
Proximal mapping of p [Friedman et al., 2007]:
Proxp(v) = sign(xλ2(v)) max(abs(xλ2(v))− λ1, 0).
Efficient algorithms to obtain xλ2(v): taut-string [Davies and Ko-vac, 2001], direct algorithm [Condat, 2013], dynamic programming[Johnson, 2013]
34
Newton system for fused LASSO
Dual approach to obtain xλ2(v): denote
z(v) := arg minz
1
2‖B∗z‖2 − 〈Bv, z〉 | ‖z‖∞ ≤ λ2
⇒ x(v) = v − B∗z(v). Let C = z | ‖z‖∞ ≤ λ2, from optimalitycondition
z = ΠC(z − (BB∗z − Bv))
and the implicit function theorem⇒ Newton system for fused Lasso:
(I + σAPA∗)ξ = rhs, [P is Han-Sun Jacobian (JOTA, 1997)]
P = P (I − B∗(I − Σ + ΣBB∗)−1ΣB) (positive semidefinite)
Σ ∈ ∂ΠC(z − (BB∗z − Bv))
P , Σ: diagonal matrices with 0, 1 entries. Most diagonal entries ofP are 0 if xopt is sparse. The red part is diagonal + low rankAgain, can use sparsity and the structure of the red part to solvethe system efficiently
35
Numerical resulsts for LASSO
KKT residual:
ηKKT :=‖x− Proxp[x− (Ax− b)]‖
1 + ‖x‖+ ‖Ax− b‖≤ 10−6.
Compare SSNAL with state-of-the-art solvers: mfIPM, ... [Foun-toulakis et al., 2014] and APG [Liu et al. 2011]
(A, b) taken from 11 Sparco collections (all very easy problems) [VanDen Berg et al, 2009]
λ = λc‖A∗b‖∞ with λc = 10−3 and 10−4
Add 60dB noise to b in Matlab: b = awgn(b,60,’measured’)
max. iteration number: 20,000 for APGmax. computation time: 7 hours
36
Numerical results for LASSO arising from compressed sensing
(a) our Ssnal(b) mfIPM(c) APG: Nesterov’s accelerated proximal gradient method
λc = 10−3ηKKT time (hh:mm:ss)
probname m;n a | b | c a | b | csrcsep1 29166;57344 1.6-7 | 7.3-7 | 8.7-7 5:44 | 42:34 | 1:56soccer1 3200;4096 1.8-7 | 6.3-7 | 8.4-7 01 | 03 | 2:35
blurrycam 65536;65536 1.9-7 | 6.5-7 | 4.1-7 03 | 09 | 02blurspike 16384;16384 3.1-7 | 9.5-7 | 9.9-7 03 | 05 | 03
λc = 10−4
srcsep1 29166;57344 9.8-7 | 9.5-7 | 9.9-7 9:28 | 3:31:08 | 2:50soccer1 3200;4096 8.7-7 | 4.3-7 | 3.3-6 01 | 02 | 3:07
blurrycam 65536;65536 1.0-7 | 9.7-7 | 9.7-7 05 | 1:35 | 03blurspike 16384;16384 3.5-7 | 7.4-7 | 9.8-7 10 | 08 | 05
37
Numerical results for LASSO arising from sparse regression
11 large scale instances (A, b) from LIBSVM [Chang and Lin, 2011]
A: data normalized (with at most unit norm columns)
λc = 10−3 ηKKT time (hh:mm:ss)
probname m;n a | b | c a | b | cE2006.train 16087; 150360 1.6-7 | 4.1-7 | 9.1-7 01 | 14 | 02
log1p.E2006.train 16087; 4272227 2.6-7 | 4.9-7 | 1.7-4 35 | 59:55 | 2:17:57E2006.test 3308; 150358 1.6-7 | 1.3-7 | 3.9-7 01 | 08 | 01
log1p.E2006.test 3308; 4272226 1.4-7 | 9.2-8 | 1.6-2 27 | 30:45 | 1:29:25pyrim5 74; 201376 2.5-7 | 4.2-7 | 3.6-3 05 | 9:03 | 8:25
triazines4 186; 635376 8.5-7 | 7.7-1 | 1.8-3 29 | 49:27 | 55:31abalone7 4177; 6435 8.4-7 | 1.6-7 | 1.3-3 02 | 2:03 | 10:05bodyfat7 252; 116280 1.2-8 | 5.2-7 | 1.4-2 02 | 1:41 | 12:49housing7 506; 77520 8.8-7 | 6.6-7 | 4.1-4 03 | 6:26 | 17:00
38
Why each nonsmooth Newton step cheap
For housing7, the computational costs in our SSNAL are as follows:
1 costs for Ax: 66 times, 0.11s in total;2 costs for AT ξ: 43 times, 2s in total;3 costs for solving the inner linear systems: 43 times, 1.2s in
total.
SSNAL has the ability to maintain the sparsity of x, the compu-tational costs for calculating Ax are negligible comparing to othercosts. In fact, each step of SSNAL is cheaper than many first ordermethods which need at least both Ax (x may be dense) and AT ξ.
SOS is important for designing robust solvers!
SS-Newton equation can be solved very efficiently by exploit-ing the SOS property in solutions!
39
Solution path of LassoNAL
LassoNAL can generate solution path when λ varies
LassoNAL: start from λmax to desired λ, each step λnew = 0.9λold
λmax = ‖A∗b‖∞, λ = 10−3λmax
need to solve 66 lasso subproblems
Compare LassoNAL with SPAMS (SPArse Modeling Software byJulien Mairal et al.)
SPAMS: modified Lars or homotopy algorithm (solve problem viasolution path)
40
Solution path of LassoNAL
(a) LassoNAL (one run with desired λ = 10−3λmax)
(b) LassoNAL (solution path from λmax to λ)
(c) SPAMS (solution path from λmax to λ)
Randomly generated data
time (ss) λ NO. ratio nnz
m;n a | b | c b | c
50; 1e4 0.4 | 3.5 | 1.5 66 | 75 8.75 46
50; 2e4 0.4 | 3.7 | 7.7 66 | 71 9.25 49
50; 3e4 0.4 | 5.1 | 17.4 66 | 71 12.75 46
50; 4e4 0.4 | 5.1 | 32.0 66 | 69 12.75 48
50; 5e4 0.5 | 8.4 | err 66 | err 16.30 49
SPAMS reports error when n ≥ 5× 104
LassoNAL path: warm-start, ratio < 66, for simple problems, run-ning time almost independent with respect to n
41
Solution path of LassoNAL
(a) LassoNAL (one run with desired λ = 10−3λmax)
(b) LassoNAL (solution path from λmax to λ)
(c) SPAMS (solution path from λmax to λ)
UCI data: truncated with n ≤ 4× 104 (SPAMS reports error whenn is large)η: KKT residual
time (ss) λ NO. ratio nnz η
Prob. m;n a | b | c b | c b| c b| c
pyrim5 74; 4e4 0.4 | 10.9 | 37.9 66 | 1 27.25 56 | 0 2.5-7 | 9.9-1
triazines4186; 4e4 1.4 | 33.9 | 38 66 | 1 24.21 136 | 0 4.4-7 | 9.9-1
housing7 506; 4e4 2.0 | 42.8 | 41.8 66 | 259 21.4 109 | 77 3.7-7 | 1.3-3
For difficult problems, SPAMS can not reach desired λ and may stopat λmax (pyrim5 & triazines4)
42
Solution path of LassoNAL
Plot partial solution path for housing7, 10 largest nonzero elementsin absolute values in the solution selected with λ ∈ [10−3λmax, 0.9
33λmax]
0 50 100 150 200 250 300 350 400λ-15
-10
-5
0
5
10
15
x(λ) Partial solution path by LassoNAL
43
Numerical results for fused LASSO
(a) our SSNAL(b) APG based solver [Liu et al., 2011] (enhanced...)(c1) ADMM (classical) (c2) ADMM (linearized)
Parameters: λ1 = λc‖A∗y‖∞, λ2 = 2λ1, tol = 10−4
Problem: triazines 4, m = 186, n = 635376
Fused Lasso P. iter time (hh:mm:ss)
λc | nnz | ηC a | b | c1 | c2 a | b | c1 | c210−1 ; 164; 2.4-2 10 | 6448 | 3461 | 8637 18 | 26:44 | 28:42 | 46:3510−2 ; 1004; 1.7-2 13 | 11820 | 3841 | 19596 22 | 48:51 | 24:41 | 1:22:1110−3 ; 1509; 1.2-3 16 | 20000 | 4532 | 20000 31 | 1:16:11 | 38:23 | 1:29:4810−5 ; 2420; 6.4-5 24 | 20000 | 14384 | 20000 1:01 | 1:26:39 | 1:49:44 | 1:35:36
SSNAL is vastly superior to first-order methods: APG, ADMM (clas-sical), ADMM (linearized)
ADMM (linearized) needs many more iterations than ADMM (clas-sical)
44
When to choose SSNAL?
When Proxp and its generalized (HS) Jacobian ∂Proxp are easyto compute
Almost all of the LASSO models are suitable for SSNAL
When the problems are very easy, one may also consider APGor ADMM
Very complicated problems, in particular with many constraints,consider 2-phase approaches
45
Some remarks
1 For big optimization problems, our knowledge from the tradi-tional optimization domain may be inadequate.
2 Belief: We do not know what we do not know. Always go tomodern computers!
3 Big data optimization models provide many opportunities totest New and Old ideas. SOS is just one of them.
4 Many more need to be done such as the stochastic semismoothNewton methods (Andre Milzarek, Zaiwen Wen, ...), screening,sketching, parallelizing ...
46
top related