![Page 1: Proximal Newton Methods for Large-Scale Machine … Newton Methods for Large-Scale Machine Learning Inderjit Dhillon Computer Science & Mathematics UT Austin ShanghaiTech Symposium](https://reader030.vdocument.in/reader030/viewer/2022020316/5b2d83277f8b9ab66e8bebcd/html5/thumbnails/1.jpg)
Proximal Newton Methods for Large-Scale MachineLearning
Inderjit DhillonComputer Science & Mathematics
UT Austin
ShanghaiTech Symposium on Data ScienceJune 24, 2015
Joint work with C. J. Hsieh, M. Sustik, P. Ravikumar, R. Poldrack, S. Becker,and P. Olsen
Inderjit Dhillon Computer Science & Mathematics UT Austin Proximal Newton Methods for Large-Scale Machine Learning
![Page 2: Proximal Newton Methods for Large-Scale Machine … Newton Methods for Large-Scale Machine Learning Inderjit Dhillon Computer Science & Mathematics UT Austin ShanghaiTech Symposium](https://reader030.vdocument.in/reader030/viewer/2022020316/5b2d83277f8b9ab66e8bebcd/html5/thumbnails/2.jpg)
Big Data
Link prediction
LinkedIn.
gene-gene network
fMRI
Image classification
Spam classification
Inderjit Dhillon Computer Science & Mathematics UT Austin Proximal Newton Methods for Large-Scale Machine Learning
![Page 3: Proximal Newton Methods for Large-Scale Machine … Newton Methods for Large-Scale Machine Learning Inderjit Dhillon Computer Science & Mathematics UT Austin ShanghaiTech Symposium](https://reader030.vdocument.in/reader030/viewer/2022020316/5b2d83277f8b9ab66e8bebcd/html5/thumbnails/3.jpg)
Modern Machine Learning Problems
Inderjit Dhillon Computer Science & Mathematics UT Austin Proximal Newton Methods for Large-Scale Machine Learning
![Page 4: Proximal Newton Methods for Large-Scale Machine … Newton Methods for Large-Scale Machine Learning Inderjit Dhillon Computer Science & Mathematics UT Austin ShanghaiTech Symposium](https://reader030.vdocument.in/reader030/viewer/2022020316/5b2d83277f8b9ab66e8bebcd/html5/thumbnails/4.jpg)
Inverse Covariance Estimation
Learning graph (structure and/or weights) from data.
Gene networks, fMRI analysis, stock networks, climate analysis.
Gene network fMRI brain analysis
Inderjit Dhillon Computer Science & Mathematics UT Austin Proximal Newton Methods for Large-Scale Machine Learning
![Page 5: Proximal Newton Methods for Large-Scale Machine … Newton Methods for Large-Scale Machine Learning Inderjit Dhillon Computer Science & Mathematics UT Austin ShanghaiTech Symposium](https://reader030.vdocument.in/reader030/viewer/2022020316/5b2d83277f8b9ab66e8bebcd/html5/thumbnails/5.jpg)
Matrix Completion
Missing value estimation for recommender systems.
Assumption: the underlying matrix is low-rank.
Inderjit Dhillon Computer Science & Mathematics UT Austin Proximal Newton Methods for Large-Scale Machine Learning
![Page 6: Proximal Newton Methods for Large-Scale Machine … Newton Methods for Large-Scale Machine Learning Inderjit Dhillon Computer Science & Mathematics UT Austin ShanghaiTech Symposium](https://reader030.vdocument.in/reader030/viewer/2022020316/5b2d83277f8b9ab66e8bebcd/html5/thumbnails/6.jpg)
Robust PCA
Image de-noising, face recognition, graphical modeling with latentvariables
Decompose the input matrix as sparse + low-rank matrices.
Figures from (Candes et al, 2009)
Inderjit Dhillon Computer Science & Mathematics UT Austin Proximal Newton Methods for Large-Scale Machine Learning
![Page 7: Proximal Newton Methods for Large-Scale Machine … Newton Methods for Large-Scale Machine Learning Inderjit Dhillon Computer Science & Mathematics UT Austin ShanghaiTech Symposium](https://reader030.vdocument.in/reader030/viewer/2022020316/5b2d83277f8b9ab66e8bebcd/html5/thumbnails/7.jpg)
Multi-task Learning
Training T classification tasks together.
Sparse models: Lasso
Feature sharing: Group Lasso
Inderjit Dhillon Computer Science & Mathematics UT Austin Proximal Newton Methods for Large-Scale Machine Learning
![Page 8: Proximal Newton Methods for Large-Scale Machine … Newton Methods for Large-Scale Machine Learning Inderjit Dhillon Computer Science & Mathematics UT Austin ShanghaiTech Symposium](https://reader030.vdocument.in/reader030/viewer/2022020316/5b2d83277f8b9ab66e8bebcd/html5/thumbnails/8.jpg)
Inverse Covariance Estimation
Inderjit Dhillon Computer Science & Mathematics UT Austin Proximal Newton Methods for Large-Scale Machine Learning
![Page 9: Proximal Newton Methods for Large-Scale Machine … Newton Methods for Large-Scale Machine Learning Inderjit Dhillon Computer Science & Mathematics UT Austin ShanghaiTech Symposium](https://reader030.vdocument.in/reader030/viewer/2022020316/5b2d83277f8b9ab66e8bebcd/html5/thumbnails/9.jpg)
Estimate Relationships between Variables
Example: FMRI Brain Analysis.
Goal: Reveal functional connections between regions of the brain.
(Sun et al, 2009; Smith et al, 2011; Varoquaux et al, 2010; Ng et al, 2011)
Input Output
Figure from (Varoquaux et al, 2010)
Inderjit Dhillon Computer Science & Mathematics UT Austin Proximal Newton Methods for Large-Scale Machine Learning
![Page 10: Proximal Newton Methods for Large-Scale Machine … Newton Methods for Large-Scale Machine Learning Inderjit Dhillon Computer Science & Mathematics UT Austin ShanghaiTech Symposium](https://reader030.vdocument.in/reader030/viewer/2022020316/5b2d83277f8b9ab66e8bebcd/html5/thumbnails/10.jpg)
Estimate Relationships between Variables
Example: Gene regulatory network discovery.
Goal: Gene network reconstruction using microarray data.
(Schafer & Strimmer 2005; Andrei & Kendziorski 2009; Menendez etal, 2010; Yin and Li, 2011)
Figure from (Banerjee et al, 2008)
Inderjit Dhillon Computer Science & Mathematics UT Austin Proximal Newton Methods for Large-Scale Machine Learning
![Page 11: Proximal Newton Methods for Large-Scale Machine … Newton Methods for Large-Scale Machine Learning Inderjit Dhillon Computer Science & Mathematics UT Austin ShanghaiTech Symposium](https://reader030.vdocument.in/reader030/viewer/2022020316/5b2d83277f8b9ab66e8bebcd/html5/thumbnails/11.jpg)
Other Applications
Financial Data Analysis:
Model dependencies in multivariate time series (Xuan & Murphy, 2007).Sparse high dimensional models in economics (Fan et al, 2011).
Social Network Analysis / Web data:
Model co-authorship networks (Goldenberg & Moore, 2005).Model item-item similarity for recommender system(Agarwal et al, 2011).
Climate Data Analysis (Chen et al., 2010).
Signal Processing (Zhang & Fung, 2013).
Anomaly Detection (Ide et al, 2009).
Inderjit Dhillon Computer Science & Mathematics UT Austin Proximal Newton Methods for Large-Scale Machine Learning
![Page 12: Proximal Newton Methods for Large-Scale Machine … Newton Methods for Large-Scale Machine Learning Inderjit Dhillon Computer Science & Mathematics UT Austin ShanghaiTech Symposium](https://reader030.vdocument.in/reader030/viewer/2022020316/5b2d83277f8b9ab66e8bebcd/html5/thumbnails/12.jpg)
Inverse Covariance Estimation
Given: n i.i.d. samples {y1, . . . , yn}, yi ∈ Rp, yi ∼ N (µ,Σ),
Goal: Estimate relationships between variables.
The sample mean and covariance are given by:
µ =1
n
n∑i=1
yi and S =1
n
n∑i=1
(yi − µ)(yi − µ)T .
Covariance matrix Σ may not directly convey relationships.
An example – Chain graph: y(j) = 0.5y(j − 1) +N (0, 1)
Σ =
1.33 0.67 0.330.67 1.33 0.670.33 0.67 1.33
Inderjit Dhillon Computer Science & Mathematics UT Austin Proximal Newton Methods for Large-Scale Machine Learning
![Page 13: Proximal Newton Methods for Large-Scale Machine … Newton Methods for Large-Scale Machine Learning Inderjit Dhillon Computer Science & Mathematics UT Austin ShanghaiTech Symposium](https://reader030.vdocument.in/reader030/viewer/2022020316/5b2d83277f8b9ab66e8bebcd/html5/thumbnails/13.jpg)
Inverse Covariance Estimation (cont’d)
Conditional independence is reflected as zeros in Θ (= Σ−1):
Θij = 0⇔ y(i), y(j) are conditionally independent given other variables
Chain graph example: y(j) = 0.5y(j − 1) +N (0, 1)
Θ =
1 −0.5 0−0.5 1.25 −0.5
0 −0.5 1
In a GMRF G = (V ,E ), each node corresponds to a variable, and eachedge corresponds to a non-zero entry in Θ.
Goal: Estimate the inverse covariance matrix Θ from the observations.
Inderjit Dhillon Computer Science & Mathematics UT Austin Proximal Newton Methods for Large-Scale Machine Learning
![Page 14: Proximal Newton Methods for Large-Scale Machine … Newton Methods for Large-Scale Machine Learning Inderjit Dhillon Computer Science & Mathematics UT Austin ShanghaiTech Symposium](https://reader030.vdocument.in/reader030/viewer/2022020316/5b2d83277f8b9ab66e8bebcd/html5/thumbnails/14.jpg)
Maximum Likelihood Estimator: LogDet Optimization
Given the n samples, the likelihood is
P(y1, . . . , yn; µ,Θ) ∝n∏
i=1
(det Θ)1/2 exp
(−1
2(yi − µ)T Θ(yi − µ)
)
= (det Θ)n/2 exp
(−1
2
n∑i=1
(yi − µ)T Θ(yi − µ)
).
The log likelihood can be written as
log(P(y1, . . . , yn; µ,Θ)) =n
2log(det Θ)− n
2tr(ΘS) + constant.
Maximum likelihood estimator:
Θ∗ = arg minX�0{− log detX + tr(SX )}.
Problem: when p > n, sample covariance matrix S is singular.
Inderjit Dhillon Computer Science & Mathematics UT Austin Proximal Newton Methods for Large-Scale Machine Learning
![Page 15: Proximal Newton Methods for Large-Scale Machine … Newton Methods for Large-Scale Machine Learning Inderjit Dhillon Computer Science & Mathematics UT Austin ShanghaiTech Symposium](https://reader030.vdocument.in/reader030/viewer/2022020316/5b2d83277f8b9ab66e8bebcd/html5/thumbnails/15.jpg)
L1-regularized inverse covariance selection
A sparse inverse covariance matrix is preferred –
add `1 regularization to promote sparsity.
The resulting optimization problem:
Θ = arg minX�0
{− log detX + tr(SX ) + λ‖X‖1
}= arg min
X�0f (X ),
where ‖X‖1 =∑n
i ,j=1 |Xij |.Regularization parameter λ > 0 controls the sparsity.
The problem appears hard to solve:
Non-smooth log-determinant program.Number of parameters scale quadratically with number of variables.
Inderjit Dhillon Computer Science & Mathematics UT Austin Proximal Newton Methods for Large-Scale Machine Learning
![Page 16: Proximal Newton Methods for Large-Scale Machine … Newton Methods for Large-Scale Machine Learning Inderjit Dhillon Computer Science & Mathematics UT Austin ShanghaiTech Symposium](https://reader030.vdocument.in/reader030/viewer/2022020316/5b2d83277f8b9ab66e8bebcd/html5/thumbnails/16.jpg)
Optimization in Machine Learning
High-dimensional!
Large number of samples!
Large number of parameters!
Inderjit Dhillon Computer Science & Mathematics UT Austin Proximal Newton Methods for Large-Scale Machine Learning
![Page 17: Proximal Newton Methods for Large-Scale Machine … Newton Methods for Large-Scale Machine Learning Inderjit Dhillon Computer Science & Mathematics UT Austin ShanghaiTech Symposium](https://reader030.vdocument.in/reader030/viewer/2022020316/5b2d83277f8b9ab66e8bebcd/html5/thumbnails/17.jpg)
Exploiting Structure in Machine Learning Problems
Regularized Loss Minimization(RLM):
minθ{ g(θ,X )︸ ︷︷ ︸
loss
+ h(θ)︸︷︷︸regularization
} ≡ f (θ),
Exploiting problem structure:
Can we have faster computation?
Exploiting model structure:
Can we avoid un-important search space?
Exploiting Data distribution:
Can we speed up optimization algorithms if we roughly estimate the datadistribution?
Inderjit Dhillon Computer Science & Mathematics UT Austin Proximal Newton Methods for Large-Scale Machine Learning
![Page 18: Proximal Newton Methods for Large-Scale Machine … Newton Methods for Large-Scale Machine Learning Inderjit Dhillon Computer Science & Mathematics UT Austin ShanghaiTech Symposium](https://reader030.vdocument.in/reader030/viewer/2022020316/5b2d83277f8b9ab66e8bebcd/html5/thumbnails/18.jpg)
QUICQUadratic approximation for Inverse Covariance estimation
Inderjit Dhillon Computer Science & Mathematics UT Austin Proximal Newton Methods for Large-Scale Machine Learning
![Page 19: Proximal Newton Methods for Large-Scale Machine … Newton Methods for Large-Scale Machine Learning Inderjit Dhillon Computer Science & Mathematics UT Austin ShanghaiTech Symposium](https://reader030.vdocument.in/reader030/viewer/2022020316/5b2d83277f8b9ab66e8bebcd/html5/thumbnails/19.jpg)
Sparse Inverse Covariance Estimation
p2 parameters in the non-smooth Log-determinant program:
Θ = arg minX�0
{− log detX + tr(SX )︸ ︷︷ ︸
negative log likelihood
+λ‖X‖1
}
Inderjit Dhillon Computer Science & Mathematics UT Austin Proximal Newton Methods for Large-Scale Machine Learning
![Page 20: Proximal Newton Methods for Large-Scale Machine … Newton Methods for Large-Scale Machine Learning Inderjit Dhillon Computer Science & Mathematics UT Austin ShanghaiTech Symposium](https://reader030.vdocument.in/reader030/viewer/2022020316/5b2d83277f8b9ab66e8bebcd/html5/thumbnails/20.jpg)
First Order Methods vs Second Order Methods
Many algorithms have been proposed.Block coordinate descent (Banerjee et al., 2007), (Friedman et al., 2007),Gradient or Accelerate gradient descent (Lu, 2009), (Duchi et al., 2008),Greedy coordinate descent (Scheinberg & Rich., 2009)ADMM (Scheinberg et al., 2009), (Boyd et al., 2011)
All of them are first order methods (only gradient information)
QUIC: the first second order method for this problem (use Hessianinformation).
∇f (x) =
∂f∂x1
∂f∂x2
...
∂f (x)∂xd
, H = ∇2f (x) =
∂2f∂x2
1
∂2f∂x1∂x2
· · · ∂2f∂x1∂xd
∂2f∂x2∂x1
∂2f∂x2
2· · · ∂2f
∂x2∂xd
......
. . ....
∂2f∂xd∂x1
∂2f∂xd∂x2
· · · ∂2f∂x2
d
Inderjit Dhillon Computer Science & Mathematics UT Austin Proximal Newton Methods for Large-Scale Machine Learning
![Page 21: Proximal Newton Methods for Large-Scale Machine … Newton Methods for Large-Scale Machine Learning Inderjit Dhillon Computer Science & Mathematics UT Austin ShanghaiTech Symposium](https://reader030.vdocument.in/reader030/viewer/2022020316/5b2d83277f8b9ab66e8bebcd/html5/thumbnails/21.jpg)
First Order Methods vs Second Order Methods
Why do all previous approaches use first order methods?
Inderjit Dhillon Computer Science & Mathematics UT Austin Proximal Newton Methods for Large-Scale Machine Learning
![Page 22: Proximal Newton Methods for Large-Scale Machine … Newton Methods for Large-Scale Machine Learning Inderjit Dhillon Computer Science & Mathematics UT Austin ShanghaiTech Symposium](https://reader030.vdocument.in/reader030/viewer/2022020316/5b2d83277f8b9ab66e8bebcd/html5/thumbnails/22.jpg)
Second Order Methods for Non-Smooth Functions
Update rule for Newton method (second order method):
x ← x − η(∇2f (x))−1∇f (x)
Two difficulties:
Time complexity per iteration may be very high.How to deal with non-smooth functions (‖x‖1)?
Addressed by (Hsieh, Sustik, Dhillon and Ravikumar, 2011).
Extensions:
Theoretical Analysis: (Lee et al., 2012), (Patrinos and Bemporad.,2013), (Tang and Scheinberg, 2014), (Yen, Hsieh, Ravikumar andDhillon, 2014), . . .Optimization Algorithms: (Olsen et al., 2012), (Dinh et al., 2013),(Treister et al., 2014), (Scheinberg and Tang, 2014), (Zhong et al.,2014), . . .
Inderjit Dhillon Computer Science & Mathematics UT Austin Proximal Newton Methods for Large-Scale Machine Learning
![Page 23: Proximal Newton Methods for Large-Scale Machine … Newton Methods for Large-Scale Machine Learning Inderjit Dhillon Computer Science & Mathematics UT Austin ShanghaiTech Symposium](https://reader030.vdocument.in/reader030/viewer/2022020316/5b2d83277f8b9ab66e8bebcd/html5/thumbnails/23.jpg)
Second Order Methods for Non-Smooth Functions
Update rule for Newton method (second order method):
x ← x − η(∇2f (x))−1∇f (x)
Two difficulties:
Time complexity per iteration may be very high.How to deal with non-smooth functions (‖x‖1)?
Addressed by (Hsieh, Sustik, Dhillon and Ravikumar, 2011).
Extensions:
Theoretical Analysis: (Lee et al., 2012), (Patrinos and Bemporad.,2013), (Tang and Scheinberg, 2014), (Yen, Hsieh, Ravikumar andDhillon, 2014), . . .Optimization Algorithms: (Olsen et al., 2012), (Dinh et al., 2013),(Treister et al., 2014), (Scheinberg and Tang, 2014), (Zhong et al.,2014), . . .
Inderjit Dhillon Computer Science & Mathematics UT Austin Proximal Newton Methods for Large-Scale Machine Learning
![Page 24: Proximal Newton Methods for Large-Scale Machine … Newton Methods for Large-Scale Machine Learning Inderjit Dhillon Computer Science & Mathematics UT Austin ShanghaiTech Symposium](https://reader030.vdocument.in/reader030/viewer/2022020316/5b2d83277f8b9ab66e8bebcd/html5/thumbnails/24.jpg)
Time Complexity?
Newton direction: H−1g . (H: Hessian, g : gradient)
Solving the linear system exactly: O(p6) time.
Appears to be prohibitive to apply a second order method.
Inderjit Dhillon Computer Science & Mathematics UT Austin Proximal Newton Methods for Large-Scale Machine Learning
![Page 25: Proximal Newton Methods for Large-Scale Machine … Newton Methods for Large-Scale Machine Learning Inderjit Dhillon Computer Science & Mathematics UT Austin ShanghaiTech Symposium](https://reader030.vdocument.in/reader030/viewer/2022020316/5b2d83277f8b9ab66e8bebcd/html5/thumbnails/25.jpg)
Time Complexity?
Coordinate descent: each time fits one element of g .
Inderjit Dhillon Computer Science & Mathematics UT Austin Proximal Newton Methods for Large-Scale Machine Learning
![Page 26: Proximal Newton Methods for Large-Scale Machine … Newton Methods for Large-Scale Machine Learning Inderjit Dhillon Computer Science & Mathematics UT Austin ShanghaiTech Symposium](https://reader030.vdocument.in/reader030/viewer/2022020316/5b2d83277f8b9ab66e8bebcd/html5/thumbnails/26.jpg)
Time Complexity?
Coordinate descent: each time fits one element of g .
Inderjit Dhillon Computer Science & Mathematics UT Austin Proximal Newton Methods for Large-Scale Machine Learning
![Page 27: Proximal Newton Methods for Large-Scale Machine … Newton Methods for Large-Scale Machine Learning Inderjit Dhillon Computer Science & Mathematics UT Austin ShanghaiTech Symposium](https://reader030.vdocument.in/reader030/viewer/2022020316/5b2d83277f8b9ab66e8bebcd/html5/thumbnails/27.jpg)
Time Complexity?
Use iterative solver—coordinate descent
O(p2) per coordinate update, O(p4) for one sweep. ⇒ need 40 daysfor p = 104 on a 2.83GHz CPU.
Still prohibitive.
Inderjit Dhillon Computer Science & Mathematics UT Austin Proximal Newton Methods for Large-Scale Machine Learning
![Page 28: Proximal Newton Methods for Large-Scale Machine … Newton Methods for Large-Scale Machine Learning Inderjit Dhillon Computer Science & Mathematics UT Austin ShanghaiTech Symposium](https://reader030.vdocument.in/reader030/viewer/2022020316/5b2d83277f8b9ab66e8bebcd/html5/thumbnails/28.jpg)
Exploiting Problem StructureFast Computation of Newton Direction
minX�0{ − log detX + tr(SX )︸ ︷︷ ︸
problem structure
+ λ‖X‖1︸ ︷︷ ︸regularization
}
Inderjit Dhillon Computer Science & Mathematics UT Austin Proximal Newton Methods for Large-Scale Machine Learning
![Page 29: Proximal Newton Methods for Large-Scale Machine … Newton Methods for Large-Scale Machine Learning Inderjit Dhillon Computer Science & Mathematics UT Austin ShanghaiTech Symposium](https://reader030.vdocument.in/reader030/viewer/2022020316/5b2d83277f8b9ab66e8bebcd/html5/thumbnails/29.jpg)
Exploiting Problem Structure
Structure of the Hessian:
H =∂2
∂2X(− log detX ) = X−1 ⊗ X−1.
The Kronecker product:
A⊗ B =
a11B · · · a1pB...
. . ....
ap1B · · · appB
was introduced by G. Zehfuss (1858) and later attributed to L.Kronecker.
Inderjit Dhillon Computer Science & Mathematics UT Austin Proximal Newton Methods for Large-Scale Machine Learning
![Page 30: Proximal Newton Methods for Large-Scale Machine … Newton Methods for Large-Scale Machine Learning Inderjit Dhillon Computer Science & Mathematics UT Austin ShanghaiTech Symposium](https://reader030.vdocument.in/reader030/viewer/2022020316/5b2d83277f8b9ab66e8bebcd/html5/thumbnails/30.jpg)
Exploiting Problem Structure
The Hessian (p2 by p2 matrix) can be represented using p2
parameters!
Inderjit Dhillon Computer Science & Mathematics UT Austin Proximal Newton Methods for Large-Scale Machine Learning
![Page 31: Proximal Newton Methods for Large-Scale Machine … Newton Methods for Large-Scale Machine Learning Inderjit Dhillon Computer Science & Mathematics UT Austin ShanghaiTech Symposium](https://reader030.vdocument.in/reader030/viewer/2022020316/5b2d83277f8b9ab66e8bebcd/html5/thumbnails/31.jpg)
Exploiting Problem Structure
Time complexity: O(p2) ⇒ O(p) for each coordinate update.O(p4) ⇒ O(p3) per sweep.
At coordinate descent step to update (i , j) element of descent direction:(H vec(∆)
)ip+j
=
((X−1 ⊗ X−1) vec(∆)
)ip+j
=
(X−1∆X−1
)ij
= (X−1U)ij ,
where U = ∆X−1 is maintained in memory.Inderjit Dhillon Computer Science & Mathematics UT Austin Proximal Newton Methods for Large-Scale Machine Learning
![Page 32: Proximal Newton Methods for Large-Scale Machine … Newton Methods for Large-Scale Machine Learning Inderjit Dhillon Computer Science & Mathematics UT Austin ShanghaiTech Symposium](https://reader030.vdocument.in/reader030/viewer/2022020316/5b2d83277f8b9ab66e8bebcd/html5/thumbnails/32.jpg)
Newton Method for Non-smooth Functions
Iteration conducts the following updates:Compute the generalized Newton direction
Dt = arg min∆
fXt (∆)
= arg min∆〈∇g(Xt), vec(∆)〉+
1
2vec(∆)T∇2g(Xt) vec(∆) + λ‖Xt + ∆‖1
Xt+1 ← Xt + Dt .
Inderjit Dhillon Computer Science & Mathematics UT Austin Proximal Newton Methods for Large-Scale Machine Learning
![Page 33: Proximal Newton Methods for Large-Scale Machine … Newton Methods for Large-Scale Machine Learning Inderjit Dhillon Computer Science & Mathematics UT Austin ShanghaiTech Symposium](https://reader030.vdocument.in/reader030/viewer/2022020316/5b2d83277f8b9ab66e8bebcd/html5/thumbnails/33.jpg)
Proximal Newton Method
Proximal Newton for sparse Inverse Covariance estimation
For t = 0, 1, . . .1 Compute Proximal Newton direction: Dt = arg min∆ fXt (Xt + ∆) (A
Lasso problem.)2 Line Search: use an Armijo-rule based step-size selection to get α s.t.
Xt+1 = Xt + αDt is
positive definite,satisfies a sufficient decrease condition f (Xt + αDt) ≤ f (Xt) + ασ∆t .
Inderjit Dhillon Computer Science & Mathematics UT Austin Proximal Newton Methods for Large-Scale Machine Learning
![Page 34: Proximal Newton Methods for Large-Scale Machine … Newton Methods for Large-Scale Machine Learning Inderjit Dhillon Computer Science & Mathematics UT Austin ShanghaiTech Symposium](https://reader030.vdocument.in/reader030/viewer/2022020316/5b2d83277f8b9ab66e8bebcd/html5/thumbnails/34.jpg)
Exploiting Problem Structure
Can we scale to p ≈ 20, 000?
Coordinate descent updates for computing Newton direction.
needs O(p2) storageneeds O(p4) computation per sweepneeds O(p3) computation per sweep (by exploiting problem structure)
Line search (compute determinant using Cholesky factorization).
needs O(p2) storageneeds O(p3) computation
Inderjit Dhillon Computer Science & Mathematics UT Austin Proximal Newton Methods for Large-Scale Machine Learning
![Page 35: Proximal Newton Methods for Large-Scale Machine … Newton Methods for Large-Scale Machine Learning Inderjit Dhillon Computer Science & Mathematics UT Austin ShanghaiTech Symposium](https://reader030.vdocument.in/reader030/viewer/2022020316/5b2d83277f8b9ab66e8bebcd/html5/thumbnails/35.jpg)
Exploiting Model StructureVariable Selection
minθ{ − log detX + tr(SX )︸ ︷︷ ︸
problem structure
+ h(θ)︸︷︷︸regularization
} ≡ f (θ),
Inderjit Dhillon Computer Science & Mathematics UT Austin Proximal Newton Methods for Large-Scale Machine Learning
![Page 36: Proximal Newton Methods for Large-Scale Machine … Newton Methods for Large-Scale Machine Learning Inderjit Dhillon Computer Science & Mathematics UT Austin ShanghaiTech Symposium](https://reader030.vdocument.in/reader030/viewer/2022020316/5b2d83277f8b9ab66e8bebcd/html5/thumbnails/36.jpg)
Sparsity of the Model
ER dataset, p = 692, solutions at iteration 1, 3, 5.
Only 2.7% variables become nonzero.
Inderjit Dhillon Computer Science & Mathematics UT Austin Proximal Newton Methods for Large-Scale Machine Learning
![Page 37: Proximal Newton Methods for Large-Scale Machine … Newton Methods for Large-Scale Machine Learning Inderjit Dhillon Computer Science & Mathematics UT Austin ShanghaiTech Symposium](https://reader030.vdocument.in/reader030/viewer/2022020316/5b2d83277f8b9ab66e8bebcd/html5/thumbnails/37.jpg)
Active Variable Selection
Strategy: before solving the Newton direction, “guess” the variablesthat should be updated.
Fixed set: Sfixed = {(i , j) | Xij = 0 and |∇ijg(X )| ≤ λ.}.Only update the rest of variables (free set).
Convergence Guaranteed.
p = 1869, number of variables = p2 = 3.49 million. The size of free setdrops to 20, 000 very quickly.
Inderjit Dhillon Computer Science & Mathematics UT Austin Proximal Newton Methods for Large-Scale Machine Learning
![Page 38: Proximal Newton Methods for Large-Scale Machine … Newton Methods for Large-Scale Machine Learning Inderjit Dhillon Computer Science & Mathematics UT Austin ShanghaiTech Symposium](https://reader030.vdocument.in/reader030/viewer/2022020316/5b2d83277f8b9ab66e8bebcd/html5/thumbnails/38.jpg)
Active Variable Selection
Inderjit Dhillon Computer Science & Mathematics UT Austin Proximal Newton Methods for Large-Scale Machine Learning
![Page 39: Proximal Newton Methods for Large-Scale Machine … Newton Methods for Large-Scale Machine Learning Inderjit Dhillon Computer Science & Mathematics UT Austin ShanghaiTech Symposium](https://reader030.vdocument.in/reader030/viewer/2022020316/5b2d83277f8b9ab66e8bebcd/html5/thumbnails/39.jpg)
Exploiting Model Structure
Can we scale to p ≈ 20, 000?
Coordinate descent updates for computing Newton direction.
needs O(p2) storageneeds O(p4) computation per sweepneeds O(p3) computation per sweep (by exploiting problem structure)needs O(mp) computation per sweep, where m ≈ ‖X‖0
(by exploiting problem and model structure)
Line search (compute determinant using Cholesky factorization).
needs O(p2) storageneeds O(p3) computation
Inderjit Dhillon Computer Science & Mathematics UT Austin Proximal Newton Methods for Large-Scale Machine Learning
![Page 40: Proximal Newton Methods for Large-Scale Machine … Newton Methods for Large-Scale Machine Learning Inderjit Dhillon Computer Science & Mathematics UT Austin ShanghaiTech Symposium](https://reader030.vdocument.in/reader030/viewer/2022020316/5b2d83277f8b9ab66e8bebcd/html5/thumbnails/40.jpg)
Algorithm
QUIC: QUadratic approximation for sparse Inverse Covariance estimation
Input: Empirical covariance matrix S , scalar λ, initial X0.
For t = 0, 1, . . .1 Variable selection: select a free set of m� p2 variables.2 Use coordinate descent to find descent direction:
Dt = arg min∆ fXt (Xt + ∆) over set of free variables, (A Lasso problem.)3 Line Search: use an Armijo-rule based step-size selection to get α s.t.
Xt+1 = Xt + αDt is
positive definite,satisfies a sufficient decrease condition f (Xt + αDt) ≤ f (Xt) + ασ∆t .
(Cholesky factorization of Xt + αDt)
QUIC can solve p = 20, 000 in about 2.3 hours.
Inderjit Dhillon Computer Science & Mathematics UT Austin Proximal Newton Methods for Large-Scale Machine Learning
![Page 41: Proximal Newton Methods for Large-Scale Machine … Newton Methods for Large-Scale Machine Learning Inderjit Dhillon Computer Science & Mathematics UT Austin ShanghaiTech Symposium](https://reader030.vdocument.in/reader030/viewer/2022020316/5b2d83277f8b9ab66e8bebcd/html5/thumbnails/41.jpg)
Algorithm
Current iterate Xt .
Inderjit Dhillon Computer Science & Mathematics UT Austin Proximal Newton Methods for Large-Scale Machine Learning
![Page 42: Proximal Newton Methods for Large-Scale Machine … Newton Methods for Large-Scale Machine Learning Inderjit Dhillon Computer Science & Mathematics UT Austin ShanghaiTech Symposium](https://reader030.vdocument.in/reader030/viewer/2022020316/5b2d83277f8b9ab66e8bebcd/html5/thumbnails/42.jpg)
Algorithm
Form the quadratic approximation fXt (∆).
f (X ) = g(X ) + λ‖X‖1
fXt (∆)= f (Xt)+〈∇g(X ),∆〉+ 1
2vec(∆)T∇2g(X )vec(∆)+λ‖X + ∆‖1
Inderjit Dhillon Computer Science & Mathematics UT Austin Proximal Newton Methods for Large-Scale Machine Learning
![Page 43: Proximal Newton Methods for Large-Scale Machine … Newton Methods for Large-Scale Machine Learning Inderjit Dhillon Computer Science & Mathematics UT Austin ShanghaiTech Symposium](https://reader030.vdocument.in/reader030/viewer/2022020316/5b2d83277f8b9ab66e8bebcd/html5/thumbnails/43.jpg)
Algorithm
Form the quadratic approximation fXt (∆).
Free set selection.
Inderjit Dhillon Computer Science & Mathematics UT Austin Proximal Newton Methods for Large-Scale Machine Learning
![Page 44: Proximal Newton Methods for Large-Scale Machine … Newton Methods for Large-Scale Machine Learning Inderjit Dhillon Computer Science & Mathematics UT Austin ShanghaiTech Symposium](https://reader030.vdocument.in/reader030/viewer/2022020316/5b2d83277f8b9ab66e8bebcd/html5/thumbnails/44.jpg)
Algorithm
Form the quadratic approximation fXt (∆).
Free set selection.
Coordinate descent to minimize fXt (∆)
Inderjit Dhillon Computer Science & Mathematics UT Austin Proximal Newton Methods for Large-Scale Machine Learning
![Page 45: Proximal Newton Methods for Large-Scale Machine … Newton Methods for Large-Scale Machine Learning Inderjit Dhillon Computer Science & Mathematics UT Austin ShanghaiTech Symposium](https://reader030.vdocument.in/reader030/viewer/2022020316/5b2d83277f8b9ab66e8bebcd/html5/thumbnails/45.jpg)
Algorithm
Form the quadratic approximation fXt (∆).
Free set selection.
Coordinate descent to minimize fXt (∆)
Inderjit Dhillon Computer Science & Mathematics UT Austin Proximal Newton Methods for Large-Scale Machine Learning
![Page 46: Proximal Newton Methods for Large-Scale Machine … Newton Methods for Large-Scale Machine Learning Inderjit Dhillon Computer Science & Mathematics UT Austin ShanghaiTech Symposium](https://reader030.vdocument.in/reader030/viewer/2022020316/5b2d83277f8b9ab66e8bebcd/html5/thumbnails/46.jpg)
Algorithm
Form the quadratic approximation fXt (∆).
Free set selection.
Coordinate descent to minimize fXt (∆)
Xt+1 ← Xt + Dt
Inderjit Dhillon Computer Science & Mathematics UT Austin Proximal Newton Methods for Large-Scale Machine Learning
![Page 47: Proximal Newton Methods for Large-Scale Machine … Newton Methods for Large-Scale Machine Learning Inderjit Dhillon Computer Science & Mathematics UT Austin ShanghaiTech Symposium](https://reader030.vdocument.in/reader030/viewer/2022020316/5b2d83277f8b9ab66e8bebcd/html5/thumbnails/47.jpg)
Theoretical Guarantee
Theorem: Quadratic Convergence Rate for QUIC
Suppose g(·) is strongly convex and the quadratic subproblem at eachiteration is solved exactly, QUIC converges to the global optimum with anasymptotic quadratic convergence rate:
limt→∞
‖Xt+1 − X ∗‖F
‖Xt − X ∗‖2F
= κ,
where κ is a constant.
Inderjit Dhillon Computer Science & Mathematics UT Austin Proximal Newton Methods for Large-Scale Machine Learning
![Page 48: Proximal Newton Methods for Large-Scale Machine … Newton Methods for Large-Scale Machine Learning Inderjit Dhillon Computer Science & Mathematics UT Austin ShanghaiTech Symposium](https://reader030.vdocument.in/reader030/viewer/2022020316/5b2d83277f8b9ab66e8bebcd/html5/thumbnails/48.jpg)
QUIC Results: Data from Biology Applications
(a) Time for Estrogen, p = 692 (b) Time for hereditarybc, p = 1, 869
Inderjit Dhillon Computer Science & Mathematics UT Austin Proximal Newton Methods for Large-Scale Machine Learning
![Page 49: Proximal Newton Methods for Large-Scale Machine … Newton Methods for Large-Scale Machine Learning Inderjit Dhillon Computer Science & Mathematics UT Austin ShanghaiTech Symposium](https://reader030.vdocument.in/reader030/viewer/2022020316/5b2d83277f8b9ab66e8bebcd/html5/thumbnails/49.jpg)
Quic & DirtyActive Subspace Selection for Dirty Statistical Models
Inderjit Dhillon Computer Science & Mathematics UT Austin Proximal Newton Methods for Large-Scale Machine Learning
![Page 50: Proximal Newton Methods for Large-Scale Machine … Newton Methods for Large-Scale Machine Learning Inderjit Dhillon Computer Science & Mathematics UT Austin ShanghaiTech Symposium](https://reader030.vdocument.in/reader030/viewer/2022020316/5b2d83277f8b9ab66e8bebcd/html5/thumbnails/50.jpg)
Exploiting Model Structure
minX{ Loss(X ; Data)︸ ︷︷ ︸
problem structure
+ Regularization(X )︸ ︷︷ ︸model structure
}
If h(X ) = ‖X‖1, the solution is expected to be sparse.
Active Variable Selection:
Inderjit Dhillon Computer Science & Mathematics UT Austin Proximal Newton Methods for Large-Scale Machine Learning
![Page 51: Proximal Newton Methods for Large-Scale Machine … Newton Methods for Large-Scale Machine Learning Inderjit Dhillon Computer Science & Mathematics UT Austin ShanghaiTech Symposium](https://reader030.vdocument.in/reader030/viewer/2022020316/5b2d83277f8b9ab66e8bebcd/html5/thumbnails/51.jpg)
Matrix Completion Problems
minX{ g(X ) + λ‖X‖∗ }
‖X‖∗: a nuclear norm regularization to promote the low rank structure.
Active variable selection ⇒ Active subspace selection
(identify the low-dimensional subspace)
Inderjit Dhillon Computer Science & Mathematics UT Austin Proximal Newton Methods for Large-Scale Machine Learning
![Page 52: Proximal Newton Methods for Large-Scale Machine … Newton Methods for Large-Scale Machine Learning Inderjit Dhillon Computer Science & Mathematics UT Austin ShanghaiTech Symposium](https://reader030.vdocument.in/reader030/viewer/2022020316/5b2d83277f8b9ab66e8bebcd/html5/thumbnails/52.jpg)
Active Subspace Selection
Given current iterate X = UΣV T , the “active subspace” is:
A = {uvT | uTXv 6= 0 or |uT∇Xv| > λ}.
Step 1: Active subspace selection:UG ΣGV
>G = Sλ(X −∇g(X )).
UA := span(U,UG ), VA := span(V ,VG ).A = {uvT | u ∈ UA, v ∈ VA}.
Step 2: Solve the reduced-sized problem:
argminS∈Rk×k
gX (UASV>A ) + λ‖S‖∗,
gX (·) is the quadratic approximation of g(·) at current iterate X
Iterate steps 1 and 2.
Subspace selection can be generalized to settings where the regularizeris a “decomposable norm” at the solution
Inderjit Dhillon Computer Science & Mathematics UT Austin Proximal Newton Methods for Large-Scale Machine Learning
![Page 53: Proximal Newton Methods for Large-Scale Machine … Newton Methods for Large-Scale Machine Learning Inderjit Dhillon Computer Science & Mathematics UT Austin ShanghaiTech Symposium](https://reader030.vdocument.in/reader030/viewer/2022020316/5b2d83277f8b9ab66e8bebcd/html5/thumbnails/53.jpg)
Active Subspace Selection for Decomposable Norms
A norm ‖ · ‖ is decomposable at x if there is a subspace T and vectore ∈ T such that the sub differential at x is
∂‖x‖r = {ρ ∈ Rn | ΠT (ρ) = e and ‖ΠT ⊥(ρ)‖∗r ≤ 1},
Examples: `1 norm, nuclear norm, group lasso, . . .
Active subspace selection:
Divide the space into “fixed” subspace and “free subspace”:
Sfree := [T (θ)] ∪ [T (proxλr(G ))], Sfixed = Sfree
⊥,
where T (·) is the support of the decomposable norm.Solve the reduced sized problem.
Inderjit Dhillon Computer Science & Mathematics UT Austin Proximal Newton Methods for Large-Scale Machine Learning
![Page 54: Proximal Newton Methods for Large-Scale Machine … Newton Methods for Large-Scale Machine Learning Inderjit Dhillon Computer Science & Mathematics UT Austin ShanghaiTech Symposium](https://reader030.vdocument.in/reader030/viewer/2022020316/5b2d83277f8b9ab66e8bebcd/html5/thumbnails/54.jpg)
Active Subspace Selection for Dirty Models
Superposition-structured models or “dirty models”:
min{θ(r)}k
r=1
{g
(∑r
θ(r)
)+∑
r
λrh(r)(θ(r))
},
Examples:
Sparse + low-rank graphical model learningMulti-task learning (sparse + group-sparse or sparse + low-rank).
Select active subspace for each θ(r).
Inderjit Dhillon Computer Science & Mathematics UT Austin Proximal Newton Methods for Large-Scale Machine Learning
![Page 55: Proximal Newton Methods for Large-Scale Machine … Newton Methods for Large-Scale Machine Learning Inderjit Dhillon Computer Science & Mathematics UT Austin ShanghaiTech Symposium](https://reader030.vdocument.in/reader030/viewer/2022020316/5b2d83277f8b9ab66e8bebcd/html5/thumbnails/55.jpg)
Proximal Newton Method
Proximal Newton for Dirty Models (QUIC & DIRTY)
For t = 0, 1, . . .1 Compute θ =
∑kr=1 θ
(r)
2 Compute the current gradient G = ∇g(θ)3 Select free set by
S(r)free := [T (θ(r))] ∪ [T (prox
(r)λr
(G ))], S(r)fixed = S(r)
free
⊥,
4 For r = 1, . . . , k
Solve the subproblem
∆(r) = argmind∈Rn
〈d ,G +∑t 6=r
H∆(t)〉+1
2dTHd + λr‖θ(r) + d‖r .
5 Line Search: use an Armijo-rule based step-size selection to get α s.t.θ = θ + α∆ sufficient decrease the objective function value.
Inderjit Dhillon Computer Science & Mathematics UT Austin Proximal Newton Methods for Large-Scale Machine Learning
![Page 56: Proximal Newton Methods for Large-Scale Machine … Newton Methods for Large-Scale Machine Learning Inderjit Dhillon Computer Science & Mathematics UT Austin ShanghaiTech Symposium](https://reader030.vdocument.in/reader030/viewer/2022020316/5b2d83277f8b9ab66e8bebcd/html5/thumbnails/56.jpg)
Theoretical Guarantee
Theorem: Quadratic Convergence Rate for QUIC & DIRTY
Suppose g(·) is strongly convex and the quadratic subproblem at eachiteration is solved exactly, Quic & Dirty converges to the global optimumwith an asymptotic quadratic convergence rate:
limt→∞
‖Xt+1 − X ∗‖F
‖Xt − X ∗‖2F
= κ,
where κ is a constant.
Inderjit Dhillon Computer Science & Mathematics UT Austin Proximal Newton Methods for Large-Scale Machine Learning
![Page 57: Proximal Newton Methods for Large-Scale Machine … Newton Methods for Large-Scale Machine Learning Inderjit Dhillon Computer Science & Mathematics UT Austin ShanghaiTech Symposium](https://reader030.vdocument.in/reader030/viewer/2022020316/5b2d83277f8b9ab66e8bebcd/html5/thumbnails/57.jpg)
Results: Data from Biology Applications
Sparse + low rank graphical model learning
minS ,L:L�0,S−L�0
− log det(S − L) + 〈S − L,Σ〉+ λS‖S‖1 + λL tr(L).
0 100 200 300 400 5001500
2000
2500
3000
time (sec)
Ob
jective
va
lue
Quic & DirtyPGALMLogdetPPM .
(c) Time for Leukemia, p = 1, 255
0 200 400 600−2000
−1500
−1000
−500
time (sec)
Ob
jective
va
lue
Quic & DirtyPGALMLogdetPPM .
(d) Time for Rosetta, p = 2, 000
Inderjit Dhillon Computer Science & Mathematics UT Austin Proximal Newton Methods for Large-Scale Machine Learning
![Page 58: Proximal Newton Methods for Large-Scale Machine … Newton Methods for Large-Scale Machine Learning Inderjit Dhillon Computer Science & Mathematics UT Austin ShanghaiTech Symposium](https://reader030.vdocument.in/reader030/viewer/2022020316/5b2d83277f8b9ab66e8bebcd/html5/thumbnails/58.jpg)
Application: Multi-task Learning
Multi-task (multi-class) learning with sparse + group sparse structure.
k∑r=1
`(y (r),X (r)(S (r) + B(r))) + λS‖S‖1 + λB‖B‖1,2,
datasetnumber of Dirty Models (sparse + group sparse) Other Models
training data Quic & Dirty proximal gradient ADMM Lasso Group Lasso
USPS100 7.47% / 0.75s 7.49% / 10.8s 7.47% / 4.5s 10.27% 8.36%400 2.5% / 1.55s 2.5% / 35.8 2.5% / 11.0s 4.87% 2.93%
RCV11000 18.45% / 23.1s 18.49% / 430.8s 18.5% / 259s 22.67% 20.8%5000 10.27% / 87s 10.27% / 2254s 10.27% / 1191s 13.67% 12.25%
Inderjit Dhillon Computer Science & Mathematics UT Austin Proximal Newton Methods for Large-Scale Machine Learning
![Page 59: Proximal Newton Methods for Large-Scale Machine … Newton Methods for Large-Scale Machine Learning Inderjit Dhillon Computer Science & Mathematics UT Austin ShanghaiTech Symposium](https://reader030.vdocument.in/reader030/viewer/2022020316/5b2d83277f8b9ab66e8bebcd/html5/thumbnails/59.jpg)
Big & QuicSparse Inverse Covariance Estimation with 1 Million Random Variables
Inderjit Dhillon Computer Science & Mathematics UT Austin Proximal Newton Methods for Large-Scale Machine Learning
![Page 60: Proximal Newton Methods for Large-Scale Machine … Newton Methods for Large-Scale Machine Learning Inderjit Dhillon Computer Science & Mathematics UT Austin ShanghaiTech Symposium](https://reader030.vdocument.in/reader030/viewer/2022020316/5b2d83277f8b9ab66e8bebcd/html5/thumbnails/60.jpg)
Big & Quic
Inderjit Dhillon Computer Science & Mathematics UT Austin Proximal Newton Methods for Large-Scale Machine Learning
![Page 61: Proximal Newton Methods for Large-Scale Machine … Newton Methods for Large-Scale Machine Learning Inderjit Dhillon Computer Science & Mathematics UT Austin ShanghaiTech Symposium](https://reader030.vdocument.in/reader030/viewer/2022020316/5b2d83277f8b9ab66e8bebcd/html5/thumbnails/61.jpg)
Difficulties in Scaling QUIC
Coordinate descent requires Xt and X−1t ,
needs O(p2) storageneeds O(mp) computation per sweep, where m = ‖Xt‖0
Line search (compute determinant of a big sparse matrix).
needs O(p2) storageneeds O(p3) computation
Inderjit Dhillon Computer Science & Mathematics UT Austin Proximal Newton Methods for Large-Scale Machine Learning
![Page 62: Proximal Newton Methods for Large-Scale Machine … Newton Methods for Large-Scale Machine Learning Inderjit Dhillon Computer Science & Mathematics UT Austin ShanghaiTech Symposium](https://reader030.vdocument.in/reader030/viewer/2022020316/5b2d83277f8b9ab66e8bebcd/html5/thumbnails/62.jpg)
Line Search
Given sparse matrix A = Xt + αD, we need to1 Check its positive definiteness.2 Compute log det(A).
Cholesky factorization in QUIC requires O(p3) computation.
If A =
(a bT
b C
),
det(A) = det(C )(a− bTC−1b)A is positive definite iff C is positive definite and (a− bTC−1b) > 0.
C is sparse, so can compute C−1b using Conjugate Gradient (CG).
Time complexity: TCG = O(mT ), where T is number of CG iterations.
Inderjit Dhillon Computer Science & Mathematics UT Austin Proximal Newton Methods for Large-Scale Machine Learning
![Page 63: Proximal Newton Methods for Large-Scale Machine … Newton Methods for Large-Scale Machine Learning Inderjit Dhillon Computer Science & Mathematics UT Austin ShanghaiTech Symposium](https://reader030.vdocument.in/reader030/viewer/2022020316/5b2d83277f8b9ab66e8bebcd/html5/thumbnails/63.jpg)
Difficulties in Scaling QUIC
Coordinate descent requires Xt and X−1t ,
needs O(p2) storageneeds O(mp) computation per sweep, where m = ‖Xt‖0
Line search (compute determinant of a big sparse matrix).
needs O(p) storageneeds O(mp) computation
Inderjit Dhillon Computer Science & Mathematics UT Austin Proximal Newton Methods for Large-Scale Machine Learning
![Page 64: Proximal Newton Methods for Large-Scale Machine … Newton Methods for Large-Scale Machine Learning Inderjit Dhillon Computer Science & Mathematics UT Austin ShanghaiTech Symposium](https://reader030.vdocument.in/reader030/viewer/2022020316/5b2d83277f8b9ab66e8bebcd/html5/thumbnails/64.jpg)
Back-of-the-envelope calculations
Co-ordinate descent requires m computations of wTi Dwj .
To solve a problem of size p = 104 (avg degree=10), QUIC takes 12minutes
Extrapolate: to solve a problem with p = 106, QUIC would take12min× 104 = 83.33 days (2.6 days using 32 cores).
p2 memory means 8 Terabyte main memory for p = 106.
Naive solution that has O(m) memory requirement:
Compute wi ,wj by solving two linear systems by CG,O(TCG ) computation per coordinate updateO(mTCG ) computation per sweep.Would need 8,333 days to solve the problem for p = 106 (260 days using32 cores).
BigQUIC solves problems of size p = 106 in one day on 32 cores.
Inderjit Dhillon Computer Science & Mathematics UT Austin Proximal Newton Methods for Large-Scale Machine Learning
![Page 65: Proximal Newton Methods for Large-Scale Machine … Newton Methods for Large-Scale Machine Learning Inderjit Dhillon Computer Science & Mathematics UT Austin ShanghaiTech Symposium](https://reader030.vdocument.in/reader030/viewer/2022020316/5b2d83277f8b9ab66e8bebcd/html5/thumbnails/65.jpg)
Back-of-the-envelope calculations
Co-ordinate descent requires m computations of wTi Dwj .
To solve a problem of size p = 104 (avg degree=10), QUIC takes 12minutes
Extrapolate: to solve a problem with p = 106, QUIC would take12min× 104 = 83.33 days (2.6 days using 32 cores).
p2 memory means 8 Terabyte main memory for p = 106.
Naive solution that has O(m) memory requirement:
Compute wi ,wj by solving two linear systems by CG,O(TCG ) computation per coordinate updateO(mTCG ) computation per sweep.Would need 8,333 days to solve the problem for p = 106 (260 days using32 cores).
BigQUIC solves problems of size p = 106 in one day on 32 cores.
Inderjit Dhillon Computer Science & Mathematics UT Austin Proximal Newton Methods for Large-Scale Machine Learning
![Page 66: Proximal Newton Methods for Large-Scale Machine … Newton Methods for Large-Scale Machine Learning Inderjit Dhillon Computer Science & Mathematics UT Austin ShanghaiTech Symposium](https://reader030.vdocument.in/reader030/viewer/2022020316/5b2d83277f8b9ab66e8bebcd/html5/thumbnails/66.jpg)
Coordinate Updates with Memory Cache
Assume we can store M columns of W in memory.
Coordinate descent update (i , j): compute wTi Dwj .
If wi ,wj are not in memory: recompute by CG:
Xwi = ei : O(TCG ) time.
Inderjit Dhillon Computer Science & Mathematics UT Austin Proximal Newton Methods for Large-Scale Machine Learning
![Page 67: Proximal Newton Methods for Large-Scale Machine … Newton Methods for Large-Scale Machine Learning Inderjit Dhillon Computer Science & Mathematics UT Austin ShanghaiTech Symposium](https://reader030.vdocument.in/reader030/viewer/2022020316/5b2d83277f8b9ab66e8bebcd/html5/thumbnails/67.jpg)
Coordinate Updates with Memory Cache
Assume we can store M columns of W in memory.
Coordinate descent update (i , j): compute wTi Dwj .
If wi ,wj are not in memory: recompute by CG:
Xwi = ei : O(mTCG ) time.
Inderjit Dhillon Computer Science & Mathematics UT Austin Proximal Newton Methods for Large-Scale Machine Learning
![Page 68: Proximal Newton Methods for Large-Scale Machine … Newton Methods for Large-Scale Machine Learning Inderjit Dhillon Computer Science & Mathematics UT Austin ShanghaiTech Symposium](https://reader030.vdocument.in/reader030/viewer/2022020316/5b2d83277f8b9ab66e8bebcd/html5/thumbnails/68.jpg)
Coordinate Updates with Memory Cache
Assume we can store M columns of W in memory.
Coordinate descent update (i , j): compute wTi Dwj .
If wi ,wj are not in memory: recompute by CG:
Xwi = ei : O(mTCG ) time.
Inderjit Dhillon Computer Science & Mathematics UT Austin Proximal Newton Methods for Large-Scale Machine Learning
![Page 69: Proximal Newton Methods for Large-Scale Machine … Newton Methods for Large-Scale Machine Learning Inderjit Dhillon Computer Science & Mathematics UT Austin ShanghaiTech Symposium](https://reader030.vdocument.in/reader030/viewer/2022020316/5b2d83277f8b9ab66e8bebcd/html5/thumbnails/69.jpg)
Coordinate Updates with Memory Cache
Assume we can store M columns of W in memory.
Coordinate descent update (i , j): compute wTi Dwj .
If wi ,wj are not in memory: recompute by CG:
Xwi = ei : O(mTCG ) time.
Inderjit Dhillon Computer Science & Mathematics UT Austin Proximal Newton Methods for Large-Scale Machine Learning
![Page 70: Proximal Newton Methods for Large-Scale Machine … Newton Methods for Large-Scale Machine Learning Inderjit Dhillon Computer Science & Mathematics UT Austin ShanghaiTech Symposium](https://reader030.vdocument.in/reader030/viewer/2022020316/5b2d83277f8b9ab66e8bebcd/html5/thumbnails/70.jpg)
Coordinate Updates with Memory Cache
Assume we can store M columns of W in memory.
Coordinate descent update (i , j): compute wTi Dwj .
If wi ,wj are not in memory: recompute by CG:
Xwi = ei : O(mTCG ) time.
Inderjit Dhillon Computer Science & Mathematics UT Austin Proximal Newton Methods for Large-Scale Machine Learning
![Page 71: Proximal Newton Methods for Large-Scale Machine … Newton Methods for Large-Scale Machine Learning Inderjit Dhillon Computer Science & Mathematics UT Austin ShanghaiTech Symposium](https://reader030.vdocument.in/reader030/viewer/2022020316/5b2d83277f8b9ab66e8bebcd/html5/thumbnails/71.jpg)
Coordinate Updates – ideal case
Want to find update sequence that minimizes number of cache misses:probably NP Hard.
Our strategy: update variables block by block (assume k = p/Mblocks).
Inderjit Dhillon Computer Science & Mathematics UT Austin Proximal Newton Methods for Large-Scale Machine Learning
![Page 72: Proximal Newton Methods for Large-Scale Machine … Newton Methods for Large-Scale Machine Learning Inderjit Dhillon Computer Science & Mathematics UT Austin ShanghaiTech Symposium](https://reader030.vdocument.in/reader030/viewer/2022020316/5b2d83277f8b9ab66e8bebcd/html5/thumbnails/72.jpg)
General case: block diagonal + sparse
If the block partition is not perfect:
extra column computations can be characterized by boundary nodes.
Given a partition {S1, . . . ,Sk}, we define boundary nodes as
B(Sq) ≡ {j | j ∈ Sq and ∃i ∈ Sz , z 6= q s.t. Fij = 1},
where F is adjacency matrix of the free set.
Inderjit Dhillon Computer Science & Mathematics UT Austin Proximal Newton Methods for Large-Scale Machine Learning
![Page 73: Proximal Newton Methods for Large-Scale Machine … Newton Methods for Large-Scale Machine Learning Inderjit Dhillon Computer Science & Mathematics UT Austin ShanghaiTech Symposium](https://reader030.vdocument.in/reader030/viewer/2022020316/5b2d83277f8b9ab66e8bebcd/html5/thumbnails/73.jpg)
Graph Clustering Algorithm
The number of columns to be computed in one sweep is
p +∑
q
|B(Sq)|.
Can be upper bounded by
p +∑
q
|B(Sq)| ≤ p +∑z 6=q
∑i∈Sz ,j∈Sq
Fij .
Minimize the right hand side → minimize off-diagonal entries.
Use Graph Clustering (METIS or Graclus) to find the partition.
Inderjit Dhillon Computer Science & Mathematics UT Austin Proximal Newton Methods for Large-Scale Machine Learning
![Page 74: Proximal Newton Methods for Large-Scale Machine … Newton Methods for Large-Scale Machine Learning Inderjit Dhillon Computer Science & Mathematics UT Austin ShanghaiTech Symposium](https://reader030.vdocument.in/reader030/viewer/2022020316/5b2d83277f8b9ab66e8bebcd/html5/thumbnails/74.jpg)
Size of boundary nodes in real datasets
ER dataset with p = 692.
Random partition Partition by clustering
compute 1687 columns compute 775 columns
O(kpTCG )→ O(pTCG ) flops per sweep.
Inderjit Dhillon Computer Science & Mathematics UT Austin Proximal Newton Methods for Large-Scale Machine Learning
![Page 75: Proximal Newton Methods for Large-Scale Machine … Newton Methods for Large-Scale Machine Learning Inderjit Dhillon Computer Science & Mathematics UT Austin ShanghaiTech Symposium](https://reader030.vdocument.in/reader030/viewer/2022020316/5b2d83277f8b9ab66e8bebcd/html5/thumbnails/75.jpg)
BigQUIC
Block co-ordinate descent with clustering,
needs O(m + p2/k) storageneeds O(mp) computation per sweep, where m = ‖Xt‖0
Line search (compute determinant of a big sparse matrix).
needs O(p) storageneeds O(mp) computation
Inderjit Dhillon Computer Science & Mathematics UT Austin Proximal Newton Methods for Large-Scale Machine Learning
![Page 76: Proximal Newton Methods for Large-Scale Machine … Newton Methods for Large-Scale Machine Learning Inderjit Dhillon Computer Science & Mathematics UT Austin ShanghaiTech Symposium](https://reader030.vdocument.in/reader030/viewer/2022020316/5b2d83277f8b9ab66e8bebcd/html5/thumbnails/76.jpg)
Convergence Guarantees for QUIC
Theorem (Hsieh, Sustik, Dhillon & Ravikumar, NIPS, 2011)
QUIC converges quadratically to the global optimum, that is for someconstant 0 < κ < 1:
limt→∞
‖Xt+1 − X ∗‖F
‖Xt − X ∗‖2F
= κ.
Inderjit Dhillon Computer Science & Mathematics UT Austin Proximal Newton Methods for Large-Scale Machine Learning
![Page 77: Proximal Newton Methods for Large-Scale Machine … Newton Methods for Large-Scale Machine Learning Inderjit Dhillon Computer Science & Mathematics UT Austin ShanghaiTech Symposium](https://reader030.vdocument.in/reader030/viewer/2022020316/5b2d83277f8b9ab66e8bebcd/html5/thumbnails/77.jpg)
BigQUIC Convergence Analysis
Recall W = X−1.
When each wi is computed by CG (Xwi = ei ):
The gradient ∇ijg(X ) = Sij −Wij on free set can be computed once andstored in memory.Hessian (wT
i Dwj in coordinate updates) needs to be repeatedlycomputed.
To reduce the time overhead, Hessian should be computedapproximately.
Theorem (Hsieh, Sustik, Dhillon, Ravikumar & Poldrack, 2013)
If ∇g(X ) is computed exactly and ηI � Ht � ηI for some constantη, η > 0 at every Newton iteration, then BigQUIC converges to theglobal optimum.
Inderjit Dhillon Computer Science & Mathematics UT Austin Proximal Newton Methods for Large-Scale Machine Learning
![Page 78: Proximal Newton Methods for Large-Scale Machine … Newton Methods for Large-Scale Machine Learning Inderjit Dhillon Computer Science & Mathematics UT Austin ShanghaiTech Symposium](https://reader030.vdocument.in/reader030/viewer/2022020316/5b2d83277f8b9ab66e8bebcd/html5/thumbnails/78.jpg)
BigQUIC Convergence Analysis
Recall W = X−1.
When each wi is computed by CG (Xwi = ei ):
The gradient ∇ijg(X ) = Sij −Wij on free set can be computed once andstored in memory.Hessian (wT
i Dwj in coordinate updates) needs to be repeatedlycomputed.
To reduce the time overhead, Hessian should be computedapproximately.
Theorem (Hsieh, Sustik, Dhillon, Ravikumar & Poldrack, 2013)
If ∇g(X ) is computed exactly and ηI � Ht � ηI for some constantη, η > 0 at every Newton iteration, then BigQUIC converges to theglobal optimum.
Inderjit Dhillon Computer Science & Mathematics UT Austin Proximal Newton Methods for Large-Scale Machine Learning
![Page 79: Proximal Newton Methods for Large-Scale Machine … Newton Methods for Large-Scale Machine Learning Inderjit Dhillon Computer Science & Mathematics UT Austin ShanghaiTech Symposium](https://reader030.vdocument.in/reader030/viewer/2022020316/5b2d83277f8b9ab66e8bebcd/html5/thumbnails/79.jpg)
BigQUIC Convergence Analysis
To have super-linear convergence rate, we require the Hessiancomputation to be more and more accurate as Xt approaches X ∗.
Measure the accuracy of Hessian computation by R = [r1 . . . , rp],where ri = Xwi − ei .
The optimality of Xt can be measured by
∇Sij f (X ) =
{∇ijg(X ) + sign(Xij )λ if Xij 6= 0,
sign(∇ijg(X )) max(|∇ijg(X )| − λ, 0) if Xij = 0.
Theorem (Hsieh, Sustik, Dhillon, Ravikumar & Poldrack, 2013)
In BigQUIC, if residual matrix ‖Rt‖ = O(‖∇S f (Xt)‖`) for some0 < ` ≤ 1 as t →∞, then ‖Xt+1 − X ∗‖ = O(‖Xt − X ∗‖1+`) ast →∞.
When ‖Rt‖ = O(‖∇S f (Xt)‖), BigQUIC has asymptotic quadraticconvergence rate.
Inderjit Dhillon Computer Science & Mathematics UT Austin Proximal Newton Methods for Large-Scale Machine Learning
![Page 80: Proximal Newton Methods for Large-Scale Machine … Newton Methods for Large-Scale Machine Learning Inderjit Dhillon Computer Science & Mathematics UT Austin ShanghaiTech Symposium](https://reader030.vdocument.in/reader030/viewer/2022020316/5b2d83277f8b9ab66e8bebcd/html5/thumbnails/80.jpg)
BigQUIC Convergence Analysis
To have super-linear convergence rate, we require the Hessiancomputation to be more and more accurate as Xt approaches X ∗.
Measure the accuracy of Hessian computation by R = [r1 . . . , rp],where ri = Xwi − ei .
The optimality of Xt can be measured by
∇Sij f (X ) =
{∇ijg(X ) + sign(Xij )λ if Xij 6= 0,
sign(∇ijg(X )) max(|∇ijg(X )| − λ, 0) if Xij = 0.
Theorem (Hsieh, Sustik, Dhillon, Ravikumar & Poldrack, 2013)
In BigQUIC, if residual matrix ‖Rt‖ = O(‖∇S f (Xt)‖`) for some0 < ` ≤ 1 as t →∞, then ‖Xt+1 − X ∗‖ = O(‖Xt − X ∗‖1+`) ast →∞.
When ‖Rt‖ = O(‖∇S f (Xt)‖), BigQUIC has asymptotic quadraticconvergence rate.
Inderjit Dhillon Computer Science & Mathematics UT Austin Proximal Newton Methods for Large-Scale Machine Learning
![Page 81: Proximal Newton Methods for Large-Scale Machine … Newton Methods for Large-Scale Machine Learning Inderjit Dhillon Computer Science & Mathematics UT Austin ShanghaiTech Symposium](https://reader030.vdocument.in/reader030/viewer/2022020316/5b2d83277f8b9ab66e8bebcd/html5/thumbnails/81.jpg)
Experimental results (scalability)
(e) Scalability on random graph (f) Scalability on chain graph
Figure: BigQUIC can solve one million dimensional problems.
Inderjit Dhillon Computer Science & Mathematics UT Austin Proximal Newton Methods for Large-Scale Machine Learning
![Page 82: Proximal Newton Methods for Large-Scale Machine … Newton Methods for Large-Scale Machine Learning Inderjit Dhillon Computer Science & Mathematics UT Austin ShanghaiTech Symposium](https://reader030.vdocument.in/reader030/viewer/2022020316/5b2d83277f8b9ab66e8bebcd/html5/thumbnails/82.jpg)
Experimental results
BigQUIC is faster even for medium size problems.
Figure: Comparison on FMRI data with a p = 20000 subset (maximum dimensionthat previous methods can handle).
Inderjit Dhillon Computer Science & Mathematics UT Austin Proximal Newton Methods for Large-Scale Machine Learning
![Page 83: Proximal Newton Methods for Large-Scale Machine … Newton Methods for Large-Scale Machine Learning Inderjit Dhillon Computer Science & Mathematics UT Austin ShanghaiTech Symposium](https://reader030.vdocument.in/reader030/viewer/2022020316/5b2d83277f8b9ab66e8bebcd/html5/thumbnails/83.jpg)
Results on FMRI dataset
228,483 voxels, 518 time points.
λ = 0.6 =⇒ average degree 8, BigQUIC took 5 hours.
λ = 0.5 =⇒ average degree 38, BigQUIC took 21 hours.
Findings:
Voxels with large degree were generally found in the gray matter.Can detect meaningful brain modules by modularity clustering.
Inderjit Dhillon Computer Science & Mathematics UT Austin Proximal Newton Methods for Large-Scale Machine Learning
![Page 84: Proximal Newton Methods for Large-Scale Machine … Newton Methods for Large-Scale Machine Learning Inderjit Dhillon Computer Science & Mathematics UT Austin ShanghaiTech Symposium](https://reader030.vdocument.in/reader030/viewer/2022020316/5b2d83277f8b9ab66e8bebcd/html5/thumbnails/84.jpg)
Conclusions
Proximal Newton method: a second order method for optimizing asmooth convex function plus a non-smooth regularizer.
How to develop an efficient proximal Newton method?
Exploit problem structure and geometry of problem
For sparse inverse covariance estimation problem, our proposedalgorithm (BigQuic) can solve a 1-million dimensional problem in1-day on a single 32-core machine.
Main ingredients for BigQuic:
Fast coordinate descent solver for computing Newton direction.Avoid un-important updates by active subspace selection.Memory efficient coordinate descent updates.
Quic & Dirty: extension to “Dirty” Statistical Models.
Inderjit Dhillon Computer Science & Mathematics UT Austin Proximal Newton Methods for Large-Scale Machine Learning
![Page 85: Proximal Newton Methods for Large-Scale Machine … Newton Methods for Large-Scale Machine Learning Inderjit Dhillon Computer Science & Mathematics UT Austin ShanghaiTech Symposium](https://reader030.vdocument.in/reader030/viewer/2022020316/5b2d83277f8b9ab66e8bebcd/html5/thumbnails/85.jpg)
Future Work
“Effective” order of Newton-like methods for specific problems?
Multi-core Computation? “Wild” Co-ordinate Descent?
Prioritized Scheduling (Co-ordinate Descent)?
Suitability for Distributed for Out-of-core Computing?
Develop scalable & reliable software
No dearth of interesting applications!
Inderjit Dhillon Computer Science & Mathematics UT Austin Proximal Newton Methods for Large-Scale Machine Learning
![Page 86: Proximal Newton Methods for Large-Scale Machine … Newton Methods for Large-Scale Machine Learning Inderjit Dhillon Computer Science & Mathematics UT Austin ShanghaiTech Symposium](https://reader030.vdocument.in/reader030/viewer/2022020316/5b2d83277f8b9ab66e8bebcd/html5/thumbnails/86.jpg)
References
[1] C. J. Hsieh, I. S. Dhillon, P. Ravikumar, S. Becker, and P. Olsen. QUIC & DIRTY: AQuadratic Approximation Approach for Dirty Statistical Models . NIPS, 2014.[2] C. J. Hsieh, M. Sustik, I. S. Dhillon, P. Ravikumar, and R. Poldrack. BIG & QUIC: Sparseinverse covariance estimation for a million variables . NIPS (oral presentation), 2013.[3] C. J. Hsieh, M. Sustik, I. S. Dhillon, and P. Ravikumar. Sparse Inverse Covariance MatrixEstimation using Quadratic Approximation. NIPS, 2011.[4] C. J. Hsieh, I. S. Dhillon, P. Ravikumar, A. Banerjee. A Divide-and-Conquer Procedure forSparse Inverse Covariance Estimation. NIPS, 2012.[5] P. A. Olsen, F. Oztoprak, J. Nocedal, and S. J. Rennie. Newton-Like Methods for SparseInverse Covariance Estimation. Optimization Online, 2012.[6] O. Banerjee, L. El Ghaoui, and A. d’Aspremont Model Selection Through Sparse MaximumLikelihood Estimation for Multivariate Gaussian or Binary Data. JMLR, 2008.[7] J. Friedman, T. Hastie, and R. Tibshirani. Sparse inverse covariance estimation with thegraphical lasso. Biostatistics, 2008.[8] L. Li and K.-C. Toh. An inexact interior point method for l1-reguarlized sparse covarianceselection. Mathematical Programming Computation, 2010.[9] K. Scheinberg, S. Ma, and D. Glodfarb. Sparse inverse covariance selection via alternatinglinearization methods. NIPS, 2010.[10] K. Scheinberg and I. Rish. Learning sparse Gaussian Markov networks using a greedycoordinate ascent approach. Machine Learning and Knowledge Discovery in Databases, 2010.
Inderjit Dhillon Computer Science & Mathematics UT Austin Proximal Newton Methods for Large-Scale Machine Learning