coefficient path algorithms
DESCRIPTION
Coefficient Path Algorithms. Karl Sjöstrand Informatics and Mathematical Modelling, DTU. What’s This Lecture About?. The focus is on computation rather than methods. Efficiency A lgorithms provide insight. Loss Functions. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Coefficient Path Algorithms](https://reader035.vdocument.in/reader035/viewer/2022081505/5681674f550346895ddbff06/html5/thumbnails/1.jpg)
Coefficient Path Algorithms
Karl SjöstrandInformatics and Mathematical Modelling, DTU
![Page 2: Coefficient Path Algorithms](https://reader035.vdocument.in/reader035/viewer/2022081505/5681674f550346895ddbff06/html5/thumbnails/2.jpg)
What’s This Lecture About?
• The focus is on computation rather than methods.– Efficiency– Algorithms provide insight
![Page 3: Coefficient Path Algorithms](https://reader035.vdocument.in/reader035/viewer/2022081505/5681674f550346895ddbff06/html5/thumbnails/3.jpg)
Loss Functions
• We wish to model a random variable Y by a function of a set of other random variables f(X)
• To determine how far from Y our model is we define a loss function L(Y, f(X)).
![Page 4: Coefficient Path Algorithms](https://reader035.vdocument.in/reader035/viewer/2022081505/5681674f550346895ddbff06/html5/thumbnails/4.jpg)
Loss Function Example
• Let Y be a vector y of n outcome observations• Let X be an (n×p) matrix X where the p columns
are predictor variables• Use squared error loss L(y,f(X))=||y -f(X)||2
• Let f(X) be a linear model with coefficients β, f(X) = Xβ.
• The loss function is then • The minimizer is the familiar OLS solution
yXXX TTXfYL 1)())(,(minargˆ
)()(2
2βββ T XyXyXy
![Page 5: Coefficient Path Algorithms](https://reader035.vdocument.in/reader035/viewer/2022081505/5681674f550346895ddbff06/html5/thumbnails/5.jpg)
Adding a Penalty Function
• We get different results if we consider a penalty function J(β) along with the loss function
• Parameter λ defines amount of penalty
)())(,(minarg)(ˆ
JXfyL
![Page 6: Coefficient Path Algorithms](https://reader035.vdocument.in/reader035/viewer/2022081505/5681674f550346895ddbff06/html5/thumbnails/6.jpg)
Virtues of the Penalty Function
• Imposes structure on the model– Computational difficulties• Unstable estimates• Non-invertible matrices
– To reflect prior knowledge– To perform variable selection• S p a r s e solutions are easier to interpret
![Page 7: Coefficient Path Algorithms](https://reader035.vdocument.in/reader035/viewer/2022081505/5681674f550346895ddbff06/html5/thumbnails/7.jpg)
Selecting a Suitable Model
• We must evaluate models for lots of different values of λ– For instance when doing cross-validation• For each training and test set, evaluate for a
suitable set of values of λ.• Each evaluation of may be expensive
)(ˆ
)(ˆ
![Page 8: Coefficient Path Algorithms](https://reader035.vdocument.in/reader035/viewer/2022081505/5681674f550346895ddbff06/html5/thumbnails/8.jpg)
Topic of this Lecture
• Algorithms for estimating
for all values of the parameter λ.
• Plotting the vector with respect to λ yields a coefficient path.
)())(,(minarg)(ˆ
JXfyL
)(ˆ
![Page 9: Coefficient Path Algorithms](https://reader035.vdocument.in/reader035/viewer/2022081505/5681674f550346895ddbff06/html5/thumbnails/9.jpg)
Example Path – Ridge Regression
• Regression – Quadratic loss, quadratic penalty2
2
2
2minarg)(ˆ ββ
Xy
)(ˆ
![Page 10: Coefficient Path Algorithms](https://reader035.vdocument.in/reader035/viewer/2022081505/5681674f550346895ddbff06/html5/thumbnails/10.jpg)
Example Path - LASSO
• Regression – Quadratic loss, piecewise linear penalty
1
2
2minarg)(ˆ ββ
Xy
)(ˆ
![Page 11: Coefficient Path Algorithms](https://reader035.vdocument.in/reader035/viewer/2022081505/5681674f550346895ddbff06/html5/thumbnails/11.jpg)
Example Path – Support Vector Machine
• Classification – details on loss and penalty later
![Page 12: Coefficient Path Algorithms](https://reader035.vdocument.in/reader035/viewer/2022081505/5681674f550346895ddbff06/html5/thumbnails/12.jpg)
Example Path – Penalized Logistic Regression
• Classification – non-linear loss, piecewise linear penalty
1
1
}exp{1logminarg)(ˆ βββn
ii
T
XXy
Image from Rosset, NIPS 2004
![Page 13: Coefficient Path Algorithms](https://reader035.vdocument.in/reader035/viewer/2022081505/5681674f550346895ddbff06/html5/thumbnails/13.jpg)
Path Properties
![Page 14: Coefficient Path Algorithms](https://reader035.vdocument.in/reader035/viewer/2022081505/5681674f550346895ddbff06/html5/thumbnails/14.jpg)
Piecewise Linear Paths
• What is required from the loss and penalty functions for piecewise linearity?
• One condition is that is a piecewise constant vector in λ.
)(ˆ
![Page 15: Coefficient Path Algorithms](https://reader035.vdocument.in/reader035/viewer/2022081505/5681674f550346895ddbff06/html5/thumbnails/15.jpg)
Condition for Piecewise Linearity
0 200 400 600 800 1000 1200 1400 1600 1800-300
-200
-100
0
100
200
300
400
500
600
||()||1
( )
0 200 400 600 800 1000 1200 1400 1600 1800
-0.4
-0.3
-0.2
-0.1
0
0.1
0.2
0.3
0.4
||()||1
d()/d
![Page 16: Coefficient Path Algorithms](https://reader035.vdocument.in/reader035/viewer/2022081505/5681674f550346895ddbff06/html5/thumbnails/16.jpg)
Tracing the Entire Path
• From a starting point along the path (e.g. λ=∞), we can easily create the entire path if:– is known– the knots where change can be worked out
)(ˆ
)(ˆ
)(ˆ
![Page 17: Coefficient Path Algorithms](https://reader035.vdocument.in/reader035/viewer/2022081505/5681674f550346895ddbff06/html5/thumbnails/17.jpg)
The Piecewise Linear Condition
)(ˆ)(ˆ)(ˆ)(ˆ 122 JJL
![Page 18: Coefficient Path Algorithms](https://reader035.vdocument.in/reader035/viewer/2022081505/5681674f550346895ddbff06/html5/thumbnails/18.jpg)
Sufficient and Necessary Condition
• A sufficient and necessary condition for linearity of at λ0:– expression above is a constant vector with respect
to λ in a neighborhood of λ0.
)(ˆ)(ˆ)(ˆ)(ˆ 122 JJL
)(ˆ
![Page 19: Coefficient Path Algorithms](https://reader035.vdocument.in/reader035/viewer/2022081505/5681674f550346895ddbff06/html5/thumbnails/19.jpg)
A Stronger Sufficient Condition
• ...but not a necessary condition
• The loss is a piecewise quadratic function of β• The penalty is a piecewise linear function of β
)(ˆ)(ˆ)(ˆ)(ˆ 122 JJL
constant disappears constant
![Page 20: Coefficient Path Algorithms](https://reader035.vdocument.in/reader035/viewer/2022081505/5681674f550346895ddbff06/html5/thumbnails/20.jpg)
Implications of this Condition
• Loss functions may be– Quadratic (standard squared error loss)– Piecewise quadratic– Piecewise linear (a variant of piecewise quadratic)
• Penalty functions may be– Linear (SVM ”penalty”)– Piecewise linear (L1 and Linf)
![Page 21: Coefficient Path Algorithms](https://reader035.vdocument.in/reader035/viewer/2022081505/5681674f550346895ddbff06/html5/thumbnails/21.jpg)
Condition Applied - Examples
• Ridge regression– Quadratic loss – ok– Quadratic penalty – not ok
• LASSO– Quadratic loss – ok– Piecewise linear penalty - ok
![Page 22: Coefficient Path Algorithms](https://reader035.vdocument.in/reader035/viewer/2022081505/5681674f550346895ddbff06/html5/thumbnails/22.jpg)
When do Directions Change?
• Directions are only valid where L and J are differentiable.– LASSO: L is differentiable everywhere, J is not at
β=0.
• Directions change when β touches 0. – Variables either become 0, or leave 0– Denote the set of non-zero variables A – Denote the set of zero variables I
![Page 23: Coefficient Path Algorithms](https://reader035.vdocument.in/reader035/viewer/2022081505/5681674f550346895ddbff06/html5/thumbnails/23.jpg)
An algorithm for the LASSO
• Quadratic loss, piecewise linear penalty
• We now know it has a piecewise linear path!
• Let’s see if we can work out the directions and knots
1
2
2minarg)(ˆ ββ
Xy
![Page 24: Coefficient Path Algorithms](https://reader035.vdocument.in/reader035/viewer/2022081505/5681674f550346895ddbff06/html5/thumbnails/24.jpg)
Reformulating the LASSO
jjj
p
jjj
,0,0 subject to
)()(minarg1
2
2,
Xy
jjj
1
2
2minarg)(ˆ ββ
Xy
![Page 25: Coefficient Path Algorithms](https://reader035.vdocument.in/reader035/viewer/2022081505/5681674f550346895ddbff06/html5/thumbnails/25.jpg)
Useful Conditions
sConstraint
11
)(
1)(
2
2)()(:
p
jjj
p
jjj
J
p
jjj
L
pL
Xy
• Lagrange primal function
• KKT conditions
0,00)(,0)(
jjjj
jjjj LL
![Page 26: Coefficient Path Algorithms](https://reader035.vdocument.in/reader035/viewer/2022081505/5681674f550346895ddbff06/html5/thumbnails/26.jpg)
LASSO Algorithm Properties
• Coefficients are nonzero only if• For zero variables
jL ))(ˆ(
jL ))(ˆ( I
A
![Page 27: Coefficient Path Algorithms](https://reader035.vdocument.in/reader035/viewer/2022081505/5681674f550346895ddbff06/html5/thumbnails/27.jpg)
Working out the Knots (1)
• First case: a variable becomes zero (A to I)• Assume we know the current and
directions
ˆ
)(ˆ 0
ˆˆ
d
Ajdj
jj
,/ˆ
ˆmin
![Page 28: Coefficient Path Algorithms](https://reader035.vdocument.in/reader035/viewer/2022081505/5681674f550346895ddbff06/html5/thumbnails/28.jpg)
Working out the Knots (2)
• Second case: a variable becomes non-zero• For inactive variables change with λ.jL ))(ˆ(
0 200 400 600 800 1000 1200 1400 1600 1800 20000
500
1000
1500
2000
|dL|
algorithm direction
Second addedvariable
![Page 29: Coefficient Path Algorithms](https://reader035.vdocument.in/reader035/viewer/2022081505/5681674f550346895ddbff06/html5/thumbnails/29.jpg)
Working out the Knots (3)
• For some scalar d, will reach λ.– This is where variable j becomes active!– Solve for d :
jdL )ˆ(
Ijdd
d
dLdL
j
Tji
Tji
Tji
Tji
j
AiIj
,min
)(
)()(,
)(
)()(min
)ˆ()ˆ(
Xxx
Xyxx
Xxx
Xyxx
![Page 30: Coefficient Path Algorithms](https://reader035.vdocument.in/reader035/viewer/2022081505/5681674f550346895ddbff06/html5/thumbnails/30.jpg)
Path Directions
• Directions for non-zero variables
))(ˆsgn()2()(ˆ)(ˆ)(ˆ112
AAT
AAA JL
XX
![Page 31: Coefficient Path Algorithms](https://reader035.vdocument.in/reader035/viewer/2022081505/5681674f550346895ddbff06/html5/thumbnails/31.jpg)
The Algorithm
• while I is not empty– Work out the minmal distance d where a variable
is either added or dropped– Update sets A and I– Update β = β + d– Calculate new directions
• end
ˆ
![Page 32: Coefficient Path Algorithms](https://reader035.vdocument.in/reader035/viewer/2022081505/5681674f550346895ddbff06/html5/thumbnails/32.jpg)
Variants – Huberized LASSO
• Use a piecewise quadratic loss which is nicer to outliers
![Page 33: Coefficient Path Algorithms](https://reader035.vdocument.in/reader035/viewer/2022081505/5681674f550346895ddbff06/html5/thumbnails/33.jpg)
Huberized LASSO
• Same path algorithm applies– With a minor change due to the piecewise loss
![Page 34: Coefficient Path Algorithms](https://reader035.vdocument.in/reader035/viewer/2022081505/5681674f550346895ddbff06/html5/thumbnails/34.jpg)
Variants - SVM
• Dual SVM formulation
– Quadratic ”loss”– Linear ”penalty”
iL iTTT
D ,10 subject to 21maxarg:
YYXX1
![Page 35: Coefficient Path Algorithms](https://reader035.vdocument.in/reader035/viewer/2022081505/5681674f550346895ddbff06/html5/thumbnails/35.jpg)
A few Methods with Piecewise Linear Paths
• Least Angle Regression• LASSO (+variants)• Forward Stagewise Regression• Elastic Net• The Non-Negative Garotte• Support Vector Machines (L1 and L2)• Support Vector Domain Description• Locally Adaptive Regression Splines
![Page 36: Coefficient Path Algorithms](https://reader035.vdocument.in/reader035/viewer/2022081505/5681674f550346895ddbff06/html5/thumbnails/36.jpg)
References• Rosset and Zhu 2004
– Piecewise Linear Regularized Solution Paths• Efron et. al 2003
– Least Angle Regression• Hastie et. al 2004
– The Entire Regularization Path for the SVM• Zhu, Rosset et. al 2003
– 1-norm Support Vector Machines• Rosset 2004
– Tracking Curved Regularized Solution Paths• Park and Hastie 2006
– An L1-regularization Path Algorithm for Generalized Linear Models• Friedman et al. 2008
– Regularized Paths for Generalized Linear Models via Coordinate Descent
![Page 37: Coefficient Path Algorithms](https://reader035.vdocument.in/reader035/viewer/2022081505/5681674f550346895ddbff06/html5/thumbnails/37.jpg)
Conclusion
• We have defined conditions which help identifying problems with piecewise linear paths– ...and shown that efficient algorithms exist
• Having access to solutions for all values of the regularization parameter is important when selecting a suitable model