regularization and variable selection via the elastic netrm17770/mlreading/... · naïve elastic...
TRANSCRIPT
‘Regularization and variable selection via the elastic net’
Hui Zou, Trevor Hastie
Presented by: James Pope and Michal Kozlowski
Background
• Ordinary Least Squares:
• Two important aspects to evaluate quality of model:• Accuracy of prediction on future data – good generalisation• Interpretation of the model – simpler model often preferred
Background
Bias – Variance trade-off
Ridge Regression and LASSO
• Penalization methods to resolve this:• Ridge Regression – minimisation of residual sum of squares, bound on the 𝐿2
norm
• Lasso – penalised least squares, imposing 𝐿1 penalty – shrinkage and variable selection at once
Ridge Regression and LASSO
Ridge Regression and LASSO
• LASSO is limited however:• In the p>n case, the lasso
selects at most n variables before it saturates.
• If there is a group of highly correlated independent variables, LASSO chooses one and discards the others.
Naïve Elastic Net
• At high dimensional data, there can be variables which are highly correlated with each other – multicollinearity.
• Methods up to now failed to perform grouped selection.
• Can be thought of as a hybrid of ridge and LASSO:
Naïve Elastic Net
• If 𝛼= 1, we have ridge• If 𝛼= 0, we have LASSO• If 0 < 𝛼 < 1 we have elastic net
• In this figure 𝛼 = 0.5
Naïve Elastic Net is insufficient
• The Naïve EN estimator is a two-stage procedure: • For each fixed 𝜆2, we first find the ridge coefficient and then perform
LASSO shrinkage
• We effectively perform double shrinkage!
• Extra bias, no variance reduction…
Elastic Net with scaling correction
• Correct it, by introducing the quadratic penalty. Given augmented data (y*, X*), naïve net solves the lasso problem:
• Corrected estimates defined by:
• Naïve net can be shown to be ; thus:
• That way, we retain the grouping effect, and perform shrinkage once!
Elastic Net
• Elastic Net estimate
Univariate Soft Thresholding
• Consider that when , elastic net effectively becomes LASSO
• Special case of elastic net, when
• Elastic net applies soft thresholding on univariate coefficients.
• UST ignores the dependence between predictors, and treats them as independent variables.
• Used for significance analysis of microarrays (Tusher, 2001)
• Elastic net bridges the gap between the LASSO and UST.
LARS-EN – Elastic Net Computation
• First proposed as LARS by Efron et al., 2004, expanded in this paper
• Resembles forward stepwise regression.
• The solution path is piecewise linear.
• Given a fixed 𝜆2, we can solve the entire ENet solution path!• At step k, update Cholesky factorisation of
• Record the non-zero coefficients at each LARS-EN step
• Not necessary to run it all the way to the end for p>>n!
(early stopping)
Summary on Elastic Net
• Ridge regression fails to choose any variables. Either all or none.
• LASSO solves this, but disregards clusters of highly correlated data. It favours a single variable.
• Elastic net is a good compromise between the two.
Elastic Net Evaluation
• Study: Prostate Cancer Data
• Simulation A: To evaluate prediction performance (Lasso)
• Simulation B: To evaluate feature selection and grouping performance
• Study: Leukaemia Data (p>>n)
Study: Prostate Cancer Data
• Eight clinical measures as predictors1. Cancer volume2. Prostate weight3. Age4. Benign prostatic hyperplasia5. Seminal vesicle invasion6. Capsular penetration7. Gleason score8. Gleason score 4/5
• The response is the prostate-specific antigen (l-psa)
• Data divided into training (67 observations) and test (30 observations)• Compared methods OLS, Ridge, Lasso, Naïve Elastic Net, Elastic Net• Used tenfold CV on training to determine model tuning parameters.
Study : Methods Prediction Compared
• Error: Elastic net best, OLS worse.
• For λ = 1000, this essentially becomes UST (Why 1000?)• ( previously β_hat(∞), i.e. where λ2 -> ∞)
• Elastic Net dominates Lasso, Lasso hurt by high correlation (e.g. 7, and 8)
ConjectureIf
RR < OLSThen
EN < Lasso
Simulation A• Purpose to show that elastic net dominates lasso.
• Prediction accuracy
• Variable selection
• Simulate using model:y = X β + σε, ε ~ N(0,1)
• Four Examples (first three from Lasso), training / validation / test1. 50 data sets, 20/20/200, 8 predictors β = {3,1.5,0,0,2,0,0,0}, σ = 3
Pairwise correlation between xi and xj corr(i,j)=0.5|i-j|
2. Same as 1, except β = {0.85,0.85,…,0.85}
3. 50 data sets, 100/100/400, 40 predictors, β = {10x0,10x2,10x0,10x2}, σ = 15Pairwise correlation between xi and xj corr(i,j)=0.5
Simulation A: Example 4
• Example specifically designed to show grouping selection4. 50 data sets, 50/50/400 observations
40 predictors β = {15 x 3, 25 x 0}, σ = 15
xi = Z1 + noise, i = 1,…,5
xi = Z2 + noise, i = 6,…,10
xi = Z3 + noise, i = 11,…,15
xi = noise, i = 16,…,40
• In this model there are three important groups• Ideally would select 15 features and reject 25 noise features
Simulation A: Results• Elastic net dominates Lasso (under collinearity).
• But in some cases worse than Ridge or Naïve!
Simulation A (Fig.4): Error Distribution
• Paper does not comment, elastic net lots of outliers
Example 1 Example 2 Example 3 Example 4
Simulation A: Results Sparse Solutions
• Elastic net selects more features than Lasso due to grouping/correlation• Example 4 is close to optimal
Non-zero β : 3 3 20 15
Simulation B: Grouping Selection (Lasso)• Given two independent variable Z1, and Z2, the response variable y
Y ~ N(Z1 + 0.1 Z2, 1)
• Create X from two groups (note that x2 and x5 are negated)x1 = Z1 + noisex2 = -Z1 + noisex3 = Z1 + noisex4 = Z2 + noisex5 = -Z2 + noisex6 = Z2 + noise
• Within group correlation roughly 1
• Between group correlation roughly 0
• Important variates are the Z1-group, how well do EN and Lasso perform?
Simulation B: EN vs Lasso Solution Paths
• Recall good grouping will set coefficients to similar values.
• Lasso very unstable.
• Elastic Net selects same (absolute) coefficient for the Z1-group
Lasso Elastic Net (λ2= 2)
Negated
Z2
roughly 1/10 of Z1 per model
Study: Leukaemia Data (p>>n)Microarray Classification and Gene Selection• Great, so far elastic net dominates Lasso (prediction and feature
selection). What about feature selection when p>>n (e.g. p = 10000 and n=100) ?
Xβ
p
p=n y
Xβ
p
p=n y
• When p>>n, good classification method should have…A. Gene selection built into the procedure
B. Not be limited to the fact at p>>n (Lasso is limited to selecting n)
C. Genes sharing same biological pathway (correlated?), should group into model
• Current methods arbitrarily select one of the related gene predictors.
Microarray Classification and Gene SelectionOther Methods
• Lasso:• Good at A (built in selection)
• Fails B (handle p>>n) and C (grouped selection)
• Support Vector Machine (Guyon et al., 2002) and Penalized Logistic Regression (Zhu and Hastie, 2004) fail at C.• Use either univariate ranking or recursive feature elimination
Microarray Classification and Gene SelectionEvaluation using Leukaemia Study
• Leukaemia data 7192 genes and 72 samples.• Training set 38 samples, of which 27 are Type 1 and 11 are Type 2
• Test set 34
• Goal: Construct rule to predict y = 0-1 response (Type1=0, Type2=1)
• Pre-screening to select 1000 genes from 7192 before elastic net• Used t-statistic to make “computation more manageable”
• Authors claim does not affect results (NEED TO UNDERSTAND MORE)
• Used ten-fold CV to select tuning parameters.
Leukaemia Classification
• Early stopping strategy k = 200 (iterations). Less computational cost (?)
λ = 0.01
Training CVk-82, Error = 3
Test Error = 0
• Full path (i.e. when using s as fraction of L1-norm to know when to stop).
Test? Error Increases!
Leukaemia Classification Summary• Elastic net selects more than n predictors (training set determines n).
• Lasso would only be able to select a maximum of 38 genes.
Leukaemia/Gene: Solution Paths
• Not clear that the 45 selected genes are properly grouped, though some coefficients seem to converge.
• No evidence that better than Golub (or others) “…best classification and internal gene selection”.
Number of Genes Selected
k=8245 Selected
Discussion
• Elastic appears to be better than Lasso for prediction and feature selection.
• Other issues are unclear.• Elastic net is better than or comparable to Ridge Regression for predictions.
• Elastic net is better at grouping than other grouping methods (sans Lasso).
References
• Hui Zou and Trevor Hastie, ‘Regularization and variable selection via the elastic net’, 2005
• Derek Kane, ‘Data Science - Part XII - Ridge Regression, LASSO, and Elastic Nets’, 2015
• Vivian S. Zhang, ‘Ridge regression, lasso and elastic net’, 2014
• http://www.ds100.org/sp17/assets/notebooks/linear_regression/Regularization.html
• https://stats.stackexchange.com/questions/123205/relation-between-the-tuning-parameter-lambda-parameter-estimates-beta-i-a
• https://codingstartups.com/practical-machine-learning-ridge-regression-vs-lasso/
• https://datascience.stackexchange.com/questions/361/when-is-a-model-underfitted