regularization and variable selection via the elastic netrm17770/mlreading/... · naïve elastic...

32
Regularization and variable selection via the elastic netHui Zou, Trevor Hastie Presented by: James Pope and Michal Kozlowski

Upload: others

Post on 31-May-2020

9 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Regularization and variable selection via the elastic netrm17770/mlreading/... · Naïve Elastic Net •At high dimensional data, there can be variables which are highly correlated

‘Regularization and variable selection via the elastic net’

Hui Zou, Trevor Hastie

Presented by: James Pope and Michal Kozlowski

Page 2: Regularization and variable selection via the elastic netrm17770/mlreading/... · Naïve Elastic Net •At high dimensional data, there can be variables which are highly correlated

Background

• Ordinary Least Squares:

• Two important aspects to evaluate quality of model:• Accuracy of prediction on future data – good generalisation• Interpretation of the model – simpler model often preferred

Page 3: Regularization and variable selection via the elastic netrm17770/mlreading/... · Naïve Elastic Net •At high dimensional data, there can be variables which are highly correlated

Background

Bias – Variance trade-off

Page 4: Regularization and variable selection via the elastic netrm17770/mlreading/... · Naïve Elastic Net •At high dimensional data, there can be variables which are highly correlated

Ridge Regression and LASSO

• Penalization methods to resolve this:• Ridge Regression – minimisation of residual sum of squares, bound on the 𝐿2

norm

• Lasso – penalised least squares, imposing 𝐿1 penalty – shrinkage and variable selection at once

Page 5: Regularization and variable selection via the elastic netrm17770/mlreading/... · Naïve Elastic Net •At high dimensional data, there can be variables which are highly correlated

Ridge Regression and LASSO

Page 6: Regularization and variable selection via the elastic netrm17770/mlreading/... · Naïve Elastic Net •At high dimensional data, there can be variables which are highly correlated

Ridge Regression and LASSO

• LASSO is limited however:• In the p>n case, the lasso

selects at most n variables before it saturates.

• If there is a group of highly correlated independent variables, LASSO chooses one and discards the others.

Page 7: Regularization and variable selection via the elastic netrm17770/mlreading/... · Naïve Elastic Net •At high dimensional data, there can be variables which are highly correlated

Naïve Elastic Net

• At high dimensional data, there can be variables which are highly correlated with each other – multicollinearity.

• Methods up to now failed to perform grouped selection.

• Can be thought of as a hybrid of ridge and LASSO:

Page 8: Regularization and variable selection via the elastic netrm17770/mlreading/... · Naïve Elastic Net •At high dimensional data, there can be variables which are highly correlated

Naïve Elastic Net

• If 𝛼= 1, we have ridge• If 𝛼= 0, we have LASSO• If 0 < 𝛼 < 1 we have elastic net

• In this figure 𝛼 = 0.5

Page 9: Regularization and variable selection via the elastic netrm17770/mlreading/... · Naïve Elastic Net •At high dimensional data, there can be variables which are highly correlated

Naïve Elastic Net is insufficient

• The Naïve EN estimator is a two-stage procedure: • For each fixed 𝜆2, we first find the ridge coefficient and then perform

LASSO shrinkage

• We effectively perform double shrinkage!

• Extra bias, no variance reduction…

Page 10: Regularization and variable selection via the elastic netrm17770/mlreading/... · Naïve Elastic Net •At high dimensional data, there can be variables which are highly correlated

Elastic Net with scaling correction

• Correct it, by introducing the quadratic penalty. Given augmented data (y*, X*), naïve net solves the lasso problem:

• Corrected estimates defined by:

• Naïve net can be shown to be ; thus:

• That way, we retain the grouping effect, and perform shrinkage once!

Page 11: Regularization and variable selection via the elastic netrm17770/mlreading/... · Naïve Elastic Net •At high dimensional data, there can be variables which are highly correlated

Elastic Net

• Elastic Net estimate

Page 12: Regularization and variable selection via the elastic netrm17770/mlreading/... · Naïve Elastic Net •At high dimensional data, there can be variables which are highly correlated

Univariate Soft Thresholding

• Consider that when , elastic net effectively becomes LASSO

• Special case of elastic net, when

• Elastic net applies soft thresholding on univariate coefficients.

• UST ignores the dependence between predictors, and treats them as independent variables.

• Used for significance analysis of microarrays (Tusher, 2001)

• Elastic net bridges the gap between the LASSO and UST.

Page 13: Regularization and variable selection via the elastic netrm17770/mlreading/... · Naïve Elastic Net •At high dimensional data, there can be variables which are highly correlated

LARS-EN – Elastic Net Computation

• First proposed as LARS by Efron et al., 2004, expanded in this paper

• Resembles forward stepwise regression.

• The solution path is piecewise linear.

• Given a fixed 𝜆2, we can solve the entire ENet solution path!• At step k, update Cholesky factorisation of

• Record the non-zero coefficients at each LARS-EN step

• Not necessary to run it all the way to the end for p>>n!

(early stopping)

Page 14: Regularization and variable selection via the elastic netrm17770/mlreading/... · Naïve Elastic Net •At high dimensional data, there can be variables which are highly correlated

Summary on Elastic Net

• Ridge regression fails to choose any variables. Either all or none.

• LASSO solves this, but disregards clusters of highly correlated data. It favours a single variable.

• Elastic net is a good compromise between the two.

Page 15: Regularization and variable selection via the elastic netrm17770/mlreading/... · Naïve Elastic Net •At high dimensional data, there can be variables which are highly correlated

Elastic Net Evaluation

• Study: Prostate Cancer Data

• Simulation A: To evaluate prediction performance (Lasso)

• Simulation B: To evaluate feature selection and grouping performance

• Study: Leukaemia Data (p>>n)

Page 16: Regularization and variable selection via the elastic netrm17770/mlreading/... · Naïve Elastic Net •At high dimensional data, there can be variables which are highly correlated

Study: Prostate Cancer Data

• Eight clinical measures as predictors1. Cancer volume2. Prostate weight3. Age4. Benign prostatic hyperplasia5. Seminal vesicle invasion6. Capsular penetration7. Gleason score8. Gleason score 4/5

• The response is the prostate-specific antigen (l-psa)

• Data divided into training (67 observations) and test (30 observations)• Compared methods OLS, Ridge, Lasso, Naïve Elastic Net, Elastic Net• Used tenfold CV on training to determine model tuning parameters.

Page 17: Regularization and variable selection via the elastic netrm17770/mlreading/... · Naïve Elastic Net •At high dimensional data, there can be variables which are highly correlated

Study : Methods Prediction Compared

• Error: Elastic net best, OLS worse.

• For λ = 1000, this essentially becomes UST (Why 1000?)• ( previously β_hat(∞), i.e. where λ2 -> ∞)

• Elastic Net dominates Lasso, Lasso hurt by high correlation (e.g. 7, and 8)

ConjectureIf

RR < OLSThen

EN < Lasso

Page 18: Regularization and variable selection via the elastic netrm17770/mlreading/... · Naïve Elastic Net •At high dimensional data, there can be variables which are highly correlated

Simulation A• Purpose to show that elastic net dominates lasso.

• Prediction accuracy

• Variable selection

• Simulate using model:y = X β + σε, ε ~ N(0,1)

• Four Examples (first three from Lasso), training / validation / test1. 50 data sets, 20/20/200, 8 predictors β = {3,1.5,0,0,2,0,0,0}, σ = 3

Pairwise correlation between xi and xj corr(i,j)=0.5|i-j|

2. Same as 1, except β = {0.85,0.85,…,0.85}

3. 50 data sets, 100/100/400, 40 predictors, β = {10x0,10x2,10x0,10x2}, σ = 15Pairwise correlation between xi and xj corr(i,j)=0.5

Page 19: Regularization and variable selection via the elastic netrm17770/mlreading/... · Naïve Elastic Net •At high dimensional data, there can be variables which are highly correlated

Simulation A: Example 4

• Example specifically designed to show grouping selection4. 50 data sets, 50/50/400 observations

40 predictors β = {15 x 3, 25 x 0}, σ = 15

xi = Z1 + noise, i = 1,…,5

xi = Z2 + noise, i = 6,…,10

xi = Z3 + noise, i = 11,…,15

xi = noise, i = 16,…,40

• In this model there are three important groups• Ideally would select 15 features and reject 25 noise features

Page 20: Regularization and variable selection via the elastic netrm17770/mlreading/... · Naïve Elastic Net •At high dimensional data, there can be variables which are highly correlated

Simulation A: Results• Elastic net dominates Lasso (under collinearity).

• But in some cases worse than Ridge or Naïve!

Page 21: Regularization and variable selection via the elastic netrm17770/mlreading/... · Naïve Elastic Net •At high dimensional data, there can be variables which are highly correlated

Simulation A (Fig.4): Error Distribution

• Paper does not comment, elastic net lots of outliers

Example 1 Example 2 Example 3 Example 4

Page 22: Regularization and variable selection via the elastic netrm17770/mlreading/... · Naïve Elastic Net •At high dimensional data, there can be variables which are highly correlated

Simulation A: Results Sparse Solutions

• Elastic net selects more features than Lasso due to grouping/correlation• Example 4 is close to optimal

Non-zero β : 3 3 20 15

Page 23: Regularization and variable selection via the elastic netrm17770/mlreading/... · Naïve Elastic Net •At high dimensional data, there can be variables which are highly correlated

Simulation B: Grouping Selection (Lasso)• Given two independent variable Z1, and Z2, the response variable y

Y ~ N(Z1 + 0.1 Z2, 1)

• Create X from two groups (note that x2 and x5 are negated)x1 = Z1 + noisex2 = -Z1 + noisex3 = Z1 + noisex4 = Z2 + noisex5 = -Z2 + noisex6 = Z2 + noise

• Within group correlation roughly 1

• Between group correlation roughly 0

• Important variates are the Z1-group, how well do EN and Lasso perform?

Page 24: Regularization and variable selection via the elastic netrm17770/mlreading/... · Naïve Elastic Net •At high dimensional data, there can be variables which are highly correlated

Simulation B: EN vs Lasso Solution Paths

• Recall good grouping will set coefficients to similar values.

• Lasso very unstable.

• Elastic Net selects same (absolute) coefficient for the Z1-group

Lasso Elastic Net (λ2= 2)

Negated

Z2

roughly 1/10 of Z1 per model

Page 25: Regularization and variable selection via the elastic netrm17770/mlreading/... · Naïve Elastic Net •At high dimensional data, there can be variables which are highly correlated

Study: Leukaemia Data (p>>n)Microarray Classification and Gene Selection• Great, so far elastic net dominates Lasso (prediction and feature

selection). What about feature selection when p>>n (e.g. p = 10000 and n=100) ?

p

p=n y

p

p=n y

• When p>>n, good classification method should have…A. Gene selection built into the procedure

B. Not be limited to the fact at p>>n (Lasso is limited to selecting n)

C. Genes sharing same biological pathway (correlated?), should group into model

• Current methods arbitrarily select one of the related gene predictors.

Page 26: Regularization and variable selection via the elastic netrm17770/mlreading/... · Naïve Elastic Net •At high dimensional data, there can be variables which are highly correlated

Microarray Classification and Gene SelectionOther Methods

• Lasso:• Good at A (built in selection)

• Fails B (handle p>>n) and C (grouped selection)

• Support Vector Machine (Guyon et al., 2002) and Penalized Logistic Regression (Zhu and Hastie, 2004) fail at C.• Use either univariate ranking or recursive feature elimination

Page 27: Regularization and variable selection via the elastic netrm17770/mlreading/... · Naïve Elastic Net •At high dimensional data, there can be variables which are highly correlated

Microarray Classification and Gene SelectionEvaluation using Leukaemia Study

• Leukaemia data 7192 genes and 72 samples.• Training set 38 samples, of which 27 are Type 1 and 11 are Type 2

• Test set 34

• Goal: Construct rule to predict y = 0-1 response (Type1=0, Type2=1)

• Pre-screening to select 1000 genes from 7192 before elastic net• Used t-statistic to make “computation more manageable”

• Authors claim does not affect results (NEED TO UNDERSTAND MORE)

• Used ten-fold CV to select tuning parameters.

Page 28: Regularization and variable selection via the elastic netrm17770/mlreading/... · Naïve Elastic Net •At high dimensional data, there can be variables which are highly correlated

Leukaemia Classification

• Early stopping strategy k = 200 (iterations). Less computational cost (?)

λ = 0.01

Training CVk-82, Error = 3

Test Error = 0

• Full path (i.e. when using s as fraction of L1-norm to know when to stop).

Test? Error Increases!

Page 29: Regularization and variable selection via the elastic netrm17770/mlreading/... · Naïve Elastic Net •At high dimensional data, there can be variables which are highly correlated

Leukaemia Classification Summary• Elastic net selects more than n predictors (training set determines n).

• Lasso would only be able to select a maximum of 38 genes.

Page 30: Regularization and variable selection via the elastic netrm17770/mlreading/... · Naïve Elastic Net •At high dimensional data, there can be variables which are highly correlated

Leukaemia/Gene: Solution Paths

• Not clear that the 45 selected genes are properly grouped, though some coefficients seem to converge.

• No evidence that better than Golub (or others) “…best classification and internal gene selection”.

Number of Genes Selected

k=8245 Selected

Page 31: Regularization and variable selection via the elastic netrm17770/mlreading/... · Naïve Elastic Net •At high dimensional data, there can be variables which are highly correlated

Discussion

• Elastic appears to be better than Lasso for prediction and feature selection.

• Other issues are unclear.• Elastic net is better than or comparable to Ridge Regression for predictions.

• Elastic net is better at grouping than other grouping methods (sans Lasso).

Page 32: Regularization and variable selection via the elastic netrm17770/mlreading/... · Naïve Elastic Net •At high dimensional data, there can be variables which are highly correlated

References

• Hui Zou and Trevor Hastie, ‘Regularization and variable selection via the elastic net’, 2005

• Derek Kane, ‘Data Science - Part XII - Ridge Regression, LASSO, and Elastic Nets’, 2015

• Vivian S. Zhang, ‘Ridge regression, lasso and elastic net’, 2014

• http://www.ds100.org/sp17/assets/notebooks/linear_regression/Regularization.html

• https://stats.stackexchange.com/questions/123205/relation-between-the-tuning-parameter-lambda-parameter-estimates-beta-i-a

• https://codingstartups.com/practical-machine-learning-ridge-regression-vs-lasso/

• https://datascience.stackexchange.com/questions/361/when-is-a-model-underfitted