linear model selection and regularizationkti.mff.cuni.cz/~marta/su5a.pdf · 3 selection,...
TRANSCRIPT
![Page 1: Linear Model Selection and Regularizationkti.mff.cuni.cz/~marta/su5a.pdf · 3 Selection, Regularization Methods Subset Selection evaluate all subsets and select the best model (CV)](https://reader033.vdocument.in/reader033/viewer/2022050112/5f49a5e10a999069346fe454/html5/thumbnails/1.jpg)
1
Linear Model Selection and Regularization
especially usefull in high dimensions p>>100.
![Page 2: Linear Model Selection and Regularizationkti.mff.cuni.cz/~marta/su5a.pdf · 3 Selection, Regularization Methods Subset Selection evaluate all subsets and select the best model (CV)](https://reader033.vdocument.in/reader033/viewer/2022050112/5f49a5e10a999069346fe454/html5/thumbnails/2.jpg)
2
Why Linear Model Regularization?
● Linear models are simple, BUT● consider p>>n,
● we have more features than data records● we can (often) learn model with 0 training error
– even for independent features!– it is overfitted model.
● Less features in the model may lead to smaller test error.
● We add constrains or a penalty on coefficients.● Model with fewer features is more interpretable.
![Page 3: Linear Model Selection and Regularizationkti.mff.cuni.cz/~marta/su5a.pdf · 3 Selection, Regularization Methods Subset Selection evaluate all subsets and select the best model (CV)](https://reader033.vdocument.in/reader033/viewer/2022050112/5f49a5e10a999069346fe454/html5/thumbnails/3.jpg)
3
Selection, Regularization Methods
● Subset Selection● evaluate all subsets and select the best model (CV)
● Shrinkage (reguralization): ● a penalty on coefficients size shrunks them towards
zero
● Dimension Reduction:● from p dimension select M-dimensional subspace,
M<p.● fit a linear model in this M-dim. subspace.
![Page 4: Linear Model Selection and Regularizationkti.mff.cuni.cz/~marta/su5a.pdf · 3 Selection, Regularization Methods Subset Selection evaluate all subsets and select the best model (CV)](https://reader033.vdocument.in/reader033/viewer/2022050112/5f49a5e10a999069346fe454/html5/thumbnails/4.jpg)
4
Best Subset Selection
● Null model predicts .● for( k in 1:p)
● fit models with exactly k predictors ● select the one with smallest RSS, or equiv. largest R2
– denote it
● Select a single best model from among
using crossvalidation, AIC, BIC or adjusted R2.
![Page 5: Linear Model Selection and Regularizationkti.mff.cuni.cz/~marta/su5a.pdf · 3 Selection, Regularization Methods Subset Selection evaluate all subsets and select the best model (CV)](https://reader033.vdocument.in/reader033/viewer/2022050112/5f49a5e10a999069346fe454/html5/thumbnails/5.jpg)
5
Best Subset Selection
● tractable up to p=30,40.● Simillarly, for logistic regression
● with deviance as error measure instead of RSS,● again, CV for model 'size' selection.
![Page 6: Linear Model Selection and Regularizationkti.mff.cuni.cz/~marta/su5a.pdf · 3 Selection, Regularization Methods Subset Selection evaluate all subsets and select the best model (CV)](https://reader033.vdocument.in/reader033/viewer/2022050112/5f49a5e10a999069346fe454/html5/thumbnails/6.jpg)
6
Forward Stepwise Selection
● Null model predicts .● for( k in 0:(p-1))
● consider (p-k) adding one predictor to ● select the one with smallest RSS, or equiv. largest R2
– denote it●
● Select a single best model from among ● using crossvalidation, AIC, BIC or adjusted R2.
![Page 7: Linear Model Selection and Regularizationkti.mff.cuni.cz/~marta/su5a.pdf · 3 Selection, Regularization Methods Subset Selection evaluate all subsets and select the best model (CV)](https://reader033.vdocument.in/reader033/viewer/2022050112/5f49a5e10a999069346fe454/html5/thumbnails/7.jpg)
7
Backward Stepwise Selection
● Full model with p predictors (standard LR).● for( k in (p-1):0)
● consider (k+1) models removing one predictor from ● select the one with smallest RSS, or equiv. largest R2
– denote it●
● Select a single best model from among ● using crossvalidation, AIC, BIC or adjusted R2.
![Page 8: Linear Model Selection and Regularizationkti.mff.cuni.cz/~marta/su5a.pdf · 3 Selection, Regularization Methods Subset Selection evaluate all subsets and select the best model (CV)](https://reader033.vdocument.in/reader033/viewer/2022050112/5f49a5e10a999069346fe454/html5/thumbnails/8.jpg)
8
Hybrid Approaches
● go Forward, any time try to eliminate useless predictor.
● Each algorithm may provide different subset for a given size k (except 0 and p ;-)
● None of these has to be optimal with respect to mean test error.
![Page 9: Linear Model Selection and Regularizationkti.mff.cuni.cz/~marta/su5a.pdf · 3 Selection, Regularization Methods Subset Selection evaluate all subsets and select the best model (CV)](https://reader033.vdocument.in/reader033/viewer/2022050112/5f49a5e10a999069346fe454/html5/thumbnails/9.jpg)
9
Choosing the Optimal Model
● Two main approaches:● Analytical criteria, adjustment to the training
error to reduce overfitting ('penalty')● should not be used for p>>n!
● Direct estimate of test error, either● validation set● or cross-validation approach.
![Page 10: Linear Model Selection and Regularizationkti.mff.cuni.cz/~marta/su5a.pdf · 3 Selection, Regularization Methods Subset Selection evaluate all subsets and select the best model (CV)](https://reader033.vdocument.in/reader033/viewer/2022050112/5f49a5e10a999069346fe454/html5/thumbnails/10.jpg)
10
Analytical Criteria
● Mallow 'in sample error estimate'
● Akaike: (more general, proportional to Cp here)
● Bayesian Information Criterion:
● Adjusted R2 :
equiv. minimize
![Page 11: Linear Model Selection and Regularizationkti.mff.cuni.cz/~marta/su5a.pdf · 3 Selection, Regularization Methods Subset Selection evaluate all subsets and select the best model (CV)](https://reader033.vdocument.in/reader033/viewer/2022050112/5f49a5e10a999069346fe454/html5/thumbnails/11.jpg)
11
Example
![Page 12: Linear Model Selection and Regularizationkti.mff.cuni.cz/~marta/su5a.pdf · 3 Selection, Regularization Methods Subset Selection evaluate all subsets and select the best model (CV)](https://reader033.vdocument.in/reader033/viewer/2022050112/5f49a5e10a999069346fe454/html5/thumbnails/12.jpg)
12
Validation and Cross-Validation
● Validation: at the beginning, ● exclude 1/4 of data samples from training● use them for error estimation for model selection.
● Cross-Validation: at the beginning,● split data records into k=10 folds,● for k in 1:10
– hide k-th fold for training– use it for error estimation for model selection.
Note: different runs may provide different subsets of size 3.
![Page 13: Linear Model Selection and Regularizationkti.mff.cuni.cz/~marta/su5a.pdf · 3 Selection, Regularization Methods Subset Selection evaluate all subsets and select the best model (CV)](https://reader033.vdocument.in/reader033/viewer/2022050112/5f49a5e10a999069346fe454/html5/thumbnails/13.jpg)
13
Example
![Page 14: Linear Model Selection and Regularizationkti.mff.cuni.cz/~marta/su5a.pdf · 3 Selection, Regularization Methods Subset Selection evaluate all subsets and select the best model (CV)](https://reader033.vdocument.in/reader033/viewer/2022050112/5f49a5e10a999069346fe454/html5/thumbnails/14.jpg)
14
One Standard Error Rule
● take the model size with the minimal CV error● calculate 1 std. err. interval arround this error,● select the smallest model with error inside this
interval.
![Page 15: Linear Model Selection and Regularizationkti.mff.cuni.cz/~marta/su5a.pdf · 3 Selection, Regularization Methods Subset Selection evaluate all subsets and select the best model (CV)](https://reader033.vdocument.in/reader033/viewer/2022050112/5f49a5e10a999069346fe454/html5/thumbnails/15.jpg)
15
Shrinkage Methods
● Penalty for non-zero model parameters,● no penalty for intercept.
● Ridge:
● Lasso:
![Page 16: Linear Model Selection and Regularizationkti.mff.cuni.cz/~marta/su5a.pdf · 3 Selection, Regularization Methods Subset Selection evaluate all subsets and select the best model (CV)](https://reader033.vdocument.in/reader033/viewer/2022050112/5f49a5e10a999069346fe454/html5/thumbnails/16.jpg)
16
Ridge
● Parameter lambda penalizes the sum of .● intentionally excluded from the penalty.● we can center features and fix:
● For centered featues:● for orthonormal features:● Dependent on scale: standardization usefull.
β 2
![Page 17: Linear Model Selection and Regularizationkti.mff.cuni.cz/~marta/su5a.pdf · 3 Selection, Regularization Methods Subset Selection evaluate all subsets and select the best model (CV)](https://reader033.vdocument.in/reader033/viewer/2022050112/5f49a5e10a999069346fe454/html5/thumbnails/17.jpg)
17
Ridge coef. - Cancer example
![Page 18: Linear Model Selection and Regularizationkti.mff.cuni.cz/~marta/su5a.pdf · 3 Selection, Regularization Methods Subset Selection evaluate all subsets and select the best model (CV)](https://reader033.vdocument.in/reader033/viewer/2022050112/5f49a5e10a999069346fe454/html5/thumbnails/18.jpg)
18
Lasso regression
● the penalty is ● it forces some coefficients to be zero● an equvivalent specification:
![Page 19: Linear Model Selection and Regularizationkti.mff.cuni.cz/~marta/su5a.pdf · 3 Selection, Regularization Methods Subset Selection evaluate all subsets and select the best model (CV)](https://reader033.vdocument.in/reader033/viewer/2022050112/5f49a5e10a999069346fe454/html5/thumbnails/19.jpg)
19
Ridge x Lasso
![Page 20: Linear Model Selection and Regularizationkti.mff.cuni.cz/~marta/su5a.pdf · 3 Selection, Regularization Methods Subset Selection evaluate all subsets and select the best model (CV)](https://reader033.vdocument.in/reader033/viewer/2022050112/5f49a5e10a999069346fe454/html5/thumbnails/20.jpg)
20
Corellated X, Parameter Shrinkage
![Page 21: Linear Model Selection and Regularizationkti.mff.cuni.cz/~marta/su5a.pdf · 3 Selection, Regularization Methods Subset Selection evaluate all subsets and select the best model (CV)](https://reader033.vdocument.in/reader033/viewer/2022050112/5f49a5e10a999069346fe454/html5/thumbnails/21.jpg)
21
Best subset, Ridge, Lasso
● Coefficient change for orthonormal features:
![Page 22: Linear Model Selection and Regularizationkti.mff.cuni.cz/~marta/su5a.pdf · 3 Selection, Regularization Methods Subset Selection evaluate all subsets and select the best model (CV)](https://reader033.vdocument.in/reader033/viewer/2022050112/5f49a5e10a999069346fe454/html5/thumbnails/22.jpg)
22
Example
● p=45, n=50, 2 predictors relate to output.
![Page 23: Linear Model Selection and Regularizationkti.mff.cuni.cz/~marta/su5a.pdf · 3 Selection, Regularization Methods Subset Selection evaluate all subsets and select the best model (CV)](https://reader033.vdocument.in/reader033/viewer/2022050112/5f49a5e10a999069346fe454/html5/thumbnails/23.jpg)
Bias-variance decomposition
● Bias – 'systematic error', ● usually caused by restricted model subspace
● Var – variance of the estimate● we wish both to be zero.
![Page 24: Linear Model Selection and Regularizationkti.mff.cuni.cz/~marta/su5a.pdf · 3 Selection, Regularization Methods Subset Selection evaluate all subsets and select the best model (CV)](https://reader033.vdocument.in/reader033/viewer/2022050112/5f49a5e10a999069346fe454/html5/thumbnails/24.jpg)
Example of Bias
![Page 25: Linear Model Selection and Regularizationkti.mff.cuni.cz/~marta/su5a.pdf · 3 Selection, Regularization Methods Subset Selection evaluate all subsets and select the best model (CV)](https://reader033.vdocument.in/reader033/viewer/2022050112/5f49a5e10a999069346fe454/html5/thumbnails/25.jpg)
Example: Lasso, Ridge Regression
● red: MSE● green: variance● black: squared Bias
penalty positive
![Page 26: Linear Model Selection and Regularizationkti.mff.cuni.cz/~marta/su5a.pdf · 3 Selection, Regularization Methods Subset Selection evaluate all subsets and select the best model (CV)](https://reader033.vdocument.in/reader033/viewer/2022050112/5f49a5e10a999069346fe454/html5/thumbnails/26.jpg)
MSE: 100 observations, p differs
![Page 27: Linear Model Selection and Regularizationkti.mff.cuni.cz/~marta/su5a.pdf · 3 Selection, Regularization Methods Subset Selection evaluate all subsets and select the best model (CV)](https://reader033.vdocument.in/reader033/viewer/2022050112/5f49a5e10a999069346fe454/html5/thumbnails/27.jpg)
27
Penalty ~ prior model probability
● Ridge
● we assume prior probability of parameters● independent,
then Ridge is most likely estimate (posterior mode).● Bayes formula
– P(X) constant, prior probability, – likelihood, posterior probability.
P (β / X )=P (X / β )⋅P (β )
P (X )
P (β )
P (X / β )
P (β )
P (β / X )
~
![Page 28: Linear Model Selection and Regularizationkti.mff.cuni.cz/~marta/su5a.pdf · 3 Selection, Regularization Methods Subset Selection evaluate all subsets and select the best model (CV)](https://reader033.vdocument.in/reader033/viewer/2022050112/5f49a5e10a999069346fe454/html5/thumbnails/28.jpg)
28
Prior Probability Ridge, Laso
● Ridge: Normal distribution● Lasso: Laplace distribution
![Page 29: Linear Model Selection and Regularizationkti.mff.cuni.cz/~marta/su5a.pdf · 3 Selection, Regularization Methods Subset Selection evaluate all subsets and select the best model (CV)](https://reader033.vdocument.in/reader033/viewer/2022050112/5f49a5e10a999069346fe454/html5/thumbnails/29.jpg)
29
Linear Models for Regression
● Ridge, Lasso – penalization● PCR, PLS – coordinate system change +
dimension selection
![Page 30: Linear Model Selection and Regularizationkti.mff.cuni.cz/~marta/su5a.pdf · 3 Selection, Regularization Methods Subset Selection evaluate all subsets and select the best model (CV)](https://reader033.vdocument.in/reader033/viewer/2022050112/5f49a5e10a999069346fe454/html5/thumbnails/30.jpg)
30
Principal Component AnalysisPCA
vlastní čísla, vlastní vektory
![Page 31: Linear Model Selection and Regularizationkti.mff.cuni.cz/~marta/su5a.pdf · 3 Selection, Regularization Methods Subset Selection evaluate all subsets and select the best model (CV)](https://reader033.vdocument.in/reader033/viewer/2022050112/5f49a5e10a999069346fe454/html5/thumbnails/31.jpg)
31
PCR, PLS
● PCR Principal component regression● select direction corresponding to largest
eigenvalues● for these directions, regression coeff. are fitted.● For size=p equivalent with linear regression.
● Partial least squares – considers Y for selection● calculates regression coefficients● weight features and calculate eigenvalues● select the first direction of PLS, ● other direction simillar, orthogonal to the first.