lecture 3: regularized linear models€¦ · bias-variance tradeoff – linear regression •...
TRANSCRIPT
Lecture 3: Regularized Linear models
Hien Van Nguyen
University of Houston
9/6/2017
Bias-Variance Tradeoff
• Suppose data arise from this model
• Assume data is fixed and rewrite MSE
• More complex model will typically result in lower bias (fit better), but higher variance
9/6/2017 Machine Learning 2
True value
Bias-Variance Tradeoff – Linear Regression
• Linear model
• Bias-Variance decomposition of linear model
9/6/2017 Machine Learning 3
Bias-Variance Tradeoff – Linear Regression
• Each additional variable in the predictor weight vector will add the same amount of variance 𝜎𝜎𝜖𝜖2/𝑛𝑛 regardless of whether its true coefficient is large or small (or even zero)
• In other words, error scales linearly with the number of variables
• On a side note, variance is inversely proportional to the number of training samples
9/6/2017 Machine Learning 4
Why regularizing model?
• Predictive ability: Linear regression has zero bias but suffer from high variance. In some applications, it might be desirable to sacrifice some bias for smaller variance
• Intepretability: Large number of predictor variables make it difficult to interpret the model. Variables can be highly correlated. Need to identify a smaller subset of important variables
• Data scarcity: when the dimension of data is large and number of training samples is small, we cannot use non-regularized linear regression
9/6/2017 Machine Learning 5
Ridge regression• RR is similar to least square regression, but shrink the regression coefficients
towards zero by imposing a penalty on their size.• One-variable case
• Important convention:
• Constant variable 𝑏𝑏 can be removed
9/6/2017 Machine Learning 6
Ridge Regression
• Solution:
9/6/2017 Machine Learning 7
Multivariate ridge regression• Multivariate version
• Take derivative and set to zero:
• Regularization makes the problem non-singular even if 𝐗𝐗𝐗𝐗T is not full rank
• Scenarios where 𝐗𝐗𝐗𝐗T is not full rank: • One dimension could be linearly computed from other dimensions• Dimension is larger than the number of training samples
9/6/2017 Machine Learning 8
Least Absolute Selection and Shrinkage Operator (Lasso)• Motivation:
• Ridge regression rarely set coefficients to zero exactly
cannot perform variable selection in the linear model
less interpretable model
• Solution: Using a different regularizer
9/6/2017 Machine Learning 9
Lasso
• Ridge uses ℓ2-norm while lasso uses ℓ1-norm
• Many coefficients will be shrunk to zero exactly
• Regularizer is strongly affected by the scale of each variable. Variables with good predictive power can be penalized more due to large scale
Good practice to scale variables to unit variance
9/6/2017 Machine Learning 10
Lasso
• No close-form solution, need to solve using optimization methods
• Cannot decompose objective function in to bias and variance explicitly
• But similar trend hold: Larger 𝜆𝜆 lead to higher bias and lower variance
• More suitable for variable selection
9/6/2017 Machine Learning 11
Coefficient paths – Prostate cancer
9/6/2017 Machine Learning 12
Ridge Lasso