lecture 3: regularized linear models€¦ · bias-variance tradeoff – linear regression •...

Lecture 3: Regularized Linear models

Hien Van Nguyen

University of Houston

9/6/2017

Bias-Variance Tradeoff

• Suppose data arise from this model

• Assume data is fixed and rewrite MSE

• More complex model will typically result in lower bias (fit better), but higher variance

9/6/2017 Machine Learning 2

True value

Bias-Variance Tradeoff – Linear Regression

• Linear model

• Bias-Variance decomposition of linear model


Bias-Variance Tradeoff – Linear Regression

• Each additional variable in the predictor weight vector will add the same amount of variance 𝜎𝜎𝜖𝜖2/𝑛𝑛 regardless of whether its true coefficient is large or small (or even zero)

• In other words, error scales linearly with the number of variables

• On a side note, variance is inversely proportional to the number of training samples


Why regularizing model?

• Predictive ability: Linear regression has zero bias but suffer from high variance. In some applications, it might be desirable to sacrifice some bias for smaller variance

• Intepretability: Large number of predictor variables make it difficult to interpret the model. Variables can be highly correlated. Need to identify a smaller subset of important variables

• Data scarcity: when the dimension of data is large and number of training samples is small, we cannot use non-regularized linear regression


Ridge regression• RR is similar to least square regression, but shrink the regression coefficients

towards zero by imposing a penalty on their size.• One-variable case

• Important convention:

• Constant variable 𝑏𝑏 can be removed


Ridge Regression

• Solution:


Multivariate ridge regression• Multivariate version

• Take derivative and set to zero:

• Regularization makes the problem non-singular even if 𝐗𝐗𝐗𝐗T is not full rank

• Scenarios where 𝐗𝐗𝐗𝐗T is not full rank: • One dimension could be linearly computed from other dimensions• Dimension is larger than the number of training samples


Least Absolute Selection and Shrinkage Operator (Lasso)• Motivation:

• Ridge regression rarely set coefficients to zero exactly

cannot perform variable selection in the linear model

less interpretable model

• Solution: Using a different regularizer


Lasso

• Ridge uses ℓ2-norm while lasso uses ℓ1-norm

• Many coefficients will be shrunk to zero exactly

• Regularizer is strongly affected by the scale of each variable. Variables with good predictive power can be penalized more due to large scale

Good practice to scale variables to unit variance


Lasso

• No close-form solution, need to solve using optimization methods

• Cannot decompose objective function in to bias and variance explicitly

• But similar trend hold: Larger 𝜆𝜆 lead to higher bias and lower variance

• More suitable for variable selection


Coefficient paths – Prostate cancer


Ridge Lasso

lecture 3: regularized linear models€¦ · bias-variance tradeoff – linear regression •...

Documents