![Page 1: An Introduction to Linear Regression and Shrinkage Methods Ehsan Khoddam Mohammadi](https://reader035.vdocument.in/reader035/viewer/2022062712/56649c915503460f9494bb32/html5/thumbnails/1.jpg)
Prediction with Regression
An Introduction to Linear Regression and Shrinkage Methods
Ehsan Khoddam Mohammadi
![Page 2: An Introduction to Linear Regression and Shrinkage Methods Ehsan Khoddam Mohammadi](https://reader035.vdocument.in/reader035/viewer/2022062712/56649c915503460f9494bb32/html5/thumbnails/2.jpg)
•Prediction•Estimation•Bias Variance Trade-Off•Regression•Ordinary Least square•Ridge regression•Lasso
Outline
![Page 3: An Introduction to Linear Regression and Shrinkage Methods Ehsan Khoddam Mohammadi](https://reader035.vdocument.in/reader035/viewer/2022062712/56649c915503460f9494bb32/html5/thumbnails/3.jpg)
Predictiondefinitionset of inputs: X1, X2, …, Xp
the output: YWe want to analyze the relationship between
these variables (interpretation)We want to estimate output based on inputs
(prediction)
![Page 4: An Introduction to Linear Regression and Shrinkage Methods Ehsan Khoddam Mohammadi](https://reader035.vdocument.in/reader035/viewer/2022062712/56649c915503460f9494bb32/html5/thumbnails/4.jpg)
Predictionsame concept in different literatures
Machine learning: supervised learningFinance: forecastingPolitics: predictionEstimation theory: function approximation
Statistics ML Economy
X1, X2, …, Xn
Predictors Features Independent variables
Y Response Class Dependent variables
![Page 5: An Introduction to Linear Regression and Shrinkage Methods Ehsan Khoddam Mohammadi](https://reader035.vdocument.in/reader035/viewer/2022062712/56649c915503460f9494bb32/html5/thumbnails/5.jpg)
RegressionWhy?Well-performed and accurate in both Interpretation
and PredictionStrong fundamental in math, statistics and
computationMany modern and advanced methods are based on
Regression, even they are variant of regressionNew methods are still invented for regression:
Nobel prize are still given to investigations in regression, Hot topic
Could be formulated as optimization problem: that’s the reason I choose it for this class, it’s more
related to subject of class than any other methods I’ve known for prediction
![Page 6: An Introduction to Linear Regression and Shrinkage Methods Ehsan Khoddam Mohammadi](https://reader035.vdocument.in/reader035/viewer/2022062712/56649c915503460f9494bb32/html5/thumbnails/6.jpg)
RegressionclassificationLinear Regression
Least square Best sub-sut Selection, Regression with feature selectionStepwise RegressionShrinkage regularization for Regression:
Ridge Regression Lasso Regression
Non-Linear RegressionNumerical Data fittingANNDiscrete regression
Logistic Regression
![Page 7: An Introduction to Linear Regression and Shrinkage Methods Ehsan Khoddam Mohammadi](https://reader035.vdocument.in/reader035/viewer/2022062712/56649c915503460f9494bb32/html5/thumbnails/7.jpg)
Before proceeding with regressionLet’s investigate on some statistical property of
ESTIMATION
![Page 8: An Introduction to Linear Regression and Shrinkage Methods Ehsan Khoddam Mohammadi](https://reader035.vdocument.in/reader035/viewer/2022062712/56649c915503460f9494bb32/html5/thumbnails/8.jpg)
Estimating the parameterassume that we have iid (identically
independent distributed) samples X1, . . . ,Xn with unknown distribution.
Estimating p.d.f of them is too hard in many situations, Instead of that, We want to estimate a parameter θ .
is estimation of θ, it is function of X1, . . . ,Xn .
![Page 9: An Introduction to Linear Regression and Shrinkage Methods Ehsan Khoddam Mohammadi](https://reader035.vdocument.in/reader035/viewer/2022062712/56649c915503460f9494bb32/html5/thumbnails/9.jpg)
Bias-Variance dilemma Definition 1 : The bias of an estimator is . If it is 0, the estimator is
said to be unbiased.
Definition 2 : The mean squared error (MSE) of an estimator is .
An interesting equation:
ˆ ˆ( ) ( )Bias E
2ˆ(( ) )E
2 2ˆ ˆ ˆ(( ) ) ( ) ( )E Var Bias
What does it really mean?
![Page 10: An Introduction to Linear Regression and Shrinkage Methods Ehsan Khoddam Mohammadi](https://reader035.vdocument.in/reader035/viewer/2022062712/56649c915503460f9494bb32/html5/thumbnails/10.jpg)
[Image from “More on Regularization and (Generalized) Ridge Operators”, Takane,(2007)]
![Page 11: An Introduction to Linear Regression and Shrinkage Methods Ehsan Khoddam Mohammadi](https://reader035.vdocument.in/reader035/viewer/2022062712/56649c915503460f9494bb32/html5/thumbnails/11.jpg)
Test and training error as a function of model complexity.
[ Image from “The Elements of Statistical Learning”,Second Edition, Hastie et al. (2008)]
![Page 12: An Introduction to Linear Regression and Shrinkage Methods Ehsan Khoddam Mohammadi](https://reader035.vdocument.in/reader035/viewer/2022062712/56649c915503460f9494bb32/html5/thumbnails/12.jpg)
Linear RegressionModel
Set of training data :
Linear Regression model:
Real-valued coefficients β need to be estimated
1 1
1 2
, )...( , )
Each ( , ,..., )
( n n i
Ti i i ip
y x y y
x x x x
x
01
( )p
j jj
f X X
![Page 13: An Introduction to Linear Regression and Shrinkage Methods Ehsan Khoddam Mohammadi](https://reader035.vdocument.in/reader035/viewer/2022062712/56649c915503460f9494bb32/html5/thumbnails/13.jpg)
Linear RegressionLeast square
Most popular estimation methodMinimize the Residual Sum of Squares:
2
1
20
1 1
( ) ( ( ))
( )
N
i ii
pN
i ij ji j
RSS y f x
y x
How do we minimize it?
![Page 14: An Introduction to Linear Regression and Shrinkage Methods Ehsan Khoddam Mohammadi](https://reader035.vdocument.in/reader035/viewer/2022062712/56649c915503460f9494bb32/html5/thumbnails/14.jpg)
Linear RegressionLeast square
Let’s rewrite last formula in this form:
Quadratic function (not a point here but we shall use this property later)
Differentiating respect to β and set it to zero:
Unique Solution: ;
( ) ( ) ( )TRSS y X y X
2 ( ) 0TRSSX y X
1ˆ ( )T TX X X y
Under which assumptions we could obtain unique solution?
ˆy X
![Page 15: An Introduction to Linear Regression and Shrinkage Methods Ehsan Khoddam Mohammadi](https://reader035.vdocument.in/reader035/viewer/2022062712/56649c915503460f9494bb32/html5/thumbnails/15.jpg)
Linear RegressionLeast square, AssumptionsX should be full-rank, hence is p.d and
invertible, unique solution could be obtainedIn another word, features vectors should be
linearly independent or uncorrelated What will be happened to β if X would be non-
full-rank matrix or some features would be highly correlated?
TX X
![Page 16: An Introduction to Linear Regression and Shrinkage Methods Ehsan Khoddam Mohammadi](https://reader035.vdocument.in/reader035/viewer/2022062712/56649c915503460f9494bb32/html5/thumbnails/16.jpg)
Linear RegressionLeast square, flaws
Low bias but High variance:
and one could estimate Var(y) by:
It’s hard to find meaning-full relation if we have too many features.
What would you recommend to solve these problems?
1 2ˆ( ) ( )TVar X X
2 1ˆ ( )
1 iyN p
![Page 17: An Introduction to Linear Regression and Shrinkage Methods Ehsan Khoddam Mohammadi](https://reader035.vdocument.in/reader035/viewer/2022062712/56649c915503460f9494bb32/html5/thumbnails/17.jpg)
Linear RegressionImprovementsModel Selection (Feature Selection):
Best-Subset Selection (Branch and Leap , Furnival (1974))
Step-wise Selection (Greedy approach, sub-optimal but preferred)
mRMR (using mutual information criterion for selection)
Shrinkage Methods: impose constraint on βRidge RegressionLasso Regression
![Page 18: An Introduction to Linear Regression and Shrinkage Methods Ehsan Khoddam Mohammadi](https://reader035.vdocument.in/reader035/viewer/2022062712/56649c915503460f9494bb32/html5/thumbnails/18.jpg)
Ridge RegressionWhen you have a problem want to be solved
in statistics, There is always a Russian statistician waiting for you to solve it. (Be careful! just in statistics I guarantee , they will betray you in any other situations)
Andrey Nikolayevich Tychonoff provides a Tikhonov (!!!) regularization for ill-posed problems , Also known as Ridge Regression in statistics.
![Page 19: An Introduction to Linear Regression and Shrinkage Methods Ehsan Khoddam Mohammadi](https://reader035.vdocument.in/reader035/viewer/2022062712/56649c915503460f9494bb32/html5/thumbnails/19.jpg)
Ridge Regressionfirst attempt
Remember this?:Tychonoff added a term to avoid singularity and changed above formula to this:
Now, the inverse could be computed even if Is not of full-rank, Also β is still linear function of y.Every thing start from above formula but now we
have better point of view than Tychonoff, let’s take a look!
1ˆ ( )T TX X X y
1ˆ ( )ridge T TX X I X y TX X
![Page 20: An Introduction to Linear Regression and Shrinkage Methods Ehsan Khoddam Mohammadi](https://reader035.vdocument.in/reader035/viewer/2022062712/56649c915503460f9494bb32/html5/thumbnails/20.jpg)
Ridge Regressionbetter motivationTo avoid high variance of β we just impose a
constraint on it, our problem is now an optimization problem with constraints.
2
01 1
2
1
ˆ arg min ,
subject to
pNridge
i ij ji j
p
jj
y x
t
![Page 21: An Introduction to Linear Regression and Shrinkage Methods Ehsan Khoddam Mohammadi](https://reader035.vdocument.in/reader035/viewer/2022062712/56649c915503460f9494bb32/html5/thumbnails/21.jpg)
Even better representation: using lagrangian form
Or again even better! in matrix representation form, we could differentiate this formula and set it to zero
Could you guess the solution?
2
20
1 1 1
ˆ arg minp pN
ridgei ij j j
i j j
y x
( ) ( ) ( )T TRSS y X y X
Could you find a relation between β and βridge when inputs are orthonormal?
![Page 22: An Introduction to Linear Regression and Shrinkage Methods Ehsan Khoddam Mohammadi](https://reader035.vdocument.in/reader035/viewer/2022062712/56649c915503460f9494bb32/html5/thumbnails/22.jpg)
LASSO
Least Absolute Selection and Shrinkage Operator
![Page 23: An Introduction to Linear Regression and Shrinkage Methods Ehsan Khoddam Mohammadi](https://reader035.vdocument.in/reader035/viewer/2022062712/56649c915503460f9494bb32/html5/thumbnails/23.jpg)
LASSOWe impose L1-norm constraint on our
regression
No close form exists, it’s non-linear function of y
2
01 1
1
ˆ arg min ,
subject to | |
pNlasso
i ij ji j
p
jj
y x
t
How could you solve above problem? (hint: ask Mr.Iranmehr!)
![Page 24: An Introduction to Linear Regression and Shrinkage Methods Ehsan Khoddam Mohammadi](https://reader035.vdocument.in/reader035/viewer/2022062712/56649c915503460f9494bb32/html5/thumbnails/24.jpg)
LASSOWhy?First attempt for usage of L1-norm, show
significant results in signal processing, denoising [Chen et al. (1998)]
Base method for LAR (new and novel method for regression, not covered here)
[Efron et al. (2004)]Good for Sparse model selection where p>N
[Donoho (2006b)]
![Page 25: An Introduction to Linear Regression and Shrinkage Methods Ehsan Khoddam Mohammadi](https://reader035.vdocument.in/reader035/viewer/2022062712/56649c915503460f9494bb32/html5/thumbnails/25.jpg)
REFERENCES“The Elements of Statistical Learning”,
Second Edition, Hastie et al. , 2008“More on Regularization and (Generalized)
Ridge Operators”, Takane, 2007“Bias, Variance and MSE of Estimators”, Guy
Lebanon, 2004“Least Squares Optimization with L1-Norm
Regularization”, Mark Schmidt, 2005“Regularization: Ridge Regression and the
LASSO”, Tibshirani, 2006