regularization methods for linear regression
TRANSCRIPT
Regularization Methods forLinear Regression
Mathilde Mougeot
ENSIIE
2017-2018
Mathilde Mougeot (ENSIIE) MRR2017 1 / 66
Agenda
Lessons
• 3 plenatory lessons (MRR)
• 3 Practical work sessions using R (mrr)
• 10 Project sessions (ipR)
Before next session, install on your computer
1 R software, https://www.r-project.org/
2 Rstudio, https://www.rstudio.com/
Documents are availabale (at this stage)
• https://sites.google.com/site/MougeotMathilde/teaching
Mathilde Mougeot (ENSIIE) MRR2017 2 / 66
Plan
A word on data and predictive models
Data are everywhere
• Industry (Temperature, IR sensors...)
• Finance : transactions
• Marketing : consumer data.
• on your phone (GPS, mail, musique ...)
→ Data base are available everywhere : from small data set to Big Data
Mathilde Mougeot (ENSIIE) MRR2017 3 / 66
Plan
A word on data and predictive models
Data are everywhere
• Industry (Temperature, IR sensors...)
• Finance : transactions
• Marketing : consumer data.
• on your phone (GPS, mail, musique ...)
→ Data base are available everywhere : from small data set to Big Data
Mathilde Mougeot (ENSIIE) MRR2017 3 / 66
Plan
A word on data and predictive models
Nowadays, predictive models are crutial for monitoring, for diagnosis
• Industry : Health monitoring, Energy...
• Finance : forecast of the evolution of the market
• Marketing : scoring
• Health
→ Machine learning models are used to mine, to operate the data.
Mathilde Mougeot (ENSIIE) MRR2017 4 / 66
Plan
Regularization Methods for Linear Regression
-Linear regression and Regularized Linear Regression belongs to the Predictivemodel family.-Linear regression is an old model but still very useful !Gauss, 1785 ; Legendre 1805
Outline of the lesson
• Motivations
• Ordinary Least Square -OLS- (geometrical approach)
• The linear Model (probabilistic approach)
• Using R software for modeling
Mathilde Mougeot (ENSIIE) MRR2017 5 / 66
Plan
Evolution of the average age of the French population
Mathilde Mougeot (ENSIIE) MRR2017 6 / 66
Plan
Evolution of the average age of the French population
Mathilde Mougeot (ENSIIE) MRR2017 7 / 66
Plan
Modeling the average age of the French population
Mathilde Mougeot (ENSIIE) MRR2017 8 / 66
Plan
Application : Social Networks
Facebook users :
an user(million)
31/12/04 031/12/05 531/12/06 1031/3/07 2030/12/07 6030/6/08 10030/12/08 15030/3/09 17030/6/09 20030/9/09 25030/12/09 30030/3/10 40030/6/10 50030/6/11 750
Mathilde Mougeot (ENSIIE) MRR2017 9 / 66
Plan
Evolution of the number of Facebook users
Motivations : investment, forecastMathilde Mougeot (ENSIIE) MRR2017 10 / 66
Plan
Modeling the evolution of the number of Facebook users
Mathilde Mougeot (ENSIIE) MRR2017 11 / 66
Introduction : Regression model
• (Y,X) : couple of variablesY : Target quantitative variableX = (X1,X2, . . . ,Xp) : Co-variates, quantitative variables
→ The goal is to propose a Regression model to explain Y given X.The parameters of the model are computed using a set of data
Y = Fdata(X ) = Fdata(X1, . . . ,Xp)
In this case, F is a linear function.
• Questions :
- What are the performances of this model ?- What are the main explicative variables ?- Is-it possible to use the model to predict new values ? to forecast ?- Can we improve the model ?
Boston Housing Data
The original data are n = 506 observations on p = 14 variables,medv median value, being the target variable
crim per capita crime rate by townzn proportion of residential land zoned for lots over 25,000 sq.ftindus proportion of non-retail business acres per townchas Charles River dummy variable (= 1 if tract bounds river ; 0 otherwise)nox nitric oxides concentration (parts per 10 million)rm average number of rooms per dwellingage proportion of owner-occupied units built prior to 1940dis weighted distances to five Boston employment centresrad index of accessibility to radial highwaystax full-value property-tax rate per USD 10,000ptratio pupil-teacher ratio by townb 1000(B − 0.63)2 where B is the proportion of blacks by townlstat percentage of lower status of the populationmedv median value of owner-occupied homes in USD 1000’s
Boston Housing Data
The data :nr crim zn indus chas nox rm age dis rad tax ptratio b lstat medv1 0.006 18 2.3 0 0.53 6.57 65.2 4.09 1 296 15.3 396.9 4.9 24.02 0.027 0 7.0 0 0.46 6.42 78.9 4.96 2 242 17.8 396.9 9.1 21.63 0.027 0 7.0 0 0.46 7.18 61.1 4.96 2 242 17.8 392.8 4.0 34.74 0.032 0 2.1 0 0.45 6.99 45.8 6.06 3 222 18.7 394.6 2.9 33.45 0.069 0 2.1 0 0.45 7.14 54.2 6.06 3 222 18.7 396.9 5.3 36.2... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
Boston Housing Data
Different points of view may exist for studying those data :Model Y = Fdata(X )
• Mining approach
− Evaluate the performances of the model
− What are the most important variables ? (variable selection)→ sparse models, less complex, best performances
• Predictive approach
− Inference and simulation→ Ponctual estimation for new values of the co-variables→ Confidence interval computation.
Outline
Outline
• Applications
• Ordinary Least Square (OLS) / Moindre Carres Ordinaires (MCO)
• Linear Model
• Regularization methods : ridge, lasso
Mathilde Mougeot (ENSIIE) MRR2017 16 / 66
OLS
Ordinary Least Square (OLS)
Outline
Ordinary Least Square (OLS)
• Values/Variables :
- Y , Y ∈ R value/ Target variable- X = (X 1, ...,X p), X ∈ Rp values/ covariates
• Data : S = {(xi , yi ) i = 1, ..., n, yi ∈ R, xi ∈ Rp}
• Goal : Modeling Y linearly with X , with a ”small” ε term
• Y = β1 + β2X2 + · · ·+ βpXp+ ε
• If X1 = 1, we can write Y = β1X1 + β2X2 + · · ·+ βpXp+ ε
• Y =∑p
j βjXj+ ε (p βj parameters, p variables, ”one constant” inside)
Mathilde Mougeot (ENSIIE) MRR2017 18 / 66
Outline
Ordinary Least Square (OLS)
• Values/Variables :
- Y , Y ∈ R value/ Target variable- X = (X 1, ...,X p), X ∈ Rp values/ covariates
• Data : S = {(xi , yi ) i = 1, ..., n, yi ∈ R, xi ∈ Rp}
• Goal : Modeling Y linearly with X , with a ”small” ε term
• Y = β1 + β2X2 + · · ·+ βpXp+ ε
• If X1 = 1, we can write Y = β1X1 + β2X2 + · · ·+ βpXp+ ε
• Y =∑p
j βjXj+ ε (p βj parameters, p variables, ”one constant” inside)
Mathilde Mougeot (ENSIIE) MRR2017 18 / 66
Outline
Ordinary Least Square (OLS)
• Values/Variables :
- Y , Y ∈ R value/ Target variable- X = (X 1, ...,X p), X ∈ Rp values/ covariates
• Data : S = {(xi , yi ) i = 1, ..., n, yi ∈ R, xi ∈ Rp}
• Goal : Modeling Y linearly with X , with a ”small” ε term
• Y = β1 + β2X2 + · · ·+ βpXp+ ε
• If X1 = 1, we can write Y = β1X1 + β2X2 + · · ·+ βpXp+ ε
• Y =∑p
j βjXj+ ε (p βj parameters, p variables, ”one constant” inside)
Mathilde Mougeot (ENSIIE) MRR2017 18 / 66
OLS
Ordinary Least Square (OLS)Simple Linear Regression model
Outline
Simple Linear Regression : exampleWe only have one co-variable (X) to explain the target variable (Y).The scatter plot is represented by :
++++++
+
++
+
+
+
++
+
++++
+
+
+
+++
+
+
+
+
+
++
+
+
+
+
+
+
+
+
+
++
+
+
+
+
+
+
+
++
++++
++
+
+
+
+
+
+++
+
+
+
+
++
++
+
+
+
+
++
+
++
+
+
+
+
++
++
+
+
+
+
+
+
++
+
5 6 7 8 9 10
12
16
20
24
x
y
Mathilde Mougeot (ENSIIE) MRR2017 20 / 66
Outline
Simple Linear Regression : example
For all observation couples i , 1 ≤ i ≤ n (Yi ,Xi ),the goal is here to minimize (Yi − (β1 + β2Xi ))2
++++++
+
++
+
+
+
++
+
++++
+
+
+
+++
+
+
+
+
+
++
+
+
+
+
+
+
+
+
+
++
+
+
+
+
+
+
+
++
++++
++
+
+
+
+
+
+++
+
+
+
+
++
++
+
+
+
+
++
+
++
+
+
+
+
++
++
+
+
+
+
+
+
++
+
5 6 7 8 9 10
12
16
20
24
x
y
Régression
Mathilde Mougeot (ENSIIE) MRR2017 21 / 66
Outline
Ordinary Least Square : simple linear model
Formalism :
• Distance to a single point : (yi − xiβ2 − β1)2
• Distance to the whole sample :∑n
i=1(yi − xiβ2 − β1)2
→ Best line : intercept β1 and slope β2 such that∑n
i=1(yi − xiβ2 − β1)2 isminimum, among all possible values of β1 and β2.
OLS estimator :The values estimated by OLS (the estimates) for β1 and β2 verify :
(βols1 , βols
2 ) = argminβ1,β2∈R2
{n∑
i=1
(yi − xiβ2 − β1)2}
Mathilde Mougeot (ENSIIE) MRR2017 22 / 66
Outline
Ordinary Least Square : simple linear modelOLS estimator (observations) :The values estimated by OLS (the estimates) for β1 and β2 verify :
(βols1 , βols
2 ) = argminβ1,β2∈R2
{n∑
i=1
(yi − xiβ2 − β1)2}
OLS estimator (vector notations) : y , x , 1n
The values estimated by OLS (the estimates) for β1 and β2 verify :
(βols1 , βols
2 ) = argminβ1,β2∈R2
‖y − xβ2 − 1nβ1‖22
OLS estimator (matrix notations Y (n,1) ; X (n,2)) :The values estimated by OLS (the estimates) for β1 and β2 verify :
(βols1 , βols
2 ) = argminβ1,β2∈R2
‖Y − Xβ‖22
Mathilde Mougeot (ENSIIE) MRR2017 23 / 66
Outline
Ordinary Least Square : simple linear model
Theorem :The OLS estimators have the following expressions :
β1 = y − β2x
β2 =∑n
i=1(xi−x)(yi−y)∑2i=1(xi−x)2 = Cov(x ,y)
Var(x)
Proof :by zeroing the derivative of the objective function, which is convex.
Mathilde Mougeot (ENSIIE) MRR2017 24 / 66
Outline
Ordinary Least Square : simple linear model
For the simple linear model, the correlation coefficient may be very useful :
• r(x , y) : correlation coefficient/ coefficient de correlation lineaire
r(x , y) = cov(X ,Y )√var(x)√
var(y)
• r(x , y) = 1 if and only if Y = aX + b, linear relation between Y et X
R-square used in multiple regression
• R2 = VarYVar(Y )
• R2 ∈ [0, 1]
• Simple regression R2 = r2 :
Mathilde Mougeot (ENSIIE) MRR2017 25 / 66
Outline
Ordinary Least Square : simple linear model
For the simple linear model, the correlation coefficient may be very useful :
• r(x , y) : correlation coefficient/ coefficient de correlation lineaire
r(x , y) = cov(X ,Y )√var(x)√
var(y)
• r(x , y) = 1 if and only if Y = aX + b, linear relation between Y et X
R-square used in multiple regression
• R2 = VarYVar(Y )
• R2 ∈ [0, 1]
• Simple regression R2 = r2 :
Mathilde Mougeot (ENSIIE) MRR2017 25 / 66
Outline
Ordinary Least Square : simple linear model
For the simple linear model, the correlation coefficient may be very useful :
• r(x , y) : correlation coefficient/ coefficient de correlation lineaire
r(x , y) = cov(X ,Y )√var(x)√
var(y)
• r(x , y) = 1 if and only if Y = aX + b, linear relation between Y et X
R-square used in multiple regression
• R2 = VarYVar(Y )
• R2 ∈ [0, 1]
• Simple regression R2 = r2 :
Mathilde Mougeot (ENSIIE) MRR2017 25 / 66
Outline
The value of a correlation coif. may be non informative
Illustration of two joint distribution with a correlation coefficient equaled to 0.999
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●● ●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
n=500, correlation 0.9998
x
y
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
x
y
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●
n=500,correlation 0.9990
x
y
→ A scatter plot shows two very different distribution.Always look at the data ! !
Always study carefully the data
Mathilde Mougeot (ENSIIE) MRR2017 26 / 66
OLS
Ordinary Least Square (OLS)Multiple Linear Regression model
Outline
Ordinary Least Square : multiple linear model
• We suppose Y =∑p
j βjXj + ε and S = {(xi , yi ) i = 1...n, yi ∈ R xi ∈ Rp}
• The Quadratic error is defined by :
E (β) =∑n
i ε2i =
∑ni (yi −
∑j x
ji βj)
2
with matrix notation :
E (β) = (Y − Xβ)T (Y − Xβ)
• Goal : To minimize the error E (β) on the data set S .To compute β ∈ Rp :
β = argminβ∈Rp E (β)
Mathilde Mougeot (ENSIIE) MRR2017 28 / 66
Outline
Ordinary Least Square : multiple linear model
• We suppose Y =∑p
j βjXj + ε and S = {(xi , yi ) i = 1...n, yi ∈ R xi ∈ Rp}
• The Quadratic error is defined by :
E (β) =∑n
i ε2i =
∑ni (yi −
∑j x
ji βj)
2
with matrix notation :
E (β) = (Y − Xβ)T (Y − Xβ)
• Goal : To minimize the error E (β) on the data set S .To compute β ∈ Rp :
β = argminβ∈Rp E (β)
Mathilde Mougeot (ENSIIE) MRR2017 28 / 66
Outline
Ordinary Least Square : multiple linear model
• We suppose Y =∑p
j βjXj + ε and S = {(xi , yi ) i = 1...n, yi ∈ R xi ∈ Rp}
• The Quadratic error is defined by :
E (β) =∑n
i ε2i =
∑ni (yi −
∑j x
ji βj)
2
with matrix notation :
E (β) = (Y − Xβ)T (Y − Xβ)
• Goal : To minimize the error E (β) on the data set S .To compute β ∈ Rp :
β = argminβ∈Rp E (β)
Mathilde Mougeot (ENSIIE) MRR2017 28 / 66
Outline
Ordinary Least Square. multiple regression model
• We aim to compute β which minimize :
E (β) = ||Y − Xβ||22= (Y − Xβ)T (Y − Xβ)
• Assumption : XTX inversible. (n ≥ p)
Theorem :
βMCO = (XTX )−1XTY
Mathilde Mougeot (ENSIIE) MRR2017 29 / 66
Outline
MCO
• Estimation
β = (XTX )−1XTY
• Prediction Knowing β and given X1, . . . ,Xp,
the prediction of the target can be computed : Y =∑
j βjXj
Y = X β= X (XTX )−1XTY
• P Projection matrix on the Hyperplan (hat matrix)P = X (XTX )−1XT
P2 = P
• Residuals
- ε = Y − Y• Remark : no assumption on the law or distribution of ε
Mathilde Mougeot (ENSIIE) MRR2017 30 / 66
Outline
MCO
• Estimation
β = (XTX )−1XTY
• Prediction Knowing β and given X1, . . . ,Xp,
the prediction of the target can be computed : Y =∑
j βjXj
Y = X β= X (XTX )−1XTY
• P Projection matrix on the Hyperplan (hat matrix)P = X (XTX )−1XT
P2 = P
• Residuals
- ε = Y − Y• Remark : no assumption on the law or distribution of ε
Mathilde Mougeot (ENSIIE) MRR2017 30 / 66
Outline
MCO
• Estimation
β = (XTX )−1XTY
• Prediction Knowing β and given X1, . . . ,Xp,
the prediction of the target can be computed : Y =∑
j βjXj
Y = X β= X (XTX )−1XTY
• P Projection matrix on the Hyperplan (hat matrix)P = X (XTX )−1XT
P2 = P
• Residuals
- ε = Y − Y• Remark : no assumption on the law or distribution of ε
Mathilde Mougeot (ENSIIE) MRR2017 30 / 66
Outline
MCO
• Estimation
β = (XTX )−1XTY
• Prediction Knowing β and given X1, . . . ,Xp,
the prediction of the target can be computed : Y =∑
j βjXj
Y = X β= X (XTX )−1XTY
• P Projection matrix on the Hyperplan (hat matrix)P = X (XTX )−1XT
P2 = P
• Residuals
- ε = Y − Y• Remark : no assumption on the law or distribution of ε
Mathilde Mougeot (ENSIIE) MRR2017 30 / 66
Outline
Ordinary Least Square. Geometrical interpretation
Y =∑p
j Xjβj + ε
∈ Rn ∈ Rp ∈ R(n−p)
Mathilde Mougeot (ENSIIE) MRR2017 31 / 66
Outline Properties
Ordinary Least Square. Properties
• Orthogonality :
- Y⊥ε• Xj⊥ε ∀j ∈ [1 . . . p] < X j , ε >= 0
• Residual average :
-∑
i εi = 0 if there is an intercept in the model X 1 = (1, 1, . . . , 1)→ the average point belongs to the hyperplan
- ¯Y = Y
• Analysis of Variance -ANAVAR- (Pythagore)
- var(Y ) = var(Y ) + var(E)
Mathilde Mougeot (ENSIIE) MRR2017 32 / 66
Outline Properties
Ordinary Least Square. Properties
• Orthogonality :
- Y⊥ε• Xj⊥ε ∀j ∈ [1 . . . p] < X j , ε >= 0
• Residual average :
-∑
i εi = 0 if there is an intercept in the model X 1 = (1, 1, . . . , 1)→ the average point belongs to the hyperplan
- ¯Y = Y
• Analysis of Variance -ANAVAR- (Pythagore)
- var(Y ) = var(Y ) + var(E)
Mathilde Mougeot (ENSIIE) MRR2017 32 / 66
Outline Properties
Ordinary Least Square. Properties
• Orthogonality :
- Y⊥ε• Xj⊥ε ∀j ∈ [1 . . . p] < X j , ε >= 0
• Residual average :
-∑
i εi = 0 if there is an intercept in the model X 1 = (1, 1, . . . , 1)→ the average point belongs to the hyperplan
- ¯Y = Y
• Analysis of Variance -ANAVAR- (Pythagore)
- var(Y ) = var(Y ) + var(E)
Mathilde Mougeot (ENSIIE) MRR2017 32 / 66
Outline Adjustment
Multiple Linear model : example with R
head(mydata,3) ;y x1 x2 x31 -2.20 0.38 0.98 0.462 -1.75 0.11 0.62 0.373 -0.24 0.80 0.59 0.87...> modlm=lm(y ˜ x1+x2+x3,data=mydata) ;Call :lm(formula = y ˜ x1+x2+x3, data = mydata)Coefficients :(Intercept) x1 x2 x30.02754 1.98163 -3.03612 0.01903
Mathilde Mougeot (ENSIIE) MRR2017 33 / 66
Outline Adjustment
Multiple Linear model : example with R
> modlm=lm(y ˜ x1+x2+x3,data=mydata) ;> summary(modlm)lm(formula = y ˜ x1+x2+x3, data = mydata)Residuals :Min 1Q Median 3Q Max-0.29 -0.075 -0.0035 0.073 0.281Coefficients :Estimate Std. Error t value Pr(>|t|)(Intercept) 0.02754 0.01503 1.833 0.0674 .x1 1.98163 0.01577 125.652 <2e-16 ***x2 -3.03612 0.01621 -187.286 <2e-16 ***x3 0.01903 0.01576 1.208 0.2277— Signif. codes : 0 /*** 0.001 /** 0.01 / * 0.05 / . 0.1/ 1Residual standard error : 0.1009 on 496 degrees of freedomMultiple R-squared : 0.9904, Adjusted R-squared : 0.9904F-statistic : 1.707e+04 on 3 and 496 DF, p-value : < 2.2e-16
Mathilde Mougeot (ENSIIE) MRR2017 34 / 66
Outline Adjustment
Ordinary Least Square, quality of the adjustment
• R2, R-square (French : coefficient de determination)
• R2 = var(Y )var(Y )
, remark : no unit
• cos2w = R2 =||Y− ¯Y1,n||2
||Y−Y1,n||2
w : angle between the centered vector (Y − Y1,n) and its centered prediction
(Y − ˆY1,n)
• var(E ) = 1n
∑ni=1(yi − yi )
2
= (1− R2)var(Y ), unit of Y 2
Mathilde Mougeot (ENSIIE) MRR2017 35 / 66
Outline Adjustment
Ordinary Least Square, quality of the adjustment
• R2, R-square (French : coefficient de determination)
• R2 = var(Y )var(Y )
, remark : no unit
• cos2w = R2 =||Y− ¯Y1,n||2
||Y−Y1,n||2
w : angle between the centered vector (Y − Y1,n) and its centered prediction
(Y − ˆY1,n)
• var(E ) = 1n
∑ni=1(yi − yi )
2
= (1− R2)var(Y ), unit of Y 2
Mathilde Mougeot (ENSIIE) MRR2017 35 / 66
Outline Adjustment
Ordinary Least Square, quality of the adjustment
• R2, R-square (French : coefficient de determination)
• R2 = var(Y )var(Y )
, remark : no unit
• cos2w = R2 =||Y− ¯Y1,n||2
||Y−Y1,n||2
w : angle between the centered vector (Y − Y1,n) and its centered prediction
(Y − ˆY1,n)
• var(E ) = 1n
∑ni=1(yi − yi )
2
= (1− R2)var(Y ), unit of Y 2
Mathilde Mougeot (ENSIIE) MRR2017 35 / 66
Outline Adjustment
Ordinary Least Square : Quality of the Ajustement• R-square (for few variables)
• R2 = cos2 ω = VarYVar(Y )
• R2 ∈ [0, 1]• R2 = increases mechanically with the number of variables
• Adjusted R-squared is sometimes preferred (penalization with the number ofvariables)• R2
adj = 1− (1− R2) n−1n−p
• R2adj may be negtive
• Residual study :• εi = yi − yi ∀i ∈ 1..n• Vizualization of
• (εi , i) ∀i ∈ 1..n• (εi , yi ) homoscedastic vs heteroscedasticbissectrice model
• Prediction : Vizualization of• (yi , yi ) ∀i ∈ 1..n• comparison with the first bisector.
Mathilde Mougeot (ENSIIE) MRR2017 36 / 66
Outline Adjustment
Ordinary Least Square : Quality of the Ajustement• R-square (for few variables)
• R2 = cos2 ω = VarYVar(Y )
• R2 ∈ [0, 1]• R2 = increases mechanically with the number of variables
• Adjusted R-squared is sometimes preferred (penalization with the number ofvariables)• R2
adj = 1− (1− R2) n−1n−p
• R2adj may be negtive
• Residual study :• εi = yi − yi ∀i ∈ 1..n• Vizualization of
• (εi , i) ∀i ∈ 1..n• (εi , yi ) homoscedastic vs heteroscedasticbissectrice model
• Prediction : Vizualization of• (yi , yi ) ∀i ∈ 1..n• comparison with the first bisector.
Mathilde Mougeot (ENSIIE) MRR2017 36 / 66
Outline Adjustment
Ordinary Least Square : Quality of the Ajustement• R-square (for few variables)
• R2 = cos2 ω = VarYVar(Y )
• R2 ∈ [0, 1]• R2 = increases mechanically with the number of variables
• Adjusted R-squared is sometimes preferred (penalization with the number ofvariables)• R2
adj = 1− (1− R2) n−1n−p
• R2adj may be negtive
• Residual study :• εi = yi − yi ∀i ∈ 1..n• Vizualization of
• (εi , i) ∀i ∈ 1..n• (εi , yi ) homoscedastic vs heteroscedasticbissectrice model
• Prediction : Vizualization of• (yi , yi ) ∀i ∈ 1..n• comparison with the first bisector.
Mathilde Mougeot (ENSIIE) MRR2017 36 / 66
Outline Adjustment
Ordinary Least Square : Quality of the Ajustement• R-square (for few variables)
• R2 = cos2 ω = VarYVar(Y )
• R2 ∈ [0, 1]• R2 = increases mechanically with the number of variables
• Adjusted R-squared is sometimes preferred (penalization with the number ofvariables)• R2
adj = 1− (1− R2) n−1n−p
• R2adj may be negtive
• Residual study :• εi = yi − yi ∀i ∈ 1..n• Vizualization of
• (εi , i) ∀i ∈ 1..n• (εi , yi ) homoscedastic vs heteroscedasticbissectrice model
• Prediction : Vizualization of• (yi , yi ) ∀i ∈ 1..n• comparison with the first bisector.
Mathilde Mougeot (ENSIIE) MRR2017 36 / 66
Outline Adjustment
Ordinary Least Square : Quality of the Adjustement
Graphics (yi , yi ) 1 ≤ i ≤ j VERY USEFUL
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
modlm$fit
y
●
●
●
●
●
●
●
●
●
● ●
●●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
● ●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
● ●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●●
●
● ●
●
●
●
●
● ●
●
●
●
●
●●
●
●
●●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
● ●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●●
●
●
●
●
modlm$fit
y
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
● ●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
● ●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
● ●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
● ● ●●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
● ●
●
●
●
●●
●
●
●
●●●●
●
●
●
●●
● ●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●●
●●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
● ●
●
●●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●●
●
●
●
●
●
●
● ●●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
● ●
●●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
● ●
●● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●●
●
●
● ●
●●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●●
●
●
●
modlm$fit
y
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●●
● ●
●
●●●
●●●
●
●●
●
●
●
●
●●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●● ●
●
●
●
●
●
●
●●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
● ●
●●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
● ●
●
●
●
● ●●
●
●
●
●
●●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
● ●●
●
●
●
●●
●
●●
●
●
●
●
●
● ●
●
●
● ●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
● ●
●
●
●
●
●●
●
●
● ●
●
●
●
●
●
●
●
● ●
●
●
●
●●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●● ●●
●
●
●●
●
●
●
●
●
●●
●●●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●● ●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
modlm$fit
y
Mathilde Mougeot (ENSIIE) MRR2017 37 / 66
Outline Adjustment
Ordinary Least Square : Student Residual graphεiSE
= yi−yiSE
(no unit term)
+
+++
+
++
+
+
+
+
++
+
+
+
+
+
+
++
+
+
+
++
+
+
+
+
+
++
+
++
+++++
++
+
++
+
+
+
+
+
+
+
+
+
+
++
++
+
++
+
++
+++
+
+
++
++
+
+
++
+
++
++
++
++
++
+
+
+
+
+
++
+
++
12 14 16 18 20 22 24
−4
−2
02
4
y
E
Residual graph
→ Random distribution. There is no information to be capture
Mathilde Mougeot (ENSIIE) MRR2017 38 / 66
Outline Adjustment
Ordinary Least Square : Student Residual graphεiSE
= yi−yiSE
(with non unit)
++++
++++
+++++
+++
+
+++++
+
+++
+
++
+++++
+++++++
++++++++++++++++++++
++++++++
++++++++++++++++++
+++++++
++++
++
0 20 40 60 80 100
−3
0−
10
01
02
03
0
Index
E
o
oo
Residual graph function of Y
→ Large values for some points ? Outliers detection ?
Mathilde Mougeot (ENSIIE) MRR2017 39 / 66
Outline Adjustment
Ordinary Least Square : Student Residual graphεiSE
= yi−yiSE
(with non unit)
+++
++++++
+
+
++
+
++
+++ ++
+
+
+
+
+
+
+
+
++
++
+
+
+
++
+ +
++++
+
+++
+
+++++
+
+
+
+
+++
++++
+++
+++
+++
+
++
+
+
+
++
++
+
++++
++
+
+
+
+
++
+
+
+
50 100 150 200
−1
0−
50
51
0
y2
E2
Graphe des residus en fonction de Y
→ there is still some information in the residuals.→ The model needs to be changed.
Mathilde Mougeot (ENSIIE) MRR2017 40 / 66
Outline Adjustment
Ordinary Least Square : curse of dimension
Data set : {(yi , xi )1 ≤ i ≤ n}. One target variable, one covariable
+
+
+
+
+
+
++
++
+
+
+
+ +
+
+
+
+ +
+
+
+
+
+
+
+
+
+
+
++
++
+++
++
+
+
+
+
+
+
+
++
+
+
−2 −1 0 1 2
−2−1
01
2
x1
y
R2= 0.07
Y:N(0,1) X:N(0,1)
n=50, p=1+1
→ R2 =, R2adj = −0.02
Mathilde Mougeot (ENSIIE) MRR2017 41 / 66
Outline Adjustment
Ordinary Least Square : illustration of the impact of thenumber of covariables on the modelinitial data and OLS line
+
+
+ +
+
+
++
+
++
+
++
+
++
+
++
+
+
+
+
++
+
+
+
+
+
+
+
+
+
+ ++
+
+
++
+
+
+
+
+
+
+
+
−2 −1 0 1 2
−0.6
−0.4
−0.2
0.0
0.2
0.4
y
mco
$fit
Y:N(0,1) X:N(0,1)
n=50, p=1+1
R2= 0.07
→ R2 =, R2adj = −0.02
Mathilde Mougeot (ENSIIE) MRR2017 42 / 66
Outline Adjustment
Ordinary Least Square : illustration of the impact of thenumber of covariables on the model
initial data and 48 more covariables N (0, 1) are added to the initial data set.
+
+
+
+
+
+
+
+++
+
++
+
+
+
+
+
+
+
+
+++
++
++
+
++
++
++
+
+
++
+
++
+
+
+
+
+
+
+
+
−3 −2 −1 0 1 2
−3−2
−10
12
y
mco
$fit
Y:N(0,1) X:N(0,1)
n=50, p=1+48
R2= 1.00
→ R2 = 0.99, R2adj = 0.93
Mathilde Mougeot (ENSIIE) MRR2017 43 / 66
Outline Adjustment
Ordinary Least Square : need to change the model (1/2)
initial data set :
Mathilde Mougeot (ENSIIE) MRR2017 44 / 66
Outline Adjustment
Ordinary Least Square : need to change the model (1/2)
logarithmic transformation
Mathilde Mougeot (ENSIIE) MRR2017 45 / 66
Outline Adjustment
MCO Regression. Some limits :
If XTX is non inversible
• n >> p, collinearity between some Xj .• Pseudo-inverse, the solution is not unique• Variable selection
• p >> n, when the number of variables is larger than the number ofobservations• Regularization method• Ridge -L2-, Lasso -L1-.• Variable selection
OLS modelPonctual estimation.
Mathilde Mougeot (ENSIIE) MRR2017 46 / 66
Outline Adjustment
OLS, XTX non inversible →
Pseudo inverse computation
Solution (n > p), XTX is non invertible with the rank k, k < p :
XTX = UΣ2UT
=U
σ2
1 0 0 0
0... 0 0
0 0 σ2k 0
0 0 0 0
UT
= UkΣ2kU
Tk
(XTX )∗−1 = UkΣ2k−1
UTk avec Σ2
k =
σ21 0 0
0... 0
0 0 σ2k
β = (XTX )∗−1XTYThe solution non unique
Mathilde Mougeot (ENSIIE) MRR2017 47 / 66
Modele Lineaire
Outline
• Motivations
• Ordinary Least Square
• Linear Model
• Penalized regression, ridge, lasso
Mathilde Mougeot (ENSIIE) MRR2017 48 / 66
Linear Model
Probabilistic assumption on the residuals
The probabilistic assumption on the residuals let to introduce :- statistical tests on the value of the coefficients (to study is a variable is useful ornot to explain the target),-to test if the general liner model is useful (all coefficients equal zero).
Linear model
• We write : Y = Xβ + ε avec ε ∼ N (0, σ2)
• We have
- εi = Yi −∑
X ji βj avec f (ε) = 1√
2πσe− ε2
2σ2 i.i.d.
• Residual density & Maximum Likelihood Estimationf (ε1, . . . , εn) =
∏i f (εi )
= 1(2π)n/2σn e
−∑ε2i
2σ2
= 1(2π)n/2σ2n/2 e
− ||Y−Xβ||22σ2
• the goal is to compute β, σ2 solutions of the maximum likelihood Estimation(MLE)Same solution for the MLE and the OLS :
- β = (XTX )−1XTY
- σ2 = 1n
∑i=ni=1 ε
2i
Linear model
• We write : Y = Xβ + ε avec ε ∼ N (0, σ2)
• We have
- εi = Yi −∑
X ji βj avec f (ε) = 1√
2πσe− ε2
2σ2 i.i.d.
• Residual density & Maximum Likelihood Estimationf (ε1, . . . , εn) =
∏i f (εi )
= 1(2π)n/2σn e
−∑ε2i
2σ2
= 1(2π)n/2σ2n/2 e
− ||Y−Xβ||22σ2
• the goal is to compute β, σ2 solutions of the maximum likelihood Estimation(MLE)Same solution for the MLE and the OLS :
- β = (XTX )−1XTY
- σ2 = 1n
∑i=ni=1 ε
2i
Linear model
• We write : Y = Xβ + ε avec ε ∼ N (0, σ2)
• We have
- εi = Yi −∑
X ji βj avec f (ε) = 1√
2πσe− ε2
2σ2 i.i.d.
• Residual density & Maximum Likelihood Estimationf (ε1, . . . , εn) =
∏i f (εi )
= 1(2π)n/2σn e
−∑ε2i
2σ2
= 1(2π)n/2σ2n/2 e
− ||Y−Xβ||22σ2
• the goal is to compute β, σ2 solutions of the maximum likelihood Estimation(MLE)Same solution for the MLE and the OLS :
- β = (XTX )−1XTY
- σ2 = 1n
∑i=ni=1 ε
2i
Linear model
• We write : Y = Xβ + ε avec ε ∼ N (0, σ2)
• We have
- εi = Yi −∑
X ji βj avec f (ε) = 1√
2πσe− ε2
2σ2 i.i.d.
• Residual density & Maximum Likelihood Estimationf (ε1, . . . , εn) =
∏i f (εi )
= 1(2π)n/2σn e
−∑ε2i
2σ2
= 1(2π)n/2σ2n/2 e
− ||Y−Xβ||22σ2
• the goal is to compute β, σ2 solutions of the maximum likelihood Estimation(MLE)Same solution for the MLE and the OLS :
- β = (XTX )−1XTY
- σ2 = 1n
∑i=ni=1 ε
2i
Linear model : What are the laws of the estimators ?
Y = Xβ + ε avec ε ∼ N (0, σ2)
Law of the estimators :• Law of β ?
• Law of Y ?
• Law of σ2 ?
Benefits→ let to compute confidence intervals for β and Y .→ let to test the parameters.
Law of β :
β ∼ N (β, σ2(XTX )−1)
Expectation and Variance of β ? :
β = (XTX )−1XTY et Y = Xβ + ε
• E(β) = β Non biased estimator E(β)− β = 0
• Var(β) = σ2(XTX )−1
Var(β) = (XTX )−1XT ([var(Y )]X (XTX )−1
= (XTX )−1XT ([var(ε)]X (XTX )−1
= σ2(XTX )−1
• E[(β − β)2] = Var(β) + 0
Recall Var(aY ) = aVar(Y )aT
Law of β :
β ∼ N (β, σ2(XTX )−1)
Expectation and Variance of β ? :
β = (XTX )−1XTY et Y = Xβ + ε
• E(β) = β Non biased estimator E(β)− β = 0
• Var(β) = σ2(XTX )−1
Var(β) = (XTX )−1XT ([var(Y )]X (XTX )−1
= (XTX )−1XT ([var(ε)]X (XTX )−1
= σ2(XTX )−1
• E[(β − β)2] = Var(β) + 0
Recall Var(aY ) = aVar(Y )aT
Law of β :
β ∼ N (β, σ2(XTX )−1)
Expectation and Variance of β ? :
β = (XTX )−1XTY et Y = Xβ + ε
• E(β) = β Non biased estimator E(β)− β = 0
• Var(β) = σ2(XTX )−1
Var(β) = (XTX )−1XT ([var(Y )]X (XTX )−1
= (XTX )−1XT ([var(ε)]X (XTX )−1
= σ2(XTX )−1
• E[(β − β)2] = Var(β) + 0
Recall Var(aY ) = aVar(Y )aT
Law of β :
β ∼ N (β, σ2(XTX )−1)
Expectation and Variance of β ? :
β = (XTX )−1XTY et Y = Xβ + ε
• E(β) = β Non biased estimator E(β)− β = 0
• Var(β) = σ2(XTX )−1
Var(β) = (XTX )−1XT ([var(Y )]X (XTX )−1
= (XTX )−1XT ([var(ε)]X (XTX )−1
= σ2(XTX )−1
• E[(β − β)2] = Var(β) + 0
Recall Var(aY ) = aVar(Y )aT
Law of Y
Y ∼ N (Xβ, σ2X (XTX )−1XT )
Expectation and Variance of Y ?, Y = X β
• E(Y ) = XβE(Y ) = E(X β) = XE(β) = Xβ = E(Y )
• Var(Y ) = σ2 X (XTX )−1 XT
Var(Y ) = Var(X β)
= X Var(β) XT
= σ2 X (XTX )−1 XT
Law of Y
Y ∼ N (Xβ, σ2X (XTX )−1XT )
Expectation and Variance of Y ?, Y = X β
• E(Y ) = XβE(Y ) = E(X β) = XE(β) = Xβ = E(Y )
• Var(Y ) = σ2 X (XTX )−1 XT
Var(Y ) = Var(X β)
= X Var(β) XT
= σ2 X (XTX )−1 XT
Law of Y
Y ∼ N (Xβ, σ2X (XTX )−1XT )
Expectation and Variance of Y ?, Y = X β
• E(Y ) = XβE(Y ) = E(X β) = XE(β) = Xβ = E(Y )
• Var(Y ) = σ2 X (XTX )−1 XT
Var(Y ) = Var(X β)
= X Var(β) XT
= σ2 X (XTX )−1 XT
Law of ε
ε ∼ N (0, σ2(In − X (XTX )−1XT )
Expectation and Variance of ε = Y − Y ? :
• E(ε) = 0
• Var(ε) = σ2(In − X (XTX )−1XT )
Var(ε) = Var(Y − Y )
= Var(Y − X β)
= σ2(In)− XVar(β)XT
= σ2(In − X (XTX )−1XT )
Recal : Var(aY ) = aVar(Y )aT
Law of ε
ε ∼ N (0, σ2(In − X (XTX )−1XT )
Expectation and Variance of ε = Y − Y ? :
• E(ε) = 0
• Var(ε) = σ2(In − X (XTX )−1XT )
Var(ε) = Var(Y − Y )
= Var(Y − X β)
= σ2(In)− XVar(β)XT
= σ2(In − X (XTX )−1XT )
Recal : Var(aY ) = aVar(Y )aT
Law of ε
ε ∼ N (0, σ2(In − X (XTX )−1XT )
Expectation and Variance of ε = Y − Y ? :
• E(ε) = 0
• Var(ε) = σ2(In − X (XTX )−1XT )
Var(ε) = Var(Y − Y )
= Var(Y − X β)
= σ2(In)− XVar(β)XT
= σ2(In − X (XTX )−1XT )
Recal : Var(aY ) = aVar(Y )aT
Linear model : law of the estimators
Under the assumption that εi are i.i.d. with εi ∼ N (0, σ2)
Theoremif p ≤ n and XTX inversible,
The vector
(βε
)of dimension (p + n) is a gaussian vector
with mean
(β0
), and
and variance σ2
((XTX )−1 0
0 In − X (XTX )−1XT
)
Loi σ2
n−pσ2 σ
2 ∼ χ2n−p
We note : σ2 = ||ε||2n−p
||ε||2 =∑n
i ε2i ||ε||2 suit une loi σ2χ2(n − p) (Cochran theorem)
Then, the expectation of σ2 = ||ε||2n−p is σ2,
(E(χ2(n − p)) = n − p)
We deduce the law of σ2 :σ2 ∼ σ2
n−pχ2(n − p)
Recall : Student theorem.U ∼ N (0, 1) and V ∼ χ2(d) , U and V are independant, the, we haveZ = U√
V/dfollows a Student law of parameter d .
Loi σ2
n−pσ2 σ
2 ∼ χ2n−p
We note : σ2 = ||ε||2n−p
||ε||2 =∑n
i ε2i ||ε||2 suit une loi σ2χ2(n − p) (Cochran theorem)
Then, the expectation of σ2 = ||ε||2n−p is σ2,
(E(χ2(n − p)) = n − p)
We deduce the law of σ2 :σ2 ∼ σ2
n−pχ2(n − p)
Recall : Student theorem.U ∼ N (0, 1) and V ∼ χ2(d) , U and V are independant, the, we haveZ = U√
V/dfollows a Student law of parameter d .
Loi σ2
n−pσ2 σ
2 ∼ χ2n−p
We note : σ2 = ||ε||2n−p
||ε||2 =∑n
i ε2i ||ε||2 suit une loi σ2χ2(n − p) (Cochran theorem)
Then, the expectation of σ2 = ||ε||2n−p is σ2,
(E(χ2(n − p)) = n − p)
We deduce the law of σ2 :σ2 ∼ σ2
n−pχ2(n − p)
Recall : Student theorem.U ∼ N (0, 1) and V ∼ χ2(d) , U and V are independant, the, we haveZ = U√
V/dfollows a Student law of parameter d .
Significativity test of βj , σ2 inconnu
• Student Statistics : T
• Significativity test (bilateral)• H0 : βj = 0• H1 : βj 6= 0
• Decision with a risk α, Reject of H0 if
• βj√σ2Sj,j
> tn−p(1− α/2) with Sj,j jeme term of the diagnonal of (XTX )−1
• pvalue < α
• Conclusion :• βj is significatively different of zero• Xj a une influence dans le modele
Not true if there exists colinearity between the variables
Significativity test of βj , σ2 inconnu
• Student Statistics : T
• Significativity test (bilateral)• H0 : βj = 0• H1 : βj 6= 0
• Decision with a risk α, Reject of H0 if
• βj√σ2Sj,j
> tn−p(1− α/2) with Sj,j jeme term of the diagnonal of (XTX )−1
• pvalue < α
• Conclusion :• βj is significatively different of zero• Xj a une influence dans le modele
Not true if there exists colinearity between the variables
Significativity test of βj , σ2 inconnu
• Student Statistics : T
• Significativity test (bilateral)• H0 : βj = 0• H1 : βj 6= 0
• Decision with a risk α, Reject of H0 if
• βj√σ2Sj,j
> tn−p(1− α/2) with Sj,j jeme term of the diagnonal of (XTX )−1
• pvalue < α
• Conclusion :• βj is significatively different of zero• Xj a une influence dans le modele
Not true if there exists colinearity between the variables
Significativity test of βj , σ2 inconnu
• Student Statistics : T
• Significativity test (bilateral)• H0 : βj = 0• H1 : βj 6= 0
• Decision with a risk α, Reject of H0 if
• βj√σ2Sj,j
> tn−p(1− α/2) with Sj,j jeme term of the diagnonal of (XTX )−1
• pvalue < α
• Conclusion :• βj is significatively different of zero• Xj a une influence dans le modele
Not true if there exists colinearity between the variables
Significativity test of βj , σ2 inconnu
• Student Statistics : T
• Significativity test (bilateral)• H0 : βj = 0• H1 : βj 6= 0
• Decision with a risk α, Reject of H0 if
• βj√σ2Sj,j
> tn−p(1− α/2) with Sj,j jeme term of the diagnonal of (XTX )−1
• pvalue < α
• Conclusion :• βj is significatively different of zero• Xj a une influence dans le modele
Not true if there exists colinearity between the variables
Global significativity of the model
Test of the model with a risk α
H0 : β2 = β3 = . . . = βp = 0H1 : ∃j = 2, . . . , p, βj 6= 0
Statistics
F = n−pp−1
||Y− ¯Y ||2
||Y−Y ||2 ∼ Fisher(p − 1, n − p)
Remarque : n−pp−1
||Y− ¯Y ||2
||Y−Y ||2 = SSE/(p−1)SSR/(n−p) (E :Estimated ; R : Residuals)
Decision rule
• si Fobs > qFα , H0 is rejected, and there exist a coefficient which is not zero.The regression is ”useful”
• si Fobs ≤ qFα , H0 is acceted, all the coefficients are supposed to be nullThe regression is not ”useful”
Global significativity of the model
Test of the model with a risk α
H0 : β2 = β3 = . . . = βp = 0H1 : ∃j = 2, . . . , p, βj 6= 0
Statistics
F = n−pp−1
||Y− ¯Y ||2
||Y−Y ||2 ∼ Fisher(p − 1, n − p)
Remarque : n−pp−1
||Y− ¯Y ||2
||Y−Y ||2 = SSE/(p−1)SSR/(n−p) (E :Estimated ; R : Residuals)
Decision rule
• si Fobs > qFα , H0 is rejected, and there exist a coefficient which is not zero.The regression is ”useful”
• si Fobs ≤ qFα , H0 is acceted, all the coefficients are supposed to be nullThe regression is not ”useful”
Global significativity of the model
Test of the model with a risk α
H0 : β2 = β3 = . . . = βp = 0H1 : ∃j = 2, . . . , p, βj 6= 0
Statistics
F = n−pp−1
||Y− ¯Y ||2
||Y−Y ||2 ∼ Fisher(p − 1, n − p)
Remarque : n−pp−1
||Y− ¯Y ||2
||Y−Y ||2 = SSE/(p−1)SSR/(n−p) (E :Estimated ; R : Residuals)
Decision rule
• si Fobs > qFα , H0 is rejected, and there exist a coefficient which is not zero.The regression is ”useful”
• si Fobs ≤ qFα , H0 is acceted, all the coefficients are supposed to be nullThe regression is not ”useful”
Global significativity of the model
• Fisher Statistic
• Significativity test (bilateral)• H0 : β2 = . . . = βp = 0• H1 : ∃βj 6= 0
• Decision with a rish α, Reject H0 if
• si n−pp−1
R2
1−R2 > fp−1,n−p(1− α)• si pvalue < α
→ The linear model has an added value
Global significativity of the model
• Fisher Statistic
• Significativity test (bilateral)• H0 : β2 = . . . = βp = 0• H1 : ∃βj 6= 0
• Decision with a rish α, Reject H0 if
• si n−pp−1
R2
1−R2 > fp−1,n−p(1− α)• si pvalue < α
→ The linear model has an added value
Global significativity of the model
• Fisher Statistic
• Significativity test (bilateral)• H0 : β2 = . . . = βp = 0• H1 : ∃βj 6= 0
• Decision with a rish α, Reject H0 if
• si n−pp−1
R2
1−R2 > fp−1,n−p(1− α)• si pvalue < α
→ The linear model has an added value
Global significativity of the model
Remarque1 : sur la statistique de Fisher :
F = n−pp−1
R2
1−R2
The R2 coefficient increase mechanically with the number of variables
Remarque : the adjusted R2 may be used
R2adj = 1− (1−R2)(n−1)
n−p
The R2adj does not increase ith the number of variables.
Global significativity of the model
Remarque1 : sur la statistique de Fisher :
F = n−pp−1
R2
1−R2
The R2 coefficient increase mechanically with the number of variables
Remarque : the adjusted R2 may be used
R2adj = 1− (1−R2)(n−1)
n−p
The R2adj does not increase ith the number of variables.
Boston Housing Data
The original data are n = 506 observations on p = 14 variables,medv being the target variable
crim per capita crime rate by townzn proportion of residential land zoned for lots over 25,000 sq.ftindus proportion of non-retail business acres per townchas Charles River dummy variable (= 1 if tract bounds river ; 0 otherwise)nox nitric oxides concentration (parts per 10 million)rm average number of rooms per dwellingage proportion of owner-occupied units built prior to 1940dis weighted distances to five Boston employment centresrad index of accessibility to radial highwaystax full-value property-tax rate per USD 10,000ptratio pupil-teacher ratio by townb 1000(B − 0.63)2 where B is the proportion of blacks by townlstat percentage of lower status of the populationmedv median value of owner-occupied homes in USD 1000’s
Boston Housing Data
Les donnees :nr crim zn indus chas nox rm age dis rad tax ptratio b lstat medv1 0.00632 18 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 396.90 4.98 24.02 0.02731 0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.90 9.14 21.63 0.02729 0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83 4.03 34.74 0.03237 0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 394.63 2.94 33.45 0.06905 0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 396.90 5.33 36.2... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
Boston Housing Data
MCO sous Rlibrary(mlbench)
#Data data(BostonHousing) tab=BostonHousing;names(tab)
target="medv"; Y=tab[,target]; X=tab[,names(tab)!=target]; names(X)
#MCO resfit=lsfit(x=X,y=Y,intercept=T);
resfit$coef hist(resfit$res)Cst crim zn indus chas nox rm age dis rad tax ptratio b lstat medv36.45 -0.10 0.046 0.020 2.68 -17.76 3.80 0.00 -1.47 0.30 -0.01 -0.95
Boston Housing Data
Modele Lineaire sous Rcode Rreslm=lm(medv ~ .,data=tab); summary(reslm) Resultats :n = 506, p = 14Residuals:
Min 1Q Median 3Q Max
-15.595 -2.730 -0.518 1.777 26.199
Coefficients:Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.646e+01 5.103e+00 7.144 3.28e-12 ***
crim -1.080e-01 3.286e-02 -3.287 0.001087 **
zn 4.642e-02 1.373e-02 3.382 0.000778 ***
indus 2.056e-02 6.150e-02 0.334 0.738288
chas1 2.687e+00 8.616e-01 3.118 0.001925 **
nox -1.777e+01 3.820e+00 -4.651 4.25e-06 ***
rm 3.810e+00 4.179e-01 9.116 < 2e-16 ***
age 6.922e-04 1.321e-02 0.052 0.958229
dis -1.476e+00 1.995e-01 -7.398 6.01e-13 ***
rad 3.060e-01 6.635e-02 4.613 5.07e-06 ***
tax -1.233e-02 3.760e-03 -3.280 0.001112 **
ptratio -9.527e-01 1.308e-01 -7.283 1.31e-12 ***
b 9.312e-03 2.686e-03 3.467 0.000573 ***
lstat -5.248e-01 5.072e-02 -10.347 < 2e-16 ***
Signif. codes: 0 *** 0.001/ ** 0.01 /* 0.05 /. 0.1 / 1
Residual standard error: 4.745 on 492 degrees of freedom
Multiple R-squared: 0.7406, Adjusted R-squared: 0.7338
F-statistic: 108.1 on 13 and 492 DF, p-value: < 2.2e-16
Precautions
• Multicolinearitela solution des MCO necessite de calculer (XTX )−1.Lorsque le determinant de cette matrice est tres proche de zero, le problemeest mal conditionne.
• Choix des variablesLe coefficient de determination R2 augmente en fonction du nombre devariables.si p = n R2 = 1, ce qui n’est pas forcement pertinent.
Demonstration sous R