lecture vi–regression€¦ · lecture vi: mlsc - dr. sethu vijayakumar 6 more insights into the...
TRANSCRIPT
Lecture VI: MLSC - Dr. Sethu Vijayakumar 1
Lecture VI– Regression(Linear Methods for Regression)
Contents:• Linear Methods for Regression
• Least Squares, Gauss Markov Theorem• Recursive Least Squares
Lecture VI: MLSC - Dr. Sethu Vijayakumar 2
Linear Regression Model
y f w x w Linear Model
where x x x Input vectorw w w w regression parameters
jj
M
jT
mT
mT
= = + = +
= ≡
= ≡
=∑( ) :
( , , ... , , ) ,( , , ... , , )
x x w
xw
01
1 2
1 2 0
1
ε
The linear model either assumes that the regression function f(x) is linear , or that the linear model is a reasonable approximation. The inputs x can be :
• Quantitative inputs• Transformations of quantitative inputs such as log, square root etc.• Basis expansions (e.g. polynomial representation) :• Interaction between variables : • Dummy coding of levels of qualitative input
In all these cases, the model is linear in the parameters, even though the final function itself may not be linear.
x x x x2 12
3 13= =, , ...
x x x3 1 2= ⋅
Lecture VI: MLSC - Dr. Sethu Vijayakumar 3
Power of Linear Models
1 x1 x2 x3 x4 … xd
w0w
y
g() y x( ) = f x, w( ) = g wT x + w0( )
if g() is linear: only linear functions can be modeledhowever, if x is actually preprocessed, complicated functions can be realized
x = Φ z( ) =
φ1 z( )φ2 z( )
…φd z( )
⎡
⎣
⎢ ⎢ ⎢
⎤
⎦
⎥ ⎥ ⎥
example : x = Φ z( ) =
zz2
…zd
⎡
⎣
⎢ ⎢ ⎢
⎤
⎦
⎥ ⎥ ⎥
Lecture VI: MLSC - Dr. Sethu Vijayakumar 4
Least Squares Optimization
Least Squares Cost Function
Minimize Cost
( ) ( )
⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢
⎣
⎡
…=
⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢
⎣
⎡
…=−−=−=
=−=
∑
∑
=
=
Tn
T
T
n
TN
i
Tii
N
iii
t
tt
t
datatrainingofNwherexftJ
x
xx
XtXwtXwtwx 2
1
2
1
1
2
1
2
, where,21)(
21
#))(ˆ(21
( ) ( ) ( )
( )
wXXtXXXwXt
XXwXtXXwXt
XXwtXwtXwtww
TT
TTT
TTTTT
TTJJ
=⇒=⇒
+−=+−=
−−=⎟⎠⎞
⎜⎝⎛ −−==
210
∂∂
∂∂
( ) tXXXw TTSolution 1: −=
Lecture VI: MLSC - Dr. Sethu Vijayakumar 5
What are we really doing ?
( )wxtXXXw
Tpredpred
TT
ySolutionSquaresLeast
==
−1:
We seek the linear function of X that minimizes the sum of the squared residuals from Y
Linear least squares fitting
Lecture VI: MLSC - Dr. Sethu Vijayakumar 6
More insights into the LS solution
The Pseudo-Inversepseudo inverses are a special solution to an infinite set of solutions of a non-unique inverse problem (we talked about it in the previous lecture)the matrix inversion above may still be ill-defined if XTX is close to singular and so-called Ridge Regression needs to be applied
Ridge Regression
Multiple Outputs: just like multiple single output regressions
( ) TT XXXX 1−+ =
( ) 1 where1<<+=
−+ γγ TT XIXXX
W = XTX( )−1XTY
Lecture VI: MLSC - Dr. Sethu Vijayakumar 7
Geometrical Interpretation of LS
Subspace S spanned by the columns of X
X[ ]1
X[ ]0
t
y
y is an orthogonal Projection of t on S
Vector of residual errors (othorgonal to y)
[ ] ii
iwvectorResidual ∑−=−=− XtXwtyt:
… is orthogonal to the space spanned by columns of X since …
( ) XXwtw
TJ−−== 0
∂∂
And hence, …
y is the optimal reconstruction of t in the range of X
Lecture VI: MLSC - Dr. Sethu Vijayakumar 8
Physical Interpretation of LS
• all springs have the same spring constant• points far away generate more “force” (danger of outliers)• springs are vertical• solution is the minimum energy solution achieved by the springs
x
y
Lecture VI: MLSC - Dr. Sethu Vijayakumar 9
Minimum variance unbiased estimator
Least Squares estimate of the parameters w has the smallest variance among all linear unbiased estimates.
Least Squares are also called BLUE estimates – Best Linear Unbiased Estimators
Gauss-Markov Theorem
TT
TT
whereEstimateSquaresLeast
XX)(XHHttXX)(Xw
1
1 :ˆ−
−
===
In other words, Gauss-Markov theorem says that there is no other matrix C such that the estimator formed by
.ˆthanvariancesmallerahaveandunbiasedbothbewill~ wCtw =
www =)ˆ(sinceEstimateUnbiasedanis)(ˆ EEstimateSquaresLeast
(Homework !!)
Lecture VI: MLSC - Dr. Sethu Vijayakumar 10
Gauss-Markov Theorem (Proof)
CXwεCCXw
CεCXwεXwC
Ctw
=+=
+=+=
=
)()(
))(()()~(
EEEEE
ICXwCXwww
=⇒=⇒=)~(: EEstimateUnbiasedFor
T
TT
T
T
T
T
T
EsinceE
EEE
EEEVar
CCCεεC
ICXCεCεwCεCXwwCεCXw
wCtwCtwwww
wwwww
2][
...]))([(]))([(
]))([(])~)(~[(
]))~(~))(~(~[()~(
σ==
==−+−+=
−−=−−=
−−=
Lecture VI: MLSC - Dr. Sethu Vijayakumar 11
Gauss-Markov Theorem (Proof)
)~()ˆ( ww VarVarthatshowtowantWe ≤
0DXIIDXICXIXXX)(XD
XX)(XDC
=⇒=+⇒==+
+=
−
−
since
Let
TT
TT
)( 1
1
)ˆ()~(0....
)2())(()~(
2
122
1112
1122
wDDwDXX)(XDD
X)DX(XX)X)(X(XX)(XDDXX)(XDXX)(XDCCw
VarVarsince
Var
T
TT
TTTTT
TTTTTT
+==+=
++=++==
−
−−−
−−
σσσ
σσσ
.,..2
provedHencedefinitionbytrueisThisnegativenonareofelementsdiagonalthatshowtosufficientisIt T −DDσ
Lecture VI: MLSC - Dr. Sethu Vijayakumar 12
Biased vs unbiased
{ } { }( ) { }( ){ })var()var(
ˆˆ)(ˆ)(ˆ2
222
estimatebiasnoiseyEyEfyEfE iiiii
++=−+−+= xx εσ
Bias-Variance decomposition of error
Gauss-Markov Theorem says that Least Squares achieves the estimate with the minimum variance (and hence, the minimum Mean Squared Error) among all the unbiased estimates (bias=0).
Does that mean that we should always work with unbiased estimators ??
No !! since there may exists some biased estimators with a smaller net mean squared error – they trade a little bias for a larger reduction in variance.
Variable Subset Selection and Shrinkage are methods (which we will explore soon) that introduce bias and try to reduce the variance of the estimate.
Lecture VI: MLSC - Dr. Sethu Vijayakumar 13
Recursive Least Squares
The Sherman-Morrison-Woodbury Theorem
More General: The Matrix Inversion Theorem
Recursive Least Squares Update
A − zz T( )− 1= A − 1 +
A − 1zz T A − 1
1 − z T A −1z
A − BC( )− 1 = A − 1 + A − 1B I + CA − 1B( )−1CA −1
( )( )
( )TTnnnn
nT
nTnnn
Tn
xwtxPWw
xPxPxxPPP
xtx
XXPIP
−+=
⎩⎨⎧
<=⎟⎟
⎠
⎞⎜⎜⎝
⎛+
−=
≡<<=
++
+
−
11
1
1
forgetting if1forgetting no if1
where1: term)bias theincludes that (note ,point data newevery For
) (note 1 where1:Initialize
λλλ
γγ
Lecture VI: MLSC - Dr. Sethu Vijayakumar 14
Recursive Least Squares (cont’d)
Some amazing facts about recursive least squaresResults for W are EXACTLY the same as for normal least squares update (batch update) after every data point was added once! (no iterations)NO matrix inversion necessary anymoreNO learning rate necessaryGuaranteed convergence to optimal W (linear regression is an optimal estimator under many conditions)Forgetting factor λ allows to forget data in case of changing target functionsComputational load is larger than batch version of linear regressionBut don’t get fooled: if data is singular, you still will have problems!