lecture vi–regression€¦ · lecture vi: mlsc - dr. sethu vijayakumar 6 more insights into the...

Lecture VI: MLSC - Dr. Sethu Vijayakumar 1

Lecture VI– Regression(Linear Methods for Regression)

Contents:• Linear Methods for Regression

• Least Squares, Gauss Markov Theorem• Recursive Least Squares


Linear Regression Model

y f w x w Linear Model

where x x x Input vectorw w w w regression parameters

jj

M

jT

mT

mT

= = + = +

= ≡

= ≡

=∑( ) :

( , , ... , , ) ,( , , ... , , )

x x w

xw

01

1 2

1 2 0

1

ε

The linear model either assumes that the regression function f(x) is linear , or that the linear model is a reasonable approximation. The inputs x can be :

• Quantitative inputs• Transformations of quantitative inputs such as log, square root etc.• Basis expansions (e.g. polynomial representation) :• Interaction between variables : • Dummy coding of levels of qualitative input

In all these cases, the model is linear in the parameters, even though the final function itself may not be linear.

x x x x2 12

3 13= =, , ...

x x x3 1 2= ⋅


Power of Linear Models

1 x1 x2 x3 x4 … xd

w0w

y

g() y x( ) = f x, w( ) = g wT x + w0( )

if g() is linear: only linear functions can be modeledhowever, if x is actually preprocessed, complicated functions can be realized

x = Φ z( ) =

φ1 z( )φ2 z( )

…φd z( )

⎡

⎣

⎢ ⎢ ⎢

⎤

⎦

⎥ ⎥ ⎥

example : x = Φ z( ) =

zz2

…zd

⎡

⎣

⎢ ⎢ ⎢

⎤

⎦

⎥ ⎥ ⎥


Least Squares Optimization

Least Squares Cost Function

Minimize Cost

( ) ( )

⎥⎥⎥⎥⎥

⎦

⎤

⎢⎢⎢⎢⎢

⎣

⎡

…=

⎥⎥⎥⎥

⎦

⎤

⎢⎢⎢⎢

⎣

⎡

…=−−=−=

=−=

∑

∑

=

=

Tn

T

T

n

TN

i

Tii

N

iii

t

tt

t

datatrainingofNwherexftJ

x

xx

XtXwtXwtwx 2

1

2

1

1

2

1

2

, where,21)(

21

#))(ˆ(21

( ) ( ) ( )

( )

wXXtXXXwXt

XXwXtXXwXt

XXwtXwtXwtww

TT

TTT

TTTTT

TTJJ

=⇒=⇒

+−=+−=

−−=⎟⎠⎞

⎜⎝⎛ −−==

210

∂∂

∂∂

( ) tXXXw TTSolution 1: −=


What are we really doing ?

( )wxtXXXw

Tpredpred

TT

ySolutionSquaresLeast

==

−1:

We seek the linear function of X that minimizes the sum of the squared residuals from Y

Linear least squares fitting


More insights into the LS solution

The Pseudo-Inversepseudo inverses are a special solution to an infinite set of solutions of a non-unique inverse problem (we talked about it in the previous lecture)the matrix inversion above may still be ill-defined if XTX is close to singular and so-called Ridge Regression needs to be applied

Ridge Regression

Multiple Outputs: just like multiple single output regressions

( ) TT XXXX 1−+ =

( ) 1 where1<<+=

−+ γγ TT XIXXX

W = XTX( )−1XTY


Geometrical Interpretation of LS

Subspace S spanned by the columns of X

X[ ]1

X[ ]0

t

y

y is an orthogonal Projection of t on S

Vector of residual errors (othorgonal to y)

[ ] ii

iwvectorResidual ∑−=−=− XtXwtyt:

… is orthogonal to the space spanned by columns of X since …

( ) XXwtw

TJ−−== 0

∂∂

And hence, …

y is the optimal reconstruction of t in the range of X


Physical Interpretation of LS

• all springs have the same spring constant• points far away generate more “force” (danger of outliers)• springs are vertical• solution is the minimum energy solution achieved by the springs

x

y


Minimum variance unbiased estimator

Least Squares estimate of the parameters w has the smallest variance among all linear unbiased estimates.

Least Squares are also called BLUE estimates – Best Linear Unbiased Estimators

Gauss-Markov Theorem

TT

TT

whereEstimateSquaresLeast

XX)(XHHttXX)(Xw

1

1 :ˆ−

−

===

In other words, Gauss-Markov theorem says that there is no other matrix C such that the estimator formed by

.ˆthanvariancesmallerahaveandunbiasedbothbewill~ wCtw =

www =)ˆ(sinceEstimateUnbiasedanis)(ˆ EEstimateSquaresLeast

(Homework !!)


Gauss-Markov Theorem (Proof)

CXwεCCXw

CεCXwεXwC

Ctw

=+=

+=+=

=

)()(

))(()()~(

EEEEE

ICXwCXwww

=⇒=⇒=)~(: EEstimateUnbiasedFor

T

TT

T

T

T

T

T

EsinceE

EEE

EEEVar

CCCεεC

ICXCεCεwCεCXwwCεCXw

wCtwCtwwww

wwwww

2][

...]))([(]))([(

]))([(])~)(~[(

]))~(~))(~(~[()~(

σ==

==−+−+=

−−=−−=

−−=


Gauss-Markov Theorem (Proof)

)~()ˆ( ww VarVarthatshowtowantWe ≤

0DXIIDXICXIXXX)(XD

XX)(XDC

=⇒=+⇒==+

+=

−

−

since

Let

TT

TT

)( 1

1

)ˆ()~(0....

)2())(()~(

2

122

1112

1122

wDDwDXX)(XDD

X)DX(XX)X)(X(XX)(XDDXX)(XDXX)(XDCCw

VarVarsince

Var

T

TT

TTTTT

TTTTTT

+==+=

++=++==

−

−−−

−−

σσσ

σσσ

.,..2

provedHencedefinitionbytrueisThisnegativenonareofelementsdiagonalthatshowtosufficientisIt T −DDσ


Biased vs unbiased

{ } { }( ) { }( ){ })var()var(

ˆˆ)(ˆ)(ˆ2

222

estimatebiasnoiseyEyEfyEfE iiiii

++=−+−+= xx εσ

Bias-Variance decomposition of error

Gauss-Markov Theorem says that Least Squares achieves the estimate with the minimum variance (and hence, the minimum Mean Squared Error) among all the unbiased estimates (bias=0).

Does that mean that we should always work with unbiased estimators ??

No !! since there may exists some biased estimators with a smaller net mean squared error – they trade a little bias for a larger reduction in variance.

Variable Subset Selection and Shrinkage are methods (which we will explore soon) that introduce bias and try to reduce the variance of the estimate.


Recursive Least Squares

The Sherman-Morrison-Woodbury Theorem

More General: The Matrix Inversion Theorem

Recursive Least Squares Update

A − zz T( )− 1= A − 1 +

A − 1zz T A − 1

1 − z T A −1z

A − BC( )− 1 = A − 1 + A − 1B I + CA − 1B( )−1CA −1

( )( )

( )TTnnnn

nT

nTnnn

Tn

xwtxPWw

xPxPxxPPP

xtx

XXPIP

−+=

⎩⎨⎧

<=⎟⎟

⎠

⎞⎜⎜⎝

⎛+

−=

≡<<=

++

+

−

11

1

1

forgetting if1forgetting no if1

where1: term)bias theincludes that (note ,point data newevery For

) (note 1 where1:Initialize

λλλ

γγ


Recursive Least Squares (cont’d)

Some amazing facts about recursive least squaresResults for W are EXACTLY the same as for normal least squares update (batch update) after every data point was added once! (no iterations)NO matrix inversion necessary anymoreNO learning rate necessaryGuaranteed convergence to optimal W (linear regression is an optimal estimator under many conditions)Forgetting factor λ allows to forget data in case of changing target functionsComputational load is larger than batch version of linear regressionBut don’t get fooled: if data is singular, you still will have problems!

lecture vi–regression€¦ · lecture vi: mlsc - dr. sethu vijayakumar 6 more insights into the...

Documents