linear regression - penn engineering › ~cis519 › spring2019 › ... · based on slide by...

LinearRegression

RobotImageCredit:Viktoriya Sukhanova ©123RF.com

TheseslideswereassembledbyEricEaton,withgratefulacknowledgementofthemanyotherswhomadetheircoursematerialsfreelyavailableonline.Feelfreetoreuseoradapttheseslidesforyourownacademicpurposes,providedthatyouincludeproperattribution.PleasesendcommentsandcorrectionstoEric.

RegressionGiven:– Datawhere

– Correspondinglabelswhere

1970 1980 1990 2000 2010 2020

Septem

rcticSeaIceExtent

(1,000,000sq

DatafromG.Witt.JournalofStatisticsEducation,Volume21,Number1(2013)

LinearRegressionQuadraticRegression

X =nx(1), . . . ,x(n)

ox(i) 2 Rd

y =ny(1), . . . , y(n)

oy(i) 2 R

• 97samples,partitionedinto67train/30test• Eightpredictors(features):

– 6continuous(4logtransforms),1binary,1ordinal

• Continuousoutcomevariable:– lpsa:log(prostatespecificantigenlevel)

ProstateCancerDataset

BasedonslidebyJeffHowbert

LinearRegression• Hypothesis:

• Fitmodelbyminimizingsumofsquarederrors

y = ✓0 + ✓1x1 + ✓2x2 + . . .+ ✓dxd =dX

✓jxj

Assumex0 =1

y = ✓0 + ✓1x1 + ✓2x2 + . . .+ ✓dxd =dX

✓jxj

Figures are courtesy ofGregShakhnarovich

LeastSquaresLinearRegression

• CostFunction

• Fitbysolving

J(✓) =1

⇣h✓

⇣x(i)

⌘� y(i)

min✓

J(✓)

IntuitionBehindCostFunction

ForinsightonJ(),let’sassumesox 2 R ✓ = [✓0, ✓1]

J(✓) =1

⇣h✓

⇣x(i)

⌘� y(i)

BasedonexamplebyAndrewNg

0 1 2 3

(forfixed,thisisafunctionofx) (functionoftheparameter)

-0.5 0 0.5 1 1.5 2 2.5

J(✓) =1

⇣h✓

⇣x(i)

⌘� y(i)

0 1 2 3

-0.5 0 0.5 1 1.5 2 2.5

J(✓) =1

⇣h✓

⇣x(i)

⌘� y(i)

J([0, 0.5]) =1

2⇥ 3

⇥(0.5� 1)2 + (1� 2)2 + (1.5� 3)2

⇤⇡ 0.58Basedonexample

byAndrewNg

0 1 2 3

-0.5 0 0.5 1 1.5 2 2.5

J(✓) =1

⇣h✓

⇣x(i)

⌘� y(i)

J([0, 0]) ⇡ 2.333

J()isconcave

11SlidebyAndrewNg

(forfixed,thisisafunctionofx) (functionoftheparameters)

SlidebyAndrewNg

BasicSearchProcedure• Chooseinitialvaluefor• Untilwereachaminimum:– Chooseanewvaluefortoreduce

✓ J(✓)

J(q0,q1)

FigurebyAndrewNg

J(✓)

J(q0,q1)

FigurebyAndrewNg

J(✓)

J(q0,q1)

FigurebyAndrewNg

Sincetheleastsquaresobjectivefunctionisconvex(concave),wedon’tneedtoworryaboutlocalminima

GradientDescent• Initialize• Repeatuntilconvergence

✓j ✓j � ↵@

@✓jJ(✓) simultaneousupdate

forj =0...d

learningrate(small)e.g.,α=0.05

J(✓)

-0.5 0 0.5 1 1.5 2 2.5

✓j ✓j � ↵@

forj =0...d

ForLinearRegression:@

@✓jJ(✓) =

⇣h✓

⇣x(i)

⌘� y(i)

✓kx(i)k � y(i)

!⇥ @

✓kx(i)k � y(i)

!x(i)j

⇣h✓

⇣x(i)

⌘� y(i)

⌘x(i)j

✓j ✓j � ↵@

forj =0...d

@✓jJ(✓) =

⇣h✓

⇣x(i)

⌘� y(i)

✓kx(i)k � y(i)

!⇥ @

✓kx(i)k � y(i)

!x(i)j

⇣h✓

⇣x(i)

⌘� y(i)

⌘x(i)j

✓j ✓j � ↵@

forj =0...d

@✓jJ(✓) =

⇣h✓

⇣x(i)

⌘� y(i)

✓kx(i)k � y(i)

!⇥ @

✓kx(i)k � y(i)

!x(i)j

⇣h✓

⇣x(i)

⌘� y(i)

⌘x(i)j

✓j ✓j � ↵@

forj =0...d

@✓jJ(✓) =

⇣h✓

⇣x(i)

⌘� y(i)

✓kx(i)k � y(i)

!⇥ @

✓kx(i)k � y(i)

!x(i)j

⇣h✓

⇣x(i)

⌘� y(i)

⌘x(i)j

GradientDescentforLinearRegression

• Initialize• Repeatuntilconvergence

simultaneousupdateforj =0...d

✓j ✓j � ↵1

⇣h✓

⇣x(i)

⌘� y(i)

⌘x(i)j

• Toachievesimultaneousupdate• AtthestartofeachGDiteration,compute• Usethisstoredvalueintheupdatesteploop

⇣x(i)

kvk2 =

v2i =q

v21 + v22 + . . .+ v2|v|L2 norm:

k✓new � ✓oldk2 < ✏• Assumeconvergencewhen

GradientDescent

h(x)=-900– 0.1x

SlidebyAndrewNg

GradientDescent

SlidebyAndrewNg

GradientDescent

SlidebyAndrewNg

GradientDescent

SlidebyAndrewNg

GradientDescent

SlidebyAndrewNg

GradientDescent

SlidebyAndrewNg

GradientDescent

SlidebyAndrewNg

GradientDescent

SlidebyAndrewNg

GradientDescent

SlidebyAndrewNg

Choosingα

αtoosmall

slowconvergence

αtoolarge

Increasingvaluefor J(✓)

• Mayovershoottheminimum• Mayfailtoconverge• Mayevendiverge

Toseeifgradientdescentisworking,printouteachiteration• Thevalueshoulddecreaseateachiteration• Ifitdoesn’t,adjustα

J(✓)

ExtendingLinearRegressiontoMoreComplexModels

• TheinputsX forlinearregressioncanbe:– Originalquantitativeinputs– Transformationofquantitativeinputs

• e.g.log,exp,squareroot,square,etc.

– Polynomialtransformation• example:y =b0 +b1×x +b2×x2 +b3×x3

– Basisexpansions– Dummycodingofcategoricalinputs– Interactionsbetweenvariables

• example:x3 =x1 × x2

Thisallowsuseoflinearregressiontechniquestofitnon-lineardatasets.

LinearBasisFunctionModels

• Generally,

• Typically,sothatactsasabias• Inthesimplestcase,weuselinearbasisfunctions:

h✓(x) =dX

✓j�j(x)

�0(x) = 1 ✓0

�j(x) = xj

basisfunction

BasedonslidebyChristopherBishop(PRML)

– Theseareglobal;asmallchangeinx affectsallbasisfunctions

• Polynomialbasisfunctions:

• Gaussianbasisfunctions:

– Thesearelocal;asmallchangeinx onlyaffectnearbybasisfunctions.μj ands controllocationandscale(width).

LinearBasisFunctionModels• Sigmoidal basisfunctions:

– Thesearealsolocal;asmallchangeinx onlyaffectsnearbybasisfunctions.μjands controllocationandscale(slope).

ExampleofFittingaPolynomialCurvewithaLinearModel

y = ✓0 + ✓1x+ ✓2x2 + . . .+ ✓px

✓jxj

• BasicLinearModel:

• GeneralizedLinearModel:

• Oncewehavereplacedthedatabytheoutputsofthebasisfunctions,fittingthegeneralizedmodelisexactlythesameproblemasfittingthebasicmodel– Unlessweusethekerneltrick– moreonthatwhenwecoversupportvectormachines

– Therefore,thereisnopointinclutteringthemathwithbasisfunctions

h✓(x) =dX

✓j�j(x)

h✓(x) =dX

✓jxj

BasedonslidebyGeoffHinton

LinearAlgebraConcepts• Vector in isanorderedsetofd realnumbers– e.g.,v=[1,6,3,4]isin– “[1,6,3,4]” isacolumnvector:– asopposedtoarowvector:

• Anm-by-n matrix isanobjectwithm rowsandn columns,whereeachentryisarealnumber:

÷÷÷÷÷

ççççç

( )4361

÷÷÷

ççç

2396784821

BasedonslidesbyJosephBradley

• Transpose:reflectvector/matrixonline:

( )baba T

=÷÷ø

öççè

æ÷÷ø

öççè

æ=÷÷

öççè

ædbca

dcba T

– Note:(Ax)T=xTAT (We’lldefinemultiplicationsoon…)

• Vectornorms:– Lp normofv =(v1,…,vk)is– Commonnorms:L1,L2– Linfinity =maxi |vi|

• Lengthofavectorv isL2(v)

|vi|p! 1

LinearAlgebraConcepts

• Vectordotproduct:

– Note:dotproductofu withitself =length(u)2 =

• Matrixproduct:

( ) ( ) 22112121 vuvuvvuuvu +=•=•

÷÷ø

öççè

æ++++

÷÷ø

öççè

æ=÷÷

öççè

2222122121221121

2212121121121111

1211 ,

babababababababa

• Vectorproducts:– Dotproduct:

– Outerproduct:

( ) 22112

121 vuvuvv

uuvuvu T +=÷÷ø

öççè

æ==•

( ) ÷÷ø

öççè

æ=÷÷

öççè

211121

vuvuvuvu

h(x) = ✓|x

x| =⇥1 x1 . . . xd

Vectorization• Benefitsofvectorization– Morecompactequations– Fastercode(usingoptimizedmatrixlibraries)

• Considerourmodel:

• Let

• Canwritethemodelinvectorized formas45

h(x) =dX

✓jxj

✓0✓1...✓d

Vectorization• Considerourmodelforn instances:

• Let

• Canwritethemodelinvectorized formas46

h✓(x) = X✓

66666664

1 x(1)1 . . . x(1)

.... . .

1 x(i)1 . . . x(i)

.... . .

1 x(n)1 . . . x(n)

77777775

✓0✓1...✓d

h⇣x(i)

✓jx(i)j

R(d+1)⇥1 Rn⇥(d+1)

J(✓) =1

⇣✓|x(i) � y(i)

Vectorization• Forthelinearregressioncostfunction:

J(✓) =1

2n(X✓ � y)| (X✓ � y)

J(✓) =1

⇣h✓

⇣x(i)

⌘� y(i)

Rn⇥(d+1)

R(d+1)⇥1

Rn⇥1R1⇥n

...y(n)

ClosedFormSolution:

ClosedFormSolution• InsteadofusingGD,solveforoptimal analytically– Noticethatthesolutioniswhen

• Derivation:

Takederivativeandsetequalto0,thensolvefor:

@✓J(✓) = 0

J (✓) =1

2n(X✓ � y)| (X✓ � y)

/ ✓|X|X✓ � y|X✓ � ✓|X|y + y|y/ ✓|X|X✓ � 2✓|X|y + y|y

1x1J (✓) =1

2n(X✓ � y)| (X✓ � y)

/ ✓|X|X✓ � y|X✓ � ✓|X|y + y|y/ ✓|X|X✓ � 2✓|X|y + y|y

J (✓) =1

2n(X✓ � y)| (X✓ � y)

/ ✓|X|X✓ � y|X✓ � ✓|X|y + y|y/ ✓|X|X✓ � 2✓|X|y + y|y

@✓(✓|X|X✓ � 2✓|X|y + y|y) = 0

(X|X)✓ �X|y = 0

(X|X)✓ = X|y

✓ = (X|X)�1X|y

@✓(✓|X|X✓ � 2✓|X|y + y|y) = 0

(X|X)✓ �X|y = 0

(X|X)✓ = X|y

✓ = (X|X)�1X|y

@✓(✓|X|X✓ � 2✓|X|y + y|y) = 0

(X|X)✓ �X|y = 0

(X|X)✓ = X|y

✓ = (X|X)�1X|y

@✓(✓|X|X✓ � 2✓|X|y + y|y) = 0

(X|X)✓ �X|y = 0

(X|X)✓ = X|y

✓ = (X|X)�1X|y

ClosedFormSolution• CanobtainbysimplypluggingX and into

• IfX TX isnotinvertible(i.e.,singular),mayneedto:– Usepseudo-inverseinsteadoftheinverse

• Inpython,numpy.linalg.pinv(a)

– Removeredundant(notlinearlyindependent)features– Removeextrafeaturestoensurethatd ≤n

@✓(✓|X|X✓ � 2✓|X|y + y|y) = 0

(X|X)✓ �X|y = 0

(X|X)✓ = X|y

✓ = (X|X)�1X|y

...y(n)

7775X =

66666664

1 x(1)1 . . . x(1)

.... . .

1 x(i)1 . . . x(i)

.... . .

1 x(n)1 . . . x(n)

77777775

GradientDescentvs ClosedForm

GradientDescentClosedFormSolution

• Requiresmultipleiterations• Needtochooseα• Workswellwhenn islarge• Cansupportincremental

learning

• Non-iterative• Noneedforα• Slowifn islarge

– Computing(X TX)-1 isroughlyO(n3)

ImprovingLearning:FeatureScaling

• Idea:Ensurethatfeaturehavesimilarscales

• Makesgradientdescentconvergemuch faster

0 5 10 15 20

BeforeFeatureScaling

0 5 10 15 20

AfterFeatureScaling

FeatureStandardization• Rescalesfeaturestohavezeromeanandunitvariance

– Letμj bethemeanoffeaturej:

– Replaceeachvaluewith:

• sj isthestandarddeviationoffeaturej• Couldalsousetherangeoffeaturej (maxj – minj)forsj

• Mustapplythesametransformationtoinstancesforbothtrainingandprediction

• Outlierscancauseproblems52

µj =1

x(i)j � µj

forj =1...d(notx0!)

QualityofFit

Overfitting:• Thelearnedhypothesismayfitthetrainingsetverywell( )

• ...butfailstogeneralizetonewexamples

TimeSpent

Underfitting(highbias)

Overfitting(highvariance)

Correctfit

J(✓) ⇡ 0

Regularization• Amethodforautomaticallycontrollingthecomplexityofthelearnedhypothesis

• Idea:penalizeforlargevaluesof– Canincorporateintothecostfunction– Workswellwhenwehavealotoffeatures,eachthatcontributesabittopredictingthelabel

• Canalsoaddressoverfitting byeliminatingfeatures(eithermanuallyorviamodelselection)

Regularization• Linearregressionobjectivefunction

– istheregularizationparameter()– Noregularizationon!

J(✓) =1

⇣h✓

⇣x(i)

⌘� y(i)

⌘2+ �

modelfittodata regularization

� � � 0

J(✓) =1

⇣h✓

⇣x(i)

⌘� y(i)

UnderstandingRegularization

• Notethat

– Thisisthemagnitudeofthefeaturecoefficientvector!

• Wecanalsothinkofthisas:

• L2 regularizationpullscoefficientstoward0

✓2j = k✓1:dk22

(✓j � 0)2 = k✓1:d � ~0k22

J(✓) =1

⇣h✓

⇣x(i)

⌘� y(i)

• Whathappensas?

TimeSpentonWork

J(✓) =1

⇣h✓

⇣x(i)

⌘� y(i)

� ! 1

✓0 + ✓1x+ ✓2x2 + ✓3x

3 + ✓4x4

• Whathappensas?

TimeSpentonWork

0 0 0 0

J(✓) =1

⇣h✓

⇣x(i)

⌘� y(i)

� ! 1

✓0 + ✓1x+ ✓2x2 + ✓3x

3 + ✓4x4

RegularizedLinearRegression

• CostFunction

• Fitbysolving

• Gradientupdate:

min✓

J(✓)

✓j ✓j � ↵1

⇣h✓

⇣x(i)

⌘� y(i)

⌘x(i)j

✓0 ✓0 � ↵1

⇣h✓

⇣x(i)

⌘� y(i)

regularization

@✓jJ(✓)

@✓0J(✓)

J(✓) =1

⇣h✓

⇣x(i)

⌘� y(i)

� ↵�✓j

✓0 ✓0 � ↵1

⇣h✓

⇣x(i)

⌘� y(i)

• Wecanrewritethegradientstepas:

J(✓) =1

⇣h✓

⇣x(i)

⌘� y(i)

✓j ✓j (1� ↵�)� ↵1

⇣h✓

⇣x(i)

⌘� y(i)

⌘x(i)j

✓j ✓j � ↵1

⇣h✓

⇣x(i)

⌘� y(i)

⌘x(i)j � ↵�✓j

BBBBB@X|X + �

666664

0 0 0 . . . 00 1 0 . . . 00 0 1 . . . 0...

......

. . ....

0 0 0 . . . 1

777775

CCCCCA

• Toincorporateregularizationintotheclosedformsolution:

• Canderivethisthesameway,bysolving

• Canprovethatforλ >0,inverseexistsintheequationabove

BBBBB@X|X + �

666664

0 0 0 . . . 00 1 0 . . . 00 0 1 . . . 0...

......

. . ....

0 0 0 . . . 1

777775

CCCCCA

@✓J(✓) = 0

linear regression - penn engineering › ~cis519 › spring2019 › ... · based on slide by...

Documents

erasmus course catalogue -...

radial basis functions: introduction and...

radial basis function (rbf) for non-linear dynamic system...

singular value decomposition sivakumar …sivakumar...

an e cient linear-precision partition of unity basis for

gbla – gröbner basis linear algebra package

self-consistent numerical-basis-set linear-combination-of...

a five element basis for the uncountable linear orders 1....

structural analysis with the finite element method; linear...

review (2 nd order tensors): tensor – linear mapping of a...

identification of linear parameter-varying systems: a...

ma2213 lecture 3 approximation. piecewise linear...

the terms that you have to know! basis, linear independent,...

texas a&m...

a ﬁve element basis for the uncountable linear...

eigen - tuxfamily · – radial basis functions •...

image warping using radial basis functions · interpolation...

linear regression - penn...

linear algebra review - linear dynamical...

linear’regression’ › ~cis519 › fall2015 › ... ·...