linear regression - penn engineering › ~cis519 › spring2019 › ... · based on slide by...

Post on 04-Jul-2020

5 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

LinearRegression

RobotImageCredit:Viktoriya Sukhanova ©123RF.com

TheseslideswereassembledbyEricEaton,withgratefulacknowledgementofthemanyotherswhomadetheircoursematerialsfreelyavailableonline.Feelfreetoreuseoradapttheseslidesforyourownacademicpurposes,providedthatyouincludeproperattribution.PleasesendcommentsandcorrectionstoEric.

RegressionGiven:– Datawhere

– Correspondinglabelswhere

2

0

1

2

3

4

5

6

7

8

9

1970 1980 1990 2000 2010 2020

Septem

berA

rcticSeaIceExtent

(1,000,000sq

km)

Year

DatafromG.Witt.JournalofStatisticsEducation,Volume21,Number1(2013)

LinearRegressionQuadraticRegression

X =nx(1), . . . ,x(n)

ox(i) 2 Rd

y =ny(1), . . . , y(n)

oy(i) 2 R

• 97samples,partitionedinto67train/30test• Eightpredictors(features):

– 6continuous(4logtransforms),1binary,1ordinal

• Continuousoutcomevariable:– lpsa:log(prostatespecificantigenlevel)

ProstateCancerDataset

BasedonslidebyJeffHowbert

LinearRegression• Hypothesis:

• Fitmodelbyminimizingsumofsquarederrors

5

x

y = ✓0 + ✓1x1 + ✓2x2 + . . .+ ✓dxd =dX

j=0

✓jxj

Assumex0 =1

y = ✓0 + ✓1x1 + ✓2x2 + . . .+ ✓dxd =dX

j=0

✓jxj

Figures are courtesy ofGregShakhnarovich

LeastSquaresLinearRegression

6

• CostFunction

• Fitbysolving

J(✓) =1

2n

nX

i=1

⇣h✓

⇣x(i)

⌘� y(i)

⌘2

min✓

J(✓)

IntuitionBehindCostFunction

7

ForinsightonJ(),let’sassumesox 2 R ✓ = [✓0, ✓1]

J(✓) =1

2n

nX

i=1

⇣h✓

⇣x(i)

⌘� y(i)

⌘2

BasedonexamplebyAndrewNg

IntuitionBehindCostFunction

8

0

1

2

3

0 1 2 3

y

x

(forfixed,thisisafunctionofx) (functionoftheparameter)

0

1

2

3

-0.5 0 0.5 1 1.5 2 2.5

ForinsightonJ(),let’sassumesox 2 R ✓ = [✓0, ✓1]

J(✓) =1

2n

nX

i=1

⇣h✓

⇣x(i)

⌘� y(i)

⌘2

BasedonexamplebyAndrewNg

IntuitionBehindCostFunction

9

0

1

2

3

0 1 2 3

y

x

(forfixed,thisisafunctionofx) (functionoftheparameter)

0

1

2

3

-0.5 0 0.5 1 1.5 2 2.5

ForinsightonJ(),let’sassumesox 2 R ✓ = [✓0, ✓1]

J(✓) =1

2n

nX

i=1

⇣h✓

⇣x(i)

⌘� y(i)

⌘2

J([0, 0.5]) =1

2⇥ 3

⇥(0.5� 1)2 + (1� 2)2 + (1.5� 3)2

⇤⇡ 0.58Basedonexample

byAndrewNg

IntuitionBehindCostFunction

10

0

1

2

3

0 1 2 3

y

x

(forfixed,thisisafunctionofx) (functionoftheparameter)

0

1

2

3

-0.5 0 0.5 1 1.5 2 2.5

ForinsightonJ(),let’sassumesox 2 R ✓ = [✓0, ✓1]

J(✓) =1

2n

nX

i=1

⇣h✓

⇣x(i)

⌘� y(i)

⌘2

J([0, 0]) ⇡ 2.333

BasedonexamplebyAndrewNg

J()isconcave

IntuitionBehindCostFunction

11SlidebyAndrewNg

IntuitionBehindCostFunction

12

(forfixed,thisisafunctionofx) (functionoftheparameters)

SlidebyAndrewNg

IntuitionBehindCostFunction

13

(forfixed,thisisafunctionofx) (functionoftheparameters)

SlidebyAndrewNg

IntuitionBehindCostFunction

14

(forfixed,thisisafunctionofx) (functionoftheparameters)

SlidebyAndrewNg

IntuitionBehindCostFunction

15

(forfixed,thisisafunctionofx) (functionoftheparameters)

SlidebyAndrewNg

BasicSearchProcedure• Chooseinitialvaluefor• Untilwereachaminimum:– Chooseanewvaluefortoreduce

16

✓ J(✓)

q1q0

J(q0,q1)

FigurebyAndrewNg

BasicSearchProcedure• Chooseinitialvaluefor• Untilwereachaminimum:– Chooseanewvaluefortoreduce

17

J(✓)

q1q0

J(q0,q1)

FigurebyAndrewNg

BasicSearchProcedure• Chooseinitialvaluefor• Untilwereachaminimum:– Chooseanewvaluefortoreduce

18

J(✓)

q1q0

J(q0,q1)

FigurebyAndrewNg

Sincetheleastsquaresobjectivefunctionisconvex(concave),wedon’tneedtoworryaboutlocalminima

GradientDescent• Initialize• Repeatuntilconvergence

19

✓j ✓j � ↵@

@✓jJ(✓) simultaneousupdate

forj =0...d

learningrate(small)e.g.,α=0.05

J(✓)

0

1

2

3

-0.5 0 0.5 1 1.5 2 2.5

GradientDescent• Initialize• Repeatuntilconvergence

20

✓j ✓j � ↵@

@✓jJ(✓) simultaneousupdate

forj =0...d

ForLinearRegression:@

@✓jJ(✓) =

@

@✓j

1

2n

nX

i=1

⇣h✓

⇣x(i)

⌘� y(i)

⌘2

=@

@✓j

1

2n

nX

i=1

dX

k=0

✓kx(i)k � y(i)

!2

=1

n

nX

i=1

dX

k=0

✓kx(i)k � y(i)

!⇥ @

@✓j

dX

k=0

✓kx(i)k � y(i)

!

=1

n

nX

i=1

dX

k=0

✓kx(i)k � y(i)

!x(i)j

=1

n

nX

i=1

⇣h✓

⇣x(i)

⌘� y(i)

⌘x(i)j

GradientDescent• Initialize• Repeatuntilconvergence

21

✓j ✓j � ↵@

@✓jJ(✓) simultaneousupdate

forj =0...d

ForLinearRegression:@

@✓jJ(✓) =

@

@✓j

1

2n

nX

i=1

⇣h✓

⇣x(i)

⌘� y(i)

⌘2

=@

@✓j

1

2n

nX

i=1

dX

k=0

✓kx(i)k � y(i)

!2

=1

n

nX

i=1

dX

k=0

✓kx(i)k � y(i)

!⇥ @

@✓j

dX

k=0

✓kx(i)k � y(i)

!

=1

n

nX

i=1

dX

k=0

✓kx(i)k � y(i)

!x(i)j

=1

n

nX

i=1

⇣h✓

⇣x(i)

⌘� y(i)

⌘x(i)j

GradientDescent• Initialize• Repeatuntilconvergence

22

✓j ✓j � ↵@

@✓jJ(✓) simultaneousupdate

forj =0...d

ForLinearRegression:@

@✓jJ(✓) =

@

@✓j

1

2n

nX

i=1

⇣h✓

⇣x(i)

⌘� y(i)

⌘2

=@

@✓j

1

2n

nX

i=1

dX

k=0

✓kx(i)k � y(i)

!2

=1

n

nX

i=1

dX

k=0

✓kx(i)k � y(i)

!⇥ @

@✓j

dX

k=0

✓kx(i)k � y(i)

!

=1

n

nX

i=1

dX

k=0

✓kx(i)k � y(i)

!x(i)j

=1

n

nX

i=1

⇣h✓

⇣x(i)

⌘� y(i)

⌘x(i)j

GradientDescent• Initialize• Repeatuntilconvergence

23

✓j ✓j � ↵@

@✓jJ(✓) simultaneousupdate

forj =0...d

ForLinearRegression:@

@✓jJ(✓) =

@

@✓j

1

2n

nX

i=1

⇣h✓

⇣x(i)

⌘� y(i)

⌘2

=@

@✓j

1

2n

nX

i=1

dX

k=0

✓kx(i)k � y(i)

!2

=1

n

nX

i=1

dX

k=0

✓kx(i)k � y(i)

!⇥ @

@✓j

dX

k=0

✓kx(i)k � y(i)

!

=1

n

nX

i=1

dX

k=0

✓kx(i)k � y(i)

!x(i)j

=1

n

nX

i=1

⇣h✓

⇣x(i)

⌘� y(i)

⌘x(i)j

GradientDescentforLinearRegression

• Initialize• Repeatuntilconvergence

24

simultaneousupdateforj =0...d

✓j ✓j � ↵1

n

nX

i=1

⇣h✓

⇣x(i)

⌘� y(i)

⌘x(i)j

• Toachievesimultaneousupdate• AtthestartofeachGDiteration,compute• Usethisstoredvalueintheupdatesteploop

h✓

⇣x(i)

kvk2 =

sX

i

v2i =q

v21 + v22 + . . .+ v2|v|L2 norm:

k✓new � ✓oldk2 < ✏• Assumeconvergencewhen

GradientDescent

25

(forfixed,thisisafunctionofx) (functionoftheparameters)

h(x)=-900– 0.1x

SlidebyAndrewNg

GradientDescent

26

(forfixed,thisisafunctionofx) (functionoftheparameters)

SlidebyAndrewNg

GradientDescent

27

(forfixed,thisisafunctionofx) (functionoftheparameters)

SlidebyAndrewNg

GradientDescent

28

(forfixed,thisisafunctionofx) (functionoftheparameters)

SlidebyAndrewNg

GradientDescent

29

(forfixed,thisisafunctionofx) (functionoftheparameters)

SlidebyAndrewNg

GradientDescent

30

(forfixed,thisisafunctionofx) (functionoftheparameters)

SlidebyAndrewNg

GradientDescent

31

(forfixed,thisisafunctionofx) (functionoftheparameters)

SlidebyAndrewNg

GradientDescent

32

(forfixed,thisisafunctionofx) (functionoftheparameters)

SlidebyAndrewNg

GradientDescent

33

(forfixed,thisisafunctionofx) (functionoftheparameters)

SlidebyAndrewNg

Choosingα

34

αtoosmall

slowconvergence

αtoolarge

Increasingvaluefor J(✓)

• Mayovershoottheminimum• Mayfailtoconverge• Mayevendiverge

Toseeifgradientdescentisworking,printouteachiteration• Thevalueshoulddecreaseateachiteration• Ifitdoesn’t,adjustα

J(✓)

ExtendingLinearRegressiontoMoreComplexModels

• TheinputsX forlinearregressioncanbe:– Originalquantitativeinputs– Transformationofquantitativeinputs

• e.g.log,exp,squareroot,square,etc.

– Polynomialtransformation• example:y =b0 +b1×x +b2×x2 +b3×x3

– Basisexpansions– Dummycodingofcategoricalinputs– Interactionsbetweenvariables

• example:x3 =x1 × x2

Thisallowsuseoflinearregressiontechniquestofitnon-lineardatasets.

LinearBasisFunctionModels

• Generally,

• Typically,sothatactsasabias• Inthesimplestcase,weuselinearbasisfunctions:

h✓(x) =dX

j=0

✓j�j(x)

�0(x) = 1 ✓0

�j(x) = xj

basisfunction

BasedonslidebyChristopherBishop(PRML)

LinearBasisFunctionModels

– Theseareglobal;asmallchangeinx affectsallbasisfunctions

• Polynomialbasisfunctions:

• Gaussianbasisfunctions:

– Thesearelocal;asmallchangeinx onlyaffectnearbybasisfunctions.μj ands controllocationandscale(width).

BasedonslidebyChristopherBishop(PRML)

LinearBasisFunctionModels• Sigmoidal basisfunctions:

where

– Thesearealsolocal;asmallchangeinx onlyaffectsnearbybasisfunctions.μjands controllocationandscale(slope).

BasedonslidebyChristopherBishop(PRML)

ExampleofFittingaPolynomialCurvewithaLinearModel

y = ✓0 + ✓1x+ ✓2x2 + . . .+ ✓px

p =pX

j=0

✓jxj

LinearBasisFunctionModels

• BasicLinearModel:

• GeneralizedLinearModel:

• Oncewehavereplacedthedatabytheoutputsofthebasisfunctions,fittingthegeneralizedmodelisexactlythesameproblemasfittingthebasicmodel– Unlessweusethekerneltrick– moreonthatwhenwecoversupportvectormachines

– Therefore,thereisnopointinclutteringthemathwithbasisfunctions

40

h✓(x) =dX

j=0

✓j�j(x)

h✓(x) =dX

j=0

✓jxj

BasedonslidebyGeoffHinton

LinearAlgebraConcepts• Vector in isanorderedsetofd realnumbers– e.g.,v=[1,6,3,4]isin– “[1,6,3,4]” isacolumnvector:– asopposedtoarowvector:

• Anm-by-n matrix isanobjectwithm rowsandn columns,whereeachentryisarealnumber:

÷÷÷÷÷

ø

ö

ççççç

è

æ

4361

( )4361

÷÷÷

ø

ö

ççç

è

æ

2396784821

Rd

R4

BasedonslidesbyJosephBradley

• Transpose:reflectvector/matrixonline:

( )baba T

=÷÷ø

öççè

æ÷÷ø

öççè

æ=÷÷

ø

öççè

ædbca

dcba T

– Note:(Ax)T=xTAT (We’lldefinemultiplicationsoon…)

• Vectornorms:– Lp normofv =(v1,…,vk)is– Commonnorms:L1,L2– Linfinity =maxi |vi|

• Lengthofavectorv isL2(v)

X

i

|vi|p! 1

p

BasedonslidesbyJosephBradley

LinearAlgebraConcepts

• Vectordotproduct:

– Note:dotproductofu withitself =length(u)2 =

• Matrixproduct:

( ) ( ) 22112121 vuvuvvuuvu +=•=•

÷÷ø

öççè

æ++++

=

÷÷ø

öççè

æ=÷÷

ø

öççè

æ=

2222122121221121

2212121121121111

2221

1211

2221

1211 ,

babababababababa

AB

bbbb

Baaaa

A

kuk22

BasedonslidesbyJosephBradley

LinearAlgebraConcepts

• Vectorproducts:– Dotproduct:

– Outerproduct:

( ) 22112

121 vuvuvv

uuvuvu T +=÷÷ø

öççè

æ==•

( ) ÷÷ø

öççè

æ=÷÷

ø

öççè

æ=

2212

211121

2

1

vuvuvuvu

vvuu

uvT

BasedonslidesbyJosephBradley

LinearAlgebraConcepts

h(x) = ✓|x

x| =⇥1 x1 . . . xd

Vectorization• Benefitsofvectorization– Morecompactequations– Fastercode(usingoptimizedmatrixlibraries)

• Considerourmodel:

• Let

• Canwritethemodelinvectorized formas45

h(x) =dX

j=0

✓jxj

✓ =

2

6664

✓0✓1...✓d

3

7775

Vectorization• Considerourmodelforn instances:

• Let

• Canwritethemodelinvectorized formas46

h✓(x) = X✓

X =

2

66666664

1 x(1)1 . . . x(1)

d...

.... . .

...

1 x(i)1 . . . x(i)

d...

.... . .

...

1 x(n)1 . . . x(n)

d

3

77777775

✓ =

2

6664

✓0✓1...✓d

3

7775

h⇣x(i)

⌘=

dX

j=0

✓jx(i)j

R(d+1)⇥1 Rn⇥(d+1)

J(✓) =1

2n

nX

i=1

⇣✓|x(i) � y(i)

⌘2

Vectorization• Forthelinearregressioncostfunction:

47

J(✓) =1

2n(X✓ � y)| (X✓ � y)

J(✓) =1

2n

nX

i=1

⇣h✓

⇣x(i)

⌘� y(i)

⌘2

Rn⇥(d+1)

R(d+1)⇥1

Rn⇥1R1⇥n

Let:

y =

2

6664

y(1)

y(2)

...y(n)

3

7775

ClosedFormSolution:

ClosedFormSolution• InsteadofusingGD,solveforoptimal analytically– Noticethatthesolutioniswhen

• Derivation:

Takederivativeandsetequalto0,thensolvefor:

48

✓@

@✓J(✓) = 0

J (✓) =1

2n(X✓ � y)| (X✓ � y)

/ ✓|X|X✓ � y|X✓ � ✓|X|y + y|y/ ✓|X|X✓ � 2✓|X|y + y|y

1x1J (✓) =1

2n(X✓ � y)| (X✓ � y)

/ ✓|X|X✓ � y|X✓ � ✓|X|y + y|y/ ✓|X|X✓ � 2✓|X|y + y|y

J (✓) =1

2n(X✓ � y)| (X✓ � y)

/ ✓|X|X✓ � y|X✓ � ✓|X|y + y|y/ ✓|X|X✓ � 2✓|X|y + y|y

@

@✓(✓|X|X✓ � 2✓|X|y + y|y) = 0

(X|X)✓ �X|y = 0

(X|X)✓ = X|y

✓ = (X|X)�1X|y

✓@

@✓(✓|X|X✓ � 2✓|X|y + y|y) = 0

(X|X)✓ �X|y = 0

(X|X)✓ = X|y

✓ = (X|X)�1X|y

@

@✓(✓|X|X✓ � 2✓|X|y + y|y) = 0

(X|X)✓ �X|y = 0

(X|X)✓ = X|y

✓ = (X|X)�1X|y

@

@✓(✓|X|X✓ � 2✓|X|y + y|y) = 0

(X|X)✓ �X|y = 0

(X|X)✓ = X|y

✓ = (X|X)�1X|y

ClosedFormSolution• CanobtainbysimplypluggingX and into

• IfX TX isnotinvertible(i.e.,singular),mayneedto:– Usepseudo-inverseinsteadoftheinverse

• Inpython,numpy.linalg.pinv(a)

– Removeredundant(notlinearlyindependent)features– Removeextrafeaturestoensurethatd ≤n

49

@

@✓(✓|X|X✓ � 2✓|X|y + y|y) = 0

(X|X)✓ �X|y = 0

(X|X)✓ = X|y

✓ = (X|X)�1X|y

y =

2

6664

y(1)

y(2)

...y(n)

3

7775X =

2

66666664

1 x(1)1 . . . x(1)

d...

.... . .

...

1 x(i)1 . . . x(i)

d...

.... . .

...

1 x(n)1 . . . x(n)

d

3

77777775

✓ y

GradientDescentvs ClosedForm

GradientDescentClosedFormSolution

50

• Requiresmultipleiterations• Needtochooseα• Workswellwhenn islarge• Cansupportincremental

learning

• Non-iterative• Noneedforα• Slowifn islarge

– Computing(X TX)-1 isroughlyO(n3)

ImprovingLearning:FeatureScaling

• Idea:Ensurethatfeaturehavesimilarscales

• Makesgradientdescentconvergemuch faster

51

0

5

10

15

20

0 5 10 15 20

✓1

✓2

BeforeFeatureScaling

0

5

10

15

20

0 5 10 15 20

✓1

✓2

AfterFeatureScaling

FeatureStandardization• Rescalesfeaturestohavezeromeanandunitvariance

– Letμj bethemeanoffeaturej:

– Replaceeachvaluewith:

• sj isthestandarddeviationoffeaturej• Couldalsousetherangeoffeaturej (maxj – minj)forsj

• Mustapplythesametransformationtoinstancesforbothtrainingandprediction

• Outlierscancauseproblems52

µj =1

n

nX

i=1

x(i)j

x(i)j

x(i)j � µj

sj

forj =1...d(notx0!)

QualityofFit

Overfitting:• Thelearnedhypothesismayfitthetrainingsetverywell( )

• ...butfailstogeneralizetonewexamples

53

Prod

uctiv

ity

TimeSpent

Prod

uctiv

ity

TimeSpent

Prod

uctiv

ity

TimeSpent

Underfitting(highbias)

Overfitting(highvariance)

Correctfit

J(✓) ⇡ 0

BasedonexamplebyAndrewNg

Regularization• Amethodforautomaticallycontrollingthecomplexityofthelearnedhypothesis

• Idea:penalizeforlargevaluesof– Canincorporateintothecostfunction– Workswellwhenwehavealotoffeatures,eachthatcontributesabittopredictingthelabel

• Canalsoaddressoverfitting byeliminatingfeatures(eithermanuallyorviamodelselection)

54

✓j

Regularization• Linearregressionobjectivefunction

– istheregularizationparameter()– Noregularizationon!

55

J(✓) =1

2n

nX

i=1

⇣h✓

⇣x(i)

⌘� y(i)

⌘2+ �

dX

j=1

✓2j

modelfittodata regularization

✓0

� � � 0

J(✓) =1

2n

nX

i=1

⇣h✓

⇣x(i)

⌘� y(i)

⌘2+

2

dX

j=1

✓2j

UnderstandingRegularization

• Notethat

– Thisisthemagnitudeofthefeaturecoefficientvector!

• Wecanalsothinkofthisas:

• L2 regularizationpullscoefficientstoward0

56

dX

j=1

✓2j = k✓1:dk22

dX

j=1

(✓j � 0)2 = k✓1:d � ~0k22

J(✓) =1

2n

nX

i=1

⇣h✓

⇣x(i)

⌘� y(i)

⌘2+

2

dX

j=1

✓2j

UnderstandingRegularization

• Whathappensas?

57

Prod

uctiv

ity

TimeSpentonWork

J(✓) =1

2n

nX

i=1

⇣h✓

⇣x(i)

⌘� y(i)

⌘2+

2

dX

j=1

✓2j

� ! 1

✓0 + ✓1x+ ✓2x2 + ✓3x

3 + ✓4x4

UnderstandingRegularization

• Whathappensas?

58

Prod

uctiv

ity

TimeSpentonWork

0 0 0 0

J(✓) =1

2n

nX

i=1

⇣h✓

⇣x(i)

⌘� y(i)

⌘2+

2

dX

j=1

✓2j

� ! 1

✓0 + ✓1x+ ✓2x2 + ✓3x

3 + ✓4x4

RegularizedLinearRegression

59

• CostFunction

• Fitbysolving

• Gradientupdate:

min✓

J(✓)

✓j ✓j � ↵1

n

nX

i=1

⇣h✓

⇣x(i)

⌘� y(i)

⌘x(i)j

✓0 ✓0 � ↵1

n

nX

i=1

⇣h✓

⇣x(i)

⌘� y(i)

regularization

@

@✓jJ(✓)

@

@✓0J(✓)

J(✓) =1

2n

nX

i=1

⇣h✓

⇣x(i)

⌘� y(i)

⌘2+

2

dX

j=1

✓2j

� ↵�✓j

RegularizedLinearRegression

60

✓0 ✓0 � ↵1

n

nX

i=1

⇣h✓

⇣x(i)

⌘� y(i)

• Wecanrewritethegradientstepas:

J(✓) =1

2n

nX

i=1

⇣h✓

⇣x(i)

⌘� y(i)

⌘2+

2

dX

j=1

✓2j

✓j ✓j (1� ↵�)� ↵1

n

nX

i=1

⇣h✓

⇣x(i)

⌘� y(i)

⌘x(i)j

✓j ✓j � ↵1

n

nX

i=1

⇣h✓

⇣x(i)

⌘� y(i)

⌘x(i)j � ↵�✓j

RegularizedLinearRegression

61

✓ =

0

BBBBB@X|X + �

2

666664

0 0 0 . . . 00 1 0 . . . 00 0 1 . . . 0...

......

. . ....

0 0 0 . . . 1

3

777775

1

CCCCCA

�1

X|y

• Toincorporateregularizationintotheclosedformsolution:

RegularizedLinearRegression

62

• Toincorporateregularizationintotheclosedformsolution:

• Canderivethisthesameway,bysolving

• Canprovethatforλ >0,inverseexistsintheequationabove

✓ =

0

BBBBB@X|X + �

2

666664

0 0 0 . . . 00 1 0 . . . 00 0 1 . . . 0...

......

. . ....

0 0 0 . . . 1

3

777775

1

CCCCCA

�1

X|y

@

@✓J(✓) = 0

top related