logistic regression demystified (hopefully)

Logis&cRegressionDemys&fied(Hopefully)

GabrieleTolomeiYahooLabs,London,UK

19thNovember2015

Introduc&on

•  3componentsneedtobedefined:– Model:describesthesetofhypotheses(hypothesisspace)thatcanberepresented;

– ErrorMeasure(CostFunc2on):measuresthepricethatmustbepaidifamisclassifica&onerroroccurs

– LearningAlgorithm:isresponsibleofpickingthebesthypothesis(accordingtotheerrormeasure)bysearchingthroughthehypothesisspace

TheModel

LinearSignal•  Logis&cRegressionisanexampleoflinearmodel•  Givenad+1-dimensionalinputx

xT=(x0,x1,…,xd),x0=1•  Wedefinethefamilyofreal-valuedfunc&onsFhavingd+1

parametersθθT=(θ0,θ1,…,θd)

•  Eachfunc&onfθinFoutputsarealscalarobtainedasalinearcombina&onoftheinputxwiththeparametersθ

fθ(x)means“theapplica&onoffparametrizedbyθtox”anditisreferredtoassignal

HypothesisSpace

•  ThesignalaloneisnotenoughtodefinethehypothesisspaceH

•  Usuallythesignalispassedthrougha“filter”,i.e.anotherreal-valuedfunc&ong

•  hθ(x)=g(fθ(x))definesthehypothesisspace:

ThesetofpossiblehypothesesHchangesdependingontheparametricmodel(fθ)andonthethresholdingfunc&on(g)

1

-1

Thresholdingx

fθ

1

0g=sign g=iden6ty g=logis6c

fθ(x) fθ(x) fθ(x)

TheLogis&cFunc&on

•  DomainisR,Codomainis[0,1]•  Alsoknownassigmoidfunc&ondotoits“S”shapeorsob

threshold(comparedtohardthresholdimposedbysign)•  Whenz=θTxweareapplyinganon-lineartransforma&onto

ourlinearsignal•  Outputcanbegenuinelyinterpretedasaprobabilityvalue

Probabilis&cInterpreta&on

•  Describingthesetofhypothesesusingthelogis&cfunc&onisnotenoughtostatethattheoutputcanbeinterpretedasaprobability–  Allweknowisthatthelogis&cfunc&onalwaysproducearealvalue

between0and1–  Otherfunc&onsmaybedefinedhavingthesameproperty

•  e.g.,1/πarctan(x)+1/2

•  Thekeypointshereare:–  theoutputofthelogis&cfunc&oncanbeinterpretedasaprobability

evenduringlearning–  thelogis&cfunc&onismathema&callyconvenient!

Probabilis&cInterpreta&on:OddsRa&o•  Letp(resp.,q=1-p)betheprobabilityofsuccess(resp.,

failure)ofanevent•  odds(success)=p/q=p/(1-p)•  odds(failure)=q/p=1/p/q=1/odds(success)•  logit(p)=ln(odds(success))=ln(p/q)=ln(p/1-p)•  Logis&cRegressionisinfactanordinarylinearregression

wherethelogitistheresponsevariable!

•  Thecoefficientsoflogis&cregressionareexpressedintermsofthenaturallogarithmofodds

Probabilis&cInterpreta&on:OddsRa&o

Probabilis&c-generatedDataAsforanyothersupervisedlearningproblemwecanonlydealwithafinitesetDofmlabelledexampleswhichwecantrytolearnfrom

whereeachyiisabinaryvariabletakingontwovalues{-1,+1}Thatmeanswedonothaveaccesstotheindividualprobabilityassociatedwitheachtrainingsample!S&llwecanassumethatdataweobservefromD,i.e.posi2ve(+1)andnega2ve(-1)samplesareactuallygeneratedbyanunderlyingandunknownprobabilityfunc6on(noisytarget)whichwewanttoes&mate

Es&ma&ngtheNoisyTargetMoreformally,giventhegenerictrainingexample(x,y)weclaimthereexistsacondi&onalprobabilityP(y|x),whichisdefinedas:

whereeachφisthenoisytargetfunc&on•  Determinis&cfunc&on:givenxasinputitalwaysoutputseithery=

+1ory=-1(mutuallyexclusive)•  Noisytargetfunc&on:givenxasinputitalwaysoutputsbothy=+1

andy=-1,eachwitha“degreeofcertainty”associated

Goal:Ifweassumeφ:Rd+1à[0,1]istheunderlyingandunknownnoisytargetwhichgeneratesourexamples,ouraimistofindanes&mateφ*whichbestapproximatesφ

HypothesizedNoisyTargetWeclaimthatthebestes&mateφ*ofφish*θ(x)whichinturnispickedfromthesetofhypothesesdefinedbylogis&cfunc&on

Buthowdoweselecth*θ(x)?2elementsareneeded:-  TrainingsetD-  ErrorMeasure(CostFunc&on)tominimize

TheErrorMeasure

TheBestHypothesis

IfthehypothesisspaceHismadeofafamilyofparametricmodels,h*θ(x)canbepickedas:

Thatis,wewanttomaximisetheprobabilityofthechosenhypothesisgiventhedataDweobserved

FlippingtheCoin:(Data)LikelihoodWemeasuretheerrorwearemakingbyassumingthath*θ(x)approximatesthetruenoisytargetφ

HowlikelyisthattheobserveddataDhavebeengeneratedbyourselectedhypothesish*θ(x)?

FindthehypothesiswhichmaximisestheprobabilityoftheobserveddataDgivenapar&cularhypothesis

TheLikelihoodFunc&onGivenagenerictrainingexample(x,y)andassumingithasbeengeneratedbyahypothesishθ(x)thelikelihoodfunc&onis:

whereφhasbeenreplacedwithourhypothesisIfweassumethehypothesisisthelogis&cfunc&on

Andbyno&cingthatlogis&cfunc&onissymmetric,i.e.l(-z)=1-l(z),thelikelihoodforasingleexampleis:

TheLikelihoodFunc&on

Havingaccesstoafullsetofmi.i.d.trainingexamplesD

Theoveralllikelihoodfunc&oniscomputedas:

WhyDoesLikelihoodMakeSense?Howdoesthelikelihoodl(yiθTxi)changesw.r.t.thesignofyiandθTxi?

Ifthelabelisconcordantwiththesignal(eitherposi&velyornega&vely)thenl(yiθTxi)approachesto1

Ourpredic&onagreeswiththetruelabel

θTxi>0 θTxi<0

yi>0 ≈1 ≈0

yi<0 ≈0 ≈1

Conversely,ifthelabelisdiscordantwiththesignalthenl(yiθTxi)approachesto0

Ourpredic&ondisagreeswiththetruelabel

MaximumLikelihoodEs&mate

Findthevectorofparametersθsuchthatthelikelihoodfunc&onismaximum

FromMLEtoIn-SampleErrorGenerallyspeaking,givenahypothesishθandatrainingsetDofmlabelledsamplesweareinterestedinmeasuringthe“in-sample”(i.e.training)error

wheree()measureshow“far”thechosenhypothesisisfromthetrueobservedvalue

Howwecan“transform”MLEtoanexpressionsimilartothe“in-sample”errorabove?

FromMLEtoIn-SampleError

FromMLEtoIn-SampleError

Byno&cingthatlogis&cfunc&oncanberewritenasfollows:

Wecanfinallywritethe“in-sample”errortobeminimised:

Cross-EntropyError

TheLearningAlgorithm

PickingtheBestHypothesisSofarwehavedefined:-  Themodel-  Theerrormeasure(cross-entropy)

Toactuallyselectthebesthypothesis,wehavetopickthevectorofparameterssothattheerrormeasureisminimised

Theusualwayofachievingthisistocomputethegradientwithrespecttoθ(i.e.thevectorofpar&alderiva&ves),setitto0,andsolveitforθ

MeanSquaredErrorvs.Cross-EntropyInthecaseoflinearregressionwehaveasimilarexpressionfortheerrormeasure,i.e.MeanSquaredError(MSE)

MinimisingMSEthroughOrdinaryLeastSquares(OLS)leadstoaclosed-formsolu2onobenreferredtoastheOLSes&matorforθ

TheproblemisthatusingCross-Entropyaserrormeasurewecannotfindaclosed-formsolu&ontotheminimiza&onproblem

Itera2veSolu2on

(Batch)GradientDescentGeneralitera&vemethodforanynonlinearop&miza&on

Underspecificassump&onsonthefunc&ontobeminimisedandonthelearningrateparameterateachitera&on,themethodguaranteestheconvergencetoalocalminimum

globalminimum

Ifthefunc&onisconvexlikethecross-entropyerrorforlogis&cregressionthenthelocalminimumisalsotheglobalminimum

GradientDescent:TheIdea1.  Att=0ini&alizethe(guessed)vectorofparametersθtoθ(0)2.  Repeatun&lconvergence:

a.  Updatethecurrentvectorofparametersθ(t)bytakinga“step”alongthe“steepest”slope:θ(t+1)=θ(t)+ηv

b.  Returnto2.

step

Unitvectorrepresen&ngthedirec&onofthesteepestslope

Ques2on:Howdowecomputethedirec&onv?Dependingonhowwesolveitwemaygetdifferentsolu&ons(GradientDescent,ConjugateGradient,etc.)

GradientDescent:TheDirec&onvWealreadyintui&velysaidthatthedirec&onvshouldbethatofthe“steepest”slope

Concretely,thismeansmovingalongthedirec&onwhichmostlyreducesthein-sampleerrorfunc&on

WewantΔEintobeasnega&veaspossible,whichmeansthatweareactuallyreducingtheerrorw.r.t.thepreviousitera&ont-1

GradientDescent:TheDirec&onv

Let’sfirstassumeweareintheunivariatecase,i.e.θ=θinR


First-orderTaylorapproxima&on Second-ordererrorterm

Tosummarizeandgeneralizetothemul&variatecaseofθ:

Thegreekleternablaindicatesthegradient


Theunitvectorvonlycontributestothedirec&onandnottothemagnitudeoftheitera&vestepTherefore:-  themaximum(i.e.mostposi2ve)stephappenswhenboththe

errorvectorandthedirec&onvectorhavethesamedirec&on-  theminimum(i.e.mostnega2ve)stephappenswhenthetwo

vectorshaveoppositedirec&on


Ateachitera&ont,wewanttheunitvectorwhichmakesexactlythemostnega&vestep

Therefore:

GradientDescent:TheStepηHowthestepmagnitudeηaffectstheconvergence?

ηtoosmall ηtoolarge ηvariable

RuleofthumbDynamicallychangeηpropor&onallytothegradient!

GradientDescent:TheStepηRememberthatateachitera&ontheupdatestrategyis:

where:

Ateachitera&ont,thestepηisfixed

GradientDescent:TheStepηtInsteadofhavingafixedηateachitera&on,useavariableηtasfunc&onofη

Ifwetake

GradientDescent:TheAlgorithm1.  Att=0ini&alizethe(guessed)vectorofparametersθtoθ(0)2.  Fort=0,1,2,…un&lstop:

a.  Computethegradientofthecross-entropyerror(i.e.thevectorofpar&alderiva&ves)

b.  Updatethevectorofparameters:θ(t+1)=θ(t)-ηEin(θ(t))c.  Returnto2.

3.  Returnthefinalvectorofparametersθ(∞)

Discussion:Ini&aliza&on•  Howdowechoosetheini&alvalueoftheparametersθ(0)?•  Ifthefunc&onisconvexweareguaranteedtoreachthe

globalminimumnomaterwhatistheini&alvalueofθ(0)•  Ingeneralwemaygettothelocalminimumnearesttoθ(0)

–  Problem:wemaymiss“beter”localminima(oreventheglobalifitexists)

–  Solu&on(heuris&c):repea&ngGD100÷1,000&meseach&mewithadifferentθ(0)maygiveasenseofwhatiseventuallytheglobalminimum(noguarantees)

Discussion:Termina&on•  Whendoesthealgorithmstop?•  Intui&vely,whenθ(t+1)=θ(t)è-ηEin(θ(t))=0èEin(θ(t))=0•  Ifthefunc&onisconvexweareguaranteedtoreachtheglobal

minimumwhenEin(θ(t))=0–  i.e.thereexistsauniquelocalminimumwhichalsohappenstobethe

globalminimum

•  Ingeneralwedon’tknowifeventuallyEin(θ(t))=0thereforewecanuseseveralcriteriaoftermina&on,e.g.,:–  stopwheneverthedifferencebetweentwoitera&onsis“smallenough”à

mayconverge“prematurely”–  stopwhentheerrorequalstoεàmaynotconvergeifthetargeterroris

notachievable–  stopaberTitera&ons–  combina&onsoftheaboveinprac&ceworks…

AdvancedTopics•  GradientDescentusingsecond-orderapproxima&on

–  beterlocalapproxima&onthanfirst-orderbuteachsteprequirescompu&ngthesecondderiva&ve(Hessianmatrix)

–  ConjugateGradientmakessecond-approxima&on“faster”asitdoesn’trequiretocomputeexplicitlythefullHessianmatrix

•  Stochas&cGradientDescent(SGD)–  Ateachsteponlyonesampleisconsideredforcompu&ngthegradient

oftheerrorinsteadofthefulltrainingset

•  L1andL2regulariza&ontopenalizeextremeparametervaluesanddealwithoverfi~ng–  includetheL1orL2normofthevectorofparametersθinthecross-

entropyerrorfunc&ontobeminimisedduringlearning

logistic regression demystified (hopefully)

Science