logistic regression - svivek · logistic regression is the discriminative version. this lecture...

MachineLearning

LogisticRegression

1

Wherearewe?

Wehaveseenthefollowingideas– Linearmodels– Learningaslossminimization– Bayesianlearningcriteria(MAPandMLEestimation)– TheNaïveBayesclassifier

2

Thislecture

• Logisticregression

• ConnectiontoNaïveBayes

• Trainingalogisticregressionclassifier

• Backtolossminimization

3

Thislecture





4

LogisticRegression:Setup

• Thesetting– Binaryclassification– Inputs:Featurevectorsx 2 <d

– Labels:y 2 {-1, +1}

• Trainingdata– S={(xi, yi)},mexamples

5

Classification,but…

Theoutputy isdiscretevalued(-1 or 1)

Insteadofpredictingtheoutput,letustrytopredictP(y=1 |x)

Expandhypothesisspacetofunctionswhoseoutputis[0-1]• Originalproblem:<d ! {-1, 1}• Modifiedproblem:<d ! [0-1]• Effectivelymaketheproblemaregressionproblem

Manyhypothesisspacespossible

6

Classification,but…

Theoutputy isdiscretevalued(-1 or 1)

Insteadofpredictingtheoutput,letustrytopredictP(y=1 |x)

Expandhypothesisspacetofunctionswhoseoutputis[0-1]• Originalproblem:<d ! {-1, 1}• Modifiedproblem:<d ! [0-1]• Effectivelymaketheproblemaregressionproblem

Manyhypothesisspacespossible

7

TheSigmoidfunction

Thehypothesisspaceforlogisticregression:Allfunctionsoftheform

Thatis,alinearfunction,composedwithasigmoidfunction(thelogisticfunction) ¾

Whatisthedomainandtherangeofthesigmoidfunction?

Thisisareasonablechoice.Wewillseewhylater

8

TheSigmoidfunction




9

TheSigmoidfunction



Whatisthedomainandtherangeofthesigmoidfunction?


10

TheSigmoidfunction

¾(z)

z

11

TheSigmoidfunction

12

Whatisitsderivativewithrespecttoz?

TheSigmoidfunction

13

Whatisitsderivativewithrespecttoz?

Predictingprobabilities

Accordingtothelogisticregressionmodel,wehave

14



15



16



Orequivalently

17



Orequivalently

18

Notethatwearedirectlymodeling𝑃(𝑦|𝑥) ratherthan𝑃(𝑥|𝑦)and𝑃(𝑦)

Predictingalabelwithlogisticregression

• ComputeP(y=1|x;w)

• Ifthisisgreaterthanhalf,predict1elsepredict-1– WhatdoesthiscorrespondtointermsofwTx?

19

Predictingalabelwithlogisticregression

• ComputeP(y=1|x;w)

• Ifthisisgreaterthanhalf,predict1elsepredict-1– WhatdoesthiscorrespondtointermsofwTx?

– Prediction=sgn(wTx)

20

Thislecture





21

NaïveBayesandLogisticregression

RememberthatthenaïveBayesdecisionisalinearfunction

Here,theP’srepresenttheNaïveBayesposteriordistribution,andwcanbeusedtocalculatethepriorsandthelikelihoods.

Thatis,𝑃(𝑦 = 1|𝐰, 𝐱)iscomputedusing𝑃(𝐱|𝑦 = 1,𝐰)and𝑃(𝑦 = 1|𝐰)

22

log𝑃(𝑦 = −1|𝐱,𝐰)𝑃(𝑦 = +1|𝐱,𝐰) = 𝐰2𝐱



Butwealsoknowthat𝑃 𝑦 = +1 𝐱,𝐰 = 1 − 𝑃(𝑦 = −1|𝐱,𝐰)

23

log𝑃(𝑦 = −1|𝐱,𝐰)𝑃(𝑦 = +1|𝐱,𝐰) = 𝐰2𝐱




Substitutingintheaboveexpression,weget

24

log𝑃(𝑦 = −1|𝐱,𝐰)𝑃(𝑦 = +1|𝐱,𝐰) = 𝐰2𝐱

𝑃 𝑦 = +1 𝐰, 𝐱 = 𝜎 𝐰2𝐱 =1

1 + exp(−𝐰2𝐱)




Substitutingintheaboveexpression,weget

25

log𝑃(𝑦 = −1|𝐱,𝐰)𝑃(𝑦 = +1|𝐱,𝐰) = 𝐰2𝐱

𝑃 𝑦 = +1 𝐰, 𝐱 = 𝜎 𝐰2𝐱 =1

1 + exp(−𝐰2𝐱)

Thatis,bothnaïveBayesandlogisticregressiontrytocomputethesameposteriordistributionovertheoutputs

NaïveBayesisagenerativemodel.

LogisticRegressionisthediscriminativeversion.

Thislecture



• Trainingalogisticregressionclassifier– First:Maximumlikelihoodestimation– Then:Addingpriorsà MaximumaPosterioriestimation


26

Maximumlikelihoodestimation

Let’sgetbacktotheproblemoflearning

• Trainingdata– S={(xi, yi)},mexamples

• Whatwewant– Findaw suchthatP(S|w)ismaximized– Weknowthatourexamplesaredrawnindependentlyandareidenticallydistributed(i.i.d)

– Howdoweproceed?

27


28

Theusualtrick:Convertproductstosumsbytakinglog

Recallthatthisworksonlybecauselogisanincreasingfunctionandthemaximizer willnotchange

argmax𝐰

𝑃 𝑆 𝐰 = argmax𝐰

;𝑃 𝑦< 𝐱<,𝐰)=

<>?


29

Equivalenttosolving

argmax𝐰


;𝑃 𝑦< 𝐱<,𝐰)=

<>?

max𝐰

@log𝑃 𝑦< 𝐱<, 𝐰)=

<


30

But(bydefinition)weknowthat

argmax𝐰


;𝑃 𝑦< 𝐱<,𝐰)=

<>?

max𝐰


<

𝑃 𝑦 𝐰, 𝐱 = 𝜎 𝑦<𝐰2𝐱< =1

1 + exp(−𝑦<𝐰2𝐱<)


31

argmax𝐰


;𝑃 𝑦< 𝐱<,𝐰)=

<>?

max𝐰


<

𝑃 𝑦 𝐰, 𝐱 =1

1 + exp(−yB𝐰2𝐱<)

Equivalenttosolving

max𝐰

@−log(1 + exp(−𝑦<𝐰2𝐱<)=

<


32

argmax𝐰


;𝑃 𝑦< 𝐱<,𝐰)=

<>?

max𝐰


<

𝑃 𝑦 𝐰, 𝐱 =1


Equivalenttosolving

Thegoal:Maximumlikelihoodtrainingofadiscriminativeprobabilisticclassifierunderthelogisticmodelfortheposteriordistribution.

max𝐰

@−log(1 + exp(−𝑦<𝐰2𝐱<)=

<


33

argmax𝐰


;𝑃 𝑦< 𝐱<,𝐰)=

<>?

max𝐰


<

𝑃 𝑦 𝐰, 𝐱 =1


Equivalenttosolving

max𝐰

@−log(1 + exp(−𝑦<𝐰2𝐱<)=

<

Equivalentto:Trainingalinearclassifierbyminimizingthelogisticloss.

Thegoal:Maximumlikelihoodtrainingofadiscriminativeprobabilisticclassifierunderthelogisticmodelfortheposteriordistribution.

Maximumaposterioriestimation

Wecouldalsoaddapriorontheweights

Supposeeachweightintheweightvectorisdrawnindependentlyfromthenormaldistributionwithzeromeanandstandarddeviation𝜎

𝑝 𝐰 =;𝑝(𝑤<)E

F>?

=;1

𝜎 2𝜋� exp−𝑤<J

𝜎J

E

F>?

34

MAPestimationforlogisticregression

35


F>?

=;1


𝜎J

E

F>?

Letusworkthroughthisprocedureagaintoseewhatchanges


36


F>?

=;1


𝜎J

E

F>?

Letusworkthroughthisprocedureagaintoseewhatchanges

WhatisthegoalofMAPestimation?(Inmaximumlikelihood,wemaximizedthelikelihoodofthedata)


37


F>?

=;1


𝜎J

E

F>?

WhatisthegoalofMAPestimation?(Inmaximumlikelihood,wemaximizedthelikelihoodofthedata)

Tomaximizetheposteriorprobabilityofthemodelgiventhedata(i.e.tofindthemostprobablemodel,giventhedata)

𝑃 𝐰 𝑆 ∝ 𝑃 𝑆 𝐰 𝑃(𝐰)


38

Learningbysolving


F>?

=;1


𝜎J

E

F>?

argmax𝐰

𝑃(𝐰|𝑆) = argmax𝐰

𝑃 𝑆 𝐰 𝑃(𝐰)


39

Learningbysolving


F>?

=;1


𝜎J

E

F>?

argmax𝐰


Takelogtosimplify

max𝐰

log 𝑃 𝑆 𝐰 + log𝑃(𝐰)


40

Learningbysolving


F>?

=;1


𝜎J

E

F>?

argmax𝐰


Takelogtosimplify

max𝐰


Wehavealreadyexpandedoutthefirstterm.

@−log(1 + exp(−𝑦<𝐰2𝐱<)=

<


41

Learningbysolving


F>?

=;1


𝜎J

E

F>?

argmax𝐰


Takelogtosimplify

max𝐰


@−log(1 + exp(−𝑦<𝐰2𝐱<)=

<

+@−𝑤<J

𝜎J

E

F>?

+ 𝑐𝑜𝑛𝑠𝑡𝑎𝑛𝑡𝑠

Expandthelogprior


42

Learningbysolving


F>?

=;1


𝜎J

E

F>?

argmax𝐰


Takelogtosimplify

max𝐰


max𝐰

@−log(1 + exp(−𝑦<𝐰2𝐱<)=

<

+@−𝑤<J

𝜎J

E

F>?

+ 𝑐𝑜𝑛𝑠𝑡𝑎𝑛𝑡𝑠


43

Learningbysolving


F>?

=;1


𝜎J

E

F>?

argmax𝐰


Takelogtosimplify

max𝐰


max𝐰

@−log(1 + exp(−𝑦<𝐰2𝐱<)=

<

−1𝜎J 𝐰

2𝐰


44

Learningbysolving


F>?

=;1


𝜎J

E

F>?

argmax𝐰


Takelogtosimplify

max𝐰


max𝐰

@−log(1 + exp(−𝑦<𝐰2𝐱<)=

<

−1𝜎J 𝐰

2𝐰

Maximizinganegativefunctionisthesameasminimizingthefunction

Learningalogisticregressionclassifier

Learningalogisticregressionclassifierisequivalenttosolving

45

min𝐰@log(1 + exp(−𝑦<𝐰2𝐱<)=

<

+1𝜎J 𝐰

2𝐰



46

Wherehaveweseenthisbefore?


<

+1𝜎J 𝐰

2𝐰



47

Wherehaveweseenthisbefore?

Thefirstquestioninthehomework:Writedownthestochasticgradientdescentalgorithmforthis?

Historically,othertrainingalgorithmsexist.Inparticular,youmightrunintoLBFGS


<

+1𝜎J 𝐰

2𝐰

Logisticregressionis…

• Aclassifierthatpredictstheprobabilitythatthelabelis+1foraparticularinput

• Thediscriminativecounter-partofthenaïveBayesclassifier

• AdiscriminativeclassifierthatcanbetrainedviaMAPorMLEestimation

• Adiscriminativeclassifierthatminimizesthelogisticlossoverthetrainingset

48

Thislecture





49

Learningaslossminimization• Thesetup

– Examplesx drawnfromafixed,unknowndistributionD– Hiddenoracleclassifierf labelsexamples– Wewishtofindahypothesish thatmimicsf

• Theidealsituation– DefineafunctionL thatpenalizesbadhypotheses– Learning:Pickafunctionh2 Htominimizeexpectedloss

• Instead,minimizeempiricallossonthetrainingset

50

ButdistributionDisunknown

Empiricallossminimization

Learning=minimizeempiricallossonthetrainingset

51

Isthereaproblemhere?

Empiricallossminimization

Learning=minimizeempiricallossonthetrainingset

Weneedsomethingthatbiasesthelearnertowardssimplerhypotheses• Achievedusingaregularizer,whichpenalizescomplex

hypotheses

52

Isthereaproblemhere? Overfitting!

Regularizedlossminimization

• Learning:

• Withlinearclassifiers:

• Whatisalossfunction?– Lossfunctionsshouldpenalizemistakes– Weareminimizingaveragelossoverthetrainingdata

• Whatistheideallossfunctionforclassification?

53

(usingl2regularization)

The0-1loss

Penalizeclassificationmistakesbetweentruelabelyandpredictiony’

• Forlinearclassifiers,thepredictiony’=sgn(wTx)– MistakeifywTx· 0

Minimizing0-1lossisintractable.Needsurrogates

54

Thelossfunctionzoo

Manylossfunctionsexist– Perceptronloss

– Hingeloss(SVM)

– Exponentialloss(AdaBoost)

– Logisticloss(logisticregression)

55

Thelossfunctionzoo

56

Thelossfunctionzoo

57

Zero-one

Thelossfunctionzoo

58

Hinge:SVM

Zero-one

Thelossfunctionzoo

59

Perceptron

Hinge:SVM

Zero-one

Thelossfunctionzoo

60

Perceptron

Hinge:SVM

Exponential:AdaBoost

Zero-one

Thelossfunctionzoo

61

Perceptron

Hinge:SVM

Logisticregression

Exponential:AdaBoost

Zero-one

Thelossfunctionzoo

62

Zoomedout

Thelossfunctionzoo

63

Zoomedoutevenmore

logistic regression - svivek · logistic regression is the discriminative version. this lecture...

Documents