logistic regression - svivek · logistic regression is the discriminative version. this lecture...

Post on 30-Jul-2020

32 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

MachineLearning

LogisticRegression

1

Wherearewe?

Wehaveseenthefollowingideas– Linearmodels– Learningaslossminimization– Bayesianlearningcriteria(MAPandMLEestimation)– TheNaïveBayesclassifier

2

Thislecture

• Logisticregression

• ConnectiontoNaïveBayes

• Trainingalogisticregressionclassifier

• Backtolossminimization

3

Thislecture

• Logisticregression

• ConnectiontoNaïveBayes

• Trainingalogisticregressionclassifier

• Backtolossminimization

4

LogisticRegression:Setup

• Thesetting– Binaryclassification– Inputs:Featurevectorsx 2 <d

– Labels:y 2 {-1, +1}

• Trainingdata– S={(xi, yi)},mexamples

5

Classification,but…

Theoutputy isdiscretevalued(-1 or 1)

Insteadofpredictingtheoutput,letustrytopredictP(y=1 |x)

Expandhypothesisspacetofunctionswhoseoutputis[0-1]• Originalproblem:<d ! {-1, 1}• Modifiedproblem:<d ! [0-1]• Effectivelymaketheproblemaregressionproblem

Manyhypothesisspacespossible

6

Classification,but…

Theoutputy isdiscretevalued(-1 or 1)

Insteadofpredictingtheoutput,letustrytopredictP(y=1 |x)

Expandhypothesisspacetofunctionswhoseoutputis[0-1]• Originalproblem:<d ! {-1, 1}• Modifiedproblem:<d ! [0-1]• Effectivelymaketheproblemaregressionproblem

Manyhypothesisspacespossible

7

TheSigmoidfunction

Thehypothesisspaceforlogisticregression:Allfunctionsoftheform

Thatis,alinearfunction,composedwithasigmoidfunction(thelogisticfunction) ¾

Whatisthedomainandtherangeofthesigmoidfunction?

Thisisareasonablechoice.Wewillseewhylater

8

TheSigmoidfunction

Thehypothesisspaceforlogisticregression:Allfunctionsoftheform

Thatis,alinearfunction,composedwithasigmoidfunction(thelogisticfunction) ¾

Thisisareasonablechoice.Wewillseewhylater

9

TheSigmoidfunction

Thehypothesisspaceforlogisticregression:Allfunctionsoftheform

Thatis,alinearfunction,composedwithasigmoidfunction(thelogisticfunction) ¾

Whatisthedomainandtherangeofthesigmoidfunction?

Thisisareasonablechoice.Wewillseewhylater

10

TheSigmoidfunction

¾(z)

z

11

TheSigmoidfunction

12

Whatisitsderivativewithrespecttoz?

TheSigmoidfunction

13

Whatisitsderivativewithrespecttoz?

Predictingprobabilities

Accordingtothelogisticregressionmodel,wehave

14

Predictingprobabilities

Accordingtothelogisticregressionmodel,wehave

15

Predictingprobabilities

Accordingtothelogisticregressionmodel,wehave

16

Predictingprobabilities

Accordingtothelogisticregressionmodel,wehave

Orequivalently

17

Predictingprobabilities

Accordingtothelogisticregressionmodel,wehave

Orequivalently

18

Notethatwearedirectlymodeling𝑃(𝑦|𝑥) ratherthan𝑃(𝑥|𝑦)and𝑃(𝑦)

Predictingalabelwithlogisticregression

• ComputeP(y=1|x;w)

• Ifthisisgreaterthanhalf,predict1elsepredict-1– WhatdoesthiscorrespondtointermsofwTx?

19

Predictingalabelwithlogisticregression

• ComputeP(y=1|x;w)

• Ifthisisgreaterthanhalf,predict1elsepredict-1– WhatdoesthiscorrespondtointermsofwTx?

– Prediction=sgn(wTx)

20

Thislecture

• Logisticregression

• ConnectiontoNaïveBayes

• Trainingalogisticregressionclassifier

• Backtolossminimization

21

NaïveBayesandLogisticregression

RememberthatthenaïveBayesdecisionisalinearfunction

Here,theP’srepresenttheNaïveBayesposteriordistribution,andwcanbeusedtocalculatethepriorsandthelikelihoods.

Thatis,𝑃(𝑦 = 1|𝐰, 𝐱)iscomputedusing𝑃(𝐱|𝑦 = 1,𝐰)and𝑃(𝑦 = 1|𝐰)

22

log𝑃(𝑦 = −1|𝐱,𝐰)𝑃(𝑦 = +1|𝐱,𝐰) = 𝐰2𝐱

NaïveBayesandLogisticregression

RememberthatthenaïveBayesdecisionisalinearfunction

Butwealsoknowthat𝑃 𝑦 = +1 𝐱,𝐰 = 1 − 𝑃(𝑦 = −1|𝐱,𝐰)

23

log𝑃(𝑦 = −1|𝐱,𝐰)𝑃(𝑦 = +1|𝐱,𝐰) = 𝐰2𝐱

NaïveBayesandLogisticregression

RememberthatthenaïveBayesdecisionisalinearfunction

Butwealsoknowthat𝑃 𝑦 = +1 𝐱,𝐰 = 1 − 𝑃(𝑦 = −1|𝐱,𝐰)

Substitutingintheaboveexpression,weget

24

log𝑃(𝑦 = −1|𝐱,𝐰)𝑃(𝑦 = +1|𝐱,𝐰) = 𝐰2𝐱

𝑃 𝑦 = +1 𝐰, 𝐱 = 𝜎 𝐰2𝐱 =1

1 + exp(−𝐰2𝐱)

NaïveBayesandLogisticregression

RememberthatthenaïveBayesdecisionisalinearfunction

Butwealsoknowthat𝑃 𝑦 = +1 𝐱,𝐰 = 1 − 𝑃(𝑦 = −1|𝐱,𝐰)

Substitutingintheaboveexpression,weget

25

log𝑃(𝑦 = −1|𝐱,𝐰)𝑃(𝑦 = +1|𝐱,𝐰) = 𝐰2𝐱

𝑃 𝑦 = +1 𝐰, 𝐱 = 𝜎 𝐰2𝐱 =1

1 + exp(−𝐰2𝐱)

Thatis,bothnaïveBayesandlogisticregressiontrytocomputethesameposteriordistributionovertheoutputs

NaïveBayesisagenerativemodel.

LogisticRegressionisthediscriminativeversion.

Thislecture

• Logisticregression

• ConnectiontoNaïveBayes

• Trainingalogisticregressionclassifier– First:Maximumlikelihoodestimation– Then:Addingpriorsà MaximumaPosterioriestimation

• Backtolossminimization

26

Maximumlikelihoodestimation

Let’sgetbacktotheproblemoflearning

• Trainingdata– S={(xi, yi)},mexamples

• Whatwewant– Findaw suchthatP(S|w)ismaximized– Weknowthatourexamplesaredrawnindependentlyandareidenticallydistributed(i.i.d)

– Howdoweproceed?

27

Maximumlikelihoodestimation

28

Theusualtrick:Convertproductstosumsbytakinglog

Recallthatthisworksonlybecauselogisanincreasingfunctionandthemaximizer willnotchange

argmax𝐰

𝑃 𝑆 𝐰 = argmax𝐰

;𝑃 𝑦< 𝐱<,𝐰)=

<>?

Maximumlikelihoodestimation

29

Equivalenttosolving

argmax𝐰

𝑃 𝑆 𝐰 = argmax𝐰

;𝑃 𝑦< 𝐱<,𝐰)=

<>?

max𝐰

@log𝑃 𝑦< 𝐱<, 𝐰)=

<

Maximumlikelihoodestimation

30

But(bydefinition)weknowthat

argmax𝐰

𝑃 𝑆 𝐰 = argmax𝐰

;𝑃 𝑦< 𝐱<,𝐰)=

<>?

max𝐰

@log𝑃 𝑦< 𝐱<, 𝐰)=

<

𝑃 𝑦 𝐰, 𝐱 = 𝜎 𝑦<𝐰2𝐱< =1

1 + exp(−𝑦<𝐰2𝐱<)

Maximumlikelihoodestimation

31

argmax𝐰

𝑃 𝑆 𝐰 = argmax𝐰

;𝑃 𝑦< 𝐱<,𝐰)=

<>?

max𝐰

@log𝑃 𝑦< 𝐱<, 𝐰)=

<

𝑃 𝑦 𝐰, 𝐱 =1

1 + exp(−yB𝐰2𝐱<)

Equivalenttosolving

max𝐰

@−log(1 + exp(−𝑦<𝐰2𝐱<)=

<

Maximumlikelihoodestimation

32

argmax𝐰

𝑃 𝑆 𝐰 = argmax𝐰

;𝑃 𝑦< 𝐱<,𝐰)=

<>?

max𝐰

@log𝑃 𝑦< 𝐱<, 𝐰)=

<

𝑃 𝑦 𝐰, 𝐱 =1

1 + exp(−yB𝐰2𝐱<)

Equivalenttosolving

Thegoal:Maximumlikelihoodtrainingofadiscriminativeprobabilisticclassifierunderthelogisticmodelfortheposteriordistribution.

max𝐰

@−log(1 + exp(−𝑦<𝐰2𝐱<)=

<

Maximumlikelihoodestimation

33

argmax𝐰

𝑃 𝑆 𝐰 = argmax𝐰

;𝑃 𝑦< 𝐱<,𝐰)=

<>?

max𝐰

@log𝑃 𝑦< 𝐱<, 𝐰)=

<

𝑃 𝑦 𝐰, 𝐱 =1

1 + exp(−yB𝐰2𝐱<)

Equivalenttosolving

max𝐰

@−log(1 + exp(−𝑦<𝐰2𝐱<)=

<

Equivalentto:Trainingalinearclassifierbyminimizingthelogisticloss.

Thegoal:Maximumlikelihoodtrainingofadiscriminativeprobabilisticclassifierunderthelogisticmodelfortheposteriordistribution.

Maximumaposterioriestimation

Wecouldalsoaddapriorontheweights

Supposeeachweightintheweightvectorisdrawnindependentlyfromthenormaldistributionwithzeromeanandstandarddeviation𝜎

𝑝 𝐰 =;𝑝(𝑤<)E

F>?

=;1

𝜎 2𝜋� exp−𝑤<J

𝜎J

E

F>?

34

MAPestimationforlogisticregression

35

𝑝 𝐰 =;𝑝(𝑤<)E

F>?

=;1

𝜎 2𝜋� exp−𝑤<J

𝜎J

E

F>?

Letusworkthroughthisprocedureagaintoseewhatchanges

MAPestimationforlogisticregression

36

𝑝 𝐰 =;𝑝(𝑤<)E

F>?

=;1

𝜎 2𝜋� exp−𝑤<J

𝜎J

E

F>?

Letusworkthroughthisprocedureagaintoseewhatchanges

WhatisthegoalofMAPestimation?(Inmaximumlikelihood,wemaximizedthelikelihoodofthedata)

MAPestimationforlogisticregression

37

𝑝 𝐰 =;𝑝(𝑤<)E

F>?

=;1

𝜎 2𝜋� exp−𝑤<J

𝜎J

E

F>?

WhatisthegoalofMAPestimation?(Inmaximumlikelihood,wemaximizedthelikelihoodofthedata)

Tomaximizetheposteriorprobabilityofthemodelgiventhedata(i.e.tofindthemostprobablemodel,giventhedata)

𝑃 𝐰 𝑆 ∝ 𝑃 𝑆 𝐰 𝑃(𝐰)

MAPestimationforlogisticregression

38

Learningbysolving

𝑝 𝐰 =;𝑝(𝑤<)E

F>?

=;1

𝜎 2𝜋� exp−𝑤<J

𝜎J

E

F>?

argmax𝐰

𝑃(𝐰|𝑆) = argmax𝐰

𝑃 𝑆 𝐰 𝑃(𝐰)

MAPestimationforlogisticregression

39

Learningbysolving

𝑝 𝐰 =;𝑝(𝑤<)E

F>?

=;1

𝜎 2𝜋� exp−𝑤<J

𝜎J

E

F>?

argmax𝐰

𝑃 𝑆 𝐰 𝑃(𝐰)

Takelogtosimplify

max𝐰

log 𝑃 𝑆 𝐰 + log𝑃(𝐰)

MAPestimationforlogisticregression

40

Learningbysolving

𝑝 𝐰 =;𝑝(𝑤<)E

F>?

=;1

𝜎 2𝜋� exp−𝑤<J

𝜎J

E

F>?

argmax𝐰

𝑃 𝑆 𝐰 𝑃(𝐰)

Takelogtosimplify

max𝐰

log 𝑃 𝑆 𝐰 + log𝑃(𝐰)

Wehavealreadyexpandedoutthefirstterm.

@−log(1 + exp(−𝑦<𝐰2𝐱<)=

<

MAPestimationforlogisticregression

41

Learningbysolving

𝑝 𝐰 =;𝑝(𝑤<)E

F>?

=;1

𝜎 2𝜋� exp−𝑤<J

𝜎J

E

F>?

argmax𝐰

𝑃 𝑆 𝐰 𝑃(𝐰)

Takelogtosimplify

max𝐰

log 𝑃 𝑆 𝐰 + log𝑃(𝐰)

@−log(1 + exp(−𝑦<𝐰2𝐱<)=

<

+@−𝑤<J

𝜎J

E

F>?

+ 𝑐𝑜𝑛𝑠𝑡𝑎𝑛𝑡𝑠

Expandthelogprior

MAPestimationforlogisticregression

42

Learningbysolving

𝑝 𝐰 =;𝑝(𝑤<)E

F>?

=;1

𝜎 2𝜋� exp−𝑤<J

𝜎J

E

F>?

argmax𝐰

𝑃 𝑆 𝐰 𝑃(𝐰)

Takelogtosimplify

max𝐰

log 𝑃 𝑆 𝐰 + log𝑃(𝐰)

max𝐰

@−log(1 + exp(−𝑦<𝐰2𝐱<)=

<

+@−𝑤<J

𝜎J

E

F>?

+ 𝑐𝑜𝑛𝑠𝑡𝑎𝑛𝑡𝑠

MAPestimationforlogisticregression

43

Learningbysolving

𝑝 𝐰 =;𝑝(𝑤<)E

F>?

=;1

𝜎 2𝜋� exp−𝑤<J

𝜎J

E

F>?

argmax𝐰

𝑃 𝑆 𝐰 𝑃(𝐰)

Takelogtosimplify

max𝐰

log 𝑃 𝑆 𝐰 + log𝑃(𝐰)

max𝐰

@−log(1 + exp(−𝑦<𝐰2𝐱<)=

<

−1𝜎J 𝐰

2𝐰

MAPestimationforlogisticregression

44

Learningbysolving

𝑝 𝐰 =;𝑝(𝑤<)E

F>?

=;1

𝜎 2𝜋� exp−𝑤<J

𝜎J

E

F>?

argmax𝐰

𝑃 𝑆 𝐰 𝑃(𝐰)

Takelogtosimplify

max𝐰

log 𝑃 𝑆 𝐰 + log𝑃(𝐰)

max𝐰

@−log(1 + exp(−𝑦<𝐰2𝐱<)=

<

−1𝜎J 𝐰

2𝐰

Maximizinganegativefunctionisthesameasminimizingthefunction

Learningalogisticregressionclassifier

Learningalogisticregressionclassifierisequivalenttosolving

45

min𝐰@log(1 + exp(−𝑦<𝐰2𝐱<)=

<

+1𝜎J 𝐰

2𝐰

Learningalogisticregressionclassifier

Learningalogisticregressionclassifierisequivalenttosolving

46

Wherehaveweseenthisbefore?

min𝐰@log(1 + exp(−𝑦<𝐰2𝐱<)=

<

+1𝜎J 𝐰

2𝐰

Learningalogisticregressionclassifier

Learningalogisticregressionclassifierisequivalenttosolving

47

Wherehaveweseenthisbefore?

Thefirstquestioninthehomework:Writedownthestochasticgradientdescentalgorithmforthis?

Historically,othertrainingalgorithmsexist.Inparticular,youmightrunintoLBFGS

min𝐰@log(1 + exp(−𝑦<𝐰2𝐱<)=

<

+1𝜎J 𝐰

2𝐰

Logisticregressionis…

• Aclassifierthatpredictstheprobabilitythatthelabelis+1foraparticularinput

• Thediscriminativecounter-partofthenaïveBayesclassifier

• AdiscriminativeclassifierthatcanbetrainedviaMAPorMLEestimation

• Adiscriminativeclassifierthatminimizesthelogisticlossoverthetrainingset

48

Thislecture

• Logisticregression

• ConnectiontoNaïveBayes

• Trainingalogisticregressionclassifier

• Backtolossminimization

49

Learningaslossminimization• Thesetup

– Examplesx drawnfromafixed,unknowndistributionD– Hiddenoracleclassifierf labelsexamples– Wewishtofindahypothesish thatmimicsf

• Theidealsituation– DefineafunctionL thatpenalizesbadhypotheses– Learning:Pickafunctionh2 Htominimizeexpectedloss

• Instead,minimizeempiricallossonthetrainingset

50

ButdistributionDisunknown

Empiricallossminimization

Learning=minimizeempiricallossonthetrainingset

51

Isthereaproblemhere?

Empiricallossminimization

Learning=minimizeempiricallossonthetrainingset

Weneedsomethingthatbiasesthelearnertowardssimplerhypotheses• Achievedusingaregularizer,whichpenalizescomplex

hypotheses

52

Isthereaproblemhere? Overfitting!

Regularizedlossminimization

• Learning:

• Withlinearclassifiers:

• Whatisalossfunction?– Lossfunctionsshouldpenalizemistakes– Weareminimizingaveragelossoverthetrainingdata

• Whatistheideallossfunctionforclassification?

53

(usingl2regularization)

The0-1loss

Penalizeclassificationmistakesbetweentruelabelyandpredictiony’

• Forlinearclassifiers,thepredictiony’=sgn(wTx)– MistakeifywTx· 0

Minimizing0-1lossisintractable.Needsurrogates

54

Thelossfunctionzoo

Manylossfunctionsexist– Perceptronloss

– Hingeloss(SVM)

– Exponentialloss(AdaBoost)

– Logisticloss(logisticregression)

55

Thelossfunctionzoo

56

Thelossfunctionzoo

57

Zero-one

Thelossfunctionzoo

58

Hinge:SVM

Zero-one

Thelossfunctionzoo

59

Perceptron

Hinge:SVM

Zero-one

Thelossfunctionzoo

60

Perceptron

Hinge:SVM

Exponential:AdaBoost

Zero-one

Thelossfunctionzoo

61

Perceptron

Hinge:SVM

Logisticregression

Exponential:AdaBoost

Zero-one

Thelossfunctionzoo

62

Zoomedout

Thelossfunctionzoo

63

Zoomedoutevenmore

top related