logistic regression - cc.gatech.edu · logistic regression robot image credit: viktoriyasukhanova©...
TRANSCRIPT
LogisticRegression
RobotImageCredit:Viktoriya Sukhanova ©123RF.com
TheseslideswereassembledbyByronBoots,withonlyminormodificationsfromEricEaton’sslidesandgratefulacknowledgementtothemanyotherswhomadetheircoursematerialsfreelyavailableonline.Feelfreetoreuseoradapttheseslidesforyourownacademicpurposes,providedthatyouincludeproperattribution.
ClassificationBasedonProbability• Insteadofjustpredictingtheclass,givetheprobabilityoftheinstancebeingthatclass– i.e.,learn
• Comparisontoperceptron:– Perceptrondoesn’tproduceprobabilityestimate
• Recallthat:
2
p(y | x)
p(event) + p(¬event) = 1
0 p(event) 1
LogisticRegression• Takesaprobabilisticapproachtolearningdiscriminativefunctions(i.e.,aclassifier)
• shouldgive– Want
• Logisticregressionmodel:
3
h✓(x) = g (✓|x)
g(z) =1
1 + e�z
0 h✓(x) 1
g(z) =1
1 + e�z
h✓
(x) =1
1 + e�✓
Tx
Logistic/SigmoidFunction
h✓(x) p(y = 1 | x;✓)
InterpretationofHypothesisOutput
4
=estimated
à Tellpatientthat70%chanceoftumorbeingmalignant
Example:Cancerdiagnosisfromtumorsize
h✓(x) p(y = 1 | x;✓)
x =
x0
x1
�=
1
tumorSize
�
h✓(x) = 0.7
p(y = 0 | x;✓) + p(y = 1 | x;✓) = 1Notethat:
BasedonexamplebyAndrewNg
Therefore, p(y = 0 | x;✓) = 1� p(y = 1 | x;✓)
AnotherInterpretation• Equivalently,logisticregressionassumesthat
• Inotherwords,logisticregressionassumesthatthelogoddsisalinearfunctionof
5
log
p(y = 1 | x;✓)p(y = 0 | x;✓) = ✓0 + ✓1x1 + . . .+ ✓dxd
x
SideNote:theoddsinfavorofaneventisthequantityp /(1−p),wherep istheprobabilityoftheevent
E.g.,IfItossafairdice,whataretheoddsthatIwillhavea6?
oddsofy =1
BasedonslidebyXiaoli Fern
LogisticRegression
• Assumeathresholdand...
– Predicty =1if– Predicty =0if
6
h✓(x) = g (✓|x)
g(z) =1
1 + e�z
g(z) =1
1 + e�z
h✓(x) � 0.5
h✓(x) < 0.5
y =1
y =0
✓
BasedonslidebyAndrewNg
shouldbelargenegativevaluesfornegativeinstances
h✓(x) = g (✓|x) shouldbelargepositive
valuesforpositiveinstancesh✓(x) = g (✓|
x)
Non-LinearDecisionBoundary• Canapplybasisfunctionexpansiontofeatures,sameaswithlinearregression
7
x =
2
41x1
x2
3
5 !
2
6666666666666664
1x1
x2
x1x2
x
21
x
22
x
21x2
x1x22
...
3
7777777777777775
LogisticRegression
• Given
where
• Model:
8
x
| =⇥1 x1 . . . xd
⇤✓ =
2
6664
✓0✓1...✓d
3
7775
h✓(x) = g (✓|x)
g(z) =1
1 + e�z
n⇣
x
(1), y(1)⌘
,⇣
x
(2), y(2)⌘
, . . . ,⇣
x
(n), y(n)⌘o
x
(i) 2 Rd, y(i) 2 {0, 1}
LogisticRegressionObjectiveFunction• Can’tjustusesquaredlossasinlinearregression:
– Usingthelogisticregressionmodel
resultsinanon-convexoptimization
9
J(✓) =1
2n
nX
i=1
⇣h✓
⇣x
(i)⌘� y(i)
⌘2
h✓
(x) =1
1 + e�✓
Tx
DerivingtheCostFunctionviaMaximumLikelihoodEstimation
• Likelihoodofdataisgivenby:
• So,lookingfortheθ thatmaximizesthelikelihood
• Cantakethelogwithoutchangingthesolution:
10
l(✓) =nY
i=1
p(y(i) | x(i);✓)
✓MLE = argmax
✓l(✓) = argmax
✓
nY
i=1
p(y(i) | x(i);✓)
✓MLE = argmax
✓log
nY
i=1
p(y(i) | x(i);✓)
= argmax
✓
nX
i=1
log p(y(i) | x(i);✓)
✓MLE = argmax
✓log
nY
i=1
p(y(i) | x(i);✓)
= argmax
✓
nX
i=1
log p(y(i) | x(i);✓)
DerivingtheCostFunctionviaMaximumLikelihoodEstimation
11
• Expandasfollows:
• Substituteinmodel,andtakenegativetoyield
✓MLE = argmax
✓
nX
i=1
log p(y(i) | x(i);✓)
= argmax
✓
nX
i=1
hy(i) log p(y(i)=1 | x(i)
;✓) +
⇣1� y(i)
⌘log
⇣1� p(y(i)=1 | x(i)
;✓)
⌘i
J(✓) = �nX
i=1
hy(i) log h✓(x
(i)) +
⇣1� y(i)
⌘log
⇣1� h✓(x
(i))
⌘i
Logisticregressionobjective:min✓
J(✓)
✓MLE = argmax
✓
nX
i=1
log p(y(i) | x(i);✓)
= argmax
✓
nX
i=1
hy(i) log p(y(i)=1 | x(i)
;✓) +
⇣1� y(i)
⌘log
⇣1� p(y(i)=1 | x(i)
;✓)
⌘i
IntuitionBehindtheObjective
• Costofasingleinstance:
• Canre-writeobjectivefunctionas
12
J(✓) = �nX
i=1
hy(i) log h✓(x
(i)) +
⇣1� y(i)
⌘log
⇣1� h✓(x
(i))
⌘i
cost (h✓(x), y) =
⇢� log(h✓(x)) if y = 1
� log(1� h✓(x)) if y = 0
J(✓) =nX
i=1
cost
⇣h✓(x
(i)), y(i)
⌘
J(✓) =1
2n
nX
i=1
⇣h✓
⇣x
(i)⌘� y(i)
⌘2Comparetolinearregression:
IntuitionBehindtheObjective
13
cost (h✓(x), y) =
⇢� log(h✓(x)) if y = 1
� log(1� h✓(x)) if y = 0
Aside:Recalltheplotoflog(z)
IntuitionBehindtheObjective
Ify =1• Cost=0ifpredictioniscorrect• As
• Capturesintuitionthatlargermistakesshouldgetlargerpenalties– e.g.,predict,buty =1
14
cost (h✓(x), y) =
⇢� log(h✓(x)) if y = 1
� log(1� h✓(x)) if y = 0
h✓(x) ! 0, cost ! 1
h✓(x) = 0
BasedonexamplebyAndrewNg
Ify =1
10
cost
h✓(x) = 0
IntuitionBehindtheObjective
15
cost (h✓(x), y) =
⇢� log(h✓(x)) if y = 1
� log(1� h✓(x)) if y = 0
Ify =0
10
cost
Ify =1
Ify =0• Cost=0ifpredictioniscorrect• As
• Capturesintuitionthatlargermistakesshouldgetlargerpenalties
(1� h✓(x)) ! 0, cost ! 1
BasedonexamplebyAndrewNg
h✓(x) = 0
RegularizedLogisticRegression
• Wecanregularizelogisticregressionexactlyasbefore:
16
J(✓) = �nX
i=1
hy(i) log h✓(x
(i)) +
⇣1� y(i)
⌘log
⇣1� h✓(x
(i))
⌘i
Jregularized(✓) = J(✓) + �dX
j=1
✓2j
= J(✓) + �k✓[1:d]k22
GradientDescentforLogisticRegression
17
• Initialize• Repeatuntilconvergence
✓
✓j ✓j � ↵@
@✓jJ(✓) simultaneousupdate
forj =0...d
Jreg(✓) = �nX
i=1
hy(i) log h✓(x
(i)) +
⇣1� y(i)
⌘log
⇣1� h✓(x
(i))
⌘i+ �k✓[1:d]k22
Want min✓
J(✓)
Usethenaturallogarithm(ln =loge)tocancelwiththeexp()in h✓
(x) =1
1 + e�✓
Tx
✓0 ✓0 � ↵
nX
i=1
⇣h✓
⇣x
(i)⌘� y
(i)⌘
✓j ✓j � ↵
nX
i=1
⇣h✓
⇣x
(i)⌘� y
(i)⌘x
(i)j � �✓j
GradientDescentforLogisticRegression
18
Jreg(✓) = �nX
i=1
hy(i) log h✓(x
(i)) +
⇣1� y(i)
⌘log
⇣1� h✓(x
(i))
⌘i+ �k✓[1:d]k22
Want min✓
J(✓)
• Initialize• Repeatuntilconvergence
✓(simultaneousupdateforj =0...d)
✓j ✓j � ↵
"nX
i=1
⇣h✓
⇣x
(i)⌘� y
(i)⌘x
(i)j �
�
n
✓j
#
✓0 ✓0 � ↵
nX
i=1
⇣h✓
⇣x
(i)⌘� y
(i)⌘
✓j ✓j � ↵
nX
i=1
⇣h✓
⇣x
(i)⌘� y
(i)⌘x
(i)j � �✓j
GradientDescentforLogisticRegression
19
• Initialize• Repeatuntilconvergence
✓(simultaneousupdateforj =0...d)
ThislooksIDENTICALtolinearregression!!!• Ignoringthe1/n constant• However,theformofthemodelisverydifferent:
h✓
(x) =1
1 + e�✓
Tx
✓j ✓j � ↵
"nX
i=1
⇣h✓
⇣x
(i)⌘� y
(i)⌘x
(i)j �
�
n
✓j
#
Multi-ClassClassification
Diseasediagnosis: healthy/cold/flu/pneumonia
Objectclassification: desk/chair/monitor/bookcase20
x1
x2
x1
x2
Binaryclassification: Multi-classclassification:
h✓(x) =1
1 + exp(�✓
Tx)
=
exp(✓
Tx)
1 + exp(✓
Tx)
Multi-ClassLogisticRegression• For2classes:
• ForC classes{1,...,C}:
– Calledthesoftmax function
21
h✓(x) =1
1 + exp(�✓
Tx)
=
exp(✓
Tx)
1 + exp(✓
Tx)
weightassignedtoy =0
weightassignedtoy =1
p(y = c | x;✓1, . . . ,✓C) =exp(✓
Tc x)PC
c=1 exp(✓Tc x)
Multi-ClassLogisticRegression
• Trainalogisticregressionclassifierforeachclassi topredicttheprobabilitythaty =i with
22
x1
x2
SplitintoOnevs Rest:
hc(x) =exp(✓
Tc x)PC
c=1 exp(✓Tc x)
hc(x) =exp(✓
Tc x)PC
c=1 exp(✓Tc x)
ImplementingMulti-ClassLogisticRegression
• Useasthemodelforclassc
• Gradientdescentsimultaneouslyupdatesallparametersforallmodels– Samederivativeasbefore,justwiththeabovehc(x)
• Predictclasslabelasthemostprobablelabel
23
max
chc(x)