classification: logistic regression from...
TRANSCRIPT
![Page 1: Classification: Logistic Regression from Datausers.umiacs.umd.edu/~jbg/teaching/CSCI_5622/03a.pdf · Machine Learning: Jordan Boyd-Graber University of Colorado Boulder LECTURE 3](https://reader033.vdocument.in/reader033/viewer/2022060521/604fe2334029e662ae2373c3/html5/thumbnails/1.jpg)
Classification: Logistic Regressionfrom Data
Machine Learning: Jordan Boyd-GraberUniversity of Colorado BoulderLECTURE 3
Slides adapted from Emily Fox
Machine Learning: Jordan Boyd-Graber | Boulder Classification: Logistic Regression from Data | 1 of 15
![Page 2: Classification: Logistic Regression from Datausers.umiacs.umd.edu/~jbg/teaching/CSCI_5622/03a.pdf · Machine Learning: Jordan Boyd-Graber University of Colorado Boulder LECTURE 3](https://reader033.vdocument.in/reader033/viewer/2022060521/604fe2334029e662ae2373c3/html5/thumbnails/2.jpg)
Reminder: Logistic Regression
P(Y = 0|X) =1
1 + exp�
β0 +∑
i βiXi
� (1)
P(Y = 1|X) =exp
�
β0 +∑
i βiXi
�
1 + exp�
β0 +∑
i βiXi
� (2)
• Discriminative prediction: p(y |x)
• Classification uses: ad placement, spam detection
• What we didn’t talk about is how to learn β from data
Machine Learning: Jordan Boyd-Graber | Boulder Classification: Logistic Regression from Data | 2 of 15
![Page 3: Classification: Logistic Regression from Datausers.umiacs.umd.edu/~jbg/teaching/CSCI_5622/03a.pdf · Machine Learning: Jordan Boyd-Graber University of Colorado Boulder LECTURE 3](https://reader033.vdocument.in/reader033/viewer/2022060521/604fe2334029e662ae2373c3/html5/thumbnails/3.jpg)
Logistic Regression: Objective Function
L ≡ lnp(Y |X ,β) =∑
j
lnp(y(j) |x(j),β) (3)
=∑
j
y(j)
β0 +∑
i
βix(j)i
!
− ln
1 + exp
β0 +∑
i
βix(j)i
!
(4)
Machine Learning: Jordan Boyd-Graber | Boulder Classification: Logistic Regression from Data | 3 of 15
![Page 4: Classification: Logistic Regression from Datausers.umiacs.umd.edu/~jbg/teaching/CSCI_5622/03a.pdf · Machine Learning: Jordan Boyd-Graber University of Colorado Boulder LECTURE 3](https://reader033.vdocument.in/reader033/viewer/2022060521/604fe2334029e662ae2373c3/html5/thumbnails/4.jpg)
Convexity
• Convex function
• Doesn’t matter where you start, ifyou “walk up” objective
• Gradient!
Machine Learning: Jordan Boyd-Graber | Boulder Classification: Logistic Regression from Data | 4 of 15
![Page 5: Classification: Logistic Regression from Datausers.umiacs.umd.edu/~jbg/teaching/CSCI_5622/03a.pdf · Machine Learning: Jordan Boyd-Graber University of Colorado Boulder LECTURE 3](https://reader033.vdocument.in/reader033/viewer/2022060521/604fe2334029e662ae2373c3/html5/thumbnails/5.jpg)
Convexity
• Convex function
• Doesn’t matter where you start, ifyou “walk up” objective
• Gradient!
Machine Learning: Jordan Boyd-Graber | Boulder Classification: Logistic Regression from Data | 4 of 15
![Page 6: Classification: Logistic Regression from Datausers.umiacs.umd.edu/~jbg/teaching/CSCI_5622/03a.pdf · Machine Learning: Jordan Boyd-Graber University of Colorado Boulder LECTURE 3](https://reader033.vdocument.in/reader033/viewer/2022060521/604fe2334029e662ae2373c3/html5/thumbnails/6.jpg)
Gradient Descent (non-convex)
Goal
Optimize log likelihood with respect to variables W and b
0
Parameter
Objective
Machine Learning: Jordan Boyd-Graber | Boulder Classification: Logistic Regression from Data | 5 of 15
![Page 7: Classification: Logistic Regression from Datausers.umiacs.umd.edu/~jbg/teaching/CSCI_5622/03a.pdf · Machine Learning: Jordan Boyd-Graber University of Colorado Boulder LECTURE 3](https://reader033.vdocument.in/reader033/viewer/2022060521/604fe2334029e662ae2373c3/html5/thumbnails/7.jpg)
Gradient Descent (non-convex)
Goal
Optimize log likelihood with respect to variables W and b
0
Machine Learning: Jordan Boyd-Graber | Boulder Classification: Logistic Regression from Data | 5 of 15
![Page 8: Classification: Logistic Regression from Datausers.umiacs.umd.edu/~jbg/teaching/CSCI_5622/03a.pdf · Machine Learning: Jordan Boyd-Graber University of Colorado Boulder LECTURE 3](https://reader033.vdocument.in/reader033/viewer/2022060521/604fe2334029e662ae2373c3/html5/thumbnails/8.jpg)
Gradient Descent (non-convex)
Goal
Optimize log likelihood with respect to variables W and b
0
Machine Learning: Jordan Boyd-Graber | Boulder Classification: Logistic Regression from Data | 5 of 15
![Page 9: Classification: Logistic Regression from Datausers.umiacs.umd.edu/~jbg/teaching/CSCI_5622/03a.pdf · Machine Learning: Jordan Boyd-Graber University of Colorado Boulder LECTURE 3](https://reader033.vdocument.in/reader033/viewer/2022060521/604fe2334029e662ae2373c3/html5/thumbnails/9.jpg)
Gradient Descent (non-convex)
Goal
Optimize log likelihood with respect to variables W and b
10
Machine Learning: Jordan Boyd-Graber | Boulder Classification: Logistic Regression from Data | 5 of 15
![Page 10: Classification: Logistic Regression from Datausers.umiacs.umd.edu/~jbg/teaching/CSCI_5622/03a.pdf · Machine Learning: Jordan Boyd-Graber University of Colorado Boulder LECTURE 3](https://reader033.vdocument.in/reader033/viewer/2022060521/604fe2334029e662ae2373c3/html5/thumbnails/10.jpg)
Gradient Descent (non-convex)
Goal
Optimize log likelihood with respect to variables W and b
10
Machine Learning: Jordan Boyd-Graber | Boulder Classification: Logistic Regression from Data | 5 of 15
![Page 11: Classification: Logistic Regression from Datausers.umiacs.umd.edu/~jbg/teaching/CSCI_5622/03a.pdf · Machine Learning: Jordan Boyd-Graber University of Colorado Boulder LECTURE 3](https://reader033.vdocument.in/reader033/viewer/2022060521/604fe2334029e662ae2373c3/html5/thumbnails/11.jpg)
Gradient Descent (non-convex)
Goal
Optimize log likelihood with respect to variables W and b
10
2
Machine Learning: Jordan Boyd-Graber | Boulder Classification: Logistic Regression from Data | 5 of 15
![Page 12: Classification: Logistic Regression from Datausers.umiacs.umd.edu/~jbg/teaching/CSCI_5622/03a.pdf · Machine Learning: Jordan Boyd-Graber University of Colorado Boulder LECTURE 3](https://reader033.vdocument.in/reader033/viewer/2022060521/604fe2334029e662ae2373c3/html5/thumbnails/12.jpg)
Gradient Descent (non-convex)
Goal
Optimize log likelihood with respect to variables W and b
10
2
Machine Learning: Jordan Boyd-Graber | Boulder Classification: Logistic Regression from Data | 5 of 15
![Page 13: Classification: Logistic Regression from Datausers.umiacs.umd.edu/~jbg/teaching/CSCI_5622/03a.pdf · Machine Learning: Jordan Boyd-Graber University of Colorado Boulder LECTURE 3](https://reader033.vdocument.in/reader033/viewer/2022060521/604fe2334029e662ae2373c3/html5/thumbnails/13.jpg)
Gradient Descent (non-convex)
Goal
Optimize log likelihood with respect to variables W and b
10
23
Machine Learning: Jordan Boyd-Graber | Boulder Classification: Logistic Regression from Data | 5 of 15
![Page 14: Classification: Logistic Regression from Datausers.umiacs.umd.edu/~jbg/teaching/CSCI_5622/03a.pdf · Machine Learning: Jordan Boyd-Graber University of Colorado Boulder LECTURE 3](https://reader033.vdocument.in/reader033/viewer/2022060521/604fe2334029e662ae2373c3/html5/thumbnails/14.jpg)
Gradient Descent (non-convex)
Goal
Optimize log likelihood with respect to variables W and b
10
23
UndiscoveredCountry
Machine Learning: Jordan Boyd-Graber | Boulder Classification: Logistic Regression from Data | 5 of 15
![Page 15: Classification: Logistic Regression from Datausers.umiacs.umd.edu/~jbg/teaching/CSCI_5622/03a.pdf · Machine Learning: Jordan Boyd-Graber University of Colorado Boulder LECTURE 3](https://reader033.vdocument.in/reader033/viewer/2022060521/604fe2334029e662ae2373c3/html5/thumbnails/15.jpg)
Gradient Descent (non-convex)
Goal
Optimize log likelihood with respect to variables W and b
0
Wij(l) Wij
(l)
Jα∂
∂W
Machine Learning: Jordan Boyd-Graber | Boulder Classification: Logistic Regression from Data | 5 of 15
![Page 16: Classification: Logistic Regression from Datausers.umiacs.umd.edu/~jbg/teaching/CSCI_5622/03a.pdf · Machine Learning: Jordan Boyd-Graber University of Colorado Boulder LECTURE 3](https://reader033.vdocument.in/reader033/viewer/2022060521/604fe2334029e662ae2373c3/html5/thumbnails/16.jpg)
Gradient for Logistic Regression
Gradient
∇βL (~β) =
�
∂L (~β)
∂ β0, . . . ,
∂L (~β)
∂ βn
�
(5)
Update
∆β ≡η∇βL (~β) (6)
βi ←β ′i +η∂L (~β)
∂ βi(7)
Machine Learning: Jordan Boyd-Graber | Boulder Classification: Logistic Regression from Data | 6 of 15
![Page 17: Classification: Logistic Regression from Datausers.umiacs.umd.edu/~jbg/teaching/CSCI_5622/03a.pdf · Machine Learning: Jordan Boyd-Graber University of Colorado Boulder LECTURE 3](https://reader033.vdocument.in/reader033/viewer/2022060521/604fe2334029e662ae2373c3/html5/thumbnails/17.jpg)
Gradient for Logistic Regression
Gradient
∇βL (~β) =
�
∂L (~β)
∂ β0, . . . ,
∂L (~β)
∂ βn
�
(5)
Update
∆β ≡η∇βL (~β) (6)
βi ←β ′i +η∂L (~β)
∂ βi(7)
Why are we adding? What would well do if we wanted to do descent?
Machine Learning: Jordan Boyd-Graber | Boulder Classification: Logistic Regression from Data | 6 of 15
![Page 18: Classification: Logistic Regression from Datausers.umiacs.umd.edu/~jbg/teaching/CSCI_5622/03a.pdf · Machine Learning: Jordan Boyd-Graber University of Colorado Boulder LECTURE 3](https://reader033.vdocument.in/reader033/viewer/2022060521/604fe2334029e662ae2373c3/html5/thumbnails/18.jpg)
Gradient for Logistic Regression
Gradient
∇βL (~β) =
�
∂L (~β)
∂ β0, . . . ,
∂L (~β)
∂ βn
�
(5)
Update
∆β ≡η∇βL (~β) (6)
βi ←β ′i +η∂L (~β)
∂ βi(7)
η: step size, must be greater than zero
Machine Learning: Jordan Boyd-Graber | Boulder Classification: Logistic Regression from Data | 6 of 15
![Page 19: Classification: Logistic Regression from Datausers.umiacs.umd.edu/~jbg/teaching/CSCI_5622/03a.pdf · Machine Learning: Jordan Boyd-Graber University of Colorado Boulder LECTURE 3](https://reader033.vdocument.in/reader033/viewer/2022060521/604fe2334029e662ae2373c3/html5/thumbnails/19.jpg)
Gradient for Logistic Regression
Gradient
∇βL (~β) =
�
∂L (~β)
∂ β0, . . . ,
∂L (~β)
∂ βn
�
(5)
Update
∆β ≡η∇βL (~β) (6)
βi ←β ′i +η∂L (~β)
∂ βi(7)
NB: Conjugate gradient is usually better, but harder to implement
Machine Learning: Jordan Boyd-Graber | Boulder Classification: Logistic Regression from Data | 6 of 15
![Page 20: Classification: Logistic Regression from Datausers.umiacs.umd.edu/~jbg/teaching/CSCI_5622/03a.pdf · Machine Learning: Jordan Boyd-Graber University of Colorado Boulder LECTURE 3](https://reader033.vdocument.in/reader033/viewer/2022060521/604fe2334029e662ae2373c3/html5/thumbnails/20.jpg)
Choosing Step Size
Parameter
Objective
Machine Learning: Jordan Boyd-Graber | Boulder Classification: Logistic Regression from Data | 7 of 15
![Page 21: Classification: Logistic Regression from Datausers.umiacs.umd.edu/~jbg/teaching/CSCI_5622/03a.pdf · Machine Learning: Jordan Boyd-Graber University of Colorado Boulder LECTURE 3](https://reader033.vdocument.in/reader033/viewer/2022060521/604fe2334029e662ae2373c3/html5/thumbnails/21.jpg)
Choosing Step Size
Parameter
Objective
Machine Learning: Jordan Boyd-Graber | Boulder Classification: Logistic Regression from Data | 7 of 15
![Page 22: Classification: Logistic Regression from Datausers.umiacs.umd.edu/~jbg/teaching/CSCI_5622/03a.pdf · Machine Learning: Jordan Boyd-Graber University of Colorado Boulder LECTURE 3](https://reader033.vdocument.in/reader033/viewer/2022060521/604fe2334029e662ae2373c3/html5/thumbnails/22.jpg)
Choosing Step Size
Parameter
Objective
Machine Learning: Jordan Boyd-Graber | Boulder Classification: Logistic Regression from Data | 7 of 15
![Page 23: Classification: Logistic Regression from Datausers.umiacs.umd.edu/~jbg/teaching/CSCI_5622/03a.pdf · Machine Learning: Jordan Boyd-Graber University of Colorado Boulder LECTURE 3](https://reader033.vdocument.in/reader033/viewer/2022060521/604fe2334029e662ae2373c3/html5/thumbnails/23.jpg)
Choosing Step Size
Parameter
Objective
Machine Learning: Jordan Boyd-Graber | Boulder Classification: Logistic Regression from Data | 7 of 15
![Page 24: Classification: Logistic Regression from Datausers.umiacs.umd.edu/~jbg/teaching/CSCI_5622/03a.pdf · Machine Learning: Jordan Boyd-Graber University of Colorado Boulder LECTURE 3](https://reader033.vdocument.in/reader033/viewer/2022060521/604fe2334029e662ae2373c3/html5/thumbnails/24.jpg)
Choosing Step Size
Parameter
Objective
Machine Learning: Jordan Boyd-Graber | Boulder Classification: Logistic Regression from Data | 7 of 15
![Page 25: Classification: Logistic Regression from Datausers.umiacs.umd.edu/~jbg/teaching/CSCI_5622/03a.pdf · Machine Learning: Jordan Boyd-Graber University of Colorado Boulder LECTURE 3](https://reader033.vdocument.in/reader033/viewer/2022060521/604fe2334029e662ae2373c3/html5/thumbnails/25.jpg)
Remaining issues
• When to stop?
• What if β keeps getting bigger?
Machine Learning: Jordan Boyd-Graber | Boulder Classification: Logistic Regression from Data | 8 of 15
![Page 26: Classification: Logistic Regression from Datausers.umiacs.umd.edu/~jbg/teaching/CSCI_5622/03a.pdf · Machine Learning: Jordan Boyd-Graber University of Colorado Boulder LECTURE 3](https://reader033.vdocument.in/reader033/viewer/2022060521/604fe2334029e662ae2373c3/html5/thumbnails/26.jpg)
Regularized Conditional Log Likelihood
Unregularized
β ∗= argmaxβ
ln�
p(y(j) |x(j),β)�
(8)
Regularized
β ∗= argmaxβ
ln�
p(y(j) |x(j),β)�
−µ∑
i
β 2i (9)
Machine Learning: Jordan Boyd-Graber | Boulder Classification: Logistic Regression from Data | 9 of 15
![Page 27: Classification: Logistic Regression from Datausers.umiacs.umd.edu/~jbg/teaching/CSCI_5622/03a.pdf · Machine Learning: Jordan Boyd-Graber University of Colorado Boulder LECTURE 3](https://reader033.vdocument.in/reader033/viewer/2022060521/604fe2334029e662ae2373c3/html5/thumbnails/27.jpg)
Regularized Conditional Log Likelihood
Unregularized
β ∗= argmaxβ
ln�
p(y(j) |x(j),β)�
(8)
Regularized
β ∗= argmaxβ
ln�
p(y(j) |x(j),β)�
−µ∑
i
β 2i (9)
µ is “regularization” parameter that trades off between likelihood and havingsmall parameters
Machine Learning: Jordan Boyd-Graber | Boulder Classification: Logistic Regression from Data | 9 of 15
![Page 28: Classification: Logistic Regression from Datausers.umiacs.umd.edu/~jbg/teaching/CSCI_5622/03a.pdf · Machine Learning: Jordan Boyd-Graber University of Colorado Boulder LECTURE 3](https://reader033.vdocument.in/reader033/viewer/2022060521/604fe2334029e662ae2373c3/html5/thumbnails/28.jpg)
Approximating the Gradient
• Our datasets are big (to fit into memory)
• . . . or data are changing / streaming
• Hard to compute true gradient
L (β)≡Ex [∇L (β ,x)] (10)
• Average over all observations
• What if we compute an update just from one observation?
Machine Learning: Jordan Boyd-Graber | Boulder Classification: Logistic Regression from Data | 10 of 15
![Page 29: Classification: Logistic Regression from Datausers.umiacs.umd.edu/~jbg/teaching/CSCI_5622/03a.pdf · Machine Learning: Jordan Boyd-Graber University of Colorado Boulder LECTURE 3](https://reader033.vdocument.in/reader033/viewer/2022060521/604fe2334029e662ae2373c3/html5/thumbnails/29.jpg)
Approximating the Gradient
• Our datasets are big (to fit into memory)
• . . . or data are changing / streaming
• Hard to compute true gradient
L (β)≡Ex [∇L (β ,x)] (10)
• Average over all observations
• What if we compute an update just from one observation?
Machine Learning: Jordan Boyd-Graber | Boulder Classification: Logistic Regression from Data | 10 of 15
![Page 30: Classification: Logistic Regression from Datausers.umiacs.umd.edu/~jbg/teaching/CSCI_5622/03a.pdf · Machine Learning: Jordan Boyd-Graber University of Colorado Boulder LECTURE 3](https://reader033.vdocument.in/reader033/viewer/2022060521/604fe2334029e662ae2373c3/html5/thumbnails/30.jpg)
Approximating the Gradient
• Our datasets are big (to fit into memory)
• . . . or data are changing / streaming
• Hard to compute true gradient
L (β)≡Ex [∇L (β ,x)] (10)
• Average over all observations
• What if we compute an update just from one observation?
Machine Learning: Jordan Boyd-Graber | Boulder Classification: Logistic Regression from Data | 10 of 15
![Page 31: Classification: Logistic Regression from Datausers.umiacs.umd.edu/~jbg/teaching/CSCI_5622/03a.pdf · Machine Learning: Jordan Boyd-Graber University of Colorado Boulder LECTURE 3](https://reader033.vdocument.in/reader033/viewer/2022060521/604fe2334029e662ae2373c3/html5/thumbnails/31.jpg)
Getting to Union Station
Pretend it’s a pre-smartphone world and you want to get to Union Station
Machine Learning: Jordan Boyd-Graber | Boulder Classification: Logistic Regression from Data | 11 of 15
![Page 32: Classification: Logistic Regression from Datausers.umiacs.umd.edu/~jbg/teaching/CSCI_5622/03a.pdf · Machine Learning: Jordan Boyd-Graber University of Colorado Boulder LECTURE 3](https://reader033.vdocument.in/reader033/viewer/2022060521/604fe2334029e662ae2373c3/html5/thumbnails/32.jpg)
Stochastic Gradient for Logistic Regression
Given a single observation x chosen at random from the dataset,
βi ←β ′i +η�
−µβ ′i + xi
�
y −p(y = 1 |~x , ~β ′)��
(11)
Examples in class.
Machine Learning: Jordan Boyd-Graber | Boulder Classification: Logistic Regression from Data | 12 of 15
![Page 33: Classification: Logistic Regression from Datausers.umiacs.umd.edu/~jbg/teaching/CSCI_5622/03a.pdf · Machine Learning: Jordan Boyd-Graber University of Colorado Boulder LECTURE 3](https://reader033.vdocument.in/reader033/viewer/2022060521/604fe2334029e662ae2373c3/html5/thumbnails/33.jpg)
Stochastic Gradient for Logistic Regression
Given a single observation x chosen at random from the dataset,
βi ←β ′i +η�
−µβ ′i + xi
�
y −p(y = 1 |~x , ~β ′)��
(11)
Examples in class.
Machine Learning: Jordan Boyd-Graber | Boulder Classification: Logistic Regression from Data | 12 of 15
![Page 34: Classification: Logistic Regression from Datausers.umiacs.umd.edu/~jbg/teaching/CSCI_5622/03a.pdf · Machine Learning: Jordan Boyd-Graber University of Colorado Boulder LECTURE 3](https://reader033.vdocument.in/reader033/viewer/2022060521/604fe2334029e662ae2373c3/html5/thumbnails/34.jpg)
Stochastic Gradient for Regularized Regression
L = logp(y |x;β)−µ∑
j
β 2j (12)
Taking the derivative (with respect to example xi )
∂L∂ βj
=(yi −p(yi = 1 |~xi ;β))xj −2µβj (13)
Machine Learning: Jordan Boyd-Graber | Boulder Classification: Logistic Regression from Data | 13 of 15
![Page 35: Classification: Logistic Regression from Datausers.umiacs.umd.edu/~jbg/teaching/CSCI_5622/03a.pdf · Machine Learning: Jordan Boyd-Graber University of Colorado Boulder LECTURE 3](https://reader033.vdocument.in/reader033/viewer/2022060521/604fe2334029e662ae2373c3/html5/thumbnails/35.jpg)
Stochastic Gradient for Regularized Regression
L = logp(y |x;β)−µ∑
j
β 2j (12)
Taking the derivative (with respect to example xi )
∂L∂ βj
=(yi −p(yi = 1 |~xi ;β))xj −2µβj (13)
Machine Learning: Jordan Boyd-Graber | Boulder Classification: Logistic Regression from Data | 13 of 15
![Page 36: Classification: Logistic Regression from Datausers.umiacs.umd.edu/~jbg/teaching/CSCI_5622/03a.pdf · Machine Learning: Jordan Boyd-Graber University of Colorado Boulder LECTURE 3](https://reader033.vdocument.in/reader033/viewer/2022060521/604fe2334029e662ae2373c3/html5/thumbnails/36.jpg)
Proofs about Stochastic Gradient
• Depends on convexity of objective and how close ε you want to get toactual answer
• Best bounds depend on changing η over time and per dimension (not allfeatures created equal)
Machine Learning: Jordan Boyd-Graber | Boulder Classification: Logistic Regression from Data | 14 of 15
![Page 37: Classification: Logistic Regression from Datausers.umiacs.umd.edu/~jbg/teaching/CSCI_5622/03a.pdf · Machine Learning: Jordan Boyd-Graber University of Colorado Boulder LECTURE 3](https://reader033.vdocument.in/reader033/viewer/2022060521/604fe2334029e662ae2373c3/html5/thumbnails/37.jpg)
In class
• Your questions!
• Working through simple example
• Prepared for logistic regression homework
Machine Learning: Jordan Boyd-Graber | Boulder Classification: Logistic Regression from Data | 15 of 15