classification: logistic regression from...

Classification: Logistic Regressionfrom Data

Machine Learning: Jordan Boyd-GraberUniversity of Colorado BoulderLECTURE 3

Slides adapted from Emily Fox

Machine Learning: Jordan Boyd-Graber | Boulder Classification: Logistic Regression from Data | 1 of 15

Reminder: Logistic Regression

P(Y = 0|X) =1

1 + exp�

β0 +∑

i βiXi

� (1)

P(Y = 1|X) =exp

�

β0 +∑

i βiXi

�

1 + exp�

β0 +∑

i βiXi

� (2)

• Discriminative prediction: p(y |x)

• Classification uses: ad placement, spam detection

• What we didn’t talk about is how to learn β from data


Logistic Regression: Objective Function

L ≡ lnp(Y |X ,β) =∑

j

lnp(y(j) |x(j),β) (3)

=∑

j

y(j)

β0 +∑

i

βix(j)i

!

− ln

1 + exp

β0 +∑

i

βix(j)i

!

(4)


Convexity

• Convex function

• Doesn’t matter where you start, ifyou “walk up” objective

• Gradient!


Gradient Descent (non-convex)

Goal

Optimize log likelihood with respect to variables W and b

0

Parameter

Objective



Goal


0



Goal


10



Goal


10

2



Goal


10

23



Goal


10

23

UndiscoveredCountry



Goal


0

Wij(l) Wij

(l)

Jα∂

∂W


Gradient for Logistic Regression

Gradient

∇βL (~β) =

�

∂L (~β)

∂ β0, . . . ,

∂L (~β)

∂ βn

�

(5)

Update

∆β ≡η∇βL (~β) (6)

βi ←β ′i +η∂L (~β)

∂ βi(7)



Gradient

∇βL (~β) =

�

∂L (~β)

∂ β0, . . . ,

∂L (~β)

∂ βn

�

(5)

Update

∆β ≡η∇βL (~β) (6)

βi ←β ′i +η∂L (~β)

∂ βi(7)

Why are we adding? What would well do if we wanted to do descent?



Gradient

∇βL (~β) =

�

∂L (~β)

∂ β0, . . . ,

∂L (~β)

∂ βn

�

(5)

Update

∆β ≡η∇βL (~β) (6)

βi ←β ′i +η∂L (~β)

∂ βi(7)

η: step size, must be greater than zero



Gradient

∇βL (~β) =

�

∂L (~β)

∂ β0, . . . ,

∂L (~β)

∂ βn

�

(5)

Update

∆β ≡η∇βL (~β) (6)

βi ←β ′i +η∂L (~β)

∂ βi(7)

NB: Conjugate gradient is usually better, but harder to implement


Choosing Step Size

Parameter

Objective


Remaining issues

• When to stop?

• What if β keeps getting bigger?


Regularized Conditional Log Likelihood

Unregularized

β ∗= argmaxβ

ln�

p(y(j) |x(j),β)�

(8)

Regularized

β ∗= argmaxβ

ln�

p(y(j) |x(j),β)�

−µ∑

i

β 2i (9)


Regularized Conditional Log Likelihood

Unregularized

β ∗= argmaxβ

ln�

p(y(j) |x(j),β)�

(8)

Regularized

β ∗= argmaxβ

ln�

p(y(j) |x(j),β)�

−µ∑

i

β 2i (9)

µ is “regularization” parameter that trades off between likelihood and havingsmall parameters


Approximating the Gradient

• Our datasets are big (to fit into memory)

• . . . or data are changing / streaming

• Hard to compute true gradient

L (β)≡Ex [∇L (β ,x)] (10)

• Average over all observations

• What if we compute an update just from one observation?


Getting to Union Station

Pretend it’s a pre-smartphone world and you want to get to Union Station


Stochastic Gradient for Logistic Regression

Given a single observation x chosen at random from the dataset,

βi ←β ′i +η�

−µβ ′i + xi

�

y −p(y = 1 |~x , ~β ′)��

(11)

Examples in class.


Stochastic Gradient for Regularized Regression

L = logp(y |x;β)−µ∑

j

β 2j (12)

Taking the derivative (with respect to example xi )

∂L∂ βj

=(yi −p(yi = 1 |~xi ;β))xj −2µβj (13)


Proofs about Stochastic Gradient

• Depends on convexity of objective and how close ε you want to get toactual answer

• Best bounds depend on changing η over time and per dimension (not allfeatures created equal)


In class

• Your questions!

• Working through simple example

• Prepared for logistic regression homework


classification: logistic regression from...

Documents