linear model (iii)

Linear Model (III)

Rong Jin

Announcement Homework 2 is out and is due 02/05/2004

(next Tuesday) Homework 1 is handed out

Recap: Logistic Regression Model Assume the inputs and outputs are related in the log

linear function

Estimate weights: MLE approach

1 2

1( | ; )

1 exp ( )

{ , ,..., , }m

p y xy x w c

w w w c

*1

1

max ( ) max log ( | ; )

1max log

1 exp( )

ntrain i iiw w

n

iw

w l D p y x

y x w c

1 2{ , ,..., , }mw w w c

Example: Text Classification Input x: a binary vector

Each word is a different dimension xi = 0 if the ith word does not appear in the document

xi = 1 if it appears in the document Output y: interesting document or not

+1: interesting -1: uninteresting

Example: Text Classification

Doc 1

The purpose of the Lady Bird Johnson Wildflower Center is to educate people around the world, …

Doc 2

Rain Bird is one of the leading irrigation manufacturers in the world, providingcomplete irrigation solutions for people…

term the world people company center …

Doc 1 1 1 1 0 1 …

Doc 2 1 1 1 1 0 …

Example 2: Text Classification Logistic regression model

Every term ti is assigned with a weight wi

Learning parameters: MLE approach

Need numerical solutions

1 2

1( | ; )

1 exp ( )

{ , ,..., , }

i ii

m

p y dy w t c

w w w c

( ) ( )

1 1

( ) ( )

1 1

( ) log ( | ) log ( | )

1 1log log

1 exp 1 exp

N Ntrain i ii i

N N

i ii i i ii i

l D p d p d

w t c w t c

Example 2: Text Classification Weight wi

wi > 0: term ti is a positive evidence

wi < 0: term ti is a negative evidence

wi = 0: term ti is irrelevant to whether or not the document is intesting

The larger the | wi |, the more important ti term is determining whether the document is interesting.

Threshold c0 : more likely to be an interesting document

0 : more likely to be an uninteresting document

0 : decision boundary

i

i

i

i it d

i it d

i it d

w t c

w t c

w t c

Example 2: Text Classification

• Dataset: Reuter-21578

• Classification accuracy

• Naïve Bayes: 77%

• Logistic regression: 88%

Why Logistic Regression Works better for Text Classification? Common words

Small weights in logistic regression Large weights in naïve Bayes

Weight ~ p(w|+) – p(w|-)

Independence assumption Naive Bayes assumes that each word is generated

independently Logistic regression is able to take into account of

the correlation of words

Comparison

Generative Model

• Model P(x|y)• Model the input patterns

• Usually fast converge• Cheap computation• Robust to noise data

But• Usually performs worse

Discriminative Model

• Model P(y|x) directly• Model the decision boundary

• Usually good performance

But• Slow convergence• Expensive computation• Sensitive to noise data

Problems with Logistic Regression?

1 1 2 2

1 2

1( | ; )

1 exp ( ... )

{ , ,..., , }m m

m

p xc x w x w x w

w w w c

How about words that only appears in one class?

Overfitting Problem with Logistic Regression Consider word t that only appears in one document d, and d is

a positive document. Let w be its associated weight

Consider the derivative of l(Dtrain) with respect to w

w will be infinite !

( ) ( )

1 1

( )

1

( ) log ( | ) log ( | )

log ( | ) log ( | ) log ( | )

log ( | )i

N Ntrain i ii i

Ni id d i

l D p d p d

p d p d p d

p d l l

( ) log ( | ) 10 0 0

1 exptrainl D l lp d

w w w w c x w

Solution: Regularization Regularized log-likelihood

Large weights small weights Prevent weights from being too large

Small weights zero Sparse weights

2

( ) ( ) 21 1 1

( ) ( )

log ( | ) log ( | )

reg train train

N N mi i ii i i

l D l D s w

p d p d s w

Why do We Need Sparse Solution? Two types of solutions

1. Many non-zero weights but many of them are small

2. Only a small number of weights, and many of them are large

Occam’s Razor: the simpler the better A simpler model that fits data unlikely to be coincidence A complicated model that fit data might be coincidence Smaller number of non-zero weights

less amount of evidence to consider

simpler model

case 2 is preferred

Occam’s Razer

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 50

0.5

1

1.5

2

2.5

Occam’s Razer: Power = 1

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 50

0.5

1

1.5

2

2.5

1y a x

Occam’s Razer: Power = 3

2 31 2 3y a x a x a x

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 50

0.5

1

1.5

2

2.5

Occam’s Razor: Power = 10

2 101 2 10...y a x a x a x

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 50

0.5

1

1.5

2

2.5

Finding Optimal Solutions

Concave objective function No local maximum

Many standard optimization algorithms work

22 1 1

21 1

( ) ( ) log ( | )

1log

1 exp ( )

N mreg train train ii i

N mii i

l D l D w p y x s w

s wy c x w

Gradient Ascent Maximize the log-likelihood by iteratively adjusting the

parameters in small increments In each iteration, we adjust w in the direction that increases the

log-likelihood (toward the gradient)

21 1

1

21 1

1

log ( | )

(1 ( | ))

log ( | )

(1 ( | ))

where is learning rate.

N mi i ii i

Ni i i ii

N mi i ii i

Ni i ii

w w p y x s ww

w sw x y p y x

c c p y x s wc

c y p y x

Predication ErrorsPreventing weights from being too large

Graphical Illustration

No regularization case

Using regularization Without regularization

Iteration

When should Stop? The gradient ascent learning method converges when

there is no incentive to move the parameters in any particular direction:

In many cases, it can be very tricky Small first order derivative close to the maximum point

21 1 1

21 1 1

log ( | ) (1 ( | )) 0

log ( | ) (1 ( | )) 0

N m Ni i i i i i ii i i

N m Ni i i i i ii i i

p y x w sw x y p y xw

p y x w y p y xc

Extend Logistic Regression Model to Multiple Classes

y{1,2,…,C} How to extend the above definition to the

case when the number of classes is more than 2?

1 1 2 2

1 2

1( | ; )

1 exp ( ... )

{ , ,..., , }m m

m

p xc x w x w x w

w w w c

Conditional Exponential Model It is simple!

Ensure the sum of probability to be 1

( | ; ) exp( )y yp y x c x w

1( | ; ) exp( )

( )

( ) exp( )

y y

y yy

p y x c x wZ x

Z x c x w

Conditional Exponential Model Predication probability

Model parameters: For each class y, we have weights wy and threshold cy

Maximum likelihood estimation

Any problem with the above optimization problem?

exp( )( | ; )

exp( )y y

y yy

c x wp y x

c x w

1 1

exp( )( ) log ( | ) log

exp( )i iN N y i y

train i iy i yy

c x wl D p y x

c x w

Conditional Exponential Model Add a constant vector to every weight vector,

we have the same log-likelihood function

Usually set w1 to be a zero vector and c1 to be zero

0 0

0 0

10 0

1

,

exp( )( ) log

exp( )

exp( )log

exp( )

i i

i i

y y y y

N y i ytrain i

y i yy

N y i y

iy i yy

w w w c c c

c c x w wl D

c c x w w

c x w

c x w

Maximum Entropy Model: Motivation Consider a translation example English ‘in’ French {dans, en, à, au cours de, pendant} Goal: p(dans), p(en), p(à), p(au-cours-de), p(pendant) Case 1: no prior knowledge on tranlation

What is your guess of the probabilities? p(dans)=p(en)=p(à)=p(au-cours-de)=p(pendant)=1/5

Case 2: 30% time use either dans or en What is your guess of the probabilities? p(dans)=p(en)=3/20 p(à)=p(au-cours-de)=p(pendant)=7/30

Uniform distribution is favored

Maximum Entropy Model: Motivation Case 3: 30% use dans or en, and 50% use dans or à

What is your guess of the probabilities?

How to measure the uniformity of any distribution?

Maximum Entropy Principle (MaxEnt) A uniformity of distribution is measured by entropy of the

distribution

Solution: p(dans) = 0.2, p(a) = 0.3, p(en)=0.1, p(au-cours-de) = 0.2, p(pendant) = 0.2

* max ( )

where ( ) ( ) log ( ) ( ) log ( ) ( ) log ( )

( ) log ( ) ( ) log ( )

subject to

( ) ( ) 3/10

( ) ( ) 1/ 2

( ) ( ) ( ) (

PP H P

H P p dans p dans p en p en p a p a

p au course de p au course de p pendant p pendant

p dans p en

p dans p a

p dans p en p a p au cours d

) ( ) 1e p pendant

MaxEnt for Classification Problems

Requiring the first order moment to be consistent between the empirical data and model predication

No assumption about the parametric form for likelihood Usually assume it is Cn continuous

What is the solution for ?

1

1 1

max ( | ) max ( | ) log ( | )

subject to

( | ) ( , ), ( | )=1

Ni i i ii yp p

N Ni i ii i y

H y x p y x p y x

p y x x x y y p y x

( | )p y x

( | )p y x

Solution to MaxEnt Surprisingly, the solution is just conditional

exponential model without thresholds

Why?

exp( )( | ; )

exp( )y

yy

x wp y x

x w

Solution to MaxEntSolve the maximum entropy problem, i.e.

1 1

max ( | ) max ( | ) log ( | )

subject to ( | ) ( , ), ( | ) 1

i iyp p

N Ni i ii i y

H y x p y x p y x

p y x x x y y p y x

To maximize the entropy function under the constraints, we can introduce a set of Lagrangian multipliers into the objective function, i.e.,

1 1 1( | ) ( | ) ( , ) ( | ) 1

N N Ny i i iy i i i y

G H y x p y x x x y y p y x

Setting the first derivative of G with respect ( | )ip y x to be zero, we have equations

log ( | ) 0 ( | ) exp( )( | ) i y i i y i

i

Gp y x x p y x x

p y x

Since ( | ) 1iyp y x , we have ( | )ip y x

as

exp( )( | )

exp( )

yi

yy

xp y x

x

linear model (iii)

Documents

bettera simpler model

simpler model case

small number of weights

interesting document

logistic regression

coincidencea complicated

positive document

document d