linear model (iii)
DESCRIPTION
Linear Model (III). Rong Jin. Announcement. Homework 2 is out and is due 02/05/2004 (next Tuesday) Homework 1 is handed out. Recap: Logistic Regression Model. Assume the inputs and outputs are related in the log linear function Estimate weights: MLE approach. Example: Text Classification. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Linear Model (III)](https://reader035.vdocument.in/reader035/viewer/2022062321/56813d19550346895da6dab4/html5/thumbnails/1.jpg)
Linear Model (III)
Rong Jin
![Page 2: Linear Model (III)](https://reader035.vdocument.in/reader035/viewer/2022062321/56813d19550346895da6dab4/html5/thumbnails/2.jpg)
Announcement Homework 2 is out and is due 02/05/2004
(next Tuesday) Homework 1 is handed out
![Page 3: Linear Model (III)](https://reader035.vdocument.in/reader035/viewer/2022062321/56813d19550346895da6dab4/html5/thumbnails/3.jpg)
Recap: Logistic Regression Model Assume the inputs and outputs are related in the log
linear function
Estimate weights: MLE approach
1 2
1( | ; )
1 exp ( )
{ , ,..., , }m
p y xy x w c
w w w c
*1
1
max ( ) max log ( | ; )
1max log
1 exp( )
ntrain i iiw w
n
iw
w l D p y x
y x w c
1 2{ , ,..., , }mw w w c
![Page 4: Linear Model (III)](https://reader035.vdocument.in/reader035/viewer/2022062321/56813d19550346895da6dab4/html5/thumbnails/4.jpg)
Example: Text Classification Input x: a binary vector
Each word is a different dimension xi = 0 if the ith word does not appear in the document
xi = 1 if it appears in the document Output y: interesting document or not
+1: interesting -1: uninteresting
![Page 5: Linear Model (III)](https://reader035.vdocument.in/reader035/viewer/2022062321/56813d19550346895da6dab4/html5/thumbnails/5.jpg)
Example: Text Classification
Doc 1
The purpose of the Lady Bird Johnson Wildflower Center is to educate people around the world, …
Doc 2
Rain Bird is one of the leading irrigation manufacturers in the world, providingcomplete irrigation solutions for people…
term the world people company center …
Doc 1 1 1 1 0 1 …
Doc 2 1 1 1 1 0 …
![Page 6: Linear Model (III)](https://reader035.vdocument.in/reader035/viewer/2022062321/56813d19550346895da6dab4/html5/thumbnails/6.jpg)
Example 2: Text Classification Logistic regression model
Every term ti is assigned with a weight wi
Learning parameters: MLE approach
Need numerical solutions
1 2
1( | ; )
1 exp ( )
{ , ,..., , }
i ii
m
p y dy w t c
w w w c
( ) ( )
1 1
( ) ( )
1 1
( ) log ( | ) log ( | )
1 1log log
1 exp 1 exp
N Ntrain i ii i
N N
i ii i i ii i
l D p d p d
w t c w t c
![Page 7: Linear Model (III)](https://reader035.vdocument.in/reader035/viewer/2022062321/56813d19550346895da6dab4/html5/thumbnails/7.jpg)
Example 2: Text Classification Weight wi
wi > 0: term ti is a positive evidence
wi < 0: term ti is a negative evidence
wi = 0: term ti is irrelevant to whether or not the document is intesting
The larger the | wi |, the more important ti term is determining whether the document is interesting.
Threshold c0 : more likely to be an interesting document
0 : more likely to be an uninteresting document
0 : decision boundary
i
i
i
i it d
i it d
i it d
w t c
w t c
w t c
![Page 8: Linear Model (III)](https://reader035.vdocument.in/reader035/viewer/2022062321/56813d19550346895da6dab4/html5/thumbnails/8.jpg)
Example 2: Text Classification
• Dataset: Reuter-21578
• Classification accuracy
• Naïve Bayes: 77%
• Logistic regression: 88%
![Page 9: Linear Model (III)](https://reader035.vdocument.in/reader035/viewer/2022062321/56813d19550346895da6dab4/html5/thumbnails/9.jpg)
Why Logistic Regression Works better for Text Classification? Common words
Small weights in logistic regression Large weights in naïve Bayes
Weight ~ p(w|+) – p(w|-)
Independence assumption Naive Bayes assumes that each word is generated
independently Logistic regression is able to take into account of
the correlation of words
![Page 10: Linear Model (III)](https://reader035.vdocument.in/reader035/viewer/2022062321/56813d19550346895da6dab4/html5/thumbnails/10.jpg)
Comparison
Generative Model
• Model P(x|y)• Model the input patterns
• Usually fast converge• Cheap computation• Robust to noise data
But• Usually performs worse
Discriminative Model
• Model P(y|x) directly• Model the decision boundary
• Usually good performance
But• Slow convergence• Expensive computation• Sensitive to noise data
![Page 11: Linear Model (III)](https://reader035.vdocument.in/reader035/viewer/2022062321/56813d19550346895da6dab4/html5/thumbnails/11.jpg)
Problems with Logistic Regression?
1 1 2 2
1 2
1( | ; )
1 exp ( ... )
{ , ,..., , }m m
m
p xc x w x w x w
w w w c
How about words that only appears in one class?
![Page 12: Linear Model (III)](https://reader035.vdocument.in/reader035/viewer/2022062321/56813d19550346895da6dab4/html5/thumbnails/12.jpg)
Overfitting Problem with Logistic Regression Consider word t that only appears in one document d, and d is
a positive document. Let w be its associated weight
Consider the derivative of l(Dtrain) with respect to w
w will be infinite !
( ) ( )
1 1
( )
1
( ) log ( | ) log ( | )
log ( | ) log ( | ) log ( | )
log ( | )i
N Ntrain i ii i
Ni id d i
l D p d p d
p d p d p d
p d l l
( ) log ( | ) 10 0 0
1 exptrainl D l lp d
w w w w c x w
![Page 13: Linear Model (III)](https://reader035.vdocument.in/reader035/viewer/2022062321/56813d19550346895da6dab4/html5/thumbnails/13.jpg)
Solution: Regularization Regularized log-likelihood
Large weights small weights Prevent weights from being too large
Small weights zero Sparse weights
2
( ) ( ) 21 1 1
( ) ( )
log ( | ) log ( | )
reg train train
N N mi i ii i i
l D l D s w
p d p d s w
![Page 14: Linear Model (III)](https://reader035.vdocument.in/reader035/viewer/2022062321/56813d19550346895da6dab4/html5/thumbnails/14.jpg)
Why do We Need Sparse Solution? Two types of solutions
1. Many non-zero weights but many of them are small
2. Only a small number of weights, and many of them are large
Occam’s Razor: the simpler the better A simpler model that fits data unlikely to be coincidence A complicated model that fit data might be coincidence Smaller number of non-zero weights
less amount of evidence to consider
simpler model
case 2 is preferred
![Page 15: Linear Model (III)](https://reader035.vdocument.in/reader035/viewer/2022062321/56813d19550346895da6dab4/html5/thumbnails/15.jpg)
Occam’s Razer
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 50
0.5
1
1.5
2
2.5
![Page 16: Linear Model (III)](https://reader035.vdocument.in/reader035/viewer/2022062321/56813d19550346895da6dab4/html5/thumbnails/16.jpg)
Occam’s Razer: Power = 1
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 50
0.5
1
1.5
2
2.5
1y a x
![Page 17: Linear Model (III)](https://reader035.vdocument.in/reader035/viewer/2022062321/56813d19550346895da6dab4/html5/thumbnails/17.jpg)
Occam’s Razer: Power = 3
2 31 2 3y a x a x a x
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 50
0.5
1
1.5
2
2.5
![Page 18: Linear Model (III)](https://reader035.vdocument.in/reader035/viewer/2022062321/56813d19550346895da6dab4/html5/thumbnails/18.jpg)
Occam’s Razor: Power = 10
2 101 2 10...y a x a x a x
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 50
0.5
1
1.5
2
2.5
![Page 19: Linear Model (III)](https://reader035.vdocument.in/reader035/viewer/2022062321/56813d19550346895da6dab4/html5/thumbnails/19.jpg)
Finding Optimal Solutions
Concave objective function No local maximum
Many standard optimization algorithms work
22 1 1
21 1
( ) ( ) log ( | )
1log
1 exp ( )
N mreg train train ii i
N mii i
l D l D w p y x s w
s wy c x w
![Page 20: Linear Model (III)](https://reader035.vdocument.in/reader035/viewer/2022062321/56813d19550346895da6dab4/html5/thumbnails/20.jpg)
Gradient Ascent Maximize the log-likelihood by iteratively adjusting the
parameters in small increments In each iteration, we adjust w in the direction that increases the
log-likelihood (toward the gradient)
21 1
1
21 1
1
log ( | )
(1 ( | ))
log ( | )
(1 ( | ))
where is learning rate.
N mi i ii i
Ni i i ii
N mi i ii i
Ni i ii
w w p y x s ww
w sw x y p y x
c c p y x s wc
c y p y x
Predication ErrorsPreventing weights from being too large
![Page 21: Linear Model (III)](https://reader035.vdocument.in/reader035/viewer/2022062321/56813d19550346895da6dab4/html5/thumbnails/21.jpg)
Graphical Illustration
No regularization case
![Page 22: Linear Model (III)](https://reader035.vdocument.in/reader035/viewer/2022062321/56813d19550346895da6dab4/html5/thumbnails/22.jpg)
Using regularization Without regularization
Iteration
![Page 23: Linear Model (III)](https://reader035.vdocument.in/reader035/viewer/2022062321/56813d19550346895da6dab4/html5/thumbnails/23.jpg)
When should Stop? The gradient ascent learning method converges when
there is no incentive to move the parameters in any particular direction:
In many cases, it can be very tricky Small first order derivative close to the maximum point
21 1 1
21 1 1
log ( | ) (1 ( | )) 0
log ( | ) (1 ( | )) 0
N m Ni i i i i i ii i i
N m Ni i i i i ii i i
p y x w sw x y p y xw
p y x w y p y xc
![Page 24: Linear Model (III)](https://reader035.vdocument.in/reader035/viewer/2022062321/56813d19550346895da6dab4/html5/thumbnails/24.jpg)
Extend Logistic Regression Model to Multiple Classes
y{1,2,…,C} How to extend the above definition to the
case when the number of classes is more than 2?
1 1 2 2
1 2
1( | ; )
1 exp ( ... )
{ , ,..., , }m m
m
p xc x w x w x w
w w w c
![Page 25: Linear Model (III)](https://reader035.vdocument.in/reader035/viewer/2022062321/56813d19550346895da6dab4/html5/thumbnails/25.jpg)
Conditional Exponential Model It is simple!
Ensure the sum of probability to be 1
( | ; ) exp( )y yp y x c x w
1( | ; ) exp( )
( )
( ) exp( )
y y
y yy
p y x c x wZ x
Z x c x w
![Page 26: Linear Model (III)](https://reader035.vdocument.in/reader035/viewer/2022062321/56813d19550346895da6dab4/html5/thumbnails/26.jpg)
Conditional Exponential Model Predication probability
Model parameters: For each class y, we have weights wy and threshold cy
Maximum likelihood estimation
Any problem with the above optimization problem?
exp( )( | ; )
exp( )y y
y yy
c x wp y x
c x w
1 1
exp( )( ) log ( | ) log
exp( )i iN N y i y
train i iy i yy
c x wl D p y x
c x w
![Page 27: Linear Model (III)](https://reader035.vdocument.in/reader035/viewer/2022062321/56813d19550346895da6dab4/html5/thumbnails/27.jpg)
Conditional Exponential Model Add a constant vector to every weight vector,
we have the same log-likelihood function
Usually set w1 to be a zero vector and c1 to be zero
0 0
0 0
10 0
1
,
exp( )( ) log
exp( )
exp( )log
exp( )
i i
i i
y y y y
N y i ytrain i
y i yy
N y i y
iy i yy
w w w c c c
c c x w wl D
c c x w w
c x w
c x w
![Page 28: Linear Model (III)](https://reader035.vdocument.in/reader035/viewer/2022062321/56813d19550346895da6dab4/html5/thumbnails/28.jpg)
Maximum Entropy Model: Motivation Consider a translation example English ‘in’ French {dans, en, à, au cours de, pendant} Goal: p(dans), p(en), p(à), p(au-cours-de), p(pendant) Case 1: no prior knowledge on tranlation
What is your guess of the probabilities? p(dans)=p(en)=p(à)=p(au-cours-de)=p(pendant)=1/5
Case 2: 30% time use either dans or en What is your guess of the probabilities? p(dans)=p(en)=3/20 p(à)=p(au-cours-de)=p(pendant)=7/30
Uniform distribution is favored
![Page 29: Linear Model (III)](https://reader035.vdocument.in/reader035/viewer/2022062321/56813d19550346895da6dab4/html5/thumbnails/29.jpg)
Maximum Entropy Model: Motivation Case 3: 30% use dans or en, and 50% use dans or à
What is your guess of the probabilities?
How to measure the uniformity of any distribution?
![Page 30: Linear Model (III)](https://reader035.vdocument.in/reader035/viewer/2022062321/56813d19550346895da6dab4/html5/thumbnails/30.jpg)
Maximum Entropy Principle (MaxEnt) A uniformity of distribution is measured by entropy of the
distribution
Solution: p(dans) = 0.2, p(a) = 0.3, p(en)=0.1, p(au-cours-de) = 0.2, p(pendant) = 0.2
* max ( )
where ( ) ( ) log ( ) ( ) log ( ) ( ) log ( )
( ) log ( ) ( ) log ( )
subject to
( ) ( ) 3/10
( ) ( ) 1/ 2
( ) ( ) ( ) (
PP H P
H P p dans p dans p en p en p a p a
p au course de p au course de p pendant p pendant
p dans p en
p dans p a
p dans p en p a p au cours d
) ( ) 1e p pendant
![Page 31: Linear Model (III)](https://reader035.vdocument.in/reader035/viewer/2022062321/56813d19550346895da6dab4/html5/thumbnails/31.jpg)
MaxEnt for Classification Problems
Requiring the first order moment to be consistent between the empirical data and model predication
No assumption about the parametric form for likelihood Usually assume it is Cn continuous
What is the solution for ?
1
1 1
max ( | ) max ( | ) log ( | )
subject to
( | ) ( , ), ( | )=1
Ni i i ii yp p
N Ni i ii i y
H y x p y x p y x
p y x x x y y p y x
( | )p y x
( | )p y x
![Page 32: Linear Model (III)](https://reader035.vdocument.in/reader035/viewer/2022062321/56813d19550346895da6dab4/html5/thumbnails/32.jpg)
Solution to MaxEnt Surprisingly, the solution is just conditional
exponential model without thresholds
Why?
exp( )( | ; )
exp( )y
yy
x wp y x
x w
![Page 33: Linear Model (III)](https://reader035.vdocument.in/reader035/viewer/2022062321/56813d19550346895da6dab4/html5/thumbnails/33.jpg)
Solution to MaxEntSolve the maximum entropy problem, i.e.
1 1
max ( | ) max ( | ) log ( | )
subject to ( | ) ( , ), ( | ) 1
i iyp p
N Ni i ii i y
H y x p y x p y x
p y x x x y y p y x
To maximize the entropy function under the constraints, we can introduce a set of Lagrangian multipliers into the objective function, i.e.,
1 1 1( | ) ( | ) ( , ) ( | ) 1
N N Ny i i iy i i i y
G H y x p y x x x y y p y x
Setting the first derivative of G with respect ( | )ip y x to be zero, we have equations
log ( | ) 0 ( | ) exp( )( | ) i y i i y i
i
Gp y x x p y x x
p y x
Since ( | ) 1iyp y x , we have ( | )ip y x
as
exp( )( | )
exp( )
yi
yy
xp y x
x