unconstrained optimization rong jin. logistic regression the optimization problem is to find weights...
Post on 22-Dec-2015
226 views
TRANSCRIPT
![Page 1: Unconstrained Optimization Rong Jin. Logistic Regression The optimization problem is to find weights w and b that maximizes the above log-likelihood How](https://reader035.vdocument.in/reader035/viewer/2022062320/56649d815503460f94a656a1/html5/thumbnails/1.jpg)
Unconstrained Optimization
Rong Jin
![Page 2: Unconstrained Optimization Rong Jin. Logistic Regression The optimization problem is to find weights w and b that maximizes the above log-likelihood How](https://reader035.vdocument.in/reader035/viewer/2022062320/56649d815503460f94a656a1/html5/thumbnails/2.jpg)
Logistic Regression2
2 1 1
1( ) ( ) log
1 exp ( )N m
reg train train ii il D l D w s w
y b x w
The optimization problem is to find weights w and b that maximizes the above log-likelihood
How to do it efficiently ?
![Page 3: Unconstrained Optimization Rong Jin. Logistic Regression The optimization problem is to find weights w and b that maximizes the above log-likelihood How](https://reader035.vdocument.in/reader035/viewer/2022062320/56649d815503460f94a656a1/html5/thumbnails/3.jpg)
Gradient Ascent Compute the gradient
Increase weights w and threshold b in the gradient direction
21 1
21 1
log ( | )
log ( | )
where is learning rate.
N mi i ii i
N mi i ii i
w w p y x s ww
c c p y x s wc
21 1 1
21 1 1
log ( | ) (1 ( | ))
log ( | ) (1 ( | ))
N m Ni i i i i i ii i i
N m Ni i i i i ii i i
p y x s w sw x y p y xw
p y x s w y p y xb
![Page 4: Unconstrained Optimization Rong Jin. Logistic Regression The optimization problem is to find weights w and b that maximizes the above log-likelihood How](https://reader035.vdocument.in/reader035/viewer/2022062320/56649d815503460f94a656a1/html5/thumbnails/4.jpg)
Problem with Gradient Ascent Difficult to find the appropriate step
size Small slow convergence Large oscillation or “bubbling”
Convergence conditions Robbins-Monroe conditions
Along with “regular” objective function will ensure convergence
20 0
, t tt t
![Page 5: Unconstrained Optimization Rong Jin. Logistic Regression The optimization problem is to find weights w and b that maximizes the above log-likelihood How](https://reader035.vdocument.in/reader035/viewer/2022062320/56649d815503460f94a656a1/html5/thumbnails/5.jpg)
Newton Method Utilizing the second order derivative Expand the objective function to the second order around x0
The minimum point is Newton method for optimization
Guarantee to converge when the objective function is convex
0 0
20 0 0( ) ( ) ( ) ( )
2'( ) , ''( )
x x x x
bf x f x a x x x x
a f x b f x
0 /x x a b
'( )
''( )
old
old
new old x x
x x
f xx x
f x
![Page 6: Unconstrained Optimization Rong Jin. Logistic Regression The optimization problem is to find weights w and b that maximizes the above log-likelihood How](https://reader035.vdocument.in/reader035/viewer/2022062320/56649d815503460f94a656a1/html5/thumbnails/6.jpg)
Multivariate Newton Method Object function comprises of multiple variables
Example: logistic regression model
Text categorization: thousands of words thousands of variables Multivariate Newton Method
Multivariate function:
First order derivative a vector
Second order derivative Hessian matrix Hessian matrix is mxm matrix Each element in Hessian matrix is defined as:
21 1
1( ) log
1 exp ( )N m
reg train ii il D s w
y b x w
21 2
,( , ,..., )m
i ji j
f x x x
x x
H
1 2( ) ( , ,..., )mf x f x x x
1 1, ,...,
m
f f f f
x x x x
![Page 7: Unconstrained Optimization Rong Jin. Logistic Regression The optimization problem is to find weights w and b that maximizes the above log-likelihood How](https://reader035.vdocument.in/reader035/viewer/2022062320/56649d815503460f94a656a1/html5/thumbnails/7.jpg)
Multivariate Newton Method Updating equation:
Hessian matrix for logistic regression model
Can be expensive to compute Example: text categorization with 10,000 words Hessian matrix is of size 10,000 x 10,000 100 million entries Even worse, we have compute the inverse of Hessian matrix H-1
1 ( )new old f xx x
x
H
1( | )(1 ( | ))
n Ti i i i m mi
p y x p y x x x s H I
![Page 8: Unconstrained Optimization Rong Jin. Logistic Regression The optimization problem is to find weights w and b that maximizes the above log-likelihood How](https://reader035.vdocument.in/reader035/viewer/2022062320/56649d815503460f94a656a1/html5/thumbnails/8.jpg)
Quasi-Newton Method Approximate the Hessian matrix H-1 with another B matrix:
B is update iteratively (BFGS):
Utilizing derivatives of previous iterations
( )new old f xx x
x
B
1
1 1,
T Tk k k k k k
k k T Tk k k k k
k k k k k k
p p y y
p p y p
p x x y g g
B BB B
B
![Page 9: Unconstrained Optimization Rong Jin. Logistic Regression The optimization problem is to find weights w and b that maximizes the above log-likelihood How](https://reader035.vdocument.in/reader035/viewer/2022062320/56649d815503460f94a656a1/html5/thumbnails/9.jpg)
Limited-Memory Quasi-Newton Quasi-Newton
Avoid computing the inverse of Hessian matrix But, it still requires computing the B matrix large storage
Limited-Memory Quasi-Newton (L-BFGS) Even avoid explicitly computing B matrix
B can be expressed as a product of vectors Only keep the most recently vectors of (3~20)
1
1 1,
T Tk k k k k k
k k T Tk k k k k
k k k k k k
p p y y
p p y p
p x x y g g
B BB B
B
{ , }k kp y
{ , }k kp y
![Page 10: Unconstrained Optimization Rong Jin. Logistic Regression The optimization problem is to find weights w and b that maximizes the above log-likelihood How](https://reader035.vdocument.in/reader035/viewer/2022062320/56649d815503460f94a656a1/html5/thumbnails/10.jpg)
Efficiency
Num
ber of Variable
Standard Newton method: O(n3)
Small
Medium Quasi Newton method (BFGS): O(n2)
Limited-memory Quasi Newton method (L-BFGS): O(n)
Large
Con
verg
ence
Rat
e
V-Fast
Fast
R-Fast
![Page 11: Unconstrained Optimization Rong Jin. Logistic Regression The optimization problem is to find weights w and b that maximizes the above log-likelihood How](https://reader035.vdocument.in/reader035/viewer/2022062320/56649d815503460f94a656a1/html5/thumbnails/11.jpg)
Empirical Study: Learning Conditional Exponential Model
Dataset Instances Features
Rule 29,602 246
Lex 42,509 135,182
Summary 24,044 198,467
Shallow 8,625,782 264,142
Dataset Iterations Time (s)
Rule 350 4.8
81 1.13
Lex 1545 114.21
176 20.02
Summary 3321 190.22
69 8.52
Shallow 14527 85962.53
421 2420.30
Limited-memory Quasi-Newton method
Gradient ascent
![Page 12: Unconstrained Optimization Rong Jin. Logistic Regression The optimization problem is to find weights w and b that maximizes the above log-likelihood How](https://reader035.vdocument.in/reader035/viewer/2022062320/56649d815503460f94a656a1/html5/thumbnails/12.jpg)
Free Software http://www.ece.northwestern.edu/~nocedal/so
ftware.html L-BFGS L-BFGSB
![Page 13: Unconstrained Optimization Rong Jin. Logistic Regression The optimization problem is to find weights w and b that maximizes the above log-likelihood How](https://reader035.vdocument.in/reader035/viewer/2022062320/56649d815503460f94a656a1/html5/thumbnails/13.jpg)
Linear Conjugate Gradient Method Consider optimizing the quadratic function
Conjugate vectors The set of vector {p1, p2, …, pl} is said to be conjugate with respect to
a matrix A if
Important property The quadratic function can be optimized by simply optimizing the
function along individual direction in the conjugate set. Optimal solution:
k is the minimizer along the kth conjugate direction
* arg min2
TT
x
x xx b x
A
0, for any Ti jp p i j A
1 1 2 2 ... l lx p p p
![Page 14: Unconstrained Optimization Rong Jin. Logistic Regression The optimization problem is to find weights w and b that maximizes the above log-likelihood How](https://reader035.vdocument.in/reader035/viewer/2022062320/56649d815503460f94a656a1/html5/thumbnails/14.jpg)
Example Minimize the following function
Matrix A
Conjugate direction
Optimization First direction, x1 = x2=x:
Second direction, x1 =- x2=x:
Solution: x1 = x2=1
-4
-2
0
2
4
-3
-2
-1
0
1
2
3-10
0
10
20
302 21 2 1 2 1 2 1 2( , )f x x x x x x x x
1 0.5
0.5 1A
1 21 1
,1 1
p p
21 2 1( , ) 2 Minimizer 1f x x x x
21 2 2( , ) 2 Minimizer 0f x x x
![Page 15: Unconstrained Optimization Rong Jin. Logistic Regression The optimization problem is to find weights w and b that maximizes the above log-likelihood How](https://reader035.vdocument.in/reader035/viewer/2022062320/56649d815503460f94a656a1/html5/thumbnails/15.jpg)
How to Efficiently Find a Set of Conjugate Directions Iterative procedure
Given conjugate directions {p1,p2,…, pk-1}
Set pk as follows:
Theorem: The direction generated in the above step is conjugate to all previous directions {p1,p2,…, pk-1}, i.e.,
Note: compute the k direction pk only requires the previous direction pk-1
11 1
1 1
( ), , where ,
k
Tk k
k k k k k k k k k kTx xk k
r p f xp r p r x x p
xp p
A
A
, for any [1, 2,..., 1]Tk ip p i k A
![Page 16: Unconstrained Optimization Rong Jin. Logistic Regression The optimization problem is to find weights w and b that maximizes the above log-likelihood How](https://reader035.vdocument.in/reader035/viewer/2022062320/56649d815503460f94a656a1/html5/thumbnails/16.jpg)
Nonlinear Conjugate Gradient Even though conjugate gradient is derived for a
quadratic objective function, it can be applied directly to other nonlinear functions
Several variants: Fletcher-Reeves conjugate gradient (FR-CG) Polak-Ribiere conjugate gradient (PR-CG)
More robust than FR-CG
Compared to Newton method No need for computing the Hessian matrix No need for storing the Hessian matrix