![Page 1: Using Gradient Descent for Optimization and Learning](https://reader031.vdocument.in/reader031/viewer/2022021922/587362ad1a28abe7648b5fa9/html5/thumbnails/1.jpg)
Gradient Descent
Nicolas Le Roux
Optimization
Basics
Approximations to Newton
method
Stochastic Optimization
Learning (Bottou)
TONGA
Natural Gradient
Online Natural Gradient
Results
Using Gradient Descent for
Optimization and Learning
Nicolas Le Roux
15 May 2009
![Page 2: Using Gradient Descent for Optimization and Learning](https://reader031.vdocument.in/reader031/viewer/2022021922/587362ad1a28abe7648b5fa9/html5/thumbnails/2.jpg)
Gradient Descent
Nicolas Le Roux
Optimization
Basics
Approximations to Newton
method
Stochastic Optimization
Learning (Bottou)
TONGA
Natural Gradient
Online Natural Gradient
Results
Why having a good
optimizer?
• plenty of data available everywhere
• extract information efficiently
![Page 3: Using Gradient Descent for Optimization and Learning](https://reader031.vdocument.in/reader031/viewer/2022021922/587362ad1a28abe7648b5fa9/html5/thumbnails/3.jpg)
Gradient Descent
Nicolas Le Roux
Optimization
Basics
Approximations to Newton
method
Stochastic Optimization
Learning (Bottou)
TONGA
Natural Gradient
Online Natural Gradient
Results
Why gradient descent?
• cheap
• suitable for large models
• it works
![Page 4: Using Gradient Descent for Optimization and Learning](https://reader031.vdocument.in/reader031/viewer/2022021922/587362ad1a28abe7648b5fa9/html5/thumbnails/4.jpg)
Gradient Descent
Nicolas Le Roux
Optimization
Basics
Approximations to Newton
method
Stochastic Optimization
Learning (Bottou)
TONGA
Natural Gradient
Online Natural Gradient
Results
Optimization vs Learning
Optimization
• function f to minimize
• time T (ρ) to reach error level ρ
![Page 5: Using Gradient Descent for Optimization and Learning](https://reader031.vdocument.in/reader031/viewer/2022021922/587362ad1a28abe7648b5fa9/html5/thumbnails/5.jpg)
Gradient Descent
Nicolas Le Roux
Optimization
Basics
Approximations to Newton
method
Stochastic Optimization
Learning (Bottou)
TONGA
Natural Gradient
Online Natural Gradient
Results
Optimization vs Learning
Optimization
• function f to minimize
• time T (ρ) to reach error level ρ
Learning
• measure of quality f (cost function)
• get training samples x1, . . . , xn from p
• choose a model F with parameters θ
• minimize Ep[f (θ, x)]
![Page 6: Using Gradient Descent for Optimization and Learning](https://reader031.vdocument.in/reader031/viewer/2022021922/587362ad1a28abe7648b5fa9/html5/thumbnails/6.jpg)
Gradient Descent
Nicolas Le Roux
Optimization
Basics
Approximations to Newton
method
Stochastic Optimization
Learning (Bottou)
TONGA
Natural Gradient
Online Natural Gradient
Results
Outline
1 Optimization
Basics
Approximations to Newton method
Stochastic Optimization
2 Learning (Bottou)
3 TONGA
Natural Gradient
Online Natural Gradient
Results
![Page 7: Using Gradient Descent for Optimization and Learning](https://reader031.vdocument.in/reader031/viewer/2022021922/587362ad1a28abe7648b5fa9/html5/thumbnails/7.jpg)
Gradient Descent
Nicolas Le Roux
Optimization
Basics
Approximations to Newton
method
Stochastic Optimization
Learning (Bottou)
TONGA
Natural Gradient
Online Natural Gradient
Results
The basics
Taylor expansion of f to the first order:
f (θ + ε) = f (θ) + εT∇θf + o(‖ε‖)
Best improvement obtained with
ε = −η∇θf η > 0
![Page 8: Using Gradient Descent for Optimization and Learning](https://reader031.vdocument.in/reader031/viewer/2022021922/587362ad1a28abe7648b5fa9/html5/thumbnails/8.jpg)
Gradient Descent
Nicolas Le Roux
Optimization
Basics
Approximations to Newton
method
Stochastic Optimization
Learning (Bottou)
TONGA
Natural Gradient
Online Natural Gradient
Results
Quadratic bowl
η = .1 η = .3
![Page 9: Using Gradient Descent for Optimization and Learning](https://reader031.vdocument.in/reader031/viewer/2022021922/587362ad1a28abe7648b5fa9/html5/thumbnails/9.jpg)
Gradient Descent
Nicolas Le Roux
Optimization
Basics
Approximations to Newton
method
Stochastic Optimization
Learning (Bottou)
TONGA
Natural Gradient
Online Natural Gradient
Results
Optimal learning rate
λmin
ηmin,opt = 1λmin
ηmin,div = 2λmin
λmax
ηmax,opt = 1λmax
ηmax,div = 2λmax
η = ηmax,opt
κ =η
ηmin,opt=
λmax
λmin
T (ρ) = O(
dκ log1ρ
)
![Page 10: Using Gradient Descent for Optimization and Learning](https://reader031.vdocument.in/reader031/viewer/2022021922/587362ad1a28abe7648b5fa9/html5/thumbnails/10.jpg)
Gradient Descent
Nicolas Le Roux
Optimization
Basics
Approximations to Newton
method
Stochastic Optimization
Learning (Bottou)
TONGA
Natural Gradient
Online Natural Gradient
Results
Limitations of gradient
descent
• Speed of convergence highly dependent on κ
• Not invariant to linear transformations
θ′ = kθ
f (θ′ + ε) = f (θ′) + εT∇θ′ f + o(‖ε‖)
ε = −η∇θ′f
But ∇θ′ f = ∇θ fk !
![Page 11: Using Gradient Descent for Optimization and Learning](https://reader031.vdocument.in/reader031/viewer/2022021922/587362ad1a28abe7648b5fa9/html5/thumbnails/11.jpg)
Gradient Descent
Nicolas Le Roux
Optimization
Basics
Approximations to Newton
method
Stochastic Optimization
Learning (Bottou)
TONGA
Natural Gradient
Online Natural Gradient
Results
Newton method
Second-order Taylor expansion of f around θ:
f (θ + ε) = f (θ) + εT∇θf +εT Hε
2+ o(‖ε‖2)
Best improvement obtained with
ε = −ηH−1∇θf
![Page 12: Using Gradient Descent for Optimization and Learning](https://reader031.vdocument.in/reader031/viewer/2022021922/587362ad1a28abe7648b5fa9/html5/thumbnails/12.jpg)
Gradient Descent
Nicolas Le Roux
Optimization
Basics
Approximations to Newton
method
Stochastic Optimization
Learning (Bottou)
TONGA
Natural Gradient
Online Natural Gradient
Results
Quadratic bowl
![Page 13: Using Gradient Descent for Optimization and Learning](https://reader031.vdocument.in/reader031/viewer/2022021922/587362ad1a28abe7648b5fa9/html5/thumbnails/13.jpg)
Gradient Descent
Nicolas Le Roux
Optimization
Basics
Approximations to Newton
method
Stochastic Optimization
Learning (Bottou)
TONGA
Natural Gradient
Online Natural Gradient
Results
Rosenbrock function
![Page 14: Using Gradient Descent for Optimization and Learning](https://reader031.vdocument.in/reader031/viewer/2022021922/587362ad1a28abe7648b5fa9/html5/thumbnails/14.jpg)
Gradient Descent
Nicolas Le Roux
Optimization
Basics
Approximations to Newton
method
Stochastic Optimization
Learning (Bottou)
TONGA
Natural Gradient
Online Natural Gradient
Results
Properties of Newton
method
• Newton method assumes the function is locally
quadratic (Beyond Newton method, Minka)
• H must be positive definite
• storing H and finding ε are in d2
T (ρ) = O(
d2 log log1ρ
)
![Page 15: Using Gradient Descent for Optimization and Learning](https://reader031.vdocument.in/reader031/viewer/2022021922/587362ad1a28abe7648b5fa9/html5/thumbnails/15.jpg)
Gradient Descent
Nicolas Le Roux
Optimization
Basics
Approximations to Newton
method
Stochastic Optimization
Learning (Bottou)
TONGA
Natural Gradient
Online Natural Gradient
Results
Limitations of Newton
method
• Newton method looks powerful
but...
![Page 16: Using Gradient Descent for Optimization and Learning](https://reader031.vdocument.in/reader031/viewer/2022021922/587362ad1a28abe7648b5fa9/html5/thumbnails/16.jpg)
Gradient Descent
Nicolas Le Roux
Optimization
Basics
Approximations to Newton
method
Stochastic Optimization
Learning (Bottou)
TONGA
Natural Gradient
Online Natural Gradient
Results
Limitations of Newton
method
• Newton method looks powerful
but...
• H may be hard to compute
• it is expensive
![Page 17: Using Gradient Descent for Optimization and Learning](https://reader031.vdocument.in/reader031/viewer/2022021922/587362ad1a28abe7648b5fa9/html5/thumbnails/17.jpg)
Gradient Descent
Nicolas Le Roux
Optimization
Basics
Approximations to Newton
method
Stochastic Optimization
Learning (Bottou)
TONGA
Natural Gradient
Online Natural Gradient
Results
Gauss-Newton
Only works when f is a sum of squared residuals!
f (θ) =12
∑
i
r 2i (θ)
∂f∂θ
=∑
i
ri∂ri
∂θ
∂2f∂θ2
=∑
i
[
ri∂2ri
∂θ2+
(
∂ri
∂θ
)(
∂ri
∂θ
)T]
θ = θ − η
(
∑
i
(
∂ri
∂θ
)(
∂ri
∂θ
)T)−1
∂f∂θ
![Page 18: Using Gradient Descent for Optimization and Learning](https://reader031.vdocument.in/reader031/viewer/2022021922/587362ad1a28abe7648b5fa9/html5/thumbnails/18.jpg)
Gradient Descent
Nicolas Le Roux
Optimization
Basics
Approximations to Newton
method
Stochastic Optimization
Learning (Bottou)
TONGA
Natural Gradient
Online Natural Gradient
Results
Properties of
Gauss-Newton
θ = θ − η
(
∑
i
(
∂ri
∂θ
)(
∂ri
∂θ
)T)−1
∂f∂θ
Discarded term: ri∂2ri
∂θ2
• Does not require the computation of H
• Only valid close to the optimum where ri = 0
• Computation cost in O(d2)
![Page 19: Using Gradient Descent for Optimization and Learning](https://reader031.vdocument.in/reader031/viewer/2022021922/587362ad1a28abe7648b5fa9/html5/thumbnails/19.jpg)
Gradient Descent
Nicolas Le Roux
Optimization
Basics
Approximations to Newton
method
Stochastic Optimization
Learning (Bottou)
TONGA
Natural Gradient
Online Natural Gradient
Results
Levenberg-Marquardt
θ = θ − η
(
∑
i
(
∂ri
∂θ
)(
∂ri
∂θ
)T
+ λI
)−1∂f∂θ
• “Damped” Gauss-Newton
• Intermediate between Gauss-Newton and
Steepest Descent
• Slower optimization but more robust
• Cost in O(d2)
![Page 20: Using Gradient Descent for Optimization and Learning](https://reader031.vdocument.in/reader031/viewer/2022021922/587362ad1a28abe7648b5fa9/html5/thumbnails/20.jpg)
Gradient Descent
Nicolas Le Roux
Optimization
Basics
Approximations to Newton
method
Stochastic Optimization
Learning (Bottou)
TONGA
Natural Gradient
Online Natural Gradient
Results
Quasi-Newton methods
• Gauss-Newton and Levenberg-Marquardt can
only be used in special cases
• What about the general case?
![Page 21: Using Gradient Descent for Optimization and Learning](https://reader031.vdocument.in/reader031/viewer/2022021922/587362ad1a28abe7648b5fa9/html5/thumbnails/21.jpg)
Gradient Descent
Nicolas Le Roux
Optimization
Basics
Approximations to Newton
method
Stochastic Optimization
Learning (Bottou)
TONGA
Natural Gradient
Online Natural Gradient
Results
Quasi-Newton methods
• Gauss-Newton and Levenberg-Marquardt can
only be used in special cases
• What about the general case?
• H characterizes the change in gradient when
moving in parameter space
• Let’s find a matrix which does the same!
![Page 22: Using Gradient Descent for Optimization and Learning](https://reader031.vdocument.in/reader031/viewer/2022021922/587362ad1a28abe7648b5fa9/html5/thumbnails/22.jpg)
Gradient Descent
Nicolas Le Roux
Optimization
Basics
Approximations to Newton
method
Stochastic Optimization
Learning (Bottou)
TONGA
Natural Gradient
Online Natural Gradient
Results
BFGS
We look for a matrix B such that
∇θf (θ + ε) −∇θf (θ) = B−1t ε : Secant equation
![Page 23: Using Gradient Descent for Optimization and Learning](https://reader031.vdocument.in/reader031/viewer/2022021922/587362ad1a28abe7648b5fa9/html5/thumbnails/23.jpg)
Gradient Descent
Nicolas Le Roux
Optimization
Basics
Approximations to Newton
method
Stochastic Optimization
Learning (Bottou)
TONGA
Natural Gradient
Online Natural Gradient
Results
BFGS
We look for a matrix B such that
∇θf (θ + ε) −∇θf (θ) = B−1t ε : Secant equation
• Problem: this is underconstrained
• Solution: set additional constraints: small
‖Bt+1 − Bt‖W
![Page 24: Using Gradient Descent for Optimization and Learning](https://reader031.vdocument.in/reader031/viewer/2022021922/587362ad1a28abe7648b5fa9/html5/thumbnails/24.jpg)
Gradient Descent
Nicolas Le Roux
Optimization
Basics
Approximations to Newton
method
Stochastic Optimization
Learning (Bottou)
TONGA
Natural Gradient
Online Natural Gradient
Results
BFGS - (2)
• B0 = I
• while not converged:
• pt = −Bt∇f (θt)
• ηt = linemin(f , θt , pt)
• st = ηtpt (change in parameter space)• θt+1 = θt + st
• yt = ∇f (θt+1) −∇f (θt) (change in gradient
space)
• ρt = (sTt yt)
−1
• Bt+1 = (I − ρtstyTt )Bt(I − ρtytsT
t ) + ρtstsTt
(stems from Sherman-Morrisson formula)
![Page 25: Using Gradient Descent for Optimization and Learning](https://reader031.vdocument.in/reader031/viewer/2022021922/587362ad1a28abe7648b5fa9/html5/thumbnails/25.jpg)
Gradient Descent
Nicolas Le Roux
Optimization
Basics
Approximations to Newton
method
Stochastic Optimization
Learning (Bottou)
TONGA
Natural Gradient
Online Natural Gradient
Results
BFGS - (3)
• Requires a line search
• No matrix inversion required
• Update in O(d2)
• Can we do better than O(d2)?
![Page 26: Using Gradient Descent for Optimization and Learning](https://reader031.vdocument.in/reader031/viewer/2022021922/587362ad1a28abe7648b5fa9/html5/thumbnails/26.jpg)
Gradient Descent
Nicolas Le Roux
Optimization
Basics
Approximations to Newton
method
Stochastic Optimization
Learning (Bottou)
TONGA
Natural Gradient
Online Natural Gradient
Results
L-BFGS
• Low-rank estimate of B
• Based on the last m moves in parameters and
gradient spaces
• Cost O(md) per update
• Same ballpark as steepest descent!
![Page 27: Using Gradient Descent for Optimization and Learning](https://reader031.vdocument.in/reader031/viewer/2022021922/587362ad1a28abe7648b5fa9/html5/thumbnails/27.jpg)
Gradient Descent
Nicolas Le Roux
Optimization
Basics
Approximations to Newton
method
Stochastic Optimization
Learning (Bottou)
TONGA
Natural Gradient
Online Natural Gradient
Results
Conjugate Gradient
We want to solveAx = b
• Relies on conjugate directions
• u and v are conjugate if uT Av = 0.
• d mutually conjugate directions form a base of
Rd
• Goal: to move along conjugate directions close
to steepest descent directions
![Page 28: Using Gradient Descent for Optimization and Learning](https://reader031.vdocument.in/reader031/viewer/2022021922/587362ad1a28abe7648b5fa9/html5/thumbnails/28.jpg)
Gradient Descent
Nicolas Le Roux
Optimization
Basics
Approximations to Newton
method
Stochastic Optimization
Learning (Bottou)
TONGA
Natural Gradient
Online Natural Gradient
Results
Nonlinear Conjugate
Gradient
• Extension to non-quadratic functions
• Requires a line search in every direction
(Important for conjugacy!)
• Various direction updates (Polak-Ribiere)
![Page 29: Using Gradient Descent for Optimization and Learning](https://reader031.vdocument.in/reader031/viewer/2022021922/587362ad1a28abe7648b5fa9/html5/thumbnails/29.jpg)
Gradient Descent
Nicolas Le Roux
Optimization
Basics
Approximations to Newton
method
Stochastic Optimization
Learning (Bottou)
TONGA
Natural Gradient
Online Natural Gradient
Results
Going from batch to
stochastic
• dataset composed of n samples
• f (θ) = 1n
∑
i fi(θ, xi)
Do we really need to see all the examples beforemaking a parameter update?
![Page 30: Using Gradient Descent for Optimization and Learning](https://reader031.vdocument.in/reader031/viewer/2022021922/587362ad1a28abe7648b5fa9/html5/thumbnails/30.jpg)
Gradient Descent
Nicolas Le Roux
Optimization
Basics
Approximations to Newton
method
Stochastic Optimization
Learning (Bottou)
TONGA
Natural Gradient
Online Natural Gradient
Results
Stochastic optimization
• Information is redundant amongst samples
• We can afford more frequent, noisier updates
• But problems arise...
![Page 31: Using Gradient Descent for Optimization and Learning](https://reader031.vdocument.in/reader031/viewer/2022021922/587362ad1a28abe7648b5fa9/html5/thumbnails/31.jpg)
Gradient Descent
Nicolas Le Roux
Optimization
Basics
Approximations to Newton
method
Stochastic Optimization
Learning (Bottou)
TONGA
Natural Gradient
Online Natural Gradient
Results
Stochastic (Bottou)
Advantage
• much faster convergence on large redundant
datasets
Disadvantages
• Keeps bouncing around unless η is reduced
• Extremely hard to reach high accuracy
• Theoretical definitions for convergence not as
well defined
• Most second-orders methods will not work
![Page 32: Using Gradient Descent for Optimization and Learning](https://reader031.vdocument.in/reader031/viewer/2022021922/587362ad1a28abe7648b5fa9/html5/thumbnails/32.jpg)
Gradient Descent
Nicolas Le Roux
Optimization
Basics
Approximations to Newton
method
Stochastic Optimization
Learning (Bottou)
TONGA
Natural Gradient
Online Natural Gradient
Results
Batch (Bottou)
Advantage
• Guaranteed convergence to a local minimum
under simple conditions
• Lots of tricks to speed up the learning
Disadvantage
• Painfully slow on large problems
![Page 33: Using Gradient Descent for Optimization and Learning](https://reader031.vdocument.in/reader031/viewer/2022021922/587362ad1a28abe7648b5fa9/html5/thumbnails/33.jpg)
Gradient Descent
Nicolas Le Roux
Optimization
Basics
Approximations to Newton
method
Stochastic Optimization
Learning (Bottou)
TONGA
Natural Gradient
Online Natural Gradient
Results
Problems arising in
stochastic setting
• First order descent: O(dn) =⇒ O(dn)
• Second order methods: O(d2 + dn) =⇒ O(d2n)
• Special cases: algorithms requiring line search
• BFGS: not critical, may be replaced by a
one-step update• Conjugate Gradient: critical, no stochastic
version
![Page 34: Using Gradient Descent for Optimization and Learning](https://reader031.vdocument.in/reader031/viewer/2022021922/587362ad1a28abe7648b5fa9/html5/thumbnails/34.jpg)
Gradient Descent
Nicolas Le Roux
Optimization
Basics
Approximations to Newton
method
Stochastic Optimization
Learning (Bottou)
TONGA
Natural Gradient
Online Natural Gradient
Results
Successful stochastic
methods
• Stochastic gradient descent
• Online BFGS (Schraudolph, 2007)
• Online L-BFGS (Schraudolph, 2007)
![Page 35: Using Gradient Descent for Optimization and Learning](https://reader031.vdocument.in/reader031/viewer/2022021922/587362ad1a28abe7648b5fa9/html5/thumbnails/35.jpg)
Gradient Descent
Nicolas Le Roux
Optimization
Basics
Approximations to Newton
method
Stochastic Optimization
Learning (Bottou)
TONGA
Natural Gradient
Online Natural Gradient
Results
Conclusions of the
tutorial
Batch methods
• Second-order methods have much faster
convergence
• They are too expensive when d is large
• Except for L-BFGS and Nonlinear Conjugate
Gradient
![Page 36: Using Gradient Descent for Optimization and Learning](https://reader031.vdocument.in/reader031/viewer/2022021922/587362ad1a28abe7648b5fa9/html5/thumbnails/36.jpg)
Gradient Descent
Nicolas Le Roux
Optimization
Basics
Approximations to Newton
method
Stochastic Optimization
Learning (Bottou)
TONGA
Natural Gradient
Online Natural Gradient
Results
Conclusions of the
tutorial
Stochastic methods
• Much faster updates
• Terrible convergence rates
• Stochastic Gradient Descent: T (ρ) = O(
dρ
)
• Second-order Stochastic Descent:
T (ρ) = O(
d2
ρ
)
![Page 37: Using Gradient Descent for Optimization and Learning](https://reader031.vdocument.in/reader031/viewer/2022021922/587362ad1a28abe7648b5fa9/html5/thumbnails/37.jpg)
Gradient Descent
Nicolas Le Roux
Optimization
Basics
Approximations to Newton
method
Stochastic Optimization
Learning (Bottou)
TONGA
Natural Gradient
Online Natural Gradient
Results
Outline
1 Optimization
Basics
Approximations to Newton method
Stochastic Optimization
2 Learning (Bottou)
3 TONGA
Natural Gradient
Online Natural Gradient
Results
![Page 38: Using Gradient Descent for Optimization and Learning](https://reader031.vdocument.in/reader031/viewer/2022021922/587362ad1a28abe7648b5fa9/html5/thumbnails/38.jpg)
Gradient Descent
Nicolas Le Roux
Optimization
Basics
Approximations to Newton
method
Stochastic Optimization
Learning (Bottou)
TONGA
Natural Gradient
Online Natural Gradient
Results
Learning
• Measure of quality f (“cost function”)
• Get training samples x1, . . . , xn from p
• Choose a model F with parameters θ
• Minimize Ep[f (θ, x)]
• Time budget T
![Page 39: Using Gradient Descent for Optimization and Learning](https://reader031.vdocument.in/reader031/viewer/2022021922/587362ad1a28abe7648b5fa9/html5/thumbnails/39.jpg)
Gradient Descent
Nicolas Le Roux
Optimization
Basics
Approximations to Newton
method
Stochastic Optimization
Learning (Bottou)
TONGA
Natural Gradient
Online Natural Gradient
Results
Small-scale vs.
Large-scale Learning
• Small-scale learning problem: the active budget
constraint is the number of examples n.
• Large-scale learning problem: the active budget
constraint is the computing time T .
![Page 40: Using Gradient Descent for Optimization and Learning](https://reader031.vdocument.in/reader031/viewer/2022021922/587362ad1a28abe7648b5fa9/html5/thumbnails/40.jpg)
Gradient Descent
Nicolas Le Roux
Optimization
Basics
Approximations to Newton
method
Stochastic Optimization
Learning (Bottou)
TONGA
Natural Gradient
Online Natural Gradient
Results
Which algorithm should
we use?
T (ρ) for various algorithms:
• Gradient Descent: T (ρ) = O(
ndκ log 1ρ
)
• Second Order Gradient Descent:
T (ρ) = O(
d2 log log 1ρ
)
• Stochastic Gradient Descent: T (ρ) = O(
dρ
)
• Second-order Stochastic Descent:
T (ρ) = O(
d2
ρ
)
Second Order Gradient Descent seems a goodchoice!
![Page 41: Using Gradient Descent for Optimization and Learning](https://reader031.vdocument.in/reader031/viewer/2022021922/587362ad1a28abe7648b5fa9/html5/thumbnails/41.jpg)
Gradient Descent
Nicolas Le Roux
Optimization
Basics
Approximations to Newton
method
Stochastic Optimization
Learning (Bottou)
TONGA
Natural Gradient
Online Natural Gradient
Results
Large-scale learning
• We are limited by the time T
• We can choose between ρ and n
• Better optimization means fewer examples
![Page 42: Using Gradient Descent for Optimization and Learning](https://reader031.vdocument.in/reader031/viewer/2022021922/587362ad1a28abe7648b5fa9/html5/thumbnails/42.jpg)
Gradient Descent
Nicolas Le Roux
Optimization
Basics
Approximations to Newton
method
Stochastic Optimization
Learning (Bottou)
TONGA
Natural Gradient
Online Natural Gradient
Results
Generalization error
E(f̃n) − E(f ∗) = E(f ∗F ) − E(f ∗) Approximation error
+ E(fn) − E(f ∗F ) Estimation error
+ E(f̃n) − E(fn) Optimization error
There is no need to optimize thoroughly if we cannotprocess enough data points!
![Page 43: Using Gradient Descent for Optimization and Learning](https://reader031.vdocument.in/reader031/viewer/2022021922/587362ad1a28abe7648b5fa9/html5/thumbnails/43.jpg)
Gradient Descent
Nicolas Le Roux
Optimization
Basics
Approximations to Newton
method
Stochastic Optimization
Learning (Bottou)
TONGA
Natural Gradient
Online Natural Gradient
Results
Which algorithm should
we use?
Time to reach E(f̃n) − E(f ∗) < ǫ:
• Gradient Descent: O(
d2κ
ǫ1/α log2 1ǫ
)
• Second Order Gradient Descent:
O(
d2κ
ǫ1/α log 1ǫ
log log 1ǫ
)
• Stochastic Gradient Descent: O(
dκ2
ǫ
)
• Second-order Stochastic Descent: O(
d2
ǫ
)
with 12 ≤ α ≤ 1 (statistical estimation rate).
![Page 44: Using Gradient Descent for Optimization and Learning](https://reader031.vdocument.in/reader031/viewer/2022021922/587362ad1a28abe7648b5fa9/html5/thumbnails/44.jpg)
Gradient Descent
Nicolas Le Roux
Optimization
Basics
Approximations to Newton
method
Stochastic Optimization
Learning (Bottou)
TONGA
Natural Gradient
Online Natural Gradient
Results
In a nutshell
• Simple stochastic gradient descent is extremely
efficient
• Fast second-order stochastic gradient descent
can win us a constant factor
• Are there other possible factors of improvement?
![Page 45: Using Gradient Descent for Optimization and Learning](https://reader031.vdocument.in/reader031/viewer/2022021922/587362ad1a28abe7648b5fa9/html5/thumbnails/45.jpg)
Gradient Descent
Nicolas Le Roux
Optimization
Basics
Approximations to Newton
method
Stochastic Optimization
Learning (Bottou)
TONGA
Natural Gradient
Online Natural Gradient
Results
What we did not talk
about (yet)
• we have access to x1, . . . , xn ∼ p
• we wish to minimize Ex∼p[f (θ, x)]
• can we use the uncertainty in the dataset to get
information about p?
![Page 46: Using Gradient Descent for Optimization and Learning](https://reader031.vdocument.in/reader031/viewer/2022021922/587362ad1a28abe7648b5fa9/html5/thumbnails/46.jpg)
Gradient Descent
Nicolas Le Roux
Optimization
Basics
Approximations to Newton
method
Stochastic Optimization
Learning (Bottou)
TONGA
Natural Gradient
Online Natural Gradient
Results
Outline
1 Optimization
Basics
Approximations to Newton method
Stochastic Optimization
2 Learning (Bottou)
3 TONGA
Natural Gradient
Online Natural Gradient
Results
![Page 47: Using Gradient Descent for Optimization and Learning](https://reader031.vdocument.in/reader031/viewer/2022021922/587362ad1a28abe7648b5fa9/html5/thumbnails/47.jpg)
Gradient Descent
Nicolas Le Roux
Optimization
Basics
Approximations to Newton
method
Stochastic Optimization
Learning (Bottou)
TONGA
Natural Gradient
Online Natural Gradient
Results
Cost functions
Empirical cost function
f (θ) =1n
∑
i
f (θ, xi)
True cost function
f ∗(θ) = Ex∼p[f (θ, x)]
![Page 48: Using Gradient Descent for Optimization and Learning](https://reader031.vdocument.in/reader031/viewer/2022021922/587362ad1a28abe7648b5fa9/html5/thumbnails/48.jpg)
Gradient Descent
Nicolas Le Roux
Optimization
Basics
Approximations to Newton
method
Stochastic Optimization
Learning (Bottou)
TONGA
Natural Gradient
Online Natural Gradient
Results
Gradients
Empirical gradient
g =1n
∑
i
∇θf (θ, xi)
True gradient
g∗ = Ex∼p[∇θf (θ, x)]
Central-limit theorem:
g|g∗ ∼ N
(
g∗,Cn
)
![Page 49: Using Gradient Descent for Optimization and Learning](https://reader031.vdocument.in/reader031/viewer/2022021922/587362ad1a28abe7648b5fa9/html5/thumbnails/49.jpg)
Gradient Descent
Nicolas Le Roux
Optimization
Basics
Approximations to Newton
method
Stochastic Optimization
Learning (Bottou)
TONGA
Natural Gradient
Online Natural Gradient
Results
Posterior over g∗
Usingg∗ ∼ N (0, σ2I)
we have
g∗|g ∼ N
(
(
I +C
nσ2
)−1
g,
(
Iσ2
+ nC−1
)−1)
and then
εT g∗|g ∼ N
(
εT
(
I +C
nσ2
)−1
g, εT
(
Iσ2
+ nC−1
)−1
ε
)
![Page 50: Using Gradient Descent for Optimization and Learning](https://reader031.vdocument.in/reader031/viewer/2022021922/587362ad1a28abe7648b5fa9/html5/thumbnails/50.jpg)
Gradient Descent
Nicolas Le Roux
Optimization
Basics
Approximations to Newton
method
Stochastic Optimization
Learning (Bottou)
TONGA
Natural Gradient
Online Natural Gradient
Results
Aggressive strategy
εT g∗|g ∼ N
(
εT
(
I +C
nσ2
)−1
g, εT
(
Iσ2
+ nC−1
)−1
ε
)
• we want to minimize Eg∗[εT g∗]
• Solution:
ε = −η
(
I +C
nσ2
)−1
g
![Page 51: Using Gradient Descent for Optimization and Learning](https://reader031.vdocument.in/reader031/viewer/2022021922/587362ad1a28abe7648b5fa9/html5/thumbnails/51.jpg)
Gradient Descent
Nicolas Le Roux
Optimization
Basics
Approximations to Newton
method
Stochastic Optimization
Learning (Bottou)
TONGA
Natural Gradient
Online Natural Gradient
Results
Aggressive strategy
εT g∗|g ∼ N
(
εT
(
I +C
nσ2
)−1
g, εT
(
Iσ2
+ nC−1
)−1
ε
)
• we want to minimize Eg∗[εT g∗]
• Solution:
ε = −η
(
I +C
nσ2
)−1
g
This is the regularized natural gradient.
![Page 52: Using Gradient Descent for Optimization and Learning](https://reader031.vdocument.in/reader031/viewer/2022021922/587362ad1a28abe7648b5fa9/html5/thumbnails/52.jpg)
Gradient Descent
Nicolas Le Roux
Optimization
Basics
Approximations to Newton
method
Stochastic Optimization
Learning (Bottou)
TONGA
Natural Gradient
Online Natural Gradient
Results
Conservative strategy
εT g∗|g ∼ N
(
εT
(
I +Cσ2
)−1
g, εT
(
Iσ2
+ nC−1
)−1
ε
)
• we want to minimize Pr(εT g∗ > 0)
• Solution:
ε = −η
(
Cn
)−1
g
![Page 53: Using Gradient Descent for Optimization and Learning](https://reader031.vdocument.in/reader031/viewer/2022021922/587362ad1a28abe7648b5fa9/html5/thumbnails/53.jpg)
Gradient Descent
Nicolas Le Roux
Optimization
Basics
Approximations to Newton
method
Stochastic Optimization
Learning (Bottou)
TONGA
Natural Gradient
Online Natural Gradient
Results
Conservative strategy
εT g∗|g ∼ N
(
εT
(
I +Cσ2
)−1
g, εT
(
Iσ2
+ nC−1
)−1
ε
)
• we want to minimize Pr(εT g∗ > 0)
• Solution:
ε = −η
(
Cn
)−1
g
This is the natural gradient.
![Page 54: Using Gradient Descent for Optimization and Learning](https://reader031.vdocument.in/reader031/viewer/2022021922/587362ad1a28abe7648b5fa9/html5/thumbnails/54.jpg)
Gradient Descent
Nicolas Le Roux
Optimization
Basics
Approximations to Newton
method
Stochastic Optimization
Learning (Bottou)
TONGA
Natural Gradient
Online Natural Gradient
Results
Online Natural Gradient
• Natural gradient has nice properties
• Its cost per iteration is in d2
• Can we make it faster?
![Page 55: Using Gradient Descent for Optimization and Learning](https://reader031.vdocument.in/reader031/viewer/2022021922/587362ad1a28abe7648b5fa9/html5/thumbnails/55.jpg)
Gradient Descent
Nicolas Le Roux
Optimization
Basics
Approximations to Newton
method
Stochastic Optimization
Learning (Bottou)
TONGA
Natural Gradient
Online Natural Gradient
Results
Goals
ε = −η
(
I +C
nσ2
)−1
g
We must be able to update and invert C whenever anew gradient arrives.
![Page 56: Using Gradient Descent for Optimization and Learning](https://reader031.vdocument.in/reader031/viewer/2022021922/587362ad1a28abe7648b5fa9/html5/thumbnails/56.jpg)
Gradient Descent
Nicolas Le Roux
Optimization
Basics
Approximations to Newton
method
Stochastic Optimization
Learning (Bottou)
TONGA
Natural Gradient
Online Natural Gradient
Results
Plan of action
Updating (uncentered) C:
Ct ∝ γCt−1 + (1 − γ)gtgTt
![Page 57: Using Gradient Descent for Optimization and Learning](https://reader031.vdocument.in/reader031/viewer/2022021922/587362ad1a28abe7648b5fa9/html5/thumbnails/57.jpg)
Gradient Descent
Nicolas Le Roux
Optimization
Basics
Approximations to Newton
method
Stochastic Optimization
Learning (Bottou)
TONGA
Natural Gradient
Online Natural Gradient
Results
Plan of action
Updating (uncentered) C:
Ct ∝ γCt−1 + (1 − γ)gtgTt
Inverting C:
• maintain low-rank estimates of C using its
eigendecomposition.
![Page 58: Using Gradient Descent for Optimization and Learning](https://reader031.vdocument.in/reader031/viewer/2022021922/587362ad1a28abe7648b5fa9/html5/thumbnails/58.jpg)
Gradient Descent
Nicolas Le Roux
Optimization
Basics
Approximations to Newton
method
Stochastic Optimization
Learning (Bottou)
TONGA
Natural Gradient
Online Natural Gradient
Results
Eigendecomposition of C
• Computing k eigenvectors of C = GGT is
O(kd2)
• But computing k eigenvectors of GT G is O(kp2)!
• Still too expensive to compute for each new
sample (and p must not grow)
![Page 59: Using Gradient Descent for Optimization and Learning](https://reader031.vdocument.in/reader031/viewer/2022021922/587362ad1a28abe7648b5fa9/html5/thumbnails/59.jpg)
Gradient Descent
Nicolas Le Roux
Optimization
Basics
Approximations to Newton
method
Stochastic Optimization
Learning (Bottou)
TONGA
Natural Gradient
Online Natural Gradient
Results
Eigendecomposition of C
• Computing k eigenvectors of C = GGT is
O(kd2)
• But computing k eigenvectors of GT G is O(kp2)!
• Still too expensive to compute for each new
sample (and p must not grow)
• Done only every b steps
![Page 60: Using Gradient Descent for Optimization and Learning](https://reader031.vdocument.in/reader031/viewer/2022021922/587362ad1a28abe7648b5fa9/html5/thumbnails/60.jpg)
Gradient Descent
Nicolas Le Roux
Optimization
Basics
Approximations to Newton
method
Stochastic Optimization
Learning (Bottou)
TONGA
Natural Gradient
Online Natural Gradient
Results
To prove I’m not cheating
• eigendecomposition every b steps
Ct = γtC +t∑
k=1
γt−kgkgTk + λt t = 1, . . . , b
vt = C−1t gt
Xt =[
γt2 U γ
t−12 g1 . . . γ
12 gt−1 gt
]
Ct = XtX Tt + λt vt = Xtαt gt = Xtyt
αt = (X Tt Xt + λt)−1yt
vt = Xt(X Tt Xt + λt)−1yt
![Page 61: Using Gradient Descent for Optimization and Learning](https://reader031.vdocument.in/reader031/viewer/2022021922/587362ad1a28abe7648b5fa9/html5/thumbnails/61.jpg)
Gradient Descent
Nicolas Le Roux
Optimization
Basics
Approximations to Newton
method
Stochastic Optimization
Learning (Bottou)
TONGA
Natural Gradient
Online Natural Gradient
Results
Computation complexity
• d dimensions
• k eigenvectors
• b steps between two updates
• Computing the eigendecomposition (every b
steps): k(k + b)2
• Computing the natural gradient:
d(k + b) + (k + b)2
• if p ≪ (k + b), cost per example is O(d(k + b))
![Page 62: Using Gradient Descent for Optimization and Learning](https://reader031.vdocument.in/reader031/viewer/2022021922/587362ad1a28abe7648b5fa9/html5/thumbnails/62.jpg)
Gradient Descent
Nicolas Le Roux
Optimization
Basics
Approximations to Newton
method
Stochastic Optimization
Learning (Bottou)
TONGA
Natural Gradient
Online Natural Gradient
Results
What next
• Complexity in O(d(k + b))
• We need a small k
• If d > 106, how large should k be?
![Page 63: Using Gradient Descent for Optimization and Learning](https://reader031.vdocument.in/reader031/viewer/2022021922/587362ad1a28abe7648b5fa9/html5/thumbnails/63.jpg)
Gradient Descent
Nicolas Le Roux
Optimization
Basics
Approximations to Newton
method
Stochastic Optimization
Learning (Bottou)
TONGA
Natural Gradient
Online Natural Gradient
Results
Decomposing C
C is almost block-diagonal!
![Page 64: Using Gradient Descent for Optimization and Learning](https://reader031.vdocument.in/reader031/viewer/2022021922/587362ad1a28abe7648b5fa9/html5/thumbnails/64.jpg)
Gradient Descent
Nicolas Le Roux
Optimization
Basics
Approximations to Newton
method
Stochastic Optimization
Learning (Bottou)
TONGA
Natural Gradient
Online Natural Gradient
Results
Quality of approximation
200 400 600 800 1000 1200 1400 1600 1800 20000
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Number k of eigenvectors kept
Rat
io o
f the
squ
ared
Fro
beni
us n
orm
s
Full matrix approximationBlock diagonal approximation
5 10 15 20 25 30 35 400.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Number k of eigenvectors kept
Rat
io o
f the
squ
ared
Fro
beni
us n
orm
s
Full matrix approximationBlock diagonal approximation
Minimum with k = 5
![Page 65: Using Gradient Descent for Optimization and Learning](https://reader031.vdocument.in/reader031/viewer/2022021922/587362ad1a28abe7648b5fa9/html5/thumbnails/65.jpg)
Gradient Descent
Nicolas Le Roux
Optimization
Basics
Approximations to Newton
method
Stochastic Optimization
Learning (Bottou)
TONGA
Natural Gradient
Online Natural Gradient
Results
Evolution of C
![Page 66: Using Gradient Descent for Optimization and Learning](https://reader031.vdocument.in/reader031/viewer/2022021922/587362ad1a28abe7648b5fa9/html5/thumbnails/66.jpg)
Gradient Descent
Nicolas Le Roux
Optimization
Basics
Approximations to Newton
method
Stochastic Optimization
Learning (Bottou)
TONGA
Natural Gradient
Online Natural Gradient
Results
Results - MNIST
0 500 1000 1500 2000 2500 3000 3500 4000 45000.05
0.1
0.15
0.2
CPU time (in seconds)
Neg
ativ
e lo
g−lik
elih
ood
on th
e te
st s
et
Block diagonal TONGAStochastic batchsize=1Stochastic batchsize=400Stochastic batchsize=1000Stochastic batchsize=2000
0 500 1000 1500 2000 2500 3000 3500 4000 45000.015
0.02
0.025
0.03
0.035
0.04
0.045
0.05
0.055
0.06
CPU time (in seconds)
Cla
ssifi
catio
n er
ror
on th
e te
st s
et
Block diagonal TONGAStochastic batchsize=1Stochastic batchsize=400Stochastic batchsize=1000Stochastic batchsize=2000
![Page 67: Using Gradient Descent for Optimization and Learning](https://reader031.vdocument.in/reader031/viewer/2022021922/587362ad1a28abe7648b5fa9/html5/thumbnails/67.jpg)
Gradient Descent
Nicolas Le Roux
Optimization
Basics
Approximations to Newton
method
Stochastic Optimization
Learning (Bottou)
TONGA
Natural Gradient
Online Natural Gradient
Results
Results - Rectangles
0 0.5 1 1.5 2 2.5 3 3.5
x 104
0.06
0.08
0.1
0.12
0.14
0.16
0.18
0.2
CPU time (in seconds)
Neg
ativ
e lo
g−lik
elih
ood
on th
e va
lidat
ion
set
Stochastic gradientBlock diagonal TONGA
0 0.5 1 1.5 2 2.5 3 3.5
x 104
0.06
0.08
0.1
0.12
0.14
0.16
0.18
0.2
CPU time (in seconds)
Cla
ssifi
catio
n er
ror
on th
e va
lidat
ion
set
Stochastic gradientBlock diagonal TONGA
![Page 68: Using Gradient Descent for Optimization and Learning](https://reader031.vdocument.in/reader031/viewer/2022021922/587362ad1a28abe7648b5fa9/html5/thumbnails/68.jpg)
Gradient Descent
Nicolas Le Roux
Optimization
Basics
Approximations to Newton
method
Stochastic Optimization
Learning (Bottou)
TONGA
Natural Gradient
Online Natural Gradient
Results
Results - USPS
0 10 20 30 40 50 60 700.2
0.4
0.6
0.8
1
1.2
1.4
1.6
CPU time (in seconds)
Neg
ativ
e lo
g−lik
elih
ood
Block diagonal TONGAStochastic gradient
0 10 20 30 40 50 60 700.05
0.1
0.15
0.2
0.25
0.3
CPU time (in seconds)
Cla
ssifi
catio
n er
ror
Block diagonal TONGAStochastic gradient
![Page 69: Using Gradient Descent for Optimization and Learning](https://reader031.vdocument.in/reader031/viewer/2022021922/587362ad1a28abe7648b5fa9/html5/thumbnails/69.jpg)
Gradient Descent
Nicolas Le Roux
Optimization
Basics
Approximations to Newton
method
Stochastic Optimization
Learning (Bottou)
TONGA
Natural Gradient
Online Natural Gradient
Results
TONGA - Conclusion
• Introducing uncertainty speeds up training
• There exists a fast implementation of the online
natural gradient
![Page 70: Using Gradient Descent for Optimization and Learning](https://reader031.vdocument.in/reader031/viewer/2022021922/587362ad1a28abe7648b5fa9/html5/thumbnails/70.jpg)
Gradient Descent
Nicolas Le Roux
Optimization
Basics
Approximations to Newton
method
Stochastic Optimization
Learning (Bottou)
TONGA
Natural Gradient
Online Natural Gradient
Results
Difference between C
and H
• H accounts for small changes in the parameter
space
• C accounts for small changes in the input space
• C is not just a “cheap cousin” of H
![Page 71: Using Gradient Descent for Optimization and Learning](https://reader031.vdocument.in/reader031/viewer/2022021922/587362ad1a28abe7648b5fa9/html5/thumbnails/71.jpg)
Gradient Descent
Nicolas Le Roux
Optimization
Basics
Approximations to Newton
method
Stochastic Optimization
Learning (Bottou)
TONGA
Natural Gradient
Online Natural Gradient
Results
Future research
Much more to do!
• Can we combine the effects of H and C?
• Are there better approximations of C and H?
• Anything you can think of!
![Page 72: Using Gradient Descent for Optimization and Learning](https://reader031.vdocument.in/reader031/viewer/2022021922/587362ad1a28abe7648b5fa9/html5/thumbnails/72.jpg)
Gradient Descent
Nicolas Le Roux
Optimization
Basics
Approximations to Newton
method
Stochastic Optimization
Learning (Bottou)
TONGA
Natural Gradient
Online Natural Gradient
Results
Thank you!
Questions?