using gradient descent for optimization and learning

Gradient Descent

Nicolas Le Roux

Optimization

Basics

Approximations to Newton

method

Stochastic Optimization

Learning (Bottou)

TONGA

Natural Gradient

Online Natural Gradient

Results

Using Gradient Descent for

Optimization and Learning

Nicolas Le Roux

15 May 2009

Gradient Descent

Nicolas Le Roux

Optimization

Basics


method


Learning (Bottou)

TONGA

Natural Gradient


Results

Why having a good

optimizer?

• plenty of data available everywhere

• extract information efficiently

Gradient Descent

Nicolas Le Roux

Optimization

Basics


method


Learning (Bottou)

TONGA

Natural Gradient


Results

Why gradient descent?

• cheap

• suitable for large models

• it works

Gradient Descent

Nicolas Le Roux

Optimization

Basics


method


Learning (Bottou)

TONGA

Natural Gradient


Results

Optimization vs Learning

Optimization

• function f to minimize

• time T (ρ) to reach error level ρ

Gradient Descent

Nicolas Le Roux

Optimization

Basics


method


Learning (Bottou)

TONGA

Natural Gradient


Results

Optimization vs Learning

Optimization

• function f to minimize

• time T (ρ) to reach error level ρ

Learning

• measure of quality f (cost function)

• get training samples x1, . . . , xn from p

• choose a model F with parameters θ

• minimize Ep[f (θ, x)]

Gradient Descent

Nicolas Le Roux

Optimization

Basics


method


Learning (Bottou)

TONGA

Natural Gradient


Results

Outline

1 Optimization

Basics

Approximations to Newton method


2 Learning (Bottou)

3 TONGA

Natural Gradient


Results

Gradient Descent

Nicolas Le Roux

Optimization

Basics


method


Learning (Bottou)

TONGA

Natural Gradient


Results

The basics

Taylor expansion of f to the first order:

f (θ + ε) = f (θ) + εT∇θf + o(‖ε‖)

Best improvement obtained with

ε = −η∇θf η > 0

Gradient Descent

Nicolas Le Roux

Optimization

Basics


method


Learning (Bottou)

TONGA

Natural Gradient


Results

Quadratic bowl

η = .1 η = .3

Gradient Descent

Nicolas Le Roux

Optimization

Basics


method


Learning (Bottou)

TONGA

Natural Gradient


Results

Optimal learning rate

λmin

ηmin,opt = 1λmin

ηmin,div = 2λmin

λmax

ηmax,opt = 1λmax

ηmax,div = 2λmax

η = ηmax,opt

κ =η

ηmin,opt=

λmax

λmin

T (ρ) = O(

dκ log1ρ

)

Gradient Descent

Nicolas Le Roux

Optimization

Basics


method


Learning (Bottou)

TONGA

Natural Gradient


Results

Limitations of gradient

descent

• Speed of convergence highly dependent on κ

• Not invariant to linear transformations

θ′ = kθ

f (θ′ + ε) = f (θ′) + εT∇θ′ f + o(‖ε‖)

ε = −η∇θ′f

But ∇θ′ f = ∇θ fk !

Gradient Descent

Nicolas Le Roux

Optimization

Basics


method


Learning (Bottou)

TONGA

Natural Gradient


Results

Newton method

Second-order Taylor expansion of f around θ:

f (θ + ε) = f (θ) + εT∇θf +εT Hε

2+ o(‖ε‖2)

Best improvement obtained with

ε = −ηH−1∇θf

Gradient Descent

Nicolas Le Roux

Optimization

Basics


method


Learning (Bottou)

TONGA

Natural Gradient


Results

Quadratic bowl

Gradient Descent

Nicolas Le Roux

Optimization

Basics


method


Learning (Bottou)

TONGA

Natural Gradient


Results

Rosenbrock function

Gradient Descent

Nicolas Le Roux

Optimization

Basics


method


Learning (Bottou)

TONGA

Natural Gradient


Results

Properties of Newton

method

• Newton method assumes the function is locally

quadratic (Beyond Newton method, Minka)

• H must be positive definite

• storing H and finding ε are in d2

T (ρ) = O(

d2 log log1ρ

)

Gradient Descent

Nicolas Le Roux

Optimization

Basics


method


Learning (Bottou)

TONGA

Natural Gradient


Results

Limitations of Newton

method

• Newton method looks powerful

but...

Gradient Descent

Nicolas Le Roux

Optimization

Basics


method


Learning (Bottou)

TONGA

Natural Gradient


Results

Limitations of Newton

method

• Newton method looks powerful

but...

• H may be hard to compute

• it is expensive

Gradient Descent

Nicolas Le Roux

Optimization

Basics


method


Learning (Bottou)

TONGA

Natural Gradient


Results

Gauss-Newton

Only works when f is a sum of squared residuals!

f (θ) =12

∑

i

r 2i (θ)

∂f∂θ

=∑

i

ri∂ri

∂θ

∂2f∂θ2

=∑

i

[

ri∂2ri

∂θ2+

(

∂ri

∂θ

)(

∂ri

∂θ

)T]

θ = θ − η

(

∑

i

(

∂ri

∂θ

)(

∂ri

∂θ

)T)−1

∂f∂θ

Gradient Descent

Nicolas Le Roux

Optimization

Basics


method


Learning (Bottou)

TONGA

Natural Gradient


Results

Properties of

Gauss-Newton

θ = θ − η

(

∑

i

(

∂ri

∂θ

)(

∂ri

∂θ

)T)−1

∂f∂θ

Discarded term: ri∂2ri

∂θ2

• Does not require the computation of H

• Only valid close to the optimum where ri = 0

• Computation cost in O(d2)

Gradient Descent

Nicolas Le Roux

Optimization

Basics


method


Learning (Bottou)

TONGA

Natural Gradient


Results

Levenberg-Marquardt

θ = θ − η

(

∑

i

(

∂ri

∂θ

)(

∂ri

∂θ

)T

+ λI

)−1∂f∂θ

• “Damped” Gauss-Newton

• Intermediate between Gauss-Newton and

Steepest Descent

• Slower optimization but more robust

• Cost in O(d2)

Gradient Descent

Nicolas Le Roux

Optimization

Basics


method


Learning (Bottou)

TONGA

Natural Gradient


Results

Quasi-Newton methods

• Gauss-Newton and Levenberg-Marquardt can

only be used in special cases

• What about the general case?

Gradient Descent

Nicolas Le Roux

Optimization

Basics


method


Learning (Bottou)

TONGA

Natural Gradient


Results

Quasi-Newton methods

• Gauss-Newton and Levenberg-Marquardt can

only be used in special cases

• What about the general case?

• H characterizes the change in gradient when

moving in parameter space

• Let’s find a matrix which does the same!

Gradient Descent

Nicolas Le Roux

Optimization

Basics


method


Learning (Bottou)

TONGA

Natural Gradient


Results

BFGS

We look for a matrix B such that

∇θf (θ + ε) −∇θf (θ) = B−1t ε : Secant equation

Gradient Descent

Nicolas Le Roux

Optimization

Basics


method


Learning (Bottou)

TONGA

Natural Gradient


Results

BFGS

We look for a matrix B such that

∇θf (θ + ε) −∇θf (θ) = B−1t ε : Secant equation

• Problem: this is underconstrained

• Solution: set additional constraints: small

‖Bt+1 − Bt‖W

Gradient Descent

Nicolas Le Roux

Optimization

Basics


method


Learning (Bottou)

TONGA

Natural Gradient


Results

BFGS - (2)

• B0 = I

• while not converged:

• pt = −Bt∇f (θt)

• ηt = linemin(f , θt , pt)

• st = ηtpt (change in parameter space)• θt+1 = θt + st

• yt = ∇f (θt+1) −∇f (θt) (change in gradient

space)

• ρt = (sTt yt)

−1

• Bt+1 = (I − ρtstyTt )Bt(I − ρtytsT

t ) + ρtstsTt

(stems from Sherman-Morrisson formula)

Gradient Descent

Nicolas Le Roux

Optimization

Basics


method


Learning (Bottou)

TONGA

Natural Gradient


Results

BFGS - (3)

• Requires a line search

• No matrix inversion required

• Update in O(d2)

• Can we do better than O(d2)?

Gradient Descent

Nicolas Le Roux

Optimization

Basics


method


Learning (Bottou)

TONGA

Natural Gradient


Results

L-BFGS

• Low-rank estimate of B

• Based on the last m moves in parameters and

gradient spaces

• Cost O(md) per update

• Same ballpark as steepest descent!

Gradient Descent

Nicolas Le Roux

Optimization

Basics


method


Learning (Bottou)

TONGA

Natural Gradient


Results

Conjugate Gradient

We want to solveAx = b

• Relies on conjugate directions

• u and v are conjugate if uT Av = 0.

• d mutually conjugate directions form a base of

Rd

• Goal: to move along conjugate directions close

to steepest descent directions

Gradient Descent

Nicolas Le Roux

Optimization

Basics


method


Learning (Bottou)

TONGA

Natural Gradient


Results

Nonlinear Conjugate

Gradient

• Extension to non-quadratic functions

• Requires a line search in every direction

(Important for conjugacy!)

• Various direction updates (Polak-Ribiere)

Gradient Descent

Nicolas Le Roux

Optimization

Basics


method


Learning (Bottou)

TONGA

Natural Gradient


Results

Going from batch to

stochastic

• dataset composed of n samples

• f (θ) = 1n

∑

i fi(θ, xi)

Do we really need to see all the examples beforemaking a parameter update?

Gradient Descent

Nicolas Le Roux

Optimization

Basics


method


Learning (Bottou)

TONGA

Natural Gradient


Results

Stochastic optimization

• Information is redundant amongst samples

• We can afford more frequent, noisier updates

• But problems arise...

Gradient Descent

Nicolas Le Roux

Optimization

Basics


method


Learning (Bottou)

TONGA

Natural Gradient


Results

Stochastic (Bottou)

Advantage

• much faster convergence on large redundant

datasets

Disadvantages

• Keeps bouncing around unless η is reduced

• Extremely hard to reach high accuracy

• Theoretical definitions for convergence not as

well defined

• Most second-orders methods will not work

Gradient Descent

Nicolas Le Roux

Optimization

Basics


method


Learning (Bottou)

TONGA

Natural Gradient


Results

Batch (Bottou)

Advantage

• Guaranteed convergence to a local minimum

under simple conditions

• Lots of tricks to speed up the learning

Disadvantage

• Painfully slow on large problems

Gradient Descent

Nicolas Le Roux

Optimization

Basics


method


Learning (Bottou)

TONGA

Natural Gradient


Results

Problems arising in

stochastic setting

• First order descent: O(dn) =⇒ O(dn)

• Second order methods: O(d2 + dn) =⇒ O(d2n)

• Special cases: algorithms requiring line search

• BFGS: not critical, may be replaced by a

one-step update• Conjugate Gradient: critical, no stochastic

version

Gradient Descent

Nicolas Le Roux

Optimization

Basics


method


Learning (Bottou)

TONGA

Natural Gradient


Results

Successful stochastic

methods

• Stochastic gradient descent

• Online BFGS (Schraudolph, 2007)

• Online L-BFGS (Schraudolph, 2007)

Gradient Descent

Nicolas Le Roux

Optimization

Basics


method


Learning (Bottou)

TONGA

Natural Gradient


Results

Conclusions of the

tutorial

Batch methods

• Second-order methods have much faster

convergence

• They are too expensive when d is large

• Except for L-BFGS and Nonlinear Conjugate

Gradient

Gradient Descent

Nicolas Le Roux

Optimization

Basics


method


Learning (Bottou)

TONGA

Natural Gradient


Results

Conclusions of the

tutorial

Stochastic methods

• Much faster updates

• Terrible convergence rates

• Stochastic Gradient Descent: T (ρ) = O(

dρ

)

• Second-order Stochastic Descent:

T (ρ) = O(

d2

ρ

)

Gradient Descent

Nicolas Le Roux

Optimization

Basics


method


Learning (Bottou)

TONGA

Natural Gradient


Results

Outline

1 Optimization

Basics



2 Learning (Bottou)

3 TONGA

Natural Gradient


Results

Gradient Descent

Nicolas Le Roux

Optimization

Basics


method


Learning (Bottou)

TONGA

Natural Gradient


Results

Learning

• Measure of quality f (“cost function”)

• Get training samples x1, . . . , xn from p

• Choose a model F with parameters θ

• Minimize Ep[f (θ, x)]

• Time budget T

Gradient Descent

Nicolas Le Roux

Optimization

Basics


method


Learning (Bottou)

TONGA

Natural Gradient


Results

Small-scale vs.

Large-scale Learning

• Small-scale learning problem: the active budget

constraint is the number of examples n.

• Large-scale learning problem: the active budget

constraint is the computing time T .

Gradient Descent

Nicolas Le Roux

Optimization

Basics


method


Learning (Bottou)

TONGA

Natural Gradient


Results

Which algorithm should

we use?

T (ρ) for various algorithms:

• Gradient Descent: T (ρ) = O(

ndκ log 1ρ

)

• Second Order Gradient Descent:

T (ρ) = O(

d2 log log 1ρ

)

• Stochastic Gradient Descent: T (ρ) = O(

dρ

)

• Second-order Stochastic Descent:

T (ρ) = O(

d2

ρ

)

Second Order Gradient Descent seems a goodchoice!

Gradient Descent

Nicolas Le Roux

Optimization

Basics


method


Learning (Bottou)

TONGA

Natural Gradient


Results

Large-scale learning

• We are limited by the time T

• We can choose between ρ and n

• Better optimization means fewer examples

Gradient Descent

Nicolas Le Roux

Optimization

Basics


method


Learning (Bottou)

TONGA

Natural Gradient


Results

Generalization error

E(f̃n) − E(f ∗) = E(f ∗F ) − E(f ∗) Approximation error

+ E(fn) − E(f ∗F ) Estimation error

+ E(f̃n) − E(fn) Optimization error

There is no need to optimize thoroughly if we cannotprocess enough data points!

Gradient Descent

Nicolas Le Roux

Optimization

Basics


method


Learning (Bottou)

TONGA

Natural Gradient


Results

Which algorithm should

we use?

Time to reach E(f̃n) − E(f ∗) < ǫ:

• Gradient Descent: O(

d2κ

ǫ1/α log2 1ǫ

)

• Second Order Gradient Descent:

O(

d2κ

ǫ1/α log 1ǫ

log log 1ǫ

)

• Stochastic Gradient Descent: O(

dκ2

ǫ

)

• Second-order Stochastic Descent: O(

d2

ǫ

)

with 12 ≤ α ≤ 1 (statistical estimation rate).

Gradient Descent

Nicolas Le Roux

Optimization

Basics


method


Learning (Bottou)

TONGA

Natural Gradient


Results

In a nutshell

• Simple stochastic gradient descent is extremely

efficient

• Fast second-order stochastic gradient descent

can win us a constant factor

• Are there other possible factors of improvement?

Gradient Descent

Nicolas Le Roux

Optimization

Basics


method


Learning (Bottou)

TONGA

Natural Gradient


Results

What we did not talk

about (yet)

• we have access to x1, . . . , xn ∼ p

• we wish to minimize Ex∼p[f (θ, x)]

• can we use the uncertainty in the dataset to get

information about p?

Gradient Descent

Nicolas Le Roux

Optimization

Basics


method


Learning (Bottou)

TONGA

Natural Gradient


Results

Outline

1 Optimization

Basics



2 Learning (Bottou)

3 TONGA

Natural Gradient


Results

Gradient Descent

Nicolas Le Roux

Optimization

Basics


method


Learning (Bottou)

TONGA

Natural Gradient


Results

Cost functions

Empirical cost function

f (θ) =1n

∑

i

f (θ, xi)

True cost function

f ∗(θ) = Ex∼p[f (θ, x)]

Gradient Descent

Nicolas Le Roux

Optimization

Basics


method


Learning (Bottou)

TONGA

Natural Gradient


Results

Gradients

Empirical gradient

g =1n

∑

i

∇θf (θ, xi)

True gradient

g∗ = Ex∼p[∇θf (θ, x)]

Central-limit theorem:

g|g∗ ∼ N

(

g∗,Cn

)

Gradient Descent

Nicolas Le Roux

Optimization

Basics


method


Learning (Bottou)

TONGA

Natural Gradient


Results

Posterior over g∗

Usingg∗ ∼ N (0, σ2I)

we have

g∗|g ∼ N

(

(

I +C

nσ2

)−1

g,

(

Iσ2

+ nC−1

)−1)

and then

εT g∗|g ∼ N

(

εT

(

I +C

nσ2

)−1

g, εT

(

Iσ2

+ nC−1

)−1

ε

)

Gradient Descent

Nicolas Le Roux

Optimization

Basics


method


Learning (Bottou)

TONGA

Natural Gradient


Results

Aggressive strategy

εT g∗|g ∼ N

(

εT

(

I +C

nσ2

)−1

g, εT

(

Iσ2

+ nC−1

)−1

ε

)

• we want to minimize Eg∗[εT g∗]

• Solution:

ε = −η

(

I +C

nσ2

)−1

g

Gradient Descent

Nicolas Le Roux

Optimization

Basics


method


Learning (Bottou)

TONGA

Natural Gradient


Results

Aggressive strategy

εT g∗|g ∼ N

(

εT

(

I +C

nσ2

)−1

g, εT

(

Iσ2

+ nC−1

)−1

ε

)

• we want to minimize Eg∗[εT g∗]

• Solution:

ε = −η

(

I +C

nσ2

)−1

g

This is the regularized natural gradient.

Gradient Descent

Nicolas Le Roux

Optimization

Basics


method


Learning (Bottou)

TONGA

Natural Gradient


Results

Conservative strategy

εT g∗|g ∼ N

(

εT

(

I +Cσ2

)−1

g, εT

(

Iσ2

+ nC−1

)−1

ε

)

• we want to minimize Pr(εT g∗ > 0)

• Solution:

ε = −η

(

Cn

)−1

g

Gradient Descent

Nicolas Le Roux

Optimization

Basics


method


Learning (Bottou)

TONGA

Natural Gradient


Results

Conservative strategy

εT g∗|g ∼ N

(

εT

(

I +Cσ2

)−1

g, εT

(

Iσ2

+ nC−1

)−1

ε

)

• we want to minimize Pr(εT g∗ > 0)

• Solution:

ε = −η

(

Cn

)−1

g

This is the natural gradient.

Gradient Descent

Nicolas Le Roux

Optimization

Basics


method


Learning (Bottou)

TONGA

Natural Gradient


Results


• Natural gradient has nice properties

• Its cost per iteration is in d2

• Can we make it faster?

Gradient Descent

Nicolas Le Roux

Optimization

Basics


method


Learning (Bottou)

TONGA

Natural Gradient


Results

Goals

ε = −η

(

I +C

nσ2

)−1

g

We must be able to update and invert C whenever anew gradient arrives.

Gradient Descent

Nicolas Le Roux

Optimization

Basics


method


Learning (Bottou)

TONGA

Natural Gradient


Results

Plan of action

Updating (uncentered) C:

Ct ∝ γCt−1 + (1 − γ)gtgTt

Gradient Descent

Nicolas Le Roux

Optimization

Basics


method


Learning (Bottou)

TONGA

Natural Gradient


Results

Plan of action

Updating (uncentered) C:

Ct ∝ γCt−1 + (1 − γ)gtgTt

Inverting C:

• maintain low-rank estimates of C using its

eigendecomposition.

Gradient Descent

Nicolas Le Roux

Optimization

Basics


method


Learning (Bottou)

TONGA

Natural Gradient


Results

Eigendecomposition of C

• Computing k eigenvectors of C = GGT is

O(kd2)

• But computing k eigenvectors of GT G is O(kp2)!

• Still too expensive to compute for each new

sample (and p must not grow)

Gradient Descent

Nicolas Le Roux

Optimization

Basics


method


Learning (Bottou)

TONGA

Natural Gradient


Results

Eigendecomposition of C

• Computing k eigenvectors of C = GGT is

O(kd2)

• But computing k eigenvectors of GT G is O(kp2)!

• Still too expensive to compute for each new

sample (and p must not grow)

• Done only every b steps

Gradient Descent

Nicolas Le Roux

Optimization

Basics


method


Learning (Bottou)

TONGA

Natural Gradient


Results

To prove I’m not cheating

• eigendecomposition every b steps

Ct = γtC +t∑

k=1

γt−kgkgTk + λt t = 1, . . . , b

vt = C−1t gt

Xt =[

γt2 U γ

t−12 g1 . . . γ

12 gt−1 gt

]

Ct = XtX Tt + λt vt = Xtαt gt = Xtyt

αt = (X Tt Xt + λt)−1yt

vt = Xt(X Tt Xt + λt)−1yt

Gradient Descent

Nicolas Le Roux

Optimization

Basics


method


Learning (Bottou)

TONGA

Natural Gradient


Results

Computation complexity

• d dimensions

• k eigenvectors

• b steps between two updates

• Computing the eigendecomposition (every b

steps): k(k + b)2

• Computing the natural gradient:

d(k + b) + (k + b)2

• if p ≪ (k + b), cost per example is O(d(k + b))

Gradient Descent

Nicolas Le Roux

Optimization

Basics


method


Learning (Bottou)

TONGA

Natural Gradient


Results

What next

• Complexity in O(d(k + b))

• We need a small k

• If d > 106, how large should k be?

Gradient Descent

Nicolas Le Roux

Optimization

Basics


method


Learning (Bottou)

TONGA

Natural Gradient


Results

Decomposing C

C is almost block-diagonal!

Gradient Descent

Nicolas Le Roux

Optimization

Basics


method


Learning (Bottou)

TONGA

Natural Gradient


Results

Quality of approximation

200 400 600 800 1000 1200 1400 1600 1800 20000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Number k of eigenvectors kept

Rat

io o

f the

squ

ared

Fro

beni

us n

orm

s

Full matrix approximationBlock diagonal approximation

5 10 15 20 25 30 35 400.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Number k of eigenvectors kept

Rat

io o

f the

squ

ared

Fro

beni

us n

orm

s

Full matrix approximationBlock diagonal approximation

Minimum with k = 5

Gradient Descent

Nicolas Le Roux

Optimization

Basics


method


Learning (Bottou)

TONGA

Natural Gradient


Results

Evolution of C

Gradient Descent

Nicolas Le Roux

Optimization

Basics


method


Learning (Bottou)

TONGA

Natural Gradient


Results

Results - MNIST

0 500 1000 1500 2000 2500 3000 3500 4000 45000.05

0.1

0.15

0.2

CPU time (in seconds)

Neg

ativ

e lo

g−lik

elih

ood

on th

e te

st s

et

Block diagonal TONGAStochastic batchsize=1Stochastic batchsize=400Stochastic batchsize=1000Stochastic batchsize=2000

0 500 1000 1500 2000 2500 3000 3500 4000 45000.015

0.02

0.025

0.03

0.035

0.04

0.045

0.05

0.055

0.06


Cla

ssifi

catio

n er

ror

on th

e te

st s

et

Block diagonal TONGAStochastic batchsize=1Stochastic batchsize=400Stochastic batchsize=1000Stochastic batchsize=2000

Gradient Descent

Nicolas Le Roux

Optimization

Basics


method


Learning (Bottou)

TONGA

Natural Gradient


Results

Results - Rectangles

0 0.5 1 1.5 2 2.5 3 3.5

x 104

0.06

0.08

0.1

0.12

0.14

0.16

0.18

0.2


Neg

ativ

e lo

g−lik

elih

ood

on th

e va

lidat

ion

set

Stochastic gradientBlock diagonal TONGA

0 0.5 1 1.5 2 2.5 3 3.5

x 104

0.06

0.08

0.1

0.12

0.14

0.16

0.18

0.2


Cla

ssifi

catio

n er

ror

on th

e va

lidat

ion

set

Stochastic gradientBlock diagonal TONGA

Gradient Descent

Nicolas Le Roux

Optimization

Basics


method


Learning (Bottou)

TONGA

Natural Gradient


Results

Results - USPS

0 10 20 30 40 50 60 700.2

0.4

0.6

0.8

1

1.2

1.4

1.6


Neg

ativ

e lo

g−lik

elih

ood

Block diagonal TONGAStochastic gradient

0 10 20 30 40 50 60 700.05

0.1

0.15

0.2

0.25

0.3


Cla

ssifi

catio

n er

ror

Block diagonal TONGAStochastic gradient

Gradient Descent

Nicolas Le Roux

Optimization

Basics


method


Learning (Bottou)

TONGA

Natural Gradient


Results

TONGA - Conclusion

• Introducing uncertainty speeds up training

• There exists a fast implementation of the online

natural gradient

Gradient Descent

Nicolas Le Roux

Optimization

Basics


method


Learning (Bottou)

TONGA

Natural Gradient


Results

Difference between C

and H

• H accounts for small changes in the parameter

space

• C accounts for small changes in the input space

• C is not just a “cheap cousin” of H

Gradient Descent

Nicolas Le Roux

Optimization

Basics


method


Learning (Bottou)

TONGA

Natural Gradient


Results

Future research

Much more to do!

• Can we combine the effects of H and C?

• Are there better approximations of C and H?

• Anything you can think of!

Gradient Descent

Nicolas Le Roux

Optimization

Basics


method


Learning (Bottou)

TONGA

Natural Gradient


Results

Thank you!

Questions?

using gradient descent for optimization and learning

Engineering