using gradient descent for optimization and learning

72
Gradient Descent Nicolas Le Roux Optimization Basics Approximations to Newton method Stochastic Optimization Learning (Bottou) TONGA Natural Gradient Online Natural Gradient Results Using Gradient Descent for Optimization and Learning Nicolas Le Roux 15 May 2009

Upload: volkan-oban

Post on 16-Apr-2017

484 views

Category:

Engineering


3 download

TRANSCRIPT

Page 1: Using Gradient Descent for Optimization and Learning

Gradient Descent

Nicolas Le Roux

Optimization

Basics

Approximations to Newton

method

Stochastic Optimization

Learning (Bottou)

TONGA

Natural Gradient

Online Natural Gradient

Results

Using Gradient Descent for

Optimization and Learning

Nicolas Le Roux

15 May 2009

Page 2: Using Gradient Descent for Optimization and Learning

Gradient Descent

Nicolas Le Roux

Optimization

Basics

Approximations to Newton

method

Stochastic Optimization

Learning (Bottou)

TONGA

Natural Gradient

Online Natural Gradient

Results

Why having a good

optimizer?

• plenty of data available everywhere

• extract information efficiently

Page 3: Using Gradient Descent for Optimization and Learning

Gradient Descent

Nicolas Le Roux

Optimization

Basics

Approximations to Newton

method

Stochastic Optimization

Learning (Bottou)

TONGA

Natural Gradient

Online Natural Gradient

Results

Why gradient descent?

• cheap

• suitable for large models

• it works

Page 4: Using Gradient Descent for Optimization and Learning

Gradient Descent

Nicolas Le Roux

Optimization

Basics

Approximations to Newton

method

Stochastic Optimization

Learning (Bottou)

TONGA

Natural Gradient

Online Natural Gradient

Results

Optimization vs Learning

Optimization

• function f to minimize

• time T (ρ) to reach error level ρ

Page 5: Using Gradient Descent for Optimization and Learning

Gradient Descent

Nicolas Le Roux

Optimization

Basics

Approximations to Newton

method

Stochastic Optimization

Learning (Bottou)

TONGA

Natural Gradient

Online Natural Gradient

Results

Optimization vs Learning

Optimization

• function f to minimize

• time T (ρ) to reach error level ρ

Learning

• measure of quality f (cost function)

• get training samples x1, . . . , xn from p

• choose a model F with parameters θ

• minimize Ep[f (θ, x)]

Page 6: Using Gradient Descent for Optimization and Learning

Gradient Descent

Nicolas Le Roux

Optimization

Basics

Approximations to Newton

method

Stochastic Optimization

Learning (Bottou)

TONGA

Natural Gradient

Online Natural Gradient

Results

Outline

1 Optimization

Basics

Approximations to Newton method

Stochastic Optimization

2 Learning (Bottou)

3 TONGA

Natural Gradient

Online Natural Gradient

Results

Page 7: Using Gradient Descent for Optimization and Learning

Gradient Descent

Nicolas Le Roux

Optimization

Basics

Approximations to Newton

method

Stochastic Optimization

Learning (Bottou)

TONGA

Natural Gradient

Online Natural Gradient

Results

The basics

Taylor expansion of f to the first order:

f (θ + ε) = f (θ) + εT∇θf + o(‖ε‖)

Best improvement obtained with

ε = −η∇θf η > 0

Page 8: Using Gradient Descent for Optimization and Learning

Gradient Descent

Nicolas Le Roux

Optimization

Basics

Approximations to Newton

method

Stochastic Optimization

Learning (Bottou)

TONGA

Natural Gradient

Online Natural Gradient

Results

Quadratic bowl

η = .1 η = .3

Page 9: Using Gradient Descent for Optimization and Learning

Gradient Descent

Nicolas Le Roux

Optimization

Basics

Approximations to Newton

method

Stochastic Optimization

Learning (Bottou)

TONGA

Natural Gradient

Online Natural Gradient

Results

Optimal learning rate

λmin

ηmin,opt = 1λmin

ηmin,div = 2λmin

λmax

ηmax,opt = 1λmax

ηmax,div = 2λmax

η = ηmax,opt

κ =η

ηmin,opt=

λmax

λmin

T (ρ) = O(

dκ log1ρ

)

Page 10: Using Gradient Descent for Optimization and Learning

Gradient Descent

Nicolas Le Roux

Optimization

Basics

Approximations to Newton

method

Stochastic Optimization

Learning (Bottou)

TONGA

Natural Gradient

Online Natural Gradient

Results

Limitations of gradient

descent

• Speed of convergence highly dependent on κ

• Not invariant to linear transformations

θ′ = kθ

f (θ′ + ε) = f (θ′) + εT∇θ′ f + o(‖ε‖)

ε = −η∇θ′f

But ∇θ′ f = ∇θ fk !

Page 11: Using Gradient Descent for Optimization and Learning

Gradient Descent

Nicolas Le Roux

Optimization

Basics

Approximations to Newton

method

Stochastic Optimization

Learning (Bottou)

TONGA

Natural Gradient

Online Natural Gradient

Results

Newton method

Second-order Taylor expansion of f around θ:

f (θ + ε) = f (θ) + εT∇θf +εT Hε

2+ o(‖ε‖2)

Best improvement obtained with

ε = −ηH−1∇θf

Page 12: Using Gradient Descent for Optimization and Learning

Gradient Descent

Nicolas Le Roux

Optimization

Basics

Approximations to Newton

method

Stochastic Optimization

Learning (Bottou)

TONGA

Natural Gradient

Online Natural Gradient

Results

Quadratic bowl

Page 13: Using Gradient Descent for Optimization and Learning

Gradient Descent

Nicolas Le Roux

Optimization

Basics

Approximations to Newton

method

Stochastic Optimization

Learning (Bottou)

TONGA

Natural Gradient

Online Natural Gradient

Results

Rosenbrock function

Page 14: Using Gradient Descent for Optimization and Learning

Gradient Descent

Nicolas Le Roux

Optimization

Basics

Approximations to Newton

method

Stochastic Optimization

Learning (Bottou)

TONGA

Natural Gradient

Online Natural Gradient

Results

Properties of Newton

method

• Newton method assumes the function is locally

quadratic (Beyond Newton method, Minka)

• H must be positive definite

• storing H and finding ε are in d2

T (ρ) = O(

d2 log log1ρ

)

Page 15: Using Gradient Descent for Optimization and Learning

Gradient Descent

Nicolas Le Roux

Optimization

Basics

Approximations to Newton

method

Stochastic Optimization

Learning (Bottou)

TONGA

Natural Gradient

Online Natural Gradient

Results

Limitations of Newton

method

• Newton method looks powerful

but...

Page 16: Using Gradient Descent for Optimization and Learning

Gradient Descent

Nicolas Le Roux

Optimization

Basics

Approximations to Newton

method

Stochastic Optimization

Learning (Bottou)

TONGA

Natural Gradient

Online Natural Gradient

Results

Limitations of Newton

method

• Newton method looks powerful

but...

• H may be hard to compute

• it is expensive

Page 17: Using Gradient Descent for Optimization and Learning

Gradient Descent

Nicolas Le Roux

Optimization

Basics

Approximations to Newton

method

Stochastic Optimization

Learning (Bottou)

TONGA

Natural Gradient

Online Natural Gradient

Results

Gauss-Newton

Only works when f is a sum of squared residuals!

f (θ) =12

i

r 2i (θ)

∂f∂θ

=∑

i

ri∂ri

∂θ

∂2f∂θ2

=∑

i

[

ri∂2ri

∂θ2+

(

∂ri

∂θ

)(

∂ri

∂θ

)T]

θ = θ − η

(

i

(

∂ri

∂θ

)(

∂ri

∂θ

)T)−1

∂f∂θ

Page 18: Using Gradient Descent for Optimization and Learning

Gradient Descent

Nicolas Le Roux

Optimization

Basics

Approximations to Newton

method

Stochastic Optimization

Learning (Bottou)

TONGA

Natural Gradient

Online Natural Gradient

Results

Properties of

Gauss-Newton

θ = θ − η

(

i

(

∂ri

∂θ

)(

∂ri

∂θ

)T)−1

∂f∂θ

Discarded term: ri∂2ri

∂θ2

• Does not require the computation of H

• Only valid close to the optimum where ri = 0

• Computation cost in O(d2)

Page 19: Using Gradient Descent for Optimization and Learning

Gradient Descent

Nicolas Le Roux

Optimization

Basics

Approximations to Newton

method

Stochastic Optimization

Learning (Bottou)

TONGA

Natural Gradient

Online Natural Gradient

Results

Levenberg-Marquardt

θ = θ − η

(

i

(

∂ri

∂θ

)(

∂ri

∂θ

)T

+ λI

)−1∂f∂θ

• “Damped” Gauss-Newton

• Intermediate between Gauss-Newton and

Steepest Descent

• Slower optimization but more robust

• Cost in O(d2)

Page 20: Using Gradient Descent for Optimization and Learning

Gradient Descent

Nicolas Le Roux

Optimization

Basics

Approximations to Newton

method

Stochastic Optimization

Learning (Bottou)

TONGA

Natural Gradient

Online Natural Gradient

Results

Quasi-Newton methods

• Gauss-Newton and Levenberg-Marquardt can

only be used in special cases

• What about the general case?

Page 21: Using Gradient Descent for Optimization and Learning

Gradient Descent

Nicolas Le Roux

Optimization

Basics

Approximations to Newton

method

Stochastic Optimization

Learning (Bottou)

TONGA

Natural Gradient

Online Natural Gradient

Results

Quasi-Newton methods

• Gauss-Newton and Levenberg-Marquardt can

only be used in special cases

• What about the general case?

• H characterizes the change in gradient when

moving in parameter space

• Let’s find a matrix which does the same!

Page 22: Using Gradient Descent for Optimization and Learning

Gradient Descent

Nicolas Le Roux

Optimization

Basics

Approximations to Newton

method

Stochastic Optimization

Learning (Bottou)

TONGA

Natural Gradient

Online Natural Gradient

Results

BFGS

We look for a matrix B such that

∇θf (θ + ε) −∇θf (θ) = B−1t ε : Secant equation

Page 23: Using Gradient Descent for Optimization and Learning

Gradient Descent

Nicolas Le Roux

Optimization

Basics

Approximations to Newton

method

Stochastic Optimization

Learning (Bottou)

TONGA

Natural Gradient

Online Natural Gradient

Results

BFGS

We look for a matrix B such that

∇θf (θ + ε) −∇θf (θ) = B−1t ε : Secant equation

• Problem: this is underconstrained

• Solution: set additional constraints: small

‖Bt+1 − Bt‖W

Page 24: Using Gradient Descent for Optimization and Learning

Gradient Descent

Nicolas Le Roux

Optimization

Basics

Approximations to Newton

method

Stochastic Optimization

Learning (Bottou)

TONGA

Natural Gradient

Online Natural Gradient

Results

BFGS - (2)

• B0 = I

• while not converged:

• pt = −Bt∇f (θt)

• ηt = linemin(f , θt , pt)

• st = ηtpt (change in parameter space)• θt+1 = θt + st

• yt = ∇f (θt+1) −∇f (θt) (change in gradient

space)

• ρt = (sTt yt)

−1

• Bt+1 = (I − ρtstyTt )Bt(I − ρtytsT

t ) + ρtstsTt

(stems from Sherman-Morrisson formula)

Page 25: Using Gradient Descent for Optimization and Learning

Gradient Descent

Nicolas Le Roux

Optimization

Basics

Approximations to Newton

method

Stochastic Optimization

Learning (Bottou)

TONGA

Natural Gradient

Online Natural Gradient

Results

BFGS - (3)

• Requires a line search

• No matrix inversion required

• Update in O(d2)

• Can we do better than O(d2)?

Page 26: Using Gradient Descent for Optimization and Learning

Gradient Descent

Nicolas Le Roux

Optimization

Basics

Approximations to Newton

method

Stochastic Optimization

Learning (Bottou)

TONGA

Natural Gradient

Online Natural Gradient

Results

L-BFGS

• Low-rank estimate of B

• Based on the last m moves in parameters and

gradient spaces

• Cost O(md) per update

• Same ballpark as steepest descent!

Page 27: Using Gradient Descent for Optimization and Learning

Gradient Descent

Nicolas Le Roux

Optimization

Basics

Approximations to Newton

method

Stochastic Optimization

Learning (Bottou)

TONGA

Natural Gradient

Online Natural Gradient

Results

Conjugate Gradient

We want to solveAx = b

• Relies on conjugate directions

• u and v are conjugate if uT Av = 0.

• d mutually conjugate directions form a base of

Rd

• Goal: to move along conjugate directions close

to steepest descent directions

Page 28: Using Gradient Descent for Optimization and Learning

Gradient Descent

Nicolas Le Roux

Optimization

Basics

Approximations to Newton

method

Stochastic Optimization

Learning (Bottou)

TONGA

Natural Gradient

Online Natural Gradient

Results

Nonlinear Conjugate

Gradient

• Extension to non-quadratic functions

• Requires a line search in every direction

(Important for conjugacy!)

• Various direction updates (Polak-Ribiere)

Page 29: Using Gradient Descent for Optimization and Learning

Gradient Descent

Nicolas Le Roux

Optimization

Basics

Approximations to Newton

method

Stochastic Optimization

Learning (Bottou)

TONGA

Natural Gradient

Online Natural Gradient

Results

Going from batch to

stochastic

• dataset composed of n samples

• f (θ) = 1n

i fi(θ, xi)

Do we really need to see all the examples beforemaking a parameter update?

Page 30: Using Gradient Descent for Optimization and Learning

Gradient Descent

Nicolas Le Roux

Optimization

Basics

Approximations to Newton

method

Stochastic Optimization

Learning (Bottou)

TONGA

Natural Gradient

Online Natural Gradient

Results

Stochastic optimization

• Information is redundant amongst samples

• We can afford more frequent, noisier updates

• But problems arise...

Page 31: Using Gradient Descent for Optimization and Learning

Gradient Descent

Nicolas Le Roux

Optimization

Basics

Approximations to Newton

method

Stochastic Optimization

Learning (Bottou)

TONGA

Natural Gradient

Online Natural Gradient

Results

Stochastic (Bottou)

Advantage

• much faster convergence on large redundant

datasets

Disadvantages

• Keeps bouncing around unless η is reduced

• Extremely hard to reach high accuracy

• Theoretical definitions for convergence not as

well defined

• Most second-orders methods will not work

Page 32: Using Gradient Descent for Optimization and Learning

Gradient Descent

Nicolas Le Roux

Optimization

Basics

Approximations to Newton

method

Stochastic Optimization

Learning (Bottou)

TONGA

Natural Gradient

Online Natural Gradient

Results

Batch (Bottou)

Advantage

• Guaranteed convergence to a local minimum

under simple conditions

• Lots of tricks to speed up the learning

Disadvantage

• Painfully slow on large problems

Page 33: Using Gradient Descent for Optimization and Learning

Gradient Descent

Nicolas Le Roux

Optimization

Basics

Approximations to Newton

method

Stochastic Optimization

Learning (Bottou)

TONGA

Natural Gradient

Online Natural Gradient

Results

Problems arising in

stochastic setting

• First order descent: O(dn) =⇒ O(dn)

• Second order methods: O(d2 + dn) =⇒ O(d2n)

• Special cases: algorithms requiring line search

• BFGS: not critical, may be replaced by a

one-step update• Conjugate Gradient: critical, no stochastic

version

Page 34: Using Gradient Descent for Optimization and Learning

Gradient Descent

Nicolas Le Roux

Optimization

Basics

Approximations to Newton

method

Stochastic Optimization

Learning (Bottou)

TONGA

Natural Gradient

Online Natural Gradient

Results

Successful stochastic

methods

• Stochastic gradient descent

• Online BFGS (Schraudolph, 2007)

• Online L-BFGS (Schraudolph, 2007)

Page 35: Using Gradient Descent for Optimization and Learning

Gradient Descent

Nicolas Le Roux

Optimization

Basics

Approximations to Newton

method

Stochastic Optimization

Learning (Bottou)

TONGA

Natural Gradient

Online Natural Gradient

Results

Conclusions of the

tutorial

Batch methods

• Second-order methods have much faster

convergence

• They are too expensive when d is large

• Except for L-BFGS and Nonlinear Conjugate

Gradient

Page 36: Using Gradient Descent for Optimization and Learning

Gradient Descent

Nicolas Le Roux

Optimization

Basics

Approximations to Newton

method

Stochastic Optimization

Learning (Bottou)

TONGA

Natural Gradient

Online Natural Gradient

Results

Conclusions of the

tutorial

Stochastic methods

• Much faster updates

• Terrible convergence rates

• Stochastic Gradient Descent: T (ρ) = O(

)

• Second-order Stochastic Descent:

T (ρ) = O(

d2

ρ

)

Page 37: Using Gradient Descent for Optimization and Learning

Gradient Descent

Nicolas Le Roux

Optimization

Basics

Approximations to Newton

method

Stochastic Optimization

Learning (Bottou)

TONGA

Natural Gradient

Online Natural Gradient

Results

Outline

1 Optimization

Basics

Approximations to Newton method

Stochastic Optimization

2 Learning (Bottou)

3 TONGA

Natural Gradient

Online Natural Gradient

Results

Page 38: Using Gradient Descent for Optimization and Learning

Gradient Descent

Nicolas Le Roux

Optimization

Basics

Approximations to Newton

method

Stochastic Optimization

Learning (Bottou)

TONGA

Natural Gradient

Online Natural Gradient

Results

Learning

• Measure of quality f (“cost function”)

• Get training samples x1, . . . , xn from p

• Choose a model F with parameters θ

• Minimize Ep[f (θ, x)]

• Time budget T

Page 39: Using Gradient Descent for Optimization and Learning

Gradient Descent

Nicolas Le Roux

Optimization

Basics

Approximations to Newton

method

Stochastic Optimization

Learning (Bottou)

TONGA

Natural Gradient

Online Natural Gradient

Results

Small-scale vs.

Large-scale Learning

• Small-scale learning problem: the active budget

constraint is the number of examples n.

• Large-scale learning problem: the active budget

constraint is the computing time T .

Page 40: Using Gradient Descent for Optimization and Learning

Gradient Descent

Nicolas Le Roux

Optimization

Basics

Approximations to Newton

method

Stochastic Optimization

Learning (Bottou)

TONGA

Natural Gradient

Online Natural Gradient

Results

Which algorithm should

we use?

T (ρ) for various algorithms:

• Gradient Descent: T (ρ) = O(

ndκ log 1ρ

)

• Second Order Gradient Descent:

T (ρ) = O(

d2 log log 1ρ

)

• Stochastic Gradient Descent: T (ρ) = O(

)

• Second-order Stochastic Descent:

T (ρ) = O(

d2

ρ

)

Second Order Gradient Descent seems a goodchoice!

Page 41: Using Gradient Descent for Optimization and Learning

Gradient Descent

Nicolas Le Roux

Optimization

Basics

Approximations to Newton

method

Stochastic Optimization

Learning (Bottou)

TONGA

Natural Gradient

Online Natural Gradient

Results

Large-scale learning

• We are limited by the time T

• We can choose between ρ and n

• Better optimization means fewer examples

Page 42: Using Gradient Descent for Optimization and Learning

Gradient Descent

Nicolas Le Roux

Optimization

Basics

Approximations to Newton

method

Stochastic Optimization

Learning (Bottou)

TONGA

Natural Gradient

Online Natural Gradient

Results

Generalization error

E(f̃n) − E(f ∗) = E(f ∗F ) − E(f ∗) Approximation error

+ E(fn) − E(f ∗F ) Estimation error

+ E(f̃n) − E(fn) Optimization error

There is no need to optimize thoroughly if we cannotprocess enough data points!

Page 43: Using Gradient Descent for Optimization and Learning

Gradient Descent

Nicolas Le Roux

Optimization

Basics

Approximations to Newton

method

Stochastic Optimization

Learning (Bottou)

TONGA

Natural Gradient

Online Natural Gradient

Results

Which algorithm should

we use?

Time to reach E(f̃n) − E(f ∗) < ǫ:

• Gradient Descent: O(

d2κ

ǫ1/α log2 1ǫ

)

• Second Order Gradient Descent:

O(

d2κ

ǫ1/α log 1ǫ

log log 1ǫ

)

• Stochastic Gradient Descent: O(

dκ2

ǫ

)

• Second-order Stochastic Descent: O(

d2

ǫ

)

with 12 ≤ α ≤ 1 (statistical estimation rate).

Page 44: Using Gradient Descent for Optimization and Learning

Gradient Descent

Nicolas Le Roux

Optimization

Basics

Approximations to Newton

method

Stochastic Optimization

Learning (Bottou)

TONGA

Natural Gradient

Online Natural Gradient

Results

In a nutshell

• Simple stochastic gradient descent is extremely

efficient

• Fast second-order stochastic gradient descent

can win us a constant factor

• Are there other possible factors of improvement?

Page 45: Using Gradient Descent for Optimization and Learning

Gradient Descent

Nicolas Le Roux

Optimization

Basics

Approximations to Newton

method

Stochastic Optimization

Learning (Bottou)

TONGA

Natural Gradient

Online Natural Gradient

Results

What we did not talk

about (yet)

• we have access to x1, . . . , xn ∼ p

• we wish to minimize Ex∼p[f (θ, x)]

• can we use the uncertainty in the dataset to get

information about p?

Page 46: Using Gradient Descent for Optimization and Learning

Gradient Descent

Nicolas Le Roux

Optimization

Basics

Approximations to Newton

method

Stochastic Optimization

Learning (Bottou)

TONGA

Natural Gradient

Online Natural Gradient

Results

Outline

1 Optimization

Basics

Approximations to Newton method

Stochastic Optimization

2 Learning (Bottou)

3 TONGA

Natural Gradient

Online Natural Gradient

Results

Page 47: Using Gradient Descent for Optimization and Learning

Gradient Descent

Nicolas Le Roux

Optimization

Basics

Approximations to Newton

method

Stochastic Optimization

Learning (Bottou)

TONGA

Natural Gradient

Online Natural Gradient

Results

Cost functions

Empirical cost function

f (θ) =1n

i

f (θ, xi)

True cost function

f ∗(θ) = Ex∼p[f (θ, x)]

Page 48: Using Gradient Descent for Optimization and Learning

Gradient Descent

Nicolas Le Roux

Optimization

Basics

Approximations to Newton

method

Stochastic Optimization

Learning (Bottou)

TONGA

Natural Gradient

Online Natural Gradient

Results

Gradients

Empirical gradient

g =1n

i

∇θf (θ, xi)

True gradient

g∗ = Ex∼p[∇θf (θ, x)]

Central-limit theorem:

g|g∗ ∼ N

(

g∗,Cn

)

Page 49: Using Gradient Descent for Optimization and Learning

Gradient Descent

Nicolas Le Roux

Optimization

Basics

Approximations to Newton

method

Stochastic Optimization

Learning (Bottou)

TONGA

Natural Gradient

Online Natural Gradient

Results

Posterior over g∗

Usingg∗ ∼ N (0, σ2I)

we have

g∗|g ∼ N

(

(

I +C

nσ2

)−1

g,

(

Iσ2

+ nC−1

)−1)

and then

εT g∗|g ∼ N

(

εT

(

I +C

nσ2

)−1

g, εT

(

Iσ2

+ nC−1

)−1

ε

)

Page 50: Using Gradient Descent for Optimization and Learning

Gradient Descent

Nicolas Le Roux

Optimization

Basics

Approximations to Newton

method

Stochastic Optimization

Learning (Bottou)

TONGA

Natural Gradient

Online Natural Gradient

Results

Aggressive strategy

εT g∗|g ∼ N

(

εT

(

I +C

nσ2

)−1

g, εT

(

Iσ2

+ nC−1

)−1

ε

)

• we want to minimize Eg∗[εT g∗]

• Solution:

ε = −η

(

I +C

nσ2

)−1

g

Page 51: Using Gradient Descent for Optimization and Learning

Gradient Descent

Nicolas Le Roux

Optimization

Basics

Approximations to Newton

method

Stochastic Optimization

Learning (Bottou)

TONGA

Natural Gradient

Online Natural Gradient

Results

Aggressive strategy

εT g∗|g ∼ N

(

εT

(

I +C

nσ2

)−1

g, εT

(

Iσ2

+ nC−1

)−1

ε

)

• we want to minimize Eg∗[εT g∗]

• Solution:

ε = −η

(

I +C

nσ2

)−1

g

This is the regularized natural gradient.

Page 52: Using Gradient Descent for Optimization and Learning

Gradient Descent

Nicolas Le Roux

Optimization

Basics

Approximations to Newton

method

Stochastic Optimization

Learning (Bottou)

TONGA

Natural Gradient

Online Natural Gradient

Results

Conservative strategy

εT g∗|g ∼ N

(

εT

(

I +Cσ2

)−1

g, εT

(

Iσ2

+ nC−1

)−1

ε

)

• we want to minimize Pr(εT g∗ > 0)

• Solution:

ε = −η

(

Cn

)−1

g

Page 53: Using Gradient Descent for Optimization and Learning

Gradient Descent

Nicolas Le Roux

Optimization

Basics

Approximations to Newton

method

Stochastic Optimization

Learning (Bottou)

TONGA

Natural Gradient

Online Natural Gradient

Results

Conservative strategy

εT g∗|g ∼ N

(

εT

(

I +Cσ2

)−1

g, εT

(

Iσ2

+ nC−1

)−1

ε

)

• we want to minimize Pr(εT g∗ > 0)

• Solution:

ε = −η

(

Cn

)−1

g

This is the natural gradient.

Page 54: Using Gradient Descent for Optimization and Learning

Gradient Descent

Nicolas Le Roux

Optimization

Basics

Approximations to Newton

method

Stochastic Optimization

Learning (Bottou)

TONGA

Natural Gradient

Online Natural Gradient

Results

Online Natural Gradient

• Natural gradient has nice properties

• Its cost per iteration is in d2

• Can we make it faster?

Page 55: Using Gradient Descent for Optimization and Learning

Gradient Descent

Nicolas Le Roux

Optimization

Basics

Approximations to Newton

method

Stochastic Optimization

Learning (Bottou)

TONGA

Natural Gradient

Online Natural Gradient

Results

Goals

ε = −η

(

I +C

nσ2

)−1

g

We must be able to update and invert C whenever anew gradient arrives.

Page 56: Using Gradient Descent for Optimization and Learning

Gradient Descent

Nicolas Le Roux

Optimization

Basics

Approximations to Newton

method

Stochastic Optimization

Learning (Bottou)

TONGA

Natural Gradient

Online Natural Gradient

Results

Plan of action

Updating (uncentered) C:

Ct ∝ γCt−1 + (1 − γ)gtgTt

Page 57: Using Gradient Descent for Optimization and Learning

Gradient Descent

Nicolas Le Roux

Optimization

Basics

Approximations to Newton

method

Stochastic Optimization

Learning (Bottou)

TONGA

Natural Gradient

Online Natural Gradient

Results

Plan of action

Updating (uncentered) C:

Ct ∝ γCt−1 + (1 − γ)gtgTt

Inverting C:

• maintain low-rank estimates of C using its

eigendecomposition.

Page 58: Using Gradient Descent for Optimization and Learning

Gradient Descent

Nicolas Le Roux

Optimization

Basics

Approximations to Newton

method

Stochastic Optimization

Learning (Bottou)

TONGA

Natural Gradient

Online Natural Gradient

Results

Eigendecomposition of C

• Computing k eigenvectors of C = GGT is

O(kd2)

• But computing k eigenvectors of GT G is O(kp2)!

• Still too expensive to compute for each new

sample (and p must not grow)

Page 59: Using Gradient Descent for Optimization and Learning

Gradient Descent

Nicolas Le Roux

Optimization

Basics

Approximations to Newton

method

Stochastic Optimization

Learning (Bottou)

TONGA

Natural Gradient

Online Natural Gradient

Results

Eigendecomposition of C

• Computing k eigenvectors of C = GGT is

O(kd2)

• But computing k eigenvectors of GT G is O(kp2)!

• Still too expensive to compute for each new

sample (and p must not grow)

• Done only every b steps

Page 60: Using Gradient Descent for Optimization and Learning

Gradient Descent

Nicolas Le Roux

Optimization

Basics

Approximations to Newton

method

Stochastic Optimization

Learning (Bottou)

TONGA

Natural Gradient

Online Natural Gradient

Results

To prove I’m not cheating

• eigendecomposition every b steps

Ct = γtC +t∑

k=1

γt−kgkgTk + λt t = 1, . . . , b

vt = C−1t gt

Xt =[

γt2 U γ

t−12 g1 . . . γ

12 gt−1 gt

]

Ct = XtX Tt + λt vt = Xtαt gt = Xtyt

αt = (X Tt Xt + λt)−1yt

vt = Xt(X Tt Xt + λt)−1yt

Page 61: Using Gradient Descent for Optimization and Learning

Gradient Descent

Nicolas Le Roux

Optimization

Basics

Approximations to Newton

method

Stochastic Optimization

Learning (Bottou)

TONGA

Natural Gradient

Online Natural Gradient

Results

Computation complexity

• d dimensions

• k eigenvectors

• b steps between two updates

• Computing the eigendecomposition (every b

steps): k(k + b)2

• Computing the natural gradient:

d(k + b) + (k + b)2

• if p ≪ (k + b), cost per example is O(d(k + b))

Page 62: Using Gradient Descent for Optimization and Learning

Gradient Descent

Nicolas Le Roux

Optimization

Basics

Approximations to Newton

method

Stochastic Optimization

Learning (Bottou)

TONGA

Natural Gradient

Online Natural Gradient

Results

What next

• Complexity in O(d(k + b))

• We need a small k

• If d > 106, how large should k be?

Page 63: Using Gradient Descent for Optimization and Learning

Gradient Descent

Nicolas Le Roux

Optimization

Basics

Approximations to Newton

method

Stochastic Optimization

Learning (Bottou)

TONGA

Natural Gradient

Online Natural Gradient

Results

Decomposing C

C is almost block-diagonal!

Page 64: Using Gradient Descent for Optimization and Learning

Gradient Descent

Nicolas Le Roux

Optimization

Basics

Approximations to Newton

method

Stochastic Optimization

Learning (Bottou)

TONGA

Natural Gradient

Online Natural Gradient

Results

Quality of approximation

200 400 600 800 1000 1200 1400 1600 1800 20000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Number k of eigenvectors kept

Rat

io o

f the

squ

ared

Fro

beni

us n

orm

s

Full matrix approximationBlock diagonal approximation

5 10 15 20 25 30 35 400.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Number k of eigenvectors kept

Rat

io o

f the

squ

ared

Fro

beni

us n

orm

s

Full matrix approximationBlock diagonal approximation

Minimum with k = 5

Page 65: Using Gradient Descent for Optimization and Learning

Gradient Descent

Nicolas Le Roux

Optimization

Basics

Approximations to Newton

method

Stochastic Optimization

Learning (Bottou)

TONGA

Natural Gradient

Online Natural Gradient

Results

Evolution of C

Page 66: Using Gradient Descent for Optimization and Learning

Gradient Descent

Nicolas Le Roux

Optimization

Basics

Approximations to Newton

method

Stochastic Optimization

Learning (Bottou)

TONGA

Natural Gradient

Online Natural Gradient

Results

Results - MNIST

0 500 1000 1500 2000 2500 3000 3500 4000 45000.05

0.1

0.15

0.2

CPU time (in seconds)

Neg

ativ

e lo

g−lik

elih

ood

on th

e te

st s

et

Block diagonal TONGAStochastic batchsize=1Stochastic batchsize=400Stochastic batchsize=1000Stochastic batchsize=2000

0 500 1000 1500 2000 2500 3000 3500 4000 45000.015

0.02

0.025

0.03

0.035

0.04

0.045

0.05

0.055

0.06

CPU time (in seconds)

Cla

ssifi

catio

n er

ror

on th

e te

st s

et

Block diagonal TONGAStochastic batchsize=1Stochastic batchsize=400Stochastic batchsize=1000Stochastic batchsize=2000

Page 67: Using Gradient Descent for Optimization and Learning

Gradient Descent

Nicolas Le Roux

Optimization

Basics

Approximations to Newton

method

Stochastic Optimization

Learning (Bottou)

TONGA

Natural Gradient

Online Natural Gradient

Results

Results - Rectangles

0 0.5 1 1.5 2 2.5 3 3.5

x 104

0.06

0.08

0.1

0.12

0.14

0.16

0.18

0.2

CPU time (in seconds)

Neg

ativ

e lo

g−lik

elih

ood

on th

e va

lidat

ion

set

Stochastic gradientBlock diagonal TONGA

0 0.5 1 1.5 2 2.5 3 3.5

x 104

0.06

0.08

0.1

0.12

0.14

0.16

0.18

0.2

CPU time (in seconds)

Cla

ssifi

catio

n er

ror

on th

e va

lidat

ion

set

Stochastic gradientBlock diagonal TONGA

Page 68: Using Gradient Descent for Optimization and Learning

Gradient Descent

Nicolas Le Roux

Optimization

Basics

Approximations to Newton

method

Stochastic Optimization

Learning (Bottou)

TONGA

Natural Gradient

Online Natural Gradient

Results

Results - USPS

0 10 20 30 40 50 60 700.2

0.4

0.6

0.8

1

1.2

1.4

1.6

CPU time (in seconds)

Neg

ativ

e lo

g−lik

elih

ood

Block diagonal TONGAStochastic gradient

0 10 20 30 40 50 60 700.05

0.1

0.15

0.2

0.25

0.3

CPU time (in seconds)

Cla

ssifi

catio

n er

ror

Block diagonal TONGAStochastic gradient

Page 69: Using Gradient Descent for Optimization and Learning

Gradient Descent

Nicolas Le Roux

Optimization

Basics

Approximations to Newton

method

Stochastic Optimization

Learning (Bottou)

TONGA

Natural Gradient

Online Natural Gradient

Results

TONGA - Conclusion

• Introducing uncertainty speeds up training

• There exists a fast implementation of the online

natural gradient

Page 70: Using Gradient Descent for Optimization and Learning

Gradient Descent

Nicolas Le Roux

Optimization

Basics

Approximations to Newton

method

Stochastic Optimization

Learning (Bottou)

TONGA

Natural Gradient

Online Natural Gradient

Results

Difference between C

and H

• H accounts for small changes in the parameter

space

• C accounts for small changes in the input space

• C is not just a “cheap cousin” of H

Page 71: Using Gradient Descent for Optimization and Learning

Gradient Descent

Nicolas Le Roux

Optimization

Basics

Approximations to Newton

method

Stochastic Optimization

Learning (Bottou)

TONGA

Natural Gradient

Online Natural Gradient

Results

Future research

Much more to do!

• Can we combine the effects of H and C?

• Are there better approximations of C and H?

• Anything you can think of!

Page 72: Using Gradient Descent for Optimization and Learning

Gradient Descent

Nicolas Le Roux

Optimization

Basics

Approximations to Newton

method

Stochastic Optimization

Learning (Bottou)

TONGA

Natural Gradient

Online Natural Gradient

Results

Thank you!

Questions?