regularization and optimization of backpropagation · keith l. downing regularization and...

Regularization and Optimization ofBackpropagation

Keith L. Downing

The Norwegian University of Science and Technology (NTNU)Trondheim, [email protected]

October 17, 2017

Keith L. Downing Regularization and Optimization of Backpropagation

Regularization

Definition of Regularization

Reduction of testing error while maintaining a low training error.

Excessive training does reduce tranining error, but often at the expenseof higher testing error.

The network essentially memorizes the training cases, which hindersits ability to generalize and handle new cases (e.g., the test set).

Exemplifies tradeoff between bias and variance: an NN with low bias(i.e. training error) has trouble reducing variance (i.e. test error).

Bias(θm) = E(θm)−θ

Where θ is the parameterization of the true generating function, and θmis the parameterization of an estimator based on the sample, m. E.g.θm are weights of an NN after training on sample set m.

Var(θ) = degree to which the estimator’s results change with other datasamples (e.g., the test set) from same data generator.

→ Regularization = attempt to combat the bias-variance tradeoff: toproduce a θm with low bias and low variance.


Types of Regularization

Parameter Norm Penalization

Data Augmentation

Multitask Learning

Early Stopping

Sparse Representations

Ensemble Learning

Dropout

Adversarial Training


Parameter Norm Penalization

Some parameter sets, such as NN weights, achieve over-fitting when manyparameters (weights) have a high absolute value. So incorporate this into theloss function.

L(θ ,C) = L(θ ,C) + αΩ(θ)

L = loss function; L = regularized loss function; C = cases; θ = parameters ofthe estimator (e.g. weights of the NN); Ω = penalty function; α = penaltyscaling factor

L2 Regularization

Ω2(θ) = 12 ∑wi∈θ w2

i

Sum of squared weights across the whole neural network.

L1 Regularization

Ω1(θ) = ∑wi∈θ |wi |Sum of absolute values of all weights across the whole neural network.


L1 -vs- L2 Regularization

Though very similar, they can have different effects.

In cases where most weights have a small absolute value (as during theinitial phases of typical training runs), L1 imposes a much stiffer penalty:|wi |> w2

i .

Hence, L1 can have a sparsifying effect: it drives many weights to zero.

Comparing penalty derivatives (which contribute to ∂ L∂wi

gradients):

∂ Ω2(θ)∂wi

= ∂

∂wi( 1

2 ∑wk∈θ w2k ) = wi

∂ Ω1(θ)∂wi

= ∂

∂wi∑wk∈θ |wk |= sign(wk ) ∈ −1,0,1

So L2 penalty scales linearly with wi , giving a more stable updatescheme. L1 provides a more constant gradient that can be largecompared to wi (when |wi |< 1).

This also contributes to L1’s sparsifying effect upon θ .

Sparsification supports feature selection in Machine Learning.


Dataset Augmentation

It is easier to overfit small data sets. By training on larger sets,generalization naturally increases.

But we may not have many cases.

Sometimes we can manufacture additional cases.

Approaches to Manufacturing New Data Cases

1 Modify the features in many (small or big) ways, but which we know willnot change the target classification. E.g. rotate or translate an image.

2 Add (small) amounts of noise to the features and assume that this doesnot change the target. E.g., add noise to a vector of sensor readings,stock trends, customer behaviors, patient measurements, etc.

Another problem is noisy data: the targets are wrong.

Solution: Add (small) noise to all target vectors and use softmax onnetwork outputs. This prevents overfitting to bad cases and thusimproves generalization.


Multitask Learning

Representations learned for task 1 can be reused for task 2, thusrequiring fewer cases and less training to learn task 2.By forcing the net to perform multiple tasks, its shared hidden layer (H0)has general-purpose representations.

Input

H0

H1 H2 H3

Out-1 Out-2

SharedRepresentations

SpecializedRepresentations

SpecializedActions

General detector ofpatterns in data, butunrelated to a task


Early Stopping

Divide data into training, validation and test sets.

Check the error on the validation set every K epochs, but do not learn(i.e. modify weights) from these cases.

Stop training when the validation error rises.

Easy, non-obtrustive form of regularization: no need to change thenetwork, the loss function, etc. Very common.

Training

Validation

Testing

Data Set

Error Training

Validation

Testing

Should StopTraining Here

Epochs


Sparse Representations

Instead of penalizing parameters (i.e., weights) in the loss function,penalize hidden-layer activations.Ω(H) where H = hidden layer(s) and Ω can be L1, L2 or others.Sparser reps→ better pattern separation among classes→ bettergeneralization.

Negative

Positive

Zero

Weights LayerActivations

Sparse Parameters Dense Representation

Dense Parameters Sparse Representation


Ensemble Learning

Train multiple classifiers on different versions of the data set.Each has different, but uncorrelated (key point) errors.Use voting to classify test cases.Each classifier has high variance, but the ensemble does not.

Original Data Set Diverse Training Sets

New Test Case

"Cross" "Diamond""Diamond"


Dropout

1 0 0 1 1 0

Never dropoutputnodes

0 0 1 0 1 1

Case 33 Case 34

For each case, randomly select nodes at all levels (except output)whose outputs will be clamped to 0.Dropout prob range: (0.2, 0.5) - lower for input than hidden layers.Similar to bagging, with each model = a submodel of entire NN.Due to missing non-output nodes (but fixed-length outputs and targets)during training, each connection has more significance and thus needs(and achieves by learning) a higher magnitude.Testing involves all nodes.Scaling: prior to testing, multiply all weights by (1- ph), where ph = prob.dropping hidden nodes.


Adversarial Training

OldCase

NewCase

Feature 1

Feat

ure

2

dL / d f1 = low

dL / d f2 = high

Class 1

Class 2

Class 3

Class 1

Clouds: NN regionsBoxes: True regions

Class 3

Create new cases by mutating old cases in ways that maximize thechance that the net will misclassify the new case.

Calc ∂L∂ fi

for loss func (L) and each input feature fi .

∀i : fi ← fi + λ∂L∂ fi

- make largest moves along dimensions most likely toincrease loss/error; λ is very small positive.

Train on the new cases→ Increased generality.


Optimization

Classic Optimization Goal: Reduce training errorClassic Machine Learning Goal: Produce a generalclassifier.

→ Low error on training and test set.→ Finding the global minimum in the error (loss) landscapeis less important; a local minimum may even be a betterbasis for good performance on the test set.

Techniques to Improve Backprop Training

Momentum

Adaptive Learning Rates


Smart Search

Optimization→ Making smart moves in the K+1 dimensionalerror (loss) landscape; K = ‖ parameters ‖ (wgts + biases).

Ideally, this allows fast mvmt to global error mimimum.

4wi =−λ∂ (Loss)

∂wi

Techniques for Smart Movement

λ : Make step size appropriate for the texture (smoothness,bumpiness) of the current region of the landscape.∂ (Loss)

∂wi: Avoid domination by the current (local) gradient by

including previous gradients in the calculation of 4wi .


Overshoot

∆w(t+1)∆w(t) ∆w(t+2)

Descent Gradient

Appropriate Δw

When current search state is in a bowl (canyon) of thesearch landscape, following gradients can lead tooscillations if the step size is too big.Step size is λ (learning rate): 4w =−λ

∂L∂w .


Inefficient Oscillations

Loss(L)

w1

w2

Following the (weaker) gradient along w1 leads to the minimum loss, butthe w2 gradients, up and down the sides of the cylinder (canyon), causeexcess lateral motion.Since the w1 and w2 gradients are independent, search should stillmove downhill quickly (in this smooth landscape).But in a more rugged landscape, the lateral movement could visitlocations where the w1 gradient is untrue to the general trend. E.g. alittle dent or bump on the side of the cylinder could have gradientsindicating that increasing w1 would help reduce Loss (L).


Dangers of Oscillation in a Rugged Landscape

BumpMain Trend

Best Move MoveRecommended

by Local Gradient

Should end up here!

End uphere

How do we prevent local gradients from dominating the search?


Momentum: Combatting Local Minima

- ƛdE/dw

!∆w(t-1)

∆w(t)

∆w(t-1)

∆w(t)

Error

w

∆wij(t) =−λ∂Ei

∂wij+ α∆wij(t−1)

Without momentum, λ∂E∂w leads to a local minimum


Momentum in High Dimensions

Momentum

Global Minimum

Local MinimumHigh Error Region

Low Error Region


Adaptive Learning Rates

These methods have one learning rate per parameter (i.e. weight and bias).

Batch Gradient Descent

Running all training cases before updating any weights.

Delta-bar-delta Algorithm (Jacobs, 1988)

While sign( ∂Loss∂Z ) is constant, learning rate ↑.

When sign( ∂Loss∂Z ) changes, learning rate ↓

where Z = a learning parameter.

Stochastic Gradient Descent (SGD)

Running a minibatch, then updating weights.

Scale each learning rate inversely by accumulated gradientmagnitudes over time. So learning rate is always decreasing; thequestion is: How quickly?

More high-magnitude (pos or neg) gradients→ learning rate ↓ quicker.

Popular variants: AdaGrad, RMSProp, Adam


Basic Framework for Adaptive SGD

1 λ = a tensor of learning rates, one per parameter.2 Init each rate in λ to same value.3 Init the global gradient accumulator (R) to 0.4 Minibatch Training Loop

Sample a minibatch, M.Reset local gradient accumulator: G← ZeroTensor .For each case (c) in M:

Run c through the network; compute loss and gradients (g).G← G + g (for all parameters)

R← faccum(R,G)∆θ =−1× fscale(R,λ )Gwhere θ = weights and biases, and = element-wiseoperator: multiply each gradient by the scale factor.

Key differences between algorithms: faccum and fscale


Specializations

AdaGrad (Duchi et. al., 2011)

faccum(R,G) = R + GG (square each gradient in G)

fscale(R,λ ) = λ

δ+√

R(where δ is very small, e.g. 10−7)

RMSProp (Hinton, 2012)

faccum(R,G) = ρR + (1−ρ)GG (where ρ = decay rate)

fscale(R,λ ) = λ√δ+R

(where δ is very small, e.g. 10−6)

ρ controls weight given to older gradients.

Adam (adaptive moments) (Kingma and Ba, 2014)

faccum(R,G) = ρR+(1−ρ)GG1−ρT (where ρ = decay rate, T = timestep)

fscale(R,λ ) = λSδ+√

R(δ = 10−8)

where S← φS+(1−φ)G1−φT (S = first-moment estimate, φ = decay)

RMSProp and Adam are the most popular.Keith L. Downing Regularization and Optimization of Backpropagation

Momentum in Adam

The S term in the Adam optimizer is called the firstmoment estimate.It implements momentum, to produce a modified value ofthe gradient G.Since S (and thus G) is included in fscale, we do not need Gin the final calculation of 4θ , which, in Adam, is:

4θ =−1× fscale(R,λ )


Second Derivatives: Quantifying Curvature

dL/dw < 0 dL/dw > 0

d2L/dw2 = 0L

w

d2L/dw2< 0

d2L/dw2 > 0

When sign( ∂L∂w ) = sign( ∂ 2L

∂w2 ), the current slope gets more extreme.

For gradient descent learning: ∂ 2L∂w2 < 0 (> 0)→ More (Less) descent

per 4w than estimated by the standard gradient, ∂L∂w .


Approximations using 1st and 2nd Derivatives

For a point (p0) in search space for which we know F(p0), use 1st and2nd derivative of F at p0 to estimate F(p) for a nearby point p.

Knowing F(p) affects decision of moving to (or toward) p from p0.

Using the 2nd-order Taylor Expansion of F to approximate F(p):

F (p)≈ F (p0) + F ′(p0)(p−p0) +12

F ′′(p0)(p−p0)2

Extend this to a multivariable function such as L (the loss function), thatcomputes a scalar L(θ ) when give the tensor of variables, θ :

L(θ)≈ L(θ0) + (θ −θ0)T(

∂L∂θ

)θ0

+12

(θ −θ0)T

(∂ 2L∂θ2

)θ0

(θ −θ0)

∂L∂θ

= Jacobian, and ∂ 2L∂θ 2 = Hessian

Knowing L(θ ) affects decision of moving to (or toward) θ from θ0.


The Hessian Matrix

A matrix (H) of all second derivatives of a function, f, withrespect to all parameters.

For gradient descent learning, f = the loss function (L) and theparameters are weights (and biases).

H(L)(w)ij =∂ 2L

∂wi∂wj

Wherever the second derivs are continuous, H is symmetric:

∂ 2L∂wi∂wj

=∂ 2L

∂wj∂wi

Verify this symmetry: Let L(x ,y) = 4x2y + 3y3x2, and thencompute ∂ 2L

∂x∂y and ∂ 2L∂y∂x


The Hessian Matrix

H(L)(wij) =

∂L2

∂w21

∂L2

∂w1∂w2· · · ∂L2

∂w1∂wn

∂L2

∂w2∂w1

∂L2

∂w22· · · ∂L2

∂w2∂wn... ... ... ...... ... ... ...

∂L2

∂wn∂w1

∂L2

∂wn∂w2· · · ∂L2

∂w2n


Function Estimates via Jacobian and Hessian

(Δx, Δy)

x

yp

qContours =

values of F(x,y)

Given: point p, point q, F(p), the Jacobian(J) and Hessian(H) at p.

Goal: Estimate F(q)

Approach: Combine 4p = [4x ,4y ]T with J and H.

Using the Jacobian

[4x ,4y ]•

∣∣∣∣∣∣∣∂F∂x

∂F∂y

∣∣∣∣∣∣∣p

≈4F (x ,y)|p→q


Function Estimates via Jacobian and Hessian

Hessian •4p ≈ Jacobian∂F 2

∂x2∂F 2

∂x∂y

∂F 2

∂y∂x∂F 2

∂y2

p

•

∣∣∣∣∣∣4x

4y

∣∣∣∣∣∣≈∣∣∣∣∣∣∣

∂F∂x

∂F∂y

∣∣∣∣∣∣∣p

4pT •Hessian •4p ≈4F (x ,y)

[4x ,4y ]•

∂F 2

∂x2∂F 2

∂x∂y

∂F 2

∂y∂x∂F 2

∂y2

p

•

∣∣∣∣∣∣4x

4y

∣∣∣∣∣∣≈

≈ [4x ,4y ]•

∣∣∣∣∣∣∣∂F∂x

∂F∂y

∣∣∣∣∣∣∣p

≈4F (x ,y)|p→q


Eigenvalues of The Hessian Matrix

Since the Hessian is real and symmetric, it decomposes into aneigenvector basis and a set of real eigenvalues.

For each eigenvector-eigenvalue pair (vi ,κi ), the second derivative of Lin the direction vi is κi .

The max (min) eigenvalue = the max (min) second derivative. Theseindicate directions of max positive and negative curvature, along withflatter directions, all depending upon the signs of the eigenvalues.

Condition number = ratio of magnitudes of max and min κ:

maxi ,j

∣∣∣∣κiκj

∣∣∣∣High condition number (at a location in search space)→ difficultgradient-descent search. Curvatures vary greatly in different directions,so the shared learning rate (which affects movement in all dimensions)may cause too big a step in some directions, and too small in others.


Optimal Step Sizes based on Hessian Eigenvalues

When the derivative of the loss function is negative along eigenvectorvi :

κi ≈ 0→ landscape has constant slope→ take a normal-sizedstep along vi , since the gradient is an accurate representation ofthe surrounding area.

κi > 0→ landscape is curving upward→ only take a small stepalong vi , since a normal step could end up in a region ofincreased error/loss.

κi < 0→ landscape is curving downward→ take a large stepalong vi to descend quickly.

The proximity of the ith dimensional axis to each such eigenvectorindicates the proper step to take along dimension i: the proper 4wi .


regularization and optimization of backpropagation · keith l. downing regularization and...

Documents