regularization and optimization of backpropagation · keith l. downing regularization and...
Post on 15-Mar-2019
228 Views
Preview:
TRANSCRIPT
Regularization and Optimization ofBackpropagation
Keith L. Downing
The Norwegian University of Science and Technology (NTNU)Trondheim, Norwaykeithd@idi.ntnu.no
October 17, 2017
Keith L. Downing Regularization and Optimization of Backpropagation
Regularization
Definition of Regularization
Reduction of testing error while maintaining a low training error.
Excessive training does reduce tranining error, but often at the expenseof higher testing error.
The network essentially memorizes the training cases, which hindersits ability to generalize and handle new cases (e.g., the test set).
Exemplifies tradeoff between bias and variance: an NN with low bias(i.e. training error) has trouble reducing variance (i.e. test error).
Bias(θm) = E(θm)−θ
Where θ is the parameterization of the true generating function, and θmis the parameterization of an estimator based on the sample, m. E.g.θm are weights of an NN after training on sample set m.
Var(θ) = degree to which the estimator’s results change with other datasamples (e.g., the test set) from same data generator.
→ Regularization = attempt to combat the bias-variance tradeoff: toproduce a θm with low bias and low variance.
Keith L. Downing Regularization and Optimization of Backpropagation
Types of Regularization
Parameter Norm Penalization
Data Augmentation
Multitask Learning
Early Stopping
Sparse Representations
Ensemble Learning
Dropout
Adversarial Training
Keith L. Downing Regularization and Optimization of Backpropagation
Parameter Norm Penalization
Some parameter sets, such as NN weights, achieve over-fitting when manyparameters (weights) have a high absolute value. So incorporate this into theloss function.
L(θ ,C) = L(θ ,C) + αΩ(θ)
L = loss function; L = regularized loss function; C = cases; θ = parameters ofthe estimator (e.g. weights of the NN); Ω = penalty function; α = penaltyscaling factor
L2 Regularization
Ω2(θ) = 12 ∑wi∈θ w2
i
Sum of squared weights across the whole neural network.
L1 Regularization
Ω1(θ) = ∑wi∈θ |wi |Sum of absolute values of all weights across the whole neural network.
Keith L. Downing Regularization and Optimization of Backpropagation
L1 -vs- L2 Regularization
Though very similar, they can have different effects.
In cases where most weights have a small absolute value (as during theinitial phases of typical training runs), L1 imposes a much stiffer penalty:|wi |> w2
i .
Hence, L1 can have a sparsifying effect: it drives many weights to zero.
Comparing penalty derivatives (which contribute to ∂ L∂wi
gradients):
∂ Ω2(θ)∂wi
= ∂
∂wi( 1
2 ∑wk∈θ w2k ) = wi
∂ Ω1(θ)∂wi
= ∂
∂wi∑wk∈θ |wk |= sign(wk ) ∈ −1,0,1
So L2 penalty scales linearly with wi , giving a more stable updatescheme. L1 provides a more constant gradient that can be largecompared to wi (when |wi |< 1).
This also contributes to L1’s sparsifying effect upon θ .
Sparsification supports feature selection in Machine Learning.
Keith L. Downing Regularization and Optimization of Backpropagation
Dataset Augmentation
It is easier to overfit small data sets. By training on larger sets,generalization naturally increases.
But we may not have many cases.
Sometimes we can manufacture additional cases.
Approaches to Manufacturing New Data Cases
1 Modify the features in many (small or big) ways, but which we know willnot change the target classification. E.g. rotate or translate an image.
2 Add (small) amounts of noise to the features and assume that this doesnot change the target. E.g., add noise to a vector of sensor readings,stock trends, customer behaviors, patient measurements, etc.
Another problem is noisy data: the targets are wrong.
Solution: Add (small) noise to all target vectors and use softmax onnetwork outputs. This prevents overfitting to bad cases and thusimproves generalization.
Keith L. Downing Regularization and Optimization of Backpropagation
Multitask Learning
Representations learned for task 1 can be reused for task 2, thusrequiring fewer cases and less training to learn task 2.By forcing the net to perform multiple tasks, its shared hidden layer (H0)has general-purpose representations.
Input
H0
H1 H2 H3
Out-1 Out-2
SharedRepresentations
SpecializedRepresentations
SpecializedActions
General detector ofpatterns in data, butunrelated to a task
Keith L. Downing Regularization and Optimization of Backpropagation
Early Stopping
Divide data into training, validation and test sets.
Check the error on the validation set every K epochs, but do not learn(i.e. modify weights) from these cases.
Stop training when the validation error rises.
Easy, non-obtrustive form of regularization: no need to change thenetwork, the loss function, etc. Very common.
Training
Validation
Testing
Data Set
Error Training
Validation
Testing
Should StopTraining Here
Epochs
Keith L. Downing Regularization and Optimization of Backpropagation
Sparse Representations
Instead of penalizing parameters (i.e., weights) in the loss function,penalize hidden-layer activations.Ω(H) where H = hidden layer(s) and Ω can be L1, L2 or others.Sparser reps→ better pattern separation among classes→ bettergeneralization.
Negative
Positive
Zero
Weights LayerActivations
Sparse Parameters Dense Representation
Dense Parameters Sparse Representation
Keith L. Downing Regularization and Optimization of Backpropagation
Ensemble Learning
Train multiple classifiers on different versions of the data set.Each has different, but uncorrelated (key point) errors.Use voting to classify test cases.Each classifier has high variance, but the ensemble does not.
Original Data Set Diverse Training Sets
New Test Case
"Cross" "Diamond""Diamond"
Keith L. Downing Regularization and Optimization of Backpropagation
Dropout
1 0 0 1 1 0
Never dropoutputnodes
0 0 1 0 1 1
Case 33 Case 34
For each case, randomly select nodes at all levels (except output)whose outputs will be clamped to 0.Dropout prob range: (0.2, 0.5) - lower for input than hidden layers.Similar to bagging, with each model = a submodel of entire NN.Due to missing non-output nodes (but fixed-length outputs and targets)during training, each connection has more significance and thus needs(and achieves by learning) a higher magnitude.Testing involves all nodes.Scaling: prior to testing, multiply all weights by (1- ph), where ph = prob.dropping hidden nodes.
Keith L. Downing Regularization and Optimization of Backpropagation
Adversarial Training
OldCase
NewCase
Feature 1
Feat
ure
2
dL / d f1 = low
dL / d f2 = high
Class 1
Class 2
Class 3
Class 1
Clouds: NN regionsBoxes: True regions
Class 3
Create new cases by mutating old cases in ways that maximize thechance that the net will misclassify the new case.
Calc ∂L∂ fi
for loss func (L) and each input feature fi .
∀i : fi ← fi + λ∂L∂ fi
- make largest moves along dimensions most likely toincrease loss/error; λ is very small positive.
Train on the new cases→ Increased generality.
Keith L. Downing Regularization and Optimization of Backpropagation
Optimization
Classic Optimization Goal: Reduce training errorClassic Machine Learning Goal: Produce a generalclassifier.
→ Low error on training and test set.→ Finding the global minimum in the error (loss) landscapeis less important; a local minimum may even be a betterbasis for good performance on the test set.
Techniques to Improve Backprop Training
Momentum
Adaptive Learning Rates
Keith L. Downing Regularization and Optimization of Backpropagation
Smart Search
Optimization→ Making smart moves in the K+1 dimensionalerror (loss) landscape; K = ‖ parameters ‖ (wgts + biases).
Ideally, this allows fast mvmt to global error mimimum.
4wi =−λ∂ (Loss)
∂wi
Techniques for Smart Movement
λ : Make step size appropriate for the texture (smoothness,bumpiness) of the current region of the landscape.∂ (Loss)
∂wi: Avoid domination by the current (local) gradient by
including previous gradients in the calculation of 4wi .
Keith L. Downing Regularization and Optimization of Backpropagation
Overshoot
∆w(t+1)∆w(t) ∆w(t+2)
Descent Gradient
Appropriate Δw
When current search state is in a bowl (canyon) of thesearch landscape, following gradients can lead tooscillations if the step size is too big.Step size is λ (learning rate): 4w =−λ
∂L∂w .
Keith L. Downing Regularization and Optimization of Backpropagation
Inefficient Oscillations
Loss(L)
w1
w2
Following the (weaker) gradient along w1 leads to the minimum loss, butthe w2 gradients, up and down the sides of the cylinder (canyon), causeexcess lateral motion.Since the w1 and w2 gradients are independent, search should stillmove downhill quickly (in this smooth landscape).But in a more rugged landscape, the lateral movement could visitlocations where the w1 gradient is untrue to the general trend. E.g. alittle dent or bump on the side of the cylinder could have gradientsindicating that increasing w1 would help reduce Loss (L).
Keith L. Downing Regularization and Optimization of Backpropagation
Dangers of Oscillation in a Rugged Landscape
BumpMain Trend
Best Move MoveRecommended
by Local Gradient
Should end up here!
End uphere
How do we prevent local gradients from dominating the search?
Keith L. Downing Regularization and Optimization of Backpropagation
Momentum: Combatting Local Minima
- ƛdE/dw
!∆w(t-1)
∆w(t)
∆w(t-1)
∆w(t)
Error
w
∆wij(t) =−λ∂Ei
∂wij+ α∆wij(t−1)
Without momentum, λ∂E∂w leads to a local minimum
Keith L. Downing Regularization and Optimization of Backpropagation
Momentum in High Dimensions
Momentum
Global Minimum
Local MinimumHigh Error Region
Low Error Region
Keith L. Downing Regularization and Optimization of Backpropagation
Adaptive Learning Rates
These methods have one learning rate per parameter (i.e. weight and bias).
Batch Gradient Descent
Running all training cases before updating any weights.
Delta-bar-delta Algorithm (Jacobs, 1988)
While sign( ∂Loss∂Z ) is constant, learning rate ↑.
When sign( ∂Loss∂Z ) changes, learning rate ↓
where Z = a learning parameter.
Stochastic Gradient Descent (SGD)
Running a minibatch, then updating weights.
Scale each learning rate inversely by accumulated gradientmagnitudes over time. So learning rate is always decreasing; thequestion is: How quickly?
More high-magnitude (pos or neg) gradients→ learning rate ↓ quicker.
Popular variants: AdaGrad, RMSProp, Adam
Keith L. Downing Regularization and Optimization of Backpropagation
Basic Framework for Adaptive SGD
1 λ = a tensor of learning rates, one per parameter.2 Init each rate in λ to same value.3 Init the global gradient accumulator (R) to 0.4 Minibatch Training Loop
Sample a minibatch, M.Reset local gradient accumulator: G← ZeroTensor .For each case (c) in M:
Run c through the network; compute loss and gradients (g).G← G + g (for all parameters)
R← faccum(R,G)∆θ =−1× fscale(R,λ )Gwhere θ = weights and biases, and = element-wiseoperator: multiply each gradient by the scale factor.
Key differences between algorithms: faccum and fscale
Keith L. Downing Regularization and Optimization of Backpropagation
Specializations
AdaGrad (Duchi et. al., 2011)
faccum(R,G) = R + GG (square each gradient in G)
fscale(R,λ ) = λ
δ+√
R(where δ is very small, e.g. 10−7)
RMSProp (Hinton, 2012)
faccum(R,G) = ρR + (1−ρ)GG (where ρ = decay rate)
fscale(R,λ ) = λ√δ+R
(where δ is very small, e.g. 10−6)
ρ controls weight given to older gradients.
Adam (adaptive moments) (Kingma and Ba, 2014)
faccum(R,G) = ρR+(1−ρ)GG1−ρT (where ρ = decay rate, T = timestep)
fscale(R,λ ) = λSδ+√
R(δ = 10−8)
where S← φS+(1−φ)G1−φT (S = first-moment estimate, φ = decay)
RMSProp and Adam are the most popular.Keith L. Downing Regularization and Optimization of Backpropagation
Momentum in Adam
The S term in the Adam optimizer is called the firstmoment estimate.It implements momentum, to produce a modified value ofthe gradient G.Since S (and thus G) is included in fscale, we do not need Gin the final calculation of 4θ , which, in Adam, is:
4θ =−1× fscale(R,λ )
Keith L. Downing Regularization and Optimization of Backpropagation
Second Derivatives: Quantifying Curvature
dL/dw < 0 dL/dw > 0
d2L/dw2 = 0L
w
d2L/dw2< 0
d2L/dw2 > 0
When sign( ∂L∂w ) = sign( ∂ 2L
∂w2 ), the current slope gets more extreme.
For gradient descent learning: ∂ 2L∂w2 < 0 (> 0)→ More (Less) descent
per 4w than estimated by the standard gradient, ∂L∂w .
Keith L. Downing Regularization and Optimization of Backpropagation
Approximations using 1st and 2nd Derivatives
For a point (p0) in search space for which we know F(p0), use 1st and2nd derivative of F at p0 to estimate F(p) for a nearby point p.
Knowing F(p) affects decision of moving to (or toward) p from p0.
Using the 2nd-order Taylor Expansion of F to approximate F(p):
F (p)≈ F (p0) + F ′(p0)(p−p0) +12
F ′′(p0)(p−p0)2
Extend this to a multivariable function such as L (the loss function), thatcomputes a scalar L(θ ) when give the tensor of variables, θ :
L(θ)≈ L(θ0) + (θ −θ0)T(
∂L∂θ
)θ0
+12
(θ −θ0)T
(∂ 2L∂θ2
)θ0
(θ −θ0)
∂L∂θ
= Jacobian, and ∂ 2L∂θ 2 = Hessian
Knowing L(θ ) affects decision of moving to (or toward) θ from θ0.
Keith L. Downing Regularization and Optimization of Backpropagation
The Hessian Matrix
A matrix (H) of all second derivatives of a function, f, withrespect to all parameters.
For gradient descent learning, f = the loss function (L) and theparameters are weights (and biases).
H(L)(w)ij =∂ 2L
∂wi∂wj
Wherever the second derivs are continuous, H is symmetric:
∂ 2L∂wi∂wj
=∂ 2L
∂wj∂wi
Verify this symmetry: Let L(x ,y) = 4x2y + 3y3x2, and thencompute ∂ 2L
∂x∂y and ∂ 2L∂y∂x
Keith L. Downing Regularization and Optimization of Backpropagation
The Hessian Matrix
H(L)(wij) =
∂L2
∂w21
∂L2
∂w1∂w2· · · ∂L2
∂w1∂wn
∂L2
∂w2∂w1
∂L2
∂w22· · · ∂L2
∂w2∂wn... ... ... ...... ... ... ...
∂L2
∂wn∂w1
∂L2
∂wn∂w2· · · ∂L2
∂w2n
Keith L. Downing Regularization and Optimization of Backpropagation
Function Estimates via Jacobian and Hessian
(Δx, Δy)
x
yp
qContours =
values of F(x,y)
Given: point p, point q, F(p), the Jacobian(J) and Hessian(H) at p.
Goal: Estimate F(q)
Approach: Combine 4p = [4x ,4y ]T with J and H.
Using the Jacobian
[4x ,4y ]•
∣∣∣∣∣∣∣∂F∂x
∂F∂y
∣∣∣∣∣∣∣p
≈4F (x ,y)|p→q
Keith L. Downing Regularization and Optimization of Backpropagation
Function Estimates via Jacobian and Hessian
Hessian •4p ≈ Jacobian∂F 2
∂x2∂F 2
∂x∂y
∂F 2
∂y∂x∂F 2
∂y2
p
•
∣∣∣∣∣∣4x
4y
∣∣∣∣∣∣≈∣∣∣∣∣∣∣
∂F∂x
∂F∂y
∣∣∣∣∣∣∣p
4pT •Hessian •4p ≈4F (x ,y)
[4x ,4y ]•
∂F 2
∂x2∂F 2
∂x∂y
∂F 2
∂y∂x∂F 2
∂y2
p
•
∣∣∣∣∣∣4x
4y
∣∣∣∣∣∣≈
≈ [4x ,4y ]•
∣∣∣∣∣∣∣∂F∂x
∂F∂y
∣∣∣∣∣∣∣p
≈4F (x ,y)|p→q
Keith L. Downing Regularization and Optimization of Backpropagation
Eigenvalues of The Hessian Matrix
Since the Hessian is real and symmetric, it decomposes into aneigenvector basis and a set of real eigenvalues.
For each eigenvector-eigenvalue pair (vi ,κi ), the second derivative of Lin the direction vi is κi .
The max (min) eigenvalue = the max (min) second derivative. Theseindicate directions of max positive and negative curvature, along withflatter directions, all depending upon the signs of the eigenvalues.
Condition number = ratio of magnitudes of max and min κ:
maxi ,j
∣∣∣∣κiκj
∣∣∣∣High condition number (at a location in search space)→ difficultgradient-descent search. Curvatures vary greatly in different directions,so the shared learning rate (which affects movement in all dimensions)may cause too big a step in some directions, and too small in others.
Keith L. Downing Regularization and Optimization of Backpropagation
Optimal Step Sizes based on Hessian Eigenvalues
When the derivative of the loss function is negative along eigenvectorvi :
κi ≈ 0→ landscape has constant slope→ take a normal-sizedstep along vi , since the gradient is an accurate representation ofthe surrounding area.
κi > 0→ landscape is curving upward→ only take a small stepalong vi , since a normal step could end up in a region ofincreased error/loss.
κi < 0→ landscape is curving downward→ take a large stepalong vi to descend quickly.
The proximity of the ith dimensional axis to each such eigenvectorindicates the proper step to take along dimension i: the proper 4wi .
Keith L. Downing Regularization and Optimization of Backpropagation
top related