deep learning: trainingjuhan/gct634/slides/07 deep... · 2020. 10. 21. · training deep neural...

Juhan Nam

GCT634/AI613: Musical Applications of Machine Learning (Fall 2020)

Deep Learning: Training

Training Deep Neural Networks

● Gradient-based Learning

● Issue: vanishing gradient or exploding gradient○ The gradient in lower layers is computed as a cascaded multiplication of

local gradients from upper layers ○ Some elements can decay or explode exponentially○ Learning is diluted or unstable

𝜕𝑙

𝜕𝑤!"($)

Forward(hidden unit activation)

Backward(gradient flow)

Layer1 Layer2 Layer4 Layer L. . . Layer L-1Layer3 𝑙(𝑦, %𝑦)

𝜕𝑙

𝜕𝑤!"(&)

𝜕𝑙

𝜕𝑤!"(')

𝑥

Parametric

Non-parametric

Training Deep Neural Networks

● Gradient-based Learning

● Remedy: keep the distribution of the hidden unit activations in a controlled range○ Normalize the input: once as a preprocessing across the entire training set○ Set the variance of the randomly initialized weight: once as a model setup○ Normalize the hidden units: run-time processing during training

à batch normalization

𝜕𝑙

𝜕𝑤!"($)

Layer1 Layer2 Layer4 Layer L. . . Layer L-1Layer3 𝑙(𝑦, %𝑦)

𝜕𝑙

𝜕𝑤!"(&)

𝜕𝑙

𝜕𝑤!"(')

𝑥

Forward(hidden unit activation)

Backward(gradient flow)

Parametric

Non-parametric

● Standardization: zero mean and unit variance

● PCA whitening: zero mean and decorrelated unit variance

Input Normalization

MeanSubtraction

Std dev.Division

MeanSubtraction

Rotation &Std dev. division

Add a small number to the standard deviation in the division

Input Normalization

● In practice○ Zero mean and unit variance are commonly used in music classification

tasks when the input is log-compressed spectrogram (however, in image classification, the unit variance is not very common)

○ PCA whiting is not very common

● Common pitfall○ The mean and standard deviation must be computed only from the training

data (not the entire dataset)○ The mean and standard deviation from the training set should be

consistently used for the validation and test sets

Weight Initialization

● Setting the variance of the random numbers so that the variance of input is equal to the variance of output at each layer: speed up the training

● Glorot (or Xavier) initialization (2010)

○ The variance is set to 𝜎! = "#$%!"#

(𝑓𝑎𝑛$&' =#$%$% ()%*+, -)./)1#$%&'((2+,*+, -)./)

!)

○ Concerned with both the forward and backward passes○ When the activation function is tanh or sigmoid

● He initialization (2015)○ The variance is set to 𝜎! = !

#$%$%○ Concerned with the forward pass only○ When the activation function is ReLU or its variants

Batch Normalization

● Normalize the output of each layer for a mini batch input as a run-time processing during training (Ioffe and Szegedy, 2015)○ First, normalize the filter output to have zero-mean and unit variance for the

mini batch ○ Then, rescale and shift the normalized output

with its trainable parameters (𝛽, 𝛾) ■ This makes the output exploit the non-linearity:

the input with zero mean and unit variance are mostly in the linear range (e.g., sigmoid or tanh)

1

0

10-10

1

0

10-10

Batch Normalization

● Implemented as an additional layer○ Located between the FC (or Conv) layer and the activation function layer○ A simple element-wise scaling and shifting operation for the input (the mean

and variance of the mini batch is regarded as a constant vector)

● Batch normalization in the test phase ○ We can use a single example in the test phase○ Use the moving average of the mean and variance of mini batches in the

training phase: 𝜇)(+,-) = 𝛼 $ 𝜇)

(+,-) + (1 − 𝛼) $ 𝜇)(/01)

○ In summary, four types of parameters are included in the batch norm layer: mean (moving avg), variance (moving avg), rescaling (trainable) and shift (trainable)

Batch Normalization

● Advantages○ Improve the gradient flow through the networks: allowing to use a higher

learning rate and, as a result, the training becomes much faster!○ Reduce the dependence on weight initialization

5M 10M 15M 20M 25M 30M0.4

0.5

0.6

0.7

0.8

InceptionBN−BaselineBN−x5BN−x30BN−x5−SigmoidSteps to match Inception

Figure 2: Single crop validation accuracy of Inceptionand its batch-normalized variants, vs. the number oftraining steps.

Model Steps to 72.2% Max accuracyInception 31.0 · 106 72.2%BN-Baseline 13.3 · 106 72.7%BN-x5 2.1 · 106 73.0%BN-x30 2.7 · 106 74.8%

BN-x5-Sigmoid 69.8%

Figure 3: For Inception and the batch-normalizedvariants, the number of training steps required toreach the maximum accuracy of Inception (72.2%),and the maximum accuracy achieved by the net-work.

4.2.2 Single-Network Classification

We evaluated the following networks, all trained on theLSVRC2012 training data, and tested on the validationdata:Inception: the network described at the beginning of

Section 4.2, trained with the initial learning rate of 0.0015.BN-Baseline: Same as Inception with Batch Normal-

ization before each nonlinearity.BN-x5: Inception with Batch Normalization and the

modifications in Sec. 4.2.1. The initial learning rate wasincreased by a factor of 5, to 0.0075. The same learningrate increase with original Inception caused the model pa-rameters to reach machine infinity.BN-x30: Like BN-x5, but with the initial learning rate

0.045 (30 times that of Inception).BN-x5-Sigmoid: Like BN-x5, but with sigmoid non-

linearity g(t) = 11+exp(−x) instead of ReLU. We also at-

tempted to train the original Inception with sigmoid, butthe model remained at the accuracy equivalent to chance.In Figure 2, we show the validation accuracy of the

networks, as a function of the number of training steps.Inception reached the accuracy of 72.2% after 31 · 106training steps. The Figure 3 shows, for each network,the number of training steps required to reach the same72.2% accuracy, as well as the maximum validation accu-racy reached by the network and the number of steps toreach it.By only using Batch Normalization (BN-Baseline), we

match the accuracy of Inception in less than half the num-ber of training steps. By applying the modifications inSec. 4.2.1, we significantly increase the training speed ofthe network. BN-x5 needs 14 times fewer steps than In-ception to reach the 72.2% accuracy. Interestingly, in-creasing the learning rate further (BN-x30) causes themodel to train somewhat slower initially, but allows it toreach a higher final accuracy. It reaches 74.8% after 6·106steps, i.e. 5 times fewer steps than required by Inceptionto reach 72.2%.We also verified that the reduction in internal covari-

ate shift allows deep networks with Batch Normalization

to be trained when sigmoid is used as the nonlinearity,despite the well-known difficulty of training such net-works. Indeed, BN-x5-Sigmoid achieves the accuracy of69.8%. Without Batch Normalization, Inception with sig-moid never achieves better than 1/1000 accuracy.

4.2.3 Ensemble Classification

The current reported best results on the ImageNet LargeScale Visual Recognition Competition are reached by theDeep Image ensemble of traditional models (Wu et al.,2015) and the ensemble model of (He et al., 2015). Thelatter reports the top-5 error of 4.94%, as evaluated by theILSVRC server. Here we report a top-5 validation error of4.9%, and test error of 4.82% (according to the ILSVRCserver). This improves upon the previous best result, andexceeds the estimated accuracy of human raters accordingto (Russakovsky et al., 2014).For our ensemble, we used 6 networks. Each was based

on BN-x30, modified via some of the following: increasedinitial weights in the convolutional layers; using Dropout(with the Dropout probability of 5% or 10%, vs. 40%for the original Inception); and using non-convolutional,per-activation Batch Normalization with last hidden lay-ers of the model. Each network achieved its maximumaccuracy after about 6 · 106 training steps. The ensembleprediction was based on the arithmetic average of classprobabilities predicted by the constituent networks. Thedetails of ensemble and multicrop inference are similar to(Szegedy et al., 2014).We demonstrate in Fig. 4 that batch normalization al-

lows us to set new state-of-the-art by a healthy margin onthe ImageNet classification challenge benchmarks.

5 ConclusionWe have presented a novel mechanism for dramaticallyaccelerating the training of deep networks. It is based onthe premise that covariate shift, which is known to com-plicate the training of machine learning systems, also ap-

7

ImageNet Classification(Ioffe and Szegedy, 2015)

Optimization

● Stochastic gradient descent is the basic optimizer in deep learning

● However, it has limitations○ Convergence of the loss is very slow when the loss function has high

condition number ○ If the gradient becomes zero, the update gets stuck: local minima or saddle

points

)HL�)HL�/L��-XVWLQ�-RKQVRQ��6HUHQD�<HXQJ /HFWXUH�� $SULO��)HL�)HL�/L��-XVWLQ�-RKQVRQ��6HUHQD�<HXQJ /HFWXUH�� $SULO��

2SWLPL]DWLRQ��3UREOHPV�ZLWK�6*':KDW�LI�ORVV�FKDQJHV�TXLFNO\�LQ�RQH�GLUHFWLRQ�DQG�VORZO\�LQ�DQRWKHU":KDW�GRHV�JUDGLHQW�GHVFHQW�GR"9HU\�VORZ�SURJUHVV�DORQJ�VKDOORZ�GLPHQVLRQ��MLWWHU�DORQJ�VWHHS�GLUHFWLRQ

/RVV�IXQFWLRQ�KDV�KLJK�FRQGLWLRQ�QXPEHU��UDWLR�RI�ODUJHVW�WR�VPDOOHVW�VLQJXODU�YDOXH�RI�WKH�+HVVLDQ�PDWUL[�LV�ODUJH


2SWLPL]DWLRQ��3UREOHPV�ZLWK�6*'

:KDW�LI�WKH�ORVV�IXQFWLRQ�KDV�D�ORFDO�PLQLPD�RU�VDGGOH�SRLQW"

=HUR�JUDGLHQW��JUDGLHQW�GHVFHQW�JHWV�VWXFN





Condition number: the ratio of largest to smallest singular value of the Hessian matrix

Local Minimum Saddle point

Source: Stanford CS231n slides

𝑥!"# = 𝑥! − 𝛼𝛻𝑓(𝑥!)

𝛻𝑓 𝑥$ = 0 𝛻𝑓 𝑥$ = 0


2SWLPL]DWLRQ��3UREOHPV�ZLWK�6*':KDW�LI�ORVV�FKDQJHV�TXLFNO\�LQ�RQH�GLUHFWLRQ�DQG�VORZO\�LQ�DQRWKHU":KDW�GRHV�JUDGLHQW�GHVFHQW�GR"9HU\�VORZ�SURJUHVV�DORQJ�VKDOORZ�GLPHQVLRQ��MLWWHU�DORQJ�VWHHS�GLUHFWLRQ

/RVV�IXQFWLRQ�KDV�KLJK�FRQGLWLRQ�QXPEHU��UDWLR�RI�ODUJHVW�WR�VPDOOHVW�VLQJXODU�YDOXH�RI�WKH�+HVVLDQ�PDWUL[�LV�ODUJH

Momentum

● Update the parameters using not only the current gradient but also the previous history

○ Adding inertia (or gravity) to the parameter point○ Analogous to classical mechanicsà 𝑥: displacement, 𝑣: velocity

𝑥!"# = 𝑥! + 𝑣!"#

𝑣!"# = 𝜌𝑣! − 𝛼𝛻𝑓(𝑥!)Accumulated gradients

Current gradients









Local Minimum Saddle point

𝛻𝑓 𝑥$ = 0 𝛻𝑓 𝑥$ = 0

Source: Stanford CS231n slides

Nesterov Momentum

● Faster convergence by computing the gradient at the tip of the velocity

○ At the current parameter point 𝑥, , we know that the gradient from the history 𝜌𝑣, will be added to it. Thus, we compute the local gradient at the anticipated point 𝑥, + 𝜌𝑣,

𝑥!"# = 𝑥! + 𝑣!"#

𝑣!"# = 𝜌𝑣! − 𝛼𝛻𝑓(𝑥! + 𝜌𝑣!)Accumulated gradients

Advanced gradients

𝜌𝑣$

𝑥$

𝑥$ + 𝜌𝑣$

−𝛼𝛻𝑓(𝑥$)

𝑥$ + 𝑣$%&

𝑣$%&

Momentum Nesterov Momentum

−𝛼𝛻𝑓(𝑥$ + 𝜌𝑣$)

𝜌𝑣$

𝑥$

𝑥$ + 𝜌𝑣$

𝑣$%&𝑥$ + 𝑣$%&

Per-Parameter Optimization

● Every parameter has a different learning rate

● AdaGrad (Duchi, 2011)○ Use an adaptive learning rate for each parameter○ Increase the learning rate for less updated parameters and vice versa

𝑥!"# = 𝑥! − 𝛼𝛻𝑓(𝑥!) 𝑥!"#(𝑖) = 𝑥!(𝑖) − 𝛼(𝑖)𝛻𝑓(𝑥!(𝑖))

𝑥!"#(𝑖) = 𝑥!(𝑖) −𝛼

𝑔(𝑖) + 𝜖𝛻𝑓(𝑥!(𝑖)) 𝑔(𝑖) =/

!

𝛻𝑓(𝑥!(𝑖))$

Per-Parameter Optimization

● RMSProp (Tieleman and Hinton, 2012)○ Fix the continuously growing 𝑔(𝑖) in AdaGrad using a moving average

● ADAM (ADAptive Momentum estimation)○ Put it all together: momentum + per-parameter learning rate (RMSProp)○ Most widely used optimizer

𝑣!"# = 𝜌𝑣! + (1 − 𝜌)𝛻𝑓(𝑥!)𝑥!"#(𝑖) = 𝑥!(𝑖) −

𝛼𝑔!"#(𝑖) + 𝜖

𝑣!"#𝑔!"#(𝑖) = 𝛽𝑔!(𝑖) + (1 − 𝛽)𝛻𝑓(𝑥!(𝑖))$

𝑔!"#(𝑖) = 𝛽𝑔!(𝑖) + (1 − 𝛽)𝛻𝑓(𝑥!(𝑖))$𝑥!"#(𝑖) = 𝑥!(𝑖) −𝛼

𝑔!"#(𝑖) + 𝜖𝛻𝑓(𝑥!(𝑖))

Optimizer Animation

● Comparison of optimizers

● Also, check this interactive article “why momentum really works” in Distill○ https://distill.pub/2017/momentum/

Source: https://rnrahman.com/blog/visualising-stochastic-optimisers/

https://distill.pub/2017/momentum/

Annealing Learning Rate

● Decay the learning rate under certain conditions○ Step decay: by a factor (e.g. 5 or 10) every fixed size of epoch

■ Exponential decay (𝛼 = 𝛼%𝑒&'! ) or 1/t decay (𝛼 = 3('(#"'!) ) is also possible

○ Reduce on plateau: decay around every early stopping point

● Reset the learning rate with “warm start”○ Cosine (Loshchilov, 2017) and cyclic (Smith, 2017)○ Start with a high learning rate and restart with better initial weights

Epoch

LossDecay the learning rate

Epoch

LossReset the learning rate (new start!)

Regularization

● The deep learning models can easily overfit to the training data

Epoch

Loss

OverfittingStart Validation

Training

Epoch

Loss

Validation

Training

OverfittingStart

Use early stopping Use weight decay, dropout or data augmentation

Dropout

● Turn off the hidden layer units randomly in each forward pass (Srivastava et. al, 2014)○ The binary mask is Implemented by multiplying the hidden units with binary

random variables ○ The probability is a hyper-parameter

Dropout

● Prevent co-adaptation of hidden units○ Co-adaptation: two or more hidden units detect the same features○ This waste of resources can be prevented by the dropout

● Dropout enables a large ensemble of models ○ The ensemble reduces the variance of the model ○ An fully-connected layer with 1024 units has 2^1024 combinations of masks

(They share the parameters)

Data Augmentation

● Increasing the quantity of input data based on domain knowledge

● Commonly used digital audio effects○ Pitch shifting○ Time-stretching○ Equalization○ Adding noises

● Check if the output label is affected by the audio effects ○ You can also change the label according to the way of augmenting data

In Summary: Recommended Settings

● Data ○ Standardization, augmentation (optional)

● Build a neural network model○ Add batch normalization○ He initialization○ Dropout, Weight decay (optional)

● Optimizer○ ADAM or SGD with Nesterov Momentum

● Training○ Early stopping ○ Annealing

deep learning: trainingjuhan/gct634/slides/07 deep... · 2020. 10. 21. · training deep neural...

Documents