lecture 19: nn regularization - github pages · 2019-08-29 · lecture 19: nn regularization....

CS109A Introduction to Data SciencePavlos Protopapas and Kevin Rader

Lecture 19: NN Regularization

CS109A, PROTOPAPAS, RADER

Outline

Regularization of NN § Norm Penalties

§ Early Stopping

§ Data Augmentation

§ Sparse Representation

§ Bagging

§ Dropout

2


Outline

Regularization of NN § Norm Penalties§ Early Stopping



§ Bagging

§ Dropout

3


Regularization

4

Regularization is any modification we make to a learning algorithm that is intended to reduce its generalization error but not its training error.


Overfitting

5

Fitting a deep neural network with 5 layers and 100 neurons per layer can lead to a very good prediction on the training set but poor prediction on validations set.


Norm Penalties

We used to optimize:𝐽 𝑊; 𝑋, 𝑦

Change to …𝐽' 𝑊;𝑋, 𝑦 = 𝐽 𝑊; 𝑋, 𝑦 + 𝛼Ω(𝑊)

L2 regularization:– Weights decay– MAP estimation with Gaussian prior

L1 regularization:– encourages sparsity– MAP estimation with Laplacian prior

6

Biasesnotpenalized

Ω 𝑊 =12∥ 𝑊 ∥22

Ω 𝑊 =12∥ 𝑊 ∥3


Norm Penalties

We used to optimize:𝐽 𝑊; 𝑋, 𝑦

Change to …𝐽' 𝑊;𝑋, 𝑦 = 𝐽 𝑊; 𝑋, 𝑦 + 𝛼Ω(𝑊)

L2 regularization:– Decay of weights– MAP estimation with Gaussian prior

L1 regularization:– encourages sparsity– MAP estimation with Laplacian prior

7

Biasesnotpenalized

Ω 𝑊 =12∥ 𝑊 ∥22

Ω 𝑊 =12∥ 𝑊 ∥3

𝑊(453) = 𝑊(4) − 𝜆𝜕𝐽𝜕𝑊

𝐽' 𝑊;𝑋, 𝑦 = 𝐽 𝑊; 𝑋, 𝑦 +12𝛼𝑊2

𝑊(453) = 𝑊(4) − 𝜆𝜕𝐽𝜕𝑊

− 𝜆𝛼𝑊

weightsdecayin

proportiontoitssize.


Norm Penalties

8

Ω 𝑊 =12∥ 𝑊 ∥22

Ω 𝑊 =12∥ 𝑊 ∥3


Norm Penalties as Constraints

Useful if K is known in advance

Optimization:

• Construct Lagrangian and apply gradient descent

• Projected gradient descent

9

min< = >?

𝐽(𝑊;𝑋, 𝑦)


Outline


§ Early Stopping§ Data Augmentation


§ Bagging

§ Dropout

10


Early Stopping

11

Training time can be treated as a

hyperparameter

Early stopping: terminate while validation set performance is better


Early Stopping

12


Outline


§ Early Stopping

§ Data Augmentation§ Sparse Representation

§ Bagging

§ Dropout

13


Data Augmentation

14


Data Augmentation

15


Outline


§ Early Stopping


§ Sparse Representation§ Bagging

§ Dropout

16


Sparse Representation

17

𝐽 𝜃; 𝑋, 𝑦

𝑊3

𝑊2

𝑊A

𝑊B

𝑊C

𝑊D

4.34 = 3.2 2.0 1.8 2

−2.21.3

𝑊J 𝑌

𝑊J



18

0.69 = 0.5 .2 0.1 2

−2.21.3

𝐽' 𝑊;𝑋, 𝑦 = 𝐽 𝜃; 𝑋, 𝑦 + 𝛼Ω(𝑊)

𝑊3

𝑊2

𝑊A

𝑊B

𝑊C

𝑊D

Weights in output layer

𝑊J 𝑌

𝑊J



19

𝐽 𝜃; 𝑋, 𝑦

ℎA3

ℎA2ℎAA

4.34 = 3.2 2 1 2

−2.21.3

𝑊J 𝑌

ℎA3, ℎA2, ℎAA



20

𝐽' 𝑊;𝑋, 𝑦 = 𝐽 𝜃; 𝑋, 𝑦 + 𝛼Ω(ℎ)

Output of hidden layer

ℎA3

ℎA2ℎAA

1.3 = 3.2 2 1 0

−0.2.9

ℎA3, ℎA2, ℎAA


Outline


§ Early Stopping



§ Bagging§ Dropout

21

CS109A, PROTOPAPAS, RADER 22


Outline


§ Early Stopping



§ Bagging

§ Dropout

23


Noise Robustness

Random perturbation of network weights

• Gaussian noise: Equivalent to minimizing loss with regularization term

• Encourages smooth function: small perturbation in weights leads to small changes in output

Injecting noise in output labels

• Better convergence: prevents pursuit of hard probabilities

24


Dropout

25

Train all sub-networks obtained by removing non-output units from base network


Dropout: Stochastic GD

For each new example/mini-batch:

• Randomly sample a binary mask μ independently, where μi indicatesifinput/hidden node i is included

• Multiply output of node i with μi, and perform gradient update

Typically, an input node is included with prob=0.8, hidden node with prob=0.5.

26


Dropout: Weight Scaling

During prediction time use all units, but scale weights with probability of inclusion

27


Adversarial Examples

28


Adversarial Examples

29

+ =

Training on adversarial examples is mostly intended to improve security, but can sometimes provide generic regularization.

Panda 57% confidence Gibbon 99.3% confidencenoise


Recap


§ Early Stopping



§ Bagging

§ Dropout

30

lecture 19: nn regularization - github pages · 2019-08-29 · lecture 19: nn regularization....

Documents