lecture 19: nn regularization - github pages · 2019-08-29 · lecture 19: nn regularization....

30
CS109A Introduction to Data Science Pavlos Protopapas and Kevin Rader Lecture 19: NN Regularization

Upload: others

Post on 29-Jul-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Lecture 19: NN Regularization - GitHub Pages · 2019-08-29 · Lecture 19: NN Regularization. CS109A, PROTOPAPAS, RADER Outline Regularization of NN §Norm Penalties §Early Stopping

CS109A Introduction to Data SciencePavlos Protopapas and Kevin Rader

Lecture 19: NN Regularization

Page 2: Lecture 19: NN Regularization - GitHub Pages · 2019-08-29 · Lecture 19: NN Regularization. CS109A, PROTOPAPAS, RADER Outline Regularization of NN §Norm Penalties §Early Stopping

CS109A, PROTOPAPAS, RADER

Outline

Regularization of NN § Norm Penalties

§ Early Stopping

§ Data Augmentation

§ Sparse Representation

§ Bagging

§ Dropout

2

Page 3: Lecture 19: NN Regularization - GitHub Pages · 2019-08-29 · Lecture 19: NN Regularization. CS109A, PROTOPAPAS, RADER Outline Regularization of NN §Norm Penalties §Early Stopping

CS109A, PROTOPAPAS, RADER

Outline

Regularization of NN § Norm Penalties§ Early Stopping

§ Data Augmentation

§ Sparse Representation

§ Bagging

§ Dropout

3

Page 4: Lecture 19: NN Regularization - GitHub Pages · 2019-08-29 · Lecture 19: NN Regularization. CS109A, PROTOPAPAS, RADER Outline Regularization of NN §Norm Penalties §Early Stopping

CS109A, PROTOPAPAS, RADER

Regularization

4

Regularization is any modification we make to a learning algorithm that is intended to reduce its generalization error but not its training error.

Page 5: Lecture 19: NN Regularization - GitHub Pages · 2019-08-29 · Lecture 19: NN Regularization. CS109A, PROTOPAPAS, RADER Outline Regularization of NN §Norm Penalties §Early Stopping

CS109A, PROTOPAPAS, RADER

Overfitting

5

Fitting a deep neural network with 5 layers and 100 neurons per layer can lead to a very good prediction on the training set but poor prediction on validations set.

Page 6: Lecture 19: NN Regularization - GitHub Pages · 2019-08-29 · Lecture 19: NN Regularization. CS109A, PROTOPAPAS, RADER Outline Regularization of NN §Norm Penalties §Early Stopping

CS109A, PROTOPAPAS, RADER

Norm Penalties

We used to optimize:𝐽 𝑊; 𝑋, 𝑦

Change to …𝐽' 𝑊;𝑋, 𝑦 = 𝐽 𝑊; 𝑋, 𝑦 + 𝛼Ω(𝑊)

L2 regularization:– Weights decay– MAP estimation with Gaussian prior

L1 regularization:– encourages sparsity– MAP estimation with Laplacian prior

6

Biasesnotpenalized

Ω 𝑊 =12∥ 𝑊 ∥22

Ω 𝑊 =12∥ 𝑊 ∥3

Page 7: Lecture 19: NN Regularization - GitHub Pages · 2019-08-29 · Lecture 19: NN Regularization. CS109A, PROTOPAPAS, RADER Outline Regularization of NN §Norm Penalties §Early Stopping

CS109A, PROTOPAPAS, RADER

Norm Penalties

We used to optimize:𝐽 𝑊; 𝑋, 𝑦

Change to …𝐽' 𝑊;𝑋, 𝑦 = 𝐽 𝑊; 𝑋, 𝑦 + 𝛼Ω(𝑊)

L2 regularization:– Decay of weights– MAP estimation with Gaussian prior

L1 regularization:– encourages sparsity– MAP estimation with Laplacian prior

7

Biasesnotpenalized

Ω 𝑊 =12∥ 𝑊 ∥22

Ω 𝑊 =12∥ 𝑊 ∥3

𝑊(453) = 𝑊(4) − 𝜆𝜕𝐽𝜕𝑊

𝐽' 𝑊;𝑋, 𝑦 = 𝐽 𝑊; 𝑋, 𝑦 +12𝛼𝑊2

𝑊(453) = 𝑊(4) − 𝜆𝜕𝐽𝜕𝑊

− 𝜆𝛼𝑊

weightsdecayin

proportiontoitssize.

Page 8: Lecture 19: NN Regularization - GitHub Pages · 2019-08-29 · Lecture 19: NN Regularization. CS109A, PROTOPAPAS, RADER Outline Regularization of NN §Norm Penalties §Early Stopping

CS109A, PROTOPAPAS, RADER

Norm Penalties

8

Ω 𝑊 =12∥ 𝑊 ∥22

Ω 𝑊 =12∥ 𝑊 ∥3

Page 9: Lecture 19: NN Regularization - GitHub Pages · 2019-08-29 · Lecture 19: NN Regularization. CS109A, PROTOPAPAS, RADER Outline Regularization of NN §Norm Penalties §Early Stopping

CS109A, PROTOPAPAS, RADER

Norm Penalties as Constraints

Useful if K is known in advance

Optimization:

• Construct Lagrangian and apply gradient descent

• Projected gradient descent

9

min< = >?

𝐽(𝑊;𝑋, 𝑦)

Page 10: Lecture 19: NN Regularization - GitHub Pages · 2019-08-29 · Lecture 19: NN Regularization. CS109A, PROTOPAPAS, RADER Outline Regularization of NN §Norm Penalties §Early Stopping

CS109A, PROTOPAPAS, RADER

Outline

Regularization of NN § Norm Penalties

§ Early Stopping§ Data Augmentation

§ Sparse Representation

§ Bagging

§ Dropout

10

Page 11: Lecture 19: NN Regularization - GitHub Pages · 2019-08-29 · Lecture 19: NN Regularization. CS109A, PROTOPAPAS, RADER Outline Regularization of NN §Norm Penalties §Early Stopping

CS109A, PROTOPAPAS, RADER

Early Stopping

11

Training time can be treated as a

hyperparameter

Early stopping: terminate while validation set performance is better

Page 12: Lecture 19: NN Regularization - GitHub Pages · 2019-08-29 · Lecture 19: NN Regularization. CS109A, PROTOPAPAS, RADER Outline Regularization of NN §Norm Penalties §Early Stopping

CS109A, PROTOPAPAS, RADER

Early Stopping

12

Page 13: Lecture 19: NN Regularization - GitHub Pages · 2019-08-29 · Lecture 19: NN Regularization. CS109A, PROTOPAPAS, RADER Outline Regularization of NN §Norm Penalties §Early Stopping

CS109A, PROTOPAPAS, RADER

Outline

Regularization of NN § Norm Penalties

§ Early Stopping

§ Data Augmentation§ Sparse Representation

§ Bagging

§ Dropout

13

Page 14: Lecture 19: NN Regularization - GitHub Pages · 2019-08-29 · Lecture 19: NN Regularization. CS109A, PROTOPAPAS, RADER Outline Regularization of NN §Norm Penalties §Early Stopping

CS109A, PROTOPAPAS, RADER

Data Augmentation

14

Page 15: Lecture 19: NN Regularization - GitHub Pages · 2019-08-29 · Lecture 19: NN Regularization. CS109A, PROTOPAPAS, RADER Outline Regularization of NN §Norm Penalties §Early Stopping

CS109A, PROTOPAPAS, RADER

Data Augmentation

15

Page 16: Lecture 19: NN Regularization - GitHub Pages · 2019-08-29 · Lecture 19: NN Regularization. CS109A, PROTOPAPAS, RADER Outline Regularization of NN §Norm Penalties §Early Stopping

CS109A, PROTOPAPAS, RADER

Outline

Regularization of NN § Norm Penalties

§ Early Stopping

§ Data Augmentation

§ Sparse Representation§ Bagging

§ Dropout

16

Page 17: Lecture 19: NN Regularization - GitHub Pages · 2019-08-29 · Lecture 19: NN Regularization. CS109A, PROTOPAPAS, RADER Outline Regularization of NN §Norm Penalties §Early Stopping

CS109A, PROTOPAPAS, RADER

Sparse Representation

17

𝐽 𝜃; 𝑋, 𝑦

𝑊3

𝑊2

𝑊A

𝑊B

𝑊C

𝑊D

4.34 = 3.2 2.0 1.8 2

−2.21.3

𝑊J 𝑌

𝑊J

Page 18: Lecture 19: NN Regularization - GitHub Pages · 2019-08-29 · Lecture 19: NN Regularization. CS109A, PROTOPAPAS, RADER Outline Regularization of NN §Norm Penalties §Early Stopping

CS109A, PROTOPAPAS, RADER

Sparse Representation

18

0.69 = 0.5 .2 0.1 2

−2.21.3

𝐽' 𝑊;𝑋, 𝑦 = 𝐽 𝜃; 𝑋, 𝑦 + 𝛼Ω(𝑊)

𝑊3

𝑊2

𝑊A

𝑊B

𝑊C

𝑊D

Weights in output layer

𝑊J 𝑌

𝑊J

Page 19: Lecture 19: NN Regularization - GitHub Pages · 2019-08-29 · Lecture 19: NN Regularization. CS109A, PROTOPAPAS, RADER Outline Regularization of NN §Norm Penalties §Early Stopping

CS109A, PROTOPAPAS, RADER

Sparse Representation

19

𝐽 𝜃; 𝑋, 𝑦

ℎA3

ℎA2ℎAA

4.34 = 3.2 2 1 2

−2.21.3

𝑊J 𝑌

ℎA3, ℎA2, ℎAA

Page 20: Lecture 19: NN Regularization - GitHub Pages · 2019-08-29 · Lecture 19: NN Regularization. CS109A, PROTOPAPAS, RADER Outline Regularization of NN §Norm Penalties §Early Stopping

CS109A, PROTOPAPAS, RADER

Sparse Representation

20

𝐽' 𝑊;𝑋, 𝑦 = 𝐽 𝜃; 𝑋, 𝑦 + 𝛼Ω(ℎ)

Output of hidden layer

ℎA3

ℎA2ℎAA

1.3 = 3.2 2 1 0

−0.2.9

ℎA3, ℎA2, ℎAA

Page 21: Lecture 19: NN Regularization - GitHub Pages · 2019-08-29 · Lecture 19: NN Regularization. CS109A, PROTOPAPAS, RADER Outline Regularization of NN §Norm Penalties §Early Stopping

CS109A, PROTOPAPAS, RADER

Outline

Regularization of NN § Norm Penalties

§ Early Stopping

§ Data Augmentation

§ Sparse Representation

§ Bagging§ Dropout

21

Page 22: Lecture 19: NN Regularization - GitHub Pages · 2019-08-29 · Lecture 19: NN Regularization. CS109A, PROTOPAPAS, RADER Outline Regularization of NN §Norm Penalties §Early Stopping

CS109A, PROTOPAPAS, RADER 22

Page 23: Lecture 19: NN Regularization - GitHub Pages · 2019-08-29 · Lecture 19: NN Regularization. CS109A, PROTOPAPAS, RADER Outline Regularization of NN §Norm Penalties §Early Stopping

CS109A, PROTOPAPAS, RADER

Outline

Regularization of NN § Norm Penalties

§ Early Stopping

§ Data Augmentation

§ Sparse Representation

§ Bagging

§ Dropout

23

Page 24: Lecture 19: NN Regularization - GitHub Pages · 2019-08-29 · Lecture 19: NN Regularization. CS109A, PROTOPAPAS, RADER Outline Regularization of NN §Norm Penalties §Early Stopping

CS109A, PROTOPAPAS, RADER

Noise Robustness

Random perturbation of network weights

• Gaussian noise: Equivalent to minimizing loss with regularization term

• Encourages smooth function: small perturbation in weights leads to small changes in output

Injecting noise in output labels

• Better convergence: prevents pursuit of hard probabilities

24

Page 25: Lecture 19: NN Regularization - GitHub Pages · 2019-08-29 · Lecture 19: NN Regularization. CS109A, PROTOPAPAS, RADER Outline Regularization of NN §Norm Penalties §Early Stopping

CS109A, PROTOPAPAS, RADER

Dropout

25

Train all sub-networks obtained by removing non-output units from base network

Page 26: Lecture 19: NN Regularization - GitHub Pages · 2019-08-29 · Lecture 19: NN Regularization. CS109A, PROTOPAPAS, RADER Outline Regularization of NN §Norm Penalties §Early Stopping

CS109A, PROTOPAPAS, RADER

Dropout: Stochastic GD

For each new example/mini-batch:

• Randomly sample a binary mask μ independently, where μi indicatesifinput/hidden node i is included

• Multiply output of node i with μi, and perform gradient update

Typically, an input node is included with prob=0.8, hidden node with prob=0.5.

26

Page 27: Lecture 19: NN Regularization - GitHub Pages · 2019-08-29 · Lecture 19: NN Regularization. CS109A, PROTOPAPAS, RADER Outline Regularization of NN §Norm Penalties §Early Stopping

CS109A, PROTOPAPAS, RADER

Dropout: Weight Scaling

During prediction time use all units, but scale weights with probability of inclusion

27

Page 28: Lecture 19: NN Regularization - GitHub Pages · 2019-08-29 · Lecture 19: NN Regularization. CS109A, PROTOPAPAS, RADER Outline Regularization of NN §Norm Penalties §Early Stopping

CS109A, PROTOPAPAS, RADER

Adversarial Examples

28

Page 29: Lecture 19: NN Regularization - GitHub Pages · 2019-08-29 · Lecture 19: NN Regularization. CS109A, PROTOPAPAS, RADER Outline Regularization of NN §Norm Penalties §Early Stopping

CS109A, PROTOPAPAS, RADER

Adversarial Examples

29

+ =

Training on adversarial examples is mostly intended to improve security, but can sometimes provide generic regularization.

Panda 57% confidence Gibbon 99.3% confidencenoise

Page 30: Lecture 19: NN Regularization - GitHub Pages · 2019-08-29 · Lecture 19: NN Regularization. CS109A, PROTOPAPAS, RADER Outline Regularization of NN §Norm Penalties §Early Stopping

CS109A, PROTOPAPAS, RADER

Recap

Regularization of NN § Norm Penalties

§ Early Stopping

§ Data Augmentation

§ Sparse Representation

§ Bagging

§ Dropout

30