10417/10617 intermediate deep learning: fall2019rsalakhu/10417/lectures/... · 10417/10617...

16
10417/10617 Intermediate Deep Learning: Fall2019 Russ Salakhutdinov Machine Learning Department [email protected] https://deeplearning-cmu-10417.github.io/ Midterm Review

Upload: others

Post on 28-May-2020

32 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 10417/10617 Intermediate Deep Learning: Fall2019rsalakhu/10417/Lectures/... · 10417/10617 Intermediate Deep Learning: Fall2019 Russ Salakhutdinov Machine Learning Department rsalakhu@cs.cmu.edu

10417/10617 Intermediate Deep Learning:

Fall2019 RussSalakhutdinov

Machine Learning [email protected]

https://deeplearning-cmu-10417.github.io/

Midterm Review

Page 2: 10417/10617 Intermediate Deep Learning: Fall2019rsalakhu/10417/Lectures/... · 10417/10617 Intermediate Deep Learning: Fall2019 Russ Salakhutdinov Machine Learning Department rsalakhu@cs.cmu.edu

Midterm Review •  Polynomial curve fitting – generalization, overfitting

•  Loss functions for regression

•  Generalization / Overfitting

•  Statistical Decision Theory

Page 3: 10417/10617 Intermediate Deep Learning: Fall2019rsalakhu/10417/Lectures/... · 10417/10617 Intermediate Deep Learning: Fall2019 Russ Salakhutdinov Machine Learning Department rsalakhu@cs.cmu.edu

Midterm Review •  Bernoulli, Multinomial random variables (mean, variances)

•  Multivariate Gaussian distribution (form, mean, covariance)

•  Maximum likelihood estimation for these distributions.

•  Exponential family / Maximum likelihood estimation / sufficient statistics for exponential family.

•  Linear basis function models / maximum likelihood and least squares:

Page 4: 10417/10617 Intermediate Deep Learning: Fall2019rsalakhu/10417/Lectures/... · 10417/10617 Intermediate Deep Learning: Fall2019 Russ Salakhutdinov Machine Learning Department rsalakhu@cs.cmu.edu

Midterm Review •  Regularized least squares:

Ridge regression

•  Bias-variance decomposition.

Low bias

High variance

•  Gradient Descend, SGD, Parameter Update Rules

Page 5: 10417/10617 Intermediate Deep Learning: Fall2019rsalakhu/10417/Lectures/... · 10417/10617 Intermediate Deep Learning: Fall2019 Russ Salakhutdinov Machine Learning Department rsalakhu@cs.cmu.edu

Neural Networks ‣  How neural networks predict f(x) given an input x:

-  Forward propagation -  Types of units -  Capacity of neural networks (AND, OR, XOR)

‣  How to train neural nets: -  Loss function -  Backpropagation with gradient descent

‣  More recent techniques: -  Dropout -  Batch normalization -  Unsupervised Pre-training

Page 6: 10417/10617 Intermediate Deep Learning: Fall2019rsalakhu/10417/Lectures/... · 10417/10617 Intermediate Deep Learning: Fall2019 Russ Salakhutdinov Machine Learning Department rsalakhu@cs.cmu.edu

Artificial Neuron: Logistic Regression •  Neuron pre-activation (or input activation):

Feedforward neural network

Hugo Larochelle

Departement d’informatiqueUniversite de Sherbrooke

[email protected]

September 6, 2012

Abstract

Math for my slides “Feedforward neural network”.

• a(x) = b+P

i wixi = b+w>x

• h(x) = g(a(x)) = g(b+P

i wixi)

• x1 xd b w1 wd

• w

• {

• g(·) b

• h(x) = g(a(x))

• a(x) = b(1) +W(1)x⇣a(x)i = b(1)i

Pj W

(1)i,j xj

• o(x) = g(out)(b(2) +w(2)>x)

1

•  Neuron output activation:

Feedforward neural network

Hugo Larochelle

Departement d’informatiqueUniversite de Sherbrooke

[email protected]

September 6, 2012

Abstract

Math for my slides “Feedforward neural network”.

• a(x) = b+P

i wixi = b+w>x

• h(x) = g(a(x)) = g(b+P

i wixi)

• x1 xd b w1 wd

• w

• {

• g(·) b

• h(x) = g(a(x))

• a(x) = b(1) +W(1)x⇣a(x)i = b(1)i

Pj W

(1)i,j xj

• o(x) = g(out)(b(2) +w(2)>x)

1

where are the weights (parameters) is the bias term is called the activation function

Feedforward neural network

Hugo Larochelle

Departement d’informatiqueUniversite de Sherbrooke

[email protected]

September 6, 2012

Abstract

Math for my slides “Feedforward neural network”.

• a(x) = b+P

i wixi = b+w>x

• h(x) = g(a(x)) = g(b+P

i wixi)

• x1 xd b w1 wd

• w

• {

• g(·) b

• h(x) = g(a(x))

• a(x) = b(1) +W(1)x⇣a(x)i = b(1)i

Pj W

(1)i,j xj

• o(x) = g(out)(b(2) +w(2)>x)

1

Feedforward neural network

Hugo Larochelle

Departement d’informatiqueUniversite de Sherbrooke

[email protected]

September 6, 2012

Abstract

Math for my slides “Feedforward neural network”.

• a(x) = b+P

i wixi = b+w>x

• h(x) = g(a(x)) = g(b+P

i wixi)

• x1 xd b w1 wd

• w

• {

• g(·) b

• h(x) = g(a(x))

• a(x) = b(1) +W(1)x⇣a(x)i = b(1)i

Pj W

(1)i,j xj

• o(x) = g(out)(b(2) +w(2)>x)

1

Feedforward neural network

Hugo Larochelle

Departement d’informatiqueUniversite de Sherbrooke

[email protected]

September 6, 2012

Abstract

Math for my slides “Feedforward neural network”.

• a(x) = b+P

i wixi = b+w>x

• h(x) = g(a(x)) = g(b+P

i wixi)

• x1 xd b w1 wd

• w

• {

• g(·) b

• h(x) = g(a(x))

• a(x) = b(1) +W(1)x⇣a(x)i = b(1)i

Pj W

(1)i,j xj

• o(x) = g(out)(b(2) +w(2)>x)

1

Page 7: 10417/10617 Intermediate Deep Learning: Fall2019rsalakhu/10417/Lectures/... · 10417/10617 Intermediate Deep Learning: Fall2019 Russ Salakhutdinov Machine Learning Department rsalakhu@cs.cmu.edu

(from Pascal Vincent’s slides)

Decision Boundary of a Neuron •  Binary classification:

-  With sigmoid, one can interpret neuron as estimating -  Interpret as a logistic classifier

Feedforward neural network

Hugo Larochelle

Departement d’informatiqueUniversite de Sherbrooke

[email protected]

September 6, 2012

Abstract

Math for my slides “Feedforward neural network”.

• a(x) = b+P

i wixi = b+w>x

• h(x) = g(a(x)) = g(b+P

i wixi)

• x1 xd b w1 wd

• w

• {

• g(a) = a

• g(a) = sigm(a) = 11+exp(�a)

• g(a) = tanh(a) = exp(a)�exp(�a)exp(a)+exp(�a) = exp(2a)�1

exp(2a)+1

• g(a) = max(0, a)

• g(a) = reclin(a) = max(0, a)

• p(y = 1|x)

• g(·) b

• W (1)i,j b(1)i xj h(x)i w(2)

i b(2)

• h(x) = g(a(x))

• a(x) = b(1) +W(1)x⇣a(x)i = b(1)i +

Pj W

(1)i,j xj

• f(x) = o⇣b(2) +w(2)>x

• p(y = c|x)

• o(a) = softmax(a) =h

exp(a1)Pc exp(ac)

. . . exp(aC)Pc exp(ac)

i>

1

-  If activation is greater than 0.5, predict 1

-  Otherwise predict 0

Decision boundary

Same idea can be applied to a tanh activation

Page 8: 10417/10617 Intermediate Deep Learning: Fall2019rsalakhu/10417/Lectures/... · 10417/10617 Intermediate Deep Learning: Fall2019 Russ Salakhutdinov Machine Learning Department rsalakhu@cs.cmu.edu

Capacity of a Single Neuron •  Can solve linearly separable problems. 21

0 1

0 1 0 1 0 1

0

1

0

1

0

1

0

1

?

XOR (x1, x2)

OR (x1, x2) AND (x1, x2) AND (x1, x2)

(x1

(x1

,x

2)

0 1

0

1

XOR (x1, x2)

AND (x1, x2)

AN

D(x

1,x

2)

,x

2)

,x

2)

(x1 (x1

,x

2)

Figure 1.8 – Exemple de modélisation de XOR par un réseau à une couche cachée. Enhaut, de gauche à droite, illustration des fonctions booléennesOR(x1, x2),AND (x1, x2)et AND (x1, x2). En bas, on présente l’illustration de la fonction XOR(x1, x2) en fonc-tion des valeurs de x1 et x2 (à gauche), puis de AND (x1, x2) et AND (x1, x2) (à droite).Les points représentés par un cercle ou par un triangle appartiennent à la classe 0 ou1, respectivement. On observe que, bien qu’un classifieur linéaire soit en mesure de ré-soudre le problème de classification associé aux fonctions OR et AND, il ne l’est pasdans le cas du problème de XOR. Cependant, on utilisant les valeurs de AND (x1, x2)et AND (x1, x2) comme nouvelle représentation de l’entrée (x1, x2), le problème declassification XOR peut alors être résolu linéairement. À noter que dans ce derniercas, il n’existe que trois valeurs possibles de cette nouvelle représentation, puisqueAND (x1, x2) et AND (x1, x2) ne peuvent être toutes les deux vraies pour une mêmeentrée.

ont été entraînés dans le cadre des travaux de cette thèse contiennent quelques milliers

de neurones cachés. Ainsi, la tâche de l’algorithme d’apprentissage est de modifier les

paramètres du réseau afin de trouver la nature des caractéristiques de l’entrée que chaque

neurone doit extraire pour résoudre le problème de classification. Idéalement, ces carac-

Page 9: 10417/10617 Intermediate Deep Learning: Fall2019rsalakhu/10417/Lectures/... · 10417/10617 Intermediate Deep Learning: Fall2019 Russ Salakhutdinov Machine Learning Department rsalakhu@cs.cmu.edu

Capacity of a Single Neuron •  Can not solve non-linearly separable problems.

21

0 1

0 1 0 1 0 1

0

1

0

1

0

1

0

1

?

XOR (x1, x2)

OR (x1, x2) AND (x1, x2) AND (x1, x2)

(x1

(x1

,x

2)

0 1

0

1

XOR (x1, x2)

AND (x1, x2)A

ND

(x1,x

2)

,x

2)

,x

2)

(x1 (x1

,x

2)

Figure 1.8 – Exemple de modélisation de XOR par un réseau à une couche cachée. Enhaut, de gauche à droite, illustration des fonctions booléennesOR(x1, x2),AND (x1, x2)et AND (x1, x2). En bas, on présente l’illustration de la fonction XOR(x1, x2) en fonc-tion des valeurs de x1 et x2 (à gauche), puis de AND (x1, x2) et AND (x1, x2) (à droite).Les points représentés par un cercle ou par un triangle appartiennent à la classe 0 ou1, respectivement. On observe que, bien qu’un classifieur linéaire soit en mesure de ré-soudre le problème de classification associé aux fonctions OR et AND, il ne l’est pasdans le cas du problème de XOR. Cependant, on utilisant les valeurs de AND (x1, x2)et AND (x1, x2) comme nouvelle représentation de l’entrée (x1, x2), le problème declassification XOR peut alors être résolu linéairement. À noter que dans ce derniercas, il n’existe que trois valeurs possibles de cette nouvelle représentation, puisqueAND (x1, x2) et AND (x1, x2) ne peuvent être toutes les deux vraies pour une mêmeentrée.

ont été entraînés dans le cadre des travaux de cette thèse contiennent quelques milliers

de neurones cachés. Ainsi, la tâche de l’algorithme d’apprentissage est de modifier les

paramètres du réseau afin de trouver la nature des caractéristiques de l’entrée que chaque

neurone doit extraire pour résoudre le problème de classification. Idéalement, ces carac-

•  Need to transform the input into a better representation. •  Remember basis functions!

Page 10: 10417/10617 Intermediate Deep Learning: Fall2019rsalakhu/10417/Lectures/... · 10417/10617 Intermediate Deep Learning: Fall2019 Russ Salakhutdinov Machine Learning Department rsalakhu@cs.cmu.edu

Capacity of Neural Nets •  Consider a single layer neural network

2Reseaux de neurones

-1 1

-1

1

11

1 1

.5

-1.5

.7-.4-1

x1 x2

x1

x2

z=+1

z=-1

z=-1

0

1-1

0

1

-1

0

1

-1

0

1-1

0

1

-1

0

1

-1

0

1-1

0

1

-1

0

1

-1

R2

R2

R1

y1 y2

z

zk

wkj

wji

x1

x2

x1

x2

x1

x2

y1 y2

sortie k

entree i

cachee jbiais

Input

Hidden

Output

bias

(from Pascal Vincent’s slides)

Page 11: 10417/10617 Intermediate Deep Learning: Fall2019rsalakhu/10417/Lectures/... · 10417/10617 Intermediate Deep Learning: Fall2019 Russ Salakhutdinov Machine Learning Department rsalakhu@cs.cmu.edu

Capacity of Neural Nets •  Consider a single layer neural network

6Reseaux de neurones

• La puissance expressive des reseaux de neurones

y1

y2

y4

y3

y3 y4y2y1

x1 x2

z1

z1

x1

x2

(from Pascal Vincent’s slides)

Page 12: 10417/10617 Intermediate Deep Learning: Fall2019rsalakhu/10417/Lectures/... · 10417/10617 Intermediate Deep Learning: Fall2019 Russ Salakhutdinov Machine Learning Department rsalakhu@cs.cmu.edu

Capacity of Neural Nets •  Consider a single layer neural network

(from Pascal Vincent’s slides)

Page 13: 10417/10617 Intermediate Deep Learning: Fall2019rsalakhu/10417/Lectures/... · 10417/10617 Intermediate Deep Learning: Fall2019 Russ Salakhutdinov Machine Learning Department rsalakhu@cs.cmu.edu

Neural Networks ‣  SGD Training, cross entropy loss, ReLU activations

‣  Classification with neural networks

‣  Regularization, Dropout, Batchnorm

‣  Forward Propagation and Backprop (computing derivatives)

Page 14: 10417/10617 Intermediate Deep Learning: Fall2019rsalakhu/10417/Lectures/... · 10417/10617 Intermediate Deep Learning: Fall2019 Russ Salakhutdinov Machine Learning Department rsalakhu@cs.cmu.edu

Model Selection •  Training Protocol: -  Train your model on the Training Set

• bµ = 1T

Ptx(t)

• b�2 = 1T�1

Pt(x(t) � bµ)2

• b⌃ = 1T�1

Pt(x(t) � bµ)(x(t) � bµ)>

• E[bµ] = µ E[b�2] = �2 E

hb⌃i= ⌃

• bµ�µpb�2/T

• µ 2 bµ±�1.96p

b�2/T

•b✓ = argmax

p(x(1), . . . ,x(T ))

•p(x(1)

, . . . ,x(T )) =Y

t

p(x(t))

• T�1T

b⌃ = 1T

Pt(x(t) � bµ)(x(t) � bµ)>

Machine learning

• Supervised learning example: (x, y) x y

• Training set: Dtrain = {(x(t), y

(t))}

• f(x;✓)

5

-  For model selection, use Validation Set

• bµ = 1T

Ptx(t)

• b�2 = 1T�1

Pt(x(t) � bµ)2

• b⌃ = 1T�1

Pt(x(t) � bµ)(x(t) � bµ)>

• E[bµ] = µ E[b�2] = �2 E

hb⌃i= ⌃

• bµ�µpb�2/T

• µ 2 bµ±�1.96p

b�2/T

•b✓ = argmax

p(x(1), . . . ,x(T ))

•p(x(1)

, . . . ,x(T )) =Y

t

p(x(t))

• T�1T

b⌃ = 1T

Pt(x(t) � bµ)(x(t) � bµ)>

Machine learning

• Supervised learning example: (x, y) x y

• Training set: Dtrain = {(x(t), y

(t))}

• f(x;✓)

• Dvalid Dtest

5

Ø  Hyper-parameter search: hidden layer size, learning rate, number of iterations/epochs, etc.

-  Estimate generalization performance using the Test Set

• bµ = 1T

Ptx(t)

• b�2 = 1T�1

Pt(x(t) � bµ)2

• b⌃ = 1T�1

Pt(x(t) � bµ)(x(t) � bµ)>

• E[bµ] = µ E[b�2] = �2 E

hb⌃i= ⌃

• bµ�µpb�2/T

• µ 2 bµ±�1.96p

b�2/T

•b✓ = argmax

p(x(1), . . . ,x(T ))

•p(x(1)

, . . . ,x(T )) =Y

t

p(x(t))

• T�1T

b⌃ = 1T

Pt(x(t) � bµ)(x(t) � bµ)>

Machine learning

• Supervised learning example: (x, y) x y

• Training set: Dtrain = {(x(t), y

(t))}

• f(x;✓)

• Dvalid Dtest

5

•  Remember: Generalization is the behavior of the model on unseen examples.

Page 15: 10417/10617 Intermediate Deep Learning: Fall2019rsalakhu/10417/Lectures/... · 10417/10617 Intermediate Deep Learning: Fall2019 Russ Salakhutdinov Machine Learning Department rsalakhu@cs.cmu.edu

Early Stopping •  To select the number of epochs, stop training when validation set error increases (with some look ahead).

Page 16: 10417/10617 Intermediate Deep Learning: Fall2019rsalakhu/10417/Lectures/... · 10417/10617 Intermediate Deep Learning: Fall2019 Russ Salakhutdinov Machine Learning Department rsalakhu@cs.cmu.edu

Conv Nets •  Convolutional networks leverage these ideas

Ø  Local connectivity

Ø  Parameter sharing

Ø  Convolution Ø  Pooling / subsampling hidden units

Ø  Understanding Receptive Fields

•  Local contrast normalization, rectification