10417/10617 intermediate deep learning: fall2019rsalakhu/10417/lectures/... · 10417/10617...

10417/10617 Intermediate Deep Learning:

Fall2019 RussSalakhutdinov

Machine Learning [email protected]

https://deeplearning-cmu-10417.github.io/

Midterm Review

Midterm Review •  Polynomial curve fitting – generalization, overfitting

•  Loss functions for regression

•  Generalization / Overfitting

•  Statistical Decision Theory

Midterm Review •  Bernoulli, Multinomial random variables (mean, variances)

•  Multivariate Gaussian distribution (form, mean, covariance)

•  Maximum likelihood estimation for these distributions.

•  Exponential family / Maximum likelihood estimation / sufficient statistics for exponential family.

•  Linear basis function models / maximum likelihood and least squares:

Midterm Review •  Regularized least squares:

Ridge regression

•  Bias-variance decomposition.

Low bias

High variance

•  Gradient Descend, SGD, Parameter Update Rules

Neural Networks ‣  How neural networks predict f(x) given an input x:

-  Forward propagation -  Types of units -  Capacity of neural networks (AND, OR, XOR)

‣  How to train neural nets: -  Loss function -  Backpropagation with gradient descent

‣  More recent techniques: -  Dropout -  Batch normalization -  Unsupervised Pre-training

Artificial Neuron: Logistic Regression •  Neuron pre-activation (or input activation):

Feedforward neural network

Hugo Larochelle

Departement d’informatiqueUniversite de Sherbrooke

[email protected]

September 6, 2012

Abstract

Math for my slides “Feedforward neural network”.

• a(x) = b+P

i wixi = b+w>x

• h(x) = g(a(x)) = g(b+P

i wixi)

• x1 xd b w1 wd

• w

• {

• g(·) b

• h(x) = g(a(x))

• a(x) = b(1) +W(1)x⇣a(x)i = b(1)i

Pj W

(1)i,j xj

⌘

• o(x) = g(out)(b(2) +w(2)>x)

1

•  Neuron output activation:


Hugo Larochelle


[email protected]

September 6, 2012

Abstract


• a(x) = b+P

i wixi = b+w>x

• h(x) = g(a(x)) = g(b+P

i wixi)

• x1 xd b w1 wd

• w

• {

• g(·) b

• h(x) = g(a(x))

• a(x) = b(1) +W(1)x⇣a(x)i = b(1)i

Pj W

(1)i,j xj

⌘

• o(x) = g(out)(b(2) +w(2)>x)

1

where are the weights (parameters) is the bias term is called the activation function


Hugo Larochelle


[email protected]

September 6, 2012

Abstract


• a(x) = b+P

i wixi = b+w>x

• h(x) = g(a(x)) = g(b+P

i wixi)

• x1 xd b w1 wd

• w

• {

• g(·) b

• h(x) = g(a(x))

• a(x) = b(1) +W(1)x⇣a(x)i = b(1)i

Pj W

(1)i,j xj

⌘

• o(x) = g(out)(b(2) +w(2)>x)

1


Hugo Larochelle


[email protected]

September 6, 2012

Abstract


• a(x) = b+P

i wixi = b+w>x

• h(x) = g(a(x)) = g(b+P

i wixi)

• x1 xd b w1 wd

• w

• {

• g(·) b

• h(x) = g(a(x))

• a(x) = b(1) +W(1)x⇣a(x)i = b(1)i

Pj W

(1)i,j xj

⌘

• o(x) = g(out)(b(2) +w(2)>x)

1


Hugo Larochelle


[email protected]

September 6, 2012

Abstract


• a(x) = b+P

i wixi = b+w>x

• h(x) = g(a(x)) = g(b+P

i wixi)

• x1 xd b w1 wd

• w

• {

• g(·) b

• h(x) = g(a(x))

• a(x) = b(1) +W(1)x⇣a(x)i = b(1)i

Pj W

(1)i,j xj

⌘

• o(x) = g(out)(b(2) +w(2)>x)

1

(from Pascal Vincent’s slides)

Decision Boundary of a Neuron •  Binary classification:

-  With sigmoid, one can interpret neuron as estimating -  Interpret as a logistic classifier


Hugo Larochelle


[email protected]

September 6, 2012

Abstract


• a(x) = b+P

i wixi = b+w>x

• h(x) = g(a(x)) = g(b+P

i wixi)

• x1 xd b w1 wd

• w

• {

• g(a) = a

• g(a) = sigm(a) = 11+exp(�a)

• g(a) = tanh(a) = exp(a)�exp(�a)exp(a)+exp(�a) = exp(2a)�1

exp(2a)+1

• g(a) = max(0, a)

• g(a) = reclin(a) = max(0, a)

• p(y = 1|x)

• g(·) b

• W (1)i,j b(1)i xj h(x)i w(2)

i b(2)

• h(x) = g(a(x))

• a(x) = b(1) +W(1)x⇣a(x)i = b(1)i +

Pj W

(1)i,j xj

⌘

• f(x) = o⇣b(2) +w(2)>x

⌘

• p(y = c|x)

• o(a) = softmax(a) =h

exp(a1)Pc exp(ac)

. . . exp(aC)Pc exp(ac)

i>

1

-  If activation is greater than 0.5, predict 1

-  Otherwise predict 0

Decision boundary

Same idea can be applied to a tanh activation

Capacity of a Single Neuron •  Can solve linearly separable problems. 21

0 1

0 1 0 1 0 1

0

1

0

1

0

1

0

1

?

XOR (x1, x2)

OR (x1, x2) AND (x1, x2) AND (x1, x2)

(x1

(x1

,x

2)

0 1

0

1

XOR (x1, x2)

AND (x1, x2)

AN

D(x

1,x

2)

,x

2)

,x

2)

(x1 (x1

,x

2)

Figure 1.8 – Exemple de modélisation de XOR par un réseau à une couche cachée. Enhaut, de gauche à droite, illustration des fonctions booléennesOR(x1, x2),AND (x1, x2)et AND (x1, x2). En bas, on présente l’illustration de la fonction XOR(x1, x2) en fonc-tion des valeurs de x1 et x2 (à gauche), puis de AND (x1, x2) et AND (x1, x2) (à droite).Les points représentés par un cercle ou par un triangle appartiennent à la classe 0 ou1, respectivement. On observe que, bien qu’un classifieur linéaire soit en mesure de ré-soudre le problème de classification associé aux fonctions OR et AND, il ne l’est pasdans le cas du problème de XOR. Cependant, on utilisant les valeurs de AND (x1, x2)et AND (x1, x2) comme nouvelle représentation de l’entrée (x1, x2), le problème declassification XOR peut alors être résolu linéairement. À noter que dans ce derniercas, il n’existe que trois valeurs possibles de cette nouvelle représentation, puisqueAND (x1, x2) et AND (x1, x2) ne peuvent être toutes les deux vraies pour une mêmeentrée.

ont été entraînés dans le cadre des travaux de cette thèse contiennent quelques milliers

de neurones cachés. Ainsi, la tâche de l’algorithme d’apprentissage est de modifier les

paramètres du réseau afin de trouver la nature des caractéristiques de l’entrée que chaque

neurone doit extraire pour résoudre le problème de classification. Idéalement, ces carac-

Capacity of a Single Neuron •  Can not solve non-linearly separable problems.

21

0 1

0 1 0 1 0 1

0

1

0

1

0

1

0

1

?

XOR (x1, x2)

OR (x1, x2) AND (x1, x2) AND (x1, x2)

(x1

(x1

,x

2)

0 1

0

1

XOR (x1, x2)

AND (x1, x2)A

ND

(x1,x

2)

,x

2)

,x

2)

(x1 (x1

,x

2)

Figure 1.8 – Exemple de modélisation de XOR par un réseau à une couche cachée. Enhaut, de gauche à droite, illustration des fonctions booléennesOR(x1, x2),AND (x1, x2)et AND (x1, x2). En bas, on présente l’illustration de la fonction XOR(x1, x2) en fonc-tion des valeurs de x1 et x2 (à gauche), puis de AND (x1, x2) et AND (x1, x2) (à droite).Les points représentés par un cercle ou par un triangle appartiennent à la classe 0 ou1, respectivement. On observe que, bien qu’un classifieur linéaire soit en mesure de ré-soudre le problème de classification associé aux fonctions OR et AND, il ne l’est pasdans le cas du problème de XOR. Cependant, on utilisant les valeurs de AND (x1, x2)et AND (x1, x2) comme nouvelle représentation de l’entrée (x1, x2), le problème declassification XOR peut alors être résolu linéairement. À noter que dans ce derniercas, il n’existe que trois valeurs possibles de cette nouvelle représentation, puisqueAND (x1, x2) et AND (x1, x2) ne peuvent être toutes les deux vraies pour une mêmeentrée.

ont été entraînés dans le cadre des travaux de cette thèse contiennent quelques milliers

de neurones cachés. Ainsi, la tâche de l’algorithme d’apprentissage est de modifier les

paramètres du réseau afin de trouver la nature des caractéristiques de l’entrée que chaque

neurone doit extraire pour résoudre le problème de classification. Idéalement, ces carac-

•  Need to transform the input into a better representation. •  Remember basis functions!

Capacity of Neural Nets •  Consider a single layer neural network

2Reseaux de neurones

-1 1

-1

1

11

1 1

.5

-1.5

.7-.4-1

x1 x2

x1

x2

z=+1

z=-1

z=-1

0

1-1

0

1

-1

0

1

-1

0

1-1

0

1

-1

0

1

-1

0

1-1

0

1

-1

0

1

-1

R2

R2

R1

y1 y2

z

zk

wkj

wji

x1

x2

x1

x2

x1

x2

y1 y2

sortie k

entree i

cachee jbiais

Input

Hidden

Output

bias



6Reseaux de neurones

• La puissance expressive des reseaux de neurones

y1

y2

y4

y3

y3 y4y2y1

x1 x2

z1

z1

x1

x2


Neural Networks ‣  SGD Training, cross entropy loss, ReLU activations

‣  Classification with neural networks

‣  Regularization, Dropout, Batchnorm

‣  Forward Propagation and Backprop (computing derivatives)

Model Selection •  Training Protocol: -  Train your model on the Training Set

• bµ = 1T

Ptx(t)

• b�2 = 1T�1

Pt(x(t) � bµ)2

• b⌃ = 1T�1

Pt(x(t) � bµ)(x(t) � bµ)>

• E[bµ] = µ E[b�2] = �2 E

hb⌃i= ⌃

• bµ�µpb�2/T

• µ 2 bµ±�1.96p

b�2/T

•b✓ = argmax

✓

p(x(1), . . . ,x(T ))

•p(x(1)

, . . . ,x(T )) =Y

t

p(x(t))

• T�1T

b⌃ = 1T


Machine learning

• Supervised learning example: (x, y) x y

• Training set: Dtrain = {(x(t), y

(t))}

• f(x;✓)

5

-  For model selection, use Validation Set

• bµ = 1T

Ptx(t)

• b�2 = 1T�1

Pt(x(t) � bµ)2

• b⌃ = 1T�1


• E[bµ] = µ E[b�2] = �2 E

hb⌃i= ⌃


• µ 2 bµ±�1.96p

b�2/T

•b✓ = argmax

✓

p(x(1), . . . ,x(T ))

•p(x(1)

, . . . ,x(T )) =Y

t

p(x(t))

• T�1T

b⌃ = 1T


Machine learning



(t))}

• f(x;✓)

• Dvalid Dtest

5

Ø  Hyper-parameter search: hidden layer size, learning rate, number of iterations/epochs, etc.

-  Estimate generalization performance using the Test Set

• bµ = 1T

Ptx(t)

• b�2 = 1T�1

Pt(x(t) � bµ)2

• b⌃ = 1T�1


• E[bµ] = µ E[b�2] = �2 E

hb⌃i= ⌃


• µ 2 bµ±�1.96p

b�2/T

•b✓ = argmax

✓

p(x(1), . . . ,x(T ))

•p(x(1)

, . . . ,x(T )) =Y

t

p(x(t))

• T�1T

b⌃ = 1T


Machine learning



(t))}

• f(x;✓)

• Dvalid Dtest

5

•  Remember: Generalization is the behavior of the model on unseen examples.

Early Stopping •  To select the number of epochs, stop training when validation set error increases (with some look ahead).

Conv Nets •  Convolutional networks leverage these ideas

Ø  Local connectivity

Ø  Parameter sharing

Ø  Convolution Ø  Pooling / subsampling hidden units

Ø  Understanding Receptive Fields

•  Local contrast normalization, rectification

10417/10617 intermediate deep learning: fall2019rsalakhu/10417/lectures/... · 10417/10617...

Documents