10417/10617 intermediate deep learning: fall2019rsalakhu/10417/lectures/... · 10417/10617...

10417/10617 Intermediate Deep Learning:

Fall2019 RussSalakhutdinov

Machine Learning Departmentrsalakhu@cs.cmu.edu

https://deeplearning-cmu-10417.github.io/

Midterm Review

Midterm Review •  Polynomial curve fitting – generalization, overfitting

•  Loss functions for regression

•  Generalization / Overfitting

•  Statistical Decision Theory

Midterm Review •  Bernoulli, Multinomial random variables (mean, variances)

•  Multivariate Gaussian distribution (form, mean, covariance)

•  Maximum likelihood estimation for these distributions.

•  Exponential family / Maximum likelihood estimation / sufficient statistics for exponential family.

•  Linear basis function models / maximum likelihood and least squares:

Midterm Review •  Regularized least squares:

Ridge regression

•  Bias-variance decomposition.

Low bias

High variance

•  Gradient Descend, SGD, Parameter Update Rules

Neural Networks ‣  How neural networks predict f(x) given an input x:

-  Forward propagation -  Types of units -  Capacity of neural networks (AND, OR, XOR)

‣  How to train neural nets: -  Loss function -  Backpropagation with gradient descent

‣  More recent techniques: -  Dropout -  Batch normalization -  Unsupervised Pre-training

Artificial Neuron: Logistic Regression •  Neuron pre-activation (or input activation):

Feedforward neural network

Hugo Larochelle

Departement d’informatiqueUniversite de Sherbrooke

hugo.larochelle@usherbrooke.ca

September 6, 2012

Abstract

Math for my slides “Feedforward neural network”.

• a(x) = b+P

i wixi = b+w>x

• h(x) = g(a(x)) = g(b+P

i wixi)

• x1 xd b w1 wd

• g(·) b

• h(x) = g(a(x))

• a(x) = b(1) +W(1)x⇣a(x)i = b(1)i

(1)i,j xj

• o(x) = g(out)(b(2) +w(2)>x)

•  Neuron output activation:

Hugo Larochelle

September 6, 2012

Abstract

• a(x) = b+P

i wixi = b+w>x

• h(x) = g(a(x)) = g(b+P

i wixi)

• x1 xd b w1 wd

• g(·) b

• h(x) = g(a(x))

• a(x) = b(1) +W(1)x⇣a(x)i = b(1)i

(1)i,j xj

• o(x) = g(out)(b(2) +w(2)>x)

where are the weights (parameters) is the bias term is called the activation function

Hugo Larochelle

September 6, 2012

Abstract

• a(x) = b+P

i wixi = b+w>x

• h(x) = g(a(x)) = g(b+P

i wixi)

• x1 xd b w1 wd

• g(·) b

• h(x) = g(a(x))

• a(x) = b(1) +W(1)x⇣a(x)i = b(1)i

(1)i,j xj

• o(x) = g(out)(b(2) +w(2)>x)

Hugo Larochelle

September 6, 2012

Abstract

• a(x) = b+P

i wixi = b+w>x

• h(x) = g(a(x)) = g(b+P

i wixi)

• x1 xd b w1 wd

• g(·) b

• h(x) = g(a(x))

• a(x) = b(1) +W(1)x⇣a(x)i = b(1)i

(1)i,j xj

• o(x) = g(out)(b(2) +w(2)>x)

Hugo Larochelle

September 6, 2012

Abstract

• a(x) = b+P

i wixi = b+w>x

• h(x) = g(a(x)) = g(b+P

i wixi)

• x1 xd b w1 wd

• g(·) b

• h(x) = g(a(x))

• a(x) = b(1) +W(1)x⇣a(x)i = b(1)i

(1)i,j xj

• o(x) = g(out)(b(2) +w(2)>x)

(from Pascal Vincent’s slides)

Decision Boundary of a Neuron •  Binary classification:

-  With sigmoid, one can interpret neuron as estimating -  Interpret as a logistic classifier

Hugo Larochelle

September 6, 2012

Abstract

• a(x) = b+P

i wixi = b+w>x

• h(x) = g(a(x)) = g(b+P

i wixi)

• x1 xd b w1 wd

• g(a) = a

• g(a) = sigm(a) = 11+exp(�a)

• g(a) = tanh(a) = exp(a)�exp(�a)exp(a)+exp(�a) = exp(2a)�1

exp(2a)+1

• g(a) = max(0, a)

• g(a) = reclin(a) = max(0, a)

• p(y = 1|x)

• g(·) b

• W (1)i,j b(1)i xj h(x)i w(2)

i b(2)

• h(x) = g(a(x))

• a(x) = b(1) +W(1)x⇣a(x)i = b(1)i +

(1)i,j xj

• f(x) = o⇣b(2) +w(2)>x

• p(y = c|x)

• o(a) = softmax(a) =h

exp(a1)Pc exp(ac)

. . . exp(aC)Pc exp(ac)

-  If activation is greater than 0.5, predict 1

-  Otherwise predict 0

Decision boundary

Same idea can be applied to a tanh activation

Capacity of a Single Neuron •  Can solve linearly separable problems. 21

0 1 0 1 0 1

XOR (x1, x2)

OR (x1, x2) AND (x1, x2) AND (x1, x2)

XOR (x1, x2)

AND (x1, x2)

(x1 (x1

Figure 1.8 – Exemple de modélisation de XOR par un réseau à une couche cachée. Enhaut, de gauche à droite, illustration des fonctions booléennesOR(x1, x2),AND (x1, x2)et AND (x1, x2). En bas, on présente l’illustration de la fonction XOR(x1, x2) en fonc-tion des valeurs de x1 et x2 (à gauche), puis de AND (x1, x2) et AND (x1, x2) (à droite).Les points représentés par un cercle ou par un triangle appartiennent à la classe 0 ou1, respectivement. On observe que, bien qu’un classifieur linéaire soit en mesure de ré-soudre le problème de classification associé aux fonctions OR et AND, il ne l’est pasdans le cas du problème de XOR. Cependant, on utilisant les valeurs de AND (x1, x2)et AND (x1, x2) comme nouvelle représentation de l’entrée (x1, x2), le problème declassification XOR peut alors être résolu linéairement. À noter que dans ce derniercas, il n’existe que trois valeurs possibles de cette nouvelle représentation, puisqueAND (x1, x2) et AND (x1, x2) ne peuvent être toutes les deux vraies pour une mêmeentrée.

ont été entraînés dans le cadre des travaux de cette thèse contiennent quelques milliers

de neurones cachés. Ainsi, la tâche de l’algorithme d’apprentissage est de modifier les

paramètres du réseau afin de trouver la nature des caractéristiques de l’entrée que chaque

neurone doit extraire pour résoudre le problème de classification. Idéalement, ces carac-

Capacity of a Single Neuron •  Can not solve non-linearly separable problems.

0 1 0 1 0 1

XOR (x1, x2)

OR (x1, x2) AND (x1, x2) AND (x1, x2)

XOR (x1, x2)

AND (x1, x2)A

(x1 (x1

Figure 1.8 – Exemple de modélisation de XOR par un réseau à une couche cachée. Enhaut, de gauche à droite, illustration des fonctions booléennesOR(x1, x2),AND (x1, x2)et AND (x1, x2). En bas, on présente l’illustration de la fonction XOR(x1, x2) en fonc-tion des valeurs de x1 et x2 (à gauche), puis de AND (x1, x2) et AND (x1, x2) (à droite).Les points représentés par un cercle ou par un triangle appartiennent à la classe 0 ou1, respectivement. On observe que, bien qu’un classifieur linéaire soit en mesure de ré-soudre le problème de classification associé aux fonctions OR et AND, il ne l’est pasdans le cas du problème de XOR. Cependant, on utilisant les valeurs de AND (x1, x2)et AND (x1, x2) comme nouvelle représentation de l’entrée (x1, x2), le problème declassification XOR peut alors être résolu linéairement. À noter que dans ce derniercas, il n’existe que trois valeurs possibles de cette nouvelle représentation, puisqueAND (x1, x2) et AND (x1, x2) ne peuvent être toutes les deux vraies pour une mêmeentrée.

ont été entraînés dans le cadre des travaux de cette thèse contiennent quelques milliers

de neurones cachés. Ainsi, la tâche de l’algorithme d’apprentissage est de modifier les

paramètres du réseau afin de trouver la nature des caractéristiques de l’entrée que chaque

neurone doit extraire pour résoudre le problème de classification. Idéalement, ces carac-

•  Need to transform the input into a better representation. •  Remember basis functions!

Capacity of Neural Nets •  Consider a single layer neural network

2Reseaux de neurones

.7-.4-1

sortie k

entree i

cachee jbiais

Hidden

Output

6Reseaux de neurones

• La puissance expressive des reseaux de neurones

y3 y4y2y1

Neural Networks ‣  SGD Training, cross entropy loss, ReLU activations

‣  Classification with neural networks

‣  Regularization, Dropout, Batchnorm

‣  Forward Propagation and Backprop (computing derivatives)

Model Selection •  Training Protocol: -  Train your model on the Training Set

• bµ = 1T

Ptx(t)

• b�2 = 1T�1

Pt(x(t) � bµ)2

• b⌃ = 1T�1

Pt(x(t) � bµ)(x(t) � bµ)>

• E[bµ] = µ E[b�2] = �2 E

hb⌃i= ⌃

• bµ�µpb�2/T

• µ 2 bµ±�1.96p

b�2/T

•b✓ = argmax

p(x(1), . . . ,x(T ))

•p(x(1)

, . . . ,x(T )) =Y

p(x(t))

• T�1T

b⌃ = 1T

Machine learning

• Supervised learning example: (x, y) x y

• Training set: Dtrain = {(x(t), y

• f(x;✓)

-  For model selection, use Validation Set

• bµ = 1T

Ptx(t)

• b�2 = 1T�1

Pt(x(t) � bµ)2

• b⌃ = 1T�1

• E[bµ] = µ E[b�2] = �2 E

hb⌃i= ⌃

• µ 2 bµ±�1.96p

b�2/T

•b✓ = argmax

p(x(1), . . . ,x(T ))

•p(x(1)

, . . . ,x(T )) =Y

p(x(t))

• T�1T

b⌃ = 1T

Machine learning

• f(x;✓)

• Dvalid Dtest

Ø  Hyper-parameter search: hidden layer size, learning rate, number of iterations/epochs, etc.

-  Estimate generalization performance using the Test Set

• bµ = 1T

Ptx(t)

• b�2 = 1T�1

Pt(x(t) � bµ)2

• b⌃ = 1T�1

• E[bµ] = µ E[b�2] = �2 E

hb⌃i= ⌃

• µ 2 bµ±�1.96p

b�2/T

•b✓ = argmax

p(x(1), . . . ,x(T ))

•p(x(1)

, . . . ,x(T )) =Y

p(x(t))

• T�1T

b⌃ = 1T

Machine learning

• f(x;✓)

• Dvalid Dtest

•  Remember: Generalization is the behavior of the model on unseen examples.

Early Stopping •  To select the number of epochs, stop training when validation set error increases (with some look ahead).

Conv Nets •  Convolutional networks leverage these ideas

Ø  Local connectivity

Ø  Parameter sharing

Ø  Convolution Ø  Pooling / subsampling hidden units

Ø  Understanding Receptive Fields

•  Local contrast normalization, rectification

10417/10617 intermediate deep learning: fall2019rsalakhu/10417/lectures/... · 10417/10617...

Documents

primergy bx900 s2 blade server -...

accommodations & special education assessments update tetn...

valuation of tesla motors inc. - copenhagen business...

sustainable syntheses of (-)-jerantinines a & e and...

original -...

master)thesis...

kandidatafhandling - copenhagen business...

studentsrepo.um.edu.mystudentsrepo.um.edu.my/10417/4/shan.pdf ·...

surface freshwater limitation explains worst rice...

financial market microstructure and trading...

sarahrohleder035 10617 000-13-084 085 softball baseball

why do consumers resist buying electric...

mergers & acquisitions -...

70025 .362018 .10417 .4742...

universiti putra malaysia studies on...

modulating control valves pn16 mxg461 mxf461 10417 hq en

international human resource management -...

stochastic optimization for financial decision making...

10417/10617 intermediate deep learning:...

concerning the community, of the united nations...