math 829: introduction to data mining and analysis neural ...dguillot/...neural_networks-ii.pdf ·...
TRANSCRIPT
![Page 1: MATH 829: Introduction to Data Mining and Analysis Neural ...dguillot/...Neural_networks-II.pdf · MATH 829: Introduction to Data Mining and Analysis Neural networks II Dominique](https://reader034.vdocument.in/reader034/viewer/2022051814/6039e4199a69d92aca5104f0/html5/thumbnails/1.jpg)
MATH 829: Introduction to Data Mining andAnalysis
Neural networks II
Dominique Guillot
Departments of Mathematical Sciences
University of Delaware
April 13, 2016
This lecture is based on the UFLDL tutorial (http://deeplearning.stanford.edu/)
![Page 2: MATH 829: Introduction to Data Mining and Analysis Neural ...dguillot/...Neural_networks-II.pdf · MATH 829: Introduction to Data Mining and Analysis Neural networks II Dominique](https://reader034.vdocument.in/reader034/viewer/2022051814/6039e4199a69d92aca5104f0/html5/thumbnails/2.jpg)
Recall
We have:
a(2)1 = f(W
(1)11 x1 +W
(1)12 x2 +W
(1)13 x3 + b
(1)1 )
a(2)2 = f(W
(1)21 x1 +W
(1)22 x2 +W
(1)23 x3 + b
(1)2 )
a(2)3 = f(W
(1)31 x1 +W
(1)32 x2 +W
(1)33 x3 + b
(1)3 )
hW,b = a(3)1 = f(W
(2)11 a
(2)1 +W
(2)12 a
(2)2 +W
(2)13 a
(2)3 + b
(2)1 ).
2/14
![Page 3: MATH 829: Introduction to Data Mining and Analysis Neural ...dguillot/...Neural_networks-II.pdf · MATH 829: Introduction to Data Mining and Analysis Neural networks II Dominique](https://reader034.vdocument.in/reader034/viewer/2022051814/6039e4199a69d92aca5104f0/html5/thumbnails/3.jpg)
Recall (cont.)
Vector form:
z(2) =W (1)x+ b(1)
a(2) = f(z(2))
z(3) =W (2)a(2) + b(2)
hW,b = a(3) = f(z(3)).
3/14
![Page 4: MATH 829: Introduction to Data Mining and Analysis Neural ...dguillot/...Neural_networks-II.pdf · MATH 829: Introduction to Data Mining and Analysis Neural networks II Dominique](https://reader034.vdocument.in/reader034/viewer/2022051814/6039e4199a69d92aca5104f0/html5/thumbnails/4.jpg)
Training neural networks
Suppose we have
A neural network with sl neurons in layer l (l = 1, . . . , nl).
Observations (x(1), y(1)), . . . , (x(m), y(m)) ∈ Rs1 × Rsnl .
We would like to choose W (l) and b(l) in some optimal way for all
l.
Let
J(W, b;x, y) :=1
2‖hW,b(x)−y‖22 (Squared error for one sample).
De�ne
J(W, b) :=1
m
m∑i=1
J(W, b;x(i), y(i)) +λ
2
nl−1∑l=1
sl∑i=1
sl+1∑j=1
(W(l)ji )
2.
(average squared error with Ridge penalty).
Note:
The Ridge penalty prevents over�tting.
We do not penalize the bias terms b(l)i .
The loss function J(W, b) is not convex.
4/14
![Page 5: MATH 829: Introduction to Data Mining and Analysis Neural ...dguillot/...Neural_networks-II.pdf · MATH 829: Introduction to Data Mining and Analysis Neural networks II Dominique](https://reader034.vdocument.in/reader034/viewer/2022051814/6039e4199a69d92aca5104f0/html5/thumbnails/5.jpg)
Training neural networks
Suppose we have
A neural network with sl neurons in layer l (l = 1, . . . , nl).Observations (x(1), y(1)), . . . , (x(m), y(m)) ∈ Rs1 × Rsnl .
We would like to choose W (l) and b(l) in some optimal way for all
l.
Let
J(W, b;x, y) :=1
2‖hW,b(x)−y‖22 (Squared error for one sample).
De�ne
J(W, b) :=1
m
m∑i=1
J(W, b;x(i), y(i)) +λ
2
nl−1∑l=1
sl∑i=1
sl+1∑j=1
(W(l)ji )
2.
(average squared error with Ridge penalty).
Note:
The Ridge penalty prevents over�tting.
We do not penalize the bias terms b(l)i .
The loss function J(W, b) is not convex.
4/14
![Page 6: MATH 829: Introduction to Data Mining and Analysis Neural ...dguillot/...Neural_networks-II.pdf · MATH 829: Introduction to Data Mining and Analysis Neural networks II Dominique](https://reader034.vdocument.in/reader034/viewer/2022051814/6039e4199a69d92aca5104f0/html5/thumbnails/6.jpg)
Training neural networks
Suppose we have
A neural network with sl neurons in layer l (l = 1, . . . , nl).Observations (x(1), y(1)), . . . , (x(m), y(m)) ∈ Rs1 × Rsnl .
We would like to choose W (l) and b(l) in some optimal way for all
l.
Let
J(W, b;x, y) :=1
2‖hW,b(x)−y‖22 (Squared error for one sample).
De�ne
J(W, b) :=1
m
m∑i=1
J(W, b;x(i), y(i)) +λ
2
nl−1∑l=1
sl∑i=1
sl+1∑j=1
(W(l)ji )
2.
(average squared error with Ridge penalty).
Note:
The Ridge penalty prevents over�tting.
We do not penalize the bias terms b(l)i .
The loss function J(W, b) is not convex.
4/14
![Page 7: MATH 829: Introduction to Data Mining and Analysis Neural ...dguillot/...Neural_networks-II.pdf · MATH 829: Introduction to Data Mining and Analysis Neural networks II Dominique](https://reader034.vdocument.in/reader034/viewer/2022051814/6039e4199a69d92aca5104f0/html5/thumbnails/7.jpg)
Training neural networks
Suppose we have
A neural network with sl neurons in layer l (l = 1, . . . , nl).Observations (x(1), y(1)), . . . , (x(m), y(m)) ∈ Rs1 × Rsnl .
We would like to choose W (l) and b(l) in some optimal way for all
l.
Let
J(W, b;x, y) :=1
2‖hW,b(x)−y‖22 (Squared error for one sample).
De�ne
J(W, b) :=1
m
m∑i=1
J(W, b;x(i), y(i)) +λ
2
nl−1∑l=1
sl∑i=1
sl+1∑j=1
(W(l)ji )
2.
(average squared error with Ridge penalty).
Note:
The Ridge penalty prevents over�tting.
We do not penalize the bias terms b(l)i .
The loss function J(W, b) is not convex.
4/14
![Page 8: MATH 829: Introduction to Data Mining and Analysis Neural ...dguillot/...Neural_networks-II.pdf · MATH 829: Introduction to Data Mining and Analysis Neural networks II Dominique](https://reader034.vdocument.in/reader034/viewer/2022051814/6039e4199a69d92aca5104f0/html5/thumbnails/8.jpg)
Training neural networks
Suppose we have
A neural network with sl neurons in layer l (l = 1, . . . , nl).Observations (x(1), y(1)), . . . , (x(m), y(m)) ∈ Rs1 × Rsnl .
We would like to choose W (l) and b(l) in some optimal way for all
l.
Let
J(W, b;x, y) :=1
2‖hW,b(x)−y‖22 (Squared error for one sample).
De�ne
J(W, b) :=1
m
m∑i=1
J(W, b;x(i), y(i)) +λ
2
nl−1∑l=1
sl∑i=1
sl+1∑j=1
(W(l)ji )
2.
(average squared error with Ridge penalty).
Note:
The Ridge penalty prevents over�tting.
We do not penalize the bias terms b(l)i .
The loss function J(W, b) is not convex.
4/14
![Page 9: MATH 829: Introduction to Data Mining and Analysis Neural ...dguillot/...Neural_networks-II.pdf · MATH 829: Introduction to Data Mining and Analysis Neural networks II Dominique](https://reader034.vdocument.in/reader034/viewer/2022051814/6039e4199a69d92aca5104f0/html5/thumbnails/9.jpg)
Training neural networks
Suppose we have
A neural network with sl neurons in layer l (l = 1, . . . , nl).Observations (x(1), y(1)), . . . , (x(m), y(m)) ∈ Rs1 × Rsnl .
We would like to choose W (l) and b(l) in some optimal way for all
l.
Let
J(W, b;x, y) :=1
2‖hW,b(x)−y‖22 (Squared error for one sample).
De�ne
J(W, b) :=1
m
m∑i=1
J(W, b;x(i), y(i)) +λ
2
nl−1∑l=1
sl∑i=1
sl+1∑j=1
(W(l)ji )
2.
(average squared error with Ridge penalty).
Note:
The Ridge penalty prevents over�tting.
We do not penalize the bias terms b(l)i .
The loss function J(W, b) is not convex.
4/14
![Page 10: MATH 829: Introduction to Data Mining and Analysis Neural ...dguillot/...Neural_networks-II.pdf · MATH 829: Introduction to Data Mining and Analysis Neural networks II Dominique](https://reader034.vdocument.in/reader034/viewer/2022051814/6039e4199a69d92aca5104f0/html5/thumbnails/10.jpg)
Training neural networks
Suppose we have
A neural network with sl neurons in layer l (l = 1, . . . , nl).Observations (x(1), y(1)), . . . , (x(m), y(m)) ∈ Rs1 × Rsnl .
We would like to choose W (l) and b(l) in some optimal way for all
l.
Let
J(W, b;x, y) :=1
2‖hW,b(x)−y‖22 (Squared error for one sample).
De�ne
J(W, b) :=1
m
m∑i=1
J(W, b;x(i), y(i)) +λ
2
nl−1∑l=1
sl∑i=1
sl+1∑j=1
(W(l)ji )
2.
(average squared error with Ridge penalty).
Note:
The Ridge penalty prevents over�tting.
We do not penalize the bias terms b(l)i .
The loss function J(W, b) is not convex.4/14
![Page 11: MATH 829: Introduction to Data Mining and Analysis Neural ...dguillot/...Neural_networks-II.pdf · MATH 829: Introduction to Data Mining and Analysis Neural networks II Dominique](https://reader034.vdocument.in/reader034/viewer/2022051814/6039e4199a69d92aca5104f0/html5/thumbnails/11.jpg)
Some remarks
The loss function J(W, b) is often used both for regression and
classi�cation.
In classi�cation problems, we choose the labels y ∈ {0, 1} (ifworking with sigmoid) or y ∈ {−1, 1} (if working with tanh).
For regression problems, we scale the output so that y ∈ [0, 1](if working with sigmoid) or y ∈ [−1, 1] (if working with tanh).
We will use a gradient descent to minimize J(W, b). Note that
since the function is non-convex, we may only �nd a local
minimum.
We need an initial choice for W(l)ij and b
(l)i . If we initialize all
the parameters to 0, then the parameters remain constant over
the layers because of the symmetry of the problem.
As a result, we initialize the parameters to a small constant at
random (say, using N(0, ε2) for ε = 0.01).
5/14
![Page 12: MATH 829: Introduction to Data Mining and Analysis Neural ...dguillot/...Neural_networks-II.pdf · MATH 829: Introduction to Data Mining and Analysis Neural networks II Dominique](https://reader034.vdocument.in/reader034/viewer/2022051814/6039e4199a69d92aca5104f0/html5/thumbnails/12.jpg)
Some remarks
The loss function J(W, b) is often used both for regression and
classi�cation.
In classi�cation problems, we choose the labels y ∈ {0, 1} (ifworking with sigmoid) or y ∈ {−1, 1} (if working with tanh).
For regression problems, we scale the output so that y ∈ [0, 1](if working with sigmoid) or y ∈ [−1, 1] (if working with tanh).
We will use a gradient descent to minimize J(W, b). Note that
since the function is non-convex, we may only �nd a local
minimum.
We need an initial choice for W(l)ij and b
(l)i . If we initialize all
the parameters to 0, then the parameters remain constant over
the layers because of the symmetry of the problem.
As a result, we initialize the parameters to a small constant at
random (say, using N(0, ε2) for ε = 0.01).
5/14
![Page 13: MATH 829: Introduction to Data Mining and Analysis Neural ...dguillot/...Neural_networks-II.pdf · MATH 829: Introduction to Data Mining and Analysis Neural networks II Dominique](https://reader034.vdocument.in/reader034/viewer/2022051814/6039e4199a69d92aca5104f0/html5/thumbnails/13.jpg)
Some remarks
The loss function J(W, b) is often used both for regression and
classi�cation.
In classi�cation problems, we choose the labels y ∈ {0, 1} (ifworking with sigmoid) or y ∈ {−1, 1} (if working with tanh).
For regression problems, we scale the output so that y ∈ [0, 1](if working with sigmoid) or y ∈ [−1, 1] (if working with tanh).
We will use a gradient descent to minimize J(W, b). Note that
since the function is non-convex, we may only �nd a local
minimum.
We need an initial choice for W(l)ij and b
(l)i . If we initialize all
the parameters to 0, then the parameters remain constant over
the layers because of the symmetry of the problem.
As a result, we initialize the parameters to a small constant at
random (say, using N(0, ε2) for ε = 0.01).
5/14
![Page 14: MATH 829: Introduction to Data Mining and Analysis Neural ...dguillot/...Neural_networks-II.pdf · MATH 829: Introduction to Data Mining and Analysis Neural networks II Dominique](https://reader034.vdocument.in/reader034/viewer/2022051814/6039e4199a69d92aca5104f0/html5/thumbnails/14.jpg)
Some remarks
The loss function J(W, b) is often used both for regression and
classi�cation.
In classi�cation problems, we choose the labels y ∈ {0, 1} (ifworking with sigmoid) or y ∈ {−1, 1} (if working with tanh).
For regression problems, we scale the output so that y ∈ [0, 1](if working with sigmoid) or y ∈ [−1, 1] (if working with tanh).
We will use a gradient descent to minimize J(W, b). Note that
since the function is non-convex, we may only �nd a local
minimum.
We need an initial choice for W(l)ij and b
(l)i . If we initialize all
the parameters to 0, then the parameters remain constant over
the layers because of the symmetry of the problem.
As a result, we initialize the parameters to a small constant at
random (say, using N(0, ε2) for ε = 0.01).
5/14
![Page 15: MATH 829: Introduction to Data Mining and Analysis Neural ...dguillot/...Neural_networks-II.pdf · MATH 829: Introduction to Data Mining and Analysis Neural networks II Dominique](https://reader034.vdocument.in/reader034/viewer/2022051814/6039e4199a69d92aca5104f0/html5/thumbnails/15.jpg)
Some remarks
The loss function J(W, b) is often used both for regression and
classi�cation.
In classi�cation problems, we choose the labels y ∈ {0, 1} (ifworking with sigmoid) or y ∈ {−1, 1} (if working with tanh).
For regression problems, we scale the output so that y ∈ [0, 1](if working with sigmoid) or y ∈ [−1, 1] (if working with tanh).
We will use a gradient descent to minimize J(W, b). Note that
since the function is non-convex, we may only �nd a local
minimum.
We need an initial choice for W(l)ij and b
(l)i . If we initialize all
the parameters to 0, then the parameters remain constant over
the layers because of the symmetry of the problem.
As a result, we initialize the parameters to a small constant at
random (say, using N(0, ε2) for ε = 0.01).
5/14
![Page 16: MATH 829: Introduction to Data Mining and Analysis Neural ...dguillot/...Neural_networks-II.pdf · MATH 829: Introduction to Data Mining and Analysis Neural networks II Dominique](https://reader034.vdocument.in/reader034/viewer/2022051814/6039e4199a69d92aca5104f0/html5/thumbnails/16.jpg)
Some remarks
The loss function J(W, b) is often used both for regression and
classi�cation.
In classi�cation problems, we choose the labels y ∈ {0, 1} (ifworking with sigmoid) or y ∈ {−1, 1} (if working with tanh).
For regression problems, we scale the output so that y ∈ [0, 1](if working with sigmoid) or y ∈ [−1, 1] (if working with tanh).
We will use a gradient descent to minimize J(W, b). Note that
since the function is non-convex, we may only �nd a local
minimum.
We need an initial choice for W(l)ij and b
(l)i . If we initialize all
the parameters to 0, then the parameters remain constant over
the layers because of the symmetry of the problem.
As a result, we initialize the parameters to a small constant at
random (say, using N(0, ε2) for ε = 0.01).
5/14
![Page 17: MATH 829: Introduction to Data Mining and Analysis Neural ...dguillot/...Neural_networks-II.pdf · MATH 829: Introduction to Data Mining and Analysis Neural networks II Dominique](https://reader034.vdocument.in/reader034/viewer/2022051814/6039e4199a69d92aca5104f0/html5/thumbnails/17.jpg)
Gradient descent and the backpropagation algorithm
We update the parameters using a gradient descent as follows:
W(l)ij ←W
(l)ij − α
∂
∂W(l)ij
J(W, b)
b(l)i ← b
(l)i − α
∂
∂b(l)i
J(W, b).
Here α > 0 is a parameter (the learning rate).
Observe that:
∂
∂W(l)ij
J(W, b) =1
m
m∑i=1
∂
∂W(l)ij
J(W, b;x(i), y(i)) + λW(l)ij
∂
∂b(l)i
J(W, b) =1
m
m∑i=1
∂
∂b(l)i
J(W, b;x(i), y(i)).
Therefore, it su�ces to compute the derivatives of
J(W, b;x(i), y(i)).
6/14
![Page 18: MATH 829: Introduction to Data Mining and Analysis Neural ...dguillot/...Neural_networks-II.pdf · MATH 829: Introduction to Data Mining and Analysis Neural networks II Dominique](https://reader034.vdocument.in/reader034/viewer/2022051814/6039e4199a69d92aca5104f0/html5/thumbnails/18.jpg)
Gradient descent and the backpropagation algorithm
We update the parameters using a gradient descent as follows:
W(l)ij ←W
(l)ij − α
∂
∂W(l)ij
J(W, b)
b(l)i ← b
(l)i − α
∂
∂b(l)i
J(W, b).
Here α > 0 is a parameter (the learning rate).
Observe that:
∂
∂W(l)ij
J(W, b) =1
m
m∑i=1
∂
∂W(l)ij
J(W, b;x(i), y(i)) + λW(l)ij
∂
∂b(l)i
J(W, b) =1
m
m∑i=1
∂
∂b(l)i
J(W, b;x(i), y(i)).
Therefore, it su�ces to compute the derivatives of
J(W, b;x(i), y(i)).
6/14
![Page 19: MATH 829: Introduction to Data Mining and Analysis Neural ...dguillot/...Neural_networks-II.pdf · MATH 829: Introduction to Data Mining and Analysis Neural networks II Dominique](https://reader034.vdocument.in/reader034/viewer/2022051814/6039e4199a69d92aca5104f0/html5/thumbnails/19.jpg)
Gradient descent and the backpropagation algorithm
We update the parameters using a gradient descent as follows:
W(l)ij ←W
(l)ij − α
∂
∂W(l)ij
J(W, b)
b(l)i ← b
(l)i − α
∂
∂b(l)i
J(W, b).
Here α > 0 is a parameter (the learning rate).
Observe that:
∂
∂W(l)ij
J(W, b) =1
m
m∑i=1
∂
∂W(l)ij
J(W, b;x(i), y(i)) + λW(l)ij
∂
∂b(l)i
J(W, b) =1
m
m∑i=1
∂
∂b(l)i
J(W, b;x(i), y(i)).
Therefore, it su�ces to compute the derivatives of
J(W, b;x(i), y(i)).6/14
![Page 20: MATH 829: Introduction to Data Mining and Analysis Neural ...dguillot/...Neural_networks-II.pdf · MATH 829: Introduction to Data Mining and Analysis Neural networks II Dominique](https://reader034.vdocument.in/reader034/viewer/2022051814/6039e4199a69d92aca5104f0/html5/thumbnails/20.jpg)
Computing the derivatives using backpropagation
1 Compute the activations for all the layers.
2 For each output unit i in layer nl (output), compute
δ(nl)i :=
∂
∂z(nl)i
1
2‖y − hW,b(x)‖22 = −(yi − a
(nl)i ) · f ′(znl
i ).
3 For l = nl − 1, nl − 2, . . . , 2For each node i in layer l, set
δ(l)i :=
sl+1∑j=1
W(l)ji δ
(l+1)j
· f ′(z(l)i ).
4 Compute the desired partial derivatives:
∂
∂W(l)ij
J(W, b;x(i), y(i)) = a(l)j δ
(l+1)i
∂
∂b(l)i
J(W, b;x(i), y(i)) = δ(l+1)i .
7/14
![Page 21: MATH 829: Introduction to Data Mining and Analysis Neural ...dguillot/...Neural_networks-II.pdf · MATH 829: Introduction to Data Mining and Analysis Neural networks II Dominique](https://reader034.vdocument.in/reader034/viewer/2022051814/6039e4199a69d92aca5104f0/html5/thumbnails/21.jpg)
Computing the derivatives using backpropagation
1 Compute the activations for all the layers.2 For each output unit i in layer nl (output), compute
δ(nl)i :=
∂
∂z(nl)i
1
2‖y − hW,b(x)‖22 = −(yi − a
(nl)i ) · f ′(znl
i ).
3 For l = nl − 1, nl − 2, . . . , 2For each node i in layer l, set
δ(l)i :=
sl+1∑j=1
W(l)ji δ
(l+1)j
· f ′(z(l)i ).
4 Compute the desired partial derivatives:
∂
∂W(l)ij
J(W, b;x(i), y(i)) = a(l)j δ
(l+1)i
∂
∂b(l)i
J(W, b;x(i), y(i)) = δ(l+1)i .
7/14
![Page 22: MATH 829: Introduction to Data Mining and Analysis Neural ...dguillot/...Neural_networks-II.pdf · MATH 829: Introduction to Data Mining and Analysis Neural networks II Dominique](https://reader034.vdocument.in/reader034/viewer/2022051814/6039e4199a69d92aca5104f0/html5/thumbnails/22.jpg)
Computing the derivatives using backpropagation
1 Compute the activations for all the layers.2 For each output unit i in layer nl (output), compute
δ(nl)i :=
∂
∂z(nl)i
1
2‖y − hW,b(x)‖22 = −(yi − a
(nl)i ) · f ′(znl
i ).
3 For l = nl − 1, nl − 2, . . . , 2For each node i in layer l, set
δ(l)i :=
sl+1∑j=1
W(l)ji δ
(l+1)j
· f ′(z(l)i ).
4 Compute the desired partial derivatives:
∂
∂W(l)ij
J(W, b;x(i), y(i)) = a(l)j δ
(l+1)i
∂
∂b(l)i
J(W, b;x(i), y(i)) = δ(l+1)i .
7/14
![Page 23: MATH 829: Introduction to Data Mining and Analysis Neural ...dguillot/...Neural_networks-II.pdf · MATH 829: Introduction to Data Mining and Analysis Neural networks II Dominique](https://reader034.vdocument.in/reader034/viewer/2022051814/6039e4199a69d92aca5104f0/html5/thumbnails/23.jpg)
Computing the derivatives using backpropagation
1 Compute the activations for all the layers.2 For each output unit i in layer nl (output), compute
δ(nl)i :=
∂
∂z(nl)i
1
2‖y − hW,b(x)‖22 = −(yi − a
(nl)i ) · f ′(znl
i ).
3 For l = nl − 1, nl − 2, . . . , 2For each node i in layer l, set
δ(l)i :=
sl+1∑j=1
W(l)ji δ
(l+1)j
· f ′(z(l)i ).
4 Compute the desired partial derivatives:
∂
∂W(l)ij
J(W, b;x(i), y(i)) = a(l)j δ
(l+1)i
∂
∂b(l)i
J(W, b;x(i), y(i)) = δ(l+1)i .
7/14
![Page 24: MATH 829: Introduction to Data Mining and Analysis Neural ...dguillot/...Neural_networks-II.pdf · MATH 829: Introduction to Data Mining and Analysis Neural networks II Dominique](https://reader034.vdocument.in/reader034/viewer/2022051814/6039e4199a69d92aca5104f0/html5/thumbnails/24.jpg)
Autoencoders
An autoencoder learns the identity function:
Input: unlabeled data.
Output = input.
Idea: limit the number of hidden layers to discover structure in
the data.
Learn a compressed representation of the input.
Source: UFLDL tutorial.
Can also learn a sparse network by including supplementary
constraints on the weights.8/14
![Page 25: MATH 829: Introduction to Data Mining and Analysis Neural ...dguillot/...Neural_networks-II.pdf · MATH 829: Introduction to Data Mining and Analysis Neural networks II Dominique](https://reader034.vdocument.in/reader034/viewer/2022051814/6039e4199a69d92aca5104f0/html5/thumbnails/25.jpg)
Example (UFLDL)
Train an autoencoder on 10× 10 images with one hidden layer.
Each hidden unit i computes:
a(2)i = f
100∑j=1
W(1)ij xj + b
(1)j
.
Think of a(2)i as some non-linear feature of the input x.
Problem: Find x that maximally activates a(2)i over ‖x‖2 ≤ 1.
Claim:
xj =W
(1)ij√∑100
j=1(W(1)ij )2
.
(Hint: Use Cauchy�Schwarz).
We can now display the image maximizing a(2)i for each i.
9/14
![Page 26: MATH 829: Introduction to Data Mining and Analysis Neural ...dguillot/...Neural_networks-II.pdf · MATH 829: Introduction to Data Mining and Analysis Neural networks II Dominique](https://reader034.vdocument.in/reader034/viewer/2022051814/6039e4199a69d92aca5104f0/html5/thumbnails/26.jpg)
Example (UFLDL)
Train an autoencoder on 10× 10 images with one hidden layer.
Each hidden unit i computes:
a(2)i = f
100∑j=1
W(1)ij xj + b
(1)j
.
Think of a(2)i as some non-linear feature of the input x.
Problem: Find x that maximally activates a(2)i over ‖x‖2 ≤ 1.
Claim:
xj =W
(1)ij√∑100
j=1(W(1)ij )2
.
(Hint: Use Cauchy�Schwarz).
We can now display the image maximizing a(2)i for each i.
9/14
![Page 27: MATH 829: Introduction to Data Mining and Analysis Neural ...dguillot/...Neural_networks-II.pdf · MATH 829: Introduction to Data Mining and Analysis Neural networks II Dominique](https://reader034.vdocument.in/reader034/viewer/2022051814/6039e4199a69d92aca5104f0/html5/thumbnails/27.jpg)
Example (UFLDL)
Train an autoencoder on 10× 10 images with one hidden layer.
Each hidden unit i computes:
a(2)i = f
100∑j=1
W(1)ij xj + b
(1)j
.
Think of a(2)i as some non-linear feature of the input x.
Problem: Find x that maximally activates a(2)i over ‖x‖2 ≤ 1.
Claim:
xj =W
(1)ij√∑100
j=1(W(1)ij )2
.
(Hint: Use Cauchy�Schwarz).
We can now display the image maximizing a(2)i for each i.
9/14
![Page 28: MATH 829: Introduction to Data Mining and Analysis Neural ...dguillot/...Neural_networks-II.pdf · MATH 829: Introduction to Data Mining and Analysis Neural networks II Dominique](https://reader034.vdocument.in/reader034/viewer/2022051814/6039e4199a69d92aca5104f0/html5/thumbnails/28.jpg)
Example (UFLDL)
Train an autoencoder on 10× 10 images with one hidden layer.
Each hidden unit i computes:
a(2)i = f
100∑j=1
W(1)ij xj + b
(1)j
.
Think of a(2)i as some non-linear feature of the input x.
Problem: Find x that maximally activates a(2)i over ‖x‖2 ≤ 1.
Claim:
xj =W
(1)ij√∑100
j=1(W(1)ij )2
.
(Hint: Use Cauchy�Schwarz).
We can now display the image maximizing a(2)i for each i.
9/14
![Page 29: MATH 829: Introduction to Data Mining and Analysis Neural ...dguillot/...Neural_networks-II.pdf · MATH 829: Introduction to Data Mining and Analysis Neural networks II Dominique](https://reader034.vdocument.in/reader034/viewer/2022051814/6039e4199a69d92aca5104f0/html5/thumbnails/29.jpg)
Example (UFLDL)
Train an autoencoder on 10× 10 images with one hidden layer.
Each hidden unit i computes:
a(2)i = f
100∑j=1
W(1)ij xj + b
(1)j
.
Think of a(2)i as some non-linear feature of the input x.
Problem: Find x that maximally activates a(2)i over ‖x‖2 ≤ 1.
Claim:
xj =W
(1)ij√∑100
j=1(W(1)ij )2
.
(Hint: Use Cauchy�Schwarz).
We can now display the image maximizing a(2)i for each i.
9/14
![Page 30: MATH 829: Introduction to Data Mining and Analysis Neural ...dguillot/...Neural_networks-II.pdf · MATH 829: Introduction to Data Mining and Analysis Neural networks II Dominique](https://reader034.vdocument.in/reader034/viewer/2022051814/6039e4199a69d92aca5104f0/html5/thumbnails/30.jpg)
Example (UFLDL)
Train an autoencoder on 10× 10 images with one hidden layer.
Each hidden unit i computes:
a(2)i = f
100∑j=1
W(1)ij xj + b
(1)j
.
Think of a(2)i as some non-linear feature of the input x.
Problem: Find x that maximally activates a(2)i over ‖x‖2 ≤ 1.
Claim:
xj =W
(1)ij√∑100
j=1(W(1)ij )2
.
(Hint: Use Cauchy�Schwarz).
We can now display the image maximizing a(2)i for each i.
9/14
![Page 31: MATH 829: Introduction to Data Mining and Analysis Neural ...dguillot/...Neural_networks-II.pdf · MATH 829: Introduction to Data Mining and Analysis Neural networks II Dominique](https://reader034.vdocument.in/reader034/viewer/2022051814/6039e4199a69d92aca5104f0/html5/thumbnails/31.jpg)
Example (cont.)
100 hidden units on 10x10 pixel inputs:
The di�erent hidden units have learned to detect edges at di�erent
positions and orientations in the image.
10/14
![Page 32: MATH 829: Introduction to Data Mining and Analysis Neural ...dguillot/...Neural_networks-II.pdf · MATH 829: Introduction to Data Mining and Analysis Neural networks II Dominique](https://reader034.vdocument.in/reader034/viewer/2022051814/6039e4199a69d92aca5104f0/html5/thumbnails/32.jpg)
Sparse neural networks
So far we discussed dense neural networks.
Dense networks have a lot of parameters to learn. Can be
ine�cient or impossible to train.
Sparse models have been proposed in the literature.
Some of these models �nd inspiration from how the early
visual system is wired up in biology.
11/14
![Page 33: MATH 829: Introduction to Data Mining and Analysis Neural ...dguillot/...Neural_networks-II.pdf · MATH 829: Introduction to Data Mining and Analysis Neural networks II Dominique](https://reader034.vdocument.in/reader034/viewer/2022051814/6039e4199a69d92aca5104f0/html5/thumbnails/33.jpg)
Sparse neural networks
So far we discussed dense neural networks.
Dense networks have a lot of parameters to learn. Can be
ine�cient or impossible to train.
Sparse models have been proposed in the literature.
Some of these models �nd inspiration from how the early
visual system is wired up in biology.
11/14
![Page 34: MATH 829: Introduction to Data Mining and Analysis Neural ...dguillot/...Neural_networks-II.pdf · MATH 829: Introduction to Data Mining and Analysis Neural networks II Dominique](https://reader034.vdocument.in/reader034/viewer/2022051814/6039e4199a69d92aca5104f0/html5/thumbnails/34.jpg)
Sparse neural networks
So far we discussed dense neural networks.
Dense networks have a lot of parameters to learn. Can be
ine�cient or impossible to train.
Sparse models have been proposed in the literature.
Some of these models �nd inspiration from how the early
visual system is wired up in biology.
11/14
![Page 35: MATH 829: Introduction to Data Mining and Analysis Neural ...dguillot/...Neural_networks-II.pdf · MATH 829: Introduction to Data Mining and Analysis Neural networks II Dominique](https://reader034.vdocument.in/reader034/viewer/2022051814/6039e4199a69d92aca5104f0/html5/thumbnails/35.jpg)
Sparse neural networks
So far we discussed dense neural networks.
Dense networks have a lot of parameters to learn. Can be
ine�cient or impossible to train.
Sparse models have been proposed in the literature.
Some of these models �nd inspiration from how the early
visual system is wired up in biology.
11/14
![Page 36: MATH 829: Introduction to Data Mining and Analysis Neural ...dguillot/...Neural_networks-II.pdf · MATH 829: Introduction to Data Mining and Analysis Neural networks II Dominique](https://reader034.vdocument.in/reader034/viewer/2022051814/6039e4199a69d92aca5104f0/html5/thumbnails/36.jpg)
Using convolutions
Idea: Certain signals are stationary, i.e., their statistical
properties do not change in space or time.
For example, images often have similar statistical properties in
di�erent regions in space.
That suggests that the features that we learn at one part of an
image can also be applied to other parts of the image.
Can �convolve� the learned features with the larger image.
Example: 96× 96 image.
Learn features on small 8× 8 patches sampled randomly (e.g.
using a sparse autoencoder).
Run the trained model through all 8× 8 patches of the image
to get the feature activations.
Source: UFLDL tutorial.
12/14
![Page 37: MATH 829: Introduction to Data Mining and Analysis Neural ...dguillot/...Neural_networks-II.pdf · MATH 829: Introduction to Data Mining and Analysis Neural networks II Dominique](https://reader034.vdocument.in/reader034/viewer/2022051814/6039e4199a69d92aca5104f0/html5/thumbnails/37.jpg)
Using convolutions
Idea: Certain signals are stationary, i.e., their statistical
properties do not change in space or time.
For example, images often have similar statistical properties in
di�erent regions in space.
That suggests that the features that we learn at one part of an
image can also be applied to other parts of the image.
Can �convolve� the learned features with the larger image.
Example: 96× 96 image.
Learn features on small 8× 8 patches sampled randomly (e.g.
using a sparse autoencoder).
Run the trained model through all 8× 8 patches of the image
to get the feature activations.
Source: UFLDL tutorial.
12/14
![Page 38: MATH 829: Introduction to Data Mining and Analysis Neural ...dguillot/...Neural_networks-II.pdf · MATH 829: Introduction to Data Mining and Analysis Neural networks II Dominique](https://reader034.vdocument.in/reader034/viewer/2022051814/6039e4199a69d92aca5104f0/html5/thumbnails/38.jpg)
Using convolutions
Idea: Certain signals are stationary, i.e., their statistical
properties do not change in space or time.
For example, images often have similar statistical properties in
di�erent regions in space.
That suggests that the features that we learn at one part of an
image can also be applied to other parts of the image.
Can �convolve� the learned features with the larger image.
Example: 96× 96 image.
Learn features on small 8× 8 patches sampled randomly (e.g.
using a sparse autoencoder).
Run the trained model through all 8× 8 patches of the image
to get the feature activations.
Source: UFLDL tutorial.
12/14
![Page 39: MATH 829: Introduction to Data Mining and Analysis Neural ...dguillot/...Neural_networks-II.pdf · MATH 829: Introduction to Data Mining and Analysis Neural networks II Dominique](https://reader034.vdocument.in/reader034/viewer/2022051814/6039e4199a69d92aca5104f0/html5/thumbnails/39.jpg)
Using convolutions
Idea: Certain signals are stationary, i.e., their statistical
properties do not change in space or time.
For example, images often have similar statistical properties in
di�erent regions in space.
That suggests that the features that we learn at one part of an
image can also be applied to other parts of the image.
Can �convolve� the learned features with the larger image.
Example: 96× 96 image.
Learn features on small 8× 8 patches sampled randomly (e.g.
using a sparse autoencoder).
Run the trained model through all 8× 8 patches of the image
to get the feature activations.
Source: UFLDL tutorial.
12/14
![Page 40: MATH 829: Introduction to Data Mining and Analysis Neural ...dguillot/...Neural_networks-II.pdf · MATH 829: Introduction to Data Mining and Analysis Neural networks II Dominique](https://reader034.vdocument.in/reader034/viewer/2022051814/6039e4199a69d92aca5104f0/html5/thumbnails/40.jpg)
Using convolutions
Idea: Certain signals are stationary, i.e., their statistical
properties do not change in space or time.
For example, images often have similar statistical properties in
di�erent regions in space.
That suggests that the features that we learn at one part of an
image can also be applied to other parts of the image.
Can �convolve� the learned features with the larger image.
Example: 96× 96 image.
Learn features on small 8× 8 patches sampled randomly (e.g.
using a sparse autoencoder).
Run the trained model through all 8× 8 patches of the image
to get the feature activations.
Source: UFLDL tutorial.
12/14
![Page 41: MATH 829: Introduction to Data Mining and Analysis Neural ...dguillot/...Neural_networks-II.pdf · MATH 829: Introduction to Data Mining and Analysis Neural networks II Dominique](https://reader034.vdocument.in/reader034/viewer/2022051814/6039e4199a69d92aca5104f0/html5/thumbnails/41.jpg)
Using convolutions
Idea: Certain signals are stationary, i.e., their statistical
properties do not change in space or time.
For example, images often have similar statistical properties in
di�erent regions in space.
That suggests that the features that we learn at one part of an
image can also be applied to other parts of the image.
Can �convolve� the learned features with the larger image.
Example: 96× 96 image.
Learn features on small 8× 8 patches sampled randomly (e.g.
using a sparse autoencoder).
Run the trained model through all 8× 8 patches of the image
to get the feature activations.
Source: UFLDL tutorial.
12/14
![Page 42: MATH 829: Introduction to Data Mining and Analysis Neural ...dguillot/...Neural_networks-II.pdf · MATH 829: Introduction to Data Mining and Analysis Neural networks II Dominique](https://reader034.vdocument.in/reader034/viewer/2022051814/6039e4199a69d92aca5104f0/html5/thumbnails/42.jpg)
Using convolutions
Idea: Certain signals are stationary, i.e., their statistical
properties do not change in space or time.
For example, images often have similar statistical properties in
di�erent regions in space.
That suggests that the features that we learn at one part of an
image can also be applied to other parts of the image.
Can �convolve� the learned features with the larger image.
Example: 96× 96 image.
Learn features on small 8× 8 patches sampled randomly (e.g.
using a sparse autoencoder).
Run the trained model through all 8× 8 patches of the image
to get the feature activations.
Source: UFLDL tutorial.
12/14
![Page 43: MATH 829: Introduction to Data Mining and Analysis Neural ...dguillot/...Neural_networks-II.pdf · MATH 829: Introduction to Data Mining and Analysis Neural networks II Dominique](https://reader034.vdocument.in/reader034/viewer/2022051814/6039e4199a69d92aca5104f0/html5/thumbnails/43.jpg)
Using convolutions
Idea: Certain signals are stationary, i.e., their statistical
properties do not change in space or time.
For example, images often have similar statistical properties in
di�erent regions in space.
That suggests that the features that we learn at one part of an
image can also be applied to other parts of the image.
Can �convolve� the learned features with the larger image.
Example: 96× 96 image.
Learn features on small 8× 8 patches sampled randomly (e.g.
using a sparse autoencoder).
Run the trained model through all 8× 8 patches of the image
to get the feature activations.
Source: UFLDL tutorial. 12/14
![Page 44: MATH 829: Introduction to Data Mining and Analysis Neural ...dguillot/...Neural_networks-II.pdf · MATH 829: Introduction to Data Mining and Analysis Neural networks II Dominique](https://reader034.vdocument.in/reader034/viewer/2022051814/6039e4199a69d92aca5104f0/html5/thumbnails/44.jpg)
Pooling features
Once can also pool the features obtained via convolution.
For example, to describe a large image, one natural approach
is to aggregate statistics of these features at various locations.
E.g. compute the mean, max, etc. over di�erent regions.
Can lead to more robust features. Can lead to invariant
features.
For example, if the pooling regions are contiguous, then the
pooling units will be �translation invariant�, i.e., they won't
change much if objects in the image are undergo a (small)
translation.
13/14
![Page 45: MATH 829: Introduction to Data Mining and Analysis Neural ...dguillot/...Neural_networks-II.pdf · MATH 829: Introduction to Data Mining and Analysis Neural networks II Dominique](https://reader034.vdocument.in/reader034/viewer/2022051814/6039e4199a69d92aca5104f0/html5/thumbnails/45.jpg)
Pooling features
Once can also pool the features obtained via convolution.
For example, to describe a large image, one natural approach
is to aggregate statistics of these features at various locations.
E.g. compute the mean, max, etc. over di�erent regions.
Can lead to more robust features. Can lead to invariant
features.
For example, if the pooling regions are contiguous, then the
pooling units will be �translation invariant�, i.e., they won't
change much if objects in the image are undergo a (small)
translation.
13/14
![Page 46: MATH 829: Introduction to Data Mining and Analysis Neural ...dguillot/...Neural_networks-II.pdf · MATH 829: Introduction to Data Mining and Analysis Neural networks II Dominique](https://reader034.vdocument.in/reader034/viewer/2022051814/6039e4199a69d92aca5104f0/html5/thumbnails/46.jpg)
Pooling features
Once can also pool the features obtained via convolution.
For example, to describe a large image, one natural approach
is to aggregate statistics of these features at various locations.
E.g. compute the mean, max, etc. over di�erent regions.
Can lead to more robust features. Can lead to invariant
features.
For example, if the pooling regions are contiguous, then the
pooling units will be �translation invariant�, i.e., they won't
change much if objects in the image are undergo a (small)
translation.
13/14
![Page 47: MATH 829: Introduction to Data Mining and Analysis Neural ...dguillot/...Neural_networks-II.pdf · MATH 829: Introduction to Data Mining and Analysis Neural networks II Dominique](https://reader034.vdocument.in/reader034/viewer/2022051814/6039e4199a69d92aca5104f0/html5/thumbnails/47.jpg)
Pooling features
Once can also pool the features obtained via convolution.
For example, to describe a large image, one natural approach
is to aggregate statistics of these features at various locations.
E.g. compute the mean, max, etc. over di�erent regions.
Can lead to more robust features. Can lead to invariant
features.
For example, if the pooling regions are contiguous, then the
pooling units will be �translation invariant�, i.e., they won't
change much if objects in the image are undergo a (small)
translation.
13/14
![Page 48: MATH 829: Introduction to Data Mining and Analysis Neural ...dguillot/...Neural_networks-II.pdf · MATH 829: Introduction to Data Mining and Analysis Neural networks II Dominique](https://reader034.vdocument.in/reader034/viewer/2022051814/6039e4199a69d92aca5104f0/html5/thumbnails/48.jpg)
Pooling features
Once can also pool the features obtained via convolution.
For example, to describe a large image, one natural approach
is to aggregate statistics of these features at various locations.
E.g. compute the mean, max, etc. over di�erent regions.
Can lead to more robust features. Can lead to invariant
features.
For example, if the pooling regions are contiguous, then the
pooling units will be �translation invariant�, i.e., they won't
change much if objects in the image are undergo a (small)
translation.
13/14
![Page 49: MATH 829: Introduction to Data Mining and Analysis Neural ...dguillot/...Neural_networks-II.pdf · MATH 829: Introduction to Data Mining and Analysis Neural networks II Dominique](https://reader034.vdocument.in/reader034/viewer/2022051814/6039e4199a69d92aca5104f0/html5/thumbnails/49.jpg)
Pooling features
Once can also pool the features obtained via convolution.
For example, to describe a large image, one natural approach
is to aggregate statistics of these features at various locations.
E.g. compute the mean, max, etc. over di�erent regions.
Can lead to more robust features. Can lead to invariant
features.
For example, if the pooling regions are contiguous, then the
pooling units will be �translation invariant�, i.e., they won't
change much if objects in the image are undergo a (small)
translation.
13/14
![Page 50: MATH 829: Introduction to Data Mining and Analysis Neural ...dguillot/...Neural_networks-II.pdf · MATH 829: Introduction to Data Mining and Analysis Neural networks II Dominique](https://reader034.vdocument.in/reader034/viewer/2022051814/6039e4199a69d92aca5104f0/html5/thumbnails/50.jpg)
Neural networks with scikit-learn
Need to install the 0.18-dev version
(http://scikit-learn.org/stable/developers/
contributing.html#retrieving-the-latest-code).
sklearn.neural_network.MLPClassifier
sklearn.neural_network.MLPRegressor
14/14