neural networks and optimization - diens1 introduction 2 deep networks 3 optimization 4...
TRANSCRIPT
![Page 1: Neural networks and optimization - DIENS1 Introduction 2 Deep networks 3 Optimization 4 Convolutional neural networks Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15](https://reader036.vdocument.in/reader036/viewer/2022062602/5f01cce67e708231d4011978/html5/thumbnails/1.jpg)
Neural networks and optimization
Nicolas Le Roux
Criteo
18/05/15
Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 1 / 85
![Page 2: Neural networks and optimization - DIENS1 Introduction 2 Deep networks 3 Optimization 4 Convolutional neural networks Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15](https://reader036.vdocument.in/reader036/viewer/2022062602/5f01cce67e708231d4011978/html5/thumbnails/2.jpg)
1 Introduction
2 Deep networks
3 Optimization
4 Convolutional neural networks
Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 2 / 85
![Page 3: Neural networks and optimization - DIENS1 Introduction 2 Deep networks 3 Optimization 4 Convolutional neural networks Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15](https://reader036.vdocument.in/reader036/viewer/2022062602/5f01cce67e708231d4011978/html5/thumbnails/3.jpg)
1 Introduction
2 Deep networks
3 Optimization
4 Convolutional neural networks
Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 3 / 85
![Page 4: Neural networks and optimization - DIENS1 Introduction 2 Deep networks 3 Optimization 4 Convolutional neural networks Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15](https://reader036.vdocument.in/reader036/viewer/2022062602/5f01cce67e708231d4011978/html5/thumbnails/4.jpg)
Goal : classification and regression
Medical imaging : cancer or not ? Classification
Autonomous driving : optimal wheel position Regression
Kinect : where are the limbs ? Regression
OCR : what are the characters ? Classification
It also works for structured outputs
Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 4 / 85
![Page 5: Neural networks and optimization - DIENS1 Introduction 2 Deep networks 3 Optimization 4 Convolutional neural networks Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15](https://reader036.vdocument.in/reader036/viewer/2022062602/5f01cce67e708231d4011978/html5/thumbnails/5.jpg)
Goal : classification and regression
Medical imaging : cancer or not ? Classification
Autonomous driving : optimal wheel position Regression
Kinect : where are the limbs ? Regression
OCR : what are the characters ? Classification
It also works for structured outputs
Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 4 / 85
![Page 6: Neural networks and optimization - DIENS1 Introduction 2 Deep networks 3 Optimization 4 Convolutional neural networks Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15](https://reader036.vdocument.in/reader036/viewer/2022062602/5f01cce67e708231d4011978/html5/thumbnails/6.jpg)
Object recognition
Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 5 / 85
![Page 7: Neural networks and optimization - DIENS1 Introduction 2 Deep networks 3 Optimization 4 Convolutional neural networks Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15](https://reader036.vdocument.in/reader036/viewer/2022062602/5f01cce67e708231d4011978/html5/thumbnails/7.jpg)
Object recognition
Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 6 / 85
![Page 8: Neural networks and optimization - DIENS1 Introduction 2 Deep networks 3 Optimization 4 Convolutional neural networks Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15](https://reader036.vdocument.in/reader036/viewer/2022062602/5f01cce67e708231d4011978/html5/thumbnails/8.jpg)
1 Introduction
2 Deep networks
3 Optimization
4 Convolutional neural networks
Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 7 / 85
![Page 9: Neural networks and optimization - DIENS1 Introduction 2 Deep networks 3 Optimization 4 Convolutional neural networks Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15](https://reader036.vdocument.in/reader036/viewer/2022062602/5f01cce67e708231d4011978/html5/thumbnails/9.jpg)
Linear classifier
Dataset :(X (i),Y (i)
)pairs, i = 1, . . . ,N.
X (i) ∈ Rn, Y (i) ∈ {−1,1}.Goal : Find w and b such that sign(w>X (i) + b) = Y (i).
Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 8 / 85
![Page 10: Neural networks and optimization - DIENS1 Introduction 2 Deep networks 3 Optimization 4 Convolutional neural networks Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15](https://reader036.vdocument.in/reader036/viewer/2022062602/5f01cce67e708231d4011978/html5/thumbnails/10.jpg)
Linear classifier
Dataset :(X (i),Y (i)
)pairs, i = 1, . . . ,N.
X (i) ∈ Rn, Y (i) ∈ {−1,1}.Goal : Find w and b such that sign(w>X (i) + b) = Y (i).
Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 8 / 85
![Page 11: Neural networks and optimization - DIENS1 Introduction 2 Deep networks 3 Optimization 4 Convolutional neural networks Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15](https://reader036.vdocument.in/reader036/viewer/2022062602/5f01cce67e708231d4011978/html5/thumbnails/11.jpg)
Perceptron algorithm (Rosenblatt, 57)
w0 = 0, b0 = 0
Y (i) = sign(w>X (i) + b)
wt+1 ← wt +∑
i
(Y (i) − Y (i)
)X (i)
bt+1 ← bt +∑
i
(Y (i) − Y (i)
)Linearly separable demo
Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 9 / 85
![Page 12: Neural networks and optimization - DIENS1 Introduction 2 Deep networks 3 Optimization 4 Convolutional neural networks Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15](https://reader036.vdocument.in/reader036/viewer/2022062602/5f01cce67e708231d4011978/html5/thumbnails/12.jpg)
Some data are not separable
The Perceptron algorithm is NOT convergent for non linearlyseparable data.
Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 10 / 85
![Page 13: Neural networks and optimization - DIENS1 Introduction 2 Deep networks 3 Optimization 4 Convolutional neural networks Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15](https://reader036.vdocument.in/reader036/viewer/2022062602/5f01cce67e708231d4011978/html5/thumbnails/13.jpg)
Non convergence of the perceptron algorithm
Non-linearly separable perceptron
We need an algorithm which works both on separable and
non separable data.
Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 11 / 85
![Page 14: Neural networks and optimization - DIENS1 Introduction 2 Deep networks 3 Optimization 4 Convolutional neural networks Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15](https://reader036.vdocument.in/reader036/viewer/2022062602/5f01cce67e708231d4011978/html5/thumbnails/14.jpg)
Classification error
A good classifier minimizes
`(w) =∑
i
`i
=∑
i
`(
Y (i),Y (i))
=∑
i
1sign(w>X (i)+b)6=Y (i)
Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 12 / 85
![Page 15: Neural networks and optimization - DIENS1 Introduction 2 Deep networks 3 Optimization 4 Convolutional neural networks Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15](https://reader036.vdocument.in/reader036/viewer/2022062602/5f01cce67e708231d4011978/html5/thumbnails/15.jpg)
Cost function
Classification error is not smooth.
Sigmoid is smooth but not convex.
Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 13 / 85
![Page 16: Neural networks and optimization - DIENS1 Introduction 2 Deep networks 3 Optimization 4 Convolutional neural networks Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15](https://reader036.vdocument.in/reader036/viewer/2022062602/5f01cce67e708231d4011978/html5/thumbnails/16.jpg)
Cost function
Classification error is not smooth.
Sigmoid is smooth but not convex.
Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 13 / 85
![Page 17: Neural networks and optimization - DIENS1 Introduction 2 Deep networks 3 Optimization 4 Convolutional neural networks Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15](https://reader036.vdocument.in/reader036/viewer/2022062602/5f01cce67e708231d4011978/html5/thumbnails/17.jpg)
Convex cost functions
Classification error is not smooth.
Sigmoid is smooth but not convex.
Logistic loss is a convex upper bound.
Hinge loss (SVMs) is very much like logistic.
Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 14 / 85
![Page 18: Neural networks and optimization - DIENS1 Introduction 2 Deep networks 3 Optimization 4 Convolutional neural networks Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15](https://reader036.vdocument.in/reader036/viewer/2022062602/5f01cce67e708231d4011978/html5/thumbnails/18.jpg)
Convex cost functions
Classification error is not smooth.
Sigmoid is smooth but not convex.
Logistic loss is a convex upper bound.
Hinge loss (SVMs) is very much like logistic.
Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 14 / 85
![Page 19: Neural networks and optimization - DIENS1 Introduction 2 Deep networks 3 Optimization 4 Convolutional neural networks Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15](https://reader036.vdocument.in/reader036/viewer/2022062602/5f01cce67e708231d4011978/html5/thumbnails/19.jpg)
Interlude
Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 15 / 85
![Page 20: Neural networks and optimization - DIENS1 Introduction 2 Deep networks 3 Optimization 4 Convolutional neural networks Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15](https://reader036.vdocument.in/reader036/viewer/2022062602/5f01cce67e708231d4011978/html5/thumbnails/20.jpg)
Convex function
Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 16 / 85
![Page 21: Neural networks and optimization - DIENS1 Introduction 2 Deep networks 3 Optimization 4 Convolutional neural networks Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15](https://reader036.vdocument.in/reader036/viewer/2022062602/5f01cce67e708231d4011978/html5/thumbnails/21.jpg)
Non-convex function
Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 17 / 85
![Page 22: Neural networks and optimization - DIENS1 Introduction 2 Deep networks 3 Optimization 4 Convolutional neural networks Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15](https://reader036.vdocument.in/reader036/viewer/2022062602/5f01cce67e708231d4011978/html5/thumbnails/22.jpg)
Not-convex-but-still-OK function
Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 18 / 85
![Page 23: Neural networks and optimization - DIENS1 Introduction 2 Deep networks 3 Optimization 4 Convolutional neural networks Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15](https://reader036.vdocument.in/reader036/viewer/2022062602/5f01cce67e708231d4011978/html5/thumbnails/23.jpg)
End of interlude
Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 19 / 85
![Page 24: Neural networks and optimization - DIENS1 Introduction 2 Deep networks 3 Optimization 4 Convolutional neural networks Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15](https://reader036.vdocument.in/reader036/viewer/2022062602/5f01cce67e708231d4011978/html5/thumbnails/24.jpg)
Solving separable AND non-separable
problems
Non-linearly separable logistic
Linearly separable logistic
Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 20 / 85
![Page 25: Neural networks and optimization - DIENS1 Introduction 2 Deep networks 3 Optimization 4 Convolutional neural networks Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15](https://reader036.vdocument.in/reader036/viewer/2022062602/5f01cce67e708231d4011978/html5/thumbnails/25.jpg)
Non-linear classification
Non-linearly separable polynomial
Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 21 / 85
![Page 26: Neural networks and optimization - DIENS1 Introduction 2 Deep networks 3 Optimization 4 Convolutional neural networks Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15](https://reader036.vdocument.in/reader036/viewer/2022062602/5f01cce67e708231d4011978/html5/thumbnails/26.jpg)
Non-linear classification
Features : X1,X2 → linear classifier
Features : X1,X2,X1X2,X 21 , . . .→ non-linear classifier
Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 22 / 85
![Page 27: Neural networks and optimization - DIENS1 Introduction 2 Deep networks 3 Optimization 4 Convolutional neural networks Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15](https://reader036.vdocument.in/reader036/viewer/2022062602/5f01cce67e708231d4011978/html5/thumbnails/27.jpg)
Choosing the features
To make it work, I created lots of extra features :
(X i1X j
2) for i , j ∈ [0,4] (p = 18)
Would it work with fewer features ?
Test with (X i1X j
2) for i , j ∈ [1,3]
Non-linearly separable polynomial degree 2
Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 23 / 85
![Page 28: Neural networks and optimization - DIENS1 Introduction 2 Deep networks 3 Optimization 4 Convolutional neural networks Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15](https://reader036.vdocument.in/reader036/viewer/2022062602/5f01cce67e708231d4011978/html5/thumbnails/28.jpg)
Choosing the features
To make it work, I created lots of extra features :
(X i1X j
2) for i , j ∈ [0,4] (p = 18)
Would it work with fewer features ?
Test with (X i1X j
2) for i , j ∈ [1,3]
Non-linearly separable polynomial degree 2
Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 23 / 85
![Page 29: Neural networks and optimization - DIENS1 Introduction 2 Deep networks 3 Optimization 4 Convolutional neural networks Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15](https://reader036.vdocument.in/reader036/viewer/2022062602/5f01cce67e708231d4011978/html5/thumbnails/29.jpg)
A graphical view of the classifiers
f (X ) = w1X1 + w2X2 + b
Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 24 / 85
![Page 30: Neural networks and optimization - DIENS1 Introduction 2 Deep networks 3 Optimization 4 Convolutional neural networks Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15](https://reader036.vdocument.in/reader036/viewer/2022062602/5f01cce67e708231d4011978/html5/thumbnails/30.jpg)
A graphical view of the classifiers
f (X ) = w1X1 + w2X2 + w3X 21 + w4X 2
2 + w5X1X2 + . . .
Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 24 / 85
![Page 31: Neural networks and optimization - DIENS1 Introduction 2 Deep networks 3 Optimization 4 Convolutional neural networks Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15](https://reader036.vdocument.in/reader036/viewer/2022062602/5f01cce67e708231d4011978/html5/thumbnails/31.jpg)
Non-linear features
A linear classifier on a non-linear transformation is
non-linear.
A non-linear classifier relies on non-linear features.
Which ones do we choose ?
Example : Hj = X pj1 X qj
2
SVM : Hj = K (X ,X (j)) with K some kernel function
Do they have to be predefined ?
A neural network will learn the Hj ’s
Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 25 / 85
![Page 32: Neural networks and optimization - DIENS1 Introduction 2 Deep networks 3 Optimization 4 Convolutional neural networks Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15](https://reader036.vdocument.in/reader036/viewer/2022062602/5f01cce67e708231d4011978/html5/thumbnails/32.jpg)
Non-linear features
A linear classifier on a non-linear transformation is
non-linear.
A non-linear classifier relies on non-linear features.
Which ones do we choose ?
Example : Hj = X pj1 X qj
2
SVM : Hj = K (X ,X (j)) with K some kernel function
Do they have to be predefined ?
A neural network will learn the Hj ’s
Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 25 / 85
![Page 33: Neural networks and optimization - DIENS1 Introduction 2 Deep networks 3 Optimization 4 Convolutional neural networks Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15](https://reader036.vdocument.in/reader036/viewer/2022062602/5f01cce67e708231d4011978/html5/thumbnails/33.jpg)
Non-linear features
A linear classifier on a non-linear transformation is
non-linear.
A non-linear classifier relies on non-linear features.
Which ones do we choose ?
Example : Hj = X pj1 X qj
2
SVM : Hj = K (X ,X (j)) with K some kernel function
Do they have to be predefined ?
A neural network will learn the Hj ’s
Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 25 / 85
![Page 34: Neural networks and optimization - DIENS1 Introduction 2 Deep networks 3 Optimization 4 Convolutional neural networks Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15](https://reader036.vdocument.in/reader036/viewer/2022062602/5f01cce67e708231d4011978/html5/thumbnails/34.jpg)
Non-linear features
A linear classifier on a non-linear transformation is
non-linear.
A non-linear classifier relies on non-linear features.
Which ones do we choose ?
Example : Hj = X pj1 X qj
2
SVM : Hj = K (X ,X (j)) with K some kernel function
Do they have to be predefined ?
A neural network will learn the Hj ’s
Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 25 / 85
![Page 35: Neural networks and optimization - DIENS1 Introduction 2 Deep networks 3 Optimization 4 Convolutional neural networks Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15](https://reader036.vdocument.in/reader036/viewer/2022062602/5f01cce67e708231d4011978/html5/thumbnails/35.jpg)
Non-linear features
A linear classifier on a non-linear transformation is
non-linear.
A non-linear classifier relies on non-linear features.
Which ones do we choose ?
Example : Hj = X pj1 X qj
2
SVM : Hj = K (X ,X (j)) with K some kernel function
Do they have to be predefined ?
A neural network will learn the Hj ’s
Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 25 / 85
![Page 36: Neural networks and optimization - DIENS1 Introduction 2 Deep networks 3 Optimization 4 Convolutional neural networks Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15](https://reader036.vdocument.in/reader036/viewer/2022062602/5f01cce67e708231d4011978/html5/thumbnails/36.jpg)
A 3-step algorithm for a neural network
1 Pick an example x
2 Transform it in x = Vx with some matrix V
3 Compute w>x + b
Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 26 / 85
![Page 37: Neural networks and optimization - DIENS1 Introduction 2 Deep networks 3 Optimization 4 Convolutional neural networks Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15](https://reader036.vdocument.in/reader036/viewer/2022062602/5f01cce67e708231d4011978/html5/thumbnails/37.jpg)
A 3-step algorithm for a neural network
1 Pick an example x
2 Transform it in x = Vx with some matrix V
3 Compute w>x + b
Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 26 / 85
![Page 38: Neural networks and optimization - DIENS1 Introduction 2 Deep networks 3 Optimization 4 Convolutional neural networks Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15](https://reader036.vdocument.in/reader036/viewer/2022062602/5f01cce67e708231d4011978/html5/thumbnails/38.jpg)
A 3-step algorithm for a neural network
1 Pick an example x
2 Transform it in x = Vx with some matrix V
3 Compute w>x + b
Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 26 / 85
![Page 39: Neural networks and optimization - DIENS1 Introduction 2 Deep networks 3 Optimization 4 Convolutional neural networks Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15](https://reader036.vdocument.in/reader036/viewer/2022062602/5f01cce67e708231d4011978/html5/thumbnails/39.jpg)
A 3-step algorithm for a neural network
1 Pick an example x
2 Transform it in x = Vx with some matrix V
3 Compute w>x + b = w>Vx + b = wx + b
Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 27 / 85
![Page 40: Neural networks and optimization - DIENS1 Introduction 2 Deep networks 3 Optimization 4 Convolutional neural networks Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15](https://reader036.vdocument.in/reader036/viewer/2022062602/5f01cce67e708231d4011978/html5/thumbnails/40.jpg)
A 3-step algorithm for a neural network
1 Pick an example x
2 Transform it in x = Vx with some matrix V
3 Apply a nonlinear function g to all the elements of x
4 Compute w>x + b = w>g(Vx) + b 6= wx + b
Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 28 / 85
![Page 41: Neural networks and optimization - DIENS1 Introduction 2 Deep networks 3 Optimization 4 Convolutional neural networks Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15](https://reader036.vdocument.in/reader036/viewer/2022062602/5f01cce67e708231d4011978/html5/thumbnails/41.jpg)
Neural networks
Usually, we use
Hj = g(v>j X )
Hj : Hidden unit
vj : Input weight
g : Transfer function
Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 29 / 85
![Page 42: Neural networks and optimization - DIENS1 Introduction 2 Deep networks 3 Optimization 4 Convolutional neural networks Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15](https://reader036.vdocument.in/reader036/viewer/2022062602/5f01cce67e708231d4011978/html5/thumbnails/42.jpg)
Neural networks
Usually, we use
Hj = g(v>j X )
Hj : Hidden unit
vj : Input weight
g : Transfer function
Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 29 / 85
![Page 43: Neural networks and optimization - DIENS1 Introduction 2 Deep networks 3 Optimization 4 Convolutional neural networks Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15](https://reader036.vdocument.in/reader036/viewer/2022062602/5f01cce67e708231d4011978/html5/thumbnails/43.jpg)
Transfer function
f (X ) =∑
j
wjHj(X ) + b =∑
j
wjg(v>j X ) + b
g is the transfer function.
g used to be the sigmoid or the tanh.
If g is the sigmoid, each hidden unit is a soft classifier.
Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 30 / 85
![Page 44: Neural networks and optimization - DIENS1 Introduction 2 Deep networks 3 Optimization 4 Convolutional neural networks Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15](https://reader036.vdocument.in/reader036/viewer/2022062602/5f01cce67e708231d4011978/html5/thumbnails/44.jpg)
Transfer function
f (X ) =∑
j
wjHj(X ) + b =∑
j
wjg(v>j X ) + b
g is the transfer function.
g is now often the positive part.
This is called a rectified linear unit.Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 31 / 85
![Page 45: Neural networks and optimization - DIENS1 Introduction 2 Deep networks 3 Optimization 4 Convolutional neural networks Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15](https://reader036.vdocument.in/reader036/viewer/2022062602/5f01cce67e708231d4011978/html5/thumbnails/45.jpg)
Neural networks
f (X ) =∑
j
wjHj(X ) + b =∑
j
wjg(v>j X ) + b
Each hidden unit is a (soft) linear classifier.
We linearly combine these classifiers to get the output.
Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 32 / 85
![Page 46: Neural networks and optimization - DIENS1 Introduction 2 Deep networks 3 Optimization 4 Convolutional neural networks Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15](https://reader036.vdocument.in/reader036/viewer/2022062602/5f01cce67e708231d4011978/html5/thumbnails/46.jpg)
Example on the non-separable problem
Non-linearly separable neural net 3
Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 33 / 85
![Page 47: Neural networks and optimization - DIENS1 Introduction 2 Deep networks 3 Optimization 4 Convolutional neural networks Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15](https://reader036.vdocument.in/reader036/viewer/2022062602/5f01cce67e708231d4011978/html5/thumbnails/47.jpg)
Training a neural network
Dataset :(X (i),Y (i)
)pairs, i = 1, . . . ,N.
Goal : Find w and b such that
sign(w>X (i) + b
)= Y (i)
Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 34 / 85
![Page 48: Neural networks and optimization - DIENS1 Introduction 2 Deep networks 3 Optimization 4 Convolutional neural networks Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15](https://reader036.vdocument.in/reader036/viewer/2022062602/5f01cce67e708231d4011978/html5/thumbnails/48.jpg)
Training a neural network
Dataset :(X (i),Y (i)
)pairs, i = 1, . . . ,N.
Goal : Find w and b to minimize∑i
log(1 + exp
(−Y (i) (w>X (i) + b
)))
Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 34 / 85
![Page 49: Neural networks and optimization - DIENS1 Introduction 2 Deep networks 3 Optimization 4 Convolutional neural networks Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15](https://reader036.vdocument.in/reader036/viewer/2022062602/5f01cce67e708231d4011978/html5/thumbnails/49.jpg)
Training a neural network
Dataset :(X (i),Y (i)
)pairs, i = 1, . . . ,N.
Goal : Find v1, . . . , vk , w and b to minimize
∑i
log
1 + exp
−Y (i)
∑j
wjg(v>
j X (i))
Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 34 / 85
![Page 50: Neural networks and optimization - DIENS1 Introduction 2 Deep networks 3 Optimization 4 Convolutional neural networks Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15](https://reader036.vdocument.in/reader036/viewer/2022062602/5f01cce67e708231d4011978/html5/thumbnails/50.jpg)
Cost function
s = cost function (logistic loss, hinge loss, ...)
`(v ,w ,b,X (i),Y (i)) = s(
Y (i),Y (i))
= s
∑j
wjHj(X (i)),Y (i)
= s
∑j
wjg(v>
j X (i)) ,Y (i)
We can minimize it using gradient descent.
Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 35 / 85
![Page 51: Neural networks and optimization - DIENS1 Introduction 2 Deep networks 3 Optimization 4 Convolutional neural networks Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15](https://reader036.vdocument.in/reader036/viewer/2022062602/5f01cce67e708231d4011978/html5/thumbnails/51.jpg)
Cost function
s = cost function (logistic loss, hinge loss, ...)
`(v ,w ,b,X (i),Y (i)) = s(
Y (i),Y (i))
= s
∑j
wjHj(X (i)),Y (i)
= s
∑j
wjg(v>
j X (i)) ,Y (i)
We can minimize it using gradient descent.
Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 35 / 85
![Page 52: Neural networks and optimization - DIENS1 Introduction 2 Deep networks 3 Optimization 4 Convolutional neural networks Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15](https://reader036.vdocument.in/reader036/viewer/2022062602/5f01cce67e708231d4011978/html5/thumbnails/52.jpg)
Finding a good transformation of x
Hj(x) = g(v>j x)
Is it a good Hj ? What can we say about it ?
It is theoretically enough (universal approximation)
That does not mean you should use it
Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 36 / 85
![Page 53: Neural networks and optimization - DIENS1 Introduction 2 Deep networks 3 Optimization 4 Convolutional neural networks Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15](https://reader036.vdocument.in/reader036/viewer/2022062602/5f01cce67e708231d4011978/html5/thumbnails/53.jpg)
Finding a good transformation of x
Hj(x) = g(v>j x)
Is it a good Hj ? What can we say about it ?
It is theoretically enough (universal approximation)
That does not mean you should use it
Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 36 / 85
![Page 54: Neural networks and optimization - DIENS1 Introduction 2 Deep networks 3 Optimization 4 Convolutional neural networks Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15](https://reader036.vdocument.in/reader036/viewer/2022062602/5f01cce67e708231d4011978/html5/thumbnails/54.jpg)
Finding a good transformation of x
Hj(x) = g(v>j x)
Is it a good Hj ? What can we say about it ?
It is theoretically enough (universal approximation)
That does not mean you should use it
Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 36 / 85
![Page 55: Neural networks and optimization - DIENS1 Introduction 2 Deep networks 3 Optimization 4 Convolutional neural networks Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15](https://reader036.vdocument.in/reader036/viewer/2022062602/5f01cce67e708231d4011978/html5/thumbnails/55.jpg)
Going deeper
Hj(x) = g(v>j x)
xj = g(v>j x)
We can transform x as well.
Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 37 / 85
![Page 56: Neural networks and optimization - DIENS1 Introduction 2 Deep networks 3 Optimization 4 Convolutional neural networks Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15](https://reader036.vdocument.in/reader036/viewer/2022062602/5f01cce67e708231d4011978/html5/thumbnails/56.jpg)
Going deeper
Hj(x) = g(v>j x)
xj = g(v>j x)
We can transform x as well.
Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 37 / 85
![Page 57: Neural networks and optimization - DIENS1 Introduction 2 Deep networks 3 Optimization 4 Convolutional neural networks Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15](https://reader036.vdocument.in/reader036/viewer/2022062602/5f01cce67e708231d4011978/html5/thumbnails/57.jpg)
Going from 2 to 3 layers
We can have as many layers as we like
But it makes the optimization problem harder
Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 38 / 85
![Page 58: Neural networks and optimization - DIENS1 Introduction 2 Deep networks 3 Optimization 4 Convolutional neural networks Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15](https://reader036.vdocument.in/reader036/viewer/2022062602/5f01cce67e708231d4011978/html5/thumbnails/58.jpg)
Going from 2 to 3 layers
We can have as many layers as we like
But it makes the optimization problem harder
Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 38 / 85
![Page 59: Neural networks and optimization - DIENS1 Introduction 2 Deep networks 3 Optimization 4 Convolutional neural networks Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15](https://reader036.vdocument.in/reader036/viewer/2022062602/5f01cce67e708231d4011978/html5/thumbnails/59.jpg)
Going from 2 to 3 layers
We can have as many layers as we like
But it makes the optimization problem harder
Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 38 / 85
![Page 60: Neural networks and optimization - DIENS1 Introduction 2 Deep networks 3 Optimization 4 Convolutional neural networks Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15](https://reader036.vdocument.in/reader036/viewer/2022062602/5f01cce67e708231d4011978/html5/thumbnails/60.jpg)
1 Introduction
2 Deep networks
3 Optimization
4 Convolutional neural networks
Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 39 / 85
![Page 61: Neural networks and optimization - DIENS1 Introduction 2 Deep networks 3 Optimization 4 Convolutional neural networks Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15](https://reader036.vdocument.in/reader036/viewer/2022062602/5f01cce67e708231d4011978/html5/thumbnails/61.jpg)
The need for fast learning
Neural networks may need many examples (several millions
or more).
We need to be able to use them quickly.
Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 40 / 85
![Page 62: Neural networks and optimization - DIENS1 Introduction 2 Deep networks 3 Optimization 4 Convolutional neural networks Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15](https://reader036.vdocument.in/reader036/viewer/2022062602/5f01cce67e708231d4011978/html5/thumbnails/62.jpg)
Batch methods
L(θ) =1N
∑i
`(θ,X (i),Y (i))
θt+1 → θt −αt
N
∑i
∂`(θ,X (i),Y (i))
∂θ
To compute one update of the parameters, we need to go
through all the data.
This can be very expensive.
What if we have an infinite amount of data ?
Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 41 / 85
![Page 63: Neural networks and optimization - DIENS1 Introduction 2 Deep networks 3 Optimization 4 Convolutional neural networks Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15](https://reader036.vdocument.in/reader036/viewer/2022062602/5f01cce67e708231d4011978/html5/thumbnails/63.jpg)
Potential solutions
1 Discard data.
I Seems stupid
I Yet many people do it
2 Use approximate methods.
I Update = average of the updates for all datapoints.
I Are these update really different ?
I If not, how can we learn faster ?
Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 42 / 85
![Page 64: Neural networks and optimization - DIENS1 Introduction 2 Deep networks 3 Optimization 4 Convolutional neural networks Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15](https://reader036.vdocument.in/reader036/viewer/2022062602/5f01cce67e708231d4011978/html5/thumbnails/64.jpg)
Potential solutions
1 Discard data.
I Seems stupid
I Yet many people do it
2 Use approximate methods.
I Update = average of the updates for all datapoints.
I Are these update really different ?
I If not, how can we learn faster ?
Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 42 / 85
![Page 65: Neural networks and optimization - DIENS1 Introduction 2 Deep networks 3 Optimization 4 Convolutional neural networks Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15](https://reader036.vdocument.in/reader036/viewer/2022062602/5f01cce67e708231d4011978/html5/thumbnails/65.jpg)
Stochastic gradient descent
L(θ) =1N
∑i
`(θ,X (i),Y (i))
θt+1 → θt −αt
N
∑i
∂`(θ,X (i),Y (i))
∂θ
Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 43 / 85
![Page 66: Neural networks and optimization - DIENS1 Introduction 2 Deep networks 3 Optimization 4 Convolutional neural networks Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15](https://reader036.vdocument.in/reader036/viewer/2022062602/5f01cce67e708231d4011978/html5/thumbnails/66.jpg)
Stochastic gradient descent
L(θ) =1N
∑i
`(θ,X (i),Y (i))
θt+1 → θt − αt∂`(θ,X (it ),Y (it ))
∂θ
Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 43 / 85
![Page 67: Neural networks and optimization - DIENS1 Introduction 2 Deep networks 3 Optimization 4 Convolutional neural networks Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15](https://reader036.vdocument.in/reader036/viewer/2022062602/5f01cce67e708231d4011978/html5/thumbnails/67.jpg)
Stochastic gradient descent
L(θ) =1N
∑i
`(θ,X (i),Y (i))
θt+1 → θt − αt∂`(θ,X (it ),Y (it ))
∂θ
What do we lose when updating the parameters to satisfy justone example ?
Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 43 / 85
![Page 68: Neural networks and optimization - DIENS1 Introduction 2 Deep networks 3 Optimization 4 Convolutional neural networks Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15](https://reader036.vdocument.in/reader036/viewer/2022062602/5f01cce67e708231d4011978/html5/thumbnails/68.jpg)
Disagreement
‖µ‖2/σ2 during optimization
(log scale)
As optimization progresses,
disagreement increases
Early on, one can pick one
example at a time
What about later ?
Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 44 / 85
![Page 69: Neural networks and optimization - DIENS1 Introduction 2 Deep networks 3 Optimization 4 Convolutional neural networks Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15](https://reader036.vdocument.in/reader036/viewer/2022062602/5f01cce67e708231d4011978/html5/thumbnails/69.jpg)
Stochastic vs. batch methodsStochastic vs. deterministic methods
• Goal = best of both worlds: linear rate with O(1) iteration cost
time
log(
exce
ss c
ost)
stochastic
deterministic
For non-convex problem, stochastic works best.
Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 45 / 85
![Page 70: Neural networks and optimization - DIENS1 Introduction 2 Deep networks 3 Optimization 4 Convolutional neural networks Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15](https://reader036.vdocument.in/reader036/viewer/2022062602/5f01cce67e708231d4011978/html5/thumbnails/70.jpg)
Stochastic vs. batch methodsStochastic vs. deterministic methods
• Goal = best of both worlds: linear rate with O(1) iteration cost
time
log(
exce
ss c
ost)
stochastic
deterministic
For non-convex problem, stochastic works best.Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 45 / 85
![Page 71: Neural networks and optimization - DIENS1 Introduction 2 Deep networks 3 Optimization 4 Convolutional neural networks Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15](https://reader036.vdocument.in/reader036/viewer/2022062602/5f01cce67e708231d4011978/html5/thumbnails/71.jpg)
Understanding the loss function of deep
networks
Recent studies say most local minima are close to the
optimum.
Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 46 / 85
![Page 72: Neural networks and optimization - DIENS1 Introduction 2 Deep networks 3 Optimization 4 Convolutional neural networks Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15](https://reader036.vdocument.in/reader036/viewer/2022062602/5f01cce67e708231d4011978/html5/thumbnails/72.jpg)
Understanding the loss function of deep
networks
Recent studies say most local minima are close to the
optimum.
Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 46 / 85
![Page 73: Neural networks and optimization - DIENS1 Introduction 2 Deep networks 3 Optimization 4 Convolutional neural networks Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15](https://reader036.vdocument.in/reader036/viewer/2022062602/5f01cce67e708231d4011978/html5/thumbnails/73.jpg)
Understanding the loss function of deep
networks
Recent studies say most local minima are close to the
optimum.
Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 46 / 85
![Page 74: Neural networks and optimization - DIENS1 Introduction 2 Deep networks 3 Optimization 4 Convolutional neural networks Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15](https://reader036.vdocument.in/reader036/viewer/2022062602/5f01cce67e708231d4011978/html5/thumbnails/74.jpg)
Regularizing deep networks
Deep networks have many parameters
How do we avoid overfitting ?
I Regularizing the parameters
I Limiting the number of units
I Dropout
Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 47 / 85
![Page 75: Neural networks and optimization - DIENS1 Introduction 2 Deep networks 3 Optimization 4 Convolutional neural networks Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15](https://reader036.vdocument.in/reader036/viewer/2022062602/5f01cce67e708231d4011978/html5/thumbnails/75.jpg)
Regularizing deep networks
Deep networks have many parameters
How do we avoid overfitting ?
I Regularizing the parameters
I Limiting the number of units
I Dropout
Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 47 / 85
![Page 76: Neural networks and optimization - DIENS1 Introduction 2 Deep networks 3 Optimization 4 Convolutional neural networks Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15](https://reader036.vdocument.in/reader036/viewer/2022062602/5f01cce67e708231d4011978/html5/thumbnails/76.jpg)
Regularizing deep networks
Deep networks have many parameters
How do we avoid overfitting ?
I Regularizing the parameters
I Limiting the number of units
I Dropout
Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 47 / 85
![Page 77: Neural networks and optimization - DIENS1 Introduction 2 Deep networks 3 Optimization 4 Convolutional neural networks Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15](https://reader036.vdocument.in/reader036/viewer/2022062602/5f01cce67e708231d4011978/html5/thumbnails/77.jpg)
Regularizing deep networks
Deep networks have many parameters
How do we avoid overfitting ?
I Regularizing the parameters
I Limiting the number of units
I Dropout
Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 47 / 85
![Page 78: Neural networks and optimization - DIENS1 Introduction 2 Deep networks 3 Optimization 4 Convolutional neural networks Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15](https://reader036.vdocument.in/reader036/viewer/2022062602/5f01cce67e708231d4011978/html5/thumbnails/78.jpg)
Dropout
Overfitting happens when each unit gets too specialized
The core idea is to prevent each unit from relying too much
on the others
How do we do this ?
Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 48 / 85
![Page 79: Neural networks and optimization - DIENS1 Introduction 2 Deep networks 3 Optimization 4 Convolutional neural networks Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15](https://reader036.vdocument.in/reader036/viewer/2022062602/5f01cce67e708231d4011978/html5/thumbnails/79.jpg)
Dropout
Overfitting happens when each unit gets too specialized
The core idea is to prevent each unit from relying too much
on the others
How do we do this ?
Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 48 / 85
![Page 80: Neural networks and optimization - DIENS1 Introduction 2 Deep networks 3 Optimization 4 Convolutional neural networks Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15](https://reader036.vdocument.in/reader036/viewer/2022062602/5f01cce67e708231d4011978/html5/thumbnails/80.jpg)
Dropout
Overfitting happens when each unit gets too specialized
The core idea is to prevent each unit from relying too much
on the others
How do we do this ?
Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 48 / 85
![Page 81: Neural networks and optimization - DIENS1 Introduction 2 Deep networks 3 Optimization 4 Convolutional neural networks Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15](https://reader036.vdocument.in/reader036/viewer/2022062602/5f01cce67e708231d4011978/html5/thumbnails/81.jpg)
Dropout - an illustration
Figure taken from Dropout : A Simple Way to Prevent Neural Networks from Overfitting
Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 49 / 85
![Page 82: Neural networks and optimization - DIENS1 Introduction 2 Deep networks 3 Optimization 4 Convolutional neural networks Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15](https://reader036.vdocument.in/reader036/viewer/2022062602/5f01cce67e708231d4011978/html5/thumbnails/82.jpg)
Dropout - a conclusion
By removing units at random, we force the others to be less
specialized
At test time, we use all the units
This is sometimes presented as an ensemble method.
Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 50 / 85
![Page 83: Neural networks and optimization - DIENS1 Introduction 2 Deep networks 3 Optimization 4 Convolutional neural networks Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15](https://reader036.vdocument.in/reader036/viewer/2022062602/5f01cce67e708231d4011978/html5/thumbnails/83.jpg)
Dropout - a conclusion
By removing units at random, we force the others to be less
specialized
At test time, we use all the units
This is sometimes presented as an ensemble method.
Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 50 / 85
![Page 84: Neural networks and optimization - DIENS1 Introduction 2 Deep networks 3 Optimization 4 Convolutional neural networks Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15](https://reader036.vdocument.in/reader036/viewer/2022062602/5f01cce67e708231d4011978/html5/thumbnails/84.jpg)
Neural networks - Summary
A linear classifier in a feature space can model non-linear
boundaries.
Finding a good feature space is essential.
One can design the feature map by hand.
One can learn the feature map, using fewer features than if
it done by hand.
Learning the feature map is potentially HARD
(non-convexity).
Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 51 / 85
![Page 85: Neural networks and optimization - DIENS1 Introduction 2 Deep networks 3 Optimization 4 Convolutional neural networks Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15](https://reader036.vdocument.in/reader036/viewer/2022062602/5f01cce67e708231d4011978/html5/thumbnails/85.jpg)
Recipe for training a neural network
Use rectified linear units
Use dropout
Use stochastic gradient descent
Use 2000 units on each layer
Try different number of layers
Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 52 / 85
![Page 86: Neural networks and optimization - DIENS1 Introduction 2 Deep networks 3 Optimization 4 Convolutional neural networks Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15](https://reader036.vdocument.in/reader036/viewer/2022062602/5f01cce67e708231d4011978/html5/thumbnails/86.jpg)
1 Introduction
2 Deep networks
3 Optimization
4 Convolutional neural networks
Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 53 / 85
![Page 87: Neural networks and optimization - DIENS1 Introduction 2 Deep networks 3 Optimization 4 Convolutional neural networks Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15](https://reader036.vdocument.in/reader036/viewer/2022062602/5f01cce67e708231d4011978/html5/thumbnails/87.jpg)
vj ’s for images
f (X ) =∑
j
wjHj(X ) + b =∑
j
wjg(v>j X ) + b
If X is an image, vj is an image too.
vj acts as a filter (presence or absence of a pattern).
What does vj look like ?
Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 54 / 85
![Page 88: Neural networks and optimization - DIENS1 Introduction 2 Deep networks 3 Optimization 4 Convolutional neural networks Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15](https://reader036.vdocument.in/reader036/viewer/2022062602/5f01cce67e708231d4011978/html5/thumbnails/88.jpg)
vj ’s for images - Examples
Filters are mostly local
Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 55 / 85
![Page 89: Neural networks and optimization - DIENS1 Introduction 2 Deep networks 3 Optimization 4 Convolutional neural networks Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15](https://reader036.vdocument.in/reader036/viewer/2022062602/5f01cce67e708231d4011978/html5/thumbnails/89.jpg)
vj ’s for images - Examples
Filters are mostly localNicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 55 / 85
![Page 90: Neural networks and optimization - DIENS1 Introduction 2 Deep networks 3 Optimization 4 Convolutional neural networks Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15](https://reader036.vdocument.in/reader036/viewer/2022062602/5f01cce67e708231d4011978/html5/thumbnails/90.jpg)
Basic idea of convolutional neural networks
Filters are mostly local.
Instead of using image-wide filters, use small ones over
patches.
Repeat for every patch to get a response image.
Subsample the response image to get local invariance.
Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 56 / 85
![Page 91: Neural networks and optimization - DIENS1 Introduction 2 Deep networks 3 Optimization 4 Convolutional neural networks Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15](https://reader036.vdocument.in/reader036/viewer/2022062602/5f01cce67e708231d4011978/html5/thumbnails/91.jpg)
Filtering - Filter 1
Original image Filter Output image
Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 57 / 85
![Page 92: Neural networks and optimization - DIENS1 Introduction 2 Deep networks 3 Optimization 4 Convolutional neural networks Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15](https://reader036.vdocument.in/reader036/viewer/2022062602/5f01cce67e708231d4011978/html5/thumbnails/92.jpg)
Filtering - Filter 2
Original image Filter Output image
Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 58 / 85
![Page 93: Neural networks and optimization - DIENS1 Introduction 2 Deep networks 3 Optimization 4 Convolutional neural networks Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15](https://reader036.vdocument.in/reader036/viewer/2022062602/5f01cce67e708231d4011978/html5/thumbnails/93.jpg)
Filtering - Filter 3
Original image Filter Output image
Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 59 / 85
![Page 94: Neural networks and optimization - DIENS1 Introduction 2 Deep networks 3 Optimization 4 Convolutional neural networks Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15](https://reader036.vdocument.in/reader036/viewer/2022062602/5f01cce67e708231d4011978/html5/thumbnails/94.jpg)
Pooling - Filter 1
Original image Output image Subsampled image
How to do 2x subsampling-pooling :
Output image = O, subsampled image = S.
Sij = maxk over window around (2i,2j) Ok .
Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 60 / 85
![Page 95: Neural networks and optimization - DIENS1 Introduction 2 Deep networks 3 Optimization 4 Convolutional neural networks Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15](https://reader036.vdocument.in/reader036/viewer/2022062602/5f01cce67e708231d4011978/html5/thumbnails/95.jpg)
Pooling - Filter 2
Original image Output image Subsampled image
How to do 2x subsampling-pooling :
Output image = O, subsampled image = S.
Sij = maxk over window around (2i,2j) Ok .
Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 61 / 85
![Page 96: Neural networks and optimization - DIENS1 Introduction 2 Deep networks 3 Optimization 4 Convolutional neural networks Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15](https://reader036.vdocument.in/reader036/viewer/2022062602/5f01cce67e708231d4011978/html5/thumbnails/96.jpg)
Pooling - Filter 3
Original image Output image Subsampled image
How to do 2x subsampling-pooling :
Output image = O, subsampled image = S.
Sij = maxk over window around (2i,2j) Ok .
Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 62 / 85
![Page 97: Neural networks and optimization - DIENS1 Introduction 2 Deep networks 3 Optimization 4 Convolutional neural networks Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15](https://reader036.vdocument.in/reader036/viewer/2022062602/5f01cce67e708231d4011978/html5/thumbnails/97.jpg)
A convolutional layer
Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 63 / 85
![Page 98: Neural networks and optimization - DIENS1 Introduction 2 Deep networks 3 Optimization 4 Convolutional neural networks Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15](https://reader036.vdocument.in/reader036/viewer/2022062602/5f01cce67e708231d4011978/html5/thumbnails/98.jpg)
Transforming the data with a layer
Original datapoint New datapoint
Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 64 / 85
![Page 99: Neural networks and optimization - DIENS1 Introduction 2 Deep networks 3 Optimization 4 Convolutional neural networks Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15](https://reader036.vdocument.in/reader036/viewer/2022062602/5f01cce67e708231d4011978/html5/thumbnails/99.jpg)
A convolutional network
Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 65 / 85
![Page 100: Neural networks and optimization - DIENS1 Introduction 2 Deep networks 3 Optimization 4 Convolutional neural networks Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15](https://reader036.vdocument.in/reader036/viewer/2022062602/5f01cce67e708231d4011978/html5/thumbnails/100.jpg)
Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 66 / 85
![Page 101: Neural networks and optimization - DIENS1 Introduction 2 Deep networks 3 Optimization 4 Convolutional neural networks Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15](https://reader036.vdocument.in/reader036/viewer/2022062602/5f01cce67e708231d4011978/html5/thumbnails/101.jpg)
Face detection
Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 67 / 85
![Page 102: Neural networks and optimization - DIENS1 Introduction 2 Deep networks 3 Optimization 4 Convolutional neural networks Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15](https://reader036.vdocument.in/reader036/viewer/2022062602/5f01cce67e708231d4011978/html5/thumbnails/102.jpg)
Face detection
Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 68 / 85
![Page 103: Neural networks and optimization - DIENS1 Introduction 2 Deep networks 3 Optimization 4 Convolutional neural networks Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15](https://reader036.vdocument.in/reader036/viewer/2022062602/5f01cce67e708231d4011978/html5/thumbnails/103.jpg)
NORB dataset
50 toys belonging to 5 categories
I animal, human figure, airplane, truck, car
10 instance per category
I 5 instances used for training, 5 instances for testing
Raw dataset
I 972 stereo pairs of each toy. 48,600 image pairs total.
Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 69 / 85
![Page 104: Neural networks and optimization - DIENS1 Introduction 2 Deep networks 3 Optimization 4 Convolutional neural networks Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15](https://reader036.vdocument.in/reader036/viewer/2022062602/5f01cce67e708231d4011978/html5/thumbnails/104.jpg)
NORB dataset - 2
For each instance :
18 azimuths
0 to 350 degrees every 20 degrees
9 elevations
30 to 70 degrees from horizontal every 5 degrees
6 illuminations
on/off combinations of 4 lights
2 cameras (stereo), 7.5 cm apart
40 cm from the objectNicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 70 / 85
![Page 105: Neural networks and optimization - DIENS1 Introduction 2 Deep networks 3 Optimization 4 Convolutional neural networks Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15](https://reader036.vdocument.in/reader036/viewer/2022062602/5f01cce67e708231d4011978/html5/thumbnails/105.jpg)
NORB dataset - 3
Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 71 / 85
![Page 106: Neural networks and optimization - DIENS1 Introduction 2 Deep networks 3 Optimization 4 Convolutional neural networks Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15](https://reader036.vdocument.in/reader036/viewer/2022062602/5f01cce67e708231d4011978/html5/thumbnails/106.jpg)
Textured and cluttered versions
Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 72 / 85
![Page 107: Neural networks and optimization - DIENS1 Introduction 2 Deep networks 3 Optimization 4 Convolutional neural networks Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15](https://reader036.vdocument.in/reader036/viewer/2022062602/5f01cce67e708231d4011978/html5/thumbnails/107.jpg)
90,857 free parameters, 3,901,162 connections.
The entire network is trained end-to-end (all the layers are
trained simultaneously).Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 73 / 85
![Page 108: Neural networks and optimization - DIENS1 Introduction 2 Deep networks 3 Optimization 4 Convolutional neural networks Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15](https://reader036.vdocument.in/reader036/viewer/2022062602/5f01cce67e708231d4011978/html5/thumbnails/108.jpg)
Normalized-Uniform set
Method ErrorLinear Classifier on raw stereo images 30.2%K-Nearest-Neighbors on raw stereo images 18.4%K-Nearest-Neighbors on PCA-95 16.6%Pairwise SVM on 96x96 stereo images 11.6%Pairwise SVM on 95 Principal Components 13.3%Convolutional Net on 96x96 stereo images 5.8%
Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 74 / 85
![Page 109: Neural networks and optimization - DIENS1 Introduction 2 Deep networks 3 Optimization 4 Convolutional neural networks Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15](https://reader036.vdocument.in/reader036/viewer/2022062602/5f01cce67e708231d4011978/html5/thumbnails/109.jpg)
Jittered-Cluttered Dataset
291,600 stereo pairs for training, 58,320 for testing
Objects are jittered
I position, scale, in-plane rotation, contrast, brightness,
backgrounds, distractor objects,...
Input dimension : 98x98x2 (approx 18,000)Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 75 / 85
![Page 110: Neural networks and optimization - DIENS1 Introduction 2 Deep networks 3 Optimization 4 Convolutional neural networks Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15](https://reader036.vdocument.in/reader036/viewer/2022062602/5f01cce67e708231d4011978/html5/thumbnails/110.jpg)
Jittered-Cluttered Dataset - Results
Method ErrorSVM with Gaussian kernel 43.3%Convolutional Net with binocular input 7.8%Convolutional Net + SVM on top 5.9%Convolutional Net with monocular input 20.8%Smaller mono net (DEMO) 26.0%
Dataset available from http://www.cs.nyu.edu/˜yann
Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 76 / 85
![Page 111: Neural networks and optimization - DIENS1 Introduction 2 Deep networks 3 Optimization 4 Convolutional neural networks Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15](https://reader036.vdocument.in/reader036/viewer/2022062602/5f01cce67e708231d4011978/html5/thumbnails/111.jpg)
NORB recognition - 1
Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 77 / 85
![Page 112: Neural networks and optimization - DIENS1 Introduction 2 Deep networks 3 Optimization 4 Convolutional neural networks Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15](https://reader036.vdocument.in/reader036/viewer/2022062602/5f01cce67e708231d4011978/html5/thumbnails/112.jpg)
NORB recognition - 2
Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 78 / 85
![Page 113: Neural networks and optimization - DIENS1 Introduction 2 Deep networks 3 Optimization 4 Convolutional neural networks Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15](https://reader036.vdocument.in/reader036/viewer/2022062602/5f01cce67e708231d4011978/html5/thumbnails/113.jpg)
ConvNet summary
With complex problems, it is hard to design features by
hand.
Neural networks circumvent this problem.
They can be hard to train (again...).
Convolutional neural networks use knowledge about locality
in images.
They are much easier than standard networks.
And they are FAST (again...).
Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 79 / 85
![Page 114: Neural networks and optimization - DIENS1 Introduction 2 Deep networks 3 Optimization 4 Convolutional neural networks Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15](https://reader036.vdocument.in/reader036/viewer/2022062602/5f01cce67e708231d4011978/html5/thumbnails/114.jpg)
What has not been covered
In some cases, we have lots of data, but without the labels.
Unsupervised learning.
There are techniques to use these data to get better
performance.
E.g. : Task-Driven Dictionary Learning, Mairal et al.
Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 80 / 85
![Page 115: Neural networks and optimization - DIENS1 Introduction 2 Deep networks 3 Optimization 4 Convolutional neural networks Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15](https://reader036.vdocument.in/reader036/viewer/2022062602/5f01cce67e708231d4011978/html5/thumbnails/115.jpg)
Additional techniques currently used
Local response normalization
Data augmentation
Batch gradient normalization
Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 81 / 85
![Page 116: Neural networks and optimization - DIENS1 Introduction 2 Deep networks 3 Optimization 4 Convolutional neural networks Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15](https://reader036.vdocument.in/reader036/viewer/2022062602/5f01cce67e708231d4011978/html5/thumbnails/116.jpg)
Adversarial examples
ConvNets achieve stellar performance in object recognition
Do they really understand what’s going on in an image ?
Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 82 / 85
![Page 117: Neural networks and optimization - DIENS1 Introduction 2 Deep networks 3 Optimization 4 Convolutional neural networks Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15](https://reader036.vdocument.in/reader036/viewer/2022062602/5f01cce67e708231d4011978/html5/thumbnails/117.jpg)
Adversarial examples
ConvNets achieve stellar performance in object recognition
Do they really understand what’s going on in an image ?
Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 82 / 85
![Page 118: Neural networks and optimization - DIENS1 Introduction 2 Deep networks 3 Optimization 4 Convolutional neural networks Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15](https://reader036.vdocument.in/reader036/viewer/2022062602/5f01cce67e708231d4011978/html5/thumbnails/118.jpg)
What I did not cover
Algorithms for pretraining (RBMs, autoencoders)
Generative models (DBMs)
Recurrent neural networks for machine translation
Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 83 / 85
![Page 119: Neural networks and optimization - DIENS1 Introduction 2 Deep networks 3 Optimization 4 Convolutional neural networks Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15](https://reader036.vdocument.in/reader036/viewer/2022062602/5f01cce67e708231d4011978/html5/thumbnails/119.jpg)
Conclusions
Neural networks are complex non-linear models
Optimizing them is hard
It is as much about theory as about engineering
When properly tuned, they can perform wonders
Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 84 / 85
![Page 120: Neural networks and optimization - DIENS1 Introduction 2 Deep networks 3 Optimization 4 Convolutional neural networks Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15](https://reader036.vdocument.in/reader036/viewer/2022062602/5f01cce67e708231d4011978/html5/thumbnails/120.jpg)
Bibliography
Deep Learning book, http://www-labs.iro.umontreal.ca/˜bengioy/DLbook/
Dropout : A Simple Way to Prevent Neural Networks from Overfitting by Srivistava et al.
Intriguing properties of neural networks by Szegedy et al.
ImageNet Classification with Deep Convolutional Neural Networks by Krizhevsky,
Sutskever and Hinton
Gradient-based learning applied to document recognition by LeCun et al.
Qualitatively characterizing neural network optimization problems by Goodfellow and
Vinyals
Sequence to Sequence Learning with Neural Networks by Sutskever, Vinyals and Le
Caffe, http://caffe.berkeleyvision.org/
Hugo Larochelle’s MOOC, https:
//www.youtube.com/playlist?list=PL6Xpj9I5qXYEcOhn7TqghAJ6NAPrNmUBH
Nicolas Le Roux (Criteo) Neural networks and optimization 18/05/15 85 / 85