machine learning - ht 2016 8. neural networks · machine learning - ht 2016 8. neural networks...
TRANSCRIPT
Machine learning - HT 2016
8. Neural Networks
Varun Kanade
University of OxfordFebruary 19, 2016
Outline
Today, we’ll study feedforward neural networks
I Multi-layer perceptrons
I Application to classification or regression settings
I Backpropagation to compute gradients
I Finally understand what model:forward and model:backward is doing
1
Artificial Neuron : Logistic Regression
1
x1
x2
Σ y = Pr(y = 1 | x,w, b)
b
w1
w2
2
Multilayer Perceptron (MLP) : Classification
1
x1
x2
1
Σ
Σ
1 Σ y = Pr(y = 1 | x,W,b)
b21
w211
w212
b12
w121
w122
w311
w212
b31
3
Multilayer Perceptron (MLP) : Regression
1
x1
x2
1
Σ
Σ
1 Σ y = E[y | x,W,b]
b21
w211
w212
b12
w121
w122
w311
w212
b31
4
A Simple Example
y = σ
(w4
11σ(w3
11σ(w211x1 + w2
12x2 + b21) + w312σ(w2
21x1 + w222x2 + b22) + b31
)+ w4
12σ(w3
21σ(w211x1 + w2
12x2 + b21) + w322σ(w2
21x1 + w222x2 + b22) + b31
)+ b41
)
5
Feedforward Neural Networks
Layer 2(Hidden)
Layer 1(Input)
Layer 3(Hidden)
Layer 4(Output)
6
layer 2
layer l − 1
layer l
layer L− 1
layer L
z1input x δ1 ∂L∂z1
zL+1loss L
δl
δL model = nn.Sequential()model:add(nn.Linear(n_in, n_h1));model:add(nn.Sigmoid());......model:add(nn.Linear(n_h_L-1, n_out));model:Sigmoid();criterion = nn.MSECriterion();
model:forward(x)criterion:forward(model.output, y)
dL_do = criterion:backward(model.output,y)model:backward(x, dL_do)
7
layer 2
layer l − 1
layer l
layer L− 1
layer L
z1input x δ1 ∂L∂z1
zL+1loss L
δl
δL
Forward Equations
(1) z1 = x (input)
(2) zl = Wlal−1 + bl
(3) al = f(zl)
(4) L(aL, y)
8
Output Layer
layer L (zL)
aL−1 δL
aL
zL = WLaL−1 + bL
aL = f(zL)
Loss: L(y,aL) = criterion : forward(aL, y)
δL = ∂L∂zL
= ∂L∂aL f ′(zL)
9
Back Propagation
layer l (zl)
al−1 δl
al δl+1
al (the inputs into layer l + 1)
zl+1 = Wl+1al + bl+1 (wl+1j,k weight on connection from kth
unit in layer l to jth unit in layer l+ 1)
al = f(zl) (f is a non-linearity)
δl+1 = ∂L∂zl+1
δl = ∂L∂zl
=(
∂zl+1
∂zl
)T· ∂L∂zl+1
10
Gradients wrt Parameters
layer l (zl)
al−1 δl
al δl+1
zl = Wlal−1 + bl (wlj,k weight on connection from kth
unit in layer l-1 to jth unit in layer l)
δl = ∂L∂zl
11
layer 2
layer l − 1
layer l
layer L− 1
layer L
z1input x δ1 ∂L∂z1
zL+1loss L
δl
δL
Forward Equations
(1) z1 = x (input)
(2) zl = Wlal−1 + bl
(3) al = f(zl)
(4) L(aL, y)
Back-propagation Equations
(1) δL = ∂L∂aL f ′(zL)
(2) δl = (Wl+1)T δl+1 f ′(zl)
(3) ∂L∂bl = δl
(4) ∂L∂Wl = δl(al−1)T
12
Output Layer : LogSoftmax
layer L (zL)
aL−1 δL
aLcriterion = nn.ClassNLLCriterion();
Loss: L(y,aL) = −aLyaL = LogSoftmax(zL)
δL = ∂L∂zL
=(
∂aL
∂zL
)T· ∂L∂aL
13
layer 2
layer l − 1
layer l
layer L− 1
layer L
z1input x δ1 ∂L∂z1
zL+1loss L
δl
δL
Forward Equations
(1) z1 = x (input)
(2) zl = gl(al−1;Wl,bl)
(3) al = f l(zl)
(4) L(aL, y)
Back-propagation Equations
(1) δL =(
∂aL
∂zL
)T· ∂L∂aL
(2) δl =(
∂zl+1
∂al · ∂al
∂zl
)T· δl+1
(3) ∂L∂bl =
(∂zl
∂bl
)T· δl
(4) ∂L∂Wl =
(∂zl
∂Wl
)T· δl
14
layer 2
layer l − 1
layer l
layer L− 1
layer L
z1input x δ1 ∂L∂z1
zL+1loss L
δl
δL
Forward Equations
(1) z1 = x (input)
(2) zl = Wlal−1 + bl
(3) al = f(zl)
(4) L(aL, y)
Back-propagation Equations
(1) δL = ∂L∂aL f ′(zL)
(2) δl = (Wl+1)T δl+1 f ′(zl)
(3) ∂L∂bl = δl
(4) ∂L∂Wl = δl(al−1)T
15
Computational Questions
What is the running time to compute the gradient for a single data point?
What is the space requirement?
Can we process multiple examples together?
16
17
18
Training Deep Neural Networks
I Back-propagation gives gradient
I Stochastic gradient descent is the method of choice
I RegularisationI How do we add `1 or `2 regularisation?I Don’t regularise bias terms
I How about convergence?
I What did we learn in the last 10 years, that we didn’t know in the 80s?
19
Training Feedforward Deep Networks
Layer 2(Hidden)
Layer 1(Input)
Layer 3(Hidden)
Layer 4(Output)
Why do we get non-convex optimisation problem?
20
A toy example
1
x ∈ −1, 1
a21 Target is y = 1−x
2
w21
b21
21
Propagating Gradients Backwards
z11 a4
1w21 w3
1 w41
b21 b31 b41
22
Digit Classification Problem on Sheet 4
One-hot encoding Binary encoding
23
Digit Classification: Squared Loss vs Cross Entropy
Squared Error Cross Entropy
(See paper by Glorot and Bengio (2010))
24
Avoiding Saturation
Use rectified linear units
ReLU(z) = max(0, z)
Sometimes, you will see justf(z) = |z|
Other varieties: leaky ReLUs,parametric ReLUs
25
Initialising Weights and Biases
Why is initialising important?
Suppose we were using a sigmoid unit, howwould you initialise the incoming weights?
What if it were a ReLU unit?
How about the biases?
26
Avoiding Overfitting
Deep Neural Networks have a lot of parameters
I Fully connected layers with n1, n2, .., nk units have at leastn1n2 + n2n3 + · · ·+ nk−1nk parameters
I MLP for digit recognition had 2 million parameters!
I For image detection, the neural net used by Krizhevsky, Sutskever,Hinton (2012) has 60 million parameters and 1.2 million training images
I How do we avoid deep neural networks from overfitting?
27
Early Stopping
Maintain validation set and stop trainingwhen error on validation set stopsdecreasing.
What are the computational costs?
What are the advantages?
Early stopping leads to better generalisation
(See paper by Hardt, Recht and Singer (2015))
28
Add DataTypically, getting additional data is either impossible or expensive
Fake the data!
Images can be translated slight, rotated slightly, change of brightness, etc.
Google Offline Translate trained on entirely fake data!
(Google Research Blog)29
Add Data
Adversarial Training
Take trained (or partially trained model)
Create examples by modifications ‘‘imperceptible to the human eye’’, butwhere the model fails
(Szegedy et al., Goodfellow et al.)
30
Other Ideas to Reduce Overfitting
Hard constraints on weights
Gradient Clipping
Inject noise into the system
Enforce sparsity in the neural network
Unsupervised Pre-training(Bengio et al.)
31
Bagging and Dropout
Bagging (Leo Breiman - 1994)
I Given datasetD = 〈(xi, yi)〉mi=1, sampleD1,D2, · · · ,Dk of sizem fromD with replacement
I Train classifiers f1, . . . , fk onD1, . . . ,Dk
I When predicting use majority (or average if using regression)
I Clearly this approach is not practical for deep networks
32
Dropout
I For input x each hidden unit with probability 1/2 independently
I Every input, will have a potentially different mask
I Potentially exponentially different models, but have ‘‘same weights’’
I After training whole network is used by halving all the weights
(Srivastava, Hinton, Krizhevsky, 2014)
33
Errors Made by MLP for Digit Recognition
34
Avoiding Overfitting
I Use weight sharing a.k.a tied weights in the model
I Exploit invariances -- translation
I Exploit locality in images, audio, text, etc.
I Convolutional Neural Networks (convnets)
35
Convolutional Neural Networks (convnets)
(Fukushima, LeCun, Hinton 1980s)
36
Convolution
37
Image Convolution
Source: L. W. Kheng38
Source: Krizhevsky, Sutskever, Hinton (2012)39
Source: Krizhevsky, Sutskever, Hinton (2012); Wikipedia40
Source: Krizhevsky, Sutskever, Hinton (2012)41
Source: Zeiler and Fergus (2013)42
Source: Zeiler and Fergus (2013)43
Convnet in Torch
convnet
model = nn.Sequential ()model:add(nn.Reshape (1, 32, 32))-- layer 1:model:add(nn.SpatialConvolution (1, 16, 5, 5))model:add(nn.ReLU ())model:add(nn.SpatialMaxPooling (2, 2, 2, 2))-- layer 2:model:add(nn.SpatialConvolution (16, 128, 5, 5))model:add(nn.ReLU ())model:add(nn.SpatialMaxPooling (2, 2, 2, 2))-- layer 3:model:add(nn.Reshape (128*5*5))model:add(nn.Linear (128*5*5 , 200))model:add(nn.ReLU ())-- outputmodel:add(nn.Linear (200, 10))model:add(nn.LogSoftMax ())
44
Convolutional Layer
zl+1i′,j′,f ′ = bf ′ +
Wf′∑i=1
Hf′∑j=1
Fl∑f=1
ali′+i−1,j′+j−1,fwl+1,f ′
i,j,f
∂zl+1i′,j′,f ′
∂wl+1,f ′
i,j,f
= ali′+i−1,j′+j−1,f
∂L
∂wl+1,f ′
i,j,f
=∑i′,j′
δl+1i′,j′,f ′a
li′+i−1,j′+j−1,f
45
Convolutional Layer
zl+1i′,j′,f ′ = bf ′ +
Wf′∑i=1
Hf′∑j=1
Fl∑f=1
ali′+i−1,j′+j−1,fwl+1,f ′
i,j,f
∂zl+1i′,j′,f ′
∂ali,j,f= wl+1,f ′
i−i′+1,j−j′+1,f
∂L
∂ali,j,f=∑
i′,j′,f ′
δl+1i′,j′,f ′w
l+1,f ′
i−i′+1,j−j′+1,f
46
Max-Pooling Layer
bl+1i′,j′ = max
i,j∈Ω(i′,j′)ali,j
∂bl+1i′,j′
∂ali,j= I
((i, j) = argmax
i,j∈Ω(i′,j′)
ali,j
)
47