machine learning - ht 2016 8. neural networks · machine learning - ht 2016 8. neural networks...

Machine learning - HT 2016

8. Neural Networks

Varun Kanade

University of OxfordFebruary 19, 2016

Outline

Today, we’ll study feedforward neural networks

I Multi-layer perceptrons

I Application to classification or regression settings

I Backpropagation to compute gradients

I Finally understand what model:forward and model:backward is doing

1

Artificial Neuron : Logistic Regression

1

x1

x2

Σ y = Pr(y = 1 | x,w, b)

b

w1

w2

2

Multilayer Perceptron (MLP) : Classification

1

x1

x2

1

Σ

Σ

1 Σ y = Pr(y = 1 | x,W,b)

b21

w211

w212

b12

w121

w122

w311

w212

b31

3

Multilayer Perceptron (MLP) : Regression

1

x1

x2

1

Σ

Σ

1 Σ y = E[y | x,W,b]

b21

w211

w212

b12

w121

w122

w311

w212

b31

4

A Simple Example

y = σ

(w4

11σ(w3

11σ(w211x1 + w2

12x2 + b21) + w312σ(w2

21x1 + w222x2 + b22) + b31

)+ w4

12σ(w3

21σ(w211x1 + w2

12x2 + b21) + w322σ(w2

21x1 + w222x2 + b22) + b31

)+ b41

)

5

Feedforward Neural Networks

Layer 2(Hidden)

Layer 1(Input)

Layer 3(Hidden)

Layer 4(Output)

6

layer 2

layer l − 1

layer l

layer L− 1

layer L

z1input x δ1 ∂L∂z1

zL+1loss L

δl

δL model = nn.Sequential()model:add(nn.Linear(n_in, n_h1));model:add(nn.Sigmoid());......model:add(nn.Linear(n_h_L-1, n_out));model:Sigmoid();criterion = nn.MSECriterion();

model:forward(x)criterion:forward(model.output, y)

dL_do = criterion:backward(model.output,y)model:backward(x, dL_do)

7

layer 2

layer l − 1

layer l

layer L− 1

layer L


zL+1loss L

δl

δL

Forward Equations

(1) z1 = x (input)

(2) zl = Wlal−1 + bl

(3) al = f(zl)

(4) L(aL, y)

8

Output Layer

layer L (zL)

aL−1 δL

aL

zL = WLaL−1 + bL

aL = f(zL)

Loss: L(y,aL) = criterion : forward(aL, y)

δL = ∂L∂zL

= ∂L∂aL f ′(zL)

9

Back Propagation

layer l (zl)

al−1 δl

al δl+1

al (the inputs into layer l + 1)

zl+1 = Wl+1al + bl+1 (wl+1j,k weight on connection from kth

unit in layer l to jth unit in layer l+ 1)

al = f(zl) (f is a non-linearity)

δl+1 = ∂L∂zl+1

δl = ∂L∂zl

=(

∂zl+1

∂zl

)T· ∂L∂zl+1

10

Gradients wrt Parameters

layer l (zl)

al−1 δl

al δl+1

zl = Wlal−1 + bl (wlj,k weight on connection from kth

unit in layer l-1 to jth unit in layer l)

δl = ∂L∂zl

11

layer 2

layer l − 1

layer l

layer L− 1

layer L


zL+1loss L

δl

δL

Forward Equations

(1) z1 = x (input)


(3) al = f(zl)

(4) L(aL, y)

Back-propagation Equations

(1) δL = ∂L∂aL f ′(zL)

(2) δl = (Wl+1)T δl+1 f ′(zl)

(3) ∂L∂bl = δl

(4) ∂L∂Wl = δl(al−1)T

12

Output Layer : LogSoftmax

layer L (zL)

aL−1 δL

aLcriterion = nn.ClassNLLCriterion();

Loss: L(y,aL) = −aLyaL = LogSoftmax(zL)

δL = ∂L∂zL

=(

∂aL

∂zL

)T· ∂L∂aL

13

layer 2

layer l − 1

layer l

layer L− 1

layer L


zL+1loss L

δl

δL

Forward Equations

(1) z1 = x (input)

(2) zl = gl(al−1;Wl,bl)

(3) al = f l(zl)

(4) L(aL, y)


(1) δL =(

∂aL

∂zL

)T· ∂L∂aL

(2) δl =(

∂zl+1

∂al · ∂al

∂zl

)T· δl+1

(3) ∂L∂bl =

(∂zl

∂bl

)T· δl

(4) ∂L∂Wl =

(∂zl

∂Wl

)T· δl

14

layer 2

layer l − 1

layer l

layer L− 1

layer L


zL+1loss L

δl

δL

Forward Equations

(1) z1 = x (input)


(3) al = f(zl)

(4) L(aL, y)


(1) δL = ∂L∂aL f ′(zL)

(2) δl = (Wl+1)T δl+1 f ′(zl)

(3) ∂L∂bl = δl

(4) ∂L∂Wl = δl(al−1)T

15

Computational Questions

What is the running time to compute the gradient for a single data point?

What is the space requirement?

Can we process multiple examples together?

16

Training Deep Neural Networks

I Back-propagation gives gradient

I Stochastic gradient descent is the method of choice

I RegularisationI How do we add `1 or `2 regularisation?I Don’t regularise bias terms

I How about convergence?

I What did we learn in the last 10 years, that we didn’t know in the 80s?

19

Training Feedforward Deep Networks

Layer 2(Hidden)

Layer 1(Input)

Layer 3(Hidden)

Layer 4(Output)

Why do we get non-convex optimisation problem?

20

A toy example

1

x ∈ −1, 1

a21 Target is y = 1−x

2

w21

b21

21

Propagating Gradients Backwards

z11 a4

1w21 w3

1 w41

b21 b31 b41

22

Digit Classification Problem on Sheet 4

One-hot encoding Binary encoding

23

Digit Classification: Squared Loss vs Cross Entropy

Squared Error Cross Entropy

(See paper by Glorot and Bengio (2010))

24

Avoiding Saturation

Use rectified linear units

ReLU(z) = max(0, z)

Sometimes, you will see justf(z) = |z|

Other varieties: leaky ReLUs,parametric ReLUs

25

Initialising Weights and Biases

Why is initialising important?

Suppose we were using a sigmoid unit, howwould you initialise the incoming weights?

What if it were a ReLU unit?

How about the biases?

26

Avoiding Overfitting

Deep Neural Networks have a lot of parameters

I Fully connected layers with n1, n2, .., nk units have at leastn1n2 + n2n3 + · · ·+ nk−1nk parameters

I MLP for digit recognition had 2 million parameters!

I For image detection, the neural net used by Krizhevsky, Sutskever,Hinton (2012) has 60 million parameters and 1.2 million training images

I How do we avoid deep neural networks from overfitting?

27

Early Stopping

Maintain validation set and stop trainingwhen error on validation set stopsdecreasing.

What are the computational costs?

What are the advantages?

Early stopping leads to better generalisation

(See paper by Hardt, Recht and Singer (2015))

28

Add DataTypically, getting additional data is either impossible or expensive

Fake the data!

Images can be translated slight, rotated slightly, change of brightness, etc.

Google Offline Translate trained on entirely fake data!

(Google Research Blog)29

Add Data

Adversarial Training

Take trained (or partially trained model)

Create examples by modifications ‘‘imperceptible to the human eye’’, butwhere the model fails

(Szegedy et al., Goodfellow et al.)

30

Other Ideas to Reduce Overfitting

Hard constraints on weights

Gradient Clipping

Inject noise into the system

Enforce sparsity in the neural network

Unsupervised Pre-training(Bengio et al.)

31

Bagging and Dropout

Bagging (Leo Breiman - 1994)

I Given datasetD = 〈(xi, yi)〉mi=1, sampleD1,D2, · · · ,Dk of sizem fromD with replacement

I Train classifiers f1, . . . , fk onD1, . . . ,Dk

I When predicting use majority (or average if using regression)

I Clearly this approach is not practical for deep networks

32

Dropout

I For input x each hidden unit with probability 1/2 independently

I Every input, will have a potentially different mask

I Potentially exponentially different models, but have ‘‘same weights’’

I After training whole network is used by halving all the weights

(Srivastava, Hinton, Krizhevsky, 2014)

33

Errors Made by MLP for Digit Recognition

34

Avoiding Overfitting

I Use weight sharing a.k.a tied weights in the model

I Exploit invariances -- translation

I Exploit locality in images, audio, text, etc.

I Convolutional Neural Networks (convnets)

35

Convolutional Neural Networks (convnets)

(Fukushima, LeCun, Hinton 1980s)

36

Convolution

37

Image Convolution

Source: L. W. Kheng38

Source: Krizhevsky, Sutskever, Hinton (2012)39

Source: Krizhevsky, Sutskever, Hinton (2012); Wikipedia40

Source: Krizhevsky, Sutskever, Hinton (2012)41

Source: Zeiler and Fergus (2013)42

Source: Zeiler and Fergus (2013)43

Convnet in Torch

convnet

model = nn.Sequential ()model:add(nn.Reshape (1, 32, 32))-- layer 1:model:add(nn.SpatialConvolution (1, 16, 5, 5))model:add(nn.ReLU ())model:add(nn.SpatialMaxPooling (2, 2, 2, 2))-- layer 2:model:add(nn.SpatialConvolution (16, 128, 5, 5))model:add(nn.ReLU ())model:add(nn.SpatialMaxPooling (2, 2, 2, 2))-- layer 3:model:add(nn.Reshape (128*5*5))model:add(nn.Linear (128*5*5 , 200))model:add(nn.ReLU ())-- outputmodel:add(nn.Linear (200, 10))model:add(nn.LogSoftMax ())

44

Convolutional Layer

zl+1i′,j′,f ′ = bf ′ +

Wf′∑i=1

Hf′∑j=1

Fl∑f=1

ali′+i−1,j′+j−1,fwl+1,f ′

i,j,f

∂zl+1i′,j′,f ′

∂wl+1,f ′

i,j,f

= ali′+i−1,j′+j−1,f

∂L

∂wl+1,f ′

i,j,f

=∑i′,j′

δl+1i′,j′,f ′a

li′+i−1,j′+j−1,f

45

Convolutional Layer

zl+1i′,j′,f ′ = bf ′ +

Wf′∑i=1

Hf′∑j=1

Fl∑f=1

ali′+i−1,j′+j−1,fwl+1,f ′

i,j,f

∂zl+1i′,j′,f ′

∂ali,j,f= wl+1,f ′

i−i′+1,j−j′+1,f

∂L

∂ali,j,f=∑

i′,j′,f ′

δl+1i′,j′,f ′w

l+1,f ′

i−i′+1,j−j′+1,f

46

Max-Pooling Layer

bl+1i′,j′ = max

i,j∈Ω(i′,j′)ali,j

∂bl+1i′,j′

∂ali,j= I

((i, j) = argmax

i,j∈Ω(i′,j′)

ali,j

)

47

machine learning - ht 2016 8. neural networks · machine learning - ht 2016 8. neural networks...

Documents