outputs k w rks krks renals 11 2010 11 rks1 overview lecture rks directly descent 11 rks2 riance...

4
Single Layer Neural Networks Steve Renals Informatics 2B— Learning and Data Lecture 11 26 February 2010 Informatics 2B: Learning and Data Lecture 11 Single Layer Neural Networks 1 Overview Today’s lecture Linear discriminants and single-layer neural networks Training the weights of a single-layer neural network directly Gradient descent Informatics 2B: Learning and Data Lecture 11 Single Layer Neural Networks 2 Recap: Gaussians with equal covariance Consider the special case in which the Gaussian pdfs for each class all share the same class-independent covariance matrix: Σ c = Σ c By dropping terms that are now constant we can simplify the discriminant function to y c (x)=(μ T c Σ -1 )x - 1 2 μ T c Σ -1 μ c + ln P(C ) This is a linear function of x We can define two variables in terms of μ c , P(C ) and Σ: w T c = μ T c Σ -1 w c0 = - 1 2 μ T c Σ -1 μ c + ln P(C ) Substituting w c and w c0 into the expression for y c (x): y c (x)= w T c x + w c0 Linear discriminant function Informatics 2B: Learning and Data Lecture 11 Single Layer Neural Networks 3 Recap: Decision Regions: equal covariance 8 6 4 2 0 2 4 6 8 8 6 4 2 0 2 4 6 8 x 1 x 2 Decision regions: Equal Covariance Gaussians Informatics 2B: Learning and Data Lecture 11 Single Layer Neural Networks 4 Single-layer neural networks Linear discriminant functions for a K -class problem y k (x)= w T k x + w k0 May be represented as a single layer neural network Informatics 2B: Learning and Data Lecture 11 Single Layer Neural Networks 5 Multiclass single-layer neural network + Inputs Outputs Bias + y 1 y K x 0 x 1 x d w 10 w 11 w K0 w K1 w Kd w 1d Informatics 2B: Learning and Data Lecture 11 Single Layer Neural Networks 6 Single-layer neural networks Linear discriminant functions for a K -class problem y k (x)= w T k x + w k0 May be represented as a single layer neural network Define a K × (d + 1) weight matrix W whose k th row is the weight vectors w T k The 0th column is given by the biases w k0 Informatics 2B: Learning and Data Lecture 11 Single Layer Neural Networks 7 Single-layer neural networks Linear discriminant functions for a K -class problem y k (x)= w T k x + w k0 May be represented as a single layer neural network Define a K × (d + 1) weight matrix W whose k th row is the weight vectors w T k The 0th column is given by the biases w k0 If we define an additional input dimension x 0 = 1, which corresponds to the bias, then we may write: y = Wx Informatics 2B: Learning and Data Lecture 11 Single Layer Neural Networks 7 Single-layer neural networks Linear discriminant functions for a K -class problem y k (x)= w T k x + w k0 May be represented as a single layer neural network Define a K × (d + 1) weight matrix W whose k th row is the weight vectors w T k The 0th column is given by the biases w k0 If we define an additional input dimension x 0 = 1, which corresponds to the bias, then we may write: y = Wx In terms of individual components y k = d i =0 w ki x i Informatics 2B: Learning and Data Lecture 11 Single Layer Neural Networks 7

Upload: others

Post on 12-Aug-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Outputs K W rks krks Renals 11 2010 11 rks1 Overview lecture rks directly descent 11 rks2 riance each sameendentmatrix: c = 8 c the to y c( x ( T 1) x 1 2 T 1 ln P ( C ) arof x of

Single Layer Neural Networks

Steve Renals

Informatics 2B— Learning and Data Lecture 1126 February 2010

Informatics 2B: Learning and Data Lecture 11 Single Layer Neural Networks 1

Overview

Today’s lecture

Linear discriminants and single-layer neural networks

Training the weights of a single-layer neural network directly

Gradient descent

Informatics 2B: Learning and Data Lecture 11 Single Layer Neural Networks 2

Recap: Gaussians with equal covariance

Consider the special case in which the Gaussian pdfs for eachclass all share the same class-independent covariance matrix:

Σc = Σ ∀cBy dropping terms that are now constant we can simplify thediscriminant function to

yc(x) = (µTc Σ−1)x− 1

2µTc Σ−1µc + lnP(C )

This is a linear function of xWe can define two variables in terms of µc , P(C ) and Σ:

wTc = µT

c Σ−1 wc0 = −1

2µTc Σ−1µc + lnP(C )

Substituting wc and wc0 into the expression for yc(x):

yc(x) = wTc x + wc0

Linear discriminant functionInformatics 2B: Learning and Data Lecture 11 Single Layer Neural Networks 3

Recap: Decision Regions: equal covariance

−8 −6 −4 −2 0 2 4 6 8−8

−6

−4

−2

0

2

4

6

8

x1

x 2

Decision regions: Equal Covariance Gaussians

Informatics 2B: Learning and Data Lecture 11 Single Layer Neural Networks 4

Single-layer neural networks

Linear discriminant functions for a K -class problem

yk(x) = wTk x + wk0

May be represented as a single layer neural network

Informatics 2B: Learning and Data Lecture 11 Single Layer Neural Networks 5

Multiclass single-layer neural network

+

Inputs

Outputs

Bias

+y1 yK

x0 x1 xd

w10

w11

wK0 wK1

wKdw1d

Informatics 2B: Learning and Data Lecture 11 Single Layer Neural Networks 6

Single-layer neural networks

Linear discriminant functions for a K -class problem

yk(x) = wTk x + wk0

May be represented as a single layer neural network

Define a K × (d + 1) weight matrix W whose kth row is theweight vectors wT

k

The 0th column is given by the biases wk0

If we define an additional input dimension x0 = 1, whichcorresponds to the bias, then we may write:

y = Wx

In terms of individual components

yk =d∑

i=0

wkixi

Informatics 2B: Learning and Data Lecture 11 Single Layer Neural Networks 7

Single-layer neural networks

Linear discriminant functions for a K -class problem

yk(x) = wTk x + wk0

May be represented as a single layer neural network

Define a K × (d + 1) weight matrix W whose kth row is theweight vectors wT

k

The 0th column is given by the biases wk0

If we define an additional input dimension x0 = 1, whichcorresponds to the bias, then we may write:

y = Wx

In terms of individual components

yk =d∑

i=0

wkixi

Informatics 2B: Learning and Data Lecture 11 Single Layer Neural Networks 7

Single-layer neural networks

Linear discriminant functions for a K -class problem

yk(x) = wTk x + wk0

May be represented as a single layer neural network

Define a K × (d + 1) weight matrix W whose kth row is theweight vectors wT

k

The 0th column is given by the biases wk0

If we define an additional input dimension x0 = 1, whichcorresponds to the bias, then we may write:

y = Wx

In terms of individual components

yk =d∑

i=0

wkixi

Informatics 2B: Learning and Data Lecture 11 Single Layer Neural Networks 7

Page 2: Outputs K W rks krks Renals 11 2010 11 rks1 Overview lecture rks directly descent 11 rks2 riance each sameendentmatrix: c = 8 c the to y c( x ( T 1) x 1 2 T 1 ln P ( C ) arof x of

Multiclass single-layer neural network

+

Inputs

Outputs

Bias

+y1 yK

x0 x1 xd

w10

w11

wK0 wK1

wKdw1d

Input vector x = (x0, x1, . . . , xd)

Output vector y = (y1, . . . , yK )

Weight matrix W: wki is the weight from input xi to output yk

Informatics 2B: Learning and Data Lecture 11 Single Layer Neural Networks 8

Multiclass single-layer neural network

+

Inputs

Outputs

Bias

+y1 yK

x0 x1 xd

w10

w11

wK0 wK1

wKdw1d

y = Wx yk =d∑

i=0

wkixi

Informatics 2B: Learning and Data Lecture 11 Single Layer Neural Networks 9

Training set

We want to train the weight matrix W such that (for ourtraining data) we minimize the error in the output vectors ygiven the input vectors x

Training set: set of N input/output pairs{(xn, tn) : 1 ≤ n ≤ N}, where tn = (tn1 , . . . , t

nK ) is the target

output vector for input vector xn

For a classification problem, if the correct class is C , then:

tnc = 1

tnk = 0 ∀k 6= c

1-from-n coding

We can write the network output vector as yn(xn; W)(explicitly showing the dependence on the weight matrix andthe input vector)

Informatics 2B: Learning and Data Lecture 11 Single Layer Neural Networks 10

Training set

We want to train the weight matrix W such that (for ourtraining data) we minimize the error in the output vectors ygiven the input vectors x

Training set: set of N input/output pairs{(xn, tn) : 1 ≤ n ≤ N}, where tn = (tn1 , . . . , t

nK ) is the target

output vector for input vector xn

For a classification problem, if the correct class is C , then:

tnc = 1

tnk = 0 ∀k 6= c

1-from-n coding

We can write the network output vector as yn(xn; W)(explicitly showing the dependence on the weight matrix andthe input vector)

Informatics 2B: Learning and Data Lecture 11 Single Layer Neural Networks 10

Training set

We want to train the weight matrix W such that (for ourtraining data) we minimize the error in the output vectors ygiven the input vectors x

Training set: set of N input/output pairs{(xn, tn) : 1 ≤ n ≤ N}, where tn = (tn1 , . . . , t

nK ) is the target

output vector for input vector xn

For a classification problem, if the correct class is C , then:

tnc = 1

tnk = 0 ∀k 6= c

1-from-n coding

We can write the network output vector as yn(xn; W)(explicitly showing the dependence on the weight matrix andthe input vector)

Informatics 2B: Learning and Data Lecture 11 Single Layer Neural Networks 10

Training set

We want to train the weight matrix W such that (for ourtraining data) we minimize the error in the output vectors ygiven the input vectors x

Training set: set of N input/output pairs{(xn, tn) : 1 ≤ n ≤ N}, where tn = (tn1 , . . . , t

nK ) is the target

output vector for input vector xn

For a classification problem, if the correct class is C , then:

tnc = 1

tnk = 0 ∀k 6= c

1-from-n coding

We can write the network output vector as yn(xn; W)(explicitly showing the dependence on the weight matrix andthe input vector)

Informatics 2B: Learning and Data Lecture 11 Single Layer Neural Networks 10

Sum-of-squares error function

Training problem: set the weight matrix W such thatyn(xn; W) is as close as possible to tn for all nTo address this problem we use the notion of an error functionbetween the target and actual network output

Sum-of-squares error function computes the (squared)Euclidean distance between tn and yn for all the training set1 ≤ n ≤ N:

E (W) =1

2

N∑

n=1

K∑

k=1

(ynk − tnk )2

=1

2

N∑

n=1

K∑

k=1

(d∑

i=0

wkixni − tnk

)2

This error function E (W) is a smooth function of the weightsTraining involves setting the weight matrix W to minimizeE (W)

Informatics 2B: Learning and Data Lecture 11 Single Layer Neural Networks 11

Sum-of-squares error function

Training problem: set the weight matrix W such thatyn(xn; W) is as close as possible to tn for all nTo address this problem we use the notion of an error functionbetween the target and actual network outputSum-of-squares error function computes the (squared)Euclidean distance between tn and yn for all the training set1 ≤ n ≤ N:

E (W) =1

2

N∑

n=1

K∑

k=1

(ynk − tnk )2

=1

2

N∑

n=1

K∑

k=1

(d∑

i=0

wkixni − tnk

)2

This error function E (W) is a smooth function of the weightsTraining involves setting the weight matrix W to minimizeE (W)

Informatics 2B: Learning and Data Lecture 11 Single Layer Neural Networks 11

Sum-of-squares error function

Training problem: set the weight matrix W such thatyn(xn; W) is as close as possible to tn for all nTo address this problem we use the notion of an error functionbetween the target and actual network outputSum-of-squares error function computes the (squared)Euclidean distance between tn and yn for all the training set1 ≤ n ≤ N:

E (W) =1

2

N∑

n=1

K∑

k=1

(ynk − tnk )2

=1

2

N∑

n=1

K∑

k=1

(d∑

i=0

wkixni − tnk

)2

This error function E (W) is a smooth function of the weightsTraining involves setting the weight matrix W to minimizeE (W)

Informatics 2B: Learning and Data Lecture 11 Single Layer Neural Networks 11

Page 3: Outputs K W rks krks Renals 11 2010 11 rks1 Overview lecture rks directly descent 11 rks2 riance each sameendentmatrix: c = 8 c the to y c( x ( T 1) x 1 2 T 1 ln P ( C ) arof x of

Minimizing the error function

E (W) =1

2

N∑

n=1

K∑

k=1

(d∑

i=0

wkixni − tnk

)2

We find the minimum by looking for where the derivatives ofE with respect to the weights are 0

Since E is a quadratic function of the weights, the derivativesare linear functions of the weights

Solving for the weight values:

Exact approach: pseudoinverse of weight matrixIterative approaches: IRLS (Newton-Raphson), gradientdescent

We will only consider gradient descent

Informatics 2B: Learning and Data Lecture 11 Single Layer Neural Networks 12

Minimizing the error function

E (W) =1

2

N∑

n=1

K∑

k=1

(d∑

i=0

wkixni − tnk

)2

We find the minimum by looking for where the derivatives ofE with respect to the weights are 0

Since E is a quadratic function of the weights, the derivativesare linear functions of the weights

Solving for the weight values:

Exact approach: pseudoinverse of weight matrixIterative approaches: IRLS (Newton-Raphson), gradientdescent

We will only consider gradient descent

Informatics 2B: Learning and Data Lecture 11 Single Layer Neural Networks 12

Gradient descent (1)

Gradient descent can be used whenever it is possible tocompute the derivatives of the error function E with respectto the parameter to be optimized W

Basic idea: adjust the weights to move downhill in weightspace

Weight space: K · (d + 1) dimension space: a weight matrixW is a point in weight space

The gradient of the error in weight space:

∇WE =

(∂E

∂w10, . . . ,

∂E

∂wki, . . . ,

∂E

∂wKd

)T

Informatics 2B: Learning and Data Lecture 11 Single Layer Neural Networks 13

Gradient descent (2)

Operation of gradient descent:1 Start with a guess for the weight matrix W (small random

numbers)2 Update the weights by adjusting the weight matrix in the

direction of −∇WE .3 Recompute the error, and iterate

The update for weight wki at iteration τ + 1 is:

w τ+1ki = w τ

ki − η∂E

∂wki

The parameter η is the learning rate

Informatics 2B: Learning and Data Lecture 11 Single Layer Neural Networks 14

Gradients for a single-layer neural network

E (W) =1

2

N∑

n=1

K∑

k=1

(d∑

i=0

wkixni − tnk

)2

To minimize E with respect to W we differentiate E withrespect to each weight wki :

∂E

∂wki=

N∑

n=1

d∑

j=0

wkjxnj − tnk

xni

=N∑

n=1

(ynk − tnk )xni

Informatics 2B: Learning and Data Lecture 11 Single Layer Neural Networks 15

Gradient descent for a single layer neural network (2)

If we define δnk as the difference between the network outputand the target, we can write:

δnk = ynk − tnk

∂E

∂wki=

N∑

n=1

δnkxni

The derivative for the weight connecting input i to output k iscalculated using the product of the error at the output andthe input value, summed over all the training set

Combining the expression for the derivatives with theexpression for gradient descent update we have:

w τ+1ki = w τ

ki − η∂E

∂wki= w τ

ki − ηN∑

n=1

δnkxni

Informatics 2B: Learning and Data Lecture 11 Single Layer Neural Networks 16

Gradient descent for a single layer neural network (2)

If we define δnk as the difference between the network outputand the target, we can write:

δnk = ynk − tnk

∂E

∂wki=

N∑

n=1

δnkxni

The derivative for the weight connecting input i to output k iscalculated using the product of the error at the output andthe input value, summed over all the training set

Combining the expression for the derivatives with theexpression for gradient descent update we have:

w τ+1ki = w τ

ki − η∂E

∂wki= w τ

ki − ηN∑

n=1

δnkxni

Informatics 2B: Learning and Data Lecture 11 Single Layer Neural Networks 16

Gradient descent for a single layer neural network (2)

If we define δnk as the difference between the network outputand the target, we can write:

δnk = ynk − tnk

∂E

∂wki=

N∑

n=1

δnkxni

The derivative for the weight connecting input i to output k iscalculated using the product of the error at the output andthe input value, summed over all the training set

Combining the expression for the derivatives with theexpression for gradient descent update we have:

w τ+1ki = w τ

ki − η∂E

∂wki= w τ

ki − ηN∑

n=1

δnkxni

Informatics 2B: Learning and Data Lecture 11 Single Layer Neural Networks 16

Schematic of gradient descent training

+

Inputsx0 x1xd

Outputs

Bias

xi

yk

wk0 wk1 wki wkd

Informatics 2B: Learning and Data Lecture 11 Single Layer Neural Networks 17

Page 4: Outputs K W rks krks Renals 11 2010 11 rks1 Overview lecture rks directly descent 11 rks2 riance each sameendentmatrix: c = 8 c the to y c( x ( T 1) x 1 2 T 1 ln P ( C ) arof x of

Schematic of gradient descent training

+

Inputsx0 x1xd

Outputs

Bias

xi

yk

wk0 wk1 wki wkd

Informatics 2B: Learning and Data Lecture 11 Single Layer Neural Networks 17

Schematic of gradient descent training

+

Inputsx0 x1xd

Outputs

Bias

xi

wk0 wk1 wki wkd

yk =

d

∑j=0

wk jx j

Informatics 2B: Learning and Data Lecture 11 Single Layer Neural Networks 17

Schematic of gradient descent training

+

Inputsx0 x1xd

Outputs

Bias

xi

wk0 wk1 wki wkd

δk = yk! tk

Informatics 2B: Learning and Data Lecture 11 Single Layer Neural Networks 17

Schematic of gradient descent training

+

Inputsx0 x1xd

Outputs

Bias

xi

wk0 wk1 wki wkd

Δwτki = Δwτki+δk · xi

δkyk

Informatics 2B: Learning and Data Lecture 11 Single Layer Neural Networks 17

Interpreting the bias parameter

Derivative with respect to the bias (at the minimum):

∂E

∂wk0=

N∑

n=1

d∑

j=1

wkjxnj + wk0 − tnk

= 0

If we write:

x̄i =1

N

N∑

n=1

xni t̄k =1

N

N∑

n=1

tnk

Then we may write the solution for the bias as

wk0 = t̄k −d∑

i=1

wki x̄i

The bias may be interpreted as compensating for thedifference in the training set mean of the targets and thenetwork outputs

Informatics 2B: Learning and Data Lecture 11 Single Layer Neural Networks 18

Summary

Training single-layer neural networks

Sum-of-squares error function

Gradient descent

Good coverage in Bishop, Pattern Recognition and NeuralNetworks (section 3.1.3, 3.4)

Informatics 2B: Learning and Data Lecture 11 Single Layer Neural Networks 19