ee-m110 2006/7: is l7&8 1/24, v3.0 lectures 7&8: non-linear classification and regression...

EE-M110 2006/7: IS L7&8 1/24, v3.0

Lectures 7&8: Non-linear Classification and Regression using Layered Perceptrons

Dr Martin Brown

Room: E1k

Email: [email protected]

Telephone: 0161 306 4672

http://www.eee.manchester.ac.uk/intranet/pg/coursematerial/

++

+

+++

++

+++

+

+

+++.. .

. ...

x1

x2

m(x,) = 0

EE-M110 2006/7: IS L7&8 2/24, v3.0

Lectures 7&8: Outline

1. What approaches are possible for non-linear classification and regression problems

2. Non-linear polynomial networks

1. Potential and problems using flexible models

3. Sigmoidal-type non-linear transformations

1. Modelling capabilities

2. Regression and classification interpretation

3. Parameter optimization using gradient descent

4. Non-linear logical functions and layered Perceptron nets

Lead onto Multi-Layer Perceptron (MLP) models next week

EE-M110 2006/7: IS L7&8 3/24, v3.0

Lecture 7&8: Resources

These slides are largely self-contained, but extra, background material can be found in:

Machine Learning, T Mitchell, McGraw Hill, 1997

Machine Learning, Neural and Statistical Classification, D Michie, DJ Spiegelhalter and CC Taylor, 1994: http://www.amsta.leeds.ac.uk/~charles/statlog/

In addition, there are many on-line sources for multi-layer perceptrons (MLPs) and error back propagation (EBP), just search on google

Advanced text:

Information Theory, Inference and Learning Algorithms, D MacKay, Cambridge University Press, 2003

EE-M110 2006/7: IS L7&8 4/24, v3.0

Non-Linear Regression and Classification

Most real-world modelling problems are not linear:

A task is non-linear if it cannot be represented using a linear model

Classification the number of classification errors is too large

Regression the noise variance is too large

Using non-linear models/relationships may help to approximate f().

++

+

+++

++

+++

+

+

+++.. .

. ...

x1

x2

θxxx TfNfy )(:),0()( 2

EE-M110 2006/7: IS L7&8 5/24, v3.0

Non-Linear Classification

Consider the following 2-class classification problem

Always compare to prior error rate

Exercise: What are the error rates for prior, optimal linear and non-linear models?

Type of non-linear function is important

Data is generated by (with classification errors):

++

+

+++

++

+++

+

+

+++.. .

. ....

x1

x2

)(Bin)(sgn 2 xfy

EE-M110 2006/7: IS L7&8 6/24, v3.0

Non-linear Regression

Need to balance model complexity against data accuracy

How much signal is reproducible:

),0()( 2Nfy x

x

)(0

^^

yEy

biasmodely, y

x),(^^

θxmy

non-linearmodely, y

x

),(^^

θxmy

non-linear interpolation model

y, y

x

^^

θxTy

linearmodely, y

EE-M110 2006/7: IS L7&8 7/24, v3.0

Polynomial Non-Linear Models

A simple and convenient way to extend linear models is to consider polynomial expansions, such as quadratic:

Expansion to any order is possible: cubic, quadratic, subset of terms

Linear model is produced when:

A polynomial model is linear in its parameters

Approximate any continuous function, arbitrarily closely if a high enough polynomial expansion is used (Taylor series)

5224

2132122110),( xxxxxxmy θx

bilinear quadraticlinearbias

]1[ 22

212121 xxxxxx

y T

x

θx

0543

EE-M110 2006/7: IS L7&8 8/24, v3.0

Example: Quadratic Decision Boundary

A quadratic 2-class classifier is given by:

This has a decision boundary given by:

an 2-dimensional ellipse

)sgn(),( 5224

2132122110 xxxxxxmy θx

05224

2132122110 xxxxxx

Example of quadratic classification boundary for the Iris Setosa data

Modify Perceptron simulation to work on this?

10.1

68.2

92.201.1

33.8

95.7

^

θ

EE-M110 2006/7: IS L7&8 9/24, v3.0

Polynomial Regression “Overfitting” …

Optimal, least squares parameter estimator is given by:

where X is the data matrix, each row represents a data point, each column is one polynomial basis term.

Which polynomial terms should be used - polynomials are flexible but can be quite oscillatory (high frequency components), usually not appropriate

yXXXθ TT 1^

Example20 data points, x randomly drawn from a unit variance, normal distribution, y=exp(-x.^2) , fitted by a fifth order polynomial.

y, y^

EE-M110 2006/7: IS L7&8 10/24, v3.0

Sigmoidal Non-Linear Transformations

Lets consider another way to introduce non-linearities into a basic linear model, by producing a continuous, non-linear transformation of a weighted sum:

What sort of single input, single output functions, f(), are possible?To estimate parameters using gradient descent, it should be

differentiableTo use for classification and regression, is should be able to

represent linear and step functions, as appropriate

yx0=1

x1

xn

0

1

n

)(ufy

)()( θxTfufy

i iixu

EE-M110 2006/7: IS L7&8 11/24, v3.0

Tanh() Function

Consider the tanh() function whose output lies in (-1,1)

When there is a single input: u = 0+x1

When 1 is large (= 4)

Almost a step function

When 1 is small (= 0.25)

Almost a linear relationship

0 shifts tanh() horizontally

)2exp(1

)2exp(1)tanh()(

u

uuuy

u

f(u)

1 small

1 large

EE-M110 2006/7: IS L7&8 12/24, v3.0

Tanh Function in 2D X-Space

Such functions are often known as ridge functions, because they are constant along a line in input space!u = xT = c

]1,1,0[110

)tanh(

21

θxxu

uy

EE-M110 2006/7: IS L7&8 13/24, v3.0

0-1 SigmoidMany books/notes use the following sigmoid function:

which has an output lying in the range (0,1).

In these notes, we’ll refer to both transformation functions as sigmoidal functions, because of their “lazy S” shapeIn fact, they’re just transformations of each other:

)exp(1

1)(

uuy

1)2(sig*2)tanh( uu

EE-M110 2006/7: IS L7&8 14/24, v3.0

Sigmoidal Parameter Estimation

Gradient descent update for a single training datum:

For the ith training pattern:

Using the chain rule:

Giving an update rule:iii

ii

ufyy

u

u

y

y

pp

x

θθ

)(')(^

2^

21 )( iii yyp

θθ

ik

p

iiikk ufyy xθθ )(')(^

1

xθTu )(ufy 2^

21 )( ii yyp

Similar to the LMS rule, apart from the extra sigmoidal derivative term, f’().

EE-M110 2006/7: IS L7&8 15/24, v3.0

Sigmoidal Parameter Estimation (ii)

Sigmoidal function’s derivative (tanh):

]1,0(

1)1(

2121)1(

4)1(

2)1(

)1(

)1(2

)1)(1(

2

22

4242

22

2

22

22

22

22

122

u

f

ye

eeeee

ee

ee

e

ee

eeuu

f

u

uuuu

u

u

u

uu

u

uu

uu

df/du

f(u)

EE-M110 2006/7: IS L7&8 16/24, v3.0

Layered Perceptron Networks

In this section, we’re going to consider how these sigmoidal nodes can be connected together into layers to give greater/more flexible non-linear modelling behaviour

Two central questions:

1. What are the non-linear modelling capabilities?

2. How to estimate the non-linear parameters?

x1

x2

x0y

h0

h2

h1

EE-M110 2006/7: IS L7&8 17/24, v3.0

Linearly Separable 2D Logical Functions

Note class output values of 0 and 1 in next few slides

AND

OR

NOT

x1

x2

0

1

0 1

-1

-1 -1

1

x1

x2

1

00 1

x2

0

1

0 1

1

-1 1

1

x1

x2

1

00 1x1

1

1

0

-1

x 0 1x

^

θ

^

θ

^

θ

1

-1-1

-1

1

-1

-1

1

1

1

EE-M110 2006/7: IS L7&8 18/24, v3.0

Nonlinearly Separable 2D XOR

eXclusive OR (XOR) - n bit parity:2 inputs:

Data generated by:y = (NOT x2 AND x1) OR (NOT x1 AND x2).

Non-linear, polynomial input transformations:

x3 = x1*x2, makes the problem separable

How can multi-layer networks?

x1

x2

0

1

0 1

1

-1 1

-1

x1

x2

1

00 1

1

-11

-1

EE-M110 2006/7: IS L7&8 19/24, v3.0

Multi-Layer Network for 2D XORCan be implemented as a two layer network (two layers of adjustable

parameters) with two “hidden nodes” in the hidden layer– Empty circles represent linear Perceptron nodes– Solid circles represent a real signals– Arrows represent model parameters

(NOT x2 AND x1) OR (NOT x1 AND x2)Is represented in a 2 layer network as:

h1: (NOT x2 AND x1)

h2: (NOT x1 AND x2), y = h1 OR h2

x1

x2

x0=1y

h0=1

h2

h1

outputlayerhidden

layer

EE-M110 2006/7: IS L7&8 20/24, v3.0

Exercise: Determine the 9 Parameters

Write down the parameter vectors for the 3 Perceptron nodesh1: (NOT x2 AND x1)

h2: (NOT x1 AND x2),

y: h1 OR h2

yθ

1hθ

2hθ

x1

x2

x0=1y

h0=1

h2

h1

outputlayerhidden

layer

EE-M110 2006/7: IS L7&8 21/24, v3.0

Logical Functions and DNF

Any logical function can be expressed as the union of “negation and conjunction” terms.It can be realized with a 2 layer Perceptron network.

Each hidden layer unit to respond to exactly one positive example.

Output layer is formed from the union of the hidden layer outputs.

f = h1 OR h2 OR … OR hP

Each data point/positive example is given its own “hidden unit”, which responds to only that point

Essentially, it memorizes the positive training samples

EE-M110 2006/7: IS L7&8 22/24, v3.0

Lecture 7&8: Conclusions

There are many ways to build and use non-linear models for classification and regression purposes Potentially get more accurate predictions/fewer errors if the

data is generated by a non-linear relationship Parameter estimation is sometimes more complex

• No direct optimal parameter calculation• Gradient-based estimation has local minima and

differing curvatures Need to select an appropriate non-linear framework

Multi-layer (sigmoidal) Perceptrons are one such framework• Non-linearity controlled by nodes in hidden layer• Parameters estimated using gradient descent• Several factors need to be considered

EE-M110 2006/7: IS L7&8 23/24, v3.0

Lecture 7&8: Laboratory (i)

MatlabExtend the basic Perceptron matlab script so that it now

trains up a quadratic classifier (note that the plotting routines will no longer be appropriate).

Implement the sigmoidal perceptron learning algorithm, where the model consists of a single layer with a tanh activation function and the parameters are updated after each presentation of a datum (see Slides 10-14)

Test the algorithm on the logical AND and logical OR data, as you did for the normal Perceptron algorithm in the laboratory in IS2.ppt

What are the similarities/differences of this model compared to the normal Perceptron algorithm described in IS2.ppt

EE-M110 2006/7: IS L7&8 24/24, v3.0

Lecture 7&8: Laboratory (ii)

Theory

Prove the relationship on Slide 13 between the two types of sigmoids

Verify the derivative of the tanh function on Slide 15, and prove that the derivative of the (0,1) sigmoid on Slide 13 can be expressed as y(1-y)

Calculate the optimal parameter values missing on Slides 17 and 20.

Derive a generic rule for setting the parameter values on Slide 21 for an arbitrary logical function. You may assume that you know the number of positive examples, the number of features and the logical structure of each positive example

ee-m110 2006/7: is l7&8 1/24, v3.0 lectures 7&8: non-linear classification and regression...

Documents

nonlinear classification

nonlinear regression

nonlinear model y

linear model classification

polynomial nonlinear

nonlinear modelsrelationships

nonlinear polynomial

polynomial model