learning concept1

7/27/2019 Learning Concept1

1/61

Supervised Learning with Neural Networks

We now look at how an agent might learn to solve a general problemby seeing examples .

Aims:

to present an outline of supervised learning as part of AI;

to introduce much of the notation and terminology used;

to introduce the classical perceptron , and to show how it can beapplied more generally using kernels ;

to introduce multilayer perceptrons and the backpropagation al-

gorithm for training them.

Reading: Russell and Norvig, chapter 18 and 19.

Copyright c

Sean Holden 2002-2005.


2/61

An example

A common source of problems in AI is medical diagnosis .

Imagine that we want to automate the diagnosis of an embarrassing

disease (call it

) by constructing a machine:

Machine

if the patient suffers from

otherwise Measurements taken from thepatient, e.g. heart rate, blood pressure,presence of green spots etc.

Could we do this by explicitly writing a program that examines themeasurements and outputs a diagnosis?

Experience suggests that this is unlikely.


3/61

An example, continued...

Lets look at an alternative approach. Each collection of measure-ments can be written as a vector,

where,

heart rate

blood pressure

if the patient has green spotsotherwise

...

and so on


4/61


A vector of this kind contains all the measurements for a single pa-tient and is generally called a feature vector or instance .

The measurements are usually referred to as attributes or features .(Technically, there is a difference between an attribute and a feature- but we wont have time to explore it.)

Attributes or features generally appear as one of three basic types:

continuous:

where

;

binary:

or

;

discrete: can take one of a nite number of values, say

.


5/61


Now imagine that we have a large collection of patient histories (

in total) and for each of these we know whether or not the patientsuffered from

.

The

th patient history gives us an instance .

This can be paired with a single bit or denoting whether or

not the

th patient suffers from

. The resulting pair is called anexample or a labelled example .

Collecting all the examples together we obtain a training sequence ,


6/61


7/61

But whats a hypothesis?

Whats a hypothesis?

Denote by

the set of all possible instances.

is a possible instance

Then a hypothesis could simply be a function from

to

.

In other words, can take any instance and produce a or a ,depending on whether (according to ) the patient the measure-ments were taken from is suffering from

.


8/61


There are some important issues here, which are effectively right atthe heart of machine learning. Most importantly: the hypothesis can assign a or a to any

.

This includes instances that did not appear in the training se-quence.

The overall process therefore involves rather more thanmemo- rising the training sequence.

Ideally, the aim is that should be able to generalize . That is, itshould be possible to use it to diagnose new patients.


9/61


In fact we need a slightly more exible denition of a hypothesis.

A hypothesis is a function from

to some suitable set

because:

there may be more than two classes:

No disease Disease

Disease

Disease

or, we might want to indicated how likely it is that the patienthas disease

where denotes denitely does have the disease and denotesdenitely does not have it.

might for example denotethat the patient is reasonably certain to have the disease.


10/61


One way of thinking about the previous case is in terms of prob-abilities:

is in class

We may have

. For example if contains several recentmeasurements of a currency exchange rate and we want

tobe a prediction of what the rate will be in minutes time. Suchproblems are generally known as regression problems .


11/61

Types of learning

The form of machine learning described is called supervised learning . This introduction will concentrate on this kind of learning. Inparticular, the literature also discusses:

1. Unsupervised learning .

2. Learning using membership queries and equivalence queries .

3. Reinforcement learning .


12/61

Some further examples

Speech recognition.

Deciding whether or not to give credit.

Detecting credit card fraud.

Deciding whether to buy or sell a stock option.

Deciding whether a tumour is benign.

Data mining - that is, extracting interesting but hidden knowledgefrom existing, large databases. For example, databases contain-ing nancial transactions or loan applications.

Deciding whether driving conditions are dangerous.

Automatic driving. (See Pomerleau, 1989, in which a car is drivenfor 90 miles at 70 miles per hour, on a public road with other carspresent, but with no assistance from humans!)

Playing games. (For example, see Tesauro, 1992 and 1995,

where a world class backgammon player is described.)


13/61

What have we got so far?

Extracting what we have so far, we get the following central ideas:

A collection

of possible instances .

A collection

of classes to which any instance canbelong. In some scenarios we might have

.

A training sequence containing labelled examples,

with

and for

.

A learning algorithm

which takes and produces a hypothesis

. We can write,


14/61

What have we got so far?

But we need some more ideas as well:

We often need to state what kinds of hypotheses are available to

.

The collection of available hypotheses is called the hypothesis space and denoted .

is available to

Some learning algorithms do not always return the same

for each run on a given sequence . In this case

is aprobability distribution on .


15/61

Whats the right answer?

We may sometimes assume that there is a correct function thatgoverns the relationship between the s and the labels.

This is called the target concept and is denoted . It can be re-garded as the perfect hypothesis, and so we have

.

This is not always a sufcient way of thinking about the problem...


16/61


17/61

Generalization performance

How can generalization performance be assessed?

Model the generation of training example using a probability dis-tribution on

.

All examples are assumed to be independent and identically dis-tributed (i.i.d.) according to .

Given a hypothesis and any example

we can introduce ameasure

of the error that makes in classifying thatexample.

For example the following denitions for

might be appropriate:

for a classication problem

when


18/61

Generalization performance

A reasonable denition of generalization performance is then

er

In the case of the denition of

for classication problems given inthe previous slide this gives

er

In the case of the denition for

given for regression problems, er

is the expected square of the difference between true label and pre-dicted label.


19/61

Problems encountered in practice

In practice there are various problems that can arise:

Measurements may be missing from the vectors.

There may be noise present.

Classications in may be incorrect.

The practical techniques to be presented have their own approachesto dealing with such problems. Similarly, problems arising in practiceare addressed by the theory.


20/61

...

Examples

Noisy Examples

...

Hypothesis

Prior Knowledge

Some measurements may be missingSome

are available

Noise

Learning Algorithm

NOT KNOWN

Target Concept


21/61

The perceptron: a review

The initial development of linear discriminants was carried out byFisher in 1936, and since then they have been central to supervisedlearning.

Their inuence continues to be felt in the recent and ongoing devel-opment of support vector machines , of which more later...


22/61


We have a two-class classication problem in

.

We output class if

, or class

if

.


23/61


So the hypothesis is

where

if

otherwise


24/61

The primal perceptron algorithm

do

for (each example in )

if (

)

while (mistakes are made in the for loop)return .


25/61

Novikoffs theorem

The perceptron algorithm does not converge if is not linearly sep-arable. However Novikoff proved the following:

Theorem 1 If

is non-trivial and linearly separable, where there ex- ists a hyperplane

optimum

optimum

with

optimum

and

optimum

optimum

for

, then the perceptron algorithm makes at most

mistakes.


26/61

Dual form of the perceptron algorithm

If we set then the primal perceptron algorithm operates byadding and subtracting misclassied points to an initial at eachstep.

As a result, when it stops we can represent the nal as

Note:

the values are positive and proportional to the number of times

is misclassied;if is xed then the vector

is an alternativerepresentation of .


27/61

Dual form of the perceptron algorithm

Using these facts, the hypothesis can be re-written


28/61


29/61

Mapping to a bigger space

There are many problems a perceptron cant solve.

The classic example is the parity problem.


30/61


But what happens if we add another element to

?

For example we could use

.

0.5 0 0.5 1 1.50.2

0

0.2

0.4

0.6

0.8

1

1.2

x1

x 2

00.5

1

0

0.5

1

0

0.5

1

x1x2

x 1 * x 2

In

a perceptron can easily solve this problem.


31/61


where


32/61


This is an old trick, and the functions can be anything we like.

Example: In a multilayer perceptron

where is the vector of weights and the bias associated withhidden node

.

Note however that for the time being the functions are xed , whereasin a multilayer perceptron they are allowed to vary as a result of vary-ing the and .


33/61


.

..

...


34/61


We can now use a perceptron in the high-dimensional space, obtain-ing a hypothesis of the form

where the are the weights from the hidden nodes to the outputnode.

So:

start with ;

use the to map it to a bigger space;

use a (linear) perceptron in the bigger space.


35/61


What happens if we use the dual form of the perceptron algorithm inthis process?

We end up with a hypothesis of the form

where

is the vector

Notice that this introduces the possibility of a tradeoff of and .

The sum has (potentially) become smaller.

The cost associated with this is that we may have to calculate

several times.


36/61

Kernels

This suggests that it might be useful to be able to calculate

easily.

In fact such an observation has far-reaching consequences.

Denition 2 A kernel is a function

such that for all vectors and

Note that a given kernel naturally corresponds to an underlying col-lection of functions . Ideally, we want to make sure that the valueof does not have a great effect on the calculation of

. If this

is the case then

is easy to evaluate.


37/61

Kernels: an example

We can use

or

to obtain polynomial kernels.

In the latter case we have

features that are monomials up to

degree , so the decision boundary obtained using

as describedabove will be a polynomial curve of degree .


38/61

Kernels: an example

1 0 1 21

0.5

0

0.5

1

1.5

2

x1

x 2

100 examples, p=10, c=1

1

0

1

2 1

0

1

21000

500

0

500

1000

x2

Weighted sum value

x1

V a

l u e

( t r u n c a

t e d a

t 1 0 0 0 )


39/61

Gradient descent

An alternative method for training a basic perceptron works as fol-lows. We dene a measure of error for a given collection of weights.For example

where the has not been used. Modifying our notation slightly sothat

gives


40/61

Gradient descent

is parabolic and has a unique global minimum and no localminima. We therefore start with a random and update it as follows:

where

and is some small positive number.

The vector

tells us the direction of the steepest decrease in

.


41/61

Simple feedforward neural networks

We continue using the same notation as previously.

Usually, we think in terms of a training algorithm

nding a hy-pothesis based on a training sequence

and then classifying new instances by evaluating

.

Usually with neural networks the training algorithm provides avector of weights . We therefore have a hypothesis that de-pends on the weight vector. This is often made explicit by writing

and representing the hypothesis as a mapping depending on both and the new instance , so

classication of


42/61

Backpropagation: the general case

First, lets look at the general case.

We have a completely unrestricted feedforward structure:

...

For the time being, there may be several outputs, and no speciclayering is assumed.


43/61


For each node:

...

connects node

to node

.

is the weighted sum or activation for node

. is the activation function .

.


44/61


In addition, there is often a bias input for each node, which is alwaysset to .

This is not always included explicitly; sometimes the bias is includedby writing the weighted summation as

where

is the bias for node

.


45/61


As usual we have:

instances

;

a training sequence

.

We also dene a measure of training error

measure of the error of the network on

where is the vector of all the weights in the network.

Our aim is to nd a set of weights that minimises

.


46/61


How can we nd a set of weights that minimises

?

The approach used by the backpropagation algorithm is very simple:

1. begin at step with a randomly chosen collection of weights ;

2. at the

th step, calculate the gradient

of

at the point ;

3. update the weight vector by taking a small step in the direction ofthe gradient

4. repeat this process until

is sufciently small.


47/61


48/61


49/61


When

is an output unit this is easy as

and the rst term is in general easy to calculate for a given .


50/61


When

is not an output unit we have

where

are the nodes to which node

sends a connec-tion:

......


51/61


Then

by denition, and

So

B k i h l


52/61


Summary: to calculate

for one pattern:

1. Forward propagation: apply and calculate outputs etc for all the nodes in the network.

2. Backpropagation 1: for outputs

3. Backpropagation 2: For other nodes

where the were calculated at an earlier step.

B k i i l


53/61

Backpropagation: a specic example

output

..

.

...

B k g ti i l


54/61


For the output:

.

For the other nodes:

so

Also,



55/61


For the output:

We have

output

output

output

output

and

sooutput

and

output



56/61


For the hidden nodes:

We have

but there is only one output so

output

output

and we have a value for output so

output

output

Putting it all together


57/61

Putting it all together

We can then use the derivatives in one of two basic ways:

Batch: (as described previously)

Sequential: using just one pattern at once

selecting patterns in sequence or at random.

Example: the classical parity problem


58/61


As an example we show the result of training a network with:

two inputs;

one output;one hidden layer containing

units;

;

all other details as above.

The problem is the classical parity problem. There are noisy ex-amples.

The sequential approach is used, with repetitions through theentire training sequence.



59/61


1 0 1 21

0.5

0

0.5

1

1.5

2Before training

x1

x 2

1 0 1 21

0.5

0

0.5

1

1.5

2After training

x1

x 2


60/61



61/61

p p y p

0 100 200 300 400 500 600 700 800 900 10000

1

2

3

4

5

6

7

8

9

10Error during training

6 1

learning concept1

Documents