learning concept1

Upload: zgxfsbjbn

Post on 14-Apr-2018

220 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/27/2019 Learning Concept1

    1/61

    Supervised Learning with Neural Networks

    We now look at how an agent might learn to solve a general problemby seeing examples .

    Aims:

    to present an outline of supervised learning as part of AI;

    to introduce much of the notation and terminology used;

    to introduce the classical perceptron , and to show how it can beapplied more generally using kernels ;

    to introduce multilayer perceptrons and the backpropagation al-

    gorithm for training them.

    Reading: Russell and Norvig, chapter 18 and 19.

    Copyright c

    Sean Holden 2002-2005.

  • 7/27/2019 Learning Concept1

    2/61

    An example

    A common source of problems in AI is medical diagnosis .

    Imagine that we want to automate the diagnosis of an embarrassing

    disease (call it

    ) by constructing a machine:

    Machine

    if the patient suffers from

    otherwise Measurements taken from thepatient, e.g. heart rate, blood pressure,presence of green spots etc.

    Could we do this by explicitly writing a program that examines themeasurements and outputs a diagnosis?

    Experience suggests that this is unlikely.

  • 7/27/2019 Learning Concept1

    3/61

    An example, continued...

    Lets look at an alternative approach. Each collection of measure-ments can be written as a vector,

    where,

    heart rate

    blood pressure

    if the patient has green spotsotherwise

    ...

    and so on

  • 7/27/2019 Learning Concept1

    4/61

    An example, continued...

    A vector of this kind contains all the measurements for a single pa-tient and is generally called a feature vector or instance .

    The measurements are usually referred to as attributes or features .(Technically, there is a difference between an attribute and a feature- but we wont have time to explore it.)

    Attributes or features generally appear as one of three basic types:

    continuous:

    where

    ;

    binary:

    or

    ;

    discrete: can take one of a nite number of values, say

    .

  • 7/27/2019 Learning Concept1

    5/61

    An example, continued...

    Now imagine that we have a large collection of patient histories (

    in total) and for each of these we know whether or not the patientsuffered from

    .

    The

    th patient history gives us an instance .

    This can be paired with a single bit or denoting whether or

    not the

    th patient suffers from

    . The resulting pair is called anexample or a labelled example .

    Collecting all the examples together we obtain a training sequence ,

  • 7/27/2019 Learning Concept1

    6/61

  • 7/27/2019 Learning Concept1

    7/61

    But whats a hypothesis?

    Whats a hypothesis?

    Denote by

    the set of all possible instances.

    is a possible instance

    Then a hypothesis could simply be a function from

    to

    .

    In other words, can take any instance and produce a or a ,depending on whether (according to ) the patient the measure-ments were taken from is suffering from

    .

  • 7/27/2019 Learning Concept1

    8/61

    But whats a hypothesis?

    There are some important issues here, which are effectively right atthe heart of machine learning. Most importantly: the hypothesis can assign a or a to any

    .

    This includes instances that did not appear in the training se-quence.

    The overall process therefore involves rather more thanmemo- rising the training sequence.

    Ideally, the aim is that should be able to generalize . That is, itshould be possible to use it to diagnose new patients.

  • 7/27/2019 Learning Concept1

    9/61

    But whats a hypothesis?

    In fact we need a slightly more exible denition of a hypothesis.

    A hypothesis is a function from

    to some suitable set

    because:

    there may be more than two classes:

    No disease Disease

    Disease

    Disease

    or, we might want to indicated how likely it is that the patienthas disease

    where denotes denitely does have the disease and denotesdenitely does not have it.

    might for example denotethat the patient is reasonably certain to have the disease.

  • 7/27/2019 Learning Concept1

    10/61

    But whats a hypothesis?

    One way of thinking about the previous case is in terms of prob-abilities:

    is in class

    We may have

    . For example if contains several recentmeasurements of a currency exchange rate and we want

    tobe a prediction of what the rate will be in minutes time. Suchproblems are generally known as regression problems .

  • 7/27/2019 Learning Concept1

    11/61

    Types of learning

    The form of machine learning described is called supervised learn- ing . This introduction will concentrate on this kind of learning. Inparticular, the literature also discusses:

    1. Unsupervised learning .

    2. Learning using membership queries and equivalence queries .

    3. Reinforcement learning .

  • 7/27/2019 Learning Concept1

    12/61

    Some further examples

    Speech recognition.

    Deciding whether or not to give credit.

    Detecting credit card fraud.

    Deciding whether to buy or sell a stock option.

    Deciding whether a tumour is benign.

    Data mining - that is, extracting interesting but hidden knowledgefrom existing, large databases. For example, databases contain-ing nancial transactions or loan applications.

    Deciding whether driving conditions are dangerous.

    Automatic driving. (See Pomerleau, 1989, in which a car is drivenfor 90 miles at 70 miles per hour, on a public road with other carspresent, but with no assistance from humans!)

    Playing games. (For example, see Tesauro, 1992 and 1995,

    where a world class backgammon player is described.)

  • 7/27/2019 Learning Concept1

    13/61

    What have we got so far?

    Extracting what we have so far, we get the following central ideas:

    A collection

    of possible instances .

    A collection

    of classes to which any instance canbelong. In some scenarios we might have

    .

    A training sequence containing labelled examples,

    with

    and for

    .

    A learning algorithm

    which takes and produces a hypothesis

    . We can write,

  • 7/27/2019 Learning Concept1

    14/61

    What have we got so far?

    But we need some more ideas as well:

    We often need to state what kinds of hypotheses are available to

    .

    The collection of available hypotheses is called the hypothesis space and denoted .

    is available to

    Some learning algorithms do not always return the same

    for each run on a given sequence . In this case

    is aprobability distribution on .

  • 7/27/2019 Learning Concept1

    15/61

    Whats the right answer?

    We may sometimes assume that there is a correct function thatgoverns the relationship between the s and the labels.

    This is called the target concept and is denoted . It can be re-garded as the perfect hypothesis, and so we have

    .

    This is not always a sufcient way of thinking about the problem...

  • 7/27/2019 Learning Concept1

    16/61

  • 7/27/2019 Learning Concept1

    17/61

    Generalization performance

    How can generalization performance be assessed?

    Model the generation of training example using a probability dis-tribution on

    .

    All examples are assumed to be independent and identically dis-tributed (i.i.d.) according to .

    Given a hypothesis and any example

    we can introduce ameasure

    of the error that makes in classifying thatexample.

    For example the following denitions for

    might be appropriate:

    for a classication problem

    when

  • 7/27/2019 Learning Concept1

    18/61

    Generalization performance

    A reasonable denition of generalization performance is then

    er

    In the case of the denition of

    for classication problems given inthe previous slide this gives

    er

    In the case of the denition for

    given for regression problems, er

    is the expected square of the difference between true label and pre-dicted label.

  • 7/27/2019 Learning Concept1

    19/61

    Problems encountered in practice

    In practice there are various problems that can arise:

    Measurements may be missing from the vectors.

    There may be noise present.

    Classications in may be incorrect.

    The practical techniques to be presented have their own approachesto dealing with such problems. Similarly, problems arising in practiceare addressed by the theory.

  • 7/27/2019 Learning Concept1

    20/61

    ...

    Examples

    Noisy Examples

    ...

    Hypothesis

    Prior Knowledge

    Some measurements may be missingSome

    are available

    Noise

    Learning Algorithm

    NOT KNOWN

    Target Concept

  • 7/27/2019 Learning Concept1

    21/61

    The perceptron: a review

    The initial development of linear discriminants was carried out byFisher in 1936, and since then they have been central to supervisedlearning.

    Their inuence continues to be felt in the recent and ongoing devel-opment of support vector machines , of which more later...

  • 7/27/2019 Learning Concept1

    22/61

    The perceptron: a review

    We have a two-class classication problem in

    .

    We output class if

    , or class

    if

    .

  • 7/27/2019 Learning Concept1

    23/61

    The perceptron: a review

    So the hypothesis is

    where

    if

    otherwise

  • 7/27/2019 Learning Concept1

    24/61

    The primal perceptron algorithm

    do

    for (each example in )

    if (

    )

    while (mistakes are made in the for loop)return .

  • 7/27/2019 Learning Concept1

    25/61

    Novikoffs theorem

    The perceptron algorithm does not converge if is not linearly sep-arable. However Novikoff proved the following:

    Theorem 1 If

    is non-trivial and linearly separable, where there ex- ists a hyperplane

    optimum

    optimum

    with

    optimum

    and

    optimum

    optimum

    for

    , then the perceptron algorithm makes at most

    mistakes.

  • 7/27/2019 Learning Concept1

    26/61

    Dual form of the perceptron algorithm

    If we set then the primal perceptron algorithm operates byadding and subtracting misclassied points to an initial at eachstep.

    As a result, when it stops we can represent the nal as

    Note:

    the values are positive and proportional to the number of times

    is misclassied;if is xed then the vector

    is an alternativerepresentation of .

  • 7/27/2019 Learning Concept1

    27/61

    Dual form of the perceptron algorithm

    Using these facts, the hypothesis can be re-written

  • 7/27/2019 Learning Concept1

    28/61

  • 7/27/2019 Learning Concept1

    29/61

    Mapping to a bigger space

    There are many problems a perceptron cant solve.

    The classic example is the parity problem.

  • 7/27/2019 Learning Concept1

    30/61

    Mapping to a bigger space

    But what happens if we add another element to

    ?

    For example we could use

    .

    0.5 0 0.5 1 1.50.2

    0

    0.2

    0.4

    0.6

    0.8

    1

    1.2

    x1

    x 2

    00.5

    1

    0

    0.5

    1

    0

    0.5

    1

    x1x2

    x 1 * x 2

    In

    a perceptron can easily solve this problem.

  • 7/27/2019 Learning Concept1

    31/61

    Mapping to a bigger space

    where

  • 7/27/2019 Learning Concept1

    32/61

    Mapping to a bigger space

    This is an old trick, and the functions can be anything we like.

    Example: In a multilayer perceptron

    where is the vector of weights and the bias associated withhidden node

    .

    Note however that for the time being the functions are xed , whereasin a multilayer perceptron they are allowed to vary as a result of vary-ing the and .

  • 7/27/2019 Learning Concept1

    33/61

    Mapping to a bigger space

    .

    ..

    ...

  • 7/27/2019 Learning Concept1

    34/61

    Mapping to a bigger space

    We can now use a perceptron in the high-dimensional space, obtain-ing a hypothesis of the form

    where the are the weights from the hidden nodes to the outputnode.

    So:

    start with ;

    use the to map it to a bigger space;

    use a (linear) perceptron in the bigger space.

  • 7/27/2019 Learning Concept1

    35/61

    Mapping to a bigger space

    What happens if we use the dual form of the perceptron algorithm inthis process?

    We end up with a hypothesis of the form

    where

    is the vector

    Notice that this introduces the possibility of a tradeoff of and .

    The sum has (potentially) become smaller.

    The cost associated with this is that we may have to calculate

    several times.

  • 7/27/2019 Learning Concept1

    36/61

    Kernels

    This suggests that it might be useful to be able to calculate

    easily.

    In fact such an observation has far-reaching consequences.

    Denition 2 A kernel is a function

    such that for all vectors and

    Note that a given kernel naturally corresponds to an underlying col-lection of functions . Ideally, we want to make sure that the valueof does not have a great effect on the calculation of

    . If this

    is the case then

    is easy to evaluate.

  • 7/27/2019 Learning Concept1

    37/61

    Kernels: an example

    We can use

    or

    to obtain polynomial kernels.

    In the latter case we have

    features that are monomials up to

    degree , so the decision boundary obtained using

    as describedabove will be a polynomial curve of degree .

  • 7/27/2019 Learning Concept1

    38/61

    Kernels: an example

    1 0 1 21

    0.5

    0

    0.5

    1

    1.5

    2

    x1

    x 2

    100 examples, p=10, c=1

    1

    0

    1

    2 1

    0

    1

    21000

    500

    0

    500

    1000

    x2

    Weighted sum value

    x1

    V a

    l u e

    ( t r u n c a

    t e d a

    t 1 0 0 0 )

  • 7/27/2019 Learning Concept1

    39/61

    Gradient descent

    An alternative method for training a basic perceptron works as fol-lows. We dene a measure of error for a given collection of weights.For example

    where the has not been used. Modifying our notation slightly sothat

    gives

  • 7/27/2019 Learning Concept1

    40/61

    Gradient descent

    is parabolic and has a unique global minimum and no localminima. We therefore start with a random and update it as follows:

    where

    and is some small positive number.

    The vector

    tells us the direction of the steepest decrease in

    .

  • 7/27/2019 Learning Concept1

    41/61

    Simple feedforward neural networks

    We continue using the same notation as previously.

    Usually, we think in terms of a training algorithm

    nding a hy-pothesis based on a training sequence

    and then classifying new instances by evaluating

    .

    Usually with neural networks the training algorithm provides avector of weights . We therefore have a hypothesis that de-pends on the weight vector. This is often made explicit by writing

    and representing the hypothesis as a mapping depending on both and the new instance , so

    classication of

  • 7/27/2019 Learning Concept1

    42/61

    Backpropagation: the general case

    First, lets look at the general case.

    We have a completely unrestricted feedforward structure:

    ...

    For the time being, there may be several outputs, and no speciclayering is assumed.

  • 7/27/2019 Learning Concept1

    43/61

    Backpropagation: the general case

    For each node:

    ...

    connects node

    to node

    .

    is the weighted sum or activation for node

    . is the activation function .

    .

  • 7/27/2019 Learning Concept1

    44/61

    Backpropagation: the general case

    In addition, there is often a bias input for each node, which is alwaysset to .

    This is not always included explicitly; sometimes the bias is includedby writing the weighted summation as

    where

    is the bias for node

    .

  • 7/27/2019 Learning Concept1

    45/61

    Backpropagation: the general case

    As usual we have:

    instances

    ;

    a training sequence

    .

    We also dene a measure of training error

    measure of the error of the network on

    where is the vector of all the weights in the network.

    Our aim is to nd a set of weights that minimises

    .

  • 7/27/2019 Learning Concept1

    46/61

    Backpropagation: the general case

    How can we nd a set of weights that minimises

    ?

    The approach used by the backpropagation algorithm is very simple:

    1. begin at step with a randomly chosen collection of weights ;

    2. at the

    th step, calculate the gradient

    of

    at the point ;

    3. update the weight vector by taking a small step in the direction ofthe gradient

    4. repeat this process until

    is sufciently small.

  • 7/27/2019 Learning Concept1

    47/61

  • 7/27/2019 Learning Concept1

    48/61

  • 7/27/2019 Learning Concept1

    49/61

    Backpropagation: the general case

    When

    is an output unit this is easy as

    and the rst term is in general easy to calculate for a given .

  • 7/27/2019 Learning Concept1

    50/61

    Backpropagation: the general case

    When

    is not an output unit we have

    where

    are the nodes to which node

    sends a connec-tion:

    ......

  • 7/27/2019 Learning Concept1

    51/61

    Backpropagation: the general case

    Then

    by denition, and

    So

    B k i h l

  • 7/27/2019 Learning Concept1

    52/61

    Backpropagation: the general case

    Summary: to calculate

    for one pattern:

    1. Forward propagation: apply and calculate outputs etc for all the nodes in the network.

    2. Backpropagation 1: for outputs

    3. Backpropagation 2: For other nodes

    where the were calculated at an earlier step.

    B k i i l

  • 7/27/2019 Learning Concept1

    53/61

    Backpropagation: a specic example

    output

    ..

    .

    ...

    B k g ti i l

  • 7/27/2019 Learning Concept1

    54/61

    Backpropagation: a specic example

    For the output:

    .

    For the other nodes:

    so

    Also,

    Backpropagation: a specic example

  • 7/27/2019 Learning Concept1

    55/61

    Backpropagation: a specic example

    For the output:

    We have

    output

    output

    output

    output

    and

    sooutput

    and

    output

    Backpropagation: a specic example

  • 7/27/2019 Learning Concept1

    56/61

    Backpropagation: a specic example

    For the hidden nodes:

    We have

    but there is only one output so

    output

    output

    and we have a value for output so

    output

    output

    Putting it all together

  • 7/27/2019 Learning Concept1

    57/61

    Putting it all together

    We can then use the derivatives in one of two basic ways:

    Batch: (as described previously)

    Sequential: using just one pattern at once

    selecting patterns in sequence or at random.

    Example: the classical parity problem

  • 7/27/2019 Learning Concept1

    58/61

    Example: the classical parity problem

    As an example we show the result of training a network with:

    two inputs;

    one output;one hidden layer containing

    units;

    ;

    all other details as above.

    The problem is the classical parity problem. There are noisy ex-amples.

    The sequential approach is used, with repetitions through theentire training sequence.

    Example: the classical parity problem

  • 7/27/2019 Learning Concept1

    59/61

    Example: the classical parity problem

    1 0 1 21

    0.5

    0

    0.5

    1

    1.5

    2Before training

    x1

    x 2

    1 0 1 21

    0.5

    0

    0.5

    1

    1.5

    2After training

    x1

    x 2

  • 7/27/2019 Learning Concept1

    60/61

    Example: the classical parity problem

  • 7/27/2019 Learning Concept1

    61/61

    p p y p

    0 100 200 300 400 500 600 700 800 900 10000

    1

    2

    3

    4

    5

    6

    7

    8

    9

    10Error during training

    6 1