perceptron - danushka bollegaladanushka.net/lect/dm/percept.pdf• perceptron is a bio-inspired...

25
Perceptron Danushka Bollegala

Upload: others

Post on 22-Oct-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

  • PerceptronDanushka Bollegala

  • Bio-inspired model

    • Perceptron is a bio-inspired algorithm that tries to mimic a single neuron

    • We simply multiply each input (feature) by a weight and check whether this weighted sum (activation) is greater than a threshold.

    • If so, then we “fire” the neuron (i.e. a decision is made based on the activation)

    2

  • A single neuron

    3

    a

    x1 x2 x3 x4x5

    w1w2 w3 w4

    w5

    a = w1x1+w2x2+w3x3+w4x4+w5x5activation (score) =

    if a > θ then output = 1 else output = 0

    If the activation is greater than a predefined threshold, then the neuron fires.

  • Bias

    • Often we need to adjust a fixed shift from zero, if the “interesting” region happens to be far from the origin.

    • We adjust the previous model by including a bias term b as follows

    4

    a = b+DX

    i=1

    wdxd

  • Notational trick

    • By introducing a feature that is always ON (i.e. x0 = 1 for all instances), we can squeeze the bias term b into the weight vector by setting w0 = b

    5

    a =DX

    i=0

    wdxd = w>x

    This is more “elegant” as we can write the activation as the inner-product between the weight vector and the feature vector. However, we should keep in mind that bias term still appears in the model.

  • Perceptron

    • Consider only one training instance at a time • online learning • k-NN considers ALL instances (batch learning)

    • Learn only if we make a mistake when we classify using the current weight vector. Otherwise, we do not make adjustments to the weight vector

    • Error-driven learning6

  • 7

    Dra

    ft:D

    oN

    otD

    istrib

    ute

    the perceptron 39

    Algorithm 5 PerceptronTrain(D, MaxIter)1: wd 0, for all d = 1 . . . D // initialize weights2: b 0 // initialize bias3: for iter = 1 . . . MaxIter do4: for all (x,y) 2 D do5: a ÂDd=1 wd xd + b // compute activation for this example6: if ya 0 then7: wd wd + yxd, for all d = 1 . . . D // update weights8: b b + y // update bias9: end if

    10: end for11: end for12: return w0, w1, . . . , wD, b

    Algorithm 6 PerceptronTest(w0, w1, . . . , wD, b, x̂)1: a ÂDd=1 wd x̂d + b // compute activation for the test example2: return sign(a)

    that instead of considering the entire data set at the same time, it onlyever looks at one example. It processes that example and then goeson to the next one. Second, it is error driven. This means that, solong as it is doing well, it doesn’t bother updating its parameters.

    The algorithm maintains a “guess” at good parameters (weightsand bias) as it runs. It processes one example at a time. For a givenexample, it makes a prediction. It checks to see if this predictionis correct (recall that this is training data, so we have access to truelabels). If the prediction is correct, it does nothing. Only when theprediction is incorrect does it change its parameters, and it changesthem in such a way that it would do better on this example nexttime around. It then goes on to the next example. Once it hits thelast example in the training set, it loops back around for a specifiednumber of iterations.

    The training algorithm for the perceptron is shown in Algo-rithm 3.2 and the corresponding prediction algorithm is shown inAlgorithm 3.2. There is one “trick” in the training algorithm, whichprobably seems silly, but will be useful later. It is in line 6, when wecheck to see if we want to make an update or not. We want to makean update if the current prediction (just sign(a)) is incorrect. Thetrick is to multiply the true label y by the activation a and comparethis against zero. Since the label y is either +1 or �1, you just needto realize that ya is positive whenever a and y have the same sign.In other words, the product ya is positive if the current prediction iscorrect. It is very very important to check

    ya 0 rather than ya < 0. Why?The particular form of update for the perceptron is quite simple.The weight wd is increased by yxd and the bias is increased by y. The

    slide credit: CIML (Daume III)

  • Detecting errors• In Line 6 of PerceptronTrain code we have

    • ya 0) in order to have a correct prediction

    • If the current instance is negative (y = -1), we should have a negative activation (a < 0) in order to have a correct prediction

    • In both cases ya > 0. • Therefore, if ya

  • Update rule — Intuitive Explanation

    • Perceptron update rule is • w = w + yx

    • If we incorrectly classify a positive instance as negative • We should have a higher (more positive) activation to avoid this • We should increase wTx • Therefore, we should ADD the current instance to the weight vector

    • If we incorrectly classify a negative instance as positive • We should have a lower (more negative) activation to avoid this • We should decrease wTx • Therefore, we should DEDUCT the current instance from the weight

    vector

    9

  • Update rule — Math Explanation

    10

    Dra

    ft:D

    oN

    otD

    istrib

    ute

    40 a course in machine learning

    goal of the update is to adjust the parameters so that they are “bet-ter” for the current example. In other words, if we saw this exampletwice in a row, we should do a better job the second time around.

    To see why this particular update achieves this, consider the fol-lowing scenario. We have some current set of parameters w1, . . . , wD, b.We observe an example (x, y). For simplicity, suppose this is a posi-tive example, so y = +1. We compute an activation a, and make anerror. Namely, a < 0. We now update our weights and bias. Let’s callthe new weights w01, . . . , w

    0D, b

    0. Suppose we observe the same exam-ple again and need to compute a new activation a0. We proceed by alittle algebra:

    a0 =D

    Âd=1

    w0dxd + b0 (3.3)

    =D

    Âd=1

    (wd + xd)xd + (b + 1) (3.4)

    =D

    Âd=1

    wdxd + b +D

    Âd=1

    xdxd + 1 (3.5)

    = a +D

    Âd=1

    x2d + 1 > a (3.6)

    So the difference between the old activation a and the new activa-tion a0 is Âd x2d + 1. But x

    2d � 0, since it’s squared. So this value is

    always at least one. Thus, the new activation is always at least the oldactivation plus one. Since this was a positive example, we have suc-cessfully moved the activation in the proper direction. (Though notethat there’s no guarantee that we will correctly classify this point thesecond, third or even fourth time around!) This analysis hold for the case pos-

    itive examples (y = +1). It shouldalso hold for negative examples.Work it out.

    Figure 3.3: training and test error viaearly stopping

    The only hyperparameter of the perceptron algorithm is MaxIter,the number of passes to make over the training data. If we makemany many passes over the training data, then the algorithm is likelyto overfit. (This would be like studying too long for an exam and justconfusing yourself.) On the other hand, going over the data onlyone time might lead to underfitting. This is shown experimentally inFigure 3.3. The x-axis shows the number of passes over the data andthe y-axis shows the training error and the test error. As you can see,there is a “sweet spot” at which test performance begins to degradedue to overfitting.

    One aspect of the perceptron algorithm that is left underspecifiedis line 4, which says: loop over all the training examples. The naturalimplementation of this would be to loop over them in a constantorder. The is actually a bad idea.

    Consider what the perceptron algorithm would do on a data setthat consisted of 500 positive examples followed by 500 negative

    If the misclassified instance is a positive one, then after we update using w = w + x, the new activation a’ is greater than the old activation a.

  • Quiz 1

    • Show that the analysis in the previous slide holds when y = -1 (i.e. we misclassified a negative instance)

    11

    Danushka

    Danushka

    Danushka

    Danushka

    Danushka

    Danushka

    Danushka

    Danushka

    Danushka

    Danushka

    Danushka

    Danushka

    Danushka

  • Things to remember• There is no guarantee that we will correctly

    classify a misclassified instance in the next round.

    • We have simply increased/decreased the activation but this adjustment might not be sufficient. We might need to do more aggressive adjustments

    • There are algorithms that enforce such requirements explicitly such as the Passive Aggressive Classifier (not discussed here)

    12

  • Ordering of instances

    • Ordering training instances randomly within each iteration produces good results in practice

    • Showing only all the positives first and all the negatives next is a bad idea

    13

  • Hyperplane• The decision in perceptron is made depending on

    wTx > 0 or wTx

  • Geometric Interpretation of Hyperplane

    15

    weight vector w

    hyperplane defined by the weight vector is perpendicular to the weight vector

    wTx = 0

  • Geometric interpretation

    16

    O

    wx (+1)

    The angle between the current weight vector w and the positive instance x is greater than 90o. Therefore, wTx < 0, and this instance is going to get misclassified as negative.

  • Geometric interpretation

    17

    O

    wx (+1)

    The new weight vector w’ is the addition of w + x according to the perceptron update rule. It lies in between x and w. Notice that the angle between w’ and x is less than 90o. Therefore, x will be classified as positive by w’.

    w’ = w + x

  • Vector algebra revision

    18

  • Quiz 2• Let x = (1, 0)T and y = (1, 1)T. Compute x+y and x-y using

    the parallelogram approach described in the previous slide.

    19

  • Quiz 3

    • Provide a geometric interpretation for the update rule in Perceptron when a negative instance is mistaken to be positive.

    20

  • Linear separability• If a given set of positive and negative training

    instances can be separated into those two groups using a straight line (hyperplane), then we say that the dataset is linearly separable.

    21

    Linearly separable

  • Remarks• When a dataset is linearly separable, there can

    exist more than one hyperplanes that separates the dataset into positive/negative groups.

    • In other words, the hyperplane that linearly separates a linearly separable dataset might not be unique.

    • However, (by definition) if a dataset is non-linearly separable, then there exist NO hyperplane that separates the dataset into positive/negative groups.

    22

  • A non-linearly separable case

    23

    No matter how we draw straight lines, we cannot separate the red instances from the blue instances

  • Negation handling in Sentiment Classification

    24

    ExcellentTerrible

    NOT=1

    NOT=-1

    Mutually exclusive OR (XOR): XOR(A,B) = 1 only when one of the two inputs is 1.

  • Further Remarks

    • When a dataset is linearly separable it can be proved that the perceptron will always find a separating hyperplane!

    • The final weight vector returned by the Perceptron is more influenced by the final training instances it sees.

    • Take the average over all weight vectors during the training (averaged perceptron algorithm)

    25