cs480 introduction to machine learning...

CS480 Introduction to Machine Learning Perceptron

Edith Law

Based on slides from Joelle Pineau

Logistics

•Assignment 1 due Thursday Jan 24 8:59PM

•Special Office Hour Today 4pm-6pm AI Lab (DC2306C)

•Project Proposal due next Tuesday Jan 29 8:59PM (details in proposal.pdf and proposal.zip on the “Project” page).

2

Considering Features…

In decision trees, only a small number of features are used to make decisions.

In KNN, all features are used equally.

For perceptron algorithms, we are learning weights for features.

3

History

4

Bio-inspired Learning

Our brain is made up of units called neurons, that send electrical signals to one another.

The rate of firing tells us how “activated” the neuron is.

A single neuron has multiple incoming neurons, that are firing at different rates.

Depending on how much the incoming neurons are firing and how strong the neural connections are, the neuron will “decide” how strongly it wants to fire.

5


We can think of our learning algorithm as a single neuron.

It receives input from D-many other neurons, one for each input features.

The strength of these inputs are the feature values.

Each incoming connection has a weight.

The neuron sums up all weighted inputs, and decide whether to fire or not.

Firing = positive, not firing = negative.

6


a =D

∑d=1

wdxd

The binary threshold decision neuron sums up all the weighted inputs:

If activation a >= 0, then predicts positive (y=1), otherwise predicts negative (y=0).

a = [D

∑d=1

wdxd] + b

bias term

a = [w0 w1 ⋯ wD]

1x1⋮xD

7

Linear Threshold Function

Linear + threshold

1

x1

xd

w0

w1

wd

∑…

y

features learned weights

decision unit (binary threshold neuron)

output

8

Encoding Logic Gates

NANDx1 0 1 0 1

x2 0 0 1 1

y 1 1 1 0

−x1 + 1.5 = x2

1

10

−x1 − x2 + 1.5 = 0w = [−1, − 1, 1.5]

Suppose that we have two boolean features, and we want to use a perceptron to encode the NAND function.

v = [−1 − 1 1.5] [x1x2

1 ]; y = {0 if v < 01 if v ≥ 0

9

Perceptron Learning Algorithm

Intuition: change the weight vector in such a way that it will do better on a misclassified example the next time around.

Initialize w0 … wd randomly, For each example, make a prediction

If it is correct, do nothing

•if the output unit incorrectly outputs 0, add input vector to the weight vector.

• if the output unit incorrectly outputs 1, subtract input vector from the weight vector.

Repeat

This procedure is guaranteed to find the set of weights that can classify all training examples correctly if any such set exists.

wi ← wi + η (y − a) xiIf it is incorrect, update the weight vector

10


11


In line 6, we want to make an update if the current prediction is incorrect.

The trick is the multiply the true label y by the prediction a and compare this against zero. Since the label y is either +1 or -1, ya is positive whenever a and y have the same sign, i.e., when the prediction is correct.

The update: •weight wd is increased by yxd and the bias is increased by y •how is this better for the current example?

12

Why does the update work?

We have a current weight vector w1,…,wD, b. We observe an example (x, y). Let’s supposed y = +1. We compute an activation a and make an error, i.e., a < 0. Now we update the weight and bias.

New activation is always at least the old activation + 1.

Since this is a positive example, we have successfully moved the activation in the proper direction. Though, there is no guarantee that we will correctly classify this point at other times.

13

Convergence Behaviour of Perceptron

[B]Sec 4.1.7 14

Perceptron Learning: Example

15


Learning a perceptron involves choosing the values for the weights. Therefore, the hypothesis space considered in perceptron learning is the set of all possible real-valued weight vectors.

In two dimensions, learning weights for features amount to learning a hyperplane classifier, i.e., a division of space into two halves by straight line, where one half is positive and one half is negative. In that sense, perceptron can be seen as explicitly finding a good linear decision boundary. The algorithm is • online, i.e., instead of considering the entire dataset at the same time,

it only looks at one example at a time. • error driven, i.e., as long as it is doing well, it doesn’t bother updating

its parameters. 16

The Perceptron Algorithm: Hyper-parameters

The only hyper parameter is MaxIter, the number of passes to make over the training data. • many passes, and we will overfit. • too few passes, we will underfit. Is constant number of times a good idea?

17

The Perceptron Algorithm: Hyper-parameters

Is constant MaxIter a good idea? •Consider what happens if the algorithm see 500 positive examples

first, then 500 negative examples later. •The algorithm will stop after seeing just a few (e.g., 5) positive

examples, do well for the next 495, then start seeing negative examples, then it will take a while (e.g., 10 examples) and then start predicting negative again. So, it will only have learned a handful of examples.

•Trick: permute the order of examples during each iteration. (in practice, this yields about 20% savings; in theory, you can prove that it’s expected to be about twice as fast)

18

Decision Boundary

For a perceptron, the decision boundary is precisely where the sign of the activation, a changes from -1 to +1, i.e., the set of points that achieve zero activation (i.e., not clearly positive nor negative).

We can rewrite this equation in terms of dot product between the two vectors w = <w1, w2,… wD> and x = <x1, x2, … xD>. The two vectors only has a zero dot product if and only if they are perpendicular.

B = {x : ∑d

wdxd = 0}

(xi, yi) is classified correctly if and only if: yi (wTxi) > 0.

Thus, the decision boundary is simply the plane perpendicular to w.

wT x = 0

19

Decision Boundary

decision boundary, with weight vector pointing in the direction of the positive examples and away from the negative examples.

•The scale of the weight vector is irrelevant (e.g., if we replace w with 2w, activations are doubled, but sign doesn’t change.)

•This is saying that it only matters which side of the plane a test point falls on, not how far it is from the plane.

•normalized weight vector is commonly used, ||w|| = 1

20

Decision Boundary

The dot product also compute projections, i.e., wTx is the distance of x from the origin when projected onto the vector w.

Here, all the data points are projected onto w.

You can think of this as a 1-dimensional version of the data where each data point is placed according to its projection along w.

The distance along w is exactly the activation of that example, with no bias.

With bias b, then the projection + b is compared against zero to determine whether a data point is positive or negative. 21

Decision Boundary

Bias term moves the threshold from 0. It shift the decision boundary away from the origin.

If b is positive, the boundary is shifted away from w. A positive bias means that more examples should be classified positive.

The decision boundary for a perceptron in D dimension is always a D-1 dimensional hyperplane.

22

Perceptron Convergence and Linear Separability

What the perceptron algorithm does is move the decision boundary in the direction of the training examples.

•Does perceptron algorithm converge?

•What does it converge to?

•How long does it take?

23


An example of a dataset on which the perceptron algorithm will not converge. •1 pos example, 1 neg example with no features: the algorithm will only

adjust the bias. • If the bias is positive, then negative example is going to cause it to

decrease; if the bias is negative, then the positive example is going to cause it to increase.

What convergence really means is that the algorithm will not make any more updates, i.e., every example is correctly predicted.

Geometrically, it means that the algorithm has found the hyperplane that separates data into positive and negative examples.

In this case, convergence means that the data is linearly separable.

24


+

-

++

+

+

+

--

--

-

-

not linearly separable perceptron will not converge

+

-

++

+++

--

--

-

-

linearly separable perceptron will converge

The data is linearly separable if and only if there exists a w such that: – For all examples, yi (wTxi) > 0 – Or equivalently, the 0-1 loss is zero for some set of parameters (w).

25


The somewhat surprising thing about perceptron is that if the data is linearly separable, then it will converge to a weight vector that separates the data.

How long (i.e., number of updates) does it take to converge?

Intuitively, the perception algorithm will converge more quickly for easier problem and more slowly for harder problems. But how do we define “easy” and “hard”?

26

Margin

We can define difficulty using the notion of margin, which is the distance between the hyperplane and the nearest point.

Problems with large margins are easy, because there is a lot of “wiggle room” to find a separating hyperplane. Problems with small margins are hard because you have to have a very specific and well tuned weight vector.

margin(D, w, b) = {min(x,y)∈D y (wT x + b) if w separates D−∞, otherwise

The margin is only defined if w,b actually separate the data D.

If it is define, then the margin is the point with minimum activation, after the activation is multiplied by the label.

27

Margin

The margin of a dataset is the largest attainable margin on this dataset.

margin(D) = supw,b

margin(D, w, b)

In other words, to compute the margin of a dataset, you “try” every possible w,b pair. For each pair, you compute the margin, then we take the largest of these as the overall margin of the data.

The margin is typically denoted by gamma .γ

28

Theoream. Suppose the perceptron algorithm is run on a linearly separable data set D with margin . Assume that ||x|| <= 1 for all . Then the algorithm will converge after at most updates.

Rosenblatt’s Perceptron Convergence Theorem

γ−2γ > 0

x ∈ D

The idea of the proof:• If the data is linearly separable with margin , then there exists some

weight vector w* that achieves this margin.γ

• The perceptron algorithm is trying to find a weight vector w that points roughly in the same direction as w*. (large margin = very rough, small margin = very precise).

• Every time, the perceptron makes an update, the angle between w and w* changes. We can show that this angle decreases.

29


The idea of the proof:

•We can show that the angle decrease by showing that - the dot product between w and w* increases a lot, - the norm ||w|| does not increase very much.

Since the dot product is increasing, but w isn’t getting too long, the angle between them has to be shrinking.

Theoream. Suppose the perceptron algorithm is run on a linearly separable data set D with margin . Assume that ||x|| <= 1 for all . Then the algorithm will converge after at most updates.

γ−2γ > 0

x ∈ D

30


The proof shows:

• If the data that is linearly separable with margin , then perceptron will converge to a solution that separates the data after some finite number of updates. It converges quickly when is large. But it doesn’t say anything about the solution other than it separates the data.

γ > 0

γ

For more details:

•See Hal Daume Section 4.5.

•Geoff Hinton’s explanations of the Perceptron Convergence Theorem: - A Geometric View of Perceptrons (https://youtu.be/0T57_yjjB58) - Why the Learning Works (https://youtu.be/d6nFTN081S8)

31

https://youtu.be/0T57_yjjB58

https://youtu.be/d6nFTN081S8


Additional comments:

•The number of updates depends on the dataset, on the learning rate, and on the initial weights.

• If the data is not linearly separable, there will be oscillation (which can be detected automatically).

•Decreasing the learning rate to 0 can cause the oscillation to settle on some particular solution.

32

Limitations

The problem with the vanilla perceptron algorithm is that it counts later points more than it counts earlier points.

Consider this: there are 10,000 examples, after the first 100 examples, the perceptron learned a really good classifier, then makes an error on the 10,000th example. It updates the weight vector and completely ruins 99% of the data.

33

Voted Perceptron Algorithm

Intuition: We want the weight vectors that “survive” a long time to get more say than the weight vectors that are overthrown quickly.

Let (w,b)(1), …. (w,b)(K) be the K+1 weight vector encountered during training, and c(1), …, c(K) be the survival times for each weight vector. Then, the prediction on a test point is:

y = sign(K

∑k=1

c(k) sign(w(k) ⋅ x + b(k)))There is some theory showing that it is guaranteed to generalize better than the vanilla perceptron.

But it is also impractical: need to store weight vectors and their counts, and ask for a growing number of votes in each iteration.

34

Averaged Perceptron Algorithm

Maintain a collection of weight vectors and survival times, but at test time, predict according to the average weight vector, than voting.

y = sign(K

∑k=1

c(k)(w(k) ⋅ x + b(k)))

This means that we can simply maintain a running sum of the averaged weight vector (blue) and averaged bias (red).

y = sign(K

∑k=1

c(k)(w(k)) ⋅ x +K

∑k=1

c(k)b(k))

Better than vanilla version in terms of generalization to test data, but still has the do early stopping so that it doesn’t overfit.

35

Averaged Perceptron Algorithm

Maintain a collection of weight vectors and survival times, but at test time, predict according to the average weight vector, than voting.

36

Perceptron Learning Example

Examples used (bold) and not (faint). What do you notice?

37

Perceptron Learning Example

Solutions are often non-unique. The solution depends on the set of instances and the order of sampling in updates.

38

Non-Uniqueness

Consider a linearly separable binary classification dataset. There is an infinite number of hyper-planes that separate the classes:

Which plane is best?

Related question: For a given plane, for which points should we be most confident in the classification?

39

Linear Support Vector Machines

A linear SVM is a perceptron for which we chose w such that the margin is maximized.

For a given separating hyper-plane, the margin is twice the (Euclidean) distance from hyper-plane to nearest training example.

• i.e. the width of the “strip” around the decision boundary that contains no training examples.

40

Can Perceptron Solve This Problem?

The decision boundaries can only be linear.

3 features: excellent, terrible, not.

• excellent -> positive

• terrible -> negative

• not would flip the label!

sentiment classification problemXOR problem

41

Handling Non-Linearly Separable Data

Two solutions to the XOR problem:

• combine multiple perceptrons in a single framework (i.e., neural network approach)

• find computationally efficient ways of doing feature mapping (i.e., kernel approach)

•relax the criterion of separating all the data

42

Limitations of Perceptrons

•Solutions are non-unique. - Define what “best separating hyperplane” means (e.g., max margin)

and devise an algorithm (e.g., SVM) to find it.

•Cannot handle non-linearly separable data - Combine perceptrons to learn non-linear boundaries - Transform the features by hand to make data linearly separable - Relax linear separability requirement

43

Linear Regression (next class)

Perceptron is a linear rule for classification •y in a finite set: e.g., {+1, -1}

What if y can be any real number? •Known as regression •Again, there is a linear rule

minw∈Rd

12

| |xTw − y | |22 + λ | |w | |2

2

empirical risk

regularization

hyperparameter

44

What you should know

• the perceptron rule

• the perceptron learning algorithm

• its limitations (non-uniqueness and restriction of linearly separable data) and how to deal with these limitations

• high-level understanding of the Rosenblatt’s Perceptron Convergence Theorem

45

cs480 introduction to machine learning...

Documents