machine learning (cse 446): the perceptron algorithm

14
Machine Learning (CSE 446): The Perceptron Algorithm Sham M Kakade c 2019 University of Washington [email protected] 0 / 11

Upload: others

Post on 05-Jan-2022

10 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Machine Learning (CSE 446): The Perceptron Algorithm

Machine Learning (CSE 446):The Perceptron Algorithm

Sham M Kakadec© 2019

University of [email protected]

0 / 11

Page 2: Machine Learning (CSE 446): The Perceptron Algorithm

Announcements

I HW1 posted.

I Today: the perceptron algo

0 / 11

Page 3: Machine Learning (CSE 446): The Perceptron Algorithm

Is there a happy medium?

I Decision trees (that aren’t too deep): use relatively few features to classify.

I If we want to use ’all’ the features, then we need a deep (high complexity)decision tree (so overfitting will be more of a concern).

I Is there a ’low complexity’ method (something which is better at keeping thegeneralization error small) which is able to utilize all the features for prediction?

I One idea:A ’linear classifier’: use all features, but weight them.

1 / 11

Page 4: Machine Learning (CSE 446): The Perceptron Algorithm

Inspiration from NeuronsImage from Wikimedia Commons.

x[1]

w[1] ×

x[2] w[2] ×

x[3] w[3] ×

x[d] w[d] ×

b

!

fire, or not?ŷ

bias parameter

weight parameters

“activation” output

input

Input signals come in through dendrites, output signal passes out through the axon.

2 / 11

Page 5: Machine Learning (CSE 446): The Perceptron Algorithm

A “parametric” HypothesisI Consider using a “linear classifier” :

f(x) = sign (w · x+ b)

where w ∈ Rd and b is a scalar.I Here sign(z) is the function which is 1 if z ≥ 0 and −1 otherwise.

(Let us say that Y = {−1, 1}.)I Notation: x ∈ Rd; x[i] denotes i− th coordinate; remember that:

w · x =

d∑j=1

w[j] · x[j]

I Learning requires us to set the weights w and the bias b.I (convenience) we can always append 1 to the vector x through the concatenation

x← (x, 1) .

This transformation allows us to absorb the bias b into the last component of theweight vector.

3 / 11

Page 6: Machine Learning (CSE 446): The Perceptron Algorithm

Geometrically...

I What does the decision boundary look like?

4 / 11

Page 7: Machine Learning (CSE 446): The Perceptron Algorithm

What would we like to do?

ε(w) =1

N

N∑i=1

1{yi 6= sign(w · xi)}

I Optimization problem: find a classifier which minimizes the classification loss onthe training dataset D:

minwε(w)

I Problem: (in general) this is an NP-Hard problem.

I Let’s ignore this, and think of an algorithm.

This is the general approach of loss function minimization: find parameters whichmake our training error “small” (and which also generalizes)

5 / 11

Page 8: Machine Learning (CSE 446): The Perceptron Algorithm

The Perceptron Leaning Algorithm (Rosenblatt, 1958)

I Let’s think of an online algorithm, where we try to update w as we examine eachtraining point (xi, yi), one at a time.

I Consider the ’epoch based’ algorithm:

1. For all (x, y) in our training set:2. Choose a point (x, y) without replacement from D:

Let y = sign(w · x)If y = y, then do not update:

wt+1 = wt

If y 6= y,wt+1 = wt + yx

Return to the first step.

I When to stop?

6 / 11

Page 9: Machine Learning (CSE 446): The Perceptron Algorithm

Parameters and Convergence

This is the first supervised algorithm we have seen with (no-trivial) real valuedparameters, w.

I The perceptron learning algorithm’s sole hyperparameter is E, the number ofepochs (passes over the training data). How should we tune E?use a Dev set.

I Basic question: will the algorithm converge or not?I Can you say what has to occur for the algorithm to converge? (Novikoff, 1962)I Can we understand when it will never converge? (Minsky and Papert, 1969)

7 / 11

Page 10: Machine Learning (CSE 446): The Perceptron Algorithm

When does the perceptron not converge?

8 / 11

Page 11: Machine Learning (CSE 446): The Perceptron Algorithm

Linear Separability

A dataset D = 〈(xn, yn)〉Nn=1 is linearly separable if there exists some linear classifier(defined by w, b) such that, for all n, yn = sign (w · xn + b).

9 / 11

Page 12: Machine Learning (CSE 446): The Perceptron Algorithm

The Perceptron Convergence

I Again taking b = 0 (absorbing it into w).

I Margin def: Suppose the data are linearly separable, and all data points are γ awayfrom the separating hyperplane. Precisely, there exists a w∗, which we can assumeto be of unit norm (without loss of generality), such that for all (x, y) ∈ D.

y (w∗ · x) ≥ γ

γ is the margin.

Theorem: (Novikoff, 1962) Suppose the inputs bounded such that ‖x‖ ≤ R. Assumeour data D is linearly separable with margin γ. Then the perceptron algorithm willmake at most R2

γ2mistakes.

(This implies that at most O(Nγ2) updates, after which time wt never changes. )

10 / 11

Page 13: Machine Learning (CSE 446): The Perceptron Algorithm

Proof of the “Mistake Lemma”I Let Mt be the number of mistakes at time t.

If we make a mistake using wt on (x, y), then observe that ywt · x ≤ 0.I Suppose we make a mistake at time t:

w∗ · wt = w∗ · (wt−1 + yx) = w∗ · wt−1 + yw∗ · x ≥ w∗ · wt−1 + γ .

Since w0 = 0 and w∗ · wt grows by γ every time we make a mistake, this impliesthat w∗ · wt ≥ γMt.

I Also, if we make a mistake at time t (using that ywt · x ≤ 0),

‖wt‖2 = ‖wt−1‖2 + 2ywt−1 · x+ ||x||2 ≤ ‖wt−1‖2 + 0 + ||x||2 ≤ ‖wt−1‖2 +R2 .

Since ‖wt‖2 grows by R2 on every mistake, this implies ‖wt‖2 ≤ R2Mt and so‖wt‖ ≤ R

√Mt.

I Now we have that:

γMt ≤ w∗ · wt ≤ ‖w∗‖‖wt‖ ≤ R√Mt .

solving for Mt completes the proof.11 / 11

Page 14: Machine Learning (CSE 446): The Perceptron Algorithm

References I

M. Minsky and S. Papert. Perceptrons: An Introduction to Computational Geometry.MIT Press, Cambridge, MA, 1969.

A. B. Novikoff. On convergence proofs on perceptrons. In Proceedings of theSymposium on the Mathematical Theory of Automata, 1962.

Frank Rosenblatt. The perceptron: A probabilistic model for information storage andorganization in the brain. Psychological Review, 65:386–408, 1958.

11 / 11