machine learning (cse 446): perceptron convergenceperceptron convergence due to rosenblatt (1958)....
TRANSCRIPT
![Page 1: Machine Learning (CSE 446): Perceptron ConvergencePerceptron Convergence Due to Rosenblatt (1958). Theorem: Suppose data are scaled so that kx ik 2 1. Assume D is linearly separable,](https://reader034.vdocument.in/reader034/viewer/2022042219/5ec51dea1f58921ae1145218/html5/thumbnails/1.jpg)
Machine Learning (CSE 446):Perceptron Convergence
Sham M Kakadec© 2018
University of [email protected]
1 / 13
![Page 2: Machine Learning (CSE 446): Perceptron ConvergencePerceptron Convergence Due to Rosenblatt (1958). Theorem: Suppose data are scaled so that kx ik 2 1. Assume D is linearly separable,](https://reader034.vdocument.in/reader034/viewer/2022042219/5ec51dea1f58921ae1145218/html5/thumbnails/2.jpg)
Review
1 / 13
![Page 3: Machine Learning (CSE 446): Perceptron ConvergencePerceptron Convergence Due to Rosenblatt (1958). Theorem: Suppose data are scaled so that kx ik 2 1. Assume D is linearly separable,](https://reader034.vdocument.in/reader034/viewer/2022042219/5ec51dea1f58921ae1145218/html5/thumbnails/3.jpg)
Happy Medium?
Decision trees (that aren’t too deep): use relatively few features to classify.
K-nearest neighbors: all features weighted equally.
Today: use all features, but weight them.
For today’s lecture, assume that y ∈ {−1,+1} instead of {0, 1}, and that x ∈ Rd.
2 / 13
![Page 4: Machine Learning (CSE 446): Perceptron ConvergencePerceptron Convergence Due to Rosenblatt (1958). Theorem: Suppose data are scaled so that kx ik 2 1. Assume D is linearly separable,](https://reader034.vdocument.in/reader034/viewer/2022042219/5ec51dea1f58921ae1145218/html5/thumbnails/4.jpg)
Inspiration from NeuronsImage from Wikimedia Commons.
Input signals come in through dendrites, output signal passes out through the axon.
2 / 13
![Page 5: Machine Learning (CSE 446): Perceptron ConvergencePerceptron Convergence Due to Rosenblatt (1958). Theorem: Suppose data are scaled so that kx ik 2 1. Assume D is linearly separable,](https://reader034.vdocument.in/reader034/viewer/2022042219/5ec51dea1f58921ae1145218/html5/thumbnails/5.jpg)
Perceptron Learning AlgorithmData: D = 〈(xn, yn)〉Nn=1, number of epochs EResult: weights w and bias binitialize: w = 0 and b = 0;for e ∈ {1, . . . , E} do
for n ∈ {1, . . . , N}, in random order do# predicty = sign (w · xn + b);if y 6= yn then
# updatew← w + yn · xn;b← b+ yn;
end
end
endreturn w, b
Algorithm 1: PerceptronTrain
3 / 13
![Page 6: Machine Learning (CSE 446): Perceptron ConvergencePerceptron Convergence Due to Rosenblatt (1958). Theorem: Suppose data are scaled so that kx ik 2 1. Assume D is linearly separable,](https://reader034.vdocument.in/reader034/viewer/2022042219/5ec51dea1f58921ae1145218/html5/thumbnails/6.jpg)
Linear Decision Boundary
w·x + b = 0
activation = w·x + b
4 / 13
![Page 7: Machine Learning (CSE 446): Perceptron ConvergencePerceptron Convergence Due to Rosenblatt (1958). Theorem: Suppose data are scaled so that kx ik 2 1. Assume D is linearly separable,](https://reader034.vdocument.in/reader034/viewer/2022042219/5ec51dea1f58921ae1145218/html5/thumbnails/7.jpg)
Linear Decision Boundary
w·x + b = 0
4 / 13
![Page 8: Machine Learning (CSE 446): Perceptron ConvergencePerceptron Convergence Due to Rosenblatt (1958). Theorem: Suppose data are scaled so that kx ik 2 1. Assume D is linearly separable,](https://reader034.vdocument.in/reader034/viewer/2022042219/5ec51dea1f58921ae1145218/html5/thumbnails/8.jpg)
Interpretation of Weight Values
What does it mean when . . .
I w1 = 100?
I w2 = −1?
I w3 = 0?
What if ‖w‖ is “large”?
5 / 13
![Page 9: Machine Learning (CSE 446): Perceptron ConvergencePerceptron Convergence Due to Rosenblatt (1958). Theorem: Suppose data are scaled so that kx ik 2 1. Assume D is linearly separable,](https://reader034.vdocument.in/reader034/viewer/2022042219/5ec51dea1f58921ae1145218/html5/thumbnails/9.jpg)
Today
5 / 13
![Page 10: Machine Learning (CSE 446): Perceptron ConvergencePerceptron Convergence Due to Rosenblatt (1958). Theorem: Suppose data are scaled so that kx ik 2 1. Assume D is linearly separable,](https://reader034.vdocument.in/reader034/viewer/2022042219/5ec51dea1f58921ae1145218/html5/thumbnails/10.jpg)
What would we like to do?
I Optimization problem: find a classifier which minimizes the classification loss.
I The perceptron algorithm can be viewed as trying to do this...
I Problem: (in general) this is an NP-Hard problem.
I Let’s still try to understand it...
This is the general approach of loss function minimization: find parameters whichmake our training error ’small’ (and which also generalizes)
6 / 13
![Page 11: Machine Learning (CSE 446): Perceptron ConvergencePerceptron Convergence Due to Rosenblatt (1958). Theorem: Suppose data are scaled so that kx ik 2 1. Assume D is linearly separable,](https://reader034.vdocument.in/reader034/viewer/2022042219/5ec51dea1f58921ae1145218/html5/thumbnails/11.jpg)
When does the perceptron not converge?
7 / 13
![Page 12: Machine Learning (CSE 446): Perceptron ConvergencePerceptron Convergence Due to Rosenblatt (1958). Theorem: Suppose data are scaled so that kx ik 2 1. Assume D is linearly separable,](https://reader034.vdocument.in/reader034/viewer/2022042219/5ec51dea1f58921ae1145218/html5/thumbnails/12.jpg)
Linear Separability
A dataset D = 〈(xn, yn)〉Nn=1 is linearly separable if there exists some linear classifier(defined by w, b) such that, for all n, yn = sign (w · xn + b).
If data are separable, (without loss of generality) can scale so that:
I “margin at 1”, can assume for all (x, y)
y (w∗ · x) ≥ 1
(let w∗ be smallest norm vector with margin 1).
I CIML: assumes ‖w∗‖ is unit length and scales the ”1” above.
8 / 13
![Page 13: Machine Learning (CSE 446): Perceptron ConvergencePerceptron Convergence Due to Rosenblatt (1958). Theorem: Suppose data are scaled so that kx ik 2 1. Assume D is linearly separable,](https://reader034.vdocument.in/reader034/viewer/2022042219/5ec51dea1f58921ae1145218/html5/thumbnails/13.jpg)
Perceptron ConvergenceDue to Rosenblatt (1958).
Theorem: Suppose data are scaled so that ‖xi‖2 ≤ 1.Assume D is linearly separable, and let be w∗ be a separator with “margin 1”.Then the perceptron algorithm will converge in at most ‖w∗‖2 epochs.
I Let wt be the param at “iteration” t; w0 = 0
I “A Mistake Lemma”: At iteration t
If we make a mistake, ‖wt+1 −w∗‖2 = ‖wt −w∗‖2
If we do make a mistake, ‖wt+1 −w∗‖2 ≤ ‖wt −w∗‖2 − 1
I The theorem directly follows from this lemma. Why?
9 / 13
![Page 14: Machine Learning (CSE 446): Perceptron ConvergencePerceptron Convergence Due to Rosenblatt (1958). Theorem: Suppose data are scaled so that kx ik 2 1. Assume D is linearly separable,](https://reader034.vdocument.in/reader034/viewer/2022042219/5ec51dea1f58921ae1145218/html5/thumbnails/14.jpg)
Proof of the “Mistake Lemma”
10 / 13
![Page 15: Machine Learning (CSE 446): Perceptron ConvergencePerceptron Convergence Due to Rosenblatt (1958). Theorem: Suppose data are scaled so that kx ik 2 1. Assume D is linearly separable,](https://reader034.vdocument.in/reader034/viewer/2022042219/5ec51dea1f58921ae1145218/html5/thumbnails/15.jpg)
Proof of the “Mistake Lemma” (more scratch space)
11 / 13
![Page 16: Machine Learning (CSE 446): Perceptron ConvergencePerceptron Convergence Due to Rosenblatt (1958). Theorem: Suppose data are scaled so that kx ik 2 1. Assume D is linearly separable,](https://reader034.vdocument.in/reader034/viewer/2022042219/5ec51dea1f58921ae1145218/html5/thumbnails/16.jpg)
Proof of the “Mistake Lemma” (more scratch space)
11 / 13
![Page 17: Machine Learning (CSE 446): Perceptron ConvergencePerceptron Convergence Due to Rosenblatt (1958). Theorem: Suppose data are scaled so that kx ik 2 1. Assume D is linearly separable,](https://reader034.vdocument.in/reader034/viewer/2022042219/5ec51dea1f58921ae1145218/html5/thumbnails/17.jpg)
Voted Perceptron
I Suppose w1, w4, w10,w11 . . . are the parameters right after we updated (e.g.after we made a mistake).
I Idea: instead of using the final wt to classify, we classify with a majority voteusing w1, w4, w10,w11 . . .
I Why?
See CIML for details: Implementation and variants.
12 / 13
![Page 18: Machine Learning (CSE 446): Perceptron ConvergencePerceptron Convergence Due to Rosenblatt (1958). Theorem: Suppose data are scaled so that kx ik 2 1. Assume D is linearly separable,](https://reader034.vdocument.in/reader034/viewer/2022042219/5ec51dea1f58921ae1145218/html5/thumbnails/18.jpg)
Voted Perceptron
Let w(e,n) and b(e,n) be the parameters after updating based on the nth example onepoch e.
y = sign
(E∑
e=1
N∑n=1
sign(w(e,n) · x+ b(e,n))
)
12 / 13
![Page 19: Machine Learning (CSE 446): Perceptron ConvergencePerceptron Convergence Due to Rosenblatt (1958). Theorem: Suppose data are scaled so that kx ik 2 1. Assume D is linearly separable,](https://reader034.vdocument.in/reader034/viewer/2022042219/5ec51dea1f58921ae1145218/html5/thumbnails/19.jpg)
Voted Perceptron
13 / 13
![Page 20: Machine Learning (CSE 446): Perceptron ConvergencePerceptron Convergence Due to Rosenblatt (1958). Theorem: Suppose data are scaled so that kx ik 2 1. Assume D is linearly separable,](https://reader034.vdocument.in/reader034/viewer/2022042219/5ec51dea1f58921ae1145218/html5/thumbnails/20.jpg)
Voted Perceptron
13 / 13
![Page 21: Machine Learning (CSE 446): Perceptron ConvergencePerceptron Convergence Due to Rosenblatt (1958). Theorem: Suppose data are scaled so that kx ik 2 1. Assume D is linearly separable,](https://reader034.vdocument.in/reader034/viewer/2022042219/5ec51dea1f58921ae1145218/html5/thumbnails/21.jpg)
Voted Perceptron
13 / 13
![Page 22: Machine Learning (CSE 446): Perceptron ConvergencePerceptron Convergence Due to Rosenblatt (1958). Theorem: Suppose data are scaled so that kx ik 2 1. Assume D is linearly separable,](https://reader034.vdocument.in/reader034/viewer/2022042219/5ec51dea1f58921ae1145218/html5/thumbnails/22.jpg)
Voted Perceptron
13 / 13
![Page 23: Machine Learning (CSE 446): Perceptron ConvergencePerceptron Convergence Due to Rosenblatt (1958). Theorem: Suppose data are scaled so that kx ik 2 1. Assume D is linearly separable,](https://reader034.vdocument.in/reader034/viewer/2022042219/5ec51dea1f58921ae1145218/html5/thumbnails/23.jpg)
Voted Perceptron
13 / 13
![Page 24: Machine Learning (CSE 446): Perceptron ConvergencePerceptron Convergence Due to Rosenblatt (1958). Theorem: Suppose data are scaled so that kx ik 2 1. Assume D is linearly separable,](https://reader034.vdocument.in/reader034/viewer/2022042219/5ec51dea1f58921ae1145218/html5/thumbnails/24.jpg)
Voted Perceptron
13 / 13
![Page 25: Machine Learning (CSE 446): Perceptron ConvergencePerceptron Convergence Due to Rosenblatt (1958). Theorem: Suppose data are scaled so that kx ik 2 1. Assume D is linearly separable,](https://reader034.vdocument.in/reader034/viewer/2022042219/5ec51dea1f58921ae1145218/html5/thumbnails/25.jpg)
References I
Frank Rosenblatt. The perceptron: A probabilistic model for information storage andorganization in the brain. Psychological Review, 65:386–408, 1958.
13 / 13