machine learning (cse 446): perceptron convergenceperceptron convergence due to rosenblatt (1958)....

25

Machine Learning (CSE 446): Perceptron Convergence Sham M Kakade c 2018 University of Washington [email protected] 1 / 13

Upload: others

Post on 20-May-2020

16 views

Category:

Documents

0 download

Report

Download

Embed Size (px):

TRANSCRIPT

Page 1: Machine Learning (CSE 446): Perceptron ConvergencePerceptron Convergence Due to Rosenblatt (1958). Theorem: Suppose data are scaled so that kx ik 2 1. Assume D is linearly separable,

Machine Learning (CSE 446):Perceptron Convergence

Sham M Kakadec© 2018

University of [email protected]

1 / 13

Page 2: Machine Learning (CSE 446): Perceptron ConvergencePerceptron Convergence Due to Rosenblatt (1958). Theorem: Suppose data are scaled so that kx ik 2 1. Assume D is linearly separable,

Review

1 / 13

Page 3: Machine Learning (CSE 446): Perceptron ConvergencePerceptron Convergence Due to Rosenblatt (1958). Theorem: Suppose data are scaled so that kx ik 2 1. Assume D is linearly separable,

Happy Medium?

Decision trees (that aren’t too deep): use relatively few features to classify.

K-nearest neighbors: all features weighted equally.

Today: use all features, but weight them.

For today’s lecture, assume that y ∈ {−1,+1} instead of {0, 1}, and that x ∈ Rd.

2 / 13

Page 4: Machine Learning (CSE 446): Perceptron ConvergencePerceptron Convergence Due to Rosenblatt (1958). Theorem: Suppose data are scaled so that kx ik 2 1. Assume D is linearly separable,

Inspiration from NeuronsImage from Wikimedia Commons.

Input signals come in through dendrites, output signal passes out through the axon.

2 / 13

Page 5: Machine Learning (CSE 446): Perceptron ConvergencePerceptron Convergence Due to Rosenblatt (1958). Theorem: Suppose data are scaled so that kx ik 2 1. Assume D is linearly separable,

Perceptron Learning AlgorithmData: D = 〈(xn, yn)〉Nn=1, number of epochs EResult: weights w and bias binitialize: w = 0 and b = 0;for e ∈ {1, . . . , E} do

for n ∈ {1, . . . , N}, in random order do# predicty = sign (w · xn + b);if y 6= yn then

# updatew← w + yn · xn;b← b+ yn;

end

end

endreturn w, b

Algorithm 1: PerceptronTrain

3 / 13

Page 6: Machine Learning (CSE 446): Perceptron ConvergencePerceptron Convergence Due to Rosenblatt (1958). Theorem: Suppose data are scaled so that kx ik 2 1. Assume D is linearly separable,

Linear Decision Boundary

w·x + b = 0

activation = w·x + b

4 / 13

Page 7: Machine Learning (CSE 446): Perceptron ConvergencePerceptron Convergence Due to Rosenblatt (1958). Theorem: Suppose data are scaled so that kx ik 2 1. Assume D is linearly separable,

Linear Decision Boundary

w·x + b = 0

4 / 13

Page 8: Machine Learning (CSE 446): Perceptron ConvergencePerceptron Convergence Due to Rosenblatt (1958). Theorem: Suppose data are scaled so that kx ik 2 1. Assume D is linearly separable,

Interpretation of Weight Values

What does it mean when . . .

I w1 = 100?

I w2 = −1?

I w3 = 0?

What if ‖w‖ is “large”?

5 / 13

Page 9: Machine Learning (CSE 446): Perceptron ConvergencePerceptron Convergence Due to Rosenblatt (1958). Theorem: Suppose data are scaled so that kx ik 2 1. Assume D is linearly separable,

Today

5 / 13

Page 10: Machine Learning (CSE 446): Perceptron ConvergencePerceptron Convergence Due to Rosenblatt (1958). Theorem: Suppose data are scaled so that kx ik 2 1. Assume D is linearly separable,

What would we like to do?

I Optimization problem: find a classifier which minimizes the classification loss.

I The perceptron algorithm can be viewed as trying to do this...

I Problem: (in general) this is an NP-Hard problem.

I Let’s still try to understand it...

This is the general approach of loss function minimization: find parameters whichmake our training error ’small’ (and which also generalizes)

6 / 13

Page 11: Machine Learning (CSE 446): Perceptron ConvergencePerceptron Convergence Due to Rosenblatt (1958). Theorem: Suppose data are scaled so that kx ik 2 1. Assume D is linearly separable,

When does the perceptron not converge?

7 / 13

Page 12: Machine Learning (CSE 446): Perceptron ConvergencePerceptron Convergence Due to Rosenblatt (1958). Theorem: Suppose data are scaled so that kx ik 2 1. Assume D is linearly separable,

Linear Separability

A dataset D = 〈(xn, yn)〉Nn=1 is linearly separable if there exists some linear classifier(defined by w, b) such that, for all n, yn = sign (w · xn + b).

If data are separable, (without loss of generality) can scale so that:

I “margin at 1”, can assume for all (x, y)

y (w∗ · x) ≥ 1

(let w∗ be smallest norm vector with margin 1).

I CIML: assumes ‖w∗‖ is unit length and scales the ”1” above.

8 / 13

Page 13: Machine Learning (CSE 446): Perceptron ConvergencePerceptron Convergence Due to Rosenblatt (1958). Theorem: Suppose data are scaled so that kx ik 2 1. Assume D is linearly separable,

Perceptron ConvergenceDue to Rosenblatt (1958).

Theorem: Suppose data are scaled so that ‖xi‖2 ≤ 1.Assume D is linearly separable, and let be w∗ be a separator with “margin 1”.Then the perceptron algorithm will converge in at most ‖w∗‖2 epochs.

I Let wt be the param at “iteration” t; w0 = 0

I “A Mistake Lemma”: At iteration t

If we make a mistake, ‖wt+1 −w∗‖2 = ‖wt −w∗‖2

If we do make a mistake, ‖wt+1 −w∗‖2 ≤ ‖wt −w∗‖2 − 1

I The theorem directly follows from this lemma. Why?

9 / 13

Page 14: Machine Learning (CSE 446): Perceptron ConvergencePerceptron Convergence Due to Rosenblatt (1958). Theorem: Suppose data are scaled so that kx ik 2 1. Assume D is linearly separable,

Proof of the “Mistake Lemma”

10 / 13

Page 15: Machine Learning (CSE 446): Perceptron ConvergencePerceptron Convergence Due to Rosenblatt (1958). Theorem: Suppose data are scaled so that kx ik 2 1. Assume D is linearly separable,

Proof of the “Mistake Lemma” (more scratch space)

11 / 13

Page 16: Machine Learning (CSE 446): Perceptron ConvergencePerceptron Convergence Due to Rosenblatt (1958). Theorem: Suppose data are scaled so that kx ik 2 1. Assume D is linearly separable,

Proof of the “Mistake Lemma” (more scratch space)

11 / 13

Page 17: Machine Learning (CSE 446): Perceptron ConvergencePerceptron Convergence Due to Rosenblatt (1958). Theorem: Suppose data are scaled so that kx ik 2 1. Assume D is linearly separable,

Voted Perceptron

I Suppose w1, w4, w10,w11 . . . are the parameters right after we updated (e.g.after we made a mistake).

I Idea: instead of using the final wt to classify, we classify with a majority voteusing w1, w4, w10,w11 . . .

I Why?

See CIML for details: Implementation and variants.

12 / 13

Page 18: Machine Learning (CSE 446): Perceptron ConvergencePerceptron Convergence Due to Rosenblatt (1958). Theorem: Suppose data are scaled so that kx ik 2 1. Assume D is linearly separable,

Voted Perceptron

Let w(e,n) and b(e,n) be the parameters after updating based on the nth example onepoch e.

y = sign

(E∑

e=1

N∑n=1

sign(w(e,n) · x+ b(e,n))

)

12 / 13

Page 19: Machine Learning (CSE 446): Perceptron ConvergencePerceptron Convergence Due to Rosenblatt (1958). Theorem: Suppose data are scaled so that kx ik 2 1. Assume D is linearly separable,

Voted Perceptron

13 / 13

Page 20: Machine Learning (CSE 446): Perceptron ConvergencePerceptron Convergence Due to Rosenblatt (1958). Theorem: Suppose data are scaled so that kx ik 2 1. Assume D is linearly separable,

Voted Perceptron

13 / 13

Page 21: Machine Learning (CSE 446): Perceptron ConvergencePerceptron Convergence Due to Rosenblatt (1958). Theorem: Suppose data are scaled so that kx ik 2 1. Assume D is linearly separable,

Voted Perceptron

13 / 13

Page 22: Machine Learning (CSE 446): Perceptron ConvergencePerceptron Convergence Due to Rosenblatt (1958). Theorem: Suppose data are scaled so that kx ik 2 1. Assume D is linearly separable,

Voted Perceptron

13 / 13

Page 23: Machine Learning (CSE 446): Perceptron ConvergencePerceptron Convergence Due to Rosenblatt (1958). Theorem: Suppose data are scaled so that kx ik 2 1. Assume D is linearly separable,

Voted Perceptron

13 / 13

Page 24: Machine Learning (CSE 446): Perceptron ConvergencePerceptron Convergence Due to Rosenblatt (1958). Theorem: Suppose data are scaled so that kx ik 2 1. Assume D is linearly separable,

Voted Perceptron

13 / 13

Page 25: Machine Learning (CSE 446): Perceptron ConvergencePerceptron Convergence Due to Rosenblatt (1958). Theorem: Suppose data are scaled so that kx ik 2 1. Assume D is linearly separable,

References I

Frank Rosenblatt. The perceptron: A probabilistic model for information storage andorganization in the brain. Psychological Review, 65:386–408, 1958.

13 / 13

Rosenblatt Forbes Article

The Perceptron Algorithm - George Mason University · 9/14/10 1 The Perceptron Algorithm Perceptron (Frank Rosenblatt, 1957) • First learning algorithm for neural networks; •

Artificial Neural Networks...Perceptron can learn only linearly separable classification problems. Feed-forward networks with non-linear activation functions and hidden layers can

85786926 Object Oriented Rosenblatt Perceptron Using C

multilayer perceptrons pdf - uni-osnabrueck.desbitzer/... · Introduction to Multilayer Perceptrons • simple perceptron – local vs. distributed – linearly separable • hidden

Perceptrons and Kernel Methods · 2011. 1. 12. · A perceptron is an incremental learning method for linear classifiers invented by Frank Rosenblatt in 1956. The perceptron is an

CS 188: Artificial Intelligence Linear Classifierscs188/fa18/assets/slides/lec21/FA18_cs188_lecture21... · Problems with the Perceptron §Noise: if the data isn’t separable, weights

lecture6 SVM Perceptron - Purdue University · 2018-01-25 · •Perceptron convergence theorem: • If exact solution exists (i.e., if data is linearly separable), then perceptron

Manuela Veloso 15-381 - Fall 2001 - Carnegie Mellon School ...€¦ · Manuela Veloso 15-381 - Fall 2001 Veloso, ... Perceptron convergence theorem Rosenblatt 1962: ... and (b) or

Multilayer Perceptron perceptron.pdf · Multilayer Perceptron ... input x belongs to C 1. Perceptron is cosmetically similar to logistic ... Learning Boolean XOR A simple perceptron

Computacion inteligente El perceptron. Nov 2005 2 Agenda History Perceptron structure Perceptron as a two-class classifier The Perceptron as a

Support vector machines: Maximum-margin linear classifierscseweb.ucsd.edu/classes/wi19/cse151-b/SVM.pdf · 2019-03-05 · Improving upon the Perceptron For a linearly separable data

Rosenblatt poster

Introduction to (shallow) Neural Networksperso.mines-paristech.fr/.../Slides/MLP-NeuralNetworks_course_2pp.… · •PERCEPTRON (Rosenblatt, 1957) ... deep Convolutional Neural Networks

Perceptron. Inner-product scalar Perceptron Perceptron learning rule XOR problem linear separable patterns Gradient descent Stochastic Approximation to

CS 4700: Foundations of Artificial Intelligence...Carla P. Gomes CS4700 Perceptron Perceptron – Invented by Frank Rosenblatt in 1957 in an attempt to understand human memory, learning,

Paul C. Rosenblatt

Lecture 4 Artificial Neural Networks• Rosenblatt (1958) created the perceptron, an algorithm for pattern recognition. • Neural network research stagnated after machine learning

Predicting Sequences: Structured Perceptron - svivek · Notes on structured perceptron • Mistake bound for separable data, just like perceptron • In practice, use averaging for

perceptron - svcl.ucsd.edu · The perceptron this was the main insight of Rosenblatt, which lead to the Perceptron the basic idea is to do gradient descent on our cost J()wb n y(w

02 Fundamentals of Neural Network - Myreaders.info … · layer perceptron, learning algorithm for training perceptron, linearly separable task, XOR problem, ADAptive LINear Element

09 Classification - Perceptroni-systems.github.io/HSE545/machine learning all/04 Classification... · Perceptron •Linearly%separable%data •Hyperplane –Separates%a%DPdimensional%space%into%two%halfPspaces

Perceptron - Seoul National Universityling.snu.ac.kr/class/cl_under1801/DL02-perceptron.pdf · 2018. 6. 15. · Perceptron? •Introduced by Frank Rosenblatt, author of the book Principles

Machine Learning (CSE 446): Perceptron Convergence...Frank Rosenblatt. The perceptron: A probabilistic model for information storage and organization in the brain. Psychological Review,

Machine Learning (CSE 446): PerceptronPerceptron Convergence Due to Rosenblatt (1958). If Dis linearly separable with margin >0 and for all n2f1;:::;Ng;kx nk 2 1, then the perceptron

Linear Classification: The Perceptron - Penn … · Improving the Perceptron • The Perceptron produces many θ‘s during training • The standard Perceptron simply uses the final

Perceptron Learning - homepage.cs.uri.edu · Perceptron Learning Observations: The inherent assumption is that D is linearly separable, that is, a hyperplane can be found that perfectly

Beyond Linear Separability. Limitations of Perceptron Only linear separations Only converges for linearly separable data One Solution (SVM’s) Map data

perceptrondtnghi/ml/mlp-en.pdf · Introduction Artificial neural networks Inspired by biological brains and neurons Studied since 1943 (McCulloch & Pitts neuron) Perceptron (Rosenblatt,

3 Perceptron Learning; Maximum Margin Classiﬁersjrs/189s19/lec/03.pdf · a 20 ⇥ 20 pixel image. Rosenblatt built a Mark I Perceptron Machine that ran the algorithm, complete with

Lecture 11: Single Layer Perceptronsce.sharif.edu/courses/92-93/1/ce957-1/resources/... · Perceptron: Convergence Theorem Suppose datasets C 1 and C 2 are linearly separable. The

perceptron - SVCLsvcl.ucsd.edu/courses/ece271B-F09/handouts/perceptron.pdfThe perceptron this was the main insight of Rosenblatt, which lead to the Perceptron the basic idea is to

Deep Learning for Incipient Slip DetectionCenter of Excellence Cognitive Interaction Technology Deep Learning History 1958 Perceptron (Rosenblatt) 1980 Neocognitron (Fukushima) 1982

Artificial Intelligence II - KBS - KBS€¦ · perfectly correct Convergence: if the training is separable, perceptron will eventually converge (binary case) Mistake Bound: the maximum

CPSC 340: Machine Learning and Data Mining - cs.ubc.cafwood/CS340/lectures/L19.pdf · Perceptron Algorithm for Linearly-Separable Data • One of the first “learning” algorithms