introduction to machine learning fall 2013 perceptron (6) prof. koby crammer department of...

37
Introduction to Machine Learning Fall 2013 Perceptron (6) Prof. Koby Crammer Department of Electrical Engineering Technion 1

Upload: layne-porritt

Post on 31-Mar-2015

219 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Introduction to Machine Learning Fall 2013 Perceptron (6) Prof. Koby Crammer Department of Electrical Engineering Technion 1

Introduction to Machine LearningFall 2013

Perceptron (6)

Prof. Koby Crammer

Department of Electrical Engineering

Technion1

Page 2: Introduction to Machine Learning Fall 2013 Perceptron (6) Prof. Koby Crammer Department of Electrical Engineering Technion 1

2

Online Learning

Tyrannosaurus rex

Page 3: Introduction to Machine Learning Fall 2013 Perceptron (6) Prof. Koby Crammer Department of Electrical Engineering Technion 1

3

Online Learning

Triceratops

Page 4: Introduction to Machine Learning Fall 2013 Perceptron (6) Prof. Koby Crammer Department of Electrical Engineering Technion 1

4

Online Learning

Tyrannosaurus rex

Velocireptor

Page 5: Introduction to Machine Learning Fall 2013 Perceptron (6) Prof. Koby Crammer Department of Electrical Engineering Technion 1

5

Formal Setting – Binary Classification

• Instances – Images, Sentences

• Labels– Parse tree, Names

• Prediction rule– Linear predictions rules

• Loss– No. of mistakes

Page 6: Introduction to Machine Learning Fall 2013 Perceptron (6) Prof. Koby Crammer Department of Electrical Engineering Technion 1

6

Online Framework

• Initialize Classifier• Algorithm works in rounds• On round the online algorithm :

– Receives an input instance – Outputs a prediction– Receives a feedback label– Computes loss– Updates the prediction rule

• Goal : – Suffer small cumulative loss

Page 7: Introduction to Machine Learning Fall 2013 Perceptron (6) Prof. Koby Crammer Department of Electrical Engineering Technion 1

7

Why Online Learning?

• Fast• Memory efficient - process one example at a

time• Simple to implement• Formal guarantees – Mistake bounds • Online to Batch conversions• No statistical assumptions• Adaptive

• Not as good as a well designed batch algorithms

Page 8: Introduction to Machine Learning Fall 2013 Perceptron (6) Prof. Koby Crammer Department of Electrical Engineering Technion 1

8

Update Rules

• Online algorithms are based on an update rule which defines from (and possibly other information)

• Linear Classifiers : find from based on the input

• Some Update Rules :

– Perceptron (Rosenblat)

– ALMA (Gentile)

– ROMMA (Li & Long)

– NORMA (Kivinen et. al)

– MIRA (Crammer & Singer)

– EG (Littlestown and Warmuth)

– Bregman Based (Warmuth)

– Numerous Online Convex

Programming Algorithms

Page 9: Introduction to Machine Learning Fall 2013 Perceptron (6) Prof. Koby Crammer Department of Electrical Engineering Technion 1

9

Today

• The Perceptron Algorithm :– Agmon 1954; – Rosenblatt 1952-1962, – Block 1962, Novikoff 1962, – Minsky & Papert 1969, – Freund & Schapire 1999, – Blum & Dunagan 2002

Page 10: Introduction to Machine Learning Fall 2013 Perceptron (6) Prof. Koby Crammer Department of Electrical Engineering Technion 1

10

The Perceptron Algorithm

• If No-Mistake

– Do nothing

• If Mistake

– Update

• Margin after update :

Prediction:

Page 11: Introduction to Machine Learning Fall 2013 Perceptron (6) Prof. Koby Crammer Department of Electrical Engineering Technion 1

11

Geometrical Interpretation

Page 12: Introduction to Machine Learning Fall 2013 Perceptron (6) Prof. Koby Crammer Department of Electrical Engineering Technion 1

12

• For any competitor prediction function• We bound the loss suffered by the algorithm with

the loss suffered by

Relative Loss Bound

Cumulative Loss Suffered by the

Algorithm

Sequence of Prediction Functions

Cumulative Loss of Competitor

Page 13: Introduction to Machine Learning Fall 2013 Perceptron (6) Prof. Koby Crammer Department of Electrical Engineering Technion 1

13

• For any competitor prediction function• We bound the loss suffered by the algorithm with

the loss suffered by

Relative Loss Bound

InequalityPossibly Large Gap

RegretExtra Loss

CompetitivenessRatio

Page 14: Introduction to Machine Learning Fall 2013 Perceptron (6) Prof. Koby Crammer Department of Electrical Engineering Technion 1

14

• For any competitor prediction function• We bound the loss suffered by the algorithm with

the loss suffered by

Relative Loss Bound

Grows With T ConstantGrows With T

Page 15: Introduction to Machine Learning Fall 2013 Perceptron (6) Prof. Koby Crammer Department of Electrical Engineering Technion 1

15

• For any competitor prediction function• We bound the loss suffered by the algorithm with

the loss suffered by

Relative Loss Bound

Best Prediction Function in hindsight for the data sequence

Page 16: Introduction to Machine Learning Fall 2013 Perceptron (6) Prof. Koby Crammer Department of Electrical Engineering Technion 1

16

Remarks

• If the input is inseparable, then the problem of finding a separating hyperplane which attains less then M errors is NP-hard (Open hemisphere)

• Obtaining a zero-one loss bound with a unit competitiveness ratio is as hard as finding a constant approximating error for the Open Hemesphere problem.

• Bound of the number of mistakes the perceptron makes with the hinge loss of any compitetor

Page 17: Introduction to Machine Learning Fall 2013 Perceptron (6) Prof. Koby Crammer Department of Electrical Engineering Technion 1

17

Definitions

• Any Competitor• The parameters vector can be chosen using

the input data

• The parameterized hinge loss of on

• True hinge loss• 1-norm of hinge loss

Page 18: Introduction to Machine Learning Fall 2013 Perceptron (6) Prof. Koby Crammer Department of Electrical Engineering Technion 1

18

Geometrical Assumption

• All examples are bounded in a ball of radius R

Page 19: Introduction to Machine Learning Fall 2013 Perceptron (6) Prof. Koby Crammer Department of Electrical Engineering Technion 1

19

Perceptron’s Mistake Bound

• Bounds :

• If the sample is separable then

Page 20: Introduction to Machine Learning Fall 2013 Perceptron (6) Prof. Koby Crammer Department of Electrical Engineering Technion 1

20

Proof - Intuition

• Two views :– The angle between and decreases with

– The following sum is fixed

as we make more mistakes, our solution is better

[FS99, SS05]

Page 21: Introduction to Machine Learning Fall 2013 Perceptron (6) Prof. Koby Crammer Department of Electrical Engineering Technion 1

21

Proof

• Define the potential :

• Bound it’s cumulative sum from above and below

[C04]

Page 22: Introduction to Machine Learning Fall 2013 Perceptron (6) Prof. Koby Crammer Department of Electrical Engineering Technion 1

22

Proof

• Bound from above :

Telescopic Sum Non-NegativeZero Vector

Page 23: Introduction to Machine Learning Fall 2013 Perceptron (6) Prof. Koby Crammer Department of Electrical Engineering Technion 1

23

Proof

• Bound From Below :– No error on tth round

– Error on tth round

Page 24: Introduction to Machine Learning Fall 2013 Perceptron (6) Prof. Koby Crammer Department of Electrical Engineering Technion 1

24

Proof

• We bound each term :

Page 25: Introduction to Machine Learning Fall 2013 Perceptron (6) Prof. Koby Crammer Department of Electrical Engineering Technion 1

25

Proof

• Bound From Below :– No error on tth round

– Error on tth round

• Cumulative bound :

Page 26: Introduction to Machine Learning Fall 2013 Perceptron (6) Prof. Koby Crammer Department of Electrical Engineering Technion 1

26

Proof

• Putting both bounds together :

• We use first degree of freedom (and scale) :

• Bound :

Page 27: Introduction to Machine Learning Fall 2013 Perceptron (6) Prof. Koby Crammer Department of Electrical Engineering Technion 1

27

Proof

• General Bound :

• Choose :

• Simple Bound :

Objective of SVM

Page 28: Introduction to Machine Learning Fall 2013 Perceptron (6) Prof. Koby Crammer Department of Electrical Engineering Technion 1

28

Proof

• Better bound : optimize the value of

Page 29: Introduction to Machine Learning Fall 2013 Perceptron (6) Prof. Koby Crammer Department of Electrical Engineering Technion 1

29

Remarks

• Bound does not depend on dimension of the feature vector

• The bound holds for all sequences. It is not tight for most real world data

• But, there exists a setting for which it is tight – worst

Page 30: Introduction to Machine Learning Fall 2013 Perceptron (6) Prof. Koby Crammer Department of Electrical Engineering Technion 1

30

0

2

4

6

8

10

12

14

00.

40.

81.

21.

6 22.

42.

83.

23.

6 44.

44.

85.

25.

6

Three Bounds

Page 31: Introduction to Machine Learning Fall 2013 Perceptron (6) Prof. Koby Crammer Department of Electrical Engineering Technion 1

31

Separable Case

• Assume there exists such that for all examples Then all bounds are equivalent

• Perceptron makes finite number of mistakes until convergence (not necessarily to )

Page 32: Introduction to Machine Learning Fall 2013 Perceptron (6) Prof. Koby Crammer Department of Electrical Engineering Technion 1

32

Separable Case – Other Quantities

• Use 1st (parameterization) degree of freedom

• Scale the such that

• Define

• The bound becomes

Page 33: Introduction to Machine Learning Fall 2013 Perceptron (6) Prof. Koby Crammer Department of Electrical Engineering Technion 1

33

Separable Case - Illustration

Page 34: Introduction to Machine Learning Fall 2013 Perceptron (6) Prof. Koby Crammer Department of Electrical Engineering Technion 1

34

separable Case – IllustrationThe Perceptron will make more mistakes

Finding a separating hyperplance is more difficult

Page 35: Introduction to Machine Learning Fall 2013 Perceptron (6) Prof. Koby Crammer Department of Electrical Engineering Technion 1

35

Inseparable Case

• Difficult problem implies a large value of• In this case the Perceptron will make a large

number of mistakes

Page 36: Introduction to Machine Learning Fall 2013 Perceptron (6) Prof. Koby Crammer Department of Electrical Engineering Technion 1

36

Perceptron Algorithm

• Extremely easy to implement• Relative loss bounds for separable and inseparable

cases. Minimal assumptions (not iid)• Easy to convert to a well-performing batch algorithm

(under iid assumptions)

• Quantities in bound are not compatible : no. of mistakes vs. hinge-loss.

• Margin of examples is ignored by update• Same update for separable case and inseparable

case.

Page 37: Introduction to Machine Learning Fall 2013 Perceptron (6) Prof. Koby Crammer Department of Electrical Engineering Technion 1

Concluding Remarks

• Batch vs. Online– Two phases: Training and then Test– Single continues process

• Statistical Assumption– Distribution over examples– All sequences

• Conversions– Online -> Batch– Batch -> Online

37