winow vs perceptron

8/13/2019 Winow vs Perceptron

1/12

SIMS 290-2:Applied Natural Language Processing

Barbara Rosario

1

October 4, 2004

Today

Algorithms for Classification

Binary classificationPerceptron

Winnow

2

Support Vector Machines (SVM)

Kernel Methods

MultiClass classification!ecision "rees

#a$%e Bayes

K nearest neigh&or

Binary Classification: examples

Spam filtering (spam' not spam)

Customer ser%ice message classification (urgent %snot urgent)

3

'

Sentiment classification (positi%e' negati%e)

Sometime it can &e con%enient to treat a multiwaypro&lem li*e a &inary one+ one class %ersus all theothers' for all classes

Binary Classification

,i%en+ some data items that &elong to a positi%e(-. ) or a negati%e (. ) class

"as*+ "rain the classifier and predict the class for a

4

,eometrically+ find a separator

Linear versus on Linearal!orit"ms

Linearly separable data+ if all the data points can&e correctly classified &y a linear (hyperplanar)decision &oundary

5

Linearly separable data

6

Class1

Class2Linear Decision boundary


2/12

on linearly separable data

7

Class1

Class2

on linearly separable data

8

Non Linear ClassifierClass1

Class2


/inear or #on linear separa&le data0

We can find out only empirically

/inear algorithms (algorithms that find a linear decision&oundary)

9

W en we t in t e ata is ineary separa e

Ad%antages

1 Simpler' less parameters

!isad%antages

1 2igh dimensional data (li*e for #/") is usually not linearly separa&le

34amples+ Perceptron' Winnow' SVM

#ote+ we can use linear algorithms also for non linear pro&lems(see Kernel methods)


#on /inear

When the data is non linearly separa&le

Ad%antages

1 More accurate

10

!isad%antages

1 More complicated' more parameters

34ample+ Kernel methods

#ote+ the distinction &etween linear and non linearapplies also for multiclass classification (we5ll seethis later)

#imple linear al!orit"ms

Perceptron and Winnow algorithm

/inear

Binary classification

6nline rocess data se uentiall one data oint at the

11

time)

Mista*e dri%en

Simple single layer #eural #etwor*s

Linear binary classification

!ata+ 8(4i'yi)9i:.n4 in ;d (4 is a %ector in ddimensional space)

feature %ector

y in 8.'-.9

la&el (class' category)

12From Gert Lanckriet, Statistical Learning Theory Tutorial

then y : -.

1 if w4 - & ? > then y : .


3/12


4/12


5/12

#upport /ector *ac"ine #/*-

/arge Margin Classifier

/inearl se ara&le case

M w/xa + b (

w/xb + b -(

25

From Gert Lanckriet, Statistical Learning Theory Tutorial

,oal+ find thehyperplane thatma4imies the margin

w/ x + b 0

Support %ectors

#upport /ector *ac"ine #/*-

"e4t classification

2andwriting recognition

Computational &iology (eg' microarray data)

26


@ace etection

@ace e4pression recognition

"ime series prediction

on Linear problem

27

on Linear problem

28

on Linear problem

Kernel methods

A family of nonlinear algorithms

"ransform the non linear pro&lem in a linear one (in

29From Gert Lanckriet, Statistical Learning Theory Tutorial

se linear algorithms to sol%e the linear pro&lem inthe new space

*ain intuition of ernel met"ods

(Copy here from &lac* &oard)

30


6/12

Basic principle )ernel met"ods

;d ;! (! == d) w/%x&+b0

31


X=[x z] (X)=[x2 z2 xz]

"%x& sign%w(x2+w2

2+w1x +b&


Linear separability+ more li*ely in high dimensions

-apping+ maps input into highdimensionalfeature space

32


dimensional feature space

-oti'ation+ appropriate choice of leads to linearsepara&ility

"e can do t&is efficiently.


We can use the linear algorithms seen &efore(Perceptron' SVM) for classification in the higherdimensional space

33

*ulti1class classification

,i%en+ some data items that &elong to one of Mpossi&le classes

"as*+ "rain the classifier and predict the class for a

34

,eometrically+ harder pro&lem' no more simple

geometry

*ulti1class classification

35

*ulti1class classification: xamples

Author identification

/anguage identification

"e4t categoriation (topics)

36


7/12

#ome- +l!orit"ms for *ulti1classclassification

/inearParallel class separators+ !ecision "rees

#on parallel class separators+ #a$%e Bayes

37

Knearest neigh&ors

Linear, parallel class separators

ex: 3ecision Trees-

38

Linear, O parallel class separatorsex: ave Bayes-

39

on Linear ex: k earest ei!"bor-

40

3ecision Trees

!ecision treeis a classifier in the form of a treestructure' where each node is either+

/eaf node indicates the %alue of the target attri&ute (class)of e4amples' or

41http://dms.irb.hr/tutorial/tut_dtrees.php

!ecision node specifies some test to &e carried out on asingle attri&ute%alue' with one &ranch and su&tree for eachpossi&le outcome of the test

A decision tree can &e used to classify an e4ample &ystarting at the root of the tree and mo%ing through ituntil a leaf node' which pro%ides the classification ofthe instance

Trainin! xamples

JesWea*#ormalCool;ain!I

JesWea*2ighMild;ain!

JesWea*2igh2ot6%ercast!L

#oStrong2igh2otSunny!

#oWea*2igh2otSunny!.

#lay /ennisWind2umidity"emp6utloo*!ay

,oal+ learn when we can play "ennis and when we cannot

42

#oStrong2ighMild;ain!.

JesWea*#ormal2ot6%ercast!.L

JesStrong2ighMild6%ercast!.

JesStrong#ormalMildSunny!..

JesStrong#ormalMild;ain!.>

JesWea*#ormalColdSunny!G

#oWea*2ighMildSunny!N

JesWea*#ormalCool6%ercast!O

#oStrong#ormalCool;ain!H

lli


8/12


9/12

Buildin! 3ecision Trees

Splitting criterion5indin! t"e features and t"e values to split on

6 for example, '"y test first 7cts8 and not 7vs89

6 &"y test on 7cts 28 and not 7cts ;8 9

#plit t"at !ives us t"e maximum information gain or t"emaximum reduction of uncertainty-

49

&"en all t"e elements at one node "ave t"e same class,no need to split furt"er

n practice' one first &uilds a large tree and then one prunes it&ac* (to a%oid o%erfitting)

See @oundations of Statistical #atural /anguage Processing'Manning and Schuete for a good introduction

3ecision Trees: #tren!t"s

!ecision trees are a&le to generate understanda&lerules

!ecision trees perform classification without re7uiring

50

http://dms.irb.hr/tutorial/tut_dtrees.php

!ecision trees are a&le to handle &oth continuousand categorical %aria&les

!ecision trees pro%ide a clear indication of whichfeatures are most important for prediction orclassification

3ecision Trees: 'ea)nesses

!ecision trees are prone to errors in classificationpro&lems with many classes and relati%ely smallnum&er of training e4amples

!ecision tree can &e computationally e4pensi%e to

51http://dms.irb.hr/tutorial/tut_dtrees.php

traineed to compare all possible splits

$runin! is also expensiveMost decisiontree algorithms only e4amine a singlefield at a time "his leads to rectangular classification&o4es that may not correspond well with the actualdistri&ution of records in the decision space

3ecision Trees

!ecision "rees in We*a

52

ave Bayes

More powerful that !ecision "rees

!ecision "rees #a$%e Bayes

53

ave Bayes *odels

,raphical Models+ graphtheory plus pro&a&ilitytheory

A

54

3dges are conditionalpro&a&ilities B C

P(A)

P(B|A)

P(C|A)


10/12


11/12

ave Bayes: #tren!t"s

Very simple model3asy to understand

Very easy to implement

61

'

Modest space storage

Widely used &ecause it wor*s really well for te4tcategoriation

/inear' &ut non parallel decision &oundaries

ave Bayes: 'ea)nesses

#a$%e Bayes independence assumption has two conse7uences+

"he linear ordering of words is ignored (&ag of wordsmodel)

"he words are independent of each other gi%en the class+ @alse

1 Presidentis more li*ely to occur in a conte4t that contains electionthani i

62

i i

#a$%e Bayes assumption is inappropriate if there are strongconditional dependencies &etween the %aria&les

(But e%en if the model is not ErightF' #a$%e Bayes models dowell in a surprisingly large num&er of cases &ecause often weare interested in classification accuracyand not in accuratepro&a&ility estimations)

ave Bayes

#a$%e Bayes in We*a

63

k earest ei!"bor Classification

#earest #eigh&or classification rule+ to classify a newo&Dect' find the o&Dect in the training set that is mostsimilar "hen assign the category of this nearestnei h&or

64

K #earest #eigh&or (K##)+ consult * nearestneigh&ors !ecision &ased on the maDority categoryof these neigh&ors More ro&ust than * : .

34ample of similarity measure often used in #/P is cosinesimilarity

%1earest ei!"bor

65

%1earest ei!"bor

66


12/12

winow vs perceptron

Documents