winow vs perceptron
TRANSCRIPT
-
8/13/2019 Winow vs Perceptron
1/12
SIMS 290-2:Applied Natural Language Processing
Barbara Rosario
1
October 4, 2004
Today
Algorithms for Classification
Binary classificationPerceptron
Winnow
2
Support Vector Machines (SVM)
Kernel Methods
MultiClass classification!ecision "rees
#a$%e Bayes
K nearest neigh&or
Binary Classification: examples
Spam filtering (spam' not spam)
Customer ser%ice message classification (urgent %snot urgent)
3
'
Sentiment classification (positi%e' negati%e)
Sometime it can &e con%enient to treat a multiwaypro&lem li*e a &inary one+ one class %ersus all theothers' for all classes
Binary Classification
,i%en+ some data items that &elong to a positi%e(-. ) or a negati%e (. ) class
"as*+ "rain the classifier and predict the class for a
4
,eometrically+ find a separator
Linear versus on Linearal!orit"ms
Linearly separable data+ if all the data points can&e correctly classified &y a linear (hyperplanar)decision &oundary
5
Linearly separable data
6
Class1
Class2Linear Decision boundary
-
8/13/2019 Winow vs Perceptron
2/12
on linearly separable data
7
Class1
Class2
on linearly separable data
8
Non Linear ClassifierClass1
Class2
Linear versus on Linearal!orit"ms
/inear or #on linear separa&le data0
We can find out only empirically
/inear algorithms (algorithms that find a linear decision&oundary)
9
W en we t in t e ata is ineary separa e
Ad%antages
1 Simpler' less parameters
!isad%antages
1 2igh dimensional data (li*e for #/") is usually not linearly separa&le
34amples+ Perceptron' Winnow' SVM
#ote+ we can use linear algorithms also for non linear pro&lems(see Kernel methods)
Linear versus on Linearal!orit"ms
#on /inear
When the data is non linearly separa&le
Ad%antages
1 More accurate
10
!isad%antages
1 More complicated' more parameters
34ample+ Kernel methods
#ote+ the distinction &etween linear and non linearapplies also for multiclass classification (we5ll seethis later)
#imple linear al!orit"ms
Perceptron and Winnow algorithm
/inear
Binary classification
6nline rocess data se uentiall one data oint at the
11
time)
Mista*e dri%en
Simple single layer #eural #etwor*s
Linear binary classification
!ata+ 8(4i'yi)9i:.n4 in ;d (4 is a %ector in ddimensional space)
feature %ector
y in 8.'-.9
la&el (class' category)
12From Gert Lanckriet, Statistical Learning Theory Tutorial
then y : -.
1 if w4 - & ? > then y : .
-
8/13/2019 Winow vs Perceptron
3/12
-
8/13/2019 Winow vs Perceptron
4/12
-
8/13/2019 Winow vs Perceptron
5/12
#upport /ector *ac"ine #/*-
/arge Margin Classifier
/inearl se ara&le case
M w/xa + b (
w/xb + b -(
25
From Gert Lanckriet, Statistical Learning Theory Tutorial
,oal+ find thehyperplane thatma4imies the margin
w/ x + b 0
Support %ectors
#upport /ector *ac"ine #/*-
"e4t classification
2andwriting recognition
Computational &iology (eg' microarray data)
26
From Gert Lanckriet, Statistical Learning Theory Tutorial
@ace etection
@ace e4pression recognition
"ime series prediction
on Linear problem
27
on Linear problem
28
on Linear problem
Kernel methods
A family of nonlinear algorithms
"ransform the non linear pro&lem in a linear one (in
29From Gert Lanckriet, Statistical Learning Theory Tutorial
se linear algorithms to sol%e the linear pro&lem inthe new space
*ain intuition of ernel met"ods
(Copy here from &lac* &oard)
30
-
8/13/2019 Winow vs Perceptron
6/12
Basic principle )ernel met"ods
;d ;! (! == d) w/%x&+b0
31
From Gert Lanckriet, Statistical Learning Theory Tutorial
X=[x z] (X)=[x2 z2 xz]
"%x& sign%w(x2+w2
2+w1x +b&
Basic principle )ernel met"ods
Linear separability+ more li*ely in high dimensions
-apping+ maps input into highdimensionalfeature space
32
From Gert Lanckriet, Statistical Learning Theory Tutorial
dimensional feature space
-oti'ation+ appropriate choice of leads to linearsepara&ility
"e can do t&is efficiently.
Basic principle )ernel met"ods
We can use the linear algorithms seen &efore(Perceptron' SVM) for classification in the higherdimensional space
33
*ulti1class classification
,i%en+ some data items that &elong to one of Mpossi&le classes
"as*+ "rain the classifier and predict the class for a
34
,eometrically+ harder pro&lem' no more simple
geometry
*ulti1class classification
35
*ulti1class classification: xamples
Author identification
/anguage identification
"e4t categoriation (topics)
36
-
8/13/2019 Winow vs Perceptron
7/12
#ome- +l!orit"ms for *ulti1classclassification
/inearParallel class separators+ !ecision "rees
#on parallel class separators+ #a$%e Bayes
37
Knearest neigh&ors
Linear, parallel class separators
ex: 3ecision Trees-
38
Linear, O parallel class separatorsex: ave Bayes-
39
on Linear ex: k earest ei!"bor-
40
3ecision Trees
!ecision treeis a classifier in the form of a treestructure' where each node is either+
/eaf node indicates the %alue of the target attri&ute (class)of e4amples' or
41http://dms.irb.hr/tutorial/tut_dtrees.php
!ecision node specifies some test to &e carried out on asingle attri&ute%alue' with one &ranch and su&tree for eachpossi&le outcome of the test
A decision tree can &e used to classify an e4ample &ystarting at the root of the tree and mo%ing through ituntil a leaf node' which pro%ides the classification ofthe instance
Trainin! xamples
JesWea*#ormalCool;ain!I
JesWea*2ighMild;ain!
JesWea*2igh2ot6%ercast!L
#oStrong2igh2otSunny!
#oWea*2igh2otSunny!.
#lay /ennisWind2umidity"emp6utloo*!ay
,oal+ learn when we can play "ennis and when we cannot
42
#oStrong2ighMild;ain!.
JesWea*#ormal2ot6%ercast!.L
JesStrong2ighMild6%ercast!.
JesStrong#ormalMildSunny!..
JesStrong#ormalMild;ain!.>
JesWea*#ormalColdSunny!G
#oWea*2ighMildSunny!N
JesWea*#ormalCool6%ercast!O
#oStrong#ormalCool;ain!H
lli
-
8/13/2019 Winow vs Perceptron
8/12
-
8/13/2019 Winow vs Perceptron
9/12
Buildin! 3ecision Trees
Splitting criterion5indin! t"e features and t"e values to split on
6 for example, '"y test first 7cts8 and not 7vs89
6 &"y test on 7cts 28 and not 7cts ;8 9
#plit t"at !ives us t"e maximum information gain or t"emaximum reduction of uncertainty-
49
&"en all t"e elements at one node "ave t"e same class,no need to split furt"er
n practice' one first &uilds a large tree and then one prunes it&ac* (to a%oid o%erfitting)
See @oundations of Statistical #atural /anguage Processing'Manning and Schuete for a good introduction
3ecision Trees: #tren!t"s
!ecision trees are a&le to generate understanda&lerules
!ecision trees perform classification without re7uiring
50
http://dms.irb.hr/tutorial/tut_dtrees.php
!ecision trees are a&le to handle &oth continuousand categorical %aria&les
!ecision trees pro%ide a clear indication of whichfeatures are most important for prediction orclassification
3ecision Trees: 'ea)nesses
!ecision trees are prone to errors in classificationpro&lems with many classes and relati%ely smallnum&er of training e4amples
!ecision tree can &e computationally e4pensi%e to
51http://dms.irb.hr/tutorial/tut_dtrees.php
traineed to compare all possible splits
$runin! is also expensiveMost decisiontree algorithms only e4amine a singlefield at a time "his leads to rectangular classification&o4es that may not correspond well with the actualdistri&ution of records in the decision space
3ecision Trees
!ecision "rees in We*a
52
ave Bayes
More powerful that !ecision "rees
!ecision "rees #a$%e Bayes
53
ave Bayes *odels
,raphical Models+ graphtheory plus pro&a&ilitytheory
A
54
3dges are conditionalpro&a&ilities B C
P(A)
P(B|A)
P(C|A)
-
8/13/2019 Winow vs Perceptron
10/12
-
8/13/2019 Winow vs Perceptron
11/12
ave Bayes: #tren!t"s
Very simple model3asy to understand
Very easy to implement
61
'
Modest space storage
Widely used &ecause it wor*s really well for te4tcategoriation
/inear' &ut non parallel decision &oundaries
ave Bayes: 'ea)nesses
#a$%e Bayes independence assumption has two conse7uences+
"he linear ordering of words is ignored (&ag of wordsmodel)
"he words are independent of each other gi%en the class+ @alse
1 Presidentis more li*ely to occur in a conte4t that contains electionthani i
62
i i
#a$%e Bayes assumption is inappropriate if there are strongconditional dependencies &etween the %aria&les
(But e%en if the model is not ErightF' #a$%e Bayes models dowell in a surprisingly large num&er of cases &ecause often weare interested in classification accuracyand not in accuratepro&a&ility estimations)
ave Bayes
#a$%e Bayes in We*a
63
k earest ei!"bor Classification
#earest #eigh&or classification rule+ to classify a newo&Dect' find the o&Dect in the training set that is mostsimilar "hen assign the category of this nearestnei h&or
64
K #earest #eigh&or (K##)+ consult * nearestneigh&ors !ecision &ased on the maDority categoryof these neigh&ors More ro&ust than * : .
34ample of similarity measure often used in #/P is cosinesimilarity
%1earest ei!"bor
65
%1earest ei!"bor
66
-
8/13/2019 Winow vs Perceptron
12/12