feature engineering - umd
TRANSCRIPT
![Page 1: Feature Engineering - UMD](https://reader031.vdocument.in/reader031/viewer/2022020704/61fb4c492e268c58cd5c829a/html5/thumbnails/1.jpg)
Classification II: Decision Trees and SVMs
Digging into Data
March 3, 2014
Slides adapted from Tom Mitchell, Eric Xing, and Lauren Hannah
Digging into Data Classification II: Decision Trees and SVMs March 3, 2014 1 / 45
![Page 2: Feature Engineering - UMD](https://reader031.vdocument.in/reader031/viewer/2022020704/61fb4c492e268c58cd5c829a/html5/thumbnails/2.jpg)
Roadmap
Classification: machines labeling data for us
Last time: naïve Bayes and logistic regression
This time:… Decision Trees
∆ Simple, nonlinear, interpretable… SVMs
∆ (another) example of linear classifier∆ State-of-the-art classification
… Examples in Rattle (Logistic, SVM, Trees)… Discussion: Which classifier should I use for my problem?
Digging into Data Classification II: Decision Trees and SVMs March 3, 2014 2 / 45
![Page 3: Feature Engineering - UMD](https://reader031.vdocument.in/reader031/viewer/2022020704/61fb4c492e268c58cd5c829a/html5/thumbnails/3.jpg)
Outline
1 Decision Trees
2 Learning Decision Trees
3 Vector space classification
4 Linear Classifiers
5 Support Vector Machines
6 Recap
Digging into Data Classification II: Decision Trees and SVMs March 3, 2014 3 / 45
![Page 4: Feature Engineering - UMD](https://reader031.vdocument.in/reader031/viewer/2022020704/61fb4c492e268c58cd5c829a/html5/thumbnails/4.jpg)
Trees
Suppose that we want to construct a set of rules to represent the data
can represent data as a series of if-then statements
here, “if” splits inputs into two categories
“then” assigns value
when “if” statements are nested, structure is called a tree
X1#>#5#
X1#>#8# X2#<#)2#
X3#>#0# TRUE#
TRUE# FALSE#
TRUE#FALSE#
True#
False#
True#
True#
True#
False#
False#
Digging into Data Classification II: Decision Trees and SVMs March 3, 2014 4 / 45
![Page 5: Feature Engineering - UMD](https://reader031.vdocument.in/reader031/viewer/2022020704/61fb4c492e268c58cd5c829a/html5/thumbnails/5.jpg)
Trees
Ex: data (X1,X2,X3,Y ) with X1,X2,X3 are real, Y Boolean
First, see if X1 > 5:
if TRUE, see if X1 > 8… if TRUE, return FALSE… if FALSE, return TRUE
if FALSE, see if X2 <�2… if TRUE, see if X3 > 0
∆ if TRUE, returnTRUE
∆ if FALSE, returnFALSE
… if FALSE, return TRUE
X1#>#5#
X1#>#8# X2#<#)2#
X3#>#0# TRUE#
TRUE# FALSE#
TRUE#FALSE#
True#
False#
True#
True#
True#
False#
False#
Digging into Data Classification II: Decision Trees and SVMs March 3, 2014 5 / 45
![Page 6: Feature Engineering - UMD](https://reader031.vdocument.in/reader031/viewer/2022020704/61fb4c492e268c58cd5c829a/html5/thumbnails/6.jpg)
Trees
X1#>#5#
X1#>#8# X2#<#)2#
X3#>#0# TRUE#
TRUE# FALSE#
TRUE#FALSE#
True#
False#
True#
True#
True#
False#
False#
Example 1: (X1,X2,X3) = (1,1,1)
Example 2: (X1,X2,X3) = (10,�3,0)
Digging into Data Classification II: Decision Trees and SVMs March 3, 2014 6 / 45
![Page 7: Feature Engineering - UMD](https://reader031.vdocument.in/reader031/viewer/2022020704/61fb4c492e268c58cd5c829a/html5/thumbnails/7.jpg)
Trees
X1#>#5#
X1#>#8# X2#<#)2#
X3#>#0# TRUE#
TRUE# FALSE#
TRUE#FALSE#
True#
False#
True#
True#
True#
False#
False#
Example 1: (X1,X2,X3) = (1,1,1) ! TRUE
Example 2: (X1,X2,X3) = (10,�3,0)
Digging into Data Classification II: Decision Trees and SVMs March 3, 2014 6 / 45
![Page 8: Feature Engineering - UMD](https://reader031.vdocument.in/reader031/viewer/2022020704/61fb4c492e268c58cd5c829a/html5/thumbnails/8.jpg)
Trees
X1#>#5#
X1#>#8# X2#<#)2#
X3#>#0# TRUE#
TRUE# FALSE#
TRUE#FALSE#
True#
False#
True#
True#
True#
False#
False#
Example 1: (X1,X2,X3) = (1,1,1) ! TRUE
Example 2: (X1,X2,X3) = (10,�3,0) ! FALSE
Digging into Data Classification II: Decision Trees and SVMs March 3, 2014 6 / 45
![Page 9: Feature Engineering - UMD](https://reader031.vdocument.in/reader031/viewer/2022020704/61fb4c492e268c58cd5c829a/html5/thumbnails/9.jpg)
Trees
Terminology:
branches: one side of a split
leaves: terminal nodes that return values
Why trees?
trees can be used for regression or classification… regression: returned value is a real number… classification: returned value is a class
unlike linear regression, SVMs, naive Bayes, etc, trees fit local models… in large spaces, global models may be hard to fit… results may be hard to interpret
fast, interpretable predictions
Digging into Data Classification II: Decision Trees and SVMs March 3, 2014 7 / 45
![Page 10: Feature Engineering - UMD](https://reader031.vdocument.in/reader031/viewer/2022020704/61fb4c492e268c58cd5c829a/html5/thumbnails/10.jpg)
Example: Predicting Electoral Results
2008 Democratic primary:
Hillary Clinton
Barack Obama
Given historical data, how will a county (small administrative unit inside anAmerican state) vote?
can extrapolate to state level data
might give regions to focus on increasing voter turnout
would like to know how variables interact
Digging into Data Classification II: Decision Trees and SVMs March 3, 2014 8 / 45
![Page 11: Feature Engineering - UMD](https://reader031.vdocument.in/reader031/viewer/2022020704/61fb4c492e268c58cd5c829a/html5/thumbnails/11.jpg)
Example: Predicting Electoral Results
Figure 1: Classification tree for county-level outcomes in the 2008 Democratic Party
primary (as of April 16), by Amanada Cox for the New York Times.
3
Digging into Data Classification II: Decision Trees and SVMs March 3, 2014 9 / 45
![Page 12: Feature Engineering - UMD](https://reader031.vdocument.in/reader031/viewer/2022020704/61fb4c492e268c58cd5c829a/html5/thumbnails/12.jpg)
Example: Predicting Electoral Results
Figure 1: Classification tree for county-level outcomes in the 2008 Democratic Party
primary (as of April 16), by Amanada Cox for the New York Times.
3
Digging into Data Classification II: Decision Trees and SVMs March 3, 2014 10 / 45
![Page 13: Feature Engineering - UMD](https://reader031.vdocument.in/reader031/viewer/2022020704/61fb4c492e268c58cd5c829a/html5/thumbnails/13.jpg)
Example: Predicting Electoral Results
Figure 1: Classification tree for county-level outcomes in the 2008 Democratic Party
primary (as of April 16), by Amanada Cox for the New York Times.
3
Digging into Data Classification II: Decision Trees and SVMs March 3, 2014 11 / 45
![Page 14: Feature Engineering - UMD](https://reader031.vdocument.in/reader031/viewer/2022020704/61fb4c492e268c58cd5c829a/html5/thumbnails/14.jpg)
Example: Predicting Electoral Results
Figure 1: Classification tree for county-level outcomes in the 2008 Democratic Party
primary (as of April 16), by Amanada Cox for the New York Times.
3Digging into Data Classification II: Decision Trees and SVMs March 3, 2014 12 / 45
![Page 15: Feature Engineering - UMD](https://reader031.vdocument.in/reader031/viewer/2022020704/61fb4c492e268c58cd5c829a/html5/thumbnails/15.jpg)
Decision Trees
Decision tree representation:
Each internal node tests an attribute
Each branch corresponds to attribute value
Each leaf node assigns a classification
How would we represent as a function of X ,Y :
X AND Y (both must be true)
X OR Y (either can be true)
X XOR Y (one and only one is true)
Digging into Data Classification II: Decision Trees and SVMs March 3, 2014 13 / 45
![Page 16: Feature Engineering - UMD](https://reader031.vdocument.in/reader031/viewer/2022020704/61fb4c492e268c58cd5c829a/html5/thumbnails/16.jpg)
When to Consider Decision Trees
Instances describable by attribute-value pairs
Target function is discrete valued
Disjunctive hypothesis may be required
Possibly noisy training data
Examples:
Equipment or medical diagnosis
Credit risk analysis
Modeling calendar scheduling preferences
Digging into Data Classification II: Decision Trees and SVMs March 3, 2014 14 / 45
![Page 17: Feature Engineering - UMD](https://reader031.vdocument.in/reader031/viewer/2022020704/61fb4c492e268c58cd5c829a/html5/thumbnails/17.jpg)
Outline
1 Decision Trees
2 Learning Decision Trees
3 Vector space classification
4 Linear Classifiers
5 Support Vector Machines
6 Recap
Digging into Data Classification II: Decision Trees and SVMs March 3, 2014 15 / 45
![Page 18: Feature Engineering - UMD](https://reader031.vdocument.in/reader031/viewer/2022020704/61fb4c492e268c58cd5c829a/html5/thumbnails/18.jpg)
Top-Down Induction of Decision Trees
Main loop:1 A the “best” decision attribute for next node2 Assign A as decision attribute for node3 For each value of A, create new descendant of node4 Sort training examples to leaf nodes5 If training examples perfectly classified, Then STOP, Else iterate over new
leaf nodes
Which attribute is best?
A1=? A2=?
ft ft
[29+,35-] [29+,35-]
[21+,5-] [8+,30-] [18+,33-] [11+,2-]
Digging into Data Classification II: Decision Trees and SVMs March 3, 2014 16 / 45
![Page 19: Feature Engineering - UMD](https://reader031.vdocument.in/reader031/viewer/2022020704/61fb4c492e268c58cd5c829a/html5/thumbnails/19.jpg)
Entropy: Reminder
Entropy(S)
1.0
0.5
0.0 0.5 1.0p+
S is a sample of training examplesp� is the proportion of positive examples in Sp is the proportion of negative examples in SEntropy measures the impurity of S
Entropy(S)⌘�p� log2 p� �p log2 p
Digging into Data Classification II: Decision Trees and SVMs March 3, 2014 17 / 45
![Page 20: Feature Engineering - UMD](https://reader031.vdocument.in/reader031/viewer/2022020704/61fb4c492e268c58cd5c829a/html5/thumbnails/20.jpg)
Entropy
How spread out is the distribution of S:
p�(� log2 p�)+p (� log2 p )
Entropy(S)⌘�p� log2 p� �p log2 p
Digging into Data Classification II: Decision Trees and SVMs March 3, 2014 18 / 45
![Page 21: Feature Engineering - UMD](https://reader031.vdocument.in/reader031/viewer/2022020704/61fb4c492e268c58cd5c829a/html5/thumbnails/21.jpg)
Information Gain
Which feature A would be a more useful rule in our decision tree?
Gain(S,A) = expected reduction in entropy due to sorting on A
Gain(S,A)⌘ Entropy(S) �X
v2Values(A)
|Sv ||S| Entropy(Sv)
A1=? A2=?
ft ft
[29+,35-] [29+,35-]
[21+,5-] [8+,30-] [18+,33-] [11+,2-]
Digging into Data Classification II: Decision Trees and SVMs March 3, 2014 19 / 45
![Page 22: Feature Engineering - UMD](https://reader031.vdocument.in/reader031/viewer/2022020704/61fb4c492e268c58cd5c829a/html5/thumbnails/22.jpg)
H(S) =�2954
lg✓
2954
◆� 35
64lg✓
3564
◆
=
Digging into Data Classification II: Decision Trees and SVMs March 3, 2014 20 / 45
![Page 23: Feature Engineering - UMD](https://reader031.vdocument.in/reader031/viewer/2022020704/61fb4c492e268c58cd5c829a/html5/thumbnails/23.jpg)
H(S) =�2954
lg✓
2954
◆� 35
64lg✓
3564
◆
= 0.96
Digging into Data Classification II: Decision Trees and SVMs March 3, 2014 20 / 45
![Page 24: Feature Engineering - UMD](https://reader031.vdocument.in/reader031/viewer/2022020704/61fb4c492e268c58cd5c829a/html5/thumbnails/24.jpg)
H(S) =�2954
lg✓
2954
◆� 35
64lg✓
3564
◆
= 0.96
Gain(S,A1) = 0.96� 2664
� 5
26lgÅ 5
26
ã� 21
26lg✓
2126
◆�
� 3864
� 8
38lg✓
838
◆� 30
38lg✓
3038
◆�
=
Digging into Data Classification II: Decision Trees and SVMs March 3, 2014 20 / 45
![Page 25: Feature Engineering - UMD](https://reader031.vdocument.in/reader031/viewer/2022020704/61fb4c492e268c58cd5c829a/html5/thumbnails/25.jpg)
H(S) =�2954
lg✓
2954
◆� 35
64lg✓
3564
◆
= 0.96
Gain(S,A1) = 0.96� 2664
� 5
26lgÅ 5
26
ã� 21
26lg✓
2126
◆�
� 3864
� 8
38lg✓
838
◆� 30
38lg✓
3038
◆�
= 0.96�0.28�0.44 = 0.24
Digging into Data Classification II: Decision Trees and SVMs March 3, 2014 20 / 45
![Page 26: Feature Engineering - UMD](https://reader031.vdocument.in/reader031/viewer/2022020704/61fb4c492e268c58cd5c829a/html5/thumbnails/26.jpg)
H(S) =�2954
lg✓
2954
◆� 35
64lg✓
3564
◆
= 0.96
Gain(S,A1) = 0.96� 2664
� 5
26lgÅ 5
26
ã� 21
26lg✓
2126
◆�
� 3864
� 8
38lg✓
838
◆� 30
38lg✓
3038
◆�
= 0.96�0.28�0.44 = 0.24
Gain(S,A2) = 0.96� 5164
�18
51lg✓
1851
◆� 33
51lg✓
3351
◆�
� 1364
�11
13lg✓
1113
◆� 2
13lg✓
213
◆�
=
Digging into Data Classification II: Decision Trees and SVMs March 3, 2014 20 / 45
![Page 27: Feature Engineering - UMD](https://reader031.vdocument.in/reader031/viewer/2022020704/61fb4c492e268c58cd5c829a/html5/thumbnails/27.jpg)
H(S) =�2954
lg✓
2954
◆� 35
64lg✓
3564
◆
= 0.96
Gain(S,A1) = 0.96� 2664
� 5
26lgÅ 5
26
ã� 21
26lg✓
2126
◆�
� 3864
� 8
38lg✓
838
◆� 30
38lg✓
3038
◆�
= 0.96�0.28�0.44 = 0.24
Gain(S,A2) = 0.96� 5164
�18
51lg✓
1851
◆� 33
51lg✓
3351
◆�
� 1364
�11
13lg✓
1113
◆� 2
13lg✓
213
◆�
= 0.96�0.75�0.13 = 0.08
Digging into Data Classification II: Decision Trees and SVMs March 3, 2014 20 / 45
![Page 28: Feature Engineering - UMD](https://reader031.vdocument.in/reader031/viewer/2022020704/61fb4c492e268c58cd5c829a/html5/thumbnails/28.jpg)
ID3 Algorithm
Start at root, look for best attribute
Repeat for subtrees at each attribute outcome
Stop when information gain is below a threshold
Bias: prefers shorter trees (Occam’s Razor)
! a short hyp that fits data unlikely to be coincidence! a long hyp that fits data might be coincidence… Prevents overfitting (more later)
Digging into Data Classification II: Decision Trees and SVMs March 3, 2014 21 / 45
![Page 29: Feature Engineering - UMD](https://reader031.vdocument.in/reader031/viewer/2022020704/61fb4c492e268c58cd5c829a/html5/thumbnails/29.jpg)
Outline
1 Decision Trees
2 Learning Decision Trees
3 Vector space classification
4 Linear Classifiers
5 Support Vector Machines
6 Recap
Digging into Data Classification II: Decision Trees and SVMs March 3, 2014 22 / 45
![Page 30: Feature Engineering - UMD](https://reader031.vdocument.in/reader031/viewer/2022020704/61fb4c492e268c58cd5c829a/html5/thumbnails/30.jpg)
Thinking Geometrically
Suppose you have two classes: vacations and sports
Suppose you have four documents
SportsDoc1: {ball, ball, ball, travel}Doc2: {ball, ball}
VacationsDoc3: {travel, ball, travel}Doc4: {travel}
What does this look like in vector space?
Digging into Data Classification II: Decision Trees and SVMs March 3, 2014 23 / 45
![Page 31: Feature Engineering - UMD](https://reader031.vdocument.in/reader031/viewer/2022020704/61fb4c492e268c58cd5c829a/html5/thumbnails/31.jpg)
Put the documents in vector space
Travel
3
2
1
01 2 3
Ball
Digging into Data Classification II: Decision Trees and SVMs March 3, 2014 24 / 45
![Page 32: Feature Engineering - UMD](https://reader031.vdocument.in/reader031/viewer/2022020704/61fb4c492e268c58cd5c829a/html5/thumbnails/32.jpg)
Put the documents in vector space
Travel
3
2
1
01 2 3
1
2
3
4
Ball
Digging into Data Classification II: Decision Trees and SVMs March 3, 2014 24 / 45
![Page 33: Feature Engineering - UMD](https://reader031.vdocument.in/reader031/viewer/2022020704/61fb4c492e268c58cd5c829a/html5/thumbnails/33.jpg)
Vector space representation of documents
Each document is a vector, one component for each term.
Terms are axes.
High dimensionality: 10,000s of dimensions and more
How can we do classification in this space?
Digging into Data Classification II: Decision Trees and SVMs March 3, 2014 25 / 45
![Page 34: Feature Engineering - UMD](https://reader031.vdocument.in/reader031/viewer/2022020704/61fb4c492e268c58cd5c829a/html5/thumbnails/34.jpg)
Vector space classification
As before, the training set is a set of documents, each labeled with its class.
In vector space classification, this set corresponds to a labeled set of pointsor vectors in the vector space.
Premise 1: Documents in the same class form a contiguous region.
Premise 2: Documents from different classes don’t overlap.
We define lines, surfaces, hypersurfaces to divide regions.
Digging into Data Classification II: Decision Trees and SVMs March 3, 2014 26 / 45
![Page 35: Feature Engineering - UMD](https://reader031.vdocument.in/reader031/viewer/2022020704/61fb4c492e268c58cd5c829a/html5/thumbnails/35.jpg)
Classes in the vector space
Recap Vector space classification Linear classifiers Support Vector Machines Discussion
Classes in the vector space
xxx
x
�� ��
�
�
China
Kenya
UK�
Should the document � be assigned to China, UK or Kenya?
Schutze: Support vector machines 12 / 55
Digging into Data Classification II: Decision Trees and SVMs March 3, 2014 27 / 45
![Page 36: Feature Engineering - UMD](https://reader031.vdocument.in/reader031/viewer/2022020704/61fb4c492e268c58cd5c829a/html5/thumbnails/36.jpg)
Classes in the vector space
Recap Vector space classification Linear classifiers Support Vector Machines Discussion
Classes in the vector space
xxx
x
�� ��
�
�
China
Kenya
UK�
Should the document � be assigned to China, UK or Kenya?
Schutze: Support vector machines 12 / 55
Should the document ? be assigned to China, UK or Kenya?Digging into Data Classification II: Decision Trees and SVMs March 3, 2014 27 / 45
![Page 37: Feature Engineering - UMD](https://reader031.vdocument.in/reader031/viewer/2022020704/61fb4c492e268c58cd5c829a/html5/thumbnails/37.jpg)
Classes in the vector space
Recap Vector space classification Linear classifiers Support Vector Machines Discussion
Classes in the vector space
xxx
x
�� ��
�
�
China
Kenya
UK�
Find separators between the classes
Schutze: Support vector machines 12 / 55
Find separators between the classesDigging into Data Classification II: Decision Trees and SVMs March 3, 2014 27 / 45
![Page 38: Feature Engineering - UMD](https://reader031.vdocument.in/reader031/viewer/2022020704/61fb4c492e268c58cd5c829a/html5/thumbnails/38.jpg)
Classes in the vector space
Recap Vector space classification Linear classifiers Support Vector Machines Discussion
Classes in the vector space
xxx
x
�� ��
�
�
China
Kenya
UK�
Find separators between the classes
Schutze: Support vector machines 12 / 55
Find separators between the classesDigging into Data Classification II: Decision Trees and SVMs March 3, 2014 27 / 45
![Page 39: Feature Engineering - UMD](https://reader031.vdocument.in/reader031/viewer/2022020704/61fb4c492e268c58cd5c829a/html5/thumbnails/39.jpg)
Classes in the vector space
Recap Vector space classification Linear classifiers Support Vector Machines Discussion
Classes in the vector space
xxx
x
�� ��
�
�
China
Kenya
UK�
Find separators between the classes
Schutze: Support vector machines 12 / 55
Based on these separators: ? should be assigned to ChinaDigging into Data Classification II: Decision Trees and SVMs March 3, 2014 27 / 45
![Page 40: Feature Engineering - UMD](https://reader031.vdocument.in/reader031/viewer/2022020704/61fb4c492e268c58cd5c829a/html5/thumbnails/40.jpg)
Classes in the vector space
Recap Vector space classification Linear classifiers Support Vector Machines Discussion
Classes in the vector space
xxx
x
�� ��
�
�
China
Kenya
UK�
Find separators between the classes
Schutze: Support vector machines 12 / 55
How do we find separators that do a good job at classifying new documents like ??– Main topic of todayDigging into Data Classification II: Decision Trees and SVMs March 3, 2014 27 / 45
![Page 41: Feature Engineering - UMD](https://reader031.vdocument.in/reader031/viewer/2022020704/61fb4c492e268c58cd5c829a/html5/thumbnails/41.jpg)
Outline
1 Decision Trees
2 Learning Decision Trees
3 Vector space classification
4 Linear Classifiers
5 Support Vector Machines
6 Recap
Digging into Data Classification II: Decision Trees and SVMs March 3, 2014 28 / 45
![Page 42: Feature Engineering - UMD](https://reader031.vdocument.in/reader031/viewer/2022020704/61fb4c492e268c58cd5c829a/html5/thumbnails/42.jpg)
Linear classifiers
Definition:… A linear classifier computes a linear combination or weighted sumP
i �i xi of the feature values.… Classification decision:
Pi �i xi > ✓?
… . . . where ✓ (the threshold) is a parameter.
(First, we only consider binary classifiers.)
Geometrically, this corresponds to a line (2D), a plane (3D) or a hyperplane(higher dimensionalities).
We call this the separator or decision boundary.
We find the separator based on training set.
Methods for finding separator: logistic regression, naïve Bayes, linear SVM
Assumption: The classes are linearly separable.
Before, we just talked about equations. What’s the geometric intuition?
Digging into Data Classification II: Decision Trees and SVMs March 3, 2014 29 / 45
![Page 43: Feature Engineering - UMD](https://reader031.vdocument.in/reader031/viewer/2022020704/61fb4c492e268c58cd5c829a/html5/thumbnails/43.jpg)
Linear classifiers
Definition:… A linear classifier computes a linear combination or weighted sumP
i �i xi of the feature values.… Classification decision:
Pi �i xi > ✓?
… . . . where ✓ (the threshold) is a parameter.
(First, we only consider binary classifiers.)
Geometrically, this corresponds to a line (2D), a plane (3D) or a hyperplane(higher dimensionalities).
We call this the separator or decision boundary.
We find the separator based on training set.
Methods for finding separator: logistic regression, naïve Bayes, linear SVM
Assumption: The classes are linearly separable.
Before, we just talked about equations. What’s the geometric intuition?
Digging into Data Classification II: Decision Trees and SVMs March 3, 2014 29 / 45
![Page 44: Feature Engineering - UMD](https://reader031.vdocument.in/reader031/viewer/2022020704/61fb4c492e268c58cd5c829a/html5/thumbnails/44.jpg)
A linear classifier in 1D
Recap Vector space classification Linear classifiers Support Vector Machines Discussion
A linear classifier in 1D
A linear classifier in 1D isa point x described by theequation w1d1 = �
x = �/w1
Schutze: Support vector machines 15 / 55
A linear classifier in 1D is apoint x described by theequation �1d1 = ✓
x = ✓ /�1
Points (d1) with �1d1 � ✓are in the class c.
Points (d1) with �1d1 < ✓are in the complement classc.
Digging into Data Classification II: Decision Trees and SVMs March 3, 2014 30 / 45
![Page 45: Feature Engineering - UMD](https://reader031.vdocument.in/reader031/viewer/2022020704/61fb4c492e268c58cd5c829a/html5/thumbnails/45.jpg)
A linear classifier in 1D
Recap Vector space classification Linear classifiers Support Vector Machines Discussion
A linear classifier in 1D
A linear classifier in 1D isa point x described by theequation w1d1 = �
x = �/w1
Schutze: Support vector machines 15 / 55
A linear classifier in 1D is apoint x described by theequation �1d1 = ✓
x = ✓ /�1
Points (d1) with �1d1 � ✓are in the class c.
Points (d1) with �1d1 < ✓are in the complement classc.
Digging into Data Classification II: Decision Trees and SVMs March 3, 2014 30 / 45
![Page 46: Feature Engineering - UMD](https://reader031.vdocument.in/reader031/viewer/2022020704/61fb4c492e268c58cd5c829a/html5/thumbnails/46.jpg)
A linear classifier in 1D
Recap Vector space classification Linear classifiers Support Vector Machines Discussion
A linear classifier in 1D
A linear classifier in 1D isa point x described by theequation w1d1 = �
x = �/w1
Points (d1) with w1d1 � �are in the class c .
Schutze: Support vector machines 15 / 55
A linear classifier in 1D is apoint x described by theequation �1d1 = ✓
x = ✓ /�1
Points (d1) with �1d1 � ✓are in the class c.
Points (d1) with �1d1 < ✓are in the complement classc.
Digging into Data Classification II: Decision Trees and SVMs March 3, 2014 30 / 45
![Page 47: Feature Engineering - UMD](https://reader031.vdocument.in/reader031/viewer/2022020704/61fb4c492e268c58cd5c829a/html5/thumbnails/47.jpg)
A linear classifier in 1D
Recap Vector space classification Linear classifiers Support Vector Machines Discussion
A linear classifier in 1D
A linear classifier in 1D isa point x described by theequation w1d1 = �
x = �/w1
Points (d1) with w1d1 � �are in the class c .
Points (d1) with w1d1 < �are in the complementclass c .
Schutze: Support vector machines 15 / 55
A linear classifier in 1D is apoint x described by theequation �1d1 = ✓
x = ✓ /�1
Points (d1) with �1d1 � ✓are in the class c.
Points (d1) with �1d1 < ✓are in the complement classc.
Digging into Data Classification II: Decision Trees and SVMs March 3, 2014 30 / 45
![Page 48: Feature Engineering - UMD](https://reader031.vdocument.in/reader031/viewer/2022020704/61fb4c492e268c58cd5c829a/html5/thumbnails/48.jpg)
A linear classifier in 2DRecap Vector space classification Linear classifiers Support Vector Machines Discussion
A linear classifier in 2D
A linear classifier in 2D isa line described by theequation w1d1 + w2d2 = �
Example for a 2D linearclassifier
Schutze: Support vector machines 16 / 55
A linear classifier in 2D is aline described by theequation �1d1 +�2d2 = ✓
Example for a 2D linearclassifier
Points (d1 d2) with�1d1 +�2d2 � ✓ are in theclass c.
Points (d1 d2) with�1d1 +�2d2 < ✓ are in thecomplement class c.
Digging into Data Classification II: Decision Trees and SVMs March 3, 2014 31 / 45
![Page 49: Feature Engineering - UMD](https://reader031.vdocument.in/reader031/viewer/2022020704/61fb4c492e268c58cd5c829a/html5/thumbnails/49.jpg)
A linear classifier in 2DRecap Vector space classification Linear classifiers Support Vector Machines Discussion
A linear classifier in 2D
A linear classifier in 2D isa line described by theequation w1d1 + w2d2 = �
Example for a 2D linearclassifier
Schutze: Support vector machines 16 / 55
A linear classifier in 2D is aline described by theequation �1d1 +�2d2 = ✓
Example for a 2D linearclassifier
Points (d1 d2) with�1d1 +�2d2 � ✓ are in theclass c.
Points (d1 d2) with�1d1 +�2d2 < ✓ are in thecomplement class c.
Digging into Data Classification II: Decision Trees and SVMs March 3, 2014 31 / 45
![Page 50: Feature Engineering - UMD](https://reader031.vdocument.in/reader031/viewer/2022020704/61fb4c492e268c58cd5c829a/html5/thumbnails/50.jpg)
A linear classifier in 2DRecap Vector space classification Linear classifiers Support Vector Machines Discussion
A linear classifier in 2D
A linear classifier in 2D isa line described by theequation w1d1 + w2d2 = �
Example for a 2D linearclassifier
Points (d1 d2) withw1d1 + w2d2 � � are inthe class c .
Schutze: Support vector machines 16 / 55
A linear classifier in 2D is aline described by theequation �1d1 +�2d2 = ✓
Example for a 2D linearclassifier
Points (d1 d2) with�1d1 +�2d2 � ✓ are in theclass c.
Points (d1 d2) with�1d1 +�2d2 < ✓ are in thecomplement class c.
Digging into Data Classification II: Decision Trees and SVMs March 3, 2014 31 / 45
![Page 51: Feature Engineering - UMD](https://reader031.vdocument.in/reader031/viewer/2022020704/61fb4c492e268c58cd5c829a/html5/thumbnails/51.jpg)
A linear classifier in 2DRecap Vector space classification Linear classifiers Support Vector Machines Discussion
A linear classifier in 2D
A linear classifier in 2D isa line described by theequation w1d1 + w2d2 = �
Example for a 2D linearclassifier
Points (d1 d2) withw1d1 + w2d2 � � are inthe class c .
Points (d1 d2) withw1d1 + w2d2 < � are inthe complement class c .
Schutze: Support vector machines 16 / 55
A linear classifier in 2D is aline described by theequation �1d1 +�2d2 = ✓
Example for a 2D linearclassifier
Points (d1 d2) with�1d1 +�2d2 � ✓ are in theclass c.
Points (d1 d2) with�1d1 +�2d2 < ✓ are in thecomplement class c.
Digging into Data Classification II: Decision Trees and SVMs March 3, 2014 31 / 45
![Page 52: Feature Engineering - UMD](https://reader031.vdocument.in/reader031/viewer/2022020704/61fb4c492e268c58cd5c829a/html5/thumbnails/52.jpg)
A linear classifier in 3DRecap Vector space classification Linear classifiers Support Vector Machines Discussion
A linear classifier in 3D
A linear classifier in 3D isa plane described by theequationw1d1 + w2d2 + w3d3 = �
Schutze: Support vector machines 17 / 55
A linear classifier in 3D is aplane described by theequation�1d1 +�2d2 +�3d3 = ✓
Example for a 3D linearclassifier
Points (d1 d2 d3) with�1d1 +�2d2 +�3d3 � ✓ arein the class c.
Points (d1 d2 d3) with�1d1 +�2d2 +�3d3 < ✓ arein the complement class c.
Digging into Data Classification II: Decision Trees and SVMs March 3, 2014 32 / 45
![Page 53: Feature Engineering - UMD](https://reader031.vdocument.in/reader031/viewer/2022020704/61fb4c492e268c58cd5c829a/html5/thumbnails/53.jpg)
A linear classifier in 3DRecap Vector space classification Linear classifiers Support Vector Machines Discussion
A linear classifier in 3D
A linear classifier in 3D isa plane described by theequationw1d1 + w2d2 + w3d3 = �
Example for a 3D linearclassifier
Schutze: Support vector machines 17 / 55
A linear classifier in 3D is aplane described by theequation�1d1 +�2d2 +�3d3 = ✓
Example for a 3D linearclassifier
Points (d1 d2 d3) with�1d1 +�2d2 +�3d3 � ✓ arein the class c.
Points (d1 d2 d3) with�1d1 +�2d2 +�3d3 < ✓ arein the complement class c.
Digging into Data Classification II: Decision Trees and SVMs March 3, 2014 32 / 45
![Page 54: Feature Engineering - UMD](https://reader031.vdocument.in/reader031/viewer/2022020704/61fb4c492e268c58cd5c829a/html5/thumbnails/54.jpg)
A linear classifier in 3DRecap Vector space classification Linear classifiers Support Vector Machines Discussion
A linear classifier in 3D
A linear classifier in 3D isa plane described by theequationw1d1 + w2d2 + w3d3 = �
Example for a 3D linearclassifier
Points (d1 d2 d3) withw1d1 + w2d2 + w3d3 � �are in the class c .
Schutze: Support vector machines 17 / 55
A linear classifier in 3D is aplane described by theequation�1d1 +�2d2 +�3d3 = ✓
Example for a 3D linearclassifier
Points (d1 d2 d3) with�1d1 +�2d2 +�3d3 � ✓ arein the class c.
Points (d1 d2 d3) with�1d1 +�2d2 +�3d3 < ✓ arein the complement class c.
Digging into Data Classification II: Decision Trees and SVMs March 3, 2014 32 / 45
![Page 55: Feature Engineering - UMD](https://reader031.vdocument.in/reader031/viewer/2022020704/61fb4c492e268c58cd5c829a/html5/thumbnails/55.jpg)
A linear classifier in 3DRecap Vector space classification Linear classifiers Support Vector Machines Discussion
A linear classifier in 3D
A linear classifier in 3D isa plane described by theequationw1d1 + w2d2 + w3d3 = �
Example for a 3D linearclassifier
Points (d1 d2 d3) withw1d1 + w2d2 + w3d3 � �are in the class c .
Points (d1 d2 d3) withw1d1 + w2d2 + w3d3 < �are in the complementclass c .
Schutze: Support vector machines 17 / 55
A linear classifier in 3D is aplane described by theequation�1d1 +�2d2 +�3d3 = ✓
Example for a 3D linearclassifier
Points (d1 d2 d3) with�1d1 +�2d2 +�3d3 � ✓ arein the class c.
Points (d1 d2 d3) with�1d1 +�2d2 +�3d3 < ✓ arein the complement class c.
Digging into Data Classification II: Decision Trees and SVMs March 3, 2014 32 / 45
![Page 56: Feature Engineering - UMD](https://reader031.vdocument.in/reader031/viewer/2022020704/61fb4c492e268c58cd5c829a/html5/thumbnails/56.jpg)
Naive Bayes and Logistic Regression as linear classifiers
Multinomial Naive Bayes is a linear classifier (in log space) defined by:
MX
i=1
�i di = ✓
where �i = log[P(ti |c)/P(ti |c)], di = number of occurrences of ti in d , and✓ =� log[P(c)/P(c)]. Here, the index i , 1 i M, refers to terms of thevocabulary.Logistic regression is the same (we only put it into the logistic function to turn itinto a probability).
TakewayNaïve Bayes, logistic regression and SVM (which we’ll get to in a second) are alllinear methods. They choose their hyperplanes based on different objectives: jointlikelihood (NB), conditional likelihood (LR), and the margin (SVM).
Digging into Data Classification II: Decision Trees and SVMs March 3, 2014 33 / 45
![Page 57: Feature Engineering - UMD](https://reader031.vdocument.in/reader031/viewer/2022020704/61fb4c492e268c58cd5c829a/html5/thumbnails/57.jpg)
Naive Bayes and Logistic Regression as linear classifiers
Multinomial Naive Bayes is a linear classifier (in log space) defined by:
MX
i=1
�i di = ✓
where �i = log[P(ti |c)/P(ti |c)], di = number of occurrences of ti in d , and✓ =� log[P(c)/P(c)]. Here, the index i , 1 i M, refers to terms of thevocabulary.Logistic regression is the same (we only put it into the logistic function to turn itinto a probability).
TakewayNaïve Bayes, logistic regression and SVM (which we’ll get to in a second) are alllinear methods. They choose their hyperplanes based on different objectives: jointlikelihood (NB), conditional likelihood (LR), and the margin (SVM).
Digging into Data Classification II: Decision Trees and SVMs March 3, 2014 33 / 45
![Page 58: Feature Engineering - UMD](https://reader031.vdocument.in/reader031/viewer/2022020704/61fb4c492e268c58cd5c829a/html5/thumbnails/58.jpg)
Which hyperplane?
Digging into Data Classification II: Decision Trees and SVMs March 3, 2014 34 / 45
![Page 59: Feature Engineering - UMD](https://reader031.vdocument.in/reader031/viewer/2022020704/61fb4c492e268c58cd5c829a/html5/thumbnails/59.jpg)
Which hyperplane?
For linearly separable training sets: there are infinitely many separatinghyperplanes.
They all separate the training set perfectly . . .
. . . but they behave differently on test data.
Error rates on new data are low for some, high for others.
How do we find a low-error separator?
Digging into Data Classification II: Decision Trees and SVMs March 3, 2014 35 / 45
![Page 60: Feature Engineering - UMD](https://reader031.vdocument.in/reader031/viewer/2022020704/61fb4c492e268c58cd5c829a/html5/thumbnails/60.jpg)
Outline
1 Decision Trees
2 Learning Decision Trees
3 Vector space classification
4 Linear Classifiers
5 Support Vector Machines
6 Recap
Digging into Data Classification II: Decision Trees and SVMs March 3, 2014 36 / 45
![Page 61: Feature Engineering - UMD](https://reader031.vdocument.in/reader031/viewer/2022020704/61fb4c492e268c58cd5c829a/html5/thumbnails/61.jpg)
Support vector machines
Machine-learning research in the last two decades has improved classifiereffectiveness.
New generation of state-of-the-art classifiers: support vector machines(SVMs), boosted decision trees, regularized logistic regression, neuralnetworks, and random forests
Applications to IR problems, particularly text classification
SVMs: A kind of large-margin classifierVector space based machine-learning method aiming to find a decision boundarybetween two classes that is maximally far from any point in the training data(possibly discounting some points as outliers or noise)
Digging into Data Classification II: Decision Trees and SVMs March 3, 2014 37 / 45
![Page 62: Feature Engineering - UMD](https://reader031.vdocument.in/reader031/viewer/2022020704/61fb4c492e268c58cd5c829a/html5/thumbnails/62.jpg)
Support Vector Machines
2-class training data
decision boundary !linear separator
criterion: beingmaximally far awayfrom any data point! determinesclassifier margin
linear separatorposition defined bysupport vectors
Recap Vector space classification Linear classifiers Support Vector Machines Discussion
Support Vector Machines
2-class training data
Schutze: Support vector machines 29 / 55
Digging into Data Classification II: Decision Trees and SVMs March 3, 2014 38 / 45
![Page 63: Feature Engineering - UMD](https://reader031.vdocument.in/reader031/viewer/2022020704/61fb4c492e268c58cd5c829a/html5/thumbnails/63.jpg)
Support Vector Machines
2-class training data
decision boundary !linear separator
criterion: beingmaximally far awayfrom any data point! determinesclassifier margin
linear separatorposition defined bysupport vectors
Recap Vector space classification Linear classifiers Support Vector Machines Discussion
Support Vector Machines
2-class training data
decision boundary� linear separator
Schutze: Support vector machines 29 / 55
Digging into Data Classification II: Decision Trees and SVMs March 3, 2014 38 / 45
![Page 64: Feature Engineering - UMD](https://reader031.vdocument.in/reader031/viewer/2022020704/61fb4c492e268c58cd5c829a/html5/thumbnails/64.jpg)
Support Vector Machines
2-class training data
decision boundary !linear separator
criterion: beingmaximally far awayfrom any data point! determinesclassifier margin
linear separatorposition defined bysupport vectors
Recap Vector space classification Linear classifiers Support Vector Machines Discussion
Support Vector Machines
2-class training data
decision boundary� linear separator
criterion: beingmaximally far awayfrom any data point� determinesclassifier margin
Margin ismaximized
Schutze: Support vector machines 29 / 55
Digging into Data Classification II: Decision Trees and SVMs March 3, 2014 38 / 45
![Page 65: Feature Engineering - UMD](https://reader031.vdocument.in/reader031/viewer/2022020704/61fb4c492e268c58cd5c829a/html5/thumbnails/65.jpg)
Support Vector Machines
2-class training data
decision boundary !linear separator
criterion: beingmaximally far awayfrom any data point! determinesclassifier margin
linear separatorposition defined bysupport vectors
Recap Vector space classification Linear classifiers Support Vector Machines Discussion
Why maximize the margin?
Points near decisionsurface � uncertainclassification decisions(50% either way).A classifier with a largemargin makes no lowcertainty classificationdecisions.Gives classificationsafety margin w.r.t slighterrors in measurement ordoc. variation
Support vectors
Margin ismaximized
Maximummargindecisionhyperplane
Schutze: Support vector machines 30 / 55
Digging into Data Classification II: Decision Trees and SVMs March 3, 2014 38 / 45
![Page 66: Feature Engineering - UMD](https://reader031.vdocument.in/reader031/viewer/2022020704/61fb4c492e268c58cd5c829a/html5/thumbnails/66.jpg)
Why maximize the margin?
Points near decisionsurface ! uncertainclassificationdecisions (50% eitherway).
A classifier with alarge margin makesno low certaintyclassificationdecisions.
Gives classificationsafety margin w.r.tslight errors inmeasurement ordocuments variation
Recap Vector space classification Linear classifiers Support Vector Machines Discussion
Why maximize the margin?
Points near decisionsurface � uncertainclassification decisions(50% either way).A classifier with a largemargin makes no lowcertainty classificationdecisions.Gives classificationsafety margin w.r.t slighterrors in measurement ordoc. variation
Support vectors
Margin ismaximized
Maximummargindecisionhyperplane
Schutze: Support vector machines 30 / 55
Digging into Data Classification II: Decision Trees and SVMs March 3, 2014 39 / 45
![Page 67: Feature Engineering - UMD](https://reader031.vdocument.in/reader031/viewer/2022020704/61fb4c492e268c58cd5c829a/html5/thumbnails/67.jpg)
Why maximize the margin?
SVM classifier: large marginaround decision boundary
compare to decisionhyperplane: place fat separatorbetween classes… unique solution
decreased memory capacity
increased ability to correctlygeneralize to test data
Digging into Data Classification II: Decision Trees and SVMs March 3, 2014 40 / 45
![Page 68: Feature Engineering - UMD](https://reader031.vdocument.in/reader031/viewer/2022020704/61fb4c492e268c58cd5c829a/html5/thumbnails/68.jpg)
SVM extensions
Slack variables: not perfect line
Kernels: different geometries
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
Loss functions: Different penalties for getting the answer wrong
Digging into Data Classification II: Decision Trees and SVMs March 3, 2014 41 / 45
![Page 69: Feature Engineering - UMD](https://reader031.vdocument.in/reader031/viewer/2022020704/61fb4c492e268c58cd5c829a/html5/thumbnails/69.jpg)
Outline
1 Decision Trees
2 Learning Decision Trees
3 Vector space classification
4 Linear Classifiers
5 Support Vector Machines
6 Recap
Digging into Data Classification II: Decision Trees and SVMs March 3, 2014 42 / 45
![Page 70: Feature Engineering - UMD](https://reader031.vdocument.in/reader031/viewer/2022020704/61fb4c492e268c58cd5c829a/html5/thumbnails/70.jpg)
Classification
Many commercial applications
There are many applications of text classification for corporate Intranets,government departments, and Internet publishers.
Often greater performance gains from exploiting domain-specific featuresthan from changing from one machine learning method to another.(Homework 3)
Digging into Data Classification II: Decision Trees and SVMs March 3, 2014 43 / 45
![Page 71: Feature Engineering - UMD](https://reader031.vdocument.in/reader031/viewer/2022020704/61fb4c492e268c58cd5c829a/html5/thumbnails/71.jpg)
Choosing what kind of classifier to use
When building a text classifier, first question: how much training data is therecurrently available?
None?
Very little?
A fair amount?
A huge amount
Digging into Data Classification II: Decision Trees and SVMs March 3, 2014 44 / 45
![Page 72: Feature Engineering - UMD](https://reader031.vdocument.in/reader031/viewer/2022020704/61fb4c492e268c58cd5c829a/html5/thumbnails/72.jpg)
Choosing what kind of classifier to use
When building a text classifier, first question: how much training data is therecurrently available?
None? Hand write rules or use active learning
Very little?
A fair amount?
A huge amount
Digging into Data Classification II: Decision Trees and SVMs March 3, 2014 44 / 45
![Page 73: Feature Engineering - UMD](https://reader031.vdocument.in/reader031/viewer/2022020704/61fb4c492e268c58cd5c829a/html5/thumbnails/73.jpg)
Choosing what kind of classifier to use
When building a text classifier, first question: how much training data is therecurrently available?
None? Hand write rules or use active learning
Very little? Naïve Bayes
A fair amount?
A huge amount
Digging into Data Classification II: Decision Trees and SVMs March 3, 2014 44 / 45
![Page 74: Feature Engineering - UMD](https://reader031.vdocument.in/reader031/viewer/2022020704/61fb4c492e268c58cd5c829a/html5/thumbnails/74.jpg)
Choosing what kind of classifier to use
When building a text classifier, first question: how much training data is therecurrently available?
None? Hand write rules or use active learning
Very little? Naïve Bayes
A fair amount? SVM
A huge amount
Digging into Data Classification II: Decision Trees and SVMs March 3, 2014 44 / 45
![Page 75: Feature Engineering - UMD](https://reader031.vdocument.in/reader031/viewer/2022020704/61fb4c492e268c58cd5c829a/html5/thumbnails/75.jpg)
Choosing what kind of classifier to use
When building a text classifier, first question: how much training data is therecurrently available?
None? Hand write rules or use active learning
Very little? Naïve Bayes
A fair amount? SVM
A huge amount Doesn’t matter, use whatever works
Digging into Data Classification II: Decision Trees and SVMs March 3, 2014 44 / 45
![Page 76: Feature Engineering - UMD](https://reader031.vdocument.in/reader031/viewer/2022020704/61fb4c492e268c58cd5c829a/html5/thumbnails/76.jpg)
Choosing what kind of classifier to use
When building a text classifier, first question: how much training data is therecurrently available?
None? Hand write rules or use active learning
Very little? Naïve Bayes
A fair amount? SVM
A huge amount Doesn’t matter, use whatever works
Interpretable?
Digging into Data Classification II: Decision Trees and SVMs March 3, 2014 44 / 45
![Page 77: Feature Engineering - UMD](https://reader031.vdocument.in/reader031/viewer/2022020704/61fb4c492e268c58cd5c829a/html5/thumbnails/77.jpg)
Choosing what kind of classifier to use
When building a text classifier, first question: how much training data is therecurrently available?
None? Hand write rules or use active learning
Very little? Naïve Bayes
A fair amount? SVM
A huge amount Doesn’t matter, use whatever works
Interpretable? Decision trees
Digging into Data Classification II: Decision Trees and SVMs March 3, 2014 44 / 45
![Page 78: Feature Engineering - UMD](https://reader031.vdocument.in/reader031/viewer/2022020704/61fb4c492e268c58cd5c829a/html5/thumbnails/78.jpg)
Recap
Is there a learning method that is optimal for all text classification problems?
No, because there is a tradeoff between bias and variance.
Factors to take into account:… How much training data is available?… How simple/complex is the problem? (linear vs. nonlinear decision
boundary)… How noisy is the problem?… How stable is the problem over time?
∆ For an unstable problem, it’s better to use a simple and robustclassifier.
∆ You’ll be investigating the role of features in HW3!
Digging into Data Classification II: Decision Trees and SVMs March 3, 2014 45 / 45