classificationcontinued dr eamonn keogh computer science & engineering department university of...
Post on 21-Dec-2015
221 views
TRANSCRIPT
![Page 1: ClassificationContinued Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu](https://reader036.vdocument.in/reader036/viewer/2022062407/56649d545503460f94a30b8a/html5/thumbnails/1.jpg)
ClassificationClassification
ContinuedContinued
Dr Eamonn KeoghDr Eamonn KeoghComputer Science & Engineering Department
University of California - RiversideRiverside,CA [email protected]
Decision Trees
![Page 2: ClassificationContinued Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu](https://reader036.vdocument.in/reader036/viewer/2022062407/56649d545503460f94a30b8a/html5/thumbnails/2.jpg)
Lets review the classification techniques we have seen so far, in terms of decision surfaces.
![Page 3: ClassificationContinued Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu](https://reader036.vdocument.in/reader036/viewer/2022062407/56649d545503460f94a30b8a/html5/thumbnails/3.jpg)
-3 -2 -1 0 1 2 3 4 5 6 7-3
-2
-1
0
1
2
3
4
5
![Page 4: ClassificationContinued Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu](https://reader036.vdocument.in/reader036/viewer/2022062407/56649d545503460f94a30b8a/html5/thumbnails/4.jpg)
-3 -2 -1 0 1 2 3 4 5 6 7-3
-2
-1
0
1
2
3
4
5
Linear Classifier
![Page 5: ClassificationContinued Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu](https://reader036.vdocument.in/reader036/viewer/2022062407/56649d545503460f94a30b8a/html5/thumbnails/5.jpg)
-2 -1 0 1 2 3 4 5 6
-2
-1
0
1
2
3
4
Nearest Neighbor Classifier
![Page 6: ClassificationContinued Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu](https://reader036.vdocument.in/reader036/viewer/2022062407/56649d545503460f94a30b8a/html5/thumbnails/6.jpg)
-2 -1 0 1 2 3 4 5 6
-2
-1
0
1
2
3
4
![Page 7: ClassificationContinued Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu](https://reader036.vdocument.in/reader036/viewer/2022062407/56649d545503460f94a30b8a/html5/thumbnails/7.jpg)
• Decision tree – A flow-chart-like tree structure
– Internal node denotes a test on an attribute
– Branch represents an outcome of the test
– Leaf nodes represent class labels or class distribution
• Decision tree generation consists of two phases– Tree construction
• At start, all the training examples are at the root
• Partition examples recursively based on selected attributes
– Tree pruning
• Identify and remove branches that reflect noise or outliers
• Use of decision tree: Classifying an unknown sample– Test the attribute values of the sample against the decision tree
Decision Tree ClassificationDecision Tree Classification
![Page 8: ClassificationContinued Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu](https://reader036.vdocument.in/reader036/viewer/2022062407/56649d545503460f94a30b8a/html5/thumbnails/8.jpg)
Bought PC age income student credit_ratingno <=30 high no fairno <=30 high no excellentyes 31…40 high no fairyes >40 medium no fairyes >40 low yes fairno >40 low yes excellentyes 31…40 low yes excellentno <=30 medium no fairyes <=30 low yes fairyes >40 medium yes fairyes <=30 medium yes excellentyes 31…40 medium no excellentyes 31…40 high yes fairno >40 medium no excellent
Decision Tree Example IDecision Tree Example I
We have above data in our database, based upon this data, we want to predict if potential customers are likely to buy a computer.
For example: will Joe, a 25 year old lumberjack with medium income and a fair credit rating buy a PC?
![Page 9: ClassificationContinued Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu](https://reader036.vdocument.in/reader036/viewer/2022062407/56649d545503460f94a30b8a/html5/thumbnails/9.jpg)
Decision Tree Example II
Age?
<=30
Student?
31to40 >40
Yes CreditRating?
no yes excellent fair
no yes no yes
Bought PC age income student credit_ratingno <=30 high no fairno <=30 high no excellentyes 31…40 high no fairyes >40 medium no fairyes >40 low yes fairno >40 low yes excellentyes 31…40 low yes excellentno <=30 medium no fairyes <=30 low yes fairyes >40 medium yes fairyes <=30 medium yes excellentyes 31…40 medium no excellentyes 31…40 high yes fairno >40 medium no excellent Joe, a 25 year old lumberjack
with medium income and a fair credit rating.?, <=30, medium, no, fair
![Page 10: ClassificationContinued Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu](https://reader036.vdocument.in/reader036/viewer/2022062407/56649d545503460f94a30b8a/html5/thumbnails/10.jpg)
• Basic algorithm (a greedy algorithm)– Tree is constructed in a top-down recursive divide-and-conquer manner– At start, all the training examples are at the root– Attributes are categorical (if continuous-valued, they can be discretized
in advance)– Examples are partitioned recursively based on selected attributes.– Test attributes are selected on the basis of a heuristic or statistical
measure (e.g., information gain)• Conditions for stopping partitioning
– All samples for a given node belong to the same class– There are no remaining attributes for further partitioning – majority
voting is employed for classifying the leaf– There are no samples left
How do we construct the decision tree?How do we construct the decision tree?
![Page 11: ClassificationContinued Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu](https://reader036.vdocument.in/reader036/viewer/2022062407/56649d545503460f94a30b8a/html5/thumbnails/11.jpg)
10
1 2 3 4 5 6 7 8 9 10
123456789
Imagine this dataset shows two classes of people, healthy and sick.The X-axis shows their blood sugar count, the Y axis shows their white cell count.
We want to find the single best rule of the form
if somefeature > somevalue then class = sickelse class = healthy
if blood sugar > 3.5 then class = sick else class = healthy
![Page 12: ClassificationContinued Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu](https://reader036.vdocument.in/reader036/viewer/2022062407/56649d545503460f94a30b8a/html5/thumbnails/12.jpg)
10
1 2 3 4 5 6 7 8 9 10
123456789
Blood Sugar > 3.5?
no yes
Healthy sick
![Page 13: ClassificationContinued Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu](https://reader036.vdocument.in/reader036/viewer/2022062407/56649d545503460f94a30b8a/html5/thumbnails/13.jpg)
10
1 2 3 4 5 6 7 8 9 10
123456789
1 2 3 4 5 6 7 8 9 10
12345678910
Blood Sugar > 3.9?
no yes
sickWhite Cell > 4.3?
yes no
Healthy sick
![Page 14: ClassificationContinued Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu](https://reader036.vdocument.in/reader036/viewer/2022062407/56649d545503460f94a30b8a/html5/thumbnails/14.jpg)
We have only informally shown how the decision tree chooses the splitting point for continuous attributes.
How do we choose a splitting criteria for nominal or Boolean attributes?
We want to find the single best rule of the form
if somefeature = somevalue then
class = sick
else
class = healthy
Gender
Height
M
F
![Page 15: ClassificationContinued Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu](https://reader036.vdocument.in/reader036/viewer/2022062407/56649d545503460f94a30b8a/html5/thumbnails/15.jpg)
![Page 16: ClassificationContinued Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu](https://reader036.vdocument.in/reader036/viewer/2022062407/56649d545503460f94a30b8a/html5/thumbnails/16.jpg)
10
1 2 3 4 5 6 7 8 9 10
123456789
10
1 2 3 4 5 6 7 8 9 10
123456789
10
1 2 3 4 5 6 7 8 9 10
123456789
10
1 2 3 4 5 6 7 8 9 10
123456789
Example of a problem that decision trees do poorly on.
![Page 17: ClassificationContinued Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu](https://reader036.vdocument.in/reader036/viewer/2022062407/56649d545503460f94a30b8a/html5/thumbnails/17.jpg)
• Predictive accuracy• Speed and scalability
– time to construct the model– time to use the model
• Robustness– handling noise and missing values
• Scalability– efficiency in disk-resident databases
• Interpretability: – understanding and insight provided by the model.
We have now seen several classification algorithms. How should we compare them?
![Page 18: ClassificationContinued Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu](https://reader036.vdocument.in/reader036/viewer/2022062407/56649d545503460f94a30b8a/html5/thumbnails/18.jpg)
What happens if we run out of features to test before correctly partitioning the test set?
Here we have a dataset with 3 features.
•Weight •Blood PH•Height
We are trying to classify people into two classes, yes or no (ie yes, they will get sick or no they won’t).
Most items are classified, but 28 individuals remain unclassified after using all features...
Weight
<=30
Blood PH < 7?
31to40 >40
Yes Height > 178?
no yes no yes
no yes no Blood PH < 5?
no yes
no 18 healthy12 sick
18 no10 yes
![Page 19: ClassificationContinued Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu](https://reader036.vdocument.in/reader036/viewer/2022062407/56649d545503460f94a30b8a/html5/thumbnails/19.jpg)
Feature Generation
10
1 2 3 4 5 6 7 8 9 10
123456789
BMI=kg/m2
1 2 3 4 5 6 7 8 9 10
Height
Wei
ght
BMI
![Page 20: ClassificationContinued Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu](https://reader036.vdocument.in/reader036/viewer/2022062407/56649d545503460f94a30b8a/html5/thumbnails/20.jpg)
Feature Generation Case StudyFeature Generation Case Study
Suppose we have the following two classes..
Class A: 100 random coin tossesClass B: A human “faking” 100 random coin tosses
A 10100010101010101010101010101010110101….A 11010101001010101000101010100101010101….B 10100010101010101010101010101010101111….B 10100101010101001100111010101010101010….A 11110101010111101000111010101010111010….
![Page 21: ClassificationContinued Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu](https://reader036.vdocument.in/reader036/viewer/2022062407/56649d545503460f94a30b8a/html5/thumbnails/21.jpg)
What splitting criteria should we use?What splitting criteria should we use?
![Page 22: ClassificationContinued Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu](https://reader036.vdocument.in/reader036/viewer/2022062407/56649d545503460f94a30b8a/html5/thumbnails/22.jpg)
How many bits do I need to label all the objects in this box?
How many bits do I need to label all the objects in these boxes?
![Page 23: ClassificationContinued Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu](https://reader036.vdocument.in/reader036/viewer/2022062407/56649d545503460f94a30b8a/html5/thumbnails/23.jpg)
Information Gain as A Splitting Criteria
• Select the attribute with the highest information gain (information
gain is the expected reduction in entropy).
• Assume there are two classes, P and N
– Let the set of examples S contain p elements of class P and n elements of
class N
– The amount of information, needed to decide if an arbitrary example in S
belongs to P or N is defined as
np
n
np
n
np
p
np
pSE 22 loglog)(
0 log(0) is defined as 0
![Page 24: ClassificationContinued Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu](https://reader036.vdocument.in/reader036/viewer/2022062407/56649d545503460f94a30b8a/html5/thumbnails/24.jpg)
Information Gain in Decision Tree Induction
• Assume that using attribute A, a current set will be partitioned into some number of child sets
• The encoding information that would be gained by branching on A
)()()( setschildallEsetCurrentEAGain
Note: entropy is at its minimum if the collection of objects is completely uniform
![Page 25: ClassificationContinued Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu](https://reader036.vdocument.in/reader036/viewer/2022062407/56649d545503460f94a30b8a/html5/thumbnails/25.jpg)
Entropy(9 ,5 ) = -(9/14)log2(9/14) - (5/14)log2(5/14) = 0.940
Entropy(9 ,0 ) = -(9/9)log2(9/9) - (0/9)log2(0/9) = 0
Entropy(0 ,5 ) = -(0/5)log2(0/5) - (5/5)log2(5/5) = 0
log(1) = 0
np
n
np
n
np
p
np
pSE 22 loglog)(
![Page 26: ClassificationContinued Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu](https://reader036.vdocument.in/reader036/viewer/2022062407/56649d545503460f94a30b8a/html5/thumbnails/26.jpg)
Entropy(9 ,5 ) ( Entropy(9 ,0 ) Entropy(0 ,5 )
log(1) = 0
Gain(A) = - + )
= 0.940
![Page 27: ClassificationContinued Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu](https://reader036.vdocument.in/reader036/viewer/2022062407/56649d545503460f94a30b8a/html5/thumbnails/27.jpg)
Avoiding Overfitting in ClassificationAvoiding Overfitting in Classification
• The generated tree may overfit the training data – Too many branches, some may reflect anomalies due to
noise or outliers– Result is in poor accuracy for unseen samples
• Two approaches to avoid overfitting – Prepruning: Halt tree construction early—do not split a
node if this would result in the goodness measure falling below a threshold
• Difficult to choose an appropriate threshold– Postpruning: Remove branches from a “fully grown” tree
—get a sequence of progressively pruned trees• Use a set of data different from the training data to
decide which is the “best pruned tree”
![Page 28: ClassificationContinued Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu](https://reader036.vdocument.in/reader036/viewer/2022062407/56649d545503460f94a30b8a/html5/thumbnails/28.jpg)
Approaches to Determine the Final Tree SizeApproaches to Determine the Final Tree Size
• Separate training (2/3) and testing (1/3) sets
• Use cross validation, e.g., 10-fold cross validation
• Use all the data for training
– but apply a statistical test (e.g., chi-square) to estimate whether expanding or pruning a node may improve the entire distribution
![Page 29: ClassificationContinued Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu](https://reader036.vdocument.in/reader036/viewer/2022062407/56649d545503460f94a30b8a/html5/thumbnails/29.jpg)
Feature SelectionFeature SelectionOne of the nice features of decision trees is that they automatically discover the best features to use (the ones near the top of the tree), and which features are irrelevant for the problem (the features which are no used).
How do we decide which features to use for nearest neighbor, or the linear classifier?
Suppose we are trying to decide if tomorrow is a good day to play tennis, based on the temperature, the windspeed, the humidity and the outlook…
We could use just the temperature, or just {temperature, windspeed} or just {…} This sounds like a search problem!
![Page 30: ClassificationContinued Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu](https://reader036.vdocument.in/reader036/viewer/2022062407/56649d545503460f94a30b8a/html5/thumbnails/30.jpg)
•Forward Selection•Backward Elimination•Bi-directional Search
![Page 31: ClassificationContinued Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu](https://reader036.vdocument.in/reader036/viewer/2022062407/56649d545503460f94a30b8a/html5/thumbnails/31.jpg)
-3 -2 -1 0 1 2 3 4 5 6 7-3
-2
-1
0
1
2
3
4
5
Linear Classifier
-3 -2 -1 0 1 2 3 4 5 6 7-3
-2
-1
0
1
2
3
4
5
Linear Classifier
-2 -1 0 1 2 3 4 5 6
-2
-1
0
1
2
3
4
Nearest Neighbor Classifier
-3 -2 -1 0 1 2 3 4 5 6 7-3
-2
-1
0
1
2
3
4
5