classification ii. 2 numeric attributes numeric attributes can take many values –creating branches...
TRANSCRIPT
Classification II
2
Numeric Attributes• Numeric attributes can take many values
– Creating branches for each value is not ideal• The value range is usually split into two parts• Splitting position is determined using the idea of
information gain• Consider the following sorted temperature values and their
class labels64 65 68 69 70 71 72 75 80 81 83 85Yes no yes yes yes no no yes no yes yes no
yes yes – We could have created a split at any of the 11 positions
between the numbers– For each split compute the information gain– Select the split that gives the highest information gain– E.g. temp<71.5 produces 4 yes and 2 no &– Temp>71.5 produces 5 yes and 3 no– Entropy([4,2],
[5,3])=6/14*Entropy(4/6,2/6)+8/14*Entropy(5/8,3/8) = 0.939 bits
3
Missing Values• Easy solution is to treat missing values as a new
possible value for the attribute– E.g if Outlook has a missing value we have rainy,
overcast, sunny and missing as its possible values– This makes sense if the fact that the attribute is missing is
significant (e.g. missing medical test results)– Also the learning method needs no modification
• More complex solution is to let the missing value receive a proportion of value from each of the known values of the attribute– The proportion is estimated based on the proportions of
instances with known value at a node– E.g. if Outlook has a missing value in one instance and
has 4 instances with rainy, 2 with overcast and 3 with sunny then the missing value is {4/9 rainy, 2/9 overcast, 3/9 sunny}
– All the computations (such as information gain) are performed using these weighted values
4
Overfitting• When decision trees are grown until each leaf
node is pure they might learn unnecessary details from the training data– This is called overfitting– Unnecessary details could be just noise or because the
training data is not representative
• Overfitting makes the classifier perform poorly on independent test data
• Two solutions– Stop growing the tree earlier, before it overfits training
data – hard to estimate when to stop in practice– prune overfitted parts of the decision tree by evaluating
the utility of pruning nodes from the tree using part of the training data as validation data
5
Naïve Bayes Classifier• A simple classifier using observed probabilities• Assumes that all the attributes contribute towards
classification– Equally importantly – Independently
• For some data sets this classifier achieves better results than decision trees
• Makes use of Bayes rule of conditional probability• Bayes rule: If H is a hypothesis and E is its
evidence, then P(H|E) = (P(E|H)P(H))/P(E)
• P(H|E) – conditional probability – probability of H given E
6
Example DataOutlook Temperature Humidity Windy Play
sunny hot high false no
sunny hot high true no
overcast hot high false yes
rainy mild high false yes
rainy cool normal false yes
rainy cool normal true no
overcast cool normal true yes
sunny mild high false no
sunny cool normal false yes
rainy mild normal false yes
sunny mild normal true yes
overcast mild high true yes
overcast hot normal false yes
rainy mild high true no
Class Attribute
7
Weather Data Counts and Probabilities
Outlook Temperature Humidity Windy Play
noyesnoyes
false
true
yesyesyes no nono
high
normal
hot
mild
cool
sunny
overcast
rainy
2 3
4 0
3 2
2 2
4 2
3 1
3 4
6 1
6 2
3 3
9 5
false
true
high
normal
hot
mild
cool
sunny
overcast
rainy
2/9 3/5
4/9 0/5
3/9 2/5
2/9 2/5
4/9 2/5
3/9 1/5
3/9 4/5
6/9 1/5
6/9 2/5
3/9 3/5
9/14 5/14
8
Naïve Bayes Example• Given the training weather data, this algorithm
computes probabilities shown on the previous slide• For a test instance with {sunny, cool, high, true}
the algorithm uses Bayes rule to compute probabilities of Play = yes and Play = no
• First let H be Play = yes• Then from Bayes rule
– P(yes|E) = (P(sunny|yes)*P(cool|yes)*P(high|yes)*P(true|yes)*P(yes))/P(E)
– All the probabilities on the right hand side except P(E) are known from the previous slide
– P(yes|E) = (2/9*3/9*3/9*3/9*9/14)/P(E) = 0.0053/P(E)– P(no|E) = (P(sunny|no)*P(cool|no)*P(high|no)*P(true|
no)*P(no))/P(E)– P(no|E) = (3/5*1/5*4/5*3/5*5/14)/P(E) = 0.0206/P(E)– Because P(yes|E) + P(no|E) = 1,
• (0.0053|P(E)) + (0.0206|P(E)) = 1– P(E) = 0.0053 + 0.0206– P(yes|E) = 0.0053/(0.0053+0.0206) = 20.5%– P(no|E) = 0.0206 /(0.0053+0.0206) = 79.5%
9
K-nearest neighbour• No model built explicitly – instances are stored verbatim• For an unknown test instance, using a distance function the
k nearest training instances are determined and their most common class is assigned
• For numeric attributes Euclidean distance is a natural distance function
• Consider two instances– I1 having attributes a1, a2, …, an and– I2 having attributes a1’, a2’,…, an’– Euclidean distance between the two instances is
√((a1-a1’)2+(a2-a2’)2+…+(an-an’)2)• Because different attributes have different scales, you need
to normalize attribute values to lie between 0 and 1– ai = (vi –min(vi)) /(max(vi)-min(vi))
• Finding nearest neighbours is computationally expensive – because a rudimentary approach is to compute distances to all the instances– speeding techniques exist but not studied here
10
Classifier’s Performance Metrics• Error rate of a classifier measures its overall
performance– Error rate = proportion of errors = number of
misclassifications/total number of test instances• Error rate does not discriminate between different
types of errors• A binary classifier (yes and no) makes two kinds
of errors– Calling an instance of ‘no’ as an instance of ‘yes’
• False positives– Calling an instance of ‘yes’ as an instance of ‘no’
• False negatives
• In practice false positives and false negatives have different associated costs– Cost of lending to a defaulter is larger than lost-business
cost of refusing loan to a non-defaulter– Cost of failing to detect fire is larger than the cost of a
false alarm
11
Confusion Matrix• The four possible
outcomes of a binary classifier are usually shown in a confusion matrix– TP – True Positives– TN – True Negatives– FP – False Positives– FN – False Negatives– P – Total Positives
(TP+FN)– N – Total Negatives
(FP+TN)• A number of
performance metrics defined using these counts
Yes No
Yes
No
ActualClass
Predicted Class
TP
TNFP
FN P
N
P’ N’
12
Performance Metrics Derived from Confusion Matrix
• True Positive rate, TPR = TP/P = TP/(TP+FN)– Also known as sensitivity and recall
• False Positive rate, FPR = FP/N=FP/(FP+TN)• Accuracy = (TN+TP)/(TP+TN+FP+FN)• Error Rate = 1 – Accuracy• Specificity = 1 – FPR• Positive Predictive Value = TP/P’ = TP/(TP+FP)
– Also known as precision
• Negative Predictive Value = TN/N’ = TN(TN+FN)
13
ROC - Receiver Operating Characteristic
• Metrics derived from confusion matrix useful for comparing classifiers
• Particularly a plot of TPR on y-axis against FPR on x axis is known as ROC
• A, B, C, D and E are five classifiers with different TPR and FPR values
• A is the ideal classifier because it has TPR = 1.0 and FPR = 0
• E is on the diagonal which stands for random guess
• C performs worse than random guess– But inverse of C which is B is
better than D• Classifiers should aim to be in
the northwest
1.0
1.0A
B
C
D
0
ETPR
FPR
better
worse
14
Testing Options 1• Testing the classifier on training data is not useful
– Performance figures from such testing will be optimistic– Because the classifier is trained from the very data
• Ideally, a new data set called ‘test set’ needs to be used for testing– If test set is large performance figures will be more realistic– Creating test set needs experts’ time and therefore creating
large test sets is expensive– After testing, test set is combined with training data to
produce a new classifier– Sometimes, a third data set called ‘validation data’ used for
fine tuning a classifier or to select a classifier among many• In practice several strategies used to make up for lack of
test data– Holdout procedure – a certain proportion of training data is
held as test data and remaining used for training– Cross-validation– Leave-one-out cross-validation
15
Testing Options 2• Cross Validation
– Partition the data into a fixed number of folds– Use data from each of the partitions for testing while using the
remaining for training– Every instance is used for testing once– 10-fold cross-validation is standard, particularly repeating it 10
times• Leave-one-out
– Is n-fold cross-validation, where n is the data size– One instance is held for testing while using the remaining for
training– Results from single instance tests are averaged to obtain the
final test result– Maximum utilization of data for training– No sampling of data for testing, each instance is systematically
used for testing– High costs involved because classifier is trained n times– Hard to ensure representative data for training