classification ii. 2 numeric attributes numeric attributes can take many values –creating branches...

Classification II

2

Numeric Attributes• Numeric attributes can take many values

– Creating branches for each value is not ideal• The value range is usually split into two parts• Splitting position is determined using the idea of

information gain• Consider the following sorted temperature values and their

class labels64 65 68 69 70 71 72 75 80 81 83 85Yes no yes yes yes no no yes no yes yes no

yes yes – We could have created a split at any of the 11 positions

between the numbers– For each split compute the information gain– Select the split that gives the highest information gain– E.g. temp<71.5 produces 4 yes and 2 no &– Temp>71.5 produces 5 yes and 3 no– Entropy([4,2],

[5,3])=6/14*Entropy(4/6,2/6)+8/14*Entropy(5/8,3/8) = 0.939 bits

3

Missing Values• Easy solution is to treat missing values as a new

possible value for the attribute– E.g if Outlook has a missing value we have rainy,

overcast, sunny and missing as its possible values– This makes sense if the fact that the attribute is missing is

significant (e.g. missing medical test results)– Also the learning method needs no modification

• More complex solution is to let the missing value receive a proportion of value from each of the known values of the attribute– The proportion is estimated based on the proportions of

instances with known value at a node– E.g. if Outlook has a missing value in one instance and

has 4 instances with rainy, 2 with overcast and 3 with sunny then the missing value is {4/9 rainy, 2/9 overcast, 3/9 sunny}

– All the computations (such as information gain) are performed using these weighted values

4

Overfitting• When decision trees are grown until each leaf

node is pure they might learn unnecessary details from the training data– This is called overfitting– Unnecessary details could be just noise or because the

training data is not representative

• Overfitting makes the classifier perform poorly on independent test data

• Two solutions– Stop growing the tree earlier, before it overfits training

data – hard to estimate when to stop in practice– prune overfitted parts of the decision tree by evaluating

the utility of pruning nodes from the tree using part of the training data as validation data

5

Naïve Bayes Classifier• A simple classifier using observed probabilities• Assumes that all the attributes contribute towards

classification– Equally importantly – Independently

• For some data sets this classifier achieves better results than decision trees

• Makes use of Bayes rule of conditional probability• Bayes rule: If H is a hypothesis and E is its

evidence, then P(H|E) = (P(E|H)P(H))/P(E)

• P(H|E) – conditional probability – probability of H given E

6

Example DataOutlook Temperature Humidity Windy Play

sunny hot high false no

sunny hot high true no

overcast hot high false yes

rainy mild high false yes

rainy cool normal false yes

rainy cool normal true no

overcast cool normal true yes

sunny mild high false no

sunny cool normal false yes

rainy mild normal false yes

sunny mild normal true yes

overcast mild high true yes

overcast hot normal false yes

rainy mild high true no

Class Attribute

7

Weather Data Counts and Probabilities

Outlook Temperature Humidity Windy Play

noyesnoyes

false

true

yesyesyes no nono

high

normal

hot

mild

cool

sunny

overcast

rainy

2 3

4 0

3 2

2 2

4 2

3 1

3 4

6 1

6 2

3 3

9 5

false

true

high

normal

hot

mild

cool

sunny

overcast

rainy

2/9 3/5

4/9 0/5

3/9 2/5

2/9 2/5

4/9 2/5

3/9 1/5

3/9 4/5

6/9 1/5

6/9 2/5

3/9 3/5

9/14 5/14

8

Naïve Bayes Example• Given the training weather data, this algorithm

computes probabilities shown on the previous slide• For a test instance with {sunny, cool, high, true}

the algorithm uses Bayes rule to compute probabilities of Play = yes and Play = no

• First let H be Play = yes• Then from Bayes rule

– P(yes|E) = (P(sunny|yes)*P(cool|yes)*P(high|yes)*P(true|yes)*P(yes))/P(E)

– All the probabilities on the right hand side except P(E) are known from the previous slide

– P(yes|E) = (2/9*3/9*3/9*3/9*9/14)/P(E) = 0.0053/P(E)– P(no|E) = (P(sunny|no)*P(cool|no)*P(high|no)*P(true|

no)*P(no))/P(E)– P(no|E) = (3/5*1/5*4/5*3/5*5/14)/P(E) = 0.0206/P(E)– Because P(yes|E) + P(no|E) = 1,

• (0.0053|P(E)) + (0.0206|P(E)) = 1– P(E) = 0.0053 + 0.0206– P(yes|E) = 0.0053/(0.0053+0.0206) = 20.5%– P(no|E) = 0.0206 /(0.0053+0.0206) = 79.5%

9

K-nearest neighbour• No model built explicitly – instances are stored verbatim• For an unknown test instance, using a distance function the

k nearest training instances are determined and their most common class is assigned

• For numeric attributes Euclidean distance is a natural distance function

• Consider two instances– I1 having attributes a1, a2, …, an and– I2 having attributes a1’, a2’,…, an’– Euclidean distance between the two instances is

√((a1-a1’)2+(a2-a2’)2+…+(an-an’)2)• Because different attributes have different scales, you need

to normalize attribute values to lie between 0 and 1– ai = (vi –min(vi)) /(max(vi)-min(vi))

• Finding nearest neighbours is computationally expensive – because a rudimentary approach is to compute distances to all the instances– speeding techniques exist but not studied here

10

Classifier’s Performance Metrics• Error rate of a classifier measures its overall

performance– Error rate = proportion of errors = number of

misclassifications/total number of test instances• Error rate does not discriminate between different

types of errors• A binary classifier (yes and no) makes two kinds

of errors– Calling an instance of ‘no’ as an instance of ‘yes’

• False positives– Calling an instance of ‘yes’ as an instance of ‘no’

• False negatives

• In practice false positives and false negatives have different associated costs– Cost of lending to a defaulter is larger than lost-business

cost of refusing loan to a non-defaulter– Cost of failing to detect fire is larger than the cost of a

false alarm

11

Confusion Matrix• The four possible

outcomes of a binary classifier are usually shown in a confusion matrix– TP – True Positives– TN – True Negatives– FP – False Positives– FN – False Negatives– P – Total Positives

(TP+FN)– N – Total Negatives

(FP+TN)• A number of

performance metrics defined using these counts

Yes No

Yes

No

ActualClass

Predicted Class

TP

TNFP

FN P

N

P’ N’

12

Performance Metrics Derived from Confusion Matrix

• True Positive rate, TPR = TP/P = TP/(TP+FN)– Also known as sensitivity and recall

• False Positive rate, FPR = FP/N=FP/(FP+TN)• Accuracy = (TN+TP)/(TP+TN+FP+FN)• Error Rate = 1 – Accuracy• Specificity = 1 – FPR• Positive Predictive Value = TP/P’ = TP/(TP+FP)

– Also known as precision

• Negative Predictive Value = TN/N’ = TN(TN+FN)

13

ROC - Receiver Operating Characteristic

• Metrics derived from confusion matrix useful for comparing classifiers

• Particularly a plot of TPR on y-axis against FPR on x axis is known as ROC

• A, B, C, D and E are five classifiers with different TPR and FPR values

• A is the ideal classifier because it has TPR = 1.0 and FPR = 0

• E is on the diagonal which stands for random guess

• C performs worse than random guess– But inverse of C which is B is

better than D• Classifiers should aim to be in

the northwest

1.0

1.0A

B

C

D

0

ETPR

FPR

better

worse

14

Testing Options 1• Testing the classifier on training data is not useful

– Performance figures from such testing will be optimistic– Because the classifier is trained from the very data

• Ideally, a new data set called ‘test set’ needs to be used for testing– If test set is large performance figures will be more realistic– Creating test set needs experts’ time and therefore creating

large test sets is expensive– After testing, test set is combined with training data to

produce a new classifier– Sometimes, a third data set called ‘validation data’ used for

fine tuning a classifier or to select a classifier among many• In practice several strategies used to make up for lack of

test data– Holdout procedure – a certain proportion of training data is

held as test data and remaining used for training– Cross-validation– Leave-one-out cross-validation

15

Testing Options 2• Cross Validation

– Partition the data into a fixed number of folds– Use data from each of the partitions for testing while using the

remaining for training– Every instance is used for testing once– 10-fold cross-validation is standard, particularly repeating it 10

times• Leave-one-out

– Is n-fold cross-validation, where n is the data size– One instance is held for testing while using the remaining for

training– Results from single instance tests are averaged to obtain the

final test result– Maximum utilization of data for training– No sampling of data for testing, each instance is systematically

used for testing– High costs involved because classifier is trained n times– Hard to ensure representative data for training

classification ii. 2 numeric attributes numeric attributes can take many values –creating branches...

Documents