classification 1. 2 task: given a set of pre-classified examples, build a model or classifier to...

22
Classification 1

Upload: horatio-stanley

Post on 13-Dec-2015

220 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Classification 1. 2  Task: Given a set of pre-classified examples, build a model or classifier to classify new cases.  Supervised learning: classes

Classification

11

Page 2: Classification 1. 2  Task: Given a set of pre-classified examples, build a model or classifier to classify new cases.  Supervised learning: classes

22

Task: Given a set of pre-classified examples, build a model or classifier to classify new cases.

Supervised learning: classes are known for the examples used to build the classifier.

A classifier can be a set of rules, a decision tree, a neural network, etc.

Typical applications: credit approval, direct marketing, fraud detection, medical diagnosis, …..

Classification

Page 3: Classification 1. 2  Task: Given a set of pre-classified examples, build a model or classifier to classify new cases.  Supervised learning: classes

33

Simplicity first

Simple algorithms often work very well!

There are many kinds of simple structure, eg: One attribute does all the work

All attributes contribute equally & independently

A weighted linear combination might do

Instance-based: use a few prototypes

Use simple logical rules

Success of method depends on the domain

Page 4: Classification 1. 2  Task: Given a set of pre-classified examples, build a model or classifier to classify new cases.  Supervised learning: classes

44

Inferring rudimentary rules

1R: learns a 1-level decision tree I.e., rules that all test one particular attribute

Basic version One branch for each value

Each branch assigns most frequent class

Error rate: proportion of instances that don’t belong to the majority class of their corresponding branch

Choose attribute with lowest error rate

(assumes nominal attributes)

Page 5: Classification 1. 2  Task: Given a set of pre-classified examples, build a model or classifier to classify new cases.  Supervised learning: classes

55

Pseudo-code for 1R

For each attribute,

For each value of the attribute, make a rule as follows:

count how often each class appears

find the most frequent class

make the rule assign that class to this attribute-value

Calculate the error rate of the rules

Choose the rules with the smallest error rate

Note: “missing” is treated as a separate attribute value

Page 6: Classification 1. 2  Task: Given a set of pre-classified examples, build a model or classifier to classify new cases.  Supervised learning: classes

66

Evaluating the weather attributes

Attribute Rules Errors Total errors

Outlook Sunny No 2/5 4/14

Overcast Yes 0/4

Rainy Yes 2/5

Temp Hot No* 2/4 5/14

Mild Yes 2/6

Cool Yes 1/4

Humidity High No 3/7 4/14

Normal Yes 1/7

Windy False Yes 2/8 5/14

True No* 3/6

Outlook Temp Humidity Windy Play

Sunny Hot High False No

Sunny Hot High True No

Overcast Hot High False Yes

Rainy Mild High False Yes

Rainy Cool Normal False Yes

Rainy Cool Normal True No

Overcast Cool Normal True Yes

Sunny Mild High False No

Sunny Cool Normal False Yes

Rainy Mild Normal False Yes

Sunny Mild Normal True Yes

Overcast Mild High True Yes

Overcast Hot Normal False Yes

Rainy Mild High True No * indicates a tie

Page 7: Classification 1. 2  Task: Given a set of pre-classified examples, build a model or classifier to classify new cases.  Supervised learning: classes

77

Dealing withnumeric attributes

Discretize numeric attributes

Divide each attribute’s range into intervals Sort instances according to attribute’s values

Place breakpoints where the class changes(the majority class)

This minimizes the total error

Example: temperature from weather data

64 65 68 69 70 71 72 72 75 75 80 81 83 85

Yes | No | Yes Yes Yes | No No Yes | Yes Yes | No | Yes Yes | No

Outlook Temperature Humidity Windy Play

Sunny 85 85 False No

Sunny 80 90 True No

Overcast 83 86 False Yes

Rainy 75 80 False Yes

… … … … …

Page 8: Classification 1. 2  Task: Given a set of pre-classified examples, build a model or classifier to classify new cases.  Supervised learning: classes

88

The problem of overfitting

This procedure is very sensitive to noise One instance with an incorrect class label will probably

produce a separate interval

Also: time stamp attribute will have zero errors

Simple solution:enforce minimum number of instances in majority class per interval

Page 9: Classification 1. 2  Task: Given a set of pre-classified examples, build a model or classifier to classify new cases.  Supervised learning: classes

99

Discretization example

Example (with min = 3):

Final result for temperature attribute

64 65 68 69 70 71 72 72 75 75 80 81 83 85

Yes | No | Yes Yes Yes | No No Yes | Yes Yes | No | Yes Yes | No

64 65 68 69 70 71 72 72 75 75 80 81 83 85

Yes No Yes Yes Yes | No No Yes Yes Yes | No Yes Yes No

Page 10: Classification 1. 2  Task: Given a set of pre-classified examples, build a model or classifier to classify new cases.  Supervised learning: classes

1010

With overfitting avoidance

Resulting rule set:

Attribute Rules Errors Total errors

Outlook Sunny No 2/5 4/14

Overcast Yes 0/4

Rainy Yes 2/5

Temperature 77.5 Yes 3/10 5/14

> 77.5 No* 2/4

Humidity 82.5 Yes 1/7 3/14

> 82.5 and 95.5 No 2/6

> 95.5 Yes 0/1

Windy False Yes 2/8 5/14

True No* 3/6

Page 11: Classification 1. 2  Task: Given a set of pre-classified examples, build a model or classifier to classify new cases.  Supervised learning: classes

Missing Values

1111

Page 12: Classification 1. 2  Task: Given a set of pre-classified examples, build a model or classifier to classify new cases.  Supervised learning: classes

1212

Missing Values

Many data sets are plagued by the problem of missing values

missing values can be a result of manual data entry, incorrect measurements, equipment errors, etc.

they are usually denoted by special characters such as:

NULL

*

?

Page 13: Classification 1. 2  Task: Given a set of pre-classified examples, build a model or classifier to classify new cases.  Supervised learning: classes

1313

Table 2.1

Page 14: Classification 1. 2  Task: Given a set of pre-classified examples, build a model or classifier to classify new cases.  Supervised learning: classes

1414

Missing Values

imputation (filling-in) of missing data

We will use two ways of single imputation: Single Imputation

Hot Deck Imputation

Page 15: Classification 1. 2  Task: Given a set of pre-classified examples, build a model or classifier to classify new cases.  Supervised learning: classes

1515

Missing Values

single imputation mean imputation method uses the mean of

values of a feature that contains missing data in case of a symbolic/categorical feature, a mode (the most

frequent value) is used

the algorithm imputes missing values for each attribute separately

Page 16: Classification 1. 2  Task: Given a set of pre-classified examples, build a model or classifier to classify new cases.  Supervised learning: classes

1616

Table 2.2

Page 17: Classification 1. 2  Task: Given a set of pre-classified examples, build a model or classifier to classify new cases.  Supervised learning: classes

1717

Missing Values

- single imputation hot deck imputation: for each object that contains

missing values the most similar object (according to some distance function) is found, and the missing values are imputed from that object if the most similar record also contains missing values for the

same feature then it is discarded and another closest object is found

the procedure is repeated until all the missing values are imputed

when no similar object is found, the closest object with the minimum number of missing values is chosen to impute the missing values

Page 18: Classification 1. 2  Task: Given a set of pre-classified examples, build a model or classifier to classify new cases.  Supervised learning: classes

1818

Table 2.3

Page 19: Classification 1. 2  Task: Given a set of pre-classified examples, build a model or classifier to classify new cases.  Supervised learning: classes

1919

Noise

Page 20: Classification 1. 2  Task: Given a set of pre-classified examples, build a model or classifier to classify new cases.  Supervised learning: classes

2020

Noise

Def.: Noise in the data is defined as a value that is a random error or variance in a measured feature the amount of noise in the data can jeopardize

the entire KDP results

the influence of noise on the data can be prevented by imposing constraints on features to detect anomalies when the data is entered for instance, DBMS usually provides facility to define constrains

for individual attributes

Page 21: Classification 1. 2  Task: Given a set of pre-classified examples, build a model or classifier to classify new cases.  Supervised learning: classes

Noise Detection

In manual inspection, the user checks feature values against predefined constraints and manually detects the noise

For example, for object 5 in table 2.3 , the cholesterol value is 45.0, which is outside the predefined acceptable interval for this feature, namely, within [50.0, 600.0].

2121

Page 22: Classification 1. 2  Task: Given a set of pre-classified examples, build a model or classifier to classify new cases.  Supervised learning: classes

2222

NoiseNoise can be removed using

Binning Requires ordering values of the noisy feature and then

substituting the values with a mean or median value for predefined bins

In table 2.3, the attribute of Cholesterol contains the value of “45” which is a noise. Binning first orders the values of the noisy feature and then replaces the values with a mean or median value for the predefined bins. As an example, let us consider the cholesterol feature, with its values 45.0, 261.2, 331.2, and 407.5. If the bin size equals two, two bins are created: bin1 with 45.0 and 261.2, and bin2 with 331.2 and 407.5. For bin1 the mean value is 153.1, and for bin2 it is 369.4. Therefore the values 45.0 and 261.2 would be replaced with 153.1 and the values 331.2 and 407.5 with 369.4. Note that the two new values are within the acceptable interval.