spring 20031 classification. spring 20032 classification task input: a training set of tuples, each...

Spring 2003 1

Classification

Spring 2003 2

Classification task

• Input: a training set of tuples, each labeled with one class label

• Output: a model (classifier) that assigns a class label to each tuple based on the other attributes

• The model can be used to predict the class of new tuples, for which the class label is missing or unknown

Spring 2003 3

What is Classification

• Data classification is a two-step process– first step: a model is built describing a

predetermined set of data classes or concepts– second step: the model is used for classification

• Each tuple is assumed to belong to a predefined class, as determined by one of the attributes, called the class label attribute

• Data tuples are also referred to as samples, examples, or objects

Spring 2003 4

Train and test

• The tuples (examples, samples) are divided into training set + test set

• Classification model is built in two steps:– training - build the model from the training set– test - check the accuracy of the model using test

set

Spring 2003 5

Train and test

• Kind of models:– if - then rules– logical formulae– decision trees

• Accuracy of models:– the known class of test samples is matched

against the class predicted by the model– accuracy rate = % of test set samples correctly

classified by the model

Spring 2003 6

Training step

Age Car Type Risk20 Combi High18 Sports High40 Sports High50 Family Low35 Minivan Low30 Combi High32 Family Low40 Combi Low

trainingdata

Classification algorithm

Classifier(model)

if age < 31or Car Type =Sportsthen Risk = High

Spring 2003 7

Test step

Age Car Type Risk27 Sports High34 Family Low66 Family High44 Sports High

testdata

Classifier(model)

RiskHighLowLowHigh

Spring 2003 8

Classification (prediction)

Age Car Type Risk27 Sports 34 Minivan 55 Family 34 Sports

newdata

Classifier(model)

RiskHighLowLowHigh

Spring 2003 9

Classification vs. Prediction

• There are two forms of data analysis that can be used to extract models describing data classes or to predict future data trends:– classification: predict categorical labels– prediction: models continuous-valued functions

Spring 2003 10

Comparing Classification Methods (1)

• Predictive accuracy: this refers to the ability of the model to correctly predict the class label of new or previously unseen data

• Speed: this refers to the computation costs involved in generating and using the model

• Robustness: this is the ability of the model to make correct predictions given noisy data or data with missing values

Spring 2003 11

Comparing Classification Methods (2)

• Scalability: this refers to the ability to construct the model efficiently given large amount of data

• Interpretability: this refers to the level of understanding and insight that is provided by the model

• Simplicity:– decision tree size

– rule compactness

• Domain-dependent quality indicators

Spring 2003 12

Problem formulation

Given records in the database with class label – find model for each class.

Age Car Type Risk20 Combi High18 Sports High40 Sports High50 Family Low35 Minivan Low30 Combi High32 Family Low40 Combi Low

Age < 31

High

Car Type is sports

High Low

Spring 2003 13

Classification techniques

• Decision Tree Classification

• Bayesian Classifiers

• Neural Networks

• Statistical Analysis

• Genetic Algorithms

• Rough Set Approach

• k-nearest neighbor classifiers

Spring 2003 14

Classification by Decision Tree Induction

• A decision tree is a tree structure, where – each internal node denotes a test on an attribute, – each branch represents the outcome of the test,– leaf nodes represent classes or class distributions

Age < 31

High

Car Type is sports

High Low

Y N

Spring 2003 15

Decision Tree Induction (1)

• A decision tree is a class discriminator that recursively partitions the training set until each partition consists entirely or dominantly of examples from one class.

• Each non-leaf node of the tree contains a split point, which is a test on one or more attributes and determines how the data is partitioned

Spring 2003 16


• Basic algorithm: a greedy algorithm that constructs decision trees in a top-down recursive divide-and-conquer manner.

• Many variants:– from machine learning (ID3, C4.5)– from statistics (CART)– from pattern recognition (CHAID)

• Main difference: split criterion

Spring 2003 17


• The algorithm consists of two phases: Build an initial tree from the training data such that

each leaf node is pure Prune this tree to increase its accuracy on test data

Spring 2003 18

Tree Building

• In the growth phase the tree is built by recursively partitioning the data until each partition is either "pure" (contains members of the same class) or sufficiently small.

• The form of the split used to partition the data depends on the type of the attribute used in the split: for a continuous attribute A, splits are of the form

value(A)<x where x is a value in the domain of A. for a categorical attribute A, splits are of the form

value(A)X where Xdomain(A)

Spring 2003 19

Tree Building AlgorithmMake Tree (Training Data T){

Partition(T)}Partition(Data S){

if (all points in S are in the same class) thenreturn

for each attribute A doevaluate splits on attribute A;

use best split found to partition S into S1 and S2Partition(S1)Partition(S2)

}

Spring 2003 20

Tree Building Algorithm

• While growing the tree, the goal at each node is to determine the split point that "best" divides the training records belonging to that leaf

• To evaluate the goodness of the split some splitting indices have been proposed

Spring 2003 21

Split Criteria• Gini index (CART, SPRINT)

– select attribute that minimize impurity of a split

• Information gain (ID3, C4.5)– to measure impurity of a split use entropy– select attribute that maximize entropy reduction

2 contingency table statistics (CHAID)– measures correlation between each attribute and

the class label– select attribute with maximal correlation

Spring 2003 22

Gini index (1)

Given a sample training set where each record represents a car-insurance applicant. We want to build a model of what makes an applicant a high or low insurance risk.

RID Age Car Type Risk0 23 family high1 17 sport high2 43 sport high3 68 family low4 32 truck low5 20 family high

Training set

The model built can be used to screen future insurance applicants by classifying them into the High or Low risk categories

Classifier(model)

Spring 2003 23

Gini index (2)SPRINT algorithm:

Partition(Data S) {if (all points in S are of the same class) then

return;for each attribute A do

evaluate splits on attribute A;

Use best split found to partition S into S1 and S2

Partition(S1);

Partition(S2);}Initial call: Partition(Training Data)

Spring 2003 24

Gini index (3)• Definition:

gini(S) = 1 - pj2

where: • S is a data set containing examples from n classes

• pj is a relative frequency of class j in S

• E.g. two classes, Pos and Neg, and dataset S with p Pos-elements and n Neg-elements.

ppos= p/(p+n) pneg = n/(n+p)

gini(S) = 1 - ppos2 - pneg

2

Spring 2003 25

Gini index (4)

• If dataset S is split into S1 and S2, then splitting index is defined as follows:

giniSPLIT(S) = (p1+ n1)/(p+n)*gini(S1) +

(p2+ n2)/(p+n)* gini(S2)

where p1, n1 (p2, n2) denote p1 Pos-elements and n1 Neg-elements in the dataset S1 (S2), respectively.

• In this definition the "best" split point is the one with the lowest value of the giniSPLIT index.

Spring 2003 26

Example (1)

RID Age Car Type Risk0 23 family high1 17 sport high2 43 sport high3 68 family low4 32 truck low5 20 family high

Training set

Spring 2003 27

Example (1)

Age RID Risk17 1 high20 5 high23 0 high32 4 low43 2 high68 3 low

Attribute list for ‘Age’

Attribute list for ‘Car Type’Car Type RID Risk

family 0 highsport 1 highsport 2 highfamily 3 lowtruck 4 lowfamily 5 high

Spring 2003 28

Example (2)• Possible values of a split point for Age attribute are:

Age17, Age20, Age23, Age32, Age43, Age68

Tuple count High LowAge<=17 1 0Age>17 3 2

G(Age<=17) = 1- (12+02) = 0G(Age>17) = 1- ((3/5)2+(2/5)2) = 1 - (13/25)2 = 12/25GSPLIT = (1/6) * 0 + (5/6) * (12/25) = 2/5

Spring 2003 29

Example (3)Tuple count High Low

Age<=20 2 0Age>20 2 2

G(Age<=20) = 1- (12+02) = 0G(Age>20) = 1- ((1/2)2+(1/2)2) = 1/2GSPLIT = (2/6) * 0 + (4/6) * (1/8) = 1/3

Tuple count High LowAge<=23 3 0Age>23 1 2

G(Age23) = 1- (12+02) = 0G(Age>23) = 1- ((1/3)2+(2/3)2) = 1 - (1/9) - (4/9) = 4/9GSPLIT = (3/6) * 0 + (3/6) * (4/9) = 2/9

Spring 2003 30

Example (4)Tuple count High Low

Age<=32 3 1Age>32 1 1

G(Age32) = 1- ((3/4)2+(1/4)2) = 1 - (10/16) = 6/16 = 3/8G(Age>32) = 1- ((1/2)2+(1/2)2) = 1/2GSPLIT = (4/6)*(3/8) + (2/6)*(1/2) = (1/8) + (1/6)=14/48= 7/24

The lowest value of GSPLIT is for Age23, thus we have a split point at Age=(23+32) / 2 = 27.5

Spring 2003 31

Example (5)

Age27.5

Risk = High Risk = Low

Age>27.5

Decision tree after the first split of the example set:

Spring 2003 32

Example (6)

Attribute lists are divided at the split point.Attribute lists for Age27.5:

Age RID Risk17 1 high20 5 high23 0 high

Car Type RID Riskfamily 0 highsport 1 highfamily 5 high

Age RID Risk32 4 low43 2 high68 3 low

Car Type RID Risksport 2 highfamily 3 lowtruck 4 low

Attribute lists for Age>27.5

Spring 2003 33

Example (7)Evaluating splits for categorical attributes

We have to evaluate splitting index for each of the 2N combinations, where N is the cardinality of the categorical attribute.

Tuple count High LowCar type= {sport} 1 0Car type ={family] 0 1Car type = {truck} 0 1

G(Car type {sport}) = 1 – 12 – 02 = 0G(Car type {family}) = 1 – 02 – 12 = 0G(Car type {truck}) = 1 – 02 – 12 = 0

Spring 2003 34

Example (8)G(Car type { sport, family }) = 1 - (1/2)2 - (1/2)2 = 1/2

G(Car type { sport, truck }) = 1/2

G(Car type { family, truck }) = 1 - 02 - 12 = 0

GSPLIT(Car type { sport }) = (1/3) * 0 + (2/3) * 0 = 0

GSPLIT(Car type { family }) = (1/3) * 0 + (2/3)*(1/2) = 1/3

GSPLIT(Car type { truck }) = (1/3) * 0 + (2/3)*(1/2) = 1/3

GSPLIT(Car type { sport, family}) = (2/3)*(1/2)+(1/3)*0= 1/3

GSPLIT(Car type { sport, truck}) = (2/3)*(1/2)+(1/3)*0= 1/3

GSPLIT(Car type { family, truck }) = (2/3)*0+(1/3)*0=0

Spring 2003 35

Example (9)

The lowest value of GSPLIT is for Car type {sport},

thus this is our split point. Decision tree after the second split of the example set:

Age27.5

Risk = High

Risk = Low

Age>27.5

Risk = High

Car type {sport}Car type {family, truck}

Spring 2003 36

Information Gain (1)

• The information gain measure is used to select the test attribute at each node in the tree

• The attribute with the highest information gain (or greatest entropy reduction) is chosen as the test attribute for the current node

• This attribute minimizes the information needed to classify the samples in the resulting partitions

Spring 2003 37


• Let S be a set consisting of s data samples. Suppose the class label attribute has m distinct values defining m classes, Ci (for i=1, ..., m)

• Let si be the number of samples of S in class Ci

• The expected information needed to classify a given sample is given by

I(s1, s2, ..., sm) = - pi log2(pi)

where pi is the probability that an arbitrary sample

belongs to class Ci and is estimated by si/s.

Spring 2003 38


• Let attribute A have v distinct values, {a1, a2, ...,

av}. Attribute A can be used to partition S into {S1,

S2, ..., Sv}, where Sj contains those samples in S

that have value aj of A

• If A were selected as the test attribute, then these subsets would correspond to the branches grown from the node containing the set S

Spring 2003 39


• Let sij be the number of samples of the class Ci in a

subset Sj. The entropy, or expected information

based on the partitioning into subsets by A, is given by:

E(A1, A2, ...Av) = (s1j + s2j +...+smj)/s*

* I(s1j, s2j, ..., smj)

• The smaller the entropy value, the greater the purity of the subset partitions.

Spring 2003 40


• The term (s1j + s2j +...+smj)/s acts as the weight of

the jth subset and is the number of samples in the subset (i.e. having value aj of A) divided by the

total number of samples in S. Note that for a given subset Sj,

I(s1j, s2j, ..., smj) = - pij log2(pij)

where pij = sij/|Sj| and is the probability that a

sample in Sj belongs to class Ci

Spring 2003 41


The encoding information that would be gained by branching on A is

Gain(A) = I(s1, s2, ..., sm) – E(A)

Gain(A) is the expected reduction in entropy caused by knowing the value of attribute A

Spring 2003 42

Example (1)

RID Age Income student credit_rating buys_computer1 <=30 high no fair no2 <=30 high no excellent no3 31..40 high no fair yes4 >40 medium no fair yes5 >40 low yes fair yes6 >40 low yes excellent no7 31..40 low yes excellent yes8 <=30 medium no fair no9 <=30 low yes fair yes

10 >40 medium yes fair yes11 <=30 medium yes excellent yes12 31..40 medium no excellent yes13 31..40 high yes fair yes14 >40 medium no excellent no

Spring 2003 43

Example (2)

• Let us consider the following training set of tuples taken from the customer database.• The class label attribute, buys_computer, has two distinct values (yes, no), therefore, there are two classes (m=2).

C1 correspond to yes – s1 = 9

C2 correspond to no - s2 = 5

I(s1, s2)=I(9, 5)= - 9/14log29/14 – 5/14log25/14=0.94

Spring 2003 44

Example (3)• Next, we need to compute the entropy of each

attribute. Let start with the attribute age

for age=‘<=30’

s11=2 s21=3 I(s11, s21) = 0.971

for age=’31..40’

s12=4 s22=0 I(s12, s22) = 0

for age=‘>40’

s13=2 s23=3 I(s13, s23) = 0.971

Spring 2003 45

Example (4)

The entropy of age is,

E(age)=5/14 *I(s11, s21) +4/14* I(s12, s22) +

+ 5/14* I(s13, s23) = 0.694

The gain in information from such a partitioning would be:

Gain(age) = I(s1, s2) – E(age) = 0.246

Spring 2003 46

Example (5)

•We can compute

Gain(income)=0.029,

Gain(student)=0.151, and Gain(credit_rating)=0.048

Since age has the highest information gain amont the attributes, it is selected as the test atribute. A node is created and labeled with age, and branches are grown for each of the attribute’s values.

Spring 2003 47

Example (6)

age

buys_computers:yes, no

buys_computers:yes, no

<=30 31..40 >40

buys_computers: yes

Spring 2003 48

Example (7)age

<=3031..40

>40

yesstudent

yes

yes

no

no

credit_rating

excellent fair

no yes

Spring 2003 49

Entropy vs. Gini index•Gini index tends to isolate the largest class from all other classes

class A 40class B 30class C 20class D 10

•Entropy tends to fin groups of classes that add up to 50% of the data

if age < 40

class A 40 class B 30class C 20class D 10

yes no

class A 40class B 30class C 20class D 10

if age < 65

class A 40class D 10

class B 30class C 20

yes no

Spring 2003 50

Tree pruning• When a decision tree is built, many of the branches

will reflect anomalies in the training data due to noise or outliers.

• Tree pruning methods typically use statistical measures to remove the least reliable branches, generally resulting in faster classification and an improvement in the ability of the tree to correctly classify independent test data

Spring 2003 51

Tree pruning• Prepruning approach (stopping): a tree is

‘pruned’ by halting its construction early (i.e. by deciding not to further split or partition the subset of training samples). Upon halting, the node becomes a leaf. The leaf hold the most frequent class among the subset samples

• Postpruning approach (pruning): removes branches from a ‘fully grown’ tree. A tree node is pruned by removing its branches. The lowest unpruned node becomes a leaf and is labeled by the most frequent class among its former branches

Spring 2003 52

Extracting Classification Rules from Decision Trees

• The knowledge represented in decision trees can be extracted and represented in the form of classification IF-THEN rules.

• One rule is created for each path from the root to a leaf node

• Each attribute-value pair along a given path forms a conjunction in the rule antecedent; the leaf node holds the class prediction, forming the rule consequent

Spring 2003 53

Extracting Classification Rules from Decision Trees

• The decision tree of Example (7) can be converted to classification rules:

IF age=‘<=30’ AND student=‘no’ THEN buys_computers=‘no’

IF age=‘<=30’ AND student=‘yes’ THEN buys_computers=‘yes’

IF age=’31..40’ THEN buys_computers=‘yes’

IF age=‘>40’ AND credit_rating=‘excellent’

THEN buys_computers=‘no’

IF age=‘>40’ AND credit_rating=‘fair’

THEN buys_computers=‘yes’

Spring 2003 54

Other Classification Methods

• There is a number of classification methods in the literature:– Bayesian classifiers

– Neural-network classifiers

– K-nearest neighbor classifiers

– Association-based classifiers

– Rough and fuzzy sets

Spring 2003 55

Classification Based on Concepts from Association Rule Mining

• We may apply „quantitative rule mining” approach to discover classification rules – associative classification

• It mines rules of the form condset y, where condset is a set of items (or attribute-value pairs) and y is a class label.

Spring 2003 56

Bayesian classifers

• Bayesian classifier is a statistical classifier. It can predict the probability that a given sample belongs to a particular class.

• Bayesian classification is based on Bayes theorem of a-posteriori probability.

• Let X is a data sample whose class label is unknown. Each sample is represented n-dimensional vector, X=(x1, x2, ..., xn).

• The classification problem may be formulated using a-posteriori probabilities as follows: determine P(C|X), the probability that the sample X belongs to a specified class C.

• P(C|X) is the a-posteriori probability of C conditioned on X.

Spring 2003 57

Bayesian classifers

• Example:

Given a set of samples describing credit applicants P(Risk=low|Age=38, Marital_Status=divorced, Income=low, children=2) is the probability that a credit applicant X=(38, divorced, low, 2) is the low credit risk applicant.

• The idea of Bayesian classification is to assign to a new unknown sample X the class label C such that P(C| X) is maximal.

Spring 2003 58

Bayesian classifers

• The main problem is how to estimate a-posteriori probability P(C|X)?

• By Bayes theorem: P(C|X) = (P(X|C) * P(C))/P(X), where P(C) is the apriori probability of C, that is the probability that any given sample belongs to the class C, P(X|C) is the a-posteriori probability of X conditioned on C, and P(X) is the apriori probability of X.

• In our example, P(X|C) is the probability that X=(38, divorced, low, 2) given the class Risk=low, P(C) is the probability of the class C, and P(X) is the probability that the sample X=(38, divorced, low, 2).

Spring 2003 59

Bayesian classifers

• Suppose a training database D consists of n samples, and suppose the class label attribute has m distinct values defining m distinct classes C_i, for i = 1, ..., m.

• Let s_i denotes the number of samples of D in class C_i.

• Bayesian classifier assigns an unknown sample X to the class C_i that maximizes P(C_i|X). Since P(X) is constant for all classes, the class C_i for which P(C_i|X) is maximized is the class C_i for which P(X|C_i) * P(C_i) is maximized.

• P(C_i) may be estimated by s_i/n (relative frequency of the class C_i), or we may assume that all classes have the same probability P(C_1) = P(C_2) = ... = P(C_k).

Spring 2003 60

Bayesian classifers

• The main problem is how to compute P(C_i|X)?

• Given a large dataset with many predictor attributes, it would be very expensive to compute P(C_i|X), therefore, to reduce the cost of computing P(C_i|X), the assumption of class conditional independence, or, in other words, the attribute independence assumption is made.

• The assumption states that there are no dependencies among predictor attributes, which leads to the following formula:

P(X|C_i) = j=1k P(x_j|C_i)

Spring 2003 61

Bayesian classifers

• The probabilities P(x_1|C_i), P(x_2|C_i), ..., P(x_k|C_i) can be estimated from the dataset:

- If j-th attribute is categorical, then P(x_j|C_i) is estimated as the relative frequency of samples of the class C_I having value x_j for j-th attribute,

- If j-th attribute is continuous, then P(x_j|C_i) is estimated through the Gaussian density function.

• Due to the class conditional independence assumption, the Bayesian classifier is also known as the naive Bayesian classifier.

Spring 2003 62

Bayesian classifers

• The assumption makes computation possible. Moreover, when the assumption is satisfied, the naive Bayesian classifier is optimal, that is it is the most accurate classifier in comparison to all other classifiers.

• However, the assumption is seldom satisfied in practice, since attributes are usually correlated.

• Several attempts are being made to apply Bayesian analysis without assuming attribute independence. The resulting models are called Bayesian networks or Bayesian belief networks

• Bayesian belief networks combine Bayesian analysis with causal relationships between attributes.

Spring 2003 63

k-nearest neighbor classifiers

• Nearest neighbor classifier belongs to instance-based learning methods.

• Instance-based learning methods differ from other classification methods discussed earlier in that they do not build a classifier until a new unknown sample needs to be classified.

• Each training sample is described by n-dimensional vector representing a point in an n-dimensional space called pattern space. When a new unknown sample has to be classified, a distance function is used to determine a member of the training set which is closest to the unknown sample.

Spring 2003 64


• Once the nearest training sample is located in the pattern space, its class label is assigned to the unknown sample.

• The main drawback of this approach is that it is very sensitive to noisy training samples.

• The common solution to this problem is to adopt the k-nearest neighbor strategy.

• When a new unknown sample has to be classified, the classifier searches the pattern space for the k training samples which are closest to the unknown sample. These k training samples are called the k "nearest neighbors" of the unknown sample and the most common class label among k "nearest neighbors" is assigned to the unknown sample.

• To find the k "nearest neighbors" of the unknown sample a multidimensional index is used (e.g. R-tree, Pyramid tree, etc.).

Spring 2003 65

k-nearest neighbor classifiers• Two different issues need to be addressed regarding k-

nearest neighbor method: – the distance function, and

– the transformation from a sample to a point in the pattern space.

• The first issue is to define the distance function. If the attributes are numeric, most k-nearest neighbor classifiers use Euclidean distance.

• Instead of the Euclidean distance, we may also apply other distance metrics like Manhattan distance, maximum of dimensions, or Minkowski distance.

Spring 2003 66


• The second issue is how to transform a sample to a point in the pattern space.

• Note that different attributes may have different scales and units, and different variability. Thus, if the distance metric is used directly, the effects of some attributes might be dominated by other attributes that have larger scale or higher variability.

• A simple solution to this problem is to weight the various attributes. One common approach is to normalize all attribute values into the range [0, 1].

Spring 2003 67


• This solution is sensitive to the outliers problem since a single outlier could cause virtually all other values to be contained in a small subrange.

• Another common approach is to apply a standarization transformation, such as subtracting the mean from the value of each attribute and then dividing by its standard deviation.

• Recently, another approach was proposed which consists in applying the robust space transformation called Donoho-Stahel estimator – the estimator has some important and useful properties that make the estimator very attractive for different data mining applications.

Spring 2003 68

Classifier accuracy

• The accuracy of a classifier on a given test set of samples is defined as the percentage of test samples correctly classified by the classifier, and it measures the overall performance of the classifier.

• Note that the accuracy of the classifier is not estimated on the training dataset, since it would not be a good indicator of the future accuracy on new data.

• The reason is that the classifier generated from the training dataset tends to overfit the training data, and any estimate of the classifier's accuracy based on that data will be overoptimistic.

Spring 2003 69

Classifier accuracy• In other words, the classifier is more accurate on the data

that was used to train the classifier, but very likely it will be less accurate on independent set of data.

• To predict the accuracy of the classifier on new data, we need to asses its accuracy on an independent dataset that played no part in the formation of the classifier.

• This dataset is called the test set

• It is important to note that the test dataset should not to be used in any way to built the classifier.

Spring 2003 70

Classifier accuracy

• There are several methods for estimating classifier accuracy. The choice of a method depends on the amount of sample data available for training and testing.

• If there are a lot of sample data, then the following simple holdout method is usually applied.

• The given set of samples is randomly partitioned into two independent sets, a training set and a test set (typically, 70% of the data is used for training, and the remaining 30% is used for testing)

• Provided that both sets of samples are representative, the accuracy of the classifier on the test set will give a good indication of accuracy on new data.

Spring 2003 71

Classifier accuracy

• In general, it is difficult to say whether a given set of samples is representative or not, but at least we may ensure that the random sampling of the data set is done in such a way that the class distribution of samples in both training and test set is approximately the same as that in the initial data set.

• This procedure is called stratification

Spring 2003 72

Testing – large dataset

Available examples

Training Set

70% Divide randomly

Test Set

30%

used to develop one tree check accuracy

Spring 2003 73

Classifier accuracy

• If the amount of data for training and testing is limited, the problem is how to use this limited amount of data for training to get a good classifier and for testing to obtain a correct estimation of the classifier accuracy?

• The standard and very common technique of measuring the accuracy of a classifier when the amount of data is limited is k-fold cross-validation

• In k-fold cross-validation, the initial set of samples is randomly partitioned into k approximately equal mutually exclusive subsets, called folds, S_1, S_2, ..., S_k.

Spring 2003 74

Classifier accuracy

• Training and testing is performed k times. At each iteration, one fold is used for testing while remainder k-1 folds are used for training. So, at the end, each fold has been used exactly once for testing and k-1 for training.

• The accuracy estimate is the overall number of correct classifications from k iterations divided by the total number of samples N in the initial dataset.

• Often, the k-fold cross-validation technique is combined with stratification and is called stratified k-fold cross-validation

Spring 2003 75

Testing – small dataset

Available examples

Training Set

90%

Test Set

10%

used to develop 10 different trees check accuracy

* cross-validationRepeat 10 times

Spring 2003 76

Classifier accuracy• There are many other methods of estimating classifier

accuracy on a particular dataset• Two popular methods are leave-one-out cross-validation

and the bootstrapping • Leave-one-out cross-validation is simply N-fold cross-

validation, where N is the number of samples in the initial dataset

• At each iteration, a single sample from the dataset is left out for testing, and remaining samples are used for training. The result of testing is either success or failure.

• The results of all N evaluations, one for each sample from the dataset, are averaged, and that average represents the final accuracy estimate.

Spring 2003 77

Classifier accuracy• Bootstrapping is based on the sampling with replacement• The initial dataset is sampled N times, where N is the total

number of samples in the dataset, with replacement, to form another set of N samples for training.

• Since some samples in this new "set" will be repeated, so it means that some samples from the initial dataset will not appear in this training set. These samples will form a test set.

• Both mentioned estimation methods are interesting especially for estimating classifier accuracy for small datasets. In practice, the standard and most popular technique of estimating a classifier accuracy is stratified tenfold cross-validation

Spring 2003 78

Requirements

• Focus on mega-induction• Handle both continous and categorical data• No restriction on:

– number of examples

– number of attributes

– number of classes

Spring 2003 79

Applications

• Treatment effectiveness• Credit Approval• Store location• Target marketing• Insurrence company (fraud detection)• Telecommunication company (client

classification)

spring 20031 classification. spring 20032 classification task input: a training set of tuples, each...

Documents

model simplicity

model classifier

class dist

attributesthe model

class labeloutput

predefined class

class label of new

class of new tuples