spring 20031 classification. spring 20032 classification task input: a training set of tuples, each...
Post on 29-Dec-2015
216 Views
Preview:
TRANSCRIPT
Spring 2003 1
Classification
Spring 2003 2
Classification task
• Input: a training set of tuples, each labeled with one class label
• Output: a model (classifier) that assigns a class label to each tuple based on the other attributes
• The model can be used to predict the class of new tuples, for which the class label is missing or unknown
Spring 2003 3
What is Classification
• Data classification is a two-step process– first step: a model is built describing a
predetermined set of data classes or concepts– second step: the model is used for classification
• Each tuple is assumed to belong to a predefined class, as determined by one of the attributes, called the class label attribute
• Data tuples are also referred to as samples, examples, or objects
Spring 2003 4
Train and test
• The tuples (examples, samples) are divided into training set + test set
• Classification model is built in two steps:– training - build the model from the training set– test - check the accuracy of the model using test
set
Spring 2003 5
Train and test
• Kind of models:– if - then rules– logical formulae– decision trees
• Accuracy of models:– the known class of test samples is matched
against the class predicted by the model– accuracy rate = % of test set samples correctly
classified by the model
Spring 2003 6
Training step
Age Car Type Risk20 Combi High18 Sports High40 Sports High50 Family Low35 Minivan Low30 Combi High32 Family Low40 Combi Low
trainingdata
Classification algorithm
Classifier(model)
if age < 31or Car Type =Sportsthen Risk = High
Spring 2003 7
Test step
Age Car Type Risk27 Sports High34 Family Low66 Family High44 Sports High
testdata
Classifier(model)
RiskHighLowLowHigh
Spring 2003 8
Classification (prediction)
Age Car Type Risk27 Sports 34 Minivan 55 Family 34 Sports
newdata
Classifier(model)
RiskHighLowLowHigh
Spring 2003 9
Classification vs. Prediction
• There are two forms of data analysis that can be used to extract models describing data classes or to predict future data trends:– classification: predict categorical labels– prediction: models continuous-valued functions
Spring 2003 10
Comparing Classification Methods (1)
• Predictive accuracy: this refers to the ability of the model to correctly predict the class label of new or previously unseen data
• Speed: this refers to the computation costs involved in generating and using the model
• Robustness: this is the ability of the model to make correct predictions given noisy data or data with missing values
Spring 2003 11
Comparing Classification Methods (2)
• Scalability: this refers to the ability to construct the model efficiently given large amount of data
• Interpretability: this refers to the level of understanding and insight that is provided by the model
• Simplicity:– decision tree size
– rule compactness
• Domain-dependent quality indicators
Spring 2003 12
Problem formulation
Given records in the database with class label – find model for each class.
Age Car Type Risk20 Combi High18 Sports High40 Sports High50 Family Low35 Minivan Low30 Combi High32 Family Low40 Combi Low
Age < 31
High
Car Type is sports
High Low
Spring 2003 13
Classification techniques
• Decision Tree Classification
• Bayesian Classifiers
• Neural Networks
• Statistical Analysis
• Genetic Algorithms
• Rough Set Approach
• k-nearest neighbor classifiers
Spring 2003 14
Classification by Decision Tree Induction
• A decision tree is a tree structure, where – each internal node denotes a test on an attribute, – each branch represents the outcome of the test,– leaf nodes represent classes or class distributions
Age < 31
High
Car Type is sports
High Low
Y N
Spring 2003 15
Decision Tree Induction (1)
• A decision tree is a class discriminator that recursively partitions the training set until each partition consists entirely or dominantly of examples from one class.
• Each non-leaf node of the tree contains a split point, which is a test on one or more attributes and determines how the data is partitioned
Spring 2003 16
Decision Tree Induction (2)
• Basic algorithm: a greedy algorithm that constructs decision trees in a top-down recursive divide-and-conquer manner.
• Many variants:– from machine learning (ID3, C4.5)– from statistics (CART)– from pattern recognition (CHAID)
• Main difference: split criterion
Spring 2003 17
Decision Tree Induction (3)
• The algorithm consists of two phases: Build an initial tree from the training data such that
each leaf node is pure Prune this tree to increase its accuracy on test data
Spring 2003 18
Tree Building
• In the growth phase the tree is built by recursively partitioning the data until each partition is either "pure" (contains members of the same class) or sufficiently small.
• The form of the split used to partition the data depends on the type of the attribute used in the split: for a continuous attribute A, splits are of the form
value(A)<x where x is a value in the domain of A. for a categorical attribute A, splits are of the form
value(A)X where Xdomain(A)
Spring 2003 19
Tree Building AlgorithmMake Tree (Training Data T){
Partition(T)}Partition(Data S){
if (all points in S are in the same class) thenreturn
for each attribute A doevaluate splits on attribute A;
use best split found to partition S into S1 and S2Partition(S1)Partition(S2)
}
Spring 2003 20
Tree Building Algorithm
• While growing the tree, the goal at each node is to determine the split point that "best" divides the training records belonging to that leaf
• To evaluate the goodness of the split some splitting indices have been proposed
Spring 2003 21
Split Criteria• Gini index (CART, SPRINT)
– select attribute that minimize impurity of a split
• Information gain (ID3, C4.5)– to measure impurity of a split use entropy– select attribute that maximize entropy reduction
2 contingency table statistics (CHAID)– measures correlation between each attribute and
the class label– select attribute with maximal correlation
Spring 2003 22
Gini index (1)
Given a sample training set where each record represents a car-insurance applicant. We want to build a model of what makes an applicant a high or low insurance risk.
RID Age Car Type Risk0 23 family high1 17 sport high2 43 sport high3 68 family low4 32 truck low5 20 family high
Training set
The model built can be used to screen future insurance applicants by classifying them into the High or Low risk categories
Classifier(model)
Spring 2003 23
Gini index (2)SPRINT algorithm:
Partition(Data S) {if (all points in S are of the same class) then
return;for each attribute A do
evaluate splits on attribute A;
Use best split found to partition S into S1 and S2
Partition(S1);
Partition(S2);}Initial call: Partition(Training Data)
Spring 2003 24
Gini index (3)• Definition:
gini(S) = 1 - pj2
where: • S is a data set containing examples from n classes
• pj is a relative frequency of class j in S
• E.g. two classes, Pos and Neg, and dataset S with p Pos-elements and n Neg-elements.
ppos= p/(p+n) pneg = n/(n+p)
gini(S) = 1 - ppos2 - pneg
2
Spring 2003 25
Gini index (4)
• If dataset S is split into S1 and S2, then splitting index is defined as follows:
giniSPLIT(S) = (p1+ n1)/(p+n)*gini(S1) +
(p2+ n2)/(p+n)* gini(S2)
where p1, n1 (p2, n2) denote p1 Pos-elements and n1 Neg-elements in the dataset S1 (S2), respectively.
• In this definition the "best" split point is the one with the lowest value of the giniSPLIT index.
Spring 2003 26
Example (1)
RID Age Car Type Risk0 23 family high1 17 sport high2 43 sport high3 68 family low4 32 truck low5 20 family high
Training set
Spring 2003 27
Example (1)
Age RID Risk17 1 high20 5 high23 0 high32 4 low43 2 high68 3 low
Attribute list for ‘Age’
Attribute list for ‘Car Type’Car Type RID Risk
family 0 highsport 1 highsport 2 highfamily 3 lowtruck 4 lowfamily 5 high
Spring 2003 28
Example (2)• Possible values of a split point for Age attribute are:
Age17, Age20, Age23, Age32, Age43, Age68
Tuple count High LowAge<=17 1 0Age>17 3 2
G(Age<=17) = 1- (12+02) = 0G(Age>17) = 1- ((3/5)2+(2/5)2) = 1 - (13/25)2 = 12/25GSPLIT = (1/6) * 0 + (5/6) * (12/25) = 2/5
Spring 2003 29
Example (3)Tuple count High Low
Age<=20 2 0Age>20 2 2
G(Age<=20) = 1- (12+02) = 0G(Age>20) = 1- ((1/2)2+(1/2)2) = 1/2GSPLIT = (2/6) * 0 + (4/6) * (1/8) = 1/3
Tuple count High LowAge<=23 3 0Age>23 1 2
G(Age23) = 1- (12+02) = 0G(Age>23) = 1- ((1/3)2+(2/3)2) = 1 - (1/9) - (4/9) = 4/9GSPLIT = (3/6) * 0 + (3/6) * (4/9) = 2/9
Spring 2003 30
Example (4)Tuple count High Low
Age<=32 3 1Age>32 1 1
G(Age32) = 1- ((3/4)2+(1/4)2) = 1 - (10/16) = 6/16 = 3/8G(Age>32) = 1- ((1/2)2+(1/2)2) = 1/2GSPLIT = (4/6)*(3/8) + (2/6)*(1/2) = (1/8) + (1/6)=14/48= 7/24
The lowest value of GSPLIT is for Age23, thus we have a split point at Age=(23+32) / 2 = 27.5
Spring 2003 31
Example (5)
Age27.5
Risk = High Risk = Low
Age>27.5
Decision tree after the first split of the example set:
Spring 2003 32
Example (6)
Attribute lists are divided at the split point.Attribute lists for Age27.5:
Age RID Risk17 1 high20 5 high23 0 high
Car Type RID Riskfamily 0 highsport 1 highfamily 5 high
Age RID Risk32 4 low43 2 high68 3 low
Car Type RID Risksport 2 highfamily 3 lowtruck 4 low
Attribute lists for Age>27.5
Spring 2003 33
Example (7)Evaluating splits for categorical attributes
We have to evaluate splitting index for each of the 2N combinations, where N is the cardinality of the categorical attribute.
Tuple count High LowCar type= {sport} 1 0Car type ={family] 0 1Car type = {truck} 0 1
G(Car type {sport}) = 1 – 12 – 02 = 0G(Car type {family}) = 1 – 02 – 12 = 0G(Car type {truck}) = 1 – 02 – 12 = 0
Spring 2003 34
Example (8)G(Car type { sport, family }) = 1 - (1/2)2 - (1/2)2 = 1/2
G(Car type { sport, truck }) = 1/2
G(Car type { family, truck }) = 1 - 02 - 12 = 0
GSPLIT(Car type { sport }) = (1/3) * 0 + (2/3) * 0 = 0
GSPLIT(Car type { family }) = (1/3) * 0 + (2/3)*(1/2) = 1/3
GSPLIT(Car type { truck }) = (1/3) * 0 + (2/3)*(1/2) = 1/3
GSPLIT(Car type { sport, family}) = (2/3)*(1/2)+(1/3)*0= 1/3
GSPLIT(Car type { sport, truck}) = (2/3)*(1/2)+(1/3)*0= 1/3
GSPLIT(Car type { family, truck }) = (2/3)*0+(1/3)*0=0
Spring 2003 35
Example (9)
The lowest value of GSPLIT is for Car type {sport},
thus this is our split point. Decision tree after the second split of the example set:
Age27.5
Risk = High
Risk = Low
Age>27.5
Risk = High
Car type {sport}Car type {family, truck}
Spring 2003 36
Information Gain (1)
• The information gain measure is used to select the test attribute at each node in the tree
• The attribute with the highest information gain (or greatest entropy reduction) is chosen as the test attribute for the current node
• This attribute minimizes the information needed to classify the samples in the resulting partitions
Spring 2003 37
Information Gain (2)
• Let S be a set consisting of s data samples. Suppose the class label attribute has m distinct values defining m classes, Ci (for i=1, ..., m)
• Let si be the number of samples of S in class Ci
• The expected information needed to classify a given sample is given by
I(s1, s2, ..., sm) = - pi log2(pi)
where pi is the probability that an arbitrary sample
belongs to class Ci and is estimated by si/s.
Spring 2003 38
Information Gain (3)
• Let attribute A have v distinct values, {a1, a2, ...,
av}. Attribute A can be used to partition S into {S1,
S2, ..., Sv}, where Sj contains those samples in S
that have value aj of A
• If A were selected as the test attribute, then these subsets would correspond to the branches grown from the node containing the set S
Spring 2003 39
Information Gain (4)
• Let sij be the number of samples of the class Ci in a
subset Sj. The entropy, or expected information
based on the partitioning into subsets by A, is given by:
E(A1, A2, ...Av) = (s1j + s2j +...+smj)/s*
* I(s1j, s2j, ..., smj)
• The smaller the entropy value, the greater the purity of the subset partitions.
Spring 2003 40
Information Gain (5)
• The term (s1j + s2j +...+smj)/s acts as the weight of
the jth subset and is the number of samples in the subset (i.e. having value aj of A) divided by the
total number of samples in S. Note that for a given subset Sj,
I(s1j, s2j, ..., smj) = - pij log2(pij)
where pij = sij/|Sj| and is the probability that a
sample in Sj belongs to class Ci
Spring 2003 41
Information Gain (6)
The encoding information that would be gained by branching on A is
Gain(A) = I(s1, s2, ..., sm) – E(A)
Gain(A) is the expected reduction in entropy caused by knowing the value of attribute A
Spring 2003 42
Example (1)
RID Age Income student credit_rating buys_computer1 <=30 high no fair no2 <=30 high no excellent no3 31..40 high no fair yes4 >40 medium no fair yes5 >40 low yes fair yes6 >40 low yes excellent no7 31..40 low yes excellent yes8 <=30 medium no fair no9 <=30 low yes fair yes
10 >40 medium yes fair yes11 <=30 medium yes excellent yes12 31..40 medium no excellent yes13 31..40 high yes fair yes14 >40 medium no excellent no
Spring 2003 43
Example (2)
• Let us consider the following training set of tuples taken from the customer database.• The class label attribute, buys_computer, has two distinct values (yes, no), therefore, there are two classes (m=2).
C1 correspond to yes – s1 = 9
C2 correspond to no - s2 = 5
I(s1, s2)=I(9, 5)= - 9/14log29/14 – 5/14log25/14=0.94
Spring 2003 44
Example (3)• Next, we need to compute the entropy of each
attribute. Let start with the attribute age
for age=‘<=30’
s11=2 s21=3 I(s11, s21) = 0.971
for age=’31..40’
s12=4 s22=0 I(s12, s22) = 0
for age=‘>40’
s13=2 s23=3 I(s13, s23) = 0.971
Spring 2003 45
Example (4)
The entropy of age is,
E(age)=5/14 *I(s11, s21) +4/14* I(s12, s22) +
+ 5/14* I(s13, s23) = 0.694
The gain in information from such a partitioning would be:
Gain(age) = I(s1, s2) – E(age) = 0.246
Spring 2003 46
Example (5)
•We can compute
Gain(income)=0.029,
Gain(student)=0.151, and Gain(credit_rating)=0.048
Since age has the highest information gain amont the attributes, it is selected as the test atribute. A node is created and labeled with age, and branches are grown for each of the attribute’s values.
Spring 2003 47
Example (6)
age
buys_computers:yes, no
buys_computers:yes, no
<=30 31..40 >40
buys_computers: yes
Spring 2003 48
Example (7)age
<=3031..40
>40
yesstudent
yes
yes
no
no
credit_rating
excellent fair
no yes
Spring 2003 49
Entropy vs. Gini index•Gini index tends to isolate the largest class from all other classes
class A 40class B 30class C 20class D 10
•Entropy tends to fin groups of classes that add up to 50% of the data
if age < 40
class A 40 class B 30class C 20class D 10
yes no
class A 40class B 30class C 20class D 10
if age < 65
class A 40class D 10
class B 30class C 20
yes no
Spring 2003 50
Tree pruning• When a decision tree is built, many of the branches
will reflect anomalies in the training data due to noise or outliers.
• Tree pruning methods typically use statistical measures to remove the least reliable branches, generally resulting in faster classification and an improvement in the ability of the tree to correctly classify independent test data
Spring 2003 51
Tree pruning• Prepruning approach (stopping): a tree is
‘pruned’ by halting its construction early (i.e. by deciding not to further split or partition the subset of training samples). Upon halting, the node becomes a leaf. The leaf hold the most frequent class among the subset samples
• Postpruning approach (pruning): removes branches from a ‘fully grown’ tree. A tree node is pruned by removing its branches. The lowest unpruned node becomes a leaf and is labeled by the most frequent class among its former branches
Spring 2003 52
Extracting Classification Rules from Decision Trees
• The knowledge represented in decision trees can be extracted and represented in the form of classification IF-THEN rules.
• One rule is created for each path from the root to a leaf node
• Each attribute-value pair along a given path forms a conjunction in the rule antecedent; the leaf node holds the class prediction, forming the rule consequent
Spring 2003 53
Extracting Classification Rules from Decision Trees
• The decision tree of Example (7) can be converted to classification rules:
IF age=‘<=30’ AND student=‘no’ THEN buys_computers=‘no’
IF age=‘<=30’ AND student=‘yes’ THEN buys_computers=‘yes’
IF age=’31..40’ THEN buys_computers=‘yes’
IF age=‘>40’ AND credit_rating=‘excellent’
THEN buys_computers=‘no’
IF age=‘>40’ AND credit_rating=‘fair’
THEN buys_computers=‘yes’
Spring 2003 54
Other Classification Methods
• There is a number of classification methods in the literature:– Bayesian classifiers
– Neural-network classifiers
– K-nearest neighbor classifiers
– Association-based classifiers
– Rough and fuzzy sets
Spring 2003 55
Classification Based on Concepts from Association Rule Mining
• We may apply „quantitative rule mining” approach to discover classification rules – associative classification
• It mines rules of the form condset y, where condset is a set of items (or attribute-value pairs) and y is a class label.
Spring 2003 56
Bayesian classifers
• Bayesian classifier is a statistical classifier. It can predict the probability that a given sample belongs to a particular class.
• Bayesian classification is based on Bayes theorem of a-posteriori probability.
• Let X is a data sample whose class label is unknown. Each sample is represented n-dimensional vector, X=(x1, x2, ..., xn).
• The classification problem may be formulated using a-posteriori probabilities as follows: determine P(C|X), the probability that the sample X belongs to a specified class C.
• P(C|X) is the a-posteriori probability of C conditioned on X.
Spring 2003 57
Bayesian classifers
• Example:
Given a set of samples describing credit applicants P(Risk=low|Age=38, Marital_Status=divorced, Income=low, children=2) is the probability that a credit applicant X=(38, divorced, low, 2) is the low credit risk applicant.
• The idea of Bayesian classification is to assign to a new unknown sample X the class label C such that P(C| X) is maximal.
Spring 2003 58
Bayesian classifers
• The main problem is how to estimate a-posteriori probability P(C|X)?
• By Bayes theorem: P(C|X) = (P(X|C) * P(C))/P(X), where P(C) is the apriori probability of C, that is the probability that any given sample belongs to the class C, P(X|C) is the a-posteriori probability of X conditioned on C, and P(X) is the apriori probability of X.
• In our example, P(X|C) is the probability that X=(38, divorced, low, 2) given the class Risk=low, P(C) is the probability of the class C, and P(X) is the probability that the sample X=(38, divorced, low, 2).
Spring 2003 59
Bayesian classifers
• Suppose a training database D consists of n samples, and suppose the class label attribute has m distinct values defining m distinct classes C_i, for i = 1, ..., m.
• Let s_i denotes the number of samples of D in class C_i.
• Bayesian classifier assigns an unknown sample X to the class C_i that maximizes P(C_i|X). Since P(X) is constant for all classes, the class C_i for which P(C_i|X) is maximized is the class C_i for which P(X|C_i) * P(C_i) is maximized.
• P(C_i) may be estimated by s_i/n (relative frequency of the class C_i), or we may assume that all classes have the same probability P(C_1) = P(C_2) = ... = P(C_k).
Spring 2003 60
Bayesian classifers
• The main problem is how to compute P(C_i|X)?
• Given a large dataset with many predictor attributes, it would be very expensive to compute P(C_i|X), therefore, to reduce the cost of computing P(C_i|X), the assumption of class conditional independence, or, in other words, the attribute independence assumption is made.
• The assumption states that there are no dependencies among predictor attributes, which leads to the following formula:
P(X|C_i) = j=1k P(x_j|C_i)
Spring 2003 61
Bayesian classifers
• The probabilities P(x_1|C_i), P(x_2|C_i), ..., P(x_k|C_i) can be estimated from the dataset:
- If j-th attribute is categorical, then P(x_j|C_i) is estimated as the relative frequency of samples of the class C_I having value x_j for j-th attribute,
- If j-th attribute is continuous, then P(x_j|C_i) is estimated through the Gaussian density function.
• Due to the class conditional independence assumption, the Bayesian classifier is also known as the naive Bayesian classifier.
Spring 2003 62
Bayesian classifers
• The assumption makes computation possible. Moreover, when the assumption is satisfied, the naive Bayesian classifier is optimal, that is it is the most accurate classifier in comparison to all other classifiers.
• However, the assumption is seldom satisfied in practice, since attributes are usually correlated.
• Several attempts are being made to apply Bayesian analysis without assuming attribute independence. The resulting models are called Bayesian networks or Bayesian belief networks
• Bayesian belief networks combine Bayesian analysis with causal relationships between attributes.
Spring 2003 63
k-nearest neighbor classifiers
• Nearest neighbor classifier belongs to instance-based learning methods.
• Instance-based learning methods differ from other classification methods discussed earlier in that they do not build a classifier until a new unknown sample needs to be classified.
• Each training sample is described by n-dimensional vector representing a point in an n-dimensional space called pattern space. When a new unknown sample has to be classified, a distance function is used to determine a member of the training set which is closest to the unknown sample.
Spring 2003 64
k-nearest neighbor classifiers
• Once the nearest training sample is located in the pattern space, its class label is assigned to the unknown sample.
• The main drawback of this approach is that it is very sensitive to noisy training samples.
• The common solution to this problem is to adopt the k-nearest neighbor strategy.
• When a new unknown sample has to be classified, the classifier searches the pattern space for the k training samples which are closest to the unknown sample. These k training samples are called the k "nearest neighbors" of the unknown sample and the most common class label among k "nearest neighbors" is assigned to the unknown sample.
• To find the k "nearest neighbors" of the unknown sample a multidimensional index is used (e.g. R-tree, Pyramid tree, etc.).
Spring 2003 65
k-nearest neighbor classifiers• Two different issues need to be addressed regarding k-
nearest neighbor method: – the distance function, and
– the transformation from a sample to a point in the pattern space.
• The first issue is to define the distance function. If the attributes are numeric, most k-nearest neighbor classifiers use Euclidean distance.
• Instead of the Euclidean distance, we may also apply other distance metrics like Manhattan distance, maximum of dimensions, or Minkowski distance.
Spring 2003 66
k-nearest neighbor classifiers
• The second issue is how to transform a sample to a point in the pattern space.
• Note that different attributes may have different scales and units, and different variability. Thus, if the distance metric is used directly, the effects of some attributes might be dominated by other attributes that have larger scale or higher variability.
• A simple solution to this problem is to weight the various attributes. One common approach is to normalize all attribute values into the range [0, 1].
Spring 2003 67
k-nearest neighbor classifiers
• This solution is sensitive to the outliers problem since a single outlier could cause virtually all other values to be contained in a small subrange.
• Another common approach is to apply a standarization transformation, such as subtracting the mean from the value of each attribute and then dividing by its standard deviation.
• Recently, another approach was proposed which consists in applying the robust space transformation called Donoho-Stahel estimator – the estimator has some important and useful properties that make the estimator very attractive for different data mining applications.
Spring 2003 68
Classifier accuracy
• The accuracy of a classifier on a given test set of samples is defined as the percentage of test samples correctly classified by the classifier, and it measures the overall performance of the classifier.
• Note that the accuracy of the classifier is not estimated on the training dataset, since it would not be a good indicator of the future accuracy on new data.
• The reason is that the classifier generated from the training dataset tends to overfit the training data, and any estimate of the classifier's accuracy based on that data will be overoptimistic.
Spring 2003 69
Classifier accuracy• In other words, the classifier is more accurate on the data
that was used to train the classifier, but very likely it will be less accurate on independent set of data.
• To predict the accuracy of the classifier on new data, we need to asses its accuracy on an independent dataset that played no part in the formation of the classifier.
• This dataset is called the test set
• It is important to note that the test dataset should not to be used in any way to built the classifier.
Spring 2003 70
Classifier accuracy
• There are several methods for estimating classifier accuracy. The choice of a method depends on the amount of sample data available for training and testing.
• If there are a lot of sample data, then the following simple holdout method is usually applied.
• The given set of samples is randomly partitioned into two independent sets, a training set and a test set (typically, 70% of the data is used for training, and the remaining 30% is used for testing)
• Provided that both sets of samples are representative, the accuracy of the classifier on the test set will give a good indication of accuracy on new data.
Spring 2003 71
Classifier accuracy
• In general, it is difficult to say whether a given set of samples is representative or not, but at least we may ensure that the random sampling of the data set is done in such a way that the class distribution of samples in both training and test set is approximately the same as that in the initial data set.
• This procedure is called stratification
Spring 2003 72
Testing – large dataset
Available examples
Training Set
70% Divide randomly
Test Set
30%
used to develop one tree check accuracy
Spring 2003 73
Classifier accuracy
• If the amount of data for training and testing is limited, the problem is how to use this limited amount of data for training to get a good classifier and for testing to obtain a correct estimation of the classifier accuracy?
• The standard and very common technique of measuring the accuracy of a classifier when the amount of data is limited is k-fold cross-validation
• In k-fold cross-validation, the initial set of samples is randomly partitioned into k approximately equal mutually exclusive subsets, called folds, S_1, S_2, ..., S_k.
Spring 2003 74
Classifier accuracy
• Training and testing is performed k times. At each iteration, one fold is used for testing while remainder k-1 folds are used for training. So, at the end, each fold has been used exactly once for testing and k-1 for training.
• The accuracy estimate is the overall number of correct classifications from k iterations divided by the total number of samples N in the initial dataset.
• Often, the k-fold cross-validation technique is combined with stratification and is called stratified k-fold cross-validation
Spring 2003 75
Testing – small dataset
Available examples
Training Set
90%
Test Set
10%
used to develop 10 different trees check accuracy
* cross-validationRepeat 10 times
Spring 2003 76
Classifier accuracy• There are many other methods of estimating classifier
accuracy on a particular dataset• Two popular methods are leave-one-out cross-validation
and the bootstrapping • Leave-one-out cross-validation is simply N-fold cross-
validation, where N is the number of samples in the initial dataset
• At each iteration, a single sample from the dataset is left out for testing, and remaining samples are used for training. The result of testing is either success or failure.
• The results of all N evaluations, one for each sample from the dataset, are averaged, and that average represents the final accuracy estimate.
Spring 2003 77
Classifier accuracy• Bootstrapping is based on the sampling with replacement• The initial dataset is sampled N times, where N is the total
number of samples in the dataset, with replacement, to form another set of N samples for training.
• Since some samples in this new "set" will be repeated, so it means that some samples from the initial dataset will not appear in this training set. These samples will form a test set.
• Both mentioned estimation methods are interesting especially for estimating classifier accuracy for small datasets. In practice, the standard and most popular technique of estimating a classifier accuracy is stratified tenfold cross-validation
Spring 2003 78
Requirements
• Focus on mega-induction• Handle both continous and categorical data• No restriction on:
– number of examples
– number of attributes
– number of classes
Spring 2003 79
Applications
• Treatment effectiveness• Credit Approval• Store location• Target marketing• Insurrence company (fraud detection)• Telecommunication company (client
classification)
top related