data mining models and evaluation techniques

Data Mining Models and EvaluationTechniques

Shubham Pachori12BCE055

DEPARTMENT OF COMPUTER ENGINEERINGAHMEDABAD -382424

November 2014

Data Mining Models And EvaluationTechniques

Seminar

Submitted in partial fulfillment of the requirements

For the degree of

Bachelor of Technology inComputer Science and Engineering

Shubham Pachori12BCE055

DEPARTMENT OF COMPUTER ENGINEERINGAHMEDABAD -382424

November 2014

Data Mining Models And Evaluation Technique

CERTIFICATE

This is to certify that the seminar entitled Data Mining Models and EvaluationTechniques submitted by Shubham Pachori (12BCE055), towards the partial ful-fillment of the requirements for the degree of Bachelor of Technology in ComputerScience And Engineering Nirma University, Ahmedabad is the record of work car-ried out by him under my supervision and guidance. In my opinion, the submittedwork has reached a level required for being accepted for examination. The resultsembodied in this Seminar, to the best of my knowledge, havent been submitted toany other university or institution for award of any degree or diploma.

Prof. K.P.Agarwal Prof. Anuja NairAssociate Professor, Assistant Professor,CSE Department, CSE Department,Institute Of Technology, Institute Of Technology,Nirma University, Ahmedabad. Nirma University, Ahmedabad.

Dr. Sanjay GargProf & Head Of Department,CSE Department,Institute Of Technology,Nirma University, Ahmedabad.

CSE Department,Institute of Technology, Nirma University i


Acknowledgements

I am profoundly grateful to Prof. K P AGARWAL for his expert guidancethroughout the project.His continuous encouragement have fetched us the goldenresults. His elixir of knowledge in the field has made this project achieve its zenithand credibility.

I would like to express deepest appreciation towards , Prof. SANJAY GARG,Head of Department of Computer Engineering and Prof. ANUJA NAIR, whoseinvaluable guidance supported us in completing this project.

At last I must express my sincere heartfelt gratitude to all the staff members ofComputer Engineering Department who helped me directly or indirectly during thiscourse of work.

SHUBHAM PACHORI12BCE055

CSE Department,Institute of Technology, Nirma University ii


Abstract

Databases are rich with hidden information that can be used for intelligent de-cision making. Classification and prediction are two forms of data analysis thatcan be used to extract models describing important data classes or to predict futuredata trends. Such analysis can help provide us with a better understanding of thedata at large. Classification models predicts categorical (discrete, unordered) labelfunctions. For example, we can build a classification model to categorize bank loanapplications as either safe or risky.

As predictions always have an implicit cost involved it is important to evaluateclassifiers generalization performance in order to determine whether to employ theclassifier (For example: when learning the effectiveness of medical treatments froma limited-size data, it is important to estimate the accuracy of the classifiers.) andto Optimize the classifier. (For example: when post-pruning decision trees we mustevaluate the accuracy of the decision trees on each pruning step.)

This seminar report gives an in depth explanation of classifier models (viz.Naive Bayesian and Decision Trees) and how these classifier models are evaluatedfor their accuracy on predictions. The later part of the report also deals with how toimprove the accuracy of these classifier models and it includes an exploratory studycomparing the various model evaluation techniques , carried out in Weka(A GUIBased Data Mining Tool) on representative data sets.

CSE Department,Institute of Technology, Nirma University iii


Contents

Certificate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiAcknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv

1 Introduction 2

2 Classification Using Decision Tree 42.1 Understanding Decision Trees . . . . . . . . . . . . . . . . . . . . 42.2 Divide and Conquer . . . . . . . . . . . . . . . . . . . . . . . . . . 52.3 C5.0 Decision Tree Algorithm . . . . . . . . . . . . . . . . . . . . 82.4 How To Choose The Best Split? . . . . . . . . . . . . . . . . . . . 92.5 Pruning The Decision Tree . . . . . . . . . . . . . . . . . . . . . . 10

3 Probabilistic Learning - Naive Bayesian Classification 123.1 Understanding Naive Bayesian Classification . . . . . . . . . . . . 123.2 Bayes Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133.3 The Naive Bayes Algorthim . . . . . . . . . . . . . . . . . . . . . 143.4 Naive Bayesian Classification . . . . . . . . . . . . . . . . . . . . 15

4 Model Evaluation Techniques 174.1 Prediction Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . 174.2 Confusion Matrix and Model Evaluation Metrics . . . . . . . . . . 184.3 How To Estimate These Metrics? . . . . . . . . . . . . . . . . . . . 23

4.3.1 Training and Independent Test Data . . . . . . . . . . . . . 234.3.2 Holdout Method . . . . . . . . . . . . . . . . . . . . . . . 244.3.3 K-Cross-validation . . . . . . . . . . . . . . . . . . . . . . 254.3.4 Bootstrap . . . . . . . . . . . . . . . . . . . . . . . . . . . 264.3.5 Comparing Two Classifier Models . . . . . . . . . . . . . . 27

4.4 ROC Curves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294.5 Ensemble Methods . . . . . . . . . . . . . . . . . . . . . . . . . . 31

4.5.1 Why Ensemble Works? . . . . . . . . . . . . . . . . . . . . 31

CSE Department,Institute of Technology, Nirma University iv


4.5.2 Ensemble Works in Two Ways . . . . . . . . . . . . . . . . 324.5.3 Learn To Combine . . . . . . . . . . . . . . . . . . . . . . 334.5.4 Learn By Consensus . . . . . . . . . . . . . . . . . . . . . 334.5.5 Bagging . . . . . . . . . . . . . . . . . . . . . . . . . . . . 334.5.6 Boosting . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

5 Conclusion and Future Scope 375.1 Comprative Study . . . . . . . . . . . . . . . . . . . . . . . . . . . 375.2 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 405.3 Future Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

References 40

CSE Department, Institute Of Technology,Nirma University, Ahmedabad v


Chapter 1

Introduction

The term Knowledge Discovery in Databases, or KDD for short, refers to the broadprocess of finding knowledge in data, and emphasizes the high-level application ofparticular data mining methods. It is of interest to researchers in machine learning,pattern recognition, databases, statistics, artificial intelligence, knowledge acquisi-tion for expert systems, and data visualization.The unifying goal of the KDD processis to extract knowledge from data in the context of large databases.It does this byusing data mining methods (algorithms) to extract (identify) what is deemed knowl-edge, according to the specifications of measures and thresholds, using a databasealong with any required preprocessing, subsampling, and transformations of thatdatabase.

The overall process of finding and interpreting patterns from data involves therepeated application of the following steps:

Figure 1.1: KDD Process

1. Developing an understanding of1. the application domain

CSE Department,Institute of Technology, Nirma University 1


2. the relevant prior knowledge3. the goals of the end-user

2. Creating a target data set: selecting a data set, or focusing on a subset of vari-ables, or data samples, on which discovery is to be performed.

3. Data cleaning and preprocessing.1. Removal of noise or outliers.2. Removal of noise or outliers.3. Strategies for handling missing data fields.4.Accounting for time sequence information and known changes.

4. Data reduction and projection.1.Finding useful features to represent the data depending on the goal of thetask.2.Using dimensionality reduction or transformation methods to reduce the ef-fective number of variables under consideration or to find invariant representa-tions for the data.

5. Choosing the data mining task.Deciding whether the goal of the KDD processis classification, regression, clustering, etc.

6. Choosing the data mining algorithm1. Selecting method(s) to be used for searching for patterns in the data.2. Deciding which models and parameters may be appropriate.3. Matching a particular datamining method with the overall criteria of the KDD process.

7. Data mining.Searching for patterns of interest in a particular representational form or a setof such representations as classification rules or trees, regression, clustering,and so forth.

8. Interpreting mined patterns.

9. Consolidating discovered knowledge

In the following chapters we will be exploring Data Mining Models and Evaluationtechniques in depth

CSE Department, Institute Of Technology,Nirma University, Ahmedabad 2


Chapter 2

Classification Using Decision Tree

This chapter introduces the concept of the most widely used learning method thatapply a similar strategy of dividing data into smaller and smaller portions to identifypatterns that can be used for prediction. The knowledge is then presented in theform of logical structures that can be understood without any statistical knowledge.This aspect makes these models particularly useful for business strategy and processimprovement.

1. Understading Decision Tress

2. Divide and Conquer

3. Unique identifiers

4. C 5.0 Decision Tree Algorithm

5. Choosing The Best Split

6. Pruning The Decision Tress

2.1 Understanding Decision Trees

As we might intuit from the name itself, decision tree learners build a model inthe form of a tree structure. The model itself comprises a series of logical decisions,similar to a flowchart, with decision nodes that indicate a decision to be made onan attribute. These split into branches that indicate the decisions choices. The treeis terminated by leaf nodes (also known as terminal nodes) that denote the result offollowing a combination of decisions.

Data that is to be classified begin at the root node where it is passed throughthe various decisions in the tree according to the values of its features. The paththat the data takes funnels each record into a leaf node, which assigns it a predicted



class. As the decision tree is essentially a flowchart, it is particularly appropriate forapplications in which the classification mechanism needs to be transparent for legalreasons or the results need to be shared in order to facilitate decision making. Somepotential uses include:

1. Credit scoring models in which the criteria that causes an applicant to be re-jected need to be well-specified

2. Marketing studies of customer churn or customer satisfaction that will be sharedwith management or advertising agencies

3. Diagnosis of medical conditions based on laboratory measurements, symptoms,or rate of disease progression

In spite of their wide applicability, it is worth noting some scenarios where treesmay not be an ideal fit. One such case might be a task where the data has a largenumber of nominal features with many levels or if the data has a large number ofnumeric features. These cases may result in a very large number of decisions and anoverly complex tree.

2.2 Divide and Conquer

Decision trees are built using a heuristic called recursive partitioning. This ap-proach is generally known as divide and conquer because it uses the feature valuesto split the data into smaller and smaller subsets of similar classes. Beginning at theroot node, which represents the entire dataset, the algorithm chooses a feature that isthe most predictive of the target class. The examples are then partitioned into groupsof distinct values of this feature; this decision forms the first set of tree branches.The algorithm continues to divide-and-conquer the nodes, choosing the best candi-date feature each time until a stopping criterion is reached. This might occur at anode if:

1. All (or nearly all) of the examples at the node have the same class

2. There are no remaining features to distinguish among examples

3. The tree has grown to a predefined size limit

To illustrate the tree building process, lets consider a simple example. Imaginethat we are working for a Hollywood film studio, and our desk is piled high with



screenplays. Rather than read each one cover-to-cover, you decide to develop adecision tree algorithm to predict whether a potential movie would fall into one ofthree categories: mainstream hit, critics choice, or box office bust. To gather data foryour model, we turn to the studio archives to examine the previous ten years of moviereleases. After reviewing the data for 30 different movie scripts, a pattern emerges.There seems to be a relationship between the films proposed shooting budget, thenumber of A-list celebrities lined up for starring roles, and the categories of success.A scatter plot of this data might look something like the figure 2.1(Reference [2]):

Figure 2.1: Scatter Plot of Budget vs A-List(Ref[2]) Celeberities

To build a simple decision tree using this data, we can apply a divide-and-conquerstrategy. Lets first split the feature indicating the number of celebrities, partition-ing the movies into groups with and without a low number of A-list stars (fig 2.2Reference [2])

Figure 2.2: Split 1: Scatter Plot of Budget vs A-List Celeberities (Ref[2])



Next, among the group of movies with a larger number of celebrities, we canmake another split between movies with and without a high budget(fig2.3) At thispoint we have partitioned the data into three groups. The group at the top-left cornerof the diagram is composed entirely of critically-acclaimed films. This group isdistinguished by a high number of celebrities and a relatively low budget. At thetop-right corner, the majority of movies are box office hits, with high budgets and alarge number of celebrities. The final group, which has little star power but budgetsranging from small to large, contains the flops.

Figure 2.3: Split 2: Scatter Plot of Budget vs A-List Celeberities (Ref[2])

If we wanted, we could continue to divide the data by splitting it based on increas-ingly specific ranges of budget and celebrity counts until each of the incorrectly clas-sified values resides in its own, perhaps tiny partition. Since the data can continueto be split until there are no distinguishing features within a partition, a decision treecan be prone to be overfitting for the training data with overly-specific decisions.Well avoid this by stopping the algorithm here since more than 80 percent of theexamples in each group are from a single class.

Our model for predicting the future success of movies can be represented in a sim-ple tree as shown fig 2.4(Ref[2]). To evaluate a script, follow the branches througheach decision until its success or failure has been predicted. In no time, you willbe able to classify the backlog of scripts and get back to more important work suchas writing an awards acceptance speech. Since real-world data contains more thantwo features, decision trees quickly become far more complex than this, with manymore nodes, branches, and leaves. In the next section we will throw some light on apopular algorithm for building decision tree models automatically.



Figure 2.4: Decision Tree Model(Refernence[2])

2.3 C5.0 Decision Tree Algorithm

There are numerous implementations of decision trees, but one of the most wellknown is the C5.0 algorithm. This algorithm was developed by computer scientist J.Ross Quinlan as an improved version of his prior algorithm, C4.5, which itself is animprovement over his ID3 (Iterative Dichotomiser 3) algorithm.

Strengths of C5.0 Algorithm

1. An all-purpose classifier that does well on most problems

2. Highly-automatic learningprocess can handle numeric or nominal features,missing data

3. Uses only the most important features

4. Can be used on data with relatively few training examples or a very largenumber

5. Results in a model that can be interpreted without a mathematical background(for relatively small trees)

6. More efficient than other complex models

Weaknesses of C5.0 Algorithm



1. Decision tree models are often biased toward splits on features having a largenumber of levels

2. It is easy to overfit or underfit the model

3. Can have trouble modeling some relationships due to reliance on axisparallelsplits

4. Small changes in training data can result in large changes to decision logic

5. Large trees can be difficult to interpret and the decisions they make may seemcounterintuitive

2.4 How To Choose The Best Split?

The first challenge that a decision tree will face is to identify which feature tosplit upon. In the previous example, we looked for feature values that split the datain such a way that partitions contained examples primarily of a single class. If thesegments of data contain only a single class, they are considered pure. There aremany different measurements of purity for identifying splitting criteria C5.0 usesEntropy for measuring purity. The entropy of a sample of data indicates how mixedthe class values are; the minimum value of 0 indicates that the sample is completelyhomogenous, while 1 indicates the maximum amount of disorder. The definition ofentropy is specified by:

Entropy(S) =c

n=1pi log2(pi) (2.1)

In the entropy formula, for a given segment of data (S), the term c refers to thenumber of different class levels, and pi refers to the proportion of values falling intoclass level i. For example, suppose we have a partition of data with two classes: red(60 percent), and white (40 percent). We can calculate the entropy as:

Entropy(S) =0.60 log2(0.60)0.40 log2(0.40) = 0.9709506 (2.2)

Given this measure of purity, the algorithm must still decide which feature to splitupon. For this, the algorithm uses entropy to calculate the change in homogeneityresulting from a split on each possible feature. The calculation is known as infor-mation gain. The information gain for a feature F is calculated as the differencebetween the entropy in the segment before the split (S1), and the partitions resulting



from the split (S2):

In f oGain(F) = Entropy(S1)Entropy(S2) (2.3)

The one complication is that after a split, the data is divided into more than onepartition. Therefore, the function to calculate Entropy(S2) needs to consider thetotal entropy across all of the partitions. It does this by weighing each partitionsentropy by the proportion of records falling into that partition. This can be stated ina formula as:

Entropy(S) =n

n=1

wi log2(Pi) (2.4)

In simple terms, the total entropy resulting from a split is the sum of entropyof each of the n partitions weighted by the proportion of examples falling in thatpartition wi. The higher the information gain, the better a feature is at creatinghomogeneous groups after a split on that feature. If the information gain is zero,there is no reduction in entropy for splitting on this feature. On the other hand, themaximum information gain is equal to the entropy prior to the split. This wouldimply the entropy after the split is zero, which means that the decision results incompletely homogeneous groups.

The previous formulae assume nominal features, but decision trees use informa-tion gain for splitting on numeric features as well. A common practice is testingvarious splits that divide the values into groups greater than or less than a threshold.This reduces the numeric feature into a two-level categorical feature and informationgain can be calculated easily. The numeric threshold yielding the largest informationgain is chosen for the split

2.5 Pruning The Decision Tree

A decision tree can continue to grow indefinitely, choosing splitting features anddividing into smaller and smaller partitions until each example is perfectly classifiedor the algorithm runs out of features to split on. However, if the tree grows overlylarge, many of the decisions it makes will be overly specific and the model will havebeen over fitted to the training data. The process of pruning a decision tree involvesreducing its size such that it generalizes better to unseen data.



One solution to this problem is to stop the tree from growing once it reaches acertain number of decisions or if the decision nodes contain only a small number ofexamples. This is called early stopping or pre-pruning the decision tree. As the treeavoids doing needless work, this is an appealing strategy. However, one downside isthat there is no way to know whether the tree will miss subtle, but important patternsthat it would have learned had it grown to a larger size.

An alternative, called post-pruning involves growing a tree that is too large, thenusing pruning criteria based on the error rates at the nodes to reduce the size of thetree to a more appropriate level. This is often a more effective approach than pre-pruning because it is quite difficult to determine the optimal depth of a decision treewithout growing it first. Pruning the tree later on allows the algorithm to be certainthat all important data structures were discovered.

One of the benefits of the C5.0 algorithm is that it is opinionated about pruningittakes care of many of the decisions, automatically using fairly reasonable defaults.Its overall strategy is to postprune the tree. It first grows a large tree that overfits thetraining data. Later, nodes and branches that have little effect on the classificationerrors are removed. In some cases, entire branches are moved further up the treeor replaced by simpler decisions. These processes of grafting branches are knownas subtree raising and subtree replacement, respectively. Balancing overfitting andunderfitting a decision tree is a bit of an art, but if model accuracy is vital it maybe worth investing some time with various pruning options to see if it improvesperformance on the test data. As you will soon see, one of the strengths of the C5.0algorithm is that it is very easy to adjust the training options.



Chapter 3

Probabilistic Learning - Naive BayesianClassification

When a meteorologist provides a weather forecast, precipitation is typically pre-dicted using terms such as 70 percent chance of rain. These forecasts are knownas probability of precipitation reports. Have you ever considered how they are cal-culated? It is a puzzling question, because in reality, it will either rain or it willnot. This chapter covers a machine learning algorithm called naive Bayes, whichalso uses principles of probability for classification. Just as meteorologists forecastweather, naive Bayes uses data about prior events to estimate the probability of fu-ture events. For instance, a common application of naive Bayes uses the frequencyof words in past junk email messages to identify new junk mail.

3.1 Understanding Naive Bayesian Classification

The basic statistical ideas necessary to understand the naive Bayes algorithm havebeen around for centuries. The technique descended from the work of the 18th cen-tury mathematician Thomas Bayes, who developed foundational mathematical prin-ciples (now known as Bayesian methods) for describing the probability of events,and how probabilities should be revised in light of additional information. Classi-fiers based on Bayesian methods utilize training data to calculate an observed prob-ability of each class based on feature values. When the classifier is used later onunlabeled data, it uses the observed probabilities to predict the most likely class forthe new features. Its a simple idea, but it results in a method that often has resultson par with more sophisticated algorithms. In fact, Bayesian classifiers have beenused for:

1. Text classification, such as junk email (spam) filtering, author identification,or topic categorizatio



2. Intrusion detection or anomaly detection in computer networks

3. Diagnosing medical conditions, when given a set of observed symptoms

Typically, Bayesian classifiers are best applied to problems in which the informationfrom numerous attributes should be considered simultaneously in order to estimatethe probability of an outcome. While many algorithms ignore features that haveweak effects, Bayesian methods utilize all available evidence to subtly change thepredictions. If a large number of features have relatively minor effects, taken to-gether their combined impact could be quite large.

3.2 Bayes Theorem

Bayes theorem is named after Thomas Bayes, a nonconformist English clergymanwho did early work in probability and decision theory during the 18th century. Let Xbe a data tuple. In Bayesian terms, X is considered evidence. As usual, it is describedby measurements made on a set of n attributes. Let H be some hypothesis, such asthat the data tuple X belongs to a specified class C. For classification problems,we want to determine P(H|X), the probability that the hypothesis H holds given theevidence or observed data tuple X . In other words, we are looking for the probabilitythat tuple X belongs to class C, given that we know the attribute description of X .P(H|X) is the posterior probability, or a posterior probability, of H conditioned onX . For example, suppose our world of data tuples is confined to customers describedby the attributes age and income, respectively, and that X is a 35-year-old customerwith an income of Rs40,000. Suppose that H is the hypothesis that our customerwill buy a computer. Then P(H|X) reflects the probability that customer X willbuy a computer given that we know the customers age and income. In contrast,P(H) is the prior probability, or a prior probability, of H. For our example, thisis the probability that any given customer will buy a computer, regardless of age,income, or any other information, for that matter. The posterior probability, P(H|X),is based on more information (e.g., customer information) than the prior probability,P(H), which is independent of X . Similarly, P(X |H) is the posterior probability ofX conditioned on H. That is, it is the probability that a customer, X , is 35 years oldand earns Rs40,000, given that we know the customer will buy a computer. P(X) isthe prior probability of X .Using our example, it is the probability that a person fromour set of customers is 35 years old and earns Rs40,000. How are these probabilitiesestimated? P(H), P(X |H), and P(X) may be estimated from the given data, as weshall see below. Bayes theorem is useful in that it provides a way of calculating the



posterior probability, P(H|X), from P(H), P(X |H), and P(X). Bayes theorem is

P(H|X) = P(X |H)P(H)P(X)

(3.1)

3.3 The Naive Bayes Algorthim

The naive Bayes (NB) algorithm describes a simple application using Bayes the-orem for classification. Although it is not the only machine learning method utilizingBayesian methods, it is the most common, particularly for text classification whereit has become the de facto standard. Strengths and weaknesses of this algorithm areas follows

Strength

1. Simple, fast, and very effective

2. Does well with noisy and missing data

3. Requires relatively few examples for training, but also works well with verylarge numbers of examples

4. Easy to obtain the estimated probability for a prediction

Weaknesses

1. Relies on an often-faulty assumption of equally important and independentfeatures

2. Not ideal for datasets with large

3. Estimated probabilities are less reliable than the predicted classes

The naive Bayes algorithm is named as such because it makes a couple of naiveassumptions about the data. In particular, naive Bayes assumes that all of the featuresin the dataset are equally important and independent. These assumptions are rarelytrue in most of the real-world applications.

For example, if you were attempting to identify spam by monitoring email mes-sages, it is almost certainly true that some features will be more important thanothers. For example, the sender of the email may be a more important indicatorof spam than the message text. Additionally, the words that appear in the message



body are not independent from one another, since the appearance of some words is avery good indication that other words are also likely to appear. A message with theword Viagra is probably likely to also contain the words prescription or drugs. Prob-abilistic Learning Classification Using Naive Bayes [ 96 ] However, in most caseswhen these assumptions are violated, naive Bayes still performs fairly well. Thisis true even in extreme circumstances where strong dependencies are found amongthe features. Due to the algorithms versatility and accuracy across many types ofconditions, naive Bayes is often a strong first candidate for classification learningtasks.

3.4 Naive Bayesian Classification

The nave Bayesian classifier, or simple Bayesian classifier, works as follows:

1. Let D be a training set of tuples and their associated class labels. As usual, eachtuple is represented by an n-dimensional attribute vector, X = (x1,x2, .....,xn),depicting n measurements made on the tuple from n attributes, respectivelyA1,A2, .....,An.

2. Suppose that there are m classes, C1,C2, ...,Cm. Given a tuple, X , the classifierwill predict that X belongs to the class having the highest posterior probability,conditioned on X . That is, the nave Bayesian classifier predicts that tuple Xbelongs to the class Ci if and only if

P(Ci|X)> P(C j|X) f or1 < j < m, j6 = i : (3.2)

Thus we maximize P(Ci|X). The class for which P(Ci|X) is maximized iscalled the maximum posterior hypothesis. By Bayes theorem

P(Ci|X) = P(X |Ci)P(Ci)P(X) (3.3)

3. As P(X) is constant for all classes, only P(X |Ci)P(Ci) need be maximized. Ifthe class prior probabilities are not known, then it is commonly assumed thatthe classes are equally likely, that is, P(C1) = P(C2) == P(Cm), and we wouldtherefore maximize P(X |Ci). Otherwise, we maximize P(X |Ci)P(Ci). Notethat the class prior probabilities may be estimated by P(Ci) = |Ci,D|/|D|,where|Ci,D| is the number of training tuples of class Ci in D.

4. Given data sets with many attributes, it would be extremely computationally



expensive to compute P(X |Ci). In order to reduce computation in evaluatingP(X |Ci), the naive assumption of class conditional independence is made. Thispresumes that the values of the attributes are conditionally independent of oneanother, given the class label of the tuple (i.e., that there are no dependencerelationships among the attributes). Thus,

P(X |Ci) =k

n=1

P(xk|Ci) (3.4)

5. In order to predict the class label of X , P(X |Ci)P(Ci) is evaluated for each classCi. The classifier predicts that the class label of tuple X is the class Ci if andonly if

P(X |Ci)P(Ci)> P(X |C j)P(C j) f or1 < j < m, j! = i (3.5)

In other words, the predicted class label is the class Ci for which P(X |Ci)P(Ci)is the maximum.



Chapter 4

Model Evaluation Techniques

As We now have in depth explored the two most widely used classifier models thequestion which we now face is how accurately these classifiers can predict the futuretrends based on the data used for building these classifier viz. How accurately a cus-tomer recomender system of a company can predict the future purchasing behaviorof the customer based on the previously recorded sales data of the customers.

Given the signifanct role these classifiers play their accuracy becomes of primeimportance to the companies speically to those in e-commerce system. Thus themodel evaluation techniques are employed to evaluate the accuracy of the predictionsmade by a classifier model.As different classifier models have varying strengths andweaknesses, it is necessary to use test that reveal distinctions among the learnerswhen measuring how a model will perform on future data.The Succeeding sectionsin this chapters will primarily focus on the following points.

1. The reason why predictive accuracy is not sufficient to measure performanceand what are the other alternatives to measure the accuracy

2. Methods to ensure that the performance measures reasonably reflect a modelsability to predict or forecast unseen data

4.1 Prediction Accuracy

The prediction accuracy of a classifier model is defined proportion of correct pre-dictions by the total number of predictions. This number indicates the percentage ofcases in which the learner is right or wrong. For instance, suppose a classifier cor-rectly identified whether or not 99,990 out of 100,000 newborn babies are carriersof a treatable but potentially-fatal genetic defect. This would imply an accuracy of99.99 percent and an error rate of only 0.01 percent.



Although this would appear to indicate an extremely accurate classifier, it wouldbe wise to collect additional information before trusting your childs life to the test.What if the genetic defect is found in only 10 out of every 100,000 babies? A testthat predicts no defect regardless of circumstances will still be correct for 99.99percent of all cases. In this case, even though the predictions are correct for the largemajority of data, the classifier is not very useful for its intended purpose, which is toidentify children with birth defects.

The best measure of classifier performance is whether the classifier is successfulat its intended purpose. For this reason, it is crucial to have measures of modelperformance that measure utility rather than raw accuracy

4.2 Confusion Matrix and Model Evaluation Metrics

A confusion matrix is a matrix that categorizes predictions according to whetherthey match the actual value in the data. One of the tables dimensions indicates thepossible categories of predicted values while the other dimension indicates the samefor actual values.It can be an order of n-matrix depending on the values which canbe achieved by the predicted class.Figure 4.1(Refernece [2]) depicts a 2x2 and 3x3confusion matrix. There are four important terms that are considered as the building

Figure 4.1: Confusion Matrox(Ref[2])

blocks used in computing many evaluation measures.The class of interest is knownas the positive class, while all others are known as negative.

1. True Positives(TP):Correctly classified as the class of interest.

2. True Negatives (TN):Correctly classified as not the class of interest.

3. False Positives(FP):Incorrectly classified as the class of interest.

4. False Negatives(FN):Incorrectly classified as not the class of interest.



The confusion matrix is useful tool for analysing how well our classifier can recog-nize tuples of different classes. TP and TN tell us when the classifier is getting thingsright, while FP and FN tell us when the classifier is getting things wrong.Given mclasses, a confusion matrix is a matrix of atleast m by m size An entry, CMi j inthe first m rows and m columns indicates the number of tuples of class i that werelabeled by the classifier as class j. For a classifier to have good accuracy, ideallymost of the tuples would be represented along the diagonal of the confusion matrixfrom the entryCM1,1 to entryCMm,m , with the rest of the entries being zero or closeto zero. That is ideally , FP and FN are around zero.

Accuracy: The accuracy of a classifier on a given test set is the percentage of testtuples that are correctly classified by the classifier.

accuracy=TP+TNP+N

(4.1)

Error Rate: Error Rate or miss classification rate of classifier, M, which is simply1accuracy(M), where accuracy (M) is the accuracy of M.

errorrate=FP+FNP+N

(4.2)

If we use the training set instead of test set to estimate the error rate of a model, thisquantity is known as the re-substituion error.This error estimate is optimistic of thetrue error rate because the model is not tested on any samples that it has not alreadyseen.

The Class Imbalance Problem: the datasets where the main class of interest israre. That is the data set distribution reflects a significant majority of the negativeclass and minority positive class.For example in fraud detection applications, theclass of interest fraudulent class is rare or less frequently occurring in comparisonto the negative non-fraudulentclass.In medical data there may be a rare class, suchas cancer. Suppose that we have trained a classifier to classify medical data tu-ples,where the class label attribute is cancer and the possible class values are yesand no. An accuracy rate of say 97% may make the classifier seem quite accurate,but what if only, say 3% of the training tuples are actually cancer? Clearly an ac-curacy rate of 97% may not be acceptable- the classifier could be correctly labelingonly the non-cancer tuples, for instance,and miss classifying all the cancer tuples.Instead we need other measures which access how well the classifier can recognize



the positive tuples and how well it can recognize the negative tuples

Sensitivity and Specificity:Classification often involves a balance between be-ing overly conservative and overly aggressive in decision making. For example, ane-mail filter could guarantee to eliminate every spam message by aggressively elim-inating nearly every ham message at the same time. On the other hand, a guaranteethat no ham messages will be inadvertently filtered might allow an unacceptableamount of spam to pass through the filter. This tradeoff is captured by a pair ofmeasures: sensitivity and specificity.

The sensitivity of a model (also called the true positive rate), measures the pro-portion of positive examples that were correctly classified. Therefore, as shown inthe following formula, it is calculated as the number of true positives divided by thetotal number of positives in the datathose correctly classified (the true positives), aswell as those incorrectly classified (the false negatives).

sensitivity=TP

TP+FN(4.3)

The specificity of a model (also called the true negative rate), measures the pro-portion of negative examples that were correctly classified. As with sensitivity, thisis computed as the number of true negatives divided by the total number of nega-tivesthe true negatives plus the false positives.

speci f icity=TN

TN+FP(4.4)

Precision and recall: Closely related to sensitivity and specificity are two otherperformance measures, related to compromises made in classification: precision andrecall. Used primarily in the context of information retrieval, these statistics areintended to provide an indication of how interesting and relevant a models resultsare, or whether the predictions are diluted by meaningless noise.

The precision (also known as the positive predictive value) is defined as the pro-portion of positive examples that are truly positive; in other words, when a modelpredicts the positive class, how often is it correct? A precise model will only predictthe positive class in cases very likely to be positive. It will be very trustworthy.



Consider what would happen if the model was very imprecise. Over time, theresults would be less likely to be trusted. In the context of information retrieval,this would be similar to a search engine such as Google returning unrelated results.Eventually users would switch to a competitor such as Bing. In the case of the SMSspam filter, high precision means that the model is able to carefully target only thespam while ignoring the ham.

precision=TP

TP+FP(4.5)

On the other hand, recall is a measure of how complete the results are. As shownin the following formula, this is defined as the number of true positives over thetotal number of positives. We may recognize that this is the same as sensitivity, onlythe interpretation differs. A model with high recall captures a large portion of thepositive examples, meaning that it has wide breadth. For example, a search enginewith high recall returns a large number of documents pertinent to the search query.Similarly, the SMS spam filter has high recall if the majority of spam messages arecorrectly identified.

recall =TP

TP+FN(4.6)

The F-Measure: A measure of model performance that combines precision andrecall into a single number is known as the F-measure (also sometimes called theF1 score or the F-score). The F-measure combines precision and recall using theharmonic mean. The harmonic mean is used rather than the more common arithmeticmean since both precision and recall are expressed as proportions between zero andone. The following is the formula for F-measure:

FMeasure= 2 precision recallrecall+ precision

(4.7)

F =(1+2) precision recall2 precision+ recall (4.8)

In addition to accuracy-based measures, classifiers can also be compared with re-spect to the following additional aspects:

1. Speed: This refers to the computational costs involved in generating and usingthe given classifier

2. Robustness: This is the ability of the classifier to make correct predictiongiven noisy data or data with missing values.Robustness is typically assessed



with a series of synthetic data sets represeting increasing degress of noise andmissing values.

3. Scalability: This refers to the ability to construct the classifier efficiently givenlarge amounts of data. Scalability is typically assessed with a series of data setsof increasing size.

4. Interpretability : This refers to the level of understanding and insight that isprovided by the classifier or predictor. Interpretability is subjective and there-fore more difficult to asses



4.3 How To Estimate These Metrics?

We can use following methods to estimate the evaluation metrics explained in-depth in the preceding sections:

a. Training data

b. Independent test data

c. Hold-out method

d. k-fold cross-validation method

e. Leave-one-out method

f. Bootstrap method

g. Comparing Two Models

4.3.1 Training and Independent Test Data

The accuracy/error estimates on the training data are not good indicators ofperformance on future data. Because new data will probably not be exactly the sameas the training data.The accuracy/error estimates on the training data measure thedegree of classifiers over-fitting.Fig 4.2 depicts use of training set Estimation with

Figure 4.2: Training Set

independent test data (figure 4.3)is used when we have plenty of data and thereis a natural way to forming training and test data. For example: Quinlan in 1987reported experiments in a medical domain for which the classifiers were trained ondata from 1985 and tested on data from 1986.

Figure 4.3: Training and Test Set



Figure 4.4: Classification: Train, Validation, Test Split Reference[3]

4.3.2 Holdout Method

The holdout method(fig4.5) is what we have alluded to so far in our discussionsabout accuracy. In this method, the given data are randomly partitioned into twoindependent sets, a training set and a test set. Typically, two-thirds of the data areallocated to the training set, and the remaining one-third is allocated to the test set.The training set is used to derive the model, whose accuracy is estimated with thetest set . The estimate is pessimistic because only a portion of the initial data is usedto derive the model. The hold-out method is usually used when we have thousands

Figure 4.5: Holdout Method



of instances, including several hundred instances from each class.

For unbalanced data-sets, samples might not be representative.Few or no instancesof some classes will be there in case of class imbalanced data where one class is inmajority viz. fraudulent transaction detection and Medical Diagnostic Tests. Tomake the sample Representative for holdout we use the concept of stratification inthis we ensure that each class gets equal representation according to their proportionin actual data-set

Random sub-sampling is a variation of the holdout method in which the hold-out method is repeated k times.In each iteration, a certain proportion is randomlyselected for training (possibly with stratification). The error rates on the differentiterations are averaged to yield an overall error rate.It is also known as repeatedholdout method.

4.3.3 K-Cross-validation

In k-fold cross-validation, (fig 4.6) the initial data are randomly partitioned into kmutually exclusive subsets or folds, D1,D2, ...,Dk each of approximately equal size.Training and testing is performed k times. In iteration i, partition Di is reserved asthe test set, and the remaining partitions are collectively used to train the model.That is, in the first iteration, subsets D2, ..,Dk collectively serve as the training set inorder to obtain a first model, which is tested on D1; the second iteration is trained onsubsets D1,D3, ....,Dk and tested on D2; and so on. Unlike the holdout and randomsubsampling methods above, here, each sample is used the same number of times fortraining and once for testing. For classification, the accuracy estimate is the overallnumber of correct classifications from the k iterations, divided by the total numberof tuples in the initial data. For prediction, the error estimate can be computed as thetotal loss from the k iterations, divided by the total number of initial tuples.

Leave one out CVLeave-one-out is a special case of k-fold cross-validation where k is set to the num-ber of initial tuples. That is, only one sample is left out at a time for the test set.Some features of Leave one out CV are

1. Makes best use of the data.

2. Involves no random sub-sampling.



Figure 4.6: k-cross-validation

Disadvantages of Leave one out CV:

1. A disadvantage of Leave-One-Out-CV is that stratification is not possible:

2. Very computationally expensive.

4.3.4 Bootstrap

Cross validation uses sampling without replacement. The same instance, once se-lected, can not be selected again for a particular training/test set.The The bootstrapuses sampling with replacement to form the training set:

1. Sample a dataset of n instances n times with replacement to form a new datasetof n instances.

2. Use this data as the training set.

3. Use the instances from the original dataset that dont occur in the new trainingset for testing.

4. A particular instance has a probability of 11n of not being picked Thus its prob-ability of ending up in the test data is (where n tends to infinity):

(1 1n)n = e1 = 0.368 (4.9)

5. This means the training data will contain approximately 63.2% of the instancesand the test data will contain approximately 36.8% of the instances.

6. The error estimate on the test data will be very pessimistic because the classifieris trained on just 63% of the instances.



7. Therefore, combine it with the training error:

err = 0.632 .e+0.368 e (4.10)

8. The training error gets less weight than the error on the test data.

4.3.5 Comparing Two Classifier Models

Suppose that we have generated two models, M1 and M2 (for either classification orprediction), from our data. We have performed 10-fold cross-validation to obtain amean error rate for each. How can we determine which model is best? It may seemintuitive to select the model with the lowest error rate, however, the mean error ratesare just estimates of error on the true population of future data cases. There can beconsiderable variance between error rates within any given 10-fold cross-validationexperiment. Although the mean error rates obtained for M1 and M2 may appeardifferent, that difference may not be statistically significant. What if any differencebetween the two may just be attributed to chance? following points explain in detailhow statistically significant is theirr difference

1. Assume that we have two classifiers, M1 and M2, and we would like to knowwhich one is better for a classification problem.

2. We test the classifiers on n test data sets D1,D2, ,Dn and we receive error rateestimates e11,e12, ,e1n for classifier M1 and error rate estimates e21,e22, ,e2nfor classifier M2.

3. Using rate estimates we can compute the mean error rate e1 for classifier M1and the mean error rate e2 for classifier M2.

4. These mean error rates are just estimates of error on the true population offuture data cases.

5. We note that error rate estimates e11,e12, ,e1n for classifier M1 and error rateestimates e21,e22, ,e2n for classifier M2 are paired. Thus, we consider the dif-ferences d1,d2, ,dn where d j = |e1 j e2 j|.

6. The differences d1,d2, ,dn are instantiations of n random variables D1,D2, ,Dnwith mean D and standard deviation D.

7. We need to establish confidence intervals for D in order to decide whether thedifference in the generalization performance of the classifiers M1 and M2 isstatistically significant or not.



8. Since the standard deviation D is unknown, we approximate it using the samplestandard deviation sd:

sd =

1n

n

i=1[(e1i e2i) (e1 e2)]2 (4.11)

9. T-statisticsT =

DDsdn

(4.12)

10. The T statistics is governed by t-distribution with n - 1 degrees of freedom.Figure 4.7 shows t-distribution curve (Refernce[4])

Figure 4.7: t-distribution curve (Reference [4])

11. If d and sd are the mean and standard deviation of the normally distributeddifferences of n random pairs of errors, a (1 )100% confidence interval for D =1 - 2 is :

dm t1/2 sdn< d < dm+ t1/2

sdn

(4.13)

where t/2 is the t-value with v= n1 degrees of freedom, leaving an area of/2 to the right.



12. If t > z or t < z t doesnt lie in the rejection region, within the tails of the distribution.This means that we can reject the null hypothesis that the means of M1 and M2 arethe same and conclude that there is a statistically significant difference between thetwo models. Otherwise, if we cannot reject the null hypothesis, we then concludethat any difference between M1 and M2 can be attributed to chance.

4.4 ROC Curves

The ROC curve (Receiver Operating Characteristic) is commonly used to exam-ine the tradeoff between the detection of true positives, while avoiding the falsepositives. As you might suspect from the name, ROC curves were developed byengineers in the field of communications around the time of World War II; receiversof radar and radio signals needed a method to discriminate between true signalsand false alarms. The same technique is useful today for visualizing the efficacy ofmachine learning models.

The characteristics of a typical ROC diagram are depicted in the following plot(figure4.8 Reference[2]). Curves are defined on a plot with the proportion of true positiveson the vertical axis, and the proportion of false positives on the horizontal axis. Be-cause these values are equivalent to sensitivity and (1 specificity), respectively, thediagram is also known as a sensitivity/specificity plot:

Figure 4.8: ROC curves (Reference[2])



The points comprising ROC curves indicate the true positive rate at varying falsepositive thresholds. To create the curves, a classifiers predictions are sorted bythe models estimated probability of the positive class, with the largest values first.Beginning at the origin, each predictions impact on the true positive rate and falsepositive rate will result in a curve tracing vertically (for a correct prediction), orhorizontally (for an incorrect prediction).

To illustrate this concept, three hypothetical classifiers are contrasted in the pre-vious plot. First, the diagonal line from the bottom-left to the top-right corner ofthe diagram represents a classifier with no predictive value. This type of classifierdetects true positives and false positives at exactly the same rate, implying that theclassifier cannot discriminate between the two. This is the baseline by which otherclassifiers may be judged; ROC curves falling close to this line indicate models thatare not very useful. Similarly, the perfect classifier has a curve that passes throughthe point at 100 percent true positive rate and 0 percent false positive rate. It isable to correctly identify all of the true positives before it incorrectly classifies anynegative result. Most real-world classifiers are similar to the test classifier; they fallsomewhere in the zone between perfect and useless.

The closer the curve is to the perfect classifier, the better it is at identifying posi-tive values. This can be measured using a statistic known as the area under the ROCcurve (abbreviated AUC). The AUC, as you might expect, treats the ROC diagramas a two-dimensional square and measures the total area under the ROC curve. AUCranges from 0.5 (for a classifier with no predictive value), to 1.0 (for a perfect clas-sifier). A convention for interpreting AUC scores uses a system similar to academicletter grades:

1. 0.9 1.0 = A (outstanding)

2. 0.8 0.9 = B (excellent/good)

3. 0.7 0.8 = C (acceptable/fair)

4. 0.6 0.7 = D (poor)

5. 0.5 0.6 = F (no discrimination)

As with most scales similar to this, the levels may work better for some tasksthan others; the categorization is somewhat subjective.



4.5 Ensemble Methods

Motivation

1. Ensemble model improves accuracy and robustness over single model methods

2. Applications:

(a) distributed computing

(b) privacy-preserving applications

(c) large-scale data with reusable models

(d) multiple sources of data

3. Efficiency: a complex problem can be decomposed into multiple sub-problemsthat are easier to understand and solve (divide-and-conquer approach)

4.5.1 Why Ensemble Works?

1. Intuition combining diverse, independent opinions in human decision-makingas a protective mechanism e.g. stock portfolio

2. Overcome limitations of single hypothesis The target function may not be im-plementable with individual classifiers, but may be approximated by model av-eraging

3. Gives a global picture

Figure 4.9: Ensemble Gives Global picture



4.5.2 Ensemble Works in Two Ways

1. Learn to Combine

Figure 4.10: Learn to Combine(Reference[3])

2. Learn By Consensus

Figure 4.11: Learn By Consensus(Refrence[3])



4.5.3 Learn To Combine

Pros

1. Get useful feed backs from labeled data.

2. Can potentially improve accuracy.

Cons

1. Need to keep the labeled data to train the ensemble

2. May overfit the labeled data.

3. Cannot work when no labels are available

4.5.4 Learn By Consensus

Pros

1. Do not need labeled data.

2. Can improve the generalization performance.

Cons

1. No feedbacks from the labeled data.

2. Require the assumption that consensus is better.

4.5.5 Bagging

Given a set, D, of d tuples, bagging works as follows. For iteration i(i= 1,2, ....,k),a training set,Di, of d tuples is sampled with replacement fromthe original set oftuples,D. The term bagging stands for bootstrap aggregation.Each training set is abootstrap sample . Because sampling with replacement is used, some of the originaltuples of D may not be included in Di,whereas other smay occur more than once.A classifier model, Mi, is learned for each training set, Di. To classify an unknowntuple, X , each classifier, Mi, returns its class prediction, which counts as one vote.The bagged classifier, M, counts the votes and assigns the class with the most votesto X . Bagging can be applied to the prediction of continuous values by taking theaverage value of each prediction for a given test tuple. . The bagged classifier oftenhas significantly greater accuracy than a single classifier derived from D, the originaltraining data. It will not be considerably worse and is more robust to the effects ofnoisy data. The increased accuracy occurs because the composite model reduces



the variance of the individual classifiers. For prediction, it was theoretically proventhat a bagged predictor will always have improved accuracy over a single predictorderived from D.

Alogrithm The bagging algorithmcreate an ensemble of models (classifiers orpredictors) for a learning scheme where each model gives an equally-weighted pre-diction.Input: D, a set of d training tuples; k, the number of models in the ensemble; alearning scheme (e.g., decision tree algorithm, back-propagation, etc.)Output: A composite model, M.Method:(1) for i= 1tokdo // create k models:(2) create bootstrap sample, Di, by sampling D with replacement;(3) use Di to derive a model, Mi;(4) end forTo use the composite model on a tuple, X(1) if classification then(2) let each of the k models classify X and return the majority vote(3) if prediction then(4) let each of the k models predict a value for X and return the average predictedvalue;

4.5.6 Boosting

Principles

1. Boost a set of weak learners to a strong learner

2. An iterative procedure to adaptively change distribution of training data by fo-cusing more on previously misclassified records

3. Initially, all N records are assigned equal weights Unlike bagging, weights maychange at the end of a boosting round

4. Records that are wrongly classified will have their weights increased

5. Records that are classified correctly will have their weights decreased

6. Equal weights are assigned to each training tuple (1/d for round 1)

7. After a classifier Mi is learned, the weights are adjusted to allow the subsequentclassifier Mi+1 to pay more attention to tuples that were misclassified by Mi.



8. Final boosted classifier M combines the votes of each individual classifierWeight of each classifiers vote is a function of its accuracy

9. Adaboost popular boosting algorithm

Adaboost-Boosting AlgorithmInput:1). Training set D containing d tuples2). k rounds3). A classification learning schemeOutput:A composite modelMethod:

1. Data set D containing d class-labeled tuples (X1,y1),(X2,y2),(X3,y3), .(Xd,yd)

2. Initially assign equal weight 1/d to each tuple

3. To generate k base classifiers, we need k rounds or

4. iterations Round i, tuples from D are sampled with replacement , to form Di(size d)

5. Each tuples chance of being selected depends on its weight

6. Base classifier Mi, is derived from training tuples of Di

7. Error of Mi is tested using Di ]item Weights of training tuples are adjusted de-pending on how they were classifiedCorrectly classified: Decrease weightIncorrectly classified: Increase weight

8. Weight of a tuple indicates how hard it is to classify it (directly proportional)

9. Some classifiers may be better at classifying some hard tuples than others

10. We finally have a series of classifiers that complement each other

11. Error Estimate:

error(Mi) =d

jw j err(X j) (4.14)

where err(X j) is the misclassification error for X j(= 1)

12. If classifier error exceeds 0.5, we abandon it

13. Try again with a new Di and a new Mi derived from it error (Mi) affects how theweights of training tuples are updated



14. If a tuple is correctly classified in round i, its weight is multiplied by

error(Mi)1 erro(Mi) (4.15)

15. Adjust weights of all correctly classified tuples

16. Now weights of all tuples (including the misclassified tuples) are normalized

n f =sumo f oldweightssumo f newweights

(4.16)

17. Weight of a classifier Mis weight is log error(Mi)1erro(Mi)18. The lower a classifier error rate, the more accurate it is, and therefore, the higher

its weight for voting should be

19. Weight of a classifier Mis vote is log error(Mi)1erro(Mi)20. For each class c, sum the weights of each classifier that assigned class c to X

(unseen tuple)

21. The class with the highest sum is the WINNER!



Chapter 5

Conclusion and Future Scope

5.1 Comprative Study

To pratically explore the theoretical aspects of the data mining models and thetechniques to evaluate them, we conducted a small scale exploratory study in datamining tool Weka - developed by University of Waikado , Newzeland.The followingtables summarize the result of our exploratory study

Figure 5.1: Weka Screen Shots


Data Set Name Type of Data No of Attributes No of Instances Type of Classifier Attribute Under Observation Evaluation Technique No of Correctly Classified Instances No of Incorrectly Classified Instances Predicted Attribute Values TP Rate FP Rate Precision Recall F-Measure ROC Area Confusion Matrix

Only Training Set 217 69 No recurrence event 0.965 0.729 0.758 0.965 0.849 0.639

Recurrence event 0.271 0.035 0.767 0.271 0.4 0.639

10 -Fold CV( Random Seed Value 0 ) 214 72 No recurrence event 0.95 0.729 0.755 0.95 0.841 0.582


10 -Fold CV( Random Seed Value 20) 208 78 No recurrence event 0.93 0.753 0.745 0.93 0.827 0.547


Hold out Mehthod(Percent Split

66%)73 24 No recurrence event 0.972 0.88 0.761 0.972 0.854 0.697


Repeated (20) Hold out Mehthod(Percent Split



Only Training Set 215 71 No recurrence event 0.866 0.518 0.798 0.866 0.831 0.76


10 -Fold CV( Random Seed Value 0 )

208 78 No recurrence event 0.851 0.565 0.781 0.851 0.814 0.7


10 -Fold CV( Random Seed Value 20)

210 76 No recurrence event 0.861 0.565 0.783 0.861 0.82 0.702








Only Training Set 646 122 tested_negative 0.936 0.336 0.839 0.936 0.885 0.888

tested_positive 0.664 0.064 0.848 0.664 0.745 0.888


571 197 tested_negative 0.814 0.403 0.79 0.814 0.802 0.751

tested_positive 0.597 0.186 0.632 0.597 0.614 0.751

10 -Fold CV( Random Seed Value 20)

573 195 tested_negative 0.834 0.418 0.788 0.834 0.81 0.74

tested_positive 0.582 0.166 0.653 0.582 0.615 0.74


66%)192 69 tested_negative 0.849 0.463 0.762 0.849 0.803 0.722

tested_positive 0.537 0.151 0.671 0.537 0.596 0.722


66%)194 67 tested_negative 0.776 0.313 0.81 0.776 0.793 0.747

tested_positive 0.688 0.224 0.641 0.688 0.663 0.747

Only Training Set 586 182 tested_negative 0.842 0.384 0.803 0.842 0.822 0.825

tested_positive 0.616 0.158 0.676 0.616 0.645 0.825


583 185 tested_negative 0.844 0.399 0.798 0.844 0.82 0.814

tested_positive 0.601 0.156 0.674 0.601 0.635 0.814

10 -Fold CV( Random Seed Value 20) 578 190 tested_negative 0.834 0.399 0.796 0.834 0.814 0.811

tested_positive 0.601 0.166 0.66 0.601 0.629 0.811


66%)187 74 tested_negative 0.831 0.484 0.75 0.831 0.789 0.811

tested_positive 0.516 0.169 0.636 0.516 0.57 0.811


66%)205 56 tested_negative 0.842 0.313 0.822 0.842 0.832 0.838

tested_positive 0.688 0.158 0.717 0.688 0.702 0.838

286

286

Class

ClassDiabetes.arff Numeric 9 768 Nave Bayesian

Diabetes.arff Numeric 9 768 J48

Breast Cancer .arff

Breast Cancer .arff

Numeric

Numeric

10

10

J48 Class

ClassNave Bayesian

recurrencebcenorecurrena

ba

ba

==

23627194


ba

ba

==

236210191


ba

ba

==

216414187


ba

ba

==

322270


ba

ba

==

11221648


ba

ba

==

414427174


ba

ba

==

374828173


ba

ba

==

374830171


ba

ba

==

1691656


ba

ba

==

15181450

postestedbnegtesteda

ba

ba

__

1789032468

==


ba

ba

__

16010893404

==


ba

ba

__

15611283417

==


ba

ba

__

514425141

==


ba

ba

__

663028128

==


ba

ba

__

16510379421

==


ba

ba

__

16110783417

==


ba

ba

__

663026139

==


ba

ba

__

1494628138

==


ba

ba

__

16110778422

==

Data Set Name No of Attributes No of Instances Type of Classifier Attribute Under Observation Evaluation Technique No of Correctly Classified Instances No of Incorrectly Classified Instances Predicted Attribute Values TP Rate FP Rate Precision Recall F-Measure ROC Area Confusion Matrix

Only Training Set 144 6 Iris-setosa 1 0 1 1 1 1

Iris-versicolor 0.96 0.04 0.923 0.96 0.941 0.993

Iris-virginica 0.92 0.02 0.958 0.92 0.939 0.993

CV 143 7 Iris-setosa 1 0 1 1 1 1

Iris-versicolor 0.94 0.04 0.922 0.94 0.931 0.993

Iris-virginica 0.92 0.03 0.939 0.92 0.929 0.993

CV(Seed value 20) 143 7 Iris-setosa 1 0 1 1 1 1

Iris-versicolor 0.94 0.04 0.922 0.94 0.931 0.931

Iris-virginica 0.92 0.03 0.939 0.92 0.929 0.929

Hold Out Method 49 2 Iris-setosa 1 0 1 1 1 1

Iris-versicolor 1 0.063 0.905 1 0.95 0.969

Iris-virginica 0.882 0 1 0.882 0.938 0.967

Hold Out Method(seed 20) 51 0 Iris-setosa 1 0 1 1 1 1

Iris-versicolor 1 0 1 1 1 1

Iris-virginica 1 0 1 1 1 1

Only Training Set 144 6 Iris-setosa 1 0 1 1 1 1

Iris-versicolor 0.96 0.04 0.923 0.96 0.941 0.993

Iris-virginica 0.92 0.02 0.958 0.92 0.939 0.993

CV 143 7 Iris-setosa 1 0 1 1 1 1

Iris-versicolor 0.94 0.04 0.922 0.94 0.931 0.993

Iris-virginica 0.92 0.03 0.939 0.92 0.929 0.993

CV (seed =20) 143 7 Iris-setosa 1 0 1 1 1 1

Iris-versicolor 0.94 0.04 0.922 0.94 0.931 0.993

Iris-virginica 0.92 0.03 0.939 0.92 0.929 0.993

Hold out Method 49 2 Iris-setosa 1 0 1 1 1 1

Iris-versicolor 0.947 0.031 0.947 0.947 0.947 0.993

Iris-virginica 0.938 0.029 0.938 0.938 0.938 0.993

Hold out Method seed =20 Iris-setosa 1 0 1 1 1 1

Iris-versicolor 1 0 1 1 1 1

Iris-virginica 1 0 1 1 1 1

J48

Nom(Class)

Nave Bayesian

150

Iris.arff

5

464024800050

cba

cba

464034700050

cba

cba

464034700050

cba

cba

152001900015

cba

cba

150001300015

cba

cba

464024800050

cba

cba

464034700050

cba

cba

464034700050

cba

cba

151011800016

cba

cba

150001300015

cba

cba


5.2 Conclusion

From the exploratory tests carried out on the datasets in Weka we can concludesome of the following theoretical aspect which we explored in depth in the varioussections

1. Evaluating the classifier only on the training set fetches highly optimistic resultand thus they are biased.

2. Increasing the k value incereases the credibility of the result and the best resultsare obtained when k = 10

3. Repeating the k-cross validation in iteration fetches more credible results andthe best results are obtained when it is repeated for 10 times

4. Hold out Method when repeated iteratively fetches accurate results and bestresults are obtained when it is repeated for 10 times

5. Naive Bayesian and Decision Tree Induction(J48) works excellently well withdatasets which have more nominal data in comparision to numeric data

5.3 Future Scope

The Comparative Study can be extended to new levels by incorporating R caretpackage and carrying out these comparative test on more complex data sets whichhave 1000+plus entries.A cost Sensitive comparative study can also be seen as anextension to this seminar which again can be carried out in R using ROCR pack-age.The comparative study conducted and the ones proposed as future scope can bevery helpful in desigining machine learning systems and evauluating their accuracy.



References

[1] Data Mining Concepts and Techniques: Jiawei Han, Micheline Kamber,JianPei

[2] Machine Learning With R: Brett Lanz

[3] Statistical Learning: Gareth James, Daniela Witten ,Trevor Hastie ,Robert Tib-shirani

[4] Statistics; David Freedman, Robert Pisani

[5] Inferential Statistics:Course Track Udacity

[6] Descriptive Statistics:Course Track Udacity

[7] Data Mining With Weka: Course Track University Of Waikato, Newzeland


CertificateAcknowledgementsAbstractIntroductionClassification Using Decision TreeUnderstanding Decision TreesDivide and ConquerC5.0 Decision Tree AlgorithmHow To Choose The Best Split?Pruning The Decision Tree

Probabilistic Learning - Naive Bayesian ClassificationUnderstanding Naive Bayesian ClassificationBayes TheoremThe Naive Bayes AlgorthimNaive Bayesian Classification

Model Evaluation TechniquesPrediction AccuracyConfusion Matrix and Model Evaluation MetricsHow To Estimate These Metrics?Training and Independent Test DataHoldout MethodK-Cross-validationBootstrapComparing Two Classifier Models

ROC CurvesEnsemble MethodsWhy Ensemble Works?Ensemble Works in Two WaysLearn To CombineLearn By ConsensusBaggingBoosting

Conclusion and Future ScopeComprative StudyConclusionFuture Scope

References

data mining models and evaluation techniques

Documents