data mining instructor: bajuna salehe email: [email protected]@yahoo.com web:

Data Mining

Instructor: Bajuna Salehe

Email: [email protected]

Web: http://www.ifm.ac.tz/staff/bajuna/courses

Classification and Prediction


Classification and prediction are two forms of data analysis that can be used to extract models describing important data classes or to predict future data trends. Such analysis can help provide us with a better understanding of the data at large.

An example application An emergency room in a hospital measures 17

variables (e.g., blood pressure, age, etc) of newly admitted patients.

A decision is needed: whether to put a new patient in an intensive-care unit.

Due to the high cost of ICU, those patients who may survive less than a month are given higher priority.

Problem: to predict high-risk patients and discriminate them from low-risk patients.

Another application A credit card company receives thousands of

applications for new cards. Each application contains information about an applicant, age Marital status annual salary outstanding debts credit rating etc.

Problem: to decide whether an application should approved, or to classify applications into two categories, approved and not approved.

Machine learning and our focus Like human learning from past experiences. A computer does not have “experiences”. A computer system learns from data, which

represent some “past experiences” of an application domain.

Our focus: learn a target function that can be used to predict the values of a discrete class attribute, e.g., approve or not-approved, and high-risk or low risk.

The task is commonly called: Supervised learning, classification, or inductive learning.


Whereas classification predicts categorical (discrete, unordered) labels, prediction models continuous valued functions.


For example, we can build a classification model to categorize bank loan applications as either safe or risky, or a prediction model to predict the expenditures in dollars of potential customers on computer equipment given their income and occupation.

Classification Classification is the process of finding a

model (or function) that describes and distinguishes data classes or concepts, for the purpose of being able to use the model to predict the class of objects whose class label is unknown.

The derived model is based on the analysis of a set of training data (i.e., data objects whose class label is known).

What is Classification

Classification is the task of assigning objects to their respective categories.

Examples include classifying email messages as spam or non-spam based upon the message header and content, and classifying galaxies based upon their respective shapes.


Classification can provide a valuable support for informed decision making in the organisation.

For example, suppose a mobile phone company would like to promote a new cell-phone product to the public. Instead of mass mailing the promotional catalog to everyone, the company may be able to reduce the campaign cost by targeting only a small segment of the population


It may classify each person as a potential buyer or non-buyer based on their personal information such as income, occupation, lifestyle, and credit ratings.

Discrete Data Discrete Data – A set of data is said to be

discrete if the values / observations belonging to it are distinct and separate, i.e. they can be counted (1,2,3,....). Examples might include the number of kittens in a litter; the number of patients in a doctors surgery; the number of flaws in one metre of cloth; gender (male, female); blood group (O, A, B, AB).

Discrete Data

Any data measurements that are not quantified on an infinitely divisible numeric scale. Includes items like counts, proportions, ratios, or percentage of a characteristics, (i.e. sex, loan forms, department attendance, etc.) that have measurements like pass or fail, leak or no leak, small, medium, or large, go or no go tests. (SixSigma.com Dictonary)

Continuous Data

Continuous/Variable Data – A set of data is said to be continuous if the values / observations belonging to it may take on any value within a finite or infinite interval. You can count, order and measure continuous data. For example height, weight, temperature, the amount of sugar in an orange, the time required to run a mile.

Continuous Data

Variable data type have real numbers in the measurement like 2.34, 2.55, etc. (i.e. data that can be measured on a continuous scale)

Categorical Data Categorical Data – A set of data is said to

be categorical if the values or observations belonging to it can be sorted according to category. Each value is chosen from a set of non-overlapping categories. For example, shoes in a cupboard can be sorted according to colour: the characteristic 'colour' can have non-overlapping categories 'black', 'brown', 'red' and 'other'. People have the characteristic of 'gender' with categories 'male' and 'female'.

Nominal Data

Nominal Data – A set of data is said to be nominal if the values / observations belonging to it can be assigned a code in the form of a number where the numbers are simply labels. You can count but not order or measure nominal data. For example, in a data set males could be coded as 0, females as 1; marital status of an individual could be coded as Y if married, N if single.

Ordinal Data

Ordinal Data - A set of data is said to be ordinal if the values / observations belonging to it can be ranked (put in order) or have a rating scale attached. You can count and order, but not measure, ordinal data.

Ordinal Data

The categories for an ordinal set of data have a natural order, for example, suppose a group of people were asked to taste varieties of biscuit and classify each biscuit on a rating scale of 1 to 5, representing strongly dislike, dislike, neutral, like, strongly like. A rating of 5 indicates more enjoyment than a rating of 4, for example, so such data are ordinal.

Preliminaries

The input data for classification task is given in the form of collection of records.

Each record also known as instance or example is characterised by a tuple (x,y), where x is the attribute set and y is the class label

Preliminaries

Table 1. Vertebrate Data Set

Preliminaries In the above slide, the table shows a

sample data set used for classifying vertebrates into one of the following categories: mammal, bird, fish, reptile, or amphibian.

The attribute set includes properties of a vertebrate such as its body temperature, skin cover, method of reproduction, ability to fly and ability to live in water.

Preliminaries

The attribute set may contain discrete and continuous features, however on the table above attribute set contains mostly discrete values.

The class label on the other hand, must be a discrete attribute.

This is a key characteristics that distinguishes classification from another predictive modeling task known as regression, where y is a continuous attribute.


Classification can be described as a task of assigning objects to one of several predefined categories.

Input Output

Attribute Set Class label

(x) (y)

The diagram show the classification as task of mapping an input attribute set x into its class label y

Classification Model

Simple Definition Classification is the task of learning a

target function f that maps each attribute set x into one of the pre-defined class labels y.

The target function is also known informally as a classification model.

Usefulness of Classification Model A classification model is useful for the following

purposes:

It may serve as an explanatory tool to distinguish between objects of different classes (Descriptive Modeling).

It may also be used to predict the class label of unknown records (Predictive Modeling). Consider the table below:

Usefulness of Classification Model

A classification model can be treated as a black box that automatically assigns a class label when presented with the attribute set of an unknown record.

Example you can be given the characteristics of creature known as gila monster.

Usefulness of Classification Model By building a classification model from the

data set shown in Table 1, you may use the model to determine the class to which the creature belongs.

Classification models are most suited for predicting or describing data sets with binary or nominal target attributes.

Classification & Prediction Classification:

Predicts categorical class labelsClassifies data (constructs a model) based on

the training set and the values (class labels) in a classifying attribute and uses it in classifying new data

Prediction: Models continuous-valued functions, i.e.,

predicts unknown or missing values Typical Applications

Credit approval Target marketing

– Medical diagnosis

– Treatment effectiveness analysis

Classification Techniques

Classification Technique A classification technique is a systematic

approach for building classification models from an input data set.

Examples of classification techniques include:Decision Tree ClassifiersRule-Based Classifiers Neural NetworksSupport Vector MachinesNaıve Bayes ClassifiersNearest-Neighbor Classifiers

Classification Technique

Each technique employs a learning algorithm to identify a model that best fits the relationship between the attribute set and class label of the input data (produces outputs consistent with the class labels of the input data).

Classification Technique

A good classification model must predict correctly the class labels of records it has never seen before.

Building models with good generalization capability, i.e., models that accurately predict the class labels of previously unseen records, is therefore a key objective of the learning algorithm.

General Approach to Solve a Classification Problem A general strategy to solving a classification

problem is that: First, the input data is divided into two disjoint sets,

known as the training set and test set, respectively. The training set will be used for building a

classification model. The induced model is later applied to the test

set to predict the class label of each test record.

Why are we dividing the data into two set? This strategy of dividing the data into

independent training and test sets allows us to obtain an unbiased estimate of the performance of a model on previously unseen records.

A figure below in the next slide depicts

General Approach to Solve a Classification Problem

Performance Measurement of Model Evaluation of the performance of a

classification model is based upon the number of test records predicted correctly and wrongly by the model.

The counts are tabulated in a table known as a confusion matrix.

Performance Measurement of Model Table 2 depicts the confusion matrix for a

binary classification problem.

Performance Measurement of Model Each entry fij in this table denotes the

number of records from class i predicted to be of class j.

For instance, f01 is the number of records from class 0 wrongly predicted as class 1

Based on the entries in the confusion matrix, the total number of correct predictions made by the model is (f11 + f00) and the total number of wrong predictions is (f10 + f01).

Performance Measurement of Model Although a confusion matrix provides the

information needed to determine how good is a classification model, it is useful to summarize this information into a single number.

This would make it more convenient to compare the performance of different classification models.

Performance Measurement of Model There are several performance metrics

available for doing this. One of the most popular metrics is model accuracy, which is defined as:

Accuracy = Number of correct predictions

Total number of predictions

= f11 + f00

f11 + f10 + f01 + f00

Performance Measurement of Model Equivalently, the performance of a model

can be expressed in terms of its error rate given by the following equation:

Error rate = Number of wrong predictions

Total number of predictions

= f10 + f01

f11 + f10 + f01 + f00

Decision Trees

data mining instructor: bajuna salehe email: [email protected]@yahoo.com web:

Documents