spring 20031 classification. spring 20032 classification task input: a training set of tuples, each...

79
Spring 2003 1 Classification

Upload: cecily-berry

Post on 29-Dec-2015

216 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Spring 20031 Classification. Spring 20032 Classification task Input: a training set of tuples, each labeled with one class label Output: a model (classifier)

Spring 2003 1

Classification

Page 2: Spring 20031 Classification. Spring 20032 Classification task Input: a training set of tuples, each labeled with one class label Output: a model (classifier)

Spring 2003 2

Classification task

• Input: a training set of tuples, each labeled with one class label

• Output: a model (classifier) that assigns a class label to each tuple based on the other attributes

• The model can be used to predict the class of new tuples, for which the class label is missing or unknown

Page 3: Spring 20031 Classification. Spring 20032 Classification task Input: a training set of tuples, each labeled with one class label Output: a model (classifier)

Spring 2003 3

What is Classification

• Data classification is a two-step process– first step: a model is built describing a

predetermined set of data classes or concepts– second step: the model is used for classification

• Each tuple is assumed to belong to a predefined class, as determined by one of the attributes, called the class label attribute

• Data tuples are also referred to as samples, examples, or objects

Page 4: Spring 20031 Classification. Spring 20032 Classification task Input: a training set of tuples, each labeled with one class label Output: a model (classifier)

Spring 2003 4

Train and test

• The tuples (examples, samples) are divided into training set + test set

• Classification model is built in two steps:– training - build the model from the training set– test - check the accuracy of the model using test

set

Page 5: Spring 20031 Classification. Spring 20032 Classification task Input: a training set of tuples, each labeled with one class label Output: a model (classifier)

Spring 2003 5

Train and test

• Kind of models:– if - then rules– logical formulae– decision trees

• Accuracy of models:– the known class of test samples is matched

against the class predicted by the model– accuracy rate = % of test set samples correctly

classified by the model

Page 6: Spring 20031 Classification. Spring 20032 Classification task Input: a training set of tuples, each labeled with one class label Output: a model (classifier)

Spring 2003 6

Training step

Age Car Type Risk20 Combi High18 Sports High40 Sports High50 Family Low35 Minivan Low30 Combi High32 Family Low40 Combi Low

trainingdata

Classification algorithm

Classifier(model)

if age < 31or Car Type =Sportsthen Risk = High

Page 7: Spring 20031 Classification. Spring 20032 Classification task Input: a training set of tuples, each labeled with one class label Output: a model (classifier)

Spring 2003 7

Test step

Age Car Type Risk27 Sports High34 Family Low66 Family High44 Sports High

testdata

Classifier(model)

RiskHighLowLowHigh

Page 8: Spring 20031 Classification. Spring 20032 Classification task Input: a training set of tuples, each labeled with one class label Output: a model (classifier)

Spring 2003 8

Classification (prediction)

Age Car Type Risk27 Sports 34 Minivan 55 Family 34 Sports

newdata

Classifier(model)

RiskHighLowLowHigh

Page 9: Spring 20031 Classification. Spring 20032 Classification task Input: a training set of tuples, each labeled with one class label Output: a model (classifier)

Spring 2003 9

Classification vs. Prediction

• There are two forms of data analysis that can be used to extract models describing data classes or to predict future data trends:– classification: predict categorical labels– prediction: models continuous-valued functions

Page 10: Spring 20031 Classification. Spring 20032 Classification task Input: a training set of tuples, each labeled with one class label Output: a model (classifier)

Spring 2003 10

Comparing Classification Methods (1)

• Predictive accuracy: this refers to the ability of the model to correctly predict the class label of new or previously unseen data

• Speed: this refers to the computation costs involved in generating and using the model

• Robustness: this is the ability of the model to make correct predictions given noisy data or data with missing values

Page 11: Spring 20031 Classification. Spring 20032 Classification task Input: a training set of tuples, each labeled with one class label Output: a model (classifier)

Spring 2003 11

Comparing Classification Methods (2)

• Scalability: this refers to the ability to construct the model efficiently given large amount of data

• Interpretability: this refers to the level of understanding and insight that is provided by the model

• Simplicity:– decision tree size

– rule compactness

• Domain-dependent quality indicators

Page 12: Spring 20031 Classification. Spring 20032 Classification task Input: a training set of tuples, each labeled with one class label Output: a model (classifier)

Spring 2003 12

Problem formulation

Given records in the database with class label – find model for each class.

Age Car Type Risk20 Combi High18 Sports High40 Sports High50 Family Low35 Minivan Low30 Combi High32 Family Low40 Combi Low

Age < 31

High

Car Type is sports

High Low

Page 13: Spring 20031 Classification. Spring 20032 Classification task Input: a training set of tuples, each labeled with one class label Output: a model (classifier)

Spring 2003 13

Classification techniques

• Decision Tree Classification

• Bayesian Classifiers

• Neural Networks

• Statistical Analysis

• Genetic Algorithms

• Rough Set Approach

• k-nearest neighbor classifiers

Page 14: Spring 20031 Classification. Spring 20032 Classification task Input: a training set of tuples, each labeled with one class label Output: a model (classifier)

Spring 2003 14

Classification by Decision Tree Induction

• A decision tree is a tree structure, where – each internal node denotes a test on an attribute, – each branch represents the outcome of the test,– leaf nodes represent classes or class distributions

Age < 31

High

Car Type is sports

High Low

Y N

Page 15: Spring 20031 Classification. Spring 20032 Classification task Input: a training set of tuples, each labeled with one class label Output: a model (classifier)

Spring 2003 15

Decision Tree Induction (1)

• A decision tree is a class discriminator that recursively partitions the training set until each partition consists entirely or dominantly of examples from one class.

• Each non-leaf node of the tree contains a split point, which is a test on one or more attributes and determines how the data is partitioned

Page 16: Spring 20031 Classification. Spring 20032 Classification task Input: a training set of tuples, each labeled with one class label Output: a model (classifier)

Spring 2003 16

Decision Tree Induction (2)

• Basic algorithm: a greedy algorithm that constructs decision trees in a top-down recursive divide-and-conquer manner.

• Many variants:– from machine learning (ID3, C4.5)– from statistics (CART)– from pattern recognition (CHAID)

• Main difference: split criterion

Page 17: Spring 20031 Classification. Spring 20032 Classification task Input: a training set of tuples, each labeled with one class label Output: a model (classifier)

Spring 2003 17

Decision Tree Induction (3)

• The algorithm consists of two phases: Build an initial tree from the training data such that

each leaf node is pure Prune this tree to increase its accuracy on test data

Page 18: Spring 20031 Classification. Spring 20032 Classification task Input: a training set of tuples, each labeled with one class label Output: a model (classifier)

Spring 2003 18

Tree Building

• In the growth phase the tree is built by recursively partitioning the data until each partition is either "pure" (contains members of the same class) or sufficiently small.

• The form of the split used to partition the data depends on the type of the attribute used in the split: for a continuous attribute A, splits are of the form

value(A)<x where x is a value in the domain of A. for a categorical attribute A, splits are of the form

value(A)X where Xdomain(A)

Page 19: Spring 20031 Classification. Spring 20032 Classification task Input: a training set of tuples, each labeled with one class label Output: a model (classifier)

Spring 2003 19

Tree Building AlgorithmMake Tree (Training Data T){

Partition(T)}Partition(Data S){

if (all points in S are in the same class) thenreturn

for each attribute A doevaluate splits on attribute A;

use best split found to partition S into S1 and S2Partition(S1)Partition(S2)

}

Page 20: Spring 20031 Classification. Spring 20032 Classification task Input: a training set of tuples, each labeled with one class label Output: a model (classifier)

Spring 2003 20

Tree Building Algorithm

• While growing the tree, the goal at each node is to determine the split point that "best" divides the training records belonging to that leaf

• To evaluate the goodness of the split some splitting indices have been proposed

Page 21: Spring 20031 Classification. Spring 20032 Classification task Input: a training set of tuples, each labeled with one class label Output: a model (classifier)

Spring 2003 21

Split Criteria• Gini index (CART, SPRINT)

– select attribute that minimize impurity of a split

• Information gain (ID3, C4.5)– to measure impurity of a split use entropy– select attribute that maximize entropy reduction

2 contingency table statistics (CHAID)– measures correlation between each attribute and

the class label– select attribute with maximal correlation

Page 22: Spring 20031 Classification. Spring 20032 Classification task Input: a training set of tuples, each labeled with one class label Output: a model (classifier)

Spring 2003 22

Gini index (1)

Given a sample training set where each record represents a car-insurance applicant. We want to build a model of what makes an applicant a high or low insurance risk.

RID Age Car Type Risk0 23 family high1 17 sport high2 43 sport high3 68 family low4 32 truck low5 20 family high

Training set

The model built can be used to screen future insurance applicants by classifying them into the High or Low risk categories

Classifier(model)

Page 23: Spring 20031 Classification. Spring 20032 Classification task Input: a training set of tuples, each labeled with one class label Output: a model (classifier)

Spring 2003 23

Gini index (2)SPRINT algorithm:

Partition(Data S) {if (all points in S are of the same class) then

return;for each attribute A do

evaluate splits on attribute A;

Use best split found to partition S into S1 and S2

Partition(S1);

Partition(S2);}Initial call: Partition(Training Data)

Page 24: Spring 20031 Classification. Spring 20032 Classification task Input: a training set of tuples, each labeled with one class label Output: a model (classifier)

Spring 2003 24

Gini index (3)• Definition:

gini(S) = 1 - pj2

where: • S is a data set containing examples from n classes

• pj is a relative frequency of class j in S

• E.g. two classes, Pos and Neg, and dataset S with p Pos-elements and n Neg-elements.

ppos= p/(p+n) pneg = n/(n+p)

gini(S) = 1 - ppos2 - pneg

2

Page 25: Spring 20031 Classification. Spring 20032 Classification task Input: a training set of tuples, each labeled with one class label Output: a model (classifier)

Spring 2003 25

Gini index (4)

• If dataset S is split into S1 and S2, then splitting index is defined as follows:

giniSPLIT(S) = (p1+ n1)/(p+n)*gini(S1) +

(p2+ n2)/(p+n)* gini(S2)

where p1, n1 (p2, n2) denote p1 Pos-elements and n1 Neg-elements in the dataset S1 (S2), respectively.

• In this definition the "best" split point is the one with the lowest value of the giniSPLIT index.

Page 26: Spring 20031 Classification. Spring 20032 Classification task Input: a training set of tuples, each labeled with one class label Output: a model (classifier)

Spring 2003 26

Example (1)

RID Age Car Type Risk0 23 family high1 17 sport high2 43 sport high3 68 family low4 32 truck low5 20 family high

Training set

Page 27: Spring 20031 Classification. Spring 20032 Classification task Input: a training set of tuples, each labeled with one class label Output: a model (classifier)

Spring 2003 27

Example (1)

Age RID Risk17 1 high20 5 high23 0 high32 4 low43 2 high68 3 low

Attribute list for ‘Age’

Attribute list for ‘Car Type’Car Type RID Risk

family 0 highsport 1 highsport 2 highfamily 3 lowtruck 4 lowfamily 5 high

Page 28: Spring 20031 Classification. Spring 20032 Classification task Input: a training set of tuples, each labeled with one class label Output: a model (classifier)

Spring 2003 28

Example (2)• Possible values of a split point for Age attribute are:

Age17, Age20, Age23, Age32, Age43, Age68

Tuple count High LowAge<=17 1 0Age>17 3 2

G(Age<=17) = 1- (12+02) = 0G(Age>17) = 1- ((3/5)2+(2/5)2) = 1 - (13/25)2 = 12/25GSPLIT = (1/6) * 0 + (5/6) * (12/25) = 2/5

Page 29: Spring 20031 Classification. Spring 20032 Classification task Input: a training set of tuples, each labeled with one class label Output: a model (classifier)

Spring 2003 29

Example (3)Tuple count High Low

Age<=20 2 0Age>20 2 2

G(Age<=20) = 1- (12+02) = 0G(Age>20) = 1- ((1/2)2+(1/2)2) = 1/2GSPLIT = (2/6) * 0 + (4/6) * (1/8) = 1/3

Tuple count High LowAge<=23 3 0Age>23 1 2

G(Age23) = 1- (12+02) = 0G(Age>23) = 1- ((1/3)2+(2/3)2) = 1 - (1/9) - (4/9) = 4/9GSPLIT = (3/6) * 0 + (3/6) * (4/9) = 2/9

Page 30: Spring 20031 Classification. Spring 20032 Classification task Input: a training set of tuples, each labeled with one class label Output: a model (classifier)

Spring 2003 30

Example (4)Tuple count High Low

Age<=32 3 1Age>32 1 1

G(Age32) = 1- ((3/4)2+(1/4)2) = 1 - (10/16) = 6/16 = 3/8G(Age>32) = 1- ((1/2)2+(1/2)2) = 1/2GSPLIT = (4/6)*(3/8) + (2/6)*(1/2) = (1/8) + (1/6)=14/48= 7/24

The lowest value of GSPLIT is for Age23, thus we have a split point at Age=(23+32) / 2 = 27.5

Page 31: Spring 20031 Classification. Spring 20032 Classification task Input: a training set of tuples, each labeled with one class label Output: a model (classifier)

Spring 2003 31

Example (5)

Age27.5

Risk = High Risk = Low

Age>27.5

Decision tree after the first split of the example set:

Page 32: Spring 20031 Classification. Spring 20032 Classification task Input: a training set of tuples, each labeled with one class label Output: a model (classifier)

Spring 2003 32

Example (6)

Attribute lists are divided at the split point.Attribute lists for Age27.5:

Age RID Risk17 1 high20 5 high23 0 high

Car Type RID Riskfamily 0 highsport 1 highfamily 5 high

Age RID Risk32 4 low43 2 high68 3 low

Car Type RID Risksport 2 highfamily 3 lowtruck 4 low

Attribute lists for Age>27.5

Page 33: Spring 20031 Classification. Spring 20032 Classification task Input: a training set of tuples, each labeled with one class label Output: a model (classifier)

Spring 2003 33

Example (7)Evaluating splits for categorical attributes

We have to evaluate splitting index for each of the 2N combinations, where N is the cardinality of the categorical attribute.

Tuple count High LowCar type= {sport} 1 0Car type ={family] 0 1Car type = {truck} 0 1

G(Car type {sport}) = 1 – 12 – 02 = 0G(Car type {family}) = 1 – 02 – 12 = 0G(Car type {truck}) = 1 – 02 – 12 = 0

Page 34: Spring 20031 Classification. Spring 20032 Classification task Input: a training set of tuples, each labeled with one class label Output: a model (classifier)

Spring 2003 34

Example (8)G(Car type { sport, family }) = 1 - (1/2)2 - (1/2)2 = 1/2

G(Car type { sport, truck }) = 1/2

G(Car type { family, truck }) = 1 - 02 - 12 = 0

GSPLIT(Car type { sport }) = (1/3) * 0 + (2/3) * 0 = 0

GSPLIT(Car type { family }) = (1/3) * 0 + (2/3)*(1/2) = 1/3

GSPLIT(Car type { truck }) = (1/3) * 0 + (2/3)*(1/2) = 1/3

GSPLIT(Car type { sport, family}) = (2/3)*(1/2)+(1/3)*0= 1/3

GSPLIT(Car type { sport, truck}) = (2/3)*(1/2)+(1/3)*0= 1/3

GSPLIT(Car type { family, truck }) = (2/3)*0+(1/3)*0=0

Page 35: Spring 20031 Classification. Spring 20032 Classification task Input: a training set of tuples, each labeled with one class label Output: a model (classifier)

Spring 2003 35

Example (9)

The lowest value of GSPLIT is for Car type {sport},

thus this is our split point. Decision tree after the second split of the example set:

Age27.5

Risk = High

Risk = Low

Age>27.5

Risk = High

Car type {sport}Car type {family, truck}

Page 36: Spring 20031 Classification. Spring 20032 Classification task Input: a training set of tuples, each labeled with one class label Output: a model (classifier)

Spring 2003 36

Information Gain (1)

• The information gain measure is used to select the test attribute at each node in the tree

• The attribute with the highest information gain (or greatest entropy reduction) is chosen as the test attribute for the current node

• This attribute minimizes the information needed to classify the samples in the resulting partitions

Page 37: Spring 20031 Classification. Spring 20032 Classification task Input: a training set of tuples, each labeled with one class label Output: a model (classifier)

Spring 2003 37

Information Gain (2)

• Let S be a set consisting of s data samples. Suppose the class label attribute has m distinct values defining m classes, Ci (for i=1, ..., m)

• Let si be the number of samples of S in class Ci

• The expected information needed to classify a given sample is given by

I(s1, s2, ..., sm) = - pi log2(pi)

where pi is the probability that an arbitrary sample

belongs to class Ci and is estimated by si/s.

Page 38: Spring 20031 Classification. Spring 20032 Classification task Input: a training set of tuples, each labeled with one class label Output: a model (classifier)

Spring 2003 38

Information Gain (3)

• Let attribute A have v distinct values, {a1, a2, ...,

av}. Attribute A can be used to partition S into {S1,

S2, ..., Sv}, where Sj contains those samples in S

that have value aj of A

• If A were selected as the test attribute, then these subsets would correspond to the branches grown from the node containing the set S

Page 39: Spring 20031 Classification. Spring 20032 Classification task Input: a training set of tuples, each labeled with one class label Output: a model (classifier)

Spring 2003 39

Information Gain (4)

• Let sij be the number of samples of the class Ci in a

subset Sj. The entropy, or expected information

based on the partitioning into subsets by A, is given by:

E(A1, A2, ...Av) = (s1j + s2j +...+smj)/s*

* I(s1j, s2j, ..., smj)

• The smaller the entropy value, the greater the purity of the subset partitions.

Page 40: Spring 20031 Classification. Spring 20032 Classification task Input: a training set of tuples, each labeled with one class label Output: a model (classifier)

Spring 2003 40

Information Gain (5)

• The term (s1j + s2j +...+smj)/s acts as the weight of

the jth subset and is the number of samples in the subset (i.e. having value aj of A) divided by the

total number of samples in S. Note that for a given subset Sj,

I(s1j, s2j, ..., smj) = - pij log2(pij)

where pij = sij/|Sj| and is the probability that a

sample in Sj belongs to class Ci

Page 41: Spring 20031 Classification. Spring 20032 Classification task Input: a training set of tuples, each labeled with one class label Output: a model (classifier)

Spring 2003 41

Information Gain (6)

The encoding information that would be gained by branching on A is

Gain(A) = I(s1, s2, ..., sm) – E(A)

Gain(A) is the expected reduction in entropy caused by knowing the value of attribute A

Page 42: Spring 20031 Classification. Spring 20032 Classification task Input: a training set of tuples, each labeled with one class label Output: a model (classifier)

Spring 2003 42

Example (1)

RID Age Income student credit_rating buys_computer1 <=30 high no fair no2 <=30 high no excellent no3 31..40 high no fair yes4 >40 medium no fair yes5 >40 low yes fair yes6 >40 low yes excellent no7 31..40 low yes excellent yes8 <=30 medium no fair no9 <=30 low yes fair yes

10 >40 medium yes fair yes11 <=30 medium yes excellent yes12 31..40 medium no excellent yes13 31..40 high yes fair yes14 >40 medium no excellent no

Page 43: Spring 20031 Classification. Spring 20032 Classification task Input: a training set of tuples, each labeled with one class label Output: a model (classifier)

Spring 2003 43

Example (2)

• Let us consider the following training set of tuples taken from the customer database.• The class label attribute, buys_computer, has two distinct values (yes, no), therefore, there are two classes (m=2).

C1 correspond to yes – s1 = 9

C2 correspond to no - s2 = 5

I(s1, s2)=I(9, 5)= - 9/14log29/14 – 5/14log25/14=0.94

Page 44: Spring 20031 Classification. Spring 20032 Classification task Input: a training set of tuples, each labeled with one class label Output: a model (classifier)

Spring 2003 44

Example (3)• Next, we need to compute the entropy of each

attribute. Let start with the attribute age

for age=‘<=30’

s11=2 s21=3 I(s11, s21) = 0.971

for age=’31..40’

s12=4 s22=0 I(s12, s22) = 0

for age=‘>40’

s13=2 s23=3 I(s13, s23) = 0.971

Page 45: Spring 20031 Classification. Spring 20032 Classification task Input: a training set of tuples, each labeled with one class label Output: a model (classifier)

Spring 2003 45

Example (4)

The entropy of age is,

E(age)=5/14 *I(s11, s21) +4/14* I(s12, s22) +

+ 5/14* I(s13, s23) = 0.694

The gain in information from such a partitioning would be:

Gain(age) = I(s1, s2) – E(age) = 0.246

Page 46: Spring 20031 Classification. Spring 20032 Classification task Input: a training set of tuples, each labeled with one class label Output: a model (classifier)

Spring 2003 46

Example (5)

•We can compute

Gain(income)=0.029,

Gain(student)=0.151, and Gain(credit_rating)=0.048

Since age has the highest information gain amont the attributes, it is selected as the test atribute. A node is created and labeled with age, and branches are grown for each of the attribute’s values.

Page 47: Spring 20031 Classification. Spring 20032 Classification task Input: a training set of tuples, each labeled with one class label Output: a model (classifier)

Spring 2003 47

Example (6)

age

buys_computers:yes, no

buys_computers:yes, no

<=30 31..40 >40

buys_computers: yes

Page 48: Spring 20031 Classification. Spring 20032 Classification task Input: a training set of tuples, each labeled with one class label Output: a model (classifier)

Spring 2003 48

Example (7)age

<=3031..40

>40

yesstudent

yes

yes

no

no

credit_rating

excellent fair

no yes

Page 49: Spring 20031 Classification. Spring 20032 Classification task Input: a training set of tuples, each labeled with one class label Output: a model (classifier)

Spring 2003 49

Entropy vs. Gini index•Gini index tends to isolate the largest class from all other classes

class A 40class B 30class C 20class D 10

•Entropy tends to fin groups of classes that add up to 50% of the data

if age < 40

class A 40 class B 30class C 20class D 10

yes no

class A 40class B 30class C 20class D 10

if age < 65

class A 40class D 10

class B 30class C 20

yes no

Page 50: Spring 20031 Classification. Spring 20032 Classification task Input: a training set of tuples, each labeled with one class label Output: a model (classifier)

Spring 2003 50

Tree pruning• When a decision tree is built, many of the branches

will reflect anomalies in the training data due to noise or outliers.

• Tree pruning methods typically use statistical measures to remove the least reliable branches, generally resulting in faster classification and an improvement in the ability of the tree to correctly classify independent test data

Page 51: Spring 20031 Classification. Spring 20032 Classification task Input: a training set of tuples, each labeled with one class label Output: a model (classifier)

Spring 2003 51

Tree pruning• Prepruning approach (stopping): a tree is

‘pruned’ by halting its construction early (i.e. by deciding not to further split or partition the subset of training samples). Upon halting, the node becomes a leaf. The leaf hold the most frequent class among the subset samples

• Postpruning approach (pruning): removes branches from a ‘fully grown’ tree. A tree node is pruned by removing its branches. The lowest unpruned node becomes a leaf and is labeled by the most frequent class among its former branches

Page 52: Spring 20031 Classification. Spring 20032 Classification task Input: a training set of tuples, each labeled with one class label Output: a model (classifier)

Spring 2003 52

Extracting Classification Rules from Decision Trees

• The knowledge represented in decision trees can be extracted and represented in the form of classification IF-THEN rules.

• One rule is created for each path from the root to a leaf node

• Each attribute-value pair along a given path forms a conjunction in the rule antecedent; the leaf node holds the class prediction, forming the rule consequent

Page 53: Spring 20031 Classification. Spring 20032 Classification task Input: a training set of tuples, each labeled with one class label Output: a model (classifier)

Spring 2003 53

Extracting Classification Rules from Decision Trees

• The decision tree of Example (7) can be converted to classification rules:

IF age=‘<=30’ AND student=‘no’ THEN buys_computers=‘no’

IF age=‘<=30’ AND student=‘yes’ THEN buys_computers=‘yes’

IF age=’31..40’ THEN buys_computers=‘yes’

IF age=‘>40’ AND credit_rating=‘excellent’

THEN buys_computers=‘no’

IF age=‘>40’ AND credit_rating=‘fair’

THEN buys_computers=‘yes’

Page 54: Spring 20031 Classification. Spring 20032 Classification task Input: a training set of tuples, each labeled with one class label Output: a model (classifier)

Spring 2003 54

Other Classification Methods

• There is a number of classification methods in the literature:– Bayesian classifiers

– Neural-network classifiers

– K-nearest neighbor classifiers

– Association-based classifiers

– Rough and fuzzy sets

Page 55: Spring 20031 Classification. Spring 20032 Classification task Input: a training set of tuples, each labeled with one class label Output: a model (classifier)

Spring 2003 55

Classification Based on Concepts from Association Rule Mining

• We may apply „quantitative rule mining” approach to discover classification rules – associative classification

• It mines rules of the form condset y, where condset is a set of items (or attribute-value pairs) and y is a class label.

Page 56: Spring 20031 Classification. Spring 20032 Classification task Input: a training set of tuples, each labeled with one class label Output: a model (classifier)

Spring 2003 56

Bayesian classifers

• Bayesian classifier is a statistical classifier. It can predict the probability that a given sample belongs to a particular class.

• Bayesian classification is based on Bayes theorem of a-posteriori probability.

• Let X is a data sample whose class label is unknown. Each sample is represented n-dimensional vector, X=(x1, x2, ..., xn).

• The classification problem may be formulated using a-posteriori probabilities as follows: determine P(C|X), the probability that the sample X belongs to a specified class C.

• P(C|X) is the a-posteriori probability of C conditioned on X.

Page 57: Spring 20031 Classification. Spring 20032 Classification task Input: a training set of tuples, each labeled with one class label Output: a model (classifier)

Spring 2003 57

Bayesian classifers

• Example:

Given a set of samples describing credit applicants P(Risk=low|Age=38, Marital_Status=divorced, Income=low, children=2) is the probability that a credit applicant X=(38, divorced, low, 2) is the low credit risk applicant.

• The idea of Bayesian classification is to assign to a new unknown sample X the class label C such that P(C| X) is maximal.

Page 58: Spring 20031 Classification. Spring 20032 Classification task Input: a training set of tuples, each labeled with one class label Output: a model (classifier)

Spring 2003 58

Bayesian classifers

• The main problem is how to estimate a-posteriori probability P(C|X)?

• By Bayes theorem: P(C|X) = (P(X|C) * P(C))/P(X), where P(C) is the apriori probability of C, that is the probability that any given sample belongs to the class C, P(X|C) is the a-posteriori probability of X conditioned on C, and P(X) is the apriori probability of X.

• In our example, P(X|C) is the probability that X=(38, divorced, low, 2) given the class Risk=low, P(C) is the probability of the class C, and P(X) is the probability that the sample X=(38, divorced, low, 2).

Page 59: Spring 20031 Classification. Spring 20032 Classification task Input: a training set of tuples, each labeled with one class label Output: a model (classifier)

Spring 2003 59

Bayesian classifers

• Suppose a training database D consists of n samples, and suppose the class label attribute has m distinct values defining m distinct classes C_i, for i = 1, ..., m.

• Let s_i denotes the number of samples of D in class C_i.

• Bayesian classifier assigns an unknown sample X to the class C_i that maximizes P(C_i|X). Since P(X) is constant for all classes, the class C_i for which P(C_i|X) is maximized is the class C_i for which P(X|C_i) * P(C_i) is maximized.

• P(C_i) may be estimated by s_i/n (relative frequency of the class C_i), or we may assume that all classes have the same probability P(C_1) = P(C_2) = ... = P(C_k).

Page 60: Spring 20031 Classification. Spring 20032 Classification task Input: a training set of tuples, each labeled with one class label Output: a model (classifier)

Spring 2003 60

Bayesian classifers

• The main problem is how to compute P(C_i|X)?

• Given a large dataset with many predictor attributes, it would be very expensive to compute P(C_i|X), therefore, to reduce the cost of computing P(C_i|X), the assumption of class conditional independence, or, in other words, the attribute independence assumption is made.

• The assumption states that there are no dependencies among predictor attributes, which leads to the following formula:

P(X|C_i) = j=1k P(x_j|C_i)

Page 61: Spring 20031 Classification. Spring 20032 Classification task Input: a training set of tuples, each labeled with one class label Output: a model (classifier)

Spring 2003 61

Bayesian classifers

• The probabilities P(x_1|C_i), P(x_2|C_i), ..., P(x_k|C_i) can be estimated from the dataset:

- If j-th attribute is categorical, then P(x_j|C_i) is estimated as the relative frequency of samples of the class C_I having value x_j for j-th attribute,

- If j-th attribute is continuous, then P(x_j|C_i) is estimated through the Gaussian density function.

• Due to the class conditional independence assumption, the Bayesian classifier is also known as the naive Bayesian classifier.

Page 62: Spring 20031 Classification. Spring 20032 Classification task Input: a training set of tuples, each labeled with one class label Output: a model (classifier)

Spring 2003 62

Bayesian classifers

• The assumption makes computation possible. Moreover, when the assumption is satisfied, the naive Bayesian classifier is optimal, that is it is the most accurate classifier in comparison to all other classifiers.

• However, the assumption is seldom satisfied in practice, since attributes are usually correlated.

• Several attempts are being made to apply Bayesian analysis without assuming attribute independence. The resulting models are called Bayesian networks or Bayesian belief networks

• Bayesian belief networks combine Bayesian analysis with causal relationships between attributes.

Page 63: Spring 20031 Classification. Spring 20032 Classification task Input: a training set of tuples, each labeled with one class label Output: a model (classifier)

Spring 2003 63

k-nearest neighbor classifiers

• Nearest neighbor classifier belongs to instance-based learning methods.

• Instance-based learning methods differ from other classification methods discussed earlier in that they do not build a classifier until a new unknown sample needs to be classified.

• Each training sample is described by n-dimensional vector representing a point in an n-dimensional space called pattern space. When a new unknown sample has to be classified, a distance function is used to determine a member of the training set which is closest to the unknown sample.

Page 64: Spring 20031 Classification. Spring 20032 Classification task Input: a training set of tuples, each labeled with one class label Output: a model (classifier)

Spring 2003 64

k-nearest neighbor classifiers

• Once the nearest training sample is located in the pattern space, its class label is assigned to the unknown sample.

• The main drawback of this approach is that it is very sensitive to noisy training samples.

• The common solution to this problem is to adopt the k-nearest neighbor strategy.

• When a new unknown sample has to be classified, the classifier searches the pattern space for the k training samples which are closest to the unknown sample. These k training samples are called the k "nearest neighbors" of the unknown sample and the most common class label among k "nearest neighbors" is assigned to the unknown sample.

• To find the k "nearest neighbors" of the unknown sample a multidimensional index is used (e.g. R-tree, Pyramid tree, etc.).

Page 65: Spring 20031 Classification. Spring 20032 Classification task Input: a training set of tuples, each labeled with one class label Output: a model (classifier)

Spring 2003 65

k-nearest neighbor classifiers• Two different issues need to be addressed regarding k-

nearest neighbor method: – the distance function, and

– the transformation from a sample to a point in the pattern space.

• The first issue is to define the distance function. If the attributes are numeric, most k-nearest neighbor classifiers use Euclidean distance.

• Instead of the Euclidean distance, we may also apply other distance metrics like Manhattan distance, maximum of dimensions, or Minkowski distance.

Page 66: Spring 20031 Classification. Spring 20032 Classification task Input: a training set of tuples, each labeled with one class label Output: a model (classifier)

Spring 2003 66

k-nearest neighbor classifiers

• The second issue is how to transform a sample to a point in the pattern space.

• Note that different attributes may have different scales and units, and different variability. Thus, if the distance metric is used directly, the effects of some attributes might be dominated by other attributes that have larger scale or higher variability.

• A simple solution to this problem is to weight the various attributes. One common approach is to normalize all attribute values into the range [0, 1].

Page 67: Spring 20031 Classification. Spring 20032 Classification task Input: a training set of tuples, each labeled with one class label Output: a model (classifier)

Spring 2003 67

k-nearest neighbor classifiers

• This solution is sensitive to the outliers problem since a single outlier could cause virtually all other values to be contained in a small subrange.

• Another common approach is to apply a standarization transformation, such as subtracting the mean from the value of each attribute and then dividing by its standard deviation.

• Recently, another approach was proposed which consists in applying the robust space transformation called Donoho-Stahel estimator – the estimator has some important and useful properties that make the estimator very attractive for different data mining applications.

Page 68: Spring 20031 Classification. Spring 20032 Classification task Input: a training set of tuples, each labeled with one class label Output: a model (classifier)

Spring 2003 68

Classifier accuracy

• The accuracy of a classifier on a given test set of samples is defined as the percentage of test samples correctly classified by the classifier, and it measures the overall performance of the classifier.

• Note that the accuracy of the classifier is not estimated on the training dataset, since it would not be a good indicator of the future accuracy on new data.

• The reason is that the classifier generated from the training dataset tends to overfit the training data, and any estimate of the classifier's accuracy based on that data will be overoptimistic.

Page 69: Spring 20031 Classification. Spring 20032 Classification task Input: a training set of tuples, each labeled with one class label Output: a model (classifier)

Spring 2003 69

Classifier accuracy• In other words, the classifier is more accurate on the data

that was used to train the classifier, but very likely it will be less accurate on independent set of data.

• To predict the accuracy of the classifier on new data, we need to asses its accuracy on an independent dataset that played no part in the formation of the classifier.

• This dataset is called the test set

• It is important to note that the test dataset should not to be used in any way to built the classifier.

Page 70: Spring 20031 Classification. Spring 20032 Classification task Input: a training set of tuples, each labeled with one class label Output: a model (classifier)

Spring 2003 70

Classifier accuracy

• There are several methods for estimating classifier accuracy. The choice of a method depends on the amount of sample data available for training and testing.

• If there are a lot of sample data, then the following simple holdout method is usually applied.

• The given set of samples is randomly partitioned into two independent sets, a training set and a test set (typically, 70% of the data is used for training, and the remaining 30% is used for testing)

• Provided that both sets of samples are representative, the accuracy of the classifier on the test set will give a good indication of accuracy on new data.

Page 71: Spring 20031 Classification. Spring 20032 Classification task Input: a training set of tuples, each labeled with one class label Output: a model (classifier)

Spring 2003 71

Classifier accuracy

• In general, it is difficult to say whether a given set of samples is representative or not, but at least we may ensure that the random sampling of the data set is done in such a way that the class distribution of samples in both training and test set is approximately the same as that in the initial data set.

• This procedure is called stratification

Page 72: Spring 20031 Classification. Spring 20032 Classification task Input: a training set of tuples, each labeled with one class label Output: a model (classifier)

Spring 2003 72

Testing – large dataset

Available examples

Training Set

70% Divide randomly

Test Set

30%

used to develop one tree check accuracy

Page 73: Spring 20031 Classification. Spring 20032 Classification task Input: a training set of tuples, each labeled with one class label Output: a model (classifier)

Spring 2003 73

Classifier accuracy

• If the amount of data for training and testing is limited, the problem is how to use this limited amount of data for training to get a good classifier and for testing to obtain a correct estimation of the classifier accuracy?

• The standard and very common technique of measuring the accuracy of a classifier when the amount of data is limited is k-fold cross-validation

• In k-fold cross-validation, the initial set of samples is randomly partitioned into k approximately equal mutually exclusive subsets, called folds, S_1, S_2, ..., S_k.

Page 74: Spring 20031 Classification. Spring 20032 Classification task Input: a training set of tuples, each labeled with one class label Output: a model (classifier)

Spring 2003 74

Classifier accuracy

• Training and testing is performed k times. At each iteration, one fold is used for testing while remainder k-1 folds are used for training. So, at the end, each fold has been used exactly once for testing and k-1 for training.

• The accuracy estimate is the overall number of correct classifications from k iterations divided by the total number of samples N in the initial dataset.

• Often, the k-fold cross-validation technique is combined with stratification and is called stratified k-fold cross-validation

Page 75: Spring 20031 Classification. Spring 20032 Classification task Input: a training set of tuples, each labeled with one class label Output: a model (classifier)

Spring 2003 75

Testing – small dataset

Available examples

Training Set

90%

Test Set

10%

used to develop 10 different trees check accuracy

* cross-validationRepeat 10 times

Page 76: Spring 20031 Classification. Spring 20032 Classification task Input: a training set of tuples, each labeled with one class label Output: a model (classifier)

Spring 2003 76

Classifier accuracy• There are many other methods of estimating classifier

accuracy on a particular dataset• Two popular methods are leave-one-out cross-validation

and the bootstrapping • Leave-one-out cross-validation is simply N-fold cross-

validation, where N is the number of samples in the initial dataset

• At each iteration, a single sample from the dataset is left out for testing, and remaining samples are used for training. The result of testing is either success or failure.

• The results of all N evaluations, one for each sample from the dataset, are averaged, and that average represents the final accuracy estimate.

Page 77: Spring 20031 Classification. Spring 20032 Classification task Input: a training set of tuples, each labeled with one class label Output: a model (classifier)

Spring 2003 77

Classifier accuracy• Bootstrapping is based on the sampling with replacement• The initial dataset is sampled N times, where N is the total

number of samples in the dataset, with replacement, to form another set of N samples for training.

• Since some samples in this new "set" will be repeated, so it means that some samples from the initial dataset will not appear in this training set. These samples will form a test set.

• Both mentioned estimation methods are interesting especially for estimating classifier accuracy for small datasets. In practice, the standard and most popular technique of estimating a classifier accuracy is stratified tenfold cross-validation

Page 78: Spring 20031 Classification. Spring 20032 Classification task Input: a training set of tuples, each labeled with one class label Output: a model (classifier)

Spring 2003 78

Requirements

• Focus on mega-induction• Handle both continous and categorical data• No restriction on:

– number of examples

– number of attributes

– number of classes

Page 79: Spring 20031 Classification. Spring 20032 Classification task Input: a training set of tuples, each labeled with one class label Output: a model (classifier)

Spring 2003 79

Applications

• Treatment effectiveness• Credit Approval• Store location• Target marketing• Insurrence company (fraud detection)• Telecommunication company (client

classification)