dbm630 lecture06

DBM630: Data Mining and

Data Warehousing

MS.IT. Rangsit University

1

Semester 2/2011

Lecture 6

Classification and Prediction Decision Tree and Classification Rules

by Kritsada Sriphaew (sriphaew.k AT gmail.com)

Topics

2

What Is Classification, What Is Prediction?

Decision Tree

Classification Rule: Covering Algorithm

Data Warehousing and Data Mining by Kritsada Sriphaew

What Is Classification?

Case

A bank loans officer needs analysis of her data in order to learn which loan applicants are “safe” and which are “risky” for the bank

A marketing manager needs data analysis to help guess whether a customer with a given profile will buy a new computer or not

A medical researcher wants to analyze breast cancer data in order to predict which one of three specific treatments a patient receive

The data analysis task is classification, where the model or classifier is constructed to predict categorical labels

The model is a classifier

3 Data Warehousing and Data Mining by Kritsada Sriphaew

What Is Prediction? Suppose that the marketing manager would like to predict how

much a given customer will spend during a sale at the shop

This data analysis task is numeric prediction, where the model constructed predicts a continuous value or ordered values, as opposed to a categorical label

This model is a predictor

Regression analysis is a statistical methodology that is most often used for numeric prediction

4 Data Warehousing and Data Mining by Kritsada Sriphaew

5

How does classification work?

Data classification is a two-step process

In the first step, -- learning step or training phase A model is built describing a predetermined set of data classes or concepts

Data tuples used to build the classification model are called training data set

If the class label is provided, this step is known as supervised learning, otherwise called unsupervised learning

The learned model may be represented in the form of classification rules, decision trees, Bayesian, mathematical formulae, etc.


6

How does classification work?

In the second step,

The learned model is used for classification

Estimate the predictive accuracy of the model using hold-out data set (a test set of class-labeled samples which are randomly selected and are independent of the training samples)

If the accuracy of the model were estimate based on the training data set -> the model tends to overfit the data

If the accuracy of the model is considered acceptable, the model can be used to classify future data tuples or objects for which the class label is unknown

In the experiment, there are three kinds of dataset, training data set, hold-out data set (or validation data set), and test data set


7

Issues Regarding Classification/Prediction

Comparing classification methods

The criteria to compare and evaluate classification and prediction methods

Accuracy: an ability of a given classifier to correctly predict the class label of new or unseen data

Speed: the computation costs involved in generating and using the given classifier or predictor

Robustness: an ability of the classifier or predictor to make correct predictions given noisy data or data with missing values

Scalability: an ability to construct the classifier or predictor efficiently given large amounts of data

Interpretability: the level of understanding and insight that is provided by the classifier or predictor – subjective and more difficult to assess


8

Decision Tree A decision tree is a flow-chart-like tree structure,

each internal node denotes a test on an attribute, each branch represents an outcome of the test leaf node represent classes Top-most node in a tree is the root node Instead of using the complete set of features jointly to make a decision, different

subsets of features are used at different levels of the tree during making a decision

Age?

student? Credit_rating?

no yes

yes

yes no

<=30 31…40

>40

no yes excellent fair

The decision tree

represents the concept buys_computer

Classification – Decision Tree

9

Decision Tree Induction

Normal procedure: greedy algorithm by top down in recursive divide-and-conquer fashion First: attribute is selected for root node and branch is

created for each possible attribute value

Then: the instances are split into subsets (one for each

branch extending from the node)

Finally: procedure is repeated recursively for each branch, using only instances that reach the branch

Process stops if All instances for a given node belong to the same class

No remaining attribute on which the samples may be further partitioned majority vote is employed

No sample for the branch to test the attrbiute majority vote is employed


10

Decision Tree Representation (An Example)

The decision tree (DT) of the weather example is:

outlook

windy humidity

sunny rainy overcast

yes

high normal

no yes

false true

yes no

Decision Tree Induction

Outlook Temp. Humid. Windy Play

sunny hot high false N

sunny hot high true N

overcast hot high false Y

rainy mild high false Y

rainy cool normal false Y

rainy cool normal true N

overcast cool normal true Y

sunny mild high false N

sunny cool normal false Y

rainy mild normal false Y

sunny mild normal true Y

overcast mild high true Y

overcast hot normal false Y

rainy mild high true N Classification – Decision Tree

11

An Example (Which attribute is the best?)

There are four possibilities for each split


12

Criterions for Attribute Selection

Which is the best attribute? The one which will result in the smallest tree

Heuristic: choose the attribute that produces the “purest”

nodes

Popular impurity criterion: information gain

Information gain increases with the average purity

of the subsets that an attribute produces

Strategy: choose the attribute with the highest

information gain is chosen as the test attribute

for the current nodes


13

Computing “Information” Information is measured in bits

Given a probability distribution, the information required to predict an event is the distribution‟s entropy

Entropy gives the information required in bits (this can involve fractions of bits!)

Information gain measures the goodness of split

Formula for computing expected information:

Let S be a set consisting of s data instances, the class label attribute has n distinct classes, Ci (for i = 1, …, n)

Let si be the number of instances in class Ci

The expected information or entropy is

info([s1,s2,…,sn]) = entropy(s1/s, s2/s, sn/s) = - S pi(log2 pi) where pi is the probability that the instance belongs to class, pi = si/s

Formula for computing information gain: Find an information gain of attribute A

gain(A) = info. before splitting – info. after splitting


14

Expected Information for “Outlook” “Outlook” = “sunny”:

info([2,3]) = entropy(2/5,3/5) = -(2/5)log2(2/5) - (3/5)log2(3/5) = 0.971 bits

“Outlook” = “overcast”: info([4,0]) = entropy(1,0) = -(1)log2(1) - (0)log2(0) = 0 bits

“Outlook” = “rainy”: info([3,2]) = entropy(3/5,2/5) = - (3/5)log2(3/5) - (2/5)log2(2/5) = 0.971 bits

Expected information for attribute “Outlook”: info([2,3],[4,0],[3,2]) = (5/14)info([2,3]) + (4/14)info([4,0]) + (5/14)info([3,2]) = [ (5/14)0.971 ] +[ (4/14)x0 ] +[ (5/14)x0.971 ] = 0.693 bits

Outlook Temp. Humid. Windy Play

sunny hot high false N

sunny hot high true N

overcast hot high false Y

rainy mild high false Y

rainy cool normal false Y

rainy cool normal true N

overcast cool normal true Y

sunny mild high false N

sunny cool normal false Y

rainy mild normal false Y

sunny mild normal true Y

overcast mild high true Y

overcast hot normal false Y

rainy mild high true N


15

Information Gain for “Outlook” Information gain:

info. before splitting – info. after splitting

gain(”Outlook”) = info([9,5]) - info([2,3],[4,0],[3,2])

= 0.940-0.693

= 0.247 bits

Information gain for attributes from weather data:

gain(”Outlook”) = 0.247 bits

gain(”Temperature”) = 0.029 bits

gain(“Humidity”) = 0.152 bits

gain(“Windy”) = 0.048 bits


16

Gain(outlook) =

info([9,5]) -

info([2,3],[4,0],[3,2])

= 0.247

An Example of Gain Criterion (Which attribute is the best?)

Gain(humidity) =

info([9,5]) -

info([3,4],[6,1])

= 0.152

Gain(humidity) =

info([9,5]) -

info([6,2],[3,3])

= 0.048

Gain(outlook) =

info([9,5]) -

info([2,2],[4,2],[3,1])

= 0.029

The best


17

Continuing to Split

If “Outlook” = “sunny”

gain(”Temperature”) = 0.571 bits

gain(“Humidity”) = 0.971 bits

gain(“Windy”) = 0.020 bits

“Temperature” = “hot”: info([0,2]) = entropy(0,1) = -(0)log2(0) - (1)log2(1)

= 0 bits

“Temperature” = “mild”: info([1,1]) = entropy(1/2,1/2) = -(1/2)log2(1/2) - (1/2)log2(1/2)

= 1 bits

“Temperature” = “cool”: info([1,0]) = 0 bits

Expected information for attribute “Temperature”: info([0,2],[1,1],[1,0])

= (2/5)info([0,2]) + (2/5)info([1,1]) + (1/5)info([1,0])

= 0 + [ (2/5)x1 ] + 0

= 0.4 bits

Information gain gain(“Temperature”) = info(2,3) - info([0,2],[1,1],[1,0]) = –(2/5)log2(2/5) – (3/5)log2(3/5) – 0.4 = 0.971 – 0.4 = 0.571 bits


18

The Final Decision Tree

Note: not all leaves need to be pure; sometimes identical

instances have different classes

Splitting stops when data can‟t be split any further


19

Properties for a Purity Measure

Properties we require from a purity measure: When node is pure, measure should be zero

When impurity is maximal (i. e. all classes equally likely),

measure should be maximal

Measure should obey multistage property (i. e. decisions can be

made in several stages):

measure([2,3,4]) =

measure([2,7]) + (7/9) measure([3,4])

Entropy is the only function that satisfies all

three properties!


20

Some Properties for the Entropy The multistage property:

entropy(p,q,r) = entropy(p,q+r) + [(q+r)/(p+q+r)] × entropy(q, r)

Ex.: info(2,3,4) can be calculated as

= {- (2/9)log2(2/9) + (7/9)log2(7/9)}–{7/9}*{[(3/7)log2(3/7) + (4/7)log2(4/7)}

= - (2/9)log2(2/9) - (7/9) [ log2 (7/9) + (3/7)log2(3/7) + (4/7)log2(4/7) ]

= - (2/9)log2(2/9)

- (7/9) [ (3/7)log2(7/9) + (4/7)log2(7/9) +(3/7)log2(3/7) + (4/7)log2(4/7) ]

= - (2/9)log2(2/9)

- (7/9) [ (3/7)log2(7/9) + (3/7)log2(3/7) +(4/7)log2(7/9) + (4/7)log2(4/7) ]

= - (2/9)log2(2/9)

- (7/9) [ (3/7)log2(7/9 x 3/7) + (4/7)log2(7/9 x 4/7) ]

= - (2/9)log2(2/9) - (7/9) [ (3/7)log2(3/9) + (4/7)log2(4/9) ]

= - (2/9)log2(2/9) - (3/9)log2(3/9) - (4/9)log2(4/9)


21

A Problem: Highly-Branching Attributes

Problematic: attributes with a large number of values (extreme case: ID code)

Subsets are more likely to be pure if there is a large number of values

Information gain is biased towards choosing attributes with a large number of values

This may result in overfitting (selection of an attribute that is non-optimal for prediction) and fragmentation


22

Example: Highly-Branching Attributes

Entropy Split

info(ID)

= info([0,1],[0,1],

[1,0],…,[0,1])

= 0 bits

gain(ID) = 0.940 (max.)

ID

A M

no yes yes no

B N

ID Outlook Temp. Humid. Windy Play

A sunny hot high false N

B sunny hot high true N

C overcast hot high false Y

D rainy mild high false Y

E rainy cool normal false Y

F rainy cool normal true N

G overcast cool normal true Y

H sunny mild high false N

I sunny cool normal false Y

J rainy mild normal false Y

K sunny mild normal true Y

L overcast mild high true Y

M overcast hot normal false Y

N rainy mild high true N


23

Modification: The Gain Ratio As a Split Info.

Gain ratio: a modification of the information gain that reduces its bias

Gain ratio takes number and size of branches into account when choosing an attribute It corrects the information gain by taking the

intrinsic information of a split into account

Intrinsic information: entropy of distribution of instances into branches

(i.e. how much info do we need to tell which branch an instance belongs to)


24

Computing the Gain Ratio

Example: intrinsic information (split info) for ID code

info([1,1,…,1] = 14*( (-1/14)log(1/14) ) =3.807

Value of attribute decreases as intrinsic information gets larger

Definition of gain ratio: gain_ratio(“Attribute”) = gain(“Attribute”) intrinsic_info (“Attribute”)

Example: gain_ratio(“ID”) = gain(“ID”) = 0.970 bits intrinsic_info (“ID”) 3.807 bits = 0.246


25

Gain Ratio for Weather Data


26

Gain Ratio for Weather Data(Discussion)

“Outlook” still comes out top

However: “ID” has greater gain ratio Standard fix: ad hoc test to prevent splitting on

that type of attribute

Problem with gain ratio: it may overcompensate May choose an attribute just because its intrinsic

information is very low

Standard fix: only consider attributes with greater than average information gain


27

Avoiding Overfitting the Data

The naïve DT algorithm grows each branch of

the tree just deeply enough to perfectly classify

the training examples.

This algorithm may produce trees that overfit

the training examples but do not work well for

general cases.

Reason: the training set may has some noises

or it is too small to produce a representative

sample of the true target tree (function).


28

Avoid Overfitting: Pruning

Pruning simplifies a decision tree to prevent overfitting to noise

in the data

Two main pruning strategies:

1. Prepruning: stops growing a tree when no statistically significant

association between any attribute and the class at a particular node.

Most popular test: chi-squared test, only statistically significant

attributes where allowed to be selected by information gain procedure

2. Postpruning: takes a fully-grown decision tree and discards unreliable

parts by two main pruning operations, i.e., subtree replacement and

subtree raising with some possible strategies, e.g., error estimation,

significance testing, MDL principle.

Prepruning is preferred in practice because of early stopping


29

Subtree Replacement

Bottom-up: tree is considered for replacement once all its

subtrees have been considered


30

Subtree Raising

Deletes node and redistributes instances

Slower than subtree replacement (Worthwhile?)


31

Tree to Rule vs. Rule to Tree

outlook

windy humidity

sunny rainy overcast

yes

high normal

no yes

false true

yes no

If outlook=sunny & humidity=high then class=no

If outlook=sunny & humidity=normal then class=yes

If outlook=overcast then class=yes

If outlook=rainy & windy=false then class=yes

If outlook=rainy & windy=true then class=no

Tree Rule

If outlook=sunny & humidity=high then class=no

If humidity=normal then class=yes

If outlook=overcast then class=yes

If outlook=rainy & windy=true then class=no

Rule Tree

outlook=rainy & windy=true & humidity=normal ?

outlook=rainy & windy=false & humidity=high ?

? Question:

Classification Rules

Classification Rule: Algorithms

32

Two main algorithms are:

Inferring Rudimentary rules 1R: 1-level decision tree

Covering Algorithms: Algorithm to construct the rules Pruning Rules & Computing Significance

Hypergeometric Distribution vs. Binomial Distribution Incremental Reduce-Error Pruning


33

Inferring Rudimentary Rules (1R rule) 1R learns a 1-level decision tree

Generate a set of rules that all test on one particular attribute Focus on each attribute

Pseudo-code

Note: “missing” can be treated as a separate attribute value 1R’s simple rules performed not much worse than much more complex

decision trees.

• For each attribute,

• For each value of the attribute, make a rule as

follows:

• count how often each class appears

• find the most frequent class

• make the rule assign that class to this attribute-value

• Calculate the error rate of the rules

• Choose the rules with the smallest error rate

(Holte, 93)


Classification Rules 34

An Example: Evaluating the Weather Attributes (Nominal, Ordinal)

Outlook Temp. Humidity Windy Play

sunny hot high false no

sunny hot high true no

overcast hot high false yes

rainy mild high false yes

rainy cool normal false yes

rainy cool normal true no

overcast cool normal true yes

sunny mild high false no

sunny cool normal false yes

rainy mild normal false yes

sunny mild normal true yes

overcast mild high true yes

overcast hot normal false yes

rainy mild high true no

1R chooses the attribute that

produces rules with the smallest

number of errors, i.e., rule sets

of attribute “Outlook” or

“Humidity”

Attribute Rule Error Total Error

Outlook

(O)

O = sunny no

O = overcast yes

O = rainy yes

2/5

0/4

2/5

4/14

Temp.

(T)

T = hot no

T = mild yes

T = cool yes

2/4

2/6

1/4

5/14

Humidity

(H)

H = high no

H = normal yes

3/7

1/7

4/14

Windy

(W)

W = false yes

W = true no

2/8

3/6

5/14

35

An Example: Evaluating the Weather Attributes (Numeric)

Attribute Rule Error Total

Error

Outlook

(O)

O = sunny no

O = overcast yes

O = rainy yes

2/5

0/4

2/5

4/14

Temp.

(T)

T <= 77.5 yes

T > 77.5 no

3/10

2/4

5/14

Humidity

(H)

H <= 82.5 yes

82.5<H<=95.5 no

H > 95.5 yes

1/7

2/6

0/1

3/14

Windy

(W)

W = false yes

W = true no

2/8

3/6

5/14

1R chooses the attribute that

produces rules with the smallest

number of errors, i.e., rule set of

attribute “Humidity”

Outlook Temp. Humidity Windy Play

sunny 85 85 false no

sunny 80 90 true no

overcast 83 86 false yes

rainy 70 96 false yes


rainy 65 70 true no

overcast 64 65 true yes

sunny 72 95 false no

sunny 69 70 false yes


sunny 75 70 true yes

overcast 72 90 true yes

overcast 81 75 false yes

rainy 71 91 true no


36

Dealing with Numeric Attributes

Numeric attributes are discretized: the range of the attribute is divided into a set of intervals Instances are sorted according to attribute’s values Breakpoints are placed where the (majority) class changes

(so that the total error is minimized)

Example: Temperature from weather data

64 65 68 69 70 71 72 72 75 75 80 81 83 85

Y | N | Y Y Y | N N Y | Y Y | N | Y Y | N

64 65 68 69 70 71 72 72 75 75 80 81 83 85

Y N Y Y Y | N N Y Y Y | N Y Y N

min=3

Left-to-right

64 65 68 69 70 71 72 72 75 75 80 81 83 85

Y N Y Y Y N N Y Y Y | N Y Y N

Merge

same

category


37

Covering Algorithm Separate-and-conquer algorithm Focus on each class in turn Seek a way to covering all instances in the class More rules could be added for perfect rule set Comparing to decision tree (DT):

Decision tree Divide-and-conquer Focus on all classes at each step Seek an attribute to split on that best separates the classes

DT can be converted into a rule set Straightforward conversion: rule set overly complex More effective conversions are not trivial

In multiclass situations, covering algorithm concentrates on one class at a time whereas DT learner takes all classes into account

Separate-and-conquer: selects the test that

maximizes the number of covered positive examples

and minimizes the number of negative examples

that pass the test. It usually does not pay any

attention to the examples that do not pass the test.

Divide-and-conquer: optimize for all outcomes of

the test.


38

Constructing Classification Rule (An Example)

Rule so far

Rule after adding new item


If x<=1.2 then class = b

If x> 1.2 then class = b

If x> 1.2 & y<=2.6 then class = b

More rules could be added for

“perfect” rule set

y > 2.6

x > 1.2

b

b

n

n

y

y

? Decision Tree

x

y

x

y

x

y

1.2

2.6

Instance space

1.2

a a

a a a

b

b b b

b b

b

a a

a a a

b

b b b

b b

b

a a

a a a

b

b b b

b b

b

b b b

39

A Simple Covering Algorithm Generates a rule by adding tests that maximize rule’s

accuracy, even each new test reduces the rule’s coverage Similar to situation in decision trees: problem of selecting

an attribute to split Decision tree inducer maximizes overall purity. Covering algorithm maximizes rule accuracy.

Goal: maximizing accuracy t: total number of instances covered by rule

p: positive examples of the class covered by rule

t-p: number of errors made by rule

One option: select test that maximizes the ratio p/t

We are finished when p/t = 1 or the set of instances cannot be split any further.


40

An Example: Contact Lenses Data

First try to find a rule for “hard”

age Spectacle

prescription

astigmati

sm

Tear prod.

rate

Recom.

lenses

young myope no reduced none

young myope no normal soft

young myope yes reduced none

young myope yes normal hard

young hypermyope no reduced none

young hypermyope no normal soft

young hypermyope yes reduced none

young hypermyope yes normal hard

pre-presbyopic myope no reduced none

pre-presbyopic myope no normal soft

pre-presbyopic myope yes reduced none

pre-presbyopic myope yes normal hard

pre-presbyopic hypermyope no reduced none

pre-presbyopic hypermyope no normal soft

pre-presbyopic hypermyope yes reduced none

pre-presbyopic hypermyope yes normal none

Age Spectacle

prescription

astigma

tism

Tear prod.

rate

Recom.

lenses

presbyopic myope no reduced none

presbyopic myope no normal none

presbyopic myope yes reduced none

presbyopic myope yes normal hard

presbyopic hypermyope no reduced none

presbyopic hypermyope no normal soft

presbyopic hypermyope yes reduced none

presbyopic hypermyope yes normal none


41

An Example: Contact Lenses Data (Finding a good choice)

Rule we seek: If ? then recommendation = hard

Possible tests: Age = Young 2/8

Age = Pre- presbyopic 1/8

Age = Presbyopic 1/8

Spectacle prescription = Myope 3/12

Spectacle prescription = Hypermetrope 1/12

Astigmatism = no 0/12

Astigmatism = yes 4/12

Tear production rate = Reduced 0/12

Tear production rate = Normal 4/12 OR


42

Modified Rule and Resulting Data

Rule with best test added: If astigmatics = yes then recommendation = hard age Spectacle

prescription

astigmati

sm

Tear prod.

rate

Recom.

lenses

















Age Spectacle

prescription

astigma

tism

Tear prod.

rate

Recom.

lenses









• The underlined rows match with the

rule.

• Anyway, we need to refine the rule

since they are not all correct,

according to the rule.


43

Further Refinement Current State: If astigmatism = yes and ? then recommendation = hard






Tear production rate = Reduced 0/6

Tear production rate = Normal 4/6


Classification Rules: Covering Algorithm 44

Modified Rule and Resulting Data Rule with best test added: If astigmatics = yes and tear prod. rate = normal then

recommendation = hard age Spectacle

prescription

astigmati

sm

Tear prod.

rate

Recom.

lenses

















Age Spectacle

prescription

astigma

tism

Tear prod.

rate

Recom.

lenses









• The underlined rows match with

the rule.

• Anyway, we need to refine the rule

since they are not all correct,

according to the rule.

45

Further Refinement

Current State: If astigmatism = yes and tear prod. rate = normal and ? then

recommendation = hard






Tie between the first and the fourth test We choose the one with greater coverage


ITS423: Data Warehouses and Data

Mining 46

Modified Rule and Resulting Data Final rule with best test added: If astigmatics = yes and tear prod.rate = normal and spectacle prescription = myope then recommendation = hard

• The blue rows match with the rule.

• All three rows are „hard‟.

• No need to refine the rule since the

rule becomes perfect.

age Spectacle

prescription

astigmati

sm

Tear prod.

rate

Recom.

lenses

















Age Spectacle

prescription

astigma

tism

Tear prod.

rate

Recom.

lenses










47

Finding More Rules Second rule for recommending “hard lenses”: (built from

instances not covered by first rule) If age = young and astigmatism = yes and

tear production rate = normal then

recommendation = hard

These two rules cover all “hard lenses”:

Process is repeated with other two classes, that is “soft lenses” and “none”.

(1) If astigmatics = yes & tear.prod.rate = normal & spectacle.prescr = myope

then recommendation = hard

(2) If age = young and astigmatism = yes and tear production rate = normal then recommendation = hard


48

Pseudo-code for PRISM Algorithm For each class C

• Initialize E to the instance set

• While E contains instances in class C

• Create a rule R with an empty left-hand-side that

predicts class C

• Until R is perfect (or there are no more

attributes to use) do

• For each attribute A not mentioned in R, and

each value v,

• Consider adding the condition A = v to the

left-hand side of R

• Select A and v to maximize the accuracy p/t

(break ties by choosing the condition with

the largest p)

• Add A = v to R

• Remove the instances covered by R from E


49

Order Dependency among Rules PRISM without outerloop generates a decision list for

one class Subsequent rules are designed for rules that are not

covered by previous rules Here, order does not matter because all rules predict the

same class

Outer loop considers all classes separately No order dependence implied

Two problems are overlapping rules default rule required


50

Separate-and-Conquer Methods like PRISM (for dealing with one class) are separate-and-

conquer algorithms: First, a rule is identified

Then, all instances covered by the rule are separated out Finally, the remaining instances are “conquered”

Difference to divide-and-conquer methods: Subset covered by rule doesn’t need to be explored any further

Variety in separate-and-conquer approach. Search method (e. g. greedy, beam search, ...) Test selection criteria (e. g. accuracy, ...) Pruning method (e. g. MDL, hold-out set, ...) Stopping criterion (e. g. minimum accuracy) Post- processing step

Also: Decision list vs. one rule set for each class


51

Good Rules and Bad Rules (overview) Sometimes it is better not to generate perfect rules that guarantee to give the

correct classification on all instances in order to avoiding overfitting.

How do we decide which rules are worthwhile?

How do we tell when it becomes counterproductive to continue adding terms to a rule to exclude a few pecky instances of the wrong type?

Two main strategies of pruning rules Global pruning (post-pruning) Incremental pruning (pre-pruning)

Three pruning criteria

MDL principle (Minimum Description Length)

Statistical significance INDUCT Error on hold-out set (reduced-error pruning)

Create all perfect rules then prune

Prune a rule when generating

Rule size + Exception


52

Hypergeometric Distribution

P

p

T-P

t-p

T

t Hypergeometric Distribution

The p examples out of t examples selected by the

rule are correctly covered

The class contains P examples

The rule selects t examples

The dataset contains T examples


53

Computing Significance

We want the probability that a random rule does at least as well (statistical significance of rule):

),min(

)(Pt

pi t

T

it

PT

i

P

C

CCRm

),min(

)(Pt

pi

t

T

it

PT

i

P

Rm

Or

)!(!

!Here,

qpq

p

q

p


54

Good/Bad Rules by Statistical significance (An Example)

If astigmatism = yes and

tear production rate = normal then recommendation = hard success fraction = 4/6

no information success fraction = 4/24

probability of 4/24 4/6 = 0.0014

If astigmatism = yes and tear prod. rate = normal and age = young

then recommendation = hard success fraction = 2/2 no information success fraction = 4/24


2

3

If astigmatism = yes then recommendation = hard success fraction = 4/12 no information success fraction = 4/24


1

“Reduced

probability”

means better

“Increased

probability”

means worse

0.047 0.0014

0.0014 0.022

The Best Rule

P = p = 4, T = 24, t=12 4424−412−42412

= 1∗208

24!

12!∗12!

= 20!

8!∗12!24!

12!∗12!

= 20! ∗ 12!

8! ∗ 24!= 0.047

),min(

)(Pt

pi

t

T

it

PT

i

P

Rm

55

Good/Bad Rules by Statistical significance (Another Example)

If astigmatism = yes and tear production rate = normal

then recommendation = none success fraction = 2/6

no information success fraction = 15/24


If astigmatism = no and tear production rate = normal

then recommendation = soft success fraction = 5/6 no information success fraction = 5/24


If tear production rate = reduced then recommendation = none success fraction = 12/12 no information success fraction = 15/24

probability of 15/24 12/12 = 0.0017

4

5

6

Good Rule Low Probability

Bad Rule High Probability


56

The Binomial Distribution Approximation: can use sampling with replacement instead of sampling

without replacement Dataset contains T examples

Class contains P examples

p examples are correctly covered

Rule selects t examples

itiPt

pi T

P

T

P

i

tRm

1)(

),min(


57

Pruning Strategies

For better estimation, a rule should be evaluated on data not used for training.

This requires a growing set and a pruning set Two options are

Reduced-error pruning for rules builds a full unpruned rule set and simplifies it subsequently

Incremental reduced-error pruning simplifies a rule immediately after it has been built.


58

INDUCT (Incremental Pruning Algorithm)

Initialize E to the instance set

Until E is empty do

For each class C for which E contains an instance

Use basic covering algorithm to create best perfect rule for C

Calculate significance m(R) for rule and significance

m(R-) for rule with final condition omitted

If (m(R-) < m(R)), prune rule and repeat previous step

From the rules for the different classes, select

the most significant one

(i.e. the one with smallest m(R))

Print the rule

Remove the instances covered by rule from E

Continue

INDUCT’s significance computation for a rule: • Probability of completely random rule with same coverage performing at least as well. • Random rule R selects t cases at random from the dataset • We want to know how likely it is that p of these belong to the correct class? • This probability is given by the hypergeometric distribution


59

Example:

RID age income student Credit_rating Class:buys_computer

1 youth High No Fair No

2 youth High No Excellent No

3 middle_age High No Fair Yes

4 senior Medium No Fair Yes

5 senior Low Yes Fair Yes

6 senior Low Yes Excellent No

7 middle_age Low Yes Excellent Yes

8 youth Medium No Fair No

9 youth Low Yes Fair Yes

10 senior Medium Yes Fair Yes

11 youth Medium Yes Excellent Yes

12 middle_age Medium No Excellent Yes

13 middle_age High Yes Fair Yes

14 senior medium no Excellent No

Classification task is to predict whether a customer will buy a computer


dbm630 lecture06

Technology

data classification

data mining

validation data set

classifier3 data warehousing

data analysis task

classification model

concepts data tuples

future data tuples